results: 实验结果表明,这种多标签圆形蛇形态模型在食管炎细胞的标准化分割 tasks 中的平均准确率比传统的Mask R-CNN模型和DeepSnake模型高。Abstract
Eosinophilic esophagitis (EoE) is a chronic and relapsing disease characterized by esophageal inflammation. Symptoms of EoE include difficulty swallowing, food impaction, and chest pain which significantly impact the quality of life, resulting in nutritional impairments, social limitations, and psychological distress. The diagnosis of EoE is typically performed with a threshold (15 to 20) of eosinophils (Eos) per high-power field (HPF). Since the current counting process of Eos is a resource-intensive process for human pathologists, automatic methods are desired. Circle representation has been shown as a more precise, yet less complicated, representation for automatic instance cell segmentation such as CircleSnake approach. However, the CircleSnake was designed as a single-label model, which is not able to deal with multi-label scenarios. In this paper, we propose the multi-label CircleSnake model for instance segmentation on Eos. It extends the original CircleSnake model from a single-label design to a multi-label model, allowing segmentation of multiple object types. Experimental results illustrate the CircleSnake model's superiority over the traditional Mask R-CNN model and DeepSnake model in terms of average precision (AP) in identifying and segmenting eosinophils, thereby enabling enhanced characterization of EoE. This automated approach holds promise for streamlining the assessment process and improving diagnostic accuracy in EoE analysis. The source code has been made publicly available at https://github.com/yilinliu610730/EoE.
摘要
恶性肠炎(EoE)是一种慢性和复发的疾病, caracterized by esophageal inflammation. Its symptoms include difficulty swallowing, food impaction, and chest pain, which significantly impact the quality of life, resulting in nutritional impairments, social limitations, and psychological distress. The diagnosis of EoE is typically performed with a threshold (15 to 20) of eosinophils (Eos) per high-power field (HPF). Since the current counting process of Eos is a resource-intensive process for human pathologists, automatic methods are desired. Circle representation has been shown as a more precise, yet less complicated, representation for automatic instance cell segmentation such as CircleSnake approach. However, the CircleSnake was designed as a single-label model, which is not able to deal with multi-label scenarios. In this paper, we propose the multi-label CircleSnake model for instance segmentation on Eos. It extends the original CircleSnake model from a single-label design to a multi-label model, allowing segmentation of multiple object types. Experimental results illustrate the CircleSnake model's superiority over the traditional Mask R-CNN model and DeepSnake model in terms of average precision (AP) in identifying and segmenting eosinophils, thereby enabling enhanced characterization of EoE. This automated approach holds promise for streamlining the assessment process and improving diagnostic accuracy in EoE analysis. The source code has been made publicly available at https://github.com/yilinliu610730/EoE.Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.
An inexact proximal majorization-minimization Algorithm for remote sensing image stripe noise removal
results: 对于numerical experiments表明,提出的模型和算法具有超越现有模型和算法的优势。Abstract
The stripe noise existing in remote sensing images badly degrades the visual quality and restricts the precision of data analysis. Therefore, many destriping models have been proposed in recent years. In contrast to these existing models, in this paper, we propose a nonconvex model with a DC function (i.e., the difference of convex functions) structure to remove the strip noise. To solve this model, we make use of the DC structure and apply an inexact proximal majorization-minimization algorithm with each inner subproblem solved by the alternating direction method of multipliers. It deserves mentioning that we design an implementable stopping criterion for the inner subproblem, while the convergence can still be guaranteed. Numerical experiments demonstrate the superiority of the proposed model and algorithm.
摘要
Remote sensing 图像中的条纹噪音会严重损害视觉质量和数据分析精度。因此,过去几年内,许多推 striping 模型已经被提出。与现有模型不同,在这篇论文中,我们提出了一种非凸模型,使用差分 convex 函数(i.e., 条纹函数的差)结构来除掉条纹噪音。为解这个模型,我们利用 DC 结构,并采用不准确的 proximal 大anterograde 方法,每个内部问题都由 alternate 方向多重因素方法解决。值得一提的是,我们设计了可实现的停止条件,而且可以保证 convergence。数字实验表明,我们提出的模型和算法具有superiority。
End-to-end Alternating Optimization for Real-World Blind Super Resolution
paper_authors: Zhengxiong Luo, Yan Huang, Shang Li, Liang Wang, Tieniu Tan
for: This paper proposes a new method for blind super-resolution (SR) of low-resolution (LR) images, which can estimate the degradation of the LR image and super-resolve it to its high-resolution (HR) counterpart in a single model.
methods: The proposed method uses an alternating optimization algorithm that consists of two convolutional neural modules: \textit{Restorer} and \textit{Estimator}. \textit{Restorer} restores the SR image based on the estimated degradation, while \textit{Estimator} estimates the degradation with the help of the restored SR image.
results: The proposed method outperforms state-of-the-art methods in terms of both objective metrics and visual quality, and produces more visually favorable results. The codes are available at \url{https://github.com/greatlog/RealDAN.git}.Abstract
Blind Super-Resolution (SR) usually involves two sub-problems: 1) estimating the degradation of the given low-resolution (LR) image; 2) super-resolving the LR image to its high-resolution (HR) counterpart. Both problems are ill-posed due to the information loss in the degrading process. Most previous methods try to solve the two problems independently, but often fall into a dilemma: a good super-resolved HR result requires an accurate degradation estimation, which however, is difficult to be obtained without the help of original HR information. To address this issue, instead of considering these two problems independently, we adopt an alternating optimization algorithm, which can estimate the degradation and restore the SR image in a single model. Specifically, we design two convolutional neural modules, namely \textit{Restorer} and \textit{Estimator}. \textit{Restorer} restores the SR image based on the estimated degradation, and \textit{Estimator} estimates the degradation with the help of the restored SR image. We alternate these two modules repeatedly and unfold this process to form an end-to-end trainable network. In this way, both \textit{Restorer} and \textit{Estimator} could get benefited from the intermediate results of each other, and make each sub-problem easier. Moreover, \textit{Restorer} and \textit{Estimator} are optimized in an end-to-end manner, thus they could get more tolerant of the estimation deviations of each other and cooperate better to achieve more robust and accurate final results. Extensive experiments on both synthetic datasets and real-world images show that the proposed method can largely outperform state-of-the-art methods and produce more visually favorable results. The codes are rleased at \url{https://github.com/greatlog/RealDAN.git}.
摘要
干Resolution(SR)问题通常包含两个互相关联的优化问题:1)估计LR图像的劣化程度;2)SR图像的恢复。两个问题都是不定的,因为升级过程中会产生信息损失。大多数之前的方法是独立地解决这两个问题,但往往陷入一个困境:一个好的SRHR图像需要一个准确的劣化估计,但是没有原始HR图像的帮助,劣化估计很难以取得。为了解决这个问题,我们不是独立地考虑这两个问题,而是采用一种alternating optimization算法,可以同时估计劣化和SR图像的恢复。我们设计了两个卷积神经网络模块,namely \textit{Restorer}和\textit{Estimator}。\textit{Restorer}使用估计的劣化来恢复SR图像,而\textit{Estimator}使用恢复的SR图像来估计劣化。我们在这两个模块之间重复交换,并将这个过程膨胀成一个可训练的端到端网络。这样,\textit{Restorer}和\textit{Estimator}都可以通过对方的中间结果得到帮助,使每个优化问题更加容易。此外,\textit{Restorer}和\textit{Estimator}在端到端训练中被优化,因此它们可以更快地适应对方的估计偏差,并更好地合作以实现更加稳定和准确的最终结果。我们在Synthetic datasets和实际图像上进行了广泛的实验,结果显示,我们的方法可以大幅超越当前状态的方法,并生成更加视觉吸引人的结果。代码可以在 \url{https://github.com/greatlog/RealDAN.git} 中下载。
Recursive Detection and Analysis of Nanoparticles in Scanning Electron Microscopy Images
results: 在五个不同的测试图像中,准确地检测粒子,具有97%的准确率,并能够识别杂乱的粒子布局和较弱的粒子信号。Abstract
In this study, we present a computational framework tailored for the precise detection and comprehensive analysis of nanoparticles within scanning electron microscopy (SEM) images. The primary objective of this framework revolves around the accurate localization of nanoparticle coordinates, accompanied by secondary objectives encompassing the extraction of pertinent morphological attributes including area, orientation, brightness, and length. Constructed leveraging the robust image processing capabilities of Python, particularly harnessing libraries such as OpenCV, SciPy, and Scikit-Image, the framework employs an amalgamation of techniques, including thresholding, dilating, and eroding, to enhance the fidelity of image processing outcomes. The ensuing nanoparticle data is seamlessly integrated into the RStudio environment to facilitate meticulous post-processing analysis. This encompasses a comprehensive evaluation of model accuracy, discernment of feature distribution patterns, and the identification of intricate particle arrangements. The finalized framework exhibits high nanoparticle identification within the primary sample image and boasts 97\% accuracy in detecting particles across five distinct test images drawn from a SEM nanoparticle dataset. Furthermore, the framework demonstrates the capability to discern nanoparticles of faint intensity, eluding manual labeling within the control group.
摘要
在本研究中,我们提出了一种计算框架,用于精确检测和全面分析射电显微镜像中的粒子。主要目标是准确地确定粒子坐标,并且包括次要目标,如粒子形态特征的提取,包括面积、方向、亮度和长度。这个框架基于Python的强大图像处理能力,特别是利用OpenCV、SciPy和Scikit-Image库,使用了一系列技术,如阈值处理、扩大和腐蚀,以提高图像处理结果的准确性。得到的粒子数据可以轻松地在RStudio环境中进行仔细的后处理分析,包括精确评估模型准确性、分析特征分布征特点,以及描述复杂的粒子排列。最终的框架在五个不同的射电显微镜像数据集中的主要样本图像中具有高粒子标识能力,并且在五个测试图像中达到97%的粒子检测精度。此外,框架还能够检测具有某些较弱亮度的粒子,而这些粒子在控制组中未能手动标注。
Dynamic Kernel-Based Adaptive Spatial Aggregation for Learned Image Compression
results: 在三个标准测试集上比州前方学习基于方法 achieve superior rate-distortion performance。Abstract
Learned image compression methods have shown superior rate-distortion performance and remarkable potential compared to traditional compression methods. Most existing learned approaches use stacked convolution or window-based self-attention for transform coding, which aggregate spatial information in a fixed range. In this paper, we focus on extending spatial aggregation capability and propose a dynamic kernel-based transform coding. The proposed adaptive aggregation generates kernel offsets to capture valid information in the content-conditioned range to help transform. With the adaptive aggregation strategy and the sharing weights mechanism, our method can achieve promising transform capability with acceptable model complexity. Besides, according to the recent progress of entropy model, we define a generalized coarse-to-fine entropy model, considering the coarse global context, the channel-wise, and the spatial context. Based on it, we introduce dynamic kernel in hyper-prior to generate more expressive global context. Furthermore, we propose an asymmetric spatial-channel entropy model according to the investigation of the spatial characteristics of the grouped latents. The asymmetric entropy model aims to reduce statistical redundancy while maintaining coding efficiency. Experimental results demonstrate that our method achieves superior rate-distortion performance on three benchmarks compared to the state-of-the-art learning-based methods.
摘要
现有学习图像压缩方法已经显示出了更高的比特率-损均值性和强大的潜在性,比较传统压缩方法更出色。大多数现有的学习方法使用堆叠 convolution或窗口基于自注意力进行变换编码,这些方法会聚合空间信息在固定范围内。在这篇论文中,我们关注扩展空间聚合能力,并提议一种动态核心基于变换编码。我们的提案的自适应聚合生成核心偏移来捕捉有效信息在内容受限范围内,以帮助变换。通过适应聚合策略和共享权重机制,我们的方法可以实现有优的变换能力,同时保持可接受的模型复杂度。此外,根据最近的Entropy模型进展,我们定义一种通用粗化-细化Entropy模型,考虑了粗化全局上下文、通道级别和空间上下文。基于它,我们引入动态核心在超乘预先中生成更表达力的全局上下文。另外,我们提出一种偏 asymmetric spatial-channel Entropy模型,根据图像拼接特征的调查,以减少统计冗余,保持编码效率。实验结果表明,我们的方法在三个标准测试集上实现了与当前学习基于方法相比更高的比特率-损均值性。
Deployment and Analysis of Instance Segmentation Algorithm for In-field Grade Estimation of Sweetpotatoes
paper_authors: Hoang M. Nguyen, Sydney Gyurek, Russell Mierop, Kenneth V. Pecota, Kylie LaGamba, Michael Boyette, G. Craig Yencho, Cranos M. Williams, Michael W. Kudenov
results: 研究结果表明,这个模型可以在不同的环境条件下(包括光照和土壤特性等)正确地检测 SP 存储根,并且与商业光学排分机相比,其误差值(RMSE)为0.66 cm、1.22 cm 和74.73 g,而Root 计数误差值(RMSE)为5.27根,与 r^2 为0.8。这种fenotyping 策略有望在没有高级和昂贵光学排分机的环境中实现快速的产量估算。Abstract
Shape estimation of sweetpotato (SP) storage roots is inherently challenging due to their varied size and shape characteristics. Even measuring "simple" metrics, such as length and width, requires significant time investments either directly in-field or afterward using automated graders. In this paper, we present the results of a model that can perform grading and provide yield estimates directly in the field quicker than manual measurements. Detectron2, a library consisting of deep-learning object detection algorithms, was used to implement Mask R-CNN, an instance segmentation model. This model was deployed for in-field grade estimation of SPs and evaluated against an optical sorter. Storage roots from various clones imaged with a cellphone during trials between 2019 and 2020, were used in the model's training and validation to fine-tune a model to detect SPs. Our results showed that the model could distinguish individual SPs in various environmental conditions including variations in lighting and soil characteristics. RMSE for length, width, and weight, from the model compared to a commercial optical sorter, were 0.66 cm, 1.22 cm, and 74.73 g, respectively, while the RMSE of root counts per plot was 5.27 roots, with r^2 = 0.8. This phenotyping strategy has the potential enable rapid yield estimates in the field without the need for sophisticated and costly optical sorters and may be more readily deployed in environments with limited access to these kinds of resources or facilities.
摘要
生长缘 estimation of sweetpotato (SP) storage roots 是一项自然而难的任务,主要因为它们的尺寸和形态特征具有很大的变化。甚至计算 "简单" 的指标,如长度和宽度,都需要投入很大的时间,可能是在场地上直接进行或使用自动化分级机器人。在这篇论文中,我们介绍了一种模型,可以在场地上进行排名和产量估算,比手动测量更快。使用 Detectron2 库中的深度学习对象检测算法,我们实现了面部分 Segmentation 模型。这个模型在场地上进行了 SP 的排名和产量估算,并与商业光学分级机器人进行比较。我们使用了不同的 CLONE 的存储根,在2019-2020 年的试验中拍摄了它们,并用于模型的训练和验证。我们的结果表明,模型可以在不同的环境条件下(包括光照和土壤特征)分辨准确的 SP。length、width 和 weight 的 RMSE 与商业光学分级机器人相比为 0.66 cm、1.22 cm 和 74.73 g,分类精度为 0.8。这种fenotyping策略有可能在不同的环境中快速地进行产量估算,不需要高昂的光学分级机器人,可能更容易在有限的资源和设施的环境中进行。
Learning to Distill Global Representation for Sparse-View CT
paper_authors: Zilong Li, Chenglong Ma, Jie Chen, Junping Zhang, Hongming Shan for:* 这篇论文的目的是为了提出一个新的实验方法,以减少X射影像资料的辐射剂量和资料收集时间,并提高影像资料的质量。methods:* 这篇论文使用了一种名为“GloRe”的全球表现框架,它是一种基于傅立叶数据的学习方法,可以将影像资料转换为更加简洁的表现,以减少影像资料的质量损失。results:* 实验结果显示,GloReDi方法比以往的方法更好地实现了实验中的目的,包括减少辐射剂量和资料收集时间,并提高影像资料的质量。此外,GloReDi方法还可以实现现场的实验操作和资料分析,实现了实时的影像资料处理和分析。Abstract
Sparse-view computed tomography (CT) -- using a small number of projections for tomographic reconstruction -- enables much lower radiation dose to patients and accelerated data acquisition. The reconstructed images, however, suffer from strong artifacts, greatly limiting their diagnostic value. Current trends for sparse-view CT turn to the raw data for better information recovery. The resultant dual-domain methods, nonetheless, suffer from secondary artifacts, especially in ultra-sparse view scenarios, and their generalization to other scanners/protocols is greatly limited. A crucial question arises: have the image post-processing methods reached the limit? Our answer is not yet. In this paper, we stick to image post-processing methods due to great flexibility and propose global representation (GloRe) distillation framework for sparse-view CT, termed GloReDi. First, we propose to learn GloRe with Fourier convolution, so each element in GloRe has an image-wide receptive field. Second, unlike methods that only use the full-view images for supervision, we propose to distill GloRe from intermediate-view reconstructed images that are readily available but not explored in previous literature. The success of GloRe distillation is attributed to two key components: representation directional distillation to align the GloRe directions, and band-pass-specific contrastive distillation to gain clinically important details. Extensive experiments demonstrate the superiority of the proposed GloReDi over the state-of-the-art methods, including dual-domain ones. The source code is available at https://github.com/longzilicart/GloReDi.
摘要
通用简化的 computed tomography (CT) -- 使用少量投影进行 Tomographic 重建 -- 可以减少病人的辐射剂量和数据收集的时间。然而,重建的图像却受到强烈的artifacts的限制,从而很大程度地降低了其诊断价值。现有的趋势是将着眼于原始数据,以便更好地回收信息。但是,结果的二元领域方法受到紧张视角场景中的次要artifacts的限制,并且其在其他扫描仪/协议上的普适性受到了严重的限制。问题是:图像后处理方法是否已经达到了限制?我们的答案是不是。在这篇论文中,我们坚持使用图像后处理方法,因为它具有很大的灵活性。我们提出了一种全局表示(GloRe)润色框架,称之为GloReDi。首先,我们提出了学习 GloRe 使用傅ри捷 convolution,使每个 GloRe 元素具有图像全面的感受野。其次,不同于以充满视图图像为监督的方法,我们提议使用中间视图重建图像,这些图像Ready availability,但在过去的文献中未经探讨。GloRe 润色的成功归功于两个关键组成部分:方向性润色 distillation 以确定 GloRe 方向,以及带通道特异的对比润色 distillation 以获取临床重要的细节。广泛的实验表明,提议的 GloReDi 超过了当前状态的方法,包括双Domain方法。源代码可以在 GitHub 上获取:https://github.com/longzilicart/GloReDi。
paper_authors: Eunseop Yoon, Hee Suk Yoon, Dhananjaya Gowda, SooHwan Eom, Daehyeok Kim, John Harvill, Heting Gao, Mark Hasegawa-Johnson, Chanwoo Kim, Chang D. Yoo
results: 实验结果表明,使用 proposed 的损失函数 sampling 方法可以改善 sentence-level 和 paragraph-level G2P 的性能。Abstract
Text-to-Text Transfer Transformer (T5) has recently been considered for the Grapheme-to-Phoneme (G2P) transduction. As a follow-up, a tokenizer-free byte-level model based on T5 referred to as ByT5, recently gave promising results on word-level G2P conversion by representing each input character with its corresponding UTF-8 encoding. Although it is generally understood that sentence-level or paragraph-level G2P can improve usability in real-world applications as it is better suited to perform on heteronyms and linking sounds between words, we find that using ByT5 for these scenarios is nontrivial. Since ByT5 operates on the character level, it requires longer decoding steps, which deteriorates the performance due to the exposure bias commonly observed in auto-regressive generation models. This paper shows that the performance of sentence-level and paragraph-level G2P can be improved by mitigating such exposure bias using our proposed loss-based sampling method.
摘要
文本转换变换器(T5)最近被考虑用于字母到phoneme(G2P)转推。作为续作,一种不使用tokenizer的字节级模型基于T5,称为ByT5,最近在单词级G2P转换中表现出了有前途的结果,通过每个输入字符与其唯一的UTF-8编码相对应。虽然通常认为 sentence-level或paragraph-level G2P可以提高实际应用中的可用性,因为更适合处理同义词和 слова间的声音连接,但使用ByT5进行这些场景是不rivial的。因为ByT5运行在字符级别,它需要更长的解码步骤,这会降低性能,因为普遍存在在自动生成模型中的曝光偏见。这篇论文显示,使用我们提议的损失样本方法可以改善 sentence-level和paragraph-level G2P的性能。
Classifying Dementia in the Presence of Depression: A Cross-Corpus Study
paper_authors: Franziska Braun, Sebastian P. Bayerl, Paula A. Pérez-Toro, Florian Hönig, Hartmut Lehfeld, Thomas Hillemacher, Elmar Nöth, Tobias Bocklet, Korbinian Riedhammer
methods: 这篇论文使用了已知的基eline系统,将语音、文本和情感嵌入应用于三类分类问题(健康Control vs. 中风症 vs. деменcia),使用 semantic Verbal Fluency Test 和 Boston Naming Test 的语音和文本数据。
results: 研究人员通过在两个独立录制的德国数据集上进行交叉 корpus 和混合 корpus 实验,发现了更好的一致性和可重复性,并进行了详细的错误分析,以了解类ifiers 是如何学习的。Abstract
Automated dementia screening enables early detection and intervention, reducing costs to healthcare systems and increasing quality of life for those affected. Depression has shared symptoms with dementia, adding complexity to diagnoses. The research focus so far has been on binary classification of dementia (DEM) and healthy controls (HC) using speech from picture description tests from a single dataset. In this work, we apply established baseline systems to discriminate cognitive impairment in speech from the semantic Verbal Fluency Test and the Boston Naming Test using text, audio and emotion embeddings in a 3-class classification problem (HC vs. MCI vs. DEM). We perform cross-corpus and mixed-corpus experiments on two independently recorded German datasets to investigate generalization to larger populations and different recording conditions. In a detailed error analysis, we look at depression as a secondary diagnosis to understand what our classifiers actually learn.
摘要
自动化老年痴味检测可以提前发现和 intervene,从而减少医疗系统的成本和提高受影响人的生活质量。抑郁和痴味具有共同的症状,使诊断变得更加复杂。现有研究主要集中在使用图片描述测试来进行二分类诊断(DEM vs. HC)。在这项工作中,我们使用已有的基线系统来分类语音中的认知障碍,使用语音、文本和情感嵌入在3类分类问题中(HC vs. MCI vs. DEM)。我们进行了跨 korpus 和混合 korpus 实验,以 investigate 大量人口和不同的录音条件下的泛化性。在详细的错误分析中,我们查看了抑郁是作为次要诊断的影响。
ChinaTelecom System Description to VoxCeleb Speaker Recognition Challenge 2023
results: 最终提交得分为0.1066和EER为1.980%。Abstract
This technical report describes ChinaTelecom system for Track 1 (closed) of the VoxCeleb2023 Speaker Recognition Challenge (VoxSRC 2023). Our system consists of several ResNet variants trained only on VoxCeleb2, which were fused for better performance later. Score calibration was also applied for each variant and the fused system. The final submission achieved minDCF of 0.1066 and EER of 1.980%.
摘要
这份技术报告描述了我们在VoxCeleb2023 Speaker Recognition Challenge(VoxSRC 2023)的 Track 1(关闭)系统。我们的系统包括了多种ResNet变体,只在VoxCeleb2上进行训练,并将其进行了更好的性能的融合。在每个变体和融合系统上都进行了分数调整。最终提交的结果为minDCF为0.1066和EER为1.980%。
AffectEcho: Speaker Independent and Language-Agnostic Emotion and Affect Transfer for Speech Synthesis
results: 实验结果表明,该模型可以控制生成的语音中的情感表达,同时保持每个说话者的个性风格和情感气质。该模型还实现了语言独立的情感模型化能力,通过在英语和中文之间进行情感传递任务来证明。 Qualitative和量化 metric 上都达到了领先的result。Abstract
Affect is an emotional characteristic encompassing valence, arousal, and intensity, and is a crucial attribute for enabling authentic conversations. While existing text-to-speech (TTS) and speech-to-speech systems rely on strength embedding vectors and global style tokens to capture emotions, these models represent emotions as a component of style or represent them in discrete categories. We propose AffectEcho, an emotion translation model, that uses a Vector Quantized codebook to model emotions within a quantized space featuring five levels of affect intensity to capture complex nuances and subtle differences in the same emotion. The quantized emotional embeddings are implicitly derived from spoken speech samples, eliminating the need for one-hot vectors or explicit strength embeddings. Experimental results demonstrate the effectiveness of our approach in controlling the emotions of generated speech while preserving identity, style, and emotional cadence unique to each speaker. We showcase the language-independent emotion modeling capability of the quantized emotional embeddings learned from a bilingual (English and Chinese) speech corpus with an emotion transfer task from a reference speech to a target speech. We achieve state-of-art results on both qualitative and quantitative metrics.
摘要
“情感”是一种情感特征,包括浓淡、高潮和强度,它是促进真实对话的重要属性。现有的文本到语音(TTS)和语音到语音系统通常使用强制编码器和全局风格Token来捕捉情感,但这些模型表示情感为风格的一部分或者分类化表示。我们提出了“情感频谱”(AffectEcho),一种情感翻译模型,它使用量化频谱来模型情感在量化空间中的五级浓淡强度,以捕捉复杂的感受和微妙的差异。这些量化情感嵌入是从语音样本中提取的,不需要一个一个的强制编码器或者显式强制编码器。实验结果表明我们的方法可以控制生成的语音中的情感,保留每个说话者的个性、风格和情感 cadence。我们还展示了基于英语和中文双语语音资料的语言独立情感模型的能力,通过将参考语音中的情感传递到目标语音中。我们在质量和质量指标上达到了当前最佳结果。
SCANet: A Self- and Cross-Attention Network for Audio-Visual Speech Separation
methods: SCANet 模型包括两种注意块:自注意(SA)和跨注意(CA)块,其中 CA 块分布在网络的顶部(TCA)、中部(MCA)和底部(BCA)。这些块可以学习不同modalities的特征,并提取 audio-visual 特征中的不同semantics。
results: 对三个标准的 audio-visual 分离benchmark(LRS2、LRS3和VoxCeleb2)进行了广泛的实验,结果表明 SCANet 模型比现有的状态对(SOTA)方法高效,同时保持了相对的执行时间。Abstract
The integration of different modalities, such as audio and visual information, plays a crucial role in human perception of the surrounding environment. Recent research has made significant progress in designing fusion modules for audio-visual speech separation. However, they predominantly focus on multi-modal fusion architectures situated either at the top or bottom positions, rather than comprehensively considering multi-modal fusion at various hierarchical positions within the network. In this paper, we propose a novel model called self- and cross-attention network (SCANet), which leverages the attention mechanism for efficient audio-visual feature fusion. SCANet consists of two types of attention blocks: self-attention (SA) and cross-attention (CA) blocks, where the CA blocks are distributed at the top (TCA), middle (MCA) and bottom (BCA) of SCANet. These blocks maintain the ability to learn modality-specific features and enable the extraction of different semantics from audio-visual features. Comprehensive experiments on three standard audio-visual separation benchmarks (LRS2, LRS3, and VoxCeleb2) demonstrate the effectiveness of SCANet, outperforming existing state-of-the-art (SOTA) methods while maintaining comparable inference time.
摘要
In this paper, we propose a novel model called the self- and cross-attention network (SCANet), which leverages the attention mechanism for efficient audio-visual feature fusion. SCANet consists of two types of attention blocks: self-attention (SA) and cross-attention (CA) blocks. The CA blocks are distributed at the top (TCA), middle (MCA), and bottom (BCA) of SCANet, allowing the network to learn modality-specific features and extract different semantics from audio-visual features.Experiments on three standard audio-visual separation benchmarks (LRS2, LRS3, and VoxCeleb2) demonstrate the effectiveness of SCANet, outperforming existing state-of-the-art (SOTA) methods while maintaining comparable inference time.
Radio2Text: Streaming Speech Recognition Using mmWave Radio Signals
paper_authors: Running Zhao, Jiangtao Yu, Hang Zhao, Edith C. H. Ngai
for: This paper proposes a mmWave-based system for streaming automatic speech recognition (ASR) with a large vocabulary size.
methods: The proposed system, called Radio2Text, uses a tailored streaming Transformer that learns speech-related features effectively, and a cross-modal structure based on knowledge distillation to mitigate the negative effect of low quality mmWave signals.
results: The experimental results show that Radio2Text can achieve a character error rate of 5.7% and a word error rate of 9.4% for the recognition of a vocabulary consisting of over 13,000 words.Abstract
Millimeter wave (mmWave) based speech recognition provides more possibility for audio-related applications, such as conference speech transcription and eavesdropping. However, considering the practicality in real scenarios, latency and recognizable vocabulary size are two critical factors that cannot be overlooked. In this paper, we propose Radio2Text, the first mmWave-based system for streaming automatic speech recognition (ASR) with a vocabulary size exceeding 13,000 words. Radio2Text is based on a tailored streaming Transformer that is capable of effectively learning representations of speech-related features, paving the way for streaming ASR with a large vocabulary. To alleviate the deficiency of streaming networks unable to access entire future inputs, we propose the Guidance Initialization that facilitates the transfer of feature knowledge related to the global context from the non-streaming Transformer to the tailored streaming Transformer through weight inheritance. Further, we propose a cross-modal structure based on knowledge distillation (KD), named cross-modal KD, to mitigate the negative effect of low quality mmWave signals on recognition performance. In the cross-modal KD, the audio streaming Transformer provides feature and response guidance that inherit fruitful and accurate speech information to supervise the training of the tailored radio streaming Transformer. The experimental results show that our Radio2Text can achieve a character error rate of 5.7% and a word error rate of 9.4% for the recognition of a vocabulary consisting of over 13,000 words.
摘要
《 millimeter wave (mmWave) 基于语音识别提供更多的音频相关应用程序,如会议语音转文本和窃听。然而,在实际场景中,延迟和可识别词汇数量是两个不可或缺的因素。在本文中,我们提出 Radio2Text,第一个 mmWave 基于的流动自动语音识别(ASR)系统,可以识别 более чем 13,000 个词汇。Radio2Text 基于一种适应流动 Transformer,可以有效地学习语音相关特征的表示,为流动 ASR 开辟了新的可能性。为了解决流动网络无法访问整个未来输入的问题,我们提出了指导初始化,通过权重继承来传递非流动 Transformer 中关于全局上下文的特征知识到适应流动 Transformer。此外,我们提出了基于知识储存(KD)的 Cross-modal 结构,named cross-modal KD,以mitigate the negative effect of low quality mmWave signals on recognition performance。在 Cross-modal KD 中,音频流动 Transformer 提供了特征和回应指导,通过继承精准和有用的语音信息来监督适应流动 Transformer 的训练。实验结果表明,我们的 Radio2Text 可以在识别 More than 13,000 个词汇时 achieving a character error rate of 5.7% and a word error rate of 9.4%.
End-to-End Open Vocabulary Keyword Search With Multilingual Neural Representations
results: 对长查询和不在训练数据中出现的查询,提出了比 ASR-based 系统更高的性能。Abstract
Conventional keyword search systems operate on automatic speech recognition (ASR) outputs, which causes them to have a complex indexing and search pipeline. This has led to interest in ASR-free approaches to simplify the search procedure. We recently proposed a neural ASR-free keyword search model which achieves competitive performance while maintaining an efficient and simplified pipeline, where queries and documents are encoded with a pair of recurrent neural network encoders and the encodings are combined with a dot-product. In this article, we extend this work with multilingual pretraining and detailed analysis of the model. Our experiments show that the proposed multilingual training significantly improves the model performance and that despite not matching a strong ASR-based conventional keyword search system for short queries and queries comprising in-vocabulary words, the proposed model outperforms the ASR-based system for long queries and queries that do not appear in the training data.
摘要
传统的关键词搜索系统通常基于自动语音识别(ASR)输出,这会导致它们具有复杂的索引和搜索管道。这已经引起了寻求ASR-free方法的兴趣,以简化搜索过程。我们最近提出了一种基于神经网络的ASR-free关键词搜索模型,该模型在竞争性和效率方面具有优异表现,并且保持了简单的搜索管道。在这篇文章中,我们将对该模型进行多语言预训练和详细分析。我们的实验结果表明,训练多语言后的模型性能有显著提升,并且虽然不能与强大的ASR-based传统关键词搜索系统相比,但该模型对长 queries和不在训练数据中的 queries 表现出色。
results: 最终提交在VoxSRC-23公共排名板上获得第一名,minDCF(0.05)为0.0762,EER为1.30%。Abstract
This report describes ID R&D team submissions for Track 2 (open) to the VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC-23). Our solution is based on the fusion of deep ResNets and self-supervised learning (SSL) based models trained on a mixture of a VoxCeleb2 dataset and a large version of a VoxTube dataset. The final submission to the Track 2 achieved the first place on the VoxSRC-23 public leaderboard with a minDCF(0.05) of 0.0762 and EER of 1.30%.
摘要
translate to Simplified Chinese:这份报告描述ID R&D团队在VoxCeleb Speaker Recognition Challenge 2023(VoxSRC-23)的跑道2(开放)中的提交。我们的解决方案基于深度ResNet和自动教育学(SSL)基于模型,在VoxCeleb2数据集和大量的VoxTube数据集的混合上进行训练。最终的提交在VoxSRC-23公共排名板上获得第一名,minDCF(0.05)为0.0762,EER为1.30%。
results: 对于 Surface Water 和 Qinghai-Tibet Plateau Lake 数据集,LEPrompter 可以在不添加额外参数或 GFLOPs 的情况下实现了前一个状态方法的性能改进。LEPrompter 在两个数据集上的 mIoU 分数分别为 91.48% 和 97.43%。Abstract
The extraction of lakes from remote sensing images is a complex challenge due to the varied lake shapes and data noise. Current methods rely on multispectral image datasets, making it challenging to learn lake features accurately from pixel arrangements. This, in turn, affects model learning and the creation of accurate segmentation masks. This paper introduces a unified prompt-based dataset construction approach that provides approximate lake locations using point, box, and mask prompts. We also propose a two-stage prompt enhancement framework, LEPrompter, which involves prompt-based and prompt-free stages during training. The prompt-based stage employs a prompt encoder to extract prior information, integrating prompt tokens and image embeddings through self- and cross-attention in the prompt decoder. Prompts are deactivated once the model is trained to ensure independence during inference, enabling automated lake extraction. Evaluations on Surface Water and Qinghai-Tibet Plateau Lake datasets show consistent performance improvements compared to the previous state-of-the-art method. LEPrompter achieves mIoU scores of 91.48% and 97.43% on the respective datasets without introducing additional parameters or GFLOPs. Supplementary materials provide the source code, pre-trained models, and detailed user studies.
摘要
几乎所有的探测湖泊从遥感图像中提取湖泊的过程都是一个复杂的挑战,主要是因为湖泊的形状和数据噪声的变化。现有的方法都是基于多spectral图像集,这使得学习湖泊特征准确从像素的排序中困难。这种情况导致模型学习和生成准确的分割面板困难。本文提出了一种简单的提示基本构建方法,可以提供湖泊的大致位置使用点、盒子和面影��提示。我们还提出了一种两阶段提高框架,即 LEPrompter,该框架在训练阶段包括基于提示和无提示两个阶段。提示阶段使用提示编码器提取先前信息,将提示token和图像嵌入通过自身和跨attend在提示解码器中进行自我和跨attend。提示被禁用一旦模型训练完成,以确保探测过程中的独立性。评估表面水和青海 Tibet 高原湖 datasets 表明,相比之前的状态艺术方法,LEPrompter 在不添加额外参数或 GFLOPs 的情况下实现了稳定的性能提升。详细的材料和附加的详细用户研究可以在补充材料中找到。
Integrating Visual and Semantic Similarity Using Hierarchies for Image Retrieval
paper_authors: Aishwarya Venkataramanan, Martin Laviale, Cédric Pradalier for: 这个论文的目的是提高内容基于图像检索(CBIR)中的结果准确性,使得检索结果更加准确地反映用户的查询需求。methods: 该方法使用了深度神经网络进行分类训练,并将相似的类划合并为一个视觉层次结构。这个视觉层次结构被用于计算图像之间的距离,以确定最相似的图像。results: 实验结果表明,该方法在标准数据集CUB-200-2011和CIFAR100上达到了比较好的性能,并且在实际应用中使用的 диато镜像检索 task 上也表现出色。Abstract
Most of the research in content-based image retrieval (CBIR) focus on developing robust feature representations that can effectively retrieve instances from a database of images that are visually similar to a query. However, the retrieved images sometimes contain results that are not semantically related to the query. To address this, we propose a method for CBIR that captures both visual and semantic similarity using a visual hierarchy. The hierarchy is constructed by merging classes with overlapping features in the latent space of a deep neural network trained for classification, assuming that overlapping classes share high visual and semantic similarities. Finally, the constructed hierarchy is integrated into the distance calculation metric for similarity search. Experiments on standard datasets: CUB-200-2011 and CIFAR100, and a real-life use case using diatom microscopy images show that our method achieves superior performance compared to the existing methods on image retrieval.
摘要
大多数研究在内容基于图像检索(CBIR)都是关注开发robust的特征表示,以便从图像库中检索与查询图像视觉相似的图像。然而,检索到的图像有时会包含不相关的内容。为解决这个问题,我们提议一种基于视觉层次结构的CBIR方法,该方法可以捕捉视觉和 semantic相似性。我们使用一个深度神经网络进行分类训练,并假设过lapping类在特征空间中共享高度的视觉和semantic相似性。最后,我们将构造的层次结构纳入检索距离计算度量中,以进一步提高图像检索的性能。我们在标准数据集CUB-200-2011和CIFAR100上进行了实验,以及使用激光微scopy图像进行了实际应用,结果显示我们的方法在图像检索中具有较高的性能。
ALIP: Adaptive Language-Image Pre-training with Synthetic Caption
paper_authors: Kaicheng Yang, Jiankang Deng, Xiang An, Jiawei Li, Ziyong Feng, Jia Guo, Jing Yang, Tongliang Liu
For: The paper aims to improve the performance of vision-language tasks by addressing the issue of intrinsic noise and unmatched image-text pairs in web data through a novel pre-training method called Adaptive Language-Image Pre-training (ALIP).* Methods: The paper proposes an ALIP model that integrates supervision from both raw text and synthetic captions, with core components such as the Language Consistency Gate (LCG) and Description Consistency Gate (DCG) that dynamically adjust the weights of samples and image-text/caption pairs during training. The adaptive contrastive loss is also used to reduce the impact of noise data and enhance the efficiency of pre-training.* Results: The paper achieves state-of-the-art performance on multiple downstream tasks including zero-shot image-text retrieval and linear probe, and the code and pre-trained models are released for future research at https://github.com/deepglint/ALIP.Abstract
Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks by scaling up the dataset with image-text pairs collected from the web. However, the presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning. To address this issue, we first utilize the OFA model to generate synthetic captions that focus on the image content. The generated captions contain complementary information that is beneficial for pre-training. Then, we propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption. As the core components of ALIP, the Language Consistency Gate (LCG) and Description Consistency Gate (DCG) dynamically adjust the weights of samples and image-text/caption pairs during the training process. Meanwhile, the adaptive contrastive loss can effectively reduce the impact of noise data and enhances the efficiency of pre-training data. We validate ALIP with experiments on different scales of models and pre-training datasets. Experiments results show that ALIP achieves state-of-the-art performance on multiple downstream tasks including zero-shot image-text retrieval and linear probe. To facilitate future research, the code and pre-trained models are released at https://github.com/deepglint/ALIP.
摘要CLIP(对比语言图像预训练)已经显著提高了多种视觉语言任务的性能,通过扩大数据集,收集自网络上的图像文本对。然而,网络数据中的内生噪音和不匹配的图像文本对可能会影响表征学习的性能。为解决这个问题,我们首先使用OFA模型生成各种关注图像内容的自然语言描述。生成的描述包含有利于预训练的补充信息。然后,我们提出了自适应语言图像预训练(ALIP),一种bidirectional模型,它将raw文本和生成的描述作为双价值监督。ALIP的核心组件是语言一致门(LCG)和描述一致门(DCG),它们在训练过程中动态调整样本和图像文本/描述对的权重。同时,适应对比损失可以有效降低噪音数据的影响,提高预训练数据的效率。我们通过不同规模的模型和预训练集进行实验,并证明ALIP在多个下游任务中达到了状态之作。为便于未来的研究,我们在https://github.com/deepglint/ALIP上发布了代码和预训练模型。Here's the translation in Traditional Chinese:CLIP(对比语言图像预训练)已经显著提高了多种视觉语言任务的性能,通过扩大数据集,收集自网络上的图像文本对。然而,网络数据中的内生噪音和不匹配的图像文本对可能会影响表征学习的性能。为解决这个问题,我们首先使用OFA模型生成各种关注图像内容的自然语言描述。生成的描述包含有利于预训练的补充信息。然后,我们提出了自适应语言图像预训练(ALIP),一种bidirectional模型,它将raw文本和生成的描述作为双价值监督。ALIP的核心组件是语言一致门(LCG)和描述一致门(DCG),它们在训练过程中动态调整样本和图像文本/描述对的权重。同时,适应对比损失可以有效降低噪音数据的影响,提高预训练数据的效率。我们通过不同规模的模型和预训练集进行实验,并证明ALIP在多个下游任务中达到了状态之作。为便于未来的研究,我们在https://github.com/deepglint/ALIP上发布了代码和预训练模型。
Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer
results: 我们在两个 VideoQA 测试 benchmark 上评估了 Tem-Adapter 和不同的预训练传输方法,结果显示了我们的方法的效果。Abstract
Video-language pre-trained models have shown remarkable success in guiding video question-answering (VideoQA) tasks. However, due to the length of video sequences, training large-scale video-based models incurs considerably higher costs than training image-based ones. This motivates us to leverage the knowledge from image-based pretraining, despite the obvious gaps between image and video domains. To bridge these gaps, in this paper, we propose Tem-Adapter, which enables the learning of temporal dynamics and complex semantics by a visual Temporal Aligner and a textual Semantic Aligner. Unlike conventional pretrained knowledge adaptation methods that only concentrate on the downstream task objective, the Temporal Aligner introduces an extra language-guided autoregressive task aimed at facilitating the learning of temporal dependencies, with the objective of predicting future states based on historical clues and language guidance that describes event progression. Besides, to reduce the semantic gap and adapt the textual representation for better event description, we introduce a Semantic Aligner that first designs a template to fuse question and answer pairs as event descriptions and then learns a Transformer decoder with the whole video sequence as guidance for refinement. We evaluate Tem-Adapter and different pre-train transferring methods on two VideoQA benchmarks, and the significant performance improvement demonstrates the effectiveness of our method.
摘要
vidéo-langage pré-entraînés modèles ont montré un succès remarquable dans la conduite des tâches de questions-réponses vidéo (VideoQA). Cependant, en raison de la longueur des séquences de vidéo, l'entraînement de modèles de grande échelle basés sur des vidéos soulève considérablement plus de coûts que l'entraînement de modèles d'images. Cela motive à utiliser la connaissance des modèles d'images pré-entraînés, malgré les écarts évidents entre les domaines de l'image et de la vidéo. Pour combler ces écarts, dans ce papier, nous proposons Tem-Adapter, qui permet l'apprentissage des dynamiques temporelles et des complexités sémantiques en utilisant un Alignateur Temporel Visuel et un Alignateur Sémantique Textuel. Contrairement aux méthodes de transfert de connaissance conventionnelles qui se concentrent uniquement sur l'objectif de la tâche downstream, l'Alignateur Temporel introduit une tâche autorégressive language-guidée supplémentaire visant à faciliter l'apprentissage des dépendances temporelles en prédictant des états futurs à partir de clues historiques et de guidances linguistiques qui décrivent l'évolution des événements. De plus, pour réduire l'écart sémantique et adapter la représentation textuelle pour une description plus précise des événements, nous introduisons un Alignateur Sémantique qui conçoit d'abord un modèle de template pour fusionner les paires de questions et de réponses comme des descriptions d'événements, puis apprend une décodeuse Transformer avec la totalité de la séquence de vidéo comme guidance pour la réfinement. Nous évaluons Tem-Adapter et différentes méthodes de transfert de connaissance sur deux benchmarks VideoQA, et les améliorations de performance significatives démontrent l'efficacité de notre méthode.
Prediction of post-radiotherapy recurrence volumes in head and neck squamous cell carcinoma using 3D U-Net segmentation
paper_authors: Denis Kutnár, Ivan R Vogelius, Katrin Elisabet Håkansson, Jens Petersen, Jeppe Friborg, Lena Specht, Mogens Bernsdorf, Anita Gothelf, Claus Kristensen, Abraham George Smith
results: CNN可以预测LRR的Volume,并且可以在预处理前的FDG-PET/CT扫描图像上提供更好的预测结果,但是需要进一步的数据集开发以达到临床有用的预测精度。Abstract
Locoregional recurrences (LRR) are still a frequent site of treatment failure for head and neck squamous cell carcinoma (HNSCC) patients. Identification of high risk subvolumes based on pretreatment imaging is key to biologically targeted radiation therapy. We investigated the extent to which a Convolutional neural network (CNN) is able to predict LRR volumes based on pre-treatment 18F-fluorodeoxyglucose positron emission tomography (FDG-PET)/computed tomography (CT) scans in HNSCC patients and thus the potential to identify biological high risk volumes using CNNs. For 37 patients who had undergone primary radiotherapy for oropharyngeal squamous cell carcinoma, five oncologists contoured the relapse volumes on recurrence CT scans. Datasets of pre-treatment FDG-PET/CT, gross tumour volume (GTV) and contoured relapse for each of the patients were randomly divided into training (n=23), validation (n=7) and test (n=7) datasets. We compared a CNN trained from scratch, a pre-trained CNN, a SUVmax threshold approach, and using the GTV directly. The SUVmax threshold method included 5 out of the 7 relapse origin points within a volume of median 4.6 cubic centimetres (cc). Both the GTV contour and best CNN segmentations included the relapse origin 6 out of 7 times with median volumes of 28 and 18 cc respectively. The CNN included the same or greater number of relapse volume POs, with significantly smaller relapse volumes. Our novel findings indicate that CNNs may predict LRR, yet further work on dataset development is required to attain clinically useful prediction accuracy.
摘要
Head and neck squamous cell carcinoma (HNSCC) 的 Local Recurrence (LRR) 仍然是治疗失败的常见现象。 预测LRR的风险量基于前治学成像是关键。我们使用深度学习神经网络(CNN)来预测HNSCC患者的LRR量,并可能通过这种方式确定生物学高风险量。我们在37名受到主要放疗治疗的唇舌癌患者中,由5名外科医生确定了再次出现的量在复发CT扫描图上。我们将每名患者的预治FTDG-PET/CT、GTV和重点复发量分别作为训练(n=23)、验证(n=7)和测试(n=7)集。我们比较了从头开始训练的CNN、预先训练的CNN、SUVmax阈值方法和直接使用GTV。SUVmax阈值方法包含5个复发起点在 median 4.6立方厘米(cc)中。GTV标注和最佳CNN分割都包含复发起点6个中, median 值分别为28和18 cc。CNN包含同样或更多的复发量POs,并且复发量较小。我们的新发现表明CNN可能预测LRR,但是需要进一步的数据集开发以实现临床有用的预测精度。
SIGMA: Scale-Invariant Global Sparse Shape Matching
results: 文章在一些复杂的3D数据集上实现了精确的非均质对匹配,包括一些具有不一致的网格的数据集,以及点云对网格的匹配问题。Abstract
We propose a novel mixed-integer programming (MIP) formulation for generating precise sparse correspondences for highly non-rigid shapes. To this end, we introduce a projected Laplace-Beltrami operator (PLBO) which combines intrinsic and extrinsic geometric information to measure the deformation quality induced by predicted correspondences. We integrate the PLBO, together with an orientation-aware regulariser, into a novel MIP formulation that can be solved to global optimality for many practical problems. In contrast to previous methods, our approach is provably invariant to rigid transformations and global scaling, initialisation-free, has optimality guarantees, and scales to high resolution meshes with (empirically observed) linear time. We show state-of-the-art results for sparse non-rigid matching on several challenging 3D datasets, including data with inconsistent meshing, as well as applications in mesh-to-point-cloud matching.
摘要
我们提出了一种新的混合整数编程(MIP)形式ulation для生成精度粗略对应关系 для高度非rigid shapes。为此,我们引入了投影拉place-Beltrami算子(PLBO),该算子结合内在和外在几何信息来衡量预测对应关系所导致的形变质量。我们将PLBO与orientation-aware regularizer integrate into a novel MIP formulation that can be solved to global optimality for many practical problems. 与之前的方法不同,我们的方法具有rigid transformations和全局缩放的不变性,初始化自由,有优化 garanties,并可扩展到高分辨率网格上(empirically observed) linear time。我们在several challenging 3D datasets上显示了非常好的结果,包括具有不一致的网格的数据、以及mesh-to-point-cloud matching应用。
Robust Autonomous Vehicle Pursuit without Expert Steering Labels
results: 我们使用CARLA模拟器进行了广泛验证,并显示了实时性和不同enario中的稳定性。我们的方法能够在不同的路径和路径完成率下实现高的追踪性和安全性。Abstract
In this work, we present a learning method for lateral and longitudinal motion control of an ego-vehicle for vehicle pursuit. The car being controlled does not have a pre-defined route, rather it reactively adapts to follow a target vehicle while maintaining a safety distance. To train our model, we do not rely on steering labels recorded from an expert driver but effectively leverage a classical controller as an offline label generation tool. In addition, we account for the errors in the predicted control values, which can lead to a loss of tracking and catastrophic crashes of the controlled vehicle. To this end, we propose an effective data augmentation approach, which allows to train a network capable of handling different views of the target vehicle. During the pursuit, the target vehicle is firstly localized using a Convolutional Neural Network. The network takes a single RGB image along with cars' velocities and estimates the target vehicle's pose with respect to the ego-vehicle. This information is then fed to a Multi-Layer Perceptron, which regresses the control commands for the ego-vehicle, namely throttle and steering angle. We extensively validate our approach using the CARLA simulator on a wide range of terrains. Our method demonstrates real-time performance and robustness to different scenarios including unseen trajectories and high route completion. The project page containing code and multimedia can be publicly accessed here: https://changyaozhou.github.io/Autonomous-Vehicle-Pursuit/.
摘要
在这个工作中,我们提出了一种学习方法用于ego汽车的横向和 longitudinal 动作控制,以实现追踪目标车辆。控制的车辆不具备预定的路线,而是能够反应地适应追踪目标车辆,同时保持安全距离。为了训练我们的模型,我们不依赖于专家驾驶员提供的转向标签,而是有效地利用了古典控制器作为离线标签生成工具。此外,我们还考虑了预测控制值的错误,可能导致跟踪丢失和控制车辆的崩溃。为此,我们提出了一种有效的数据扩充方法,允许训练一个能够处理不同视角的目标车辆的网络。在追踪过程中,目标车辆首先被local化使用卷积神经网络,Network takes a single RGB image along with cars' velocities and estimates the target vehicle's pose with respect to the ego-vehicle。这些信息然后被传递给多层感知器,该感知器将反逆预测控制命令,即加速和转向角。我们在CARLA simulator上进行了广泛的验证,并证明了我们的方法具有实时性和不同enario中的稳定性。项目页面包含代码和多媒体,可以在以下链接上公共访问:https://changyaozhou.github.io/Autonomous-Vehicle-Pursuit/.
Automated Semiconductor Defect Inspection in Scanning Electron Microscope Images: a Systematic Review
results: 研究分析了38篇关于SEM图像自动化检测的文献,其中每篇文献的应用、方法、数据集、结果、限制和未来工作都被简要概述。Abstract
A growing need exists for efficient and accurate methods for detecting defects in semiconductor materials and devices. These defects can have a detrimental impact on the efficiency of the manufacturing process, because they cause critical failures and wafer-yield limitations. As nodes and patterns get smaller, even high-resolution imaging techniques such as Scanning Electron Microscopy (SEM) produce noisy images due to operating close to sensitivity levels and due to varying physical properties of different underlayers or resist materials. This inherent noise is one of the main challenges for defect inspection. One promising approach is the use of machine learning algorithms, which can be trained to accurately classify and locate defects in semiconductor samples. Recently, convolutional neural networks have proved to be particularly useful in this regard. This systematic review provides a comprehensive overview of the state of automated semiconductor defect inspection on SEM images, including the most recent innovations and developments. 38 publications were selected on this topic, indexed in IEEE Xplore and SPIE databases. For each of these, the application, methodology, dataset, results, limitations and future work were summarized. A comprehensive overview and analysis of their methods is provided. Finally, promising avenues for future work in the field of SEM-based defect inspection are suggested.
摘要
需求生长的势在普遍精炼芯片材料和设备中的缺陷检测方法中增加。这些缺陷可能会影响生产过程的效率,因为它们导致关键失败和芯片产量限制。随着节点和模式的减小,即使使用高分辨率成像技术如扫描电子显微镜(SEM),也会产生噪声,因为在不同的底层或抗氧剂材料的物理性能的变化导致。这种内生的噪声是检测缺陷的主要挑战。一种有前途的方法是使用机器学习算法,可以对芯片样品中的缺陷进行准确的分类和位置确定。最近,卷积神经网络已经证明是在这一点上非常有用。本文提供了涵盖自动芯片缺陷检测在SEM图像上的系统性评论,包括最新的创新和发展。38篇文章被选择,其中每篇文章的应用、方法、数据集、结果、局限性和未来工作的总结。对这些方法的系统性分析进行了详细的梳理。最后,对芯片缺陷检测领域的未来工作的可能性进行了建议。
Diff-CAPTCHA: An Image-based CAPTCHA with Security Enhanced by Denoising Diffusion Model
paper_authors: Ran Jiang, Sanfeng Zhang, Linfeng Liu, Yanbing Peng
for: 增强文本Captcha的安全性
methods: 使用扩散模型生成文本图像,以强化文本特征和背景图像的混合,增加机器学习的难度
results: 对比基eline方案,Diff-CAPTCHA显示更高的安全性和usable性,可以有效抵抗终端攻击 algorithmAbstract
To enhance the security of text CAPTCHAs, various methods have been employed, such as adding the interference lines on the text, randomly distorting the characters, and overlapping multiple characters. These methods partly increase the difficulty of automated segmentation and recognition attacks. However, facing the rapid development of the end-to-end breaking algorithms, their security has been greatly weakened. The diffusion model is a novel image generation model that can generate the text images with deep fusion of characters and background images. In this paper, an image-click CAPTCHA scheme called Diff-CAPTCHA is proposed based on denoising diffusion models. The background image and characters of the CAPTCHA are treated as a whole to guide the generation process of a diffusion model, thus weakening the character features available for machine learning, enhancing the diversity of character features in the CAPTCHA, and increasing the difficulty of breaking algorithms. To evaluate the security of Diff-CAPTCHA, this paper develops several attack methods, including end-to-end attacks based on Faster R-CNN and two-stage attacks, and Diff-CAPTCHA is compared with three baseline schemes, including commercial CAPTCHA scheme and security-enhanced CAPTCHA scheme based on style transfer. The experimental results show that diffusion models can effectively enhance CAPTCHA security while maintaining good usability in human testing.
摘要
“为提高文本验证码(CAPTCHA)的安全性,各种方法有被应用,如添加文本中的干扰线条、随机扭曲字符和 overlap 多个字符。这些方法有所提高了自动化分 segmentation 和识别攻击的难度。然而,面对末端分解算法的快速发展,它们的安全性受到了严重的挑战。在这篇论文中,一种基于杂化模型的图像阻止验证码(Diff-CAPTCHA)被提出。图像阻止验证码中的背景图像和字符被视为一个整体,以引导杂化模型的生成过程,从而减弱字符特征,提高验证码中字符特征的多样性,并提高攻击算法的难度。为评估Diff-CAPTCHA的安全性,本文开发了多种攻击方法,包括基于Faster R-CNN的端到端攻击和两个阶段攻击。与三个基eline scheme进行比较,包括商业CAPTCHA scheme和基于样式转换的安全化CAPTCHA scheme。实验结果表明,杂化模型可以有效地提高验证码的安全性,同时保持人工测试中的使用体验良好。”
DeepContrast: Deep Tissue Contrast Enhancement using Synthetic Data Degradations and OOD Model Predictions
paper_authors: Nuno Pimpão Martins, Yannis Kalaidzidis, Marino Zerial, Florian Jug
for: 这 paper 的目的是提高微scopy 图像质量,以便更好地检查和 characterize 细胞和组织结构和功能。
methods: 这 paper 使用了深度学习方法,并需要ground truth 数据进行训练。然而,在深入到样本中时,获取 clean GT 数据是无法实现的。因此,作者们提出了一种新的方法,可以绕过 GT 数据的缺失。
results: 作者们首先使用了一种 Approximate forward model 来模拟深入到样本中的图像噪声和对比loss。然后,他们使用了一种 neural network 来学习这种噪声和对比loss 的 inverse。 results 表明,这种方法可以在 OOD 情况下提高微scopy 图像质量。然而,在每次预测中,图像对比度会不断提高,同时图像细节会逐渐消失。因此,取决于下游分析的需求,需要找到一个平衡点,以保留图像细节而同时提高对比度。Abstract
Microscopy images are crucial for life science research, allowing detailed inspection and characterization of cellular and tissue-level structures and functions. However, microscopy data are unavoidably affected by image degradations, such as noise, blur, or others. Many such degradations also contribute to a loss of image contrast, which becomes especially pronounced in deeper regions of thick samples. Today, best performing methods to increase the quality of images are based on Deep Learning approaches, which typically require ground truth (GT) data during training. Our inability to counteract blurring and contrast loss when imaging deep into samples prevents the acquisition of such clean GT data. The fact that the forward process of blurring and contrast loss deep into tissue can be modeled, allowed us to propose a new method that can circumvent the problem of unobtainable GT data. To this end, we first synthetically degraded the quality of microscopy images even further by using an approximate forward model for deep tissue image degradations. Then we trained a neural network that learned the inverse of this degradation function from our generated pairs of raw and degraded images. We demonstrated that networks trained in this way can be used out-of-distribution (OOD) to improve the quality of less severely degraded images, e.g. the raw data imaged in a microscope. Since the absolute level of degradation in such microscopy images can be stronger than the additional degradation introduced by our forward model, we also explored the effect of iterative predictions. Here, we observed that in each iteration the measured image contrast kept improving while detailed structures in the images got increasingly removed. Therefore, dependent on the desired downstream analysis, a balance between contrast improvement and retention of image details has to be found.
摘要
μ好像是生物科学研究中非常重要的一部分,允许详细检查和Characterization of cellular and tissue-level structures and functions。然而,μ数据都会受到图像质量下降的影响,如噪声、模糊或其他问题。这些问题也会导致图像的对比度下降,尤其是在样本深度较深的地方。今天,最好的图像质量提高方法都是基于深度学习的方法,它们通常需要训练时使用标准训练数据(GT数据)。由于我们无法对深入样本中的图像进行修复和对比度提高,因此我们无法获得这些干净的GT数据。由于前进过程中的图像噪声和对比度下降可以被模型化,我们提出了一种新的方法,可以缺省GT数据问题。我们首先使用一种近似的深度图像降解模型来further degrade the quality of microscopy images。然后,我们使用这些生成的对像对来训练一个神经网络,学习这个降解函数的逆运算。我们证明了这种方法可以在不使用GT数据的情况下,提高图像质量。由于μ数据的绝对水平可能比我们的前进模型引入的降解水平更高,我们也探讨了反复预测的效果。在每次预测中,我们发现图像对比度不断提高,同时图像中的细节也在逐渐消失。因此,根据下游分析的需要,需要找到对比度提高和图像细节保留的平衡。
KernelWarehouse: Towards Parameter-Efficient Dynamic Convolution
results: 在ImageNet和MS-COCO datasets上,与基eline方法进行比较,并达到了状态 искусственный智能的表现。例如,使用KernelWarehouse在ImageNet上的ResNet18|ResNet50|MobileNetV2|ConvNeXt-Tiny模型可以达到76.05%|81.05%|75.52%|82.51%的顶部一个精度。此外,KernelWarehouse还可以减小ConvNet模型的大小,同时提高其表现,例如,我们的ResNet18模型的参数减少了36.45%|65.10%,而其顶部一个精度提高了2.89%|2.29%。Abstract
Dynamic convolution learns a linear mixture of $n$ static kernels weighted with their sample-dependent attentions, demonstrating superior performance compared to normal convolution. However, existing designs are parameter-inefficient: they increase the number of convolutional parameters by $n$ times. This and the optimization difficulty lead to no research progress in dynamic convolution that can allow us to use a significant large value of $n$ (e.g., $n>100$ instead of typical setting $n<10$) to push forward the performance boundary. In this paper, we propose $KernelWarehouse$, a more general form of dynamic convolution, which can strike a favorable trade-off between parameter efficiency and representation power. Its key idea is to redefine the basic concepts of "$kernels$" and "$assembling$ $kernels$" in dynamic convolution from the perspective of reducing kernel dimension and increasing kernel number significantly. In principle, KernelWarehouse enhances convolutional parameter dependencies within the same layer and across successive layers via tactful kernel partition and warehouse sharing, yielding a high degree of freedom to fit a desired parameter budget. We validate our method on ImageNet and MS-COCO datasets with different ConvNet architectures, and show that it attains state-of-the-art results. For instance, the ResNet18|ResNet50|MobileNetV2|ConvNeXt-Tiny model trained with KernelWarehouse on ImageNet reaches 76.05%|81.05%|75.52%|82.51% top-1 accuracy. Thanks to its flexible design, KernelWarehouse can even reduce the model size of a ConvNet while improving the accuracy, e.g., our ResNet18 model with 36.45%|65.10% parameter reduction to the baseline shows 2.89%|2.29% absolute improvement to top-1 accuracy.
摘要
“动态核函数学习一种线性混合的 $n$ 个静态核函数,表现更高于常规核函数,但现有设计不具有效率:它们将核函数参数数量增加 $n$ 倍。这和优化困难导致没有进展在动态核函数方面,使得我们无法使用大量的 $n$(例如,$n>100$ 而不是典型设定 $n<10$)来推动性能边界。在这篇论文中,我们提出 $KernelWarehouse$,一种更通用的动态核函数形式,可以寻求有利的参数效率和表达能力之间的平衡。其关键思想是从动态核函数中重新定义“核函数”和“核函数组合”的基本概念,以减少核函数维度并增加核函数数量,从而在同一层和successive层中增强参数之间的依赖关系。在原理上,KernelWarehouse 通过精细的核函数分割和仓库共享,提供了高度自由的参数预算调整。我们在 ImageNet 和 MS-COCO 数据集上验证了我们的方法,并显示其达到了当前最佳结果。例如,在 ImageNet 上,我们使用 KernelWarehouse 训练 ResNet18|ResNet50|MobileNetV2|ConvNeXt-Tiny 模型,达到了76.05%|81.05%|75.52%|82.51% 的顶部一个精度。由于其灵活的设计,KernelWarehouse 甚至可以将 ConvNet 模型的大小减小,同时提高准确性,例如,我们的 ResNet18 模型在参数减少 36.45%|65.10% 后,对于基eline 模型的较大准确性做出了2.89%|2.29% 的绝对改进。”
Membrane Potential Batch Normalization for Spiking Neural Networks
results: 实验结果表明,提出的MPBN在非射频 datasets上表现良好,并且可以采用元素级别的形式。代码已经公开在 GitHub上(https://github.com/yfguo91/MPBN)。Abstract
As one of the energy-efficient alternatives of conventional neural networks (CNNs), spiking neural networks (SNNs) have gained more and more interest recently. To train the deep models, some effective batch normalization (BN) techniques are proposed in SNNs. All these BNs are suggested to be used after the convolution layer as usually doing in CNNs. However, the spiking neuron is much more complex with the spatio-temporal dynamics. The regulated data flow after the BN layer will be disturbed again by the membrane potential updating operation before the firing function, i.e., the nonlinear activation. Therefore, we advocate adding another BN layer before the firing function to normalize the membrane potential again, called MPBN. To eliminate the induced time cost of MPBN, we also propose a training-inference-decoupled re-parameterization technique to fold the trained MPBN into the firing threshold. With the re-parameterization technique, the MPBN will not introduce any extra time burden in the inference. Furthermore, the MPBN can also adopt the element-wised form, while these BNs after the convolution layer can only use the channel-wised form. Experimental results show that the proposed MPBN performs well on both popular non-spiking static and neuromorphic datasets. Our code is open-sourced at \href{https://github.com/yfguo91/MPBN}{MPBN}.
摘要
如一种能效替代传统神经网络(CNN)的选择,脉冲神经网络(SNN)在最近吸引了更多的关注。为了训练深度模型,一些有效的批normalization(BN)技术在SNN中被提出。所有这些BN都被建议在卷积层后使用,与CNN的做法一样。然而,神经元更加复杂,具有空间时间动力学。通过更正数据流后BN层的规则操作,将导致神经元电位更新操作前的非线性活化函数被干扰。因此,我们建议在神经元电位更新操作前加入另一个BN层,称为MPBN。为了消除MPBN引入的时间成本,我们还提出了在训练和推理之间分离的重parameterization技术,可以将MPBN翻译成发射阈值。通过这种重parameterization技术,MPBN在推理中不会增加额外的时间负担。此外,MPBN还可以采用元素级别的形式,而这些BN后 convolution层只能使用通道级别的形式。实验结果表明,我们提出的MPBN在非激发神经网络的流行静止数据集和neuromorphic数据集上都表现良好。我们的代码在GitHub上公开,请参考\href{https://github.com/yfguo91/MPBN}{MPBN}.
GAEI-UNet: Global Attention and Elastic Interaction U-Net for Vessel Image Segmentation
results: 对于DRIVE眼球血管数据集的评估表明,GAEI-UNet比传统的深度学习分割模型具有更高的精度和连接度,而不会提高计算复杂性。Abstract
Vessel image segmentation plays a pivotal role in medical diagnostics, aiding in the early detection and treatment of vascular diseases. While segmentation based on deep learning has shown promising results, effectively segmenting small structures and maintaining connectivity between them remains challenging. To address these limitations, we propose GAEI-UNet, a novel model that combines global attention and elastic interaction-based techniques. GAEI-UNet leverages global spatial and channel context information to enhance high-level semantic understanding within the U-Net architecture, enabling precise segmentation of small vessels. Additionally, we adopt an elastic interaction-based loss function to improve connectivity among these fine structures. By capturing the forces generated by misalignment between target and predicted shapes, our model effectively learns to preserve the correct topology of vessel networks. Evaluation on retinal vessel dataset -- DRIVE demonstrates the superior performance of GAEI-UNet in terms of SE and connectivity of small structures, without significantly increasing computational complexity. This research aims to advance the field of vessel image segmentation, providing more accurate and reliable diagnostic tools for the medical community. The implementation code is available on Code.
摘要
船体影像分割在医学诊断中发挥重要作用,帮助早期发现和治疗血管疾病。深度学习基于的分割方法已经显示出了有前途的成绩,但是分割小结构和保持这些结构之间的连接仍然是挑战。为了解决这些限制,我们提出了GAEI-UNet模型,该模型 combining global attention和弹性交互技术。GAEI-UNet利用全局空间和通道信息来增强高级 semantic understanding within U-Net架构,以帮助精确地分割小血管。此外,我们采用了弹性交互基于的损失函数来提高这些细结构之间的连接。通过捕捉目标形态与预测形态之间的偏差生成的力量,我们的模型可以准确地保持血管网络的正确topology。对于retinal vessel数据集(DRIVE)的评估表明,GAEI-UNet在准确率和细结构连接方面具有显著优势,而不是增加计算复杂性。本研究旨在提高船体影像分割的精度和可靠性,为医学社区提供更加准确和可靠的诊断工具。代码实现可以在Code.
Denoising Diffusion Probabilistic Model for Retinal Image Generation and Segmentation
For: 用于检测和诊断视网膜图像中的眼病、血液循环和大脑相关疾病。* Methods: 使用Generative Adversarial Networks (GAN)生成深度学习模型,并使用Denosing Diffusion Probabilistic Model (DDPM)生成图像。* Results: 提出了一个新的DDPM模型,并创建了一个名为Retinal Trees (ReTree)的数据集,包括视网膜图像、相应的血管树和基于DDPM的分类网络。通过对 synthetic data 进行训练和 authentic data 进行测试, validate 了该数据集的有效性。Here’s the summary in English for reference:* For: Detection and diagnosis of eye, blood circulation, and brain-related diseases using retinal images.* Methods: Use of Generative Adversarial Networks (GAN) to generate deep learning models, and use of Denoising Diffusion Probabilistic Model (DDPM) to generate images.* Results: Proposed a new DDPM model and created a dataset called Retinal Trees (ReTree), which includes retinal images, corresponding vessel trees, and a classification network based on DDPM trained with images from the ReTree dataset. Validated the effectiveness of the dataset by training the vessel segmentation model with synthetic data and testing it on authentic data.Abstract
Experts use retinal images and vessel trees to detect and diagnose various eye, blood circulation, and brain-related diseases. However, manual segmentation of retinal images is a time-consuming process that requires high expertise and is difficult due to privacy issues. Many methods have been proposed to segment images, but the need for large retinal image datasets limits the performance of these methods. Several methods synthesize deep learning models based on Generative Adversarial Networks (GAN) to generate limited sample varieties. This paper proposes a novel Denoising Diffusion Probabilistic Model (DDPM) that outperformed GANs in image synthesis. We developed a Retinal Trees (ReTree) dataset consisting of retinal images, corresponding vessel trees, and a segmentation network based on DDPM trained with images from the ReTree dataset. In the first stage, we develop a two-stage DDPM that generates vessel trees from random numbers belonging to a standard normal distribution. Later, the model is guided to generate fundus images from given vessel trees and random distribution. The proposed dataset has been evaluated quantitatively and qualitatively. Quantitative evaluation metrics include Frechet Inception Distance (FID) score, Jaccard similarity coefficient, Cohen's kappa, Matthew's Correlation Coefficient (MCC), precision, recall, F1-score, and accuracy. We trained the vessel segmentation model with synthetic data to validate our dataset's efficiency and tested it on authentic data. Our developed dataset and source code is available at https://github.com/AAleka/retree.
摘要
专家利用血液图像和血管树来检测和诊断多种眼部、血液和大脑相关疾病。然而,手动分割血液图像是一项时间consuming和技术困难的任务,尤其是由于隐私问题。许多方法已经提出来分割图像,但是因为缺乏大量血液图像数据,这些方法的性能受限。本文提出了一种新的干扰扩散概率模型(DDPM),其在图像生成中超越了生成 adversarial networks(GAN)的性能。我们开发了一个名为“Retinal Trees”(ReTree)的数据集,该数据集包括血液图像、对应的血管树和基于 DDPM 的分割网络,并在 ReTree 数据集上训练了 DDPM 模型。在首个阶段,我们开发了一种两个阶段的 DDPM,其中首先生成血管树从随机数列中的标准正态分布中。然后,模型被引导生成基于给定血管树和随机分布的血液图像。我们对提出的数据集进行了量化和质量evaluation。量化评价指标包括Frechet Inception Distance(FID)分数、Jaccard同异度系数、Cohen的κ值、Matthew的相关系数(MCC)、精度、 recall、F1 分数和准确率。我们使用合成数据训练了血管分割模型,以验证我们的数据集的效率,并在真实数据上测试了它。我们开发的数据集和源代码可以在 上获得。
Improving Depth Gradient Continuity in Transformers: A Comparative Study on Monocular Depth Estimation with CNN
results: 我们的实验结果显示,搭配 Depth Gradient Refinement (DGR) 模块和我们提出的损失函数后,模型的性能得到了提高,而无需增加复杂度和计算成本。此研究不仅提供了 transformer 和 cnn 在单目深度估计中的新的视角,还开辟了新的深度估计方法的先河。Abstract
Monocular depth estimation is an ongoing challenge in computer vision. Recent progress with Transformer models has demonstrated notable advantages over conventional CNNs in this area. However, there's still a gap in understanding how these models prioritize different regions in 2D images and how these regions affect depth estimation performance. To explore the differences between Transformers and CNNs, we employ a sparse pixel approach to contrastively analyze the distinctions between the two. Our findings suggest that while Transformers excel in handling global context and intricate textures, they lag behind CNNs in preserving depth gradient continuity. To further enhance the performance of Transformer models in monocular depth estimation, we propose the Depth Gradient Refinement (DGR) module that refines depth estimation through high-order differentiation, feature fusion, and recalibration. Additionally, we leverage optimal transport theory, treating depth maps as spatial probability distributions, and employ the optimal transport distance as a loss function to optimize our model. Experimental results demonstrate that models integrated with the plug-and-play Depth Gradient Refinement (DGR) module and the proposed loss function enhance performance without increasing complexity and computational costs. This research not only offers fresh insights into the distinctions between Transformers and CNNs in depth estimation but also paves the way for novel depth estimation methodologies.
摘要
单眼深度估计是计算机视觉领域中持续的挑战。最近的对称模型在这个领域中获得了杰出的进步,与传统的CNN相比。然而,我们仍然不确定这些模型在2D图像中不同区域的优先顺序和如何影响深度估计性能。为了探索这两种模型之间的差异,我们使用罕发像素方法进行对照分析。我们的发现表明,对称模型在处理全球背景和细节文化方面表现出色,但在保持深度梯度连续性方面落后于CNN。为了进一步提高对称模型在单眼深度估计中的表现,我们提出了深度梯度精照(DGR)模组,该模组通过高阶差分、特征融合和重新对准来精照深度估计。此外,我们运用了最佳运输理论,将深度图表视为空间概率分布,并使用最佳运输距离作为损失函数来优化我们的模型。实验结果显示,将DGR模组和我们提议的损失函数组合使用可以提高表现,不会增加复杂度和计算成本。本研究不具备对对称模型和CNN在深度估计中的新发现,也开启了新的深度估计方法的先河。
AdaBrowse: Adaptive Video Browser for Efficient Continuous Sign Language Recognition
results: 我们的实验结果表明,AdaBrowse和AdaBrowse+可以与状态静态方法具有相同的准确率,同时减少了1.44倍的计算量和2.12倍的FLOPs。Abstract
Raw videos have been proven to own considerable feature redundancy where in many cases only a portion of frames can already meet the requirements for accurate recognition. In this paper, we are interested in whether such redundancy can be effectively leveraged to facilitate efficient inference in continuous sign language recognition (CSLR). We propose a novel adaptive model (AdaBrowse) to dynamically select a most informative subsequence from input video sequences by modelling this problem as a sequential decision task. In specific, we first utilize a lightweight network to quickly scan input videos to extract coarse features. Then these features are fed into a policy network to intelligently select a subsequence to process. The corresponding subsequence is finally inferred by a normal CSLR model for sentence prediction. As only a portion of frames are processed in this procedure, the total computations can be considerably saved. Besides temporal redundancy, we are also interested in whether the inherent spatial redundancy can be seamlessly integrated together to achieve further efficiency, i.e., dynamically selecting a lowest input resolution for each sample, whose model is referred to as AdaBrowse+. Extensive experimental results on four large-scale CSLR datasets, i.e., PHOENIX14, PHOENIX14-T, CSL-Daily and CSL, demonstrate the effectiveness of AdaBrowse and AdaBrowse+ by achieving comparable accuracy with state-of-the-art methods with 1.44$\times$ throughput and 2.12$\times$ fewer FLOPs. Comparisons with other commonly-used 2D CNNs and adaptive efficient methods verify the effectiveness of AdaBrowse. Code is available at \url{https://github.com/hulianyuyy/AdaBrowse}.
摘要
raw 视频有许多重复的特征,只需要一部分帧可以达到准确的认知要求。在这篇论文中,我们关注到是否可以利用这种重复来提高连续手语识别(CSLR)的效率。我们提出了一种新的适应模型(AdaBrowse),可以动态选择输入视频序列中最 informative 的子序列。具体来说,我们首先使用一个轻量级网络快速扫描输入视频,提取低精度特征。然后,这些特征被 feed 到一个策略网络,以智能选择处理的子序列。最后,对于这个子序列,我们使用一个标准的 CSLR 模型进行句子预测。由于只处理一部分帧,总计算量可以得到显著减少。此外,我们还关注是否可以将内在的空间重复合理地融合到一起,以实现更高的效率,即动态选择每个样本的最低输入分辨率。我们称之为 AdaBrowse+。我们在四个大规模 CSLR 数据集上进行了广泛的实验,分别是 PHOENIX14、PHOENIX14-T、CSL-Daily 和 CSL,并证明了 AdaBrowse 和 AdaBrowse+ 的有效性,与state-of-the-art方法相比,具有1.44倍的吞吐量和2.12倍的 fewer FLOPs。与其他常用的2D CNNs和适应高效方法进行比较,证明了 AdaBrowse 的有效性。代码可以在 GitHub 上找到:\url{https://github.com/hulianyuyy/AdaBrowse}。
Visually-Aware Context Modeling for News Image Captioning
results: 我们进行了广泛的实验,证明了我们的框架的有效性。无需使用额外的对应数据,我们在两个新闻图文描述数据集上建立了新的状态前瞻性,超越了之前的状态前瞻性5个CIDEr点。Abstract
The goal of News Image Captioning is to generate an image caption according to the content of both a news article and an image. To leverage the visual information effectively, it is important to exploit the connection between the context in the articles/captions and the images. Psychological studies indicate that human faces in images draw higher attention priorities. On top of that, humans often play a central role in news stories, as also proven by the face-name co-occurrence pattern we discover in existing News Image Captioning datasets. Therefore, we design a face-naming module for faces in images and names in captions/articles to learn a better name embedding. Apart from names, which can be directly linked to an image area (faces), news image captions mostly contain context information that can only be found in the article. Humans typically address this by searching for relevant information from the article based on the image. To emulate this thought process, we design a retrieval strategy using CLIP to retrieve sentences that are semantically close to the image. We conduct extensive experiments to demonstrate the efficacy of our framework. Without using additional paired data, we establish the new state-of-the-art performance on two News Image Captioning datasets, exceeding the previous state-of-the-art by 5 CIDEr points. We will release code upon acceptance.
摘要
文章标题:图像新闻描述的目标是根据新闻文章和图像内容生成图像描述。为了有效利用图像信息,需要利用文章和图像之间的联系。心理学研究表明,图像中的人脸吸引更高的注意力。此外,人类在新闻故事中也很重要,我们发现在现有的新闻图像描述数据集中存在人名和图像的相互关系。因此,我们设计了一个人脸命名模块,以学习更好的人名嵌入。除了名称,新闻图像描述主要包含文章中的上下文信息,这些信息只能通过文章来找到。我们通过使用CLIP进行检索策略,将文章中相似的句子与图像相关联。我们进行了广泛的实验,证明了我们的框架的效果。无需使用额外的对应数据,我们在两个新闻图像描述数据集上建立了新的状态的报告,超越了前一个状态的5个CIDEr点。我们将代码发布于接受后。
Stable and Causal Inference for Discriminative Self-supervised Deep Visual Representations
For: This paper aims to address the instability issues in discriminative self-supervised learning methods and improve the downstream performance of the learned representations.* Methods: The paper uses a causal perspective to analyze the unstable behaviors of discriminative self-supervised methods and proposes solutions to overcome these issues. The proposed solutions involve tempering a linear transformation with controlled synthetic data.* Results: The authors show through experiments on both controlled image datasets and realistic image datasets that their proposed solutions are effective in addressing the instability issues and improving the downstream performance of the learned representations.Abstract
In recent years, discriminative self-supervised methods have made significant strides in advancing various visual tasks. The central idea of learning a data encoder that is robust to data distortions/augmentations is straightforward yet highly effective. Although many studies have demonstrated the empirical success of various learning methods, the resulting learned representations can exhibit instability and hinder downstream performance. In this study, we analyze discriminative self-supervised methods from a causal perspective to explain these unstable behaviors and propose solutions to overcome them. Our approach draws inspiration from prior works that empirically demonstrate the ability of discriminative self-supervised methods to demix ground truth causal sources to some extent. Unlike previous work on causality-empowered representation learning, we do not apply our solutions during the training process but rather during the inference process to improve time efficiency. Through experiments on both controlled image datasets and realistic image datasets, we show that our proposed solutions, which involve tempering a linear transformation with controlled synthetic data, are effective in addressing these issues.
摘要
近年来,权偏自监学方法在不同的视觉任务中作出了重要进步。这中心思想是学习一种数据编码器,该编码器能够对数据扭曲/增强产生Robust。虽然许多研究表明了不同学习方法的实际成功,但是获得的学习表示可能会具有不稳定性,影响下游性能。在这种研究中,我们从 causal 角度分析权偏自监学方法,解释这些不稳定行为,并提出解决方案。我们的方法启发自先前的研究,证明了权偏自监学方法可以一定程度地分解真实的 causal 源。不同于之前的 causality-empowered representation learning,我们在执行过程中不应用我们的解决方案,而是在推理过程中应用,以提高时间效率。通过对控制图像 dataset 和真实图像 dataset 进行实验,我们展示了我们提议的解决方案,即在推理过程中tempering 一个线性变换的控制synthetic数据,是有效地解决这些问题。
Dual-Stream Diffusion Net for Text-to-Video Generation
methods: 提posed dual-stream diffusion net (DSDN),包括两个扩散流:视频内容流和动作流,以及我们设计的cross-transformer交互模块,以使内容和动作领域之间进行良好的对应。
results: 实验表明,我们的方法可以生成更加平滑的连续视频,具有 fewer flickers.Abstract
With the emerging diffusion models, recently, text-to-video generation has aroused increasing attention. But an important bottleneck therein is that generative videos often tend to carry some flickers and artifacts. In this work, we propose a dual-stream diffusion net (DSDN) to improve the consistency of content variations in generating videos. In particular, the designed two diffusion streams, video content and motion branches, could not only run separately in their private spaces for producing personalized video variations as well as content, but also be well-aligned between the content and motion domains through leveraging our designed cross-transformer interaction module, which would benefit the smoothness of generated videos. Besides, we also introduce motion decomposer and combiner to faciliate the operation on video motion. Qualitative and quantitative experiments demonstrate that our method could produce amazing continuous videos with fewer flickers.
摘要
<>将文本转换为简化中文。现在的扩散模型已经吸引了更多的注意力,特别是在文本到视频生成方面。然而,生成视频时存在一个重要的瓶颈,即生成视频经常带有些微震动和 artifacts。在这种情况下,我们提出了双流扩散网络(DSDN),以提高生成视频的内容变化一致性。具体来说,我们设计了两个扩散流,一个是视频内容流,另一个是动作流,它们可以在独立的私有空间中运行,以生成个性化的视频内容和动作。此外,我们还引入了动作分解器和组合器,以便操作视频动作。实际和量测试表明,我们的方法可以生成更加流畅的视频,带有 fewer flickers。
ECPC-IDS:A benchmark endometrail cancer PET/CT image dataset for evaluation of semantic segmentation and detection of hypermetabolic regions
paper_authors: Dechao Tang, Xuanyi Li, Tianming Du, Deguo Ma, Zhiyu Ma, Hongzan Sun, Marcin Grzegorzek, Huiyan Jiang, Chen Li
for: This paper aims to provide a publicly available dataset of endometrial cancer images for research and development of computer-assisted diagnostic techniques.
methods: The paper uses five classical deep learning semantic segmentation methods and six deep learning object detection methods to demonstrate the differences between various methods on the dataset.
results: The paper provides a large number of multiple images, including a large amount of information required for image and target detection, which can aid researchers in exploring new algorithms to enhance computer-assisted technology and improve the accuracy and objectivity of diagnosis.Abstract
Endometrial cancer is one of the most common tumors in the female reproductive system and is the third most common gynecological malignancy that causes death after ovarian and cervical cancer. Early diagnosis can significantly improve the 5-year survival rate of patients. With the development of artificial intelligence, computer-assisted diagnosis plays an increasingly important role in improving the accuracy and objectivity of diagnosis, as well as reducing the workload of doctors. However, the absence of publicly available endometrial cancer image datasets restricts the application of computer-assisted diagnostic techniques.In this paper, a publicly available Endometrial Cancer PET/CT Image Dataset for Evaluation of Semantic Segmentation and Detection of Hypermetabolic Regions (ECPC-IDS) are published. Specifically, the segmentation section includes PET and CT images, with a total of 7159 images in multiple formats. In order to prove the effectiveness of segmentation methods on ECPC-IDS, five classical deep learning semantic segmentation methods are selected to test the image segmentation task. The object detection section also includes PET and CT images, with a total of 3579 images and XML files with annotation information. Six deep learning methods are selected for experiments on the detection task.This study conduct extensive experiments using deep learning-based semantic segmentation and object detection methods to demonstrate the differences between various methods on ECPC-IDS. As far as we know, this is the first publicly available dataset of endometrial cancer with a large number of multiple images, including a large amount of information required for image and target detection. ECPC-IDS can aid researchers in exploring new algorithms to enhance computer-assisted technology, benefiting both clinical doctors and patients greatly.
摘要
《endoometrial cancer图像数据集》是女性生殖系统中最常见的肿瘤之一,也是第三种最常致命的生殖系统癌症之一,只有后于肾脏和子宫癌症。早期诊断可以显著提高患者5年生存率。随着人工智能的发展,计算机助手诊断在改善诊断准确性和公正性方面扮演着越来越重要的角色,同时减轻医生的工作负担。但是, absent of publicly available endometrial cancer image datasets restricts the application of computer-assisted diagnostic techniques。在本文中,一个公共可用的Endometrial Cancer PET/CT Image Dataset for Evaluation of Semantic Segmentation and Detection of Hypermetabolic Regions(ECPC-IDS)被发布。具体来说,分别包括PET和CT图像,总共7159个图像,存储在多种格式中。为证明ECPC-IDS上segmenation方法的效果,本研究选择了5种经典的深度学习 semantic segmentation方法进行测试图像 segmentation任务。另外,包括PET和CT图像,总共3579个图像和XML文件中的注释信息。为了证明检测任务中的不同方法的差异,本研究选择了6种深度学习方法进行实验。本研究通过深度学习基于Semantic Segmentation和object Detection方法进行了广泛的实验,以证明ECPC-IDS上的差异。我们知道,ECPC-IDS是第一个公共可用的endometrial cancer图像数据集,包含大量的多Format图像和需要的信息,以便图像和目标检测。ECPC-IDS将为研究人员提供了一个大量的数据集,以便他们探索新的算法,为临床医生和患者带来很大的好处。
Leveraging Next-Active Objects for Context-Aware Anticipation in Egocentric Videos
results: 相比现有的视频模型架构,NAOGAT能够更好地利用物体与全景场景的关系,预测出下一个活跃的物体和相应的未来行为,同时可以利用物体的动态运动来提高准确率。通过我们的实验,我们表明NAOGAT在Ego4D和EpicKitchens-100(“Unseen Set”)两个 dataset上比现有方法表现出色,并且在多个额外指标中表现出色,如时间到接触(TTC)和下一个活跃物体的本地化探测。Abstract
Objects are crucial for understanding human-object interactions. By identifying the relevant objects, one can also predict potential future interactions or actions that may occur with these objects. In this paper, we study the problem of Short-Term Object interaction anticipation (STA) and propose NAOGAT (Next-Active-Object Guided Anticipation Transformer), a multi-modal end-to-end transformer network, that attends to objects in observed frames in order to anticipate the next-active-object (NAO) and, eventually, to guide the model to predict context-aware future actions. The task is challenging since it requires anticipating future action along with the object with which the action occurs and the time after which the interaction will begin, a.k.a. the time to contact (TTC). Compared to existing video modeling architectures for action anticipation, NAOGAT captures the relationship between objects and the global scene context in order to predict detections for the next active object and anticipate relevant future actions given these detections, leveraging the objects' dynamics to improve accuracy. One of the key strengths of our approach, in fact, is its ability to exploit the motion dynamics of objects within a given clip, which is often ignored by other models, and separately decoding the object-centric and motion-centric information. Through our experiments, we show that our model outperforms existing methods on two separate datasets, Ego4D and EpicKitchens-100 ("Unseen Set"), as measured by several additional metrics, such as time to contact, and next-active-object localization. The code will be available upon acceptance.
摘要
objecs 是人机交互的关键因素。通过确定相关的 объекcs,一可以预测未来的交互或操作。在这篇论文中,我们研究了短期对象交互预测(STA)问题,并提出了NAOGAT(下一个活动对象引导预测变换器),一种多modal 终端变换网络,它在观察到的帧中注意到对象,以预测下一个活动对象(NAO),并最终用这些预测来导引模型预测受到上下文影响的未来行为。这个任务非常具有挑战性,因为它需要预测将来的行为,同时与这个行为发生的对象和时间相关,即接触时间(TTC)。与现有的视频模型化建筑不同,NAOGAT 捕捉对象与全局场景上下文之间的关系,以预测下一个活动对象的探测和未来行为,利用对象的动力学特征提高准确性。我们的方法的一个关键优势是能够利用视频中对象的动力学特征,这些特征通常被其他模型忽略。我们通过实验证明,我们的模型在Ego4D和EpicKitchens-100(“未seen集”)两个数据集上比现有方法表现出色, measured by 一些额外指标,如接触时间和下一个活动对象的本地化。代码将在接受后公开。
Improving Audio-Visual Segmentation with Bidirectional Generation
results: 经过对AVSBenchbenchmark进行了广泛的实验和分析,该方法在多个声音源的分割任务中达到了新的状态对AVS的性能水平,尤其是在MS3子集中,该子集包括多个声音源的分割任务。Abstract
The aim of audio-visual segmentation (AVS) is to precisely differentiate audible objects within videos down to the pixel level. Traditional approaches often tackle this challenge by combining information from various modalities, where the contribution of each modality is implicitly or explicitly modeled. Nevertheless, the interconnections between different modalities tend to be overlooked in audio-visual modeling. In this paper, inspired by the human ability to mentally simulate the sound of an object and its visual appearance, we introduce a bidirectional generation framework. This framework establishes robust correlations between an object's visual characteristics and its associated sound, thereby enhancing the performance of AVS. To achieve this, we employ a visual-to-audio projection component that reconstructs audio features from object segmentation masks and minimizes reconstruction errors. Moreover, recognizing that many sounds are linked to object movements, we introduce an implicit volumetric motion estimation module to handle temporal dynamics that may be challenging to capture using conventional optical flow methods. To showcase the effectiveness of our approach, we conduct comprehensive experiments and analyses on the widely recognized AVSBench benchmark. As a result, we establish a new state-of-the-art performance level in the AVS benchmark, particularly excelling in the challenging MS3 subset which involves segmenting multiple sound sources. To facilitate reproducibility, we plan to release both the source code and the pre-trained model.
摘要
Inspired by the human ability to mentally simulate the sound of an object and its visual appearance, we propose a bidirectional generation framework. This framework establishes robust correlations between an object's visual characteristics and its associated sound, thereby enhancing the performance of AVS.To achieve this, we employ a visual-to-audio projection component that reconstructs audio features from object segmentation masks and minimizes reconstruction errors. Additionally, recognizing that many sounds are linked to object movements, we introduce an implicit volumetric motion estimation module to handle temporal dynamics that may be challenging to capture using conventional optical flow methods.To demonstrate the effectiveness of our approach, we conduct comprehensive experiments and analyses on the widely recognized AVSBench benchmark. As a result, we establish a new state-of-the-art performance level in the AVS benchmark, particularly excelling in the challenging MS3 subset which involves segmenting multiple sound sources. To facilitate reproducibility, we plan to release both the source code and the pre-trained model.
CARE: A Large Scale CT Image Dataset and Clinical Applicable Benchmark Model for Rectal Cancer Segmentation
for: Rectal cancer segmentation of CT images for timely clinical diagnosis, radiotherapy treatment, and follow-up.
methods: Proposed a novel large-scale rectal cancer CT image dataset CARE with pixel-level annotations, and a novel medical cancer lesion segmentation benchmark model named U-SAM that incorporates prompt information to tackle the challenges of intricate anatomical structures.
results: U-SAM outperformed state-of-the-art methods on the CARE dataset and demonstrated generalization on the WORD dataset through extensive experiments. These results can serve as a baseline for future research and clinical application development.Here’s the format you requested:
for: <what are the paper written for?>
methods: <what methods the paper use?>
results: <what results the paper get?>I hope this helps!Abstract
Rectal cancer segmentation of CT image plays a crucial role in timely clinical diagnosis, radiotherapy treatment, and follow-up. Although current segmentation methods have shown promise in delineating cancerous tissues, they still encounter challenges in achieving high segmentation precision. These obstacles arise from the intricate anatomical structures of the rectum and the difficulties in performing differential diagnosis of rectal cancer. Additionally, a major obstacle is the lack of a large-scale, finely annotated CT image dataset for rectal cancer segmentation. To address these issues, this work introduces a novel large scale rectal cancer CT image dataset CARE with pixel-level annotations for both normal and cancerous rectum, which serves as a valuable resource for algorithm research and clinical application development. Moreover, we propose a novel medical cancer lesion segmentation benchmark model named U-SAM. The model is specifically designed to tackle the challenges posed by the intricate anatomical structures of abdominal organs by incorporating prompt information. U-SAM contains three key components: promptable information (e.g., points) to aid in target area localization, a convolution module for capturing low-level lesion details, and skip-connections to preserve and recover spatial information during the encoding-decoding process. To evaluate the effectiveness of U-SAM, we systematically compare its performance with several popular segmentation methods on the CARE dataset. The generalization of the model is further verified on the WORD dataset. Extensive experiments demonstrate that the proposed U-SAM outperforms state-of-the-art methods on these two datasets. These experiments can serve as the baseline for future research and clinical application development.
摘要
<> Rectal cancer segmentation of CT image 在临床诊断、放疗治疗和跟踪中扮演着关键角色。although current segmentation methods have shown promise in delineating cancerous tissues, they still encounter challenges in achieving high segmentation precision. These obstacles arise from the intricate anatomical structures of the rectum and the difficulties in performing differential diagnosis of rectal cancer. Additionally, a major obstacle is the lack of a large-scale, finely annotated CT image dataset for rectal cancer segmentation. To address these issues, this work introduces a novel large-scale rectal cancer CT image dataset CARE with pixel-level annotations for both normal and cancerous rectum, which serves as a valuable resource for algorithm research and clinical application development. Moreover, we propose a novel medical cancer lesion segmentation benchmark model named U-SAM. The model is specifically designed to tackle the challenges posed by the intricate anatomical structures of abdominal organs by incorporating prompt information. U-SAM contains three key components: promptable information (e.g., points) to aid in target area localization, a convolution module for capturing low-level lesion details, and skip-connections to preserve and recover spatial information during the encoding-decoding process. To evaluate the effectiveness of U-SAM, we systematically compare its performance with several popular segmentation methods on the CARE dataset. The generalization of the model is further verified on the WORD dataset. Extensive experiments demonstrate that the proposed U-SAM outperforms state-of-the-art methods on these two datasets. These experiments can serve as the baseline for future research and clinical application development.Here's the translation in Traditional Chinese:<>RECTAL cancer segmentation of CT image 在临床诊断、放疗治疗和跟踪中扮演着关键角色。although current segmentation methods have shown promise in delineating cancerous tissues, they still encounter challenges in achieving high segmentation precision. These obstacles arise from the intricate anatomical structures of the rectum and the difficulties in performing differential diagnosis of rectal cancer. Additionally, a major obstacle is the lack of a large-scale, finely annotated CT image dataset for rectal cancer segmentation. To address these issues, this work introduces a novel large-scale rectal cancer CT image dataset CARE with pixel-level annotations for both normal and cancerous rectum, which serves as a valuable resource for algorithm research and clinical application development. Moreover, we propose a novel medical cancer lesion segmentation benchmark model named U-SAM. The model is specifically designed to tackle the challenges posed by the intricate anatomical structures of abdominal organs by incorporating prompt information. U-SAM contains three key components: promptable information (e.g., points) to aid in target area localization, a convolution module for capturing low-level lesion details, and skip-connections to preserve and recover spatial information during the encoding-decoding process. To evaluate the effectiveness of U-SAM, we systematically compare its performance with several popular segmentation methods on the CARE dataset. The generalization of the model is further verified on the WORD dataset. Extensive experiments demonstrate that the proposed U-SAM outperforms state-of-the-art methods on these two datasets. These experiments can serve as the baseline for future research and clinical application development.
Computer vision-enriched discrete choice models, with an application to residential location choice
paper_authors: Sander van Cranenburgh, Francisco Garrido-Valenzuela
for: This paper aims to address the gap between traditional discrete choice models and real-world decision-making by incorporating computer vision into these models.
methods: The proposed “Computer Vision-enriched Discrete Choice Models” (CV-DCMs) integrate computer vision and traditional discrete choice models to handle choice tasks involving numeric attributes and images.
results: The proposed CV-DCMs are grounded in random utility maximization principles and demonstrate the potential to handle complex decision-making tasks involving visual imagery, as demonstrated through a novel stated choice experiment involving residential location choices.Abstract
Visual imagery is indispensable to many multi-attribute decision situations. Examples of such decision situations in travel behaviour research include residential location choices, vehicle choices, tourist destination choices, and various safety-related choices. However, current discrete choice models cannot handle image data and thus cannot incorporate information embedded in images into their representations of choice behaviour. This gap between discrete choice models' capabilities and the real-world behaviour it seeks to model leads to incomplete and, possibly, misleading outcomes. To solve this gap, this study proposes "Computer Vision-enriched Discrete Choice Models" (CV-DCMs). CV-DCMs can handle choice tasks involving numeric attributes and images by integrating computer vision and traditional discrete choice models. Moreover, because CV-DCMs are grounded in random utility maximisation principles, they maintain the solid behavioural foundation of traditional discrete choice models. We demonstrate the proposed CV-DCM by applying it to data obtained through a novel stated choice experiment involving residential location choices. In this experiment, respondents faced choice tasks with trade-offs between commute time, monthly housing cost and street-level conditions, presented using images. As such, this research contributes to the growing body of literature in the travel behaviour field that seeks to integrate discrete choice modelling and machine learning.
摘要
“图像感知是许多多属性决策情况中不可或缺的。旅游行为研究中的例子包括居住地选择、车辆选择、旅游目的地选择以及安全相关的选择。然而,现有的精确选择模型无法处理图像数据,因此无法将图像中嵌入的信息 integrate 到其表示的选择行为中。这个差距导致模型的结果不准确、可能是误导的。为解决这个差距,本研究提出了“计算机视觉增强精确选择模型”(CV-DCM)。CV-DCM 可以处理包含数字属性和图像的选择任务,通过结合计算机视觉和传统精确选择模型来实现。此外,由于 CV-DCM 基于 random utility maximization 原则,因此它保持了传统精确选择模型的坚实行为基础。我们通过应用 CV-DCM 到来自一种新的声明选择实验中的居住地选择任务来示例。在这个实验中,参与者面临了含有交通时间、月度住房成本和街道级别条件的选择任务,使用图像表示。因此,本研究贡献到旅游行为领域的精确选择模型和机器学习 интеграция的增长的Literature中。”
Detecting Olives with Synthetic or Real Data? Olive the Above
results: 实验表明,使用大量的synthetic数据和一小部分的真实数据可以提高橄榄检测精度,比使用只有一小部分的真实数据更好。Abstract
Modern robotics has enabled the advancement in yield estimation for precision agriculture. However, when applied to the olive industry, the high variation of olive colors and their similarity to the background leaf canopy presents a challenge. Labeling several thousands of very dense olive grove images for segmentation is a labor-intensive task. This paper presents a novel approach to detecting olives without the need to manually label data. In this work, we present the world's first olive detection dataset comprised of synthetic and real olive tree images. This is accomplished by generating an auto-labeled photorealistic 3D model of an olive tree. Its geometry is then simplified for lightweight rendering purposes. In addition, experiments are conducted with a mix of synthetically generated and real images, yielding an improvement of up to 66% compared to when only using a small sample of real data. When access to real, human-labeled data is limited, a combination of mostly synthetic data and a small amount of real data can enhance olive detection.
摘要
现代机器人技术已经推动了精准农业的收益估计。然而,当应用于橄榄业时,高度的橄榄颜色变化和它们与背景叶子覆盖物的相似性带来了挑战。为了进行 segmentation,标注数以千计的橄榄园景图像是一项劳动密集的任务。本文介绍了一种新的橄榄检测方法,不需要手动标注数据。在这种方法中,我们首次创建了一个自动标注的真实3D橄榄树模型。其几何结构后来简化了为轻量级渲染目的。此外,我们在真实和 sintetically生成的图像混合下进行了实验,并达到了使用小量真实数据时的66%提高。在有限的真实人类标注数据时,一种主要是 sintetically生成的数据和一小部分的真实数据的组合可以提高橄榄检测。
OnUVS: Online Feature Decoupling Framework for High-Fidelity Ultrasound Video Synthesis
paper_authors: Han Zhou, Dong Ni, Ao Chang, Xinrui Zhou, Rusi Chen, Yanlin Chen, Lian Liu, Jiamin Liang, Yuhao Huang, Tong Han, Zhe Liu, Deng-Ping Fan, Xin Yang for:* The paper aims to address the challenges of synthesizing high-fidelity ultrasound (US) videos for clinical diagnosis, particularly in the context of limited availability of specific US video cases.methods:* The proposed method is an online feature-decoupling framework called OnUVS, which incorporates anatomic information into keypoint learning, uses a dual-decoder to decouple content and textural features, and employs a multiple-feature discriminator to enhance sharpness and fine details.results:* The paper reports that OnUVS synthesizes US videos with high fidelity, as demonstrated through validation and user studies on in-house echocardiographic and pelvic floor US videos.Abstract
Ultrasound (US) imaging is indispensable in clinical practice. To diagnose certain diseases, sonographers must observe corresponding dynamic anatomic structures to gather comprehensive information. However, the limited availability of specific US video cases causes teaching difficulties in identifying corresponding diseases, which potentially impacts the detection rate of such cases. The synthesis of US videos may represent a promising solution to this issue. Nevertheless, it is challenging to accurately animate the intricate motion of dynamic anatomic structures while preserving image fidelity. To address this, we present a novel online feature-decoupling framework called OnUVS for high-fidelity US video synthesis. Our highlights can be summarized by four aspects. First, we introduced anatomic information into keypoint learning through a weakly-supervised training strategy, resulting in improved preservation of anatomical integrity and motion while minimizing the labeling burden. Second, to better preserve the integrity and textural information of US images, we implemented a dual-decoder that decouples the content and textural features in the generator. Third, we adopted a multiple-feature discriminator to extract a comprehensive range of visual cues, thereby enhancing the sharpness and fine details of the generated videos. Fourth, we constrained the motion trajectories of keypoints during online learning to enhance the fluidity of generated videos. Our validation and user studies on in-house echocardiographic and pelvic floor US videos showed that OnUVS synthesizes US videos with high fidelity.
摘要
Ultrasound (US) 影像是诊断的不可或缺的工具。为了诊断某些疾病,sonoographers需要观察相应的动态 анатомиче结构,以获取全面的信息。然而,有限的特定US视频案例的可用性会导致教学困难,从而可能影响疾病的检测率。Synthesizing US videos may represent a promising solution to this issue. However, it is challenging to accurately animate the intricate motion of dynamic anatomic structures while preserving image fidelity. To address this, we present a novel online feature-decoupling framework called OnUVS for high-fidelity US video synthesis. Our highlights can be summarized by four aspects. First, we introduced anatomic information into keypoint learning through a weakly-supervised training strategy, resulting in improved preservation of anatomical integrity and motion while minimizing the labeling burden. Second, to better preserve the integrity and textural information of US images, we implemented a dual-decoder that decouples the content and textural features in the generator. Third, we adopted a multiple-feature discriminator to extract a comprehensive range of visual cues, thereby enhancing the sharpness and fine details of the generated videos. Fourth, we constrained the motion trajectories of keypoints during online learning to enhance the fluidity of generated videos. Our validation and user studies on in-house echocardiographic and pelvic floor US videos showed that OnUVS synthesizes US videos with high fidelity.
SceNeRFlow: Time-Consistent Reconstruction of General Dynamic Scenes
paper_authors: Edith Tretschk, Vladislav Golyanik, Michael Zollhoefer, Aljaz Bozic, Christoph Lassner, Christian Theobalt
For: 本研究旨在实现一种可靠地重建通常、不固定的物体运动的4D重建方法,以便进行后续任务 like 3D 编辑、运动分析或虚拟资产创建。* Methods: 我们提出了一种名为 SceNeRFlow 的动态 NeRF 方法,它使用多视图RGB视频和静止相机图像作为输入,并在线进行重建。该方法首先估算了场景中的准确模型,然后使用时间一致的方式来重建物体的变形和外观。由于这个准确模型是时间不变的,因此我们可以在长时间和长距离内获得对应关系。我们使用神经场景表示来参数化我们的方法的组件。* Results: 我们实验表明,与之前的工作不同,我们的方法可以重建 studio 级别的运动。这是因为我们使用了一种强制 regularization 的粗糙分解方法,以处理更大的运动。我们还证明了我们的方法可以在不同的场景下进行高效的重建。Abstract
Existing methods for the 4D reconstruction of general, non-rigidly deforming objects focus on novel-view synthesis and neglect correspondences. However, time consistency enables advanced downstream tasks like 3D editing, motion analysis, or virtual-asset creation. We propose SceNeRFlow to reconstruct a general, non-rigid scene in a time-consistent manner. Our dynamic-NeRF method takes multi-view RGB videos and background images from static cameras with known camera parameters as input. It then reconstructs the deformations of an estimated canonical model of the geometry and appearance in an online fashion. Since this canonical model is time-invariant, we obtain correspondences even for long-term, long-range motions. We employ neural scene representations to parametrize the components of our method. Like prior dynamic-NeRF methods, we use a backwards deformation model. We find non-trivial adaptations of this model necessary to handle larger motions: We decompose the deformations into a strongly regularized coarse component and a weakly regularized fine component, where the coarse component also extends the deformation field into the space surrounding the object, which enables tracking over time. We show experimentally that, unlike prior work that only handles small motion, our method enables the reconstruction of studio-scale motions.
摘要
现有方法 для四维重建普通、非rigidly变形对象强调新视图合成和忽略匹配。然而,时间一致性允许进行高级下游任务,如3D编辑、运动分析或虚拟资产创建。我们提出SceNeRFlow方法,用于在时间一致的方式重建普通、非rigid场景。我们的动态NeRF方法从多视图RGB视频和静止相机知道参数的背景图像入手。然后,它在线上重建了测量的准确模型的geometry和外观的变形。由于这个准确模型是时间不变的,我们在长期、长距离运动中获得匹配。我们使用神经场景表示来参数化我们的方法的组件。与先前的动态NeRF方法类似,我们使用回溯变形模型。但是,我们发现对于更大的运动,需要非常重要的适应。我们将变形分解为一个强制regularized的粗糙分量和一个弱regularized的细分量,其中粗糙分量还扩展了变形场景到对象周围的空间,这使得可以在时间上跟踪。我们实验表明,与先前工作仅处理小运动不同,我们的方法可以重建 studio级别的运动。
MultiMediate’23: Engagement Estimation and Bodily Behaviour Recognition in Social Interactions
paper_authors: Philipp Müller, Michal Balazia, Tobias Baur, Michael Dietz, Alexander Heimerl, Dominik Schiller, Mohammed Guermal, Dominike Thomas, François Brémond, Jan Alexandersson, Elisabeth André, Andreas Bulling
results: 这篇论文提供了 MultiMediate’23 挑战中的基准结果。Abstract
Automatic analysis of human behaviour is a fundamental prerequisite for the creation of machines that can effectively interact with- and support humans in social interactions. In MultiMediate'23, we address two key human social behaviour analysis tasks for the first time in a controlled challenge: engagement estimation and bodily behaviour recognition in social interactions. This paper describes the MultiMediate'23 challenge and presents novel sets of annotations for both tasks. For engagement estimation we collected novel annotations on the NOvice eXpert Interaction (NOXI) database. For bodily behaviour recognition, we annotated test recordings of the MPIIGroupInteraction corpus with the BBSI annotation scheme. In addition, we present baseline results for both challenge tasks.
摘要
自动分析人类行为是人工智能创造人工智能交互的基本先提要。在MultiMediate'23中,我们对人际交互中的两个关键人类社交行为进行了首次控制挑战:参与度估计和社交互动中的身体行为识别。这篇文章描述了MultiMediate'23挑战,并提供了两个任务的新的注解集。对于参与度估计任务,我们在NOvice eXpert Interaction(NOXI)数据库上收集了新的注解。对于身体行为识别任务,我们在MPIIGroupInteraction corpus上使用了BBSI注解方案进行了注解。此外,我们还提供了两个挑战任务的基线结果。
Contrastive Learning for Lane Detection via Cross-Similarity
results: 在评估数据集上,我们的CLLD方法比过去的对比学习方法更高效,特别是在低可见性情况下,如阴影和人行道等。 与超级学习相比,CLLD在阴影和人行道等Scene中表现更好。Abstract
Detecting road lanes is challenging due to intricate markings vulnerable to unfavorable conditions. Lane markings have strong shape priors, but their visibility is easily compromised. Factors like lighting, weather, vehicles, pedestrians, and aging colors challenge the detection. A large amount of data is required to train a lane detection approach that can withstand natural variations caused by low visibility. This is because there are numerous lane shapes and natural variations that exist. Our solution, Contrastive Learning for Lane Detection via cross-similarity (CLLD), is a self-supervised learning method that tackles this challenge by enhancing lane detection models resilience to real-world conditions that cause lane low visibility. CLLD is a novel multitask contrastive learning that trains lane detection approaches to detect lane markings even in low visible situations by integrating local feature contrastive learning (CL) with our new proposed operation cross-similarity. Local feature CL focuses on extracting features for small image parts, which is necessary to localize lane segments, while cross-similarity captures global features to detect obscured lane segments using their surrounding. We enhance cross-similarity by randomly masking parts of input images for augmentation. Evaluated on benchmark datasets, CLLD outperforms state-of-the-art contrastive learning, especially in visibility-impairing conditions like shadows. Compared to supervised learning, CLLD excels in scenarios like shadows and crowded scenes.
摘要
探析道路车道是一项挑战,因为车道标记有许多细节和敏感度。车道标记具有强的形状偏好,但它们的可见性易受到不利的影响。因为光线、天气、车辆、人行道和腐蚀等因素,车道标记的可见性容易受到损害。为了训练一个可以抗性低视觉的车道探析方法,需要大量的数据。这是因为车道形状和自然变化存在很多。我们的解决方案是通过对比学习(Contrastive Learning)提高车道探析模型对低视觉情况的抗性。我们提出了一种新的多任务对比学习方法,称为跨相似性(cross-similarity)。本方法将本地特征对比学习(CL)与我们新提出的跨相似性操作结合,以增强车道探析模型对低视觉情况的抗性。本地特征CL专注于从小图像部分提取特征,这是必要的以确定车道段的地方,而跨相似性则捕捉全图像的全局特征,以检测遮盖的车道段。我们在输入图像中随机遮盖部分进行了增强。在标准测试集上评估,CLLD在视觉降低情况下(如阴影)表现出色,与前学习方法相比,它在阴影和人行道场景中表现更出色。相比于监督学习,CLLD在这些场景中表现更好。
DDF-HO: Hand-Held Object Reconstruction via Conditional Directed Distance Field
results: 我们在 sintetic和实际 datasets上进行了广泛的实验,结果表明DDF-HO方法在Chamfer Distance上比基eline方法大大提高,约80%的提升。代码和训练模型将很 soon released。Abstract
Reconstructing hand-held objects from a single RGB image is an important and challenging problem. Existing works utilizing Signed Distance Fields (SDF) reveal limitations in comprehensively capturing the complex hand-object interactions, since SDF is only reliable within the proximity of the target, and hence, infeasible to simultaneously encode local hand and object cues. To address this issue, we propose DDF-HO, a novel approach leveraging Directed Distance Field (DDF) as the shape representation. Unlike SDF, DDF maps a ray in 3D space, consisting of an origin and a direction, to corresponding DDF values, including a binary visibility signal determining whether the ray intersects the objects and a distance value measuring the distance from origin to target in the given direction. We randomly sample multiple rays and collect local to global geometric features for them by introducing a novel 2D ray-based feature aggregation scheme and a 3D intersection-aware hand pose embedding, combining 2D-3D features to model hand-object interactions. Extensive experiments on synthetic and real-world datasets demonstrate that DDF-HO consistently outperforms all baseline methods by a large margin, especially under Chamfer Distance, with about 80% leap forward. Codes and trained models will be released soon.
摘要
原文:重建手持物品从单一RGB图像是一项重要和挑战性的问题。现有的works使用Signed Distance Fields (SDF)表明这些工作存在问题,因为SDF只有在目标附近才可靠,因此不能同时编码当地手和物品指示。为解决这个问题,我们提出了DDF-HO方法,该方法利用Directed Distance Field (DDF)作为形状表示。与SDF不同,DDF将三个维度空间中的一条射线映射到对应的DDF值,包括一个二进制可见信号,该信号判断射线是否与物体 intersect,以及一个距离值,该值测量射线起点与目标之间的距离。我们随机选择多个射线,并收集它们的本地到全局的地理特征,使用一种新的2D射线基于的特征聚合方案和一种3D交叉相关手姿嵌入,将2D-3D特征相结合,以模拟手-物体交互。我们对 sintetic和实际数据集进行了广泛的实验,结果显示,DDF-HO方法在Chamfer Distance上比基eline方法高出约80%,这 represent a large advance.代码和训练模型即将释出。
results: 实验结果表明,提案的方法可以显著降低脉冲发生的情况,并在比对状态的artificial neural networks(ANN)基eline上达到更好的性能。Abstract
Spiking Neural Networks (SNNs) are well known as a promising energy-efficient alternative to conventional artificial neural networks. Subject to the preconceived impression that SNNs are sparse firing, the analysis and optimization of inherent redundancy in SNNs have been largely overlooked, thus the potential advantages of spike-based neuromorphic computing in accuracy and energy efficiency are interfered. In this work, we pose and focus on three key questions regarding the inherent redundancy in SNNs. We argue that the redundancy is induced by the spatio-temporal invariance of SNNs, which enhances the efficiency of parameter utilization but also invites lots of noise spikes. Further, we analyze the effect of spatio-temporal invariance on the spatio-temporal dynamics and spike firing of SNNs. Then, motivated by these analyses, we propose an Advance Spatial Attention (ASA) module to harness SNNs' redundancy, which can adaptively optimize their membrane potential distribution by a pair of individual spatial attention sub-modules. In this way, noise spike features are accurately regulated. Experimental results demonstrate that the proposed method can significantly drop the spike firing with better performance than state-of-the-art SNN baselines. Our code is available in \url{https://github.com/BICLab/ASA-SNN}.
摘要
神经网络(SNN)因其能够减少能耗而被广泛认为是一种有前途的人工神经网络。然而,由于人们对SNN的减少性假设,因此对SNN内部缺乏缓存的分析和优化而忽略了大量的可能优势。在这项工作中,我们提出了三个关键问题,即SNN中的内在缺乏的来源,这种缺乏的源泉是SNN的空间时间不变性所带来的,这种不变性可以提高参数的使用效率,但也会引入很多噪声脉冲。然后,我们分析了SNN的空间时间动态和脉冲发生的影响。基于这些分析,我们提出了一种进步的空间注意力(ASA)模块,可以自适应地优化SNN的膜电压分布,并且可以精准地控制噪声脉冲特征。实验结果表明,我们的方法可以减少脉冲发生,并且比现有的SNN基线性能更好。我们的代码可以在中找到。
How To Overcome Confirmation Bias in Semi-Supervised Image Classification By Active Learning
results: experiments表明,在实际数据场景中, combining SSL methods with a random selection for labeling可以超越现有的活动学习(AL)技术。Abstract
Do we need active learning? The rise of strong deep semi-supervised methods raises doubt about the usability of active learning in limited labeled data settings. This is caused by results showing that combining semi-supervised learning (SSL) methods with a random selection for labeling can outperform existing active learning (AL) techniques. However, these results are obtained from experiments on well-established benchmark datasets that can overestimate the external validity. However, the literature lacks sufficient research on the performance of active semi-supervised learning methods in realistic data scenarios, leaving a notable gap in our understanding. Therefore we present three data challenges common in real-world applications: between-class imbalance, within-class imbalance, and between-class similarity. These challenges can hurt SSL performance due to confirmation bias. We conduct experiments with SSL and AL on simulated data challenges and find that random sampling does not mitigate confirmation bias and, in some cases, leads to worse performance than supervised learning. In contrast, we demonstrate that AL can overcome confirmation bias in SSL in these realistic settings. Our results provide insights into the potential of combining active and semi-supervised learning in the presence of common real-world challenges, which is a promising direction for robust methods when learning with limited labeled data in real-world applications.
摘要
是否需要活动学习?强大深度半supervised方法的出现,使得有限量标注数据设置下使用活动学习的可用性受到了质疑。这是由于结果表明,结合半supervised学习(SSL)方法和随机选择标注可以超越现有的活动学习(AL)技术。然而,这些结果是基于已知的benchmark数据集进行实验获得的,这些数据集可能过度估计外部有效性。然而,文献中对活动半supervised学习方法在实际数据场景中的性能的研究不够,留下了一个 Notable gap 在我们的理解中。因此,我们介绍了三种常见的实际数据挑战:between-class imbalance、within-class imbalance 和 between-class similarity。这些挑战可能会对SSL性能产生负面影响,因为confirmation bias。我们在SSL和AL中进行了实验,发现随机抽样不能消除confirmation bias,在某些情况下,随机抽样even worse than supervised learning。然而,我们示出了AL可以在这些实际设置中超越confirmation bias。我们的结果为将活动和半supervised学习结合在一起的潜在可能性提供了新的思路,这是一种在实际应用中学习受限量标注数据时的robust方法。
Low-Light Image Enhancement with Illumination-Aware Gamma Correction and Complete Image Modelling Network
methods: integrate gamma correction with deep networks, use Taylor Series to approximate gamma correction, use a novel Transformer block to simulate pixel dependencies
results: outperform state-of-the-art methods on several benchmark datasetsAbstract
This paper presents a novel network structure with illumination-aware gamma correction and complete image modelling to solve the low-light image enhancement problem. Low-light environments usually lead to less informative large-scale dark areas, directly learning deep representations from low-light images is insensitive to recovering normal illumination. We propose to integrate the effectiveness of gamma correction with the strong modelling capacities of deep networks, which enables the correction factor gamma to be learned in a coarse to elaborate manner via adaptively perceiving the deviated illumination. Because exponential operation introduces high computational complexity, we propose to use Taylor Series to approximate gamma correction, accelerating the training and inference speed. Dark areas usually occupy large scales in low-light images, common local modelling structures, e.g., CNN, SwinIR, are thus insufficient to recover accurate illumination across whole low-light images. We propose a novel Transformer block to completely simulate the dependencies of all pixels across images via a local-to-global hierarchical attention mechanism, so that dark areas could be inferred by borrowing the information from far informative regions in a highly effective manner. Extensive experiments on several benchmark datasets demonstrate that our approach outperforms state-of-the-art methods.
摘要
Common local modeling structures, such as CNN and SwinIR, are insufficient to recover accurate illumination across whole low-light images, as dark areas often occupy large scales. To address this, the proposed method uses a novel Transformer block that completes simulates the dependencies of all pixels across images via a local-to-global hierarchical attention mechanism. This allows dark areas to be inferred by borrowing information from far informative regions in a highly effective manner.Experimental results on several benchmark datasets demonstrate that the proposed approach outperforms state-of-the-art methods.
MEDOE: A Multi-Expert Decoder and Output Ensemble Framework for Long-tailed Semantic Segmentation
paper_authors: Junao Shen, Long Chen, Kun Kuang, Fei Wu, Tian Feng, Wei Zhang
for: 解决长尾分布难以捕捉的问题,提高 semantic segmentation 的性能。
methods: 提出了一个名为 MEDOE 的新框架,通过Contextual Information Ensemble-and-Grouping 技术,使用多个专家来提高分类的准确性。
results: 实验结果表明,MEDOE 比现有方法在 Cityscapes 和 ADE20K 数据集上提高了1.78% 的 mIoU 和 5.89% 的 mAcc。Abstract
Long-tailed distribution of semantic categories, which has been often ignored in conventional methods, causes unsatisfactory performance in semantic segmentation on tail categories. In this paper, we focus on the problem of long-tailed semantic segmentation. Although some long-tailed recognition methods (e.g., re-sampling/re-weighting) have been proposed in other problems, they can probably compromise crucial contextual information and are thus hardly adaptable to the problem of long-tailed semantic segmentation. To address this issue, we propose MEDOE, a novel framework for long-tailed semantic segmentation via contextual information ensemble-and-grouping. The proposed two-sage framework comprises a multi-expert decoder (MED) and a multi-expert output ensemble (MOE). Specifically, the MED includes several "experts". Based on the pixel frequency distribution, each expert takes the dataset masked according to the specific categories as input and generates contextual information self-adaptively for classification; The MOE adopts learnable decision weights for the ensemble of the experts' outputs. As a model-agnostic framework, our MEDOE can be flexibly and efficiently coupled with various popular deep neural networks (e.g., DeepLabv3+, OCRNet, and PSPNet) to improve their performance in long-tailed semantic segmentation. Experimental results show that the proposed framework outperforms the current methods on both Cityscapes and ADE20K datasets by up to 1.78% in mIoU and 5.89% in mAcc.
摘要
长尾分布的 semantic category 问题,常被 conventional methods 忽略,导致 semantic segmentation 的性能不满意。在这篇论文中,我们关注了长尾 semantic segmentation 问题。尽管有一些长尾认知方法(例如重新批量/重新权重)在其他问题上提出,但它们可能会丢失重要的上下文信息,因此难以适应长尾 semantic segmentation 问题。为解决这个问题,我们提出了 MEDOE,一种基于上下文信息的集成和分组的novel框架。该框架包括一个多专家解码器(MED)和一个多专家输出集(MOE)。具体来说,MED 包括多个 "专家"。根据像素频率分布,每个专家都会根据特定类别为输入,在自适应的方式下生成上下文信息用于分类; MOE 采用可学习的决策权重,对专家们的输出进行集成。作为一个模型独立的框架,我们的 MEDOE 可以与各种流行的深度神经网络(例如 DeepLabv3+、OCRNet 和 PSPNet)模型结合,以提高它们在长尾 semantic segmentation 中的性能。实验结果表明,我们的提议方案在 Cityscapes 和 ADE20K 数据集上比现有方法提高了1.78%的 mIoU 和 5.89%的 mAcc。
Neural Spherical Harmonics for structurally coherent continuous representation of diffusion MRI signal
results: 使用这种方法重建dMRI信号后,得到的数据具有更加结构几何地表示, gradient 图像中的噪声被除去,fiber orientation distribution functions 显示了细胞轴的平滑变化。此外,该方法还可以计算mean diffusivity、fractional anisotropy和总显示的纤维密度。这些结果可以通过一个单一的模型架构和一个 hyperparameter 来实现。此外,在angular和 espacial频域进行upsampling也可以实现与现有方法相当或更好的重建结果。Abstract
We present a novel way to model diffusion magnetic resonance imaging (dMRI) datasets, that benefits from the structural coherence of the human brain while only using data from a single subject. Current methods model the dMRI signal in individual voxels, disregarding the intervoxel coherence that is present. We use a neural network to parameterize a spherical harmonics series (NeSH) to represent the dMRI signal of a single subject from the Human Connectome Project dataset, continuous in both the angular and spatial domain. The reconstructed dMRI signal using this method shows a more structurally coherent representation of the data. Noise in gradient images is removed and the fiber orientation distribution functions show a smooth change in direction along a fiber tract. We showcase how the reconstruction can be used to calculate mean diffusivity, fractional anisotropy, and total apparent fiber density. These results can be achieved with a single model architecture, tuning only one hyperparameter. In this paper we also demonstrate how upsampling in both the angular and spatial domain yields reconstructions that are on par or better than existing methods.
摘要
我们提出了一种新的方法,用于模型吸引磁共振成像(dMRI)数据集,该方法利用人脑结构减少的一体性。现有方法在个voxel中模型dMRI信号,忽略了voxel之间的相关性。我们使用神经网络来parameterize一个圆锥函数系列(NeSH)来表示单个主体的dMRI信号,这个信号是人类连接度计划数据集的连续信号。我们的重建结果显示,使用这种方法可以获得更结构一致的数据表示。 gradient图像中的噪声被去除,纤维方向分布函数显示了纤维轨迹上的平滑变化。我们还示出了如何使用这种重建方法计算平均扩散率、相对扩散率和总显示纤维密度。这些结果可以通过单个模型架构和一个hyperparameter来实现,并且可以在angular和空间域中进行upsampling,以实现与现有方法相当或更好的重建结果。
Self-Reference Deep Adaptive Curve Estimation for Low-Light Image Enhancement
results: 对多个实际 datasets进行了广泛的Qualitative和Quantitative分析,结果表明,该方法在比较当前最佳算法的测试中表现出色。Abstract
In this paper, we propose a 2-stage low-light image enhancement method called Self-Reference Deep Adaptive Curve Estimation (Self-DACE). In the first stage, we present an intuitive, lightweight, fast, and unsupervised luminance enhancement algorithm. The algorithm is based on a novel low-light enhancement curve that can be used to locally boost image brightness. We also propose a new loss function with a simplified physical model designed to preserve natural images' color, structure, and fidelity. We use a vanilla CNN to map each pixel through deep Adaptive Adjustment Curves (AAC) while preserving the local image structure. Secondly, we introduce the corresponding denoising scheme to remove the latent noise in the darkness. We approximately model the noise in the dark and deploy a Denoising-Net to estimate and remove the noise after the first stage. Exhaustive qualitative and quantitative analysis shows that our method outperforms existing state-of-the-art algorithms on multiple real-world datasets.
摘要
在本文中,我们提出了一种两阶段低光照图像提升方法,称为自适应曲线估计(Self-DACE)。在第一阶段,我们提出了一种直观、轻量级、快速、不需要监督的亮度提升算法。该算法基于一个新的低光照增强曲线,可以地方增强图像亮度。我们还提出了一个新的损失函数,采用简化的物理模型,保持自然图像的颜色、结构和准确性。我们使用一个普通的Convolutional Neural Network(CNN)将每个像素通过深度适应曲线(AAC)进行映射,保持图像的地方结构。在第二阶段,我们引入了对应的干扰除方案,以除去黑暗中的秘密噪声。我们约化黑暗中的噪声,并部署了一个干扰除网络来估计和除去噪声。我们的方法在多个实际世界数据集上进行了系统的质量和量化分析,结果表明我们的方法在与现有状态艺术算法进行比较时表现出色。
Automatic Vision-Based Parking Slot Detection and Occupancy Classification
results: 在使用了公开available的PKLot和CNRPark+EXT数据集进行测试后,该算法表现了高效的停车槽检测和占用分类能力,并能够快速响应停车场的变化。Abstract
Parking guidance information (PGI) systems are used to provide information to drivers about the nearest parking lots and the number of vacant parking slots. Recently, vision-based solutions started to appear as a cost-effective alternative to standard PGI systems based on hardware sensors mounted on each parking slot. Vision-based systems provide information about parking occupancy based on images taken by a camera that is recording a parking lot. However, such systems are challenging to develop due to various possible viewpoints, weather conditions, and object occlusions. Most notably, they require manual labeling of parking slot locations in the input image which is sensitive to camera angle change, replacement, or maintenance. In this paper, the algorithm that performs Automatic Parking Slot Detection and Occupancy Classification (APSD-OC) solely on input images is proposed. Automatic parking slot detection is based on vehicle detections in a series of parking lot images upon which clustering is applied in bird's eye view to detect parking slots. Once the parking slots positions are determined in the input image, each detected parking slot is classified as occupied or vacant using a specifically trained ResNet34 deep classifier. The proposed approach is extensively evaluated on well-known publicly available datasets (PKLot and CNRPark+EXT), showing high efficiency in parking slot detection and robustness to the presence of illegal parking or passing vehicles. Trained classifier achieves high accuracy in parking slot occupancy classification.
摘要
停车导航信息(PGI)系统用于提供驾驶员关于最近停车场和停车槽数量的信息。最近,基于视觉解决方案开始出现,作为传感器安装在每个停车槽上的成本效果的代替。视觉基于系统提供停车占用的信息,基于停车场上的图像。然而,这些系统的开发具有多种可能的视点、天气条件和物体遮挡的挑战。特别是,它们需要手动标注停车槽位置在输入图像中,这是相对于摄像头角度变化、更换或维护的敏感。在本文中,一种基于输入图像进行自动停车槽检测和占用分类的算法(APSD-OC)被提出。自动停车槽检测基于在停车场图像中检测车辆,并将其分组在鸟瞰视图中检测停车槽。一旦detect了停车槽的位置在输入图像中,每个检测到的停车槽都会被使用特定训练的ResNet34深度分类器进行占用或无占用的分类。提议的方法在知名的公共数据集(PKLot和CNRPark+EXT)进行了广泛的评估,表明高效地检测停车槽和抗抗车或过往车的存在。训练的分类器在停车槽占用分类中达到了高精度。
Unsupervised Domain Adaptive Detection with Network Stability Analysis
results: 这篇研究使用NSA整合Faster R-CNN,实现了顶尖的结果,包括在Cityscapes-to-FoggyCityscapes上的52.7% mAP记录。此外,NSA还可以应用于其他一阶检测器(例如FCOS),并且在实验中证明了这一点。Abstract
Domain adaptive detection aims to improve the generality of a detector, learned from the labeled source domain, on the unlabeled target domain. In this work, drawing inspiration from the concept of stability from the control theory that a robust system requires to remain consistent both externally and internally regardless of disturbances, we propose a novel framework that achieves unsupervised domain adaptive detection through stability analysis. In specific, we treat discrepancies between images and regions from different domains as disturbances, and introduce a novel simple but effective Network Stability Analysis (NSA) framework that considers various disturbances for domain adaptation. Particularly, we explore three types of perturbations including heavy and light image-level disturbances and instancelevel disturbance. For each type, NSA performs external consistency analysis on the outputs from raw and perturbed images and/or internal consistency analysis on their features, using teacher-student models. By integrating NSA into Faster R-CNN, we immediately achieve state-of-the-art results. In particular, we set a new record of 52.7% mAP on Cityscapes-to-FoggyCityscapes, showing the potential of NSA for domain adaptive detection. It is worth noticing, our NSA is designed for general purpose, and thus applicable to one-stage detection model (e.g., FCOS) besides the adopted one, as shown by experiments. https://github.com/tiankongzhang/NSA.
摘要
领域适应检测目标是提高检测器,从源频道上得到标注的频道,在目标频道上进行检测的通用性。在这项工作中,我们Drawing inspiration from the concept of stability from control theory that a robust system requires to remain consistent both externally and internally regardless of disturbances, we propose a novel framework that achieves unsupervised domain adaptive detection through stability analysis. Specifically, we treat discrepancies between images and regions from different domains as disturbances, and introduce a novel simple but effective Network Stability Analysis (NSA) framework that considers various disturbances for domain adaptation. Particularly, we explore three types of perturbations including heavy and light image-level disturbances and instance-level disturbance. For each type, NSA performs external consistency analysis on the outputs from raw and perturbed images and/or internal consistency analysis on their features, using teacher-student models. By integrating NSA into Faster R-CNN, we immediately achieve state-of-the-art results. In particular, we set a new record of 52.7% mAP on Cityscapes-to-FoggyCityscapes, showing the potential of NSA for domain adaptive detection. It is worth noting that our NSA is designed for general purpose, and thus applicable to one-stage detection models (e.g., FCOS) besides the adopted one, as shown by experiments. More details can be found at https://github.com/tiankongzhang/NSA.
AATCT-IDS: A Benchmark Abdominal Adipose Tissue CT Image Dataset for Image Denoising, Semantic Segmentation, and Radiomics Evaluation
paper_authors: Zhiyu Ma, Chen Li, Tianming Du, Le Zhang, Dechao Tang, Deguo Ma, Shanchuan Huang, Yan Liu, Yihao Sun, Zhihao Chen, Jin Yuan, Qianqing Nie, Marcin Grzegorzek, Hongzan Sun
For: 这个研究使用的数据集是为了研究腹部脂肪组织的多维特征而制作的。* Methods: 这个研究使用了一个名为AATTCT-IDS的标准数据集,该数据集包含300个subject的3D CT剖图,并由研究人员手动标注了脂肪组织区域的3213个剖图。不同任务的研究人员使用不同的方法对AATTCT-IDS进行了比较分析,以验证这些方法在这些任务中的研究潜力。* Results: 这个研究发现,在图像压缩领域,使用缓和策略可以更好地降低杂噪,并且保持原始图像的结构。在semantic segmentation领域,BiSeNet模型可以在短时间内 obtian精度高的结果,并具有更好的结构分化能力。在 радиологи metrics 领域,研究人员发现了三种不同的脂肪分布方式。I hope this helps! Let me know if you have any further questions.Abstract
Methods: In this study, a benchmark \emph{Abdominal Adipose Tissue CT Image Dataset} (AATTCT-IDS) containing 300 subjects is prepared and published. AATTCT-IDS publics 13,732 raw CT slices, and the researchers individually annotate the subcutaneous and visceral adipose tissue regions of 3,213 of those slices that have the same slice distance to validate denoising methods, train semantic segmentation models, and study radiomics. For different tasks, this paper compares and analyzes the performance of various methods on AATTCT-IDS by combining the visualization results and evaluation data. Thus, verify the research potential of this data set in the above three types of tasks. Results: In the comparative study of image denoising, algorithms using a smoothing strategy suppress mixed noise at the expense of image details and obtain better evaluation data. Methods such as BM3D preserve the original image structure better, although the evaluation data are slightly lower. The results show significant differences among them. In the comparative study of semantic segmentation of abdominal adipose tissue, the segmentation results of adipose tissue by each model show different structural characteristics. Among them, BiSeNet obtains segmentation results only slightly inferior to U-Net with the shortest training time and effectively separates small and isolated adipose tissue. In addition, the radiomics study based on AATTCT-IDS reveals three adipose distributions in the subject population. Conclusion: AATTCT-IDS contains the ground truth of adipose tissue regions in abdominal CT slices. This open-source dataset can attract researchers to explore the multi-dimensional characteristics of abdominal adipose tissue and thus help physicians and patients in clinical practice. AATCT-IDS is freely published for non-commercial purpose at: \url{https://figshare.com/articles/dataset/AATTCT-IDS/23807256}.
摘要
方法:本研究使用的Benchmark dataset是“ Abdomen Adipose Tissue CT Image Dataset”(AATTCT-IDS),包含300个研究对象,已经公布并可以免费下载。AATTCT-IDS包含13,732个Raw CT slice,研究人员对3,213个slice进行了手动标注,以验证去噪方法、训练semantic segmentation模型以及研究 радиомькс。通过组合视觉化结果和评估数据,对不同任务进行比较和分析。这种方法可以验证AATTCT-IDS的研究潜力。结果:在图像去噪比较研究中,使用缓和策略的算法可以更好地降低杂噪,但是会压缩图像细节。BM3D等方法可以更好地保持原始图像结构,但评估数据略为下降。结果显示不同算法之间存在显著的差异。在semantic segmentation的研究中,每个模型对脂肪组织的 segmentation 结果具有不同的结构特征。比如BiSeNet可以在短时间内 obtaint 脂肪组织 segmentation 结果,并且可以准确地分割小型和隔离的脂肪组织。此外,基于AATTCT-IDS的 радиомькс研究发现了脂肪分布的三种类型。结论:AATTCT-IDS包含了 Abdomen CT slice 中脂肪组织的真实特征。这个开源数据集可以吸引研究人员更深入研究胸部脂肪组织的多维特征,从而帮助医生和患者在临床实践中。AATTCT-IDS是免费发布的,可以在以下链接下下载:https://figshare.com/articles/dataset/AATTCT-IDS/23807256。
Learning to Generate Semantic Layouts for Higher Text-Image Correspondence in Text-to-Image Synthesis
paper_authors: Minho Park, Jooyeol Yun, Seunghwan Choi, Jaegul Choo
for: 提高文本到图像生成的文本-图像匹配率,而不仅仅依靠大规模的文本-图像数据集。
methods: 提出了一种新的方法,即利用可用的semantic layout来增强文本到图像生成的文本-图像匹配率。Specifically, we propose a Gaussian-categorical diffusion process that simultaneously generates both images and corresponding layout pairs.
results: 我们的实验表明,我们可以通过训练模型生成 semantic labels for each pixel,使模型对不同图像区域的semantics有所了解。我们的方法在Multi-Modal CelebA-HQ和Cityscapes dataset上达到了更高的文本-图像匹配率,这些数据集上的文本-图像对是罕见的。Abstract
Existing text-to-image generation approaches have set high standards for photorealism and text-image correspondence, largely benefiting from web-scale text-image datasets, which can include up to 5~billion pairs. However, text-to-image generation models trained on domain-specific datasets, such as urban scenes, medical images, and faces, still suffer from low text-image correspondence due to the lack of text-image pairs. Additionally, collecting billions of text-image pairs for a specific domain can be time-consuming and costly. Thus, ensuring high text-image correspondence without relying on web-scale text-image datasets remains a challenging task. In this paper, we present a novel approach for enhancing text-image correspondence by leveraging available semantic layouts. Specifically, we propose a Gaussian-categorical diffusion process that simultaneously generates both images and corresponding layout pairs. Our experiments reveal that we can guide text-to-image generation models to be aware of the semantics of different image regions, by training the model to generate semantic labels for each pixel. We demonstrate that our approach achieves higher text-image correspondence compared to existing text-to-image generation approaches in the Multi-Modal CelebA-HQ and the Cityscapes dataset, where text-image pairs are scarce. Codes are available in this https://pmh9960.github.io/research/GCDP
摘要
现有的文本到图像生成方法已经设置了高标准 для光实感和文本图像对应,主要受益于网络规模的文本图像集,这些集可以包括多达50亿对。然而,文本到图像生成模型在域pecific的 dataset上,如城市场景、医学图像和人脸,仍然受到文本图像对应的低问题,因为缺乏文本图像对。此外,收集百亿对文本图像对可以是时间consuming和costly的。因此,保证高度文本图像对应而不依赖于网络规模的文本图像集是一个挑战。在这篇论文中,我们提出了一种新的方法,通过利用可用的semantic layout来提高文本图像对应。具体来说,我们提出了一个 Gaussian-categorical 扩散过程,同时生成图像和对应的布局对。我们的实验表明,我们可以通过训练模型生成semantic标签来引导文本到图像生成模型对不同图像区域的semantic有认知。我们的方法在Multi-Modal CelebA-HQ和Cityscapes dataset上表现出高度文本图像对应,这些数据集上文本图像对scarce。代码可以在以下链接获取:https://pmh9960.github.io/research/GCDP
methods: 该文章使用了扩展了 Blau et al. 的 perceptual quality 定义,通过conditioning 用户定义的信息来提高压缩率。
results: 实验结果表明,该代码可以成功保持高质量和 semantics 的图像压缩,并且提供了一个Lower bound 的共同Randomness 需求,解决了前一些对于 (conditional) perceptual quality 压缩是否应该包含随机性的辩论。I hope that helps! Let me know if you have any other questions.Abstract
We propose conditional perceptual quality, an extension of the perceptual quality defined in \citet{blau2018perception}, by conditioning it on user defined information. Specifically, we extend the original perceptual quality $d(p_{X},p_{\hat{X})$ to the conditional perceptual quality $d(p_{X|Y},p_{\hat{X}|Y})$, where $X$ is the original image, $\hat{X}$ is the reconstructed, $Y$ is side information defined by user and $d(.,.)$ is divergence. We show that conditional perceptual quality has similar theoretical properties as rate-distortion-perception trade-off \citep{blau2019rethinking}. Based on these theoretical results, we propose an optimal framework for conditional perceptual quality preserving compression. Experimental results show that our codec successfully maintains high perceptual quality and semantic quality at all bitrate. Besides, by providing a lowerbound of common randomness required, we settle the previous arguments on whether randomness should be incorporated into generator for (conditional) perceptual quality compression. The source code is provided in supplementary material.
摘要
我们提议使用条件的感知质量,这是基于\citet{blau2018perception}定义的感知质量的扩展,通过定制用户定义的信息来conditioning。具体来说,我们从原始的感知质量$d(p_{X},p_{\hat{X})$中扩展了条件感知质量$d(p_{X|Y},p_{\hat{X}|Y})$,其中$X$是原始图像,$\hat{X}$是重建的图像,$Y$是用户定义的侧信息。我们证明了条件感知质量具有与Rate-Distortion-Perception交易的同样理论性质。基于这些理论结论,我们提出了一个优化的条件感知质量保持压缩框架。实验结果表明,我们的编码器成功保持高感知质量和 semantics质量在所有比特率。此外,我们提供了common randomness所需的下界,解决了过去对 generator中是否应该包含随机性的争议。代码可以在补充材料中找到。
SCANet: A Self- and Cross-Attention Network for Audio-Visual Speech Separation
results: 对于三个标准的听视分离测试集(LRS2、LRS3和VoxCeleb2),SCANet表现出色,超过现有的状态时之方法,并且保持相对的执行时间相对较短。Abstract
The integration of different modalities, such as audio and visual information, plays a crucial role in human perception of the surrounding environment. Recent research has made significant progress in designing fusion modules for audio-visual speech separation. However, they predominantly focus on multi-modal fusion architectures situated either at the top or bottom positions, rather than comprehensively considering multi-modal fusion at various hierarchical positions within the network. In this paper, we propose a novel model called self- and cross-attention network (SCANet), which leverages the attention mechanism for efficient audio-visual feature fusion. SCANet consists of two types of attention blocks: self-attention (SA) and cross-attention (CA) blocks, where the CA blocks are distributed at the top (TCA), middle (MCA) and bottom (BCA) of SCANet. These blocks maintain the ability to learn modality-specific features and enable the extraction of different semantics from audio-visual features. Comprehensive experiments on three standard audio-visual separation benchmarks (LRS2, LRS3, and VoxCeleb2) demonstrate the effectiveness of SCANet, outperforming existing state-of-the-art (SOTA) methods while maintaining comparable inference time.
摘要
人类对周围环境的识别受到不同modalities(如音频和视觉信息)的集成具有重要作用。现有研究已经在设计音视频演说分离模块方面做出了重要进步。然而,这些模块主要集中在网络的顶层或底层位置,而不是全面考虑多modalities在网络各级别位置的集成。在这篇论文中,我们提出了一种新的模型,即自我和交叉关注网络(SCANet),它利用关注机制来实现有效的音视频特征结合。SCANet包括两种关注块:自我关注(SA)和交叉关注(CA)块,其中CA块分布在SCANet的顶层(TCA)、中层(MCA)和底层(BCA)。这些块可以保持学习不同modalities特征,并允许从音视频特征中提取不同 semantics。我们对三个标准音视频分离标准(LRS2、LRS3和VoxCeleb2)进行了广泛的实验,并证明SCANet的效果比现有SOTA方法更好,同时保持相对的执行时间相对快。
S2R: Exploring a Double-Win Transformer-Based Framework for Ideal and Blind Super-Resolution
paper_authors: Minghao She, Wendong Mao, Huihong Shi, Zhongfeng Wang
for: 提高理想和盲 SR 任务的视觉效果(ideal and blind super-resolution)
methods: 提出了一种双赢框架(double-win framework),包括一种轻量级 transformer 基于 SR 模型(S2R transformer)和一种新的倒吃式训练策略(coarse-to-fine training strategy)
results: 实验结果表明,提出的 S2R 模型在理想 SR 条件下与只有 578K 参数表现出色,并在盲糊条件下与只有 10 梯度更新达到更好的视觉效果,提高了整体训练速度三百倍。Abstract
Nowadays, deep learning based methods have demonstrated impressive performance on ideal super-resolution (SR) datasets, but most of these methods incur dramatically performance drops when directly applied in real-world SR reconstruction tasks with unpredictable blur kernels. To tackle this issue, blind SR methods are proposed to improve the visual results on random blur kernels, which causes unsatisfactory reconstruction effects on ideal low-resolution images similarly. In this paper, we propose a double-win framework for ideal and blind SR task, named S2R, including a light-weight transformer-based SR model (S2R transformer) and a novel coarse-to-fine training strategy, which can achieve excellent visual results on both ideal and random fuzzy conditions. On algorithm level, S2R transformer smartly combines some efficient and light-weight blocks to enhance the representation ability of extracted features with relatively low number of parameters. For training strategy, a coarse-level learning process is firstly performed to improve the generalization of the network with the help of a large-scale external dataset, and then, a fast fine-tune process is developed to transfer the pre-trained model to real-world SR tasks by mining the internal features of the image. Experimental results show that the proposed S2R outperforms other single-image SR models in ideal SR condition with only 578K parameters. Meanwhile, it can achieve better visual results than regular blind SR models in blind fuzzy conditions with only 10 gradient updates, which improve convergence speed by 300 times, significantly accelerating the transfer-learning process in real-world situations.
摘要
现在,深度学习基本方法在理想的超分辨率(SR)数据集上已经表现出了惊人的表现,但大多数这些方法在实际的SR重建任务中直接应用时会导致表现下降,特别是在随机扭曲kernel的情况下。为解决这个问题,盲 SR 方法被提出来提高视觉效果,但这些方法在理想的低分辨率图像上也会导致不满足的重建效果。在这篇文章中,我们提出了一个双赢框架,名为 S2R,包括一个轻量级的 transformer 基本 SR 模型(S2R transformer)和一种新的 course-to-fine 训练策略,可以在理想和随机扭曲条件下实现出色的视觉效果。在算法层次,S2R transformer 智能地结合了一些高效和轻量级的块来增强提取的特征表示能力,同时减少参数的数量。在训练策略方面,我们首先在大规模的外部数据集上进行了粗级学习过程,以提高网络的通用性,然后,我们开发了一种快速的 fine-tune 过程,通过挖掘图像内部特征来转移预训练模型到实际SR任务中。实验结果显示,我们提出的 S2R 可以在理想SR条件下与只有 578K 参数的其他单图 SR 模型进行比较,同时在盲扭曲条件下,它可以在只有 10 梯度更新的情况下达到更好的视觉效果,提高了整体的转移学习过程的速度,从而实现了实际应用中的加速。
GPA-3D: Geometry-aware Prototype Alignment for Unsupervised Domain Adaptive 3D Object Detection from Point Clouds
results: 在Waymo、nuScenes和KITTI等多个 benchmark上获得了比领先方法更好的适应性和性能Abstract
LiDAR-based 3D detection has made great progress in recent years. However, the performance of 3D detectors is considerably limited when deployed in unseen environments, owing to the severe domain gap problem. Existing domain adaptive 3D detection methods do not adequately consider the problem of the distributional discrepancy in feature space, thereby hindering generalization of detectors across domains. In this work, we propose a novel unsupervised domain adaptive \textbf{3D} detection framework, namely \textbf{G}eometry-aware \textbf{P}rototype \textbf{A}lignment (\textbf{GPA-3D}), which explicitly leverages the intrinsic geometric relationship from point cloud objects to reduce the feature discrepancy, thus facilitating cross-domain transferring. Specifically, GPA-3D assigns a series of tailored and learnable prototypes to point cloud objects with distinct geometric structures. Each prototype aligns BEV (bird's-eye-view) features derived from corresponding point cloud objects on source and target domains, reducing the distributional discrepancy and achieving better adaptation. The evaluation results obtained on various benchmarks, including Waymo, nuScenes and KITTI, demonstrate the superiority of our GPA-3D over the state-of-the-art approaches for different adaptation scenarios. The MindSpore version code will be publicly available at \url{https://github.com/Liz66666/GPA3D}.
摘要
“李达尔基于3D探测技术在最近几年内做出了大量的进步。然而,3D探测器在未看过的环境中表现不佳,主要因为域名隔problem。现有的域名适应3D探测方法不充分考虑了特征空间中的分布差异问题,从而阻碍探测器在不同域之间的通用化。在这种情况下,我们提出了一种新的无监督域名适应3D探测框架,即geometry-aware prototype alignment(GPA-3D)。GPA-3D利用了点云对象的内在几何关系来减少特征空间中的差异,从而促进域之间的转移。具体来说,GPA-3D分配了一系列适应点云对象的学习式和定制的原型。每个原型都将源域和目标域BEV特征相互对应,从而减少分布差异并实现更好的适应。我们在WAYMO、nuScenes和KITTI等各种标准均取得了GPA-3D在不同适应场景下的优于状态艺术方法。MindSpore版本代码将在 \url{https://github.com/Liz66666/GPA3D} 上公开。”
View Consistent Purification for Accurate Cross-View Localization
results: 对KITTI和Ford Multi-AV Seasonal dataset进行了广泛的实验,并表明了our proposed method可以高效地在室外环境中进行自地localization,其中 median lateral和longitudinal方向的空间精度错误在0.5米以下, medianorientation方向的精度错误在2度以下。Abstract
This paper proposes a fine-grained self-localization method for outdoor robotics that utilizes a flexible number of onboard cameras and readily accessible satellite images. The proposed method addresses limitations in existing cross-view localization methods that struggle to handle noise sources such as moving objects and seasonal variations. It is the first sparse visual-only method that enhances perception in dynamic environments by detecting view-consistent key points and their corresponding deep features from ground and satellite views, while removing off-the-ground objects and establishing homography transformation between the two views. Moreover, the proposed method incorporates a spatial embedding approach that leverages camera intrinsic and extrinsic information to reduce the ambiguity of purely visual matching, leading to improved feature matching and overall pose estimation accuracy. The method exhibits strong generalization and is robust to environmental changes, requiring only geo-poses as ground truth. Extensive experiments on the KITTI and Ford Multi-AV Seasonal datasets demonstrate that our proposed method outperforms existing state-of-the-art methods, achieving median spatial accuracy errors below $0.5$ meters along the lateral and longitudinal directions, and a median orientation accuracy error below 2 degrees.
摘要
(本文提出了一种用于外部自导航的细化自位置方法,该方法使用可变数量的船内摄像头和可ready accessible的卫星图像。该方法解决了现有的相对视图本地化方法所面临的噪声源,如移动物体和季节变化。这是首个使用视图相同的键点和深度特征进行视觉匹配的缺省方法,同时从地面和卫星视图中提取杂谱对象并建立Homography变换。此外,该方法还 integrate了空间嵌入方法,利用摄像头内部和外部信息来减少视觉匹配的歧义,导致更好的特征匹配和总位姿估计精度。该方法具有强大的泛化能力和环境变化的Robustness,只需要地球坐标作为真实参照。实验表明,我们的提出方法在KITTI和福特多个AV季节数据集上表现出色,与现有状态最佳方法相比,实现了水平和 longitudinal方向的 median 精度 ErrorBelow $0.5$米,并且orientation accuracy errorBelow 2度。)
Snapshot High Dynamic Range Imaging with a Polarization Camera
paper_authors: Mingyang Xie, Matthew Chan, Christopher Metzler
for: 该论文旨在将普通的楔格相机转化为高性能HDR相机。
methods: 该方法使用 linear polarizer 在前面,并通过不同的折射器orientation 来获取不同的曝光图像四个。然后,通过一种异常抗性和自适应算法来重建HDR图像(在单一极性下)。
results: 该方法通过实际的实验证明其效果。Abstract
High dynamic range (HDR) images are important for a range of tasks, from navigation to consumer photography. Accordingly, a host of specialized HDR sensors have been developed, the most successful of which are based on capturing variable per-pixel exposures. In essence, these methods capture an entire exposure bracket sequence at once in a single shot. This paper presents a straightforward but highly effective approach for turning an off-the-shelf polarization camera into a high-performance HDR camera. By placing a linear polarizer in front of the polarization camera, we are able to simultaneously capture four images with varied exposures, which are determined by the orientation of the polarizer. We develop an outlier-robust and self-calibrating algorithm to reconstruct an HDR image (at a single polarity) from these measurements. Finally, we demonstrate the efficacy of our approach with extensive real-world experiments.
摘要
高动态范围(HDR)图像在各种任务中非常重要,从导航到消费型摄影。因此,一些专门的HDR传感器被开发出来,最成功的是基于每个像素变量曝光的方法。在本文中,我们提出了将偏振相机转化为高性能HDR相机的简单 yet effectiveapproach。我们在偏振相机前置一个线性偏振器,以实现同时捕捉不同曝光的四个图像。我们开发了一种异常抗性和自适应算法,将这些测量转化为HDR图像(单一偏振)。最后,我们通过实际实验证明了我们的方法的有效性。Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory
results: 实验证明 DragNUWA 的效果,其能够在视频生成中实现细致的控制功能,并且在不同的材料和设定下都能够达到比较好的效果。Abstract
Controllable video generation has gained significant attention in recent years. However, two main limitations persist: Firstly, most existing works focus on either text, image, or trajectory-based control, leading to an inability to achieve fine-grained control in videos. Secondly, trajectory control research is still in its early stages, with most experiments being conducted on simple datasets like Human3.6M. This constraint limits the models' capability to process open-domain images and effectively handle complex curved trajectories. In this paper, we propose DragNUWA, an open-domain diffusion-based video generation model. To tackle the issue of insufficient control granularity in existing works, we simultaneously introduce text, image, and trajectory information to provide fine-grained control over video content from semantic, spatial, and temporal perspectives. To resolve the problem of limited open-domain trajectory control in current research, We propose trajectory modeling with three aspects: a Trajectory Sampler (TS) to enable open-domain control of arbitrary trajectories, a Multiscale Fusion (MF) to control trajectories in different granularities, and an Adaptive Training (AT) strategy to generate consistent videos following trajectories. Our experiments validate the effectiveness of DragNUWA, demonstrating its superior performance in fine-grained control in video generation. The homepage link is \url{https://www.microsoft.com/en-us/research/project/dragnuwa/}
摘要
《可控视频生成技术在最近几年内得到了广泛关注。然而,两个主要限制仍然存在:首先,大多数现有的工作都是基于文本、图像或轨迹控制,导致精细控制视频内容的能力受到限制。其次,轨迹控制领域的研究仍然处于初 stages,大多数实验都是基于简单的 dataset like Human3.6M 进行的。这两个限制使得模型无法处理开放领域图像和复杂弯曲轨迹。在这篇论文中,我们提出了 DragNUWA,一种开放领域扩散基于视频生成模型。为了解决现有工作中的精细控制不足问题,我们同时引入文本、图像和轨迹信息,以提供从 semantic、空间和时间三个角度进行精细控制视频内容的能力。为了解决当前研究中对开放领域轨迹控制的限制,我们提出了轨迹模型,包括 Trajectory Sampler (TS) 以实现开放领域轨迹控制,Multiscale Fusion (MF) 以控制不同级别的轨迹,以及 Adaptive Training (AT) 策略以生成遵循轨迹的一致视频。我们的实验证明 DragNUWA 的效果,并示出其在精细控制视频生成方面的优越性。更多信息请访问我们的主页:。
Pro-Cap: Leveraging a Frozen Vision-Language Model for Hateful Meme Detection
results: 在三个标准测试集上,使用 Pro-Cap 方法的模型显示了良好的性能和泛化能力,证明了方法的有效性和通用性。Abstract
Hateful meme detection is a challenging multimodal task that requires comprehension of both vision and language, as well as cross-modal interactions. Recent studies have tried to fine-tune pre-trained vision-language models (PVLMs) for this task. However, with increasing model sizes, it becomes important to leverage powerful PVLMs more efficiently, rather than simply fine-tuning them. Recently, researchers have attempted to convert meme images into textual captions and prompt language models for predictions. This approach has shown good performance but suffers from non-informative image captions. Considering the two factors mentioned above, we propose a probing-based captioning approach to leverage PVLMs in a zero-shot visual question answering (VQA) manner. Specifically, we prompt a frozen PVLM by asking hateful content-related questions and use the answers as image captions (which we call Pro-Cap), so that the captions contain information critical for hateful content detection. The good performance of models with Pro-Cap on three benchmarks validates the effectiveness and generalization of the proposed method.
摘要
仇恨内容检测是一个复杂的多模态任务,需要包括视觉和语言理解,以及跨模态交互。Recent studies have tried to fine-tune预训练的视觉语言模型(PVLM) для这个任务。然而,随着模型的尺寸增大,更重要的是更好地利用更强大的PVLM,而不是只是精度。Recently, researchers have attempted to convert meme images into textual captions and prompt language models for predictions. This approach has shown good performance but suffers from non-informative image captions. Considering the two factors mentioned above, we propose a probing-based captioning approach to leverage PVLMs in a zero-shot visual question answering (VQA) manner. Specifically, we prompt a frozen PVLM by asking hateful content-related questions and use the answers as image captions (which we call Pro-Cap), so that the captions contain information critical for hateful content detection. The good performance of models with Pro-Cap on three benchmarks validates the effectiveness and generalization of the proposed method.Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.
Deep Learning Framework for Spleen Volume Estimation from 2D Cross-sectional Views
paper_authors: Zhen Yuan, Esther Puyol-Anton, Haran Jogeesvaran, Baba Inusa, Andrew P. King for:* The paper aims to provide an automated method for measuring spleen volume from 2D cross-sectional segmentations obtained from ultrasound imaging, which can be used to assess splenomegaly and related clinical conditions.methods:* The proposed method uses a variational autoencoder-based framework to estimate spleen volume from single- or dual-view 2D spleen segmentations.* The framework includes three volume estimation methods, which are evaluated and compared with the clinical standard approach and a deep learning-based 2D-3D reconstruction-based approach.results:* The best model achieved mean relative volume accuracies of 86.62% and 92.58% for single- and dual-view segmentations, respectively, which is higher than the performance of the clinical standard approach and the comparative deep learning-based approach.Here is the information in Simplified Chinese text:for:* 本研究旨在提供一种自动化的脾体积量测量方法,使用2D横截图像来评估脾体积度和相关的临床病情。methods:* 提议的方法使用Variational Autoencoder(VAE)基础框架来估计脾体积量,并包括三种量测方法。results:* 最佳模型在单视和双视分割图像中的脾体积量准确率分别为86.62%和92.58%,高于临床标准方法和相关的深度学习基础的2D-3D重构方法。Abstract
Abnormal spleen enlargement (splenomegaly) is regarded as a clinical indicator for a range of conditions, including liver disease, cancer and blood diseases. While spleen length measured from ultrasound images is a commonly used surrogate for spleen size, spleen volume remains the gold standard metric for assessing splenomegaly and the severity of related clinical conditions. Computed tomography is the main imaging modality for measuring spleen volume, but it is less accessible in areas where there is a high prevalence of splenomegaly (e.g., the Global South). Our objective was to enable automated spleen volume measurement from 2D cross-sectional segmentations, which can be obtained from ultrasound imaging. In this study, we describe a variational autoencoder-based framework to measure spleen volume from single- or dual-view 2D spleen segmentations. We propose and evaluate three volume estimation methods within this framework. We also demonstrate how 95% confidence intervals of volume estimates can be produced to make our method more clinically useful. Our best model achieved mean relative volume accuracies of 86.62% and 92.58% for single- and dual-view segmentations, respectively, surpassing the performance of the clinical standard approach of linear regression using manual measurements and a comparative deep learning-based 2D-3D reconstruction-based approach. The proposed spleen volume estimation framework can be integrated into standard clinical workflows which currently use 2D ultrasound images to measure spleen length. To the best of our knowledge, this is the first work to achieve direct 3D spleen volume estimation from 2D spleen segmentations.
摘要
非常常见的脾脓肿大 (splenomegaly) 被视为临床指标,用于诊断多种疾病,包括肝病、癌症和血液疾病。脾脓长度从ultrasound图像中测量是通常使用的代表脾脓大小的临床标准,但脾脓体积仍然是评估splenomegaly和相关临床疾病的严重程度的黄金标准。计算机断层成像是评估脾脓体积的主要成像方法,但在全球南方地区,其访问性较差。我们的目标是启用自动化脾脓体积计算的方法,从2D横截图像中获得脾脓 segmentation。在这种框架中,我们提出并评估了三种体积估计方法。我们还示出了如何生成95%信任区间的体积估计,以使我们的方法更加临床有用。我们的最佳模型在单视和双视 segmentation中达到了86.62%和92.58%的相对体积准确率,超过了临床标准方法的线性回归使用手动测量和相对深度学习基于2D-3D重建的方法。我们的提议的脾脓体积估计框架可以与现有的临床工作流程 integrating,这些工作流程当前使用2D ultrasound图像来测量脾脓长度。据我们所知,这是第一个直接从2D脾脓 segmentation中获得3D脾脓体积的方法。
Shortcut-V2V: Compression Framework for Video-to-Video Translation based on Temporal Redundancy Reduction
results: 相比原始模型, saves 3.2-5.7x 计算成本和 7.8-44x 内存占用, 而且性能相对较好。Abstract
Video-to-video translation aims to generate video frames of a target domain from an input video. Despite its usefulness, the existing networks require enormous computations, necessitating their model compression for wide use. While there exist compression methods that improve computational efficiency in various image/video tasks, a generally-applicable compression method for video-to-video translation has not been studied much. In response, we present Shortcut-V2V, a general-purpose compression framework for video-to-video translation. Shourcut-V2V avoids full inference for every neighboring video frame by approximating the intermediate features of a current frame from those of the previous frame. Moreover, in our framework, a newly-proposed block called AdaBD adaptively blends and deforms features of neighboring frames, which makes more accurate predictions of the intermediate features possible. We conduct quantitative and qualitative evaluations using well-known video-to-video translation models on various tasks to demonstrate the general applicability of our framework. The results show that Shourcut-V2V achieves comparable performance compared to the original video-to-video translation model while saving 3.2-5.7x computational cost and 7.8-44x memory at test time.
摘要
<>对于视频到视频翻译 task,我们目标是生成目标域中的视频帧。尽管它们的用处很大,但现有的网络需要巨大的计算能力,因此需要压缩模型以实现广泛的应用。虽然在不同的图像/视频任务中存在压缩方法,但一个通用的压缩方法 для视频到视频翻译尚未得到了充分的研究。为了解决这个问题,我们提出了 Shortcut-V2V,一个通用压缩框架 для视频到视频翻译。Shortcut-V2V 避免了对每帧邻近视频帧的完整推理,而是使用一个新提出的块 called AdaBD,将邻近帧的特征特性缓存并混合。这使得在测试时可以更准确地预测中间特征。我们对多个常见的视频到视频翻译模型进行了量化和质量的评估,以示我们的框架的通用性。结果表明,Shortcut-V2V 可以与原始视频到视频翻译模型相比,在测试时保持相同的性能,同时减少了 3.2-5.7 倍的计算成本和 7.8-44 倍的内存占用。
$A^2$Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models
results: 实验表明,$A^2$Nav 在 R2R-Habitat 和 RxR-Habitat 数据集上达到了可观的 ZS-VLN 性能,甚至超过了指导学习方法。Abstract
We study the task of zero-shot vision-and-language navigation (ZS-VLN), a practical yet challenging problem in which an agent learns to navigate following a path described by language instructions without requiring any path-instruction annotation data. Normally, the instructions have complex grammatical structures and often contain various action descriptions (e.g., "proceed beyond", "depart from"). How to correctly understand and execute these action demands is a critical problem, and the absence of annotated data makes it even more challenging. Note that a well-educated human being can easily understand path instructions without the need for any special training. In this paper, we propose an action-aware zero-shot VLN method ($A^2$Nav) by exploiting the vision-and-language ability of foundation models. Specifically, the proposed method consists of an instruction parser and an action-aware navigation policy. The instruction parser utilizes the advanced reasoning ability of large language models (e.g., GPT-3) to decompose complex navigation instructions into a sequence of action-specific object navigation sub-tasks. Each sub-task requires the agent to localize the object and navigate to a specific goal position according to the associated action demand. To accomplish these sub-tasks, an action-aware navigation policy is learned from freely collected action-specific datasets that reveal distinct characteristics of each action demand. We use the learned navigation policy for executing sub-tasks sequentially to follow the navigation instruction. Extensive experiments show $A^2$Nav achieves promising ZS-VLN performance and even surpasses the supervised learning methods on R2R-Habitat and RxR-Habitat datasets.
摘要
我们研究零shot视觉语言导航(ZS-VLN)任务,这是一个实际又具有挑战性的问题,在哪怕需要任务描述数据。通常,这些 instrucion 具有复杂的 grammatical structure 和各种动作描述(例如“继续前进”、“从而 Depart”)。正确地理解和执行这些动作需求是一个关键的问题,而且缺乏 annotated data 使其更加具有挑战性。注意,一个有良好教育的人可以轻松地理解路径 instrucion без需要任何特殊训练。在这篇论文中,我们提出一个动作感知的零shot VLN 方法($A^2$Nav),通过利用基础模型的见识和语言能力。具体来说,提案的方法包括一个 instruction parser 和一个动作感知的导航政策。 instruction parser 利用大型语言模型(例如 GPT-3)的进阶逻辑能力,将复杂的导航 instrucion 拆分为一系列动作特定的物品导航子任务。每个子任务需要代理人寻找物品并按照相应的动作需求 navigate 到特定的目标位置。为了完成这些子任务,我们从自由收集的动作特定 dataset 学习一个动作感知的导航政策。我们使用学习的导航政策来执行子任务依序,以实现following the navigation instruction。我们的实验结果显示,$A^2$Nav 可以实现出色的 ZS-VLN 性能,甚至超越了对 R2R-Habitat 和 RxR-Habitat dataset 的监督学习方法。
YODA: You Only Diffuse Areas. An Area-Masked Diffusion Approach For Image Super-Resolution
paper_authors: Brian B. Moser, Stanislav Frolov, Federico Raue, Sebastian Palacio, Andreas Dengel
for: 该论文旨在提出一种基于部分扩散的单图超解hre(SISR)方法,即“You Only Diffuse Areas”(YODA)。
methods: 该方法利用基于注意力地图的扩散选择性地应用于空间区域,以便更有效地 converts to high-resolution outputs。
results: 我们通过对SR3和SRDiff扩展而证明了YODA的性能提升,包括面部和总体SR的PSNR、SSIM和LPIPS指标上的新状态态记录。此外,YODA还能够稳定化训练过程,尤其是在小批量下可能引起的颜色偏移问题。Abstract
This work introduces "You Only Diffuse Areas" (YODA), a novel method for partial diffusion in Single-Image Super-Resolution (SISR). The core idea is to utilize diffusion selectively on spatial regions based on attention maps derived from the low-resolution image and the current time step in the diffusion process. This time-dependent targeting enables a more effective conversion to high-resolution outputs by focusing on areas that benefit the most from the iterative refinement process, i.e., detail-rich objects. We empirically validate YODA by extending leading diffusion-based SISR methods SR3 and SRDiff. Our experiments demonstrate new state-of-the-art performance gains in face and general SR across PSNR, SSIM, and LPIPS metrics. A notable finding is YODA's stabilization effect on training by reducing color shifts, especially when induced by small batch sizes, potentially contributing to resource-constrained scenarios. The proposed spatial and temporal adaptive diffusion mechanism opens promising research directions, including developing enhanced attention map extraction techniques and optimizing inference latency based on sparser diffusion.
摘要
我们介绍了一种新的单图超解像方法“You Only Diffuse Areas”(YODA),它在单图超解像中实现了部分扩散。该方法基于低分辨率图像和当前扩散过程的注意力地图 selectively 实现了扩散。这种时间相关的目标设定,使得高分辨率输出更加有效地利用了迭代精度提高过程中的细节强项。我们经验 validate YODA 方法,并将其扩展到了领先的扩散基于 SISR 方法SR3和SRDiff。我们的实验表明,YODA 方法在 PSNR、SSIM 和 LPIPS 指标上均 achieve 新的状态泰施率表现。一个值得注意的发现是 YODA 方法在训练中的稳定化效果,尤其是在小批量引入时,可以降低颜色偏移,这可能对资源受限的场景产生贡献。The proposed spatial and temporal adaptive diffusion mechanism opens promising research directions, including developing enhanced attention map extraction techniques and optimizing inference latency based on sparser diffusion. 这种提出的空间和时间适应扩散机制,开启了许多有前途的研究方向,包括提高注意力地图提取技术和基于更加稀疏的扩散优化推理延迟。
Boosting Cross-Quality Face Verification using Blind Face Restoration
for: 这 paper 的目的是研究如何使用三种状态的盲面 restore 技术来提高人脸识别系统的性能,并保持人脸的有价值信息。
methods: 这 paper 使用的方法包括 GFP-GAN、GPEN 和 SGPN 三种盲面 restore 技术。
results: 实验结果表明,使用 GFP-GAN 技术可以大幅提高人脸识别系统的准确率,特别是在具有低质量图像的情况下。Abstract
In recent years, various Blind Face Restoration (BFR) techniques were developed. These techniques transform low quality faces suffering from multiple degradations to more realistic and natural face images with high perceptual quality. However, it is crucial for the task of face verification to not only enhance the perceptual quality of the low quality images but also to improve the biometric-utility face quality metrics. Furthermore, preserving the valuable identity information is of great importance. In this paper, we investigate the impact of applying three state-of-the-art blind face restoration techniques namely, GFP-GAN, GPEN and SGPN on the performance of face verification system under very challenging environment characterized by very low quality images. Extensive experimental results on the recently proposed cross-quality LFW database using three state-of-the-art deep face recognition models demonstrate the effectiveness of GFP-GAN in boosting significantly the face verification accuracy.
摘要
CoDeF: Content Deformation Fields for Temporally Consistent Video Processing
for: 这 paper 的目的是提出一种新型的视频表示方法,称为 Content Deformation Field (CoDeF),用于重构视频。
methods: 这 paper 使用了一种新的渲染管道和一些注意力力学 régularization,使得 CoDeF 可以自然地支持提升图像算法,从而实现视频处理。
results: 实验表明,CoDeF 能够将图像-to-图像翻译提升到视频-to-视频翻译,并且可以自动地进行关键点检测和跟踪。此外,CoDeF 可以提供更高的横向一致性,并且可以跟踪非RIGID 对象如水和雾。Abstract
We present the content deformation field CoDeF as a new type of video representation, which consists of a canonical content field aggregating the static contents in the entire video and a temporal deformation field recording the transformations from the canonical image (i.e., rendered from the canonical content field) to each individual frame along the time axis.Given a target video, these two fields are jointly optimized to reconstruct it through a carefully tailored rendering pipeline.We advisedly introduce some regularizations into the optimization process, urging the canonical content field to inherit semantics (e.g., the object shape) from the video.With such a design, CoDeF naturally supports lifting image algorithms for video processing, in the sense that one can apply an image algorithm to the canonical image and effortlessly propagate the outcomes to the entire video with the aid of the temporal deformation field.We experimentally show that CoDeF is able to lift image-to-image translation to video-to-video translation and lift keypoint detection to keypoint tracking without any training.More importantly, thanks to our lifting strategy that deploys the algorithms on only one image, we achieve superior cross-frame consistency in processed videos compared to existing video-to-video translation approaches, and even manage to track non-rigid objects like water and smog.Project page can be found at https://qiuyu96.github.io/CoDeF/.
摘要
我们介绍了一种新的视频表示方式:内容扭曲场(CoDeF),它包含一个核心内容场聚合整个视频中的静止内容,以及一个时间扭曲场记录每帧图像(即从核心内容场渲染后的图像)与时间轴上的变化。给定一个目标视频,这两个场合jointly 优化以重建它,通过我们特制的渲染管线。我们注意到了一些正则化,使得核心内容场继承视频中的 semantics(例如物体形状)。与此同时,我们还引入了一些正则化,使得核心内容场继承视频中的 semantics(例如物体形状)。这种设计使得CoDeF自然支持图像算法 для视频处理,即可以将图像算法应用于核心图像,并使用时间扭曲场将结果传播到整个视频中。我们实验表明,CoDeF可以将图像到图像翻译 lifted 到视频到视频翻译,并且可以使用图像算法来检测关键点,而不需要训练。此外,我们的提升策略只需要在一个图像上应用算法,从而实现了跨帧一致性更高的处理视频,甚至可以跟踪非RIGID的物体,如水和雾。项目页面可以在 找到。
Helping Hands: An Object-Aware Ego-Centric Video Recognition Model
results: 在零批量测试中表现出优于状态艺术,并在长期视频理解任务中表现出良好的表示能力,包括视频描述和视频存储等任务。Abstract
We introduce an object-aware decoder for improving the performance of spatio-temporal representations on ego-centric videos. The key idea is to enhance object-awareness during training by tasking the model to predict hand positions, object positions, and the semantic label of the objects using paired captions when available. At inference time the model only requires RGB frames as inputs, and is able to track and ground objects (although it has not been trained explicitly for this). We demonstrate the performance of the object-aware representations learnt by our model, by: (i) evaluating it for strong transfer, i.e. through zero-shot testing, on a number of downstream video-text retrieval and classification benchmarks; and (ii) by using the representations learned as input for long-term video understanding tasks (e.g. Episodic Memory in Ego4D). In all cases the performance improves over the state of the art -- even compared to networks trained with far larger batch sizes. We also show that by using noisy image-level detection as pseudo-labels in training, the model learns to provide better bounding boxes using video consistency, as well as grounding the words in the associated text descriptions. Overall, we show that the model can act as a drop-in replacement for an ego-centric video model to improve performance through visual-text grounding.
摘要
我们介绍一个对象意识decoder,用于改进自我中心视频的空间时间表示性能。关键思想是在训练过程中提高对象意识,通过在可用的paired captions中 зада务模型预测手势位置、物体位置和物体semantic标签。在推理时,模型只需要RGB帧作为输入,并能够跟踪和附加物体(尚未直接训练)。我们通过测试和分类benchmark数据集来评估模型的表示,并证明其在零基eline测试中具有强 Transfer Learning 性能。此外,我们还表明通过使用噪音图像级检测作为pseudo-标签,模型能够提供更好的 bounding box,同时还能够在相关文本描述中固定words。总之,我们表明该模型可以作为自我中心视频模型的替换,以提高视觉-文本固定性能。
A Foundation LAnguage-Image model of the Retina (FLAIR): Encoding expert knowledge in text supervision
paper_authors: Julio Silva-Rodriguez, Hadi Chakor, Riadh Kobbi, Jose Dolz, Ismail Ben Ayed
for: This paper is written for developing a pre-trained vision-language model for universal retinal fundus image understanding, with the goal of improving the performance of domain-expert models in medical imaging.
methods: The paper uses a combination of open-access fundus imaging datasets and expert knowledge from clinical literature and community standards to pre-train and fine-tune a vision-language model called FLAIR. The model is adapted with a lightweight linear probe for zero-shot inference, and its performance is evaluated under various scenarios with domain shifts and unseen categories.
results: The paper reports that FLAIR outperforms fully-trained, dataset-focused models and more generalist, larger-scale image-language models in few-shot regimes, emphasizing the potential of embedding experts’ domain knowledge in medical imaging. Specifically, FLAIR achieves better performance in difficult scenarios with domain shifts or unseen categories, and outperforms more generalist models by a large margin.Abstract
Foundation vision-language models are currently transforming computer vision, and are on the rise in medical imaging fueled by their very promising generalization capabilities. However, the initial attempts to transfer this new paradigm to medical imaging have shown less impressive performances than those observed in other domains, due to the significant domain shift and the complex, expert domain knowledge inherent to medical-imaging tasks. Motivated by the need for domain-expert foundation models, we present FLAIR, a pre-trained vision-language model for universal retinal fundus image understanding. To this end, we compiled 37 open-access, mostly categorical fundus imaging datasets from various sources, with up to 97 different target conditions and 284,660 images. We integrate the expert's domain knowledge in the form of descriptive textual prompts, during both pre-training and zero-shot inference, enhancing the less-informative categorical supervision of the data. Such a textual expert's knowledge, which we compiled from the relevant clinical literature and community standards, describes the fine-grained features of the pathologies as well as the hierarchies and dependencies between them. We report comprehensive evaluations, which illustrate the benefit of integrating expert knowledge and the strong generalization capabilities of FLAIR under difficult scenarios with domain shifts or unseen categories. When adapted with a lightweight linear probe, FLAIR outperforms fully-trained, dataset-focused models, more so in the few-shot regimes. Interestingly, FLAIR outperforms by a large margin more generalist, larger-scale image-language models, which emphasizes the potential of embedding experts' domain knowledge and the limitations of generalist models in medical imaging.
摘要
医学影像领域的基础视言模型目前正在改变计算机视觉领域,并在医学影像领域得到推动。然而,初始的尝试将这新的思维方式应用到医学影像领域表现不如其他领域的表现,这是因为医学影像领域的领域转移和专业知识的复杂性。为了解决这问题,我们提出了FLAIR,一个预训练的视言模型,用于通用的肉眼血管图像理解。为此,我们收集了37个开放访问的、主要是分类的肉眼影像Dataset,共计284,660张图像,并将专业知识 integrate到模型中,以增强数据的不具备信息的分类指导。这些文本描述了疾病的细腻特征以及疾病之间和层次结构的相互关系。我们对FLAIR进行了全面的评估,其中包括域转移和未经见过的类目的场景。我们发现,通过 integrate专业知识,FLAIR在域转移和少量学习场景下表现出了强大的泛化能力,并且在不同的域转移和未经见过的类目场景下,FLAIR可以与专门设计的Dataset-FOCUSED模型进行比较,并在这些场景下表现出优异的表现。此外,我们发现,当FLAIR被拓展为一个轻量级线性探针时,其表现更为出色,特别是在几何学学习场景下。这表明,将专业知识 embedding到模型中和通用模型的局限性在医学影像领域中具有潜在的优势。
Memory-and-Anticipation Transformer for Online Action Understanding
results: 对四个具有挑战性的标准 benchmark(TVSeries、THUMOS’14、HDD、EPIC-Kitchens-100)进行测试,MAT模型在在线动作检测和预测任务中具有显著的优异性,与现有方法相比显著超越。Abstract
Most existing forecasting systems are memory-based methods, which attempt to mimic human forecasting ability by employing various memory mechanisms and have progressed in temporal modeling for memory dependency. Nevertheless, an obvious weakness of this paradigm is that it can only model limited historical dependence and can not transcend the past. In this paper, we rethink the temporal dependence of event evolution and propose a novel memory-anticipation-based paradigm to model an entire temporal structure, including the past, present, and future. Based on this idea, we present Memory-and-Anticipation Transformer (MAT), a memory-anticipation-based approach, to address the online action detection and anticipation tasks. In addition, owing to the inherent superiority of MAT, it can process online action detection and anticipation tasks in a unified manner. The proposed MAT model is tested on four challenging benchmarks TVSeries, THUMOS'14, HDD, and EPIC-Kitchens-100, for online action detection and anticipation tasks, and it significantly outperforms all existing methods. Code is available at https://github.com/Echo0125/Memory-and-Anticipation-Transformer.
摘要
现有的预测系统多数是记忆基本方法,尝试模拟人类预测能力 by 使用不同的记忆机制,并进步在时间模型中。然而,这个思维模型的明显弱点是它只能模型有限的历史依赖,无法突破过去。在这篇文章中,我们重新思考了事件演化的时间依赖,并提出了一个新的记忆预测基本方法,可以模型整个时间结构,包括过去、现在和未来。基于这个想法,我们提出了记忆预测变换器(MAT),一种记忆预测基本方法,用于线上动作检测和预测任务。此外,由于MAT的内在优势,可以在线上进行动作检测和预测任务的统一处理。我们在四个具有挑战性的参考标准(TVSeries、THUMOS'14、HDD和EPIC-Kitchens-100)上进行了MAT模型的评估,并与所有现有的方法进行比较。结果显示MAT模型在线上动作检测和预测任务上具有杰出的表现。代码可以在 获取。
The Challenge of Fetal Cardiac MRI Reconstruction Using Deep Learning
paper_authors: Denis Prokopenko, Kerstin Hammernik, Thomas Roberts, David F A Lloyd, Daniel Rueckert, Joseph V Hajnal
For: This paper explores the use of deep learning methods to improve the quality of non-gated kt-SENSE reconstruction for dynamic free-breathing fetal cardiac MRI.* Methods: The authors use supervised deep learning networks to reconstruct fully-sampled data from undersampled data, and consider various model architectures and training strategies for their application in a real clinical setup.* Results: The authors show that the best-performing models recover a detailed depiction of the maternal anatomy but underestimate the dynamic properties of the fetal heart, suggesting the need for more targeted training and evaluation methods for fetal heart applications.Abstract
Dynamic free-breathing fetal cardiac MRI is one of the most challenging modalities, which requires high temporal and spatial resolution to depict rapid changes in a small fetal heart. The ability of deep learning methods to recover undersampled data could help to optimise the kt-SENSE acquisition strategy and improve non-gated kt-SENSE reconstruction quality. In this work, we explore supervised deep learning networks for reconstruction of kt-SENSE style acquired data using an extensive in vivo dataset. Having access to fully-sampled low-resolution multi-coil fetal cardiac MRI, we study the performance of the networks to recover fully-sampled data from undersampled data. We consider model architectures together with training strategies taking into account their application in the real clinical setup used to collect the dataset to enable networks to recover prospectively undersampled data. We explore a set of modifications to form a baseline performance evaluation for dynamic fetal cardiac MRI on real data. We systematically evaluate the models on coil-combined data to reveal the effect of the suggested changes to the architecture in the context of fetal heart properties. We show that the best-performers recover a detailed depiction of the maternal anatomy on a large scale, but the dynamic properties of the fetal heart are under-represented. Training directly on multi-coil data improves the performance of the models, allows their prospective application to undersampled data and makes them outperform CTFNet introduced for adult cardiac cine MRI. However, these models deliver similar qualitative performances recovering the maternal body very well but underestimating the dynamic properties of fetal heart. This dynamic feature of fast change of fetal heart that is highly localised suggests both more targeted training and evaluation methods might be needed for fetal heart application.
摘要
干支持自由呼吸婴儿心脏MRI是最有挑战性的模式之一,需要高度的时间和空间分辨率来显示婴儿心脏中快速变化的小心脏。深度学习方法可以恢复扫描不足的数据,可以帮助优化kt-SENSE获取策略和非阻塞kt-SENSE重建质量。在这项工作中,我们使用了激活函数网络进行kt-SENSE样式数据重建,使用了大量的生物样本。由于我们拥有完整的低分辨率多极体婴儿心脏MRI数据,我们可以研究深度学习网络是否可以从扫描不足数据中恢复完整数据。我们考虑了模型架构和训练策略,以便在实际临床设置中收集数据时使用。我们系统地评估了模型在实际数据上的表现,并通过多极体数据组合来描述模型的改进效果。我们发现最佳表现者可以准确地重建大规模的 maternal anatomy,但是婴儿心脏的动态性尚未得到充分表现。通过直接在多极体数据上训练,模型可以预测扫描不足数据,并且在 adult cardiac cine MRI 中引入的 CTFNet 之上表现出色。然而,这些模型在Qualitative上具有类似表现,可以准确地重建 maternal body,但是婴儿心脏的动态特性尚未得到充分表现。这种婴儿心脏动态特性的快速变化和高度本地化表示,可能需要更Targeted的训练和评估方法来应用于婴儿心脏。
Advancements in Repetitive Action Counting: Joint-Based PoseRAC Model With Improved Performance
results: 这篇论文在RepCount数据集上取得了比前方法更好的结果,其中MAE为0.211,OBO准确率为0.599。实验结果显示了方法的效iveness和可靠性。Abstract
Repetitive counting (RepCount) is critical in various applications, such as fitness tracking and rehabilitation. Previous methods have relied on the estimation of red-green-and-blue (RGB) frames and body pose landmarks to identify the number of action repetitions, but these methods suffer from a number of issues, including the inability to stably handle changes in camera viewpoints, over-counting, under-counting, difficulty in distinguishing between sub-actions, inaccuracy in recognizing salient poses, etc. In this paper, based on the work done by [1], we integrate joint angles with body pose landmarks to address these challenges and achieve better results than the state-of-the-art RepCount methods, with a Mean Absolute Error (MAE) of 0.211 and an Off-By-One (OBO) counting accuracy of 0.599 on the RepCount data set [2]. Comprehensive experimental results demonstrate the effectiveness and robustness of our method.
摘要
重复计数(RepCount)在不同应用中具有重要意义,如健身和rehabilitation。先前的方法通过RGB帧和体姿特征来估计动作重复数,但这些方法受到许多问题的影响,包括视角变化不稳定、过 COUNT、下 COUNT、不能识别子动作、识别精准 pose 等问题。本文基于 [1] 的工作,将骨骼角度与体姿特征结合,以解决这些挑战并实现更好的结果,MAE 为 0.211,OBO 计数准确率为 0.599 在 RepCount 数据集上。广泛的实验结果证明了我们的方法的有效性和可靠性。
SEDA: Self-Ensembling ViT with Defensive Distillation and Adversarial Training for robust Chest X-rays Classification
paper_authors: Raza Imam, Ibrahim Almakky, Salma Alrashdi, Baketah Alrashdi, Mohammad Yaqub
For: This paper aims to enhance the robustness of self-ensembling ViTs for the tuberculosis chest x-ray classification task, in order to improve the reliability of Deep Learning methods in medical settings.* Methods: The proposed method, SEDA, utilizes efficient CNN blocks to learn spatial features with various levels of abstraction, and leverages adversarial training in combination with defensive distillation for improved robustness against adversaries.* Results: Extensive experiments performed with the proposed architecture and training paradigm on a publicly available Tuberculosis x-ray dataset show that SEDA achieves state-of-the-art (SOTA) efficacy compared to SEViT in terms of computational efficiency with 70x times lighter framework, and enhanced robustness of +9%.Abstract
Deep Learning methods have recently seen increased adoption in medical imaging applications. However, elevated vulnerabilities have been explored in recent Deep Learning solutions, which can hinder future adoption. Particularly, the vulnerability of Vision Transformer (ViT) to adversarial, privacy, and confidentiality attacks raise serious concerns about their reliability in medical settings. This work aims to enhance the robustness of self-ensembling ViTs for the tuberculosis chest x-ray classification task. We propose Self-Ensembling ViT with defensive Distillation and Adversarial training (SEDA). SEDA utilizes efficient CNN blocks to learn spatial features with various levels of abstraction from feature representations extracted from intermediate ViT blocks, that are largely unaffected by adversarial perturbations. Furthermore, SEDA leverages adversarial training in combination with defensive distillation for improved robustness against adversaries. Training using adversarial examples leads to better model generalizability and improves its ability to handle perturbations. Distillation using soft probabilities introduces uncertainty and variation into the output probabilities, making it more difficult for adversarial and privacy attacks. Extensive experiments performed with the proposed architecture and training paradigm on publicly available Tuberculosis x-ray dataset shows SOTA efficacy of SEDA compared to SEViT in terms of computational efficiency with 70x times lighter framework and enhanced robustness of +9%.
摘要
SEDA uses efficient convolutional neural network (CNN) blocks to learn spatial features with different levels of abstraction from the feature representations extracted from intermediate ViT blocks. These features are less affected by adversarial perturbations. Additionally, SEDA uses adversarial training in combination with defensive distillation to improve the model's robustness against adversaries. Training with adversarial examples improves the model's ability to handle perturbations, while distillation using soft probabilities introduces uncertainty and variation into the output probabilities, making it more difficult for adversarial and privacy attacks.We evaluated the proposed architecture and training paradigm on a publicly available tuberculosis x-ray dataset and found that SEDA outperformed the existing SEViT method in terms of computational efficiency, with a 70 times lighter framework, and enhanced robustness, with a +9% improvement.
for: This paper focuses on improving the performance of neural implicit surface-based methods for multi-view 3D reconstruction, with a particular emphasis on reconstructing individual objects within a scene.
methods: The proposed framework, called ObjectSDF++, utilizes an occlusion-aware object opacity rendering formulation and a novel regularization term for object distinction to improve the quality of object reconstruction.
results: The extensive experiments conducted in the paper demonstrate that ObjectSDF++ produces superior object reconstruction results and significantly improves the quality of scene reconstruction compared to the previous state-of-the-art method, ObjectSDF.Here’s the full text in Simplified Chinese:
results: 实验结果表明,ObjectSDF++ 比之前的状态 искусственный方法ObjectSDF更好地重建对象和Scene。Abstract
In recent years, neural implicit surface reconstruction has emerged as a popular paradigm for multi-view 3D reconstruction. Unlike traditional multi-view stereo approaches, the neural implicit surface-based methods leverage neural networks to represent 3D scenes as signed distance functions (SDFs). However, they tend to disregard the reconstruction of individual objects within the scene, which limits their performance and practical applications. To address this issue, previous work ObjectSDF introduced a nice framework of object-composition neural implicit surfaces, which utilizes 2D instance masks to supervise individual object SDFs. In this paper, we propose a new framework called ObjectSDF++ to overcome the limitations of ObjectSDF. First, in contrast to ObjectSDF whose performance is primarily restricted by its converted semantic field, the core component of our model is an occlusion-aware object opacity rendering formulation that directly volume-renders object opacity to be supervised with instance masks. Second, we design a novel regularization term for object distinction, which can effectively mitigate the issue that ObjectSDF may result in unexpected reconstruction in invisible regions due to the lack of constraint to prevent collisions. Our extensive experiments demonstrate that our novel framework not only produces superior object reconstruction results but also significantly improves the quality of scene reconstruction. Code and more resources can be found in \url{https://qianyiwu.github.io/objectsdf++}
摘要
Recently, neural implicit surface reconstruction has become a popular paradigm for multi-view 3D reconstruction. Unlike traditional multi-view stereo methods, neural implicit surface-based methods use neural networks to represent 3D scenes as signed distance functions (SDFs). However, these methods often neglect the reconstruction of individual objects within the scene, which limits their performance and practical applications. To address this issue, previous work ObjectSDF proposed a framework of object-composition neural implicit surfaces, which uses 2D instance masks to supervise individual object SDFs. In this paper, we propose a new framework called ObjectSDF++ to overcome the limitations of ObjectSDF. First, our model uses an occlusion-aware object opacity rendering formulation that directly volume-renders object opacity to be supervised with instance masks, which improves performance compared to ObjectSDF. Second, we design a novel regularization term for object distinction, which can effectively mitigate the issue that ObjectSDF may result in unexpected reconstruction in invisible regions due to the lack of constraint to prevent collisions. Our extensive experiments show that our novel framework not only produces superior object reconstruction results but also significantly improves the quality of scene reconstruction. Code and more resources can be found at \url{https://qianyiwu.github.io/objectsdf++}.
StyleDiffusion: Controllable Disentangled Style Transfer via Diffusion Models
results: 该框架在比较之上超越了现有的状态对抗方法,并可以实现高质量的C-S分离和风格传输,以及灵活的C-S分离和控制质量之间的质量。Abstract
Content and style (C-S) disentanglement is a fundamental problem and critical challenge of style transfer. Existing approaches based on explicit definitions (e.g., Gram matrix) or implicit learning (e.g., GANs) are neither interpretable nor easy to control, resulting in entangled representations and less satisfying results. In this paper, we propose a new C-S disentangled framework for style transfer without using previous assumptions. The key insight is to explicitly extract the content information and implicitly learn the complementary style information, yielding interpretable and controllable C-S disentanglement and style transfer. A simple yet effective CLIP-based style disentanglement loss coordinated with a style reconstruction prior is introduced to disentangle C-S in the CLIP image space. By further leveraging the powerful style removal and generative ability of diffusion models, our framework achieves superior results than state of the art and flexible C-S disentanglement and trade-off control. Our work provides new insights into the C-S disentanglement in style transfer and demonstrates the potential of diffusion models for learning well-disentangled C-S characteristics.
摘要
results: 研究发现,人类和认知系统之间的合作对话可以提高人类认知精度和准确性,并且不同类型的信息可以帮助提高人类的创造力和解决问题能力。Abstract
Whenever humans use tools human performance is enhanced. Cognitive systems are a new kind of tool continually increasing in cognitive capability and are now performing high level cognitive tasks previously thought to be explicitly human. Usage of such tools, known as cogs, are expected to result in ever increasing levels of human cognitive augmentation. In a human cog ensemble, a cooperative, peer to peer, and collaborative dialog between a human and a cognitive system, human cognitive capability is augmented as a result of the interaction. The human cog ensemble is therefore able to achieve more than just the human or the cog working alone. This article presents results from two studies designed to measure the effect information supplied by a cog has on cognitive accuracy, the ability to produce the correct result, and cognitive precision, the propensity to produce only the correct result. Both cognitive accuracy and cognitive precision are shown to be increased by information of different types (policies and rules, examples, and suggestions) and with different kinds of problems (inventive problem solving and puzzles). Similar effects shown in other studies are compared.
摘要
This article presents the results of two studies that measured the effect of information supplied by a cog on cognitive accuracy, the ability to produce the correct result, and cognitive precision, the tendency to produce only the correct result. The studies found that both cognitive accuracy and cognitive precision were increased by information of different types (policies and rules, examples, and suggestions) and with different types of problems (inventive problem-solving and puzzles). Similar effects have been shown in other studies.
Explainable AI for clinical risk prediction: a survey of concepts, methods, and modalities
results: 论文总结了过去几年来对于可解释性的研究和发展,包括各种可解释性方法的应用和评估,以及这些方法在实际应用中的成果。Abstract
Recent advancements in AI applications to healthcare have shown incredible promise in surpassing human performance in diagnosis and disease prognosis. With the increasing complexity of AI models, however, concerns regarding their opacity, potential biases, and the need for interpretability. To ensure trust and reliability in AI systems, especially in clinical risk prediction models, explainability becomes crucial. Explainability is usually referred to as an AI system's ability to provide a robust interpretation of its decision-making logic or the decisions themselves to human stakeholders. In clinical risk prediction, other aspects of explainability like fairness, bias, trust, and transparency also represent important concepts beyond just interpretability. In this review, we address the relationship between these concepts as they are often used together or interchangeably. This review also discusses recent progress in developing explainable models for clinical risk prediction, highlighting the importance of quantitative and clinical evaluation and validation across multiple common modalities in clinical practice. It emphasizes the need for external validation and the combination of diverse interpretability methods to enhance trust and fairness. Adopting rigorous testing, such as using synthetic datasets with known generative factors, can further improve the reliability of explainability methods. Open access and code-sharing resources are essential for transparency and reproducibility, enabling the growth and trustworthiness of explainable research. While challenges exist, an end-to-end approach to explainability in clinical risk prediction, incorporating stakeholders from clinicians to developers, is essential for success.
摘要
In this review, we discuss the relationship between these concepts and how they are often used together or interchangeably. We also highlight recent progress in developing explainable models for clinical risk prediction, emphasizing the importance of quantitative and clinical evaluation and validation across multiple common modalities in clinical practice. External validation and the combination of diverse interpretability methods are essential to enhance trust and fairness.To improve the reliability of explainability methods, rigorous testing using synthetic datasets with known generative factors is necessary. Open access and code-sharing resources are also crucial for transparency and reproducibility, enabling the growth and trustworthiness of explainable research.While challenges exist, an end-to-end approach to explainability in clinical risk prediction, involving stakeholders from clinicians to developers, is essential for success. By addressing these challenges, we can ensure that AI systems are trustworthy, fair, and effective in improving healthcare outcomes.Simplified Chinese translation:最近,人工智能在医疗领域的应用发展呈现了无比惊人的扩展性,超越了人类的诊断和疾病预测性能。然而,随着AI模型的复杂度的增加,对其透明度、可能的偏见和解释性的关注也在增加。为保证AI系统的信任和可靠性,特别是在临床风险预测模型中,解释性变得非常重要。解释性指AI系统能够提供明确、可理解的决策逻辑或决策的解释给人类潜在涉及者。在临床风险预测中,其他的解释性方面,如公平、偏见、信任和透明度,也是非常重要的。在这篇文章中,我们讨论了这些概念之间的关系,以及它们在一起或互换使用的情况。我们还强调了最近在临床风险预测中发展的解释性模型的进展,强调了评估和验证的重要性,以及在多种常见的模式上进行评估和验证的必要性。外部验证和多种解释性方法的组合是重要的,以增强信任和公平。为提高解释性方法的可靠性,使用知generate的synthetic dataset是非常重要的。开源和分享代码资源也是非常重要的,以便透明度和可重现性。虽然存在挑战,但是在临床风险预测中的解释性研究需要一个综合的方法,涉及临床医生到开发者的潜在涉及者。通过解决这些挑战,我们可以确保AI系统是可靠的,公平的,并能够改善医疗结果。
PDPK: A Framework to Synthesise Process Data and Corresponding Procedural Knowledge for Manufacturing
results: 该论文通过对生成的数据进行评估,发现一些已有的嵌入方法在表示过程知识方面具有潜在的优势。同时,该论文还提供了一个开源的框架和评估代码,以便将来的研究者可以更容易地进行比较。Abstract
Procedural knowledge describes how to accomplish tasks and mitigate problems. Such knowledge is commonly held by domain experts, e.g. operators in manufacturing who adjust parameters to achieve quality targets. To the best of our knowledge, no real-world datasets containing process data and corresponding procedural knowledge are publicly available, possibly due to corporate apprehensions regarding the loss of knowledge advances. Therefore, we provide a framework to generate synthetic datasets that can be adapted to different domains. The design choices are inspired by two real-world datasets of procedural knowledge we have access to. Apart from containing representations of procedural knowledge in Resource Description Framework (RDF)-compliant knowledge graphs, the framework simulates parametrisation processes and provides consistent process data. We compare established embedding methods on the resulting knowledge graphs, detailing which out-of-the-box methods have the potential to represent procedural knowledge. This provides a baseline which can be used to increase the comparability of future work. Furthermore, we validate the overall characteristics of a synthesised dataset by comparing the results to those achievable on a real-world dataset. The framework and evaluation code, as well as the dataset used in the evaluation, are available open source.
摘要
过程知识描述如何完成任务和解决问题。这种知识通常由领域专家拥有,例如制造业中的操作员,他们根据参数进行调整以达到质量目标。到目前为止,我们知道没有公开可用的现实世界数据集,可能由于企业对知识前进的担忧。因此,我们提供了一个框架,可以生成可靠的 sintetic 数据集,可以适应不同领域。该框架灵感来自我们有 Access 到的两个真实世界过程知识数据集。除了包含过程知识的资源描述框架(RDF)Compatible 知识图,该框架还模拟参数化过程,提供一致的过程数据。我们使用现成的嵌入方法对所获知识图进行评估, detailing 哪些方法有可能表示过程知识。这提供了一个基准,可以用于将来的研究比较可读性。此外,我们验证了生成的数据集的总特征,与真实世界数据集的结果进行比较。框架和评估代码,以及用于评估的数据集,都是开源的。
Agglomerative Transformer for Human-Object Interaction Detection
results: 在 HICO-Det 上达到了36.75 mAP的新州OF-the-art性能,并且在具有单个阶段和端到端结构的情况下,提高了8.5%的GFLOPs和36%的FPSAbstract
We propose an agglomerative Transformer (AGER) that enables Transformer-based human-object interaction (HOI) detectors to flexibly exploit extra instance-level cues in a single-stage and end-to-end manner for the first time. AGER acquires instance tokens by dynamically clustering patch tokens and aligning cluster centers to instances with textual guidance, thus enjoying two benefits: 1) Integrality: each instance token is encouraged to contain all discriminative feature regions of an instance, which demonstrates a significant improvement in the extraction of different instance-level cues and subsequently leads to a new state-of-the-art performance of HOI detection with 36.75 mAP on HICO-Det. 2) Efficiency: the dynamical clustering mechanism allows AGER to generate instance tokens jointly with the feature learning of the Transformer encoder, eliminating the need of an additional object detector or instance decoder in prior methods, thus allowing the extraction of desirable extra cues for HOI detection in a single-stage and end-to-end pipeline. Concretely, AGER reduces GFLOPs by 8.5% and improves FPS by 36%, even compared to a vanilla DETR-like pipeline without extra cue extraction.
摘要
一体性:每个实例token被鼓励包含所有细节特征区域,从而显著提高了不同实例级cue的提取,并 ultimately leads to a new state-of-the-art performance of HOI detection with 36.75 mAP on HICO-Det.2. 效率:动态 clustering mechanism allows AGER to generate instance tokens jointly with the feature learning of the Transformer encoder, eliminating the need for an additional object detector or instance decoder in prior methods, thus allowing the extraction of desirable extra cues for HOI detection in a single-stage and end-to-end pipeline.Concretely, AGER reduces GFLOPs by 8.5% and improves FPS by 36%, even compared to a vanilla DETR-like pipeline without extra cue extraction.
Is Meta-Learning the Right Approach for the Cold-Start Problem in Recommender Systems?
paper_authors: Davide Buffelli, Ashish Gupta, Agnieszka Strzalka, Vassilis Plachouras for:This paper aims to address the cold-start problem in deep learning models for recommender systems, specifically exploring the use of standard and widely adopted deep learning models and a simple modular approach for achieving similar or higher performance without using meta-learning techniques.methods:The paper employs a variety of deep learning models, including standard and widely adopted models, and compares their performance with meta-learning models specifically designed for the cold-start setting. The authors also propose a simple modular approach using common representation learning techniques.results:The authors show that, when tuned correctly, standard and widely adopted deep learning models perform just as well as newer meta-learning models on commonly used benchmarks for the cold-start problem. Additionally, the simple modular approach using common representation learning techniques performs comparably to meta-learning techniques specifically designed for the cold-start setting while being much more easily deployable in real-world applications.Abstract
Recommender systems have become fundamental building blocks of modern online products and services, and have a substantial impact on user experience. In the past few years, deep learning methods have attracted a lot of research, and are now heavily used in modern real-world recommender systems. Nevertheless, dealing with recommendations in the cold-start setting, e.g., when a user has done limited interactions in the system, is a problem that remains far from solved. Meta-learning techniques, and in particular optimization-based meta-learning, have recently become the most popular approaches in the academic research literature for tackling the cold-start problem in deep learning models for recommender systems. However, current meta-learning approaches are not practical for real-world recommender systems, which have billions of users and items, and strict latency requirements. In this paper we show that it is possible to obtaining similar, or higher, performance on commonly used benchmarks for the cold-start problem without using meta-learning techniques. In more detail, we show that, when tuned correctly, standard and widely adopted deep learning models perform just as well as newer meta-learning models. We further show that an extremely simple modular approach using common representation learning techniques, can perform comparably to meta-learning techniques specifically designed for the cold-start setting while being much more easily deployable in real-world applications.
摘要
现代在线产品和服务中,推荐系统已成为基本结构的一部分,对用户体验产生了深见的影响。过去几年,深度学习方法在研究中吸引了很多注目,现在在现实世界中广泛应用于现代推荐系统中。然而,在冷启动设定下(例如,用户在系统中做的交互有限),仍然是一个未解决的问题。学术研究文献中,最受欢迎的方法是使用meta-学习技术来解决冷启动问题在深度学习模型中。然而,现有的meta-学习方法在实际应用中并不实用,因为它们需要大量的用户和项目数据,并且具有严格的响应时间要求。在这篇论文中,我们表明可以在常用的benchmark测试中达到相似或更高的性能,而不需要使用meta-学习技术。具体来说,当正确地调整的标准和广泛采用的深度学习模型,它们可以与更新的meta-学习模型相比,达到类似或更高的性能。此外,我们还显示了一种非常简单的模块化方法,使用通用表示学习技术,可以与特定的冷启动设定相比,并且可以在实际应用中更加容易进行部署。Here's the translation in Simplified Chinese:现代在线产品和服务中,推荐系统已成为基本结构的一部分,对用户体验产生了深见的影响。过去几年,深度学习方法在研究中吸引了很多注目,现在在现实世界中广泛应用于现代推荐系统中。然而,在冷启动设定下(例如,用户在系统中做的交互有限),仍然是一个未解决的问题。学术研究文献中,最受欢迎的方法是使用meta-学习技术来解决冷启动问题在深度学习模型中。然而,现有的meta-学习方法在实际应用中并不实用,因为它们需要大量的用户和项目数据,并且具有严格的响应时间要求。在这篇论文中,我们表明可以在常用的benchmark测试中达到相似或更高的性能,而不需要使用meta-学习技术。具体来说,当正确地调整的标准和广泛采用的深度学习模型,它们可以与更新的meta-学习模型相比,达到类似或更高的性能。此外,我们还显示了一种非常简单的模块化方法,使用通用表示学习技术,可以与特定的冷启动设定相比,并且可以在实际应用中更加容易进行部署。
Graph Out-of-Distribution Generalization with Controllable Data Augmentation
results: 对多个实际数据集进行了广泛的研究,并证明了该方法在比基eline模型具有显著的优势。Abstract
Graph Neural Network (GNN) has demonstrated extraordinary performance in classifying graph properties. However, due to the selection bias of training and testing data (e.g., training on small graphs and testing on large graphs, or training on dense graphs and testing on sparse graphs), distribution deviation is widespread. More importantly, we often observe \emph{hybrid structure distribution shift} of both scale and density, despite of one-sided biased data partition. The spurious correlations over hybrid distribution deviation degrade the performance of previous GNN methods and show large instability among different datasets. To alleviate this problem, we propose \texttt{OOD-GMixup} to jointly manipulate the training distribution with \emph{controllable data augmentation} in metric space. Specifically, we first extract the graph rationales to eliminate the spurious correlations due to irrelevant information. Secondly, we generate virtual samples with perturbation on graph rationale representation domain to obtain potential OOD training samples. Finally, we propose OOD calibration to measure the distribution deviation of virtual samples by leveraging Extreme Value Theory, and further actively control the training distribution by emphasizing the impact of virtual OOD samples. Extensive studies on several real-world datasets on graph classification demonstrate the superiority of our proposed method over state-of-the-art baselines.
摘要
格 Graf儿 Neural Network(GNN)在分类图像的表现非常出色。然而,由于训练和测试数据的选择偏袋(如训练小图和测试大图,或训练稠密图和测试稀疏图),导致分布偏移广泛存在。更重要的是,我们经常观察到 hybrid 结构分布偏移,即 both scale 和 density 的偏移,尽管数据分区存在一侧偏袋。这些偏移导致过去的 GNN 方法的性能下降,并在不同的 dataset 中显示出大的不稳定性。为解决这问题,我们提出了 \texttt{OOD-GMixup},它通过在 metric 空间中控制数据增强来同时做出一些可控的数据增强。具体来说,我们首先提取图像的理据,以消除由无关信息引起的假象相关性。然后,我们生成了基于图像理据表示域的虚拟样本,以获得可能的 OOD 训练样本。最后,我们提出了 OOD 校准,通过激活EXTREME Value Theory,以量化虚拟 OOD 样本的分布偏移,并进一步控制训练分布。我们对一些实际世界的图像分类任务进行了广泛的研究,并证明了我们的提议方法的优越性。
Learning Logic Programs by Discovering Higher-Order Abstractions
paper_authors: Céline Hocquette, Sebastijan Dumančić, Andrew Cropper
for: 本研究旨在找到更高级别的抽象,以实现人类水平的AI。
methods: 该方法基于逻辑编程,从示例和背景知识中逻辑程序的induction。
results: 对多个领域的实验表明,STEVIE可以提高预测精度27%,降低学习时间47%。此外,STEVIE还可以找到可以在不同领域中传递的抽象。Abstract
Discovering novel abstractions is important for human-level AI. We introduce an approach to discover higher-order abstractions, such as map, filter, and fold. We focus on inductive logic programming, which induces logic programs from examples and background knowledge. We introduce the higher-order refactoring problem, where the goal is to compress a logic program by introducing higher-order abstractions. We implement our approach in STEVIE, which formulates the higher-order refactoring problem as a constraint optimisation problem. Our experimental results on multiple domains, including program synthesis and visual reasoning, show that, compared to no refactoring, STEVIE can improve predictive accuracies by 27% and reduce learning times by 47%. We also show that STEVIE can discover abstractions that transfer to different domains
摘要
发现新的抽象是人工智能达到人类水平的关键。我们介绍了一种方法,用于发现更高一级的抽象,如地图、筛选和折叠。我们专注于逻辑编程,即从示例和背景知识中逻辑程序的生成。我们介绍了更高一级 refactoring 问题,其目标是通过引入更高一级抽象来压缩逻辑程序。我们在 STEVIE 中实现了这种方法,将更高一级 refactoring 问题转化为约束优化问题。我们的实验结果表明,相比无 refactoring,STEVIE 可以提高预测精度27%,降低学习时间47%。我们还表明,STEVIE 可以发现可以在不同领域传递的抽象。
A Framework for Data-Driven Explainability in Mathematical Optimization
paper_authors: Kevin-Martin Aigner, Marc Goerigk, Michael Hartisch, Frauke Liers, Arthur Miehlich
for: 提高各种大规模实际问题的解决效率,使其可以快速解决很多以前被视为不可解决的问题。
methods: 使用了解释性的评价标准,与优化软件的黑盒问题相关。
results: 提出了一种新的评价标准,即解释性,并在简单情况下证明了其NP困难性。同时,对于简单的路径问题,提出了一种可解释的模型,并进行了数值实验,显示了解释性的成本很低。Abstract
Advancements in mathematical programming have made it possible to efficiently tackle large-scale real-world problems that were deemed intractable just a few decades ago. However, provably optimal solutions may not be accepted due to the perception of optimization software as a black box. Although well understood by scientists, this lacks easy accessibility for practitioners. Hence, we advocate for introducing the explainability of a solution as another evaluation criterion, next to its objective value, which enables us to find trade-off solutions between these two criteria. Explainability is attained by comparing against (not necessarily optimal) solutions that were implemented in similar situations in the past. Thus, solutions are preferred that exhibit similar features. Although we prove that already in simple cases the explainable model is NP-hard, we characterize relevant polynomially solvable cases such as the explainable shortest-path problem. Our numerical experiments on both artificial as well as real-world road networks show the resulting Pareto front. It turns out that the cost of enforcing explainability can be very small.
摘要
(Simplified Chinese translation)数学编程技术的进步使得可以有效地解决大规模的实际问题,这些问题前不久被视为不可解决的。然而,可证优解可能不会被接受,因为优化软件被视为黑盒子。虽然科学家们很好地理解这些技术,但是这lacks easy accessibility for practitioners。因此,我们提议将解释性作为评价标准之一,以便找到这两个标准之间的平衡解。解释性通过对过去在类似情况下实现的解 compare,以实现。因此,解决方案会受到类似特征的影响。虽然我们证明了简单情况下的解释模型是NP困难的,但我们描述了可解 polynomially solvable cases,如解释最短路问题。我们在人工和实际道路网络上进行了数值实验,显示了结果的Pareto前沿。结果显示,强制执行解释性的成本很小。
Integrating cognitive map learning and active inference for planning in ambiguous environments
results: 研究发现,使用活动推理Agent可以在复杂情况下更有效地进行计划,而且在感知观察提供了不确定信息时,其效果更加出色。Abstract
Living organisms need to acquire both cognitive maps for learning the structure of the world and planning mechanisms able to deal with the challenges of navigating ambiguous environments. Although significant progress has been made in each of these areas independently, the best way to integrate them is an open research question. In this paper, we propose the integration of a statistical model of cognitive map formation within an active inference agent that supports planning under uncertainty. Specifically, we examine the clone-structured cognitive graph (CSCG) model of cognitive map formation and compare a naive clone graph agent with an active inference-driven clone graph agent, in three spatial navigation scenarios. Our findings demonstrate that while both agents are effective in simple scenarios, the active inference agent is more effective when planning in challenging scenarios, in which sensory observations provide ambiguous information about location.
摘要
Translated into Simplified Chinese:生物体需要获得认知地图以学习世界结构,以及能够在不确定环境中进行规划。虽然在每个领域中都已经取得了 significative progress,但整体 интеграion仍然是一个开放的研究问题。在这篇论文中,我们提出了在活动推理Agent中 интеGRATE cognitive map formation的统计模型,并在三个空间导航enario中比较了一个简单的clone graph Agent与一个活动推理驱动的clone graph Agent。我们发现,在复杂enario中,活动推理驱动的clone graph Agent更有效,因为感知观测在位置信息上提供了模糊的信息。
results: 本论文证明了RoBOS算法在certain assumptions下可以保证sublinear lenient regret,并且提出了一个弱定义的 regret called robust satisficing regret,其下可以获得sublinear upper bound不受分布Shift的影响。Abstract
Distributional shifts pose a significant challenge to achieving robustness in contemporary machine learning. To overcome this challenge, robust satisficing (RS) seeks a robust solution to an unspecified distributional shift while achieving a utility above a desired threshold. This paper focuses on the problem of RS in contextual Bayesian optimization when there is a discrepancy between the true and reference distributions of the context. We propose a novel robust Bayesian satisficing algorithm called RoBOS for noisy black-box optimization. Our algorithm guarantees sublinear lenient regret under certain assumptions on the amount of distribution shift. In addition, we define a weaker notion of regret called robust satisficing regret, in which our algorithm achieves a sublinear upper bound independent of the amount of distribution shift. To demonstrate the effectiveness of our method, we apply it to various learning problems and compare it to other approaches, such as distributionally robust optimization.
摘要
当前机器学习中的分布shift问题 pose a significant challenge to achieving robustness. To overcome this challenge, robust satisficing (RS) seeks a robust solution to an unspecified distributional shift while achieving a utility above a desired threshold. This paper focuses on the problem of RS in contextual Bayesian optimization when there is a discrepancy between the true and reference distributions of the context. We propose a novel robust Bayesian satisficing algorithm called RoBOS for noisy black-box optimization. Our algorithm guarantees sublinear lenient regret under certain assumptions on the amount of distribution shift. In addition, we define a weaker notion of regret called robust satisficing regret, in which our algorithm achieves a sublinear upper bound independent of the amount of distribution shift. To demonstrate the effectiveness of our method, we apply it to various learning problems and compare it to other approaches, such as distributionally robust optimization.Here's the translation in Traditional Chinese:当前机器学习中的分布shift问题 pose a significant challenge to achieving robustness. To overcome this challenge, robust satisficing (RS) seeks a robust solution to an unspecified distributional shift while achieving a utility above a desired threshold. This paper focuses on the problem of RS in contextual Bayesian optimization when there is a discrepancy between the true and reference distributions of the context. We propose a novel robust Bayesian satisficing algorithm called RoBOS for noisy black-box optimization. Our algorithm guarantees sublinear lenient regret under certain assumptions on the amount of distribution shift. In addition, we define a weaker notion of regret called robust satisficing regret, in which our algorithm achieves a sublinear upper bound independent of the amount of distribution shift. To demonstrate the effectiveness of our method, we apply it to various learning problems and compare it to other approaches, such as distributionally robust optimization.
It Ain’t That Bad: Understanding the Mysterious Performance Drop in OOD Generalization for Generative Transformer Models
paper_authors: Xingcheng Xu, Zihao Pan, Haipeng Zhang, Yanqing Yang
for: investigate the generalization ability of Generative Transformer-based models
methods: using n-digit addition and multiplication tasks to study the models’ generalization behaviors
results: the models show successful ID generalization but poor OOD generalization, and the reason is found to be due to the models’ learned algebraic structures that map OOD inputs to equivalent ID domain outputs.Abstract
Generative Transformer-based models have achieved remarkable proficiency on solving diverse problems. However, their generalization ability is not fully understood and not always satisfying. Researchers take basic mathematical tasks like n-digit addition or multiplication as important perspectives for investigating their generalization behaviors. Curiously, it is observed that when training on n-digit operations (e.g., additions) in which both input operands are n-digit in length, models generalize successfully on unseen n-digit inputs (in-distribution (ID) generalization), but fail miserably and mysteriously on longer, unseen cases (out-of-distribution (OOD) generalization). Studies try to bridge this gap with workarounds such as modifying position embedding, fine-tuning, and priming with more extensive or instructive data. However, without addressing the essential mechanism, there is hardly any guarantee regarding the robustness of these solutions. We bring this unexplained performance drop into attention and ask whether it is purely from random errors. Here we turn to the mechanistic line of research which has notable successes in model interpretability. We discover that the strong ID generalization stems from structured representations, while behind the unsatisfying OOD performance, the models still exhibit clear learned algebraic structures. Specifically, these models map unseen OOD inputs to outputs with equivalence relations in the ID domain. These highlight the potential of the models to carry useful information for improved generalization.
摘要
We bring this unexplained performance drop into attention and question whether it is purely due to random errors. To better understand the issue, we turn to the mechanistic line of research, which has been successful in model interpretability. We discovered that the strong ID generalization is due to structured representations, while the unsatisfying OOD performance is caused by the models still exhibiting clear learned algebraic structures. Specifically, these models map unseen OOD inputs to outputs with equivalence relations in the ID domain. This suggests that the models have the potential to carry useful information for improved generalization.
Description Logics Go Second-Order – Extending EL with Universally Quantified Concepts
results: 作者们证明了扩展的描述逻辑 $\mathcal{EL}$ 的延展可以用 classical $\mathcal{EL}$ reasoning algorithms来推理,即使使用 second-order semantics。此外,作者们还证明了这个扩展的一个稍小但仍然有用的子集是可计算的。这个子集可以表达一些扩展的概念链规则、正面自限制和一些本地角色值图从 KL-ONE 中获得,而无需添加任何其他构造器。Abstract
The study of Description Logics have been historically mostly focused on features that can be translated to decidable fragments of first-order logic. In this paper, we leave this restriction behind and look for useful and decidable extensions outside first-order logic. We introduce universally quantified concepts, which take the form of variables that can be replaced with arbitrary concepts, and define two semantics of this extension. A schema semantics allows replacements of concept variables only by concepts from a particular language, giving us axiom schemata similar to modal logics. A second-order semantics allows replacement of concept variables with arbitrary subsets of the domain, which is similar to quantified predicates in second-order logic. To study the proposed semantics, we focus on the extension of the description logic $\mathcal{EL}$. We show that for a useful fragment of the extension, the conclusions entailed by the different semantics coincide, allowing us to use classical $\mathcal{EL}$ reasoning algorithms even for the second-order semantics. For a slightly smaller, but still useful, fragment, we were also able to show polynomial decidability of the extension. This fragment, in particular, can express a generalized form of role chain axioms, positive self restrictions, and some forms of (local) role-value-maps from KL-ONE, without requiring any additional constructors.
摘要
study of Description Logics Historically, most research has focused on features that can be translated into decidable fragments of first-order logic. In this paper, we abandon this restriction and explore useful and decidable extensions beyond first-order logic. We introduce universally quantified concepts, which take the form of variables that can be replaced with arbitrary concepts, and define two semantics for this extension. A schema semantics allows replacements of concept variables only with concepts from a specific language, giving us axiom schemata similar to modal logics. A second-order semantics allows replacement of concept variables with arbitrary subsets of the domain, similar to quantified predicates in second-order logic. To study the proposed semantics, we focus on the extension of the description logic $\mathcal{EL}$. We show that for a useful fragment of the extension, the conclusions entailed by the different semantics coincide, allowing us to use classical $\mathcal{EL}$ reasoning algorithms even for the second-order semantics. For a slightly smaller, but still useful, fragment, we were also able to show polynomial decidability of the extension. This fragment can express a generalized form of role chain axioms, positive self-restrictions, and some forms of (local) role-value-maps from KL-ONE without requiring any additional constructors.
TEST: Text Prototype Aligned Embedding to Activate LLM’s Ability for Time Series
results: 本文通过实验表明,使用 TEST 方法可以让 LLM 处理 TS 数据,但不能与现有的特定于 TS 任务的模型相比,但可以让 LLM 拥有处理 TS 数据的能力而不妨碍其语言能力。Abstract
This work summarizes two strategies for completing time-series (TS) tasks using today's language model (LLM): LLM-for-TS, design and train a fundamental large model for TS data; TS-for-LLM, enable the pre-trained LLM to handle TS data. Considering the insufficient data accumulation, limited resources, and semantic context requirements, this work focuses on TS-for-LLM methods, where we aim to activate LLM's ability for TS data by designing a TS embedding method suitable for LLM. The proposed method is named TEST. It first tokenizes TS, builds an encoder to embed them by instance-wise, feature-wise, and text-prototype-aligned contrast, and then creates prompts to make LLM more open to embeddings, and finally implements TS tasks. Experiments are carried out on TS classification and forecasting tasks using 8 LLMs with different structures and sizes. Although its results cannot significantly outperform the current SOTA models customized for TS tasks, by treating LLM as the pattern machine, it can endow LLM's ability to process TS data without compromising the language ability. This paper is intended to serve as a foundational work that will inspire further research.
摘要
The TEST method first tokenizes the TS data, builds an encoder to embed the data instance-wise, feature-wise, and text-prototype-aligned, and then creates prompts to make the LLM more open to the embeddings. Finally, the method implements TS tasks using the embedded data. The experiments were conducted on TS classification and forecasting tasks using 8 different LLMs with various structures and sizes. While the results cannot significantly outperform the current state-of-the-art (SOTA) models customized for TS tasks, the TEST method can endow LLM's ability to process TS data without compromising its language ability. This work serves as a foundational study that inspires further research.Translation notes:* "LLM" is translated as "语言模型" (yǔ yán módel), which means "language model" in Simplified Chinese.* "TS" is translated as "时间序列" (shí jiān xìng zhì), which means "time-series" in Simplified Chinese.* "SOTA" is translated as "当前最佳" (dāng xiàn zuī jìa), which means "current state-of-the-art" in Simplified Chinese.
Challenges and Opportunities of Using Transformer-Based Multi-Task Learning in NLP Through ML Lifecycle: A Survey
results: 本研究系统地分析了如何将MTL方法应用于NLP领域中的ML生命周期阶段,并指出了在这些阶段中MTL方法的挑战和机遇。此外,本研究还提出了一种将MTL和长期学习(CL)相结合的研究方向,以解决实际应用中模型训练和部署中的问题。Abstract
The increasing adoption of natural language processing (NLP) models across industries has led to practitioners' need for machine learning systems to handle these models efficiently, from training to serving them in production. However, training, deploying, and updating multiple models can be complex, costly, and time-consuming, mainly when using transformer-based pre-trained language models. Multi-Task Learning (MTL) has emerged as a promising approach to improve efficiency and performance through joint training, rather than training separate models. Motivated by this, we first provide an overview of transformer-based MTL approaches in NLP. Then, we discuss the challenges and opportunities of using MTL approaches throughout typical ML lifecycle phases, specifically focusing on the challenges related to data engineering, model development, deployment, and monitoring phases. This survey focuses on transformer-based MTL architectures and, to the best of our knowledge, is novel in that it systematically analyses how transformer-based MTL in NLP fits into ML lifecycle phases. Furthermore, we motivate research on the connection between MTL and continual learning (CL), as this area remains unexplored. We believe it would be practical to have a model that can handle both MTL and CL, as this would make it easier to periodically re-train the model, update it due to distribution shifts, and add new capabilities to meet real-world requirements.
摘要
随着自然语言处理(NLP)模型在不同领域的普及,机器学习(ML)实践者需要有效地处理这些模型,从训练到生产环境中的部署。然而,训练、部署和更新多个模型可能会具有复杂性、成本和时间consuming的问题,尤其是使用基于转换器的预训练语言模型。多任务学习(MTL)已经成为一种有前途的方法,通过共同训练而提高效率和性能。在这种情况下,我们首先提供了转换器基于MTL方法在NLP领域的概述。然后,我们讨论了使用MTL方法在ML生命周期阶段中的挑战和机遇,特别是在数据工程、模型开发、部署和监测阶段。这份报告专注于基于转换器的MTL架构,并且,到我们所知,是在ML生命周期阶段中系统地分析了转换器基于MTL在NLP领域的应用。此外,我们还鼓励了对MTL和持续学习(CL)之间的连接进行研究,因为这个领域还未得到了足够的探索。我们认为,一个能够处理MTL和CL的模型会更加实用,这样可以更加方便地在 periodic 训练、因为分布shift 和添加新功能等需求时更新模型。
Self-Deception: Reverse Penetrating the Semantic Firewall of Large Language Models
paper_authors: Zhenhua Wang, Wei Xie, Kai Chen, Baosheng Wang, Zhiwen Gui, Enze Wang
for: 这个论文 investigate了LLM “监禁”问题,并提出了一种自动监禁方法。
methods: 论文提出了三种实现方法,包括使用语义防火墙、自我欺骗攻击和语义攻击。
results: 实验结果显示,提出的攻击方法可以成功监禁GPT-3.5-Turbo和GPT-4模型,成功率为86.2%和67%,失败率为4.7%和2.2%。Abstract
Large language models (LLMs), such as ChatGPT, have emerged with astonishing capabilities approaching artificial general intelligence. While providing convenience for various societal needs, LLMs have also lowered the cost of generating harmful content. Consequently, LLM developers have deployed semantic-level defenses to recognize and reject prompts that may lead to inappropriate content. Unfortunately, these defenses are not foolproof, and some attackers have crafted "jailbreak" prompts that temporarily hypnotize the LLM into forgetting content defense rules and answering any improper questions. To date, there is no clear explanation of the principles behind these semantic-level attacks and defenses in both industry and academia. This paper investigates the LLM jailbreak problem and proposes an automatic jailbreak method for the first time. We propose the concept of a semantic firewall and provide three technical implementation approaches. Inspired by the attack that penetrates traditional firewalls through reverse tunnels, we introduce a "self-deception" attack that can bypass the semantic firewall by inducing LLM to generate prompts that facilitate jailbreak. We generated a total of 2,520 attack payloads in six languages (English, Russian, French, Spanish, Chinese, and Arabic) across seven virtual scenarios, targeting the three most common types of violations: violence, hate, and pornography. The experiment was conducted on two models, namely the GPT-3.5-Turbo and GPT-4. The success rates on the two models were 86.2% and 67%, while the failure rates were 4.7% and 2.2%, respectively. This highlighted the effectiveness of the proposed attack method. All experimental code and raw data will be released as open-source to inspire future research. We believe that manipulating AI behavior through carefully crafted prompts will become an important research direction in the future.
摘要
大型语言模型(LLM),如ChatGPT,已经出现了惊人的能力,接近人工通用智能。这些模型可以为社会各种需求提供便利,但同时也降低了生成不良内容的成本。因此,LLM开发者已经部署了Semantic-level防御机制,以识别和拒绝可能导致不良内容的提示。尽管这些防御机制不是不可逾越的,但一些攻击者已经制作了“监狱敲打”提示,使LLM忘记内容防御规则,回答任何不当问题。至今,在业界和学术界没有明确的Semantic-level攻击和防御原理的解释。本文 investigate LLM监狱问题,并提出了自动监狱方法的想法。我们提出了 semantic firewall 的概念,并提供了三种技术实现方法。受到传统防火墙被攻击的启示,我们引入了一种“自我欺骗”攻击,可以绕过semantic firewall,使LLM生成提示,促使监狱。我们在六种语言(英文、俄文、法文、西班牙文、中文和阿拉伯文)的七个虚拟场景中总共生成了2520个攻击payload。实验使用了GPT-3.5-Turbo和GPT-4两个模型,成功率分别为86.2%和67%,失败率分别为4.7%和2.2%。这说明了我们提出的攻击方法的效果。所有实验代码和原始数据将被发布为开源,以便uture research。我们认为,通过精心制定提示来控制AI行为将成为未来的重要研究方向。
In situ Fault Diagnosis of Indium Tin Oxide Electrodes by Processing S-Parameter Patterns
results: 研究人员通过构建了一个完整的S-parameter模式数据库,并使用深度学习(DL)方法,包括多层感知网络(MLP)、卷积神经网络(CNN)和变换器,同时分析了缺陷的原因和严重程度。Abstract
In the field of optoelectronics, indium tin oxide (ITO) electrodes play a crucial role in various applications, such as displays, sensors, and solar cells. Effective fault detection and diagnosis of the ITO electrodes are essential to ensure the performance and reliability of the devices. However, traditional visual inspection is challenging with transparent ITO electrodes, and existing fault detection methods have limitations in determining the root causes of the defects, often requiring destructive evaluations. In this study, an in situ fault diagnosis method is proposed using scattering parameter (S-parameter) signal processing, offering early detection, high diagnostic accuracy, noise robustness, and root cause analysis. A comprehensive S-parameter pattern database is obtained according to defect states. Deep learning (DL) approaches, including multilayer perceptron (MLP), convolutional neural network (CNN), and transformer, are then used to simultaneously analyze the cause and severity of defects. Notably, it is demonstrated that the diagnostic performance under additive noise levels can be significantly enhanced by combining different channels of the S-parameters as input to the learning algorithms, as confirmed through the t-distributed stochastic neighbor embedding (t-SNE) dimension reduction visualization.
摘要
在光电子学领域,锌镉铝矿(ITO)电极在多种应用中扮演关键角色,如显示器、感测器和太阳能电池。有效检测和诊断ITO电极的缺陷非常重要,以确保设备的性能和可靠性。然而,传统的视觉检查困难于透明的ITO电极,现有的缺陷检测方法有限制力推断缺陷的根本原因,经常需要破坏评估。本研究提出了即位检测方法,使用散射参数(S-parameter)信号处理,可以早期检测、高精度诊断、鲁棒性和根本原因分析。通过对缺陷状态下的S-parameter模式库的建立,使用深度学习(DL)方法,包括多层感知网络(MLP)、卷积神经网络(CNN)和变换器,同时分析缺陷的原因和严重程度。另外,通过不同通道的S-parameters作为输入,使用不同的DL方法进行同时分析,可以减少干扰的影响,并且通过t-分布随机 neigh embedding(t-SNE)维度减少Visualization表明,在增加的污染水平下,诊断性能得到了显著提高。
Explainable Multi-View Deep Networks Methodology for Experimental Physics
results: 这个论文的实验结果显示,使用多看点架构可以提高分类精度,并且可以提供更多的解释性。具体来说,在高能量浓度物理实验中,使用多看点模型可以从foam样品的多个影像描述中提取更多的信息,以提高样品质量的评估精度。Abstract
Physical experiments often involve multiple imaging representations, such as X-ray scans and microscopic images. Deep learning models have been widely used for supervised analysis in these experiments. Combining different image representations is frequently required to analyze and make a decision properly. Consequently, multi-view data has emerged - datasets where each sample is described by views from different angles, sources, or modalities. These problems are addressed with the concept of multi-view learning. Understanding the decision-making process of deep learning models is essential for reliable and credible analysis. Hence, many explainability methods have been devised recently. Nonetheless, there is a lack of proper explainability in multi-view models, which are challenging to explain due to their architectures. In this paper, we suggest different multi-view architectures for the vision domain, each suited to another problem, and we also present a methodology for explaining these models. To demonstrate the effectiveness of our methodology, we focus on the domain of High Energy Density Physics (HEDP) experiments, where multiple imaging representations are used to assess the quality of foam samples. We apply our methodology to classify the foam samples quality using the suggested multi-view architectures. Through experimental results, we showcase the improvement of accurate architecture choice on both accuracy - 78% to 84% and AUC - 83% to 93% and present a trade-off between performance and explainability. Specifically, we demonstrate that our approach enables the explanation of individual one-view models, providing insights into the decision-making process of each view. This understanding enhances the interpretability of the overall multi-view model. The sources of this work are available at: https://github.com/Scientific-Computing-Lab-NRCN/Multi-View-Explainability.
摘要
物理实验经常涉及多种图像表示方式,如X射线扫描和显微镜图像。深度学习模型在这些实验中广泛应用了监督分析。将不同的图像表示方式结合起来是必要的,以便正确地分析和做出决策。因此,多视图数据出现了,每个样本都是由不同的角度、来源或模式描述的。这些问题是多视图学习的核心。理解深度学习模型的决策过程是必要的,以确保可靠和可信worth的分析。因此,许多解释方法已经被开发出来。然而,多视图模型的解释却是一个挑战,因为它们的架构复杂。在这篇论文中,我们提出了不同的多视图架构,每个适用于不同的问题,以及一种方法来解释这些模型。为了证明我们的方法的有效性,我们将在高能密度物理实验中应用我们的方法,其中多种图像表示方式是用来评估高强度粉末样本质量的。我们通过实验结果表明,我们的方法可以提高准确性,从78%提高到84%,并且可以提高AUC从83%提高到93%。此外,我们还发现了在性能和解释之间存在一定的负相关性。具体来说,我们的方法可以解释单个一视图模型的决策过程,从而提供每个视图的解释。这种理解可以提高多视图模型的解释性。这些资源可以在以下链接中获取:https://github.com/Scientific-Computing-Lab-NRCN/Multi-View-Explainability。
Towards Ontology-Mediated Planning with OWL DL Ontologies (Extended Version)
results: 本研究的实验表明,该方法在小型Domain 下可以具有一定的潜在和局限性。同时,该方法还可以允许规划专家使用现有的规划工具来解决 ontology-mediated 规划问题。Abstract
While classical planning languages make the closed-domain and closed-world assumption, there have been various approaches to extend those with DL reasoning, which is then interpreted under the usual open-world semantics. Current approaches for planning with DL ontologies integrate the DL directly into the planning language, and practical approaches have been developed based on first-order rewritings or rewritings into datalog. We present here a new approach in which the planning specification and ontology are kept separate, and are linked together using an interface. This allows planning experts to work in a familiar formalism, while existing ontologies can be easily integrated and extended by ontology experts. Our approach for planning with those ontology-mediated planning problems is optimized for cases with comparatively small domains, and supports the whole OWL DL fragment. The idea is to rewrite the ontology-mediated planning problem into a classical planning problem to be processed by existing planning tools. Different to other approaches, our rewriting is data-dependent. A first experimental evaluation of our approach shows the potential and limitations of this approach.
摘要
古典规划语言采用关闭领域和关闭世界假设,然而有许多方法来扩展这些使用DL推理,然后在普通的开放世界 semantics下进行解释。现有的规划与DL ontology的集成方法是将DL直接集成到规划语言中,并基于首览 rewrite 或 rewrite 到 datalog 实现实用。我们现在提出了一种新的方法,在规划规范和 ontology 之间使用接口相互连接。这使得规划专家可以在熟悉的 formalism 中工作,而现有的 ontology 可以轻松地被集成和扩展。我们的规划与 ontology-mediated 规划问题的方法优化为小领域情况,支持整个 OWL DL Fragment。我们的 rewrite 是数据висимы的。一个初步的实验评估表明了这种方法的潜力和局限性。
Modelling the Spread of COVID-19 in Indoor Spaces using Automated Probabilistic Planning
for: This paper aims to provide a novel approach for modeling the spread of COVID-19 in indoor spaces using probabilistic planning and dynamic graph analysis, and to evaluate the effectiveness of different mitigation strategies.
methods: The authors use a probabilistic planning framework to model the spread of COVID-19 in shared spaces, and endow the planner with means to control the spread of the disease through non-pharmaceutical interventions (NPIs) such as mandating masks and vaccines. They also compare the impact of crowds and capacity limits on the spread of COVID-19 in these settings.
results: The authors demonstrate that the use of probabilistic planning is effective in predicting the amount of infections that are likely to occur in shared spaces, and that automated planners have the potential to design competent interventions to limit the spread of the disease.Here are the three key points in Simplified Chinese text:
results: 作者们表明,使用概率规划可以准确预测共享空间中的感染情况,并且自动化计划程序有效地设计控制疫苗传播的策略。Abstract
The coronavirus disease 2019 (COVID-19) pandemic has been ongoing for around 3 years, and has infected over 750 million people and caused over 6 million deaths worldwide at the time of writing. Throughout the pandemic, several strategies for controlling the spread of the disease have been debated by healthcare professionals, government authorities, and international bodies. To anticipate the potential impact of the disease, and to simulate the effectiveness of different mitigation strategies, a robust model of disease spread is needed. In this work, we explore a novel approach based on probabilistic planning and dynamic graph analysis to model the spread of COVID-19 in indoor spaces. We endow the planner with means to control the spread of the disease through non-pharmaceutical interventions (NPIs) such as mandating masks and vaccines, and we compare the impact of crowds and capacity limits on the spread of COVID-19 in these settings. We demonstrate that the use of probabilistic planning is effective in predicting the amount of infections that are likely to occur in shared spaces, and that automated planners have the potential to design competent interventions to limit the spread of the disease. Our code is fully open-source and is available at: https://github.com/mharmanani/prob-planning-covid19 .
摘要
COVID-19 疫情已经持续约3年,已经感染了超过75亿人,导致了全球6000万人的死亡。在疫情中,医疗专业人员、政府机构和国际组织一直在讨论如何控制疫情的传播。为了预测疫情的可能影响和评估不同mitigation策略的效果,需要一个可靠的疫情传播模型。在这个工作中,我们探讨了一种基于概率规划和动态图分析的新方法,用于模型COVID-19在室内空间的传播。我们赋予 плаanner控制疫情的能力,包括强制戴Mask和接种疫苗等非药用 интервентions。我们比较了在这些设置下COVID-19的传播的影响,并证明了概率规划有效地预测共享空间中感染的可能性,以及自动化 плаanner有 Potential to design有效的防止疫情传播的 intervención。我们的代码已经开源,可以在https://github.com/mharmanani/prob-planning-covid19 中获取。
results: 我们通过使用Various state-of-the-artCounterfactual生成器和多个标准数据集进行了大量的Counterfactual生成和研究,发现Counterfactual生成过程引起的领域和模型变化是很大,可能会妨碍Algorithmic Recourse的应用。然而,我们也提出了一些缓解这些问题的策略。我们的Counterfactual生成和研究框架快速、开源。Abstract
Existing work on Counterfactual Explanations (CE) and Algorithmic Recourse (AR) has largely focused on single individuals in a static environment: given some estimated model, the goal is to find valid counterfactuals for an individual instance that fulfill various desiderata. The ability of such counterfactuals to handle dynamics like data and model drift remains a largely unexplored research challenge. There has also been surprisingly little work on the related question of how the actual implementation of recourse by one individual may affect other individuals. Through this work, we aim to close that gap. We first show that many of the existing methodologies can be collectively described by a generalized framework. We then argue that the existing framework does not account for a hidden external cost of recourse, that only reveals itself when studying the endogenous dynamics of recourse at the group level. Through simulation experiments involving various state-of the-art counterfactual generators and several benchmark datasets, we generate large numbers of counterfactuals and study the resulting domain and model shifts. We find that the induced shifts are substantial enough to likely impede the applicability of Algorithmic Recourse in some situations. Fortunately, we find various strategies to mitigate these concerns. Our simulation framework for studying recourse dynamics is fast and opensourced.
摘要
先前的Counterfactual Explanations(CE)和Algorithmic Recourse(AR)研究都集中在单个个体和静止环境中,即给出一个估计模型后,找到满足多种需求的有效Counterfactuals。然而,这些Counterfactuals在数据和模型漂移的情况下的可靠性仍然是一个未解决的研究挑战。此外,很少有研究探讨了实际实施救济的一个人如何影响别的人。在这个工作中,我们希望填补这一差距。我们首先表明了许多现有的方法ologies可以总结为一个通用框架。然后,我们 argue that现有的框架不会考虑一种隐藏的外部成本,只有在研究救济的自然 dynamics 上才会显示出来。通过使用多种现状的Counterfactual生成器和多个标准数据集,我们生成了大量的Counterfactuals,并研究了其导致的领域和模型shift。我们发现这些shift具有很大的影响,可能会阻碍Algorithmic Recourse的应用。幸运的是,我们发现了一些缓解这些问题的策略。我们的救济动态 simulate框架快速且开源,可以帮助研究人员更好地理解和应用Algorithmic Recourse。
Enhancing Performance on Seen and Unseen Dialogue Scenarios using Retrieval-Augmented End-to-End Task-Oriented System
results: 实现了提高非空共同目标准确率的6.7%,至于现有强基线之上Abstract
End-to-end task-oriented dialogue (TOD) systems have achieved promising performance by leveraging sophisticated natural language understanding and natural language generation capabilities of pre-trained models. This work enables the TOD systems with more flexibility through a simple cache. The cache provides the flexibility to dynamically update the TOD systems and handle both existing and unseen dialogue scenarios. Towards this end, we first fine-tune a retrieval module to effectively retrieve the most relevant information entries from the cache. We then train end-to-end TOD models that can refer to and ground on both dialogue history and retrieved information during TOD generation. The cache is straightforward to construct, and the backbone models of TOD systems are compatible with existing pre-trained generative models. Extensive experiments demonstrate the superior performance of our framework, with a notable improvement in non-empty joint goal accuracy by 6.7% compared to strong baselines.
摘要
End-to-end任务导向对话(TOD)系统已经取得了可观的成绩,通过利用先前训练的自然语言理解和自然语言生成能力。这项工作允许TOD系统更加灵活,通过简单的缓存。缓存提供了更新TOD系统的动态性,并处理现有和未看过的对话场景。为此,我们首先精度地微调检索模块,以便快速检索缓存中最相关的信息项。然后,我们训练了基于对话历史和检索到信息的TOD模型,以便在TOD生成时,能够 Refer to和根据对话历史和检索到的信息。缓存的构建非常简单,并且TOD系统的核心模型与现有预训练生成模型兼容。广泛的实验证明了我们的框架的超越性,相比强基准,非空共同目标准确率提高6.7%。
PEvoLM: Protein Sequence Evolutionary Information Language Model
paper_authors: Issar Arab for:This paper aims to improve the efficiency and accuracy of protein sequence alignment and evolutionary information retrieval by leveraging recent advancements in natural language processing (NLP) and machine learning (ML).methods:The proposed method, called PEvoLM, is a novel bidirectional language model that combines the concept of transfer learning with the idea of position-specific scoring matrices (PSSMs) to learn the evolutionary information of protein sequences. The model uses a single path for both the forward and backward passes, which reduces the number of free parameters compared to traditional bidirectional models.results:The proposed PEvoLM model was trained on a dataset of protein sequences and achieved state-of-the-art performance on predicting the next amino acid in a sequence, as well as the probability distribution of the next AA derived from similar yet different sequences. The model also demonstrated improved performance on multi-task learning, which involves predicting both the next AA and the evolutionary information of protein sequences. The source code and pre-trained model are available on GitHub under the permissive MIT license.Abstract
With the exponential increase of the protein sequence databases over time, multiple-sequence alignment (MSA) methods, like PSI-BLAST, perform exhaustive and time-consuming database search to retrieve evolutionary information. The resulting position-specific scoring matrices (PSSMs) of such search engines represent a crucial input to many machine learning (ML) models in the field of bioinformatics and computational biology. A protein sequence is a collection of contiguous tokens or characters called amino acids (AAs). The analogy to natural language allowed us to exploit the recent advancements in the field of Natural Language Processing (NLP) and therefore transfer NLP state-of-the-art algorithms to bioinformatics. This research presents an Embedding Language Model (ELMo), converting a protein sequence to a numerical vector representation. While the original ELMo trained a 2-layer bidirectional Long Short-Term Memory (LSTMs) network following a two-path architecture, one for the forward and the second for the backward pass, by merging the idea of PSSMs with the concept of transfer-learning, this work introduces a novel bidirectional language model (bi-LM) with four times less free parameters and using rather a single path for both passes. The model was trained not only on predicting the next AA but also on the probability distribution of the next AA derived from similar, yet different sequences as summarized in a PSSM, simultaneously for multi-task learning, hence learning evolutionary information of protein sequences as well. The network architecture and the pre-trained model are made available as open source under the permissive MIT license on GitHub at https://github.com/issararab/PEvoLM.
摘要
随着蛋白序列数据库的不断增长,多序列Alignment(MSA)方法,如PSI-BLAST,在时间上进行了极其探索和耗时的数据库搜索,以获取进化信息。得到的位置特异分数矩阵(PSSM)是这些搜索引擎的结果,这些矩阵在生物信息学和计算生物学领域中是非常重要的输入。蛋白序列是一系列连续的字符或氨基酸(AA)的集合。通过将蛋白序列与自然语言的对比,我们可以利用自然语言处理领域的最新进展,并将其传递到生物信息学中。本研究提出了一种Embedding Language Model(ELMo),将蛋白序列转换为数字向量表示。原ELMo使用了两层双向长短时间记忆(LSTM)网络,其中一个是向前 pass,另一个是向后 pass,通过将PSSM的想法与传输学习的概念结合起来,这种工作引入了一种新的双向语言模型(bi-LM),它具有四倍少的自由参数,并使用单路进行两个过程。模型不仅可以预测下一个AA,还可以同时预测下一个AA的分布情况,基于相似 yet different 的序列,这是通过多任务学习来学习蛋白序列的进化信息。网络架构和预训练模型都是开源的,可以在 GitHub 上下载,详细信息请参考 。
Interpretability Benchmark for Evaluating Spatial Misalignment of Prototypical Parts Explanations
paper_authors: Mikołaj Sacha, Bartosz Jura, Dawid Rymarczyk, Łukasz Struski, Jacek Tabor, Bartosz Zieliński
for: 提高parts-based网络的可解释性
methods: 引入特定的metric集来衡量相关的概念错误,并提出一种修正方法来解决这种错误
results: 通过实验研究,证明了该metric集的表达能力和修正方法的有效性Abstract
Prototypical parts-based networks are becoming increasingly popular due to their faithful self-explanations. However, their similarity maps are calculated in the penultimate network layer. Therefore, the receptive field of the prototype activation region often depends on parts of the image outside this region, which can lead to misleading interpretations. We name this undesired behavior a spatial explanation misalignment and introduce an interpretability benchmark with a set of dedicated metrics for quantifying this phenomenon. In addition, we propose a method for misalignment compensation and apply it to existing state-of-the-art models. We show the expressiveness of our benchmark and the effectiveness of the proposed compensation methodology through extensive empirical studies.
摘要
归并部网络在当前受欢迎的程度增加,这主要归功于它们的自我解释性。然而,它们的相似地图通常在半 finales层计算,因此prototype activation区域的接受范围经常受到图像外部的部分影响,这可能会导致误导性的解释。我们称这种不 DESirable 行为为空间解释不一致,并提出了一个专门的可解释指标集来量化这种现象。此外,我们还提出了一种补做方法,并应用于现有的状态级模型。我们通过广泛的实验研究表明了我们的指标和补做方法的表达能力和效果。
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework
results: 这个框架可以减轻 LLM 生成和理解能力的强大 yet 不完美性,同时充分利用人类的理解和智慧。它还可以简化和统一复杂的 LLM 工作流程,使其变得更加自然和简单。Abstract
This technical report presents AutoGen, a new framework that enables development of LLM applications using multiple agents that can converse with each other to solve tasks. AutoGen agents are customizable, conversable, and seamlessly allow human participation. They can operate in various modes that employ combinations of LLMs, human inputs, and tools. AutoGen's design offers multiple advantages: a) it gracefully navigates the strong but imperfect generation and reasoning abilities of these LLMs; b) it leverages human understanding and intelligence, while providing valuable automation through conversations between agents; c) it simplifies and unifies the implementation of complex LLM workflows as automated agent chats. We provide many diverse examples of how developers can easily use AutoGen to effectively solve tasks or build applications, ranging from coding, mathematics, operations research, entertainment, online decision-making, question answering, etc.
摘要
这份技术报告介绍了AutoGen,一个新的框架,它使得开发 LLM 应用程序的多个代理可以互相对话以解决任务。AutoGen 代理可定制化、可对话、可以轻松地允许人类参与。它们可以在不同的模式下运行,这些模式可以结合 LLM、人类输入和工具来实现。AutoGen 的设计具有多个优点:一、它高效地缓解了 LLM 的强大 yet imperfect 生成和推理能力; 二、它利用人类的理解和智慧,同时提供了有价值的对话between agents; 三、它简化和统一了复杂 LLM 工作流的实现。我们提供了许多不同的开发者可以轻松地使用 AutoGen 解决任务或构建应用程序的示例,包括编程、数学、运筹、娱乐、在线决策、问答等等。
results: 个性化学习经验和预测投资骗局成功率提高Abstract
In today's world, with the rise of numerous social platforms, it has become relatively easy for anyone to spread false information and lure people into traps. Fraudulent schemes and traps are growing rapidly in the investment world. Due to this, countries and individuals face huge financial risks. We present an awareness system with the use of machine learning and gamification techniques to educate the people about investment scams and traps. Our system applies machine learning techniques to provide a personalized learning experience to the user. The system chooses distinct game-design elements and scams from the knowledge pool crafted by domain experts for each individual. The objective of the research project is to reduce inequalities in all countries by educating investors via Active Learning. Our goal is to assist the regulators in assuring a conducive environment for a fair, efficient, and inclusive capital market. In the paper, we discuss the impact of the problem, provide implementation details, and showcase the potentiality of the system through preliminary experiments and results.
摘要
今天的世界,由于社交平台的崛起,任何人都可以轻松地散布假信息和陷阱。投资领域的诈骗和陷阱在快速增长。由于此,国家和个人面临巨大的金融风险。我们提出了一种意识系统,通过机器学习和游戏化技术来教育人们关于投资诈骗和陷阱。我们的系统使用机器学习技术提供个性化学习经验 для每名用户。系统从培育专家制定的知识库中选择了不同的游戏元素和诈骗,为每名用户制定个性化的学习计划。我们的研究目标是通过活动学习减少全球各国的不平等。我们的目标是协助监管机构建立一个公平、高效、包容的资本市场环境。在论文中,我们讨论了问题的影响、实施细节以及系统的潜在可能性,并通过初步实验和结果展示系统的效果。
SYENet: A Simple Yet Effective Network for Multiple Low-Level Vision Tasks with Real-time Performance on Mobile Device
paper_authors: Weiran Gou, Ziyao Yi, Yan Xiang, Shaoqing Li, Zibin Liu, Dehui Kong, Ke Xu for:This paper aims to solve the problems of task-specific algorithms and large parameters in deep learning-based low-level vision tasks on mobile devices.methods:The proposed SYENet network consists of two asymmetrical branches with simple building blocks and a Quadratic Connection Unit(QCU) to connect the results. The network also uses a new Outlier-Aware Loss to improve performance.results:The proposed SYENet network achieves superior performance in real-time applications such as Image Signal Processing(ISP), Low-Light Enhancement(LLE), and Super-Resolution(SR) with 2K60FPS throughput on Qualcomm 8 Gen 1 mobile SoC. Specifically, it got the highest score in MAI 2022 Learned Smartphone ISP challenge for ISP task.Abstract
With the rapid development of AI hardware accelerators, applying deep learning-based algorithms to solve various low-level vision tasks on mobile devices has gradually become possible. However, two main problems still need to be solved: task-specific algorithms make it difficult to integrate them into a single neural network architecture, and large amounts of parameters make it difficult to achieve real-time inference. To tackle these problems, we propose a novel network, SYENet, with only $~$6K parameters, to handle multiple low-level vision tasks on mobile devices in a real-time manner. The SYENet consists of two asymmetrical branches with simple building blocks. To effectively connect the results by asymmetrical branches, a Quadratic Connection Unit(QCU) is proposed. Furthermore, to improve performance, a new Outlier-Aware Loss is proposed to process the image. The proposed method proves its superior performance with the best PSNR as compared with other networks in real-time applications such as Image Signal Processing(ISP), Low-Light Enhancement(LLE), and Super-Resolution(SR) with 2K60FPS throughput on Qualcomm 8 Gen 1 mobile SoC(System-on-Chip). Particularly, for ISP task, SYENet got the highest score in MAI 2022 Learned Smartphone ISP challenge.
摘要
随着人工智能硬件加速器的快速发展,在移动设备上应用深度学习算法来解决各种低级视觉任务已经变得可能。然而,两个主要问题仍需要解决:任务特定的算法难以集成到单一神经网络架构中,以及大量参数使得实时推理困难。为解决这些问题,我们提出了一种新的网络,SYENet,只有约6K个参数,可以在移动设备上实时处理多种低级视觉任务。SYENet由两个不同的束缚分支组成,每个分支都有简单的建筑块。为了有效地连接两个分支的结果,我们提出了一种叫做Quadratic Connection Unit(QCU)的新单元。此外,我们还提出了一种新的外围感知损失函数,用于处理图像。我们的方法在实时应用中证明了superior表现,包括PSNR在qualcomm 8 Gen 1移动SoC上的2K60FPS通道上,并在ISP、LLE和SR等任务上达到了最高分。特别是在ISP任务上,SYENet获得了2022年MAI学习智能手机ISP挑战赛中的最高分。
Ranking-aware Uncertainty for Text-guided Image Retrieval
For: The paper is written for text-guided image retrieval, specifically to incorporate conditional text to better capture users’ intent.* Methods: The paper proposes a novel ranking-aware uncertainty approach that uses only the provided triplets to capture more ranking information. The approach consists of three components: in-sample uncertainty, cross-sample uncertainty, and distribution regularization.* Results: The proposed method achieves significant results on two public datasets for composed image retrieval compared to existing state-of-the-art methods.Here is the information in Simplified Chinese text:* For: 文章是为文本指导图像检索,具体来说是使用条件文本更好地捕捉用户的意图。* Methods: 文章提出了一种基于 uncertainty 的排名感知方法,只使用提供的 triplets 来更好地捕捉排名信息。该方法包括三个组成部分:内样 uncertainty,交叉样 uncertainty 和分布规范。* Results: 提出的方法在两个公共数据集上实现了与现有状态艺术方法相比较出色的结果。Abstract
Text-guided image retrieval is to incorporate conditional text to better capture users' intent. Traditionally, the existing methods focus on minimizing the embedding distances between the source inputs and the targeted image, using the provided triplets $\langle$source image, source text, target image$\rangle$. However, such triplet optimization may limit the learned retrieval model to capture more detailed ranking information, e.g., the triplets are one-to-one correspondences and they fail to account for many-to-many correspondences arising from semantic diversity in feedback languages and images. To capture more ranking information, we propose a novel ranking-aware uncertainty approach to model many-to-many correspondences by only using the provided triplets. We introduce uncertainty learning to learn the stochastic ranking list of features. Specifically, our approach mainly comprises three components: (1) In-sample uncertainty, which aims to capture semantic diversity using a Gaussian distribution derived from both combined and target features; (2) Cross-sample uncertainty, which further mines the ranking information from other samples' distributions; and (3) Distribution regularization, which aligns the distributional representations of source inputs and targeted image. Compared to the existing state-of-the-art methods, our proposed method achieves significant results on two public datasets for composed image retrieval.
摘要
文本帮助图像检索是将条件文本更好地捕捉用户的意图。传统方法通常是通过使用提供的三元组 $\langle$源图像、源文本、目标图像$\rangle$ 进行距离计算,以最小化嵌入空间中的距离。然而,这种三元组优化可能会限制学习的检索模型,从而缺乏更多的排名信息,例如:三元组是一对一对应关系,而忽略了语言反馈和图像之间的多对多关系。为了捕捉更多的排名信息,我们提出了一种新的排名不确定采用方法。我们的方法包括三个主要组成部分:1. 样本内不确定性,它用一个由combined和目标特征组成的 Gaussian 分布来捕捉语义多样性。2. 交叉样本不确定性,它进一步挖掘来自其他样本的排名信息。3. 分布规则,它将源输入和目标图像的分布表示相似。与现有状态艺术方法相比,我们的提出方法在两个公共数据集上 achieved 显著的结果。
How to Mask in Error Correction Code Transformer: Systematic and Double Masking
results: 对 ECCT 进行修改后,实现了 state-of-the-art 的解码性能,与传统的解码算法相比,具有显著的性能优势Abstract
In communication and storage systems, error correction codes (ECCs) are pivotal in ensuring data reliability. As deep learning's applicability has broadened across diverse domains, there is a growing research focus on neural network-based decoders that outperform traditional decoding algorithms. Among these neural decoders, Error Correction Code Transformer (ECCT) has achieved the state-of-the-art performance, outperforming other methods by large margins. To further enhance the performance of ECCT, we propose two novel methods. First, leveraging the systematic encoding technique of ECCs, we introduce a new masking matrix for ECCT, aiming to improve the performance and reduce the computational complexity. Second, we propose a novel transformer architecture of ECCT called a double-masked ECCT. This architecture employs two different mask matrices in a parallel manner to learn more diverse features of the relationship between codeword bits in the masked self-attention blocks. Extensive simulation results show that the proposed double-masked ECCT outperforms the conventional ECCT, achieving the state-of-the-art decoding performance with significant margins.
摘要
在通信和存储系统中,错误修复码(ECC)是确保数据可靠性的关键。随着深度学习在不同领域的应用积极扩大,关注 neural network 基于decoder 的研究也在不断增长。 Among these neural decoders, Error Correction Code Transformer(ECCT)已经实现了领先的性能,比其他方法有大幅的优势。为了进一步提高 ECCT 的性能,我们提出了两种新方法。首先,利用 ECC 的系统编码技术,我们引入了一个新的面积矩阵,以提高性能并降低计算复杂度。其次,我们提出了一种新的 transformer 架构,called double-masked ECCT,这种架构在并行方式使用了两个不同的面积矩阵,以学习codeword 比特在面积矩阵中的更多的特征。经验性 simulation 结果表明,我们提出的 double-masked ECCT 可以超过 convent ECCT,实现了领先的解码性能,并且具有显著的优势。
OmniZoomer: Learning to Move and Zoom in on Sphere at High-Resolution
results: 实现高品质和分辨率的旋转和缩放功能,可以在虚拟现实环境中自由移动和缩放到 interessant objectAbstract
Omnidirectional images (ODIs) have become increasingly popular, as their large field-of-view (FoV) can offer viewers the chance to freely choose the view directions in immersive environments such as virtual reality. The M\"obius transformation is typically employed to further provide the opportunity for movement and zoom on ODIs, but applying it to the image level often results in blurry effect and aliasing problem. In this paper, we propose a novel deep learning-based approach, called \textbf{OmniZoomer}, to incorporate the M\"obius transformation into the network for movement and zoom on ODIs. By learning various transformed feature maps under different conditions, the network is enhanced to handle the increasing edge curvatures, which alleviates the blurry effect. Moreover, to address the aliasing problem, we propose two key components. Firstly, to compensate for the lack of pixels for describing curves, we enhance the feature maps in the high-resolution (HR) space and calculate the transformed index map with a spatial index generation module. Secondly, considering that ODIs are inherently represented in the spherical space, we propose a spherical resampling module that combines the index map and HR feature maps to transform the feature maps for better spherical correlation. The transformed feature maps are decoded to output a zoomed ODI. Experiments show that our method can produce HR and high-quality ODIs with the flexibility to move and zoom in to the object of interest. Project page is available at http://vlislab22.github.io/OmniZoomer/.
摘要
“全方位图像(ODIs)在现实Virtual Reality等充满Environment中日益受欢迎,因其广阔的视场(FoV)可以让观看者自由选择视线。通常,使用Möbius变换来提供更多的运动和尺度缩放功能,但在图像层次上应用Möbius变换经常会导致模糊效果和抖音问题。在这篇论文中,我们提出了一种基于深度学习的新方法,称为OmniZoomer,以把Möbius变换 incorporated into the network for movement and zoom on ODIs。通过学习不同条件下的变换特征地图,网络得以处理增加的边缘弯曲,从而缓解模糊效果。此外,为了解决抖音问题,我们提出了两个关键 ком成分。首先,为了补偿因为描述曲线而缺少像素的问题,我们增强了高分辨率(HR)空间中的特征地图,并计算 transformed index map with a spatial index generation module。其次,因为ODIs是自然地表示在球面空间,我们提出了一个球面采样模块,将特征地图和HR特征地图组合起来,以便更好地在球面上匹配。 transformed feature maps are then decoded to output a zoomed ODI。实验结果表明,我们的方法可以生成高分辨率和高质量的ODIs,并且具有可以随意移动和缩放到对象所在位置的功能。项目页面可以在http://vlislab22.github.io/OmniZoomer/ 中找到。”
ChatLogo: A Large Language Model-Driven Hybrid Natural-Programming Language Interface for Agent-based Modeling and Programming
results: 提供一个更用户友好的界面,支持创造性表达,并避免技术系统过度依赖于单个大语言模型Abstract
Building on Papert (1980)'s idea of children talking to computers, we propose ChatLogo, a hybrid natural-programming language interface for agent-based modeling and programming. We build upon previous efforts to scaffold ABM & P learning and recent development in leveraging large language models (LLMs) to support the learning of computational programming. ChatLogo aims to support conversations with computers in a mix of natural and programming languages, provide a more user-friendly interface for novice learners, and keep the technical system from over-reliance on any single LLM. We introduced the main elements of our design: an intelligent command center, and a conversational interface to support creative expression. We discussed the presentation format and future work. Responding to the challenges of supporting open-ended constructionist learning of ABM & P and leveraging LLMs for educational purposes, we contribute to the field by proposing the first constructionist LLM-driven interface to support computational and complex systems thinking.
摘要
使用 Papert(1980)的想法,我们提议了 ChatLogo,一种混合自然编程语言界面,用于代理模型和编程。我们建立在以前的尝试和使用大型语言模型(LLMs)支持计算编程的学习。ChatLogo 目标是在自然语言和编程语言之间进行对话,提供更易于使用的界面 для初学者,并避免技术系统对任何单一 LLM 的过度依赖。我们介绍了我们的设计的主要元素:一个智能命令中心,以及一个对话界面支持创造表达。我们讨论了展示形式和未来工作。面对支持开放式构建主义学习 ABM & P 以及使用 LLMs 教育目的的挑战,我们在领域中贡献了第一个构建主义 LLM-驱动的界面,用于支持计算和复杂系统思维。
S-Mixup: Structural Mixup for Graph Neural Networks
results: 经过广泛的实验表明,S-Mixup可以提高GNN的Robustness和泛化性,特别在异质情况下。同时,S-Mixup可以减少模型的训练时间和计算量,同时保持模型的性能。Abstract
Existing studies for applying the mixup technique on graphs mainly focus on graph classification tasks, while the research in node classification is still under-explored. In this paper, we propose a novel mixup augmentation for node classification called Structural Mixup (S-Mixup). The core idea is to take into account the structural information while mixing nodes. Specifically, S-Mixup obtains pseudo-labels for unlabeled nodes in a graph along with their prediction confidence via a Graph Neural Network (GNN) classifier. These serve as the criteria for the composition of the mixup pool for both inter and intra-class mixups. Furthermore, we utilize the edge gradient obtained from the GNN training and propose a gradient-based edge selection strategy for selecting edges to be attached to the nodes generated by the mixup. Through extensive experiments on real-world benchmark datasets, we demonstrate the effectiveness of S-Mixup evaluated on the node classification task. We observe that S-Mixup enhances the robustness and generalization performance of GNNs, especially in heterophilous situations. The source code of S-Mixup can be found at \url{https://github.com/SukwonYun/S-Mixup}
摘要
existstudies haves focus on graph classification tasks, whileresearch in node classification stil under-explored. In this paper, we propose a novel mixup augmentation for node classification called Structural Mixup (S-Mixup). Core idea is to take into account the structural information while mixing nodes. Specifically, S-Mixup obtains pseudo-labels for unlabeled nodes in a graph along with their prediction confidence via a Graph Neural Network (GNN) classifier. These serve as the criteria for the composition of the mixup pool for both inter and intra-class mixups. Furthermore, we utilize the edge gradient obtained from the GNN training and propose a gradient-based edge selection strategy for selecting edges to be attached to the nodes generated by the mixup. Through extensive experiments on real-world benchmark datasets, we demonstrate the effectiveness of S-Mixup evaluated on the node classification task. We observe that S-Mixup enhances the robustness and generalization performance of GNNs, especially in heterophilous situations. Source code of S-Mixup can be found at \url{https://github.com/SukwonYun/S-Mixup}
Decentralized Graph Neural Network for Privacy-Preserving Recommendation
results: 通过三个公共数据集进行了广泛的实验 validate了我们的框架在多个维度上的一致优势,包括推荐效果、通信效率和隐私保护。Abstract
Building a graph neural network (GNN)-based recommender system without violating user privacy proves challenging. Existing methods can be divided into federated GNNs and decentralized GNNs. But both methods have undesirable effects, i.e., low communication efficiency and privacy leakage. This paper proposes DGREC, a novel decentralized GNN for privacy-preserving recommendations, where users can choose to publicize their interactions. It includes three stages, i.e., graph construction, local gradient calculation, and global gradient passing. The first stage builds a local inner-item hypergraph for each user and a global inter-user graph. The second stage models user preference and calculates gradients on each local device. The third stage designs a local differential privacy mechanism named secure gradient-sharing, which proves strong privacy-preserving of users' private data. We conduct extensive experiments on three public datasets to validate the consistent superiority of our framework.
摘要
建立一个基于图计算机学(GNN)的推荐系统,并保持用户隐私免受挑战。现有的方法可以分为联邦GNN和分散GNN。但这两种方法都有不愿意的影响,例如对应用的通讯效率低下和隐私泄露。本文提出了DGREC,一个新的分散GNN推荐系统,其中用户可以选择公开他们的互动。这个系统包括三个阶段:图建构、本地偏好模型计算和全球偏好计算。第一个阶段建立了每个用户的本地内部项目图和全球用户图。第二个阶段模型用户偏好,计算每个本地设备上的偏好计算。第三个阶段设计了一个安全的偏好分享机制,以保证用户的私人数据的强大隐私。我们在三个公共数据集上进行了广泛的实验,以验证我们的框架的一致性和superiority。
Freshness or Accuracy, Why Not Both? Addressing Delayed Feedback via Dynamic Graph Neural Networks
paper_authors: Xiaolin Zheng, Zhongyu Wang, Chaochao Chen, Feng Zhu, Jiashu Qian for: This paper aims to address the delayed feedback problem in online commercial systems, where users’ conversions are always delayed and can negatively impact the accuracy of training algorithms.methods: The proposed method, called Delayed Feedback Modeling by Dynamic Graph Neural Network (DGDFEM), includes three stages: preparing a data pipeline, building a dynamic graph, and training a CVR prediction model. The model training uses a novel graph convolutional method named HLGCN, which leverages both high-pass and low-pass filters to deal with conversion and non-conversion relationships.results: The proposed method achieves both data freshness and label accuracy, as validated by extensive experiments on three industry datasets. The results show the consistent superiority of the method over existing methods.Abstract
The delayed feedback problem is one of the most pressing challenges in predicting the conversion rate since users' conversions are always delayed in online commercial systems. Although new data are beneficial for continuous training, without complete feedback information, i.e., conversion labels, training algorithms may suffer from overwhelming fake negatives. Existing methods tend to use multitask learning or design data pipelines to solve the delayed feedback problem. However, these methods have a trade-off between data freshness and label accuracy. In this paper, we propose Delayed Feedback Modeling by Dynamic Graph Neural Network (DGDFEM). It includes three stages, i.e., preparing a data pipeline, building a dynamic graph, and training a CVR prediction model. In the model training, we propose a novel graph convolutional method named HLGCN, which leverages both high-pass and low-pass filters to deal with conversion and non-conversion relationships. The proposed method achieves both data freshness and label accuracy. We conduct extensive experiments on three industry datasets, which validate the consistent superiority of our method.
摘要
延迟反馈问题是在线商业系统预测转化率时最大的挑战之一,因为用户的转化事件总是延迟的。新的数据对于连续训练是有利的,但无完整反馈信息,即转化标签,训练算法可能受到充斥假负样本的困扰。现有方法通常使用多任务学习或设计数据管道来解决延迟反馈问题,但这些方法存在数据新鲜度和标签准确性之间的负担。本文提出了延迟反馈模型化方法(DGDFEM),它包括三个阶段:准备数据管道、建立动态图和训练CVR预测模型。在模型训练中,我们提出了一种新的图 convolutional方法名为HLGCN,它利用高频和低频滤波器来处理转化和非转化关系。我们的方法实现了数据新鲜度和标签准确性的平衡。我们对三个行业数据集进行了广泛的实验,验证了我们的方法的一致性优势。
results: 本文的算法可以解决一个开放的问题,即每个有限的 Littlestone 维度类型都存在一个可计算的在线学习算法。Abstract
We consider online learning in the model where a learning algorithm can access the class only via the consistency oracle -- an oracle, that, at any moment, can give a function from the class that agrees with all examples seen so far. This model was recently considered by Assos et al. (COLT'23). It is motivated by the fact that standard methods of online learning rely on computing the Littlestone dimension of subclasses, a problem that is computationally intractable. Assos et al. gave an online learning algorithm in this model that makes at most $C^d$ mistakes on classes of Littlestone dimension $d$, for some absolute unspecified constant $C > 0$. We give a novel algorithm that makes at most $O(256^d)$ mistakes. Our proof is significantly simpler and uses only very basic properties of the Littlestone dimension. We also observe that there exists no algorithm in this model that makes at most $2^{d+1}-2$ mistakes. We also observe that our algorithm (as well as the algorithm of Assos et al.) solves an open problem by Hasrati and Ben-David (ALT'23). Namely, it demonstrates that every class of finite Littlestone dimension with recursively enumerable representation admits a computable online learner (that may be undefined on unrealizable samples).
摘要
我们考虑在模型中进行在线学习,其中学习算法可以通过一个一致性 oracle 访问类型。这个 oracle 可以在任何时刻给出一个与所有前面看到的示例都一致的函数。这种模型在 Assos et al. (COLT'23) 中最近被考虑。它是由于标准的在线学习方法需要计算子类的 Litstone 维度,这是计算上不可能的问题而启发的。Assos et al. 提供了一个在这种模型中的在线学习算法,该算法在类型的 Litstone 维度为 $d$ 时最多会出现 $C^d$ 个错误,其中 $C$ 是一个未知的绝对常数。我们提供了一个新的算法,该算法在类型的 Litstone 维度为 $d$ 时最多会出现 $O(256^d)$ 个错误。我们的证明比较简单,只需使用了类型的 Litstone 维度的基本性质。我们还观察到,在这种模型中并没有任何算法可以在类型的 Litstone 维度为 $d$ 时出现最多 $2^{d+1}-2$ 个错误。此外,我们的算法(以及 Assos et al. 的算法)解决了 Hasrati 和 Ben-David (ALT'23) 提出的一个开放问题。即,我们证明了所有有 finite Littlestone 维度的类型都具有可计算的在线学习算法(可能是 undefined 的示例)。
Unbiased Decisions Reduce Regret: Adversarial Domain Adaptation for the Bank Loan Problem
results: 研究获得了一系列具有挑战性的 benchmark 问题的州望性结果,并且初步证明了这种方法可以提高这些问题中的公平性。Abstract
In many real world settings binary classification decisions are made based on limited data in near real-time, e.g. when assessing a loan application. We focus on a class of these problems that share a common feature: the true label is only observed when a data point is assigned a positive label by the principal, e.g. we only find out whether an applicant defaults if we accepted their loan application. As a consequence, the false rejections become self-reinforcing and cause the labelled training set, that is being continuously updated by the model decisions, to accumulate bias. Prior work mitigates this effect by injecting optimism into the model, however this comes at the cost of increased false acceptance rate. We introduce adversarial optimism (AdOpt) to directly address bias in the training set using adversarial domain adaptation. The goal of AdOpt is to learn an unbiased but informative representation of past data, by reducing the distributional shift between the set of accepted data points and all data points seen thus far. AdOpt significantly exceeds state-of-the-art performance on a set of challenging benchmark problems. Our experiments also provide initial evidence that the introduction of adversarial domain adaptation improves fairness in this setting.
摘要
在许多实际场景中,二分类决策是基于有限数据和减速时间进行的,例如评审借款申请。我们关注一类这些问题,这些问题共享一个共同特点:真正的标签仅当数据点被主体分配正确标签时才能见到,例如,只有当我们接受借款申请后才能知道是否有 Default。这导致 false reject 变得自我强化,使得标记的训练集,由模型决策更新,逐渐受到偏见。先前的工作利用模型中的optimism来避免这种偏见,但这会导致准确接受率增加。我们引入对抗优isms(AdOpt)来直接 Address 训练集中的偏见,通过对抗领域适应来学习不偏的、但具有信息的表示。AdOpt 在一组挑战性 benchmark 问题上表现出色,大大超过了当前状态的性能。我们的实验还提供了初步证据,表明在这种设定下,对抗领域适应可以提高公平性。
A Comparative Analysis of the Capabilities of Nature-inspired Feature Selection Algorithms in Predicting Student Performance
results: 结果表明,无论数据集都可以使用NIAs进行特征选择和传统机器学习算法进行分类,可以提高预测精度,同时减少特征集大小 by 2/3。Abstract
Predicting student performance is key in leveraging effective pre-failure interventions for at-risk students. In this paper, I have analyzed the relative performance of a suite of 12 nature-inspired algorithms when used to predict student performance across 3 datasets consisting of instance-based clickstream data, intra-course single-course performance, and performance when taking multiple courses simultaneously. I found that, for all datasets, leveraging an ensemble approach using NIAs for feature selection and traditional ML algorithms for classification increased predictive accuracy while also reducing feature set size by 2/3.
摘要
预测学生表现是关键在实施有效预测措施以 помочь学生避免失败。在这篇论文中,我分析了12种自然静电算法在预测学生表现方面的相对性,并在3个数据集上进行了比较,包括单个课程性能、多门课程同时表现和点播数据。我发现,无论哪个数据集,使用NIAs进行特征选择和传统机器学习算法进行分类可以提高预测精度,同时减少特征集的大小 by 2/3。
DiagGPT: An LLM-based Chatbot with Automatic Topic Management for Task-Oriented Dialogue
results: 实验表明,DiagGPT在进行诊断对话场景中表现出色,能够帮助用户完成任务。这表明DiagGPT在实际应用中具有潜在的应用前景。Abstract
Large Language Models (LLMs), such as ChatGPT, are becoming increasingly sophisticated, demonstrating capabilities that closely resemble those of humans. These AI models are playing an essential role in assisting humans with a wide array of tasks in daily life. A significant application of AI is its use as a chat agent, responding to human inquiries across various domains. Current LLMs have shown proficiency in answering general questions. However, basic question-answering dialogue often falls short in complex diagnostic scenarios, such as legal or medical consultations. These scenarios typically necessitate Task-Oriented Dialogue (TOD), wherein an AI chat agent needs to proactively pose questions and guide users towards specific task completion. Previous fine-tuning models have underperformed in TOD, and current LLMs do not inherently possess this capability. In this paper, we introduce DiagGPT (Dialogue in Diagnosis GPT), an innovative method that extends LLMs to TOD scenarios. Our experiments reveal that DiagGPT exhibits outstanding performance in conducting TOD with users, demonstrating its potential for practical applications.
摘要
Automated Test Case Generation Using Code Models and Domain Adaptation
results: 提出了一个完全自动化测试框架,可以补充搜索基本测试生成器,提高测试覆盖率。results show that our approach can generate new test cases that cover lines that were not covered by developer-written tests, and using domain adaptation can increase line coverage by 49.9% and 54%.Abstract
State-of-the-art automated test generation techniques, such as search-based testing, are usually ignorant about what a developer would create as a test case. Therefore, they typically create tests that are not human-readable and may not necessarily detect all types of complex bugs developer-written tests would do. In this study, we leverage Transformer-based code models to generate unit tests that can complement search-based test generation. Specifically, we use CodeT5, i.e., a state-of-the-art large code model, and fine-tune it on the test generation downstream task. For our analysis, we use the Methods2test dataset for fine-tuning CodeT5 and Defects4j for project-level domain adaptation and evaluation. The main contribution of this study is proposing a fully automated testing framework that leverages developer-written tests and available code models to generate compilable, human-readable unit tests. Results show that our approach can generate new test cases that cover lines that were not covered by developer-written tests. Using domain adaptation, we can also increase line coverage of the model-generated unit tests by 49.9% and 54% in terms of mean and median (compared to the model without domain adaptation). We can also use our framework as a complementary solution alongside common search-based methods to increase the overall coverage with mean and median of 25.3% and 6.3%. It can also increase the mutation score of search-based methods by killing extra mutants (up to 64 new mutants were killed per project in our experiments).
摘要
现代自动化测试技术,如搜寻式测试,通常忽略开发者会写的测试案例。因此,它们通常会创建不可读的测试和可能不会检测所有类型的复杂bug。在本研究中,我们利用Transformer型别code模型来生成单元测试,以补充搜寻式测试生成器。具体来说,我们使用CodeT5,即现代大型code模型,并对其进行精度调整。我们使用Methods2test数据集进行精度调整和Defects4j进行项目级域适应和评估。本研究的主要贡献是提出了一个完全自动化测试框架,利用开发者写的测试和可用的code模型来生成可读性高的单元测试。结果表明,我们的方法可以生成新的测试案例,覆盖开发者写的测试案例中未覆盖的行数。通过领域适应,我们可以增加模型生成的单元测试的行覆盖率,增加了49.9%和54%的平均和中值(相比无领域适应)。此外,我们的框架还可以作为搜寻式方法的补充解决方案,增加总覆盖率的平均和中值为25.3%和6.3%。此外,它还可以提高搜寻式方法的突变得分,杀死Extra突变(在我们的实验中,每个项目最多杀死64个突变)。
Planning to Learn: A Novel Algorithm for Active Learning during Model-Based Planning
paper_authors: Rowan Hodson, Bruce Bassett, Charel van Hoof, Benjamin Rosman, Mark Solms, Jonathan P. Shock, Ryan Smith
for: 这个论文的目的是对Active Inference框架进行评估和改进。
methods: 这个论文使用了强化学习和搜索算法来解决多步规划问题。
results: 论文的实验结果显示,使用强化学习和搜索算法可以在一种生物学上相关的环境中超过其他算法的性能。Abstract
Active Inference is a recent framework for modeling planning under uncertainty. Empirical and theoretical work have now begun to evaluate the strengths and weaknesses of this approach and how it might be improved. A recent extension - the sophisticated inference (SI) algorithm - improves performance on multi-step planning problems through recursive decision tree search. However, little work to date has been done to compare SI to other established planning algorithms. SI was also developed with a focus on inference as opposed to learning. The present paper has two aims. First, we compare performance of SI to Bayesian reinforcement learning (RL) schemes designed to solve similar problems. Second, we present an extension of SI - sophisticated learning (SL) - that more fully incorporates active learning during planning. SL maintains beliefs about how model parameters would change under the future observations expected under each policy. This allows a form of counterfactual retrospective inference in which the agent considers what could be learned from current or past observations given different future observations. To accomplish these aims, we make use of a novel, biologically inspired environment designed to highlight the problem structure for which SL offers a unique solution. Here, an agent must continually search for available (but changing) resources in the presence of competing affordances for information gain. Our simulations show that SL outperforms all other algorithms in this context - most notably, Bayes-adaptive RL and upper confidence bound algorithms, which aim to solve multi-step planning problems using similar principles (i.e., directed exploration and counterfactual reasoning). These results provide added support for the utility of Active Inference in solving this class of biologically-relevant problems and offer added tools for testing hypotheses about human cognition.
摘要
aktive inferens 是一种最近的 плани法下 uncertainty 的框架。 empirical 和 theoretical 工作已经开始评估这种方法的优缺点,以及如何改进它。 recient extension - sophisticated inference(SI)算法 - 在多步 плани问题上提高性能通过 recursively decision tree 搜索。 然而,到目前为止,对 SI 与其他已知的 плани算法进行比较的工作尚未进行。 SI 还是在推理上而不是学习的方法。 本文的两个目标是:首先,与 bayesian 强化学习(RL)方法相比,SI 的性能如何?其次,我们提出了一种扩展 SI - sophisticated learning(SL) - 它更加具有活动学习在 плани过程中。 SL 维护对未来观测所采取的 Each policy 下的模型参数变化的信念。这允许一种对当前或过去观测进行 counterfactual 推理,即 agent 考虑在不同的未来观测下,当前或过去观测所能学习到什么。 为了实现这些目标,我们使用了一个 novel, biologically 引发的环境,这个环境是为 highlight 这类问题的问题结构而设计的。在这个环境中,agent 需要不断寻找可用(但是变化的)资源,同时面临着竞争的affordances 对信息增长。我们的 simulate 结果显示,SL 在这种情况下超越了所有其他算法 - 特别是 bayes-adaptive RL 和 upper confidence bound 算法,这些算法是使用类似原则(例如,引导探索和 counterfactual 推理)解决多步 плани问题。这些结果为活动推理在这类生物学 relevance 问题的 utility 提供了更多的支持,并为测试人类认知 Hypothesis 提供了新的工具。
results: 研究发现,量子计算在大规模计算情况下可以实现更高的利润和能源效率,而且这种优势取决于大规模计算。 基于实际物理参数,文章还证明了实现这种能源效率优势所需的规模。Abstract
Energy cost is increasingly crucial in the modern computing industry with the wide deployment of large-scale machine learning models and language models. For the firms that provide computing services, low energy consumption is important both from the perspective of their own market growth and the government's regulations. In this paper, we study the energy benefits of quantum computing vis-a-vis classical computing. Deviating from the conventional notion of quantum advantage based solely on computational complexity, we redefine advantage in an energy efficiency context. Through a Cournot competition model constrained by energy usage, we demonstrate quantum computing firms can outperform classical counterparts in both profitability and energy efficiency at Nash equilibrium. Therefore quantum computing may represent a more sustainable pathway for the computing industry. Moreover, we discover that the energy benefits of quantum computing economies are contingent on large-scale computation. Based on real physical parameters, we further illustrate the scale of operation necessary for realizing this energy efficiency advantage.
摘要
energy cost is becoming increasingly important in the modern computing industry with the widespread deployment of large-scale machine learning models and language models. for firms that provide computing services, low energy consumption is important both from the perspective of their own market growth and the government's regulations. in this paper, we study the energy benefits of quantum computing compared to classical computing. we deviate from the conventional notion of quantum advantage based solely on computational complexity and redefine advantage in an energy efficiency context. through a Cournot competition model constrained by energy usage, we demonstrate that quantum computing firms can outperform their classical counterparts in both profitability and energy efficiency at the Nash equilibrium. therefore, quantum computing may represent a more sustainable pathway for the computing industry. furthermore, we find that the energy benefits of quantum computing economies are contingent on large-scale computation. based on real physical parameters, we illustrate the scale of operation necessary for realizing this energy efficiency advantage.
GRINN: A Physics-Informed Neural Network for solving hydrodynamic systems in the presence of self-gravity
results: 我们的结果与一个线性分析解相匹配在线性 régime中 Within 1% error bound, and with a conventional grid code solution within 5% error bound in the nonlinear regime. 我们发现GRINN计算时间与维度无关,与传统网格代码的计算时间成正比。GRINN computation time is longer than the grid code in one- and two-dimensional calculations but is an order of magnitude lesser than the grid code in 3D with similar accuracy.Abstract
Modeling self-gravitating gas flows is essential to answering many fundamental questions in astrophysics. This spans many topics including planet-forming disks, star-forming clouds, galaxy formation, and the development of large-scale structures in the Universe. However, the nonlinear interaction between gravity and fluid dynamics offers a formidable challenge to solving the resulting time-dependent partial differential equations (PDEs) in three dimensions (3D). By leveraging the universal approximation capabilities of a neural network within a mesh-free framework, physics informed neural networks (PINNs) offer a new way of addressing this challenge. We introduce the gravity-informed neural network (GRINN), a PINN-based code, to simulate 3D self-gravitating hydrodynamic systems. Here, we specifically study gravitational instability and wave propagation in an isothermal gas. Our results match a linear analytic solution to within 1\% in the linear regime and a conventional grid code solution to within 5\% as the disturbance grows into the nonlinear regime. We find that the computation time of the GRINN does not scale with the number of dimensions. This is in contrast to the scaling of the grid-based code for the hydrodynamic and self-gravity calculations as the number of dimensions is increased. Our results show that the GRINN computation time is longer than the grid code in one- and two- dimensional calculations but is an order of magnitude lesser than the grid code in 3D with similar accuracy. Physics-informed neural networks like GRINN thus show promise for advancing our ability to model 3D astrophysical flows.
摘要
模拟自引力液体流动是astrophysics中答您许多基本问题的关键。这些问题包括形成 planetary disks、star-forming clouds、galaxy formation和宇宙大规模结构的发展。然而,gravity和 fluid dynamics之间的非线性互动使得解决 resulting time-dependent partial differential equations (PDEs) 在三维空间 (3D) 中提供了一项困难的挑战。通过利用 neural network 的通用适应能力 within a mesh-free framework,physics informed neural networks (PINNs) 提供了一种新的方法来解决这一挑战。我们介绍了 gravity-informed neural network (GRINN),一种基于 PINN 的代码,用于模拟 3D 自引力液体系统。我们在特定情况下研究了 gravitational instability 和波传播在固有温度气体中。我们的结果与一个线性分析解相匹配,在线性 regime 中准确到 1%,与一个 convent ional grid code solution 相匹配,在干扰增长到非线性 regime 时准确到 5%。我们发现 GRINN 的计算时间与维度无关。这与 grid-based code 的 hydrodynamic 和自重计算时间在维度增加时的扩展不同。我们的结果显示 GRINN 的计算时间在一维和二维计算中比 grid code 短,但在 3D 计算中比 grid code 更长。Physics-informed neural networks like GRINN 因此显示出了在 3D astrophysical flows 模拟方面的承诺。
Large Language Models in Introductory Programming Education: ChatGPT’s Performance and Implications for Assessments
results: 结果显示 LLMs 的正确率高达94.4%到95.8%,并可靠地提供文本解释和程序代码,这对programming教育和评价方面开启了新的可能性。Abstract
This paper investigates the performance of the Large Language Models (LLMs) ChatGPT-3.5 and GPT-4 in solving introductory programming tasks. Based on the performance, implications for didactic scenarios and assessment formats utilizing LLMs are derived. For the analysis, 72 Python tasks for novice programmers were selected from the free site CodingBat. Full task descriptions were used as input to the LLMs, while the generated replies were evaluated using CodingBat's unit tests. In addition, the general availability of textual explanations and program code was analyzed. The results show high scores of 94.4 to 95.8% correct responses and reliable availability of textual explanations and program code, which opens new ways to incorporate LLMs into programming education and assessment.
摘要
APACE: AlphaFold2 and advanced computing as a service for accelerated discovery in biophysics
paper_authors: Hyun Park, Parth Patel, Roland Haas, E. A. Huerta
for: Protein 3D structure prediction from amino acid sequence.
methods: AlphaFold2 and advanced computing as a service.
results: Up to two orders of magnitude faster than off-the-shelf AlphaFold2 implementations, reducing time-to-solution from weeks to minutes.Abstract
The prediction of protein 3D structure from amino acid sequence is a computational grand challenge in biophysics, and plays a key role in robust protein structure prediction algorithms, from drug discovery to genome interpretation. The advent of AI models, such as AlphaFold, is revolutionizing applications that depend on robust protein structure prediction algorithms. To maximize the impact, and ease the usability, of these novel AI tools we introduce APACE, AlphaFold2 and advanced computing as a service, a novel computational framework that effectively handles this AI model and its TB-size database to conduct accelerated protein structure prediction analyses in modern supercomputing environments. We deployed APACE in the Delta supercomputer, and quantified its performance for accurate protein structure predictions using four exemplar proteins: 6AWO, 6OAN, 7MEZ, and 6D6U. Using up to 200 ensembles, distributed across 50 nodes in Delta, equivalent to 200 A100 NVIDIA GPUs, we found that APACE is up to two orders of magnitude faster than off-the-shelf AlphaFold2 implementations, reducing time-to-solution from weeks to minutes. This computational approach may be readily linked with robotics laboratories to automate and accelerate scientific discovery.
摘要
“蛋白结构预测从氨基酸序列是生物物理 Computational Grand Challenge,它扮演着关键角色在稳定蛋白结构预测算法中,从药物发现到基因解读。人工智能模型,如AlphaFold,的出现正在改变这些应用程序中的应用,以增加其影响力和使用之 ease。为了最大化这些新的人工智能工具的影响和使用易用性,我们介绍了APACE、AlphaFold2和高性能计算作为服务,一个新的计算框架,可以有效地处理这些人工智能模型和其TB-size数据库,在现代超级计算环境中进行加速蛋白结构预测分析。我们在Delta超级计算机上部署了APACE,并评估其表现,使用四个例子蛋白:6AWO、6OAN、7MEZ和6D6U。使用 Up to 200个组,分布在Delta上的50个节点中,相当于200个A100 NVIDIA GPUs,我们发现APACE与传统的AlphaFold2实现方法相比,可以提高到二个次的速度,从 weeks 缩短到 minutes。这个计算方法可能可以与Robotics laboratory相连,以实现和加速科学发现。”
RAVEN: In-Context Learning with Retrieval Augmented Encoder-Decoder Language Models
results: 通过广泛的实验,我们证明了RAVEN模型在certain scenarios中具有显著的优势,比如ATLAS模型和一些最先进的语言模型。尽管RAVEN模型有许多 fewer parameters,但它在一些场景中可以达到与最先进的语言模型相当的性能。我们的研究证明了检索增强encoder-decoder语言模型在上下文学习中的潜力,并鼓励进一步的研究在这个方向上。Abstract
In this paper, we investigate the in-context learning ability of retrieval-augmented encoder-decoder language models. We first conduct a comprehensive analysis of the state-of-the-art ATLAS model and identify its limitations in in-context learning, primarily due to a mismatch between pretraining and testing, as well as a restricted context length. To address these issues, we propose RAVEN, a model that combines retrieval-augmented masked language modeling and prefix language modeling. We further introduce Fusion-in-Context Learning to enhance the few-shot performance by enabling the model to leverage more in-context examples without requiring additional training or model modifications. Through extensive experiments, we demonstrate that RAVEN significantly outperforms ATLAS and achieves results comparable to the most advanced language models in certain scenarios, despite having substantially fewer parameters. Our work underscores the potential of retrieval-augmented encoder-decoder language models for in-context learning and encourages further research in this direction.
摘要
在这篇论文中,我们研究了隐藏语言模型在受限上下文中学习的能力。我们首先进行了现有ATLAS模型的全面分析,并发现其在受限上下文中学习时存在一些限制,主要是预训练和测试的不符合,以及上下文长度的限制。为了解决这些问题,我们提出了RAVEN模型,它结合了检索支持的隐藏语言模型和前缀语言模型。我们还引入了内容学习协调技术,以便让模型在少量示例下具有更好的表现。通过广泛的实验,我们证明了RAVEN模型在certain情况下与ATLAS模型相比有显著改善,并且在一些情况下与当前最先进的语言模型相当。我们的工作表明了隐藏语言模型在受限上下文中学习的潜力,并鼓励了进一步的研究在这个方向上。
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification
methods: 这篇论文使用了不同的 Code Usage Frequency 约束来提高 GPT-4 Code Interpreter 的逻辑能力。这些约束包括代码生成和执行、代码执行结果评估和纠正答案。
results: 这篇论文的结果表明,通过使用 Code Usage Frequency 约束和 CSV 提示方法,GPT-4 Code Interpreter 的数学逻辑能力得到了大幅提高。具体来说,在 MATH 数据集上,使用 GPT-4 Code Interpreter 和 CSV 得到了 Zero-shot 精度提高从 53.9% 到 84.3%。Abstract
Recent progress in large language models (LLMs) like GPT-4 and PaLM-2 has brought significant advancements in addressing math reasoning problems. In particular, OpenAI's latest version of GPT-4, known as GPT-4 Code Interpreter, shows remarkable performance on challenging math datasets. In this paper, we explore the effect of code on enhancing LLMs' reasoning capability by introducing different constraints on the \textit{Code Usage Frequency} of GPT-4 Code Interpreter. We found that its success can be largely attributed to its powerful skills in generating and executing code, evaluating the output of code execution, and rectifying its solution when receiving unreasonable outputs. Based on this insight, we propose a novel and effective prompting method, explicit \uline{c}ode-based \uline{s}elf-\uline{v}erification~(CSV), to further boost the mathematical reasoning potential of GPT-4 Code Interpreter. This method employs a zero-shot prompt on GPT-4 Code Interpreter to encourage it to use code to self-verify its answers. In instances where the verification state registers as ``False'', the model shall automatically amend its solution, analogous to our approach of rectifying errors during a mathematics examination. Furthermore, we recognize that the states of the verification result indicate the confidence of a solution, which can improve the effectiveness of majority voting. With GPT-4 Code Interpreter and CSV, we achieve an impressive zero-shot accuracy on MATH dataset \textbf{(53.9\% $\to$ 84.3\%)}.
摘要
Relightable and Animatable Neural Avatar from Sparse-View Video
results: 实现了从缺乏视角(或半影)输入中生成高质量的可重新照明和动画的神经人物模型,并且与状态艺术法比较。Abstract
This paper tackles the challenge of creating relightable and animatable neural avatars from sparse-view (or even monocular) videos of dynamic humans under unknown illumination. Compared to studio environments, this setting is more practical and accessible but poses an extremely challenging ill-posed problem. Previous neural human reconstruction methods are able to reconstruct animatable avatars from sparse views using deformed Signed Distance Fields (SDF) but cannot recover material parameters for relighting. While differentiable inverse rendering-based methods have succeeded in material recovery of static objects, it is not straightforward to extend them to dynamic humans as it is computationally intensive to compute pixel-surface intersection and light visibility on deformed SDFs for inverse rendering. To solve this challenge, we propose a Hierarchical Distance Query (HDQ) algorithm to approximate the world space distances under arbitrary human poses. Specifically, we estimate coarse distances based on a parametric human model and compute fine distances by exploiting the local deformation invariance of SDF. Based on the HDQ algorithm, we leverage sphere tracing to efficiently estimate the surface intersection and light visibility. This allows us to develop the first system to recover animatable and relightable neural avatars from sparse view (or monocular) inputs. Experiments demonstrate that our approach is able to produce superior results compared to state-of-the-art methods. Our code will be released for reproducibility.
摘要
Translated into Simplified Chinese:这篇论文面临的挑战是从缺视图(或缺视)视频中的动态人体创建可重新照明和可动画化神经人体模型。相比 studio 环境,这种设定更加实用和可达,但是它对于计算机视觉来说是非常具有挑战性的难题。先前的神经人体重建方法可以从缺视图中重建可动画化人体模型,但是无法恢复物理参数以进行照明。而使用可导 diferenciable inverse rendering 方法可以成功地在 static 对象上进行材质恢复,但是对于动态人体来说,计算像素表面交叉和光线可见性的计算是 computationally 昂贵的。为解决这个挑战,我们提出了一种层次距离查询(HDQ)算法,用于估算人体在任意姿势下的世界空间距离。具体来说,我们使用 parametric human model 来估算坐标距离,并使用 SDF 的本地弯曲不变性来计算细致距离。基于 HDQ 算法,我们可以通过圆柱追踪来高效地计算表面交叉和光线可见性。这样,我们可以开发出可以从缺视图(或缺视)输入中恢复可动画化和可重新照明的神经人体模型的首个系统。实验表明,我们的方法可以与现有方法相比产生更好的结果。我们的代码将会被发布,以便重现。
Through the Lens of Core Competency: Survey on Evaluation of Large Language Models
paper_authors: Ziyu Zhuang, Qiguang Chen, Longxuan Ma, Mingda Li, Yi Han, Yushan Qian, Haopeng Bai, Zixian Feng, Weinan Zhang, Ting Liu for: 本研究旨在更好地评估大语言模型(LLM)的性能,以帮助指导研究领域的发展。methods: 本研究使用多种评估任务和指标来评估 LLM 的四大能力:理解、知识、可靠性和安全性。results: 研究发现,现有的评估任务和指标不具备完善性,因此提出了多种新的评估方法和指标,以更好地评估 LLM 的性能。Abstract
From pre-trained language model (PLM) to large language model (LLM), the field of natural language processing (NLP) has witnessed steep performance gains and wide practical uses. The evaluation of a research field guides its direction of improvement. However, LLMs are extremely hard to thoroughly evaluate for two reasons. First of all, traditional NLP tasks become inadequate due to the excellent performance of LLM. Secondly, existing evaluation tasks are difficult to keep up with the wide range of applications in real-world scenarios. To tackle these problems, existing works proposed various benchmarks to better evaluate LLMs. To clarify the numerous evaluation tasks in both academia and industry, we investigate multiple papers concerning LLM evaluations. We summarize 4 core competencies of LLM, including reasoning, knowledge, reliability, and safety. For every competency, we introduce its definition, corresponding benchmarks, and metrics. Under this competency architecture, similar tasks are combined to reflect corresponding ability, while new tasks can also be easily added into the system. Finally, we give our suggestions on the future direction of LLM's evaluation.
摘要
Traditional NLP tasks become inadequate due to the excellent performance of LLM.2. Existing evaluation tasks are difficult to keep up with the wide range of applications in real-world scenarios.To address these issues, previous works have proposed various benchmarks to better evaluate LLMs. To clarify the numerous evaluation tasks in both academia and industry, we investigate multiple papers concerning LLM evaluations. We summarize 4 core competencies of LLM, including:1. Reasoning: the ability to make logical and informed decisions based on the input.2. Knowledge: the ability to understand and apply knowledge to the input.3. Reliability: the ability to produce consistent and accurate output.4. Safety: the ability to avoid producing harmful or inappropriate output.For each competency, we introduce its definition, corresponding benchmarks, and metrics. Under this competency architecture, similar tasks are combined to reflect corresponding ability, while new tasks can also be easily added into the system. Finally, we provide our suggestions on the future direction of LLM’s evaluation.Translation note:* “pre-trained language model” (PLM) is translated as “预训练语言模型” (PLM)* “large language model” (LLM) is translated as “大型语言模型” (LLM)* “natural language processing” (NLP) is translated as “自然语言处理” (NLP)* “core competencies” is translated as “核心能力” (core competencies)* “reasoning” is translated as “理性” (reasoning)* “knowledge” is translated as “知识” (knowledge)* “reliability” is translated as “可靠性” (reliability)* “safety” is translated as “安全性” (safety)
Probabilistic Phase Labeling and Lattice Refinement for Autonomous Material Research
paper_authors: Ming-Chiang Chang, Sebastian Ament, Maximilian Amsler, Duncan R. Sutherland, Lan Zhou, John M. Gregoire, Carla P. Gomes, R. Bruce van Dover, Michael O. Thompson
results: 实验和synthetic数据表明,CrystalShift可以提供可靠的晶体结构估算,超过现有方法的性能,并可以轻松地 интеグрите到高通过率的实验室工作流程中。此外,CrystalShift还提供了材料结构参数的量化见解,以便专家评估和AI模型对phas espacio进行模拟,从而加速材料的鉴定和发现。Abstract
X-ray diffraction (XRD) is an essential technique to determine a material's crystal structure in high-throughput experimentation, and has recently been incorporated in artificially intelligent agents in autonomous scientific discovery processes. However, rapid, automated and reliable analysis method of XRD data matching the incoming data rate remains a major challenge. To address these issues, we present CrystalShift, an efficient algorithm for probabilistic XRD phase labeling that employs symmetry-constrained pseudo-refinement optimization, best-first tree search, and Bayesian model comparison to estimate probabilities for phase combinations without requiring phase space information or training. We demonstrate that CrystalShift provides robust probability estimates, outperforming existing methods on synthetic and experimental datasets, and can be readily integrated into high-throughput experimental workflows. In addition to efficient phase-mapping, CrystalShift offers quantitative insights into materials' structural parameters, which facilitate both expert evaluation and AI-based modeling of the phase space, ultimately accelerating materials identification and discovery.
摘要
EduSAT: A Pedagogical Tool for Theory and Applications of Boolean Satisfiability
results: 对EduSAT的评估显示其高度准确,在所有实现的SAT和SMT解决方法中都达到100%的正确率Abstract
Boolean Satisfiability (SAT) and Satisfiability Modulo Theories (SMT) are widely used in automated verification, but there is a lack of interactive tools designed for educational purposes in this field. To address this gap, we present EduSAT, a pedagogical tool specifically developed to support learning and understanding of SAT and SMT solving. EduSAT offers implementations of key algorithms such as the Davis-Putnam-Logemann-Loveland (DPLL) algorithm and the Reduced Order Binary Decision Diagram (ROBDD) for SAT solving. Additionally, EduSAT provides solver abstractions for five NP-complete problems beyond SAT and SMT. Users can benefit from EduSAT by experimenting, analyzing, and validating their understanding of SAT and SMT solving techniques. Our tool is accompanied by comprehensive documentation and tutorials, extensive testing, and practical features such as a natural language interface and SAT and SMT formula generators, which also serve as a valuable opportunity for learners to deepen their understanding. Our evaluation of EduSAT demonstrates its high accuracy, achieving 100% correctness across all the implemented SAT and SMT solvers. We release EduSAT as a python package in .whl file, and the source can be identified at https://github.com/zhaoy37/SAT_Solver.
摘要
布尔满足性(SAT)和满足性模ulo理论(SMT)广泛应用于自动验证,但教育用途上缺乏交互式工具。为了填补这一空白,我们提出了EduSAT,一种教育工具,专门用于支持学习和理解SAT和SMT解决方法。EduSAT实现了关键算法,如戴维斯-普特南-洛曼-罗宾逊(DPLL)算法和减少顺序二进制决策图(ROBDD) для SAT解决。此外,EduSAT还提供了五个NP完备问题的解决器抽象。用户可以通过EduSAT进行实验、分析和验证他们对SAT和SMT解决方法的理解。我们的工具附有完整的文档和教程,广泛的测试和实用功能,如自然语言界面和SAT和SMT公式生成器,这也为学习者提供了深入了解的机会。我们的评估表明,EduSAT具有100%的正确率,在所有实现的SAT和SMT解决器中。我们在Python包中发布了EduSAT,可以在.whl文件中找到,源代码可以在https://github.com/zhaoy37/SAT_Solver中找到。
A Comprehensive Study on Knowledge Graph Embedding over Relational Patterns Based on Rule Learning
results: 研究发现,KGE 模型在特定关系模式下的表现不一定和预期的相关,而且存在一些关系模式下 KGE 模型表现差。为解决这个问题,研究提出了一种无需训练的方法 Score-based Patterns Adaptation (SPA),可以增强 KGE 模型在多种关系模式下的表现。Abstract
Knowledge Graph Embedding (KGE) has proven to be an effective approach to solving the Knowledge Graph Completion (KGC) task. Relational patterns which refer to relations with specific semantics exhibiting graph patterns are an important factor in the performance of KGE models. Though KGE models' capabilities are analyzed over different relational patterns in theory and a rough connection between better relational patterns modeling and better performance of KGC has been built, a comprehensive quantitative analysis on KGE models over relational patterns remains absent so it is uncertain how the theoretical support of KGE to a relational pattern contributes to the performance of triples associated to such a relational pattern. To address this challenge, we evaluate the performance of 7 KGE models over 4 common relational patterns on 2 benchmarks, then conduct an analysis in theory, entity frequency, and part-to-whole three aspects and get some counterintuitive conclusions. Finally, we introduce a training-free method Score-based Patterns Adaptation (SPA) to enhance KGE models' performance over various relational patterns. This approach is simple yet effective and can be applied to KGE models without additional training. Our experimental results demonstrate that our method generally enhances performance over specific relational patterns. Our source code is available from GitHub at https://github.com/zjukg/Comprehensive-Study-over-Relational-Patterns.
摘要
知识图 embedding(KGE)已经证明是解决知识图完成(KGC)任务的有效方法。关系模式,即具有特定 semantics 的图 Patterns,是 KGE 模型表现的重要因素。 Although KGE models' capabilities have been analyzed over different relational patterns in theory, and a rough connection between better relational patterns modeling and better performance of KGC has been built, a comprehensive quantitative analysis on KGE models over relational patterns remains absent, so it is uncertain how the theoretical support of KGE to a relational pattern contributes to the performance of triples associated to such a relational pattern. To address this challenge, we evaluate the performance of 7 KGE models over 4 common relational patterns on 2 benchmarks, and then conduct an analysis in three aspects: theory, entity frequency, and part-to-whole. We also obtain some counterintuitive conclusions. Finally, we introduce a training-free method Score-based Patterns Adaptation (SPA) to enhance KGE models' performance over various relational patterns. This approach is simple yet effective and can be applied to KGE models without additional training. Our experimental results demonstrate that our method generally enhances performance over specific relational patterns. Our source code is available from GitHub at .
Towards Temporal Edge Regression: A Case Study on Agriculture Trade Between Nations
results: 基线模型在不同设定下表现出色,尤其是在负边的存在情况下表现更佳。TGN模型在三种动态GNN模型中表现最佳,并且发现训练样本中负边的比例对测试性能产生了显著的影响。Here’s the full Chinese text:随着图 neural network(GNN)在动态图上的应用,它们在node classification、链接预测和图回归等任务中表现出色。然而,对于时间顺序edge regression任务,尚有很少的研究。本文通过预测国际贸易数据中的边值(trade value),探讨GNN在静态和动态图上的应用。我们提出三种简单强的基线模型,并对一种静态和三种动态GNN模型进行了广泛的实验评估。我们的实验结果表明,基线模型在不同设定下表现出色,尤其是在负边的存在情况下表现更佳。此外,我们发现TGN模型在三种动态GNN模型中表现最佳,并且训练样本中负边的比例对测试性能产生了显著的影响。相关代码可以在GitHub上找到:https://github.com/scylj1/GNN_Edge_Regression。Abstract
Recently, Graph Neural Networks (GNNs) have shown promising performance in tasks on dynamic graphs such as node classification, link prediction and graph regression. However, few work has studied the temporal edge regression task which has important real-world applications. In this paper, we explore the application of GNNs to edge regression tasks in both static and dynamic settings, focusing on predicting food and agriculture trade values between nations. We introduce three simple yet strong baselines and comprehensively evaluate one static and three dynamic GNN models using the UN Trade dataset. Our experimental results reveal that the baselines exhibit remarkably strong performance across various settings, highlighting the inadequacy of existing GNNs. We also find that TGN outperforms other GNN models, suggesting TGN is a more appropriate choice for edge regression tasks. Moreover, we note that the proportion of negative edges in the training samples significantly affects the test performance. The companion source code can be found at: https://github.com/scylj1/GNN_Edge_Regression.
摘要
最近,图 нейрон网络(GNN)在动态图上的任务中表现出色,包括节点分类、链接预测和图表 regression。然而,有少量的研究集中注意力于时间扩展edge regression任务,这种任务在实际世界中具有重要意义。在这篇论文中,我们探讨了GNN在静态和动态设置下进行边 regression任务的应用, focusing on 国际贸易食品和农业贸易值的预测。我们提出了三种简单强大的基elines,并对静态和三种动态GNN模型进行了广泛的实验评估,使用UN Trade数据集。我们的实验结果显示baselines在不同设置下具有极强表现,这 highlights the inadequacy of existing GNNs。此外,我们发现TGN在其他GNN模型之上表现出优异, suggesting TGN是更适合边 regression任务的选择。此外,我们注意到在训练样本中负边的比例对测试性能产生了显著的影响。相关的源代码可以在GitHub上找到:https://github.com/scylj1/GNN_Edge_Regression。
The $10 Million ANA Avatar XPRIZE Competition Advanced Immersive Telepresence Systems
paper_authors: Sven Behnke, Julie A. Adams, David Locke
for: 这篇论文是关于<<$10M ANA Avatar XPRIZE>>的多年竞赛,参赛者需要开发一种可以在实时传输人类存在的功能。
methods: 本文描述了竞赛的不同阶段和任务,以及评价标准。
results: 根据文章报道,竞赛中的参赛队伍通过了不同的任务和评价标准,并获得了奖励。Abstract
The $10M ANA Avatar XPRIZE aimed to create avatar systems that can transport human presence to remote locations in real time. The participants of this multi-year competition developed robotic systems that allow operators to see, hear, and interact with a remote environment in a way that feels as if they are truly there. On the other hand, people in the remote environment were given the impression that the operator was present inside the avatar robot. At the competition finals, held in November 2022 in Long Beach, CA, USA, the avatar systems were evaluated on their support for remotely interacting with humans, exploring new environments, and employing specialized skills. This article describes the competition stages with tasks and evaluation procedures, reports the results, presents the winning teams' approaches, and discusses lessons learned.
摘要
美国ANA Avatar XPRIZE挑战赛旨在创造能够在实时传输人类存在的功能,使得操作者可以在远程地点上见、听和互动。参赛队伍在多年的竞赛中开发出了 робо图形系统,让操作者能够在远程环境中感受到真实存在的感觉。同时,远程环境中的人们也被给予操作员在机器人中存在的印象。2022年11月在美国加利福尼亚州长滩市举行的比赛决赛中,参赛队伍的机器人系统被评估为在与人类互动、探索新环境和使用专业技能方面的支持。本文介绍了竞赛阶段的任务和评估过程,报道了结果,展示了赢得奖队的方法,并讨论了学习的教训。
Synthesizing Political Zero-Shot Relation Classification via Codebook Knowledge, NLI, and ChatGPT
results: 经过大规模的实验,ZSP 实现了在Rootcode 精细分类中的40%的提升,与超参 Bert 模型相当,表明 ZSP 可以作为事件记录验证和 ontology 发展中的有价值工具。Abstract
Recent supervised models for event coding vastly outperform pattern-matching methods. However, their reliance solely on new annotations disregards the vast knowledge within expert databases, hindering their applicability to fine-grained classification. To address these limitations, we explore zero-shot approaches for political event ontology relation classification, by leveraging knowledge from established annotation codebooks. Our study encompasses both ChatGPT and a novel natural language inference (NLI) based approach named ZSP. ZSP adopts a tree-query framework that deconstructs the task into context, modality, and class disambiguation levels. This framework improves interpretability, efficiency, and adaptability to schema changes. By conducting extensive experiments on our newly curated datasets, we pinpoint the instability issues within ChatGPT and highlight the superior performance of ZSP. ZSP achieves an impressive 40% improvement in F1 score for fine-grained Rootcode classification. ZSP demonstrates competitive performance compared to supervised BERT models, positioning it as a valuable tool for event record validation and ontology development. Our work underscores the potential of leveraging transfer learning and existing expertise to enhance the efficiency and scalability of research in the field.
摘要
现代监督学习模型在事件编码方面表现出色,但它们完全依赖新的注释,忽视了专家数据库中的庞大知识,这限制了它们的细致分类应用。为解决这些局限性,我们研究零shot方法,利用已有的注释代码ebook来 классифика事件ontology关系。我们的研究包括ChatGPT和一种新的自然语言推理(NLI)基于的approach named ZSP。ZSP采用树查询框架,将任务分解为上下文、Modal和分类层。这种框架提高了可读性、效率和 schema变化的适应性。通过对我们新收集的数据进行广泛的实验,我们揭示了ChatGPT中的不稳定性问题,并高举ZSP的出色表现。ZSP在细致的Rootcode分类中实现了40%的提升。ZSP与supervised BERT模型的表现相当,这positioned它为事件记录验证和ontology发展的有价值工具。我们的工作强调了可以通过转移学习和现有专业知识来提高研究领域的效率和可扩展性。
Emotion Embeddings $\unicode{x2014}$ Learning Stable and Homogeneous Abstractions from Heterogeneous Affective Datasets
results: 实验表明,该方法可以在各种不同的情感分析数据集上实现预测质量的可靠性和可重用性,而不需要特定的语言、Modalities或标签类型。Abstract
Human emotion is expressed in many communication modalities and media formats and so their computational study is equally diversified into natural language processing, audio signal analysis, computer vision, etc. Similarly, the large variety of representation formats used in previous research to describe emotions (polarity scales, basic emotion categories, dimensional approaches, appraisal theory, etc.) have led to an ever proliferating diversity of datasets, predictive models, and software tools for emotion analysis. Because of these two distinct types of heterogeneity, at the expressional and representational level, there is a dire need to unify previous work on increasingly diverging data and label types. This article presents such a unifying computational model. We propose a training procedure that learns a shared latent representation for emotions, so-called emotion embeddings, independent of different natural languages, communication modalities, media or representation label formats, and even disparate model architectures. Experiments on a wide range of heterogeneous affective datasets indicate that this approach yields the desired interoperability for the sake of reusability, interpretability and flexibility, without penalizing prediction quality. Code and data are archived under https://doi.org/10.5281/zenodo.7405327 .
摘要
人类情感表达在许多交通Modalities和媒体格式中表现出来,因此计算研究也是多样化的,包括自然语言处理、音频信号分析、计算视觉等。在过去的研究中,用于描述情感的多种格式(负号级别、基本情绪类别、维度方法、评估理论等)导致了计算研究的数据和预测模型的总体增加,以及软件工具的杂化。由于表达和表达的多样性,我们需要统一过去的工作,以实现交互性、可读性和灵活性。本文提出了一种统一的计算模型,通过学习共享的感情嵌入,独立于不同的自然语言、交通Modalities、媒体或表达格式,甚至不同的模型架构。实验表明,这种方法可以实现数据和标签类型之间的兼容性,不пенalize预测质量。代码和数据存储在https://doi.org/10.5281/zenodo.7405327。
Brain-Inspired Computational Intelligence via Predictive Coding
paper_authors: Tommaso Salvatori, Ankur Mali, Christopher L. Buckley, Thomas Lukasiewicz, Rajesh P. N. Rao, Karl Friston, Alexander Ororbia
For: The paper aims to explore the potential of using predictive coding (PC) as a guiding principle for the development of machine learning algorithms, in order to address some of the limitations of current deep neural network approaches.* Methods: The paper surveys the literature on PC and its applications in machine intelligence tasks, highlighting its exciting properties and potential benefits for the field of machine learning.* Results: The paper discusses the potential of PC to model information processing in different brain areas, be used in cognitive control and robotics, and provide a powerful inversion scheme for continuous-state generative models.Abstract
Artificial intelligence (AI) is rapidly becoming one of the key technologies of this century. The majority of results in AI thus far have been achieved using deep neural networks trained with the error backpropagation learning algorithm. However, the ubiquitous adoption of this approach has highlighted some important limitations such as substantial computational cost, difficulty in quantifying uncertainty, lack of robustness, unreliability, and biological implausibility. It is possible that addressing these limitations may require schemes that are inspired and guided by neuroscience theories. One such theory, called predictive coding (PC), has shown promising performance in machine intelligence tasks, exhibiting exciting properties that make it potentially valuable for the machine learning community: PC can model information processing in different brain areas, can be used in cognitive control and robotics, and has a solid mathematical grounding in variational inference, offering a powerful inversion scheme for a specific class of continuous-state generative models. With the hope of foregrounding research in this direction, we survey the literature that has contributed to this perspective, highlighting the many ways that PC might play a role in the future of machine learning and computational intelligence at large.
摘要
人工智能(AI)在这个世纪 rapidly becoming one of the key technologies. thus far majority of results in AI have been achieved using deep neural networks trained with the error backpropagation learning algorithm. However, the widespread adoption of this approach has highlighted some important limitations, such as significant computational cost, difficulty in quantifying uncertainty, lack of robustness, unreliability, and biological implausibility. It is possible that addressing these limitations may require schemes that are inspired and guided by neuroscience theories. One such theory, called predictive coding (PC), has shown promising performance in machine intelligence tasks, exhibiting exciting properties that make it potentially valuable for the machine learning community: PC can model information processing in different brain areas, can be used in cognitive control and robotics, and has a solid mathematical grounding in variational inference, offering a powerful inversion scheme for a specific class of continuous-state generative models. With the hope of foregrounding research in this direction, we survey the literature that has contributed to this perspective, highlighting the many ways that PC might play a role in the future of machine learning and computational intelligence at large.
results: 实验表明,Equivariant Transporter Net比非对称版本更高效,可以使用很少的人类示例来快速适应不同的抓取和放置任务。Abstract
Robotic pick and place tasks are symmetric under translations and rotations of both the object to be picked and the desired place pose. For example, if the pick object is rotated or translated, then the optimal pick action should also rotate or translate. The same is true for the place pose; if the desired place pose changes, then the place action should also transform accordingly. A recently proposed pick and place framework known as Transporter Net captures some of these symmetries, but not all. This paper analytically studies the symmetries present in planar robotic pick and place and proposes a method of incorporating equivariant neural models into Transporter Net in a way that captures all symmetries. The new model, which we call Equivariant Transporter Net, is equivariant to both pick and place symmetries and can immediately generalize pick and place knowledge to different pick and place poses. We evaluate the new model empirically and show that it is much more sample efficient than the non-symmetric version, resulting in a system that can imitate demonstrated pick and place behavior using very few human demonstrations on a variety of imitation learning tasks.
摘要
机器人拾取置位任务具有对象拾取和置位pose的对称性。例如,如果拾取对象旋转或平移,那么最佳拾取动作也应该旋转或平移。同样,如果置位pose发生变化,那么置位动作也应该相应变化。一种已经提出的拾取置位框架被称为Transporter Net,但它不捕捉所有的对称性。这篇论文分析了平面机器人拾取置位中存在的对称性,并提议在Transporter Net中采用具有对称性的equivariant neural网络,以捕捉所有的对称性。新的模型被称为Equivariant Transporter Net,它对拾取和置位对称性具有对称性,可以立即将拾取知识应用到不同的拾取和置位pose。我们在实验中评估了新模型,并证明它在很少示例的情况下可以很好地复现人类示例行为,在多种模仿学习任务上表现出非常高效。
paper_authors: Fernando B. Pérez Maurera, Maurizio Ferrari Dacrema, Pablo Castells, Paolo Cremonesi
for: 提高推荐系统质量的新数据源
methods: 使用印象数据(过去的推荐项)和传统交互数据进行用户偏好细化
results: 系统性文献回顾,涵盖推荐系统使用印象数据的三大研究方向:推荐算法、数据集和评估方法Here’s the full text in Simplified Chinese:for: 本文提出了一种基于新数据源的推荐系统,用于提高推荐系统的质量。methods: 本文使用印象数据(过去的推荐项)和传统交互数据进行用户偏好细化,以提高推荐系统的准确率和个性化程度。results: 本文进行了系统性文献回顾,涵盖推荐系统使用印象数据的三大研究方向:推荐算法、数据集和评估方法。Abstract
Novel data sources bring new opportunities to improve the quality of recommender systems. Impressions are a novel data source containing past recommendations (shown items) and traditional interactions. Researchers may use impressions to refine user preferences and overcome the current limitations in recommender systems research. The relevance and interest of impressions have increased over the years; hence, the need for a review of relevant work on this type of recommenders. We present a systematic literature review on recommender systems using impressions, focusing on three fundamental angles in research: recommenders, datasets, and evaluation methodologies. We provide three categorizations of papers describing recommenders using impressions, present each reviewed paper in detail, describe datasets with impressions, and analyze the existing evaluation methodologies. Lastly, we present open questions and future directions of interest, highlighting aspects missing in the literature that can be addressed in future works.
摘要
新的数据源带来了推荐系统质量的改进机遇。印象是一种新的数据源,包含过去的推荐(显示的项目)和传统的交互。研究人员可以使用印象来细化用户偏好,超越现有的推荐系统研究的限制。印象的重要性和兴趣在年来不断增加,因此需要对这类推荐系统的研究进行系统性的文献评审。本文提出了一种系统性的文献评审方法,将推荐系统使用印象分为三个基本方向:推荐算法、数据集和评价方法。每个评审的论文都会在详细的描述,数据集也会被介绍,以及现有的评价方法的分析。最后,我们将提出未解决的问题和未来的方向,强调文献中缺失的方面,未来的研究可以在这些方面进行深入的探索。
paper_authors: Eunseop Yoon, Hee Suk Yoon, Dhananjaya Gowda, SooHwan Eom, Daehyeok Kim, John Harvill, Heting Gao, Mark Hasegawa-Johnson, Chanwoo Kim, Chang D. Yoo
for: 这篇论文旨在提高文本中字符串到音素的转化率,具体来说是在使用Text-to-Text Transfer Transformer(T5)和tokenizer-free byte-level模型(ByT5)进行 Grapheme-to-Phoneme(G2P)转化时。
results: 研究发现,通过使用该采样方法,可以有效地提高 sentence-level和paragraph-level G2P 的性能,从而提高文本中的可用性和可读性。Abstract
Text-to-Text Transfer Transformer (T5) has recently been considered for the Grapheme-to-Phoneme (G2P) transduction. As a follow-up, a tokenizer-free byte-level model based on T5 referred to as ByT5, recently gave promising results on word-level G2P conversion by representing each input character with its corresponding UTF-8 encoding. Although it is generally understood that sentence-level or paragraph-level G2P can improve usability in real-world applications as it is better suited to perform on heteronyms and linking sounds between words, we find that using ByT5 for these scenarios is nontrivial. Since ByT5 operates on the character level, it requires longer decoding steps, which deteriorates the performance due to the exposure bias commonly observed in auto-regressive generation models. This paper shows that the performance of sentence-level and paragraph-level G2P can be improved by mitigating such exposure bias using our proposed loss-based sampling method.
摘要
文本转换transformer(T5)最近被考虑用于字母到音素(G2P)转化。作为续作,一个不需要分词的字节级模型基于T5,称为ByT5,最近在单词级G2P转化中达到了promising的结果,通过对输入字符的UTF-8编码进行表示。虽然一般认为 sentence-level或 paragraph-level G2P可以提高实际应用中的可用性,因为更适合处理同义词和 слова之间的声音连接,但是使用ByT5进行这些场景是非常困难。由于ByT5 operate在字符级别,需要更长的解码步骤,这会导致性能下降,这是因为自动生成模型中通常存在的曝露偏见。本文显示,可以通过我们提出的损失样本方法来改善 sentence-level和 paragraph-level G2P的性能。
results: 实验结果显示,KEAF 在两个 dataset 上的表现均较之前的 State-of-the-Art (SOTA) 模型更高,并且可以减少 Label Noise 和提高模型的稳定性。Abstract
Existing attribute-value extraction (AVE) models require large quantities of labeled data for training. However, new products with new attribute-value pairs enter the market every day in real-world e-Commerce. Thus, we formulate AVE in multi-label few-shot learning (FSL), aiming to extract unseen attribute value pairs based on a small number of training examples. We propose a Knowledge-Enhanced Attentive Framework (KEAF) based on prototypical networks, leveraging the generated label description and category information to learn more discriminative prototypes. Besides, KEAF integrates with hybrid attention to reduce noise and capture more informative semantics for each class by calculating the label-relevant and query-related weights. To achieve multi-label inference, KEAF further learns a dynamic threshold by integrating the semantic information from both the support set and the query set. Extensive experiments with ablation studies conducted on two datasets demonstrate that KEAF outperforms other SOTA models for information extraction in FSL. The code can be found at: https://github.com/gjiaying/KEAF
摘要
现有的属性值提取(AVE)模型需要大量的标注数据进行训练。然而,实际世界电商中每天都会出现新的产品,带有新的属性值对。因此,我们将 AVE 转化为多标签少数例学习(FSL),以提取未经见过的属性值对。我们提议一种基于原型网络的知识增强框架(KEAF),利用生成的标签描述和类别信息来学习更有刺激的原型。此外,KEAF 还 integrate 了杂合注意 Mechanism 来减少噪音和捕捉更有意义的语义信息。为实现多标签推断,KEAF 进一步学习一个动态阈值,通过将支持集和查询集的语义信息集成在一起。经验表明,KEAF 在两个数据集上实现了对信息提取的最佳性能,超过其他 SOTA 模型。代码可以在 GitHub 上找到:https://github.com/gjiaying/KEAF。
Advancing continual lifelong learning in neural information retrieval: definition, dataset, framework, and empirical evaluation
results: 研究结果显示,提出的框架可以成功防止 neural information retrieval 中的恐慌性遗传,并提高之前学习的任务表现。 embedding-based Retrieval 模型在持续学习中受到主题转移距离和新任务集大小的影响,而 pretraining-based 模型则不受此影响。适当的学习策略可以 Mitigate 这些影响。Abstract
Continual learning refers to the capability of a machine learning model to learn and adapt to new information, without compromising its performance on previously learned tasks. Although several studies have investigated continual learning methods for information retrieval tasks, a well-defined task formulation is still lacking, and it is unclear how typical learning strategies perform in this context. To address this challenge, a systematic task formulation of continual neural information retrieval is presented, along with a multiple-topic dataset that simulates continuous information retrieval. A comprehensive continual neural information retrieval framework consisting of typical retrieval models and continual learning strategies is then proposed. Empirical evaluations illustrate that the proposed framework can successfully prevent catastrophic forgetting in neural information retrieval and enhance performance on previously learned tasks. The results indicate that embedding-based retrieval models experience a decline in their continual learning performance as the topic shift distance and dataset volume of new tasks increase. In contrast, pretraining-based models do not show any such correlation. Adopting suitable learning strategies can mitigate the effects of topic shift and data augmentation.
摘要
(简化中文)持续学习指的是机器学习模型能够在新的信息上学习和适应,而不会对之前学习的任务的性能产生负面影响。虽然一些研究已经研究过持续学习方法,但是对于信息检索任务的明确定义仍然缺失,而且不清楚 Typical learning strategies 在这个上下文中的性能。为了解决这个挑战,本文提出了一个系统的持续神经信息检索任务定义,以及一个多主题的数据集,用于模拟连续的信息检索。然后,一个包含 Typical retrieval models 和持续学习策略的完整持续神经信息检索框架被提出。实验证明,提议的框架可以成功避免神经信息检索中的恶化学习现象,并提高之前学习任务的性能。结果表明,基于嵌入的检索模型,随着主题偏移距离和新任务数据量增加,其持续学习性能会下降。相比之下,基于预训练的模型不会出现这种趋势。采用适当的学习策略可以 Mitigate 主题偏移和数据扩展的影响。
results: 小规模用户研究表明,本应用程序具有效果,用户特别是appreciate系统提供的自动指导和个性化输入的平衡。Abstract
Current approaches for text summarization are predominantly automatic, with rather limited space for human intervention and control over the process. In this paper, we introduce SummHelper, a 2-phase summarization assistant designed to foster human-machine collaboration. The initial phase involves content selection, where the system recommends potential content, allowing users to accept, modify, or introduce additional selections. The subsequent phase, content consolidation, involves SummHelper generating a coherent summary from these selections, which users can then refine using visual mappings between the summary and the source text. Small-scale user studies reveal the effectiveness of our application, with participants being especially appreciative of the balance between automated guidance and opportunities for personal input.
摘要
当前的文本摘要方法都是自动的,具有有限的人类参与和控制能力。在这篇论文中,我们介绍了SummHelper,一种两个阶段的摘要助手,旨在激发人机合作。第一阶段是内容选择,系统会提供可能的内容选择,用户可以接受、修改或添加自己的选择。第二阶段是内容归约,系统会基于这些选择生成一份有 coherence 的摘要,用户可以通过视觉映射来修改摘要和原始文本之间的关系。小规模用户研究表明了我们的应用的有效性,用户尤其喜欢SummHelper的自动导航和个性化输入的平衡。
for: detoxifying language models (LLMs) to avoid generating harmful content while maintaining generation capability.
methods: decomposing the detoxification process into sub-steps, calibrating the strong reasoning ability of LLMs using a Detox-Chain, and training with the non-toxic prompt.
results: significant detoxification and generation improvement for six LLMs scaling from 1B to 33B, as demonstrated by automatic and human evaluation on two benchmarks.Here’s the simplified Chinese text:
results: 透过对六个LLMs(1B至33B)进行训练,在两个标准 benchmark 上展示了重要的净化和生成改善。Abstract
Detoxification for LLMs is challenging since it requires models to avoid generating harmful content while maintaining the generation capability. To ensure the safety of generations, previous detoxification methods detoxify the models by changing the data distributions or constraining the generations from different aspects in a single-step manner. However, these approaches will dramatically affect the generation quality of LLMs, e.g., discourse coherence and semantic consistency, since language models tend to generate along the toxic prompt while detoxification methods work in the opposite direction. To handle such a conflict, we decompose the detoxification process into different sub-steps, where the detoxification is concentrated in the input stage and the subsequent continual generation is based on the non-toxic prompt. Besides, we also calibrate the strong reasoning ability of LLMs by designing a Detox-Chain to connect the above sub-steps in an orderly manner, which allows LLMs to detoxify the text step-by-step. Automatic and human evaluation on two benchmarks reveals that by training with Detox-Chain, six LLMs scaling from 1B to 33B can obtain significant detoxification and generation improvement. Our code and data are available at https://github.com/CODINNLG/Detox-CoT. Warning: examples in the paper may contain uncensored offensive content.
摘要
LLM 的净化是一个挑战,因为它们需要避免生成有害内容,同时保持生成能力。为确保生成的安全,以前的净化方法通常是改变数据分布或限制生成从不同方面,这些方法会很快地影响 LLM 的生成质量,例如文本连贯性和Semantic 一致性。这是因为语言模型在生成过程中很可能会遵循有害提示,而净化方法则在对向的方向上工作。为解决这种冲突,我们将净化过程分解成不同的子步骤,其中净化在输入阶段进行,而后续的不断生成则基于非有害提示。此外,我们还使用 Detox-Chain 来连接这些子步骤,使 LLM 可以逐步净化文本。在自动和人工评估的基础上,我们发现通过训练 Detox-Chain,6个 LLM 缩放到 1B 和 33B 的模型都可以获得显著的净化和生成改进。我们的代码和数据可以在 GitHub 上找到,链接在 https://github.com/CODINNLG/Detox-CoT 上。请注意,文章中的示例可能包含不цензурирован的有害内容。
Pre-training with Large Language Model-based Document Expansion for Dense Passage Retrieval
results: 实验结果表明,通过预训练 LLM-基于文档扩展,可以significantly提高大规模网络检索任务的检索性能。我们的工作具有强的零基数和外部预测能力,使其更适用于没有人工标注数据的检索。Abstract
In this paper, we systematically study the potential of pre-training with Large Language Model(LLM)-based document expansion for dense passage retrieval. Concretely, we leverage the capabilities of LLMs for document expansion, i.e. query generation, and effectively transfer expanded knowledge to retrievers using pre-training strategies tailored for passage retrieval. These strategies include contrastive learning and bottlenecked query generation. Furthermore, we incorporate a curriculum learning strategy to reduce the reliance on LLM inferences. Experimental results demonstrate that pre-training with LLM-based document expansion significantly boosts the retrieval performance on large-scale web-search tasks. Our work shows strong zero-shot and out-of-domain retrieval abilities, making it more widely applicable for retrieval when initializing with no human-labeled data.
摘要
在这篇论文中,我们系统地研究了使用大语言模型(LLM)-基于文档扩展的预训练对紧凑段 retrieval的潜力。具体来说,我们利用 LLM 的扩展能力,即查询生成,并将扩展知识有效地传递给 Retriever 使用预训练策略。这些策略包括对比学习和瓶颈查询生成。此外,我们采用了课程学习策略来减少 LLM 的推断依赖度。实验结果表明,预训练与 LLM-基于文档扩展可以大幅提高大规模网络搜索任务中的 Retrieval 性能。我们的工作具有强大的零shot和 OUT-OF-DOMAIN 检索能力,使其更适用于没有人类标注数据的检索。
Benchmarking Neural Network Generalization for Grammar Induction
paper_authors: Nur Lan, Emmanuel Chemla, Roni Katzir
for: 这 paper 是为了测试神经网络模型的泛化能力而写的。
methods: 这 paper 使用了一种基于正式语言的测试方法来评估神经网络模型的泛化能力。
results: 研究发现,使用 Minimum Description Length 目标函数(MDL)来训练神经网络模型可以提高其泛化能力,并使用更少的数据。Abstract
How well do neural networks generalize? Even for grammar induction tasks, where the target generalization is fully known, previous works have left the question open, testing very limited ranges beyond the training set and using different success criteria. We provide a measure of neural network generalization based on fully specified formal languages. Given a model and a formal grammar, the method assigns a generalization score representing how well a model generalizes to unseen samples in inverse relation to the amount of data it was trained on. The benchmark includes languages such as $a^nb^n$, $a^nb^nc^n$, $a^nb^mc^{n+m}$, and Dyck-1 and 2. We evaluate selected architectures using the benchmark and find that networks trained with a Minimum Description Length objective (MDL) generalize better and using less data than networks trained using standard loss functions. The benchmark is available at https://github.com/taucompling/bliss.
摘要
<>将文本翻译成简化中文。<>previous works 未能很好地回答了 neural network 的泛化能力问题,即使是 grammar induction 任务,target generalization 完全知道。我们提供了一种基于完全指定的形式语言的 neural network 泛化评价方法。给定一个模型和一个形式语法,该方法将分配一个泛化分数,表示模型在未经见过样本数据上的泛化能力,与训练数据量之间存在 inverse 关系。 benchmark 包括 $a^nb^n$, $a^nb^nc^n$, $a^nb^mc^{n+m}$, Dyck-1 和 Dyck-2 等语言。我们使用这些 benchmark 评估选择的 architecture,并发现使用 MDL 目标函数(Minimum Description Length)训练的网络在使用更少的数据量下可以更好地泛化。 benchmark 可以在 GitHub 上找到:https://github.com/taucompling/bliss。
MemoChat: Tuning LLMs to Use Memos for Consistent Long-Range Open-Domain Conversation
results: 实验结果表明,MemoChat 可以在三个不同的测试场景中超越强基eline,并且可以在大规模的 open-source 和 API 可达的 chatbot 上实现高效的长距离开放领域对话。Abstract
We propose MemoChat, a pipeline for refining instructions that enables large language models (LLMs) to effectively employ self-composed memos for maintaining consistent long-range open-domain conversations. We demonstrate a long-range open-domain conversation through iterative "memorization-retrieval-response" cycles. This requires us to carefully design tailored tuning instructions for each distinct stage. The instructions are reconstructed from a collection of public datasets to teach the LLMs to memorize and retrieve past dialogues with structured memos, leading to enhanced consistency when participating in future conversations. We invite experts to manually annotate a test set designed to evaluate the consistency of long-range conversations questions. Experiments on three testing scenarios involving both open-source and API-accessible chatbots at scale verify the efficacy of MemoChat, which outperforms strong baselines. Our codes, data and models are available here: https://github.com/LuJunru/MemoChat.
摘要
我们提出了 MemoChat,一个用于精细调整 instrucion 的管道,使大语言模型 (LLM) 可以有效地利用自己编写的笔记来维护长范围开放领域对话的一致性。我们通过迭代“记忆-检索-响应” cycles 来实现长范围开放领域对话。这需要我们仔细设计适应每个特定阶段的调整 instruction。这些 instruction 从公共数据集中搜集并教育 LLM 记忆和检索过去对话,从而在未来对话中提高一致性。我们邀请专家手动标注一组用于评估长范围对话一致性的测试集。我们在三个测试场景中使用了开源和 API 访问ible 的 chatbot,并在大规模上进行了实验,以证明 MemoChat 的有效性,超越了强大的基线。我们的代码、数据和模型可以在以下链接中找到:https://github.com/LuJunru/MemoChat。
MoCoSA: Momentum Contrast for Knowledge Graph Completion with Structure-Augmented Pre-trained Language Models
paper_authors: Jiabang He, Liu Jia, Lei Wang, Xiyao Li, Xing Xu
for: 本研究旨在提高知识图完成任务的性能,使得模型能够更好地进行知识图中的推理和推出逻辑结论。
methods: 本研究提出了一种基于structure-augmented pre-trained language models (MoCoSA)的方法,通过适应结构编码器让PLM能够更好地感知结构信息,以提高学习效率。同时,我们还提出了很重要的势能负采样和内部关系负采样来提高学习效率。
results: 实验结果显示,我们的方法可以在Wn18RR和OpenBG500上实现state-of-the-art的性能,MRR分数提高2.5%和21%。Abstract
Knowledge Graph Completion (KGC) aims to conduct reasoning on the facts within knowledge graphs and automatically infer missing links. Existing methods can mainly be categorized into structure-based or description-based. On the one hand, structure-based methods effectively represent relational facts in knowledge graphs using entity embeddings. However, they struggle with semantically rich real-world entities due to limited structural information and fail to generalize to unseen entities. On the other hand, description-based methods leverage pre-trained language models (PLMs) to understand textual information. They exhibit strong robustness towards unseen entities. However, they have difficulty with larger negative sampling and often lag behind structure-based methods. To address these issues, in this paper, we propose Momentum Contrast for knowledge graph completion with Structure-Augmented pre-trained language models (MoCoSA), which allows the PLM to perceive the structural information by the adaptable structure encoder. To improve learning efficiency, we proposed momentum hard negative and intra-relation negative sampling. Experimental results demonstrate that our approach achieves state-of-the-art performance in terms of mean reciprocal rank (MRR), with improvements of 2.5% on WN18RR and 21% on OpenBG500.
摘要
知识图完成(KGC)的目标是通过知识图中的事实进行推理,自动填充缺失的链接。现有方法主要可以分为结构基于的和描述基于的两类。一方面,结构基于的方法可以有效地表示知识图中的关系事实使用实体嵌入。然而,它们在semantic rich的实体上遇到限制,容易受到未知实体的影响,并且难以泛化到未看过的实体。另一方面,描述基于的方法可以利用预训练语言模型(PLM)来理解文本信息。它们在面对未看过的实体时具有强的鲁棒性,但是它们在大量负样本下表现不佳,经常落后于结构基于的方法。为了解决这些问题,在这篇论文中,我们提出了带有结构扩充适应器的摘要凝聚(MoCoSA),使PLM能够通过适应器感知结构信息。此外,我们还提出了振荡负样本和内部关系负样本的提高学习效率。实验结果表明,我们的方法在MRR指标上达到了领先水平,与WN18RR和OpenBG500上的提高分别为2.5%和21%。
Leveraging Explainable AI to Analyze Researchers’ Aspect-Based Sentiment about ChatGPT
methods: 本研究使用 Explainable AI 方法来分析研究数据,以拓展 Aspect-Based Sentiment Analysis 的状态艺术。
results: 本研究提供了valuable insights into extending the state of the art of Aspect-Based Sentiment Analysis on newer datasets,where such analysis is not hampered by the length of the text data。Abstract
The groundbreaking invention of ChatGPT has triggered enormous discussion among users across all fields and domains. Among celebration around its various advantages, questions have been raised with regards to its correctness and ethics of its use. Efforts are already underway towards capturing user sentiments around it. But it begs the question as to how the research community is analyzing ChatGPT with regards to various aspects of its usage. It is this sentiment of the researchers that we analyze in our work. Since Aspect-Based Sentiment Analysis has usually only been applied on a few datasets, it gives limited success and that too only on short text data. We propose a methodology that uses Explainable AI to facilitate such analysis on research data. Our technique presents valuable insights into extending the state of the art of Aspect-Based Sentiment Analysis on newer datasets, where such analysis is not hampered by the length of the text data.
摘要
<> chatgpt 的创新性发明已经引发了各界用户的广泛讨论。虽然有很多人对其多方面的优点表示欢迎,但也有人提出了关于其正确性和使用道德性的问题。为了捕捉用户情绪,已经有努力在进行。但问题来临,研究者们如何对 chatgpt 进行不同方面的分析呢?我们的研究团队在这个问题上进行了分析。由于 aspect-based sentiment analysis 通常仅在一些数据集上得到有限的成功,尤其是在短文本数据上。我们提出了一种使用可解释AI来实现这种分析的方法。我们的技术可以为 newer datasets 提供有价值的发现,使 aspect-based sentiment analysis 不受文本数据的长度所限制。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format instead.
ChinaTelecom System Description to VoxCeleb Speaker Recognition Challenge 2023
results: 最终提交得分为0.1066和EER为1.980%。Abstract
This technical report describes ChinaTelecom system for Track 1 (closed) of the VoxCeleb2023 Speaker Recognition Challenge (VoxSRC 2023). Our system consists of several ResNet variants trained only on VoxCeleb2, which were fused for better performance later. Score calibration was also applied for each variant and the fused system. The final submission achieved minDCF of 0.1066 and EER of 1.980%.
摘要
这份技术报告描述了我们在VoxCeleb2023演说识别挑战(VoxSRC 2023)的Track 1(关闭)系统。我们的系统包括了多种ResNet变种,只在VoxCeleb2上进行训练。这些变种之后被 fusion 以提高性能。在每个变种和混合系统上进行了分数均衡。最终提交得分为0.1066%和1.980%。
RSpell: Retrieval-augmented Framework for Domain Adaptive Chinese Spelling Check
results: 通过在法律、医学和官方文书写作三个领域的CSC数据集上进行实验,显示RSpell在零shot和 fine-tuning scenarios 中均达到了当前最佳性能,证明了基于检索的CSC模型框架的效果。Abstract
Chinese Spelling Check (CSC) refers to the detection and correction of spelling errors in Chinese texts. In practical application scenarios, it is important to make CSC models have the ability to correct errors across different domains. In this paper, we propose a retrieval-augmented spelling check framework called RSpell, which searches corresponding domain terms and incorporates them into CSC models. Specifically, we employ pinyin fuzzy matching to search for terms, which are combined with the input and fed into the CSC model. Then, we introduce an adaptive process control mechanism to dynamically adjust the impact of external knowledge on the model. Additionally, we develop an iterative strategy for the RSpell framework to enhance reasoning capabilities. We conducted experiments on CSC datasets in three domains: law, medicine, and official document writing. The results demonstrate that RSpell achieves state-of-the-art performance in both zero-shot and fine-tuning scenarios, demonstrating the effectiveness of the retrieval-augmented CSC framework. Our code is available at https://github.com/47777777/Rspell.
摘要
AffectEcho: Speaker Independent and Language-Agnostic Emotion and Affect Transfer for Speech Synthesis
results: 该研究通过一系列实验表明,使用Vector Quantized codebook来模型情感可以达到语言独立的情感模型化能力,并且在评价 metrics 上取得了顶尖的结果。Abstract
Affect is an emotional characteristic encompassing valence, arousal, and intensity, and is a crucial attribute for enabling authentic conversations. While existing text-to-speech (TTS) and speech-to-speech systems rely on strength embedding vectors and global style tokens to capture emotions, these models represent emotions as a component of style or represent them in discrete categories. We propose AffectEcho, an emotion translation model, that uses a Vector Quantized codebook to model emotions within a quantized space featuring five levels of affect intensity to capture complex nuances and subtle differences in the same emotion. The quantized emotional embeddings are implicitly derived from spoken speech samples, eliminating the need for one-hot vectors or explicit strength embeddings. Experimental results demonstrate the effectiveness of our approach in controlling the emotions of generated speech while preserving identity, style, and emotional cadence unique to each speaker. We showcase the language-independent emotion modeling capability of the quantized emotional embeddings learned from a bilingual (English and Chinese) speech corpus with an emotion transfer task from a reference speech to a target speech. We achieve state-of-art results on both qualitative and quantitative metrics.
摘要
情感是一种情感特征,包括价值、刺激和强度,它是对话的真实化的关键属性。现有的文本到语音(TTS)和语音到语音系统(STTS)都 rely on 强度 embedding vector 和全局风格标识符来捕捉情感,但这些模型表现情感为样式的一部分或用 discrete category 表示。我们提议 AffectEcho,一种情感翻译模型,使用量化码字表示情感在量化空间中的五级情感强度,以捕捉复杂的细节和同一种情感中的微妙差异。量化情感嵌入在不需要一hot вектор或显式强度 embedding 的基础上,从而消除了一些不必要的缺失。实验结果表明我们的方法可以控制生成的语音中的情感,保留每个说话人的个性、风格和情感节奏。我们还展示了基于英语和中文 speech 集合的语言无关情感模型的能力,通过将 refer 语音中的情感传递到 target 语音中。我们在质量和量度 metrics 上获得了领先的结果。
results: 研究人员的最佳模型在HurricaneSARC数据集上的性能为0.70 F1,而使用中间任务转移学习可以提高该数据集上的性能。Abstract
During natural disasters, people often use social media platforms such as Twitter to ask for help, to provide information about the disaster situation, or to express contempt about the unfolding event or public policies and guidelines. This contempt is in some cases expressed as sarcasm or irony. Understanding this form of speech in a disaster-centric context is essential to improving natural language understanding of disaster-related tweets. In this paper, we introduce HurricaneSARC, a dataset of 15,000 tweets annotated for intended sarcasm, and provide a comprehensive investigation of sarcasm detection using pre-trained language models. Our best model is able to obtain as much as 0.70 F1 on our dataset. We also demonstrate that the performance on HurricaneSARC can be improved by leveraging intermediate task transfer learning. We release our data and code at https://github.com/tsosea2/HurricaneSarc.
摘要
在自然灾害事件中,人们经常通过社交媒体平台如推特请求帮助、提供灾害情况信息或表达对事件或公共政策的负面态度。这种负面态度在一些情况下会表现为讽刺或反意。在这篇论文中,我们介绍了飓风SARC数据集,该数据集包含15,000条推特帖子,每个帖子都被标注为意图 sarcastic。我们进行了全面的讽刺检测研究,使用预训练的语言模型。我们的最佳模型在我们的数据集上可以获得0.70的F1分。此外,我们还证明了可以通过中间任务传承学习提高HurricaneSARC的性能。我们在 GitHub上发布了数据集和代码,请参考https://github.com/tsosea2/HurricaneSarc。
paper_authors: Daniela N. Rim, Kimera Richard, Heeyoul Choi
for: 提高Transformer模型的计算效率和准确率,解决针对空token的计算浪费。
methods: sorting translation sentence pairs based on length before batching, partially sorting data to maintain i.i.d data assumption.
results: 在英-韩语和英-刚拉语机器翻译任务上实现了计算时间减少的目标,而不失去性能水平。Abstract
The Transformer model has revolutionized Natural Language Processing tasks such as Neural Machine Translation, and many efforts have been made to study the Transformer architecture, which increased its efficiency and accuracy. One potential area for improvement is to address the computation of empty tokens that the Transformer computes only to discard them later, leading to an unnecessary computational burden. To tackle this, we propose an algorithm that sorts translation sentence pairs based on their length before batching, minimizing the waste of computing power. Since the amount of sorting could violate the independent and identically distributed (i.i.d) data assumption, we sort the data partially. In experiments, we apply the proposed method to English-Korean and English-Luganda language pairs for machine translation and show that there are gains in computational time while maintaining the performance. Our method is independent of architectures, so that it can be easily integrated into any training process with flexible data lengths.
摘要
<>translate_language Simplified Chinese;The Transformer model has revolutionized Natural Language Processing tasks such as Neural Machine Translation, and many efforts have been made to study the Transformer architecture, which increased its efficiency and accuracy. One potential area for improvement is to address the computation of empty tokens that the Transformer computes only to discard them later, leading to an unnecessary computational burden. To tackle this, we propose an algorithm that sorts translation sentence pairs based on their length before batching, minimizing the waste of computing power. Since the amount of sorting could violate the independent and identically distributed (i.i.d) data assumption, we sort the data partially. In experiments, we apply the proposed method to English-Korean and English-Luganda language pairs for machine translation and show that there are gains in computational time while maintaining the performance. Our method is independent of architectures, so that it can be easily integrated into any training process with flexible data lengths.Translate_done
MDDial: A Multi-turn Differential Diagnosis Dialogue Dataset with Reliability Evaluation
methods: 该论文使用了一个新的对话数据集 called MDDial,该数据集包含了英语的分级诊断对话。此外,论文还引入了一个统一的分级诊断分数,该分数考虑了症状和诊断之间的关系,并且反映了系统的可靠性。
results: 实验表明,使用现有的语言模型在MDDial上进行训练后,其表现不佳,因为它们无法关联相关的症状和疾病。这表明,为了建立高效的ADD对话系统,还需要进一步的研究和改进。Abstract
Dialogue systems for Automatic Differential Diagnosis (ADD) have a wide range of real-life applications. These dialogue systems are promising for providing easy access and reducing medical costs. Building end-to-end ADD dialogue systems requires dialogue training datasets. However, to the best of our knowledge, there is no publicly available ADD dialogue dataset in English (although non-English datasets exist). Driven by this, we introduce MDDial, the first differential diagnosis dialogue dataset in English which can aid to build and evaluate end-to-end ADD dialogue systems. Additionally, earlier studies present the accuracy of diagnosis and symptoms either individually or as a combined weighted score. This method overlooks the connection between the symptoms and the diagnosis. We introduce a unified score for the ADD system that takes into account the interplay between symptoms and diagnosis. This score also indicates the system's reliability. To the end, we train two moderate-size of language models on MDDial. Our experiments suggest that while these language models can perform well on many natural language understanding tasks, including dialogue tasks in the general domain, they struggle to relate relevant symptoms and disease and thus have poor performance on MDDial. MDDial will be released publicly to aid the study of ADD dialogue research.
摘要
对话系统 для自动差异诊断(ADD)有很广泛的实际应用。这些对话系统有承诺提供容易访问和降低医疗成本。建立终到终ADD对话系统需要对话训练数据集。然而,到我们所知,英语的ADD对话数据集没有公共可用。我们驱动了这,我们引入了MDDial,英语中的第一个差异诊断对话数据集,可以帮助建立和评估终到终ADD对话系统。此外,先前的研究表明诊断和症状的准确率,或者是将其合并为权重加权分数。这种方法忽视了症状和诊断之间的连接。我们引入了一个统一的ADD系统分数,考虑了症状和诊断之间的互动。这个分数还指示系统的可靠性。为了实现这一点,我们训练了两个中型语言模型在MDDial上。我们的实验表明,虽然这两个语言模型可以在许多自然语言理解任务上表现良好,包括对话任务在通用领域,但它们在MDDial上很难关联相关的症状和疾病,因此它们在MDDial上表现不佳。MDDial将被公开发布,以便ADD对话研究的支持。
Radio2Text: Streaming Speech Recognition Using mmWave Radio Signals
results: 实验结果显示,Radio2Text可以实现一个字符错误率为5.7%和一个词错率为9.4%,用于识别一个 vocabulary 包含超过13,000个词的语音。Abstract
Millimeter wave (mmWave) based speech recognition provides more possibility for audio-related applications, such as conference speech transcription and eavesdropping. However, considering the practicality in real scenarios, latency and recognizable vocabulary size are two critical factors that cannot be overlooked. In this paper, we propose Radio2Text, the first mmWave-based system for streaming automatic speech recognition (ASR) with a vocabulary size exceeding 13,000 words. Radio2Text is based on a tailored streaming Transformer that is capable of effectively learning representations of speech-related features, paving the way for streaming ASR with a large vocabulary. To alleviate the deficiency of streaming networks unable to access entire future inputs, we propose the Guidance Initialization that facilitates the transfer of feature knowledge related to the global context from the non-streaming Transformer to the tailored streaming Transformer through weight inheritance. Further, we propose a cross-modal structure based on knowledge distillation (KD), named cross-modal KD, to mitigate the negative effect of low quality mmWave signals on recognition performance. In the cross-modal KD, the audio streaming Transformer provides feature and response guidance that inherit fruitful and accurate speech information to supervise the training of the tailored radio streaming Transformer. The experimental results show that our Radio2Text can achieve a character error rate of 5.7% and a word error rate of 9.4% for the recognition of a vocabulary consisting of over 13,000 words.
摘要
高频振荡(mmWave)基于的语音识别系统提供了更多的音频相关应用,如会议语音笔录和窃听。然而,在实际场景中,延迟和可识别词汇数是两个不可或缺的因素。在这篇论文中,我们提出了Radio2Text,第一个 mmWave 基于的流动自动语音识别(ASR)系统,可以处理超过 13,000 个词的词汇库。Radio2Text 基于一个适应流动 Transformer,可以有效地学习语音相关特征的表示,为流动 ASR 开辟了新的可能性。为了解决流动网络无法访问整个未来输入的问题,我们提出了导航初始化,使得非流动 Transformer 的特征知识相关的全局上下文特征被传递给适应流动 Transformer через 重量继承。此外,我们提出了基于知识储存(KD)的交叉模态结构,称为交叉模态 KD,以 Mitigate 低质量 mmWave 信号对识别性的负面影响。在交叉模态 KD 中,音频流动 Transformer 提供了特征和回应指导,将有用和准确的语音信息继承到supervise 适应流动 Transformer 的训练。实验结果显示,我们的 Radio2Text 可以实现字符错误率为 5.7% 和词错率为 9.4% для 识别超过 13,000 个词的词汇库。
Separate the Wheat from the Chaff: Model Deficiency Unlearning via Parameter-Efficient Module Operation
results: 经过广泛的实验表明,我们的方法可以有效提高LLM的真实性和干净性,同时保留LLM的基本功能。Abstract
Large language models (LLMs) have been widely used in various applications but are known to suffer from issues related to untruthfulness and toxicity. While parameter-efficient modules (PEMs) have demonstrated their effectiveness in equipping models with new skills, leveraging PEMs for deficiency unlearning remains underexplored. In this work, we propose a PEMs operation approach, namely Extraction-before-Subtraction (Ext-Sub), to enhance the truthfulness and detoxification of LLMs through the integration of ``expert'' PEM and ``anti-expert'' PEM. Remarkably, even anti-expert PEM possess valuable capabilities due to their proficiency in generating fabricated content, which necessitates language modeling and logical narrative competence. Rather than merely negating the parameters, our approach involves extracting and eliminating solely the deficiency capability within anti-expert PEM while preserving the general capabilities. To evaluate the effectiveness of our approach in terms of truthfulness and detoxification, we conduct extensive experiments on LLMs, encompassing additional abilities such as language modeling and mathematical reasoning. Our empirical results demonstrate that our approach effectively improves truthfulness and detoxification, while largely preserving the fundamental abilities of LLMs.
摘要
Improving Detection of ChatGPT-Generated Fake Science Using Real Publication Text: Introducing xFakeBibs a Supervised-Learning Network Algorithm
results: 研究发现,ChatGPT仅生成了23%的大文本内容,落后于其他10个拟合分布中的任何一个。这种技术性的差异使得ChatGPT难以与真正的科学研究相匹配。该算法可以准确地标识98 out of 100篇论文为假文献,但是还需要进一步的研究来检测所有假记录。Abstract
ChatGPT is becoming a new reality. In this paper, we show how to distinguish ChatGPT-generated publications from counterparts produced by scientists. Using a newly designed supervised Machine Learning algorithm, we demonstrate how to detect machine-generated publications from those produced by scientists. The algorithm was trained using 100 real publication abstracts, followed by a 10-fold calibration approach to establish a lower-upper bound range of acceptance. In the comparison with ChatGPT content, it was evident that ChatGPT contributed merely 23\% of the bigram content, which is less than 50\% of any of the other 10 calibrating folds. This analysis highlights a significant disparity in technical terms where ChatGPT fell short of matching real science. When categorizing the individual articles, the xFakeBibs algorithm accurately identified 98 out of 100 publications as fake, with 2 articles incorrectly classified as real publications. Though this work introduced an algorithmic approach that detected the ChatGPT-generated fake science with a high degree of accuracy, it remains challenging to detect all fake records. This work is indeed a step in the right direction to counter fake science and misinformation.
摘要
chatgpt是一种新的现实 becoming。在这篇论文中,我们展示了如何分辨chatgpt生成的出版物和科学家生成的对应出版物。使用一种新的监督式机器学习算法,我们示示了如何检测机器生成的出版物和科学家生成的出版物。这个算法在100个真实出版摘要上训练,然后采用10次核对方法来确定下限上限范围。与chatgpt内容进行比较,显然chatgpt仅占bigram内容的23%,这比任何其他10个核对叶下降的50%少。这一分析表明chatgpt在技术性方面存在显著的差距,它未能与真正的科学相匹配。当分类个别文章时,xFakeBibs算法正确地标识了98个出版物为假,2个文章被误将作为真正的出版物分类。虽然这项工作提出了一种算法方法来检测chatgpt生成的假科学,但是仍然存在一定的检测所有假记录的挑战。这项工作确实是对假科学和误信息的一步进击。
The Costly Dilemma: Generalization, Evaluation and Cost-Optimal Deployment of Large Language Models
paper_authors: Abi Aryan, Aakash Kumar Nain, Andrew McMahon, Lucas Augusto Meyer, Harpreet Singh Sahota
for: 这 paper 是为了探讨大语言模型在企业级别上的应用和投资问题。
methods: 该 paper 使用了一种框架,用于评估大语言模型的泛化、评价和成本优化。
results: 该 paper 表明,在大语言模型的开发、部署和管理方面,泛化、评价和成本优化是可以相互独立地考虑的三个因素。Abstract
When deploying machine learning models in production for any product/application, there are three properties that are commonly desired. First, the models should be generalizable, in that we can extend it to further use cases as our knowledge of the domain area develops. Second they should be evaluable, so that there are clear metrics for performance and the calculation of those metrics in production settings are feasible. Finally, the deployment should be cost-optimal as far as possible. In this paper we propose that these three objectives (i.e. generalization, evaluation and cost-optimality) can often be relatively orthogonal and that for large language models, despite their performance over conventional NLP models, enterprises need to carefully assess all the three factors before making substantial investments in this technology. We propose a framework for generalization, evaluation and cost-modeling specifically tailored to large language models, offering insights into the intricacies of development, deployment and management for these large language models.
摘要
当部署机器学习模型在生产环境中时,通常有三个属性被希望拥有。第一,模型应该泛化,以便在知识域领域的发展中扩展其使用场景。第二,模型应该可评估,以便在生产环境中有明确的表现指标和计算这些指标的能力。最后,部署应该是可最小化成本的。在这篇论文中,我们提出了这三个目标(即泛化、评估和成本优化)可以相对独立,并且对于大语言模型来说,企业需要仔细评估这三个因素才能够进行大规模投资。我们提出了特制化于大语言模型的泛化、评估和成本模型,以便更好地理解这些模型的开发、部署和管理。
Using Artificial Populations to Study Psychological Phenomena in Neural Models
results: 通过人工 популяции的实验,发现语言模型在高度表现 Category 的情况下展现典型效应,但是不具有结构激活效应。 通过这些结果,我们发现单个模型通常会过度估计 neural 模型中的认知行为。Abstract
The recent proliferation of research into transformer based natural language processing has led to a number of studies which attempt to detect the presence of human-like cognitive behavior in the models. We contend that, as is true of human psychology, the investigation of cognitive behavior in language models must be conducted in an appropriate population of an appropriate size for the results to be meaningful. We leverage work in uncertainty estimation in a novel approach to efficiently construct experimental populations. The resultant tool, PopulationLM, has been made open source. We provide theoretical grounding in the uncertainty estimation literature and motivation from current cognitive work regarding language models. We discuss the methodological lessons from other scientific communities and attempt to demonstrate their application to two artificial population studies. Through population based experimentation we find that language models exhibit behavior consistent with typicality effects among categories highly represented in training. However, we find that language models don't tend to exhibit structural priming effects. Generally, our results show that single models tend to over estimate the presence of cognitive behaviors in neural models.
摘要
近期,研究基于转移器的自然语言处理技术的普及,导致了许多研究,试图探测模型中是否存在人类智能行为。我们认为,与人类心理学一样,研究语言模型的认知行为应该在适当的人口规模下进行。我们利用了不确定性估计的工作,开发了一种新的方法,名为PopulationLM,并将其开源。我们提供了不确定性估计的理论基础和当前语言模型认知工作的动机。我们也学习了其他科学社区的方法,尝试应用它们到两个人工人口研究中。通过人口基本实验,我们发现语言模型在高度表现Category的情况下展现典型效果,但是它们不往往展现结构激活效应。总的来说,我们的结果表明,单个模型通常会过度估计 neural models 中的认知行为。
End-to-End Open Vocabulary Keyword Search With Multilingual Neural Representations
results: 对于长 queries 和不在训练数据中出现的 queries,提出的模型表现比ASR-based系统更佳,而对于短 queries 和在训练数据中出现的 queries,模型表现相对落后 ASR-based系统Abstract
Conventional keyword search systems operate on automatic speech recognition (ASR) outputs, which causes them to have a complex indexing and search pipeline. This has led to interest in ASR-free approaches to simplify the search procedure. We recently proposed a neural ASR-free keyword search model which achieves competitive performance while maintaining an efficient and simplified pipeline, where queries and documents are encoded with a pair of recurrent neural network encoders and the encodings are combined with a dot-product. In this article, we extend this work with multilingual pretraining and detailed analysis of the model. Our experiments show that the proposed multilingual training significantly improves the model performance and that despite not matching a strong ASR-based conventional keyword search system for short queries and queries comprising in-vocabulary words, the proposed model outperforms the ASR-based system for long queries and queries that do not appear in the training data.
摘要
传统的关键词搜索系统通常基于自动语音识别(ASR)输出,这导致了搜索过程中的复杂的索引和搜索管道。这有利于无需ASR的方法,以简化搜索过程。我们最近提出了一种基于神经网络的无ASR关键词搜索模型,该模型在竞争性能和高效的搜索管道之间寻找平衡。在这篇文章中,我们将这种工作扩展到多语言预训练和详细分析模型。我们的实验表明,论文中提出的多语言预训练显著提高了模型性能,而且尽管短 queries 和包含 vocabulary 词的 queries 不能与强大的 ASR-based 传统关键词搜索系统匹配,但是模型仍然超越了 ASR-based 系统,对于长 queries 和没有在训练数据中出现的 queries。
Anaphoric Structure Emerges Between Neural Networks
results: 研究发现,尽管 anaphoric 结构可能会增加ambiguity,但是这些结构仍然可以被人工神经网络学习,并且这些结构会自然地出现在网络之间。此外,在加入了效率压力的情况下,anaphoric 结构的出现变得更加普遍。研究结论是,certain pragmatic structures 可以 straightforwardly emerge between neural networks,但是speakers 和 listeners 之间的竞争需要conditions the degree and nature of their emergence.Abstract
Pragmatics is core to natural language, enabling speakers to communicate efficiently with structures like ellipsis and anaphora that can shorten utterances without loss of meaning. These structures require a listener to interpret an ambiguous form - like a pronoun - and infer the speaker's intended meaning - who that pronoun refers to. Despite potential to introduce ambiguity, anaphora is ubiquitous across human language. In an effort to better understand the origins of anaphoric structure in natural language, we look to see if analogous structures can emerge between artificial neural networks trained to solve a communicative task. We show that: first, despite the potential for increased ambiguity, languages with anaphoric structures are learnable by neural models. Second, anaphoric structures emerge between models 'naturally' without need for additional constraints. Finally, introducing an explicit efficiency pressure on the speaker increases the prevalence of these structures. We conclude that certain pragmatic structures straightforwardly emerge between neural networks, without explicit efficiency pressures, but that the competing needs of speakers and listeners conditions the degree and nature of their emergence.
摘要
Pragmatics 是自然语言的核心,允许发言人通过缩短语句而不产生意义损失的结构进行efficient的交流。这些结构需要听众可以解释模糊的形式,如pronoun,并从speaker的意图中归纳出true meaning。虽然可能引入模糊性,但是anaphora在人类语言中 ubique。为了更好地理解自然语言中anaphoric structure的起源,我们尝试看看人工神经网络在解决交流任务时是否可以学习这些结构。我们发现:一、despite the potential for increased ambiguity, languages with anaphoric structures can be learned by neural models; two、anaphoric structures emerge between models naturally, without the need for additional constraints; three、introducing an explicit efficiency pressure on the speaker increases the prevalence of these structures. 我们 conclude that certain pragmatic structures can emerge between neural networks without explicit efficiency pressures, but the competing needs of speakers and listeners condition the degree and nature of their emergence.
“Beware of deception”: Detecting Half-Truth and Debunking it through Controlled Claim Editing
results: 实现了对修订后的声明的BLEU分数0.88,对受检测的半真话的混淆分数85%,并在与其他语言模型比较中表现出了显著的优势。Abstract
The prevalence of half-truths, which are statements containing some truth but that are ultimately deceptive, has risen with the increasing use of the internet. To help combat this problem, we have created a comprehensive pipeline consisting of a half-truth detection model and a claim editing model. Our approach utilizes the T5 model for controlled claim editing; "controlled" here means precise adjustments to select parts of a claim. Our methodology achieves an average BLEU score of 0.88 (on a scale of 0-1) and a disinfo-debunk score of 85% on edited claims. Significantly, our T5-based approach outperforms other Language Models such as GPT2, RoBERTa, PEGASUS, and Tailor, with average improvements of 82%, 57%, 42%, and 23% in disinfo-debunk scores, respectively. By extending the LIAR PLUS dataset, we achieve an F1 score of 82% for the half-truth detection model, setting a new benchmark in the field. While previous attempts have been made at half-truth detection, our approach is, to the best of our knowledge, the first to attempt to debunk half-truths.
摘要
“半真话”的流行率在互联网时代增加了,为了解决这问题,我们创建了一个全面的管道,包括半真话检测模型和声明编辑模型。我们的方法使用 T5 模型进行控制的声明编辑,其中“控制”指的是精确地调整选择部分声明的部分。我们的方法在编辑声明后获得的 BLEU 分数为 0.88(分数范围为 0-1),并达到了对编辑声明的85%的误信驳斥分数。与其他语言模型相比,如 GPT2、RoBERTa、PEGASUS 和 Tailor,我们的 T5 基本法提供了82%、57%、42% 和 23% 的平均提高在误信驳斥分数上。通过扩展 LIAR PLUS 数据集,我们实现了半真话检测模型的 F1 分数为 82%,创造了新的benchmark。相比之前的尝试,我们的方法是我们知道的首次尝试用于驳斥半真话。
MultiSChuBERT: Effective Multimodal Fusion for Scholarly Document Quality Prediction
results: 这 paper 的实验结果表明,combining 文本和视觉模型可以显著提高 SDQP 任务的结果。此外,paper 还证明了逐渐冰结 Visual 子模型的重量可以降低它的预测误差,并且使用不同的文本嵌入模型可以提高结果。Abstract
Automatic assessment of the quality of scholarly documents is a difficult task with high potential impact. Multimodality, in particular the addition of visual information next to text, has been shown to improve the performance on scholarly document quality prediction (SDQP) tasks. We propose the multimodal predictive model MultiSChuBERT. It combines a textual model based on chunking full paper text and aggregating computed BERT chunk-encodings (SChuBERT), with a visual model based on Inception V3.Our work contributes to the current state-of-the-art in SDQP in three ways. First, we show that the method of combining visual and textual embeddings can substantially influence the results. Second, we demonstrate that gradual-unfreezing of the weights of the visual sub-model, reduces its tendency to ovefit the data, improving results. Third, we show the retained benefit of multimodality when replacing standard BERT$_{\textrm{BASE}$ embeddings with more recent state-of-the-art text embedding models. Using BERT$_{\textrm{BASE}$ embeddings, on the (log) number of citations prediction task with the ACL-BiblioMetry dataset, our MultiSChuBERT (text+visual) model obtains an $R^{2}$ score of 0.454 compared to 0.432 for the SChuBERT (text only) model. Similar improvements are obtained on the PeerRead accept/reject prediction task. In our experiments using SciBERT, scincl, SPECTER and SPECTER2.0 embeddings, we show that each of these tailored embeddings adds further improvements over the standard BERT$_{\textrm{BASE}$ embeddings, with the SPECTER2.0 embeddings performing best.
摘要
自动评估学术文献质量是一项复杂的任务,具有高度可能的影响。多模态,即在文献中添加视觉信息,已经被证明可以提高学术文献质量预测(SDQP)任务的性能。我们提出了多模态预测模型MultiSChuBERT。它将文本模型基于块化全篇文献文本和计算BERT块编码(SChuBERT),与视觉模型基于Inception V3.0相结合。我们的工作对现状领域的SDQP做出了三个贡献。首先,我们显示了将视觉和文本嵌入结合的方法可以具有显著的影响。其次,我们证明了逐渐冰封视觉子模型的重量的方法可以降低它的预测数据倾斜现象,提高结果。最后,我们显示了在使用现代文本嵌入模型代替标准BERT$_{\textrm{BASE}$嵌入时,多模态的优势仍然保留。使用BERT$_{\textrm{BASE}$嵌入,我们在ACL-BiblioMetry数据集上的(对数)引用数预测任务中,MultiSChuBERT(文本+视觉)模型的$R^{2}$分数为0.454,比SChuBERT(文本只)模型的0.432分数高。类似的改进也在PeerReadAccept/拒绝预测任务中得到。在我们使用SciBERT、scincl、SPECTER和SPECTER2.0嵌入时,我们显示了每个适应嵌入都添加了进一步的改进,SPECTER2.0嵌入表现最佳。
Teach LLMs to Personalize – An Approach inspired by Writing Education
results: 我们在三个公共数据集上进行了评估,并与多种基eline进行比较。我们的结果表明,我们的方法可以获得显著的提升。Abstract
Personalized text generation is an emerging research area that has attracted much attention in recent years. Most studies in this direction focus on a particular domain by designing bespoke features or models. In this work, we propose a general approach for personalized text generation using large language models (LLMs). Inspired by the practice of writing education, we develop a multistage and multitask framework to teach LLMs for personalized generation. In writing instruction, the task of writing from sources is often decomposed into multiple steps that involve finding, evaluating, summarizing, synthesizing, and integrating information. Analogously, our approach to personalized text generation consists of multiple stages: retrieval, ranking, summarization, synthesis, and generation. In addition, we introduce a multitask setting that helps the model improve its generation ability further, which is inspired by the observation in education that a student's reading proficiency and writing ability are often correlated. We evaluate our approach on three public datasets, each of which covers a different and representative domain. Our results show significant improvements over a variety of baselines.
摘要
personnalized text generation 是一个崛起的研究领域,Recently, much attention has been paid to this field. Most studies in this direction focus on a particular domain by designing bespoke features or models. In this work, we propose a general approach for personalized text generation using large language models (LLMs). Inspired by the practice of writing education, we develop a multistage and multitask framework to teach LLMs for personalized generation. In writing instruction, the task of writing from sources is often decomposed into multiple steps that involve finding, evaluating, summarizing, synthesizing, and integrating information. Analogously, our approach to personalized text generation consists of multiple stages: retrieval, ranking, summarization, synthesis, and generation. In addition, we introduce a multitask setting that helps the model improve its generation ability further, which is inspired by the observation in education that a student's reading proficiency and writing ability are often correlated. We evaluate our approach on three public datasets, each of which covers a different and representative domain. Our results show significant improvements over a variety of baselines.Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.
results: 这篇论文首次生成了大规模的REI问题数据集,并提出了一些初步的机器学习基线。Abstract
We propose \emph{regular expression inference (REI)} as a challenge for code/language modelling, and the wider machine learning community. REI is a supervised machine learning (ML) and program synthesis task, and poses the problem of finding minimal regular expressions from examples: Given two finite sets of strings $P$ and $N$ and a cost function $\text{cost}(\cdot)$, the task is to generate an expression $r$ that accepts all strings in $P$ and rejects all strings in $N$, while no other such expression $r'$ exists with $\text{cost}(r')<\text{cost}(r)$. REI has advantages as a challenge problem: (i) regular expressions are well-known, widely used, and a natural idealisation of code; (ii) REI's asymptotic worst-case complexity is well understood; (iii) REI has a small number of easy to understand parameters (e.g.~$P$ or $N$ cardinality, string lengths of examples, or the cost function); this lets us easily finetune REI-hardness; (iv) REI is an unsolved problem for deep learning based ML. Recently, an REI solver was implemented on GPUs, using program synthesis techniques. This enabled, for the first time, fast generation of minimal expressions for complex REI instances. Building on this advance, we generate and publish the first large-scale datasets for REI, and devise and evaluate several initial heuristic and machine learning baselines. We invite the community to participate and explore ML methods that learn to solve REI problems. We believe that progress in REI directly translates to code/language modelling.
摘要
我们提出了几何表达推理(REI)作为代码/语言模型化挑战,并且广泛的机器学习社群。 REI 是一个监督式机器学习(ML)和程式生成任务,问题是找到最简的几何表达,使得所有 $P$ 中的字串都被接受,而所有 $N$ 中的字串都被拒绝,而且没有其他任何几何表达 $r'$ 存在,使得 $\text{cost}(r') < \text{cost}(r)$。REI 有以下优点作为挑战问题:1. 几何表达是通范知的,广泛使用的,并且是代码的自然化 идеalization。2. REI 的扩展最差情况的复杂度良好理解。3. REI 只有一小部分容易理解的参数(例如 $P$ 或 $N$ 的卡丁大小、字串示例的长度、或成本函数),这让我们可以轻松地调整 REI 的困难度。4. REI 是深度学习基础的未解问题。最近,一个 REI 解决方案在 GPU 上实现,使用程式生成技术。这允许了,如 never before,快速生成复杂的几何表达。基于这个进步,我们产生了第一个大规模的 REI 数据集,并设计了多个初步的调和和机器学习基线。我们邀请社区参与,探索机器学习方法可以解决 REI 问题。我们相信,进步在 REI 直接对代码/语言模型化有着影响。
results: 对比 vanilla MLLMs,LCL-MLLM 在Recognizing unseen images和理解 novel concepts 方面表现出色,并且在ISEKAI dataset上进行了广泛的实验评估。Abstract
The ability to learn from context with novel concepts, and deliver appropriate responses are essential in human conversations. Despite current Multimodal Large Language Models (MLLMs) and Large Language Models (LLMs) being trained on mega-scale datasets, recognizing unseen images or understanding novel concepts in a training-free manner remains a challenge. In-Context Learning (ICL) explores training-free few-shot learning, where models are encouraged to ``learn to learn" from limited tasks and generalize to unseen tasks. In this work, we propose link-context learning (LCL), which emphasizes "reasoning from cause and effect" to augment the learning capabilities of MLLMs. LCL goes beyond traditional ICL by explicitly strengthening the causal relationship between the support set and the query set. By providing demonstrations with causal links, LCL guides the model to discern not only the analogy but also the underlying causal associations between data points, which empowers MLLMs to recognize unseen images and understand novel concepts more effectively. To facilitate the evaluation of this novel approach, we introduce the ISEKAI dataset, comprising exclusively of unseen generated image-label pairs designed for link-context learning. Extensive experiments show that our LCL-MLLM exhibits strong link-context learning capabilities to novel concepts over vanilla MLLMs. Code and data will be released at https://github.com/isekai-portal/Link-Context-Learning.
摘要
人工智能沟通中,学习从文本上下文中理解新概念并提供相应的响应是非常重要的。尽管当前的多模态大型自然语言模型(MLLM)和大型自然语言模型(LLM)在巨大数据集上训练,仍然识别未看过的图像或理解未经训练的新概念是一大挑战。无需训练的内容学习(ICL)探索了几shot学习,使模型能够“学习学习”从有限任务中吸取知识并通过未经训练的任务进行推理。在这种工作中,我们提出了链接上下文学习(LCL),强调“因果关系”来增强模型的学习能力。LCL比传统的ICL更加广泛,通过显示 causal 链接,使模型能够不仅理解 analogies,还能够捕捉数据点之间的 causal 关系,从而使得 MLLMs 能够更好地识别未看过的图像和理解新概念。为了便于这种新的方法的评估,我们提出了 ISEKAI 数据集,包括一系列未经训练的生成图像标签对,供 link-context 学习使用。经过广泛的实验,我们发现我们的 LCL-MLLM 在对 novel 概念的链接上下文学习中表现出色,胜过普通的 MLLMs。代码和数据将在 GitHub 上发布,请参考 https://github.com/isekai-portal/Link-Context-Learning。
A Trustable LSTM-Autoencoder Network for Cyberbullying Detection on Social Media Using Synthetic Data
for: The paper aims to detect cyberbullying on social media, specifically using a trustable LSTM-Autoencoder Network with synthetic data to address data availability issues.
methods: The proposed method uses a combination of LSTM and Autoencoder networks to identify aggressive comments on social media, and the authors also compare the performance of their model with other state-of-the-art models such as LSTM, BiLSTM, Word2vec, BERT, and GPT-2.
results: The proposed model outperforms all the other models on all datasets, achieving an accuracy of 95%. The authors also demonstrate the state-of-the-art results of their model compared to previous works on the dataset used in this paper.Here’s the simplified Chinese version of the three key points:
results: 该方法在所有数据集上都有最高的准确率(95%),并且比前一些工作在该数据集上的结果更为出色。Abstract
Social media cyberbullying has a detrimental effect on human life. As online social networking grows daily, the amount of hate speech also increases. Such terrible content can cause depression and actions related to suicide. This paper proposes a trustable LSTM-Autoencoder Network for cyberbullying detection on social media using synthetic data. We have demonstrated a cutting-edge method to address data availability difficulties by producing machine-translated data. However, several languages such as Hindi and Bangla still lack adequate investigations due to a lack of datasets. We carried out experimental identification of aggressive comments on Hindi, Bangla, and English datasets using the proposed model and traditional models, including Long Short-Term Memory (LSTM), Bidirectional Long Short-Term Memory (BiLSTM), LSTM-Autoencoder, Word2vec, Bidirectional Encoder Representations from Transformers (BERT), and Generative Pre-trained Transformer 2 (GPT-2) models. We employed evaluation metrics such as f1-score, accuracy, precision, and recall to assess the models performance. Our proposed model outperformed all the models on all datasets, achieving the highest accuracy of 95%. Our model achieves state-of-the-art results among all the previous works on the dataset we used in this paper.
摘要
社交媒体网络欺凌对人类生活产生负面影响。随着在线社交网络的日常增长,讨厌言语的数量也在增加。这些丑陋的内容可以导致抑郁和自杀行为。这篇论文提议一种可靠的LSTM-自动encoder网络,用于社交媒体上的欺凌检测。我们通过生成机器翻译数据解决了数据可用性的问题。然而,一些语言,如希ن第和孟加拉语仍然缺乏足够的研究,因为缺乏数据集。我们通过实验标识了攻击性评论在希нд语、孟加拉语和英语数据集上的表现。我们使用了评价指标,如f1分数、准确率、精度和回归率来评价模型的表现。我们的提议模型在所有数据集上都达到了最高的准确率95%。我们的模型在所有之前的工作中 achieved state-of-the-art 结果。
paper_authors: Mohammad Soleymanpour, Michael T. Johnson, Rahim Soleymanpour, Jeffrey Berry
For: This paper is written for the purpose of developing a new dysarthric speech synthesis method for use in Automatic Speech Recognition (ASR) training data augmentation.* Methods: The paper uses a modified neural multi-talker Text-to-Speech (TTS) system, which includes a dysarthria severity level coefficient and a pause insertion model, to synthesize dysarthric speech for varying severity levels. The paper also uses a DNN-HMM model for dysarthria-specific speech recognition.* Results: The paper shows that the addition of dysarthric speech synthesis to ASR training data improves the accuracy of dysarthric speech recognition by 12.2%, and that the addition of severity level and pause insertion controls decreases WER by 6.5%. The subjective evaluation shows that the synthesized speech is perceived as similar to true dysarthric speech, especially for higher levels of dysarthria.Here is the simplified Chinese translation of the three key information points:* For: 这篇论文是为了开发一种新的嗜睡术语合成方法,以便用于自动语音识别(ASR)训练数据增强。* Methods: 这篇论文使用一种修改后的神经网络多个人Text-to-Speech(TTS)系统,包括嗜睡度量级别和插入停顿模型,以生成嗜睡术语。* Results: 这篇论文显示,通过将嗜睡术语合成添加到ASR训练数据中,可以提高嗜睡术语识别精度,相比基eline,增强精度12.2%。此外,添加严重度和插入停顿控制可以降低WER值6.5%。主观评估表明,生成的合成语音与真正的嗜睡术语相似,尤其是高度嗜睡术语。Abstract
Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech recognition (ASR) systems can help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech, which is not readily available for dysarthric talkers. This paper presents a new dysarthric speech synthesis method for the purpose of ASR training data augmentation. Differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels are important components for dysarthric speech modeling, synthesis, and augmentation. For dysarthric speech synthesis, a modified neural multi-talker TTS is implemented by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels. To evaluate the effectiveness for synthesis of training data for ASR, dysarthria-specific speech recognition was used. Results show that a DNN-HMM model trained on additional synthetic dysarthric speech achieves WER improvement of 12.2% compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5%, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has significant impact on the dysarthric ASR systems. In addition, we have conducted a subjective evaluation to evaluate the dysarthric-ness and similarity of synthesized speech. Our subjective evaluation shows that the perceived dysartrhic-ness of synthesized speech is similar to that of true dysarthric speech, especially for higher levels of dysarthria
摘要
�� Dyasarthria 是一种运动性语言障碍,常characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech recognition (ASR) systems can help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech, which is not readily available for dysarthric talkers. This paper presents a new dysarthric speech synthesis method for the purpose of ASR training data augmentation. Differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels are important components for dysarthric speech modeling, synthesis, and augmentation. For dysarthric speech synthesis, a modified neural multi-talker TTS is implemented by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels. To evaluate the effectiveness for synthesis of training data for ASR, dysarthria-specific speech recognition was used. Results show that a DNN-HMM model trained on additional synthetic dysarthric speech achieves WER improvement of 12.2% compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5%, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has significant impact on the dysarthric ASR systems. In addition, we have conducted a subjective evaluation to evaluate the dysarthric-ness and similarity of synthesized speech. Our subjective evaluation shows that the perceived dysartrhic-ness of synthesized speech is similar to that of true dysarthric speech, especially for higher levels of dysarthria.Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need the translation in Traditional Chinese, please let me know.
Eliciting Risk Aversion with Inverse Reinforcement Learning via Interactive Questioning
results: 我们证明,询问代理人的优化政策在不同环境中是一种有效的方法来识别代理人的风险偏好。具体来说,我们证明代理人的风险偏好可以通过问题的数量增长和问题的随机设计来识别出来,并且我们开发了一种算法来设计优化问题。在 simulations 中,我们发现我们的方法可以快速地学习代理人的风险偏好,比Randomly 设计的问题更快。这种方法在 robo-advising 中有重要应用,并提供了一种新的风险偏好识别方法。Abstract
This paper proposes a novel framework for identifying an agent's risk aversion using interactive questioning. Our study is conducted in two scenarios: a one-period case and an infinite horizon case. In the one-period case, we assume that the agent's risk aversion is characterized by a cost function of the state and a distortion risk measure. In the infinite horizon case, we model risk aversion with an additional component, a discount factor. Assuming the access to a finite set of candidates containing the agent's true risk aversion, we show that asking the agent to demonstrate her optimal policies in various environment, which may depend on their previous answers, is an effective means of identifying the agent's risk aversion. Specifically, we prove that the agent's risk aversion can be identified as the number of questions tends to infinity, and the questions are randomly designed. We also develop an algorithm for designing optimal questions and provide empirical evidence that our method learns risk aversion significantly faster than randomly designed questions in simulations. Our framework has important applications in robo-advising and provides a new approach for identifying an agent's risk preferences.
摘要
Translation notes:* "risk aversion" is translated as "风险偏好" (fēngxǐn tiěndòng)* "cost function" is translated as "成本函数" (chéngbèng fāngxìn)* "distortion risk measure" is translated as "偏移风险度量" (diānchōng fēngxǐn duōliàng)* "infinite horizon" is translated as "无限距离" (wúxìn jiǔdì)* "discount factor" is translated as "折扣因子" (diǎnkē yīn zhī)* "finite set" is translated as "有限集" (yǒujiàn jítè)* "optimal policies" is translated as "最佳策略" (zuìjiā cuòlüè)* "randomly designed questions" is translated as "随机设计的问题" (suījī jièdǎo de wèn tí)* "empirical evidence" is translated as "实验证据" (shíyàn zhèngjiā)
Digital twinning of cardiac electrophysiology models from the surface ECG: a geodesic backpropagation approach
results: 在一个人工测试 caso中,Geodesic-BP方法可以很准确地重construct一个模拟的cardiac activation,包括在模型不准确性的情况下。此外,我们还应用了该算法于一个公共可用的兔子模型数据集,得到了非常正面的结果。Abstract
The eikonal equation has become an indispensable tool for modeling cardiac electrical activation accurately and efficiently. In principle, by matching clinically recorded and eikonal-based electrocardiograms (ECGs), it is possible to build patient-specific models of cardiac electrophysiology in a purely non-invasive manner. Nonetheless, the fitting procedure remains a challenging task. The present study introduces a novel method, Geodesic-BP, to solve the inverse eikonal problem. Geodesic-BP is well-suited for GPU-accelerated machine learning frameworks, allowing us to optimize the parameters of the eikonal equation to reproduce a given ECG. We show that Geodesic-BP can reconstruct a simulated cardiac activation with high accuracy in a synthetic test case, even in the presence of modeling inaccuracies. Furthermore, we apply our algorithm to a publicly available dataset of a rabbit model, with very positive results. Given the future shift towards personalized medicine, Geodesic-BP has the potential to help in future functionalizations of cardiac models meeting clinical time constraints while maintaining the physiological accuracy of state-of-the-art cardiac models.
摘要
《椭圆方程》已成为心脏电动力学模型的不可或缺工具。在原理上,通过对临床记录和椭圆方程基于的电cardiogram(ECG)进行匹配,可以建立个性化的心脏电physiology模型,无需侵入性的干预。然而,匹配过程仍然是一项复杂的任务。本研究提出了一种新方法,即Geodesic-BP,以解决反椭圆问题。Geodesic-BP适用于加速机器学习框架的GPU,可以优化椭圆方程的参数,以复制给定的ECG。我们在synthetic测试案例中显示,Geodesic-BP可以高精度地重建模拟的cardiac activation,即使在模型不准确的情况下。此外,我们对公共可用的兔子模型数据集进行了应用,得到了非常正面的结果。随着未来的个性化医疗转型,Geodesic-BP具有帮助将cardiac模型功能化,遵循临床时间约束,保持现有的心脏模型physiological accuracy的潜在力量。
Explainable AI for clinical risk prediction: a survey of concepts, methods, and modalities
results: 这篇论文的结果显示,这些解释模型可以在临床预测中提高解释性和信任度,并且可以实现在多种模式下的透明度和公平性。Abstract
Recent advancements in AI applications to healthcare have shown incredible promise in surpassing human performance in diagnosis and disease prognosis. With the increasing complexity of AI models, however, concerns regarding their opacity, potential biases, and the need for interpretability. To ensure trust and reliability in AI systems, especially in clinical risk prediction models, explainability becomes crucial. Explainability is usually referred to as an AI system's ability to provide a robust interpretation of its decision-making logic or the decisions themselves to human stakeholders. In clinical risk prediction, other aspects of explainability like fairness, bias, trust, and transparency also represent important concepts beyond just interpretability. In this review, we address the relationship between these concepts as they are often used together or interchangeably. This review also discusses recent progress in developing explainable models for clinical risk prediction, highlighting the importance of quantitative and clinical evaluation and validation across multiple common modalities in clinical practice. It emphasizes the need for external validation and the combination of diverse interpretability methods to enhance trust and fairness. Adopting rigorous testing, such as using synthetic datasets with known generative factors, can further improve the reliability of explainability methods. Open access and code-sharing resources are essential for transparency and reproducibility, enabling the growth and trustworthiness of explainable research. While challenges exist, an end-to-end approach to explainability in clinical risk prediction, incorporating stakeholders from clinicians to developers, is essential for success.
摘要
近期人工智能在医疗领域的应用显示了很大的承诺,可以超越人类的诊断和疾病预测。然而,随着人工智能模型的复杂度的增加,关于它们的不透明度、潜在偏见和解释性的问题也在关注。为确保人工智能系统的可靠性和可信worthiness,特别是在临床风险预测模型中,解释性变得非常重要。解释性通常指人工智能系统能够提供人类权利者可靠的决策逻辑或决策结果的强有力的解释。在临床风险预测中,其他方面的解释性,如公平、偏见、信任和透明度,也是重要的概念,不仅是解释性。本文评论了这些概念之间的关系,并讨论了最新的开发可解释模型的进展,强调临床实践中多种普遍的Modalities的数据量的评估和验证。它强调需要外部验证和多种解释性方法的组合,以增强可靠性和公平。采用严格的测试,如使用已知生成因素的 sintetic 数据集,可以进一步提高解释性方法的可靠性。开放访问和代码分享资源是必要的,以便透明度和可重现性。虽然存在挑战,但是综合approach,从临床医生到开发者,是必要的 для成功。
Content-based Recommendation Engine for Video Streaming Platform
results: 提出一种基于内容的推荐引擎,可以为用户提供适合其兴趣和选择的视频建议。测试结果显示,提出的引擎性能比较好,精度、回归率和F1核心均达到了预期水平。Abstract
Recommendation engine suggest content, product or services to the user by using machine learning algorithm. This paper proposed a content-based recommendation engine for providing video suggestion to the user based on their previous interests and choices. We will use TF-IDF text vectorization method to determine the relevance of words in a document. Then we will find out the similarity between each content by calculating cosine similarity between them. Finally, engine will recommend videos to the users based on the obtained similarity score value. In addition, we will measure the engine's performance by computing precision, recall, and F1 core of the proposed system.
摘要
<> translate into Simplified Chinese推荐引擎使用机器学习算法提供内容、产品或服务给用户。这篇论文提出了基于用户之前的兴趣和选择的视频推荐引擎。我们使用TF-IDF文本向量化方法确定文档中的相关性。然后我们计算每个内容之间的cosine相似性,并根据得到的相似性分值来推荐视频给用户。此外,我们还会测算推荐引擎的性能,包括精度、准确率和F1分值。Note: "TF-IDF" stands for "Term Frequency-Inverse Document Frequency", which is a text vectorization method used to determine the importance of words in a document.
Fast Uncertainty Quantification of Spent Nuclear Fuel with Neural Networks
paper_authors: Arnau Albà, Andreas Adelmann, Lucas Münster, Dimitri Rochman, Romana Boiger
for: 这 paper 是为了快速评估核电燃料(SNF)的特性而写的。
methods: 这 paper 使用神经网络(NN)来模拟 SNF 的特性,减少计算成本。
results: NN 可以准确地预测 decay heat 和 nuclide 浓度,响应关键输入参数的变化。模型被验证了,并且可以减少计算成本。Abstract
The accurate calculation and uncertainty quantification of the characteristics of spent nuclear fuel (SNF) play a crucial role in ensuring the safety, efficiency, and sustainability of nuclear energy production, waste management, and nuclear safeguards. State of the art physics-based models, while reliable, are computationally intensive and time-consuming. This paper presents a surrogate modeling approach using neural networks (NN) to predict a number of SNF characteristics with reduced computational costs compared to physics-based models. An NN is trained using data generated from CASMO5 lattice calculations. The trained NN accurately predicts decay heat and nuclide concentrations of SNF, as a function of key input parameters, such as enrichment, burnup, cooling time between cycles, mean boron concentration and fuel temperature. The model is validated against physics-based decay heat simulations and measurements of different uranium oxide fuel assemblies from two different pressurized water reactors. In addition, the NN is used to perform sensitivity analysis and uncertainty quantification. The results are in very good alignment to CASMO5, while the computational costs (taking into account the costs of generating training samples) are reduced by a factor of 10 or more. Our findings demonstrate the feasibility of using NNs as surrogate models for fast characterization of SNF, providing a promising avenue for improving computational efficiency in assessing nuclear fuel behavior and associated risks.
摘要
现代物理模型可靠但计算成本高。这篇论文提出了使用神经网络(NN)模型来快速预测核电燃料(SNF)特性。通过训练NN模型使用CASMO5网格计算数据,可以准确预测燃料衰变热和核lide浓度,具体取决于一些关键输入参数,如浓缩度、燃烧时间、循环冷却时间、平均氧化物含量和燃料温度。模型被验证了基于物理模型的衰变热计算和不同氧化物燃料聚集体的测量数据。此外,NN模型还可以进行敏感分析和不确定性评估。结果与CASMO5很相似,计算成本(包括生成训练样本的成本)被降低了一倍或更多。这些发现表明使用NN模型可以快速预测SNF特性,提供了改善核电燃料行为和相关风险评估计算效率的可能性。
results: 在各种情况下,Continuous Sweep比Median Sweep表现更好,并且可以通过分析表达来找到最佳决策边界。Abstract
Quantification is a supervised machine learning task, focused on estimating the class prevalence of a dataset rather than labeling its individual observations. We introduce Continuous Sweep, a new parametric binary quantifier inspired by the well-performing Median Sweep. Median Sweep is currently one of the best binary quantifiers, but we have changed this quantifier on three points, namely 1) using parametric class distributions instead of empirical distributions, 2) optimizing decision boundaries instead of applying discrete decision rules, and 3) calculating the mean instead of the median. We derive analytic expressions for the bias and variance of Continuous Sweep under general model assumptions. This is one of the first theoretical contributions in the field of quantification learning. Moreover, these derivations enable us to find the optimal decision boundaries. Finally, our simulation study shows that Continuous Sweep outperforms Median Sweep in a wide range of situations.
摘要
Precision and Recall Reject Curves for Classification
methods: 使用 prototype-based classifiers from learning vector quantization
results: 提供了一种新的评估方法,可以更 accurately 评估模型的性能, especialy for 数据异常分布的场景。Abstract
For some classification scenarios, it is desirable to use only those classification instances that a trained model associates with a high certainty. To obtain such high-certainty instances, previous work has proposed accuracy-reject curves. Reject curves allow to evaluate and compare the performance of different certainty measures over a range of thresholds for accepting or rejecting classifications. However, the accuracy may not be the most suited evaluation metric for all applications, and instead precision or recall may be preferable. This is the case, for example, for data with imbalanced class distributions. We therefore propose reject curves that evaluate precision and recall, the recall-reject curve and the precision-reject curve. Using prototype-based classifiers from learning vector quantization, we first validate the proposed curves on artificial benchmark data against the accuracy reject curve as a baseline. We then show on imbalanced benchmarks and medical, real-world data that for these scenarios, the proposed precision- and recall-curves yield more accurate insights into classifier performance than accuracy reject curves.
摘要
有些分类场景中,您可能想使用已经训练好的模型对分类结果进行高度确定性的评估。为了获得这些高度确定性的分类实例,前一些工作提出了准确率拒绝曲线。拒绝曲线可以评估和比较不同确定度度量在不同的阈值上Accept或拒绝分类的性能。但是,准确率可能不是所有应用场景中最适合的评估度量,特别是数据具有不均匀的类别分布。我们因此提议使用精度和准确率的拒绝曲线。使用学习 вектор量化的原型基 classifier,我们首先验证提议的曲线在人工 benchmark 数据上对准确率 reject 曲线作为基准进行验证。然后,我们在具有不均匀分布的 benchmark 和医学实际数据上示出,在这些场景中,提议的精度和准确率曲线可以更准确地评估分类器的性能,而不是准确率 reject 曲线。
A distributed neural network architecture for dynamic sensor selection with application to bandwidth-constrained body-sensor networks
results: 这个方法可以将选择最佳通道分布到不同的节点上,并且可以对实际的体内侦测网络(WSN)进行验证,并分析传输负载和任务准确性之间的交易。Abstract
We propose a dynamic sensor selection approach for deep neural networks (DNNs), which is able to derive an optimal sensor subset selection for each specific input sample instead of a fixed selection for the entire dataset. This dynamic selection is jointly learned with the task model in an end-to-end way, using the Gumbel-Softmax trick to allow the discrete decisions to be learned through standard backpropagation. We then show how we can use this dynamic selection to increase the lifetime of a wireless sensor network (WSN) by imposing constraints on how often each node is allowed to transmit. We further improve performance by including a dynamic spatial filter that makes the task-DNN more robust against the fact that it now needs to be able to handle a multitude of possible node subsets. Finally, we explain how the selection of the optimal channels can be distributed across the different nodes in a WSN. We validate this method on a use case in the context of body-sensor networks, where we use real electroencephalography (EEG) sensor data to emulate an EEG sensor network. We analyze the resulting trade-offs between transmission load and task accuracy.
摘要
我们提出了一种动态感知选择方法,用于深度神经网络(DNN),可以在每个特定输入样本上选择最佳感知subset,而不是整个数据集中固定的选择。这种动态选择与任务模型在综合的方式中同时学习,使用Gumbel-Softmax技巧,以使得柔性决策可以通过标准反馈来学习。然后,我们介绍了如何使用这种动态选择来增加无线传感器网络(WSN)的寿命,通过限制每个节点发送的次数。此外,我们还提高了性能,通过包括动态空间筛选器,使任务-DNN更加抗性能于面临多个可能的节点subset。最后,我们解释了如何选择优化的通道。我们验证了这种方法,使用了真实的电enzephalography(EEG)感知数据,模拟了EEG感知网络。我们分析了结果的平衡 Trade-offs between transmission load和任务准确率。
PDPK: A Framework to Synthesise Process Data and Corresponding Procedural Knowledge for Manufacturing
paper_authors: Richard Nordsieck, André Schweizer, Michael Heider, Jörg Hähner for: 本研究的目的是提供一个框架,可以生成基于不同领域的 sintetic 数据集,以便模拟实际中的程序�beroject knowledge。methods: 本研究使用的方法包括:(1) 基于 Resource Description Framework (RDF) 的知识 graphs 的设计,(2) 模拟 parametrisation проце�cess,(3) 使用现有的嵌入方法来表示程序�beroject knowledge。results: 本研究的结果包括:(1) 一个可以适应不同领域的 sintetic 数据集,(2) 一个基于 RDF 的知识 graphs 的表示方法,(3) 一个可以评估嵌入方法的基本比较结果。Abstract
Procedural knowledge describes how to accomplish tasks and mitigate problems. Such knowledge is commonly held by domain experts, e.g. operators in manufacturing who adjust parameters to achieve quality targets. To the best of our knowledge, no real-world datasets containing process data and corresponding procedural knowledge are publicly available, possibly due to corporate apprehensions regarding the loss of knowledge advances. Therefore, we provide a framework to generate synthetic datasets that can be adapted to different domains. The design choices are inspired by two real-world datasets of procedural knowledge we have access to. Apart from containing representations of procedural knowledge in Resource Description Framework (RDF)-compliant knowledge graphs, the framework simulates parametrisation processes and provides consistent process data. We compare established embedding methods on the resulting knowledge graphs, detailing which out-of-the-box methods have the potential to represent procedural knowledge. This provides a baseline which can be used to increase the comparability of future work. Furthermore, we validate the overall characteristics of a synthesised dataset by comparing the results to those achievable on a real-world dataset. The framework and evaluation code, as well as the dataset used in the evaluation, are available open source.
摘要
“程序性知识”描述了如何完成任务和解决问题。这种知识通常由领域专家所拥有,例如制造业中的操作员,他们会调整参数以达到质量目标。据我们所知,没有公开可用的实际世界数据集,可能因为企业对知识前进的担忧。因此,我们提供了一个框架,可以生成可靠的Synthetic数据集,可以适应不同领域。这个框架基于我们可以获得的两个实际世界数据集的知识,并且模拟参数化过程,提供一致的处理数据。我们使用已有的嵌入方法对知识图进行评估,并详细介绍这些方法的潜在表现。此外,我们还验证了生成的数据集的总性特征,并与实际世界数据集进行比较。框架和评估代码以及使用于评估的数据集都可以公开获取。
Dual-Branch Temperature Scaling Calibration for Long-Tailed Recognition
methods: 本研究使用温度扩展(TS)方法,设计了多支分支温度扩展模型(Dual-TS), simultaneously considering the diversity of temperature parameters of different categories and the non-generalizability of temperature parameters for rare samples in minority classes.
results: 通过实验,我们示出了我们的模型在传统ECE和Esbin-ECE评价指标上均达到了顶尖性。Abstract
The calibration for deep neural networks is currently receiving widespread attention and research. Miscalibration usually leads to overconfidence of the model. While, under the condition of long-tailed distribution of data, the problem of miscalibration is more prominent due to the different confidence levels of samples in minority and majority categories, and it will result in more serious overconfidence. To address this problem, some current research have designed diverse temperature coefficients for different categories based on temperature scaling (TS) method. However, in the case of rare samples in minority classes, the temperature coefficient is not generalizable, and there is a large difference between the temperature coefficients of the training set and the validation set. To solve this challenge, this paper proposes a dual-branch temperature scaling calibration model (Dual-TS), which considers the diversities in temperature parameters of different categories and the non-generalizability of temperature parameters for rare samples in minority classes simultaneously. Moreover, we noticed that the traditional calibration evaluation metric, Excepted Calibration Error (ECE), gives a higher weight to low-confidence samples in the minority classes, which leads to inaccurate evaluation of model calibration. Therefore, we also propose Equal Sample Bin Excepted Calibration Error (Esbin-ECE) as a new calibration evaluation metric. Through experiments, we demonstrate that our model yields state-of-the-art in both traditional ECE and Esbin-ECE metrics.
摘要
Currently, 深度神经网络的准确性调整 receiving extensive attention and research. 不准确的情况可能导致模型过于自信。而在长条形分布的数据下,偏好类和少数类之间的样本准确性差异更为明显,这会导致更严重的过自信。为解决这个问题,一些当前的研究已经设计了不同类别的温度系数。然而,在罕见的少数类中的样本中,温度系数不一致,Validation集和训练集的温度系数存在大的差异。为解决这个挑战,本文提出了双支temperature scaling calibration模型(Dual-TS),该模型考虑了不同类别的温度参数的多样性以及少数类中样本的非一致性。此外,我们发现传统的准确性评价指标excepted Calibration Error(ECE)会将低信度的少数类样本更加重视,这会导致模型准确性的误估。因此,我们也提出了Equal Sample Bin Excepted Calibration Error(Esbin-ECE)作为一个新的准确性评价指标。经过实验,我们展示了我们的模型在传统ECE和Esbin-ECE指标下达到了国际一流的性能。
KernelWarehouse: Towards Parameter-Efficient Dynamic Convolution
results: 研究人员通过对 ImageNet 和 MS-COCO 数据集使用不同的 ConvNet 架构,并使用 KernelWarehouse 模型,达到了当前最佳的结果。例如,使用 KernelWarehouse 模型在 ImageNet 上训练 ResNet18|ResNet50|MobileNetV2|ConvNeXt-Tiny 模型,可以达到 76.05%|81.05%|75.52%|82.51% 的 top-1 准确率。此外,由于 KernelWarehouse 的灵活设计,可以将 ConvNet 模型的大小减小,同时提高准确率,例如,我们的 ResNet18 模型在 36.45%|65.10% 参数减少后,对比基eline 模型,可以提高 2.89%|2.29% 的绝对准确率。Abstract
Dynamic convolution learns a linear mixture of $n$ static kernels weighted with their sample-dependent attentions, demonstrating superior performance compared to normal convolution. However, existing designs are parameter-inefficient: they increase the number of convolutional parameters by $n$ times. This and the optimization difficulty lead to no research progress in dynamic convolution that can allow us to use a significant large value of $n$ (e.g., $n>100$ instead of typical setting $n<10$) to push forward the performance boundary. In this paper, we propose $KernelWarehouse$, a more general form of dynamic convolution, which can strike a favorable trade-off between parameter efficiency and representation power. Its key idea is to redefine the basic concepts of "$kernels$" and "$assembling$ $kernels$" in dynamic convolution from the perspective of reducing kernel dimension and increasing kernel number significantly. In principle, KernelWarehouse enhances convolutional parameter dependencies within the same layer and across successive layers via tactful kernel partition and warehouse sharing, yielding a high degree of freedom to fit a desired parameter budget. We validate our method on ImageNet and MS-COCO datasets with different ConvNet architectures, and show that it attains state-of-the-art results. For instance, the ResNet18|ResNet50|MobileNetV2|ConvNeXt-Tiny model trained with KernelWarehouse on ImageNet reaches 76.05%|81.05%|75.52%|82.51% top-1 accuracy. Thanks to its flexible design, KernelWarehouse can even reduce the model size of a ConvNet while improving the accuracy, e.g., our ResNet18 model with 36.45%|65.10% parameter reduction to the baseline shows 2.89%|2.29% absolute improvement to top-1 accuracy.
摘要
“动态核函数学习一种线性混合的 $n$ 个静止核函数,显示出比普通核函数更高的性能。然而,现有设计是参数不充分利用的:它们将核函数参数数量增加 $n$ 倍。这和优化困难导致无法进行大量 $n$ 的研究进步,而 $n$ 通常设置在 10 左右。在这篇论文中,我们提出了 $KernelWarehouse$,一种更通用的动态核函数设计,可以在参数效率和表示力之间做出一个平衡。它的关键思想是通过重新定义动态核函数中 "$核函数" 和 "$核函数组合" 的概念,从减少核函数维度和增加核函数数量的角度来提高卷积参数的相互依赖性和层之间的参数共享,从而获得一定的参数预算的自由度。我们在 ImageNet 和 MS-COCO 数据集上验证了我们的方法,并显示其可以达到领先的Result。例如,我们在 ImageNet 上使用 KernelWarehouse 训练 ResNet18|ResNet50|MobileNetV2|ConvNeXt-Tiny 模型,达到 76.05%|81.05%|75.52%|82.51% 的 top-1 准确率。另外,由于 KernelWarehouse 的灵活设计,它可以减少 ConvNet 模型的大小,同时提高准确率,例如,我们的 ResNet18 模型,在参数减少 36.45%|65.10% 后,对比基eline的准确率提高 2.89%|2.29%。”
Independent Distribution Regularization for Private Graph Embedding
results: 实验结果表明,PVGAE在三个实际数据集上的性能和隐私保护性都比基eline方法更高。Abstract
Learning graph embeddings is a crucial task in graph mining tasks. An effective graph embedding model can learn low-dimensional representations from graph-structured data for data publishing benefiting various downstream applications such as node classification, link prediction, etc. However, recent studies have revealed that graph embeddings are susceptible to attribute inference attacks, which allow attackers to infer private node attributes from the learned graph embeddings. To address these concerns, privacy-preserving graph embedding methods have emerged, aiming to simultaneously consider primary learning and privacy protection through adversarial learning. However, most existing methods assume that representation models have access to all sensitive attributes in advance during the training stage, which is not always the case due to diverse privacy preferences. Furthermore, the commonly used adversarial learning technique in privacy-preserving representation learning suffers from unstable training issues. In this paper, we propose a novel approach called Private Variational Graph AutoEncoders (PVGAE) with the aid of independent distribution penalty as a regularization term. Specifically, we split the original variational graph autoencoder (VGAE) to learn sensitive and non-sensitive latent representations using two sets of encoders. Additionally, we introduce a novel regularization to enforce the independence of the encoders. We prove the theoretical effectiveness of regularization from the perspective of mutual information. Experimental results on three real-world datasets demonstrate that PVGAE outperforms other baselines in private embedding learning regarding utility performance and privacy protection.
摘要
学习图像抽象是图像挖掘任务中关键的任务。一个有效的图像抽象模型可以从图像结构数据中学习低维度表示,为数据发布带来多个下游应用程序的 beneficial effects,如节点分类、链接预测等。然而, latest studies have shown that graph embeddings are vulnerable to attribute inference attacks, which allow attackers to infer private node attributes from the learned graph embeddings. To address these concerns, privacy-preserving graph embedding methods have emerged, aiming to simultaneously consider primary learning and privacy protection through adversarial learning. However, most existing methods assume that representation models have access to all sensitive attributes in advance during the training stage, which is not always the case due to diverse privacy preferences. Furthermore, the commonly used adversarial learning technique in privacy-preserving representation learning suffers from unstable training issues. In this paper, we propose a novel approach called Private Variational Graph AutoEncoders (PVGAE) with the aid of independent distribution penalty as a regularization term. Specifically, we split the original variational graph autoencoder (VGAE) to learn sensitive and non-sensitive latent representations using two sets of encoders. Additionally, we introduce a novel regularization to enforce the independence of the encoders. We prove the theoretical effectiveness of regularization from the perspective of mutual information. Experimental results on three real-world datasets demonstrate that PVGAE outperforms other baselines in private embedding learning regarding utility performance and privacy protection.
Convergence of Two-Layer Regression with Nonlinear Units
results: 作者计算了一个准确的表示形式 для损失函数的资深特征,并证明了这个表示形式下的梯度是 lipschitz 连续和归一化的。Abstract
Large language models (LLMs), such as ChatGPT and GPT4, have shown outstanding performance in many human life task. Attention computation plays an important role in training LLMs. Softmax unit and ReLU unit are the key structure in attention computation. Inspired by them, we put forward a softmax ReLU regression problem. Generally speaking, our goal is to find an optimal solution to the regression problem involving the ReLU unit. In this work, we calculate a close form representation for the Hessian of the loss function. Under certain assumptions, we prove the Lipschitz continuous and the PSDness of the Hessian. Then, we introduce an greedy algorithm based on approximate Newton method, which converges in the sense of the distance to optimal solution. Last, We relax the Lipschitz condition and prove the convergence in the sense of loss value.
摘要
大型语言模型(LLM),如ChatGPT和GPT4,在许多人类生活任务中表现出色。注意计算在训练 LLM 中扮演重要角色。Softmax单元和ReLU单元是关键结构在注意计算中。受到他们的灵感,我们提出了Softmax ReLU regression问题。通过该问题,我们的目标是找到一个最佳解决方案。在这个工作中,我们计算了一个圆滑函数表示的条件下的条件下的条件下的Hessian数值。在某些假设下,我们证明了Hessian 的 lipschitz 连续和对偶性。接着,我们引入了一个类似于扩展新тон方法的探索算法,它在几乎所有情况下将 convergence 到最佳解决方案。最后,我们宽松了 lipschitz 条件,并证明了在损失值方面的对应。
Is Meta-Learning the Right Approach for the Cold-Start Problem in Recommender Systems?
results: 研究表明,当正确地调整深度学习模型时,它们可以与更新的 meta-learning 模型相比,在常用的benchmark上达到同等或更高的性能。此外,研究还表明了一种非常简单的模块化方法,可以在实际应用中更易地实现。Abstract
Recommender systems have become fundamental building blocks of modern online products and services, and have a substantial impact on user experience. In the past few years, deep learning methods have attracted a lot of research, and are now heavily used in modern real-world recommender systems. Nevertheless, dealing with recommendations in the cold-start setting, e.g., when a user has done limited interactions in the system, is a problem that remains far from solved. Meta-learning techniques, and in particular optimization-based meta-learning, have recently become the most popular approaches in the academic research literature for tackling the cold-start problem in deep learning models for recommender systems. However, current meta-learning approaches are not practical for real-world recommender systems, which have billions of users and items, and strict latency requirements. In this paper we show that it is possible to obtaining similar, or higher, performance on commonly used benchmarks for the cold-start problem without using meta-learning techniques. In more detail, we show that, when tuned correctly, standard and widely adopted deep learning models perform just as well as newer meta-learning models. We further show that an extremely simple modular approach using common representation learning techniques, can perform comparably to meta-learning techniques specifically designed for the cold-start setting while being much more easily deployable in real-world applications.
摘要
Graph Out-of-Distribution Generalization with Controllable Data Augmentation
paper_authors: Bin Lu, Xiaoying Gan, Ze Zhao, Shiyu Liang, Luoyi Fu, Xinbing Wang, Chenghu Zhou for: This paper aims to address the issue of distribution deviation in graph neural network (GNN) training, specifically the problem of hybrid structure distribution shift, which can lead to spurious correlations and degrade the performance of GNN methods.methods: The proposed method, called \texttt{OOD-GMixup}, uses controllable data augmentation in the metric space to manipulate the training distribution and alleviate the distribution deviation problem. Specifically, the method involves extracting graph rationales to eliminate spurious correlations, generating virtual samples with perturbation on the graph rationale representation domain, and using OOD calibration to measure the distribution deviation of virtual samples.results: The proposed method outperforms state-of-the-art baselines on several real-world datasets for graph classification tasks, demonstrating its effectiveness in addressing the distribution deviation problem in GNN training.Abstract
Graph Neural Network (GNN) has demonstrated extraordinary performance in classifying graph properties. However, due to the selection bias of training and testing data (e.g., training on small graphs and testing on large graphs, or training on dense graphs and testing on sparse graphs), distribution deviation is widespread. More importantly, we often observe \emph{hybrid structure distribution shift} of both scale and density, despite of one-sided biased data partition. The spurious correlations over hybrid distribution deviation degrade the performance of previous GNN methods and show large instability among different datasets. To alleviate this problem, we propose \texttt{OOD-GMixup} to jointly manipulate the training distribution with \emph{controllable data augmentation} in metric space. Specifically, we first extract the graph rationales to eliminate the spurious correlations due to irrelevant information. Secondly, we generate virtual samples with perturbation on graph rationale representation domain to obtain potential OOD training samples. Finally, we propose OOD calibration to measure the distribution deviation of virtual samples by leveraging Extreme Value Theory, and further actively control the training distribution by emphasizing the impact of virtual OOD samples. Extensive studies on several real-world datasets on graph classification demonstrate the superiority of our proposed method over state-of-the-art baselines.
摘要
图 нейрон网络(GNN)在分类图属性方面表现出色,但由于训练和测试数据的选择偏见(例如训练小图测试大图,或训练密集图测试稀疏图),流行范围偏差。更重要的是,我们经常观察到 hybrid 结构分布偏移,即一种同时具有规模和密度偏移的分布偏移,尽管数据分区是一种一侧偏见。这种偏移导致过去的 GNN 方法表现不稳定,并且在不同的数据集上显示出大的变化。为解决这个问题,我们提出了 \texttt{OOD-GMixup},它通过在度量空间进行可控的数据增强来同时 manipulate 训练分布。具体来说,我们首先提取图理由来消除由无关信息引起的假 positives。然后,我们通过对图理由表示域进行扰动来生成可能的 OOD 训练样本。最后,我们提出了 OOD 校准,通过沿用极值理论来测量虚拟 OOD 样本的分布偏移,并且通过强调虚拟 OOD 样本的影响来活动控制训练分布。我们在多个实际 dataset 上进行了广泛的研究,并证明了我们的提议方法的超越性。
Learning Logic Programs by Discovering Higher-Order Abstractions
results: 我们在多个领域,包括程序生成和视觉理解,进行了实验。结果显示,相比无 refactoring,STEVIE可以提高预测精度27%,降低学习时间47%。此外,STEVIE还可以找到可以在不同领域中传递的抽象。Abstract
Discovering novel abstractions is important for human-level AI. We introduce an approach to discover higher-order abstractions, such as map, filter, and fold. We focus on inductive logic programming, which induces logic programs from examples and background knowledge. We introduce the higher-order refactoring problem, where the goal is to compress a logic program by introducing higher-order abstractions. We implement our approach in STEVIE, which formulates the higher-order refactoring problem as a constraint optimisation problem. Our experimental results on multiple domains, including program synthesis and visual reasoning, show that, compared to no refactoring, STEVIE can improve predictive accuracies by 27% and reduce learning times by 47%. We also show that STEVIE can discover abstractions that transfer to different domains
摘要
发现新的抽象是人类水平AI的关键。我们介绍了一种方法,用于发现更高级别的抽象,如地图、筛选和折叠。我们关注了逻辑编程,它从示例和背景知识中生成逻辑程序。我们定义了更高级别的重构问题,其目标是通过引入更高级别的抽象来压缩逻辑程序。我们在STEVIE中实现了我们的方法,它将重构问题转化为约束优化问题。我们的实验结果在多个领域,包括程序生成和视觉理解,表明,相比无 refactoring,STEVIE可以提高预测精度 by 27%,并将学习时间减少 by 47%。我们还表明STEVIE可以找到可以在不同领域传递的抽象。
Warped geometric information on the optimisation of Euclidean functions
results: 使用第三阶 Taylor级别的地odesic曲线approximation来提高优化效率,并且比标准欧几何Gradient-basedcounterparts更快 converging。Abstract
We consider the fundamental task of optimizing a real-valued function defined in a potentially high-dimensional Euclidean space, such as the loss function in many machine-learning tasks or the logarithm of the probability distribution in statistical inference. We use the warped Riemannian geometry notions to redefine the optimisation problem of a function on Euclidean space to a Riemannian manifold with a warped metric, and then find the function's optimum along this manifold. The warped metric chosen for the search domain induces a computational friendly metric-tensor for which optimal search directions associate with geodesic curves on the manifold becomes easier to compute. Performing optimization along geodesics is known to be generally infeasible, yet we show that in this specific manifold we can analytically derive Taylor approximations up to third-order. In general these approximations to the geodesic curve will not lie on the manifold, however we construct suitable retraction maps to pull them back onto the manifold. Therefore, we can efficiently optimize along the approximate geodesic curves. We cover the related theory, describe a practical optimization algorithm and empirically evaluate it on a collection of challenging optimisation benchmarks. Our proposed algorithm, using third-order approximation of geodesics, outperforms standard Euclidean gradient-based counterparts in term of number of iterations until convergence and an alternative method for Hessian-based optimisation routines.
摘要
我们考虑了在高维欧几学空间中优化实数函数的基本任务,例如机器学习中的损函数或统计推断中的Logarithm的概率分布。我们使用扭曲的里曼尼geometry来重新定义在欧几学空间上的函数优化问题,然后在这个拓扑上寻找函数的最优点。选择的扭曲度metric在搜索区域上引入了一个计算友好的度量牛顿,使得优化问题的解决变得更加容易。尽管在欧几学空间上进行优化是不可能的,但我们可以 analytically derived Taylor approximation up to third-order。这些approximation不会在拓扑上,但我们可以构建适当的Retraction map pull them back onto the manifold。因此,我们可以高效地优化third-order approximation of geodesics。我们涵盖相关理论,描述了一种实用的优化算法,并对一组具有挑战性的优化benchmark进行了实验性评估。我们的提议的算法,使用第三阶 Taylor approximation of geodesics,在迭代次数 Until convergence和一种基于Hessian的优化方法之间占据了优势。
results: 本研究显示,RoBOS 算法可以在不同的学习问题中具有抑 Linear 的宽松偏误,并且可以独立于分布变化的量来控制偏误。Abstract
Distributional shifts pose a significant challenge to achieving robustness in contemporary machine learning. To overcome this challenge, robust satisficing (RS) seeks a robust solution to an unspecified distributional shift while achieving a utility above a desired threshold. This paper focuses on the problem of RS in contextual Bayesian optimization when there is a discrepancy between the true and reference distributions of the context. We propose a novel robust Bayesian satisficing algorithm called RoBOS for noisy black-box optimization. Our algorithm guarantees sublinear lenient regret under certain assumptions on the amount of distribution shift. In addition, we define a weaker notion of regret called robust satisficing regret, in which our algorithm achieves a sublinear upper bound independent of the amount of distribution shift. To demonstrate the effectiveness of our method, we apply it to various learning problems and compare it to other approaches, such as distributionally robust optimization.
摘要 translate("Distributional shifts pose a significant challenge to achieving robustness in contemporary machine learning. To overcome this challenge, robust satisficing (RS) seeks a robust solution to an unspecified distributional shift while achieving a utility above a desired threshold. This paper focuses on the problem of RS in contextual Bayesian optimization when there is a discrepancy between the true and reference distributions of the context. We propose a novel robust Bayesian satisficing algorithm called RoBOS for noisy black-box optimization. Our algorithm guarantees sublinear lenient regret under certain assumptions on the amount of distribution shift. In addition, we define a weaker notion of regret called robust satisficing regret, in which our algorithm achieves a sublinear upper bound independent of the amount of distribution shift. To demonstrate the effectiveness of our method, we apply it to various learning problems and compare it to other approaches, such as distributionally robust optimization.")Here's the translation in Simplified Chinese:Distributional shifts pose a significant challenge to achieving robustness in contemporary machine learning. To overcome this challenge, robust satisficing (RS) seeks a robust solution to an unspecified distributional shift while achieving a utility above a desired threshold. This paper focuses on the problem of RS in contextual Bayesian optimization when there is a discrepancy between the true and reference distributions of the context. We propose a novel robust Bayesian satisficing algorithm called RoBOS for noisy black-box optimization. Our algorithm guarantees sublinear lenient regret under certain assumptions on the amount of distribution shift. In addition, we define a weaker notion of regret called robust satisficing regret, in which our algorithm achieves a sublinear upper bound independent of the amount of distribution shift. To demonstrate the effectiveness of our method, we apply it to various learning problems and compare it to other approaches, such as distributionally robust optimization.
DFedADMM: Dual Constraints Controlled Model Inconsistency for Decentralized Federated Learning
results: 在非对称 Setting 中,DFedADMM 和 DFedADMM-SAM 算法的渐进速度为 $\mathcal{O}\Big(\frac{1}{\sqrt{KT}+\frac{1}{KT(1-\psi)^2}\Big)$ 和 $\mathcal{O}\Big(\frac{1}{\sqrt{KT}+\frac{1}{KT(1-\psi)^2}+ \frac{1}{T^{3/2}K^{1/2}\Big)$,并且在 MNIST、CIFAR10 和 CIFAR100 数据集上实验证明了我们的算法在一致性和渐进速度方面与现有 SOTA 优化器相比具有显著优势。Abstract
To address the communication burden issues associated with federated learning (FL), decentralized federated learning (DFL) discards the central server and establishes a decentralized communication network, where each client communicates only with neighboring clients. However, existing DFL methods still suffer from two major challenges: local inconsistency and local heterogeneous overfitting, which have not been fundamentally addressed by existing DFL methods. To tackle these issues, we propose novel DFL algorithms, DFedADMM and its enhanced version DFedADMM-SAM, to enhance the performance of DFL. The DFedADMM algorithm employs primal-dual optimization (ADMM) by utilizing dual variables to control the model inconsistency raised from the decentralized heterogeneous data distributions. The DFedADMM-SAM algorithm further improves on DFedADMM by employing a Sharpness-Aware Minimization (SAM) optimizer, which uses gradient perturbations to generate locally flat models and searches for models with uniformly low loss values to mitigate local heterogeneous overfitting. Theoretically, we derive convergence rates of $\small \mathcal{O}\Big(\frac{1}{\sqrt{KT}+\frac{1}{KT(1-\psi)^2}\Big)$ and $\small \mathcal{O}\Big(\frac{1}{\sqrt{KT}+\frac{1}{KT(1-\psi)^2}+ \frac{1}{T^{3/2}K^{1/2}\Big)$ in the non-convex setting for DFedADMM and DFedADMM-SAM, respectively, where $1 - \psi$ represents the spectral gap of the gossip matrix. Empirically, extensive experiments on MNIST, CIFAR10 and CIFAR100 datesets demonstrate that our algorithms exhibit superior performance in terms of both generalization and convergence speed compared to existing state-of-the-art (SOTA) optimizers in DFL.
摘要
To address the communication burden issues associated with federated learning (FL), decentralized federated learning (DFL) discards the central server and establishes a decentralized communication network, where each client communicates only with neighboring clients. However, existing DFL methods still suffer from two major challenges: local inconsistency and local heterogeneous overfitting, which have not been fundamentally addressed by existing DFL methods. To tackle these issues, we propose novel DFL algorithms, DFedADMM and its enhanced version DFedADMM-SAM, to enhance the performance of DFL. The DFedADMM algorithm employs primal-dual optimization (ADMM) by utilizing dual variables to control the model inconsistency raised from the decentralized heterogeneous data distributions. The DFedADMM-SAM algorithm further improves on DFedADMM by employing a Sharpness-Aware Minimization (SAM) optimizer, which uses gradient perturbations to generate locally flat models and searches for models with uniformly low loss values to mitigate local heterogeneous overfitting. Theoretically, we derive convergence rates of $\small \mathcal{O}\Big(\frac{1}{\sqrt{KT}+\frac{1}{KT(1-\psi)^2}\Big)$ and $\small \mathcal{O}\Big(\frac{1}{\sqrt{KT}+\frac{1}{KT(1-\psi)^2}+ \frac{1}{T^{3/2}K^{1/2}\Big)$ in the non-convex setting for DFedADMM and DFedADMM-SAM, respectively, where $1 - \psi$ represents the spectral gap of the gossip matrix. Empirically, extensive experiments on MNIST, CIFAR10 and CIFAR100 datesets demonstrate that our algorithms exhibit superior performance in terms of both generalization and convergence speed compared to existing state-of-the-art (SOTA) optimizers in DFL.
CARE: A Large Scale CT Image Dataset and Clinical Applicable Benchmark Model for Rectal Cancer Segmentation
for: Rectal cancer segmentation of CT images for timely clinical diagnosis, radiotherapy treatment, and follow-up.
methods: A novel large-scale rectal cancer CT image dataset (CARE) with pixel-level annotations, and a novel medical cancer lesion segmentation benchmark model (U-SAM) that incorporates prompt information to tackle the challenges of intricate anatomical structures.
results: The proposed U-SAM outperforms state-of-the-art methods on the CARE dataset and the WORD dataset, demonstrating its effectiveness in rectal cancer segmentation.Abstract
Rectal cancer segmentation of CT image plays a crucial role in timely clinical diagnosis, radiotherapy treatment, and follow-up. Although current segmentation methods have shown promise in delineating cancerous tissues, they still encounter challenges in achieving high segmentation precision. These obstacles arise from the intricate anatomical structures of the rectum and the difficulties in performing differential diagnosis of rectal cancer. Additionally, a major obstacle is the lack of a large-scale, finely annotated CT image dataset for rectal cancer segmentation. To address these issues, this work introduces a novel large scale rectal cancer CT image dataset CARE with pixel-level annotations for both normal and cancerous rectum, which serves as a valuable resource for algorithm research and clinical application development. Moreover, we propose a novel medical cancer lesion segmentation benchmark model named U-SAM. The model is specifically designed to tackle the challenges posed by the intricate anatomical structures of abdominal organs by incorporating prompt information. U-SAM contains three key components: promptable information (e.g., points) to aid in target area localization, a convolution module for capturing low-level lesion details, and skip-connections to preserve and recover spatial information during the encoding-decoding process. To evaluate the effectiveness of U-SAM, we systematically compare its performance with several popular segmentation methods on the CARE dataset. The generalization of the model is further verified on the WORD dataset. Extensive experiments demonstrate that the proposed U-SAM outperforms state-of-the-art methods on these two datasets. These experiments can serve as the baseline for future research and clinical application development.
摘要
CT 影像中的肛门癌分 segmentation 在临床诊断、放疗治疗和跟踪中扮演着关键的角色。 although current segmentation methods have shown promise in delineating cancerous tissues, they still encounter challenges in achieving high segmentation precision. These obstacles arise from the intricate anatomical structures of the rectum and the difficulties in performing differential diagnosis of rectal cancer. Additionally, a major obstacle is the lack of a large-scale, finely annotated CT image dataset for rectal cancer segmentation. To address these issues, this work introduces a novel large scale rectal cancer CT image dataset CARE with pixel-level annotations for both normal and cancerous rectum, which serves as a valuable resource for algorithm research and clinical application development. Moreover, we propose a novel medical cancer lesion segmentation benchmark model named U-SAM. The model is specifically designed to tackle the challenges posed by the intricate anatomical structures of abdominal organs by incorporating prompt information. U-SAM contains three key components: promptable information (e.g., points) to aid in target area localization, a convolution module for capturing low-level lesion details, and skip-connections to preserve and recover spatial information during the encoding-decoding process. To evaluate the effectiveness of U-SAM, we systematically compare its performance with several popular segmentation methods on the CARE dataset. The generalization of the model is further verified on the WORD dataset. Extensive experiments demonstrate that the proposed U-SAM outperforms state-of-the-art methods on these two datasets. These experiments can serve as the baseline for future research and clinical application development.Here's the text with some additional information about the Simplified Chinese translation:The Simplified Chinese translation is written in a formal and objective style, which is appropriate for a scientific or technical document. The translation uses proper nouns and phrases to convey the meaning of the original text accurately. The use of 大量 (dàliàng) to describe the dataset CARE emphasizes the scale of the dataset, while the use of 精心 (jīngxīn) to describe the annotations highlights the precision of the annotations. The translation also uses technical terms such as 肛门 (fèngmén) for "rectum" and 癌 (gān) for "cancer" to ensure accuracy in the medical context.Please note that the translation is provided as a courtesy and may not be perfect or entirely accurate. If you have any further questions or need more information, please feel free to ask!
It Ain’t That Bad: Understanding the Mysterious Performance Drop in OOD Generalization for Generative Transformer Models
results: 研究发现,当训练n进位操作(如加法)时,模型在未seen n进位输入上能够成功泛化(ID泛化),但在长度更长的未seen输入上表现很差(OOD泛化)。研究人员尝试使用工具such as修改位嵌入、细化和priming来bridging这个差距,但这些解决方案并没有解决核心机制,因此无法保证其稳定性。Abstract
Generative Transformer-based models have achieved remarkable proficiency on solving diverse problems. However, their generalization ability is not fully understood and not always satisfying. Researchers take basic mathematical tasks like n-digit addition or multiplication as important perspectives for investigating their generalization behaviors. Curiously, it is observed that when training on n-digit operations (e.g., additions) in which both input operands are n-digit in length, models generalize successfully on unseen n-digit inputs (in-distribution (ID) generalization), but fail miserably and mysteriously on longer, unseen cases (out-of-distribution (OOD) generalization). Studies try to bridge this gap with workarounds such as modifying position embedding, fine-tuning, and priming with more extensive or instructive data. However, without addressing the essential mechanism, there is hardly any guarantee regarding the robustness of these solutions. We bring this unexplained performance drop into attention and ask whether it is purely from random errors. Here we turn to the mechanistic line of research which has notable successes in model interpretability. We discover that the strong ID generalization stems from structured representations, while behind the unsatisfying OOD performance, the models still exhibit clear learned algebraic structures. Specifically, these models map unseen OOD inputs to outputs with equivalence relations in the ID domain. These highlight the potential of the models to carry useful information for improved generalization.
摘要
transformer-based models have achieved remarkable proficiency in solving diverse problems, but their generalization ability is not fully understood and is not always satisfying. researchers take basic mathematical tasks like n-digit addition or multiplication as important perspectives for investigating their generalization behaviors. curiously, it is observed that when training on n-digit operations (e.g., additions) in which both input operands are n-digit in length, models generalize successfully on unseen n-digit inputs (in-distribution (ID) generalization), but fail miserably and mysteriously on longer, unseen cases (out-of-distribution (OOD) generalization). studies try to bridge this gap with workarounds such as modifying position embedding, fine-tuning, and priming with more extensive or instructive data. however, without addressing the essential mechanism, there is hardly any guarantee regarding the robustness of these solutions. we bring this unexplained performance drop into attention and ask whether it is purely from random errors. here we turn to the mechanistic line of research which has notable successes in model interpretability. we discover that the strong ID generalization stems from structured representations, while behind the unsatisfying OOD performance, the models still exhibit clear learned algebraic structures. specifically, these models map unseen OOD inputs to outputs with equivalence relations in the ID domain. these highlight the potential of the models to carry useful information for improved generalization.
results: 对于CitationNet、OGBN-arxiv和TWITCH dataset,RAM-CG模型比现有最佳实现提供了2.2%、6.9%和6.6%的准确率提升。Abstract
Continual graph learning (CGL) studies the problem of learning from an infinite stream of graph data, consolidating historical knowledge, and generalizing it to the future task. At once, only current graph data are available. Although some recent attempts have been made to handle this task, we still face two potential challenges: 1) most of existing works only manipulate on the intermediate graph embedding and ignore intrinsic properties of graphs. It is non-trivial to differentiate the transferred information across graphs. 2) recent attempts take a parameter-sharing policy to transfer knowledge across time steps or progressively expand new architecture given shifted graph distribution. Learning a single model could loss discriminative information for each graph task while the model expansion scheme suffers from high model complexity. In this paper, we point out that latent relations behind graph edges can be attributed as an invariant factor for the evolving graphs and the statistical information of latent relations evolves. Motivated by this, we design a relation-aware adaptive model, dubbed as RAM-CG, that consists of a relation-discovery modular to explore latent relations behind edges and a task-awareness masking classifier to accounts for the shifted. Extensive experiments show that RAM-CG provides significant 2.2%, 6.9% and 6.6% accuracy improvements over the state-of-the-art results on CitationNet, OGBN-arxiv and TWITCH dataset, respective.
摘要
在这篇论文中,我们指出了图像边缘下的隐藏关系可以作为不变因素,并且这些隐藏关系的统计信息在演化过程中发展。基于这一点,我们设计了一种相关性感知的自适应模型,即RAM-CG,它包括一个用于探索隐藏关系的关系探索模块和一个用于考虑时间步骤或graph distribution的变化的任务意识Masking类ifier。经过广泛的实验,我们发现RAM-CG可以提供2.2%、6.9%和6.6%的精度提高 compared to state-of-the-art results on CitationNet、OGBN-arxiv和TWITCH dataset,分别。
Two Phases of Scaling Laws for Nearest Neighbor Classifiers
results: 研究发现, nearest neighbor 分类器可以在一定的数据量范围内具有 polynomial 型的扩展法则,而在其他范围内具有 exponential 型的扩展法则。这种结果反映了数据分布的复杂性对模型的泛化误差的影响。Abstract
A scaling law refers to the observation that the test performance of a model improves as the number of training data increases. A fast scaling law implies that one can solve machine learning problems by simply boosting the data and the model sizes. Yet, in many cases, the benefit of adding more data can be negligible. In this work, we study the rate of scaling laws of nearest neighbor classifiers. We show that a scaling law can have two phases: in the first phase, the generalization error depends polynomially on the data dimension and decreases fast; whereas in the second phase, the error depends exponentially on the data dimension and decreases slowly. Our analysis highlights the complexity of the data distribution in determining the generalization error. When the data distributes benignly, our result suggests that nearest neighbor classifier can achieve a generalization error that depends polynomially, instead of exponentially, on the data dimension.
摘要
(Simplified Chinese)一个尺度法则指的是模型在训练数据量增加后测试性能的改进。一个快速尺度法则意味着可以通过简单地增加数据和模型大小解决机器学习问题。然而,在许多情况下,增加更多数据的好处很小。在这个工作中,我们研究 nearest neighbor 分类器的尺度法则。我们发现一个尺度法则可以有两个阶段:在第一阶段,泛化错误与数据维度之间存在 polynomial 关系,随着数据维度增加而快速下降;而在第二阶段,错误与数据维度之间存在 exponential 关系,随着数据维度增加而慢慢下降。我们的分析表明数据分布的复杂性对泛化错误的影响。当数据分布良好时,我们的结果表明 nearest neighbor 分类器可以在数据维度上取得 polynomial 类型的泛化错误,而不是 exponential 类型。
The Expressive Power of Graph Neural Networks: A Survey
results: 本研究结果显示,GNNs 的表达能力有很多限制,但可以通过不同的方法来增强表达能力,如图 feature enhancement、graph topology enhancement 和 GNNs 架构增强等方法。Abstract
Graph neural networks (GNNs) are effective machine learning models for many graph-related applications. Despite their empirical success, many research efforts focus on the theoretical limitations of GNNs, i.e., the GNNs expressive power. Early works in this domain mainly focus on studying the graph isomorphism recognition ability of GNNs, and recent works try to leverage the properties such as subgraph counting and connectivity learning to characterize the expressive power of GNNs, which are more practical and closer to real-world. However, no survey papers and open-source repositories comprehensively summarize and discuss models in this important direction. To fill the gap, we conduct a first survey for models for enhancing expressive power under different forms of definition. Concretely, the models are reviewed based on three categories, i.e., Graph feature enhancement, Graph topology enhancement, and GNNs architecture enhancement.
摘要
格点网络(GNNs)是许多图形应用中的有效机器学习模型。尽管其实验成功,但许多研究努力集中在GNNs的理论局限性上,即GNNs表达能力。早期工作主要关注图同构识别能力,而最近的工作尝试利用子图计数和连接学习来描述GNNs的表达能力,这些方法更加实际和更加接近实际应用。然而,没有任何报告和开源库全面总结和讨论这个重要方向。为了填补这个空白,我们进行了首次的survey,检讨不同定义下GNNs表达能力的模型。具体来说,模型被评估基于三类划分,即图特征增强、图结构增强和GNNs架构增强。
Challenges and Opportunities of Using Transformer-Based Multi-Task Learning in NLP Through ML Lifecycle: A Survey
results: 本文系统地分析了在NLP领域中使用转换器基于MTL方法如何适应ML生命周期阶段,并提出了在MTL和 continual learning(CL)之间的连接研究的可能性,以便在实际应用中更方便地定期重新训练模型、更新模型due to distribution shifts,并添加新功能来满足实际需求。Abstract
The increasing adoption of natural language processing (NLP) models across industries has led to practitioners' need for machine learning systems to handle these models efficiently, from training to serving them in production. However, training, deploying, and updating multiple models can be complex, costly, and time-consuming, mainly when using transformer-based pre-trained language models. Multi-Task Learning (MTL) has emerged as a promising approach to improve efficiency and performance through joint training, rather than training separate models. Motivated by this, we first provide an overview of transformer-based MTL approaches in NLP. Then, we discuss the challenges and opportunities of using MTL approaches throughout typical ML lifecycle phases, specifically focusing on the challenges related to data engineering, model development, deployment, and monitoring phases. This survey focuses on transformer-based MTL architectures and, to the best of our knowledge, is novel in that it systematically analyses how transformer-based MTL in NLP fits into ML lifecycle phases. Furthermore, we motivate research on the connection between MTL and continual learning (CL), as this area remains unexplored. We believe it would be practical to have a model that can handle both MTL and CL, as this would make it easier to periodically re-train the model, update it due to distribution shifts, and add new capabilities to meet real-world requirements.
摘要
随着自然语言处理(NLP)模型在不同领域的推广,机器学习(ML)实践者需要高效地处理这些模型,从训练到生产环境中部署。然而,训练、部署和更新多个模型可能会复杂、成本高和时间费时,尤其是使用基于转换器的预训练语言模型。多任务学习(MTL)已经作为一种有 Promise的方法,以提高效率和性能,通过共同训练而不是独立训练多个模型。我们首先提供了基于转换器的 MTL 方法在 NLP 领域的概述。然后,我们讨论了在 ML 生命周期阶段中使用 MTL 方法的挑战和机遇,尤其是在数据工程、模型开发、部署和监控阶段。本文主要关注基于转换器的 MTL 架构,并且,到我们所知,是在 ML 生命周期阶段中系统地分析了基于转换器的 MTL 在 NLP 领域的应用。此外,我们还鼓励了研究基于 MTL 和连续学习(CL)的连续关系,因为这个领域还没有得到充分的探索。我们认为,一个能够同时处理 MTL 和 CL 的模型会更加实用,因为这样可以更方便地在 periodic retrained 模型,因应分布Shift 更新模型,并添加新的功能来满足实际需求。
SCQPTH: an efficient differentiable splitting method for convex quadratic programming
methods: 这篇论文使用了alternating direction method of multipliers(ADMM)和operator splitting方法来解决QPs。
results: 实验结果显示,使用SCQPTH可以提供1-10倍的计算效率提升,相比于现有的可微QP解决方法。Abstract
We present SCQPTH: a differentiable first-order splitting method for convex quadratic programs. The SCQPTH framework is based on the alternating direction method of multipliers (ADMM) and the software implementation is motivated by the state-of-the art solver OSQP: an operating splitting solver for convex quadratic programs (QPs). The SCQPTH software is made available as an open-source python package and contains many similar features including efficient reuse of matrix factorizations, infeasibility detection, automatic scaling and parameter selection. The forward pass algorithm performs operator splitting in the dimension of the original problem space and is therefore suitable for large scale QPs with $100-1000$ decision variables and thousands of constraints. Backpropagation is performed by implicit differentiation of the ADMM fixed-point mapping. Experiments demonstrate that for large scale QPs, SCQPTH can provide a $1\times - 10\times$ improvement in computational efficiency in comparison to existing differentiable QP solvers.
摘要
我们介绍SCQPTH:一种可微的首项拆分法 для凸quadratic programs。SCQPTH框架基于多方向方法(ADMM),并受到现代QP解析器OSQP的驱动。SCQPTH软件作为一个开源python套件,具有许多相似特性,包括有效的续用矩阵因数、无法构成检测、自动尺度调整和参数选择。forward pass算法在原始问题空间的维度进行操作拆分,适用于大规模QP的解释,具有100-1000个决策变数和千个约束。回传算法使用凸拓扑ADMM定点映射的隐含对偶积分。实验表明,对于大规模QP,SCQPTH可以提供1倍-10倍的计算效率提升,与现有的可微QP解析器相比。
Self-Deception: Reverse Penetrating the Semantic Firewall of Large Language Models
paper_authors: Zhenhua Wang, Wei Xie, Kai Chen, Baosheng Wang, Zhiwen Gui, Enze Wang for: 这篇论文 investigate了LLM囚禁问题,并提出了一种自动囚禁方法。methods: 论文提出了三种实现方法,包括使用语义防火墙、自我误导攻击和隐藏攻击。results: 实验结果显示,在两种模型上,自动囚禁方法的成功率分别为86.2%和67%,失败率分别为4.7%和2.2%。Abstract
Large language models (LLMs), such as ChatGPT, have emerged with astonishing capabilities approaching artificial general intelligence. While providing convenience for various societal needs, LLMs have also lowered the cost of generating harmful content. Consequently, LLM developers have deployed semantic-level defenses to recognize and reject prompts that may lead to inappropriate content. Unfortunately, these defenses are not foolproof, and some attackers have crafted "jailbreak" prompts that temporarily hypnotize the LLM into forgetting content defense rules and answering any improper questions. To date, there is no clear explanation of the principles behind these semantic-level attacks and defenses in both industry and academia. This paper investigates the LLM jailbreak problem and proposes an automatic jailbreak method for the first time. We propose the concept of a semantic firewall and provide three technical implementation approaches. Inspired by the attack that penetrates traditional firewalls through reverse tunnels, we introduce a "self-deception" attack that can bypass the semantic firewall by inducing LLM to generate prompts that facilitate jailbreak. We generated a total of 2,520 attack payloads in six languages (English, Russian, French, Spanish, Chinese, and Arabic) across seven virtual scenarios, targeting the three most common types of violations: violence, hate, and pornography. The experiment was conducted on two models, namely the GPT-3.5-Turbo and GPT-4. The success rates on the two models were 86.2% and 67%, while the failure rates were 4.7% and 2.2%, respectively. This highlighted the effectiveness of the proposed attack method. All experimental code and raw data will be released as open-source to inspire future research. We believe that manipulating AI behavior through carefully crafted prompts will become an important research direction in the future.
摘要
大型语言模型(LLM),如ChatGPT,已经出现了惊人的能力,仅次于人工智能。这些模型可以为社会的不同需求提供便利,但也lowered the cost of generating harmful content。因此,LLM开发者们已经部署了semantic-level防御来识别和拒绝可能导致不当内容的提示。不过,这些防御并不是不可攻击的,有些攻击者已经crafted "jailbreak"提示,使LLM暂时忘记内容防御规则,回答任何不当问题。到目前为止,在业界和学术界都没有清楚的解释这些semantic-level攻击和防御的原则。本文investigates the LLM jailbreak problem,并提出了一个自动化jailbreak方法。我们提出了semantic firewall的概念,并提供了三种技术实现方法。受到传统防火墙被攻击的启发,我们引入了一种"自我欺骗"攻击,可以绕过semantic firewall,使LLM生成提示,促使jailbreak。我们在六种语言(英文、俄语、法语、西班牙语、中文和阿拉伯语)的七个虚拟enario中,总共生成了2,520个攻击payload。实验使用了GPT-3.5-Turbo和GPT-4两个模型,成功率为86.2%和67%,失败率为4.7%和2.2%。这诉求了我们的提案攻击方法的效果。我们将所有实验代码和原始数据发布为开源,以启发未来的研究。我们相信,通过精心设计的提示,可以操纵AI的行为,将成为未来的重要研究方向。
Exploring Winograd Convolution for Cost-effective Neural Network Fault Tolerance
results: experiments show that winograd convolution可以在 fault-tolerant neural networks 中减少设计开销by 55.77% 平均而无损失精度,并在考虑winograd convolution的自然fault tolerance时减少计算开销by 17.24%。在应用于具有多种 faults 的情况下,使用winograd convolution和 fault-aware retraining和 constrained activation functions 组合的模型精度显著提高。Abstract
Winograd is generally utilized to optimize convolution performance and computational efficiency because of the reduced multiplication operations, but the reliability issues brought by winograd are usually overlooked. In this work, we observe the great potential of winograd convolution in improving neural network (NN) fault tolerance. Based on the observation, we evaluate winograd convolution fault tolerance comprehensively from different granularities ranging from models, layers, and operation types for the first time. Then, we explore the use of inherent fault tolerance of winograd convolution for cost-effective NN protection against soft errors. Specifically, we mainly investigate how winograd convolution can be effectively incorporated with classical fault-tolerant design approaches including triple modular redundancy (TMR), fault-aware retraining, and constrained activation functions. According to our experiments, winograd convolution can reduce the fault-tolerant design overhead by 55.77\% on average without any accuracy loss compared to standard convolution, and further reduce the computing overhead by 17.24\% when the inherent fault tolerance of winograd convolution is considered. When it is applied on fault-tolerant neural networks enhanced with fault-aware retraining and constrained activation functions, the resulting model accuracy generally shows significant improvement in presence of various faults.
摘要
winograd通常用于优化卷积性能和计算效率,因为它减少了多个乘法操作,但winograd引入的可靠性问题通常被忽略。在这项工作中,我们发现winograd卷积可以大幅提高神经网络(NN)fault tolerance。基于这一观察,我们对winograd卷积的FAULT TOLERANCE进行了全面的评估,从不同的粒度(模型、层、操作类型)进行了首次评估。然后,我们探索了使用winograd卷积的内在fault tolerance来实现cost-effectiveNN保护 against soft errors。具体来说,我们主要研究了如何effectively incorporate winograd卷积与传统的 fault-tolerant设计方法,包括TMR、 fault-aware retraining和constrained activation functions。根据我们的实验,winograd卷积可以将FAULT-TOLERANT DESIGN overhead减少55.77%,而无需损失精度,并且在考虑winograd卷积的内在 fault tolerance时,可以减少计算 overhead的17.24%。当应用于强化了FAULT-TOLERANT NEURAL NETWORKS的模型中,模型的准确率在不同的FAULT Condition下都会表现出明显的改善。
results: 实验结果表明,提出的方法可以显著降低脉冲发生率,并且在比较于现有SNN基elines的情况下,表现出较好的性能。Abstract
Spiking Neural Networks (SNNs) are well known as a promising energy-efficient alternative to conventional artificial neural networks. Subject to the preconceived impression that SNNs are sparse firing, the analysis and optimization of inherent redundancy in SNNs have been largely overlooked, thus the potential advantages of spike-based neuromorphic computing in accuracy and energy efficiency are interfered. In this work, we pose and focus on three key questions regarding the inherent redundancy in SNNs. We argue that the redundancy is induced by the spatio-temporal invariance of SNNs, which enhances the efficiency of parameter utilization but also invites lots of noise spikes. Further, we analyze the effect of spatio-temporal invariance on the spatio-temporal dynamics and spike firing of SNNs. Then, motivated by these analyses, we propose an Advance Spatial Attention (ASA) module to harness SNNs' redundancy, which can adaptively optimize their membrane potential distribution by a pair of individual spatial attention sub-modules. In this way, noise spike features are accurately regulated. Experimental results demonstrate that the proposed method can significantly drop the spike firing with better performance than state-of-the-art SNN baselines. Our code is available in \url{https://github.com/BICLab/ASA-SNN}.
摘要
神经网络(SNN)因其能够减少能耗而被广泛认为是软件人工神经网络的有望代替。然而,由于人们对SNN的假设偏见,即SNN的快速发射是稀疏的,因此对SNN中的内在冗余进行分析和优化得到了相对少的关注。在这项工作中,我们提出了三个关键问题,即SNN中冗余的来源、冗余如何影响SNN的空间时间动力学和发射,以及如何利用SNN的冗余来提高准确率和能效性。我们提出了一种提高空间注意力(ASA)模块,该模块可以自适应优化SNN的膜电压分布,并减少噪声发射特征。实验结果表明,我们的方法可以至关重要地降低SNN的噪声发射,并在比较现有SNN基eline的情况下提高性能。我们的代码可以在中找到。
How To Overcome Confirmation Bias in Semi-Supervised Image Classification By Active Learning
results: 研究发现,SSL方法和AL方法的组合可以在实际数据场景中提高性能,但是这些结果来自于已知的benchmark数据集,这可能会过度估计外部效应。此外,文献缺乏关于实际数据场景中active semi-supervised learning方法的研究,这导致了我们对这些方法在实际场景中的性能的理解有限。因此,这篇论文提出了三个常见的数据挑战: между类异质、内类异质和between类相似。这些挑战可能会影响SSL性能,但是Random sampling不能 Mitigate这些挑战,有时甚至比supervised learning更差。然而,我们发现,在实际数据场景中,AL可以超越Confirmation bias,提高SSL的性能。Abstract
Do we need active learning? The rise of strong deep semi-supervised methods raises doubt about the usability of active learning in limited labeled data settings. This is caused by results showing that combining semi-supervised learning (SSL) methods with a random selection for labeling can outperform existing active learning (AL) techniques. However, these results are obtained from experiments on well-established benchmark datasets that can overestimate the external validity. However, the literature lacks sufficient research on the performance of active semi-supervised learning methods in realistic data scenarios, leaving a notable gap in our understanding. Therefore we present three data challenges common in real-world applications: between-class imbalance, within-class imbalance, and between-class similarity. These challenges can hurt SSL performance due to confirmation bias. We conduct experiments with SSL and AL on simulated data challenges and find that random sampling does not mitigate confirmation bias and, in some cases, leads to worse performance than supervised learning. In contrast, we demonstrate that AL can overcome confirmation bias in SSL in these realistic settings. Our results provide insights into the potential of combining active and semi-supervised learning in the presence of common real-world challenges, which is a promising direction for robust methods when learning with limited labeled data in real-world applications.
摘要
是否需要活动学习?强大深度半supervised方法的出现使得有限量标注数据的使用可能性受到了质问。这是因为结果表明将半supervised学习(SSL)方法与随机选择标注结合可以超越现有的活动学习(AL)技术。然而,这些结果是基于可靠的 benchmark 数据集进行实验所获得的,这可能会过度估计外部有效性。然而, литераature 缺乏对活动半supervised学习方法在实际数据场景中的性能的研究,这种空白在我们的理解中留下了一个大的坑。因此,我们介绍了三种常见的实际数据挑战: между类差异、 Within 类差异和 между类相似性。这些挑战可能会对 SSL 性能产生负面影响,因为确认偏见。我们在 simulated 数据挑战上进行了 SSL 和 AL 实验,发现随机抽样不能 Mitigate 确认偏见,而在一些情况下,随机抽样甚至比supervised learning(SL)更差。然而,我们示出了 AL 可以在这些实际设置中超越确认偏见。我们的结果提供了对将活动和半supervised学习结合使用在实际数据场景中的可能性的深入理解,这是一个可靠的方法在有限量标注数据的实际应用中学习。
HyperSNN: A new efficient and robust deep learning model for resource constrained control applications
results: 该论文通过在AI Gym benchmark上测试HyperSNN模型,发现HyperSNN可以与传统机器学习方法具有相同的控制精度,但具有9.96%到1.36%的能耗减少。此外,实验还表明HyperSNN具有更高的可靠性。因此,HyperSNN适用于交互式、移动和穿戴式设备,推动了能效性和可靠性的系统设计。同时,它开创了实际工业场景中复杂算法如模型预测控制(MPC)的实践之路。Abstract
In light of the increasing adoption of edge computing in areas such as intelligent furniture, robotics, and smart homes, this paper introduces HyperSNN, an innovative method for control tasks that uses spiking neural networks (SNNs) in combination with hyperdimensional computing. HyperSNN substitutes expensive 32-bit floating point multiplications with 8-bit integer additions, resulting in reduced energy consumption while enhancing robustness and potentially improving accuracy. Our model was tested on AI Gym benchmarks, including Cartpole, Acrobot, MountainCar, and Lunar Lander. HyperSNN achieves control accuracies that are on par with conventional machine learning methods but with only 1.36% to 9.96% of the energy expenditure. Furthermore, our experiments showed increased robustness when using HyperSNN. We believe that HyperSNN is especially suitable for interactive, mobile, and wearable devices, promoting energy-efficient and robust system design. Furthermore, it paves the way for the practical implementation of complex algorithms like model predictive control (MPC) in real-world industrial scenarios.
摘要
在智能家居、机器人和智能家庭等领域的边缘计算技术普及化的背景下,这篇论文引入了HyperSNN,一种使用脉冲神经网络(SNN)和高维ensional计算的创新方法,用于控制任务。HyperSNN将昂贵的32位浮点 multiply替换为8位整数加法,从而降低能耗而不失精度,可能提高准确率。我们的模型在 AI Gym 标准彩色环境中进行了测试,包括 Cartpole、Acrobot、MountainCar 和 Lunar Lander 等benchmark。HyperSNN在控制精度方面与传统机器学习方法相当,但只需1.36%到9.96%的能耗。此外,我们的实验表明,使用HyperSNN可以提高系统的可靠性。我们认为HyperSNN特别适合交互式、移动和穿戴式设备,推动能效的系统设计。此外,它开创了实际工业应用场景中实施复杂算法如预测控制(MPC)的实际应用之路。
In situ Fault Diagnosis of Indium Tin Oxide Electrodes by Processing S-Parameter Patterns
results: 研究发现,通过将不同频率频谱的S-parameter作为输入,可以使用深度学习(DL)算法同时分析缺陷的原因和严重程度,并在噪声水平下提高诊断性能。Abstract
In the field of optoelectronics, indium tin oxide (ITO) electrodes play a crucial role in various applications, such as displays, sensors, and solar cells. Effective fault detection and diagnosis of the ITO electrodes are essential to ensure the performance and reliability of the devices. However, traditional visual inspection is challenging with transparent ITO electrodes, and existing fault detection methods have limitations in determining the root causes of the defects, often requiring destructive evaluations. In this study, an in situ fault diagnosis method is proposed using scattering parameter (S-parameter) signal processing, offering early detection, high diagnostic accuracy, noise robustness, and root cause analysis. A comprehensive S-parameter pattern database is obtained according to defect states. Deep learning (DL) approaches, including multilayer perceptron (MLP), convolutional neural network (CNN), and transformer, are then used to simultaneously analyze the cause and severity of defects. Notably, it is demonstrated that the diagnostic performance under additive noise levels can be significantly enhanced by combining different channels of the S-parameters as input to the learning algorithms, as confirmed through the t-distributed stochastic neighbor embedding (t-SNE) dimension reduction visualization.
摘要
在光电子学领域,锌镉矿(ITO)电极在不同应用中发挥关键作用,如显示器、感测器和太阳能电池。确切检测和诊断ITO电极的缺陷非常重要,以确保设备的性能和可靠性。然而,传统的视觉检查困难于透明的ITO电极,现有的缺陷检测方法有限制,常需要破坏评估。本研究提出了即场缺陷诊断方法,使用散射参数(S-parameter)信号处理,可以早期检测、高度准确地诊断缺陷,并具有噪声抗性和根本原因分析能力。通过对缺陷状态的S-parameter模式数据库的建立,使用深度学习(DL)方法,包括多层感知网络(MLP)、卷积神经网络(CNN)和变换器,同时分析缺陷的原因和严重程度。尤其是,研究表明,将不同通道的S-parameters作为输入,可以增强对噪声水平的诊断性能,这得到了通过t-分布随机邻居embedding(t-SNE)维度减少视觉化的证明。
Epicure: Distilling Sequence Model Predictions into Patterns
methods: 这篇论文提出了一种方法 called Epicure,它可以将机器学习模型的预测转换成简单的模式。Epicure 使用一个格子来表示模型预测的各种抽象模式,这些模式可以更好地捕捉模型预测的细节。
results: 在测试两个任务中,namely predicting a descriptive name of a function given its source code body和detecting anomalous names given a function,Epicure 能够更准确地预测名称。Compared to the best model prediction, Epicure 可以提高准确率 by 61% for a false alarm rate of 10%.Abstract
Most machine learning models predict a probability distribution over concrete outputs and struggle to accurately predict names over high entropy sequence distributions. Here, we explore finding abstract, high-precision patterns intrinsic to these predictions in order to make abstract predictions that usefully capture rare sequences. In this short paper, we present Epicure, a method that distils the predictions of a sequence model, such as the output of beam search, into simple patterns. Epicure maps a model's predictions into a lattice that represents increasingly more general patterns that subsume the concrete model predictions. On the tasks of predicting a descriptive name of a function given the source code of its body and detecting anomalous names given a function, we show that Epicure yields accurate naming patterns that match the ground truth more often compared to just the highest probability model prediction. For a false alarm rate of 10%, Epicure predicts patterns that match 61% more ground-truth names compared to the best model prediction, making Epicure well-suited for scenarios that require high precision.
摘要
大多数机器学习模型预测结果为概率分布,困难准确预测高 entropy 序列分布中的名称。我们在这里探索找到高精度抽象模式,以便在罕见序列中准确预测名称。在这篇短文中,我们介绍 Epicure,一种方法,可以将序列模型的预测映射到一个表示增加更一般模式的格子中。在函数体代码中预测函数名和检测异常名称任务上,我们显示 Epicure 可以准确地预测名称,与真实值更常地匹配。对于 false alarm rate 为 10%,Epicure 可以预测匹配真实值的名称61%多于最佳模型预测,使其适用于需要高精度的场景。
DeSCo: Towards Generalizable and Scalable Deep Subgraph Counting
paper_authors: Tianyu Fu, Chiyue Wei, Yu Wang, Rex Ying for: 这篇论文是针对大规模子граφ counting问题(subgraph counting)提出了一个专业的解决方案。大规模子граφ counting有很多实际应用,例如社交网络分析中的模式计数以及交易网络中的过滤探测。methods: 这篇论文使用了一个叫做DeSCo的神经深度子граφ counting管线,以精确地预测查询的计数和出现位置。DeSCo首先使用了一个新的 canonical partition 技术,将大型目标图分成小型邻接图。这种技术可以严重减少查询的计数 variation,并且保证不会 missed 或 double-counting。其次,DeSCo使用了一个具有表现力的子граφ-based heterogeneous graph neural network 来实现每个邻接图中的计数。最后,DeSCo使用了一个称为 gossip propagation 的传播方法,将邻接图中的计数传递到下一个邻接图中。results: 这篇论文的实验结果显示,DeSCo 可以在八个真实世界数据集上进行预测,并且与现有的神经方法相比,具有137倍的平均方差错误,同时保持了多项式时间复杂性。Abstract
Subgraph counting is the problem of counting the occurrences of a given query graph in a large target graph. Large-scale subgraph counting is useful in various domains, such as motif counting for social network analysis and loop counting for money laundering detection on transaction networks. Recently, to address the exponential runtime complexity of scalable subgraph counting, neural methods are proposed. However, existing neural counting approaches fall short in three aspects. Firstly, the counts of the same query can vary from zero to millions on different target graphs, posing a much larger challenge than most graph regression tasks. Secondly, current scalable graph neural networks have limited expressive power and fail to efficiently distinguish graphs in count prediction. Furthermore, existing neural approaches cannot predict the occurrence position of queries in the target graph. Here we design DeSCo, a scalable neural deep subgraph counting pipeline, which aims to accurately predict the query count and occurrence position on any target graph after one-time training. Firstly, DeSCo uses a novel canonical partition and divides the large target graph into small neighborhood graphs. The technique greatly reduces the count variation while guaranteeing no missing or double-counting. Secondly, neighborhood counting uses an expressive subgraph-based heterogeneous graph neural network to accurately perform counting in each neighborhood. Finally, gossip propagation propagates neighborhood counts with learnable gates to harness the inductive biases of motif counts. DeSCo is evaluated on eight real-world datasets from various domains. It outperforms state-of-the-art neural methods with 137x improvement in the mean squared error of count prediction, while maintaining the polynomial runtime complexity.
摘要
<>Translate the given text into Simplified Chinese.<>大规模的子图计数问题是计算给定查询图在大型目标图中出现的次数。这种问题在社会网络分析中计算模式的搜索和金融网络中检测购买交易的循环计数等领域都是有用的。然而,目前的神经方法并不能够准确地解决这个问题。主要的问题包括:首先,查询的计数可以在不同的目标图上从零到百万之间变化,这比大多数图回归任务更加具有挑战性。其次,当前可扩展的图神经网络具有有限的表达能力,无法高效地 distinguishing 图。最后,现有的神经方法无法预测查询在目标图中的出现位置。为了解决这些问题,我们设计了DeSCo,一种可扩展的神经深度子图计数管道。DeSCo通过一种新的准确的分区方法将大型目标图分解成小 neighbohood 图。这种技术可以很好地减少计数的变化,同时保证不会产生错过或重复计数。其次,neighborhood 计数使用表达能力强的不同类型图神经网络来准确地进行计数。最后,gossip 传播使用学习门户来传递 neighborhood 计数,以便利用模式计数的启发性。DeSCo在八个实际数据集上进行了评估,与当前的神经方法相比,具有137倍的平均平方误差提升,同时保持了对数时间复杂度。
Leveraging Explainable AI to Analyze Researchers’ Aspect-Based Sentiment about ChatGPT
methods: 本研究使用可解释AI来实现方面基于情感分析,以推动state of the art的提升。
results: 本研究提供了延伸 newer datasets上的方面基于情感分析的有价值信息,并不受文本数据的长度所限制。Abstract
The groundbreaking invention of ChatGPT has triggered enormous discussion among users across all fields and domains. Among celebration around its various advantages, questions have been raised with regards to its correctness and ethics of its use. Efforts are already underway towards capturing user sentiments around it. But it begs the question as to how the research community is analyzing ChatGPT with regards to various aspects of its usage. It is this sentiment of the researchers that we analyze in our work. Since Aspect-Based Sentiment Analysis has usually only been applied on a few datasets, it gives limited success and that too only on short text data. We propose a methodology that uses Explainable AI to facilitate such analysis on research data. Our technique presents valuable insights into extending the state of the art of Aspect-Based Sentiment Analysis on newer datasets, where such analysis is not hampered by the length of the text data.
摘要
<> chatgpt 的创新性发明引发了各界用户的广泛讨论,包括其多种优点的欢快和使用 ethics 的问题。各方面的用户情 Sentiment 已经在进行 Capture 的工作。但是,我们需要问到研究者们如何对 chatgpt 进行多方面的分析。我们的方法使用可解释 AI 来实现这种分析,并提供了valuable insights 以推进 aspect-based sentiment analysis 的状态艺术的扩展。>Here's the translation of the text into Traditional Chinese:<> chatgpt 的创新性发明引发了各界用户的广泛讨论,包括其多种优点的欢快和使用 ethics 的问题。各方面的用户情感已经在进行 Capture 的工作。但是,我们需要问到研究者们如何进行 chatgpt 的多方面分析。我们的方法使用可解释 AI 来实现这种分析,并提供了 valuable insights 以推进 aspect-based sentiment analysis 的状态艺术的扩展。>
results: 通过使用各种现状最佳化器和数据集,我们通过模拟实验发现了救济引入的领域和模型变化很大,可能会妨碍救济的应用。然而,我们也提出了一些缓解方法,并开发了一个快速的 simulate框架。Abstract
Existing work on Counterfactual Explanations (CE) and Algorithmic Recourse (AR) has largely focused on single individuals in a static environment: given some estimated model, the goal is to find valid counterfactuals for an individual instance that fulfill various desiderata. The ability of such counterfactuals to handle dynamics like data and model drift remains a largely unexplored research challenge. There has also been surprisingly little work on the related question of how the actual implementation of recourse by one individual may affect other individuals. Through this work, we aim to close that gap. We first show that many of the existing methodologies can be collectively described by a generalized framework. We then argue that the existing framework does not account for a hidden external cost of recourse, that only reveals itself when studying the endogenous dynamics of recourse at the group level. Through simulation experiments involving various state-of the-art counterfactual generators and several benchmark datasets, we generate large numbers of counterfactuals and study the resulting domain and model shifts. We find that the induced shifts are substantial enough to likely impede the applicability of Algorithmic Recourse in some situations. Fortunately, we find various strategies to mitigate these concerns. Our simulation framework for studying recourse dynamics is fast and opensourced.
摘要
现有的工作主要关注单个个体在静止环境下的Counterfactual Explanations(CE)和Algorithmic Recourse(AR),即给一个估计模型下找到满足多种要求的有效counterfactuals。然而,这些counterfactuals在数据和模型变化时的处理能力仍然是一个未经探索的研究挑战。此外,很少有研究关于个体实施救济后对其他个体的影响。我们通过这项工作来填补这一差距。我们首先表明了许多现有方法可以总结为一个通用框架。然后,我们 argue that现有的框架不会考虑一种隐藏的外部成本,只有在研究群体级别的救济动力学时才能发现。通过使用多种state-of-the-art counterfactual生成器和多个标准数据集,我们生成了大量的counterfactuals,并研究其导致的领域和模型变化。我们发现这些变化足够大,可能会阻碍救济的应用。幸运地,我们发现了多种缓解这些问题的策略。我们的救济动力学 simulate框架快速且开源。
paper_authors: Shuwen Lu, Zhihui Zhang, Cong Guo, Jingwen Leng, Yangjie Zhou, Minyi Guo for: This paper aims to develop a high-performance and efficient hardware acceleration framework for graph neural network (GNN) models, addressing the challenges of high bandwidth requirements and model diversity.methods: The proposed framework, called SwitchBlade, utilizes a new type of partition-level operator fusion, partition-level multi-threading, and fine-grained graph partitioning to reduce the bandwidth requirement and improve hardware utilization.results: The proposed framework achieves an average speedup of 1.85 times and energy savings of 19.03 times compared to the NVIDIA V100 GPU, and delivers performance comparable to state-of-the-art specialized accelerators.Abstract
Graph neural networks (GNNs) have shown significant accuracy improvements in a variety of graph learning domains, sparking considerable research interest. To translate these accuracy improvements into practical applications, it is essential to develop high-performance and efficient hardware acceleration for GNN models. However, designing GNN accelerators faces two fundamental challenges: the high bandwidth requirement of GNN models and the diversity of GNN models. Previous works have addressed the first challenge by using more expensive memory interfaces to achieve higher bandwidth. For the second challenge, existing works either support specific GNN models or have generic designs with poor hardware utilization. In this work, we tackle both challenges simultaneously. First, we identify a new type of partition-level operator fusion, which we utilize to internally reduce the high bandwidth requirement of GNNs. Next, we introduce partition-level multi-threading to schedule the concurrent processing of graph partitions, utilizing different hardware resources. To further reduce the extra on-chip memory required by multi-threading, we propose fine-grained graph partitioning to generate denser graph partitions. Importantly, these three methods make no assumptions about the targeted GNN models, addressing the challenge of model variety. We implement these methods in a framework called SwitchBlade, consisting of a compiler, a graph partitioner, and a hardware accelerator. Our evaluation demonstrates that SwitchBlade achieves an average speedup of $1.85\times$ and energy savings of $19.03\times$ compared to the NVIDIA V100 GPU. Additionally, SwitchBlade delivers performance comparable to state-of-the-art specialized accelerators.
摘要
格raph神经网络(GNN)在各种图学任务中表现出了显著的准确性改进,引发了广泛的研究兴趣。为将这些准确性改进应用于实际场景,必须开发高性能和高效的硬件加速器 для GNN 模型。然而,设计 GNN 加速器面临两个基本挑战:GNN 模型的带宽要求高,以及 GNN 模型的多样性。先前的工作通过使用更昂贵的内存接口来实现更高的带宽来解决第一个挑战。对于第二个挑战,现有的工作ether支持特定的 GNN 模型或者有通用的设计,导致硬件利用率低下。在这个工作中,我们同时解决了这两个挑战。首先,我们发现了一种新的合并类型的分区级操作,我们利用这种合并来减少 GNN 模型的带宽要求。然后,我们引入分区级多线程来调度图分 partitions 的同时处理,使用不同的硬件资源。为了避免多线程增加的额外内存开销,我们提议细化的图分解。这三种方法不仅不假设目标 GNN 模型,而且还可以减少硬件内存占用。我们将这些方法集成到一个名为 SwitchBlade 的框架中,包括编译器、图分解器和硬件加速器。我们的评估表明,SwitchBlade 可以在 NVIDIA V100 GPU 上实现平均的速度提升 $1.85\times$ 和能源减少 $19.03\times$。此外,SwitchBlade 可以与当前的特化加速器相比。
Expressivity of Graph Neural Networks Through the Lens of Adversarial Robustness
paper_authors: Francesco Campi, Lukas Gosch, Tom Wollschläger, Yan Scholten, Stephan Günnemann
for: This paper studies the adversarial robustness of Graph Neural Networks (GNNs) and compares their expressive power to traditional Message Passing Neural Networks (MPNNs).
methods: The paper uses adversarial attacks to test the ability of GNNs to count specific subgraph patterns, and extends the concept of adversarial robustness to this task.
results: The paper shows that more powerful GNNs fail to generalize to small perturbations to the graph’s structure and fail to count substructures on out-of-distribution graphs.Here is the same information in Simplified Chinese text:
results: 论文显示了更强大的 GNNs 对小 strucure 的变化和 out-of-distribution 图表示不能通过总结。Abstract
We perform the first adversarial robustness study into Graph Neural Networks (GNNs) that are provably more powerful than traditional Message Passing Neural Networks (MPNNs). In particular, we use adversarial robustness as a tool to uncover a significant gap between their theoretically possible and empirically achieved expressive power. To do so, we focus on the ability of GNNs to count specific subgraph patterns, which is an established measure of expressivity, and extend the concept of adversarial robustness to this task. Based on this, we develop efficient adversarial attacks for subgraph counting and show that more powerful GNNs fail to generalize even to small perturbations to the graph's structure. Expanding on this, we show that such architectures also fail to count substructures on out-of-distribution graphs.
摘要
我们进行了第一个对图神经网络(GNNs)的逆攻击 robustness 研究,发现它们在许多情况下比传统的消息传递神经网络(MPNNs)更具有潜在的表达能力。特别是,我们使用逆攻击 robustness 作为一种工具,揭示了 GNNs 的表达能力与理论可能的表达能力之间存在很大的差距。为此,我们专注于 GNNs 的子图计数能力,这是一个已知的表达能力指标,并将逆攻击 robustness 扩展到这个任务。 Based on this, we develop efficient adversarial attacks for subgraph counting and show that more powerful GNNs fail to generalize even to small perturbations to the graph's structure. In addition, we show that such architectures also fail to count substructures on out-of-distribution graphs.
AATCT-IDS: A Benchmark Abdominal Adipose Tissue CT Image Dataset for Image Denoising, Semantic Segmentation, and Radiomics Evaluation
paper_authors: Zhiyu Ma, Chen Li, Tianming Du, Le Zhang, Dechao Tang, Deguo Ma, Shanchuan Huang, Yan Liu, Yihao Sun, Zhihao Chen, Jin Yuan, Qianqing Nie, Marcin Grzegorzek, Hongzan Sun
results: 研究结果显示,在图像压缩任务中,使用缓和策略可以更好地降低杂噪,但是会导致图像细节的损失。在 semantics 分割任务中,BiSeNet 模型可以在短时间内 obtaint 比较好的 segmentation 结果,并具有良好的隔离小型和隔离的脂肪组织能力。在 радиологиcal 分析任务中,研究发现了脂肪分布的多dimensional 特征。Abstract
Methods: In this study, a benchmark \emph{Abdominal Adipose Tissue CT Image Dataset} (AATTCT-IDS) containing 300 subjects is prepared and published. AATTCT-IDS publics 13,732 raw CT slices, and the researchers individually annotate the subcutaneous and visceral adipose tissue regions of 3,213 of those slices that have the same slice distance to validate denoising methods, train semantic segmentation models, and study radiomics. For different tasks, this paper compares and analyzes the performance of various methods on AATTCT-IDS by combining the visualization results and evaluation data. Thus, verify the research potential of this data set in the above three types of tasks. Results: In the comparative study of image denoising, algorithms using a smoothing strategy suppress mixed noise at the expense of image details and obtain better evaluation data. Methods such as BM3D preserve the original image structure better, although the evaluation data are slightly lower. The results show significant differences among them. In the comparative study of semantic segmentation of abdominal adipose tissue, the segmentation results of adipose tissue by each model show different structural characteristics. Among them, BiSeNet obtains segmentation results only slightly inferior to U-Net with the shortest training time and effectively separates small and isolated adipose tissue. In addition, the radiomics study based on AATTCT-IDS reveals three adipose distributions in the subject population. Conclusion: AATTCT-IDS contains the ground truth of adipose tissue regions in abdominal CT slices. This open-source dataset can attract researchers to explore the multi-dimensional characteristics of abdominal adipose tissue and thus help physicians and patients in clinical practice. AATCT-IDS is freely published for non-commercial purpose at: \url{https://figshare.com/articles/dataset/AATTCT-IDS/23807256}.
摘要
方法:本研究使用的Benchmark数据集是《 Abdomen Adipose Tissue CT Image Dataset》(AATTCT-IDS),包含300个研究对象,并公布了13,732个Raw CT slice。研究人员对3,213个slice进行了手动标注,以验证减噪方法、训练semantic segmentation模型和研究 ради米克特性。通过组合视觉化结果和评估数据来对不同任务进行比较分析,以验证数据集的研究潜力。结果:在图像减噪比较研究中,使用缓和策略的算法可以更好地抑制杂噪,但是会增加图像细节的损失。BM3D算法可以更好地保持原始图像结构,但评估数据略为下降。结果显示了不同算法之间存在很大的差异。在 Abdomen Adipose Tissue Semantic Segmentation 比较研究中,每种模型的 segmentation 结果具有不同的结构特征。BISeNet模型可以在短时间内 obtener segmentation 结果,与 U-Net 相当,并且能够有效地分离小 isolated adipose tissue。此外,基于 AATTCT-IDS 的 ради米克特研究发现了脂肪分布的三种类型。结论:AATTCT-IDS 包含了 Abdomen CT 图像中脂肪组织区域的真实标准。这个开源数据集可以吸引研究人员通过多维度特征的探索,帮助临床医生和病人。AATTCT-IDS 采用非商业用途发布,可以免费下载,请参考:https://figshare.com/articles/dataset/AATTCT-IDS/23807256。
results: 这个算法可以在时间复杂度为 $\tilde{O} \left( 2^{\tilde{O}(\frac{k}{\varepsilon})} \eta^2 d\right)$ 内运行,并且 WITH HIGH PROBABILITY输出一个$k$个中心点集合,其中$cost(V, C) \leq (1+\varepsilon) \cdot cost(V, C_{OPT})$,where $C_{OPT}$是最优的$k$-中心点集合,$cost(.)$是标准的$k$-means成本函数(即点到最近中心的平方距离的总和),并且$\eta$是最大距离到最小距离的比率。这是第一个具有polylogarithmic running time的量子算法,并且不需要量子线性代数子过程,运行时间独立于参数(例如condition number)。Abstract
We give a quantum approximation scheme (i.e., $(1 + \varepsilon)$-approximation for every $\varepsilon > 0$) for the classical $k$-means clustering problem in the QRAM model with a running time that has only polylogarithmic dependence on the number of data points. More specifically, given a dataset $V$ with $N$ points in $\mathbb{R}^d$ stored in QRAM data structure, our quantum algorithm runs in time $\tilde{O} \left( 2^{\tilde{O}(\frac{k}{\varepsilon})} \eta^2 d\right)$ and with high probability outputs a set $C$ of $k$ centers such that $cost(V, C) \leq (1+\varepsilon) \cdot cost(V, C_{OPT})$. Here $C_{OPT}$ denotes the optimal $k$-centers, $cost(.)$ denotes the standard $k$-means cost function (i.e., the sum of the squared distance of points to the closest center), and $\eta$ is the aspect ratio (i.e., the ratio of maximum distance to minimum distance). This is the first quantum algorithm with a polylogarithmic running time that gives a provable approximation guarantee of $(1+\varepsilon)$ for the $k$-means problem. Also, unlike previous works on unsupervised learning, our quantum algorithm does not require quantum linear algebra subroutines and has a running time independent of parameters (e.g., condition number) that appear in such procedures.
摘要
我们提供一种量子近似方案(即$(1+\varepsilon)$-近似方案,其中$\varepsilon > 0$) для классической$k$-Means分布问题在QRAM模型中,并且running时间只具有多项式幂ilogarithmic(polylogarithmic)依赖于数据点的数量。更具体地,给定一个数据集$V$包含$N$个点的归一化在QRAM数据结构中,我们的量子算法在时间 $\tilde{O} \left( 2^{\tilde{O}(\frac{k}{\varepsilon})} \eta^2 d\right)$ 内运行,并且高probability输出一组$C$的$k$个中心,使得 $cost(V, C) \leq (1+\varepsilon) \cdot cost(V, C_{OPT})$,其中$C_{OPT}$表示最优的$k$-中心,$cost(.)$表示标准的$k$-Means成本函数(即点到最近中心的平方距离之和),并且$\eta$是最大距离与最小距离的比率。这是首个具有多项式幂ilogarithmic running time的量子算法,并且不需要量子线性代数子算法,运行时间不依赖于参数(例如condition number)的量子学习算法。
PEvoLM: Protein Sequence Evolutionary Information Language Model
methods: 该方法使用了一种基于语言模型的 bidirectional Long Short-Term Memory (LSTMs) 网络,并将 PSSMs 与转移学习结合,以降低模型的自由参数数量。
results: 该方法可以同时预测下一个氨基酸和其演化信息,并且可以通过多任务学习来学习蛋白质序列的进化信息。Abstract
With the exponential increase of the protein sequence databases over time, multiple-sequence alignment (MSA) methods, like PSI-BLAST, perform exhaustive and time-consuming database search to retrieve evolutionary information. The resulting position-specific scoring matrices (PSSMs) of such search engines represent a crucial input to many machine learning (ML) models in the field of bioinformatics and computational biology. A protein sequence is a collection of contiguous tokens or characters called amino acids (AAs). The analogy to natural language allowed us to exploit the recent advancements in the field of Natural Language Processing (NLP) and therefore transfer NLP state-of-the-art algorithms to bioinformatics. This research presents an Embedding Language Model (ELMo), converting a protein sequence to a numerical vector representation. While the original ELMo trained a 2-layer bidirectional Long Short-Term Memory (LSTMs) network following a two-path architecture, one for the forward and the second for the backward pass, by merging the idea of PSSMs with the concept of transfer-learning, this work introduces a novel bidirectional language model (bi-LM) with four times less free parameters and using rather a single path for both passes. The model was trained not only on predicting the next AA but also on the probability distribution of the next AA derived from similar, yet different sequences as summarized in a PSSM, simultaneously for multi-task learning, hence learning evolutionary information of protein sequences as well. The network architecture and the pre-trained model are made available as open source under the permissive MIT license on GitHub at https://github.com/issararab/PEvoLM.
摘要
随着蛋白序列数据库的呈指数增长,多重序列对齐(MSA)方法如PSI-BLAST在时间上进行耗时和耗力的数据库搜索,以获取进化信息。得到的位置特异分数据(PSSM)被多种机器学习(ML)模型在生物信息学和计算生物学中作为重要输入。蛋白序列是一系列连续的字符或氨基酸(AA)的集合。通过将蛋白序列与自然语言的相似性进行比较,我们可以利用自然语言处理(NLP)领域的最新进展,并将其应用到生物信息学中。本研究投入了一个Embedding Language Model(ELMo),将蛋白序列转换为数值vector表示。在原ELMo模型中,一个2层扩展LSTM网络按照两路架构,一路为前向传输,另一路为后向传输。在将PSSM的概念与传输学习混合到一起的基础上,本工作提出了一种新的双向语言模型(bi-LM),具有四倍少的自由参数,并使用单路架构进行两个方向的传输。该模型在预测下一个AA以外,同时也预测来自相似 yet different 序列的AA的概率分布,即PSSM,并在多任务学习中同时学习蛋白序列的进化信息。网络架构和预训练模型在MIT免费许可下在GitHub上提供,可以在https://github.com/issararab/PEvoLM中下载。
Stochastic Controlled Averaging for Federated Learning with Communication Compression
results: experiments show that SCALLION和SCAFCOM可以与相应的全精度FL方法匹配或超越其通信和计算复杂度,并且可以在不同的数据不均性下表现良好。Abstract
Communication compression, a technique aiming to reduce the information volume to be transmitted over the air, has gained great interests in Federated Learning (FL) for the potential of alleviating its communication overhead. However, communication compression brings forth new challenges in FL due to the interplay of compression-incurred information distortion and inherent characteristics of FL such as partial participation and data heterogeneity. Despite the recent development, the performance of compressed FL approaches has not been fully exploited. The existing approaches either cannot accommodate arbitrary data heterogeneity or partial participation, or require stringent conditions on compression. In this paper, we revisit the seminal stochastic controlled averaging method by proposing an equivalent but more efficient/simplified formulation with halved uplink communication costs. Building upon this implementation, we propose two compressed FL algorithms, SCALLION and SCAFCOM, to support unbiased and biased compression, respectively. Both the proposed methods outperform the existing compressed FL methods in terms of communication and computation complexities. Moreover, SCALLION and SCAFCOM accommodates arbitrary data heterogeneity and do not make any additional assumptions on compression errors. Experiments show that SCALLION and SCAFCOM can match the performance of corresponding full-precision FL approaches with substantially reduced uplink communication, and outperform recent compressed FL methods under the same communication budget.
摘要
通信压缩,一种目的是减少在空中传输的信息量,在联合学习(FL)中获得了广泛的关注,因为它可能减轻联合学习的通信负担。然而,通信压缩在FL中带来了新的挑战,这是因为压缩引入的信息扭曲和联合学习的自然特点,如数据不同性和部分参与。尽管有最近的发展,已有的压缩FL方法的性能尚未被完全利用。现有的方法 Either cannot accommodate arbitrary data heterogeneity or partial participation, or require stringent conditions on compression.在这篇论文中,我们重新考虑了seminal stochastic controlled averaging方法,并提出了一种更加有效/简单的表述,减少了上行通信成本的一半。基于这个实现,我们提出了两种压缩FL算法,即SCALLION和SCAFCOM,以支持不偏和偏 compression。两种方法在communication和计算复杂度方面都有更好的性能,并且可以满足任意数据不同性和不添加任何压缩错误的假设。实验表明,SCALLION和SCAFCOM可以与相应的全精度FL方法匹配性能,并且在相同的通信预算下出perform recent compressed FL methods。
Characteristics of networks generated by kernel growing neural gas
results: 研究发现,kernel GNG可以生成具有不同特性的网络,其中每种kernel生成的网络具有不同的特征。Abstract
This research aims to develop kernel GNG, a kernelized version of the growing neural gas (GNG) algorithm, and to investigate the features of the networks generated by the kernel GNG. The GNG is an unsupervised artificial neural network that can transform a dataset into an undirected graph, thereby extracting the features of the dataset as a graph. The GNG is widely used in vector quantization, clustering, and 3D graphics. Kernel methods are often used to map a dataset to feature space, with support vector machines being the most prominent application. This paper introduces the kernel GNG approach and explores the characteristics of the networks generated by kernel GNG. Five kernels, including Gaussian, Laplacian, Cauchy, inverse multiquadric, and log kernels, are used in this study.
摘要
这项研究的目标是开发kernel GNG,即基于基域神经网络(GNG)算法的基域化版本,并研究由kernel GNG生成的网络特性。GNG是一种无监督的人工神经网络,可以将数据集转换成无向图,从而提取数据集中的特征。GNG广泛应用于 вектор量化、归一化和3D图形。基域方法通常用于将数据集映射到特征空间,Support Vector Machines(SVM)是最广泛应用的应用。本文介绍了基域GNG方法,并探讨由基域GNG生成的网络特性。本研究使用的五种基域包括 Gaussian、Laplacian、Cauchy、 inverse multiquadric 和 log 基域。
Interpretability Benchmark for Evaluating Spatial Misalignment of Prototypical Parts Explanations
results: 研究人员通过广泛的实验研究,证明了他们的指标集和补做方法的效果。他们发现,通过使用他们的方法,PPN模型可以更好地准确地捕捉和解释图像中的部分。Abstract
Prototypical parts-based networks are becoming increasingly popular due to their faithful self-explanations. However, their similarity maps are calculated in the penultimate network layer. Therefore, the receptive field of the prototype activation region often depends on parts of the image outside this region, which can lead to misleading interpretations. We name this undesired behavior a spatial explanation misalignment and introduce an interpretability benchmark with a set of dedicated metrics for quantifying this phenomenon. In addition, we propose a method for misalignment compensation and apply it to existing state-of-the-art models. We show the expressiveness of our benchmark and the effectiveness of the proposed compensation methodology through extensive empirical studies.
摘要
幻化网络的各部分模型在不断增长的 популяр度中,主要是因为它们的自我解释能力很强。然而,它们的相似度地图通常在半 finales层计算,这意味着prototype activation区域的受感知范围经常受到图像外部部分的影响,这可能导致不准确的解释。我们称这种情况为空间解释误差,并引入了一个特有的可度量这种现象的解释指标集。此外,我们还提出了一种补偿方法,并应用于现有的状态对照模型。我们通过广泛的实验研究证明了我们的指标和补偿方法的表达能力和有效性。
Benchmarking Adversarial Robustness of Compressed Deep Learning Models
results: 我们发现,即使压缩后的模型具有更好的总体性能和执行速度,它们对攻击输入的抵抗性仍然保持相对不变。这表明,模型压缩不会对针对攻击的鲁棒性产生负面影响。Abstract
The increasing size of Deep Neural Networks (DNNs) poses a pressing need for model compression, particularly when employed on resource constrained devices. Concurrently, the susceptibility of DNNs to adversarial attacks presents another significant hurdle. Despite substantial research on both model compression and adversarial robustness, their joint examination remains underexplored. Our study bridges this gap, seeking to understand the effect of adversarial inputs crafted for base models on their pruned versions. To examine this relationship, we have developed a comprehensive benchmark across diverse adversarial attacks and popular DNN models. We uniquely focus on models not previously exposed to adversarial training and apply pruning schemes optimized for accuracy and performance. Our findings reveal that while the benefits of pruning enhanced generalizability, compression, and faster inference times are preserved, adversarial robustness remains comparable to the base model. This suggests that model compression while offering its unique advantages, does not undermine adversarial robustness.
摘要
随着深度神经网络(DNN)的尺度不断增加,模型压缩成为了资源有限设备上的一 pressing需求。同时,DNN对攻击性输入的抵触也成为了一大难题。虽然关于模型压缩和攻击性稳定性的研究均有很大进步,但这两个领域之间的交叠仍然尚未得到充分研究。我们的研究尝试填补这一空白,探讨针对基本模型的攻击性输入如何影响其压缩版本。为了实现这一目标,我们在多种攻击方法和流行的DNN模型之间建立了一个完整的比较平台。我们独特地将注意力集中在没有接受过防御性训练的DNN模型上,并应用优化了准确和性能的压缩方案。我们的发现表明,即使使用压缩,模型的总体性能和攻击性稳定性仍然保持相对不变。这表明,模型压缩不会损害攻击性稳定性。
Deep Generative Imputation Model for Missing Not At Random Data
results: 对比STATE-OF-THE-ART 基eline,提出的GNR模型在RMSE方面平均提高9.9%到18.8%,并且总是在面具重建精度方面获得更好的结果,这使得恢复更加原理。Abstract
Data analysis usually suffers from the Missing Not At Random (MNAR) problem, where the cause of the value missing is not fully observed. Compared to the naive Missing Completely At Random (MCAR) problem, it is more in line with the realistic scenario whereas more complex and challenging. Existing statistical methods model the MNAR mechanism by different decomposition of the joint distribution of the complete data and the missing mask. But we empirically find that directly incorporating these statistical methods into deep generative models is sub-optimal. Specifically, it would neglect the confidence of the reconstructed mask during the MNAR imputation process, which leads to insufficient information extraction and less-guaranteed imputation quality. In this paper, we revisit the MNAR problem from a novel perspective that the complete data and missing mask are two modalities of incomplete data on an equal footing. Along with this line, we put forward a generative-model-specific joint probability decomposition method, conjunction model, to represent the distributions of two modalities in parallel and extract sufficient information from both complete data and missing mask. Taking a step further, we exploit a deep generative imputation model, namely GNR, to process the real-world missing mechanism in the latent space and concurrently impute the incomplete data and reconstruct the missing mask. The experimental results show that our GNR surpasses state-of-the-art MNAR baselines with significant margins (averagely improved from 9.9% to 18.8% in RMSE) and always gives a better mask reconstruction accuracy which makes the imputation more principle.
摘要
通常情况下,数据分析会面临缺失不具有完整性(Missing Not At Random,MNAR)问题,其中缺失的原因不是完全观察到。与完全随机缺失(Missing Completely At Random,MCAR)问题相比,MNAR问题更加真实和复杂。现有的统计方法将MNAR机制分解为不同的共同分布。但我们发现,直接将这些统计方法 интеGRATE到深度生成模型中是不佳的。特别是,它会忽略恢复mask的自信度,导致信息提取不充分和缺失补做质量不够保证。在这篇论文中,我们从一个新的视角来看待MNAR问题,即完整数据和缺失mask是两种不同的束缚数据模式。随着这一线,我们提出了一种特有的生成模型协同分布方法,即并合模型。这种方法可以在平行方式 represent complete data和缺失mask的分布,并提取完整数据和缺失mask中的足够信息。进一步,我们利用深度生成补做模型,即GNR,来处理实际中的缺失机制,并同时补做缺失数据和恢复缺失mask。实验结果显示,我们的GNR在RMSE方面与状态当前的MNAR基线之间提高了9.9%到18.8%的平均提升,并且总是提供更好的mask重建精度,这使得补做更符合原理。
results: 研究结果显示,使用中间任务转移学习可以提高 HurricaneSARC 上的性能,最佳模型可以达到0.70的 F1 分数。Abstract
During natural disasters, people often use social media platforms such as Twitter to ask for help, to provide information about the disaster situation, or to express contempt about the unfolding event or public policies and guidelines. This contempt is in some cases expressed as sarcasm or irony. Understanding this form of speech in a disaster-centric context is essential to improving natural language understanding of disaster-related tweets. In this paper, we introduce HurricaneSARC, a dataset of 15,000 tweets annotated for intended sarcasm, and provide a comprehensive investigation of sarcasm detection using pre-trained language models. Our best model is able to obtain as much as 0.70 F1 on our dataset. We also demonstrate that the performance on HurricaneSARC can be improved by leveraging intermediate task transfer learning. We release our data and code at https://github.com/tsosea2/HurricaneSarc.
摘要
在自然灾害事件中,人们经常使用社交媒体平台如推特请求帮助、提供灾害情况信息或表达对事件或公共政策的负面情感。在这种情况下,人们可能会通过讽刺或反讽的方式表达自己的不满。在这种情况下,理解这种语言表达方式是提高自然语言理解灾害相关推特的关键。在这篇论文中,我们介绍了风暴沙射(HurricaneSARC)数据集,该数据集包含15,000个推特,每个推特都被标注为带有讽刺意图。我们提供了一项全面的讽刺检测研究,使用预训练语言模型。我们的最佳模型在我们的数据集上可以获得0.70的F1分。我们还证明了在风暴沙射数据集上的性能可以通过中间任务转移学习提高。我们将数据和代码发布在GitHub上,请参考https://github.com/tsosea2/HurricaneSarc。
Hierarchical Topological Ordering with Conditional Independence Test for Limited Time Series
paper_authors: Anpeng Wu, Haoxuan Li, Kun Kuang, Keli Zhang, Fei Wu
For: 本研究旨在使用导向acyclic graphs (DAGs) 揭示观测数据中的 causal 关系。* Methods: 本研究使用 topology-based 方法,首先学习变量的排序,然后消除冗余的边,以确保图形 remain acyclic。* Results: 研究人员提出了一种改进 topology-based 方法,通过 incorporating conditional instrumental variables as exogenous interventions,可以 Identify descendant nodes for each variable。HT-CIT 算法可以减少需要被截割的边的数量,并且在实际数据上得到了更好的性能。Abstract
Learning directed acyclic graphs (DAGs) to identify causal relations underlying observational data is crucial but also poses significant challenges. Recently, topology-based methods have emerged as a two-step approach to discovering DAGs by first learning the topological ordering of variables and then eliminating redundant edges, while ensuring that the graph remains acyclic. However, one limitation is that these methods would generate numerous spurious edges that require subsequent pruning. To overcome this limitation, in this paper, we propose an improvement to topology-based methods by introducing limited time series data, consisting of only two cross-sectional records that need not be adjacent in time and are subject to flexible timing. By incorporating conditional instrumental variables as exogenous interventions, we aim to identify descendant nodes for each variable. Following this line, we propose a hierarchical topological ordering algorithm with conditional independence test (HT-CIT), which enables the efficient learning of sparse DAGs with a smaller search space compared to other popular approaches. The HT-CIT algorithm greatly reduces the number of edges that need to be pruned. Empirical results from synthetic and real-world datasets demonstrate the superiority of the proposed HT-CIT algorithm.
摘要
To overcome this limitation, we propose an improvement to topology-based methods by incorporating limited time series data, consisting of only two cross-sectional records that do not need to be adjacent in time and are subject to flexible timing. By incorporating conditional instrumental variables as exogenous interventions, we aim to identify descendant nodes for each variable.We propose a hierarchical topological ordering algorithm with conditional independence test (HT-CIT), which enables the efficient learning of sparse DAGs with a smaller search space compared to other popular approaches. The HT-CIT algorithm greatly reduces the number of edges that need to be pruned.Empirical results from synthetic and real-world datasets demonstrate the superiority of the proposed HT-CIT algorithm.
Online Control for Linear Dynamics: A Data-Driven Approach
results: 我们证明了我们的算法的 regret 是 $\mathcal{O}(\sqrt{T})$,这意味着它的性能与模型基于方法相当。Abstract
This paper considers an online control problem over a linear time-invariant system with unknown dynamics, bounded disturbance, and adversarial cost. We propose a data-driven strategy to reduce the regret of the controller. Unlike model-based methods, our algorithm does not identify the system model, instead, it leverages a single noise-free trajectory to calculate the accumulation of disturbance and makes decisions using the accumulated disturbance action controller we design, whose parameters are updated by online gradient descent. We prove that the regret of our algorithm is $\mathcal{O}(\sqrt{T})$ under mild assumptions, suggesting that its performance is on par with model-based methods.
摘要
这篇论文考虑了一个在线控制问题,其中系统为线性时不变的,受到干扰和恶意成本的影响。我们提出了一种数据驱动的策略,以减少控制器的 regret。不同于模型基于方法,我们的算法不需要确定系统模型,而是利用一个干净的轨迹来计算干扰的积累,并使用我们设计的积累干扰控制器,其参数通过在线梯度下降进行更新。我们证明了我们的算法的 regret是 $\mathcal{O}(\sqrt{T})$,这意味着它的性能与模型基于方法相当。
Microstructure-Empowered Stock Factor Extraction and Utilization
paper_authors: Xianfeng Jiao, Zizhong Li, Chang Xu, Yang Liu, Weiqing Liu, Jiang Bian for:This paper aims to effectively extract essential factors from order flow data for diverse downstream tasks across different granularities and scenarios.methods:The proposed framework consists of a Context Encoder and an Factor Extractor, using unsupervised learning methods to select important signals from the given context.results:The extracted factors are utilized for downstream tasks, demonstrating significant improvement for stock trend prediction and order execution tasks at the second and minute level, compared to existing tick-level approaches.Here’s the simplified Chinese text:for: 这篇论文目的是EXTRACTING essential factors from order flow data for 多种下游任务 Across different granularities and scenarios.methods: 该提议的框架包括Context Encoder和Factor Extractor,使用无监督学习方法选择Context中的重要信号。results: 提取的因素被用于下游任务,对stock trend prediction和订单执行任务 at the second and minute level exhibit significant improvement, compared to现有的tick-level Approaches.Abstract
High-frequency quantitative investment is a crucial aspect of stock investment. Notably, order flow data plays a critical role as it provides the most detailed level of information among high-frequency trading data, including comprehensive data from the order book and transaction records at the tick level. The order flow data is extremely valuable for market analysis as it equips traders with essential insights for making informed decisions. However, extracting and effectively utilizing order flow data present challenges due to the large volume of data involved and the limitations of traditional factor mining techniques, which are primarily designed for coarser-level stock data. To address these challenges, we propose a novel framework that aims to effectively extract essential factors from order flow data for diverse downstream tasks across different granularities and scenarios. Our method consists of a Context Encoder and an Factor Extractor. The Context Encoder learns an embedding for the current order flow data segment's context by considering both the expected and actual market state. In addition, the Factor Extractor uses unsupervised learning methods to select such important signals that are most distinct from the majority within the given context. The extracted factors are then utilized for downstream tasks. In empirical studies, our proposed framework efficiently handles an entire year of stock order flow data across diverse scenarios, offering a broader range of applications compared to existing tick-level approaches that are limited to only a few days of stock data. We demonstrate that our method extracts superior factors from order flow data, enabling significant improvement for stock trend prediction and order execution tasks at the second and minute level.
摘要
高频量资金投资是股票投资的重要方面。特别是订单流量数据提供了最详细的信息 amid高频度交易,包括订单书和交易记录的粒度水平。订单流量数据非常有价值 для市场分析,因为它们提供了对决策的重要见解。然而,从订单流量数据中提取和充分利用该数据具有挑战,主要是因为该数据的量太大,以及传统因数挖掘技术的局限性,这些技术主要是设计 для粗细度股票数据。为了解决这些挑战,我们提出了一个新的框架,旨在充分提取订单流量数据中的重要因素,并将其应用到不同的细节和情况下。我们的方法包括内容编码器和因数挖掘器。内容编码器会将目前订单流量数据段的内容嵌入学习到一个内容嵌入,考虑到预期和实际市场状态。此外,因数挖掘器会使用无监督学习方法来选择订单流量数据中最重要的信号,这些信号与大多数信号在该情况下最为不同。提取的因素则可以在不同的细节和情况下被重新利用。在实验研究中,我们的提案方法可以高效处理一整年的股票订单流量数据,提供更广泛的应用场景,比起现有的几天股票数据tick水平的方法。我们显示,我们的方法可以从订单流量数据中提取出超越性的因素,导致股票趋势预测和订单执行任务在第二和分钟级别上得到了重要改善。
Is Self-Supervised Pretraining Good for Extrapolation in Molecular Property Prediction?
results: 研究发现,自适应预训练可以提高模型对未观察属性值的预测,但不能准确地预测绝对属性值。Abstract
The prediction of material properties plays a crucial role in the development and discovery of materials in diverse applications, such as batteries, semiconductors, catalysts, and pharmaceuticals. Recently, there has been a growing interest in employing data-driven approaches by using machine learning technologies, in combination with conventional theoretical calculations. In material science, the prediction of unobserved values, commonly referred to as extrapolation, is particularly critical for property prediction as it enables researchers to gain insight into materials beyond the limits of available data. However, even with the recent advancements in powerful machine learning models, accurate extrapolation is still widely recognized as a significantly challenging problem. On the other hand, self-supervised pretraining is a machine learning technique where a model is first trained on unlabeled data using relatively simple pretext tasks before being trained on labeled data for target tasks. As self-supervised pretraining can effectively utilize material data without observed property values, it has the potential to improve the model's extrapolation ability. In this paper, we clarify how such self-supervised pretraining can enhance extrapolation performance.We propose an experimental framework for the demonstration and empirically reveal that while models were unable to accurately extrapolate absolute property values, self-supervised pretraining enables them to learn relative tendencies of unobserved property values and improve extrapolation performance.
摘要
material的性质预测在材料的开发和发现中扮演着关键性的角色,如电池、半导体、催化剂和药品等。最近,有越来越多的研究者开始使用数据驱动方法,将机器学习技术与传统的理论计算相结合。在物理学中,预测未知值(commonly referred to as extrapolation)是特别重要的,因为它允许研究者对材料进行深入的研究,并且不仅限于可用数据的范围。然而,即使最近的高功能机器学习模型也存在着准确预测的问题。自我超vised pretraining是一种机器学习技术,其中模型首先在无标签数据上通过简单的预TEXT tasks进行训练,然后在标签数据上进行目标任务的训练。由于自我超vised pretraining可以充分利用材料数据无需观察到的性质值,因此它有可能提高模型的预测能力。在这篇论文中,我们解释了如何使用自我超vised pretraining来提高预测性能。我们提出了一种实验框架,并通过实验证明,虽然模型无法准确预测绝对性质值,但是自我超vised pretraining使得它们学习了未知性质值的相对倾向,从而提高预测性能。
How to Mask in Error Correction Code Transformer: Systematic and Double Masking
results: 对 ECCT 进行改进,实现了 state-of-the-art 的 decoding 性能,与传统的 decoding 算法相比,带来了显著的性能提升。Abstract
In communication and storage systems, error correction codes (ECCs) are pivotal in ensuring data reliability. As deep learning's applicability has broadened across diverse domains, there is a growing research focus on neural network-based decoders that outperform traditional decoding algorithms. Among these neural decoders, Error Correction Code Transformer (ECCT) has achieved the state-of-the-art performance, outperforming other methods by large margins. To further enhance the performance of ECCT, we propose two novel methods. First, leveraging the systematic encoding technique of ECCs, we introduce a new masking matrix for ECCT, aiming to improve the performance and reduce the computational complexity. Second, we propose a novel transformer architecture of ECCT called a double-masked ECCT. This architecture employs two different mask matrices in a parallel manner to learn more diverse features of the relationship between codeword bits in the masked self-attention blocks. Extensive simulation results show that the proposed double-masked ECCT outperforms the conventional ECCT, achieving the state-of-the-art decoding performance with significant margins.
摘要
在通信和存储系统中,错误修正码(ECC)是确保数据可靠性的关键。随着深度学习在多个领域的应用积极扩大,有一个增长的研究重点是基于神经网络的解码器,以超越传统的解码算法。其中,Error Correction Code Transformer(ECCT)已经实现了状态收敛性能,超过其他方法的大幅提高。为了进一步提高ECCT的性能,我们提出了两种新的方法。首先,利用ECC的系统编码技术,我们引入了一个新的面积矩阵,以提高性能并降低计算复杂性。其次,我们提出了一种新的ECCT架构,即双面Masked ECCT。这种架构在并行方式中使用了两个不同的面积矩阵,以学习codeword位数据之间的更多多样性的关系。经过广泛的 simulate结果表明,我们的双面Masked ECCT可以超越传统的ECCT,实现最佳解码性能,并且具有显著的提高。
S-Mixup: Structural Mixup for Graph Neural Networks
results: 通过对真实世界 benchmark 数据集进行了广泛的实验,我们证明了 S-Mixup 在节点分类任务中的效果,尤其是在不同类型的图像中。Abstract
Existing studies for applying the mixup technique on graphs mainly focus on graph classification tasks, while the research in node classification is still under-explored. In this paper, we propose a novel mixup augmentation for node classification called Structural Mixup (S-Mixup). The core idea is to take into account the structural information while mixing nodes. Specifically, S-Mixup obtains pseudo-labels for unlabeled nodes in a graph along with their prediction confidence via a Graph Neural Network (GNN) classifier. These serve as the criteria for the composition of the mixup pool for both inter and intra-class mixups. Furthermore, we utilize the edge gradient obtained from the GNN training and propose a gradient-based edge selection strategy for selecting edges to be attached to the nodes generated by the mixup. Through extensive experiments on real-world benchmark datasets, we demonstrate the effectiveness of S-Mixup evaluated on the node classification task. We observe that S-Mixup enhances the robustness and generalization performance of GNNs, especially in heterophilous situations. The source code of S-Mixup can be found at \url{https://github.com/SukwonYun/S-Mixup}
摘要
先前的研究主要集中在图像分类任务上应用mixup技术,而节点分类任务的研究仍然处于初期阶段。在这篇论文中,我们提出了一种新的节点分类mixup增强方法,称为结构mixup(S-Mixup)。核心思想是在混合节点时考虑结构信息。特别是,S-Mixup使用图神经网络(GNN)分类器获取未标注节点的 pseudo-标签以及其预测信度。这些 pseudo-标签 serve as the criteria for the composition of the mixup pool for both inter and intra-class mixups。此外,我们提出了基于GNN训练的边 Gradient的选择策略,用于选择混合时附加到节点上的边。经过广泛的实验,我们证明了S-Mixup可以增强GNN的Robustness和泛化性,特别在不同类型的情况下。S-Mixup的源代码可以在 \url{https://github.com/SukwonYun/S-Mixup} 找到。
Safety Filter Design for Neural Network Systems via Convex Optimization
methods: 该方法利用 NN 验证工具来上界化 NN 动力学,然后通过Robust linear MPC 搜索一个能 garantuee 约束满足的控制器。
results: 数学示例表明,该方法可以有效地保证 NN 系统的安全性,并且可以适应不同的模型误差。Abstract
With the increase in data availability, it has been widely demonstrated that neural networks (NN) can capture complex system dynamics precisely in a data-driven manner. However, the architectural complexity and nonlinearity of the NNs make it challenging to synthesize a provably safe controller. In this work, we propose a novel safety filter that relies on convex optimization to ensure safety for a NN system, subject to additive disturbances that are capable of capturing modeling errors. Our approach leverages tools from NN verification to over-approximate NN dynamics with a set of linear bounds, followed by an application of robust linear MPC to search for controllers that can guarantee robust constraint satisfaction. We demonstrate the efficacy of the proposed framework numerically on a nonlinear pendulum system.
摘要
随着数据的增加,已经广泛证明了神经网络(NN)可以准确地捕捉复杂系统动态,以数据驱动的方式。然而,神经网络的建筑复杂性和非线性使得Synthesize a provably safe controller是一项挑战。在这项工作中,我们提出了一种新的安全筛选器,该筛选器基于凸优化来保证系统的安全性,面对加itive disturbances,这些disturbances可以捕捉模型错误。我们的方法利用了神经网络验证的工具,将NN动态覆盖成一组线性上下文,然后通过Robust linear MPC来搜索一个能确保约束满足的控制器。我们通过数值实验示范了我们的框架在非线性护卷系统上的效果。
Rigid Transformations for Stabilized Lower Dimensional Space to Support Subsurface Uncertainty Quantification and Interpretation
paper_authors: Ademide O. Mabadeje, Michael J. Pyrcz for:This paper aims to improve the accuracy and repeatability of nonlinear dimensionality reduction (NDR) methods for subsurface datasets, which are characterized by big data challenges such as high dimensionality and complex relationships.methods:The proposed method employs rigid transformations to stabilize the Euclidean invariant representation of the data, integrates out-of-sample points (OOSP), and quantifies uncertainty using a stress ratio (SR) metric.results:The proposed method is validated using synthetic data, distance metrics, and real-world wells from the Duvernay Formation, and shows improved accuracy and repeatability compared to existing methods. The SR metric provides valuable insights into uncertainty, enabling better model adjustments and inferential analysis.Abstract
Subsurface datasets inherently possess big data characteristics such as vast volume, diverse features, and high sampling speeds, further compounded by the curse of dimensionality from various physical, engineering, and geological inputs. Among the existing dimensionality reduction (DR) methods, nonlinear dimensionality reduction (NDR) methods, especially Metric-multidimensional scaling (MDS), are preferred for subsurface datasets due to their inherent complexity. While MDS retains intrinsic data structure and quantifies uncertainty, its limitations include unstabilized unique solutions invariant to Euclidean transformations and an absence of out-of-sample points (OOSP) extension. To enhance subsurface inferential and machine learning workflows, datasets must be transformed into stable, reduced-dimension representations that accommodate OOSP. Our solution employs rigid transformations for a stabilized Euclidean invariant representation for LDS. By computing an MDS input dissimilarity matrix, and applying rigid transformations on multiple realizations, we ensure transformation invariance and integrate OOSP. This process leverages a convex hull algorithm and incorporates loss function and normalized stress for distortion quantification. We validate our approach with synthetic data, varying distance metrics, and real-world wells from the Duvernay Formation. Results confirm our method's efficacy in achieving consistent LDS representations. Furthermore, our proposed "stress ratio" (SR) metric provides insight into uncertainty, beneficial for model adjustments and inferential analysis. Consequently, our workflow promises enhanced repeatability and comparability in NDR for subsurface energy resource engineering and associated big data workflows.
摘要
<>translate text into Simplified Chinese<>Subsurface datasets inherently possess big data characteristics such as vast volume, diverse features, and high sampling speeds, further compounded by the curse of dimensionality from various physical, engineering, and geological inputs. Among the existing dimensionality reduction (DR) methods, nonlinear dimensionality reduction (NDR) methods, especially Metric-multidimensional scaling (MDS), are preferred for subsurface datasets due to their inherent complexity. While MDS retains intrinsic data structure and quantifies uncertainty, its limitations include unstabilized unique solutions invariant to Euclidean transformations and an absence of out-of-sample points (OOSP) extension. To enhance subsurface inferential and machine learning workflows, datasets must be transformed into stable, reduced-dimension representations that accommodate OOSP. Our solution employs rigid transformations for a stabilized Euclidean invariant representation for LDS. By computing an MDS input dissimilarity matrix, and applying rigid transformations on multiple realizations, we ensure transformation invariance and integrate OOSP. This process leverages a convex hull algorithm and incorporates loss function and normalized stress for distortion quantification. We validate our approach with synthetic data, varying distance metrics, and real-world wells from the Duvernay Formation. Results confirm our method's efficacy in achieving consistent LDS representations. Furthermore, our proposed "stress ratio" (SR) metric provides insight into uncertainty, beneficial for model adjustments and inferential analysis. Consequently, our workflow promises enhanced repeatability and comparability in NDR for subsurface energy resource engineering and associated big data workflows.
Decentralized Graph Neural Network for Privacy-Preserving Recommendation
results: 通过对三个公共数据集进行广泛的实验 validate了我们的框架在不同的情况下的一致性优于现有方法。Abstract
Building a graph neural network (GNN)-based recommender system without violating user privacy proves challenging. Existing methods can be divided into federated GNNs and decentralized GNNs. But both methods have undesirable effects, i.e., low communication efficiency and privacy leakage. This paper proposes DGREC, a novel decentralized GNN for privacy-preserving recommendations, where users can choose to publicize their interactions. It includes three stages, i.e., graph construction, local gradient calculation, and global gradient passing. The first stage builds a local inner-item hypergraph for each user and a global inter-user graph. The second stage models user preference and calculates gradients on each local device. The third stage designs a local differential privacy mechanism named secure gradient-sharing, which proves strong privacy-preserving of users' private data. We conduct extensive experiments on three public datasets to validate the consistent superiority of our framework.
摘要
建立一个基于图 neural network(GNN)的推荐系统,不让用户隐私泄露是一个挑战。现有的方法可以分为联邦GNN和分散GNN两种。但是这两种方法都有不良影响,即低通信效率和隐私泄露。本文提出了DGREC,一个新的分散GNN推荐系统,其中用户可以选择公开其互动。这个系统包括三个阶段:图建构、本地梯度计算和全球梯度传递。第一阶段在每个用户的本地内部项目图中建立了一个内部图,并在所有用户之间建立了一个全球跨用户图。第二阶段模型用户的喜好,并在每个本地设备上计算梯度。第三阶段设计了一个本地隐私保证机制,名为安全梯度分享,证明了用户隐私的严格保证。我们在三个公共数据集上进行了广泛的实验,以验证我们的框架的一致性和优化。
Freshness or Accuracy, Why Not Both? Addressing Delayed Feedback via Dynamic Graph Neural Networks
paper_authors: Xiaolin Zheng, Zhongyu Wang, Chaochao Chen, Feng Zhu, Jiashu Qian for: 这篇论文旨在解决在线商业系统中的延迟反馈问题,因为用户的延迟反馈通常会导致模型训练受到影响。methods: 本论文提出了一个名为延迟反馈模型化(DGDFEM)的方法,它包括三个阶段:准备一个数据管线、建立动态图和训练一个CVR预测模型。在模型训练过程中,我们提出了一种新的图解扩展方法 named HLGCN,它可以处理对于转换和非转换关系的处理。results: 我们进行了广泛的实验, validate the consistent superiority of our method on three industry datasets。Abstract
The delayed feedback problem is one of the most pressing challenges in predicting the conversion rate since users' conversions are always delayed in online commercial systems. Although new data are beneficial for continuous training, without complete feedback information, i.e., conversion labels, training algorithms may suffer from overwhelming fake negatives. Existing methods tend to use multitask learning or design data pipelines to solve the delayed feedback problem. However, these methods have a trade-off between data freshness and label accuracy. In this paper, we propose Delayed Feedback Modeling by Dynamic Graph Neural Network (DGDFEM). It includes three stages, i.e., preparing a data pipeline, building a dynamic graph, and training a CVR prediction model. In the model training, we propose a novel graph convolutional method named HLGCN, which leverages both high-pass and low-pass filters to deal with conversion and non-conversion relationships. The proposed method achieves both data freshness and label accuracy. We conduct extensive experiments on three industry datasets, which validate the consistent superiority of our method.
摘要
延迟反馈问题是在预测转化率时最为紧迫的挑战,因为在线商业系统中用户的转化都会延迟。新的数据对于连续训练有益,但是无法获得完整的反馈信息,即转化标签,训练算法可能会受到干扰性的假负样本的影响。现有方法通常使用多任务学习或设计数据管道来解决延迟反馈问题,但这些方法存在数据新鲜度和标签准确性之间的负担。在本文中,我们提出延迟反馈模型化方法(DGDFEM),它包括三个阶段:准备数据管道、建立动态图和训练转化率预测模型。在模型训练中,我们提出了一种新的图 convolution方法 named HLGCN,它利用了高通和低通滤波器来处理转化和非转化关系。我们的方法可以同时保证数据新鲜度和标签准确性。我们对三个行业数据集进行了广泛的实验,并证明了我们的方法的一致性优势。
results: 研究发现,在随机观察到模型的情况下,GD和SGD在sub-Gaussian性和反射性下 converge linearly to a neighborhood of the ground truth,并且SGD在低样本场景中不仅更快速 convergence,还能够超过alternating minimization和GD的性能。Abstract
We consider regression of a max-affine model that produces a piecewise linear model by combining affine models via the max function. The max-affine model ubiquitously arises in applications in signal processing and statistics including multiclass classification, auction problems, and convex regression. It also generalizes phase retrieval and learning rectifier linear unit activation functions. We present a non-asymptotic convergence analysis of gradient descent (GD) and mini-batch stochastic gradient descent (SGD) for max-affine regression when the model is observed at random locations following the sub-Gaussianity and an anti-concentration with additive sub-Gaussian noise. Under these assumptions, a suitably initialized GD and SGD converge linearly to a neighborhood of the ground truth specified by the corresponding error bound. We provide numerical results that corroborate the theoretical finding. Importantly, SGD not only converges faster in run time with fewer observations than alternating minimization and GD in the noiseless scenario but also outperforms them in low-sample scenarios with noise.
摘要
我们考虑一种最大平均模型,它生成一个分割线性模型,通过最大函数将多个平均模型相互结合。这种最大平均模型广泛应用于信号处理和统计领域,包括多类分类、拍卖问题和凸回归。它还泛化phas retrieval和学习Rectifier Linear Unit激活函数。我们对梯度下降(GD)和批处理梯度下降(SGD)在最大平均 regresión中进行非假设性分析,当模型在随机位置上观察时,ASSUMING SUB-Gaussianity和反射激活函数。在这些假设下,初始化GD和SGD会在一个固定误差 bound 的附近Linearly converge。我们提供了数值结果,证明了这些理论发现。重要的是,SGD不仅在干净场景下更快的 converge 时间和观察 fewer than alternating minimization和GD,而且在噪声场景下也超越它们。
A Reinforcement Learning Approach for Performance-aware Reduction in Power Consumption of Data Center Compute Nodes
methods: 本研究使用了强化学习(Reinforcement Learning)来设计云端处理器的电力规则,并与argoNode资源管理软件库和Intel Running Average Power Limit(RAPL)硬件控制机制搭配使用。
results: 经过训练的代理人可以通过调整处理器的最大供电功率,以实现协调电力消耗和应用程序性能。在使用STREAM套件进行评估时,已经示出了一个可以找到平衡点的实时执行环境。Abstract
As Exascale computing becomes a reality, the energy needs of compute nodes in cloud data centers will continue to grow. A common approach to reducing this energy demand is to limit the power consumption of hardware components when workloads are experiencing bottlenecks elsewhere in the system. However, designing a resource controller capable of detecting and limiting power consumption on-the-fly is a complex issue and can also adversely impact application performance. In this paper, we explore the use of Reinforcement Learning (RL) to design a power capping policy on cloud compute nodes using observations on current power consumption and instantaneous application performance (heartbeats). By leveraging the Argo Node Resource Management (NRM) software stack in conjunction with the Intel Running Average Power Limit (RAPL) hardware control mechanism, we design an agent to control the maximum supplied power to processors without compromising on application performance. Employing a Proximal Policy Optimization (PPO) agent to learn an optimal policy on a mathematical model of the compute nodes, we demonstrate and evaluate using the STREAM benchmark how a trained agent running on actual hardware can take actions by balancing power consumption and application performance.
摘要
In this paper, we explore the use of Reinforcement Learning (RL) to design a power capping policy on cloud compute nodes using observations on current power consumption and instantaneous application performance (heartbeats). By leveraging the Argo Node Resource Management (NRM) software stack in conjunction with the Intel Running Average Power Limit (RAPL) hardware control mechanism, we design an agent to control the maximum supplied power to processors without compromising on application performance.Employing a Proximal Policy Optimization (PPO) agent to learn an optimal policy on a mathematical model of the compute nodes, we demonstrate and evaluate using the STREAM benchmark how a trained agent running on actual hardware can take actions by balancing power consumption and application performance.
The Costly Dilemma: Generalization, Evaluation and Cost-Optimal Deployment of Large Language Models
paper_authors: Abi Aryan, Aakash Kumar Nain, Andrew McMahon, Lucas Augusto Meyer, Harpreet Singh Sahota for: 这个论文是为了解决在生产环境中部署机器学习模型时,常见的三个属性的问题。methods: 论文提出了一种框架,用于考虑这三个属性的关系,并为大语言模型的开发、部署和管理提供了新的思路。results: 论文表明,通过这种框架,可以帮助企业更好地评估大语言模型的投资,并且可以在生产环境中部署这些模型,以便更好地满足企业的需求。Abstract
When deploying machine learning models in production for any product/application, there are three properties that are commonly desired. First, the models should be generalizable, in that we can extend it to further use cases as our knowledge of the domain area develops. Second they should be evaluable, so that there are clear metrics for performance and the calculation of those metrics in production settings are feasible. Finally, the deployment should be cost-optimal as far as possible. In this paper we propose that these three objectives (i.e. generalization, evaluation and cost-optimality) can often be relatively orthogonal and that for large language models, despite their performance over conventional NLP models, enterprises need to carefully assess all the three factors before making substantial investments in this technology. We propose a framework for generalization, evaluation and cost-modeling specifically tailored to large language models, offering insights into the intricacies of development, deployment and management for these large language models.
摘要
results: 实验结果表明,ZIPTF和C-ZIPTF在各种 sintetic和实际的Single-cell RNA sequencing(scRNA-seq)数据中都能够成功地重建known和生物学意义的蛋白表达计划。特别是,当数据中存在高概率的空值时,ZIPTF可以达到2.4倍的准确率提高。此外,C-ZIPTF可以提高因子化结果的一致性和准确率。Abstract
Tensor factorizations (TF) are powerful tools for the efficient representation and analysis of multidimensional data. However, classic TF methods based on maximum likelihood estimation underperform when applied to zero-inflated count data, such as single-cell RNA sequencing (scRNA-seq) data. Additionally, the stochasticity inherent in TFs results in factors that vary across repeated runs, making interpretation and reproducibility of the results challenging. In this paper, we introduce Zero Inflated Poisson Tensor Factorization (ZIPTF), a novel approach for the factorization of high-dimensional count data with excess zeros. To address the challenge of stochasticity, we introduce Consensus Zero Inflated Poisson Tensor Factorization (C-ZIPTF), which combines ZIPTF with a consensus-based meta-analysis. We evaluate our proposed ZIPTF and C-ZIPTF on synthetic zero-inflated count data and synthetic and real scRNA-seq data. ZIPTF consistently outperforms baseline matrix and tensor factorization methods in terms of reconstruction accuracy for zero-inflated data. When the probability of excess zeros is high, ZIPTF achieves up to $2.4\times$ better accuracy. Additionally, C-ZIPTF significantly improves the consistency and accuracy of the factorization. When tested on both synthetic and real scRNA-seq data, ZIPTF and C-ZIPTF consistently recover known and biologically meaningful gene expression programs.
摘要
tensor分解(TF)是一种高效的工具,用于表示和分析多维数据。然而,经典的TF方法基于最大化可能性估计,在应用于零含量数据,如单元RNA测序(scRNA-seq)数据时,表现不佳。此外,TF的恒定性使得因素在重复运行中异常,这使得结果的解释和复制困难。在本文中,我们介绍了零含量ポイッソンtensor分解(ZIPTF),一种用于高维计数数据中的零含量因素分解的新方法。为了解决恒定性挑战,我们引入了consensus零含量ポイッソンtensor分解(C-ZIPTF),它将ZIPTF与consensus-based meta-analysis结合。我们对ZIPTF和C-ZIPTF进行了synthetic zero-inflated count data和synthetic和实际scRNA-seq数据的评估。ZIPTF在零含量数据上的重建精度比基eline矩阵和tensor分解方法高,当零含量probability高时,ZIPTF可以达到2.4倍的精度提高。此外,C-ZIPTF可以提高因素分解的一致性和精度。当测试在synthetic和实际scRNA-seq数据上时,ZIPTF和C-ZIPTF都可以一致地恢复知道的和生物学意义的蛋白表达程序。
methods: 本研究使用了一种新的算法,该算法可以在类的Littlestone维度为$d$时 makest at most $O(256^d)$ mistake。我们的证明比前一个算法简单得多,只需要使用了基本的Littlestone维度属性。
results: 本研究的结果是,无论类的Littlestone维度如何大,我们的算法都可以在类中 makest at most $O(256^d)$ mistake。此外,我们还证明了Hasrati和Ben-David(ALT’23)的开题,即每个有 recursively enumerable representation 的类都可以有一个可计算的在线学习算法(可能是undefined on unrealizable samples)。Abstract
We consider online learning in the model where a learning algorithm can access the class only via the consistency oracle -- an oracle, that, at any moment, can give a function from the class that agrees with all examples seen so far. This model was recently considered by Assos et al. (COLT'23). It is motivated by the fact that standard methods of online learning rely on computing the Littlestone dimension of subclasses, a problem that is computationally intractable. Assos et al. gave an online learning algorithm in this model that makes at most $C^d$ mistakes on classes of Littlestone dimension $d$, for some absolute unspecified constant $C > 0$. We give a novel algorithm that makes at most $O(256^d)$ mistakes. Our proof is significantly simpler and uses only very basic properties of the Littlestone dimension. We also observe that there exists no algorithm in this model that makes at most $2^{d+1}-2$ mistakes. We also observe that our algorithm (as well as the algorithm of Assos et al.) solves an open problem by Hasrati and Ben-David (ALT'23). Namely, it demonstrates that every class of finite Littlestone dimension with recursively enumerable representation admits a computable online learner (that may be undefined on unrealizable samples).
摘要
我们考虑在模型中进行在网络学习,其中学习算法可以通过一个一致性 oracle 访问类别。这个 oracle 可以在任何时刻给出一个对应到所有已经看过的例子的函数。这个模型最近由Assos et al. (COLT'23) 考虑过。它是由于标准的在网络学习方法需要 Computing Littlestone 维度的问题是 computationally intractable 而提出的。Assos et al. 提出了一个在这个模型中的线上学习算法,它在类别的 Littlestone 维度为 $d$ 时会误导最多 $C^d$ 次。我们提出了一个新的算法,它在类别的 Littlestone 维度为 $d$ 时会误导最多 $O(256^d)$ 次。我们的证明比较简单,只需要使用一些非常基本的 Littlestone 维度的性质。我们还观察到,在这个模型中没有任何算法可以在类别的 Littlestone 维度为 $d$ 时误导最多 $2^{d+1}-2$ 次。此外,我们的算法(以及Assos et al. 的算法)解决了 Hasrati 和 Ben-David (ALT'23) 的开问题。即,它证明了所有具有 finite Littlestone 维度的类别,都存在可 computable 的线上学习者(可能是 undefined 的 samples)。
Natural Evolution Strategies as a Black Box Estimator for Stochastic Variational Inference
methods: 使用自然进化策略(Natural Evolution Strategies)提出的一种alternative estimator,不假设使用的分布类型,allowing for the creation of models that would otherwise not have been possible under the VAE framework。
results: 提出的estimator不受权重估计问题的限制,可以创建不同类型的模型,并且可以在VAE中实现更高的效果。Abstract
Stochastic variational inference and its derivatives in the form of variational autoencoders enjoy the ability to perform Bayesian inference on large datasets in an efficient manner. However, performing inference with a VAE requires a certain design choice (i.e. reparameterization trick) to allow unbiased and low variance gradient estimation, restricting the types of models that can be created. To overcome this challenge, an alternative estimator based on natural evolution strategies is proposed. This estimator does not make assumptions about the kind of distributions used, allowing for the creation of models that would otherwise not have been possible under the VAE framework.
摘要
Here's the text in Simplified Chinese: Stochastic variational inference和其 derivatives in the form of variational autoencoders具有能够有效地进行 Bayesian inference on 大量数据的能力。然而,使用 VAE 进行推理需要特定的设计选择(即重parameterization trick),以实现无偏误和低幂度的Gradient估计,限制了可以创建的模型类型。为了解决这个挑战,一种基于自然进化策略的替代估计器被提议。这个估计器不会对分布类型进行假设,允许创建不可能在 VAE 框架下创建的模型。
Unbiased Decisions Reduce Regret: Adversarial Domain Adaptation for the Bank Loan Problem
methods: 本研究使用了对抗优化(AdOpt)来直接Address bias in the training set,通过对适应域适应来学习不偏且有用的表示。
results: AdOptsignificantly exceeds state-of-the-art performance on a set of challenging benchmark problems, and our experiments also provide initial evidence that the introduction of adversarial domain adaptation improves fairness in this setting.Abstract
In many real world settings binary classification decisions are made based on limited data in near real-time, e.g. when assessing a loan application. We focus on a class of these problems that share a common feature: the true label is only observed when a data point is assigned a positive label by the principal, e.g. we only find out whether an applicant defaults if we accepted their loan application. As a consequence, the false rejections become self-reinforcing and cause the labelled training set, that is being continuously updated by the model decisions, to accumulate bias. Prior work mitigates this effect by injecting optimism into the model, however this comes at the cost of increased false acceptance rate. We introduce adversarial optimism (AdOpt) to directly address bias in the training set using adversarial domain adaptation. The goal of AdOpt is to learn an unbiased but informative representation of past data, by reducing the distributional shift between the set of accepted data points and all data points seen thus far. AdOpt significantly exceeds state-of-the-art performance on a set of challenging benchmark problems. Our experiments also provide initial evidence that the introduction of adversarial domain adaptation improves fairness in this setting.
摘要
在许多实际世界中的设定下,二进制分类决策基于有限数据集,例如评审借款申请。我们关注一类问题,这些问题共同特点是:真实标签只有当数据点被主体分配正确标签时才能见到,例如只有当借款申请被accept时才能确定 defaults。这导致false rejects 自我加强,从而导致labels 训练集,由模型决策而不断更新的,受到偏见。先前的工作利用模型内置optimism来mitigate这种效应,但是这会导致准确接受率增加。我们引入对抗优化(AdOpt),直接通过对抗领域适应来纠正训练集的偏见。AdOpt的目标是学习不偏见但具有信息的过去数据表示,通过减少所有见过的数据点和接受数据点之间的分布差异来实现。AdOpt在一组复杂的benchmark问题上表现出了显著超过状态艺术性的表现。我们的实验还提供了初步证据,表明在这种设定下,引入对抗领域适应可以改善公平性。
Regret Lower Bounds in Multi-agent Multi-armed Bandit
results: 本研究提供了一系列的Lower bound,包括适用于各种设定下的下界,以及与之前的研究中的Upper bound之间的差异。Abstract
Multi-armed Bandit motivates methods with provable upper bounds on regret and also the counterpart lower bounds have been extensively studied in this context. Recently, Multi-agent Multi-armed Bandit has gained significant traction in various domains, where individual clients face bandit problems in a distributed manner and the objective is the overall system performance, typically measured by regret. While efficient algorithms with regret upper bounds have emerged, limited attention has been given to the corresponding regret lower bounds, except for a recent lower bound for adversarial settings, which, however, has a gap with let known upper bounds. To this end, we herein provide the first comprehensive study on regret lower bounds across different settings and establish their tightness. Specifically, when the graphs exhibit good connectivity properties and the rewards are stochastically distributed, we demonstrate a lower bound of order $O(\log T)$ for instance-dependent bounds and $\sqrt{T}$ for mean-gap independent bounds which are tight. Assuming adversarial rewards, we establish a lower bound $O(T^{\frac{2}{3})$ for connected graphs, thereby bridging the gap between the lower and upper bound in the prior work. We also show a linear regret lower bound when the graph is disconnected. While previous works have explored these settings with upper bounds, we provide a thorough study on tight lower bounds.
摘要
多臂猎手驱动方法的提高方法拥有可证明的上界 regret,同时对应的下界Bound也得到了广泛的研究。在这个 Setting 中,Recently, Multi-agent Multi-armed Bandit 在不同领域中得到了广泛应用,每个客户端面临分布式的猎手问题,系统性能通常由 regret 来衡量。虽然有效的算法出现了,但对应的 regret 下界却受到了限制的注意。为此,我们在这里提供了首次对 regret 下界的全面研究,并证明其紧致性。 Specifically, 当图表现出好连接性和奖励是随机分布的时候,我们展示了一个下界 bound 的规模为 $\order{ \log T}$ 的实例依赖性 bounds 和 $\sqrt{T}$ 的不相互独立 bounds,这些下界都是紧致的。在对抗性奖励情况下,我们提出了一个下界 bound 的规模为 $O(T^{ \frac{2}{3})$,这个 bound 可以将上下界之间的差异bridged。此外,我们还证明了连接图时的线性下界 bound。而在之前的工作中,只有对 upper bounds 进行了研究。我们对这些设定进行了全面的研究,并证明了其下界的紧致性。
A Comparative Analysis of the Capabilities of Nature-inspired Feature Selection Algorithms in Predicting Student Performance
results: 结果表明,对于所有数据集,使用NIAs进行特征选择和传统机器学习算法进行分类,可以提高预测精度,同时减少特征集的大小。Abstract
Predicting student performance is key in leveraging effective pre-failure interventions for at-risk students. In this paper, I have analyzed the relative performance of a suite of 12 nature-inspired algorithms when used to predict student performance across 3 datasets consisting of instance-based clickstream data, intra-course single-course performance, and performance when taking multiple courses simultaneously. I found that, for all datasets, leveraging an ensemble approach using NIAs for feature selection and traditional ML algorithms for classification increased predictive accuracy while also reducing feature set size by 2/3.
摘要
预测学生表现是关键在实施有效的预failure措施的过程中,以帮助学生避免失败。在这篇论文中,我分析了12种自然引导的算法在预测学生表现方面的相对性,并使用这些算法来选择特征并使用传统的机器学习算法进行分类。我发现,无论 dataset 是哪一个,使用 Ensemble 方法和 NIAs 选择特征,并使用传统的机器学习算法进行分类,可以提高预测精度,同时减少特征集的大小,比例为2/3。
Classification of Data Generated by Gaussian Mixture Models Using Deep ReLU Networks
results: 我们得到了不依赖于维度 $d$ 的非假 asymptotic upper bounds 和折补风险(类别错误率)的强化证明,表明深度ReLU 神经网络可以超越维度的咒语。Abstract
This paper studies the binary classification of unbounded data from ${\mathbb R}^d$ generated under Gaussian Mixture Models (GMMs) using deep ReLU neural networks. We obtain $\unicode{x2013}$ for the first time $\unicode{x2013}$ non-asymptotic upper bounds and convergence rates of the excess risk (excess misclassification error) for the classification without restrictions on model parameters. The convergence rates we derive do not depend on dimension $d$, demonstrating that deep ReLU networks can overcome the curse of dimensionality in classification. While the majority of existing generalization analysis of classification algorithms relies on a bounded domain, we consider an unbounded domain by leveraging the analyticity and fast decay of Gaussian distributions. To facilitate our analysis, we give a novel approximation error bound for general analytic functions using ReLU networks, which may be of independent interest. Gaussian distributions can be adopted nicely to model data arising in applications, e.g., speeches, images, and texts; our results provide a theoretical verification of the observed efficiency of deep neural networks in practical classification problems.
摘要
Simplified Chinese translation:这篇论文研究了使用深度ReLU神经网络进行 $\mathbb{R}^d$ 中无穷数据的二分类问题,基于 Gaussian Mixture Models (GMMs)。我们得到了 $\unicode{x2013}$ 的首次非增长上界和抽象率,而这些上界不依赖维度 $d$。这表明深度ReLU 网络可以超越维度约束。大多数现有的泛化分析中的 Classification 算法都是基于固定维度的,而我们则考虑了无穷维度的情况,通过利用 Gaussian 分布的分布和快速衰减。为便于我们的分析,我们还提出了一个 novel 的 Approximation 错误 bound для一般的分析函数,可能具有独立的 интерес。 Gaussian 分布可以很好地模型应用中的数据,如演讲、图像和文本等;我们的结果提供了实际问题中深度神经网络的理论验证。
Planning to Learn: A Novel Algorithm for Active Learning during Model-Based Planning
results: simulations 表明,SL 在一种生物学上levant的环境中表现出色,比 Bayes-adaptive RL 和 upper confidence bound algorithms 更高效,这些算法都使用了类似的原则(例如,导向探索和对未来观测的反思)来解决多步计划问题。Abstract
Active Inference is a recent framework for modeling planning under uncertainty. Empirical and theoretical work have now begun to evaluate the strengths and weaknesses of this approach and how it might be improved. A recent extension - the sophisticated inference (SI) algorithm - improves performance on multi-step planning problems through recursive decision tree search. However, little work to date has been done to compare SI to other established planning algorithms. SI was also developed with a focus on inference as opposed to learning. The present paper has two aims. First, we compare performance of SI to Bayesian reinforcement learning (RL) schemes designed to solve similar problems. Second, we present an extension of SI - sophisticated learning (SL) - that more fully incorporates active learning during planning. SL maintains beliefs about how model parameters would change under the future observations expected under each policy. This allows a form of counterfactual retrospective inference in which the agent considers what could be learned from current or past observations given different future observations. To accomplish these aims, we make use of a novel, biologically inspired environment designed to highlight the problem structure for which SL offers a unique solution. Here, an agent must continually search for available (but changing) resources in the presence of competing affordances for information gain. Our simulations show that SL outperforms all other algorithms in this context - most notably, Bayes-adaptive RL and upper confidence bound algorithms, which aim to solve multi-step planning problems using similar principles (i.e., directed exploration and counterfactual reasoning). These results provide added support for the utility of Active Inference in solving this class of biologically-relevant problems and offer added tools for testing hypotheses about human cognition.
摘要
aktive Inference 是一种最近的 планинг下 uncertainty 的框架。 empirical 和 theoretischen 工作已经开始评估这种方法的优劣和如何改进它。一种最近的扩展 - 智能推理(SI)算法 - 在多步 планинг问题上提高性能通过 recursively 决策树搜索。然而,到目前为止,尚未对 SI 与其他已知的 планинг算法进行比较。SI 也是在推理方面而不是学习方面进行发展。本文的两个目标是:首先, Comparing the performance of SI with Bayesian reinforcement learning (RL) schemes designed to solve similar problems; second, presenting an extension of SI - sophisticated learning (SL) - that more fully incorporates active learning during planning. SL maintains beliefs about how model parameters would change under the future observations expected under each policy, allowing a form of counterfactual retrospective inference in which the agent considers what could be learned from current or past observations given different future observations. To accomplish these aims, we make use of a novel, biologically inspired environment designed to highlight the problem structure for which SL offers a unique solution. In this environment, an agent must continually search for available (but changing) resources in the presence of competing affordances for information gain. Our simulations show that SL outperforms all other algorithms in this context - most notably, Bayes-adaptive RL and upper confidence bound algorithms, which aim to solve multi-step planning problems using similar principles (i.e., directed exploration and counterfactual reasoning). These results provide added support for the utility of Active Inference in solving this class of biologically-relevant problems and offer added tools for testing hypotheses about human cognition.
results: 我们发现,量子计算在大规模计算中可以获得更高的能源效率优势,并且需要在大规模的操作范围内进行计算。基于实际物理参数,我们还证明了实现这种能源效率优势所需的规模的尺度。Abstract
Energy cost is increasingly crucial in the modern computing industry with the wide deployment of large-scale machine learning models and language models. For the firms that provide computing services, low energy consumption is important both from the perspective of their own market growth and the government's regulations. In this paper, we study the energy benefits of quantum computing vis-a-vis classical computing. Deviating from the conventional notion of quantum advantage based solely on computational complexity, we redefine advantage in an energy efficiency context. Through a Cournot competition model constrained by energy usage, we demonstrate quantum computing firms can outperform classical counterparts in both profitability and energy efficiency at Nash equilibrium. Therefore quantum computing may represent a more sustainable pathway for the computing industry. Moreover, we discover that the energy benefits of quantum computing economies are contingent on large-scale computation. Based on real physical parameters, we further illustrate the scale of operation necessary for realizing this energy efficiency advantage.
摘要
现代计算业务中能源成本日益重要,尤其是大规模机器学习模型和自然语言处理模型的广泛部署。为提供计算服务的公司而言,低能耗是重要的,不仅从市场增长的角度来看,也从政府的法规来看。在这篇论文中,我们研究了量子计算对于纳什平衡下的能源利益。我们偏离了传统的量子优势概念,定义了能效上的优势。使用偏处比率模型,我们示出了量子计算公司在纳什平衡下可以超越经典对手,在利润和能效性方面表现出优势。因此,量子计算可能代表计算业务更可持续的发展 paths。此外,我们发现了大规模计算的能源利益,并通过实际物理参数来证明实现这种能效优势的规模。
Active Inverse Learning in Stackelberg Trajectory Games
results: 在循环轨迹游戏中,提出的方法可以大幅提高领导者对不同假设下追随者轨迹的 Conditional Probability 的快速抽象。相比于随机输入,提出的领导者输入可以加速抽象的推断过程,几乎可以达到多个数量级的提速。Abstract
Game-theoretic inverse learning is the problem of inferring the players' objectives from their actions. We formulate an inverse learning problem in a Stackelberg game between a leader and a follower, where each player's action is the trajectory of a dynamical system. We propose an active inverse learning method for the leader to infer which hypothesis among a finite set of candidates describes the follower's objective function. Instead of using passively observed trajectories like existing methods, the proposed method actively maximizes the differences in the follower's trajectories under different hypotheses to accelerate the leader's inference. We demonstrate the proposed method in a receding-horizon repeated trajectory game. Compared with uniformly random inputs, the leader inputs provided by the proposed method accelerate the convergence of the probability of different hypotheses conditioned on the follower's trajectory by orders of magnitude.
摘要
<>游戏理论反学习问题是从行为中推断玩家的目标。我们将游戏形式 Stackelberg 游戏中的领导和追随者作为例子,每个玩家的行为是动力系统的轨迹。我们提议一种活动反学习方法,以便领导者根据追随者的动力系统轨迹中的不同假设来推断追随者的目标函数。不同于现有方法,我们的方法不是通过被动观察到的轨迹来进行反学习,而是通过活动地增加追随者的轨迹下不同假设的差异,以加速领导者的推断。我们在回溯 horizon 重复轨迹游戏中应用了该方法,并与随机输入相比,该方法可以加速各个假设 conditioned on 追随者的轨迹上的概率的减少,减少至数量级。
GRINN: A Physics-Informed Neural Network for solving hydrodynamic systems in the presence of self-gravity
results: 与 аналитиче解准相匹配,在线性 régime中与传统网格代码解准相差 1%,在非线性 régime中与传统网格代码解准相差 5%。 GRINN 计算时间与维度无关,与传统网格代码计算时间相比,在一维和二维计算中快得多,在三维计算中 slower 但是更准。Abstract
Modeling self-gravitating gas flows is essential to answering many fundamental questions in astrophysics. This spans many topics including planet-forming disks, star-forming clouds, galaxy formation, and the development of large-scale structures in the Universe. However, the nonlinear interaction between gravity and fluid dynamics offers a formidable challenge to solving the resulting time-dependent partial differential equations (PDEs) in three dimensions (3D). By leveraging the universal approximation capabilities of a neural network within a mesh-free framework, physics informed neural networks (PINNs) offer a new way of addressing this challenge. We introduce the gravity-informed neural network (GRINN), a PINN-based code, to simulate 3D self-gravitating hydrodynamic systems. Here, we specifically study gravitational instability and wave propagation in an isothermal gas. Our results match a linear analytic solution to within 1\% in the linear regime and a conventional grid code solution to within 5\% as the disturbance grows into the nonlinear regime. We find that the computation time of the GRINN does not scale with the number of dimensions. This is in contrast to the scaling of the grid-based code for the hydrodynamic and self-gravity calculations as the number of dimensions is increased. Our results show that the GRINN computation time is longer than the grid code in one- and two- dimensional calculations but is an order of magnitude lesser than the grid code in 3D with similar accuracy. Physics-informed neural networks like GRINN thus show promise for advancing our ability to model 3D astrophysical flows.
摘要
To address this challenge, we propose a novel approach based on physics-informed neural networks (PINNs), which leverages the universal approximation capabilities of neural networks to simulate 3D self-gravitating hydrodynamic systems. We introduce the gravity-informed neural network (GRINN), a PINN-based code that can accurately simulate 3D self-gravitating flows.In this study, we specifically focus on gravitational instability and wave propagation in an isothermal gas. Our results show that the GRINN code can accurately capture the linear analytic solution to within 1% in the linear regime and a conventional grid code solution to within 5% as the disturbance grows into the nonlinear regime. Moreover, we find that the computation time of the GRINN code does not scale with the number of dimensions, which is in contrast to the scaling of the grid-based code for the hydrodynamic and self-gravity calculations as the number of dimensions is increased.Our results demonstrate that the GRINN code is an order of magnitude faster than the grid code in 3D with similar accuracy, indicating that PINNs like GRINN hold great promise for advancing our ability to model 3D astrophysical flows.
BI-LAVA: Biocuration with Hierarchical Image Labeling through Active Learning and Visual Analysis
results: 经过评估,人机混合方法能够成功地帮助领域专家理解数据集和分类 hierarchy 的特点,并且可以提高数据的品质和可用性。Abstract
In the biomedical domain, taxonomies organize the acquisition modalities of scientific images in hierarchical structures. Such taxonomies leverage large sets of correct image labels and provide essential information about the importance of a scientific publication, which could then be used in biocuration tasks. However, the hierarchical nature of the labels, the overhead of processing images, the absence or incompleteness of labeled data, and the expertise required to label this type of data impede the creation of useful datasets for biocuration. From a multi-year collaboration with biocurators and text-mining researchers, we derive an iterative visual analytics and active learning strategy to address these challenges. We implement this strategy in a system called BI-LAVA Biocuration with Hierarchical Image Labeling through Active Learning and Visual Analysis. BI-LAVA leverages a small set of image labels, a hierarchical set of image classifiers, and active learning to help model builders deal with incomplete ground-truth labels, target a hierarchical taxonomy of image modalities, and classify a large pool of unlabeled images. BI-LAVA's front end uses custom encodings to represent data distributions, taxonomies, image projections, and neighborhoods of image thumbnails, which help model builders explore an unfamiliar image dataset and taxonomy and correct and generate labels. An evaluation with machine learning practitioners shows that our mixed human-machine approach successfully supports domain experts in understanding the characteristics of classes within the taxonomy, as well as validating and improving data quality in labeled and unlabeled collections.
摘要
在生物医学领域,taxonomies将科学图像收集modalities组织成层次结构。这些taxonomies利用大量正确的图像标签,提供了科学出版物的重要信息,可以用于生物CURATION任务。然而,层次性标签、图像处理过程的开销、标签数据缺失或不完整、以及标签这类数据的专业知识却使得创建有用的数据集变得困难。基于多年的合作与生物CURATORS和文本挖掘研究人员,我们提出了一种迭代式视觉分析和活动学习策略。我们实现了这种策略在BI-LAVA生物CURATION系统中。BI-LAVA利用一小sets of image标签、层次Set of image分类器和活动学习来帮助模型建立者处理不完整的真实标签、目标层次的图像模式和一大量的未标记图像。BI-LAVA的前端使用自定编码来表示数据分布、税onomy、图像投影和图像缩略图,帮助模型建立者探索未familiar的图像数据集和税onomy,并且更正和生成标签。我们与机器学习专家合作进行评估,表明我们的人机结合方法可以成功地支持领域专家理解税onomy中类别的特征,以及验证和提高标记和未标记集的数据质量。
A physics-informed machine learning model for reconstruction of dynamic loads
paper_authors: Gledson Rodrigo Tondo, Igor Kavrakov, Guido Morgenthal
For: This paper aims to develop a probabilistic physics-informed machine-learning framework for reconstructing dynamic forces on long-span bridges based on measured deflections, velocities, or accelerations.* Methods: The proposed framework uses Gaussian process regression to model the relationship between the measured data and the dynamic forces, and can handle incomplete and contaminated data.* Results: The developed framework is applied to an aerodynamic analysis of the Great Belt East Bridge, and the results show good agreement between the applied and predicted dynamic load. The framework can be used for validation of design models and assumptions, as well as prognosis of responses to assist in damage detection and structural health monitoring.Here is the same information in Simplified Chinese:* For: 这篇论文目的是开发一种基于测量数据的可靠物理学整合机器学习框架,用于重 span 桥的动态力计算。* Methods: 提posed 框架使用 Gaussian 过程回归来模型测量数据和动态力之间的关系,并可以处理受损和污染的数据。* Results: 该框架在应用于大彩虹大桥的风动分析中获得了良好的吻合度,可以用于验证设计模型和假设,以及损害检测和结构健康监测。Abstract
Long-span bridges are subjected to a multitude of dynamic excitations during their lifespan. To account for their effects on the structural system, several load models are used during design to simulate the conditions the structure is likely to experience. These models are based on different simplifying assumptions and are generally guided by parameters that are stochastically identified from measurement data, making their outputs inherently uncertain. This paper presents a probabilistic physics-informed machine-learning framework based on Gaussian process regression for reconstructing dynamic forces based on measured deflections, velocities, or accelerations. The model can work with incomplete and contaminated data and offers a natural regularization approach to account for noise in the measurement system. An application of the developed framework is given by an aerodynamic analysis of the Great Belt East Bridge. The aerodynamic response is calculated numerically based on the quasi-steady model, and the underlying forces are reconstructed using sparse and noisy measurements. Results indicate a good agreement between the applied and the predicted dynamic load and can be extended to calculate global responses and the resulting internal forces. Uses of the developed framework include validation of design models and assumptions, as well as prognosis of responses to assist in damage detection and structural health monitoring.
摘要
长 Span 桥拱需要面对多种动态冲击 durante 其使用寿命。为了考虑这些影响结构系统的效应,设计时使用多种荷载模型来模拟结构可能经历的条件。这些模型基于不同的简化假设和由测量数据指导的参数,因此其输出具有内在的不确定性。这篇文章描述了一种基于 Gaussian 进程回归的概率物理学 informed 机器学习框架,用于重建动态力 based on 测量弯曲、速度或加速度。该模型可以处理部分完整和污染的数据,并提供自然的正则化方法来考虑测量系统中的噪声。应用于大吃水东桥的 aerodynamic 分析, numerically 计算 quasi-steady 模型下的 aerodynamic 响应,并使用稀缺和污染的测量数据来重建动态荷载。结果表明了 applied 和预测的动态荷载之间的良好一致,并可以扩展到计算全局响应和结构内部力。用于验证设计模型和假设,以及检测结构受损和监测结构健康状况。
Monte Carlo guided Diffusion for Bayesian linear inverse problems
paper_authors: Gabriel Cardoso, Yazid Janati El Idrissi, Sylvain Le Corff, Eric Moulines
for: 这个论文是为了解决杂合知识的线性逆问题,包括计算摄影到医学影像等多个应用。
methods: 这个论文使用了分数基生成模型(SGM)来解决这些问题,特别是在填充问题中。
results: 该论文提出了一种基于贝叶斯框架的Feynman-Kac模型,并使用了顺序 Monte Carlo 方法解决这个问题。数据显示,该算法在处理杂合知识的逆问题时表现更好于基准值。Abstract
Ill-posed linear inverse problems that combine knowledge of the forward measurement model with prior models arise frequently in various applications, from computational photography to medical imaging. Recent research has focused on solving these problems with score-based generative models (SGMs) that produce perceptually plausible images, especially in inpainting problems. In this study, we exploit the particular structure of the prior defined in the SGM to formulate recovery in a Bayesian framework as a Feynman--Kac model adapted from the forward diffusion model used to construct score-based diffusion. To solve this Feynman--Kac problem, we propose the use of Sequential Monte Carlo methods. The proposed algorithm, MCGdiff, is shown to be theoretically grounded and we provide numerical simulations showing that it outperforms competing baselines when dealing with ill-posed inverse problems.
摘要
“对于具有前进模型知识的抽象线性逆问题,在计算摄影和医疗影像等多个应用中经常出现。现代研究强调使用得分生成模型(SGM)解决这些问题,以生成可见感合理的图像,特别是在填充问题中。本研究利用SGM中的特殊结构,将恢复问题转化为bayesian框架中的Feynman-Kac模型,并使用顺序 Monte Carlo方法解决。我们提出的算法MCGdiff理论上有基础,并在数值实验中证明其在不良定性逆问题中表现更好于竞争对手。”Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know.
An Adaptive Approach for Probabilistic Wind Power Forecasting Based on Meta-Learning
for: 这个论文研究了一种适应性approach for probabilistic wind power forecasting (WPF), 包括Offline和Online学习过程。
methods: 在Offline学习阶段,使用内部和外部循环更新的meta-学习方法来训练基础预测模型,使其具有不同预测任务中的优秀适应性,如 probabilistic WPF with different lead times or locations。在Online学习阶段,基础预测模型被应用于在线预测,并与增量学习技术相结合。
results: 对于不同的预测任务和地点,提出了两种应用:一是针对不同的领先时间(temporal adaptation),二是针对新建的风力电站(spatial adaptation)。数据集实验结果表明,提出的方法具有优秀的适应性,与现有的方法相比。Abstract
This paper studies an adaptive approach for probabilistic wind power forecasting (WPF) including offline and online learning procedures. In the offline learning stage, a base forecast model is trained via inner and outer loop updates of meta-learning, which endows the base forecast model with excellent adaptability to different forecast tasks, i.e., probabilistic WPF with different lead times or locations. In the online learning stage, the base forecast model is applied to online forecasting combined with incremental learning techniques. On this basis, the online forecast takes full advantage of recent information and the adaptability of the base forecast model. Two applications are developed based on our proposed approach concerning forecasting with different lead times (temporal adaptation) and forecasting for newly established wind farms (spatial adaptation), respectively. Numerical tests were conducted on real-world wind power data sets. Simulation results validate the advantages in adaptivity of the proposed methods compared with existing alternatives.
摘要
MultiSChuBERT: Effective Multimodal Fusion for Scholarly Document Quality Prediction
results: 研究发现,将视觉信息和文本信息结合在一起可以显著改善预测结果。此外,研究还发现逐渐冰结视觉子模型的Weight可以降低过拟合现象,提高结果。最后,研究发现使用不同的文本嵌入模型可以进一步提高结果。Abstract
Automatic assessment of the quality of scholarly documents is a difficult task with high potential impact. Multimodality, in particular the addition of visual information next to text, has been shown to improve the performance on scholarly document quality prediction (SDQP) tasks. We propose the multimodal predictive model MultiSChuBERT. It combines a textual model based on chunking full paper text and aggregating computed BERT chunk-encodings (SChuBERT), with a visual model based on Inception V3.Our work contributes to the current state-of-the-art in SDQP in three ways. First, we show that the method of combining visual and textual embeddings can substantially influence the results. Second, we demonstrate that gradual-unfreezing of the weights of the visual sub-model, reduces its tendency to ovefit the data, improving results. Third, we show the retained benefit of multimodality when replacing standard BERT$_{\textrm{BASE}$ embeddings with more recent state-of-the-art text embedding models. Using BERT$_{\textrm{BASE}$ embeddings, on the (log) number of citations prediction task with the ACL-BiblioMetry dataset, our MultiSChuBERT (text+visual) model obtains an $R^{2}$ score of 0.454 compared to 0.432 for the SChuBERT (text only) model. Similar improvements are obtained on the PeerRead accept/reject prediction task. In our experiments using SciBERT, scincl, SPECTER and SPECTER2.0 embeddings, we show that each of these tailored embeddings adds further improvements over the standard BERT$_{\textrm{BASE}$ embeddings, with the SPECTER2.0 embeddings performing best.
摘要
自动评估学术文献质量是一项复杂的任务,具有高的潜在影响力。在特定的情况下,添加视觉信息可以提高学术文献质量预测(SDQP)任务的性能。我们提出了多模态预测模型MultiSChuBERT,它将文本模型基于批处全篇文本和计算BERT批处编码(SChuBERT),与视觉模型基于Inception V3.0结合。我们的工作对学术文献质量预测领域做出了三种贡献。首先,我们发现将视觉和文本嵌入结合的方法可以具有显著的影响。其次,我们证明在归一化权重的过程中,慢慢地冰封视觉子模型的权重可以降低它的预测数据的偏向性,提高结果。最后,我们发现在使用更新的文本嵌入模型而不是标准的BERT$_{\textrm{BASE}$嵌入模型时,多模态性仍然具有保留的优势。使用BERT$_{\textrm{BASE}$嵌入模型,我们在ACL-BiblioMetry数据集上进行(ilog)引用数预测任务,MultiSChuBERT(文本+视觉)模型的$R^{2}$分数为0.454,比SChuBERT(文本只)模型的0.432高。类似的改进也在PeerRead接受/拒绝预测任务中获得。在我们使用SciBERT、scincl、SPECTER和SPECTER2.0嵌入模型时,我们发现每种适应嵌入模型都带来进一步的改进,SPECTER2.0嵌入模型表现最佳。
RAVEN: In-Context Learning with Retrieval Augmented Encoder-Decoder Language Models
results: 通过广泛的实验,我们证明了RAVEN模型在certain情况下可以明显超越ATLAS模型,并在一些情况下与当前最先进的语言模型匹配。此外,我们还发现RAVEN模型在几何学习中的表现比ATLAS模型更好,尽管它具有更少的参数。我们的研究见证了 Retrieval-augmented encoder-decoder语言模型在上下文学习中的潜力,并鼓励了进一步的研究。Abstract
In this paper, we investigate the in-context learning ability of retrieval-augmented encoder-decoder language models. We first conduct a comprehensive analysis of the state-of-the-art ATLAS model and identify its limitations in in-context learning, primarily due to a mismatch between pretraining and testing, as well as a restricted context length. To address these issues, we propose RAVEN, a model that combines retrieval-augmented masked language modeling and prefix language modeling. We further introduce Fusion-in-Context Learning to enhance the few-shot performance by enabling the model to leverage more in-context examples without requiring additional training or model modifications. Through extensive experiments, we demonstrate that RAVEN significantly outperforms ATLAS and achieves results comparable to the most advanced language models in certain scenarios, despite having substantially fewer parameters. Our work underscores the potential of retrieval-augmented encoder-decoder language models for in-context learning and encourages further research in this direction.
摘要
在这篇论文中,我们研究了基于检索扩展encoder-decoder语言模型的Contextual Learning能力。我们首先对当前领先的ATLAS模型进行了全面分析,并发现其在Contextual Learning中存在一些限制,主要归结于预训练和测试的匹配性不足以及context长度的限制。为了解决这些问题,我们提议了RAVEN模型,它将检索扩展的MASKED语言模型和前缀语言模型相结合。我们还引入了Context-in-Fusion Learning,以提高少量示例的性能,使模型可以在不需要额外训练或模型修改的情况下,更好地利用更多的Contextual例子。通过广泛的实验,我们证明了RAVEN模型可以在某些场景下明显超过ATLAS模型,并达到与当前最先进语言模型相当的性能,即使有substantially fewer parameters。我们的工作强调了基于检索扩展encoder-decoder语言模型的Contextual Learning的潜力,并鼓励进一步的研究。
methods: 这 paper 使用的方法包括 program synthesis 技术和 GPU 加速。
results: 这 paper 提供了首个大规模的 REI 数据集,以及一些初步的机器学习基线。Abstract
We propose \emph{regular expression inference (REI)} as a challenge for code/language modelling, and the wider machine learning community. REI is a supervised machine learning (ML) and program synthesis task, and poses the problem of finding minimal regular expressions from examples: Given two finite sets of strings $P$ and $N$ and a cost function $\text{cost}(\cdot)$, the task is to generate an expression $r$ that accepts all strings in $P$ and rejects all strings in $N$, while no other such expression $r'$ exists with $\text{cost}(r')<\text{cost}(r)$. REI has advantages as a challenge problem: (i) regular expressions are well-known, widely used, and a natural idealisation of code; (ii) REI's asymptotic worst-case complexity is well understood; (iii) REI has a small number of easy to understand parameters (e.g.~$P$ or $N$ cardinality, string lengths of examples, or the cost function); this lets us easily finetune REI-hardness; (iv) REI is an unsolved problem for deep learning based ML. Recently, an REI solver was implemented on GPUs, using program synthesis techniques. This enabled, for the first time, fast generation of minimal expressions for complex REI instances. Building on this advance, we generate and publish the first large-scale datasets for REI, and devise and evaluate several initial heuristic and machine learning baselines. We invite the community to participate and explore ML methods that learn to solve REI problems. We believe that progress in REI directly translates to code/language modelling.
摘要
我们提出了常规表达式推理(REI)作为代码/语言模型化挑战,以及更广泛的机器学习社区。REI是一个监督式机器学习(ML)和程式合成任务,挑战找到最小的常规表达式,使得所有$P$集合中的字串都被接受,而$N$集合中的字串都被拒绝,而且没有其他可以实现这一结果的表达式$r'$,则cost函数中的cost(r')>cost(r)。REI有以下优点作为挑战问题:(一)常规表达式很受欢迎,广泛使用,并是代码的自然化的理想化;(二)REI的极限最差情况的复杂度良好了解;(三)REI只有一小部分容易理解的参数(例如$P$或$N$的 cardinality,字串示例的长度,或者cost函数),这使我们可以轻松地调整REI的困难度;(四)REI是深度学习基于机器学习的未解决问题。最近,一个REI解决方案在GPU上被实现,使用程式合成技术。这使得,在一次性的情况下,快速生成复杂的REI问题中的最小表达式。我们为了进一步推动REI的研究,为首次发布了大规模的REI数据集,并设计了许多初步的假设和机器学习基准。我们邀请社区参与,探索机器学习方法可以解决REI问题。我们相信,REI的进步直接对代码/语言模型化有益。
SciRE-Solver: Efficient Sampling of Diffusion Probabilistic Models by Score-integrand Solver with Recursive Derivative Estimation
paper_authors: Shigui Li, Wei Chen, Delu Zeng for:This paper proposes a high-efficiency sampler for Diffusion Probabilistic Models (DPMs), which are powerful generative models known for their ability to generate high-fidelity image samples.methods:The proposed method uses a score-based exact solution paradigm for the diffusion ODEs corresponding to the sampling process of DPMs, and introduces a new perspective on developing numerical algorithms for solving diffusion ODEs. The method also uses a recursive derivative estimation (RDE) method to reduce the estimation error.results:The proposed method, called SciRE-Solver, achieves state-of-the-art (SOTA) sampling performance with a limited number of score function evaluations (NFE) on both discrete-time and continuous-time DPMs. Specifically, the method achieves $3.48$ FID with $12$ NFE and $2.42$ FID with $20$ NFE for continuous-time DPMs on CIFAR10, respectively. The method also reaches SOTA values of $2.40$ FID with $100$ NFE for continuous-time DPM and of $3.15$ FID with $84$ NFE for discrete-time DPM on CIFAR-10, as well as of $2.17$ ($2.02$) FID with $18$ ($50$) NFE for discrete-time DPM on CelebA 64$\times$64.Abstract
Diffusion probabilistic models (DPMs) are a powerful class of generative models known for their ability to generate high-fidelity image samples. A major challenge in the implementation of DPMs is the slow sampling process. In this work, we bring a high-efficiency sampler for DPMs. Specifically, we propose a score-based exact solution paradigm for the diffusion ODEs corresponding to the sampling process of DPMs, which introduces a new perspective on developing numerical algorithms for solving diffusion ODEs. To achieve an efficient sampler, we propose a recursive derivative estimation (RDE) method to reduce the estimation error. With our proposed solution paradigm and RDE method, we propose the score-integrand solver with the convergence order guarantee as efficient solver (SciRE-Solver) for solving diffusion ODEs. The SciRE-Solver attains state-of-the-art (SOTA) sampling performance with a limited number of score function evaluations (NFE) on both discrete-time and continuous-time DPMs in comparison to existing training-free sampling algorithms. Such as, we achieve $3.48$ FID with $12$ NFE and $2.42$ FID with $20$ NFE for continuous-time DPMs on CIFAR10, respectively. Different from other samplers, SciRE-Solver has the promising potential to surpass the FIDs achieved in the original papers of some pre-trained models with a small NFEs. For example, we reach SOTA value of $2.40$ FID with $100$ NFE for continuous-time DPM and of $3.15$ FID with $84$ NFE for discrete-time DPM on CIFAR-10, as well as of $2.17$ ($2.02$) FID with $18$ ($50$) NFE for discrete-time DPM on CelebA 64$\times$64.
摘要
Diffusion probabilistic models (DPMs) 是一种强大的生成模型,能够生成高质量的图像样本。然而,DPMs 的实现受到批处的概率过程的缓慢问题。在这项工作中,我们提出了一种高效的抽象方法,用于解决DPMs 的批处概率过程。具体来说,我们提出了一种基于得分函数的精确解方法,用于解决相应的批处Diffusion ODEs。此外,我们还提出了一种级联Derivative估计(RDE)方法,以降低估计误差。通过我们的提出的解方法和RDE方法,我们提出了一种得分函数整合器(SciRE-Solver),用于解决Diffusion ODEs。SciRE-Solver 实现了状态前的(SOTA)抽样性能,并且只需要少量的得分函数评估(NFE)。例如,我们在 CIFAR10 上实现了 $3.48$ FID 和 $2.42$ FID 的抽样性能,它们分别需要 $12$ NFE 和 $20$ NFE。与其他抽样器不同,SciRE-Solver 具有可能超过原始模型的 FID 的潜在能力,例如,我们在 CIFAR10 上实现了 $2.40$ FID 和 $3.15$ FID,它们分别需要 $100$ NFE 和 $84$ NFE。此外,我们还在 CelebA 64$\times$64 上实现了 $2.17$ ($2.02$) FID 和 $18$ ($50$) NFE。
results: 本文的理论结果通过数学实验证明,可以高精度地重construct Radon-NikodymDerivative在任何特定点。Abstract
We discuss the problem of estimating Radon-Nikodym derivatives. This problem appears in various applications, such as covariate shift adaptation, likelihood-ratio testing, mutual information estimation, and conditional probability estimation. To address the above problem, we employ the general regularization scheme in reproducing kernel Hilbert spaces. The convergence rate of the corresponding regularized algorithm is established by taking into account both the smoothness of the derivative and the capacity of the space in which it is estimated. This is done in terms of general source conditions and the regularized Christoffel functions. We also find that the reconstruction of Radon-Nikodym derivatives at any particular point can be done with high order of accuracy. Our theoretical results are illustrated by numerical simulations.
摘要
我们讨论类 Radon-Nikodym Derivative 的估计问题。这个问题在各种应用中出现,例如对应测验、likelihood-ratio 测试、共轨关系估计和conditional probability 估计。为了解决这个问题,我们使用通用的常数化方案在复原核函数空间中实现。我们证明了这个常数化算法的数据速度,通过考虑类 derivative 的平滑性和核函数空间中的容量。此外,我们也发现了在特定点进行类 Radon-Nikodym Derivative 的重建可以实现高精度。我们的理论结果通过数学实验进行说明。
Back to Basics: A Sanity Check on Modern Time Series Classification Algorithms
results: 研究发现,使用Tabular模型可以在大约19%的单variate数据集和28%的多variate数据集上表现出色,并且在大约50%的数据集上达到了10个百分点的准确率。这些结果表明,在开发时序分类器时,需要考虑使用Tabular模型作为基线。这些模型快速、简单、可能比较容易理解和部署。Abstract
The state-of-the-art in time series classification has come a long way, from the 1NN-DTW algorithm to the ROCKET family of classifiers. However, in the current fast-paced development of new classifiers, taking a step back and performing simple baseline checks is essential. These checks are often overlooked, as researchers are focused on establishing new state-of-the-art results, developing scalable algorithms, and making models explainable. Nevertheless, there are many datasets that look like time series at first glance, but classic algorithms such as tabular methods with no time ordering may perform better on such problems. For example, for spectroscopy datasets, tabular methods tend to significantly outperform recent time series methods. In this study, we compare the performance of tabular models using classic machine learning approaches (e.g., Ridge, LDA, RandomForest) with the ROCKET family of classifiers (e.g., Rocket, MiniRocket, MultiRocket). Tabular models are simple and very efficient, while the ROCKET family of classifiers are more complex and have state-of-the-art accuracy and efficiency among recent time series classifiers. We find that tabular models outperform the ROCKET family of classifiers on approximately 19% of univariate and 28% of multivariate datasets in the UCR/UEA benchmark and achieve accuracy within 10 percentage points on about 50% of datasets. Our results suggest that it is important to consider simple tabular models as baselines when developing time series classifiers. These models are very fast, can be as effective as more complex methods and may be easier to understand and deploy.
摘要
现代时间序列分类技术已经进步很远,从1NN-DTW算法到ROCKET家族的分类器。然而,在当前的快速发展新的分类器,需要从时刻停下来,进行简单的基准检查。这些检查通常被忽略,因为研究人员对于创造新的state-of-the-art结果、开发可扩展的算法以及使模型可读性有着紧迫的需求。然而,有许多数据集看起来像时间序列,但 классические算法如表格方法无时间排序可能在这些问题上表现更好。例如,对于spectroscopy数据集,表格方法在许多情况下会表现出色。在这种研究中,我们比较了使用经典机器学习方法(如ridge、LDA、RandomForest)的表格模型与ROCKET家族的分类器(如Rocket、MiniRocket、MultiRocket)的性能。表格模型简单、很有效,而ROCKET家族的分类器则是更复杂,在最近的时间序列分类器中具有state-of-the-art的准确率和效率。我们发现,表格模型在UCRC/UEA数据集上约占19%的单variate数据集和28%的多variate数据集上表现较佳,并且在大约50%的数据集上达到了准确率在10个百分点之间。我们的结果表明,在开发时间序列分类器时,需要考虑使用简单的表格模型作为基准。这些模型很快速,可能与更复杂的方法相当有效,而且可能更容易理解和部署。
The Challenge of Fetal Cardiac MRI Reconstruction Using Deep Learning
results: 研究发现,使用多架构和训练策略可以提高模型的性能,但是这些模型仍然不能准确地捕捉婴儿心脏的动态特征。Abstract
Dynamic free-breathing fetal cardiac MRI is one of the most challenging modalities, which requires high temporal and spatial resolution to depict rapid changes in a small fetal heart. The ability of deep learning methods to recover undersampled data could help to optimise the kt-SENSE acquisition strategy and improve non-gated kt-SENSE reconstruction quality. In this work, we explore supervised deep learning networks for reconstruction of kt-SENSE style acquired data using an extensive in vivo dataset. Having access to fully-sampled low-resolution multi-coil fetal cardiac MRI, we study the performance of the networks to recover fully-sampled data from undersampled data. We consider model architectures together with training strategies taking into account their application in the real clinical setup used to collect the dataset to enable networks to recover prospectively undersampled data. We explore a set of modifications to form a baseline performance evaluation for dynamic fetal cardiac MRI on real data. We systematically evaluate the models on coil-combined data to reveal the effect of the suggested changes to the architecture in the context of fetal heart properties. We show that the best-performers recover a detailed depiction of the maternal anatomy on a large scale, but the dynamic properties of the fetal heart are under-represented. Training directly on multi-coil data improves the performance of the models, allows their prospective application to undersampled data and makes them outperform CTFNet introduced for adult cardiac cine MRI. However, these models deliver similar qualitative performances recovering the maternal body very well but underestimating the dynamic properties of fetal heart. This dynamic feature of fast change of fetal heart that is highly localised suggests both more targeted training and evaluation methods might be needed for fetal heart application.
摘要
“几何受测疫苗Cardiac MRI是最具挑战性的测疫方法,需要高度的时间和空间分辨率,以呈现迅速变化的小胎心脏。深度学习方法可以复原缺少样本的数据,可以帮助优化kt-SENSE取样策略和提高非锁定kt-SENSE重建品质。在这个工作中,我们使用了对应式深度学习网络来重建kt-SENSE式取得的数据,使用了实验室中的广泛生物实验数据。由于我们有完整的低分辨率多晶粒胎心MRI数据,我们可以研究深度学习网络是否可以从缺少样本中获取完整的数据。我们考虑了网络架构和训练策略,以便在实际的临床设置中执行。我们系统地评估了这些模型,并且使用了多晶粒数据进行训练,以便在未来遇到缺少样本时进行预测。我们发现这些最佳performer可以精确地重建大规模的母体 анатоми,但是儿 heart的动力学特性受到了忽略。这些模型在训练 directly on multi-coil data 时表现更好,并且可以在缺少样本情况下进行预测。但是,这些模型对儿 heart的动力学特性做出了不充分的评估。这个儿 heart的快速变化和高度地方化的特性表明,为了对儿 heart进行评估,可能需要更加特定的训练和评估方法。”
A Trustable LSTM-Autoencoder Network for Cyberbullying Detection on Social Media Using Synthetic Data
results: 研究发现,提出的LSTM-Autoencoder网络在所有数据集上表现最佳,具有95%的准确率。与前一些相关研究相比,本研究的结果具有状态的前进。Abstract
Social media cyberbullying has a detrimental effect on human life. As online social networking grows daily, the amount of hate speech also increases. Such terrible content can cause depression and actions related to suicide. This paper proposes a trustable LSTM-Autoencoder Network for cyberbullying detection on social media using synthetic data. We have demonstrated a cutting-edge method to address data availability difficulties by producing machine-translated data. However, several languages such as Hindi and Bangla still lack adequate investigations due to a lack of datasets. We carried out experimental identification of aggressive comments on Hindi, Bangla, and English datasets using the proposed model and traditional models, including Long Short-Term Memory (LSTM), Bidirectional Long Short-Term Memory (BiLSTM), LSTM-Autoencoder, Word2vec, Bidirectional Encoder Representations from Transformers (BERT), and Generative Pre-trained Transformer 2 (GPT-2) models. We employed evaluation metrics such as f1-score, accuracy, precision, and recall to assess the models performance. Our proposed model outperformed all the models on all datasets, achieving the highest accuracy of 95%. Our model achieves state-of-the-art results among all the previous works on the dataset we used in this paper.
摘要
社交媒体cyberbullying有着恶势力影响人类生活。随着在线社交网络的日常增长,讨厌言语的数量也在增加。这种凶杀内容可能导致抑郁和自杀行为。本文提议一种可靠的LSTM-Autoencoder网络来检测社交媒体上的cyberbullying,使用合成数据。我们已经实现了数据可用性问题的解决方法,生成了机器翻译数据。然而,一些语言,如希ن迪和孟加拉语,仍然缺乏足够的调查,因为数据缺乏。我们对印地语、孟加拉语和英语数据集进行了实验性的攻击性评论标识,使用我们提出的模型和传统模型,包括Long Short-Term Memory(LSTM)、Bidirectional Long Short-Term Memory(BiLSTM)、LSTM-Autoencoder、Word2vec、Bidirectional Encoder Representations from Transformers(BERT)和Generative Pre-trained Transformer 2(GPT-2)模型。我们使用了评价指标,如f1-score、准确率、精度和回归率,评估模型的表现。我们的提议模型在所有数据集上表现出色,实现了最高的准确率95%。我们的模型在所有之前的作品中获得了状态的杰出成绩。
Towards Temporal Edge Regression: A Case Study on Agriculture Trade Between Nations
for: This paper is written for exploring the application of Graph Neural Networks (GNNs) to edge regression tasks in both static and dynamic settings, specifically focusing on predicting food and agriculture trade values between nations.
methods: The paper introduces three simple yet strong baselines and comprehensively evaluates one static and three dynamic GNN models using the UN Trade dataset.
results: The experimental results show that the baselines exhibit remarkably strong performance across various settings, highlighting the inadequacy of existing GNNs. Additionally, the paper finds that TGN outperforms other GNN models, suggesting TGN is a more appropriate choice for edge regression tasks. Furthermore, the proportion of negative edges in the training samples significantly affects the test performance.Here is the information in Simplified Chinese text:
methods: 论文引入三种简单强大的基线,并对一个静态和三个动态 GNN 模型进行了全面的评估,使用 UN 贸易 dataset。
results: 实验结果显示,基线在不同设置下表现出色,强调现有 GNN 的不足。此外,论文发现 TGN 在边 regression 任务中表现更出色,建议 TGN 是更适合的选择。此外,论文发现训练样本中负边的比例对测试性能产生了显著的影响。Abstract
Recently, Graph Neural Networks (GNNs) have shown promising performance in tasks on dynamic graphs such as node classification, link prediction and graph regression. However, few work has studied the temporal edge regression task which has important real-world applications. In this paper, we explore the application of GNNs to edge regression tasks in both static and dynamic settings, focusing on predicting food and agriculture trade values between nations. We introduce three simple yet strong baselines and comprehensively evaluate one static and three dynamic GNN models using the UN Trade dataset. Our experimental results reveal that the baselines exhibit remarkably strong performance across various settings, highlighting the inadequacy of existing GNNs. We also find that TGN outperforms other GNN models, suggesting TGN is a more appropriate choice for edge regression tasks. Moreover, we note that the proportion of negative edges in the training samples significantly affects the test performance. The companion source code can be found at: https://github.com/scylj1/GNN_Edge_Regression.
摘要
近期,图神经网络(GNNs)在动态图像任务中表现出色,包括节点分类、链接预测和图像 regression。然而,有少量研究探讨了时间Edge regression任务,它在实际世界中具有重要应用。在这篇论文中,我们探讨了 GNNs 在静态和动态设置下的边 regression 任务,专注于预测国家之间的食品和农业贸易价值。我们提出三种简单却强大的基elines,并且广泛评估了一个静态 GNN 模型和三个动态 GNN 模型,使用 UN Trade 数据集进行实验。我们的实验结果表明,基elines 在不同设置下具有惊人的表现,反映现有 GNNs 的不足。此外,我们发现预测样本中负边的比例对测试性能产生了重要的影响。 companion 的源代码可以在:https://github.com/scylj1/GNN_Edge_Regression 找到。
Synthesizing Political Zero-Shot Relation Classification via Codebook Knowledge, NLI, and ChatGPT
methods: 本研究使用了 Zero-shot approach,包括一种基于自然语言理解 (NLI) 的新方法 named ZSP,它采用了树查询框架,将任务分解成上下文、Modalities 和类别异常级别。
results: 对于细化的 Rootcode 分类,ZSP 实现了40%的 F1 分数提升,与 supervised BERT 模型相比,ZSP 的性能相对稳定,可用于事件记录验证和 ontology 开发。Abstract
Recent supervised models for event coding vastly outperform pattern-matching methods. However, their reliance solely on new annotations disregards the vast knowledge within expert databases, hindering their applicability to fine-grained classification. To address these limitations, we explore zero-shot approaches for political event ontology relation classification, by leveraging knowledge from established annotation codebooks. Our study encompasses both ChatGPT and a novel natural language inference (NLI) based approach named ZSP. ZSP adopts a tree-query framework that deconstructs the task into context, modality, and class disambiguation levels. This framework improves interpretability, efficiency, and adaptability to schema changes. By conducting extensive experiments on our newly curated datasets, we pinpoint the instability issues within ChatGPT and highlight the superior performance of ZSP. ZSP achieves an impressive 40% improvement in F1 score for fine-grained Rootcode classification. ZSP demonstrates competitive performance compared to supervised BERT models, positioning it as a valuable tool for event record validation and ontology development. Our work underscores the potential of leveraging transfer learning and existing expertise to enhance the efficiency and scalability of research in the field.
摘要
Emotion Embeddings $\unicode{x2014}$ Learning Stable and Homogeneous Abstractions from Heterogeneous Affective Datasets
results: 实验结果表明,该方法可以实现数据和标签类型之间的协同性,无需增加预测质量的损害,同时提供了可重用、可解释和灵活的模型。Abstract
Human emotion is expressed in many communication modalities and media formats and so their computational study is equally diversified into natural language processing, audio signal analysis, computer vision, etc. Similarly, the large variety of representation formats used in previous research to describe emotions (polarity scales, basic emotion categories, dimensional approaches, appraisal theory, etc.) have led to an ever proliferating diversity of datasets, predictive models, and software tools for emotion analysis. Because of these two distinct types of heterogeneity, at the expressional and representational level, there is a dire need to unify previous work on increasingly diverging data and label types. This article presents such a unifying computational model. We propose a training procedure that learns a shared latent representation for emotions, so-called emotion embeddings, independent of different natural languages, communication modalities, media or representation label formats, and even disparate model architectures. Experiments on a wide range of heterogeneous affective datasets indicate that this approach yields the desired interoperability for the sake of reusability, interpretability and flexibility, without penalizing prediction quality. Code and data are archived under https://doi.org/10.5281/zenodo.7405327 .
摘要
人类情感表达在多种通信modalities和媒体格式中表现出来,因此计算研究也在自然语言处理、音频信号分析、计算机视觉等领域得到了多样化的应用。在过去的研究中,用于描述情感的不同格式(如负责度维度、基本情绪类别、维度方法、评估理论等)导致了数据集、预测模型和软件工具的总体化,从而产生了不断增长的多样化问题。为了解决这两种不同的多样性,即表达层次和表示层次的多样性,这篇文章提出了一种统一的计算模型。我们提议一种在不同的自然语言、通信modalities、媒体和表示格式之间学习共享的潜在表达(emotion embeddings)的训练方法,无论是不同的语言、模式、媒体或表示格式,都可以学习到共享的表达特征。在各种多样化的情感数据集上进行了广泛的实验,结果表明,这种方法可以实现数据集之间的可 reuse、可 interpretability和灵活性,而不会增加预测质量的损失。代码和数据可以在https://doi.org/10.5281/zenodo.7405327上找到。
Brain-Inspired Computational Intelligence via Predictive Coding
paper_authors: Tommaso Salvatori, Ankur Mali, Christopher L. Buckley, Thomas Lukasiewicz, Rajesh P. N. Rao, Karl Friston, Alexander Ororbia
for: This paper aims to explore the potential of predictive coding (PC) in addressing the limitations of deep neural networks in machine learning and computational intelligence.
methods: The paper surveys the literature on PC and its applications in cognitive control, robotics, and variational inference, highlighting its exciting properties and potential value for the machine learning community.
results: The paper hopes to foreground research in PC-inspired machine learning and encourage further exploration of its potential in the future of computational intelligence.Abstract
Artificial intelligence (AI) is rapidly becoming one of the key technologies of this century. The majority of results in AI thus far have been achieved using deep neural networks trained with the error backpropagation learning algorithm. However, the ubiquitous adoption of this approach has highlighted some important limitations such as substantial computational cost, difficulty in quantifying uncertainty, lack of robustness, unreliability, and biological implausibility. It is possible that addressing these limitations may require schemes that are inspired and guided by neuroscience theories. One such theory, called predictive coding (PC), has shown promising performance in machine intelligence tasks, exhibiting exciting properties that make it potentially valuable for the machine learning community: PC can model information processing in different brain areas, can be used in cognitive control and robotics, and has a solid mathematical grounding in variational inference, offering a powerful inversion scheme for a specific class of continuous-state generative models. With the hope of foregrounding research in this direction, we survey the literature that has contributed to this perspective, highlighting the many ways that PC might play a role in the future of machine learning and computational intelligence at large.
摘要
人工智能(AI)在这个世纪 rapid becoming 一种关键技术。大多数 AI 结果都是使用深度神经网络,通过错误反propagation 学习算法训练得到的。然而,普遍采用这种方法的限制已经吸引了人们的注意,包括计算成本很高、难以量化不确定性、缺乏可靠性和生物学可能性。可能需要解决这些限制的方法可能是基于神经科学理论的。一种这种理论是预测编码(PC),它在机器智能任务中表现出了扎实的性能,并且具有许多可能有用的特性,例如:PC 可以模型不同脑区的信息处理,可以应用于认知控制和机器人学习,并且具有强大的数学基础,可以通过变量推理来解决特定类型的连续状态生成模型。希望通过这篇文章,各位能够更好地了解 PC 在机器学习和计算智能领域的前景,并且能够更好地发挥研究。因此,我们在这篇文章中survey了一些与这种视角相关的文献,并 highlighted 它们在未来机器学习和计算智能领域的可能性。
Graph-Structured Kernel Design for Power Flow Learning using Gaussian Processes
results: 作者通过实验表明,提案的 VDK-GP 可以在中等规模500-Bus和大规模1354-Bus电力系统上实现更 than two fold 样本复杂度减少,相比整个 GP。此外,作者还提出了一种新的网络滑块活动学习算法,可以快速地适应 VDK 的学习。在测试预测中,该算法可以比Random Trial 的平均性能提高两倍,在中等规模500-Bus系统上,并在大规模1354-Bus系统上达到最佳性能的 10% 。此外,作者还证明了提案的方法在不同数据集上的uncertainty quantification应用中的性能。Abstract
This paper presents a physics-inspired graph-structured kernel designed for power flow learning using Gaussian Process (GP). The kernel, named the vertex-degree kernel (VDK), relies on latent decomposition of voltage-injection relationship based on the network graph or topology. Notably, VDK design avoids the need to solve optimization problems for kernel search. To enhance efficiency, we also explore a graph-reduction approach to obtain a VDK representation with lesser terms. Additionally, we propose a novel network-swipe active learning scheme, which intelligently selects sequential training inputs to accelerate the learning of VDK. Leveraging the additive structure of VDK, the active learning algorithm performs a block-descent type procedure on GP's predictive variance, serving as a proxy for information gain. Simulations demonstrate that the proposed VDK-GP achieves more than two fold sample complexity reduction, compared to full GP on medium scale 500-Bus and large scale 1354-Bus power systems. The network-swipe algorithm outperforms mean performance of 500 random trials on test predictions by two fold for medium-sized 500-Bus systems and best performance of 25 random trials for large-scale 1354-Bus systems by 10%. Moreover, we demonstrate that the proposed method's performance for uncertainty quantification applications with distributionally shifted testing data sets.
摘要
paper_authors: Fernando B. Pérez Maurera, Maurizio Ferrari Dacrema, Pablo Castells, Paolo Cremonesi for: 这个论文旨在探讨基于印象(过去推荐的项目)的推荐系统,以提高推荐系统的质量。methods: 本文使用系统性文献综述方法,对推荐系统使用印象进行三个基本的研究角度:推荐器、数据集和评价方法。results: 本文对各种推荐系统使用印象进行了详细的介绍,还分析了各种数据集和评价方法。最后,本文提出了一些未解决的问题和未来研究方向,强调在文献中缺失的方面可以在未来的研究中进行深入探讨。Abstract
Novel data sources bring new opportunities to improve the quality of recommender systems. Impressions are a novel data source containing past recommendations (shown items) and traditional interactions. Researchers may use impressions to refine user preferences and overcome the current limitations in recommender systems research. The relevance and interest of impressions have increased over the years; hence, the need for a review of relevant work on this type of recommenders. We present a systematic literature review on recommender systems using impressions, focusing on three fundamental angles in research: recommenders, datasets, and evaluation methodologies. We provide three categorizations of papers describing recommenders using impressions, present each reviewed paper in detail, describe datasets with impressions, and analyze the existing evaluation methodologies. Lastly, we present open questions and future directions of interest, highlighting aspects missing in the literature that can be addressed in future works.
摘要
新的数据源带来了改善推荐系统质量的新机遇。印象是一种新的数据源,包含过去的推荐(显示的项目)和传统的交互。研究人员可以使用印象来细化用户喜好,超越当前的推荐系统研究的限制。在几年内,印象的相关性和兴趣度有所增加,因此需要对这类推荐器进行系统性的文献回顾。本文提供了对推荐系统使用印象的系统性文献回顾,重点是三个基本的研究方向:推荐器、数据集和评估方法ologies。我们提供了三种描述推荐器使用印象的论文分类,详细介绍每篇评论文章,描述含印象的数据集,并分析现有的评估方法。最后,我们提出了未解决的问题和未来方向,强调文献中缺失的方面,可以在未来的研究中进行深入探究。
paper_authors: Denis Kutnár, Ivan R Vogelius, Katrin Elisabet Håkansson, Jens Petersen, Jeppe Friborg, Lena Specht, Mogens Bernsdorf, Anita Gothelf, Claus Kristensen, Abraham George Smith For:* The paper aims to investigate the use of Convolutional Neural Networks (CNNs) to predict locoregional recurrences (LRR) in head and neck squamous cell carcinoma (HNSCC) patients based on pre-treatment imaging.Methods:* The study uses pre-treatment 18F-fluorodeoxyglucose positron emission tomography (FDG-PET)/computed tomography (CT) scans to train a CNN to predict LRR volumes.* The dataset is divided into training, validation, and test sets.* The CNN is trained from scratch and compared to a pre-trained CNN and a SUVmax threshold approach.Results:* The SUVmax threshold method had a median volume of 4.6 cubic centimeters (cc) and included 5 out of 7 relapse origin points.* The GTV contour and best CNN segmentations included the relapse origin 6 out of 7 times with median volumes of 28 and 18 cc, respectively.* The CNN included the same or greater number of relapse volume points of origin (POs) with significantly smaller relapse volumes.I hope this helps! Let me know if you have any further questions.Abstract
Locoregional recurrences (LRR) are still a frequent site of treatment failure for head and neck squamous cell carcinoma (HNSCC) patients. Identification of high risk subvolumes based on pretreatment imaging is key to biologically targeted radiation therapy. We investigated the extent to which a Convolutional neural network (CNN) is able to predict LRR volumes based on pre-treatment 18F-fluorodeoxyglucose positron emission tomography (FDG-PET)/computed tomography (CT) scans in HNSCC patients and thus the potential to identify biological high risk volumes using CNNs. For 37 patients who had undergone primary radiotherapy for oropharyngeal squamous cell carcinoma, five oncologists contoured the relapse volumes on recurrence CT scans. Datasets of pre-treatment FDG-PET/CT, gross tumour volume (GTV) and contoured relapse for each of the patients were randomly divided into training (n=23), validation (n=7) and test (n=7) datasets. We compared a CNN trained from scratch, a pre-trained CNN, a SUVmax threshold approach, and using the GTV directly. The SUVmax threshold method included 5 out of the 7 relapse origin points within a volume of median 4.6 cubic centimetres (cc). Both the GTV contour and best CNN segmentations included the relapse origin 6 out of 7 times with median volumes of 28 and 18 cc respectively. The CNN included the same or greater number of relapse volume POs, with significantly smaller relapse volumes. Our novel findings indicate that CNNs may predict LRR, yet further work on dataset development is required to attain clinically useful prediction accuracy.
摘要
“头颈癌细胞瘤(HNSCC)患者的局部再次发病(LRR)仍然是治疗失败的常见现象。我们使用深度学习模型(CNN)来预测LRR объем based on pre-treatment 18F-fluorodeoxyglucose пози트рон辐射 Tomatoes(FDG-PET)/计算机成像(CT)扫描图。我们通过对37名受到首次放疗治疗的口咽癌患者的再次发病CT扫描图进行评估,并对每个患者的预处理FDG-PET/CT、GTV和评估关键点进行分类。我们对比了从零开始训练的CNN、预训练CNN、SUVmax阈值方法和直接使用GTV。我们发现,SUVmax阈值方法包含5个再次发病起点, median 4.6立方厘米(cc)内,而GTV和最佳CNN分 segmentation都包含6个再次发病起点, median 28和18cc分别。CNN包含了同样或更多的再次发病体积点,并且再次发病体积更小。我们的新发现表明,CNN可能预测LRR,但是进一步的数据集开发是需要以获得临床有用的预测精度。”
DeepContrast: Deep Tissue Contrast Enhancement using Synthetic Data Degradations and OOD Model Predictions
paper_authors: Nuno Pimpão Martins, Yannis Kalaidzidis, Marino Zerial, Florian Jug
For: The paper aims to improve the quality of microscopy images by addressing the problem of image degradation, specifically blurring and contrast loss, which is a major challenge in life science research.* Methods: The authors use a deep learning approach, training a neural network to learn the inverse of the degradation function, and demonstrate that the network can be used out-of-distribution to improve the quality of less severely degraded images. They also explore the effect of iterative predictions and find that a balance between contrast improvement and retention of image details is necessary.* Results: The authors show that their method can improve the quality of microscopy images, especially in deeper regions of thick samples, and demonstrate the effectiveness of their approach through experiments using real microscopy images.Abstract
Microscopy images are crucial for life science research, allowing detailed inspection and characterization of cellular and tissue-level structures and functions. However, microscopy data are unavoidably affected by image degradations, such as noise, blur, or others. Many such degradations also contribute to a loss of image contrast, which becomes especially pronounced in deeper regions of thick samples. Today, best performing methods to increase the quality of images are based on Deep Learning approaches, which typically require ground truth (GT) data during training. Our inability to counteract blurring and contrast loss when imaging deep into samples prevents the acquisition of such clean GT data. The fact that the forward process of blurring and contrast loss deep into tissue can be modeled, allowed us to propose a new method that can circumvent the problem of unobtainable GT data. To this end, we first synthetically degraded the quality of microscopy images even further by using an approximate forward model for deep tissue image degradations. Then we trained a neural network that learned the inverse of this degradation function from our generated pairs of raw and degraded images. We demonstrated that networks trained in this way can be used out-of-distribution (OOD) to improve the quality of less severely degraded images, e.g. the raw data imaged in a microscope. Since the absolute level of degradation in such microscopy images can be stronger than the additional degradation introduced by our forward model, we also explored the effect of iterative predictions. Here, we observed that in each iteration the measured image contrast kept improving while detailed structures in the images got increasingly removed. Therefore, dependent on the desired downstream analysis, a balance between contrast improvement and retention of image details has to be found.
摘要
μ可微镜图像是生命科学研究中不可或缺的,它允许详细的检查和Characterization of cellular和组织水平结构和功能。然而,μ可微镜数据必然受到图像噪声、模糊和其他问题的影响,这些问题也会导致图像尺度的损失。今天,最佳的图像质量提高方法都基于深度学习方法,这些方法通常需要训练时的标准数据(GT数据)。我们无法对深入采样中的图像进行减震和减少尺度损失,因此无法获得这些干净的GT数据。由于前向过程中的减震和尺度损失可以被模型化,我们提出了一种新的方法,可以缺省GT数据的问题。我们首先使用一种近似的前向模型来进一步降低μ可微镜图像的质量。然后,我们使用这些生成的对应图像对进行了深度学习网络的训练,这些网络可以学习对这种降低函数的逆操作。我们展示了这种方法可以在OOD(out-of-distribution)情况下提高μ可微镜图像的质量。由于μ可微镜图像中的噪声水平可能比我们添加的降低水平更高,我们也研究了反复预测的效果。在每次预测中,我们发现图像尺度的改善随着图像细节的消失而增加。因此,根据下游分析的需求,需要找到和保留图像细节的平衡。
GAEI-UNet: Global Attention and Elastic Interaction U-Net for Vessel Image Segmentation
results: 在DRIVE retinal vessel dataset 上进行评估,GAEI-UNet 能够在 SE 和小血管之间的连接性方面表现出色,而且不需要对计算复杂性进行增加。Abstract
Vessel image segmentation plays a pivotal role in medical diagnostics, aiding in the early detection and treatment of vascular diseases. While segmentation based on deep learning has shown promising results, effectively segmenting small structures and maintaining connectivity between them remains challenging. To address these limitations, we propose GAEI-UNet, a novel model that combines global attention and elastic interaction-based techniques. GAEI-UNet leverages global spatial and channel context information to enhance high-level semantic understanding within the U-Net architecture, enabling precise segmentation of small vessels. Additionally, we adopt an elastic interaction-based loss function to improve connectivity among these fine structures. By capturing the forces generated by misalignment between target and predicted shapes, our model effectively learns to preserve the correct topology of vessel networks. Evaluation on retinal vessel dataset -- DRIVE demonstrates the superior performance of GAEI-UNet in terms of SE and connectivity of small structures, without significantly increasing computational complexity. This research aims to advance the field of vessel image segmentation, providing more accurate and reliable diagnostic tools for the medical community. The implementation code is available on Code.
摘要
船体图像分割在医疗诊断中扮演着关键角色,帮助早期发现和治疗血管疾病。然而,基于深度学习的分割方法仍然存在小结构分割和保持结构连接的挑战。为解决这些限制,我们提出了GAEI-UNet模型,该模型结合全球注意力和弹性互动基于技术。GAEI-UNet 利用全球空间和通道信息来增强 U-Net 架构中高级别 semantics的理解,以提高小血管的精确分割。此外,我们采用了弹性互动基于损失函数,以提高小结构之间的连接。通过捕捉目标形态与预测形态之间的弹性力,我们的模型能够保持血管网络的正确拓扑结构。DRIVE retinal vessel dataset 的评估表明,GAEI-UNet 在小结构连接和精确分割方面具有显著优势,而无需显著增加计算复杂性。本研究旨在提高船体图像分割领域的准确性和可靠性,为医疗界提供更多的可靠的诊断工具。代码可在 Code.
Denoising Diffusion Probabilistic Model for Retinal Image Generation and Segmentation
For: The paper aims to provide a novel dataset and method for retinal image segmentation, which can help detect and diagnose various eye, blood circulation, and brain-related diseases.* Methods: The proposed method uses a Denoising Diffusion Probabilistic Model (DDPM) to generate retinal images and vessel trees, and a two-stage DDPM to guide the generation of fundus images from given vessel trees and random distribution.* Results: The proposed dataset, called Retinal Trees (ReTree), has been evaluated quantitatively and qualitatively, and the results show that the DDPM outperformed GANs in image synthesis. The dataset and source code are available online for further research and validation.Abstract
Experts use retinal images and vessel trees to detect and diagnose various eye, blood circulation, and brain-related diseases. However, manual segmentation of retinal images is a time-consuming process that requires high expertise and is difficult due to privacy issues. Many methods have been proposed to segment images, but the need for large retinal image datasets limits the performance of these methods. Several methods synthesize deep learning models based on Generative Adversarial Networks (GAN) to generate limited sample varieties. This paper proposes a novel Denoising Diffusion Probabilistic Model (DDPM) that outperformed GANs in image synthesis. We developed a Retinal Trees (ReTree) dataset consisting of retinal images, corresponding vessel trees, and a segmentation network based on DDPM trained with images from the ReTree dataset. In the first stage, we develop a two-stage DDPM that generates vessel trees from random numbers belonging to a standard normal distribution. Later, the model is guided to generate fundus images from given vessel trees and random distribution. The proposed dataset has been evaluated quantitatively and qualitatively. Quantitative evaluation metrics include Frechet Inception Distance (FID) score, Jaccard similarity coefficient, Cohen's kappa, Matthew's Correlation Coefficient (MCC), precision, recall, F1-score, and accuracy. We trained the vessel segmentation model with synthetic data to validate our dataset's efficiency and tested it on authentic data. Our developed dataset and source code is available at https://github.com/AAleka/retree.
摘要
专家利用眼球图像和血管树来检测和诊断不同的眼部、血液和脑部疾病。然而,手动分割眼球图像是一项时间consuming和需要高度专业知识的过程,另外,隐私问题也是一个大的问题。许多方法已经被提出来分割图像,但是因为缺乏大量眼球图像数据,这些方法的性能受到限制。本文提出了一种新的吸引 diffusion 概率模型(DDPM),它在图像生成方面超越了生成 adversarial 网络(GAN)的性能。我们开发了一个名为“Retinal Trees”(ReTree)的数据集,该数据集包括眼球图像、相应的血管树和基于 DDPM 的分割网络。在第一个阶段,我们开发了一种两个阶段的 DDPM,它可以将随机数据从标准差分布中生成血管树。然后,模型被引导以使用给定的血管树和随机分布生成眼球图像。我们对该数据集进行了量化和质量上的评估。量化评估指标包括Fréchet Inception Distance(FID)分数、Jaccard相似度系数、Cohen的κ值、Matthew的相似度系数(MCC)、精度、 recall、F1-score 和准确率。我们使用合成数据来验证我们的数据集的效果,并对真实数据进行测试。我们开发的数据集和源代码可以在 上下载。
ECPC-IDS:A benchmark endometrail cancer PET/CT image dataset for evaluation of semantic segmentation and detection of hypermetabolic regions
results: 这个论文通过对 ECPC-IDS 进行广泛的实验,证明了深度学习基于语义 segmentation 和对象检测方法的效果,并且这些方法可以帮助研究者开发新的计算机支持诊断技术,从而为临床医生和患者带来很大的 benefit.Abstract
Endometrial cancer is one of the most common tumors in the female reproductive system and is the third most common gynecological malignancy that causes death after ovarian and cervical cancer. Early diagnosis can significantly improve the 5-year survival rate of patients. With the development of artificial intelligence, computer-assisted diagnosis plays an increasingly important role in improving the accuracy and objectivity of diagnosis, as well as reducing the workload of doctors. However, the absence of publicly available endometrial cancer image datasets restricts the application of computer-assisted diagnostic techniques.In this paper, a publicly available Endometrial Cancer PET/CT Image Dataset for Evaluation of Semantic Segmentation and Detection of Hypermetabolic Regions (ECPC-IDS) are published. Specifically, the segmentation section includes PET and CT images, with a total of 7159 images in multiple formats. In order to prove the effectiveness of segmentation methods on ECPC-IDS, five classical deep learning semantic segmentation methods are selected to test the image segmentation task. The object detection section also includes PET and CT images, with a total of 3579 images and XML files with annotation information. Six deep learning methods are selected for experiments on the detection task.This study conduct extensive experiments using deep learning-based semantic segmentation and object detection methods to demonstrate the differences between various methods on ECPC-IDS. As far as we know, this is the first publicly available dataset of endometrial cancer with a large number of multiple images, including a large amount of information required for image and target detection. ECPC-IDS can aid researchers in exploring new algorithms to enhance computer-assisted technology, benefiting both clinical doctors and patients greatly.
摘要
结直肠癌是女性生殖系统中最常见的肿瘤,是生殖系统癌症的第三大死因,只次于卵巢癌和子宫癌。早期诊断可以提高患者5年生存率。随着人工智能的发展,计算机助诊技术在诊断准确性和 объектив性方面扮演着越来越重要的角色,同时也减轻了医生的工作负担。然而,由于缺乏公共可用的结直肠癌图像数据集,计算机助诊技术的应用受到了限制。在这篇论文中,一个公共可用的结直肠癌PET/CT图像数据集(ECPC-IDS)被发布。具体来说,分别包括PET和CT图像,总计7159张图像,多种格式。为证明ECPC-IDS上 segmentation 方法的效果,我们选择了5种经典的深度学习semantic segmentation方法进行测试图像 segmentation 任务。对象检测部分也包括PET和CT图像,总计3579张图像和XML文件中的注释信息。为了证明对ECPC-IDS的不同方法的差异,我们选择了6种深度学习方法进行实验。这个研究通过深度学习基于semantic segmentation和对象检测方法进行广泛的实验,以示ECPC-IDS上不同方法之间的差异。我们知道,ECPC-IDS是目前所有公共可用的结直肠癌图像数据集中最大的一个,包括大量的图像信息和目标检测信息。ECPC-IDS可以帮助研究人员探索新的算法,以提高计算机助诊技术,对医生和病人都是非常有利。
OnUVS: Online Feature Decoupling Framework for High-Fidelity Ultrasound Video Synthesis
paper_authors: Han Zhou, Dong Ni, Ao Chang, Xinrui Zhou, Rusi Chen, Yanlin Chen, Lian Liu, Jiamin Liang, Yuhao Huang, Tong Han, Zhe Liu, Deng-Ping Fan, Xin Yang
For: The paper aims to address the challenges of synthesizing high-fidelity ultrasound (US) videos for medical education and diagnosis, specifically the limited availability of specific US video cases and the need to accurately animate dynamic anatomic structures while preserving image fidelity.* Methods: The proposed method, called OnUVS, is an online feature-decoupling framework that incorporates anatomic information into keypoint learning, uses a dual-decoder to decouple content and textural features, and employs a multiple-feature discriminator to enhance the sharpness and fine details of the generated videos. Additionally, the method constrains the motion trajectories of keypoints during online learning to enhance the fluidity of the generated videos.* Results: The paper reports that OnUVS synthesizes US videos with high fidelity, as validated and demonstrated through user studies on in-house echocardiographic and pelvic floor US videos.Abstract
Ultrasound (US) imaging is indispensable in clinical practice. To diagnose certain diseases, sonographers must observe corresponding dynamic anatomic structures to gather comprehensive information. However, the limited availability of specific US video cases causes teaching difficulties in identifying corresponding diseases, which potentially impacts the detection rate of such cases. The synthesis of US videos may represent a promising solution to this issue. Nevertheless, it is challenging to accurately animate the intricate motion of dynamic anatomic structures while preserving image fidelity. To address this, we present a novel online feature-decoupling framework called OnUVS for high-fidelity US video synthesis. Our highlights can be summarized by four aspects. First, we introduced anatomic information into keypoint learning through a weakly-supervised training strategy, resulting in improved preservation of anatomical integrity and motion while minimizing the labeling burden. Second, to better preserve the integrity and textural information of US images, we implemented a dual-decoder that decouples the content and textural features in the generator. Third, we adopted a multiple-feature discriminator to extract a comprehensive range of visual cues, thereby enhancing the sharpness and fine details of the generated videos. Fourth, we constrained the motion trajectories of keypoints during online learning to enhance the fluidity of generated videos. Our validation and user studies on in-house echocardiographic and pelvic floor US videos showed that OnUVS synthesizes US videos with high fidelity.
摘要
Ultrasound (US) 影像是临床实践中不可或缺的。为了诊断某些疾病,医学技术员需要观察相应的动态解剖结构,以获取全面的信息。然而,有限的特定US视频案例的可用性会导致教学困难,从而可能影响疾病检测率。 synthesizing US videos 可能是一个有前途的解决方案。然而,复杂动态解剖结构的精准动画化是一项挑战。为解决这个问题,我们提出了一种在线特征解coupling框架called OnUVS,用于高精度US видео生成。我们的主要特点包括:First, we introduced anatomic information into keypoint learning through a weakly-supervised training strategy, resulting in improved preservation of anatomical integrity and motion while minimizing the labeling burden.Second, to better preserve the integrity and textural information of US images, we implemented a dual-decoder that decouples the content and textural features in the generator.Third, we adopted a multiple-feature discriminator to extract a comprehensive range of visual cues, thereby enhancing the sharpness and fine details of the generated videos.Fourth, we constrained the motion trajectories of keypoints during online learning to enhance the fluidity of generated videos. Our validation and user studies on in-house echocardiographic and pelvic floor US videos showed that OnUVS synthesizes US videos with high fidelity.
Neural Spherical Harmonics for structurally coherent continuous representation of diffusion MRI signal
methods: 这个论文使用神经网络来Parameterize NeSH series来表示单个试验者的dMRI信号,这个信号是连续的在角度和空间频谱中。这种方法可以去除梯度图像中的噪声,并且 fibre orientation distribution functions (FOD)显示了纤维轨迹中的缓和的方向变化。
results: 该方法可以计算mean diffusivity、fractional anisotropy和总显示纤维density等结果,这些结果可以通过单一的模型架构和 hyperparameter 的调整来实现。此外,在angular和空间频谱上的upsampling也可以实现 reconstruction 的同等或更好的结果。Abstract
We present a novel way to model diffusion magnetic resonance imaging (dMRI) datasets, that benefits from the structural coherence of the human brain while only using data from a single subject. Current methods model the dMRI signal in individual voxels, disregarding the intervoxel coherence that is present. We use a neural network to parameterize a spherical harmonics series (NeSH) to represent the dMRI signal of a single subject from the Human Connectome Project dataset, continuous in both the angular and spatial domain. The reconstructed dMRI signal using this method shows a more structurally coherent representation of the data. Noise in gradient images is removed and the fiber orientation distribution functions show a smooth change in direction along a fiber tract. We showcase how the reconstruction can be used to calculate mean diffusivity, fractional anisotropy, and total apparent fiber density. These results can be achieved with a single model architecture, tuning only one hyperparameter. In this paper we also demonstrate how upsampling in both the angular and spatial domain yields reconstructions that are on par or better than existing methods.
摘要
我们提出了一种新的方法来模型扩散磁共振成像(dMRI)数据集,该方法利用人脑的结构准确性而不需要多个试验者的数据。现有方法对dMRI信号在单个 vozuel 中进行模型化,忽略了 voxel 之间的幂相关性。我们使用神经网络来参数化一个圆锥函数序列(NeSH)来表示单个试验者的 dMRI 信号,这种表示是连续的 both angular 和 spatial 领域。重建的 dMRI 信号使用这种方法显示出更加结构准确的数据。gradient 图像中的噪声被除去, fibers 方向分布函数显示出一直线性的改变方向。我们显示了如何使用这种重建来计算平均扩散率、分数扩散率和总表面积磁度。这些结果可以通过单个模型架构和一个 hyperparameter 的调整来实现。在这篇论文中,我们还证明了在 both angular 和 spatial 领域进行 upsampling 可以获得与现有方法相当或更好的重建结果。
Self-Reference Deep Adaptive Curve Estimation for Low-Light Image Enhancement
results: 对多个实际世界数据集进行了详细的qualitative和量化分析,并证明了方法的超越现有状态的算法。Abstract
In this paper, we propose a 2-stage low-light image enhancement method called Self-Reference Deep Adaptive Curve Estimation (Self-DACE). In the first stage, we present an intuitive, lightweight, fast, and unsupervised luminance enhancement algorithm. The algorithm is based on a novel low-light enhancement curve that can be used to locally boost image brightness. We also propose a new loss function with a simplified physical model designed to preserve natural images' color, structure, and fidelity. We use a vanilla CNN to map each pixel through deep Adaptive Adjustment Curves (AAC) while preserving the local image structure. Secondly, we introduce the corresponding denoising scheme to remove the latent noise in the darkness. We approximately model the noise in the dark and deploy a Denoising-Net to estimate and remove the noise after the first stage. Exhaustive qualitative and quantitative analysis shows that our method outperforms existing state-of-the-art algorithms on multiple real-world datasets.
摘要
在本文中,我们提出了一种两stage的低光照图像提升方法,称为Self-Reference Deep Adaptive Curve Estimation(Self-DACE)。在第一个阶段,我们提供了一种直观、轻量级、快速、不需超级视觉的亮度增强算法。该算法基于一个新的低光照增强曲线,可以在本地增加图像亮度。我们还提出了一个新的损失函数,该函数遵循简化的物理模型,以保持自然图像的颜色、结构和准确性。我们使用一个普通的CNN来将每个像素通过深度适应曲线(AAC)进行映射,同时保持图像的本地结构。在第二个阶段,我们引入了对应的噪声除减方法,以除除在黑暗中的噪声。我们简单地模型了黑暗中的噪声,并部署了一个Denoising-Net来估计和除减噪声。经过详细的质量和量化分析,我们的方法在多个真实世界数据集上表现出excel。
results: 实验结果表明,该代码可以成功保持高质量和高Semantic quality,并且可以提供底层 bound of common randomness required,解决了过去的争议是否应该在生成器中添加随机性以提高 conditional perceptual quality 压缩。Here’s the full translation of the abstract in Simplified Chinese:
results: 实验结果表明,该代码可以成功保持高质量和高Semantic quality,并且可以提供底层 bound of common randomness required,解决了过去的争议是否应该在生成器中添加随机性以提高 conditional perceptual quality 压缩。Abstract
We propose conditional perceptual quality, an extension of the perceptual quality defined in \citet{blau2018perception}, by conditioning it on user defined information. Specifically, we extend the original perceptual quality $d(p_{X},p_{\hat{X})$ to the conditional perceptual quality $d(p_{X|Y},p_{\hat{X}|Y})$, where $X$ is the original image, $\hat{X}$ is the reconstructed, $Y$ is side information defined by user and $d(.,.)$ is divergence. We show that conditional perceptual quality has similar theoretical properties as rate-distortion-perception trade-off \citep{blau2019rethinking}. Based on these theoretical results, we propose an optimal framework for conditional perceptual quality preserving compression. Experimental results show that our codec successfully maintains high perceptual quality and semantic quality at all bitrate. Besides, by providing a lowerbound of common randomness required, we settle the previous arguments on whether randomness should be incorporated into generator for (conditional) perceptual quality compression. The source code is provided in supplementary material.
摘要
我们提议用条件的感知质量来扩展感知质量,定义在\citet{blau2018perception}中。specifically,我们将原始的感知质量$d(p_{X},p_{\hat{X})$扩展到条件感知质量$d(p_{X|Y},p_{\hat{X}|Y})$,其中$X$是原始图像,$\hat{X}$是重建的图像,$Y$是用户定义的侧信息。我们证明了条件感知质量具有类似的理论性质,与比特率-质量-感知质量评估交易\citep{blau2019rethinking}相似。基于这些理论结果,我们提出了一个优化的条件感知质量保持压缩框架。实验结果表明,我们的编码器成功保持高度的感知质量和语义质量,无论比特率如何。此外,我们提供了一个下界的公共随机性需求,解决了以前关于是否应该在生成器中包含随机性以实现(条件)感知质量压缩的辩论。代码可以在补充材料中找到。
A Comprehensive Overview of Computational Nuclei Segmentation Methods in Digital Pathology
paper_authors: Vasileios Magoulianitis, Catherine A. Alexander, C. -C. Jay Kuo
for: 这 paper 旨在对 cancer 诊断领域中数字 PATHOLOGY 的应用进行全面的 Review,尤其是在诊断、分期和评价方面。
methods: 这 paper 使用了传统的图像处理技术和 Deep Learning (DL) 模型来自动 segment nuclei,并评估了不同类型的超级vision 的影响。
results: 这 paper 提出了一种可靠且高效的 nuclei segmentation 方法,并讨论了不同模型和超级vision 的优劣。 futur 的研究应该强调高效可解释的模型,以便医生可以信任其输出。Abstract
In the cancer diagnosis pipeline, digital pathology plays an instrumental role in the identification, staging, and grading of malignant areas on biopsy tissue specimens. High resolution histology images are subject to high variance in appearance, sourcing either from the acquisition devices or the H\&E staining process. Nuclei segmentation is an important task, as it detects the nuclei cells over background tissue and gives rise to the topology, size, and count of nuclei which are determinant factors for cancer detection. Yet, it is a fairly time consuming task for pathologists, with reportedly high subjectivity. Computer Aided Diagnosis (CAD) tools empowered by modern Artificial Intelligence (AI) models enable the automation of nuclei segmentation. This can reduce the subjectivity in analysis and reading time. This paper provides an extensive review, beginning from earlier works use traditional image processing techniques and reaching up to modern approaches following the Deep Learning (DL) paradigm. Our review also focuses on the weak supervision aspect of the problem, motivated by the fact that annotated data is scarce. At the end, the advantages of different models and types of supervision are thoroughly discussed. Furthermore, we try to extrapolate and envision how future research lines will potentially be, so as to minimize the need for labeled data while maintaining high performance. Future methods should emphasize efficient and explainable models with a transparent underlying process so that physicians can trust their output.
摘要
在肿瘤诊断管道中,数字pathology扮演着重要的角色,用于识别、分期和评价肿瘤区域的病理样本中的细胞。高分辨率历史学图像具有高度的变化,来源于取得设备或H\&E染色过程。细胞核 segmentation 是一项重要的任务,因为它可以在背景组织中检测细胞核,并且对肿瘤检测有关键的因素。然而,这是一项时间consuming的任务,它需要许多的专业人员时间和subjectivity。计算机支持诊断 (CAD) 工具,激活了现代人工智能 (AI) 模型,可以自动完成细胞核 segmentation。这可以降低分析和阅读时间的主观性,并提高诊断的准确性。本文提供了广泛的回顾,从传统图像处理技术开始,到现代DL方法。我们的回顾还特别关注了弱监督问题的挑战,由于标注数据稀缺。文章结尾,我们详细讨论了不同模型和监督方式的优势。此外,我们尝试预测未来研究的发展趋势,以减少标注数据的需求,保持高性能。未来的方法应该强调高效和可读性的模型,并且具有透明的下面过程,以便医生可以信任其输出。
Snapshot High Dynamic Range Imaging with a Polarization Camera
results: 实验结果表明该方法可以实现高效的HDR图像重构Abstract
High dynamic range (HDR) images are important for a range of tasks, from navigation to consumer photography. Accordingly, a host of specialized HDR sensors have been developed, the most successful of which are based on capturing variable per-pixel exposures. In essence, these methods capture an entire exposure bracket sequence at once in a single shot. This paper presents a straightforward but highly effective approach for turning an off-the-shelf polarization camera into a high-performance HDR camera. By placing a linear polarizer in front of the polarization camera, we are able to simultaneously capture four images with varied exposures, which are determined by the orientation of the polarizer. We develop an outlier-robust and self-calibrating algorithm to reconstruct an HDR image (at a single polarity) from these measurements. Finally, we demonstrate the efficacy of our approach with extensive real-world experiments.
摘要
高动态范围(HDR)图像在多种任务中具有重要性,从导航到消费型摄影。因此,一些专门的HDR传感器被开发出来,最成功的是基于每像素不同曝光的捕获方法。这篇论文提出了将普通的 polarization 摄像头转化为高性能 HDR 摄像头的简单 yet effective 方法。我们在 polarization 摄像头前面加入了线性激光 polarizer,从而同时捕获了不同曝光的四个图像,其曝光强度与 polarizer 的 orientations 相关。我们开发了一种具有耐抗异常值和自适应特性的算法,将这些测量转化为 HDR 图像(具有单一极性)。最后,我们通过实验证明了我们的方法的有效性。
Deep Learning Framework for Spleen Volume Estimation from 2D Cross-sectional Views
results: 我们的最佳模型在单个视图和双个视图的肝脏分割图像上达到了86.62%和92.58%的相对体积准确率,超过了现有的临床标准方法和一种相关的深度学习基于2D-3D重建的方法。Abstract
Abnormal spleen enlargement (splenomegaly) is regarded as a clinical indicator for a range of conditions, including liver disease, cancer and blood diseases. While spleen length measured from ultrasound images is a commonly used surrogate for spleen size, spleen volume remains the gold standard metric for assessing splenomegaly and the severity of related clinical conditions. Computed tomography is the main imaging modality for measuring spleen volume, but it is less accessible in areas where there is a high prevalence of splenomegaly (e.g., the Global South). Our objective was to enable automated spleen volume measurement from 2D cross-sectional segmentations, which can be obtained from ultrasound imaging. In this study, we describe a variational autoencoder-based framework to measure spleen volume from single- or dual-view 2D spleen segmentations. We propose and evaluate three volume estimation methods within this framework. We also demonstrate how 95% confidence intervals of volume estimates can be produced to make our method more clinically useful. Our best model achieved mean relative volume accuracies of 86.62% and 92.58% for single- and dual-view segmentations, respectively, surpassing the performance of the clinical standard approach of linear regression using manual measurements and a comparative deep learning-based 2D-3D reconstruction-based approach. The proposed spleen volume estimation framework can be integrated into standard clinical workflows which currently use 2D ultrasound images to measure spleen length. To the best of our knowledge, this is the first work to achieve direct 3D spleen volume estimation from 2D spleen segmentations.
摘要
非常常见的脾脓肥大(splenomegaly)被认为是许多疾病的临床指标之一,包括肝病、 cancer 和血液疾病。 Although spleen length measured from ultrasound images is commonly used as a surrogate for spleen size, spleen volume remains the gold standard metric for assessing splenomegaly and the severity of related clinical conditions. computed tomography 是评估脾脓体积的主要成像Modal,但它在全球南部地区更普遍不可用。 Our objective was to enable automated spleen volume measurement from 2D cross-sectional segmentations, which can be obtained from ultrasound imaging. In this study, we describe a variational autoencoder-based framework to measure spleen volume from single- or dual-view 2D spleen segmentations. We propose and evaluate three volume estimation methods within this framework. We also demonstrate how 95% confidence intervals of volume estimates can be produced to make our method more clinically useful. Our best model achieved mean relative volume accuracies of 86.62% and 92.58% for single- and dual-view segmentations, respectively, surpassing the performance of the clinical standard approach of linear regression using manual measurements and a comparative deep learning-based 2D-3D reconstruction-based approach. The proposed spleen volume estimation framework can be integrated into standard clinical workflows which currently use 2D ultrasound images to measure spleen length. To the best of our knowledge, this is the first work to achieve direct 3D spleen volume estimation from 2D spleen segmentations.
results: 研究发现,使用AMSS系统可以提高室内声学环境的听觉舒适性,而且可以更好地根据听觉反馈来选择最佳的干扰音频。Abstract
Soundscape augmentation or "masking" introduces wanted sounds into the acoustic environment to improve acoustic comfort. Usually, the masker selection and playback strategies are either arbitrary or based on simple rules (e.g. -3 dBA), which may lead to sub-optimal increment or even reduction in acoustic comfort for dynamic acoustic environments. To reduce ambiguity in the selection of maskers, an automatic masker selection system (AMSS) was recently developed. The AMSS uses a deep-learning model trained on a large-scale dataset of subjective responses to maximize the derived ISO pleasantness (ISO 12913-2). Hence, this study investigates the short-term in situ performance of the AMSS implemented in a gazebo in an urban park. Firstly, the predicted ISO pleasantness from the AMSS is evaluated in comparison to the in situ subjective evaluation scores. Secondly, the effect of various masker selection schemes on the perceived affective quality and appropriateness would be evaluated. In total, each participant evaluated 6 conditions: (1) ambient environment with no maskers; (2) AMSS; (3) bird and (4) water masker from prior art; (5) random selection from same pool of maskers used to train the AMSS; and (6) selection of best-performing maskers based on the analysis of the dataset used to train the AMSS.
摘要
增强声响环境或"遮盾"技术可以提高声响舒适度。通常,遮盾选择和播放策略是随意的或基于简单的规则(例如 -3 dBA),可能会导致声响舒适度的下降或不足。为了减少遮盾选择的 ambiguity,一个自动遮盾选择系统(AMSS)已经被开发出来。AMSS使用基于大规模数据集的主观反应进行训练,以最大化 derivated ISO 舒适度(ISO 12913-2)。因此,本研究探讨了在城市公园中的废墟中进行的 AMSS 实际性表现。首先,AMSS 预测的 ISO 舒适度与现场评估分数进行比较。其次,通过不同遮盾选择方案对人们对声响质量和适应性的感知做出评估。总的来说,每名参与者评估了6个条件:(1)无遮盾的 ambient 环境;(2)AMSS;(3)鸟声和水声遮盾来自优等艺术作品;(4)随机从同一个训练数据集中选择的遮盾;(5)基于训练数据集分析选择最佳遮盾;以及(6)由参与者自己选择的最佳遮盾。
Improving CTC-AED model with integrated-CTC and auxiliary loss regularization
results: DAL 方法在注意力重新评分中表现更好,而 PMP 方法在 CTC 预refix 搜索和扩散搜索中表现更好Abstract
Connectionist temporal classification (CTC) and attention-based encoder decoder (AED) joint training has been widely applied in automatic speech recognition (ASR). Unlike most hybrid models that separately calculate the CTC and AED losses, our proposed integrated-CTC utilizes the attention mechanism of AED to guide the output of CTC. In this paper, we employ two fusion methods, namely direct addition of logits (DAL) and preserving the maximum probability (PMP). We achieve dimensional consistency by adaptively affine transforming the attention results to match the dimensions of CTC. To accelerate model convergence and improve accuracy, we introduce auxiliary loss regularization for accelerated convergence. Experimental results demonstrate that the DAL method performs better in attention rescoring, while the PMP method excels in CTC prefix beam search and greedy search.
摘要
卷积时间分类(CTC)和注意力基于编码器解码器(AED)的共同训练在自动语音识别(ASR)中广泛应用。与大多数混合模型不同,我们提议的综合CTC使用AED的注意力机制来导引CTC输出。在这篇论文中,我们采用了两种融合方法,即直接加法的峰值(DAL)和保持最大概率(PMP)。为确保维度一致,我们采用了适应权重变换来调整注意力结果的维度与CTC匹配。为加速模型 converges 和提高准确性,我们引入了辅助损失regularization。实验结果表明,DAL方法在注意力重新评分中表现更好,而PMP方法在CTC预фикс搜索和扫描搜索中表现更好。
Using Text Injection to Improve Recognition of Personal Identifiers in Speech
methods: 使用 text-injection 方法,在训练数据中包含假文本替换Personally Identifiable Information(PII)类别,以提高 ASR 模型对这些类别的识别精度。
results: 在医疗笔记中,可以提高名称和日期的回忆率,并提高总的words error rate(WER)。对于字符串数字序列,可以提高字符错误率和句子准确率。Abstract
Accurate recognition of specific categories, such as persons' names, dates or other identifiers is critical in many Automatic Speech Recognition (ASR) applications. As these categories represent personal information, ethical use of this data including collection, transcription, training and evaluation demands special care. One way of ensuring the security and privacy of individuals is to redact or eliminate Personally Identifiable Information (PII) from collection altogether. However, this results in ASR models that tend to have lower recognition accuracy of these categories. We use text-injection to improve the recognition of PII categories by including fake textual substitutes of PII categories in the training data using a text injection method. We demonstrate substantial improvement to Recall of Names and Dates in medical notes while improving overall WER. For alphanumeric digit sequences we show improvements to Character Error Rate and Sentence Accuracy.
摘要
“准确识别特定类别,如人名、日期等标识信息,在自动语音识别(ASR)应用中是关键。这些类别代表个人信息,因此使用这些数据需要特殊的注意和保护。一种方法是不收集或消除个人可识别信息(PII),但这会导致ASR模型对这些类别的识别精度下降。我们使用文本插入法来提高PII类别的识别精度,通过在训练数据中包含假文本substitute来提高名称和日期的回忆率。我们在医疗笔记中展示了明显的提高,同时提高总的word error rate。对于字符串数字序列,我们显示了字符错误率和句子准确率的改善。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know and I can provide that as well.
Localization of DOA trajectories – Beyond the grid
results: 对比传统网格方法,提议的轨迹localization算法在噪音抗干扰和计算效率方面具有改善的性能。Abstract
The direction of arrival (DOA) estimation algorithms are crucial in localizing acoustic sources. Traditional localization methods rely on block-level processing to extract the directional information from multiple measurements processed together. However, these methods assume that DOA remains constant throughout the block, which may not be true in practical scenarios. Also, the performance of localization methods is limited when the true parameters do not lie on the parameter search grid. In this paper we propose two trajectory models, namely the polynomial and bandlimited trajectory models, to capture the DOA dynamics. To estimate trajectory parameters, we adopt two gridless algorithms: i) Sliding Frank-Wolfe (SFW), which solves the Beurling LASSO problem and ii) Newtonized Orthogonal Matching Pursuit (NOMP), which improves over OMP using cyclic refinement. Furthermore, we extend our analysis to include wideband processing. The simulation results indicate that the proposed trajectory localization algorithms exhibit improved performance compared to grid-based methods in terms of resolution, robustness to noise, and computational efficiency.
摘要
Direction of arrival (DOA) 估计算法是音频源本地化的关键。传统的本地化方法依赖块级处理来提取方向信息,但这些方法假设DOA在块内保持相同,这可能不符合实际场景。此外,本地化方法的性能受限于真实参数不在搜索网格上。在本文中,我们提出了两种轨迹模型:多项式轨迹模型和带有限轨迹模型,以捕捉DOA的动态。为估计轨迹参数,我们采用了两种不含网格的算法:i)滑动法沃尔夫(SFW),解决了Beurling LASSO问题,ii)增强的正交匹配追踪(NOMP),通过循环纠正提高OMP的性能。此外,我们扩展了分析至宽频处理。实验结果表明,提议的轨迹本地化算法在比grid-based方法更高的分解能力、鲁棒性和计算效率方面表现出色。
Compositional nonlinear audio signal processing with Volterra series
results: 论文得出了一些关于非线性音频系统的结论,包括:非线性音频系统的变化可以通过模型为折射映射来描述; Volterra系列和其 morphisms 组织成一个Category,可以用来模型非线性音频系统的时间变化; series composition of Volterra series 是 associative。Abstract
We develop a compositional theory of nonlinear audio signal processing based on a categorification of the Volterra series. We begin by considering what it would mean for the Volterra series to be functorial with respect to a base category whose objects are temperate distributions and whose morphisms are certain linear transformations. This leads to formulae describing how the outcomes of nonlinear transformations are affected if their input signals are first linearly processed. We then consider how nonlinear audio systems change, and introduce as a model thereof a notion of morphism of Volterra series, which we exhibit as a kind of lens map. We show how morphisms can be parameterized and used to generate indexed families of Volterra series, which are well-suited to model nonstationary or time-varying nonlinear phenomena. We then describe how Volterra series and their morphisms organize into a category, which we call Volt. We exhibit the operations of sum, product, and series composition of Volterra series as monoidal products on Volt and identify, for each in turn, its corresponding universal property. We show, in particular, that the series composition of Volterra series is associative. We then bridge between our framework and a subject at the heart of audio signal processing: time-frequency analysis. Specifically, we show that an equivalence between a certain class of second-order Volterra series and the bilinear time-frequency distributions (TFDs) can be extended to one between certain higher-order Volterra series and the so-called polynomial TFDs. We end with prospects for future work, including the incorporation of nonlinear system identification techniques and the extension of our theory to the settings of compositional graph and topological audio signal processing.
摘要
我们开发了一种基于幂阶系列的非线性音频信号处理理论,该理论是基于幂阶系列的 categorification。我们首先考虑了幂阶系列是如何作为一种函手,对底Category的对象(温度分布)和态射(certain linear transformation)进行函手性的定义。这导致了输入非线性变换后的结果如何受到输入信号的线性处理影响的公式。然后,我们考虑了非线性音频系统如何改变,并引入了一种模型,即幂阶系列的态射。我们显示了这种态射可以被视为一种类型的镜像。我们还介绍了如何使用态射来生成索引 family of Volterra series,这些家族适合模拟非站ARY或时间变化的非线性现象。最后,我们描述了幂阶系列和其态射组织成一个category,我们称之为Volt。我们展示了幂阶系列和其态射之间的操作,包括加法、乘法和序列 compose,它们都是Volt中的对应的幂阶乘法。我们还证明了序列 compose 是相关的。最后,我们将我们的框架与音频信号处理中关键的主题进行桥接,即时域分析。我们证明了一种等价关系,即某种次序 Volterra series与bilinear time-frequency distributions(TFDs)之间的等价关系,可以扩展到高阶 Volterra series和叫做多项式 TFDs。我们结束于未来工作的展望,包括非线性系统识别技术的 incorporation 和我们理论的扩展到compositional graph 和 topological audio signal processing 的设置。
results: ResNet293和MFA-Conformer模型在VAL46上表现出了3.65%和3.83%的分类错误率(DER),分布 ensemble模型在VAL46上表现出了3.50%的DER,并在VoxSRC-23测试集上达到4.88%的DER。Abstract
This report describes the submission system by the GIST-AiTeR team for the VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC-23) Track 4. Our submission system focuses on implementing diverse speaker diarization (SD) techniques, including ResNet293 and MFA-Conformer with different combinations of segment and hop length. Then, those models are combined into an ensemble model. The ResNet293 and MFA-Conformer models exhibited the diarization error rates (DERs) of 3.65% and 3.83% on VAL46, respectively. The submitted ensemble model provided a DER of 3.50% on VAL46, and consequently, it achieved a DER of 4.88% on the VoxSRC-23 test set.
摘要
The DKU-MSXF Diarization System for the VoxCeleb Speaker Recognition Challenge 2023
results: 最终通过DOVER-Lap进行拟合不同的分组和TSVAD系统,实现4.30%的识别错误率(DER),在track 4的挑战排行榜上名列第一。Abstract
This paper describes the DKU-MSXF submission to track 4 of the VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC-23). Our system pipeline contains voice activity detection, clustering-based diarization, overlapped speech detection, and target-speaker voice activity detection, where each procedure has a fused output from 3 sub-models. Finally, we fuse different clustering-based and TSVAD-based diarization systems using DOVER-Lap and achieve the 4.30% diarization error rate (DER), which ranks first place on track 4 of the challenge leaderboard.
摘要
这篇论文描述DKU-MSXF在VoxCeleb Speaker Recognition Challenge 2023(VoxSRC-23)的订阅提交,我们的系统管道包括声音活动检测、集群化基于分类的分类、重叠说话检测和目标说话人声音活动检测,每个过程有3个子模型的融合输出。最后,我们将不同的集群化和TSVAD基于的分类系统融合使用DOVER-Lap,实现4.30%的分类错误率(DER),在赛事排名榜上名列第一。
AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained Model
results: 经过广泛的实验 validate the effectiveness of the proposed method,在 widely-used datasets LRS2 和 LRS3 上 achievement new state-of-the-art performances。Abstract
Visual Speech Recognition (VSR) is the task of predicting spoken words from silent lip movements. VSR is regarded as a challenging task because of the insufficient information on lip movements. In this paper, we propose an Audio Knowledge empowered Visual Speech Recognition framework (AKVSR) to complement the insufficient speech information of visual modality by using audio modality. Different from the previous methods, the proposed AKVSR 1) utilizes rich audio knowledge encoded by a large-scale pretrained audio model, 2) saves the linguistic information of audio knowledge in compact audio memory by discarding the non-linguistic information from the audio through quantization, and 3) includes Audio Bridging Module which can find the best-matched audio features from the compact audio memory, which makes our training possible without audio inputs, once after the compact audio memory is composed. We validate the effectiveness of the proposed method through extensive experiments, and achieve new state-of-the-art performances on the widely-used datasets, LRS2 and LRS3.
摘要
Visual Speech Recognition (VSR) 是指从舌部运动中预测说话的任务。由于舌部运动的信息不充分,VSR 被视为一项具有挑战性的任务。在这篇论文中,我们提出了一个 Audio Knowledge empowered Visual Speech Recognition 框架 (AKVSR),通过使用音频模式来补充视觉模式中的不充分信息。与前一些方法不同,我们的 AKVSR 具有以下特点:1. 使用大规模预训练的音频模型来编码丰富的音频知识。2. 通过归约来抛弃非语言信息,将音频知识储存在紧凑的音频内存中。3. 包括音频桥接模块,可以在训练时找到最佳匹配的音频特征,从而使我们的训练不需要音频输入,只需要一次性将紧凑音频内存构建。我们通过广泛的实验 validate 了我们的方法的效果,并在广泛使用的 dataset 上达到了新的状态码性表现。