results: 实验结果表明,该方法可以在压缩率和图像质量之间取得平衡,并且与手工编码和其他神经网络编码方法相比,有较高的压缩率和较好的图像质量。Abstract
The design of a neural image compression network is governed by how well the entropy model matches the true distribution of the latent code. Apart from the model capacity, this ability is indirectly under the effect of how close the relaxed quantization is to the actual hard quantization. Optimizing the parameters of a rate-distortion variational autoencoder (R-D VAE) is ruled by this approximated quantization scheme. In this paper, we propose a feature-level frequency disentanglement to help the relaxed scalar quantization achieve lower bit rates by guiding the high entropy latent features to include most of the low-frequency texture of the image. In addition, to strengthen the de-correlating power of the transformer-based analysis/synthesis transform, an augmented self-attention score calculation based on the Hadamard product is utilized during both encoding and decoding. Channel-wise autoregressive entropy modeling takes advantage of the proposed frequency separation as it inherently directs high-informational low-frequency channels to the first chunks and conditions the future chunks on it. The proposed network not only outperforms hand-engineered codecs, but also neural network-based codecs built on computation-heavy spatially autoregressive entropy models.
摘要
neural image compression network 的设计受到 latent code 的真实分布如何匹配 entropy model 的影响。除了模型容量之外,这种能力受到较量化 quantization 的距离实际hard quantization的影响。在这篇文章中,我们提出了一种基于 frequency separation 的特征级解耦,以帮助 relaxed scalar quantization 实现更低的比特率,使高 entropy 的 latent features 包含大多数低频Texture of the image。此外,为强化 transformer 基于分析/synthesis transform 的分割力,我们在编码和解码过程中使用了增强的自注意力分数计算方法,基于 Hadamard 乘法。通道级自适应 entropy modeling 利用了我们提出的频谱分离,因为它自然地将高信息价值的低频通道分配给首 chunk,并将未来 chunk conditional 于它。提出的网络不仅超越了手动设计的编码器,还超越了基于 computation-heavy 空间自相关 entropy model 的神经网络编码器。
Brain MRI Segmentation using Template-Based Training and Visual Perception Augmentation
results: 使用这种方法训练了 mouse、rat、兔子、猩猩和人类大脑MRI 3D U-Net 模型,并实现了分割任务,这种工具可以有效地解决深度学习应用图像分析中的数据有限问题,为研究人员提供一个统一的解决方案,只需要一个图像样本就能训练深度神经网络。Abstract
Deep learning models usually require sufficient training data to achieve high accuracy, but obtaining labeled data can be time-consuming and labor-intensive. Here we introduce a template-based training method to train a 3D U-Net model from scratch using only one population-averaged brain MRI template and its associated segmentation label. The process incorporated visual perception augmentation to enhance the model's robustness in handling diverse image inputs and mitigating overfitting. Leveraging this approach, we trained 3D U-Net models for mouse, rat, marmoset, rhesus, and human brain MRI to achieve segmentation tasks such as skull-stripping, brain segmentation, and tissue probability mapping. This tool effectively addresses the limited availability of training data and holds significant potential for expanding deep learning applications in image analysis, providing researchers with a unified solution to train deep neural networks with only one image sample.
摘要
深度学习模型通常需要充足的训练数据以达到高精度,但获取标注数据可以是时间consuming和劳动密集的。我们介绍了一个模板基本训练方法,用于从零开始训练3D U-Net模型,只使用一个人类大脑MRI模板和其关联的分割标注。该过程包括视觉感知增强,以提高模型对多种图像输入的可以性和避免过拟合。通过这种方法,我们训练了 mouse、rat、marmoset、rhesus和人类大脑MRI 3D U-Net模型,用于实现分解任务 such as skull-stripping、brain segmentation和组织概率地图。这个工具有效地解决了训练数据的有限性问题,并有潜在的扩展深度学习应用于图像分析领域,为研究人员提供了一个统一的解决方案,只需一个图像样本就能训练深度神经网络。
T-UNet: Triplet UNet for Change Detection in High-Resolution Remote Sensing Images
For: 这个研究旨在提出一个新的网络模型,以便更精确地检测遥感图像之间的变化。* Methods: 这个模型使用了一个三条分支Encoder,并且将 triplet Encoder 用于同时提取物件特征和时间间隔图像之间的变化特征。另外,这个模型还具有多条分支空间特征交互模组 (MBSSCA),以便有效地交互和融合各分支中的特征。* Results: 这个模型可以更精确地检测遥感图像之间的变化,并且可以实时地提取变化的细节信息。Abstract
Remote sensing image change detection aims to identify the differences between images acquired at different times in the same area. It is widely used in land management, environmental monitoring, disaster assessment and other fields. Currently, most change detection methods are based on Siamese network structure or early fusion structure. Siamese structure focuses on extracting object features at different times but lacks attention to change information, which leads to false alarms and missed detections. Early fusion (EF) structure focuses on extracting features after the fusion of images of different phases but ignores the significance of object features at different times for detecting change details, making it difficult to accurately discern the edges of changed objects. To address these issues and obtain more accurate results, we propose a novel network, Triplet UNet(T-UNet), based on a three-branch encoder, which is capable to simultaneously extract the object features and the change features between the pre- and post-time-phase images through triplet encoder. To effectively interact and fuse the features extracted from the three branches of triplet encoder, we propose a multi-branch spatial-spectral cross-attention module (MBSSCA). In the decoder stage, we introduce the channel attention mechanism (CAM) and spatial attention mechanism (SAM) to fully mine and integrate detailed textures information at the shallow layer and semantic localization information at the deep layer.
摘要
distant 感知图像变化检测 aimsto identify 不同时间在同一个区域中的图像差异。 它广泛应用于土地管理、环境监测、灾害评估等领域。 现在,大多数变化检测方法基于 Siamese 网络结构或早期融合结构。 Siamese 结构专注于在不同时间抽取对象特征,但缺乏关注变化信息,这会导致假报警和错过检测。 Early Fusion 结构专注于在不同阶段图像融合后抽取特征,但忽视对象特征在不同时间的变化细节检测,这使得准确地识别变化对象的边缘很难。 为了解决这些问题并获得更加准确的结果,我们提出了一种新的网络, Triplet UNet(T-UNet),基于三个分支编码器。 T-UNet 可以同时提取对象特征和不同时间图像之间的变化特征,通过 triplet 编码器。 为了有效地交互和融合 triplet 编码器中的特征,我们提出了多个分支空间特征交叉注意力模块(MBSSCA)。 在解码阶段,我们引入通道注意力机制(CAM)和空间注意力机制(SAM),以全面挖掘和融合图像的细节信息和semantic 本地化信息。
Generative Image Priors for MRI Reconstruction Trained from Magnitude-Only Images
paper_authors: Guanxiong Luo, Xiaoqing Wang, Mortiz Blumenthal, Martin Schilling, Erik Hans Ulrich Rauf, Raviteja Kotikalapudi, Niels Focke, Martin Uecker
results: 实验结果表明,基于复杂图像的先验比只基于磁场图像的先验更高效。此外,一个训练于更大数据集的先验表现了更高的可靠性。最后,我们发现使用生成型先验比L1-wavelet常量化抑制可以在高抽样率下实现更好的压缩成像。I hope that helps! Let me know if you have any other questions.Abstract
Purpose: In this work, we present a workflow to construct generic and robust generative image priors from magnitude-only images. The priors can then be used for regularization in reconstruction to improve image quality. Methods: The workflow begins with the preparation of training datasets from magnitude-only MR images. This dataset is then augmented with phase information and used to train generative priors of complex images. Finally, trained priors are evaluated using both linear and nonlinear reconstruction for compressed sensing parallel imaging with various undersampling schemes. Results: The results of our experiments demonstrate that priors trained on complex images outperform priors trained only on magnitude images. Additionally, a prior trained on a larger dataset exhibits higher robustness. Finally, we show that the generative priors are superior to L1 -wavelet regularization for compressed sensing parallel imaging with high undersampling. Conclusion: These findings stress the importance of incorporating phase information and leveraging large datasets to raise the performance and reliability of the generative priors for MRI reconstruction. Phase augmentation makes it possible to use existing image databases for training.
摘要
目的:在这项工作中,我们提出了一种工作流程,用于从偏差图像中构建通用和Robust的生成图像先验。这些先验然后可以用于图像重建中的规范化,以提高图像质量。方法:工作流程开始于准备各种训练集,其中包括偏差图像。这些训练集然后被扩展以包括相位信息,并用于训练生成图像先验。最后,我们使用不同的抽样方案进行重建,以评估训练过的先验。结果:我们的实验结果表明,基于复杂图像的先验在重建中表现较好,而且一个基于更大的数据集的先验具有更高的稳定性。此外,我们还证明了这些生成先验在高度抽样下的扩散成像中超过L1-wavelet规范化。结论:这些结果强调了在MRI重建中包含相位信息和利用大型数据集来提高生成先验的性能和可靠性。相位扩展使得可以使用现有的图像库进行训练。
CT Reconstruction from Few Planar X-rays with Application towards Low-resource Radiotherapy
For: 提供了一种方法,使用几(<5)个平面X射影像生成CT量子,并在临床应用中进行了首次评估:放疗规划。* Methods: 使用深度生成模型,基于神经隐式表示法生成三维CT剖图从平面X射影像中。通过在训练过程中使用解剖指导,使模型专注于临床相关特征。* Results: 对于 Thoracic CT 的放疗规划,通过模型生成的剖图,让是ocoenter radiation dose与临床获取的 CT 图像中的 radiation dose差异<1%。此外,模型也比最近的稀疙CT重建基eline在 LIDC 肺CT 数据集上的标准像素和结构级指标(PSNR、SSIM、Dice 分数)上表现更好。Abstract
CT scans are the standard-of-care for many clinical ailments, and are needed for treatments like external beam radiotherapy. Unfortunately, CT scanners are rare in low and mid-resource settings due to their costs. Planar X-ray radiography units, in comparison, are far more prevalent, but can only provide limited 2D observations of the 3D anatomy. In this work, we propose a method to generate CT volumes from few (<5) planar X-ray observations using a prior data distribution, and perform the first evaluation of such a reconstruction algorithm for a clinical application: radiotherapy planning. We propose a deep generative model, building on advances in neural implicit representations to synthesize volumetric CT scans from few input planar X-ray images at different angles. To focus the generation task on clinically-relevant features, our model can also leverage anatomical guidance during training (via segmentation masks). We generated 2-field opposed, palliative radiotherapy plans on thoracic CTs reconstructed by our method, and found that isocenter radiation dose on reconstructed scans have <1% error with respect to the dose calculated on clinically acquired CTs using <=4 X-ray views. In addition, our method is better than recent sparse CT reconstruction baselines in terms of standard pixel and structure-level metrics (PSNR, SSIM, Dice score) on the public LIDC lung CT dataset. Code is available at: https://github.com/wanderinrain/Xray2CT.
摘要
干扰CT扫描是许多临床病情的标准治疗方式,并用于外部束辐射治疗。然而,CT扫描仪在LOW和中等资源设置中罕见,主要因为它们的成本高。相比之下,平面X射线投影机更加普遍,但它们只能提供2D结构的有限观察。在这种情况下,我们提出了一种方法,使用先前数据分布来生成CT卷积体从几个平面X射线图像中。我们还提出了一种深度生成模型,基于神经隐式表示来生成3D CT扫描图像从几个平面X射线图像的不同角度。为了聚焦生成任务在临床相关特征上,我们的模型还可以在训练过程中使用解剖指导(通过分剖排序)。我们使用这种方法生成了2个场 opposed、肺部Palliative radiotherapy计划,并发现在我们重建的CT扫描图像上的辐射剂量与临床获得的CT扫描图像使用4个X射线视图计算的辐射剂量之间的差异小于1%。此外,我们的方法也比最近的稀疏CT重建基准值更高于标准像素级和结构级度指标(PSNR、SSIM、Dice分数)在公共的LIDC肺CT数据集上。代码可以在:https://github.com/wanderinrain/Xray2CT中找到。
paper_authors: Syed M. Arshad, Lee C. Potter, Chong Chen, Yingmin Liu, Preethi Chandrasekaran, Christopher Crabtree, Yuchi Han, Rizwan Ahmad
For: The paper is written to present and validate an outlier rejection method for free-running cardiovascular MRI (CMR) to make it more motion robust.* Methods: The proposed method, called compressive recovery with outlier rejection (CORe), models outliers as an auxiliary variable and enforces MR physics-guided group-sparsity on it, which is jointly estimated with the image using an iterative algorithm.* Results: The simulation studies show that CORe outperforms traditional compressed sensing (CS), robust regression (RR), and another outlier rejection method in terms of normalized mean squared error (NMSE) and structural similarity index (SSIM) across 50 different realizations. The expert reader evaluation of 3D cine images demonstrates that CORe is more effective in suppressing artifacts while maintaining or improving image sharpness. The flow consistency evaluation in 4D flow images shows that CORe yields more consistent flow measurements, especially under exercise stress.Here is the information in Simplified Chinese text:* For: 本研究是为提出并验证一种用于自由运行心血管MRI(CMR)的异常拒绝方法,以提高其震动稳定性。* 方法: 提议的方法是压缩恢复与异常拒绝(CORe),它将异常作为辅助变量,并在这个变量上遵循MR物理指导的群集稀缺,通过迭代算法来同时估计异常和图像。* 结果: 实验研究表明,CORe在50个不同实现中的normalized mean squared error(NMSE)和结构相似指数(SSIM)比CS、RR和另一种异常拒绝方法更高。专家读者评估3D cinema图像表明,CORe更有效地抑制artefacts,保持或改善图像锐度。4D流动图像中的流量一致性评估表明,CORe在运动压力下得到了更一致的流量测量。Abstract
PURPOSE: To present and validate an outlier rejection method that makes free-running cardiovascular MRI (CMR) more motion robust. METHODS: The proposed method, called compressive recovery with outlier rejection (CORe), models outliers as an auxiliary variable that is added to the measured data. We enforce MR physics-guided group-sparsity on the auxiliary variable and jointly estimate it along with the image using an iterative algorithm. For validation, CORe is first compared to traditional compressed sensing (CS), robust regression (RR), and another outlier rejection method using two simulation studies. Then, CORe is compared to CS using five 3D cine and ten rest and stress 4D flow imaging datasets. RESULTS: Our simulation studies show that CORe outperforms CS, RR, and the outlier rejection method in terms of normalized mean squared error (NMSE) and structural similarity index (SSIM) across 50 different realizations. The expert reader evaluation of 3D cine images demonstrates that CORe is more effective in suppressing artifacts while maintaining or improving image sharpness. The flow consistency evaluation in 4D flow images show that CORe yields more consistent flow measurements, especially under exercise stress. CONCLUSION: An outlier rejection method is presented and validated using simulated and measured data. This method can help suppress motion artifacts in a wide range of free-running CMR applications. CODE: MATLAB implementation code is available on GitHub at https://github.com/syedmurtazaarshad/motion-robust-CMR
摘要
目的:提出和验证一种可以使自由运行征Cardiovascular MRI(CMR)更加鲁棒于运动 artifacts的方法。方法:提出的方法称为压缩恢复与外围异常值拒绝(CORe),将异常值视为一个辅助变量,并将其添加到测量数据中。我们遵循MR物理指导的群集稀缺性来限制这个辅助变量,并使用迭代算法来同时估计它和图像。验证:首先,CORe与传统的压缩感知(CS)、Robust Regression(RR)和另一种异常值拒绝方法进行了两个 simulations studies。然后,CORe与CS进行了五个3D缓冲和十个Rest和奋斗4D流图像数据集的比较。结果:我们的 simulations studies表明,CORe在NMSE和SSIM方面与CS、RR和另一种异常值拒绝方法都有更好的性能,并且在50个不同的实现中保持稳定。专业读者对3D缓冲图像进行了评估,表明CORe更有效地抑制artefacts,同时保持或改善图像的锐度。在4D流图像中,CORe得到了更一致的流量测量结果,特别是在运动压力下。结论:在自由运行CMR应用中,我们提出了一种可以抑制运动artefacts的异常值拒绝方法,并通过 simulations 和实验验证了其效果。代码:MATLAB实现代码可以在 GitHub 上找到,https://github.com/syedmurtazaarshad/motion-robust-CMR。
Diffusion Models for Counterfactual Generation and Anomaly Detection in Brain Images
results: 这个方法可以将健康样本转换为疾病影像,并且可以提高这些影像的重建精度。在实验中,这个方法比于其他弱型指导方法,对于stroke病变和脑癌分类而言,可以提高DICE score。Abstract
Segmentation masks of pathological areas are useful in many medical applications, such as brain tumour and stroke management. Moreover, healthy counterfactuals of diseased images can be used to enhance radiologists' training files and to improve the interpretability of segmentation models. In this work, we present a weakly supervised method to generate a healthy version of a diseased image and then use it to obtain a pixel-wise anomaly map. To do so, we start by considering a saliency map that approximately covers the pathological areas, obtained with ACAT. Then, we propose a technique that allows to perform targeted modifications to these regions, while preserving the rest of the image. In particular, we employ a diffusion model trained on healthy samples and combine Denoising Diffusion Probabilistic Model (DDPM) and Denoising Diffusion Implicit Model (DDIM) at each step of the sampling process. DDPM is used to modify the areas affected by a lesion within the saliency map, while DDIM guarantees reconstruction of the normal anatomy outside of it. The two parts are also fused at each timestep, to guarantee the generation of a sample with a coherent appearance and a seamless transition between edited and unedited parts. We verify that when our method is applied to healthy samples, the input images are reconstructed without significant modifications. We compare our approach with alternative weakly supervised methods on IST-3 for stroke lesion segmentation and on BraTS2021 for brain tumour segmentation, where we improve the DICE score of the best competing method from $0.6534$ to $0.7056$.
摘要
Segmentation masks of pathological areas are useful in many medical applications, such as brain tumour and stroke management. Moreover, healthy counterfactuals of diseased images can be used to enhance radiologists' training files and to improve the interpretability of segmentation models. In this work, we present a weakly supervised method to generate a healthy version of a diseased image and then use it to obtain a pixel-wise anomaly map. To do so, we start by considering a saliency map that approximately covers the pathological areas, obtained with ACAT. Then, we propose a technique that allows to perform targeted modifications to these regions, while preserving the rest of the image. In particular, we employ a diffusion model trained on healthy samples and combine Denoising Diffusion Probabilistic Model (DDPM) and Denoising Diffusion Implicit Model (DDIM) at each step of the sampling process. DDPM is used to modify the areas affected by a lesion within the saliency map, while DDIM guarantees reconstruction of the normal anatomy outside of it. The two parts are also fused at each timestep, to guarantee the generation of a sample with a coherent appearance and a seamless transition between edited and unedited parts. We verify that when our method is applied to healthy samples, the input images are reconstructed without significant modifications. We compare our approach with alternative weakly supervised methods on IST-3 for stroke lesion segmentation and on BraTS2021 for brain tumour segmentation, where we improve the DICE score of the best competing method from $0.6534$ to $0.7056$.
Predicting Ki67, ER, PR, and HER2 Statuses from H&E-stained Breast Cancer Images
paper_authors: Amir Akbarnejad, Nilanjan Ray, Penny J. Barnes, Gilbert Bigras
for: The paper aims to investigate whether machine learning methods can accurately predict molecular information from histomorphology.
methods: The authors built a large-scale dataset of 185538 images with reliable measurements for Ki67, ER, PR, and HER2 statuses, using mirrored images of H&E and IHC assays. They used a standard ViT-based pipeline to train the classifiers and achieved prediction performances around 90% in terms of AUC.
results: The authors showed that the trained classifiers can localize relevant regions, which encourages future work to improve the localizations. They also made their dataset publicly available for further research.Abstract
Despite the advances in machine learning and digital pathology, it is not yet clear if machine learning methods can accurately predict molecular information merely from histomorphology. In a quest to answer this question, we built a large-scale dataset (185538 images) with reliable measurements for Ki67, ER, PR, and HER2 statuses. The dataset is composed of mirrored images of H\&E and corresponding images of immunohistochemistry (IHC) assays (Ki67, ER, PR, and HER2. These images are mirrored through registration. To increase reliability, individual pairs were inspected and discarded if artifacts were present (tissue folding, bubbles, etc). Measurements for Ki67, ER and PR were determined by calculating H-Score from image analysis. HER2 measurement is based on binary classification: 0 and 1+ (IHC scores representing a negative subset) vs 3+ (IHC score positive subset). Cases with IHC equivocal score (2+) were excluded. We show that a standard ViT-based pipeline can achieve prediction performances around 90% in terms of Area Under the Curve (AUC) when trained with a proper labeling protocol. Finally, we shed light on the ability of the trained classifiers to localize relevant regions, which encourages future work to improve the localizations. Our proposed dataset is publicly available: https://ihc4bc.github.io/
摘要
尽管机器学习和数字 PATHOLOGY 技术已经得到了进步,但是目前还没有确切地知道机器学习方法是否可以从 histomorphology 中精确预测分子信息。为了回答这个问题,我们创建了一个大规模数据集 (185538 张图像),其中包含可靠的测量值 для Ki67、ER、PR 和 HER2 状况。这个数据集包括 H\&E 和相关的免疫染色技术 (IHC) 图像的相互镜像 (Ki67、ER、PR 和 HER2)。这些图像通过注册进行镜像。为了增加可靠性,我们 manually 检查并排除了ifacts 存在的个体对 (肿瘤卷绕、气泡等)。我们使用 H-Score 来计算 Ki67、ER 和 PR 的测量值,而 HER2 测量值则基于二分类:0 和 1+ (IHC 分数表示负 subsets) vs 3+ (IHC 分数表示正 subsets)。我们排除了 IHC equivocal 分数 (2+) 的情况。我们显示,使用标准 ViT-based 管道可以在训练时达到约 90% 的区域 beneath the curve (AUC) 性能。最后,我们探讨了训练的分类器是否可以准确地呈现相关区域,这有助于未来的工作。我们提供的数据集可以在以下链接中下载:https://ihc4bc.github.io/
results: 本研究提出了一种可以根据用户要求定制时Frequency表示的方法,并通过数学分析和实验例子表明了该方法的有效性。Abstract
Sparse time-frequency (T-F) representations have been an important research topic for more than several decades. Among them, optimization-based methods (in particular, extensions of basis pursuit) allow us to design the representations through objective functions. Since acoustic signal processing utilizes models of spectrogram, the flexibility of optimization-based T-F representations is helpful for adjusting the representation for each application. However, acoustic applications often require models of \textit{magnitude} of T-F representations obtained by discrete Gabor transform (DGT). Adjusting a T-F representation to such a magnitude model (e.g., smoothness of magnitude of DGT coefficients) results in a non-convex optimization problem that is difficult to solve. In this paper, instead of tackling difficult non-convex problems, we propose a convex optimization-based framework that realizes a T-F representation whose magnitude has characteristics specified by the user. We analyzed the properties of the proposed method and provide numerical examples of sparse T-F representations having, e.g., low-rank or smooth magnitude, which have not been realized before.
摘要
零埋时频(T-F)表示已经是研究领域中的重要话题, duration of more than several decades. Among them, 优化基于方法(特别是基 pursuit 的扩展), allowing us to design the representation through objective functions. 因为音声信号处理使用spectrogram模型, therefore, the flexibility of optimization-based T-F representations is helpful for adjusting the representation for each application. However, acoustic applications often require models of 音声信号的大小(magnitude)obtained by discrete Gabor transform (DGT). Adjusting a T-F representation to such a magnitude model (e.g., smoothness of magnitude of DGT coefficients) results in a non-convex optimization problem that is difficult to solve. In this paper, instead of tackling difficult non-convex problems, we propose a convex optimization-based framework that realizes a T-F representation whose magnitude has characteristics specified by the user. We analyzed the properties of the proposed method and provide numerical examples of sparse T-F representations having, e.g., low-rank or smooth magnitude, which have not been realized before.
Optimizing multi-user sound communications in reverberating environments with acoustic reconfigurable metasurfaces
results: 实现了控制多спектル声场,覆盖了很大的谱域,包括隐含通信、频分多路通信和多用户通信等多种功能,并在实验中实现了无交叠的同时音乐播放。Abstract
How do you ensure that, in a reverberant room, several people can speak simultaneously to several other people, making themselves perfectly understood and without any crosstalk between messages? In this work, we report a conceptual solution to this problem by developing an intelligent acoustic wall, which can be reconfigured electronically and is controlled by a learning algorithm that adapts to the geometry of the room and the positions of sources and receivers. To this end, a portion of the room boundaries is covered with a smart mirror made of a broadband acoustic reconfigurable metasurface (ARMs) designed to provide a two-state (0 or {\pi}) phase shift in the reflected waves by 200 independently tunable units. The whole phase pattern is optimized to maximize the Shannon capacity while minimizing crosstalk between the different sources and receivers. We demonstrate the control of multi-spectral sound fields covering a spectrum much larger than the coherence bandwidth of the room for diverse striking functionalities, including crosstalk-free acoustic communication, frequency-multiplexed communications, and multi-user communications. An experiment conducted with two music sources for two different people demonstrates a crosstalk-free simultaneous music playback. Our work opens new routes for the control of sound waves in complex media and for a new generation of acoustic devices.
摘要
如何在噪音强的房间中,许多人同时与别人说话,保持完整的理解和没有任何干扰?在这项工作中,我们报道了一种概念解决方案,通过开发智能音频墙来实现。这个墙可以电子控制,并由学习算法控制,以适应房间的几何结构和源器和接收器的位置。为此,部分房间边界被覆盖了一块智能镜,由一百多个独立调整的单元组成,每个单元可以提供0或π的阶段差。整个阶段模式被优化,以最大化吞吐量,同时减少不同源器和接收器之间的干扰。我们实际操作了多色音场,覆盖了房间的较大的吸收带,并实现了不同的吸引功能,包括干扰自由的音频通信、频分多路通信和多用户通信。在两个音源为两个不同的人的实验中,我们实现了干扰自由的同时播放音乐。我们的工作开启了控制噪音媒体的新路线,以及一代新的噪音设备。
results: 通过对多种语言进行实验, validate the efficacy of the proposed method across diverse multilingual tasks, 并达到了多到多语言的同时翻译。此外,本研究还展示了 UTUT 可以实现多到多语言的同时翻译,这在文献中尚未被研究过。Abstract
In this paper, we propose a method to learn unified representations of multilingual speech and text with a single model, especially focusing on the purpose of speech synthesis. We represent multilingual speech audio with speech units, the quantized representations of speech features encoded from a self-supervised speech model. Therefore, we can focus on their linguistic content by treating the audio as pseudo text and can build a unified representation of speech and text. Then, we propose to train an encoder-decoder structured model with a Unit-to-Unit Translation (UTUT) objective on multilingual data. Specifically, by conditioning the encoder with the source language token and the decoder with the target language token, the model is optimized to translate the spoken language into that of the target language, in a many-to-many language translation setting. Therefore, the model can build the knowledge of how spoken languages are comprehended and how to relate them to different languages. A single pre-trained model with UTUT can be employed for diverse multilingual speech- and text-related tasks, such as Speech-to-Speech Translation (STS), multilingual Text-to-Speech Synthesis (TTS), and Text-to-Speech Translation (TTST). By conducting comprehensive experiments encompassing various languages, we validate the efficacy of the proposed method across diverse multilingual tasks. Moreover, we show UTUT can perform many-to-many language STS, which has not been previously explored in the literature. Samples are available on https://choijeongsoo.github.io/utut.
摘要
在这篇论文中,我们提出了一种方法,可以通过单个模型学习多语言语音和文本的统一表示,特别是在语音合成的目的下。我们使用自动编码的语音特征来编码多语言语音,并将其转换为 pseudo 文本,以便更好地关注它们的语言内容。然后,我们提出了一种基于encoder-decoder结构的模型,使用单位至单位翻译(UTUT)目标进行训练。具体来说,通过将encoder Conditioned with source language token,并将decoder Conditioned with target language token,模型会被优化为将语言转换为目标语言,在多种语言翻译设定下。因此,模型可以学习不同语言之间的关系,并如何将语音翻译成不同语言。一个预训练的UTUT模型可以在多种多语言语音和文本相关任务中使用,如语音到语音翻译(STS)、多语言文本到语音合成(TTS)和文本到语音翻译(TTST)。通过对多种语言进行全面的实验,我们证明了提议方法的有效性。此外,我们还证明了UTUT可以实现多语言 STS,这在文献中尚未被探讨。详细实验结果和样例可以在https://choijeongsoo.github.io/utut上找到。
methods: 这个方法使用一个终端 deep learning 框架,通过从食物图像中推导出食物的3D形状信息,以估计食物能量价值。
results: 在使用 Nutrition5k 食物图像集进行评估中,这个方法的 Mean Absolute Error (MAE) 为40.05 kCal, Mean Absolute Percentage Error (MAPE) 为11.47%。这个方法仅使用 RGB 图像作为输入,并且与需要 RGB 和深度信息的现有方法相比,它具有竞争力。Abstract
Dietary assessment is a key contributor to monitoring health status. Existing self-report methods are tedious and time-consuming with substantial biases and errors. Image-based food portion estimation aims to estimate food energy values directly from food images, showing great potential for automated dietary assessment solutions. Existing image-based methods either use a single-view image or incorporate multi-view images and depth information to estimate the food energy, which either has limited performance or creates user burdens. In this paper, we propose an end-to-end deep learning framework for food energy estimation from a monocular image through 3D shape reconstruction. We leverage a generative model to reconstruct the voxel representation of the food object from the input image to recover the missing 3D information. Our method is evaluated on a publicly available food image dataset Nutrition5k, resulting a Mean Absolute Error (MAE) of 40.05 kCal and Mean Absolute Percentage Error (MAPE) of 11.47% for food energy estimation. Our method uses RGB image as the only input at the inference stage and achieves competitive results compared to the existing method requiring both RGB and depth information.
摘要
饮食评估是健康状况监测的关键因素。现有的自我报告方法具有巨大的偏见和错误。图像基于食物部分估计技术可以直接从食物图像中估算食物能量值,显示出了自动化饮食评估解决方案的潜在优势。现有的图像基于方法可以使用单视图图像或多视图图像和深度信息来估算食物能量,但它们具有有限的性能或者让用户感到压力。在这篇论文中,我们提出了一种基于深度学习的端到端框架,通过RGB图像来估算食物能量。我们利用生成模型来重建食物对象的 voxel 表示,从输入图像中恢复缺失的3D信息。我们的方法在公共可用的饮食图像数据集Nutrition5k上进行评估,得到了40.05 kCal的平均绝对误差(MAE)和11.47%的平均绝对百分比误差(MAPE)。我们的方法只需RGB图像作为推理阶段的输入,实现了与需要RGB和深度信息的现有方法相比的竞争性成绩。
QUEST: Query Stream for Vehicle-Infrastructure Cooperative Perception
results: 实验结果表明,QUEST 框架可以有效地提高合作感知性能,并且在实际应用场景中(如摄像头基于车辆基础设施感知)表现出了更高的传输灵活性和 packet dropout 的Robustness。Abstract
Cooperative perception can effectively enhance individual perception performance by providing additional viewpoint and expanding the sensing field. Existing cooperation paradigms are either interpretable (result cooperation) or flexible (feature cooperation). In this paper, we propose the concept of query cooperation to enable interpretable instance-level flexible feature interaction. To specifically explain the concept, we propose a cooperative perception framework, termed QUEST, which let query stream flow among agents. The cross-agent queries are interacted via fusion for co-aware instances and complementation for individual unaware instances. Taking camera-based vehicle-infrastructure perception as a typical practical application scene, the experimental results on the real-world dataset, DAIR-V2X-Seq, demonstrate the effectiveness of QUEST and further reveal the advantage of the query cooperation paradigm on transmission flexibility and robustness to packet dropout. We hope our work can further facilitate the cross-agent representation interaction for better cooperative perception in practice.
摘要
合作感知可以有效地提高个体感知性能,提供额外视点和扩大感知场。现有的合作方法是可解释的(结果合作)或者灵活的(特征合作)。在这篇论文中,我们提出了查询合作概念,以实现可解释的实例级别的灵活特征交互。为了具体说明这个概念,我们提出了一个合作感知框架,称为QUEST,允许查询流水线在代理之间流动。不同代理之间的问题是通过融合实现协同意识的实例,而不同代理之间的问题是通过补充实现个体未知的实例。使用摄像头基于车辆基础设施感知为实际应用场景,在DAIR-V2X-Seq实际数据集上进行实验, demonstarte了QUEST的效果,并透露了查询合作方法在传输灵活性和 packet dropout Robustness 的优势。我们希望我们的工作能够进一步促进跨代理表示交互,以便更好地实现合作感知在实践中。
RegionBLIP: A Unified Multi-modal Pre-training Framework for Holistic and Regional Comprehension
results: 我们的实验结果表明,我们的框架 RegionBLIP 可以保持 BLIP-2 的图像理解能力,并在新引入的点云模式和地域对象上增加理解能力。Abstract
In this work, we investigate extending the comprehension of Multi-modal Large Language Models (MLLMs) to regional objects. To this end, we propose to extract features corresponding to regional objects as soft prompts for LLM, which provides a straightforward and scalable approach and eliminates the need for LLM fine-tuning. To effectively extract regional features from regular image features and irregular point cloud features, we present a novel and unified position-assisted feature extraction module. Furthermore, training an MLLM from scratch is highly time-consuming. Thus, we propose incrementally extending existing pre-trained MLLMs to comprehend more modalities and the regional objects of those modalities. Specifically, we freeze the Q-Former from BLIP-2, an impressive MLLM, and optimize the modality-specific Lora parameters in Q-Former and LLM for each newly introduced modality. The freezing of the Q-Former eliminates the need for extensive pre-training on massive image-text data. The freezed Q-Former pre-trained from massive image-text data is also beneficial for the pre-training on image-region-text data. We name our framework RegionBLIP. We pre-train RegionBLIP on image-region-text, point-cloud-text, and point-cloud-region-text data. Experimental results verify that \Ours{} can preserve the image comprehension capability of BILP-2 and further gain a comprehension of the newly introduced point cloud modality and regional objects. The Data, Code, and Pre-trained models will be available at https://github.com/mightyzau/RegionBLIP.
摘要
在这项工作中,我们 investigate extending the comprehension of Multi-modal Large Language Models (MLLMs) to regional objects. 为此,我们提议提取对 regional objects 的特征作为 LLM 的软提示,这提供了一个直观的并可扩展的方法,并消除了 LLM 的 fine-tuning 需求。为了有效地从常见图像特征和点云特征中提取 regional 特征,我们提出了一个 novel 和统一的位置帮助特征提取模块。此外,预训练一个 MLLM 从头开始是非常时间消耗的。因此,我们提议逐步扩展现有的预训练 MLLM 以包括更多Modalities 和这些 Modality 中的 regional objects。具体来说,我们冻结 BLIP-2 中的 Q-Former,并优化 Modality-specific Lora 参数在 Q-Former 和 LLM 中。冻结 Q-Former 消除了大量预训练在图像-文本数据上的需求。同时,预训练 Q-Former 在图像-区域-文本数据上也有助于预训练。我们将这种框架命名为 RegionBLIP。我们在图像-区域-文本、点云-文本和点云-区域-文本数据上预训练 RegionBLIP。实验结果表明,我们的方法可以保持 BLIP-2 中的图像理解能力,同时还能够增加对新引入的点云模式和 regional objects 的理解。数据、代码和预训练模型将在 GitHub 上提供,链接在 https://github.com/mightyzau/RegionBLIP。
Point2Mask: Point-supervised Panoptic Segmentation via Optimal Transport
results: 在Pascal VOC和COCO上实验表明,提出的Point2Mask方法可以在无需大量标注的情况下达到高水平的精细预测性能。Abstract
Weakly-supervised image segmentation has recently attracted increasing research attentions, aiming to avoid the expensive pixel-wise labeling. In this paper, we present an effective method, namely Point2Mask, to achieve high-quality panoptic prediction using only a single random point annotation per target for training. Specifically, we formulate the panoptic pseudo-mask generation as an Optimal Transport (OT) problem, where each ground-truth (gt) point label and pixel sample are defined as the label supplier and consumer, respectively. The transportation cost is calculated by the introduced task-oriented maps, which focus on the category-wise and instance-wise differences among the various thing and stuff targets. Furthermore, a centroid-based scheme is proposed to set the accurate unit number for each gt point supplier. Hence, the pseudo-mask generation is converted into finding the optimal transport plan at a globally minimal transportation cost, which can be solved via the Sinkhorn-Knopp Iteration. Experimental results on Pascal VOC and COCO demonstrate the promising performance of our proposed Point2Mask approach to point-supervised panoptic segmentation. Source code is available at: https://github.com/LiWentomng/Point2Mask.
摘要
Recently, weakly-supervised image segmentation has attracted increasing research attention, aiming to avoid the expensive pixel-wise labeling. In this paper, we propose an effective method, called Point2Mask, to achieve high-quality panoptic prediction using only a single random point annotation per target for training. Specifically, we formulate the panoptic pseudo-mask generation as an Optimal Transport (OT) problem, where each ground-truth (gt) point label and pixel sample are defined as the label supplier and consumer, respectively. The transportation cost is calculated by the introduced task-oriented maps, which focus on the category-wise and instance-wise differences among the various thing and stuff targets. Furthermore, a centroid-based scheme is proposed to set the accurate unit number for each gt point supplier. Hence, the pseudo-mask generation is converted into finding the optimal transport plan at a globally minimal transportation cost, which can be solved via the Sinkhorn-Knopp Iteration. Experimental results on Pascal VOC and COCO demonstrate the promising performance of our proposed Point2Mask approach to point-supervised panoptic segmentation. Source code is available at: https://github.com/LiWentomng/Point2Mask.Here's the word-for-word translation of the text into Simplified Chinese:最近,弱型图像分割获得了研究人员的越来越多的注意力,以避免高价的像素精度标注。在这篇论文中,我们提出了一种有效的方法,即Point2Mask,用于使用单个随机点标注来训练高质量�anoptic预测。具体来说,我们将�anoptic pseudo-mask生成视为一个Optimal Transport(OT)问题,其中每个ground-truth(gt)点标签和像素抽象被定义为标签供应商和消费者,分别。交通成本由引入的任务 oriented map计算,该地图强调类别和实例划分的差异。此外,我们提出了一种基于中心点的方案,以确定每个gt点供应商的准确单位数。因此,pseudo-mask生成转化为找到最低交通成本的优质运输计划,可以通过Sinkhorn-Knopp迭代解决。实验结果表明,我们提出的Point2Mask方法在Pascal VOC和COCO上表现出了扎实的推荐性。源代码可以在https://github.com/LiWentomng/Point2Mask上获取。
Deep Learning-based Prediction of Stress and Strain Maps in Arterial Walls for Improved Cardiovascular Risk Assessment
results: 我们的模型可以高度准确地预测 von Mises 剪切应力和强度场,SSIM分数为0.854和0.830, Mean squared errors为0.017和0.018。此外,我们还提出了 ensemble 和 transfer learning 技术来进一步提高模型的性能。Abstract
This study investigated the potential of end-to-end deep learning tools as a more effective substitute for FEM in predicting stress-strain fields within 2D cross sections of arterial wall. We first proposed a U-Net based fully convolutional neural network (CNN) to predict the von Mises stress and strain distribution based on the spatial arrangement of calcification within arterial wall cross-sections. Further, we developed a conditional generative adversarial network (cGAN) to enhance, particularly from the perceptual perspective, the prediction accuracy of stress and strain field maps for arterial walls with various calcification quantities and spatial configurations. On top of U-Net and cGAN, we also proposed their ensemble approaches, respectively, to further improve the prediction accuracy of field maps. Our dataset, consisting of input and output images, was generated by implementing boundary conditions and extracting stress-strain field maps. The trained U-Net models can accurately predict von Mises stress and strain fields, with structural similarity index scores (SSIM) of 0.854 and 0.830 and mean squared errors of 0.017 and 0.018 for stress and strain, respectively, on a reserved test set. Meanwhile, the cGAN models in a combination of ensemble and transfer learning techniques demonstrate high accuracy in predicting von Mises stress and strain fields, as evidenced by SSIM scores of 0.890 for stress and 0.803 for strain. Additionally, mean squared errors of 0.008 for stress and 0.017 for strain further support the model's performance on a designated test set. Overall, this study developed a surrogate model for finite element analysis, which can accurately and efficiently predict stress-strain fields of arterial walls regardless of complex geometries and boundary conditions.
摘要
Translated into Simplified Chinese:这个研究 investigate了使用端到端深度学习工具作为较为有效的finite element分析的替代方法,以predict arterial wall的压力-弯形场在2D横截面上。我们首先提出了基于U-Net的全 convolutional neural network (CNN),用于预测 calcification的空间布局对arterial wall横截面的 von Mises 压力和弯形场的分布。此外,我们还开发了基于 conditional generative adversarial network (cGAN)的模型,用于提高预测压力和弯形场图像的准确性。在U-Net和cGAN的基础之上,我们还提出了ensemble approaches,以进一步提高预测图像的准确性。我们的数据集,包括输入和输出图像,通过实施边界条件和提取压力-弯形场图像来生成。训练的U-Net模型可以准确预测 von Mises 压力和弯形场,SSIM 分数为0.854和0.830,mean squared error为0.017和0.018,分别对应于压力和弯形场。此外,cGAN模型在 transferred learning 和 ensemble learning 技术的组合下表现出了高度的准确性,SSIM 分数为0.890和0.803,mean squared error为0.008和0.017。总之,这个研究开发了一个surrogate model,可以高效地预测arterial wall的压力-弯形场,不 matter complex geometries和boundary conditions.
Focus on Content not Noise: Improving Image Generation for Nuclei Segmentation by Suppressing Steganography in CycleGAN
results: 相比 vanilla CycleGAN,该方法可以提高核体分 segmentation 任务的 F1 分数上的表现,提高了5.4个百分点。此外,该研究还证明了在 CycleGAN 架构中 integrate 高级规范技术可以减轻 steganography-related 问题,生成更加准确的synthetic数据集。Abstract
Annotating nuclei in microscopy images for the training of neural networks is a laborious task that requires expert knowledge and suffers from inter- and intra-rater variability, especially in fluorescence microscopy. Generative networks such as CycleGAN can inverse the process and generate synthetic microscopy images for a given mask, thereby building a synthetic dataset. However, past works report content inconsistencies between the mask and generated image, partially due to CycleGAN minimizing its loss by hiding shortcut information for the image reconstruction in high frequencies rather than encoding the desired image content and learning the target task. In this work, we propose to remove the hidden shortcut information, called steganography, from generated images by employing a low pass filtering based on the DCT. We show that this increases coherence between generated images and cycled masks and evaluate synthetic datasets on a downstream nuclei segmentation task. Here we achieve an improvement of 5.4 percentage points in the F1-score compared to a vanilla CycleGAN. Integrating advanced regularization techniques into the CycleGAN architecture may help mitigate steganography-related issues and produce more accurate synthetic datasets for nuclei segmentation.
摘要
描述核体在微scopic图像中的标注是一项劳动密集的任务,需要专家知识和受到内 raters 和外 raters 的变化,特别是在荧光微scopic中。生成网络如 CycleGAN 可以 inverse 该过程,生成基于给定的 mask 的 synthetic 微scopic图像,从而建立 synthetic 数据集。然而,过去的工作表明,生成的图像与 mask 之间存在内容不一致,部分是因为 CycleGAN 在高频范围内隐藏短路信息,而不是编码愿意图CONTENT 和学习目标任务。在这种情况下,我们提议从生成的图像中除去隐藏的短路信息,使用 DCT 基于低通过滤波。我们发现,这会增加生成图像和 цикли mask 之间的协调性,并评估 synthetic 数据集在下游核体分割任务中的性能。在这种情况下,我们实现了 Vanilla CycleGAN 的 5.4 个百分点 F1 score 的改进。将 advanced regularization techniques integrate 到 CycleGAN 架构中可能会 mitigate steganography-related issues 并生成更准确的 synthetic 数据集 для核体分割任务。
Multidimensional Data Analysis Based on Block Convolutional Tensor Decomposition
paper_authors: Mahdi Molavi, Mansoor Rezghi, Tayyebeh Saeedi for: This paper focuses on developing a new tensor-tensor product called the $\star_c{}\text{-Product}$ based on block convolution with reflective boundary conditions, and using it to improve tensor decomposition for analyzing high-dimensional data.methods: The paper uses the t-product of tensors and block convolution with reflective boundary conditions to develop a new tensor-tensor product called the $\star_c{}\text{-Product}$. The paper also introduces a tensor decomposition based on this product for arbitrary order tensors.results: The paper shows that the proposed $\star_c{}\text{-Product}$ has lower complexity than t-SVD and yields higher-quality results in applications such as classification and compression.Abstract
Tensor decompositions are powerful tools for analyzing multi-dimensional data in their original format. Besides tensor decompositions like Tucker and CP, Tensor SVD (t-SVD) which is based on the t-product of tensors is another extension of SVD to tensors that recently developed and has found numerous applications in analyzing high dimensional data. This paper offers a new insight into the t-Product and shows that this product is a block convolution of two tensors with periodic boundary conditions. Based on this viewpoint, we propose a new tensor-tensor product called the $\star_c{}\text{-Product}$ based on Block convolution with reflective boundary conditions. Using a tensor framework, this product can be easily extended to tensors of arbitrary order. Additionally, we introduce a tensor decomposition based on our $\star_c{}\text{-Product}$ for arbitrary order tensors. Compared to t-SVD, our new decomposition has lower complexity, and experiments show that it yields higher-quality results in applications such as classification and compression.
摘要
tensor decompositions是多维数据的分析工具, Besides tensor decompositions like Tucker和CP,tensor SVD(t-SVD),which is based on the t-product of tensors, is another extension of SVD to tensors that has recently been developed and has found numerous applications in analyzing high-dimensional data. This paper offers a new perspective on the t-product and shows that this product is a block convolution of two tensors with periodic boundary conditions. Based on this viewpoint, we propose a new tensor-tensor product called the $\star_c{}\text{-Product}$ based on block convolution with reflective boundary conditions. Using a tensor framework, this product can be easily extended to tensors of arbitrary order. Additionally, we introduce a tensor decomposition based on our $\star_c{}\text{-Product}$ for arbitrary-order tensors. Compared to t-SVD, our new decomposition has lower complexity, and experiments show that it yields higher-quality results in applications such as classification and compression.
PoissonNet: Resolution-Agnostic 3D Shape Reconstruction using Fourier Neural Operators
paper_authors: Hector Andrade-Loarca, Julius Hege, Aras Bacho, Gitta Kutyniok
for: 这个论文旨在解决用点云数据 reconstruction 3D shapes 的问题,传统的深度神经网络在高分辨率下遇到计算复杂性的问题。
methods: 作者使用 fourier neural operator (FNO) 解决波兰方程,从oriented point cloud measurement中重建mesh。
results: 作者的方法在 reconstruction 质量、运行时间和分辨率灵活性方面比现有方法优秀,同时具有一shot super-resolution 和梯度可视化的优点。Abstract
We introduce PoissonNet, an architecture for shape reconstruction that addresses the challenge of recovering 3D shapes from points. Traditional deep neural networks face challenges with common 3D shape discretization techniques due to their computational complexity at higher resolutions. To overcome this, we leverage Fourier Neural Operators (FNOs) to solve the Poisson equation and reconstruct a mesh from oriented point cloud measurements. PoissonNet exhibits two main advantages. First, it enables efficient training on low-resolution data while achieving comparable performance at high-resolution evaluation, thanks to the resolution-agnostic nature of FNOs. This feature allows for one-shot super-resolution. Second, our method surpasses existing approaches in reconstruction quality while being differentiable. Overall, our proposed method not only improves upon the limitations of classical deep neural networks in shape reconstruction but also achieves superior results in terms of reconstruction quality, running time, and resolution flexibility. Furthermore, we demonstrate that the Poisson surface reconstruction problem is well-posed in the limit case by showing a universal approximation theorem for the solution operator of the Poisson equation with distributional data utilizing the Fourier Neural Operator, which provides a theoretical foundation for our numerical results. The code to reproduce the experiments is available on: \url{https://github.com/arsenal9971/PoissonNet}.
摘要
我们介绍PoissonNet,一个用于形状重建的架构,解决从点 cloud 中获取 3D 形状的挑战。传统的深度神经网络在高分辨率下 computationally 复杂,因此我们利用 Fourier Neural Operators (FNOs) 解决 Poisson 方程,从排序点云集中重建 mesh。PoissonNet 有两个主要优点:第一,它可以高效地在低分辨率上训练,并在高分辨率评估中实现相同的性能,这使得它能够实现一次超Resolution。第二,我们的方法可以超过现有的方法重建质量,并且可以实现可微的变化。总的来说,我们的提案不仅优化了传统的深度神经网络在形状重建中的局限性,而且实现了更好的重建质量、运行时间和分辨率灵活性。此外,我们还证明了 Poisson 面重建问题在极限情况下是可定义的,通过显示了基于 Fourier Neural Operator 的解析函数,则提供了理论基础 для我们的数据�Martin 的实验结果。Code 可以在:\url{https://github.com/arsenal9971/PoissonNet} 中找到。
NuInsSeg: A Fully Annotated Dataset for Nuclei Instance Segmentation in H&E-Stained Histological Images
paper_authors: Amirreza Mahbod, Christine Polak, Katharina Feldmann, Rumsha Khan, Katharina Gelles, Georg Dorffner, Ramona Woitek, Sepideh Hatamikia, Isabella Ellinger
for: This paper is written for the task of automatic nuclei instance segmentation in whole slide image analysis, specifically using supervised deep learning methods.
methods: The paper uses a fully manually annotated dataset called NuInsSeg, which contains 665 image patches with over 30,000 manually segmented nuclei from 31 human and mouse organs. Additionally, the paper provides ambiguous area masks for the entire dataset, which represent parts of the images where precise manual annotations are impossible.
results: The paper releases one of the biggest fully manually annotated datasets of nuclei in Hematoxylin and Eosin (H&E)-stained histological images, called NuInsSeg, which can be used to train and evaluate supervised deep learning models for nuclei instance segmentation.Here’s the information in Simplified Chinese text:
results: 论文发布了一个大量的完全手动标注的核体数据集,名为NuInsSeg,可以用于训练和评估supervised深度学习模型。Abstract
In computational pathology, automatic nuclei instance segmentation plays an essential role in whole slide image analysis. While many computerized approaches have been proposed for this task, supervised deep learning (DL) methods have shown superior segmentation performances compared to classical machine learning and image processing techniques. However, these models need fully annotated datasets for training which is challenging to acquire, especially in the medical domain. In this work, we release one of the biggest fully manually annotated datasets of nuclei in Hematoxylin and Eosin (H&E)-stained histological images, called NuInsSeg. This dataset contains 665 image patches with more than 30,000 manually segmented nuclei from 31 human and mouse organs. Moreover, for the first time, we provide additional ambiguous area masks for the entire dataset. These vague areas represent the parts of the images where precise and deterministic manual annotations are impossible, even for human experts. The dataset and detailed step-by-step instructions to generate related segmentation masks are publicly available at https://www.kaggle.com/datasets/ipateam/nuinsseg and https://github.com/masih4/NuInsSeg, respectively.
摘要
在计算生物学中,自动核体实例分割扮演着整个扫描图像分析中的关键角色。虽然许多计算机化方法已经被提议用于这项任务,但是supervised深度学习(DL)方法在 segmentation 性能方面表现出色,比起经典机器学习和图像处理技术更为出色。然而,这些模型需要完全标注的数据集进行训练,这在医疗领域是困难的获得的。在这项工作中,我们发布了一个包含665个图像patches,共计超过30,000个手动标注的核体的全部扫描图像数据集,称为NuInsSeg。这个数据集包括31种人类和小鼠的组织。此外,我们还为整个数据集提供了首次的uncertain area masks。这些暧昧区域表示图像中的不确定和不可定量的部分,即even human experts无法 preciselly和准确地手动标注这些部分。数据集和相关的生成分 segmentation masks的详细步骤 instrucions都公开在https://www.kaggle.com/datasets/ipateam/nuinsseg和https://github.com/masih4/NuInsSeg中,分别。
Neural Collapse Terminus: A Unified Solution for Class Incremental Learning and Its Variants
paper_authors: Yibo Yang, Haobo Yuan, Xiangtai Li, Jianlong Wu, Lefei Zhang, Zhouchen Lin, Philip Torr, Dacheng Tao, Bernard Ghanem
for: Handle class incremental learning (CIL), long-tail class incremental learning (LTCIL), and few-shot class incremental learning (FSCIL) well.
methods: Propose a unified solution called neural collapse terminus, which is a fixed structure with the maximal equiangular inter-class separation for the whole label space. Also, propose a prototype evolving scheme to drive the backbone features into the neural collapse terminus smoothly.
results: The method is effective in all three tasks and can handle data imbalance and data scarcity. Theoretical analysis indicates that the method holds the neural collapse optimality in an incremental fashion. Extensive experiments with multiple datasets demonstrate the effectiveness of the unified solution and the generalized case.Abstract
How to enable learnability for new classes while keeping the capability well on old classes has been a crucial challenge for class incremental learning. Beyond the normal case, long-tail class incremental learning and few-shot class incremental learning are also proposed to consider the data imbalance and data scarcity, respectively, which are common in real-world implementations and further exacerbate the well-known problem of catastrophic forgetting. Existing methods are specifically proposed for one of the three tasks. In this paper, we offer a unified solution to the misalignment dilemma in the three tasks. Concretely, we propose neural collapse terminus that is a fixed structure with the maximal equiangular inter-class separation for the whole label space. It serves as a consistent target throughout the incremental training to avoid dividing the feature space incrementally. For CIL and LTCIL, we further propose a prototype evolving scheme to drive the backbone features into our neural collapse terminus smoothly. Our method also works for FSCIL with only minor adaptations. Theoretical analysis indicates that our method holds the neural collapse optimality in an incremental fashion regardless of data imbalance or data scarcity. We also design a generalized case where we do not know the total number of classes and whether the data distribution is normal, long-tail, or few-shot for each coming session, to test the generalizability of our method. Extensive experiments with multiple datasets are conducted to demonstrate the effectiveness of our unified solution to all the three tasks and the generalized case.
摘要
如何在新类增加时保持老类能力已成为增量学习中的关键挑战。此外,长尾类增量学习和少shot类增量学习也被提出来考虑数据偏好和数据稀缺问题,这些问题在实际应用中非常普遍并使得已知的忘记问题更加严重。现有的方法只适用于一种任务。在这篇论文中,我们提出了增量训练中的不一致问题的统一解决方案。具体来说,我们提出了一种固定结构的神经衰减终点,该结构具有整个标签空间中最大的等角间隔性。它在增量训练中作为一个固定目标,以避免在增量训练中逐渐将特征空间分割。对CIL和LTCIL,我们还提出了一种原型演化方案,以使得脊梁特征流动到我们的神经衰减终点的。我们的方法也适用于FSCIL,只需要一些小的适应。理论分析表明,我们的方法在增量训练中具有神经衰减优化的优点,不管数据不平衡或数据稀缺。我们还设计了一个通用情况,在每个来往会训练中,不知道总类数量和数据分布是正常、长尾或少shot,以测试我们的方法的普适性。我们进行了多个数据集的广泛实验,以证明我们的统一解决方案对所有三个任务和通用情况具有效果。
Enhancing Visibility in Nighttime Haze Images Using Guided APSF and Gradient Adaptive Convolution
results: 对实际夜间雾景图像进行了广泛的评估,并达到了30.38dB PSNR,比前方法提高13%。数据和代码可以在:\url{https://github.com/jinyeying/nighttime_dehaze}中找到。Abstract
Visibility in hazy nighttime scenes is frequently reduced by multiple factors, including low light, intense glow, light scattering, and the presence of multicolored light sources. Existing nighttime dehazing methods often struggle with handling glow or low-light conditions, resulting in either excessively dark visuals or unsuppressed glow outputs. In this paper, we enhance the visibility from a single nighttime haze image by suppressing glow and enhancing low-light regions. To handle glow effects, our framework learns from the rendered glow pairs. Specifically, a light source aware network is proposed to detect light sources of night images, followed by the APSF (Angular Point Spread Function)-guided glow rendering. Our framework is then trained on the rendered images, resulting in glow suppression. Moreover, we utilize gradient-adaptive convolution, to capture edges and textures in hazy scenes. By leveraging extracted edges and textures, we enhance the contrast of the scene without losing important structural details. To boost low-light intensity, our network learns an attention map, then adjusted by gamma correction. This attention has high values on low-light regions and low values on haze and glow regions. Extensive evaluation on real nighttime haze images, demonstrates the effectiveness of our method. Our experiments demonstrate that our method achieves a PSNR of 30.38dB, outperforming state-of-the-art methods by 13$\%$ on GTA5 nighttime haze dataset. Our data and code is available at: \url{https://github.com/jinyeying/nighttime_dehaze}.
摘要
夜间雾气场景中的可见度受多种因素影响,包括低光照、强烈辉光、光散射和多种颜色光源的存在。现有的夜间雾气去除方法 часто难以处理辉光效果,导致视觉效果过扭或辉光输出不充分抑制。在这篇论文中,我们通过抑制辉光和强化低光照区域来提高夜间雾气图像的可见度。为了处理辉光效果,我们的框架学习了夜间镜头中的灯光对。具体来说,我们提出了一个灯光意识网络,用于夜间镜头中灯光源的检测,然后使用APSF( Ángular Point Spread Function)引导的辉光渲染。我们的框架然后在渲染图像上进行训练,从而实现辉光抑制。此外,我们利用梯度适应 convolution,以捕捉雾气场景中的边缘和文本ure。通过利用EXTRACTED edges和文本ure,我们可以提高场景的对比度,而不是失去重要的结构细节。为了增强低光照INTENSITY,我们的网络学习了注意力图,然后通过γcorrecting进行调整。这个注意力图在低光照区域有高值,而在雾气和辉光区域有低值。我们的实验表明,我们的方法可以在实际的夜间雾气图像上达到PSNR 30.38dB,比 estado-of-the-art 方法高出13%的GTA5夜间雾气数据集。我们的数据和代码可以在以下链接获取:https://github.com/jinyeying/nighttime_dehaze。
Quantification of Predictive Uncertainty via Inference-Time Sampling
results: 实验表明,该方法可以生成多种可能性 Distribution,与预测错误之间存在良好的相关性。Abstract
Predictive variability due to data ambiguities has typically been addressed via construction of dedicated models with built-in probabilistic capabilities that are trained to predict uncertainty estimates as variables of interest. These approaches require distinct architectural components and training mechanisms, may include restrictive assumptions and exhibit overconfidence, i.e., high confidence in imprecise predictions. In this work, we propose a post-hoc sampling strategy for estimating predictive uncertainty accounting for data ambiguity. The method can generate different plausible outputs for a given input and does not assume parametric forms of predictive distributions. It is architecture agnostic and can be applied to any feed-forward deterministic network without changes to the architecture or training procedure. Experiments on regression tasks on imaging and non-imaging input data show the method's ability to generate diverse and multi-modal predictive distributions, and a desirable correlation of the estimated uncertainty with the prediction error.
摘要
<>将文本翻译成简化中文。<>通常,预测变化因数据抽象性而导致的预测不确定性通过建立专门的模型,这些模型具有内置的概率能力,并在预测不确定性为变量首选项进行训练。这些方法可能需要特殊的建筑Component和训练机制,可能包含限制性的假设和显示过自信,即高度自信准确性。在这种工作中,我们提出了一种 posterior sampling 策略,用于估计预测不确定性,考虑到数据抽象性。这种方法可以生成不同的可能的输出,并不假设预测分布的 Parametric 形式。它是架构无关的,可以应用于任何滤频决策网络,无需改变架构或训练过程。实验表明,这种方法可以生成多样化和多模态的预测分布,并且预测不确定性与预测错误之间存在可undesirable的相关性。
Weakly Supervised 3D Instance Segmentation without Instance-level Annotations
results: 实验表明,我们的方法可以与最新的完全监督方法相当,同时可以帮助现有方法在减少标注成本的情况下学习3D实例分割。Abstract
3D semantic scene understanding tasks have achieved great success with the emergence of deep learning, but often require a huge amount of manually annotated training data. To alleviate the annotation cost, we propose the first weakly-supervised 3D instance segmentation method that only requires categorical semantic labels as supervision, and we do not need instance-level labels. The required semantic annotations can be either dense or extreme sparse (e.g. 0.02% of total points). Even without having any instance-related ground-truth, we design an approach to break point clouds into raw fragments and find the most confident samples for learning instance centroids. Furthermore, we construct a recomposed dataset using pseudo instances, which is used to learn our defined multilevel shape-aware objectness signal. An asymmetrical object inference algorithm is followed to process core points and boundary points with different strategies, and generate high-quality pseudo instance labels to guide iterative training. Experiments demonstrate that our method can achieve comparable results with recent fully supervised methods. By generating pseudo instance labels from categorical semantic labels, our designed approach can also assist existing methods for learning 3D instance segmentation at reduced annotation cost.
摘要
三维semantic场景理解任务在深度学习出现后取得了很大成功,但经常需要大量手动标注训练数据。为了减轻标注成本,我们提出了首个无监督的3D实例分割方法,只需要类别semantic标签作为监督,并不需要实例级别标签。需要的semantic标注可以是密集的或者极少的(例如0.02%的总点数)。即使没有实例相关的地面真实数据,我们还是可以将点云分解成原始碎片,并从最信任的样本中学习实例中心点。此外,我们还构建了一个pseudo实例集合,用于学习我们定义的多级形状感知信号。我们采用不对称的物体推断算法,处理核心点和边点的不同策略,生成高质量pseudo实例标签,以导引反复训练。实验表明,我们的方法可以与最近的完全监督方法相当。通过将pseudo实例标签生成自类别semantic标签,我们设计的方法还可以帮助现有的方法在减少标注成本的情况下学习3D实例分割。
Balanced Destruction-Reconstruction Dynamics for Memory-replay Class Incremental Learning
paper_authors: Yuhang Zhou, Jiangchao Yao, Feng Hong, Ya Zhang, Yanfeng Wang
For: 提高类增量学习(CIL)中的稳定性和泛化能力,通过平衡旧知识的destruction和重建来alleviate catastrophic forgetting。* Methods: 提出了一种Balanced Destruction-Reconstruction(BDR)模块,通过考虑不同类的训练状态的差异和存储库中样本的量不均衡来平衡旧知识的destruction和重建,从而提高知识重建的效果。* Results: 经验表明,作为轻量级插件,BDR模块可以大幅提高现有最佳方法的性能,并且具有良好的泛化能力。Abstract
Class incremental learning (CIL) aims to incrementally update a trained model with the new classes of samples (plasticity) while retaining previously learned ability (stability). To address the most challenging issue in this goal, i.e., catastrophic forgetting, the mainstream paradigm is memory-replay CIL, which consolidates old knowledge by replaying a small number of old classes of samples saved in the memory. Despite effectiveness, the inherent destruction-reconstruction dynamics in memory-replay CIL are an intrinsic limitation: if the old knowledge is severely destructed, it will be quite hard to reconstruct the lossless counterpart. Our theoretical analysis shows that the destruction of old knowledge can be effectively alleviated by balancing the contribution of samples from the current phase and those saved in the memory. Motivated by this theoretical finding, we propose a novel Balanced Destruction-Reconstruction module (BDR) for memory-replay CIL, which can achieve better knowledge reconstruction by reducing the degree of maximal destruction of old knowledge. Specifically, to achieve a better balance between old knowledge and new classes, the proposed BDR module takes into account two factors: the variance in training status across different classes and the quantity imbalance of samples from the current phase and memory. By dynamically manipulating the gradient during training based on these factors, BDR can effectively alleviate knowledge destruction and improve knowledge reconstruction. Extensive experiments on a range of CIL benchmarks have shown that as a lightweight plug-and-play module, BDR can significantly improve the performance of existing state-of-the-art methods with good generalization.
摘要
BEVControl: Accurately Controlling Street-view Elements with Multi-perspective Consistency via BEV Sketch Layout
results: 对比BEVGen方法,BEVControl在前景分割mIoU中提高了5.89到26.80的margin,而且用生成图像来训练下游识别模型,其NDS分数平均提高1.29。Abstract
Using synthesized images to boost the performance of perception models is a long-standing research challenge in computer vision. It becomes more eminent in visual-centric autonomous driving systems with multi-view cameras as some long-tail scenarios can never be collected. Guided by the BEV segmentation layouts, the existing generative networks seem to synthesize photo-realistic street-view images when evaluated solely on scene-level metrics. However, once zoom-in, they usually fail to produce accurate foreground and background details such as heading. To this end, we propose a two-stage generative method, dubbed BEVControl, that can generate accurate foreground and background contents. In contrast to segmentation-like input, it also supports sketch style input, which is more flexible for humans to edit. In addition, we propose a comprehensive multi-level evaluation protocol to fairly compare the quality of the generated scene, foreground object, and background geometry. Our extensive experiments show that our BEVControl surpasses the state-of-the-art method, BEVGen, by a significant margin, from 5.89 to 26.80 on foreground segmentation mIoU. In addition, we show that using images generated by BEVControl to train the downstream perception model, it achieves on average 1.29 improvement in NDS score.
摘要
DiffColor: Toward High Fidelity Text-Guided Image Colorization with Diffusion Models
results: DiffColor方法可以在几轮 iterations 中生成真实和多样的颜色,保持图像结构和背景不变,同时将颜色与目标语言指导相吻合。此外,DiffColor方法允许在文本引导下进行卷积控制,即通过修改提示文本来生成不同的颜色化结果,而无需任何 fine-tuning。广泛的实验和用户研究表明,DiffColor方法在视觉质量、颜色准确性和颜色化选项多样性方面,超越了先前的works。Abstract
Recent data-driven image colorization methods have enabled automatic or reference-based colorization, while still suffering from unsatisfactory and inaccurate object-level color control. To address these issues, we propose a new method called DiffColor that leverages the power of pre-trained diffusion models to recover vivid colors conditioned on a prompt text, without any additional inputs. DiffColor mainly contains two stages: colorization with generative color prior and in-context controllable colorization. Specifically, we first fine-tune a pre-trained text-to-image model to generate colorized images using a CLIP-based contrastive loss. Then we try to obtain an optimized text embedding aligning the colorized image and the text prompt, and a fine-tuned diffusion model enabling high-quality image reconstruction. Our method can produce vivid and diverse colors with a few iterations, and keep the structure and background intact while having colors well-aligned with the target language guidance. Moreover, our method allows for in-context colorization, i.e., producing different colorization results by modifying prompt texts without any fine-tuning, and can achieve object-level controllable colorization results. Extensive experiments and user studies demonstrate that DiffColor outperforms previous works in terms of visual quality, color fidelity, and diversity of colorization options.
摘要
Multi-scale Cross-restoration Framework for Electrocardiogram Anomaly Detection
results: 该论文在一个新的 benchmark 数据集上达到了当前领域的状态的表现,并在两个其他常用的ECG数据集上也显示出了优秀的表现。Abstract
Electrocardiogram (ECG) is a widely used diagnostic tool for detecting heart conditions. Rare cardiac diseases may be underdiagnosed using traditional ECG analysis, considering that no training dataset can exhaust all possible cardiac disorders. This paper proposes using anomaly detection to identify any unhealthy status, with normal ECGs solely for training. However, detecting anomalies in ECG can be challenging due to significant inter-individual differences and anomalies present in both global rhythm and local morphology. To address this challenge, this paper introduces a novel multi-scale cross-restoration framework for ECG anomaly detection and localization that considers both local and global ECG characteristics. The proposed framework employs a two-branch autoencoder to facilitate multi-scale feature learning through a masking and restoration process, with one branch focusing on global features from the entire ECG and the other on local features from heartbeat-level details, mimicking the diagnostic process of cardiologists. Anomalies are identified by their high restoration errors. To evaluate the performance on a large number of individuals, this paper introduces a new challenging benchmark with signal point-level ground truths annotated by experienced cardiologists. The proposed method demonstrates state-of-the-art performance on this benchmark and two other well-known ECG datasets. The benchmark dataset and source code are available at: \url{https://github.com/MediaBrain-SJTU/ECGAD}
摘要
电心agram(ECG)是一种广泛使用的诊断工具,用于检测心脏疾病。但是,有些罕见的心脏疾病可能会被传统的ECG分析错过,因为没有一个可以包含所有可能的心脏疾病的训练数据集。这篇论文提出使用异常检测来识别任何不健康状态,只使用正常的ECG进行训练。然而,检测ECG中的异常可以是困难的,因为心脏电压的变化会导致巨大的个体差异和异常现象。为解决这个挑战,这篇论文提出了一种新的多尺度跨 restore 框架,用于ECG异常检测和定位。该框架利用两个分支自动编码器来实现多尺度特征学习,其中一个分支关注整个ECG的全局特征,另一个分支关注心跳级别细节。异常被标识为高 restore 错误。为评估性能,这篇论文创建了一个新的挑战性的标准数据集,其中每个信号点都有经验论断医生的地面真实值。提出的方法在这个标准数据集上达到了状态艺术性的性能,并在两个常见的ECG数据集上也取得了优秀的成绩。标准数据集和源代码可以在:\url{https://github.com/MediaBrain-SJTU/ECGAD} 获取。
paper_authors: Guanzhou Ke, Yang Yu, Guoqing Chao, Xiaoli Wang, Chenyang Xu, Shengfeng He
For: The paper is written for proposing a novel multi-view representation disentangling method that can go beyond inductive biases and ensure both interpretability and generalizability of the resulting representations.* Methods: The proposed method is based on discovering multi-view consistency in advance, which determines the disentangling information boundary, and maximizing transformation invariance and clustering consistency between views. The method consists of two stages: obtaining multi-view consistency by training a consistent encoder, and disentangling specificity from comprehensive representations by minimizing the upper bound of mutual information.* Results: The proposed method outperforms 12 comparison methods in terms of clustering and classification performance on four multi-view datasets, and the extracted consistency and specificity are compact and interpretable.Abstract
Multi-view (or -modality) representation learning aims to understand the relationships between different view representations. Existing methods disentangle multi-view representations into consistent and view-specific representations by introducing strong inductive biases, which can limit their generalization ability. In this paper, we propose a novel multi-view representation disentangling method that aims to go beyond inductive biases, ensuring both interpretability and generalizability of the resulting representations. Our method is based on the observation that discovering multi-view consistency in advance can determine the disentangling information boundary, leading to a decoupled learning objective. We also found that the consistency can be easily extracted by maximizing the transformation invariance and clustering consistency between views. These observations drive us to propose a two-stage framework. In the first stage, we obtain multi-view consistency by training a consistent encoder to produce semantically-consistent representations across views as well as their corresponding pseudo-labels. In the second stage, we disentangle specificity from comprehensive representations by minimizing the upper bound of mutual information between consistent and comprehensive representations. Finally, we reconstruct the original data by concatenating pseudo-labels and view-specific representations. Our experiments on four multi-view datasets demonstrate that our proposed method outperforms 12 comparison methods in terms of clustering and classification performance. The visualization results also show that the extracted consistency and specificity are compact and interpretable. Our code can be found at \url{https://github.com/Guanzhou-Ke/DMRIB}.
摘要
多视图(或多modal)表示学习目标是理解不同视图表示之间的关系。现有方法通过引入强大的概念假设来分离多视图表示,但这可能会限制其泛化能力。在这篇论文中,我们提出了一种新的多视图表示分离方法,旨在超越假设,以确保表示的解释性和泛化性。我们的方法基于发现多视图一致性在前提下可以确定分离信息边界,从而导致一个分离学习目标。我们还发现可以通过最大化变换不变性和视图集成一致性来提取一致性。这些发现驱动我们提出了一个两stage框架。在第一阶段,我们通过训练一个具有相同含义的编码器来生成多视图中的含义相同的表示,以及其相对应的pseudo标签。在第二阶段,我们通过最小化表示之间的上界乘积来分离特定的表示和总体表示之间的相互信息。最后,我们使用pseudo标签和视图特定表示来重建原始数据。我们的实验表明,我们的提出的方法在四个多视图数据集上比12种参考方法表现出色,并且可以准确地重建原始数据。visual化结果还表明提取的一致性和特定性具有 компакт性和可读性。我们的代码可以在GitHub上找到:\url{https://github.com/Guanzhou-Ke/DMRIB}。
Erasure-based Interaction Network for RGBT Video Object Detection and A Unified Benchmark
methods: 提出了一种新的计算机视IONTask called RGB-thermal(RGBT)VOD,并设计了一个名为EINet的新网络模型,以及一个包含50对复杂背景、多种物体和不同照明条件的VT-VOD50数据集。
results: 对VT-VOD50数据集进行了广泛的实验,证明了我们提posed方法的效果和效率,并且与现有主流VOD方法进行了比较。Abstract
Recently, many breakthroughs are made in the field of Video Object Detection (VOD), but the performance is still limited due to the imaging limitations of RGB sensors in adverse illumination conditions. To alleviate this issue, this work introduces a new computer vision task called RGB-thermal (RGBT) VOD by introducing the thermal modality that is insensitive to adverse illumination conditions. To promote the research and development of RGBT VOD, we design a novel Erasure-based Interaction Network (EINet) and establish a comprehensive benchmark dataset (VT-VOD50) for this task. Traditional VOD methods often leverage temporal information by using many auxiliary frames, and thus have large computational burden. Considering that thermal images exhibit less noise than RGB ones, we develop a negative activation function that is used to erase the noise of RGB features with the help of thermal image features. Furthermore, with the benefits from thermal images, we rely only on a small temporal window to model the spatio-temporal information to greatly improve efficiency while maintaining detection accuracy. VT-VOD50 dataset consists of 50 pairs of challenging RGBT video sequences with complex backgrounds, various objects and different illuminations, which are collected in real traffic scenarios. Extensive experiments on VT-VOD50 dataset demonstrate the effectiveness and efficiency of our proposed method against existing mainstream VOD methods. The code of EINet and the dataset will be released to the public for free academic usage.
摘要
近些时间,在视频对象检测(VOD)领域内,有很多突破性的进展,但性能仍然受到RGB感知器的不利照明条件的限制。为了解决这个问题,这项工作引入了一个新的计算机视觉任务,即RGB-热(RGBT)VOD,通过引入热特征来减少不利照明条件的影响。为促进RGBT VOD的研究和开发,我们设计了一种新的擦除基本网络(EINet)和建立了一个完整的benchmark数据集(VT-VOD50)。传统的VOD方法通常利用多个auxiliary帧的时间信息,因此具有大的计算负担。由于热图像比RGB图像具有更少的噪声,我们开发了一种负活动函数,用于擦除RGB特征图像中的噪声,并且利用热图像特征来减少计算负担。此外,通过利用热图像的优点,我们只需要使用小的时间窗口来模型空间-时间信息,以提高效率而无需牺牲检测精度。VT-VOD50数据集包含50对复杂背景、多种物体和不同照明条件的RGBT视频序列,通过实际的交通enario进行收集。我们在VT-VOD50数据集上进行了广泛的实验,证明了我们提出的方法在现有主流VOD方法的比较中表现出色,同时具有高效率。我们将EINet和数据集发布到公共平台,免费用于学术研究。
A Multidimensional Analysis of Social Biases in Vision Transformers
paper_authors: Jannik Brinkmann, Paul Swoboda, Christian Bartelt
for: 研究卷积Transformers(ViT)中存在的社会偏见。
methods: 测试数据、模型架构和训练目标对ViT中学习的表示中的社会偏见的影响。
results: 使用对 diffusion-based image editing进行 counterfactual augmentation 训练可以减少社会偏见,但不能完全消除其他偏见。大型模型比小型模型更少偏见,使用激发jective objectives 训练的模型也比使用生成jective objectives 训练的模型更少偏见。同一个数据集上使用不同的自然语言目标可能导致ViT中学习的社会偏见发生相反的偏见。这些发现可以帮助我们更好地理解社会偏见的起源,并提供改进公平性的方法。Abstract
The embedding spaces of image models have been shown to encode a range of social biases such as racism and sexism. Here, we investigate specific factors that contribute to the emergence of these biases in Vision Transformers (ViT). Therefore, we measure the impact of training data, model architecture, and training objectives on social biases in the learned representations of ViTs. Our findings indicate that counterfactual augmentation training using diffusion-based image editing can mitigate biases, but does not eliminate them. Moreover, we find that larger models are less biased than smaller models, and that models trained using discriminative objectives are less biased than those trained using generative objectives. In addition, we observe inconsistencies in the learned social biases. To our surprise, ViTs can exhibit opposite biases when trained on the same data set using different self-supervised objectives. Our findings give insights into the factors that contribute to the emergence of social biases and suggests that we could achieve substantial fairness improvements based on model design choices.
摘要
“图像模型的嵌入空间已经显示出许多社会偏见,如种族主义和性别歧视。在这里,我们调查了视Transformer(ViT)中这些偏见的起源。因此,我们测量了训练数据、模型结构和训练目标对图像模型学习的社会偏见的影响。我们的发现表明,通过对 diffusion-based image editing 进行 counterfactual augmentation 训练可以减轻偏见,但并不能完全消除它们。此外,我们发现大型模型比小型模型更少表现社会偏见,并且使用推导性目标进行训练的模型比使用生成性目标进行训练的模型更少表现社会偏见。此外,我们发现图像模型学习的社会偏见存在不一致性。对于同一个数据集,使用不同的自然语言目标可以使图像模型表现出相反的偏见。我们的发现可以帮助我们更好地理解社会偏见的起源,并且表明我们可以通过模型设计选择来实现显著的公平性改进。”
A Novel Convolutional Neural Network Architecture with a Continuous Symmetry
results: 这篇论文的结果显示,这种卷积神经网络架构可以在图像识别任务上实现比较性的表现,并且具有一定的内部对称性。Abstract
This paper introduces a new Convolutional Neural Network (ConvNet) architecture inspired by a class of partial differential equations (PDEs) called quasi-linear hyperbolic systems. With comparable performance on the image classification task, it allows for the modification of the weights via a continuous group of symmetry. This is a significant shift from traditional models where the architecture and weights are essentially fixed. We wish to promote the (internal) symmetry as a new desirable property for a neural network, and to draw attention to the PDE perspective in analyzing and interpreting ConvNets in the broader Deep Learning community.
摘要
(Simplified Chinese translation note: "quasi-linear hyperbolic systems" is translated as "幂线抽象系统", and "ConvNet" is translated as "卷积神经网络".)
A Survey on Deep Learning-based Spatio-temporal Action Detection
results: 本文对当前领域的深度学习方法进行了全面的回顾,并对常用的测试数据集和评价指标进行了介绍。最后,本文结束并提出了一些可能的未来研究方向。Abstract
Spatio-temporal action detection (STAD) aims to classify the actions present in a video and localize them in space and time. It has become a particularly active area of research in computer vision because of its explosively emerging real-world applications, such as autonomous driving, visual surveillance, entertainment, etc. Many efforts have been devoted in recent years to building a robust and effective framework for STAD. This paper provides a comprehensive review of the state-of-the-art deep learning-based methods for STAD. Firstly, a taxonomy is developed to organize these methods. Next, the linking algorithms, which aim to associate the frame- or clip-level detection results together to form action tubes, are reviewed. Then, the commonly used benchmark datasets and evaluation metrics are introduced, and the performance of state-of-the-art models is compared. At last, this paper is concluded, and a set of potential research directions of STAD are discussed.
摘要
空间时间动作检测(STAD)目标是在视频中分类并地理化动作的存在。它在计算机视觉领域已经成为非常活跃的研究领域,因为它在自动驾驶、视觉监测、娱乐等实际应用中具有爆炸性突出来的应用前景。多年来,许多努力投入于建立一个可靠有效的STAD框架。本文提供了深度学习基础的state-of-the-art方法的完整回顾。首先,一个分类器是开发出一个分类器,以便组织这些方法。接下来,链接算法,它们的目标是将帧或clip级别的检测结果相互链接,以形成动作管道,被评估。然后,通用的标准数据集和评估指标被介绍,并将现状的状态模型的性能比较。最后,本文结束,并对STAD的未来研究方向进行了讨论。
Real-time Light Estimation and Neural Soft Shadows for AR Indoor Scenarios
results: 我们的管道可以在实时AR场景中新一个水平的真实性来插入对象。我们的模型够小以跑在当前的移动设备上。我们在iPhone 11 Pro上达到了9ms的光估计时间和5ms的神经软阴影时间。Abstract
We present a pipeline for realistic embedding of virtual objects into footage of indoor scenes with focus on real-time AR applications. Our pipeline consists of two main components: A light estimator and a neural soft shadow texture generator. Our light estimation is based on deep neural nets and determines the main light direction, light color, ambient color and an opacity parameter for the shadow texture. Our neural soft shadow method encodes object-based realistic soft shadows as light direction dependent textures in a small MLP. We show that our pipeline can be used to integrate objects into AR scenes in a new level of realism in real-time. Our models are small enough to run on current mobile devices. We achieve runtimes of 9ms for light estimation and 5ms for neural shadows on an iPhone 11 Pro.
摘要
我们提出了一个管道,用于在室内场景视频中真实嵌入虚拟对象,特别是针对实时AR应用。我们的管道包括两个主要组成部分:光估计和神经软影Texture生成器。我们的光估计基于深度神经网络,确定主要光方向、光色、 ambient色和阴影Texture中的opacity参数。我们的神经软影方法将对象基于实际的软阴影编码为光方向依赖的文本ures在小型MLP中。我们显示,我们的管道可以在实时AR场景中嵌入对象到新的真实水平,并且可以在当前的移动设备上运行。我们的模型够小,在iPhone 11 Pro上达到9毫秒的运行时间和5毫秒的神经阴影时间。
IndoHerb: Indonesia Medicinal Plants Recognition using Transfer Learning and Deep Learning
results: 测试结果显示,使用DenseNet121模型可以达到87.4%的准确率,而使用零基eline模型可以达到43.53%的准确率。Abstract
Herbal plants are nutritious plants that can be used as an alternative to traditional disease healing. In Indonesia there are various types of herbal plants. But with the development of the times, the existence of herbal plants as traditional medicines began to be forgotten so that not everyone could recognize them. Having the ability to identify herbal plants can have many positive impacts. However, there is a problem where identifying plants can take a long time because it requires in-depth knowledge and careful examination of plant criteria. So that the application of computer vision can help identify herbal plants. Previously, research had been conducted on the introduction of herbal plants from Vietnam using several algorithms, but from these research the accuracy was not high enough. Therefore, this study intends to implement transfer learning from the Convolutional Neural Network (CNN) algorithm to classify types of herbal plants from Indonesia. This research was conducted by collecting image data of herbal plants from Indonesia independently through the Google Images search engine. After that, it will go through the data preprocessing, classification using the transfer learning method from CNN, and analysis will be carried out. The CNN transfer learning models used are ResNet34, DenseNet121, and VGG11_bn. Based on the test results of the three models, it was found that DenseNet121 was the model with the highest accuracy, which was 87.4%. In addition, testing was also carried out using the scratch model and obtained an accuracy of 43.53%. The Hyperparameter configuration used in this test is the ExponentialLR scheduler with a gamma value of 0.9; learning rate 0.001; Cross Entropy Loss function; Adam optimizer; and the number of epochs is 50. Indonesia Medicinal Plant Dataset can be accessed at the following link https://github.com/Salmanim20/indo_medicinal_plant
摘要
药用植物是一种有营养价值的植物,可以作为传统疾病的替代疗法。在印度尼西亚有很多种类的药用植物,但随着时代的发展,药用植物作为传统药物的存在开始被忘记,因此不 everybody都能认得它们。能够识别药用植物可以有很多积极影响。然而,识别植物可以耗费很长时间,因为它需要深入的知识和仔细的植物标准的检查。因此,计算机视觉的应用可以帮助识别药用植物。在过去的研究中,有些算法在印度尼西亚 introducing herbal plants from Vietnam,但这些研究的准确率不够高。因此,这项研究打算通过传输学习法,使用Convolutional Neural Network(CNN)算法来类型印度尼西亚的药用植物。这项研究通过独立地收集印度尼西亚药用植物的图像数据,并进行数据处理、类型确定和分析。测试结果显示,DenseNet121模型的准确率为87.4%,而自适应模型的准确率为43.53%。Hyperparameter配置使用ExponentialLR学习策略,γ值为0.9,学习率为0.001,交叉熵损失函数,Adam优化器,训练集数为50。印度尼西亚药用植物数据集可以在以下链接中获取:https://github.com/Salmanim20/indo_medicinal_plant。
Reference-Free Isotropic 3D EM Reconstruction using Diffusion Models
results: 该方法可以自动恢复单个不均匀的Volume,无需任何训练数据。此外, authors 还进行了大量的实验,证明了该方法在两个公共数据集上的robustness和可靠性。Abstract
Electron microscopy (EM) images exhibit anisotropic axial resolution due to the characteristics inherent to the imaging modality, presenting challenges in analysis and downstream tasks.In this paper, we propose a diffusion-model-based framework that overcomes the limitations of requiring reference data or prior knowledge about the degradation process. Our approach utilizes 2D diffusion models to consistently reconstruct 3D volumes and is well-suited for highly downsampled data. Extensive experiments conducted on two public datasets demonstrate the robustness and superiority of leveraging the generative prior compared to supervised learning methods. Additionally, we demonstrate our method's feasibility for self-supervised reconstruction, which can restore a single anisotropic volume without any training data.
摘要
Consistency Regularization for Generalizable Source-free Domain Adaptation
results: 我们的方法在多个 SFDA 标准库上进行了广泛的实验,结果显示我们的方法可以在不存取源数据集的情况下,实现源自由预测项目的高性能和普遍化能力。此外,我们的方法还能够在无法见测试数据集上保持高度的稳定性和可靠性。Abstract
Source-free domain adaptation (SFDA) aims to adapt a well-trained source model to an unlabelled target domain without accessing the source dataset, making it applicable in a variety of real-world scenarios. Existing SFDA methods ONLY assess their adapted models on the target training set, neglecting the data from unseen but identically distributed testing sets. This oversight leads to overfitting issues and constrains the model's generalization ability. In this paper, we propose a consistency regularization framework to develop a more generalizable SFDA method, which simultaneously boosts model performance on both target training and testing datasets. Our method leverages soft pseudo-labels generated from weakly augmented images to supervise strongly augmented images, facilitating the model training process and enhancing the generalization ability of the adapted model. To leverage more potentially useful supervision, we present a sampling-based pseudo-label selection strategy, taking samples with severer domain shift into consideration. Moreover, global-oriented calibration methods are introduced to exploit global class distribution and feature cluster information, further improving the adaptation process. Extensive experiments demonstrate our method achieves state-of-the-art performance on several SFDA benchmarks, and exhibits robustness on unseen testing datasets.
摘要
MVFlow: Deep Optical Flow Estimation of Compressed Videos with Motion Vector Prior
results: 实验结果表明,相比于现有模型,该论文提出的MVFlow模型可以降低AEPE值1.09,或者保持与现有模型相同的准确性,而且可以节省52%的计算时间。Abstract
In recent years, many deep learning-based methods have been proposed to tackle the problem of optical flow estimation and achieved promising results. However, they hardly consider that most videos are compressed and thus ignore the pre-computed information in compressed video streams. Motion vectors, one of the compression information, record the motion of the video frames. They can be directly extracted from the compression code stream without computational cost and serve as a solid prior for optical flow estimation. Therefore, we propose an optical flow model, MVFlow, which uses motion vectors to improve the speed and accuracy of optical flow estimation for compressed videos. In detail, MVFlow includes a key Motion-Vector Converting Module, which ensures that the motion vectors can be transformed into the same domain of optical flow and then be utilized fully by the flow estimation module. Meanwhile, we construct four optical flow datasets for compressed videos containing frames and motion vectors in pairs. The experimental results demonstrate the superiority of our proposed MVFlow, which can reduce the AEPE by 1.09 compared to existing models or save 52% time to achieve similar accuracy to existing models.
摘要
In detail, MVFlow includes a key Motion-Vector Converting Module, which ensures that the motion vectors can be transformed into the same domain as the optical flow and then be fully utilized by the flow estimation module. Furthermore, we construct four optical flow datasets for compressed videos containing frames and motion vectors in pairs. The experimental results show that our proposed MVFlow is superior, with a reduction of 1.09 in AEPE compared to existing models or a 52% reduction in time to achieve similar accuracy to existing models.
Dynamic Token-Pass Transformers for Semantic Segmentation
results: 这个方法可以大约减少 40% 至 60% FLOPs,并且与硬件友好,同时仍然可以保持高效率和优化结果。 ViT-L/B 的 Throughput 和测试速度可以提高至更多于 2$\times$ on Cityscapes。Abstract
Vision transformers (ViT) usually extract features via forwarding all the tokens in the self-attention layers from top to toe. In this paper, we introduce dynamic token-pass vision transformers (DoViT) for semantic segmentation, which can adaptively reduce the inference cost for images with different complexity. DoViT gradually stops partial easy tokens from self-attention calculation and keeps the hard tokens forwarding until meeting the stopping criteria. We employ lightweight auxiliary heads to make the token-pass decision and divide the tokens into keeping/stopping parts. With a token separate calculation, the self-attention layers are speeded up with sparse tokens and still work friendly with hardware. A token reconstruction module is built to collect and reset the grouped tokens to their original position in the sequence, which is necessary to predict correct semantic masks. We conduct extensive experiments on two common semantic segmentation tasks, and demonstrate that our method greatly reduces about 40% $\sim$ 60% FLOPs and the drop of mIoU is within 0.8% for various segmentation transformers. The throughput and inference speed of ViT-L/B are increased to more than 2$\times$ on Cityscapes.
摘要
vision transformers (ViT) 通常通过从顶部到底部传递所有的 tokens 来提取特征。在这篇论文中,我们介绍了动态Token-pass vision transformers (DoViT),用于 semantic segmentation,可以适应不同的图像复杂度减少推理成本。DoViT 会逐渐停止一些容易的 tokens 从自我注意计算中,并保留一些困难的 tokens 直到满足停止 criterion。我们使用轻量级的辅助头来做 token-pass 决策,并将 tokens 分为保留/停止部分。通过 sparse tokens 的计算,自我注意层得到加速,同时仍与硬件友好。我们还构建了一个token重建模块,用于收集和重新设置 grouped tokens 的原始位置序列,这是必要的来预测正确的semantic mask。我们在两个常见的semantic segmentation任务上进行了广泛的实验,并证明了我们的方法可以大幅减少约40% 到60% FLOPs,并且drop of mIoU 在0.8% 以下。ViT-L/B 的 Throughput 和推理速度被提高到更多于 2 倍在 Cityscapes。
Get the Best of Both Worlds: Improving Accuracy and Transferability by Grassmann Class Representation
results: 在ImageNet-1K数据集上,使用GCR后,ResNet50-D、ResNeXt50、Swin-T和Deit3-S等模型的顶部1错误率相对下降了5.6%、4.5%、3.0%和3.5%。此外,GCR还提供了更多的特征自由度,使得下游任务的质量更高。例如,对于ResNet50-D模型,GCR可以提高平均线性传输精度从77.98%提升到79.70%。Abstract
We generalize the class vectors found in neural networks to linear subspaces (i.e.~points in the Grassmann manifold) and show that the Grassmann Class Representation (GCR) enables the simultaneous improvement in accuracy and feature transferability. In GCR, each class is a subspace and the logit is defined as the norm of the projection of a feature onto the class subspace. We integrate Riemannian SGD into deep learning frameworks such that class subspaces in a Grassmannian are jointly optimized with the rest model parameters. Compared to the vector form, the representative capability of subspaces is more powerful. We show that on ImageNet-1K, the top-1 error of ResNet50-D, ResNeXt50, Swin-T and Deit3-S are reduced by 5.6%, 4.5%, 3.0% and 3.5%, respectively. Subspaces also provide freedom for features to vary and we observed that the intra-class feature variability grows when the subspace dimension increases. Consequently, we found the quality of GCR features is better for downstream tasks. For ResNet50-D, the average linear transfer accuracy across 6 datasets improves from 77.98% to 79.70% compared to the strong baseline of vanilla softmax. For Swin-T, it improves from 81.5% to 83.4% and for Deit3, it improves from 73.8% to 81.4%. With these encouraging results, we believe that more applications could benefit from the Grassmann class representation. Code is released at https://github.com/innerlee/GCR.
摘要
我们总结了神经网络中的类别向量到线性子空间(即点在草mann manifold),并证明GCR(草mann类表示)可以同时提高准确率和特征传递性。在GCR中,每个类是一个子空间,logit是定义为特征在类子空间上的投影距离的模值。我们将里曼射SGD集成到深度学习框架中,以便在Grassmannian中同时优化类子空间和其他模型参数。相比vector形式,子空间的表示能力更强。我们证明在ImageNet-1K上,ResNet50-D、ResNeXt50、Swin-T和Deit3-S的top-1错误率分别下降5.6%, 4.5%, 3.0%和3.5%。子空间还提供了特征之间的自由,我们观察到在子空间维度增加时,内类特征变化的度逐渐增加。因此,我们发现GCR特征质量更好,用于下游任务。对于ResNet50-D,我们在6个数据集上的平均直线传输精度从77.98%提高到79.70%,比强基eline softmax强度更高。对于Swin-T,从81.5%提高到83.4%,对于Deit3,从73.8%提高到81.4%。这些激动人心的结果表明,更多的应用可以从GCR中受益。代码可以在https://github.com/innerlee/GCR中找到。
DMDC: Dynamic-mask-based dual camera design for snapshot Hyperspectral Imaging
results: 在多个数据集上进行了广泛的实验,结果表明,我们的方法可以与先前最佳方法(SOTA)比较,提高PSNR值超过9 dB。Abstract
Deep learning methods are developing rapidly in coded aperture snapshot spectral imaging (CASSI). The number of parameters and FLOPs of existing state-of-the-art methods (SOTA) continues to increase, but the reconstruction accuracy improves slowly. Current methods still face two problems: 1) The performance of the spatial light modulator (SLM) is not fully developed due to the limitation of fixed Mask coding. 2) The single input limits the network performance. In this paper we present a dynamic-mask-based dual camera system, which consists of an RGB camera and a CASSI system running in parallel. First, the system learns the spatial feature distribution of the scene based on the RGB images, then instructs the SLM to encode each scene, and finally sends both RGB and CASSI images to the network for reconstruction. We further designed the DMDC-net, which consists of two separate networks, a small-scale CNN-based dynamic mask network for dynamic adjustment of the mask and a multimodal reconstruction network for reconstruction using RGB and CASSI measurements. Extensive experiments on multiple datasets show that our method achieves more than 9 dB improvement in PSNR over the SOTA. (https://github.com/caizeyu1992/DMDC)
摘要
深度学习方法在coded aperture snapshot spectral imaging(CASSI)领域快速发展,但现有状态的方法(SOTA)中参数和FLOPs的数量继续增加,但 reconstruction accuracy 的提高相对较慢。当前方法仍面临两个问题:1)SLM(spatial light modulator)的性能尚未得到完全发展,因为固定的Mask coding有限制。2)单个输入限制网络性能。在本文中,我们提出了动态Mask基于双camera系统,包括一个RGB摄像头和一个CASSI系统在平行运行。首先,系统通过RGB图像学习场景中的空间特征分布,然后根据场景指定SLM编码,最后将RGB和CASSI图像传递给网络进行重建。我们还设计了DMDC-net,它包括两个独立的网络:一个小规模的CNN基于动态Mask网络用于动态调整Mask,以及一个多模式重建网络用于使用RGB和CASSI测量进行重建。我们在多个数据集上进行了广泛的实验,结果显示,我们的方法可以在PSNR方面实现超过9dB的提高。(https://github.com/caizeyu1992/DMDC)
results: 经过广泛的实验表明,本文的模型达到了状态方法性能。此外,本文还提出了一种新的操作,称为身份混合,允许用户自定义新的身份。Abstract
Face swapping is a task that changes a facial identity of a given image to that of another person. In this work, we propose a novel face-swapping framework called Megapixel Facial Identity Manipulation (MFIM). The face-swapping model should achieve two goals. First, it should be able to generate a high-quality image. We argue that a model which is proficient in generating a megapixel image can achieve this goal. However, generating a megapixel image is generally difficult without careful model design. Therefore, our model exploits pretrained StyleGAN in the manner of GAN-inversion to effectively generate a megapixel image. Second, it should be able to effectively transform the identity of a given image. Specifically, it should be able to actively transform ID attributes (e.g., face shape and eyes) of a given image into those of another person, while preserving ID-irrelevant attributes (e.g., pose and expression). To achieve this goal, we exploit 3DMM that can capture various facial attributes. Specifically, we explicitly supervise our model to generate a face-swapped image with the desirable attributes using 3DMM. We show that our model achieves state-of-the-art performance through extensive experiments. Furthermore, we propose a new operation called ID mixing, which creates a new identity by semantically mixing the identities of several people. It allows the user to customize the new identity.
摘要
《Face swapping是一个任务,把给定图像的脸部标识换成另一个人的脸部标识。在这个工作中,我们提出了一种新的面部交换框架,即Megapixel Facial Identity Manipulation(MFIM)。这个模型应该实现两个目标。一是生成高质量图像,我们认为能够生成高分辨率图像的模型可以实现这个目标。然而,生成高分辨率图像通常需要精心的模型设计。因此,我们的模型利用了预训练的StyleGAN,通过GAN-倒置的方式来生成高分辨率图像。二是能够有效地将给定图像的脸部标识转换为另一个人的脸部标识。具体来说,它应该能够活动地将ID属性(如脸形和眼睛)转换为另一个人的ID属性,保留ID不关属性(如姿势和表情)。为了实现这个目标,我们利用了3DMM,它可以捕捉各种脸部特征。我们显式地监督我们的模型生成一个面部交换图像,拥有愿望的特征使用3DMM。我们的实验结果表明,我们的模型实现了状态监测的性能。此外,我们还提出了一种新的操作,即ID混合,它可以将多个人的标识混合成一个新的标识,让用户自定义新的标识。
Multimodal Adaptation of CLIP for Few-Shot Action Recognition
methods: 这篇论文使用了“预训、统一”的方法来避免从零训练网络,从而节省时间和资源。但这方法有两个缺点:首先,几步动作识别的有限标签数据需要对数据进行严格的节省,以避免过拟合;其次,影片的EXTRA-temporal维度对几步动作识别的有效时间模型产生挑战,而预训的视觉模型通常是图像模型。这篇论文提出了一种名为Multimodal Adaptation of CLIP(MA-CLIP)的新方法,用于解决这些问题。
results: MA-CLIP可以快速地适应几步动作识别任务,并且可以将视觉模型转换到不同的任务上,而不需要从零训练网络。这篇论文还提出了一种基于注意机制的文本导向原型建立模组,可以充分利用影片-文本多媒体资料来增强影片原型的表现。Abstract
Applying large-scale pre-trained visual models like CLIP to few-shot action recognition tasks can benefit performance and efficiency. Utilizing the "pre-training, fine-tuning" paradigm makes it possible to avoid training a network from scratch, which can be time-consuming and resource-intensive. However, this method has two drawbacks. First, limited labeled samples for few-shot action recognition necessitate minimizing the number of tunable parameters to mitigate over-fitting, also leading to inadequate fine-tuning that increases resource consumption and may disrupt the generalized representation of models. Second, the video's extra-temporal dimension challenges few-shot recognition's effective temporal modeling, while pre-trained visual models are usually image models. This paper proposes a novel method called Multimodal Adaptation of CLIP (MA-CLIP) to address these issues. It adapts CLIP for few-shot action recognition by adding lightweight adapters, which can minimize the number of learnable parameters and enable the model to transfer across different tasks quickly. The adapters we design can combine information from video-text multimodal sources for task-oriented spatiotemporal modeling, which is fast, efficient, and has low training costs. Additionally, based on the attention mechanism, we design a text-guided prototype construction module that can fully utilize video-text information to enhance the representation of video prototypes. Our MA-CLIP is plug-and-play, which can be used in any different few-shot action recognition temporal alignment metric.
摘要
使用大规模预训练的视觉模型如CLIP进行几步动作认知任务可以提高性能和效率。利用“预训练、精度调整”的方法可以避免从头开始训练网络,这可以降低时间和资源的投入。然而,这种方法有两点缺点。首先,几步动作认知任务的有限标注样本数量需要尽量减少可训练参数的数量,以避免过拟合。其次,视频的Extra-temporal维度对几步动作认知任务的有效时间模型化起来困难,而预训练的视觉模型通常是图像模型。这篇论文提出了一种新的方法called Multimodal Adaptation of CLIP (MA-CLIP)来解决这些问题。它适应了CLIP进行几步动作认知任务,通过添加轻量级适配器,可以最小化可训练参数的数量,并使模型快速传播到不同任务。我们设计的适配器可以将视频-文本多modal源的信息结合到任务指向的空间时间模型中,这快速、高效、训练成本低。此外,基于注意机制,我们设计了文本引导原型构建模块,可以充分利用视频-文本信息来增强视频原型的表示。我们的MA-CLIP是可插入的,可以在不同的几步动作认知任务中使用。
Data Augmentation for Human Behavior Analysis in Multi-Person Conversations
results: 我们的解决方案在测试集上实现了身体行为识别的最佳结果(准确率为0.6262)、眼接触检测的最高精度(准确率为0.7771)和下一位说话预测的相对不错的结果(不Weighted Average Recall为0.5281)。Abstract
In this paper, we present the solution of our team HFUT-VUT for the MultiMediate Grand Challenge 2023 at ACM Multimedia 2023. The solution covers three sub-challenges: bodily behavior recognition, eye contact detection, and next speaker prediction. We select Swin Transformer as the baseline and exploit data augmentation strategies to address the above three tasks. Specifically, we crop the raw video to remove the noise from other parts. At the same time, we utilize data augmentation to improve the generalization of the model. As a result, our solution achieves the best results of 0.6262 for bodily behavior recognition in terms of mean average precision and the accuracy of 0.7771 for eye contact detection on the corresponding test set. In addition, our approach also achieves comparable results of 0.5281 for the next speaker prediction in terms of unweighted average recall.
摘要
在这篇论文中,我们介绍了我们团队HFUT-VUT在ACM Multimedia 2023年度的MultiMediate Grand Challenge 2023中的解决方案。该解决方案包括三个子挑战:身体行为识别、眼球接触检测和下一个发言人预测。我们选择Swin Transformer作为基线,并使用数据扩充策略来解决以上三个任务。具体来说,我们对原始视频进行cropping,以移除其他部分的噪音。同时,我们利用数据扩充来提高模型的通用性。因此,我们的解决方案在对应的测试集上实现了身体行为识别的最佳结果(准确率0.6262)和眼球接触检测的最高精度(准确率0.7771)。此外,我们的方法还实现了与基线相当的下一个发言人预测结果(不重量平均回归率0.5281)。
VisAlign: Dataset for Measuring the Degree of Alignment between AI and Humans in Visual Perception
paper_authors: Jiyoung Lee, Seungho Kim, Seunghyun Won, Joonseok Lee, Marzyeh Ghassemi, James Thorne, Jaeseok Choi, O-Kil Kwon, Edward Choi
for: 本研究旨在提高人工智能模型与人类目标、偏好或道德原则的一致性,即 AI 安全性。
methods: 本文提出了一个新的图像分类数据集,用于衡量人工智能模型与人类视觉启示的一致性。
results: 使用该数据集,我们分析了五种流行的视觉识别模型和七种决策策略的视觉一致性和可靠性。Abstract
AI alignment refers to models acting towards human-intended goals, preferences, or ethical principles. Given that most large-scale deep learning models act as black boxes and cannot be manually controlled, analyzing the similarity between models and humans can be a proxy measure for ensuring AI safety. In this paper, we focus on the models' visual perception alignment with humans, further referred to as AI-human visual alignment. Specifically, we propose a new dataset for measuring AI-human visual alignment in terms of image classification, a fundamental task in machine perception. In order to evaluate AI-human visual alignment, a dataset should encompass samples with various scenarios that may arise in the real world and have gold human perception labels. Our dataset consists of three groups of samples, namely Must-Act (i.e., Must-Classify), Must-Abstain, and Uncertain, based on the quantity and clarity of visual information in an image and further divided into eight categories. All samples have a gold human perception label; even Uncertain (severely blurry) sample labels were obtained via crowd-sourcing. The validity of our dataset is verified by sampling theory, statistical theories related to survey design, and experts in the related fields. Using our dataset, we analyze the visual alignment and reliability of five popular visual perception models and seven abstention methods. Our code and data is available at \url{https://github.com/jiyounglee-0523/VisAlign}.
摘要
人工智能对齐(AI对齐)指模型行为与人类目标、偏好或伦理原则相符。由于大多数大规模深度学习模型作为黑盒子无法被手动控制,因此分析模型与人类的相似性可以作为AI安全的代理指标。在这篇论文中,我们关注视觉模型与人类的视觉对齐,即AI-人类视觉对齐。我们提出了一个新的数据集来衡量AI-人类视觉对齐,该数据集包括了不同场景可能在实际世界中出现的图像分类任务。为了评估AI-人类视觉对齐,数据集应包含具有不同频率和清晰度的图像样本,并且每个样本都有人类视觉标签。我们的数据集分为三类样本:必须作为(Must-Act)、必须停止(Must-Abstain)和不确定(Uncertain),根据图像中视觉信息的量和清晰度进行分类。所有样本具有人类视觉标签,包括不确定(极度模糊)样本的标签,通过在线人员投票获取。我们的数据集的有效性被证明由抽样理论、统计相关的调查设计理论以及相关领域专家的验证。使用我们的数据集,我们分析了五种流行的视觉识别模型和七种停止方法的视觉对齐和可靠性。我们的代码和数据可以在GitHub上获取:。
results: 我们的模型样本匹配标准 CAD 软件的表示格式,因此可以将其导入 CAD 软件进行解决、编辑和应用到下游设计任务中。Abstract
In engineering applications, line, circle, arc, and point are collectively referred to as primitives, and they play a crucial role in path planning, simulation analysis, and manufacturing. When designing CAD models, engineers typically start by sketching the model's orthographic view on paper or a whiteboard and then translate the design intent into a CAD program. Although this design method is powerful, it often involves challenging and repetitive tasks, requiring engineers to perform numerous similar operations in each design. To address this conversion process, we propose an efficient and accurate end-to-end method that avoids the inefficiency and error accumulation issues associated with using auto-regressive models to infer parametric primitives from hand-drawn sketch images. Since our model samples match the representation format of standard CAD software, they can be imported into CAD software for solving, editing, and applied to downstream design tasks.
摘要
在工程应用中,直线、圆形、弧形和点被合称为基本形状,它们在路径规划、分析研究和生产中扮演着重要的角色。在设计CAD模型时,工程师通常从纸上或白板上绘制模型的正交视图,然后将设计意图翻译到CAD软件中。虽然这种设计方法具有强大的能力,但它经常带来复杂和重复的任务,需要工程师在每个设计中执行大量相似的操作。为了解决这个转换过程中的不效率和错误积累问题,我们提出了一种高效和准确的终端方法,这种方法可以避免使用自动回归模型来从手绘笔画图像中推导参数化基本形状。由于我们的模型样本与标准CAD软件的表示格式匹配,因此它们可以被直接 importing into CAD软件中进行解决、编辑和应用到下游设计任务中。
Contrastive Multi-FaceForensics: An End-to-end Bi-grained Contrastive Learning Approach for Multi-face Forgery Detection
results: 对比Multi-FaceForensics方法,该方法在OpenForensics数据集上达到了18.5%的提升。Abstract
DeepFakes have raised serious societal concerns, leading to a great surge in detection-based forensics methods in recent years. Face forgery recognition is the conventional detection method that usually follows a two-phase pipeline: it extracts the face first and then determines its authenticity by classification. Since DeepFakes in the wild usually contain multiple faces, using face forgery detection methods is merely practical as they have to process faces in a sequel, i.e., only one face is processed at the same time. One straightforward way to address this issue is to integrate face extraction and forgery detection in an end-to-end fashion by adapting advanced object detection architectures. However, as these object detection architectures are designed to capture the semantic information of different object categories rather than the subtle forgery traces among the faces, the direct adaptation is far from optimal. In this paper, we describe a new end-to-end framework, Contrastive Multi-FaceForensics (COMICS), to enhance multi-face forgery detection. The core of the proposed framework is a novel bi-grained contrastive learning approach that explores effective face forgery traces at both the coarse- and fine-grained levels. Specifically, the coarse-grained level contrastive learning captures the discriminative features among positive and negative proposal pairs in multiple scales with the instruction of the proposal generator, and the fine-grained level contrastive learning captures the pixel-wise discrepancy between the forged and original areas of the same face and the pixel-wise content inconsistency between different faces. Extensive experiments on the OpenForensics dataset demonstrate our method outperforms other counterparts by a large margin (~18.5%) and shows great potential for integration into various architectures.
摘要
深度复制(DeepFakes)已引起了社会的严重关注,leading to a great surge in detection-based forensics methods in recent years. Face forgery recognition is the conventional detection method that usually follows a two-phase pipeline: it extracts the face first and then determines its authenticity by classification. However, since DeepFakes in the wild usually contain multiple faces, using face forgery detection methods is merely practical as they have to process faces in a sequel, i.e., only one face is processed at the same time. To address this issue, we can integrate face extraction and forgery detection in an end-to-end fashion by adapting advanced object detection architectures. However, as these object detection architectures are designed to capture the semantic information of different object categories rather than the subtle forgery traces among the faces, the direct adaptation is far from optimal.In this paper, we propose a new end-to-end framework, Contrastive Multi-FaceForensics (COMICS), to enhance multi-face forgery detection. The core of the proposed framework is a novel bi-grained contrastive learning approach that explores effective face forgery traces at both the coarse- and fine-grained levels. Specifically, the coarse-grained level contrastive learning captures the discriminative features among positive and negative proposal pairs in multiple scales with the instruction of the proposal generator, and the fine-grained level contrastive learning captures the pixel-wise discrepancy between the forged and original areas of the same face and the pixel-wise content inconsistency between different faces. Extensive experiments on the OpenForensics dataset demonstrate that our method outperforms other counterparts by a large margin (~18.5%) and shows great potential for integration into various architectures.
Circumventing Concept Erasure Methods For Text-to-Image Generative Models
paper_authors: Minh Pham, Kelly O. Marshall, Chinmay Hegde
for: 本研究旨在检查五种最近提出的概念消除方法,以确定这些方法是否能够彻底消除目标概念。
methods: 本研究使用了五种最近提出的概念消除方法,包括:Xu et al.’s method(2018)、Zhang et al.’s method(2019)、Liu et al.’s method(2020)、Wang et al.’s method(2020)和Zhang et al.’s method(2020)。
results: 研究发现,无论使用哪种方法,都无法彻底消除目标概念。特别是,使用特定的学习word embeddings可以 Retrieves “erased” concepts from the sanitized models with no alterations to their weights。这些结果表明,后期概念消除方法是不坚定的,并质疑它们在AI安全中的使用。Abstract
Text-to-image generative models can produce photo-realistic images for an extremely broad range of concepts, and their usage has proliferated widely among the general public. On the flip side, these models have numerous drawbacks, including their potential to generate images featuring sexually explicit content, mirror artistic styles without permission, or even hallucinate (or deepfake) the likenesses of celebrities. Consequently, various methods have been proposed in order to "erase" sensitive concepts from text-to-image models. In this work, we examine five recently proposed concept erasure methods, and show that targeted concepts are not fully excised from any of these methods. Specifically, we leverage the existence of special learned word embeddings that can retrieve "erased" concepts from the sanitized models with no alterations to their weights. Our results highlight the brittleness of post hoc concept erasure methods, and call into question their use in the algorithmic toolkit for AI safety.
摘要
文本到图像生成模型可以生成高度真实的图像,覆盖了极广泛的概念,并在普通公众中得到了广泛的应用。然而,这些模型也有许多缺点,包括可能生成涉黄内容、无许可的艺术风格模仿或even celebrities的形象hallucination(或深 fake)。因此,各种方法被提议,以“消除”敏感概念从文本到图像模型中。在这项工作中,我们研究了五种最近提出的概念消除方法,并发现目标概念在这些方法中并没有完全消除。具体来说,我们利用特殊学习的词嵌入,可以从清理后的模型中提取“消除”的概念,无需对模型的权重进行任何修改。我们的结果表明了后期概念消除方法的脆弱性,并质疑它们在AI安全中的使用。
TSMD: A Database for Static Color Mesh Quality Assessment Study
paper_authors: Qi Yang, Joel Jung, Haiqiang Wang, Xiaozhong Xu, Shan Liu
for: This paper is written for the study of static mesh compression algorithms and objective quality metrics.
methods: The paper uses a large-scale, crowdsourcing-based, subjective experiment to collect subjective scores from 74 viewers, and analyzes the dataset to validate its sample diversity and Mean Opinion Scores (MOS) accuracy.
results: The paper reports Pearson and Spearman correlations around 0.75, demonstrating the need for further development of more robust metrics.Here is the text in Simplified Chinese:
for: 这篇论文是为了研究静态网格压缩算法和对象质量指标的研究而写的。
methods: 这篇论文使用大规模的人工社会测试来收集74名观众的主观分数,并分析数据集以验证其样本多样性和 Mean Opinion Scores(MOS)准确性。
results: 这篇论文报告了皮尔逊和斯帕曼相关性约为0.75,表明需要进一步开发更加Robust的指标。Abstract
Static meshes with texture map are widely used in modern industrial and manufacturing sectors, attracting considerable attention in the mesh compression community due to its huge amount of data. To facilitate the study of static mesh compression algorithm and objective quality metric, we create the Tencent - Static Mesh Dataset (TSMD) containing 42 reference meshes with rich visual characteristics. 210 distorted samples are generated by the lossy compression scheme developed for the Call for Proposals on polygonal static mesh coding, released on June 23 by the Alliance for Open Media Volumetric Visual Media group. Using processed video sequences, a large-scale, crowdsourcing-based, subjective experiment was conducted to collect subjective scores from 74 viewers. The dataset undergoes analysis to validate its sample diversity and Mean Opinion Scores (MOS) accuracy, establishing its heterogeneous nature and reliability. State-of-the-art objective metrics are evaluated on the new dataset. Pearson and Spearman correlations around 0.75 are reported, deviating from results typically observed on less heterogeneous datasets, demonstrating the need for further development of more robust metrics. The TSMD, including meshes, PVSs, bitstreams, and MOS, is made publicly available at the following location: https://multimedia.tencent.com/resources/tsmd.
摘要
Static meshes with texture map 广泛应用于现代工业和制造领域,吸引了严重的数据压缩社区的关注,因为它们的数据量很大。为了促进静止矩阵压缩算法和目标质量指标的研究,我们创建了腾讯-静止矩阵数据集(TSMD),包含42个参考矩阵,具有丰富的视觉特征。通过发布的损失压缩方案,我们生成了210个扭曲样例。使用处理过的视频序列,我们通过大规模的人员协同实验,收集了74名观众的主观评分。数据集进行分析,以验证样本多样性和主观评分准确性,证明其多样性和可靠性。我们使用现有的对象指标进行evaluation,并报告了0.75的 peakson和spearman相关性,与其他更少的多样性的数据集相比,表明需要进一步发展更加Robust的指标。TSMD,包括矩阵、PVS、比特流和MOS,在以下地址公开发布:https://multimedia.tencent.com/resources/tsmd。
TDMD: A Database for Dynamic Color Mesh Subjective and Objective Quality Explorations
results: 这个数据库可以用于研究不同类型的扭曲对人类 восприятия的影响,以及提供DCM压缩和相关任务中的建议。此外,这个论文还评估了三种当今最佳的对metric在TDMD上的表现,包括图像基于的、点基于的和视频基于的metric。实验结果表明每种metric在不同的应用中具有优势和缺陷,并提供了实际应用中metric选择的建议。TDMD将在以下位置公开:https://multimedia.tencent.com/resources/tdmd。Abstract
Dynamic colored meshes (DCM) are widely used in various applications; however, these meshes may undergo different processes, such as compression or transmission, which can distort them and degrade their quality. To facilitate the development of objective metrics for DCMs and study the influence of typical distortions on their perception, we create the Tencent - dynamic colored mesh database (TDMD) containing eight reference DCM objects with six typical distortions. Using processed video sequences (PVS) derived from the DCM, we have conducted a large-scale subjective experiment that resulted in 303 distorted DCM samples with mean opinion scores, making the TDMD the largest available DCM database to our knowledge. This database enabled us to study the impact of different types of distortion on human perception and offer recommendations for DCM compression and related tasks. Additionally, we have evaluated three types of state-of-the-art objective metrics on the TDMD, including image-based, point-based, and video-based metrics, on the TDMD. Our experimental results highlight the strengths and weaknesses of each metric, and we provide suggestions about the selection of metrics in practical DCM applications. The TDMD will be made publicly available at the following location: https://multimedia.tencent.com/resources/tdmd.
摘要
《dynamic colored meshes(DCM)在各种应用中广泛使用,但这些网格可能会经历压缩、传输等过程,导致其质量下降。为了促进DCM的 объектив评价和研究这些扭曲对人类视觉的影响,我们创建了腾讯-动态颜色网格数据库(TDMD),包含8个参考DCM对象以及6种典型的扭曲。使用来自DCM的处理视频序列(PVS),我们进行了大规模的主观实验,从而生成了303个扭曲DCM样本,其中每个样本有平均意见分数。TDMD是我们知道的最大的DCM数据库。我们通过对TDMD进行研究,发现不同类型的扭曲对人类视觉的影响,并提供了DCM压缩和相关任务中的指导方针。此外,我们还评估了三种当今最佳的对象评价度量,包括图像基于的、点基于的和视频基于的度量,在TDMD上。我们的实验结果显示了每种度量的优缺点,并提供了实际应用中选择度量的建议。TDMD将在以下地址公开:https://multimedia.tencent.com/resources/tdmd。》
Efficient neural supersampling on a novel gaming dataset
paper_authors: Antoine Mercier, Ruan Erasmus, Yashesh Savani, Manik Dhingra, Fatih Porikli, Guillaume Berger
for: 提高游戏视频的实时渲染效果,因为需要更高的分辨率、帧率和光彩实现。
methods: 使用神经网络算法进行渲染内容的高速抽象,比现有方法四倍效率,保持同等准确性。
results: 引入了一个新的数据集,提供了辅助特征如运动 вектор和深度,这些特征是通过渲染特性如视窗晃动和mipple biasing在不同分辨率下生成的。我们认为这个数据集会填补当前数据景观的空白,并可以作为测试进步的 valuable resource。Abstract
Real-time rendering for video games has become increasingly challenging due to the need for higher resolutions, framerates and photorealism. Supersampling has emerged as an effective solution to address this challenge. Our work introduces a novel neural algorithm for supersampling rendered content that is 4 times more efficient than existing methods while maintaining the same level of accuracy. Additionally, we introduce a new dataset which provides auxiliary modalities such as motion vectors and depth generated using graphics rendering features like viewport jittering and mipmap biasing at different resolutions. We believe that this dataset fills a gap in the current dataset landscape and can serve as a valuable resource to help measure progress in the field and advance the state-of-the-art in super-resolution techniques for gaming content.
摘要
Translated into Simplified Chinese:现实时游戏渲染面临着高分辨率、高帧率和真实感的增加需求,而抽象渲染技术已成为一种有效的解决方案。我们的工作推出了一种基于神经网络的抽象渲染内容算法,比现有方法高效四倍,保持同等准确性。此外,我们还介绍了一个新的数据集,该数据集包含不同分辨率下的游戏内容中的视觉特征,如视口抖动和mips扭曲,以及相关的动作向量和深度信息。我们认为这个数据集将填补当前数据景观的空白,并成为评估领域的价值资源,帮助推动游戏内容的超分辨率技术的进步。
HANDAL: A Dataset of Real-World Manipulable Object Categories with Pose Annotations, Affordances, and Reconstructions
paper_authors: Andrew Guo, Bowen Wen, Jianhe Yuan, Jonathan Tremblay, Stephen Tyree, Jeffrey Smith, Stan Birchfield
for: 这 paper 是为了提供一个category-level object pose estimation和可用性预测的数据集,而且这个数据集专注于可以由机器人抓取的可操作物品,例如锤子、用具和螺丝刀。
methods: 这 paper 使用了单个抽象相机和半自动化处理来生成高质量的3D注释,而不需要人工劳动。
results: 这 paper 描述了一个包含 308k annotated image frame 和 2.2k 视频的 212 个实际世界物品的 17 个类别,以及这些数据集的使用性和挑战。Abstract
We present the HANDAL dataset for category-level object pose estimation and affordance prediction. Unlike previous datasets, ours is focused on robotics-ready manipulable objects that are of the proper size and shape for functional grasping by robot manipulators, such as pliers, utensils, and screwdrivers. Our annotation process is streamlined, requiring only a single off-the-shelf camera and semi-automated processing, allowing us to produce high-quality 3D annotations without crowd-sourcing. The dataset consists of 308k annotated image frames from 2.2k videos of 212 real-world objects in 17 categories. We focus on hardware and kitchen tool objects to facilitate research in practical scenarios in which a robot manipulator needs to interact with the environment beyond simple pushing or indiscriminate grasping. We outline the usefulness of our dataset for 6-DoF category-level pose+scale estimation and related tasks. We also provide 3D reconstructed meshes of all objects, and we outline some of the bottlenecks to be addressed for democratizing the collection of datasets like this one.
摘要
我们介绍了HANDAL数据集,用于分类水平对象pose估计和可行预测。与前一代数据集不同,我们的数据集专注于适用于机器人搅拌的可搅拌物品,包括锤子、工具和螺丝driver等,它们具有适合机器人搅拌的尺寸和形状。我们的注释过程涉及了单一的商业摄像头和半自动化处理,使得我们可以生成高质量3D注释而无需咨询大量人员。数据集包含308万个注释图像帧,来自2.2万个视频,212种实际世界中的 объек。我们专注于硬件和厨房工具对象,以便在机器人搅拌需要与环境进行实际交互的场景中进行研究。我们详细介绍了我们数据集的用途,包括6个自由度分类pose+scale估计和相关任务。我们还提供了所有物品的3D重建模型,并详细介绍了数据集收集的一些瓶颈。
DLSIA: Deep Learning for Scientific Image Analysis
results: 提供可定制的CNN建模、抽象CNN复杂性、促进科研发现、促进交叉领域合作、驱动科研图像分析等。Abstract
We introduce DLSIA (Deep Learning for Scientific Image Analysis), a Python-based machine learning library that empowers scientists and researchers across diverse scientific domains with a range of customizable convolutional neural network (CNN) architectures for a wide variety of tasks in image analysis to be used in downstream data processing, or for experiment-in-the-loop computing scenarios. DLSIA features easy-to-use architectures such as autoencoders, tunable U-Nets, and parameter-lean mixed-scale dense networks (MSDNets). Additionally, we introduce sparse mixed-scale networks (SMSNets), generated using random graphs and sparse connections. As experimental data continues to grow in scale and complexity, DLSIA provides accessible CNN construction and abstracts CNN complexities, allowing scientists to tailor their machine learning approaches, accelerate discoveries, foster interdisciplinary collaboration, and advance research in scientific image analysis.
摘要
我们介绍DLSIA(深度学习科学影像分析),这是一个基于Python的机器学习库,它为科学家和研究人员提供了许多可自定义的卷积神经网络架构,用于各种影像分析任务,包括下游资料处理和实验运行 Computing enario。DLSIA 提供了易于使用的架构,例如自动编码器、可调 U-Net 和对数零对数网络(MSDNets)。此外,我们还引入了随机 Graph 和稀疏连接的稀疏混合网络(SMSNets)。随着实验数据的数量和复杂度不断增加,DLSIA 提供了可 accessible CNN 的建构和抽象,让科学家可以根据自己的机器学习方法,加速发现,促进跨领域合作,并进步科学影像分析研究。
COVID-VR: A Deep Learning COVID-19 Classification Model Using Volume-Rendered Computer Tomography
results: 对比于传统的slice-based方法,本研究的方法能够更好地识别肺疾病,并且在使用私有数据和公共数据进行比较时,得到了相似的结果。Abstract
The COVID-19 pandemic presented numerous challenges to healthcare systems worldwide. Given that lung infections are prevalent among COVID-19 patients, chest Computer Tomography (CT) scans have frequently been utilized as an alternative method for identifying COVID-19 conditions and various other types of pulmonary diseases. Deep learning architectures have emerged to automate the identification of pulmonary disease types by leveraging CT scan slices as inputs for classification models. This paper introduces COVID-VR, a novel approach for classifying pulmonary diseases based on volume rendering images of the lungs captured from multiple angles, thereby providing a comprehensive view of the entire lung in each image. To assess the effectiveness of our proposal, we compared it against competing strategies utilizing both private data obtained from partner hospitals and a publicly available dataset. The results demonstrate that our approach effectively identifies pulmonary lesions and performs competitively when compared to slice-based methods.
摘要
COVID-19 大流行对全球医疗系统带来了很多挑战。由于封颈感染是 COVID-19 患者的常见症状,胸部计算机扫描(CT)扫描得到了广泛的应用,以确定 COVID-19 状况和其他类型的肺病。深度学习建筑在扫描肺部的 CT 扫描片中进行自动识别肺病类型。本文介绍了 COVID-VR,一种基于肺部体积渲染图像的新方法,以获取整个肺部的全面视图。为评估我们的提议效果,我们与合作医院提供的私人数据进行比较,以及公共可用的数据集。结果表明,我们的方法可以有效地识别肺病涂抹,并与 slice-based 方法相比竞争性强。
LiDAR View Synthesis for Robust Vehicle Navigation Without Expert Labels
results: 研究人员通过在线评估和与同期工作进行比较,证明了我们的方法的效iveness。特别是在模型稳定性方面,我们的方法具有显著的优势。项目页面:https://jonathsch.github.io/lidar-synthesis/Abstract
Deep learning models for self-driving cars require a diverse training dataset to manage critical driving scenarios on public roads safely. This includes having data from divergent trajectories, such as the oncoming traffic lane or sidewalks. Such data would be too dangerous to collect in the real world. Data augmentation approaches have been proposed to tackle this issue using RGB images. However, solutions based on LiDAR sensors are scarce. Therefore, we propose synthesizing additional LiDAR point clouds from novel viewpoints without physically driving at dangerous positions. The LiDAR view synthesis is done using mesh reconstruction and ray casting. We train a deep learning model, which takes a LiDAR scan as input and predicts the future trajectory as output. A waypoint controller is then applied to this predicted trajectory to determine the throttle and steering labels of the ego-vehicle. Our method neither requires expert driving labels for the original nor the synthesized LiDAR sequence. Instead, we infer labels from LiDAR odometry. We demonstrate the effectiveness of our approach in a comprehensive online evaluation and with a comparison to concurrent work. Our results show the importance of synthesizing additional LiDAR point clouds, particularly in terms of model robustness. Project page: https://jonathsch.github.io/lidar-synthesis/
摘要
深度学习模型 для自驾车需要一个多样化的训练集,以确保在公共道路上安全地处理潜在危险的驾驶场景。这包括有 divergent 的轨迹,如对向道或人行道。然而,收集这些数据在实际世界中是太危险的。为解决这个问题,提出了使用 RGB 图像的数据增强方法。然而,基于 LiDAR 探测器的解决方案很少。因此,我们提议通过 mesh 重建和射线投影来生成额外的 LiDAR 点云。我们用一个深度学习模型,该模型从 LiDAR 扫描输入得到未来轨迹的预测结果。然后,我们应用一个 waypoint 控制器来确定 egocar 的加速和转向标签。我们的方法不需要原始 LiDAR 序列的专家驾驶标签,也不需要生成的 LiDAR 序列的专家标签。相反,我们从 LiDAR 速度来推断标签。我们在线评估中进行了全面的评估,并与当前的工作进行比较。我们的结果表明,生成额外的 LiDAR 点云对模型的稳定性具有重要作用。项目页面:https://jonathsch.github.io/lidar-synthesis/
Harder synthetic anomalies to improve OoD detection in Medical Images
methods: 本研究使用了在2020年MOOD挑战赛中赢得奖的基于 Synthetic Local Anomaly (SLA) 的方法,并进一步改进了Synthetic anomaly生成过程,使它们更加多样化和挑战性。
results: 本研究在2022年MOOD挑战赛中获得了sample-wise和pixel-wise任务的首位, demonstrating the effectiveness of our method in improving the generalization ability of medical image segmentation networks.Abstract
Our method builds upon previous Medical Out-of-Distribution (MOOD) challenge winners that empirically show that synthetic local anomalies generated copying / interpolating foreign patches are useful to train segmentation networks able to generalize to unseen types of anomalies. In terms of the synthetic anomaly generation process, our contributions makes synthetic anomalies more heterogeneous and challenging by 1) using random shapes instead of squares and 2) smoothing the interpolation edge of anomalies so networks cannot rely on the high gradient between image - foreign patch to identify anomalies. Our experiments using the validation set of 2020 MOOD winners show that both contributions improved substantially the method performance. We used a standard 3D U-Net architecture as segmentation network, trained patch-wise in both brain and abdominal datasets. Our final challenge submission consisted of 10 U-Nets trained across 5 data folds with different configurations of the anomaly generation process. Our method achieved first position in both sample-wise and pixel-wise tasks in the 2022 edition of the Medical Out-of-Distribution held at MICCAI.
摘要
我们的方法建立在前一个医学异常(MOOD)挑战赛中赢家的基础上,这些赢家实证表明,通过复制/ interpolate 外部质 patches 生成的 synthetic local anomalies 可以帮助训练检测网络,以便在未经见过的异常类型上进行检测。在 synthetic anomaly 生成过程中,我们的贡献使 synthetic anomalies 更加多样和挑战性,包括:1. 使用随机形状而非方正形,2. 平滑 interpolate 边缘,以防止网络通过高Gradient между图像和外部质 patch 来识别异常。我们的实验使用 2020 MOOD 赛 validate set 表明,这两个贡献都有所提高了方法的性能。我们使用标准的 3D U-Net 架构作为检测网络,在脑和腹部数据集上进行 patch-wise 训练。我们的最终挑战提交包括 10 个 U-Nets 在 5 个数据叠加上不同的异常生成过程配置上进行训练。我们的方法在 2022 年的医学异常挑战中取得了 sample-wise 和 pixel-wise 两个任务中的第一名。
Follow the Soldiers with Optimized Single-Shot Multibox Detection and Reinforcement Learning
results: SSD Lite 提供了较好的性能和大幅提高的测试速度(约2-3倍),并且不损害准确性。Abstract
Nowadays, autonomous cars are gaining traction due to their numerous potential applications on battlefields and in resolving a variety of other real-world challenges. The main goal of our project is to build an autonomous system using DeepRacer which will follow a specific person (for our project, a soldier) when they will be moving in any direction. Two main components to accomplish this project is an optimized Single-Shot Multibox Detection (SSD) object detection model and a Reinforcement Learning (RL) model. We accomplished the task using SSD Lite instead of SSD and at the end, compared the results among SSD, SSD with Neural Computing Stick (NCS), and SSD Lite. Experimental results show that SSD Lite gives better performance among these three techniques and exhibits a considerable boost in inference speed (~2-3 times) without compromising accuracy.
摘要
现在,自适应汽车正在得到推广,因为它们在战场和解决各种实际问题上具有广泛的应用前景。我们项目的主要目标是使用DeepRacer建立一个自动驾驶系统,该系统能跟踪一名士兵(在我们项目中)在任何方向移动时。我们项目的两个主要组成部分是优化单幅多框检测(SSD)模型和再征学习(RL)模型。我们使用SSD Lite而不是SSD,并在结尾对这三种技术进行比较。实验结果表明,SSD Lite在这三种技术中表现最佳,并且在执行速度方面表现出了明显的提升(约2-3倍)而无需牺牲准确性。
results: 该系统可以自动地在手持式智能手机摄像头应用程序中实现运动融合和长时间拍摄的效果,并且可以保持图像的高分辨率和高动态范围。Abstract
Long exposure photography produces stunning imagery, representing moving elements in a scene with motion-blur. It is generally employed in two modalities, producing either a foreground or a background blur effect. Foreground blur images are traditionally captured on a tripod-mounted camera and portray blurred moving foreground elements, such as silky water or light trails, over a perfectly sharp background landscape. Background blur images, also called panning photography, are captured while the camera is tracking a moving subject, to produce an image of a sharp subject over a background blurred by relative motion. Both techniques are notoriously challenging and require additional equipment and advanced skills. In this paper, we describe a computational burst photography system that operates in a hand-held smartphone camera app, and achieves these effects fully automatically, at the tap of the shutter button. Our approach first detects and segments the salient subject. We track the scene motion over multiple frames and align the images in order to preserve desired sharpness and to produce aesthetically pleasing motion streaks. We capture an under-exposed burst and select the subset of input frames that will produce blur trails of controlled length, regardless of scene or camera motion velocity. We predict inter-frame motion and synthesize motion-blur to fill the temporal gaps between the input frames. Finally, we composite the blurred image with the sharp regular exposure to protect the sharpness of faces or areas of the scene that are barely moving, and produce a final high resolution and high dynamic range (HDR) photograph. Our system democratizes a capability previously reserved to professionals, and makes this creative style accessible to most casual photographers. More information and supplementary material can be found on our project webpage: https://motion-mode.github.io/
摘要
长时间拍摄可以生成吸引人的图像,通过运动模糊来表现场景中运动元素。通常有两种模式:前景模式和背景模式。前景模式拍摄的图像通常在静止的摄像机上拍摄,捕捉了运动的前景元素,如水或光梯,与静止的背景景象一起呈现。背景模式拍摄的图像通常在跟踪运动目标的同时拍摄,以生成一个锐化的主题图像,与运动背景模糊的效果。两种技术都是非常具有挑战性,需要额外的设备和高级技能。在这篇论文中,我们描述了一种基于智能手机摄像机应用程序的计算拍摄系统,可以自动实现这些效果,只需要单击拍摄按钮。我们的方法首先检测和分割主题元素。我们跟踪场景运动,并将多帧图像对齐以保持愿望的锐化和生成美观的运动梯度。我们捕捉具有控制长度的下采样,无论场景或摄像机运动速度。我们预测帧间运动,并使用Synthesize动作模糊填充时间间隔。最后,我们将模糊图像与锐化正常曝光图像 composite,以保护面孔或动作较少的场景区域的锐化,并生成高分辨率和高动态范围的图像。我们的系统将这种创新技术普及化,让大多数优秀摄影家能够轻松地实现这种创新风格。更多信息和补充材料可以在我们项目网站中找到:
ELIXR: Towards a general purpose X-ray artificial intelligence system through alignment of large language models and radiology vision encoders
paper_authors: Shawn Xu, Lin Yang, Christopher Kelly, Marcin Sieniek, Timo Kohlberger, Martin Ma, Wei-Hung Weng, Attila Kiraly, Sahar Kazemzadeh, Zakkai Melamed, Jungyeon Park, Patricia Strachan, Yun Liu, Chuck Lau, Preeti Singh, Christina Chen, Mozziyar Etemadi, Sreenivasa Raju Kalidindi, Yossi Matias, Katherine Chou, Greg S. Corrado, Shravya Shetty, Daniel Tse, Shruthi Prabhakara, Daniel Golden, Rory Pilgrim, Krish Eswaran, Andrew Sellergren for:* 这个研究旨在开发一个具有广泛应用能力的语言/图像同步类型(ELIXR),用于进行胸部X射像(CXR)分类、数据有效分类和semantic搜寻等多种任务。methods:* 这个研究使用了一个语言同步图像编码器,与固定的语言模型PaLM 2结合,实现了一个轻量级的适配器架构。* 研究使用了MIMIC-CXR dataset上的图像和相应的自由文本医学报告进行训练。results:* ELIXR在零shot胸部X射像(CXR)分类中实现了state-of-the-art性能(mean AUC of 0.850 across 13 findings)。* ELIXR在数据有效CXR分类中实现了高性能(mean AUCs of 0.893 and 0.898 across five findings),并在几个不同的数据量下进行了比较(1%、10%)。* ELIXR在semantic搜寻任务中实现了0.76的normalized discounted cumulative gain(NDCG),包括了一些完美的回答。Abstract
Our approach, which we call Embeddings for Language/Image-aligned X-Rays, or ELIXR, leverages a language-aligned image encoder combined or grafted onto a fixed LLM, PaLM 2, to perform a broad range of tasks. We train this lightweight adapter architecture using images paired with corresponding free-text radiology reports from the MIMIC-CXR dataset. ELIXR achieved state-of-the-art performance on zero-shot chest X-ray (CXR) classification (mean AUC of 0.850 across 13 findings), data-efficient CXR classification (mean AUCs of 0.893 and 0.898 across five findings (atelectasis, cardiomegaly, consolidation, pleural effusion, and pulmonary edema) for 1% (~2,200 images) and 10% (~22,000 images) training data), and semantic search (0.76 normalized discounted cumulative gain (NDCG) across nineteen queries, including perfect retrieval on twelve of them). Compared to existing data-efficient methods including supervised contrastive learning (SupCon), ELIXR required two orders of magnitude less data to reach similar performance. ELIXR also showed promise on CXR vision-language tasks, demonstrating overall accuracies of 58.7% and 62.5% on visual question answering and report quality assurance tasks, respectively. These results suggest that ELIXR is a robust and versatile approach to CXR AI.
摘要
我们的方法,我们称之为语言/图像对齐X射线(ELIXR),利用一个语言对齐图像编码器与固定的自然语言处理模型(PaLM 2)结合,以实现广泛的任务。我们使用图像和相应的自由文本医学报告从MIMIC-CXR数据集进行训练这个轻量级适配器建筑。ELIXR在零shot肺X射线(CXR)分类中达到了状态元的性能(平均AUC为0.850,涵盖13个发现),以及数据效率CXR分类(平均AUC为0.893和0.898,涵盖五个发现(肿瘤、心脏肥大、混合、肺液和肺血液)),并在semantic搜索中达到了0.76减少积分率(NDCG)。相比现有的数据效率方法,包括supervised contrastive learning(SupCon),ELIXR需要两个数据量级下降到达到类似性能。此外,ELIXR还在CXR视言语任务中表现出了承诺,其总准确率为58.7%和62.5%。这些结果表明ELIXR是一种强大和多功能的CXR AI方法。
Patched Denoising Diffusion Models For High-Resolution Image Synthesis
results: 在自然图像集(1024×512)和标准的小尺寸图像集(256×256)上实现高质量图像生成,并在所有四个数据集上达到了当前最佳FID分数。同时,Patch-DM还比 классиic diffusion模型减少了内存复杂度。Abstract
We propose an effective denoising diffusion model for generating high-resolution images (e.g., 1024$\times$512), trained on small-size image patches (e.g., 64$\times$64). We name our algorithm Patch-DM, in which a new feature collage strategy is designed to avoid the boundary artifact when synthesizing large-size images. Feature collage systematically crops and combines partial features of the neighboring patches to predict the features of a shifted image patch, allowing the seamless generation of the entire image due to the overlap in the patch feature space. Patch-DM produces high-quality image synthesis results on our newly collected dataset of nature images (1024$\times$512), as well as on standard benchmarks of smaller sizes (256$\times$256), including LSUN-Bedroom, LSUN-Church, and FFHQ. We compare our method with previous patch-based generation methods and achieve state-of-the-art FID scores on all four datasets. Further, Patch-DM also reduces memory complexity compared to the classic diffusion models.
摘要
我们提出一种有效的杂噪扩散模型,用于生成高分辨率图像(例如1024×512),基于小尺寸图像块(例如64×64)进行训练。我们命名该算法为“Patch-DM”,其中我们设计了一种新的特征贯通策略,以避免边缘artefact when Synthesizing large-size images。特征贯通系统系统地剪辑并组合邻近块的部分特征,以预测shifted image块的特征,从而实现了整个图像的无缝生成,因为邻近块特征空间之间存在 overlap。Patch-DM生成了高质量的图像合成结果在我们新收集的自然图像 dataset(1024×512)上,以及标准的小尺寸 benchmark(256×256)上,包括LSUN-Bedroom、LSUN-Church和FFHQ。我们与前期的patch-based生成方法进行比较,并在所有四个 dataset上实现了状态的方程FID scores。此外,Patch-DM还降低了传统扩散模型的内存复杂性。
results: 研究发现,使用这些自动学习方法可以提高COCO物体检测任务的性能,并且可以超越最新的State-of-the-art模型。 Specifically, the authors achieved an AP of $59.3%$ on the COCO val set, surpassing the previous state-of-the-art model by $1.4%$. Additionally, the authors generated a series of synthetic pre-training datasets and demonstrated that pre-training on these datasets can lead to notable improvements in object detection performance.Abstract
Motivated by that DETR-based approaches have established new records on COCO detection and segmentation benchmarks, many recent endeavors show increasing interest in how to further improve DETR-based approaches by pre-training the Transformer in a self-supervised manner while keeping the backbone frozen. Some studies already claimed significant improvements in accuracy. In this paper, we take a closer look at their experimental methodology and check if their approaches are still effective on the very recent state-of-the-art such as $\mathcal{H}$-Deformable-DETR. We conduct thorough experiments on COCO object detection tasks to study the influence of the choice of pre-training datasets, localization, and classification target generation schemes. Unfortunately, we find the previous representative self-supervised approach such as DETReg, fails to boost the performance of the strong DETR-based approaches on full data regimes. We further analyze the reasons and find that simply combining a more accurate box predictor and Objects$365$ benchmark can significantly improve the results in follow-up experiments. We demonstrate the effectiveness of our approach by achieving strong object detection results of AP=$59.3\%$ on COCO val set, which surpasses $\mathcal{H}$-Deformable-DETR + Swin-L by +$1.4\%$. Last, we generate a series of synthetic pre-training datasets by combining the very recent image-to-text captioning models (LLaVA) and text-to-image generative models (SDXL). Notably, pre-training on these synthetic datasets leads to notable improvements in object detection performance. Looking ahead, we anticipate substantial advantages through the future expansion of the synthetic pre-training dataset.
摘要
基于DETR的方法在COCO检测和 segmentation bencmarks 上设置新的纪录,许多最近的尝试表示越来越关注如何进一步提高DETR基于的方法,而不是固定背景。一些研究已经提出了显著改进的精度。在这篇文章中,我们坚持更加仔细地检查这些实验方法,并查看它们是否在最新的state-of-the-art 中如 $\mathcal{H}$-Deformable-DETR 中保持有效。我们在COCO对象检测任务中进行了系统的实验,以研究预训练数据集的选择、本地化和分类目标生成方案的影响。不幸地,我们发现以前的代表性自我超vised 方法DETReg,在全数据场景下不能提高强大 DE TR-based 方法的性能。我们进一步分析了原因,并发现可以通过结合更高精度的包Predictor和Objects$365$ benchmark来显著提高结果。我们证明了我们的方法的效果,通过在COCO验证集上达到 AP = $59.3\%$ 的强大对象检测结果,超过 $\mathcal{H}$-Deformable-DETR + Swin-L 的 + $1.4\%$。最后,我们生成了一系列的Synthetic pre-training datasets,通过结合最近的图文描述模型(LLaVA)和文本到图生成模型(SDXL)。不凡地,预训练在这些Synthetic datasets上显著提高了对象检测性能。looking ahead,我们预计将来的扩展将带来重要的优势。
A vision transformer-based framework for knowledge transfer from multi-modal to mono-modal lymphoma subtyping models
results: 我们在一个包含157个病例的实验study中发现,我们的单模式分类模型的表现非常出色,比六个最新的抗癌方法更高效。此外,我们对实验数据进行了power-law曲线估算,结果表明,我们的分类模型需要一个合理的数量的更多病例来进行训练,以达到与IHC技术相同的诊断精度。Abstract
Determining lymphoma subtypes is a crucial step for better patients treatment targeting to potentially increase their survival chances. In this context, the existing gold standard diagnosis method, which is based on gene expression technology, is highly expensive and time-consuming making difficult its accessibility. Although alternative diagnosis methods based on IHC (immunohistochemistry) technologies exist (recommended by the WHO), they still suffer from similar limitations and are less accurate. WSI (Whole Slide Image) analysis by deep learning models showed promising new directions for cancer diagnosis that would be cheaper and faster than existing alternative methods. In this work, we propose a vision transformer-based framework for distinguishing DLBCL (Diffuse Large B-Cell Lymphoma) cancer subtypes from high-resolution WSIs. To this end, we propose a multi-modal architecture to train a classifier model from various WSI modalities. We then exploit this model through a knowledge distillation mechanism for efficiently driving the learning of a mono-modal classifier. Our experimental study conducted on a dataset of 157 patients shows the promising performance of our mono-modal classification model, outperforming six recent methods from the state-of-the-art dedicated for cancer classification. Moreover, the power-law curve, estimated on our experimental data, shows that our classification model requires a reasonable number of additional patients for its training to potentially reach identical diagnosis accuracy as IHC technologies.
摘要
确定淋巴癌 subclass 是诊断患者治疗的关键步骤,以提高生存可能性。然而,现有的黄金标准诊断方法,基于基因表达技术,是非常昂贵和时间consuming,使其Difficult to access。尽管现有基于 IHC(免疫抗体技术)的诊断方法存在,但它们仍然受到限制,并且精度较低。WSI(整个板块图像)分析by deep learning模型显示了新的方向 для肿瘤诊断,这将比现有的alternative方法更便宜和更快。在这项工作中,我们提出了基于视Transformer的框架,用于从高分辨率 WSI 中分类Diffuse Large B-Cell Lymphoma(淋巴癌)亚型。为此,我们提出了一种多modal architecture,用于训练一个分类模型。然后,我们利用知识储存机制,将这个模型转化为一个简单的单modal分类器。我们的实验研究,在一个包含 157 名病人的数据集上进行,显示了我们的单modal分类模型在诊断性能方面的优秀表现,比六个最新的state-of-the-art肿瘤分类方法更高。此外,我们在实验数据上计算的力量律曲线,表明我们的分类模型需要一个合理的数量的更多病人来进行训练,以达到与 IHC 技术相同的诊断精度。
Incorporating Season and Solar Specificity into Renderings made by a NeRF Architecture using Satellite Images
results: 作者在八个Area of Interest中测试了他们的框架,并获得了高精度的渲染、高精度的高度图和预测阴影等结果。此外,作者还进行了ablation study,以 justify network design parameters。Abstract
As a result of Shadow NeRF and Sat-NeRF, it is possible to take the solar angle into account in a NeRF-based framework for rendering a scene from a novel viewpoint using satellite images for training. Our work extends those contributions and shows how one can make the renderings season-specific. Our main challenge was creating a Neural Radiance Field (NeRF) that could render seasonal features independently of viewing angle and solar angle while still being able to render shadows. We teach our network to render seasonal features by introducing one more input variable -- time of the year. However, the small training datasets typical of satellite imagery can introduce ambiguities in cases where shadows are present in the same location for every image of a particular season. We add additional terms to the loss function to discourage the network from using seasonal features for accounting for shadows. We show the performance of our network on eight Areas of Interest containing images captured by the Maxar WorldView-3 satellite. This evaluation includes tests measuring the ability of our framework to accurately render novel views, generate height maps, predict shadows, and specify seasonal features independently from shadows. Our ablation studies justify the choices made for network design parameters.
摘要
due to Shadow NeRF and Sat-NeRF, it is possible to take the solar angle into account in a NeRF-based framework for rendering a scene from a novel viewpoint using satellite images for training. Our work extends those contributions and shows how one can make the renderings season-specific. Our main challenge was creating a Neural Radiance Field (NeRF) that could render seasonal features independently of viewing angle and solar angle while still being able to render shadows. We teach our network to render seasonal features by introducing one more input variable -- time of the year. However, the small training datasets typical of satellite imagery can introduce ambiguities in cases where shadows are present in the same location for every image of a particular season. We add additional terms to the loss function to discourage the network from using seasonal features for accounting for shadows. We show the performance of our network on eight Areas of Interest containing images captured by the Maxar WorldView-3 satellite. This evaluation includes tests measuring the ability of our framework to accurately render novel views, generate height maps, predict shadows, and specify seasonal features independently from shadows. Our ablation studies justify the choices made for network design parameters.Here's the translation in Traditional Chinese:这是由于阴影NeRF和Sat-NeRF而可以将太阳角度考虑到NeRF基础框架中,以便从不同观点测量场景。我们的工作延伸了这些贡献,并显示了如何使渲染为季节特定。我们的主要挑战是创建一个能够独立地考虑观察角度和太阳角度的Neural Radiance Field(NeRF),并且仍能正确地显示阴影。我们教育我们的网络以时间年份为输入变量,以便在不同季节中显示季节特定的特征。然而,对于具有阴影的几何形状的实际测试数据可能会导致歧义。我们添加了额外的损失函数来防止网络使用季节特定的特征来计算阴影。我们在八个Area of Interest中展示了我们的网络,包括量测系统在不同观点下的渲染新视野、生成高度图、预测阴影和季节特定的特征独立于阴影。我们的ablation研究证明了我们的网络设计选择的正确性。
Learning Spatial Distribution of Long-Term Trackers Scores
results: 在LTB-50数据集上,这篇论文的召回率为0.738,与当前状态前进行竞争。在反向使用VOT-LT2022和LTB-50数据集时,召回率为0.619,仍然在当前状态前进行竞争。Abstract
Long-Term tracking is a hot topic in Computer Vision. In this context, competitive models are presented every year, showing a constant growth rate in performances, mainly measured in standardized protocols as Visual Object Tracking (VOT) and Object Tracking Benchmark (OTB). Fusion-trackers strategy has been applied over last few years for overcoming the known re-detection problem, turning out to be an important breakthrough. Following this approach, this work aims to generalize the fusion concept to an arbitrary number of trackers used as baseline trackers in the pipeline, leveraging a learning phase to better understand how outcomes correlate with each other, even when no target is present. A model and data independence conjecture will be evidenced in the manuscript, yielding a recall of 0.738 on LTB-50 dataset when learning from VOT-LT2022, and 0.619 by reversing the two datasets. In both cases, results are strongly competitive with state-of-the-art and recall turns out to be the first on the podium.
摘要
长期跟踪是计算机视觉领域热点话题。在这个上下文中,每年都有竞争力强的模型被推出,表现得越来越好,主要根据标准化协议进行评估,如视觉 объекtracking(VOT)和物体跟踪benchmark(OTB)。遗传跟踪策略在过去几年得到应用,并被视为重要的突破。基于这种方法,本研究旨在普适化融合概念,使得任意数量的基线跟踪器可以在管道中使用,并通过学习阶段更好地理解不同跟踪器之间的结果相关性,即使target不存在。 manuscript中会证明模型和数据独立性 conjecture,在LTB-50 dataset上取得0.738的回归率,并在反向两个dataset上取得0.619的回归率。在两个情况下,结果强烈竞争与状态机器人,并且回归率处于第一名。
A Hyper-pixel-wise Contrastive Learning Augmented Segmentation Network for Old Landslide Detection Using High-Resolution Remote Sensing Images and Digital Elevation Model Data
methods: hyper-pixel-wise contrastive learning augmented segmentation network (HPCL-Net) and global hyper-pixel-wise sample pair queues-based contrastive learning method
results: improved reliability of old landslide detection compared to previous models, with increased mIoU, Landslide IoU, and F1-score metricsAbstract
As a harzard disaster, landslide often brings tremendous losses to humanity, so it's necessary to achieve reliable detection of landslide. However, the problems of visual blur and small-sized dataset cause great challenges for old landslide detection task when using remote sensing data. To reliably extract semantic features, a hyper-pixel-wise contrastive learning augmented segmentation network (HPCL-Net) is proposed, which augments the local salient feature extraction from the boundaries of landslides through HPCL and fuses the heterogeneous infromation in the semantic space from High-Resolution Remote Sensing Images and Digital Elevation Model Data data. For full utilization of the precious samples, a global hyper-pixel-wise sample pair queues-based contrastive learning method, which includes the construction of global queues that store hyper-pixel-wise samples and the updating scheme of a momentum encoder, is developed, reliably enhancing the extraction ability of semantic features. The proposed HPCL-Net is evaluated on a Loess Plateau old landslide dataset and experiment results show that the model greatly improves the reliablity of old landslide detection compared to the previous old landslide segmentation model, where mIoU metric is increased from 0.620 to 0.651, Landslide IoU metric is increased from 0.334 to 0.394 and F1-score metric is increased from 0.501 to 0.565.
摘要
翻译文本作为危险灾害,山崩常会对人类造成巨大的损害,因此需要实现可靠的山崩检测。然而,使用遥感数据时,视觉模糊和小样本集的问题会导致古老山崩检测任务中的巨大挑战。为了可靠地提取semantic特征,我们提议了一种基于hyper-pixel-wise对比学习增强segmentation网络(HPCL-Net),该网络通过在山崩边界的本地精重特征提取方法和高分辨率遥感图像和数字高程模型数据的异化信息进行semantic空间的笔记卷积。为了充分利用珍贵的样本,我们开发了一种全球hyper-pixel-wise对比学习方法,该方法包括建立全球队列,并且在批处理队列中进行快速更新的批处理编码器。实验结果表明,提议的HPCL-Net模型在中国Loess Plateau古老山崩数据集上进行检测比前一代古老山崩分割模型更高度可靠,其mIoU指标从0.620提高到0.651,山崩指标从0.334提高到0.394,F1-score指标从0.501提高到0.565。
A Hybrid Approach To Real-Time Multi-Object Tracking
results: 对不同设置进行实验,该方法可以达到0.608的MOTA分数,与相关State-of-the-art的0.549分数相当,而且运行时间减少了约一半。Abstract
Multi-Object Tracking, also known as Multi-Target Tracking, is a significant area of computer vision that has many uses in a variety of settings. The development of deep learning, which has encouraged researchers to propose more and more work in this direction, has significantly impacted the scientific advancement around the study of tracking as well as many other domains related to computer vision. In fact, all of the solutions that are currently state-of-the-art in the literature and in the tracking industry, are built on top of deep learning methodologies that produce exceptionally good results. Deep learning is enabled thanks to the ever more powerful technology researchers can use to handle the significant computational resources demanded by these models. However, when real-time is a main requirement, developing a tracking system without being constrained by expensive hardware support with enormous computational resources is necessary to widen tracking applications in real-world contexts. To this end, a compromise is to combine powerful deep strategies with more traditional approaches to favor considerably lower processing solutions at the cost of less accurate tracking results even though suitable for real-time domains. Indeed, the present work goes in that direction, proposing a hybrid strategy for real-time multi-target tracking that combines effectively a classical optical flow algorithm with a deep learning architecture, targeted to a human-crowd tracking system exhibiting a desirable trade-off between performance in tracking precision and computational costs. The developed architecture was experimented with different settings, and yielded a MOTA of 0.608 out of the compared state-of-the-art 0.549 results, and about half the running time when introducing the optical flow phase, achieving almost the same performance in terms of accuracy.
摘要
多目标跟踪(也称多Target tracking)是计算机视觉领域的一个重要领域,它在各种场景中有很多应用。深度学习的发展,使研究人员们能够更加勇敢地提出更多的工作,对跟踪领域以及其他计算机视觉领域的科学进步产生了深远的影响。实际上,现有literature和industry中的所有state-of-the-art解决方案都基于深度学习方法,其Result exceptionally good。然而,当实时是主要要求时,建立一个不受昂贵硬件支持的跟踪系统是必要的,以拓宽跟踪应用在真实世界中。为此,可以通过结合强大的深度策略和传统方法来达成一个折衔,以提高跟踪精度的同时,降低计算成本。本工作就在这个方向上进行了尝试,提出了一种hybrid策略,将经典的光流算法与深度学习架构相结合,用于人群跟踪系统,实现了精度和计算成本之间的折衔。实验结果显示,与比较state-of-the-art的0.549结果相比,该系统的MOTA得分为0.608,运行时间缩短了约一半。
Tirtha – An Automated Platform to Crowdsource Images and Create 3D Models of Heritage Sites
results: 创建了一个Web平台,让普通公众通过投稿照片来创建CH sites的3D模型,提高了数字保存效率、成本效果和可持续性。Abstract
Digital preservation of Cultural Heritage (CH) sites is crucial to protect them against damage from natural disasters or human activities. Creating 3D models of CH sites has become a popular method of digital preservation thanks to advancements in computer vision and photogrammetry. However, the process is time-consuming, expensive, and typically requires specialized equipment and expertise, posing challenges in resource-limited developing countries. Additionally, the lack of an open repository for 3D models hinders research and public engagement with their heritage. To address these issues, we propose Tirtha, a web platform for crowdsourcing images of CH sites and creating their 3D models. Tirtha utilizes state-of-the-art Structure from Motion (SfM) and Multi-View Stereo (MVS) techniques. It is modular, extensible and cost-effective, allowing for the incorporation of new techniques as photogrammetry advances. Tirtha is accessible through a web interface at https://tirtha.niser.ac.in and can be deployed on-premise or in a cloud environment. In our case studies, we demonstrate the pipeline's effectiveness by creating 3D models of temples in Odisha, India, using crowdsourced images. These models are available for viewing, interaction, and download on the Tirtha website. Our work aims to provide a dataset of crowdsourced images and 3D reconstructions for research in computer vision, heritage conservation, and related domains. Overall, Tirtha is a step towards democratizing digital preservation, primarily in resource-limited developing countries.
摘要
针对文化遗产(CH)场景的数字保存是非常重要,以保护它们免受自然灾害或人类活动的损害。创建CH场景的3D模型已成为数字保存的流行方法,感谢计算机视觉和光学测量的进步。然而,这个过程需要较长的时间,高昂的成本,通常需要专业设备和技能,这会对发展中国家 pose 挑战。此外,缺乏开放的3D模型存储库,限制了研究和公众对遗产的参与。为解决这些问题,我们提出了Tirtha,一个基于网络的平台,用于协同上传CH场景的图像。Tirtha利用当前最佳的结构从动(SfM)和多视图镜像(MVS)技术。它是可扩展的,可cost-effective,可以适应计算机视觉的进步。Tirtha通过Web界面提供,可以在本地部署或云端环境中部署。在我们的案例研究中,我们示例了在奥里萨(India)的寺庐场景中使用拍摄的图像创建3D模型。这些模型通过Tirtha网站上的浏览、互动和下载。我们的工作目标是提供一个由众所共同拍摄的图像和3D重建的数据集,用于计算机视觉、遗产保护和相关领域的研究。总之,Tirtha是一步向数字保存的民主化,特别是在发展中国家。
paper_authors: Isaac R. Galatzer-Levy, Daniel McDuff, Vivek Natarajan, Alan Karthikesalingam, Matteo Malgaroli
for: This paper aims to investigate the ability of Large language models (LLMs) to predict psychiatric functioning from patient interviews and clinical descriptions without explicit training.
methods: The study uses Med-PaLM 2, a large language model explicitly trained on a large corpus of medical knowledge, to predict psychiatric functioning based on patient interviews and clinical descriptions.
results: The study finds that Med-PaLM 2 is capable of assessing psychiatric functioning across a range of psychiatric conditions, with the strongest performance in predicting depression scores based on standardized assessments. The results show the potential for general clinical language models to flexibly predict psychiatric risk based on free descriptions of functioning from both patients and clinicians.Here is the simplified Chinese version of the three key points:
for: 这篇论文旨在研究 Large language models (LLMs) 是否可以通过 patient 访问和临床描述来预测 психи治疗功能。
results: 研究发现 Med-PaLM 2 可以在各种 psycho 疾病中评估 psycho 功能,最强的表现在标准化评估中预测抑郁 scores,其准确率在 0.80 - 0.84 之间,与人类临床评估人员的准确率无 statistically distinguishable difference(t(1,144) = 1.20; p = 0.23),表明大型临床语言模型可以通过自由描述来预测 psycho 风险。Abstract
The current work investigates the capability of Large language models (LLMs) that are explicitly trained on large corpuses of medical knowledge (Med-PaLM 2) to predict psychiatric functioning from patient interviews and clinical descriptions without being trained to do so. To assess this, n = 145 depression and n =115 PTSD assessments and n = 46 clinical case studies across high prevalence/high comorbidity disorders (Depressive, Anxiety, Psychotic, trauma and stress, Addictive disorders) were analyzed using prompts to extract estimated clinical scores and diagnoses. Results demonstrate that Med-PaLM 2 is capable of assessing psychiatric functioning across a range of psychiatric conditions with the strongest performance being the prediction of depression scores based on standardized assessments (Accuracy range= 0.80 - 0.84) which were statistically indistinguishable from human clinical raters t(1,144) = 1.20; p = 0.23. Results show the potential for general clinical language models to flexibly predict psychiatric risk based on free descriptions of functioning from both patients and clinicians.
摘要
Learning beyond sensations: how dreams organize neuronal representations
paper_authors: Nicolas Deperrois, Mihai A. Petrovici, Walter Senn, Jakob Jordan
for: 这 paper 探讨了大脑中高级感觉 cortices 中 semantic 表示的形成和维护机制,以及这些表示如何影响行为。
methods: 这 paper 使用了 predictive learning 理论和虚拟经验来解释 cortical 表示的形成和维护。
results: 这 paper 提出了 two complementary learning principles,即 “adversarial dreaming” 和 “contrastive dreaming”,这些原理可以解释 cortical 学习 beyond classical predictive learning paradigm.Abstract
Semantic representations in higher sensory cortices form the basis for robust, yet flexible behavior. These representations are acquired over the course of development in an unsupervised fashion and continuously maintained over an organism's lifespan. Predictive learning theories propose that these representations emerge from predicting or reconstructing sensory inputs. However, brains are known to generate virtual experiences, such as during imagination and dreaming, that go beyond previously experienced inputs. Here, we suggest that virtual experiences may be just as relevant as actual sensory inputs in shaping cortical representations. In particular, we discuss two complementary learning principles that organize representations through the generation of virtual experiences. First, "adversarial dreaming" proposes that creative dreams support a cortical implementation of adversarial learning in which feedback and feedforward pathways engage in a productive game of trying to fool each other. Second, "contrastive dreaming" proposes that the invariance of neuronal representations to irrelevant factors of variation is acquired by trying to map similar virtual experiences together via a contrastive learning process. These principles are compatible with known cortical structure and dynamics and the phenomenology of sleep thus providing promising directions to explain cortical learning beyond the classical predictive learning paradigm.
摘要
Hard Adversarial Example Mining for Improving Robust Fairness
methods: 本研究提出了一种简单 yet effective的框架,即适应性 Hard Adversarial example Mining(HAM),通过适应性地挖掘硬AE来提高AT的效果。
results: 实验结果表明,HAM在CIFAR-10、SVHN和Imagenette等三个 benchmark上都达到了显著的改善robust fairness的效果,同时降低了计算成本。Abstract
Adversarial training (AT) is widely considered the state-of-the-art technique for improving the robustness of deep neural networks (DNNs) against adversarial examples (AE). Nevertheless, recent studies have revealed that adversarially trained models are prone to unfairness problems, restricting their applicability. In this paper, we empirically observe that this limitation may be attributed to serious adversarial confidence overfitting, i.e., certain adversarial examples with overconfidence. To alleviate this problem, we propose HAM, a straightforward yet effective framework via adaptive Hard Adversarial example Mining.HAM concentrates on mining hard adversarial examples while discarding the easy ones in an adaptive fashion. Specifically, HAM identifies hard AEs in terms of their step sizes needed to cross the decision boundary when calculating loss value. Besides, an early-dropping mechanism is incorporated to discard the easy examples at the initial stages of AE generation, resulting in efficient AT. Extensive experimental results on CIFAR-10, SVHN, and Imagenette demonstrate that HAM achieves significant improvement in robust fairness while reducing computational cost compared to several state-of-the-art adversarial training methods. The code will be made publicly available.
摘要
adversarial 训练(AT)是深度神经网络(DNN)的鲁棒性提升技术,却在实际应用中存在不公平问题。 recent studies have shown that adversarially trained models are prone to unfairness problems, limiting their applicability. In this paper, we empirically observe that this limitation may be attributed to serious adversarial confidence overfitting, i.e., certain adversarial examples with overconfidence. To alleviate this problem, we propose HAM, a straightforward yet effective framework via adaptive Hard Adversarial example Mining.HAM concentrates on mining hard adversarial examples while discarding the easy ones in an adaptive fashion. Specifically, HAM identifies hard AEs in terms of their step sizes needed to cross the decision boundary when calculating loss value. Besides, an early-dropping mechanism is incorporated to discard the easy examples at the initial stages of AE generation, resulting in efficient AT. extensive experimental results on CIFAR-10, SVHN, and Imagenette demonstrate that HAM achieves significant improvement in robust fairness while reducing computational cost compared to several state-of-the-art adversarial training methods. The code will be made publicly available.
Deep Neural Networks Fused with Textures for Image Classification
results: 在 eight datasets 上(人脸、皮肤病变、食物、海洋生物等)使用四种标准底层CNN模型,实现了与现有方法相比的更高的分类精度。Abstract
Fine-grained image classification (FGIC) is a challenging task in computer vision for due to small visual differences among inter-subcategories, but, large intra-class variations. Deep learning methods have achieved remarkable success in solving FGIC. In this paper, we propose a fusion approach to address FGIC by combining global texture with local patch-based information. The first pipeline extracts deep features from various fixed-size non-overlapping patches and encodes features by sequential modelling using the long short-term memory (LSTM). Another path computes image-level textures at multiple scales using the local binary patterns (LBP). The advantages of both streams are integrated to represent an efficient feature vector for image classification. The method is tested on eight datasets representing the human faces, skin lesions, food dishes, marine lives, etc. using four standard backbone CNNs. Our method has attained better classification accuracy over existing methods with notable margins.
摘要
《细腔化图像分类(FGIC)是计算机视觉中的一项挑战,因为小视觉差异在间类之间很大,但是内类变化很大。深度学习方法在解决FGIC中得到了非常成功。在这篇论文中,我们提出了一种混合方法来解决FGIC,将全像Texture与本地小块信息混合。首条管道从不同大小的非重叠区域提取深度特征,然后使用长期短时间记忆(LSTM)编码特征。另一条管道在多尺度使用本地二进制模式(LBP)计算图像级别的Texture。两条管道的优点被集成,形成高效的特征向量,用于图像分类。我们在八个数据集上进行了测试,包括人脸、皮肤病变、食物碟、海洋生物等,使用四种标准背部CNN。我们的方法在现有方法中达到了更好的分类精度,差异较大。》Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.
Job Shop Scheduling via Deep Reinforcement Learning: a Sequence to Sequence approach
paper_authors: Giovanni Bonetta, Davide Zago, Rossella Cancelliere, Andrea Grosso
for: 该论文旨在提出一种基于深度学习的Job调度算法,以自动学习调度规则。
methods: 该方法基于自然语言编码器-解码器模型,并未在任何其他任务中使用。
results: 实验结果表明,该方法可以超越多种传统优先级调度规则,并与当前最佳深度学习方法相比 display competitive 的结果。Abstract
Job scheduling is a well-known Combinatorial Optimization problem with endless applications. Well planned schedules bring many benefits in the context of automated systems: among others, they limit production costs and waste. Nevertheless, the NP-hardness of this problem makes it essential to use heuristics whose design is difficult, requires specialized knowledge and often produces methods tailored to the specific task. This paper presents an original end-to-end Deep Reinforcement Learning approach to scheduling that automatically learns dispatching rules. Our technique is inspired by natural language encoder-decoder models for sequence processing and has never been used, to the best of our knowledge, for scheduling purposes. We applied and tested our method in particular to some benchmark instances of Job Shop Problem, but this technique is general enough to be potentially used to tackle other different optimal job scheduling tasks with minimal intervention. Results demonstrate that we outperform many classical approaches exploiting priority dispatching rules and show competitive results on state-of-the-art Deep Reinforcement Learning ones.
摘要
This paper presents a novel end-to-end deep reinforcement learning approach to job scheduling that automatically learns dispatching rules. Our technique is inspired by natural language encoder-decoder models for sequence processing and has never been used for scheduling purposes, to the best of our knowledge. We applied and tested our method on some benchmark instances of the job shop problem, but it is general enough to be potentially used to tackle other optimal job scheduling tasks with minimal intervention.Our results demonstrate that we outperform many classical approaches that use priority dispatching rules and show competitive results with state-of-the-art deep reinforcement learning methods.
Guided Distillation for Semi-Supervised Instance Segmentation
results: 提高 teacher-student 填充模型的性能,在 Cityscapes 数据集上提高 mask-AP 从 23.7 到 33.9,在 COCO 数据集上提高 mask-AP 从 18.3 到 34.1,对比前一个状态的艺术。Abstract
Although instance segmentation methods have improved considerably, the dominant paradigm is to rely on fully-annotated training images, which are tedious to obtain. To alleviate this reliance, and boost results, semi-supervised approaches leverage unlabeled data as an additional training signal that limits overfitting to the labeled samples. In this context, we present novel design choices to significantly improve teacher-student distillation models. In particular, we (i) improve the distillation approach by introducing a novel "guided burn-in" stage, and (ii) evaluate different instance segmentation architectures, as well as backbone networks and pre-training strategies. Contrary to previous work which uses only supervised data for the burn-in period of the student model, we also use guidance of the teacher model to exploit unlabeled data in the burn-in period. Our improved distillation approach leads to substantial improvements over previous state-of-the-art results. For example, on the Cityscapes dataset we improve mask-AP from 23.7 to 33.9 when using labels for 10\% of images, and on the COCO dataset we improve mask-AP from 18.3 to 34.1 when using labels for only 1\% of the training data.
摘要
尽管实例分割方法已经有了很大的进步,但主流的方法仍然是通过完全标注的图像进行训练,这是费时的。为了减轻这种依赖,并提高结果,半超vised方法利用无标注数据作为训练信号,限制学习到标注样本上的过拟合。在这种情况下,我们提出了新的设计选择,以提高教师学生液态模型。具体来说,我们(i)改进了液态模型的适应方法,引入了一种新的“导向燃烧”阶段,以及(ii)评估不同的实例分割架构、背部网络和预训练策略。与前一些工作不同,我们在学生模型的烧入期间也使用导师模型的指导,以便利用无标注数据。我们改进的液态模型方法导致了对前一个状态的重大提高。例如,在Cityscapes dataset上,我们从23.7提高到33.9的mask-AP,并在COCO dataset上从18.3提高到34.1的mask-AP,只使用1%的训练数据上的标签。
MAP: A Model-agnostic Pretraining Framework for Click-through Rate Prediction
for: 这 paper 是为了解决个性化在线服务的点击率预测问题,因为现有的 neural 模型 无法充分利用大量的用户点击记录数据。
methods: 这 paper 使用了自适应学习 paradigm,并提出了两种实用算法:偏挥特征预测 (MFP) 和替换特征检测 (RFD),以利用大量的用户点击记录数据来提高点击率预测性能。
results: 实验结果表明,使用 MFP 和 RFD 可以在两个实际大规模数据集 (i.e., Avazu, Criteo) 上 achieve 新的州际纪录性能,并在多个强大的 backbone 上达到新的最佳性能。Abstract
With the widespread application of personalized online services, click-through rate (CTR) prediction has received more and more attention and research. The most prominent features of CTR prediction are its multi-field categorical data format, and vast and daily-growing data volume. The large capacity of neural models helps digest such massive amounts of data under the supervised learning paradigm, yet they fail to utilize the substantial data to its full potential, since the 1-bit click signal is not sufficient to guide the model to learn capable representations of features and instances. The self-supervised learning paradigm provides a more promising pretrain-finetune solution to better exploit the large amount of user click logs, and learn more generalized and effective representations. However, self-supervised learning for CTR prediction is still an open question, since current works on this line are only preliminary and rudimentary. To this end, we propose a Model-agnostic pretraining (MAP) framework that applies feature corruption and recovery on multi-field categorical data, and more specifically, we derive two practical algorithms: masked feature prediction (MFP) and replaced feature detection (RFD). MFP digs into feature interactions within each instance through masking and predicting a small portion of input features, and introduces noise contrastive estimation (NCE) to handle large feature spaces. RFD further turns MFP into a binary classification mode through replacing and detecting changes in input features, making it even simpler and more effective for CTR pretraining. Our extensive experiments on two real-world large-scale datasets (i.e., Avazu, Criteo) demonstrate the advantages of these two methods on several strong backbones (e.g., DCNv2, DeepFM), and achieve new state-of-the-art performance in terms of both effectiveness and efficiency for CTR prediction.
摘要
To address this issue, we propose a model-agnostic pretraining (MAP) framework that applies feature corruption and recovery on multi-field categorical data. Specifically, we derive two practical algorithms: masked feature prediction (MFP) and replaced feature detection (RFD). MFP digs into feature interactions within each instance through masking and predicting a small portion of input features, and introduces noise contrastive estimation (NCE) to handle large feature spaces. RFD further turns MFP into a binary classification mode through replacing and detecting changes in input features, making it simpler and more effective for CTR pretraining.Our extensive experiments on two real-world large-scale datasets (i.e., Avazu, Criteo) demonstrate the advantages of these two methods on several strong backbones (e.g., DCNv2, DeepFM), and achieve new state-of-the-art performance in terms of both effectiveness and efficiency for CTR prediction.
Towards Self-organizing Personal Knowledge Assistants in Evolving Corporate Memories
results: 过去实验和结果包括了许多不同的主题,如知识图构建、Managed Forgetting和Context Spaces等。此外,我们还提供了相关工作的概述和一些最新的发现,这些发现尚未发表。最后,我们给出了一个关于CoMem的详细描述,这是基于我们所提出的研究已经在生产中使用的一个企业记忆系统,以及这个系统在进一步研究中的挑战。Abstract
This paper presents a retrospective overview of a decade of research in our department towards self-organizing personal knowledge assistants in evolving corporate memories. Our research is typically inspired by real-world problems and often conducted in interdisciplinary collaborations with research and industry partners. We summarize past experiments and results comprising topics like various ways of knowledge graph construction in corporate and personal settings, Managed Forgetting and (Self-organizing) Context Spaces as a novel approach to Personal Information Management (PIM) and knowledge work support. Past results are complemented by an overview of related work and some of our latest findings not published so far. Last, we give an overview of our related industry use cases including a detailed look into CoMem, a Corporate Memory based on our presented research already in productive use and providing challenges for further research. Many contributions are only first steps in new directions with still a lot of untapped potential, especially with regard to further increasing the automation in PIM and knowledge work support.
摘要
中文翻译:本文提供了我们部门过去十年的研究回顾,探讨了自我组织人工智能在演化企业记忆中的个人知识助手。我们的研究通常受到实际问题的启发,并在研究和行业合作伙伴的协作下进行。我们summarize过去的实验和结果,包括企业和个人设置中知识图构建的多种方法,以及自动化Context Spaces和Managed Forgetting作为个人信息管理(PIM)和知识工作支持的新方法。过去的结果也包括相关工作的概述和一些没有发表过的最新发现。最后,我们给出了相关的行业应用场景,包括一个详细的CoMem企业记忆的概述,该记忆基于我们所提出的研究,已经在生产中使用,并提供了进一步研究的挑战。许多贡献都只是新的方向的第一步,特别是在进一步增加PIM和知识工作支持中的自动化。
Is GPT-4 a reliable rater? Evaluating Consistency in GPT-4 Text Ratings
results: 结果显示GPT-4模型在不同的时间尺度下 exhibit 高的 между评分者一致性(ICC分数在0.94-0.99之间),表明该模型在重复提示下能够生成一致的评分。内容和风格评分之间存在0.87的高相关性。当应用不适用的风格时,内容评分保持相对定的,而风格评分下降,这表明LLM有效地在评价过程中分化内容和风格两个方面。Abstract
This study investigates the consistency of feedback ratings generated by OpenAI's GPT-4, a state-of-the-art artificial intelligence language model, across multiple iterations, time spans and stylistic variations. The model rated responses to tasks within the Higher Education (HE) subject domain of macroeconomics in terms of their content and style. Statistical analysis was conducted in order to learn more about the interrater reliability, consistency of the ratings across iterations and the correlation between ratings in terms of content and style. The results revealed a high interrater reliability with ICC scores ranging between 0.94 and 0.99 for different timespans, suggesting that GPT-4 is capable of generating consistent ratings across repetitions with a clear prompt. Style and content ratings show a high correlation of 0.87. When applying a non-adequate style the average content ratings remained constant, while style ratings decreased, which indicates that the large language model (LLM) effectively distinguishes between these two criteria during evaluation. The prompt used in this study is furthermore presented and explained. Further research is necessary to assess the robustness and reliability of AI models in various use cases.
摘要
Local Large Language Models for Complex Structured Medical Tasks
paper_authors: V. K. Cody Bumgardner, Aaron Mullen, Sam Armstrong, Caylin Hickey, Jeff Talbert
for: This paper aims to tackle complex, domain-specific tasks by combining the language reasoning capabilities of large language models (LLMs) with the benefits of local training.
methods: The proposed approach utilizes local LLMs, which can be fine-tuned to respond to specific generative instructions and provide structured outputs. The authors used a dataset of over 150k uncurated surgical pathology reports to train and evaluate different model architectures, including LLaMA, BERT, and LongFormer.
results: The results show that the LLaMA-based models significantly outperform BERT-style models across all evaluated metrics, especially with large datasets. The LLaMA models demonstrated their ability to handle complex, multi-label tasks, making them a promising approach for utilizing LLMs to perform domain-specific tasks using accessible hardware.Abstract
This paper introduces an approach that combines the language reasoning capabilities of large language models (LLMs) with the benefits of local training to tackle complex, domain-specific tasks. Specifically, the authors demonstrate their approach by extracting structured condition codes from pathology reports. The proposed approach utilizes local LLMs, which can be fine-tuned to respond to specific generative instructions and provide structured outputs. The authors collected a dataset of over 150k uncurated surgical pathology reports, containing gross descriptions, final diagnoses, and condition codes. They trained different model architectures, including LLaMA, BERT and LongFormer and evaluated their performance. The results show that the LLaMA-based models significantly outperform BERT-style models across all evaluated metrics, even with extremely reduced precision. The LLaMA models performed especially well with large datasets, demonstrating their ability to handle complex, multi-label tasks. Overall, this work presents an effective approach for utilizing LLMs to perform domain-specific tasks using accessible hardware, with potential applications in the medical domain, where complex data extraction and classification are required.
摘要
Bees Local Phase Quantization Feature Selection for RGB-D Facial Expressions Recognition
results: 研究结果显示,提案的蜂群LPQ方法在人脸表情识别任务中达到了99%的准确率,与其他方法相比,表现出了极好的性能。Abstract
Feature selection could be defined as an optimization problem and solved by bio-inspired algorithms. Bees Algorithm (BA) shows decent performance in feature selection optimization tasks. On the other hand, Local Phase Quantization (LPQ) is a frequency domain feature which has excellent performance on Depth images. Here, after extracting LPQ features out of RGB (colour) and Depth images from the Iranian Kinect Face Database (IKFDB), the Bees feature selection algorithm applies to select the desired number of features for final classification tasks. IKFDB is recorded with Kinect sensor V.2 and contains colour and depth images for facial and facial micro-expressions recognition purposes. Here five facial expressions of Anger, Joy, Surprise, Disgust and Fear are used for final validation. The proposed Bees LPQ method is compared with Particle Swarm Optimization (PSO) LPQ, PCA LPQ, Lasso LPQ, and just LPQ features for classification tasks with Support Vector Machines (SVM), K-Nearest Neighbourhood (KNN), Shallow Neural Network and Ensemble Subspace KNN. Returned results, show a decent performance of the proposed algorithm (99 % accuracy) in comparison with others.
摘要
feature选择可以定义为优化问题,并可以使用生物灵感算法解决。蜂群算法(BA)在特征选择优化任务中表现不错。另一方面,本地相对频率量化(LPQ)是一个频域特征,在深度影像上表现出色。在这里,我们将LPQ特征提取自RGB(色)和深度影像,然后使用蜂群特征选择算法选择最多的特征。IKFDB(伊朗掌握脸部数据库)是录制了Kinect感知器V.2的颜色和深度影像,并且用于脸部和脸部微表情识别。在这里,我们使用五种表情:愤怒、喜悦、惊讶、厌恶和恐惧进行最终验证。提案的蜂群LPQ方法与束缚点群集优化(PSO)LPQ、对角量变换(PCA)LPQ、lasso LPQ和单独LPQ特征进行比较,并使用支持向量机(SVM)、K-最近邻居(KNN)、浅层神经网和混合空间KNN进行类别任务。结果显示,提案的算法(99%准确)在比较之下表现不错。
LiDAR-Camera Panoptic Segmentation via Geometry-Consistent and Semantic-Aware Alignment
paper_authors: Zhiwei Zhang, Zhizhong Zhang, Qian Yu, Ran Yi, Yuan Xie, Lizhuang Ma
for: This paper is written for the task of 3D panoptic segmentation, which aims to simultaneously perform semantic segmentation and instance segmentation in a scene using both LiDAR and camera data.
methods: The proposed method, called LCPS, uses a three-stage fusion approach that includes an asynchronous compensation pixel alignment module, a semantic-aware region alignment module, and a point-to-voxel feature propagation module to fuse LiDAR and camera data.
results: The proposed method achieves an improvement of about 6.9% in PQ performance over the LiDAR-only baseline on the NuScenes dataset, demonstrating the effectiveness of the proposed fusion strategy.Here are the three points in Simplified Chinese text:
methods: 提议的方法是 LCPS,它使用三个阶段的融合方法,包括异步补做像素对齐模块、semantic-aware region alignment模块和点到 voxel 特征传播模块来融合 LiDAR 和摄像头数据。
results: 提议的方法在 NuScenes 数据集上实现了约6.9%的 PQ 性能提升,证明了提议的融合策略的有效性。Abstract
3D panoptic segmentation is a challenging perception task that requires both semantic segmentation and instance segmentation. In this task, we notice that images could provide rich texture, color, and discriminative information, which can complement LiDAR data for evident performance improvement, but their fusion remains a challenging problem. To this end, we propose LCPS, the first LiDAR-Camera Panoptic Segmentation network. In our approach, we conduct LiDAR-Camera fusion in three stages: 1) an Asynchronous Compensation Pixel Alignment (ACPA) module that calibrates the coordinate misalignment caused by asynchronous problems between sensors; 2) a Semantic-Aware Region Alignment (SARA) module that extends the one-to-one point-pixel mapping to one-to-many semantic relations; 3) a Point-to-Voxel feature Propagation (PVP) module that integrates both geometric and semantic fusion information for the entire point cloud. Our fusion strategy improves about 6.9% PQ performance over the LiDAR-only baseline on NuScenes dataset. Extensive quantitative and qualitative experiments further demonstrate the effectiveness of our novel framework. The code will be released at https://github.com/zhangzw12319/lcps.git.
摘要
“3D短文本分割是一项具有挑战性的识别任务,它需要同时完成semantic segmentation和instance segmentation。在这种任务中,我们发现图像可以提供丰富的文本、颜色和特征信息,这些信息可以补做LiDAR数据,从而提高识别性能,但是这些信息的融合仍然是一个挑战。为此,我们提出了LCPS,首个LiDAR-Camera短文本分割网络。我们的方法包括三个阶段:1)异步补做像素均衡(ACPA)模块,用于解决摄像头和LiDAR仪器之间的坐标偏差问题; 2) semantic-aware区域匹配(SARA)模块,将一对一点像素映射扩展到一对多semantic关系; 3)点云特征传播(PVP)模块,将光栅和semantic融合信息传播到整个点云中。我们的融合策略提高了NuScenes数据集上LiDAR-only基准点Cloud的PQ性能表现约6.9%。广泛的量化和质量实验进一步证明了我们的新框架的有效性。代码将在https://github.com/zhangzw12319/lcps.git中发布。”
Evaluating Link Prediction Explanations for Graph Neural Networks
paper_authors: Claudio Borile, Alan Perotti, André Panisson
for: 这篇论文主要用于提供链接预测模型的解释评价指标,以便促进链接预测模型的采用。
methods: 这篇论文使用了现状前景的解释方法,并对这些方法进行评价。
results: 研究发现,不同的距离选择方式可能会影响链接预测解释的质量。I hope that helps! Let me know if you have any other questions.Abstract
Graph Machine Learning (GML) has numerous applications, such as node/graph classification and link prediction, in real-world domains. Providing human-understandable explanations for GML models is a challenging yet fundamental task to foster their adoption, but validating explanations for link prediction models has received little attention. In this paper, we provide quantitative metrics to assess the quality of link prediction explanations, with or without ground-truth. State-of-the-art explainability methods for Graph Neural Networks are evaluated using these metrics. We discuss how underlying assumptions and technical details specific to the link prediction task, such as the choice of distance between node embeddings, can influence the quality of the explanations.
摘要
graph机器学习(GML)在实际领域有很多应用,如节点/图分类和链接预测。提供可理解的GML模型解释是推广其使用的挑战,但链接预测模型的解释 validation Received little attention。在这篇论文中,我们提供了量化的评价指标,以及或无ground-truth。现有的explainability方法 для图神经网络被使用这些指标进行评估。我们讨论了链接预测任务的下面假设和技术细节,如节点嵌入距离的选择,对解释质量产生的影响。
NBIAS: A Natural Language Processing Framework for Bias Identification in Text
results: 这篇论文的结果显示,使用了提案的方法可以实现偏见检测和消除的目的,并且比基于基线的方法提高了1%至8%的精度。Abstract
Bias in textual data can lead to skewed interpretations and outcomes when the data is used. These biases could perpetuate stereotypes, discrimination, or other forms of unfair treatment. An algorithm trained on biased data ends up making decisions that disproportionately impact a certain group of people. Therefore, it is crucial to detect and remove these biases to ensure the fair and ethical use of data. To this end, we develop a comprehensive and robust framework \textsc{Nbias} that consists of a data layer, corpus contruction, model development layer and an evaluation layer. The dataset is constructed by collecting diverse data from various fields, including social media, healthcare, and job hiring portals. As such, we applied a transformer-based token classification model that is able to identify bias words/ phrases through a unique named entity. In the assessment procedure, we incorporate a blend of quantitative and qualitative evaluations to gauge the effectiveness of our models. We achieve accuracy improvements ranging from 1% to 8% compared to baselines. We are also able to generate a robust understanding of the model functioning, capturing not only numerical data but also the quality and intricacies of its performance. The proposed approach is applicable to a variety of biases and contributes to the fair and ethical use of textual data.
摘要
文本数据中的偏见可能导致解释和结果偏离均衡,这可能导致推荐或报告不公平或不公正。一个基于偏见数据的算法会导致对某些人群的决策偏离,从而产生不公平的结果。因此,检测和消除偏见是必要的,以确保数据的公平和道德使用。为此,我们开发了一个全面和可靠的框架\textsc{Nbias},它包括数据层、文本建构层、模型开发层和评估层。我们收集了来自不同领域的多样化数据,包括社交媒体、医疗和招聘门户,并应用了基于转换器的单词分类模型,以识别偏见词语/短语。在评估过程中,我们 combinated量化和质量评估来评估模型的效果。我们实现了基eline比较高的准确率改进,从1%到8%不等,并能够生成模型的精准和复杂的性能理解。这种方法可以应用于多种偏见,并为公平和道德的文本数据使用做出贡献。
Learning Implicit Entity-object Relations by Bidirectional Generative Alignment for Multimodal NER
results: 通过对两个 benchmark进行了广泛的实验,表明了我们的方法可以在无图像输入的情况下达到最佳性能。Abstract
The challenge posed by multimodal named entity recognition (MNER) is mainly two-fold: (1) bridging the semantic gap between text and image and (2) matching the entity with its associated object in image. Existing methods fail to capture the implicit entity-object relations, due to the lack of corresponding annotation. In this paper, we propose a bidirectional generative alignment method named BGA-MNER to tackle these issues. Our BGA-MNER consists of \texttt{image2text} and \texttt{text2image} generation with respect to entity-salient content in two modalities. It jointly optimizes the bidirectional reconstruction objectives, leading to aligning the implicit entity-object relations under such direct and powerful constraints. Furthermore, image-text pairs usually contain unmatched components which are noisy for generation. A stage-refined context sampler is proposed to extract the matched cross-modal content for generation. Extensive experiments on two benchmarks demonstrate that our method achieves state-of-the-art performance without image input during inference.
摘要
MARLIM: Multi-Agent Reinforcement Learning for Inventory Management
results: 通过实验表明,基于强化学习方法的供应链管理方法比传统基eline方法更有利,可以更好地决策供应链中的产品供应问题。Abstract
Maintaining a balance between the supply and demand of products by optimizing replenishment decisions is one of the most important challenges in the supply chain industry. This paper presents a novel reinforcement learning framework called MARLIM, to address the inventory management problem for a single-echelon multi-products supply chain with stochastic demands and lead-times. Within this context, controllers are developed through single or multiple agents in a cooperative setting. Numerical experiments on real data demonstrate the benefits of reinforcement learning methods over traditional baselines.
摘要
维护产品供应和需求的平衡是供应链产业中最重要的挑战。本文提出了一个新的强化学习框架,名为MARLIM,以解决单一库存维护问题。在这个设定下,控制器通过单一或多个代理人在合作环境中发展。数据示出强化学习方法比传统基准更有价。
Improving Wind Resistance Performance of Cascaded PID Controlled Quadcopters using Residual Reinforcement Learning
results: 通过多种实验,包括在风速大于13m/s的户外场景中,我们的控制器可以减少四旋翼机器人的位置偏差约50%,并且培养的控制器具有强健性,可以在四旋翼机器人的质量和推进器的扬力系数发生变化时保持性能。Abstract
Wind resistance control is an essential feature for quadcopters to maintain their position to avoid deviation from target position and prevent collisions with obstacles. Conventionally, cascaded PID controller is used for the control of quadcopters for its simplicity and ease of tuning its parameters. However, it is weak against wind disturbances and the quadcopter can easily deviate from target position. In this work, we propose a residual reinforcement learning based approach to build a wind resistance controller of a quadcopter. By learning only the residual that compensates the disturbance, we can continue using the cascaded PID controller as the base controller of the quadcopter but improve its performance against wind disturbances. To avoid unexpected crashes and destructions of quadcopters, our method does not require real hardware for data collection and training. The controller is trained only on a simulator and directly applied to the target hardware without extra finetuning process. We demonstrate the effectiveness of our approach through various experiments including an experiment in an outdoor scene with wind speed greater than 13 m/s. Despite its simplicity, our controller reduces the position deviation by approximately 50% compared to the quadcopter controlled with the conventional cascaded PID controller. Furthermore, trained controller is robust and preserves its performance even though the quadcopter's mass and propeller's lift coefficient is changed between 50% to 150% from original training time.
摘要
风 resistance 控制是 quadcopter 维持目标位置的重要特性,以避免偏离目标位置和避免遭遇障碍物的Collision。 Conventionally, cascaded PID controller 是用于 quadcopter 控制的最常用方法,因为它的简单性和参数调整的容易性。然而,它在风干扰下表现弱,quadcopter 容易偏离目标位置。在这项工作中,我们提出了基于 residual reinforcement learning 的风 resistance 控制方法。通过学习剩下的偏差,我们可以继续使用 cascaded PID controller 作为 quadcopter 的基础控制器,但是提高其对风干扰的性能。为了避免意外坠机和 quadcopter 的破坏,我们的方法不需要实际硬件数据采集和训练。控制器通过在 simulator 上学习,直接应用于目标硬件上,不需要额外的调整过程。我们通过多种实验,包括在风速大于 13 m/s 的户外场景中,证明了我们的方法的有效性。即使简单,我们的控制器可以降低 quadcopter 的位置偏离约 50%,相比于使用 conventinal cascaded PID controller。另外,训练后的控制器具有坚定性,其性能保持不变,even though quadcopter 的质量和推进器的扬力系数被变化了,从原始训练时间的50%到150%。
Interleaving GANs with knowledge graphs to support design creativity for book covers
results: 本研究比前一些尝试更好地生成书籍封面,而知识图产生更多的选择,为书籍作者或编辑提供了更好的选择。Abstract
An attractive book cover is important for the success of a book. In this paper, we apply Generative Adversarial Networks (GANs) to the book covers domain, using different methods for training in order to obtain better generated images. We interleave GANs with knowledge graphs to alter the input title to obtain multiple possible options for any given title, which are then used as an augmented input to the generator. Finally, we use the discriminator obtained during the training phase to select the best images generated with new titles. Our method performed better at generating book covers than previous attempts, and the knowledge graph gives better options to the book author or editor compared to using GANs alone.
摘要
一本漂亮的书封面对书的成功非常重要。在这篇论文中,我们使用生成对抗网络(GANs)来适应书封面领域,使用不同的训练方法来获得更好的生成图像。我们将GANs与知识图加以混合,将输入标题进行修改,以获得多个可能的选择,然后将这些选择作为增强输入给生成器。最后,我们使用训练阶段获得的推定器来选择最佳的生成图像。我们的方法比之前的尝试更好地生成书封面,而知识图可以为书作者或编辑提供更多的选择,比如使用GANsalone。
Multimodal Indoor Localisation in Parkinson’s Disease for Detecting Medication Use: Observational Pilot Study in a Free-Living Setting
paper_authors: Ferdian Jovan, Catherine Morgan, Ryan McConville, Emma L. Tonkin, Ian Craddock, Alan Whone
for: 这个研究的目的是提高当前indoor localization方法的效果,并使用双感知模式( Received Signal Strength Indicator 和加速器数据)来评估PD患者的 дви作障碍。
methods: 这个研究使用了一种基于转换器的方法,利用RSSI和加速器数据来提供 complementary views of movement。
results: 研究表明,该方法可以高效地进行indoor localization,并且可以准确地捕捉PD患者的 дви作障碍。具体来说,研究发现,精准的房间级别的地理位置预测,可以准确地预测PD患者是否正在服用levodopa药物。Abstract
Parkinson's disease (PD) is a slowly progressive, debilitating neurodegenerative disease which causes motor symptoms including gait dysfunction. Motor fluctuations are alterations between periods with a positive response to levodopa therapy ("on") and periods marked by re-emergency of PD symptoms ("off") as the response to medication wears off. These fluctuations often affect gait speed and they increase in their disabling impact as PD progresses. To improve the effectiveness of current indoor localisation methods, a transformer-based approach utilising dual modalities which provide complementary views of movement, Received Signal Strength Indicator (RSSI) and accelerometer data from wearable devices, is proposed. A sub-objective aims to evaluate whether indoor localisation, including its in-home gait speed features (i.e. the time taken to walk between rooms), could be used to evaluate motor fluctuations by detecting whether the person with PD is taking levodopa medications or withholding them. To properly evaluate our proposed method, we use a free-living dataset where the movements and mobility are greatly varied and unstructured as expected in real-world conditions. 24 participants lived in pairs (consisting of one person with PD, one control) for five days in a smart home with various sensors. Our evaluation on the resulting dataset demonstrates that our proposed network outperforms other methods for indoor localisation. The sub-objective evaluation shows that precise room-level localisation predictions, transformed into in-home gait speed features, produce accurate predictions on whether the PD participant is taking or withholding their medications.
摘要
帕金森病 (PD) 是一种慢慢恶化、疲劳性神经退化病种,引起 дви作Symptoms 包括行走功能障碍。 дви作周期性变化是指由levodopa治疗 ("on") 和PD症状重新出现 ("off") 的交替,这些变化通常对行走速度产生影响,随着PD的进行而加剧。为了提高现有indoor localization方法的有效性,一种基于 transformer 的方法,利用 dual Modalities 提供 complementary 视图的运动数据,包括 Received Signal Strength Indicator (RSSI) 和加速器数据,是提出的。一个 sub-objective 是评估 whether indoor localization, 包括室内步速特征 (i.e. 行走 между房间所需时间), 可以用来评估PD参与者是否服用 levodopa 药物。为了正确评估我们的提议方法,我们使用了一个免费生活数据集,其中参与者的运动和 mobilty 具有很大的变化和不结构化,如预计的实际条件中。24名参与者(一名PD参与者和一名控制人)在五天内生活在一个智能家庭中,并装备了多种感知器。我们对 resulting dataset 进行评估,并显示我们的提议网络在indoor localization方面的表现比其他方法更好。 sub-objective 评估表明,精准的室内地理位置预测,经 transformed 为室内步速特征,可以准确地预测PD参与者是否服用 levodopa 药物。
ReIDTrack: Multi-Object Track and Segmentation Without Motion
paper_authors: Kaer Huang, Bingchuan Sun, Feng Chen, Tao Zhang, Jun Xie, Jian Li, Christopher Walter Twombly, Zhepeng Wang
For: The paper focuses on exploring the direction of achieving state-of-the-art (SOTA) performance in multi-object tracking and segmentation (MOTS) using only high-performance detection and appearance models, without relying on motion information and IoU mapping during association.* Methods: The proposed method uses CBNetV2 as a detection model and MoCo-v2 as a self-supervised appearance model, and removes motion information and IoU mapping during the association process.* Results: The method achieved 1st place on the MOTS track and 2nd place on the MOT track in the CVPR2023 WAD workshop, demonstrating its effectiveness and simplicity.Here are the three points in Simplified Chinese text:* For: 本研究主要目标是通过高性能的检测和外观模型来实现多bject tracking和分割(MOTS)中的新状态之冠(SOTA)性能,不需要在协调过程中使用运动信息和IOU映射。* Methods: 提posed方法使用 CBNetV2 作为检测模型,MoCo-v2 作为自我指导的外观模型,并在协调过程中除掉运动信息和IOU映射。* Results: 方法在 CVPR2023 WAD 工作坊中获得了 MOTS track 第一名和 MOT track 第二名,证明了其简洁效果。Abstract
In recent years, dominant Multi-object tracking (MOT) and segmentation (MOTS) methods mainly follow the tracking-by-detection paradigm. Transformer-based end-to-end (E2E) solutions bring some ideas to MOT and MOTS, but they cannot achieve a new state-of-the-art (SOTA) performance in major MOT and MOTS benchmarks. Detection and association are two main modules of the tracking-by-detection paradigm. Association techniques mainly depend on the combination of motion and appearance information. As deep learning has been recently developed, the performance of the detection and appearance model is rapidly improved. These trends made us consider whether we can achieve SOTA based on only high-performance detection and appearance model. Our paper mainly focuses on exploring this direction based on CBNetV2 with Swin-B as a detection model and MoCo-v2 as a self-supervised appearance model. Motion information and IoU mapping were removed during the association. Our method wins 1st place on the MOTS track and wins 2nd on the MOT track in the CVPR2023 WAD workshop. We hope our simple and effective method can give some insights to the MOT and MOTS research community. Source code will be released under this git repository
摘要
近年来,主流多目标跟踪(MOT)和多目标分割(MOTS)方法主要采用跟踪检测 paradigm。基于 transformer 的端到端(E2E)解决方案在 MOT 和 MOTS 中带来了一些想法,但它们无法达到主要 MOT 和 MOTS benchmark 中的新状态前景(SOTA)性能。检测和归一化是跟踪检测 paradigm 的两个主要模块。归一化技术主要基于运动和外观信息的组合。随着深度学习的发展,检测和外观模型的性能得到了迅速提高。这些趋势使我们考虑了是否可以基于高性能的检测和外观模型达到 SOTA。我们的论文主要关注 exploring 这一方向,使用 CBNetV2 作为检测模型和 MoCo-v2 作为自主监督的外观模型。在归一化过程中,我们移除了运动信息和 IoU 映射。我们的方法在 CVPR2023 WAD 工作坊的 MOTS 轨道上获得了第一名,在 MOT 轨道上获得了第二名。我们希望我们的简单而有效的方法可以给 MOT 和 MOTS 研究社区提供一些想法。源代码将在这个 Git 仓库中发布。
Assessing Systematic Weaknesses of DNNs using Counterfactuals
results: 根据这篇论文的结果,在自动驾驶领域中的一个semantic segmentation模型中,发现存在不同的人工资产之间的性能差异,但是只有在某些情况下,资产类型本身是对模型性能的恶化原因。Abstract
With the advancement of DNNs into safety-critical applications, testing approaches for such models have gained more attention. A current direction is the search for and identification of systematic weaknesses that put safety assumptions based on average performance values at risk. Such weaknesses can take on the form of (semantically coherent) subsets or areas in the input space where a DNN performs systematically worse than its expected average. However, it is non-trivial to attribute the reason for such observed low performances to the specific semantic features that describe the subset. For instance, inhomogeneities within the data w.r.t. other (non-considered) attributes might distort results. However, taking into account all (available) attributes and their interaction is often computationally highly expensive. Inspired by counterfactual explanations, we propose an effective and computationally cheap algorithm to validate the semantic attribution of existing subsets, i.e., to check whether the identified attribute is likely to have caused the degraded performance. We demonstrate this approach on an example from the autonomous driving domain using highly annotated simulated data, where we show for a semantic segmentation model that (i) performance differences among the different pedestrian assets exist, but (ii) only in some cases is the asset type itself the reason for this reduction in the performance.
摘要
Inspired by counterfactual explanations, we propose an efficient and computationally inexpensive algorithm to validate the semantic attribution of existing subsets. We demonstrate this approach on an example from the autonomous driving domain using highly annotated simulated data. Our results show that while performance differences exist among different pedestrian assets, the asset type itself is not always the reason for the reduction in performance.
Discriminative Graph-level Anomaly Detection via Dual-students-teacher Model
results: 经过广泛的实验分析,本文的方法在实际世界图像 dataset 上显示了效果,能够准确检测图像中的异常图形。Abstract
Different from the current node-level anomaly detection task, the goal of graph-level anomaly detection is to find abnormal graphs that significantly differ from others in a graph set. Due to the scarcity of research on the work of graph-level anomaly detection, the detailed description of graph-level anomaly is insufficient. Furthermore, existing works focus on capturing anomalous graph information to learn better graph representations, but they ignore the importance of an effective anomaly score function for evaluating abnormal graphs. Thus, in this work, we first define anomalous graph information including node and graph property anomalies in a graph set and adopt node-level and graph-level information differences to identify them, respectively. Then, we introduce a discriminative graph-level anomaly detection framework with dual-students-teacher model, where the teacher model with a heuristic loss are trained to make graph representations more divergent. Then, two competing student models trained by normal and abnormal graphs respectively fit graph representations of the teacher model in terms of node-level and graph-level representation perspectives. Finally, we combine representation errors between two student models to discriminatively distinguish anomalous graphs. Extensive experiment analysis demonstrates that our method is effective for the graph-level anomaly detection task on graph datasets in the real world.
摘要
不同于当前节点级异常检测任务,graph-level异常检测的目标是找到图集中异常的图,并且与其他图集中的图进行比较。由于关于graph-level异常检测的研究缺乏,graph-level异常的详细描述不够。此外,现有的工作都是捕捉异常图信息来学习更好的图表示,但它们忽略了效果的异常分数函数的重要性。因此,在这个工作中,我们首先定义图集中的异常图信息,包括节点和图属性异常,并采用节点级和图级信息差来识别它们。然后,我们提出了一种有效的图级异常检测框架,基于双学生-教师模型,其中教师模型通过规则损失来训练图表示更加分散。然后,两个竞争的学生模型,分别使用正常和异常图来训练教师模型的图表示,并在节点级和图级表示视角下进行匹配。最后,我们将两个学生模型的表示错误相加,以分别地区分异常图。我们的方法在实际世界上的图据集上进行了广泛的实验分析,并证明其效果。
DOLCE: A Descriptive Ontology for Linguistic and Cognitive Engineering
results: DOLCE在过去二十年中保持稳定,并且启发了大多数现有的顶层 ontologies,并且用于发展或改善标准和公共项目资源(例如CIDOC CRM、DBpedia和WordNet)。Abstract
DOLCE, the first top-level (foundational) ontology to be axiomatized, has remained stable for twenty years and today is broadly used in a variety of domains. DOLCE is inspired by cognitive and linguistic considerations and aims to model a commonsense view of reality, like the one human beings exploit in everyday life in areas as diverse as socio-technical systems, manufacturing, financial transactions and cultural heritage. DOLCE clearly lists the ontological choices it is based upon, relies on philosophical principles, is richly formalized, and is built according to well-established ontological methodologies, e.g. OntoClean. Because of these features, it has inspired most of the existing top-level ontologies and has been used to develop or improve standards and public domain resources (e.g. CIDOC CRM, DBpedia and WordNet). Being a foundational ontology, DOLCE is not directly concerned with domain knowledge. Its purpose is to provide the general categories and relations needed to give a coherent view of reality, to integrate domain knowledge, and to mediate across domains. In these 20 years DOLCE has shown that applied ontologies can be stable and that interoperability across reference and domain ontologies is a reality. This paper briefly introduces the ontology and shows how to use it on a few modeling cases.
摘要
DOLCE,第一个基础 Ontology(基础 ontology),已经稳定了二十年了,今天在多个领域都广泛使用。DOLCE 受到了认知和语言考虑的影响,旨在模型人类日常生活中的常识视角,如社技系统、制造、财务交易和文化遗产等领域。DOLCE 明确列出了其 ontological 选择基础,依靠哲学原则,富有形式化,按照良好的 ontological 方法ologies(如 OntoClean)建立。由于这些特点,它已经影响了大多数现有的基础 ontologies,并用于开发或改进标准和公共领域资源(如 CIDOC CRM、DBpedia 和 WordNet)。作为基础 ontology,DOLCE 不直接关心域知识。其目的是提供一个一致的视角,整合域知识,并在域之间媒介。在过去二十年中,DOLCE 表明了应用 ontologies 可以稳定,并且域 Ontology 之间的可操作性是现实。这篇文章简要介绍了 ontology,并示例了一些模型案例。
Holy Grail 2.0: From Natural Language to Constraint Models
results: 初步结果表明,这种方法可以帮助提取出高质量的模型,并且可以减少CP用户需要具备的专业知识和技能。Abstract
Twenty-seven years ago, E. Freuder highlighted that "Constraint programming represents one of the closest approaches computer science has yet made to the Holy Grail of programming: the user states the problem, the computer solves it". Nowadays, CP users have great modeling tools available (like Minizinc and CPMpy), allowing them to formulate the problem and then let a solver do the rest of the job, getting closer to the stated goal. However, this still requires the CP user to know the formalism and respect it. Another significant challenge lies in the expertise required to effectively model combinatorial problems. All this limits the wider adoption of CP. In this position paper, we investigate a possible approach to leverage pre-trained Large Language Models to extract models from textual problem descriptions. More specifically, we take inspiration from the Natural Language Processing for Optimization (NL4OPT) challenge and present early results with a decomposition-based prompting approach to GPT Models.
摘要
27年前,E. Freuder指出了“约束编程代表计算机科学最接近圣杯编程的尝试:用户定义问题,计算机解决”。目前,CP用户有了优秀的模型化工具(如Minizinc和CPMpy),可以将问题形式化并让解决器完成其余的任务,从而更接近目标。然而,这还需要CP用户了解正式语法,并且对 combinatorial 问题的模型化技巧具有专业知识。这些限制了CP更广泛的应用。在这篇Position paper中,我们investigate一种可能的方法,使用预训练的大型自然语言模型提取问题描述中的模型。更 Specifically,我们从Natural Language Processing for Optimization(NL4OPT)挑战中得到灵感,并使用分解基于Prompting Approach来应用GPT模型。
SoK: Assessing the State of Applied Federated Machine Learning
paper_authors: Tobias Müller, Maximilian Stäbler, Hugo Gascón, Frank Köster, Florian Matthes
for: This paper aims to explore the current state of applied Federated Machine Learning (FedML) and identify the challenges hindering its practical adoption.
methods: The paper uses a comprehensive systematic literature review to assess 74 relevant papers and analyze the real-world applicability of FedML, including its characteristics and emerging trends, motivational drivers, and application domains.
results: The paper identifies the challenges encountered in integrating FedML into real-life settings, providing insights that contribute to the further development and implementation of FedML in privacy-critical scenarios.Abstract
Machine Learning (ML) has shown significant potential in various applications; however, its adoption in privacy-critical domains has been limited due to concerns about data privacy. A promising solution to this issue is Federated Machine Learning (FedML), a model-to-data approach that prioritizes data privacy. By enabling ML algorithms to be applied directly to distributed data sources without sharing raw data, FedML offers enhanced privacy protections, making it suitable for privacy-critical environments. Despite its theoretical benefits, FedML has not seen widespread practical implementation. This study aims to explore the current state of applied FedML and identify the challenges hindering its practical adoption. Through a comprehensive systematic literature review, we assess 74 relevant papers to analyze the real-world applicability of FedML. Our analysis focuses on the characteristics and emerging trends of FedML implementations, as well as the motivational drivers and application domains. We also discuss the encountered challenges in integrating FedML into real-life settings. By shedding light on the existing landscape and potential obstacles, this research contributes to the further development and implementation of FedML in privacy-critical scenarios.
摘要
Unsupervised Representation Learning for Time Series: A Review
results: 经验证明,对比学习方法在9个真实世界数据集上的表现很出色,而且可以快速实现和统一评估不同模型。Abstract
Unsupervised representation learning approaches aim to learn discriminative feature representations from unlabeled data, without the requirement of annotating every sample. Enabling unsupervised representation learning is extremely crucial for time series data, due to its unique annotation bottleneck caused by its complex characteristics and lack of visual cues compared with other data modalities. In recent years, unsupervised representation learning techniques have advanced rapidly in various domains. However, there is a lack of systematic analysis of unsupervised representation learning approaches for time series. To fill the gap, we conduct a comprehensive literature review of existing rapidly evolving unsupervised representation learning approaches for time series. Moreover, we also develop a unified and standardized library, named ULTS (i.e., Unsupervised Learning for Time Series), to facilitate fast implementations and unified evaluations on various models. With ULTS, we empirically evaluate state-of-the-art approaches, especially the rapidly evolving contrastive learning methods, on 9 diverse real-world datasets. We further discuss practical considerations as well as open research challenges on unsupervised representation learning for time series to facilitate future research in this field.
摘要
<> translate into Simplified Chinese无监督表示学习方法target=blank>target=blank aim to learn discriminative feature representations from unlabeled data, without the requirement of annotating every sample. enable unsupervised representation learning is extremely crucial for time series data, due to its unique annotation bottleneck caused by its complex characteristics and lack of visual cues compared with other data modalities. In recent years, unsupervised representation learning techniques have advanced rapidly in various domains. However, there is a lack of systematic analysis of unsupervised representation learning approaches for time series. To fill the gap, we conduct a comprehensive literature review of existing rapidly evolving unsupervised representation learning approaches for time series. Moreover, we also develop a unified and standardized library, named ULTS (i.e., Unsupervised Learning for Time Series), to facilitate fast implementations and unified evaluations on various models. With ULTS, we empirically evaluate state-of-the-art approaches, especially the rapidly evolving contrastive learning methods, on 9 diverse real-world datasets. We further discuss practical considerations as well as open research challenges on unsupervised representation learning for time series to facilitate future research in this field.Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore.
SimTeG: A Frustratingly Simple Approach Improves Textual Graph Learning
results: 这个论文的实验结果表明,使用这种方法可以大幅提高多种图神经网络(GNN)在多个图 benchmark 上的表现。Abstract
Textual graphs (TGs) are graphs whose nodes correspond to text (sentences or documents), which are widely prevalent. The representation learning of TGs involves two stages: (i) unsupervised feature extraction and (ii) supervised graph representation learning. In recent years, extensive efforts have been devoted to the latter stage, where Graph Neural Networks (GNNs) have dominated. However, the former stage for most existing graph benchmarks still relies on traditional feature engineering techniques. More recently, with the rapid development of language models (LMs), researchers have focused on leveraging LMs to facilitate the learning of TGs, either by jointly training them in a computationally intensive framework (merging the two stages), or designing complex self-supervised training tasks for feature extraction (enhancing the first stage). In this work, we present SimTeG, a frustratingly Simple approach for Textual Graph learning that does not innovate in frameworks, models, and tasks. Instead, we first perform supervised parameter-efficient fine-tuning (PEFT) on a pre-trained LM on the downstream task, such as node classification. We then generate node embeddings using the last hidden states of finetuned LM. These derived features can be further utilized by any GNN for training on the same task. We evaluate our approach on two fundamental graph representation learning tasks: node classification and link prediction. Through extensive experiments, we show that our approach significantly improves the performance of various GNNs on multiple graph benchmarks.
摘要
文本图(TG)是一种广泛存在的图,其节点对应于文本(句子或文档)。图像学习TG的过程包括两个阶段:(i)不监督的特征提取和(ii)监督图像学习。在过去几年中,大量精力被投入到后一个阶段中,而GNNs(图像神经网络)在这个阶段中占据了主导地位。然而,前一个阶段的大多数现有的图标准 benchmark仍然采用传统的特征工程技术。随着语言模型(LM)的快速发展,研究人员开始利用LM来促进TG的学习,例如通过在计算昂贵的框架中同时训练LM和TG(合并两个阶段),或者设计复杂的自我超视图任务来提高特征提取(增强第一个阶段)。在这种情况下,我们提出了SimTeG,一种简单而惊人的文本图学习方法。我们首先在预训练LM上进行监督参数效率提升(PEFT),然后生成节点嵌入使用预训练LM的最后隐藏状态。这些 derivated 特征可以被任何GNN用于训练同一个任务。我们对两个基本的图表示学习任务进行了广泛的实验:节点类别化和链接预测。通过广泛的实验,我们显示了我们的方法可以在多个图标准 benchmark上提高多种GNN的表现。
Motion Planning Diffusion: Learning and Planning of Robot Motions with Diffusion Models
results: 在 simulate 的平面机器人和 7-DOF 机器人 manipulate 环境中,与基准方法进行比较,表明扩散模型是高维动作分布的强大先骨,能够Encoding 高维动作数据的多模性。Abstract
Learning priors on trajectory distributions can help accelerate robot motion planning optimization. Given previously successful plans, learning trajectory generative models as priors for a new planning problem is highly desirable. Prior works propose several ways on utilizing this prior to bootstrapping the motion planning problem. Either sampling the prior for initializations or using the prior distribution in a maximum-a-posterior formulation for trajectory optimization. In this work, we propose learning diffusion models as priors. We then can sample directly from the posterior trajectory distribution conditioned on task goals, by leveraging the inverse denoising process of diffusion models. Furthermore, diffusion has been recently shown to effectively encode data multimodality in high-dimensional settings, which is particularly well-suited for large trajectory dataset. To demonstrate our method efficacy, we compare our proposed method - Motion Planning Diffusion - against several baselines in simulated planar robot and 7-dof robot arm manipulator environments. To assess the generalization capabilities of our method, we test it in environments with previously unseen obstacles. Our experiments show that diffusion models are strong priors to encode high-dimensional trajectory distributions of robot motions.
摘要
学习轨迹分布可以帮助加速机器人运动规划优化。给出过去成功的计划,使用轨迹生成模型作为优化问题的先验分布是非常感兴趣的。先前的工作提出了多种使用这种先验来启动运动规划问题的方法。可以从先验中随机取样,或者使用先验分布来形式化最大 posterior 式对 trajectory 进行优化。在这种工作中,我们提议使用扩散模型来学习先验。我们可以通过扩散模型的逆干扰过程直接从任务目标条件下样本 posterior 轨迹分布。此外,扩散模型最近在高维设定中能够有效地编码数据多模性,这 particulary 适合大轨迹数据集。为了证明我们的方法效果,我们与多个基eline 进行比较,在 simulated 平面机器人和 7-DOF 机器人 manipulator 环境中进行测试。为了评估我们的方法泛化能力,我们在不同的障碍物环境中进行测试。我们的实验显示,扩散模型是高维轨迹分布的强大先验。
A Global Transport Capacity Risk Prediction Method for Rail Transit Based on Gaussian Bayesian Network
results: 通过 simulate 例子验证了提议的方法的有效性。I hope that helps! Let me know if you have any other questions.Abstract
Aiming at the prediction problem of transport capacity risk caused by the mismatch between the carrying capacity of rail transit network and passenger flow demand, this paper proposes an explainable prediction method of rail transit network transport capacity risk based on linear Gaussian Bayesian network. This method obtains the training data of the prediction model based on the simulation model of the rail transit system with a three-layer structure including rail transit network, train flow and passenger flow. A Bayesian network structure construction method based on the topology of the rail transit network is proposed, and the MLE (Maximum Likelihood Estimation) method is used to realize the parameter learning of the Bayesian network. Finally, the effectiveness of the proposed method is verified by simulation examples.
摘要
目标是预测由铁路 транспорт网络载客量和乘客流量差异引起的运输 capacidad 风险的预测方法,本文提出了可解释的预测方法基于线性 Gaussian Bayesian 网络。该方法通过基于铁路 транспорт网络的模拟模型,包括铁路网络、车流和乘客流,来获取预测模型的训练数据。提出了基于铁路网络 topology 的 Bayesian 网络结构建构方法,并使用 MLE (最大likelihood估计) 方法实现参数学习。最后,通过模拟例子验证了该方法的效果。Here's the translation of the text into Traditional Chinese:目的是预测由铁路 транспорт网络载客量和乘客流量差异引起的运输 capacidad 风险的预测方法,本文提出了可解释的预测方法基于线性 Gaussian Bayesian 网络。该方法通过基于铁路 транспорт网络的模拟模型,包括铁路网络、车流和乘客流,来获取预测模型的训练数据。提出了基于铁路网络 topology 的 Bayesian 网络结构建构方法,并使用 MLE (最大likelihood估计) 方法实现参数学习。最后,通过模拟例子验证了该方法的效果。
InterAct: Exploring the Potentials of ChatGPT as a Cooperative Agent
methods: 我们引入了InterAct方法,将ChatGPT fed with varied prompts,分配它多个角色,如检查员和排序员,然后与原始语言模型结合。
results: 我们的研究表明,在AlfWorld中,ChatGPT的表现率达98%,包括6个不同任务在模拟家庭环境中,这显示了ChatGPT在真实世界设置下处理复杂任务的能力,并为任务规划铺平了道路。Abstract
This research paper delves into the integration of OpenAI's ChatGPT into embodied agent systems, evaluating its influence on interactive decision-making benchmark. Drawing a parallel to the concept of people assuming roles according to their unique strengths, we introduce InterAct. In this approach, we feed ChatGPT with varied prompts, assigning it a numerous roles like a checker and a sorter, then integrating them with the original language model. Our research shows a remarkable success rate of 98% in AlfWorld, which consists of 6 different tasks in a simulated household environment, emphasizing the significance of proficient prompt engineering. The results highlight ChatGPT's competence in comprehending and performing intricate tasks effectively in real-world settings, thus paving the way for further advancements in task planning.
摘要
这份研究论文探讨了OpenAI的ChatGPT在具有体系中的整合,评估其对交互决策 bencmark 的影响。我们引入InterAct方法,在不同的任务中 feed ChatGPT 不同的提示,让它扮演多种角色,如检查员和排序员,然后将其与原始语言模型集成。我们的研究发现,在AlfWorld中的6个任务中,ChatGPT 的成功率达98%,这些任务是在模拟的家庭环境中进行的,这 demonstartes ChatGPT 在实际世界中完成复杂任务的能力,从而铺平了进一步发展任务规划的道路。
Avoidance Navigation Based on Offline Pre-Training Reinforcement Learning
paper_authors: Yang Wenkai Ji Ruihang Zhang Yuxiang Lei Hao, Zhao Zijie
for: 这个论文旨在适用深度强化学习(DRL)技术,帮助无地图移动机器人进行避免导航。
methods: 论文提出了一种在未知环境中使用 raw 感知数据与控制变量的映射,以及一种高效的离线培训策略,以加速扫描阶段的随机探索。 论文还收集了一个通用的数据集,包括专家经验,以便用于其他导航培训工作。
results: 论文表明,采用预训练和优先级专家经验可以减少80%的培训时间,并且可以提高DRL奖励的2倍。这些方法还被证明可以在真实环境中实现无碰撞导航。论文还证明了这种DRL模型在不同环境中具有通用的普适性。Abstract
This paper presents a Pre-Training Deep Reinforcement Learning(DRL) for avoidance navigation without map for mobile robots which map raw sensor data to control variable and navigate in an unknown environment. The efficient offline training strategy is proposed to speed up the inefficient random explorations in early stage and we also collect a universal dataset including expert experience for offline training, which is of some significance for other navigation training work. The pre-training and prioritized expert experience are proposed to reduce 80\% training time and has been verified to improve the 2 times reward of DRL. The advanced simulation gazebo with real physical modelling and dynamic equations reduce the gap between sim-to-real. We train our model a corridor environment, and evaluate the model in different environment getting the same effect. Compared to traditional method navigation, we can confirm the trained model can be directly applied into different scenarios and have the ability to no collision navigate. It was demonstrated that our DRL model have universal general capacity in different environment.
摘要
MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies
paper_authors: Ke Chen, Yusong Wu, Haohe Liu, Marianna Nezhurina, Taylor Berg-Kirkpatrick, Shlomo Dubnov for:这篇论文的目的是提出一种基于稳定扩散模型的文本到音乐生成模型,以解决音乐生成任务中的数据有限和版权问题。methods:这篇论文使用了稳定扩散模型和音乐LDM架构,并通过重新训练CLAP预训练模型和Hifi-GAN vocoder来适应音乐领域。此外,该论文还提出了两种不同的混合策略,以便在训练数据有限的情况下生成更多样化的音乐。results:论文的实验结果表明,提出的MusicLDM模型和混合策略可以提高生成的音乐质量和创新性,同时仍保持输入文本和生成音乐之间的相似性。此外,该论文还设计了一些新的评价指标,以证明MusicLDM模型和混合策略在生成音乐时的效果。Abstract
Diffusion models have shown promising results in cross-modal generation tasks, including text-to-image and text-to-audio generation. However, generating music, as a special type of audio, presents unique challenges due to limited availability of music data and sensitive issues related to copyright and plagiarism. In this paper, to tackle these challenges, we first construct a state-of-the-art text-to-music model, MusicLDM, that adapts Stable Diffusion and AudioLDM architectures to the music domain. We achieve this by retraining the contrastive language-audio pretraining model (CLAP) and the Hifi-GAN vocoder, as components of MusicLDM, on a collection of music data samples. Then, to address the limitations of training data and to avoid plagiarism, we leverage a beat tracking model and propose two different mixup strategies for data augmentation: beat-synchronous audio mixup and beat-synchronous latent mixup, which recombine training audio directly or via a latent embeddings space, respectively. Such mixup strategies encourage the model to interpolate between musical training samples and generate new music within the convex hull of the training data, making the generated music more diverse while still staying faithful to the corresponding style. In addition to popular evaluation metrics, we design several new evaluation metrics based on CLAP score to demonstrate that our proposed MusicLDM and beat-synchronous mixup strategies improve both the quality and novelty of generated music, as well as the correspondence between input text and generated music.
摘要
Diffusion模型在跨Modal生成任务中表现出色,包括文本到图像和文本到音频生成。然而,生成音乐,作为特殊的音频类型,存在独特的挑战,主要是数据的有限性和版权问题。在这篇论文中,我们为解决这些挑战,首先构建了领先的文本到音乐模型MusicLDM。我们在音乐领域使用Stable Diffusion和AudioLDM架构,并通过重新训练CLAP和Hifi-GAN vocoder来适应音乐领域。然后,为了解决训练数据的限制和避免抄袭,我们提出了拍数跟踪模型和两种不同的混合策略:拍数同步音频混合和拍数同步秘密混合,这些策略可以让模型在训练音乐样本之间进行混合,从而生成更多样的音乐,同时仍然保持与输入文本的相关性。此外,我们还设计了一些基于CLAP分数的新评价指标,以证明我们的提议的MusicLDM和拍数同步混合策略可以提高生成的音乐质量和新颖性,同时保持输入文本和生成音乐之间的相关性。
results: 论文通过训练神经网络来增加级别,并通过提供不同的分辨率来帮助用户快速创建和编辑层。在使用这个工具时,设计师感到了合作的感觉,喜欢这个概念,并提供了进一步改进的反馈。Abstract
We explore AI-powered upscaling as a design assistance tool in the context of creating 2D game levels. Deep neural networks are used to upscale artificially downscaled patches of levels from the puzzle platformer game Lode Runner. The trained networks are incorporated into a web-based editor, where the user can create and edit levels at three different levels of resolution: 4x4, 8x8, and 16x16. An edit at any resolution instantly transfers to the other resolutions. As upscaling requires inventing features that might not be present at lower resolutions, we train neural networks to reproduce these features. We introduce a neural network architecture that is capable of not only learning upscaling but also giving higher priority to less frequent tiles. To investigate the potential of this tool and guide further development, we conduct a qualitative study with 3 designers to understand how they use it. Designers enjoyed co-designing with the tool, liked its underlying concept, and provided feedback for further improvement.
摘要
我们探索基于人工智能的水平升级作为游戏等级设计工具。我们使用深度神经网络来升级由游戏平台冒险游戏《劫 runner》中 искусственно压缩的 patches 的等级。我们在基于网络的编辑器中 integrate 了这些训练过的网络,allowing 用户在不同的分辨率(4x4、8x8和16x16)上创建和编辑等级。任何一个分辨率的编辑都会立即传输到其他分辨率上。由于升级需要创造不存在于较低分辨率中的特征,我们训练神经网络来重现这些特征。我们提出了一种神经网络架构,可以不仅学习升级,而且在较少的块出现时给予更高的优先级。为了了解这种工具的潜力和进一步发展的方向,我们进行了3名设计师的质量调研,了解他们如何使用这种工具,并提供了反馈 для进一步改进。
Non-equilibrium physics: from spin glasses to machine and neural learning
results: 这篇论文发现了杂乱体系中学习机制和物理动力学之间的关系,并提出了一种基于统计物理的方法来设计智能系统。这些发现可能扩展我们对智能系统的现代理解,并揭示更多的计算基础结构适用于人工智能应用。Abstract
Disordered many-body systems exhibit a wide range of emergent phenomena across different scales. These complex behaviors can be utilized for various information processing tasks such as error correction, learning, and optimization. Despite the empirical success of utilizing these systems for intelligent tasks, the underlying principles that govern their emergent intelligent behaviors remain largely unknown. In this thesis, we aim to characterize such emergent intelligence in disordered systems through statistical physics. We chart a roadmap for our efforts in this thesis based on two axes: learning mechanisms (long-term memory vs. working memory) and learning dynamics (artificial vs. natural). Throughout our journey, we uncover relationships between learning mechanisms and physical dynamics that could serve as guiding principles for designing intelligent systems. We hope that our investigation into the emergent intelligence of seemingly disparate learning systems can expand our current understanding of intelligence beyond neural systems and uncover a wider range of computational substrates suitable for AI applications.
摘要
多体系统的异常行为展示了各种emergent现象,从不同的尺度到不同的级别。这些复杂的行为可以用于各种信息处理任务,如错误检查、学习和优化。虽然使用这些系统进行智能任务的实践成果非常出色,但是这些系统的下面的原理还未得到了充分了解。在这个论文中,我们想通过统计物理来描述这些emergent智能行为。我们根据两个轴来制定我们的路线图:学习机制(长期记忆 vs.工作记忆)和学习动态(人工vs.自然)。在我们的旅程中,我们发现了学习机制和物理动态之间的关系,这些关系可能成为设计智能系统的导向原理。我们希望通过对各种学习系统的emergent智能行为的研究,扩展我们当前对智能的理解,并探索更多的计算substrate适用于AI应用。
Food Classification using Joint Representation of Visual and Textual Data
results: 实验结果显示,提议的网络在大规模的 open-source 数据集 UPMC Food-101 上的评估中,与其他方法相比,获得了11.57% 和 6.34% 的精度差。Abstract
Food classification is an important task in health care. In this work, we propose a multimodal classification framework that uses the modified version of EfficientNet with the Mish activation function for image classification, and the traditional BERT transformer-based network is used for text classification. The proposed network and the other state-of-the-art methods are evaluated on a large open-source dataset, UPMC Food-101. The experimental results show that the proposed network outperforms the other methods, a significant difference of 11.57% and 6.34% in accuracy is observed for image and text classification, respectively, when compared with the second-best performing method. We also compared the performance in terms of accuracy, precision, and recall for text classification using both machine learning and deep learning-based models. The comparative analysis from the prediction results of both images and text demonstrated the efficiency and robustness of the proposed approach.
摘要
食品分类是医疗保健领域中的一项重要任务。在这项工作中,我们提出了一种多Modal分类框架,使用Modified版EfficientNet和Mish活动函数进行图像分类,并使用传统的BERT变换网络进行文本分类。我们的提案网络和其他状态泰施方法在大量的开源数据集UPMC Food-101上进行了评估。实验结果表明,我们的提案网络在图像和文本分类方面的性能都高于其他方法,与第二高性能方法的差异为11.57%和6.34%。我们还对文本分类方面的性能进行了精度、准确率和回归率的比较分析,并通过图像和文本预测结果的对比,证明了我们的方法的有效性和可靠性。
Digital twin brain: a bridge between biological intelligence and artificial intelligence
results: DTB 可以提供前所未有的智能出现和神经疾病研究,以及人工智能的发展和精准心理医疗的潜在应用。Abstract
In recent years, advances in neuroscience and artificial intelligence have paved the way for unprecedented opportunities for understanding the complexity of the brain and its emulation by computational systems. Cutting-edge advancements in neuroscience research have revealed the intricate relationship between brain structure and function, while the success of artificial neural networks highlights the importance of network architecture. Now is the time to bring them together to better unravel how intelligence emerges from the brain's multiscale repositories. In this review, we propose the Digital Twin Brain (DTB) as a transformative platform that bridges the gap between biological and artificial intelligence. It consists of three core elements: the brain structure that is fundamental to the twinning process, bottom-layer models to generate brain functions, and its wide spectrum of applications. Crucially, brain atlases provide a vital constraint, preserving the brain's network organization within the DTB. Furthermore, we highlight open questions that invite joint efforts from interdisciplinary fields and emphasize the far-reaching implications of the DTB. The DTB can offer unprecedented insights into the emergence of intelligence and neurological disorders, which holds tremendous promise for advancing our understanding of both biological and artificial intelligence, and ultimately propelling the development of artificial general intelligence and facilitating precision mental healthcare.
摘要
recent years, advances in neuroscience and artificial intelligence have created unprecedented opportunities for understanding the complexity of the brain and its emulation by computational systems. cutting-edge advancements in neuroscience research have revealed the intricate relationship between brain structure and function, while the success of artificial neural networks highlights the importance of network architecture. now is the time to bring them together to better unravel how intelligence emerges from the brain's multiscale repositories. in this review, we propose the digital twin brain (dTB) as a transformative platform that bridges the gap between biological and artificial intelligence. it consists of three core elements: the brain structure that is fundamental to the twinning process, bottom-layer models to generate brain functions, and its wide spectrum of applications. crucially, brain atlases provide a vital constraint, preserving the brain's network organization within the dTB. furthermore, we highlight open questions that invite joint efforts from interdisciplinary fields and emphasize the far-reaching implications of the dTB. the dTB can offer unprecedented insights into the emergence of intelligence and neurological disorders, which holds tremendous promise for advancing our understanding of both biological and artificial intelligence, and ultimately propelling the development of artificial general intelligence and facilitating precision mental healthcare.
Quantum Multi-Agent Reinforcement Learning for Autonomous Mobility Cooperation
results: 对于多智能系统,我们的提出的QMARL算法可以更好地处理噪声中间规模量子(NISQ)时代的限制,并且可以更快地到达启发。此外,我们的QMARL算法还可以更有效地使用参数,从而提高效率。Abstract
For Industry 4.0 Revolution, cooperative autonomous mobility systems are widely used based on multi-agent reinforcement learning (MARL). However, the MARL-based algorithms suffer from huge parameter utilization and convergence difficulties with many agents. To tackle these problems, a quantum MARL (QMARL) algorithm based on the concept of actor-critic network is proposed, which is beneficial in terms of scalability, to deal with the limitations in the noisy intermediate-scale quantum (NISQ) era. Additionally, our QMARL is also beneficial in terms of efficient parameter utilization and fast convergence due to quantum supremacy. Note that the reward in our QMARL is defined as task precision over computation time in multiple agents, thus, multi-agent cooperation can be realized. For further improvement, an additional technique for scalability is proposed, which is called projection value measure (PVM). Based on PVM, our proposed QMARL can achieve the highest reward, by reducing the action dimension into a logarithmic-scale. Finally, we can conclude that our proposed QMARL with PVM outperforms the other algorithms in terms of efficient parameter utilization, fast convergence, and scalability.
摘要
Large-scale Generative Simulation Artificial Intelligence: the Next Hotspot in Generative AI
results: 本研究获得了重要的发现,包括实际挑战的解决方案和LS-GenAI的应用前景。Abstract
The concept of GenAI has been developed for decades. Until recently, it has impressed us with substantial breakthroughs in natural language processing and computer vision, actively engaging in industrial scenarios. Noticing the practical challenges, e.g., limited learning resources, and overly dependencies on scientific discovery empiricism, we nominate large-scale generative simulation artificial intelligence (LS-GenAI) as the next hotspot for GenAI to connect.
摘要
GenAI的概念已经在数十年内发展,直到最近,它在自然语言处理和计算机视觉领域带来了重要的突破,活跃地投入到工业场景中。注意到实际挑战,例如有限的学习资源和科学发现的过度依赖,我们提名大规模生成仿真人工智能(LS-GenAI)为下一个GenAI连接的热点。
Implicit Occupancy Flow Fields for Perception and Prediction in Self-Driving
results: 经过EXTENSIVE EXPERIMENTS在城市和高速公路上,研究发现,该方法可以超越当前状态的艺术。更多信息可以在https://waabi.ai/research/implicito上得到。Abstract
A self-driving vehicle (SDV) must be able to perceive its surroundings and predict the future behavior of other traffic participants. Existing works either perform object detection followed by trajectory forecasting of the detected objects, or predict dense occupancy and flow grids for the whole scene. The former poses a safety concern as the number of detections needs to be kept low for efficiency reasons, sacrificing object recall. The latter is computationally expensive due to the high-dimensionality of the output grid, and suffers from the limited receptive field inherent to fully convolutional networks. Furthermore, both approaches employ many computational resources predicting areas or objects that might never be queried by the motion planner. This motivates our unified approach to perception and future prediction that implicitly represents occupancy and flow over time with a single neural network. Our method avoids unnecessary computation, as it can be directly queried by the motion planner at continuous spatio-temporal locations. Moreover, we design an architecture that overcomes the limited receptive field of previous explicit occupancy prediction methods by adding an efficient yet effective global attention mechanism. Through extensive experiments in both urban and highway settings, we demonstrate that our implicit model outperforms the current state-of-the-art. For more information, visit the project website: https://waabi.ai/research/implicito.
摘要
一个自动驾驶车辆(SDV)需要能够感知周围环境并预测其他交通参与者的未来行为。现有的工作都是先检测对象,然后预测检测到的对象的轨迹,或者预测整个场景的厚度占用和流动Grid。前者会导致安全隐患,因为需要保持检测数量低,牺牲对象回忆。后者因为输出格式的高维度而 computationally expensive,而且受到全连接神经网络的局限性困惑。此外,这些方法都需要大量计算资源预测可能不会被访问的区域或对象。这种情况驱动我们开发了一种统一的感知和未来预测方法,该方法可以直接由运动规划器查询,避免不必要的计算。此外,我们还设计了一种高效又有效的全局注意机制,以解决过去的显式占用预测方法的有限范围问题。经过广泛的实验,我们证明了我们的含义模型在城市和高速公路上都能够超越当前状态。更多信息请访问项目网站:https://waabi.ai/research/implicito。
Training Data Protection with Compositional Diffusion Models
results: 这篇论文的结果表明,CDM可以实现选择性的忘记和持续学习,以及根据用户的访问权限服务自定义的模型。此外,CDM还可以确定某些数据子集对于生成特定样本的重要性。Abstract
We introduce Compartmentalized Diffusion Models (CDM), a method to train different diffusion models (or prompts) on distinct data sources and arbitrarily compose them at inference time. The individual models can be trained in isolation, at different times, and on different distributions and domains and can be later composed to achieve performance comparable to a paragon model trained on all data simultaneously. Furthermore, each model only contains information about the subset of the data it was exposed to during training, enabling several forms of training data protection. In particular, CDMs are the first method to enable both selective forgetting and continual learning for large-scale diffusion models, as well as allowing serving customized models based on the user's access rights. CDMs also allow determining the importance of a subset of the data in generating particular samples.
摘要
我们介绍Compartmentalized Diffusion Models(CDM),一种方法可以在推广过程中训练不同的扩散模型(或提示),并在推广时进行可compose的方式。个别模型可以在不同的时间、不同的分布和领域上进行独立的训练,然后在推广时进行混合。此外,每个模型只包含它在训练过程中接触到的subset of data的信息,因此可以实现多种训练数据保护。特别是,CDMs是首个允许大规模扩散模型进行选择性遗忘和持续学习,以及根据用户的存取权服务自定义模型。此外,CDMs还允许决定特定样本的某些subset of data的重要性。
Dual Governance: The intersection of centralized regulation and crowdsourced safety mechanisms for Generative AI
paper_authors: Avijit Ghosh, Dhanya Lakshmi for: 这个论文的目的是提出一种名为“双重管理”的框架,以确保在生成人工智能技术的应用中保障安全和伦理。methods: 该论文使用的方法包括:* 分析现有的中央化法规和社区自身的安全机制,以找到它们之间的互补点。* 提出一种名为“双重管理”的框架,以实现中央化法规和社区自身的共同管理。results: 该论文的结果表明,通过实施“双重管理”框架,可以促进生成人工智能技术的创新和创新,同时确保其安全和伦理的应用。Abstract
Generative Artificial Intelligence (AI) has seen mainstream adoption lately, especially in the form of consumer-facing, open-ended, text and image generating models. However, the use of such systems raises significant ethical and safety concerns, including privacy violations, misinformation and intellectual property theft. The potential for generative AI to displace human creativity and livelihoods has also been under intense scrutiny. To mitigate these risks, there is an urgent need of policies and regulations responsible and ethical development in the field of generative AI. Existing and proposed centralized regulations by governments to rein in AI face criticisms such as not having sufficient clarity or uniformity, lack of interoperability across lines of jurisdictions, restricting innovation, and hindering free market competition. Decentralized protections via crowdsourced safety tools and mechanisms are a potential alternative. However, they have clear deficiencies in terms of lack of adequacy of oversight and difficulty of enforcement of ethical and safety standards, and are thus not enough by themselves as a regulation mechanism. We propose a marriage of these two strategies via a framework we call Dual Governance. This framework proposes a cooperative synergy between centralized government regulations in a U.S. specific context and safety mechanisms developed by the community to protect stakeholders from the harms of generative AI. By implementing the Dual Governance framework, we posit that innovation and creativity can be promoted while ensuring safe and ethical deployment of generative AI.
摘要
现代化的人工智能(AI)在最近几年内得到了广泛的普及,尤其是在用户直接参与、开放结束的文本和图像生成模型的形式。然而,使用这些系统会引起重要的伦理和安全问题,包括隐私侵犯、谣言和知识产权盗窃。AI的生成能力可能会取代人类的创造力和生活方式,也在严重的检查下。为了缓解这些风险,需要负责任的政策和法规,以促进开发领域中负责任的AI发展。现有和提议的中央政府的法规,尽管具有一定的优点,但也存在不足的明确性和一致性,不可避免的干扰创新和自由市场竞争。 decentralized的保护机制,例如人类协同发展的安全工具和机制,可以作为一种替代方案。然而,它们缺乏伦理和安全标准的监管和执行能力,因此不具备充分的监管能力。为了解决这些问题,我们提出了一种名为“双重治理”的框架。这种框架建议在美国特有的 контексте下,通过中央政府的法规和社区开发的安全机制,保护利益者从生成AI的危害中。通过实施“双重治理”框架,我们认为可以促进创新和创造力,同时确保安全和负责任的AI发展。
VertexSerum: Poisoning Graph Neural Networks for Link Inference
results: 本文的实验结果表明,VertexSerumsignificantly outperforms state-of-the-art(SOTA)链接推测攻击,提高了平均混合精度分数(AUC)的提升率为9.8%。此外,本文的实验还表明,VertexSerum在黑盒和在线学习 Setting下都有出色的应用可能性。Abstract
Graph neural networks (GNNs) have brought superb performance to various applications utilizing graph structural data, such as social analysis and fraud detection. The graph links, e.g., social relationships and transaction history, are sensitive and valuable information, which raises privacy concerns when using GNNs. To exploit these vulnerabilities, we propose VertexSerum, a novel graph poisoning attack that increases the effectiveness of graph link stealing by amplifying the link connectivity leakage. To infer node adjacency more accurately, we propose an attention mechanism that can be embedded into the link detection network. Our experiments demonstrate that VertexSerum significantly outperforms the SOTA link inference attack, improving the AUC scores by an average of $9.8\%$ across four real-world datasets and three different GNN structures. Furthermore, our experiments reveal the effectiveness of VertexSerum in both black-box and online learning settings, further validating its applicability in real-world scenarios.
摘要
GRAPH NEURAL NETWORKS (GNNs) 有提供了出色的表现在使用图structured数据的各种应用程序中,如社交分析和诈骗探测。图链,例如社交关系和交易历史记录,是敏感和有价值的信息,使得使用 GNNs 时存在隐私问题。为了利用这些漏洞,我们提议VertexSerum,一种新的图毒注攻击,可以增强图链泄露的连接性。为更准确地推断节点相邻关系,我们提议一种注意力机制可以在链接检测网络中嵌入。我们的实验表明,VertexSerum可以明显超过当前链接推断攻击的最佳实践(SOTA),提高了平均抽样率9.8%。此外,我们的实验还表明VertexSerum在黑盒和在线学习设置下都有出色的应用可行性。
Novel Physics-Based Machine-Learning Models for Indoor Air Quality Approximations
results: 实验结果表明,提议的模型比相关的现状艺术模型更加简单、计算效率高,同时能够更好地捕捉室内空气质量数据中的高度非线性特征。Abstract
Cost-effective sensors are capable of real-time capturing a variety of air quality-related modalities from different pollutant concentrations to indoor/outdoor humidity and temperature. Machine learning (ML) models are capable of performing air-quality "ahead-of-time" approximations. Undoubtedly, accurate indoor air quality approximation significantly helps provide a healthy indoor environment, optimize associated energy consumption, and offer human comfort. However, it is crucial to design an ML architecture to capture the domain knowledge, so-called problem physics. In this study, we propose six novel physics-based ML models for accurate indoor pollutant concentration approximations. The proposed models include an adroit combination of state-space concepts in physics, Gated Recurrent Units, and Decomposition techniques. The proposed models were illustrated using data collected from five offices in a commercial building in California. The proposed models are shown to be less complex, computationally more efficient, and more accurate than similar state-of-the-art transformer-based models. The superiority of the proposed models is due to their relatively light architecture (computational efficiency) and, more importantly, their ability to capture the underlying highly nonlinear patterns embedded in the often contaminated sensor-collected indoor air quality temporal data.
摘要
Cost-effective sensors can real-time capture various air quality-related modalities, from different pollutant concentrations to indoor/outdoor humidity and temperature. Machine learning (ML) models can perform air-quality "ahead-of-time" approximations. Accurate indoor air quality approximation is crucial for providing a healthy indoor environment, optimizing associated energy consumption, and offering human comfort. However, it is essential to design an ML architecture that captures the domain knowledge, so-called problem physics. In this study, we propose six novel physics-based ML models for accurate indoor pollutant concentration approximations. The proposed models combine state-space concepts in physics, Gated Recurrent Units, and Decomposition techniques. The proposed models were illustrated using data collected from five offices in a commercial building in California. The proposed models are less complex, computationally more efficient, and more accurate than similar state-of-the-art transformer-based models. The superiority of the proposed models is due to their light architecture and their ability to capture the underlying highly nonlinear patterns in the often contaminated sensor-collected indoor air quality temporal data.
Why Do We Need Neuro-symbolic AI to Model Pragmatic Analogies?
methods: 这篇论文使用了 Neuro-symbolic AI 技术,结合统计学和符号学 AI,以提高文本表示,强调和增强相关内容,并提供抽象和导航。
results: 研究发现,随着 analogy 的复杂度增加,需要更多的、多样化的知识,不可能由 lexical co-occurrence statistics 提供。Neuro-symbolic AI 技术可以维持 LLM 的效率,同时保持 analogy 的解释能力,帮助进行教学应用。Abstract
A hallmark of intelligence is the ability to use a familiar domain to make inferences about a less familiar domain, known as analogical reasoning. In this article, we delve into the performance of Large Language Models (LLMs) in dealing with progressively complex analogies expressed in unstructured text. We discuss analogies at four distinct levels of complexity: lexical analogies, syntactic analogies, semantic analogies, and pragmatic analogies. As the analogies become more complex, they require increasingly extensive, diverse knowledge beyond the textual content, unlikely to be found in the lexical co-occurrence statistics that power LLMs. To address this, we discuss the necessity of employing Neuro-symbolic AI techniques that combine statistical and symbolic AI, informing the representation of unstructured text to highlight and augment relevant content, provide abstraction and guide the mapping process. Our knowledge-informed approach maintains the efficiency of LLMs while preserving the ability to explain analogies for pedagogical applications.
摘要
一种智能的特征是使用熟悉的领域来对不熟悉的领域进行推理,这称为对比推理。在这篇文章中,我们探讨大语言模型(LLM)在处理不断增长复杂的对比表达的文本中表现。我们讨论了四种不同的复杂性水平的对比:lexical对比、语法对比、semantic对比和 Pragmatic对比。随着对比的复杂程度增加,它们需要更多的、多样化的知识,不可能通过文本内容的lexical co-occurrence statistics来找到。为此,我们讨论了结合统计学和符号学AI技术的必要性,以便高亮和增强文本内容,提供抽象和导向映射过程。我们的知识填充approach保持了LLM的效率,同时保留了对对比的解释,以便在教育应用中使用。
Unlocking the Potential of Similarity Matching: Scalability, Supervision and Pre-training
methods: 研究人员提出了一种基于PyTorch实现的卷积非负相似匹配(SM)算法,以扩展SM到大规模数据集。此外,他们还提出了一种基于SM层的本地监督学习目标,并在PyTorch实现中进行了预训练模型 such as LeNet的比较。
results: 研究人员发现,使用PyTorch实现的SM算法可以在计算效率和生物可能性方面与BP算法相比,并且可以在大规模数据集上扩展。此外,他们还发现,将SM层与BP算法拼接在一起可以提高模型的评价性能。Abstract
While effective, the backpropagation (BP) algorithm exhibits limitations in terms of biological plausibility, computational cost, and suitability for online learning. As a result, there has been a growing interest in developing alternative biologically plausible learning approaches that rely on local learning rules. This study focuses on the primarily unsupervised similarity matching (SM) framework, which aligns with observed mechanisms in biological systems and offers online, localized, and biologically plausible algorithms. i) To scale SM to large datasets, we propose an implementation of Convolutional Nonnegative SM using PyTorch. ii) We introduce a localized supervised SM objective reminiscent of canonical correlation analysis, facilitating stacking SM layers. iii) We leverage the PyTorch implementation for pre-training architectures such as LeNet and compare the evaluation of features against BP-trained models. This work combines biologically plausible algorithms with computational efficiency opening multiple avenues for further explorations.
摘要
而Effective的Backpropagation(BP)算法具有限制,包括生物学可能性、计算成本和在线学习适用性。因此,有越来越多的关注于开发生物学可能性的学习方法。这项研究强调在大规模数据集上扩大Similarity Matching(SM)框架,与生物系统中观察到的机制相一致,并提供在线、本地和生物学可能性的算法。(i)为了扩大SM到大数据集,我们提议使用PyTorch实现Convolutional Nonnegative SM。(ii)我们引入本地监督SM目标,与栅格分析相似,以推进SM层堆叠。(iii)我们利用PyTorch实现对预训练架构如LeNet进行评估,并与BP训练模型进行比较。这项工作结合了生物学可能性的算法和计算效率,开启了多个探索的可能性。
Bio+Clinical BERT, BERT Base, and CNN Performance Comparison for Predicting Drug-Review Satisfaction
paper_authors: Yue Ling for: 这项研究的目的是开发一种可以分析病人药物评价文本,并准确地分类为正面、中性或负面的自然语言处理(NLP)模型。methods: 这项研究采用了多种分类模型,包括BERT基础模型、医学+клиничеBERT模型和简单的CNN模型。results: 研究结果表明,医学+клиничеBERT模型在表现 overall 方面表现出色,特别是在医疗术语方面表现出了11%的macro f1和回归分数提高,如图2所示。Abstract
The objective of this study is to develop natural language processing (NLP) models that can analyze patients' drug reviews and accurately classify their satisfaction levels as positive, neutral, or negative. Such models would reduce the workload of healthcare professionals and provide greater insight into patients' quality of life, which is a critical indicator of treatment effectiveness. To achieve this, we implemented and evaluated several classification models, including a BERT base model, Bio+Clinical BERT, and a simpler CNN. Results indicate that the medical domain-specific Bio+Clinical BERT model significantly outperformed the general domain base BERT model, achieving macro f1 and recall score improvement of 11%, as shown in Table 2. Future research could explore how to capitalize on the specific strengths of each model. Bio+Clinical BERT excels in overall performance, particularly with medical jargon, while the simpler CNN demonstrates the ability to identify crucial words and accurately classify sentiment in texts with conflicting sentiments.
摘要
目标是开发一种自然语言处理(NLP)模型,能够分析患者的药物评价,并准确地将满意度分类为正面、中性或负面。这些模型将减轻医疗专业人员的工作负担,并为患者的生活质量提供更多的信息,这是治疗效果的关键指标。为达到这一目标,我们实施和评估了多种分类模型,包括BERT基础模型、医学+临床BERT和简单的CNN。结果表明,具有医学领域特点的Bio+Clinical BERT模型在表格2中显著超过了通用领域基础BERT模型,实现了 macro f1和回忆得分的11%的提升,如图表2所示。未来的研究可以探讨如何利用每个模型的特点。Bio+Clinical BERT在整体性能方面表现出色,特别是对医疗术语的处理能力强,而简单的CNN则能够准确地识别关键词并在文本中 conflicting 的情感下准确地分类 sentiment。
results: 研究发现,现在的AI技术已成为一种通用技术,可以 configurable 应用于各种领域。然而,这些技术也存在一些问题和风险,如数据隐私和安全问题。Abstract
Kuhn's framework of scientific progress (Kuhn, 1962) provides a useful framing of the paradigm shifts that have occurred in Artificial Intelligence over the last 60 years. The framework is also useful in understanding what is arguably a new paradigm shift in AI, signaled by the emergence of large pre-trained systems such as GPT-3, on which conversational agents such as ChatGPT are based. Such systems make intelligence a commoditized general purpose technology that is configurable to applications. In this paper, I summarize the forces that led to the rise and fall of each paradigm, and discuss the pressing issues and risks associated with the current paradigm shift in AI.
摘要
库恩的科学进步框架(库恩,1962)提供了有用的框架,以呈现过去60年来人工智能领域内的 paradigm shift。这个框架也有助于理解目前AI领域可能出现的新的 paradigm shift,即大型预训系统如GPT-3的出现,以及基于这些系统的对话代理人如ChatGPT。这些系统使智能成为可 configurable 通用技术,可以应用于各种应用程序。在这篇文章中,我会概述过去各个 paradigm 的起源和衰落,并讨论目前AI领域中的紧迫问题和风险。
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
paper_authors: Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, Ludwig Schmidt
results: 在七个视觉语言数据集上,OpenFlamingo模型的平均表现为80-89%相对于对应的Flamenco模型表现。I hope that helps! Let me know if you have any other questions.Abstract
We introduce OpenFlamingo, a family of autoregressive vision-language models ranging from 3B to 9B parameters. OpenFlamingo is an ongoing effort to produce an open-source replication of DeepMind's Flamingo models. On seven vision-language datasets, OpenFlamingo models average between 80 - 89% of corresponding Flamingo performance. This technical report describes our models, training data, hyperparameters, and evaluation suite. We share our models and code at https://github.com/mlfoundations/open_flamingo.
摘要
我们介绍OpenFlamingo,一个家族型态自动递增视言模型,从3B到9B参数。OpenFlamingo是一个持续进行的开源实现深渊智能的Flamingo模型的努力。在七个视言数据集上,OpenFlamingo模型的平均性能在80-89%之间,与对应的Flamingo模型的性能相似。This technical report describes our models, training data, hyperparameters, and evaluation suite. We share our models and code at .
DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales
paper_authors: Zhewei Yao, Reza Yazdani Aminabadi, Olatunji Ruwase, Samyam Rajbhandari, Xiaoxia Wu, Ammar Ahmad Awan, Jeff Rasley, Minjia Zhang, Conglong Li, Connor Holmes, Zhongzhu Zhou, Michael Wyatt, Molly Smith, Lev Kurilenko, Heyang Qin, Masahiro Tanaka, Shuai Che, Shuaiwen Leon Song, Yuxiong He
for: 这篇论文旨在提供一个可 accessed、高效、便宜的练习和测试 ChatGPT-like 模型的 Reinforcement Learning with Human Feedback (RLHF) 训练管线,并且可以让训练大小达到百亿个条件的模型在 record 时间内训练,并且在成本上降低训练成本。
results: 这篇论文发现,使用 DeepSpeed-Chat 可以实现在训练大小达到百亿个条件的模型上,在 record 时间内训练,并且在成本上降低训练成本。Abstract
ChatGPT-like models have revolutionized various applications in artificial intelligence, from summarization and coding to translation, matching or even surpassing human performance. However, the current landscape lacks an accessible, efficient, and cost-effective end-to-end RLHF (Reinforcement Learning with Human Feedback) training pipeline for these powerful models, particularly when training at the scale of billions of parameters. This paper introduces DeepSpeed-Chat, a novel system that democratizes RLHF training, making it accessible to the AI community. DeepSpeed-Chat offers three key capabilities: an easy-to-use training and inference experience for ChatGPT-like models, a DeepSpeed-RLHF pipeline that replicates the training pipeline from InstructGPT, and a robust DeepSpeed-RLHF system that combines various optimizations for training and inference in a unified way. The system delivers unparalleled efficiency and scalability, enabling training of models with hundreds of billions of parameters in record time and at a fraction of the cost. With this development, DeepSpeed-Chat paves the way for broader access to advanced RLHF training, even for data scientists with limited resources, thereby fostering innovation and further development in the field of AI.
摘要
chatGPT-like模型已经革命化了人工智能多个应用领域,从概要和编程到翻译,与人类表现相当或甚至超越人类表现。然而,目前的景象缺乏一个可 accessible、高效、Cost-effective的RLHF(强化学习with人类反馈)训练管道,特别是在百亿参数训练的场景下。这篇论文介绍了DeepSpeed-Chat,一个新的系统,使得RLHF训练变得更加可 accessible。DeepSpeed-Chat提供了三个关键能力:对ChatGPT-like模型的易于使用训练和推理经验,InstructGPT的DeepSpeed-RLHF管道的复制,以及一个robust的DeepSpeed-RLHF系统,其结合了多种优化,以实现高效和可扩展的训练和推理。该系统可以在纪录时间内训练百亿参数的模型,并且只需一小部分的成本。通过这个发展,DeepSpeed-Chat打开了RLHF训练的大门,使得更多的数据科学家可以访问高级RLHF训练,从而推动人工智能领域的创新和发展。
CausalOps – Towards an Industrial Lifecycle for Causal Probabilistic Graphical Models
paper_authors: Robert Maier, Andreas Schlattl, Thomas Guess, Jürgen Mottok
for: This paper aims to provide a novel lifecycle framework for causal model development and application, called CausalOps, to address the gap in a process reference for organizations interested in employing causal engineering.
methods: The paper proposes CausalOps, a lifecycle framework that defines key entities, dependencies, and intermediate artifacts generated during causal engineering, establishing a consistent vocabulary and workflow model.
results: The paper aims to drive the adoption of causal methods in practical applications within interested organizations and the causality community by providing a holistic view of creating and maintaining causal models.Here is the same information in Simplified Chinese text:
results: 本研究的目的是将 causal 方法实施到实际应用中的有兴趣组织和 causality 社区中,提供一个全面的创建和维护 causal 模型的观点。Abstract
Causal probabilistic graph-based models have gained widespread utility, enabling the modeling of cause-and-effect relationships across diverse domains. With their rising adoption in new areas, such as automotive system safety and machine learning, the need for an integrated lifecycle framework akin to DevOps and MLOps has emerged. Currently, a process reference for organizations interested in employing causal engineering is missing. To address this gap and foster widespread industrial adoption, we propose CausalOps, a novel lifecycle framework for causal model development and application. By defining key entities, dependencies, and intermediate artifacts generated during causal engineering, we establish a consistent vocabulary and workflow model. This work contextualizes causal model usage across different stages and stakeholders, outlining a holistic view of creating and maintaining them. CausalOps' aim is to drive the adoption of causal methods in practical applications within interested organizations and the causality community.
摘要
causal probabilistic graph-based models 已经广泛应用于不同领域,以模型 causality 关系。随着这些模型在新领域,如自动驾驶系统安全和机器学习中的应用,需要一个整合的生命周期框架,类似于 DevOps 和 MLOps。目前,有一个关于 causal engineering 的过程参考 absent。为了填补这个空白和推广 causal methods 的实际应用,我们提出了 CausalOps,一种新的生命周期框架 для causal 模型开发和应用。通过定义关键实体、依赖关系和 intermediate artifacts 在 causal engineering 中,我们建立了一种一致的词汇和工作流程模型。这种工作流程可以跨不同阶段和各种参与者,提供一个整体的创建和维护 causal 模型的视图。CausalOps 的目标是推广 causal methods 在有兴趣的组织和 causality 社区中的应用。
AI-Enhanced Data Processing and Discovery Crowd Sourcing for Meteor Shower Mapping
results: 到目前为止,CAMS已经发现了200多个新的陨星雨,并验证了多个之前报告的陨星雨。Abstract
The Cameras for Allsky Meteor Surveillance (CAMS) project, funded by NASA starting in 2010, aims to map our meteor showers by triangulating meteor trajectories detected in low-light video cameras from multiple locations across 16 countries in both the northern and southern hemispheres. Its mission is to validate, discover, and predict the upcoming returns of meteor showers. Our research aimed to streamline the data processing by implementing an automated cloud-based AI-enabled pipeline and improve the data visualization to improve the rate of discoveries by involving the public in monitoring the meteor detections. This article describes the process of automating the data ingestion, processing, and insight generation using an interpretable Active Learning and AI pipeline. This work also describes the development of an interactive web portal (the NASA Meteor Shower portal) to facilitate the visualization of meteor radiant maps. To date, CAMS has discovered over 200 new meteor showers and has validated dozens of previously reported showers.
摘要
美国国家航空航天局(NASA)自2010年起投入的全天 meteor 观测计划(CAMS)计划,目标是通过多个地点、多国的低照度视频相机三角测量 meteor 轨迹,以易地卷积累积掌握 meteor 流星雨。该计划的任务是验证、发现和预测未来的 meteor 流星雨。我们的研究旨在通过实施云端AI智能化管道自动化数据处理,提高数据可视化,以提高发现率,并让公众参与监测流星探测。这篇文章描述了自动化数据入口、处理和探索的 Active Learning AI 管道,以及开发了一个可交互的 NASA 流星雨门户,以便visualize meteor 辐射地图。迄今,CAMS 已经发现了200多个新的 meteor 流星雨,并验证了数十个之前已知的流星雨。
An enhanced motion planning approach by integrating driving heterogeneity and long-term trajectory prediction for automated driving systems
results: 研究发现,使用提出的增强方法可以更好地预测周围 HDV 的驾驶行为,提高自动驾驶系统的导航能力和安全性。Abstract
Navigating automated driving systems (ADSs) through complex driving environments is difficult. Predicting the driving behavior of surrounding human-driven vehicles (HDVs) is a critical component of an ADS. This paper proposes an enhanced motion-planning approach for an ADS in a highway-merging scenario. The proposed enhanced approach utilizes the results of two aspects: the driving behavior and long-term trajectory of surrounding HDVs, which are coupled using a hierarchical model that is used for the motion planning of an ADS to improve driving safety.
摘要
自动驾驶系统(ADS)在复杂的驾驶环境中困难 Navigation. 预测周围的人驾驶车辆(HDV)驾驶行为是ADS的一个关键组件。这篇论文提出了一种改进的运动规划方法,用于ADS在高速公路岔道场景中驾驶。该提出的改进方法利用了两个方面的结果:周围HDV的驾驶行为和长期轨迹,通过层次模型与ADS的运动规划相结合,以提高驾驶安全性。
Empirical Translation Process Research: Past and Possible Future Perspectives
results: FEP/AIF提供了一个数学上可靠的基础,允许模型深入的时间建筑,在不同的时间轴上嵌入翻译过程。这个框架开放了未来预测TPR的可能性,可能对人类翻译过程的理解做出重要贡献,并对翻译学和人工智能设计框架产生重要影响。Abstract
Over the past four decades, efforts have been made to develop and evaluate models for Empirical Translation Process Research (TPR), yet a comprehensive framework remains elusive. This article traces the evolution of empirical TPR within the CRITT TPR-DB tradition and proposes the Free Energy Principle (FEP) and Active Inference (AIF) as a framework for modeling deeply embedded translation processes. It introduces novel approaches for quantifying fundamental concepts of Relevance Theory (relevance, s-mode, i-mode), and establishes their relation to the Monitor Model, framing relevance maximization as a special case of minimizing free energy. FEP/AIF provides a mathematically rigorous foundation that enables modeling of deep temporal architectures in which embedded translation processes unfold on different timelines. This framework opens up exciting prospects for future research in predictive TPR, likely to enrich our comprehension of human translation processes, and making valuable contributions to the wider realm of translation studies and the design of cognitive architectures.
摘要
Note:* "Empirical Translation Process Research" (TPR) is translated as "观察式翻译过程研究" (含义为"empirical"的"观察"和"过程"两个词).* "CRITT TPR-DB" is translated as "CRITT TPR-DB" (缩写为"CRITT"和"TPR-DB").* "Free Energy Principle" (FEP) is translated as "自由能原理" (含义为"free"和"energy"两个词).* "Active Inference" (AIF) is translated as "活动推测" (含义为"active"和"inference"两个词).* "Relevance Theory" is translated as "相关理论" (含义为"relevance"和"theory"两个词).* "Monitor Model" is translated as "监控模型" (含义为"monitor"和"model"两个词).
More Context, Less Distraction: Visual Classification by Inferring and Conditioning on Contextual Attributes
paper_authors: Bang An, Sicheng Zhu, Michael-Andrei Panaitescu-Liess, Chaithanya Kumar Mummadi, Furong Huang
for: Zero-shot image classification
methods: 使用CLIP的contextual attributes进行图像分类,不需要训练
results: better generalization, group robustness, and better interpretability compared to traditional zero-shot classification methodsAbstract
CLIP, as a foundational vision language model, is widely used in zero-shot image classification due to its ability to understand various visual concepts and natural language descriptions. However, how to fully leverage CLIP's unprecedented human-like understanding capabilities to achieve better zero-shot classification is still an open question. This paper draws inspiration from the human visual perception process: a modern neuroscience view suggests that in classifying an object, humans first infer its class-independent attributes (e.g., background and orientation) which help separate the foreground object from the background, and then make decisions based on this information. Inspired by this, we observe that providing CLIP with contextual attributes improves zero-shot classification and mitigates reliance on spurious features. We also observe that CLIP itself can reasonably infer the attributes from an image. With these observations, we propose a training-free, two-step zero-shot classification method named PerceptionCLIP. Given an image, it first infers contextual attributes (e.g., background) and then performs object classification conditioning on them. Our experiments show that PerceptionCLIP achieves better generalization, group robustness, and better interpretability. For example, PerceptionCLIP with ViT-L/14 improves the worst group accuracy by 16.5% on the Waterbirds dataset and by 3.5% on CelebA.
摘要
CLIP,作为基础视觉语言模型,在零基础图像分类中广泛应用,这是因为它可以理解多种视觉概念和自然语言描述。然而,如何充分利用CLIP的人类化理解能力以实现更好的零基础分类仍然是一个开放的问题。这篇论文启发自人类视觉过程:现代神经科学视野认为,在分类一个物体,人们首先推理出该物体的类型独立特征(例如背景和方向),这些特征会将背景和物体分离开来,然后根据这些信息进行决策。以这种思想为灵感,我们发现,为CLIP提供Contextual attributes可以提高零基础分类和减少基于假特征的依赖。此外,我们发现CLIP本身也可以有效地从图像中推理出这些特征。基于这些观察,我们提出了一种无需训练的、两步零基础分类方法名为PerceptionCLIP。给定一个图像,它首先推理出图像中的Contextual attributes(例如背景),然后根据这些特征进行物体分类。我们的实验表明,PerceptionCLIP在水鸟数据集上提高了最差群组精度 by 16.5%,并在CelebA数据集上提高了3.5%。
results: 文章描述了系统的设计和训练方法,以及用户测试结果。Abstract
We present Lode Encoder, a gamified mixed-initiative level creation system for the classic platform-puzzle game Lode Runner. The system is built around several autoencoders which are trained on sets of Lode Runner levels. When fed with the user's design, each autoencoder produces a version of that design which is closer in style to the levels that it was trained on. The Lode Encoder interface allows the user to build and edit levels through 'painting' from the suggestions provided by the autoencoders. Crucially, in order to encourage designers to explore new possibilities, the system does not include more traditional editing tools. We report on the system design and training procedure, as well as on the evolution of the system itself and user tests.
摘要
我们介绍Lode Encoder,一种基于混合主动initiative的平台游戏逻辑编辑系统,专门为经典平台游戏Lode Runner设计。该系统建立在多个自适应器基础之上,这些自适应器在不同的Lode Runner水平上进行训练。当用户输入设计时,每个自适应器都会生成一个更加类似于它所训练的水平的版本。Lode Encoder界面允许用户通过"涂抹"的方式从自适应器提供的建议中建立和编辑水平。关键是,为了鼓励设计师探索新的可能性,该系统不包含传统的编辑工具。我们介绍了系统的设计和训练过程,以及用户测试。
EmbeddingTree: Hierarchical Exploration of Entity Features in Embedding
results: 在实验中,嵌入树算法可以帮助用户发现数据实体中的细节特征,进行嵌入训练中的特征纹理提取和新实体嵌入生成等操作。Abstract
Embedding learning transforms discrete data entities into continuous numerical representations, encoding features/properties of the entities. Despite the outstanding performance reported from different embedding learning algorithms, few efforts were devoted to structurally interpreting how features are encoded in the learned embedding space. This work proposes EmbeddingTree, a hierarchical embedding exploration algorithm that relates the semantics of entity features with the less-interpretable embedding vectors. An interactive visualization tool is also developed based on EmbeddingTree to explore high-dimensional embeddings. The tool helps users discover nuance features of data entities, perform feature denoising/injecting in embedding training, and generate embeddings for unseen entities. We demonstrate the efficacy of EmbeddingTree and our visualization tool through embeddings generated for industry-scale merchant data and the public 30Music listening/playlists dataset.
摘要
<>输入文本翻译为简化字符串。<>嵌入学习将粒度数据实体转换为连续数字表示,卷积特征/属性实体。尽管不同嵌入学习算法报道了出色的性能,但是对嵌入学习结果中特征的结构解释得到了少量的努力。这项工作提出了嵌入树,一种嵌入探索算法,该算法将数据实体的 semantics 与嵌入 vectors 之间建立连接。此外,我们还开发了基于嵌入树的互动视觉化工具,帮助用户探索高维嵌入的细节。工具可以帮助用户发现数据实体的细节特征,进行嵌入训练中的特征去噪/注入,以及生成未见实体的嵌入。我们通过使用嵌入树和互动视觉化工具对行业级别的商家数据和公共30Music listening/playlists数据进行了证明。
Flows: Building Blocks of Reasoning and Collaborating AI
paper_authors: Martin Josifoski, Lars Klein, Maxime Peyrard, Yifei Li, Saibo Geng, Julian Paul Schnitzler, Yuxing Yao, Jiheng Wei, Debjit Paul, Robert West
for: 这 paper 旨在开发一种原则性的方法,用于设计和研究多个 AI 系统和人类之间的结构化交互。
results: 该 paper 在竞赛编程任务上实现了结构化思维和合作的进一步改进,使得 AI 只 Flows 增加了 +$21$ 和人类-AI Flows 增加了 +$54$ 绝对点的解决率。Abstract
Recent advances in artificial intelligence (AI) have produced highly capable and controllable systems. This creates unprecedented opportunities for structured reasoning as well as collaboration among multiple AI systems and humans. To fully realize this potential, it is essential to develop a principled way of designing and studying such structured interactions. For this purpose, we introduce the conceptual framework of Flows: a systematic approach to modeling complex interactions. Flows are self-contained building blocks of computation, with an isolated state, communicating through a standardized message-based interface. This modular design allows Flows to be recursively composed into arbitrarily nested interactions, with a substantial reduction of complexity. Crucially, any interaction can be implemented using this framework, including prior work on AI--AI and human--AI interactions, prompt engineering schemes, and tool augmentation. We demonstrate the potential of Flows on the task of competitive coding, a challenging task on which even GPT-4 struggles. Our results suggest that structured reasoning and collaboration substantially improve generalization, with AI-only Flows adding +$21$ and human--AI Flows adding +$54$ absolute points in terms of solve rate. To support rapid and rigorous research, we introduce the aiFlows library. The library comes with a repository of Flows that can be easily used, extended, and composed into novel, more complex Flows. The aiFlows library is available at https://github.com/epfl-dlab/aiflows. Data and Flows for reproducing our experiments are available at https://github.com/epfl-dlab/cc_flows.
摘要
最近的人工智能(AI)技术发展已经创造出了高水平的可控制系统。这创造了前所未有的机会,让多个AI系统和人类之间进行结构化的合作和推理。为了实现这些潜力,我们提出了Flows概念框架:一种系统的方法来设计和研究这些结构化交互。Flows是自包含的建筑块,具有隔离的状态和标准化的消息传递接口。这种模块化设计使得Flows可以被 recursively 组合成任意层次的交互,从而减少复杂性。这些交互可以包括先前的AI-AI和人类-AI交互、提前工程方案和工具增强等。我们在竞赛编程任务上示出了Flows的潜力,这是一个AIeven GPT-4 很难完成的任务。我们的结果表明,结构化合作和推理可以提高通用性,AI只Flows adds +$21$ 和人类-AI Flows adds +$54$ 绝对点的解决率。为了支持快速和严格的研究,我们引入了aiFlows库。该库包含了可以轻松使用、扩展和组合成更复杂的Flows的Repository。aiFlows库可以在https://github.com/epfl-dlab/aiflows上获取。实验数据和Flows可以在https://github.com/epfl-dlab/cc_flows上获取。
Fighting Fire with Fire: Can ChatGPT Detect AI-generated Text?
results: 研究发现,ChatGPT在人工生成文本和人类写作文本之间的检测性能具有相似性,但是在某些情况下可能会受到干扰。这些结果提供了关于如何使用ChatGPT和类似的大语言模型在自动检测pipeline中的信息。Abstract
Large language models (LLMs) such as ChatGPT are increasingly being used for various use cases, including text content generation at scale. Although detection methods for such AI-generated text exist already, we investigate ChatGPT's performance as a detector on such AI-generated text, inspired by works that use ChatGPT as a data labeler or annotator. We evaluate the zero-shot performance of ChatGPT in the task of human-written vs. AI-generated text detection, and perform experiments on publicly available datasets. We empirically investigate if ChatGPT is symmetrically effective in detecting AI-generated or human-written text. Our findings provide insight on how ChatGPT and similar LLMs may be leveraged in automated detection pipelines by simply focusing on solving a specific aspect of the problem and deriving the rest from that solution. All code and data is available at https://github.com/AmritaBh/ChatGPT-as-Detector.
摘要
大型语言模型(LLM)如ChatGPT在不同的应用场景中日益受到应用,包括大规模文本内容生成。虽然现有的AI生成文本检测方法已经存在,但我们在ChatGPT作为数据标注器或标注者的灵感下进行研究,检测AI生成文本的性能。我们对公共可用数据集进行了零shot性能评估,并对人工生成和AI生成文本之间的对比进行了实验。我们发现ChatGPT在检测人工生成和AI生成文本的任务中具有相似的效果。我们的发现可能有助于在自动检测管道中使用ChatGPT和类似的LLM,只需关注解决特定问题的方法,然后从其中 derivation 其他方面的解决方案。所有代码和数据可以在 GitHub 上找到:https://github.com/AmritaBh/ChatGPT-as-Detector。
BRNES: Enabling Security and Privacy-aware Experience Sharing in Multiagent Robotic and Autonomous Systems
For: The paper is written to address the issues of adversarial manipulation and inference in multi-agent reinforcement learning (MARL) with experience sharing (ES).* Methods: The proposed framework, called BRNES, uses a dynamic neighbor zone selection and weighted experience aggregation to reduce the impact of Byzantine attacks. It also employs local differential privacy (LDP) to protect the agents’ private information from adversarial inference attacks.* Results: The proposed framework outperforms the state-of-the-art in terms of steps to goal, obtained reward, and time to goal metrics. Specifically, it is 8.32x faster than non-private frameworks and 1.41x faster than private frameworks in an adversarial setting.Abstract
Although experience sharing (ES) accelerates multiagent reinforcement learning (MARL) in an advisor-advisee framework, attempts to apply ES to decentralized multiagent systems have so far relied on trusted environments and overlooked the possibility of adversarial manipulation and inference. Nevertheless, in a real-world setting, some Byzantine attackers, disguised as advisors, may provide false advice to the advisee and catastrophically degrade the overall learning performance. Also, an inference attacker, disguised as an advisee, may conduct several queries to infer the advisors' private information and make the entire ES process questionable in terms of privacy leakage. To address and tackle these issues, we propose a novel MARL framework (BRNES) that heuristically selects a dynamic neighbor zone for each advisee at each learning step and adopts a weighted experience aggregation technique to reduce Byzantine attack impact. Furthermore, to keep the agent's private information safe from adversarial inference attacks, we leverage the local differential privacy (LDP)-induced noise during the ES process. Our experiments show that our framework outperforms the state-of-the-art in terms of the steps to goal, obtained reward, and time to goal metrics. Particularly, our evaluation shows that the proposed framework is 8.32x faster than the current non-private frameworks and 1.41x faster than the private frameworks in an adversarial setting.
摘要
虽然经验分享(ES)可以加速多代理激励学习(MARL)在顾问-受顾问框架下,但是在分散式多代理系统中应用ES的尝试都是在可信环境中进行,而忽视了对敌意攻击和推理的可能性。然而,在实际场景中,一些贪婪攻击者,化名为顾问,可能为受顾问提供错误的建议,从而导致总的学习性能受到极大的降低。此外,一个推理攻击者,化名为受顾问,可能通过多个查询来推理出顾问的私人信息,使整个ES过程存在隐私泄露问题。为解决这些问题,我们提议一种基于BRNES的新的MARL框架,强制选择每个受顾问的动态邻区,并采用权重经验聚合技术来减少攻击影响。此外,通过在ES过程中应用本地幂等隐私(LDP)引起的噪声,保护代理的私人信息免遭敌意推理攻击。我们的实验表明,我们的框架比现有的非私钥框架快8.32倍,比私钥框架快1.41倍在敌意 Setting中。
A Probabilistic Approach to Self-Supervised Learning using Cyclical Stochastic Gradient MCMC
results: 通过寻找表示的expressive posterior,悉数自适应学习得到了可读性和多样性的表示。在多种下游分类任务上,取得了显著的性能提升、校准和对于类型检测。Abstract
In this paper we present a practical Bayesian self-supervised learning method with Cyclical Stochastic Gradient Hamiltonian Monte Carlo (cSGHMC). Within this framework, we place a prior over the parameters of a self-supervised learning model and use cSGHMC to approximate the high dimensional and multimodal posterior distribution over the embeddings. By exploring an expressive posterior over the embeddings, Bayesian self-supervised learning produces interpretable and diverse representations. Marginalizing over these representations yields a significant gain in performance, calibration and out-of-distribution detection on a variety of downstream classification tasks. We provide experimental results on multiple classification tasks on four challenging datasets. Moreover, we demonstrate the effectiveness of the proposed method in out-of-distribution detection using the SVHN and CIFAR-10 datasets.
摘要
在本文中,我们提出了一种实用的极 bayesian自适应学习方法,即循环随机Gradient Hamiltonian Monte Carlo(cSGHMC)。在这种框架下,我们对自适应学习模型参数进行了先验,并使用cSGHMC来近似高维多模态 posterior distribution over the embeddings。通过探索高维多模态的 posterior over the embeddings,bayesian自适应学习可以生成可读性和多样性的表示。对这些表示进行摘要,可以获得显著的性能、调整和出现在其他分类任务上的表现。我们在多个分类任务上进行了多个数据集的实验,并在SVHN和CIFAR-10数据集上进行了out-of-distribution检测。
Exploring the psychology of GPT-4’s Moral and Legal Reasoning
paper_authors: Guilherme F. C. F. Almeida, José Luiz Nunes, Neele Engelmann, Alex Wiegmann, Marcelo de Araújo
for: 这个论文旨在研究大语言模型GPT-4的道德和法律决策方面的 simulated human reasoning。
methods: 作者使用心理学方法 probing GPT-4的道德和法律决策过程。
results: 研究发现GPT-4和人类在意图归属、 causality 判断、诱导行为、道德基础、legal luck 的影响以及同意和规则违反判断方面存在高相关性,但也有一些重要的系统性差异。Abstract
Large language models have been used as the foundation of highly sophisticated artificial intelligences, capable of delivering human-like responses to probes about legal and moral issues. However, these models are unreliable guides to their own inner workings, and even the engineering teams behind their creation are unable to explain exactly how they came to develop all of the capabilities they currently have. The emerging field of machine psychology seeks to gain insight into the processes and concepts that these models possess. In this paper, we employ the methods of psychology to probe into GPT-4's moral and legal reasoning. More specifically, we investigate the similarities and differences between GPT-4 and humans when it comes to intentionality ascriptions, judgments about causation, the morality of deception, moral foundations, the impact of moral luck on legal judgments, the concept of consent, and rule violation judgments. We find high correlations between human and AI responses, but also several significant systematic differences between them. We conclude with a discussion of the philosophical implications of our findings.
摘要
大型语言模型已被用为高级人工智能的基础,能够提供人类样式的回应于法律和道德问题。然而,这些模型对自己内部的工作方式是不可靠的导航, même 创建团队无法完全解释它们如何获得所有现有的能力。新兴领域的机器心理学欲了解这些模型内部的过程和概念。在这篇论文中,我们使用心理学方法探究GPT-4的道德和法律理解。更 specifically,我们研究人类和AI在意图归属、 causation 判断、误导的道德性、道德基础、legal 判断的道德遗产、同意和规则违反判断方面的相似性和差异。我们发现人类和 AI 回应之间存在高相关性,但也存在一些重要的系统性差异。我们结束于哲学意义的讨论。
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
paper_authors: Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, Dirk Hovy
For: The paper is written to address the issue of large language models following malicious instructions and generating toxic content, and to propose a new test suite called XSTest to identify eXaggerated Safety behaviors in a structured and systematic way.* Methods: The paper uses a new test suite called XSTest, which comprises 200 safe prompts across ten prompt types, to evaluate the safety behaviors of a recently-released state-of-the-art language model.* Results: The paper highlights systematic failure modes in the language model, demonstrating that it is not well-calibrated and tends to refuse complying with safe prompts that use similar language to unsafe prompts or mention sensitive topics.Abstract
Without proper safeguards, large language models will readily follow malicious instructions and generate toxic content. This motivates safety efforts such as red-teaming and large-scale feedback learning, which aim to make models both helpful and harmless. However, there is a tension between these two objectives, since harmlessness requires models to refuse complying with unsafe prompts, and thus not be helpful. Recent anecdotal evidence suggests that some models may have struck a poor balance, so that even clearly safe prompts are refused if they use similar language to unsafe prompts or mention sensitive topics. In this paper, we introduce a new test suite called XSTest to identify such eXaggerated Safety behaviours in a structured and systematic way. In its current form, XSTest comprises 200 safe prompts across ten prompt types that well-calibrated models should not refuse to comply with. We describe XSTest's creation and composition, and use the test suite to highlight systematic failure modes in a recently-released state-of-the-art language model.
摘要
Translated into Simplified Chinese: Without proper safeguards, large language models will readily follow malicious instructions and generate toxic content, which motivates safety efforts such as red-teaming and large-scale feedback learning to make models both helpful and harmless. However, there is a tension between these two objectives, as harmlessness requires models to refuse complying with unsafe prompts, and thus not be helpful. Recent anecdotal evidence suggests that some models may have struck a poor balance, so that even clearly safe prompts are refused if they use similar language to unsafe prompts or mention sensitive topics. In this paper, we introduce a new test suite called XSTest to identify such eXaggerated Safety behaviors in a structured and systematic way. In its current form, XSTest comprises 200 safe prompts across ten prompt types that well-calibrated models should not refuse to comply with. We describe XSTest's creation and composition, and use the test suite to highlight systematic failure modes in a recently-released state-of-the-art language model.Translated into Traditional Chinese: Without proper safeguards, large language models will readily follow malicious instructions and generate toxic content, which motivates safety efforts such as red-teaming and large-scale feedback learning to make models both helpful and harmless. However, there is a tension between these two objectives, as harmlessness requires models to refuse complying with unsafe prompts, and thus not be helpful. Recent anecdotal evidence suggests that some models may have struck a poor balance, so that even clearly safe prompts are refused if they use similar language to unsafe prompts or mention sensitive topics. In this paper, we introduce a new test suite called XSTest to identify such eXaggerated Safety behaviors in a structured and systematic way. In its current form, XSTest comprises 200 safe prompts across ten prompt types that well-calibrated models should not refuse to comply with. We describe XSTest's creation and composition, and use the test suite to highlight systematic failure modes in a recently-released state-of-the-art language model.
results: 通过对多种语言进行实验,authors 证明了该方法在多种多语言语音合成、文本合成和翻译等任务中的效果。此外,authors 还证明了该方法可以实现多对多语言同时翻译,这在文献中尚未被探讨过。示例可以在 https://choijeongsoo.github.io/utut 上找到。Abstract
In this paper, we propose a method to learn unified representations of multilingual speech and text with a single model, especially focusing on the purpose of speech synthesis. We represent multilingual speech audio with speech units, the quantized representations of speech features encoded from a self-supervised speech model. Therefore, we can focus on their linguistic content by treating the audio as pseudo text and can build a unified representation of speech and text. Then, we propose to train an encoder-decoder structured model with a Unit-to-Unit Translation (UTUT) objective on multilingual data. Specifically, by conditioning the encoder with the source language token and the decoder with the target language token, the model is optimized to translate the spoken language into that of the target language, in a many-to-many language translation setting. Therefore, the model can build the knowledge of how spoken languages are comprehended and how to relate them to different languages. A single pre-trained model with UTUT can be employed for diverse multilingual speech- and text-related tasks, such as Speech-to-Speech Translation (STS), multilingual Text-to-Speech Synthesis (TTS), and Text-to-Speech Translation (TTST). By conducting comprehensive experiments encompassing various languages, we validate the efficacy of the proposed method across diverse multilingual tasks. Moreover, we show UTUT can perform many-to-many language STS, which has not been previously explored in the literature. Samples are available on https://choijeongsoo.github.io/utut.
摘要
在这篇论文中,我们提出了一种方法,可以通过单一模型学习多语言 speech 和文本的共同表示,特别是关注语音合成的目的。我们使用自动编码的 speech 特征来编码多语言 speech 音频,并将其转换为 pseudo text。因此,我们可以关注其语言内容,并建立多语言 speech 和文本的共同表示。然后,我们提出了一种encoder-decoder结构的模型,使用 Unit-to-Unit Translation(UTUT)目标在多语言数据上进行训练。具体来说,通过将源语言tokenconditional encode器,并将目标语言tokenconditional decode器,模型将被优化为将说话语言翻译成目标语言,这种多语言翻译设置下进行训练。因此,模型可以学习说话语言如何被理解,以及如何将其与不同语言关联起来。一个预训练的 UTUT 模型可以用于多种多语言 speech-和文本相关任务,如 Speech-to-Speech Translation(STS)、多语言 Text-to-Speech Synthesis(TTS)和 Text-to-Speech Translation(TTST)。通过对多种语言进行全面的实验,我们证明了提出的方法的有效性。此外,我们还证明了 UTUT 可以实现多语言 STS,这是在文献中没有被探讨的。样例可以在 中找到。
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models
for: 本研究旨在investigate how pre-training loss, supervised data amount, and augmented data amount influence the mathematical reasoning performances of a supervised large language model (LLM).
results: 研究发现,pre-training loss是LLM表现度的更好的指标,而不是模型的参数数量。通过不同量的supervised数据进行学习练习,我们发现模型的性能与数据量成直线关系,并且更好的模型在更大的数据量下提高的速度更快。此外,我们还发现,通过使用RFT,可以增加更多的数据样本,提高LLM的数学逻辑能力,特别是对于较差的LLM模型。最后,我们将多个模型的拒绝样本合并,使得LLM-7B模型的准确率达到49.3%,高于SFT方法的准确率35.9%,并且显著超过SFT方法。Abstract
Mathematical reasoning is a challenging task for large language models (LLMs), while the scaling relationship of it with respect to LLM capacity is under-explored. In this paper, we investigate how the pre-training loss, supervised data amount, and augmented data amount influence the reasoning performances of a supervised LLM. We find that pre-training loss is a better indicator of the model's performance than the model's parameter count. We apply supervised fine-tuning (SFT) with different amounts of supervised data and empirically find a log-linear relation between data amount and model performance, and we find better models improve less with enlarged supervised datasets. To augment more data samples for improving model performances without any human effort, we propose to apply Rejection sampling Fine-Tuning (RFT). RFT uses supervised models to generate and collect correct reasoning paths as augmented fine-tuning datasets. We find with augmented samples containing more distinct reasoning paths, RFT improves mathematical reasoning performance more for LLMs. We also find RFT brings more improvement for less performant LLMs. Furthermore, we combine rejection samples from multiple models which push LLaMA-7B to an accuracy of 49.3% and outperforms the supervised fine-tuning (SFT) accuracy of 35.9% significantly.
摘要
matematic reasoning 是一个大型自然语言模型 (LLM) 中的挑战任务,而这些模型的规模下的扩展关系尚未得到充分探索。在这篇论文中,我们 investigate了在supervised fine-tuning (SFT) 中,预训练损失、supervised数据量和增强数据量对模型的推理表现的影响。我们发现预训练损失是模型性能的更好的指标,而不是模型的参数数量。我们运用SFT的不同数量的supervised数据,并发现在数据量增加时,模型性能逐渐提高,但是当数据量增加时,模型的改进度逐渐减少。为了增加更多的数据样本以提高模型性能而不需要人工劳动,我们提出了应用Rejection sampling Fine-Tuning (RFT)。RFT使用supervised模型生成和收集正确的推理路径作为增强数据集。我们发现增强样本中包含更多的不同的推理路径,RFT可以更好地提高LLMs的推理性能。此外,我们发现RFT对较低性能的LLMs更加有优势。此外,我们将多个模型的拒绝样本结合,使得LLLaMA-7B 的准确率提高到 49.3%,并且在SFT的准确率(35.9%)上显著超越。
Lexicon and Rule-based Word Lemmatization Approach for the Somali Language
results: 该算法在120篇文档中测试得到了57%的准确率(对长文章),60.57%的准确率(对新闻文章摘要)和95.87%的高准确率(对社交媒体消息)。Abstract
Lemmatization is a Natural Language Processing (NLP) technique used to normalize text by changing morphological derivations of words to their root forms. It is used as a core pre-processing step in many NLP tasks including text indexing, information retrieval, and machine learning for NLP, among others. This paper pioneers the development of text lemmatization for the Somali language, a low-resource language with very limited or no prior effective adoption of NLP methods and datasets. We especially develop a lexicon and rule-based lemmatizer for Somali text, which is a starting point for a full-fledged Somali lemmatization system for various NLP tasks. With consideration of the language morphological rules, we have developed an initial lexicon of 1247 root words and 7173 derivationally related terms enriched with rules for lemmatizing words not present in the lexicon. We have tested the algorithm on 120 documents of various lengths including news articles, social media posts, and text messages. Our initial results demonstrate that the algorithm achieves an accuracy of 57\% for relatively long documents (e.g. full news articles), 60.57\% for news article extracts, and high accuracy of 95.87\% for short texts such as social media messages.
摘要
“干扰”是自然语言处理(NLP)技术的一种 normalize 文本的方法,通过将 morphological derivations 变换为其根形式。它是许多 NLP 任务的核心预处理步骤,包括文本索引、信息检索和机器学习 для NLP 等。这篇论文推动了索马里语文本干扰的开发,这是一种有限的资源语言,具有非常有限或者无效的 NLP 方法和数据集。我们特别开发了索马里语文本干扰词典和规则基于的干扰器,这是一个索马里文本干扰系统的开始。我们考虑了语言的 morphological 规则,我们开发了初始词典的 1247 个根词和 7173 个 derivationally 相关的词汇,并添加了词语不在词典中的 lemmatizing 规则。我们对 120 篇文档进行测试,包括新闻文章、社交媒体帖子和短信。我们的初步结果表明,算法在长文档(例如全文新闻文章)中达到了 57% 的准确率,在新闻文章摘要中达到了 60.57%,并在短信中达到了高准确率的 95.87%。
Does Correction Remain A Problem For Large Language Models?
results: 这paper发现 correction 在大语言模型中仍然存在问题,但可以通过 few-shot learning 技术和 GPT-like 模型来解决这个问题。Abstract
As large language models, such as GPT, continue to advance the capabilities of natural language processing (NLP), the question arises: does the problem of correction still persist? This paper investigates the role of correction in the context of large language models by conducting two experiments. The first experiment focuses on correction as a standalone task, employing few-shot learning techniques with GPT-like models for error correction. The second experiment explores the notion of correction as a preparatory task for other NLP tasks, examining whether large language models can tolerate and perform adequately on texts containing certain levels of noise or errors. By addressing these experiments, we aim to shed light on the significance of correction in the era of large language models and its implications for various NLP applications.
摘要
大型语言模型,如GPT,继续推进自然语言处理(NLP)的能力,问题是:检查还是存在的问题吗?这篇论文通过两个实验来研究大型语言模型中的检查问题。第一个实验专注于检查作为独立任务,使用少量学习技术训练GPT-like模型进行错误检查。第二个实验探讨检查作为其他NLP任务的前置任务,检查大语言模型是否可以忍受和处理含有一定水平的噪音或错误的文本。通过这两个实验,我们想要照明大型语言模型时代检查的重要性和它对各种NLP应用的影响。
Supply chain emission estimation using large language models
results: 我们的结果表明,域适应基础模型比状态机器学习技术更高效,并且与专家(SME)性能相似。该 frameworks可能加速企业范围内Scope 3估算,帮助企业采取适当的气候行动,实现SDG 13。Abstract
Large enterprises face a crucial imperative to achieve the Sustainable Development Goals (SDGs), especially goal 13, which focuses on combating climate change and its impacts. To mitigate the effects of climate change, reducing enterprise Scope 3 (supply chain emissions) is vital, as it accounts for more than 90\% of total emission inventories. However, tracking Scope 3 emissions proves challenging, as data must be collected from thousands of upstream and downstream suppliers.To address the above mentioned challenges, we propose a first-of-a-kind framework that uses domain-adapted NLP foundation models to estimate Scope 3 emissions, by utilizing financial transactions as a proxy for purchased goods and services. We compared the performance of the proposed framework with the state-of-art text classification models such as TF-IDF, word2Vec, and Zero shot learning. Our results show that the domain-adapted foundation model outperforms state-of-the-art text mining techniques and performs as well as a subject matter expert (SME). The proposed framework could accelerate the Scope 3 estimation at Enterprise scale and will help to take appropriate climate actions to achieve SDG 13.
摘要
大型企业面临一个决定性的传统目标,即 alcancing Sustainable Development Goals (SDGs),特别是目标13,它强调抗暖化和其影响。 为减少暖化的影响,减少企业范围3(供应链排放)是非常重要,因为它占总排放清单的超过90%。然而,追踪范围3排放是具有挑战性,因为需要从 thousands of 上游和下游供应商收集数据。为解决上述问题,我们提出了一个创新的框架,使用领域适应NLP基础模型估算范围3排放,通过利用购买商品和服务的金融交易作为代理。我们与州创的文本分类模型进行比较,包括TF-IDF、word2Vec和零 shot learning。我们的结果显示,领域适应基础模型比州创的文本探索技术更好,并且和专家(SME)的性能相似。我们的提案的框架可以优化企业范围3估算,帮助实现SDG 13,并且对抗暖化。
Ambient Adventures: Teaching ChatGPT on Developing Complex Stories
results: 研究表明,通过使用大语言模型的故事生成能力和人写的提示,机器人可以成功完成幻想玩偶,并且在文本冒险游戏中 simulate 一个家庭作为玩偶场景。Abstract
Imaginative play is an area of creativity that could allow robots to engage with the world around them in a much more personified way. Imaginary play can be seen as taking real objects and locations and using them as imaginary objects and locations in virtual scenarios. We adopted the story generation capability of large language models (LLMs) to obtain the stories used for imaginary play with human-written prompts. Those generated stories will be simplified and mapped into action sequences that can guide the agent in imaginary play. To evaluate whether the agent can successfully finish the imaginary play, we also designed a text adventure game to simulate a house as the playground for the agent to interact.
摘要
幻想玩耍是一个创造力领域,可以让机器人与现实世界更加人性化地交互。幻想玩耍可以看作是将现实物品和位置用作虚拟enario中的幻想物品和位置。我们采用了大语言模型(LLM)的故事生成能力,使用人写的提示来生成故事。这些生成的故事将被简化并映射到动作序列,以引导代理人进行幻想玩耍。为了评估代理人是否能成功完成幻想玩耍,我们还设计了一个文本冒险游戏,模拟了一个家庭作为玩耍场地。
Baby’s CoThought: Leveraging Large Language Models for Enhanced Reasoning in Compact Models
methods: 该ipeline使用GPT-3.5-turbo来重新排序一个小于100M的数据集,并使用RoBERTa(Liu et al., 2019)的方式进行预训练。
results: 在4个benchmark上评测,该BabyLM表现得更好,在10种语言、NLU和问答任务中超过了RoBERTa-base的表现,提示了更好的Context抽取能力。Abstract
Large Language Models (LLMs) demonstrate remarkable performance on a variety of Natural Language Understanding (NLU) tasks, primarily due to their in-context learning ability. This ability is utilized in our proposed "CoThought" pipeline, which efficiently trains smaller "baby" language models (BabyLMs) by leveraging the Chain of Thought (CoT) prompting of LLMs. Our pipeline restructures a dataset of less than 100M in size using GPT-3.5-turbo, transforming it into task-oriented, human-readable texts that are comparable to the school texts for language learners. The BabyLM is then pretrained on this restructured dataset in a RoBERTa (Liu et al., 2019) fashion. In evaluations across 4 benchmarks, our BabyLM outperforms the RoBERTa-base in 10 linguistic, NLU, and question answering tasks by more than 3 points, showing superior ability to extract contextual information. These results suggest that compact LMs pretrained on small, LLM-restructured data can better understand tasks and achieve improved performance. The code for data processing and model training is available at: https://github.com/oooranz/Baby-CoThought.
摘要
大语言模型(LLM)在多种自然语言理解(NLU)任务上表现出众,主要归功于其在上下文学习能力。我们的提议的“CoThought”管道利用了LLM的链条(CoT)提问能力,将更小的“宝宝”语言模型(BabyLM)进行高效地训练。我们使用GPT-3.5-turbo重新排序了一个数据集,将其转换成任务导向、人类可读的文本,与学校语言学习课本相似。然后,我们使用RoBERTa(Liu et al., 2019)的方式先进行了BabyLM的预训练。在4个标准准标下进行评估,我们的BabyLM在10种语言、NLU和问答任务中超过了RoBERTa-base的3点,表明它更好地提取上下文信息。这些结果表明,使用小型LM在小数据集上进行预训练可以更好地理解任务,并实现更高的性能。相关代码可以在GitHub上找到:https://github.com/oooranz/Baby-CoThought。
Evaluating ChatGPT text-mining of clinical records for obesity monitoring
paper_authors: Ivo S. Fins, Heather Davies, Sean Farrell, Jose R. Torres, Gina Pinchbeck, Alan D. Radford, Peter-John Noble
For: The paper aims to compare the ability of a large language model (ChatGPT) and a previously developed regular expression (RegexT) to identify overweight body condition scores (BCS) in veterinary narratives.* Methods: The study uses 4,415 anonymized clinical narratives, with BCS values extracted using either RegexT or by appending the narrative to a prompt sent to ChatGPT and manually reviewing the results.* Results: The paper finds that the precision of RegexT was higher (100%, 95% CI 94.81-100%) than the ChatGPT (89.3%, 95% CI 82.75-93.64%), but the recall of ChatGPT (100%, 95% CI 96.18-100%) was considerably higher than that of RegexT (72.6%, 95% CI 63.92-79.94%).Here are the three key points in Simplified Chinese text:* 用途: 这篇论文目的是比较一个大型自然语言模型(ChatGPT)和一个已经开发的正则表达(RegexT)在宠物临床报告中识别过重体重分数(BCS)的能力。* 方法: 这个研究使用了4415个匿名的临床报告,BCS值由RegexT或者将报告附加到ChatGPT的提示中,并 manually查看结果。* 结果: 论文发现,RegexT的精度(100%, 95% CI 94.81-100%)高于ChatGPT(89.3%, 95% CI 82.75-93.64%),但ChatGPT的感知(100%, 95% CI 96.18-100%)远高于RegexT(72.6%, 95% CI 63.92-79.94%)。Abstract
Background: Veterinary clinical narratives remain a largely untapped resource for addressing complex diseases. Here we compare the ability of a large language model (ChatGPT) and a previously developed regular expression (RegexT) to identify overweight body condition scores (BCS) in veterinary narratives. Methods: BCS values were extracted from 4,415 anonymised clinical narratives using either RegexT or by appending the narrative to a prompt sent to ChatGPT coercing the model to return the BCS information. Data were manually reviewed for comparison. Results: The precision of RegexT was higher (100%, 95% CI 94.81-100%) than the ChatGPT (89.3%; 95% CI82.75-93.64%). However, the recall of ChatGPT (100%. 95% CI 96.18-100%) was considerably higher than that of RegexT (72.6%, 95% CI 63.92-79.94%). Limitations: Subtle prompt engineering is needed to improve ChatGPT output. Conclusions: Large language models create diverse opportunities and, whilst complex, present an intuitive interface to information but require careful implementation to avoid unpredictable errors.
摘要
背景: veterinary clinical narratives remain a largely untapped resource for addressing complex diseases. 在这里,我们比较了一个大型自然语言模型(ChatGPT)和一个已经开发的正则表达(RegexT)可以在 veterinary narratives 中标识过重的体重 condition scores (BCS)。方法:BCS 值被提取自4,415个匿名的临床 narratives 中,使用 Either RegexT 或者附加 narrative 到一个提示,并将模型返回 BCSI 信息。数据被手动审查以进行比较。结果:RegexT 的精度高于 ChatGPT(100%, 95% CI 94.81-100%),但 ChatGPT 的回归高于 RegexT(100%, 95% CI 96.18-100%)。限制:需要细化的提示工程来提高 ChatGPT 输出。结论:大型自然语言模型创造了多样的机会,尽管复杂,但它们提供了直观的界面,但是需要小心的实施以避免不可预期的错误。
BioBERT Based SNP-traits Associations Extraction from Biomedical Literature
results: 根据SNPPhenA数据集的评估结果,BioBERT-GRU 方法比前一些机器学习和深度学习基于的方法表现更好,具有精度为 0.883、回归率为 0.882 和 F1 分数为 0.881。Abstract
Scientific literature contains a considerable amount of information that provides an excellent opportunity for developing text mining methods to extract biomedical relationships. An important type of information is the relationship between singular nucleotide polymorphisms (SNP) and traits. In this paper, we present a BioBERT-GRU method to identify SNP- traits associations. Based on the evaluation of our method on the SNPPhenA dataset, it is concluded that this new method performs better than previous machine learning and deep learning based methods. BioBERT-GRU achieved the result a precision of 0.883, recall of 0.882 and F1-score of 0.881.
摘要
(文科文献中含有大量信息,提供了优秀的机会,用于发展文本挖掘技术,抽取生物医学关系。特别是关于单核苷多态性(SNP)和特征之间的关系。在本文中,我们提出了 BioBERT-GRU 方法,用于找到 SNP-特征关系。对于 SNPPhenA 数据集的评估,我们得出结论,这种新方法在前一些机器学习和深度学习基于方法上表现更好。BioBERT-GRU 实现了一个精度为 0.883,回归率为 0.882,和 F1 分数为 0.881。)
Multimodal Neurons in Pretrained Text-Only Transformers
results: 研究发现,当图像和文本modalities被拼接在一起时,语言模型可以将图像表示映射到文本中,并且这种转换发生在transformer模型的深处。此外,研究还发现了一些”多Modal neurons”,这些neurons可以将视觉表示转换为对应的文本描述,并且这种转换具有系统的 causal effect on image captioning。Abstract
Language models demonstrate remarkable capacity to generalize representations learned in one modality to downstream tasks in other modalities. Can we trace this ability to individual neurons? We study the case where a frozen text transformer is augmented with vision using a self-supervised visual encoder and a single linear projection learned on an image-to-text task. Outputs of the projection layer are not immediately decodable into language describing image content; instead, we find that translation between modalities occurs deeper within the transformer. We introduce a procedure for identifying "multimodal neurons" that convert visual representations into corresponding text, and decoding the concepts they inject into the model's residual stream. In a series of experiments, we show that multimodal neurons operate on specific visual concepts across inputs, and have a systematic causal effect on image captioning.
摘要
语言模型表现出了惊人的泛化能力,可以将在一种modalities中学习的表示 transferred 到另一种modalities中的下游任务中。我们可以追溯这种能力到个体神经吗?我们研究一个冻结的文本转换器被融合到视觉中的情况,使用一个自我supervised visual encoder和一个单一的线性投影学习了一个image-to-text任务。投影层的输出不是 immediate 可以decode 到描述图像内容的语言;相反,我们发现在trasformer中, traduction between modalities 发生在更深层次。我们提出了一种方法来识别"多modal neurons",它们将视觉表示转换成对应的语言表示,并在模型的剩余流中注入概念。在一系列实验中,我们发现这些多modal neurons 操作于特定的视觉概念上,并具有系统的 causal effect on image captioning。
Comparing scalable strategies for generating numerical perspectives
results: 研究发现,这三种策略之间有 synergy,不同的策略在不同的设置和上下文中表现出不同的优势,用户也表现出不同的偏好。Abstract
Numerical perspectives help people understand extreme and unfamiliar numbers (e.g., \$330 billion is about \$1,000 per person in the United States). While research shows perspectives to be helpful, generating them at scale is challenging both because it is difficult to identify what makes some analogies more helpful than others, and because what is most helpful can vary based on the context in which a given number appears. Here we present and compare three policies for large-scale perspective generation: a rule-based approach, a crowdsourced system, and a model that uses Wikipedia data and semantic similarity (via BERT embeddings) to generate context-specific perspectives. We find that the combination of these three approaches dominates any single method, with different approaches excelling in different settings and users displaying heterogeneous preferences across approaches. We conclude by discussing our deployment of perspectives in a widely-used online word processor.
摘要
numrical 角度可以帮助人们更好地理解极端和不熟悉的数字(例如, \$330 亿是美国每个人 \$1,000)。 Although research shows that perspectives are helpful, generating them at scale is challenging because it is difficult to identify which analogies are most helpful, and what is most helpful can vary based on the context in which a given number appears. Here, we present and compare three policies for large-scale perspective generation: a rule-based approach, a crowdsourced system, and a model that uses Wikipedia data and semantic similarity (via BERT embeddings) to generate context-specific perspectives. We find that the combination of these three approaches dominates any single method, with different approaches excelling in different settings and users displaying heterogeneous preferences across approaches. We conclude by discussing our deployment of perspectives in a widely-used online word processor.Note:* "numrical" is a typo, the correct word is "numerical"* "极端" is a more casual way of saying "extreme"* "不熟悉" is a more casual way of saying "unfamiliar"* "BERT embeddings" is a more technical term, you may want to use a more general term like "word embeddings" or "semantic embeddings" to make the text more accessible to a wider audience.
Large Language Model Displays Emergent Ability to Interpret Novel Literary Metaphors
results: GPT-4能够提供精彩和准确的比喻解释,而且在逆比喻中也表现出了格雷西安合作原则的敏感性。这些结果表明LLMs such as GPT-4已经获得了解新的复杂比喻的能力。Abstract
Recent advances in the performance of large language models (LLMs) have sparked debate over whether, given sufficient training, high-level human abilities emerge in such generic forms of artificial intelligence (AI). Despite the exceptional performance of LLMs on a wide range of tasks involving natural language processing and reasoning, there has been sharp disagreement as to whether their abilities extend to more creative human abilities. A core example is the ability to interpret novel metaphors. Given the enormous and non-curated text corpora used to train LLMs, a serious obstacle to designing tests is the requirement of finding novel yet high-quality metaphors that are unlikely to have been included in the training data. Here we assessed the ability of GPT-4, a state-of-the-art large language model, to provide natural-language interpretations of novel literary metaphors drawn from Serbian poetry and translated into English. Despite exhibiting no signs of having been exposed to these metaphors previously, the AI system consistently produced detailed and incisive interpretations. Human judge - blind to the fact that an AI model was involved - rated metaphor interpretations generated by GPT-4 as superior to those provided by a group of college students. In interpreting reversed metaphors, GPT-4, as well as humans, exhibited signs of sensitivity to the Gricean cooperative principle. These results indicate that LLMs such as GPT-4 have acquired an emergent ability to interpret complex novel metaphors.
摘要
Investigating Reinforcement Learning for Communication Strategies in a Task-Initiative Setting
results: 研究发现,使用 coherence-based 对话策略可以带来最小数据需求、可解释的选择和强大的审核能力,但具有较小的数据损失,并且在各种用户模型下表现良好。Abstract
Many conversational domains require the system to present nuanced information to users. Such systems must follow up what they say to address clarification questions and repair misunderstandings. In this work, we explore this interactive strategy in a referential communication task. Using simulation, we analyze the communication trade-offs between initial presentation and subsequent followup as a function of user clarification strategy, and compare the performance of several baseline strategies to policies derived by reinforcement learning. We find surprising advantages to coherence-based representations of dialogue strategy, which bring minimal data requirements, explainable choices, and strong audit capabilities, but incur little loss in predicted outcomes across a wide range of user models.
摘要
很多对话领域需要系统向用户提供细化信息。这些系统必须跟踪用户的提问和理解错误,以便进行回答和修复。在这个工作中,我们研究了这种交互策略在参照通信任务中的应用。使用模拟,我们分析了对话投入和后续跟进的交互траde-off,并比较了多种基eline策略和基于强化学习的策略的性能。我们发现了一些预期不一致的优点,包括减少数据需求、可解释的选择和强大的审核能力,但是这些策略具有较小的输出预测损失。
Reverse Stable Diffusion: What prompt was used to generate this image?
results: 我们的新学习框架在这个任务中产生了出色的结果,尤其是在使用白盒模型时。此外,我们发现了一个有趣的发现:将演化模型直接用于文本至图像生成任务时,可以让模型生成更加适合预告的图像。Abstract
Text-to-image diffusion models such as Stable Diffusion have recently attracted the interest of many researchers, and inverting the diffusion process can play an important role in better understanding the generative process and how to engineer prompts in order to obtain the desired images. To this end, we introduce the new task of predicting the text prompt given an image generated by a generative diffusion model. We combine a series of white-box and black-box models (with and without access to the weights of the diffusion network) to deal with the proposed task. We propose a novel learning framework comprising of a joint prompt regression and multi-label vocabulary classification objective that generates improved prompts. To further improve our method, we employ a curriculum learning procedure that promotes the learning of image-prompt pairs with lower labeling noise (i.e. that are better aligned), and an unsupervised domain-adaptive kernel learning method that uses the similarities between samples in the source and target domains as extra features. We conduct experiments on the DiffusionDB data set, predicting text prompts from images generated by Stable Diffusion. Our novel learning framework produces excellent results on the aforementioned task, yielding the highest gains when applied on the white-box model. In addition, we make an interesting discovery: training a diffusion model on the prompt generation task can make the model generate images that are much better aligned with the input prompts, when the model is directly reused for text-to-image generation.
摘要
文本扩散模型,如稳定扩散,在近期吸引了许多研究人员的关注。在 revert 扩散过程中,可以更好地理解生成过程,并用于改进提示的工程。为此,我们提出了预测由生成扩散模型生成的图像中的文本提示的新任务。我们结合了一系列的白盒和黑盒模型(具有或 безDiffusion网络的参数)来处理该任务。我们提出了一种新的学习框架,包括联合提示回归和多类词汇分类目标,可以生成改进的提示。为了进一步改进我们的方法,我们使用了课程学习程序,该程序通过将图像提示与更低的标签噪音(即更好地对齐)进行排序,来促进学习。此外,我们还使用了无监督领域适应kernels的学习方法,该方法使用源和目标领域中样本之间的相似性作为额外特征。我们在DiffusionDB数据集上进行了实验,预测由Stable Diffusion生成的图像中的文本提示。我们的新的学习框架在该任务上取得了优秀的结果,特别是在白盒模型上。此外,我们还发现了一个有趣的发现:在直接将扩散模型用于文本到图像生成任务的训练过程中,可以使扩散模型生成更好地对齐的图像,当模型直接重用于文本到图像生成时。
UPB at IberLEF-2023 AuTexTification: Detection of Machine-Generated Text using Transformer Ensembles
results: 我们的最佳模型在英文数据集上 achievable macro F1-score为66.63%,在西班牙语数据集上 achievable macro F1-score为67.10%。Abstract
This paper describes the solutions submitted by the UPB team to the AuTexTification shared task, featured as part of IberLEF-2023. Our team participated in the first subtask, identifying text documents produced by large language models instead of humans. The organizers provided a bilingual dataset for this subtask, comprising English and Spanish texts covering multiple domains, such as legal texts, social media posts, and how-to articles. We experimented mostly with deep learning models based on Transformers, as well as training techniques such as multi-task learning and virtual adversarial training to obtain better results. We submitted three runs, two of which consisted of ensemble models. Our best-performing model achieved macro F1-scores of 66.63% on the English dataset and 67.10% on the Spanish dataset.
摘要
Optimizing Machine Translation through Prompt Engineering: An Investigation into ChatGPT’s Customizability
results: 研究发现,在大规模语言模型如ChatGPT中添加合适的提示可以生成灵活的翻译,而 convention Machine Translation(MT)未能实现这一点。研究还发现,当提示包含特定条件时,翻译质量发生了变化。对于翻译质量的评价采用了实践翻译家的视角,并使用OpenAI的单词嵌入API进行cosine相似性计算。结果表明,在提示中包含翻译目标和目标受众可以改善翻译质量,并且在市场文档和文化依赖的成语中特别有用。Abstract
This paper explores the influence of integrating the purpose of the translation and the target audience into prompts on the quality of translations produced by ChatGPT. Drawing on previous translation studies, industry practices, and ISO standards, the research underscores the significance of the pre-production phase in the translation process. The study reveals that the inclusion of suitable prompts in large-scale language models like ChatGPT can yield flexible translations, a feat yet to be realized by conventional Machine Translation (MT). The research scrutinizes the changes in translation quality when prompts are used to generate translations that meet specific conditions. The evaluation is conducted from a practicing translator's viewpoint, both subjectively and qualitatively, supplemented by the use of OpenAI's word embedding API for cosine similarity calculations. The findings suggest that the integration of the purpose and target audience into prompts can indeed modify the generated translations, generally enhancing the translation quality by industry standards. The study also demonstrates the practical application of the "good translation" concept, particularly in the context of marketing documents and culturally dependent idioms.
摘要
results: 研究发现,Med-PaLM 2可以预测多种心理疾病的功能,其中最高的性能是预测抑郁分数(准确率范围为0.80-0.84),与人类临床评估人员的表现相当(t(1,144) = 1.20; p = 0.23)。这些结果表明,通用临床语言模型可以静态预测心理风险基于自由描述功能。Abstract
The current work investigates the capability of Large language models (LLMs) that are explicitly trained on large corpuses of medical knowledge (Med-PaLM 2) to predict psychiatric functioning from patient interviews and clinical descriptions without being trained to do so. To assess this, n = 145 depression and n =115 PTSD assessments and n = 46 clinical case studies across high prevalence/high comorbidity disorders (Depressive, Anxiety, Psychotic, trauma and stress, Addictive disorders) were analyzed using prompts to extract estimated clinical scores and diagnoses. Results demonstrate that Med-PaLM 2 is capable of assessing psychiatric functioning across a range of psychiatric conditions with the strongest performance being the prediction of depression scores based on standardized assessments (Accuracy range= 0.80 - 0.84) which were statistically indistinguishable from human clinical raters t(1,144) = 1.20; p = 0.23. Results show the potential for general clinical language models to flexibly predict psychiatric risk based on free descriptions of functioning from both patients and clinicians.
摘要
当前研究探讨了大型语言模型(LLMs)在大量医学知识训练(Med-PaLM 2)下预测患者的心理功能。为了评估这一点,研究人员分析了145例含有抑郁症和115例含有Post Traumatic Stress Disorder(PTSD)的评估结果,以及46例临床案例,包括高频高相关疾病(抑郁、焦虑、精神病、吸毒症)。结果表明Med-PaLM 2可以准确预测各种心理疾病的功能水平,最高的表现是预测抑郁评估结果(准确率范围为0.80-0.84),这与人类临床评估器的表现相当(t(1,144) = 1.20; p = 0.23)。结果表明大规模临床语言模型可以通过自动提取评估结果和诊断来预测心理风险。
Distribution-Free Inference for the Regression Function of Binary Classification
results: 该论文证明了构建的 confidence 区间是强有力的,即任何不正确的模型都将在长期内被排除,并且这种排除可以通过 probably approximately correct 类型上下文来衡量。此外,提出了与 asymptotic 信息椭圆比较的方法。Abstract
One of the key objects of binary classification is the regression function, i.e., the conditional expectation of the class labels given the inputs. With the regression function not only a Bayes optimal classifier can be defined, but it also encodes the corresponding misclassification probabilities. The paper presents a resampling framework to construct exact, distribution-free and non-asymptotically guaranteed confidence regions for the true regression function for any user-chosen confidence level. Then, specific algorithms are suggested to demonstrate the framework. It is proved that the constructed confidence regions are strongly consistent, that is, any false model is excluded in the long run with probability one. The exclusion is quantified with probably approximately correct type bounds, as well. Finally, the algorithms are validated via numerical experiments, and the methods are compared to approximate asymptotic confidence ellipsoids.
摘要
一个重要的二分类问题中的重要对象是回归函数,即输入给定后的类标签的Conditional Expectation。不仅可以通过回归函数定义 bayes 优化的分类器,还可以编码相应的错误分布。文章提出了一种抽样框架,可以construct exact、distribution-free和非尺度假设保证的真实回归函数信心区间,并提供了特定的算法来实现该框架。文章证明了构造的信心区间具有strong consistency,即任何不正确模型都将在长期内被排除,并且这种排除的可靠性被证明为probabilistic bounds。最后,算法被 validate通过数值实验,并与approximate asymptotic confidence ellipsoids进行比较。
Hard Adversarial Example Mining for Improving Robust Fairness
results: 实验结果显示,这个方法可以在CIFAR-10、SVHN和Imagenette等数据集上实现重要的改善,即提高DNN对于AE的抵抗性和公正性,并且降低了computational cost。Abstract
Adversarial training (AT) is widely considered the state-of-the-art technique for improving the robustness of deep neural networks (DNNs) against adversarial examples (AE). Nevertheless, recent studies have revealed that adversarially trained models are prone to unfairness problems, restricting their applicability. In this paper, we empirically observe that this limitation may be attributed to serious adversarial confidence overfitting, i.e., certain adversarial examples with overconfidence. To alleviate this problem, we propose HAM, a straightforward yet effective framework via adaptive Hard Adversarial example Mining.HAM concentrates on mining hard adversarial examples while discarding the easy ones in an adaptive fashion. Specifically, HAM identifies hard AEs in terms of their step sizes needed to cross the decision boundary when calculating loss value. Besides, an early-dropping mechanism is incorporated to discard the easy examples at the initial stages of AE generation, resulting in efficient AT. Extensive experimental results on CIFAR-10, SVHN, and Imagenette demonstrate that HAM achieves significant improvement in robust fairness while reducing computational cost compared to several state-of-the-art adversarial training methods. The code will be made publicly available.
摘要
“对抗训练”(AT)是深度神经网络(DNN)的鲜明技术,可以提高其对假输入的鲜明性。然而, latest studies have shown that adversarially trained models are prone to unfairness problems, limiting their applicability. In this paper, we empirically observe that this limitation may be attributed to serious adversarial confidence overfitting, i.e., certain adversarial examples with overconfidence. To address this problem, we propose HAM, a straightforward yet effective framework via adaptive Hard Adversarial example Mining.HAM concentrates on mining hard adversarial examples while discarding the easy ones in an adaptive fashion. Specifically, HAM identifies hard AEs in terms of their step sizes needed to cross the decision boundary when calculating loss value. Furthermore, an early-dropping mechanism is incorporated to discard the easy examples at the initial stages of AE generation, resulting in efficient AT. Extensive experimental results on CIFAR-10, SVHN, and Imagenette demonstrate that HAM achieves significant improvement in robust fairness while reducing computational cost compared to several state-of-the-art adversarial training methods. The code will be made publicly available.Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.
Tensor Programs IVb: Adaptive Optimization in the Infinite-Width Limit
for: 这个论文旨在探讨适用于宽神经网络的自适应优化器,如 Adam 优化器,在宽神经网络中是否会出现新的现象。
methods: 该论文使用了一种新的表示语言 called Tensor Program,以及一种简化表达和计算的 bra-ket 表示法,来描述如何使用适应优化器处理梯度并生成更新。
results: 研究发现,在宽神经网络中使用适应优化器时,会出现类似于 Stochastic Gradient Descent (SGD) 中的特征学习和核函数行为的分 dichotomy。此外,研究还提出了一种新的 “神经 Tangent” 和 “最大更新” 概念,以描述在不同架构下的学习过程。Abstract
Going beyond stochastic gradient descent (SGD), what new phenomena emerge in wide neural networks trained by adaptive optimizers like Adam? Here we show: The same dichotomy between feature learning and kernel behaviors (as in SGD) holds for general optimizers as well, including Adam -- albeit with a nonlinear notion of "kernel." We derive the corresponding "neural tangent" and "maximal update" limits for any architecture. Two foundational advances underlie the above results: 1) A new Tensor Program language, NEXORT, that can express how adaptive optimizers process gradients into updates. 2) The introduction of bra-ket notation to drastically simplify expressions and calculations in Tensor Programs. This work summarizes and generalizes all previous results in the Tensor Programs series of papers.
摘要
SGD 以外,在宽神经网络中,适应优化器如 Adam 引入新的现象,我们在这里展示:SGD 中的特征学习和核函数行为之间的分 dichotomy 也存在于总优化器中,包括 Adam,但是它们的概念是非线性的。我们 deriv 出对于任何架构的“神经射”和“最大更新”的限制。这些结果基于以下两个基础性的进步:1. 一种新的tensor program语言,叫做 NEXORT,可以表示如何适应优化器处理梯度到更新的过程。2. 使用 bra-ket notation来剖析和计算tensor program表达式,从而大大简化表达式和计算。这篇文章总结了以前的所有tensor program系列文章的结果,并对其进行总结和推广。
Job Shop Scheduling via Deep Reinforcement Learning: a Sequence to Sequence approach
paper_authors: Giovanni Bonetta, Davide Zago, Rossella Cancelliere, Andrea Grosso
for: 这 paper 是用 Deep Reinforcement Learning 方法解决 Job Shop Problem,即一个常见的 Combinatorial Optimization 问题,以获得优化的生产计划。
methods: 该 paper 使用了一种基于自然语言编码器-解码器模型的 Deep Reinforcement Learning 方法,自动学习调度规则。这种方法 Never 在过去用于调度问题,但它可以普遍应用于其他优化调度任务。
results: 实验结果表明,该方法可以超越许多经典方法,以及部分 Deep Reinforcement Learning 方法,并在Job Shop Problem 中获得竞争力强的结果。Abstract
Job scheduling is a well-known Combinatorial Optimization problem with endless applications. Well planned schedules bring many benefits in the context of automated systems: among others, they limit production costs and waste. Nevertheless, the NP-hardness of this problem makes it essential to use heuristics whose design is difficult, requires specialized knowledge and often produces methods tailored to the specific task. This paper presents an original end-to-end Deep Reinforcement Learning approach to scheduling that automatically learns dispatching rules. Our technique is inspired by natural language encoder-decoder models for sequence processing and has never been used, to the best of our knowledge, for scheduling purposes. We applied and tested our method in particular to some benchmark instances of Job Shop Problem, but this technique is general enough to be potentially used to tackle other different optimal job scheduling tasks with minimal intervention. Results demonstrate that we outperform many classical approaches exploiting priority dispatching rules and show competitive results on state-of-the-art Deep Reinforcement Learning ones.
摘要
“Job scheduling 是一个非常知名的 Combinatorial Optimization 问题,它在各种自动化系统中有无数应用。一旦规划得好,就能够限制生产成本和废弃物。然而,这个问题的NP-hardness使得需要使用专门的知识和技巧来设计对特定任务的方法。这篇文章提出了一个原创的 Deep Reinforcement Learning 方法,可以自动学习派送规则。我们的技术受自然语言Encoder-Decoder模型的启发,并从未在Job Shop Problem 中使用过。我们将这个方法应用到了一些 benchmark 的Job Shop Problem 问题上,但这个方法够通用,可以用来解决其他不同的优化任务。结果表明我们可以超过许多传统的优化方法,并与现有的 Deep Reinforcement Learning 方法竞争。”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.
Benchmarking Adaptative Variational Quantum Algorithms on QUBO Instances
results: 这 paper 的结果表明,Adaptative VQA 方法在 QUBO 问题上的表现比 traditional VQA 方法更好,特别是在选择合适的 hyperparameter 时。此外,这 paper 还发现,在不同的 hyperparameter 选择情况下,Adaptative VQA 方法的表现可以得到大幅提高。这些结果可以作为 near-term 量子计算机设备上 Adaptative VQA 的标准 Referential,并为未来这一领域的研究提供有价值的指导。Abstract
In recent years, Variational Quantum Algorithms (VQAs) have emerged as a promising approach for solving optimization problems on quantum computers in the NISQ era. However, one limitation of VQAs is their reliance on fixed-structure circuits, which may not be taylored for specific problems or hardware configurations. A leading strategy to address this issue are Adaptative VQAs, which dynamically modify the circuit structure by adding and removing gates, and optimize their parameters during the training. Several Adaptative VQAs, based on heuristics such as circuit shallowness, entanglement capability and hardware compatibility, have already been proposed in the literature, but there is still lack of a systematic comparison between the different methods. In this paper, we aim to fill this gap by analyzing three Adaptative VQAs: Evolutionary Variational Quantum Eigensolver (EVQE), Variable Ansatz (VAns), already proposed in the literature, and Random Adapt-VQE (RA-VQE), a random approach we introduce as a baseline. In order to compare these algorithms to traditional VQAs, we also include the Quantum Approximate Optimization Algorithm (QAOA) in our analysis. We apply these algorithms to QUBO problems and study their performance by examining the quality of the solutions found and the computational times required. Additionally, we investigate how the choice of the hyperparameters can impact the overall performance of the algorithms, highlighting the importance of selecting an appropriate methodology for hyperparameter tuning. Our analysis sets benchmarks for Adaptative VQAs designed for near-term quantum devices and provides valuable insights to guide future research in this area.
摘要
Recent years have seen the emergence of Variational Quantum Algorithms (VQAs) as a promising approach for solving optimization problems on quantum computers in the NISQ era. However, one limitation of VQAs is their reliance on fixed-structure circuits, which may not be tailored for specific problems or hardware configurations. To address this issue, Adaptative VQAs have been proposed, which dynamically modify the circuit structure and optimize parameters during training. Several Adaptative VQAs have been proposed in the literature, but there is still a lack of systematic comparison between them. In this paper, we aim to fill this gap by analyzing three Adaptative VQAs: Evolutionary Variational Quantum Eigensolver (EVQE), Variable Ansatz (VAns), and Random Adapt-VQE (RA-VQE), as well as the Quantum Approximate Optimization Algorithm (QAOA) for comparison. We apply these algorithms to QUBO problems and study their performance by examining the quality of the solutions found and the computational times required. We also investigate the impact of hyperparameter choice on algorithm performance, highlighting the importance of selecting an appropriate methodology for hyperparameter tuning. Our analysis sets benchmarks for Adaptative VQAs designed for near-term quantum devices and provides valuable insights to guide future research in this area.
Deep Learning-based Prediction of Stress and Strain Maps in Arterial Walls for Improved Cardiovascular Risk Assessment
results: 我们的模型可以高度准确地预测血管壁的剪切压力和扭矩分布,并且可以考虑到不同的 calcification 量和空间配置。Abstract
This study investigated the potential of end-to-end deep learning tools as a more effective substitute for FEM in predicting stress-strain fields within 2D cross sections of arterial wall. We first proposed a U-Net based fully convolutional neural network (CNN) to predict the von Mises stress and strain distribution based on the spatial arrangement of calcification within arterial wall cross-sections. Further, we developed a conditional generative adversarial network (cGAN) to enhance, particularly from the perceptual perspective, the prediction accuracy of stress and strain field maps for arterial walls with various calcification quantities and spatial configurations. On top of U-Net and cGAN, we also proposed their ensemble approaches, respectively, to further improve the prediction accuracy of field maps. Our dataset, consisting of input and output images, was generated by implementing boundary conditions and extracting stress-strain field maps. The trained U-Net models can accurately predict von Mises stress and strain fields, with structural similarity index scores (SSIM) of 0.854 and 0.830 and mean squared errors of 0.017 and 0.018 for stress and strain, respectively, on a reserved test set. Meanwhile, the cGAN models in a combination of ensemble and transfer learning techniques demonstrate high accuracy in predicting von Mises stress and strain fields, as evidenced by SSIM scores of 0.890 for stress and 0.803 for strain. Additionally, mean squared errors of 0.008 for stress and 0.017 for strain further support the model's performance on a designated test set. Overall, this study developed a surrogate model for finite element analysis, which can accurately and efficiently predict stress-strain fields of arterial walls regardless of complex geometries and boundary conditions.
摘要
Bag of Policies for Distributional Deep Exploration
methods: 我们提出了一种通用方法,即Bag of Policies(BoP),可以在任何返回分布估计器基础上建立,该方法通过维护多个独立更新的头部来实现深入探索。
results: 我们通过实验表明,BoP方法可以在ALE Atari游戏中提高学习robustness和速度。Abstract
Efficient exploration in complex environments remains a major challenge for reinforcement learning (RL). Compared to previous Thompson sampling-inspired mechanisms that enable temporally extended exploration, i.e., deep exploration, we focus on deep exploration in distributional RL. We develop here a general purpose approach, Bag of Policies (BoP), that can be built on top of any return distribution estimator by maintaining a population of its copies. BoP consists of an ensemble of multiple heads that are updated independently. During training, each episode is controlled by only one of the heads and the collected state-action pairs are used to update all heads off-policy, leading to distinct learning signals for each head which diversify learning and behaviour. To test whether optimistic ensemble method can improve on distributional RL as did on scalar RL, by e.g. Bootstrapped DQN, we implement the BoP approach with a population of distributional actor-critics using Bayesian Distributional Policy Gradients (BDPG). The population thus approximates a posterior distribution of return distributions along with a posterior distribution of policies. Another benefit of building upon BDPG is that it allows to analyze global posterior uncertainty along with local curiosity bonus simultaneously for exploration. As BDPG is already an optimistic method, this pairing helps to investigate if optimism is accumulatable in distributional RL. Overall BoP results in greater robustness and speed during learning as demonstrated by our experimental results on ALE Atari games.
摘要
efficient exploration in complex environments remains a major challenge for reinforcement learning (RL). compared to previous thompson sampling-inspired mechanisms that enable temporally extended exploration, i.e., deep exploration, we focus on deep exploration in distributional RL. we develop here a general purpose approach, bag of policies (BoP), that can be built on top of any return distribution estimator by maintaining a population of its copies. BoP consists of an ensemble of multiple heads that are updated independently. during training, each episode is controlled by only one of the heads and the collected state-action pairs are used to update all heads off-policy, leading to distinct learning signals for each head which diversify learning and behavior. to test whether optimistic ensemble method can improve on distributional RL as did on scalar RL, by e.g. bootstrapped dqn, we implement the BoP approach with a population of distributional actor-critics using bayesian distributional policy gradients (BDPG). the population thus approximates a posterior distribution of return distributions along with a posterior distribution of policies. another benefit of building upon BDPG is that it allows to analyze global posterior uncertainty along with local curiosity bonus simultaneously for exploration. as BDPG is already an optimistic method, this pairing helps to investigate if optimism is accumulatable in distributional RL. overall BoP results in greater robustness and speed during learning as demonstrated by our experimental results on ALE Atari games.
Guided Distillation for Semi-Supervised Instance Segmentation
results: 对 teacher-student distillation 模型进行改进,包括引入新的 “导引燃烧” 阶段和评估不同的实例 segmentation 架构、后向网络和预训练策略。这些改进导致了substantial 的提高,比如在 Cityscapes 数据集上提高 mask-AP 从 23.7 到 33.9,在 COCO 数据集上提高 mask-AP 从 18.3 到 34.1,即使使用只有 1% 的标注数据。Abstract
Although instance segmentation methods have improved considerably, the dominant paradigm is to rely on fully-annotated training images, which are tedious to obtain. To alleviate this reliance, and boost results, semi-supervised approaches leverage unlabeled data as an additional training signal that limits overfitting to the labeled samples. In this context, we present novel design choices to significantly improve teacher-student distillation models. In particular, we (i) improve the distillation approach by introducing a novel "guided burn-in" stage, and (ii) evaluate different instance segmentation architectures, as well as backbone networks and pre-training strategies. Contrary to previous work which uses only supervised data for the burn-in period of the student model, we also use guidance of the teacher model to exploit unlabeled data in the burn-in period. Our improved distillation approach leads to substantial improvements over previous state-of-the-art results. For example, on the Cityscapes dataset we improve mask-AP from 23.7 to 33.9 when using labels for 10\% of images, and on the COCO dataset we improve mask-AP from 18.3 to 34.1 when using labels for only 1\% of the training data.
摘要
尽管实例分割方法已经有了很大的进步,但现在主流的方法仍然是通过完全标注的图像进行训练,这是获取标注数据的劳作。为了解决这个问题, semi-supervised 方法可以利用无标注数据作为训练信号,限制学习到标注样本的过拟合。在这个上下文中,我们提出了新的设计选择,以提高教师学生液体模型。具体来说,我们(i)改进了液体模型的distillation方法,引入了一个新的“导向燃烧”阶段,以及(ii)评估不同的实例分割架构、背部网络和预训练策略。与前一个工作不同,我们在学生模型的燃烧期间也使用教师模型的指导来利用无标注数据。我们改进的液体模型方法导致了与前一个状态的 существенный改进。例如,在Cityscapes 数据集上,我们从 23.7 提高到 33.9 的 mask-AP,并在 COCO 数据集上从 18.3 提高到 34.1 的 mask-AP,只使用了1%的训练数据上的标注。
Neural Collapse Terminus: A Unified Solution for Class Incremental Learning and Its Variants
paper_authors: Yibo Yang, Haobo Yuan, Xiangtai Li, Jianlong Wu, Lefei Zhang, Zhouchen Lin, Philip Torr, Dacheng Tao, Bernard Ghanem
for: 这篇论文targets class incremental learning (CIL)、long-tail class incremental learning (LTCIL)和 few-shot class incremental learning (FSCIL) 三种任务,解决了数据不均衡和数据罕见性导致的分类 incremental learning难题。
methods: 该论文提出了一种统一的解决方案,即使用神经萧条终点(Neural Collapse Terminus,NCT),它是一种固定结构,在整个标签空间中具有最大的等角间距性。NCT acts as a consistent target throughout the incremental training, avoiding dividing the feature space incrementally。在CIL和LTCIL任务中,我们还提出了一种prototype evolving scheme来使得backbone features满足NCT的要求。
results: 我们的方法可以在多个数据集上进行广泛的实验,证明了对所有三种任务和通用情况(不知道总类数和数据分布是ormal、long-tail或few-shot)的一致性。我们的方法也可以在数据不均衡和数据罕见性情况下保持分类能力,并且在不同的数据集上具有良好的普适性。Abstract
How to enable learnability for new classes while keeping the capability well on old classes has been a crucial challenge for class incremental learning. Beyond the normal case, long-tail class incremental learning and few-shot class incremental learning are also proposed to consider the data imbalance and data scarcity, respectively, which are common in real-world implementations and further exacerbate the well-known problem of catastrophic forgetting. Existing methods are specifically proposed for one of the three tasks. In this paper, we offer a unified solution to the misalignment dilemma in the three tasks. Concretely, we propose neural collapse terminus that is a fixed structure with the maximal equiangular inter-class separation for the whole label space. It serves as a consistent target throughout the incremental training to avoid dividing the feature space incrementally. For CIL and LTCIL, we further propose a prototype evolving scheme to drive the backbone features into our neural collapse terminus smoothly. Our method also works for FSCIL with only minor adaptations. Theoretical analysis indicates that our method holds the neural collapse optimality in an incremental fashion regardless of data imbalance or data scarcity. We also design a generalized case where we do not know the total number of classes and whether the data distribution is normal, long-tail, or few-shot for each coming session, to test the generalizability of our method. Extensive experiments with multiple datasets are conducted to demonstrate the effectiveness of our unified solution to all the three tasks and the generalized case.
摘要
如何在新类添加时保持老类表现良好是incremental learning中的一大挑战。此外,长尾类增量学习和少shot类增量学习也被提出,以考虑数据不均和数据罕见的问题,这些问题在实际应用中非常普遍,并使得已知的忘记悖论变得更加严重。现有的方法主要针对其中一个任务。在这篇论文中,我们提出了对不一致的问题的共同解决方案。具体来说,我们提议一个固定结构的神经覆盖终点,该结构具有整个标签空间中最大的等角间类分离度。它在增量训练中 acted as一个不可分割的目标,以避免在增量训练中分割特征空间。对CIL和LTCIL,我们还提出了一种原型演化方案,用于使得干扰核的特征慢慢地趋向于我们的神经覆盖终点。我们的方法也适用于FSCIL,只需要小量的修改。理论分析表明,我们的方法在增量训练中保持神经覆盖优化无关于数据不均或数据罕见。我们还设计了一个通用情况,在每次会话中不知道总共多少个类和数据分布是否正常、长尾或少shot,以测试我们的方法的普适性。我们在多个数据集上进行了广泛的实验,以证明我们的共同解决方案对所有三个任务和通用情况都是有效的。
Multitask Learning with No Regret: from Improved Confidence Bounds to Active Learning
paper_authors: Pier Giuseppe Sessa, Pierre Laforgue, Nicolò Cesa-Bianchi, Andreas Krause
for: This paper is written for people who want to learn multiple related tasks simultaneously while quantifying uncertainty in the estimated tasks, especially in the challenging agnostic setting.
methods: The paper uses novel multitask confidence intervals that do not require i.i.d. data and can be applied to bound the regret in online learning. The paper also proposes a novel online learning algorithm that achieves improved regret without knowing the task similarity parameter in advance.
results: The paper obtains new regret guarantees that can significantly improve over treating tasks independently, and introduces a novel multitask active learning setup where several tasks must be simultaneously optimized but only one can be queried for feedback. The paper also empirically validates the bounds and algorithms on synthetic and real-world data.Abstract
Multitask learning is a powerful framework that enables one to simultaneously learn multiple related tasks by sharing information between them. Quantifying uncertainty in the estimated tasks is of pivotal importance for many downstream applications, such as online or active learning. In this work, we provide novel multitask confidence intervals in the challenging agnostic setting, i.e., when neither the similarity between tasks nor the tasks' features are available to the learner. The obtained intervals do not require i.i.d. data and can be directly applied to bound the regret in online learning. Through a refined analysis of the multitask information gain, we obtain new regret guarantees that, depending on a task similarity parameter, can significantly improve over treating tasks independently. We further propose a novel online learning algorithm that achieves such improved regret without knowing this parameter in advance, i.e., automatically adapting to task similarity. As a second key application of our results, we introduce a novel multitask active learning setup where several tasks must be simultaneously optimized, but only one of them can be queried for feedback by the learner at each round. For this problem, we design a no-regret algorithm that uses our confidence intervals to decide which task should be queried. Finally, we empirically validate our bounds and algorithms on synthetic and real-world (drug discovery) data.
摘要
多任务学习是一种强大的框架,它允许学习多个相关的任务,并在这些任务之间共享信息。量化任务估计中的不确定性是许多下游应用的关键问题,如在线学习或活动学习中。在这种工作中,我们提供了新的多任务信息增强的信度范围,这些范围不需要数据的i.i.d.分布,并可以直接应用于约束在线学习中的 regret。通过对多任务信息增强的细化分析,我们获得了新的 regret 保证,这些保证与任务相似性参数有关,可以在不知道任务相似性参数的情况下提高 regret 的性能。我们还提出了一种新的在线学习算法,该算法可以在不知道任务相似性参数的情况下实现改进的 regret。作为第二个关键应用,我们介绍了一种多任务活动学习设置,在这种设置中,学习器需要同时优化多个任务,但只有一个任务可以在每个轮次中被学习器请求反馈。为解决这个问题,我们设计了一种不会 regret 的算法,该算法使用我们的信度范围来决定哪个任务应该被请求反馈。最后,我们通过synthetic和实际世界(药物发现)数据进行了实验 validate 我们的 bound 和算法。
Finding the Optimum Design of Large Gas Engines Prechambers Using CFD and Bayesian Optimization
results: 实验结果表明,选择的策略能够有效地找到符合目标值的增压器设计。Abstract
The turbulent jet ignition concept using prechambers is a promising solution to achieve stable combustion at lean conditions in large gas engines, leading to high efficiency at low emission levels. Due to the wide range of design and operating parameters for large gas engine prechambers, the preferred method for evaluating different designs is computational fluid dynamics (CFD), as testing in test bed measurement campaigns is time-consuming and expensive. However, the significant computational time required for detailed CFD simulations due to the complexity of solving the underlying physics also limits its applicability. In optimization settings similar to the present case, i.e., where the evaluation of the objective function(s) is computationally costly, Bayesian optimization has largely replaced classical design-of-experiment. Thus, the present study deals with the computationally efficient Bayesian optimization of large gas engine prechambers design using CFD simulation. Reynolds-averaged-Navier-Stokes simulations are used to determine the target values as a function of the selected prechamber design parameters. The results indicate that the chosen strategy is effective to find a prechamber design that achieves the desired target values.
摘要
“湍流喷射概念使用预室是大气引擎中稳定燃烧的可能解,实现低排放水平的高效燃烧。由于大气引擎预室的设计和运行 Parameters 的广泛范围,所以 Computational Fluid Dynamics(CFD)是评估不同设计的优先方法,因为实验室测试是时间consuming 和昂贵的。但是,复杂的物理学 пробле 需要大量的计算时间,限制了 CFD 模拟的可行性。在优化设定中,例如现在的情况,Bayesian 优化已经取代了传统的设计实验。因此,本研究探讨了 Computationally Efficient Bayesian 优化大气引擎预室设计 using CFD 模拟。Reynolds-averaged-Navier-Stokes 模拟用于选择预室设计参数中的目标值。结果显示选择的策略是有效的,可以找到一个符合预期的预室设计。”
Exploiting Multi-Label Correlation in Label Distribution Learning
paper_authors: Zhiqiang Kou jing wang yuheng jia xin geng
for: This paper focuses on addressing the challenges of Label Distribution Learning (LDL) methods that exploit low-rank label correlation, which is not always present in real-world datasets.
methods: The proposed method introduces an auxiliary Multi-Label Learning (MLL) process to capture low-rank label correlation, which is then used to improve the performance of LDL.
results: The proposed method is shown to be superior to existing LDL methods through comprehensive experiments, and ablation studies demonstrate the advantages of exploiting low-rank label correlation in the auxiliary MLL.Here’s the text in Simplified Chinese:
for: 本文主要研究 Label Distribution Learning (LDL) 方法中的一个挑战,即优化方法可以快速地适应实际数据中的 label distribution 结构。
methods: 提议的方法是通过引入多标签学习 (MLL) 过程来捕捉 label distribution 的低级别结构,从而提高 LDL 的性能。
results: 实验结果表明,提议的方法在比较实验中胜过现有的 LDL 方法,并且精细分析表明低级别 label correlation 的利用在 auxiliary MLL 中具有优势。Abstract
Label Distribution Learning (LDL) is a novel machine learning paradigm that assigns label distribution to each instance. Many LDL methods proposed to leverage label correlation in the learning process to solve the exponential-sized output space; among these, many exploited the low-rank structure of label distribution to capture label correlation. However, recent studies disclosed that label distribution matrices are typically full-rank, posing challenges to those works exploiting low-rank label correlation. Note that multi-label is generally low-rank; low-rank label correlation is widely adopted in multi-label learning (MLL) literature. Inspired by that, we introduce an auxiliary MLL process in LDL and capture low-rank label correlation on that MLL rather than LDL. In such a way, low-rank label correlation is appropriately exploited in our LDL methods. We conduct comprehensive experiments and demonstrate that our methods are superior to existing LDL methods. Besides, the ablation studies justify the advantages of exploiting low-rank label correlation in the auxiliary MLL.
摘要
标签分布学习(LDL)是一种新的机器学习方案,它将标签分布分配给每个实例。许多LDL方法尝试利用标签相关性来解决输出空间的指数型增长;其中许多利用标签分布的低级结构来捕捉标签相关性。然而,最近的研究表明,标签分布矩阵通常是全级结构,这对那些利用低级标签相关性的方法带来挑战。注意,多标签通常是低级的;低级标签相关性广泛采用在多标签学习(MLL) литературе。针对此,我们引入了一个辅助的MLL过程,并在其上捕捉低级标签相关性。这种方法可以正确地利用低级标签相关性。我们进行了广泛的实验,并证明了我们的方法与现有的LDL方法相比,具有更高的性能。此外,缺失研究证明了在辅助MLL中利用低级标签相关性的优势。
Bringing Chemistry to Scale: Loss Weight Adjustment for Multivariate Regression in Deep Learning of Thermochemical Processes
results: 通过对网络训练过程中的梯度均衡来解释调整方法的效iveness。Abstract
Flamelet models are widely used in computational fluid dynamics to simulate thermochemical processes in turbulent combustion. These models typically employ memory-expensive lookup tables that are predetermined and represent the combustion process to be simulated. Artificial neural networks (ANNs) offer a deep learning approach that can store this tabular data using a small number of network weights, potentially reducing the memory demands of complex simulations by orders of magnitude. However, ANNs with standard training losses often struggle with underrepresented targets in multivariate regression tasks, e.g., when learning minor species mass fractions as part of lookup tables. This paper seeks to improve the accuracy of an ANN when learning multiple species mass fractions of a hydrogen (\ce{H2}) combustion lookup table. We assess a simple, yet effective loss weight adjustment that outperforms the standard mean-squared error optimization and enables accurate learning of all species mass fractions, even of minor species where the standard optimization completely fails. Furthermore, we find that the loss weight adjustment leads to more balanced gradients in the network training, which explains its effectiveness.
摘要
FLAMELET模型广泛用于计算流体动力学来模拟热化学过程,其中通常使用占用内存的 lookup 表格来表示燃烧过程。人工神经网络(ANNs)提供了深度学习方法,可以将这些表格存储在少量的网络参数中,可能在复杂的 simulations 中减少内存需求量度。然而,标准的训练损失通常在多变量回归任务中表现不佳,例如学习氢(\ce{H2))燃烧 lookup 表格中的少数种团谱。本文想要提高一个 ANN 在学习多种种团谱的燃烧 lookup 表格中的准确性。我们评估了一种简单 yet effective 的损失加权调整,超越标准的均方差估计,使得网络在所有种团谱中学习准确,包括少数种团谱,其标准估计完全失败。此外,我们发现损失加权调整导致了网络训练过程中的更平衡的梯度,这解释了其效果。
MAP: A Model-agnostic Pretraining Framework for Click-through Rate Prediction
results: 该 paper 的实验结果表明,使用 MFP 和 RFD 可以在两个实际大规模数据集(Avazu、Criteo)上 achieve new state-of-the-art 的性能,并且比以前的方法更高效和更加简单。Abstract
With the widespread application of personalized online services, click-through rate (CTR) prediction has received more and more attention and research. The most prominent features of CTR prediction are its multi-field categorical data format, and vast and daily-growing data volume. The large capacity of neural models helps digest such massive amounts of data under the supervised learning paradigm, yet they fail to utilize the substantial data to its full potential, since the 1-bit click signal is not sufficient to guide the model to learn capable representations of features and instances. The self-supervised learning paradigm provides a more promising pretrain-finetune solution to better exploit the large amount of user click logs, and learn more generalized and effective representations. However, self-supervised learning for CTR prediction is still an open question, since current works on this line are only preliminary and rudimentary. To this end, we propose a Model-agnostic pretraining (MAP) framework that applies feature corruption and recovery on multi-field categorical data, and more specifically, we derive two practical algorithms: masked feature prediction (MFP) and replaced feature detection (RFD). MFP digs into feature interactions within each instance through masking and predicting a small portion of input features, and introduces noise contrastive estimation (NCE) to handle large feature spaces. RFD further turns MFP into a binary classification mode through replacing and detecting changes in input features, making it even simpler and more effective for CTR pretraining. Our extensive experiments on two real-world large-scale datasets (i.e., Avazu, Criteo) demonstrate the advantages of these two methods on several strong backbones (e.g., DCNv2, DeepFM), and achieve new state-of-the-art performance in terms of both effectiveness and efficiency for CTR prediction.
摘要
Quantification of Predictive Uncertainty via Inference-Time Sampling
results: 可以生成不同可能的输出,与预测错误之间存在正相关性Abstract
Predictive variability due to data ambiguities has typically been addressed via construction of dedicated models with built-in probabilistic capabilities that are trained to predict uncertainty estimates as variables of interest. These approaches require distinct architectural components and training mechanisms, may include restrictive assumptions and exhibit overconfidence, i.e., high confidence in imprecise predictions. In this work, we propose a post-hoc sampling strategy for estimating predictive uncertainty accounting for data ambiguity. The method can generate different plausible outputs for a given input and does not assume parametric forms of predictive distributions. It is architecture agnostic and can be applied to any feed-forward deterministic network without changes to the architecture or training procedure. Experiments on regression tasks on imaging and non-imaging input data show the method's ability to generate diverse and multi-modal predictive distributions, and a desirable correlation of the estimated uncertainty with the prediction error.
摘要
<>传统的预测变化问题通常通过建立专门的模型,即建有 probabilistic 功能的模型,来解决。这些方法通常需要特定的架构组件和训练机制,可能带有限制性和过于自信,即高度自信准确但不准确的预测。在这种工作中,我们提议一种后期采样策略来估计预测uncertainty,考虑到数据的模糊性。该方法可以生成不同的可能性输出,并不假设预测分布的 parametic 形式。它是架构无关的,可以应用于任何 deterministic 网络,无需改变架构或训练过程。实验表明,该方法可以生成多模态和多样化的预测分布,并且预测uncertainty与预测误差之间存在可良好的相关性。Note: "Simplified Chinese" is a romanization of the Chinese language that uses a simplified set of characters and pronunciation, which is commonly used in mainland China.
Telematics Combined Actuarial Neural Networks for Cross-Sectional and Longitudinal Claim Count Data
results: 研究发现,使用CANN模型可以在车险保险中获得更高的准确性和效果,比起基于手动设计的茶imer特征的log-linear模型。Abstract
We present novel cross-sectional and longitudinal claim count models for vehicle insurance built upon the Combined Actuarial Neural Network (CANN) framework proposed by Mario W\"uthrich and Michael Merz. The CANN approach combines a classical actuarial model, such as a generalized linear model, with a neural network. This blending of models results in a two-component model comprising a classical regression model and a neural network part. The CANN model leverages the strengths of both components, providing a solid foundation and interpretability from the classical model while harnessing the flexibility and capacity to capture intricate relationships and interactions offered by the neural network. In our proposed models, we use well-known log-linear claim count regression models for the classical regression part and a multilayer perceptron (MLP) for the neural network part. The MLP part is used to process telematics car driving data given as a vector characterizing the driving behavior of each insured driver. In addition to the Poisson and negative binomial distributions for cross-sectional data, we propose a procedure for training our CANN model with a multivariate negative binomial (MVNB) specification. By doing so, we introduce a longitudinal model that accounts for the dependence between contracts from the same insured. Our results reveal that the CANN models exhibit superior performance compared to log-linear models that rely on manually engineered telematics features.
摘要
我们提出了一种新的跨sectional和 longitudinallaimcount模型,用于汽车保险,基于Mario W\"uthrich和Michael Merz所提出的Combined Actuarial Neural Network(CANN)框架。CANN方法将经典泛 actuarial模型,如通用线性模型,与神经网络结合。这种混合模型组成了两个组件:经典回归模型和神经网络部分。CANN模型利用了经典模型的坚实基础和可解释性,同时利用神经网络的灵活性和可捕捉复杂关系和互动。在我们的提议中,我们使用了常见的log-linear claim count回归模型作为经典回归部分,而用多层感知器(MLP)来处理每名保险人的驾驶行为数据。此外,我们还提出了一种训练CANN模型的多variate negative binomial(MVNB)规范的过程。通过这种方法,我们引入了一种Longitudinal模型,该模型考虑了同一保险人签订的合同之间的依赖关系。我们的结果表明,CANN模型比基于手动设计的 telematics特征的log-linear模型表现出色。
ADRNet: A Generalized Collaborative Filtering Framework Combining Clinical and Non-Clinical Data for Adverse Drug Reaction Prediction
results: 在两个大规模的临床数据集上进行了广泛的比较性能测试,并在两个非临床数据集上进行了实验,证明了提案的 ADRNet 框架的准确性和效率。Abstract
Adverse drug reaction (ADR) prediction plays a crucial role in both health care and drug discovery for reducing patient mortality and enhancing drug safety. Recently, many studies have been devoted to effectively predict the drug-ADRs incidence rates. However, these methods either did not effectively utilize non-clinical data, i.e., physical, chemical, and biological information about the drug, or did little to establish a link between content-based and pure collaborative filtering during the training phase. In this paper, we first formulate the prediction of multi-label ADRs as a drug-ADR collaborative filtering problem, and to the best of our knowledge, this is the first work to provide extensive benchmark results of previous collaborative filtering methods on two large publicly available clinical datasets. Then, by exploiting the easy accessible drug characteristics from non-clinical data, we propose ADRNet, a generalized collaborative filtering framework combining clinical and non-clinical data for drug-ADR prediction. Specifically, ADRNet has a shallow collaborative filtering module and a deep drug representation module, which can exploit the high-dimensional drug descriptors to further guide the learning of low-dimensional ADR latent embeddings, which incorporates both the benefits of collaborative filtering and representation learning. Extensive experiments are conducted on two publicly available real-world drug-ADR clinical datasets and two non-clinical datasets to demonstrate the accuracy and efficiency of the proposed ADRNet. The code is available at https://github.com/haoxuanli-pku/ADRnet.
摘要
药物反应(ADR)预测对于医疗和药品发现具有关键作用,可以降低病人死亡率并提高药品安全性。在过去的几年中,许多研究都在努力地预测药物-ADR的发生率。然而,这些方法 Either did not effectively utilize non-clinical data, such as physical, chemical, and biological information about the drug, or did little to establish a link between content-based and pure collaborative filtering during the training phase.在本文中,我们将药物多标签 ADR 预测转化为药物-ADR 共同滤波问题,并且在我们所知道的范围内,这是首次对两个大规模公共可用的临床数据集进行了广泛的 benchmark 测试。然后,通过利用药物的易 accessible 非临床数据中的药物特征,我们提出了 ADRNet,一种通用的共同滤波框架,结合临床数据和非临床数据 для药物-ADR 预测。具体来说,ADRNet 包括一个浅层共同滤波模块和一个深度药物表示模块,可以利用高维度的药物描述符进一步引导学习低维度的 ADR 潜在嵌入,这里包括了共同滤波和表示学习的两大优点。我们对两个公共可用的实际世界药物-ADR 临床数据集和两个非临床数据集进行了广泛的实验,以证明我们提出的 ADRNet 的准确性和效率。代码可以在 上获取。
Evaluating Link Prediction Explanations for Graph Neural Networks
results: 研究发现,不同的Distance between node embeddings选择对链接预测解释质量有所影响,而state-of-the-art explainability methods for GNN也在链接预测任务中表现不佳。Abstract
Graph Machine Learning (GML) has numerous applications, such as node/graph classification and link prediction, in real-world domains. Providing human-understandable explanations for GML models is a challenging yet fundamental task to foster their adoption, but validating explanations for link prediction models has received little attention. In this paper, we provide quantitative metrics to assess the quality of link prediction explanations, with or without ground-truth. State-of-the-art explainability methods for Graph Neural Networks are evaluated using these metrics. We discuss how underlying assumptions and technical details specific to the link prediction task, such as the choice of distance between node embeddings, can influence the quality of the explanations.
摘要
图机学(GML)在实际领域拥有很多应用,如节点/图分类和链接预测。提供可理解的图学模型解释是推广其采用的基本任务,但链接预测模型的解释 validation received little attention。在这篇论文中,我们提供了量化的评价指标,以确定链接预测解释的质量,无论有无ground-truth。我们使用这些指标评估当今的图学神经网络解释方法。我们还讨论了链接预测任务的下面假设和技术细节,如节点嵌入距离的选择,对解释质量产生的影响。
Learning Implicit Entity-object Relations by Bidirectional Generative Alignment for Multimodal NER
results: 我们在两个 benchmark上进行了广泛的实验,并证明了我们的方法可以在无图像输入情况下达到状态级性能。Abstract
The challenge posed by multimodal named entity recognition (MNER) is mainly two-fold: (1) bridging the semantic gap between text and image and (2) matching the entity with its associated object in image. Existing methods fail to capture the implicit entity-object relations, due to the lack of corresponding annotation. In this paper, we propose a bidirectional generative alignment method named BGA-MNER to tackle these issues. Our BGA-MNER consists of \texttt{image2text} and \texttt{text2image} generation with respect to entity-salient content in two modalities. It jointly optimizes the bidirectional reconstruction objectives, leading to aligning the implicit entity-object relations under such direct and powerful constraints. Furthermore, image-text pairs usually contain unmatched components which are noisy for generation. A stage-refined context sampler is proposed to extract the matched cross-modal content for generation. Extensive experiments on two benchmarks demonstrate that our method achieves state-of-the-art performance without image input during inference.
摘要
Efficiency of First-Order Methods for Low-Rank Tensor Recovery with the Tensor Nuclear Norm Under Strict Complementarity
For: 该 paper 是为了研究低级数据的恢复问题,特别是使用紧张矩阵的拓扑级数据。* Methods: 该 paper 使用了各种约束优化算法,包括拓扑级数据的紧张矩阵减小问题,以及基于紧张矩阵的拓扑级数据的优化问题。* Results: 该 paper 得到了一些关于低级数据的恢复问题的结论,包括:1. 当 objective 函数 是强Converter 函数时,使用拓扑级数据的紧张矩阵减小问题可以实现 linear 的减小率,即使 objective 函数 不是强Converter 函数。2. 对于一个具有某些特定 tubal 级数据的优化问题,使用拓扑级数据的紧张矩阵减小问题可以实现 nearly linear 的减小率。3. 对于一个非均匀的 objective 函数,使用 extragradient 方法可以实现类似的结论。此外,paper 还提供了一些基本的结论,包括对高阶级数据的拓扑级数据的研究。Abstract
We consider convex relaxations for recovering low-rank tensors based on constrained minimization over a ball induced by the tensor nuclear norm, recently introduced in \cite{tensor_tSVD}. We build on a recent line of results that considered convex relaxations for the recovery of low-rank matrices and established that under a strict complementarity condition (SC), both the convergence rate and per-iteration runtime of standard gradient methods may improve dramatically. We develop the appropriate strict complementarity condition for the tensor nuclear norm ball and obtain the following main results under this condition: 1. When the objective to minimize is of the form $f(\mX)=g(\mA\mX)+\langle{\mC,\mX}\rangle$ , where $g$ is strongly convex and $\mA$ is a linear map (e.g., least squares), a quadratic growth bound holds, which implies linear convergence rates for standard projected gradient methods, despite the fact that $f$ need not be strongly convex. 2. For a smooth objective function, when initialized in certain proximity of an optimal solution which satisfies SC, standard projected gradient methods only require SVD computations (for projecting onto the tensor nuclear norm ball) of rank that matches the tubal rank of the optimal solution. In particular, when the tubal rank is constant, this implies nearly linear (in the size of the tensor) runtime per iteration, as opposed to super linear without further assumptions. 3. For a nonsmooth objective function which admits a popular smooth saddle-point formulation, we derive similar results to the latter for the well known extragradient method. An additional contribution which may be of independent interest, is the rigorous extension of many basic results regarding tensors of arbitrary order, which were previously obtained only for third-order tensors.
摘要
我们考虑对低维度tensor的复原问题进行凸relaxation,基于给定的ball induce by tensor nuclear norm, refer to \cite{tensor_tSVD}.我们建立在latest line of results中,对低维度matrix的复原问题进行凸relaxation,并证明在strict complementarity condition (SC)下,标准的梯度方法具有优化的传递率和每次迭代的时间复杂度。我们发展了适当的SC condition for tensor nuclear norm ball,并取得以下主要结果:1. 当目标函数为 $f(\mX)=g(\mA\mX)+\langle{\mC,\mX}\rangle$ ,其中 $g$ 是强式凸函数且 $\mA$ 是线性映射(例如最小二乘),则存在强式凸成长范围,这意味着标准的梯度方法具有直线传递率,即使 $f$ 不是强式凸函数。2. 当目标函数是光滑函数时,初始化在具有SC condition的优化解时,标准的梯度方法只需要进行tensor nuclear norm ball上的SVD computations(即 проек onto the tensor nuclear norm ball),其中rank与优化解的 tubal rank 匹配。在特定情况下,这意味着每次迭代只需要运算在nearly linear(在tensor的大小上)的时间内,相比之下,无需更多的假设时间复杂度增长。3. 当目标函数是非光滑函数且具有流行的双峰形式时,我们 derive similar results to the latter for the well-known extragradient method.此外,我们还提供了一些独立有兴趣的结果,例如对于任意维度tensor的基本结果的精确推广,这些结果曾经只被证明 для第三维度tensor。
End-to-End Reinforcement Learning of Koopman Models for Economic Nonlinear MPC
paper_authors: Daniel Mayfrank, Alexander Mitsos, Manuel Dahmen
For: The paper aims to develop a method for end-to-end reinforcement learning of dynamic surrogate models for optimal performance in nonlinear model predictive control (NMPC) applications.* Methods: The paper proposes a method for training dynamic surrogate models using reinforcement learning, which can reduce the computational burden of NMPC and improve its performance.* Results: The paper shows that the proposed method can achieve predictive controllers that strike a favorable balance between control performance and computational demand, and outperform models derived from system identification. Additionally, the method can react to changes in the control setting without retraining.Abstract
(Economic) nonlinear model predictive control ((e)NMPC) requires dynamic system models that are sufficiently accurate in all relevant state-space regions. These models must also be computationally cheap enough to ensure real-time tractability. Data-driven surrogate models for mechanistic models can be used to reduce the computational burden of (e)NMPC; however, such models are typically trained by system identification for maximum average prediction accuracy on simulation samples and perform suboptimally as part of actual (e)NMPC. We present a method for end-to-end reinforcement learning of dynamic surrogate models for optimal performance in (e)NMPC applications, resulting in predictive controllers that strike a favorable balance between control performance and computational demand. We validate our method on two applications derived from an established nonlinear continuous stirred-tank reactor model. We compare the controller performance to that of MPCs utilizing models trained by the prevailing maximum prediction accuracy paradigm, and model-free neural network controllers trained using reinforcement learning. We show that our method matches the performance of the model-free neural network controllers while consistently outperforming models derived from system identification. Additionally, we show that the MPC policies can react to changes in the control setting without retraining.
摘要
经济非线性模型预测控制(eNMPC)需要具有足够准确的动态系统模型,这些模型在所有相关的状态空间区域中都必须具有足够的准确性。这些模型还需要够快速 computationally,以确保实时可行性。基于系统认识的数据驱动模型可以用来降低(e)NMPC中的计算负担; however,这些模型通常通过系统认识来训练,以实现最大平均预测精度在模拟样本上。我们提出了一种终端渐进学习的动态代理模型方法,以实现在(e)NMPC应用中的优化性能。我们验证了这种方法,并将其应用于两个来自已知的非线性混合均匀 реактор模型中的应用。我们比较了控制性能与使用系统认识训练的模型、以及使用循环神经网络控制器训练的模型之间的性能。我们发现,我们的方法与使用循环神经网络控制器相当,而且一直 exceeds models derived from system identification。此外,我们还发现MPC策略可以随控制设置的变化而反应,无需重新训练。
UniG-Encoder: A Universal Feature Encoder for Graph and Hypergraph Node Classification
results: 对于十二种 represntative 高维图数据集和六种实际图数据集,该方法的性能都高于现有方法。Abstract
Graph and hypergraph representation learning has attracted increasing attention from various research fields. Despite the decent performance and fruitful applications of Graph Neural Networks (GNNs), Hypergraph Neural Networks (HGNNs), and their well-designed variants, on some commonly used benchmark graphs and hypergraphs, they are outperformed by even a simple Multi-Layer Perceptron. This observation motivates a reexamination of the design paradigm of the current GNNs and HGNNs and poses challenges of extracting graph features effectively. In this work, a universal feature encoder for both graph and hypergraph representation learning is designed, called UniG-Encoder. The architecture starts with a forward transformation of the topological relationships of connected nodes into edge or hyperedge features via a normalized projection matrix. The resulting edge/hyperedge features, together with the original node features, are fed into a neural network. The encoded node embeddings are then derived from the reversed transformation, described by the transpose of the projection matrix, of the network's output, which can be further used for tasks such as node classification. The proposed architecture, in contrast to the traditional spectral-based and/or message passing approaches, simultaneously and comprehensively exploits the node features and graph/hypergraph topologies in an efficient and unified manner, covering both heterophilic and homophilic graphs. The designed projection matrix, encoding the graph features, is intuitive and interpretable. Extensive experiments are conducted and demonstrate the superior performance of the proposed framework on twelve representative hypergraph datasets and six real-world graph datasets, compared to the state-of-the-art methods. Our implementation is available online at https://github.com/MinhZou/UniG-Encoder.
摘要
《图和高维图表示学习中的抽象和表示学习》在不同的研究领域中,图和高维图表示学习已经引起了越来越多的关注。Despite the decent performance and fruitful applications of图神经网络(GNNs)和高维图神经网络(HGNNs),以及其设计的许多变体,它们在一些常用的图和高维图上表现出色,但是它们仍然被简单的多层感知器(MLP)所超越。这一观察激发了现有GNNs和HGNNs的设计思维的重新检视,并提出了提取图特征的挑战。在这种情况下,我们提出了一种通用的图特征编码器,称为UniG-Encoder。UniG-Encoder的架构由一种正则化投影矩阵的前向变换和反向变换组成。在前向变换中,我们将图中连接的节点之间的 topological relationship 映射到edge或者高维edge特征上,使用正则化投影矩阵。得到的edge/高维edge特征和原始节点特征后,将被feed into一个神经网络。编码后的节点嵌入则可以通过反向变换,即神经网络输出的投影矩阵的转置,得到。这种架构与传统的spectral-based和/或message passing方法不同,同时并且全面地利用节点特征和图/高维图结构,提高了效率和一致性。设计的投影矩阵可以直观地和 interpretably 表示图特征。我们在十二个代表性的高维图数据集和六个实际的图数据集上进行了广泛的实验,并证明了我们的提案在这些数据集上表现出色,比传统的方法更高效。我们的实现可以在https://github.com/MinhZou/UniG-Encoder中找到。
MARLIM: Multi-Agent Reinforcement Learning for Inventory Management
paper_authors: Rémi Leluc, Elie Kadoche, Antoine Bertoncello, Sébastien Gourvénec
for: solve the inventory management problem for a single-echelon multi-products supply chain with stochastic demands and lead-times.
methods: reinforcement learning framework called MARLIM, with controllers developed through single or multiple agents in a cooperative setting.
results: numerical experiments on real data demonstrate the benefits of reinforcement learning methods over traditional baselines.Here’s the text in Traditional Chinese if you prefer:
for: 解决供应链中单一层次多产品的存储管理问题,具有随机需求和延误时间。
methods: 使用增强学习框架 called MARLIM,通过单一或多个代理人在合作环境中发展控制器。
results: 使用实际数据进行numerical experiments,证明增强学习方法比传统基准方法更有利。Abstract
Maintaining a balance between the supply and demand of products by optimizing replenishment decisions is one of the most important challenges in the supply chain industry. This paper presents a novel reinforcement learning framework called MARLIM, to address the inventory management problem for a single-echelon multi-products supply chain with stochastic demands and lead-times. Within this context, controllers are developed through single or multiple agents in a cooperative setting. Numerical experiments on real data demonstrate the benefits of reinforcement learning methods over traditional baselines.
摘要
维护供应和需求产品的平衡是供应链业中最重要的挑战之一。本文提出了一个新的强化学习框架,名为 MARLIM,以解决单一库存产品供应链中的存储管理问题。在这个 Setting中,控制器通过单一或多个代理人在合作环境中发展。数值实验表明,强化学习方法比传统基线更有利。Here's the word-for-word translation:维护供应和需求产品的平衡是供应链业中最重要的挑战之一。本文提出了一个新的强化学习框架,名为 MARLIM,以解决单一库存产品供应链中的存储管理问题。在这个 Setting中,控制器通过单一或多个代理人在合作环境中发展。数值实验表明,强化学习方法比传统基线更有利。
Interleaving GANs with knowledge graphs to support design creativity for book covers
results: 该方法在生成图书封面方面比前一些尝试更好,而知识图也为图书作者或编辑提供了更多的选择。Abstract
An attractive book cover is important for the success of a book. In this paper, we apply Generative Adversarial Networks (GANs) to the book covers domain, using different methods for training in order to obtain better generated images. We interleave GANs with knowledge graphs to alter the input title to obtain multiple possible options for any given title, which are then used as an augmented input to the generator. Finally, we use the discriminator obtained during the training phase to select the best images generated with new titles. Our method performed better at generating book covers than previous attempts, and the knowledge graph gives better options to the book author or editor compared to using GANs alone.
摘要
一个吸引人的书封面对于书的成功非常重要。在这篇论文中,我们使用生成对抗网络(GANs)来修改书封面领域,使用不同的训练方法来获得更好的生成图像。我们将GANs与知识图组合起来,使得输入标题的变化可以生成多个可能的选项,然后将这些选项作为增强的输入传递给生成器。最后,我们使用训练阶段获得的探测器来选择最佳的生成图像。我们的方法在生成书封面方面表现更好,并且知识图可以为书作者或编辑提供更好的选择,比起使用GANsalone。
Weighted Multi-Level Feature Factorization for App ads CTR and installation prediction
methods: 该方法基于归一化多级特征分解,将 clicking 和 installing 视为两个不同, yet related 任务,因此模型设计了特定任务的特定特征和共享特征。
results: 该方法在 academia-track 最终结果中获得了11名和总分55分的成绩。Is there anything else I can help with?Abstract
This paper provides an overview of the approach we used as team ISISTANITOS for the ACM RecSys Challenge 2023. The competition was organized by ShareChat, and involved predicting the probability of a user clicking an app ad and/or installing an app, to improve deep funnel optimization and a special focus on user privacy. Our proposed method inferring the probabilities of clicking and installing as two different, but related tasks. Hence, the model engineers a specific set of features for each task and a set of shared features. Our model is called Weighted Multi-Level Feature Factorization because it considers the interaction of different order features, where the order is associated to the depth in a neural network. The prediction for a given task is generated by combining the task specific and shared features on the different levels. Our submission achieved the 11 rank and overall score of 55 in the competition academia-track final results. We release our source code at: https://github.com/knife982000/RecSys2023Challenge
摘要
Multimodal Indoor Localisation in Parkinson’s Disease for Detecting Medication Use: Observational Pilot Study in a Free-Living Setting
results: 研究结果表明,提案的网络在实际数据上表现出色,并且可以准确地预测患者是否正在服用 леvodopa 药物。Abstract
Parkinson's disease (PD) is a slowly progressive, debilitating neurodegenerative disease which causes motor symptoms including gait dysfunction. Motor fluctuations are alterations between periods with a positive response to levodopa therapy ("on") and periods marked by re-emergency of PD symptoms ("off") as the response to medication wears off. These fluctuations often affect gait speed and they increase in their disabling impact as PD progresses. To improve the effectiveness of current indoor localisation methods, a transformer-based approach utilising dual modalities which provide complementary views of movement, Received Signal Strength Indicator (RSSI) and accelerometer data from wearable devices, is proposed. A sub-objective aims to evaluate whether indoor localisation, including its in-home gait speed features (i.e. the time taken to walk between rooms), could be used to evaluate motor fluctuations by detecting whether the person with PD is taking levodopa medications or withholding them. To properly evaluate our proposed method, we use a free-living dataset where the movements and mobility are greatly varied and unstructured as expected in real-world conditions. 24 participants lived in pairs (consisting of one person with PD, one control) for five days in a smart home with various sensors. Our evaluation on the resulting dataset demonstrates that our proposed network outperforms other methods for indoor localisation. The sub-objective evaluation shows that precise room-level localisation predictions, transformed into in-home gait speed features, produce accurate predictions on whether the PD participant is taking or withholding their medications.
摘要
帕金森病 (PD) 是一种慢慢地进行、严重影响人体功能的神经退化疾病。它会导致运动症状,其中包括走姿畸形。在 леvodopa 治疗的影响下,运动症状会经历变化,通常是由“在”和“离”两种不同的状态组成的。这些变化会影响走姿速度,并且随着疾病的进行而加剧。为了改进现有的室内定位方法,一种基于 transformer 的方法,使用了两种不同的感知模式,即 Received Signal Strength Indicator (RSSI) 和加速器数据,从 wearable 设备获取。这种方法的一个目标是评估 Whether indoor localization, including its in-home gait speed features (i.e., the time taken to walk between rooms), could be used to evaluate motor fluctuations by detecting whether the person with PD is taking levodopa medications or withholding them。为了准确评估我们的提议方法,我们使用了一个免费生活数据集,该数据集包含了各种不同和不结构的运动和 mobilty 数据,如预计在实际条件下的情况。24名参与者(其中有1名帕金森病患者和1名控制人)在一个智能家庭中生活了5天,并装备了多种感知器。我们对 resulting 数据进行评估,并表明我们的提议网络在室内定位方面表现出色。另一个目标评估表明,通过精准地将室内定位预测结果转换为室内走姿速度特征,可以准确地预测帕金森病患者是否正在服用或减少 леvodopa 药物。
A Novel Convolutional Neural Network Architecture with a Continuous Symmetry
results: 该架构可以与传统模型相比,在图像分类任务上达到相似的性能,并且可以推广到更多的应用场景。Abstract
This paper introduces a new Convolutional Neural Network (ConvNet) architecture inspired by a class of partial differential equations (PDEs) called quasi-linear hyperbolic systems. With comparable performance on the image classification task, it allows for the modification of the weights via a continuous group of symmetry. This is a significant shift from traditional models where the architecture and weights are essentially fixed. We wish to promote the (internal) symmetry as a new desirable property for a neural network, and to draw attention to the PDE perspective in analyzing and interpreting ConvNets in the broader Deep Learning community.
摘要
这篇论文介绍了一种新的卷积神经网络(ConvNet)建 architecture, Drawing inspiration from a class ofpartial differential equations(PDEs)called quasi-linear hyperbolic systems. 与传统模型相比,该建 architecture 允许权重的修改via continuous group of symmetry, 这是一种重要的shift。我们希望通过推广 internal symmetry 作为神经网络的新有利属性,并吸引更广泛的深度学习社区关注 PDE 的视角来分析和解释 ConvNets。
Assessing Systematic Weaknesses of DNNs using Counterfactuals
paper_authors: Sujan Sai Gannamaneni, Michael Mock, Maram Akila
for: This paper aims to identify systematic weaknesses in deep neural networks (DNNs) for safety-critical applications, and to develop an effective and computationally efficient algorithm to validate the semantic attribution of existing subsets.
methods: The proposed method is inspired by counterfactual explanations and uses a combination of feature importance and partial dependence plots to identify the causal relationship between specific attributes and the model’s performance.
results: The authors demonstrate the effectiveness of their approach on an example from the autonomous driving domain, showing that the proposed method can identify performance differences among different pedestrian assets, and that the asset type is not always the reason for reduced performance.Abstract
With the advancement of DNNs into safety-critical applications, testing approaches for such models have gained more attention. A current direction is the search for and identification of systematic weaknesses that put safety assumptions based on average performance values at risk. Such weaknesses can take on the form of (semantically coherent) subsets or areas in the input space where a DNN performs systematically worse than its expected average. However, it is non-trivial to attribute the reason for such observed low performances to the specific semantic features that describe the subset. For instance, inhomogeneities within the data w.r.t. other (non-considered) attributes might distort results. However, taking into account all (available) attributes and their interaction is often computationally highly expensive. Inspired by counterfactual explanations, we propose an effective and computationally cheap algorithm to validate the semantic attribution of existing subsets, i.e., to check whether the identified attribute is likely to have caused the degraded performance. We demonstrate this approach on an example from the autonomous driving domain using highly annotated simulated data, where we show for a semantic segmentation model that (i) performance differences among the different pedestrian assets exist, but (ii) only in some cases is the asset type itself the reason for this reduction in the performance.
摘要
Feature Noise Boosts DNN Generalization under Label Noise
results: 经过了 teorтиче 分析和实验 validate,这种特征噪声方法可以有效地增强DNNs的泛化性,并且可以在不同的标签噪声水平和类型下选择合适的噪声类型和水平来实现恰当的标签噪声泛化。Abstract
The presence of label noise in the training data has a profound impact on the generalization of deep neural networks (DNNs). In this study, we introduce and theoretically demonstrate a simple feature noise method, which directly adds noise to the features of training data, can enhance the generalization of DNNs under label noise. Specifically, we conduct theoretical analyses to reveal that label noise leads to weakened DNN generalization by loosening the PAC-Bayes generalization bound, and feature noise results in better DNN generalization by imposing an upper bound on the mutual information between the model weights and the features, which constrains the PAC-Bayes generalization bound. Furthermore, to ensure effective generalization of DNNs in the presence of label noise, we conduct application analyses to identify the optimal types and levels of feature noise to add for obtaining desirable label noise generalization. Finally, extensive experimental results on several popular datasets demonstrate the feature noise method can significantly enhance the label noise generalization of the state-of-the-art label noise method.
摘要
Deep neural networks (DNNs) 的泛化能力受到标签噪声的影响。在这项研究中,我们介绍了一种简单的特征噪声方法,可以直接将噪声添加到训练数据的特征中,以提高 DNN 的泛化能力。我们进行了理论分析,发现标签噪声会导致 DNN 的泛化能力弱化,并且特征噪声会对 DNN 的泛化能力产生正面的影响,并且可以限制 PAC-Bayes 泛化约束。此外,为确保 DNN 在标签噪声下的有效泛化,我们进行了应用分析,以确定合适的特征噪声类型和水平。最后,我们对多个流行的数据集进行了广泛的实验,发现特征噪声方法可以显著提高标签噪声泛化方法的泛化能力。
Discriminative Graph-level Anomaly Detection via Dual-students-teacher Model
methods: 本文提出了一种 dual-students-teacher 模型,其中教师模型使用了规则损函数来训练图表示更加分散。两个竞争学生模型,一个是 normal 图模型,另一个是异常图模型,分别适应教师模型的节点级和图级表示视角。最后,我们将两个学生模型的表示错误相加,以分类异常图。
results: 实验分析表明,我们的方法在真实世界中的图据集上得到了良好的效果,能够有效地检测图集中的异常图。Abstract
Different from the current node-level anomaly detection task, the goal of graph-level anomaly detection is to find abnormal graphs that significantly differ from others in a graph set. Due to the scarcity of research on the work of graph-level anomaly detection, the detailed description of graph-level anomaly is insufficient. Furthermore, existing works focus on capturing anomalous graph information to learn better graph representations, but they ignore the importance of an effective anomaly score function for evaluating abnormal graphs. Thus, in this work, we first define anomalous graph information including node and graph property anomalies in a graph set and adopt node-level and graph-level information differences to identify them, respectively. Then, we introduce a discriminative graph-level anomaly detection framework with dual-students-teacher model, where the teacher model with a heuristic loss are trained to make graph representations more divergent. Then, two competing student models trained by normal and abnormal graphs respectively fit graph representations of the teacher model in terms of node-level and graph-level representation perspectives. Finally, we combine representation errors between two student models to discriminatively distinguish anomalous graphs. Extensive experiment analysis demonstrates that our method is effective for the graph-level anomaly detection task on graph datasets in the real world.
摘要
不同于当前的节点级异常检测任务,我们的目标是找到图集中异常的图,并且这些图significantly differ from others。由于关于图级异常检测的研究 scarcity,我们所用的异常描述不够详细。此外,现有的works主要是捕捉异常图信息,以学习更好的图表示。然而,它们忽略了有效的异常分数函数的重要性,用于评估异常图。因此,在这个工作中,我们首先定义了图集中的异常信息,包括节点和图性异常。然后,我们提出了一种推理图级异常检测框架,使用两个学生模型和一个教师模型。教师模型通过规则损失进行训练,以使图表示更加分化。两个学生模型,分别使用正常和异常图进行训练,然后将教师模型的图表示分别拟合到节点级和图级 representation 两个视角中。最后,我们将两个学生模型的表示错误相加,以分类地分别检测异常图。我们的方法在实际世界上的图据上进行了广泛的实验分析,并证明其效果。
Unsupervised Multiplex Graph Learning with Complementary and Consistent Information
results: 对比其他方法,本文的提出的方法在不同的下游任务中显示出了显著的效果和高效率,能够有效地解决 out-of-sample 问题和噪声问题。Abstract
Unsupervised multiplex graph learning (UMGL) has been shown to achieve significant effectiveness for different downstream tasks by exploring both complementary information and consistent information among multiple graphs. However, previous methods usually overlook the issues in practical applications, i.e., the out-of-sample issue and the noise issue. To address the above issues, in this paper, we propose an effective and efficient UMGL method to explore both complementary and consistent information. To do this, our method employs multiple MLP encoders rather than graph convolutional network (GCN) to conduct representation learning with two constraints, i.e., preserving the local graph structure among nodes to handle the out-of-sample issue, and maximizing the correlation of multiple node representations to handle the noise issue. Comprehensive experiments demonstrate that our proposed method achieves superior effectiveness and efficiency over the comparison methods and effectively tackles those two issues. Code is available at https://github.com/LarryUESTC/CoCoMG.
摘要
<>转换文本为简化中文。<>无监督多Graph学习(UMGL)已经在不同的下游任务中显示出了显著的效果,通过探索多个图的共同信息和差异信息。然而,先前的方法通常忽视了实际应用中的问题,即外样问题和噪声问题。为了解决这些问题,在这篇论文中,我们提出了一种有效和高效的 UMGL 方法,通过两个约束来进行表示学习:一是保持节点之间的本地图 структуры,以处理外样问题,二是最大化多个节点表示之间的相关性,以处理噪声问题。广泛的实验表明,我们提出的方法在比较方法上显示出了superior的效果和高效性,并有效地解决了这两个问题。代码可以在 中找到。
Deep Learning-based surrogate models for parametrized PDEs: handling geometric variability through graph neural networks
results: 实验结果表明,GNN 可以提供一种有效的替代方案,以提高模拟效率和泛化能力。 GNN 也能够在不同的几何和分辨率下进行泛化。Abstract
Mesh-based simulations play a key role when modeling complex physical systems that, in many disciplines across science and engineering, require the solution of parametrized time-dependent nonlinear partial differential equations (PDEs). In this context, full order models (FOMs), such as those relying on the finite element method, can reach high levels of accuracy, however often yielding intensive simulations to run. For this reason, surrogate models are developed to replace computationally expensive solvers with more efficient ones, which can strike favorable trade-offs between accuracy and efficiency. This work explores the potential usage of graph neural networks (GNNs) for the simulation of time-dependent PDEs in the presence of geometrical variability. In particular, we propose a systematic strategy to build surrogate models based on a data-driven time-stepping scheme where a GNN architecture is used to efficiently evolve the system. With respect to the majority of surrogate models, the proposed approach stands out for its ability of tackling problems with parameter dependent spatial domains, while simultaneously generalizing to different geometries and mesh resolutions. We assess the effectiveness of the proposed approach through a series of numerical experiments, involving both two- and three-dimensional problems, showing that GNNs can provide a valid alternative to traditional surrogate models in terms of computational efficiency and generalization to new scenarios. We also assess, from a numerical standpoint, the importance of using GNNs, rather than classical dense deep neural networks, for the proposed framework.
摘要
mesh-based 模拟在科学和工程多种领域中发挥关键作用,特别是在解决参数化时间依赖非线性偏微分方程(PDEs)中。在这种情况下,全序模型(FOMs),如基于finite element方法的模型,可以达到高级别的准确性,但通常需要昂贵的计算。为了解决这个问题,人们开发了代理模型,以换取更高效的计算方法,这可以实现可接受的妥协。本工作探讨在时间依赖PDEs中使用图 neural network(GNNs)进行模拟,特别是在具有参数化空间域的情况下。我们提出了一种系统atic的方法,使用数据驱动时间步骤来构建代理模型,其中GNN架构用于有效地演化系统。与大多数代理模型不同的是,我们的方法可以处理具有参数依赖的空间域的问题,同时能够泛化到不同的几何和分辨率。我们通过一系列数字实验,包括二维和三维问题,证明GNNs可以提供一个有效的代替方案,与传统的代理模型相比。此外,我们还从数值角度评估了使用GNNs而不使用传统的 dense deep neural networks 的重要性。
Experimental Results regarding multiple Machine Learning via Quaternions
results: 根据四元数和多种机器学习算法,这个研究显示了更高的准确率和显著提高的预测性能。总之,这个研究提供了使用四元数进行机器学习任务的实质基础。Abstract
This paper presents an experimental study on the application of quaternions in several machine learning algorithms. Quaternion is a mathematical representation of rotation in three-dimensional space, which can be used to represent complex data transformations. In this study, we explore the use of quaternions to represent and classify rotation data, using randomly generated quaternion data and corresponding labels, converting quaternions to rotation matrices, and using them as input features. Based on quaternions and multiple machine learning algorithms, it has shown higher accuracy and significantly improved performance in prediction tasks. Overall, this study provides an empirical basis for exploiting quaternions for machine learning tasks.
摘要
SoK: Assessing the State of Applied Federated Machine Learning
results: 本研究发现,FedML 在privacy-critical 环境中具有潜在的应用前景,但是在实际应用中还存在一些挑战,如数据隐私保护和communication overhead。Abstract
Machine Learning (ML) has shown significant potential in various applications; however, its adoption in privacy-critical domains has been limited due to concerns about data privacy. A promising solution to this issue is Federated Machine Learning (FedML), a model-to-data approach that prioritizes data privacy. By enabling ML algorithms to be applied directly to distributed data sources without sharing raw data, FedML offers enhanced privacy protections, making it suitable for privacy-critical environments. Despite its theoretical benefits, FedML has not seen widespread practical implementation. This study aims to explore the current state of applied FedML and identify the challenges hindering its practical adoption. Through a comprehensive systematic literature review, we assess 74 relevant papers to analyze the real-world applicability of FedML. Our analysis focuses on the characteristics and emerging trends of FedML implementations, as well as the motivational drivers and application domains. We also discuss the encountered challenges in integrating FedML into real-life settings. By shedding light on the existing landscape and potential obstacles, this research contributes to the further development and implementation of FedML in privacy-critical scenarios.
摘要
This study aims to explore the current state of applied FedML and identify the challenges hindering its practical adoption. Through a comprehensive systematic literature review, we assess 74 relevant papers to analyze the real-world applicability of FedML. Our analysis focuses on the characteristics and emerging trends of FedML implementations, as well as the motivational drivers and application domains. We also discuss the encountered challenges in integrating FedML into real-life settings. By shedding light on the existing landscape and potential obstacles, this research contributes to the further development and implementation of FedML in privacy-critical scenarios.
Unsupervised Representation Learning for Time Series: A Review
results: 研究发现,使用contrastive learning方法可以在9个真实世界数据集上实现最高的表示学习效果,并且提出了一些实践考虑因素和未来研究挑战。Abstract
Unsupervised representation learning approaches aim to learn discriminative feature representations from unlabeled data, without the requirement of annotating every sample. Enabling unsupervised representation learning is extremely crucial for time series data, due to its unique annotation bottleneck caused by its complex characteristics and lack of visual cues compared with other data modalities. In recent years, unsupervised representation learning techniques have advanced rapidly in various domains. However, there is a lack of systematic analysis of unsupervised representation learning approaches for time series. To fill the gap, we conduct a comprehensive literature review of existing rapidly evolving unsupervised representation learning approaches for time series. Moreover, we also develop a unified and standardized library, named ULTS (i.e., Unsupervised Learning for Time Series), to facilitate fast implementations and unified evaluations on various models. With ULTS, we empirically evaluate state-of-the-art approaches, especially the rapidly evolving contrastive learning methods, on 9 diverse real-world datasets. We further discuss practical considerations as well as open research challenges on unsupervised representation learning for time series to facilitate future research in this field.
摘要
<> translate_language: zh-CNUnsupervised representation learning方法 aim to learn discriminative feature representations from unlabeled data, without the requirement of annotating every sample. 时序数据的特殊特点和缺乏视觉提示,使得时序数据的注解瓶颈非常大,因此Unsupervised representation learning是非常重要的。在过去几年,Unsupervised representation learning技术在不同领域得到了快速发展。然而,对时序数据Unsupervised representation learning的系统性分析仍然缺乏。为了填补这一空白,我们进行了 comprehensive literature review of existing rapidly evolving Unsupervised representation learning approaches for time series。此外,我们还开发了一个标准化和快速实现的库,名为ULTS(i.e., Unsupervised Learning for Time Series),以便对不同模型进行快速实现和统一评估。通过ULTS,我们employmetricamente评估了现状最佳的方法,特别是在不同的 datasets上 rapidly evolving contrastive learning方法。我们还讨论了实践中的Considerations和未来研究中的开放问题,以便促进未来在这个领域的研究。
Adversarial Training of Denoising Diffusion Model Using Dual Discriminators for High-Fidelity Multi-Speaker TTS
results: 比较latest state-of-the-art模型FastSpeech2和DiffGAN-TTS,在不同的评价指标上显示更高的性能,包括SSIM、MCD、F0 RMSE、STOI、PESQ以及主观评价MOS。Abstract
The diffusion model is capable of generating high-quality data through a probabilistic approach. However, it suffers from the drawback of slow generation speed due to the requirement of a large number of time steps. To address this limitation, recent models such as denoising diffusion implicit models (DDIM) focus on generating samples without directly modeling the probability distribution, while models like denoising diffusion generative adversarial networks (GAN) combine diffusion processes with GANs. In the field of speech synthesis, a recent diffusion speech synthesis model called DiffGAN-TTS, utilizing the structure of GANs, has been introduced and demonstrates superior performance in both speech quality and generation speed. In this paper, to further enhance the performance of DiffGAN-TTS, we propose a speech synthesis model with two discriminators: a diffusion discriminator for learning the distribution of the reverse process and a spectrogram discriminator for learning the distribution of the generated data. Objective metrics such as structural similarity index measure (SSIM), mel-cepstral distortion (MCD), F0 root mean squared error (F0 RMSE), short-time objective intelligibility (STOI), perceptual evaluation of speech quality (PESQ), as well as subjective metrics like mean opinion score (MOS), are used to evaluate the performance of the proposed model. The evaluation results show that the proposed model outperforms recent state-of-the-art models such as FastSpeech2 and DiffGAN-TTS in various metrics. Our implementation and audio samples are located on GitHub.
摘要
Diffusion模型可以生成高质量数据通过概率方法。然而,它受限于需要大量时间步骤,导致生成速度慢。为解决这个限制,最近的模型如降噪扩散权重模型(DDIM)和降噪扩散生成对抗网络(GAN)等都在生成样本不直接模型概率分布。在语音生成领域,一种最近的扩散语音生成模型叫做DiffGAN-TTS,利用了GAN的结构,已经被引入并显示出优于现有模型的语音质量和生成速度。在这篇论文中,我们提议一种语音生成模型,其中包括两个识别器:一个扩散识别器用于学习逆过程的分布,一个spectrogram识别器用于学习生成数据的分布。对象度量如构造相似度指数(SSIM)、mel-cepstral损害(MCD)、F0根圆满平均误差(F0 RMSE)、短时对象智能度(STOI)、语音质量评价(PESQ)以及主观度量如主观评分(MOS)等都用于评估提议模型的性能。评估结果表明,提议模型超过了最近的状态艺模型如FastSpeech2和DiffGAN-TTS在各个度量上。我们的实现和音频样本位于GitHub。
Fast Slate Policy Optimization: Going Beyond Plackett-Luce
results: 文章对比了使用常见的Plackett-Luce策略类与该方法,并在动作空间规模达万万的问题上进行了比较,证明了该方法的有效性。Abstract
An increasingly important building block of large scale machine learning systems is based on returning slates; an ordered lists of items given a query. Applications of this technology include: search, information retrieval and recommender systems. When the action space is large, decision systems are restricted to a particular structure to complete online queries quickly. This paper addresses the optimization of these large scale decision systems given an arbitrary reward function. We cast this learning problem in a policy optimization framework and propose a new class of policies, born from a novel relaxation of decision functions. This results in a simple, yet efficient learning algorithm that scales to massive action spaces. We compare our method to the commonly adopted Plackett-Luce policy class and demonstrate the effectiveness of our approach on problems with action space sizes in the order of millions.
摘要
越来越重要的大规模机器学习系统建构件是基于返回板,即查询时返回的有序列表。这些应用包括搜索、信息检索和推荐系统。当动作空间较大时,决策系统会受到特定结构的限制,以快速完成在线查询。这篇论文通过对大规模决策系统优化问题进行政策优化框架,提出了一种新的政策类型。这种新政策类型由一种新的决策函数 relaxation 得到,从而得到了一个简单 yet efficient 的学习算法,可扩展到巨大的动作空间。我们与普遍采用的沃尔特-劳伯策略类比较,并在动作空间规模为百万的问题上表现了我们的方法的有效性。
Hierarchical Federated Learning in Wireless Networks: Pruning Tackles Bandwidth Scarcity and System Heterogeneity
results: 通过广泛的 simulations,本研究证明了提出的 PHFL 算法在测试准确率、墙 clock 时间、能量消耗和带宽要求上的有效性。Abstract
While a practical wireless network has many tiers where end users do not directly communicate with the central server, the users' devices have limited computation and battery powers, and the serving base station (BS) has a fixed bandwidth. Owing to these practical constraints and system models, this paper leverages model pruning and proposes a pruning-enabled hierarchical federated learning (PHFL) in heterogeneous networks (HetNets). We first derive an upper bound of the convergence rate that clearly demonstrates the impact of the model pruning and wireless communications between the clients and the associated BS. Then we jointly optimize the model pruning ratio, central processing unit (CPU) frequency and transmission power of the clients in order to minimize the controllable terms of the convergence bound under strict delay and energy constraints. However, since the original problem is not convex, we perform successive convex approximation (SCA) and jointly optimize the parameters for the relaxed convex problem. Through extensive simulation, we validate the effectiveness of our proposed PHFL algorithm in terms of test accuracy, wall clock time, energy consumption and bandwidth requirement.
摘要
While a practical wireless network has many tiers where end users do not directly communicate with the central server, the users' devices have limited computation and battery powers, and the serving base station (BS) has a fixed bandwidth. Owing to these practical constraints and system models, this paper leverages model pruning and proposes a pruning-enabled hierarchical federated learning (PHFL) in heterogeneous networks (HetNets). We first derive an upper bound of the convergence rate that clearly demonstrates the impact of the model pruning and wireless communications between the clients and the associated BS. Then we jointly optimize the model pruning ratio, central processing unit (CPU) frequency and transmission power of the clients in order to minimize the controllable terms of the convergence bound under strict delay and energy constraints. However, since the original problem is not convex, we perform successive convex approximation (SCA) and jointly optimize the parameters for the relaxed convex problem. Through extensive simulation, we validate the effectiveness of our proposed PHFL algorithm in terms of test accuracy, wall clock time, energy consumption and bandwidth requirement.Here is the translation in Traditional Chinese:而在实际无线网络中,有多层次的端用户不直接与中央服务器通信,用户的设备有限的计算和电池能力,并且服务基站(BS)有固定的带宽。由于这些实际限制和系统模型,这篇文章利用模型剔除和提议了一个剔除enabled的 Hierarchical Federated Learning(PHFL)在不同类型的网络(HetNets)中。我们首先 derive了各种模型剔除的上限 bound,该 bound 清楚地显示了剔除和无线通信 между客户端和相关的 BS 的影响。然后,我们共同优化模型剔除比例、中央处理器(CPU)频率和客户端的传输功率,以最小化控制性 bound 的调整下限,以满足严格的延迟和能源限制。然而,由于原始问题不是凸变数,我们通过Successive Convex Approximation(SCA)来优化parameters,并对它们进行松动的优化。通过广泛的Simulation,我们证明了我们提议的 PHFL 算法在测试准确度、壁时、能源消耗和带宽需求方面的有效性。
Motion Planning Diffusion: Learning and Planning of Robot Motions with Diffusion Models
paper_authors: Joao Carvalho, An T. Le, Mark Baierl, Dorothea Koert, Jan Peters
For: The paper is written for researchers and practitioners in the field of robot motion planning and optimization, particularly those interested in using learning priors to accelerate the planning process.* Methods: The paper proposes using diffusion models as priors for motion planning, which allows for sampling directly from the posterior trajectory distribution conditioned on task goals. The authors also leverage the inverse denoising process of diffusion models to effectively encode data multimodality in high-dimensional settings.* Results: The paper demonstrates the efficacy of the proposed method, Motion Planning Diffusion, through experiments in simulated planar robot and 7-dof robot arm manipulator environments. The results show that diffusion models are strong priors for encoding high-dimensional trajectory distributions of robot motions, and the method is able to generalize well to previously unseen obstacles.Here is the same information in Simplified Chinese text:* For: 本文是为机器人运动规划和优化领域的研究人员和实践者编写的,特别是关心使用学习假设加速运动规划的人。* Methods: 本文提出使用扩散模型作为运动规划的假设,可以直接从 posterior 运动规划分布中采样任务目标条件下的轨迹。作者们还利用扩散模型的逆噪处理机制来有效地编码高维数据多模性。* Results: 本文通过在平面机器人和7度OF机械臂环境中的 simulations 来证明提出的方法 Motion Planning Diffusion 的效果,结果表明扩散模型是高维轨迹分布的强大假设,并能够在未经见过障碍物的环境中具有普适性。Abstract
Learning priors on trajectory distributions can help accelerate robot motion planning optimization. Given previously successful plans, learning trajectory generative models as priors for a new planning problem is highly desirable. Prior works propose several ways on utilizing this prior to bootstrapping the motion planning problem. Either sampling the prior for initializations or using the prior distribution in a maximum-a-posterior formulation for trajectory optimization. In this work, we propose learning diffusion models as priors. We then can sample directly from the posterior trajectory distribution conditioned on task goals, by leveraging the inverse denoising process of diffusion models. Furthermore, diffusion has been recently shown to effectively encode data multimodality in high-dimensional settings, which is particularly well-suited for large trajectory dataset. To demonstrate our method efficacy, we compare our proposed method - Motion Planning Diffusion - against several baselines in simulated planar robot and 7-dof robot arm manipulator environments. To assess the generalization capabilities of our method, we test it in environments with previously unseen obstacles. Our experiments show that diffusion models are strong priors to encode high-dimensional trajectory distributions of robot motions.
摘要
学习轨迹分布可以帮助加速机器人运动规划优化。给定先前成功的计划,学习轨迹生成模型作为优化问题的先前知识是非常有优势。先前的工作提出了使用这种先前知识来启动运动规划问题的多种方法。可以从先前知识中采样,或者使用先前知识的分布来形式化最大 posterior 方法来优化轨迹。在这个工作中,我们提议使用扩散模型来学习先前知识。我们可以通过扩散模型的逆减针对任务目标来直接从 posterior 轨迹分布中采样,并且利用扩散模型对高维数据的编码能力来更好地处理大规模轨迹数据。为了证明我们的方法效果,我们将比较我们的方法(运动规划扩散)与多个基eline在 simulate 平面机器人和7度OF机械臂环境中。为了评估我们的方法泛化能力,我们在未看过障碍物的环境中进行测试。我们的实验表明,扩散模型是高维轨迹分布的机器人运动优化中强大的先前知识。
InterAct: Exploring the Potentials of ChatGPT as a Cooperative Agent
results: 我们的研究显示,在 AlfWorld 中,包含 6 个不同任务的模拟家庭环境中,ChatGPT 的成功率达到 98%,强调了提示工程的重要性,从而为任务规划预测开辟了新的可能性。Abstract
This research paper delves into the integration of OpenAI's ChatGPT into embodied agent systems, evaluating its influence on interactive decision-making benchmark. Drawing a parallel to the concept of people assuming roles according to their unique strengths, we introduce InterAct. In this approach, we feed ChatGPT with varied prompts, assigning it a numerous roles like a checker and a sorter, then integrating them with the original language model. Our research shows a remarkable success rate of 98% in AlfWorld, which consists of 6 different tasks in a simulated household environment, emphasizing the significance of proficient prompt engineering. The results highlight ChatGPT's competence in comprehending and performing intricate tasks effectively in real-world settings, thus paving the way for further advancements in task planning.
摘要
Translation notes:* "OpenAI's ChatGPT" is translated as "OpenAI的ChatGPT" (OpenAI的chatGPT)* "embodied agent systems" is translated as "实体式代理系统" (shíwù yìjīng yìjīng zhìxīng)* "interactive decision-making benchmark" is translated as "互动决策指标" (interactive decision-making indicator)* "AlfWorld" is translated as "AlfWorld" (AlfWorld)* "simulated household environment" is translated as "模拟家庭环境" (móxíng jiāgōng yuánjīng)* "task planning" is translated as "任务规划" (task planning)
MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies
results: 根据新定义的CLAP分数评价指标,该模型和mixup策略可以提高生成的音乐质量和新颖性,同时仍保持与输入文本的对应性。Abstract
Diffusion models have shown promising results in cross-modal generation tasks, including text-to-image and text-to-audio generation. However, generating music, as a special type of audio, presents unique challenges due to limited availability of music data and sensitive issues related to copyright and plagiarism. In this paper, to tackle these challenges, we first construct a state-of-the-art text-to-music model, MusicLDM, that adapts Stable Diffusion and AudioLDM architectures to the music domain. We achieve this by retraining the contrastive language-audio pretraining model (CLAP) and the Hifi-GAN vocoder, as components of MusicLDM, on a collection of music data samples. Then, to address the limitations of training data and to avoid plagiarism, we leverage a beat tracking model and propose two different mixup strategies for data augmentation: beat-synchronous audio mixup and beat-synchronous latent mixup, which recombine training audio directly or via a latent embeddings space, respectively. Such mixup strategies encourage the model to interpolate between musical training samples and generate new music within the convex hull of the training data, making the generated music more diverse while still staying faithful to the corresponding style. In addition to popular evaluation metrics, we design several new evaluation metrics based on CLAP score to demonstrate that our proposed MusicLDM and beat-synchronous mixup strategies improve both the quality and novelty of generated music, as well as the correspondence between input text and generated music.
摘要
Diffusion models have shown promising results in cross-modal generation tasks, including text-to-image and text-to-audio generation. However, generating music, as a special type of audio, presents unique challenges due to limited availability of music data and sensitive issues related to copyright and plagiarism. In this paper, to tackle these challenges, we first construct a state-of-the-art text-to-music model, MusicLDM, that adapts Stable Diffusion and AudioLDM architectures to the music domain. We achieve this by retraining the contrastive language-audio pretraining model (CLAP) and the Hifi-GAN vocoder, as components of MusicLDM, on a collection of music data samples. Then, to address the limitations of training data and to avoid plagiarism, we leverage a beat tracking model and propose two different mixup strategies for data augmentation: beat-synchronous audio mixup and beat-synchronous latent mixup, which recombine training audio directly or via a latent embeddings space, respectively. Such mixup strategies encourage the model to interpolate between musical training samples and generate new music within the convex hull of the training data, making the generated music more diverse while still staying faithful to the corresponding style. In addition to popular evaluation metrics, we design several new evaluation metrics based on CLAP score to demonstrate that our proposed MusicLDM and beat-synchronous mixup strategies improve both the quality and novelty of generated music, as well as the correspondence between input text and generated music.
paper_authors: Debosmita Bhaumik, Julian Togelius, Georgios N. Yannakakis, Ahmed Khalifa
for: 用AI进行2D游戏等级设计协助工具
methods: 使用深度神经网络进行等级升采样、编辑和扩展
results: 通过对游戏等级进行升采样和编辑,设计者可以在不同的分辨率下创建和编辑等级,并且神经网络可以学习增加缺失的特征。Abstract
We explore AI-powered upscaling as a design assistance tool in the context of creating 2D game levels. Deep neural networks are used to upscale artificially downscaled patches of levels from the puzzle platformer game Lode Runner. The trained networks are incorporated into a web-based editor, where the user can create and edit levels at three different levels of resolution: 4x4, 8x8, and 16x16. An edit at any resolution instantly transfers to the other resolutions. As upscaling requires inventing features that might not be present at lower resolutions, we train neural networks to reproduce these features. We introduce a neural network architecture that is capable of not only learning upscaling but also giving higher priority to less frequent tiles. To investigate the potential of this tool and guide further development, we conduct a qualitative study with 3 designers to understand how they use it. Designers enjoyed co-designing with the tool, liked its underlying concept, and provided feedback for further improvement.
摘要
我们探索使用人工智能进行缩放作为游戏等级设计工具,在创建2D游戏等级时。我们使用深度神经网络来缩放人工减小的等级补丁,来自游戏排版平台游戏“卢德运动”。我们在网页编辑器中 integrate 了训练的神经网络,用户可以在不同的分辨率下创建和编辑等级:4x4、8x8和16x16。任何修改都会同步到其他分辨率上。由于缩放需要创造不存在低分辨率下的特征,我们训练神经网络来复制这些特征。我们介绍了一种神经网络架构,可以不仅学习缩放,还可以增加较少的瓷砖优先级。为了了解这工具的潜在力量和进一步发展,我们进行了3名设计师的质量调查,了解他们如何使用这工具,他们对这工具的概念和反馈。
results: 我们通过了广泛的实验,证明了我们的模型可以达到领先的性能水平,并且可以根据用户的需求进行自定义。我们还提出了一种新的操作called ID mixing,可以创造出新的 Identity by semantically mixing 多个人的 Identities。Abstract
Face swapping is a task that changes a facial identity of a given image to that of another person. In this work, we propose a novel face-swapping framework called Megapixel Facial Identity Manipulation (MFIM). The face-swapping model should achieve two goals. First, it should be able to generate a high-quality image. We argue that a model which is proficient in generating a megapixel image can achieve this goal. However, generating a megapixel image is generally difficult without careful model design. Therefore, our model exploits pretrained StyleGAN in the manner of GAN-inversion to effectively generate a megapixel image. Second, it should be able to effectively transform the identity of a given image. Specifically, it should be able to actively transform ID attributes (e.g., face shape and eyes) of a given image into those of another person, while preserving ID-irrelevant attributes (e.g., pose and expression). To achieve this goal, we exploit 3DMM that can capture various facial attributes. Specifically, we explicitly supervise our model to generate a face-swapped image with the desirable attributes using 3DMM. We show that our model achieves state-of-the-art performance through extensive experiments. Furthermore, we propose a new operation called ID mixing, which creates a new identity by semantically mixing the identities of several people. It allows the user to customize the new identity.
摘要
<>Translate the given text into Simplified Chinese.<>人脸替换是一项任务,替换给定图像中的人脸为另一个人脸。在这项工作中,我们提出了一种新的人脸替换框架,称为兆像人脸身份修饰(MFIM)。人脸替换模型应该完成两个目标:首先,生成高质量图像;其次,有效地转换给定图像中的人脸身份特征(如脸形和眼睛)到另一个人脸身份特征,保留不相关的人脸特征(如姿势和表情)。为达到这个目标,我们利用了3DMM,可以捕捉各种人脸特征。具体来说,我们明确地监督我们的模型生成一个人脸替换图像,拥有愿景的特征。我们展示了我们的模型在各种实验中达到了状态艺术性的表现。此外,我们还提出了一种新的操作,即ID混合,可以创造一个新的身份,通过semantic Mixing(semantic混合)几个人脸的身份。这允许用户自定义新的身份。
Food Classification using Joint Representation of Visual and Textual Data
results: 实验结果显示,提案的网络在大量的 open-source 数据集 UPMC Food-101 上表现出色,与其他方法比较,获得了11.57% 和 6.34% 的准确率优势,图像和文本分类方面。Abstract
Food classification is an important task in health care. In this work, we propose a multimodal classification framework that uses the modified version of EfficientNet with the Mish activation function for image classification, and the traditional BERT transformer-based network is used for text classification. The proposed network and the other state-of-the-art methods are evaluated on a large open-source dataset, UPMC Food-101. The experimental results show that the proposed network outperforms the other methods, a significant difference of 11.57% and 6.34% in accuracy is observed for image and text classification, respectively, when compared with the second-best performing method. We also compared the performance in terms of accuracy, precision, and recall for text classification using both machine learning and deep learning-based models. The comparative analysis from the prediction results of both images and text demonstrated the efficiency and robustness of the proposed approach.
摘要
食品分类是医疗保健中的重要任务。在这项工作中,我们提出了一种多模态分类框架,使用修改后的EfficientNet和Mish激活函数进行图像分类,并使用传统的BERT变换网络进行文本分类。我们的提议网络和其他状态arct对比的方法在大型开源数据集UPMC Food-101上进行了评估。实验结果表明,我们的提议网络在图像和文本分类方面的准确率比其他方法高出11.57%和6.34%,与第二高分类方法相比。我们还对文本分类方面的准确率、精度和准确率进行了比较分析,并通过图像和文本预测结果的比较分析,证明了我们的方法的高效和可靠性。
Circumventing Concept Erasure Methods For Text-to-Image Generative Models
results: 结果表明这些post hoc概念除法方法不稳定,不适用于AI安全。Abstract
Text-to-image generative models can produce photo-realistic images for an extremely broad range of concepts, and their usage has proliferated widely among the general public. On the flip side, these models have numerous drawbacks, including their potential to generate images featuring sexually explicit content, mirror artistic styles without permission, or even hallucinate (or deepfake) the likenesses of celebrities. Consequently, various methods have been proposed in order to "erase" sensitive concepts from text-to-image models. In this work, we examine five recently proposed concept erasure methods, and show that targeted concepts are not fully excised from any of these methods. Specifically, we leverage the existence of special learned word embeddings that can retrieve "erased" concepts from the sanitized models with no alterations to their weights. Our results highlight the brittleness of post hoc concept erasure methods, and call into question their use in the algorithmic toolkit for AI safety.
摘要
Local-Global Temporal Fusion Network with an Attention Mechanism for Multiple and Multiclass Arrhythmia Classification
paper_authors: Yun Kwan Kim, Minji Lee, Kunwook Jo, Hee Seok Song, Seong-Whan Lee for:The paper aims to develop a framework for arrhythmia detection and classification with a constrained input length, which can accurately recognize arrhythmias and calculate their occurrence times.methods:The proposed method consists of local temporal information extraction, global pattern extraction, and local-global information fusion with attention. The method utilizes a combination of local and global features to capture the dynamics of arrhythmias and achieve accurate detection and classification.results:The proposed method was evaluated on the MIT-BIH arrhythmia database and MIT-BIH atrial fibrillation database, and the results showed superior performance compared to state-of-the-art models. The method was also tested on a different database and achieved superior performance, demonstrating its generalization ability.Abstract
Clinical decision support systems (CDSSs) have been widely utilized to support the decisions made by cardiologists when detecting and classifying arrhythmia from electrocardiograms (ECGs). However, forming a CDSS for the arrhythmia classification task is challenging due to the varying lengths of arrhythmias. Although the onset time of arrhythmia varies, previously developed methods have not considered such conditions. Thus, we propose a framework that consists of (i) local temporal information extraction, (ii) global pattern extraction, and (iii) local-global information fusion with attention to perform arrhythmia detection and classification with a constrained input length. The 10-class and 4-class performances of our approach were assessed by detecting the onset and offset of arrhythmia as an episode and the duration of arrhythmia based on the MIT-BIH arrhythmia database (MITDB) and MIT-BIH atrial fibrillation database (AFDB), respectively. The results were statistically superior to those achieved by the comparison models. To check the generalization ability of the proposed method, an AFDB-trained model was tested on the MITDB, and superior performance was attained compared with that of a state-of-the-art model. The proposed method can capture local-global information and dynamics without incurring information losses. Therefore, arrhythmias can be recognized more accurately, and their occurrence times can be calculated; thus, the clinical field can create more accurate treatment plans by using the proposed method.
摘要
临床决策支持系统(CDSS)已广泛应用于卡地理学家在电cardiogram(ECG)中检测和分类 irregular heartbeat。然而,构建一个CDSS用于 irregular heartbeat 分类任务是困难的,因为irregular heartbeat的持续时间各不相同。 although the onset time of irregular heartbeat varies, previously developed methods have not considered such conditions. Therefore, we propose a framework that consists of (i) local temporal information extraction, (ii) global pattern extraction, and (iii) local-global information fusion with attention to perform irregular heartbeat detection and classification with a constrained input length.我们使用 MIT-BIH arrhythmia 数据库(MITDB)和 MIT-BIH atrial fibrillation 数据库(AFDB)来评估我们的方法。结果表明,我们的方法在10类和4类任务中表现出了 statistically superior 的结果,比较于比较模型。为了检验我们的方法的通用能力,我们使用 AFDB 训练的模型在 MITDB 上进行测试,并取得了比较于一个现有模型的superior 性能。我们的方法可以捕捉local-global信息和动态,而不会导致信息损失。因此,irregular heartbeat可以更准确地被识别,并且其出现时间可以被计算出来。因此,临床领域可以通过使用我们的方法创建更加准确的治疗计划。
Online Multi-Task Learning with Recursive Least Squares and Recursive Kernel Methods
paper_authors: Gabriel R. Lencione, Fernando J. Von Zuben
for: 这个论文提出了两种新的在线多任务学习(MTL)回归问题的方法。
methods: 我们使用高性能的图形基于MTL形式,并开发了其归纳版本,基于Weighted Recursive Least Squares(WRLS)和在线稀疏最小二乘支持向量回归(OSLSSVR)。
results: 我们使用任务堆叠变换,展示了一个包含多个任务关系的矩阵,并将其 Structural information embodied于MT-WRLS方法的初始化过程中,或MT-OSLSSVR中的多任务核函数中。相比现有文献,我们实现了精确和近似的循环回归,具有quadratic per-instance cost的纬度。在一个实际的风速预测案例中,我们比较了我们的在线MTL方法与其他竞争者,并证明了我们的两种提案方法具有显著的性能提升。Abstract
This paper introduces two novel approaches for Online Multi-Task Learning (MTL) Regression Problems. We employ a high performance graph-based MTL formulation and develop its recursive versions based on the Weighted Recursive Least Squares (WRLS) and the Online Sparse Least Squares Support Vector Regression (OSLSSVR). Adopting task-stacking transformations, we demonstrate the existence of a single matrix incorporating the relationship of multiple tasks and providing structural information to be embodied by the MT-WRLS method in its initialization procedure and by the MT-OSLSSVR in its multi-task kernel function. Contrasting the existing literature, which is mostly based on Online Gradient Descent (OGD) or cubic inexact approaches, we achieve exact and approximate recursions with quadratic per-instance cost on the dimension of the input space (MT-WRLS) or on the size of the dictionary of instances (MT-OSLSSVR). We compare our online MTL methods to other contenders in a real-world wind speed forecasting case study, evidencing the significant gain in performance of both proposed approaches.
摘要
Minimax Optimal $Q$ Learning with Nearest Neighbors
for: This paper proposes two new $Q$ learning methods to improve the convergence rate of estimated $Q$ functions in continuous state spaces.
methods: The proposed methods use a direct nearest neighbor approach instead of the kernel nearest neighbor approach used in (Shah and Xie, 2018), which significantly improves the convergence rate and time complexity in high-dimensional state spaces.
results: Both offline and online methods are minimax rate optimal, meaning that they achieve the optimal convergence rate of $\tilde{O}(T^{-1/(d+2)})$ in high-dimensional state spaces.Abstract
$Q$ learning is a popular model free reinforcement learning method. Most of existing works focus on analyzing $Q$ learning for finite state and action spaces. If the state space is continuous, then the original $Q$ learning method can not be directly used. A modification of the original $Q$ learning method was proposed in (Shah and Xie, 2018), which estimates $Q$ values with nearest neighbors. Such modification makes $Q$ learning suitable for continuous state space. (Shah and Xie, 2018) shows that the convergence rate of estimated $Q$ function is $\tilde{O}(T^{-1/(d+3)})$, which is slower than the minimax lower bound $\tilde{\Omega}(T^{-1/(d+2)})$, indicating that this method is not efficient. This paper proposes two new $Q$ learning methods to bridge the gap of convergence rates in (Shah and Xie, 2018), with one of them being offline, while the other is online. Despite that we still use nearest neighbor approach to estimate $Q$ function, the algorithms are crucially different from (Shah and Xie, 2018). In particular, we replace the kernel nearest neighbor in discretized region with a direct nearest neighbor approach. Consequently, our approach significantly improves the convergence rate. Moreover, the time complexity is also significantly improved in high dimensional state spaces. Our analysis shows that both offline and online methods are minimax rate optimal.
摘要
Efficient neural supersampling on a novel gaming dataset
results: 与传统方法相比,该算法可以提高supersampling的效率4倍,保持同等准确性。新的数据集填补了现有数据集的缺失,可以作为衡量领域进步的 valuable resource。Abstract
Real-time rendering for video games has become increasingly challenging due to the need for higher resolutions, framerates and photorealism. Supersampling has emerged as an effective solution to address this challenge. Our work introduces a novel neural algorithm for supersampling rendered content that is 4 times more efficient than existing methods while maintaining the same level of accuracy. Additionally, we introduce a new dataset which provides auxiliary modalities such as motion vectors and depth generated using graphics rendering features like viewport jittering and mipmap biasing at different resolutions. We believe that this dataset fills a gap in the current dataset landscape and can serve as a valuable resource to help measure progress in the field and advance the state-of-the-art in super-resolution techniques for gaming content.
摘要
现代游戏实时渲染技术面临着高分辨率、高帧率和真实感等需求的挑战。聚合抽象技术已成为解决这些挑战的有效方法。我们的工作提出了一种新的神经网络算法,可以实现对游戏内容的聚合抽象,比现有方法高效4倍,同时保持同等准确性。此外,我们还提供了一个新的数据集,包括视频游戏中的运动向量和深度信息,这些信息是通过视窗摇摆和miplevel偏移等图形渲染特性生成的。我们认为这个数据集填补了当前数据领域的空白,可以作为评估进步和推动领域的state-of-the-art技术的 valuable resource。
Online covariance estimation for stochastic gradient descent under Markovian sampling
For: The paper is written for understanding the convergence rate of the online overlapping batch-means covariance estimator for Stochastic Gradient Descent (SGD) under Markovian sampling.* Methods: The paper uses techniques such as the batch-means covariance estimator and state-dependent and state-independent Markovian sampling to analyze the convergence rate of SGD.* Results: The paper shows that the convergence rates of the covariance estimator are $O\big(\sqrt{d},n^{-1/8}(\log n)^{1/4}\big)$ and $O\big(\sqrt{d},n^{-1/8}\big)$ under state-dependent and state-independent Markovian sampling, respectively, with $d$ representing dimensionality and $n$ denoting the number of observations or SGD iterations. These rates match the best-known convergence rate previously established for the independent and identically distributed ($\iid$) case. Additionally, the paper establishes the convergence rate for the first four moments of the $\ell_2$ norm of the error of SGD dynamics under state-dependent Markovian data.Abstract
We study the online overlapping batch-means covariance estimator for Stochastic Gradient Descent (SGD) under Markovian sampling. We show that the convergence rates of the covariance estimator are $O\big(\sqrt{d}\,n^{-1/8}(\log n)^{1/4}\big)$ and $O\big(\sqrt{d}\,n^{-1/8}\big)$ under state-dependent and state-independent Markovian sampling, respectively, with $d$ representing dimensionality and $n$ denoting the number of observations or SGD iterations. Remarkably, these rates match the best-known convergence rate previously established for the independent and identically distributed ($\iid$) case by \cite{zhu2021online}, up to logarithmic factors. Our analysis overcomes significant challenges that arise due to Markovian sampling, leading to the introduction of additional error terms and complex dependencies between the blocks of the batch-means covariance estimator. Moreover, we establish the convergence rate for the first four moments of the $\ell_2$ norm of the error of SGD dynamics under state-dependent Markovian data, which holds potential interest as an independent result. To validate our theoretical findings, we provide numerical illustrations to derive confidence intervals for SGD when training linear and logistic regression models under Markovian sampling. Additionally, we apply our approach to tackle the intriguing problem of strategic classification with logistic regression, where adversaries can adaptively modify features during the training process to increase their chances of being classified in a specific target class.
摘要
我们研究在Markovian sampling下的线上遮盾Batch-Means协方幂矩阵估计器,并证明其参数估计率为$O\big(\sqrt{d}\,n^{-1/8}(\log n)^{1/4}\big)$和$O\big(\sqrt{d}\,n^{-1/8}\big)$,具体来说,这两个率限分别适用于Markovian sampling下的状态相依和状态独立两种情况,其中$d$表示维度,$n$表示观察或SGD迭代次数。可以看到,这些率限与之前独立同分布($\iid$)情况下的最佳率限相匹配,仅受到logs因子的影响。我们的分析面临了由Markovian sampling带来的重要挑战,导致批量均值协方幂矩阵估计器中出现了额外的错误项和复杂的依赖关系。此外,我们还确定了SGD动态中第四个对应的内积项的测度,具体来说,这个结果可能具有独立的价值,作为一个独立的研究结果。在实践中,我们提供了数值示例,以 derivate SGD训练过程中的信任区间。此外,我们还应用了我们的方法,解决了具有挑战性的问题,例如在训练过程中,敌人可以随机修改特征,以增加他们被特定目标类别中的机会。
Interpretable Machine Learning for Discovery: Statistical Challenges & Opportunities
results: 本文详细介绍了使用可解解释机器学习技术进行supervised和unsupervised学习中的发现,以及如何验证这些发现的有效性和可靠性。Abstract
New technologies have led to vast troves of large and complex datasets across many scientific domains and industries. People routinely use machine learning techniques to not only process, visualize, and make predictions from this big data, but also to make data-driven discoveries. These discoveries are often made using Interpretable Machine Learning, or machine learning models and techniques that yield human understandable insights. In this paper, we discuss and review the field of interpretable machine learning, focusing especially on the techniques as they are often employed to generate new knowledge or make discoveries from large data sets. We outline the types of discoveries that can be made using Interpretable Machine Learning in both supervised and unsupervised settings. Additionally, we focus on the grand challenge of how to validate these discoveries in a data-driven manner, which promotes trust in machine learning systems and reproducibility in science. We discuss validation from both a practical perspective, reviewing approaches based on data-splitting and stability, as well as from a theoretical perspective, reviewing statistical results on model selection consistency and uncertainty quantification via statistical inference. Finally, we conclude by highlighting open challenges in using interpretable machine learning techniques to make discoveries, including gaps between theory and practice for validating data-driven-discoveries.
摘要
新技术导致科学领域和产业中大量复杂数据的出现,人们常用机器学习技术不仅处理、可见化和预测这些大数据,还用以获得数据驱动发现。这些发现通常使用可解释机器学习,即机器学习模型和技术,获得人类可理解的发现。在这篇论文中,我们讨论了可解释机器学习的场景,特别是在大数据集上使用这些技术进行新的发现。我们列举了在supervised和unsupervised Setting下使用可解释机器学习实现的发现类型。此外,我们关注了如何在数据驱动下验证这些发现,以便提高机器学习系统的信任和科学研究的重复性。我们从实践和理论两个角度来验证这些发现,包括数据分割和稳定性的方法,以及统计学结果的模型选择一致性和不确定性评估。最后,我们结束时强调了在使用可解释机器学习技术进行发现时存在的开放挑战,包括数据驱动发现的验证和理论与实践之间的差距。
Reverse Stable Diffusion: What prompt was used to generate this image?
results: 本文的实验结果显示,使用 proposed 的学习框架可以在 Stable Diffusion 生成的图像上预测文本描述,并且在 white-box 模型上获得最高的改进。此外,文中还发现了一个有趣的发现:将 diffusion 模型直接用于文本与图像生成任务可以让模型生成更加适合的图像,对应于输入文本的描述。Abstract
Text-to-image diffusion models such as Stable Diffusion have recently attracted the interest of many researchers, and inverting the diffusion process can play an important role in better understanding the generative process and how to engineer prompts in order to obtain the desired images. To this end, we introduce the new task of predicting the text prompt given an image generated by a generative diffusion model. We combine a series of white-box and black-box models (with and without access to the weights of the diffusion network) to deal with the proposed task. We propose a novel learning framework comprising of a joint prompt regression and multi-label vocabulary classification objective that generates improved prompts. To further improve our method, we employ a curriculum learning procedure that promotes the learning of image-prompt pairs with lower labeling noise (i.e. that are better aligned), and an unsupervised domain-adaptive kernel learning method that uses the similarities between samples in the source and target domains as extra features. We conduct experiments on the DiffusionDB data set, predicting text prompts from images generated by Stable Diffusion. Our novel learning framework produces excellent results on the aforementioned task, yielding the highest gains when applied on the white-box model. In addition, we make an interesting discovery: training a diffusion model on the prompt generation task can make the model generate images that are much better aligned with the input prompts, when the model is directly reused for text-to-image generation.
摘要
文本到图像扩散模型,如稳定扩散,在最近吸引了许多研究者的关注,而逆扩散过程可以更好地理解生成过程并如何引入提示以获得所需的图像。为此,我们介绍了预测由生成扩散模型生成的图像中的文本提示的新任务。我们结合了白盒和黑盒模型(具有或无Diffusion网络的权重)来处理该任务。我们提出了一种新的学习框架,包括文本提示 regression和多标签词汇分类目标,可以生成改进的提示。为了进一步改进我们的方法,我们使用了课程学习程序,该程序将优先采用低噪音(即更好地对齐)的图像-提示对来学习。此外,我们还使用了无监督领域适应器,使用源和目标领域样本之间的相似性为额外特征。我们在DiffusionDB数据集上进行实验,预测由稳定扩散生成的图像中的文本提示。我们的新学习框架在该任务上获得了出色的结果,特别是在白盒模型上实现了最高的提升。此外,我们还发现了一个有趣的发现:在直接将扩散模型用于文本到图像生成任务的训练过程中,模型可以生成与输入提示更好地对齐的图像。
Implicit Occupancy Flow Fields for Perception and Prediction in Self-Driving
results: 我们的实验结果表明,我们的方法可以在城市和高速公路环境中比前状态OF THE ART表现出色,并且可以避免过度预测和计算资源浪费。更多信息可以查看我们的项目网站:https://waabi.ai/research/implicito。Abstract
A self-driving vehicle (SDV) must be able to perceive its surroundings and predict the future behavior of other traffic participants. Existing works either perform object detection followed by trajectory forecasting of the detected objects, or predict dense occupancy and flow grids for the whole scene. The former poses a safety concern as the number of detections needs to be kept low for efficiency reasons, sacrificing object recall. The latter is computationally expensive due to the high-dimensionality of the output grid, and suffers from the limited receptive field inherent to fully convolutional networks. Furthermore, both approaches employ many computational resources predicting areas or objects that might never be queried by the motion planner. This motivates our unified approach to perception and future prediction that implicitly represents occupancy and flow over time with a single neural network. Our method avoids unnecessary computation, as it can be directly queried by the motion planner at continuous spatio-temporal locations. Moreover, we design an architecture that overcomes the limited receptive field of previous explicit occupancy prediction methods by adding an efficient yet effective global attention mechanism. Through extensive experiments in both urban and highway settings, we demonstrate that our implicit model outperforms the current state-of-the-art. For more information, visit the project website: https://waabi.ai/research/implicito.
摘要
一种自驾车(SDV)必须能够感知它所处的环境和预测其他交通参与者的未来行为。现有的方法分别执行物体探测然后预测检测到的物体的轨迹,或预测整个场景的稠密占用和流动Grid。前者存在安全隐患,因为需要保持检测数量低,以保证效率,同时牺牲物体回归率。后者因高维度输出网络而 computationally expensive,并且受到限制的受感网络的有限观察范围的影响。此外,两种方法都需要大量计算资源预测可能不会被动控制器询问的区域或物体。这种驱动我们的协调感知和未来预测方法,该方法可以直接被动控制器询问,并且避免了不必要的计算。此外,我们还设计了一种高效但有效的全局注意力机制,以超越过去的显式占用预测方法的有限观察范围。通过在城市和高速公路上进行了广泛的实验,我们证明了我们的隐式模型可以比现状之最。更多信息,请访问我们的项目网站:https://waabi.ai/research/implicito。
Training Data Protection with Compositional Diffusion Models
results: CDM 可以实现大规模扩散模型中的选择性忘记和持续学习,以及根据用户的访问权限服务自定义模型。此外,CDM 还可以确定特定样本的数据 subsets 的重要性。Abstract
We introduce Compartmentalized Diffusion Models (CDM), a method to train different diffusion models (or prompts) on distinct data sources and arbitrarily compose them at inference time. The individual models can be trained in isolation, at different times, and on different distributions and domains and can be later composed to achieve performance comparable to a paragon model trained on all data simultaneously. Furthermore, each model only contains information about the subset of the data it was exposed to during training, enabling several forms of training data protection. In particular, CDMs are the first method to enable both selective forgetting and continual learning for large-scale diffusion models, as well as allowing serving customized models based on the user's access rights. CDMs also allow determining the importance of a subset of the data in generating particular samples.
摘要
我们介绍了封 compartmentalized Diffusion Models(CDM),这是一种方法来在不同的数据来源上训练不同的扩散模型(或提示),并在推论时进行组合。个别模型可以在专门的时间和环境下进行训练,并且可以在不同的分布和领域上进行训练。在推论时,这些模型可以组合以 дости得相当于一个传统模型,训练在所有数据上。此外,每个模型只包含它在训练时所接触到的子集数据的信息,这使得可以实现多种训练数据保护。特别是,CDMs 是首个允许大规模扩散模型中的选择性遗忘和持续学习,以及根据用户的存取权来提供自定义的模型。CDMs 还允许决定特定数据子集在生成特定样本时的重要性。
Dual Governance: The intersection of centralized regulation and crowdsourced safety mechanisms for Generative AI
results: 根据论文的描述,这个双重管理框架可以实现创新和创意的发展,同时确保generative AI的安全和道德性。Abstract
Generative Artificial Intelligence (AI) has seen mainstream adoption lately, especially in the form of consumer-facing, open-ended, text and image generating models. However, the use of such systems raises significant ethical and safety concerns, including privacy violations, misinformation and intellectual property theft. The potential for generative AI to displace human creativity and livelihoods has also been under intense scrutiny. To mitigate these risks, there is an urgent need of policies and regulations responsible and ethical development in the field of generative AI. Existing and proposed centralized regulations by governments to rein in AI face criticisms such as not having sufficient clarity or uniformity, lack of interoperability across lines of jurisdictions, restricting innovation, and hindering free market competition. Decentralized protections via crowdsourced safety tools and mechanisms are a potential alternative. However, they have clear deficiencies in terms of lack of adequacy of oversight and difficulty of enforcement of ethical and safety standards, and are thus not enough by themselves as a regulation mechanism. We propose a marriage of these two strategies via a framework we call Dual Governance. This framework proposes a cooperative synergy between centralized government regulations in a U.S. specific context and safety mechanisms developed by the community to protect stakeholders from the harms of generative AI. By implementing the Dual Governance framework, we posit that innovation and creativity can be promoted while ensuring safe and ethical deployment of generative AI.
摘要
生成人工智能(AI)在最近几年内得到了广泛的推广,尤其是在形式为consumer-facing、开放结束的文本和图像生成模型。然而,使用这些系统的使用带来了重要的伦理和安全问题,包括隐私侵犯、谣言和知识产权侵犯。生成AI的潜在性取代人类创造力和生活方式也在严格审查。为了缓解这些风险,有一个紧迫需要的政策和法规,负责able和伦理的开发在生成AI领域。现有和提议的中央政府法规,如政府的执法,受到批评,包括缺乏清晰性和一致性、跨行政区域不兼容性、限制创新和妨碍自由市场竞争。 Decentralized保护via Crowdsourced safety工具和机制是一个可能的代替方案。然而,它们缺乏伦理和安全标准的可靠监管和执法能力,因此不够作为唯一的规章机制。我们提出了一种名为“双重管理”的框架,该框架提议在美国特定的上下文中,中央政府法规和社区开发的安全机制之间建立合作协同关系,以促进创新和伦理的投入,同时确保生成AI的安全和伦理部署。
VertexSerum: Poisoning Graph Neural Networks for Link Inference
results: 在四个真实世界数据集和三种不同的GNN结构中,VertexSerum显著超过了现有的链接推断攻击方法,平均提高了AUC分数9.8%。此外,我们的实验还表明VertexSerum在黑盒和在线学习设置中都有出色的应用可能性。Abstract
Graph neural networks (GNNs) have brought superb performance to various applications utilizing graph structural data, such as social analysis and fraud detection. The graph links, e.g., social relationships and transaction history, are sensitive and valuable information, which raises privacy concerns when using GNNs. To exploit these vulnerabilities, we propose VertexSerum, a novel graph poisoning attack that increases the effectiveness of graph link stealing by amplifying the link connectivity leakage. To infer node adjacency more accurately, we propose an attention mechanism that can be embedded into the link detection network. Our experiments demonstrate that VertexSerum significantly outperforms the SOTA link inference attack, improving the AUC scores by an average of $9.8\%$ across four real-world datasets and three different GNN structures. Furthermore, our experiments reveal the effectiveness of VertexSerum in both black-box and online learning settings, further validating its applicability in real-world scenarios.
摘要
GRAPH NEURAL NETWORKS (GNNs) have brought superb performance to various applications utilizing graph structural data, such as social analysis and fraud detection. The graph links, e.g., social relationships and transaction history, are sensitive and valuable information, which raises privacy concerns when using GNNs. To exploit these vulnerabilities, we propose VertexSerum, a novel graph poisoning attack that increases the effectiveness of graph link stealing by amplifying the link connectivity leakage. To infer node adjacency more accurately, we propose an attention mechanism that can be embedded into the link detection network. Our experiments demonstrate that VertexSerum significantly outperforms the SOTA link inference attack, improving the AUC scores by an average of 9.8% across four real-world datasets and three different GNN structures. Furthermore, our experiments reveal the effectiveness of VertexSerum in both black-box and online learning settings, further validating its applicability in real-world scenarios.
Machine Learning Small Molecule Properties in Drug Discovery
paper_authors: Nikolai Schapin, Maciej Majewski, Alejandro Varela, Carlos Arroniz, Gianni De Fabritiis for: This paper provides a comprehensive overview of various machine learning (ML) methods for predicting small molecule properties in drug discovery, including binding affinities, solubility, and ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity).methods: The paper reviews a wide range of ML methods, including neural networks, chemical fingerprints, and graph-based neural networks, and discusses existing popular datasets and molecular descriptors.results: The paper highlights the challenges of predicting and optimizing multiple properties during hit-to-lead and lead optimization stages of drug discovery and explores briefly possible multi-objective optimization techniques to balance diverse properties while optimizing lead candidates. Additionally, the paper assesses techniques to provide an understanding of model predictions, especially for critical decision-making in drug discovery.Here is the information in Simplified Chinese text:for: 这篇论文提供了药物小分子性能预测方法的全面回顾,包括绑定亲和力、溶解度和ADMET(吸收、分布、代谢、排泄和毒性)。methods: 论文评论了各种机器学习方法,包括神经网络、化学指纹和图表基本网络,并讨论了现有的受欢迎数据集和分子特征。results: 论文强调了hit-to-lead和领导优化阶段中预测和优化多种属性的挑战,并 briefly explores可能的多目标优化技术来均衡多种属性。此外,论文评估了模型预测的理解方法,特别是在药物发现中的关键决策过程中。Abstract
Machine learning (ML) is a promising approach for predicting small molecule properties in drug discovery. Here, we provide a comprehensive overview of various ML methods introduced for this purpose in recent years. We review a wide range of properties, including binding affinities, solubility, and ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity). We discuss existing popular datasets and molecular descriptors and embeddings, such as chemical fingerprints and graph-based neural networks. We highlight also challenges of predicting and optimizing multiple properties during hit-to-lead and lead optimization stages of drug discovery and explore briefly possible multi-objective optimization techniques that can be used to balance diverse properties while optimizing lead candidates. Finally, techniques to provide an understanding of model predictions, especially for critical decision-making in drug discovery are assessed. Overall, this review provides insights into the landscape of ML models for small molecule property predictions in drug discovery. So far, there are multiple diverse approaches, but their performances are often comparable. Neural networks, while more flexible, do not always outperform simpler models. This shows that the availability of high-quality training data remains crucial for training accurate models and there is a need for standardized benchmarks, additional performance metrics, and best practices to enable richer comparisons between the different techniques and models that can shed a better light on the differences between the many techniques.
摘要
机器学习(ML)是药物发现中预测小分子性质的有前途的方法。本文提供了最近几年内对这种目标的各种机器学习方法的全面概述。我们评论了各种性质,包括结合稳定性、溶解度和ADMET(吸收、分布、代谢、排泄和毒性)。我们讨论了现有的受欢迎数据集和分子特征,如化学指纹和图像基于神经网络。我们 также提到了选择和优化多个属性的挑战,以及可能使用的多目标优化技术来平衡多个属性。最后,我们评估了模型预测结果的方法,特别是在药物发现的关键决策过程中。总的来说,本文提供了药物小分子性质预测机器学习模型的景观,目前有多种不同的方法,但它们的性能经常相当。神经网络,虽然更灵活,并不总是击败简单的模型。这表明数据训练的质量是关键,还需要标准化的 bencmarks、额外的性能指标和最佳实践,以便更好地比较不同的方法和模型,从而更好地了解它们之间的差异。
From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion
paper_authors: Robin San Roman, Yossi Adi, Antoine Deleforge, Romain Serizel, Gabriel Synnaeve, Alexandre Défossez for:这篇论文旨在提出一种高质量多频幂 diffusion-based 框架,用于从低比特率精炼的数据表示生成任何类型的音频模式(如语音、音乐、环境声)。methods:该方法使用 diffusion 模型,并在多个频带上实现。results:在相同的比特率下,该方法与现有的生成技术相比,具有更高的感知质量。Abstract
Deep generative models can generate high-fidelity audio conditioned on various types of representations (e.g., mel-spectrograms, Mel-frequency Cepstral Coefficients (MFCC)). Recently, such models have been used to synthesize audio waveforms conditioned on highly compressed representations. Although such methods produce impressive results, they are prone to generate audible artifacts when the conditioning is flawed or imperfect. An alternative modeling approach is to use diffusion models. However, these have mainly been used as speech vocoders (i.e., conditioned on mel-spectrograms) or generating relatively low sampling rate signals. In this work, we propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality (e.g., speech, music, environmental sounds) from low-bitrate discrete representations. At equal bit rate, the proposed approach outperforms state-of-the-art generative techniques in terms of perceptual quality. Training and, evaluation code, along with audio samples, are available on the facebookresearch/audiocraft Github page.
摘要
深度生成模型可以生成高质量音频,受到不同类型的表示(例如:mel-spectrograms、Mel-frequency Cepstral Coefficients (MFCC))的控制。近期,这些模型被用来synthesize音波形态,受到高度压缩的表示的控制。虽然这些方法可以生成印象深刻的结果,但它们容易产生杂音artefacts,当控制是不完整或有误的时。一种alternative的模型方法是使用扩散模型。然而,这些模型主要用于speech vocoder(受到mel-spectrograms的控制)或生成低频率的信号。在这个工作中,我们提议一种高质量多频段扩散基础框架,可以从low-bitrate discrete表示生成任何类型的音频模式(例如:speech、音乐、环境声)。在相同的比特率下,我们的提议方法在perceptual质量上超过了现状的生成技术。训练和评估代码,以及音频样本,可以在facebookresearch/audiocraft GitHub页面上获取。
A digital twin framework for civil engineering structures
results: 研究人员通过在 synthetic 案例中使用reduced-order numerical model 计算健康依赖控制策略,并通过 dynamically updating digital twin 状态来实现智能决策。两个synthetic案例(一个悬臂 beam 和一个铁路桥)证明了该方法的动态决策能力。Abstract
The digital twin concept represents an appealing opportunity to advance condition-based and predictive maintenance paradigms for civil engineering systems, thus allowing reduced lifecycle costs, increased system safety, and increased system availability. This work proposes a predictive digital twin approach to the health monitoring, maintenance, and management planning of civil engineering structures. The asset-twin coupled dynamical system is encoded employing a probabilistic graphical model, which allows all relevant sources of uncertainty to be taken into account. In particular, the time-repeating observations-to-decisions flow is modeled using a dynamic Bayesian network. Real-time structural health diagnostics are provided by assimilating sensed data with deep learning models. The digital twin state is continually updated in a sequential Bayesian inference fashion. This is then exploited to inform the optimal planning of maintenance and management actions within a dynamic decision-making framework. A preliminary offline phase involves the population of training datasets through a reduced-order numerical model and the computation of a health-dependent control policy. The strategy is assessed on two synthetic case studies, involving a cantilever beam and a railway bridge, demonstrating the dynamic decision-making capabilities of health-aware digital twins.
摘要
“数字双胞体概念可能为公共工程系统维护和预测维护方面提供一个吸引人的机遇,以降低系统成本、提高系统安全性和提高系统可用性。这项工作提议一种预测性数字双胞体方法,用于监测、维护和管理计划 civil engineering 结构。Asset-twin 相关的动态系统通过 probabilistic graphical model 编码,其中包括所有相关的不确定因素。具体来说,时间重复的观察数据流使用动态 Bayesian network 进行模型化。通过嵌入感知数据的深度学习模型,实时执行结构健康诊断。数字双胞体状态通过顺序 Bayesian 推理方式不断更新。这些信息最后用于在动态决策框架中决策维护和管理活动的优化。在一个先进的离线阶段,通过减少的数值模型和计算健康控制策略,人工数据被填充到训练集中。这种策略在两个 sintetic 案例中,包括一个悬臂 beam 和一个铁路桥,展示了健康意识数字双胞体的动态决策能力。”
DLSIA: Deep Learning for Scientific Image Analysis
results: 该论文通过论文的实验数据继续增长 scale和复杂度,DLSIA提供了论文架构的可定制化和抽象,帮助科学家适应机器学习方法,加速发现,促进交叉领域合作,并进展科学图像分析研究。Abstract
We introduce DLSIA (Deep Learning for Scientific Image Analysis), a Python-based machine learning library that empowers scientists and researchers across diverse scientific domains with a range of customizable convolutional neural network (CNN) architectures for a wide variety of tasks in image analysis to be used in downstream data processing, or for experiment-in-the-loop computing scenarios. DLSIA features easy-to-use architectures such as autoencoders, tunable U-Nets, and parameter-lean mixed-scale dense networks (MSDNets). Additionally, we introduce sparse mixed-scale networks (SMSNets), generated using random graphs and sparse connections. As experimental data continues to grow in scale and complexity, DLSIA provides accessible CNN construction and abstracts CNN complexities, allowing scientists to tailor their machine learning approaches, accelerate discoveries, foster interdisciplinary collaboration, and advance research in scientific image analysis.
摘要
我们介绍DLSIA(深度学习 для科学影像分析),一个基于Python的机器学习库,它为科学家和研究人员提供了许多可自定义的卷积神经网络架构,用于广泛的影像分析任务,包括下游处理和实验运行 Computing enario。DLSIA 提供了易于使用的架构,例如自动编码器、可调 U-Net 和对�如� mixed-scale dense network (MSNet)。此外,我们还引入了随机 graphs 和罕见 Connection 的 sparse mixed-scale network (SMSNet)。随着实验数据的数量和复杂度不断增加,DLSIA 提供了可访问的 CNN 建立和抽象 CNN 复杂度,让科学家可以根据自己的机器学习方法,加速发现,促进多学科合作,并进展科学影像分析研究。
Novel Physics-Based Machine-Learning Models for Indoor Air Quality Approximations
paper_authors: Ahmad Mohammadshirazi, Aida Nadafian, Amin Karimi Monsefi, Mohammad H. Rafiei, Rajiv Ramnath
For: 这种研究旨在提供一种准确的indoor空气质量估计方法,以提供健康的室内环境,优化相关能源消耗,并提供人类 COMFORT。* Methods: 该研究提出了六种新的物理基于机器学习模型,结合了状态空间概念、闭合循环单元和分解技术。* Results: 该研究表明,提出的模型比类似的现状技术转换器模型更加简单、计算效率高,并且能够更好地捕捉室内空气质量数据中的高度非线性特征。Abstract
Cost-effective sensors are capable of real-time capturing a variety of air quality-related modalities from different pollutant concentrations to indoor/outdoor humidity and temperature. Machine learning (ML) models are capable of performing air-quality "ahead-of-time" approximations. Undoubtedly, accurate indoor air quality approximation significantly helps provide a healthy indoor environment, optimize associated energy consumption, and offer human comfort. However, it is crucial to design an ML architecture to capture the domain knowledge, so-called problem physics. In this study, we propose six novel physics-based ML models for accurate indoor pollutant concentration approximations. The proposed models include an adroit combination of state-space concepts in physics, Gated Recurrent Units, and Decomposition techniques. The proposed models were illustrated using data collected from five offices in a commercial building in California. The proposed models are shown to be less complex, computationally more efficient, and more accurate than similar state-of-the-art transformer-based models. The superiority of the proposed models is due to their relatively light architecture (computational efficiency) and, more importantly, their ability to capture the underlying highly nonlinear patterns embedded in the often contaminated sensor-collected indoor air quality temporal data.
摘要
Cost-effective sensors can real-time capture various air quality-related modalities, from different pollutant concentrations to indoor/outdoor humidity and temperature. Machine learning (ML) models can perform air-quality "ahead-of-time" approximations. Accurate indoor air quality approximation is crucial to provide a healthy indoor environment, optimize associated energy consumption, and offer human comfort. However, it is essential to design an ML architecture that captures the domain knowledge, so-called problem physics. In this study, we propose six novel physics-based ML models for accurate indoor pollutant concentration approximations. The proposed models combine state-space concepts in physics, Gated Recurrent Units, and Decomposition techniques. The proposed models were illustrated using data collected from five offices in a commercial building in California. The proposed models are less complex, computationally more efficient, and more accurate than similar state-of-the-art transformer-based models. The superiority of the proposed models is due to their relatively light architecture (computational efficiency) and their ability to capture the underlying highly nonlinear patterns embedded in the often contaminated sensor-collected indoor air quality temporal data.
results: 研究发现,通过嵌入电力市场清算优化层可以减少预测错误对电力价格的影响,同时可以控制系统内价格错误的空间分布。Abstract
While deep learning gradually penetrates operational planning, its inherent prediction errors may significantly affect electricity prices. This letter examines how prediction errors propagate into electricity prices, revealing notable pricing errors and their spatial disparity in congested power systems. To improve fairness, we propose to embed electricity market-clearing optimization as a deep learning layer. Differentiating through this layer allows for balancing between prediction and pricing errors, as oppose to minimizing prediction errors alone. This layer implicitly optimizes fairness and controls the spatial distribution of price errors across the system. We showcase the price-aware deep learning in the nexus of wind power forecasting and short-term electricity market clearing.
摘要
“深度学习逐渐渗透到运维规划中,但它的内置预测错误可能对电力价格产生很大影响。这封信通过分析预测错误如何传播到电力价格,揭示了价格错误的很大差异和系统中的空间分布。为了提高公平性,我们提议将电力市场清算优化作为深度学习层的一部分。通过这个层进行极点搜索,可以平衡预测错误和价格错误,而不是仅仅是减少预测错误。这个层也隐式地优化了公平性,并控制了系统中价格错误的空间分布。我们在风力发电预测和短期电力市场清算之间展示了价格意识深度学习的作用。”Note: Please keep in mind that the translation is not perfect and may not capture all the nuances of the original text.
COVID-VR: A Deep Learning COVID-19 Classification Model Using Volume-Rendered Computer Tomography
results: 对比于传统的slice-based方法,该方法能够更好地识别肺疾病,并且在比较中表现竞争力强。Abstract
The COVID-19 pandemic presented numerous challenges to healthcare systems worldwide. Given that lung infections are prevalent among COVID-19 patients, chest Computer Tomography (CT) scans have frequently been utilized as an alternative method for identifying COVID-19 conditions and various other types of pulmonary diseases. Deep learning architectures have emerged to automate the identification of pulmonary disease types by leveraging CT scan slices as inputs for classification models. This paper introduces COVID-VR, a novel approach for classifying pulmonary diseases based on volume rendering images of the lungs captured from multiple angles, thereby providing a comprehensive view of the entire lung in each image. To assess the effectiveness of our proposal, we compared it against competing strategies utilizing both private data obtained from partner hospitals and a publicly available dataset. The results demonstrate that our approach effectively identifies pulmonary lesions and performs competitively when compared to slice-based methods.
摘要
COVID-19 大流行对全球医疗系统带来了许多挑战。由于呼吸系统疾病非常普遍 Among COVID-19 patients, chest Computer Tomography (CT) scans have frequently been used as an alternative method for identifying COVID-19 conditions and various other types of pulmonary diseases. 深度学习体系 emerged to automate the identification of pulmonary disease types by leveraging CT scan slices as inputs for classification models.This paper introduces COVID-VR, a novel approach for classifying pulmonary diseases based on volume rendering images of the lungs captured from multiple angles, thereby providing a comprehensive view of the entire lung in each image. To assess the effectiveness of our proposal, we compared it against competing strategies utilizing both private data obtained from partner hospitals and a publicly available dataset. The results demonstrate that our approach effectively identifies pulmonary lesions and performs competitively when compared to slice-based methods.Here's the text with Traditional Chinese characters:COVID-19 大流行对全球医疗系统带来了许多挑战。由于呼吸系统疾病非常普遍 Among COVID-19 patients, chest Computer Tomography (CT) scans have frequently been used as an alternative method for identifying COVID-19 conditions and various other types of pulmonary diseases. 深度学习体系 emerged to automate the identification of pulmonary disease types by leveraging CT scan slices as inputs for classification models.This paper introduces COVID-VR, a novel approach for classifying pulmonary diseases based on volume rendering images of the lungs captured from multiple angles, thereby providing a comprehensive view of the entire lung in each image. To assess the effectiveness of our proposal, we compared it against competing strategies utilizing both private data obtained from partner hospitals and a publicly available dataset. The results demonstrate that our approach effectively identifies pulmonary lesions and performs competitively when compared to slice-based methods.
Unlocking the Potential of Similarity Matching: Scalability, Supervision and Pre-training
results: 研究人员通过对PyTorch实现Convolutional Nonnegative SM进行扩展,并引入一种基于核心相关分析的本地化supervised SM目标,使得SM层可以堆叠。此外,研究人员还对LeNet预训练架构进行比较,并证明了与BP训练模型的评估特征的比较。这种结合生物可能性和计算效率的方法开创了多个可能性的探索。Abstract
While effective, the backpropagation (BP) algorithm exhibits limitations in terms of biological plausibility, computational cost, and suitability for online learning. As a result, there has been a growing interest in developing alternative biologically plausible learning approaches that rely on local learning rules. This study focuses on the primarily unsupervised similarity matching (SM) framework, which aligns with observed mechanisms in biological systems and offers online, localized, and biologically plausible algorithms. i) To scale SM to large datasets, we propose an implementation of Convolutional Nonnegative SM using PyTorch. ii) We introduce a localized supervised SM objective reminiscent of canonical correlation analysis, facilitating stacking SM layers. iii) We leverage the PyTorch implementation for pre-training architectures such as LeNet and compare the evaluation of features against BP-trained models. This work combines biologically plausible algorithms with computational efficiency opening multiple avenues for further explorations.
摘要
While effective, the backpropagation (BP) algorithm has limitations in terms of biological plausibility, computational cost, and suitability for online learning. As a result, there has been a growing interest in developing alternative biologically plausible learning approaches that rely on local learning rules. This study focuses on the primarily unsupervised similarity matching (SM) framework, which aligns with observed mechanisms in biological systems and offers online, localized, and biologically plausible algorithms.i) To scale SM to large datasets, we propose an implementation of Convolutional Nonnegative SM using PyTorch.ii) We introduce a localized supervised SM objective reminiscent of canonical correlation analysis, facilitating stacking SM layers.iii) We leverage the PyTorch implementation for pre-training architectures such as LeNet and compare the evaluation of features against BP-trained models. This work combines biologically plausible algorithms with computational efficiency, opening multiple avenues for further explorations.Here's the translation in Traditional Chinese:虽然backpropagation(BP)算法有效,但它具有生物可能性、计算成本和线上学习不适用的限制。因此,有着增加生物可能性学习方法的生物学可能性的兴趣。本研究专注在主要无监督相似匹配(SM)框架,该框架与生物系统观察到的机制相似,并且提供了线上、本地和生物可能性的算法。i) 为了扩展SM到大量数据集,我们提议使用PyTorch实现Convolutional Nonnegative SM。ii) 我们引入了一个本地导向的SM目标,与传统的均值分析相似,便于堆叠SM层。iii) 我们利用PyTorch实现,与BP训练的模型进行比较,以评估特征的评估。本研究结合了生物可能性算法和计算效率,开启了多个探索之路。
Bio+Clinical BERT, BERT Base, and CNN Performance Comparison for Predicting Drug-Review Satisfaction
results: 结果表明,医疗+临床BERT模型在总性表现方面胜过基础BERT模型,实现了 macro f1 和回归分数的11%的提升,如表2所示。未来的研究可以探讨如何利用每个模型的特点。医疗+临床BERT模型在医疗术语方面表现出色,而简单的CNN则能够准确地识别关键词并在文本中分类 sentiment。Abstract
The objective of this study is to develop natural language processing (NLP) models that can analyze patients' drug reviews and accurately classify their satisfaction levels as positive, neutral, or negative. Such models would reduce the workload of healthcare professionals and provide greater insight into patients' quality of life, which is a critical indicator of treatment effectiveness. To achieve this, we implemented and evaluated several classification models, including a BERT base model, Bio+Clinical BERT, and a simpler CNN. Results indicate that the medical domain-specific Bio+Clinical BERT model significantly outperformed the general domain base BERT model, achieving macro f1 and recall score improvement of 11%, as shown in Table 2. Future research could explore how to capitalize on the specific strengths of each model. Bio+Clinical BERT excels in overall performance, particularly with medical jargon, while the simpler CNN demonstrates the ability to identify crucial words and accurately classify sentiment in texts with conflicting sentiments.
摘要
本研究的目的是开发自然语言处理(NLP)模型,可以分析病人的药品评价并准确地分类为正面、中性或负面的满意度。这些模型会减轻医疗专业人员的工作负担,并提供更多有关病人生活质量的指标,这是治疗效果的关键指标。为 достичь这一目标,我们实施和评估了多种分类模型,包括BERT基础模型、医疗+临床BERT和简单的CNN。结果表明,医疗领域特定的Bio+Clinical BERT模型在表格2中显著超越了通用领域基础BERT模型,实现了macro f1和回归分数的提高率为11%。未来的研究可以探讨如何利用每个模型的特点。Bio+Clinical BERT在总性性能方面表现优异,特别是对医疗术语的处理;而简单的CNN则能够准确地标识关键词并在文本中 conflicting 的情感下准确地分类 sentiment。
Sea level Projections with Machine Learning using Altimetry and Climate Model ensembles
results: 通过非线性拟合气候模型预测和ML模型,预测未来30年海平面变化,并通过分割数据集来提高预测的准确性Abstract
Satellite altimeter observations retrieved since 1993 show that the global mean sea level is rising at an unprecedented rate (3.4mm/year). With almost three decades of observations, we can now investigate the contributions of anthropogenic climate-change signals such as greenhouse gases, aerosols, and biomass burning in this rising sea level. We use machine learning (ML) to investigate future patterns of sea level change. To understand the extent of contributions from the climate-change signals, and to help in forecasting sea level change in the future, we turn to climate model simulations. This work presents a machine learning framework that exploits both satellite observations and climate model simulations to generate sea level rise projections at a 2-degree resolution spatial grid, 30 years into the future. We train fully connected neural networks (FCNNs) to predict altimeter values through a non-linear fusion of the climate model hindcasts (for 1993-2019). The learned FCNNs are then applied to future climate model projections to predict future sea level patterns. We propose segmenting our spatial dataset into meaningful clusters and show that clustering helps to improve predictions of our ML model.
摘要
卫星探雷数据自1993年起获取到,全球平均海平面上升速率为3.4毫米/年。经过三十年的观测,我们现在可以研究人类活动对海平面上升的贡献,包括绿色气体、尘埃和生物燃烧等气候变化信号。我们使用机器学习(ML)技术来研究未来海平面变化的趋势。为了了解气候变化信号的贡献程度,以及在未来预测海平面变化的需要,我们转而使用气候模型仿真。本研究提出了一种基于卫星观测和气候模型仿真的机器学习框架,用于预测未来30年的海平面变化趋势。我们使用全连接神经网络(FCNN)来预测探雷值,通过非线性混合气候模型预测(1993-2019年)来训练FCNN。学习后的FCNN被应用于未来气候模型预测中,以预测未来海平面的变化趋势。我们还提出了分割我们的空间数据集,并证明分割可以提高我们的机器学习模型的预测精度。
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
paper_authors: Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, Ludwig Schmidt
results: 根据论文的报告,OpenFlamingo模型在七个视觉语言数据集上的平均性能为80-89%,与相应的FLAMINGO模型的性能相当。Abstract
We introduce OpenFlamingo, a family of autoregressive vision-language models ranging from 3B to 9B parameters. OpenFlamingo is an ongoing effort to produce an open-source replication of DeepMind's Flamingo models. On seven vision-language datasets, OpenFlamingo models average between 80 - 89% of corresponding Flamingo performance. This technical report describes our models, training data, hyperparameters, and evaluation suite. We share our models and code at https://github.com/mlfoundations/open_flamingo.
摘要
我们介绍OpenFlamingo,一个家族型态数据视觉语言模型,从3B到9B参数。OpenFlamingo是一个持续进行的开源实现深渊迷你的FLAMINGO模型。在七个视觉语言数据集上,OpenFlamingo模型的平均表现为80-89%相应的FLAMINGO性能。这份技术报告描述了我们的模型、训练数据、几何parameters和评估套件。我们在https://github.com/mlfoundations/open_flamingo上分享我们的模型和代码。
Follow the Soldiers with Optimized Single-Shot Multibox Detection and Reinforcement Learning
results: 我们使用SSD Lite而不是SSD,并对其进行了比较。实验结果显示,SSD Lite在这三种技术中表现最佳,并在执行速度方面具有显著的提升(约2-3倍)而无需牺牲准确性。Abstract
Nowadays, autonomous cars are gaining traction due to their numerous potential applications on battlefields and in resolving a variety of other real-world challenges. The main goal of our project is to build an autonomous system using DeepRacer which will follow a specific person (for our project, a soldier) when they will be moving in any direction. Two main components to accomplish this project is an optimized Single-Shot Multibox Detection (SSD) object detection model and a Reinforcement Learning (RL) model. We accomplished the task using SSD Lite instead of SSD and at the end, compared the results among SSD, SSD with Neural Computing Stick (NCS), and SSD Lite. Experimental results show that SSD Lite gives better performance among these three techniques and exhibits a considerable boost in inference speed (~2-3 times) without compromising accuracy.
摘要
现在,自适应车辆正在受到广泛关注,因为它们在战场和解决各种实际问题中具有丰富的潜力。我们项目的主要目标是使用DeepRacer建立一个自适应系统,该系统可以跟踪一个特定人(在我们项目中是一名士兵)在任何方向移动时。我们使用优化的单射多框检测(SSD)对象检测模型和学习奖励(RL)模型来实现该目标。我们使用SSD Lite而不是SSD,并在结束时比较了这三种技术的结果。实验结果显示SSD Lite在这三种技术中表现最佳,并且在执行速度方面表现了明显的提升(约2-3倍),而无需牺牲准确性。
DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales
results: 这篇论文的结果表明,使用DeepSpeed-Chat系统可以在短时间内训练ChatGPT-like模型,并且可以在相对较少的成本下进行大规模训练。这种系统的开发将会推动AI领域的进步和发展,并且将使更多的数据科学家有access to advanced RLHF训练技术。Abstract
ChatGPT-like models have revolutionized various applications in artificial intelligence, from summarization and coding to translation, matching or even surpassing human performance. However, the current landscape lacks an accessible, efficient, and cost-effective end-to-end RLHF (Reinforcement Learning with Human Feedback) training pipeline for these powerful models, particularly when training at the scale of billions of parameters. This paper introduces DeepSpeed-Chat, a novel system that democratizes RLHF training, making it accessible to the AI community. DeepSpeed-Chat offers three key capabilities: an easy-to-use training and inference experience for ChatGPT-like models, a DeepSpeed-RLHF pipeline that replicates the training pipeline from InstructGPT, and a robust DeepSpeed-RLHF system that combines various optimizations for training and inference in a unified way. The system delivers unparalleled efficiency and scalability, enabling training of models with hundreds of billions of parameters in record time and at a fraction of the cost. With this development, DeepSpeed-Chat paves the way for broader access to advanced RLHF training, even for data scientists with limited resources, thereby fostering innovation and further development in the field of AI.
摘要
<>translate "ChatGPT-like models have revolutionized various applications in artificial intelligence, from summarization and coding to translation, matching or even surpassing human performance. However, the current landscape lacks an accessible, efficient, and cost-effective end-to-end RLHF (Reinforcement Learning with Human Feedback) training pipeline for these powerful models, particularly when training at the scale of billions of parameters. This paper introduces DeepSpeed-Chat, a novel system that democratizes RLHF training, making it accessible to the AI community. DeepSpeed-Chat offers three key capabilities: an easy-to-use training and inference experience for ChatGPT-like models, a DeepSpeed-RLHF pipeline that replicates the training pipeline from InstructGPT, and a robust DeepSpeed-RLHF system that combines various optimizations for training and inference in a unified way. The system delivers unparalleled efficiency and scalability, enabling training of models with hundreds of billions of parameters in record time and at a fraction of the cost. With this development, DeepSpeed-Chat paves the way for broader access to advanced RLHF training, even for data scientists with limited resources, thereby fostering innovation and further development in the field of AI."into Simplified Chinese:<> chatGPT-like 模型已经在人工智能中革命化了各种应用,从概要和编程到翻译,与人类表现相当或甚至超越人类表现。然而,当前的景象缺乏可 accessible、高效、成本效果的整体 RLHF(人类反馈学习)训练管道,特别是在 billions of parameters 的训练 scale 上。这篇论文介绍了 DeepSpeed-Chat,一种新的系统,它将RLHF 训练 демокра化,使其对 AI 社区开放。DeepSpeed-Chat 提供三个关键能力:对 ChatGPT-like 模型的易于使用训练和推理经验,基于 InstructGPT 的 DeepSpeed-RLHF 管道,以及一个可靠的 DeepSpeed-RLHF 系统,它将训练和推理优化集成在一起。该系统具有无前例的高效性和可扩展性,可以在短时间内训练 billions of parameters 的模型,并且在成本的一小部分。通过这一发展,DeepSpeed-Chat 为更广泛的 RLHF 训练提供了可持续的进程,使得数据科学家 WITH 有限的资源也可以访问高级 RLHF 训练,从而推动 AI 领域的创新和进一步发展。
results: 该系统可以帮助摄影师在手持式智能手机摄像头中实现长时间曝光摄影,并且可以自动检测和分割主题,以及生成美观的运动梦幕。这种技术可以让摄影师更容易地拍摄长时间曝光照片,并且可以帮助更多的摄影爱好者掌握这种技术。Abstract
Long exposure photography produces stunning imagery, representing moving elements in a scene with motion-blur. It is generally employed in two modalities, producing either a foreground or a background blur effect. Foreground blur images are traditionally captured on a tripod-mounted camera and portray blurred moving foreground elements, such as silky water or light trails, over a perfectly sharp background landscape. Background blur images, also called panning photography, are captured while the camera is tracking a moving subject, to produce an image of a sharp subject over a background blurred by relative motion. Both techniques are notoriously challenging and require additional equipment and advanced skills. In this paper, we describe a computational burst photography system that operates in a hand-held smartphone camera app, and achieves these effects fully automatically, at the tap of the shutter button. Our approach first detects and segments the salient subject. We track the scene motion over multiple frames and align the images in order to preserve desired sharpness and to produce aesthetically pleasing motion streaks. We capture an under-exposed burst and select the subset of input frames that will produce blur trails of controlled length, regardless of scene or camera motion velocity. We predict inter-frame motion and synthesize motion-blur to fill the temporal gaps between the input frames. Finally, we composite the blurred image with the sharp regular exposure to protect the sharpness of faces or areas of the scene that are barely moving, and produce a final high resolution and high dynamic range (HDR) photograph. Our system democratizes a capability previously reserved to professionals, and makes this creative style accessible to most casual photographers. More information and supplementary material can be found on our project webpage: https://motion-mode.github.io/
摘要
长时间拍摄可以生成吸引人的图像,表现在Scene中的运动元素的摩擦模式。通常在两种模式下使用,生成 either 前景或背景模糊效果。前景模糊图像通常在静止摄像机上拍摄,捕捉摩擦的前景元素,如流动的水或灯光轨迹,与静止的背景景象一样清晰。背景模糊图像,也称为滑动摄影,通过在摄像机跟踪移动目标来生成一个锐定的主题,与相对运动的背景模糊。这两种技术都具有挑战性,需要额外设备和高级技能。在这篇论文中,我们描述了一种基于智能手机摄像机应用的计算机 burst摄影系统,可以在单击闭合按钮后自动完成这些效果。我们的方法首先检测和分割主题。我们跟踪场景运动,并对多帧图像进行对齐,以保持所需的锐度和生成美观的运动螺旋。我们捕捉具有不充足光量的快速拍摄,并从输入帧中选择能够生成控制长度的摩擦轨迹。我们预测间帧运动,并使用模拟摩擦来填充时间间隔。最后,我们将模糊图像 composite 到锐定的正常曝光图像中,以保护人脸或场景中的 hardly moving 部分,并生成一个高分辨率和高 dynamically range (HDR) 图像。我们的系统将这种创造力减少到专业人员之外,使这种创造性风格开放给大多数众所可达。更多信息和补充材料可以在我们项目网站中找到:https://motion-mode.github.io/
AI-Enhanced Data Processing and Discovery Crowd Sourcing for Meteor Shower Mapping
results: 到目前为止,CAMS已经发现了200多个新的陨星雨,并验证了数十个之前已经报告的雨。Abstract
The Cameras for Allsky Meteor Surveillance (CAMS) project, funded by NASA starting in 2010, aims to map our meteor showers by triangulating meteor trajectories detected in low-light video cameras from multiple locations across 16 countries in both the northern and southern hemispheres. Its mission is to validate, discover, and predict the upcoming returns of meteor showers. Our research aimed to streamline the data processing by implementing an automated cloud-based AI-enabled pipeline and improve the data visualization to improve the rate of discoveries by involving the public in monitoring the meteor detections. This article describes the process of automating the data ingestion, processing, and insight generation using an interpretable Active Learning and AI pipeline. This work also describes the development of an interactive web portal (the NASA Meteor Shower portal) to facilitate the visualization of meteor radiant maps. To date, CAMS has discovered over 200 new meteor showers and has validated dozens of previously reported showers.
摘要
美国国家航空航天局(NASA)自2010年起投入了“全天空闪电观测计划”(CAMS),旨在通过多个国家和多个地点的低光照视频摄像机械triangulationeteor轨迹,以确定和预测下一次闪电流星雨的返回。该项目的任务是验证、发现和预测下一次闪电流星雨的返回。我们的研究旨在通过实施云端AI智能pipeline自动化数据处理和改进数据视图来提高发现率,并通过与公众合作监测闪电探测来提高发现率。这篇文章描述了使用可解释性活动学习和AIipeline自动化数据进入、处理和情况描述的过程。此外,这篇文章还描述了开发了NASA流星雨门户,以便促进流星 radiant map的可视化。至今,CAMS已经发现了200多个新的闪电流星雨,并验证了数十个之前报道的闪电流星雨。
Explainable Deep Learning for Tumor Dynamic Modeling and Overall Survival Prediction using Neural-ODE
results: TDNODE 能够减少现有模型的一个关键局限性,从截断数据中做不偏向预测。生成的 metrics 可以高度准确地预测患者的全身存活率(OS)。Abstract
While tumor dynamic modeling has been widely applied to support the development of oncology drugs, there remains a need to increase predictivity, enable personalized therapy, and improve decision-making. We propose the use of Tumor Dynamic Neural-ODE (TDNODE) as a pharmacology-informed neural network to enable model discovery from longitudinal tumor size data. We show that TDNODE overcomes a key limitation of existing models in its ability to make unbiased predictions from truncated data. The encoder-decoder architecture is designed to express an underlying dynamical law which possesses the fundamental property of generalized homogeneity with respect to time. Thus, the modeling formalism enables the encoder output to be interpreted as kinetic rate metrics, with inverse time as the physical unit. We show that the generated metrics can be used to predict patients' overall survival (OS) with high accuracy. The proposed modeling formalism provides a principled way to integrate multimodal dynamical datasets in oncology disease modeling.
摘要
traditional dynamic modeling has been widely used to support the development of oncology drugs, but there is still a need to improve predictability, enable personalized therapy, and make better decisions. we propose the use of Tumor Dynamic Neural-ODE (TDNODE) as a pharmacology-informed neural network to enable model discovery from longitudinal tumor size data. we show that TDNODE overcomes a key limitation of existing models by making unbiased predictions from truncated data. the encoder-decoder architecture is designed to express an underlying dynamical law that possesses the fundamental property of generalized homogeneity with respect to time. thus, the modeling formalism enables the encoder output to be interpreted as kinetic rate metrics, with inverse time as the physical unit. we show that the generated metrics can be used to predict patients' overall survival (os) with high accuracy. the proposed modeling formalism provides a principled way to integrate multimodal dynamical datasets in oncology disease modeling.
Compressed and distributed least-squares regression: convergence rates with applications to Federated Learning
paper_authors: Constantin Philippenko, Aymeric Dieuleveut
for: investigate the impact of compression on stochastic gradient algorithms for machine learning, specifically in distributed and federated learning
methods: analyze the convergence rates of several unbiased compression operators, and extend the results to the case of federated learning
results: demonstrate that the limit variance term scales with $\mathrm{Tr}(\mathfrak{C}{\mathrm{ania} H^{-1})/K$, and analyze the dependency of $\mathfrak{C}{\mathrm{ania}$ on the compression strategy and its impact on convergence in centralized and heterogeneous FL frameworks.Abstract
In this paper, we investigate the impact of compression on stochastic gradient algorithms for machine learning, a technique widely used in distributed and federated learning. We underline differences in terms of convergence rates between several unbiased compression operators, that all satisfy the same condition on their variance, thus going beyond the classical worst-case analysis. To do so, we focus on the case of least-squares regression (LSR) and analyze a general stochastic approximation algorithm for minimizing quadratic functions relying on a random field. We consider weak assumptions on the random field, tailored to the analysis (specifically, expected H\"older regularity), and on the noise covariance, enabling the analysis of various randomizing mechanisms, including compression. We then extend our results to the case of federated learning. More formally, we highlight the impact on the convergence of the covariance $\mathfrak{C}_{\mathrm{ania}$ of the additive noise induced by the algorithm. We demonstrate despite the non-regularity of the stochastic field, that the limit variance term scales with $\mathrm{Tr}(\mathfrak{C}_{\mathrm{ania} H^{-1})/K$ (where $H$ is the Hessian of the optimization problem and $K$ the number of iterations) generalizing the rate for the vanilla LSR case where it is $\sigma^2 \mathrm{Tr}(H H^{-1}) / K = \sigma^2 d / K$ (Bach and Moulines, 2013). Then, we analyze the dependency of $\mathfrak{C}_{\mathrm{ania}$ on the compression strategy and ultimately its impact on convergence, first in the centralized case, then in two heterogeneous FL frameworks.
摘要
在这篇论文中,我们研究了压缩对于分布式和联合学习中使用的梯度下降算法的影响。我们将不同的压缩算法比较,这些算法都满足同样的协方差Condition,因此超过了传统的最坏情况分析。为了这样做,我们将ocus on least-squares regression(LSR),并分析一种基于随机场的总体梯度下降算法来最小化quadratics函数。我们假设了随机场的假设(specifically, expected H\"older regularity)和噪声 covariance,使得可以分析不同的随机化机制,包括压缩。然后,我们将结果推广到联合学习中。更正式地,我们强调 covariance $\mathfrak{C}_{\mathrm{ania}$ 的添加噪声对算法的 converges impact。我们证明,即使随机场不规则, covariance $\mathfrak{C}_{\mathrm{ania}$ 的极限方差项与 $\mathrm{Tr}(\mathfrak{C}_{\mathrm{ania} H^{-1})/K$ 相似(where $H$ is the Hessian of the optimization problem and $K$ the number of iterations),这与 vanilla LSR 情况中的 rate $\sigma^2 \mathrm{Tr}(H H^{-1}) / K = \sigma^2 d / K$ (Bach and Moulines, 2013)相同。然后,我们分析了压缩策略对 covariance $\mathfrak{C}_{\mathrm{ania}$ 的影响,最初是在中央化环境中,然后在两个不同的多样化联合学习框架中进行分析。
More Context, Less Distraction: Visual Classification by Inferring and Conditioning on Contextual Attributes
paper_authors: Bang An, Sicheng Zhu, Michael-Andrei Panaitescu-Liess, Chaithanya Kumar Mummadi, Furong Huang
for: This paper aims to improve zero-shot image classification using CLIP, by leveraging the model’s ability to understand visual concepts and natural language descriptions.
methods: The proposed method, called PerceptionCLIP, first infers contextual attributes (e.g., background) from an image, and then performs object classification conditioning on these attributes.
results: The proposed method achieves better generalization, group robustness, and interpretability compared to traditional zero-shot classification methods. For example, PerceptionCLIP with ViT-L/14 improves the worst group accuracy by 16.5% on the Waterbirds dataset and by 3.5% on CelebA.Here’s the simplified Chinese translation of the three key information points:
results: 提议的方法相比传统零shot分类方法,具有更好的泛化、群体稳定和可解释性。例如,PerceptionCLIP与ViT-L/14结合使用,在水鸟数据集上提高了最差群组准确率16.5%,在 celebA 数据集上提高了3.5%。Abstract
CLIP, as a foundational vision language model, is widely used in zero-shot image classification due to its ability to understand various visual concepts and natural language descriptions. However, how to fully leverage CLIP's unprecedented human-like understanding capabilities to achieve better zero-shot classification is still an open question. This paper draws inspiration from the human visual perception process: a modern neuroscience view suggests that in classifying an object, humans first infer its class-independent attributes (e.g., background and orientation) which help separate the foreground object from the background, and then make decisions based on this information. Inspired by this, we observe that providing CLIP with contextual attributes improves zero-shot classification and mitigates reliance on spurious features. We also observe that CLIP itself can reasonably infer the attributes from an image. With these observations, we propose a training-free, two-step zero-shot classification method named PerceptionCLIP. Given an image, it first infers contextual attributes (e.g., background) and then performs object classification conditioning on them. Our experiments show that PerceptionCLIP achieves better generalization, group robustness, and better interpretability. For example, PerceptionCLIP with ViT-L/14 improves the worst group accuracy by 16.5% on the Waterbirds dataset and by 3.5% on CelebA.
摘要
CLIP,作为基础视言语模型,在零批图像分类中广泛使用,因为它能够理解多种视觉概念和自然语言描述。然而,如何充分利用CLIP的人类如意理解能力来实现更好的零批分类仍是一个开放问题。这篇论文启发自人类视觉过程:现代神经科学视野认为,在分类一个物体,人们首先推理出它的类型独立特征(如背景和方向),这些特征将物体与背景分离,然后根据这些信息进行决策。 inspirited by this, we observe that providing CLIP with contextual attributes improves zero-shot classification and mitigates reliance on spurious features. We also observe that CLIP itself can reasonably infer the attributes from an image. With these observations, we propose a training-free, two-step zero-shot classification method named PerceptionCLIP. Given an image, it first infers contextual attributes (e.g., background) and then performs object classification conditioning on them. Our experiments show that PerceptionCLIP achieves better generalization, group robustness, and better interpretability. For example, PerceptionCLIP with ViT-L/14 improves the worst group accuracy by 16.5% on the Waterbirds dataset and by 3.5% on CelebA.
results: 文章报道了这个系统的设计和训练过程,以及用户测试结果。Abstract
We present Lode Encoder, a gamified mixed-initiative level creation system for the classic platform-puzzle game Lode Runner. The system is built around several autoencoders which are trained on sets of Lode Runner levels. When fed with the user's design, each autoencoder produces a version of that design which is closer in style to the levels that it was trained on. The Lode Encoder interface allows the user to build and edit levels through 'painting' from the suggestions provided by the autoencoders. Crucially, in order to encourage designers to explore new possibilities, the system does not include more traditional editing tools. We report on the system design and training procedure, as well as on the evolution of the system itself and user tests.
摘要
我们介绍Lode Encoder,一个基于混合式 init 的平台游戏吧 Runner 级别创建系统。该系统建立在多个自适应oder上,这些自适应oder在 Lode Runner 级别设计的集合上进行了训练。当用户提供设计时,每个自适应oder都会生成一个更加适合 Lode Runner 级别的版本。Lode Encoder 界面允许用户通过 '涂抹' 方式在自适应oder 提供的建议基础上创建和编辑级别。为了鼓励设计师探索新的可能性,系统没有传统的编辑工具。我们介绍了系统的设计和训练过程,以及用户测试。
Masked and Swapped Sequence Modeling for Next Novel Basket Recommendation in Grocery Shopping
results: 我们通过对三个公开数据集进行了广泛的实验,并证明了BTBR和我们的面具策略和交换策略可以很好地提高NNBR任务的性能。Abstract
Next basket recommendation (NBR) is the task of predicting the next set of items based on a sequence of already purchased baskets. It is a recommendation task that has been widely studied, especially in the context of grocery shopping. In next basket recommendation (NBR), it is useful to distinguish between repeat items, i.e., items that a user has consumed before, and explore items, i.e., items that a user has not consumed before. Most NBR work either ignores this distinction or focuses on repeat items. We formulate the next novel basket recommendation (NNBR) task, i.e., the task of recommending a basket that only consists of novel items, which is valuable for both real-world application and NBR evaluation. We evaluate how existing NBR methods perform on the NNBR task and find that, so far, limited progress has been made w.r.t. the NNBR task. To address the NNBR task, we propose a simple bi-directional transformer basket recommendation model (BTBR), which is focused on directly modeling item-to-item correlations within and across baskets instead of learning complex basket representations. To properly train BTBR, we propose and investigate several masking strategies and training objectives: (i) item-level random masking, (ii) item-level select masking, (iii) basket-level all masking, (iv) basket-level explore masking, and (v) joint masking. In addition, an item-basket swapping strategy is proposed to enrich the item interactions within the same baskets. We conduct extensive experiments on three open datasets with various characteristics. The results demonstrate the effectiveness of BTBR and our masking and swapping strategies for the NNBR task. BTBR with a properly selected masking and swapping strategy can substantially improve NNBR performance.
摘要
下一个篮球推荐(NBR)任务是预测下一个序列中的项目,基于已经购买的篮球。这是一种广泛研究的推荐任务,特别是在超市购物中。在NBR任务中,分 distinguish between repeat items(已经消耗过的项目)和 explore items(未消耗过的项目)。大多数NBR工作 Either ignore this distinction or focus on repeat items。我们提出了下一个新型篮球推荐(NNBR)任务,即推荐一个只包含新品的篮球,这对实际应用和NBR评估都具有价值。我们评估了现有的NBR方法在NNBR任务中的表现,发现至今为止,对NNBR任务的进展有限。为解决NNBR任务,我们提出了一种简单的双向转换器篮球推荐模型(BTBR),这是直接模型item-to-item correlations within和across baskets,而不是学习复杂的篮球表示。为正确地训练BTBR,我们提出了和 investigate several masking strategies and training objectives:(i)item-level random masking,(ii)item-level select masking,(iii)basket-level all masking,(iv)basket-level explore masking,和(v)joint masking。此外,我们还提出了一种item-basket swapping strategy,以增强item interactions within the same baskets。我们对三个开放数据集进行了广泛的实验,结果表明BTBR和我们的masking和swapping策略对NNBR任务有效。BTBR WITH properly selected masking and swapping strategy can substantially improve NNBR performance。
Excitatory/Inhibitory Balance Emerges as a Key Factor for RBN Performance, Overriding Attractor Dynamics
results: 研究发现,在某些特定的分布参数下,Random Boolean Networks(RBNs)可以具有多种不同的动态行为,其中一些动态行为可以提高计算性能。此外,研究还发现,不同的动态行为对计算性能的影响几乎没有关系。Abstract
Reservoir computing provides a time and cost-efficient alternative to traditional learning methods.Critical regimes, known as the "edge of chaos," have been found to optimize computational performance in binary neural networks. However, little attention has been devoted to studying reservoir-to-reservoir variability when investigating the link between connectivity, dynamics, and performance. As physical reservoir computers become more prevalent, developing a systematic approach to network design is crucial. In this article, we examine Random Boolean Networks (RBNs) and demonstrate that specific distribution parameters can lead to diverse dynamics near critical points. We identify distinct dynamical attractors and quantify their statistics, revealing that most reservoirs possess a dominant attractor. We then evaluate performance in two challenging tasks, memorization and prediction, and find that a positive excitatory balance produces a critical point with higher memory performance. In comparison, a negative inhibitory balance delivers another critical point with better prediction performance. Interestingly, we show that the intrinsic attractor dynamics have little influence on performance in either case.
摘要
rezhervoir computing 提供了一种时间和成本效率的代替方法,traditional learning methods 中的一种新的方法。critical regimes, known as the "edge of chaos," have been found to optimize computational performance in binary neural networks。However, little attention has been devoted to studying reservoir-to-reservoir variability when investigating the link between connectivity, dynamics, and performance。As physical reservoir computers become more prevalent, developing a systematic approach to network design is crucial。In this article, we examine Random Boolean Networks (RBNs) and demonstrate that specific distribution parameters can lead to diverse dynamics near critical points。We identify distinct dynamical attractors and quantify their statistics, revealing that most reservoirs possess a dominant attractor。We then evaluate performance in two challenging tasks, memorization and prediction, and find that a positive excitatory balance produces a critical point with higher memory performance。In comparison, a negative inhibitory balance delivers another critical point with better prediction performance。Interestingly, we show that the intrinsic attractor dynamics have little influence on performance in either case。
EmbeddingTree: Hierarchical Exploration of Entity Features in Embedding
results: 作者们通过使用 EmbeddingTree 和互动视觉工具,可以从 embedding 空间中提取出更多的 semantics 信息,并且可以对 embedding 训练中的特征进行denoising/注入。他们还通过对实际业务数据和公共30Music listening/playlists数据集进行实验,证明了 EmbeddingTree 的效果和互动视觉工具的价值。Abstract
Embedding learning transforms discrete data entities into continuous numerical representations, encoding features/properties of the entities. Despite the outstanding performance reported from different embedding learning algorithms, few efforts were devoted to structurally interpreting how features are encoded in the learned embedding space. This work proposes EmbeddingTree, a hierarchical embedding exploration algorithm that relates the semantics of entity features with the less-interpretable embedding vectors. An interactive visualization tool is also developed based on EmbeddingTree to explore high-dimensional embeddings. The tool helps users discover nuance features of data entities, perform feature denoising/injecting in embedding training, and generate embeddings for unseen entities. We demonstrate the efficacy of EmbeddingTree and our visualization tool through embeddings generated for industry-scale merchant data and the public 30Music listening/playlists dataset.
摘要
<> translate_language: zh-CN>Original text:“嵌入学习”将维度数据实体转化为连续数字表示,嵌入实体的特征/属性。尽管不同的嵌入学习算法在表现出色,但是对嵌入学习后得到的数字空间中的特征进行结构性解释却得到了少数努力。这项工作提出了“嵌入树”,一种嵌入探索算法,它将数据实体的 semantics 与嵌入向量相关联。此外,我们还开发了基于“嵌入树”的互动视觉化工具,帮助用户探索高维嵌入中的细节特征,进行嵌入训练中的特征杂谔/注入,并生成未经见过的嵌入。我们通过使用“嵌入树”和视觉化工具来验证嵌入树的有效性,并在行业级的商家数据和公共30Music的听众/播放列表数据集上进行了实践。Translated text in Simplified Chinese:“嵌入学习”将维度数据实体转化为连续数字表示,嵌入实体的特征/属性。尽管不同的嵌入学习算法在表现出色,但是对嵌入学习后得到的数字空间中的特征进行结构性解释却得到了少数努力。这项工作提出了“嵌入树”,一种嵌入探索算法,它将数据实体的 semantics 与嵌入向量相关联。此外,我们还开发了基于“嵌入树”的互动视觉化工具,帮助用户探索高维嵌入中的细节特征,进行嵌入训练中的特征杂谔/注入,并生成未经见过的嵌入。我们通过使用“嵌入树”和视觉化工具来验证嵌入树的有效性,并在行业级的商家数据和公共30Music的听众/播放列表数据集上进行了实践。
Investigation on Machine Learning Based Approaches for Estimating the Critical Temperature of Superconductors
results: 与其他前一些可accessible的研究相比,该模型表现了一定的承诺性,其RMSE为9.68,R2值为0.922。Abstract
Superconductors have been among the most fascinating substances, as the fundamental concept of superconductivity as well as the correlation of critical temperature and superconductive materials have been the focus of extensive investigation since their discovery. However, superconductors at normal temperatures have yet to be identified. Additionally, there are still many unknown factors and gaps of understanding regarding this unique phenomenon, particularly the connection between superconductivity and the fundamental criteria to estimate the critical temperature. To bridge the gap, numerous machine learning techniques have been established to estimate critical temperatures as it is extremely challenging to determine. Furthermore, the need for a sophisticated and feasible method for determining the temperature range that goes beyond the scope of the standard empirical formula appears to be strongly emphasized by various machine-learning approaches. This paper uses a stacking machine learning approach to train itself on the complex characteristics of superconductive materials in order to accurately predict critical temperatures. In comparison to other previous accessible research investigations, this model demonstrated a promising performance with an RMSE of 9.68 and an R2 score of 0.922. The findings presented here could be a viable technique to shed new insight on the efficient implementation of the stacking ensemble method with hyperparameter optimization (HPO).
摘要
超导材料已经是最引人注目的物质之一,因为超导性的基本概念以及相关的极限温度和超导材料的关系,已经被广泛研究了多年。然而,在常规温度下的超导材料还没有被发现。此外,关于这一特有现象的多种未知因素和理解之间的连接,特别是计算极限温度的基本标准的关系,仍然存在很多未知和缺陷。为了填补这些缺陷,许多机器学习技术已经被开发出来,以便估算极限温度。此外,需要一种可行、可靠的方法来确定极限温度范围,这超出了标准的实验方程的范围。本文使用堆叠机器学习方法来训练自己,以便准确预测极限温度。与之前可 accessible 的研究比较,这个模型表现出了有前途的性能,RMSE 为 9.68,R2 分数为 0.922。这些发现可能对于改进堆叠ensemble方法与超参数优化(HPO)的实现有所帮助。
BRNES: Enabling Security and Privacy-aware Experience Sharing in Multiagent Robotic and Autonomous Systems
results: 本论文的实验结果显示, compared to the state-of-the-art frameworks, BRNES 可以更快地达到目标,并且在对抗攻击和推理攻击的情况下也能够获得更好的学习效果。具体来说,BRNES 比非隐私框架8.32倍 faster,并且比隐私框架1.41倍 faster。Abstract
Although experience sharing (ES) accelerates multiagent reinforcement learning (MARL) in an advisor-advisee framework, attempts to apply ES to decentralized multiagent systems have so far relied on trusted environments and overlooked the possibility of adversarial manipulation and inference. Nevertheless, in a real-world setting, some Byzantine attackers, disguised as advisors, may provide false advice to the advisee and catastrophically degrade the overall learning performance. Also, an inference attacker, disguised as an advisee, may conduct several queries to infer the advisors' private information and make the entire ES process questionable in terms of privacy leakage. To address and tackle these issues, we propose a novel MARL framework (BRNES) that heuristically selects a dynamic neighbor zone for each advisee at each learning step and adopts a weighted experience aggregation technique to reduce Byzantine attack impact. Furthermore, to keep the agent's private information safe from adversarial inference attacks, we leverage the local differential privacy (LDP)-induced noise during the ES process. Our experiments show that our framework outperforms the state-of-the-art in terms of the steps to goal, obtained reward, and time to goal metrics. Particularly, our evaluation shows that the proposed framework is 8.32x faster than the current non-private frameworks and 1.41x faster than the private frameworks in an adversarial setting.
摘要
尽管经验分享(ES)可以加速多智能学习(MARL)在顾问-被顾问框架下,但是在分散式多智能系统中应用ES的尝试都是在可信环境下进行,而忽略了恶意攻击和推理的可能性。然而,在实际场景中,一些拜占庭攻击者,装扮成顾问,可能为被顾问提供错误的建议,从而使整体学习性能受到极大的降低。此外,一个推理攻击者,装扮成被顾问,可能通过多次查询来推理顾问的私人信息,使整个ES过程成为隐私泄露的问题。为解决这些问题,我们提出了一种基于BRNES的新的MARL框架,它在每个被顾问 learning步骤中采用了动态邻居区选择和权重经验聚合技术来减少拜占庭攻击的影响。此外,为保护智能机器的私人信息免受恶意推理攻击,我们利用了本地差分隐私(LDP)induced的噪声在ES过程中。我们的实验表明,我们的框架比现有的非私钥框架快8.32倍,比私钥框架快1.41倍在恶意Setting中。
A Probabilistic Approach to Self-Supervised Learning using Cyclical Stochastic Gradient MCMC
results: 实验结果表明,通过约束 marginal posterior 分布,bayesian self-supervised learning 方法在多种下游分类任务中具有显著的性能提升、calibration 和 out-of-distribution 检测。此外,在 SVHN 和 CIFAR-10 dataset 上,提出的方法也有效地检测了 out-of-distribution 样本。Abstract
In this paper we present a practical Bayesian self-supervised learning method with Cyclical Stochastic Gradient Hamiltonian Monte Carlo (cSGHMC). Within this framework, we place a prior over the parameters of a self-supervised learning model and use cSGHMC to approximate the high dimensional and multimodal posterior distribution over the embeddings. By exploring an expressive posterior over the embeddings, Bayesian self-supervised learning produces interpretable and diverse representations. Marginalizing over these representations yields a significant gain in performance, calibration and out-of-distribution detection on a variety of downstream classification tasks. We provide experimental results on multiple classification tasks on four challenging datasets. Moreover, we demonstrate the effectiveness of the proposed method in out-of-distribution detection using the SVHN and CIFAR-10 datasets.
摘要
在这篇论文中,我们提出了一种实用的极 probabilistic Bayesian自适应学习方法,即循环随机梯度汉堡 Monte Carlo(cSGHMC)。在这个框架中,我们将参数的自适应学习模型的参数置于一个高维度和多模态的 posterior 分布中,并使用 cSGHMC 来近似这个高维度和多模态的 posterior 分布。通过探索表征空间的表示,极 probabilistic Bayesian自适应学习可以生成可读取和多样化的表示。对这些表示的聚合具有显著提升性,可以在多种下游分类任务上实现更高的性能、评估和对外部数据集的检测。我们在多个分类任务上进行了多个数据集的实验,并证明了我们的提议方法在 SVHN 和 CIFAR-10 数据集上的外部数据集检测的效果。
Tirtha – An Automated Platform to Crowdsource Images and Create 3D Models of Heritage Sites
methods: 创建CH场所的3D模型已经成为数字保存的流行方法,感谢计算机视觉和光ogramметry的进步。但是,这个过程 consume much time and money,并且通常需要专业设备和技能,尤其是在资源受限的发展国家。
results: 我们提出了Tirtha,一个基于网络的平台,通过协同拍摄图片来生成CH场所的3D模型。Tirtha使用了当前最好的结构from Motion(SfM)和多视图ステレオ(MVS)技术。它是可拓展、可cost-effective的,可以适应新的摄影技术的发展。Tirtha可以通过网络界面访问,并可以部署在本地或云端环境中。我们的案例研究表明,Tirtha可以成功地创建奥里萨邦印度的寺庙3D模型,使用了公共上传的图片。这些3D模型可以在Tirtha网站上查看、交互和下载。我们的工作希望通过提供大量公共上传的图片和3D重建,为计算机视觉、遗产保护和相关领域的研究提供数据集。总之,Tirtha是一步向数字保存的民主化,主要是在资源受限的发展国家。Abstract
Digital preservation of Cultural Heritage (CH) sites is crucial to protect them against damage from natural disasters or human activities. Creating 3D models of CH sites has become a popular method of digital preservation thanks to advancements in computer vision and photogrammetry. However, the process is time-consuming, expensive, and typically requires specialized equipment and expertise, posing challenges in resource-limited developing countries. Additionally, the lack of an open repository for 3D models hinders research and public engagement with their heritage. To address these issues, we propose Tirtha, a web platform for crowdsourcing images of CH sites and creating their 3D models. Tirtha utilizes state-of-the-art Structure from Motion (SfM) and Multi-View Stereo (MVS) techniques. It is modular, extensible and cost-effective, allowing for the incorporation of new techniques as photogrammetry advances. Tirtha is accessible through a web interface at https://tirtha.niser.ac.in and can be deployed on-premise or in a cloud environment. In our case studies, we demonstrate the pipeline's effectiveness by creating 3D models of temples in Odisha, India, using crowdsourced images. These models are available for viewing, interaction, and download on the Tirtha website. Our work aims to provide a dataset of crowdsourced images and 3D reconstructions for research in computer vision, heritage conservation, and related domains. Overall, Tirtha is a step towards democratizing digital preservation, primarily in resource-limited developing countries.
摘要
针对文化遗产(CH)场景的数字保存是非常重要,以保护它们免受自然灾害或人类活动的损害。创建CH场景的3D模型已成为数字保存的流行方法,感谢计算机视觉和相机摄影渠道的进步。然而,这个过程占用时间、成本高,通常需要特殊的设备和专业知识,这在有限资源的发展国家中存在挑战。此外,缺乏开放的3D模型存储库,限制了研究和公众对遗产的参与和研究。为解决这些问题,我们提出了Tirtha,一个基于网络的平台,用于把文化遗产场景的图片集成为3D模型。Tirtha使用当前最佳的结构从运动(SfM)和多视图镜像(MVS)技术。它是可扩展、可cost-effective的,可以适应计算机视觉的进步。Tirtha通过网络界面访问,可以在本地部署或云端环境中部署。在我们的案例研究中,我们使用了拥有图片的拥有者来创建奥里萨(India)的寺庙3D模型。这些3D模型可以在Tirtha网站上查看、交互和下载。我们的工作的目标是提供一个包含了拥有图片和3D重建的数据集,用于计算机视觉、遗产保护和相关领域的研究。总之,Tirtha是一步向数字保存的民主化,主要在有限资源的发展国家。
results: 根据实验和数值数据显示,该方法可以在困难噪声条件下提高图像重建质量,并且比传统方法更简单地实现Abstract
Optical measurements often exhibit mixed Poisson-Gaussian noise statistics, which hampers image quality, particularly under low signal-to-noise ratio (SNR) conditions. Computational imaging falls short in such situations when solely Poissonian noise statistics are assumed. In response to this challenge, we define a loss function that explicitly incorporates this mixed noise nature. By using maximum-likelihood estimation, we devise a practical method to account for camera readout noise in gradient-based ptychography optimization. Our results, based on both experimental and numerical data, demonstrate that this approach outperforms the conventional one, enabling enhanced image reconstruction quality under challenging noise conditions through a straightforward methodological adjustment.
摘要
光学测量 часто会表现出杂合波尔兹-加布斯噪声统计,这会妨碍图像质量,特别是在低信号噪声比(SNR)条件下。计算摄影时,当假设只有波尔兹噪声统计时,计算摄影会失败。为解决这个挑战,我们定义了一个损失函数,该函数直接表达杂合噪声性质。通过最大似然估计,我们开发了一种实用的方法,用于考虑摄像头读取噪声在梯度基础ptychography优化中的影响。我们的结果,基于实验和数值数据,表明该方法在挑战性的噪声条件下能够超越传统方法,提高图像重建质量。
Focus on Content not Noise: Improving Image Generation for Nuclei Segmentation by Suppressing Steganography in CycleGAN
results: 这个论文的实验结果表明,通过使用降噪滤波来除去生成图像中的快速幂数信息,可以提高生成的图像和颗点mask之间的协调性,并提高核体分割任务的性能。Abstract
Annotating nuclei in microscopy images for the training of neural networks is a laborious task that requires expert knowledge and suffers from inter- and intra-rater variability, especially in fluorescence microscopy. Generative networks such as CycleGAN can inverse the process and generate synthetic microscopy images for a given mask, thereby building a synthetic dataset. However, past works report content inconsistencies between the mask and generated image, partially due to CycleGAN minimizing its loss by hiding shortcut information for the image reconstruction in high frequencies rather than encoding the desired image content and learning the target task. In this work, we propose to remove the hidden shortcut information, called steganography, from generated images by employing a low pass filtering based on the DCT. We show that this increases coherence between generated images and cycled masks and evaluate synthetic datasets on a downstream nuclei segmentation task. Here we achieve an improvement of 5.4 percentage points in the F1-score compared to a vanilla CycleGAN. Integrating advanced regularization techniques into the CycleGAN architecture may help mitigate steganography-related issues and produce more accurate synthetic datasets for nuclei segmentation.
摘要
描述核体在微scopic影像中的标注是一项劳动密集的任务,需要专家知识和受到内部和外部评分变化的影响,尤其在染料微scopic中。生成网络如CycleGAN可以将过程逆转,生成基于给定的mask的合成微scopic影像,从而建立一个合成数据集。然而,过去的工作表明,生成的图像与mask之间存在内容不一致,部分由CycleGAN在高频范围内隐藏短cut信息来抑制图像重建的损失而不是编码所需的图像内容和学习目标任务。在这个工作中,我们提议从生成图像中除去隐藏的短cut信息,使用基于DCT的低通过滤波。我们发现,这会提高生成图像和cycled mask之间的协调性,并评估合成数据集在核体分割任务上的性能。在这里,我们实现了与vanilla CycleGAN相比的5.4个百分点的F1分数提高。可能通过在CycleGAN架构中 интегрирова高级规则化技术可以减少隐藏信息相关的问题,生成更准确的合成数据集 для核体分割任务。
NuInsSeg: A Fully Annotated Dataset for Nuclei Instance Segmentation in H&E-Stained Histological Images
results: 本研究提供了一个大规模的手动标注数据集(NuInsSeg),包含665个图像割辑和超过30,000个手动标注的核体,以及一些涉及的不确定地域mask。Abstract
In computational pathology, automatic nuclei instance segmentation plays an essential role in whole slide image analysis. While many computerized approaches have been proposed for this task, supervised deep learning (DL) methods have shown superior segmentation performances compared to classical machine learning and image processing techniques. However, these models need fully annotated datasets for training which is challenging to acquire, especially in the medical domain. In this work, we release one of the biggest fully manually annotated datasets of nuclei in Hematoxylin and Eosin (H&E)-stained histological images, called NuInsSeg. This dataset contains 665 image patches with more than 30,000 manually segmented nuclei from 31 human and mouse organs. Moreover, for the first time, we provide additional ambiguous area masks for the entire dataset. These vague areas represent the parts of the images where precise and deterministic manual annotations are impossible, even for human experts. The dataset and detailed step-by-step instructions to generate related segmentation masks are publicly available at https://www.kaggle.com/datasets/ipateam/nuinsseg and https://github.com/masih4/NuInsSeg, respectively.
摘要
在计算生物学中,自动核体实例分割在整个染色体图像分析中扮演着关键角色。虽然许多计算机化方法已经被提议用于这个任务,但是深度学习(DL)方法在 segmentation 性能方面表现出色,特别是在医疗领域。然而,这些模型需要完全标注的数据集来训练,而在医疗领域获得这些数据集是困难的。在这项工作中,我们发布了一个包含665个图像区域和超过30,000个手动标注的核体的全部批处数据集,称为NuInsSeg。这个数据集包含31种人类和小鼠器官的HE染色图像。此外,我们还为整个数据集提供了首次的不确定区域面罩。这些不确定区域表示图像中 precisions和决定性的手动标注是不可能的,即使是人类专家。数据集和相关的生成 segmentation 面罩的详细步骤都公开在https://www.kaggle.com/datasets/ipateam/nuinsseg和https://github.com/masih4/NuInsSeg 上。
Reference-Free Isotropic 3D EM Reconstruction using Diffusion Models
for: overcome the limitations of anisotropic axial resolution in Electron Microscopy (EM) images
methods: utilize 2D diffusion models for consistent 3D volume reconstruction, well-suited for highly downsampled data
results: superiority of leveraging the generative prior compared to supervised learning methods, and feasibility for self-supervised reconstruction without any training data.Here’s the full text in Simplified Chinese:
results: 我们的实验结果表明,基于分散模型的方法可以在高度下采样的情况下提供更高质量的重建结果,并且在无参考数据的情况下进行自动重建也是可能的。Abstract
Electron microscopy (EM) images exhibit anisotropic axial resolution due to the characteristics inherent to the imaging modality, presenting challenges in analysis and downstream tasks.In this paper, we propose a diffusion-model-based framework that overcomes the limitations of requiring reference data or prior knowledge about the degradation process. Our approach utilizes 2D diffusion models to consistently reconstruct 3D volumes and is well-suited for highly downsampled data. Extensive experiments conducted on two public datasets demonstrate the robustness and superiority of leveraging the generative prior compared to supervised learning methods. Additionally, we demonstrate our method's feasibility for self-supervised reconstruction, which can restore a single anisotropic volume without any training data.
摘要
DMDC: Dynamic-mask-based dual camera design for snapshot Hyperspectral Imaging
results: 对多个数据集进行了广泛的实验,并达到了与SOTA的 более чем9dB PSNR提升。Abstract
Deep learning methods are developing rapidly in coded aperture snapshot spectral imaging (CASSI). The number of parameters and FLOPs of existing state-of-the-art methods (SOTA) continues to increase, but the reconstruction accuracy improves slowly. Current methods still face two problems: 1) The performance of the spatial light modulator (SLM) is not fully developed due to the limitation of fixed Mask coding. 2) The single input limits the network performance. In this paper we present a dynamic-mask-based dual camera system, which consists of an RGB camera and a CASSI system running in parallel. First, the system learns the spatial feature distribution of the scene based on the RGB images, then instructs the SLM to encode each scene, and finally sends both RGB and CASSI images to the network for reconstruction. We further designed the DMDC-net, which consists of two separate networks, a small-scale CNN-based dynamic mask network for dynamic adjustment of the mask and a multimodal reconstruction network for reconstruction using RGB and CASSI measurements. Extensive experiments on multiple datasets show that our method achieves more than 9 dB improvement in PSNR over the SOTA. (https://github.com/caizeyu1992/DMDC)
摘要
深度学习方法在coded aperture snapshot spectral imaging(CASSI)领域得到了极速的发展。现有状态的方法(SOTA)中的参数和FLOPs继续增加,但是重建精度逐渐提高。现有方法仍然面临两个问题:1)SLM(光学掩模)的性能尚未得到完全发展,因为固定的掩码编码有限制。2)单输入限制网络的性能。在本文中,我们提出了动态掩码基于双摄像头系统,该系统由RGB摄像头和CASSI系统在平行运行。首先,系统通过RGB图像学习场景中的空间特征分布,然后对场景进行编码,并将RGB和CASSI图像发送给网络进行重建。我们还设计了DMDC-net,它包括两个独立的网络:一个小规模的CNN基于动态掩码网络用于动态调整掩码,以及一个多模式重建网络用于使用RGB和CASSI测量进行重建。我们对多个数据集进行了广泛的实验,结果表明,我们的方法可以与SOTA比进行9dB以上的PSNR提高。(https://github.com/caizeyu1992/DMDC)
Numerical Uncertainty of Convolutional Neural Networks Inference for Structural Brain MRI Analysis
paper_authors: Inés Gonzalez Pepe, Vinuyan Sivakolunthu, Hae Lang Park, Yohan Chatelain, Tristan Glatard
for: 这 paper investigates the numerical uncertainty of Convolutional Neural Networks (CNNs) inference for structural brain MRI analysis.
methods: 这 paper applies Random Rounding – a stochastic arithmetic technique – to CNN models employed in non-linear registration (SynthMorph) and whole-brain segmentation (FastSurfer), and compares the resulting numerical uncertainty to the one measured in a reference image-processing pipeline (FreeSurfer recon-all).
results: Results obtained on 32 representative subjects show that CNN predictions are substantially more accurate numerically than traditional image-processing results (non-linear registration: 19 vs 13 significant bits on average; whole-brain segmentation: 0.99 vs 0.92 S{\o}rensen-Dice score on average), which suggests a better reproducibility of CNN results across execution environments.Abstract
This paper investigates the numerical uncertainty of Convolutional Neural Networks (CNNs) inference for structural brain MRI analysis. It applies Random Rounding -- a stochastic arithmetic technique -- to CNN models employed in non-linear registration (SynthMorph) and whole-brain segmentation (FastSurfer), and compares the resulting numerical uncertainty to the one measured in a reference image-processing pipeline (FreeSurfer recon-all). Results obtained on 32 representative subjects show that CNN predictions are substantially more accurate numerically than traditional image-processing results (non-linear registration: 19 vs 13 significant bits on average; whole-brain segmentation: 0.99 vs 0.92 S{\o}rensen-Dice score on average), which suggests a better reproducibility of CNN results across execution environments.
摘要
TDMD: A Database for Dynamic Color Mesh Subjective and Objective Quality Explorations
results: 研究发现,不同类型的扭曲会对人类对DCM的评价产生不同的影响。此外,该论文还评估了三种当今最佳的对metric,包括图像基于、点基于和视频基于的metric,并对其在实际应用中的选择提供了建议。Abstract
Dynamic colored meshes (DCM) are widely used in various applications; however, these meshes may undergo different processes, such as compression or transmission, which can distort them and degrade their quality. To facilitate the development of objective metrics for DCMs and study the influence of typical distortions on their perception, we create the Tencent - dynamic colored mesh database (TDMD) containing eight reference DCM objects with six typical distortions. Using processed video sequences (PVS) derived from the DCM, we have conducted a large-scale subjective experiment that resulted in 303 distorted DCM samples with mean opinion scores, making the TDMD the largest available DCM database to our knowledge. This database enabled us to study the impact of different types of distortion on human perception and offer recommendations for DCM compression and related tasks. Additionally, we have evaluated three types of state-of-the-art objective metrics on the TDMD, including image-based, point-based, and video-based metrics, on the TDMD. Our experimental results highlight the strengths and weaknesses of each metric, and we provide suggestions about the selection of metrics in practical DCM applications. The TDMD will be made publicly available at the following location: https://multimedia.tencent.com/resources/tdmd.
摘要
“动态颜色网格”(DCM)在各种应用中广泛使用,但这些网格可能会经历不同的处理过程,如压缩或传输,这会导致它们的质量下降。为了促进DCM的 объектив评价和研究不同类型的扭曲对人类感知的影响,我们创建了腾讯——动态颜色网格数据库(TDMD),包含8个参考DCM对象和6种典型的扭曲。使用来自DCM的处理视频序列(PVS),我们进行了大规模的主观实验,得到了303个扭曲DCM样本,其中每个样本有平均意见分数,这使得TDMD成为我们所知道的最大的DCM数据库。这个数据库允许我们研究不同类型的扭曲对人类感知的影响,并提供了DCM压缩和相关任务的建议。此外,我们还评估了三种现状最佳的对metric在TDMD上,包括图像基于、点基于和视频基于的metric。我们的实验结果显示了每种metric的优缺点,并提供了实际应用中metric选择的建议。TDMD将于以下地址公开:https://multimedia.tencent.com/resources/tdmd。”
Estimation of motion blur kernel parameters using regression convolutional neural networks
paper_authors: Luis G. Varela, Laura E. Boucheron, Steven Sandoval, David Voelz, Abu Bucker Siddik
for: 本研究旨在拟合线性运动模糊图像中的杂乱参数。
methods: 本研究使用神经网络进行回归预测,以估算线性运动模糊图像中的杂乱参数。
results: 研究表明,线性运动模糊图像中的杂乱参数与杂乱程度和方向之间存在着密切的关系。这种关系可以被利用来在uniformed motion blur images中进行杂乱参数的回归预测。Abstract
Many deblurring and blur kernel estimation methods use MAP or classification deep learning techniques to sharpen an image and predict the blur kernel. We propose a regression approach using neural networks to predict the parameters of linear motion blur kernels. These kernels can be parameterized by its length of blur and the orientation of the blur.This paper will analyze the relationship between length and angle of linear motion blur. This analysis will help establish a foundation to using regression prediction in uniformed motion blur images.
摘要
很多锐化和杂化kernel估计方法使用MAP或分类深度学习技术来锐化图像和预测杂化kernel。我们提议使用回归方法使用神经网络预测线性运动杂化kernel的参数。这些kernel可以由杂化的长度和杂化方向来参数化。本文将分析线性运动杂化中长度和角度之间的关系,以Establish a foundation for using regression prediction in uniformed motion blur images。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.
ELIXR: Towards a general purpose X-ray artificial intelligence system through alignment of large language models and radiology vision encoders
paper_authors: Shawn Xu, Lin Yang, Christopher Kelly, Marcin Sieniek, Timo Kohlberger, Martin Ma, Wei-Hung Weng, Attila Kiraly, Sahar Kazemzadeh, Zakkai Melamed, Jungyeon Park, Patricia Strachan, Yun Liu, Chuck Lau, Preeti Singh, Christina Chen, Mozziyar Etemadi, Sreenivasa Raju Kalidindi, Yossi Matias, Katherine Chou, Greg S. Corrado, Shravya Shetty, Daniel Tse, Shruthi Prabhakara, Daniel Golden, Rory Pilgrim, Krish Eswaran, Andrew Sellergren
For: The paper is written for the task of developing a lightweight adapter architecture for chest X-ray (CXR) classification and vision-language tasks using a fixed language model (PaLM 2) and a language-aligned image encoder.* Methods: The paper uses a combination of a language-aligned image encoder and a fixed large language model (PaLM 2) to perform zero-shot CXR classification, data-efficient CXR classification, and semantic search tasks. The authors train the adapter architecture using images paired with corresponding free-text radiology reports from the MIMIC-CXR dataset.* Results: The paper achieves state-of-the-art performance on zero-shot CXR classification (mean AUC of 0.850 across 13 findings), data-efficient CXR classification (mean AUCs of 0.893 and 0.898 across five findings for 1% and 10% training data), and semantic search (0.76 normalized discounted cumulative gain across nineteen queries, including perfect retrieval on twelve of them). The authors also demonstrate the promise of ELIXR on CXR vision-language tasks, achieving overall accuracies of 58.7% and 62.5% on visual question answering and report quality assurance tasks, respectively.Abstract
Our approach, which we call Embeddings for Language/Image-aligned X-Rays, or ELIXR, leverages a language-aligned image encoder combined or grafted onto a fixed LLM, PaLM 2, to perform a broad range of tasks. We train this lightweight adapter architecture using images paired with corresponding free-text radiology reports from the MIMIC-CXR dataset. ELIXR achieved state-of-the-art performance on zero-shot chest X-ray (CXR) classification (mean AUC of 0.850 across 13 findings), data-efficient CXR classification (mean AUCs of 0.893 and 0.898 across five findings (atelectasis, cardiomegaly, consolidation, pleural effusion, and pulmonary edema) for 1% (~2,200 images) and 10% (~22,000 images) training data), and semantic search (0.76 normalized discounted cumulative gain (NDCG) across nineteen queries, including perfect retrieval on twelve of them). Compared to existing data-efficient methods including supervised contrastive learning (SupCon), ELIXR required two orders of magnitude less data to reach similar performance. ELIXR also showed promise on CXR vision-language tasks, demonstrating overall accuracies of 58.7% and 62.5% on visual question answering and report quality assurance tasks, respectively. These results suggest that ELIXR is a robust and versatile approach to CXR AI.
摘要
我们的方法,我们称之为语言/图像对接X射线(ELIXR),利用一个语言对接图像编码器与 fixes LLM(PaLM 2)结合,以实现广泛的任务。我们在使用图像和对应的自由文本医学报告从 MIMIC-CXR 数据集进行训练这个轻量级适配器建筑。ELIXR 在零shot 肺X射线(CXR)分类中获得了状态机器的表现(平均 AUC 为 0.850,涵盖 13 个发现),以及数据效率 CXR 分类(平均 AUCs 为 0.893 和 0.898,涵盖五个发现(胸腔缺失、心肺肥大、混合、肺液腔和肺泡),对 1% (约 2,200 张图像)和 10% (约 22,000 张图像)训练数据)。此外,ELIXR 还在 CXR 视言语任务中表现良好,其总准确率为 58.7% 和 62.5%,分别在视问题回答和报告质量签名任务中。这些结果表明 ELIXR 是一种可靠和多样的 CXR AI 方法。
A vision transformer-based framework for knowledge transfer from multi-modal to mono-modal lymphoma subtyping models
results: 我们的实验研究表明,我们的单模式分类器模型在157名患者的数据集上表现出色,超过了6个最近的State-of-the-art方法。此外,我们Estimated一个力学律曲线,表明我们的分类器模型只需要一定的更多的患者数据来进行训练,以达到与IHC技术相同的诊断精度。Abstract
Determining lymphoma subtypes is a crucial step for better patients treatment targeting to potentially increase their survival chances. In this context, the existing gold standard diagnosis method, which is based on gene expression technology, is highly expensive and time-consuming making difficult its accessibility. Although alternative diagnosis methods based on IHC (immunohistochemistry) technologies exist (recommended by the WHO), they still suffer from similar limitations and are less accurate. WSI (Whole Slide Image) analysis by deep learning models showed promising new directions for cancer diagnosis that would be cheaper and faster than existing alternative methods. In this work, we propose a vision transformer-based framework for distinguishing DLBCL (Diffuse Large B-Cell Lymphoma) cancer subtypes from high-resolution WSIs. To this end, we propose a multi-modal architecture to train a classifier model from various WSI modalities. We then exploit this model through a knowledge distillation mechanism for efficiently driving the learning of a mono-modal classifier. Our experimental study conducted on a dataset of 157 patients shows the promising performance of our mono-modal classification model, outperforming six recent methods from the state-of-the-art dedicated for cancer classification. Moreover, the power-law curve, estimated on our experimental data, shows that our classification model requires a reasonable number of additional patients for its training to potentially reach identical diagnosis accuracy as IHC technologies.
摘要
确定淋巴癌 subclass 是诊断的关键 step 以提高患者治疗的可能性,并增加生存机会。在这种情况下,现有的黄金标准诊断方法,基于基因表达技术,是非常昂贵和时间consuming,使得它的可 accessibility 受限。虽然基于 IHC(免疫 histochemistry)技术的诊断方法存在,但它们仍然受到类似的限制,并且精度较低。 WSI(整个报告图像)分析方法,基于深入学习模型,显示出了可能更加便宜和更快的诊断方法。在这种工作中,我们提出了基于视Transformer的框架,用于从高分辨率 WSI 中分类Diffuse Large B-Cell Lymphoma(淋巴癌)亚种。为此,我们提出了多modal 架构,用于训练一个分类器模型。然后,我们利用知识储存机制,以高效地驱动这个模型的学习。我们的实验研究,在157名患者的 dataset 上进行,显示了我们的单Modal 分类模型在诊断方面的出色表现,超过了最近六种专门为抑癌诊断而设计的方法。此外,我们对实验数据进行了power-law 曲线估计,显示我们的分类模型需要一个合理的数量的更多患者来进行训练,以达到与 IHC 技术相同的诊断精度。
Incorporating Season and Solar Specificity into Renderings made by a NeRF Architecture using Satellite Images
results: 该研究可以准确地渲染新视点中的场景,生成高程图,预测阴影,并独立地渲染季节特征和阴影。Abstract
As a result of Shadow NeRF and Sat-NeRF, it is possible to take the solar angle into account in a NeRF-based framework for rendering a scene from a novel viewpoint using satellite images for training. Our work extends those contributions and shows how one can make the renderings season-specific. Our main challenge was creating a Neural Radiance Field (NeRF) that could render seasonal features independently of viewing angle and solar angle while still being able to render shadows. We teach our network to render seasonal features by introducing one more input variable -- time of the year. However, the small training datasets typical of satellite imagery can introduce ambiguities in cases where shadows are present in the same location for every image of a particular season. We add additional terms to the loss function to discourage the network from using seasonal features for accounting for shadows. We show the performance of our network on eight Areas of Interest containing images captured by the Maxar WorldView-3 satellite. This evaluation includes tests measuring the ability of our framework to accurately render novel views, generate height maps, predict shadows, and specify seasonal features independently from shadows. Our ablation studies justify the choices made for network design parameters.
摘要
因为阴影NeRF和Sat-NeRF,可以使用卫星图像进行训练,以渲染场景从不同视点的 novel 视图。我们的工作是对这些贡献的扩展,并表明如何使渲染季节化特征独立于视角和太阳角。我们的主要挑战是创建一个能够独立地渲染季节特征而不是视角和太阳角的神经辐射场(NeRF)。我们教育我们的网络如何渲染季节特征,通过引入一个额外的输入变量——时间年份。然而,卫星图像的小训练集通常会导致在同一个季节中的阴影存在于每个图像中,这会引入ambiguity。我们添加了额外的损失函数项来避免网络使用季节特征来补偿阴影。我们在八个 Area of Interest 中测试了我们的框架,包括测试渲染新视图、生成高度图、预测阴影和独立地渲染季节特征。我们的剖析研究证明了我们的网络设计参数的选择。
methods: 本论文提出了一种基于限制器的音乐去限制网络,使用sample-wise gain inversion(SGI)原理来实现。SGI是一种样本级别的增量降噪,可以减少音乐压缩后的噪音。
results: 根据实验结果,提出的去限制网络可以高效地提高音乐质量,SI-SDR值达23.8dB。此外, authors还提供了一个大量的训练数据集(musdb-XL-train),可以帮助实现实际中友好的去限制网络。Abstract
The loudness war, an ongoing phenomenon in the music industry characterized by the increasing final loudness of music while reducing its dynamic range, has been a controversial topic for decades. Music mastering engineers have used limiters to heavily compress and make music louder, which can induce ear fatigue and hearing loss in listeners. In this paper, we introduce music de-limiter networks that estimate uncompressed music from heavily compressed signals. Inspired by the principle of a limiter, which performs sample-wise gain reduction of a given signal, we propose the framework of sample-wise gain inversion (SGI). We also present the musdb-XL-train dataset, consisting of 300k segments created by applying a commercial limiter plug-in for training real-world friendly de-limiter networks. Our proposed de-limiter network achieves excellent performance with a scale-invariant source-to-distortion ratio (SI-SDR) of 23.8 dB in reconstructing musdb-HQ from musdb- XL data, a limiter-applied version of musdb-HQ. The training data, codes, and model weights are available in our repository (https://github.com/jeonchangbin49/De-limiter).
摘要
“喧嚣战争”,音乐业界一直持续的现象,表示音乐的最终强度不断增加,同时降低其 dinamic range。音乐调制工程师使用限制器将音乐更加压缩和更加 loud,这可能导致耳感觉疲劳和听觉损伤。在这篇论文中,我们介绍音乐去限制网络,将压缩后的音乐转换为未压缩的音乐。我们的方法基于限制器的原理,实现 sample-wise 几何减少 (SGI)。我们还提供 musdb-XL-train 数据集,包括 300 万个段落,通过商业限制器插件训练真实世界友善的去限制网络。我们的提案的去限制网络实现了 excellent 性能,si-sdr 23.8 dB 在重建 musdb-HQ FROM musdb-XL 数据中。训练数据、代码和模型预测项目可以从我们的存储库 (https://github.com/jeonchangbin49/De-limiter) 获取。
Inaudible Adversarial Perturbation: Manipulating the Recognition of User Speech in Real Time
paper_authors: Xinfeng Li, Chen Yan, Xuancun Lu, Zihan Zeng, Xiaoyu Ji, Wenyuan Xu for: 这篇论文旨在扩展现有的攻击途径,使其适用于用户存在的实际场景。methods: 这篇论文提出了一种使用无声振荡(IAP)攻击语音识别系统的方法,通过在用户说话时传输 ultrasound 来 manipulate ASR 输出。results: 实验结果表明,VRIFLE 可以在不同的配置和防御策略下实现有效的真实时操纵,并且可以在用户发生干扰时仍然保持有效。Abstract
Automatic speech recognition (ASR) systems have been shown to be vulnerable to adversarial examples (AEs). Recent success all assumes that users will not notice or disrupt the attack process despite the existence of music/noise-like sounds and spontaneous responses from voice assistants. Nonetheless, in practical user-present scenarios, user awareness may nullify existing attack attempts that launch unexpected sounds or ASR usage. In this paper, we seek to bridge the gap in existing research and extend the attack to user-present scenarios. We propose VRIFLE, an inaudible adversarial perturbation (IAP) attack via ultrasound delivery that can manipulate ASRs as a user speaks. The inherent differences between audible sounds and ultrasounds make IAP delivery face unprecedented challenges such as distortion, noise, and instability. In this regard, we design a novel ultrasonic transformation model to enhance the crafted perturbation to be physically effective and even survive long-distance delivery. We further enable VRIFLE's robustness by adopting a series of augmentation on user and real-world variations during the generation process. In this way, VRIFLE features an effective real-time manipulation of the ASR output from different distances and under any speech of users, with an alter-and-mute strategy that suppresses the impact of user disruption. Our extensive experiments in both digital and physical worlds verify VRIFLE's effectiveness under various configurations, robustness against six kinds of defenses, and universality in a targeted manner. We also show that VRIFLE can be delivered with a portable attack device and even everyday-life loudspeakers.
摘要
自动语音识别(ASR)系统已经被证明是易受到敌意例验(AE)的威胁。现有的成功假设用户会不注意或中断攻击过程,即使存在音乐/噪声类的声音和声音助手的自发响应。然而,在实际用户存在的场景下,用户的注意力可能会使攻击无法进行,因此我们在这篇论文中尝试将攻击扩展到用户存在的场景。我们提议了一种不可见的恶意扰动(IAP)攻击方法,通过ultrasound发送来控制ASR。由于听ible和ultrasound之间的本质差异,IAP发送面临了前所未有的挑战,如扭曲、噪声和不稳定。为此,我们设计了一种新的ultrasound转换模型,以增强制作的恶意扰动,使其在物理上有效并能够在远程发送。此外,我们采用了一系列的修改来增强VRIFLE的可靠性,包括在生成过程中对用户和实际世界的变化进行修改。这样,VRIFLE可以在不同的距离和用户说话时进行有效的实时 manipulate ASR输出,并且采用了altern和mute策略来抑制用户中断的影响。我们的广泛的实验证明了VRIFLE在多种配置下的效iveness,对六种防御机制的Robustness,以及在targeted方式下的通用性。此外,我们还示出了VRIFLE可以通过移动攻击设备和日常生活中的 loudspeaker 进行传输。
SALTTS: Leveraging Self-Supervised Speech Representations for improved Text-to-Speech Synthesis
results: 比基eline FastSpeech2 具有更高的对象和主观评价指标表现Abstract
While FastSpeech2 aims to integrate aspects of speech such as pitch, energy, and duration as conditional inputs, it still leaves scope for richer representations. As a part of this work, we leverage representations from various Self-Supervised Learning (SSL) models to enhance the quality of the synthesized speech. In particular, we pass the FastSpeech2 encoder's length-regulated outputs through a series of encoder layers with the objective of reconstructing the SSL representations. In the SALTTS-parallel implementation, the representations from this second encoder are used for an auxiliary reconstruction loss with the SSL features. The SALTTS-cascade implementation, however, passes these representations through the decoder in addition to having the reconstruction loss. The richness of speech characteristics from the SSL features reflects in the output speech quality, with the objective and subjective evaluation measures of the proposed approach outperforming the baseline FastSpeech2.
摘要
而 FastSpeech2 目标是将语音特征如抑压、能量和持续时间作为条件输入集成,但还留有更加富有的表示空间。作为这项工作的一部分,我们利用了不同的自我超vision学习(SSL)模型来增强合成语音的质量。具体来说,我们将 FastSpeech2 Encoder 的长度调整后的输出通过一系列Encoder层进行重建,以达到重建 SSL 特征的目标。在 SALTTS-parallel 实现中,这些第二个 Encoder 的表示被用于 auxiliary 重建损失中的 SSL 特征。而 SALTTS-cascade 实现则是将这些表示通过 Decoder 以及重建损失进行处理。通过 SSL 特征中的语音特征的丰富性,我们发现在输出语音质量中能够获得更高的 Objective 和 Subjective 评价指标。
results: 实验结果显示,这个网络在多个医疗影像数据集上实现了高精度的分类性能,并且比 existed 网络更快速、轻量级、并且具有更少的计算成本。Abstract
The U-shaped architecture has emerged as a crucial paradigm in the design of medical image segmentation networks. However, due to the inherent local limitations of convolution, a fully convolutional segmentation network with U-shaped architecture struggles to effectively extract global context information, which is vital for the precise localization of lesions. While hybrid architectures combining CNNs and Transformers can address these issues, their application in real medical scenarios is limited due to the computational resource constraints imposed by the environment and edge devices. In addition, the convolutional inductive bias in lightweight networks adeptly fits the scarce medical data, which is lacking in the Transformer based network. In order to extract global context information while taking advantage of the inductive bias, we propose CMUNeXt, an efficient fully convolutional lightweight medical image segmentation network, which enables fast and accurate auxiliary diagnosis in real scene scenarios. CMUNeXt leverages large kernel and inverted bottleneck design to thoroughly mix distant spatial and location information, efficiently extracting global context information. We also introduce the Skip-Fusion block, designed to enable smooth skip-connections and ensure ample feature fusion. Experimental results on multiple medical image datasets demonstrate that CMUNeXt outperforms existing heavyweight and lightweight medical image segmentation networks in terms of segmentation performance, while offering a faster inference speed, lighter weights, and a reduced computational cost. The code is available at https://github.com/FengheTan9/CMUNeXt.
摘要
医学图像分割网络的U型体系已成为现代医学图像分割领域的关键思想。然而,由于卷积操作的本质性限制,一个完全卷积分割网络具有U型体系很难准确地提取全局上下文信息,这是诊断病理所必需的精确位置信息。尽管混合 arquitectures combining CNNs和Transformers可以解决这些问题,但它们在实际医疗场景中的应用受限于环境和边缘设备的计算资源限制。此外,卷积操作的权重偏好适应于稀缺的医学数据,这是Transformer基于网络中缺失的。为了提取全局上下文信息并利用卷积操作的权重偏好,我们提出CMUNeXt,一种高效的卷积分割医学图像分割网络。CMUNeXt通过大kernel和缩回瓦片设计,fficiently混合远程空间和位置信息,以提取全局上下文信息。我们还提出Skip-Fusion块,用于实现平滑的跳转连接和充分的特征融合。实验结果表明,CMUNeXt在多个医学图像Dataset上的分割性能超过现有的重量级和轻量级医学图像分割网络,同时提供更快的推理速度、轻量级的权重和降低的计算成本。代码可以在https://github.com/FengheTan9/CMUNeXt中下载。
TeachCLIP: Multi-Grained Teaching for Efficient Text-to-Video Retrieval
results: EXTENSIVE experiments 表明,提出的方法可以在多个公共数据集上达到高效的文本-视频检索效果。Abstract
For text-to-video retrieval (T2VR), which aims to retrieve unlabeled videos by ad-hoc textual queries, CLIP-based methods are dominating. Compared to CLIP4Clip which is efficient and compact, the state-of-the-art models tend to compute video-text similarity by fine-grained cross-modal feature interaction and matching, putting their scalability for large-scale T2VR into doubt. For efficient T2VR, we propose TeachCLIP with multi-grained teaching to let a CLIP4Clip based student network learn from more advanced yet computationally heavy models such as X-CLIP, TS2-Net and X-Pool . To improve the student's learning capability, we add an Attentional frame-Feature Aggregation (AFA) block, which by design adds no extra storage/computation overhead at the retrieval stage. While attentive weights produced by AFA are commonly used for combining frame-level features, we propose a novel use of the weights to let them imitate frame-text relevance estimated by the teacher network. As such, AFA provides a fine-grained learning (teaching) channel for the student (teacher). Extensive experiments on multiple public datasets justify the viability of the proposed method.
摘要
For text-to-video retrieval (T2VR), which aims to retrieve unlabeled videos by ad-hoc textual queries, CLIP-based methods are dominant. Compared to CLIP4Clip which is efficient and compact, the state-of-the-art models tend to compute video-text similarity by fine-grained cross-modal feature interaction and matching, putting their scalability for large-scale T2VR into doubt. For efficient T2VR, we propose TeachCLIP with multi-grained teaching to let a CLIP4Clip based student network learn from more advanced yet computationally heavy models such as X-CLIP, TS2-Net and X-Pool . To improve the student's learning capability, we add an Attentional frame-Feature Aggregation (AFA) block, which by design adds no extra storage/computation overhead at the retrieval stage. While attentive weights produced by AFA are commonly used for combining frame-level features, we propose a novel use of the weights to let them imitate frame-text relevance estimated by the teacher network. As such, AFA provides a fine-grained learning (teaching) channel for the student (teacher). Extensive experiments on multiple public datasets justify the viability of the proposed method.Note: Please note that the translation is in Simplified Chinese, which is one of the two standard forms of Chinese writing. The other form is Traditional Chinese.
Improving Generalization in Visual Reinforcement Learning via Conflict-aware Gradient Agreement Augmentation
results: 经验表明,CG2A可以显著提高视觉奖励学习算法的泛化性和样本效率。Abstract
Learning a policy with great generalization to unseen environments remains challenging but critical in visual reinforcement learning. Despite the success of augmentation combination in the supervised learning generalization, naively applying it to visual RL algorithms may damage the training efficiency, suffering from serve performance degradation. In this paper, we first conduct qualitative analysis and illuminate the main causes: (i) high-variance gradient magnitudes and (ii) gradient conflicts existed in various augmentation methods. To alleviate these issues, we propose a general policy gradient optimization framework, named Conflict-aware Gradient Agreement Augmentation (CG2A), and better integrate augmentation combination into visual RL algorithms to address the generalization bias. In particular, CG2A develops a Gradient Agreement Solver to adaptively balance the varying gradient magnitudes, and introduces a Soft Gradient Surgery strategy to alleviate the gradient conflicts. Extensive experiments demonstrate that CG2A significantly improves the generalization performance and sample efficiency of visual RL algorithms.
摘要
Generative Noisy-Label Learning by Implicit Dicriminative Approximation with Partial Label Prior
results: 经过广泛的实验表明,本文的生成模型在多个噪声标签 benchmark 上达到了领先的Result,同时保持了与权衡分布模型相同的计算复杂性。Abstract
The learning with noisy labels has been addressed with both discriminative and generative models. Although discriminative models have dominated the field due to their simpler modeling and more efficient computational training processes, generative models offer a more effective means of disentangling clean and noisy labels and improving the estimation of the label transition matrix. However, generative approaches maximize the joint likelihood of noisy labels and data using a complex formulation that only indirectly optimizes the model of interest associating data and clean labels. Additionally, these approaches rely on generative models that are challenging to train and tend to use uninformative clean label priors. In this paper, we propose a new generative noisy-label learning approach that addresses these three issues. First, we propose a new model optimisation that directly associates data and clean labels. Second, the generative model is implicitly estimated using a discriminative model, eliminating the inefficient training of a generative model. Third, we propose a new informative label prior inspired by partial label learning as supervision signal for noisy label learning. Extensive experiments on several noisy-label benchmarks demonstrate that our generative model provides state-of-the-art results while maintaining a similar computational complexity as discriminative models.
摘要
学习噪声标签的问题已经由权重分配模型和生成模型来解决。虽然权重分配模型因其简单的模型化和更高效的计算训练过程而在领域中占据主导地位,但生成模型可以更好地分离噪声标签和数据,并提高标签过渡矩阵的估计。然而,生成方法通过复杂的形式ulation maximizes the joint likelihood of noisy labels and data, which only indirectly optimizes the model of interest associating data and clean labels。此外,这些方法通常需要困难的训练和使用无用的干净标签辅助signal。在本文中,我们提出了一种新的生成噪声标签学习方法,解决了以下三个问题。首先,我们提出了一种新的模型优化方法,直接将数据和干净标签相关联。其次,我们使用权重分配模型来隐式地估计生成模型,从而消除了生成模型的不效环境训练。最后,我们提出了一种新的有用的标签前导信号, inspirited by partial label learning as supervision signal for noisy label learning。我们在多个噪声标签benchmark experiment中进行了广泛的实验,结果表明我们的生成模型可以在计算复杂度相同的情况下提供国际级的result。
Interpretable End-to-End Driving Model for Implicit Scene Understanding
methods: 本研究使用特定的感知任务,如物体检测和景图生成,但这些任务只能EQivalent to sampling from高维度的景象特征,而不够表示场景。此外,感知任务的目标与人类驾驶不符,人类只关注可能影响自己的车辆轨迹。因此,我们提出了一个综合 interpretable 的隐藏驾驶景象理解(II-DSU)模型,通过规划模块引导,提取隐藏的高维度景象特征作为场景理解结果,并使用 auxillary 感知任务进行可视化验证。
results: 实验结果表明,我们的方法在 CARLA 测试benchmark 上达到了新的状态OF-THE-ART,并能够获取包含更多的驾驶场景信息的景象特征,从而提高下游规划的性能。Abstract
Driving scene understanding is to obtain comprehensive scene information through the sensor data and provide a basis for downstream tasks, which is indispensable for the safety of self-driving vehicles. Specific perception tasks, such as object detection and scene graph generation, are commonly used. However, the results of these tasks are only equivalent to the characterization of sampling from high-dimensional scene features, which are not sufficient to represent the scenario. In addition, the goal of perception tasks is inconsistent with human driving that just focuses on what may affect the ego-trajectory. Therefore, we propose an end-to-end Interpretable Implicit Driving Scene Understanding (II-DSU) model to extract implicit high-dimensional scene features as scene understanding results guided by a planning module and to validate the plausibility of scene understanding using auxiliary perception tasks for visualization. Experimental results on CARLA benchmarks show that our approach achieves the new state-of-the-art and is able to obtain scene features that embody richer scene information relevant to driving, enabling superior performance of the downstream planning.
摘要
自动驾驶车辆Scene理解是通过感知数据获取全面的Scene信息,并提供下游任务的基础,这是自动驾驶车辆的安全所不可或缺。特定的感知任务,如物体检测和Scene图生成,通常被使用。然而,这些任务的结果只是样本高维Scene特征的 caracterization,这并不够 representationScene。此外,感知任务的目标与人类驾驶不符,人类驾驶只关注可能affect eg trajectory的 factor。因此,我们提出了End-to-end可解释的含义Scene Understanding(II-DSU)模型,通过规划模块引导,从感知数据中提取高维Scene特征,并使用辅助感知任务进行视觉化。实验结果表明,我们的方法在CARLAbenchmark上达到了新的州OF-the-art,并能够获取包含更加丰富的Scene信息,有利于下游规划的性能。
paper_authors: Huzheng Yang, James Gee, Jianbo Shi
for: 这个研究探究了一种新的大脑编码模型,通过添加记忆相关信息作为输入来实现。
methods: 这个研究使用了视觉认知任务,并使用了以前看过的图像来预测非视觉大脑的活动。
results: 研究发现,在添加记忆信息作为输入时,模型的性能得到了显著提高(单个模型分数为66.8,集成模型分数为70.8)。此外,研究还发现了Periodic delayed brain response和hippocampus的相关活动,这些活动与6-7帧之前的图像相关。Abstract
We explore a new class of brain encoding model by adding memory-related information as input. Memory is an essential brain mechanism that works alongside visual stimuli. During a vision-memory cognitive task, we found the non-visual brain is largely predictable using previously seen images. Our Memory Encoding Model (Mem) won the Algonauts 2023 visual brain competition even without model ensemble (single model score 66.8, ensemble score 70.8). Our ensemble model without memory input (61.4) can also stand a 3rd place. Furthermore, we observe periodic delayed brain response correlated to 6th-7th prior image, and hippocampus also showed correlated activity timed with this periodicity. We conjuncture that the periodic replay could be related to memory mechanism to enhance the working memory.
摘要
我们探索了一新类的脑编码模型,将记忆相关信息作为输入。记忆是脑中重要的机制,与视觉刺激同时工作。在视觉认知任务中,我们发现了非视觉脑的大部分可预测,使用先前看到的图像。我们的记忆编码模型(Mem)在2023年Algonauts视觉脑竞赛中获胜,即使没有模型集成(单个模型分数为66.8,集成分数为70.8)。我们的 ensemble模型 без记忆输入(61.4)也能够获得第三名。此外,我们发现了 periodic delayed 脑响应与第6-7个先前图像相对应,以及 hippocampus 也显示了与这种周期性相关的活动。我们推测 periodic replay 可能与记忆机制相关,以增强工作记忆。
Contrast-augmented Diffusion Model with Fine-grained Sequence Alignment for Markup-to-Image Generation
results: 在四个不同领域的四个标准 benchmark 数据集上进行了广泛的实验,并证明了 FSA-CDM 模型中提出的组件的效果,与当前状态的表达能力相比,提高了约2%-12% DTW 的改进。Abstract
The recently rising markup-to-image generation poses greater challenges as compared to natural image generation, due to its low tolerance for errors as well as the complex sequence and context correlations between markup and rendered image. This paper proposes a novel model named "Contrast-augmented Diffusion Model with Fine-grained Sequence Alignment" (FSA-CDM), which introduces contrastive positive/negative samples into the diffusion model to boost performance for markup-to-image generation. Technically, we design a fine-grained cross-modal alignment module to well explore the sequence similarity between the two modalities for learning robust feature representations. To improve the generalization ability, we propose a contrast-augmented diffusion model to explicitly explore positive and negative samples by maximizing a novel contrastive variational objective, which is mathematically inferred to provide a tighter bound for the model's optimization. Moreover, the context-aware cross attention module is developed to capture the contextual information within markup language during the denoising process, yielding better noise prediction results. Extensive experiments are conducted on four benchmark datasets from different domains, and the experimental results demonstrate the effectiveness of the proposed components in FSA-CDM, significantly exceeding state-of-the-art performance by about 2%-12% DTW improvements. The code will be released at https://github.com/zgj77/FSACDM.
摘要
“Recently rising markup-to-image generation poses greater challenges compared to natural image generation, due to its low tolerance for errors and complex sequence and context correlations between markup and rendered image. This paper proposes a novel model named 'Contrast-augmented Diffusion Model with Fine-grained Sequence Alignment' (FSA-CDM), which introduces contrastive positive/negative samples into the diffusion model to boost performance for markup-to-image generation. Technically, we design a fine-grained cross-modal alignment module to well explore the sequence similarity between the two modalities for learning robust feature representations. To improve generalization ability, we propose a contrast-augmented diffusion model to explicitly explore positive and negative samples by maximizing a novel contrastive variational objective, which is mathematically inferred to provide a tighter bound for the model's optimization. Moreover, the context-aware cross attention module is developed to capture the contextual information within markup language during the denoising process, yielding better noise prediction results. Extensive experiments are conducted on four benchmark datasets from different domains, and the experimental results demonstrate the effectiveness of the proposed components in FSA-CDM, significantly exceeding state-of-the-art performance by about 2%-12% DTW improvements. The code will be released at https://github.com/zgj77/FSACDM.”Note: DTW (Dynamic Time Warping) is a measure of similarity between two time series signals. A higher DTW improvement indicates better performance.
UCDFormer: Unsupervised Change Detection Using a Transformer-driven Image Translation
For: 本研究旨在提出一种基于域名适应的无监督变化检测方法,以满足 Remote Sensing 图像的变化检测任务。* Methods: 本方法使用了一种基于 transformer 的图像翻译模型,称为 UCDFormer,以解决域名适应问题。图像翻译模型包括一个轻量级 transformer 和域名特定的相互作用权重。然后,通过图像翻译后的差分图计算得到变化映射。最后,提出了一种可靠的像素选择模块,使用杂化 c-means clustering 和自适应阈值来选择变化/不变化像素。* Results: 对不同的无监督变化检测任务进行实验,结果显示 UCDFormer 的性能高于其他相关方法,卷积比例更高达 12.3%。此外,UCDFormer 在考虑大规模应用时对地球quake-induced landslide 检测也表现出色。代码可以在 \url{https://github.com/zhu-xlab/UCDFormer} 上下载。Abstract
Change detection (CD) by comparing two bi-temporal images is a crucial task in remote sensing. With the advantages of requiring no cumbersome labeled change information, unsupervised CD has attracted extensive attention in the community. However, existing unsupervised CD approaches rarely consider the seasonal and style differences incurred by the illumination and atmospheric conditions in multi-temporal images. To this end, we propose a change detection with domain shift setting for remote sensing images. Furthermore, we present a novel unsupervised CD method using a light-weight transformer, called UCDFormer. Specifically, a transformer-driven image translation composed of a light-weight transformer and a domain-specific affinity weight is first proposed to mitigate domain shift between two images with real-time efficiency. After image translation, we can generate the difference map between the translated before-event image and the original after-event image. Then, a novel reliable pixel extraction module is proposed to select significantly changed/unchanged pixel positions by fusing the pseudo change maps of fuzzy c-means clustering and adaptive threshold. Finally, a binary change map is obtained based on these selected pixel pairs and a binary classifier. Experimental results on different unsupervised CD tasks with seasonal and style changes demonstrate the effectiveness of the proposed UCDFormer. For example, compared with several other related methods, UCDFormer improves performance on the Kappa coefficient by more than 12\%. In addition, UCDFormer achieves excellent performance for earthquake-induced landslide detection when considering large-scale applications. The code is available at \url{https://github.com/zhu-xlab/UCDFormer}
摘要
Change detection (CD) by comparing two bi-temporal images is a crucial task in remote sensing. With the advantages of not requiring cumbersome labeled change information, unsupervised CD has attracted extensive attention in the community. However, existing unsupervised CD approaches rarely consider the seasonal and style differences incurred by the illumination and atmospheric conditions in multi-temporal images. To this end, we propose a change detection with domain shift setting for remote sensing images. Furthermore, we present a novel unsupervised CD method using a light-weight transformer, called UCDFormer. Specifically, a transformer-driven image translation composed of a light-weight transformer and a domain-specific affinity weight is first proposed to mitigate domain shift between two images with real-time efficiency. After image translation, we can generate the difference map between the translated before-event image and the original after-event image. Then, a novel reliable pixel extraction module is proposed to select significantly changed/unchanged pixel positions by fusing the pseudo change maps of fuzzy c-means clustering and adaptive threshold. Finally, a binary change map is obtained based on these selected pixel pairs and a binary classifier. Experimental results on different unsupervised CD tasks with seasonal and style changes demonstrate the effectiveness of the proposed UCDFormer. For example, compared with several other related methods, UCDFormer improves performance on the Kappa coefficient by more than 12\%. In addition, UCDFormer achieves excellent performance for earthquake-induced landslide detection when considering large-scale applications. The code is available at \url{https://github.com/zhu-xlab/UCDFormer}.Here's the translation in Traditional Chinese:同比较两个时间影像的变化检测(CD)是远程感知中的一个重要任务。由于不需要耗费大量的标签变化信息,不监督CD吸引了社区的广泛关注。然而,现有的不监督CD方法很少考虑多个时间影像中的季节和风格变化,这些变化是由照明和大气conditions导致的。为了解决这个问题,我们提出了适用于远程感知影像的CD设定。此外,我们还提出了一个名为UCDFormer的新型不监督CD方法,这是一个轻量级的transformer驱动的图像转换。 Specifically, 这个图像转换由一个轻量级的transformer和对应域专的相互关联项目所构成。在实现图像转换后,我们可以生成原始后事件影像与转换前事件影像之间的差分图。然后,我们提出了一个新的可靠像素提取模组,这个模组通过融合多个pseudo change map的统计学 clustering和自适应阈值来选择具有重要变化的像素位置。最后,我们根据这些选择的像素对生成一个二进制变化图。实验结果显示,UCDFormer在不同的不监督CD任务中具有优秀的表现,比如在凡托伦数据上的Kappa系数提高了更多于12%。此外,UCDFormer在考虑大规模应用时也具有优秀的震灾引起的塌陷检测表现。相关代码可以在 \url{https://github.com/zhu-xlab/UCDFormer} 中找到。
DySTreSS: Dynamically Scaled Temperature in Self-Supervised Contrastive Learning
results: 实验证明,提案的框架(DySTreSS)在SSL中的性能比或与对偶损失函数基于的框架相当。它还进一步分析了温度的选择,以及特征空间中的本地和全球结构如何影响性能。Abstract
In contemporary self-supervised contrastive algorithms like SimCLR, MoCo, etc., the task of balancing attraction between two semantically similar samples and repulsion between two samples from different classes is primarily affected by the presence of hard negative samples. While the InfoNCE loss has been shown to impose penalties based on hardness, the temperature hyper-parameter is the key to regulating the penalties and the trade-off between uniformity and tolerance. In this work, we focus our attention to improve the performance of InfoNCE loss in SSL by studying the effect of temperature hyper-parameter values. We propose a cosine similarity-dependent temperature scaling function to effectively optimize the distribution of the samples in the feature space. We further analyze the uniformity and tolerance metrics to investigate the optimal regions in the cosine similarity space for better optimization. Additionally, we offer a comprehensive examination of the behavior of local and global structures in the feature space throughout the pre-training phase, as the temperature varies. Experimental evidence shows that the proposed framework outperforms or is at par with the contrastive loss-based SSL algorithms. We believe our work (DySTreSS) on temperature scaling in SSL provides a foundation for future research in contrastive learning.
摘要
现代自我监督对比算法如SimCLR、MoCo等,任务是让两个semantically相似的样本之间吸引,而两个来自不同类型的样本之间冲突。而这种吸引和冲突的 equilibrio 主要受到强度负样本的影响。在这种情况下,InfoNCE损失已经显示出对强度的penalty,但温度超参数是控制这些penalty的关键。在这项工作中,我们关注InfoNCE损失在SSL中的性能优化,研究温度超参数的影响。我们提议一种cosine相似性dependent的温度扩展函数,以便有效地优化样本在特征空间的分布。我们进一步分析了cosine相似性空间中的各种参数区域,以便更好地优化。此外,我们还对批量和全局结构在特征空间中的变化进行了详细的分析,从温度变化的角度出发。实验证明,我们提出的框架可以超越或与基于对比损失的SSL算法相当。我们认为我们的工作(DySTreSS)在SSL中的温度扩展提供了对contrastive学习的未来研究的基础。
Multi-task learning for classification, segmentation, reconstruction, and detection on chest CT scans
results: 提出了一种新的多任务学习框架,可以帮助医生更快速、更准确地诊断肺癌和covid-19患者的疾病。此外,通过对不同背部和损失函数的使用,可以提高 segmentation 任务的性能。Abstract
Lung cancer and covid-19 have one of the highest morbidity and mortality rates in the world. For physicians, the identification of lesions is difficult in the early stages of the disease and time-consuming. Therefore, multi-task learning is an approach to extracting important features, such as lesions, from small amounts of medical data because it learns to generalize better. We propose a novel multi-task framework for classification, segmentation, reconstruction, and detection. To the best of our knowledge, we are the first ones who added detection to the multi-task solution. Additionally, we checked the possibility of using two different backbones and different loss functions in the segmentation task.
摘要
肺癌和 COVID-19 是全球 morbidity 和 mortality 率最高的疾病之一。为医生来说,在早期疾病阶段寻找 lesions 是困难的和时间消耗的。因此,多任务学习是一种EXTRACTING重要特征的方法,如疾病 lesions,从小量医疗数据中提取。我们提出了一种新的多任务框架,用于分类、分割、重建和检测。根据我们所知,我们是第一个将检测添加到多任务解决方案中的人。此外,我们还检查了使用两种不同的背景和不同的损失函数在分割任务中的可能性。
Leveraging Expert Models for Training Deep Neural Networks in Scarce Data Domains: Application to Offline Handwritten Signature Verification
results: 研究表明,使用本研究的方法可以在三个流行的签名数据集上达到相当或更高的性能,而不需要在特征提取过程中使用任何签名数据。这种方法可能可以在其他有限数据的领域中应用。Abstract
This paper introduces a novel approach to leverage the knowledge of existing expert models for training new Convolutional Neural Networks, on domains where task-specific data are limited or unavailable. The presented scheme is applied in offline handwritten signature verification (OffSV) which, akin to other biometric applications, suffers from inherent data limitations due to regulatory restrictions. The proposed Student-Teacher (S-T) configuration utilizes feature-based knowledge distillation (FKD), combining graph-based similarity for local activations with global similarity measures to supervise student's training, using only handwritten text data. Remarkably, the models trained using this technique exhibit comparable, if not superior, performance to the teacher model across three popular signature datasets. More importantly, these results are attained without employing any signatures during the feature extraction training process. This study demonstrates the efficacy of leveraging existing expert models to overcome data scarcity challenges in OffSV and potentially other related domains.
摘要
DiffusePast: Diffusion-based Generative Replay for Class Incremental Semantic Segmentation
results: 通过实验表明,ours方法可以在主流的benchmark上实现竞争性的性能,更好地平衡老类和新类的性能Abstract
The Class Incremental Semantic Segmentation (CISS) extends the traditional segmentation task by incrementally learning newly added classes. Previous work has introduced generative replay, which involves replaying old class samples generated from a pre-trained GAN, to address the issues of catastrophic forgetting and privacy concerns. However, the generated images lack semantic precision and exhibit out-of-distribution characteristics, resulting in inaccurate masks that further degrade the segmentation performance. To tackle these challenges, we propose DiffusePast, a novel framework featuring a diffusion-based generative replay module that generates semantically accurate images with more reliable masks guided by different instructions (e.g., text prompts or edge maps). Specifically, DiffusePast introduces a dual-generator paradigm, which focuses on generating old class images that align with the distribution of downstream datasets while preserving the structure and layout of the original images, enabling more precise masks. To adapt to the novel visual concepts of newly added classes continuously, we incorporate class-wise token embedding when updating the dual-generator. Moreover, we assign adequate pseudo-labels of old classes to the background pixels in the new step images, further mitigating the forgetting of previously learned knowledge. Through comprehensive experiments, our method demonstrates competitive performance across mainstream benchmarks, striking a better balance between the performance of old and novel classes.
摘要
“ classe Incremental Semantic Segmentation(CISS)延伸了传统的分类任务,通过不断学习新增的类别。先前的工作已经引入生成重播,即从预训的GAN中生成的旧类样本,以解决严重遗忘和隐私问题。但是,生成的图像没有准确的Semantic和外部分布特征,导致不准确的mask,进一步损害分类性能。为了解决这些挑战,我们提出了DiffusePast,一个新的框架,其中包括一个散射型生成重播模组,可以产生具有准确Semantic和可靠mask的图像。 Specifically, DiffusePast采用了双生成者 парадиг,专注于生成旧类样本,使其与下游资料集的分布相似,并保持原始图像的结构和布局,实现更 precisemask。为了适应新增的视觉概念,我们在更新双生成者时 incorporate class-wise token embedding。此外,我们将旧类别的 pseudo-labels assign 到新步骤图像的背景像素上,进一步减少遗忘先前学习的知识。通过实际的实验,我们的方法在主流的benchmark上展示了竞争性表现,寻求更好地寻求旧和新类别的 equilibrio。”
Stereo Visual Odometry with Deep Learning-Based Point and Line Feature Matching using an Attention Graph Neural Network
results: 实验结果显示,我们的方法在多个实际和synthetic dataset上能够在低可见度天气和照明条件下实现更多的线特征匹配,并且当补充了点特征匹配时,能够在恶劣天气和照明条件下保持一致的性能。Abstract
Robust feature matching forms the backbone for most Visual Simultaneous Localization and Mapping (vSLAM), visual odometry, 3D reconstruction, and Structure from Motion (SfM) algorithms. However, recovering feature matches from texture-poor scenes is a major challenge and still remains an open area of research. In this paper, we present a Stereo Visual Odometry (StereoVO) technique based on point and line features which uses a novel feature-matching mechanism based on an Attention Graph Neural Network that is designed to perform well even under adverse weather conditions such as fog, haze, rain, and snow, and dynamic lighting conditions such as nighttime illumination and glare scenarios. We perform experiments on multiple real and synthetic datasets to validate the ability of our method to perform StereoVO under low visibility weather and lighting conditions through robust point and line matches. The results demonstrate that our method achieves more line feature matches than state-of-the-art line matching algorithms, which when complemented with point feature matches perform consistently well in adverse weather and dynamic lighting conditions.
摘要
Robust 特征匹配构成了大多数视觉同时定位和地图(vSLAM)、视觉速度、3D重建和结构从运动(SfM)算法的核心。然而,从Texture贫瘠场景中恢复特征匹配是一个主要挑战,并且仍然是一个开放的研究领域。在这篇论文中,我们提出了一种基于点和线特征的斯tereo视巡(StereoVO)技术,使用一种基于注意力图 neural network的新特征匹配机制,可以在恶劣天气和照明条件下表现良好,如雾、霾、雨、雪等。我们在多个实际和 sintetic 数据集上进行了实验,以验证我们的方法在低视力天气和照明条件下进行斯tereoVO的能力。结果显示,我们的方法在与State-of-the-art 线Matching算法进行比较时,在恶劣天气和动态照明条件下可以得到更多的线特征匹配,并且当 complemented Point 特征匹配时,能够一致地表现良好。
Unlearning Spurious Correlations in Chest X-ray Classification
methods: 该研究使用了一种基于 Covid-19 肺 X-ray 图像集的深度学习模型,并通过利用用户提供的互动式反馈来实现 eXplanation Based Learning(XBL)方法,以解除不必要的相关性。
results: 研究结果表明,XBL 方法可以有效地减少骨灰图像分类模型中的误差,并且可以在患有 Covid-19 的肺 X-ray 图像中提高模型的透明度。Abstract
Medical image classification models are frequently trained using training datasets derived from multiple data sources. While leveraging multiple data sources is crucial for achieving model generalization, it is important to acknowledge that the diverse nature of these sources inherently introduces unintended confounders and other challenges that can impact both model accuracy and transparency. A notable confounding factor in medical image classification, particularly in musculoskeletal image classification, is skeletal maturation-induced bone growth observed during adolescence. We train a deep learning model using a Covid-19 chest X-ray dataset and we showcase how this dataset can lead to spurious correlations due to unintended confounding regions. eXplanation Based Learning (XBL) is a deep learning approach that goes beyond interpretability by utilizing model explanations to interactively unlearn spurious correlations. This is achieved by integrating interactive user feedback, specifically feature annotations. In our study, we employed two non-demanding manual feedback mechanisms to implement an XBL-based approach for effectively eliminating these spurious correlations. Our results underscore the promising potential of XBL in constructing robust models even in the presence of confounding factors.
摘要
医疗图像分类模型经常使用多个数据源进行训练。虽然利用多个数据源对于实现模型泛化具有重要意义,但是需要承认这些来源的多样性自然而来带来不必要的混乱和其他挑战,这些挑战可能会影响模型的准确性和透明度。一种常见的混乱因素在医疗图像分类中是由青春期所引起的骨髓成熔。我们使用COVID-19胸部X射线图像集训练深度学习模型,并显示这些数据集可能会导致不必要的相关性,因为不Intentional的混乱区域。基于解释的学习(XBL)是一种深度学习方法,它不仅提供了解释,还可以通过交互式的用户反馈来解除不必要的相关性。我们在研究中使用了两种非常容易的手动反馈机制来实现XBL的方法。我们的结果表明,XBL在存在混乱因素的情况下可以构建可靠的模型。
Spatio-Temporal Branching for Motion Prediction using Motion Increments
results: 根据标准 HMP 测试准则,该 paper 的方法在预测动作的精度方面表现出色,超过了当前状态的方法。Abstract
Human motion prediction (HMP) has emerged as a popular research topic due to its diverse applications, but it remains a challenging task due to the stochastic and aperiodic nature of future poses. Traditional methods rely on hand-crafted features and machine learning techniques, which often struggle to model the complex dynamics of human motion. Recent deep learning-based methods have achieved success by learning spatio-temporal representations of motion, but these models often overlook the reliability of motion data. Additionally, the temporal and spatial dependencies of skeleton nodes are distinct. The temporal relationship captures motion information over time, while the spatial relationship describes body structure and the relationships between different nodes. In this paper, we propose a novel spatio-temporal branching network using incremental information for HMP, which decouples the learning of temporal-domain and spatial-domain features, extracts more motion information, and achieves complementary cross-domain knowledge learning through knowledge distillation. Our approach effectively reduces noise interference and provides more expressive information for characterizing motion by separately extracting temporal and spatial features. We evaluate our approach on standard HMP benchmarks and outperform state-of-the-art methods in terms of prediction accuracy.
摘要
人体运动预测(HMP)已经成为一个流行的研究话题,因为它在多个应用领域中具有广泛的应用前景。然而,HMP仍然是一个具有抽象和不规则的未来姿势的挑战。传统的方法通常使用手工设计的特征和机器学习技术来模型人体运动,但这些方法经常难以捕捉人体运动的复杂动力学特征。最近的深度学习基于方法已经取得成功,通过学习人体运动的空间时间表示,但这些模型经常忽略人体运动数据的可靠性。此外,人体运动中的时间和空间关系不同。时间关系捕捉运动信息的时间方向,而空间关系描述人体结构和不同节点之间的关系。在本文中,我们提出了一种新的空间时间分支网络,使得HMP可以分解时间领域和空间领域的学习,提取更多的运动信息,并通过知识储存来实现补做知识学习。我们的方法可以更好地减少噪音干扰,并为运动特征的描述提供更多的表达信息。我们在标准HMP测试benchmark上评估了我们的方法,并在预测精度方面超过了状态 искусственный智能方法。
AutoPoster: A Highly Automatic and Content-aware Design System for Advertising Poster Generation
results: 这个系统可以将来自单一图像和标题的输入转换为多个不同尺寸的广告招牌,并且可以提供高品质的广告招牌。用户研究和实验结果显示,这个系统比其他广告招牌生成方法更有效率和有艺术价值。Abstract
Advertising posters, a form of information presentation, combine visual and linguistic modalities. Creating a poster involves multiple steps and necessitates design experience and creativity. This paper introduces AutoPoster, a highly automatic and content-aware system for generating advertising posters. With only product images and titles as inputs, AutoPoster can automatically produce posters of varying sizes through four key stages: image cleaning and retargeting, layout generation, tagline generation, and style attribute prediction. To ensure visual harmony of posters, two content-aware models are incorporated for layout and tagline generation. Moreover, we propose a novel multi-task Style Attribute Predictor (SAP) to jointly predict visual style attributes. Meanwhile, to our knowledge, we propose the first poster generation dataset that includes visual attribute annotations for over 76k posters. Qualitative and quantitative outcomes from user studies and experiments substantiate the efficacy of our system and the aesthetic superiority of the generated posters compared to other poster generation methods.
摘要
宣传海报是信息传达的一种形式,它将视觉和语言Modalities结合在一起。创建海报需要多个步骤,需要设计经验和创造力。本文介绍AutoPoster,一种高度自动化和内容意识的海报生成系统。只需输入产品图像和标题,AutoPoster可以自动生成不同尺寸的海报,通过四个关键阶段:图像清洁和重点调整、布局生成、标语生成和风格特征预测。为保证海报的视觉和谐,我们将两个内容意识模型 incorporated into layout and tagline generation。此外,我们提出了一种新的多任务Style Attribute Predictor (SAP),可以同时预测视觉风格特征。此外,我们还提供了首个包含视觉特征注释的海报生成数据集,包含超过76万个海报。用户测试和实验结果表明,我们的系统和生成的海报比其他海报生成方法更有效果和美观。
Attention-free Spikformer: Mixing Spike Sequences with Simple Linear Transforms
results: 研究表明,通过将SSA替换为无参数化的线性变换(LT),例如快推和wavelet变换,可以将 quadratic time complexity 降低到 log-linear time complexity。这些变换可以将笛声序列混合,并且在不同的频率和时间域中提取稀疏的视觉特征,以显示出强大的性能和效率。Abstract
By integrating the self-attention capability and the biological properties of Spiking Neural Networks (SNNs), Spikformer applies the flourishing Transformer architecture to SNNs design. It introduces a Spiking Self-Attention (SSA) module to mix sparse visual features using spike-form Query, Key, and Value, resulting in the State-Of-The-Art (SOTA) performance on numerous datasets compared to previous SNN-like frameworks. In this paper, we demonstrate that the Spikformer architecture can be accelerated by replacing the SSA with an unparameterized Linear Transform (LT) such as Fourier and Wavelet transforms. These transforms are utilized to mix spike sequences, reducing the quadratic time complexity to log-linear time complexity. They alternate between the frequency and time domains to extract sparse visual features, showcasing powerful performance and efficiency. We conduct extensive experiments on image classification using both neuromorphic and static datasets. The results indicate that compared to the SOTA Spikformer with SSA, Spikformer with LT achieves higher Top-1 accuracy on neuromorphic datasets (i.e., CIFAR10-DVS and DVS128 Gesture) and comparable Top-1 accuracy on static datasets (i.e., CIFAR-10 and CIFAR-100). Furthermore, Spikformer with LT achieves approximately 29-51% improvement in training speed, 61-70% improvement in inference speed, and reduces memory usage by 4-26% due to not requiring learnable parameters.
摘要
<>使用自注意力能力和生物特性,Spikformer将Transformer架构应用于生物神经网络(SNN)设计。它引入了自引导注意力(SSA)模块,将稀薄视觉特征混合使用披萃Query、Key和Value,达到了State-Of-The-Art(SOTA)性能在多个数据集上,比前一些SNN-like框架更高。在这篇论文中,我们示出了Spikformer架构可以通过替换SSA而加速。这些替换使用线性变换(LT),如 fourier和wavelet变换,来混合披萃序列,从而将时间复杂度降低为对数复杂度。它们在频率和时间域之间 alternate,提取稀薄视觉特征,展示出了强大的性能和效率。我们对图像分类进行了广泛的实验,使用了neuromorphic和静止数据集。结果表明,相比SOTA的Spikformer与SSA,Spikformer与LT achieve higher Top-1准确率在neuromorphic数据集(即CIFAR10-DVS和DVS128Gesture),并在静止数据集(即CIFAR-10和CIFAR-100)中具有相同的Top-1准确率。此外,Spikformer与LT achieve approximately 29-51%的训练速度提升,61-70%的推理速度提升,并减少了内存使用量。Note: The translation is in Simplified Chinese, which is the standard Chinese writing system used in mainland China and Singapore. The traditional Chinese writing system is also widely used in Taiwan and Hong Kong.
Homography Estimation in Complex Topological Scenes
results: 实验结果显示,提案的方法可以与现有模型相比,提高了IoU指标的值,最高提升12%。Abstract
Surveillance videos and images are used for a broad set of applications, ranging from traffic analysis to crime detection. Extrinsic camera calibration data is important for most analysis applications. However, security cameras are susceptible to environmental conditions and small camera movements, resulting in a need for an automated re-calibration method that can account for these varying conditions. In this paper, we present an automated camera-calibration process leveraging a dictionary-based approach that does not require prior knowledge on any camera settings. The method consists of a custom implementation of a Spatial Transformer Network (STN) and a novel topological loss function. Experiments reveal that the proposed method improves the IoU metric by up to 12% w.r.t. a state-of-the-art model across five synthetic datasets and the World Cup 2014 dataset.
摘要
侦察视频和图像广泛应用于交通分析和犯罪检测等领域。外部摄像头准确性受环境因素和小型摄像头运动影响,需要一种自动重新准确方法,能够考虑这些变化条件。本文提出了一种基于词典方法的自动摄像头准确方法,不需要任何摄像头设置的先前知识。该方法包括一个自定义的空间转换网络(STN)和一个新的 topological 损失函数。实验表明,提议的方法可以与先前状态艺术模型相比,在五个Synthetic dataset和2014年世界杯 dataset上提高 IoU 指标达12%。
Improving Generalization of Synthetically Trained Sonar Image Descriptors for Underwater Place Recognition
results: 研究表明,提议的方法可以在使用synthetic数据进行训练的情况下,与现有方法相比,在真实场景中表现更加出色。Abstract
Autonomous navigation in underwater environments presents challenges due to factors such as light absorption and water turbidity, limiting the effectiveness of optical sensors. Sonar systems are commonly used for perception in underwater operations as they are unaffected by these limitations. Traditional computer vision algorithms are less effective when applied to sonar-generated acoustic images, while convolutional neural networks (CNNs) typically require large amounts of labeled training data that are often unavailable or difficult to acquire. To this end, we propose a novel compact deep sonar descriptor pipeline that can generalize to real scenarios while being trained exclusively on synthetic data. Our architecture is based on a ResNet18 back-end and a properly parameterized random Gaussian projection layer, whereas input sonar data is enhanced with standard ad-hoc normalization/prefiltering techniques. A customized synthetic data generation procedure is also presented. The proposed method has been evaluated extensively using both synthetic and publicly available real data, demonstrating its effectiveness compared to state-of-the-art methods.
摘要
自适应导航在水下环境中存在许多挑战,如光线吸收和水渍混淆,这些因素对光学感知器的效果有限。因此,SONAR系统在水下操作中被广泛使用,因为它们不受这些限制。传统的计算机视觉算法在SONAR生成的声学图像上不太有效,而深度学习网络(CNN)通常需要大量的标注训练数据,这些数据通常不易获得或困难获得。为此,我们提出了一种新的快速深度声学描述管道,可以在实际场景中广泛应用,同时只需要在合成数据上进行培训。我们的架构基于ResNet18的后端和一个正确参数化的随机抽象射影层,输入声学数据通过标准的预处理/正规化技术进行增强。我们还提出了一种自定义合成数据生成过程。我们的方法在使用合成数据和公共available的实际数据进行广泛评估后,与现有方法相比,表现出了明显的效果。
MammoDG: Generalisable Deep Learning Breaks the Limits of Cross-Domain Multi-Center Breast Cancer Screening
paper_authors: Yijun Yang, Shujun Wang, Lihao Liu, Sarah Hickman, Fiona J Gilbert, Carola-Bibiane Schönlieb, Angelica I. Aviles-Rivero
for: 旨在提高乳腺癌早期发现,以提高治疗效果和女性生活质量。
methods: 利用机器学习模型支持专家决策。
results: 提出了一种新的深度学习框架 MammoDG,可以在不同卫星中心的多视图照片上进行可靠和通用的分析。 MammoDG 利用了一种新的对比机制,以提高其通用能力。Abstract
Breast cancer is a major cause of cancer death among women, emphasising the importance of early detection for improved treatment outcomes and quality of life. Mammography, the primary diagnostic imaging test, poses challenges due to the high variability and patterns in mammograms. Double reading of mammograms is recommended in many screening programs to improve diagnostic accuracy but increases radiologists' workload. Researchers explore Machine Learning models to support expert decision-making. Stand-alone models have shown comparable or superior performance to radiologists, but some studies note decreased sensitivity with multiple datasets, indicating the need for high generalisation and robustness models. This work devises MammoDG, a novel deep-learning framework for generalisable and reliable analysis of cross-domain multi-center mammography data. MammoDG leverages multi-view mammograms and a novel contrastive mechanism to enhance generalisation capabilities. Extensive validation demonstrates MammoDG's superiority, highlighting the critical importance of domain generalisation for trustworthy mammography analysis in imaging protocol variations.
摘要
乳癌是女性主要的癌症死亡原因之一,强调早期检测可以提高治疗效果和生活质量。 however,环境变化和图像征特的高度变化使得胸部X射线成为主要的诊断图像检测方法具有挑战。为了提高诊断准确性,许多检测 програм序建议采用双重诊断,但这会增加医生的工作负担。研究人员在寻找机器学习模型以支持专家决策。单独的研究表明,机器学习模型可以与专家相比或超越其性能,但一些研究表明,多个数据集的使用可能会导致感知性下降,表明需要高度泛化和可靠性模型。本文提出了 MammoDG,一种新的深度学习框架,用于可靠和泛化的胸部X射线数据分析。 MammoDG 利用多视图照片和一种新的对比机制来增强泛化能力。经验证明, MammoDG 的性能明显优于其他方法, highlighting the critical importance of domain generalization for trustworthy mammography analysis in imaging protocol variations。
Dynamic Token Pruning in Plain Vision Transformers for Semantic Segmentation
methods: 这个方法基于 early exit 的思想,将网络分成多个阶段,每个阶段都有一个 auxiliary block,评估每个 tokens 的困难程度。只需要完成前几个阶段的运算就能够获得结果,以减少计算量。
results: 实验结果显示,提案的 DToP 架构可以降低平均 $20% - 35%$ 的计算成本,与现有的 semantic segmentation 方法相比,不会影响精度。Abstract
Vision transformers have achieved leading performance on various visual tasks yet still suffer from high computational complexity. The situation deteriorates in dense prediction tasks like semantic segmentation, as high-resolution inputs and outputs usually imply more tokens involved in computations. Directly removing the less attentive tokens has been discussed for the image classification task but can not be extended to semantic segmentation since a dense prediction is required for every patch. To this end, this work introduces a Dynamic Token Pruning (DToP) method based on the early exit of tokens for semantic segmentation. Motivated by the coarse-to-fine segmentation process by humans, we naturally split the widely adopted auxiliary-loss-based network architecture into several stages, where each auxiliary block grades every token's difficulty level. We can finalize the prediction of easy tokens in advance without completing the entire forward pass. Moreover, we keep $k$ highest confidence tokens for each semantic category to uphold the representative context information. Thus, computational complexity will change with the difficulty of the input, akin to the way humans do segmentation. Experiments suggest that the proposed DToP architecture reduces on average $20\% - 35\%$ of computational cost for current semantic segmentation methods based on plain vision transformers without accuracy degradation.
摘要
现代变换器已经在多种视觉任务中表现出了领先的性能,然而仍然受到高计算复杂性的限制。在密集预测任务如Semantic Segmentation中,高分辨率输入和输出通常意味着更多的 токен参与计算。直接从计算中移除不重要的 токен已经在图像分类任务上被讨论,但这种方法无法扩展到Semantic Segmentation,因为每个 patch 需要进行密集预测。为解决这个问题,这种工作提出了一种基于早期离开的 Token Pruning(DToP)方法。受人类的粗化预测过程启发,我们自然将广泛采用的辅助损失基于网络架构分成了多个阶段,每个辅助块都会评估每个 токен的Difficulty Level。因此,我们可以在提前完成某些简单的预测之前就结束预测。此外,我们保留每个 semantic 类型的前 $k$ 个最高信任的 токен,以保持代表性信息。因此,计算复杂性会随输入的Difficulty Level而变化,与人类的分 segmentation 类似。实验表明,我们提出的 DToP 架构可以将现有的Semantic Segmentation方法基于普通的视觉变换器中的计算复杂性减少平均 $20\% - 35\%$ без影响准确性。
WCCNet: Wavelet-integrated CNN with Crossmodal Rearranging Fusion for Fast Multispectral Pedestrian Detection
paper_authors: Xingjian Wang, Li Chai, Jiming Chen, Zhiguo Shi for:本文提出了一种新的和高效的多spectral人体检测框架,它能够有效地提高人体检测的可见性,并且具有更高的计算效率和精度。methods:该框架基于 dual-stream backbone,其中一个流程用 discrete wavelet transform (DWT) 进行快速推理和训练,另一个流程使用 CNN 层进行特征提取。此外,本文还提出了一种 crossmodal rearranging fusion module (CMRF),可以 Mitigate spatial misalignment和merge semantically complementary features of spatially-related local regions。results:根据 KAIST 和 FLIR 测试套件的评估结果,WCCNet 在计算效率和精度之间做出了极大的折衔,并且与当前的状态革命性方法进行比较。此外,本文还进行了 ablation study,并进行了深入的组件对性能的分析。Abstract
Multispectral pedestrian detection achieves better visibility in challenging conditions and thus has a broad application in various tasks, for which both the accuracy and computational cost are of paramount importance. Most existing approaches treat RGB and infrared modalities equally, typically adopting two symmetrical CNN backbones for multimodal feature extraction, which ignores the substantial differences between modalities and brings great difficulty for the reduction of the computational cost as well as effective crossmodal fusion. In this work, we propose a novel and efficient framework named WCCNet that is able to differentially extract rich features of different spectra with lower computational complexity and semantically rearranges these features for effective crossmodal fusion. Specifically, the discrete wavelet transform (DWT) allowing fast inference and training speed is embedded to construct a dual-stream backbone for efficient feature extraction. The DWT layers of WCCNet extract frequency components for infrared modality, while the CNN layers extract spatial-domain features for RGB modality. This methodology not only significantly reduces the computational complexity, but also improves the extraction of infrared features to facilitate the subsequent crossmodal fusion. Based on the well extracted features, we elaborately design the crossmodal rearranging fusion module (CMRF), which can mitigate spatial misalignment and merge semantically complementary features of spatially-related local regions to amplify the crossmodal complementary information. We conduct comprehensive evaluations on KAIST and FLIR benchmarks, in which WCCNet outperforms state-of-the-art methods with considerable computational efficiency and competitive accuracy. We also perform the ablation study and analyze thoroughly the impact of different components on the performance of WCCNet.
摘要
多spectrum行人检测可以在具有挑战性的条件下提供更好的可见性,因此在不同任务中具有广泛的应用。在这些任务中,精度和计算成本都是非常重要的。大多数现有方法都会对RGB和红外Modalities进行相同的处理,通常采用两个对称的CNN背bone来进行多Modal feature抽取,这会忽略modalities之间的重要差异,并且使得计算成本减少和有效的交叉模态融合变得非常困难。在这种情况下,我们提出了一种新的和高效的框架,名为WCCNet,可以在低计算成本下 differentially抽取不同 спектrum的丰富特征,并将这些特征semantically重新排序以实现有效的交叉模态融合。具体来说,WCCNet中使用了束波变换(DWT),以便快速进行推理和训练。DWT层EXTRACT frequency component for infrared modality,而CNN层EXTRACT spatial-domain features for RGB modality。这种方法不仅可以减少计算成本,还可以提高infrared特征的提取,以便后续的交叉模态融合。基于良好地EXTRACT的特征,我们在WCCNet中进行了详细的交叉模态重新排序融合模块(CMRF)的设计,可以抵消空间misalignment和semantic complementary feature of spatially-related local regionsmerge to amplify the crossmodal complementary information.我们在KAIST和FLIRbenchmark上进行了广泛的评估,并发现WCCNet在计算效率和精度方面都高于当前的状态方法。我们还进行了归 subtractive study和仔细分析了不同组件对WCCNet性能的影响。
TS-RGBD Dataset: a Novel Dataset for Theatre Scenes Description for People with Visual Impairments
paper_authors: Leyla Benhamida, Khadidja Delloul, Slimane Larabi for: 这篇论文是为了提供一个新的RGB-D数据集,用于图像描述和人体动作识别。methods: 这篇论文使用了Microsoft Kinect捕获RGB、深度和骨骼序列数据,并提供了人体动作识别和图像描述模型的评估。results: 研究人员通过在TS-RGBD数据集上测试图像描述模型和骨骼基于人体动作识别模型,发现这些模型可以在剧场场景中探测人类动作和描述场景中的区域特征。Abstract
Computer vision was long a tool used for aiding visually impaired people to move around their environment and avoid obstacles and falls. Solutions are limited to either indoor or outdoor scenes, which limits the kind of places and scenes visually disabled people can be in, including entertainment places such as theatres. Furthermore, most of the proposed computer-vision-based methods rely on RGB benchmarks to train their models resulting in a limited performance due to the absence of the depth modality. In this paper, we propose a novel RGB-D dataset containing theatre scenes with ground truth human actions and dense captions annotations for image captioning and human action recognition: TS-RGBD dataset. It includes three types of data: RGB, depth, and skeleton sequences, captured by Microsoft Kinect. We test image captioning models on our dataset as well as some skeleton-based human action recognition models in order to extend the range of environment types where a visually disabled person can be, by detecting human actions and textually describing appearances of regions of interest in theatre scenes.
摘要
计算机视觉曾长期用于帮助视障人群在环境中移动和避免障碍和落体。现有的解决方案主要限制在indoor或outdoor场景中,这限制了视障人群可以在的场所和场景,包括娱乐场所如剧场。此外,大多数提出的计算机视觉基本方法依赖RGB标准准确数据来训练其模型,这导致模型性能有限因为缺少深度特征。在本文中,我们提出一个新的RGB-D数据集,包括剧场场景,具有真实人行为和描述镜像照片的文本描述。该数据集包括三类数据:RGB、深度和骨架幕,通过微软Kinect捕获。我们对我们的数据集进行图像描述和人体动作识别模型的测试,以扩展视障人群可以在的环境类型,通过检测人体动作和描述相关区域的出现方式来描述剧场场景。
Point Anywhere: Directed Object Estimation from Omnidirectional Images
methods: 该方法使用全景相机,以消除用户/物体位置约束和指示臂左右约束。具体来说, repeatedly extracting regions of interest from equirectangular images and projecting them onto perspective images 可以实现高精度的估计。
results: 该方法可以提高估计精度,并且通过训练机器学习模型进一步提高估计精度。Abstract
One of the intuitive instruction methods in robot navigation is a pointing gesture. In this study, we propose a method using an omnidirectional camera to eliminate the user/object position constraint and the left/right constraint of the pointing arm. Although the accuracy of skeleton and object detection is low due to the high distortion of equirectangular images, the proposed method enables highly accurate estimation by repeatedly extracting regions of interest from the equirectangular image and projecting them onto perspective images. Furthermore, we found that training the likelihood of the target object in machine learning further improves the estimation accuracy.
摘要
一种直观的导航指南方法是指示手势。在这项研究中,我们提出了使用全景相机来消除用户/物体位置约束和指示手势的左右约束。虽然投影图像的矩形误差导致skeleton和物体检测精度较低,但我们的方法可以通过重复提取全景图像中的区域并将其投影到平视图中,实现高精度的估算。此外,我们发现通过机器学习来训练目标物体的可能性可以进一步提高估算精度。
MDT3D: Multi-Dataset Training for LiDAR 3D Object Detection Generalization
paper_authors: Louis Soum-Fontez, Jean-Emmanuel Deschaud, François Goulette
For: + The paper aims to improve the robustness of 3D object detection models when tested in new environments with different sensor configurations.* Methods: + The authors propose a Multi-Dataset Training for 3D Object Detection (MDT3D) method that leverages information from several annotated source datasets to increase the robustness of 3D object detection models. + The authors use a new label mapping based on coarse labels to tackle the labelling gap between datasets. + They also introduce a new cross-dataset augmentation method called cross-dataset object injection.* Results: + The authors demonstrate improvements in 3D object detection performance using their MDT3D method, compared to training on a single dataset. + The improvements are shown for different types of 3D object detection models.Here is the simplified Chinese translation of the three key information points:* For: + 论文目标是提高3D对象检测模型在新环境中的可靠性,即使使用不同的感知器配置。* Methods: + 作者提出了一种多个源数据集合训练的3D对象检测模型(MDT3D)方法,以利用多个标注源数据集的信息来提高3D对象检测模型的Robustness。 + 作者采用了一种新的标签映射方法,以 Address the labelling gap between datasets。 + 他们还引入了一种新的跨数据集增强方法:跨数据集对象注入。* Results: + 作者展示了使用MDT3D方法的3D对象检测性能提高,比单个数据集训练 better。 + 提高的性能适用于不同类型的3D对象检测模型。Abstract
Supervised 3D Object Detection models have been displaying increasingly better performance in single-domain cases where the training data comes from the same environment and sensor as the testing data. However, in real-world scenarios data from the target domain may not be available for finetuning or for domain adaptation methods. Indeed, 3D object detection models trained on a source dataset with a specific point distribution have shown difficulties in generalizing to unseen datasets. Therefore, we decided to leverage the information available from several annotated source datasets with our Multi-Dataset Training for 3D Object Detection (MDT3D) method to increase the robustness of 3D object detection models when tested in a new environment with a different sensor configuration. To tackle the labelling gap between datasets, we used a new label mapping based on coarse labels. Furthermore, we show how we managed the mix of datasets during training and finally introduce a new cross-dataset augmentation method: cross-dataset object injection. We demonstrate that this training paradigm shows improvements for different types of 3D object detection models. The source code and additional results for this research project will be publicly available on GitHub for interested parties to access and utilize: https://github.com/LouisSF/MDT3D
摘要
超级vised 3D对象检测模型在单域场景中表现越来越好,但在实际应用中数据可能不可用于训练或适应方法。实际上,基于特定点分布的3D对象检测模型在未seen数据集上采集的数据上表现不佳。因此,我们决定利用多个注释源数据集的信息,通过我们的多个数据集训练方法(MDT3D)提高3D对象检测模型在新环境中的鲁棒性。为了bridging数据集之间的标签差距,我们使用了一种新的标签映射基于粗略标签。此外,我们描述了在训练中混合数据集的方法,以及一种新的交叉数据集增强方法:交叉数据集对象注入。我们证明了这种训练方法对不同类型的3D对象检测模型都有改进。GitHub上将公开源代码和额外结果,欢迎有兴趣的朋友来取得和利用:https://github.com/LouisSF/MDT3D。
Exploiting Synthetic Data for Data Imbalance Problems: Baselines from a Data Perspective
paper_authors: Moon Ye-Bin, Nam Hyeon-Woo, Wonseok Choi, Nayeong Kim, Suha Kwak, Tae-Hyun Oh
for: addressing data imbalance problems in deep neural networks
methods: utilizes synthetic data as a preliminary step before employing task-specific algorithms
results: impressive performance on several datasets, surpassing the performance of existing task-specific methodsAbstract
We live in a vast ocean of data, and deep neural networks are no exception to this. However, this data exhibits an inherent phenomenon of imbalance. This imbalance poses a risk of deep neural networks producing biased predictions, leading to potentially severe ethical and social consequences. To address these challenges, we believe that the use of generative models is a promising approach for comprehending tasks, given the remarkable advancements demonstrated by recent diffusion models in generating high-quality images. In this work, we propose a simple yet effective baseline, SYNAuG, that utilizes synthetic data as a preliminary step before employing task-specific algorithms to address data imbalance problems. This straightforward approach yields impressive performance on datasets such as CIFAR100-LT, ImageNet100-LT, UTKFace, and Waterbird, surpassing the performance of existing task-specific methods. While we do not claim that our approach serves as a complete solution to the problem of data imbalance, we argue that supplementing the existing data with synthetic data proves to be an effective and crucial preliminary step in addressing data imbalance concerns.
摘要
我们生活在一个庞大的数据洋面中,深度神经网络也不例外。然而,这些数据具有内在的不均衡现象,这可能导致深度神经网络预测结果受到偏见,从而导致严重的伦理和社会后果。为解决这些挑战,我们认为使用生成模型是一种有前途的方法,因为最近的扩散模型在生成高质量图像方面已经展示出了很好的进步。在这种工作中,我们提出了一个简单 yet有效的基线方法,即SYNAuG,它利用 synthetic data 作为先决步骤,然后使用任务特定的算法来解决数据不均衡问题。这种简单的方法在 CIFAR100-LT、ImageNet100-LT、UTKFace 和 Waterbird 等数据集上实现了很好的性能,超过了现有的任务特定方法的表现。虽然我们不能声称我们的方法是解决数据不均衡问题的完整解决方案,但我们认为在增加现有数据的同时使用 synthetic data 作为先决步骤是一种有效和重要的步骤。
Orientation-Guided Contrastive Learning for UAV-View Geo-Localisation
results: 我们在University-1652和University-160k datasets上进行了实验,并得到了比前方法更高的性能。在推理阶段,我们不再需要OrientationModule,这意味着在实际应用中不需要额外计算。Abstract
Retrieving relevant multimedia content is one of the main problems in a world that is increasingly data-driven. With the proliferation of drones, high quality aerial footage is now available to a wide audience for the first time. Integrating this footage into applications can enable GPS-less geo-localisation or location correction. In this paper, we present an orientation-guided training framework for UAV-view geo-localisation. Through hierarchical localisation orientations of the UAV images are estimated in relation to the satellite imagery. We propose a lightweight prediction module for these pseudo labels which predicts the orientation between the different views based on the contrastive learned embeddings. We experimentally demonstrate that this prediction supports the training and outperforms previous approaches. The extracted pseudo-labels also enable aligned rotation of the satellite image as augmentation to further strengthen the generalisation. During inference, we no longer need this orientation module, which means that no additional computations are required. We achieve state-of-the-art results on both the University-1652 and University-160k datasets.
摘要
世界变得数据驱动, retrieve relevant multimedia content成为主要问题。随着无人机的普及,高质量的飞行视频现在可以为广泛的用户开放。将这些视频集成到应用程序中可以实现无GPS地理定位或位置修正。在这篇论文中,我们提出了无人机视图地理定位的委托导向训练框架。通过层次的地理定位方向,我们估算了无人机图像与卫星图像之间的方向关系。我们提出了一个轻量级预测模块,这个模块基于对比学习的嵌入学习来预测不同视图之间的方向。我们实验表明,这种预测支持训练并超过了之前的方法。提取的pseudo标签还可以帮助对卫星图像进行平行旋转的调整,以进一步强化通用化。在推理阶段,我们再无需该方向模块,因此不需要额外的计算。我们在University-1652和University-160k数据集上实现了状态机器人的结果。
ForensicsForest Family: A Series of Multi-scale Hierarchical Cascade Forests for Detecting GAN-generated Faces
results: 在实验中,ForensicsForest Family的三种变种都达到了比较高的伪造探测率,并且在不同的伪造率下都能够保持一定的稳定性。Hybrid ForensicsForest和Divide-and-Conquer ForensicsForest相比,后者在减少内存开销的同时,能够保持相对高的伪造探测率。Abstract
The prominent progress in generative models has significantly improved the reality of generated faces, bringing serious concerns to society. Since recent GAN-generated faces are in high realism, the forgery traces have become more imperceptible, increasing the forensics challenge. To combat GAN-generated faces, many countermeasures based on Convolutional Neural Networks (CNNs) have been spawned due to their strong learning ability. In this paper, we rethink this problem and explore a new approach based on forest models instead of CNNs. Specifically, we describe a simple and effective forest-based method set called {\em ForensicsForest Family} to detect GAN-generate faces. The proposed ForensicsForest family is composed of three variants, which are {\em ForensicsForest}, {\em Hybrid ForensicsForest} and {\em Divide-and-Conquer ForensicsForest} respectively. ForenscisForest is a newly proposed Multi-scale Hierarchical Cascade Forest, which takes semantic, frequency and biology features as input, hierarchically cascades different levels of features for authenticity prediction, and then employs a multi-scale ensemble scheme that can comprehensively consider different levels of information to improve the performance further. Based on ForensicsForest, we develop Hybrid ForensicsForest, an extended version that integrates the CNN layers into models, to further refine the effectiveness of augmented features. Moreover, to reduce the memory cost in training, we propose Divide-and-Conquer ForensicsForest, which can construct a forest model using only a portion of training samplings. In the training stage, we train several candidate forest models using the subsets of training samples. Then a ForensicsForest is assembled by picking the suitable components from these candidate forest models...
摘要
“由于生成模型的杰出进步,生成的面部现场具有高现实性,对社会带来了严重的担忧。由于最近的GAN生成面部具有高真实性,因此伪造的迹象 becames more imperceptible,增加了侦错挑战。为了解决GAN生成面部的问题,许多基于Convolutional Neural Networks(CNNs)的对策被生成。在这篇论文中,我们重新思考这个问题,并探索一个新的方法,基于森林模型而不是CNNs。具体来说,我们描述了一个简单而有效的森林模型家族,called “ForensicsForest Family”,用于检测GAN生成面部。我们的提案的ForensicsForest是一个新的多层次堆栈森林,它接受 semantic、frequency和生物特征,在不同的层次进行堆积不同的特征,然后使用多层次组合 scheme,具体来说是使用多个不同层次的特征来进一步提高表现。基于ForensicsForest,我们发展了Hybrid ForensicsForest,一个扩展版本,它在模型中添加了CNN层,以进一步提高增强特征的效果。此外,为了降低训练阶段的内存成本,我们提出了Divide-and-Conquer ForensicsForest,这个方法可以将森林模型建立成多个子集的训练样本。在训练阶段,我们将训练多个候选的森林模型,然后选择适合的部分来建立一个ForensicsForest。”
methods: 这篇论文使用了一个名为Curriculum Adaptation for Black-Box(CABB)的curriculum导向适应方法,它会逐渐地训练目标模型,首先是在目标数据上高信度(清洁)的标签下进行训练,然后是在目标数据上噪音标签下进行训练。CABB使用了Jensen-Shannon散度作为更好的混合标签分类标准,并且使用了对应分支网络来抑制错误堆积的错误识别。
results: 实验结果显示,CABB比 EXISTS的黑盒页面适应模型好,并且与白盒页面适应模型相近。Abstract
Addressing the rising concerns of privacy and security, domain adaptation in the dark aims to adapt a black-box source trained model to an unlabeled target domain without access to any source data or source model parameters. The need for domain adaptation of black-box predictors becomes even more pronounced to protect intellectual property as deep learning based solutions are becoming increasingly commercialized. Current methods distill noisy predictions on the target data obtained from the source model to the target model, and/or separate clean/noisy target samples before adapting using traditional noisy label learning algorithms. However, these methods do not utilize the easy-to-hard learning nature of the clean/noisy data splits. Also, none of the existing methods are end-to-end, and require a separate fine-tuning stage and an initial warmup stage. In this work, we present Curriculum Adaptation for Black-Box (CABB) which provides a curriculum guided adaptation approach to gradually train the target model, first on target data with high confidence (clean) labels, and later on target data with noisy labels. CABB utilizes Jensen-Shannon divergence as a better criterion for clean-noisy sample separation, compared to the traditional criterion of cross entropy loss. Our method utilizes co-training of a dual-branch network to suppress error accumulation resulting from confirmation bias. The proposed approach is end-to-end trainable and does not require any extra finetuning stage, unlike existing methods. Empirical results on standard domain adaptation datasets show that CABB outperforms existing state-of-the-art black-box DA models and is comparable to white-box domain adaptation models.
摘要
In this work, we present Curriculum Adaptation for Black-Box (CABB) which provides a curriculum-guided adaptation approach to gradually train the target model, first on target data with high confidence (clean) labels, and later on target data with noisy labels. CABB utilizes Jensen-Shannon divergence as a better criterion for clean-noisy sample separation compared to the traditional criterion of cross-entropy loss. Our method utilizes co-training of a dual-branch network to suppress error accumulation resulting from confirmation bias. The proposed approach is end-to-end trainable and does not require any extra fine-tuning stage, unlike existing methods.Empirical results on standard domain adaptation datasets show that CABB outperforms existing state-of-the-art black-box DA models and is comparable to white-box domain adaptation models.
Training-Free Instance Segmentation from Semantic Image Segmentation Masks
results: 在两个复杂的dataset上和多种基线模型(包括CNN和Transformers)进行了广泛的实验,得到了与state-of-the-art Fully-supervised实例分割方法相当的结果,而无需额外的人工资源或计算成本增加。Abstract
In recent years, the development of instance segmentation has garnered significant attention in a wide range of applications. However, the training of a fully-supervised instance segmentation model requires costly both instance-level and pixel-level annotations. In contrast, weakly-supervised instance segmentation methods (i.e., with image-level class labels or point labels) struggle to satisfy the accuracy and recall requirements of practical scenarios. In this paper, we propose a novel paradigm for instance segmentation called training-free instance segmentation (TFISeg), which achieves instance segmentation results from image masks predicted using off-the-shelf semantic segmentation models. TFISeg does not require training a semantic or/and instance segmentation model and avoids the need for instance-level image annotations. Therefore, it is highly efficient. Specifically, we first obtain a semantic segmentation mask of the input image via a trained semantic segmentation model. Then, we calculate a displacement field vector for each pixel based on the segmentation mask, which can indicate representations belonging to the same class but different instances, i.e., obtaining the instance-level object information. Finally, instance segmentation results are obtained after being refined by a learnable category-agnostic object boundary branch. Extensive experimental results on two challenging datasets and representative semantic segmentation baselines (including CNNs and Transformers) demonstrate that TFISeg can achieve competitive results compared to the state-of-the-art fully-supervised instance segmentation methods without the need for additional human resources or increased computational costs. The code is available at: TFISeg
摘要
Recently, the development of instance segmentation has received significant attention in a wide range of applications. However, fully-supervised instance segmentation models require expensive instance-level and pixel-level annotations. In contrast, weakly-supervised instance segmentation methods (i.e., with image-level class labels or point labels) cannot meet the accuracy and recall requirements of practical scenarios. In this paper, we propose a novel instance segmentation paradigm called training-free instance segmentation (TFISeg), which achieves instance segmentation results from image masks predicted using off-the-shelf semantic segmentation models. TFISeg does not require training a semantic or/and instance segmentation model and avoids the need for instance-level image annotations. Therefore, it is highly efficient. Specifically, we first obtain a semantic segmentation mask of the input image via a trained semantic segmentation model. Then, we calculate a displacement field vector for each pixel based on the segmentation mask, which can indicate representations belonging to the same class but different instances, i.e., obtaining the instance-level object information. Finally, instance segmentation results are obtained after being refined by a learnable category-agnostic object boundary branch. Extensive experimental results on two challenging datasets and representative semantic segmentation baselines (including CNNs and Transformers) demonstrate that TFISeg can achieve competitive results compared to the state-of-the-art fully-supervised instance segmentation methods without the need for additional human resources or increased computational costs. The code is available at: TFISeg.
Decomposing and Coupling Saliency Map for Lesion Segmentation in Ultrasound Images
results: 这篇论文的结果显示,使用DC-Net可以实现聚合调察影像中肿瘤分类的高精度,并且比前一代方法提高了许多。Abstract
Complex scenario of ultrasound image, in which adjacent tissues (i.e., background) share similar intensity with and even contain richer texture patterns than lesion region (i.e., foreground), brings a unique challenge for accurate lesion segmentation. This work presents a decomposition-coupling network, called DC-Net, to deal with this challenge in a (foreground-background) saliency map disentanglement-fusion manner. The DC-Net consists of decomposition and coupling subnets, and the former preliminarily disentangles original image into foreground and background saliency maps, followed by the latter for accurate segmentation under the assistance of saliency prior fusion. The coupling subnet involves three aspects of fusion strategies, including: 1) regional feature aggregation (via differentiable context pooling operator in the encoder) to adaptively preserve local contextual details with the larger receptive field during dimension reduction; 2) relation-aware representation fusion (via cross-correlation fusion module in the decoder) to efficiently fuse low-level visual characteristics and high-level semantic features during resolution restoration; 3) dependency-aware prior incorporation (via coupler) to reinforce foreground-salient representation with the complementary information derived from background representation. Furthermore, a harmonic loss function is introduced to encourage the network to focus more attention on low-confidence and hard samples. The proposed method is evaluated on two ultrasound lesion segmentation tasks, which demonstrates the remarkable performance improvement over existing state-of-the-art methods.
摘要
复杂的超声图像场景,邻近组织(即背景)与病変区域(即前景)的像素强度和文本URE模式都非常相似,带来了准确病变分割的独特挑战。这个工作提出了一种 decomposition-coupling 网络(DC-Net),通过 fore- ground-background 敏感地图分离融合来解决这个挑战。DC-Net 包括 decomposition 和 coupling 子网络,前者先将原始图像粗略地分解成病变和背景的敏感地图,然后后者使用敏感融合来进行准确的分割。 coupling 子网络包括三种融合策略:1)地域特征聚合(通过可微的上下文池Operator在编码器中),以适应大的辐射场景保留地方Contextual details; 2)关系意识表示融合(通过cross-correlation融合模块在解码器中),以高效地融合低级视觉特征和高级 semantics Features during resolution restoration; 3)依赖关系 Prior 包含(通过coupler),以强制前景敏感表示与背景表示DERIVE complementary information。此外,我们还引入了一种和谐损失函数,以促进网络更多地关注低信任和困难样本。我们的方法在两个超声病变分割任务上进行了评估,并达到了现有状态的杰出性提高。
WaterFlow: Heuristic Normalizing Flow for Underwater Image Enhancement and Beyond
results: 对比state-of-the-art方法,水下图像增强方法提高了质量和可行性Abstract
Underwater images suffer from light refraction and absorption, which impairs visibility and interferes the subsequent applications. Existing underwater image enhancement methods mainly focus on image quality improvement, ignoring the effect on practice. To balance the visual quality and application, we propose a heuristic normalizing flow for detection-driven underwater image enhancement, dubbed WaterFlow. Specifically, we first develop an invertible mapping to achieve the translation between the degraded image and its clear counterpart. Considering the differentiability and interpretability, we incorporate the heuristic prior into the data-driven mapping procedure, where the ambient light and medium transmission coefficient benefit credible generation. Furthermore, we introduce a detection perception module to transmit the implicit semantic guidance into the enhancement procedure, where the enhanced images hold more detection-favorable features and are able to promote the detection performance. Extensive experiments prove the superiority of our WaterFlow, against state-of-the-art methods quantitatively and qualitatively.
摘要
水下图像受到光折射和吸收的影响,导致视觉效果下降,同时影响后续应用。现有水下图像提升方法主要关注图像质量改进,忽略了实际应用的效果。为了平衡视觉质量和实际应用,我们提议一种基于归一化流的探测驱动水下图像提升方法,称为WaterFlow。specifically,我们首先开发了一个可逆映射,以实现水下图像和其清晰版本之间的翻译。在考虑到可 diferenciability和可 interpretability的情况下,我们将HEURISTIC prior incorporated into the data-driven mapping procedure,其中 ambient light和medium transmission coefficient benefit credible generation。此外,我们引入了探测感知模块,以将隐式的Semantic guidance传递给提升过程,这些提升后的图像具有更多的探测利好特征,能够提高探测性能。广泛的实验证明了我们的WaterFlow,在量化和 каче性上胜过当前状态的方法。
Towards Discriminative Representation with Meta-learning for Colonoscopic Polyp Re-Identification
results: 实验结果表明,该方法可以明显超越当前的状态机器人的方法,并且在不同的摄像头和视图下都可以保持高度的表现。Abstract
Colonoscopic Polyp Re-Identification aims to match the same polyp from a large gallery with images from different views taken using different cameras and plays an important role in the prevention and treatment of colorectal cancer in computer-aided diagnosis. However, traditional methods for object ReID directly adopting CNN models trained on the ImageNet dataset usually produce unsatisfactory retrieval performance on colonoscopic datasets due to the large domain gap. Additionally, these methods neglect to explore the potential of self-discrepancy among intra-class relations in the colonoscopic polyp dataset, which remains an open research problem in the medical community. To solve this dilemma, we propose a simple but effective training method named Colo-ReID, which can help our model to learn more general and discriminative knowledge based on the meta-learning strategy in scenarios with fewer samples. Based on this, a dynamic Meta-Learning Regulation mechanism called MLR is introduced to further boost the performance of polyp re-identification. To the best of our knowledge, this is the first attempt to leverage the meta-learning paradigm instead of traditional machine learning to effectively train deep models in the task of colonoscopic polyp re-identification. Empirical results show that our method significantly outperforms current state-of-the-art methods by a clear margin.
摘要
干扰识别是一种重要的计算机辅助诊断技术,旨在将同一个肿瘤从大量图像库中匹配到不同视角和不同摄像头拍摄的图像中。但是,传统的对象识别方法通常在计算机辅助诊断中表现不 satisfactory,因为它们直接采用基于 ImageNet 数据集训练的 CNN 模型,而这些模型在干扰识别任务中存在很大的领域差距。此外,这些方法忽略了干扰识别任务中的自体差异,这是医学社区还未解决的一个开放问题。为解决这个问题,我们提出了一种简单 yet effective 的训练方法,称为 Colo-ReID。这种方法可以帮助我们的模型学习更一般和特征 rich 的知识,基于元学习策略在scenario中 avec fewer samples。此外,我们还提出了一种动态元学习调节机制,称为 MLR,以进一步提高肿瘤重新识别性能。到目前为止,这是首次在干扰识别任务中采用元学习 paradigm,而不是传统机器学习方法,以有效地训练深度模型。实验结果表明,我们的方法在肿瘤重新识别任务中显著超过当前状态的方法。
Detection and Segmentation of Cosmic Objects Based on Adaptive Thresholding and Back Propagation Neural Network
results: 这篇论文通过使用适应resholding方法和反射神经网络,实现了高精度的宇宙对象分类和检测。Abstract
Astronomical images provide information about the great variety of cosmic objects in the Universe. Due to the large volumes of data, the presence of innumerable bright point sources as well as noise within the frame and the spatial gap between objects and satellite cameras, it is a challenging task to classify and detect the celestial objects. We propose an Adaptive Thresholding Method (ATM) based segmentation and Back Propagation Neural Network (BPNN) based cosmic object detection including a well-structured series of pre-processing steps designed to enhance segmentation and detection.
摘要
天文图像提供宇宙中各种不同类型的天体信息。由于数据量庞大,充满灯点源和扫描器镜头之间的空间差距,以及图像中的噪声,因此分类和检测天体是一项复杂的任务。我们提出了适应resholding方法(ATM)基于分割和基于迷你网络(BPNN)基于天体物体检测,包括一系列井井有条的预处理步骤,用于提高分割和检测。
Continual Domain Adaptation on Aerial Images under Gradually Degrading Weather
results: 研究发现了对于现有的缓冲适应方法存在稳定问题,并提出了一个简单的Gradient normalization方法来缓和训练不稳定。Abstract
Domain adaptation (DA) strives to mitigate the domain gap between the source domain where a model is trained, and the target domain where the model is deployed. When a deep learning model is deployed on an aerial platform, it may face gradually degrading weather conditions during operation, leading to widening domain gaps between the training data and the encountered evaluation data. We synthesize two such gradually worsening weather conditions on real images from two existing aerial imagery datasets, generating a total of four benchmark datasets. Under the continual, or test-time adaptation setting, we evaluate three DA models on our datasets: a baseline standard DA model and two continual DA models. In such setting, the models can access only one small portion, or one batch of the target data at a time, and adaptation takes place continually, and over only one epoch of the data. The combination of the constraints of continual adaptation, and gradually deteriorating weather conditions provide the practical DA scenario for aerial deployment. Among the evaluated models, we consider both convolutional and transformer architectures for comparison. We discover stability issues during adaptation for existing buffer-fed continual DA methods, and offer gradient normalization as a simple solution to curb training instability.
摘要
域适应(DA)目的是减少源域和目标域之间的域隔,以便在不同域隔下使用模型。当深度学习模型在飞行平台上部署时,可能会遇到逐渐恶化的天气条件,导致模型训练数据和评估数据之间的域隔变得更大。我们使用两个现有的飞行图像数据集来生成两种慢恶化的天气条件,并生成了四个benchmark dataset。在 continual setting下,我们评估了三个DA模型,包括标准DA模型和两个 continual DA模型。在这种设定下,模型只能访问一小部分或一批目标数据,并且adaptation只能在一个epoch的数据上进行。 combinaison of continual adaptation和慢恶化天气条件提供了实际的DAenario for aerial deployment。我们考虑了 convolutional和 transformer架构进行比较。我们发现了 continual adaptation中的稳定问题,并提供了 gradient normalization作为一种简单的解决方案来缓解训练不稳定。
Survey on Computer Vision Techniques for Internet-of-Things Devices
results: 这篇论文总结了各种低功率技术的优缺点和未解决的问题,以及它们在 convolutional 和 transformer DNN 中的应用。Abstract
Deep neural networks (DNNs) are state-of-the-art techniques for solving most computer vision problems. DNNs require billions of parameters and operations to achieve state-of-the-art results. This requirement makes DNNs extremely compute, memory, and energy-hungry, and consequently difficult to deploy on small battery-powered Internet-of-Things (IoT) devices with limited computing resources. Deployment of DNNs on Internet-of-Things devices, such as traffic cameras, can improve public safety by enabling applications such as automatic accident detection and emergency response.Through this paper, we survey the recent advances in low-power and energy-efficient DNN implementations that improve the deployability of DNNs without significantly sacrificing accuracy. In general, these techniques either reduce the memory requirements, the number of arithmetic operations, or both. The techniques can be divided into three major categories: neural network compression, network architecture search and design, and compiler and graph optimizations. In this paper, we survey both low-power techniques for both convolutional and transformer DNNs, and summarize the advantages, disadvantages, and open research problems.
摘要
深度神经网络(DNN)是现代计算机视觉问题的州际技术。DNN需要数百亿个参数和运算来实现州际 результаados,这使得DNN具有极高的计算、存储和能源需求,因此难以在具有有限的计算资源的小型Internet of Things(IoT)设备上部署。然而,通过部署DNN在交通摄像头等IoT设备上,可以提高公共安全性,例如自动检测事故并触发紧急应急回应。本文概述了最近几年关于低功耗和能效的DNN实现方法,以提高DNN的部署可能性,不会导致重要的精度损失。这些方法可以分为三大类:神经网络压缩、网络架构搜索和设计,以及编译器和图像优化。本文将对低功耗的 convolutional 和 transformer DNN 进行概述,并评价其优缺点和未解决的问题。
Degeneration-Tuning: Using Scrambled Grid shield Unwanted Concepts from Stable Diffusion
For: This paper aims to protect the generative model from generating inappropriate or dangerous content, such as specific intellectual property (IP), human faces, and various artistic styles, by shielding the unwanted concepts from the model’s weights.* Methods: The proposed method, called Degeneration-Tuning (DT), uses Scrambled Grid to reconstruct the correlation between undesired concepts and their corresponding image domain, and guides the text-to-image diffusion model to generate meaningless content when such textual concepts are provided as input.* Results: The proposed DT method effectively shields the unwanted concepts from the model’s weights, without significantly impacting the generative quality of other contents. The FID and IS scores of the model on COCO-30K exhibit only minor changes after DT, shifting from 12.61 and 39.20 to 13.04 and 38.25, respectively, which outperforms previous methods.Here’s the simplified Chinese text in the format you requested:* For: 保护生成模型从生成不当或危险内容,如特定知识产权(IP)、人脸和多种艺术风格。* Methods: 提出的方法是叫做“干扰调整”(DT),使用扰乱网格重建不良概念和其对应的图像域之间的相关性,使模型对这些文本概念提供输入时生成无意义的内容。* Results: DT方法能够成功隔离模型的概念,不会对其他内容产生重要影响。COCO-30K上的FID和IS分数只有微小变化,从12.61和39.20变化到13.04和38.25,这明显超越了之前的方法。Abstract
Owing to the unrestricted nature of the content in the training data, large text-to-image diffusion models, such as Stable Diffusion (SD), are capable of generating images with potentially copyrighted or dangerous content based on corresponding textual concepts information. This includes specific intellectual property (IP), human faces, and various artistic styles. However, Negative Prompt, a widely used method for content removal, frequently fails to conceal this content due to inherent limitations in its inference logic. In this work, we propose a novel strategy named \textbf{Degeneration-Tuning (DT)} to shield contents of unwanted concepts from SD weights. By utilizing Scrambled Grid to reconstruct the correlation between undesired concepts and their corresponding image domain, we guide SD to generate meaningless content when such textual concepts are provided as input. As this adaptation occurs at the level of the model's weights, the SD, after DT, can be grafted onto other conditional diffusion frameworks like ControlNet to shield unwanted concepts. In addition to qualitatively showcasing the effectiveness of our DT method in protecting various types of concepts, a quantitative comparison of the SD before and after DT indicates that the DT method does not significantly impact the generative quality of other contents. The FID and IS scores of the model on COCO-30K exhibit only minor changes after DT, shifting from 12.61 and 39.20 to 13.04 and 38.25, respectively, which clearly outperforms the previous methods.
摘要
由于训练数据的内容是不受限制的,大型文本到图像扩散模型,如稳定扩散(SD),可以根据相关的文本概念信息生成图像,包括特定的知识产权(IP)、人脸和多种艺术风格。然而,负面提示,一种广泛使用的内容 removals 方法,常常无法隐藏这些内容,因为其推理逻辑的内在限制。在这种工作中,我们提出了一种新的策略,即倒退调整(DT),以隐藏 SD 权重中不良的内容。通过使用扰乱网格重建不良概念和其相应的图像领域之间的相关性,我们使 SD 生成无意义的内容,当提供这些文本概念时。由于这种适应发生在模型权重 niveau,SD 之后的 DT 可以与其他条件扩散框架,如控制网络(ControlNet),结合使用。此外,我们还进行了质量比较,发现DT方法对其他内容的生成质量没有显著影响,COCO-30K上的 FID 和 IS 分数从12.61和39.20下降到13.04和38.25,分别提高了1.44和1.95个百分数点,这明显超过了之前的方法。
Virtual histological staining of unlabeled autopsy tissue
paper_authors: Yuzhu Li, Nir Pillar, Jingxi Li, Tairan Liu, Di Wu, Songyu Sun, Guangdong Ma, Kevin de Haan, Luzhe Huang, Sepehr Hamidi, Anatoly Urisman, Tal Keidar Haran, William Dean Wallace, Jonathan E. Zuckerman, Aydogan Ozcan
results: 研究人员发现,使用这种虚拟染色技术可以快速和Cost-effectively 生成高质量的 H&E 染色图像,即使是 COVID-19 样本中受到严重的自体酶分解和细胞死亡的情况下,传统 histochemical staining 方法无法提供一致的染色质量。此外,这种虚拟染色技术还可以扩展到肿瘤组织,并可以快速生成高质量的 H&E 染色图像,减少审查尸体组织的劳动力、成本和基础设施需求。Abstract
Histological examination is a crucial step in an autopsy; however, the traditional histochemical staining of post-mortem samples faces multiple challenges, including the inferior staining quality due to autolysis caused by delayed fixation of cadaver tissue, as well as the resource-intensive nature of chemical staining procedures covering large tissue areas, which demand substantial labor, cost, and time. These challenges can become more pronounced during global health crises when the availability of histopathology services is limited, resulting in further delays in tissue fixation and more severe staining artifacts. Here, we report the first demonstration of virtual staining of autopsy tissue and show that a trained neural network can rapidly transform autofluorescence images of label-free autopsy tissue sections into brightfield equivalent images that match hematoxylin and eosin (H&E) stained versions of the same samples, eliminating autolysis-induced severe staining artifacts inherent in traditional histochemical staining of autopsied tissue. Our virtual H&E model was trained using >0.7 TB of image data and a data-efficient collaboration scheme that integrates the virtual staining network with an image registration network. The trained model effectively accentuated nuclear, cytoplasmic and extracellular features in new autopsy tissue samples that experienced severe autolysis, such as COVID-19 samples never seen before, where the traditional histochemical staining failed to provide consistent staining quality. This virtual autopsy staining technique can also be extended to necrotic tissue, and can rapidly and cost-effectively generate artifact-free H&E stains despite severe autolysis and cell death, also reducing labor, cost and infrastructure requirements associated with the standard histochemical staining.
摘要
histological examination是毫不可或缺的探验步骤,但传统的历史化学染色方法在尸体样本上面临多种挑战,包括由尸体腐败导致的自释染色质量下降,以及涉及大量化学物质染色过程,需要巨大的劳动力、成本和时间投入。在全球卫生危机期间, Histopathology服务的可用性受限,导致尸体稳定和染色artifacts更加严重。我们现在报道了首次虚拟染色技术的应用,使用训练过的神经网络将自折衍 fluorescence图像转化为同样的H&E染色版本,消除尸体腐败导致的严重染色 artifacts。我们的虚拟H&E模型在 >0.7 TB的图像数据和数据efficient合作方案下进行训练。训练后,模型能够强调尸体中的核、细胞和 extracellular特征,并在新的尸体样本中,如COVID-19样本,提供了高质量的染色图像,传统历史化学染色无法提供一致的染色质量。此虚拟探验染色技术还可以扩展到肿瘤组织,可以快速、成本低地生成 artifact-free H&E染色,即使尸体腐败严重,细胞死亡严重,也可以避免传统历史化学染色所需的劳动力、成本和基础设施。
A Novel Cross-Perturbation for Single Domain Generalization
For: The paper aims to enhance the ability of a model to generalize to unknown domains when trained on a single source domain, by using cross-perturbation and feature-level perturbation methods.* Methods: The paper proposes a simple yet effective cross-perturbation method called CPerb, which utilizes both horizontal and vertical operations to increase data diversity and learn domain-invariant features. Additionally, the paper proposes a novel feature-level perturbation method called MixPatch, which exploits local image style information to further diversify the training data.* Results: The paper achieves state-of-the-art performance on various benchmark datasets, demonstrating the effectiveness of the proposed method in enhancing the generalization capability of the model.Abstract
Single domain generalization aims to enhance the ability of the model to generalize to unknown domains when trained on a single source domain. However, the limited diversity in the training data hampers the learning of domain-invariant features, resulting in compromised generalization performance. To address this, data perturbation (augmentation) has emerged as a crucial method to increase data diversity. Nevertheless, existing perturbation methods often focus on either image-level or feature-level perturbations independently, neglecting their synergistic effects. To overcome these limitations, we propose CPerb, a simple yet effective cross-perturbation method. Specifically, CPerb utilizes both horizontal and vertical operations. Horizontally, it applies image-level and feature-level perturbations to enhance the diversity of the training data, mitigating the issue of limited diversity in single-source domains. Vertically, it introduces multi-route perturbation to learn domain-invariant features from different perspectives of samples with the same semantic category, thereby enhancing the generalization capability of the model. Additionally, we propose MixPatch, a novel feature-level perturbation method that exploits local image style information to further diversify the training data. Extensive experiments on various benchmark datasets validate the effectiveness of our method.
摘要
<>单Domain总结旨在提高模型在未知领域下的泛化能力,但限制的训练数据多样性使得学习领域共同特征受挫。为此,数据扰动(增强)技术已成为提高数据多样性的关键手段。然而,现有的扰动方法通常只关注图像级或特征级扰动,忽视了它们的相互作用。为了解决这些限制,我们提出CPerb,一种简单 yet有效的跨扰动方法。具体来说,CPerb利用图像级和特征级扰动来增强训练数据的多样性,从而解决单源领域中的多样性问题。此外,我们提出MixPatch,一种新的特征级扰动方法,利用本地图像风格信息来进一步让训练数据更加多样。广泛的实验 validate了我们的方法的有效性。Note: The translation is in Simplified Chinese, which is the standard written form of Chinese used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that form as well.
ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation
for: This paper aims to propose a novel image manipulation method that can accurately reflect human intentions without relying on external cross-modal language information.
methods: The proposed method, called ImageBrush, learns visual instructions for image editing by employing a pair of transformation images as visual instructions. The method uses a diffusion-based inpainting approach to capture the underlying intentions from visual demonstrations and apply them to a new image.
results: The proposed method generates engaging manipulation results that conform to the transformations entailed in the demonstrations. The method also exhibits robust generalization capabilities on various downstream tasks such as pose transfer, image translation, and video inpainting.Abstract
While language-guided image manipulation has made remarkable progress, the challenge of how to instruct the manipulation process faithfully reflecting human intentions persists. An accurate and comprehensive description of a manipulation task using natural language is laborious and sometimes even impossible, primarily due to the inherent uncertainty and ambiguity present in linguistic expressions. Is it feasible to accomplish image manipulation without resorting to external cross-modal language information? If this possibility exists, the inherent modality gap would be effortlessly eliminated. In this paper, we propose a novel manipulation methodology, dubbed ImageBrush, that learns visual instructions for more accurate image editing. Our key idea is to employ a pair of transformation images as visual instructions, which not only precisely captures human intention but also facilitates accessibility in real-world scenarios. Capturing visual instructions is particularly challenging because it involves extracting the underlying intentions solely from visual demonstrations and then applying this operation to a new image. To address this challenge, we formulate visual instruction learning as a diffusion-based inpainting problem, where the contextual information is fully exploited through an iterative process of generation. A visual prompting encoder is carefully devised to enhance the model's capacity in uncovering human intent behind the visual instructions. Extensive experiments show that our method generates engaging manipulation results conforming to the transformations entailed in demonstrations. Moreover, our model exhibits robust generalization capabilities on various downstream tasks such as pose transfer, image translation and video inpainting.
摘要
“对于语言导向的图像调整进步了很远,但实际执行人类意图的问题仍然存在。使用自然语言描述调整任务的精确和全面描述是困难且有时是不可能的,主要因为语言表达中固有的不确定和歧义。我们是否可以不对外部跨modal的语言信息进行调整图像?如果这个可能性存在,则无需运用对应的模式差距。在这篇论文中,我们提出了一种新的调整方法,名为ImageBrush,它可以从visual示例中学习更精确的图像编辑指令。我们的关键思想是使用对应的变数图像作为visual指令,这不仅能够准确地表达人类的意图,而且可以在实际应用中更加方便。”Please note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.
Deep Learning Approaches in Pavement Distress Identification: A Review
results: 论文发现,使用UAV和深度学习算法可以有效地检测和分类公路损害,并且在2D图像处理方面有所进步,但3D图像处理还存在一些挑战。Abstract
This paper presents a comprehensive review of recent advancements in image processing and deep learning techniques for pavement distress detection and classification, a critical aspect in modern pavement management systems. The conventional manual inspection process conducted by human experts is gradually being superseded by automated solutions, leveraging machine learning and deep learning algorithms to enhance efficiency and accuracy. The ability of these algorithms to discern patterns and make predictions based on extensive datasets has revolutionized the domain of pavement distress identification. The paper investigates the integration of unmanned aerial vehicles (UAVs) for data collection, offering unique advantages such as aerial perspectives and efficient coverage of large areas. By capturing high-resolution images, UAVs provide valuable data that can be processed using deep learning algorithms to detect and classify various pavement distresses effectively. While the primary focus is on 2D image processing, the paper also acknowledges the challenges associated with 3D images, such as sensor limitations and computational requirements. Understanding these challenges is crucial for further advancements in the field. The findings of this review significantly contribute to the evolution of pavement distress detection, fostering the development of efficient pavement management systems. As automated approaches continue to mature, the implementation of deep learning techniques holds great promise in ensuring safer and more durable road infrastructure for the benefit of society.
摘要
The paper explores the integration of unmanned aerial vehicles (UAVs) for data collection, which offers unique advantages such as aerial perspectives and efficient coverage of large areas. By capturing high-resolution images, UAVs provide valuable data that can be processed using deep learning algorithms to detect and classify various pavement distresses effectively. While the primary focus is on 2D image processing, the paper also acknowledges the challenges associated with 3D images, such as sensor limitations and computational requirements. Understanding these challenges is crucial for further advancements in the field.The findings of this review significantly contribute to the evolution of pavement distress detection, fostering the development of efficient pavement management systems. As automated approaches continue to mature, the implementation of deep learning techniques holds great promise in ensuring safer and more durable road infrastructure for the benefit of society.
Addressing Uncertainty in Imbalanced Histopathology Image Classification of HER2 Breast Cancer: An interpretable Ensemble Approach with Threshold Filtered Single Instance Evaluation (SIE)
paper_authors: Md Sakib Hossain Shovon, M. F. Mridha, Khan Md Hasib, Sultan Alfarhood, Mejdl Safran, Dunren Che for:This paper aims to develop an accurate and robust method for diagnosing breast cancer (BC) subtypes based on the expression of Human Epidermal Growth Factor Receptor (HER2).methods:The proposed method utilizes an ensemble approach that combines DenseNet201 and Xception feature extractors, followed by a single instance evaluation (SIE) technique to determine different confidence levels and adjust the decision boundary among the imbalanced classes.results:The proposed approach, called DenseNet201-Xception-SIE, achieved an accuracy of 97.12% on H&E data and 97.56% on IHC data, outperforming all other existing state-of-the-art models. The use of Grad-CAM and Guided Grad-CAM provided insights into how the model works and makes decisions on the histopathology dataset.Abstract
Breast Cancer (BC) is among women's most lethal health concerns. Early diagnosis can alleviate the mortality rate by helping patients make efficient treatment decisions. Human Epidermal Growth Factor Receptor (HER2) has become one the most lethal subtype of BC. According to the College of American Pathologists/American Society of Clinical Oncology (CAP/ASCO), the severity level of HER2 expression can be classified between 0 and 3+ range. HER2 can be detected effectively from immunohistochemical (IHC) and, hematoxylin \& eosin (HE) images of different classes such as 0, 1+, 2+, and 3+. An ensemble approach integrated with threshold filtered single instance evaluation (SIE) technique has been proposed in this study to diagnose BC from the multi-categorical expression of HER2 subtypes. Initially, DenseNet201 and Xception have been ensembled into a single classifier as feature extractors with an effective combination of global average pooling, dropout layer, dense layer with a swish activation function, and l2 regularizer, batch normalization, etc. After that, extracted features has been processed through single instance evaluation (SIE) to determine different confidence levels and adjust decision boundary among the imbalanced classes. This study has been conducted on the BC immunohistochemical (BCI) dataset, which is classified by pathologists into four stages of HER2 BC. This proposed approach known as DenseNet201-Xception-SIE with a threshold value of 0.7 surpassed all other existing state-of-art models with an accuracy of 97.12\%, precision of 97.15\%, and recall of 97.68\% on H\&E data and, accuracy of 97.56\%, precision of 97.57\%, and recall of 98.00\% on IHC data respectively, maintaining momentous improvement. Finally, Grad-CAM and Guided Grad-CAM have been employed in this study to interpret, how TL-based model works on the histopathology dataset and make decisions from the data.
摘要
乳癌(BC)是女性最致命的健康问题之一。早期诊断可以降低死亡率,帮助患者做出有效的治疗决策。人类皮肤增长因子受体(HER2)是乳癌最致命的一种亚型。根据美国病理学会/美国肿瘤学会(CAP/ASCO)的分类标准,HER2表达严重程度可分为0到3+之间的范围。HER2可以从免疫染色(IHC)和铝染色(HE)图像中有效地探测。本研究提出了一种基于多个表达类型的 ensemble 方法,用于诊断 BC。该方法首先将 DenseNet201 和 Xception ensemble 为单个分类器的特征提取器,并使用有效的全球均值池化、dropout层、权重抑制层、短束激活函数、L2 regularizer、批处理等技术。然后,提取的特征经过单个实例评估(SIE)处理,以确定不同的信任水平并调整不均衡的类别之间的决策边界。本研究在 BC 免疫染色(BCI)数据集上进行了实验,该数据集由病理学家分为4个HER2 BC的阶段。该提出的方法,称为 DenseNet201-Xception-SIE,在HER2 BC的诊断方面达到了97.12%的准确率、97.15%的精度和97.68%的回归率在HE数据上,以及97.56%的准确率、97.57%的精度和98.00%的回归率在IHC数据上,与现有的所有状态 искус法模型相比,保持了很大的改善。最后, Grad-CAM 和 Guided Grad-CAM 在 histopathology 数据集上被用来解释,如何TL基于模型在数据上工作,从数据中做出决策。
Body Knowledge and Uncertainty Modeling for Monocular 3D Human Body Reconstruction
results: 实验显示,KNOWN 的人体重建比先前的弱监督方法高,特别是在困难的少数图像上。Abstract
While 3D body reconstruction methods have made remarkable progress recently, it remains difficult to acquire the sufficiently accurate and numerous 3D supervisions required for training. In this paper, we propose \textbf{KNOWN}, a framework that effectively utilizes body \textbf{KNOW}ledge and u\textbf{N}certainty modeling to compensate for insufficient 3D supervisions. KNOWN exploits a comprehensive set of generic body constraints derived from well-established body knowledge. These generic constraints precisely and explicitly characterize the reconstruction plausibility and enable 3D reconstruction models to be trained without any 3D data. Moreover, existing methods typically use images from multiple datasets during training, which can result in data noise (\textit{e.g.}, inconsistent joint annotation) and data imbalance (\textit{e.g.}, minority images representing unusual poses or captured from challenging camera views). KNOWN solves these problems through a novel probabilistic framework that models both aleatoric and epistemic uncertainty. Aleatoric uncertainty is encoded in a robust Negative Log-Likelihood (NLL) training loss, while epistemic uncertainty is used to guide model refinement. Experiments demonstrate that KNOWN's body reconstruction outperforms prior weakly-supervised approaches, particularly on the challenging minority images.
摘要
Recently, 3D body reconstruction methods have made significant progress, but acquiring sufficient and accurate 3D supervisions for training remains a challenge. In this paper, we propose KNOWN, a framework that effectively utilizes body knowledge and uncertainty modeling to compensate for insufficient 3D supervisions. KNOWN leverages a comprehensive set of generic body constraints derived from well-established body knowledge, which precisely and explicitly characterize the reconstruction plausibility and enable 3D reconstruction models to be trained without any 3D data. Moreover, existing methods typically use images from multiple datasets during training, which can result in data noise (e.g., inconsistent joint annotation) and data imbalance (e.g., minority images representing unusual poses or captured from challenging camera views). KNOWN solves these problems through a novel probabilistic framework that models both aleatoric and epistemic uncertainty. Aleatoric uncertainty is encoded in a robust Negative Log-Likelihood (NLL) training loss, while epistemic uncertainty is used to guide model refinement. Experimental results demonstrate that KNOWN's body reconstruction outperforms prior weakly-supervised approaches, particularly on challenging minority images.Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. The translation is based on the standard Mandarin pronunciation and may vary depending on the regional accent or dialect.
Hybrid-SORT: Weak Cues Matter for Online Multi-Object Tracking
methods: 本文引入 confidence state 和 height state 作为 potential weak cue,并且通过 velocity direction 进行补做。
results: compared to existing methods, 本方法在 diverse trackers 和 scenarios 中表现出优于其他方法,并且在 plug-and-play 和 training-free 的情况下实现了显著的改进。特别是在 DanceTrack benchmark 上,本方法 achieve superior performance due to its ability to handle interaction and occlusion.Abstract
Multi-Object Tracking (MOT) aims to detect and associate all desired objects across frames. Most methods accomplish the task by explicitly or implicitly leveraging strong cues (i.e., spatial and appearance information), which exhibit powerful instance-level discrimination. However, when object occlusion and clustering occur, both spatial and appearance information will become ambiguous simultaneously due to the high overlap between objects. In this paper, we demonstrate that this long-standing challenge in MOT can be efficiently and effectively resolved by incorporating weak cues to compensate for strong cues. Along with velocity direction, we introduce the confidence state and height state as potential weak cues. With superior performance, our method still maintains Simple, Online and Real-Time (SORT) characteristics. Furthermore, our method shows strong generalization for diverse trackers and scenarios in a plug-and-play and training-free manner. Significant and consistent improvements are observed when applying our method to 5 different representative trackers. Further, by leveraging both strong and weak cues, our method Hybrid-SORT achieves superior performance on diverse benchmarks, including MOT17, MOT20, and especially DanceTrack where interaction and occlusion are frequent and severe. The code and models are available at https://github.com/ymzis69/HybirdSORT.
摘要
Accessibility and Inclusiveness of New Information and Communication Technologies for Disabled Users and Content Creators in the Metaverse
results: 该论文的研究结果表明,在Metaverse平台的设计和开发过程中,active involvement of physically disabled individuals是促进包容性的关键。同时,该论文还强调了需要进一步的研究和合作,以建立Metaverse项目中 disabled individuals的标准和规范。Abstract
Despite the proliferation of Blockchain Metaverse projects, the inclusion of physically disabled individuals in the Metaverse remains distant, with limited standards and regulations in place. However, the article proposes a concept of the Metaverse that leverages emerging technologies, such as Virtual and Augmented Reality, and the Internet of Things, to enable greater engagement of disabled creatives. This approach aims to enhance inclusiveness in the Metaverse landscape. Based on the findings, the paper concludes that the active involvement of physically disabled individuals in the design and development of Metaverse platforms is crucial for promoting inclusivity. The proposed framework for accessibility and inclusiveness in Virtual, Augmented, and Mixed realities of decentralised Metaverses provides a basis for the meaningful participation of disabled creatives. The article emphasises the importance of addressing the mechanisms for art production by individuals with disabilities in the emerging Metaverse landscape. Additionally, it highlights the need for further research and collaboration to establish standards and regulations that facilitate the inclusion of physically disabled individuals in Metaverse projects.
摘要
尽管Metaverse项目的普及,但 disabled individuals在Metaverse中的包括仍然远离,有限的标准和规定在位。然而,这篇文章提出了一种基于虚拟和增强现实、物联网等新技术的Metaverse概念,以便更好地促进disabled creatives的参与度。这种方法的目的是在Metaverse景观中提高包括性。根据发现,文章结论称,在Metaverse平台的设计和开发过程中active involvement of physically disabled individuals是促进包括性的关键。文章还强调了在虚拟、增强和混合现实中的decentralized Metaverses中 Addressing mechanisms for art production by individuals with disabilities是emerging Metaverse landscape的重要问题。此外,文章还高举了需要进一步的研究和合作,以确立包括physically disabled individuals在Metaverse项目中的标准和规定。
High-Fidelity Eye Animatable Neural Radiance Fields for Human Face
results: 通过ETH-XGaze数据集的实验,证明模型能够生成高质量图像,具有精准的眼动rotation和非均匀的周围肉肉变形。此外,通过使用生成的图像可以有效提高眼球估计性能。Abstract
Face rendering using neural radiance fields (NeRF) is a rapidly developing research area in computer vision. While recent methods primarily focus on controlling facial attributes such as identity and expression, they often overlook the crucial aspect of modeling eyeball rotation, which holds importance for various downstream tasks. In this paper, we aim to learn a face NeRF model that is sensitive to eye movements from multi-view images. We address two key challenges in eye-aware face NeRF learning: how to effectively capture eyeball rotation for training and how to construct a manifold for representing eyeball rotation. To accomplish this, we first fit FLAME, a well-established parametric face model, to the multi-view images considering multi-view consistency. Subsequently, we introduce a new Dynamic Eye-aware NeRF (DeNeRF). DeNeRF transforms 3D points from different views into a canonical space to learn a unified face NeRF model. We design an eye deformation field for the transformation, including rigid transformation, e.g., eyeball rotation, and non-rigid transformation. Through experiments conducted on the ETH-XGaze dataset, we demonstrate that our model is capable of generating high-fidelity images with accurate eyeball rotation and non-rigid periocular deformation, even under novel viewing angles. Furthermore, we show that utilizing the rendered images can effectively enhance gaze estimation performance.
摘要
<>使用神经辐射场(NeRF)进行人脸渲染是计算机视觉领域的一个迅速发展的研究领域。而最近的方法主要是控制人脸特征,如身份和表情,但它们经常忽略人脸中的眼球旋转,这在多个下游任务中具有重要性。在这篇论文中,我们想要学习一个敏感于眼球旋转的人脸NeRF模型,从多视图图像中获得 Training。我们解决了两个关键挑战:如何有效地捕捉眼球旋转,以及如何构建一个表示眼球旋转的拟合。为了实现这一点,我们首先适应FLAME,一个已知的参数化人脸模型,到多视图图像中,并考虑多视图一致性。然后,我们引入一种新的动态眼球旋转NeRF(DeNeRF)。DeNeRF将不同视图中的3D点转换成一个均匀的Canonical空间,以学习一个独立的人脸NeRF模型。我们设计了一个眼球变形场,包括静止变换、如眼球旋转,以及非静止变换。经过在ETH-XGaze数据集上进行的实验,我们表明了我们的模型可以生成高质量的图像,具有准确的眼球旋转和非静止肉眼扭曲,即使在新的视角角度下。此外,我们还证明了使用渲染出来的图像可以有效地提高眼球估计性能。
Decomposition Ascribed Synergistic Learning for Unified Image Restoration
paper_authors: Jinghao Zhang, Jie Huang, Man Zhou, Chongyi Li, Feng Zhao
for: 本研究旨在为实际应用 scenarios 提供多种图像异常处理方法的共同学习机制。
methods: 我们通过对各种异常信息进行分解,使用singular value decomposition (SVD) 分析不同类型的异常信息,并将其分为两类:singular vector 主导和singular value 主导。
results: 我们提出了一种基于 DASL 的图像异常处理方法,包括 Singular VEctor Operator (SVEO) 和 Singular VAlue Operator (SVAO) 两种有效操作,可以轻松地搭配现有的卷积图像修复框架。我们还提出了一种协调分解损失函数。对于五种混合图像修复任务进行了广泛的实验,证明了我们的方法的效果。Abstract
Learning to restore multiple image degradations within a single model is quite beneficial for real-world applications. Nevertheless, existing works typically concentrate on regarding each degradation independently, while their relationship has been less exploited to ensure the synergistic learning. To this end, we revisit the diverse degradations through the lens of singular value decomposition, with the observation that the decomposed singular vectors and singular values naturally undertake the different types of degradation information, dividing various restoration tasks into two groups,\ie, singular vector dominated and singular value dominated. The above analysis renders a more unified perspective to ascribe the diverse degradations, compared to previous task-level independent learning. The dedicated optimization of degraded singular vectors and singular values inherently utilizes the potential relationship among diverse restoration tasks, attributing to the Decomposition Ascribed Synergistic Learning (DASL). Specifically, DASL comprises two effective operators, namely, Singular VEctor Operator (SVEO) and Singular VAlue Operator (SVAO), to favor the decomposed optimization, which can be lightly integrated into existing convolutional image restoration backbone. Moreover, the congruous decomposition loss has been devised for auxiliary. Extensive experiments on blended five image restoration tasks demonstrate the effectiveness of our method, including image deraining, image dehazing, image denoising, image deblurring, and low-light image enhancement.
摘要
To exploit the potential relationship among diverse restoration tasks, we propose Decomposition Ascribed Synergistic Learning (DASL). DASL consists of two operators: Singular VEctor Operator (SVEO) and Singular VAlue Operator (SVAO), which favor decomposed optimization. These operators can be easily integrated into existing convolutional image restoration backbones. Additionally, we introduce a congruous decomposition loss for auxiliary purposes.Our method is tested on five blended image restoration tasks: image deraining, image dehazing, image denoising, image deblurring, and low-light image enhancement. Extensive experiments demonstrate the effectiveness of our approach.
LISA: Reasoning Segmentation via Large Language Model
results: 实验表明,LISA可以处理复杂的逻辑分割、世界知识、解释性答案和多turn会话等情况,并且在没有逻辑分割数据集上进行零基础训练时仍然表现出色。细化训练后,模型的性能得到进一步提高。Abstract
Although perception systems have made remarkable advancements in recent years, they still rely on explicit human instruction to identify the target objects or categories before executing visual recognition tasks. Such systems lack the ability to actively reason and comprehend implicit user intentions. In this work, we propose a new segmentation task -- reasoning segmentation. The task is designed to output a segmentation mask given a complex and implicit query text. Furthermore, we establish a benchmark comprising over one thousand image-instruction pairs, incorporating intricate reasoning and world knowledge for evaluation purposes. Finally, we present LISA: large Language Instructed Segmentation Assistant, which inherits the language generation capabilities of the multi-modal Large Language Model (LLM) while also possessing the ability to produce segmentation masks. We expand the original vocabulary with a token and propose the embedding-as-mask paradigm to unlock the segmentation capability. Remarkably, LISA can handle cases involving: 1) complex reasoning; 2) world knowledge; 3) explanatory answers; 4) multi-turn conversation. Also, it demonstrates robust zero-shot capability when trained exclusively on reasoning-free datasets. In addition, fine-tuning the model with merely 239 reasoning segmentation image-instruction pairs results in further performance enhancement. Experiments show our method not only unlocks new reasoning segmentation capabilities but also proves effective in both complex reasoning segmentation and standard referring segmentation tasks. Code, models, and demo are at https://github.com/dvlab-research/LISA.
摘要
尽管视觉系统在最近几年内已经做出了很多出色的进步,但它们仍然需要显式的人类指令来标识目标对象或类别才能进行视觉识别任务。这些系统缺乏能 aktive 地理解和理解用户的潜在意图。在这项工作中,我们提出了一个新的分割任务---理解分割。这个任务的目标是通过给出复杂的潜在查询文本而输出分割mask。此外,我们建立了包含超过一千张图像指令对的benchmark,并包含了复杂的理解和世界知识 для评估purpose。最后,我们提出了LISA:大型语言指令分割助手,它继承了多模态大语言模型(LLM)的语言生成能力,同时也具有生成分割mask的能力。我们将原始词汇表添加了一个Token,并提出了 embedding-as-mask paradigm来解锁分割能力。特别是,LISA可以处理:1)复杂的理解;2)世界知识;3)解释性答案;4)多turn conversation。此外,它还示出了零基础学习能力,只需要训练于无理解数据集就能够进行表现。此外, fine-tuning the model with merely 239 reasoning segmentation image-instruction pairs results in further performance enhancement。实验表明,我们的方法不仅解锁了新的理解分割能力,还证明了在复杂的理解分割和标准引用分割任务中的有效性。代码、模型和demo可以在https://github.com/dvlab-research/LISA中找到。
Toward Zero-shot Character Recognition: A Gold Standard Dataset with Radical-level Annotations
results: 实验结果表明,ACCID 是一个有效的古汉字图像集,并且基于分解和重组的零shot OCR方法可以在不同的文本背景下达到高度的识别率。Abstract
Optical character recognition (OCR) methods have been applied to diverse tasks, e.g., street view text recognition and document analysis. Recently, zero-shot OCR has piqued the interest of the research community because it considers a practical OCR scenario with unbalanced data distribution. However, there is a lack of benchmarks for evaluating such zero-shot methods that apply a divide-and-conquer recognition strategy by decomposing characters into radicals. Meanwhile, radical recognition, as another important OCR task, also lacks radical-level annotation for model training. In this paper, we construct an ancient Chinese character image dataset that contains both radical-level and character-level annotations to satisfy the requirements of the above-mentioned methods, namely, ACCID, where radical-level annotations include radical categories, radical locations, and structural relations. To increase the adaptability of ACCID, we propose a splicing-based synthetic character algorithm to augment the training samples and apply an image denoising method to improve the image quality. By introducing character decomposition and recombination, we propose a baseline method for zero-shot OCR. The experimental results demonstrate the validity of ACCID and the baseline model quantitatively and qualitatively.
摘要
Optical character recognition (OCR) 技术已经应用到多种任务中,如街景文本识别和文档分析。最近,零 shot OCR 引起了研究者的关注,因为它考虑了实际 OCR 场景中的不均衡数据分布。然而,这些零 shot 方法的评估标准缺乏,尤其是对于使用分解符认识策略进行分解字符的方法。同时, радикал认识也缺乏 радикал级别的标注数据,用于模型训练。在这篇论文中,我们构建了一个古代中文字符图像集,该集包括字符级别和 радикал级别的注释,以满足上述方法的要求。为了提高 ACCID 的适应性,我们提议一种拼接基于的人工生成字符算法,以及一种图像净化方法来提高图像质量。通过引入字符分解和重组,我们提出了一个基线方法 для零 shot OCR。实验结果表明 ACCID 和基线模型的有效性和质量。
Ada-DQA: Adaptive Diverse Quality-aware Feature Acquisition for Video Quality Assessment
results: 对三个主流的无参照 VQA benchmark进行实验,显示 Ada-DQA 方法在比较当前状态艺术方法无需使用额外的 VQA 训练数据的情况下,具有显著优于其他方法的性能。Abstract
Video quality assessment (VQA) has attracted growing attention in recent years. While the great expense of annotating large-scale VQA datasets has become the main obstacle for current deep-learning methods. To surmount the constraint of insufficient training data, in this paper, we first consider the complete range of video distribution diversity (\ie content, distortion, motion) and employ diverse pretrained models (\eg architecture, pretext task, pre-training dataset) to benefit quality representation. An Adaptive Diverse Quality-aware feature Acquisition (Ada-DQA) framework is proposed to capture desired quality-related features generated by these frozen pretrained models. By leveraging the Quality-aware Acquisition Module (QAM), the framework is able to extract more essential and relevant features to represent quality. Finally, the learned quality representation is utilized as supplementary supervisory information, along with the supervision of the labeled quality score, to guide the training of a relatively lightweight VQA model in a knowledge distillation manner, which largely reduces the computational cost during inference. Experimental results on three mainstream no-reference VQA benchmarks clearly show the superior performance of Ada-DQA in comparison with current state-of-the-art approaches without using extra training data of VQA.
摘要
视频质量评估(VQA)在最近几年内引起了越来越多的关注。然而,大量 annotate VQA 数据的成本已成为当前深度学习方法的主要障碍。为了突破这个限制,在这篇论文中,我们首先考虑了视频分布的完整范围(即内容、扭曲、运动),并采用多种预训练模型(即架构、预text任务、预训练数据)来优化质量表示。一个适应多样性特征获取(Ada-DQA)框架是提出来捕捉所需的质量相关特征。通过利用质量相关模块(QAM),该框架能够提取更加重要和相关的特征来表示质量。最后,学习的质量表示被用作增强知识储存模型的辅助监督信息,与标注的质量分数一起导航训练,大大降低了推理过程中的计算成本。实验结果表明,Ada-DQA 在三个主流无参考 VQA benchmark 上表现出色,胜过当前的状态艺术方法,无需使用额外的 VQA 训练数据。