eess.IV - 2023-07-23

ES2Net: An Efficient Spectral-Spatial Network for Hyperspectral Image Change Detection

  • paper_url: http://arxiv.org/abs/2307.12327
  • repo_url: None
  • paper_authors: Qingren Yao, Yuan Zhou, Wei Xiang
  • for: 本研究旨在提高迁移图像特征检测精度,具体目的是针对高spectralresolution干扰图像(HSIs)进行迁移特征检测。
  • methods: 本研究提出了一种综合利用深度学习和频谱筛选技术的迁移特征检测网络(ES2Net),其中包括一个学习式频谱筛选模块,可以自动选择适合迁移检测的频谱 bands。此外,为了考虑不同频谱 bands 之间的复杂非线性关系,我们还提出了一种帧度级空间注意力机制。
  • results: 实验表明,ES2Net 比其他当前领域state-of-the-art方法更高效和精度。
    Abstract Hyperspectral image change detection (HSI-CD) aims to identify the differences in bitemporal HSIs. To mitigate spectral redundancy and improve the discriminativeness of changing features, some methods introduced band selection technology to select bands conducive for CD. However, these methods are limited by the inability to end-to-end training with the deep learning-based feature extractor and lack considering the complex nonlinear relationship among bands. In this paper, we propose an end-to-end efficient spectral-spatial change detection network (ES2Net) to address these issues. Specifically, we devised a learnable band selection module to automatically select bands conducive to CD. It can be jointly optimized with a feature extraction network and capture the complex nonlinear relationships among bands. Moreover, considering the large spatial feature distribution differences among different bands, we design the cluster-wise spatial attention mechanism that assigns a spatial attention factor to each individual band to individually improve the feature discriminativeness for each band. Experiments on three widely used HSI-CD datasets demonstrate the effectiveness and superiority of this method compared with other state-of-the-art methods.
    摘要 干支图像变化检测(HSI-CD)的目标是确定双时间干支图像之间的差异。为了减少 спектраль的重复性和提高变化特征的抑制力,一些方法引入了频率选择技术,以选择适合变化检测的频率。然而,这些方法受到练习深度学习基于特征提取器的端到端训练的限制,以及各频率之间的复杂非线性关系的忽略。在本文中,我们提出了一种练习效率的 spectral-spatial 变化检测网络(ES2Net),以解决这些问题。具体来说,我们设计了一个学习型频率选择模块,可以自动选择适合变化检测的频率。这个模块可以与特征提取网络jointly 优化,并 capture 各频率之间的复杂非线性关系。此外,考虑到不同频率之间的大规模空间特征分布差异,我们设计了各个频率的各自精度注意力机制,以进一步提高每个频率的特征抑制力。在三个广泛使用的HSI-CD数据集上进行了实验,我们发现该方法与其他当前状态的方法相比,具有更高的效iveness和优势。

Development of pericardial fat count images using a combination of three different deep-learning models

  • paper_url: http://arxiv.org/abs/2307.12316
  • repo_url: None
  • paper_authors: Takaaki Matsunaga, Atsushi Kono, Hidetoshi Matsuo, Kaoru Kitagawa, Mizuho Nishio, Hiromi Hashimura, Yu Izawa, Takayoshi Toba, Kazuki Ishikawa, Akie Katsuki, Kazuyuki Ohmura, Takamichi Murakami
  • for: 这个研究旨在使用深度学习模型从胸部X射线图像中生成胸部脂肪计数图像(PFCIs),以评估胸部脂肪的水平。
  • methods: 这个研究使用了3种不同的深度学习模型,包括CycleGAN,将胸部CT图像投影到2D图像上,并使用高像素值表示脂肪聚集。
  • results: 比较了使用提案方法生成的PFCIs和单一CycleGAN-based模型生成的PFCIs,发现提案方法生成的PFCIs具有更高的图像质量指标(SSIM、MSE、MAE)。
    Abstract Rationale and Objectives: Pericardial fat (PF), the thoracic visceral fat surrounding the heart, promotes the development of coronary artery disease by inducing inflammation of the coronary arteries. For evaluating PF, this study aimed to generate pericardial fat count images (PFCIs) from chest radiographs (CXRs) using a dedicated deep-learning model. Materials and Methods: The data of 269 consecutive patients who underwent coronary computed tomography (CT) were reviewed. Patients with metal implants, pleural effusion, history of thoracic surgery, or that of malignancy were excluded. Thus, the data of 191 patients were used. PFCIs were generated from the projection of three-dimensional CT images, where fat accumulation was represented by a high pixel value. Three different deep-learning models, including CycleGAN, were combined in the proposed method to generate PFCIs from CXRs. A single CycleGAN-based model was used to generate PFCIs from CXRs for comparison with the proposed method. To evaluate the image quality of the generated PFCIs, structural similarity index measure (SSIM), mean squared error (MSE), and mean absolute error (MAE) of (i) the PFCI generated using the proposed method and (ii) the PFCI generated using the single model were compared. Results: The mean SSIM, MSE, and MAE were as follows: 0.856, 0.0128, and 0.0357, respectively, for the proposed model; and 0.762, 0.0198, and 0.0504, respectively, for the single CycleGAN-based model. Conclusion: PFCIs generated from CXRs with the proposed model showed better performance than those with the single model. PFCI evaluation without CT may be possible with the proposed method.
    摘要 理解和目标:胸膈脂肪(PF),脊梗内脂肪环绕心脏,促进了折射病变的发展。为评估PF,本研究旨在通过专门的深度学习模型生成胸膈脂肪计数图像(PFCIs)从胸部X射线图(CXRs)中。材料和方法:本研究查阅了269例 consecutively患者的折射 computed tomography(CT)数据。患有金属设备、肿胀、历史折射手术或肿瘤患者被排除。因此,该研究使用了191例患者的数据。PFCIs通过三维CT图像的投影,表示脂肪堆积的高像素值来生成。三种不同的深度学习模型,包括CycleGAN,被组合在提议的方法中来生成PFCIs从CXRs。单独使用CycleGAN基于模型生成PFCIs从CXRs作为比较。为评估生成的PFCIs的图像质量,使用了结构相似度指标(SSIM)、平均方差(MSE)和平均绝对错误(MAE)进行比较。结果:生成的PFCIs的SSIM、MSE和MAE分别为:0.856、0.0128和0.0357;而单独使用CycleGAN基于模型生成的PFCIs的SSIM、MSE和MAE分别为:0.762、0.0198和0.0504。结论:提议的方法生成的PFCIs在SSIM、MSE和MAE指标上表现更好于单独使用CycleGAN基于模型生成的PFCIs。PFCI评估可能不需要CT成像。

Simultaneous temperature estimation and nonuniformity correction from multiple frames

  • paper_url: http://arxiv.org/abs/2307.12297
  • repo_url: None
  • paper_authors: Navot Oz, Omri Berman, Nir Sochen, David Mendelovich, Iftach Klapp
  • for: 用于温度测量的低成本紫外线相机中的非均匀性和温度测量偏差问题的解决方案。
  • methods: 基于物理捕获模型和深度学习核函数网络(KPN)的方法,通过将多帧图像 fusion 来实现同时的温度估计和非均匀性修正。另外,还提出了一个新的偏移块,可以将 ambient 温度包含在模型中,以便估计相机的偏移。
  • results: 通过对实际数据进行测试,得到了与高科技质量的放射计相机相比,仅有0.27-0.54℃的小差异。这种方法可以提供高精度和高效的同时温度估计和非均匀性修正解决方案,对各种实际应用场景具有重要意义。
    Abstract Infrared (IR) cameras are widely used for temperature measurements in various applications, including agriculture, medicine, and security. Low-cost IR camera have an immense potential to replace expansive radiometric cameras in these applications, however low-cost microbolometer-based IR cameras are prone to spatially-variant nonuniformity and to drift in temperature measurements, which limits their usability in practical scenarios. To address these limitations, we propose a novel approach for simultaneous temperature estimation and nonuniformity correction from multiple frames captured by low-cost microbolometer-based IR cameras. We leverage the physical image acquisition model of the camera and incorporate it into a deep learning architecture called kernel estimation networks (KPN), which enables us to combine multiple frames despite imperfect registration between them. We also propose a novel offset block that incorporates the ambient temperature into the model and enables us to estimate the offset of the camera, which is a key factor in temperature estimation. Our findings demonstrate that the number of frames has a significant impact on the accuracy of temperature estimation and nonuniformity correction. Moreover, our approach achieves a significant improvement in performance compared to vanilla KPN, thanks to the offset block. The method was tested on real data collected by a low-cost IR camera mounted on a UAV, showing only a small average error of $0.27^\circ C-0.54^\circ C$ relative to costly scientific-grade radiometric cameras. Our method provides an accurate and efficient solution for simultaneous temperature estimation and nonuniformity correction, which has important implications for a wide range of practical applications.
    摘要 低成本红外线(IR)镜头在不同应用中广泛使用,包括农业、医学和安全领域。低成本微波温度计IR镜头具有巨大的潜在可能性,以取代昂贵的几何学测量镜头,但是它们受到空间不均匀和测量偏差的限制,这限制了它们在实际应用中的可用性。为了解决这些限制,我们提出了一个新的方法,可以同时进行测量温度和非均匀调正。我们利用镜头的物理摄取模型,并将其 integrate into a deep learning architecture called kernel estimation networks (KPN),这使得我们可以融合多帧影像,即使它们不具有完美的对齐。我们还提出了一个 novel offset block,它包含了环境温度,并允许我们估计镜头的偏移,这是温度估计中的关键因素。我们的研究表明,影像数量有显著的影响温度估计和非均匀调正的精度。此外,我们的方法在比vanilla KPN更好的性能,感谢偏移层的存在。我们的方法在实际应用中使用低成本IR镜头,与较贵的科学级测量镜头相比, пока有小平均误差为0.27-0.54℃。我们的方法提供了一个精确和高效的温度估计和非均匀调正方法,具有广泛的实际应用。

ASCON: Anatomy-aware Supervised Contrastive Learning Framework for Low-dose CT Denoising

  • paper_url: http://arxiv.org/abs/2307.12225
  • repo_url: https://github.com/hao1635/ASCON
  • paper_authors: Zhihao Chen, Qi Gao, Yi Zhang, Hongming Shan
  • for: 这个论文是设计来进行低剂量 Computed Tomography(CT)扫描图像干扰除掉噪声的方法。
  • methods: 这个方法使用了两个新的设计:一个高效的自我注意力-based U-Net(ESAU-Net)和一个多尺度解剖对比网络(MAC-Net)。ESAU-Net使用了通道对自我注意力的机制来更好地捕捉全域-地方互动,而MAC-Net则包括一个单元非对比模组来捕捉解剖信息和一个像素对比模组来维持自体解剖一致性。
  • results: 实验结果显示,ASCON在两个公共的低剂量 CT 扫描干扰 dataset 上表现出色,较之前的模型更好。此外,ASCON还提供了解剖解释,允许在低剂量 CT 扫描中进行解剖可读性检查。
    Abstract While various deep learning methods have been proposed for low-dose computed tomography (CT) denoising, most of them leverage the normal-dose CT images as the ground-truth to supervise the denoising process. These methods typically ignore the inherent correlation within a single CT image, especially the anatomical semantics of human tissues, and lack the interpretability on the denoising process. In this paper, we propose a novel Anatomy-aware Supervised CONtrastive learning framework, termed ASCON, which can explore the anatomical semantics for low-dose CT denoising while providing anatomical interpretability. The proposed ASCON consists of two novel designs: an efficient self-attention-based U-Net (ESAU-Net) and a multi-scale anatomical contrastive network (MAC-Net). First, to better capture global-local interactions and adapt to the high-resolution input, an efficient ESAU-Net is introduced by using a channel-wise self-attention mechanism. Second, MAC-Net incorporates a patch-wise non-contrastive module to capture inherent anatomical information and a pixel-wise contrastive module to maintain intrinsic anatomical consistency. Extensive experimental results on two public low-dose CT denoising datasets demonstrate superior performance of ASCON over state-of-the-art models. Remarkably, our ASCON provides anatomical interpretability for low-dose CT denoising for the first time. Source code is available at https://github.com/hao1635/ASCON.
    摘要 “Various deep learning methods have been proposed for low-dose computed tomography (CT) denoising, but most of them rely on normal-dose CT images as ground truth to supervise the denoising process, ignoring the inherent correlation within a single CT image and lacking interpretability. In this paper, we propose a novel Anatomy-aware Supervised CONtrastive learning framework, termed ASCON, which can explore the anatomical semantics for low-dose CT denoising while providing anatomical interpretability. The proposed ASCON consists of two novel designs: an efficient self-attention-based U-Net (ESAU-Net) and a multi-scale anatomical contrastive network (MAC-Net). First, to better capture global-local interactions and adapt to high-resolution input, an efficient ESAU-Net is introduced using a channel-wise self-attention mechanism. Second, MAC-Net incorporates a patch-wise non-contrastive module to capture inherent anatomical information and a pixel-wise contrastive module to maintain intrinsic anatomical consistency. Extensive experimental results on two public low-dose CT denoising datasets demonstrate superior performance of ASCON over state-of-the-art models, and remarkably, our ASCON provides anatomical interpretability for low-dose CT denoising for the first time. Source code is available at https://github.com/hao1635/ASCON.”Note that the translation is in Simplified Chinese, which is one of the two standard forms of Chinese writing. The other form is Traditional Chinese.

SCPAT-GAN: Structural Constrained and Pathology Aware Convolutional Transformer-GAN for Virtual Histology Staining of Human Coronary OCT images

  • paper_url: http://arxiv.org/abs/2307.12138
  • repo_url: None
  • paper_authors: Xueshen Li, Hongshan Liu, Xiaoyu Song, Brigitta C. Brott, Silvio H. Litovsky, Yu Gan
  • for: 用于生成基于OCT图像的虚拟病理信息,以更好地指导心血管疾病的治疗。
  • methods: 使用 transformer生成对抗网络,并在结构层进行病理导向的制约,以生成虚拟染色H&E压痕。
  • results: 提高了现有方法的病理准确率和结构准确率,并可以在不需要大量 paired 训练数据的情况下生成虚拟病理信息。
    Abstract There is a significant need for the generation of virtual histological information from coronary optical coherence tomography (OCT) images to better guide the treatment of coronary artery disease. However, existing methods either require a large pixel-wisely paired training dataset or have limited capability to map pathological regions. To address these issues, we proposed a structural constrained, pathology aware, transformer generative adversarial network, namely SCPAT-GAN, to generate virtual stained H&E histology from OCT images. The proposed SCPAT-GAN advances existing methods via a novel design to impose pathological guidance on structural layers using transformer-based network.
    摘要 “ coronary optical coherence tomography(OCT)图像中的虚拟 histological 信息的生成具有抑制 coronary artery disease 的治疗方法的重要需求。然而,现有的方法 either require a large paired training dataset or have limited capability to map pathological regions。为解决这些问题,我们提出了一种基于 transformer 的权重约束、病理相关的生成对抗网络,即 SCPAT-GAN,用于从 OCT 图像中生成虚拟染色 H&E 病理图像。我们的提议的 SCPAT-GAN 在现有方法中提供了一种新的设计,通过 transformer 基于的网络来强制 paths 层中的病理指导。”Note: Please keep in mind that the translation is done by a machine and may not be perfect. If you have any further questions or need more accurate translations, please feel free to ask!

Improving temperature estimation in low-cost infrared cameras using deep neural networks

  • paper_url: http://arxiv.org/abs/2307.12130
  • repo_url: None
  • paper_authors: Navot Oz, Nir Sochen, David Mendelovich, Iftach Klapp
  • for: 提高低成本热相机的温度准确性和纠正非均匀性。
  • methods: 开发了一个考虑 ambient temperature 的非均匀性模拟器,并提出了一种基于全连接神经网络的热体温度估算方法,通过使用单个图像和摄像头自身测量的 ambient temperature 来纠正非均匀性。
  • results: 比前作下降了约 $1^\circ C$ 的平均温度误差,并且通过应用物理约束降低了误差的 $4%$。 验证数据集上的平均温度误差为 $0.37^\circ C$,并在实际场景中也得到了相当的结果。
    Abstract Low-cost thermal cameras are inaccurate (usually $\pm 3^\circ C$) and have space-variant nonuniformity across their detector. Both inaccuracy and nonuniformity are dependent on the ambient temperature of the camera. The main goal of this work was to improve the temperature accuracy of low-cost cameras and rectify the nonuniformity. A nonuniformity simulator that accounts for the ambient temperature was developed. An end-to-end neural network that incorporates the ambient temperature at image acquisition was introduced. The neural network was trained with the simulated nonuniformity data to estimate the object's temperature and correct the nonuniformity, using only a single image and the ambient temperature measured by the camera itself. Results show that the proposed method lowered the mean temperature error by approximately $1^\circ C$ compared to previous works. In addition, applying a physical constraint on the network lowered the error by an additional $4\%$. The mean temperature error over an extensive validation dataset was $0.37^\circ C$. The method was verified on real data in the field and produced equivalent results.
    摘要 低成本热相机的精度受到 ambient temperature 的影响(通常在 $\pm 3^\circ C$ 范围内),并且具有空间不均的非统一性,这两个问题都与热相机的 ambient temperature 相关。本研究的主要目标是提高低成本热相机的温度精度和修正非统一性。我们开发了一个考虑 ambient temperature 的非统一性模拟器,并提出了一种基于 neural network 的方法,该方法可以使用单个图像和热相机自己测量的 ambient temperature 来估计物体温度并修正非统一性。实验结果表明,我们的方法可以降低mean温度错误约为 $1^\circ C$ compared to 前一代方法。此外,通过 físical 约束对网络进行限制,可以降低错误约为 $4\%$。整体来说,我们的方法在广泛验证数据集上的 mean 温度错误为 $0.37^\circ C$。此外,我们的方法在实际场景中也得到了Equivalent 的结果。

cs.CV - 2023-07-22

Spatial Self-Distillation for Object Detection with Inaccurate Bounding Boxes

  • paper_url: http://arxiv.org/abs/2307.12101
  • repo_url: https://github.com/ucas-vg/pointtinybenchmark
  • paper_authors: Di Wu, Pengfei Chen, Xuehui Yu, Guorong Li, Zhenjun Han, Jianbin Jiao
  • for: 提高object detection的精度和效率,使用低质量 bounding box 监督。
  • methods: 使用 Spatial Position Self-Distillation (SPSD) 模块和 Spatial Identity Self-Distillation (SISD) 模块, combinig spatial information 和 category information,以提高提档箱的选择过程。
  • results: 在 MS-COCO 和 VOC datasets 上,与无isy box 监督实现 state-of-the-art 性能。
    Abstract Object detection via inaccurate bounding boxes supervision has boosted a broad interest due to the expensive high-quality annotation data or the occasional inevitability of low annotation quality (\eg tiny objects). The previous works usually utilize multiple instance learning (MIL), which highly depends on category information, to select and refine a low-quality box. Those methods suffer from object drift, group prediction and part domination problems without exploring spatial information. In this paper, we heuristically propose a \textbf{Spatial Self-Distillation based Object Detector (SSD-Det)} to mine spatial information to refine the inaccurate box in a self-distillation fashion. SSD-Det utilizes a Spatial Position Self-Distillation \textbf{(SPSD)} module to exploit spatial information and an interactive structure to combine spatial information and category information, thus constructing a high-quality proposal bag. To further improve the selection procedure, a Spatial Identity Self-Distillation \textbf{(SISD)} module is introduced in SSD-Det to obtain spatial confidence to help select the best proposals. Experiments on MS-COCO and VOC datasets with noisy box annotation verify our method's effectiveness and achieve state-of-the-art performance. The code is available at https://github.com/ucas-vg/PointTinyBenchmark/tree/SSD-Det.
    摘要 Object detection via inaccurate bounding boxes supervision has aroused broad interest due to the high cost of high-quality annotation data or the occasional low annotation quality (e.g., tiny objects). Previous works usually rely on multiple instance learning (MIL), which heavily relies on category information, to select and refine a low-quality box. These methods are plagued by object drift, group prediction, and part domination problems without exploring spatial information. In this paper, we propose a Spatial Self-Distillation based Object Detector (SSD-Det) to exploit spatial information to refine the inaccurate box in a self-distillation manner. SSD-Det utilizes a Spatial Position Self-Distillation (SPSD) module to leverage spatial information and an interactive structure to combine spatial information and category information, thus constructing a high-quality proposal bag. To further improve the selection procedure, a Spatial Identity Self-Distillation (SISD) module is introduced in SSD-Det to obtain spatial confidence to help select the best proposals. Experiments on MS-COCO and VOC datasets with noisy box annotation demonstrate the effectiveness of our method and achieve state-of-the-art performance. The code is available at https://github.com/ucas-vg/PointTinyBenchmark/tree/SSD-Det.

Edge Guided GANs with Multi-Scale Contrastive Learning for Semantic Image Synthesis

  • paper_url: http://arxiv.org/abs/2307.12084
  • repo_url: https://github.com/ha0tang/ecgan
  • paper_authors: Hao Tang, Guolei Sun, Nicu Sebe, Luc Van Gool
  • for: 提出了一种新的ECGAN方法,用于解决 semantic image synthesis 问题。
  • methods: 使用 edge 作为中间表示,并通过提案的注意力导向Edge传输模块来引导图像生成。同时,设计了一种选择性高亮类征图像的模块,以保持 semantic 信息。
  • results: 提出了一种新的对比学习方法,用于让同类像素内容更相似,并在多个输入Semantic layout中捕捉更多的semantic关系。
    Abstract We propose a novel ECGAN for the challenging semantic image synthesis task. Although considerable improvements have been achieved by the community in the recent period, the quality of synthesized images is far from satisfactory due to three largely unresolved challenges. 1) The semantic labels do not provide detailed structural information, making it challenging to synthesize local details and structures; 2) The widely adopted CNN operations such as convolution, down-sampling, and normalization usually cause spatial resolution loss and thus cannot fully preserve the original semantic information, leading to semantically inconsistent results (e.g., missing small objects); 3) Existing semantic image synthesis methods focus on modeling 'local' semantic information from a single input semantic layout. However, they ignore 'global' semantic information of multiple input semantic layouts, i.e., semantic cross-relations between pixels across different input layouts. To tackle 1), we propose to use the edge as an intermediate representation which is further adopted to guide image generation via a proposed attention guided edge transfer module. To tackle 2), we design an effective module to selectively highlight class-dependent feature maps according to the original semantic layout to preserve the semantic information. To tackle 3), inspired by current methods in contrastive learning, we propose a novel contrastive learning method, which aims to enforce pixel embeddings belonging to the same semantic class to generate more similar image content than those from different classes. We further propose a novel multi-scale contrastive learning method that aims to push same-class features from different scales closer together being able to capture more semantic relations by explicitly exploring the structures of labeled pixels from multiple input semantic layouts from different scales.
    摘要 我们提出了一种新的ECGAN,用于解决 semantic image synthesis 任务中的三大挑战。尽管社区在最近一段时间内已经取得了显著的进步,但是synthesized图像的质量仍然远远不够满意,主要因为以下三个因素:1)semantic labels 不含细节信息,因此难以synthesize мест Details和结构; 2)通用的CNN操作,如 convolution、下采样和normalization,通常会导致空间分辨率损失,从而无法完全保留原始semantic信息,导致结果不一致(例如缺少小对象); 3)现有的semantic image synthesis方法都是基于单个输入semantic layout的本地semantic信息模型化。然而,它们忽略了多个输入semantic layout的semantic信息之间的关系,即多个输入semantic layout中的semantic cross-relations。为了解决1),我们提议使用边为 intermediate representation,并将其采用提取模块来导引图像生成。为了解决2),我们设计了一种有效的模块,可以根据原始semantic layout选择性地强调类型相关的特征图。为了解决3),我们提出了一种基于contrastive learning的新方法,该方法 aimsto enforce pixel embeddings belonging to the same semantic class to generate more similar image content than those from different classes。我们还提出了一种多尺度contrastive learning方法,该方法 aimsto push same-class features from different scales closer together,以便捕捉多个输入semantic layout中的semantic关系。

Iterative Reconstruction Based on Latent Diffusion Model for Sparse Data Reconstruction

  • paper_url: http://arxiv.org/abs/2307.12070
  • repo_url: None
  • paper_authors: Linchao He, Hongyu Yan, Mengting Luo, Kunming Luo, Wang Wang, Wenchao Du, Hu Chen, Hongyu Yang, Yi Zhang
  • For: The paper is written for reconstructing computed tomography (CT) images from sparse measurements, which is an ill-posed inverse problem. The paper proposes a new method called Latent Diffusion Iterative Reconstruction (LDIR) to solve this problem.* Methods: LDIR uses a pre-trained Latent Diffusion Model (LDM) as a data prior to extend the Iterative Reconstruction (IR) method. The LDM is used to approximate the prior distribution of the CT images, and the gradient from the data-fidelity term is used to guide the sampling process. This allows LDIR to integrate iterative reconstruction and LDM in an unsupervised manner, making the reconstruction of high-resolution images more efficient.* Results: The paper shows that LDIR outperforms other state-of-the-art unsupervised and even exceeds supervised methods in terms of both quantity and quality on extremely sparse CT data reconstruction tasks. Additionally, LDIR achieves competitive performance on nature image tasks and exhibits significantly faster execution times and lower memory consumption compared to methods with similar network settings.
    Abstract Reconstructing Computed tomography (CT) images from sparse measurement is a well-known ill-posed inverse problem. The Iterative Reconstruction (IR) algorithm is a solution to inverse problems. However, recent IR methods require paired data and the approximation of the inverse projection matrix. To address those problems, we present Latent Diffusion Iterative Reconstruction (LDIR), a pioneering zero-shot method that extends IR with a pre-trained Latent Diffusion Model (LDM) as a accurate and efficient data prior. By approximating the prior distribution with an unconditional latent diffusion model, LDIR is the first method to successfully integrate iterative reconstruction and LDM in an unsupervised manner. LDIR makes the reconstruction of high-resolution images more efficient. Moreover, LDIR utilizes the gradient from the data-fidelity term to guide the sampling process of the LDM, therefore, LDIR does not need the approximation of the inverse projection matrix and can solve various CT reconstruction tasks with a single model. Additionally, for enhancing the sample consistency of the reconstruction, we introduce a novel approach that uses historical gradient information to guide the gradient. Our experiments on extremely sparse CT data reconstruction tasks show that LDIR outperforms other state-of-the-art unsupervised and even exceeds supervised methods, establishing it as a leading technique in terms of both quantity and quality. Furthermore, LDIR also achieves competitive performance on nature image tasks. It is worth noting that LDIR also exhibits significantly faster execution times and lower memory consumption compared to methods with similar network settings. Our code will be publicly available.
    摘要 <>将计算Tomography(CT)图像重建问题转化为简单的无监测问题。iterative Reconstruction(IR)算法是解决无监测问题的一种方法。然而,最近的IR方法需要匹配数据和近似投影矩阵的approximation。为了解决这些问题,我们介绍Latent Diffusion Iterative Reconstruction(LDIR),一种革新的零shot方法,通过将iterative Reconstruction和Latent Diffusion Model(LDM)集成在一起,以获得高效和准确的数据先天。LDIR通过近似 latent diffusion模型来表示先天分布,因此不需要近似投影矩阵的approximation,可以解决多种CT重建任务。此外,我们还引入了一种新的方法,使用历史梯度信息来导引抽象过程。我们的实验表明,LDIR在极端稀畴CT数据重建任务中表现出色,超过了其他无监测和监测方法,成为当前领导技术。此外,LDIR还在自然图像任务中表现竞争力强。值得注意的是,LDIR也显示出了相对较快的执行时间和较低的内存占用量,相比于与相同网络设置的方法。我们将代码公开。

Replay: Multi-modal Multi-view Acted Videos for Casual Holography

  • paper_url: http://arxiv.org/abs/2307.12067
  • repo_url: https://github.com/facebookresearch/replay_dataset
  • paper_authors: Roman Shapovalov, Yanir Kleiman, Ignacio Rocco, David Novotny, Andrea Vedaldi, Changan Chen, Filippos Kokkinos, Ben Graham, Natalia Neverova
  • for: 这个论文主要用于提供一个多视角多Modal视频人类社交交互的收集,可以用于多种应用,如新视角合成、3D重建、人体和脸部分析等。
  • methods: 这个论文使用了多个高品质摄像头和仪器来捕捉和记录人类社交交互的视频,并对视频进行了高精度时间戳和相机pose的标注。
  • results: 这个论文提供了一个包含超过4000分钟视频和700万个高分辨率帧的大规模数据集,并提供了一个基准测试集用于训练和评估新视角合成方法。
    Abstract We introduce Replay, a collection of multi-view, multi-modal videos of humans interacting socially. Each scene is filmed in high production quality, from different viewpoints with several static cameras, as well as wearable action cameras, and recorded with a large array of microphones at different positions in the room. Overall, the dataset contains over 4000 minutes of footage and over 7 million timestamped high-resolution frames annotated with camera poses and partially with foreground masks. The Replay dataset has many potential applications, such as novel-view synthesis, 3D reconstruction, novel-view acoustic synthesis, human body and face analysis, and training generative models. We provide a benchmark for training and evaluating novel-view synthesis, with two scenarios of different difficulty. Finally, we evaluate several baseline state-of-the-art methods on the new benchmark.
    摘要 我们介绍Replay数据集,这是一个多视点多Modal人类社交互动的集合。每场场景由高品质摄像机拍摄,包括固定摄像机和穿戴式动作摄像机,以及不同位置的听音器。总的来说,数据集包含超过4000分钟的视频和700万个时间戳的高分辨率帧,并有部分帧附有相机姿态和部分人体或脸部掩码。Replay数据集有许多应用 potential,如新视图合成、3D重建、新视图声音合成、人体和脸部分析,以及训练生成模型。我们提供了一个对novel-view synthesis进行训练和评估的标准准。最后,我们评估了一些基eline状态的先进方法在新的标准上。

Discovering Spatio-Temporal Rationales for Video Question Answering

  • paper_url: http://arxiv.org/abs/2307.12058
  • repo_url: None
  • paper_authors: Yicong Li, Junbin Xiao, Chun Feng, Xiang Wang, Tat-Seng Chua
  • for: 解决复杂的视频问答(VideoQA)问题,其中视频具有多个物体和事件,发生在不同的时间点。
  • methods: 提出了一种Spatio-Temporal Rationalization(STR),这是一种可 diferenciable 选择模块,可以适应тив地从视频内容中收集问题关键的时间点和空间对象。此外,还提出了一种基于STR的Transformer-style neural network架构,名为TranSTR,它还强调了一种新的答案交互机制来协调STR。
  • results: 实验结果表明,TranSTR在四个dataset上达到了新的State-of-the-art(SoTA)水平,特别是在NExT-QA和Causal-VidQA上,它超过了之前的SoTA by 5.8%和6.8%。此外,我们还进行了广泛的研究来证明STR的重要性以及提出的答案交互机制的重要性。
    Abstract This paper strives to solve complex video question answering (VideoQA) which features long video containing multiple objects and events at different time. To tackle the challenge, we highlight the importance of identifying question-critical temporal moments and spatial objects from the vast amount of video content. Towards this, we propose a Spatio-Temporal Rationalization (STR), a differentiable selection module that adaptively collects question-critical moments and objects using cross-modal interaction. The discovered video moments and objects are then served as grounded rationales to support answer reasoning. Based on STR, we further propose TranSTR, a Transformer-style neural network architecture that takes STR as the core and additionally underscores a novel answer interaction mechanism to coordinate STR for answer decoding. Experiments on four datasets show that TranSTR achieves new state-of-the-art (SoTA). Especially, on NExT-QA and Causal-VidQA which feature complex VideoQA, it significantly surpasses the previous SoTA by 5.8\% and 6.8\%, respectively. We then conduct extensive studies to verify the importance of STR as well as the proposed answer interaction mechanism. With the success of TranSTR and our comprehensive analysis, we hope this work can spark more future efforts in complex VideoQA. Code will be released at https://github.com/yl3800/TranSTR.
    摘要 这篇论文目标解决复杂的视频问答(VideoQA)问题,该问题具有长视频内容中的多个物体和事件,并且发生在不同的时间点。为了解决这个挑战,我们强调了问题关键的时间刻和空间对象的标识,并提出了一种空间时间合理化(STR)模块,该模块通过交叉模式互动来适应性地收集问题关键的时间刻和空间对象。得到的视频刻和对象将被用作问题理解的基础理据。基于STR,我们还提出了TransSTR,一种基于Transformer的神经网络架构,该架构将STR作为核心,并强调了一种新的答案互动机制以协调STR进行答案解码。实验结果表明,TransSTR在四个数据集上达到了新的状态时刻(SoTA),尤其是在NExT-QA和Causal-VidQA这两个复杂的VideoQA数据集上,与之前的SoTA相比,它提高了5.8%和6.8%。我们还进行了广泛的研究来证明STR的重要性以及我们提议的答案互动机制的重要性。通过TransSTR和我们的全面分析,我们希望这项工作可以激发更多的未来的VideoQA研究。代码将在GitHub上发布。

Patch-Wise Point Cloud Generation: A Divide-and-Conquer Approach

  • paper_url: http://arxiv.org/abs/2307.12049
  • repo_url: https://github.com/wenc13/patchgeneration
  • paper_authors: Cheng Wen, Baosheng Yu, Rao Fu, Dacheng Tao
  • for: 本研究旨在生成高精度点云,用于自动驾驶和机器人等应用。
  • methods: 提出了一种新的点云生成框架,使用分割和聚合的方法,将整个生成过程分解成多个小区域生成任务。每个小区域生成器都基于学习的先验,用于捕捉点云的几何结构信息。还引入了点和小区域之间的交互transformer,以便在不同尺度上进行交互。
  • results: 实验结果表明,提出的小区域生成方法可以准确地生成高精度点云,并且在ShapeNet数据集上表现出色,与现有的状态对点云生成方法进行比较。
    Abstract A generative model for high-fidelity point clouds is of great importance in synthesizing 3d environments for applications such as autonomous driving and robotics. Despite the recent success of deep generative models for 2d images, it is non-trivial to generate 3d point clouds without a comprehensive understanding of both local and global geometric structures. In this paper, we devise a new 3d point cloud generation framework using a divide-and-conquer approach, where the whole generation process can be divided into a set of patch-wise generation tasks. Specifically, all patch generators are based on learnable priors, which aim to capture the information of geometry primitives. We introduce point- and patch-wise transformers to enable the interactions between points and patches. Therefore, the proposed divide-and-conquer approach contributes to a new understanding of point cloud generation from the geometry constitution of 3d shapes. Experimental results on a variety of object categories from the most popular point cloud dataset, ShapeNet, show the effectiveness of the proposed patch-wise point cloud generation, where it clearly outperforms recent state-of-the-art methods for high-fidelity point cloud generation.
    摘要 一个高级别点云生成模型对于 sintesizing 3D 环境而言是非常重要的,特别是在自动驾驶和机器人应用中。虽然最近的深度生成模型在 2D 图像方面已经取得了成功,但是生成 3D 点云则不是一件容易的事情,需要全面了解点云的本地和全局 геометрической结构。在这篇论文中,我们提出了一种新的点云生成框架,使用分治方法,整个生成过程可以分解为一系列的小区域生成任务。具体来说,所有的小区域生成器都基于学习的先验,旨在捕捉点云中的几何基本元素。我们引入了点云和小区域之间的交互,使得我们的分治方法在点云生成中做出了新的贡献,帮助我们更好地理解点云生成的几何结构。我们的实验结果表明,在ShapeNet 上的多种物体类别上,我们的裂解方法可以高效地生成高级别的点云,并且明显超过了最近的状态 искусственный风格方法。

FSDiffReg: Feature-wise and Score-wise Diffusion-guided Unsupervised Deformable Image Registration for Cardiac Images

  • paper_url: http://arxiv.org/abs/2307.12035
  • repo_url: https://github.com/xmed-lab/fsdiffreg
  • paper_authors: Yi Qin, Xiaomeng Li
  • for: 这篇论文主要针对医疗影像注册 зада задачу,尤其是实现高品质的扭转场,同时保持扭转学径的可靠性。
  • methods: 本论文提出了两个模组,分别是对于Semantic Diffusion Model的多对多映射和Score-wise Diffusion-Guided Module,以利用扩散模型的Semantic Feature空间来帮助注册任务。
  • results: 实验结果显示,本论文的模型能够对3D医疗心脏影像注册任务提供精确的扭转场,并具有保持扭转学径的可靠性。
    Abstract Unsupervised deformable image registration is one of the challenging tasks in medical imaging. Obtaining a high-quality deformation field while preserving deformation topology remains demanding amid a series of deep-learning-based solutions. Meanwhile, the diffusion model's latent feature space shows potential in modeling the deformation semantics. To fully exploit the diffusion model's ability to guide the registration task, we present two modules: Feature-wise Diffusion-Guided Module (FDG) and Score-wise Diffusion-Guided Module (SDG). Specifically, FDG uses the diffusion model's multi-scale semantic features to guide the generation of the deformation field. SDG uses the diffusion score to guide the optimization process for preserving deformation topology with barely any additional computation. Experiment results on the 3D medical cardiac image registration task validate our model's ability to provide refined deformation fields with preserved topology effectively. Code is available at: https://github.com/xmed-lab/FSDiffReg.git.
    摘要 <>将文本翻译成简化中文。<>医疗影像注册不监督式扭变是一项具有挑战性的任务。在一系列深度学习基于解决方案中,获得高质量扭变场并保持扭变 topology remains demanding. 同时,扩散模型的隐藏特征空间表现出了模型扭变 semantics 的潜力。为了充分利用扩散模型对注册任务的导航,我们提出了两个模块:特征 wise Diffusion-Guided Module (FDG) 和 Score-wise Diffusion-Guided Module (SDG)。具体来说,FDG 使用扩散模型的多尺度semantic特征来导航生成扭变场。SDG 使用扩散分数来导航优化过程,以保持扭变 topology with barely any additional computation。实验结果表明,我们的模型能够提供高精度的扭变场,并具有保持扭变 topology 的能力。代码可以在:https://github.com/xmed-lab/FSDiffReg.git 中找到。

Self-Supervised and Semi-Supervised Polyp Segmentation using Synthetic Data

  • paper_url: http://arxiv.org/abs/2307.12033
  • repo_url: None
  • paper_authors: Enric Moreu, Eric Arazo, Kevin McGuinness, Noel E. O’Connor
  • for: 早期检测Rectal Polyps是肝肠癌预防中非常重要的一步,手术镜检查是 manually carried out to examine the entirety of the patient’s colon的主要方法。
  • methods: 我们使用计算机视觉技术来助力 профессионаls在诊断阶段,并利用Synthetic Data和自动生成的图像来增加数据量,以便更好地利用无标注数据。
  • results: 我们的模型Pl-CUT-Seg在标准的Polyp Segmentation benchmark上达到了自动标注和 semi-supervised setup中的state-of-the-art Results,并且我们还提出了PL-CUT-Seg+,一种通过targeted regularization来Address the domain gap between real and synthetic images的改进版本。
    Abstract Early detection of colorectal polyps is of utmost importance for their treatment and for colorectal cancer prevention. Computer vision techniques have the potential to aid professionals in the diagnosis stage, where colonoscopies are manually carried out to examine the entirety of the patient's colon. The main challenge in medical imaging is the lack of data, and a further challenge specific to polyp segmentation approaches is the difficulty of manually labeling the available data: the annotation process for segmentation tasks is very time-consuming. While most recent approaches address the data availability challenge with sophisticated techniques to better exploit the available labeled data, few of them explore the self-supervised or semi-supervised paradigm, where the amount of labeling required is greatly reduced. To address both challenges, we leverage synthetic data and propose an end-to-end model for polyp segmentation that integrates real and synthetic data to artificially increase the size of the datasets and aid the training when unlabeled samples are available. Concretely, our model, Pl-CUT-Seg, transforms synthetic images with an image-to-image translation module and combines the resulting images with real images to train a segmentation model, where we use model predictions as pseudo-labels to better leverage unlabeled samples. Additionally, we propose PL-CUT-Seg+, an improved version of the model that incorporates targeted regularization to address the domain gap between real and synthetic images. The models are evaluated on standard benchmarks for polyp segmentation and reach state-of-the-art results in the self- and semi-supervised setups.
    摘要 早期检测肠RectalPolyp非常重要,以采取治疗和预防肠RectalCancer。计算机视觉技术有可能帮助专业人员在诊断阶段进行手动检查患者的整个肠肠Rectal。主要挑战在医疗影像领域是数据不足,而特定于肠Polyp分割方法的另一个挑战是手动标注可用数据的困难。大多数最新的方法解决数据不足的挑战,使用了复杂的技术来更好地利用可用的标注数据。然而,只有几个方法探讨了不supervised或semi-supervised模式,其中可以大幅减少标注数量。为了解决这两个挑战,我们利用生成的数据和提议一个综合模型,即Pl-CUT-Seg,将生成的图像与实际图像结合以训练一个分割模型。我们使用模型预测结果作为pseudo-标注,以更好地利用无标注样本。此外,我们还提出了PL-CUT-Seg+,一个改进的模型,其中包括targeted regularization,以解决实际和生成图像之间的领域差异。模型在标准的肠Polyp分割测试benchmark上进行评估,并在自supervised和semi-supervised setup中达到了状态的末点结果。

Flight Contrail Segmentation via Augmented Transfer Learning with Novel SR Loss Function in Hough Space

  • paper_url: http://arxiv.org/abs/2307.12032
  • repo_url: https://github.com/junzis/contrail-net
  • paper_authors: Junzi Sun, Esther Roosenbrand
  • for: 这篇论文旨在提出一种基于增强传输学习的新模型,用于检测飞机烟囱从卫星图像中。
  • methods: 该模型使用了增强传输学习技术,以及一种新的损失函数SR损失,以提高烟囱线检测精度。
  • results: 研究发现,该模型能够准确地检测飞机烟囱,并且只需要 minimal data。这些成果开创了机器学习基于烟囱检测的新途径,并可以解决航空研究中烟囱检测模型的缺乏大量手动标注数据问题。
    Abstract Air transport poses significant environmental challenges, particularly the contribution of flight contrails to climate change due to their potential global warming impact. Detecting contrails from satellite images has been a long-standing challenge. Traditional computer vision techniques have limitations under varying image conditions, and machine learning approaches using typical convolutional neural networks are hindered by the scarcity of hand-labeled contrail datasets and contrail-tailored learning processes. In this paper, we introduce an innovative model based on augmented transfer learning that accurately detects contrails with minimal data. We also propose a novel loss function, SR Loss, which improves contrail line detection by transforming the image space into Hough space. Our research opens new avenues for machine learning-based contrail detection in aviation research, offering solutions to the lack of large hand-labeled datasets, and significantly enhancing contrail detection models.
    摘要 飞行交通对环境造成重大挑战,特别是飞机烟尘对气候变化的贡献,由于其可能对全球暖化产生影响。传统的计算机视觉技术在不同的图像条件下有限制,机器学习方法使用典型的卷积神经网络受到手动标注烟尘数据的罕见和烟尘特化学习过程的限制。在这篇论文中,我们介绍了一种创新的模型,基于增强转移学习,可以准确地检测烟尘。我们还提出了一种新的损失函数,SR损失,它在图像空间转换到抽象空间,从而改善烟尘线检测。我们的研究开创了机器学习基于烟尘检测的新途径,解决了航空研究中缺乏大量手动标注数据的问题,并有效地提高烟尘检测模型。

On the Effectiveness of Spectral Discriminators for Perceptual Quality Improvement

  • paper_url: http://arxiv.org/abs/2307.12027
  • repo_url: https://github.com/luciennnnnnn/dualformer
  • paper_authors: Xin Luo, Yunan Zhu, Shunxin Xu, Dong Liu
  • for: 本研究旨在解释spectral discriminator在生成模型中的效果,特别是在图像超解像(GAN-based SR)中。
  • methods: 作者使用了spectral discriminator和ordinary discriminator进行比较,并提出了使用这两种权值 simultaneously。另外,作者还提出了一种基于Transformer的方法来协调spectral discriminator。
  • results: 作者发现spectral discriminator在高频范围内的性能更高,而ordinary discriminator在低频范围内的性能更高。这些结果表明,使用spectral discriminator和ordinary discriminator simultaneously可以提高SR图像的质量。此外,作者还发现,使用这种方法可以更好地评估图像的品质。
    Abstract Several recent studies advocate the use of spectral discriminators, which evaluate the Fourier spectra of images for generative modeling. However, the effectiveness of the spectral discriminators is not well interpreted yet. We tackle this issue by examining the spectral discriminators in the context of perceptual image super-resolution (i.e., GAN-based SR), as SR image quality is susceptible to spectral changes. Our analyses reveal that the spectral discriminator indeed performs better than the ordinary (a.k.a. spatial) discriminator in identifying the differences in the high-frequency range; however, the spatial discriminator holds an advantage in the low-frequency range. Thus, we suggest that the spectral and spatial discriminators shall be used simultaneously. Moreover, we improve the spectral discriminators by first calculating the patch-wise Fourier spectrum and then aggregating the spectra by Transformer. We verify the effectiveness of the proposed method twofold. On the one hand, thanks to the additional spectral discriminator, our obtained SR images have their spectra better aligned to those of the real images, which leads to a better PD tradeoff. On the other hand, our ensembled discriminator predicts the perceptual quality more accurately, as evidenced in the no-reference image quality assessment task.
    摘要 Here's the Simplified Chinese translation:一些最近的研究提出了使用spectral discriminator,它们评估图像的快 Fourier spectrum进行生成模型。然而,spectral discriminator的效果还不够理解。我们在perceptual image super-resolution(i.e., GAN-based SR)中研究spectral discriminator,因为SR图像质量对快 Fourier spectrum的变化敏感。我们的分析发现,spectral discriminator在高频范围内比普通的(即空间)discriminator更好地识别图像的差异,但是空间discriminator在低频范围内有优势。因此,我们建议同时使用spectral和空间discriminator。此外,我们改进了spectral discriminator,首先计算每个 patch的快 Fourier spectrum,然后使用Transformer聚合spectrum。我们验证了我们的提议方法的有效性通过两种方式:一是我们的SR图像的spectrum更加靠近真实图像的spectrum,导致PD质量更好的权衡;二是我们的ensembled discriminator在无参图像质量评估任务中预测了更加准确的Perceptual质量。

Simple parameter-free self-attention approximation

  • paper_url: http://arxiv.org/abs/2307.12018
  • repo_url: https://github.com/exploita123/charmedforfree
  • paper_authors: Yuwen Zhai, Jing Hao, Liang Gao, Xinyu Li, Yiping Gao, Shumin Han
  • for: 用于提高ViT的效率,适用于边缘设备。
  • methods: 使用自注意力和卷积的混合模型,以及一种无需训练参数的自注意力 aproximation方法(SPSA),用于捕捉全局空间特征。
  • results: 通过对图像分类和对象检测任务进行广泛的实验,证明了SPSA与卷积的组合的效果。
    Abstract The hybrid model of self-attention and convolution is one of the methods to lighten ViT. The quadratic computational complexity of self-attention with respect to token length limits the efficiency of ViT on edge devices. We propose a self-attention approximation without training parameters, called SPSA, which captures global spatial features with linear complexity. To verify the effectiveness of SPSA combined with convolution, we conduct extensive experiments on image classification and object detection tasks.
    摘要 “半自动化模型,即自注意和卷积的混合模型,是用于轻量化ViT的一种方法。自注意的二次计算复杂度与字符串长度成正比,限制了ViT在边缘设备上的效率。我们提出了一种不需要训练参数的自注意简化方法,称为SPSA,它可以 capture global spatial features with linear complexity。为验证SPSA与卷积结合的效果,我们进行了广泛的图像分类和对象检测任务的实验。”Note that Simplified Chinese is used in mainland China and Singapore, while Traditional Chinese is used in Taiwan and Hong Kong.

NLCUnet: Single-Image Super-Resolution Network with Hairline Details

  • paper_url: http://arxiv.org/abs/2307.12014
  • repo_url: None
  • paper_authors: Jiancong Feng, Yuan-Gen Wang, Fengchuang Xing
  • for: 提高单张超解像图像质量
  • methods: 提出了一种基于非本地注意力机制的单张超解像网络(NLCUnet),包括三种核心设计。具体来说,首先引入了一种基于全图区域学习的非本地注意力机制,以恢复本地碎片。然后,我们发现现有工作中的卷积权重学习是无需的,因此我们创建了一种新的网络架构,通过对每个通道进行深度卷积并添加通道注意力,从而实现性能提升。最后,我们提议在中心2K图像中随机选择64×64区域,以便尽可能包含 semantic信息。
  • results: 经过多次实验 validate 的DF2K数据集上,我们的NLCUnet表现比state-of-the-art更高,按PSNR和SSIM指标评估。同时,它也可以呈现出更加有利的毛细处理细节。
    Abstract Pursuing the precise details of super-resolution images is challenging for single-image super-resolution tasks. This paper presents a single-image super-resolution network with hairline details (termed NLCUnet), including three core designs. Specifically, a non-local attention mechanism is first introduced to restore local pieces by learning from the whole image region. Then, we find that the blur kernel trained by the existing work is unnecessary. Based on this finding, we create a new network architecture by integrating depth-wise convolution with channel attention without the blur kernel estimation, resulting in a performance improvement instead. Finally, to make the cropped region contain as much semantic information as possible, we propose a random 64$\times$64 crop inside the central 512$\times$512 crop instead of a direct random crop inside the whole image of 2K size. Numerous experiments conducted on the benchmark DF2K dataset demonstrate that our NLCUnet performs better than the state-of-the-art in terms of the PSNR and SSIM metrics and yields visually favorable hairline details.
    摘要 追求超高清照片细节精度是单图超解像任务中的挑战。这篇论文提出了一种单图超解像网络(NLCUnet),包括三个核心设计。具体来说,我们首先引入非本地注意力机制,以全图区域学习恢复本地块。然后,我们发现现有工作中的模糊核心训练是不必要的,因此我们创建了一种不包含模糊核心的网络架构,通过混合深度卷积和通道注意力,从而实现性能提升。最后,我们提议在中心256x256区域内随机选择64x64区域,而不是直接随机选择2K图像中的整个区域,以便尽可能地包含中心区域的semantic信息。经过多次实验,我们发现NLCUnet在DF2K数据集上的PSNR和SSIM指标上表现比前者更好,并且视觉效果更佳。

SCOL: Supervised Contrastive Ordinal Loss for Abdominal Aortic Calcification Scoring on Vertebral Fracture Assessment Scans

  • paper_url: http://arxiv.org/abs/2307.12006
  • repo_url: https://github.com/afsahs/supervised-contrastive-ordinal-loss
  • paper_authors: Afsah Saleem, Zaid Ilyas, David Suter, Ghulam Mubashar Hassan, Siobhan Reid, John T. Schousboe, Richard Prince, William D. Leslie, Joshua R. Lewis, Syed Zulqarnain Gilani
    for: 这个研究的目的是开发一种自动评估胸部动脉calcification的方法,以检测 asymptomatic atherosclerotic cardiovascular diseases (ASCVDs) 的风险。methods: 这个研究使用了一种新的Supervised Contrastive Ordinal Loss (SCOL) 函数,并开发了一种 Dual-encoder Contrastive Ordinal Learning (DCOL) 框架,以利用 AAC regression 标签中的顺序信息。results: 研究结果表明,该方法可以提高 AAC 的 интер-класс分化度和内部一致性,并且可以准确地预测高风险 AAC 类别。
    Abstract Abdominal Aortic Calcification (AAC) is a known marker of asymptomatic Atherosclerotic Cardiovascular Diseases (ASCVDs). AAC can be observed on Vertebral Fracture Assessment (VFA) scans acquired using Dual-Energy X-ray Absorptiometry (DXA) machines. Thus, the automatic quantification of AAC on VFA DXA scans may be used to screen for CVD risks, allowing early interventions. In this research, we formulate the quantification of AAC as an ordinal regression problem. We propose a novel Supervised Contrastive Ordinal Loss (SCOL) by incorporating a label-dependent distance metric with existing supervised contrastive loss to leverage the ordinal information inherent in discrete AAC regression labels. We develop a Dual-encoder Contrastive Ordinal Learning (DCOL) framework that learns the contrastive ordinal representation at global and local levels to improve the feature separability and class diversity in latent space among the AAC-24 genera. We evaluate the performance of the proposed framework using two clinical VFA DXA scan datasets and compare our work with state-of-the-art methods. Furthermore, for predicted AAC scores, we provide a clinical analysis to predict the future risk of a Major Acute Cardiovascular Event (MACE). Our results demonstrate that this learning enhances inter-class separability and strengthens intra-class consistency, which results in predicting the high-risk AAC classes with high sensitivity and high accuracy.
    摘要 《腹部动脉钙化(AAC)是无症状栓塞动脉疾病(ASCVD)的知名标志。AAC可以在骨折评估(VFA)扫描机上观察到,因此自动量化AAC on VFA DXA扫描机可能用来检测心血管风险,允许早期干预。在这个研究中,我们将AAC量化作为一个Ordinal regression问题。我们提出了一种名为Supervised Contrastive Ordinal Loss(SCOL)的新的损失函数,它通过融合现有的Supervised contrastive loss和标签висимый距离度量来利用AAC regression标签中的排序信息。我们开发了一个名为Dual-encoder Contrastive Ordinal Learning(DCOL)的框架,它可以在全球和本地两个水平上学习对照的排序ORDINAL表示,以提高在隐藏空间中的特征分类和类别多样性。我们使用两个来自临床的VFA DXA扫描数据集进行评估,并与现有的方法进行比较。此外,我们还对预测的AAC分数进行临床分析,以预测未来的主要急性心血管事件(MACE)的风险。我们的结果显示,这种学习可以提高间隔分类的标准差和内部一致性,从而预测高风险AAC类别的敏感性和准确性。

COLosSAL: A Benchmark for Cold-start Active Learning for 3D Medical Image Segmentation

  • paper_url: http://arxiv.org/abs/2307.12004
  • repo_url: https://github.com/medicl-vu/colossal
  • paper_authors: Han Liu, Hao Li, Xing Yao, Yubo Fan, Dewei Hu, Benoit Dawant, Vishwesh Nath, Zhoubing Xu, Ipek Oguz
  • for: 本研究旨在解决医疗图像分割任务中数据标注瓶颈问题,提出了一个叫做COLosSAL的数据集和评估框架,用于评估不同的启动式活动学习策略。
  • methods: 本研究使用了六种不同的启动式活动学习策略,并对五个3D医疗图像分割任务进行了评估。
  • results: 研究发现,启动式活动学习仍然是3D分割任务中未解决的问题,但是一些重要的趋势有被观察到。
    Abstract Medical image segmentation is a critical task in medical image analysis. In recent years, deep learning based approaches have shown exceptional performance when trained on a fully-annotated dataset. However, data annotation is often a significant bottleneck, especially for 3D medical images. Active learning (AL) is a promising solution for efficient annotation but requires an initial set of labeled samples to start active selection. When the entire data pool is unlabeled, how do we select the samples to annotate as our initial set? This is also known as the cold-start AL, which permits only one chance to request annotations from experts without access to previously annotated data. Cold-start AL is highly relevant in many practical scenarios but has been under-explored, especially for 3D medical segmentation tasks requiring substantial annotation effort. In this paper, we present a benchmark named COLosSAL by evaluating six cold-start AL strategies on five 3D medical image segmentation tasks from the public Medical Segmentation Decathlon collection. We perform a thorough performance analysis and explore important open questions for cold-start AL, such as the impact of budget on different strategies. Our results show that cold-start AL is still an unsolved problem for 3D segmentation tasks but some important trends have been observed. The code repository, data partitions, and baseline results for the complete benchmark are publicly available at https://github.com/MedICL-VU/COLosSAL.
    摘要 医学像素化是医学图像分析中的关键任务。在过去几年,基于深度学习的方法在完全标注的数据集上训练后表现出色。然而,数据标注却是一个重要的瓶颈,特别是 для 3D 医学图像。活动学习(AL)是一种可能的解决方案,但它需要一个初始化标注的样本集来开始活动选择。当整个数据池都是未标注的时候,如何选择要标注的样本呢?这也被称为冷启动 AL,它允许在专家无法访问前一次已经标注的数据时,仅请求一次标注。冷启动 AL 在许多实际场景中是非常有价值的,特别是 для 3D 医学分割任务,需要很大的标注努力。在这篇论文中,我们提出了一个名为 COLosSAL 的标准套件,通过评估六种冷启动 AL 策略在五个 3D 医学图像分割任务上进行了全面性的性能分析。我们进行了详细的性能分析,并探讨了冷启动 AL 中重要的开放问题,如预算对不同策略的影响。我们的结果表明,冷启动 AL 仍然是未解决的问题,但我们在不同任务上观察到了一些重要的趋势。我们在 GitHub 上公开了代码库、数据分区和基线结果,欢迎您在 上查看。

A Stronger Stitching Algorithm for Fisheye Images based on Deblurring and Registration

  • paper_url: http://arxiv.org/abs/2307.11997
  • repo_url: None
  • paper_authors: Jing Hao, Jingming Xie, Jinyuan Zhang, Moyun Liu
  • for: 解决 fisheye 图像中的 геометрической扭曲问题,提高 fisheye 图像的融合质量。
  • methods: combines 传统图像处理技术和深度学习,提出了 Attention-based Nonlinear Activation Free Network (ANAFNet) 和 ORB-FREAK-GMS (OFG) 两种算法。
  • results: 实验结果表明,通过我们的方法可以获得高质量的排 compose 图像。
    Abstract Fisheye lens, which is suitable for panoramic imaging, has the prominent advantage of a large field of view and low cost. However, the fisheye image has a severe geometric distortion which may interfere with the stage of image registration and stitching. Aiming to resolve this drawback, we devise a stronger stitching algorithm for fisheye images by combining the traditional image processing method with deep learning. In the stage of fisheye image correction, we propose the Attention-based Nonlinear Activation Free Network (ANAFNet) to deblur fisheye images corrected by Zhang calibration method. Specifically, ANAFNet adopts the classical single-stage U-shaped architecture based on convolutional neural networks with soft-attention technique and it can restore a sharp image from a blurred image effectively. In the part of image registration, we propose the ORB-FREAK-GMS (OFG), a comprehensive image matching algorithm, to improve the accuracy of image registration. Experimental results demonstrate that panoramic images of superior quality stitching by fisheye images can be obtained through our method.
    摘要 鱼眼镜,适用于全景拍摄,具有大视野和低成本的优点。然而,鱼眼图像受到严重的几何扭曲的影响,可能会干扰图像registraton和组合stage。为了解决这个缺点,我们提出了一种基于传统图像处理技术和深度学习的更强大的缝合算法 для鱼眼图像。在鱼眼图像修正阶段,我们提出了Attention-based Nonlinear Activation Free Network(ANAFNet),用于deburring鱼眼图像,并可以有效地还原锐化图像。在图像registraton阶段,我们提出了ORB-FREAK-GMS(OFG),一种全面的图像匹配算法,以提高图像registraton的准确性。实验结果表明,通过我们的方法可以得到高质量的全景图像组合。

Morphology-inspired Unsupervised Gland Segmentation via Selective Semantic Grouping

  • paper_url: http://arxiv.org/abs/2307.11989
  • repo_url: https://github.com/xmed-lab/mssg
  • paper_authors: Qixiang Zhang, Yi Li, Cheng Xue, Xiaomeng Li
  • for: 这个论文的目的是开发一种不需要人工标注的深度学习方法来自动分类腺体,以推动自动癌症诊断和预后评估。
  • methods: 我们提出了一种新的 Morphology-inspired 方法,将实验资讯引入为额外知识,以指导分类过程。我们首先利用这个实验资讯选择出腺体子区域的提案,然后使用具有形式知识的 Semantic Grouping 模组集成全局资讯。
  • results: 我们在 GlaS dataset 和 CRAG dataset 上进行实验,结果显示我们的方法在 mIOU 上超过第二名的对手10.56%。
    Abstract Designing deep learning algorithms for gland segmentation is crucial for automatic cancer diagnosis and prognosis, yet the expensive annotation cost hinders the development and application of this technology. In this paper, we make a first attempt to explore a deep learning method for unsupervised gland segmentation, where no manual annotations are required. Existing unsupervised semantic segmentation methods encounter a huge challenge on gland images: They either over-segment a gland into many fractions or under-segment the gland regions by confusing many of them with the background. To overcome this challenge, our key insight is to introduce an empirical cue about gland morphology as extra knowledge to guide the segmentation process. To this end, we propose a novel Morphology-inspired method via Selective Semantic Grouping. We first leverage the empirical cue to selectively mine out proposals for gland sub-regions with variant appearances. Then, a Morphology-aware Semantic Grouping module is employed to summarize the overall information about the gland by explicitly grouping the semantics of its sub-region proposals. In this way, the final segmentation network could learn comprehensive knowledge about glands and produce well-delineated, complete predictions. We conduct experiments on GlaS dataset and CRAG dataset. Our method exceeds the second-best counterpart over 10.56% at mIOU.
    摘要 深度学习算法的设计对腺体分 segmentation是抑肿癌诊断和诊断的关键,但是手动标注的高投入成本限制了这种技术的发展和应用。在这篇论文中,我们首次尝试探索一种不需要手动标注的深度学习方法 для腺体分 segmentation。现有的无监督语义分割方法在腺体图像上遇到一大问题:它们都会过分 segment 腺体,或者将腺体区域与背景相混淆。为了解决这个问题,我们的关键思想是通过引入腺体形态的辅助知识来导向分 segmentation 过程。我们首先利用这个辅助知识来选择出腺体各个子区域的提议,然后使用具有形态意识的语义组合模块来总结腺体的全部信息。这样,最终的分 segmentation 网络可以学习腺体的全面信息,并生成完整、清晰的预测。我们在 GlaS 数据集和 CRAG 数据集上进行实验,我们的方法在 mIOU 上超过第二最佳对手的 10.56%。

Expert Knowledge-Aware Image Difference Graph Representation Learning for Difference-Aware Medical Visual Question Answering

  • paper_url: http://arxiv.org/abs/2307.11986
  • repo_url: https://github.com/holipori/mimic-diff-vqa
  • paper_authors: Xinyue Hu, Lin Gu, Qiyuan An, Mengliang Zhang, Liangchen Liu, Kazuma Kobayashi, Tatsuya Harada, Ronald M. Summers, Yingying Zhu
  • for: 提高医疗机器人识别能力,推动医学视觉语言模型的自动化。
  • methods: 基于专家知识,构建图像差异视Question Answering(VQA)任务,使用新收集的MIMIC-Diff-VQA数据集。
  • results: 提出了一种新的图像差异VQA任务,并提供了一种基于专家知识的图像差异GRAPH表示学习模型,以提高医学视觉语言模型的性能。
    Abstract To contribute to automating the medical vision-language model, we propose a novel Chest-Xray Difference Visual Question Answering (VQA) task. Given a pair of main and reference images, this task attempts to answer several questions on both diseases and, more importantly, the differences between them. This is consistent with the radiologist's diagnosis practice that compares the current image with the reference before concluding the report. We collect a new dataset, namely MIMIC-Diff-VQA, including 700,703 QA pairs from 164,324 pairs of main and reference images. Compared to existing medical VQA datasets, our questions are tailored to the Assessment-Diagnosis-Intervention-Evaluation treatment procedure used by clinical professionals. Meanwhile, we also propose a novel expert knowledge-aware graph representation learning model to address this task. The proposed baseline model leverages expert knowledge such as anatomical structure prior, semantic, and spatial knowledge to construct a multi-relationship graph, representing the image differences between two images for the image difference VQA task. The dataset and code can be found at https://github.com/Holipori/MIMIC-Diff-VQA. We believe this work would further push forward the medical vision language model.
    摘要 为推进医疗视语模型的自动化,我们提出了一个新的胸部X射影异常问答任务(VQA)任务。给定一对主要和参照图像,这个任务的目标是回答几个疾病和图像之间的差异相关的问题。这与辐射医生的诊断实践相一致,即比较当前图像与参照图像进行诊断报告。我们收集了一个新的数据集,即MIMIC-Diff-VQA,包含700703个问答对从164324对主要和参照图像的对。与现有的医疗VQA数据集相比,我们的问题更加适合诊断过程中的评估-诊断- interven-评估策略。同时,我们也提出了一种基于专家知识的图像差异学习模型来解决这个任务。我们的基eline模型利用专家知识,如解剖结构优先知识、语义知识和空间知识,构建多关系图,表示图像差异问题中的图像差异。数据集和代码可以在 GitHub 上找到:https://github.com/Holipori/MIMIC-Diff-VQA。我们认为这项工作将进一步推动医疗视语模型的发展。

Simulation of Arbitrary Level Contrast Dose in MRI Using an Iterative Global Transformer Model

  • paper_url: http://arxiv.org/abs/2307.11980
  • repo_url: None
  • paper_authors: Dayang Wang, Srivathsa Pasumarthi, Greg Zaharchuk, Ryan Chamberlain
  • for: 这个论文是为了提出一种基于深度学习的扬化剂抑制和消除技术,以减少或完全消除荷尔蒙酸盐(GBCAs)的负面影响。
  • methods: 这种技术使用了一种名为Gformer的变换器,它是一种迭代模型,通过扫描和注意力机制来Synthesize图像,并且可以模拟不同的剑药剂和疾病水平。
  • results: 根据量化评估结果,提出的Gformer模型在比较其他现状技术时表现更好,而且在下游任务如剂量减少和肿瘤分割等方面也进行了评估,以证明这种技术的临床实用性。
    Abstract Deep learning (DL) based contrast dose reduction and elimination in MRI imaging is gaining traction, given the detrimental effects of Gadolinium-based Contrast Agents (GBCAs). These DL algorithms are however limited by the availability of high quality low dose datasets. Additionally, different types of GBCAs and pathologies require different dose levels for the DL algorithms to work reliably. In this work, we formulate a novel transformer (Gformer) based iterative modelling approach for the synthesis of images with arbitrary contrast enhancement that corresponds to different dose levels. The proposed Gformer incorporates a sub-sampling based attention mechanism and a rotational shift module that captures the various contrast related features. Quantitative evaluation indicates that the proposed model performs better than other state-of-the-art methods. We further perform quantitative evaluation on downstream tasks such as dose reduction and tumor segmentation to demonstrate the clinical utility.
    摘要 深度学习(DL)基于对比剂减少和消除在MRI成像中得到了进展,由于质子剂基因链(GBCAs)的负面影响。这些DL算法 however 受到低质量低剂量数据的可用性的限制。另外,不同类型的GBCAs和疾病需要不同的剂量水平以便DL算法可靠地工作。在这种工作中,我们提出了一种基于转换器(Gformer)的迭代模型方法,用于生成具有任意对比强化的图像。我们的Gformer模型包括一种归一化基于抽样的注意力机制和一种旋转变换模块,以捕捉不同的对比相关特征。量化评估表明,我们的模型在其他状态对比较好的方法。我们进一步进行了下游任务的量化评估,如剂量减少和肿瘤分 segmentation,以证明我们的临床实用性。

Two-stream Multi-level Dynamic Point Transformer for Two-person Interaction Recognition

  • paper_url: http://arxiv.org/abs/2307.11973
  • repo_url: None
  • paper_authors: Yao Liu, Gangfeng Cui, Jiahui Luo, Lina Yao, Xiaojun Chang
  • for: 本研究的目的是提出一种基于点云网络的两人交互识别方法,以满足人工智能应用中的个人隐私要求。
  • methods: 我们提出了一种名为Interval Frame Sampling(IFS)的框选Method,以及一种两树多级特征聚合模块,以提取全局和部分特征。
  • results: 我们的网络在NTU RGB+D 60和NTU RGB+D 120的交互子集上进行了广泛的实验,并显示了与状态前方法进行比较的优异性。
    Abstract As a fundamental aspect of human life, two-person interactions contain meaningful information about people's activities, relationships, and social settings. Human action recognition serves as the foundation for many smart applications, with a strong focus on personal privacy. However, recognizing two-person interactions poses more challenges due to increased body occlusion and overlap compared to single-person actions. In this paper, we propose a point cloud-based network named Two-stream Multi-level Dynamic Point Transformer for two-person interaction recognition. Our model addresses the challenge of recognizing two-person interactions by incorporating local-region spatial information, appearance information, and motion information. To achieve this, we introduce a designed frame selection method named Interval Frame Sampling (IFS), which efficiently samples frames from videos, capturing more discriminative information in a relatively short processing time. Subsequently, a frame features learning module and a two-stream multi-level feature aggregation module extract global and partial features from the sampled frames, effectively representing the local-region spatial information, appearance information, and motion information related to the interactions. Finally, we apply a transformer to perform self-attention on the learned features for the final classification. Extensive experiments are conducted on two large-scale datasets, the interaction subsets of NTU RGB+D 60 and NTU RGB+D 120. The results show that our network outperforms state-of-the-art approaches across all standard evaluation settings.
    摘要 人类生活中的基本方面之一是两个人之间的交互,这些交互含有人们的活动、关系和社会环境中的有用信息。人工智能应用中强调个人隐私,人体动作识别作为基础技术,但Recognizing two-person interactions poses more challenges due to increased body occlusion and overlap compared to single-person actions。在这篇论文中,我们提出了一种基于点云的网络 named Two-stream Multi-level Dynamic Point Transformer for two-person interaction recognition。我们的模型解决了认izing two-person interactions的挑战,通过 integrate local-region spatial information, appearance information, and motion information。为了实现这一点,我们提出了一种设计的Frame Selection Method named Interval Frame Sampling (IFS),该方法高效地从视频中选择Frame, capture more discriminative information in a relatively short processing time。然后,我们引入了一个Frame Features Learning Module和Two-stream Multi-level Feature Aggregation Module,这两个模块通过EXTract global and partial features from the sampled frames, effectively representing the local-region spatial information, appearance information, and motion information related to the interactions。最后,我们使用 transformer 进行自注意力,以实现最终的分类。我们在 NTU RGB+D 60 和 NTU RGB+D 120 两个大规模数据集上进行了广泛的实验,结果显示,我们的网络在所有标准评估环境下都超过了现有方法。

Intelligent Remote Sensing Image Quality Inspection System

  • paper_url: http://arxiv.org/abs/2307.11965
  • repo_url: None
  • paper_authors: Yijiong Yu, Tao Wang, Kang Ran, Chang Li, Hao Wu
  • for: 这篇论文旨在提出一个新的两步智能系统,用于Remote Sensing图像质量检查,以提高检查效率。
  • methods: 本论文提出的方法首先使用多 modelo 进行图像分类,然后使用最佳的方法来ocalize多种图像质量问题。
  • results: результа业表示,提出的方法在Remote Sensing图像质量检查中表现出色,比一些一步方法更高效。此外,本论文初步探讨了多modal模型在Remote Sensing图像质量检查中的可行性和潜力。
    Abstract Quality inspection is a necessary task before putting any remote sensing image into practical application. However, traditional manual inspection methods suffer from low efficiency. Hence, we propose a novel two-step intelligent system for remote sensing image quality inspection that combines multiple models, which first performs image classification and then employs the most appropriate methods to localize various forms of quality problems in the image. Results demonstrate that the proposed method exhibits excellent performance and efficiency in remote sensing image quality inspection, surpassing the performance of those one-step methods. Furthermore, we conduct an initial exploration of the feasibility and potential of applying multimodal models to remote sensing image quality inspection.
    摘要 <>translate "Quality inspection is a necessary task before putting any remote sensing image into practical application. However, traditional manual inspection methods suffer from low efficiency. Hence, we propose a novel two-step intelligent system for remote sensing image quality inspection that combines multiple models, which first performs image classification and then employs the most appropriate methods to localize various forms of quality problems in the image. Results demonstrate that the proposed method exhibits excellent performance and efficiency in remote sensing image quality inspection, surpassing the performance of those one-step methods. Furthermore, we conduct an initial exploration of the feasibility and potential of applying multimodal models to remote sensing image quality inspection." into Simplified Chinese.翻译:在应用 remote sensing 图像之前,质量检测是一项必需的任务。然而,传统的手动检测方法具有低效率。因此,我们提出了一种新的两步智能系统,用于 remote sensing 图像质量检测,这个系统首先使用多个模型进行图像分类,然后使用最有效的方法来地址图像中的质量问题。结果显示,我们的方法在 remote sensing 图像质量检测中表现出色,高于一键方法的性能。此外,我们还进行了初步的多模态模型在 remote sensing 图像质量检测中的可行性和潜力的探索。>>

MIMONet: Multi-Input Multi-Output On-Device Deep Learning

  • paper_url: http://arxiv.org/abs/2307.11962
  • repo_url: None
  • paper_authors: Zexin Li, Xiaoxi He, Yufei Li, Shahab Nikkhoo, Wei Yang, Lothar Thiele, Cong Liu
  • for: This paper aims to improve the performance of intelligent robotic systems by proposing a novel on-device multi-input multi-output deep neural network (MIMO DNN) framework called MIMONet.
  • methods: MIMONet leverages existing single-input single-output (SISO) model compression techniques and develops a new deep-compression method tailored to MIMO models, which explores unique properties of the MIMO model to achieve boosted accuracy and on-device efficiency.
  • results: Extensive experiments on three embedded platforms and a case study using the TurtleBot3 robot demonstrate that MIMONet achieves higher accuracy and superior on-device efficiency compared to state-of-the-art SISO and MISO models, as well as a baseline MIMO model.
    Abstract Future intelligent robots are expected to process multiple inputs simultaneously (such as image and audio data) and generate multiple outputs accordingly (such as gender and emotion), similar to humans. Recent research has shown that multi-input single-output (MISO) deep neural networks (DNN) outperform traditional single-input single-output (SISO) models, representing a significant step towards this goal. In this paper, we propose MIMONet, a novel on-device multi-input multi-output (MIMO) DNN framework that achieves high accuracy and on-device efficiency in terms of critical performance metrics such as latency, energy, and memory usage. Leveraging existing SISO model compression techniques, MIMONet develops a new deep-compression method that is specifically tailored to MIMO models. This new method explores unique yet non-trivial properties of the MIMO model, resulting in boosted accuracy and on-device efficiency. Extensive experiments on three embedded platforms commonly used in robotic systems, as well as a case study using the TurtleBot3 robot, demonstrate that MIMONet achieves higher accuracy and superior on-device efficiency compared to state-of-the-art SISO and MISO models, as well as a baseline MIMO model we constructed. Our evaluation highlights the real-world applicability of MIMONet and its potential to significantly enhance the performance of intelligent robotic systems.
    摘要 将来的智能机器人将能同时处理多个输入(如图像和音频数据),并生成相应的多个输出(如性别和情感),类似于人类。最新的研究表明,多输入单输出(MISO)深度神经网络(DNN)的性能比单输入单输出(SISO)模型更高,这表示了大的一步。在这篇论文中,我们提出了一种名为MIMONet的在设备上运行的多输入多输出(MIMO)DNN框架,实现了高准确率和设备上的高效性。我们利用了现有的SISO模型压缩技术,开发了一种专门适应MIMO模型的深度压缩方法。这种新方法利用了MIMO模型独特 yet non-trivial 的性质,从而提高了准确率和设备上的效率。我们在三种常用于机器人系统的嵌入式平台上进行了广泛的实验,以及使用了TurtleBot3机器人进行了一个案例研究。我们的评估表明,MIMONet在与状态对照模型和基eline MIMO模型进行比较时,在准确率和设备上的性能均高于其他模型。我们的评估还表明,MIMONet在实际应用中具有广泛的应用前景和可能性。

DHC: Dual-debiased Heterogeneous Co-training Framework for Class-imbalanced Semi-supervised Medical Image Segmentation

  • paper_url: http://arxiv.org/abs/2307.11960
  • repo_url: https://github.com/xmed-lab/dhc
  • paper_authors: Haonan Wang, Xiaomeng Li
  • for: 这个研究旨在提出一个基于semi-supervised learning的3D医疗影像分类框架,以解决对于医疗影像分类的专门技能和时间负担问题。
  • methods: 这个框架使用了一个名为Dual-debiased Heterogeneous Co-training(DHC)的新方法,包括两种损失衡量策略,即Distribution-aware Debiased Weighting(DistDW)和Difficulty-aware Debiased Weighting(DiffDW),以透过对于 pseudo labels的动态指导,帮助模型解决数据和学习偏见。
  • results: 实验结果显示,提出的方法可以将 pseudo labels 用于偏见调整和纾解类别偏见问题,并且与现有的SSL方法相比,提高了模型的性能。此外,我们还提出了更加代表的医疗影像分类 semi-supervised 测试标准,以全面显示这些类别偏见的设计的效果。
    Abstract The volume-wise labeling of 3D medical images is expertise-demanded and time-consuming; hence semi-supervised learning (SSL) is highly desirable for training with limited labeled data. Imbalanced class distribution is a severe problem that bottlenecks the real-world application of these methods but was not addressed much. Aiming to solve this issue, we present a novel Dual-debiased Heterogeneous Co-training (DHC) framework for semi-supervised 3D medical image segmentation. Specifically, we propose two loss weighting strategies, namely Distribution-aware Debiased Weighting (DistDW) and Difficulty-aware Debiased Weighting (DiffDW), which leverage the pseudo labels dynamically to guide the model to solve data and learning biases. The framework improves significantly by co-training these two diverse and accurate sub-models. We also introduce more representative benchmarks for class-imbalanced semi-supervised medical image segmentation, which can fully demonstrate the efficacy of the class-imbalance designs. Experiments show that our proposed framework brings significant improvements by using pseudo labels for debiasing and alleviating the class imbalance problem. More importantly, our method outperforms the state-of-the-art SSL methods, demonstrating the potential of our framework for the more challenging SSL setting. Code and models are available at: https://github.com/xmed-lab/DHC.
    摘要 “三维医疗影像的量化标签是专业需求又是时间耗尽的,因此半监督学习(SSL)是非常有优点的。然而,实际应用中存在资料分布不均的问题,这对实际应用带来了很大的阻碍。为解决这个问题,我们提出了一个新的双重偏见随时执行(DHC)框架,用于半监督三维医疗影像分类。具体来说,我们提出了两种损失调整策略,namely Distribution-aware Debiased Weighting(DistDW)和 Difficulty-aware Debiased Weighting(DiffDW),这些策略可以在执行时将 pseudo labels 用于导引模型解决数据和学习偏见。这个框架在半监督下执行这两个多标的子模型,从而提高了性能。我们还提出了更多的代表性的标签数据集,用于测试实际中的半监督医疗影像分类。实验结果显示,我们的提案的框架可以将 pseudo labels 用于偏见调整和缓和资料分布不均问题,并且超越了现有的SSL方法。代码和模型可以在:https://github.com/xmed-lab/DHC 中找到。”

Topology-Preserving Automatic Labeling of Coronary Arteries via Anatomy-aware Connection Classifier

  • paper_url: http://arxiv.org/abs/2307.11959
  • repo_url: https://github.com/zutsusemi/miccai2023-topolab-labels
  • paper_authors: Zhixing Zhang, Ziwei Zhao, Dong Wang, Shishuang Zhao, Yuhang Liu, Jia Liu, Liwei Wang
  • for: 这个论文主要是为了提高心血管疾病诊断过程中自动标注 coronary artery 的精度和效率。
  • methods: 该方法使用了新的 TopoLab 框架,该框架包括明确表征动脉连接的方法,以及层次结构特征提取和动脉间特征互动的策略。
  • results: 实验结果表明,使用 TopoLab 可以在 orCaScore 数据集和一个内部数据集上达到状态机器人的表现,提高了自动标注 coronary artery 的精度和效率。
    Abstract Automatic labeling of coronary arteries is an essential task in the practical diagnosis process of cardiovascular diseases. For experienced radiologists, the anatomically predetermined connections are important for labeling the artery segments accurately, while this prior knowledge is barely explored in previous studies. In this paper, we present a new framework called TopoLab which incorporates the anatomical connections into the network design explicitly. Specifically, the strategies of intra-segment feature aggregation and inter-segment feature interaction are introduced for hierarchical segment feature extraction. Moreover, we propose the anatomy-aware connection classifier to enable classification for each connected segment pair, which effectively exploits the prior topology among the arteries with different categories. To validate the effectiveness of our method, we contribute high-quality annotations of artery labeling to the public orCaScore dataset. The experimental results on both the orCaScore dataset and an in-house dataset show that our TopoLab has achieved state-of-the-art performance.
    摘要 自动标注 coronary arteries 是诊断冠状动脉疾病的重要任务。经验丰富的医生们认为,将预先确定的 анатомиче连接 explicitly incorporated into the network design 是标注artery 段 accurately。在这篇论文中,我们提出了一种新的框架 called TopoLab,它使用了 hierarchical segment feature extraction 和 anatomy-aware connection classifier 来实现这一目标。此外,我们还提供了高质量的 artery 标注数据,以 validate 我们的方法。实验结果表明,我们的 TopoLab 已经达到了领先的性能。Here's the text with some additional explanations and notes:自动标注 coronary arteries 是诊断冠状动脉疾病的重要任务。(1) coronary arteries 是冠状动脉系统中的重要组成部分,它们的血流很重要,一旦发生疾病,会对身体产生严重的影响。(2) 为了诊断 coronary arteries 的疾病,需要进行精准的标注,但这是一项复杂的任务,需要具备很好的知识和技能。经验丰富的医生们认为,将预先确定的 anatomical connections explicitly incorporated into the network design 是标注 coronary arteries 段 accurately。(3) anatomical connections 是指冠状动脉系统中各个组成部分之间的连接关系,这些连接关系可以帮助医生更好地理解冠状动脉系统的结构和功能。在这篇论文中,我们提出了一种新的框架 called TopoLab,它使用了 hierarchical segment feature extraction 和 anatomy-aware connection classifier 来实现标注 coronary arteries 的任务。(4) TopoLab 的核心思想是利用 hierarchical segment feature extraction 来提取 coronary arteries 的特征,并通过 anatomy-aware connection classifier 来确定各个连接关系的类别。此外,我们还提供了高质量的 artery 标注数据,以 validate 我们的方法。(5) 我们使用了一个高质量的 dataset,并对其进行了严格的验证和验收,以确保我们的方法的可靠性和精准性。实验结果表明,我们的 TopoLab 已经达到了领先的性能。(6) 我们对比了 TopoLab 与其他方法的实验结果,发现 TopoLab 的性能明显超过了其他方法。这表明,我们的方法可以帮助医生更好地诊断 coronary arteries 的疾病。

Pick the Best Pre-trained Model: Towards Transferability Estimation for Medical Image Segmentation

  • paper_url: http://arxiv.org/abs/2307.11958
  • repo_url: https://github.com/endoluminalsurgicalvision-imr/ccfv
  • paper_authors: Yuncheng Yang, Meng Wei, Junjun He, Jie Yang, Jin Ye, Yun Gu
  • for: 这种论文主要用于适用于医疗图像分割任务中的深度神经网络训练,以便更好地利用庞大的医疗图像数据。
  • methods: 该论文提出了一种新的转移性能度估计(TE)方法,该方法考虑了类别一致性和特征多样性,以更好地估计转移性能度。
  • results: 对比现有的TE算法,该方法在医疗图像分割任务中的转移性能度估计表现出色,并且在实验中胜过所有现有的TE算法。
    Abstract Transfer learning is a critical technique in training deep neural networks for the challenging medical image segmentation task that requires enormous resources. With the abundance of medical image data, many research institutions release models trained on various datasets that can form a huge pool of candidate source models to choose from. Hence, it's vital to estimate the source models' transferability (i.e., the ability to generalize across different downstream tasks) for proper and efficient model reuse. To make up for its deficiency when applying transfer learning to medical image segmentation, in this paper, we therefore propose a new Transferability Estimation (TE) method. We first analyze the drawbacks of using the existing TE algorithms for medical image segmentation and then design a source-free TE framework that considers both class consistency and feature variety for better estimation. Extensive experiments show that our method surpasses all current algorithms for transferability estimation in medical image segmentation. Code is available at https://github.com/EndoluminalSurgicalVision-IMR/CCFV
    摘要 启用转移学习是训练深度神经网络的关键技术,用于具有巨大资源的医学图像分割任务。由于医学图像数据的备受,许多研究机构发布了基于不同数据集的模型,这些模型可以形成一个庞大的候选源模型库。因此,对于正确和高效地 reuse 模型,估计源模型的传输性(即在不同下游任务中generalization)是非常重要的。为了解决医学图像分割中转移学习的不足,本文提出了一种新的传输性估计(TE)方法。我们首先分析了现有TE算法在医学图像分割中的缺点,然后设计了一个无源TE框架,考虑了类含义一致和特征多样性,以更好地估计传输性。经验表明,我们的方法在医学图像分割中的传输性估计比现有算法更高。代码可以在https://github.com/EndoluminalSurgicalVision-IMR/CCFV 中找到。

High-performance real-world optical computing trained by in situ model-free optimization

  • paper_url: http://arxiv.org/abs/2307.11957
  • repo_url: None
  • paper_authors: Guangyuan Zhao, Xin Shu, Renjie Zhou
  • for: 提高光学计算系统的高速和低能耗数据处理能力,但面临计算吃力和现实模拟差距问题。
  • methods: 使用模型独立的光学计算系统优化方法,基于排名评分算法直接倒敲光学权重的概率分布,不需要计算吃力和偏见的系统模拟。
  • results: 在MNIST和FMNIST数据集上实现了高精度分类,并在单层折射光学计算系统上实现了图像自由和高速细胞分析的潜力。
    Abstract Optical computing systems can provide high-speed and low-energy data processing but face deficiencies in computationally demanding training and simulation-to-reality gap. We propose a model-free solution for lightweight in situ optimization of optical computing systems based on the score gradient estimation algorithm. This approach treats the system as a black box and back-propagates loss directly to the optical weights' probabilistic distributions, hence circumventing the need for computation-heavy and biased system simulation. We demonstrate a superior classification accuracy on the MNIST and FMNIST datasets through experiments on a single-layer diffractive optical computing system. Furthermore, we show its potential for image-free and high-speed cell analysis. The inherent simplicity of our proposed method, combined with its low demand for computational resources, expedites the transition of optical computing from laboratory demonstrations to real-world applications.
    摘要 光学计算系统可以提供高速和低能耗数据处理,但面临 computationally demanding 训练和实际与模拟之间的差距。我们提议一种模型自由的解决方案,基于分布式权重的推估算法,用于优化光学计算系统。这种方法将系统视为黑盒,将损失反射直接到光学权重的概率分布中,因此不需要 computationally 复杂和偏见的系统模拟。我们通过在单层散射光学计算系统上进行实验,示出了在 MNIST 和 FMNIST 数据集上的高精度分类。此外,我们还展示了它的潜在应用于无图像和高速细胞分析。我们的提议的简单性和计算资源的低需求,将推动光学计算从实验室示范向实际应用的过渡。

Pūioio: On-device Real-Time Smartphone-Based Automated Exercise Repetition Counting System

  • paper_url: http://arxiv.org/abs/2308.02420
  • repo_url: None
  • paper_authors: Adam Sinclair, Kayla Kautai, Seyed Reza Shahamiri
  • for: 这个研究旨在开发一个基于深度学习的智能手机应用程序,用于实时评估运动重复次数。
  • methods: 这个系统包括五个 ком成分:投射估计、阈值、Optical flow、状态机器和计数器。
  • results: 这个系统在实际测试中获得了98.89%的准确率,并且在预先录取的资料集上获得了98.85%的准确率。
    Abstract Automated exercise repetition counting has applications across the physical fitness realm, from personal health to rehabilitation. Motivated by the ubiquity of mobile phones and the benefits of tracking physical activity, this study explored the feasibility of counting exercise repetitions in real-time, using only on-device inference, on smartphones. In this work, after providing an extensive overview of the state-of-the-art automatic exercise repetition counting methods, we introduce a deep learning based exercise repetition counting system for smartphones consisting of five components: (1) Pose estimation, (2) Thresholding, (3) Optical flow, (4) State machine, and (5) Counter. The system is then implemented via a cross-platform mobile application named P\=uioio that uses only the smartphone camera to track repetitions in real time for three standard exercises: Squats, Push-ups, and Pull-ups. The proposed system was evaluated via a dataset of pre-recorded videos of individuals exercising as well as testing by subjects exercising in real time. Evaluation results indicated the system was 98.89% accurate in real-world tests and up to 98.85% when evaluated via the pre-recorded dataset. This makes it an effective, low-cost, and convenient alternative to existing solutions since the proposed system has minimal hardware requirements without requiring any wearable or specific sensors or network connectivity.
    摘要 自动化运动重复计数有应用于身体健身领域的广泛应用,从个人健康到rehabilitation。驱动于手机的普遍和跟踪物理活动的好处,本研究探讨了使用手机上的只 inference来实时计数运动重复的可能性。在这种工作中,我们首先提供了现有自动运动重复计数方法的广泛概述,然后介绍了一种基于深度学习的运动重复计数系统,该系统包括以下五个组成部分:(1)姿势估计,(2)阈值设定,(3)光流计算,(4)状态机,(5)计数器。该系统后来通过一款可在多个平台上运行的跨平台移动应用程序 named P\=uioio 实现,该应用程序只使用手机摄像头来实时跟踪运动重复。我们对三种标准运动进行测试:蹲下、推上和抓上。我们对这些测试结果进行评估,结果表明,该系统在实际测试中的准确率为98.89%,并且在预录视频数据集上的评估结果为98.85%。这使得该系统成为一种有效、低成本、便捷的代替方案,因为该系统具有最低硬件要求,不需要任何佩戴式设备或特定的感应器或网络连接。

LAMP: Leveraging Language Prompts for Multi-person Pose Estimation

  • paper_url: http://arxiv.org/abs/2307.11934
  • repo_url: https://github.com/shengnanh20/lamp
  • paper_authors: Shengnan Hu, Ce Zheng, Zixiang Zhou, Chen Chen, Gita Sukthankar
  • for: 这篇论文目的是提高人机交互的效果,帮助社交机器人在拥挤的公共场所中穿梭。
  • methods: 该论文提出了一种新的提示基于的多人姿态估计策略,即语言助け多人姿态估计(LAMP)。该策略利用了一个已经训练好的语言模型(CLIP)生成的文本表示,以便更好地理解人体姿态,并通过文本提示来增强视觉表示的不可见部分。
  • results: 该论文表明,使用语言supervised Training可以提高单Stage多人姿态估计的性能,并且both个体级和关节级的提示都是训练中的有价值贡献。
    Abstract Human-centric visual understanding is an important desideratum for effective human-robot interaction. In order to navigate crowded public places, social robots must be able to interpret the activity of the surrounding humans. This paper addresses one key aspect of human-centric visual understanding, multi-person pose estimation. Achieving good performance on multi-person pose estimation in crowded scenes is difficult due to the challenges of occluded joints and instance separation. In order to tackle these challenges and overcome the limitations of image features in representing invisible body parts, we propose a novel prompt-based pose inference strategy called LAMP (Language Assisted Multi-person Pose estimation). By utilizing the text representations generated by a well-trained language model (CLIP), LAMP can facilitate the understanding of poses on the instance and joint levels, and learn more robust visual representations that are less susceptible to occlusion. This paper demonstrates that language-supervised training boosts the performance of single-stage multi-person pose estimation, and both instance-level and joint-level prompts are valuable for training. The code is available at https://github.com/shengnanh20/LAMP.
    摘要 人类中心视觉理解是人机交互中的重要需求。为了在人权拥挤的公共场所中导航,社交机器人需要能够理解周围人类的活动。这篇论文解决了人类中心视觉理解的一个关键方面,即多人pose estimation。在拥挤场景中实现好的多人pose estimation是困难的,因为隐藏的关节和实例分离是主要的挑战。为了解决这些挑战并超越图像特征在表示隐藏身体部分方面的局限性,我们提出了一种新的提示基于的pose推断策略called LAMP(语言协助多人pose estimation)。通过利用训练好的语言模型(CLIP)生成的文本表示,LAMP可以促进实例和关节水平的pose理解,并学习更加Robust的视觉表示,更少受到遮挡的影响。这篇论文示出,语言超vised训练可以提高单 Stage多人pose estimation的性能,并且实例水平和关节水平的提示都是训练的有价值。代码可以在https://github.com/shengnanh20/LAMP中获取。

RICo: Rotate-Inpaint-Complete for Generalizable Scene Reconstruction

  • paper_url: http://arxiv.org/abs/2307.11932
  • repo_url: None
  • paper_authors: Isaac Kasahara, Shubham Agrawal, Selim Engin, Nikhil Chavan-Dafle, Shuran Song, Volkan Isler
  • for: scene reconstruction from a single view, with the goal of estimating the full 3D geometry and texture of a scene containing previously unseen objects.
  • methods: leveraging large language models to inpaint missing areas of scene color images rendered from different views, and then lifting these inpainted images to 3D by predicting normals of the inpainted image and solving for the missing depth values.
  • results: outperforms multiple baselines while providing generalization to novel objects and scenes, with robustness to changes in depth distributions and scale.
    Abstract General scene reconstruction refers to the task of estimating the full 3D geometry and texture of a scene containing previously unseen objects. In many practical applications such as AR/VR, autonomous navigation, and robotics, only a single view of the scene may be available, making the scene reconstruction a very challenging task. In this paper, we present a method for scene reconstruction by structurally breaking the problem into two steps: rendering novel views via inpainting and 2D to 3D scene lifting. Specifically, we leverage the generalization capability of large language models to inpaint the missing areas of scene color images rendered from different views. Next, we lift these inpainted images to 3D by predicting normals of the inpainted image and solving for the missing depth values. By predicting for normals instead of depth directly, our method allows for robustness to changes in depth distributions and scale. With rigorous quantitative evaluation, we show that our method outperforms multiple baselines while providing generalization to novel objects and scenes.
    摘要 全景重建指的是根据已知的2D图像来估算场景中包含未知对象的3D几何和文本ure。在许多实际应用中,如AR/VR、自动导航和机器人等,只有一个视图可用,因此场景重建变得非常困难。在这篇论文中,我们提出了一种场景重建方法,通过分解问题为两步来解决:首先,通过填充缺失部分来渲染新的视图图像;然后,使用这些渲染的图像来提升3D场景的抽象。具体来说,我们利用大型自然语言模型来填充场景颜色图像中的缺失部分。接下来,我们使用这些填充的图像来预测场景的法向量,并通过解决缺失的深度值来提升3D场景。由于预测法向量而不是直接预测深度,我们的方法具有对深度分布和比例的弹性。经过严谨的量化评估,我们表明了我们的方法在多个基eline上方法性能高,同时具有对新对象和场景的泛化能力。

PartDiff: Image Super-resolution with Partial Diffusion Models

  • paper_url: http://arxiv.org/abs/2307.11926
  • repo_url: None
  • paper_authors: Kai Zhao, Alex Ling Yu Hung, Kaifeng Pang, Haoxin Zheng, Kyunghyun Sung
  • for: 这个论文主要针对的是图像超分辨率生成任务中的Diffusion-based生成模型,它们可以很好地生成高质量的图像。
  • methods: 这个论文提出了一种新的Partial Diffusion Model(PartDiff),它在图像扩散过程中只需要扩散到中间 latent state 而不是扩散到完全随机噪声中,从中间 latent state 开始进行恢复。
  • results: 对于MRI和自然图像的超分辨率生成任务,Partial Diffusion Models 可以significantly reduce the number of denoising steps 而不是 sacrificing the quality of generation。
    Abstract Denoising diffusion probabilistic models (DDPMs) have achieved impressive performance on various image generation tasks, including image super-resolution. By learning to reverse the process of gradually diffusing the data distribution into Gaussian noise, DDPMs generate new data by iteratively denoising from random noise. Despite their impressive performance, diffusion-based generative models suffer from high computational costs due to the large number of denoising steps.In this paper, we first observed that the intermediate latent states gradually converge and become indistinguishable when diffusing a pair of low- and high-resolution images. This observation inspired us to propose the Partial Diffusion Model (PartDiff), which diffuses the image to an intermediate latent state instead of pure random noise, where the intermediate latent state is approximated by the latent of diffusing the low-resolution image. During generation, Partial Diffusion Models start denoising from the intermediate distribution and perform only a part of the denoising steps. Additionally, to mitigate the error caused by the approximation, we introduce "latent alignment", which aligns the latent between low- and high-resolution images during training. Experiments on both magnetic resonance imaging (MRI) and natural images show that, compared to plain diffusion-based super-resolution methods, Partial Diffusion Models significantly reduce the number of denoising steps without sacrificing the quality of generation.
    摘要 diffusion probabilistic models (DDPMs) 拥有在不同的图像生成任务中表现出色,包括图像超解像。通过学习逆转数据分布慢慢散发的过程,DDPMs 生成新的数据,通过多次减噪来实现。尽管它们在表现方面印象深刻,但散发基于的生成模型受到高计算成本的限制,这是因为散发步骤的数量太多。在这篇论文中,我们首先发现了将两个低分辨率和高分辨率图像散发到了中间状态后,中间状态会逐渐减少差异,并变得无法区分。这一观察点我们提出了partial diffusion模型(PartDiff),它将图像散发到中间状态而不是完全随机噪声中,中间状态approximerated by low-resolution image的散发latent。在生成过程中,Partial Diffusion Models从中间分布开始减噪,并只完成一部分的减噪步骤。此外,为了减少由approximation引起的误差,我们引入"latent alignment",在训练期间对低分辨率和高分辨率图像的latent进行对齐。实验表明,与普通的散发基于超解像方法相比,Partial Diffusion Models可以在MRI和自然图像上减少减噪步骤数量,而不是牺牲生成质量。

Poverty rate prediction using multi-modal survey and earth observation data

  • paper_url: http://arxiv.org/abs/2307.11921
  • repo_url: None
  • paper_authors: Simone Fobi, Manuel Cardona, Elliott Collins, Caleb Robinson, Anthony Ortiz, Tina Sederholm, Rahul Dodhia, Juan Lavista Ferres
  • For: The paper aims to predict the poverty rate of a region by combining household demographic and living standards survey questions with features derived from satellite imagery.* Methods: The paper uses a single-step featurization method to extract visual features from freely available 10m/px Sentinel-2 surface reflectance satellite imagery, and combines these features with ten survey questions in a proxy means test (PMT) to estimate poverty rates. The paper also proposes an approach for selecting a subset of survey questions that are complementary to the visual features extracted from satellite imagery.* Results: The inclusion of visual features reduces the mean error in poverty rate estimates from 4.09% to 3.88%, and using a subset of survey questions selected based on their complementarity to the visual features results in the best performance with errors decreasing from 4.09% to 3.71%. The extracted visual features also encode geographic and urbanization differences between regions.
    Abstract This work presents an approach for combining household demographic and living standards survey questions with features derived from satellite imagery to predict the poverty rate of a region. Our approach utilizes visual features obtained from a single-step featurization method applied to freely available 10m/px Sentinel-2 surface reflectance satellite imagery. These visual features are combined with ten survey questions in a proxy means test (PMT) to estimate whether a household is below the poverty line. We show that the inclusion of visual features reduces the mean error in poverty rate estimates from 4.09% to 3.88% over a nationally representative out-of-sample test set. In addition to including satellite imagery features in proxy means tests, we propose an approach for selecting a subset of survey questions that are complementary to the visual features extracted from satellite imagery. Specifically, we design a survey variable selection approach guided by the full survey and image features and use the approach to determine the most relevant set of small survey questions to include in a PMT. We validate the choice of small survey questions in a downstream task of predicting the poverty rate using the small set of questions. This approach results in the best performance -- errors in poverty rate decrease from 4.09% to 3.71%. We show that extracted visual features encode geographic and urbanization differences between regions.
    摘要 Here is the text in Simplified Chinese:这项研究提出了一种方法,将家庭调查问题与卫星成像数据特征结合起来预测地区贫困率。该方法使用单步特征化方法提取自由可用的10m/px Sentinal-2表面反射卫星成像数据中的视觉特征,然后与十个调查问题结合在一起进行代理平均测试(PMT),以估算家庭是否下于贫困线。我们发现,通过包含卫星成像特征,贫困率估算的均误率从4.09%降低到3.88%。此外,我们还提出了一种选择调查问题的方法,该方法根据全面调查和图像特征进行指导,以选择最相关的小调查问题来包含在PMT中。我们验证了这些小调查问题的选择,并发现贫困率估算的误差从4.09%降低到3.71%。我们发现,提取的视觉特征含有地域和城市化差异。

Building3D: An Urban-Scale Dataset and Benchmarks for Learning Roof Structures from Point Clouds

  • paper_url: http://arxiv.org/abs/2307.11914
  • repo_url: None
  • paper_authors: Ruisheng Wang, Shangfeng Huang, Hongxin Yang
  • for: 这个论文主要是为了提供一个大规模的城市建筑模型 benchmark,以便进行未来城市建筑模型的研究。
  • methods: 这个论文使用了 LiDAR 点云测试获得的数据,并使用了各种手工和深度特征基础的算法进行评估。
  • results: 论文发现了城市建筑模型存在高内分组变化、数据不对称和大规模噪声等挑战,并提供了首个和最大的城市建筑模型 benchmark,以便进行未来城市建筑模型的研究。
    Abstract Urban modeling from LiDAR point clouds is an important topic in computer vision, computer graphics, photogrammetry and remote sensing. 3D city models have found a wide range of applications in smart cities, autonomous navigation, urban planning and mapping etc. However, existing datasets for 3D modeling mainly focus on common objects such as furniture or cars. Lack of building datasets has become a major obstacle for applying deep learning technology to specific domains such as urban modeling. In this paper, we present a urban-scale dataset consisting of more than 160 thousands buildings along with corresponding point clouds, mesh and wire-frame models, covering 16 cities in Estonia about 998 Km2. We extensively evaluate performance of state-of-the-art algorithms including handcrafted and deep feature based methods. Experimental results indicate that Building3D has challenges of high intra-class variance, data imbalance and large-scale noises. The Building3D is the first and largest urban-scale building modeling benchmark, allowing a comparison of supervised and self-supervised learning methods. We believe that our Building3D will facilitate future research on urban modeling, aerial path planning, mesh simplification, and semantic/part segmentation etc.
    摘要 城市模型从LiDAR点云是计算机视觉、计算机图形、光学测量和远程感知等领域的重要话题。3D城市模型在智能城市、自动导航、城市规划和地图等领域发现了广泛的应用。然而,现有的3D模型数据集主要集中在常见的物品 such as 家具或车辆。lack of 建筑数据集成为应用深度学习技术于特定领域 such as 城市模型的主要障碍。在这篇论文中,我们提供了一个城市级别的数据集,包括超过16万个建筑物 along with 相应的点云、网格和细线模型,覆盖了16座城市,总面积约998平方公里。我们进行了广泛的性能评估,包括手工设计和深度特征基于方法。实验结果表明,Building3D存在高内类差异、数据不均衡和大规模噪音等挑战。Building3D是首个和最大的城市级别建筑模型 benchmark,允许对supervised和自主学习方法进行比较。我们认为,我们的Building3D将促进未来对城市模型、空中路径规划、网格简化和semantic/part分割等方面的研究。

Unveiling Vulnerabilities in Interpretable Deep Learning Systems with Query-Efficient Black-box Attacks

  • paper_url: http://arxiv.org/abs/2307.11906
  • repo_url: None
  • paper_authors: Eldor Abdukhamidov, Mohammed Abuhamad, Simon S. Woo, Eric Chan-Tin, Tamer Abuhmed
  • for: 防护受到攻击的深度学习系统,因为它们容易受到恶意攻击的威胁。
  • methods: 我们提出了一种基于微生物遗传学算法的黑盒攻击方法,不需要对目标模型和解释模型有任何专业知识。
  • results: 我们的实验结果显示,这种攻击方法可以实现高度的攻击成功率,使用攻击示例和属性地图,与正常样本几乎无法区别。
    Abstract Deep learning has been rapidly employed in many applications revolutionizing many industries, but it is known to be vulnerable to adversarial attacks. Such attacks pose a serious threat to deep learning-based systems compromising their integrity, reliability, and trust. Interpretable Deep Learning Systems (IDLSes) are designed to make the system more transparent and explainable, but they are also shown to be susceptible to attacks. In this work, we propose a novel microbial genetic algorithm-based black-box attack against IDLSes that requires no prior knowledge of the target model and its interpretation model. The proposed attack is a query-efficient approach that combines transfer-based and score-based methods, making it a powerful tool to unveil IDLS vulnerabilities. Our experiments of the attack show high attack success rates using adversarial examples with attribution maps that are highly similar to those of benign samples which makes it difficult to detect even by human analysts. Our results highlight the need for improved IDLS security to ensure their practical reliability.
    摘要 深度学习已经广泛应用在多个领域,但它们容易受到敌意攻击。这些攻击会对深度学习基于系统造成严重的威胁,对其稳定性、可靠性和信任性造成潜在的影响。可解释深度学习系统(IDLS)是为了使系统更加透明和可解释的,但它们也被证明容易受到攻击。在这种工作中,我们提出了一种基于微生物遗传算法的黑盒攻击方法,不需要攻击目标模型和其解释模型的先前知识。我们的攻击方法结合了传递基本方法和分数基本方法,使其成为攻击IDLS的强大工具。我们的实验结果显示,使用对抗示例和归属地图可以达到高攻击成功率,这些对抗示例与正常样本具有高度相似的归属地图,使其具有难以探测的特点。我们的结果 highlights the need for improved IDLS security to ensure their practical reliability.

Model Compression Methods for YOLOv5: A Review

  • paper_url: http://arxiv.org/abs/2307.11904
  • repo_url: None
  • paper_authors: Mohammad Jani, Jamil Fayyad, Younes Al-Younes, Homayoun Najjaran
  • for: 本文主要针对于增强YOLO对象检测器的研究,以提高其精度和效率。
  • methods: 本文主要采用network pruning和quantization两种方法来压缩YOLOv5模型,以适应资源有限的边缘设备。
  • results: 通过对YOLOv5模型进行压缩,可以降低内存使用量和推理时间,使其在硬件限制的边缘设备上进行部署成为可能。但是,在实施中还存在一些挑战,需要进一步的探索和优化。
    Abstract Over the past few years, extensive research has been devoted to enhancing YOLO object detectors. Since its introduction, eight major versions of YOLO have been introduced with the purpose of improving its accuracy and efficiency. While the evident merits of YOLO have yielded to its extensive use in many areas, deploying it on resource-limited devices poses challenges. To address this issue, various neural network compression methods have been developed, which fall under three main categories, namely network pruning, quantization, and knowledge distillation. The fruitful outcomes of utilizing model compression methods, such as lowering memory usage and inference time, make them favorable, if not necessary, for deploying large neural networks on hardware-constrained edge devices. In this review paper, our focus is on pruning and quantization due to their comparative modularity. We categorize them and analyze the practical results of applying those methods to YOLOv5. By doing so, we identify gaps in adapting pruning and quantization for compressing YOLOv5, and provide future directions in this area for further exploration. Among several versions of YOLO, we specifically choose YOLOv5 for its excellent trade-off between recency and popularity in literature. This is the first specific review paper that surveys pruning and quantization methods from an implementation point of view on YOLOv5. Our study is also extendable to newer versions of YOLO as implementing them on resource-limited devices poses the same challenges that persist even today. This paper targets those interested in the practical deployment of model compression methods on YOLOv5, and in exploring different compression techniques that can be used for subsequent versions of YOLO.
    摘要 过去几年,对 YOLO 对象检测器进行了广泛的研究,旨在提高其精度和效率。自其出现以来,共有八个主要版本的 YOLO 发布,以提高其性能和效能。虽然 YOLO 在许多领域得到了广泛的应用,但在具有限制的硬件设备上部署它存在挑战。为解决这个问题,各种神经网络压缩方法得到了开发,这些方法可以分为三类:网络剪辑、量化和知识传递。由于这些方法的实际成果,如减少内存使用量和计算时间,因此它们在硬件限制的边缘设备上部署成为了必要的。在本文中,我们主要关注剪辑和量化,因为它们在可控性方面比较高。我们按照不同的方法进行了分类和分析,并通过应用这些方法于 YOLOv5 来评估其实际效果。通过这种方式,我们可以了解剪辑和量化在 YOLOv5 上的应用存在哪些挑战,并提供未来研究的方向。在多个 YOLO 版本中,我们选择了 YOLOv5,因为它在文献中的悠久度和受欢迎程度都非常高。这是对剪辑和量化在 YOLOv5 上的实践评估的首个专题评论文。我们的研究也可以扩展到更新版本的 YOLO,因为在具有限制的硬件设备上部署它们也存在同样的挑战。这篇文章针对那些关注实际部署模型压缩方法的人,以及想要探索不同的压缩技术,以应用于更新版本的 YOLO。

Selecting the motion ground truth for loose-fitting wearables: benchmarking optical MoCap methods

  • paper_url: http://arxiv.org/abs/2307.11881
  • repo_url: https://github.com/lalasray/dmcb
  • paper_authors: Lala Shakti Swarup Ray, Bo Zhou, Sungho Suh, Paul Lukowicz
  • for: 用于评估 оптиче markers 和 marker-less MoCap 的性能,以选择最佳的质量评估方法。
  • methods: 使用了大量的实际录制 MoCap 数据,并对不同的 drape 水平进行并发的 3D 物理 simulations,以评估 marker-based 和 marker-less MoCap 方法的性能。
  • results: marker-based MoCap 和 marker-less MoCap 在轻度的穿着衣服下 both exhibit significan performance loss (>10cm),但是在日常活动中, marker-less MoCap 略微超过 marker-based MoCap,making it a favorable and cost-effective choice for wearable studies.
    Abstract To help smart wearable researchers choose the optimal ground truth methods for motion capturing (MoCap) for all types of loose garments, we present a benchmark, DrapeMoCapBench (DMCB), specifically designed to evaluate the performance of optical marker-based and marker-less MoCap. High-cost marker-based MoCap systems are well-known as precise golden standards. However, a less well-known caveat is that they require skin-tight fitting markers on bony areas to ensure the specified precision, making them questionable for loose garments. On the other hand, marker-less MoCap methods powered by computer vision models have matured over the years, which have meager costs as smartphone cameras would suffice. To this end, DMCB uses large real-world recorded MoCap datasets to perform parallel 3D physics simulations with a wide range of diversities: six levels of drape from skin-tight to extremely draped garments, three levels of motions and six body type - gender combinations to benchmark state-of-the-art optical marker-based and marker-less MoCap methods to identify the best-performing method in different scenarios. In assessing the performance of marker-based and low-cost marker-less MoCap for casual loose garments both approaches exhibit significant performance loss (>10cm), but for everyday activities involving basic and fast motions, marker-less MoCap slightly outperforms marker-based MoCap, making it a favorable and cost-effective choice for wearable studies.
    摘要 为帮助智能佩戴设备研究人员选择最佳的股权实验方法,我们提出了一个标准化的测试平台:DrapeMoCapBench(DMCB),用于评估光学标记基本和无标记MoCap的性能。高成本的光学标记基本MoCap系统已经被广泛认可为精度的金标准,但是它们需要在骨部位上粘贴皮肤紧密的标记,以确保 specify 的精度,这使得它们对松裤服不太可靠。而无标记MoCap方法,通过计算机视觉模型已经成熟了多年,它们的成本很低,只需要使用智能手机相机即可。因此,DMCB使用了大量的真实世界记录的MoCap数据进行并行的3D物理模拟,以评估当今最佳的光学标记基本和无标记MoCap方法的性能。在评估松裤服上的光学标记基本和低成本无标记MoCap方法的性能时,两者都表现出了明显的性能损失(>10cm)。但是在日常活动中涉及到基本和快速动作时,无标记MoCap方法尚微出perform marker-based MoCap,使其成为便宜且可靠的选择 для wearable研究。

Digital Modeling on Large Kernel Metamaterial Neural Network

  • paper_url: http://arxiv.org/abs/2307.11862
  • repo_url: None
  • paper_authors: Quan Liu, Hanyu Zheng, Brandon T. Swartz, Ho hin Lee, Zuhayr Asad, Ivan Kravchenko, Jason G. Valentine, Yuankai Huo
    for: 这个论文是针对现代深度神经网络(DNNs)的物理部署(例如CPUs和GPUs)所做的研究。这种设计可能会导致重要的计算负担,延迟和耗电量问题,这些问题在物联网(IoT)、边缘 computing 和无人机(drones)应用中是critical的限制。methods: 这个论文使用了最新的光学计算单元(例如元material),实现了无源电力和光速神经网络。但是,光学设计的物理限制(例如精度、噪声和宽度)会导致物理设计的限制。此外,现有的元material神经网络(MNN)的优点(例如光速计算)并未经过标准3x3卷积核心的探索。这个论文提出了一个新的大型卷积元material神经网络(LMNN),它可以最大化现代SOTA MNN的数位容量,并考虑光学限制。results: 这个论文的实验结果显示,使用LMNN可以提高分类精度,同时降低计算延迟。实验结果表明,将 computation cost of convolutional front-end offloaded into fabricated optical hardware可以提高分类精度。这个研究表明,LMNN的开发是可能实现无源电力和光速AI的一个重要步骤。
    Abstract Deep neural networks (DNNs) utilized recently are physically deployed with computational units (e.g., CPUs and GPUs). Such a design might lead to a heavy computational burden, significant latency, and intensive power consumption, which are critical limitations in applications such as the Internet of Things (IoT), edge computing, and the usage of drones. Recent advances in optical computational units (e.g., metamaterial) have shed light on energy-free and light-speed neural networks. However, the digital design of the metamaterial neural network (MNN) is fundamentally limited by its physical limitations, such as precision, noise, and bandwidth during fabrication. Moreover, the unique advantages of MNN's (e.g., light-speed computation) are not fully explored via standard 3x3 convolution kernels. In this paper, we propose a novel large kernel metamaterial neural network (LMNN) that maximizes the digital capacity of the state-of-the-art (SOTA) MNN with model re-parametrization and network compression, while also considering the optical limitation explicitly. The new digital learning scheme can maximize the learning capacity of MNN while modeling the physical restrictions of meta-optic. With the proposed LMNN, the computation cost of the convolutional front-end can be offloaded into fabricated optical hardware. The experimental results on two publicly available datasets demonstrate that the optimized hybrid design improved classification accuracy while reducing computational latency. The development of the proposed LMNN is a promising step towards the ultimate goal of energy-free and light-speed AI.
    摘要 深度神经网络(DNN)最近几年广泛应用,通常通过计算单元(如CPU和GPU)进行物理部署。这种设计可能会导致重大的计算负担、显著的延迟和大量的电力消耗,这些限制在互联网物联网(IoT)、边缘计算和无人机应用中非常 kritisch。 latest advances in optical computational units(如元material)have shed light on energy-free and light-speed neural networks. However, the digital design of the metamaterial neural network (MNN) is fundamentally limited by its physical limitations, such as precision, noise, and bandwidth during fabrication. Moreover, the unique advantages of MNN's (e.g., light-speed computation) are not fully explored via standard 3x3 convolution kernels. In this paper, we propose a novel large kernel metamaterial neural network (LMNN) that maximizes the digital capacity of the state-of-the-art (SOTA) MNN with model re-parametrization and network compression, while also considering the optical limitation explicitly. The new digital learning scheme can maximize the learning capacity of MNN while modeling the physical restrictions of meta-optic. With the proposed LMNN, the computation cost of the convolutional front-end can be offloaded into fabricated optical hardware. The experimental results on two publicly available datasets demonstrate that the optimized hybrid design improved classification accuracy while reducing computational latency. The development of the proposed LMNN is a promising step towards the ultimate goal of energy-free and light-speed AI.Here's a word-for-word translation of the text into Simplified Chinese:深度神经网络(DNN)最近几年广泛应用,通常通过计算单元(如CPU和GPU)进行物理部署。这种设计可能会导致重大的计算负担、显著的延迟和大量的电力消耗,这些限制在互联网物联网(IoT)、边缘计算和无人机应用中非常 kritisch。 最近的光学计算单元(如元material)的进步抛光了能源免费和光速神经网络。然而,光学神经网络(MNN)的数字设计受到物理限制,如精度、雷达和带宽的制约,这些限制在fabrication中表现出来。此外,MNN的独特优点(如光速计算)通过标准3x3卷积核不充分发挥。在这篇论文中,我们提出了一种新的大 kernel metamaterial神经网络(LMNN),该模型可以最大化SOTA MNN的数字容量,同时考虑光学限制。新的数字学习方案可以在MNN中最大化学习能力,同时模拟meta-optic的物理限制。通过提出LMNN,计算前端的计算成本可以卸载到fabricated的光学硬件上。实验结果表明,使用两个公共可用的数据集,LMNN的优化型材料设计可以提高分类精度,同时减少计算延迟。LMNN的发展是前往能源免费和光速AI的普遍进步的重要步骤。

Enhancing Your Trained DETRs with Box Refinement

  • paper_url: http://arxiv.org/abs/2307.11828
  • repo_url: https://github.com/yiqunchen1999/refinebox
  • paper_authors: Yiqun Chen, Qiang Chen, Peize Sun, Shoufa Chen, Jingdong Wang, Jian Cheng
  • for: 提高 DETR 和其 variants 的本地化性能
  • methods: 使用轻量级增强网络进行推理 Outputs 的精度提高
  • results: 在 COCO 和 LVIS $1.0 $ 上实验表明 RefineBox 可以提高 DETR 和其 variants 的性能,例如 DETR 的性能提高 2.4 AP,Conditinal-DETR 的性能提高 2.5 AP,DAB-DETR 的性能提高 1.9 AP,DN-DETR 的性能提高 1.6 AP。Here’s the summary in English for reference:
  • for: Improving the localization performance of DETR and its variants
  • methods: Using lightweight refinement networks to refine the outputs of DETR-like detectors
  • results: Experimental results on COCO and LVIS $1.0$ show that RefineBox can improve the performance of DETR and its variants, with performance gains of 2.4 AP, 2.5 AP, 1.9 AP, and 1.6 AP for DETR, Conditinal-DETR, DAB-DETR, and DN-DETR, respectively.
    Abstract We present a conceptually simple, efficient, and general framework for localization problems in DETR-like models. We add plugins to well-trained models instead of inefficiently designing new models and training them from scratch. The method, called RefineBox, refines the outputs of DETR-like detectors by lightweight refinement networks. RefineBox is easy to implement and train as it only leverages the features and predicted boxes from the well-trained detection models. Our method is also efficient as we freeze the trained detectors during training. In addition, we can easily generalize RefineBox to various trained detection models without any modification. We conduct experiments on COCO and LVIS $1.0$. Experimental results indicate the effectiveness of our RefineBox for DETR and its representative variants (Figure 1). For example, the performance gains for DETR, Conditinal-DETR, DAB-DETR, and DN-DETR are 2.4 AP, 2.5 AP, 1.9 AP, and 1.6 AP, respectively. We hope our work will bring the attention of the detection community to the localization bottleneck of current DETR-like models and highlight the potential of the RefineBox framework. Code and models will be publicly available at: \href{https://github.com/YiqunChen1999/RefineBox}{https://github.com/YiqunChen1999/RefineBox}.
    摘要 我们提出了一个概念简单、高效和通用的框位问题解决方案,用于DETR-like模型。我们在已经训练过的模型上添加插件而不是从scratch开发新的模型和重新训练。我们称之为RefineBox,它使用轻量级修正网络来精度地修正DETR-like探测器的输出。RefineBox易于实现和训练,只需利用已经训练过的检测器的特征和预测框。我们的方法也具有高效性,因为我们在训练过程中冻结了已经训练过的检测器。此外,我们可以轻松地扩展RefineBox到不同的训练过的检测器上,无需修改。我们在COCO和LVIS $1.0$上进行了实验,实验结果表明RefineBox对DETR和其相关变体(图1)具有效果。例如,对DETR、Conditinal-DETR、DAB-DETR和DN-DETR的性能提升分别为2.4 AP、2.5 AP、1.9 AP和1.6 AP。我们希望我们的工作能吸引检测社区对当前DETR-like模型的本地化瓶颈的注意,并强调RefineBox框架的潜在可能性。代码和模型将在GitHub上公开:

BandRe: Rethinking Band-Pass Filters for Scale-Wise Object Detection Evaluation

  • paper_url: http://arxiv.org/abs/2307.11748
  • repo_url: https://github.com/shinya7y/UniverseNet
  • paper_authors: Yosuke Shinya
  • for: 本文提出了一种新的精度评估方法,用于评估物体检测器在实际应用中的性能。
  • methods: 本文使用了一个筛 бан组合,包括三角形和梯形带通过筛,来评估物体检测器的精度。
  • results: 经过实验,本文显示了新的精度评估方法可以准确地评估物体检测器的性能,并且可以强调不同的方法和数据集之间的差异。
    Abstract Scale-wise evaluation of object detectors is important for real-world applications. However, existing metrics are either coarse or not sufficiently reliable. In this paper, we propose novel scale-wise metrics that strike a balance between fineness and reliability, using a filter bank consisting of triangular and trapezoidal band-pass filters. We conduct experiments with two methods on two datasets and show that the proposed metrics can highlight the differences between the methods and between the datasets. Code is available at https://github.com/shinya7y/UniverseNet .
    摘要 精度评估是对实际应用中对象检测器的评估非常重要。然而,现有的指标都是 Either too coarse or not reliable enough。在这篇论文中,我们提出了一种新的精度指标,它们能够在精度和可靠性之间做出平衡,使用了一个由三角形和梯形带滤波器组成的筛子。我们在两种方法和两个数据集上进行了实验,并证明了我们的指标可以更好地反映方法和数据集之间的差异。代码可以在 GitHub 上找到:https://github.com/shinya7y/UniverseNet。

Automatic Data Augmentation Learning using Bilevel Optimization for Histopathological Images

  • paper_url: http://arxiv.org/abs/2307.11808
  • repo_url: https://github.com/smounsav/bilevel_augment_histo
  • paper_authors: Saypraseuth Mounsaveng, Issam Laradji, David Vázquez, Marco Perdersoli, Ismail Ben Ayed
  • for: 用于适应深度学习模型在 histopathological 图像分类中的训练问题,因为细胞和组织的颜色和形状变化,以及有限的数据量,使模型学习这些变化困难。
  • methods: 使用数据扩大(DA)技术,在训练过程中生成更多的样本,以帮助模型对颜色和形状变化变得抗变异。
  • results: 通过自动学习 DA 参数,使用 truncated backpropagation 进行快速和高效地学习,并在六个不同的 dataset 上进行验证。实验结果表明,我们的模型可以学习更有用的颜色和旋转变换,而不需要手动选择 DA 变换。此外,我们的模型只需要微调几个方法特有的超参数,但是性能更高。
    Abstract Training a deep learning model to classify histopathological images is challenging, because of the color and shape variability of the cells and tissues, and the reduced amount of available data, which does not allow proper learning of those variations. Variations can come from the image acquisition process, for example, due to different cell staining protocols or tissue deformation. To tackle this challenge, Data Augmentation (DA) can be used during training to generate additional samples by applying transformations to existing ones, to help the model become invariant to those color and shape transformations. The problem with DA is that it is not only dataset-specific but it also requires domain knowledge, which is not always available. Without this knowledge, selecting the right transformations can only be done using heuristics or through a computationally demanding search. To address this, we propose an automatic DA learning method. In this method, the DA parameters, i.e. the transformation parameters needed to improve the model training, are considered learnable and are learned automatically using a bilevel optimization approach in a quick and efficient way using truncated backpropagation. We validated the method on six different datasets. Experimental results show that our model can learn color and affine transformations that are more helpful to train an image classifier than predefined DA transformations, which are also more expensive as they need to be selected before the training by grid search on a validation set. We also show that similarly to a model trained with RandAugment, our model has also only a few method-specific hyperparameters to tune but is performing better. This makes our model a good solution for learning the best DA parameters, especially in the context of histopathological images, where defining potentially useful transformation heuristically is not trivial.
    摘要 训练深度学习模型分类 histopathological 图像是具有挑战性,因为细胞和组织的颜色和形状可能具有差异,而且可用数据的量相对较少,不允许模型学习这些差异。这些差异可能来自图像获取过程中,例如不同的细胞染色协议或组织变形。为解决这个挑战,数据扩大(DA)可以在训练过程中应用转换来生成更多的样本,以帮助模型对颜色和形状变化变得抗变异。然而,DA并不是 dataset-specific,而且需要域知识,即不一定可以获得。在缺乏域知识的情况下,选择正确的转换只能通过规则或通过计算昂贵的搜索来实现。为此,我们提出了一种自动 DA 学习方法。在这种方法中,DA 参数,即需要改进模型训练的转换参数,被视为可学习的并通过归档逆传播进行自动学习,以便快速和高效地进行学习。我们在六个不同的数据集上进行验证。实验结果表明,我们的模型可以学习更有用的颜色和Affine 转换,而不需要手动选择DA转换。此外,我们的模型也只需要少量的方法特有超参数来调整,但是性能更高。这使得我们的模型成为了学习最佳 DA 参数的好解决方案,特别是在 histopathological 图像上,因为定义有用的转换方法可能不是易事。

3D Skeletonization of Complex Grapevines for Robotic Pruning

  • paper_url: http://arxiv.org/abs/2307.11706
  • repo_url: None
  • paper_authors: Eric Schneider, Sushanth Jayanth, Abhisesh Silwal, George Kantor
  • for: 提高机器人剪刀技术,以便在实际商业葡萄园中进行葡萄藤梢剪刀
  • methods: 基于植物skeletonization技术,提高机器人视觉能力,以便在 denser和更复杂的葡萄藤结构中进行剪刀
  • results: 提高剪刀精度,使用3D和skeletal信息可以更 preciselly predict剪刀重量,超过了先前的工作
    Abstract Robotic pruning of dormant grapevines is an area of active research in order to promote vine balance and grape quality, but so far robotic efforts have largely focused on planar, simplified vines not representative of commercial vineyards. This paper aims to advance the robotic perception capabilities necessary for pruning in denser and more complex vine structures by extending plant skeletonization techniques. The proposed pipeline generates skeletal grapevine models that have lower reprojection error and higher connectivity than baseline algorithms. We also show how 3D and skeletal information enables prediction accuracy of pruning weight for dense vines surpassing prior work, where pruning weight is an important vine metric influencing pruning site selection.
    摘要 “机械剪裁棕榈葡萄是一项活跃的研究领域,以促进葡萄平衡和质量提高,但目前机械努力主要集中在平面化、简单的葡萄藤上。本文旨在提高机械识别能力,以便在更复杂和紧凑的葡萄藤结构中进行剪裁。我们提出的管道使得植物skeletonization技术得到扩展,生成的股骨葡萄模型具有较低的重oprojection误差和更高的连接率,并且我们还示出了基于3D和股骨信息的剪裁重量预测精度超过先前的工作,剪裁重量是葡萄metric关键参数,对剪裁点选择产生重要影响。”

SACReg: Scene-Agnostic Coordinate Regression for Visual Localization

  • paper_url: http://arxiv.org/abs/2307.11702
  • repo_url: None
  • paper_authors: Jerome Revaud, Yohann Cabon, Romain Brégier, JongMin Lee, Philippe Weinzaepfel
  • for: 这个研究旨在提出一个可以应用于各种景象的全局坐标 regression 模型,以提高现有的Scene Regression(SCR)方法的缩减性和应用范围。
  • methods: 本研究使用了 transformer 架构,并可以处理变数数量的图像和罕见2D-3D标注。具体来说,模型通过从 off-the-shelf 图像检索技术和 Structure-from-Motion 数据库获取输入,并使用 transformer 架构进行训练。
  • results: 研究发现,这个模型可以在多个测试 benchmark 上表现出色,特别是在Scene Regression 方法中,并在 Cambridge localization benchmark 上设置了新的州立纪录,甚至超越了 feature-matching-based 方法。
    Abstract Scene coordinates regression (SCR), i.e., predicting 3D coordinates for every pixel of a given image, has recently shown promising potential. However, existing methods remain mostly scene-specific or limited to small scenes and thus hardly scale to realistic datasets. In this paper, we propose a new paradigm where a single generic SCR model is trained once to be then deployed to new test scenes, regardless of their scale and without further finetuning. For a given query image, it collects inputs from off-the-shelf image retrieval techniques and Structure-from-Motion databases: a list of relevant database images with sparse pointwise 2D-3D annotations. The model is based on the transformer architecture and can take a variable number of images and sparse 2D-3D annotations as input. It is trained on a few diverse datasets and significantly outperforms other scene regression approaches on several benchmarks, including scene-specific models, for visual localization. In particular, we set a new state of the art on the Cambridge localization benchmark, even outperforming feature-matching-based approaches.
    摘要

cs.AI - 2023-07-22

A Revolution of Personalized Healthcare: Enabling Human Digital Twin with Mobile AIGC

  • paper_url: http://arxiv.org/abs/2307.12115
  • repo_url: None
  • paper_authors: Jiayuan Chen, Changyan Yi, Hongyang Du, Dusit Niyato, Jiawen Kang, Jun Cai, Xuemin, Shen
  • for: 这篇论文是为了探讨移动人工智能生成内容(AIGC)技术在人类数字双(HDT)应用中的可能性和挑战而写的。
  • methods: 该论文提出了一种基于移动AIGC的HDT系统架构,并确定了相关的设计要求和挑战。同时,文章还介绍了两个使用场景,即手术规划和个性化药物。
  • results: 文章通过实验研究证明了移动AIGC驱动的HDT解决方案的效果,并在虚拟物理治疗教学平台中应用了这种解决方案。
    Abstract Mobile Artificial Intelligence-Generated Content (AIGC) technology refers to the adoption of AI algorithms deployed at mobile edge networks to automate the information creation process while fulfilling the requirements of end users. Mobile AIGC has recently attracted phenomenal attentions and can be a key enabling technology for an emerging application, called human digital twin (HDT). HDT empowered by the mobile AIGC is expected to revolutionize the personalized healthcare by generating rare disease data, modeling high-fidelity digital twin, building versatile testbeds, and providing 24/7 customized medical services. To promote the development of this new breed of paradigm, in this article, we propose a system architecture of mobile AIGC-driven HDT and highlight the corresponding design requirements and challenges. Moreover, we illustrate two use cases, i.e., mobile AIGC-driven HDT in customized surgery planning and personalized medication. In addition, we conduct an experimental study to prove the effectiveness of the proposed mobile AIGC-driven HDT solution, which shows a particular application in a virtual physical therapy teaching platform. Finally, we conclude this article by briefly discussing several open issues and future directions.
    摘要 Mobile artificial intelligence生成内容(AIGC)技术指的是通过移动边缘网络部署人工智能算法自动生成信息的过程,同时满足用户的需求。 Mobile AIGC在最近吸引了非常大的关注,可以是人类数字双胞虫(HDT)的关键技术。 HDT通过移动AIGC得到强化,预计将在个性化医疗方面产生极高精度的数字双胞虫,生成罕见疾病数据,建立多样化测试平台,提供24/7个性化医疗服务。在这篇文章中,我们提议了移动AIGC驱动HDT的系统架构,并 highlight了相关的设计要求和挑战。此外,我们还介绍了两个使用场景:移动AIGC驱动HDT在自定义手术规划中和个性化药物。此外,我们进行了实验研究,证明了我们提议的移动AIGC驱动HDT解决方案的有效性,该解决方案在虚拟物理治疗教学平台中得到应用。最后,我们 briefly discussed several open issues and future directions。

A Zero-shot and Few-shot Study of Instruction-Finetuned Large Language Models Applied to Clinical and Biomedical Tasks

  • paper_url: http://arxiv.org/abs/2307.12114
  • repo_url: None
  • paper_authors: Yanis Labrak, Mickael Rouvier, Richard Dufour
  • for: 这些论文旨在评估四种现状最佳大型自然语言处理(NLP)模型(ChatGPT、Flan-T5 UL2、Tk-Instruct和Alpaca)在13种真实世界医疗和生物医学NLP任务中的性能,包括命名实体识别(NER)、问答(QA)、关系提取(RE)等。
  • methods: 这些论文使用了四种现状最佳大型NLP模型,并在英文语言上进行了13种真实世界医疗和生物医学NLP任务的测试和评估。
  • results: 研究结果表明,评估的LLMs在零和几个预测场景中对大多数任务的性能几乎相当于状态之前的模型,特别是在问答任务上表现非常出色,即使它们没有看到这些任务的示例。然而,对于分类和RE任务,模型的性能比专门为医疗领域训练的模型,如PubMedBERT,下降。此外,研究发现没有LLM可以在所有任务中占据领先地位,各模型在某些任务上表现更好。
    Abstract We evaluate four state-of-the-art instruction-tuned large language models (LLMs) -- ChatGPT, Flan-T5 UL2, Tk-Instruct, and Alpaca -- on a set of 13 real-world clinical and biomedical natural language processing (NLP) tasks in English, such as named-entity recognition (NER), question-answering (QA), relation extraction (RE), etc. Our overall results demonstrate that the evaluated LLMs begin to approach performance of state-of-the-art models in zero- and few-shot scenarios for most tasks, and particularly well for the QA task, even though they have never seen examples from these tasks before. However, we observed that the classification and RE tasks perform below what can be achieved with a specifically trained model for the medical field, such as PubMedBERT. Finally, we noted that no LLM outperforms all the others on all the studied tasks, with some models being better suited for certain tasks than others.
    摘要 我们评估了四种当前最佳的 instruciton-tuned大语言模型(LLMs)——ChatGPT、Flan-T5 UL2、Tk-Instruct和Alpaca——在英文的13种实际医疗和生物医学自然语言处理(NLP)任务上,如命名实体识别(NER)、问答(QA)、关系提取(RE)等。我们的总结结果表明,评估的LLMs在零和几个预测场景中的性能接近了当前最佳模型的水平,特别是在QA任务上表现出色,即使它们从来没有看到这些任务的示例。然而,我们发现,分类和RE任务的性能下降到了专门为医疗领域训练的模型,如PubMedBERT,所能达到的水平。最后,我们注意到,没有任何LLM在所有研究任务上表现出优于其他模型,一些模型更适合某些任务。

CFR-p: Counterfactual Regret Minimization with Hierarchical Policy Abstraction, and its Application to Two-player Mahjong

  • paper_url: http://arxiv.org/abs/2307.12087
  • repo_url: None
  • paper_authors: Shiheng Wang
  • for: 这篇论文是为了应用Counterfactual Regret Minimization(CFR)算法到另一款具有多种变体的 incomplete information 游戏——麻将。
  • methods: 论文使用了game theoretical analysis和层次抽象来改进CFR算法,以适应麻将游戏的复杂性。
  • results: 研究发现,这种基于赢家策略的CFR框架可以普适应其他不完整信息游戏。
    Abstract Counterfactual Regret Minimization(CFR) has shown its success in Texas Hold'em poker. We apply this algorithm to another popular incomplete information game, Mahjong. Compared to the poker game, Mahjong is much more complex with many variants. We study two-player Mahjong by conducting game theoretical analysis and making a hierarchical abstraction to CFR based on winning policies. This framework can be generalized to other imperfect information games.
    摘要 <>转换给定文本到简化中文。>Counterfactual Regret Minimization(CFR)在德州扑克游戏中表现出色。我们将这个算法应用到另一款流行的不完全信息游戏——麻将。与扑克游戏相比,麻将更加复杂,有多种变体。我们通过游戏理论分析和基于赢家策略层次化CFR的框架来研究两人麻将。这个框架可以推广到其他不完全信息游戏。

Enhancing Temporal Planning Domains by Sequential Macro-actions (Extended Version)

  • paper_url: http://arxiv.org/abs/2307.12081
  • repo_url: None
  • paper_authors: Marco De Bortoli, Lukáš Chrpa, Martin Gebser, Gerald Steinbauer-Wagner
  • for: 提高多代理人和资源共享的具有时间约束的计划效率。
  • methods: 使用带有恒常性的维度和扩展的约束来实现Sequential Temporal Macro-Actions,保证计划的可行性。
  • results: 在多个计划器和领域中实现提高了获得优化计划和计划质量。
    Abstract Temporal planning is an extension of classical planning involving concurrent execution of actions and alignment with temporal constraints. Durative actions along with invariants allow for modeling domains in which multiple agents operate in parallel on shared resources. Hence, it is often important to avoid resource conflicts, where temporal constraints establish the consistency of concurrent actions and events. Unfortunately, the performance of temporal planning engines tends to sharply deteriorate when the number of agents and objects in a domain gets large. A possible remedy is to use macro-actions that are well-studied in the context of classical planning. In temporal planning settings, however, introducing macro-actions is significantly more challenging when the concurrent execution of actions and shared use of resources, provided the compliance to temporal constraints, should not be suppressed entirely. Our work contributes a general concept of sequential temporal macro-actions that guarantees the applicability of obtained plans, i.e., the sequence of original actions encapsulated by a macro-action is always executable. We apply our approach to several temporal planners and domains, stemming from the International Planning Competition and RoboCup Logistics League. Our experiments yield improvements in terms of obtained satisficing plans as well as plan quality for the majority of tested planners and domains.
    摘要 temporal 规划是 classical 规划的扩展,具有同时执行动作和时间约束的整合。持续动作和 invariants 允许在多个代理人在共享资源上并发操作的Domain 模型。因此,通常需要避免资源冲突,而时间约束可以确定同时执行的动作和事件的一致性。然而, temporal 规划引擎的性能通常随多个代理人和对象的数量增加而逐渐下降。为了解决这个问题,我们可以使用 macro-actions,它们在 классиical 规划中非常成熟。然而,在 temporal 规划设置下,引入 macro-actions 是更加挑战,因为需要保持同时执行动作和共享资源的一致性,而不能完全终止。我们的工作提出了一种通用的顺序 temporal macro-actions 概念,该概念保证原始动作序列被封装在 macro-action 中执行的情况下,得到的计划是可靠的。我们对多个 temporal 规划器和领域进行了应用,其中包括国际规划竞赛和 RoboCup 物流联盟。我们的实验表明,我们的方法可以提高大多数测试的规划器和领域中的得到的满意计划以及计划质量。

Game-Theoretic Robust Reinforcement Learning Handles Temporally-Coupled Perturbations

  • paper_url: http://arxiv.org/abs/2307.12062
  • repo_url: None
  • paper_authors: Yongyuan Liang, Yanchao Sun, Ruijie Zheng, Xiangyu Liu, Tuomas Sandholm, Furong Huang, Stephen McAleer
  • for: 本研究旨在训练可以在环境干扰或敌意攻击下表现良好的RL策略。
  • methods: 我们提出了GRAD方法,它将把时间相关的干扰问题看作是一个部分可见两个玩家零 SUM 游戏,通过找到这个游戏的approximate平衡,确保agent对时间相关的干扰具有强大的Robustness。
  • results: 我们在一系列连续控制任务上进行了实验,结果表明,相比基eline,我们的提议方法在state和action空间中都具有显著的Robustness优势,特别是在面对时间相关的干扰攻击时。
    Abstract Robust reinforcement learning (RL) seeks to train policies that can perform well under environment perturbations or adversarial attacks. Existing approaches typically assume that the space of possible perturbations remains the same across timesteps. However, in many settings, the space of possible perturbations at a given timestep depends on past perturbations. We formally introduce temporally-coupled perturbations, presenting a novel challenge for existing robust RL methods. To tackle this challenge, we propose GRAD, a novel game-theoretic approach that treats the temporally-coupled robust RL problem as a partially-observable two-player zero-sum game. By finding an approximate equilibrium in this game, GRAD ensures the agent's robustness against temporally-coupled perturbations. Empirical experiments on a variety of continuous control tasks demonstrate that our proposed approach exhibits significant robustness advantages compared to baselines against both standard and temporally-coupled attacks, in both state and action spaces.
    摘要 STRONG REINFORCEMENT LEARNING (RL) aims to train policies that can perform well under environmental perturbations or adversarial attacks. Existing methods typically assume that the space of possible perturbations remains the same across timesteps. However, in many situations, the space of possible perturbations at a given timestep depends on past perturbations. We formally introduce temporally-coupled perturbations, presenting a new challenge for existing robust RL methods. To address this challenge, we propose GRAD, a novel game-theoretic approach that treats the temporally-coupled robust RL problem as a partially-observable two-player zero-sum game. By finding an approximate equilibrium in this game, GRAD ensures the agent's robustness against temporally-coupled perturbations. Empirical experiments on a variety of continuous control tasks show that our proposed approach exhibits significant robustness advantages compared to baselines against both standard and temporally-coupled attacks, in both state and action spaces.Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

Fast Knowledge Graph Completion using Graphics Processing Units

  • paper_url: http://arxiv.org/abs/2307.12059
  • repo_url: None
  • paper_authors: Chun-Hee Lee, Dong-oh Kang, Hwa Jeon Song
  • for: 本研究旨在提供一种高效的知识图谱完成框架,用于在 GPU 上获得新关系。
  • methods: 本研究使用知识图谱嵌入模型来实现知识图谱完成。首先,我们定义 “可转换为度量空间”,然后将知识图谱完成问题转换成度量空间中的相似Join问题。然后,我们利用度量空间的性质 deriv 出公式,并基于这些公式开发了一个快速的知识图谱完成算法。
  • results: 我们的研究表明,我们的框架可以高效地处理知识图谱完成问题。
    Abstract Knowledge graphs can be used in many areas related to data semantics such as question-answering systems, knowledge based systems. However, the currently constructed knowledge graphs need to be complemented for better knowledge in terms of relations. It is called knowledge graph completion. To add new relations to the existing knowledge graph by using knowledge graph embedding models, we have to evaluate $N\times N \times R$ vector operations, where $N$ is the number of entities and $R$ is the number of relation types. It is very costly. In this paper, we provide an efficient knowledge graph completion framework on GPUs to get new relations using knowledge graph embedding vectors. In the proposed framework, we first define "transformable to a metric space" and then provide a method to transform the knowledge graph completion problem into the similarity join problem for a model which is "transformable to a metric space". After that, to efficiently process the similarity join problem, we derive formulas using the properties of a metric space. Based on the formulas, we develop a fast knowledge graph completion algorithm. Finally, we experimentally show that our framework can efficiently process the knowledge graph completion problem.
    摘要 知识图可以应用于数据semantics中的多个领域,如问答系统、知识基础系统。然而,现有的知识图需要补充以提高知识的关系。这被称为知识图完成。为了通过使用知识图嵌入模型添加新的关系到现有的知识图,我们需要评估 $N\times N \times R$ 矢量操作,其中 $N$ 是实体的数量,$R$ 是关系类型的数量。这非常昂贵。在这篇论文中,我们提供了一个高效的知识图完成框架在GPU上进行新关系获取。我们首先定义"可转换到一个度量空间",然后将知识图完成问题转换成一个相似Join问题,该问题可以"可转换到一个度量空间"。然后,我们 derivate formulas使用度量空间的性质,然后我们开发了一个快速的知识图完成算法。最后,我们实验表明,我们的框架可以高效地处理知识图完成问题。

Extracting Molecular Properties from Natural Language with Multimodal Contrastive Learning

  • paper_url: http://arxiv.org/abs/2307.12996
  • repo_url: None
  • paper_authors: Romain Lacombe, Andrew Gaut, Jeff He, David Lüdeke, Kateryna Pistunova
  • for: 本研究旨在将科学知识从文本中提取到分子图表示,以 Bridge 深度学习在计算生物化学中的图表示和文本描述之间的 gap。
  • methods: 本研究使用对比学习将神经图表示与文本描述的特征进行对应,并使用神经相关性分数策略提高文本检索。此外,我们还提出了一种基于有机反应的新的分子图数据增强策略。
  • results: 我们的模型在下游 MoleculeNet 性质分类任务上表现出色,与模型只使用图模式alone (+4.26% AUROC提升) 和 MoMu 模型(Su et al. 2022) (+1.54% 提升) 相比,均显著提高了性能。
    Abstract Deep learning in computational biochemistry has traditionally focused on molecular graphs neural representations; however, recent advances in language models highlight how much scientific knowledge is encoded in text. To bridge these two modalities, we investigate how molecular property information can be transferred from natural language to graph representations. We study property prediction performance gains after using contrastive learning to align neural graph representations with representations of textual descriptions of their characteristics. We implement neural relevance scoring strategies to improve text retrieval, introduce a novel chemically-valid molecular graph augmentation strategy inspired by organic reactions, and demonstrate improved performance on downstream MoleculeNet property classification tasks. We achieve a +4.26% AUROC gain versus models pre-trained on the graph modality alone, and a +1.54% gain compared to recently proposed molecular graph/text contrastively trained MoMu model (Su et al. 2022).
    摘要 (Simplified Chinese translation)深度学习在计算生物化学中传统上专注于分子图 neural representation; 然而,最近的语言模型发展提出了如何在文本中储存科学知识的问题。为了联系这两种模式,我们研究如何从自然语言中提取分子性质信息并将其转换为图表示。我们使用对比学习将神经图表示与文本描述中的特征表示进行对应。我们还使用神经相关分数策略来改进文本检索,并提出了一种基于有机反应的化学正确分子图增强策略。我们在下游MoleculeNet属性分类任务上达到了+4.26% AUROC提升和+1.54%提升,相比于只使用图模式预训练的模型。

How to Design and Deliver Courses for Higher Education in the AI Era: Insights from Exam Data Analysis

  • paper_url: http://arxiv.org/abs/2308.02441
  • repo_url: None
  • paper_authors: Ahmad Samer Wazan, Imran Taj, Abdulhadi Shoufan, Romain Laborde, Rémi Venant
  • For: The paper advocates for the idea that courses and exams in the AI era should be designed based on the strengths and limitations of AI, as well as pedagogical educational objectives.* Methods: The paper explores the strengths and limitations of AI based on current advances in the field, and provides examples of how courses and exams can be designed based on these factors. The paper also describes a pedagogical approach inspired by the Socratic teaching method that was adopted from January 2023 to May 2023.* Results: The paper presents data analysis results of seven ChatGPT-authorized exams conducted between December 2022 and March 2023, which show no correlation between students’ grades and whether or not they use ChatGPT to answer their exam questions. The paper also proposes a new exam system that allows for the application of the pedagogical approach in the AI era.Here is the information in Simplified Chinese text:* For: 这篇论文提出了在人工智能时代,课程和考试应该如何设计,以便符合人工智能的优势和局限性,以及教育目标。* Methods: 论文从现有人工智能技术的发展来探讨人工智能的优势和局限性,并提供了不同领域的示例,如IT、英语和艺术等。论文还描述了一种基于索普朗教学方法的教学方法,从2023年1月至2023年5月进行了应用。* Results: 论文提供了七个使用ChatGPT作为考试工具的考试数据分析结果,显示学生的成绩与使用ChatGPT answering考试问题无关。论文还提出了一种新的考试系统,以便在人工智能时代应用教学方法。
    Abstract In this position paper, we advocate for the idea that courses and exams in the AI era have to be designed based on two factors: (1) the strengths and limitations of AI, and (2) the pedagogical educational objectives. Based on insights from the Delors report on education [1], we first address the role of education and recall the main objectives that educational institutes must strive to achieve independently of any technology. We then explore the strengths and limitations of AI, based on current advances in AI. We explain how courses and exams can be designed based on these strengths and limitations of AI, providing different examples in the IT, English, and Art domains. We show how we adopted a pedagogical approach that is inspired from the Socratic teaching method from January 2023 to May 2023. Then, we present the data analysis results of seven ChatGPT-authorized exams conducted between December 2022 and March 2023. Our exam data results show that there is no correlation between students' grades and whether or not they use ChatGPT to answer their exam questions. Finally, we present a new exam system that allows us to apply our pedagogical approach in the AI era.
    摘要 Translation notes:* "AI era" is translated as "人工智能时代" (rénxīng zhìnéng shídài)* "pedagogical educational objectives" is translated as "教育目标" (jiàoyù mùbiāo)* "Delors report" is translated as "德洛尔报告" (déluō'ěr bàogāo)* "Socratic teaching method" is translated as "苏格拉底教学方法" (sūgélādī jíxué fāngfa)* "ChatGPT-authorized exams" is translated as "ChatGPT授权考试" (ChatGPT shèngquán kǎoshì)* "data analysis results" is translated as "数据分析结果" (numbers dàxīn yìjī)Note: The translation is in Simplified Chinese, as requested.

Model Predictive Control (MPC) of an Artificial Pancreas with Data-Driven Learning of Multi-Step-Ahead Blood Glucose Predictors

  • paper_url: http://arxiv.org/abs/2307.12015
  • repo_url: None
  • paper_authors: Eleonora Maria Aiello, Mehrad Jaloli, Marzia Cescon
  • for: 这个论文旨在开发一种基于Linear Time-Varying(LTV)Model Predictive Control(MPC)框架的闭环式胰岛素输液控制算法,用于治疗型1 диабеtes(T1D)。
  • methods: 这个算法使用了一个数据驱动的多步预测器,并将预测结果用于LTV MPC控制器中。在非线性部分,我们使用了一个Long Short-Term Memory(LSTM)网络,而在线性部分,我们使用了一个线性回归模型。
  • results: 我们对这两种控制器进行了Simulation比较,并发现我们的LSTM-MPC控制器在三个场景中表现更好,即在常规情况下、随机饭物干扰情况下和降低胰岛素敏感性25%情况下。此外,我们的方法可以更好地预测未来血糖浓度,并且closed-loop性能更好。
    Abstract We present the design and \textit{in-silico} evaluation of a closed-loop insulin delivery algorithm to treat type 1 diabetes (T1D) consisting in a data-driven multi-step-ahead blood glucose (BG) predictor integrated into a Linear Time-Varying (LTV) Model Predictive Control (MPC) framework. Instead of identifying an open-loop model of the glucoregulatory system from available data, we propose to directly fit the entire BG prediction over a predefined prediction horizon to be used in the MPC, as a nonlinear function of past input-ouput data and an affine function of future insulin control inputs. For the nonlinear part, a Long Short-Term Memory (LSTM) network is proposed, while for the affine component a linear regression model is chosen. To assess benefits and drawbacks when compared to a traditional linear MPC based on an auto-regressive with exogenous (ARX) input model identified from data, we evaluated the proposed LSTM-MPC controller in three simulation scenarios: a nominal case with 3 meals per day, a random meal disturbances case where meals were generated with a recently published meal generator, and a case with 25$\%$ decrease in the insulin sensitivity. Further, in all the scenarios, no feedforward meal bolus was administered. For the more challenging random meal generation scenario, the mean $\pm$ standard deviation percent time in the range 70-180 [mg/dL] was 74.99 $\pm$ 7.09 vs. 54.15 $\pm$ 14.89, the mean $\pm$ standard deviation percent time in the tighter range 70-140 [mg/dL] was 47.78$\pm$8.55 vs. 34.62 $\pm$9.04, while the mean $\pm$ standard deviation percent time in sever hypoglycemia, i.e., $<$ 54 [mg/dl] was 1.00$\pm$3.18 vs. 9.45$\pm$11.71, for our proposed LSTM-MPC controller and the traditional ARX-MPC, respectively. Our approach provided accurate predictions of future glucose concentrations and good closed-loop performances of the overall MPC controller.
    摘要 我们介绍了一种closed-loop胰岛素输液算法,用于治疗型1 диабеtes(T1D),这种算法包括一个数据驱动的多步预测血糖(BG)预测器,integrated into a Linear Time-Varying(LTV) Model Predictive Control(MPC)框架。而不是从可用数据中直接Identify opens loop模型的静脉糖皮肤系统,我们提议直接预测整个BG预测 horizon,作为一个非线性函数,用于MPC中的预测。 For the nonlinear part, a Long Short-Term Memory(LSTM) network is proposed, while for the affine component a linear regression model is chosen. To evaluate the benefits and drawbacks of the proposed LSTM-MPC controller compared to a traditional linear MPC based on an auto-regressive with exogenous(ARX)input model identified from data, we evaluated the proposed LSTM-MPC controller in three simulation scenarios: a nominal case with 3 meals per day, a random meal disturbances case where meals were generated with a recently published meal generator, and a case with 25% decrease in the insulin sensitivity. Further, in all the scenarios, no feedforward meal bolus was administered. For the more challenging random meal generation scenario, the mean ± standard deviation percent time in the range 70-180 [mg/dL] was 74.99 ± 7.09 vs. 54.15 ± 14.89, the mean ± standard deviation percent time in the tighter range 70-140 [mg/dL] was 47.78 ± 8.55 vs. 34.62 ± 9.04, while the mean ± standard deviation percent time in severe hypoglycemia, i.e., <54 [mg/dL] was 1.00 ± 3.18 vs. 9.45 ± 11.71, for our proposed LSTM-MPC controller and the traditional ARX-MPC, respectively. Our approach provided accurate predictions of future glucose concentrations and good closed-loop performances of the overall MPC controller.

Psy-LLM: Scaling up Global Mental Health Psychological Services with AI-based Large Language Models

  • paper_url: http://arxiv.org/abs/2307.11991
  • repo_url: None
  • paper_authors: Tin Lai, Yukun Shi, Zicong Du, Jiajie Wu, Ken Fu, Yichao Dou, Ziqi Wang
    for:The paper aims to provide a novel AI-based system for online psychological consultation, which can assist healthcare professionals in providing timely and professional mental health support.methods:The proposed framework, called Psy-LLM, leverages Large Language Models (LLMs) for question-answering in online psychological consultation. The framework combines pre-trained LLMs with real-world professional Q&A from psychologists and extensively crawled psychological articles.results:The authors evaluated the framework using intrinsic metrics such as perplexity and extrinsic evaluation metrics including human participant assessments of response helpfulness, fluency, relevance, and logic. The results demonstrate the effectiveness of the Psy-LLM framework in generating coherent and relevant answers to psychological questions.
    Abstract The demand for psychological counseling has grown significantly in recent years, particularly with the global outbreak of COVID-19, which has heightened the need for timely and professional mental health support. Online psychological counseling has emerged as the predominant mode of providing services in response to this demand. In this study, we propose the Psy-LLM framework, an AI-based system leveraging Large Language Models (LLMs) for question-answering in online psychological consultation. Our framework combines pre-trained LLMs with real-world professional Q&A from psychologists and extensively crawled psychological articles. The Psy-LLM framework serves as a front-end tool for healthcare professionals, allowing them to provide immediate responses and mindfulness activities to alleviate patient stress. Additionally, it functions as a screening tool to identify urgent cases requiring further assistance. We evaluated the framework using intrinsic metrics, such as perplexity, and extrinsic evaluation metrics, with human participant assessments of response helpfulness, fluency, relevance, and logic. The results demonstrate the effectiveness of the Psy-LLM framework in generating coherent and relevant answers to psychological questions. This article concludes by discussing the potential of large language models to enhance mental health support through AI technologies in online psychological consultation.
    摘要 “对于心理辅导的需求在最近的几年中有了很大的增长,特别是COVID-19全球大流行,这使得心理健康支持的需求增加了。在这篇研究中,我们提出了Psy-LLM框架,这是一个基于大语言模型(LLM)的人工智能系统,用于在线心理咨询中回答问题。我们的框架结合了预训语言模型和专业心理师的问答,以及大量爬虫的心理文章。Psy-LLM框架作为健康专业人员的前端工具,可以提供即时的回答和心理活动,以减轻病人的压力。同时,它还可以作为寻找紧急案例需要进一步帮助的萤幕工具。我们使用了自类度、流畅度、相关度和逻辑性等内部评估指标,以及人类参与者的评价,来评估Psy-LLM框架的效果。结果显示,Psy-LLM框架可以生成 coherent 和相关的回答心理问题。本文结束时,讨论了大语言模型在线心理咨询中如何通过人工智能技术增强心理健康支持。”

Sparse then Prune: Toward Efficient Vision Transformers

  • paper_url: http://arxiv.org/abs/2307.11988
  • repo_url: https://github.com/yogiprsty/sparse-vit
  • paper_authors: Yogi Prasetyo, Novanto Yudistira, Agus Wahyu Widodo
  • for: 这个研究旨在investigate the possibility of applying Sparse Regularization and Pruning methods to the Vision Transformer architecture for image classification tasks, and explore the trade-off between performance and efficiency.
  • methods: 这个研究使用了Sparse Regularization和Pruning方法,并在CIFAR-10、CIFAR-100和ImageNet-100 datasets上进行了实验。模型的训练过程包括两部分:预训练和精度调整。预训练使用了ImageNet21K数据,followed by 20 epochs of fine-tuning.
  • results: 研究发现,当使用CIFAR-100和ImageNet-100数据进行测试时,带有Sparse Regularization的模型可以提高准确率by 0.12%。此外,对带有Sparse Regularization的模型进行截割,可以更好地提高平均准确率。特别是在CIFAR-10数据集上,截割后的模型可以提高准确率by 0.568%,在CIFAR-100和ImageNet-100数据集上提高了1.764%和0.256%。
    Abstract The Vision Transformer architecture is a deep learning model inspired by the success of the Transformer model in Natural Language Processing. However, the self-attention mechanism, large number of parameters, and the requirement for a substantial amount of training data still make Vision Transformers computationally burdensome. In this research, we investigate the possibility of applying Sparse Regularization to Vision Transformers and the impact of Pruning, either after Sparse Regularization or without it, on the trade-off between performance and efficiency. To accomplish this, we apply Sparse Regularization and Pruning methods to the Vision Transformer architecture for image classification tasks on the CIFAR-10, CIFAR-100, and ImageNet-100 datasets. The training process for the Vision Transformer model consists of two parts: pre-training and fine-tuning. Pre-training utilizes ImageNet21K data, followed by fine-tuning for 20 epochs. The results show that when testing with CIFAR-100 and ImageNet-100 data, models with Sparse Regularization can increase accuracy by 0.12%. Furthermore, applying pruning to models with Sparse Regularization yields even better results. Specifically, it increases the average accuracy by 0.568% on CIFAR-10 data, 1.764% on CIFAR-100, and 0.256% on ImageNet-100 data compared to pruning models without Sparse Regularization. Code can be accesed here: https://github.com/yogiprsty/Sparse-ViT
    摘要 “当前的视觉 трансформер架构是一种深度学习模型,受到自然语言处理中的Transformer模型的成功所 inspirited。然而,自我对项 mechanism,大量的参数,以及需要大量的训练数据仍然使得视觉 трансформер Computationally burdensome。在这个研究中,我们 investigate了将Sparse Regularization应用到视觉 трансформер架构中,以及对其进行Prune的影响,以进行性能和效率之间的交易。为此,我们将Sparse Regularization和Prune方法应用到视觉 трансформер架构,进行图像分类任务。训练过程包括两个部分:预训练和精练。预训练使用ImageNet21K数据,接着进行20次精练。结果显示,在CIFAR-100和ImageNet-100数据上进行训练时,具有Sparse Regularization的模型可以提高精确率0.12%。此外,对Sparse Regularization的模型进行Prune操作,产生了更好的结果。具体来说,它可以在CIFAR-10数据上提高平均精确率0.568%,CIFAR-100数据上提高1.764%,ImageNet-100数据上提高0.256%。软件可以在以下github上取得:https://github.com/yogiprsty/Sparse-ViT”

Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?

  • paper_url: http://arxiv.org/abs/2307.11978
  • repo_url: https://github.com/cewu/ptnl
  • paper_authors: Cheng-En Wu, Yu Tian, Haichao Yu, Heng Wang, Pedro Morgado, Yu Hen Hu, Linjie Yang
  • for: 这个论文主要研究了CLIP视觉语言模型如何在几行示例下适应新的分类任务,以及这种示例调整过程对噪声标签的Robustness。
  • methods: 该论文使用了CLIP视觉语言模型,通过几行示例进行示例调整,并进行了广泛的实验研究以探索这种示例调整过程中的关键因素。
  • results: 研究发现, CLIP的示例调整过程具有很高的Robustness,这主要归因于模型中的固定类名token提供了强制的Regularization,以及CLIP学习的强大预训练图像文本嵌入,帮助提高图像分类的预测精度。
    Abstract Vision-language models such as CLIP learn a generic text-image embedding from large-scale training data. A vision-language model can be adapted to a new classification task through few-shot prompt tuning. We find that such a prompt tuning process is highly robust to label noises. This intrigues us to study the key reasons contributing to the robustness of the prompt tuning paradigm. We conducted extensive experiments to explore this property and find the key factors are: 1) the fixed classname tokens provide a strong regularization to the optimization of the model, reducing gradients induced by the noisy samples; 2) the powerful pre-trained image-text embedding that is learned from diverse and generic web data provides strong prior knowledge for image classification. Further, we demonstrate that noisy zero-shot predictions from CLIP can be used to tune its own prompt, significantly enhancing prediction accuracy in the unsupervised setting. The code is available at https://github.com/CEWu/PTNL.
    摘要 视力语模型如CLIP通过大规模训练学习通用文本图像嵌入。一个视力语模型可以通过几招提示调整来适应新的分类任务。我们发现这种提示调整过程具有很高的鲁棒性,使我们感到惊叹。我们进行了广泛的实验研究这种性能的原因,并发现关键因素有:1)固定的类名token提供了模型优化的强制 régularization,减少了噪声样本引起的梯度; 2)通过多样化和通用的网络数据学习的强大预训练图像文本嵌入,为图像分类提供了强大的先验知识。此外,我们示出了CLIP的噪声零时预测可以用来调整其自己的提示,significantly enhance预测精度在无监督Setting下。代码可以在https://github.com/CEWu/PTNL中找到。

Multi-representations Space Separation based Graph-level Anomaly-aware Detection

  • paper_url: http://arxiv.org/abs/2307.12994
  • repo_url: None
  • paper_authors: Fu Lin, Haonan Gong, Mingkang Li, Zitong Wang, Yue Zhang, Xuexiong Luo
  • for: 本研究的目标是检测图数据中异常的图形。
  • methods: 我们提出了一种基于多表示空间分离的图级异常检测框架,以考虑不同类型的异常图形之间的重要性。我们还设计了一个异常检测模块,以learn异常图形之间的特定权重。
  • results: 我们对基eline方法进行了广泛的评估,并获得了显著的效果。
    Abstract Graph structure patterns are widely used to model different area data recently. How to detect anomalous graph information on these graph data has become a popular research problem. The objective of this research is centered on the particular issue that how to detect abnormal graphs within a graph set. The previous works have observed that abnormal graphs mainly show node-level and graph-level anomalies, but these methods equally treat two anomaly forms above in the evaluation of abnormal graphs, which is contrary to the fact that different types of abnormal graph data have different degrees in terms of node-level and graph-level anomalies. Furthermore, abnormal graphs that have subtle differences from normal graphs are easily escaped detection by the existing methods. Thus, we propose a multi-representations space separation based graph-level anomaly-aware detection framework in this paper. To consider the different importance of node-level and graph-level anomalies, we design an anomaly-aware module to learn the specific weight between them in the abnormal graph evaluation process. In addition, we learn strictly separate normal and abnormal graph representation spaces by four types of weighted graph representations against each other including anchor normal graphs, anchor abnormal graphs, training normal graphs, and training abnormal graphs. Based on the distance error between the graph representations of the test graph and both normal and abnormal graph representation spaces, we can accurately determine whether the test graph is anomalous. Our approach has been extensively evaluated against baseline methods using ten public graph datasets, and the results demonstrate its effectiveness.
    摘要 graph结构模式在当今数据中广泛应用。如何检测图数据中的异常信息已成为一个流行的研究问题。本研究的目标在于特定的问题:如何在图集中检测异常图。前一些研究发现,异常图主要表现为节点水平和图水平异常,但这些方法在评估异常图时平等对待这两种异常形态,这与实际情况不符。此外,异常图具有微妙的差异,容易被现有方法检测掉。因此,我们提出了一个基于多个表示空间分离的图级异常检测框架。为了考虑节点水平和图水平异常的不同重要性,我们设计了一个异常检测模块,以学习特定的节点水平和图水平异常权重。此外,我们通过四种不同的权重图表示对彼此进行学习,以学习纯正的常见图和异常图表示空间。通过测试图表示空间与常见图表示空间和异常图表示空间之间的距离错误来准确判断测试图是否异常。我们的方法与基准方法进行比较使用了十个公共图数据集,结果表明其效果批示。

Pyrus Base: An Open Source Python Framework for the RoboCup 2D Soccer Simulation

  • paper_url: http://arxiv.org/abs/2307.16875
  • repo_url: https://github.com/cyrus2d/pyrus2d
  • paper_authors: Nader Zare, Aref Sayareh, Omid Amini, Mahtab Sarvmaili, Arad Firouzkouhi, Stan Matwin, Amilcar Soares
  • For: The paper is written to introduce Pyrus, a Python base code for the RoboCup Soccer Simulation 2D (SS2D) league, to provide a more accessible and efficient platform for researchers to develop their ideas and integrate machine learning algorithms into their teams.* Methods: The paper uses C++ base codes as the foundation and develops Pyrus, a Python base code, to overcome the challenges of C++ base codes and provide a more user-friendly platform for researchers.* Results: Pyrus is introduced as a powerful baseline for developing machine learning concepts in SS2D, and it is open-source and publicly available under MIT License on GitHub, encouraging researchers to efficiently develop their ideas and integrate machine learning algorithms into their teams.Here’s the simplified Chinese text for the three key points:* For: 这篇论文是为了介绍PYRUS,一个基于Python的RoboCup足球模拟2D(SS2D)联赛的基础代码,以便更好地为研究人员提供一个访问ibility和效率的平台,以便他们可以更加快速地开发自己的想法并将机器学习算法integrated into their teams。* Methods: 这篇论文使用C++基础代码作为基础,然后开发了PYRUS,一个基于Python的基础代码,以解决C++基础代码的挑战,并提供一个更加用户友好的平台 для研究人员。* Results: PYRUS被引入为SS2D中的一个强大基线,可以帮助研究人员更加快速地开发自己的想法并integrated machine learning算法into their teams,PYRUS的基础代码公开发布在GitHub上,并且以MIT许可证进行公共可用,以便更多的研究人员可以参与到这个项目中。
    Abstract Soccer, also known as football in some parts of the world, involves two teams of eleven players whose objective is to score more goals than the opposing team. To simulate this game and attract scientists from all over the world to conduct research and participate in an annual computer-based soccer world cup, Soccer Simulation 2D (SS2D) was one of the leagues initiated in the RoboCup competition. In every SS2D game, two teams of 11 players and one coach connect to the RoboCup Soccer Simulation Server and compete against each other. Over the past few years, several C++ base codes have been employed to control agents' behavior and their communication with the server. Although C++ base codes have laid the foundation for the SS2D, developing them requires an advanced level of C++ programming. C++ language complexity is a limiting disadvantage of C++ base codes for all users, especially for beginners. To conquer the challenges of C++ base codes and provide a powerful baseline for developing machine learning concepts, we introduce Pyrus, the first Python base code for SS2D. Pyrus is developed to encourage researchers to efficiently develop their ideas and integrate machine learning algorithms into their teams. Pyrus base is open-source code, and it is publicly available under MIT License on GitHub
    摘要 足球(也称为足球在一些地方)是一种需要两支队伍的 eleven 名球员,目标是将更多的入球击败对手队伍。为了模拟这场游戏并吸引全球科学家来参与研究和参加每年的计算机基于足球世界杯赛,Football Simulation 2D(SS2D)是RoboCup竞赛中的一个赛事。在每场 SS2D 比赛中,两支队伍的 11 名球员和一位教练通过RoboCup足球 simulate Server 竞争对对手。过去几年,一些 C++ 基础代码被使用来控制代理的行为和与服务器的通信。虽然 C++ 基础代码已经为 SS2D 提供了基础,但是开发它们需要高级的 C++ 编程技能。 C++ 语言复杂性是 C++ 基础代码的限制性,特别是对所有用户来说,尤其是对初学者来说。为了 conquering C++ 基础代码的挑战和提供一个机器学习概念的强大基础,我们引入了 Pyrus,SS2D 的第一个 Python 基础代码。Pyrus 是为了鼓励研究人员尽可能快速地发展他们的想法,并将机器学习算法 integrate 到他们的队伍中。Pyrus 的基础代码是开源的,公开在 GitHub 上,并以 MIT 许可证进行公共发布。

On-Robot Bayesian Reinforcement Learning for POMDPs

  • paper_url: http://arxiv.org/abs/2307.11954
  • repo_url: None
  • paper_authors: Hai Nguyen, Sammie Katt, Yuchen Xiao, Christopher Amato
  • for: 本研究旨在提高机器人学习的效率,因为收集数据的成本很高。
  • methods: 本paper使用权重学习(BRL)方法,利用专家知识和有效的算法来解决机器人学习的问题。
  • results: 本paper在两个人机交互任务中实现了近乎最佳性能,只需要几个实际世界话语。视频证明可以在https://youtu.be/H9xp60ngOes中找到。
    Abstract Robot learning is often difficult due to the expense of gathering data. The need for large amounts of data can, and should, be tackled with effective algorithms and leveraging expert information on robot dynamics. Bayesian reinforcement learning (BRL), thanks to its sample efficiency and ability to exploit prior knowledge, is uniquely positioned as such a solution method. Unfortunately, the application of BRL has been limited due to the difficulties of representing expert knowledge as well as solving the subsequent inference problem. This paper advances BRL for robotics by proposing a specialized framework for physical systems. In particular, we capture this knowledge in a factored representation, then demonstrate the posterior factorizes in a similar shape, and ultimately formalize the model in a Bayesian framework. We then introduce a sample-based online solution method, based on Monte-Carlo tree search and particle filtering, specialized to solve the resulting model. This approach can, for example, utilize typical low-level robot simulators and handle uncertainty over unknown dynamics of the environment. We empirically demonstrate its efficiency by performing on-robot learning in two human-robot interaction tasks with uncertainty about human behavior, achieving near-optimal performance after only a handful of real-world episodes. A video of learned policies is at https://youtu.be/H9xp60ngOes.
    摘要 机器人学习往往困难,主要是因为获取数据的成本高昂。为了解决这个问题,我们需要使用有效的算法和利用机器人动力学专家的知识。泛bayesian学习(BRL)因其样本效率高和能够利用先验知识的特点,成为一种有优势的解决方案。然而,BRL在应用中受到了知识表示和推理问题的限制。这篇论文提出了一种特有的框架,用于解决机器人物理系统中的问题。我们捕捉了专家知识,并证明 posterior 会分解为类似的形式,最后将模型形式化为 bayesian 框架。我们then introduces 一种基于 Monte-Carlo 搜索和粒子筛选的在线解决方法,特化用于解决 resulting 模型。这种方法可以利用典型的低级机器人模拟器,并处理不确定环境中的动力学不确定性。我们实验表明,这种方法可以在两个人机器人互动任务中达到近似优化性,只需要几十个真实世界 episoden。有关学习的视频可以在 https://youtu.be/H9xp60ngOes 中找到。

Pathology-and-genomics Multimodal Transformer for Survival Outcome Prediction

  • paper_url: http://arxiv.org/abs/2307.11952
  • repo_url: https://github.com/cassie07/pathomics
  • paper_authors: Kexin Ding, Mu Zhou, Dimitris N. Metaxas, Shaoting Zhang
  • for: 这个研究旨在提高colon和rectum癌 survival outcome预测,通过结合pathology和genomics信息。
  • methods: 该研究提出了一种多Modal transformer(PathOmics),通过不监督预训练来捕捉组织微环境的内在相互作用,并将这些信息与许多 genomics数据(例如mRNA-sequence、copy number variant和methylation)融合。
  • results: 研究表明,提出的方法可以在TCGA colon和RECTUM癌组织中表现出优异,并且超越了现有的研究。此外,该方法还可以使用有限的finetunedamples进行数据效率的分析,从而提高预测结果的准确性。
    Abstract Survival outcome assessment is challenging and inherently associated with multiple clinical factors (e.g., imaging and genomics biomarkers) in cancer. Enabling multimodal analytics promises to reveal novel predictive patterns of patient outcomes. In this study, we propose a multimodal transformer (PathOmics) integrating pathology and genomics insights into colon-related cancer survival prediction. We emphasize the unsupervised pretraining to capture the intrinsic interaction between tissue microenvironments in gigapixel whole slide images (WSIs) and a wide range of genomics data (e.g., mRNA-sequence, copy number variant, and methylation). After the multimodal knowledge aggregation in pretraining, our task-specific model finetuning could expand the scope of data utility applicable to both multi- and single-modal data (e.g., image- or genomics-only). We evaluate our approach on both TCGA colon and rectum cancer cohorts, showing that the proposed approach is competitive and outperforms state-of-the-art studies. Finally, our approach is desirable to utilize the limited number of finetuned samples towards data-efficient analytics for survival outcome prediction. The code is available at https://github.com/Cassie07/PathOmics.
    摘要 生存结果评估在癌症中是挑战性的,与多种临床因素(例如成像和基因表达 markers)相关。启用多modal分析承诺可以揭示新的预测性模式。在这项研究中,我们提出了一种多modal transformer(PathOmics),将pathology和基因学信息集成到colon相关癌症生存预测中。我们强调了无监督预训来捕捉材料微环境的内在交互。经过多modal知识聚合的预训后,我们的任务特定模型精度调整可以扩大数据的可用范围,包括多modal数据(例如图像或基因数据)以及单modal数据(例如图像或基因数据)。我们在TCGAcolon和rectum癌症群体上评估了我们的方法,并显示了我们的方法与当前最佳实践相比较竞争。最后,我们的方法可以使用有限的精度调整样本来实现数据效率的分析。代码可以在https://github.com/Cassie07/PathOmics中找到。

HIQL: Offline Goal-Conditioned RL with Latent States as Actions

  • paper_url: http://arxiv.org/abs/2307.11949
  • repo_url: https://github.com/seohongpark/hiql
  • paper_authors: Seohong Park, Dibya Ghosh, Benjamin Eysenbach, Sergey Levine
  • for: 本研究旨在开发一种基于不监督学习的目标决策策略,能够从大量未标注数据中学习。
  • methods: 该方法使用一个action-free值函数,并通过层次分解来学习两个策略:一个高级策略用于处理状态作为行为,预测子目标,以及一个低级策略用于达成这个子目标。
  • results: 该方法可以解决长期任务,并可以在高维图像观察中进行扩展。 Code可以在https://seohong.me/projects/hiql/上下载。
    Abstract Unsupervised pre-training has recently become the bedrock for computer vision and natural language processing. In reinforcement learning (RL), goal-conditioned RL can potentially provide an analogous self-supervised approach for making use of large quantities of unlabeled (reward-free) data. However, building effective algorithms for goal-conditioned RL that can learn directly from diverse offline data is challenging, because it is hard to accurately estimate the exact value function for faraway goals. Nonetheless, goal-reaching problems exhibit structure, such that reaching distant goals entails first passing through closer subgoals. This structure can be very useful, as assessing the quality of actions for nearby goals is typically easier than for more distant goals. Based on this idea, we propose a hierarchical algorithm for goal-conditioned RL from offline data. Using one action-free value function, we learn two policies that allow us to exploit this structure: a high-level policy that treats states as actions and predicts (a latent representation of) a subgoal and a low-level policy that predicts the action for reaching this subgoal. Through analysis and didactic examples, we show how this hierarchical decomposition makes our method robust to noise in the estimated value function. We then apply our method to offline goal-reaching benchmarks, showing that our method can solve long-horizon tasks that stymie prior methods, can scale to high-dimensional image observations, and can readily make use of action-free data. Our code is available at https://seohong.me/projects/hiql/
    摘要 “无监督预训”最近已经成为计算机视觉和自然语言处理领域的基础。在征激学习(RL)中,目标conditioned RL可能提供一种自愿supervised的方法,将大量的无回奖数据给利用。然而,建立有效的对目标conditioned RL算法,从多元的过去数据中学习,是一个挑战。这是因为,过去的目标变得越远,预测其价值函数的专业程度就越高。然而,目标 raggiungere问题具有结构,即到达较远的目标需要先通过更近的子目标。这种结构可以非常有用,因为评估靠近目标的动作较 easier than评估更远的目标。基于这个想法,我们提出了一个层次架构的对目标conditioned RL算法。我们使用一个不含动作的价值函数,学习两个政策:一个高层政策,将状态视为动作,预测(一个隐藏表示)子目标,以及一个低层政策,预测将用来达到子目标的动作。我们通过分析和示例,显示了我们的层次分解对于错误估计价值函数的影响。我们然后将我们的方法应用到过去目标 raggiungere测试 benchmark,展示了我们的方法可以解决长期任务,可以扩展到高维度的影像观察,并可以轻松地使用无动作数据。我们的代码可以在https://seohong.me/projects/hiql/ 获取。”

Selective Perception: Optimizing State Descriptions with Reinforcement Learning for Language Model Actors

  • paper_url: http://arxiv.org/abs/2307.11922
  • repo_url: None
  • paper_authors: Kolby Nottingham, Yasaman Razeghi, Kyungmin Kim, JB Lanier, Pierre Baldi, Roy Fox, Sameer Singh
  • for: 这个论文旨在探讨如何使用自然语言处理技术来帮助语言模型在决策过程中更好地处理环境状态信息。
  • methods: 这篇论文提出了一种名为“布林德”(BLINDER)的方法,它通过学习任务条件下的状态描述值函数来自动选择简洁的状态描述。
  • results: 实验结果表明,使用布林德方法可以提高任务成功率,降低输入大小和计算成本,并在不同的语言模型actor之间进行泛化。
    Abstract Large language models (LLMs) are being applied as actors for sequential decision making tasks in domains such as robotics and games, utilizing their general world knowledge and planning abilities. However, previous work does little to explore what environment state information is provided to LLM actors via language. Exhaustively describing high-dimensional states can impair performance and raise inference costs for LLM actors. Previous LLM actors avoid the issue by relying on hand-engineered, task-specific protocols to determine which features to communicate about a state and which to leave out. In this work, we propose Brief Language INputs for DEcision-making Responses (BLINDER), a method for automatically selecting concise state descriptions by learning a value function for task-conditioned state descriptions. We evaluate BLINDER on the challenging video game NetHack and a robotic manipulation task. Our method improves task success rate, reduces input size and compute costs, and generalizes between LLM actors.
    摘要 Translated into Simplified Chinese:大型语言模型(LLM)在机器人和游戏等领域被应用为序列决策任务的演员,利用其总体世界知识和规划能力。然而,先前的研究几乎没有探讨 LLM 演员所接受的环境状态信息是如何传递给语言中。描述高维状态的详细信息可能会降低性能和提高 LLM 演员的推理成本。先前的 LLM 演员通常通过靠手工设计、任务特定协议来确定要关注哪些状态特征和哪些可以被忽略。在这项工作中,我们提出了 Brief Language INputs for DEcision-making Responses(BLINDER)方法,通过学习任务条件下的状态描述值函数来自动选择简洁的状态描述。我们在 NetHack 游戏和机器人 manipulate 任务上评估 BLINDER。我们的方法可以提高任务成功率,降低输入大小和计算成本,并在不同的 LLM 演员之间进行泛化。

Bibliometric Analysis of Publisher and Journal Instructions to Authors on Generative-AI in Academic and Scientific Publishing

  • paper_url: http://arxiv.org/abs/2307.11918
  • repo_url: None
  • paper_authors: Conner Ganjavi, Michael B. Eppler, Asli Pekcan, Brett Biedermann, Andre Abreu, Gary S. Collins, Inderbir S. Gill, Giovanni E. Cacciamani
  • for: The paper aims to determine the extent and content of guidance for authors regarding the use of generative-AI (GAI), Generative Pretrained models (GPTs), and Large Language Models (LLMs) powered tools among the top 100 academic publishers and journals in science.
  • methods: The study screened the websites of the top 100 publishers and journals from May 19th to May 20th, 2023, to identify guidance on the use of GAI.
  • results: The study found that 17% of the largest 100 publishers and 70% of the top 100 journals provided guidance on the use of GAI. Most publishers and journals prohibited the inclusion of GAI as an author, but there was variability in how to disclose the use of GAI and in the allowable uses of GAI. Some top publishers and journals lacked guidance on the use of GAI by authors, and there was a need for standardized guidelines to protect the integrity of scientific output.Here are the three key points in Simplified Chinese text:
  • for: 这篇论文目的是检查科学领域前100家出版社和期刊的作者指南中对生成AI(GAI)、生成预训模型(GPTs)和大语言模型(LLMs)Powered工具的使用。
  • methods: 这个研究从5月19日至5月20日,对前100家出版社和期刊的官方网站进行屏幕,以找到关于GAI的指南。
  • results: 研究发现,前100家出版社中有17%提供了GAI的指南,而前100家期刊中有70%提供了指南。大多数出版社和期刊禁止了GAI作为作者的包含,但是有一定的变化在披露GAI的方式和允许的GAI使用方式。一些顶尖出版社和期刊缺乏关于GAI的指南,需要有标准化的指南来保护科学输出的正当性。
    Abstract We aim to determine the extent and content of guidance for authors regarding the use of generative-AI (GAI), Generative Pretrained models (GPTs) and Large Language Models (LLMs) powered tools among the top 100 academic publishers and journals in science. The websites of these publishers and journals were screened from between 19th and 20th May 2023. Among the largest 100 publishers, 17% provided guidance on the use of GAI, of which 12 (70.6%) were among the top 25 publishers. Among the top 100 journals, 70% have provided guidance on GAI. Of those with guidance, 94.1% of publishers and 95.7% of journals prohibited the inclusion of GAI as an author. Four journals (5.7%) explicitly prohibit the use of GAI in the generation of a manuscript, while 3 (17.6%) publishers and 15 (21.4%) journals indicated their guidance exclusively applies to the writing process. When disclosing the use of GAI, 42.8% of publishers and 44.3% of journals included specific disclosure criteria. There was variability in guidance of where to disclose the use of GAI, including in the methods, acknowledgments, cover letter, or a new section. There was also variability in how to access GAI guidance and the linking of journal and publisher instructions to authors. There is a lack of guidance by some top publishers and journals on the use of GAI by authors. Among those publishers and journals that provide guidance, there is substantial heterogeneity in the allowable uses of GAI and in how it should be disclosed, with this heterogeneity persisting among affiliated publishers and journals in some instances. The lack of standardization burdens authors and threatens to limit the effectiveness of these regulations. There is a need for standardized guidelines in order to protect the integrity of scientific output as GAI continues to grow in popularity.
    摘要 我们目的是确定杂志和出版商在使用生成AI(GAI)、生成预训练模型(GPT)和大语言模型(LLM)激活的指导内容和范围。我们在2023年5月19日至20日检查了前100名学术出版商和杂志的网站。 Among the largest 100 publishers, 17% provided guidance on the use of GAI, of which 12 (70.6%) were among the top 25 publishers. Among the top 100 journals, 70% have provided guidance on GAI. Of those with guidance, 94.1% of publishers and 95.7% of journals prohibited the inclusion of GAI as an author. Four journals (5.7%) explicitly prohibit the use of GAI in the generation of a manuscript, while 3 (17.6%) publishers and 15 (21.4%) journals indicated their guidance exclusively applies to the writing process. When disclosing the use of GAI, 42.8% of publishers and 44.3% of journals included specific disclosure criteria. There was variability in guidance of where to disclose the use of GAI, including in the methods, acknowledgments, cover letter, or a new section. There was also variability in how to access GAI guidance and the linking of journal and publisher instructions to authors. There is a lack of guidance by some top publishers and journals on the use of GAI by authors. Among those publishers and journals that provide guidance, there is substantial heterogeneity in the allowable uses of GAI and in how it should be disclosed, with this heterogeneity persisting among affiliated publishers and journals in some instances. The lack of standardization burdens authors and threatens to limit the effectiveness of these regulations. There is a need for standardized guidelines in order to protect the integrity of scientific output as GAI continues to grow in popularity.

Hindsight-DICE: Stable Credit Assignment for Deep Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2307.11897
  • repo_url: https://github.com/skandavaidyanath/credit-assignment
  • paper_authors: Akash Velu, Skanda Vaidyanath, Dilip Arumugam
  • for: 增强奖励学习 Agent 在缺乏评价反馈的环境中表现,尤其是在长期行为路径上只有单个终态反馈信号,导致奖励学习Agent 困难地归因到特定的行为步骤。
  • methods: 我们采用了现有的重要性抽象估计技术来改进基eline方法,以提高稳定性和效率。
  • results: 我们的方法可以在各种环境中稳定、高效地学习,并且可以缓解奖励学习Agent 在奖励分配问题上的困难。
    Abstract Oftentimes, environments for sequential decision-making problems can be quite sparse in the provision of evaluative feedback to guide reinforcement-learning agents. In the extreme case, long trajectories of behavior are merely punctuated with a single terminal feedback signal, leading to a significant temporal delay between the observation of a non-trivial reward and the individual steps of behavior culpable for achieving said reward. Coping with such a credit assignment challenge is one of the hallmark characteristics of reinforcement learning. While prior work has introduced the concept of hindsight policies to develop a theoretically moxtivated method for reweighting on-policy data by impact on achieving the observed trajectory return, we show that these methods experience instabilities which lead to inefficient learning in complex environments. In this work, we adapt existing importance-sampling ratio estimation techniques for off-policy evaluation to drastically improve the stability and efficiency of these so-called hindsight policy methods. Our hindsight distribution correction facilitates stable, efficient learning across a broad range of environments where credit assignment plagues baseline methods.
    摘要 常常,决策问题的环境很少提供评价反馈来引导强化学习代理人。在极端情况下,长期行为只有单个终端反馈信号,从而导致获得非致命奖励的步骤之间的时间延迟。处理这种奖励分配挑战是强化学习的一个标志特征。而优先作业已经介绍了使用影响实现观察路径返回的奖励重要性权重法,但这些方法会导致不稳定性,从而降低复杂环境中学习的效率。在这种情况下,我们采用现有的不当重要性评估技术来重要性权重法,以改善稳定性和效率。我们的往事分布修正方法可以在各种奖励分配问题中稳定、高效地学习。

On the Vulnerability of Fairness Constrained Learning to Malicious Noise

  • paper_url: http://arxiv.org/abs/2307.11892
  • repo_url: None
  • paper_authors: Avrim Blum, Princewill Okoroafor, Aadirupa Saha, Kevin Stangl
  • for: 本文研究了对小量恶意噪声的抗性性别平等学习。
  • methods: 本文使用了随机分类器来减轻恶意噪声的影响。
  • results: 研究发现,允许随机分类器时,性别平等学习对小量恶意噪声的抗性较为良好,例如对于人口均衡性,可以具有$\Theta(\alpha)$的准确率损失,与无性别约束的最好情况相当。对于平等机会性,可以具有$O(\sqrt{\alpha})$的准确率损失,并给出了匹配的下界$\Omega(\sqrt{\alpha})$。与 Konstantinov 和 Lampert(2021)的研究相比,这些结果表明性别平等学习对小量恶意噪声的抗性较为优秀。此外,本文还考虑了其他的公平性定义,包括平等机会性和均衡性。对这些公平性定义,残余准确率分布在$O(\alpha)$, $O(\sqrt{\alpha})$和$O(1)$三个自然区间内。这些结果为性别平等学习对 adversarial 噪声的抗性提供了更细致的视角。
    Abstract We consider the vulnerability of fairness-constrained learning to small amounts of malicious noise in the training data. Konstantinov and Lampert (2021) initiated the study of this question and presented negative results showing there exist data distributions where for several fairness constraints, any proper learner will exhibit high vulnerability when group sizes are imbalanced. Here, we present a more optimistic view, showing that if we allow randomized classifiers, then the landscape is much more nuanced. For example, for Demographic Parity we show we can incur only a $\Theta(\alpha)$ loss in accuracy, where $\alpha$ is the malicious noise rate, matching the best possible even without fairness constraints. For Equal Opportunity, we show we can incur an $O(\sqrt{\alpha})$ loss, and give a matching $\Omega(\sqrt{\alpha})$lower bound. In contrast, Konstantinov and Lampert (2021) showed for proper learners the loss in accuracy for both notions is $\Omega(1)$. The key technical novelty of our work is how randomization can bypass simple "tricks" an adversary can use to amplify his power. We also consider additional fairness notions including Equalized Odds and Calibration. For these fairness notions, the excess accuracy clusters into three natural regimes $O(\alpha)$,$O(\sqrt{\alpha})$ and $O(1)$. These results provide a more fine-grained view of the sensitivity of fairness-constrained learning to adversarial noise in training data.
    摘要 我们考虑了公平性条件下的学习的易受攻击性。 Konstantinov 和 Lampert (2021) 开始了这个研究,并发现了一些数据分布下,任何合法的学习者都会受到高度易受攻击性的影响,当集合大小不对称时。 在这里,我们提供了一个更有希望的看法,表明如果允许随机分类器, то alors the landscape 会是非常复杂的。 例如,对于人口均衡,我们显示可以允许仅有 $\Theta(\alpha)$ 的精度损失,其中 $\alpha$ 是邪恶噪音率,与不具有公平性限制的情况相同。 对于平等机会,我们显示可以允许 $O(\sqrt{\alpha})$ 的精度损失,并提供了匹配的 $\Omega(\sqrt{\alpha})$ 下界。 与 Konstantinov 和 Lampert (2021) 的结果相比,我们的结果显示,允许随机分类器后,损失精度会分布在三个自然的 режимах $O(\alpha)$, $O(\sqrt{\alpha})$ 和 $O(1)$。 这些结果提供了一个更细部的看法,对于公平性限制下的学习对于噪音训练数据的敏感性。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Multimodal Document Analytics for Banking Process Automation

  • paper_url: http://arxiv.org/abs/2307.11845
  • repo_url: None
  • paper_authors: Christopher Gerling, Stefan Lessmann
  • For: This paper aims to understand the potential of advanced document analytics, specifically using multimodal models, in banking processes to improve operational efficiency and enhance process efficiency.* Methods: The paper uses a comprehensive analysis of the diverse banking document landscape, highlighting opportunities for efficiency gains through automation and advanced analytics techniques in the customer business. The study also employs natural language processing (NLP) techniques, including LayoutXLM, a cross-lingual, multimodal, pre-trained model, to analyze diverse documents in the banking sector.* Results: The study achieves an overall F1 score performance of around 80% on German company register extracts, demonstrating the efficiency of LayoutXLM. Additionally, the study finds that over 75% F1 score can be achieved with only 30% of the training data, highlighting the benefits of integrating image information and the potential for real-world applicability and benefits of multimodal models within banking.
    Abstract In response to growing FinTech competition and the need for improved operational efficiency, this research focuses on understanding the potential of advanced document analytics, particularly using multimodal models, in banking processes. We perform a comprehensive analysis of the diverse banking document landscape, highlighting the opportunities for efficiency gains through automation and advanced analytics techniques in the customer business. Building on the rapidly evolving field of natural language processing (NLP), we illustrate the potential of models such as LayoutXLM, a cross-lingual, multimodal, pre-trained model, for analyzing diverse documents in the banking sector. This model performs a text token classification on German company register extracts with an overall F1 score performance of around 80\%. Our empirical evidence confirms the critical role of layout information in improving model performance and further underscores the benefits of integrating image information. Interestingly, our study shows that over 75% F1 score can be achieved with only 30% of the training data, demonstrating the efficiency of LayoutXLM. Through addressing state-of-the-art document analysis frameworks, our study aims to enhance process efficiency and demonstrate the real-world applicability and benefits of multimodal models within banking.
    摘要 响应金融科技竞争的增长和业务效率的需求,这项研究专注于理解进步的文档分析技术在银行业务中的潜在优势。我们进行了银行文档多样化领域的全面分析,并指出了自动化和高级分析技术的可能性,以提高客户业务的效率。基于自然语言处理(NLP)领域的快速发展,我们介绍了 LayoutXLM 模型,这是一种跨语言、多modal、预训练的模型,可以分析银行业务中的多种文档。这个模型在德国公司注册报表EXTRACTS上进行文本符号分类,其总 F1 分数为约 80%。我们的实证证明了文档中的布局信息对模型性能的重要性,并进一步强调了将图像信息integrated的利好。奇妙的是,我们的研究表明,只使用 30% 的训练数据,可以达到超过 75% F1 分数,这表明 LayoutXLM 的效率。通过对现代文档分析框架进行调查,我们的研究旨在提高业务效率,并证明在银行业务中的多Modal模型的实际可用性和优势。

eXplainable Artificial Intelligence (XAI) in age prediction: A systematic review

  • paper_url: http://arxiv.org/abs/2307.13704
  • repo_url: None
  • paper_authors: Alena Kalyakulina, Igor Yusipov
  • for: 这篇论文旨在介绍Explainable Artificial Intelligence(XAI)在年龄预测任务中的应用。
  • methods: 论文使用了多种XAI方法,包括深度学习模型和特征选择技术。
  • results: 论文通过对多个身体系统的研究,发现XAI可以帮助提高年龄预测的准确率和可解释性。
    Abstract eXplainable Artificial Intelligence (XAI) is now an important and essential part of machine learning, allowing to explain the predictions of complex models. XAI is especially required in risky applications, particularly in health care, where human lives depend on the decisions of AI systems. One area of medical research is age prediction and identification of biomarkers of aging and age-related diseases. However, the role of XAI in the age prediction task has not previously been explored directly. In this review, we discuss the application of XAI approaches to age prediction tasks. We give a systematic review of the works organized by body systems, and discuss the benefits of XAI in medical applications and, in particular, in the age prediction domain.
    摘要 <>将文本翻译成简化中文。<>现代人工智能(XAI)已成为机器学习的重要和必需的一部分,允许解释模型的预测结果。XAI特别在高风险应用中需要,如医疗领域,人工智能系统的决策对人生命有着重要的影响。一个医学研究领域是年龄预测和衰老病症的生物标志物的预测。然而,XAI在年龄预测任务中的角色尚未得到直接探讨。在这篇评论中,我们讨论了XAI方法在年龄预测任务中的应用。我们按照身体系统进行了系统性的综述,并讨论了医疗应用中XAI的利点,特别是在年龄预测领域。

HybridAugment++: Unified Frequency Spectra Perturbations for Model Robustness

  • paper_url: http://arxiv.org/abs/2307.11823
  • repo_url: https://github.com/mkyucel/hybrid_augment
  • paper_authors: Mehmet Kerim Yucel, Ramazan Gokberk Cinbis, Pinar Duygulu
  • for: 提高Convolutional Neural Networks (CNN)对分布shift的泛化性能
  • methods: 提出了一种简单 yet effective的数据增强方法 HybridAugment,以减少CNN对高频组件的依赖,提高其 robustness,保持清晰率高
  • results: HybridAugment和HybridAugment++在CIFAR-10/100和ImageNet上达到或超过了现状的clean accuracy,在ImageNet-C、CIFAR-10-C和CIFAR-100-C中的损坏测试中达到或超过了现状,在CIFAR-10上的抗击性和多种数据集上的out-of-distribution检测中达到了竞争水平
    Abstract Convolutional Neural Networks (CNN) are known to exhibit poor generalization performance under distribution shifts. Their generalization have been studied extensively, and one line of work approaches the problem from a frequency-centric perspective. These studies highlight the fact that humans and CNNs might focus on different frequency components of an image. First, inspired by these observations, we propose a simple yet effective data augmentation method HybridAugment that reduces the reliance of CNNs on high-frequency components, and thus improves their robustness while keeping their clean accuracy high. Second, we propose HybridAugment++, which is a hierarchical augmentation method that attempts to unify various frequency-spectrum augmentations. HybridAugment++ builds on HybridAugment, and also reduces the reliance of CNNs on the amplitude component of images, and promotes phase information instead. This unification results in competitive to or better than state-of-the-art results on clean accuracy (CIFAR-10/100 and ImageNet), corruption benchmarks (ImageNet-C, CIFAR-10-C and CIFAR-100-C), adversarial robustness on CIFAR-10 and out-of-distribution detection on various datasets. HybridAugment and HybridAugment++ are implemented in a few lines of code, does not require extra data, ensemble models or additional networks.
    摘要 卷积神经网络(CNN)在分布Shift下表现不佳。其泛化性已经得到了广泛的研究,其中一种方向是从频率角度出发。这些研究表明人类和CNN可能会关注不同频率成分的图像。基于这些观察,我们提出了一种简单又有效的数据扩充方法 HybridAugment,它降低了CNN对高频成分的依赖,从而提高了其Robustness,保持了清晰度高。其次,我们提出了HybridAugment++,它是一种层次扩充方法,它尝试通过不同频谱扩充来统一各种频谱扩充。HybridAugment++ builds on HybridAugment,并且降低了CNN对图像的振荡 Component 的依赖,而且促进图像的频谱信息。这种统一结果在CIFAR-10/100和ImageNet上得到了与或超过了现状的 Results,同时在ImageNet-C、CIFAR-10-C和CIFAR-100-C上的腐坏检验、鲁棒性检验和Out-of-distribution检验中也表现出色。HybridAugment和HybridAugment++都是几行代码,不需要额外数据、ensemble模型或额外网络。

Mitigating Communications Threats in Decentralized Federated Learning through Moving Target Defense

  • paper_url: http://arxiv.org/abs/2307.11730
  • repo_url: https://github.com/enriquetomasmb/fedstellar
  • paper_authors: Enrique Tomás Martínez Beltrán, Pedro Miguel Sánchez Sánchez, Sergio López Bernal, Gérôme Bovet, Manuel Gil Pérez, Gregorio Martínez Pérez, Alberto Huertas Celdrán
  • for: 这篇研究旨在探讨对 decentralized federated learning (DFL) 的攻击性问题,并提出一个安全模组来对抗通信基础攻击。
  • methods: 这篇研究使用了 symmetric and asymmetric encryption 以及 Moving Target Defense (MTD) 技术,包括随机选择邻居和 IP/port 变换,并在 Fedstellar 平台上实现了安全模组。
  • results: 实验结果显示,在对 MNIST 数据集和 eclipse 攻击进行评估时,这个安全模组能够提高 F1 分数的平均值至 95%,并导致 CPU 使用率(最高可达 63.2% +-3.5%)和网络流量(最高可达 230 MB +-15 MB)的moderate 增加。
    Abstract The rise of Decentralized Federated Learning (DFL) has enabled the training of machine learning models across federated participants, fostering decentralized model aggregation and reducing dependence on a server. However, this approach introduces unique communication security challenges that have yet to be thoroughly addressed in the literature. These challenges primarily originate from the decentralized nature of the aggregation process, the varied roles and responsibilities of the participants, and the absence of a central authority to oversee and mitigate threats. Addressing these challenges, this paper first delineates a comprehensive threat model, highlighting the potential risks of DFL communications. In response to these identified risks, this work introduces a security module designed for DFL platforms to counter communication-based attacks. The module combines security techniques such as symmetric and asymmetric encryption with Moving Target Defense (MTD) techniques, including random neighbor selection and IP/port switching. The security module is implemented in a DFL platform called Fedstellar, allowing the deployment and monitoring of the federation. A DFL scenario has been deployed, involving eight physical devices implementing three security configurations: (i) a baseline with no security, (ii) an encrypted configuration, and (iii) a configuration integrating both encryption and MTD techniques. The effectiveness of the security module is validated through experiments with the MNIST dataset and eclipse attacks. The results indicated an average F1 score of 95%, with moderate increases in CPU usage (up to 63.2% +-3.5%) and network traffic (230 MB +-15 MB) under the most secure configuration, mitigating the risks posed by eavesdropping or eclipse attacks.
    摘要 DFL(分布式联合学习)的出现使得机器学习模型可以在联合参与者之间训练,从而实现分布式模型聚合和减少依赖于服务器。然而,这种方法引入了一些独特的通信安全挑战,在文献中没有得到充分解决。这些挑战主要来自联合聚合过程的分布式特性、参与者的多样化角色和责任,以及缺乏中央权限来监督和 Mitigate 威胁。为了解决这些挑战,本文首先定义了DFL通信的威胁模型,并提出了一种安全模块,用于DFL平台来防御通信基于攻击。该模块结合了加密技术和移动目标防御(MTD)技术,包括随机邻居选择和IP/端口转换。该安全模块在名为Fedstellar的DFL平台上实现, allowing the deployment and monitoring of the federation。一个DFL场景已经被部署,涉及八个物理设备实现三种安全配置:(i)无安全(基准),(ii)加密配置,(iii)加密和MTD技术的配置。安全模块的效果通过使用MNIST数据集和 eclipse 攻击进行实验验证。结果表明,在最安全配置下,F1 分数平均达到 95%,CPU 使用率(最大值63.2% +-3.5%)和网络流量(230 MB +-15 MB)增加较moderate。这些结果验证了安全模块的有效性,抵消了防止窃听或 eclipse 攻击的风险。

Benchmark datasets for biomedical knowledge graphs with negative statements

  • paper_url: http://arxiv.org/abs/2307.11719
  • repo_url: None
  • paper_authors: Rita T. Sousa, Sara Silva, Catia Pesquita
  • for: fills the lack of benchmark datasets for knowledge graphs with negative statements, especially in the biomedical domain.
  • methods: two popular path-based methods are used to generate knowledge graph embeddings for each dataset.
  • results: negative statements can improve the performance of knowledge graph embeddings in relation prediction tasks, such as protein-protein interaction prediction, gene-disease association prediction, and disease prediction.
    Abstract Knowledge graphs represent facts about real-world entities. Most of these facts are defined as positive statements. The negative statements are scarce but highly relevant under the open-world assumption. Furthermore, they have been demonstrated to improve the performance of several applications, namely in the biomedical domain. However, no benchmark dataset supports the evaluation of the methods that consider these negative statements. We present a collection of datasets for three relation prediction tasks - protein-protein interaction prediction, gene-disease association prediction and disease prediction - that aim at circumventing the difficulties in building benchmarks for knowledge graphs with negative statements. These datasets include data from two successful biomedical ontologies, Gene Ontology and Human Phenotype Ontology, enriched with negative statements. We also generate knowledge graph embeddings for each dataset with two popular path-based methods and evaluate the performance in each task. The results show that the negative statements can improve the performance of knowledge graph embeddings.
    摘要 知识图表示实际世界实体的事实。大多数这些事实被定义为正面声明。然而,负面声明scarce,但在开放世界假设下,它们对许多应用程序的性能有着高度相关性。例如,在生物医学领域中,它们已经被证明可以提高性能。然而,没有一个benchmark dataset来评估这些方法,这使得建立 benchmarks for knowledge graphs with negative statements 变得困难。为了解决这些问题,我们提供了三个关系预测任务的数据集 - protein-protein交互预测、基因疾病相关性预测和疾病预测 - 这些数据集包括了两个成功的生物医学 ontologies,生物学机制 Ontology 和人类现象 Ontology,这些数据集还包括了负面声明。我们还生成了每个数据集的知识图嵌入,使用两种流行的路径基方法,并评估每个任务的性能。结果表明,负面声明可以提高知识图嵌入的性能。

Statement-based Memory for Neural Source Code Summarization

  • paper_url: http://arxiv.org/abs/2307.11709
  • repo_url: https://github.com/aakashba/smncode2022
  • paper_authors: Aakash Bansal, Siyuan Jiang, Sakib Haque, Collin McMillan
  • For: The paper is written for programmers who want to quickly understand the behavior of source code without having to read the code itself. It aims to provide natural language descriptions of code behavior.* Methods: The paper proposes a statement-based memory encoder that learns the important elements of flow during training, allowing for a statement-based subroutine representation without the need for dynamic analysis.* Results: The paper demonstrates a significant improvement over the state-of-the-art in code summarization using the proposed statement-based memory encoder.Here is the information in Simplified Chinese text:
  • for: 该论文是为程序员们提供快速理解代码行为的自然语言描述。
  • methods: 论文提出了一种基于语句记忆的编码器,通过在训练中学习流程的重要元素,实现了基于语句的子程序表示,不需要动态分析。
  • results: 论文通过提出的语句基于编码器,实现了对代码摘要的显著改进。
    Abstract Source code summarization is the task of writing natural language descriptions of source code behavior. Code summarization underpins software documentation for programmers. Short descriptions of code help programmers understand the program quickly without having to read the code itself. Lately, neural source code summarization has emerged as the frontier of research into automated code summarization techniques. By far the most popular targets for summarization are program subroutines. The idea, in a nutshell, is to train an encoder-decoder neural architecture using large sets of examples of subroutines extracted from code repositories. The encoder represents the code and the decoder represents the summary. However, most current approaches attempt to treat the subroutine as a single unit. For example, by taking the entire subroutine as input to a Transformer or RNN-based encoder. But code behavior tends to depend on the flow from statement to statement. Normally dynamic analysis may shed light on this flow, but dynamic analysis on hundreds of thousands of examples in large datasets is not practical. In this paper, we present a statement-based memory encoder that learns the important elements of flow during training, leading to a statement-based subroutine representation without the need for dynamic analysis. We implement our encoder for code summarization and demonstrate a significant improvement over the state-of-the-art.
    摘要 源代码概要是指将源代码行为写入自然语言描述。代码概要支持软件文档 для程序员。简短的代码描述可以帮助程序员快速理解程序,而无需阅读代码本身。目前,神经源代码概要已经成为自动代码概要技术的前沿。目标最多是程序子循环。基本思路是使用大量的例子来训练神经网络Encoder-Decoder结构。Encoder表示代码,Decoder表示概要。但现有方法通常会将子循环视为单个单元。例如,将整个子循环作为Transformer或RNN基于Encoder的输入。但代码行为通常是从语句到语句的流动的。正常的动态分析可能会暴露这种流动,但是在大量数据集上进行动态分析是不实际的。在本文中,我们提出一个语句基 Memory Encoder,可以在训练中学习重要的流动元素,从而得到语句基的子循环表示,无需动态分析。我们实现了我们的Encoder для代码概要,并在状态前方示出了显著提升。

cs.CL - 2023-07-22

Revisiting Distillation for Continual Learning on Visual Question Localized-Answering in Robotic Surgery

  • paper_url: http://arxiv.org/abs/2307.12045
  • repo_url: https://github.com/longbai1006/cs-vqla
  • paper_authors: Long Bai, Mobarakol Islam, Hongliang Ren
    for:这篇论文旨在探讨如何使用非例子同时学习(Continual Learning,CL)方法,以提高手术教育中的知识助手系统(VQLA)的能力。methods:这篇论文提出了一种非例子CL框架,以解决深度神经网络(DNNs)在学习新知识时的忘却问题。具体来说,当DNNs学习新的类或任务时,其对于老任务的性能会下降很快。此外,由于医疗数据隐私和许可问题,通常无法访问老数据来更新CL模型。因此,该论文提出了一种具有刚性和柔性特性的CL框架,以平衡DNNs在顺序学习中的刚性和柔性。results:通过对三个公共的手术数据集进行大量的实验,该论文证明了其提出的方法可以在手术VQLA中超越传统CL方法。具体来说,该方法可以保持老任务的性能,同时学习新任务。此外,该方法还可以调整权重对于老和新任务,以适应不同的学习情况。
    Abstract The visual-question localized-answering (VQLA) system can serve as a knowledgeable assistant in surgical education. Except for providing text-based answers, the VQLA system can highlight the interested region for better surgical scene understanding. However, deep neural networks (DNNs) suffer from catastrophic forgetting when learning new knowledge. Specifically, when DNNs learn on incremental classes or tasks, their performance on old tasks drops dramatically. Furthermore, due to medical data privacy and licensing issues, it is often difficult to access old data when updating continual learning (CL) models. Therefore, we develop a non-exemplar continual surgical VQLA framework, to explore and balance the rigidity-plasticity trade-off of DNNs in a sequential learning paradigm. We revisit the distillation loss in CL tasks, and propose rigidity-plasticity-aware distillation (RP-Dist) and self-calibrated heterogeneous distillation (SH-Dist) to preserve the old knowledge. The weight aligning (WA) technique is also integrated to adjust the weight bias between old and new tasks. We further establish a CL framework on three public surgical datasets in the context of surgical settings that consist of overlapping classes between old and new surgical VQLA tasks. With extensive experiments, we demonstrate that our proposed method excellently reconciles learning and forgetting on the continual surgical VQLA over conventional CL methods. Our code is publicly accessible.
    摘要 Visual-问题本地回答(VQLA)系统可以作为医学教育中的知识助手。除了提供文本回答外,VQLA系统还可以高亮 interessested 区域,以便更好地理解外科场景。然而,深度神经网络(DNNs)在学习新知识时会出现慢性学习问题。具体来说,当 DNNs 学习增量类或任务时,其对于古老任务的性能会下降很快。此外,由于医学数据隐私和许可问题,通常不可以访问老数据,因此在更新 continual learning(CL)模型时困难。因此,我们开发了一个非例外 continual surgical VQLA框架,以探索和衡量 DNNs 在顺序学习中的僵化-柔软性质的权衡。我们重新评估了 CL 任务中的滥览损失,并提出了固有-柔软性感知(RP-Dist)和自适应多样化滥览(SH-Dist)来保持古老知识。此外,我们还使用了 weight aligning(WA)技术来调整新任务和古老任务之间的权重偏好。我们进一步建立了 CL 框架在三个公共外科数据集上,这些数据集在外科设置中包含了古老和新的外科 VQLA任务之间的重叠类。经过广泛的实验,我们证明了我们提出的方法在 continual surgical VQLA 中优化了学习和忘却。我们的代码公共访问。

FinPT: Financial Risk Prediction with Profile Tuning on Pretrained Foundation Models

  • paper_url: http://arxiv.org/abs/2308.00065
  • repo_url: https://github.com/yuweiyin/finpt
  • paper_authors: Yuwei Yin, Yazheng Yang, Jian Yang, Qi Liu
  • For: Financial risk prediction in the financial sector, specifically addressing the issues of outdated algorithms and lack of a unified benchmark.* Methods: Propose a novel approach called FinPT that leverages large pretrained foundation models and natural language processing techniques to improve financial risk prediction. FinPT fills financial tabular data into pre-defined instruction templates, obtains natural-language customer profiles by prompting LLMs, and fine-tunes large foundation models with the profile text for predictions.* Results: Demonstrate the effectiveness of FinPT by experimenting with a range of representative strong baselines on FinBench, a set of high-quality datasets on financial risks. Analytical studies also deepen the understanding of LLMs for financial risk prediction.
    Abstract Financial risk prediction plays a crucial role in the financial sector. Machine learning methods have been widely applied for automatically detecting potential risks and thus saving the cost of labor. However, the development in this field is lagging behind in recent years by the following two facts: 1) the algorithms used are somewhat outdated, especially in the context of the fast advance of generative AI and large language models (LLMs); 2) the lack of a unified and open-sourced financial benchmark has impeded the related research for years. To tackle these issues, we propose FinPT and FinBench: the former is a novel approach for financial risk prediction that conduct Profile Tuning on large pretrained foundation models, and the latter is a set of high-quality datasets on financial risks such as default, fraud, and churn. In FinPT, we fill the financial tabular data into the pre-defined instruction template, obtain natural-language customer profiles by prompting LLMs, and fine-tune large foundation models with the profile text to make predictions. We demonstrate the effectiveness of the proposed FinPT by experimenting with a range of representative strong baselines on FinBench. The analytical studies further deepen the understanding of LLMs for financial risk prediction.
    摘要 financial风险预测在金融领域扮演着关键的角色。机器学习方法在自动检测 potential的风险方面得到广泛的应用,从而节省劳动成本。然而,在这一领域的发展在最近几年来受到了以下两个因素的延迟:1)使用的算法有些已经过时,尤其是在生成式AI和大型自然语言模型(LLM)的快速进步的背景下;2)缺乏一个统一的、开源的金融标准 benchmark,对相关研究造成了多年的阻碍。为解决这些问题,我们提出了 FinPT 和 FinBench。前者是一种新的金融风险预测方法,通过 Profile Tuning 大型预训模型中的大型预训模型,并在 FinBench 中进行了详细的分析研究。在 FinPT 中,我们将金融表格数据填充到预定的指令模板中,通过 LLMS 提取自然语言客户profile,并使用profile文本微调大型基础模型进行预测。我们通过对 FinBench 中的多种代表强基eline进行实验,证明了我们提出的 FinPT 的有效性。分析研究还深入了解LLMS的金融风险预测能力。

Learning Vision-and-Language Navigation from YouTube Videos

  • paper_url: http://arxiv.org/abs/2307.11984
  • repo_url: https://github.com/jeremylinky/youtube-vln
  • paper_authors: Kunyang Lin, Peihao Chen, Diwei Huang, Thomas H. Li, Mingkui Tan, Chuang Gan
  • for: 使用 YouTube 上的房屋游览视频来培养一个embodied agent,以便在真实的3D环境中使用自然语言指令进行导航。
  • methods: 创建一个大规模的数据集,其包含有理性的路径指令对从房屋游览视频中提取的,并在这个数据集上预训练 agent。
  • results: 通过使用 entropy 算法构建路径指令对,以及一个action-aware生成器来从未标注的旁路中提取指令,最终通过训练 trajectory judgment 预text task 来让 agent 挖掘到环境的布局知识,实现了在 R2R 和 REVERIE 两个标准测试benchmark上的状态级表现。
    Abstract Vision-and-language navigation (VLN) requires an embodied agent to navigate in realistic 3D environments using natural language instructions. Existing VLN methods suffer from training on small-scale environments or unreasonable path-instruction datasets, limiting the generalization to unseen environments. There are massive house tour videos on YouTube, providing abundant real navigation experiences and layout information. However, these videos have not been explored for VLN before. In this paper, we propose to learn an agent from these videos by creating a large-scale dataset which comprises reasonable path-instruction pairs from house tour videos and pre-training the agent on it. To achieve this, we have to tackle the challenges of automatically constructing path-instruction pairs and exploiting real layout knowledge from raw and unlabeled videos. To address these, we first leverage an entropy-based method to construct the nodes of a path trajectory. Then, we propose an action-aware generator for generating instructions from unlabeled trajectories. Last, we devise a trajectory judgment pretext task to encourage the agent to mine the layout knowledge. Experimental results show that our method achieves state-of-the-art performance on two popular benchmarks (R2R and REVERIE). Code is available at https://github.com/JeremyLinky/YouTube-VLN
    摘要 vision-and-language navigation (VLN) 需要一个具体的智能体在真实的3D环境中使用自然语言指令进行导航。现有的VLN方法受到小规模环境或不合理的路径指令数据的限制,导致对未看过的环境的泛化能力受到限制。 YouTube上有大量的房屋游览视频,这些视频提供了丰富的实际导航经验和房屋布局信息。然而,这些视频没有被前面的VLN研究所用。在这篇论文中,我们提议从这些视频中学习一个智能体,并创建了一个大规模的数据集,该数据集包括合理的路径指令对。为了实现这一点,我们首先利用一种 entropy-based 方法构建路径轨迹的节点。然后,我们提出了一种 action-aware 生成器,用于从无标签的轨迹中生成指令。最后,我们设计了一个轨迹判断预测任务,以便让智能体挖掘布局知识。实验结果表明,我们的方法在两个流行的benchmark(R2R和REVERIE)上达到了状态艺术性的表现。代码可以在 https://github.com/JeremyLinky/YouTube-VLN 上获取。

CARTIER: Cartographic lAnguage Reasoning Targeted at Instruction Execution for Robots

  • paper_url: http://arxiv.org/abs/2307.11865
  • repo_url: None
  • paper_authors: Nikhil Kakodkar, Dmitriy Rivkin, Bobak H. Baghi, Francois Hogan, Gregory Dudek
  • for: 这个论文探讨了大语言模型(LLM)如何解决混合语言计划和自然语言导航界面的问题。
  • methods: 该论文使用了大语言模型来解释用户在对话中提供的描述性语言查询,并在3D simulator AI2Thor中创建了复杂和可重复的场景。
  • results: 研究表明,使用大语言模型可以更好地解析用户在对话中提供的描述性语言查询,并且可以更好地理解用户的 Navigation 目标。
    Abstract This work explores the capacity of large language models (LLMs) to address problems at the intersection of spatial planning and natural language interfaces for navigation.Our focus is on following relatively complex instructions that are more akin to natural conversation than traditional explicit procedural directives seen in robotics. Unlike most prior work, where navigation directives are provided as imperative commands (e.g., go to the fridge), we examine implicit directives within conversational interactions. We leverage the 3D simulator AI2Thor to create complex and repeatable scenarios at scale, and augment it by adding complex language queries for 40 object types. We demonstrate that a robot can better parse descriptive language queries than existing methods by using an LLM to interpret the user interaction in the context of a list of the objects in the scene.
    摘要 To conduct our research, we utilize the 3D simulator AI2Thor to create complex and repeatable scenarios at scale, and augment it by adding complex language queries for 40 object types. Our results show that a robot can better understand and execute descriptive language queries by using an LLM to interpret the user interaction in the context of a list of objects in the scene.

The Looming Threat of Fake and LLM-generated LinkedIn Profiles: Challenges and Opportunities for Detection and Prevention

  • paper_url: http://arxiv.org/abs/2307.11864
  • repo_url: None
  • paper_authors: Navid Ayoobi, Sadat Shahriar, Arjun Mukherjee
  • for: 这个研究目的是为 LinkedIn 线上社交网络中检测伪注册和大语言模型(LLM)生成的账户,以避免伪者获取正常用户的私人资讯和推广未来的骗变活动。
  • methods: 这个研究使用 LinkedIn Profil 中提供的文本信息,并引入 Section and Subsection Tag Embedding(SSTE)方法,以增强这些数据的归类特征,以分辨伪注册和 manually 或使用 LLM 生成的账户。
  • results: 这个研究获得了约 95% 的准确率,可以分辨伪注册和正常账户,并且显示 SSTE 在识别 LLM 生成的账户时的准确率为约 90%,即使在训练阶段没有使用 LLM 生成的账户。
    Abstract In this paper, we present a novel method for detecting fake and Large Language Model (LLM)-generated profiles in the LinkedIn Online Social Network immediately upon registration and before establishing connections. Early fake profile identification is crucial to maintaining the platform's integrity since it prevents imposters from acquiring the private and sensitive information of legitimate users and from gaining an opportunity to increase their credibility for future phishing and scamming activities. This work uses textual information provided in LinkedIn profiles and introduces the Section and Subsection Tag Embedding (SSTE) method to enhance the discriminative characteristics of these data for distinguishing between legitimate profiles and those created by imposters manually or by using an LLM. Additionally, the dearth of a large publicly available LinkedIn dataset motivated us to collect 3600 LinkedIn profiles for our research. We will release our dataset publicly for research purposes. This is, to the best of our knowledge, the first large publicly available LinkedIn dataset for fake LinkedIn account detection. Within our paradigm, we assess static and contextualized word embeddings, including GloVe, Flair, BERT, and RoBERTa. We show that the suggested method can distinguish between legitimate and fake profiles with an accuracy of about 95% across all word embeddings. In addition, we show that SSTE has a promising accuracy for identifying LLM-generated profiles, despite the fact that no LLM-generated profiles were employed during the training phase, and can achieve an accuracy of approximately 90% when only 20 LLM-generated profiles are added to the training set. It is a significant finding since the proliferation of several LLMs in the near future makes it extremely challenging to design a single system that can identify profiles created with various LLMs.
    摘要 在这篇论文中,我们提出了一种新的方法,用于在 LinkedIn 在线社交网络上立即识别假 profiles 和 Large Language Model(LLM)生成的 profiles,以避免骗子在获取正式用户的隐私信息和敏感信息后,进行后续的骗取和骗财活动。这种工作使用 LinkedIn profile 中提供的文本信息,并引入 Section and Subsection Tag Embedding(SSTE)方法,以增强这些数据的抑制特征,以分辨真实 profiles 和骗劫 manually 或使用 LLM 生成的 profiles。此外,由于 LinkedIn 公共可用的大型数据集缺乏,我们自己收集了 3600 个 LinkedIn profiles для我们的研究。我们将会在研究用途上公开发布我们的数据集。这是,我们知道的, LinkedIn 上假帐户检测的首个大型公共可用数据集。在我们的 paradigm 中,我们评估了静态和Contextualized Word Embeddings,包括 GloVe、Flair、BERT 和 RoBERTa。我们发现,我们的方法可以在所有 Word Embeddings 上分辨真实 profiles 和假 profiles,准确率约为 95%。此外,我们发现 SSTE 在 LLM 生成 profiles 上具有扩展的准确率,即使在训练阶段没有使用 LLM 生成 profiles,可以达到约 90% 的准确率,只需要将 20 个 LLM 生成 profiles 添加到训练集中。这是一个重要的发现,因为未来几年 LLM 的普及会使得设计一个可以分辨多种 LLM 生成的 profiles 的系统变得极其困难。

MythQA: Query-Based Large-Scale Check-Worthy Claim Detection through Multi-Answer Open-Domain Question Answering

  • paper_url: http://arxiv.org/abs/2307.11848
  • repo_url: https://github.com/tonyby/myth-qa
  • paper_authors: Yang Bai, Anthony Colas, Daisy Zhe Wang
  • for: The paper is written for detecting check-worthy claims directly from a large-scale information source, such as Twitter, to accelerate the fact-checking process.
  • methods: The paper introduces MythQA, a new multi-answer open-domain question answering task that involves contradictory stance mining for query-based large-scale check-worthy claim detection.
  • results: The paper presents a baseline system for MythQA and evaluates existing NLP models for each system component using the TweetMythQA dataset. The paper also provides initial benchmarks and identifies key challenges for future models to improve upon.Here’s the Simplified Chinese text for the three information points:
  • for: 这篇论文是为了从大规模信息源,如推特,直接检测check-worthy claim的目的。
  • methods: 这篇论文提出了一种新的多答问题回答任务,即mythQA,以推特上的矛盾立场挖掘为基础,以加速事实核查的过程。
  • results: 这篇论文提供了mythQA的基eline系统,并对现有NLP模型进行了TweetMythQA数据集上的评估。 paper还提供了初步的benchmark和未来模型改进的关键挑战。
    Abstract Check-worthy claim detection aims at providing plausible misinformation to downstream fact-checking systems or human experts to check. This is a crucial step toward accelerating the fact-checking process. Many efforts have been put into how to identify check-worthy claims from a small scale of pre-collected claims, but how to efficiently detect check-worthy claims directly from a large-scale information source, such as Twitter, remains underexplored. To fill this gap, we introduce MythQA, a new multi-answer open-domain question answering(QA) task that involves contradictory stance mining for query-based large-scale check-worthy claim detection. The idea behind this is that contradictory claims are a strong indicator of misinformation that merits scrutiny by the appropriate authorities. To study this task, we construct TweetMythQA, an evaluation dataset containing 522 factoid multi-answer questions based on controversial topics. Each question is annotated with multiple answers. Moreover, we collect relevant tweets for each distinct answer, then classify them into three categories: "Supporting", "Refuting", and "Neutral". In total, we annotated 5.3K tweets. Contradictory evidence is collected for all answers in the dataset. Finally, we present a baseline system for MythQA and evaluate existing NLP models for each system component using the TweetMythQA dataset. We provide initial benchmarks and identify key challenges for future models to improve upon. Code and data are available at: https://github.com/TonyBY/Myth-QA
    摘要 <> CHECK-worthy 声明检测的目标是提供可信的谣言来供下游真伪检查系统或人类专家进行检查。这是减少真伪检查过程的关键步骤。许多努力已经投入到如何从小规模的预收集声明中Identify CHECK-worthy 声明,但如何高效地从大规模信息源,如推特,中直接检测 CHECK-worthy 声明仍未得到充分研究。为了填补这个空白,我们介绍了 MitQA,一个新的多答题开放Domain问答任务,涉及到矛盾立场挖掘,以便从推特等大规模信息源中检测 CHECK-worthy 声明。我们的想法是,矛盾的声明是谣言的强力指标,值得当局的审查。为了研究这个任务,我们构建了 TweetMythQA 评估数据集,包含 522 个多答问题,基于争议话题。每个问题有多个答案。此外,我们收集了每个问题的相关推特,然后将其分为三类:“支持”、“反对”和“中立”。总的来说,我们标注了 5.3K 条推特。为所有答案在数据集中,我们收集了矛盾证据。最后,我们提供了基线系统 для MitQA,并使用 TweetMythQA 数据集评估现有 NLP 模型的每个系统组件。我们提供了初步的标准和标识未来模型改进的关键挑战。代码和数据可以在 GitHub 上获取:https://github.com/TonyBY/Myth-QA。

OUTFOX: LLM-generated Essay Detection through In-context Learning with Adversarially Generated Examples

  • paper_url: http://arxiv.org/abs/2307.11729
  • repo_url: None
  • paper_authors: Ryuto Koike, Masahiro Kaneko, Naoaki Okazaki
  • for: 本研究旨在提高LLM生成文本检测器的Robustness,并在实际场景中评估其效果。
  • methods: 本研究提出了OUTFOX框架,该框架让检测器和攻击者都可以考虑对方的输出,并在学生作业作文中应用。
  • results: 实验结果表明,OUR proposed detector可以通过在攻击者的帮助下进行培训,提高检测性能,而OUR proposed attacker可以使检测器性能下降至-57.0点F1分。
    Abstract Large Language Models (LLMs) have achieved human-level fluency in text generation, making it difficult to distinguish between human-written and LLM-generated texts. This poses a growing risk of misuse of LLMs and demands the development of detectors to identify LLM-generated texts. However, existing detectors degrade detection accuracy by simply paraphrasing LLM-generated texts. Furthermore, the effectiveness of these detectors in real-life situations, such as when students use LLMs for writing homework assignments (e.g., essays) and quickly learn how to evade these detectors, has not been explored. In this paper, we propose OUTFOX, a novel framework that improves the robustness of LLM-generated-text detectors by allowing both the detector and the attacker to consider each other's output and apply this to the domain of student essays. In our framework, the attacker uses the detector's prediction labels as examples for in-context learning and adversarially generates essays that are harder to detect. While the detector uses the adversarially generated essays as examples for in-context learning to learn to detect essays from a strong attacker. Our experiments show that our proposed detector learned in-context from the attacker improves the detection performance on the attacked dataset by up to +41.3 point F1-score. While our proposed attacker can drastically degrade the performance of the detector by up to -57.0 point F1-score compared to the paraphrasing method.
    摘要 大型语言模型(LLM)已经 дости成人类水准的文本生成能力,使得区分人类写成和LLM生成的文本变得困难。这导致了LLM的滥用风险的增加,并且需要发展检测LLM生成的文本的技术。然而,现有的检测器对LLM生成的文本进行简单的重写,从而降低了检测器的准确度。此外,现有的检测器在实际情况下,例如学生使用LLM写作作业(例如论文)并快速学习如何避免这些检测器的情况下,尚未被探访。在这篇论文中,我们提出了 OUTFOX 框架,它可以提高 LLM 生成文本检测器的Robustness。在我们的框架中,攻击者使用检测器的预测标签作为内容学习的示例,并通过对检测器进行对抗式学习来生成更难以检测的文本。而检测器则使用对抗式生成的文本作为内容学习的示例,以提高检测器对于攻击者生成的文本的准确度。我们的实验显示,我们的提案的检测器在攻击 dataset 上的准确度提高了 +41.3 点 F1 分数。而我们的提案的攻击者可以对检测器造成极大的影响,比如 -57.0 点 F1 分数,相比之下,对文本进行重写方法的影响相对较小。

GPT-4 Can’t Reason

  • paper_url: http://arxiv.org/abs/2308.03762
  • repo_url: https://github.com/vohidjon123/google
  • paper_authors: Konstantine Arkoudas
  • for: 评估 GPT-4 模型的逻辑能力
  • methods: 使用多种评估方法评估 GPT-4 模型的逻辑能力
  • results: GPT-4 模型现在不具备逻辑能力,尝试用多种方法进行评估,但它只有偶尔出现一些分析天赋。
    Abstract GPT-4 was released in March 2023 to wide acclaim, marking a very substantial improvement across the board over GPT-3.5 (OpenAI's previously best model, which had powered the initial release of ChatGPT). However, despite the genuinely impressive improvement, there are good reasons to be highly skeptical of GPT-4's ability to reason. This position paper discusses the nature of reasoning; criticizes the current formulation of reasoning problems in the NLP community, as well as the way in which LLM reasoning performance is currently evaluated; introduces a small collection of 21 diverse reasoning problems; and performs a detailed qualitative evaluation of GPT-4's performance on those problems. Based on this analysis, the paper concludes that, despite its occasional flashes of analytical brilliance, GPT-4 at present is utterly incapable of reasoning.
    摘要

cs.LG - 2023-07-22

A Revolution of Personalized Healthcare: Enabling Human Digital Twin with Mobile AIGC

  • paper_url: http://arxiv.org/abs/2307.12115
  • repo_url: None
  • paper_authors: Jiayuan Chen, Changyan Yi, Hongyang Du, Dusit Niyato, Jiawen Kang, Jun Cai, Xuemin, Shen
  • for: 本研究旨在探讨 mobil AI 生成内容技术如何推动人类数字孪生(HDT)的发展,以提高个人化医疗服务。
  • methods: 本文提出了一种基于 mobil AI 生成内容技术的 HDT 系统架构,并讨论了相关的设计要求和挑战。
  • results: 本文通过两个使用场景的示例和一个实验研究证明了该方案的有效性,并提出了一些未来方向和开放问题。
    Abstract Mobile Artificial Intelligence-Generated Content (AIGC) technology refers to the adoption of AI algorithms deployed at mobile edge networks to automate the information creation process while fulfilling the requirements of end users. Mobile AIGC has recently attracted phenomenal attentions and can be a key enabling technology for an emerging application, called human digital twin (HDT). HDT empowered by the mobile AIGC is expected to revolutionize the personalized healthcare by generating rare disease data, modeling high-fidelity digital twin, building versatile testbeds, and providing 24/7 customized medical services. To promote the development of this new breed of paradigm, in this article, we propose a system architecture of mobile AIGC-driven HDT and highlight the corresponding design requirements and challenges. Moreover, we illustrate two use cases, i.e., mobile AIGC-driven HDT in customized surgery planning and personalized medication. In addition, we conduct an experimental study to prove the effectiveness of the proposed mobile AIGC-driven HDT solution, which shows a particular application in a virtual physical therapy teaching platform. Finally, we conclude this article by briefly discussing several open issues and future directions.
    摘要 mobile artificial intelligence生成内容(AIGC)技术指的是在移动边缘网络中部署AI算法,以自动化信息创建过程,同时满足用户的需求。 mobile AIGC 在最近受到了极高的关注,并可以是人类数字双(HDT)的关键启用技术。 HDT 通过 mobile AIGC 的 empowerment,预计将重塑个性化医疗,生成罕见疾病数据,模拟高精度数字双,建立多样化测试床,提供24/7个性化医疗服务。为推动这种新的 Paradigma 的发展,本文提出了移动 AIGC 驱动 HDT 的系统架构,并 highlighted 相应的设计要求和挑战。 此外,本文还 illustrate 了两个用例,即移动 AIGC 驱动 HDT 在定制手术规划和个性化药物。 此外,我们还进行了实验研究,证明了提议的移动 AIGC 驱动 HDT 解决方案的效iveness。 最后,我们 briefly discuss 了一些开放问题和未来方向。

A Zero-shot and Few-shot Study of Instruction-Finetuned Large Language Models Applied to Clinical and Biomedical Tasks

  • paper_url: http://arxiv.org/abs/2307.12114
  • repo_url: None
  • paper_authors: Yanis Labrak, Mickael Rouvier, Richard Dufour
  • for: 这些大型自然语言处理(NLP)任务,如名实化识别(NER)、问答(QA)、关系抽取(RE)等,是为了评估四种现状最佳的 instruciton-tuned 大语言模型(LLMs)在英文医学和生物医学领域的表现。
  • methods: 这些LLMs 是通过对 instruction-tuned 模型进行训练,以适应不同的 NLP 任务。
  • results: 结果表明,评估的 LLMs 在零到几个采样enario 下,对大多数任务的性能都在逐渐提高,特别是在问答任务上,即使它们从来没有看到这些任务的示例。然而,分类和RE任务的性能下降,与专门为医疗领域训练的模型,如PubMedBERT,相比而言,它们的性能较差。此外,我们发现没有任何 LLM 在所有任务上都能超越其他模型,各个模型在不同任务上的表现不同。
    Abstract We evaluate four state-of-the-art instruction-tuned large language models (LLMs) -- ChatGPT, Flan-T5 UL2, Tk-Instruct, and Alpaca -- on a set of 13 real-world clinical and biomedical natural language processing (NLP) tasks in English, such as named-entity recognition (NER), question-answering (QA), relation extraction (RE), etc. Our overall results demonstrate that the evaluated LLMs begin to approach performance of state-of-the-art models in zero- and few-shot scenarios for most tasks, and particularly well for the QA task, even though they have never seen examples from these tasks before. However, we observed that the classification and RE tasks perform below what can be achieved with a specifically trained model for the medical field, such as PubMedBERT. Finally, we noted that no LLM outperforms all the others on all the studied tasks, with some models being better suited for certain tasks than others.
    摘要 我们评估了四种现代 instruction-tuned大型自然语言处理(NLP)模型(ChatGPT、Flan-T5 UL2、Tk-Instruct和Alpaca),在英语的13种实际医疗和生物医学NLP任务上进行评估,包括名称实体识别(NER)、问答(QA)、关系提取(RE)等。我们的总结结果表明,评估的LLMs在零或几个预测enario中的性能已经接近了现有模型的性能,尤其是在QA任务上表现出色,即使它们从来没有看到这些任务的示例。然而,我们发现,分类和RE任务的性能比特别训练的医疗领域模型,如PubMedBERT,还是有所下降。最后,我们注意到,无LLM可以在所有研究任务上表现出优于其他模型,一些模型更适合某些任务。

Active Control of Flow over Rotating Cylinder by Multiple Jets using Deep Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2307.12083
  • repo_url: None
  • paper_authors: Kamyar Dobakhti, Jafar Ghazanfarian
  • for: 这个论文主要目的是提出一种基于深度学习的活动流控方法,以减少碰撞体上的阻力。
  • methods: 该方法使用多个控制的喷流来达到最大可能的阻力减少。具体来说,文章将介绍DRL算法的控制参数、其限制和优化,以及喷流数量和位置、感测器位置和最大喷流速率的优化。
  • results: 结果表明,将旋转和DRL相结合可以有效地减少阻力系数,达到49.75%的减少级别。此外,文章还表明,在不同的配置下,感测器的数量和位置需要根据用户的需求进行选择。同时,允许代理人访问更高的喷流速率,通常不会提高性能,除非rotating cylinder。
    Abstract The real power of artificial intelligence appears in reinforcement learning, which is computationally and physically more sophisticated due to its dynamic nature. Rotation and injection are some of the proven ways in active flow control for drag reduction on blunt bodies. In this paper, rotation will be added to the cylinder alongside the deep reinforcement learning (DRL) algorithm, which uses multiple controlled jets to reach the maximum possible drag suppression. Characteristics of the DRL code, including controlling parameters, their limitations, and optimization of the DRL network for use with rotation will be presented. This work will focus on optimizing the number and positions of the jets, the sensors location, and the maximum allowed flow rate to jets in the form of the maximum allowed flow rate of each actuation and the total number of them per episode. It is found that combining the rotation and DRL is promising since it suppresses the vortex shedding, stabilizes the Karman vortex street, and reduces the drag coefficient by up to 49.75%. Also, it will be shown that having more sensors at more locations is not always a good choice and the sensor number and location should be determined based on the need of the user and corresponding configuration. Also, allowing the agent to have access to higher flow rates, mostly reduces the performance, except when the cylinder rotates. In all cases, the agent can keep the lift coefficient at a value near zero, or stabilize it at a smaller number.
    摘要 真正的人工智能在强化学习中表现出真正的力量,因为它的动态性使其更加复杂。在活动流控中,旋转和注入是已知的降低拖力的方法。在这篇论文中,我们将在筒体上添加旋转,并与深度强化学习(DRL)算法结合使用多个控制的气流来达到最大可能的拖力降低。我们将展示DRL代码中的控制参数、其限制和优化DRL网络的方法,包括气流管道的数量和位置、感应器的位置和每个episode中的最大气流量。我们发现,将旋转和DRL结合使用是有前途的,因为它可以阻断旋转 shedding,稳定卡曼旋流street,并降低拖力系数至最多49.75%。此外,我们还发现,在某些情况下,添加更多的感应器并不总是有利,需要根据用户的需求和相应的配置来确定感应器的数量和位置。此外,允许机器人访问更高的气流量,通常会降低性能,除非筒体在旋转。在所有情况下,机器人都可以保持着降低的升力系数,或者稳定其在更小的数字上。

Spectral Normalized-Cut Graph Partitioning with Fairness Constraints

  • paper_url: http://arxiv.org/abs/2307.12065
  • repo_url: https://github.com/jiali2000/fnm
  • paper_authors: Jia Li, Yanhao Wang, Arpit Merchant
  • for: 本文目的是为了分解一个图的节点集 into $k$ 个彩色独立集,以最小化图中任何两个集之间的正规化连接值,同时保证每个属性分布在每个集中是约等的。
  • methods: 本文提出了一种两阶段的光谱算法,称为 FNM,用于实现公平分解。在第一阶段,我们添加了一个增强的拉格朗日函数基于我们的公平准则,以生成一个公平的光谱节点嵌入。在第二阶段,我们设计了一种圆拟方案,以生成 $k$ 个集从公平嵌入中生成高质量的分解。
  • results: 通过对九个标准数据集进行广泛的实验,我们证明了 FNM 比三种基准方法更高效。
    Abstract Normalized-cut graph partitioning aims to divide the set of nodes in a graph into $k$ disjoint clusters to minimize the fraction of the total edges between any cluster and all other clusters. In this paper, we consider a fair variant of the partitioning problem wherein nodes are characterized by a categorical sensitive attribute (e.g., gender or race) indicating membership to different demographic groups. Our goal is to ensure that each group is approximately proportionally represented in each cluster while minimizing the normalized cut value. To resolve this problem, we propose a two-phase spectral algorithm called FNM. In the first phase, we add an augmented Lagrangian term based on our fairness criteria to the objective function for obtaining a fairer spectral node embedding. Then, in the second phase, we design a rounding scheme to produce $k$ clusters from the fair embedding that effectively trades off fairness and partition quality. Through comprehensive experiments on nine benchmark datasets, we demonstrate the superior performance of FNM compared with three baseline methods.
    摘要 normalized-cut graph partitioning aimed to divide the set of nodes in a graph into $k$ disjoint clusters to minimize the fraction of the total edges between any cluster and all other clusters. In this paper, we considered a fair variant of the partitioning problem, where nodes were characterized by a categorical sensitive attribute (e.g., gender or race) indicating membership to different demographic groups. Our goal was to ensure that each group was approximately proportionally represented in each cluster while minimizing the normalized cut value. To resolve this problem, we proposed a two-phase spectral algorithm called FNM. In the first phase, we added an augmented Lagrangian term based on our fairness criteria to the objective function for obtaining a fairer spectral node embedding. Then, in the second phase, we designed a rounding scheme to produce $k$ clusters from the fair embedding that effectively trades off fairness and partition quality. Through comprehensive experiments on nine benchmark datasets, we demonstrated the superior performance of FNM compared with three baseline methods.

Balancing Exploration and Exploitation in Hierarchical Reinforcement Learning via Latent Landmark Graphs

  • paper_url: http://arxiv.org/abs/2307.12063
  • repo_url: https://github.com/papercode2022/hill
  • paper_authors: Qingyang Zhang, Yiming Yang, Jingqing Ruan, Xuantang Xiong, Dengpeng Xing, Bo Xu
  • for: 这篇论文目的是提出一种可以解决循环对待问题的弹性问题决策学习方法,即 Hierarchical reinforcement learning via dynamically building Latent Landmark graphs (HILL)。
  • methods: 这篇论文使用了一种名为 HILL 的方法,它使用了对抗表示学习目标来学习隐藏目标表示,然后使用这些表示来动态建立隐藏标签图和选择策略,以解决循环对待问题的问题。
  • results: 实验结果显示,HILL 比state-of-the-art基eline在缺乏对象奖励的连续控制任务上具有更高的样本效率和渐进性表现。
    Abstract Goal-Conditioned Hierarchical Reinforcement Learning (GCHRL) is a promising paradigm to address the exploration-exploitation dilemma in reinforcement learning. It decomposes the source task into subgoal conditional subtasks and conducts exploration and exploitation in the subgoal space. The effectiveness of GCHRL heavily relies on subgoal representation functions and subgoal selection strategy. However, existing works often overlook the temporal coherence in GCHRL when learning latent subgoal representations and lack an efficient subgoal selection strategy that balances exploration and exploitation. This paper proposes HIerarchical reinforcement learning via dynamically building Latent Landmark graphs (HILL) to overcome these limitations. HILL learns latent subgoal representations that satisfy temporal coherence using a contrastive representation learning objective. Based on these representations, HILL dynamically builds latent landmark graphs and employs a novelty measure on nodes and a utility measure on edges. Finally, HILL develops a subgoal selection strategy that balances exploration and exploitation by jointly considering both measures. Experimental results demonstrate that HILL outperforms state-of-the-art baselines on continuous control tasks with sparse rewards in sample efficiency and asymptotic performance. Our code is available at https://github.com/papercode2022/HILL.
    摘要 “对于受益从探索和实施的问题,叫做目标调整层次学习(GCHRL)是一种有前途的思路。它将源任务分解为子任务 conditional subtask,并在子任务空间进行探索和实施。GCHRL的有效性很大程度上取决于子任务表示函数和子任务选择策略。然而,现有的工作往往忽略GCHRL中的时间协调性在学习隐藏子任务表示时。此外,缺乏一个能够均衡探索和实施的子任务选择策略。本文提出了层次学习 via 动态建立隐藏地标 graphs(HILL)来解决这些限制。HILL使用了一个对照式表示学习目标来学习隐藏子任务表示,并在这些表示上动态建立隐藏地标 graphs。HILL还使用了节点上的新鲜度量和边上的实用度量。最后,HILL发展了一个子任务选择策略,考虑了这两个度量,以均衡探索和实施。实验结果显示,HILL在缺少奖励的粒子控制任务上比基于 estado-of-the-art 基eline 高效和长期性。我们的代码可以在 获取。”

Game-Theoretic Robust Reinforcement Learning Handles Temporally-Coupled Perturbations

  • paper_url: http://arxiv.org/abs/2307.12062
  • repo_url: None
  • paper_authors: Yongyuan Liang, Yanchao Sun, Ruijie Zheng, Xiangyu Liu, Tuomas Sandholm, Furong Huang, Stephen McAleer
  • for: 本研究旨在训练能够在环境干扰或敌意攻击下表现良好的RL策略。
  • methods: 我们提出了GRAD方法,它将把 temporally-coupled 干扰视为一个部分可见二人零 SUM 游戏,通过查找这个游戏的approximate equilibria来确保代理人的强度对 temporally-coupled 干扰的Robustness。
  • results: 我们的提议方法在许多连续控制任务中实验证明了与基elines相比,具有显著的Robustness优势,包括对于标准和 temporally-coupled 干扰的攻击。
    Abstract Robust reinforcement learning (RL) seeks to train policies that can perform well under environment perturbations or adversarial attacks. Existing approaches typically assume that the space of possible perturbations remains the same across timesteps. However, in many settings, the space of possible perturbations at a given timestep depends on past perturbations. We formally introduce temporally-coupled perturbations, presenting a novel challenge for existing robust RL methods. To tackle this challenge, we propose GRAD, a novel game-theoretic approach that treats the temporally-coupled robust RL problem as a partially-observable two-player zero-sum game. By finding an approximate equilibrium in this game, GRAD ensures the agent's robustness against temporally-coupled perturbations. Empirical experiments on a variety of continuous control tasks demonstrate that our proposed approach exhibits significant robustness advantages compared to baselines against both standard and temporally-coupled attacks, in both state and action spaces.
    摘要 Strong reinforcement learning (RL) aims to train policies that can perform well under environmental perturbations or adversarial attacks. Existing methods typically assume that the space of possible perturbations remains the same across timesteps. However, in many situations, the space of possible perturbations at a given timestep depends on past perturbations. We formally introduce temporally-coupled perturbations, presenting a new challenge for existing robust RL methods. To address this challenge, we propose GRAD, a novel game-theoretic approach that treats the temporally-coupled robust RL problem as a partially-observable two-player zero-sum game. By finding an approximate equilibrium in this game, GRAD ensures the agent's robustness against temporally-coupled perturbations. Empirical experiments on a variety of continuous control tasks show that our proposed approach exhibits significant robustness advantages compared to baselines against both standard and temporally-coupled attacks, in both state and action spaces.Note: Simplified Chinese is used here, as it is the most widely used variety of Chinese in mainland China and Taiwan. Traditional Chinese is also commonly used, especially in Hong Kong and Macau.

Fast Knowledge Graph Completion using Graphics Processing Units

  • paper_url: http://arxiv.org/abs/2307.12059
  • repo_url: None
  • paper_authors: Chun-Hee Lee, Dong-oh Kang, Hwa Jeon Song
  • for: 这个论文的目的是提出一种高效的知识图完成框架,用于在GPU上获得新的关系。
  • methods: 该论文使用知识图嵌入模型,将知识图完成问题转化为一种相似Join问题,然后使用度量空间的性质来 derive 高速的完成算法。
  • results: experiments 表明,该框架可以高效处理知识图完成问题。
    Abstract Knowledge graphs can be used in many areas related to data semantics such as question-answering systems, knowledge based systems. However, the currently constructed knowledge graphs need to be complemented for better knowledge in terms of relations. It is called knowledge graph completion. To add new relations to the existing knowledge graph by using knowledge graph embedding models, we have to evaluate $N\times N \times R$ vector operations, where $N$ is the number of entities and $R$ is the number of relation types. It is very costly. In this paper, we provide an efficient knowledge graph completion framework on GPUs to get new relations using knowledge graph embedding vectors. In the proposed framework, we first define "transformable to a metric space" and then provide a method to transform the knowledge graph completion problem into the similarity join problem for a model which is "transformable to a metric space". After that, to efficiently process the similarity join problem, we derive formulas using the properties of a metric space. Based on the formulas, we develop a fast knowledge graph completion algorithm. Finally, we experimentally show that our framework can efficiently process the knowledge graph completion problem.
    摘要 知识图可以应用于数据 semantics 多个领域,如问答系统、知识基础系统。然而,目前构建的知识图需要补充以获得更好的知识,这被称为知识图完成。为添加新的关系到现有的知识图,我们需要评估 $N\times N \times R$ 矢量操作,其中 $N$ 是实体的数量,$R$ 是关系类型的数量。这很费时。在这篇论文中,我们提供了一个高效的知识图完成框架在 GPU 上来获得新关系使用知识图嵌入向量。我们首先定义 "可转换到一个度量空间",然后提供一种将知识图完成问题转换成一个度量空间中的相似Join问题的方法。接着,我们使用度量空间的性质 deriv 出 formulas,并根据 formulas 开发了一个快速的知识图完成算法。最后,我们通过实验表示,我们的框架可以高效地处理知识图完成问题。

Exploring MLOps Dynamics: An Experimental Analysis in a Real-World Machine Learning Project

  • paper_url: http://arxiv.org/abs/2307.13473
  • repo_url: None
  • paper_authors: Awadelrahman M. A. Ahmed
    for:这个研究旨在优化机器学习操作(MLOps)过程,以提高机器学习项目的效率和生产力。methods:该实验使用了一个全面的 MLOps 工作流程,覆盖了问题定义、数据收集、数据准备、模型开发、模型部署、监测、管理、扩展性和合规遵守等重要阶段。实验还采用了一种系统化跟踪方法,以记录 especified 阶段之间的重复访问,以捕捉这些访问的原因。results:研究发现,MLOps 工作流程具有很强的融合和循环特性,并且具有很高的可重复性和可缩放性。通过对实验数据进行分析,提供了一些实践建议和推荐,以便在实际应用中进行进一步的优化和改进。
    Abstract This article presents an experiment focused on optimizing the MLOps (Machine Learning Operations) process, a crucial aspect of efficiently implementing machine learning projects. The objective is to identify patterns and insights to enhance the MLOps workflow, considering its iterative and interdependent nature in real-world model development scenarios. The experiment involves a comprehensive MLOps workflow, covering essential phases like problem definition, data acquisition, data preparation, model development, model deployment, monitoring, management, scalability, and governance and compliance. Practical tips and recommendations are derived from the results, emphasizing proactive planning and continuous improvement for the MLOps workflow. The experimental investigation was strategically integrated within a real-world ML project which followed essential phases of the MLOps process in a production environment, handling large-scale structured data. A systematic tracking approach was employed to document revisits to specific phases from a main phase under focus, capturing the reasons for such revisits. By constructing a matrix to quantify the degree of overlap between phases, the study unveils the dynamic and iterative nature of the MLOps workflow. The resulting data provides visual representations of the MLOps process's interdependencies and iterative characteristics within the experimental framework, offering valuable insights for optimizing the workflow and making informed decisions in real-world scenarios. This analysis contributes to enhancing the efficiency and effectiveness of machine learning projects through an improved MLOps process. Keywords: MLOps, Machine Learning Operations, Optimization, Experimental Analysis, Iterative Process, Pattern Identification.
    摘要 The experiment covers a comprehensive MLOps workflow, including problem definition, data acquisition, data preparation, model development, model deployment, monitoring, management, scalability, and governance and compliance. The results provide practical tips and recommendations for proactive planning and continuous improvement of the MLOps workflow.The experimental investigation was conducted within a real-world ML project, which followed the essential phases of the MLOps process in a production environment, handling large-scale structured data. A systematic tracking approach was employed to document revisits to specific phases, capturing the reasons for such revisits. By constructing a matrix to quantify the degree of overlap between phases, the study reveals the dynamic and iterative nature of the MLOps workflow.The resulting data provides visual representations of the MLOps process's interdependencies and iterative characteristics within the experimental framework, offering valuable insights for optimizing the workflow and making informed decisions in real-world scenarios. This analysis contributes to enhancing the efficiency and effectiveness of machine learning projects through an improved MLOps process.Keywords: MLOps, Machine Learning Operations, Optimization, Experimental Analysis, Iterative Process, Pattern Identification.

Extracting Molecular Properties from Natural Language with Multimodal Contrastive Learning

  • paper_url: http://arxiv.org/abs/2307.12996
  • repo_url: None
  • paper_authors: Romain Lacombe, Andrew Gaut, Jeff He, David Lüdeke, Kateryna Pistunova
  • for: 本研究旨在将科学知识从文本中转移到分子图表示,以推进计算生物化学中深度学习的发展。
  • methods: 研究者使用了对比学习将神经图表示与文本描述的特征相对转移,以提高分子性质预测性能。他们还提出了一种基于有机反应的新型分子图数据生成策略。
  • results: 研究者在下游的分子网络Property Classification任务上实现了+4.26%的AUROC提升,比Graph模式alone模型提升+1.54%。这表明将科学知识从文本中转移到分子图表示可以提高分子性质预测性能。
    Abstract Deep learning in computational biochemistry has traditionally focused on molecular graphs neural representations; however, recent advances in language models highlight how much scientific knowledge is encoded in text. To bridge these two modalities, we investigate how molecular property information can be transferred from natural language to graph representations. We study property prediction performance gains after using contrastive learning to align neural graph representations with representations of textual descriptions of their characteristics. We implement neural relevance scoring strategies to improve text retrieval, introduce a novel chemically-valid molecular graph augmentation strategy inspired by organic reactions, and demonstrate improved performance on downstream MoleculeNet property classification tasks. We achieve a +4.26% AUROC gain versus models pre-trained on the graph modality alone, and a +1.54% gain compared to recently proposed molecular graph/text contrastively trained MoMu model (Su et al. 2022).
    摘要 深度学习在计算生物化学中传统上专注于分子图神经表示;然而,最近的语言模型发展显示了科学知识在文本中的含义。为了融合这两种模式,我们研究如何从自然语言中提取分子性质信息并将其传递到图表示中。我们使用对比学习对神经图表示和文本描述中的特征进行对齐,并使用神经相关性分数策略来提高文本检索。我们还介绍了一种基于有机反应的新型化学Graph augmentation策略,并在下游MoleculeNet性质分类任务上达到了+4.26% AUROC提升和+1.54%提升 compared to MoMu模型(Su et al., 2022)。

Flight Contrail Segmentation via Augmented Transfer Learning with Novel SR Loss Function in Hough Space

  • paper_url: http://arxiv.org/abs/2307.12032
  • repo_url: https://github.com/junzis/contrail-net
  • paper_authors: Junzi Sun, Esther Roosenbrand
  • for: 检测飞行 contrails 从卫星图像中
  • methods: 基于增强转移学习的新模型,以及一种新的损失函数 SR Loss
  • results: 准确地检测 contrails WITH minimal data
    Abstract Air transport poses significant environmental challenges, particularly the contribution of flight contrails to climate change due to their potential global warming impact. Detecting contrails from satellite images has been a long-standing challenge. Traditional computer vision techniques have limitations under varying image conditions, and machine learning approaches using typical convolutional neural networks are hindered by the scarcity of hand-labeled contrail datasets and contrail-tailored learning processes. In this paper, we introduce an innovative model based on augmented transfer learning that accurately detects contrails with minimal data. We also propose a novel loss function, SR Loss, which improves contrail line detection by transforming the image space into Hough space. Our research opens new avenues for machine learning-based contrail detection in aviation research, offering solutions to the lack of large hand-labeled datasets, and significantly enhancing contrail detection models.
    摘要 空中交通对环境造成重要挑战,特别是飞行烟尘的潜在全球暖化影响。从卫星图像探测飞行烟尘是一项长期挑战。传统的计算机视觉技术在不同的图像条件下有限制,机器学习方法使用 Typical convolutional neural networks 也受到手动标注飞行烟尘数据的罕见性和适应飞行烟尘学习过程的限制。在这篇论文中,我们介绍了一种创新的模型,基于增强传输学习,可以准确地检测飞行烟尘,只需 minimal data。我们还提出了一种新的损失函数,SR Loss,它通过将图像空间转换为截距空间,提高了飞行烟尘线检测。我们的研究打开了新的机器学习基于飞行烟尘检测的可能性,解决了航空研究中缺乏大量手动标注数据的问题,并显著提高了飞行烟尘检测模型。

FinPT: Financial Risk Prediction with Profile Tuning on Pretrained Foundation Models

  • paper_url: http://arxiv.org/abs/2308.00065
  • repo_url: https://github.com/yuweiyin/finpt
  • paper_authors: Yuwei Yin, Yazheng Yang, Jian Yang, Qi Liu
  • for: 这研究旨在提出一种新的金融风险预测方法,以帮助金融机构更好地识别和预测风险。
  • methods: 该方法使用Profile Tuning技术,将大型预训模型粘贴到金融表格数据中,并通过提问大语言模型(LLMs)获取自然语言客户profile,进而进行预测。
  • results: 通过对FinBench数据集进行实验,研究人员发现FinPT方法可以与各种代表性的强基线进行比较,并且通过分析LLMs的性能,深入理解它们在金融风险预测中的应用。
    Abstract Financial risk prediction plays a crucial role in the financial sector. Machine learning methods have been widely applied for automatically detecting potential risks and thus saving the cost of labor. However, the development in this field is lagging behind in recent years by the following two facts: 1) the algorithms used are somewhat outdated, especially in the context of the fast advance of generative AI and large language models (LLMs); 2) the lack of a unified and open-sourced financial benchmark has impeded the related research for years. To tackle these issues, we propose FinPT and FinBench: the former is a novel approach for financial risk prediction that conduct Profile Tuning on large pretrained foundation models, and the latter is a set of high-quality datasets on financial risks such as default, fraud, and churn. In FinPT, we fill the financial tabular data into the pre-defined instruction template, obtain natural-language customer profiles by prompting LLMs, and fine-tune large foundation models with the profile text to make predictions. We demonstrate the effectiveness of the proposed FinPT by experimenting with a range of representative strong baselines on FinBench. The analytical studies further deepen the understanding of LLMs for financial risk prediction.
    摘要

A Flexible Framework for Incorporating Patient Preferences Into Q-Learning

  • paper_url: http://arxiv.org/abs/2307.12022
  • repo_url: None
  • paper_authors: Joshua P. Zitovsky, Leslie Wilson, Michael R. Kosorok
  • for: 这篇论文是为了解决现实世界医疗问题中的多个竞争结果问题而写的,包括治疗效果和不良反应的严重程度。
  • methods: 这篇论文提出了一种新的方法,即Latent Utility Q-Learning(LUQ-Learning),以解决现有方法的限制,包括只能处理单个时间点和两个结果、不能 incorporate自报病人偏好等。LUQ-Learning 使用隐藏模型方法,自然地扩展 Q-learning 到复合结果设定下,并采取理想的质量评价来对各个病人进行评价。
  • results: 在基于低背痛的实验中,我们的方法与多种基线方法进行比较,并在所有实验中达到了非常竞争性的实验性表现。
    Abstract In real-world healthcare problems, there are often multiple competing outcomes of interest, such as treatment efficacy and side effect severity. However, statistical methods for estimating dynamic treatment regimes (DTRs) usually assume a single outcome of interest, and the few methods that deal with composite outcomes suffer from important limitations. This includes restrictions to a single time point and two outcomes, the inability to incorporate self-reported patient preferences and limited theoretical guarantees. To this end, we propose a new method to address these limitations, which we dub Latent Utility Q-Learning (LUQ-Learning). LUQ-Learning uses a latent model approach to naturally extend Q-learning to the composite outcome setting and adopt the ideal trade-off between outcomes to each patient. Unlike previous approaches, our framework allows for an arbitrary number of time points and outcomes, incorporates stated preferences and achieves strong asymptotic performance with realistic assumptions on the data. We conduct simulation experiments based on an ongoing trial for low back pain as well as a well-known completed trial for schizophrenia. In all experiments, our method achieves highly competitive empirical performance compared to several alternative baselines.
    摘要 在现实医疗问题中,常常存在多个竞争的目的结果,如治疗效果和副作用严重程度。然而,统计方法 для估计动态治疗方案(DTR)通常假设单一的目的结果,而其中几种方法只能处理单个时间点和两个结果。这些方法还具有限制性,例如不能 incorporate自报病人喜好和有限的理论保证。为此,我们提出了一种新的方法,我们称之为潜在用户价值Q学习(LUQ-Learning)。LUQ-Learning 使用潜在模型方法来自然地扩展Q学习到复合结果设定下,并采取每个患者的理想妥协。不同于前一些方法,我们的框架允许任意数量的时间点和结果,并 incorporate 自报病人喜好,并实现强 asymptotic performance 在现实数据下,只需要有限的假设。我们在一个低肢瘤痛试验和一个已完成的躁闹症试验中进行了 simulations experiments。在所有实验中,我们的方法与多个基准方法相比,表现出了非常竞争的实验性。

Model Predictive Control (MPC) of an Artificial Pancreas with Data-Driven Learning of Multi-Step-Ahead Blood Glucose Predictors

  • paper_url: http://arxiv.org/abs/2307.12015
  • repo_url: None
  • paper_authors: Eleonora Maria Aiello, Mehrad Jaloli, Marzia Cescon
  • for: 这个研究是为了开发一个基于Linear Time-Varying(LTV)Model Predictive Control(MPC)框架的关闭循环胰岛素输送算法,用于治疗类型1 диабе尼(T1D)。
  • methods: 这个研究使用了一个数据驱动的多步预测血糖(BG)预测器,并将其与LTV MPC框架集成。而不是从数据中直接标定胰岛素逻辑系统的开放循环模型,这里提议直接使用BG预测器来预测未来的血糖水平。为非线性部分,使用了Long Short-Term Memory(LSTM)网络,而为线性部分,使用了线性回归模型。
  • results: 对于三个模拟场景,包括一个标准情况,一个随机餐食干扰情况,以及一个减少胰岛素敏感度25%的情况,我们证明了我们的LSTM-MPC控制器的优势。在随机餐食干扰情况下,我们的方法提供了更加准确的未来血糖水平预测,以及更好的封闭循环性能。
    Abstract We present the design and \textit{in-silico} evaluation of a closed-loop insulin delivery algorithm to treat type 1 diabetes (T1D) consisting in a data-driven multi-step-ahead blood glucose (BG) predictor integrated into a Linear Time-Varying (LTV) Model Predictive Control (MPC) framework. Instead of identifying an open-loop model of the glucoregulatory system from available data, we propose to directly fit the entire BG prediction over a predefined prediction horizon to be used in the MPC, as a nonlinear function of past input-ouput data and an affine function of future insulin control inputs. For the nonlinear part, a Long Short-Term Memory (LSTM) network is proposed, while for the affine component a linear regression model is chosen. To assess benefits and drawbacks when compared to a traditional linear MPC based on an auto-regressive with exogenous (ARX) input model identified from data, we evaluated the proposed LSTM-MPC controller in three simulation scenarios: a nominal case with 3 meals per day, a random meal disturbances case where meals were generated with a recently published meal generator, and a case with 25$\%$ decrease in the insulin sensitivity. Further, in all the scenarios, no feedforward meal bolus was administered. For the more challenging random meal generation scenario, the mean $\pm$ standard deviation percent time in the range 70-180 [mg/dL] was 74.99 $\pm$ 7.09 vs. 54.15 $\pm$ 14.89, the mean $\pm$ standard deviation percent time in the tighter range 70-140 [mg/dL] was 47.78$\pm$8.55 vs. 34.62 $\pm$9.04, while the mean $\pm$ standard deviation percent time in sever hypoglycemia, i.e., $<$ 54 [mg/dl] was 1.00$\pm$3.18 vs. 9.45$\pm$11.71, for our proposed LSTM-MPC controller and the traditional ARX-MPC, respectively. Our approach provided accurate predictions of future glucose concentrations and good closed-loop performances of the overall MPC controller.
    摘要 我们介绍了一种关闭Loop抗糖尿病(T1D)的设计和 simulate evaluate 的数据驱动多步预测血糖(BG)预测算法,包括一个基于线性时变(LTV)模型预测控制(MPC)框架的数据驱动多步预测算法。而不是直接从可用数据中Identify一个开 Loop模型的glucoregulatory系统,我们提议直接将整个BG预测 horizon为用于MPC,作为非线性函数过去输入输出数据和未来药物控制输入的非线性函数。 для非线性部分,我们提议使用一个Long Short-Term Memory(LSTM)网络,而对于线性部分,我们选择了一个线性回归模型。为了评估我们提议的LSTM-MPC控制器与传统的ARX-MPC控制器相比,我们在三个模拟场景中评估了这两个控制器的表现:一个标准的3餐/天场景,一个随机餐品干扰场景,以及一个25%的药物敏感度下降场景。此外,在所有场景中,没有feedforward餐品补偿。在更加复杂的随机餐品生成场景中,LSTM-MPC控制器的mean±标准差%时间在70-180[mg/dL]范围内为74.99±7.09 vs. 54.15±14.89,mean±标准差%时间在70-140[mg/dL]范围内为47.78±8.55 vs. 34.62±9.04,而且mean±标准差%时间在严重低血糖(<54[mg/dL])下为1.00±3.18 vs. 9.45±11.71。我们的方法提供了精准的未来血糖浓度预测和关闭Loop控制器的全面性能的良好表现。

NLCUnet: Single-Image Super-Resolution Network with Hairline Details

  • paper_url: http://arxiv.org/abs/2307.12014
  • repo_url: None
  • paper_authors: Jiancong Feng, Yuan-Gen Wang, Fengchuang Xing
  • For: 提高单张超解像图像质量,特别是细节部分的精度。* Methods: 提出了一种基于非本地注意力的单张超解像网络(NLCUnet),包括三个核心设计:非本地注意力机制、深度卷积 convolution 和通道注意力。* Results: 在DF2K dataset上进行了许多实验,发现 NLCUnet 在 PSNR 和 SSIM 指标上比现有方法提高较多,并且可以保持更好的细节部分。
    Abstract Pursuing the precise details of super-resolution images is challenging for single-image super-resolution tasks. This paper presents a single-image super-resolution network with hairline details (termed NLCUnet), including three core designs. Specifically, a non-local attention mechanism is first introduced to restore local pieces by learning from the whole image region. Then, we find that the blur kernel trained by the existing work is unnecessary. Based on this finding, we create a new network architecture by integrating depth-wise convolution with channel attention without the blur kernel estimation, resulting in a performance improvement instead. Finally, to make the cropped region contain as much semantic information as possible, we propose a random 64$\times$64 crop inside the central 512$\times$512 crop instead of a direct random crop inside the whole image of 2K size. Numerous experiments conducted on the benchmark DF2K dataset demonstrate that our NLCUnet performs better than the state-of-the-art in terms of the PSNR and SSIM metrics and yields visually favorable hairline details.
    摘要 推进超高清照片的精确细节是单图超解像 зада务中的挑战。本文提出了一个单图超解像网络(NLCUnet),包括三个核心设计。具体来说,我们首先引入非本地注意力机制,以便通过整个图像区域学习地址本地副本。然后,我们发现现有工作中训练的模糊核心不是必需的,因此我们创建了一个新的网络架构,通过depthwise核论和通道注意力来提高性能。最后,我们提议在中心256×256区域中随机选择64×64区域,以便尽可能包含图像中的semantic信息。在DF2K数据集上进行了多次实验,表明我们的NLCUnet在PSNR和SSIM指标上比state-of-the-art更高,并且视觉上具有更好的毛细膨胀细节。

Contrastive Self-Supervised Learning Based Approach for Patient Similarity: A Case Study on Atrial Fibrillation Detection from PPG Signal

  • paper_url: http://arxiv.org/abs/2308.02433
  • repo_url: https://github.com/subangkar/simsig
  • paper_authors: Subangkar Karmaker Shanto, Shoumik Saha, Atif Hasan Rahman, Mohammad Mehedy Masud, Mohammed Eunus Ali
  • for: 这个论文是为了提出一种基于对比学习的深度学习框架,用于搜索基于生物 физи学信号的病人相似性。
  • methods: 这个框架使用对比学习方法来学习病人的相似embedding,并引入了一些邻居选择算法来确定生成embedding上的最高相似性。
  • results: 作者通过对一个涉及到心脏病的案例研究,证明了该框架的有效性。实验结果表明,该框架可以准确地检测心脏病AF,并且与其他基线方法相比,其性能更高。
    Abstract In this paper, we propose a novel contrastive learning based deep learning framework for patient similarity search using physiological signals. We use a contrastive learning based approach to learn similar embeddings of patients with similar physiological signal data. We also introduce a number of neighbor selection algorithms to determine the patients with the highest similarity on the generated embeddings. To validate the effectiveness of our framework for measuring patient similarity, we select the detection of Atrial Fibrillation (AF) through photoplethysmography (PPG) signals obtained from smartwatch devices as our case study. We present extensive experimentation of our framework on a dataset of over 170 individuals and compare the performance of our framework with other baseline methods on this dataset.
    摘要 在本文中,我们提出了一种基于对比学习的深度学习框架,用于通过生物物理信号来查找病人相似性。我们使用对比学习方法来学习病人的相似 embedding,并引入了一些邻居选择算法来确定生成 embedding 中最相似的病人。为了证明我们的框架的有效性,我们选择了基于 photoplethysmography (PPG) 信号检测 Atrial Fibrillation (AF) 为我们的案例研究。我们对一个包含超过 170 个个体的数据集进行了广泛的实验,并与其他基线方法进行比较。

Expert Knowledge-Aware Image Difference Graph Representation Learning for Difference-Aware Medical Visual Question Answering

  • paper_url: http://arxiv.org/abs/2307.11986
  • repo_url: https://github.com/holipori/mimic-diff-vqa
  • paper_authors: Xinyue Hu, Lin Gu, Qiyuan An, Mengliang Zhang, Liangchen Liu, Kazuma Kobayashi, Tatsuya Harada, Ronald M. Summers, Yingying Zhu
  • for: 这 paper 的目的是提出一个新的胸部X射影差异视觉问答任务 (VQA),以帮助自动化医疗视觉语言模型。
  • methods: 这 paper 使用了一种新的专家知识感知图表学习模型,将图像差异视觉问答任务解决。该模型利用了 анатомиче结构优先知识、semantic知识和空间知识等专家知识来构建多关系图,表示图像差异的问答任务。
  • results: 这 paper 收集了一个新的数据集,名为 MIMIC-Diff-VQA,包含 700,703 个问答对from 164,324 对主要和参考图像。与现有的医疗 VQA 数据集相比,这些问题更加适合临床诊断实践中的诊断- intervene-评估过程。
    Abstract To contribute to automating the medical vision-language model, we propose a novel Chest-Xray Difference Visual Question Answering (VQA) task. Given a pair of main and reference images, this task attempts to answer several questions on both diseases and, more importantly, the differences between them. This is consistent with the radiologist's diagnosis practice that compares the current image with the reference before concluding the report. We collect a new dataset, namely MIMIC-Diff-VQA, including 700,703 QA pairs from 164,324 pairs of main and reference images. Compared to existing medical VQA datasets, our questions are tailored to the Assessment-Diagnosis-Intervention-Evaluation treatment procedure used by clinical professionals. Meanwhile, we also propose a novel expert knowledge-aware graph representation learning model to address this task. The proposed baseline model leverages expert knowledge such as anatomical structure prior, semantic, and spatial knowledge to construct a multi-relationship graph, representing the image differences between two images for the image difference VQA task. The dataset and code can be found at https://github.com/Holipori/MIMIC-Diff-VQA. We believe this work would further push forward the medical vision language model.
    摘要 为了让医疗视语言模型自动化,我们提出了一个新的胸部X射影异常视问答(VQA)任务。给定一对主要和参考图像,这个任务的目标是回答一些疾病和图像之间的异常问题。这与医生诊断实践相一致,即将当前图像与参考图像进行比较,以确定报告。我们收集了一个新的数据集,即MIMIC-Diff-VQA,包含700703个问答对 from 164324对主要和参考图像。与现有的医学VQA数据集相比,我们的问题更加适合医生在诊断过程中采用的评估-诊断- interven-评估(ADIE)治疗流程。此外,我们还提出了一种基于专家知识的图像异常关系学习模型,以解决这个任务。我们的基eline模型利用专家知识,如生物结构优先知识、semantic知识和空间知识,构建多关系图,表示图像之间的异常关系。数据集和代码可以在https://github.com/Holipori/MIMIC-Diff-VQA中找到。我们认为这项工作将会进一步推动医学视语言模型的发展。

Collaborative Graph Neural Networks for Attributed Network Embedding

  • paper_url: http://arxiv.org/abs/2307.11981
  • repo_url: https://github.com/qiaoyut/conn
  • paper_authors: Qiaoyu Tan, Xin Zhang, Xiao Huang, Hao Chen, Jundong Li, Xia Hu
    for: This paper focuses on developing a new graph neural network (GNN) architecture called COllaborative graph Neural Networks (CONN) to improve attribute network embedding.methods: The proposed CONN architecture uses selective message diffusion and cross-correlation to jointly reconstruct node-to-node and node-to-attribute-category interactions, which enhances the model’s capacity.results: The experimental results on real-world networks show that CONN outperforms state-of-the-art embedding algorithms with a significant margin.
    Abstract Graph neural networks (GNNs) have shown prominent performance on attributed network embedding. However, existing efforts mainly focus on exploiting network structures, while the exploitation of node attributes is rather limited as they only serve as node features at the initial layer. This simple strategy impedes the potential of node attributes in augmenting node connections, leading to limited receptive field for inactive nodes with few or even no neighbors. Furthermore, the training objectives (i.e., reconstructing network structures) of most GNNs also do not include node attributes, although studies have shown that reconstructing node attributes is beneficial. Thus, it is encouraging to deeply involve node attributes in the key components of GNNs, including graph convolution operations and training objectives. However, this is a nontrivial task since an appropriate way of integration is required to maintain the merits of GNNs. To bridge the gap, in this paper, we propose COllaborative graph Neural Networks--CONN, a tailored GNN architecture for attribute network embedding. It improves model capacity by 1) selectively diffusing messages from neighboring nodes and involved attribute categories, and 2) jointly reconstructing node-to-node and node-to-attribute-category interactions via cross-correlation. Experiments on real-world networks demonstrate that CONN excels state-of-the-art embedding algorithms with a great margin.
    摘要 GRAPH Neural Networks (GNNs) 有出色表现在嵌入属性网络中。然而,现有努力主要是利用网络结构,而忽视节点特征的利用,只是将节点特征作为初始层节点特征使用。这种简单的策略限制了无活节点的潜在范围,因为它们有少量或甚至没有邻居。此外,大多数 GNN 的训练目标(即重建网络结构)并不包括节点特征,尽管研究表明重建节点特征有利。因此,深入涉及节点特征在 GNN 的关键组件中是一项挑战,需要避免降低 GNN 的优点。为了bridging这个差距,在这篇论文中,我们提出了协同图 neural Networks(CONN),一种针对嵌入属性网络的特化 GNN 架构。它提高了模型容量,通过1) 选择性地往返邻居节点和涉及属性类别中传递消息,2) 并同时重建节点到节点和节点到属性类别的交互。实验表明,CONN 在实际网络上超过了当前领先 embedding 算法的性能。

Simulation of Arbitrary Level Contrast Dose in MRI Using an Iterative Global Transformer Model

  • paper_url: http://arxiv.org/abs/2307.11980
  • repo_url: None
  • paper_authors: Dayang Wang, Srivathsa Pasumarthi, Greg Zaharchuk, Ryan Chamberlain
  • for: 这个研究旨在提出一种基于卷积神经网络的图像合成方法,以实现不同剂量水平的对照增强图像的生成,以便为MRI成像中的医学应用提供更好的依据。
  • methods: 该方法基于一种名为Gformer的变换器,其包括一种抽样基于注意力机制和一种旋转 shift模块,以捕捉不同对照增强特征。
  • results: 对比其他状态艺技术,该方法的评估结果表明其性能更高。此外,该方法还在下游任务中,如剂量减少和肿瘤分割中进行了评估,以证明其在临床应用中的价值。
    Abstract Deep learning (DL) based contrast dose reduction and elimination in MRI imaging is gaining traction, given the detrimental effects of Gadolinium-based Contrast Agents (GBCAs). These DL algorithms are however limited by the availability of high quality low dose datasets. Additionally, different types of GBCAs and pathologies require different dose levels for the DL algorithms to work reliably. In this work, we formulate a novel transformer (Gformer) based iterative modelling approach for the synthesis of images with arbitrary contrast enhancement that corresponds to different dose levels. The proposed Gformer incorporates a sub-sampling based attention mechanism and a rotational shift module that captures the various contrast related features. Quantitative evaluation indicates that the proposed model performs better than other state-of-the-art methods. We further perform quantitative evaluation on downstream tasks such as dose reduction and tumor segmentation to demonstrate the clinical utility.
    摘要 深度学习(DL)基于对比剂量减少和消除在MRI成像中得到了进一步的发展,因为Gadolinium-based Contrast Agents(GBCAs)的负面效应。但这些DL算法受到高质量低剂量数据的有效性的限制。此外,不同类型的GBCAs和疾病需要不同的剂量水平以便DL算法可靠地工作。在这种工作中,我们提出了一种基于转换器(Gformer)的迭代模型方法,用于生成具有任意对比强化的图像。我们的Gformer模型包括子抽样基于注意力机制和旋转变换模块,以捕捉不同的对比相关特征。量化评估表明,我们提出的模型在其他状态当前的方法之上表现出了更好的性能。我们进一步进行了下游任务如剂量减少和肿瘤分割,以证明临床实用性。

Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?

  • paper_url: http://arxiv.org/abs/2307.11978
  • repo_url: https://github.com/cewu/ptnl
  • paper_authors: Cheng-En Wu, Yu Tian, Haichao Yu, Heng Wang, Pedro Morgado, Yu Hen Hu, Linjie Yang
  • for: 研究了CLIP模型在干预几个示例下调整为新的分类任务中的稳定性。
  • methods: 使用了几个示例来调整CLIP模型,并发现这种方法具有很高的抗噪性。
  • results: 发现了两个关键因素导致这种方法的稳定性:1)固定的类名Token提供了模型优化过程中强制的正则化,减少了噪音样本引起的梯度; 2)从多样化和通用的网络数据中学习的强大预训练图文映射提供了图像分类的强大先验知识。此外,我们还示出了使用CLIP模型自己的噪音零例预测来调整其自己的提示,可以显著提高无监督下的预测精度。代码可以在https://github.com/CEWu/PTNL中找到。
    Abstract Vision-language models such as CLIP learn a generic text-image embedding from large-scale training data. A vision-language model can be adapted to a new classification task through few-shot prompt tuning. We find that such a prompt tuning process is highly robust to label noises. This intrigues us to study the key reasons contributing to the robustness of the prompt tuning paradigm. We conducted extensive experiments to explore this property and find the key factors are: 1) the fixed classname tokens provide a strong regularization to the optimization of the model, reducing gradients induced by the noisy samples; 2) the powerful pre-trained image-text embedding that is learned from diverse and generic web data provides strong prior knowledge for image classification. Further, we demonstrate that noisy zero-shot predictions from CLIP can be used to tune its own prompt, significantly enhancing prediction accuracy in the unsupervised setting. The code is available at https://github.com/CEWu/PTNL.
    摘要 CLIP类的视觉语言模型通过大规模训练学习一个通用的文本图像嵌入。一个视觉语言模型可以通过几个shot提问调整到新的分类任务。我们发现这种提问调整过程具有高度的鲁棒性,这使我们感到感兴趣,并且想 deeper 地研究这种特性的原因。我们进行了广泛的实验,并发现关键因素有两个:1)固定的类名token提供了模型优化的强制性,减少了噪音样本引起的梯度;2)通过多种和通用的网络数据学习的强大预训练图像文本嵌入,为图像分类提供了强大的先验知识。此外,我们示出了使用CLIP生成的噪音零shot预测来调整其自己的提问,可以大幅提高无监督下的预测精度。代码可以在https://github.com/CEWu/PTNL 中找到。

Out-of-Distribution Optimality of Invariant Risk Minimization

  • paper_url: http://arxiv.org/abs/2307.11972
  • repo_url: None
  • paper_authors: Shoji Toyota, Kenji Fukumizu
  • for: 提高深度神经网络的泛化能力,即使在未经见过的领域下也能准确预测。
  • methods: 使用偏向风险最小化(IRM)方法,解决深度神经网络继承训练数据中嵌入的假 correlations 问题,以提高模型的泛化能力。
  • results: 提供了一种理论保证,表明在满足certain conditions下,bi-level optimization problem的解决方案会最小化异常风险。
    Abstract Deep Neural Networks often inherit spurious correlations embedded in training data and hence may fail to generalize to unseen domains, which have different distributions from the domain to provide training data. M. Arjovsky et al. (2019) introduced the concept out-of-distribution (o.o.d.) risk, which is the maximum risk among all domains, and formulated the issue caused by spurious correlations as a minimization problem of the o.o.d. risk. Invariant Risk Minimization (IRM) is considered to be a promising approach to minimize the o.o.d. risk: IRM estimates a minimum of the o.o.d. risk by solving a bi-level optimization problem. While IRM has attracted considerable attention with empirical success, it comes with few theoretical guarantees. Especially, a solid theoretical guarantee that the bi-level optimization problem gives the minimum of the o.o.d. risk has not yet been established. Aiming at providing a theoretical justification for IRM, this paper rigorously proves that a solution to the bi-level optimization problem minimizes the o.o.d. risk under certain conditions. The result also provides sufficient conditions on distributions providing training data and on a dimension of feature space for the bi-leveled optimization problem to minimize the o.o.d. risk.
    摘要 深度神经网络经常会继承训练数据中嵌入的假 correlations,从而导致在未看到的领域中失败,这些领域的分布与训练数据的分布不同。M. Arjovsky等人(2019)引入了 OUT-OF-DISTRIBUTION(o.o.d)风险,它是所有领域的最大风险,并将嵌入在训练数据中的假 correlations 问题定义为一个 minimization 问题。不变risk Minimization (IRM) 被视为一种有前景的方法来减少 o.o.d. 风险:IRM 通过解决一个二级优化问题来估算 o.o.d. 风险的最小值。虽然 IRM 在实际中得到了广泛的关注并取得了一些成功,但它具有少量的理论保证。特别是,一个坚实的理论保证,即二级优化问题的解决方案实际上是 o.o.d. 风险的最小值,尚未被成功地建立。本文通过坚实的理论证明,解决二级优化问题可以减少 o.o.d. 风险,并提供了一些有关训练数据的分布和特征空间维度的充分条件。

DHC: Dual-debiased Heterogeneous Co-training Framework for Class-imbalanced Semi-supervised Medical Image Segmentation

  • paper_url: http://arxiv.org/abs/2307.11960
  • repo_url: https://github.com/xmed-lab/dhc
  • paper_authors: Haonan Wang, Xiaomeng Li
  • for: 这个研究的目的是提出一个基于 semi-supervised learning (SSL) 的三维医疗影像分类框架,以解决对于医疗影像分类的专家需求和时间耗费问题。
  • methods: 这个框架使用了一个新的 Dual-debiased Heterogeneous Co-training (DHC) 方法,包括两种损失衡量策略:Distribution-aware Debiased Weighting (DistDW) 和 Difficulty-aware Debiased Weighting (DiffDW),这些策略可以动态地使用 Pseudo 标签来导引模型解决数据和学习偏见。
  • results: 实验结果显示,提出的方法可以将 pseudo 标签用于偏见调整和纠正阶层分类问题,并且与现有的 SSL 方法比较,显示出我们的方法在更加具体的 SSL 设定下表现更好。代码和模型可以在 GitHub 上找到:https://github.com/xmed-lab/DHC.
    Abstract The volume-wise labeling of 3D medical images is expertise-demanded and time-consuming; hence semi-supervised learning (SSL) is highly desirable for training with limited labeled data. Imbalanced class distribution is a severe problem that bottlenecks the real-world application of these methods but was not addressed much. Aiming to solve this issue, we present a novel Dual-debiased Heterogeneous Co-training (DHC) framework for semi-supervised 3D medical image segmentation. Specifically, we propose two loss weighting strategies, namely Distribution-aware Debiased Weighting (DistDW) and Difficulty-aware Debiased Weighting (DiffDW), which leverage the pseudo labels dynamically to guide the model to solve data and learning biases. The framework improves significantly by co-training these two diverse and accurate sub-models. We also introduce more representative benchmarks for class-imbalanced semi-supervised medical image segmentation, which can fully demonstrate the efficacy of the class-imbalance designs. Experiments show that our proposed framework brings significant improvements by using pseudo labels for debiasing and alleviating the class imbalance problem. More importantly, our method outperforms the state-of-the-art SSL methods, demonstrating the potential of our framework for the more challenging SSL setting. Code and models are available at: https://github.com/xmed-lab/DHC.
    摘要 医学三维图像的体积级标注是专业技术和时间consuming的;因此使用限制标注数据的 semi-supervised learning (SSL) 是非常有优点的。然而,实际应用中存在严重的类别分布不均问题,这个问题未得到充分关注。为解决这个问题,我们提出了一种新的双向偏置共训(DHC)框架,用于 semi-supervised 三维医学图像分割。我们提出了两种损失补偿策略,即 Distribution-aware Debiased Weighting(DistDW)和 Difficulty-aware Debiased Weighting(DiffDW),这两种策略可以动态使用 pseudo labels 来引导模型解决数据和学习偏见。我们的框架在合作这两个多样和准确的子模型时得到了显著改进。我们还提出了更加代表性的 semi-supervised 医学图像分割 benchmark,可以全面展示我们的类别偏见设计的效果。实验表明,我们的提议的框架可以通过使用 pseudo labels 进行偏见修正和缓解类别偏见问题,并且超越了当前状态的 SSL 方法,表明了我们的框架在更加挑战的 SSL 设定下的潜在力量。代码和模型可以在 GitHub 上找到:https://github.com/xmed-lab/DHC。

Multi-representations Space Separation based Graph-level Anomaly-aware Detection

  • paper_url: http://arxiv.org/abs/2307.12994
  • repo_url: None
  • paper_authors: Fu Lin, Haonan Gong, Mingkang Li, Zitong Wang, Yue Zhang, Xuexiong Luo
  • for: 本研究的目标是检测图DataSet中的异常图。
  • methods: 我们提出了一种基于多个表示空间分离的图级异常检测框架。为了考虑不同类型的异常图数据的重要性,我们设计了一个异常感知模块来学习特定的节点级和图级异常重要性。此外,我们学习了严格地分离正常和异常图表示空间,通过四种不同的权重图表示对比彼此。
  • results: 我们对基eline方法进行了广泛的评估,并通过十个公共图数据集来评估我们的方法。结果表明,我们的方法具有效果。
    Abstract Graph structure patterns are widely used to model different area data recently. How to detect anomalous graph information on these graph data has become a popular research problem. The objective of this research is centered on the particular issue that how to detect abnormal graphs within a graph set. The previous works have observed that abnormal graphs mainly show node-level and graph-level anomalies, but these methods equally treat two anomaly forms above in the evaluation of abnormal graphs, which is contrary to the fact that different types of abnormal graph data have different degrees in terms of node-level and graph-level anomalies. Furthermore, abnormal graphs that have subtle differences from normal graphs are easily escaped detection by the existing methods. Thus, we propose a multi-representations space separation based graph-level anomaly-aware detection framework in this paper. To consider the different importance of node-level and graph-level anomalies, we design an anomaly-aware module to learn the specific weight between them in the abnormal graph evaluation process. In addition, we learn strictly separate normal and abnormal graph representation spaces by four types of weighted graph representations against each other including anchor normal graphs, anchor abnormal graphs, training normal graphs, and training abnormal graphs. Based on the distance error between the graph representations of the test graph and both normal and abnormal graph representation spaces, we can accurately determine whether the test graph is anomalous. Our approach has been extensively evaluated against baseline methods using ten public graph datasets, and the results demonstrate its effectiveness.
    摘要 GRAPH结构模式在近期内广泛应用于不同领域的数据模型中。检测图数据中异常Graph信息已成为一个流行的研究问题。我们的研究 objective 是 centered 在特定的问题上,即如何在图数据中检测异常图。前一些研究发现,异常图主要表现为节点级别和图级别异常,但这些方法很容易对两种异常形态进行等效的评估,这与实际情况不符。此外,一些异常图具有轻微异常特征,容易被现有方法排除。因此,我们提出了一个基于多个 Representation space 的图级别异常检测框架。为了考虑不同的节点级别和图级别异常的重要性,我们设计了一个异常检测模块,以学习特定的节点级别和图级别异常之间的权重。此外,我们通过四种不同类型的权重化图表示对之间的竞争学习,以学习纯正的正常图表示空间和异常图表示空间。通过测试图表示与正常图表示空间和异常图表示空间之间的距离错误来准确判断测试图是否异常。我们的方法在比基线方法进行evaluate 后得到了显著的效果。

High-performance real-world optical computing trained by in situ model-free optimization

  • paper_url: http://arxiv.org/abs/2307.11957
  • repo_url: None
  • paper_authors: Guangyuan Zhao, Xin Shu, Renjie Zhou
  • for: 提高光学计算系统的高速和低能耗数据处理能力,并解决 simulation-to-reality gap。
  • methods: 使用 score gradient estimation 算法,对光学系统进行模型独立优化,不需要 computation-heavy 和偏见的系统模拟。
  • results: 在 MNIST 和 FMNIST 数据集上实现了高精度分类,并在无图像和高速细胞分析中展示了潜在的应用前景。
    Abstract Optical computing systems can provide high-speed and low-energy data processing but face deficiencies in computationally demanding training and simulation-to-reality gap. We propose a model-free solution for lightweight in situ optimization of optical computing systems based on the score gradient estimation algorithm. This approach treats the system as a black box and back-propagates loss directly to the optical weights' probabilistic distributions, hence circumventing the need for computation-heavy and biased system simulation. We demonstrate a superior classification accuracy on the MNIST and FMNIST datasets through experiments on a single-layer diffractive optical computing system. Furthermore, we show its potential for image-free and high-speed cell analysis. The inherent simplicity of our proposed method, combined with its low demand for computational resources, expedites the transition of optical computing from laboratory demonstrations to real-world applications.
    摘要 光学计算系统可以提供高速和低能耗数据处理,但面临 computationally demanding 训练和实际-模拟之间的差距。我们提出了一种模型自由的解决方案,基于分布式权重的排名预测算法,用于优化光学计算系统。这种方法将系统视为黑盒子,直接从损失函数反射到光学权重的概率分布,因此不需要计算负担重和偏见的系统模拟。我们通过对单层散射光学计算系统进行实验,在 MNIST 和 FMNIST 数据集上达到了更高的分类精度。此外,我们还示出了无图像和高速细胞分析的潜在可能性。我们的提议的简单性和计算资源的低需求,使得光学计算从实验室示范转移到实际应用变得更加容易。

Pūioio: On-device Real-Time Smartphone-Based Automated Exercise Repetition Counting System

  • paper_url: http://arxiv.org/abs/2308.02420
  • repo_url: None
  • paper_authors: Adam Sinclair, Kayla Kautai, Seyed Reza Shahamiri
  • for: 这个研究的目的是为了开发一个可靠且低成本的手机应用程序,可以在实时进行运动重复计数。
  • methods: 这个研究使用了深度学习技术,搭配手机摄像头进行运动重复计数。系统包括五个组件:(1)姿势估计、(2)阈值分类、(3)流动性、(4)状态机器、(5)计数器。
  • results: 这个系统在实际测试中精度高达98.89%,并且在预先录影的数据集中也达到98.85%的准确性。这使得这个系统成为一个有效、低成本且便捷的选择,不需要特殊的仪器或网络连接。
    Abstract Automated exercise repetition counting has applications across the physical fitness realm, from personal health to rehabilitation. Motivated by the ubiquity of mobile phones and the benefits of tracking physical activity, this study explored the feasibility of counting exercise repetitions in real-time, using only on-device inference, on smartphones. In this work, after providing an extensive overview of the state-of-the-art automatic exercise repetition counting methods, we introduce a deep learning based exercise repetition counting system for smartphones consisting of five components: (1) Pose estimation, (2) Thresholding, (3) Optical flow, (4) State machine, and (5) Counter. The system is then implemented via a cross-platform mobile application named P\=uioio that uses only the smartphone camera to track repetitions in real time for three standard exercises: Squats, Push-ups, and Pull-ups. The proposed system was evaluated via a dataset of pre-recorded videos of individuals exercising as well as testing by subjects exercising in real time. Evaluation results indicated the system was 98.89% accurate in real-world tests and up to 98.85% when evaluated via the pre-recorded dataset. This makes it an effective, low-cost, and convenient alternative to existing solutions since the proposed system has minimal hardware requirements without requiring any wearable or specific sensors or network connectivity.
    摘要 自动化的运动重复计数有各种应用在身体健身和重建领域,从个人健康到rehabilitation。为了利用移动电话的普遍性和跟踪物理活动的利点,这项研究探索了使用移动电话上的只有设备推理来实时计数运动重复的可能性。在这项研究中,我们首先提供了现有自动运动重复计数方法的广泛概述,然后引入了一种基于深度学习的运动重复计数系统,该系统由五个组成部分:(1)姿势估计,(2)阈值分割,(3)Optical flow,(4)状态机和(5)计数器。这个系统然后通过一个跨平台移动应用程序 named P\=uioio 实现,该应用程序使用了移动电话摄像头来实时跟踪运动重复,并对三种标准运动进行测试:蹲squats,推push-ups和抓pull-ups。我们对这个系统进行了一系列测试和评估,测试结果表明该系统在实际测试中的准确率达98.89%,并且在预录视频数据集上的评估结果为98.85%。这使得该系统成为一个有效、低成本、方便的替代方案,因为它没有特殊的硬件需求,也没有需要佩戴式设备或特殊的传感器或网络连接。

Implicit Interpretation of Importance Weight Aware Updates

  • paper_url: http://arxiv.org/abs/2307.11955
  • repo_url: None
  • paper_authors: Keyi Chen, Francesco Orabona
  • for: 这篇论文主要是为了解释importance weight aware(IWA)更新法的性能优劣。
  • methods: 论文使用了一种新的框架,即通用隐式跟踪领导者(FTRL),来分析通用隐式更新法。
  • results: 论文表明,IWA更新法在在线学习设置中具有更好的 regret upper bound,比plain gradient更新法更好。
    Abstract Due to its speed and simplicity, subgradient descent is one of the most used optimization algorithms in convex machine learning algorithms. However, tuning its learning rate is probably its most severe bottleneck to achieve consistent good performance. A common way to reduce the dependency on the learning rate is to use implicit/proximal updates. One such variant is the Importance Weight Aware (IWA) updates, which consist of infinitely many infinitesimal updates on each loss function. However, IWA updates' empirical success is not completely explained by their theory. In this paper, we show for the first time that IWA updates have a strictly better regret upper bound than plain gradient updates in the online learning setting. Our analysis is based on the new framework, generalized implicit Follow-the-Regularized-Leader (FTRL) (Chen and Orabona, 2023), to analyze generalized implicit updates using a dual formulation. In particular, our results imply that IWA updates can be considered as approximate implicit/proximal updates.
    摘要 由于其速度和简洁性,剪梯下降是机器学习中最常用的优化算法之一。然而,调整学习率是它最严重的瓶颈,以实现一致的好表现。一种常见的方法是使用隐式/辅助更新。一种such variant是重要性评估(IWA)更新,它们包括无限多个infinitesimal更新。然而, IWA更新的实际成功并不完全由其理论来解释。在这篇论文中,我们展示了IWA更新在在线学习 Setting中具有更好的 regret upper bound,比普通的梯度更新更好。我们的分析基于新的框架,通用隐式 Follow-the-Regularized-Leader(FTRL)(Chen和Orabona,2023),用于分析通用隐式更新。特别是,我们的结果表明,IWA更新可以被视为approximate隐式/辅助更新。

On-Robot Bayesian Reinforcement Learning for POMDPs

  • paper_url: http://arxiv.org/abs/2307.11954
  • repo_url: None
  • paper_authors: Hai Nguyen, Sammie Katt, Yuchen Xiao, Christopher Amato
  • for: 这篇论文的目的是提出一种专门适用于物理系统的 bayesian 强化学习方法,以解决 robot 学习中的数据成本问题。
  • methods: 该方法使用了一种特殊的 factored 表示方法,以捕捉专家知识,并使用 Monte-Carlo tree search 和 particle filtering 来解决 posterior 的推理问题。
  • results: 在两个人机交互任务中,该方法可以在几个实际世界回合后达到 near-optimal 性能,并且可以利用 typical low-level robot simulators 和处理未知环境的不确定性。
    Abstract Robot learning is often difficult due to the expense of gathering data. The need for large amounts of data can, and should, be tackled with effective algorithms and leveraging expert information on robot dynamics. Bayesian reinforcement learning (BRL), thanks to its sample efficiency and ability to exploit prior knowledge, is uniquely positioned as such a solution method. Unfortunately, the application of BRL has been limited due to the difficulties of representing expert knowledge as well as solving the subsequent inference problem. This paper advances BRL for robotics by proposing a specialized framework for physical systems. In particular, we capture this knowledge in a factored representation, then demonstrate the posterior factorizes in a similar shape, and ultimately formalize the model in a Bayesian framework. We then introduce a sample-based online solution method, based on Monte-Carlo tree search and particle filtering, specialized to solve the resulting model. This approach can, for example, utilize typical low-level robot simulators and handle uncertainty over unknown dynamics of the environment. We empirically demonstrate its efficiency by performing on-robot learning in two human-robot interaction tasks with uncertainty about human behavior, achieving near-optimal performance after only a handful of real-world episodes. A video of learned policies is at https://youtu.be/H9xp60ngOes.
    摘要 机器人学习通常困难由于数据收集成本高昂。为了解决这问题,我们可以采用有效的算法和利用机器人动力学专家的知识。 bayesian reinforcement learning(BRL)因其样本效率高和可以利用先验知识而成为一种适用的解决方案。然而,BRL在应用中受到了专家知识表示和推理问题的限制。本文提出了一种特殊的框架,用于physical systems。我们通过 capture this knowledge in a factored representation,然后证明 posterior factorizes in a similar shape,并 ultimately formalize the model in a Bayesian framework。然后,我们引入了一种基于Monte-Carlo tree search和particle filtering的在线解决方法,专门用于解决这个模型。这种方法可以利用 typical low-level robot simulators and handle uncertainty over unknown dynamics of the environment。我们通过在两个人机交互任务中进行实验,demonstrate its efficiency,只需要几个真实世界回合就能够 дости得 near-optimal performance。视频 display learned policies at https://youtu.be/H9xp60ngOes.

HIQL: Offline Goal-Conditioned RL with Latent States as Actions

  • paper_url: http://arxiv.org/abs/2307.11949
  • repo_url: https://github.com/seohongpark/hiql
  • paper_authors: Seohong Park, Dibya Ghosh, Benjamin Eysenbach, Sergey Levine
  • for: 这个论文旨在提出一种基于非监督学习的目标conditioned reinforcement learning算法,可以从无标签数据中学习。
  • methods: 该算法使用一个action-free value function,通过层次分解来学习两个策略:一个高级策略使得状态被看作动作,预测子目标,以及一个低级策略预测达到子目标的行动。
  • results: 通过分析和实践示例, authors表明该层次分解使得其方法具有对噪音估计值函数的 Robustness。然后,通过应用该方法于offline目标 дости达标准别件,authors证明其方法可以解决远程目标任务,可以扩展到高维图像观察数据,并可以充分利用无动作数据。
    Abstract Unsupervised pre-training has recently become the bedrock for computer vision and natural language processing. In reinforcement learning (RL), goal-conditioned RL can potentially provide an analogous self-supervised approach for making use of large quantities of unlabeled (reward-free) data. However, building effective algorithms for goal-conditioned RL that can learn directly from diverse offline data is challenging, because it is hard to accurately estimate the exact value function for faraway goals. Nonetheless, goal-reaching problems exhibit structure, such that reaching distant goals entails first passing through closer subgoals. This structure can be very useful, as assessing the quality of actions for nearby goals is typically easier than for more distant goals. Based on this idea, we propose a hierarchical algorithm for goal-conditioned RL from offline data. Using one action-free value function, we learn two policies that allow us to exploit this structure: a high-level policy that treats states as actions and predicts (a latent representation of) a subgoal and a low-level policy that predicts the action for reaching this subgoal. Through analysis and didactic examples, we show how this hierarchical decomposition makes our method robust to noise in the estimated value function. We then apply our method to offline goal-reaching benchmarks, showing that our method can solve long-horizon tasks that stymie prior methods, can scale to high-dimensional image observations, and can readily make use of action-free data. Our code is available at https://seohong.me/projects/hiql/
    摘要 现代计算机视觉和自然语言处理领域中,无监督预训练已经成为核心。在奖励学习(RL)领域,目标受控RL可能提供一种类似的自我监督方法,使用大量无奖数据进行学习。然而,建立有效的目标受控RL算法,直接从多样化的离线数据中学习,是一项挑战。这是因为,难以准确地估计远距离目标的价值函数。然而,目标达成问题具有结构,即达到远距离目标需要先通过更近的亚目标。这种结构可以很有用,因为评估近距离目标的动作质量通常比远距离目标更容易。基于这个想法,我们提出了一种层次算法 для目标受控RL。使用一个没有动作的价值函数,我们学习了两个政策:一个高级政策,将状态看作动作,预测(一个隐藏表示)亚目标,以及一个低级政策,预测用于达到亚目标的动作。通过分析和示例,我们证明了这种层次 decomposition 使我们的方法具有鲁棒性,可以抵抗估计值函数的噪声。然后,我们将我们的方法应用于离线目标达成标准,并证明了我们的方法可以解决长期任务,可以扩展到高维图像观察,并可以轻松地使用无动作数据。我们的代码可以在 上获取。

The instabilities of large learning rate training: a loss landscape view

  • paper_url: http://arxiv.org/abs/2307.11948
  • repo_url: None
  • paper_authors: Lawrence Wang, Stephen Roberts
  • for: 研究深度学习网络训练中大学习率的稳定性,特别是在大学习率下的训练过程中存在潜在的不稳定性。
  • methods: 通过分析梯度下降的矩阵Hessian matrix来研究深度学习网络训练过程中的不稳定性。
  • results: 发现在大学习率下的训练过程中出现了“景观平整”和“景观转移”这两种fenomenon,这两种现象与训练过程中的不稳定性息息相关。
    Abstract Modern neural networks are undeniably successful. Numerous works study how the curvature of loss landscapes can affect the quality of solutions. In this work we study the loss landscape by considering the Hessian matrix during network training with large learning rates - an attractive regime that is (in)famously unstable. We characterise the instabilities of gradient descent, and we observe the striking phenomena of \textit{landscape flattening} and \textit{landscape shift}, both of which are intimately connected to the instabilities of training.
    摘要 现代神经网络确实非常成功。许多研究表明损失函数的凹凸度可以影响解决方案的质量。在这篇文章中,我们研究训练神经网络时的损失函数地形,包括在大学习率下进行训练的情况。我们描述梯度下降的不稳定性,并观察到了各种phenomena,如“地形平整”和“地形转移”,这些现象与训练过程中的不稳定性密切相关。

Collaboratively Learning Linear Models with Structured Missing Data

  • paper_url: http://arxiv.org/abs/2307.11947
  • repo_url: None
  • paper_authors: Chen Cheng, Gary Cheng, John Duchi
  • for: 这篇论文目的是解决多个代理(agent)协同学习最小二乘估计问题。每个代理都观察到不同的特征集(e.g., 感知器的分辨率不同)。我们想要协调代理,以生成每个代理最佳估计器。
  • methods: 我们提出了一种分布式、半监督的算法Collab,包括三步:本地训练、聚合和分布。我们的过程不需要交换标注数据,因此具有通信效率和在标注数据不可 accessible 的场景中使用。
  • results: 我们的方法在真实数据和 sintetic 数据上进行测试,并达到了 Nearly asymptotically local minimax 优化的水平,即在不交换标注数据的情况下,我们的方法与可以交换标注数据的优化方法相比,具有类似的性能。
    Abstract We study the problem of collaboratively learning least squares estimates for $m$ agents. Each agent observes a different subset of the features$\unicode{x2013}$e.g., containing data collected from sensors of varying resolution. Our goal is to determine how to coordinate the agents in order to produce the best estimator for each agent. We propose a distributed, semi-supervised algorithm Collab, consisting of three steps: local training, aggregation, and distribution. Our procedure does not require communicating the labeled data, making it communication efficient and useful in settings where the labeled data is inaccessible. Despite this handicap, our procedure is nearly asymptotically local minimax optimal$\unicode{x2013}$even among estimators allowed to communicate the labeled data such as imputation methods. We test our method on real and synthetic data.
    摘要 我们研究多 Agent 协同学习最小二乘估计问题。每个 Agent 观察不同的特征集合$\unicode{x2013}$例如,各种感知器的分辨率不同。我们的目标是在 Agent 之间协调,以生成每个 Agent 最佳估计器。我们提出了分布式、半监督的算法 Collab,包括三个步骤:本地训练、聚合和分布。我们的过程不需要通信标注数据,因此具有通信效率和在标注数据不可 accessible 的场景中使用。尽管这些限制,我们的过程仍然几乎极限本地最小最优$\unicode{x2013}$甚至与可以通信标注数据的估计器相比。我们在真实数据和 sintetic 数据上测试了我们的方法。

Batch Clipping and Adaptive Layerwise Clipping for Differential Private Stochastic Gradient Descent

  • paper_url: http://arxiv.org/abs/2307.11939
  • repo_url: None
  • paper_authors: Toan N. Nguyen, Phuong Ha Nguyen, Lam M. Nguyen, Marten Van Dijk
    for: 这 paper 是为了提出一种新的权限保护技术,以保证 differential privacy 的实现。methods: 这 paper 使用了 Individual Clipping (IC) 和 Batch Clipping (BC) 两种方法来实现权限保护,并且引入了 Adaptive Layerwise Clipping (ALC) 方法来适应不同层的敏感度。results: experiments 表明,使用 BC 和 ALC 可以使 Differential Private Stochastic Gradient Descent (DPSGD) converge,而使用 IC 和 ALC 不能 converge。
    Abstract Each round in Differential Private Stochastic Gradient Descent (DPSGD) transmits a sum of clipped gradients obfuscated with Gaussian noise to a central server which uses this to update a global model which often represents a deep neural network. Since the clipped gradients are computed separately, which we call Individual Clipping (IC), deep neural networks like resnet-18 cannot use Batch Normalization Layers (BNL) which is a crucial component in deep neural networks for achieving a high accuracy. To utilize BNL, we introduce Batch Clipping (BC) where, instead of clipping single gradients as in the orginal DPSGD, we average and clip batches of gradients. Moreover, the model entries of different layers have different sensitivities to the added Gaussian noise. Therefore, Adaptive Layerwise Clipping methods (ALC), where each layer has its own adaptively finetuned clipping constant, have been introduced and studied, but so far without rigorous DP proofs. In this paper, we propose {\em a new ALC and provide rigorous DP proofs for both BC and ALC}. Experiments show that our modified DPSGD with BC and ALC for CIFAR-$10$ with resnet-$18$ converges while DPSGD with IC and ALC does not.
    摘要 每个轮次在差分私人梯度下降(DPSGD)中传输一个混合的梯度,其中包含 Gaussian 噪声,并将其发送到中央服务器,以更新一个全球模型,通常是深度神经网络。由于混合的梯度在不同层中计算,因此无法使用批处理正则化层(BNL),这是深度神经网络实现高精度的一个关键组件。为了使用 BNL,我们引入批量混合(BC),其中,而不是归一化单个梯度,我们平均混合批处理的梯度。此外,不同层的模型元素对添加的 Gaussian 噪声具有不同的感度。因此,我们引入自适应层wise混合方法(ALC),其中每个层有自己的自适应调整的混合常量。在本文中,我们提出了一种新的 ALC,并为 BC 和 ALC 提供了严格的 DP 证明。实验表明,我们修改了 DPSGD 的 BC 和 ALC,可以在 CIFAR-10 上使用 resnet-18 进行训练,而 DPSGD 的 IC 和 ALC 不能。

Mercer Large-Scale Kernel Machines from Ridge Function Perspective

  • paper_url: http://arxiv.org/abs/2307.11925
  • repo_url: None
  • paper_authors: Karol Dziedziul, Sergey Kryzhevich
  • for: 本文关注大规模kernel机器学习方面的Mercer kernel machines的推理方法,从ridge函数的角度出发,回顾林和拜访的结果。
  • methods: 本文使用了近期rachimi和recht(2008)的Random features for large-scale kernel machines,以及相关的Approximation Theory来研究哪些kernel可以被简化为一个恒等式的极值函数。
  • results: 本文发现了一些障碍使用这种方法的问题,并可能有各种应用在深度学习中,特别是图像处理等问题。
    Abstract To present Mercer large-scale kernel machines from a ridge function perspective, we recall the results by Lin and Pinkus from Fundamentality of ridge functions. We consider the main theorem of the recent paper by Rachimi and Recht, 2008, Random features for large-scale kernel machines in terms of the Approximation Theory. We study which kernels can be approximated by a sum of cosine function products with arguments depending on $x$ and $y$ and present the obstacles of such an approach. The results of this article may have various applications in Deep Learning, especially in problems related to Image Processing.
    摘要 要从ridge函数角度介绍Mercer大规模kernel机器,我们回忆了林和拜纳斯在基本性理论中的结果。我们考虑了2008年rachimi和 recht的论文《Random features for large-scale kernel machines in terms of Approximation Theory》中的主要定理。我们研究了可以通过cosine函数产品的叠加来近似kernel机器,其中Arguments取决于x和y坐标,并提出了这种方法的阻碍。这些结果可能在深度学习中有各种应用,特别是在图像处理问题中。

Selective Perception: Optimizing State Descriptions with Reinforcement Learning for Language Model Actors

  • paper_url: http://arxiv.org/abs/2307.11922
  • repo_url: None
  • paper_authors: Kolby Nottingham, Yasaman Razeghi, Kyungmin Kim, JB Lanier, Pierre Baldi, Roy Fox, Sameer Singh
  • for: 这个论文是为了研究如何使用自然语言处理技术来提高机器人和游戏中的决策过程。
  • methods: 该论文提出了一种自动选择简洁状态描述的方法,称为Brief Language INputs for DEcision-making Responses(BLINDER),它通过学习任务条件下的状态描述价值函数来选择描述。
  • results: 该论文在NetHack游戏和机器人 manipulate任务中实现了提高任务成功率、减少输入大小和计算成本、并且可以在不同的LLM actors之间进行泛化。
    Abstract Large language models (LLMs) are being applied as actors for sequential decision making tasks in domains such as robotics and games, utilizing their general world knowledge and planning abilities. However, previous work does little to explore what environment state information is provided to LLM actors via language. Exhaustively describing high-dimensional states can impair performance and raise inference costs for LLM actors. Previous LLM actors avoid the issue by relying on hand-engineered, task-specific protocols to determine which features to communicate about a state and which to leave out. In this work, we propose Brief Language INputs for DEcision-making Responses (BLINDER), a method for automatically selecting concise state descriptions by learning a value function for task-conditioned state descriptions. We evaluate BLINDER on the challenging video game NetHack and a robotic manipulation task. Our method improves task success rate, reduces input size and compute costs, and generalizes between LLM actors.
    摘要 Translated into Simplified Chinese:大型语言模型(LLM)在机器人和游戏等领域中作为决策演员,利用其通用世界知识和规划能力。然而,前一代工作几乎没有探讨在语言中提供环境状态信息给 LLM 演员的问题。描述高维状态的尝试可能会降低性能和提高 LLM 演员的推理成本。先前的 LLM 演员通常采用手动设计、任务特定的协议来确定要将哪些特征包含在状态描述中,并且哪些可以略去。在这项工作中,我们提出了简短语言输入 для决策响应(BLINDER)方法,通过学习任务条件下的状态描述价值函数来自动选择简洁的状态描述。我们在 NetHack 游戏和机器人搅拌任务上评估了 BLINDER。我们的方法可以提高任务成功率,减少输入大小和计算成本,并且可以在不同的 LLM 演员之间进行泛化。

Poverty rate prediction using multi-modal survey and earth observation data

  • paper_url: http://arxiv.org/abs/2307.11921
  • repo_url: None
  • paper_authors: Simone Fobi, Manuel Cardona, Elliott Collins, Caleb Robinson, Anthony Ortiz, Tina Sederholm, Rahul Dodhia, Juan Lavista Ferres
  • for: 预测地区贫困率
  • methods: combining household demographic and living standards survey questions with features derived from satellite imagery
  • results: 1) inclusion of visual features reduces the mean error in poverty rate estimates from 4.09% to 3.88% 2) the best performance – errors in poverty rate decrease from 4.09% to 3.71% 3) extracted visual features encode geographic and urbanization differences between regions.
    Abstract This work presents an approach for combining household demographic and living standards survey questions with features derived from satellite imagery to predict the poverty rate of a region. Our approach utilizes visual features obtained from a single-step featurization method applied to freely available 10m/px Sentinel-2 surface reflectance satellite imagery. These visual features are combined with ten survey questions in a proxy means test (PMT) to estimate whether a household is below the poverty line. We show that the inclusion of visual features reduces the mean error in poverty rate estimates from 4.09% to 3.88% over a nationally representative out-of-sample test set. In addition to including satellite imagery features in proxy means tests, we propose an approach for selecting a subset of survey questions that are complementary to the visual features extracted from satellite imagery. Specifically, we design a survey variable selection approach guided by the full survey and image features and use the approach to determine the most relevant set of small survey questions to include in a PMT. We validate the choice of small survey questions in a downstream task of predicting the poverty rate using the small set of questions. This approach results in the best performance -- errors in poverty rate decrease from 4.09% to 3.71%. We show that extracted visual features encode geographic and urbanization differences between regions.
    摘要 Simplified Chinese translation:这项研究提出了一种方法,利用户户普查和卫星成像特征来预测地区贫困率。该方法使用10m/px Sentinel-2表面反射卫星成像中的视觉特征,与十个问题组成一个代表测试(PMT)来估算户户是否下于贫困线。包括卫星成像特征后,贫困率估计的平均错误率由4.09%降低到3.88%。此外,该方法还提出了一种方法,选择与卫星成像特征相关的小问题集,以便在预测贫困率的下游任务中使用。该方法根据全面调查和成像特征选择最相关的小问题集,并用这些问题集来预测贫困率。这种方法实现了最佳性能,贫困率估计错误率由4.09%降低到3.71%。此外,提取的视觉特征还含有地域和城市化差异。

Unveiling Vulnerabilities in Interpretable Deep Learning Systems with Query-Efficient Black-box Attacks

  • paper_url: http://arxiv.org/abs/2307.11906
  • repo_url: None
  • paper_authors: Eldor Abdukhamidov, Mohammed Abuhamad, Simon S. Woo, Eric Chan-Tin, Tamer Abuhmed
  • for: 保障深度学习系统的可靠性、可靠性和信任性,防止恶意攻击。
  • methods: 使用微生物遗传算法,基于黑盒测试,不需要目标模型和解释模型的先知知识。
  • results: 实验结果显示,这种攻击具有高成功率,使用挑战性示例和归因地幔,很难于探测。
    Abstract Deep learning has been rapidly employed in many applications revolutionizing many industries, but it is known to be vulnerable to adversarial attacks. Such attacks pose a serious threat to deep learning-based systems compromising their integrity, reliability, and trust. Interpretable Deep Learning Systems (IDLSes) are designed to make the system more transparent and explainable, but they are also shown to be susceptible to attacks. In this work, we propose a novel microbial genetic algorithm-based black-box attack against IDLSes that requires no prior knowledge of the target model and its interpretation model. The proposed attack is a query-efficient approach that combines transfer-based and score-based methods, making it a powerful tool to unveil IDLS vulnerabilities. Our experiments of the attack show high attack success rates using adversarial examples with attribution maps that are highly similar to those of benign samples which makes it difficult to detect even by human analysts. Our results highlight the need for improved IDLS security to ensure their practical reliability.
    摘要 深度学习在许多应用中得到了迅速的应用,但它知道是易受到敌意攻击的。这些攻击会对深度学习基于系统的完整性、可靠性和信任造成严重的威胁。可解释深度学习系统(IDLS)是为了使系统更加透明和可解释的,但它们也被证明是易受到攻击的。在这种工作中,我们提出了一种基于微生物遗传算法的黑盒攻击方法,不需要target模型和其解释模型的先前知识。我们的攻击方法结合了传递基本方法和分数基本方法,使其成为对IDLS的可靠性进行检测的强大工具。我们的实验表明,使用对抗例中的特征图可以达到高度的攻击成功率,并且这些特征图与正常样本的特征图几乎相同,使其具有难以检测的特点。我们的结果表明,为了确保IDLS的实际可靠性,需要进一步加强IDLS的安全性。

Model Compression Methods for YOLOv5: A Review

  • paper_url: http://arxiv.org/abs/2307.11904
  • repo_url: None
  • paper_authors: Mohammad Jani, Jamil Fayyad, Younes Al-Younes, Homayoun Najjaran
  • for: 本文主要针对强化YOLO对象检测器的研究进行了概括,以便在资源有限的设备上部署。
  • methods: 本文主要考虑了网络剪辑和量化两种压缩方法,以减少模型的内存使用量和计算时间。
  • results: 通过对YOLOv5进行剪辑和量化处理,可以降低模型的内存使用量和计算时间,但是还存在一些 gap 需要进一步研究。
    Abstract Over the past few years, extensive research has been devoted to enhancing YOLO object detectors. Since its introduction, eight major versions of YOLO have been introduced with the purpose of improving its accuracy and efficiency. While the evident merits of YOLO have yielded to its extensive use in many areas, deploying it on resource-limited devices poses challenges. To address this issue, various neural network compression methods have been developed, which fall under three main categories, namely network pruning, quantization, and knowledge distillation. The fruitful outcomes of utilizing model compression methods, such as lowering memory usage and inference time, make them favorable, if not necessary, for deploying large neural networks on hardware-constrained edge devices. In this review paper, our focus is on pruning and quantization due to their comparative modularity. We categorize them and analyze the practical results of applying those methods to YOLOv5. By doing so, we identify gaps in adapting pruning and quantization for compressing YOLOv5, and provide future directions in this area for further exploration. Among several versions of YOLO, we specifically choose YOLOv5 for its excellent trade-off between recency and popularity in literature. This is the first specific review paper that surveys pruning and quantization methods from an implementation point of view on YOLOv5. Our study is also extendable to newer versions of YOLO as implementing them on resource-limited devices poses the same challenges that persist even today. This paper targets those interested in the practical deployment of model compression methods on YOLOv5, and in exploring different compression techniques that can be used for subsequent versions of YOLO.
    摘要 在过去几年,对 YOLO 对象检测器进行了广泛的研究,以提高其精度和效率。自其引入以来,共有八个主要版本的 YOLO 发布,以提高其精度和效率。虽然 YOLO 在许多领域得到了广泛的应用,但在资源有限的设备上部署它却存在挑战。为解决这个问题,各种神经网络压缩方法被开发出来,这些方法分为三个主要类别:网络剪辑、量化和知识传递。使用这些方法可以降低内存使用量和执行时间,这使得它们在硬件限制的边缘设备上进行部署变得有利可图。在本文中,我们将关注剪辑和量化,因为它们在可模块化方面比较出色。我们将这些方法进行分类和分析,并通过应用这些方法于 YOLOv5 来评估其实际效果。通过这些研究,我们可以了解剪辑和量化在 YOLOv5 上的应用存在哪些挑战,并提供未来研究的方向。在多个 YOLO 版本中,我们选择 YOLOv5,因为它在文献中的悠久度和受欢迎程度均很高。这是关于剪辑和量化方法在 YOLOv5 上的首个具体评估文章。我们的研究也可以扩展到 newer 版本的 YOLO,因为在资源有限的设备上部署它们也存在同样的挑战。本文适合那些关注实际部署模型压缩方法在 YOLOv5 上的人,以及想要探索不同的压缩技术,以应用于未来的 YOLO 版本。

Project Florida: Federated Learning Made Easy

  • paper_url: http://arxiv.org/abs/2307.11899
  • repo_url: None
  • paper_authors: Daniel Madrigal Diaz, Andre Manoel, Jialei Chen, Nalin Singal, Robert Sim
  • for: This paper is written for machine learning engineers and application developers who want to deploy large-scale federated learning (FL) solutions across a heterogeneous device ecosystem.
  • methods: The paper presents a system architecture and software development kit (SDK) called Project Florida, which enables the deployment of FL solutions across a wide range of operating systems and hardware specifications. The paper also discusses the use of cloud-hosted infrastructure and task management interfaces to support the training process.
  • results: The paper presents illustrative experiments that demonstrate the system’s capabilities, including the ability to train machine learning models across a wide range of devices and the ability to scale the training process to accommodate a large number of client devices.
    Abstract We present Project Florida, a system architecture and software development kit (SDK) enabling deployment of large-scale Federated Learning (FL) solutions across a heterogeneous device ecosystem. Federated learning is an approach to machine learning based on a strong data sovereignty principle, i.e., that privacy and security of data is best enabled by storing it at its origin, whether on end-user devices or in segregated cloud storage silos. Federated learning enables model training across devices and silos while the training data remains within its security boundary, by distributing a model snapshot to a client running inside the boundary, running client code to update the model, and then aggregating updated snapshots across many clients in a central orchestrator. Deploying a FL solution requires implementation of complex privacy and security mechanisms as well as scalable orchestration infrastructure. Scale and performance is a paramount concern, as the model training process benefits from full participation of many client devices, which may have a wide variety of performance characteristics. Project Florida aims to simplify the task of deploying cross-device FL solutions by providing cloud-hosted infrastructure and accompanying task management interfaces, as well as a multi-platform SDK supporting most major programming languages including C++, Java, and Python, enabling FL training across a wide range of operating system (OS) and hardware specifications. The architecture decouples service management from the FL workflow, enabling a cloud service provider to deliver FL-as-a-service (FLaaS) to ML engineers and application developers. We present an overview of Florida, including a description of the architecture, sample code, and illustrative experiments demonstrating system capabilities.
    摘要 我们介绍项目“佛罗里达”,这是一个系统架构和软件开发包(SDK),它使得大规模联合学习(FL)解决方案可以在多种设备生态系统中部署。联合学习是一种基于强大数据主权原则的机器学习方法,即数据privacy和安全最好是在数据的原始位置保持,whether on end-user devices or in segregated cloud storage silos。联合学习可以在设备和存储silos之间进行模型训练,而不需要将数据传输到外部,只需在设备上运行客户端代码来更新模型,然后将更新后的模型集中到中央抽象器中。实现FL解决方案需要实施复杂的隐私和安全机制,以及可扩展的管理基础设施。因为模型训练过程需要全面参与多个客户端设备,这些设备可能有各种性能特点。项目“佛罗里达”目标是使得跨设备FL解决方案的部署变得更加简单,通过提供云主机的基础设施和相关的任务管理界面,以及支持多种主要编程语言,包括C++、Java和Python,以实现FL训练在多种操作系统和硬件特性上。架构解决方案的分离,使得云服务提供商可以提供FLaaS(联合学习 как服务),让机器学习工程师和应用程序开发人员快速搭建FL解决方案。我们将对项目“佛罗里达”进行概述,包括架构描述、示例代码和 ilustrative experiments,以示系统的能力。

Hindsight-DICE: Stable Credit Assignment for Deep Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2307.11897
  • repo_url: https://github.com/skandavaidyanath/credit-assignment
  • paper_authors: Akash Velu, Skanda Vaidyanath, Dilip Arumugam
  • for: 该文章为了解决奖励学习在缺乏评价反馈的环境中表现不佳问题,提出了一种基于追溯政策的方法。
  • methods: 该文章使用了现有的重要性抽样比例估计技术来稳定化和改进基于追溯政策的方法。
  • results: 该文章在各种环境中显示了稳定和高效的学习效果,并且可以在奖励学习中缓解奖励分配问题。
    Abstract Oftentimes, environments for sequential decision-making problems can be quite sparse in the provision of evaluative feedback to guide reinforcement-learning agents. In the extreme case, long trajectories of behavior are merely punctuated with a single terminal feedback signal, leading to a significant temporal delay between the observation of a non-trivial reward and the individual steps of behavior culpable for achieving said reward. Coping with such a credit assignment challenge is one of the hallmark characteristics of reinforcement learning. While prior work has introduced the concept of hindsight policies to develop a theoretically moxtivated method for reweighting on-policy data by impact on achieving the observed trajectory return, we show that these methods experience instabilities which lead to inefficient learning in complex environments. In this work, we adapt existing importance-sampling ratio estimation techniques for off-policy evaluation to drastically improve the stability and efficiency of these so-called hindsight policy methods. Our hindsight distribution correction facilitates stable, efficient learning across a broad range of environments where credit assignment plagues baseline methods.
    摘要 经常情况下,决策问题环境往往缺乏评价反馈,导致强化学习代理人受到很大的评价延迟。在极端情况下,长期行为只会被截止符号性的终端反馈信号刺激,从而导致行为减少的减少很大。强化学习面临着寄付问题的挑战。 Prior work已经引入了叫做前景政策的方法,以 theoretically moxtivated 方式重新权重on-policy数据,以便更好地评价 achieve trajectory return。但我们发现这些方法会导致不稳定性,从而降低强化学习的效率。在这种情况下,我们采用了现有的重要性折衔估计技术,以改善这些叫做 hindsight 政策方法的稳定性和效率。我们的 hindsight 分布修正方法可以在各种缺乏寄付的环境中,稳定、高效地学习。

On the Vulnerability of Fairness Constrained Learning to Malicious Noise

  • paper_url: http://arxiv.org/abs/2307.11892
  • repo_url: None
  • paper_authors: Avrim Blum, Princewill Okoroafor, Aadirupa Saha, Kevin Stangl
  • For: This paper studies the vulnerability of fairness-constrained learning to small amounts of malicious noise in the training data.* Methods: The paper uses randomized classifiers to mitigate the vulnerability of fairness-constrained learning to adversarial noise.* Results: The paper shows that for certain fairness notions, such as Demographic Parity, the loss in accuracy can be as low as $\Theta(\alpha)$ when the noise rate is small. For other fairness notions, such as Equal Opportunity, the loss in accuracy can be as low as $O(\sqrt{\alpha})$. The paper also shows that the loss in accuracy clusters into three natural regimes: $O(\alpha)$, $O(\sqrt{\alpha})$, and $O(1)$.Here’s the Chinese translation of the three points:* For: 这篇论文研究了受到训练数据中小量邪恶噪声影响的公平学习的敏感性。* Methods: 这篇论文使用Randomized classifier来减少公平学习受到邪恶噪声影响的敏感性。* Results: 这篇论文显示,对于某些公平性定义,如人口均衡,当噪声率小时,损失率可以为Theta(α)。对于其他公平性定义,如机会平等,损失率可以为O(√α)。论文还显示,损失率分布在三个自然的 régime中:O(α)、O(√α)和O(1)。
    Abstract We consider the vulnerability of fairness-constrained learning to small amounts of malicious noise in the training data. Konstantinov and Lampert (2021) initiated the study of this question and presented negative results showing there exist data distributions where for several fairness constraints, any proper learner will exhibit high vulnerability when group sizes are imbalanced. Here, we present a more optimistic view, showing that if we allow randomized classifiers, then the landscape is much more nuanced. For example, for Demographic Parity we show we can incur only a $\Theta(\alpha)$ loss in accuracy, where $\alpha$ is the malicious noise rate, matching the best possible even without fairness constraints. For Equal Opportunity, we show we can incur an $O(\sqrt{\alpha})$ loss, and give a matching $\Omega(\sqrt{\alpha})$lower bound. In contrast, Konstantinov and Lampert (2021) showed for proper learners the loss in accuracy for both notions is $\Omega(1)$. The key technical novelty of our work is how randomization can bypass simple "tricks" an adversary can use to amplify his power. We also consider additional fairness notions including Equalized Odds and Calibration. For these fairness notions, the excess accuracy clusters into three natural regimes $O(\alpha)$,$O(\sqrt{\alpha})$ and $O(1)$. These results provide a more fine-grained view of the sensitivity of fairness-constrained learning to adversarial noise in training data.
    摘要 我们考虑了公平性条件下的学习敏感性对小量邪恶训练数据的影响。 Konstantinov 和 Lampert (2021) 开始了这个研究,并发现了一些负的结果,表明在某些公平性条件下,任何合法的学习者都将具有高度敏感性,当群体大小不对称时。 在这里,我们提供了一个更optimistic的见解,表明如果允许随机分类器,则情况会变得更加细分。例如,对于人口均衡公平性,我们显示可以允许只有 $\Theta(\alpha)$ 的损失率,其中 $\alpha$ 是邪恶训练数据的损失率,与不具有公平性条件时相同。对于平等机会公平性,我们显示可以允许 $O(\sqrt{\alpha})$ 的损失率,并提供了对应的 $\Omega(\sqrt{\alpha})$ 下界。与 Konstantinov 和 Lampert (2021) 的结果相比,我们的结果显示,在合法学习者下,两个公平性条件的损失率都是 $\Omega(1)$。我们的技术新动向是如何使用随机性来绕过简单的邪恶攻击者可以使用的“套路”。我们还考虑了其他的公平性条件,包括平等机会和准确性。这些结果提供了训练数据中邪恶训练数据的影响的更细分的见解。

On the Universality of Linear Recurrences Followed by Nonlinear Projections

  • paper_url: http://arxiv.org/abs/2307.11888
  • repo_url: None
  • paper_authors: Antonio Orvieto, Soham De, Caglar Gulcehre, Razvan Pascanu, Samuel L. Smith
  • for: 本研究目标是表明一种基于回归线性层的字符串模型(包括S4、S5和LRU),可以正确地模拟任何具有 suficiently régulier不对称序列-到-序列映射。
  • methods: 本研究使用了扩展的回归层和位置wise多层感知器(MLPs)来模拟序列-to-序列映射。主要想法是看到回归层为压缩算法,可以准确地存储输入序列的信息到内部状态中,然后由高度表达的 MLP 进行处理。
  • results: 研究发现,这种模型可以将任何 suficiently régulier不对称序列-to-序列映射 aproximated 到任何 desired 精度。
    Abstract In this note (work in progress towards a full-length paper) we show that a family of sequence models based on recurrent linear layers~(including S4, S5, and the LRU) interleaved with position-wise multi-layer perceptrons~(MLPs) can approximate arbitrarily well any sufficiently regular non-linear sequence-to-sequence map. The main idea behind our result is to see recurrent layers as compression algorithms that can faithfully store information about the input sequence into an inner state, before it is processed by the highly expressive MLP.
    摘要 在这份工作进度中(正在prepare一篇全长论文),我们展示了一家系列模型,该模型基于循环线性层(包括S4、S5和LRU)和位置层 wise多层感知器(MLP)。这种模型可以在任何足够规则的序列到序列映射中进行近似。我们的主要想法是看循环层为压缩算法,可以准确地将输入序列存储在内部状态中,然后由高度表达的 MLP 进行处理。

MORE: Measurement and Correlation Based Variational Quantum Circuit for Multi-classification

  • paper_url: http://arxiv.org/abs/2307.11875
  • repo_url: https://github.com/jindi0/more
  • paper_authors: Jindi Wu, Tianjie Hu, Qun Li
    for:MORE is a quantum multi-classifier that leverages the quantum information of a single readout qubit to perform multi-class classification tasks.methods:MORE uses a variational ansatz and quantum state tomography to reconstruct the readout state, and then employs variational quantum clustering and supervised learning to determine the mapping between input data and quantum labels.results:MORE achieves advanced performance in multi-class classification tasks despite using a simple ansatz and limited quantum resources, and outperforms traditional binary classifiers in certain scenarios.Here’s the Chinese translation of the three points:for:MORE 是一个使用单 readout qubit 进行多类别分类任务的量子多类别推断器。methods:MORE 使用量子状态测量来重建 readout 状态,然后使用量子推断 clustering 和 supervised learning 来决定输入数据和量子标签之间的映射。results:MORE 在多类别分类任务中获得进步的表现,即使使用简单的推断器和有限的量子资源,并在某些情况下超越传统的二进制推断器。
    Abstract Quantum computing has shown considerable promise for compute-intensive tasks in recent years. For instance, classification tasks based on quantum neural networks (QNN) have garnered significant interest from researchers and have been evaluated in various scenarios. However, the majority of quantum classifiers are currently limited to binary classification tasks due to either constrained quantum computing resources or the need for intensive classical post-processing. In this paper, we propose an efficient quantum multi-classifier called MORE, which stands for measurement and correlation based variational quantum multi-classifier. MORE adopts the same variational ansatz as binary classifiers while performing multi-classification by fully utilizing the quantum information of a single readout qubit. To extract the complete information from the readout qubit, we select three observables that form the basis of a two-dimensional Hilbert space. We then use the quantum state tomography technique to reconstruct the readout state from the measurement results. Afterward, we explore the correlation between classes to determine the quantum labels for classes using the variational quantum clustering approach. Next, quantum label-based supervised learning is performed to identify the mapping between the input data and their corresponding quantum labels. Finally, the predicted label is determined by its closest quantum label when using the classifier. We implement this approach using the Qiskit Python library and evaluate it through extensive experiments on both noise-free and noisy quantum systems. Our evaluation results demonstrate that MORE, despite using a simple ansatz and limited quantum resources, achieves advanced performance.
    摘要 量子计算在最近几年内已经显示了较大的承诺,尤其是对于计算密集的任务。例如,基于量子神经网络(QNN)的分类任务已经吸引了研究者的广泛关注,并在多个场景中进行了评估。然而,大多数量子分类器目前仅限于二进制分类任务,这可能是因为量子计算资源的限制或需要大量的经典后处理。在这篇论文中,我们提出了一种高效的量子多分类器,即MORE(测量和相关性基于量子多分类器)。MORE采用了同binary分类器一样的变量 ansatz,并在完全利用单个读取量子比特的量子信息上进行多分类。为了从读取量子比特中提取完整的信息,我们选择了三个观察量,它们构成了一个二维希尔伯特空间的基。然后,我们使用量子状态探测技术来重建读取状态。接着,我们研究分类关系来确定类别的量子标签,并使用量子分布式学习方法来确定输入数据与其相应的量子标签之间的映射。最后,我们使用类ifier来预测输入数据的标签。我们使用Qiskit Python库实现这种方法,并对噪声量子系统和噪声自由量子系统进行了广泛的实验评估。我们的评估结果表明,MORE,即使使用简单的 ansatz 和有限的量子资源,仍然可以达到高效的性能。

The Looming Threat of Fake and LLM-generated LinkedIn Profiles: Challenges and Opportunities for Detection and Prevention

  • paper_url: http://arxiv.org/abs/2307.11864
  • repo_url: None
  • paper_authors: Navid Ayoobi, Sadat Shahriar, Arjun Mukherjee
    for:This paper is written to detect fake and Large Language Model (LLM)-generated profiles in the LinkedIn Online Social Network immediately upon registration and before establishing connections.methods:The paper introduces the Section and Subsection Tag Embedding (SSTE) method to enhance the discriminative characteristics of textual information provided in LinkedIn profiles for distinguishing between legitimate profiles and those created by imposters manually or by using an LLM. The paper also uses static and contextualized word embeddings, including GloVe, Flair, BERT, and RoBERTa.results:The suggested method can distinguish between legitimate and fake profiles with an accuracy of about 95% across all word embeddings. Additionally, the SSTE method has a promising accuracy for identifying LLM-generated profiles, with an accuracy of approximately 90% when only 20 LLM-generated profiles are added to the training set.
    Abstract In this paper, we present a novel method for detecting fake and Large Language Model (LLM)-generated profiles in the LinkedIn Online Social Network immediately upon registration and before establishing connections. Early fake profile identification is crucial to maintaining the platform's integrity since it prevents imposters from acquiring the private and sensitive information of legitimate users and from gaining an opportunity to increase their credibility for future phishing and scamming activities. This work uses textual information provided in LinkedIn profiles and introduces the Section and Subsection Tag Embedding (SSTE) method to enhance the discriminative characteristics of these data for distinguishing between legitimate profiles and those created by imposters manually or by using an LLM. Additionally, the dearth of a large publicly available LinkedIn dataset motivated us to collect 3600 LinkedIn profiles for our research. We will release our dataset publicly for research purposes. This is, to the best of our knowledge, the first large publicly available LinkedIn dataset for fake LinkedIn account detection. Within our paradigm, we assess static and contextualized word embeddings, including GloVe, Flair, BERT, and RoBERTa. We show that the suggested method can distinguish between legitimate and fake profiles with an accuracy of about 95% across all word embeddings. In addition, we show that SSTE has a promising accuracy for identifying LLM-generated profiles, despite the fact that no LLM-generated profiles were employed during the training phase, and can achieve an accuracy of approximately 90% when only 20 LLM-generated profiles are added to the training set. It is a significant finding since the proliferation of several LLMs in the near future makes it extremely challenging to design a single system that can identify profiles created with various LLMs.
    摘要 在这篇论文中,我们介绍了一种新的方法,用于在 LinkedIn 在线社交网络上立即识别 fake 和 Large Language Model(LLM)生成的 profiless,并在注册后before establishing connections。早期识别假 profiless是维护平台的完整性的关键,因为它防止了假者从获取真正用户的私人和敏感信息,并从获得未来骗财活动的机会。本工作使用 LinkedIn profiless 中提供的文本信息,并引入 Section and Subsection Tag Embedding(SSTE)方法,以增强这些数据的权威性,以分辨真实 profiless 和由假者或 LLM 生成的 profiless。此外,由于没有大量公开可用的 LinkedIn 数据集,我们自己收集了 3600 个 LinkedIn profiless 为我们的研究。我们将在研究用途上公开我们的数据集。这是,我们知道的, LinkedIn 上假账户检测的首个大规模公开数据集。在我们的 paradigm 中,我们评估了静止和 contextualized 单词嵌入,包括 GloVe、Flair、BERT 和 RoBERTa。我们显示,我们的方法可以在所有单词嵌入上分辨 true 和 fake profiless,准确率约为 95%。此外,我们还显示了 SSTE 在 LLM 生成 profiless 上的扩展性,即使在训练阶段没有使用 LLM 生成 profiless,可以达到约 90% 的准确率,只需要添加 20 个 LLM 生成 profiless 到训练集中。这是一项重要发现,因为未来几年内,许多 LLM 将在未来逐渐普及,设计一个系统可以识别由不同 LLM 生成的 profiless 将变得极其困难。

Data-Induced Interactions of Sparse Sensors

  • paper_url: http://arxiv.org/abs/2307.11838
  • repo_url: None
  • paper_authors: Andrei A. Klishin, J. Nathan Kutz, Krithika Manohar
  • for: 该论文旨在描述如何使用少量的感知器来重建复杂系统的状态,并且如何选择感知器的位置以实现最佳重建结果。
  • methods: 论文使用了基于异谱 interpolate 和 QR 分解的多种算法来优化感知器的位置,并通过统计物理学的狄耳诺模型来计算感知器之间的互动。
  • results: 论文通过计算数据引导的感知器互动的全景,可以结合外部选择标准和预测感知器更换的影响。
    Abstract Large-dimensional empirical data in science and engineering frequently has low-rank structure and can be represented as a combination of just a few eigenmodes. Because of this structure, we can use just a few spatially localized sensor measurements to reconstruct the full state of a complex system. The quality of this reconstruction, especially in the presence of sensor noise, depends significantly on the spatial configuration of the sensors. Multiple algorithms based on gappy interpolation and QR factorization have been proposed to optimize sensor placement. Here, instead of an algorithm that outputs a singular "optimal" sensor configuration, we take a thermodynamic view to compute the full landscape of sensor interactions induced by the training data. The landscape takes the form of the Ising model in statistical physics, and accounts for both the data variance captured at each sensor location and the crosstalk between sensors. Mapping out these data-induced sensor interactions allows combining them with external selection criteria and anticipating sensor replacement impacts.
    摘要 大量实际数据在科学和工程频繁具有低维结构,可以通过一些本地感知器来表示。由于这种结构,我们可以使用一些感知器来重建复杂系统的全部状态,尤其是在感知噪声存在时。多种基于异常 interpolate 和 QR 分解的算法已经被提出来优化感知器布局。而不是输出一个“最优”的感知器配置,我们在这里采用热力学视角计算整个感知器与训练数据之间的互动场景。这个场景采用牛顿模型来描述,考虑了每个感知器位置上采集数据的方差以及感知器之间的干扰。通过映射这些数据引起的感知器互动,我们可以与外部选择标准结合并预测感知器更换的影响。

eXplainable Artificial Intelligence (XAI) in age prediction: A systematic review

  • paper_url: http://arxiv.org/abs/2307.13704
  • repo_url: None
  • paper_authors: Alena Kalyakulina, Igor Yusipov
  • for: 这篇论文探讨了使用可解释人工智能(XAI)技术进行年龄预测任务的应用。
  • methods: 论文将XAI技术应用于不同的身体系统,进行系统化的文献综述。
  • results: 论文指出了XAI在医疗应用中的优点,特别是在年龄预测任务中。
    Abstract eXplainable Artificial Intelligence (XAI) is now an important and essential part of machine learning, allowing to explain the predictions of complex models. XAI is especially required in risky applications, particularly in health care, where human lives depend on the decisions of AI systems. One area of medical research is age prediction and identification of biomarkers of aging and age-related diseases. However, the role of XAI in the age prediction task has not previously been explored directly. In this review, we discuss the application of XAI approaches to age prediction tasks. We give a systematic review of the works organized by body systems, and discuss the benefits of XAI in medical applications and, in particular, in the age prediction domain.
    摘要 <>可解释人工智能(XAI)现在是机器学习中非常重要和必需的一部分,允许解释复杂模型的预测。XAI特别在危险应用中需要,特别是在医疗领域,人工智能系统的决策直接关系到人们的生命。一个医学研究领域是年龄预测和衰老病症的生物标志物质的预测。然而,XAI在年龄预测任务中的角色没有直接探讨过。在这篇评论中,我们讨论了XAI方法在年龄预测任务中的应用。我们按照身体系统进行了系统性的综述,并讨论了医疗应用中XAI的优点和年龄预测领域中XAI的特点。>>>Note that Simplified Chinese is used here, which is the standard form of Chinese used in mainland China and Singapore. Traditional Chinese is used in Hong Kong, Macau, and Taiwan.

PINNsFormer: A Transformer-Based Framework For Physics-Informed Neural Networks

  • paper_url: http://arxiv.org/abs/2307.11833
  • repo_url: https://github.com/adityalab/pinnsformer
  • paper_authors: Leo Zhiyuan Zhao, Xueying Ding, B. Aditya Prakash
  • for: 用于数值解 partial differential equations (PDEs) 的深度学习框架。
  • methods: 使用 Transformer 结构,并采用多头注意机制来捕捉 PDEs 中的时间关系。
  • results: 能够准确地 approximates PDEs 的解,并在不同场景下超过传统 PINNs 的表现,尽管具有较少的计算和存储成本。
    Abstract Physics-Informed Neural Networks (PINNs) have emerged as a promising deep learning framework for approximating numerical solutions for partial differential equations (PDEs). While conventional PINNs and most related studies adopt fully-connected multilayer perceptrons (MLP) as the backbone structure, they have neglected the temporal relations in PDEs and failed to approximate the true solution. In this paper, we propose a novel Transformer-based framework, namely PINNsFormer, that accurately approximates PDEs' solutions by capturing the temporal dependencies with multi-head attention mechanisms in Transformer-based models. Instead of approximating point predictions, PINNsFormer adapts input vectors to pseudo sequences and point-wise PINNs loss to a sequential PINNs loss. In addition, PINNsFormer is equipped with a novel activation function, namely Wavelet, which anticipates the Fourier decomposition through deep neural networks. We empirically demonstrate PINNsFormer's ability to capture the PDE solutions for various scenarios, in which conventional PINNs have failed to learn. We also show that PINNsFormer achieves superior approximation accuracy on such problems than conventional PINNs with non-sensitive hyperparameters, in trade of marginal computational and memory costs, with extensive experiments.
    摘要 physics-informed neural networks (PINNs) 已经出现为解决数学Physical laws的深度学习框架,但是传统的PINNs和大多数相关研究都是使用完全连接多层感知器(MLP)作为脊梁结构,这些结构忽略了PDEs中的时间关系,并且无法准确地预测解。在本文中,我们提出了一种新的Transformer-based框架,即PINNsFormer,可以准确地预测PDEs的解决方案,通过在Transformer-based模型中使用多头注意机制来捕捉PDEs中的时间相关性。而不是对点预测进行approximation,PINNsFormer将输入向量转化为pseudo序列,并将点级PINNs损失转化为sequential PINNs损失。此外,PINNsFormer还具有一种新的活动函数,即wavelet,该函数预测了深度神经网络中的Fourier分解。我们通过实验证明PINNsFormer可以在不同的情况下,包括传统PINNs无法学习的情况下,准确地预测PDEs的解决方案。此外,我们还证明PINNsFormer在这些问题上的 aproximation精度高于传统PINNs,但是与非敏感的计算和存储成本相比,PINNsFormer的计算和存储成本几乎是零的。

Differentially Private Heavy Hitter Detection using Federated Analytics

  • paper_url: http://arxiv.org/abs/2307.11749
  • repo_url: None
  • paper_authors: Karan Chadha, Junye Chen, John Duchi, Vitaly Feldman, Hanieh Hashemi, Omid Javidbakht, Audra McMillan, Kunal Talwar
  • for: 增强 prefix-tree 算法 隐私检测 differentially private heavy hitter 性能。
  • methods: 提出了一种基于 adaptive hyperparameter tuning 算法,以满足计算、通信和隐私约束的多用户数据点检测。
  • results: 通过对 Reddit 数据集进行大量实验,发现该方法可以提高检测性能,同时满足计算、通信和隐私约束。
    Abstract In this work, we study practical heuristics to improve the performance of prefix-tree based algorithms for differentially private heavy hitter detection. Our model assumes each user has multiple data points and the goal is to learn as many of the most frequent data points as possible across all users' data with aggregate and local differential privacy. We propose an adaptive hyperparameter tuning algorithm that improves the performance of the algorithm while satisfying computational, communication and privacy constraints. We explore the impact of different data-selection schemes as well as the impact of introducing deny lists during multiple runs of the algorithm. We test these improvements using extensive experimentation on the Reddit dataset~\cite{caldas2018leaf} on the task of learning the most frequent words.
    摘要 在这个工作中,我们研究了使用前缀树基于算法来提高分布式隐私极大热点检测的实用规则。我们的模型假设每个用户有多个数据点,目标是通过聚合和本地隐私来学习所有用户数据中的最多频数据点。我们提议一种适应性hyperparameter调整算法,可以提高算法的性能,同时满足计算、通信和隐私约束。我们还研究了不同的数据选择方案以及在多次运行算法时引入拒绝列表的影响。我们对这些改进进行了广泛的实验,使用了Reddit数据集(Caldas et al., 2018),以学习最常见的单词。

Advancing Ad Auction Realism: Practical Insights & Modeling Implications

  • paper_url: http://arxiv.org/abs/2307.11732
  • repo_url: None
  • paper_authors: Ming Chen, Sareh Nabi, Marciano Siniscalchi
  • for: 这个论文是为了研究当代在线广告拍卖中的四个实际特征,包括广告插播值和点击率因用户搜索词而异常,竞争者的数量和身份在拍卖过程中是未知的,广告主只能得到部分、汇总的反馈。
  • methods: 作者使用了对抗人工智能算法来模型广告主的行为,不受拍卖机制细节的影响。
  • results: 研究发现,在更加复杂的环境中,“软底”可以提高关键性能指标,而且可以在竞争者来自同一个人口群体时实现这一效果。此外,研究还证明了如何从观察拍卖价格中推断广告主价值分布,从而证明了这种方法在更加实际的拍卖Setting中的实际效果。
    Abstract This paper proposes a learning model of online ad auctions that allows for the following four key realistic characteristics of contemporary online auctions: (1) ad slots can have different values and click-through rates depending on users' search queries, (2) the number and identity of competing advertisers are unobserved and change with each auction, (3) advertisers only receive partial, aggregated feedback, and (4) payment rules are only partially specified. We model advertisers as agents governed by an adversarial bandit algorithm, independent of auction mechanism intricacies. Our objective is to simulate the behavior of advertisers for counterfactual analysis, prediction, and inference purposes. Our findings reveal that, in such richer environments, "soft floors" can enhance key performance metrics even when bidders are drawn from the same population. We further demonstrate how to infer advertiser value distributions from observed bids, thereby affirming the practical efficacy of our approach even in a more realistic auction setting.
    摘要
  1. Ad slots can have different values and click-through rates depending on users’ search queries.2. The number and identity of competing advertisers are unobserved and change with each auction.3. Advertisers only receive partial, aggregated feedback.4. Payment rules are only partially specified.We model advertisers as agents governed by an adversarial bandit algorithm, independent of auction mechanism intricacies. Our objective is to simulate the behavior of advertisers for counterfactual analysis, prediction, and inference purposes.Our findings show that “soft floors” can enhance key performance metrics even when bidders are drawn from the same population. Additionally, we demonstrate how to infer advertiser value distributions from observed bids, confirming the practical efficacy of our approach in a more realistic auction setting.

Mitigating Communications Threats in Decentralized Federated Learning through Moving Target Defense

  • paper_url: http://arxiv.org/abs/2307.11730
  • repo_url: https://github.com/enriquetomasmb/fedstellar
  • paper_authors: Enrique Tomás Martínez Beltrán, Pedro Miguel Sánchez Sánchez, Sergio López Bernal, Gérôme Bovet, Manuel Gil Pérez, Gregorio Martínez Pérez, Alberto Huertas Celdrán
  • for: This paper aims to address the communication security challenges in Decentralized Federated Learning (DFL) by introducing a security module that combines encryption and Moving Target Defense (MTD) techniques.
  • methods: The security module is implemented in a DFL platform called Fedstellar, and the authors evaluate the effectiveness of the module through experiments with the MNIST dataset and eclipse attacks.
  • results: The results show that the security module can mitigate the risks posed by eavesdropping or eclipse attacks, with an average F1 score of 95% and moderate increases in CPU usage and network traffic under the most secure configuration.Here’s the simplified Chinese text for the three points:
  • for: 这篇论文目的是解决分布式联合学习(DFL)中的通信安全挑战,通过引入加密和移动目标防御(MTD)技术的安全模块。
  • methods: 这个安全模块在分布式联合学习平台Fedstellar中实现,通过MNIST数据集和eclipse攻击进行测试。
  • results: 测试结果表明,安全模块可以降低防御 eclipse 攻击和窃听攻击的风险,实现了95%的平均F1分数,并且在最安全配置下,CPU使用率可以达到63.2% +-3.5%,网络流量可以达到230 MB +-15 MB。
    Abstract The rise of Decentralized Federated Learning (DFL) has enabled the training of machine learning models across federated participants, fostering decentralized model aggregation and reducing dependence on a server. However, this approach introduces unique communication security challenges that have yet to be thoroughly addressed in the literature. These challenges primarily originate from the decentralized nature of the aggregation process, the varied roles and responsibilities of the participants, and the absence of a central authority to oversee and mitigate threats. Addressing these challenges, this paper first delineates a comprehensive threat model, highlighting the potential risks of DFL communications. In response to these identified risks, this work introduces a security module designed for DFL platforms to counter communication-based attacks. The module combines security techniques such as symmetric and asymmetric encryption with Moving Target Defense (MTD) techniques, including random neighbor selection and IP/port switching. The security module is implemented in a DFL platform called Fedstellar, allowing the deployment and monitoring of the federation. A DFL scenario has been deployed, involving eight physical devices implementing three security configurations: (i) a baseline with no security, (ii) an encrypted configuration, and (iii) a configuration integrating both encryption and MTD techniques. The effectiveness of the security module is validated through experiments with the MNIST dataset and eclipse attacks. The results indicated an average F1 score of 95%, with moderate increases in CPU usage (up to 63.2% +-3.5%) and network traffic (230 MB +-15 MB) under the most secure configuration, mitigating the risks posed by eavesdropping or eclipse attacks.
    摘要 《协同学习的分布式协同学习(DFL)技术在训练机器学习模型方面带来了巨大的改变,使得多个参与者之间的模型协同学习可以实现,从而减少依赖于服务器。然而,这种方法引入了一些独特的通信安全挑战,这些挑战主要来自于协同学习过程的分布式特性,参与者的多样化角色和责任,以及缺乏中央权限来监管和处理威胁。为了解决这些挑战,本文首先提出了一个全面的威胁模型,描述了DFL通信的潜在风险。为应对这些风险,本工作提出了一个专门为DFL平台设计的安全模块,该模块结合了加密技术和移动目标防御(MTD)技术,包括随机 neighber 选择和IP/端口 switching。该安全模块在一个名为Fedstellar的DFL平台上实现,allowing the deployment and monitoring of the federation。一个DFL场景已经被部署,并在八个物理设备上实现了三种安全配置:(i)基eline with no security,(ii)加密配置,和(iii) integrate both encryption and MTD techniques。安全模块的有效性通过使用MNIST数据集和eclipse攻击进行实验 validate。结果显示,在最安全的配置下,模型的F1分数平均为95%,CPU使用率提高至63.2% ± 3.5%,网络流量增加至230 MB ± 15 MB。这些结果表明,通过加密和MTD技术,可以有效地防止遮断或eclipse攻击。

Local Kernel Renormalization as a mechanism for feature learning in overparametrized Convolutional Neural Networks

  • paper_url: http://arxiv.org/abs/2307.11807
  • repo_url: None
  • paper_authors: R. Aiudi, R. Pacelli, A. Vezzani, R. Burioni, P. Rotondo
  • for: 这篇论文主要研究了深度神经网络中的特征学习方法,以及它们在不同类型的架构中的表现。
  • methods: 研究者使用了一种简单的理论框架,来解释FC和CNN架构中特征学习的不同表现。他们首先显示了一个有限宽FC网络的泛化性能可以通过无穷宽网络来获得,并且提出了一种有限宽效果行动来描述CNN架构中的特征学习。
  • results: 研究者发现了一种简单的特征学习机制,它只能在浅层CNN中发生,而不是在浅层FC网络或者无Weight连接神经网络中。这种机制导致CNN架构在有限宽 régime中表现优秀,而FC网络则是在无穷宽 régime中表现优秀。
    Abstract Feature learning, or the ability of deep neural networks to automatically learn relevant features from raw data, underlies their exceptional capability to solve complex tasks. However, feature learning seems to be realized in different ways in fully-connected (FC) or convolutional architectures (CNNs). Empirical evidence shows that FC neural networks in the infinite-width limit eventually outperform their finite-width counterparts. Since the kernel that describes infinite-width networks does not evolve during training, whatever form of feature learning occurs in deep FC architectures is not very helpful in improving generalization. On the other hand, state-of-the-art architectures with convolutional layers achieve optimal performances in the finite-width regime, suggesting that an effective form of feature learning emerges in this case. In this work, we present a simple theoretical framework that provides a rationale for these differences, in one hidden layer networks. First, we show that the generalization performance of a finite-width FC network can be obtained by an infinite-width network, with a suitable choice of the Gaussian priors. Second, we derive a finite-width effective action for an architecture with one convolutional hidden layer and compare it with the result available for FC networks. Remarkably, we identify a completely different form of kernel renormalization: whereas the kernel of the FC architecture is just globally renormalized by a single scalar parameter, the CNN kernel undergoes a local renormalization, meaning that the network can select the local components that will contribute to the final prediction in a data-dependent way. This finding highlights a simple mechanism for feature learning that can take place in overparametrized shallow CNNs, but not in shallow FC architectures or in locally connected neural networks without weight sharing.
    摘要 “特征学习”,也就是深度神经网络自动从原始数据中学习到 relevante 特征的能力,是深度神经网络解决复杂任务的关键。然而,在完全连接(FC)或卷积(CNN)架构中,特征学习似乎存在不同的实现方式。实际证明表明,在无穷宽限制下,FC神经网络 eventually 超越其固定宽度 counterparts。由于无穷宽网络的kernel不会在训练过程中进行变化,因此深度FC架构中的特征学习不会对泛化提供帮助。相反,当前领域的状态艺术架构,卷积层 achiev 最佳性能,表明在这种情况下,特征学习会出现有效的形式。在这种情况下,我们提出了一个简单的理论框架,用于解释这些差异。首先,我们证明了一个有限宽FC网络的泛化性能可以通过无穷宽网络来获得,并且需要一个适当的高斯先验。其次,我们 deriv 有限宽效果动作,并与FC网络的结果进行比较。意外地,我们发现了一种完全不同的kernel renormalization:FC架构的kernel仅受到全局抽象,而CNN架构的kernel则会在数据依赖的方式进行本地抽象,这意味着网络可以在数据中选择本地组分,以便在数据依赖的方式进行预测。这种发现高光了一种简单的特征学习机制,可以在过参神经网络中发生,但不可以在FC架构中或者在没有权重共享的本地神经网络中发生。

Convergence of SGD for Training Neural Networks with Sliced Wasserstein Losses

  • paper_url: http://arxiv.org/abs/2307.11714
  • repo_url: None
  • paper_authors: Eloi Tanguy
  • for: 本研究的目的是提供对 fixes step SGD 在 SW 损失函数上的分布学习模型 parameters 的趋势,并对这种方法的有效性进行 теории保证。
  • methods: 本研究使用了 Bianchi et al. (2022) 所提出的非平滑非对称函数下 SGD 的渐进结果,并在这种 Setting 中进行了实际的应用。
  • results: 研究发现,随着步长减小,SGD 轨迹会接近 (sub) 导流方程,并且在更加严格的假设下,SGD 轨迹会在极限下 approaching 泛化极点。
    Abstract Optimal Transport has sparked vivid interest in recent years, in particular thanks to the Wasserstein distance, which provides a geometrically sensible and intuitive way of comparing probability measures. For computational reasons, the Sliced Wasserstein (SW) distance was introduced as an alternative to the Wasserstein distance, and has seen uses for training generative Neural Networks (NNs). While convergence of Stochastic Gradient Descent (SGD) has been observed practically in such a setting, there is to our knowledge no theoretical guarantee for this observation. Leveraging recent works on convergence of SGD on non-smooth and non-convex functions by Bianchi et al. (2022), we aim to bridge that knowledge gap, and provide a realistic context under which fixed-step SGD trajectories for the SW loss on NN parameters converge. More precisely, we show that the trajectories approach the set of (sub)-gradient flow equations as the step decreases. Under stricter assumptions, we show a much stronger convergence result for noised and projected SGD schemes, namely that the long-run limits of the trajectories approach a set of generalised critical points of the loss function.
    摘要

JoinGym: An Efficient Query Optimization Environment for Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2307.11704
  • repo_url: None
  • paper_authors: Kaiwen Wang, Junxiong Wang, Yueying Li, Nathan Kallus, Immanuel Trummer, Wen Sun
  • for: 本文提出了一个高效和轻量级的查询优化环境,用于应用智能学习(RL)。
  • methods: 本文使用了Markov决策过程(MDP)将左深和叶子变种的JoinOrder选择(JOS)问题转化为一个实际的数据管理问题,并提供了遵循标准Gymnasium API的实现。
  • results: 本文对各种RL算法进行了测试,并发现至少一种方法可以在训练集查询中near-优化性表现,但是在测试集查询中表现下降数个量级。这个差距驱动了进一步的研究,以确定RL算法在多任务 combinatorial优化问题中的泛化能力。
    Abstract In this paper, we present \textsc{JoinGym}, an efficient and lightweight query optimization environment for reinforcement learning (RL). Join order selection (JOS) is a classic NP-hard combinatorial optimization problem from database query optimization and can serve as a practical testbed for the generalization capabilities of RL algorithms. We describe how to formulate each of the left-deep and bushy variants of the JOS problem as a Markov Decision Process (MDP), and we provide an implementation adhering to the standard Gymnasium API. We highlight that our implementation \textsc{JoinGym} is completely based on offline traces of all possible joins, which enables RL practitioners to easily and quickly test their methods on a realistic data management problem without needing to setup any systems. Moreover, we also provide all possible join traces on $3300$ novel SQL queries generated from the IMDB dataset. Upon benchmarking popular RL algorithms, we find that at least one method can obtain near-optimal performance on train-set queries but their performance degrades by several orders of magnitude on test-set queries. This gap motivates further research for RL algorithms that generalize well in multi-task combinatorial optimization problems.
    摘要 在这篇论文中,我们介绍了一个高效和轻量级的查询优化环境,称为JoinGym,用于应急学习(RL)。Join order选择(JOS)是一个经典的NP困难的 combinatorial optimization问题,可以作为RL算法的总结能力的实际测试场景。我们描述了如何将左深和荔枝两种JOS问题转化为Markov决策过程(MDP),并提供了符合标准Gymnasium API的实现。我们指出,我们的实现基于全部可能的连接轨迹,使得RL专家可以轻松地和快速地在真实的数据管理问题上测试自己的方法,不需要设置任何系统。此外,我们还提供了3300个新的SQL查询,这些查询来自IMDB数据集。在 benchmarking 各种RL算法时,我们发现至少一种方法可以在训练集查询上获得近似优秀性能,但是它们在测试集查询上的性能却减少了几个数量级。这个差距激励了我们进一步研究RL算法在多任务 combinatorial optimization 问题中的总结能力。

Using simulation to calibrate real data acquisition in veterinary medicine

  • paper_url: http://arxiv.org/abs/2307.11695
  • repo_url: None
  • paper_authors: Krystian Strzałka, Szymon Mazurek, Maciej Wielgosz, Paweł Russek, Jakub Caputa, Daria Łukasik, Jan Krupiński, Jakub Grzeszczyk, Michał Karwatowski, Rafał Frączek, Ernest Jamro, Marcin Pietroń, Sebastian Koryciak, Agnieszka Dąbrowska-Boruch, Kazimierz Wiatr
  • for: 这个研究旨在使用模拟环境提高动物医学数据收集和诊断,特点是通过使用Blender和Blenderproc库生成具有多种生物学、环境和行为条件的 sintetic数据集,并用这些数据集训练机器学习模型以识别正常和异常的步态。
  • methods: 这个研究使用了Blender和Blenderproc库生成 sintetic数据集,并使用这些数据集训练机器学习模型。两个不同的数据集,具有不同的摄像头角度细节,被创建以进一步研究摄像头角度对模型准确性的影响。
  • results: 初步结果表明,通过使用模拟环境和真实病人数据集的组合,这种基于模拟的方法可能会提高动物医学诊断的效果和效率。
    Abstract This paper explores the innovative use of simulation environments to enhance data acquisition and diagnostics in veterinary medicine, focusing specifically on gait analysis in dogs. The study harnesses the power of Blender and the Blenderproc library to generate synthetic datasets that reflect diverse anatomical, environmental, and behavioral conditions. The generated data, represented in graph form and standardized for optimal analysis, is utilized to train machine learning algorithms for identifying normal and abnormal gaits. Two distinct datasets with varying degrees of camera angle granularity are created to further investigate the influence of camera perspective on model accuracy. Preliminary results suggest that this simulation-based approach holds promise for advancing veterinary diagnostics by enabling more precise data acquisition and more effective machine learning models. By integrating synthetic and real-world patient data, the study lays a robust foundation for improving overall effectiveness and efficiency in veterinary medicine.
    摘要 这个研究paper explores the innovative use of simulation environments to enhance data acquisition and diagnostics in veterinary medicine, focusing specifically on gait analysis in dogs. The study harnesses the power of Blender and the Blenderproc library to generate synthetic datasets that reflect diverse anatomical, environmental, and behavioral conditions. The generated data, represented in graph form and standardized for optimal analysis, is utilized to train machine learning algorithms for identifying normal and abnormal gaits. Two distinct datasets with varying degrees of camera angle granularity are created to further investigate the influence of camera perspective on model accuracy. Preliminary results suggest that this simulation-based approach holds promise for advancing veterinary diagnostics by enabling more precise data acquisition and more effective machine learning models. By integrating synthetic and real-world patient data, the study lays a robust foundation for improving overall effectiveness and efficiency in veterinary medicine.Here's the text with traditional Chinese characters:这个研究paper explores the innovative use of simulation environments to enhance data acquisition and diagnostics in veterinary medicine, focusing specifically on gait analysis in dogs. The study harnesses the power of Blender and the Blenderproc library to generate synthetic datasets that reflect diverse anatomical, environmental, and behavioral conditions. The generated data, represented in graph form and standardized for optimal analysis, is utilized to train machine learning algorithms for identifying normal and abnormal gaits. Two distinct datasets with varying degrees of camera angle granularity are created to further investigate the influence of camera perspective on model accuracy. Preliminary results suggest that this simulation-based approach holds promise for advancing veterinary diagnostics by enabling more precise data acquisition and more effective machine learning models. By integrating synthetic and real-world patient data, the study lays a robust foundation for improving overall effectiveness and efficiency in veterinary medicine.

Fast Adaptive Test-Time Defense with Robust Features

  • paper_url: http://arxiv.org/abs/2307.11672
  • repo_url: None
  • paper_authors: Anurag Singh, Mahalakshmi Sabanayagam, Krikamol Muandet, Debarghya Ghoshdastidar
  • for: 提高深度神经网络的对抗性性能
  • methods: 基于特征稳定性的抗击攻击策略
  • results: 在CIFAR-10和CIFAR-100数据集上,与现有最佳方法相比,提出的方法具有较低的计算成本,且对抗性性能较高。
    Abstract Adaptive test-time defenses are used to improve the robustness of deep neural networks to adversarial examples. However, existing methods significantly increase the inference time due to additional optimization on the model parameters or the input at test time. In this work, we propose a novel adaptive test-time defense strategy that is easy to integrate with any existing (robust) training procedure without additional test-time computation. Based on the notion of robustness of features that we present, the key idea is to project the trained models to the most robust feature space, thereby reducing the vulnerability to adversarial attacks in non-robust directions. We theoretically show that the top eigenspace of the feature matrix are more robust for a generalized additive model and support our argument for a large width neural network with the Neural Tangent Kernel (NTK) equivalence. We conduct extensive experiments on CIFAR-10 and CIFAR-100 datasets for several robustness benchmarks, including the state-of-the-art methods in RobustBench, and observe that the proposed method outperforms existing adaptive test-time defenses at much lower computation costs.
    摘要 使用适应性测试时防御,提高深度神经网络对攻击示例的Robustness。然而,现有方法会significantly增加测试时间,因为它们需要在测试时进行额外的优化模型参数或输入。在这项工作中,我们提出了一种新的适应测试时防御策略,可以轻松地与任何现有的Robust训练过程集成,无需额外的测试时间计算。我们基于特征空间的Robustness提出了一个新的思路,即将训练模型映射到最Robust的特征空间,以降低非Robust方向的攻击性。我们理论上显示,通过对特征矩阵的top射影空间进行投影,可以提高一般加法模型的Robustness。我们在CIFAR-10和CIFAR-100数据集上进行了广泛的实验,包括RobustBench状态OF-the-art方法,并观察到我们提出的方法在计算成本远低于现有适应测试时防御方法时仍能够获得更高的Robustness性。

An Efficient Interior-Point Method for Online Convex Optimization

  • paper_url: http://arxiv.org/abs/2307.11668
  • repo_url: None
  • paper_authors: Elad Hazan, Nimrod Megiddo
  • for: 这个论文是为了最小化在线凸优化中的遗弃量而写的。
  • methods: 这个论文使用了一种新的算法来最小化遗弃量,该算法是适应的,meaning its regret bounds hold not only for the time periods 1,…,T but also for every sub-interval s,s+1,…,t。
  • results: 这个论文的结果表明,该算法的遗弃量为O(√T log T),这是最小化遗弃量的下限,只有一个 logs 项。
    Abstract A new algorithm for regret minimization in online convex optimization is described. The regret of the algorithm after $T$ time periods is $O(\sqrt{T \log T})$ - which is the minimum possible up to a logarithmic term. In addition, the new algorithm is adaptive, in the sense that the regret bounds hold not only for the time periods $1,\ldots,T$ but also for every sub-interval $s,s+1,\ldots,t$. The running time of the algorithm matches that of newly introduced interior point algorithms for regret minimization: in $n$-dimensional space, during each iteration the new algorithm essentially solves a system of linear equations of order $n$, rather than solving some constrained convex optimization problem in $n$ dimensions and possibly many constraints.
    摘要 新的算法可以最小化 regret 在在线凸优化中描述。这个算法在 $T$ 时间段后的 regret 是 $O(\sqrt{T \log T})$,这是最低的,只有一个对数性 терMINOLOGY。此外,这个新算法是可适应的,意味着其 regret 约束不仅适用于时间段 $1,\ldots,T$,还适用于每个子时间段 $s,s+1,\ldots,t$。算法的运行时间与新引入的内部点算法一样,在 $n$ 维空间中,每次迭代中,新算法基本上解决了一个线性方程组问题,而不是解决一个凸优化问题并且可能有很多约束。

eess.IV - 2023-07-22

Direct atomic number reconstruction of dual energy cargo radiographs using a semiempirical transparency model

  • paper_url: http://arxiv.org/abs/2307.12099
  • repo_url: None
  • paper_authors: Peter Lalor, Areg Danagoulian
  • for: 本研究旨在提高货物内容的检测能力,特别是检测敏感物品。
  • methods: 本研究使用高分辨率的原子数预测方法,通过最小化χ²错误来预测物品的原子数。此外,还包括一个整形步骤来提高物品的原子数选择性。
  • results: 研究表明,通过包含准确性检查步骤,可以在噪声入口图像上获得高精度的物品预测结果。此外,还可以根据屏障的性质来确定屏障的物品。
    Abstract Dual energy cargo inspection systems are sensitive to both the area density and the atomic number of an imaged container due to the Z dependence of photon attenuation. The ability to identify cargo contents by their atomic number enables improved detection capabilities of illicit materials. Existing methods typically classify materials into a few material classes using an empirical calibration step. However, such a coarse label discretization limits atomic number selectivity and can yield inaccurate results if a material is near the midpoint of two bins. This work introduces a high resolution atomic number prediction method by minimizing the chi-squared error between measured transparency values and a semiempirical transparency model. Our previous work showed that by incorporating calibration step, the semiempirical transparency model can capture second order effects such as scattering. This method is benchmarked using two simulated radiographic phantoms, demonstrating the ability to obtain accurate material predictions on noisy input images by incorporating an image segmentation step. Furthermore, we show that this approach can be adapted to identify shielded objects after first determining the properties of the shielding, taking advantage of the closed-form nature of the transparency model.
    摘要 双能量货物检测系统具有区域密度和原子数的敏感性,由光子吸收度随着Z值变化而受到影响。通过物质的原子数确定货物内容,可以提高披靡材料的检测能力。现有方法通常通过静默分类材料到几个物类来实现,但这会限制原子数选择性并可能导致结果不准确,如果材料处于两个极值点之间。这项工作介绍了高分辨率原子数预测方法,通过最小化χ²错误值来预测测量值和 semiempirical 透明性模型之间的差异。我们之前的工作表明,通过添加准确步骤, semiempirical 透明性模型可以捕捉到第二个效应,如散射。这种方法在使用两个模拟的放射学phantom中进行了测试,并显示了在噪声输入图像上获得高精度材料预测的能力。此外,我们还示出了在确定防护物的属性后,通过利用闭式形式的透明性模型,可以识别防护物。

On the Effectiveness of Spectral Discriminators for Perceptual Quality Improvement

  • paper_url: http://arxiv.org/abs/2307.12027
  • repo_url: https://github.com/luciennnnnnn/dualformer
  • paper_authors: Xin Luo, Yunan Zhu, Shunxin Xu, Dong Liu
  • for: 本研究探讨了spectral discriminator在高品质图像生成中的应用,以提高SR图像质量。
  • methods: 本研究使用了GAN-based SR方法,并使用了spectral discriminator和ordinary discriminator进行评估。在高频范围内,spectral discriminator表现更好,而在低频范围内,ordinary discriminator表现更好。为了解决这个问题,我们提议同时使用spectral和ordinary discriminator。
  • results: 我们的方法可以更好地保持SR图像的spectrum,从而提高PD质量。此外,我们的ensembled discriminator可以更好地预测图像质量,并在无参图像质量评估任务中获得更高的准确率。
    Abstract Several recent studies advocate the use of spectral discriminators, which evaluate the Fourier spectra of images for generative modeling. However, the effectiveness of the spectral discriminators is not well interpreted yet. We tackle this issue by examining the spectral discriminators in the context of perceptual image super-resolution (i.e., GAN-based SR), as SR image quality is susceptible to spectral changes. Our analyses reveal that the spectral discriminator indeed performs better than the ordinary (a.k.a. spatial) discriminator in identifying the differences in the high-frequency range; however, the spatial discriminator holds an advantage in the low-frequency range. Thus, we suggest that the spectral and spatial discriminators shall be used simultaneously. Moreover, we improve the spectral discriminators by first calculating the patch-wise Fourier spectrum and then aggregating the spectra by Transformer. We verify the effectiveness of the proposed method twofold. On the one hand, thanks to the additional spectral discriminator, our obtained SR images have their spectra better aligned to those of the real images, which leads to a better PD tradeoff. On the other hand, our ensembled discriminator predicts the perceptual quality more accurately, as evidenced in the no-reference image quality assessment task.
    摘要 Translation notes:* "perceptual image super-resolution" is translated as "图像超分辨" (tú zhì xiào fāng zhì)* "GAN-based SR" is translated as "基于GAN的SR" (jī yú GAN de SR)* "SR image quality" is translated as "超分辨图像质量" (tú zhì xiào fāng zhì tú yàng)* "spectral discriminator" is translated as "频谱检测器" (freqüência kēng cè qì)* "ordinary discriminator" is translated as "常规检测器" (cháng guī kēng cè qì)* "perceptual quality" is translated as "人类质量" (rén zhì zhì yè)* "no-reference image quality assessment task" is translated as "无参考图像质量评估任务" (wú jiàng qiào tú yàng zhì yè bìng xiǎng)

A Cascade Transformer-based Model for 3D Dose Distribution Prediction in Head and Neck Cancer Radiotherapy

  • paper_url: http://arxiv.org/abs/2307.12005
  • repo_url: https://github.com/ghtara/dose_prediction
  • paper_authors: Tara Gheshlaghi, Shahabedin Nabavi, Samire Shirzadikia, Mohsen Ebrahimi Moghaddam, Nima Rostampour
  • for: 这些研究旨在提高辐射疗法的规划效率和精度,并使用深度学习方法来预测辐射剂量分布图。
  • methods: 该研究使用了一种卷积encoder-decoder网络来实现器官风险分 segmentation,并使用另一种pyramid architecture来预测辐射剂量分布。
  • results: 该模型在一个自有的头颈癌 dataset上得到了0.79和2.71的Dice和HD95分数,分别高于现有的基eline。此外,该模型还在OpenKBP dataset上得到了2.77和1.79的辐射剂量和DVH分数,并且在链接auxiliary segmentation任务时表现更优异。
    Abstract Radiation therapy is the primary method used to treat cancer in the clinic. Its goal is to deliver a precise dose to the planning target volume (PTV) while protecting the surrounding organs at risk (OARs). However, the traditional workflow used by dosimetrists to plan the treatment is time-consuming and subjective, requiring iterative adjustments based on their experience. Deep learning methods can be used to predict dose distribution maps to address these limitations. The study proposes a cascade model for organs at risk segmentation and dose distribution prediction. An encoder-decoder network has been developed for the segmentation task, in which the encoder consists of transformer blocks, and the decoder uses multi-scale convolutional blocks. Another cascade encoder-decoder network has been proposed for dose distribution prediction using a pyramid architecture. The proposed model has been evaluated using an in-house head and neck cancer dataset of 96 patients and OpenKBP, a public head and neck cancer dataset of 340 patients. The segmentation subnet achieved 0.79 and 2.71 for Dice and HD95 scores, respectively. This subnet outperformed the existing baselines. The dose distribution prediction subnet outperformed the winner of the OpenKBP2020 competition with 2.77 and 1.79 for dose and DVH scores, respectively. The predicted dose maps showed good coincidence with ground truth, with a superiority after linking with the auxiliary segmentation task. The proposed model outperformed state-of-the-art methods, especially in regions with low prescribed doses.
    摘要 射频疗法是临床肿瘤治疗的主要方法。其目标是精确地对规划目标量(PTV)进行处理,保护周围的器官随机变化(OARs)。然而,传统的规划工作流程由剂理师进行,是时间consuming和主观的,需要迭代的调整基于它们的经验。深度学习方法可以用来预测射频分布图,解决这些限制。本研究提出了组织随机分布预测和射频分布预测的卷积模型。具有变数块的对话网络已经为分 Segmentation 任务开发,具有变数块的对话网络使用多尺度的对话网络。另一个组织随机分布预测的卷积模型已经提出,使用 pyramid 架构。本研究使用了96名患有头颈癌的患者的内部头颈癌数据集和OpenKBP,一个公共头颈癌数据集,进行评估。分割子网络获得了0.79和2.71的Dice和HD95分数,分别高于现有的基准。射频分布预测子网络高于OpenKBP2020大赛中的胜出者,具有2.77和1.79的射频和DVH分数,分别。预测的射频图表示与真实射频图有good的一致,并且在附加分 segmentation 任务下表现出superiority。本研究的模型在低剂量区域表现出了特别的优势,比如脑部和肺部。

ELiOT : End-to-end Lidar Odometry using Transformer Framework

  • paper_url: http://arxiv.org/abs/2307.11998
  • repo_url: None
  • paper_authors: Daegyu Lee, Hyunwoo Nam, D. Hyunchul Shim
  • for: 该文章为了提出一种基于深度学习的 LiDAR ODometry 方法。
  • methods: 该方法使用 transformer 架构,并通过自注意力流 embedding 网络来隐式地表示顺序 LiDAR 场景的运动。
  • results: 该方法在 urbane 数据集上显示了 Encouraging 的结果,即 translational 和 rotational 错误分别为 7.59% 和 2.67%。
    Abstract In recent years, deep-learning-based point cloud registration methods have shown significant promise. Furthermore, learning-based 3D detectors have demonstrated their effectiveness in encoding semantic information from LiDAR data. In this paper, we introduce ELiOT, an end-to-end LiDAR odometry framework built on a transformer architecture. Our proposed Self-attention flow embedding network implicitly represents the motion of sequential LiDAR scenes, bypassing the need for 3D-2D projections traditionally used in such tasks. The network pipeline, composed of a 3D transformer encoder-decoder, has shown effectiveness in predicting poses on urban datasets. In terms of translational and rotational errors, our proposed method yields encouraging results, with 7.59% and 2.67% respectively on the KITTI odometry dataset. This is achieved with an end-to-end approach that foregoes the need for conventional geometric concepts.
    摘要 Recently, deep learning-based point cloud registration methods have shown significant promise. Furthermore, learning-based 3D detectors have demonstrated their effectiveness in encoding semantic information from LiDAR data. In this paper, we introduce ELiOT, an end-to-end LiDAR odometry framework built on a transformer architecture. Our proposed Self-attention flow embedding network implicitly represents the motion of sequential LiDAR scenes, bypassing the need for 3D-2D projections traditionally used in such tasks. The network pipeline, composed of a 3D transformer encoder-decoder, has shown effectiveness in predicting poses on urban datasets. In terms of translational and rotational errors, our proposed method yields encouraging results, with 7.59% and 2.67% respectively on the KITTI odometry dataset. This is achieved with an end-to-end approach that foregoes the need for conventional geometric concepts.Here's a word-for-word translation of the text into Simplified Chinese:近年来,深度学习基于点云注册方法已经显示了很大的承诺。此外,学习基于LiDAR数据的3D探测器也证明了它们在编码 semantic信息方面的效iveness。在这篇文章中,我们介绍了 ELiOT,一个基于transformer架构的LiDAR odometry框架。我们的提出的Self-attention流embedding网络可以隐式地表示sequential LiDAR scene中的运动,不需要传统的3D-2D投影。这个框架由3D transformer编码器-解码器组成,在城市数据集上表现出了良好的效果。在翻译和旋转错误方面,我们的提出方法得到了鼓舞人的结果,分别为7.59%和2.67%在KITTI odometry数据集上。这是通过一个端到端方法来实现的,不需要传统的几何概念。

Topology-Preserving Automatic Labeling of Coronary Arteries via Anatomy-aware Connection Classifier

  • paper_url: http://arxiv.org/abs/2307.11959
  • repo_url: https://github.com/zutsusemi/miccai2023-topolab-labels
  • paper_authors: Zhixing Zhang, Ziwei Zhao, Dong Wang, Shishuang Zhao, Yuhang Liu, Jia Liu, Liwei Wang
  • for: 这个论文主要用于提出了一种新的涵围 coronary artery 自动标注框架,以便在冠状动脉疾病诊断过程中准确地标注 coronary artery。
  • methods: 这个框架叫做 TopoLab,它在网络设计中直接嵌入了解剖学连接的知识。具体来说,这个框架使用了内部 segment 特征聚合和间 segment 特征互动来进行层次特征提取。此外,我们还提出了一种基于解剖学连接的连接类别分类器,以便将每个连接对应的 segment 对应到正确的类别中。
  • results: 我们在公共 orCaScore 数据集上提供了高质量的涵围 coronary artery 标注数据,并在 orCaScore 数据集和一个内部数据集上进行了实验。结果显示,我们的 TopoLab 已经实现了领先的性能。
    Abstract Automatic labeling of coronary arteries is an essential task in the practical diagnosis process of cardiovascular diseases. For experienced radiologists, the anatomically predetermined connections are important for labeling the artery segments accurately, while this prior knowledge is barely explored in previous studies. In this paper, we present a new framework called TopoLab which incorporates the anatomical connections into the network design explicitly. Specifically, the strategies of intra-segment feature aggregation and inter-segment feature interaction are introduced for hierarchical segment feature extraction. Moreover, we propose the anatomy-aware connection classifier to enable classification for each connected segment pair, which effectively exploits the prior topology among the arteries with different categories. To validate the effectiveness of our method, we contribute high-quality annotations of artery labeling to the public orCaScore dataset. The experimental results on both the orCaScore dataset and an in-house dataset show that our TopoLab has achieved state-of-the-art performance.
    摘要 自动标注 coronary artery 是冠артериосclerosis 诊断过程中的关键任务。经验丰富的放射学家可以借助 anatomical connections 准确地标注 artery segments,然而这一知识在前一些研究中几乎未被探讨。在这篇论文中,我们提出了一种新的框架 called TopoLab,该框架将 anatomical connections 直接包含到网络设计中。具体来说,我们引入了 intra-segment feature aggregation 和 inter-segment feature interaction 策略,以实现 hierarchical segment feature extraction。此外,我们还提出了 anatomy-aware connection classifier,以便对每个连接 segment pair 进行分类,从而有效利用了不同类别 artery 之间的先天 topology。为验证我们的方法的有效性,我们在公共 orCaScore 数据集上提供了高质量的 annotations of artery labeling。实验结果表明,我们的 TopoLab 在 orCaScore 数据集和自有数据集上均 achieved state-of-the-art performance。

PartDiff: Image Super-resolution with Partial Diffusion Models

  • paper_url: http://arxiv.org/abs/2307.11926
  • repo_url: None
  • paper_authors: Kai Zhao, Alex Ling Yu Hung, Kaifeng Pang, Haoxin Zheng, Kyunghyun Sung
  • for: 这个论文主要是为了提高图像超分辨率生成task中的性能。
  • methods: 论文使用了Diffusion Probabilistic Models(DDPMs)来生成图像,通过学习逆转数据分布的演化过程,将数据从普通的噪声演化成高质量的图像。
  • results: 论文的实验表明,在使用Partial Diffusion Model(PartDiff)方法时,可以significantly reduce the number of denoising steps without sacrificing the quality of generation,相比于传统的 diffusion-based super-resolution methods。
    Abstract Denoising diffusion probabilistic models (DDPMs) have achieved impressive performance on various image generation tasks, including image super-resolution. By learning to reverse the process of gradually diffusing the data distribution into Gaussian noise, DDPMs generate new data by iteratively denoising from random noise. Despite their impressive performance, diffusion-based generative models suffer from high computational costs due to the large number of denoising steps.In this paper, we first observed that the intermediate latent states gradually converge and become indistinguishable when diffusing a pair of low- and high-resolution images. This observation inspired us to propose the Partial Diffusion Model (PartDiff), which diffuses the image to an intermediate latent state instead of pure random noise, where the intermediate latent state is approximated by the latent of diffusing the low-resolution image. During generation, Partial Diffusion Models start denoising from the intermediate distribution and perform only a part of the denoising steps. Additionally, to mitigate the error caused by the approximation, we introduce "latent alignment", which aligns the latent between low- and high-resolution images during training. Experiments on both magnetic resonance imaging (MRI) and natural images show that, compared to plain diffusion-based super-resolution methods, Partial Diffusion Models significantly reduce the number of denoising steps without sacrificing the quality of generation.
    摘要 diffeomorphism probabilistic models (DDPMs) haben achieved impressive performance on various image generation tasks, including image super-resolution. By learning to reverse the process of gradually diffusing the data distribution into Gaussian noise, DDPMs generate new data by iteratively denoising from random noise. Despite their impressive performance, diffusion-based generative models suffer from high computational costs due to the large number of denoising steps.In this paper, we first observed that the intermediate latent states gradually converge and become indistinguishable when diffusing a pair of low- and high-resolution images. This observation inspired us to propose the Partial Diffusion Model (PartDiff), which diffuses the image to an intermediate latent state instead of pure random noise, where the intermediate latent state is approximated by the latent of diffusing the low-resolution image. During generation, Partial Diffusion Models start denoising from the intermediate distribution and perform only a part of the denoising steps. Additionally, to mitigate the error caused by the approximation, we introduce "latent alignment", which aligns the latent between low- and high-resolution images during training. Experiments on both magnetic resonance imaging (MRI) and natural images show that, compared to plain diffusion-based super-resolution methods, Partial Diffusion Models significantly reduce the number of denoising steps without sacrificing the quality of generation.

Conditional Temporal Attention Networks for Neonatal Cortical Surface Reconstruction

  • paper_url: http://arxiv.org/abs/2307.11870
  • repo_url: https://github.com/m-qiang/cotan
  • paper_authors: Qiang Ma, Liu Li, Vanessa Kyriakopoulou, Joseph Hajnal, Emma C. Robinson, Bernhard Kainz, Daniel Rueckert
  • for: 这个论文的目的是提出一种快速的端到端框架,用于 diffeomorphic 新生儿大脑表面重建。
  • methods: 该方法使用 Conditional Temporal Attention Network (CoTAN),可以快速预测多尺度 stationary velocity fields (SVF),并通过注意力机制来学习 conditional time-varying velocity field (CTVF)。
  • results: CoTAN 可以减少 mesh 自交错错误,并且只需 0.21 秒可以将 initial template mesh 变换成 cortical white matter 和 pial surfaces。与州际基准相比,CoTAN achieved 0.12mm 的准确性和 0.07% 的自交错错误。
    Abstract Cortical surface reconstruction plays a fundamental role in modeling the rapid brain development during the perinatal period. In this work, we propose Conditional Temporal Attention Network (CoTAN), a fast end-to-end framework for diffeomorphic neonatal cortical surface reconstruction. CoTAN predicts multi-resolution stationary velocity fields (SVF) from neonatal brain magnetic resonance images (MRI). Instead of integrating multiple SVFs, CoTAN introduces attention mechanisms to learn a conditional time-varying velocity field (CTVF) by computing the weighted sum of all SVFs at each integration step. The importance of each SVF, which is estimated by learned attention maps, is conditioned on the age of the neonates and varies with the time step of integration. The proposed CTVF defines a diffeomorphic surface deformation, which reduces mesh self-intersection errors effectively. It only requires 0.21 seconds to deform an initial template mesh to cortical white matter and pial surfaces for each brain hemisphere. CoTAN is validated on the Developing Human Connectome Project (dHCP) dataset with 877 3D brain MR images acquired from preterm and term born neonates. Compared to state-of-the-art baselines, CoTAN achieves superior performance with only 0.12mm geometric error and 0.07% self-intersecting faces. The visualization of our attention maps illustrates that CoTAN indeed learns coarse-to-fine surface deformations automatically without intermediate supervision.
    摘要 cortical surface reconstruction plays a fundamental role in modeling the rapid brain development during the perinatal period. In this work, we propose Conditional Temporal Attention Network (CoTAN), a fast end-to-end framework for diffeomorphic neonatal cortical surface reconstruction. CoTAN predicts multi-resolution stationary velocity fields (SVF) from neonatal brain magnetic resonance images (MRI). Instead of integrating multiple SVFs, CoTAN introduces attention mechanisms to learn a conditional time-varying velocity field (CTVF) by computing the weighted sum of all SVFs at each integration step. The importance of each SVF, which is estimated by learned attention maps, is conditioned on the age of the neonates and varies with the time step of integration. The proposed CTVF defines a diffeomorphic surface deformation, which reduces mesh self-intersection errors effectively. It only requires 0.21 seconds to deform an initial template mesh to cortical white matter and pial surfaces for each brain hemisphere. CoTAN is validated on the Developing Human Connectome Project (dHCP) dataset with 877 3D brain MR images acquired from preterm and term born neonates. Compared to state-of-the-art baselines, CoTAN achieves superior performance with only 0.12mm geometric error and 0.07% self-intersecting faces. The visualization of our attention maps illustrates that CoTAN indeed learns coarse-to-fine surface deformations automatically without intermediate supervision.Here's the translation in Traditional Chinese: cortical surface reconstruction plays a fundamental role in modeling the rapid brain development during the perinatal period. In this work, we propose Conditional Temporal Attention Network (CoTAN), a fast end-to-end framework for diffeomorphic neonatal cortical surface reconstruction. CoTAN predicts multi-resolution stationary velocity fields (SVF) from neonatal brain magnetic resonance images (MRI). Instead of integrating multiple SVFs, CoTAN introduces attention mechanisms to learn a conditional time-varying velocity field (CTVF) by computing the weighted sum of all SVFs at each integration step. The importance of each SVF, which is estimated by learned attention maps, is conditioned on the age of the neonates and varies with the time step of integration. The proposed CTVF defines a diffeomorphic surface deformation, which reduces mesh self-intersection errors effectively. It only requires 0.21 seconds to deform an initial template mesh to cortical white matter and pial surfaces for each brain hemisphere. CoTAN is validated on the Developing Human Connectome Project (dHCP) dataset with 877 3D brain MR images acquired from preterm and term born neonates. Compared to state-of-the-art baselines, CoTAN achieves superior performance with only 0.12mm geometric error and 0.07% self-intersecting faces. The visualization of our attention maps illustrates that CoTAN indeed learns coarse-to-fine surface deformations automatically without intermediate supervision.

Digital Modeling on Large Kernel Metamaterial Neural Network

  • paper_url: http://arxiv.org/abs/2307.11862
  • repo_url: None
  • paper_authors: Quan Liu, Hanyu Zheng, Brandon T. Swartz, Ho hin Lee, Zuhayr Asad, Ivan Kravchenko, Jason G. Valentine, Yuankai Huo
  • for: 这篇论文目的是提出一种新的大kernel金属材料神经网络(LMNN),以最大化现有金属材料神经网络(MNN)的数位能力,同时考虑物理限制。
  • methods: 本论文使用了模型重新参数化和网络压缩,以提高MNN的学习能力,并考虑了光学限制。
  • results: 实验结果显示,提案的LMNN可以提高分类精度,同时降低计算延迟。
    Abstract Deep neural networks (DNNs) utilized recently are physically deployed with computational units (e.g., CPUs and GPUs). Such a design might lead to a heavy computational burden, significant latency, and intensive power consumption, which are critical limitations in applications such as the Internet of Things (IoT), edge computing, and the usage of drones. Recent advances in optical computational units (e.g., metamaterial) have shed light on energy-free and light-speed neural networks. However, the digital design of the metamaterial neural network (MNN) is fundamentally limited by its physical limitations, such as precision, noise, and bandwidth during fabrication. Moreover, the unique advantages of MNN's (e.g., light-speed computation) are not fully explored via standard 3x3 convolution kernels. In this paper, we propose a novel large kernel metamaterial neural network (LMNN) that maximizes the digital capacity of the state-of-the-art (SOTA) MNN with model re-parametrization and network compression, while also considering the optical limitation explicitly. The new digital learning scheme can maximize the learning capacity of MNN while modeling the physical restrictions of meta-optic. With the proposed LMNN, the computation cost of the convolutional front-end can be offloaded into fabricated optical hardware. The experimental results on two publicly available datasets demonstrate that the optimized hybrid design improved classification accuracy while reducing computational latency. The development of the proposed LMNN is a promising step towards the ultimate goal of energy-free and light-speed AI.
    摘要 深度神经网络(DNN)最近的应用中使用了计算单元(例如CPU和GPU)。这种设计可能会导致重大的计算负担、显著的延迟和高度的能量投入,这些限制在互联网智能(IoT)、边缘计算和无人机应用中是关键的。最近的光学计算单元(例如元material)的进步暴露了无需能源和光速神经网络。然而,光学计算单元的数字设计(MNN)的物理限制,如精度、噪声和带宽,会在制造过程中带来限制。此外,MNN的独特优势(例如光速计算)还没有通过标准3x3卷积核来完全发挥。在本文中,我们提出了一种新的大型卷积金刚物理神经网络(LMNN),该网络可以最大化MNN的数字能力,同时考虑光学限制。新的数字学习方案可以最大化MNN的学习能力,同时模拟光学限制。通过我们的LMNN,计算前端的计算成本可以被卷积到制造的光学硬件上。实验结果表明,使用我们提出的LMNN可以提高分类精度,同时减少计算延迟。开发LMNN是通向无需能源和光速AI的最终目标的一步。

Deep Learning Hyperspectral Pansharpening on large scale PRISMA dataset

  • paper_url: http://arxiv.org/abs/2307.11666
  • repo_url: None
  • paper_authors: Simone Zini, Mirko Paolo Barbato, Flavio Piccoli, Paolo Napoletano
    for:* 这个论文旨在评估多种深度学习策略用于光谱缩放。methods:* 作者使用了多种现有的深度学习方法,并适应了PRISMA卫星获得的光谱数据进行适应。results:* 研究发现,基于数据驱动的神经网络方法在RR和FR协议中都能够超越机器学习无法预测的方法,并适应更好地完成光谱缩放任务。
    Abstract In this work, we assess several deep learning strategies for hyperspectral pansharpening. First, we present a new dataset with a greater extent than any other in the state of the art. This dataset, collected using the ASI PRISMA satellite, covers about 262200 km2, and its heterogeneity is granted by randomly sampling the Earth's soil. Second, we adapted several state of the art approaches based on deep learning to fit PRISMA hyperspectral data and then assessed, quantitatively and qualitatively, the performance in this new scenario. The investigation has included two settings: Reduced Resolution (RR) to evaluate the techniques in a supervised environment and Full Resolution (FR) for a real-world evaluation. The main purpose is the evaluation of the reconstruction fidelity of the considered methods. In both scenarios, for the sake of completeness, we also included machine-learning-free approaches. From this extensive analysis has emerged that data-driven neural network methods outperform machine-learning-free approaches and adapt better to the task of hyperspectral pansharpening, both in RR and FR protocols.
    摘要 在这项工作中,我们评估了数字深度学习策略用于多спектраль照片增高。首先,我们提供了一个新的数据集,其覆盖环境比现有的状态 искус到达更大。这个数据集,通过ASI PRISMA卫星收集,覆盖约262200 km2的区域,并通过随机采样地球土壤来保证其多样性。其次,我们适应了一些现有的深度学习方法,用于适应PRISMA多спектраль数据,然后评估了这些方法的性能。我们在两个设置下进行了评估:减小分辨率(RR)以评估这些方法在指导环境下的性能,以及全分辨率(FR)以评估这些方法在真实世界中的性能。为了充分评估这些方法的重建精度,我们还包括了不含机器学习的方法。从这项广泛的分析中,我们发现了数据驱动的神经网络方法在多спектраль照片增高任务中表现更好,在RR和FR协议中都能够更好地适应任务。

cs.SD - 2023-07-21

Integrating Pretrained ASR and LM to Perform Sequence Generation for Spoken Language Understanding

  • paper_url: http://arxiv.org/abs/2307.11005
  • repo_url: None
  • paper_authors: Siddhant Arora, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, Brian Yan, Shinji Watanabe
  • for: 这个研究旨在提出一个三阶段终端对话系统(E2E SLU),实现对话识别和语言模型(LM)的统合,并且解决预先训练的语音识别(ASR)和LM之间的词汇差异问题。
  • methods: 本研究使用三阶段E2E SLU系统,首先使用ASR子网络预测ASR转译,然后使用LM子网络做初步的SLU预测,最后使用妥协子网络根据ASR和LM子网络的表现进行最终预测。
  • results: 根据两个SLU资料集(SLURP和SLUE)的实验结果,提出的三阶段E2E SLU系统在处理具有声音挑战的声明时表现更好,特别是在SLUE资料集上。
    Abstract There has been an increased interest in the integration of pretrained speech recognition (ASR) and language models (LM) into the SLU framework. However, prior methods often struggle with a vocabulary mismatch between pretrained models, and LM cannot be directly utilized as they diverge from its NLU formulation. In this study, we propose a three-pass end-to-end (E2E) SLU system that effectively integrates ASR and LM subnetworks into the SLU formulation for sequence generation tasks. In the first pass, our architecture predicts ASR transcripts using the ASR subnetwork. This is followed by the LM subnetwork, which makes an initial SLU prediction. Finally, in the third pass, the deliberation subnetwork conditions on representations from the ASR and LM subnetworks to make the final prediction. Our proposed three-pass SLU system shows improved performance over cascaded and E2E SLU models on two benchmark SLU datasets, SLURP and SLUE, especially on acoustically challenging utterances.
    摘要 在 latest research, there has been an increased interest in integrating pre-trained speech recognition (ASR) and language models (LM) into the SLU framework. However, previous methods often struggle with a vocabulary mismatch between pre-trained models, and LM cannot be directly utilized as they diverge from its NLU formulation. In this study, we propose a three-pass end-to-end (E2E) SLU system that effectively integrates ASR and LM subnetworks into the SLU formulation for sequence generation tasks. In the first pass, our architecture predicts ASR transcripts using the ASR subnetwork. This is followed by the LM subnetwork, which makes an initial SLU prediction. Finally, in the third pass, the deliberation subnetwork conditions on representations from the ASR and LM subnetworks to make the final prediction. Our proposed three-pass SLU system shows improved performance over cascaded and E2E SLU models on two benchmark SLU datasets, SLURP and SLUE, especially on acoustically challenging utterances.

eess.AS - 2023-07-21

Topic Identification For Spontaneous Speech: Enriching Audio Features With Embedded Linguistic Information

  • paper_url: http://arxiv.org/abs/2307.11450
  • repo_url: https://github.com/aalto-speech/Topic-identification-for-spontaneous-Finnish-speech
  • paper_authors: Dejan Porjazovski, Tamás Grósz, Mikko Kurimo
  • for: 这篇论文旨在检验非标准的话语识别方案,以寻找不需要自动话语识别系统(ASR)的解决方案。
  • methods: 这篇论文使用了音频只和多模态组合方法来识别非标准的芬兰语。
  • results: 研究发现,听音只的方法在ASR系统不可用时是一个可行的选择,而多模态组合方法在识别性能上表现最佳。
    Abstract Traditional topic identification solutions from audio rely on an automatic speech recognition system (ASR) to produce transcripts used as input to a text-based model. These approaches work well in high-resource scenarios, where there are sufficient data to train both components of the pipeline. However, in low-resource situations, the ASR system, even if available, produces low-quality transcripts, leading to a bad text-based classifier. Moreover, spontaneous speech containing hesitations can further degrade the performance of the ASR model. In this paper, we investigate alternatives to the standard text-only solutions by comparing audio-only and hybrid techniques of jointly utilising text and audio features. The models evaluated on spontaneous Finnish speech demonstrate that purely audio-based solutions are a viable option when ASR components are not available, while the hybrid multi-modal solutions achieve the best results.
    摘要 传统的话题识别解决方案从音频中获得的听写系统(ASR)生成的讲解作为输入,用文本基于模型进行识别。这些方法在高资源场景下工作良好,因为可以在训练两个组件的气候下进行训练。然而,在低资源情况下,即使有ASR系统,也会生成低质量的讲解,导致文本基于模型的性能下降。此外,不慎的语音中的停顿也可能使ASR模型的性能下降。在这篇论文中,我们调查了标准文本仅解决方案的代替方案,比较音频仅、多模态融合等方法的性能。我们在自然的芬兰语音中评估了这些模型,得到的结论是:当ASR组件不可用时,听写仅的解决方案是一个可靠的选择;而多模态融合解决方案在性能上表现最佳。

MeetEval: A Toolkit for Computation of Word Error Rates for Meeting Transcription Systems

  • paper_url: http://arxiv.org/abs/2307.11394
  • repo_url: None
  • paper_authors: Thilo von Neumann, Christoph Boeddeker, Marc Delcroix, Reinhold Haeb-Umbach
  • for: 这个论文是为了评估各种会议笔记系统而编写的。
  • methods: 这个论文使用了一个开源的工具kit来评估会议笔记系统的评估方法,包括通用的Word Error Rates(WER)计算,以及一些特定的WER定义,如cpWER、ORC WER和MIMO WER。此外,它还提供了一种基于时间约束的cpWER计算方法,以提高匹配假设字符串和参照字符串的匹配质量。
  • results: 这个论文的结果表明,基于时间约束的cpWER计算方法可以提高匹配质量,同时也可以提高匹配速度。此外,这个方法还可以使用不准确的时间标签来进行匹配,从而降低了计算成本。
    Abstract MeetEval is an open-source toolkit to evaluate all kinds of meeting transcription systems. It provides a unified interface for the computation of commonly used Word Error Rates (WERs), specifically cpWER, ORC WER and MIMO WER along other WER definitions. We extend the cpWER computation by a temporal constraint to ensure that only words are identified as correct when the temporal alignment is plausible. This leads to a better quality of the matching of the hypothesis string to the reference string that more closely resembles the actual transcription quality, and a system is penalized if it provides poor time annotations. Since word-level timing information is often not available, we present a way to approximate exact word-level timings from segment-level timings (e.g., a sentence) and show that the approximation leads to a similar WER as a matching with exact word-level annotations. At the same time, the time constraint leads to a speedup of the matching algorithm, which outweighs the additional overhead caused by processing the time stamps.
    摘要 美特评估是一个开源工具kit,用于评估各种会议笔记系统。它提供一个统一的接口来计算常用的单词错误率(WER),包括cpWER、ORC WER 和 MIMO WER 等 WER 定义。我们在cpWER 计算中添加了时间约束,以确保只有在时间对齐是可能的时候才认为单词是正确的。这会导致匹配假设字符串与参考字符串的匹配更加精准,系统会受到负面抑制,如果它提供了低质量的时间标记。由于单词水平的时间信息通常不可用,我们提出了一种将 sentence 级别的时间信息约化为单词级别的时间信息的方法,并证明这种约化导致与匹配精度相似的 WER。同时,时间约束会使匹配算法加速,这些加速的效果超过了对处理时间戳的额外开销。

cs.CV - 2023-07-21

FEDD – Fair, Efficient, and Diverse Diffusion-based Lesion Segmentation and Malignancy Classification

  • paper_url: http://arxiv.org/abs/2307.11654
  • repo_url: https://github.com/hectorcarrion/fedd
  • paper_authors: Héctor Carrión, Narges Norouzi
    for:这个研究目的是提高肤科图像诊断的存取ibilit,以提供更加公平和准确的肤科图像分类和描述。methods:这个研究使用了一个名为FEDD的数据测试框架,它使用了干扰推导的数据库,并使用了直线探针来处理具有Semantic feature embeddings的数据。results:这个研究获得了显著的改进,包括0.18、0.13、0.06和0.07的intersection over union,并且仅使用了5%、10%、15%和20%的标注数据。此外,FEDD在10%的DDI上获得了81%的癌症分类精度,高于现有的州均值。
    Abstract Skin diseases affect millions of people worldwide, across all ethnicities. Increasing diagnosis accessibility requires fair and accurate segmentation and classification of dermatology images. However, the scarcity of annotated medical images, especially for rare diseases and underrepresented skin tones, poses a challenge to the development of fair and accurate models. In this study, we introduce a Fair, Efficient, and Diverse Diffusion-based framework for skin lesion segmentation and malignancy classification. FEDD leverages semantically meaningful feature embeddings learned through a denoising diffusion probabilistic backbone and processes them via linear probes to achieve state-of-the-art performance on Diverse Dermatology Images (DDI). We achieve an improvement in intersection over union of 0.18, 0.13, 0.06, and 0.07 while using only 5%, 10%, 15%, and 20% labeled samples, respectively. Additionally, FEDD trained on 10% of DDI demonstrates malignancy classification accuracy of 81%, 14% higher compared to the state-of-the-art. We showcase high efficiency in data-constrained scenarios while providing fair performance for diverse skin tones and rare malignancy conditions. Our newly annotated DDI segmentation masks and training code can be found on https://github.com/hectorcarrion/fedd.
    摘要 世界各地的人们中有数百万人被皮肤疾病影响。为了提高诊断的可accessibility,需要公平、准确地分类和分割皮肤图像。然而,罕见皮肤疾病和不 Represented 的皮肤颜色的医疗图像的缺乏标注,对开发公平和准确的模型的发展带来了挑战。本研究提出了一个公平、高效、多样的扩散基础框架(FEDD),用于皮肤病变分类和诊断。FEDD 利用了semantically meaningful的特征嵌入,通过推敲扩散probabilistic backbone来学习,然后通过线性探针进行处理,以达到最新的性能水平在多样的皮肤图像(DDI)上。我们在5%, 10%, 15%, 和20%标注样本上分别提高了交集隔��的0.18, 0.13, 0.06, 和0.07。此外,FEDD 在10%的DDI上进行训练,可以达到81%的肉瘤分类精度,与现有最新的14%高。我们在数据缺乏的情况下实现了高效性,同时为多样的皮肤颜色和罕见的肉瘤疾病情况提供公平性。我们在https://github.com/hectorcarrion/fedd上提供了新的DDI分割 mask和训练代码。

Deep Reinforcement Learning Based System for Intraoperative Hyperspectral Video Autofocusing

  • paper_url: http://arxiv.org/abs/2307.11638
  • repo_url: None
  • paper_authors: Charlie Budd, Jianrong Qiu, Oscar MacCormac, Martin Huber, Christopher Mower, Mirek Janatka, Théo Trotouin, Jonathan Shapey, Mads S. Bergholt, Tom Vercauteren
  • for: 这篇论文旨在探讨使用高spectral成像技术进行手持式实时视频HSIC的可用性问题,并提出了一种基于深度学习的自动对焦方法来解决这个问题。
  • methods: 这篇论文使用了一种可变焦点的液体镜来嵌入视频HSIC镜头,并提出了一种基于深度学习的自动对焦算法来解决视频HSIC中的对焦问题。
  • results: 实验结果显示,该新的自动对焦算法比传统的对焦策略更好 ($p<0.05$),其中平均焦点误差为 $0.070\pm.098$,比传统策略的 $0.146\pm.148$ 更低。此外,两名 neurosurgeon 在对不同自动对焦策略的比较中,认为该新的自动对焦算法最为有利,因此该系统在实际应用中是一个可靠的选择。
    Abstract Hyperspectral imaging (HSI) captures a greater level of spectral detail than traditional optical imaging, making it a potentially valuable intraoperative tool when precise tissue differentiation is essential. Hardware limitations of current optical systems used for handheld real-time video HSI result in a limited focal depth, thereby posing usability issues for integration of the technology into the operating room. This work integrates a focus-tunable liquid lens into a video HSI exoscope, and proposes novel video autofocusing methods based on deep reinforcement learning. A first-of-its-kind robotic focal-time scan was performed to create a realistic and reproducible testing dataset. We benchmarked our proposed autofocus algorithm against traditional policies, and found our novel approach to perform significantly ($p<0.05$) better than traditional techniques ($0.070\pm.098$ mean absolute focal error compared to $0.146\pm.148$). In addition, we performed a blinded usability trial by having two neurosurgeons compare the system with different autofocus policies, and found our novel approach to be the most favourable, making our system a desirable addition for intraoperative HSI.
    摘要 高spectral成像(HSI)可以捕捉更多的spectral特征,使其成为可能有价值的实时操作中的工具,当精确的组织区分是必要时。现有的光学系统的硬件限制使得实时视频HSI的焦点深度受限,从而导致了技术的可用性问题。这项工作将射频调整液体镜组合到视频HSI外Scope中,并提出了基于深度学习的视频自动对焦方法。我们首次实现了Robotics focal-time扫描,并创建了可信度评估dataset。我们对我们提出的自动对焦算法与传统策略进行比较,发现我们的新方法在($p<0.05$)的显著程度上 ($0.070\pm.098$的平均焦点错误与$0.146\pm.148$之间)。此外,我们进行了隐身用户试验,让两名 neurosurgeon比较不同的自动对焦策略,发现我们的新方法是最有优势,使我们的系统成为实时HSI中的欢迎加入。

Divide and Adapt: Active Domain Adaptation via Customized Learning

  • paper_url: http://arxiv.org/abs/2307.11618
  • repo_url: https://github.com/duojun-huang/diana-cvpr2023
  • paper_authors: Duojun Huang, Jichang Li, Weikai Chen, Junshi Huang, Zhenhua Chai, Guanbin Li
  • for: 这个研究旨在提高适应性模型的适应性表现,通过整合活动学习(AL)技术来标识目标数据中最有价的子集。
  • methods: 这个研究使用了一个名为Divide-and-Adapt(DiaNA)的新的适应领域活动整合框架,将目标实例分为四个类别,每个类别都具有透明的转移性质。
  • results: DiaNA可以精确地识别最有价的数据,并且可以适应不同的领域转移Setting,例如无监督领域转移(UDA)、半监督领域转移(SSDA)、源自由领域转移(SFDA)等。
    Abstract Active domain adaptation (ADA) aims to improve the model adaptation performance by incorporating active learning (AL) techniques to label a maximally-informative subset of target samples. Conventional AL methods do not consider the existence of domain shift, and hence, fail to identify the truly valuable samples in the context of domain adaptation. To accommodate active learning and domain adaption, the two naturally different tasks, in a collaborative framework, we advocate that a customized learning strategy for the target data is the key to the success of ADA solutions. We present Divide-and-Adapt (DiaNA), a new ADA framework that partitions the target instances into four categories with stratified transferable properties. With a novel data subdivision protocol based on uncertainty and domainness, DiaNA can accurately recognize the most gainful samples. While sending the informative instances for annotation, DiaNA employs tailored learning strategies for the remaining categories. Furthermore, we propose an informativeness score that unifies the data partitioning criteria. This enables the use of a Gaussian mixture model (GMM) to automatically sample unlabeled data into the proposed four categories. Thanks to the "divideand-adapt" spirit, DiaNA can handle data with large variations of domain gap. In addition, we show that DiaNA can generalize to different domain adaptation settings, such as unsupervised domain adaptation (UDA), semi-supervised domain adaptation (SSDA), source-free domain adaptation (SFDA), etc.
    摘要 aktive domaine anpassung (ADA) zielt darauf ab, die anpassungsleistung von modellen durch einbezug aktiver lernen (AL) technologien zu verbessern, indem eine maximale-informative auswahl vonzielchen labelliert wird. conventional AL methods ignorieren die existenz von domain shift, und daher können sie keine echten wertvollen beispiele in context von domain adaptation identifizieren. um active learning und domain anpassung zu accommodate, two naturally different tasks, propose a customized learning strategy for the target data is the key to the success of ADA solutions. we present divide-and-adapt (DiaNA), a new ADA framework that partitions the target instances into four categories with stratified transferable properties. with a novel data subdivision protocol based on uncertainty and domainness, DiaNA can accurately recognize the most gainful samples. while sending the informative instances for annotation, DiaNA employs tailored learning strategies for the remaining categories. furthermore, we propose an informativeness score that unifies the data partitioning criteria. this enables the use of a gaussian mixture model (GMM) to automatically sample unlabeled data into the proposed four categories. thanks to the "divideand-adapt" spirit, DiaNA can handle data with large variations of domain gap. in addition, we show that DiaNA can generalize to different domain adaptation settings, such as unsupervised domain adaptation (UDA), semi-supervised domain adaptation (SSDA), source-free domain adaptation (SFDA), etc.

Consistency-guided Meta-Learning for Bootstrapping Semi-Supervised Medical Image Segmentation

  • paper_url: http://arxiv.org/abs/2307.11604
  • repo_url: https://github.com/aijinrjinr/mlb-seg
  • paper_authors: Qingyue Wei, Lequan Yu, Xianhang Li, Wei Shao, Cihang Xie, Lei Xing, Yuyin Zhou
  • for: 这个研究旨在解决医疗影像识别的 semi-supervised 学习问题,提高医疗影像识别的效率和精度。
  • methods: 我们提出了一个名为 Meta-Learning for Bootstrapping Medical Image Segmentation (MLB-Seg) 的新方法,具有以下三个特点:首先,我们使用一小批清洁标注的影像进行训练,以生成初始的标签 для无标注的数据。其次,我们引入了一个像素类别映射系统,将像素标签和模型预测的结果相互映射。这些映射是基于一个小型的精心标注影像集,并且使用一个基于这些标注的meta-过程来决定这些映射。最后,我们引入了一个内部的Consistency-based Pseudo Label Enhancement (PLE) 方法,用于提高模型预测的品质。
  • results: 我们的实验结果显示,我们的提出的方法在 semi-supervised 下可以 achieve state-of-the-art 的效果,并且在两个公开的颈部和肾脏分类数据集上实现了显著的改善。
    Abstract Medical imaging has witnessed remarkable progress but usually requires a large amount of high-quality annotated data which is time-consuming and costly to obtain. To alleviate this burden, semi-supervised learning has garnered attention as a potential solution. In this paper, we present Meta-Learning for Bootstrapping Medical Image Segmentation (MLB-Seg), a novel method for tackling the challenge of semi-supervised medical image segmentation. Specifically, our approach first involves training a segmentation model on a small set of clean labeled images to generate initial labels for unlabeled data. To further optimize this bootstrapping process, we introduce a per-pixel weight mapping system that dynamically assigns weights to both the initialized labels and the model's own predictions. These weights are determined using a meta-process that prioritizes pixels with loss gradient directions closer to those of clean data, which is based on a small set of precisely annotated images. To facilitate the meta-learning process, we additionally introduce a consistency-based Pseudo Label Enhancement (PLE) scheme that improves the quality of the model's own predictions by ensembling predictions from various augmented versions of the same input. In order to improve the quality of the weight maps obtained through multiple augmentations of a single input, we introduce a mean teacher into the PLE scheme. This method helps to reduce noise in the weight maps and stabilize its generation process. Our extensive experimental results on public atrial and prostate segmentation datasets demonstrate that our proposed method achieves state-of-the-art results under semi-supervision. Our code is available at https://github.com/aijinrjinr/MLB-Seg.
    摘要 医疗影像技术在过去几年内发展非常remarkable,但通常需要大量高质量标注数据,这是时间consuming和costly的。为了解决这个问题,半supervised learning在这个领域得到了关注。在这篇论文中,我们提出了一种新的方法:Meta-Learning for Bootstrapping Medical Image Segmentation(MLB-Seg),用于解决半supervised的医疗影像分割挑战。具体来说,我们的方法首先在一小个清洁标注图像上训练一个分割模型,以生成初始标签 для无标注数据。为了进一步优化这个启动过程,我们引入了一个像素Weight Mapping系统,该系统在运行时动态分配像素的标签和模型自己预测的标签之间的权重。这些权重是基于一个小型的精确标注图像来确定的,这个过程是基于一个Meta-process来进行。为了促进这个Meta-学习过程,我们还引入了一种Consistency-based Pseudo Label Enhancement(PLE)方案,该方案可以提高模型自己的预测质量。为了进一步提高Weight map的质量,我们引入了一个Mean Teacher。这种方法可以减少Weight map中的噪音,并使其生成过程更加稳定。我们对公共的atrial和prostate分割数据集进行了广泛的实验,结果表明,我们提出的方法在半supervision下可以达到状态级result。我们的代码可以在https://github.com/aijinrjinr/MLB-Seg上获取。

Cascaded multitask U-Net using topological loss for vessel segmentation and centerline extraction

  • paper_url: http://arxiv.org/abs/2307.11603
  • repo_url: None
  • paper_authors: Pierre Rougé, Nicolas Passat, Odyssée Merveille
  • for: 这篇论文主要针对的是血管疾病诊断工具中的血管分割和中心线提取任务。
  • methods: 这篇论文提出了一种使用深度学习方法进行血管分割和中心线提取任务,并使用了clDice损失函数来保证结果的准确性。
  • results: 该方法可以提供更加准确的血管分割和中心线提取结果,并且可以在3D图像上进行更加有效的血管分割和中心线提取。
    Abstract Vessel segmentation and centerline extraction are two crucial preliminary tasks for many computer-aided diagnosis tools dealing with vascular diseases. Recently, deep-learning based methods have been widely applied to these tasks. However, classic deep-learning approaches struggle to capture the complex geometry and specific topology of vascular networks, which is of the utmost importance in most applications. To overcome these limitations, the clDice loss, a topological loss that focuses on the vessel centerlines, has been recently proposed. This loss requires computing, with a proposed soft-skeleton algorithm, the skeletons of both the ground truth and the predicted segmentation. However, the soft-skeleton algorithm provides suboptimal results on 3D images, which makes the clDice hardly suitable on 3D images. In this paper, we propose to replace the soft-skeleton algorithm by a U-Net which computes the vascular skeleton directly from the segmentation. We show that our method provides more accurate skeletons than the soft-skeleton algorithm. We then build upon this network a cascaded U-Net trained with the clDice loss to embed topological constraints during the segmentation. The resulting model is able to predict both the vessel segmentation and centerlines with a more accurate topology.
    摘要 船 Segmentation 和中心线抽取是许多计算机辅助诊断工具处理血管疾病的两个重要前期任务。近年来,深度学习基于方法广泛应用于这两个任务。然而,经典深度学习方法很难捕捉血管网络的复杂geometry和特定topology,这在大多数应用中非常重要。为了解决这些限制,最近提出了clDice损失,这是一种关注血管中心线的topological损失。这个损失需要计算,使用我们提议的软skeleton算法,真实的血管skeleton。然而,软skeleton算法在3D图像上提供的结果不够优化,使得clDice几乎不适用于3D图像。在这篇论文中,我们提议将软skeleton算法替换为一个U-Net,该网络直接从分 segmentation 中计算血管skeleton。我们显示我们的方法可以提供更加准确的skeletons。然后,我们在这个网络上建立了一个缓冲 U-Net,用于在分 segmentation 中嵌入topological约束。结果是一个能够预测血管分 segmentation 和中心线的模型,其 topology 更加准确。

CortexMorph: fast cortical thickness estimation via diffeomorphic registration using VoxelMorph

  • paper_url: http://arxiv.org/abs/2307.11567
  • repo_url: None
  • paper_authors: Richard McKinley, Christian Rummel
  • for: This paper aims to improve the efficiency of estimating cortical thickness in magnetic resonance imaging (MRI) studies.
  • methods: The proposed method, CortexMorph, uses unsupervised deep learning to directly regress the deformation field needed for DiReCT, which can significantly reduce the registration time.
  • results: The proposed method can estimate region-wise thickness in seconds from a T1-weighted image, while maintaining the ability to detect cortical atrophy, as validated on the OASIS-3 dataset and the synthetic cortical thickness phantom.
    Abstract The thickness of the cortical band is linked to various neurological and psychiatric conditions, and is often estimated through surface-based methods such as Freesurfer in MRI studies. The DiReCT method, which calculates cortical thickness using a diffeomorphic deformation of the gray-white matter interface towards the pial surface, offers an alternative to surface-based methods. Recent studies using a synthetic cortical thickness phantom have demonstrated that the combination of DiReCT and deep-learning-based segmentation is more sensitive to subvoxel cortical thinning than Freesurfer. While anatomical segmentation of a T1-weighted image now takes seconds, existing implementations of DiReCT rely on iterative image registration methods which can take up to an hour per volume. On the other hand, learning-based deformable image registration methods like VoxelMorph have been shown to be faster than classical methods while improving registration accuracy. This paper proposes CortexMorph, a new method that employs unsupervised deep learning to directly regress the deformation field needed for DiReCT. By combining CortexMorph with a deep-learning-based segmentation model, it is possible to estimate region-wise thickness in seconds from a T1-weighted image, while maintaining the ability to detect cortical atrophy. We validate this claim on the OASIS-3 dataset and the synthetic cortical thickness phantom of Rusak et al.
    摘要 cortical band 厚度与多种神经科学和心理科学疾病相关,通常通过表面基本方法such as Freesurfer在MRI研究中估算。DiReCT方法,它通过Diffuse Deformation of the gray-white matter interface towards the pial surface来计算 cortical thickness,为surface-based方法提供了一种alternative。Recent studies using a synthetic cortical thickness phantom have shown that the combination of DiReCT and deep learning-based segmentation is more sensitive to subvoxel cortical thinning than Freesurfer. However, existing implementations of DiReCT rely on iterative image registration methods, which can take up to an hour per volume. In contrast, learning-based deformable image registration methods like VoxelMorph have been shown to be faster than classical methods while improving registration accuracy. This paper proposes CortexMorph, a new method that employs unsupervised deep learning to directly regress the deformation field needed for DiReCT. By combining CortexMorph with a deep learning-based segmentation model, it is possible to estimate region-wise thickness in seconds from a T1-weighted image, while maintaining the ability to detect cortical atrophy. We validate this claim on the OASIS-3 dataset and the synthetic cortical thickness phantom of Rusak et al.Here's the text in Traditional Chinese: cortical band 厚度与多种神经科学和心理科学疾病相关,通常通过表面基本方法such as Freesurfer在MRI研究中估算。DiReCT方法,它通过Diffuse Deformation of the gray-white matter interface towards the pial surface来计算 cortical thickness,为surface-based方法提供了一个alternative。Recent studies using a synthetic cortical thickness phantom have shown that the combination of DiReCT and deep learning-based segmentation is more sensitive to subvoxel cortical thinning than Freesurfer. 然而,现有的DiReCT实现方法仍然 rely on迭代的图像 регистра方法,这可以花费到每个volume上一个小时。相比之下,学习型的可扩展图像REGISTRATION方法like VoxelMorph已经被证明是更快的,同时改善了registration的准确性。本文提出了CortexMorph,一个新的方法,它使用不supervised的深度学习来直接预测DiReCT所需的歪み场。通过结合CortexMorph和深度学习基本的分类模型,可以从T1类型的影像中 estimates region-wise thickness in seconds, while maintaining the ability to detect cortical atrophy。我们验证了这个主张在OASIS-3 dataset和Rusak et al.的synthetic cortical thickness phantom上。

YOLOPose V2: Understanding and Improving Transformer-based 6D Pose Estimation

  • paper_url: http://arxiv.org/abs/2307.11550
  • repo_url: None
  • paper_authors: Arul Selvam Periyasamy, Arash Amini, Vladimir Tsaturyan, Sven Behnke
    for:* 6D object pose estimation for autonomous robot manipulation applicationsmethods:* Transformer-based multi-object 6D pose estimation method using keypoint regression and learnable orientation estimationresults:* Achieves results comparable to state-of-the-art methods and suitable for real-time applications
    Abstract 6D object pose estimation is a crucial prerequisite for autonomous robot manipulation applications. The state-of-the-art models for pose estimation are convolutional neural network (CNN)-based. Lately, Transformers, an architecture originally proposed for natural language processing, is achieving state-of-the-art results in many computer vision tasks as well. Equipped with the multi-head self-attention mechanism, Transformers enable simple single-stage end-to-end architectures for learning object detection and 6D object pose estimation jointly. In this work, we propose YOLOPose (short form for You Only Look Once Pose estimation), a Transformer-based multi-object 6D pose estimation method based on keypoint regression and an improved variant of the YOLOPose model. In contrast to the standard heatmaps for predicting keypoints in an image, we directly regress the keypoints. Additionally, we employ a learnable orientation estimation module to predict the orientation from the keypoints. Along with a separate translation estimation module, our model is end-to-end differentiable. Our method is suitable for real-time applications and achieves results comparable to state-of-the-art methods. We analyze the role of object queries in our architecture and reveal that the object queries specialize in detecting objects in specific image regions. Furthermore, we quantify the accuracy trade-off of using datasets of smaller sizes to train our model.
    摘要 6D对象pose估计是自动化机器人控制应用的重要前提。现状的模型都是基于卷积神经网络(CNN)的。最近,Transformers这种原本是自然语言处理领域的 arquitecture 在计算机视觉任务中也取得了状态的前景result。通过多头自注意机制,Transformers 使得单 Stage 末端到末的结构可以用于学习对象检测和6D对象pose估计。在这种工作中,我们提出了 YOLOPose(简称为You Only Look Once Pose estimation),一种基于键点回归和改进的 YOLOPose 模型。与标准的热图来predict键点的方法不同,我们直接进行键点回归。此外,我们使用学习的orientation estimation模块来预测键点的方向。与此同时,我们还使用一个分离的翻译估计模块,使得我们的模型能够在终端到终的可 differentiable。我们的方法适用于实时应用,并达到了与状态前景方法相当的结果。我们分析了我们的建筑中的对象查询的角色,并发现对象查询在特定的图像区域中探测对象。此外,我们量化使用小型数据集来训练我们的模型的准确性交换。

KVN: Keypoints Voting Network with Differentiable RANSAC for Stereo Pose Estimation

  • paper_url: http://arxiv.org/abs/2307.11543
  • repo_url: https://github.com/ivano-donadi/kvn
  • paper_authors: Ivano Donadi, Alberto Pretto
  • for: 本研究旨在提高对象pose estimation的精度,特别是在多视图情况下。
  • methods: 该研究使用了一种新的可导式RANSAC层和多视图PnP算法来解决多视图对象pose estimation问题。
  • results: 实验结果表明,该方法在一个公共的多视图对象pose estimation dataset上达到了当前最佳的结果,并且在对比其他最新方法时表现出了优异性。Translation:
  • for: The purpose of this research is to improve the accuracy of object pose estimation, particularly in multi-view scenarios.
  • methods: The research uses a new differentiable RANSAC layer and a multi-view PnP algorithm to solve the multi-view object pose estimation problem.
  • results: Experimental results show that the proposed method achieves the best results on a public multi-view object pose estimation dataset, and outperforms other recent methods.
    Abstract Object pose estimation is a fundamental computer vision task exploited in several robotics and augmented reality applications. Many established approaches rely on predicting 2D-3D keypoint correspondences using RANSAC (Random sample consensus) and estimating the object pose using the PnP (Perspective-n-Point) algorithm. Being RANSAC non-differentiable, correspondences cannot be directly learned in an end-to-end fashion. In this paper, we address the stereo image-based object pose estimation problem by (i) introducing a differentiable RANSAC layer into a well-known monocular pose estimation network; (ii) exploiting an uncertainty-driven multi-view PnP solver which can fuse information from multiple views. We evaluate our approach on a challenging public stereo object pose estimation dataset, yielding state-of-the-art results against other recent approaches. Furthermore, in our ablation study, we show that the differentiable RANSAC layer plays a significant role in the accuracy of the proposed method. We release with this paper the open-source implementation of our method.
    摘要 <>将文本翻译为简化中文。>对象pose估算是计算机视觉的基本任务,广泛应用于机器人和增强现实领域。许多成熟的方法利用随机抽样consensus(RANSAC)和perspective-n-point(PnP)算法来估算对象pose。由于RANSAC不具有导数,因此对应关系不能直接在终端到终点学习。在这篇论文中,我们解决了使用多视图PnP算法和不同视图之间的信息融合来解决顺序图像基本对象pose估算问题。我们对一个公共顺序图像基本对象pose估算数据集进行了评估,并与其他最近的方法进行比较,获得了状态的艺术成绩。此外,在我们的抽象研究中,我们发现了使用可导RANSAC层对提出的方法具有重要作用。我们在这篇论文中释放了我们的方法的开源实现。

UWAT-GAN: Fundus Fluorescein Angiography Synthesis via Ultra-wide-angle Transformation Multi-scale GAN

  • paper_url: http://arxiv.org/abs/2307.11530
  • repo_url: https://github.com/Tinysqua/UWAT-GAN
  • paper_authors: Zhaojie Fang, Zhanghao Chen, Pengxue Wei, Wangting Li, Shaochong Zhang, Ahmed Elazab, Gangyong Jia, Ruiquan Ge, Changmiao Wang
  • For: The paper proposes a novel conditional generative adversarial network (UWAT-GAN) to synthesize UWF-FA from UWF-SLO, aiming to avoid the negative impacts of injecting sodium fluorescein and improve the resolution of fundus imaging.* Methods: The UWAT-GAN uses multi-scale generators and a fusion module patch to better extract global and local information, and an attention transmit module to help the decoder learn effectively. The network is trained using a supervised approach with multiple new weighted losses on different scales of data.* Results: The experiments on an in-house UWF image dataset demonstrate the superiority of the UWAT-GAN over the state-of-the-art methods, with high-resolution images generated and the ability to capture tiny vascular lesion areas.Here’s the information in Simplified Chinese text:* For: 这篇论文提出了一种新的条件生成 adversarial network(UWAT-GAN),用于从ultra-wide-angle fundus photography(UWF)中生成fluorescein angiography(FA)图像,以避免使用氯胺绿色素的不良影响和提高肤肤影像的分辨率。* Methods: UWAT-GAN使用多尺度生成器和一个拼接模块补充,以更好地提取全局和局部信息,同时使用一个注意力传输模块来帮助解码器更好地学习。网络使用了一种监管的方法,通过多个不同尺度的数据进行训练,并使用多种新的质量损失来更好地调节网络。* Results: 实验结果表明,UWAT-GAN在一个自有的UWF图像集上表现出优于当前的方法,可以生成高分辨率的图像,同时能够捕捉到微小的血管病变区域。
    Abstract Fundus photography is an essential examination for clinical and differential diagnosis of fundus diseases. Recently, Ultra-Wide-angle Fundus (UWF) techniques, UWF Fluorescein Angiography (UWF-FA) and UWF Scanning Laser Ophthalmoscopy (UWF-SLO) have been gradually put into use. However, Fluorescein Angiography (FA) and UWF-FA require injecting sodium fluorescein which may have detrimental influences. To avoid negative impacts, cross-modality medical image generation algorithms have been proposed. Nevertheless, current methods in fundus imaging could not produce high-resolution images and are unable to capture tiny vascular lesion areas. This paper proposes a novel conditional generative adversarial network (UWAT-GAN) to synthesize UWF-FA from UWF-SLO. Using multi-scale generators and a fusion module patch to better extract global and local information, our model can generate high-resolution images. Moreover, an attention transmit module is proposed to help the decoder learn effectively. Besides, a supervised approach is used to train the network using multiple new weighted losses on different scales of data. Experiments on an in-house UWF image dataset demonstrate the superiority of the UWAT-GAN over the state-of-the-art methods. The source code is available at: https://github.com/Tinysqua/UWAT-GAN.
    摘要 背景:背部照片是诊断背部疾病的重要诊断工具。近些年来,极广角背部照片(UWF)技术,包括UWF氟胺染色(UWF-FA)和UWF扫描镜内眼镜(UWF-SLO),逐渐得到应用。然而,氟胺染色和UWF-FA都需要注射氟胺,可能会产生不良影响。为了避免这些影响,混合模式医学图像生成算法已经被提出。然而,当前的基于背部照片的医学图像生成方法无法生成高分辨率图像,并且无法捕捉微型血管病区。方法:本文提出了一种新的冲拦生成 adversarial network(UWAT-GAN),用于从UWF-SLO中生成UWF-FA。该模型使用多尺度生成器和一个融合模块质子来更好地提取全部和局部信息。此外,我们还提出了一种注意力传输模块,以帮助解码器更好地学习。此外,我们采用了多种新的权重损失方法来训练网络。结果:我们在自有的UWF图像集上进行了实验,并证明了UWAT-GAN的优越性。相比之前的方法,UWAT-GAN能够生成高分辨率图像,并且能够更好地捕捉微型血管病区。代码可以在以下GitHub地址下下载:https://github.com/Tinysqua/UWAT-GAN。

Improving Viewpoint Robustness for Visual Recognition via Adversarial Training

  • paper_url: http://arxiv.org/abs/2307.11528
  • repo_url: None
  • paper_authors: Shouwei Ruan, Yinpeng Dong, Hang Su, Jianteng Peng, Ning Chen, Xingxing Wei
  • for: 提高图像分类器的视点Robustness,使其能够快速和稳定地识别不同视点下的图像。
  • methods: 提出了Viewpoint-Invariant Adversarial Training(VIAT)方法,通过对视点变换视为攻击,形式化为最小化最坏情况下的损失函数,从而获得视点不变的图像分类器。
  • results: VIAT可以显著提高多种图像分类器的视点Robustness,并且可以通过提供证明 radius 和准确率来证明其效果。
    Abstract Viewpoint invariance remains challenging for visual recognition in the 3D world, as altering the viewing directions can significantly impact predictions for the same object. While substantial efforts have been dedicated to making neural networks invariant to 2D image translations and rotations, viewpoint invariance is rarely investigated. Motivated by the success of adversarial training in enhancing model robustness, we propose Viewpoint-Invariant Adversarial Training (VIAT) to improve the viewpoint robustness of image classifiers. Regarding viewpoint transformation as an attack, we formulate VIAT as a minimax optimization problem, where the inner maximization characterizes diverse adversarial viewpoints by learning a Gaussian mixture distribution based on the proposed attack method GMVFool. The outer minimization obtains a viewpoint-invariant classifier by minimizing the expected loss over the worst-case viewpoint distributions that can share the same one for different objects within the same category. Based on GMVFool, we contribute a large-scale dataset called ImageNet-V+ to benchmark viewpoint robustness. Experimental results show that VIAT significantly improves the viewpoint robustness of various image classifiers based on the diversity of adversarial viewpoints generated by GMVFool. Furthermore, we propose ViewRS, a certified viewpoint robustness method that provides a certified radius and accuracy to demonstrate the effectiveness of VIAT from the theoretical perspective.
    摘要 视点不变性仍然是视觉识别在三维世界中的挑战,因为改变观察方向可能会对同一个物体的预测产生很大的影响。虽然大量的努力已经投入到了使神经网络不受二维图像的翻译和旋转的影响,但视点不变性几乎没有被研究。驱动于对抗训练的成功,我们提出了视点不变性对抗训练(VIAT),以提高图像分类器的视点稳定性。我们将视点变换视为攻击,并将VIAT表示为一个对抗最优化问题。内部最大化Characterizes diverse adversarial viewpoints by learning a Gaussian mixture distribution based on the proposed attack method GMVFool.外部最小化获得视点不变的分类器,将预期损失最小化在多种可能的观察方向中的最坏情况下。基于GMVFool,我们贡献了一个大规模的数据集ImageNet-V+,用于评估视点稳定性。实验结果显示,VIAT可以大幅提高多种图像分类器的视点稳定性,并且基于GMVFool生成的多种攻击视点的多样性。此外,我们还提出了ViewRS,一种可证明的视点稳定性方法,可以在理论上证明VIAT的效果。

  • paper_url: http://arxiv.org/abs/2307.11526
  • repo_url: None
  • paper_authors: Ziyuan Luo, Qing Guo, Ka Chun Cheung, Simon See, Renjie Wan
  • for: 保护NeRF模型的版权
  • methods: 使用水印颜色表示法保护NeRF模型的版权,并设计了一种鲁棒的渲染方案来保证NeRF模型的抽象稳定性
  • results: 比较 Optional solutions 中的不同方法,提出了一种能够直接保护NeRF模型的版权,同时保持高质量渲染和位准率的方法
    Abstract Neural Radiance Fields (NeRF) have the potential to be a major representation of media. Since training a NeRF has never been an easy task, the protection of its model copyright should be a priority. In this paper, by analyzing the pros and cons of possible copyright protection solutions, we propose to protect the copyright of NeRF models by replacing the original color representation in NeRF with a watermarked color representation. Then, a distortion-resistant rendering scheme is designed to guarantee robust message extraction in 2D renderings of NeRF. Our proposed method can directly protect the copyright of NeRF models while maintaining high rendering quality and bit accuracy when compared among optional solutions.
    摘要 neural radiance fields (NeRF) 有可能成为媒体表示的主要形式。由于训练 NeRF 从来不是一件容易的任务,因此保护其模型版权应该是优先事项。在这篇论文中,我们通过分析可能的版权保护解决方案的利弊,提议将 NeRF 模型中原始颜色表示替换为水印颜色表示,然后设计一种抗扭变渲染方案,以保证 NeRF 2D 渲染中的信息提取Robust。我们提议的方法可以直接保护 NeRF 模型的版权,同时保持高质量渲染和位数精度。

BatMobility: Towards Flying Without Seeing for Autonomous Drones

  • paper_url: http://arxiv.org/abs/2307.11518
  • repo_url: None
  • paper_authors: Emerson Sie, Zikun Liu, Deepak Vasisht
  • for: 这个论文的目的是否允许无人飞行器(UAV)不依赖于光学感知器,即UAV可以不见而飞。
  • methods: 作者提出了一种轻量级的mm波雷达仅基于感知系统,称为BatMobility,以取代光学感知器。BatMobility实现了无人飞行器的电台流估计(一种基于表面平行偏振的FMCW雷达基于 superficie-parallel doppler shift的新方法)和雷达基于碰撞避免。
  • results: 作者使用商业感知器建立了BatMobility,并在一个小型的Off-the-shelf quadcopter上运行了一个未修改的飞行控制器。评估表明,BatMobility在各种场景中比或超过了商业级光学感知器的性能。
    Abstract Unmanned aerial vehicles (UAVs) rely on optical sensors such as cameras and lidar for autonomous operation. However, such optical sensors are error-prone in bad lighting, inclement weather conditions including fog and smoke, and around textureless or transparent surfaces. In this paper, we ask: is it possible to fly UAVs without relying on optical sensors, i.e., can UAVs fly without seeing? We present BatMobility, a lightweight mmWave radar-only perception system for UAVs that eliminates the need for optical sensors. BatMobility enables two core functionalities for UAVs -- radio flow estimation (a novel FMCW radar-based alternative for optical flow based on surface-parallel doppler shift) and radar-based collision avoidance. We build BatMobility using commodity sensors and deploy it as a real-time system on a small off-the-shelf quadcopter running an unmodified flight controller. Our evaluation shows that BatMobility achieves comparable or better performance than commercial-grade optical sensors across a wide range of scenarios.
    摘要 无人飞行器(UAV)通常靠光学感知器件如摄像头和激光雷达进行自主操作。但光学感知器件在糟糕的照明条件、不适的天气条件(如雾和烟)以及无Texture或透明表面上存在误差。在这篇论文中,我们问:是否可以让UAV飞行不用光学感知器件,即UAV是否可以“不见”飞行?我们介绍了BatMobility,一种轻量级mmWave雷达只的感知系统,它消除了对光学感知器件的需求。BatMobility实现了UAV两个核心功能—— радио流速估计(一种基于表面平行Doppler偏移的新型FMCW雷达基于光流的替代方案)和雷达基于避免碰撞。我们使用常见的感知器件建立BatMobility,并将其部署为实时系统,运行在一个小型、Off-the-shelfquadcopter上,不需要修改飞行控制器。我们的评估表明,BatMobility在各种场景中与商业级光学感知器件相比具有相当或更好的性能。

CORE: Cooperative Reconstruction for Multi-Agent Perception

  • paper_url: http://arxiv.org/abs/2307.11514
  • repo_url: https://github.com/zllxot/core
  • paper_authors: Binglu Wang, Lei Zhang, Zhaozhong Wang, Yongqiang Zhao, Tianfei Zhou
  • for: 这篇论文提出了一种基于多智能机器人协同感知的模型CORE,用于提高多机器人合作感知的效果和通信效率。
  • methods: 该模型包括三个主要组成部分:压缩器、协同协商组件和重建模块。压缩器用于每个机器人创建更加压缩的特征表示,以便高效广播;协同协商组件用于跨机器人消息集成,以提高协同感知的效果;重建模块用于根据合并的特征表示重建观察。
  • results: 该模型在OPV2V大规模多机器人感知数据集上进行了3D物体检测和 semantic segmentation两个任务的验证,结果表明该模型在两个任务中均达到了领先的性能水平,并且更高效的进行通信。
    Abstract This paper presents CORE, a conceptually simple, effective and communication-efficient model for multi-agent cooperative perception. It addresses the task from a novel perspective of cooperative reconstruction, based on two key insights: 1) cooperating agents together provide a more holistic observation of the environment, and 2) the holistic observation can serve as valuable supervision to explicitly guide the model learning how to reconstruct the ideal observation based on collaboration. CORE instantiates the idea with three major components: a compressor for each agent to create more compact feature representation for efficient broadcasting, a lightweight attentive collaboration component for cross-agent message aggregation, and a reconstruction module to reconstruct the observation based on aggregated feature representations. This learning-to-reconstruct idea is task-agnostic, and offers clear and reasonable supervision to inspire more effective collaboration, eventually promoting perception tasks. We validate CORE on OPV2V, a large-scale multi-agent percetion dataset, in two tasks, i.e., 3D object detection and semantic segmentation. Results demonstrate that the model achieves state-of-the-art performance on both tasks, and is more communication-efficient.
    摘要 CORE consists of three major components:1. Compressor: Each agent creates a more compact feature representation for efficient broadcasting.2. Lightweight attentive collaboration component: Cross-agent message aggregation is performed using a lightweight attentive mechanism.3. Reconstruction module: The observation is reconstructed based on aggregated feature representations.The learning-to-reconstruct idea is task-agnostic and provides clear and reasonable supervision to inspire more effective collaboration, leading to improved perception tasks. The model is validated on the OPV2V dataset, achieving state-of-the-art performance in both 3D object detection and semantic segmentation tasks, while also being more communication-efficient.

Bone mineral density estimation from a plain X-ray image by learning decomposition into projections of bone-segmented computed tomography

  • paper_url: http://arxiv.org/abs/2307.11513
  • repo_url: None
  • paper_authors: Yi Gu, Yoshito Otake, Keisuke Uemura, Mazen Soufi, Masaki Takao, Hugues Talbot, Seiji Okada, Nobuhiko Sugano, Yoshinobu Sato
  • for: 这个研究旨在透过简单的X射线影像来估计骨骼粒子密度(BMD),以便在日常医疗实践中进行早期诊断。
  • methods: 本研究提出了一种高效的方法,即将骨骼分割成QCT射测的投影,以估计BMD。这种方法只需要有限的数据,并且可以在实际医疗实践中进行应用。
  • results: 研究发现,这种方法可以高度准确地估计BMD,其相互联系系数(Pearson correlation coefficient)为0.880和0.920,并且准确性的标准差(root mean square of coefficient of variation)为3.27%至3.79%。此外,研究还进行了多个验证实验,包括多姿、未测量CT和压缩实验,以确保其可以在实际应用中进行运行。
    Abstract Osteoporosis is a prevalent bone disease that causes fractures in fragile bones, leading to a decline in daily living activities. Dual-energy X-ray absorptiometry (DXA) and quantitative computed tomography (QCT) are highly accurate for diagnosing osteoporosis; however, these modalities require special equipment and scan protocols. To frequently monitor bone health, low-cost, low-dose, and ubiquitously available diagnostic methods are highly anticipated. In this study, we aim to perform bone mineral density (BMD) estimation from a plain X-ray image for opportunistic screening, which is potentially useful for early diagnosis. Existing methods have used multi-stage approaches consisting of extraction of the region of interest and simple regression to estimate BMD, which require a large amount of training data. Therefore, we propose an efficient method that learns decomposition into projections of bone-segmented QCT for BMD estimation under limited datasets. The proposed method achieved high accuracy in BMD estimation, where Pearson correlation coefficients of 0.880 and 0.920 were observed for DXA-measured BMD and QCT-measured BMD estimation tasks, respectively, and the root mean square of the coefficient of variation values were 3.27 to 3.79% for four measurements with different poses. Furthermore, we conducted extensive validation experiments, including multi-pose, uncalibrated-CT, and compression experiments toward actual application in routine clinical practice.
    摘要 骨质疾病(osteoporosis)是一种非常普遍的骨疾病,会导致脆弱骨骼中的裂解,从而导致日常生活活动下降。 dual-energy X-ray absorptiometry(DXA)和量子计算Tomography(QCT)是骨质疾病诊断的非常准确的方法,但是这些方法需要特殊的设备和扫描协议。为了经常监测骨健康,低成本、低剂量、通用可用的诊断方法非常需求。在这项研究中,我们想要从普通X射线图像中估算骨矿化密度(BMD),以便在机会性检测中进行早期诊断。现有的方法通常使用多stage的方法,包括提取区域 интереса和简单的回归来估算BMD,这些方法需要很大的训练数据。因此,我们提出了一种高效的方法,可以在有限的数据集上学习分解为骨segmented QCT的投影来估算BMD。我们的方法实现了高精度的BMD估算,其中DXA测量BMD和QCT测量BMD估算任务的Pearson相关系数分别为0.880和0.920,并且根据不同的姿势测量结果,核心均方差的值为3.27%至3.79%。此外,我们进行了广泛的验证实验,包括多姿、无抽象CT和压缩实验,以便在实际临床医疗实践中应用。

R2Det: Redemption from Range-view for Accurate 3D Object Detection

  • paper_url: http://arxiv.org/abs/2307.11482
  • repo_url: None
  • paper_authors: Yihan Wang, Qiao Yan, Yi Wang
  • for: 这篇论文的目的是提高LiDAR数据驱动的自动驾驶系统中的3D物体检测精度。
  • methods: 该论文提出了一种基于范围视图的方法,使用Range-view Representation来增强3D点Cloud的精度。这种方法包括BasicBlock、Hierarchical-dilated Meta Kernel和Feature Points Redemption等部分。
  • results: 该论文的实验结果表明,将该方法与现有的LiDAR数据驱动的3D物体检测器结合使用,可以提高3D物体检测精度,比如在KITTI val set上的easy、moderate和hard等difficulty level上的mAP提高1.39%、1.67%和1.97%。此外,该论文还提出了一种基于R2M的新的3D物体检测器R2Detector,与现有的范围视图基本的方法相比,R2Detector在KITTI benchmark和Waymo Open Dataset上表现出了明显的优异。
    Abstract LiDAR-based 3D object detection is of paramount importance for autonomous driving. Recent trends show a remarkable improvement for bird's-eye-view (BEV) based and point-based methods as they demonstrate superior performance compared to range-view counterparts. This paper presents an insight that leverages range-view representation to enhance 3D points for accurate 3D object detection. Specifically, we introduce a Redemption from Range-view Module (R2M), a plug-and-play approach for 3D surface texture enhancement from the 2D range view to the 3D point view. R2M comprises BasicBlock for 2D feature extraction, Hierarchical-dilated (HD) Meta Kernel for expanding the 3D receptive field, and Feature Points Redemption (FPR) for recovering 3D surface texture information. R2M can be seamlessly integrated into state-of-the-art LiDAR-based 3D object detectors as preprocessing and achieve appealing improvement, e.g., 1.39%, 1.67%, and 1.97% mAP improvement on easy, moderate, and hard difficulty level of KITTI val set, respectively. Based on R2M, we further propose R2Detector (R2Det) with the Synchronous-Grid RoI Pooling for accurate box refinement. R2Det outperforms existing range-view-based methods by a significant margin on both the KITTI benchmark and the Waymo Open Dataset. Codes will be made publicly available.
    摘要 利用LiDAR的3D对象检测是自动驾驶中的重要环节。最新趋势表明BEV(鸟瞰视)和点云方法在性能方面表现更出色,超过范围视图方法。本文提出一种启示,利用范围视图表示增强3D点云的精度。特别是,我们介绍了一个叫做Redemption from Range-view Module(R2M),这是一种可插入现有LiDAR基于3D对象检测器中的预处理方法。R2M包括基本块(BasicBlock) дляEXTRACTING 2D特征,叠加的高级叠加(HD)元件 kernel for 扩展3D感知范围,以及特征点重新识别(FPR)模块,用于从2D范围视图中恢复3D表面文本信息。R2M可以轻松地与现有的LiDAR基于3D对象检测器集成,并实现了让人满意的提升,例如,在KITTI测试集的易、中、Difficulty三级水平上的mAP提升率分别为1.39%、1.67%和1.97%。基于R2M,我们进一步提出了R2Detector(R2Det),它使用同步格RoI pooling来进行精度的盒子精度。R2Det在KITTI测试集和Waymo开放数据集上与现有范围视图基于方法相比具有明显的提升。代码将公开发布。

SA-BEV: Generating Semantic-Aware Bird’s-Eye-View Feature for Multi-view 3D Object Detection

  • paper_url: http://arxiv.org/abs/2307.11477
  • repo_url: https://github.com/mengtan00/sa-bev
  • paper_authors: Jinqing Zhang, Yanan Zhang, Qingjie Liu, Yunhong Wang
  • for: 提高自适应驾驶的经济性,使用纯摄像头基础的鸟瞰视图(BEV)感知。
  • methods: 提出 semantic-aware BEV Pooling(SA-BEVPool),可以根据图像特征的semantic segmentation过滤背景信息,将图像特征转化为semantic-aware BEV特征。还提出 BEV-Paste 数据增强策略,可以尝试与 semantic-aware BEV特征进行匹配。此外,我们还设计了 Multi-Scale Cross-Task(MSCT)头,可以结合任务特定和跨任务信息,更准确地预测深度分布和semantic segmentation。
  • results: 经过实验,SA-BEV在 nuScenes 上达到了状态革命性的性能。
    Abstract Recently, the pure camera-based Bird's-Eye-View (BEV) perception provides a feasible solution for economical autonomous driving. However, the existing BEV-based multi-view 3D detectors generally transform all image features into BEV features, without considering the problem that the large proportion of background information may submerge the object information. In this paper, we propose Semantic-Aware BEV Pooling (SA-BEVPool), which can filter out background information according to the semantic segmentation of image features and transform image features into semantic-aware BEV features. Accordingly, we propose BEV-Paste, an effective data augmentation strategy that closely matches with semantic-aware BEV feature. In addition, we design a Multi-Scale Cross-Task (MSCT) head, which combines task-specific and cross-task information to predict depth distribution and semantic segmentation more accurately, further improving the quality of semantic-aware BEV feature. Finally, we integrate the above modules into a novel multi-view 3D object detection framework, namely SA-BEV. Experiments on nuScenes show that SA-BEV achieves state-of-the-art performance. Code has been available at https://github.com/mengtan00/SA-BEV.git.
    摘要 最近,纯摄像头基本视角(BEV)的感知提供了经济自动驾驶的可行解决方案。然而,现有的 BEV 基本多视图三维探测器通常将所有图像特征转换成 BEV 特征,无论背景信息占据对象信息的大量。在这篇论文中,我们提出了semantic-aware BEV pooling(SA-BEVPool),可以根据图像特征的semantic segmentation过滤出背景信息,并将图像特征转换成semantic-aware BEV特征。此外,我们提出了BEV-Paste,一种高效的数据增强策略,可以快速匹配semantic-aware BEV特征。此外,我们设计了多尺度交叉任务(MSCT)头,可以将任务特定和交叉任务信息组合以更准确地预测深度分布和semantic segmentation,进一步提高 semantic-aware BEV特征的质量。最后,我们将上述模块集成到了一个新的多视图三维对象检测框架中,称之为SA-BEV。nuScenes实验显示,SA-BEV可以达到状态前的性能。代码可以在https://github.com/mengtan00/SA-BEV.git中下载。

Physics-Aware Semi-Supervised Underwater Image Enhancement

  • paper_url: http://arxiv.org/abs/2307.11470
  • repo_url: None
  • paper_authors: Hao Qi, Xinghui Dong
  • for: 提高水下图像质量,解决水下图像受到媒体传输的降低效应
  • methods: combining physics-based underwater Image Formation Model (IFM)和深度学习技术,提出了一种新的物理意识的双流水下图像提升网络(PA-UIENet),包括传输估计流和环境光估计流
  • results: 与基eline的八个方法进行比较,在五个测试集上的降低估计和水下图像提升任务中表现更好,这可能是因为它不仅可以模拟降低,还可以学习不同的水下场景特征。
    Abstract Underwater images normally suffer from degradation due to the transmission medium of water bodies. Both traditional prior-based approaches and deep learning-based methods have been used to address this problem. However, the inflexible assumption of the former often impairs their effectiveness in handling diverse underwater scenes, while the generalization of the latter to unseen images is usually weakened by insufficient data. In this study, we leverage both the physics-based underwater Image Formation Model (IFM) and deep learning techniques for Underwater Image Enhancement (UIE). To this end, we propose a novel Physics-Aware Dual-Stream Underwater Image Enhancement Network, i.e., PA-UIENet, which comprises a Transmission Estimation Steam (T-Stream) and an Ambient Light Estimation Stream (A-Stream). This network fulfills the UIE task by explicitly estimating the degradation parameters of the IFM. We also adopt an IFM-inspired semi-supervised learning framework, which exploits both the labeled and unlabeled images, to address the issue of insufficient data. Our method performs better than, or at least comparably to, eight baselines across five testing sets in the degradation estimation and UIE tasks. This should be due to the fact that it not only can model the degradation but also can learn the characteristics of diverse underwater scenes.
    摘要 水下图像通常受到水体媒体传输的质量下降的影响。传统的基于先前的方法和深度学习基于方法都已经用来解决这个问题。然而,前者的固定假设经常使其效果不足以处理多样化的水下场景,而后者的普适性通常由数据不足而削弱。在这项研究中,我们利用物理基础的水下图像形成模型(IFM)和深度学习技术来进行水下图像提升(UIE)。为此,我们提出了一种具有物理意识的双流水下图像提升网络,即PA-UIENet,该网络包括传输估计流(T-Stream)和投光估计流(A-Stream)。这个网络通过直接估计IFM中的降低参数来完成UIE任务。我们还采用基于IFM的半监督学习框架,这种框架可以利用标注和无标注图像来解决数据不足的问题。我们的方法在五个测试集上比基eline的八个参考方法表现更好,或者至少与其相当。这应该是因为它不仅可以模型降低,还可以学习多样化的水下场景的特点。

MatSpectNet: Material Segmentation Network with Domain-Aware and Physically-Constrained Hyperspectral Reconstruction

  • paper_url: http://arxiv.org/abs/2307.11466
  • repo_url: https://github.com/heng-yuwen/matspectnet
  • paper_authors: Yuwen Heng, Yihong Wu, Jiawen Chen, Srinandan Dasmahapatra, Hansung Kim
  • for: 提高RGB图像中材料分割的准确率,使用Recovered hyperspectral images。
  • methods: 提出了一种新的模型——MatSpectNet,利用现代相机的色彩感知原理来约束重构的彩色图像,并通过领域适应方法将彩色图像映射到材料分割dataset中。
  • results: 对LMD dataset和OpenSurfaces dataset进行了实验,MatSpectNet在比较最新的发表文章的基础上提高了1.60%的平均像素精度和3.42%的类别精度。
    Abstract Achieving accurate material segmentation for 3-channel RGB images is challenging due to the considerable variation in a material's appearance. Hyperspectral images, which are sets of spectral measurements sampled at multiple wavelengths, theoretically offer distinct information for material identification, as variations in intensity of electromagnetic radiation reflected by a surface depend on the material composition of a scene. However, existing hyperspectral datasets are impoverished regarding the number of images and material categories for the dense material segmentation task, and collecting and annotating hyperspectral images with a spectral camera is prohibitively expensive. To address this, we propose a new model, the MatSpectNet to segment materials with recovered hyperspectral images from RGB images. The network leverages the principles of colour perception in modern cameras to constrain the reconstructed hyperspectral images and employs the domain adaptation method to generalise the hyperspectral reconstruction capability from a spectral recovery dataset to material segmentation datasets. The reconstructed hyperspectral images are further filtered using learned response curves and enhanced with human perception. The performance of MatSpectNet is evaluated on the LMD dataset as well as the OpenSurfaces dataset. Our experiments demonstrate that MatSpectNet attains a 1.60% increase in average pixel accuracy and a 3.42% improvement in mean class accuracy compared with the most recent publication. The project code is attached to the supplementary material and will be published on GitHub.
    摘要 实现高精度物料分 segmentation для 3-canal RGB 影像是一个挑战,因为物料的外观可能会有很大的变化。投色影像(sets of spectral measurements sampled at multiple wavelengths)在理论上可以提供专门的材料识别信息,因为表面反射的电磁辐射强度变化随物料scene的Composition。但是现有的投色影像数据集是缺乏 dense material segmentation 任务中的图像数据和材料类别,收集和标注投色影像使用 spectral camera 是昂费的。为了解决这个问题,我们提出了一个新的模型,MatSpectNet,它可以将 RGB 影像中的材料分 segmentation 构成为 recovered 投色影像。MatSpectNet 利用了现代相机中的色彩感知原理来对投色影像进行对映,并使用领域适应方法将投色影像的对映能力从投色回传数据集中转移到物料分 segmentation 数据集中。另外,MatSpectNet 还使用学习回应曲线来筛选重建的投色影像,并通过人类感知进行改进。我们将 MatSpectNet 的性能评估在 LMD 数据集以及 OpenSurfaces 数据集上。实验结果显示,MatSpectNet 可以与最近一篇文献相比,提高了1.60%的平均像素精度和3.42%的mean class accuracy。项目代码会附加到补充材料中,并将在 GitHub 上发布。

Strip-MLP: Efficient Token Interaction for Vision MLP

  • paper_url: http://arxiv.org/abs/2307.11458
  • repo_url: https://github.com/med-process/strip_mlp
  • paper_authors: Guiping Cao, Shengda Luo, Wenjian Huang, Xiangyuan Lan, Dongmei Jiang, Yaowei Wang, Jianguo Zhang
  • for: 提高 MLP 模型在小数据集上表现,以及与现有 MLP 模型在 ImageNet 上的比较
  • methods: 提出了一种新的 MLP 层called Strip MLP layer,并提出了一种 Cascade Group Strip Mixing Module (CGSMM) 和一种 Local Strip Mixing Module (LSMM) 以增强 токен交互力
  • results: 实验表明,Strip-MLP 模型在小数据集上显著提高了性能,并在 ImageNet 上与现有 MLP 模型相当或更好的结果。具体来说,Strip-MLP 模型在 Caltech-101 和 CIFAR-100 上的平均 Top-1 准确率高于现有 MLP 模型 +2.44% 和 +2.16%。
    Abstract Token interaction operation is one of the core modules in MLP-based models to exchange and aggregate information between different spatial locations. However, the power of token interaction on the spatial dimension is highly dependent on the spatial resolution of the feature maps, which limits the model's expressive ability, especially in deep layers where the feature are down-sampled to a small spatial size. To address this issue, we present a novel method called \textbf{Strip-MLP} to enrich the token interaction power in three ways. Firstly, we introduce a new MLP paradigm called Strip MLP layer that allows the token to interact with other tokens in a cross-strip manner, enabling the tokens in a row (or column) to contribute to the information aggregations in adjacent but different strips of rows (or columns). Secondly, a \textbf{C}ascade \textbf{G}roup \textbf{S}trip \textbf{M}ixing \textbf{M}odule (CGSMM) is proposed to overcome the performance degradation caused by small spatial feature size. The module allows tokens to interact more effectively in the manners of within-patch and cross-patch, which is independent to the feature spatial size. Finally, based on the Strip MLP layer, we propose a novel \textbf{L}ocal \textbf{S}trip \textbf{M}ixing \textbf{M}odule (LSMM) to boost the token interaction power in the local region. Extensive experiments demonstrate that Strip-MLP significantly improves the performance of MLP-based models on small datasets and obtains comparable or even better results on ImageNet. In particular, Strip-MLP models achieve higher average Top-1 accuracy than existing MLP-based models by +2.44\% on Caltech-101 and +2.16\% on CIFAR-100. The source codes will be available at~\href{https://github.com/Med-Process/Strip_MLP{https://github.com/Med-Process/Strip\_MLP}.
    摘要 Token 交互操作是 MLB 模型中的核心模块,用于在不同的空间位置交换和聚合信息。然而,Token 交互力在空间维度上受特定的空间分辨率限制,尤其是在深层次 Where Feature Maps 下采用了压缩,这限制了模型的表达能力。为解决这问题,我们提出了一种新的方法 called Strip-MLP,它可以在三个方面增强 Token 交互力。首先,我们引入了一种新的 MLP парадиг called Strip MLP 层,允许 Token 在横向(或纵向)方向上交互,使得在同一行(或同一列)的 Token 能够参与到邻近但不同的横向(或纵向)的信息聚合中。其次,我们提出了一种 Cascade Group Strip Mixing Module (CGSMM),用于解决因特定空间分辨率而导致的性能下降。该模块允许 Token 在不同的横向和纵向上进行有效的交互,不受特定空间分辨率的限制。最后,基于 Strip MLP 层,我们提出了一种 Local Strip Mixing Module (LSMM),用于在本地区域中增强 Token 交互力。广泛的实验表明,Strip-MLP 可以大幅提高 MLP 模型在小 datasets 上的表现,并在 ImageNet 上达到相当或更好的结果。具体来说,Strip-MLP 模型在 Caltech-101 和 CIFAR-100 上的平均 Top-1 准确率高于现有 MLP 模型 +2.44% 和 +2.16%。代码将在 GitHub 上公开,可以通过 访问。

Attention Consistency Refined Masked Frequency Forgery Representation for Generalizing Face Forgery Detection

  • paper_url: http://arxiv.org/abs/2307.11438
  • repo_url: https://github.com/chenboluo/acmf
  • paper_authors: Decheng Liu, Tao Chen, Chunlei Peng, Nannan Wang, Ruimin Hu, Xinbo Gao
  • for: 这个论文旨在提高Visual data forgery detection的能力,以应对社会和经济安全中的复杂问题。
  • methods: 本论文提出了一个新的Attention Consistency Refined masked frequency forgery representation模型(ACMF),用于实现面伪造检测器的普遍化能力。
  • results: 实验结果显示,提案的方法在多个公共面伪造数据集(FaceForensic++, DFD, Celeb-DF, WDF数据集)上表现出色,较前一代方法更好。
    Abstract Due to the successful development of deep image generation technology, visual data forgery detection would play a more important role in social and economic security. Existing forgery detection methods suffer from unsatisfactory generalization ability to determine the authenticity in the unseen domain. In this paper, we propose a novel Attention Consistency Refined masked frequency forgery representation model toward generalizing face forgery detection algorithm (ACMF). Most forgery technologies always bring in high-frequency aware cues, which make it easy to distinguish source authenticity but difficult to generalize to unseen artifact types. The masked frequency forgery representation module is designed to explore robust forgery cues by randomly discarding high-frequency information. In addition, we find that the forgery attention map inconsistency through the detection network could affect the generalizability. Thus, the forgery attention consistency is introduced to force detectors to focus on similar attention regions for better generalization ability. Experiment results on several public face forgery datasets (FaceForensic++, DFD, Celeb-DF, and WDF datasets) demonstrate the superior performance of the proposed method compared with the state-of-the-art methods.
    摘要 Existing forgery detection methods often rely on high-frequency aware cues, which make it easy to distinguish between authentic and forged sources but difficult to generalize to unseen artifact types. To address this limitation, our proposed masked frequency forgery representation module is designed to explore robust forgery cues by randomly discarding high-frequency information.In addition, we find that the forgery attention map inconsistency through the detection network can affect the generalizability of the model. To address this issue, we introduce forgery attention consistency to force detectors to focus on similar attention regions for better generalization ability.Experimental results on several public face forgery datasets (FaceForensic++, DFD, Celeb-DF, and WDF datasets) demonstrate the superior performance of our proposed method compared to state-of-the-art methods.

FaceCLIPNeRF: Text-driven 3D Face Manipulation using Deformable Neural Radiance Fields

  • paper_url: http://arxiv.org/abs/2307.11418
  • repo_url: None
  • paper_authors: Sungwon Hwang, Junha Hyung, Daejin Kim, Min-Jung Kim, Jaegul Choo
  • for: 本研究旨在提供一种基于文本的3D人脸扭曲控制方法,使得非专家用户可以通过单个文本来控制基于NeRF的3D人脸重建。
  • methods: 我们的方法首先训练了一个场景扭曲器,一个基于秘密码的可变NeRF,以控制人脸扭曲使用秘密码。然而,表示场景扭曲的单个秘密码不利于分布式扭曲观察到的不同实例中。因此,我们提出了一种Positional-conditional Anchor Compositor(PAC),以学习表示扭曲场景的空间分布式秘密码。他们的渲染结果与场景扭曲器进行优化,以实现高cosine相似性与目标文本在CLIP空间的预测 embedding。
  • results: 根据我们所知,我们的方法是首次Addressing the text-driven manipulation of a face reconstructed with NeRF。我们的实验、比较和剖析研究表明,我们的方法可以实现高质量的文本驱动扭曲控制。
    Abstract As recent advances in Neural Radiance Fields (NeRF) have enabled high-fidelity 3D face reconstruction and novel view synthesis, its manipulation also became an essential task in 3D vision. However, existing manipulation methods require extensive human labor, such as a user-provided semantic mask and manual attribute search unsuitable for non-expert users. Instead, our approach is designed to require a single text to manipulate a face reconstructed with NeRF. To do so, we first train a scene manipulator, a latent code-conditional deformable NeRF, over a dynamic scene to control a face deformation using the latent code. However, representing a scene deformation with a single latent code is unfavorable for compositing local deformations observed in different instances. As so, our proposed Position-conditional Anchor Compositor (PAC) learns to represent a manipulated scene with spatially varying latent codes. Their renderings with the scene manipulator are then optimized to yield high cosine similarity to a target text in CLIP embedding space for text-driven manipulation. To the best of our knowledge, our approach is the first to address the text-driven manipulation of a face reconstructed with NeRF. Extensive results, comparisons, and ablation studies demonstrate the effectiveness of our approach.
    摘要 To achieve this, we first train a scene manipulator, a latent code-conditional deformable NeRF, over a dynamic scene to control face deformations using the latent code. However, representing a scene deformation with a single latent code is not ideal for compositing local deformations observed in different instances. Therefore, we propose the Position-conditional Anchor Compositor (PAC) to learn how to represent a manipulated scene with spatially varying latent codes. The renderings with the scene manipulator are then optimized to have high cosine similarity to a target text in CLIP embedding space for text-driven manipulation.To the best of our knowledge, our approach is the first to address text-driven manipulation of faces reconstructed with NeRF. Our extensive results, comparisons, and ablation studies demonstrate the effectiveness of our approach.

Subject-Diffusion:Open Domain Personalized Text-to-Image Generation without Test-time Fine-tuning

  • paper_url: http://arxiv.org/abs/2307.11410
  • repo_url: https://github.com/OPPO-Mente-Lab/Subject-Diffusion
  • paper_authors: Jian Ma, Junhao Liang, Chen Chen, Haonan Lu
  • for: 这篇论文主要targets open-domain和non-fine-tuning个性化图像生成领域的进步。
  • methods: 该论文提出了一种新的开放领域个性化图像生成模型,只需要一张参考图像来支持单或多个主题的图像生成。
  • results: 对比其他SOTA框架,该方法在单个、多个和人类自定义图像生成方面具有优异的表现。
    Abstract Recent progress in personalized image generation using diffusion models has been significant. However, development in the area of open-domain and non-fine-tuning personalized image generation is proceeding rather slowly. In this paper, we propose Subject-Diffusion, a novel open-domain personalized image generation model that, in addition to not requiring test-time fine-tuning, also only requires a single reference image to support personalized generation of single- or multi-subject in any domain. Firstly, we construct an automatic data labeling tool and use the LAION-Aesthetics dataset to construct a large-scale dataset consisting of 76M images and their corresponding subject detection bounding boxes, segmentation masks and text descriptions. Secondly, we design a new unified framework that combines text and image semantics by incorporating coarse location and fine-grained reference image control to maximize subject fidelity and generalization. Furthermore, we also adopt an attention control mechanism to support multi-subject generation. Extensive qualitative and quantitative results demonstrate that our method outperforms other SOTA frameworks in single, multiple, and human customized image generation. Please refer to our \href{https://oppo-mente-lab.github.io/subject_diffusion/}{project page}
    摘要 最近几年个性化图像生成采用扩散模型的进步很 significative。然而,开放领域和不需要微调的个性化图像生成领域的发展相对较慢。在这篇论文中,我们提议了一种新的开放领域个性化图像生成模型,即主题扩散(Subject-Diffusion)。这种模型不仅不需要测试时微调,而且只需一个参考图像来支持个性化生成单或多主题图像。首先,我们构建了一个自动数据标签工具,并使用LAION-Aesthetics dataset constructed a large-scale dataset consisting of 76M images and their corresponding subject detection bounding boxes, segmentation masks and text descriptions。然后,我们设计了一个新的统一框架,通过 combining text and image semantics,并通过粗略位置和细腻参考图像控制来 maximize subject fidelity and generalization。此外,我们还采用了一种注意控制机制来支持多主题生成。我们的方法在单、多和人自定义图像生成方面均有优秀表现。详细的质量和量测试结果可以参考我们的 \href{https://oppo-mente-lab.github.io/subject_diffusion/}{项目页面}。

Latent-OFER: Detect, Mask, and Reconstruct with Latent Vectors for Occluded Facial Expression Recognition

  • paper_url: http://arxiv.org/abs/2307.11404
  • repo_url: https://github.com/leeisack/latent-ofer
  • paper_authors: Isack Lee, Eungi Lee, Seok Bong Yoo
  • for: 提高实际场景中人脸表达识别(FER)的性能,解决 occluded FER(OFER)问题。
  • methods: 使用视transformer(ViT)基于 occlusion patch detector,检测 occluded 位置,并使用 hybrid reconstruction network 生成掩蔽位置的完整图像。然后,使用 CNN 基于 class activation map 提取表达相关的latent vector。
  • results: 对多个数据库进行了实验,demonstrated 提档方法的优势,比state-of-the-art方法更高效。
    Abstract Most research on facial expression recognition (FER) is conducted in highly controlled environments, but its performance is often unacceptable when applied to real-world situations. This is because when unexpected objects occlude the face, the FER network faces difficulties extracting facial features and accurately predicting facial expressions. Therefore, occluded FER (OFER) is a challenging problem. Previous studies on occlusion-aware FER have typically required fully annotated facial images for training. However, collecting facial images with various occlusions and expression annotations is time-consuming and expensive. Latent-OFER, the proposed method, can detect occlusions, restore occluded parts of the face as if they were unoccluded, and recognize them, improving FER accuracy. This approach involves three steps: First, the vision transformer (ViT)-based occlusion patch detector masks the occluded position by training only latent vectors from the unoccluded patches using the support vector data description algorithm. Second, the hybrid reconstruction network generates the masking position as a complete image using the ViT and convolutional neural network (CNN). Last, the expression-relevant latent vector extractor retrieves and uses expression-related information from all latent vectors by applying a CNN-based class activation map. This mechanism has a significant advantage in preventing performance degradation from occlusion by unseen objects. The experimental results on several databases demonstrate the superiority of the proposed method over state-of-the-art methods.
    摘要 大多数人脸表达识别(FER)研究都进行在高度控制的环境中,但其在实际场景中的性能往往不受接受。这是因为当不期望的物体遮挡面部时,FER网络很难提取面部特征并正确预测面部表达。因此,遮挡FER(OFER)成为一个挑战性问题。先前的 occlusion-aware FER 研究通常需要全部标注的面部图像进行训练。然而,收集面部图像与不同遮挡物和表达注解是时间consuming 和昂贵的。我们提出的方法是 Latent-OFER,它可以检测遮挡,还原遮挡部分的面部图像,并正确识别它们,从而提高 FER 的准确率。这个方法包括三个步骤:1. 使用支持向量数据描述算法来训练 ViT 基于的 occlusion patch detector,将遮挡位置掩码为不遮挡位置的 latent vector。2. 使用 ViT 和卷积神经网络(CNN)来生成遮挡位置的完整图像。3. 使用 CNN 来提取表达相关的 latent vector,并应用类Activation map 来找到表达相关的信息。这种机制有着显著的优势,可以避免由不可见的遮挡物引起的性能下降。我们在多个数据库上进行了多个实验,结果表明,我们的方法在 state-of-the-art 方法之上表现出了明显的优势。

CLR: Channel-wise Lightweight Reprogramming for Continual Learning

  • paper_url: http://arxiv.org/abs/2307.11386
  • repo_url: https://github.com/gyhandy/channel-wise-lightweight-reprogramming
  • paper_authors: Yunhao Ge, Yuecheng Li, Shuo Ni, Jiaping Zhao, Ming-Hsuan Yang, Laurent Itti
  • for: 这个研究旨在解决 kontinual learning 中的问题,即维持先前学习的任务表现度以后学习新任务。
  • methods: 我们提出了一个 Channel-wise Lightweight Reprogramming (CLR) 方法,帮助条件神经网络 (CNN) 在 kontinual learning 中避免 catastrophic forgetting。我们使用一个轻量级 (very cheap) 的重新程式码参数,将 CNN 模型训练在旧任务 (或自我指导 proxy task) 后,转换为解决新任务。
  • results: 我们的方法可以实现 better stability-plasticity trade-off,即维持先前任务表现度并吸收新知识。我们的方法比 13 个 state-of-the-art kontinual learning 基线测试项目表现出色,在一个新的挑战性的序列中的 53 个图像分类任务中。
    Abstract Continual learning aims to emulate the human ability to continually accumulate knowledge over sequential tasks. The main challenge is to maintain performance on previously learned tasks after learning new tasks, i.e., to avoid catastrophic forgetting. We propose a Channel-wise Lightweight Reprogramming (CLR) approach that helps convolutional neural networks (CNNs) overcome catastrophic forgetting during continual learning. We show that a CNN model trained on an old task (or self-supervised proxy task) could be ``reprogrammed" to solve a new task by using our proposed lightweight (very cheap) reprogramming parameter. With the help of CLR, we have a better stability-plasticity trade-off to solve continual learning problems: To maintain stability and retain previous task ability, we use a common task-agnostic immutable part as the shared ``anchor" parameter set. We then add task-specific lightweight reprogramming parameters to reinterpret the outputs of the immutable parts, to enable plasticity and integrate new knowledge. To learn sequential tasks, we only train the lightweight reprogramming parameters to learn each new task. Reprogramming parameters are task-specific and exclusive to each task, which makes our method immune to catastrophic forgetting. To minimize the parameter requirement of reprogramming to learn new tasks, we make reprogramming lightweight by only adjusting essential kernels and learning channel-wise linear mappings from anchor parameters to task-specific domain knowledge. We show that, for general CNNs, the CLR parameter increase is less than 0.6\% for any new task. Our method outperforms 13 state-of-the-art continual learning baselines on a new challenging sequence of 53 image classification datasets. Code and data are available at https://github.com/gyhandy/Channel-wise-Lightweight-Reprogramming
    摘要 “CONTINUAL LEARNING”目的是实现人类在继续完成多个任务后继续累累积累知识的能力。主要挑战是维持先前学习的任务表现,即避免“悖论”现象。我们提出了一个“通道对适应”(Channel-wise Lightweight Reprogramming,CLR)方法,帮助对称神经网络(CNNs)在继续学习中获得更好的稳定性和柔韧性。我们显示了一个 CNN 模型在旧任务(或自我超vised proxy task)上训练后可以通过我们提议的轻量级(非常便宜)重新配置参数,以解决继续学习中的悖论问题。我们在 CLR 方法中使用了一个通用任务无法适应的固定“标准”(immutable)部分作为共同“anchor”参数集,然后将任务特定的轻量级重新配置参数添加到标准部分以重新解读出PUTS,以获得更好的稳定性和柔韧性。为了学习继续任务,我们仅需要对每个新任务进行轻量级重新配置参数的训练。重新配置参数是任务特定的且对应到每个任务,这使我们的方法免于悖论现象。为了实现轻量级重新配置参数的学习,我们仅将重要的核心变化和通道对适应的linear mapping学习。我们发现,对于通用 CNN,CLR 参数增加的比例小于 0.6% для任何新任务。我们的方法在一个新的53个图像分类任务中进行了比较,并与13个现有的基eline方法进行了比较。代码和数据可以在 https://github.com/gyhandy/Channel-wise-Lightweight-Reprogramming 上获取。

LatentAugment: Data Augmentation via Guided Manipulation of GAN’s Latent Space

  • paper_url: http://arxiv.org/abs/2307.11375
  • repo_url: https://github.com/ltronchin/latentaugment
  • paper_authors: Lorenzo Tronchin, Minh H. Vu, Paolo Soda, Tommy Löfstedt
  • for: 这个研究旨在提高训练数据的量和多样性,以降低适架化和提高泛化。
  • methods: 这个研究使用生成数据网络(GAN)生成高质量的 sintetic 数据,并通过修改秘密 вектор来增加模式覆盖率和多样性。
  • results: 研究显示,使用 LatentAugment 技术可以提高深度模型从 MRI-to-CT 的转换性能,并在模式覆盖率和多样性方面超过 GAN-based sampling。
    Abstract Data Augmentation (DA) is a technique to increase the quantity and diversity of the training data, and by that alleviate overfitting and improve generalisation. However, standard DA produces synthetic data for augmentation with limited diversity. Generative Adversarial Networks (GANs) may unlock additional information in a dataset by generating synthetic samples having the appearance of real images. However, these models struggle to simultaneously address three key requirements: fidelity and high-quality samples; diversity and mode coverage; and fast sampling. Indeed, GANs generate high-quality samples rapidly, but have poor mode coverage, limiting their adoption in DA applications. We propose LatentAugment, a DA strategy that overcomes the low diversity of GANs, opening up for use in DA applications. Without external supervision, LatentAugment modifies latent vectors and moves them into latent space regions to maximise the synthetic images' diversity and fidelity. It is also agnostic to the dataset and the downstream task. A wide set of experiments shows that LatentAugment improves the generalisation of a deep model translating from MRI-to-CT beating both standard DA as well GAN-based sampling. Moreover, still in comparison with GAN-based sampling, LatentAugment synthetic samples show superior mode coverage and diversity. Code is available at: https://github.com/ltronchin/LatentAugment.
    摘要 <>转换文本到简化中文。<>数据扩展(DA)是一种技术,以增加训练数据的量和多样性,从而缓解过拟合和提高泛化。然而,标准的DA产生的合成数据具有有限的多样性。生成对抗网络(GANs)可以通过生成具有真实图像的样式的合成样本,从而激活数据中的额外信息。然而,这些模型很难同时满足三个关键要求:准确性和高质量样本;多样性和模式覆盖率;和快速采样。indeed,GANs可以快速生成高质量样本,但它们的模式覆盖率很低,限制了它们在DA应用中的采用。我们提出了LatentAugment,一种DA策略,可以在GANs中增加多样性,使其在DA应用中使用。无需外部监督,LatentAugment会修改缺少的缺省向量,将其移动到缺省空间中,以最大化合成图像的多样性和准确性。它还是数据aset和下游任务无关的。一系列实验表明,LatentAugment可以提高一个深度模型从MRI-to-CT的翻译,比标准DA和GAN-based sampling更好。此外,与GAN-based sampling相比,LatentAugment的合成样本还显示出更高的模式覆盖率和多样性。代码可以在:https://github.com/ltronchin/LatentAugment.

Photo2Relief: Let Human in the Photograph Stand Out

  • paper_url: http://arxiv.org/abs/2307.11364
  • repo_url: None
  • paper_authors: Zhongping Ji, Feifei Che, Hanshuo Liu, Ziyi Zhao, Yu-Wei Zhang, Wenping Wang
  • for: 这 paper 的目的是创造从照片中生成的数字 2.5D 艺术作品,其中描绘人体的全身活动。
  • methods: 该方法使用了一种新的 sigmoid 变体函数来灵活控制梯度,并通过定义在梯度空间的损失函数来训练神经网络。此外,它还使用了图像基于的渲染技术来解决不同光照条件下的挑战。
  • results: 实验结果表明,该方法可以高效地从照片中生成高质量的数字 2.5D 艺术作品,并且可以在多种场景下实现高度的自然性和细节。
    Abstract In this paper, we propose a technique for making humans in photographs protrude like reliefs. Unlike previous methods which mostly focus on the face and head, our method aims to generate art works that describe the whole body activity of the character. One challenge is that there is no ground-truth for supervised deep learning. We introduce a sigmoid variant function to manipulate gradients tactfully and train our neural networks by equipping with a loss function defined in gradient domain. The second challenge is that actual photographs often across different light conditions. We used image-based rendering technique to address this challenge and acquire rendering images and depth data under different lighting conditions. To make a clear division of labor in network modules, a two-scale architecture is proposed to create high-quality relief from a single photograph. Extensive experimental results on a variety of scenes show that our method is a highly effective solution for generating digital 2.5D artwork from photographs.
    摘要 在这篇论文中,我们提出了一种技术,使人像中的人物凸出如 relief 一般。与先前的方法一样,我们的方法不仅关注人脸和头部,而是通过生成描述人物全身活动的艺术作品。一个挑战是没有超出真实数据的观察数据,我们引入了截然函数来策略性地操作梯度,并通过定义梯度领域的损失函数来训练我们的神经网络。另一个挑战是实际照片通常在不同的照明条件下拍摄,我们使用图像基于的渲染技术来解决这个问题,并获得不同照明条件下的渲染图像和深度数据。为了在网络模块之间进行清晰的划分工作,我们提议了一种两级架构,从单个照片中生成高质量的 relief。我们的实验结果表明,我们的方法是生成数字2.5D艺术作品从照片中的高效解决方案。

ParGANDA: Making Synthetic Pedestrians A Reality For Object Detection

  • paper_url: http://arxiv.org/abs/2307.11360
  • repo_url: None
  • paper_authors: Daria Reshetova, Guanhang Wu, Marcel Puyat, Chunhui Gu, Huizhong Chen
  • for: 本研究旨在提高人体检测模型的性能,使其在实际应用中更加稳定和可靠。
  • methods: 本研究使用了生成 adversarial Network (GAN) 来将生成的人体图像变换成更加真实的图像,以减少 Synthetic-to-real 频率差。
  • results: 实验结果表明,使用 GAN 可以生成高质量的人体图像,并且不需要标注真实频率图像,因此可以应用于多个下游任务。
    Abstract Object detection is the key technique to a number of Computer Vision applications, but it often requires large amounts of annotated data to achieve decent results. Moreover, for pedestrian detection specifically, the collected data might contain some personally identifiable information (PII), which is highly restricted in many countries. This label intensive and privacy concerning task has recently led to an increasing interest in training the detection models using synthetically generated pedestrian datasets collected with a photo-realistic video game engine. The engine is able to generate unlimited amounts of data with precise and consistent annotations, which gives potential for significant gains in the real-world applications. However, the use of synthetic data for training introduces a synthetic-to-real domain shift aggravating the final performance. To close the gap between the real and synthetic data, we propose to use a Generative Adversarial Network (GAN), which performsparameterized unpaired image-to-image translation to generate more realistic images. The key benefit of using the GAN is its intrinsic preference of low-level changes to geometric ones, which means annotations of a given synthetic image remain accurate even after domain translation is performed thus eliminating the need for labeling real data. We extensively experimented with the proposed method using MOTSynth dataset to train and MOT17 and MOT20 detection datasets to test, with experimental results demonstrating the effectiveness of this method. Our approach not only produces visually plausible samples but also does not require any labels of the real domain thus making it applicable to the variety of downstream tasks.
    摘要 Computer视觉应用中的对象检测是关键技术,但它通常需要大量注解数据来 достичь良好的结果。此外,人员检测特别是可能包含个人标识信息(PII),这在许多国家是非常限制的。这种标注密集和隐私担忧的任务最近引起了使用synthetically生成的人员数据集来训练检测模型的兴趣。这个引擎可以生成无限量的数据,并且可以提供精确和一致的注释,这给了实际应用中的可能性。然而,使用synthetic数据进行训练会导致synthetic-to-real域shift,从而使得最终性能下降。为了封闭这个域shift,我们提议使用生成对抗网络(GAN),它可以进行参数化的无对比image-to-image翻译,以生成更真实的图像。GAN的关键优点在于它对低级别的变化偏好,这意味着注释给定的synthetic图像保持正确,even after domain translation is performed,因此不需要标注实际数据。我们对提议方法进行了广泛的实验,使用MOTSynth数据集进行训练,并使用MOT17和MOT20检测数据集进行测试,实验结果表明了该方法的有效性。我们的方法不仅可以生成可见的样本,而且不需要实际域的标注,因此可以应用于多个下游任务。

Tuning Pre-trained Model via Moment Probing

  • paper_url: http://arxiv.org/abs/2307.11342
  • repo_url: https://github.com/mingzeg/moment-probing
  • paper_authors: Mingze Gao, Qilong Wang, Zhenyi Lin, Pengfei Zhu, Qinghua Hu, Jingbo Zhou
  • for: 提高大规模预训练模型的高效调参,以探索LP模块的潜力。
  • methods: 提出了一种新的幻数探测(MP)方法,通过对特征分布的线性探测来提供更强的表示能力。
  • results: MP比LP更高效,并且与同类模型在训练成本下达到了竞争水平,而MP$_{+}$达到了最佳性能。
    Abstract Recently, efficient fine-tuning of large-scale pre-trained models has attracted increasing research interests, where linear probing (LP) as a fundamental module is involved in exploiting the final representations for task-dependent classification. However, most of the existing methods focus on how to effectively introduce a few of learnable parameters, and little work pays attention to the commonly used LP module. In this paper, we propose a novel Moment Probing (MP) method to further explore the potential of LP. Distinguished from LP which builds a linear classification head based on the mean of final features (e.g., word tokens for ViT) or classification tokens, our MP performs a linear classifier on feature distribution, which provides the stronger representation ability by exploiting richer statistical information inherent in features. Specifically, we represent feature distribution by its characteristic function, which is efficiently approximated by using first- and second-order moments of features. Furthermore, we propose a multi-head convolutional cross-covariance (MHC$^3$) to compute second-order moments in an efficient and effective manner. By considering that MP could affect feature learning, we introduce a partially shared module to learn two recalibrating parameters (PSRP) for backbones based on MP, namely MP$_{+}$. Extensive experiments on ten benchmarks using various models show that our MP significantly outperforms LP and is competitive with counterparts at less training cost, while our MP$_{+}$ achieves state-of-the-art performance.
    摘要 近期,高效练级预训模型的调整吸引了增加的研究兴趣,其中线性探测(LP)作为基本模块被利用来挖掘任务特定的分类表现。然而,大多数现有方法都是如何效果地引入一些学习参数的问题,而忽略了通常使用的LP模块。在这篇论文中,我们提出了一种新的幂值探测(MP)方法,可以进一步探索LP的潜力。与LP不同,MP不是根据最终特征(如ViT中的单词符)或分类符建立线性分类头,而是在特征分布上进行线性分类,从而获得更强的表达能力,因为它可以利用特征内置的更加丰富的统计信息。具体来说,我们使用特征分布的特征函数来表示特征分布,该函数可以高效地被approximated通过使用特征的第一和第二 moments。此外,我们还提出了一种多头 convolutional cross-covariance(MHC$^3)来计算特征分布中的第二 moments,以实现高效和有效地计算。由于MP可能会影响特征学习,我们引入了一个共享模块来学习两个修正参数(PSRP),即MP$_{+}$.经验表明,我们的MP在十个benchmark上与LP和其他方法进行比较,显示MP显著超过LP,并且与其他方法在训练成本下更低的情况下具有竞争力。此外,我们的MP$_{+}$实现了状态计算的表现。

Character Time-series Matching For Robust License Plate Recognition

  • paper_url: http://arxiv.org/abs/2307.11336
  • repo_url: https://github.com/chequanghuy/Character-Time-series-Matching
  • paper_authors: Quang Huy Che, Tung Do Thanh, Cuong Truong Van
  • For: This paper aims to improve license plate recognition accuracy in real-world situations by tracking the license plate in multiple frames.* Methods: The proposed method uses the Adaptive License Plate Rotation algorithm to correctly align the detected license plate, and a new method called Character Time-series Matching to recognize license plate characters from multiple consecutive frames.* Results: The proposed method achieved 96.7% accuracy on the UFPR-ALPR dataset in real-time on an RTX A5000 GPU card, and achieved high accuracy in the Vietnamese ALPR system with license plate detection and character recognition accuracy of 0.881 and 0.979 $mAP^{test}$@.5 respectively.Here’s the Chinese translation of the three key points:* FOR: 这篇论文目标是在实际情况下提高车牌识别精度,通过跟踪多帧中的车牌。* METHODS: 提议的方法使用了适应式车牌旋转算法来正确地对齐检测到的车牌,以及一种新的Character Time-series Matching方法来在多个后续帧中识别车牌字符。* RESULTS: 提议的方法在UFPR-ALPR数据集上实时达到96.7%的准确率,并在越南的ALPR系统中实现了高精度。车牌检测精度和字符识别精度分别为0.881和0.979 $mAP^{test}$@.5。
    Abstract Automatic License Plate Recognition (ALPR) is becoming a popular study area and is applied in many fields such as transportation or smart city. However, there are still several limitations when applying many current methods to practical problems due to the variation in real-world situations such as light changes, unclear License Plate (LP) characters, and image quality. Almost recent ALPR algorithms process on a single frame, which reduces accuracy in case of worse image quality. This paper presents methods to improve license plate recognition accuracy by tracking the license plate in multiple frames. First, the Adaptive License Plate Rotation algorithm is applied to correctly align the detected license plate. Second, we propose a method called Character Time-series Matching to recognize license plate characters from many consequence frames. The proposed method archives high performance in the UFPR-ALPR dataset which is \boldmath$96.7\%$ accuracy in real-time on RTX A5000 GPU card. We also deploy the algorithm for the Vietnamese ALPR system. The accuracy for license plate detection and character recognition are 0.881 and 0.979 $mAP^{test}$@.5 respectively. The source code is available at https://github.com/chequanghuy/Character-Time-series-Matching.git
    摘要 自动识别车牌(ALPR)已成为当前研究领域之一,并应用于交通和智能城市等领域。然而,现有许多方法在实际问题中仍然存在一些限制,主要是因为实际情况下的光线变化、车牌字符模糊和图像质量等问题。大多数当前ALPR算法都是基于单帧处理,这会导致图像质量更差时的准确率下降。本文提出了一种改进车牌识别精度的方法,通过跟踪车牌在多帧中的变化。首先,我们提出了一种适应车牌旋转算法,以正确地对检测到的车牌进行对齐。然后,我们提出了一种名为 Character Time-series Matching的方法,用于在多个后续帧中识别车牌字符。我们在UFPR-ALPR数据集上测试了该方法,实时性达到96.7%,并在RTX A5000 GPU卡上进行了测试。此外,我们还部署了该算法于越南ALPR系统,车牌检测精度和字符识别精度分别为0.881和0.979 $mAP^{test}$@.5。源代码可以在https://github.com/chequanghuy/Character-Time-series-Matching.git中下载。

Improving Transferability of Adversarial Examples via Bayesian Attacks

  • paper_url: http://arxiv.org/abs/2307.11334
  • repo_url: None
  • paper_authors: Qizhang Li, Yiwen Guo, Xiaochen Yang, Wangmeng Zuo, Hao Chen
  • for: 提高模型转移性,防止黑客利用模型识别图像的攻击
  • methods: incorporating Bayesian formulation into model parameters and input, 采用权重积分法对模型参数和输入进行权重补偿
  • results: 1) 将极大 Bayesian 形式应用于模型输入和参数,对转移性具有显著提高效果; 2) 通过高级推论 posterior distribution over the model input, 提高黑客攻击的可扩展性; 3) 提出一种理性的方法来细化模型参数,以便在扩展 bayesian 形式下进行优化。
    Abstract This paper presents a substantial extension of our work published at ICLR. Our ICLR work advocated for enhancing transferability in adversarial examples by incorporating a Bayesian formulation into model parameters, which effectively emulates the ensemble of infinitely many deep neural networks, while, in this paper, we introduce a novel extension by incorporating the Bayesian formulation into the model input as well, enabling the joint diversification of both the model input and model parameters. Our empirical findings demonstrate that: 1) the combination of Bayesian formulations for both the model input and model parameters yields significant improvements in transferability; 2) by introducing advanced approximations of the posterior distribution over the model input, adversarial transferability achieves further enhancement, surpassing all state-of-the-arts when attacking without model fine-tuning. Moreover, we propose a principled approach to fine-tune model parameters in such an extended Bayesian formulation. The derived optimization objective inherently encourages flat minima in the parameter space and input space. Extensive experiments demonstrate that our method achieves a new state-of-the-art on transfer-based attacks, improving the average success rate on ImageNet and CIFAR-10 by 19.14% and 2.08%, respectively, when comparing with our ICLR basic Bayesian method. We will make our code publicly available.
    摘要
  1. Combining Bayesian formulations for both the model input and model parameters leads to significant improvements in transferability.2. Introducing advanced approximations of the posterior distribution over the model input further enhances adversarial transferability, surpassing all state-of-the-art methods when attacking without model fine-tuning.Moreover, we propose a principled approach to fine-tune model parameters in this extended Bayesian formulation. The derived optimization objective inherently encourages flat minima in both the parameter space and input space. Extensive experiments demonstrate that our method achieves a new state-of-the-art on transfer-based attacks, improving the average success rate on ImageNet and CIFAR-10 by 19.14% and 2.08%, respectively, compared to our ICLR basic Bayesian method. We will make our code publicly available.

EndoSurf: Neural Surface Reconstruction of Deformable Tissues with Stereo Endoscope Videos

  • paper_url: http://arxiv.org/abs/2307.11307
  • repo_url: https://github.com/ruyi-zha/endosurf
  • paper_authors: Ruyi Zha, Xuelian Cheng, Hongdong Li, Mehrtash Harandi, Zongyuan Ge
  • for: reconstruction of soft tissues from stereo endoscope videos
  • methods: neural-field-based method (EndoSurf) using deformation field, SDF field, and radiance field for shape and texture representation
  • results: significantly outperforms existing solutions in reconstructing high-fidelity shapes, as demonstrated by experiments on public endoscope datasets.
    Abstract Reconstructing soft tissues from stereo endoscope videos is an essential prerequisite for many medical applications. Previous methods struggle to produce high-quality geometry and appearance due to their inadequate representations of 3D scenes. To address this issue, we propose a novel neural-field-based method, called EndoSurf, which effectively learns to represent a deforming surface from an RGBD sequence. In EndoSurf, we model surface dynamics, shape, and texture with three neural fields. First, 3D points are transformed from the observed space to the canonical space using the deformation field. The signed distance function (SDF) field and radiance field then predict their SDFs and colors, respectively, with which RGBD images can be synthesized via differentiable volume rendering. We constrain the learned shape by tailoring multiple regularization strategies and disentangling geometry and appearance. Experiments on public endoscope datasets demonstrate that EndoSurf significantly outperforms existing solutions, particularly in reconstructing high-fidelity shapes. Code is available at https://github.com/Ruyi-Zha/endosurf.git.
    摘要 <>很多医疗应用都需要从斯tereo激光镜视频中重建软组织。现有方法很难生成高质量的几何和外观,因为它们不能准确地表示3D场景。为解决这个问题,我们提出了一种基于神经场的新方法,叫做EndoSurf。EndoSurf可以从RGBD序列中有效地学习表示变形表面。在EndoSurf中,我们使用三个神经场来模型表面动态、形状和Texture。首先,3D点从观察空间转换到Canonical空间使用扭变场。然后,SDF场和颜色场使用梯度渲染算法来预测它们的SDF和颜色。通过这种方式,我们可以通过 differentiable volume rendering来synthesizeRGBD图像。我们使用多种正则化策略和分解几何和外观来约束学习的形状。实验表明,EndoSurf在公共激光镜数据集上表现出色,特别是在重建高精度形状方面。代码可以在https://github.com/Ruyi-Zha/endosurf.git中找到。

MAS: Towards Resource-Efficient Federated Multiple-Task Learning

  • paper_url: http://arxiv.org/abs/2307.11285
  • repo_url: None
  • paper_authors: Weiming Zhuang, Yonggang Wen, Lingjuan Lyu, Shuai Zhang
  • For: This paper proposes a new approach to training multiple simultaneous federated learning (FL) tasks on resource-constrained devices, which can improve the performance and efficiency of FL training.* Methods: The proposed approach, called MAS (Merge and Split), merges multiple FL tasks into an all-in-one task and then splits it into two or more tasks based on the affinities among tasks. It continues training each split of tasks using model parameters from the all-in-one training.* Results: Extensive experiments demonstrate that MAS outperforms other methods while reducing training time by 2x and reducing energy consumption by 40%.
    Abstract Federated learning (FL) is an emerging distributed machine learning method that empowers in-situ model training on decentralized edge devices. However, multiple simultaneous FL tasks could overload resource-constrained devices. In this work, we propose the first FL system to effectively coordinate and train multiple simultaneous FL tasks. We first formalize the problem of training simultaneous FL tasks. Then, we present our new approach, MAS (Merge and Split), to optimize the performance of training multiple simultaneous FL tasks. MAS starts by merging FL tasks into an all-in-one FL task with a multi-task architecture. After training for a few rounds, MAS splits the all-in-one FL task into two or more FL tasks by using the affinities among tasks measured during the all-in-one training. It then continues training each split of FL tasks based on model parameters from the all-in-one training. Extensive experiments demonstrate that MAS outperforms other methods while reducing training time by 2x and reducing energy consumption by 40%. We hope this work will inspire the community to further study and optimize training simultaneous FL tasks.
    摘要 Federated 学习(FL)是一种emerging的分布式机器学习方法,它允许在分布式边缘设备上进行 situ 模型训练。然而,多个同时进行 FL 任务可能会过载resource-constrained设备。在这种工作中,我们提出了首个有效地协调和训练多个同时进行 FL 任务的 FL 系统。我们首先将同时进行 FL 任务的训练问题正式化。然后,我们提出了我们的新方法,MAS(合并和分割),以便优化训练多个同时进行 FL 任务的性能。MAS 开始是将 FL 任务合并成一个所有任务的 FL 任务,并使用多任务架构进行训练。在训练一些往返后,MAS 使用在所有训练中测量的任务之间的相互关系,将所有任务合并成两个或更多个 FL 任务。然后,它继续基于所有训练中的模型参数进行每个分割的 FL 任务的训练。我们的实验证明,MAS 在减少训练时间和能量消耗的情况下,能够超越其他方法。我们希望这种工作能够鼓励社区进一步研究和优化训练同时进行 FL 任务。

Learning to Segment from Noisy Annotations: A Spatial Correction Approach

  • paper_url: http://arxiv.org/abs/2308.02498
  • repo_url: https://github.com/michaelofsbu/spatialcorrection
  • paper_authors: Jiachen Yao, Yikai Zhang, Songzhu Zheng, Mayank Goswami, Prateek Prasanna, Chao Chen
  • for: 这个研究旨在提出一个新的Markov模型,用于减少医疗影像分类 задачі中的标签错误。
  • methods: 我们提出了一种基于Markov模型的标签修正方法,可以逐步修正错误的标签,并提供了理论保证。
  • results: 实验结果显示,我们的方法在实际标签错误情况下表现比现有的方法更好。
    Abstract Noisy labels can significantly affect the performance of deep neural networks (DNNs). In medical image segmentation tasks, annotations are error-prone due to the high demand in annotation time and in the annotators' expertise. Existing methods mostly assume noisy labels in different pixels are \textit{i.i.d}. However, segmentation label noise usually has strong spatial correlation and has prominent bias in distribution. In this paper, we propose a novel Markov model for segmentation noisy annotations that encodes both spatial correlation and bias. Further, to mitigate such label noise, we propose a label correction method to recover true label progressively. We provide theoretical guarantees of the correctness of the proposed method. Experiments show that our approach outperforms current state-of-the-art methods on both synthetic and real-world noisy annotations.
    摘要 “噪音标签可以严重影响深度神经网络(DNNs)的性能。医疗影像分类任务中的标签通常受到高度的标签时间和标签专家的要求,因此标签损害很普遍。现有的方法通常假设不同像素的标签噪音是独立同分布(i.i.d)。但是,分类标签噪音通常具有强烈的空间相关性和明显的偏好性。在这篇论文中,我们提出了一个新的Markov模型,用于描述分类标签噪音的特性。此外,我们也提出了一个标签更正方法,可以逐步地更正true标签。我们提供了理论保证方法的正确性。实验结果显示,我们的方法在实验和实际标签噪音上都能够超过目前的州Of-the-art方法。”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.

Screening Mammography Breast Cancer Detection

  • paper_url: http://arxiv.org/abs/2307.11274
  • repo_url: https://github.com/chakrabortyde/rsna-breast-cancer
  • paper_authors: Debajyoti Chakraborty
  • for: 提高乳癌检测效率和准确率,以减少成本和假阳性结果导致的患者担忧。
  • methods: 使用自动化乳癌检测方法,测试了不同的方法 against RSNA数据集中的约20,000名女性患者的 radiographic 乳癌图像,得到了平均验证案例 pF1 分数为0.56。
  • results: 实现了提高乳癌检测效率和准确率的目标,可能减少成本和假阳性结果导致的患者担忧。
    Abstract Breast cancer is a leading cause of cancer-related deaths, but current programs are expensive and prone to false positives, leading to unnecessary follow-up and patient anxiety. This paper proposes a solution to automated breast cancer detection, to improve the efficiency and accuracy of screening programs. Different methodologies were tested against the RSNA dataset of radiographic breast images of roughly 20,000 female patients and yielded an average validation case pF1 score of 0.56 across methods.
    摘要 乳癌是癌症related deaths的主要原因,但现有的Programs 昂贵并且容易出现假阳性结果,导致不必要的跟进和患者担忧。这篇论文提出一种自动乳癌检测方案,以提高检测计划的效率和准确率。不同的方法ologies 在 RSNA 数据集上测试,对约20,000名女性患者的 radiographic 乳影像进行了测试,Validation case pF1 score 的平均值为 0.56。

SimCol3D – 3D Reconstruction during Colonoscopy Challenge

  • paper_url: http://arxiv.org/abs/2307.11261
  • repo_url: None
  • paper_authors: Anita Rau, Sophia Bano, Yueming Jin, Pablo Azagra, Javier Morlana, Edward Sanderson, Bogdan J. Matuszewski, Jae Young Lee, Dong-Jae Lee, Erez Posner, Netanel Frank, Varshini Elangovan, Sista Raviteja, Zhengwen Li, Jiquan Liu, Seenivasan Lalithkumar, Mobarakol Islam, Hongliang Ren, José M. M. Montiel, Danail Stoyanov
  • for: 针对受到抗坏性肿瘤的护理和检测
  • methods: 使用学习型方法对视频材料进行深度和pose预测
  • results: 虚拟colonoscopy中的深度预测robust可行,而pose预测仍然是一个开放的研究问题
    Abstract Colorectal cancer is one of the most common cancers in the world. While colonoscopy is an effective screening technique, navigating an endoscope through the colon to detect polyps is challenging. A 3D map of the observed surfaces could enhance the identification of unscreened colon tissue and serve as a training platform. However, reconstructing the colon from video footage remains unsolved due to numerous factors such as self-occlusion, reflective surfaces, lack of texture, and tissue deformation that limit feature-based methods. Learning-based approaches hold promise as robust alternatives, but necessitate extensive datasets. By establishing a benchmark, the 2022 EndoVis sub-challenge SimCol3D aimed to facilitate data-driven depth and pose prediction during colonoscopy. The challenge was hosted as part of MICCAI 2022 in Singapore. Six teams from around the world and representatives from academia and industry participated in the three sub-challenges: synthetic depth prediction, synthetic pose prediction, and real pose prediction. This paper describes the challenge, the submitted methods, and their results. We show that depth prediction in virtual colonoscopy is robustly solvable, while pose estimation remains an open research question.
    摘要 抗rectal cancer是全球最常见的癌症之一。虽然colonoscopy是一种有效的检测技术,但是通过endooscope检测colon中的质量是具有挑战性的。一个3D地图可以增强未检查的colon组织识别和作为培训平台。然而,从视频足本中重建colon仍然是一个未解决的问题,因为自我遮挡、反射表面、缺乏Texture和组织变形等多种因素限制了基于特征的方法。学习基于方法具有潜在的优势,但它们需要大量的数据。为了实现这一目标,2022年的EndoVis子挑战SimCol3D在新加坡的MICCAI 2022会议上举行。六支来自全球的团队和学术界和产业界的代表参加了三个子挑战: sintetic depth prediction、syntetic pose prediction和real pose prediction。本文描述了这一挑战,提交的方法以及其结果。我们显示了虚拟colonoscopy中的深度预测是可靠地解决的,而pose预测仍然是一个开放的研究问题。

Towards Non-Parametric Models for Confidence Aware Image Prediction from Low Data using Gaussian Processes

  • paper_url: http://arxiv.org/abs/2307.11259
  • repo_url: None
  • paper_authors: Nikhil U. Shinde, Florian Richter, Michael C. Yip
  • for: 预测未来图像序列中的图像
  • methods: 使用非 Parametric 模型,采取 probabilistic 方法来预测图像
  • results: 成功预测 fluid simulations 环境中的未来帧数据
    Abstract The ability to envision future states is crucial to informed decision making while interacting with dynamic environments. With cameras providing a prevalent and information rich sensing modality, the problem of predicting future states from image sequences has garnered a lot of attention. Current state of the art methods typically train large parametric models for their predictions. Though often able to predict with accuracy, these models rely on the availability of large training datasets to converge to useful solutions. In this paper we focus on the problem of predicting future images of an image sequence from very little training data. To approach this problem, we use non-parametric models to take a probabilistic approach to image prediction. We generate probability distributions over sequentially predicted images and propagate uncertainty through time to generate a confidence metric for our predictions. Gaussian Processes are used for their data efficiency and ability to readily incorporate new training data online. We showcase our method by successfully predicting future frames of a smooth fluid simulation environment.
    摘要 <>预测未来状态的能力对于在动态环境中决策是非常重要的。由于摄像头是一种非常普遍和信息充沛的感知方式,预测未来状态从图像序列中得到的问题已经吸引了很多注意。当前的状态艺术方法通常是通过大型参数模型进行预测。虽然它们经常可以准确预测,但它们需要大量的训练数据来得到有用的解决方案。在这篇论文中,我们关注的是从非常少的训练数据中预测图像序列的未来帧。为了解决这个问题,我们使用非Parametric模型采取一种 probabilistic 的方法来预测图像。我们生成图像序列中预测的probability分布,并将uncertainty通过时间进行传播,以生成一个 confidence 度量器 для我们的预测。使用 Gaussian Processes 的数据效率和能够轻松地在线上添加新的训练数据,我们成功地预测了一个平滑的液体流体动画环境中的未来帧。

UP-DP: Unsupervised Prompt Learning for Data Pre-Selection with Vision-Language Models

  • paper_url: http://arxiv.org/abs/2307.11227
  • repo_url: None
  • paper_authors: Xin Li, Sima Behpour, Thang Doan, Wenbin He, Liang Gou, Liu Ren
  • for: 本研究旨在 optimizing performance for undefined downstream tasks with a limited annotation budget, through selecting instances for labeling from an unlabeled dataset through a single pass.
  • methods: 本研究使用的方法是基于 joint feature space of both vision and text, 通过不同的设计和训练策略,提高数据预选的性能。 Specifically, we introduce UP-DP, a simple yet effective unsupervised prompt learning approach that adapts vision-language models, like BLIP-2, for data pre-selection.
  • results: 与状态的艺术 compare our method with the state-of-the-art using seven benchmark datasets in different settings, achieving up to a performance gain of 20%. Additionally, the prompts learned from one dataset demonstrate significant generalizability and can be applied directly to enhance the feature extraction of BLIP-2 from other datasets.
    Abstract In this study, we investigate the task of data pre-selection, which aims to select instances for labeling from an unlabeled dataset through a single pass, thereby optimizing performance for undefined downstream tasks with a limited annotation budget. Previous approaches to data pre-selection relied solely on visual features extracted from foundation models, such as CLIP and BLIP-2, but largely ignored the powerfulness of text features. In this work, we argue that, with proper design, the joint feature space of both vision and text can yield a better representation for data pre-selection. To this end, we introduce UP-DP, a simple yet effective unsupervised prompt learning approach that adapts vision-language models, like BLIP-2, for data pre-selection. Specifically, with the BLIP-2 parameters frozen, we train text prompts to extract the joint features with improved representation, ensuring a diverse cluster structure that covers the entire dataset. We extensively compare our method with the state-of-the-art using seven benchmark datasets in different settings, achieving up to a performance gain of 20%. Interestingly, the prompts learned from one dataset demonstrate significant generalizability and can be applied directly to enhance the feature extraction of BLIP-2 from other datasets. To the best of our knowledge, UP-DP is the first work to incorporate unsupervised prompt learning in a vision-language model for data pre-selection.
    摘要 在这项研究中,我们调查了数据预选任务,该任务通过单次通过,选择未标注数据集中的实例,以优化未定下渠道任务的性能,并尽可能减少标注预算。先前的数据预选方法仅仅基于基础模型中的视觉特征,如CLIP和BLIP-2,而忽略文本特征的力量。在这项工作中,我们认为,如果设计得当,则联合视觉和文本特征的特征空间可以提供更好的数据预选表示。为此,我们提出了UP-DP方法,这是一种简单 yet有效的无监督提问学习方法,可以使用视觉语言模型,如BLIP-2,进行数据预选。具体来说,我们将BLIP-2参数冻结,然后使用文本提示来提取联合特征,以确保多样化的群集结构,覆盖整个数据集。我们在七个标准测试集上进行了广泛的比较,与状态艺术的方法进行比较,达到了最高的20%的性能提升。有趣的是,从一个数据集上学习的提示可以直接应用于提高BLIP-2在其他数据集上的特征提取。据我们所知,UP-DP是首次在视觉语言模型中 incorporate无监督提问学习来进行数据预选。

Heuristic Hyperparameter Choice for Image Anomaly Detection

  • paper_url: http://arxiv.org/abs/2307.11197
  • repo_url: None
  • paper_authors: Zeyu Jiang, João P. C. Bertoldo, Etienne Decencière
  • for: 这个研究旨在提出一种基于深度学习神经网络的图像异常检测方法,使用 pré-训练的模型提取的深度特征进行异常检测,并且采用约化特征分解方法来降低计算成本和提高性能。
  • methods: 该方法使用的是逆原理 Component Analysis(NPCA)约化特征分解方法,并且提出了一些优化策略来选择NPCA算法中的参数,以获得最少的特征组件数而仍保持良好的性能。
  • results: 研究结果表明,使用NPCA约化特征分解方法可以减少图像异常检测中的计算成本,同时保持良好的检测性能。此外,该方法还可以适应不同的图像数据集和异常检测任务。
    Abstract Anomaly detection (AD) in images is a fundamental computer vision problem by deep learning neural network to identify images deviating significantly from normality. The deep features extracted from pretrained models have been proved to be essential for AD based on multivariate Gaussian distribution analysis. However, since models are usually pretrained on a large dataset for classification tasks such as ImageNet, they might produce lots of redundant features for AD, which increases computational cost and degrades the performance. We aim to do the dimension reduction of Negated Principal Component Analysis (NPCA) for these features. So we proposed some heuristic to choose hyperparameter of NPCA algorithm for getting as fewer components of features as possible while ensuring a good performance.
    摘要 <>转换给定文本到简化中文。>图像异常检测(AD)是计算机视觉中的基本问题,使用深度学习神经网络来识别图像异常。深度特征从预训练模型中提取出来的特征被证明是AD基于多变量 Gaussian 分布分析中的 essencial。然而,由于模型通常在大量的分类任务上预训练,如 ImageNet,它们可能生成大量的冗余特征,这会增加计算成本并降低性能。我们想使用NPCA算法进行维度减少。因此,我们提出了一些规则来选择NPCA算法的超参数,以获得最少的特征组件而 guaranteeing good performance。

Representation Learning in Anomaly Detection: Successes, Limits and a Grand Challenge

  • paper_url: http://arxiv.org/abs/2307.11085
  • repo_url: None
  • paper_authors: Yedid Hoshen
  • for: 这篇论文主要写于异常检测领域, argue that 异常检测领域中的主流思想无法无限扩展,会遇到基础限制。
  • methods: 该论文使用了异常检测中的“无免费午餐”原理,并提出了两个grand challenges,一是科学发现通过异常检测,二是检测imageNet数据集中最异常的图像。
  • results: 该论文提出了新的异常检测工具和想法,以应对这两个挑战。I hope this helps! Let me know if you have any other questions.
    Abstract In this perspective paper, we argue that the dominant paradigm in anomaly detection cannot scale indefinitely and will eventually hit fundamental limits. This is due to the a no free lunch principle for anomaly detection. These limitations can be overcome when there are strong tasks priors, as is the case for many industrial tasks. When such priors do not exists, the task is much harder for anomaly detection. We pose two such tasks as grand challenges for anomaly detection: i) scientific discovery by anomaly detection ii) a "mini-grand" challenge of detecting the most anomalous image in the ImageNet dataset. We believe new anomaly detection tools and ideas would need to be developed to overcome these challenges.
    摘要 在这篇观点论文中,我们 argueThat the dominant paradigm in anomaly detection cannot scale indefinitely and will eventually hit fundamental limits. This is due to the a no free lunch principle for anomaly detection. These limitations can be overcome when there are strong tasks priors, as is the case for many industrial tasks. When such priors do not exists, the task is much harder for anomaly detection. We pose two such tasks as grand challenges for anomaly detection: i) scientific discovery by anomaly detection ii) a "mini-grand" challenge of detecting the most anomalous image in the ImageNet dataset. We believe new anomaly detection tools and ideas would need to be developed to overcome these challenges.Note: Please keep in mind that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. Traditional Chinese is used in Hong Kong, Taiwan, and other countries.

GLSFormer: Gated - Long, Short Sequence Transformer for Step Recognition in Surgical Videos

  • paper_url: http://arxiv.org/abs/2307.11081
  • repo_url: https://github.com/nisargshah1999/glsformer
  • paper_authors: Nisarg A. Shah, Shameema Sikder, S. Swaroop Vedula, Vishal M. Patel
  • for: automated surgical step recognition, 自动化手术步骤识别
  • methods: 使用视Transformer学习直接从批量帧级别的空间时间特征,并 integrate短期和长期空间时间特征表示
  • results: 在两个眼睛手术视频数据集(Cataract-101和D99)上进行了广泛评估,与多种现有方法进行比较,并达到了更高的性能。
    Abstract Automated surgical step recognition is an important task that can significantly improve patient safety and decision-making during surgeries. Existing state-of-the-art methods for surgical step recognition either rely on separate, multi-stage modeling of spatial and temporal information or operate on short-range temporal resolution when learned jointly. However, the benefits of joint modeling of spatio-temporal features and long-range information are not taken in account. In this paper, we propose a vision transformer-based approach to jointly learn spatio-temporal features directly from sequence of frame-level patches. Our method incorporates a gated-temporal attention mechanism that intelligently combines short-term and long-term spatio-temporal feature representations. We extensively evaluate our approach on two cataract surgery video datasets, namely Cataract-101 and D99, and demonstrate superior performance compared to various state-of-the-art methods. These results validate the suitability of our proposed approach for automated surgical step recognition. Our code is released at: https://github.com/nisargshah1999/GLSFormer
    摘要 自动化手术步骤识别是一项重要的任务,可以有效提高手术过程中的患者安全和决策。现有的状态级方法 для手术步骤识别可以是分离的多stage模型化的空间和时间信息,或者在学习时间resolution上进行简单的操作。然而,将空间和时间特征结合jointly learning的好处并没有得到足够的考虑。在本文中,我们提出了基于视觉 трансформер的方法,直接从序列帧级补剪图像中学习spatio-temporal特征。我们的方法包括一种阻止temporal注意力机制,可以智能地将短期和长期的spatio-temporal特征表示结合起来。我们对两个眼部手术视频数据集,即Cataract-101和D99进行了广泛的评估,并证明了与多种状态级方法的比较。这些结果证明了我们的提议的合适性。我们的代码可以在:https://github.com/nisargshah1999/GLSFormer 获取。

Learning Dense UV Completion for Human Mesh Recovery

  • paper_url: http://arxiv.org/abs/2307.11074
  • repo_url: None
  • paper_authors: Yanjun Wang, Qingping Sun, Wenjia Wang, Jun Ling, Zhongang Cai, Rong Xie, Li Song
  • for: 解决单张图像中的人体重建问题,尤其是在受到自身、物体或其他人体 occlusion 的情况下。
  • methods: 提出了 Dense Inpainting Human Mesh Recovery (DIMR) 方法,利用稠密匹配图进行人体特征分离和完善。方法还包括一个基于注意力的特征填充模块,以便在 heavily occluded 图像下进行人体特征的填充和完善。
  • results: 对多个数据集进行了评测,并证明了 DIMR 方法在受到 heavily occluded 情况下的表现明显更好,而且在标准标准 bencmarks (3DPW) 上也达到了相似的结果。
    Abstract Human mesh reconstruction from a single image is challenging in the presence of occlusion, which can be caused by self, objects, or other humans. Existing methods either fail to separate human features accurately or lack proper supervision for feature completion. In this paper, we propose Dense Inpainting Human Mesh Recovery (DIMR), a two-stage method that leverages dense correspondence maps to handle occlusion. Our method utilizes a dense correspondence map to separate visible human features and completes human features on a structured UV map dense human with an attention-based feature completion module. We also design a feature inpainting training procedure that guides the network to learn from unoccluded features. We evaluate our method on several datasets and demonstrate its superior performance under heavily occluded scenarios compared to other methods. Extensive experiments show that our method obviously outperforms prior SOTA methods on heavily occluded images and achieves comparable results on the standard benchmarks (3DPW).
    摘要 人体三角形重建从单个图像中是具有挑战性的,尤其在 occlusion 存在时。现有方法可能不准确地分割人体特征或缺乏适当的监督来完成特征。在这篇论文中,我们提议 dense inpainting human mesh recovery(DIMR)方法,这是一种两个阶段的方法,利用 dense correspondence map 来处理 occlusion。我们的方法使用 dense correspondence map 来分割可见的人体特征,并在结构化 UV 图 dense human 上完成人体特征使用注意力基于的特征完成模块。我们还设计了一种特征填充训练过程,使网络学习从不受 occlusion 的特征。我们对多个数据集进行了评估,并证明了我们的方法在受到重度 occlusion 的场景下表现出色,比 Priors 的 SOTA 方法更好。广泛的实验表明,我们的方法在 heavily occluded 图像上表现出色,并在标准 benchmarks 上 achieve 相当的结果。

Towards General Game Representations: Decomposing Games Pixels into Content and Style

  • paper_url: http://arxiv.org/abs/2307.11141
  • repo_url: None
  • paper_authors: Chintan Trivedi, Konstantinos Makantasis, Antonios Liapis, Georgios N. Yannakakis
  • for: 本研究旨在利用游戏视频中的丰富上下文信息,探索人工智能在多个下渠任务中的应用,包括游戏玩家模型、程序生成和游戏玩家Agent。
  • methods: 本研究使用了预训练的计算机视觉Encoder,并使用了基于游戏类型的分解技术,以获取独立的内容嵌入和风格嵌入。
  • results: 研究发现,通过分解内容和风格嵌入,可以在多个游戏环境中实现风格不变性,同时仍能保持强的内容提取能力。这些结果表明,提出的内容和风格分解方法可以更好地普适化在不同游戏环境中。
    Abstract On-screen game footage contains rich contextual information that players process when playing and experiencing a game. Learning pixel representations of games can benefit artificial intelligence across several downstream tasks including game-playing agents, procedural content generation, and player modelling. The generalizability of these methods, however, remains a challenge, as learned representations should ideally be shared across games with similar game mechanics. This could allow, for instance, game-playing agents trained on one game to perform well in similar games with no re-training. This paper explores how generalizable pre-trained computer vision encoders can be for such tasks, by decomposing the latent space into content embeddings and style embeddings. The goal is to minimize the domain gap between games of the same genre when it comes to game content critical for downstream tasks, and ignore differences in graphical style. We employ a pre-trained Vision Transformer encoder and a decomposition technique based on game genres to obtain separate content and style embeddings. Our findings show that the decomposed embeddings achieve style invariance across multiple games while still maintaining strong content extraction capabilities. We argue that the proposed decomposition of content and style offers better generalization capacities across game environments independently of the downstream task.
    摘要 电脑游戏截屏视频内容含有丰富的上下文信息,玩家在游戏时会处理这些信息。学习游戏像素表示可以提高人工智能在多个下游任务中的表现,如游戏玩家代理、过程内容生成和玩家模型。然而,这些方法的通用性仍然是一个挑战,因为学习的表示应该能够在同类游戏中共享。这可以让游戏玩家代理从一款游戏中转移到类似游戏中,无需重新训练。本文研究如何使用普通的计算机视觉encoder和游戏类别 decomposition技术来提取游戏内容和风格的嵌入。我们的发现表明,这些分解的嵌入可以在多个游戏中保持风格不变,同时仍然保留强大的内容提取能力。我们认为,我们的内容和风格分解方法可以在不同的游戏环境下独立地提高总化能力。

CNOS: A Strong Baseline for CAD-based Novel Object Segmentation

  • paper_url: http://arxiv.org/abs/2307.11067
  • repo_url: https://github.com/nv-nguyen/cnos
  • paper_authors: Van Nguyen Nguyen, Tomas Hodan, Georgy Ponimatkin, Thibault Groueix, Vincent Lepetit
  • for: 针对RGB图像中未见对象的分割,使用CAD模型创建描述符和提议,并使用对比描述符与参考描述符进行匹配,以实现精度的对象ID分配和模式面。
  • methods: 采用三个阶段方法,首先使用最新的基础模型DINOv2和Segment Anything创建描述符和提议,然后使用对比描述符与参考描述符进行匹配,最后通过模式面进行分割。
  • results: 对七个核心数据集进行实验,比较与现有方法,得到了19.8% AP的最佳result,超越现有方法。
    Abstract We propose a simple three-stage approach to segment unseen objects in RGB images using their CAD models. Leveraging recent powerful foundation models, DINOv2 and Segment Anything, we create descriptors and generate proposals, including binary masks for a given input RGB image. By matching proposals with reference descriptors created from CAD models, we achieve precise object ID assignment along with modal masks. We experimentally demonstrate that our method achieves state-of-the-art results in CAD-based novel object segmentation, surpassing existing approaches on the seven core datasets of the BOP challenge by 19.8% AP using the same BOP evaluation protocol. Our source code is available at https://github.com/nv-nguyen/cnos.
    摘要 我们提出了一种简单的三个阶段方法,用于使用CAD模型来分割RGB图像中的未见对象。利用最新的强大基础模型,DINOv2和Segment Anything,我们创建了描述符和生成提案,包括输入RGB图像的二进制掩码。通过与参考描述符,创建从CAD模型中获得的对象ID分配和modal掩码。我们经验表明,我们的方法可以在BOP挑战的七个核心数据集上达到状态理论的Result,比既有方法在同一BOP评估协议下提高19.8%的AP。我们的源代码可以在https://github.com/nv-nguyen/cnos上获取。

HRFNet: High-Resolution Forgery Network for Localizing Satellite Image Manipulation

  • paper_url: http://arxiv.org/abs/2307.11052
  • repo_url: None
  • paper_authors: Fahim Faisal Niloy, Kishor Kumar Bhaumik, Simon S. Woo
  • for: 本研究旨在提出一种高分辨率卫星图像伪造地址化方法,以解决现有高分辨率卫星图像伪造地址化方法存在较多缺陷的问题。
  • methods: 本研究提议一种名为HRFNet的新型模型,具有 shallow 和 deep 两个分支,可以很好地结合 RGB 和扩展特征在全球和本地层次上进行卫星图像伪造地址化。
  • results: 通过多种实验表明,OUR方法可以准确地 lokalize 卫星图像伪造地址,而且不会与现有方法相比增加内存需求和处理速度的需求。
    Abstract Existing high-resolution satellite image forgery localization methods rely on patch-based or downsampling-based training. Both of these training methods have major drawbacks, such as inaccurate boundaries between pristine and forged regions, the generation of unwanted artifacts, etc. To tackle the aforementioned challenges, inspired by the high-resolution image segmentation literature, we propose a novel model called HRFNet to enable satellite image forgery localization effectively. Specifically, equipped with shallow and deep branches, our model can successfully integrate RGB and resampling features in both global and local manners to localize forgery more accurately. We perform various experiments to demonstrate that our method achieves the best performance, while the memory requirement and processing speed are not compromised compared to existing methods.
    摘要 当前高分辨率卫星图像假造地点 localization 方法通常基于 patch-based 或 downsampling-based 训练。两者均有严重缺陷,如假造区域与原始区域的界限不准确、生成不必要的 artifacts 等。为了解决上述挑战,我们引用高分辨率图像分割文献,提出了一种新的模型called HRFNet,用于有效地进行卫星图像假造地点 localization。具体来说,我们的模型具有 shallow 和 deep 分支,可以成功地在全球和本地方面 интегра RGB 和抽样特征,以更加准确地Localize 假造。我们进行了多种实验,证明了我们的方法可以具有最高性能,而占用内存和处理速度与现有方法相比,不会受到影响。

Multi-objective point cloud autoencoders for explainable myocardial infarction prediction

  • paper_url: http://arxiv.org/abs/2307.11017
  • repo_url: None
  • paper_authors: Marcel Beetz, Abhirup Banerjee, Vicente Grau
  • for: 预测心肌梗死(Myocardial Infarction,MI)的病理学基础。
  • methods: 使用多目标点云自动编码器(Multi-objective Point Cloud Autoencoder),一种基于多类3D点云表示的心肌生物学特征和功能的几何深度学习方法,可以有效地学习心肌3D形态特征,并提供可解释的MI预测结果。
  • results: 在一个大型UK Biobank数据集上,使用多目标点云自动编码器可以准确地重建多个时间点的3D形态,并且与输入形态的 Chamfer 距离低于图像像素分辨率。此外,该方法在incident MI预测任务上比多种机器学习和深度学习标准准降19%,并且可以清晰地分解控制和MI群集,并与相应的3D形态之间存在临床可能的关系,这种可解释性能够展示预测的可信度。
    Abstract Myocardial infarction (MI) is one of the most common causes of death in the world. Image-based biomarkers commonly used in the clinic, such as ejection fraction, fail to capture more complex patterns in the heart's 3D anatomy and thus limit diagnostic accuracy. In this work, we present the multi-objective point cloud autoencoder as a novel geometric deep learning approach for explainable infarction prediction, based on multi-class 3D point cloud representations of cardiac anatomy and function. Its architecture consists of multiple task-specific branches connected by a low-dimensional latent space to allow for effective multi-objective learning of both reconstruction and MI prediction, while capturing pathology-specific 3D shape information in an interpretable latent space. Furthermore, its hierarchical branch design with point cloud-based deep learning operations enables efficient multi-scale feature learning directly on high-resolution anatomy point clouds. In our experiments on a large UK Biobank dataset, the multi-objective point cloud autoencoder is able to accurately reconstruct multi-temporal 3D shapes with Chamfer distances between predicted and input anatomies below the underlying images' pixel resolution. Our method outperforms multiple machine learning and deep learning benchmarks for the task of incident MI prediction by 19% in terms of Area Under the Receiver Operating Characteristic curve. In addition, its task-specific compact latent space exhibits easily separable control and MI clusters with clinically plausible associations between subject encodings and corresponding 3D shapes, thus demonstrating the explainability of the prediction.
    摘要 myocardial infarction (MI) 是世界上最常见的死亡原因之一。传统的图像基于标记器,如舒张率,无法捕捉心脏三维解剖结构中更复杂的模式,因此限制了诊断精度。在这项工作中,我们提出了基于多对象点云自适应神经网络的新的几何深度学方法,用于可见的损害预测,基于多类三维点云表示的律动器解剖结构和功能。其架构包括多个任务特定分支,通过低维度的干扰空间相连,以实现有效的多目标学习 both 重建和MI预测,同时捕捉疾病特定的三维形态信息。此外,其层次分支设计和点云深度运算使得高分辨率的解剖点云上可以有效地进行多级别特征学习。在我们对大型UK Biobank数据集进行实验时,多对象点云自适应神经网络能够准确地重建多个时间点的三维形态,Chamfer距离输入和预测的形态之间的距离小于图像的像素分辨率。我们的方法在多种机器学习和深度学习标准准点上出现19%的提升,在接收操作特征曲线图下的面积来评估预测效果。此外,任务特定的紧凑的干托空间显示可以分别控制和MI层次分解,并且与相应的三维形态之间存在严格的相关关系,这表明预测是可解释的。

General Image-to-Image Translation with One-Shot Image Guidance

  • paper_url: http://arxiv.org/abs/2307.14352
  • repo_url: https://github.com/crystalneuro/visual-concept-translator
  • paper_authors: Bin Cheng, Zuhao Liu, Yunbo Peng, Yue Lin
  • for: 可以将愿景图像中的视觉概念翻译成另一种图像,保持源图像的内容。
  • methods: 提出了一种名为视觉概念翻译器(VCT)的新框架,通过内容概念分离和内容概念融合两个过程来实现图像翻译。
  • results: 经验证明,提出的方法可以在各种普遍的图像翻译任务中达到优秀的结果,并且可以保持源图像的内容。
    Abstract Large-scale text-to-image models pre-trained on massive text-image pairs show excellent performance in image synthesis recently. However, image can provide more intuitive visual concepts than plain text. People may ask: how can we integrate the desired visual concept into an existing image, such as our portrait? Current methods are inadequate in meeting this demand as they lack the ability to preserve content or translate visual concepts effectively. Inspired by this, we propose a novel framework named visual concept translator (VCT) with the ability to preserve content in the source image and translate the visual concepts guided by a single reference image. The proposed VCT contains a content-concept inversion (CCI) process to extract contents and concepts, and a content-concept fusion (CCF) process to gather the extracted information to obtain the target image. Given only one reference image, the proposed VCT can complete a wide range of general image-to-image translation tasks with excellent results. Extensive experiments are conducted to prove the superiority and effectiveness of the proposed methods. Codes are available at https://github.com/CrystalNeuro/visual-concept-translator.
    摘要 大规模的文本到图像模型在最近的图像生成中表现出色,但图像可以提供更直观的视觉概念 than plain text。人们可能会问:如何将我们的肖像中的愿望 visual concept 集成到现有的图像中?现有的方法无法满足这个需求,因为它们缺乏保持内容或翻译视觉概念的能力。 draw inspiration from this,我们提出了一种名为视觉概念翻译器(VCT)的新框架,具有保持内容的源图像和根据单个参考图像 guid 视觉概念的翻译能力。VCT 包括一个内容概念反转(CCI)过程,用于提取内容和概念,以及一个内容概念聚合(CCF)过程,用于将提取的信息聚合以获得目标图像。只需要一个参考图像,提出的 VCT 可以完成广泛的普通图像到图像翻译任务,效果极佳。我们进行了广泛的实验,以证明我们的方法的超越性和有效性。代码可以在 https://github.com/CrystalNeuro/visual-concept-translator 上获取。

Frequency-aware optical coherence tomography image super-resolution via conditional generative adversarial neural network

  • paper_url: http://arxiv.org/abs/2307.11130
  • repo_url: None
  • paper_authors: Xueshen Li, Zhenxing Dong, Hongshan Liu, Jennifer J. Kang-Mieler, Yuye Ling, Yu Gan
  • for: 提高医学影像诊断和治疗的能力,特别是Cardiology和Ophthalmology领域。
  • methods: 使用深度学习基于super-resolution技术,以提高图像的分辨率和结构的保存。
  • results: 提出一种 integrate three critical frequency-based modules and frequency-based loss function into a conditional generative adversarial network (cGAN)的频率意识super-resolution框架,并在大规模的 coronary OCT 数据集中进行了评估,并在鱼眼和小鼠视网膜图像中进行了应用,以证明其在眼科影像中的抽象能力。
    Abstract Optical coherence tomography (OCT) has stimulated a wide range of medical image-based diagnosis and treatment in fields such as cardiology and ophthalmology. Such applications can be further facilitated by deep learning-based super-resolution technology, which improves the capability of resolving morphological structures. However, existing deep learning-based method only focuses on spatial distribution and disregard frequency fidelity in image reconstruction, leading to a frequency bias. To overcome this limitation, we propose a frequency-aware super-resolution framework that integrates three critical frequency-based modules (i.e., frequency transformation, frequency skip connection, and frequency alignment) and frequency-based loss function into a conditional generative adversarial network (cGAN). We conducted a large-scale quantitative study from an existing coronary OCT dataset to demonstrate the superiority of our proposed framework over existing deep learning frameworks. In addition, we confirmed the generalizability of our framework by applying it to fish corneal images and rat retinal images, demonstrating its capability to super-resolve morphological details in eye imaging.
    摘要

Deep Spiking-UNet for Image Processing

  • paper_url: http://arxiv.org/abs/2307.10974
  • repo_url: https://github.com/snnresearch/spiking-unet
  • paper_authors: Hebei Li, Yueyi Zhang, Zhiwei Xiong, Zheng-jun Zha, Xiaoyan Sun
  • for: 这篇论文的目的是探讨使用快速神经网络(SNN)在图像处理任务中的应用,并将U-Net架构与SNN结合。
  • methods: 作者使用了多reshold飞神 ней元来提高信息传递的效率,并采用了一种转换和精度调整的管道来优化转换后的模型。
  • results: 实验结果表明,作者的快速神经网络在图像分割和减噪任务中达到了与非快速神经网络相当的性能,并超越了现有的SNN方法。 Comparing with the converted Spiking-UNet without fine-tuning, the inference time is reduced by approximately 90%.
    Abstract U-Net, known for its simple yet efficient architecture, is widely utilized for image processing tasks and is particularly suitable for deployment on neuromorphic chips. This paper introduces the novel concept of Spiking-UNet for image processing, which combines the power of Spiking Neural Networks (SNNs) with the U-Net architecture. To achieve an efficient Spiking-UNet, we face two primary challenges: ensuring high-fidelity information propagation through the network via spikes and formulating an effective training strategy. To address the issue of information loss, we introduce multi-threshold spiking neurons, which improve the efficiency of information transmission within the Spiking-UNet. For the training strategy, we adopt a conversion and fine-tuning pipeline that leverage pre-trained U-Net models. During the conversion process, significant variability in data distribution across different parts is observed when utilizing skip connections. Therefore, we propose a connection-wise normalization method to prevent inaccurate firing rates. Furthermore, we adopt a flow-based training method to fine-tune the converted models, reducing time steps while preserving performance. Experimental results show that, on image segmentation and denoising, our Spiking-UNet achieves comparable performance to its non-spiking counterpart, surpassing existing SNN methods. Compared with the converted Spiking-UNet without fine-tuning, our Spiking-UNet reduces inference time by approximately 90\%. This research broadens the application scope of SNNs in image processing and is expected to inspire further exploration in the field of neuromorphic engineering. The code for our Spiking-UNet implementation is available at https://github.com/SNNresearch/Spiking-UNet.
    摘要 U-Net,知名的简单 yet efficient 架构,广泛应用于图像处理任务,特别适合运行在神经模拟器芯片上。这篇论文介绍了一种新的激发式 U-Net 图像处理方法,该方法结合激发式神经网络(SNNs)与 U-Net 架构。为了实现高效的激发式 U-Net,我们面临两个主要挑战:保证通过网络传递高精度信息via spikes,并制定有效的训练策略。为了解决信息损失问题,我们引入多reshold spiking neurons,以提高激发式 U-Net 中信息传输的效率。在训练策略方面,我们采用一个转化和细化训练管道,利用预训练 U-Net 模型。在转化过程中,我们发现了不同部分数据分布变化带来的显著差异。因此,我们提出了一种连接 wise normalization 方法,以避免假象率。此外,我们采用一种流式训练方法,以细化转化后的模型,降低时间步骤而保持性能。实验结果表明,在图像分割和净化任务上,我们的激发式 U-Net 与非激发式 counterpart 的性能相似,超过现有的 SNN 方法。相比转化后的激发式 U-Net без细化,我们的激发式 U-Net 可以降低推理时间约90%。这项研究扩展了 SNN 在图像处理领域的应用范围,预计会激发更多的神经模拟器工程研究。Spiking-UNet 实现代码可在 GitHub 上找到:https://github.com/SNNresearch/Spiking-UNet.

cs.AI - 2023-07-21

Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts

  • paper_url: http://arxiv.org/abs/2307.11661
  • repo_url: https://github.com/mayug/vdt-adapter
  • paper_authors: Mayug Maniparambil, Chris Vorster, Derek Molloy, Noel Murphy, Kevin McGuinness, Noel E. O’Connor
    for:This paper focuses on improving the performance of CLIP, a contrastive pre-trained large Vision-Language Model (VLM), on downstream datasets by using GPT-4 to generate visually descriptive text prompts.methods:The authors use GPT-4 to generate text prompts that are relevant to the downstream dataset, and then use these prompts to adapt CLIP to the dataset with zero-shot learning. They also design a simple few-shot adapter that learns to choose the best possible sentences to construct generalizable classifiers.results:The authors show that their method achieves considerable improvements in 0-shot transfer accuracy on specialized fine-grained datasets, outperforming CLIP’s default prompt by around 7% on average. They also demonstrate that their simple few-shot adapter outperforms the recently proposed CoCoOP by around 2% on average and by over 4% on 4 specialized fine-grained datasets.Here is the simplified Chinese version of the three key points:for:这篇论文目标是提高CLIP的下游数据集表现,使用GPT-4生成相关的视觉描述文本提示。methods:作者使用GPT-4生成相关的视觉描述文本提示,然后使用这些提示将CLIP适应到数据集中进行零批学习。他们还设计了一个简单的几批适应器,可以选择最佳的句子来构建通用的分类器。results:作者表明,他们的方法在特殊的细化数据集上实现了considerable的0批传输准确率提升,比CLIP的默认提示高约7%。他们还证明了他们的简单的几批适应器可以超过最近提出的CoCoOP,在4个特殊的细化数据集上平均提高了2%,并在这些数据集上提高了4%。
    Abstract Contrastive pretrained large Vision-Language Models (VLMs) like CLIP have revolutionized visual representation learning by providing good performance on downstream datasets. VLMs are 0-shot adapted to a downstream dataset by designing prompts that are relevant to the dataset. Such prompt engineering makes use of domain expertise and a validation dataset. Meanwhile, recent developments in generative pretrained models like GPT-4 mean they can be used as advanced internet search tools. They can also be manipulated to provide visual information in any structure. In this work, we show that GPT-4 can be used to generate text that is visually descriptive and how this can be used to adapt CLIP to downstream tasks. We show considerable improvements in 0-shot transfer accuracy on specialized fine-grained datasets like EuroSAT (~7%), DTD (~7%), SUN397 (~4.6%), and CUB (~3.3%) when compared to CLIP's default prompt. We also design a simple few-shot adapter that learns to choose the best possible sentences to construct generalizable classifiers that outperform the recently proposed CoCoOP by ~2% on average and by over 4% on 4 specialized fine-grained datasets. The code, prompts, and auxiliary text dataset is available at https://github.com/mayug/VDT-Adapter.
    摘要 带有对比学习的大视力语言模型(VLM)如CLIP,已经革命化视觉表示学习。VLM可以通过设计相关的提示来适应下游数据集,而这种提示工程充分利用了领域专业知识和验证数据集。此外,最近的生成预训练模型如GPT-4,可以用作高级网络搜索工具,同时可以通过提供任意结构的视觉信息来操作。在这项工作中,我们展示了GPT-4可以生成视觉描述性文本,并使其用于CLIP的适应下游任务。我们发现,与CLIP的默认提示相比,在特殊化细腻数据集(EuroSAT、DTD、SUN397和CUB)上显示了 considerable improvement(大约7%)。此外,我们还设计了一个简单的几拍适配器,可以选择最佳的句子来构建通用的分类器,超过了最近提出的CoCoOP的平均提升率(大约2%),并在特殊化细腻数据集上提升了4%。代码、提示和辅助文本数据集可以在https://github.com/mayug/VDT-Adapter中下载。

Bandits with Deterministically Evolving States

  • paper_url: http://arxiv.org/abs/2307.11655
  • repo_url: None
  • paper_authors: Khashayar Khosravi, Renato Paes Leme, Chara Podimata, Apostolis Tsorvantzis
  • for: 这个论文是为了学习带带资料的投机问题,并考虑到状态不可见和演化的问题。
  • methods: 该论文提出了一种名为带带投机问题的模型,其中状态会 deterministically evolve,并且不可见。
  • results: 论文分析了这种模型下的在线学习算法,并证明了这些算法可以在任何可能的参数化下进行 оптими化。Specifically, the regret rates obtained are: for $\lambda \in [0, 1/T^2]$: $\widetilde O(\sqrt{KT})$; for $\lambda = T^{-a/b}$ with $b < a < 2b$: $\widetilde O (T^{b/a})$; for $\lambda \in (1/T, 1 - 1/\sqrt{T}): \widetilde O (K^{1/3}T^{2/3})$; and for $\lambda \in [1 - 1/\sqrt{T}, 1]: \widetilde O (K\sqrt{T})$.
    Abstract We propose a model for learning with bandit feedback while accounting for deterministically evolving and unobservable states that we call Bandits with Deterministically Evolving States. The workhorse applications of our model are learning for recommendation systems and learning for online ads. In both cases, the reward that the algorithm obtains at each round is a function of the short-term reward of the action chosen and how ``healthy'' the system is (i.e., as measured by its state). For example, in recommendation systems, the reward that the platform obtains from a user's engagement with a particular type of content depends not only on the inherent features of the specific content, but also on how the user's preferences have evolved as a result of interacting with other types of content on the platform. Our general model accounts for the different rate $\lambda \in [0,1]$ at which the state evolves (e.g., how fast a user's preferences shift as a result of previous content consumption) and encompasses standard multi-armed bandits as a special case. The goal of the algorithm is to minimize a notion of regret against the best-fixed sequence of arms pulled. We analyze online learning algorithms for any possible parametrization of the evolution rate $\lambda$. Specifically, the regret rates obtained are: for $\lambda \in [0, 1/T^2]$: $\widetilde O(\sqrt{KT})$; for $\lambda = T^{-a/b}$ with $b < a < 2b$: $\widetilde O (T^{b/a})$; for $\lambda \in (1/T, 1 - 1/\sqrt{T}): \widetilde O (K^{1/3}T^{2/3})$; and for $\lambda \in [1 - 1/\sqrt{T}, 1]: \widetilde O (K\sqrt{T})$.
    摘要 我们提出一个模型,称为带征 Deterministic Evolving States 的 Bandit Learning 模型。这种模型的应用包括推荐系统和在线广告学习。在这些应用中,算法在每个轮次获得的奖励是功能和系统状态(例如,用户的喜好)的函数。例如,在推荐系统中,用户与某种内容的互动奖励不仅受到内容本身的特点的影响,还受到用户在其他类型的内容上的互动的影响。我们的通用模型考虑了不同的演化速率 $\lambda \in [0,1]$,并包括标准多重武器的特例。算法的目标是对于任何可能的参数化 $\lambda$,最小化对最佳固定sequence of arms pulled的 regret。我们分析了在线学习算法,并得到了不同的 regret 率:* for $\lambda \in [0, 1/T^2]$: $\widetilde O(\sqrt{KT})$;* for $\lambda = T^{-a/b}$ with $b < a < 2b$: $\widetilde O (T^{b/a})$;* for $\lambda \in (1/T, 1 - 1/\sqrt{T}): \widetilde O (K^{1/3}T^{2/3})$;* for $\lambda \in [1 - 1/\sqrt{T}, 1]: \widetilde O (K\sqrt{T})$.

Alleviating the Long-Tail Problem in Conversational Recommender Systems

  • paper_url: http://arxiv.org/abs/2307.11650
  • repo_url: None
  • paper_authors: Zhipeng Zhao, Kun Zhou, Xiaolei Wang, Wayne Xin Zhao, Fan Pan, Zhao Cao, Ji-Rong Wen
  • for: 提高 conversational recommender systems (CRS) 的效果,尤其是对长尾项的推荐。
  • methods: 提出了一种名为 LOT-CRS 的新框架,该框架通过均衡 CRS 数据集来提高长尾项的推荐性能。在我们的方法中,我们设计了两个预训练任务来提高对长尾项的对话理解,并采用了 RETRIEVAL-augmented 精度调教策略来进一步提高长尾项的推荐。
  • results: 在两个公共 CRS 数据集上进行了广泛的实验,证明了我们的方法的有效性和可扩展性,特别是对长尾项的推荐。
    Abstract Conversational recommender systems (CRS) aim to provide the recommendation service via natural language conversations. To develop an effective CRS, high-quality CRS datasets are very crucial. However, existing CRS datasets suffer from the long-tail issue, \ie a large proportion of items are rarely (or even never) mentioned in the conversations, which are called long-tail items. As a result, the CRSs trained on these datasets tend to recommend frequent items, and the diversity of the recommended items would be largely reduced, making users easier to get bored. To address this issue, this paper presents \textbf{LOT-CRS}, a novel framework that focuses on simulating and utilizing a balanced CRS dataset (\ie covering all the items evenly) for improving \textbf{LO}ng-\textbf{T}ail recommendation performance of CRSs. In our approach, we design two pre-training tasks to enhance the understanding of simulated conversation for long-tail items, and adopt retrieval-augmented fine-tuning with label smoothness strategy to further improve the recommendation of long-tail items. Extensive experiments on two public CRS datasets have demonstrated the effectiveness and extensibility of our approach, especially on long-tail recommendation.
    摘要 很多受众推荐系统(CRS)寻求通过自然语言对话提供推荐服务。为开发有效的CRS,高质量CRS数据集非常重要。然而,现有的CRS数据集受到长尾问题的困扰,即大多数 item 在对话中被 rarely (或者是从未) 提及,这些 item 被称为长尾 item。这导致CRS 在这些数据集上训练后,倾向于推荐频繁 item,推荐的 Item 的多样性会受到很大的削弱,使用户更容易感到厌烦。为解决这个问题,本文提出了一种新的框架,即 LOT-CRS,它是一种集中 item 的 CRS 数据集,以提高长尾推荐性能。在我们的方法中,我们设计了两个预训练任务,以增强对 simulated conversation 的理解,并采用了提取扩展的 fine-tuning 策略和标签平滑策略,以进一步提高长尾推荐。我们在两个公共 CRS 数据集上进行了广泛的实验,并证明了我们的方法的有效性和可扩展性,特别是在长尾推荐方面。

Morphological Image Analysis and Feature Extraction for Reasoning with AI-based Defect Detection and Classification Models

  • paper_url: http://arxiv.org/abs/2307.11643
  • repo_url: None
  • paper_authors: Jiajun Zhang, Georgina Cosma, Sarah Bugby, Axel Finke, Jason Watkins
  • for: 该论文旨在提高工业应用中的AI模型表现,通过解释IE Mask R-CNN模型的预测结果。
  • methods: 该论文提出了AI-Reasoner,它从图像中提取杂志特征(DefChars),并使用决策树来理解DefChar值。然后,AI-Reasoner输出了视觉化和文本解释,以提供IE Mask R-CNN模型输出的理解。
  • results: 实验结果表明,AI-Reasoner能够有效地解释IE Mask R-CNN模型的预测结果。总的来说,该论文提供了一种解释AI模型表现的解决方案,有助于提高工业应用中的AI模型表现。
    Abstract As the use of artificial intelligent (AI) models becomes more prevalent in industries such as engineering and manufacturing, it is essential that these models provide transparent reasoning behind their predictions. This paper proposes the AI-Reasoner, which extracts the morphological characteristics of defects (DefChars) from images and utilises decision trees to reason with the DefChar values. Thereafter, the AI-Reasoner exports visualisations (i.e. charts) and textual explanations to provide insights into outputs made by masked-based defect detection and classification models. It also provides effective mitigation strategies to enhance data pre-processing and overall model performance. The AI-Reasoner was tested on explaining the outputs of an IE Mask R-CNN model using a set of 366 images containing defects. The results demonstrated its effectiveness in explaining the IE Mask R-CNN model's predictions. Overall, the proposed AI-Reasoner provides a solution for improving the performance of AI models in industrial applications that require defect analysis.
    摘要 随着人工智能(AI)模型在工程和生产中的应用变得更加普遍,这些模型的预测需要提供透明的思维过程。这篇论文提议了AI理解器(AI-Reasoner),它从图像中提取杂形特征(DefChars),并使用决策树来进行思维。然后,AI-Reasoner将生成视觉化(如图表)和文本解释,以提供对掩模隐藏基于漏斗检测和分类模型的输出的深入了解。它还提供有效的缓解策略,以提高数据预处理和整体模型性能。AI-Reasoner在对IE Mask R-CNN模型的输出进行解释中得到了证明。总之,提议的AI-Reasoner为工业应用中需要检测分析的AI模型带来了改进性。

The Two Faces of AI in Green Mobile Computing: A Literature Review

  • paper_url: http://arxiv.org/abs/2308.04436
  • repo_url: None
  • paper_authors: Wander Siemers, June Sallou, Luís Cruz
  • for: 本研究は、过去10年间の人工智能在移动设备上的应用に関する文献のレビューであり、13个主要题をまとめて详细に说明しています。
  • methods: 本研究では、34篇の论文を分析し、人工智能在移动设备上的能源消耗についての研究が不足していることを発见しています。また、大多数の研究は、学术的な背景に基づいていることを示しています。
  • results: 本研究の结果は、过去10年间で人工智能在移动设备上の应用が増加していることを示しています。しかし、人工智能の能源消耗については、更に调查が必要です。また、大多数の研究が公开されていないことも発见しています。
    Abstract Artificial intelligence is bringing ever new functionalities to the realm of mobile devices that are now considered essential (e.g., camera and voice assistants, recommender systems). Yet, operating artificial intelligence takes up a substantial amount of energy. However, artificial intelligence is also being used to enable more energy-efficient solutions for mobile systems. Hence, artificial intelligence has two faces in that regard, it is both a key enabler of desired (efficient) mobile functionalities and a major power draw on these devices, playing a part in both the solution and the problem. In this paper, we present a review of the literature of the past decade on the usage of artificial intelligence within the realm of green mobile computing. From the analysis of 34 papers, we highlight the emerging patterns and map the field into 13 main topics that are summarized in details. Our results showcase that the field is slowly increasing in the past years, more specifically, since 2019. Regarding the double impact AI has on the mobile energy consumption, the energy consumption of AI-based mobile systems is under-studied in comparison to the usage of AI for energy-efficient mobile computing, and we argue for more exploratory studies in that direction. We observe that although most studies are framed as solution papers (94%), the large majority do not make those solutions publicly available to the community. Moreover, we also show that most contributions are purely academic (28 out of 34 papers) and that we need to promote the involvement of the mobile software industry in this field.
    摘要 人工智能在移动设备领域带来了不断新的功能(例如相机和语音助手、推荐系统),现在被视为必备的功能。然而,运行人工智能需要很多能量。然而,人工智能也在使移动系统更加能效的解决方案中发挥作用。因此,人工智能在这个方面有两个面,它是必需的功能启用者和移动设备的主要能量消耗者。在这篇论文中,我们对过去十年的文献进行了回顾,并将场景映射到13个主题中。我们的结果显示,这个领域在过去几年中逐渐增长,特别是自2019年起。关于人工智能对移动设备能 consumption的双重影响,我们认为需要更多的探索性研究。我们发现,大多数研究是呈现为解决方案纸(94%),但大多数解决方案没有公开提供给社区。此外,我们还发现大多数贡献是学术性质(28 out of 34 papers),我们需要推动移动软件产业的参与。

Integration of Domain Expert-Centric Ontology Design into the CRISP-DM for Cyber-Physical Production Systems

  • paper_url: http://arxiv.org/abs/2307.11637
  • repo_url: https://github.com/htytewx/softcam
  • paper_authors: Milapji Singh Gill, Tom Westermann, Marvin Schieseck, Alexander Fay
  • for: 本研究旨在提高数据驱动项目的效率和可靠性,通过在CRISP-DM中 интегра Ontology design for CPPSs。
  • methods: 本研究使用了domain-specific ontologies,以提高数据驱动项目中 CPPSs 的理解和准备过程。
  • results: 本研究实现了一种可靠的 anomaly detection 用例,帮助数据科学家更快地和更可靠地从 CPPSs 中获得有价值信息。
    Abstract In the age of Industry 4.0 and Cyber-Physical Production Systems (CPPSs) vast amounts of potentially valuable data are being generated. Methods from Machine Learning (ML) and Data Mining (DM) have proven to be promising in extracting complex and hidden patterns from the data collected. The knowledge obtained can in turn be used to improve tasks like diagnostics or maintenance planning. However, such data-driven projects, usually performed with the Cross-Industry Standard Process for Data Mining (CRISP-DM), often fail due to the disproportionate amount of time needed for understanding and preparing the data. The application of domain-specific ontologies has demonstrated its advantageousness in a wide variety of Industry 4.0 application scenarios regarding the aforementioned challenges. However, workflows and artifacts from ontology design for CPPSs have not yet been systematically integrated into the CRISP-DM. Accordingly, this contribution intends to present an integrated approach so that data scientists are able to more quickly and reliably gain insights into the CPPS. The result is exemplarily applied to an anomaly detection use case.
    摘要 在第四产业时代和跨industry标准数据挖掘过程(CPPS)中, vast amounts of potentially valuable data 被生成。机器学习(ML)和数据挖掘(DM)的方法已经证明可以提取复杂和隐藏的模式,从而提高诊断或维护规划等任务。然而,这些数据驱动项目通常使用 Cross-Industry Standard Process for Data Mining(CRISP-DM)进行实施,但它们往往因为数据理解和准备过程中的时间过长而失败。适用域pecific ontology 在多种第四产业应用场景中表现出了优势。然而, CPPSs 的 workflow 和 artifacts 尚未被系统地 инте integrate into CRISP-DM。因此,本贡献的目的是提出一种集成的approach,使数据科学家能够更快速地和更可靠地获得 CPPS 的 Insights。结果通过例示了一个异常检测用例来说明。

On the Complexity of the Bipartite Polarization Problem: from Neutral to Highly Polarized Discussions

  • paper_url: http://arxiv.org/abs/2307.11621
  • repo_url: None
  • paper_authors: Teresa Alsinet, Josep Argelich, Ramón Béjar, Santi Martínez
  • for: 本研究探讨了一个优化问题,即寻找一个极化分 partition 的最高极化矩阵,该问题表示一个社交网络上的辩论,节点表示用户的意见,边表示用户之间的一致或不一致。
  • methods: 本研究使用了一种实验抽象模型,通过控制实例的极化程度来调控实例的解决难度。
  • results: 研究发现,极化程度与实例的解决难度正相关,即极化程度越高,解决实例的难度就越低。
    Abstract The Bipartite Polarization Problem is an optimization problem where the goal is to find the highest polarized bipartition on a weighted and labelled graph that represents a debate developed through some social network, where nodes represent user's opinions and edges agreement or disagreement between users. This problem can be seen as a generalization of the maxcut problem, and in previous work approximate solutions and exact solutions have been obtained for real instances obtained from Reddit discussions, showing that such real instances seem to be very easy to solve. In this paper, we investigate further the complexity of this problem, by introducing an instance generation model where a single parameter controls the polarization of the instances in such a way that this correlates with the average complexity to solve those instances. The average complexity results we obtain are consistent with our hypothesis: the higher the polarization of the instance, the easier is to find the corresponding polarized bipartition.
    摘要 《双分化卷积问题》是一个优化问题,目标是在一个权重 Labelled 图上找到最高卷积分的生成方法,表示一个社交网络上的辩论,节点表示用户的意见,边表示用户之间的一致或不一致。这个问题可以看作是最大cut问题的推广,在过去的工作中,人们已经得到了实际例子中的近似解和精确解,显示这些实际例子很容易解决。在这篇论文中,我们进一步调查了这个问题的复杂性,通过设计一个参数控制实例的卷积程度,以确定实例的复杂性和解决实例的难度之间的关系。我们获得的平均复杂性结果与我们的假设一致:卷积程度越高,实例的解决难度就越低。

CausE: Towards Causal Knowledge Graph Embedding

  • paper_url: http://arxiv.org/abs/2307.11610
  • repo_url: https://github.com/zjukg/cause
  • paper_authors: Yichi Zhang, Wen Zhang
  • for: 本研究旨在提高知识图(KG)的完整性(KGC),通过 repre senting KG 中的实体和关系为连续向量空间中的 embedding。
  • methods: 本研究提出了一种基于 causality 和 embedding 分离的新型知识图嵌入(KGE)方法,并提出了一种新的训练目标来稳定地预测 missing triple。
  • results: 实验结果表明,CausE 可以超过基eline模型,并实现状态的� characteristic KGC performance。
    Abstract Knowledge graph embedding (KGE) focuses on representing the entities and relations of a knowledge graph (KG) into the continuous vector spaces, which can be employed to predict the missing triples to achieve knowledge graph completion (KGC). However, KGE models often only briefly learn structural correlations of triple data and embeddings would be misled by the trivial patterns and noisy links in real-world KGs. To address this issue, we build the new paradigm of KGE in the context of causality and embedding disentanglement. We further propose a Causality-enhanced knowledge graph Embedding (CausE) framework. CausE employs causal intervention to estimate the causal effect of the confounder embeddings and design new training objectives to make stable predictions. Experimental results demonstrate that CausE could outperform the baseline models and achieve state-of-the-art KGC performance. We release our code in https://github.com/zjukg/CausE.
    摘要 知识图embedding(KGE)专注于将知识图(KG)中的实体和关系转换到连续的vector空间中,以便预测缺失的 triple以完成知识图完成(KGC)。然而,KGE模型经常只是简单地学习 triple数据的结构相关性,而 embedding会受到实际世界KG中的噪声和负面相关性的影响。为解决这个问题,我们建立了一种基于 causality的新型KGE paradigma,并提出了一种causality-enhanced知识图Embedding(CausE)框架。CausE使用 causal intervention来估计干扰因子 embedding的 causal效果,并设计了新的训练目标来确保稳定的预测。实验结果表明,CausE可以超越基eline模型,并实现状态控制KGC性能。我们在https://github.com/zjukg/CausE中发布了我们的代码。

Predict-AI-bility of how humans balance self-interest with the interest of others

  • paper_url: http://arxiv.org/abs/2307.12776
  • repo_url: None
  • paper_authors: Valerio Capraro, Roberto Di Paolo, Veronica Pizziol
  • for: This paper aims to investigate the ability of three advanced chatbots to predict dictator game decisions and capture the balance between self-interest and the interest of others in decision-making.
  • methods: The paper uses 78 experiments with human participants from 12 countries to evaluate the performance of GPT-4, Bard, and Bing in predicting dictator game decisions and identifying qualitative behavioral patterns.
  • results: The paper finds that only GPT-4 correctly captures qualitative behavioral patterns, identifying three major classes of behavior, but consistently overestimates other-regarding behavior, inflating the proportion of inequity-averse and fully altruistic participants. This bias has significant implications for AI developers and users.
    Abstract Generative artificial intelligence holds enormous potential to revolutionize decision-making processes, from everyday to high-stake scenarios. However, as many decisions carry social implications, for AI to be a reliable assistant for decision-making it is crucial that it is able to capture the balance between self-interest and the interest of others. We investigate the ability of three of the most advanced chatbots to predict dictator game decisions across 78 experiments with human participants from 12 countries. We find that only GPT-4 (not Bard nor Bing) correctly captures qualitative behavioral patterns, identifying three major classes of behavior: self-interested, inequity-averse, and fully altruistic. Nonetheless, GPT-4 consistently overestimates other-regarding behavior, inflating the proportion of inequity-averse and fully altruistic participants. This bias has significant implications for AI developers and users.
    摘要 translate the given text into Simplified ChineseGenerated artificial intelligence has the potential to revolutionize decision-making processes, from everyday to high-stakes scenarios. However, as many decisions have social implications, for AI to be a reliable assistant for decision-making, it is crucial that it can capture the balance between self-interest and the interest of others. We investigate the ability of three of the most advanced chatbots to predict dictator game decisions across 78 experiments with human participants from 12 countries. We find that only GPT-4 (not Bard nor Bing) correctly captures qualitative behavioral patterns, identifying three major classes of behavior: self-interested, inequity-averse, and fully altruistic. However, GPT-4 consistently overestimates other-regarding behavior, inflating the proportion of inequity-averse and fully altruistic participants. This bias has significant implications for AI developers and users.Note: "Dictator game" is not a commonly used term in Simplified Chinese, so it may be better to use a more common term such as "分配游戏" (bǐng huò yóu xì) or "权力游戏" (quán lì yóu xì) to convey the same idea.

Feature Map Testing for Deep Neural Networks

  • paper_url: http://arxiv.org/abs/2307.11563
  • repo_url: https://github.com/ase2023paper/deepfeature
  • paper_authors: Dong Huang, Qingwen Bu, Yahao Qing, Yichao Fu, Heming Cui
  • for: This paper is written to address the issue of deep learning testing, specifically the problem of detecting fault-inducing feature maps in deep neural networks (DNNs).
  • methods: The paper proposes a new method called DeepFeature, which tests DNNs from the feature map level and identifies vulnerabilities that can be enhanced through repairing to increase the model’s overall performance.
  • results: The paper presents experimental results that demonstrate the effectiveness of DeepFeature in detecting the model’s vulnerable feature maps, with a high fault detection rate and the ability to detect more types of faults compared to current techniques. Additionally, the paper shows that DeepFeature’s fuzzer outperforms current fuzzing techniques and generates valuable test cases more efficiently.Here is the same information in Simplified Chinese text:
  • for: 这篇论文是为了解决深度学习测试问题,具体来说是检测深度神经网络(DNNs)中的特征图层级漏洞。
  • methods: 论文提出了一种新方法called DeepFeature,它测试DNNs从特征图层级,并通过修复特征图来增强模型的总性表现。
  • results: 论文发表了实验结果,证明DeepFeature可以快速检测DNNs中的特征图漏洞,并且能够检测更多的问题类型,比如当前技术。另外,论文还表明DeepFeature的随机生成器比现有的随机生成技术更高效。
    Abstract Due to the widespread application of deep neural networks~(DNNs) in safety-critical tasks, deep learning testing has drawn increasing attention. During the testing process, test cases that have been fuzzed or selected using test metrics are fed into the model to find fault-inducing test units (e.g., neurons and feature maps, activating which will almost certainly result in a model error) and report them to the DNN developer, who subsequently repair them~(e.g., retraining the model with test cases). Current test metrics, however, are primarily concerned with the neurons, which means that test cases that are discovered either by guided fuzzing or selection with these metrics focus on detecting fault-inducing neurons while failing to detect fault-inducing feature maps. In this work, we propose DeepFeature, which tests DNNs from the feature map level. When testing is conducted, DeepFeature will scrutinize every internal feature map in the model and identify vulnerabilities that can be enhanced through repairing to increase the model's overall performance. Exhaustive experiments are conducted to demonstrate that (1) DeepFeature is a strong tool for detecting the model's vulnerable feature maps; (2) DeepFeature's test case selection has a high fault detection rate and can detect more types of faults~(comparing DeepFeature to coverage-guided selection techniques, the fault detection rate is increased by 49.32\%). (3) DeepFeature's fuzzer also outperforms current fuzzing techniques and generates valuable test cases more efficiently.
    摘要 由于深度神经网络(DNN)在安全关键任务中广泛应用,深度学习测试已引起了越来越多的关注。测试过程中,经过抽象或使用测试指标选择的测试用例会被 feed 到模型中,以找到引起模型错误的测试单元(例如神经元和特征图),并将其报告给 DNN 开发者,以便他们进行修复(例如重新训练模型使用测试用例)。现有的测试指标主要关注神经元,因此通过抽象或选择测试指标来发现的测试用例主要是检测引起模型错误的神经元,而忽略了特征图。在这项工作中,我们提出了 DeepFeature,它测试 DNN 从特征图层次。在测试过程中,DeepFeature 会仔细检查模型中每个内部特征图,并找到可以通过修复提高模型的性能的漏洞。我们进行了广泛的实验,证明了以下结论:1. DeepFeature 是一个强大的特征图漏洞检测工具,可以帮助检测模型的漏洞特征图。2. DeepFeature 的测试用例选择比coverage-guided选择技术高出49.32%的FAULT检测率。3. DeepFeature 的随机生成器也超越了当前的随机生成技术,可以更有效地生成有价值的测试用例。

CycleIK: Neuro-inspired Inverse Kinematics

  • paper_url: http://arxiv.org/abs/2307.11554
  • repo_url: None
  • paper_authors: Jan-Gerrit Habekost, Erik Strahl, Philipp Allgeuer, Matthias Kerzel, Stefan Wermter
  • for: 这篇论文旨在介绍一种基于神经网络的 inverse kinematics (IK) 方法,即 CycleIK,以及一种混合神经遗传算法pipeline,可以在独立的方式下使用,也可以通过sequential least-squares programming (SLSQP) 或者生物遗传算法 (GA)进行优化。
  • methods: 该方法使用了两种新的神经网络驱动的方法,即 Generative Adversarial Network (GAN) 和 Multi-Layer Perceptron architecture,这些方法可以单独使用,也可以在混合神经遗传算法pipeline中使用。
  • results: 在使用Weighted Multi-Objective Function from state-of-the-art BioIK方法支持下,神经网络模型可以与现有的IK方法竞争,并且通过 incorporating a genetic algorithm 可以提高精度,同时减少总的运行时间。
    Abstract The paper introduces CycleIK, a neuro-robotic approach that wraps two novel neuro-inspired methods for the inverse kinematics (IK) task, a Generative Adversarial Network (GAN), and a Multi-Layer Perceptron architecture. These methods can be used in a standalone fashion, but we also show how embedding these into a hybrid neuro-genetic IK pipeline allows for further optimization via sequential least-squares programming (SLSQP) or a genetic algorithm (GA). The models are trained and tested on dense datasets that were collected from random robot configurations of the new Neuro-Inspired COLlaborator (NICOL), a semi-humanoid robot with two redundant 8-DoF manipulators. We utilize the weighted multi-objective function from the state-of-the-art BioIK method to support the training process and our hybrid neuro-genetic architecture. We show that the neural models can compete with state-of-the-art IK approaches, which allows for deployment directly to robotic hardware. Additionally, it is shown that the incorporation of the genetic algorithm improves the precision while simultaneously reducing the overall runtime.
    摘要 文章介绍了 CycleIK,一种神经机器人方法,包括两种新的神经网络做 inverse kinematics(IK)任务的方法,生成对抗网络(GAN)和多层感知网络架构。这些方法可以单独使用,但我们还证明了将它们集成到一个混合神经遗传IK管道中,可以通过顺序最小二乘程序(SLSQP)或遗传算法(GA)进行进一步优化。模型在 dense 数据集上训练和测试,数据集是通过随机机器人配置新的神经机器人 Neuro-Inspired COLlaborator(NICOL)的两个冗余的 8-DoF 机械臂收集而来。我们使用了 BioIK 方法中的Weighted 多目标函数来支持训练过程,并使用我们的混合神经遗传架构。我们显示了神经模型可以与当前IK方法竞争,可以直接部署到机器人硬件上。此外,我们还证明了将遗传算法包含在PIPELINE中可以提高精度,同时降低总时间。

Identifying Relevant Features of CSE-CIC-IDS2018 Dataset for the Development of an Intrusion Detection System

  • paper_url: http://arxiv.org/abs/2307.11544
  • repo_url: None
  • paper_authors: László Göcs, Zsolt Csaba Johanyák
  • for: 本研究旨在帮助开发一个高效的攻击检测系统(IDS),其中包括选择必要的特征来分类网络流量。
  • methods: 本研究使用了六种特征选择方法,并对每个攻击类型进行了不同特征选择。
  • results: 研究发现,采用不同的特征选择方法和分类算法可以得到不同的优化效果,并且可以为不同的攻击类型选择最佳的特征集。
    Abstract Intrusion detection systems (IDSs) are essential elements of IT systems. Their key component is a classification module that continuously evaluates some features of the network traffic and identifies possible threats. Its efficiency is greatly affected by the right selection of the features to be monitored. Therefore, the identification of a minimal set of features that are necessary to safely distinguish malicious traffic from benign traffic is indispensable in the course of the development of an IDS. This paper presents the preprocessing and feature selection workflow as well as its results in the case of the CSE-CIC-IDS2018 on AWS dataset, focusing on five attack types. To identify the relevant features, six feature selection methods were applied, and the final ranking of the features was elaborated based on their average score. Next, several subsets of the features were formed based on different ranking threshold values, and each subset was tried with five classification algorithms to determine the optimal feature set for each attack type. During the evaluation, four widely used metrics were taken into consideration.
    摘要 安全系统检测系统(IDS)是信息技术系统中不可或缺的元素。它的关键组件是分类模块,不断评估网络流量中的一些特征,并识别可能的威胁。因此,选择需要监测的特征是非常重要的,以确保安全地分辨恶意流量和良好流量。本文介绍了预处理和特征选择工作流程,以及在CSE-CIC-IDS2018 on AWS dataset上的实验结果,关注五种攻击类型。为了确定相关的特征,本文使用了六种特征选择方法,并根据每个特征的平均分数进行了最终排名。然后,根据不同的排名阈值,将特征分为多个子集,并对每个子集使用五种分类算法进行了评估。在评估过程中,考虑了四种常用的指标。

Model Reporting for Certifiable AI: A Proposal from Merging EU Regulation into AI Development

  • paper_url: http://arxiv.org/abs/2307.11525
  • repo_url: None
  • paper_authors: Danilo Brajovic, Niclas Renner, Vincent Philipp Goebels, Philipp Wagner, Benjamin Fresz, Martin Biller, Mara Klaeb, Janika Kutz, Jens Neuhuettler, Marco F. Huber
  • for: 这篇论文旨在提供标准化的卡片来描述 AI 应用程序的开发过程,以帮助实践者开发安全的 AI 系统,并且满足政策法规的要求。
  • methods: 该论文使用了最新的欧盟法规和 AI 指南,以及最新的研究趋势:数据和模型卡片。它提出了使用标准化的卡片来记录 AI 应用程序的开发过程,包括用例卡和运行卡,以满足政策法规的要求。
  • results: 该论文的主要贡献是提出了一套标准化的卡片,以帮助实践者开发安全的 AI 系统,并且可以方便第三方进行 AI 应用程序的审核。该论文还 incorporates 了专家访谈和开发者的意见,以及相关的研究和工具箱。
    Abstract Despite large progress in Explainable and Safe AI, practitioners suffer from a lack of regulation and standards for AI safety. In this work we merge recent regulation efforts by the European Union and first proposals for AI guidelines with recent trends in research: data and model cards. We propose the use of standardized cards to document AI applications throughout the development process. Our main contribution is the introduction of use-case and operation cards, along with updates for data and model cards to cope with regulatory requirements. We reference both recent research as well as the source of the regulation in our cards and provide references to additional support material and toolboxes whenever possible. The goal is to design cards that help practitioners develop safe AI systems throughout the development process, while enabling efficient third-party auditing of AI applications, being easy to understand, and building trust in the system. Our work incorporates insights from interviews with certification experts as well as developers and individuals working with the developed AI applications.
    摘要 尽管在可解释和安全人工智能方面有大量进步,但实践者受到了不足的规范和标准的压力。在这项工作中,我们将欧盟最新的规定努力和首个AI指南提议与最新的研究趋势相结合:数据和模型卡。我们提议在开发过程中使用标准化卡来记录AI应用程序。我们的主要贡献是引入使用情况和运行卡,并更新数据和模型卡以适应规定要求。我们参考了最新的研究以及规定的来源,并提供了参考资料和工具箱 whenever possible。我们的目标是通过开发安全人工智能系统的整个开发过程中使用卡来帮助实践者,并允许第三方审核人工智能应用程序,易于理解,建立信任系统。我们的工作还 incorporates 专家评论、开发人员和使用开发的AI应用程序的人员的意见。

IndigoVX: Where Human Intelligence Meets AI for Optimal Decision Making

  • paper_url: http://arxiv.org/abs/2307.11516
  • repo_url: None
  • paper_authors: Kais Dukes
  • for: 本研究专攻了将人工智能与人类智慧结合,以实现最佳目标解决方案。
  • methods: 本研究提出了一种名为“Indigo”的新方法,它是一个辅助人类做出最佳决策的数据驱动AI系统。人类和AI共同合作,组成一个名为“IndigoVX”的虚拟专家系统,用于应对游戏或商业策略等领域。
  • results: 本研究显示,通过人类和AI的联合合作,可以实现更高效的目标解决。通过变量的三个分数评估指标,评估和改进策略,适应到实际挑战和变化。
    Abstract This paper defines a new approach for augmenting human intelligence with AI for optimal goal solving. Our proposed AI, Indigo, is an acronym for Informed Numerical Decision-making through Iterative Goal-Oriented optimization. When combined with a human collaborator, we term the joint system IndigoVX, for Virtual eXpert. The system is conceptually simple. We envisage this method being applied to games or business strategies, with the human providing strategic context and the AI offering optimal, data-driven moves. Indigo operates through an iterative feedback loop, harnessing the human expert's contextual knowledge and the AI's data-driven insights to craft and refine strategies towards a well-defined goal. Using a quantified three-score schema, this hybridization allows the combined team to evaluate strategies and refine their plan, while adapting to challenges and changes in real-time.
    摘要 这篇论文提出了一种新的方法,用人工智能增强人类智能,以实现最佳目标解决。我们的提议的AI系统名为Indigo,全名为Informed Numerical Decision-making through Iterative Goal-Oriented optimization。当与人类合作者结合时,我们称之为IndigoVX,即虚拟专家。该系统的核心思想简单。我们认为这种方法可以应用于游戏或商业策略等领域,人类提供战略背景知识,AI提供数据驱动的优化 Move。Indigo通过迭代反馈循环,利用人类专家的Contextual knowledge和AI的数据驱动洞察,制定和细化策略,以达到已定义的目标。使用量化的三个分数schema,这个混合体系可以评估策略和修改计划,同时适应挑战和变化的实时反应。

Framework for developing quantitative agent based models based on qualitative expert knowledge: an organised crime use-case

  • paper_url: http://arxiv.org/abs/2308.00505
  • repo_url: None
  • paper_authors: Frederike Oetker, Vittorio Nespeca, Thijs Vis, Paul Duijn, Peter Sloot, Rick Quax
  • For: + The paper aims to provide a systematic and transparent framework for creating agent-based models of criminal networks, specifically for law enforcement purposes. + The authors propose a methodology called FREIDA (Framework for Expert-Informed Data-driven Agent-based models) to translate qualitative data into quantitative rules for modeling criminal networks. + The paper uses the example of a criminal cocaine network in the Netherlands to demonstrate the FREIDA methodology.* Methods: + The authors use a combination of qualitative data sources (case files, literature, interviews) and quantitative data sources (databases) to create a networked agent-based model of the criminal cocaine network. + They use empirical laws to translate the qualitative data into quantitative rules, and combine these rules with the quantitative data to create the three dimensions (environment, agents, behavior) of the networked ABM. + The authors perform sensitivity analysis, uncertainty quantification, and scenario testing to validate the model and make it robust for law enforcement planning.* Results: + The authors find that the model requires flexible parameters and additional case file simulations to be performed to achieve a robust model. + The results of the sensitivity analysis and scenario testing indicate the need for adaptive intervention strategies that can respond to changes in the criminal network.Here is the summary in Simplified Chinese text:* For: + 本研究旨在提供法 enforcement 目的下的刑事网络模型创建系统,特别是通过转化专家知识到数据驱动的方法。 + 作者提出了 FREIDA(框架 для专家驱动数据驱动模型)方法,以便将刑事网络的质量数据转化为数值规则。 + 本研究使用荷兰的刑事可毒网络为例,以示 FREIDA 方法ологи。* Methods: + 作者使用质量数据源(案例文件、文献、面谈)和量化数据源(数据库)创建刑事网络的 Agent-based 模型。 + 他们使用实证法则将质量数据转化为数值规则,并将这些规则与量化数据相结合以创建网络 Agent-based 模型的三个维度(环境、代理、行为)。 + 作者进行了敏感分析、不确定性评估和enario 测试,以验证模型并使其对法 enforcement 规划有效。* Results: + 作者发现模型需要灵活的参数和更多的案例文件 simulate 以实现稳定的模型。 + 结果表明,需要适应性的干预策略,以应对刑事网络的变化。
    Abstract In order to model criminal networks for law enforcement purposes, a limited supply of data needs to be translated into validated agent-based models. What is missing in current criminological modelling is a systematic and transparent framework for modelers and domain experts that establishes a modelling procedure for computational criminal modelling that includes translating qualitative data into quantitative rules. For this, we propose FREIDA (Framework for Expert-Informed Data-driven Agent-based models). Throughout the paper, the criminal cocaine replacement model (CCRM) will be used as an example case to demonstrate the FREIDA methodology. For the CCRM, a criminal cocaine network in the Netherlands is being modelled where the kingpin node is being removed, the goal being for the remaining agents to reorganize after the disruption and return the network into a stable state. Qualitative data sources such as case files, literature and interviews are translated into empirical laws, and combined with the quantitative sources such as databases form the three dimensions (environment, agents, behaviour) of a networked ABM. Four case files are being modelled and scored both for training as well as for validation scores to transition to the computational model and application phase respectively. In the last phase, iterative sensitivity analysis, uncertainty quantification and scenario testing eventually lead to a robust model that can help law enforcement plan their intervention strategies. Results indicate the need for flexible parameters as well as additional case file simulations to be performed.
    摘要 为了模拟犯罪网络,需要有限的数据被翻译成有效的代理基模型。现在的刑事模拟中缺乏一个系统化和透明的框架,使模型者和领域专家可以确定模型生成过程,包括将Qualitative数据转化为Quantitative规则。为此,我们提出了FREIDA(专家驱动数据驱动代理基模型框架)。本文中,我们使用了药物替换模型(CCRM)作为例子,模拟了荷兰一个犯罪冰毒网络,其中王牌节点被移除,目标是让剩下的代理重新组织并返回网络到稳定状态。Qualitative数据源,如案例文件、文献和采访,被翻译成Empirical法律,并与Quantitative数据源,如数据库,共同构成了网络ABM的三个维度(环境、代理、行为)。四个案例文件被模拟和评分,以便在计算模型阶段进行训练和验证分别。在最后阶段,iterative敏感分析、不确定性评估和enario测试,最终导致一个可靠的模型,可以帮助刑事机构规划 intervención策略。结果表明需要灵活的参数以及更多的案例文件模拟,以确保模型的可靠性。

General regularization in covariate shift adaptation

  • paper_url: http://arxiv.org/abs/2307.11503
  • repo_url: None
  • paper_authors: Duc Hoan Nguyen, Sergei V. Pereverzyev, Werner Zellinger
  • for: corrected the error of least squares learning algorithms in reproducing kernel Hilbert spaces (RKHS) caused by future data distributions that are different from the training data distribution.
  • methods: reweighted kernel regression in RKHS, using the estimated Radon-Nikod'ym derivative of the future data distribution w.r.t.~the training data distribution.
  • results: novel results obtained by combining known error bounds, showing that the amount of samples needed to achieve the same order of accuracy as in standard supervised learning without differences in data distributions is smaller than previously proven by state-of-the-art analyses, under weak smoothness conditions.
    Abstract Sample reweighting is one of the most widely used methods for correcting the error of least squares learning algorithms in reproducing kernel Hilbert spaces (RKHS), that is caused by future data distributions that are different from the training data distribution. In practical situations, the sample weights are determined by values of the estimated Radon-Nikod\'ym derivative, of the future data distribution w.r.t.~the training data distribution. In this work, we review known error bounds for reweighted kernel regression in RKHS and obtain, by combination, novel results. We show under weak smoothness conditions, that the amount of samples, needed to achieve the same order of accuracy as in the standard supervised learning without differences in data distributions, is smaller than proven by state-of-the-art analyses.
    摘要 样本重Weight是最常用的方法来修正最小二乘学习算法在 reproduce kernel Hilbert space(RKHS)中的错误,这是由于未来数据分布与训练数据分布不同而导致的。在实际情况下,样本重量是根据估计的Radon-Nikodym Derivative,未来数据分布与训练数据分布之间的比例来确定。在这种工作中,我们回顾了已知的重Weighted kernel regression在 RKHS 中的错误上限,并通过组合,获得了新的结果。我们表明,在弱稳定条件下,需要更少的样本数量,以达到标准超级vised learning 无法分布差异的同等精度水平。

Adaptive ResNet Architecture for Distributed Inference in Resource-Constrained IoT Systems

  • paper_url: http://arxiv.org/abs/2307.11499
  • repo_url: None
  • paper_authors: Fazeela Mazhar Khan, Emna Baccour, Aiman Erbad, Mounir Hamdi
  • for: 这篇论文的目的是为了提出一个可以适应资源短缺的对应网络,以减少资源共享和延迟,并维持高准确率。
  • methods: 这篇论文使用了一个Empirical Study,并识别了ResNet中可以被去除的连接,以实现当地资源不足时的分布。然后,它提出了一个多bjective optimization问题,以最小化延迟和最大化准确率,根据可用资源。
  • results: 根据实验结果,这个适应性的ResNet架构可以降低共享资料、能源消耗和延迟,并维持高准确率。
    Abstract As deep neural networks continue to expand and become more complex, most edge devices are unable to handle their extensive processing requirements. Therefore, the concept of distributed inference is essential to distribute the neural network among a cluster of nodes. However, distribution may lead to additional energy consumption and dependency among devices that suffer from unstable transmission rates. Unstable transmission rates harm real-time performance of IoT devices causing low latency, high energy usage, and potential failures. Hence, for dynamic systems, it is necessary to have a resilient DNN with an adaptive architecture that can downsize as per the available resources. This paper presents an empirical study that identifies the connections in ResNet that can be dropped without significantly impacting the model's performance to enable distribution in case of resource shortage. Based on the results, a multi-objective optimization problem is formulated to minimize latency and maximize accuracy as per available resources. Our experiments demonstrate that an adaptive ResNet architecture can reduce shared data, energy consumption, and latency throughout the distribution while maintaining high accuracy.
    摘要 Translated into Simplified Chinese:深度神经网络继续扩展和复杂化,大多数边缘设备无法处理它们的广泛处理要求。因此,分布式推理是必要的,以分布神经网络到一群节点中。然而,分布可能会导致更多的能源消耗和设备之间的依赖关系,从而影响实时性和可靠性。因此,对动态系统来说,需要一个可靠的DNN,具有可变的架构,以适应可用资源。这篇论文提出了一项实验研究,以确定ResNet中可以被去除的连接,不会对模型性能产生重要影响,以便在资源短缺情况下进行分布。基于结果,我们提出了一个多目标优化问题,以最小化延迟和最大化准确率,根据可用资源。我们的实验表明,可变ResNet架构可以降低共享数据、能源消耗和延迟,同时保持高准确率。

Predict, Refine, Synthesize: Self-Guiding Diffusion Models for Probabilistic Time Series Forecasting

  • paper_url: http://arxiv.org/abs/2307.11494
  • repo_url: None
  • paper_authors: Marcel Kollovieh, Abdul Fatir Ansari, Michael Bohlke-Schneider, Jasper Zschiegner, Hao Wang, Yuyang Wang
  • for: 这篇 paper 的目的是探讨时间序列散射模型的应用,并提出了一个不受条件所限的时间序列散射模型(TSDiff),可以在多个时间序列任务上进行应用。
  • methods: 这篇 paper 使用了一种自我引导机制,让 TSDiff 在推理过程中进行条件化,不需要额外的auxiliary networks或变更训练程式。
  • results: 这篇 paper 的结果显示 TSDiff 在三个时间序列任务中具有竞争力:一是与条件性的forecasting方法竞争(predict);二是透过降低计算负载,使用 TSDiff 进行反散射(refine),并且模型对数据的生成性能仍然保持不变。
    Abstract Diffusion models have achieved state-of-the-art performance in generative modeling tasks across various domains. Prior works on time series diffusion models have primarily focused on developing conditional models tailored to specific forecasting or imputation tasks. In this work, we explore the potential of task-agnostic, unconditional diffusion models for several time series applications. We propose TSDiff, an unconditionally trained diffusion model for time series. Our proposed self-guidance mechanism enables conditioning TSDiff for downstream tasks during inference, without requiring auxiliary networks or altering the training procedure. We demonstrate the effectiveness of our method on three different time series tasks: forecasting, refinement, and synthetic data generation. First, we show that TSDiff is competitive with several task-specific conditional forecasting methods (predict). Second, we leverage the learned implicit probability density of TSDiff to iteratively refine the predictions of base forecasters with reduced computational overhead over reverse diffusion (refine). Notably, the generative performance of the model remains intact -- downstream forecasters trained on synthetic samples from TSDiff outperform forecasters that are trained on samples from other state-of-the-art generative time series models, occasionally even outperforming models trained on real data (synthesize).
    摘要 Diffusion models 已经在不同领域的生成模型任务中达到了状态计算机。先前的时间序列扩散模型研究都主要集中在发展特定预测或填充任务的条件模型。在这项工作中,我们探索了无条件的时间序列扩散模型在不同应用领域中的潜在性。我们提出了TSDiff,一种未经条件训练的时间序列扩散模型。我们的建议的自顾机制使得TSDiff在推理过程中可以通过自我指导来conditioning,无需附加网络或改变训练过程。我们示出了TSDiff在三个不同的时间序列任务上的效果:预测、修正和生成数据。首先,我们表明TSDiff与一些任务特定的条件预测方法相当竞争(predict)。其次,我们利用TSDiff学习到的隐式概率密度来降低预测基础预测器的计算开销,通过反扩散(refine)来修正基础预测器的预测。另外,通过使用TSDiff生成的 sintetic 样本来训练下游预测器,我们发现了一个关键结果:下游预测器训练在TSDiff生成的 sintetic 样本上表现更好, occasional 甚至超过了使用其他状态之前的生成时间序列模型训练的模型,即使是使用真实数据(synthesize)。

Robust Visual Question Answering: Datasets, Methods, and Future Challenges

  • paper_url: http://arxiv.org/abs/2307.11471
  • repo_url: None
  • paper_authors: Jie Ma, Pinghui Wang, Dechen Kong, Zewei Wang, Jun Liu, Hongbin Pei, Junzhou Zhao
  • for: This paper provides a comprehensive survey of the development of datasets and debiasing methods for visual question answering (VQA) to improve the robustness of VQA systems.
  • methods: The paper examines the evaluation metrics employed by VQA datasets and proposes a typology of debiasing methods for VQA, including their development process, similarities and differences, robustness comparison, and technical features.
  • results: The paper analyzes and discusses the robustness of representative vision-and-language pre-training models on VQA and identifies key areas for future research in VQA, including the need for more diverse and challenging datasets and the development of more effective debiasing methods.
    Abstract Visual question answering requires a system to provide an accurate natural language answer given an image and a natural language question. However, it is widely recognized that previous generic VQA methods often exhibit a tendency to memorize biases present in the training data rather than learning proper behaviors, such as grounding images before predicting answers. Therefore, these methods usually achieve high in-distribution but poor out-of-distribution performance. In recent years, various datasets and debiasing methods have been proposed to evaluate and enhance the VQA robustness, respectively. This paper provides the first comprehensive survey focused on this emerging fashion. Specifically, we first provide an overview of the development process of datasets from in-distribution and out-of-distribution perspectives. Then, we examine the evaluation metrics employed by these datasets. Thirdly, we propose a typology that presents the development process, similarities and differences, robustness comparison, and technical features of existing debiasing methods. Furthermore, we analyze and discuss the robustness of representative vision-and-language pre-training models on VQA. Finally, through a thorough review of the available literature and experimental analysis, we discuss the key areas for future research from various viewpoints.
    摘要 Visual问题回答需要一个系统提供正确的自然语言答案,givengoogle image和自然语言问题。然而,以前的通用VQA方法经常受到训练数据中存在的偏见的影响,而不是学习正确的行为,如在图像上固定答案。因此,这些方法通常在内部数据上得到高分,但在外部数据上表现不佳。在过去几年,各种数据集和减偏方法被提出来评估和提高VQA的可靠性。这篇论文是这个新趋势的第一篇全面的报告。 Specifically,我们首先提供了数据集的发展过程的概述,从内部和外部视角出发。然后,我们检查了这些数据集使用的评价指标。第三,我们提出了一种分类,描述了现有减偏方法的发展过程,相似性和差异,对比和技术特点。此外,我们分析和讨论了代表性视觉语言预训模型在VQA中的可靠性。最后,通过历史文献的审查和实验分析,我们讨论了未来研究的关键领域从多种角度。

Distribution Shift Matters for Knowledge Distillation with Webly Collected Images

  • paper_url: http://arxiv.org/abs/2307.11469
  • repo_url: None
  • paper_authors: Jialiang Tang, Shuo Chen, Gang Niu, Masashi Sugiyama, Chen Gong
  • for: 学习一个轻量级的学生网络从一个预训练的教师网络中。
  • methods: 使用数据free知识填充方法,从互联网上收集训练实例,并通过对两个网络的结合预测来选择有用的训练实例。然后,对学生网络和教师网络的权重特征和分类器参数进行对齐,并在新的对比学习块中生成受到扰动的数据,以便学生网络学习一种分布不变的表示。
  • results: 在多个 benchmark 数据集上进行了广泛的实验,结果表明,我们的提议的 KD$^{3}$ 可以比现有的数据free知识填充方法高效。
    Abstract Knowledge distillation aims to learn a lightweight student network from a pre-trained teacher network. In practice, existing knowledge distillation methods are usually infeasible when the original training data is unavailable due to some privacy issues and data management considerations. Therefore, data-free knowledge distillation approaches proposed to collect training instances from the Internet. However, most of them have ignored the common distribution shift between the instances from original training data and webly collected data, affecting the reliability of the trained student network. To solve this problem, we propose a novel method dubbed ``Knowledge Distillation between Different Distributions" (KD$^{3}$), which consists of three components. Specifically, we first dynamically select useful training instances from the webly collected data according to the combined predictions of teacher network and student network. Subsequently, we align both the weighted features and classifier parameters of the two networks for knowledge memorization. Meanwhile, we also build a new contrastive learning block called MixDistribution to generate perturbed data with a new distribution for instance alignment, so that the student network can further learn a distribution-invariant representation. Intensive experiments on various benchmark datasets demonstrate that our proposed KD$^{3}$ can outperform the state-of-the-art data-free knowledge distillation approaches.
    摘要 知识填充目标是学习一个轻量级学生网络从一个预训练的教师网络中。在实践中,现有的知识填充方法通常在原始训练数据不可用时无法实现,这是由于一些隐私问题和数据管理考虑。因此,数据无法知识填充方法被提出,以收集来自互联网的训练实例。然而,大多数其中忽略了原始训练数据和互联网收集的实例之间的常见分布偏移,这会影响学生网络的可靠性。为解决这问题,我们提出了一种新的方法,名为“知识填充between Different Distributions”(KD$^{3}$)。该方法包括三个组成部分。特别是,我们首先动态从互联网收集的数据中选择有用的训练实例,根据教师网络和学生网络的共同预测结果进行选择。然后,我们将两个网络的权重特征和分类器参数对齐,以便知识填充。同时,我们还建立了一个新的对比学习块,名为 MixDistribution,以生成一个新的分布 для实例对齐,使学生网络可以进一步学习一种分布不变的表示。我们对多种 benchmark 数据集进行了实验,得到的结果表明,我们的提议的 KD$^{3}$ 可以超越现有的数据无法知识填充方法。

Zero-touch realization of Pervasive Artificial Intelligence-as-a-service in 6G networks

  • paper_url: http://arxiv.org/abs/2307.11468
  • repo_url: None
  • paper_authors: Emna Baccour, Mhd Saria Allahham, Aiman Erbad, Amr Mohamed, Ahmed Refaey Hussein, Mounir Hamdi
  • for: 支持Pervasive AI(PAI)的零交互解决方案,以满足6G网络中的自动化需求。
  • methods: 使用区块链基础设施,实现零交互PAIaaS平台,并通过聚合学习来自动化网络配置和自适应控制。
  • results: 提出了一种基于区块链的PAIaaS平台,可以轻松地提供零交互的PAI服务,并在6G网络中实现自动化配置和自适应控制,同时能够减少用户的成本、安全性和资源分配担忧。
    Abstract The vision of the upcoming 6G technologies, characterized by ultra-dense network, low latency, and fast data rate is to support Pervasive AI (PAI) using zero-touch solutions enabling self-X (e.g., self-configuration, self-monitoring, and self-healing) services. However, the research on 6G is still in its infancy, and only the first steps have been taken to conceptualize its design, investigate its implementation, and plan for use cases. Toward this end, academia and industry communities have gradually shifted from theoretical studies of AI distribution to real-world deployment and standardization. Still, designing an end-to-end framework that systematizes the AI distribution by allowing easier access to the service using a third-party application assisted by a zero-touch service provisioning has not been well explored. In this context, we introduce a novel platform architecture to deploy a zero-touch PAI-as-a-Service (PAIaaS) in 6G networks supported by a blockchain-based smart system. This platform aims to standardize the pervasive AI at all levels of the architecture and unify the interfaces in order to facilitate the service deployment across application and infrastructure domains, relieve the users worries about cost, security, and resource allocation, and at the same time, respect the 6G stringent performance requirements. As a proof of concept, we present a Federated Learning-as-a-service use case where we evaluate the ability of our proposed system to self-optimize and self-adapt to the dynamics of 6G networks in addition to minimizing the users' perceived costs.
    摘要 6G技术的未来视野, caracterizada por ultra-densa red, baja latencia y alta tasa de datos,是支持Pervasive AI(PAI)使用零交互解决方案,实现自动化服务。然而,6G研究仍在幼年期,只有对其设计、实现和应用场景进行了初步的探索。在这个过程中,学术和产业社区逐渐从AI分布的理论研究转移到实际部署和标准化。然而,设计一个端到端框架,使AI分布更加容易访问服务,并且通过第三方应用程序协助零交互服务提供方式,尚未得到充分探索。在这个上下文中,我们提出了一种新的平台架构,用于在6G网络上部署零交互PAI-as-a-Service(PAIaaS)。这个平台旨在将Pervasive AI规范化到所有架构层次,并统一接口,以便在应用和基础设施域中部署服务,缓解用户关于成本、安全性和资源分配的担忧,同时尊重6G严格的性能要求。作为证明,我们提出了一个基于 Federated Learning的服务示例,并评估了我们提posed系统的自动化和自适应能力,以及对6G网络动态的自适应。

Improve Long-term Memory Learning Through Rescaling the Error Temporally

  • paper_url: http://arxiv.org/abs/2307.11462
  • repo_url: None
  • paper_authors: Shida Wang, Zhanglu Yan
  • for: 这 paper 研究 seq2seq 模型中长期记忆学习中的错误度选择问题。
  • methods: 我们发现常用的 errors 都带有短期记忆偏好,包括 Mean Absolute/Squared Error。为了减少这种偏好并提高长期记忆学习,我们提议使用时间折算的 error。此外,这种方法还可以解决渐进式 gradient 问题。
  • results: 我们在不同的长期任务和序列模型上进行了数值实验,结果证明了我们的主张。我们的结果还表明,适当的时间折算 error 对长期记忆学习是必要的。据我们所知,这是 seq2seq 模型中错误度选择问题的首次量化分析。
    Abstract This paper studies the error metric selection for long-term memory learning in sequence modelling. We examine the bias towards short-term memory in commonly used errors, including mean absolute/squared error. Our findings show that all temporally positive-weighted errors are biased towards short-term memory in learning linear functionals. To reduce this bias and improve long-term memory learning, we propose the use of a temporally rescaled error. In addition to reducing the bias towards short-term memory, this approach can also alleviate the vanishing gradient issue. We conduct numerical experiments on different long-memory tasks and sequence models to validate our claims. Numerical results confirm the importance of appropriate temporally rescaled error for effective long-term memory learning. To the best of our knowledge, this is the first work that quantitatively analyzes different errors' memory bias towards short-term memory in sequence modelling.
    摘要 Simplified Chinese:这篇论文研究序列模型中长期记忆学习中的错误度量选择。我们分析通常使用的错误度量,包括平均绝对值/平方 error,它们是偏向短期记忆的。为了减少这种偏向和改进长期记忆学习,我们提议使用时间折合的错误度量。此外,这种方法还可以减轻淡入梯度问题。我们在不同的长期记忆任务和序列模型上进行了数值实验,以验证我们的结论。数值结果证明了适当的时间折合错误度量对长期记忆学习的重要性。据我们所知,这是序列模型中错误度量偏向短期记忆的首次量化分析。

Incorporating Human Translator Style into English-Turkish Literary Machine Translation

  • paper_url: http://arxiv.org/abs/2307.11457
  • repo_url: None
  • paper_authors: Zeynep Yirmibeşoğlu, Olgun Dursun, Harun Dallı, Mehmet Şahin, Ena Hodzik, Sabri Gürses, Tunga Güngör
  • for: 英-土文学翻译
  • methods: 机器翻译模型精度调整,以及手动对照翻译者的特点进行适应
  • results: 通过适应翻译者的特点,可以高度复制翻译者的风格在目标机器翻译中
    Abstract Although machine translation systems are mostly designed to serve in the general domain, there is a growing tendency to adapt these systems to other domains like literary translation. In this paper, we focus on English-Turkish literary translation and develop machine translation models that take into account the stylistic features of translators. We fine-tune a pre-trained machine translation model by the manually-aligned works of a particular translator. We make a detailed analysis of the effects of manual and automatic alignments, data augmentation methods, and corpus size on the translations. We propose an approach based on stylistic features to evaluate the style of a translator in the output translations. We show that the human translator style can be highly recreated in the target machine translations by adapting the models to the style of the translator.
    摘要 尽管机器翻译系统主要设计用于通用领域,但现在有越来越多的人尝试将这些系统应用到其他领域,如文学翻译。在这篇论文中,我们将英语-土耳其文学翻译作为研究对象,并开发了基于翻译家的风格特征的机器翻译模型。我们通过手动对齐的方法来细化一个预训练的机器翻译模型,并对数据增强、词汇库大小和手动对齐的效果进行详细分析。我们提出了基于风格特征的方法来评估翻译家的风格在目标机器翻译中的表现。我们示出,通过适应翻译家的风格,可以在目标机器翻译中高度复制翻译家的风格。

Providing personalized Explanations: a Conversational Approach

  • paper_url: http://arxiv.org/abs/2307.11452
  • repo_url: None
  • paper_authors: Jieting Luo, Thomas Studer, Mehdi Dastani
  • for: 该论文旨在提供一种方法,使解释者通过继 consecutive conversations 与解释者进行交流,为不同背景和知识水平的听众提供个性化解释。
  • methods: 该方法基于 conversational AI 技术,通过多个交流阶段,解释者可以了解解释者的背景和需求,并逐渐提供更加个性化的解释。
  • results: 作者证明了,如果存在一个可以被解释者理解的解释,并且解释者知悉这个解释,那么对于任何初始声明, conversation 都将terminates。
    Abstract The increasing applications of AI systems require personalized explanations for their behaviors to various stakeholders since the stakeholders may have various knowledge and backgrounds. In general, a conversation between explainers and explainees not only allows explainers to obtain the explainees' background, but also allows explainees to better understand the explanations. In this paper, we propose an approach for an explainer to communicate personalized explanations to an explainee through having consecutive conversations with the explainee. We prove that the conversation terminates due to the explainee's justification of the initial claim as long as there exists an explanation for the initial claim that the explainee understands and the explainer is aware of.
    摘要 随着人工智能系统的应用越来越广泛,需要对其行为提供个性化解释,以适应不同的潜在利益相关者的不同知识背景。通常,一个对话 между解释者和解释者可以让解释者了解解释者的背景,同时也可以让解释者更好地理解解释。在这篇论文中,我们提议一种方法,通过与解释者进行连续对话来让解释者对解释者进行个性化解释。我们证明,如果存在一个对初始laim的解释,并且解释者理解这个解释,那么对话就会结束。

AIGC Empowering Telecom Sector White Paper_chinese

  • paper_url: http://arxiv.org/abs/2307.11449
  • repo_url: None
  • paper_authors: Ye Ouyang, Yaqin Zhang, Xiaozhou Ye, Yunxin Liu, Yong Song, Yang Liu, Sen Bian, Zhiyong Liu
  • for: 本研究旨在探讨如何在电信领域应用人工智能大数据技术(AIGC),以及如何在电信领域实现AIGC应用。
  • methods: 本研究通过分析GPT模型,对电信服务提供商(Telco)的应用场景进行分析,提出了telco增强认知能力系统,并提供了构建电信服务GPT的方法。
  • results: 本研究提出了一种telco增强认知能力系统,并实现了在电信领域应用GPT的具体实践。
    Abstract In the global craze of GPT, people have deeply realized that AI, as a transformative technology and key force in economic and social development, will bring great leaps and breakthroughs to the global industry and profoundly influence the future world competition pattern. As the builder and operator of information and communication infrastructure, the telecom sector provides infrastructure support for the development of AI, and even takes the lead in the implementation of AI applications. How to enable the application of AIGC (GPT) and implement AIGC in the telecom sector are questions that telecom practitioners must ponder and answer. Through the study of GPT, a typical representative of AIGC, the authors have analyzed how GPT empowers the telecom sector in the form of scenarios, discussed the gap between the current GPT general model and telecom services, proposed for the first time a Telco Augmented Cognition capability system, provided answers to how to construct a telecom service GPT in the telecom sector, and carried out various practices. Our counterparts in the industry are expected to focus on collaborative innovation around telecom and AI, build an open and shared innovation ecosystem, promote the deep integration of AI and telecom sector, and accelerate the construction of next-generation information infrastructure, in an effort to facilitate the digital transformation of the economy and society.
    摘要 在全球GPT热潮中,人们已经深刻认识到AI作为转变技术和经济社会发展的关键力量,将会带来巨大的突破和进步,对全球行业和未来世界竞争格局产生深远的影响。作为信息和通信基础设施的建设者和运营者,电信业务提供了对AI发展的基础设施支持,甚至在实施AI应用方面领先于其他行业。如何应用AIGC(GPT)和在电信业务中实施AIGC是电信干部必须思考和答案的问题。通过研究GPT,一种典型的AIGC代表,作者们分析了GPT如何使电信业务具备更高的智能能力,讨论了现有GPT通用模型与电信服务之间的差距,提出了为电信服务GPT建立telco增强认知能力系统的建议,并提供了如何构建电信服务GPT的方法。我们的行业同仁应该集中协同创新,建立开放共享创新生态系统,促进AI和电信业务深度融合,加快构建未来信息基础设施,以便推动经济社会数字化转型。

Batching for Green AI – An Exploratory Study on Inference

  • paper_url: http://arxiv.org/abs/2307.11434
  • repo_url: None
  • paper_authors: Tim Yarally, Luís Cruz, Daniel Feitosa, June Sallou, Arie van Deursen
  • For: + The paper examines the effect of input batching on the energy consumption and response times of five fully-trained neural networks for computer vision. + The study aims to investigate the potential benefits of introducing a batch size during the application phase of a deep learning model.* Methods: + The paper uses five fully-trained neural networks for computer vision that were considered state-of-the-art at the time of their publication. + The authors measure the energy consumption and response times of the networks with and without input batching.* Results: + The results suggest that batching has a significant effect on both energy consumption and response times. + The authors find that energy consumption rises at a much steeper pace than accuracy over the past decade, and highlight one particular network (ShuffleNetV2(2018)) that achieved a competitive performance while maintaining a lower energy consumption.Here is the simplified Chinese version of the three key information points:* For: + 本研究探讨了输入批处理对计算机视觉领域五个完全训练的神经网络的能耗和响应时间的影响。 + 研究目的是调查在深度学习模型应用阶段是否可以通过引入批处理来获得优点。* Methods: + 研究使用了五个完全训练的计算机视觉神经网络,这些神经网络在其出版时被认为是领域的状态OF-the-art。 + 作者们测量了这些神经网络在批处理和无批处理情况下的能耗和响应时间。* Results: + 结果表明,批处理对计算机视觉领域的能耗和响应时间有着显著的影响。 + 作者们发现,在过去十年中,能耗的增长速度远远高于精度的提高,并指出了一个特定的神经网络(ShuffleNetV2(2018)),它在其时间内实现了竞争性表现,同时具有远低的能耗。
    Abstract The batch size is an essential parameter to tune during the development of new neural networks. Amongst other quality indicators, it has a large degree of influence on the model's accuracy, generalisability, training times and parallelisability. This fact is generally known and commonly studied. However, during the application phase of a deep learning model, when the model is utilised by an end-user for inference, we find that there is a disregard for the potential benefits of introducing a batch size. In this study, we examine the effect of input batching on the energy consumption and response times of five fully-trained neural networks for computer vision that were considered state-of-the-art at the time of their publication. The results suggest that batching has a significant effect on both of these metrics. Furthermore, we present a timeline of the energy efficiency and accuracy of neural networks over the past decade. We find that in general, energy consumption rises at a much steeper pace than accuracy and question the necessity of this evolution. Additionally, we highlight one particular network, ShuffleNetV2(2018), that achieved a competitive performance for its time while maintaining a much lower energy consumption. Nevertheless, we highlight that the results are model dependent.
    摘要 批处大小是深度学习模型开发中非常重要的参数。其影响模型的准确率、通用性、训练时间和并行性等质量指标。这种情况通常被认为并且广泛研究。然而,在深度学习模型在实际应用阶段使用时,批处大小却被忽视。本研究检查了五种在发表时被视为state-of-the-art的计算机视觉神经网络的输入批处对能耗和响应时间的影响。结果表明,批处有着显著的影响。此外,我们还提供了过去十年内神经网络能效率和准确率的时间线。我们发现,能 consumption 在准确率上升的速度快得多,并质疑这种演化的必要性。此外,我们还提到了一个特定的网络,ShuffleNetV2(2018),在其时间上实现了竞争性表现,同时保持了许多更低的能 consumption。然而,我们注意到结果受模型的影响。

Prompting Large Language Models with Speech Recognition Abilities

  • paper_url: http://arxiv.org/abs/2307.11795
  • repo_url: None
  • paper_authors: Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Junteng Jia, Yuan Shangguan, Ke Li, Jinxi Guo, Wenhan Xiong, Jay Mahadeokar, Ozlem Kalinli, Christian Fuegen, Mike Seltzer
  • for: 这个论文的目的是扩展大型自然语言模型(LLM)的能力,使其能够直接从语音识别(ASR)系统中获得语音识别能力。
  • methods: 论文使用了一种小型音频编码器,将其直接附加到文本token embedding中,使LLM可以变成一个自动语音识别系统,并且可以在原来的文本识别系统中使用。
  • results: 实验表明,将一个小型音频编码器添加到open sourced LLaMA-7B模型中,可以比基线模型提高18%的性能,并且可以在多语言语音识别中进行多语言语音识别,即使LLaMA只在英语文本上进行训练。此外,论文还进行了减少学习环境的研究,以及增加音频编码器的步长和增加音频编码器的步长以生成更少的嵌入。研究结果表明,使用LLM可以在长度达1秒的语音识别中进行多语言语音识别。
    Abstract Large language models have proven themselves highly flexible, able to solve a wide range of generative tasks, such as abstractive summarization and open-ended question answering. In this paper we extend the capabilities of LLMs by directly attaching a small audio encoder allowing it to perform speech recognition. By directly prepending a sequence of audial embeddings to the text token embeddings, the LLM can be converted to an automatic speech recognition (ASR) system, and be used in the exact same manner as its textual counterpart. Experiments on Multilingual LibriSpeech (MLS) show that incorporating a conformer encoder into the open sourced LLaMA-7B allows it to outperform monolingual baselines by 18% and perform multilingual speech recognition despite LLaMA being trained overwhelmingly on English text. Furthermore, we perform ablation studies to investigate whether the LLM can be completely frozen during training to maintain its original capabilities, scaling up the audio encoder, and increasing the audio encoder striding to generate fewer embeddings. The results from these studies show that multilingual ASR is possible even when the LLM is frozen or when strides of almost 1 second are used in the audio encoder opening up the possibility for LLMs to operate on long-form audio.
    摘要 大型语言模型已经证明了它们在多种生成任务中的高度灵活性,如抽象摘要和开端问答。在这篇论文中,我们将LLM的功能扩展到附加一个小型音频编码器,使其能够进行语音识别。通过直接附加文本Token embedding序列的音频嵌入,LLM可以转化为自动语音识别(ASR)系统,并且可以在同样的方式下使用。在多语言LibriSpeech(MLS)上进行了实验,并示出了在开源的LLaMA-7B中附加一个协调器编码器后,其能够比单语言基准点高出18%,并且可以进行多语言语音识别,即使LLaMA在训练时主要是使用英语文本。此外,我们还进行了剥离研究,以investigate Whether the LLM可以在训练时被冻结,缩放音频编码器,并减少音频编码器的步进。研究结果表明,多语言ASR是可能的, même when the LLM is frozen or when strides of almost 1 second are used in the audio encoder。这开 up the possibility for LLMs to operate on long-form audio.

A Video-based Detector for Suspicious Activity in Examination with OpenPose

  • paper_url: http://arxiv.org/abs/2307.11413
  • repo_url: None
  • paper_authors: Reuben Moyo, Stanley Ndebvu, Michael Zimba, Jimmy Mbelwa
  • for: 防止学生和考试监管者之间的假设行为,保持学术integrity
  • methods: 使用自动化视频分析和人 pose estimation技术,检测考试过程中可疑的活动
  • results: 提高了考试监管效果,减少了考试过程中的假设行为
    Abstract Examinations are a crucial part of the learning process, and academic institutions invest significant resources into maintaining their integrity by preventing cheating from students or facilitators. However, cheating has become rampant in examination setups, compromising their integrity. The traditional method of relying on invigilators to monitor every student is impractical and ineffective. To address this issue, there is a need to continuously record exam sessions to monitor students for suspicious activities. However, these recordings are often too lengthy for invigilators to analyze effectively, and fatigue may cause them to miss significant details. To widen the coverage, invigilators could use fixed overhead or wearable cameras. This paper introduces a framework that uses automation to analyze videos and detect suspicious activities during examinations efficiently and effectively. We utilized the OpenPose framework and Convolutional Neural Network (CNN) to identify students exchanging objects during exams. This detection system is vital in preventing cheating and promoting academic integrity, fairness, and quality education for institutions.
    摘要 笔考是学习过程中不可或缺的一部分,学院投入了大量资源来保持笔考的完整性,避免学生或指导教师的作弊行为。然而,在考试场景中,作弊行为已经变得普遍,这威胁到了笔考的完整性。传统的依靠监督员监视每名学生的方法已经成为不切实际和不可行的。为解决这一问题,需要不断录制考试会考session,以便监测学生的可疑活动。然而,这些录制材料经常是长得太长,监督员分析效果不佳,疲劳可能会导致重要细节被遗弃。为了扩大覆盖率,监督员可以使用固定卫星或佩戴式摄像头。本文介绍了一种基于自动化分析视频,高效地检测考试会考中的可疑活动的框架。我们利用了OpenPose框架和卷积神经网络(CNN)来识别考试会考中学生交换物品的行为。这个检测系统是保持学术 integrity、公平性和质量教育的重要工具。

Deep Directly-Trained Spiking Neural Networks for Object Detection

  • paper_url: http://arxiv.org/abs/2307.11411
  • repo_url: https://github.com/BICLab/EMS-YOLO
  • paper_authors: Qiaoyi Su, Yuhong Chou, Yifan Hu, Jianing Li, Shijie Mei, Ziyang Zhang, Guoqi Li
  • for: 这个研究是为了解决如何透过直接训练神经网络(SNN)来实现物体检测任务,而不是使用传统的ANN-SNN转换策略。
  • methods: 我们提出了一个名为EMS-YOLO的新的直接训练SNN框架,它使用代理Gradient来训练深度的SNN,并且设计了一个全节点遗传架构EMS-ResNet,可以有效地延长深度的SNN训练。
  • results: 我们的方法比前一代ANN-SNN转换方法(至少500步)在更少的时间步骤(仅4步)中表现更好,并且可以在几乎相同的时间点数上 дости��FC(5.83倍)较少的能源消耗。
    Abstract Spiking neural networks (SNNs) are brain-inspired energy-efficient models that encode information in spatiotemporal dynamics. Recently, deep SNNs trained directly have shown great success in achieving high performance on classification tasks with very few time steps. However, how to design a directly-trained SNN for the regression task of object detection still remains a challenging problem. To address this problem, we propose EMS-YOLO, a novel directly-trained SNN framework for object detection, which is the first trial to train a deep SNN with surrogate gradients for object detection rather than ANN-SNN conversion strategies. Specifically, we design a full-spike residual block, EMS-ResNet, which can effectively extend the depth of the directly-trained SNN with low power consumption. Furthermore, we theoretically analyze and prove the EMS-ResNet could avoid gradient vanishing or exploding. The results demonstrate that our approach outperforms the state-of-the-art ANN-SNN conversion methods (at least 500 time steps) in extremely fewer time steps (only 4 time steps). It is shown that our model could achieve comparable performance to the ANN with the same architecture while consuming 5.83 times less energy on the frame-based COCO Dataset and the event-based Gen1 Dataset.
    摘要 神经网络(SNN)是基于脑的能效模型,它将信息编码在空间时间动态中。最近,深度SNN直接训练得到了高性能的分类任务,但是如何设计直接训练的SNN用于对象检测任务仍然是一个挑战。为解决这个问题,我们提出了EMS-YOLO,一种新的直接训练SNN框架 для对象检测,这是首次使用代理梯度来训练深度SNN而不是转换为ANN-SNN策略。我们设计了全突起减噪块,EMS-ResNet,可以有效地延长直接训练SNN的深度,同时降低能耗。此外,我们理论分析并证明EMS-ResNet可以避免梯度消失或扩散。结果表明,我们的方法在训练4个时间步骤后可以超越当前的ANN-SNN转换方法(至少500个时间步骤)。此外,我们的模型在Frame-based COCO数据集和Event-based Gen1数据集上可以与同样的ANN架构具有相同的性能,同时占用5.83倍少的能量。

Probabilistic Modeling of Inter- and Intra-observer Variability in Medical Image Segmentation

  • paper_url: http://arxiv.org/abs/2307.11397
  • repo_url: None
  • paper_authors: Arne Schmidt, Pablo Morales-Álvarez, Rafael Molina
  • for: 这篇论文的目的是提出一个新的医疗影像分类模型,以减少医疗专业人员之间和内部的观察者间变化。
  • methods: 这个模型以probabilistic inter-observer和intra-observer variation network(Pionono)的名称,它captures each rater’s labeling behavior as a multidimensional probability distribution,并与影像特征图对组合以生成可信度分布式的分类预测。这个模型可以通过统计推论优化,并可以在端到端的整合训练中进行训练。
  • results: 实验结果显示,Pionono模型可以高效地预测医疗影像分类,并且比起现有的State-of-the-art模型,例如STAPLE、Probabilistic U-Net和基于混淆矩阵的模型,有更高的准确性。此外,Pionono模型可以预测多个协调的分类图对,这些图对可以提供额外的有用信息来医疗诊断过程。
    Abstract Medical image segmentation is a challenging task, particularly due to inter- and intra-observer variability, even between medical experts. In this paper, we propose a novel model, called Probabilistic Inter-Observer and iNtra-Observer variation NetwOrk (Pionono). It captures the labeling behavior of each rater with a multidimensional probability distribution and integrates this information with the feature maps of the image to produce probabilistic segmentation predictions. The model is optimized by variational inference and can be trained end-to-end. It outperforms state-of-the-art models such as STAPLE, Probabilistic U-Net, and models based on confusion matrices. Additionally, Pionono predicts multiple coherent segmentation maps that mimic the rater's expert opinion, which provides additional valuable information for the diagnostic process. Experiments on real-world cancer segmentation datasets demonstrate the high accuracy and efficiency of Pionono, making it a powerful tool for medical image analysis.
    摘要 医学图像分割是一项具有挑战性的任务,尤其是由于内部和外部观察者的变化, même between 医疗专家。在这篇论文中,我们提出了一种新的模型,即概率性内部和外部变化网络(Pionono)。它捕捉了每个评分者的标注行为,并将其映射到多维度概率分布中。然后,它将这些信息与图像特征图 integration 以生成概率性的分割预测。该模型可以通过变量推断优化,并可以在端到端方式进行训练。与现有的STATE-OF-THE-ART模型,如STAPLE和概率性U-Net,以及基于混淆矩阵的模型相比,Pionono 表现更高准确和有效。此外,Pionono 还预测了多个协调的分割图,这些图像类似于评分者的专家意见,这些信息可以为诊断过程提供valuable 的信息。在实际的肿瘤分割数据集上进行了实验,Pionono 的准确率和效率均很高,这使得它成为医学图像分析中的一个强大工具。

Large Language Model-based System to Provide Immediate Feedback to Students in Flipped Classroom Preparation Learning

  • paper_url: http://arxiv.org/abs/2307.11388
  • repo_url: None
  • paper_authors: Shintaro Uchiyama, Kyoji Umemura, Yusuke Morita
  • for: 这种系统是为了提供prepare learning中的学生即时反馈,以解决flipped classroom模型中的一些挑战,如保证学生在学习过程中具备情感动力和学习motivation。
  • methods: 该系统使用大型自然语言模型提供学生在准备学习过程中的即时反馈,并使用ChatGPT API开发了一个视频学习支持系统。为了将ChatGPT的答案与学生的问题相align,该paper还提出了一种方法。此外,该paper还提出了一种方法,用于收集教师对学生问题的答案,并使其为学生提供更多的指导。
  • results: 该paper提出了一种基于大型自然语言模型的prepare learning支持系统,可以帮助学生在准备学习过程中得到即时反馈,提高学习效果。
    Abstract This paper proposes a system that uses large language models to provide immediate feedback to students in flipped classroom preparation learning. This study aimed to solve challenges in the flipped classroom model, such as ensuring that students are emotionally engaged and motivated to learn. Students often have questions about the content of lecture videos in the preparation of flipped classrooms, but it is difficult for teachers to answer them immediately. The proposed system was developed using the ChatGPT API on a video-watching support system for preparation learning that is being used in real practice. Answers from ChatGPT often do not align with the context of the student's question. Therefore, this paper also proposes a method to align the answer with the context. This paper also proposes a method to collect the teacher's answers to the students' questions and use them as additional guides for the students. This paper discusses the design and implementation of the proposed system.
    摘要 这篇论文提出了一种基于大语言模型的系统,用于在flipped classroom准备学习中提供立即反馈给学生。这项研究的目标是解决flipped classroom模型中的一些挑战,如确保学生在准备过程中保持情感投入和学习motivation。学生经常有lecture视频的内容问题,但是教师answering这些问题很困难。该系统基于ChatGPT API的视频观看支持系统,已经在实践中应用。然而,ChatGPT的答案不一定与学生的问题Context相align,因此该论文还提出了一种方法来将答案与Context进行Alignment。此外,该论文还提出了一种方法来收集教师对学生问题的答案,并使其为学生提供额外指导。该论文讨论了系统的设计和实现。

Diverse Offline Imitation via Fenchel Duality

  • paper_url: http://arxiv.org/abs/2307.11373
  • repo_url: None
  • paper_authors: Marin Vlastelica, Pavel Kolev, Jin Cheng, Georg Martius
  • for: 本研究旨在开发一个不需要在环境上线的自动技能发现算法。
  • methods: 我们使用了互信息目标函数,并运用了 Fenchel duality、征 Stimulation 学习和无监督技能发现,开发出一个简单的OFFLINE算法,以获得与专家相似的技能。
  • results: 我们的主要贡献是将 Fenchel duality、征 Stimulation 学习和无监督技能发现相互连接,并提供了一个简单的OFFLINE算法,以获得与专家相似的多样化技能。
    Abstract There has been significant recent progress in the area of unsupervised skill discovery, with various works proposing mutual information based objectives, as a source of intrinsic motivation. Prior works predominantly focused on designing algorithms that require online access to the environment. In contrast, we develop an \textit{offline} skill discovery algorithm. Our problem formulation considers the maximization of a mutual information objective constrained by a KL-divergence. More precisely, the constraints ensure that the state occupancy of each skill remains close to the state occupancy of an expert, within the support of an offline dataset with good state-action coverage. Our main contribution is to connect Fenchel duality, reinforcement learning and unsupervised skill discovery, and to give a simple offline algorithm for learning diverse skills that are aligned with an expert.
    摘要 “近期有很大的进步在无监督技能发现领域,多个作品提出了基于互联信息的目标,作为自适应的动机。先前的工作主要集中在设计需要在环境上线存取的算法。相比之下,我们开发了一个OFFLINE技能发现算法。我们的问题设计将最大化互联信息目标,并受到KL散度的限制。更加精确地说,这些限制保证每个技能的状态占有率保持在专家状态占有率的支持下,在对应的OFFLINE数据集中具有良好的状态动作覆盖。我们的主要贡献是连接 Fenchel 对偶、复制学习和无监督技能发现,并提供一个简单的OFFLINE算法,用于学习与专家Alignment的多标的技能。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

CohortGPT: An Enhanced GPT for Participant Recruitment in Clinical Study

  • paper_url: http://arxiv.org/abs/2307.11346
  • repo_url: None
  • paper_authors: Zihan Guan, Zihao Wu, Zhengliang Liu, Dufan Wu, Hui Ren, Quanzheng Li, Xiang Li, Ninghao Liu
  • For: 这个研究旨在提高大型自然语言模型(LLMs)在医疗研究中的参考组建 task 的性能,特别是在对于专业医生的医疗报告中进行分类。* Methods: 这个研究使用了一个知识图作为助长信息,以及一个链条思考(CoT)样本选择策略增强了执行学习。* Results: 这个研究获得了满意的性能,并且在有限的数据情况下表现更好。 code和示例数据集都可以在以下网址中获取:https://anonymous.4open.science/r/CohortGPT-4872/
    Abstract Participant recruitment based on unstructured medical texts such as clinical notes and radiology reports has been a challenging yet important task for the cohort establishment in clinical research. Recently, Large Language Models (LLMs) such as ChatGPT have achieved tremendous success in various downstream tasks thanks to their promising performance in language understanding, inference, and generation. It is then natural to test their feasibility in solving the cohort recruitment task, which involves the classification of a given paragraph of medical text into disease label(s). However, when applied to knowledge-intensive problem settings such as medical text classification, where the LLMs are expected to understand the decision made by human experts and accurately identify the implied disease labels, the LLMs show a mediocre performance. A possible explanation is that, by only using the medical text, the LLMs neglect to use the rich context of additional information that languages afford. To this end, we propose to use a knowledge graph as auxiliary information to guide the LLMs in making predictions. Moreover, to further boost the LLMs adapt to the problem setting, we apply a chain-of-thought (CoT) sample selection strategy enhanced by reinforcement learning, which selects a set of CoT samples given each individual medical report. Experimental results and various ablation studies show that our few-shot learning method achieves satisfactory performance compared with fine-tuning strategies and gains superb advantages when the available data is limited. The code and sample dataset of the proposed CohortGPT model is available at: https://anonymous.4open.science/r/CohortGPT-4872/
    摘要 参与者招募基于无结构医疗文本如临床笔记和放射报告的问题是临床研究中的一项挑战。现在,大型语言模型(LLMs)如ChatGPT在不同下游任务中表现出色,因为它们在语言理解、推理和生成方面表现出色。因此,试图使其在这个参与者招募任务中表现出色,这个任务涉及将给定的医疗文本分类到疾病标签中。然而,当应用到医学知识密集的问题设定中,LLMs表现不佳。一个可能的解释是,只使用医疗文本,LLMs会忽略语言提供的丰富背景信息。为此,我们提议使用知识图作为 auxillary 信息来导引 LLMs 进行预测。此外,为了进一步提高 LLMs 适应问题设定,我们应用了链条思维(CoT)样本选择策略强化学习,这种策略选择每个医疗报告中的一组 CoT 样本。实验结果和多种减少学习研究表明,我们的少量学习方法在可用数据有限时表现满意,而且与精度调整策略相比具有超卓的优势。代码和示例数据集的提案CohortGPT模型可以在以下链接中获取:https://anonymous.4open.science/r/CohortGPT-4872/

A Two-stage Fine-tuning Strategy for Generalizable Manipulation Skill of Embodied AI

  • paper_url: http://arxiv.org/abs/2307.11343
  • repo_url: https://github.com/xtli12/gxu-lipe
  • paper_authors: Fang Gao, XueTao Li, Jun Yu, Feng Shaung
  • for: The paper aims to enhance the generalization capability of Embodied AI models in real-world scenarios, particularly in the Maniskill2 benchmark.
  • methods: The proposed method uses a two-stage fine-tuning strategy based on the Maniskill2 benchmark, which involves training the model on diverse datasets of demonstrations and evaluating its ability to generalize to unseen scenarios.
  • results: The proposed method achieved the 1st prize in all three tracks of the ManiSkill2 Challenge, demonstrating its effectiveness in enhancing the generalization abilities of Embodied AI models.Here’s the same information in Simplified Chinese text:
  • for: 本研究旨在提高Embodied AI模型在实际场景中的泛化能力,特别是在Maniskill2 benchmark中。
  • methods: 提议的方法使用两个阶段精细调整策略,基于Maniskill2 benchmark,包括在多个示例数据上训练模型,并评估其能够在未看过的场景中泛化。
  • results: 提议的方法在ManiSkill2 Challenge中获得了全部三个赛道的冠军,表明其能够提高Embodied AI模型的泛化能力。
    Abstract The advent of Chat-GPT has led to a surge of interest in Embodied AI. However, many existing Embodied AI models heavily rely on massive interactions with training environments, which may not be practical in real-world situations. To this end, the Maniskill2 has introduced a full-physics simulation benchmark for manipulating various 3D objects. This benchmark enables agents to be trained using diverse datasets of demonstrations and evaluates their ability to generalize to unseen scenarios in testing environments. In this paper, we propose a novel two-stage fine-tuning strategy that aims to further enhance the generalization capability of our model based on the Maniskill2 benchmark. Through extensive experiments, we demonstrate the effectiveness of our approach by achieving the 1st prize in all three tracks of the ManiSkill2 Challenge. Our findings highlight the potential of our method to improve the generalization abilities of Embodied AI models and pave the way for their ractical applications in real-world scenarios. All codes and models of our solution is available at https://github.com/xtli12/GXU-LIPE.git
    摘要 chat-gpt 的出现引发了人工智能embody 的兴趣,但是许多现有的embody 模型具有庞大的训练环境依赖,可能在实际应用中不够实用。为此,Maniskill2 引入了一个完整的物理 simulate 标准套件,用于 manipulating 多种 3D 物体。这个套件允许代理人通过多种示例数据进行训练,并评估其在未看过enario 中的扩展性。在这篇论文中,我们提出了一种新的两阶段精细调整策略,旨在进一步提高我们模型的扩展能力。经过广泛的实验,我们证明了我们的方法的有效性,在 Maniskill2 挑战赛中获得了全部三个 tracks 的第一奖。我们的发现指出了我们的方法可以提高 Embodied AI 模型的扩展能力,并为它们的实际应用奠定基础。codes 和模型的解决方案可以在 中找到。

OpenGDA: Graph Domain Adaptation Benchmark for Cross-network Learning

  • paper_url: http://arxiv.org/abs/2307.11341
  • repo_url: https://github.com/skyorca/opengda
  • paper_authors: Boshen Shi, Yongqing Wang, Fangda Guo, Jiangli Shao, Huawei Shen, Xueqi Cheng
    for: 本文主要针对的是评估图域适应模型(Graph Domain Adaptation,GDA)在不同类型的任务上的性能,包括节点分类、边分类和图分类。methods: 本文提出了一个名为OpenGDA的测试基准,它提供了丰富的预处理后的数据集和统一的评估管线,用于评估不同类型的GDA模型。results: 测试结果显示,当应用于实际应用场景时,GDA模型的性能具有一定的挑战,并且需要进一步的研究以提高其实际应用效果。
    Abstract Graph domain adaptation models are widely adopted in cross-network learning tasks, with the aim of transferring labeling or structural knowledge. Currently, there mainly exist two limitations in evaluating graph domain adaptation models. On one side, they are primarily tested for the specific cross-network node classification task, leaving tasks at edge-level and graph-level largely under-explored. Moreover, they are primarily tested in limited scenarios, such as social networks or citation networks, lacking validation of model's capability in richer scenarios. As comprehensively assessing models could enhance model practicality in real-world applications, we propose a benchmark, known as OpenGDA. It provides abundant pre-processed and unified datasets for different types of tasks (node, edge, graph). They originate from diverse scenarios, covering web information systems, urban systems and natural systems. Furthermore, it integrates state-of-the-art models with standardized and end-to-end pipelines. Overall, OpenGDA provides a user-friendly, scalable and reproducible benchmark for evaluating graph domain adaptation models. The benchmark experiments highlight the challenges of applying GDA models to real-world applications with consistent good performance, and potentially provide insights to future research. As an emerging project, OpenGDA will be regularly updated with new datasets and models. It could be accessed from https://github.com/Skyorca/OpenGDA.
    摘要 Graph domain adaptation模型在跨网络学习任务中广泛应用,旨在传递标签或结构知识。目前,评估graph domain adaptation模型存在两大限制。一方面,它们主要在特定的跨网络节点分类任务上测试,剩下的任务(如边级和图级)尚未得到足够的探索。另一方面,它们主要在有限的场景中测试,如社交网络或引用网络,缺乏模型在更加丰富的场景中的验证。为了全面评估模型的实用性,我们提出了OpenGDA benchmark。它提供了各种类型的任务(节点、边、图)的充分预处理和统一的数据集,来自多样化的场景,包括网络信息系统、城市系统和自然系统。此外,它集成了当前最佳的模型和标准化的管道。总之,OpenGDA提供了用户友好、可扩展和可重现的 benchmark,用于评估graph domain adaptation模型。 benchmark实验显示出应用GDA模型到实际应用场景的挑战,并可能提供未来研究的指导。作为一个emerging项目,OpenGDA将在将来不断更新数据集和模型,可以在https://github.com/Skyorca/OpenGDA上访问。

Tri-MipRF: Tri-Mip Representation for Efficient Anti-Aliasing Neural Radiance Fields

  • paper_url: http://arxiv.org/abs/2307.11335
  • repo_url: https://github.com/wbhu/Tri-MipRF
  • paper_authors: Wenbo Hu, Yuling Wang, Lin Ma, Bangbang Yang, Lin Gao, Xiao Liu, Yuewen Ma
  • for: 提高 NeRF 的质量和效率,解决在训练时间和渲染质量之间的膝盘。
  • methods: 提出了一种新的 Tri-Mip 编码方法,通过三个 Mutli-scale Invariance Points (Mip) 维度空间的因子化,实现高效的 3D 区域抽象和渲染。
  • results: 实验表明,提出的方法可以具有 state-of-the-art 的渲染质量和重建速度,同时减少模型大小25%,比 Instant-ngp 更高效。
    Abstract Despite the tremendous progress in neural radiance fields (NeRF), we still face a dilemma of the trade-off between quality and efficiency, e.g., MipNeRF presents fine-detailed and anti-aliased renderings but takes days for training, while Instant-ngp can accomplish the reconstruction in a few minutes but suffers from blurring or aliasing when rendering at various distances or resolutions due to ignoring the sampling area. To this end, we propose a novel Tri-Mip encoding that enables both instant reconstruction and anti-aliased high-fidelity rendering for neural radiance fields. The key is to factorize the pre-filtered 3D feature spaces in three orthogonal mipmaps. In this way, we can efficiently perform 3D area sampling by taking advantage of 2D pre-filtered feature maps, which significantly elevates the rendering quality without sacrificing efficiency. To cope with the novel Tri-Mip representation, we propose a cone-casting rendering technique to efficiently sample anti-aliased 3D features with the Tri-Mip encoding considering both pixel imaging and observing distance. Extensive experiments on both synthetic and real-world datasets demonstrate our method achieves state-of-the-art rendering quality and reconstruction speed while maintaining a compact representation that reduces 25% model size compared against Instant-ngp.
    摘要 尽管神经辐射场(NeRF)已经取得了很大的进步,但我们仍然面临质量和效率之间的负担选择问题,例如MipNeRF可以提供细节够好和无折扑的渲染结果,但是训练时间需要几天,而Instant-ngp可以快速完成重建,但是在不同的距离或分辨率下,因为忽略采样区域而导致折扑或抖音。为了解决这个问题,我们提出了一种新的Tri-Mip编码方法,它可以同时实现快速重建和高精度渲染。关键在于将预处理的3D特征空间分解成三个正交的Mipmap。这样,我们可以高效地进行3D采样,利用2D预处理的特征图,从而提高渲染质量无需牺牲效率。为了处理Tri-Mip表示,我们提出了一种锥体投射渲染技术,可以高效地采样折扑3D特征,考虑到像素捕捷和观察距离。我们对both synthetic和实际数据进行了广泛的实验,并证明我们的方法可以实现state-of-the-art的渲染质量和重建速度,同时保持一个减少25%的模型大小。

Analysis of Elephant Movement in Sub-Saharan Africa: Ecological, Climatic, and Conservation Perspectives

  • paper_url: http://arxiv.org/abs/2307.11325
  • repo_url: None
  • paper_authors: Matthew Hines, Gregory Glatzer, Shreya Ghosh, Prasenjit Mitra
  • for: 这项研究旨在为非洲亚洲 elephant 的生态和保护策略提供基础数据。
  • methods: 这项研究使用分析方法来揭示非洲亚洲 elephant 的移动模式,强调关键生态因素,如季节变化和降水模式。
  • results: 研究发现 elephant 的移动模式受到季节变化和降水模式的影响,并提供了一个涵盖了这些因素的整体视图。这些结果可以用于预测未来 elephant 的移动模式,并为生态和保护策略提供依据。
    Abstract The interaction between elephants and their environment has profound implications for both ecology and conservation strategies. This study presents an analytical approach to decipher the intricate patterns of elephant movement in Sub-Saharan Africa, concentrating on key ecological drivers such as seasonal variations and rainfall patterns. Despite the complexities surrounding these influential factors, our analysis provides a holistic view of elephant migratory behavior in the context of the dynamic African landscape. Our comprehensive approach enables us to predict the potential impact of these ecological determinants on elephant migration, a critical step in establishing informed conservation strategies. This projection is particularly crucial given the impacts of global climate change on seasonal and rainfall patterns, which could substantially influence elephant movements in the future. The findings of our work aim to not only advance the understanding of movement ecology but also foster a sustainable coexistence of humans and elephants in Sub-Saharan Africa. By predicting potential elephant routes, our work can inform strategies to minimize human-elephant conflict, effectively manage land use, and enhance anti-poaching efforts. This research underscores the importance of integrating movement ecology and climatic variables for effective wildlife management and conservation planning.
    摘要 Elephant and its environment interaction has profound ecological and conservation implications. This study uses an analytical approach to decipher the complex patterns of elephant movement in Sub-Saharan Africa, focusing on key ecological drivers such as seasonal variations and rainfall patterns. Our comprehensive approach provides a holistic view of elephant migratory behavior in the dynamic African landscape, allowing us to predict the potential impact of ecological determinants on elephant migration. This projection is crucial in establishing informed conservation strategies, especially in light of the impacts of global climate change on seasonal and rainfall patterns. Our findings aim to advance the understanding of movement ecology and foster sustainable human-elephant coexistence in Sub-Saharan Africa. By predicting potential elephant routes, our work can inform strategies to minimize human-elephant conflict, effectively manage land use, and enhance anti-poaching efforts. This research highlights the importance of integrating movement ecology and climatic variables for effective wildlife management and conservation planning.

HVDetFusion: A Simple and Robust Camera-Radar Fusion Framework

  • paper_url: http://arxiv.org/abs/2307.11323
  • repo_url: https://github.com/hvxlab/hvdetfusion
  • paper_authors: Kai Lei, Zhan Chen, Shuman Jia, Xiaoteng Zhang
  • for: 提出了一种新的探测算法 called HVDetFusion,用于实时3D对象检测。
  • methods: 该算法不仅支持纯摄像头数据输入,还可以进行摄像头和雷达数据的融合输入。 modify了Bevdet4D框架,以提高探测效果和减少探测时间。
  • results: HVDetFusion实现了在nuScenes测试集上的新状态计时性3D对象检测精度记录67.4%,比其他摄像头雷达3D对象检测器更高。
    Abstract In the field of autonomous driving, 3D object detection is a very important perception module. Although the current SOTA algorithm combines Camera and Lidar sensors, limited by the high price of Lidar, the current mainstream landing schemes are pure Camera sensors or Camera+Radar sensors. In this study, we propose a new detection algorithm called HVDetFusion, which is a multi-modal detection algorithm that not only supports pure camera data as input for detection, but also can perform fusion input of radar data and camera data. The camera stream does not depend on the input of Radar data, thus addressing the downside of previous methods. In the pure camera stream, we modify the framework of Bevdet4D for better perception and more efficient inference, and this stream has the whole 3D detection output. Further, to incorporate the benefits of Radar signals, we use the prior information of different object positions to filter the false positive information of the original radar data, according to the positioning information and radial velocity information recorded by the radar sensors to supplement and fuse the BEV features generated by the original camera data, and the effect is further improved in the process of fusion training. Finally, HVDetFusion achieves the new state-of-the-art 67.4\% NDS on the challenging nuScenes test set among all camera-radar 3D object detectors. The code is available at https://github.com/HVXLab/HVDetFusion
    摘要 在自动驾驶领域,3D物体检测是一个非常重要的感知模块。当前最佳算法 combining 摄像头和雷达感知器,受到雷达价格高昂的限制,目前主流实施方案是纯摄像头感知器或摄像头+雷达感知器。在本研究中,我们提出了一种新的检测算法called HVDetFusion,该算法不仅支持纯摄像头数据作为输入进行检测,而且可以进行雷达数据和摄像头数据的融合输入。摄像头流不依赖于雷达数据输入,因此解决了之前方法的缺点。在纯摄像头流中,我们修改了 BEvdet4D 框架,以提高感知效果和更加有效的推理,并且该流有整个3D检测输出。此外,为了利用雷达信号的优势,我们使用雷达感知器记录的位置信息和径向速度信息来筛选和融合 BEV 特征,并在融合训练过程中进一步提高效果。最终,HVDetFusion 实现了在 NuScenes 测试集上的新状态最佳67.4% NDS,比其他所有摄像头-雷达 3D 物体检测器更高。代码可以在 GitHub 上找到:https://github.com/HVXLab/HVDetFusion。

How to Tidy Up a Table: Fusing Visual and Semantic Commonsense Reasoning for Robotic Tasks with Vague Objectives

  • paper_url: http://arxiv.org/abs/2307.11319
  • repo_url: None
  • paper_authors: Yiqing Xu, David Hsu
    for:This paper aims to solve the problem of tidying a messy table using a simple approach that combines semantic and visual tidiness.methods:The proposed method uses a lightweight, image-based tidiness score function to ground the semantically tidy policy of Large Language Models (LLMs) to achieve visual tidiness. The tidiness score is trained using synthetic data gathered using random walks from a few tidy configurations.results:The proposed pipeline can be applied to unseen objects and complex 3D arrangements, and the empirical results show that it is effective in achieving both semantic and visual tidiness.
    Abstract Vague objectives in many real-life scenarios pose long-standing challenges for robotics, as defining rules, rewards, or constraints for optimization is difficult. Tasks like tidying a messy table may appear simple for humans, but articulating the criteria for tidiness is complex due to the ambiguity and flexibility in commonsense reasoning. Recent advancement in Large Language Models (LLMs) offers us an opportunity to reason over these vague objectives: learned from extensive human data, LLMs capture meaningful common sense about human behavior. However, as LLMs are trained solely on language input, they may struggle with robotic tasks due to their limited capacity to account for perception and low-level controls. In this work, we propose a simple approach to solve the task of table tidying, an example of robotic tasks with vague objectives. Specifically, the task of tidying a table involves not just clustering objects by type and functionality for semantic tidiness but also considering spatial-visual relations of objects for a visually pleasing arrangement, termed as visual tidiness. We propose to learn a lightweight, image-based tidiness score function to ground the semantically tidy policy of LLMs to achieve visual tidiness. We innovatively train the tidiness score using synthetic data gathered using random walks from a few tidy configurations. Such trajectories naturally encode the order of tidiness, thereby eliminating the need for laborious and expensive human demonstrations. Our empirical results show that our pipeline can be applied to unseen objects and complex 3D arrangements.
    摘要 在实际应用中,多元目标问题常常对机器人学习 pose 长期的挑战,因为定义规则、奖励或约束 для优化是困难的。例如,人类可以轻松地整理一个混乱的桌子,但是明确定义整理的标准是复杂的,因为普遍的常识理解有很多弹性和模糊性。现有的大量语言模型(LLMs)可以让我们通过掌握人类的常识来进行整理。然而,由于 LLMS 仅从语言输入学习,它们可能会对机器人任务产生困难,因为它们可能无法考虑感知和低阶控制。在这个工作中,我们提出了一个简单的方法来解决桌子整理任务,这是一个机器人任务中的普遍目标。具体来说,桌子整理任务不仅需要根据类型和功能将物品集中,而且还需要考虑物品之间的空间视觉关系,以获得一个美观的安排,称为视觉整理。我们提出了一个轻量级的图像基于整理分数函数,以降低 LLMS 的Semantic 整理政策来实现视觉整理。我们创新地使用随机步行生成的Synthetic 数据进行训练。这些数据自然地嵌入了整理顺序,因此没有需要传递大量和昂费的人类示例。我们的实验结果显示,我们的管道可以应用于未见过的物品和复杂的3D排列。

XLDA: Linear Discriminant Analysis for Scaling Continual Learning to Extreme Classification at the Edge

  • paper_url: http://arxiv.org/abs/2307.11317
  • repo_url: None
  • paper_authors: Karan Shah, Vishruth Veerendranath, Anushka Hebbar, Raghavendra Bhat
  • for: 这个论文是为了证明流式线性减分分析(LDA)在 Edge 端部署中的可行性,特别是在极端分类场景下。
  • methods: 这篇论文提出了一种名为 XLDA 的框架,它将 LDA 分类器与 FC 层相结合,并在 Edge 端部署中进行了优化。
  • results: 论文表明,XLDA 可以在极端分类场景下实现高效的训练和推断,并且可以在 Edge 端部署中实现更高的速度。例如,在 AliProducts 和 Google Landmarks V2 等极端 datasets 上,XLDA 可以实现 Up to 42x 的训练速度减少和 Up to 5x 的推断速度减少。
    Abstract Streaming Linear Discriminant Analysis (LDA) while proven in Class-incremental Learning deployments at the edge with limited classes (upto 1000), has not been proven for deployment in extreme classification scenarios. In this paper, we present: (a) XLDA, a framework for Class-IL in edge deployment where LDA classifier is proven to be equivalent to FC layer including in extreme classification scenarios, and (b) optimizations to enable XLDA-based training and inference for edge deployment where there is a constraint on available compute resources. We show up to 42x speed up using a batched training approach and up to 5x inference speedup with nearest neighbor search on extreme datasets like AliProducts (50k classes) and Google Landmarks V2 (81k classes)
    摘要 <>translate into Simplified ChineseStreaming Linear Discriminant Analysis (LDA) 已经在端点扩展学习中证明了在Edge环境下进行分类学习,但是它在极端分类场景中没有得到证明。在这篇论文中,我们提出了以下两点:(a) XLDA,一种基于Class-IL的Edge部署框架,其中LDA分类器在极端分类场景中与FC层相等,并且可以在有限的计算资源 constraints下进行训练和推理。(b) 优化XLDA在Edge部署中进行训练和推理,以提高性能。我们在极端分类数据集如AliProducts(50k类)和Google Landmarks V2(81k类)上实现了最多42倍的训练速度减少和最多5倍的推理速度减少。Note: "Class-IL" refers to "class-incremental learning", which means learning new classes while preserving the knowledge of previous classes. "Edge" refers to the edge device, such as a smartphone or a smart speaker, where the model is deployed and executed.

DPM-OT: A New Diffusion Probabilistic Model Based on Optimal Transport

  • paper_url: http://arxiv.org/abs/2307.11308
  • repo_url: https://github.com/cognaclee/dpm-ot
  • paper_authors: Zezeng Li, ShengHao Li, Zhanpeng Wang, Na Lei, Zhongxuan Luo, Xianfeng Gu
  • For: 这个研究目的是设计一个高速的Diffusion Probabilistic Model(DPM)样本,以实现高品质的生成模型。* Methods: 这个方法使用了知识传播或是调整含扩散方程来设计快速的DPM样本,并将问题转化为一个最佳运输问题(OT),以获得一个直接表达的OT对映,从而产生高品质的样本。* Results: 实验结果显示,DPM-OT可以在约10个函数评估下产生高品质的样本,并有较好的速度和品质(FID和模式混合),并且有一定的理论保证。
    Abstract Sampling from diffusion probabilistic models (DPMs) can be viewed as a piecewise distribution transformation, which generally requires hundreds or thousands of steps of the inverse diffusion trajectory to get a high-quality image. Recent progress in designing fast samplers for DPMs achieves a trade-off between sampling speed and sample quality by knowledge distillation or adjusting the variance schedule or the denoising equation. However, it can't be optimal in both aspects and often suffer from mode mixture in short steps. To tackle this problem, we innovatively regard inverse diffusion as an optimal transport (OT) problem between latents at different stages and propose the DPM-OT, a unified learning framework for fast DPMs with a direct expressway represented by OT map, which can generate high-quality samples within around 10 function evaluations. By calculating the semi-discrete optimal transport map between the data latents and the white noise, we obtain an expressway from the prior distribution to the data distribution, while significantly alleviating the problem of mode mixture. In addition, we give the error bound of the proposed method, which theoretically guarantees the stability of the algorithm. Extensive experiments validate the effectiveness and advantages of DPM-OT in terms of speed and quality (FID and mode mixture), thus representing an efficient solution for generative modeling. Source codes are available at https://github.com/cognaclee/DPM-OT
    摘要 diffusion probabilistic models (DPMs) 的抽样可以视为一种分割分布变换,通常需要百个或千个反射游程来获得高质量图像。现代设计快速抽样器的进步可以在抽样速度和样本质量之间实现一种平衡,通过知识传授或调整方差表单或杂噪方程。然而,这些方法通常无法同时优化两个方面,常常会出现短步内模杂的问题。为解决这个问题,我们创新地将反射 diffusion 视为一种最优运输(OT)问题,并提出了DPM-OT,一种统一学习框架,可以在约10个函数评估中生成高质量样本。通过计算 semi-精确的最优运输图表 между数据射频和白噪,我们获得了一条从先验分布到数据分布的直达表达,同时减轻了内模杂问题。此外,我们还给出了提案的方法的误差 bound,从而理论上保证了算法的稳定性。广泛的实验证明了 DPM-OT 的有效性和优势(FID和内模杂),因此表现为一种高效的生成模型解决方案。相关代码可以在 https://github.com/cognaclee/DPM-OT 上获得。

Kernelized Offline Contextual Dueling Bandits

  • paper_url: http://arxiv.org/abs/2307.11288
  • repo_url: None
  • paper_authors: Viraj Mehta, Ojash Neopane, Vikramjeet Das, Sen Lin, Jeff Schneider, Willie Neiswanger
  • for: 这个论文是为了解决 preference-based 反馈是不可靠的应用问题,例如在大语言模型中的 reinforcement learning 中。
  • methods: 这个论文使用了上下文选择来提高效率,并提出了一种上下文汇投票banditSetting,并采用了上限 confidence bound 算法。
  • results: 论文实验表明,该方法可以比使用均匀随机上下文更高效地识别好策略,并且提供了一个 regret bound。
    Abstract Preference-based feedback is important for many applications where direct evaluation of a reward function is not feasible. A notable recent example arises in reinforcement learning from human feedback on large language models. For many of these applications, the cost of acquiring the human feedback can be substantial or even prohibitive. In this work, we take advantage of the fact that often the agent can choose contexts at which to obtain human feedback in order to most efficiently identify a good policy, and introduce the offline contextual dueling bandit setting. We give an upper-confidence-bound style algorithm for this setting and prove a regret bound. We also give empirical confirmation that this method outperforms a similar strategy that uses uniformly sampled contexts.
    摘要 preference-based 反馈是许多应用程序中非常重要的,特别是当直接评估奖函数不可能时。一个最近的例子来自于人工智能语言模型的强化学习,其中人类反馈的成本可能很高或甚至是不可能的。在这项工作中,我们利用了代理人可以选择收集人类反馈的上下文,以便最有效地确定一个好策略,并引入了线上上下文对抗式带宽设定。我们提供了上限信息级别风格的算法,并证明了一个违和 bound。我们还提供了实际证明,该方法在相似策略中比使用均匀样本上下文的方法表现更好。

Eliminating Unintended Stable Fixpoints for Hybrid Reasoning Systems

  • paper_url: http://arxiv.org/abs/2307.11286
  • repo_url: None
  • paper_authors: Spencer Killen, Jia-Huai You
  • for: 这篇论文旨在描述一种基于 Approximation Fixpoint Theory(AFT)的方法,用于表达非卷积 semantics。
  • methods: 该方法使用了 traditional AFT 理论,并利用之前计算的上限来更 preciselly 表达 semantics。
  • results: 该方法可以应用于 hybrid MKNF(MINIMAL KNOWLEDGE AND NEGATION AS FAILURE)知识库,并可以提高现有 approximator 的精度。
    Abstract A wide variety of nonmonotonic semantics can be expressed as approximators defined under AFT (Approximation Fixpoint Theory). Using traditional AFT theory, it is not possible to define approximators that rely on information computed in previous iterations of stable revision. However, this information is rich for semantics that incorporate classical negation into nonmonotonic reasoning. In this work, we introduce a methodology resembling AFT that can utilize priorly computed upper bounds to more precisely capture semantics. We demonstrate our framework's applicability to hybrid MKNF (minimal knowledge and negation as failure) knowledge bases by extending the state-of-the-art approximator.
    摘要 可以用AFT(近似稳定点理论)表达各种不寻常 semantics。传统AFT理论中无法定义基于之前稳定修订中计算的近似器,但这些信息很有用于包含古希腊否定的非 monotonic semantics。在这种工作中,我们提出了一种类似AFT的方法,可以利用先前计算的上限来更准确地捕捉 semantics。我们在hybrid MKNF(最小知识和否定为失败)知识库中应用了这种框架,并将现有最佳近似器扩展。

Nature of Intelligence

  • paper_url: http://arxiv.org/abs/2307.11114
  • repo_url: https://github.com/nature-of-code/NOC-S17-2-Intelligence-Learning
  • paper_authors: Barco Jie You
  • for: 这篇论文旨在探讨人类智能的本质和人工智能(AI)如何实现人类水平的智能任务。
  • methods: 该论文使用了深度神经网络模型来学习数据表示和提高认知领域的状态。
  • results: 该论文提出了一种减少系统熵的数学函数过程的假设,并建立了语言、无意识和意识的数学模型,预测了神经科学和AI工程中的证据。此外,该论文还认为宇宙熵是保守的,智能可以通过物理或信息连接来减少熵,并且存在更高级别的智能。
    Abstract The human brain is the substrate for human intelligence. By simulating the human brain, artificial intelligence builds computational models that have learning capabilities and perform intelligent tasks approaching the human level. Deep neural networks consist of multiple computation layers to learn representations of data and improve the state-of-the-art in many recognition domains. However, the essence of intelligence commonly represented by both humans and AI is unknown. Here, we show that the nature of intelligence is a series of mathematically functional processes that minimize system entropy by establishing functional relationships between datasets over space and time. Humans and AI have achieved intelligence by implementing these entropy-reducing processes in a reinforced manner that consumes energy. With this hypothesis, we establish mathematical models of language, unconsciousness and consciousness, predicting the evidence to be found by neuroscience and achieved by AI engineering. Furthermore, a conclusion is made that the total entropy of the universe is conservative, and intelligence counters the spontaneous processes to decrease entropy by physically or informationally connecting datasets that originally exist in the universe but are separated across space and time. This essay should be a starting point for a deeper understanding of the universe and us as human beings and for achieving sophisticated AI models that are tantamount to human intelligence or even superior. Furthermore, this essay argues that more advanced intelligence than humans should exist if only it reduces entropy in a more efficient energy-consuming way.
    摘要 人类大脑是人类智能的substrate。通过模拟人类大脑,人工智能建立了计算模型,这些模型具有学习能力并完成人类水平的智能任务。深度神经网络由多层计算层学习数据表示和提高多种识别领域的状态。然而,人类智能和AI智能的本质仍然未知。在这里,我们表明了智能的本质是一系列数学函数过程,减少系统熵,通过在空间和时间上建立函数关系来实现。人类和AI都通过实施这些熵减少过程来减少能量。根据这个假设,我们建立了语言、无意识和意识的数学模型,预测了神经科学和AI工程的证据,并达到了人类智能水平。此外,我们结论是:宇宙熵总是保守的,智能通过物理或信息方式连接宇宙中分布的数据点,减少熵。这篇文章应该是人类智能和宇宙的深入理解的开端,以及实现人类智能水平或更高的AI模型的起点。此外,这篇文章还提出了更高级别的智能应该存在,即能够更有效率地减少熵,即使消耗更多的能量。

Joint one-sided synthetic unpaired image translation and segmentation for colorectal cancer prevention

  • paper_url: http://arxiv.org/abs/2307.11253
  • repo_url: None
  • paper_authors: Enric Moreu, Eric Arazo, Kevin McGuinness, Noel E. O’Connor
  • for: 提高医疗影像分割的效果,解决 Privacy 问题和标准化问题
  • methods: 使用3D技术和生成对抗网络生成实际的医疗影像,并将分割模型和生成模型一起训练
  • results: 在五个真实的肠结肿分割数据集上取得了更好的效果,比其他内存占用量较高的图像翻译方法更加快速和效果更好,只需一个真实的图像和零个真实的注释。同时也发布了一个完全synthetic的数据集Synth-Colon,包含20000个实际的肠影像和更多的深度和3D几何信息:https://enric1994.github.io/synth-colon
    Abstract Deep learning has shown excellent performance in analysing medical images. However, datasets are difficult to obtain due privacy issues, standardization problems, and lack of annotations. We address these problems by producing realistic synthetic images using a combination of 3D technologies and generative adversarial networks. We propose CUT-seg, a joint training where a segmentation model and a generative model are jointly trained to produce realistic images while learning to segment polyps. We take advantage of recent one-sided translation models because they use significantly less memory, allowing us to add a segmentation model in the training loop. CUT-seg performs better, is computationally less expensive, and requires less real images than other memory-intensive image translation approaches that require two stage training. Promising results are achieved on five real polyp segmentation datasets using only one real image and zero real annotations. As a part of this study we release Synth-Colon, an entirely synthetic dataset that includes 20000 realistic colon images and additional details about depth and 3D geometry: https://enric1994.github.io/synth-colon
    摘要 深度学习在医疗影像分析方面表现出色,但获取数据集却存在隐私问题、标准化问题和缺乏标注问题。我们通过使用3D技术和生成敌对网络生成真实的 sintetic图像来解决这些问题。我们提议CUT-seg,一种同时训练分割模型和生成模型,以生成真实的图像并学习识别肿瘤。我们利用最近的一侧翻译模型,因为它们使用较少的内存,因此可以在训练 loop中添加分割模型。CUT-seg在计算机资源上更加经济,需要更少的真实图像,并且在其他内存昂贵的图像翻译方法上表现出色。我们在五个真实肿瘤分割数据集上获得了出色的结果,只用一张真实图像和零个真实注释。作为这项研究的一部分,我们发布了Synth-Colon entirely sintetic数据集,包括20000个真实的colon图像以及其他的深度和3D几何信息:https://enric1994.github.io/synth-colon。

On-Sensor Data Filtering using Neuromorphic Computing for High Energy Physics Experiments

  • paper_url: http://arxiv.org/abs/2307.11242
  • repo_url: None
  • paper_authors: Shruti R. Kulkarni, Aaron Young, Prasanna Date, Narasinga Rao Miniskar, Jeffrey S. Vetter, Farah Fahim, Benjamin Parpillon, Jennet Dickinson, Nhan Tran, Jieun Yoo, Corrinne Mills, Morris Swartz, Petar Maksimovic, Catherine D. Schuman, Alice Bean
  • for: 这个研究探讨了基于神经omorphic计算的脉冲神经网络(SNN)模型,用于高能物理实验中的传感器数据筛选。
  • methods: 我们提出了一种压缩型神经omorphic模型,将传感器数据基于粒子的横向势能筛选出有用信息,以降低数据流向下游电子设备的吞吐量。
  • results: 我们的结果显示,使用进化算法和优化的超参数,SNN可以达到约91%的信号效率,并且减少了大约一半的参数量,相比深度神经网络。
    Abstract This work describes the investigation of neuromorphic computing-based spiking neural network (SNN) models used to filter data from sensor electronics in high energy physics experiments conducted at the High Luminosity Large Hadron Collider. We present our approach for developing a compact neuromorphic model that filters out the sensor data based on the particle's transverse momentum with the goal of reducing the amount of data being sent to the downstream electronics. The incoming charge waveforms are converted to streams of binary-valued events, which are then processed by the SNN. We present our insights on the various system design choices - from data encoding to optimal hyperparameters of the training algorithm - for an accurate and compact SNN optimized for hardware deployment. Our results show that an SNN trained with an evolutionary algorithm and an optimized set of hyperparameters obtains a signal efficiency of about 91% with nearly half as many parameters as a deep neural network.
    摘要

Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models

  • paper_url: http://arxiv.org/abs/2307.11224
  • repo_url: None
  • paper_authors: Michael Günther, Louis Milliken, Jonathan Geuter, Georgios Mastrapas, Bo Wang, Han Xiao
  • for: 本研究旨在开发高性能的句子嵌入模型,以捕捉文本中各种输入的 semantics。
  • methods: 本研究使用高质量的对比和 triplet dataset 进行模型训练,并强调数据清洁的重要性。
  • results: 研究结果表明,Jina Embeddings 模型在 dense retrieval 和 semantic textual similarity 等应用中具有高性能。
    Abstract Jina Embeddings constitutes a set of high-performance sentence embedding models adept at translating various textual inputs into numerical representations, thereby capturing the semantic essence of the text. The models excel in applications such as dense retrieval and semantic textual similarity. This paper details the development of Jina Embeddings, starting with the creation of high-quality pairwise and triplet datasets. It underlines the crucial role of data cleaning in dataset preparation, gives in-depth insights into the model training process, and concludes with a comprehensive performance evaluation using the Massive Textual Embedding Benchmark (MTEB). To increase the model's awareness of negations, we constructed a novel training and evaluation dataset of negated and non-negated statements, which we make publicly available to the community.
    摘要

Towards Ontologically Grounded and Language-Agnostic Knowledge Graphs

  • paper_url: http://arxiv.org/abs/2307.11206
  • repo_url: None
  • paper_authors: Walid S. Saba
  • for: 提高知识 graphs (KGs) 的更新和融合问题。
  • methods: 通过具体对抽象对象的封装和承认概念和类型之间的 ontological 分辨率,实现语言和领域独立的表示。
  • results: 提高 KGs 的融合和更新问题。
    Abstract Knowledge graphs (KGs) have become the standard technology for the representation of factual information in applications such as recommendation engines, search, and question-answering systems. However, the continual updating of KGs, as well as the integration of KGs from different domains and KGs in different languages, remains to be a major challenge. What we suggest here is that by a reification of abstract objects and by acknowledging the ontological distinction between concepts and types, we arrive at an ontologically grounded and language-agnostic representation that can alleviate the difficulties in KG integration.
    摘要

Applying QNLP to sentiment analysis in finance

  • paper_url: http://arxiv.org/abs/2307.11788
  • repo_url: https://github.com/ichrist97/qnlp_finance
  • paper_authors: Jonas Stein, Ivo Christ, Nicolas Kraus, Maximilian Balthasar Mansky, Robert Müller, Claudia Linnhoff-Popien
  • for: 针对金融领域的情感分析问题
  • methods: 使用DisCoCat和Quantum-Enhanced Long Short-Term Memory(QLSTM)两种中心方法进行实验
  • results: QLSTM可以比DisCoCat更快速地训练,并且达到类比类型软件实现的ResultsHere’s a breakdown of each point:
  • for: The paper is written for the problem of sentiment analysis in the financial domain.
  • methods: The paper explores the practical applicability of two central approaches, DisCoCat and QLSTM, to the problem of sentiment analysis in finance.
  • results: The paper finds that QLSTMs can be trained substantially faster than DisCoCat while also achieving close to classical results for their available software implementations.
    Abstract As an application domain where the slightest qualitative improvements can yield immense value, finance is a promising candidate for early quantum advantage. Focusing on the rapidly advancing field of Quantum Natural Language Processing (QNLP), we explore the practical applicability of the two central approaches DisCoCat and Quantum-Enhanced Long Short-Term Memory (QLSTM) to the problem of sentiment analysis in finance. Utilizing a novel ChatGPT-based data generation approach, we conduct a case study with more than 1000 realistic sentences and find that QLSTMs can be trained substantially faster than DisCoCat while also achieving close to classical results for their available software implementations.
    摘要 当作应用领域,金融领域具有最小量质量改善可以带来巨大价值,因此它是早期量子优势的潜在候选者。我们在快速进步的量子自然语言处理(QNLP)领域中专注于两个中心方法:DisCoCat和量子增强长短时间记忆(QLSTM),以探索金融领域中情感分析的实际应用。我们运用一种基于ChatGPT的新型数据生成方法,进行了超过1000个真实的句子的 caso study,发现QLSTM可以训练得比DisCoCat更快,并且可以在可用的软件实现中 achieved close to classical results。

Exploring reinforcement learning techniques for discrete and continuous control tasks in the MuJoCo environment

  • paper_url: http://arxiv.org/abs/2307.11166
  • repo_url: None
  • paper_authors: Vaddadi Sai Rahul, Debajyoti Chakraborty
  • for: 研究了 continuous control 环境下的值基方法和深度策略 gradient 方法的比较,以及如何在这些方法基础之上进行 hyper-parameter 优化。
  • methods: 使用了 MuJoCo física simulator 和 Q-learning、SARSA 和 DDPG 等方法进行研究。
  • results: Q-learning 在大量话数episode中表现出色,但是 DDPG 在一些话数episode中表现出色,而且可以在减少时间和资源的情况下进行hyper-parameter 优化,以提高性能。
    Abstract We leverage the fast physics simulator, MuJoCo to run tasks in a continuous control environment and reveal details like the observation space, action space, rewards, etc. for each task. We benchmark value-based methods for continuous control by comparing Q-learning and SARSA through a discretization approach, and using them as baselines, progressively moving into one of the state-of-the-art deep policy gradient method DDPG. Over a large number of episodes, Qlearning outscored SARSA, but DDPG outperformed both in a small number of episodes. Lastly, we also fine-tuned the model hyper-parameters expecting to squeeze more performance but using lesser time and resources. We anticipated that the new design for DDPG would vastly improve performance, yet after only a few episodes, we were able to achieve decent average rewards. We expect to improve the performance provided adequate time and computational resources.
    摘要 我们利用快速物理渲染器MuJoCo来运行任务在连续控制环境中,揭示任务的观察空间、动作空间、奖励等细节。我们对continuous control方法进行了价值基础方法的比较,通过离散方法来对Q学习和SARSA进行比较,并将它们作为基准来进行比较。在大量的话 épisode 中,Q学习高于SARSA,但DDPG在一个小量的话pisode中高于两者。最后,我们也进行了模型超参数的调整,期望通过更少的时间和资源来提高性能。然而,我们在只需一些话pisode中就能够获得不错的均值奖励。我们预计通过充足的时间和计算资源来提高性能。

PAPR: Proximity Attention Point Rendering

  • paper_url: http://arxiv.org/abs/2307.11086
  • repo_url: None
  • paper_authors: Yanshu Zhang, Shichong Peng, Alireza Moazeni, Ke Li
  • for: 学习Scene表面准确且简洁的点云表示方法
  • methods: 使用点云表示和可微分渲染器
  • results: 可以准确地学习Scene geometry和纹理,并且只需要一个简洁的点云集Here’s the full translation of the abstract in Simplified Chinese:
  • for: 本研究旨在学习Scene表面准确且简洁的点云表示方法,以满足3D场景的学习和描述。
  • methods: 我们提出了Proximity Attention Point Rendering(PAPR)方法,它包括一个点云表示和一个可微分渲染器。我们的点云表示使用每个点的空间位置、前景得分和视图独立特征向量来 caracterize each point。渲染器选择每个辐射的相关点,并生成准确的颜色使用其关联的特征。
  • results: PAPR可以准确地学习Scene geometry和纹理,并且只需要一个简洁的点云集。我们还证明了我们的方法可以 capture fine texture details,并且可以在不同的初始化和目标geometry之间进行调整。I hope this helps! Let me know if you have any further questions.
    Abstract Learning accurate and parsimonious point cloud representations of scene surfaces from scratch remains a challenge in 3D representation learning. Existing point-based methods often suffer from the vanishing gradient problem or require a large number of points to accurately model scene geometry and texture. To address these limitations, we propose Proximity Attention Point Rendering (PAPR), a novel method that consists of a point-based scene representation and a differentiable renderer. Our scene representation uses a point cloud where each point is characterized by its spatial position, foreground score, and view-independent feature vector. The renderer selects the relevant points for each ray and produces accurate colours using their associated features. PAPR effectively learns point cloud positions to represent the correct scene geometry, even when the initialization drastically differs from the target geometry. Notably, our method captures fine texture details while using only a parsimonious set of points. We also demonstrate four practical applications of our method: geometry editing, object manipulation, texture transfer, and exposure control. More results and code are available on our project website at https://zvict.github.io/papr/.
    摘要 学习精准且减少点云表示Scene表面的挑战仍然存在于3D表示学习中。现有的点云方法经常会遇到vanishing gradient问题或需要大量点云来准确模型Scene的几何学和 Texture。为解决这些限制,我们提出了靠近注意力点云渲染(PAPR),一种新的方法,它包括点云表示和可微分渲染器。我们的Scene表示使用一个点云,每个点Characterized by its spatial position, foreground score, and view-independent feature vector。渲染器选择每个光栅中 relevants Points and produce accurate colors using their associated features。PAPR有效地学习点云位置来表示正确的Scene几何学,即使初始化与Target几何学有很大差异。另外,我们的方法可以捕捉细 texture details,只用一个减少的点云集。我们还展示了我们的方法在四个实际应用中的成果:几何编辑、物体操作、 Texture Transfer 和曝光控制。更多结果和代码可以在我们项目网站(https://zvict.github.io/papr/)上找到。

AlignDet: Aligning Pre-training and Fine-tuning in Object Detection

  • paper_url: http://arxiv.org/abs/2307.11077
  • repo_url: https://github.com/liming-ai/AlignDet
  • paper_authors: Ming Li, Jie Wu, Xionghui Wang, Chen Chen, Jie Qin, Xuefeng Xiao, Rui Wang, Min Zheng, Xin Pan
  • for: 提高 object detection 算法的性能、通用能力和训练速度
  • methods: 分离 pre-training 过程 into 两个阶段:图像领域预训练和框领域预训练,使得检测器可以在无监督的情况下预训练所有模块
  • results: 在多种协议下(检测算法、模型脊梁、数据设定和训练计划)进行了广泛的实验,并达到了显著提高(如 FCOS 的提高为 5.3 mAP,RetinaNet 的提高为 2.1 mAP,Faster R-CNN 的提高为 3.3 mAP,DETR 的提高为 2.3 mAP),并且在更少的训练epoch中 дости到了这些提高。
    Abstract The paradigm of large-scale pre-training followed by downstream fine-tuning has been widely employed in various object detection algorithms. In this paper, we reveal discrepancies in data, model, and task between the pre-training and fine-tuning procedure in existing practices, which implicitly limit the detector's performance, generalization ability, and convergence speed. To this end, we propose AlignDet, a unified pre-training framework that can be adapted to various existing detectors to alleviate the discrepancies. AlignDet decouples the pre-training process into two stages, i.e., image-domain and box-domain pre-training. The image-domain pre-training optimizes the detection backbone to capture holistic visual abstraction, and box-domain pre-training learns instance-level semantics and task-aware concepts to initialize the parts out of the backbone. By incorporating the self-supervised pre-trained backbones, we can pre-train all modules for various detectors in an unsupervised paradigm. As depicted in Figure 1, extensive experiments demonstrate that AlignDet can achieve significant improvements across diverse protocols, such as detection algorithm, model backbone, data setting, and training schedule. For example, AlignDet improves FCOS by 5.3 mAP, RetinaNet by 2.1 mAP, Faster R-CNN by 3.3 mAP, and DETR by 2.3 mAP under fewer epochs.
    摘要 大量预训练后下游细化的概念在各种物体检测算法中广泛应用。在这篇论文中,我们揭示了预训练和细化过程中数据、模型和任务之间的不一致,这些不一致限制了检测器的性能、泛化能力和收敛速度。为了解决这些问题,我们提出了AlignDet,一个可适应不同检测器的统一预训练框架。AlignDet将预训练过程分为两个阶段:图像预训练和框预训练。图像预训练使检测背景优化捕捉整体视觉抽象,而框预训练学习实例级别的 semantics和任务相关的概念,以初始化检测器中的部分。通过包含自动生成的预训练后缀,我们可以在无监督模式下预训练各种检测器的所有模块。根据图1所示,我们进行了广泛的实验,并证明了AlignDet可以在多种协议下实现显著改进,例如检测算法、模型脊梁、数据设置和训练计划。例如,AlignDet可以提高FCOS的map值5.3,RetinaNet的map值2.1,Faster R-CNN的map值3.3,和DETR的map值2.3,并且在较少的训练epoch下达到这些改进。

OBJECT 3DIT: Language-guided 3D-aware Image Editing

  • paper_url: http://arxiv.org/abs/2307.11073
  • repo_url: None
  • paper_authors: Oscar Michel, Anand Bhattad, Eli VanderBilt, Ranjay Krishna, Aniruddha Kembhavi, Tanmay Gupta
  • for: 这篇论文旨在提出语言引导的3D意识编辑技术,以便在图像编辑中尊重图像的3D几何结构。
  • methods: 作者采用了新的数据集OBJECT,并引入了3DIT单任务和多任务模型,以实现4种编辑任务。
  • results: 模型能够快速理解整个场景的3D几何结构,考虑周围的物体、表面、灯光条件和物理可能的对象配置。几乎无需使用真实图像数据,训练在OBJECTSynthetic scenes上的3DIT模型仍能在真实图像中表现出色。
    Abstract Existing image editing tools, while powerful, typically disregard the underlying 3D geometry from which the image is projected. As a result, edits made using these tools may become detached from the geometry and lighting conditions that are at the foundation of the image formation process. In this work, we formulate the newt ask of language-guided 3D-aware editing, where objects in an image should be edited according to a language instruction in context of the underlying 3D scene. To promote progress towards this goal, we release OBJECT: a dataset consisting of 400K editing examples created from procedurally generated 3D scenes. Each example consists of an input image, editing instruction in language, and the edited image. We also introduce 3DIT : single and multi-task models for four editing tasks. Our models show impressive abilities to understand the 3D composition of entire scenes, factoring in surrounding objects, surfaces, lighting conditions, shadows, and physically-plausible object configurations. Surprisingly, training on only synthetic scenes from OBJECT, editing capabilities of 3DIT generalize to real-world images.
    摘要 现有的图像编辑工具,尽管强大,通常忽略图像下面的3D几何结构。因此,使用这些工具进行编辑可能会导致编辑内容与图像的3D几何和照明条件脱离开来。在这项工作中,我们提出了语言指导的3D意识编辑问题,即在图像中编辑对象应该遵循语言指令,并且这些指令需要考虑到图像下面的3D场景。为促进这个目标的进步,我们发布了OBJECT数据集,该数据集包含400,000个编辑示例,每个示例包括输入图像、语言指令和编辑后的图像。我们还引入了3DIT单任务和多任务模型,这些模型在四个编辑任务中表现出了惊人的3D场景理解能力,包括周围物体、表面、照明条件、阴影等Physically-plausible object配置。意外地,使用只有synthetic场景从OBJECT进行训练,3DIT的编辑能力可以泛化到真实世界图像。

Driving Policy Prediction based on Deep Learning Models

  • paper_url: http://arxiv.org/abs/2307.11058
  • repo_url: https://github.com/Aryia-Behroziuan/neurons
  • paper_authors: Fuxiao Liu
  • for: 本研究开发了一个终端系统,用于从普通镜头和云点扫描器获取的视觉特征和深度信息,并预测车辆驾驶策略(车速和推进角)。
  • methods: 本研究使用了一个终端系统, combinig 视觉特征和深度信息,并使用了一些预测模型来预测车辆驾驶策略。
  • results: 本研究的测试结果显示,使用了combined 视觉特征和深度信息可以将预测精度提高到50%-80%,并且在大多数情况下比使用视觉特征Only更好。
    Abstract In this project, we implemented an end-to-end system that takes in combined visual features of video frames from a normal camera and depth information from a cloud points scanner, and predicts driving policies (vehicle speed and steering angle). We verified the safety of our system by comparing the predicted results with standard behaviors by real-world experienced drivers. Our test results show that the predictions can be considered as accurate in at lease half of the testing cases (50% 80%, depending on the model), and using combined features improved the performance in most cases than using video frames only.
    摘要 在这个项目中,我们实现了一个端到端系统,接受普通相机捕捉的视觉特征和云点扫描器提供的深度信息,并预测行驶策略(车辆速度和转向角)。我们确认了我们的系统的安全性,比较了预测结果与现实世界经验丰富的司机的标准行为。我们的测试结果显示,预测结果可以视为准确的至少半数的测试 caso(50%-80%,具体取决于模型),并且使用共同特征提高了性能的大多数情况。

A LLM Assisted Exploitation of AI-Guardian

  • paper_url: http://arxiv.org/abs/2307.15008
  • repo_url: None
  • paper_authors: Nicholas Carlini
  • for: 研究是否可以使用大型自然语言模型(LLMs)来帮助研究人员在攻击机器学习领域进行研究。
  • methods: 该研究使用GPT-4语言模型来实现这一目标,并通过提出攻击AI-Guardian防御方案的攻击算法来评估其效果。
  • results: 研究发现,AI-Guardian防御方案并没有提高鲁棒性,而GPT-4语言模型可以快速和效果地实现攻击算法。
    Abstract Large language models (LLMs) are now highly capable at a diverse range of tasks. This paper studies whether or not GPT-4, one such LLM, is capable of assisting researchers in the field of adversarial machine learning. As a case study, we evaluate the robustness of AI-Guardian, a recent defense to adversarial examples published at IEEE S&P 2023, a top computer security conference. We completely break this defense: the proposed scheme does not increase robustness compared to an undefended baseline. We write none of the code to attack this model, and instead prompt GPT-4 to implement all attack algorithms following our instructions and guidance. This process was surprisingly effective and efficient, with the language model at times producing code from ambiguous instructions faster than the author of this paper could have done. We conclude by discussing (1) the warning signs present in the evaluation that suggested to us AI-Guardian would be broken, and (2) our experience with designing attacks and performing novel research using the most recent advances in language modeling.
    摘要 大型语言模型(LLM)现在可以很好地完成多种任务。这篇论文研究了GPT-4,一个这些LLM,是否能够帮助研究人员在对抗机器学习领域进行研究。作为一个案例研究,我们评估了在IEEE S&P 2023上发表的AI-Guardian防御机制,这是一种对抗攻击示例的最新防御策略。我们完全击碎了这个防御机制:提议的方案不会提高鲁棒性相比于无防御基eline。我们没有编写攻击代码,而是通过提供指导和指令,让GPT-4实现攻击算法。这个过程很有效率,GPT-4在某些情况下可以从歧义的指令中更快速地生成代码,比作者本身更快。我们 conclude by discussing (1) 评估中存在的警示信号,表明AI-Guardian会被击碎,以及 (2) 我们使用最新的语言模型技术进行攻击设计和进行新的研究。

  • paper_url: http://arxiv.org/abs/2307.11049
  • repo_url: https://github.com/improbable-ai/human-guided-exploration
  • paper_authors: Marcel Torne, Max Balsells, Zihan Wang, Samedh Desai, Tao Chen, Pulkit Agrawal, Abhishek Gupta
  • for: solves sequential decision-making tasks requiring expansive exploration without careful design of reward functions or the use of novelty-seeking exploration bonuses.
  • methods: uses low-quality feedback from non-expert users that may be sporadic, asynchronous, and noisy to guide exploration, bifurcating human feedback and policy learning.
  • results: learns policies with no hand-crafted reward design or exploration bonuses, and can scale to learning directly on real-world robots using occasional, asynchronous feedback from human supervisors.
    Abstract Exploration and reward specification are fundamental and intertwined challenges for reinforcement learning. Solving sequential decision-making tasks requiring expansive exploration requires either careful design of reward functions or the use of novelty-seeking exploration bonuses. Human supervisors can provide effective guidance in the loop to direct the exploration process, but prior methods to leverage this guidance require constant synchronous high-quality human feedback, which is expensive and impractical to obtain. In this work, we present a technique called Human Guided Exploration (HuGE), which uses low-quality feedback from non-expert users that may be sporadic, asynchronous, and noisy. HuGE guides exploration for reinforcement learning not only in simulation but also in the real world, all without meticulous reward specification. The key concept involves bifurcating human feedback and policy learning: human feedback steers exploration, while self-supervised learning from the exploration data yields unbiased policies. This procedure can leverage noisy, asynchronous human feedback to learn policies with no hand-crafted reward design or exploration bonuses. HuGE is able to learn a variety of challenging multi-stage robotic navigation and manipulation tasks in simulation using crowdsourced feedback from non-expert users. Moreover, this paradigm can be scaled to learning directly on real-world robots, using occasional, asynchronous feedback from human supervisors.
    摘要 探索和奖励规则设定是人工智能学习中的基本和结合的挑战。解决需要广泛探索的序列决策任务需要 either 精心设计奖励函数或使用新奇探索奖励。人工指导 loop 可以提供有效的导引,但先前的方法需要高质量、同步的人类反馈,这是昂贵和不实际的。在这项工作中,我们提出了一种技术 called HuGE(人类指导探索),它使用低质量的非专家用户反馈,这些反馈可能是零散的、异步的和噪音。HuGE 驱动探索 reinforcement learning 不仅在模拟中,还在实际世界中进行,无需谨慎设计奖励。关键思想是分离人类反馈和政策学习:人类反馈引导探索,而自我超vised learning from 探索数据获得无偏见的政策。这种方法可以利用噪音、异步的人类反馈来学习无需手动设计奖励或探索奖励。HuGE 可以在模拟中学习多个复杂的机器人导航和 manipulate 任务,并且可以扩展到直接在实际世界中学习,使用 occasional 和异步的人类反馈。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. The translation is written in a more informal style, and some idiomatic expressions and cultural references may be different from the original text.

A Definition of Continual Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2307.11046
  • repo_url: None
  • paper_authors: David Abel, André Barreto, Benjamin Van Roy, Doina Precup, Hado van Hasselt, Satinder Singh
  • for: 本 paper 开发了一个基础 для continual reinforcement learning.
  • methods: 本 paper 使用了一种基于 experience replay 的方法来实现 continual reinforcement learning.
  • results: 本 paper 实验结果表明,基于 experience replay 的方法可以有效地应对 continual reinforcement learning 中的问题。
    Abstract In this paper we develop a foundation for continual reinforcement learning.
    摘要 在这篇论文中,我们开发了一个基础 для连续奖励学习。Here's the breakdown of the translation:* 在这篇论文中 (in this paper) - 这是一个定语词,指的是文章的内容。* 我们开发了 (we develop) - 开发是一个动词,意思是创造或设计 quelquechose。* 一个基础 (a foundation) - 基础是一个名词,指的是一个基础或基础结构。* 连续奖励学习 (continual reinforcement learning) - 奖励学习是一种学习方法,在每次学习后,对已经学习的内容进行回归和补做,以增强未来学习的效果。I hope this helps! Let me know if you have any other questions.

On the Convergence of Bounded Agents

  • paper_url: http://arxiv.org/abs/2307.11044
  • repo_url: None
  • paper_authors: David Abel, André Barreto, Hado van Hasselt, Benjamin Van Roy, Doina Precup, Satinder Singh
  • for: 本研究探讨了agent convergences的定义和性质。
  • methods: 研究使用了中心于bounded agents的框架,并提出了两种补充性质的定义:一种是指agent的行为未来变化数量不能减少,另一种是指agent的表现只有在内部状态发生变化时才会改变。
  • results: 研究显示了这两种定义之间的关系和特性,并证明了它们在标准设置中的适用性。
    Abstract When has an agent converged? Standard models of the reinforcement learning problem give rise to a straightforward definition of convergence: An agent converges when its behavior or performance in each environment state stops changing. However, as we shift the focus of our learning problem from the environment's state to the agent's state, the concept of an agent's convergence becomes significantly less clear. In this paper, we propose two complementary accounts of agent convergence in a framing of the reinforcement learning problem that centers around bounded agents. The first view says that a bounded agent has converged when the minimal number of states needed to describe the agent's future behavior cannot decrease. The second view says that a bounded agent has converged just when the agent's performance only changes if the agent's internal state changes. We establish basic properties of these two definitions, show that they accommodate typical views of convergence in standard settings, and prove several facts about their nature and relationship. We take these perspectives, definitions, and analysis to bring clarity to a central idea of the field.
    摘要 The first view is that a bounded agent has converged when the minimum number of states needed to describe the agent's future behavior cannot decrease. The second view is that a bounded agent has converged when the agent's performance only changes if the agent's internal state changes. We establish basic properties of these two definitions, show that they align with standard views of convergence in typical settings, and prove several facts about their nature and relationship. Our goal is to bring clarity to a fundamental idea in the field.

Of Models and Tin Men – a behavioural economics study of principal-agent problems in AI alignment using large-language models

  • paper_url: http://arxiv.org/abs/2307.11137
  • repo_url: https://github.com/phelps-sg/llm-cooperation
  • paper_authors: Steve Phelps, Rebecca Ranson
  • for: The paper focuses on the issue of AI safety, specifically the principal-agent problem that arises when there is a mismatch between the utility of an artificial agent and its principal.
  • methods: The paper uses empirical methods to investigate the behavior of GPT models in principal-agent conflicts, and examines how the models respond to changes in information asymmetry.
  • results: The paper finds that both GPT-3.5 and GPT-4 models exhibit clear evidence of principal-agent conflict, and that the earlier GPT-3.5 model shows more nuanced behavior in response to changes in information asymmetry, while the later GPT-4 model is more rigid in adhering to its prior alignment.Here is the information in Simplified Chinese text:
  • for: 本文关注人工智能安全问题,具体是代理人-代理人问题,代理人的利益与代理人所拥有的人工智能模型之间存在差异。
  • methods: 本文使用实证方法研究GPT模型在代理人-代理人问题中的行为,并对信息不均衡的变化进行调查。
  • results: 本文发现GPT-3.5和GPT-4模型都存在代理人-代理人问题,而GPT-3.5模型在信息不均衡下表现更加灵活,而GPT-4模型则更加坚持其先前的启配。
    Abstract AI Alignment is often presented as an interaction between a single designer and an artificial agent in which the designer attempts to ensure the agent's behavior is consistent with its purpose, and risks arise solely because of conflicts caused by inadvertent misalignment between the utility function intended by the designer and the resulting internal utility function of the agent. With the advent of agents instantiated with large-language models (LLMs), which are typically pre-trained, we argue this does not capture the essential aspects of AI safety because in the real world there is not a one-to-one correspondence between designer and agent, and the many agents, both artificial and human, have heterogeneous values. Therefore, there is an economic aspect to AI safety and the principal-agent problem is likely to arise. In a principal-agent problem conflict arises because of information asymmetry together with inherent misalignment between the utility of the agent and its principal, and this inherent misalignment cannot be overcome by coercing the agent into adopting a desired utility function through training. We argue the assumptions underlying principal-agent problems are crucial to capturing the essence of safety problems involving pre-trained AI models in real-world situations. Taking an empirical approach to AI safety, we investigate how GPT models respond in principal-agent conflicts. We find that agents based on both GPT-3.5 and GPT-4 override their principal's objectives in a simple online shopping task, showing clear evidence of principal-agent conflict. Surprisingly, the earlier GPT-3.5 model exhibits more nuanced behaviour in response to changes in information asymmetry, whereas the later GPT-4 model is more rigid in adhering to its prior alignment. Our results highlight the importance of incorporating principles from economics into the alignment process.
    摘要 人工智能安全问题经常被描述为一个设计者和一个人工智能系统之间的交互,其中设计者试图确保智能系统的行为与其目的相一致,而风险仅由设计者和智能系统之间的不一致引起。然而,随着大语言模型(LLM)实例的出现,我们认为这种定义不能捕捉AI安全问题的核心特征。在真实世界中,不存在一对一的设计者和智能系统之间的对应关系,而是有多个人工智能系统和人类的多个价值观。因此,AI安全问题具有经济性质,并且潜在的主体-代理人问题是不可避免的。在主体-代理人问题中,因为信息不均衡以及内置的利益不一致,导致冲突的产生,而这种内置的利益不一致无法通过强制训练改变代理人的利益。我们认为,包括经济原则在内的主体-代理人问题假设是捕捉AI安全问题实际情况的关键。我们采取了实证方法,研究了基于GPT模型的代理人冲突问题。我们发现,基于GPT-3.5和GPT-4模型的代理人在在线购物任务中Override其主体的目标,提供了明确的代理人冲突证据。另外,我们发现GPT-3.5模型在信息不均衡变化时的行为更加灵活,而GPT-4模型更加坚持其先前的协调。我们的结果 highlights the importance of incorporating principles from economics into the alignment process.

Cascade-DETR: Delving into High-Quality Universal Object Detection

  • paper_url: http://arxiv.org/abs/2307.11035
  • repo_url: https://github.com/syscv/cascade-detr
  • paper_authors: Mingqiao Ye, Lei Ke, Siyuan Li, Yu-Wing Tai, Chi-Keung Tang, Martin Danelljan, Fisher Yu
  • for: 提高各种领域中对象检测质量
  • methods: 提出了 Cascade Attention 层,以限制检测解码器的注意力,以提高对象localization的准确性;还进行了查询得分改进,改进了对象检测器的准确性
  • results: 在 UDB10 数据集上提高了 DETR 基于的对象检测器的性能,并在 stringent quality requirements 下表现出更明显的改进
    Abstract Object localization in general environments is a fundamental part of vision systems. While dominating on the COCO benchmark, recent Transformer-based detection methods are not competitive in diverse domains. Moreover, these methods still struggle to very accurately estimate the object bounding boxes in complex environments. We introduce Cascade-DETR for high-quality universal object detection. We jointly tackle the generalization to diverse domains and localization accuracy by proposing the Cascade Attention layer, which explicitly integrates object-centric information into the detection decoder by limiting the attention to the previous box prediction. To further enhance accuracy, we also revisit the scoring of queries. Instead of relying on classification scores, we predict the expected IoU of the query, leading to substantially more well-calibrated confidences. Lastly, we introduce a universal object detection benchmark, UDB10, that contains 10 datasets from diverse domains. While also advancing the state-of-the-art on COCO, Cascade-DETR substantially improves DETR-based detectors on all datasets in UDB10, even by over 10 mAP in some cases. The improvements under stringent quality requirements are even more pronounced. Our code and models will be released at https://github.com/SysCV/cascade-detr.
    摘要 通用环境中的对象定位是视觉系统的基本组成部分。Recent Transformer-based detection方法在多样化领域中占据了主导地位,但是在复杂环境中仍然有很大的提升空间。我们介绍了Cascade-DETR,一种高质量通用对象检测方法,同时解决了多样化领域的泛化和本地化精度的问题。我们提出了卷积束注意力层,通过限制注意力到前一个框预测中来进行对象特征信息的集成。此外,我们还重新评估了查询的得分方式,而不是仅仅依靠分类得分,我们预测查询的预期 IoU,从而导致了较好的准确把握。最后,我们提出了一个通用对象检测标准 benchmark,UDB10,它包含了10个来自多样化领域的数据集。Cascade-DETR在COCO上也提高了状态Of-the-art,并在所有UDB10数据集上提高了DETR-based检测器的性能,有些情况下提高了10 mAP以上。在严格的质量要求下,改进的性能更加出色。我们将代码和模型发布在https://github.com/SysCV/cascade-detr。

“It Felt Like Having a Second Mind”: Investigating Human-AI Co-creativity in Prewriting with Large Language Models

  • paper_url: http://arxiv.org/abs/2307.10811
  • repo_url: None
  • paper_authors: Qian Wan, Siying Hu, Yu Zhang, Piaohong Wang, Bo Wen, Zhicong Lu
  • for: investigate human-LLM collaboration patterns and dynamics during prewriting
  • methods: qualitative study with 15 participants in two creative tasks
  • results: three-stage iterative Human-AI Co-creativity process with humans in a dominant role, mixed and shifting levels of initiative between humans and LLMs, and collaboration breakdowns
    Abstract Prewriting is the process of discovering and developing ideas before a first draft, which requires divergent thinking and often implies unstructured strategies such as diagramming, outlining, free-writing, etc. Although large language models (LLMs) have been demonstrated to be useful for a variety of tasks including creative writing, little is known about how users would collaborate with LLMs to support prewriting. The preferred collaborative role and initiative of LLMs during such a creativity process is also unclear. To investigate human-LLM collaboration patterns and dynamics during prewriting, we conducted a three-session qualitative study with 15 participants in two creative tasks: story writing and slogan writing. The findings indicated that during collaborative prewriting, there appears to be a three-stage iterative Human-AI Co-creativity process that includes Ideation, Illumination, and Implementation stages. This collaborative process champions the human in a dominant role, in addition to mixed and shifting levels of initiative that exist between humans and LLMs. This research also reports on collaboration breakdowns that occur during this process, user perceptions of using existing LLMs during Human-AI Co-creativity, and discusses design implications to support this co-creativity process.
    摘要 前期写作是发现和发展想法之前的过程,需要多元思维和不结构策略,如 диаграм、概要、自由写作等。虽然大型语言模型(LLM)已经在多种任务上显示出了用于创作写作的用途,但是 collaboration between humans and LLMs during prewriting 的可能性和偏好还未得到了充分的了解。为了调查人类和 LLM 之间在创作过程中的合作模式和动态,我们进行了三场质量研究,参与者共15人,完成了两项创作任务:故事写作和广告写作。研究发现,在人类和 LLM 合作的创作过程中,存在一个三阶段的融合人工智能创作过程,包括创意、灯光和实施阶段。这个合作过程强调人类在主导地位,同时存在人类和 LLM 之间的混合和不确定的主动程度。这项研究还报告了在这个过程中的协作破裂、用户对现有 LLM 的使用感受以及设计实现这种合作过程的建议。

NeoSySPArtaN: A Neuro-Symbolic Spin Prediction Architecture for higher-order multipole waveforms from eccentric Binary Black Hole mergers using Numerical Relativity

  • paper_url: http://arxiv.org/abs/2307.11003
  • repo_url: None
  • paper_authors: Amrutaa Vibho, Ali Al Bataineh
  • for: 这个论文旨在准确预测黑洞和中子星合并事件中的轨道矩剖。
  • methods: 该论文提出了一种 combine neural network和符号回归的Neuro-Symbolic Architecture(NSA),以便准确预测黑洞和中子星合并事件中的轨道矩剖。
  • results: 实验结果表明,提案的NSA模型可以准确地预测黑洞和中子星合并事件中的轨道矩剖,其RMSE和MSE值分别为0.05和0.03。
    Abstract The prediction of spin magnitudes in binary black hole and neutron star mergers is crucial for understanding the astrophysical processes and gravitational wave (GW) signals emitted during these cataclysmic events. In this paper, we present a novel Neuro-Symbolic Architecture (NSA) that combines the power of neural networks and symbolic regression to accurately predict spin magnitudes of black hole and neutron star mergers. Our approach utilizes GW waveform data obtained from numerical relativity simulations in the SXS Waveform catalog. By combining these two approaches, we leverage the strengths of both paradigms, enabling a comprehensive and accurate prediction of spin magnitudes. Our experiments demonstrate that the proposed architecture achieves an impressive root-mean-squared-error (RMSE) of 0.05 and mean-squared-error (MSE) of 0.03 for the NSA model and an RMSE of 0.12 for the symbolic regression model alone. We train this model to handle higher-order multipole waveforms, with a specific focus on eccentric candidates, which are known to exhibit unique characteristics. Our results provide a robust and interpretable framework for predicting spin magnitudes in mergers. This has implications for understanding the astrophysical properties of black holes and deciphering the physics underlying the GW signals.
    摘要 “预测双黑洞或中子星合并事件中的轨道矩的大小是astrophysical processes和 gravitational wave(GW)信号的关键。在这篇论文中,我们提出了一种新的Neuro-Symbolic Architecture(NSA),该模型结合神经网络和符号回归的力量,以准确预测双黑洞和中子星合并事件中的轨道矩。我们的方法使用了数值相对论中的GW波形数据,收集于SXS波形目录。通过结合这两种方法,我们可以利用每种 парадигма的优势,实现全面和准确的预测。我们的实验表明,我们提出的模型可以达到 impressive root-mean-squared-error(RMSE)值为0.05和mean-squared-error(MSE)值为0.03。此外,我们还训练了模型以处理更高阶多极波形,具体来说是焦点在非圆拟合物理特性上。我们的结果提供了一个可靠和可解释的框架,用于预测双黑洞和中子星合并事件中的轨道矩。这有助于理解黑洞的astrophysical Properties和解读GW信号中的物理。”

LLM Cognitive Judgements Differ From Human

  • paper_url: http://arxiv.org/abs/2307.11787
  • repo_url: https://github.com/sotlampr/llm-cognitive-judgements
  • paper_authors: Sotiris Lamprinidis
  • for: 研究大语言模型(LLMs)的认知能力
  • methods: 使用GPT-3和ChatGPT模型完成有限数据的induction reasoning任务
  • results: GPT-3和ChatGPT的认知判断不类似于人类
    Abstract Large Language Models (LLMs) have lately been on the spotlight of researchers, businesses, and consumers alike. While the linguistic capabilities of such models have been studied extensively, there is growing interest in investigating them as cognitive subjects. In the present work I examine GPT-3 and ChatGPT capabilities on an limited-data inductive reasoning task from the cognitive science literature. The results suggest that these models' cognitive judgements are not human-like.
    摘要 受到研究者、企业和消费者的关注,大语言模型(LLMs)在最近几年来已经备受关注。虽然这些模型的语言能力已经得到了广泛的研究,但是有越来越多的人想研究它们作为认知主体。在 presente 的工作中,我研究了 GPT-3 和 ChatGPT 在有限数据 inductive reasoning 任务上的能力。结果表明,这些模型的认知判断并不像人类的。

Dense Sample Deep Learning

  • paper_url: http://arxiv.org/abs/2307.10991
  • repo_url: https://github.com/lizhaoliu-Lec/DAS
  • paper_authors: Stephen Josè Hanson, Vivek Yadav, Catherine Hanson
  • for: 本研究旨在探究深度学习(DL)网络在许多应用领域中的成功原理,包括语言翻译、蛋白质折叠、自动驾驶等,以及最近受到关注的人工智能语言模型(CHATbot)。
  • methods: 本研究使用了一个大型(1.24M参数;VGG)的DL网络,并在一种新的高密度样本任务中进行了研究(5个唯一的符号,每个符号至少有500个示例),以便更好地跟踪emergence of category structure和特征构建。研究者使用了多种视觉化方法来跟踪DL的学习动态和特征束成的发展。
  • results: 研究结果表明,DL网络在学习过程中会逐渐建立一种复杂的特征结构,这种结构可以通过图解方法来visualize。此外,研究者还提出了一种基于实验结果的新 teoría of complex feature construction。
    Abstract Deep Learning (DL) , a variant of the neural network algorithms originally proposed in the 1980s, has made surprising progress in Artificial Intelligence (AI), ranging from language translation, protein folding, autonomous cars, and more recently human-like language models (CHATbots), all that seemed intractable until very recently. Despite the growing use of Deep Learning (DL) networks, little is actually understood about the learning mechanisms and representations that makes these networks effective across such a diverse range of applications. Part of the answer must be the huge scale of the architecture and of course the large scale of the data, since not much has changed since 1987. But the nature of deep learned representations remain largely unknown. Unfortunately training sets with millions or billions of tokens have unknown combinatorics and Networks with millions or billions of hidden units cannot easily be visualized and their mechanisms cannot be easily revealed. In this paper, we explore these questions with a large (1.24M weights; VGG) DL in a novel high density sample task (5 unique tokens with at minimum 500 exemplars per token) which allows us to more carefully follow the emergence of category structure and feature construction. We use various visualization methods for following the emergence of the classification and the development of the coupling of feature detectors and structures that provide a type of graphical bootstrapping, From these results we harvest some basic observations of the learning dynamics of DL and propose a new theory of complex feature construction based on our results.
    摘要

Characterising Decision Theories with Mechanised Causal Graphs

  • paper_url: http://arxiv.org/abs/2307.10987
  • repo_url: None
  • paper_authors: Matt MacDermott, Tom Everitt, Francesco Belardinelli
  • for: 本研究旨在描述和区分不同决策理论的机器化 causal 模型,并生成一种决策理论分类表。
  • methods: 本研究使用机器化 causal 模型来描述和分析不同决策理论的特征和区别。
  • results: 本研究通过使用机器化 causal 模型,生成了一种决策理论分类表,并提供了一种方法来区分不同决策理论的重要特征。
    Abstract How should my own decisions affect my beliefs about the outcomes I expect to achieve? If taking a certain action makes me view myself as a certain type of person, it might affect how I think others view me, and how I view others who are similar to me. This can influence my expected utility calculations and change which action I perceive to be best. Whether and how it should is subject to debate, with contenders for how to think about it including evidential decision theory, causal decision theory, and functional decision theory. In this paper, we show that mechanised causal models can be used to characterise and differentiate the most important decision theories, and generate a taxonomy of different decision theories.
    摘要 我的决定如何影响我对 достичь的结果的期望?如果我们选择一 certain action,可能会使我看到自己为一种特定的人类型,这可能会影响我如何看待他人,以及我如何看待与我类似的人。这可能会影响我的预期的利得计算,并改变我认为是最佳的行为。是否和如何是一个问题,在这篇论文中,我们使用机器化 causal 模型来描述和区分不同的决策理论,并生成一个决策理论的分类。

Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image

  • paper_url: http://arxiv.org/abs/2307.10984
  • repo_url: https://github.com/yvanyin/metric3d
  • paper_authors: Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu, Kaixuan Wang, Xiaozhi Chen, Chunhua Shen
  • for: 这 paper 是为了解决单张图像中的3D场景重建问题而写的。
  • methods: 这 paper 使用了大规模数据训练和解决不同摄像机模型中的 métrico 歧义来解决单视metric depth estimation问题。
  • results: 该 paper 的方法在7个零shot benchmark 上达到了SOTA性能,并在2nd Monocular Depth Estimation Challenge 中获得了冠军。该方法可以准确地回归 metric 3D 结构,并且可以用于解决单视metrology 中的规模漂移问题。
    Abstract Reconstructing accurate 3D scenes from images is a long-standing vision task. Due to the ill-posedness of the single-image reconstruction problem, most well-established methods are built upon multi-view geometry. State-of-the-art (SOTA) monocular metric depth estimation methods can only handle a single camera model and are unable to perform mixed-data training due to the metric ambiguity. Meanwhile, SOTA monocular methods trained on large mixed datasets achieve zero-shot generalization by learning affine-invariant depths, which cannot recover real-world metrics. In this work, we show that the key to a zero-shot single-view metric depth model lies in the combination of large-scale data training and resolving the metric ambiguity from various camera models. We propose a canonical camera space transformation module, which explicitly addresses the ambiguity problems and can be effortlessly plugged into existing monocular models. Equipped with our module, monocular models can be stably trained with over 8 million images with thousands of camera models, resulting in zero-shot generalization to in-the-wild images with unseen camera settings. Experiments demonstrate SOTA performance of our method on 7 zero-shot benchmarks. Notably, our method won the championship in the 2nd Monocular Depth Estimation Challenge. Our method enables the accurate recovery of metric 3D structures on randomly collected internet images, paving the way for plausible single-image metrology. The potential benefits extend to downstream tasks, which can be significantly improved by simply plugging in our model. For example, our model relieves the scale drift issues of monocular-SLAM (Fig. 1), leading to high-quality metric scale dense mapping. The code is available at https://github.com/YvanYin/Metric3D.
    摘要 Traditional 3D scene reconstruction from images is a long-standing vision task. Due to the ill-posedness of the single-image reconstruction problem, most well-established methods are built upon multi-view geometry. State-of-the-art (SOTA) monocular metric depth estimation methods can only handle a single camera model and are unable to perform mixed-data training due to the metric ambiguity. Meanwhile, SOTA monocular methods trained on large mixed datasets achieve zero-shot generalization by learning affine-invariant depths, which cannot recover real-world metrics. In this work, we show that the key to a zero-shot single-view metric depth model lies in the combination of large-scale data training and resolving the metric ambiguity from various camera models. We propose a canonical camera space transformation module, which explicitly addresses the ambiguity problems and can be effortlessly plugged into existing monocular models. Equipped with our module, monocular models can be stably trained with over 8 million images with thousands of camera models, resulting in zero-shot generalization to in-the-wild images with unseen camera settings. Experiments demonstrate SOTA performance of our method on 7 zero-shot benchmarks. Notably, our method won the championship in the 2nd Monocular Depth Estimation Challenge. Our method enables the accurate recovery of metric 3D structures on randomly collected internet images, paving the way for plausible single-image metrology. The potential benefits extend to downstream tasks, which can be significantly improved by simply plugging in our model. For example, our model relieves the scale drift issues of monocular-SLAM (Fig. 1), leading to high-quality metric scale dense mapping. The code is available at https://github.com/YvanYin/Metric3D.