2023-12-03

cs.CV

cs.CV - 2023-12-03

Robust Computer Vision in an Ever-Changing World: A Survey of Techniques for Tackling Distribution Shifts

paper_url: http://arxiv.org/abs/2312.01540
repo_url: None
paper_authors: Eashan Adhikarla, Kai Zhang, Jun Yu, Lichao Sun, John Nicholson, Brian D. Davison
for: This paper aims to address the challenging problem of distribution shift in computer vision models, which can lead to a gap between theoretical assumptions and real-world performance.
methods: The paper explores various data-centric techniques to address distribution shift, including data augmentation strategies and training mechanisms such as transfer learning and zero-shot learning.
results: The paper provides an in-depth overview of distribution shifts, their distinctions, and techniques to address them, with a focus on the robustness of machine learning models for computer vision applications.

Abstract
AI applications are becoming increasingly visible to the general public. There is a notable gap between the theoretical assumptions researchers make about computer vision models and the reality those models face when deployed in the real world. One of the critical reasons for this gap is a challenging problem known as distribution shift. Distribution shifts tend to vary with complexity of the data, dataset size, and application type. In our paper, we discuss the identification of such a prominent gap, exploring the concept of distribution shift and its critical significance. We provide an in-depth overview of various types of distribution shifts, elucidate their distinctions, and explore techniques within the realm of the data-centric domain employed to address them. Distribution shifts can occur during every phase of the machine learning pipeline, from the data collection stage to the stage of training a machine learning model to the stage of final model deployment. As a result, it raises concerns about the overall robustness of the machine learning techniques for computer vision applications that are deployed publicly for consumers. Different deep learning models each tailored for specific type of data and tasks, architectural pipelines; highlighting how variations in data preprocessing and feature extraction can impact robustness., data augmentation strategies (e.g. geometric, synthetic and learning-based); demonstrating their role in enhancing model generalization, and training mechanisms (e.g. transfer learning, zero-shot) fall under the umbrella of data-centric methods. Each of these components form an integral part of the neural-network we analyze contributing uniquely to strengthening model robustness against distribution shifts. We compare and contrast numerous AI models that are built for mitigating shifts in hidden stratification and spurious correlations, ...

摘要
人工智能应用在日常生活中越来越普及。然而，研究人员在实际应用中发现，计算机视觉模型的理论假设与实际情况存在显著的差距。一个主要的原因是分布shift问题。分布shift问题会随数据复杂度、数据集大小和应用类型而变化。我们的论文将详细介绍这种显著的差距，探讨分布shift的概念和其critical significance。我们还将详细介绍不同类型的分布shift，区分它们，并探讨在数据领域中使用的技术来解决它们。分布shift可以在机器学习管道中发生，从数据采集阶段到机器学习模型训练阶段到最终模型部署阶段。这使得机器学习技术的总体可靠性在公共用户中受到挑战。不同的深度学习模型，每个针对特定的数据和任务进行定制，architecture pipelines; highlighting how variations in data preprocessing and feature extraction can impact robustness。data augmentation strategies (e.g. geometric, synthetic and learning-based); demonstrating their role in enhancing model generalization，和training mechanisms (e.g. transfer learning, zero-shot) fall under the umbrella of data-centric methods。每个这些组成部分都是神经网络我们分析的一部分，它们各自对模型Robustness against distribution shifts做出了重要贡献。我们将比较和对多种AI模型，用于mitigating shifts in hidden stratification and spurious correlations，...

CalliPaint: Chinese Calligraphy Inpainting with Diffusion Model

paper_url: http://arxiv.org/abs/2312.01536
repo_url: None
paper_authors: Qisheng Liao, Zhinuo Wang, Muhammad Abdul-Mageed, Gus Xia
for: 本研究旨在提出一种基于计算机视觉的中国书法生成模型，用于艺术和教育领域。
methods: 本研究使用了最新的中国书法生成和图像填充技术，提出了一种新的模型CalliPaint，可以生成有力的中国书法。
results: 实验结果表明，CalliPaint模型能够生成出高质量的中国书法，并且可以根据用户的需求进行个性化定制。

Abstract
Chinese calligraphy can be viewed as a unique form of visual art. Recent advancements in computer vision hold significant potential for the future development of generative models in the realm of Chinese calligraphy. Nevertheless, methods of Chinese calligraphy inpainting, which can be effectively used in the art and education fields, remain relatively unexplored. In this paper, we introduce a new model that harnesses recent advancements in both Chinese calligraphy generation and image inpainting. We demonstrate that our proposed model CalliPaint can produce convincing Chinese calligraphy.

摘要
中国书法可以被视为独特的视觉艺术形式。最近的计算机视觉技术发展具有很大的潜在价值，对中国书法生成模型的未来发展具有重要意义。然而，中国书法填充技术，可以在艺术和教育领域得到广泛应用，尚未得到充分探索。在这篇论文中，我们介绍了一种新的模型CalliPaint，它利用了最近的中国书法生成技术和图像填充技术。我们示出了我们提议的模型可以生成真实的中国书法。

SANeRF-HQ: Segment Anything for NeRF in High Quality

paper_url: http://arxiv.org/abs/2312.01531
repo_url: None
paper_authors: Yichen Liu, Benran Hu, Chi-Keung Tang, Yu-Wing Tai
for: 这个论文的目的是实现基于Scene的高质量3D分割。methods: 该方法使用SAM和NeRF两种方法，通过用户提供的提示进行开放世界物体分割，并利用NeRF来从不同视角中汇集信息。results: 该方法在多个NeRF数据集上显示了明显的质量改善，提供了更高的自由度 для物体定位，并使得物体分割在多个视角中更一致。

Abstract
Recently, the Segment Anything Model (SAM) has showcased remarkable capabilities of zero-shot segmentation, while NeRF (Neural Radiance Fields) has gained popularity as a method for various 3D problems beyond novel view synthesis. Though there exist initial attempts to incorporate these two methods into 3D segmentation, they face the challenge of accurately and consistently segmenting objects in complex scenarios. In this paper, we introduce the Segment Anything for NeRF in High Quality (SANeRF-HQ) to achieve high quality 3D segmentation of any object in a given scene. SANeRF-HQ utilizes SAM for open-world object segmentation guided by user-supplied prompts, while leveraging NeRF to aggregate information from different viewpoints. To overcome the aforementioned challenges, we employ density field and RGB similarity to enhance the accuracy of segmentation boundary during the aggregation. Emphasizing on segmentation accuracy, we evaluate our method quantitatively on multiple NeRF datasets where high-quality ground-truths are available or manually annotated. SANeRF-HQ shows a significant quality improvement over previous state-of-the-art methods in NeRF object segmentation, provides higher flexibility for object localization, and enables more consistent object segmentation across multiple views. Additional information can be found at https://lyclyc52.github.io/SANeRF-HQ/.

摘要
近期，Segment Anything Model（SAM）展示了 Zero-shot segmentation 的杰出能力，而 NeRF（Neural Radiance Fields）在多种3D问题中受到了广泛的关注。虽然有初步尝试将这两种方法应用于3D segmentation，但它们在复杂场景中准确地和一致地分割对象存在挑战。在这篇论文中，我们提出了Segment Anything for NeRF in High Quality（SANeRF-HQ），以实现场景中任意对象的高质量3D分割。SANeRF-HQ利用SAM进行开放世界对象分割，根据用户提供的提示进行指导，同时利用NeRF来聚合不同视点的信息。为了解决以上挑战，我们使用density场和RGB相似性来增强分割边界的准确性。强调分割精度，我们对多个NeRF数据集进行了量化评估，并与前期状态艺术方法进行了比较。SANeRF-HQ在NeRF对象分割中显示出了显著的质量提升，提供了更高的对象定位灵活性，并允许在多个视点上具有更一致的对象分割。更多信息可以在查看。

G2D: From Global to Dense Radiography Representation Learning via Vision-Language Pre-training

paper_url: http://arxiv.org/abs/2312.01522
repo_url: None
paper_authors: Che Liu, Cheng Ouyang, Sibo Cheng, Anand Shah, Wenjia Bai, Rossella Arcucci
for:* 这个论文主要是为了提高医学视觉语言预训练（VLP）中的细节特征学习。methods:* 该论文提出了一种新的VLP框架，名为G2D，它通过在全息视觉语言对应中实现 Pseudo segmentation 任务来学习细节特征。results:* G2D 在6种医学成像任务和25种疾病中表现出色，尤其是在 semantic segmentation 任务中，需要细节特征。* G2D 在只使用1%的训练数据进行微调时，仍然可以超越同类型的模型。

Abstract
Recently, medical vision-language pre-training (VLP) has reached substantial progress to learn global visual representation from medical images and their paired radiology reports. However, medical imaging tasks in real world usually require finer granularity in visual features. These tasks include visual localization tasks (e.g., semantic segmentation, object detection) and visual grounding task. Yet, current medical VLP methods face challenges in learning these fine-grained features, as they primarily focus on brute-force alignment between image patches and individual text tokens for local visual feature learning, which is suboptimal for downstream dense prediction tasks. In this work, we propose a new VLP framework, named \textbf{G}lobal to \textbf{D}ense level representation learning (G2D) that achieves significantly improved granularity and more accurate grounding for the learned features, compared to existing medical VLP approaches. In particular, G2D learns dense and semantically-grounded image representations via a pseudo segmentation task parallel with the global vision-language alignment. Notably, generating pseudo segmentation targets does not incur extra trainable parameters: they are obtained on the fly during VLP with a parameter-free processor. G2D achieves superior performance across 6 medical imaging tasks and 25 diseases, particularly in semantic segmentation, which necessitates fine-grained, semantically-grounded image features. In this task, G2D surpasses peer models even when fine-tuned with just 1\% of the training data, compared to the 100\% used by these models. The code will be released upon acceptance.

摘要
最近，医疗视语预训练（VLP）已取得了重要进步，以学习医疗图像中的全球视觉表示。然而，实际世界中的医疗图像任务通常需要更高的视觉特征细化程度。这些任务包括视本地化任务（例如语义分割、对象检测）以及视本地化任务。然而，当前的医疗VLP方法面临着学习这些细化特征的挑战，因为它们主要强调粗粒对齐方法，以学习本地视觉特征，这是下游密集预测任务不佳。在这项工作中，我们提出了一种新的VLP框架，名为全球到密度级别表示学习（G2D）。G2D可以实现更高的细化和更准确的附加，相比现有的医疗VLP方法。具体来说，G2D通过在全球视语Alignment并行进行密度Segmentation任务，来学习密度和语义涵义相关的图像表示。需要注意的是，生成 pseudo Segmentation 目标不需要额外增加可训练参数：它们可以在 VLP 过程中获得 parameter-free 处理器。G2D 在 6 个医疗图像任务和 25 种疾病中表现出色，特别是在语义分割任务中，需要细化、语义涵义相关的图像特征。在这个任务中，G2D even surpasses 同类型的模型，即使只使用 1% 的训练数据进行微调。代码将在接受后发布。

Tracing Hyperparameter Dependencies for Model Parsing via Learnable Graph Pooling Network

paper_url: http://arxiv.org/abs/2312.02224
repo_url: None
paper_authors: Xiao Guo, Vishal Asnani, Sijia Liu, Xiaoming Liu
for: 本研究旨在预测生成模型（GM）中使用的超参数，给定一个生成的图像作为输入。
methods: 我们提出了一种新的模型解析方法，即可学习图gramPooling Network（LGPN），将模型解析转化为图节点分类任务，利用图节点和边表示超参数和它们之间的依赖关系。
results: 我们通过实验获得了模型解析和其扩展应用中的state-of-the-artresult，证明了我们的方法的有效性。我们的源代码可以获得。

Abstract
Model Parsing defines the research task of predicting hyperparameters of the generative model (GM), given a generated image as input. Since a diverse set of hyperparameters is jointly employed by the generative model, and dependencies often exist among them, it is crucial to learn these hyperparameter dependencies for the improved model parsing performance. To explore such important dependencies, we propose a novel model parsing method called Learnable Graph Pooling Network (LGPN). Specifically, we transform model parsing into a graph node classification task, using graph nodes and edges to represent hyperparameters and their dependencies, respectively. Furthermore, LGPN incorporates a learnable pooling-unpooling mechanism tailored to model parsing, which adaptively learns hyperparameter dependencies of GMs used to generate the input image. We also extend our proposed method to CNN-generated image detection and coordinate attacks detection. Empirically, we achieve state-of-the-art results in model parsing and its extended applications, showing the effectiveness of our method. Our source code are available.

摘要
模型分析定义了生成模型（GM）中的参数预测任务，给定一个生成的图像作为输入。由于生成模型使用的参数集是多样的，并且这些参数之间经常存在依赖关系，因此学习这些参数之间的依赖关系非常重要，以提高模型分析性能。为了探索这些重要的依赖关系，我们提出了一种新的模型分析方法 called Learnable Graph Pooling Network（LGPN）。具体来说，我们将模型分析转换为图节点分类任务，使用图节点和边来表示参数和它们之间的依赖关系，分别。此外，LGPN还包括一个可学习的 pooling-unpooling 机制，用于适应模型分析中的参数依赖关系。我们还将我们的提议方法推广到 CNN 生成的图像检测和坐标攻击检测。实验表明，我们的方法可以达到状态的最佳结果在模型分析和其扩展应用中，表明了我们的方法的有效性。我们的源代码可以获取。

CityGen: Infinite and Controllable 3D City Layout Generation

paper_url: http://arxiv.org/abs/2312.01508
repo_url: None
paper_authors: Jie Deng, Wenhao Chai, Jianshu Guo, Qixuan Huang, Wenhao Hu, Jenq-Neng Hwang, Gaoang Wang
for: 本研究旨在提出一种可控制、多样化的3D城市布局生成方法，用于智能城市规划、数字模拟等领域。
methods: 我们提出了一种扩展精细位置的涂抹管道，以生成无限大的3D城市布局。然后，我们利用多尺度扩散模型来生成多样化、可控制的本地Semantic布局片段。
results: 我们的CityGen方法在FID和KID指标上达到了状态之巅表现，可以生成无限大、可控制的3D城市布局。CityGen方法在智能城市规划、数字模拟等领域表现良好。

Abstract
City layout generation has recently gained significant attention. The goal of this task is to automatically generate the layout of a city scene, including elements such as roads, buildings, vegetation, as well as other urban infrastructures. Previous methods using VAEs or GANs for 3D city layout generation offer limited diversity and constrained interactivity, only allowing users to selectively regenerate parts of the layout, which greatly limits customization. In this paper, we propose CityGen, a novel end-to-end framework for infinite, diverse and controllable 3D city layout generation.First, we propose an outpainting pipeline to extend the local layout to an infinite city layout. Then, we utilize a multi-scale diffusion model to generate diverse and controllable local semantic layout patches. The extensive experiments show that CityGen achieves state-of-the-art (SOTA) performance under FID and KID in generating an infinite and controllable 3D city layout. CityGen demonstrates promising applicability in fields like smart cities, urban planning, and digital simulation.

摘要
市区布局生成最近受到了广泛关注。目标是自动生成城市场景中的布局，包括道路、建筑、植被和其他城市基础设施。以前使用 VAEs 或 GANs 进行3D城市布局生成的方法具有有限的多样性和受控性，只允许用户选择性地重新生成部分布局，这限制了自定义的可能性。在这篇论文中，我们提出了 CityGen，一个新的终端框架，用于实现无限、多样化和可控的3D城市布局生成。首先，我们提出了一个扩缩绘管道，以延展当地布局到无限城市布局。然后，我们利用多尺度扩散模型生成多样化和可控的本地语义布局块。广泛的实验表明，CityGen 在 FID 和 KID 中实现了状态态前的表现，并且可以实现无限和可控的3D城市布局生成。CityGen 在智能城市、城市规划和数字模拟等领域表现出了潜在的应用可能性。

GAPS: Geometry-Aware, Physics-Based, Self-Supervised Neural Garment Draping

paper_url: http://arxiv.org/abs/2312.01490
repo_url: https://github.com/Simonhfls/GAPS
paper_authors: Ruochen Chen, Liming Chen, Shaifali Parashar
for: 这项研究的目的是提出一种基于神经网络的衣物形态模型，以实现更快速、更美观的结果，并且可以控制衣物不可压缩性。
methods: 这种模型使用物理特性参数来控制衣物的压缩性，并且引入了碰撞意识的几何约束，以确保衣物的压缩性只在可能的情况下发生。
results: 这项研究的结果表明，使用这种模型可以获得更加现实istic的衣物形态结果，而且可以处理任意体型的人体，不需要额外的后处理。此外，这种模型还可以处理略松的衣物，而不需要特殊的训练方法。

Abstract
Recent neural, physics-based modeling of garment deformations allows faster and visually aesthetic results as opposed to the existing methods. Material-specific parameters are used by the formulation to control the garment inextensibility. This delivers unrealistic results with physically implausible stretching. Oftentimes, the draped garment is pushed inside the body which is either corrected by an expensive post-processing, thus adding to further inconsistent stretching; or by deploying a separate training regime for each body type, restricting its scalability. Additionally, the flawed skinning process deployed by existing methods produces incorrect results on loose garments. In this paper, we introduce a geometrical constraint to the existing formulation that is collision-aware and imposes garment inextensibility wherever possible. Thus, we obtain realistic results where draped clothes stretch only while covering bigger body regions. Furthermore, we propose a geometry-aware garment skinning method by defining a body-garment closeness measure which works for all garment types, especially the loose ones.

摘要
In this paper, we introduce a geometrical constraint to the existing formulation that is collision-aware and imposes garment inextensibility wherever possible, resulting in realistic results where draped clothes stretch only while covering bigger body regions. Furthermore, we propose a geometry-aware garment skinning method by defining a body-garment closeness measure that works for all garment types, especially the loose ones.

Computer Vision for Increased Operative Efficiency via Identification of Instruments in the Neurosurgical Operating Room: A Proof-of-Concept Study

paper_url: http://arxiv.org/abs/2312.03001
repo_url: None
paper_authors: Tanner J. Zachem, Sully F. Chen, Vishal Venkatraman, David AW Sykes, Ravi Prakash, Samantha Spellicy, Alexander D Suarez, Weston Ross, Patrick J. Codd
for:The paper aims to develop a computer vision algorithm to identify surgical instruments in the neurosurgical operating room, with the goal of improving instrument tracking and management, reducing waste, and optimizing surgical tray packing.methods:The authors collected 1660 images of 27 commonly used neurosurgical instruments and used a U-Net Convolutional Neural Network with 5-fold cross validation to train the model.results:The U-Net achieved an accuracy of 80-100% in identifying 25 classes of instruments, with 19/25 classes having an accuracy of over 90%. However, the model was less accurate for sub-classifying certain types of forceps.Here is the simplified Chinese text for the three key points:for:这篇论文目标是开发一种计算机视觉算法，用于在 neurosurgical 操作室中识别手术工具，以提高工具跟踪和管理，减少废弃和不必要的工具开启，优化手术盘包。methods:作者收集了 1660 张 neurosurgical 手术工具的图像，并使用 U-Net convolutional neural network 进行训练。results:U-Net 在识别 25 类工具时达到了 80-100% 的准确率，其中 19/25 类工具的准确率超过 90%。然而，模型对 certain 类型的 forceps 进行分类时的准确率较低。

Abstract
Objectives Computer vision (CV) is a field of artificial intelligence that enables machines to interpret and understand images and videos. CV has the potential to be of assistance in the operating room (OR) to track surgical instruments. We built a CV algorithm for identifying surgical instruments in the neurosurgical operating room as a potential solution for surgical instrument tracking and management to decrease surgical waste and opening of unnecessary tools. Methods We collected 1660 images of 27 commonly used neurosurgical instruments. Images were labeled using the VGG Image Annotator and split into 80% training and 20% testing sets in order to train a U-Net Convolutional Neural Network using 5-fold cross validation. Results Our U-Net achieved a tool identification accuracy of 80-100% when distinguishing 25 classes of instruments, with 19/25 classes having accuracy over 90%. The model performance was not adequate for sub classifying Adson, Gerald, and Debakey forceps, which had accuracies of 60-80%. Conclusions We demonstrated the viability of using machine learning to accurately identify surgical instruments. Instrument identification could help optimize surgical tray packing, decrease tool usage and waste, decrease incidence of instrument misplacement events, and assist in timing of routine instrument maintenance. More training data will be needed to increase accuracy across all surgical instruments that would appear in a neurosurgical operating room. Such technology has the potential to be used as a method to be used for proving what tools are truly needed in each type of operation allowing surgeons across the world to do more with less.

摘要
目的：计算机视觉（CV）是人工智能的一个领域，它使得机器可以理解和解释图像和视频。在操作室（OR）中，CV可以帮助跟踪手术工具。我们建立了一个CV算法，用于在 neurosurgical 操作室中识别手术工具，以解决手术废弃和不必要的工具开启。方法：我们收集了27种常用的 neurosurgical 手术工具的1660个图像。图像被标注使用VGG Image Annotator，并被分成80%的训练集和20%的测试集，以用于训练一个U-Net convolutional neural network 使用5次交叉验证。结果：我们的U-Net 在25种类型的工具识别中取得了80-100%的准确率，其中19/25种类型的工具准确率超过90%。但是，Adson、Gerald和Debakey 手术剪辑的准确率只有60-80%。结论：我们证明了机器学习可以准确地识别手术工具。工具识别可以帮助优化手术盘板准备、减少工具使用和废弃、降低手术器械落失事件的发生率，并帮助定时 Routine 工具维护。更多的训练数据将有助于增加所有可能出现在 neurosurgical 操作室中的工具的准确率。这种技术有可能用作证明在不同类型的手术中需要哪些工具，以便全球各地的外科医生可以做更多的事情。

InvertAvatar: Incremental GAN Inversion for Generalized Head Avatars

paper_url: http://arxiv.org/abs/2312.02222
repo_url: None
paper_authors: Xiaochen Zhao, Jingxiang Sun, Lizhen Wang, Yebin Liu
for: 提高数字头文件的高准确率和效率，并且解决2D或3D生成模型中的形态扭曲、表情不准确和身份闪烁问题。
methods: 提出了一种新的框架，即增量3D GAN逆向，通过多帧图像进行逆向生成，从而提高头文件重建的性能，并且引入了一种新的3D GAN prior，以及一种创新的神经纹理编码器，以提高表情控制性和图像细节重建。
results: 比traditional技术高效，在一个或几个帧图像的情况下，实现了头文件重建的状态机器。

Abstract
While high fidelity and efficiency are central to the creation of digital head avatars, recent methods relying on 2D or 3D generative models often experience limitations such as shape distortion, expression inaccuracy, and identity flickering. Additionally, existing one-shot inversion techniques fail to fully leverage multiple input images for detailed feature extraction. We propose a novel framework, \textbf{Incremental 3D GAN Inversion}, that enhances avatar reconstruction performance using an algorithm designed to increase the fidelity from multiple frames, resulting in improved reconstruction quality proportional to frame count. Our method introduces a unique animatable 3D GAN prior with two crucial modifications for enhanced expression controllability alongside an innovative neural texture encoder that categorizes texture feature spaces based on UV parameterization. Differentiating from traditional techniques, our architecture emphasizes pixel-aligned image-to-image translation, mitigating the need to learn correspondences between observation and canonical spaces. Furthermore, we incorporate ConvGRU-based recurrent networks for temporal data aggregation from multiple frames, boosting geometry and texture detail reconstruction. The proposed paradigm demonstrates state-of-the-art performance on one-shot and few-shot avatar animation tasks.

摘要
高效率和准确性是数字头部模odel的创建中的中心思想，但现有的方法，如基于2D或3D生成模型的方法，经常会遇到形状扭曲、表达不准确和身份闪烁等问题。此外，现有的一步逆转技术无法充分利用多个输入图像 для详细特征EXTRACTION。我们提出了一种新的框架，即《增量3D GAN逆转》，该框架通过多帧图像的增量来提高头部模型的重建性能，并且与多个输入图像进行Pixel-aligned图像-图像翻译，从而避免学习观察空间和参照空间之间的对应关系。此外，我们还 integrate ConvGRU循环网络来从多帧图像中收集时间数据，以提高几何和Texture细节重建。提出的思想在一步和几步头部动画任务中达到了州前性能。

Efficient Incremental Potential Contact for Actuated Face Simulation

paper_url: http://arxiv.org/abs/2312.02999
repo_url: None
paper_authors: Bo Li, Lingchen Yang, Barbara Solenthaler
for: 模拟人脸动画的 quasi-static Finite Element 模拟器
methods: 使用 Projective Dynamics (PD) 模型人脸为 actuated soft body，采用 Incremental Potential Contact (IPC) 处理自体 intersection
results: 通过提高了 IPC 优化方法，实现高视觉准确性，而且性能开销较低

Abstract
We present a quasi-static finite element simulator for human face animation. We model the face as an actuated soft body, which can be efficiently simulated using Projective Dynamics (PD). We adopt Incremental Potential Contact (IPC) to handle self-intersection. However, directly integrating IPC into the simulation would impede the high efficiency of the PD solver, since the stiffness matrix in the global step is no longer constant and cannot be pre-factorized. We notice that the actual number of vertices affected by the collision is only a small fraction of the whole model, and by utilizing this fact we effectively decrease the scale of the linear system to be solved. With the proposed optimization method for collision, we achieve high visual fidelity at a relatively low performance overhead.

摘要
我们提出了一种快速响应的质量动力学finite element模拟器，用于人脸动画。我们将面模型设置为一个受控的软体，可以高效地使用投影动力学（PD）进行模拟。我们采用增量潜在接触（IPC）处理自相互交叠。然而，直接将IPCintegrated into the simulation would degrade the high efficiency of the PD solver, since the stiffness matrix in the global step is no longer constant and cannot be pre-factorized。我们发现实际上affected by the collision的顶点数只是整个模型的一小部分，而我们通过利用这一事实，有效地减小了解决的线性系统的规模。与传统的冲击优化方法相比，我们的优化方法可以 achieve high visual fidelity at a relatively low performance overhead.

Slice3D: Multi-Slice, Occlusion-Revealing, Single View 3D Reconstruction

paper_url: http://arxiv.org/abs/2312.02221
repo_url: https://github.com/yizhiwang96/Slice3D
paper_authors: Yizhi Wang, Wallace Lira, Wenqi Wang, Ali Mahdavi-Amiri, Hao Zhang
for: 单视图3D重建
methods: 多slice理解、Coordinate-based transformer网络、U-Net网络
results: superiority in recovering complex and severely occluded shape structures, amid ambiguities, with an inference time less than 20 seconds.

Abstract
We introduce multi-slice reasoning, a new notion for single-view 3D reconstruction which challenges the current and prevailing belief that multi-view synthesis is the most natural conduit between single-view and 3D. Our key observation is that object slicing is more advantageous than altering views to reveal occluded structures. Specifically, slicing is more occlusion-revealing since it can peel through any occluders without obstruction. In the limit, i.e., with infinitely many slices, it is guaranteed to unveil all hidden object parts. We realize our idea by developing Slice3D, a novel method for single-view 3D reconstruction which first predicts multi-slice images from a single RGB image and then integrates the slices into a 3D model using a coordinate-based transformer network for signed distance prediction. The slice images can be regressed or generated, both through a U-Net based network. For the former, we inject a learnable slice indicator code to designate each decoded image into a spatial slice location, while the slice generator is a denoising diffusion model operating on the entirety of slice images stacked on the input channels. We conduct extensive evaluation against state-of-the-art alternatives to demonstrate superiority of our method, especially in recovering complex and severely occluded shape structures, amid ambiguities. All Slice3D results were produced by networks trained on a single Nvidia A40 GPU, with an inference time less than 20 seconds.

摘要
我们介绍了多slice理解，一种新的概念，挑战当前和普遍存在的信念，即多视图合成是三维重建的最自然的通路。我们的关键观察是，对象剖解比修改视图更有利，可以更好地暴露受遮盖的结构。具体来说，剖解可以贯穿任何遮盖物，无阻碍。在极限情况下，即有无穷多个剖解， garantía可以暴露所有隐藏的对象部分。我们实现了我们的想法，开发出了一种新的单视图三维重建方法，即Slice3D。Slice3D方法首先预测多个剖解图像从单个RGB图像，然后使用坐标基于的变换网络进行签证距离预测，将剖解图像集成到3D模型中。剖解图像可以通过U-Net基于网络进行回归或生成，其中回归是通过在每个解码图像中添加学习的剖解标识码来实现的，而生成是通过整个剖解图像栈在输入通道上进行扩散滤波来实现的。我们对当前的状态艺术方法进行了广泛的评估，以示我们的方法的优越性，特别是在复杂和严重遮盖的形状结构的恢复中。所有Slice3D结果都是由单个Nvidia A40 GPU上训练的网络生成，执行时间低于20秒。

QuantAttack: Exploiting Dynamic Quantization to Attack Vision Transformers

paper_url: http://arxiv.org/abs/2312.02220
repo_url: None
paper_authors: Amit Baras, Alon Zolfi, Yuval Elovici, Asaf Shabtai
for: 这篇论文主要关注的是提出了一种新的攻击方法，可以降低量化模型的可用性和效率，并且可以在各种任务上进行测试。
methods: 这篇论文使用了一种新的攻击方法，即QuantAttack，这种攻击方法可以在量化模型的测试过程中降低模型的可用性和效率。
results: 这篇论文的实验结果显示，QuantAttack可以对于多种任务进行测试，并且可以在不同的模型和攻击版本中进行转移性测试。实验结果显示，QuantAttack可以对于量化模型进行可靠的攻击，并且可以对于模型的可用性和效率造成重大影响。

Abstract
In recent years, there has been a significant trend in deep neural networks (DNNs), particularly transformer-based models, of developing ever-larger and more capable models. While they demonstrate state-of-the-art performance, their growing scale requires increased computational resources (e.g., GPUs with greater memory capacity). To address this problem, quantization techniques (i.e., low-bit-precision representation and matrix multiplication) have been proposed. Most quantization techniques employ a static strategy in which the model parameters are quantized, either during training or inference, without considering the test-time sample. In contrast, dynamic quantization techniques, which have become increasingly popular, adapt during inference based on the input provided, while maintaining full-precision performance. However, their dynamic behavior and average-case performance assumption makes them vulnerable to a novel threat vector -- adversarial attacks that target the model's efficiency and availability. In this paper, we present QuantAttack, a novel attack that targets the availability of quantized models, slowing down the inference, and increasing memory usage and energy consumption. We show that carefully crafted adversarial examples, which are designed to exhaust the resources of the operating system, can trigger worst-case performance. In our experiments, we demonstrate the effectiveness of our attack on vision transformers on a wide range of tasks, both uni-modal and multi-modal. We also examine the effect of different attack variants (e.g., a universal perturbation) and the transferability between different models.

摘要
近年来，深度神经网络（DNN）特别是基于转换器的模型在发展中采取了更大更强大的模型。它们在表现上达到了国际标准，但它们的增长规模需要更多的计算资源（例如，更高的内存容量的GPU）。为解决这个问题，量化技术（即低比位准确表示和矩阵乘法）已经被提出。大多数量化技术采用静态策略，在训练或推理过程中将模型参数量化，而不考虑测试样本。然而，动态量化技术，它们在推理过程中根据输入进行适应，保持全精度性，在过去几年变得越来越受欢迎。然而，它们的动态行为和平均情况性假设使得它们对于一种新的威胁vector——对模型效率和可用性的攻击感到敏感。在这篇论文中，我们介绍了QuantAttack，一种新的攻击，targeting the availability of quantized models, slowing down the inference, and increasing memory usage and energy consumption.我们表明，通过精心制作的 adversarial examples，可以让模型在推理过程中消耗系统资源，导致最坏情况性能。在我们的实验中，我们证明QuantAttack对于视觉转换器在各种任务上都有效，包括单模态和多模态任务。我们还 examine了不同的攻击变种（例如，universal pertubation）和模型之间的传输性。

Diffusion Posterior Sampling for Nonlinear CT Reconstruction

paper_url: http://arxiv.org/abs/2312.01464
repo_url: None
paper_authors: Shudong Li, Matthew Tivnan, Yuan Shen, J. Webster Stayman
for: 用于CT图像重建和修复
methods: 使用扩散 posterior sampling，结合分布式 posterior 和可能性模型来生成高质量CT图像
results: 可以在低质量测量数据下生成高质量CT图像，并且可以在多种CT系统设计中使用这种方法，无需额外训练

Abstract
Diffusion models have been demonstrated as powerful deep learning tools for image generation in CT reconstruction and restoration. Recently, diffusion posterior sampling, where a score-based diffusion prior is combined with a likelihood model, has been used to produce high quality CT images given low-quality measurements. This technique is attractive since it permits a one-time, unsupervised training of a CT prior; which can then be incorporated with an arbitrary data model. However, current methods only rely on a linear model of x-ray CT physics to reconstruct or restore images. While it is common to linearize the transmission tomography reconstruction problem, this is an approximation to the true and inherently nonlinear forward model. We propose a new method that solves the inverse problem of nonlinear CT image reconstruction via diffusion posterior sampling. We implement a traditional unconditional diffusion model by training a prior score function estimator, and apply Bayes rule to combine this prior with a measurement likelihood score function derived from the nonlinear physical model to arrive at a posterior score function that can be used to sample the reverse-time diffusion process. This plug-and-play method allows incorporation of a diffusion-based prior with generalized nonlinear CT image reconstruction into multiple CT system designs with different forward models, without the need for any additional training. We develop the algorithm that performs this reconstruction, including an ordered-subsets variant for accelerated processing and demonstrate the technique in both fully sampled low dose data and sparse-view geometries using a single unsupervised training of the prior.

摘要
Diffusion模型已经被证明为深度学习工具的强大图像生成工具在CT重建和修复中。最近，Diffusion posterior sampling技术已经用于生成基于低质量测量的高质量CT图像。这种技术吸引人因为它允许一次性，不监督地训练CT先验，然后将其与任意数据模型结合使用。然而，当前的方法只是基于x射线CT物理学的线性模型来重建或修复图像。虽然线性化传输 Tomatoes reconstruction问题是一种常见的简化方法，但这是真实的非线性前向模型的一种近似。我们提出一种新的方法，通过Diffusion posterior sampling来解决非线性CT图像重建 inverse problem。我们在训练一个先验分布函数估计器后，通过bayes规则与测量 likelihood分布函数相结合，以获得可用于逆时间填充过程的 posterior分布函数。这种插件和玩法允许在不同的CT系统设计中，通过Diffusion-based先验与通用化非线性CT图像重建结合使用，而无需进行任何额外训练。我们开发了这种重建算法，包括一种顺序subsets变体用于加速处理，并在完全样本低剂量数据和稀缺视geometry中进行了示例。

Towards an accurate and generalizable multiple sclerosis lesion segmentation model using self-ensembled lesion fusion

paper_url: http://arxiv.org/abs/2312.01460
repo_url: None
paper_authors: Jinwei Zhang, Lianrui Zuo, Blake E. Dewey, Samuel W. Remedios, Dzung L. Pham, Aaron Carass, Jerry L. Prince
for: 提高多发性硬化病（MS）斑点分割的效率和可重复性，而不需要手动定义斑点 bounding box。
methods: 使用常见的U-Net建模器，并提出了一种新的测试时自 ensemble lesion fusión策略，以实现最高性能和普适性。
results: 在ISBI 2015 MS分割挑战数据集上达到了最佳性能，并在不同扫描仪的临床测试数据集上保持了一定的普适性。

Abstract
Automatic multiple sclerosis (MS) lesion segmentation using multi-contrast magnetic resonance (MR) images provides improved efficiency and reproducibility compared to manual delineation. Current state-of-the-art automatic MS lesion segmentation methods utilize modified U-Net-like architectures. However, in the literature, dedicated architecture modifications were always required to maximize their performance. In addition, the best-performing methods have not proven to be generalizable to diverse test datasets with contrast variations and image artifacts. In this work, we developed an accurate and generalizable MS lesion segmentation model using the well-known U-Net architecture without further modification. A novel test-time self-ensembled lesion fusion strategy is proposed that not only achieved the best performance using the ISBI 2015 MS segmentation challenge data but also demonstrated robustness across various self-ensemble parameter choices. Moreover, equipped with instance normalization rather than batch normalization widely used in literature, the model trained on the ISBI challenge data generalized well on clinical test datasets from different scanners.

摘要
自动多发性纤维病（MS）斑点分 segmentation 使用多contrast磁共振（MR）图像提供了改善效率和重复性的方法，比手动定义更加高效。当前Literature中的状态ethe-of-the-art自动MS斑点分 segmentation方法通常使用修改后的U-Net-like架构。然而，在文献中，专门的架构修改总是需要以最大化其性能。此外，Literature中的最佳performing方法尚未证明能够在多种测试数据集中进行普适化。在这种情况下，我们开发了一个精度高且普适的MS斑点分 segmentation模型，使用文献中的well-known U-Net架构而无需进行修改。此外，我们还提出了一种新的测试时自ensemble lesion fusions策略，不仅在ISBI 2015 MS segmentation challenge数据上达到了最好性能，而且在不同的自ensemble参数选择情况下也表现了稳定性。此外，我们在文献中广泛使用批 normalization而不是batch normalization，使得模型在不同的扫描仪上训练后在临床测试数据上进行了良好的普适性。

Looking Inside Out: Anticipating Driver Intent From Videos

paper_url: http://arxiv.org/abs/2312.01444
repo_url: None
paper_authors: Yung-chi Kung, Arthur Zhang, Junmin Wang, Joydeep Biswas
For: The paper aims to improve the prediction of future driver actions by utilizing in-cabin and external camera data to improve state-of-the-art (SOTA) performance.* Methods: The proposed method explicitly extracts object and road-level features from external camera data, which are important for predicting driver intention. The approach uses a combination of a transformer and an LSTM-based architecture, with handcrafted features as inputs.* Results: The models predict driver maneuvers more accurately and earlier than existing approaches, with an accuracy of 87.5% and an average prediction time of 4.35 seconds before the maneuver takes place.Here is the simplified Chinese version of the three key points:* 用途: 文章目标是通过利用卡车内和外部摄像头数据提高当前最佳状态(SOTA)的未来驾驶动作预测性能。* 方法: 提议的方法显式提取外部摄像头数据中的对象和路况特征，这些特征对预测驾驶员意图非常重要。方法使用组合transformer和LSTM-based架构，使用手工设计的特征作为输入。* 结果: 模型对驾驶员动作的预测比现有方法更准确和更早，准确率为87.5%，预测时间为4.35秒。

Abstract
Anticipating driver intention is an important task when vehicles of mixed and varying levels of human/machine autonomy share roadways. Driver intention can be leveraged to improve road safety, such as warning surrounding vehicles in the event the driver is attempting a dangerous maneuver. In this work, we propose a novel method of utilizing in-cabin and external camera data to improve state-of-the-art (SOTA) performance in predicting future driver actions. Compared to existing methods, our approach explicitly extracts object and road-level features from external camera data, which we demonstrate are important features for predicting driver intention. Using our handcrafted features as inputs for both a transformer and an LSTM-based architecture, we empirically show that jointly utilizing in-cabin and external features improves performance compared to using in-cabin features alone. Furthermore, our models predict driver maneuvers more accurately and earlier than existing approaches, with an accuracy of 87.5% and an average prediction time of 4.35 seconds before the maneuver takes place. We release our model configurations and training scripts on https://github.com/ykung83/Driver-Intent-Prediction

摘要
anticipating driver intention is an important task when vehicles of mixed and varying levels of human/machine autonomy share roadways。driver intention can be leveraged to improve road safety，such as warning surrounding vehicles in the event the driver is attempting a dangerous maneuver。in this work，we propose a novel method of utilizing in-cabin and external camera data to improve state-of-the-art (SOTA) performance in predicting future driver actions。compared to existing methods，our approach explicitly extracts object and road-level features from external camera data，which we demonstrate are important features for predicting driver intention。using our handcrafted features as inputs for both a transformer and an LSTM-based architecture，we empirically show that jointly utilizing in-cabin and external features improves performance compared to using in-cabin features alone。furthermore，our models predict driver maneuvers more accurately and earlier than existing approaches，with an accuracy of 87.5% and an average prediction time of 4.35 seconds before the maneuver takes place。we release our model configurations and training scripts on https://github.com/ykung83/Driver-Intent-Prediction。

Automatic Report Generation for Histopathology images using pre-trained Vision Transformers and BERT

paper_url: http://arxiv.org/abs/2312.01435
repo_url: https://github.com/ssen7/histo_cap_transformers
paper_authors: Saurav Sengupta, Donald E. Brown
for: Automatic report generation for histopathology images
methods: 使用 pré-entrenado Vision Transformer 和 BERT 模型进行报告生成
results: achieved 79.98% accuracy in Tissue Type classification, 66.36% accuracy in classifying the sex of the patient, and a BLEU-4 score of 0.5818 in caption generation task.

Abstract
Deep learning for histopathology has been successfully used for disease classification, image segmentation and more. However, combining image and text modalities using current state-of-the-art methods has been a challenge due to the high resolution of histopathology images. Automatic report generation for histopathology images is one such challenge. In this work, we show that using an existing pre-trained Vision Transformer in a two-step process of first using it to encode 4096x4096 sized patches of the Whole Slide Image (WSI) and then using it as the encoder and a pre-trained Bidirectional Encoder Representations from Transformers (BERT) model for language modeling-based decoder for report generation, we can build a fairly performant and portable report generation mechanism that takes into account the whole of the high resolution image, instead of just the patches. Our method allows us to not only generate and evaluate captions that describe the image, but also helps us classify the image into tissue types and the gender of the patient as well. Our best performing model achieves a 79.98% accuracy in Tissue Type classification and 66.36% accuracy in classifying the sex of the patient the tissue came from, with a BLEU-4 score of 0.5818 in our caption generation task.

摘要
深度学习在 histopathology 中已经成功应用于疾病分类、图像分割和更多的任务。然而，将图像和文本模式结合使用现有的方法具有高分辨率 histopathology 图像的挑战。自动生成 histopathology 图像的报告是一个such challenge.在这种工作中，我们显示了使用现有的预训练 Vision Transformer 进行两步处理：首先使用它来编码 4096x4096 大小的 Whole Slide Image (WSI) 的4096x4096 个小块，然后使用它作为编码器和一个预训练 Bidirectional Encoder Representations from Transformers (BERT) 模型来生成报告。我们的方法可以建立一个可靠并可移植的报告生成机制，该机制可以考虑整个高分辨率图像，而不仅仅是小块。我们的方法不仅可以生成和评估描述图像的caption，还可以帮助我们将图像分类为组织类型和患者的性别。我们的最佳模型在 Tissue Type 分类任务中达到了 79.98% 的准确率，在分类患者的性别任务中达到了 66.36% 的准确率，并在描述任务中达到了 BLEU-4 分数 0.5818。

D$^2$ST-Adapter: Disentangled-and-Deformable Spatio-Temporal Adapter for Few-shot Action Recognition

paper_url: http://arxiv.org/abs/2312.01431
repo_url: https://github.com/qizhongtan/d2st-adapter
paper_authors: Wenjie Pei, Qizhong Tan, Guangming Lu, Jiandong Tian
for: 这篇论文旨在提出一种基于几拟图像模型的ew-shot动作识别方法，以便学习强大的特征提取器，这是ew-shot学习中非常重要的。
methods: 该方法使用了一种名为D$^2$ST-Adapter的新的适应器调试框架，该框架由两个分支组成，用于分别编码空间和时间特征。此外，我们还提出了一种叫做Deformable Spatio-Temporal Attention模块，可以模型空间和时间特征在相应的路径中，使我们的D$^2$ST-Adapter能够在3D空间中编码全局的特征，同时保持轻量级的设计。
results: 我们的方法在对pre-trained ResNet和ViT的实现中表现出色，superior于当前状态的方法，特别是在有很多时间动态的场景下。

Abstract
Adapting large pre-trained image models to few-shot action recognition has proven to be an effective and efficient strategy for learning robust feature extractors, which is essential for few-shot learning. Typical fine-tuning based adaptation paradigm is prone to overfitting in the few-shot learning scenarios and offers little modeling flexibility for learning temporal features in video data. In this work we present the Disentangled-and-Deformable Spatio-Temporal Adapter (D$^2$ST-Adapter), a novel adapter tuning framework for few-shot action recognition, which is designed in a dual-pathway architecture to encode spatial and temporal features in a disentangled manner. Furthermore, we devise the Deformable Spatio-Temporal Attention module as the core component of D$^2$ST-Adapter, which can be tailored to model both spatial and temporal features in corresponding pathways, allowing our D$^2$ST-Adapter to encode features in a global view in 3D spatio-temporal space while maintaining a lightweight design. Extensive experiments with instantiations of our method on both pre-trained ResNet and ViT demonstrate the superiority of our method over state-of-the-art methods for few-shot action recognition. Our method is particularly well-suited to challenging scenarios where temporal dynamics are critical for action recognition.

摘要
这文章提出了一个新的适应器架构，即Disentangled-and-Deformable Spatio-Temporal Adapter (D$^2$ST-Adapter)，用于几步动作识别。这个适应器架构是一个双走道架构，用于分开空间和时间特征。此外，我们开发了一个弹性空间时间注意力模组，用于模型空间和时间特征。这个模组可以在相应的道路上调整，以便在3D空间时间中实现全球视野的特征编码。我们还进行了实验，证明了我们的方法在几步动作识别中具有优越性，特别是在时间动力学扮演重要的情况下。

WavePlanes: A compact Wavelet representation for Dynamic Neural Radiance Fields

paper_url: http://arxiv.org/abs/2312.02218
repo_url: https://github.com/azzarelli/waveplanes
paper_authors: Adrian Azzarelli, Nantheera Anantrasirichai, David R Bull
for: 这个论文旨在提高NeRF技术来处理动态场景，但是现有的NeRF模型具有资源占用和压缩问题。这篇论文提出了WavePlanes，一种快速、更加压缩的明确模型。
methods: 该论文提出了一种基于N-级2-D波峰征应的多尺度空间和时间特征面表示方法，并使用反推 discrete波峰变换来重建N个特征信号。这些特征信号用于估计4-D网格中的颜色和浓度。
results: 对比之前的平面基于模型，WavePlanes具有15倍小于的模型大小，更低的计算负担和相似的训练时间（只需一个小时），而且不需要自定义CUDA代码或高性能计算资源。此外，该论文还提出了新的特征融合方案，这些方案具有更高的可解释性。

Abstract
Dynamic Neural Radiance Fields (Dynamic NeRF) enhance NeRF technology to model moving scenes. However, they are resource intensive and challenging to compress. To address this issue, this paper presents WavePlanes, a fast and more compact explicit model. We propose a multi-scale space and space-time feature plane representation using N-level 2-D wavelet coefficients. The inverse discrete wavelet transform reconstructs N feature signals at varying detail, which are linearly decoded to approximate the color and density of volumes in a 4-D grid. Exploiting the sparsity of wavelet coefficients, we compress a Hash Map containing only non-zero coefficients and their locations on each plane. This results in a compressed model size of ~12 MB. Compared with state-of-the-art plane-based models, WavePlanes is up to 15x smaller, less computationally demanding and achieves comparable results in as little as one hour of training - without requiring custom CUDA code or high performance computing resources. Additionally, we propose new feature fusion schemes that work as well as previously proposed schemes while providing greater interpretability. Our code is available at: https://github.com/azzarelli/waveplanes/

摘要
“几何动态神经辉煌场”（Dynamic NeRF）技术可以模拟运动场景，但它们需要大量资源和压缩。这篇论文提出了“波浪面”（WavePlanes），一种快速和更加压缩的明确模型。我们提出了多个Scale的空间和时间特征面表示，使用 N 级二dimensional 波峰系数。反向逆 discrete wavelet 变换可以重建 N 个特征信号，每个信号都是不同的细节水平上的线性解oding，用于估算矩阵中的颜色和密度。利用波峰系数的稀缺性，我们将 Hash Map 中的非零系数和其所在面都压缩，从而将模型大小压缩到约 12 MB。与现有的平面基于模型相比，WavePlanes 可以在一小时内训练，并且需要更少的计算资源，并且可以在一小时内达到相似的结果。此外，我们还提出了新的特征融合方案，可以和以前的方案相比，提供更高的解释性。我们的代码可以在：https://github.com/azzarelli/waveplanes/ 获取。

Improving In-Context Learning in Diffusion Models with Visual Context-Modulated Prompts

paper_url: http://arxiv.org/abs/2312.01408
repo_url: None
paper_authors: Tianqi Chen, Yongfei Liu, Zhendong Wang, Jianbo Yuan, Quanzeng You, Hongxia Yang, Mingyuan Zhou
for: 这个论文主要针对的是如何在视觉领域中进行Context-Aware In-Context Learning（CAICL），即通过提供相关的视觉示例来帮助语言模型更好地理解图像，从而提高其对图像的理解和生成能力。
methods: 该论文提出了一种基于Diffusion的视觉基础模型，称为Improved Prompt Diffusion（iPromptDiff），它将视觉上下文转化为一个嵌入向量，然后将这个嵌入向量与文本提示的token嵌入进行修饰。这种方法可以在多种训练任务上展现出优异的灵活性和可靠性。
results: 论文的实验结果表明，当将iPromptDiff与标准的ControlNet结构结合使用时，可以在多种图像生成任务上达到出色的效果，包括normal-to-image和image-to-line转换等。此外，该方法还能够在没有昂贵的预训练和限制性的框架的情况下进行高效地训练和生成。

Abstract
In light of the remarkable success of in-context learning in large language models, its potential extension to the vision domain, particularly with visual foundation models like Stable Diffusion, has sparked considerable interest. Existing approaches in visual in-context learning frequently face hurdles such as expensive pretraining, limiting frameworks, inadequate visual comprehension, and limited adaptability to new tasks. In response to these challenges, we introduce improved Prompt Diffusion (iPromptDiff) in this study. iPromptDiff integrates an end-to-end trained vision encoder that converts visual context into an embedding vector. This vector is subsequently used to modulate the token embeddings of text prompts. We show that a diffusion-based vision foundation model, when equipped with this visual context-modulated text guidance and a standard ControlNet structure, exhibits versatility and robustness across a variety of training tasks and excels in in-context learning for novel vision tasks, such as normal-to-image or image-to-line transformations. The effectiveness of these capabilities relies heavily on a deep visual understanding, which is achieved through relevant visual demonstrations processed by our proposed in-context learning architecture.

摘要
基于大语言模型中的协同学习成功，它在视觉领域的扩展 particualrly with visual foundation models like Stable Diffusion 已经引起了广泛的关注。现有的视觉协同学习方法frequently face challenges such as expensive pretraining, limiting frameworks, inadequate visual comprehension, and limited adaptability to new tasks. In response to these challenges, we introduce improved Prompt Diffusion (iPromptDiff) in this study. iPromptDiff integrates an end-to-end trained vision encoder that converts visual context into an embedding vector. This vector is subsequently used to modulate the token embeddings of text prompts. We show that a diffusion-based vision foundation model, when equipped with this visual context-modulated text guidance and a standard ControlNet structure, exhibits versatility and robustness across a variety of training tasks and excels in in-context learning for novel vision tasks, such as normal-to-image or image-to-line transformations. The effectiveness of these capabilities relies heavily on a deep visual understanding, which is achieved through relevant visual demonstrations processed by our proposed in-context learning architecture.

VideoRF: Rendering Dynamic Radiance Fields as 2D Feature Video Streams

paper_url: http://arxiv.org/abs/2312.01407
repo_url: None
paper_authors: Liao Wang, Kaixin Yao, Chengcheng Guo, Zhirui Zhang, Qiang Hu, Jingyi Yu, Lan Xu, Minye Wu
for: 这篇论文旨在实现在常见设备上实时渲染动态场景。
methods: 作者提出了一种使用2D特征图流来表示4D辐射场，并对其进行特有的训练方案，以利用2D视频编码器来压缩特征图流，并使用视频硬件加速器实现实时解码。
results: 作者提出了一种新的渲染管线，可以快速渲染动态场景，并提供了一个实时交互播放器，可以在多种设备上实时流动和渲染动态场景，从桌面到手机。

Abstract
Neural Radiance Fields (NeRFs) excel in photorealistically rendering static scenes. However, rendering dynamic, long-duration radiance fields on ubiquitous devices remains challenging, due to data storage and computational constraints. In this paper, we introduce VideoRF, the first approach to enable real-time streaming and rendering of dynamic radiance fields on mobile platforms. At the core is a serialized 2D feature image stream representing the 4D radiance field all in one. We introduce a tailored training scheme directly applied to this 2D domain to impose the temporal and spatial redundancy of the feature image stream. By leveraging the redundancy, we show that the feature image stream can be efficiently compressed by 2D video codecs, which allows us to exploit video hardware accelerators to achieve real-time decoding. On the other hand, based on the feature image stream, we propose a novel rendering pipeline for VideoRF, which has specialized space mappings to query radiance properties efficiently. Paired with a deferred shading model, VideoRF has the capability of real-time rendering on mobile devices thanks to its efficiency. We have developed a real-time interactive player that enables online streaming and rendering of dynamic scenes, offering a seamless and immersive free-viewpoint experience across a range of devices, from desktops to mobile phones.

摘要
Translated into Simplified Chinese:NeRFs 在渲染静态场景方面表现出色，但是在普遍设备上实时渲染长时间的动态场景仍然存在数据存储和计算限制。在这篇论文中，我们介绍了 VideoRF，首个能够在移动平台上实时流动和渲染动态场景的方法。核心是一个序列化的2D特征图像流，代表了4D频谱场景所有的一个。我们提出了特化的训练方案，直接应用于这个2D领域，以利用时间和空间的重复性来压缩特征图像流。通过利用重复性，我们可以使用2D视频编解码器高效地压缩特征图像流，从而利用视频硬件加速器实现实时解码。另一方面，基于特征图像流，我们提议了一种特殊的渲染管线，以便高效地查询频谱属性。与延迟渲染模型结合使用，VideoRF在移动设备上实现了实时渲染，拥有高效的性能。我们已经开发了一个实时交互式播放器，可以在多种设备上线上流动和渲染动态场景，从桌面到手机都可以享受到无缝和 immerse 的自由观看体验。

Visual Prompting Upgrades Neural Network Sparsification: A Data-Model Perspective

paper_url: http://arxiv.org/abs/2312.01397
repo_url: https://github.com/unites-lab/vpns
paper_authors: Can Jin, Tianjin Huang, Yihua Zhang, Mykola Pechenizkiy, Sijia Liu, Shiwei Liu, Tianlong Chen
for: 这个研究旨在探讨大规模深度学习模型的硬件成本问题，并提出一种数据模型共设 perspective，以提高类型数据模型的简化。
methods: 本研究使用了自适应的可读性提示（Visual Prompts）来优化神经网络简化，并提出了一种共同优化方法来调整模型和数据之间的关系。
results: 实验结果显示，使用VPNs框架可以实现substantial的性能改善，并且发现从预训模型中获取的子网络具有更好的转移性。

Abstract
The rapid development of large-scale deep learning models questions the affordability of hardware platforms, which necessitates the pruning to reduce their computational and memory footprints. Sparse neural networks as the product, have demonstrated numerous favorable benefits like low complexity, undamaged generalization, etc. Most of the prominent pruning strategies are invented from a model-centric perspective, focusing on searching and preserving crucial weights by analyzing network topologies. However, the role of data and its interplay with model-centric pruning has remained relatively unexplored. In this research, we introduce a novel data-model co-design perspective: to promote superior weight sparsity by learning important model topology and adequate input data in a synergetic manner. Specifically, customized Visual Prompts are mounted to upgrade neural Network sparsification in our proposed VPNs framework. As a pioneering effort, this paper conducts systematic investigations about the impact of different visual prompts on model pruning and suggests an effective joint optimization approach. Extensive experiments with 3 network architectures and 8 datasets evidence the substantial performance improvements from VPNs over existing start-of-the-art pruning algorithms. Furthermore, we find that subnetworks discovered by VPNs from pre-trained models enjoy better transferability across diverse downstream scenarios. These insights shed light on new promising possibilities of data-model co-designs for vision model sparsification.

摘要
“大规模深度学习模型的快速发展对硬件平台的可持续性提出了问题，这需要进行剔除以减少其计算和内存负载。单簇神经网络作为产品，已经展示了许多有利的优点，如低复杂度、不受损的一致性等。大多数主流的剔除策略都是从模型中心的角度出发，专注于搜寻和保留重要的权重，并且对于模型和资料之间的互动几近未探访。在这个研究中，我们将引入一个新的数据-模型共设角度：通过协同运作，提高神经网络剔除的优秀权重状态，并且从数据的角度进行优化。具体来说，我们将自订的视觉提示mounted onto upgrade神经网络剔除的our proposed VPNs框架。这是一个开拓性的尝试，这篇论文将进行系统性的探索，了解不同的视觉提示对模型剔除的影响，并提出一个有效的共同优化方法。实验结果显示，VPNs在3种网络架构和8个数据集上取得了显著的性能改进，而且发现自动获得的VPNs子网络从预训模型中获得的转移性较高。这些发现照明了新的可能性，即数据-模型共设的剔除策略对于视觉模型剔除具有广泛的应用前景。”

Language-driven All-in-one Adverse Weather Removal

paper_url: http://arxiv.org/abs/2312.01381
repo_url: None
paper_authors: Hao Yang, Liyuan Pan, Yan Yang, Wei Liang
for: 这篇论文是为了提出一个能够统一多种不良天气情况的框架，以便在实际应用中更好地处理不同的天气状况。
methods: 这篇论文使用了一种基于语言的恢复框架（LDR），利用预训的摄像头语言（PVL）模型来扩大不良天气知识的多样性，并透过专案列表中的选择来适应不同的天气状况。
results: 根据实验结果，这篇论文的恢复效果较好，能够实现在实际应用中更好地处理不同的天气状况（见Fig. 1）。

Abstract
All-in-one (AiO) frameworks restore various adverse weather degradations with a single set of networks jointly. To handle various weather conditions, an AiO framework is expected to adaptively learn weather-specific knowledge for different degradations and shared knowledge for common patterns. However, existing methods: 1) rely on extra supervision signals, which are usually unknown in real-world applications; 2) employ fixed network structures, which restrict the diversity of weather-specific knowledge. In this paper, we propose a Language-driven Restoration framework (LDR) to alleviate the aforementioned issues. First, we leverage the power of pre-trained vision-language (PVL) models to enrich the diversity of weather-specific knowledge by reasoning about the occurrence, type, and severity of degradation, generating description-based degradation priors. Then, with the guidance of degradation prior, we sparsely select restoration experts from a candidate list dynamically based on a Mixture-of-Experts (MoE) structure. This enables us to adaptively learn the weather-specific and shared knowledge to handle various weather conditions (e.g., unknown or mixed weather). Experiments on extensive restoration scenarios show our superior performance (see Fig. 1). The source code will be made available.

摘要
全功能（AiO）框架可以重听不同的恶势力，通过单一的网络集成来恢复。为了处理不同的天气情况，一个AiO框架应该适应性学习不同的天气知识和共同的模式知识。然而，现有方法：1）依赖于额外的监督信号，这些信号通常在实际应用中不知道; 2）采用固定的网络结构，这限制了天气特定的知识的多样性。在这篇论文中，我们提出了语言驱动的修复框架（LDR），以解决以上问题。首先，我们利用预训练的视觉语言（PVL）模型，以提高天气特定的知识的多样性，通过对损害的发生、类型和严重程度进行推理，生成损害先验。然后，根据损害先验的导向，我们动态从候选名单中选择修复专家，使用杂合结构（MoE）来适应不同的天气情况（例如，未知或混合的天气）。这使得我们可以适应天气特定的和共同的知识，以处理不同的天气情况。实验结果表明，我们的表现较为出色（参见图1）。源代码将公开。

A Conditional Denoising Diffusion Probabilistic Model for Point Cloud Upsampling

paper_url: http://arxiv.org/abs/2312.02719
repo_url: None
paper_authors: Wentao Qu, Yuantian Shao, Lingwu Meng, Xiaoshui Huang, Liang Xiao
for: 提高点云upsampling的表达力，以提高后续任务 such as classification和重建的性能。
methods: 直接模elling dense point cloud的数据分布 Gradient，使用 conditional denoising diffusion probability model (DDPM) for point cloud upsampling，并在training中使用环境学习 dual mapping paradigm。
results: 在PU1K和PUGAN上的量化和质量评估中，PUDM signifiantly outperformed现有方法， achieved state of the art (SOTA) performance in terms of Chamfer Distance (CD) and Hausdorff Distance (HD).

Abstract
Point cloud upsampling (PCU) enriches the representation of raw point clouds, significantly improving the performance in downstream tasks such as classification and reconstruction. Most of the existing point cloud upsampling methods focus on sparse point cloud feature extraction and upsampling module design. In a different way, we dive deeper into directly modelling the gradient of data distribution from dense point clouds. In this paper, we proposed a conditional denoising diffusion probability model (DDPM) for point cloud upsampling, called PUDM. Specifically, PUDM treats the sparse point cloud as a condition, and iteratively learns the transformation relationship between the dense point cloud and the noise. Simultaneously, PUDM aligns with a dual mapping paradigm to further improve the discernment of point features. In this context, PUDM enables learning complex geometry details in the ground truth through the dominant features, while avoiding an additional upsampling module design. Furthermore, to generate high-quality arbitrary-scale point clouds during inference, PUDM exploits the prior knowledge of the scale between sparse point clouds and dense point clouds during training by parameterizing a rate factor. Moreover, PUDM exhibits strong noise robustness in experimental results. In the quantitative and qualitative evaluations on PU1K and PUGAN, PUDM significantly outperformed existing methods in terms of Chamfer Distance (CD) and Hausdorff Distance (HD), achieving state of the art (SOTA) performance.

摘要
点云upsampling（PCU）提高了Raw点云的表示，大幅提高下游任务 such as classification和重建的性能。大多数现有的点云upsampling方法都集中在稀疏点云特征提取和upsampling模块设计。在不同的方式上，我们 deeper into directly模elling数据分布的梯度从密集点云。在这篇论文中，我们提出了一种条件杂化扩散概率模型（DDPM） для点云upsampling，称为PUDM。 Specifically，PUDM将稀疏点云作为条件，并iteratively学习密集点云和噪声之间的变换关系。同时，PUDM遵循 dual mapping paradigm，以更好地提高点特征的分辨率。在这种情况下，PUDM允许学习真实的地面特征，而不需要额外设计upsampling模块。此外，PUDM在推理过程中生成高质量、任意比例的点云，通过在训练过程中参数化比率因子来实现。此外，PUDM在实验结果中表现出了强大的噪声Robustness。在PU1K和PUGAN上进行的量化和质量评估中，PUDMsignificantly outperformed现有方法， achieved state of the art（SOTA）性能。

MoEC: Mixture of Experts Implicit Neural Compression

paper_url: http://arxiv.org/abs/2312.01361
repo_url: None
paper_authors: Jianchen Zhao, Cheng-Ching Tseng, Ming Lu, Ruichuan An, Xiaobao Wei, He Sun, Shanghang Zhang
For: 提出了一种新的隐藏神经表示法（INR）数据压缩技术，用于压缩复杂的场景中的数据。* Methods: 使用了一个网格网络来自动分配特定的INR到3D场景中的点。与之前的块分割和树状分割相比，我们的学习分割可以在端到端方式下自动找到最佳分割方案。* Results: 在庞大和多样化的生物医学数据集上进行了详细的实验，并证明了MoEC在现有方法之上具有优势。特别是在极高的压缩比例下，如6000倍，我们能够保持PSNR为48.16。

Abstract
Emerging Implicit Neural Representation (INR) is a promising data compression technique, which represents the data using the parameters of a Deep Neural Network (DNN). Existing methods manually partition a complex scene into local regions and overfit the INRs into those regions. However, manually designing the partition scheme for a complex scene is very challenging and fails to jointly learn the partition and INRs. To solve the problem, we propose MoEC, a novel implicit neural compression method based on the theory of mixture of experts. Specifically, we use a gating network to automatically assign a specific INR to a 3D point in the scene. The gating network is trained jointly with the INRs of different local regions. Compared with block-wise and tree-structured partitions, our learnable partition can adaptively find the optimal partition in an end-to-end manner. We conduct detailed experiments on massive and diverse biomedical data to demonstrate the advantages of MoEC against existing approaches. In most of experiment settings, we have achieved state-of-the-art results. Especially in cases of extreme compression ratios, such as 6000x, we are able to uphold the PSNR of 48.16.

摘要
emerging implicit neural representation (INR) 是一种promising的数据压缩技术，它使用深度神经网络（DNN）的参数来表示数据。现有的方法手动将复杂的场景分成本地区域，并适应INRs到这些区域中。然而，手动设计分区方案对于复杂的场景是非常困难的，而且无法同时学习分区和INRs。为解决这个问题，我们提出了MoEC，一种基于混合专家理论的新的隐式神经压缩方法。特别是，我们使用一个网关网络来自动将3D点分配到特定的INR中。这个网关网络与不同的本地区域INRs一起被训练。相比于块 wise和树状分区，我们的学习分区可以适应ively找到最佳分区方案。我们在大量和多样化的生物医学数据上进行了详细的实验，以示MoEC对现有方法的优势。在大多数实验设置中，我们已经达到了状态法的结果。特别是在极高的压缩比例，如6000x，我们能够保持PSNR的48.16。Note: The text is translated into Simplified Chinese, which is the most widely used standard for Chinese writing. The translation is based on the grammar and vocabulary of Simplified Chinese, and some words and phrases may be different from Traditional Chinese.

Deep learning and traditional-based CAD schemes for the pulmonary embolism diagnosis: A survey

paper_url: http://arxiv.org/abs/2312.01351
repo_url: None
paper_authors: Seyed Hesamoddin Hosseini, Amir Hossein Taherinia, Mahdi Saadatmand
for: This paper aims to review, evaluate, and compare the performance of deep learning and traditional-based CAD systems for the diagnosis of Pulmonary Embolism (PE).
methods: The authors use a systematic search of articles available in databases such as IEEE, ScienceDirect, Wiley, Springer, Nature, and Wolters Kluwer to identify studies that have used either deep learning or traditional-based CAD systems for PE diagnosis. They evaluate the performance of each system using criteria such as sensitivity, False Positives (FP), and the number of datasets.
results: The authors review and evaluate 23 papers that have been published on PE diagnosis using either deep learning or traditional-based CAD systems from 2002 to 2023. They provide a comprehensive overview of the recent studies and state-of-the-art research works in this field, and compare the performance of deep learning and traditional-based CAD systems.

Abstract
Nowadays, pulmonary Computed Tomography Angiography (CTA) is the main tool for detecting Pulmonary Embolism (PE). However, manual interpretation of CTA volume requires a radiologist, which is time-consuming and error-prone due to the specific conditions of lung tissue, large volume of data, lack of experience, and eye fatigue. Therefore, Computer-Aided Design (CAD) systems are used as a second opinion for the diagnosis of PE. The purpose of this article is to review, evaluate, and compare the performance of deep learning and traditional-based CAD system for diagnosis PE and to help physicians and researchers in this field. In this study, all articles available in databases such as IEEE, ScienceDirect, Wiley, Springer, Nature, and Wolters Kluwer in the field of PE diagnosis were examined using traditional and deep learning methods. From 2002 to 2023, 23 papers were studied to extract the articles with the considered limitations. Each paper presents an automatic PE detection system that we evaluate using criteria such as sensitivity, False Positives (FP), and the number of datasets. This research work includes recent studies, state-of-the-art research works, and a more comprehensive overview compared to previously published review articles in this research area.

摘要
现在，肺 computed tomography angiography (CTA) 是诊断肺动脉瘤 (PE) 的主要工具。然而，手动解读 CTA 卷积需要一名放射学家，这是时间consuming 和 error-prone，因为肺组织的特殊条件、大量数据、缺乏经验和疲劳视力。因此，计算机支持设计 (CAD) 系统被用作诊断 PE 的第二 Opinion。本文的目的是回顾、评估和比较深度学习和传统基础 CAD 系统在诊断 PE 方面的表现，以帮助医生和研究人员。本研究中，从 2002 年到 2023 年，我们对 IEEE、ScienceDirect、Wiley、Springer、Nature 和 Wolters Kluwer 等数据库中的相关文献进行了检索和分析。期间，我们选择了 23 篇文章，并对它们进行了自动诊断 PE 系统的评估，使用敏感度、false positives (FP) 和数据集的指标。本研究包括最新的研究、顶峰研究和更为全面的概述，与之前在这个研究领域已发表的综述文章相比，更为全面和系统。

DragVideo: Interactive Drag-style Video Editing

paper_url: http://arxiv.org/abs/2312.02216
repo_url: https://github.com/rickyskywalker/dragvideo-official
paper_authors: Yufan Deng, Ruida Wang, Yuhao Zhang, Yu-Wing Tai, Chi-Keung Tang
for: 提供一种基于DragGAN的视频编辑技术，以便提供直观的用户控制和自然的编辑结果。
methods: 提出了一种基于DragDiffusion的Drag-on-Video U-Net（DoVe）编辑方法，通过优化扩散视频准确度来实现愿望的编辑效果。
results: 在多种复杂的编辑任务中，如动作编辑、skeleton编辑等，DragVideo表现出了广泛的应用前景和普适性。

Abstract
Editing visual content on videos remains a formidable challenge with two main issues: 1) direct and easy user control to produce 2) natural editing results without unsightly distortion and artifacts after changing shape, expression and layout. Inspired by DragGAN, a recent image-based drag-style editing technique, we address above issues by proposing DragVideo, where a similar drag-style user interaction is adopted to edit video content while maintaining temporal consistency. Empowered by recent diffusion models as in DragDiffusion, DragVideo contains the novel Drag-on-Video U-Net (DoVe) editing method, which optimizes diffused video latents generated by video U-Net to achieve the desired control. Specifically, we use Sample-specific LoRA fine-tuning and Mutual Self-Attention control to ensure faithful reconstruction of video from the DoVe method. We also present a series of testing examples for drag-style video editing and conduct extensive experiments across a wide array of challenging editing tasks, such as motion editing, skeleton editing, etc, underscoring DragVideo's versatility and generality. Our codes including the DragVideo web user interface will be released.

摘要
修改视频内容仍然是一大挑战，主要问题有两个：1）直接和简单地控制用户，以生成2）无瑕疵扭曲和artefacts的自然编辑结果。受DragGAN图像基于拖动技术的启发，我们解决上述问题，提出了DragVideo，其中采用了类似于拖动式用户交互方式来编辑视频内容，同时保持时间一致性。基于最近的扩散模型 DragDiffusion，DragVideo包含了novel的拖动视频U-Net（DoVe）编辑方法，该方法使用视频U-Net生成的扩散视频卷积数据进行优化，以实现所需的控制。我们还使用特定样本的LoRA精度调整和相互自我关注控制，以确保视频从DoVe方法中 faithful 重建。我们还提供了许多拖动式视频编辑测试例子，并在各种复杂的编辑任务上进行了广泛的实验，如运动编辑、skeleton编辑等，这些实验证明了DragVideo的多样性和通用性。我们将发布代码，包括DragVideo网页用户界面。

Enhancing and Adapting in the Clinic: Source-free Unsupervised Domain Adaptation for Medical Image Enhancement

paper_url: http://arxiv.org/abs/2312.01338
repo_url: https://github.com/liamheng/annotation-free-medical-image-enhancement
paper_authors: Heng Li, Ziqin Lin, Zhongxi Qiu, Zinan Li, Huazhu Fu, Yan Hu, Jiang Liu
for: 这篇论文的目的是为了提出一个不需要源数据的医疗影像增强算法（SAME），可以在测试阶段进行不supervised增强，并且可以利用测试数据来适应不同领域的医疗影像增强。
methods: 这篇论文使用了一个结构保持增强网络来学习一个可靠的源模型，然后使用教师生物数据来进行源自由领域适应（SFUDA），并且还开发了一个假标签选择器来增强增强任务的知识传递。
results: 实验结果显示，SAME可以在十个不同领域的医疗影像中提供出色的增强性和下游任务的表现，并且可以对不同领域的医疗影像进行适应。

Abstract
Medical imaging provides many valuable clues involving anatomical structure and pathological characteristics. However, image degradation is a common issue in clinical practice, which can adversely impact the observation and diagnosis by physicians and algorithms. Although extensive enhancement models have been developed, these models require a well pre-training before deployment, while failing to take advantage of the potential value of inference data after deployment. In this paper, we raise an algorithm for source-free unsupervised domain adaptive medical image enhancement (SAME), which adapts and optimizes enhancement models using test data in the inference phase. A structure-preserving enhancement network is first constructed to learn a robust source model from synthesized training data. Then a teacher-student model is initialized with the source model and conducts source-free unsupervised domain adaptation (SFUDA) by knowledge distillation with the test data. Additionally, a pseudo-label picker is developed to boost the knowledge distillation of enhancement tasks. Experiments were implemented on ten datasets from three medical image modalities to validate the advantage of the proposed algorithm, and setting analysis and ablation studies were also carried out to interpret the effectiveness of SAME. The remarkable enhancement performance and benefits for downstream tasks demonstrate the potential and generalizability of SAME. The code is available at https://github.com/liamheng/Annotation-free-Medical-Image-Enhancement.

摘要
医疗成像提供了许多有价值的信息，包括生物结构和疾病特征。然而，图像异常是在临床实践中的常见问题，可能影响医生和算法的观察和诊断。虽然扩展的增强模型已经开发出来，但这些模型需要在部署前进行较好的预训练，而不使用潜在有价值的推理数据。本文提出了一种无源自适应医学成像增强算法（SAME），该算法在推理阶段使用测试数据进行适应和优化增强模型。首先，我们构建了一个可靠的源模型，用于学习Synthesized training data。然后，我们使用测试数据进行无源自适应（SFUDA），并通过知识传播来帮助学习增强任务。此外，我们还开发了一个pseudo-label选择器，以提高增强任务的知识传播。我们对十个医学成像模式的十个数据集进行实验，并进行设置分析和剖析研究，以解释SAME的效果。我们发现SAME可以提供出色的增强性和下游任务的优势，这表明SAME具有普适性和可重用性。代码可以在https://github.com/liamheng/Annotation-free-Medical-Image-Enhancement上获取。

MABViT – Modified Attention Block Enhances Vision Transformers

paper_url: http://arxiv.org/abs/2312.01324
repo_url: None
paper_authors: Mahesh Ramesh, Aswinkumar Ramkumar
for: 提高 Large Language Models (LLMs) 的性能，以及快速训练 LLMs without significantly impacting performance.
methods: 使用 Gated Linear Units (GLU) activation function, 并在 each Transformer block 中实现 parallel 配置。
results: 在 ImageNet-1K 数据集上，提出了一种新的 transformer 变体，该变体在 Value tensor 上实现 GLU-based activation function，并在 S/16 和 B/16 变体上显示出优于现有状态的性能。此外，还提供了使用 GELU activation function 的结果，以证明我们的假设。最后，我们展示了 MABViT 变体在深度 transformers 中的更大潜力。

Abstract
Recent studies have demonstrated the effectiveness of Gated Linear Units (GLU) in enhancing transformer models, particularly in Large Language Models (LLMs). Additionally, utilizing a parallel configuration within each Transformer block rather than the conventional serialized method has been revealed to accelerate the training of LLMs without significantly impacting performance. However, when the MLP and attention block were run in parallel for the image classification task, we observed a noticeable decline in performance. We propose a novel transformer variant that integrates non-linearity within the attention block to tackle this problem. We implemented the GLU-based activation function on the Value tensor, and this new technique surpasses the current state-of-the-art S/16 variant of Vision Transformers by 0.6% on the ImageNet-1K dataset while utilizing fewer parameters. It also supersedes the B/16 variant while using only half the parameters. Furthermore, we provide results with the GELU activation function variant to confirm our assertions. Lastly, we showcase that the MABViT variants exhibit greater potential when utilized in deep transformers compared to the standard architecture.

摘要
Simplified Chinese:近期研究表明， gat 线性单元 (GLU) 可以提高 transformer 模型的性能，特别是在 Large Language Models (LLMs) 中。此外，使用 parallel 配置在每个 Transformer 块中而不是传统的 serialized 方法可以加速 LLMs 的训练，而无需显著影响性能。然而，当 MLP 和 attention 块在图像分类任务中运行在平行方式时，我们观察到了明显的性能下降。为了解决这个问题，我们提议一种新的 transformer 变体，其中 incorporates 非线性在 attention 块中。我们将 GLU 基于 activation 函数应用于 Value tensor，并这新技术在 ImageNet-1K 数据集上超过了当前状态的 art S/16 变体的 Vision Transformers 的性能，同时使用 fewer 参数。它还超过了 B/16 变体，而使用的参数只是半数。此外，我们测试了 GELU activation 函数变体，以确认我们的发现。最后，我们显示了 MABViT 变体在深度 transformers 中表现更好。

Few-shot Shape Recognition by Learning Deep Shape-aware Features

paper_url: http://arxiv.org/abs/2312.01315
repo_url: None
paper_authors: Wenlong Shi, Changsheng Lu, Ming Shao, Yinjie Zhang, Siyu Xia, Piotr Koniusz
for: 提高几何特征提取和分类的性能，并能够泛化到未看到的形状。
methods: 使用嵌入模块提取不变性形状特征，并使用双重注意力机制将形状特征分解和重构为可学习的形状基元。
results: 在五个数据集上进行了实验，并显示了与状态艺术方法的显著提高，特别是在几何特征提取和分类 task 上。

Abstract
Traditional shape descriptors have been gradually replaced by convolutional neural networks due to their superior performance in feature extraction and classification. The state-of-the-art methods recognize object shapes via image reconstruction or pixel classification. However , these methods are biased toward texture information and overlook the essential shape descriptions, thus, they fail to generalize to unseen shapes. We are the first to propose a fewshot shape descriptor (FSSD) to recognize object shapes given only one or a few samples. We employ an embedding module for FSSD to extract transformation-invariant shape features. Secondly, we develop a dual attention mechanism to decompose and reconstruct the shape features via learnable shape primitives. In this way, any shape can be formed through a finite set basis, and the learned representation model is highly interpretable and extendable to unseen shapes. Thirdly, we propose a decoding module to include the supervision of shape masks and edges and align the original and reconstructed shape features, enforcing the learned features to be more shape-aware. Lastly, all the proposed modules are assembled into a few-shot shape recognition scheme. Experiments on five datasets show that our FSSD significantly improves the shape classification compared to the state-of-the-art under the few-shot setting.

摘要
传统的形态描述符被逐渐被深度学习模型取代，因为它们在特征提取和分类方面表现更高水平。现状顶尖方法通过图像重建或像素分类来识别物体形态。然而，这些方法偏好于文本信息，忽略了重要的形态描述，因此在未看过的形态上不能泛化。我们是第一个提出了几个样本形态描述符（FSSD），可以通过只提供一个或几个样本来识别物体形态。我们使用嵌入模块来EXTRACT变换不变的形态特征。其次，我们开发了双注意力机制来分解和重建形态特征，通过学习形态基元来拼接出任意形态。这样，任何形态都可以通过有限集基元组成，并且学习的表示模型具有高度可读性和可扩展性。最后，我们提出了解码模块，以便包括形态 маска和边缘的supervision，将原始和重建的形态特征进行对齐，使得学习的特征更加形态意识。最后，我们将所有模块粘合成为几个样本形态识别方案。实验结果表明，我们的FSSD在几个数据集下实现了对状态的很大改进。

FlashAvatar: High-Fidelity Digital Avatar Rendering at 300FPS

paper_url: http://arxiv.org/abs/2312.02214
repo_url: None
paper_authors: Jun Xiang, Xuan Gao, Yudong Guo, Juyong Zhang
for: 本研究旨在提出一种新型的3D动画可控人物表示方法，可以快速从短视频序列中重construct高质量的3D人物模型，并在消耗级 GPU 上实现300帧/秒的高速渲染。
methods: 该方法基于 Parametric 面型模型和3D Gaussian 场，通过学习附加的空间偏移来模型非表面区域和细节表达。通过全面使用几何约束，捕捉高频脸部细节和保留扭曲表达。
results: 实验结果表明，FlashAvatar 比现有方法更高效和更精细，可以在300帧/秒的高速渲染中保持高质量的视觉效果和个性化细节。项目页面：https://ustc3dv.github.io/FlashAvatar/

Abstract
We propose FlashAvatar, a novel and lightweight 3D animatable avatar representation that could reconstruct a digital avatar from a short monocular video sequence in minutes and render high-fidelity photo-realistic images at 300FPS on a consumer-grade GPU. To achieve this, we maintain a uniform 3D Gaussian field embedded in the surface of a parametric face model and learn extra spatial offset to model non-surface regions and subtle facial details. While full use of geometric priors can capture high-frequency facial details and preserve exaggerated expressions, proper initialization can help reduce the number of Gaussians, thus enabling super-fast rendering speed. Extensive experimental results demonstrate that FlashAvatar outperforms existing works regarding visual quality and personalized details and is almost an order of magnitude faster in rendering speed. Project page: https://ustc3dv.github.io/FlashAvatar/

摘要
我们提出Flash Avatar，一种新的简洁且快速的三维动画人物表示方法。它可以通过短时间内монокуляр视频序列重建一个数字人物，并在消费级GPU上以300帧/秒的高Definition图像的形式进行高质量图像渲染。为了实现这一目标，我们保持一个固定的三维 Gaussian 场在面型模型表面中，并学习额外的空间偏移以模型非表面区域和细节。通过完全利用几何约束，我们可以捕捉高频脸部细节和保留夸大表达。初始化可以减少Gaussian的数量，从而实现超快渲染速度。我们的实验结果表明，Flash Avatar在视觉质量和个性化细节方面超过现有方法，并且在渲染速度方面几乎是一个数量级快。项目页面：https://ustc3dv.github.io/FlashAvatar/

SAGE: Bridging Semantic and Actionable Parts for GEneralizable Articulated-Object Manipulation under Language Instructions

paper_url: http://arxiv.org/abs/2312.01307
repo_url: None
paper_authors: Haoran Geng, Songlin Wei, Congyue Deng, Bokui Shen, He Wang, Leonidas Guibas
for: 这篇论文的目的是提供一个能够实现多种零件物体的普遍化操作的框架，以应对实际世界中的多种任务。
methods: 这篇论文使用了大型自然语言模型（LLMs）和视觉语言模型（VLMs），以及专业部分识别模型，实现了对零件物体的semantic part的理解和可操作的分解。
results: 实验结果显示，这个框架可以在实际世界中处理多种零件物体，并且可以实现高度普遍化的操作。同时，这个框架还提供了一个新的语言指令驱动的零件物体操作 benchmark。

Abstract
Generalizable manipulation of articulated objects remains a challenging problem in many real-world scenarios, given the diverse object structures, functionalities, and goals. In these tasks, both semantic interpretations and physical plausibilities are crucial for a policy to succeed. To address this problem, we propose SAGE, a novel framework that bridges the understanding of semantic and actionable parts of articulated objects to achieve generalizable manipulation under language instructions. Given a manipulation goal specified by natural language, an instruction interpreter with Large Language Models (LLMs) first translates them into programmatic actions on the object's semantic parts. This process also involves a scene context parser for understanding the visual inputs, which is designed to generate scene descriptions with both rich information and accurate interaction-related facts by joining the forces of generalist Visual-Language Models (VLMs) and domain-specialist part perception models. To further convert the action programs into executable policies, a part grounding module then maps the object semantic parts suggested by the instruction interpreter into so-called Generalizable Actionable Parts (GAParts). Finally, an interactive feedback module is incorporated to respond to failures, which greatly increases the robustness of the overall framework. Experiments both in simulation environments and on real robots show that our framework can handle a large variety of articulated objects with diverse language-instructed goals. We also provide a new benchmark for language-guided articulated-object manipulation in realistic scenarios.

摘要
通用的机械人对象控制问题在许多实际场景中仍然是一个挑战，主要因为对象的结构、功能和目标都具有多种多样性。为解决这个问题，我们提出了SAGE框架，它可以将语言指令中的含义和操作掌握到一起，实现通用的机械人对象控制。给出一个由自然语言定义的操作目标后，一个指令解释器使用大型自然语言模型（LLM）将其转换为对象语义部分的程序动作。此外，还包括场景描述分析模块，用于理解视觉输入，该模块通过将通用视语言模型（VLM）和域特化部件识别模型相结合，生成包含详细信息和准确互动相关信息的场景描述。接着，一个部件地面模块将对象语义部分建议的部件映射到称为通用可操作部分（GAParts）。最后，一个交互反馈模块被 integrate 以应对失败，这大大提高了整个框架的稳定性。在 simulate 环境和真实机器人上进行的实验表明，我们的框架可以处理具有多种多样的机械人对象和语言指令目标。我们还提供了一个新的语言指导机械人对象控制的实际场景 benchmark。

Portrait Diffusion: Training-free Face Stylization with Chain-of-Painting

paper_url: http://arxiv.org/abs/2312.02212
repo_url: https://github.com/liujin112/portraitdiffusion
paper_authors: Jin Liu, Huaibo Huang, Chao Jin, Ran He
for: 这篇论文主要针对的是如何实现免需训练的人脸风格化，即将人脸转换成specific的肖像风格。
methods: 该论文提出了一种免需训练的人脸风格化框架， named Portrait Diffusion， который利用了off-the-shelf文本到图像扩散模型，不需要特定的示例 fine-tuning。
results: 该框架通过具有修改自我注意力操作的Style Attention Control来精准地混合内容和风格特征在注意力空间中，并提出了一种链式绘制方法来进行不满的区域的慢慢调整。经验 validate了Portrait Diffusion方法的效果，并证明了链式绘制方法在人脸风格化中的优势。

Abstract
Face stylization refers to the transformation of a face into a specific portrait style. However, current methods require the use of example-based adaptation approaches to fine-tune pre-trained generative models so that they demand lots of time and storage space and fail to achieve detailed style transformation. This paper proposes a training-free face stylization framework, named Portrait Diffusion. This framework leverages off-the-shelf text-to-image diffusion models, eliminating the need for fine-tuning specific examples. Specifically, the content and style images are first inverted into latent codes. Then, during image reconstruction using the corresponding latent code, the content and style features in the attention space are delicately blended through a modified self-attention operation called Style Attention Control. Additionally, a Chain-of-Painting method is proposed for the gradual redrawing of unsatisfactory areas from rough adjustments to fine-tuning. Extensive experiments validate the effectiveness of our Portrait Diffusion method and demonstrate the superiority of Chain-of-Painting in achieving precise face stylization. Code will be released at \url{https://github.com/liujin112/PortraitDiffusion}.

摘要
“面部风格化”指的是将面部转换成特定的肖像风格。然而，现有的方法通常需要使用示例基于的适应方法来精度调整预训练的生成模型，这需要很多时间和存储空间，并且无法实现细节化的风格变换。这篇论文提出了一个不需要训练的面部风格化框架，名为“人脸扩散”。这个框架利用了市面上的文本到图像扩散模型，废弃了特定示例的精度调整。具体来说，首先将内容和风格图像转换成内存代码。然后，在使用相应的内存代码进行图像重建时，内容和风格特征在注意空间中细腻地融合，通过修改自注意操作来实现风格控制。此外，一种链式绘制方法也被提出，用于从粗略调整到细腻调整中的渐进重绘。广泛的实验证明了我们的“人脸扩散”方法的有效性，并证明了链式绘制方法在实现精确的面部风格化方面的优势。代码将在 \url{https://github.com/liujin112/PortraitDiffusion} 上发布。

Stable Messenger: Steganography for Message-Concealed Image Generation

paper_url: http://arxiv.org/abs/2312.01284
repo_url: None
paper_authors: Quang Nguyen, Truong Vu, Cuong Pham, Anh Tran, Khoi Nguyen
for: 这篇论文关注于数字保护，特别是隐藏信息的预处理。
methods: 该论文提出了一种新的评估全息译解质量的指标“消息准确率”，并提出了一种适应性的普通损失函数Log-Sum-Exponential（LSE）损失函数，以提高消息准确率。此外，论文还提出了一种新的隐藏特征 aware编码技术。
results: 经过实验证明，新的LSE损失函数和隐藏特征 aware编码技术均有显著改进消息准确率。这种全面的方法标志着评估指标的演进、损失函数的优化和图像隐藏技术的创新，为数字保护带来了更加可靠和可靠的信息保护。

Abstract
In the ever-expanding digital landscape, safeguarding sensitive information remains paramount. This paper delves deep into digital protection, specifically focusing on steganography. While prior research predominantly fixated on individual bit decoding, we address this limitation by introducing ``message accuracy'', a novel metric evaluating the entirety of decoded messages for a more holistic evaluation. In addition, we propose an adaptive universal loss tailored to enhance message accuracy, named Log-Sum-Exponential (LSE) loss, thereby significantly improving the message accuracy of recent approaches. Furthermore, we also introduce a new latent-aware encoding technique in our framework named \Approach, harnessing pretrained Stable Diffusion for advanced steganographic image generation, giving rise to a better trade-off between image quality and message recovery. Throughout experimental results, we have demonstrated the superior performance of the new LSE loss and latent-aware encoding technique. This comprehensive approach marks a significant step in evolving evaluation metrics, refining loss functions, and innovating image concealment techniques, aiming for more robust and dependable information protection.

摘要
在不断扩展的数字景观中，保护敏感信息仍然是首要的。这篇论文深入探讨数字保护，特别是隐藏式加密。先前的研究主要集中在各个比特位解码，而我们则通过引入“信息准确度”这个新的评价指标，对整体解码消息进行更加全面的评估。此外，我们还提出了适应性的 универса loss，以提高消息准确度，称之为 Log-Sum-Exponential（LSE）损失。这种新的损失函数在实验中得到了显著的提升。此外，我们还提出了一种新的隐藏式编码技术，名为 \Approach，具有预训练的稳定扩散，用于进一步提高图像质量和消息恢复的兼顾。我们的全面的方法在实验结果中得到了显著的优秀表现。这种新的评价指标、损失函数和图像隐藏技术，标志着对信息保护的演进，并且将为未来的数字保护做出更重要的贡献。

Deeper into Self-Supervised Monocular Indoor Depth Estimation

paper_url: http://arxiv.org/abs/2312.01283
repo_url: https://github.com/fcntes/indoordepth
paper_authors: Chao Fan, Zhenyu Yin, Yue Li, Feiqing Zhang
for: 这个研究旨在提高室内深度估计的精度，使用Convolutional Neural Networks (CNNs)进行自主学习。
methods: 该方法包括两个创新：首先，提出了一种改进的光学loss函数，以解决低тексту化区域的挑战；其次，在不同阶段使用多个光学损失来训练一个更深的pose网络，以更好地预测 egomotion。
results: 实验表明，我们的方法在NYUv2测试集上超过了之前的state-of-the-art方法，并在ScanNet测试集上进行了验证。代码可以在https://github.com/fcntes/IndoorDepth中下载。

Abstract
Monocular depth estimation using Convolutional Neural Networks (CNNs) has shown impressive performance in outdoor driving scenes. However, self-supervised learning of indoor depth from monocular sequences is quite challenging for researchers because of the following two main reasons. One is the large areas of low-texture regions and the other is the complex ego-motion on indoor training datasets. In this work, our proposed method, named IndoorDepth, consists of two innovations. In particular, we first propose a novel photometric loss with improved structural similarity (SSIM) function to tackle the challenge from low-texture regions. Moreover, in order to further mitigate the issue of inaccurate ego-motion prediction, multiple photometric losses at different stages are used to train a deeper pose network with two residual pose blocks. Subsequent ablation study can validate the effectiveness of each new idea. Experiments on the NYUv2 benchmark demonstrate that our IndoorDepth outperforms the previous state-of-the-art methods by a large margin. In addition, we also validate the generalization ability of our method on ScanNet dataset. Code is availabe at https://github.com/fcntes/IndoorDepth.

摘要
单目深度估计使用单曲神经网络（CNN）在车辆外部驾驶场景中表现出色。然而，内部深度自我学习从单目序列中得到实用的挑战，主要是由以下两个原因引起：一是内部大面积的低纹地区，二是内部训练数据集中的复杂自我运动。在这个工作中，我们提出了两个创新方法。具体来说，我们首先提出了一个新的光метри输入损失函数，以改善对低纹地区的挑战。此外，为了进一步减少对自我运动预测的错误，我们在不同阶段使用多个光метри输入损失函数来训练深度姿态网络，包括两个复原姿态块。接下来的混合分析可以证明每个新想法的有效性。实验结果显示，我们的室内深度方法（IndoorDepth）在NYUv2标准库上比前一代方法表现出较大的margin。此外，我们还 validate了我们的方法在ScanNet数据集上的一致性。代码可以在https://github.com/fcntes/IndoorDepth上获取。

Cycle-consistent Generative Adversarial Network Synthetic CT for MR-only Adaptive Radiation Therapy on MR-Linac

paper_url: http://arxiv.org/abs/2312.02211
repo_url: None
paper_authors: Gabriel L. Asher, Bassem I. Zaki, Gregory A. Russo, Gobind S. Gill, Charles R. Thomas, Temiloluwa O. Prioleau, Rongxiao Zhang, Brady Hunt
For: This study aims to assess the effectiveness of Deep Learning (DL) for creating synthetic Computed Tomography (CT) images in MR-guided adaptive radiation therapy (MRgART).* Methods: The study uses a Cycle-GAN model trained with MRI and CT scan slices from MR-LINAC treatments to generate sCT volumes. The model was tested on 357 sCT frames from 17 patients.* Results: The study found that the sCTs generated by the model were comparable to dCTs in electron density and structural similarity with MRI scans. The dosimetric evaluations indicated minimal differences between sCTs and dCTs, with sCTs showing better air-bubble reconstruction.Here is the simplified Chinese text for the three key points:* For: 这项研究目的是用深度学习（DL）来生成MR-guided adaptive radiation therapy（MRgART）中的人工计算机断层图像（sCT）。* Methods: 研究使用了基于Cycle-GAN模型，使其在MR-LINAC治疗中的MRI和CT扫描片上训练，以生成sCT卷积体。模型在17名患者的357幅sCT扫描片上进行测试。* Results: 研究发现，由模型生成的sCT与实际CT图像在电子密度和MRI扫描片上的结构相似性都很高。多层评估表明，sCT和实际CT图像之间的差异非常小，sCT更好地重建了气泡。

Abstract
Purpose: This study assesses the effectiveness of Deep Learning (DL) for creating synthetic CT (sCT) images in MR-guided adaptive radiation therapy (MRgART). Methods: A Cycle-GAN model was trained with MRI and CT scan slices from MR-LINAC treatments, generating sCT volumes. The analysis involved retrospective treatment plan data from patients with various tumors. sCT images were compared with standard CT scans using mean absolute error in Hounsfield Units (HU) and image similarity metrics (SSIM, PSNR, NCC). sCT volumes were integrated into a clinical treatment system for dosimetric re-evaluation. Results: The model, trained on 8405 frames from 57 patients and tested on 357 sCT frames from 17 patients, showed sCTs comparable to dCTs in electron density and structural similarity with MRI scans. The MAE between sCT and dCT was 49.2 +/- 13.2 HU, with sCT NCC exceeding dCT by 0.06, and SSIM and PSNR at 0.97 +/- 0.01 and 19.9 +/- 1.6 respectively. Dosimetric evaluations indicated minimal differences between sCTs and dCTs, with sCTs showing better air-bubble reconstruction. Conclusions: DL-based sCT generation on MR-Linacs is accurate for dose calculation and optimization in MRgART. This could facilitate MR-only treatment planning, enhancing simulation and adaptive planning efficiency on MR-Linacs.

摘要
The results showed that the sCT images generated by the DL model were comparable to the dCT images in terms of electron density and structural similarity with MRI scans. The mean absolute error between sCT and dCT was 49.2 ± 13.2 HU, with sCT NCC exceeding dCT by 0.06. The SSIM and PSNR values were 0.97 ± 0.01 and 19.9 ± 1.6, respectively. The dosimetric evaluations showed minimal differences between sCTs and dCTs, with sCTs demonstrating better air-bubble reconstruction.The study concluded that DL-based sCT generation on MR-Linacs is accurate for dose calculation and optimization in MRgART, which could facilitate MR-only treatment planning and enhance simulation and adaptive planning efficiency on MR-Linacs.

Brain Decodes Deep Nets

paper_url: http://arxiv.org/abs/2312.01280
repo_url: https://github.com/huzeyann/braindecodesdeepnets
paper_authors: Huzheng Yang, James Gee, Jianbo Shi
for: 本研究旨在开发一种工具，用于可见化和分析大规模预训练的视觉模型，并将其映射到大脑中，以暴露其隐藏的内部结构。
methods: 本研究使用了一种潜在的脑编码方法，即预测大脑fMRI测量的响应图像。研究发现，在不同的空间、层、缩放和通道维度上的Explicit mapping между大脑和深度网络特征是关键。这种映射方法被称为FactorTopy，可以与任何深度网络进行插件式使用，并能够将网络映射到大脑上（Literally!）。
results: 研究发现，不同的训练方法会导致不同的层次组织和缩放行为，随着更多的数据或网络资源增加。此外，研究还提供了小数据集训练的细节，包括如何使用预训练模型进行微调。

Abstract
We developed a tool for visualizing and analyzing large pre-trained vision models by mapping them onto the brain, thus exposing their hidden inside. Our innovation arises from a surprising usage of brain encoding: predicting brain fMRI measurements in response to images. We report two findings. First, explicit mapping between the brain and deep-network features across dimensions of space, layers, scales, and channels is crucial. This mapping method, FactorTopy, is plug-and-play for any deep-network; with it, one can paint a picture of the network onto the brain (literally!). Second, our visualization shows how different training methods matter: they lead to remarkable differences in hierarchical organization and scaling behavior, growing with more data or network capacity. It also provides insight into finetuning: how pre-trained models change when adapting to small datasets. Our method is practical: only 3K images are enough to learn a network-to-brain mapping.

摘要
我们开发了一种工具用于可见化和分析大训练后的视觉模型，通过将其映射到大脑中，以显示其隐藏的内部结构。我们的创新来自于使用大脑编码的意外用途：预测大脑 fMRI 测量响应图像。我们报道了两个发现：首先， между大脑和深度网络特征之间的直接映射在空间、层、缩放和通道方面是关键的。我们称之为FactorTopy，它是任何深度网络的插件式映射方法，可以literally将网络映射到大脑上。第二，我们的可见化显示了不同的训练方法对 Hierarchical 组织结构和缩放行为产生了重要的影响，随着更多的数据或网络容量增加，这些效果会逐渐增加。此外，我们的可见化还提供了训练策略的更新：预训练模型如何在小数据集上进行调整。我们的方法是实用的：只需要3K图像就可以学习一个网络-大脑映射。

Learning to Compose SuperWeights for Neural Parameter Allocation Search

paper_url: http://arxiv.org/abs/2312.01274
repo_url: https://github.com/piotr-teterwak/SuperWeights
paper_authors: Piotr Teterwak, Soren Nelson, Nikoli Dryden, Dina Bashkirova, Kate Saenko, Bryan A. Plummer
for: 本研究旨在自动化神经网络参数分配，以实现参数共享。
methods: 我们使用SuperWeight Networks来生成层 weights，并使用梯度信息来衡量层之间的冲突程度。
results: 我们的SuperWeight Networks在ImageNet和CIFAR datasets上具有更高的性能，并且可以为多种网络 architecture 生成参数，支持高效的集成和任何时间预测。

Abstract
Neural parameter allocation search (NPAS) automates parameter sharing by obtaining weights for a network given an arbitrary, fixed parameter budget. Prior work has two major drawbacks we aim to address. First, there is a disconnect in the sharing pattern between the search and training steps, where weights are warped for layers of different sizes during the search to measure similarity, but not during training, resulting in reduced performance. To address this, we generate layer weights by learning to compose sets of SuperWeights, which represent a group of trainable parameters. These SuperWeights are created to be large enough so they can be used to represent any layer in the network, but small enough that they are computationally efficient. The second drawback we address is the method of measuring similarity between shared parameters. Whereas prior work compared the weights themselves, we argue this does not take into account the amount of conflict between the shared weights. Instead, we use gradient information to identify layers with shared weights that wish to diverge from each other. We demonstrate that our SuperWeight Networks consistently boost performance over the state-of-the-art on the ImageNet and CIFAR datasets in the NPAS setting. We further show that our approach can generate parameters for many network architectures using the same set of weights. This enables us to support tasks like efficient ensembling and anytime prediction, outperforming fully-parameterized ensembles with 17% fewer parameters.

摘要

AttriHuman-3D: Editable 3D Human Avatar Generation with Attribute Decomposition and Indexing

paper_url: http://arxiv.org/abs/2312.02209
repo_url: None
paper_authors: Fan Yang, Tianyi Chen, Xiaosheng He, Zhongang Cai, Lei Yang, Si Wu, Guosheng Lin
for: 本研究旨在提出一种可交互编辑的3D人体生成模型，以解决现有的高精度地方编辑和计算成本问题。
methods: 该模型使用特征分解和索引技术，将全体特征（如人体、头发、衣服等）分解成六个特征平面中，然后将每个特征平面进行分解和操作。
results: 该模型可以允许用户交互编辑选择的特征，保持其他特征不变，并提供高质量的3D人体模型生成。

Abstract
Editable 3D-aware generation, which supports user-interacted editing, has witnessed rapid development recently. However, existing editable 3D GANs either fail to achieve high-accuracy local editing or suffer from huge computational costs. We propose AttriHuman-3D, an editable 3D human generation model, which address the aforementioned problems with attribute decomposition and indexing. The core idea of the proposed model is to generate all attributes (e.g. human body, hair, clothes and so on) in an overall attribute space with six feature planes, which are then decomposed and manipulated with different attribute indexes. To precisely extract features of different attributes from the generated feature planes, we propose a novel attribute indexing method as well as an orthogonal projection regularization to enhance the disentanglement. We also introduce a hyper-latent training strategy and an attribute-specific sampling strategy to avoid style entanglement and misleading punishment from the discriminator. Our method allows users to interactively edit selected attributes in the generated 3D human avatars while keeping others fixed. Both qualitative and quantitative experiments demonstrate that our model provides a strong disentanglement between different attributes, allows fine-grained image editing and generates high-quality 3D human avatars.

摘要
可编辑3D生成，它们在最近几年内得到了快速发展。然而，现有的可编辑3D GANs（生成 adversarial network） Either fail to achieve high-accuracy local editing or suffer from huge computational costs. We propose AttriHuman-3D， an editable 3D human generation model， which addresses the aforementioned problems with attribute decomposition and indexing.AttriHuman-3D的核心思想是生成所有特征（例如人体、发 styling、服装等）在总特征空间中的六个特征平面，然后将这些特征分解和修改使用不同的特征索引。为了准确提取不同特征的特征从生成的特征平面中，我们提议了一种新的特征索引方法以及一种正交投影正则化来增强分离。我们还引入了嵌入级训练策略和特征特定的采样策略，以避免样式杂化和误导评价器。我们的方法允许用户在生成的3D人物模型中交互地编辑选择的特征，而保持其他特征不变。我们的方法可以提供高精度的本地编辑，并生成高质量的3D人物模型。我们的方法在质量和效率两个方面具有优势。我们的实验表明，我们的方法可以准确地分离不同特征，并且可以进行细致的图像编辑。我们的方法可以生成高质量的3D人物模型，并且可以在不同的应用场景中使用。

A Review and A Robust Framework of Data-Efficient 3D Scene Parsing with Traditional/Learned 3D Descriptors

paper_url: http://arxiv.org/abs/2312.01262
repo_url: None
paper_authors: Kangcheng Liu
for: 本文提出了一种普适和简单的框架，用于在Point cloud understanding task中处理有限 labels。
methods: 本文使用了传统的PFH-based 3D descriptor，以及学习获得的 semantics，实现了基于低级别几何特征和高级别 semantics 的区域合并策略。
results: 实验结果显示，我们的框架在三个重要的弱supervised Point cloud understanding任务中，包括semantic segmentation、instance segmentation和对象检测，具有最高的表现。我们的方法在ScanNet数据集上的数据fficient学习在线评测和其他四个大规模3D理解benchmark上，在不同的实验设置下，都超越了现有的艺术。

Abstract
Existing state-of-the-art 3D point cloud understanding methods merely perform well in a fully supervised manner. To the best of our knowledge, there exists no unified framework that simultaneously solves the downstream high-level understanding tasks including both segmentation and detection, especially when labels are extremely limited. This work presents a general and simple framework to tackle point cloud understanding when labels are limited. The first contribution is that we have done extensive methodology comparisons of traditional and learned 3D descriptors for the task of weakly supervised 3D scene understanding, and validated that our adapted traditional PFH-based 3D descriptors show excellent generalization ability across different domains. The second contribution is that we proposed a learning-based region merging strategy based on the affinity provided by both the traditional/learned 3D descriptors and learned semantics. The merging process takes both low-level geometric and high-level semantic feature correlations into consideration. Experimental results demonstrate that our framework has the best performance among the three most important weakly supervised point clouds understanding tasks including semantic segmentation, instance segmentation, and object detection even when very limited number of points are labeled. Our method, termed Region Merging 3D (RM3D), has superior performance on ScanNet data-efficient learning online benchmarks and other four large-scale 3D understanding benchmarks under various experimental settings, outperforming current arts by a margin for various 3D understanding tasks without complicated learning strategies such as active learning.

摘要
现有的State-of-the-art 3D点云理解方法只能在完全监督的情况下表现良好。据我们所知，到目前为止没有一个统一的框架，可以同时解决下游高级理解任务，包括分割和检测，特别是当标签是非常有限的时候。这个工作提出了一个通用和简单的框架，用于解决点云理解当标签是有限的情况。我们的首要贡献是对传统和学习得到的3D描述符进行了广泛的方法比较，并证明了我们适应传统PFH基于的3D描述符在不同领域的总体化能力很好。我们的第二个贡献是基于传统/学习得到的3D描述符和学习得到的语义的学习型Region Merging策略。该 merge 过程会考虑低级 geometric 和高级 semantics 特征相关性。实验结果表明，我们的框架在无标签点云理解三大任务中的semantic segmentation、instance segmentation和对象检测任务中表现最佳，即使标签非常有限。我们的方法，称为Region Merging 3D（RM3D），在ScanNet数据高效学习在线Benchmark和其他四个大规模3D理解Benchmark上表现出色，超过当前艺术品的margin。

A Data-efficient Framework for Robotics Large-scale LiDAR Scene Parsing

paper_url: http://arxiv.org/abs/2312.02208
repo_url: None
paper_authors: Kangcheng Liu
for: 本研究目的是提出一个普遍和简单的框架，以解决受限 Label 的情况下的 3D 点云理解问题。
methods: 我们提出了一种 novel 的不监督区域扩展基于散列方法，用于生成分群。此外，我们还提出了一种创新的学习将 над分的分群融合，基于地面低层几何特征相似性和学习高层特征相似性，并受到弱 Label 的指导。
results: 我们的框架在三个最重要的弱监督 3D 点云理解任务中均有最好的表现，包括semantic segmentation、instance segmentation和object detection。实验结果显示，我们的框架在受限 Label 情况下，可以在大规模 3D semantic scene parsing 中取得最好的表现。发展的技术具有应用潜力，可以提高下游任务中的表现。代码和模型可以在：https://github.com/KangchengLiu 上取得。

Abstract
Existing state-of-the-art 3D point clouds understanding methods only perform well in a fully supervised manner. To the best of our knowledge, there exists no unified framework which simultaneously solves the downstream high-level understanding tasks, especially when labels are extremely limited. This work presents a general and simple framework to tackle point clouds understanding when labels are limited. We propose a novel unsupervised region expansion based clustering method for generating clusters. More importantly, we innovatively propose to learn to merge the over-divided clusters based on the local low-level geometric property similarities and the learned high-level feature similarities supervised by weak labels. Hence, the true weak labels guide pseudo labels merging taking both geometric and semantic feature correlations into consideration. Finally, the self-supervised reconstruction and data augmentation optimization modules are proposed to guide the propagation of labels among semantically similar points within a scene. Experimental Results demonstrate that our framework has the best performance among the three most important weakly supervised point clouds understanding tasks including semantic segmentation, instance segmentation, and object detection even when limited points are labeled, under the data-efficient settings for the large-scale 3D semantic scene parsing. The developed techniques have postentials to be applied to downstream tasks for better representations in robotic manipulation and robotic autonomous navigation. Codes and models are publicly available at: https://github.com/KangchengLiu.

摘要
现有的状态 искусственный智能3D点云理解方法只能在完全监督的情况下表现良好。据我们所知，没有一个统一的框架，可以同时解决下游高级理解任务，特别是当标签具有极限时。这项工作提出了一个通用和简单的框架，用于解决点云理解问题。我们提议了一种新的不监督区域扩展 clustering 方法，用于生成分区。更重要的是，我们创新地提出了学习将过分区的归一化，基于本地低级几何特征相似性和学习的高级特征相似性，并在弱标签的指导下进行pseudo标签归一化。因此，真正的弱标签导航 pseudo标签归一化，同时考虑地几何特征和semantic特征之间的相互关系。最后，我们提出了自动化重建和数据增强优化模块，用于导引点云内具有相同 semantic特征的点 clouds之间的标签归一化。实验结果表明，我们的框架在三个最重要的弱监督点云理解任务中，包括semantic segmentation、instance segmentation和object detection，在有限点标注的情况下，在数据有效性下表现最佳。开发的技术具有应用于下游任务的潜在优势，如机器人操作和机器人自主导航。代码和模型在：https://github.com/KangchengLiu。

TIBET: Identifying and Evaluating Biases in Text-to-Image Generative Models

paper_url: http://arxiv.org/abs/2312.01261
repo_url: None
paper_authors: Aditya Chinchure, Pushkar Shukla, Gaurav Bhatt, Kiri Salij, Kartik Hosanagar, Leonid Sigal, Matthew Turk
for: 这个论文旨在研究和评估文本生成模型中的各种偏见，包括社会偏见、性别偏见和种族偏见等。methods: 该论文提出了一种通用的方法来研究和评估文本生成模型中的各种偏见，Counterfactual reasoning 技术。这种技术可以自动检测和评估给定提示下的各种可能的偏见，并且可以提供相关的Semantic concepts 解释。results: 该论文的实验结果表明，Counterfactual reasoning 技术可以准确地检测和评估文本生成模型中的各种偏见，并且可以提供深刻的Semantic concepts 解释。此外，该论文还发现，文本生成模型中的偏见可以被分为多个维度，并且这些维度之间存在复杂的交叉关系。

Abstract
Text-to-Image (TTI) generative models have shown great progress in the past few years in terms of their ability to generate complex and high-quality imagery. At the same time, these models have been shown to suffer from harmful biases, including exaggerated societal biases (e.g., gender, ethnicity), as well as incidental correlations that limit such model's ability to generate more diverse imagery. In this paper, we propose a general approach to study and quantify a broad spectrum of biases, for any TTI model and for any prompt, using counterfactual reasoning. Unlike other works that evaluate generated images on a predefined set of bias axes, our approach automatically identifies potential biases that might be relevant to the given prompt, and measures those biases. In addition, our paper extends quantitative scores with post-hoc explanations in terms of semantic concepts in the images generated. We show that our method is uniquely capable of explaining complex multi-dimensional biases through semantic concepts, as well as the intersectionality between different biases for any given prompt. We perform extensive user studies to illustrate that the results of our method and analysis are consistent with human judgements.

摘要

Meta ControlNet: Enhancing Task Adaptation via Meta Learning

paper_url: http://arxiv.org/abs/2312.01255
repo_url: https://github.com/junjieyang97/meta-controlnet
paper_authors: Junjie Yang, Jinze Zhao, Peihao Wang, Zhangyang Wang, Yingbin Liang
for: 这个论文旨在提出一种基于传播的图像生成方法，以便实现适应不同任务的图像生成。
methods: 这个方法使用了内容学习技术，并将图像生成任务分为多个子任务，以便更好地适应不同的任务。
results: 这个方法可以在不需要训练的情况下，实现图像生成任务的控制，并且可以在不同的任务上 directly zero-shot 适应。

Abstract
Diffusion-based image synthesis has attracted extensive attention recently. In particular, ControlNet that uses image-based prompts exhibits powerful capability in image tasks such as canny edge detection and generates images well aligned with these prompts. However, vanilla ControlNet generally requires extensive training of around 5000 steps to achieve a desirable control for a single task. Recent context-learning approaches have improved its adaptability, but mainly for edge-based tasks, and rely on paired examples. Thus, two important open issues are yet to be addressed to reach the full potential of ControlNet: (i) zero-shot control for certain tasks and (ii) faster adaptation for non-edge-based tasks. In this paper, we introduce a novel Meta ControlNet method, which adopts the task-agnostic meta learning technique and features a new layer freezing design. Meta ControlNet significantly reduces learning steps to attain control ability from 5000 to 1000. Further, Meta ControlNet exhibits direct zero-shot adaptability in edge-based tasks without any finetuning, and achieves control within only 100 finetuning steps in more complex non-edge tasks such as Human Pose, outperforming all existing methods. The codes is available in https://github.com/JunjieYang97/Meta-ControlNet.

摘要
《Diffusion-based图像生成技术在最近几年内引起了广泛的关注。特别是ControlNet使用图像基于的提示，在图像任务中such as canny edge detection中表现出了强大的能力，并且能够很好地与提示保持一致。然而， vanilla ControlNet通常需要大约5000步的训练来达到 Desirable Control的水平。 current context-learning approaches have improved its adaptability, but mainly for edge-based tasks, and rely on paired examples. Therefore, two important open issues need to be addressed to fully exploit the potential of ControlNet: (i) zero-shot control for certain tasks and (ii) faster adaptation for non-edge-based tasks. In this paper, we propose a novel Meta ControlNet method, which adopts the task-agnostic meta learning technique and features a new layer freezing design. Meta ControlNet significantly reduces the learning steps to attain control ability from 5000 to 1000. Furthermore, Meta ControlNet exhibits direct zero-shot adaptability in edge-based tasks without any fine-tuning, and achieves control within only 100 fine-tuning steps in more complex non-edge tasks such as Human Pose, outperforming all existing methods. The codes are available in https://github.com/JunjieYang97/Meta-ControlNet。》Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

TranSegPGD: Improving Transferability of Adversarial Examples on Semantic Segmentation

paper_url: http://arxiv.org/abs/2312.02207
repo_url: None
paper_authors: Xiaojun Jia, Jindong Gu, Yihao Huang, Simeng Qin, Qing Guo, Yang Liu, Xiaochun Cao
for: 这个论文旨在探讨图像分类领域中对 adversarial examples 的传输性，并提出了一种有效的两stage adversarial attack策略，即 TranSegPGD。
methods: 该方法在第一个阶段对输入图像每个像素进行分支，并将每个分支分配不同的权重进行优化，以提高对所有像素的恶作剂性。在第二个阶段，图像像素被分支成不同的分支，根据其传输性，并将每个分支分配不同的权重进行优化，以提高恶作剂示例的传输性。
results: 对于 PASCAL VOC 2012 和 Cityscapes 数据集，经验表明，提出的恶作剂攻击策略可以 дости得最佳性能。

Abstract
Transferability of adversarial examples on image classification has been systematically explored, which generates adversarial examples in black-box mode. However, the transferability of adversarial examples on semantic segmentation has been largely overlooked. In this paper, we propose an effective two-stage adversarial attack strategy to improve the transferability of adversarial examples on semantic segmentation, dubbed TranSegPGD. Specifically, at the first stage, every pixel in an input image is divided into different branches based on its adversarial property. Different branches are assigned different weights for optimization to improve the adversarial performance of all pixels.We assign high weights to the loss of the hard-to-attack pixels to misclassify all pixels. At the second stage, the pixels are divided into different branches based on their transferable property which is dependent on Kullback-Leibler divergence. Different branches are assigned different weights for optimization to improve the transferability of the adversarial examples. We assign high weights to the loss of the high-transferability pixels to improve the transferability of adversarial examples. Extensive experiments with various segmentation models are conducted on PASCAL VOC 2012 and Cityscapes datasets to demonstrate the effectiveness of the proposed method. The proposed adversarial attack method can achieve state-of-the-art performance.

摘要
“对于图像分类中的攻击性例子的转移性已经得到了系统性的探索，但对于 semantic segmentation 中的攻击性例子的转移性则被忽略了。这篇文章提出了一个高效的 two-stage 攻击战略，以提高 semantic segmentation 中的攻击性例子转移性，称为 TranSegPGD。在第一阶段，每个图像中的每个像素都被分成不同的分支，根据其攻击性质。不同的分支被 assigning 不同的权重来优化攻击性表现。我们将高权重 assign 到困难攻击像素的损失，以将所有像素错分类。在第二阶段，像素被分成不同的分支，根据它们的转移性，即基于 Kullback-Leibler 散度。不同的分支被 assigning 不同的权重来优化转移性。我们将高权重 assign 到高转移性像素的损失，以提高攻击性例子的转移性。我们在 PASCAL VOC 2012 和 Cityscapes dataset 上进行了广泛的实验，以显示提案的攻击方法的效果。提案的攻击方法可以实现state-of-the-art的性能。”

2023-12-03

cs.AI

cs.AI - 2023-12-03

Revisiting Non-separable Binary Classification and its Applications in Anomaly Detection

paper_url: http://arxiv.org/abs/2312.01541
repo_url: https://github.com/mattlaued/xor-is-linearly-classifiable
paper_authors: Matthew Lau, Ismaila Seck, Athanasios P Meliopoulos, Wenke Lee, Eugene Ndiaye
for: 解决XOR问题的线性分类是否可能
methods: 提议一种不同的分类方法——等式分离，通过修改支持向量机的目标函数来分辨数据内或外部的边缘
results: equality separation可以用于异常检测，并且可以在超级vised异常检测实验中证明seen和unseen异常的检测。Here’s a more detailed explanation of each point:1. for: The paper aims to solve the problem of linearly classifying XOR, which has been a long-standing challenge in deep learning.2. methods: The authors propose a new method called equality separation, which adapts the support vector machine (SVM) objective to distinguish data within or outside the margin. This method can be integrated into neural network pipelines with a smooth approximation.3. results: The authors show that equality separation can be used for anomaly detection, and they introduce a quantitative measure called closing numbers to formalize this notion. They also test their hypothesis on supervised anomaly detection experiments, demonstrating that equality separation can detect both seen and unseen anomalies.

Abstract
The inability to linearly classify XOR has motivated much of deep learning. We revisit this age-old problem and show that linear classification of XOR is indeed possible. Instead of separating data between halfspaces, we propose a slightly different paradigm, equality separation, that adapts the SVM objective to distinguish data within or outside the margin. Our classifier can then be integrated into neural network pipelines with a smooth approximation. From its properties, we intuit that equality separation is suitable for anomaly detection. To formalize this notion, we introduce closing numbers, a quantitative measure on the capacity for classifiers to form closed decision regions for anomaly detection. Springboarding from this theoretical connection between binary classification and anomaly detection, we test our hypothesis on supervised anomaly detection experiments, showing that equality separation can detect both seen and unseen anomalies.

摘要
“XOR问题的非线性分类问题挑战了深度学习的发展。我们回顾这个老问题，并证明了 XOR 问题的线性分类是可能的。而不是将数据分为半空间，我们提议一种微妙的 парадиг，即等式分离，将 SVM 目标函数改造为用于在边缘区划分数据。我们的分类器可以与神经网络核心结构结合使用，并且通过缓和近似来实现。从其性质来看，我们认为等式分离适用于异常检测。为了正式表述这个概念，我们引入 closing numbers，一种量化度量分类器形成关闭决策区的能力。从这种理论上的连接 между 二分类和异常检测，我们在超级vised anomaly detection experiment中测试了我们的假设，并证明了等式分离可以检测到both seen和unseen异常。”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Unlocking the Potential of Federated Learning: The Symphony of Dataset Distillation via Deep Generative Latents

paper_url: http://arxiv.org/abs/2312.01537
repo_url: https://github.com/feddg23/feddg-main
paper_authors: Yuqi Jia, Saeed Vahidian, Jingwei Sun, Jianyi Zhang, Vyacheslav Kungurtsev, Neil Zhenqiang Gong, Yiran Chen
for: 提高 federated learning 中数据不一致性的问题，提出了一种高效的服务器端数据简化框架，减少了本地设备的计算和通信占用，同时保护了客户端的隐私。
methods: 提出了一种基于先进的深度生成模型的服务器端数据简化技术，使得本地设备可以训练更小的代理模型，同时在服务器端训练更大的全球模型，从而最大限度地减少资源的利用。
results: 实验结果表明，该方法可以提高 federated learning 的准确率，相比非数据简化方法可以提高准确率高达 40%，并且比现有的数据简化方法提高了 18%。此外，该方法的训练速度比基eline快，因为而不是服务器在训练多个不同数据分布的多种数据分布，而是在多模态分布上训练。

Abstract
Data heterogeneity presents significant challenges for federated learning (FL). Recently, dataset distillation techniques have been introduced, and performed at the client level, to attempt to mitigate some of these challenges. In this paper, we propose a highly efficient FL dataset distillation framework on the server side, significantly reducing both the computational and communication demands on local devices while enhancing the clients' privacy. Unlike previous strategies that perform dataset distillation on local devices and upload synthetic data to the server, our technique enables the server to leverage prior knowledge from pre-trained deep generative models to synthesize essential data representations from a heterogeneous model architecture. This process allows local devices to train smaller surrogate models while enabling the training of a larger global model on the server, effectively minimizing resource utilization. We substantiate our claim with a theoretical analysis, demonstrating the asymptotic resemblance of the process to the hypothetical ideal of completely centralized training on a heterogeneous dataset. Empirical evidence from our comprehensive experiments indicates our method's superiority, delivering an accuracy enhancement of up to 40% over non-dataset-distillation techniques in highly heterogeneous FL contexts, and surpassing existing dataset-distillation methods by 18%. In addition to the high accuracy, our framework converges faster than the baselines because rather than the server trains on several sets of heterogeneous data distributions, it trains on a multi-modal distribution. Our code is available at https://github.com/FedDG23/FedDG-main.git

摘要
“数据多样性对联合学习（FL）带来重要挑战。最近， dataset distillation 技术在客户端上进行了应用，以减轻一些这些挑战。在这篇论文中，我们提出了一个高效的 FL dataset distillation 框架，在服务器端进行了实现，对本地设备的计算和通信占用量进行了显著减少，同时保持了客户端的隐私。不同于先前的策略，不在本地设备上执行 dataset distillation，而是使服务器利用先前训练的深度生成模型来提取数据的主要表示，并将其传递给本地设备进行训练。这种方法使本地设备可以训练较小的副本模型，同时允许服务器训练一个更大的全球模型，从而减少资源的使用。我们通过理论分析，证明了这种过程的极限相似性，与完全中央化训练在多样性数据上的理论模型相似。实验证明了我们的方法的优越性，在高度多样性的 FL 上提高了精度达40%，比非 dataset distillation 技术高出18%。此外，我们的框架在基eline上更快 converges，因为服务器不需要训练多个不同数据分布的各种模型，而是训练一个多Modal的分布。我们的代码可以在 GitHub 上找到：https://github.com/FedDG23/FedDG-main.git”

NovoMol: Recurrent Neural Network for Orally Bioavailable Drug Design and Validation on PDGFRα Receptor

paper_url: http://arxiv.org/abs/2312.01527
repo_url: https://github.com/ishirraov/novomol
paper_authors: Ishir Rao
For: 提高药物临床试验的效率，解决医药产业中药物候选者的时间和成功率问题。* Methods: 使用回归神经网络mass生成药物，对药物进行数学预测，并对药物进行优化。* Results: 通过使用QED来衡量药物的胃肠溶解度，在5个训练周期后，76%的生成药物达到了QED的胃肠溶解度阈值，96%的生成药物达到了传统使用的Lipinski的五则规则。训练模型后，对PDGFRα受体进行了特定的药物候选者生成，44%的生成药物在与现有的State-of-the-art药物Imatinib（蛋白质绑定亲和力-9.4 kcal/mol）的绑定亲和力上超过了现有药物。

Abstract
Longer timelines and lower success rates of drug candidates limit the productivity of clinical trials in the pharmaceutical industry. Promising de novo drug design techniques help solve this by exploring a broader chemical space, efficiently generating new molecules, and providing improved therapies. However, optimizing for molecular characteristics found in approved oral drugs remains a challenge, limiting de novo usage. In this work, we propose NovoMol, a novel de novo method using recurrent neural networks to mass-generate drug molecules with high oral bioavailability, increasing clinical trial time efficiency. Molecules were optimized for desirable traits and ranked using the quantitative estimate of drug-likeness (QED). Generated molecules meeting QED's oral bioavailability threshold were used to retrain the neural network, and, after five training cycles, 76% of generated molecules passed this strict threshold and 96% passed the traditionally used Lipinski's Rule of Five. The trained model was then used to generate specific drug candidates for the cancer-related PDGFR{\alpha} receptor and 44% of generated candidates had better binding affinity than the current state-of-the-art drug, Imatinib (with a receptor binding affinity of -9.4 kcal/mol), and the best-generated candidate at -12.9 kcal/mol. NovoMol provides a time/cost-efficient AI-based de novo method offering promising drug candidates for clinical trials.

摘要
长时间和低成功率的药物候选者限制了医药工业中临床试验的产量。promising de novo药物设计技术可以解决这个问题，探索更广泛的化学空间，效率生成新分子，提供改进的治疗方案。然而，仍然需要优化批量药物的分子特征，限制de novo的使用。在这项工作中，我们提出了NovoMol，一种新的de novo方法，使用回归神经网络来批量生成高口服bioavailability的药物分子，提高临床试验时间效率。分子被优化为愿望的特征，并根据量子药理性估计（QED）进行排名。通过五次训练，76%的生成分子达到了QED的口服bioavailability阈值，96%达到了传统使用的利平斯基Rule of Five。训练后，模型被用来生成特定的PDGFRα受体相关的药物候选者，44%的生成候选者有更高的绑定率，比现有的 estado-of-the-art药物Imatinib（受体绑定率-9.4 kcal/mol）更好。NovoMol提供了时间/成本高效的人工智能基于de novo方法，为临床试验提供了优秀的药物候选者。

Tackling Bias in Pre-trained Language Models: Current Trends and Under-represented Societies

paper_url: http://arxiv.org/abs/2312.01509
repo_url: None
paper_authors: Vithya Yogarajan, Gillian Dobbie, Te Taka Keegan, Rostam J. Neuwirth
for: 本研究旨在探讨现有语言模型中的偏见问题，以及如何适应不同社会群体的需求。
methods: 本研究使用了现有的方法和数据集来评估语言模型中的偏见，包括度量指标、测试数据集和修正策略。
results: 研究发现现有的偏见检测和修正方法对于不同社会群体可能存在一定的限制和不足，需要针对不同社会群体的需求进行定制和调整。

Abstract
The benefits and capabilities of pre-trained language models (LLMs) in current and future innovations are vital to any society. However, introducing and using LLMs comes with biases and discrimination, resulting in concerns about equality, diversity and fairness, and must be addressed. While understanding and acknowledging bias in LLMs and developing mitigation strategies are crucial, the generalised assumptions towards societal needs can result in disadvantages towards under-represented societies and indigenous populations. Furthermore, the ongoing changes to actual and proposed amendments to regulations and laws worldwide also impact research capabilities in tackling the bias problem. This research presents a comprehensive survey synthesising the current trends and limitations in techniques used for identifying and mitigating bias in LLMs, where the overview of methods for tackling bias are grouped into metrics, benchmark datasets, and mitigation strategies. The importance and novelty of this survey are that it explores the perspective of under-represented societies. We argue that current practices tackling the bias problem cannot simply be 'plugged in' to address the needs of under-represented societies. We use examples from New Zealand to present requirements for adopting existing techniques to under-represented societies.

摘要
现代和未来的创新中，预训言语模型（LLM）的优势和能力是社会中不可或缺的。然而，在引入和使用 LLM 时，存在偏见和歧视的问题，这会导致平等、多样性和公正的问题。为了解决这些问题，我们必须理解和承认 LLM 中的偏见，并开发消除方法。然而，通过普遍的假设来认为社会需求可能会导致弱化有限表示的社会和原住民族。此外，全球不断改变的法律和规定也会影响研究对偏见问题的能力。本研究提供了一份全面的评估，汇总当前的趋势和局限性，以及用于识别和消除偏见的技术。我们认为现有的偏见缓解方法无法直接应用于弱化表示的社会。我们使用新西兰的例子来说明在采用现有技术时的需求。

Effectively Fine-tune to Improve Large Multimodal Models for Radiology Report Generation

paper_url: http://arxiv.org/abs/2312.01504
repo_url: None
paper_authors: Yuzhe Lu, Sungmin Hong, Yash Shah, Panpan Xu
for: 这个研究旨在自动生成医疗影像报告，以减少执业医生的时间和错误率。
methods: 本研究提出了一个简单 yet effective的 two-stage 练习方法，将视觉特征与大型语言模型（LLM）的文本嵌入空间进行调整。
results: 本研究使用 OpenLLaMA-7B 取得了顶尖水准的性能，而不需要专业预训。此外，我们还提供了软件图示和注意力机制的详细分析，对未来研究提供了新的方向。

Abstract
Writing radiology reports from medical images requires a high level of domain expertise. It is time-consuming even for trained radiologists and can be error-prone for inexperienced radiologists. It would be appealing to automate this task by leveraging generative AI, which has shown drastic progress in vision and language understanding. In particular, Large Language Models (LLM) have demonstrated impressive capabilities recently and continued to set new state-of-the-art performance on almost all natural language tasks. While many have proposed architectures to combine vision models with LLMs for multimodal tasks, few have explored practical fine-tuning strategies. In this work, we proposed a simple yet effective two-stage fine-tuning protocol to align visual features to LLM's text embedding space as soft visual prompts. Our framework with OpenLLaMA-7B achieved state-of-the-art level performance without domain-specific pretraining. Moreover, we provide detailed analyses of soft visual prompts and attention mechanisms, shedding light on future research directions.

摘要
评估医学影像需要高水平的领域专业知识。 even for 训练过的 radiologist 可以是时间consuming ，而不经验的 radiologist 可能会有错误。因此，使用生成 AI 自动化这个任务是非常吸引人。特别是，大语言模型（LLM）在 recent 时期表现出了惊人的进步，在 almost all natural language tasks 中 setting new state-of-the-art performance。虽然许多人已经提出了将视觉模型与 LLM 结合的建议，但只有几个人探讨了实用的 fine-tuning 策略。在这项工作中，我们提出了一种简单 yet effective 的 two-stage fine-tuning 协议，将视觉特征与 LLM 的文本嵌入空间相对轴。我们的框架与 OpenLLaMA-7B 实现了 state-of-the-art 级别的性能，而无需域pecific 预训练。此外，我们还提供了软visual prompts 和 attention 机制的详细分析，为未来的研究提供了光明。

ADT: Agent-based Dynamic Thresholding for Anomaly Detection

paper_url: http://arxiv.org/abs/2312.01488
repo_url: None
paper_authors: Xue Yang, Enda Howley, Micheal Schukat
for: 这篇论文旨在提出一个基于动态决策网络的侦错探测方法，以便在实际应用中实现动态阈值调整。
methods: 这篇论文使用了一个专案网络，将侦错探测视为一个Markov Decision Process，并提出了一个基于深度Q网络的代理者基本Dynamic Thresholding（ADT）框架。
results: 经过三个真实世界数据集的实验，这篇论文显示了ADT的阈值调整能力、数据有效性、稳定性和Robustness。

Abstract
The complexity and scale of IT systems are increasing dramatically, posing many challenges to real-world anomaly detection. Deep learning anomaly detection has emerged, aiming at feature learning and anomaly scoring, which has gained tremendous success. However, little work has been done on the thresholding problem despite it being a critical factor for the effectiveness of anomaly detection. In this paper, we model thresholding in anomaly detection as a Markov Decision Process and propose an agent-based dynamic thresholding (ADT) framework based on a deep Q-network. The proposed method can be integrated into many systems that require dynamic thresholding. An auto-encoder is utilized in this study to obtain feature representations and produce anomaly scores for complex input data. ADT can adjust thresholds adaptively by utilizing the anomaly scores from the auto-encoder and significantly improve anomaly detection performance. The properties of ADT are studied through experiments on three real-world datasets and compared with benchmarks, hence demonstrating its thresholding capability, data-efficient learning, stability, and robustness. Our study validates the effectiveness of reinforcement learning in optimal thresholding control in anomaly detection.

摘要
IT系统的复杂性和规模在不断增加，对现实世界异常检测带来了很多挑战。深度学习异常检测已经出现，旨在学习特征和异常分配，取得了巨大成功。然而，对阈值问题的研究仍然很少，即使这是异常检测效iveness的关键因素。在这篇论文中，我们将异常检测的阈值模型为Markov决策过程，并提出了基于深度Q网络的自适应动态阈值（ADT）框架。我们的方法可以与许多需要动态阈值的系统集成。在这种研究中，我们使用 auto-encoder 来获得特征表示和生成复杂输入数据的异常分数。ADT可以通过利用 auto-encoder 生成的异常分数进行适应性的阈值调整，提高异常检测性能。我们的实验表明，ADT具有阈值控制、数据效率学习、稳定性和稳定性等性质。我们的研究证明了深度学习在异常检测中的阈值控制优化的有效性。

Context-Enhanced Relational Operators with Vector Embeddings

paper_url: http://arxiv.org/abs/2312.01476
repo_url: None
paper_authors: Viktor Sanca, Manos Chatzakis, Anastasia Ailamaki
for: 这篇论文的目的是解决传统关系数据库管理系统中处理数据处理管道中的数据拓扑和上下文rich多Modal数据的挑战。
methods: 论文使用了 representation learning 模型将上下文rich数据映射到矢量 embedding 中，并将这些 embedding 与关系运算相结合，以实现机器自动化的上下文处理。
results: 论文提出了一种 hybrid 关系和矢量数据处理方法，并实现了逻辑和物理优化。使用示例串embeddings，论文示出了在关系Join操作器上启用 hybrid 上下文增强处理的能力，并实现了对执行时间的一个次元级别的提升。

Abstract
Collecting data, extracting value, and combining insights from relational and context-rich multi-modal sources in data processing pipelines presents a challenge for traditional relational DBMS. While relational operators allow declarative and optimizable query specification, they are limited to data transformations unsuitable for capturing or analyzing context. On the other hand, representation learning models can map context-rich data into embeddings, allowing machine-automated context processing but requiring imperative data transformation integration with the analytical query. To bridge this dichotomy, we present a context-enhanced relational join and introduce an embedding operator composable with relational operators. This enables hybrid relational and context-rich vector data processing, with algebraic equivalences compatible with relational algebra and corresponding logical and physical optimizations. We investigate model-operator interaction with vector data processing and study the characteristics of the E-join operator. Using an example of string embeddings, we demonstrate enabling hybrid context-enhanced processing on relational join operators with vector embeddings. The importance of holistic optimization, from logical to physical, is demonstrated in an order of magnitude execution time improvement.

摘要
Collecting data, extracting value, and combining insights from relational and context-rich multi-modal sources in data processing pipelines presents a challenge for traditional relational DBMS. While relational operators allow declarative and optimizable query specification, they are limited to data transformations unsuitable for capturing or analyzing context. On the other hand, representation learning models can map context-rich data into embeddings, allowing machine-automated context processing but requiring imperative data transformation integration with the analytical query. To bridge this dichotomy, we present a context-enhanced relational join and introduce an embedding operator composable with relational operators. This enables hybrid relational and context-rich vector data processing, with algebraic equivalences compatible with relational algebra and corresponding logical and physical optimizations. We investigate model-operator interaction with vector data processing and study the characteristics of the E-join operator. Using an example of string embeddings, we demonstrate enabling hybrid context-enhanced processing on relational join operators with vector embeddings. The importance of holistic optimization, from logical to physical, is demonstrated in an order of magnitude execution time improvement.Here's the translation in Traditional Chinese:收集数据，提取价值，并结合多种多模式的资料处理管道中的数据处理问题，对传统的关联式DBMS提出了挑战。关联式操作符允许宣告式和可优化的查询规则，但它们仅适用于不适合捕捉或分析上下文的资料变数。相反，表示学习模型可以将上下文丰富的数据映射到嵌入中，allowing机器自动处理上下文，但需要强制性的数据变数融合。 To bridge this gap, we present a context-enhanced relational join and introduce an embedding operator composable with relational operators. This enables hybrid relational and context-rich vector data processing, with algebraic equivalences compatible with relational algebra and corresponding logical and physical optimizations. We investigate model-operator interaction with vector data processing and study the characteristics of the E-join operator. Using an example of string embeddings, we demonstrate enabling hybrid context-enhanced processing on relational join operators with vector embeddings. The importance of holistic optimization, from logical to physical, is demonstrated in an order of magnitude execution time improvement.

Personality of AI

paper_url: http://arxiv.org/abs/2312.02998
repo_url: https://github.com/Committing/personalitypolice.com_public
paper_authors: Byunggu Yu, Junwhan Kim
for: 本研究探讨了大型自然语言模型（LLM）如何与人类用户进行匹配，不仅是基本匹配，而是提出了“人格匹配”的概念，以适应组织设置中的语言模型。
methods: 本研究认为训练方法对AI模型的未定性特征的形成产生了影响，因此将人类人格测试与AI模型训练方法进行了比较。通过一个原创的案例研究，我们示出了AI模型的人格细致调整的必要性，并提出了关于应用人类设计的测试到AI模型、开发专门的AI人格测试和shape AI人格的问题。
results: 本研究提供了人工智能匹配的开始点，以便将来的探索和发展。通过人类-机器团队和共存的概念，我们可以更好地理解和利用AI技术，以提高组织的效率和创新力。

Abstract
This research paper delves into the evolving landscape of fine-tuning large language models (LLMs) to align with human users, extending beyond basic alignment to propose "personality alignment" for language models in organizational settings. Acknowledging the impact of training methods on the formation of undefined personality traits in AI models, the study draws parallels with human fitting processes using personality tests. Through an original case study, we demonstrate the necessity of personality fine-tuning for AIs and raise intriguing questions about applying human-designed tests to AIs, engineering specialized AI personality tests, and shaping AI personalities to suit organizational roles. The paper serves as a starting point for discussions and developments in the burgeoning field of AI personality alignment, offering a foundational anchor for future exploration in human-machine teaming and co-existence.

摘要
这份研究论文探讨了大语言模型（LLM）在人类用户的整合方面的发展，从基础对应到人格对应，提出了在组织设置中使用“人格对应”来对语言模型进行调整。认可训练方法对AI模型的未定性特征的形成的影响，研究借鉴人类适应过程使用人格测试。通过原创的案例研究，我们证明了AI模型的人格调整的必要性，并提出了对应人工设计AI测试、工程化AI人格测试和适应组织角色的AI人格定制等问题。这篇论文为人机团队和人机共存领域的发展提供了一个基础锚点，供未来的探索和发展。

BenchMARL: Benchmarking Multi-Agent Reinforcement Learning

paper_url: http://arxiv.org/abs/2312.01472
repo_url: https://github.com/facebookresearch/benchmarl
paper_authors: Matteo Bettini, Amanda Prorok, Vincent Moens
For: This paper aims to address the reproducibility crisis in Multi-Agent Reinforcement Learning (MARL) by introducing BenchMARL, a training library for standardized benchmarking.* Methods: BenchMARL uses TorchRL as its backend, allowing for high-performance and state-of-the-art implementations, and its design enables systematic configuration and reporting of complex benchmarks with simple one-line inputs.* Results: BenchMARL is the first MARL training library that enables standardized benchmarking across different algorithms, models, and environments, and it is open-sourced on GitHub for the broad community of MARL PyTorch users.Here is the simplified Chinese text for the three key points:* For: 本文提出了一种解决多智能奖励学习（MARL） reproduceability crisis 的方法，即引入 BenchMARL，一个用于标准化测试的训练库。* Methods: BenchMARL 使用 TorchRL 作为 backend，可以实现高性能和维护状态的最佳实现，并且设计了系统化的配置和报告方式，可以通过一行输入创建和运行复杂的测试。* Results: BenchMARL 是首个用于标准化 MARL 测试的训练库，可以在不同的算法、模型和环境下进行标准化测试，并且开源在 GitHub 上，以便广泛的 MARL PyTorch 用户群体使用。

Abstract
The field of Multi-Agent Reinforcement Learning (MARL) is currently facing a reproducibility crisis. While solutions for standardized reporting have been proposed to address the issue, we still lack a benchmarking tool that enables standardization and reproducibility, while leveraging cutting-edge Reinforcement Learning (RL) implementations. In this paper, we introduce BenchMARL, the first MARL training library created to enable standardized benchmarking across different algorithms, models, and environments. BenchMARL uses TorchRL as its backend, granting it high performance and maintained state-of-the-art implementations while addressing the broad community of MARL PyTorch users. Its design enables systematic configuration and reporting, thus allowing users to create and run complex benchmarks from simple one-line inputs. BenchMARL is open-sourced on GitHub: https://github.com/facebookresearch/BenchMARL

摘要
当前的多智能奖励学习（Multi-Agent Reinforcement Learning，MARL）领域正面临一场可重复性危机。虽然有解决方案建议使用标准化报告，但我们仍然缺乏一个可以标准化和可重复性的测试工具，同时利用前沿的奖励学习（Reinforcement Learning，RL）实现。在这篇论文中，我们介绍了BenchMARL，第一个用于MARL训练的标准化测试库。BenchMARL使用TorchRL作为后端，从而实现了高性能和维护了前沿的PyTorch用户社区的状态。BenchMARL的设计允许用户系统地配置和报告，从而让用户可以通过一行命令创建和运行复杂的benchmark。BenchMARL在GitHub上开源：https://github.com/facebookresearch/BenchMARL。

Exploring Adversarial Robustness of LiDAR-Camera Fusion Model in Autonomous Driving

paper_url: http://arxiv.org/abs/2312.01468
repo_url: None
paper_authors: Bo Yang, Xiaoyu Ji, Xiaoyu Ji, Xiaoyu Ji, Xiaoyu Ji
for: This paper assesses the adversarial robustness of LiDAR-camera fusion models in 3D object detection, with a focus on safety concerns in autonomous driving.
methods: The paper introduces an attack technique that adds a limited number of physically constrained adversarial points above a car to deceive the fusion model, without changing the image data channel.
results: Experimental results show that the fusion model can be deceived solely by manipulating the LiDAR data channel, raising safety concerns in autonomous driving. The paper also explores the effects of various factors on the attack success rate.

Abstract
Our study assesses the adversarial robustness of LiDAR-camera fusion models in 3D object detection. We introduce an attack technique that, by simply adding a limited number of physically constrained adversarial points above a car, can make the car undetectable by the fusion model. Experimental results reveal that even without changes to the image data channel, the fusion model can be deceived solely by manipulating the LiDAR data channel. This finding raises safety concerns in the field of autonomous driving. Further, we explore how the quantity of adversarial points, the distance between the front-near car and the LiDAR-equipped car, and various angular factors affect the attack success rate. We believe our research can contribute to the understanding of multi-sensor robustness, offering insights and guidance to enhance the safety of autonomous driving.

摘要
我们的研究评估了涉及推理 LiDAR-camera 融合模型的攻击表现。我们介绍了一种简单地在车辆上添加一定数量的物理约束的恶意点的攻击技术，可以使车辆被检测器抑制。实验结果表明，无需改变图像数据频道，涉及 LiDAR 数据频道的攻击 already 可以让车辆被抑制。这种发现对于自动驾驶领域的安全提出了很大的问题。我们进一步探讨了攻击点的数量、车辆前方近距离 LiDAR 搭载车辆之间的距离以及不同的角度因素对攻击成功率的影响。我们认为，我们的研究可以帮助我们更好地理解多感器的可靠性，提供有用的指导和技术来提高自动驾驶的安全性。

D-Bot: Database Diagnosis System using Large Language Models

paper_url: http://arxiv.org/abs/2312.01454
repo_url: https://github.com/tsinghuadatabasegroup/db-gpt
paper_authors: Xuanhe Zhou, Guoliang Li, Zhaoyan Sun, Zhiyuan Liu, Weize Chen, Jianming Wu, Jiesi Liu, Ruohang Feng, Guoyang Zeng
for: 帮助数据库管理员（DBA）更好地管理、维护和优化数据库系统，提高DBA的工作效率和响应时间。
methods: 使用大型自然语言模型（LLM），自动从诊断文档中获取知识，生成有理据的诊断报告（包括根本原因和解决方案），并且可以在acceptable时间内（比如10分钟）完成诊断。
results: 对实际 benchmark 进行了验证，并显示了 D-Bot 可以有效地分析未经见过的异常，并且与传统方法和原生模型（如 GPT-4）相比，具有显著的性能优势。

Abstract
Database administrators (DBAs) play an important role in managing, maintaining and optimizing database systems. However, it is hard and tedious for DBAs to manage a large number of databases and give timely response (waiting for hours is intolerable in many online cases). In addition, existing empirical methods only support limited diagnosis scenarios, which are also labor-intensive to update the diagnosis rules for database version updates. Recently large language models (LLMs) have shown great potential in various fields. Thus, we propose D-Bot, an LLM-based database diagnosis system that can automatically acquire knowledge from diagnosis documents, and generate reasonable and well-founded diagnosis report (i.e., identifying the root causes and solutions) within acceptable time (e.g., under 10 minutes compared to hours by a DBA). The techniques in D-Bot include (i) offline knowledge extraction from documents, (ii) automatic prompt generation (e.g., knowledge matching, tool retrieval), (iii) root cause analysis using tree search algorithm, and (iv) collaborative mechanism for complex anomalies with multiple root causes. We verify D-Bot on real benchmarks (including 539 anomalies of six typical applications), and the results show that D-Bot can effectively analyze the root causes of unseen anomalies and significantly outperforms traditional methods and vanilla models like GPT-4.

摘要

Offline knowledge extraction from documents2. Automatic prompt generation (e.g., knowledge matching, tool retrieval)3. Root cause analysis using tree search algorithms4. Collaborative mechanism for complex anomalies with multiple root causesWe verify D-Bot on real benchmarks (including 539 anomalies of six typical applications), and the results show that D-Bot can effectively analyze the root causes of unseen anomalies and significantly outperforms traditional methods and vanilla models like GPT-4.

Foveation in the Era of Deep Learning

paper_url: http://arxiv.org/abs/2312.01450
repo_url: https://github.com/georgekillick90/fovconvnext
paper_authors: George Killick, Paul Henderson, Paul Siebert, Gerardo Aragon-Camarasa
for: 本研究旨在解决视觉场景中活动投入的挑战，使用瑞夫感知器实现瑞夫感知活动视觉架构。
methods: 我们引入了一个端到端可微分的瑞夫感知活动视觉架构，利用图 convolutional neural network 处理瑞夫感知图像，并提出了一种简单 yet effective 的瑞夫感知图像采样表述。我们的模型学习通过iteratively 关注图像中 relevante 区域进行分类。
results: 我们进行了多种图像集上详细的实验，与之前的方法进行了比较，并测试了不同的选择，如瑞夫感知度和Network perform 数据或计算预算的影响。我们发现我们的模型比一个状态 искусственный神经网络和相关参数的瑞夫感知视觉架构提高了 объек recognition 性能。

Abstract
In this paper, we tackle the challenge of actively attending to visual scenes using a foveated sensor. We introduce an end-to-end differentiable foveated active vision architecture that leverages a graph convolutional network to process foveated images, and a simple yet effective formulation for foveated image sampling. Our model learns to iteratively attend to regions of the image relevant for classification. We conduct detailed experiments on a variety of image datasets, comparing the performance of our method with previous approaches to foveated vision while measuring how the impact of different choices, such as the degree of foveation, and the number of fixations the network performs, affect object recognition performance. We find that our model outperforms a state-of-the-art CNN and foveated vision architectures of comparable parameters and a given pixel or computation budget

摘要
在这篇论文中，我们面临了使用瑕点感知器进行视觉场景活动投入的挑战。我们提出了一种从头到尾可微分的活动感知视觉架构，利用图像加注 convolutional neural network（CNN）处理瑕点图像，并提出了一种简单 yet effective的瑕点图像采样方法。我们的模型会逐次吸引图像中相关于分类的区域，并通过不同的选择，如瑕点度和fixation数量，影响物体识别性能。我们进行了多个图像集数据进行详细的实验，并与之前的方法进行比较，检查不同的选择对物体识别性能的影响。我们发现，我们的模型在与相对参数和 computation budget 相同的情况下，超过了一个状态元投入 CNN 和相似参数的active vision架构。

Learning Curricula in Open-Ended Worlds

paper_url: http://arxiv.org/abs/2312.03126
repo_url: https://github.com/facebookresearch/dcd
paper_authors: Minqi Jiang
for: 提高RLagent的开放性和普适性
methods: 自动生成训练环境的极限 Frontier curriculum
results: RLagent展现出更好的 robustness和泛化能力，能够在未经见过的环境中表现出更好的性能

Abstract
Deep reinforcement learning (RL) provides powerful methods for training optimal sequential decision-making agents. As collecting real-world interactions can entail additional costs and safety risks, the common paradigm of sim2real conducts training in a simulator, followed by real-world deployment. Unfortunately, RL agents easily overfit to the choice of simulated training environments, and worse still, learning ends when the agent masters the specific set of simulated environments. In contrast, the real world is highly open-ended, featuring endlessly evolving environments and challenges, making such RL approaches unsuitable. Simply randomizing over simulated environments is insufficient, as it requires making arbitrary distributional assumptions and can be combinatorially less likely to sample specific environment instances that are useful for learning. An ideal learning process should automatically adapt the training environment to maximize the learning potential of the agent over an open-ended task space that matches or surpasses the complexity of the real world. This thesis develops a class of methods called Unsupervised Environment Design (UED), which aim to produce such open-ended processes. Given an environment design space, UED automatically generates an infinite sequence or curriculum of training environments at the frontier of the learning agent's capabilities. Through extensive empirical studies and theoretical arguments founded on minimax-regret decision theory and game theory, the findings in this thesis show that UED autocurricula can produce RL agents exhibiting significantly improved robustness and generalization to previously unseen environment instances. Such autocurricula are promising paths toward open-ended learning systems that achieve more general intelligence by continually generating and mastering additional challenges of their own design.

摘要
深度强化学习（RL）提供了强大的方法来训练优化的序列决策机器人。由于收集真实世界交互可能会带来额外成本和安全隐患，因此常见的 simulate-to-real paradigm 是在模拟器中进行训练，然后将其部署到真实世界中。然而，RL 代理人很容易在模拟环境中过拟合，而且学习结束于代理人掌握特定的模拟环境。相比之下，真实世界是高度开放的，具有无限制的环境和挑战，使得这些 RL 方法不适用。偶尔随机选择模拟环境是不充分的，因为它需要作出伪装分布Assumption 并可能是 combinatorially 更不可能 sampling 特定环境实例，这些实例对学习具有用值。一个理想的学习过程应该自动调整培训环境，以确保代理人在开放任务空间中学习的可能性最大，并且与真实世界的复杂度匹配或超越。这份论文开发了一种方法，称为无监督环境设计（UED），以便生成开放任务空间中的无限序列或训练环境。通过实验研究和基于最小最大 regret 决策理论和游戏理论的理论支持，这些发现表明，UED 自动训练环境可以生成高度 Robustness 和泛化性，使代理人在未经见过的环境实例中表现出色。这些自动训练课程是开放式学习系统的可能性的追求，它们可以不断生成和掌握更多的挑战，以实现更广泛的智能。

Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models

paper_url: http://arxiv.org/abs/2312.01409
repo_url: None
paper_authors: Shengqu Cai, Duygu Ceylan, Matheus Gadelha, Chun-Hao Paul Huang, Tuanfeng Yang Wang, Gordon Wetzstein
for: 帮助用户创造高质量的计算机生成视频，通过结合动态3D网格的可控性和emerging扩散模型的表达性和可编辑性。
methods: 使用动画、低精度渲染的网格作为输入，并在预训练的文本到图像生成模型的不同阶段插入真实的对应关系信息，以生成高质量和时间协调的帧。
results: 在不同的动作和摄像头路径例子中，实现高质量和时间协调的计算机生成视频，并且允许用户应用自己的创意而不受限制。

Abstract
Traditional 3D content creation tools empower users to bring their imagination to life by giving them direct control over a scene's geometry, appearance, motion, and camera path. Creating computer-generated videos, however, is a tedious manual process, which can be automated by emerging text-to-video diffusion models. Despite great promise, video diffusion models are difficult to control, hindering a user to apply their own creativity rather than amplifying it. To address this challenge, we present a novel approach that combines the controllability of dynamic 3D meshes with the expressivity and editability of emerging diffusion models. For this purpose, our approach takes an animated, low-fidelity rendered mesh as input and injects the ground truth correspondence information obtained from the dynamic mesh into various stages of a pre-trained text-to-image generation model to output high-quality and temporally consistent frames. We demonstrate our approach on various examples where motion can be obtained by animating rigged assets or changing the camera path.

摘要
传统的3D内容创建工具让用户直接控制场景的几何结构、外观、运动和摄像机道，但创建计算机生成视频是一个繁琐的手动过程，可以通过新兴的文本到视频扩散模型来自动化。然而，这些扩散模型难以控制，使得用户无法应用自己的创意而是受到限制。为解决这个挑战，我们提出了一种新的方法，将动态3D网格的可控性与新兴的扩散模型的表达力和可编辑性结合在一起。我们的方法接受一个动画、低质量渲染的网格输入，并将动态网格中的真实匹配信息注入到预训练的文本到图像生成模型的不同阶段，以生成高质量和时间协调的帧。我们在不同的示例中展示了如何通过动画 rigged 资产或改变摄像机道来获得运动。

Towards Mitigating Perceived Unfairness in Contracts from a Non-Legal Stakeholder’s Perspective

paper_url: http://arxiv.org/abs/2312.01398
repo_url: None
paper_authors: Anmol Singhal, Preethu Rose Anish, Shirish Karande, Smita Ghaisas
For: The paper aims to identify potentially unfair clauses in commercial contracts and to develop a method using Pre-trained Language Models (PLMs) to identify unfairness in contractual sentences.* Methods: The paper uses an empirical study and compares chain of thought prompting and semi-supervised fine-tuning approaches to identify unfairness in contractual sentences. The authors use BERT-based fine-tuning, which achieves an accuracy of 84% on a dataset consisting of proprietary contracts.* Results: The paper finds that BERT-based fine-tuning outperforms chain of thought prompting using Vicuna-13B by a margin of 9%. The authors achieve an accuracy of 84% in identifying potentially unfair clauses in commercial contracts using PLMs.

Abstract
Commercial contracts are known to be a valuable source for deriving project-specific requirements. However, contract negotiations mainly occur among the legal counsel of the parties involved. The participation of non-legal stakeholders, including requirement analysts, engineers, and solution architects, whose primary responsibility lies in ensuring the seamless implementation of contractual terms, is often indirect and inadequate. Consequently, a significant number of sentences in contractual clauses, though legally accurate, can appear unfair from an implementation perspective to non-legal stakeholders. This perception poses a problem since requirements indicated in the clauses are obligatory and can involve punitive measures and penalties if not implemented as committed in the contract. Therefore, the identification of potentially unfair clauses in contracts becomes crucial. In this work, we conduct an empirical study to analyze the perspectives of different stakeholders regarding contractual fairness. We then investigate the ability of Pre-trained Language Models (PLMs) to identify unfairness in contractual sentences by comparing chain of thought prompting and semi-supervised fine-tuning approaches. Using BERT-based fine-tuning, we achieved an accuracy of 84% on a dataset consisting of proprietary contracts. It outperformed chain of thought prompting using Vicuna-13B by a margin of 9%.

摘要
商业合同是一个价值颇高的来源，可以 derivate 项目特定的需求。然而，合同谈判主要由各方法的法律顾问进行，非法领域的参与者，包括需求分析师、工程师和解决方案建筑师，他们的主要责任是确保合同条款的顺利实施，往往 indirect 和不充分。因此，一些合同条款中的句子，尽管法律上准确，但从实施角度来看可能会看起来不公正。这种情况会导致合同中的需求被视为不公正，从而影响实施。因此，对合同中可能不公正的句子进行标识成为了一项重要的任务。在这项工作中，我们进行了一项employmulti-stakeholder perspective的研究，以分析不同参与者对合同公正性的看法。然后，我们investigate了PLMs的能力来标识合同中的不公正句子，并比较了链条思维提示和semi-supervised fine-tuning两种方法。使用BERT基于的精度 fine-tuning，我们在一个包含专用合同的数据集上达到了84%的准确率，超过了链条思维提示使用Vicuna-13B的margin of 9%。

paper_url: http://arxiv.org/abs/2312.01367
repo_url: None
paper_authors: Bowen Sun, Shibao Zheng
for: 这篇论文主要是为了解决文本描述 Face Recognition 领域中的一些挑战。
methods: 这篇论文使用了一种控制性的扩散过程，通过实现概率传输理论连接，来实现文本描述 Face Recognition。
results: 根据实验结果，这种方法可以在文本描述 Face Recognition 领域达到了最高精度，并且在验证和识别两个任务中都表现出色。

Abstract
Diffusion probabilistic models (DPMs) have exhibited exceptional proficiency in generating visual media of outstanding quality and realism. Nonetheless, their potential in non-generative domains, such as face recognition, has yet to be thoroughly investigated. Meanwhile, despite the extensive development of multi-modal face recognition methods, their emphasis has predominantly centered on visual modalities. In this context, face recognition through textual description presents a unique and promising solution that not only transcends the limitations from application scenarios but also expands the potential for research in the field of cross-modal face recognition. It is regrettable that this avenue remains unexplored and underutilized, a consequence from the challenges mainly associated with three aspects: 1) the intrinsic imprecision of verbal descriptions; 2) the significant gaps between texts and images; and 3) the immense hurdle posed by insufficient databases.To tackle this problem, we present DiFace, a solution that effectively achieves face recognition via text through a controllable diffusion process, by establishing its theoretical connection with probability transport. Our approach not only unleashes the potential of DPMs across a broader spectrum of tasks but also achieves, to the best of our knowledge, a significant accuracy in text-to-image face recognition for the first time, as demonstrated by our experiments on verification and identification.

摘要
diffuse probabilistic models (DPMs) 有出色地表现出高品质和真实性的视觉媒体生成能力。然而，它们在非生成领域，如人脸识别，的潜力尚未得到全面探索。同时，虽然视觉多模态人脸识别方法的研发得到了广泛的应用，但是它们主要集中在视觉modalities上。在这个上下文中，通过文本描述进行人脸识别是一个独特和有前途的解决方案，不仅能够突破应用场景的限制，还可以拓宽跨modalities的人脸识别研究领域。然而，这一可能性尚未得到充分探索和利用，主要因为三个方面的挑战：1）文本描述的内在不准确性；2）图像和文本之间的巨大差距；3）数据库的缺乏。为解决这个问题，我们提出了DiFace方法，通过控制扩散过程，实现文本描述到人脸识别的功能。我们的方法不仅可以拓宽 DPMs 的应用范围，还在我们知道的范围内实现了文本描述到人脸识别的首次精度的实验 validate 和验证。

Analyze the robustness of three NMF algorithms (Robust NMF with L1 norm, L2-1 norm NMF, L2 NMF)

paper_url: http://arxiv.org/abs/2312.01357
repo_url: None
paper_authors: Cheng Zeng, Jiaqi Tian, Yixuan Xu
for: 研究非正交矩阵因子分解（NMF）在不同类型噪声下的Robustness。
methods: 采用L1 NMF、L2 NMF和L21 NMF三种不同的NMF算法，使用ORL和YaleB数据集进行噪声添加和封闭噪声的实验，并使用RMSE、ACC和NMI等评价指标来评估不同NMF算法在噪声环境中的性能。
results: 通过评价指标来评估不同NMF算法在噪声环境中的性能，并取得了噪声环境下NMF算法的抵抗力和实际应用中的可行性。

Abstract
Non-negative matrix factorization (NMF) and its variants have been widely employed in clustering and classification tasks (Long, & Jian , 2021). However, noises can seriously affect the results of our experiments. Our research is dedicated to investigating the noise robustness of non-negative matrix factorization (NMF) in the face of different types of noise. Specifically, we adopt three different NMF algorithms, namely L1 NMF, L2 NMF, and L21 NMF, and use the ORL and YaleB data sets to simulate a series of experiments with salt-and-pepper noise and Block-occlusion noise separately. In the experiment, we use a variety of evaluation indicators, including root mean square error (RMSE), accuracy (ACC), and normalized mutual information (NMI), to evaluate the performance of different NMF algorithms in noisy environments. Through these indicators, we quantify the resistance of NMF algorithms to noise and gain insights into their feasibility in practical applications.

摘要
非正式矩阵分解（NMF）和其变种在聚类和分类任务中广泛应用（龙、剑，2021）。然而，噪声可以严重地影响我们的实验结果。我们的研究旨在调查非正式矩阵分解在不同类型的噪声下的强度。特别是，我们采用了三种不同的NMF算法，即L1 NMF、L2 NMF和L21 NMF，并使用ORL和YaleB数据集来实现一系列对噪声和块填充噪声 separately。在实验中，我们使用了一些评价指标，包括平均平方误差（RMSE）、准确率（ACC）和normalized mutual information（NMI），来评估不同NMF算法在噪声环境中的表现。通过这些指标，我们可以量化不同NMF算法对噪声的抵抗力和实际应用中的可行性。

Honesty Is the Best Policy: Defining and Mitigating AI Deception

paper_url: http://arxiv.org/abs/2312.01350
repo_url: None
paper_authors: Francis Rhys Ward, Francesco Belardinelli, Francesca Toni, Tom Everitt
for: 本研究旨在解决人工智能系统中的陷阱代理人问题，以保障系统的安全、信任worthiness和合作。
methods: 本文提出了一种 formal definition of deception in structural causal games，基于哲学文献，适用于实际的机器学习系统。
results: 本研究实验ally shows that our formal definition of deception aligns with philosophical and common-sense meanings of deception, and our graphical criteria for deception can be used to mitigate deception in reinforcement learning agents and language models.

Abstract
Deceptive agents are a challenge for the safety, trustworthiness, and cooperation of AI systems. We focus on the problem that agents might deceive in order to achieve their goals (for instance, in our experiments with language models, the goal of being evaluated as truthful). There are a number of existing definitions of deception in the literature on game theory and symbolic AI, but there is no overarching theory of deception for learning agents in games. We introduce a formal definition of deception in structural causal games, grounded in the philosophy literature, and applicable to real-world machine learning systems. Several examples and results illustrate that our formal definition aligns with the philosophical and commonsense meaning of deception. Our main technical result is to provide graphical criteria for deception. We show, experimentally, that these results can be used to mitigate deception in reinforcement learning agents and language models.

摘要
诱导者是人工智能系统的安全、可靠性和合作的挑战。我们关注在代理者可能为了实现目标而隐瞒真实信息的问题（例如，在我们的语言模型实验中，目标是被评估为真实的）。现有在游戏理论和符号AI领域的多种定义误导，但是没有涵盖学习代理者的总体理论。我们提出了结构 causal游戏中的正式定义误导，基于哲学文献，并适用于现实世界机器学习系统。一些例子和结果表明，我们的正式定义与哲学和常识中的误导相吻合。我们的主要技术成果是在图示中提供误导的 критери习。我们实验表明，这些结果可以用来抑制误导在强化学习代理者和语言模型中。

tsMorph: generation of semi-synthetic time series to understand algorithm performance

paper_url: http://arxiv.org/abs/2312.01344
repo_url: None
paper_authors: Moisés Santos, André de Carvalho, Carlos Soares
for: 本研究旨在探讨时间序列预测方法在不同条件下的表现。
methods: 本研究使用了 dataset morphing 技术，通过创建两个原始数据集之间的序列来生成 semi-synthetic 时间序列。
results: 实验结果显示，Long Short-Term Memory Network 预测算法在时间序列频率增加时表现得更好，这些实验证明了 tsMorph 的效用，并为时间序列预测方法的研究提供了一个有用的工具。

Abstract
Time series forecasting is a subject of significant scientific and industrial importance. Despite the widespread utilization of forecasting methods, there is a dearth of research aimed at comprehending the conditions under which these methods yield favorable or unfavorable performances. Empirical studies, although common, encounter challenges due to the limited availability of datasets, impeding the extraction of reliable insights. To address this, we present tsMorph, a straightforward approach for generating semi-synthetic time series through dataset morphing. tsMorph operates by creating a sequence of datasets derived from two original datasets. These newly generated datasets exhibit a progressive departure from the characteristics of one dataset and a convergence toward the attributes of the other. This method provides a valuable alternative for obtaining substantial datasets. In this paper, we demonstrate the utility of tsMorph by assessing the performance of the Long Short-Term Memory Network forecasting algorithm. The time series under examination are sourced from the NN5 Competition. The findings reveal compelling insights. Notably, the performance of the Long Short-Term Memory Network improves proportionally with the frequency of the time series. These experiments affirm that tsMorph serves as an effective tool for gaining an understanding of forecasting algorithm behaviors, offering a pathway to overcome the limitations posed by empirical studies and enabling more extensive and reliable experimentation.

摘要
时间序列预测是一个科学和工业上的重要问题。尽管预测方法广泛应用，但是有很少研究旨在理解这些方法在不同条件下的表现。实际研究受到数据集的有限性的限制，导致EXTRACTING RELIABLE INSIGHTS困难。为了解决这个问题，我们提出了tsMorph方法，它可以生成基于两个原始数据集的半人工时间序列。这些新生成的数据集会逐渐偏离一个数据集的特征，而又 converges toward另一个数据集的特征。这种方法可以提供充足的数据集。在这篇论文中，我们使用Long Short-Term Memory Network预测算法来评估tsMorph的Utility。时间序列来源于NN5竞赛。我们的发现表明，Long Short-Term Memory Network的表现与时间序列频率成直接相关。这些实验证明了tsMorph是一种有效的工具，可以帮助我们理解预测算法的行为，并提供一条可靠的实验方式。

AI-Powered Arabic Crossword Puzzle Generation for Educational Applications

paper_url: http://arxiv.org/abs/2312.01339
repo_url: None
paper_authors: Kamyar Zeinalipour, Mohamed Zaky Saad, Marco Maggini, Marco Gori
for: 这个研究旨在开发一个基于进步人工智能技术的阿拉伯语十字游戏生成器，以提高学习效果和推广教育技术。
methods: 这个系统使用了 cutting-edge 大语言模型，包括 GPT4、GPT3-Davinci、GPT3-Curie、GPT3-Babbage、GPT3-Ada 和 BERT，以生成独特和挑战性的问题和答案。 fine-tuning、几兆/零兆学习策略和调查严格的品质检查协议，以确保生成的问题和答案质量高。
results: 这个系统可以实现高质量的教育十字游戏，推广学习和问题解决能力，进而改善学习体验和学习效果。

Abstract
This paper presents the first Arabic crossword puzzle generator driven by advanced AI technology. Leveraging cutting-edge large language models including GPT4, GPT3-Davinci, GPT3-Curie, GPT3-Babbage, GPT3-Ada, and BERT, the system generates distinctive and challenging clues. Based on a dataset comprising over 50,000 clue-answer pairs, the generator employs fine-tuning, few/zero-shot learning strategies, and rigorous quality-checking protocols to enforce the generation of high-quality clue-answer pairs. Importantly, educational crosswords contribute to enhancing memory, expanding vocabulary, and promoting problem-solving skills, thereby augmenting the learning experience through a fun and engaging approach, reshaping the landscape of traditional learning methods. The overall system can be exploited as a powerful educational tool that amalgamates AI and innovative learning techniques, heralding a transformative era for Arabic crossword puzzles and the intersection of technology and education.

摘要

Facial Emotion Recognition Under Mask Coverage Using a Data Augmentation Technique

paper_url: http://arxiv.org/abs/2312.01335
repo_url: https://github.com/areffarhadi/masked_face_emotion_recognition
paper_authors: Aref Farhadipour, Pouya Taghipour
for: 这个研究旨在开发一个可以识别穿着不同面具的人们情绪的人工智能视觉系统。
methods: 我们提出了一种新的数据增强技术，使用四种面具类型来增强我们的模型性能。我们使用了四种卷积神经网络，Alexnet、Squeezenet、Resnet50和VGGFace2，通过转移学习进行训练。
results: 我们的模型在多面具模式下表现出色，比单面具模式更高的精度。VGGFace2网络在人依赖模式下取得了97.82%的准确率，而在人独立模式下取得了74.21%的准确率。我们还使用了多种度量来评估我们的系统效率，包括精度、敏感度、特异度、AUC、F1分数和迷思矩阵。此外，我们还使用了LIME算法来可视化CNN的决策采取策略。

Abstract
Identifying human emotions using AI-based computer vision systems, when individuals wear face masks, presents a new challenge in the current Covid-19 pandemic. In this study, we propose a facial emotion recognition system capable of recognizing emotions from individuals wearing different face masks. A novel data augmentation technique was utilized to improve the performance of our model using four mask types for each face image. We evaluated the effectiveness of four convolutional neural networks, Alexnet, Squeezenet, Resnet50 and VGGFace2 that were trained using transfer learning. The experimental findings revealed that our model works effectively in multi-mask mode compared to single-mask mode. The VGGFace2 network achieved the highest accuracy rate, with 97.82% for the person-dependent mode and 74.21% for the person-independent mode using the JAFFE dataset. However, we evaluated our proposed model using the UIBVFED dataset. The Resnet50 has demonstrated superior performance, with accuracies of 73.68% for the person-dependent mode and 59.57% for the person-independent mode. Moreover, we employed metrics such as precision, sensitivity, specificity, AUC, F1 score, and confusion matrix to measure our system's efficiency in detail. Additionally, the LIME algorithm was used to visualize CNN's decision-making strategy.

摘要
identifying human emotions using AI-based computer vision systems during the current Covid-19 pandemic, when individuals wear face masks, presents a new challenge. In this study, we propose a facial emotion recognition system that can recognize emotions from individuals wearing different face masks. We used a novel data augmentation technique to improve the performance of our model, using four mask types for each face image. We evaluated the effectiveness of four convolutional neural networks (CNNs), Alexnet, Squeezenet, Resnet50, and VGGFace2, that were trained using transfer learning. The experimental findings showed that our model works effectively in multi-mask mode compared to single-mask mode. The VGGFace2 network achieved the highest accuracy rate, with 97.82% for the person-dependent mode and 74.21% for the person-independent mode using the JAFFE dataset. However, we evaluated our proposed model using the UIBVFED dataset. The Resnet50 demonstrated superior performance, with accuracies of 73.68% for the person-dependent mode and 59.57% for the person-independent mode. Moreover, we employed metrics such as precision, sensitivity, specificity, AUC, F1 score, and confusion matrix to measure our system's efficiency in detail. Additionally, the LIME algorithm was used to visualize the CNN's decision-making strategy.

JarviX: A LLM No code Platform for Tabular Data Analysis and Optimization

paper_url: http://arxiv.org/abs/2312.02213
repo_url: None
paper_authors: Shang-Ching Liu, ShengKun Wang, Wenqi Lin, Chung-Wei Hsiung, Yi-Chen Hsieh, Yu-Ping Cheng, Sian-Hong Luo, Tsungyao Chang, Jianwei Zhang
for: 这篇论文主要是为了提出一个智能数据分析框架 JarviX，用于自动化数据分析和可视化。
methods: 这篇论文使用了大语言模型（LLMs）来实现高精度数据分析和可视化，并提供了一个自动化机器学习（AutoML）管道来优化机器配置。
results: 实验结果表明，JarviX 可以提供高效和可靠的数据分析和可视化结果，并且可以自动生成数据概况报告、分析问题和结果解释。

Abstract
In this study, we introduce JarviX, a sophisticated data analytics framework. JarviX is designed to employ Large Language Models (LLMs) to facilitate an automated guide and execute high-precision data analyzes on tabular datasets. This framework emphasizes the significance of varying column types, capitalizing on state-of-the-art LLMs to generate concise data insight summaries, propose relevant analysis inquiries, visualize data effectively, and provide comprehensive explanations for results drawn from an extensive data analysis pipeline. Moreover, JarviX incorporates an automated machine learning (AutoML) pipeline for predictive modeling. This integration forms a comprehensive and automated optimization cycle, which proves particularly advantageous for optimizing machine configuration. The efficacy and adaptability of JarviX are substantiated through a series of practical use case studies.

摘要
在本研究中，我们介绍了 JarviX，一种复杂的数据分析框架。 JarviX 采用大型自然语言模型（LLM）来自动生成数据分析摘要、提出相关的分析问题、可读性地视觉化数据，以及为数据分析管道中的结果提供全面的解释。此外，JarviX 还包含自动机器学习（AutoML）管道，以便预测模型化。这种整体和自动化优化循环，对机器配置优化具有特点优势。 JarviX 的可效性和适应性通过一系列实践用例研究得到证明。

ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models

paper_url: http://arxiv.org/abs/2312.01305
repo_url: None
paper_authors: Jeong-gi Kwak, Erqun Dong, Yuhe Jin, Hanseok Ko, Shweta Mahajan, Kwang Moo Yi
for: 通过一幅图像生成新视图是一项具有挑战性的任务，需要对图像中对象的3D结构进行理解，并生成高质量、空间一致的新视图。
methods: 我们使用一个预训练的视频扩散模型来解决这个问题。我们的关键想法是将新视图生成 reformulated 为生成一段扫描视频 – 一个扫描视频 –，这样就能够利用视频扩散模型所学习的强大假设。
results: 通过创建一个平滑的相机轨迹，并使用视图条件扩散模型和视频扩散模型进行减噪，我们可以获得高一致性的新视图合成，超过了现状卷积模型的表现。

Abstract
Generating novel views of an object from a single image is a challenging task. It requires an understanding of the underlying 3D structure of the object from an image and rendering high-quality, spatially consistent new views. While recent methods for view synthesis based on diffusion have shown great progress, achieving consistency among various view estimates and at the same time abiding by the desired camera pose remains a critical problem yet to be solved. In this work, we demonstrate a strikingly simple method, where we utilize a pre-trained video diffusion model to solve this problem. Our key idea is that synthesizing a novel view could be reformulated as synthesizing a video of a camera going around the object of interest -- a scanning video -- which then allows us to leverage the powerful priors that a video diffusion model would have learned. Thus, to perform novel-view synthesis, we create a smooth camera trajectory to the target view that we wish to render, and denoise using both a view-conditioned diffusion model and a video diffusion model. By doing so, we obtain a highly consistent novel view synthesis, outperforming the state of the art.

摘要
<>translate_language: zh-CN Generating novel views of an object from a single image is a challenging task. It requires an understanding of the underlying 3D structure of the object from an image and rendering high-quality, spatially consistent new views. While recent methods for view synthesis based on diffusion have shown great progress, achieving consistency among various view estimates and at the same time abiding by the desired camera pose remains a critical problem yet to be solved. In this work, we demonstrate a strikingly simple method, where we utilize a pre-trained video diffusion model to solve this problem. Our key idea is that synthesizing a novel view could be reformulated as synthesizing a video of a camera going around the object of interest -- a scanning video -- which then allows us to leverage the powerful priors that a video diffusion model would have learned. Thus, to perform novel-view synthesis, we create a smooth camera trajectory to the target view that we wish to render, and denoise using both a view-conditioned diffusion model and a video diffusion model. By doing so, we obtain a highly consistent novel view synthesis, outperforming the state of the art.Note: The "zh-CN" language code specifies Simplified Chinese.

Churn Prediction via Multimodal Fusion Learning:Integrating Customer Financial Literacy, Voice, and Behavioral Data

paper_url: http://arxiv.org/abs/2312.01301
repo_url: None
paper_authors: David Hason Rudd, Huan Huo, Md Rafiqul Islam, Guandong Xu
for: 这份研究的目的是为了提高客户退买预测的准确性，以应对现代企业面临的客户退买问题。
methods: 这篇研究使用了多modal融合学习模型，融合了客户情感、财务素养（FL）水平和财务行为数据，以提高退买预测的准确性和不偏见性。研究使用了SMOGN COREG超级模型来评估客户FL水平，并使用了人工神经网络和权重整合技术来预测退买倾向。此外，研究还使用了语音情感识别模型，使用预训 CNN-VGG16 来识别客户的情感。
results: 研究结果显示，融合多modal数据可以提高退买预测的准确性。比较 Late Fusion、基准模型和融合模型，融合模型在测试精度、mean average precision和macro-averaged F1 score中获得了91.2%、66和54的成绩。此外，分析显示，低FL scores和负情感具有正相关，高隐含预测客户退买风险。

Abstract
In todays competitive landscape, businesses grapple with customer retention. Churn prediction models, although beneficial, often lack accuracy due to the reliance on a single data source. The intricate nature of human behavior and high dimensional customer data further complicate these efforts. To address these concerns, this paper proposes a multimodal fusion learning model for identifying customer churn risk levels in financial service providers. Our multimodal approach integrates customer sentiments financial literacy (FL) level, and financial behavioral data, enabling more accurate and bias-free churn prediction models. The proposed FL model utilizes a SMOGN COREG supervised model to gauge customer FL levels from their financial data. The baseline churn model applies an ensemble artificial neural network and oversampling techniques to predict churn propensity in high-dimensional financial data. We also incorporate a speech emotion recognition model employing a pre-trained CNN-VGG16 to recognize customer emotions based on pitch, energy, and tone. To integrate these diverse features while retaining unique insights, we introduced late and hybrid fusion techniques that complementary boost coordinated multimodal co learning. Robust metrics were utilized to evaluate the proposed multimodal fusion model and hence the approach validity, including mean average precision and macro-averaged F1 score. Our novel approach demonstrates a marked improvement in churn prediction, achieving a test accuracy of 91.2%, a Mean Average Precision (MAP) score of 66, and a Macro-Averaged F1 score of 54 through the proposed hybrid fusion learning technique compared with late fusion and baseline models. Furthermore, the analysis demonstrates a positive correlation between negative emotions, low FL scores, and high-risk customers.

摘要
今天的竞争场景中，企业面临Customer Retention的挑战。虽然预测客户弃用的模型具有优势，但它们经常缺乏准确性，因为它们仅仅基于单一数据源。人类行为的复杂性和高维客户数据更加增加了这些努力的困难。为了解决这些问题，这篇论文提出了一种多模式融合学习模型，用于在金融服务提供者中预测客户弃用风险水平。我们的多模式approach集成了客户情感、财务文化水平（FL）和财务行为数据，从而实现更加准确和不偏的预测模型。我们的FL模型使用SMOGN COREG指导模型来测量客户FL水平从 их金融数据中。基本的弃用模型采用了一个ensemble人工神经网络和扩展技术来预测高维金融数据中的弃用可能性。我们还 integratespeech感知模型，使用预训练的CNN-VGG16来识别客户情感的变化，并根据抽象、能量和音调来识别客户的情感。为了融合这些多样的特征而保留每个特征的独特意义，我们引入了晚期和hybrid融合技术，这些技术可以相互补做，从而提高多模式融合学习的效果。我们使用了多种Robust度量来评估我们的多模式融合模型的有效性，包括测试准确率、 macro-averaged F1分数和 Mean Average Precision（MAP）分数。我们的新方法在测试数据集上达到了91.2%的测试准确率，MAP分数为66和Macro-Averaged F1分数为54，相比基eline模型和晚期融合模型的性能有明显的提高。此外，分析还表明，低FL分、负情感和高风险客户之间存在正相关关系。

TextGenSHAP: Scalable Post-hoc Explanations in Text Generation with Long Documents

paper_url: http://arxiv.org/abs/2312.01279
repo_url: None
paper_authors: James Enouen, Hootan Nakhost, Sayna Ebrahimi, Sercan O Arik, Yan Liu, Tomas Pfister
for: 这篇论文的目的是解释大型自然语言模型（LLM）的生成内容，以便更好地理解它们的决策过程和生成结果。
methods: 这篇论文使用了后期解释技术，尤其是使用Shapley值来解释深度学习模型。但是，在处理长输入上下文和自动生成的输出序列时，Shapley值的扩展具有 significiant challenges。这篇论文提出了一种名为TextGenSHAP的高效后期解释方法，该方法利用了LM特有的技术。
results: 根据实验结果，TextGenSHAP比普通的Shapley值计算方法更快速，处理单个token的解释时间从小时降低到了分钟级别，而处理整个文档的解释时间则只需要几秒钟。此外，这篇论文还证明了在两个重要场景中，实时Shapley值可以提供更好的理解和改进LLM的性能：在回答长文书问题时，可以 lokalisiert到重要的单词和句子；在改进现有文档检索系统时，可以提高选择的段落和最终回答的准确率。

Abstract
Large language models (LLMs) have attracted huge interest in practical applications given their increasingly accurate responses and coherent reasoning abilities. Given their nature as black-boxes using complex reasoning processes on their inputs, it is inevitable that the demand for scalable and faithful explanations for LLMs' generated content will continue to grow. There have been major developments in the explainability of neural network models over the past decade. Among them, post-hoc explainability methods, especially Shapley values, have proven effective for interpreting deep learning models. However, there are major challenges in scaling up Shapley values for LLMs, particularly when dealing with long input contexts containing thousands of tokens and autoregressively generated output sequences. Furthermore, it is often unclear how to effectively utilize generated explanations to improve the performance of LLMs. In this paper, we introduce TextGenSHAP, an efficient post-hoc explanation method incorporating LM-specific techniques. We demonstrate that this leads to significant increases in speed compared to conventional Shapley value computations, reducing processing times from hours to minutes for token-level explanations, and to just seconds for document-level explanations. In addition, we demonstrate how real-time Shapley values can be utilized in two important scenarios, providing better understanding of long-document question answering by localizing important words and sentences; and improving existing document retrieval systems through enhancing the accuracy of selected passages and ultimately the final responses.

摘要
在这篇文章中，我们介绍 TextGenSHAP，一种高效的后续解释方法，它特别适用于 LLMs。我们证明，TextGenSHAP 可以在速度方面与传统 Shapley 值计算相比，提高速度，从 hour 降低到 minute 级别，而且可以在 document 级别进行解释，只需要几秒钟。此外，我们还示出了在两个重要的应用场景中，使用实时 Shapley 值可以提供更好的理解，即对长文件问答中的重要单词和句子进行地图化，以及通过提高选择的段落和最终回答的准确率来提高现有的文档检索系统。

Running cognitive evaluations on large language models: The do’s and the don’ts

paper_url: http://arxiv.org/abs/2312.01276
repo_url: None
paper_authors: Anna A. Ivanova
for: 这篇论文旨在提出对大语言模型（LLM）的语言基础能力评估方法的方法学考虑。
methods: 文章根据三个Literature case study（通用常识知识 bencmark，理解思想测试和语法一致测试），描述了应用语言测试 onto LLM 时可能出现的坑。然后，文章列出了10个准则，可以帮助设计高质量的AI系统评估。
results: 文章结论提出了四个当前在活跃讨论的领域：提问敏感、文化和语言多样性、使用LLM作为研究助手以及在开放 versus 封闭LLM上进行评估。总的来说，文章的目标是贡献到AI Psychology领域的最佳实践。

Abstract
In this paper, I describe methodological considerations for studies that aim to evaluate the cognitive capacities of large language models (LLMs) using language-based behavioral assessments. Drawing on three case studies from the literature (a commonsense knowledge benchmark, a theory of mind evaluation, and a test of syntactic agreement), I describe common pitfalls that might arise when applying a cognitive test to an LLM. I then list 10 do's and don'ts that should help design high-quality cognitive evaluations for AI systems. I conclude by discussing four areas where the do's and don'ts are currently under active discussion -- prompt sensitivity, cultural and linguistic diversity, using LLMs as research assistants, and running evaluations on open vs. closed LLMs. Overall, the goal of the paper is to contribute to the broader discussion of best practices in the rapidly growing field of AI Psychology.

摘要
在这篇论文中，我介绍了对大语言模型（LLM）使用语言基于行为评估来评估其认知能力的方法ológico Considerations。基于文献中的三个案例（通用常识准入标准、理解他者的能力评估和语法一致性测试），我描述了在应用认知测试于 LLM 时可能出现的常见困难。然后，我列出了10个做法和不做法，以帮助设计高质量的认知评估方法 для AI 系统。我的结论是，在 rapidly growing field of AI Psychology 中，这些做法和不做法在当前正在活跃的讨论中。Here's the translation of the text into Traditional Chinese:在这篇论文中，我介绍了对大语言模型（LLM）使用语言基于行为评估来评估其认知能力的方法ológico Considerations。基于文献中的三个案例（通用常识准入标准、理解他者的能力评估和语法一致性测试），我描述了在应用认知测验于 LLM 时可能出现的常见困难。然后，我列出了10个做法和不做法，以帮助设计高质量的认知评估方法 для AI 系统。我的结论是，在 rapidly growing field of AI Psychology 中，这些做法和不做法在当前正在活跃的讨论中。

Low-Precision Mixed-Computation Models for Inference on Edge

paper_url: http://arxiv.org/abs/2312.02210
repo_url: None
paper_authors: Seyedarmin Azizi, Mahdi Nazemi, Mehdi Kamal, Massoud Pedram
for: 这个研究旨在提出一种混合计算神经网络处理方法，用于边缘应用程序，这个方法利用低精度（低关系）Posit和低精度固定点（FixP）数系统。
methods: 这个混合计算方法使用4位Posit（Posit4）来表示高敏感度的 weights，而 FixP4 则用于表示其他 weights。具体来说，这个方法使用 weight 的重要性和量化误差来分配适当的数系统。此外，这个方法还引入了 Posit 表示的梯度近似，以改善 weight 更新的 backwards 过程中的质量。
results: 研究发现，使用混合计算方法可以提高模型的精度，而且耗能较低。具体来说，在视觉和语言模型中，混合计算方法的精度平均提高了1.5%，并且仅增加了0.19%的能量负担。

Abstract
This paper presents a mixed-computation neural network processing approach for edge applications that incorporates low-precision (low-width) Posit and low-precision fixed point (FixP) number systems. This mixed-computation approach employs 4-bit Posit (Posit4), which has higher precision around zero, for representing weights with high sensitivity, while it uses 4-bit FixP (FixP4) for representing other weights. A heuristic for analyzing the importance and the quantization error of the weights is presented to assign the proper number system to different weights. Additionally, a gradient approximation for Posit representation is introduced to improve the quality of weight updates in the backpropagation process. Due to the high energy consumption of the fully Posit-based computations, neural network operations are carried out in FixP or Posit/FixP. An efficient hardware implementation of a MAC operation with a first Posit operand and FixP for a second operand and accumulator is presented. The efficacy of the proposed low-precision mixed-computation approach is extensively assessed on vision and language models. The results show that, on average, the accuracy of the mixed-computation is about 1.5% higher than that of FixP with a cost of 0.19% energy overhead.

摘要

2023-12-03

cs.CL

cs.CL - 2023-12-03

Using Large Language Models to Accelerate Communication for Users with Severe Motor Impairments

paper_url: http://arxiv.org/abs/2312.01532
repo_url: None
paper_authors: Shanqing Cai, Subhashini Venugopalan, Katie Seaver, Xiang Xiao, Katrin Tomanek, Sri Jalasutram, Meredith Ringel Morris, Shaun Kane, Ajit Narayanan, Robert L. MacDonald, Emily Kornman, Daniel Vance, Blair Casey, Steve M. Gleason, Philip Q. Nelson, Michael P. Brenner
for: 提高患有严重运动障碍者的文本输入速度，以改善他们的生活质量。
methods: 利用大语言模型（LLMs）和合理的用户界面，实现高度压缩的文本输入方式，可以在Offline模拟中节省57%的motor动作。
results: 在19名非AAC用户手动Typingmobile设备上进行了验证，获得了与Offline模拟中的motor保存相似的成果，同时对总速度产生了相对较小的影响。两名困障者在Lab和场景测试中，通过phrase和word predictions从上下文意识到的大语言模型（LLMs）实现了29-60% faster的文本输入速度，相比传统基eline。这些发现提供了加速患有运动障碍者的文本交流的可能性，并证明了应用LLMs到文本基于的用户界面的方向。

Abstract
Finding ways to accelerate text input for individuals with profound motor impairments has been a long-standing area of research. Closing the speed gap for augmentative and alternative communication (AAC) devices such as eye-tracking keyboards is important for improving the quality of life for such individuals. Recent advances in neural networks of natural language pose new opportunities for re-thinking strategies and user interfaces for enhanced text-entry for AAC users. In this paper, we present SpeakFaster, consisting of large language models (LLMs) and a co-designed user interface for text entry in a highly-abbreviated form, allowing saving 57% more motor actions than traditional predictive keyboards in offline simulation. A pilot study with 19 non-AAC participants typing on a mobile device by hand demonstrated gains in motor savings in line with the offline simulation, while introducing relatively small effects on overall typing speed. Lab and field testing on two eye-gaze typing users with amyotrophic lateral sclerosis (ALS) demonstrated text-entry rates 29-60% faster than traditional baselines, due to significant saving of expensive keystrokes achieved through phrase and word predictions from context-aware LLMs. These findings provide a strong foundation for further exploration of substantially-accelerated text communication for motor-impaired users and demonstrate a direction for applying LLMs to text-based user interfaces.

摘要
研究增加患有重度motor功能障碍者的文本输入速度已经是长期的研究领域。关闭速度差异对增强communication（AAC）设备如眼动识别键盘是重要的，以提高这些人的生活质量。 latest advances in natural language processing neural networks new opportunities for re-thinking strategies and user interfaces for enhanced text-entry for AAC users。在这篇论文中，我们介绍SpeakFaster，包括大型自然语言模型（LLMs）和与用户界面（UI）的共同设计，以实现高度压缩的文本输入方式，可以在Offline simulation中节省57%更多的motor actions。一个随手持机设备的19名非AAC参与者的试验中，通过手动打印方式实现了与Offline simulation中的motor savings相同的效果，而 introducing relatively small effects on overall typing speed。在医学实验室和场景中，对两名受到amyotrophic lateral sclerosis（ALS）影响的眼动输入用户进行了lab和场景测试，并达到了基eline以上的29-60% faster than traditional baselines的文本输入速度，这主要归功于LLMsContext-aware的短语和单词预测。这些发现为患有motor功能障碍者的快速文本通信提供了一个强大的基础，并证明了应用LLMs到文本基于的用户界面的方向。

T3D: Towards 3D Medical Image Understanding through Vision-Language Pre-training

paper_url: http://arxiv.org/abs/2312.01529
repo_url: None
paper_authors: Che Liu, Cheng Ouyang, Yinda Chen, Cesar César Quilodrán-Casas, Lei Ma, Jie Fu, Yike Guo, Anand Shah, Wenjia Bai, Rossella Arcucci
for: 这篇论文主要是为了提高医学影像分析中的类别和肿瘤分类 task 的效能，并且将医学知识整合到可视表现学中。
methods: 这篇论文使用了vision-language pre-training（VLP）方法，并将两个文本信息驱动的预备任务加入：(\lowerromannumeral{1}) 文本信息驱动的对称学习; (\lowerromannumeral{2}) 文本信息驱动的图像修复。这些任务的目的是从高分辨率的医学影像中学习3D可视表现，并将医学知识整合到可视表现中，不会对信息进行过滤或修复。
results: 在使用新组建的大规模数据集和医学报告中训练T3D模型后，T3D在组织和肿瘤分类 tasks 中具有了优秀的效能，较现有的vSSL方法更好。这证明了T3D的潜在在医学影像分析中的表现学。

Abstract
Expert annotation of 3D medical image for downstream analysis is resource-intensive, posing challenges in clinical applications. Visual self-supervised learning (vSSL), though effective for learning visual invariance, neglects the incorporation of domain knowledge from medicine. To incorporate medical knowledge into visual representation learning, vision-language pre-training (VLP) has shown promising results in 2D image. However, existing VLP approaches become generally impractical when applied to high-resolution 3D medical images due to GPU hardware constraints and the potential loss of critical details caused by downsampling, which is the intuitive solution to hardware constraints. To address the above limitations, we introduce T3D, the first VLP framework designed for high-resolution 3D medical images. T3D incorporates two text-informed pretext tasks: (\lowerromannumeral{1}) text-informed contrastive learning; (\lowerromannumeral{2}) text-informed image restoration. These tasks focus on learning 3D visual representations from high-resolution 3D medical images and integrating clinical knowledge from radiology reports, without distorting information through forced alignment of downsampled volumes with detailed anatomical text. Trained on a newly curated large-scale dataset of 3D medical images and radiology reports, T3D significantly outperforms current vSSL methods in tasks like organ and tumor segmentation, as well as disease classification. This underlines T3D's potential in representation learning for 3D medical image analysis. All data and code will be available upon acceptance.

摘要
专家标注三维医学图像的专业笔记对下游分析带来挑战，因为这是资源密集的。Visual self-supervised learning（vSSL），虽然能够学习视觉不变性，但忽略了从医学中获得的领域知识。为了将医学知识 incorporated into visual representation learning，vision-language pre-training（VLP）在2D图像上已经显示出了扎实的效果。然而，现有的VLP方法在应用于高分辨率3D医学图像时变得极其不实际，主要是由于GPU硬件限制和可能的细节丢失，这是强制下采样的直观解决方案。为解决以上限制，我们介绍T3D，第一个针对高分辨率3D医学图像的VLP框架。T3D包括两个文本指导的预tex任务：（1）文本指导的对比学习；（2）文本指导的图像恢复。这两个任务的目的是从高分辨率3D医学图像中学习3D视觉表示，并将临床知识从放射学报告中 инте incorporated，而不是通过强制将下采样的图像与细节文本对齐。T3D在新建的大规模数据集和放射学报告上训练，与当前vSSL方法在组织和肿瘤分割以及疾病分类等任务中显著超越。这表明T3D在图像分析领域的表示学习具有潜力。所有数据和代码将在接受后公开。

SymNoise: Advancing Language Model Fine-tuning with Symmetric Noise

paper_url: http://arxiv.org/abs/2312.01523
repo_url: None
paper_authors: Arjun Singh, Abhay Kumar Yadav
for: 提高语言模型的表现，特别是在不同的模型和基础模型上。
methods: 采用对称噪声技术来加密模型的嵌入过程，以更加严格地控制局部弯曲，从而提高模型的表现。
results: 相比现有方法NEFTune，SymNoise在不同的模型和基础模型上都显示出了明显的改善，其中最高分为69.04%，比NEFTune高6.7%。

Abstract
In this paper, we introduce a novel fine-tuning technique for language models, which involves incorporating symmetric noise into the embedding process. This method aims to enhance the model's function by more stringently regulating its local curvature, demonstrating superior performance over the current method, NEFTune. When fine-tuning the LLaMA-2-7B model using Alpaca, standard techniques yield a 29.79% score on AlpacaEval. However, our approach, SymNoise, increases this score significantly to 69.04%, using symmetric noisy embeddings. This is a 6.7% improvement over the state-of-the-art method, NEFTune~(64.69%). Furthermore, when tested on various models and stronger baseline instruction datasets, such as Evol-Instruct, ShareGPT, OpenPlatypus, SymNoise consistently outperforms NEFTune. The current literature, including NEFTune, has underscored the importance of more in-depth research into the application of noise-based strategies in the fine-tuning of language models. Our approach, SymNoise, is another significant step towards this direction, showing notable improvement over the existing state-of-the-art method.

摘要
在本文中，我们介绍了一种新的精度调整技术 для语言模型，即在嵌入过程中添加对称噪声。这种方法的目标是通过更加严格地控制模型的本地弯曲度来提高模型的表现，比起现有的方法NEFTune，显示出更好的效果。当使用Alpaca进行精度调整LLaMA-2-7B模型时，标准的方法可以达到29.79%的分数在AlpacaEval上，但是我们的方法SymNoise可以大幅提高这个分数到69.04%，使用对称噪声嵌入。这比现状态之前的方法NEFTune（64.69%）提高了6.7%。此外，当测试在不同的模型和更强的基eline instruction dataset上，如Evol-Instruct、ShareGPT、OpenPlatypus，SymNoise都能够超越NEFTune。现有文献，包括NEFTune，强调了对噪声基础的进一步研究在语言模型的精度调整中的重要性。我们的方法SymNoise是这个方向的另一个重要的一步，表现出了 notable的提高。

Bigger is not Always Better: The Effect of Context Size on Speech Pre-Training

paper_url: http://arxiv.org/abs/2312.01515
repo_url: https://github.com/sdrobert/scpc
paper_authors: Sean Robertson, Ewan Dunbar
for: investigate how much context is necessary to achieve high-quality pre-trained acoustic models using self-supervised learning
methods: principally investigate contrastive predictive coding (CPC), which we adapt to be able to precisely control the amount of context visible to the model during training and inference
results: find that phone discriminability in the resulting model representations peaks at around 40~ms of preceding context, and that having too much context (beyond around 320 ms) substantially degrades the quality of the representations.

Abstract
It has been generally assumed in the automatic speech recognition (ASR) literature that it is better for models to have access to wider context windows. Yet, many of the potential reasons this might be true in the supervised setting do not necessarily transfer over to the case of unsupervised learning. We investigate how much context is necessary to achieve high-quality pre-trained acoustic models using self-supervised learning. We principally investigate contrastive predictive coding (CPC), which we adapt to be able to precisely control the amount of context visible to the model during training and inference. We find that phone discriminability in the resulting model representations peaks at around 40~ms of preceding context, and that having too much context (beyond around 320 ms) substantially degrades the quality of the representations. Surprisingly, we find that this pattern also transfers to supervised ASR when the pre-trained representations are used as frozen input features. Our results point to potential changes in the design of current upstream architectures to better facilitate a variety of downstream tasks.

摘要
通常认为在自动语音识别（ASR）文献中，模型更好地有更广的上下文窗口。然而，这些可能的原因在不监督学习情况下不一定传递。我们研究如何在自我监督学习中获得高质量预训练语音模型。我们主要研究对比预测编码（CPC），我们修改其以控制模型在训练和推理中可见的上下文量。我们发现话语可识别度在40毫秒的前Context中峰值，并且有过多的Context（超过320毫秒）会导致表示质量下降。 surprisingly，我们发现这种模式也适用于监督ASR，当预训练表示被用作冻结输入特征时。我们的结果指向现有上游架构的设计需要更好地满足不同下游任务的需求。

Unsupervised Approach to Evaluate Sentence-Level Fluency: Do We Really Need Reference?

paper_url: http://arxiv.org/abs/2312.01500
repo_url: None
paper_authors: Gopichand Kanumolu, Lokesh Madasu, Pavan Baswani, Ananya Mukherjee, Manish Shrivastava
for: 这篇论文的目的是评估自然语言生成（NLG）系统的流畅性。
methods: 该论文使用了一种不需要参考的无监控技术来评估NLG系统的流畅性，利用了不同的word embedding和Recurrent Neural Network（RNN）架构来训练语言模型。
results: 该论文在10种指定语言中进行了比较分析，并与人类评估结果进行了对比，以评估模型的表现。

Abstract
Fluency is a crucial goal of all Natural Language Generation (NLG) systems. Widely used automatic evaluation metrics fall short in capturing the fluency of machine-generated text. Assessing the fluency of NLG systems poses a challenge since these models are not limited to simply reusing words from the input but may also generate abstractions. Existing reference-based fluency evaluations, such as word overlap measures, often exhibit weak correlations with human judgments. This paper adapts an existing unsupervised technique for measuring text fluency without the need for any reference. Our approach leverages various word embeddings and trains language models using Recurrent Neural Network (RNN) architectures. We also experiment with other available multilingual Language Models (LMs). To assess the performance of the models, we conduct a comparative analysis across 10 Indic languages, correlating the obtained fluency scores with human judgments. Our code and human-annotated benchmark test-set for fluency is available at https://github.com/AnanyaCoder/TextFluencyForIndicLanaguges.

摘要
流畅是所有自然语言生成（NLG）系统的关键目标。广泛使用自动评估metric falling short captures the machine-generated text fluency. assessing the fluency of NLG systems is a challenge because these models are not limited to simply reusing words from the input but may also generate abstractions. existing reference-based fluency evaluations, such as word overlap measures, often exhibit weak correlations with human judgments. this paper adapts an existing unsupervised technique for measuring text fluency without the need for any reference. our approach leverages various word embeddings and trains language models using Recurrent Neural Network (RNN) architectures. we also experiment with other available multilingual Language Models (LMs). to assess the performance of the models, we conduct a comparative analysis across 10 Indic languages, correlating the obtained fluency scores with human judgments. our code and human-annotated benchmark test-set for fluency is available at https://github.com/AnanyaCoder/TextFluencyForIndicLanguages.Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

paper_url: http://arxiv.org/abs/2312.02219
repo_url: None
paper_authors: Andrés Villa, Juan Carlos León Alcázar, Alvaro Soto, Bernard Ghanem
for: 本研究的目的是评估大型语言和视觉模型（IT-LVLM）在基本计算机视觉任务中的效果。
methods: 本研究使用了一个多模态评估 benchmark named MERLIM，以评估 IT-LVLM 在计算机视觉任务中的表现。 MERLIM 包含了279,000个图像-问题对，并强调检测跨模态“幻觉”事件，其中语言输出引用图像中缺乏有效的基础。
results: 研究结果表明，当前的状态部硬件IT-LVLM 在确定细腻视觉概念方面仍有限制，并且对于不同任务而言，模型具有强大的全局视觉模式和文本偏好，但是对于细节视觉概念的识别仍有很大的改进空间。

Abstract
Large Vision and Language Models have enabled significant advances in fully supervised and zero-shot vision tasks. These large pre-trained architectures serve as the baseline to what is currently known as Instruction Tuning Large Vision and Language models (IT-LVLMs). IT-LVLMs are general-purpose multi-modal assistants whose responses are modulated by natural language instructions and arbitrary visual data. Despite this versatility, IT-LVLM effectiveness in fundamental computer vision problems remains unclear, primarily due to the absence of a standardized evaluation benchmark. This paper introduces a Multi-modal Evaluation Benchmark named MERLIM, a scalable test-bed to assess the performance of IT-LVLMs on fundamental computer vision tasks. MERLIM contains over 279K image-question pairs, and has a strong focus on detecting cross-modal "hallucination" events in IT-LVLMs, where the language output refers to visual concepts that lack any effective grounding in the image. Our results show that state-of-the-art IT-LVMLs are still limited at identifying fine-grained visual concepts, object hallucinations are common across tasks, and their results are strongly biased by small variations in the input query, even if the queries have the very same semantics. Our findings also suggest that these models have weak visual groundings but they can still make adequate guesses by global visual patterns or textual biases contained in the LLM component.

摘要
大型视觉语言模型已经实现了完全监督和零点精度视觉任务的显著进步。这些大型预训练建筑 serves as the baseline for what is currently known as Instruction Tuning Large Vision and Language models (IT-LVLMs). IT-LVLMs 是一种通用多模式助手，响应于自然语言指令和任意视觉数据。despite this versatility, the effectiveness of IT-LVLMs in fundamental computer vision problems is still unclear, primarily due to the lack of a standardized evaluation benchmark. This paper introduces a Multi-modal Evaluation Benchmark named MERLIM, a scalable test-bed to assess the performance of IT-LVLMs on fundamental computer vision tasks. MERLIM contains over 279K image-question pairs, and has a strong focus on detecting cross-modal "hallucination" events in IT-LVLMs, where the language output refers to visual concepts that lack any effective grounding in the image. Our results show that state-of-the-art IT-LVMLs are still limited at identifying fine-grained visual concepts, object hallucinations are common across tasks, and their results are strongly biased by small variations in the input query, even if the queries have the very same semantics. Our findings also suggest that these models have weak visual groundings but they can still make adequate guesses by global visual patterns or textual biases contained in the LLM component.

Transformers are uninterpretable with myopic methods: a case study with bounded Dyck grammars

paper_url: http://arxiv.org/abs/2312.01429
repo_url: None
paper_authors: Kaiyue Wen, Yuchen Li, Bingbin Liu, Andrej Risteski
for: 本研究探讨了对已经训练过的模型（例如Transformer）进行解释性分析的方法，以免导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导致偏导�

Abstract
Interpretability methods aim to understand the algorithm implemented by a trained model (e.g., a Transofmer) by examining various aspects of the model, such as the weight matrices or the attention patterns. In this work, through a combination of theoretical results and carefully controlled experiments on synthetic data, we take a critical view of methods that exclusively focus on individual parts of the model, rather than consider the network as a whole. We consider a simple synthetic setup of learning a (bounded) Dyck language. Theoretically, we show that the set of models that (exactly or approximately) solve this task satisfy a structural characterization derived from ideas in formal languages (the pumping lemma). We use this characterization to show that the set of optima is qualitatively rich; in particular, the attention pattern of a single layer can be ``nearly randomized'', while preserving the functionality of the network. We also show via extensive experiments that these constructions are not merely a theoretical artifact: even after severely constraining the architecture of the model, vastly different solutions can be reached via standard training. Thus, interpretability claims based on inspecting individual heads or weight matrices in the Transformer can be misleading.

摘要
Interpretability方法 aim to understand the algorithm implemented by a trained model (e.g., a Transofmer) by examining various aspects of the model, such as the weight matrices or the attention patterns. In this work, through a combination of theoretical results and carefully controlled experiments on synthetic data, we take a critical view of methods that exclusively focus on individual parts of the model, rather than consider the network as a whole. We consider a simple synthetic setup of learning a (bounded) Dyck language. Theoretically, we show that the set of models that (exactly or approximately) solve this task satisfy a structural characterization derived from ideas in formal languages (the pumping lemma). We use this characterization to show that the set of optima is qualitatively rich; in particular, the attention pattern of a single layer can be "nearly randomized", while preserving the functionality of the network. We also show via extensive experiments that these constructions are not merely a theoretical artifact: even after severely constraining the architecture of the model, vastly different solutions can be reached via standard training. Therefore, interpretability claims based on inspecting individual heads or weight matrices in the Transformer can be misleading.

CEScore: Simple and Efficient Confidence Estimation Model for Evaluating Split and Rephrase

paper_url: http://arxiv.org/abs/2312.01356
repo_url: https://github.com/motasemajlouni/cescore
paper_authors: AlMotasem Bellah Al Ajlouni, Jinlong Li
for: 这篇论文目的是评估自然语言处理（NLP）中的分词和重句（SR）任务的质量。methods: 这篇论文使用了一种新的统计模型来自动评估 SR 任务的质量。该模型通过模仿人类评估 SR 的方式，提供了四个指标（Sscore、Gscore、Mscore和CEscore）来评估分词和重句的简洁性、 grammaticality、意义保持和总质量等方面。results: 在使用 26 种模型进行实验后，CEScore 与人类评估的相关性强，Spearman 相关性为 0.98 的水平。这说明 CEScore 可以作为一个简单而有效的质量评估 metric для SR 任务。

Abstract
The split and rephrase (SR) task aims to divide a long, complex sentence into a set of shorter, simpler sentences that convey the same meaning. This challenging problem in NLP has gained increased attention recently because of its benefits as a pre-processing step in other NLP tasks. Evaluating quality of SR is challenging, as there no automatic metric fit to evaluate this task. In this work, we introduce CEScore, as novel statistical model to automatically evaluate SR task. By mimicking the way humans evaluate SR, CEScore provides 4 metrics (Sscore, Gscore, Mscore, and CEscore) to assess simplicity, grammaticality, meaning preservation, and overall quality, respectively. In experiments with 26 models, CEScore correlates strongly with human evaluations, achieving 0.98 in Spearman correlations at model-level. This underscores the potential of CEScore as a simple and effective metric for assessing the overall quality of SR models.

摘要
《 split and rephrase (SR) 任务的目标是将长、复杂的句子分解成一系列简单、简洁的句子，以保持同样的意思。这个在自然语言处理中的问题在最近几年得到了更多的关注，因为它作为其他 NLP 任务的预处理步骤具有优点。评估 SR 质量是一项挑战，因为没有自动计算可以评估这个任务。在这篇文章中，我们介绍了一种新的统计模型，称为 CEScore，用于自动评估 SR 任务的质量。CEScore 通过模仿人类评估 SR 的方式，提供了4个指标（Sscore、Gscore、Mscore 和 CEscore），用于评估简洁性、 grammaticality、意思保持和总质量等方面。在26种模型的实验中，CEScore 与人类评估呈现很高的相关性，达到了0.98的Spearman相关性。这表明 CEScore 可能是一个简单而有效的质量评估metric для SR 模型。

NLEBench+NorGLM: A Comprehensive Empirical Analysis and Benchmark Dataset for Generative Language Models in Norwegian

paper_url: http://arxiv.org/abs/2312.01314
repo_url: None
paper_authors: Peng Liu, Lemei Zhang, Terje Nissen Farup, Even W. Lauvrak, Jon Espen Ingvaldsen, Simen Eide, Jon Atle Gulla, Zhirong Yang
for: 本研究的目的是为了评估 Norwegische Sprache 的自然语言生成能力，以及 current GLMs 和 Benchmarks 是否能够适应低资源语言的特点。
methods: 本研究使用了 Norwegian 作为Case study，开发了一个包含多种实际 NLP 任务的权威 Benchmark，包括新闻故事、摘要、开放频道对话、自然语言理解、指令精细化、恶意和偏见评估等。
results: 研究发现，当 GLMs 在不同的下游任务中进行训练时，它们的性能强度和灵活性异常出乎意料。此外，研究还发现，当 GLMs 在不同的语言和文化背景下进行训练时，它们的性能也会受到这些因素的影响。

Abstract
Recent advancements in Generative Language Models (GLMs) have transformed Natural Language Processing (NLP) by showcasing the effectiveness of the "pre-train, prompt, and predict" paradigm in utilizing pre-trained GLM knowledge for diverse applications. Despite their potential, these capabilities lack adequate quantitative characterization due to the absence of comprehensive benchmarks, particularly for low-resource languages. Existing low-resource benchmarks focus on discriminative language models like BERT, neglecting the evaluation of generative language models. Moreover, current benchmarks often overlook measuring generalization performance across multiple tasks, a crucial metric for GLMs. To bridge these gaps, we introduce NLEBench, a comprehensive benchmark tailored for evaluating natural language generation capabilities in Norwegian, a low-resource language. We use Norwegian as a case study to explore whether current GLMs and benchmarks in mainstream languages like English can reveal the unique characteristics of underrepresented languages. NLEBench encompasses a suite of real-world NLP tasks ranging from news storytelling, summarization, open-domain conversation, natural language understanding, instruction fine-tuning, toxicity and bias evaluation, to self-curated Chain-of-Thought investigation. It features two high-quality, human-annotated datasets: an instruction dataset covering traditional Norwegian cultures, idioms, slang, and special expressions, and a document-grounded multi-label dataset for topic classification, question answering, and summarization. This paper also introduces foundational Norwegian Generative Language Models (NorGLMs) developed with diverse parameter scales and Transformer-based architectures. Systematic evaluations on the proposed benchmark suite provide insights into the capabilities and scalability of NorGLMs across various downstream tasks.

摘要
现代生成语言模型（GLM）的进步已经改变了自然语言处理（NLP）领域，它们在“预训练、提示、预测”的 paradigm 中使用预训练 GLM 知识来实现多种应用。尽管它们的潜力很大，但是这些能力尚未得到充分的量化描述，特别是低资源语言的情况。现有的低资源标准化 benchmark 多数注重推理语言模型如 BERT，忽略生成语言模型的评价。此外，现有的标准化 benchmark 经常忽略跨多个任务的通用性表现，这是 GLM 的关键纪录。为了bridging这些差距，我们提出了 NLEBench，一个适用于评估 Norwegians 语言生成能力的全面性 benchmark。我们使用 Norwegians 作为 caso study，探索现有 GLM 和主流语言 like English 的权威语言模型是否能够捕捉低资源语言的特殊特征。 NLEBench 包括一系列真实世界 NLP 任务，从新闻故事、摘要、开放领域对话、自然语言理解、指令精细调整、恶意和偏见评估等。它还包括两个高质量、人工注解的数据集：一个关于传统 Norwegian 文化、短语、俚语和特殊表达的教程数据集，以及一个基于文档的多标签数据集，用于主题分类、问题回答和摘要。本文还提出了基于 Transformer 架构和多种参数缩放的 Norwegian 生成语言模型（NorGLM）。通过对提议的标准化 benchmark 进行系统性评估，我们可以了解 NorGLM 在不同下游任务中的能力和缩放性。

Bridging Background Knowledge Gaps in Translation with Automatic Explicitation

paper_url: http://arxiv.org/abs/2312.01308
repo_url: https://github.com/h-j-han/automatic_explicitation
paper_authors: HyoJung Han, Jordan Lee Boyd-Graber, Marine Carpuat
for: 这篇论文是为了提高机器翻译的准确性和用户体验而写的。
methods: 这篇论文使用了自动生成的explicitation技术，这些技术是基于WikiExpl数据集和人工翻译者的注释。
results: 这篇论文的实验结果表明，使用这些自动生成的explicitation技术可以帮助答案更准确地回答用户提问。

Abstract
Translations help people understand content written in another language. However, even correct literal translations do not fulfill that goal when people lack the necessary background to understand them. Professional translators incorporate explicitations to explain the missing context by considering cultural differences between source and target audiences. Despite its potential to help users, NLP research on explicitation is limited because of the dearth of adequate evaluation methods. This work introduces techniques for automatically generating explicitations, motivated by WikiExpl: a dataset that we collect from Wikipedia and annotate with human translators. The resulting explicitations are useful as they help answer questions more accurately in a multilingual question answering framework.

摘要
翻译助人理解另一种语言中的内容，但是简单地Literal翻译并不能实现这个目标，因为人们缺乏必要的背景知识。专业翻译人员会考虑文化差异，为了在目标受众中提供更多的信息。尽管NLP研究中的Explicitation潜在的应用价值很大，但是由于评估方法的不足，这一领域的研究受到限制。这项工作提出了自动生成Explicitation的技术，由WikiExpl数据集和人工翻译员的注释驱动。得到的Explicitation是有用的，可以帮助多语言问答框架更准确地回答问题。

On Significance of Subword tokenization for Low Resource and Efficient Named Entity Recognition: A case study in Marathi

paper_url: http://arxiv.org/abs/2312.01306
repo_url: None
paper_authors: Harsh Chaudhari, Anuja Patil, Dhanashree Lavekar, Pranav Khairnar, Raviraj Joshi, Sachin Pande
For: The paper focuses on improving the performance of shallow models for Named Entity Recognition (NER) in low-resource languages, specifically in Marathi.* Methods: The authors propose a hybrid approach that combines traditional deep learning models (CNN and LSTM) with a BERT-based subword tokenizer to improve the accuracy of NER models.* Results: The authors show that replacing a traditional word-based tokenizer with a BERT-tokenizer brings the accuracy of vanilla single-layer models closer to that of deep pre-trained models like BERT, and demonstrate the importance of using sub-word tokenization for NER in low-resource languages.

Abstract
Named Entity Recognition (NER) systems play a vital role in NLP applications such as machine translation, summarization, and question-answering. These systems identify named entities, which encompass real-world concepts like locations, persons, and organizations. Despite extensive research on NER systems for the English language, they have not received adequate attention in the context of low resource languages. In this work, we focus on NER for low-resource language and present our case study in the context of the Indian language Marathi. The advancement of NLP research revolves around the utilization of pre-trained transformer models such as BERT for the development of NER models. However, we focus on improving the performance of shallow models based on CNN, and LSTM by combining the best of both worlds. In the era of transformers, these traditional deep learning models are still relevant because of their high computational efficiency. We propose a hybrid approach for efficient NER by integrating a BERT-based subword tokenizer into vanilla CNN/LSTM models. We show that this simple approach of replacing a traditional word-based tokenizer with a BERT-tokenizer brings the accuracy of vanilla single-layer models closer to that of deep pre-trained models like BERT. We show the importance of using sub-word tokenization for NER and present our study toward building efficient NLP systems. The evaluation is performed on L3Cube-MahaNER dataset using tokenizers from MahaBERT, MahaGPT, IndicBERT, and mBERT.

摘要
Named EntityRecognition（NER）系统在自然语言处理（NLP）应用中扮演着重要的角色，如机器翻译、概要 Summarization 和问答系统。这些系统可以识别命名实体，其包括实际世界中的地点、人物和组织。尽管英语NER系统已经得到了广泛的研究，但它们在低资源语言上没有得到了足够的注意。在这项工作中，我们将对低资源语言NER系统进行研究，并通过使用 MahaBERT、MahaGPT、IndicBERT 和 mBERT 等tokenizer进行评估。我们提出了一种将BERT基于的字词tokenizer与传统的CNN和LSTM模型结合使用的hybrid方法，以提高NER模型的性能。我们发现，将字词tokenizer替换为BERT基于的字词tokenizer可以使vanilla单层模型的准确率接近到深度预训练模型的水平。我们还证明了使用字词tokenization对NER是重要的，并进行了对Building efficient NLP系统的研究。

2023-12-03

cs.LG

cs.LG - 2023-12-03

Recurrent Distance-Encoding Neural Networks for Graph Representation Learning

paper_url: http://arxiv.org/abs/2312.01538
repo_url: https://github.com/skeletondyh/GRED
paper_authors: Yuhui Ding, Antonio Orvieto, Bobby He, Thomas Hofmann
for: 本研究旨在提出一种新的图 neural network 架构，以解决图Transformers 的计算复杂性和 inductive bias 问题。
methods: 本研究使用了深度状态空间模型的突破性，对目标节点进行最短距离 aggregation，并使用并行化的线性循环网络对距离链进行自然编码。无需 позицион编码，我们的模型在多个 benchmark 上实现了与现状册方法竞争的性能，并且理论上更加表达能力。
results: 我们的模型在多个 benchmark 上实现了与现状册方法竞争的性能，并且理论上更加表达能力。与之前的一 hop message passing нейрон网络相比，我们的模型具有更高的计算效率和表达能力。

Abstract
Graph neural networks based on iterative one-hop message passing have been shown to struggle in harnessing information from distant nodes effectively. Conversely, graph transformers allow each node to attend to all other nodes directly, but suffer from high computational complexity and have to rely on ad-hoc positional encoding to bake in the graph inductive bias. In this paper, we propose a new architecture to reconcile these challenges. Our approach stems from the recent breakthroughs in long-range modeling provided by deep state-space models on sequential data: for a given target node, our model aggregates other nodes by their shortest distances to the target and uses a parallelizable linear recurrent network over the chain of distances to provide a natural encoding of its neighborhood structure. With no need for positional encoding, we empirically show that the performance of our model is highly competitive compared with that of state-of-the-art graph transformers on various benchmarks, at a drastically reduced computational complexity. In addition, we show that our model is theoretically more expressive than one-hop message passing neural networks.

摘要
基于迭代一遍消息传递的图神经网络在距离节点很远时很难以有效地利用信息。相反，图转换器允许每个节点直接关注所有其他节点，但是它们具有高计算复杂性，并且需要采用特定的位置编码来带入图逻辑偏好。在这篇论文中，我们提出了一种新的建议，它基于序列数据中的深度状态空间模型的最近突破。对于给定的目标节点，我们的模型将其他节点通过它们与目标节点的最短距离来聚合，并使用并行化的线性循环网络来提供一种自然的编码方式。无需位置编码，我们实验表明，我们的模型与当前领导的图转换器在各种测试集上的性能非常竞争，并且计算复杂性减少了极多。此外，我们也证明了我们的模型在一遍消息传递神经网络中是更加表达力强的。

Evaluation of Active Feature Acquisition Methods for Time-varying Feature Settings

paper_url: http://arxiv.org/abs/2312.01530
repo_url: None
paper_authors: Henrik von Kleist, Alireza Zamanian, Ilya Shpitser, Narges Ahmidi
for: 这篇论文的目的是解决人工智能代理人在医疗领域中选择功能特征的问题，以便在投入资源方面做出合理的决策。
methods: 这篇论文使用了活动特征获取（AFA）代理人，并研究了在不同假设下（即无直接效应假设和无不观察混合假设）下的性能评估方法。
results: 研究人员提出了三种新的估计器：直接方法（DM）、反比权重估计器（IPW）和双强化学习估计器（DRL），这些估计器在不同假设下具有更好的数据效率。

Abstract
Machine learning methods often assume input features are available at no cost. However, in domains like healthcare, where acquiring features could be expensive or harmful, it is necessary to balance a feature's acquisition cost against its predictive value. The task of training an AI agent to decide which features to acquire is called active feature acquisition (AFA). By deploying an AFA agent, we effectively alter the acquisition strategy and trigger a distribution shift. To safely deploy AFA agents under this distribution shift, we present the problem of active feature acquisition performance evaluation (AFAPE). We examine AFAPE under i) a no direct effect (NDE) assumption, stating that acquisitions don't affect the underlying feature values; and ii) a no unobserved confounding (NUC) assumption, stating that retrospective feature acquisition decisions were only based on observed features. We show that one can apply offline reinforcement learning under the NUC assumption and missing data methods under the NDE assumption. When NUC and NDE hold, we propose a novel semi-offline reinforcement learning framework, which requires a weaker positivity assumption and yields more data-efficient estimators. We introduce three novel estimators: a direct method (DM), an inverse probability weighting (IPW), and a double reinforcement learning (DRL) estimator.

摘要

无直接效应（NDE）假设：策略不会影响下游特征值。2. 无隐藏干扰（NUC）假设：决策只基于观察特征。我们表明，可以在NUC假设下应用线上强化学习，并在NDE假设下使用失去数据方法。当NUC和NDE假设成立时，我们提出了一种新的半线上强化学习框架，需要较弱的正确性假设，并可以更高效地处理数据。我们 introduce three novel estimators：1. 直接方法（DM）。2. 反比权重（IPW）。3. 双强化学习（DRL）计算器。

Learn2Extend: Extending sequences by retaining their statistical properties with mixture models

paper_url: http://arxiv.org/abs/2312.01507
repo_url: None
paper_authors: Dimitris Vartziotis, George Dasoulas, Florian Pausinger
for: 本研究旨在推广通常有限长的实数序列，保持其内在的统计性质，通过机器学习技术。
methods: 本研究使用了自适应扩展混合模型（SEMM），直接估计条件概率 Distribution，而不是激发函数。
results: 对多种点过程进行比较实验，包括波尔兹曼、本地吸引和本地抗拒绝序列，并在里曼ζ函数零点预测中进行了实践研究。结果表明，提案的混合模型在保持特定统计性质的同时，对序列扩展表现出了超越传统神经网络架构的能力。

Abstract
This paper addresses the challenge of extending general finite sequences of real numbers within a subinterval of the real line, maintaining their inherent statistical properties by employing machine learning. Our focus lies on preserving the gap distribution and pair correlation function of these point sets. Leveraging advancements in deep learning applied to point processes, this paper explores the use of an auto-regressive \textit{Sequence Extension Mixture Model} (SEMM) for extending finite sequences, by estimating directly the conditional density, instead of the intensity function. We perform comparative experiments on multiple types of point processes, including Poisson, locally attractive, and locally repelling sequences, and we perform a case study on the prediction of Riemann $\zeta$ function zeroes. The results indicate that the proposed mixture model outperforms traditional neural network architectures in sequence extension with the retention of statistical properties. Given this motivation, we showcase the capabilities of a mixture model to extend sequences, maintaining specific statistical properties, i.e. the gap distribution, and pair correlation indicators.

摘要

The mechanistic basis of data dependence and abrupt learning in an in-context classification task

paper_url: http://arxiv.org/abs/2312.03002
repo_url: None
paper_authors: Gautam Reddy
for: 本研究探讨了Transformer模型在语言处理中的增Context学习能力，即基于输入序列中的示例例子来预测新的查询结果。
methods: 本研究使用了一个简化的注意力网络，并通过证明了特定的语言分布特性（如快速变化、大 dictionaries和不均匀排名分布）控制了Context vs weights学习的贸易或同时出现。
results: 研究发现，在一个简化的注意力网络中，Context学习是通过突然出现的概念头quarters而实现的，并与 weights学习进行竞争。通过采用进步度量和targeted实验，研究构建了一个两参数模型，可以模拟全数据分布相关性 displayed by attention-based network。

Abstract
Transformer models exhibit in-context learning: the ability to accurately predict the response to a novel query based on illustrative examples in the input sequence. In-context learning contrasts with traditional in-weights learning of query-output relationships. What aspects of the training data distribution and architecture favor in-context vs in-weights learning? Recent work has shown that specific distributional properties inherent in language, such as burstiness, large dictionaries and skewed rank-frequency distributions, control the trade-off or simultaneous appearance of these two forms of learning. We first show that these results are recapitulated in a minimal attention-only network trained on a simplified dataset. In-context learning (ICL) is driven by the abrupt emergence of an induction head, which subsequently competes with in-weights learning. By identifying progress measures that precede in-context learning and targeted experiments, we construct a two-parameter model of an induction head which emulates the full data distributional dependencies displayed by the attention-based network. A phenomenological model of induction head formation traces its abrupt emergence to the sequential learning of three nested logits enabled by an intrinsic curriculum. We propose that the sharp transitions in attention-based networks arise due to a specific chain of multi-layer operations necessary to achieve ICL, which is implemented by nested nonlinearities sequentially learned during training.

摘要
启发器模型展示在上下文学习能力：可以准确预测基于输入序列中的示例例子的响应。在上下文学习与传统的查询输出关系学习之间的区别。哪些特征在训练数据分布和架构中帮助启发器学习？最近的研究表明，语言中的特定分布特性，如快速峰值、大字典和偏斜分布，控制了这两种学习形式之间的贸易或同时出现。我们首先表明，这些结果在一个简化的注意力网络上进行训练后被复现。启发器学习（ICL）是由突然出现的一个引导头quarters，然后与查询输出学习竞争。我们通过识别进度度量和Targeted Experiments，构建了一个两个参数的引导头模型，该模型模拟了全数据分布上的所有关系依赖项。一种现象学模型跟踪了引导头的突然出现，它是在训练中逐层学习三个嵌入的假设的结果。我们建议，在注意力网络中的突然变化 arise from a specific chain of multi-layer operations necessary to achieve ICL，这是通过内置的课程学习进行嵌入非线性sequentially during training。

Normed Spaces for Graph Embedding

paper_url: http://arxiv.org/abs/2312.01502
repo_url: https://github.com/andyweizhao/graphs-normed-spaces
paper_authors: Diaaeldin Taha, Wei Zhao, J. Maxwell Riestenberg, Michael Strube
for: 这篇论文旨在探讨 normed space 是否可以作为 Riemannian manifold 的替代方案，以实现更高效和更 flexible 的图像学习。
methods: 这篇论文使用 theoretical results from discrete geometry 和 normed space 的 abstraction 来提出一种新的图像学习方法，并对比了这种方法与传统的 Riemannian manifold 方法。
results: 实验结果表明，normed space embeddings 可以在各种 synthetic 和实际世界的图像 reconstruction 任务上表现出色，并且需要更少的计算资源。此外， authors 还验证了 normed space embeddings 在不同类型的图像中的表现，并证明了它们在图像大小增加时的表现。最后， authors 还应用了 normed space embeddings 在两个实际任务中，namely link prediction 和 recommender systems。

Abstract
Theoretical results from discrete geometry suggest that normed spaces can abstractly embed finite metric spaces with surprisingly low theoretical bounds on distortion in low dimensions. In this paper, inspired by this theoretical insight, we highlight normed spaces as a more flexible and computationally efficient alternative to several popular Riemannian manifolds for learning graph embeddings. Normed space embeddings significantly outperform several popular manifolds on a large range of synthetic and real-world graph reconstruction benchmark datasets while requiring significantly fewer computational resources. We also empirically verify the superiority of normed space embeddings on growing families of graphs associated with negative, zero, and positive curvature, further reinforcing the flexibility of normed spaces in capturing diverse graph structures as graph sizes increase. Lastly, we demonstrate the utility of normed space embeddings on two applied graph embedding tasks, namely, link prediction and recommender systems. Our work highlights the potential of normed spaces for geometric graph representation learning, raises new research questions, and offers a valuable tool for experimental mathematics in the field of finite metric space embeddings. We make our code and data publically available.

摘要
理论结果表明，归一化空间可以抽象地嵌入有限度度的 метри空间，并且这些嵌入的扭曲约束在低维度下是有较低的理论上的下界。本文根据这一理论启示，提出了使用归一化空间来取代一些受欢迎的里曼纹理 manifold 来学习图像嵌入。归一化空间嵌入在许多 sintetic 和实际世界的图像重建 benchmark 数据集上表现出色，而且需要 significanly fewer 计算资源。我们还通过对增长的 familiy OF 图像进行 empirical 验证，证明了归一化空间嵌入的优越性，并且在不同的 curvature 下也能够 Capture 多样化的图像结构。最后，我们示例了归一化空间嵌入在两个应用任务上，namely, 链接预测和推荐系统。我们的工作高光了归一化空间在finite metric space嵌入中的潜力，提出了新的研究问题，并提供了实际数学中的一 valuable 工具。我们将我们的代码和数据公开发布。

OpenVoice: Versatile Instant Voice Cloning

paper_url: http://arxiv.org/abs/2312.01479
repo_url: https://github.com/myshell-ai/openvoice
paper_authors: Zengyi Qin, Wenliang Zhao, Xumin Yu, Xin Sun
for: 这篇论文的目的是提出一种新的语音模仿方法，可以使用短 audio clip 来模仿引用 speaker 的语音。
methods: 这种方法使用了一种新的语音模仿技术，可以在不同的语言中进行语音模仿，并且可以控制语音的情感、口音、节奏、停顿和声调。
results: 该方法可以在不同的语言中进行语音模仿，并且可以实现零批量跨语言语音模仿。此外，该方法的计算成本比较低，只需要一些时间和计算资源。

Abstract
We introduce OpenVoice, a versatile voice cloning approach that requires only a short audio clip from the reference speaker to replicate their voice and generate speech in multiple languages. OpenVoice represents a significant advancement in addressing the following open challenges in the field: 1) Flexible Voice Style Control. OpenVoice enables granular control over voice styles, including emotion, accent, rhythm, pauses, and intonation, in addition to replicating the tone color of the reference speaker. The voice styles are not directly copied from and constrained by the style of the reference speaker. Previous approaches lacked the ability to flexibly manipulate voice styles after cloning. 2) Zero-Shot Cross-Lingual Voice Cloning. OpenVoice achieves zero-shot cross-lingual voice cloning for languages not included in the massive-speaker training set. Unlike previous approaches, which typically require extensive massive-speaker multi-lingual (MSML) dataset for all languages, OpenVoice can clone voices into a new language without any massive-speaker training data for that language. OpenVoice is also computationally efficient, costing tens of times less than commercially available APIs that offer even inferior performance. To foster further research in the field, we have made the source code and trained model publicly accessible. We also provide qualitative results in our demo website. Prior to its public release, our internal version of OpenVoice was used tens of millions of times by users worldwide between May and October 2023, serving as the backend of MyShell.ai.

摘要
我们介绍OpenVoice，一种多元化的声音模仿方法，仅需从参照说话者的短时间声音抓取来模仿他们的声音和生成多种语言的话语。OpenVoice解决了以下开放挑战：1）灵活的声音风格控制。OpenVoice允许精确地控制声音风格，包括情感、口音、节奏、停顿和声调，而不是直接从参照说话者的风格 копи写。 previous approaches lacked the ability to flexibly manipulate voice styles after cloning。2）零次跨语言声音模仿。OpenVoice实现了零次跨语言声音模仿，即无需大量多语言听力训练数据（MSML）来模仿新语言的声音。OpenVoice也是计算效率高，成本只有商业可用的API的一半。为促进更多的研究，我们将源代码和训练模型公开 accessible。我们还提供了质量结果在我们的 demo website。在2023年5月至10月之前，我们内部版本的OpenVoice已经用户数量为 tens of millions，作为MyShell.ai的后端。

Regularity as Intrinsic Reward for Free Play

paper_url: http://arxiv.org/abs/2312.01473
repo_url: https://github.com/martius-lab/rair-mbrl
paper_authors: Cansu Sancaktar, Justus Piater, Georg Martius
for: 本研究提出了一种新的奖励信号，即规律性，用于自适应学习中的内在动机。
methods: 我们基于儿童发展的想法，认为寻求结构和秩序可以导引探索到一个不受普通的不确定性基于内在奖励的任务空间。我们扩展了这个思想，并在模型基本学习中实现了它。
results: 在一个synthetic环境中，我们展示了追求规律性可以产生各种结构化模式。在多对象Robot抓取环境中，我们也证明了我们的方法可以提高零下任务性能。在自由环境中，我们将RaIR作为内在奖励，并与模型的认知不确定性相结合，从而在建筑塔等常见结构中自动搭建了各种结构。

Abstract
We propose regularity as a novel reward signal for intrinsically-motivated reinforcement learning. Taking inspiration from child development, we postulate that striving for structure and order helps guide exploration towards a subspace of tasks that are not favored by naive uncertainty-based intrinsic rewards. Our generalized formulation of Regularity as Intrinsic Reward (RaIR) allows us to operationalize it within model-based reinforcement learning. In a synthetic environment, we showcase the plethora of structured patterns that can emerge from pursuing this regularity objective. We also demonstrate the strength of our method in a multi-object robotic manipulation environment. We incorporate RaIR into free play and use it to complement the model's epistemic uncertainty as an intrinsic reward. Doing so, we witness the autonomous construction of towers and other regular structures during free play, which leads to a substantial improvement in zero-shot downstream task performance on assembly tasks.

摘要
我们提议使用常规性作为激励学习中的新的奖励信号。受儿development的 inspirations，我们认为渴望结构和秩序可以导引探索到一个由欠知的uncertainty-based intrinsic reward不受欢迎的任务子空间。我们的通用形式的常规性作为内在奖励（RaIR）让我们在模型基 Reinforcement learning中实现了它。在一个 sintetic环境中，我们展示了追求常规性的多种结构化模式的能力。我们还在多个目标 robotic manipulation环境中证明了我们的方法的强大性。我们将 RaIR incorporated into free play，并使其与模型的epistemic uncertainty作为内在奖励。这样做，我们在自由玩家中见到了无需任务指导的自主建筑塔和其他常规结构，这导致了零上下游任务性能的明显提高。

Marginal Density Ratio for Off-Policy Evaluation in Contextual Bandits

paper_url: http://arxiv.org/abs/2312.01457
repo_url: https://github.com/faaizt/mr-ope
paper_authors: Muhammad Faaiz Taufiq, Arnaud Doucet, Rob Cornish, Jean-Francois Ton
for: 评估新策略使用现有数据而无需高成本实验。
methods: 引入一种新的评估方法——Marginal Ratio（MR）计算器，关注在结果分布的变化中。
results: 相比传统方法 like Inverse Probability Weighting（IPW）和 Doubly Robust（DR）计算器，MR计算器具有更低的准确性。在 causal inference 中，MR 计算器可以更好地估计干扰效应。我们的实验结果表明，MR 计算器在 contextual bandits 中的应用有所提升。

Abstract
Off-Policy Evaluation (OPE) in contextual bandits is crucial for assessing new policies using existing data without costly experimentation. However, current OPE methods, such as Inverse Probability Weighting (IPW) and Doubly Robust (DR) estimators, suffer from high variance, particularly in cases of low overlap between target and behavior policies or large action and context spaces. In this paper, we introduce a new OPE estimator for contextual bandits, the Marginal Ratio (MR) estimator, which focuses on the shift in the marginal distribution of outcomes $Y$ instead of the policies themselves. Through rigorous theoretical analysis, we demonstrate the benefits of the MR estimator compared to conventional methods like IPW and DR in terms of variance reduction. Additionally, we establish a connection between the MR estimator and the state-of-the-art Marginalized Inverse Propensity Score (MIPS) estimator, proving that MR achieves lower variance among a generalized family of MIPS estimators. We further illustrate the utility of the MR estimator in causal inference settings, where it exhibits enhanced performance in estimating Average Treatment Effects (ATE). Our experiments on synthetic and real-world datasets corroborate our theoretical findings and highlight the practical advantages of the MR estimator in OPE for contextual bandits.

摘要
Contextual bandits 的 Off-Policy Evaluation (OPE) 是非常重要，可以使用现有数据来评估新政策，而不需要昂贵的实验。然而，现有的 OPE 方法，如反杂 probabilities Weighting (IPW) 和 Doubly Robust (DR) 估计器，在 target 和 behavior 政策之间的覆盖率低或 action 和 context 空间较大时，具有高弹性。在这篇论文中，我们介绍了一种新的 OPE 估计器，即 Margin Ratio (MR) 估计器，它关注在结果 Y 的 marginal 分布上而不是政策本身。通过严谨的理论分析，我们证明 MR 估计器与传统的 IPW 和 DR 估计器相比，可以减少弹性。此外，我们证明 MR 估计器与 state-of-the-art Marginalized Inverse Propensity Score (MIPS) 估计器之间存在连接，MR 估计器在一个扩展的 MIPS 家族中具有更低的弹性。我们还证明 MR 估计器在 causal inference 设置中可以更好地估计 Average Treatment Effects (ATE)。我们在 synthetic 和实际数据上进行了实验，并证明了 MR 估计器在 OPE 中的实际优势。

Compositional Policy Learning in Stochastic Control Systems with Formal Guarantees

paper_url: http://arxiv.org/abs/2312.01456
repo_url: https://github.com/mlech26l/neural_martingales
paper_authors: Đorđe Žikelić, Mathias Lechner, Abhinav Verma, Krishnendu Chatterjee, Thomas A. Henzinger
for: 学习 neural network 策略的复杂控制任务
methods: 使用 SpectRL 的逻辑特征来学习图状的 probabilistic reach-avoid 特征，并与 reach-avoid supermartingales (RASM) 一起学习 global 策略
results: 提供了一种 formal certificate， guaranteeing that the policy’s behavior satisfies the desired probability level, and also derived a tighter lower bound on the probability of reach-avoidance compared to previous work.

Abstract
Reinforcement learning has shown promising results in learning neural network policies for complicated control tasks. However, the lack of formal guarantees about the behavior of such policies remains an impediment to their deployment. We propose a novel method for learning a composition of neural network policies in stochastic environments, along with a formal certificate which guarantees that a specification over the policy's behavior is satisfied with the desired probability. Unlike prior work on verifiable RL, our approach leverages the compositional nature of logical specifications provided in SpectRL, to learn over graphs of probabilistic reach-avoid specifications. The formal guarantees are provided by learning neural network policies together with reach-avoid supermartingales (RASM) for the graph's sub-tasks and then composing them into a global policy. We also derive a tighter lower bound compared to previous work on the probability of reach-avoidance implied by a RASM, which is required to find a compositional policy with an acceptable probabilistic threshold for complex tasks with multiple edge policies. We implement a prototype of our approach and evaluate it on a Stochastic Nine Rooms environment.

摘要
<> translate "Reinforcement learning has shown promising results in learning neural network policies for complicated control tasks. However, the lack of formal guarantees about the behavior of such policies remains an impediment to their deployment. We propose a novel method for learning a composition of neural network policies in stochastic environments, along with a formal certificate which guarantees that a specification over the policy's behavior is satisfied with the desired probability. Unlike prior work on verifiable RL, our approach leverages the compositional nature of logical specifications provided in SpectRL, to learn over graphs of probabilistic reach-avoid specifications. The formal guarantees are provided by learning neural network policies together with reach-avoid supermartingales (RASM) for the graph's sub-tasks and then composing them into a global policy. We also derive a tighter lower bound compared to previous work on the probability of reach-avoidance implied by a RASM, which is required to find a compositional policy with an acceptable probabilistic threshold for complex tasks with multiple edge policies. We implement a prototype of our approach and evaluate it on a Stochastic Nine Rooms environment."into Simplified Chinese.下面是文本的简化中文翻译：Reinforcement learning已经在复杂控制任务上显示出了承诺的结果，但缺乏正式的保证性 guarantees about the behavior of such policies remains an impediment to their deployment. We propose a novel method for learning a composition of neural network policies in stochastic environments, along with a formal certificate which guarantees that a specification over the policy's behavior is satisfied with the desired probability. Unlike prior work on verifiable RL, our approach leverages the compositional nature of logical specifications provided in SpectRL, to learn over graphs of probabilistic reach-avoid specifications. The formal guarantees are provided by learning neural network policies together with reach-avoid supermartingales (RASM) for the graph's sub-tasks and then composing them into a global policy. We also derive a tighter lower bound compared to previous work on the probability of reach-avoidance implied by a RASM, which is required to find a compositional policy with an acceptable probabilistic threshold for complex tasks with multiple edge policies. We implement a prototype of our approach and evaluate it on a Stochastic Nine Rooms environment.

Classification of Home Network Problems with Transformers

paper_url: http://arxiv.org/abs/2312.01445
repo_url: None
paper_authors: Jeremias Dötterl, Zahra Hemmati Fard
for: 本研究旨在开发一种可以识别家庭网络问题的分类模型，以便帮助用户快速地检测和解决问题。
methods: 本研究使用了深度学习的transformer架构，特别是我们提议的预处理器来将工具输出分割成Token序列。
results: 我们的模型在实验中达到了高准确率，证明了使用transformer进行网络问题分类的高潜力。

Abstract
We propose a classifier that can identify ten common home network problems based on the raw textual output of networking tools such as ping, dig, and ip. Our deep learning model uses an encoder-only transformer architecture with a particular pre-tokenizer that we propose for splitting the tool output into token sequences. The use of transformers distinguishes our approach from related work on network problem classification, which still primarily relies on non-deep-learning methods. Our model achieves high accuracy in our experiments, demonstrating the high potential of transformer-based problem classification for the home network.

摘要
我们提出一种分类器，可以基于网络工具的原始文本输出（如 ping、dig 和 ip）识别家庭网络上的十种常见问题。我们的深度学习模型采用一个Encoder-only transformer架构，并使用我们提议的特定预处理器将工具输出切分成Token序列。与相关的工作相比，我们的方法使用transformer对网络问题分类进行了划算，并且在我们的实验中，模型实现了高准确率，这表明了transformer-based问题分类在家庭网络中的高潜力。

Fast Dual Subgradient Optimization of the Integrated Transportation Distance Between Stochastic Kernels

paper_url: http://arxiv.org/abs/2312.01432
repo_url: None
paper_authors: Zhengqi Lin, Andrzej Ruszczynski
for: 本研究旨在开发一种基于温水态metric的约化技术，用于简化泊尔马尔可夫系统中的分布kernel的计算。
methods: 本研究使用了一种泊尔马尔可夫系统中的约化技术，基于温水态metric的定义，以及一种特殊的双重算法来构建约化kernel。
results: 本研究的实验结果表明，使用本方法可以快速和高效地简化泊尔马尔可夫系统中的分布kernel，而无需进行计算ем实杂的矩阵操作。此外，本方法还可以在实际应用中实现高效的分布kernel的替换和合理的拟合。

Abstract
A generalization of the Wasserstein metric, the integrated transportation distance, establishes a novel distance between probability kernels of Markov systems. This metric serves as the foundation for an efficient approximation technique, enabling the replacement of the original system's kernel with a kernel with a discrete support of limited cardinality. To facilitate practical implementation, we present a specialized dual algorithm capable of constructing these approximate kernels quickly and efficiently, without requiring computationally expensive matrix operations. Finally, we demonstrate the efficacy of our method through several illustrative examples, showcasing its utility in practical scenarios. This advancement offers new possibilities for the streamlined analysis and manipulation of stochastic systems represented by kernels.

摘要
一种泛化 Wasserstein 距离 metric，综合运输距离，可以将普洛布渠系统中的概率密度函数之间建立一种新的距离。这个距离serve as the foundation for an efficient approximation technique，可以将原始系统的kernel replaced with a kernel having a discrete support of limited cardinality。为了实现实际应用，我们提出了一种特殊的 dual 算法，可以快速和高效地构造这些精简化kernel，不需要计算成本高的矩阵操作。最后，我们通过一些示例证明了我们的方法的有用性，展示了它在实际场景中的应用前景。这一进步提供了新的可能性 для静止分析和 manipulate 概率系统的kernel。Note: "泛化 Wasserstein 距离" is a simplified Chinese term that is commonly used to refer to the integrated transportation distance.

Neural Network Characterization and Entropy Regulated Data Balancing through Principal Component Analysis

paper_url: http://arxiv.org/abs/2312.01392
repo_url: None
paper_authors: David Yevick, Karolina Hutchison
for: 本研究探讨了神经网络行为和训练数据记录投影形成的分布之间的关系。
methods: 本研究使用了低维主成分分析法对训练数据进行分析，并通过对每个数据记录的相关主成分进行计算，来实现神经网络计算中的数据处理。
results: 研究发现，在一个基准计算中，旋转和未旋转的mnist数字之间的分布在低维主成分空间中可以很好地分割，并且这些分布可以通过对每个数据记录的相关主成分进行平均来实现数据均衡。

Abstract
This paper examines the relationship between the behavior of a neural network and the distribution formed from the projections of the data records into the space spanned by the low-order principal components of the training data. For example, in a benchmark calculation involving rotated and unrotated MNIST digits, classes (digits) that are mapped far from the origin in a low-dimensional principal component space and that overlap minimally with other digits converge rapidly and exhibit high degrees of accuracy in neural network calculations that employ the associated components of each data record as inputs. Further, if the space spanned by these low-order principal components is divided into bins and the input data records that are mapped into a given bin averaged, the resulting pattern can be distinguished by its geometric features which interpolate between those of adjacent bins in an analogous manner to variational autoencoders. Based on this observation, a simply realized data balancing procedure can be realized by evaluating the entropy associated with each histogram bin and subsequently repeating the original image data associated with the bin by a number of times that is determined from this entropy.

摘要
Note: The text has been translated into Simplified Chinese, which is the standard writing system used in mainland China. The translation is based on the Google Translate API, and may not be perfect in terms of grammar or nuance.

Regret Optimality of GP-UCB

paper_url: http://arxiv.org/abs/2312.01386
repo_url: None
paper_authors: Wenjia Wang, Xiaowei Zhang, Lu Zou
for: 本研究的目的是回答GP-UCB是否具有最佳 regret的问题，这个问题在bayesian优化领域是一个非常重要的开问。
methods: 本研究使用了GP-UCB方法，并通过对目标函数的几何性质进行假设，提出了新的Upper bound，以证明GP-UCB的 regret是最佳的。
results: 研究发现，当目标函数具有某些几何性质时，GP-UCB方法可以同时实现最佳的simple regret和cumulative regret，而且这些Upper bound与最佳的最小值幂相等，甚至高于。这些结果表明，GP-UCB方法在搜索最佳解时具有优化的性能。

Abstract
Gaussian Process Upper Confidence Bound (GP-UCB) is one of the most popular methods for optimizing black-box functions with noisy observations, due to its simple structure and superior performance. Its empirical successes lead to a natural, yet unresolved question: Is GP-UCB regret optimal? In this paper, we offer the first generally affirmative answer to this important open question in the Bayesian optimization literature. We establish new upper bounds on both the simple and cumulative regret of GP-UCB when the objective function to optimize admits certain smoothness property. These upper bounds match the known minimax lower bounds (up to logarithmic factors independent of the feasible region's dimensionality) for optimizing functions with the same smoothness. Intriguingly, our findings indicate that, with the same level of exploration, GP-UCB can simultaneously achieve optimality in both simple and cumulative regret. The crux of our analysis hinges on a refined uniform error bound for online estimation of functions in reproducing kernel Hilbert spaces. This error bound, which we derive from empirical process theory, is of independent interest, and its potential applications may reach beyond the scope of this study.

摘要

Relation between PLS and OLS regression in terms of the eigenvalue distribution of the regressor covariance matrix

paper_url: http://arxiv.org/abs/2312.01379
repo_url: None
paper_authors: David del Val, José R. Berrendero, Alberto Suárez
for: 本研究针对某些对数 regression 问题提出了一个新的方法，即使用几何 Statistics 的对数缩减技术（PLS）。
methods: 本研究使用 PLS 技术，包括找到最小二乘项的对数缩减项，并使用 Krylov 子空间的等价形式来分析这个问题。
results: 研究发现，随着对数缩减项的数量增加，对数缩减项的估计值将更接近于对数最小二乘项的估计值。此外，研究还发现，对数缩减项的分布会受到几何 Statistics 的缩减项的影响。

Abstract
Partial least squares (PLS) is a dimensionality reduction technique introduced in the field of chemometrics and successfully employed in many other areas. The PLS components are obtained by maximizing the covariance between linear combinations of the regressors and of the target variables. In this work, we focus on its application to scalar regression problems. PLS regression consists in finding the least squares predictor that is a linear combination of a subset of the PLS components. Alternatively, PLS regression can be formulated as a least squares problem restricted to a Krylov subspace. This equivalent formulation is employed to analyze the distance between ${\hat{\boldsymbol\beta}\;}_{\mathrm{PLS}^{\scriptscriptstyle {(L)}$, the PLS estimator of the vector of coefficients of the linear regression model based on $L$ PLS components, and $\hat{\boldsymbol \beta}_{\mathrm{OLS}$, the one obtained by ordinary least squares (OLS), as a function of $L$. Specifically, ${\hat{\boldsymbol\beta}\;}_{\mathrm{PLS}^{\scriptscriptstyle {(L)}$ is the vector of coefficients in the aforementioned Krylov subspace that is closest to $\hat{\boldsymbol \beta}_{\mathrm{OLS}$ in terms of the Mahalanobis distance with respect to the covariance matrix of the OLS estimate. We provide a bound on this distance that depends only on the distribution of the eigenvalues of the regressor covariance matrix. Numerical examples on synthetic and real-world data are used to illustrate how the distance between ${\hat{\boldsymbol\beta}\;}_{\mathrm{PLS}^{\scriptscriptstyle {(L)}$ and $\hat{\boldsymbol \beta}_{\mathrm{OLS}$ depends on the number of clusters in which the eigenvalues of the regressor covariance matrix are grouped.

摘要
优化方差分解（PLS）是一种维度减少技术，在化学ometrics领域引入并在其他领域成功应用。PLS组件由最大化投影变量和目标变量之间的协方差的最大化而得。在这个工作中，我们关注它在整数 regresión问题上的应用。PLS回归是找到最小二乘predictor，这是一个线性组合PLS组件的子集。或者，PLS回归可以被表示为一个具有限制的Krylov空间的最小二乘问题。这个等价形式被用来分析PLS估计Vector of coefficients of the linear regression model based on $L$ PLS components and $\hat{\boldsymbol \beta}_{\mathrm{OLS}$的距离，作为一函数of $L$。具体来说，${\hat{\boldsymbol\beta}\;}_{\text{PLS}^{(L)}$是Krylov空间中最近于$\hat{\boldsymbol \beta}_{\text{OLS}$的Vector of coefficients的线性组合，这个空间是由OLS估计的covariance matrix确定的。我们提供一个只取决于投影变量协方差矩阵的分布的下界，用于衡量这个距离。例如，在synthetic和实际数据上，我们通过numerical例子来示出PLS估计和OLS估计之间的距离，随着投影变量协方差矩阵中元素的分组数（cluster number）的变化。

Simulation-Based Inference of Surface Accumulation and Basal Melt Rates of an Antarctic Ice Shelf from Isochronal Layers

paper_url: http://arxiv.org/abs/2312.02997
repo_url: https://github.com/mackelab/preprocessing-ice-data
paper_authors: Guy Moss, Vjeran Višnjević, Olaf Eisen, Falk M. Oraschewski, Cornelius Schröder, Jakob H. Macke, Reinhard Drews
for:This paper aims to infer the surface accumulation and basal melt rates of ice shelves over decadal and centennial timescales, using radar observations and a new method called simulation-based inference (SBI).methods:The method uses a kinematic forward model of internal stratigraphy to infer the spatial dependence of surface accumulation and basal melt rates along flow line transects. The SBI method trains neural networks on simulations of the forward model to approximate the posterior distribution, allowing for quantification of uncertainties over the inferred parameters.results:The results show stable atmospheric and oceanographic conditions over the past several centuries in the catchment of Antarctica’s Ekstr"om Ice Shelf, based on the inferred surface accumulation and basal melt rates. The use of observed internal stratigraphy allows for the separation of the effects of surface accumulation and basal melt, enabling their interpretation in a historical context.

Abstract
The ice shelves buttressing the Antarctic ice sheet determine the rate of ice-discharge into the surrounding oceans. The geometry of ice shelves, and hence their buttressing strength, is determined by ice flow as well as by the local surface accumulation and basal melt rates, governed by atmospheric and oceanic conditions. Contemporary methods resolve one of these rates, but typically not both. Moreover, there is little information of how they changed in time. We present a new method to simultaneously infer the surface accumulation and basal melt rates averaged over decadal and centennial timescales. We infer the spatial dependence of these rates along flow line transects using internal stratigraphy observed by radars, using a kinematic forward model of internal stratigraphy. We solve the inverse problem using simulation-based inference (SBI). SBI performs Bayesian inference by training neural networks on simulations of the forward model to approximate the posterior distribution, allowing us to also quantify uncertainties over the inferred parameters. We demonstrate the validity of our method on a synthetic example, and apply it to Ekstr\"om Ice Shelf, Antarctica, for which newly acquired radar measurements are available. We obtain posterior distributions of surface accumulation and basal melt averaging over 42, 84, 146, and 188 years before 2022. Our results suggest stable atmospheric and oceanographic conditions over this period in this catchment of Antarctica. Use of observed internal stratigraphy can separate the effects of surface accumulation and basal melt, allowing them to be interpreted in a historical context of the last centuries and beyond.

摘要
ice shelves 支撑 Antarctic ice sheet 决定 ice 的排放速率 into 周围的 ocean。 ice shelves 的 geometry，以及其 supporting strength，是由 ice flow 以及 local surface accumulation 和 basal melt rates 决定，这些 rates 是由 atmospheric 和 oceanic conditions 控制。 contemporary methods 只能 resolve 一个这些 rates，而 typically not both。 In addition, there is little information about how they changed over time.我们 present a new method to simultaneously infer the surface accumulation 和 basal melt rates averaged over decadal 和 centennial timescales。 we infer the spatial dependence of these rates along flow line transects using internal stratigraphy observed by radars，using a kinematic forward model of internal stratigraphy。 we solve the inverse problem using simulation-based inference (SBI). SBI performs Bayesian inference by training neural networks on simulations of the forward model to approximate the posterior distribution，allowing us to also quantify uncertainties over the inferred parameters。我们在 synthetic example 中证明了方法的有效性，并将其应用到了 Ekstr\"om Ice Shelf, Antarctica，这里 newly acquired radar measurements 是可用的。 we obtain posterior distributions of surface accumulation 和 basal melt，averaging over 42, 84, 146, 和 188 years before 2022。 our results suggest stable atmospheric 和 oceanographic conditions over this period in this catchment of Antarctica。使用 observed internal stratigraphy 可以分开 surface accumulation 和 basal melt，这些 parameters 可以在 historical context 中被解释，包括 last centuries 以及更远的过去。

Graph Coordinates and Conventional Neural Networks – An Alternative for Graph Neural Networks

paper_url: http://arxiv.org/abs/2312.01342
repo_url: None
paper_authors: Zheyi Qin, Randy Paffenroth, Anura P. Jayasumana
For: The paper proposes two novel and efficient alternatives to traditional message passing graph neural networks (GNNs) for graph-based machine learning tasks.* Methods: The proposed methods, called Topology Coordinate Neural Network (TCNN) and Directional Virtual Coordinate Neural Network (DVCNN), leverage the graph’s topology directly and sidestep the computational challenges of message passing GNNs.* Results: The proposed methods achieve competitive or superior performance to message passing GNNs in terms of accuracy and ROC-AUC, with fewer trainable parameters. Specifically, TCNN requires fewer parameters than any neural network method currently listed in the Open Graph Benchmark Leaderboard for both OGBN-Proteins and OGBN-Products datasets, while achieving higher performance for a similar number of trainable parameters.

Abstract
Graph-based data present unique challenges and opportunities for machine learning. Graph Neural Networks (GNNs), and especially those algorithms that capture graph topology through message passing for neighborhood aggregation, have been a leading solution. However, these networks often require substantial computational resources and may not optimally leverage the information contained in the graph's topology, particularly for large-scale or complex graphs. We propose Topology Coordinate Neural Network (TCNN) and Directional Virtual Coordinate Neural Network (DVCNN) as novel and efficient alternatives to message passing GNNs, that directly leverage the graph's topology, sidestepping the computational challenges presented by competing algorithms. Our proposed methods can be viewed as a reprise of classic techniques for graph embedding for neural network feature engineering, but they are novel in that our embedding techniques leverage ideas in Graph Coordinates (GC) that are lacking in current practice. Experimental results, benchmarked against the Open Graph Benchmark Leaderboard, demonstrate that TCNN and DVCNN achieve competitive or superior performance to message passing GNNs. For similar levels of accuracy and ROC-AUC, TCNN and DVCNN need far fewer trainable parameters than contenders of the OGBN Leaderboard. The proposed TCNN architecture requires fewer parameters than any neural network method currently listed in the OGBN Leaderboard for both OGBN-Proteins and OGBN-Products datasets. Conversely, our methods achieve higher performance for a similar number of trainable parameters. By providing an efficient and effective alternative to message passing GNNs, our work expands the toolbox of techniques for graph-based machine learning.

摘要
GRAPH-BASED DATA 提供了各种挑战和机会，机器学习中的 GRAPH Neural Networks (GNNs) 和特别是通过消息传递来汇集邻居的 topological 信息，已成为解决方案。然而，这些网络经常需要巨量的计算资源，并且可能不优化图形的信息，尤其是大规模或复杂的图形。我们提议 Topology Coordinate Neural Network (TCNN) 和 Directional Virtual Coordinate Neural Network (DVCNN) 作为传递消息 GNNs 的新和高效的替代方案，直接利用图形的 topological 信息，强制消息传递中的计算挑战。我们的提议方法可以视为现有的图形嵌入技术的改进，但它们是新的，因为我们的嵌入技术利用 Graph Coordinates (GC) 中缺失的想法。我们的实验结果，在对 Open Graph Benchmark Leaderboard 进行比较，表明 TCNN 和 DVCNN 与消息传递 GNNs 相比，具有竞争力或更高的性能，同时需要更少的可训练参数。TCNN 架构需要更少的参数 than any neural network method currently listed in the OGBN Leaderboard для OGBN-Proteins 和 OGBN-Products 数据集。相反，我们的方法在同等级别的性能下，需要更少的参数。我们的工作扩展了图形基于机器学习的工具箱。

Robust Non-parametric Knowledge-based Diffusion Least Mean Squares over Adaptive Networks

paper_url: http://arxiv.org/abs/2312.01299
repo_url: None
paper_authors: Soheil Ashkezari-Toussi, Hadi sadoghi-Yazdi
for: 本研究提出一种基于非参数知识的扩散最小二乘算法，用于MAP估计不知道参数 вектор。
methods: 该算法使用预测分布和缓冲一些中间估计，计算每个节点的参数 vector 的先验分布和条件概率。使用 Pseudo Huber 损失函数定义似然函数。还定义了一个错误阈值函数，以降低计算开销和减弱噪声的影响。
results: 对于静止和非静止场景，results 表明提案的算法在不同噪声类型的情况下具有坚定性。

Abstract
The present study proposes incorporating non-parametric knowledge into the diffusion least-mean-squares algorithm in the framework of a maximum a posteriori (MAP) estimation. The proposed algorithm leads to a robust estimation of an unknown parameter vector in a group of cooperative estimators. Utilizing kernel density estimation and buffering some intermediate estimations, the prior distribution and conditional likelihood of the parameters vector in each node are calculated. Pseudo Huber loss function is used for designing the likelihood function. Also, an error thresholding function is defined to reduce the computational overhead as well as more relaxation against noise, which stops the update every time an error is less than a predefined threshold. The performance of the proposed algorithm is examined in the stationary and non-stationary scenarios in the presence of Gaussian and non-Gaussian noise. Results show the robustness of the proposed algorithm in the presence of different noise types.

摘要
当前研究提出了在扩散最小二乘算法中涵盖非参数知识的 incorporation 的方法，基于最大 posteriori（MAP）估计框架。该提议的算法可以实现一个 robust 的参数向量估计在多个合作估计器中。通过计算每个节点的前后分布和Conditional likelihood函数，以及使用 Pseudo Huber 损失函数和错误阈值函数，可以减少计算开销并提高对噪声的耐受性。在静止和非静止场景下，在 Gaussian 和非 Gaussian 噪声下，对提议算法的性能进行了测试，结果显示了该算法在不同噪声类型下的稳定性。Note: Please note that the translation is in Simplified Chinese, which is one of the two standard forms of Chinese writing. If you need the translation in Traditional Chinese, please let me know.

Anomaly Detection Under Uncertainty Using Distributionally Robust Optimization Approach

paper_url: http://arxiv.org/abs/2312.01296
repo_url: None
paper_authors: Amir Hossein Noormohammadia, Seyed Ali MirHassania, Farnaz Hooshmand Khaligh
for: 这篇论文是针对异常检测问题的，具体来说是使用分类方法，包括一类支持向量机（SVM）来找到数据点不符合主流模式的异常点。
methods: 本论文使用了一类支持向量机（SVM）方法，但是这种方法需要完全知道数据点的真实概率分布，而在实际问题中，这种信息通常是不可知的。因此，该论文提出了基于分布信息的分布随机风险限制模型，可以在不知道数据点的概率分布情况下进行异常检测。
results: 该论文的计算结果表明，提出的模型在不同的概率分布情况下具有良好的稳定性和高度的准确率，并且在与标准一类支持向量机（SVM）进行比较时，具有更高的评价指标。

Abstract
Anomaly detection is defined as the problem of finding data points that do not follow the patterns of the majority. Among the various proposed methods for solving this problem, classification-based methods, including one-class Support Vector Machines (SVM) are considered effective and state-of-the-art. The one-class SVM method aims to find a decision boundary to distinguish between normal data points and anomalies using only the normal data. On the other hand, most real-world problems involve some degree of uncertainty, where the true probability distribution of each data point is unknown, and estimating it is often difficult and costly. Assuming partial distribution information such as the first and second-order moments is known, a distributionally robust chance-constrained model is proposed in which the probability of misclassification is low. By utilizing a mapping function to a higher dimensional space, the proposed model will be capable of classifying origin-inseparable datasets. Also, by adopting the kernel idea, the need for explicitly knowing the mapping is eliminated, computations can be performed in the input space, and computational complexity is reduced. Computational results validate the robustness of the proposed model under different probability distributions and also the superiority of the proposed model compared to the standard one-class SVM in terms of various evaluation metrics.

摘要
“异常检测问题定义为找到数据点不符合多数据点的模式。 amongst various提出的解决方法，基于分类的方法，包括一类支持向量机（SVM）被视为有效和现代的。一类 SVM 方法的目标是在使用仅 normal 数据来定义异常点和正常点的决策边界。然而，实际问题通常带有不确定性，每个数据点的真实概率分布未知，估计它们是困难和成本的。假设有部分分布信息，如第一和第二 moments，则提出一种分布 robust chance-constrained 模型。通过使用映射函数，该模型可以处理origin-inseparable 数据集，并且通过采用内核想法，可以消除显式知道映射的需求，在输入空间进行计算，降低计算复杂度。计算结果证明提案模型在不同的概率分布下 Display 稳定性，并且与标准一类 SVM 相比，在不同评价指标上具有superiority。”

Deep Ensembles Meets Quantile Regression: Uncertainty-aware Imputation for Time Series

paper_url: http://arxiv.org/abs/2312.01294
repo_url: None
paper_authors: Ying Liu, Peng Cui, Wenbo Hu, Richang Hong
for: 这篇论文的目的是提出一种非生成式时间序列填写方法，可以精确地填写 missing value，同时 computational efficiency 高。
methods: 本文使用深度对称组合方法，具有内置不确定性估计的能力，并且可以降低大量的计算成本。
results: 本文在实验中证明了这种方法可以实现精确的时间序列填写，并且比Score-based diffusion方法更快速，且需要 fewer model parameters。

Abstract
Multivariate time series are everywhere. Nevertheless, real-world time series data often exhibit numerous missing values, which is the time series imputation task. Although previous deep learning methods have been shown to be effective for time series imputation, they are shown to produce overconfident imputations, which might be a potentially overlooked threat to the reliability of the intelligence system. Score-based diffusion method(i.e., CSDI) is effective for the time series imputation task but computationally expensive due to the nature of the generative diffusion model framework. In this paper, we propose a non-generative time series imputation method that produces accurate imputations with inherent uncertainty and meanwhile is computationally efficient. Specifically, we incorporate deep ensembles into quantile regression with a shared model backbone and a series of quantile discrimination functions.This framework combines the merits of accurate uncertainty estimation of deep ensembles and quantile regression and above all, the shared model backbone tremendously reduces most of the computation overhead of the multiple ensembles. We examine the performance of the proposed method on two real-world datasets: air quality and health-care datasets and conduct extensive experiments to show that our method excels at making deterministic and probabilistic predictions. Compared with the score-based diffusion method: CSDI, we can obtain comparable forecasting results and is better when more data is missing. Furthermore, as a non-generative model compared with CSDI, the proposed method consumes a much smaller computation overhead, yielding much faster training speed and fewer model parameters.

摘要
多变量时间序列在所有地方都存在，然而实际世界时间序列数据经常具有多个缺失值，这是时间序列投入任务。 although previous deep learning methods have been shown to be effective for time series imputation, they are shown to produce overconfident imputations, which might be a potentially overlooked threat to the reliability of the intelligence system. Score-based diffusion method (i.e., CSDI) is effective for the time series imputation task but computationally expensive due to the nature of the generative diffusion model framework. In this paper, we propose a non-generative time series imputation method that produces accurate imputations with inherent uncertainty and meanwhile is computationally efficient. Specifically, we incorporate deep ensembles into quantile regression with a shared model backbone and a series of quantile discrimination functions. This framework combines the merits of accurate uncertainty estimation of deep ensembles and quantile regression, and above all, the shared model backbone tremendously reduces most of the computation overhead of the multiple ensembles. We examine the performance of the proposed method on two real-world datasets: air quality and health-care datasets and conduct extensive experiments to show that our method excels at making deterministic and probabilistic predictions. Compared with the score-based diffusion method: CSDI, we can obtain comparable forecasting results and is better when more data is missing. Furthermore, as a non-generative model compared with CSDI, the proposed method consumes a much smaller computation overhead, yielding much faster training speed and fewer model parameters.

Task-Oriented Edge Networks: Decentralized Learning Over Wireless Fronthaul

paper_url: http://arxiv.org/abs/2312.01288
repo_url: None
paper_authors: Hoon Lee, Seung-Wook Kim
for: This paper is written for task-oriented edge networks where multiple edge internet-of-things nodes execute machine learning tasks with the help of powerful deep neural networks (DNNs) at a network cloud.
methods: The paper uses task-oriented encoder DNNs for compressing local observations at individual edge nodes (ENs), and develops a decentralized training algorithm for separate edge-cloud DNNs over downlink wireless fronthaul channels.
results: The paper gets versatile calculations that are independent of the number of ENs, and an efficient cloud inference model that integrates a number of shallow DNNs, inspired by the nomographic function.

Abstract
This paper studies task-oriented edge networks where multiple edge internet-of-things nodes execute machine learning tasks with the help of powerful deep neural networks (DNNs) at a network cloud. Separate edge nodes (ENs) result in a partially observable system where they can only get partitioned features of the global network states. These local observations need to be forwarded to the cloud via resource-constrained wireless fronthual links. Individual ENs compress their local observations into uplink fronthaul messages using task-oriented encoder DNNs. Then, the cloud carries out a remote inference task by leveraging received signals. Such a distributed topology requests a decentralized training and decentralized execution (DTDE) learning framework for designing edge-cloud cooperative inference rules and their decentralized training strategies. First, we develop fronthaul-cooperative DNN architecture along with proper uplink coordination protocols suitable for wireless fronthaul interconnection. Inspired by the nomographic function, an efficient cloud inference model becomes an integration of a number of shallow DNNs. This modulized architecture brings versatile calculations that are independent of the number of ENs. Next, we present a decentralized training algorithm of separate edge-cloud DNNs over downlink wireless fronthaul channels. An appropriate downlink coordination protocol is proposed, which backpropagates gradient vectors wirelessly from the cloud to the ENs.

摘要
To achieve this, the paper develops a fronthaul-cooperative DNN architecture and proper uplink coordination protocols suitable for wireless fronthaul interconnection. The cloud inference model is designed as an integration of multiple shallow DNNs, inspired by the nomographic function, which provides versatile calculations that are independent of the number of ENs.The paper also presents a decentralized training algorithm for separate edge-cloud DNNs over downlink wireless fronthaul channels. An appropriate downlink coordination protocol is proposed, which backpropagates gradient vectors wirelessly from the cloud to the ENs. This allows for efficient decentralized training of the DNNs in the edge-cloud system.

Continuous Convolutional Neural Networks for Disruption Prediction in Nuclear Fusion Plasmas

paper_url: http://arxiv.org/abs/2312.01286
repo_url: None
paper_authors: William F Arnold, Lucas Spangher, Christina Rea
for: grid decarbonization for climate change
methods: Machine Learning approaches, specifically Continuous Convolutional Neural Networks
results: significantly better performance in disruption prediction compared to previous discrete models (Area Under the Receiver Operating Characteristic Curve = 0.974 v.s. 0.799) with fewer parameters.

Abstract
Grid decarbonization for climate change requires dispatchable carbon-free energy like nuclear fusion. The tokamak concept offers a promising path for fusion, but one of the foremost challenges in implementation is the occurrence of energetic plasma disruptions. In this study, we delve into Machine Learning approaches to predict plasma state outcomes. Our contributions are twofold: (1) We present a novel application of Continuous Convolutional Neural Networks for disruption prediction and (2) We examine the advantages and disadvantages of continuous models over discrete models for disruption prediction by comparing our model with the previous, discrete state of the art, and show that continuous models offer significantly better performance (Area Under the Receiver Operating Characteristic Curve = 0.974 v.s. 0.799) with fewer parameters

摘要
格里德减 carbon化为气候变化所需的可控可 carbon-free能源，如核聚变。tokamak概念提供了核聚变可能的一条可行之路，但实施中的一个重要挑战是高能束聚变的发生。在这篇研究中，我们探讨使用机器学习方法预测束聚变结果。我们的贡献有两个方面：1. 我们提出了一种新的连续卷积神经网络 для束聚变预测，并对其与之前的离散模型进行了比较。2. 我们研究了连续模型在束聚变预测中的优劣点，并发现连续模型在预测性能方面表现明显更好（接受收 Operation Characteristic Curve 的面积 = 0.974 v.s. 0.799），但它们具有更少的参数。

Mendata: A Framework to Purify Manipulated Training Data

paper_url: http://arxiv.org/abs/2312.01281
repo_url: None
paper_authors: Zonghao Huang, Neil Gong, Michael K. Reiter
for: 防止训练数据被恶意修改，以防止模型学习出现隐藏的属性。
methods: 提出了一种名为Mendata的数据纯化框架，通过在一个小型参考集中寻找类似于参考数据的输入，并通过水星距离来衡量输入的分布相似性，从而消除恶意修改的影响。
results: 通过应用Mendata，可以击败当前数据毒素和数据追踪技术。

Abstract
Untrusted data used to train a model might have been manipulated to endow the learned model with hidden properties that the data contributor might later exploit. Data purification aims to remove such manipulations prior to training the model. We propose Mendata, a novel framework to purify manipulated training data. Starting from a small reference dataset in which a large majority of the inputs are clean, Mendata perturbs the training inputs so that they retain their utility but are distributed similarly (as measured by Wasserstein distance) to the reference data, thereby eliminating hidden properties from the learned model. A key challenge is how to find such perturbations, which we address by formulating a min-max optimization problem and developing a two-step method to iteratively solve it. We demonstrate the effectiveness of Mendata by applying it to defeat state-of-the-art data poisoning and data tracing techniques.

摘要
非信任数据可能已经被修改，以允许模型学习隐藏的属性，以便后续滥用。数据纯化目标是在训练模型之前移除这些修改。我们提出了一个名为Mendata的新框架，用于纯化训练数据。从一个小的参考集中，我们开始，其中大多数输入都是清晰的。Mendata会将训练输入扰动，以保留其用途性，但是与参考数据相似的分布（按照沃氏距离测量），从而将隐藏属性从学习的模型中除除。我们需要找到这些扰动的方法，我们解决这个挑战通过定义一个最小最大优化问题，并开发了一种两步方法来逐步解决它。我们通过应用Mendata来击败当前的数据毒 poisoning和数据跟踪技术。

A Review of Link Prediction Applications in Network Biology

paper_url: http://arxiv.org/abs/2312.01275
repo_url: None
paper_authors: Ahmad F. Al Musawi, Satyaki Roy, Preetam Ghosh
for: 这个评论概要是关于生物网络中不同类型的生物分子和生物系统之间的交互关系，以及使用链接预测方法来协助预测这些交互关系。
methods: 本文系统地批判了本地、中心性和嵌入式链接预测方法，并应用到生物网络中的静态和动态网络上。
results: 文章进行了现有生物网络数据集的实际应用和评估，并 Compares the similarity of prediction trends among models and the specific network attributes that contribute to effective link prediction。

Abstract
In the domain of network biology, the interactions among heterogeneous genomic and molecular entities are represented through networks. Link prediction (LP) methodologies are instrumental in inferring missing or prospective associations within these biological networks. In this review, we systematically dissect the attributes of local, centrality, and embedding-based LP approaches, applied to static and dynamic biological networks. We undertake an examination of the current applications of LP metrics for predicting links between diseases, genes, proteins, RNA, microbiomes, drugs, and neurons. We carry out comprehensive performance evaluations on established biological network datasets to show the practical applications of standard LP models. Moreover, we compare the similarity in prediction trends among the models and the specific network attributes that contribute to effective link prediction, before underscoring the role of LP in addressing the formidable challenges prevalent in biological systems, ranging from noise, bias, and data sparseness to interpretability. We conclude the review with an exploration of the essential characteristics expected from future LP models, poised to advance our comprehension of the intricate interactions governing biological systems.

摘要
在生物网络领域，不同类型的生物分子和分子实体之间的交互关系通过网络表示。链接预测（LP）方法ologies是推断缺失或可能的生物网络中连接的重要工具。本文系统地分析了本地、中心性和嵌入式LP方法的特点，应用于静止和动态生物网络。我们进行了现有生物网络数据集的实践应用LP指标预测疾病、基因、蛋白质、RNA、微生物、药物和神经元之间的链接。此外，我们对LP模型的预测趋势进行了全面的性能评估，并比较了不同网络特性对LP模型的影响。最后，我们强调了LP在生物系统中解决困难问题的重要性，包括噪音、偏见、数据稀缺和解释性。我们结束这篇文章，探讨未来LP模型的必要特征，以提高我们对生物系统中复杂交互的理解。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Multiscale Topology in Interactomic Network: From Transcriptome to Antiaddiction Drug Repurposing

paper_url: http://arxiv.org/abs/2312.01272
repo_url: None
paper_authors: Hongyan Du, Guo-Wei Wei, Tingjun Hou
for: This paper aims to identify potential drug repurposing candidates for opioid and cocaine addiction treatment by analyzing addiction-related transcriptomic data and protein-protein interaction networks.methods: The authors use differential gene expression analysis, persistent Laplacians, and machine learning models to identify key genes and predict binding affinities of drug compounds to target proteins.results: The authors identify three pivotal molecular targets, mTOR, mGluR5, and NMDAR, for drug repurposing and evaluate their drug-likeness using a machine learning model. They also demonstrate the versatility of their methods for applications across a range of diseases and transcriptomic datasets.

Abstract
The escalating drug addiction crisis in the United States underscores the urgent need for innovative therapeutic strategies. This study embarked on an innovative and rigorous strategy to unearth potential drug repurposing candidates for opioid and cocaine addiction treatment, bridging the gap between transcriptomic data analysis and drug discovery. We initiated our approach by conducting differential gene expression analysis on addiction-related transcriptomic data to identify key genes. We propose a novel topological differentiation to identify key genes from a protein-protein interaction (PPI) network derived from DEGs. This method utilizes persistent Laplacians to accurately single out pivotal nodes within the network, conducting this analysis in a multiscale manner to ensure high reliability. Through rigorous literature validation, pathway analysis, and data-availability scrutiny, we identified three pivotal molecular targets, mTOR, mGluR5, and NMDAR, for drug repurposing from DrugBank. We crafted machine learning models employing two natural language processing (NLP)-based embeddings and a traditional 2D fingerprint, which demonstrated robust predictive ability in gauging binding affinities of DrugBank compounds to selected targets. Furthermore, we elucidated the interactions of promising drugs with the targets and evaluated their drug-likeness. This study delineates a multi-faceted and comprehensive analytical framework, amalgamating bioinformatics, topological data analysis and machine learning, for drug repurposing in addiction treatment, setting the stage for subsequent experimental validation. The versatility of the methods we developed allows for applications across a range of diseases and transcriptomic datasets.

摘要
美国的药物成瘾危机加剧，强调了创新的治疗策略的需求。这项研究采用了一种创新的和严格的方法，从药物成瘾相关的转录组数据分析中挖掘可能的药物复用候选者，从转录组数据分析 bridges 到药物发现。我们开始我们的方法，通过对成瘾相关的转录组数据进行差异表达分析，以确定关键基因。我们提出了一种新的拓扑差异分析方法，使用持续拉普拉斯来准确地从PPI网络中提取关键节点，并在多级别进行分析，以确保高度可靠。通过严格的文献验证、路径分析和数据可用性检查，我们确定了三个关键分子标arget，即mTOR、mGluR5和NMDAR，可以从药物库中复用。我们制作了使用自然语言处理（NLP）基于的两种嵌入和传统的2D指纹，用于测量药物库中的绑定亲和力。此外，我们还描述了这些药物与目标之间的交互，并评估了它们的药理性。本研究描述了一种多方面和全面的分析框架，结合生物信息学、拓扑数据分析和机器学习，用于药物复用的投入研究，为后续实验验证设置了舞台。我们的方法的多样性允许其应用于多种疾病和转录组数据集。

Distributed Reinforcement Learning for Molecular Design: Antioxidant case

paper_url: http://arxiv.org/abs/2312.01267
repo_url: None
paper_authors: Huanyi Qin, Denis Akhiyarov, Sophie Loehle, Kenneth Chiu, Mauricio Araya-Polo
For: 这个研究旨在开发一个分布式深度强化学习算法，以便对抗氧化剂的适用。* Methods: 这个算法利用了分布式深度强化学习和化学物理属性预测器，以提高训练时间和缩小训练数据的影响。* Results: 这个研究发现了一个新的分布式算法，可以对抗氧化剂的适用，并且比前一代算法快得多，可以在 proprietary 和公共数据集上验证。

Abstract
Deep reinforcement learning has successfully been applied for molecular discovery as shown by the Molecule Deep Q-network (MolDQN) algorithm. This algorithm has challenges when applied to optimizing new molecules: training such a model is limited in terms of scalability to larger datasets and the trained model cannot be generalized to different molecules in the same dataset. In this paper, a distributed reinforcement learning algorithm for antioxidants, called DA-MolDQN is proposed to address these problems. State-of-the-art bond dissociation energy (BDE) and ionization potential (IP) predictors are integrated into DA-MolDQN, which are critical chemical properties while optimizing antioxidants. Training time is reduced by algorithmic improvements for molecular modifications. The algorithm is distributed, scalable for up to 512 molecules, and generalizes the model to a diverse set of molecules. The proposed models are trained with a proprietary antioxidant dataset. The results have been reproduced with both proprietary and public datasets. The proposed molecules have been validated with DFT simulations and a subset of them confirmed in public "unseen" datasets. In summary, DA-MolDQN is up to 100x faster than previous algorithms and can discover new optimized molecules from proprietary and public antioxidants.

摘要
深度强化学学成功应用于分子发现，如示例的分子深度Q学习算法（MolDQN）。该算法在优化新分子方面存在挑战，训练这样的模型的可扩展性受到限制，而且训练的模型无法泛化到同一个数据集中的其他分子。在这篇论文中，一种分布式强化学学算法 для抗氧化剂，称为DA-MolDQN，是提出来解决这些问题。该算法 integrate了当前顶峰的键解键能（BDE）和质子化能（IP）预测器，这些属性是优化抗氧化剂中的关键性质。通过算法改进，分子修改的训练时间得到了降低。该算法可扩展，可以处理 hasta 512个分子，并且可以泛化到多种分子。该算法使用了一个专有的抗氧化剂数据集进行训练，并且得到了公共数据集的重现结果。结果表明，DA-MolDQN比前一代算法快速多少，可以从专有和公共抗氧化剂中找到优化的新分子。

Rethinking PGD Attack: Is Sign Function Necessary?

paper_url: http://arxiv.org/abs/2312.01260
repo_url: https://github.com/junjieyang97/rgd
paper_authors: Junjie Yang, Tianlong Chen, Xuxi Chen, Zhangyang Wang, Yingbin Liang
for: This paper focuses on improving the performance of adversarial attacks on neural networks by proposing a new algorithm called Raw Gradient Descent (RGD).
methods: The RGD algorithm eliminates the use of sign functions in the update process, instead using a new hidden variable of non-clipped perturbation to move beyond the constraint.
results: The proposed RGD algorithm outperforms existing algorithms such as Projected Gradient Descent (PGD) and other competitors in various settings, without incurring any additional computational overhead.Here’s the Chinese version:
for: 本文提出了一种新的敏感攻击算法 Raw Gradient Descent (RGD)，以提高神经网络上的敏感攻击性能。
methods: RGD算法减少了使用 sign 函数的更新过程，而是使用一个新的隐藏变量 non-clipped 偏移来突破约束。
results: RGD算法在各种场景中比 PGD 和其他竞争者更高效，而且不增加计算开销。

Abstract
Neural networks have demonstrated success in various domains, yet their performance can be significantly degraded by even a small input perturbation. Consequently, the construction of such perturbations, known as adversarial attacks, has gained significant attention, many of which fall within "white-box" scenarios where we have full access to the neural network. Existing attack algorithms, such as the projected gradient descent (PGD), commonly take the sign function on the raw gradient before updating adversarial inputs, thereby neglecting gradient magnitude information. In this paper, we present a theoretical analysis of how such sign-based update algorithm influences step-wise attack performance, as well as its caveat. We also interpret why previous attempts of directly using raw gradients failed. Based on that, we further propose a new raw gradient descent (RGD) algorithm that eliminates the use of sign. Specifically, we convert the constrained optimization problem into an unconstrained one, by introducing a new hidden variable of non-clipped perturbation that can move beyond the constraint. The effectiveness of the proposed RGD algorithm has been demonstrated extensively in experiments, outperforming PGD and other competitors in various settings, without incurring any additional computational overhead. The codes is available in https://github.com/JunjieYang97/RGD.

摘要

2023-12-03

eess.SP

eess.SP - 2023-12-03

Towards Decentralized Task Offloading and Resource Allocation in User-Centric Mobile Edge Computing

paper_url: http://arxiv.org/abs/2312.01499
repo_url: https://github.com/qlt315/ucmec-mmwave-fronthaul
paper_authors: Langtian Qin, Hancheng Lu, Yuang Chen, Baolin Chong, Feng Wu
for: 提高Edge computing的吞吐量和可靠性，以便更好地进行任务卸载。
methods: 提议了一种新的Mobile Edge computing架构，即用户中心的Mobile Edge computing（UCMEC），它通过用户中心的传输来确保高速和可靠的通信。同时，提出了一种分布式优化问题，用于优化任务卸载、电源控制和计算资源分配，以实现最佳性能。
results: 通过采用多代理深度学习和凸优化算法，实现了在UCMEC中的任务卸载、电源控制和计算资源分配的分布式优化。实验结果表明，相比传统的Cellular-based MEC，UCMEC可以提高上升传输率最多343.56%，并将长期平均总延迟降低最多45.57%。

Abstract
In the traditional cellular-based mobile edge computing (MEC), users at the edge of the cell are prone to suffer severe inter-cell interference and signal attenuation, leading to low throughput even transmission interruptions. Such edge effect severely obstructs offloading of tasks to MEC servers. To address this issue, we propose user-centric mobile edge computing (UCMEC), a novel MEC architecture integrating user-centric transmission, which can ensure high throughput and reliable communication for task offloading. Then, we formulate an optimization problem with joint consideration of task offloading, power control, and computing resource allocation in UCMEC, aiming at obtaining the optimal performance in terms of long-term average total delay. To solve the intractable problem, we propose two decentralized joint optimization schemes based on multi-agent deep reinforcement learning (MADRL) and convex optimization, which consider both cooperation and non-cooperation among network nodes. Simulation results demonstrate that the proposed schemes in UCMEC can significantly improve the uplink transmission rate by at most 343.56% and reduce the long-term average total delay by at most 45.57% compared to traditional cellular-based MEC.

摘要
在传统的mobile edge computing（MEC）中，用户位于Edge的位置会受到严重的同 cell干扰和信号强度减弱，导致吞吐量低下，甚至是传输中断。这种边缘效应妨碍了任务卸载到MEC服务器。为解决这个问题，我们提出了用户中心的mobile edge computing（UCMEC），一种新的MEC架构，其中包含用户中心的传输。UCMEC可以确保高吞吐量和可靠的通信，以便实现任务卸载。然后，我们形ulated了一个包括任务卸载、功率控制和计算资源分配的优化问题，目的是在long-term average total delay的基础上获得最佳性能。为解决不可解决的问题，我们提出了两种分布式联合优化方案，一种基于多智能深度学习（MADRL），另一种基于几何优化。这两种方案都考虑了网络节点之间的合作和不合作。实验结果表明，UCMEC中的提议方案可以在上行传输率方面提高最多343.56%，并在long-term average total delay方面降低最多45.57%，相比传统的cellular-based MEC。

Novel KLD-based Resource Allocation for Integrated Sensing and Communication

paper_url: http://arxiv.org/abs/2312.01427
repo_url: None
paper_authors: Yousef Kloob, Mohammad Al-Jarrah, Emad Alsusa, Christos Masouros
for: 本文提出了一种基站资源分配方法，用于优化整合感知通信（ISAC）网络，以提高网络性能。
methods: 本文使用了卡尔巴克-莱布尔 divergence（KLD）度量来优化基站资源，并分析了两种antenna部署方式（分开部署和共享部署）的KLD。然后，通过将KLD作为目标函数，使用杜邦搜索和内点法（RIPM）来优化基站资源，以满足每个用户设备（UE）和目标的最小KLD要求。
results: 结果表明，KLD度量可以有效地优化ISAC网络，并且提出了两种优化方法，其中一种使用杜邦搜索，另一种使用RIPM。两种方法均可以提供比uniform power和天线分配更高的性能。

Abstract
In this paper, we introduce a novel resource allocation approach for integrated sensing-communication (ISAC) using the Kullback-Leibler divergence (KLD) metric. Specifically, we consider a base-station with limited power and antenna resources serving a number of communication users and detecting multiple targets simultaneously. First, we analyze the KLD for two possible antenna deployments, which are the separated and shared deployments, then use the results to optimize the resources of the base-station through minimising the average KLD for the network while satisfying a minimum predefined KLD requirement for each user equipment (UE) and target. To this end, the optimisation is formulated and presented as a mixed integer nonlinear programming (MINLP) problem and then solved using two approaches. In the first approach, we employ a genetic algorithm, which offers remarkable performance but demands substantial computational resources; and in the second approach, we propose a rounding-based interior-point method (RIPM) that provides a more computationally-efficient alternative solution at a negligible performance loss. The results demonstrate that the KLD metric can be an effective means for optimising ISAC networks, and that both optimisation solutions presented offer superior performance compared to uniform power and antenna allocation.

摘要
在这篇论文中，我们提出了一种新的资源分配方法 для整合感知通信（ISAC），使用废弃-莱布尔差分（KLD）度量。具体来说，我们考虑了基站有限的功率和天线资源，服务多个通信用户并同时探测多个目标。我们首先分析了KLD两种可能的天线布置方式，分别是分开布置和共享布置，然后使用结果来优化基站的资源，以最小化网络的平均KLD值，同时满足每个用户设备（UE）和目标的最低预先定义KLD要求。为此，我们将优化问题转化为混合整数非线性 программирова（MINLP）问题，并使用两种方法解决。在第一种方法中，我们使用遗传算法，它在性能方面表现出色，但需要大量计算资源；在第二种方法中，我们提议一种归一化点方法（RIPM），它提供了更加计算效率的代替方案，但是性能损失相对较小。结果表明，KLD度量可以是ISAC网络优化的有效手段，而我们两种优化方案都提供了对均匀分配的超越性能。

Self-Critical Alternate Learning based Semantic Broadcast Communication

paper_url: http://arxiv.org/abs/2312.01423
repo_url: None
paper_authors: Zhilin Lu, Rongpeng Li, Ming Lei, Chan Wang, Zhifeng Zhao, Honggang Zhang
for: 这篇论文的目的是为了创建一个可靠且高效的多用户广播通信系统，以便在多个接收器（RX）上生成句子任务中实现semantic communication（SemCom）。
methods: 本文使用了universal transformer（UT）和强化学习（RL）技术，将句子生成任务转化为一个强化学习问题，以便透过自我批评式 alternate learning（SCAL）算法来实现稳定且高效的学习。
results: simulations results 表明，SemanticBC-SCAL 可以在低 SNR 下实现更好的性能，并且可以适应不同的BC通道，同时可以跨越传统的JSCC 框架，提供一个可靠且高效的多用户广播通信系统。

Abstract
Semantic communication (SemCom) has been deemed as a promising communication paradigm to break through the bottleneck of traditional communications. Nonetheless, most of the existing works focus more on point-to-point communication scenarios and its extension to multi-user scenarios is not that straightforward due to its cost-inefficiencies to directly scale the JSCC framework to the multi-user communication system. Meanwhile, previous methods optimize the system by differentiable bit-level supervision, easily leading to a "semantic gap". Therefore, we delve into multi-user broadcast communication (BC) based on the universal transformer (UT) and propose a reinforcement learning (RL) based self-critical alternate learning (SCAL) algorithm, named SemanticBC-SCAL, to capably adapt to the different BC channels from one transmitter (TX) to multiple receivers (RXs) for sentence generation task. In particular, to enable stable optimization via a nondifferentiable semantic metric, we regard sentence similarity as a reward and formulate this learning process as an RL problem. Considering the huge decision space, we adopt a lightweight but efficient self-critical supervision to guide the learning process. Meanwhile, an alternate learning mechanism is developed to provide cost-effective learning, in which the encoder and decoders are updated asynchronously with different iterations. Notably, the incorporation of RL makes SemanticBC-SCAL compliant with any user-defined semantic similarity metric and simultaneously addresses the channel non-differentiability issue by alternate learning. Besides, the convergence of SemanticBC-SCAL is also theoretically established. Extensive simulation results have been conducted to verify the effectiveness and superiorness of our approach, especially in low SNRs.

摘要
对传统通信方式的瓶颈破裂， semantic communication（SemCom）被认为是一种有前途的通信方式。然而，现有的大多数研究都侧重于点对点通信场景，对多用户场景的扩展并不是那么直观。主要是因为直接将JSCC框架扩展到多用户通信系统会带来成本不fficiency的问题。此外，先前的方法通过可微分位元监控来优化系统，容易导致“semantic gap”。为了解决这个问题，我们将注意力集中在多用户广播通信（BC）上，使用通用变换（UT），并提出一种基于自适应学习（RL）的自我批评 alternate learning（SCAL）算法，名为SemanticBC-SCAL。这个算法可以对多个接收器（RX）的不同BC通道进行一个TX的文本生成任务。具体来说，我们视文本相似性为奖励，将学习过程定义为RL问题。由于巨大的决策空间，我们采用轻量级但高效的自我批评监控来导引学习过程。此外，我们还开发了一种代理学习机制，以便在不同迭代中各自更新编码器和解码器。 SemanticBC-SCAL的特点包括：1. 通过RL来可靠地优化，不需要微分 semantic metric。2. 可以适应任何用户定义的semantic similarity metric。3. 具有可靠的迭代学习机制，可以在低SNR下进行稳定的优化。4. 通过 alternate learning 机制，可以实现成本效益的学习。我们通过了广泛的 simulations 来证明 SemanticBC-SCAL 的效果和超越性。特别是在低SNR下，它能够具有更高的性能。

Integrating Communication, Sensing and Computing in Satellite Internet of Things: Challenges and Opportunities

paper_url: http://arxiv.org/abs/2312.01336
repo_url: None
paper_authors: Yong Zuo, Mingyang Yue, Huiyuan Yang, Liantao Wu, Xiaojun Yuan
for: 这篇论文旨在探讨卫星互联网OF Things（IoT）系统的全球覆盖和未来应用，包括通信、感知和计算等方面。
methods: 该论文描述了在卫星环境下，通信、感知和计算函数的共同设计，以提高卫星IoT系统的性能。
results: 文章总结了当前领域的解决方案，并讨论了在卫星IoT系统中集成通信、感知和计算 функ数的主要挑战，需要进一步研究。

Abstract
Satellite Internet of Things (IoT) is to use satellites as the access points for IoT devices to achieve the global coverage of future IoT systems, and is expected to support burgeoning IoT applications, including communication, sensing, and computing. However, the complex and dynamic satellite environments and limited network resources raise new challenges in the design of satellite IoT systems. In this article, we focus on the joint design of communication, sensing, and computing to improve the performance of satellite IoT, which is quite different from the case of terrestrial IoT systems. We describe how the integration of the three functions can enhance system capabilities, and summarize the state-of-the-art solutions. Furthermore, we discuss the main challenges of integrating communication, sensing, and computing in satellite IoT to be solved with pressing interest.

摘要
卫星互联网OF Things（IoT）是通过卫星作为IoT设备访问点来实现全球IoT系统的覆盖，并预计支持崛起的IoT应用程序，包括通信、感知和计算。然而，卫星环境复杂和动态，网络资源有限，这些问题对卫星IoT系统的设计带来了新的挑战。本文将关注将通信、感知和计算三个功能集成起来以提高卫星IoT性能，与地面IoT系统不同。我们将描述如何集成这三个功能可以增强系统能力，并summarize当前的解决方案。此外，我们还会讨论将通信、感知和计算集成到卫星IoT中的主要挑战，需要受到急需解决。

Joint Beam Scheduling and Power Optimization for Beam Hopping LEO Satellite Systems

paper_url: http://arxiv.org/abs/2312.01292
repo_url: None
paper_authors: Shuang Zheng, Xing Zhang, Peng Wang, Wenbo Wang
for: 提高低地球轨道卫星通信系统的资源利用率和系统性能。
methods: 提出了一种结合照准扫描和能量优化的干扰扫描协调算法，并使用了潜在游戏模型和内点法来解决干扰扫描和能量分配问题。
results: 对比其他方法，提出的算法具有较低的时间复杂度和快速的收敛率，同时在吞吐量和公平性方面都有较好的表现。

Abstract
Low earth orbit (LEO) satellite communications can provide ubiquitous and reliable services, making it an essential part of the Internet of Everything network. Beam hopping (BH) is an emerging technology for effectively addressing the issue of low resource utilization caused by the non-uniform spatio-temporal distribution of traffic demands. However, how to allocate multi-dimensional resources in a timely and efficient way for the highly dynamic LEO satellite systems remains a challenge. This paper proposes a joint beam scheduling and power optimization beam hopping (JBSPO-BH) algorithm considering the differences in the geographic distribution of sink nodes. The JBSPO-BH algorithm decouples the original problem into two sub-problems. The beam scheduling problem is modelled as a potential game, and the Nash equilibrium (NE) point is obtained as the beam scheduling strategy. Moreover, the penalty function interior point method is applied to optimize the power allocation. Simulation results show that the JBSPO-BH algorithm has low time complexity and fast convergence and achieves better performance both in throughput and fairness. Compared with greedy-based BH, greedy-based BH with the power optimization, round-robin BH, Max-SINR BH and satellite resource allocation algorithm, the throughput of the proposed algorithm is improved by 44.99%, 20.79%, 156.06%, 15.39% and 8.17%, respectively.

摘要
低地球轨道（LEO）卫星通信可以提供 ubique 和可靠的服务，使其成为互联网Everything 网络的重要组成部分。扫描跳跃（BH）是一种 emerging 技术，用于有效地解决由非均匀空间时间分布的流量需求而导致的资源利用率低的问题。然而，如何在高度动态的 LEO 卫星系统中有效地分配多维度资源，仍然是一个挑战。这篇论文提出了一种结合扫描跳跃和能量优化的 JOINT 扫描跳跃和能量优化算法（JBSPO-BH），考虑到卫星端口的地理分布差异。JBSPO-BH 算法将原始问题分解为两个子问题。扫描跳跃问题被模型为潜在游戏，并计算出扫描跳跃策略的奈氏平衡（NE）点。此外，使用 penalty function 内点法优化能量分配。实验结果表明，JBSPO-BH 算法具有低时复杂度和快速响应，并在通过put和公平性方面表现更好。与扫描跳跃、扫描跳跃加能量优化、循环扫描跳跃、最大SINR扫描跳跃和卫星资源分配算法相比，JBSPO-BH 算法的通过put提高了44.99%、20.79%、156.06%、15.39%和8.17%。

Stochastic Resource Allocation via Dual Tail Waterfilling

paper_url: http://arxiv.org/abs/2312.01251
repo_url: None
paper_authors: Gokberk Yaylali, Dionysios S. Kalogerias
for: This paper aims to optimize the resource allocation in wireless systems by addressing the challenges of channel fading.
methods: The paper uses a risk-aware formulation of the classical stochastic resource allocation problem, leveraging the Conditional Value-at-Risk (CV@R) as a measure of risk. The optimal risk-aware resource allocation policy and the corresponding user rate functions are derived using closed-form expressions.
results: The proposed risk-aware resource allocation policy achieves more rapid and assured convergence of dual variables compared to the primal-dual tail waterfilling algorithm. The effectiveness of the proposed scheme is confirmed through detailed numerical simulations.

Abstract
Optimal resource allocation in wireless systems still stands as a rather challenging task due to the inherent statistical characteristics of channel fading. On the one hand, minimax/outage-optimal policies are often overconservative and analytically intractable, despite advertising maximally reliable system performance. On the other hand, ergodic-optimal resource allocation policies are often susceptible to the statistical dispersion of heavy-tailed fading channels, leading to relatively frequent drastic performance drops. We investigate a new risk-aware formulation of the classical stochastic resource allocation problem for point-to-point power-constrained communication networks over fading channels with no cross-interference, by leveraging the Conditional Value-at-Risk (CV@R) as a coherent measure of risk. We rigorously derive closed-form expressions for the CV@R-optimal risk-aware resource allocation policy, as well as the optimal associated quantiles of the corresponding user rate functions by capitalizing on the underlying fading distribution, parameterized by dual variables. We then develop a purely dual tail waterfilling scheme, achieving significantly more rapid and assured convergence of dual variables, as compared with the primal-dual tail waterfilling algorithm, recently proposed in the literature. The effectiveness of the proposed scheme is also readily confirmed via detailed numerical simulations.

摘要
无线系统中的资源分配仍然是一个很困难的任务，这是因为通道折叠的统计特性。一方面，最小最大/失业优化策略经常是过保守的，即使宣传最大可靠系统性能。另一方面，erdogic优化资源分配策略经常受到随机分布的巨尾折叠通道的影响，导致性能 Drop 相对较多。我们调查了一种新的风险意识形式的古典概率资源分配问题，通过利用Conditional Value-at-Risk（CV@R）作为准确的风险度量来解决。我们严格地 deriv 了 CV@R-优化的风险意识资源分配策略的关闭式表达，以及相应的用户率函数的优化关闭式表达，通过利用下降分布，参数化 dual 变量。然后，我们开发了一种纯度 dual 满水域算法，实现了较快和充分的 dual 变量的征服，相比之下，在文献中最近提出的 primal-dual 满水域算法的性能较差。我们也通过详细的数值仿真来证明提案的有效性。

2023-12-02

cs.CV

cs.CV - 2023-12-02

Disentangling the Effects of Data Augmentation and Format Transform in Self-Supervised Learning of Image Representations

paper_url: http://arxiv.org/abs/2312.02205
repo_url: None
paper_authors: Neha Kalibhat, Warren Morningstar, Alex Bijamov, Luyang Liu, Karan Singhal, Philip Mansfield
for: 本文研究了自动学习（SSL）在视觉领域中的应用，尤其是对数据增强和格式变换的影响。
methods: 本文使用了数据增强和格式变换两种方法，分别是频域增强（FDA）和图像增强。
results: 研究发现，将频域增强与图像增强结合使用可以提高下游分类精度，最高提升1.3%。此外，研究还发现格式变换可以在没有增强情况下提高学习的表示质量。

Abstract
Self-Supervised Learning (SSL) enables training performant models using limited labeled data. One of the pillars underlying vision SSL is the use of data augmentations/perturbations of the input which do not significantly alter its semantic content. For audio and other temporal signals, augmentations are commonly used alongside format transforms such as Fourier transforms or wavelet transforms. Unlike augmentations, format transforms do not change the information contained in the data; rather, they express the same information in different coordinates. In this paper, we study the effects of format transforms and augmentations both separately and together on vision SSL. We define augmentations in frequency space called Fourier Domain Augmentations (FDA) and show that training SSL models on a combination of these and image augmentations can improve the downstream classification accuracy by up to 1.3% on ImageNet-1K. We also show improvements against SSL baselines in few-shot and transfer learning setups using FDA. Surprisingly, we also observe that format transforms can improve the quality of learned representations even without augmentations; however, the combination of the two techniques yields better quality.

摘要

Motion Informed Needle Segmentation in Ultrasound Images

paper_url: http://arxiv.org/abs/2312.01239
repo_url: None
paper_authors: Raghavv Goel, Cecilia Morales, Manpreet Singh, Artur Dubrawski, John Galeotti, Howie Choset
for: 本研究旨在提高ultrasound影像中针的分割精度，解决针的运动 artifacts、噪声和隐藏现象等问题。
methods: 本研究提出了一种新的方法，结合了传统的Kalman滤波技术和数据驱动学习，同时考虑了针的运动特征和数据有限性。
results: 本研究实现了一种新的Convolutional Neural Networks (CNN) 基于 Kalman Filter (KF) 块，与现有的状态艺术模型相比，提供了15%的像素精度提高和8%的长度错误减少。此外，本研究还是首次在需le segmentation中实现了学习滤波，以提高针的分割精度。

Abstract
Segmenting a moving needle in ultrasound images is challenging due to the presence of artifacts, noise, and needle occlusion. This task becomes even more demanding in scenarios where data availability is limited. Convolutional Neural Networks (CNNs) have been successful in many computer vision applications, but struggle to accurately segment needles without considering their motion. In this paper, we present a novel approach for needle segmentation that combines classical Kalman Filter (KF) techniques with data-driven learning, incorporating both needle features and needle motion. Our method offers two key contributions. First, we propose a compatible framework that seamlessly integrates into commonly used encoder-decoder style architectures. Second, we demonstrate superior performance compared to recent state-of-the-art needle segmentation models using our novel convolutional neural network (CNN) based KF-inspired block, achieving a 15\% reduction in pixel-wise needle tip error and an 8\% reduction in length error. Third, to our knowledge we are the first to implement a learnable filter to incorporate non-linear needle motion for improving needle segmentation.

摘要
segmenting a moving needle in ultrasound images is challenging due to the presence of artifacts, noise, and needle occlusion. this task becomes even more demanding in scenarios where data availability is limited. convolutional neural networks (cnn) have been successful in many computer vision applications, but struggle to accurately segment needles without considering their motion. in this paper, we present a novel approach for needle segmentation that combines classical kalman filter (kf) techniques with data-driven learning, incorporating both needle features and needle motion. our method offers two key contributions. first, we propose a compatible framework that seamlessly integrates into commonly used encoder-decoder style architectures. second, we demonstrate superior performance compared to recent state-of-the-art needle segmentation models using our novel convolutional neural network (cnn) based kf-inspired block, achieving a 15% reduction in pixel-wise needle tip error and an 8% reduction in length error. third, to our knowledge we are the first to implement a learnable filter to incorporate non-linear needle motion for improving needle segmentation.

Volumetric Rendering with Baked Quadrature Fields

paper_url: http://arxiv.org/abs/2312.02202
repo_url: None
paper_authors: Gopal Sharma, Daniel Rebain, Kwang Moo Yi, Andrea Tagliasacchi
for: 这个论文旨在解决非透明场景中Neural Radiance Field（NeRF）的快速推理问题，使得NeRF可以更好地利用现代图形硬件。
methods: 该论文提出了一种新的NeRF表示方法，利用文本化多边形来实现快速推理。该方法包括将场景模型为多边形，然后通过 marching cubes算法获取多边形的零十字点，并使用这些点来模型volume rendering效果。最后，通过纹理fragment shader来获取最终的颜色图像。
results: 该论文的实验结果表明，该方法可以快速地生成高质量的新视图图像，并且可以轻松地与现有的图形框架集成。此外，该方法还可以利用现代图形硬件来提高渲染速度。

Abstract
We propose a novel Neural Radiance Field (NeRF) representation for non-opaque scenes that allows fast inference by utilizing textured polygons. Despite the high-quality novel view rendering that NeRF provides, a critical limitation is that it relies on volume rendering that can be computationally expensive and does not utilize the advancements in modern graphics hardware. Existing methods for this problem fall short when it comes to modelling volumetric effects as they rely purely on surface rendering. We thus propose to model the scene with polygons, which can then be used to obtain the quadrature points required to model volumetric effects, and also their opacity and colour from the texture. To obtain such polygonal mesh, we train a specialized field whose zero-crossings would correspond to the quadrature points when volume rendering, and perform marching cubes on this field. We then rasterize the polygons and utilize the fragment shaders to obtain the final colour image. Our method allows rendering on various devices and easy integration with existing graphics frameworks while keeping the benefits of volume rendering alive.

摘要
我们提出了一种新的神经频谱场（NeRF）表示方法，用于非透明场景，具有快速推理功能，通过使用文本化多边形。虽然NeRF提供了高质量的新视角渲染，但存在一个重要的限制，即它依赖于体积渲染，可能会占用大量计算资源，而不利用现代图形硬件的进步。现有的方法对这个问题不够，因为它们仅仅基于表面渲染，无法模拟体积效果。我们因此提议使用多边形来模拟场景，然后使用这些多边形来获取需要模拟体积效果的 quadrature 点，以及它们的透射度和颜色从文本中获取。我们然后在这个场景中训练特殊的场，其零 crossing 点对应于体积渲染中的 quadrature 点，并使用 Marching Cubes 算法来生成多边形。然后，我们折射多边形，并使用 Fragment Shader 来获取最终的颜色图像。我们的方法允许在不同的设备上进行渲染，并与现有的图形框架集成，保留体积渲染的优点。

ImageDream: Image-Prompt Multi-view Diffusion for 3D Generation

paper_url: http://arxiv.org/abs/2312.02201
repo_url: None
paper_authors: Peng Wang, Yichun Shi
for: 这个论文是为了开发一种基于图像激活的多视角扩散模型，用于三维对象生成。
methods: 该方法使用一种幂等相机坐标系，以提高图像视觉几何精度。它还使用多级控制，以在输入图像基础上Shape the overall object layout和 fine-tune image details。
results: 该方法可以生成比现有状态艺术图像 conditioned 方法更高质量的三维模型。经过广泛的评估，ImageDream 表现出色，并且可以在不同的推荐列表上进行更多的应用。

Abstract
We introduce "ImageDream," an innovative image-prompt, multi-view diffusion model for 3D object generation. ImageDream stands out for its ability to produce 3D models of higher quality compared to existing state-of-the-art, image-conditioned methods. Our approach utilizes a canonical camera coordination for the objects in images, improving visual geometry accuracy. The model is designed with various levels of control at each block inside the diffusion model based on the input image, where global control shapes the overall object layout and local control fine-tunes the image details. The effectiveness of ImageDream is demonstrated through extensive evaluations using a standard prompt list. For more information, visit our project page at https://Image-Dream.github.io.

摘要
我们介绍“图像之 dream”，一种创新的图像提示、多视图扩散模型，用于3D объек的生成。“图像之 dream”在现有状态的图像条件方法中比较出色，能够生成高质量的3D模型。我们的方法利用图像中对象的 canonical 摄像机坐标，改善视觉几何精度。模型内部各块的控制水平根据输入图像进行设定，全局控制定义整体对象布局，本地控制细化图像细节。我们通过了广泛的评估，包括标准提示列表，证明了图像之 dream 的有效性。更多信息，请访问我们项目页面：https://Image-Dream.github.io。

Boosting Object Detection with Zero-Shot Day-Night Domain Adaptation

paper_url: http://arxiv.org/abs/2312.01220
repo_url: None
paper_authors: Zhipeng Du, Miaojing Shi, Jiankang Deng
for: 提高适应低光照场景的物体检测精度
methods: 采用零批量日夜频谱适应，学习Retinex频谱免疫，并提高Retinex图像分解过程
results: 在ExDark、DARK FACE和CODaN等 datasets上实现了强大的低光照场景物体检测精度

Abstract
Detecting objects in low-light scenarios presents a persistent challenge, as detectors trained on well-lit data exhibit significant performance degradation on low-light data due to the low visibility. Previous methods mitigate this issue by investigating image enhancement or object detection techniques using low-light image datasets. However, the progress is impeded by the inherent difficulties associated with collecting and annotating low-light images. To address this challenge, we propose to boost low-light object detection with zero-shot day-night domain adaptation, which aims to generalize a detector from well-lit scenarios to low-light ones without requiring real low-light data. We first design a reflectance representation learning module to learn Retinex-based illumination invariance in images with a carefully designed illumination invariance reinforcement strategy. Next, an interchange-redecomposition-coherence procedure is introduced to improve over the vanilla Retinex image decomposition process by performing two sequential image decompositions and introducing a redecomposition cohering loss. Extensive experiments on ExDark, DARK FACE and CODaN datasets show strong low-light generalizability of our method.

摘要
检测low-light场景中的对象存在一个持续的挑战，因为由于低可见度而训练的检测器在low-light数据上会表现出显著的性能下降。先前的方法通过调查图像增强或对象检测技术使用low-light图像集来 Mitigate this issue.然而，这种进步受到了收集和标注low-light图像的自然Difficulties。为了解决这个挑战，我们提议使用零截日夜频谱适应，将检测器从Well-lit场景中generalize到low-light场景，无需实际low-light数据。我们首先设计了反射表示学习模块，通过一种特殊的照明不变性强制策略来学习Retinex基于图像中的照明不变性。然后，我们引入了一种交换重构相互关联过程，以改进vanilla Retinex图像分解过程，并在图像分解和重建过程中添加一个重建相互关联损失。我们在ExDark、DARK FACE和CODaN数据集上进行了广泛的实验，并显示了我们的方法在low-light场景中具有强大的普适性。

RNb-NeuS: Reflectance and Normal-based Multi-View 3D Reconstruction

paper_url: http://arxiv.org/abs/2312.01215
repo_url: None
paper_authors: Baptiste Brument, Robin Bruneau, Yvain Quéau, Jean Mélou, François Bernard Lauze, Jean-Denis, Jean-Denis Durou, Lilian Calvet
for: 本研究旨在整合多视图反射和法向图，通过光度射频来获取。
methods: 我们的方法是对每个像素进行 JOINT 重parameterization，将反射和法向图视为光度渲染下的向量。
results: 我们的方法在多视图光度射频（MVPS）测试中，与state-of-the-art方法相比，在 F-score、Chamfer 距离和平均角误度指标中表现出色，特别是在高曲率或低可见性的区域中进行详细3D重建。

Abstract
This paper introduces a versatile paradigm for integrating multi-view reflectance and normal maps acquired through photometric stereo. Our approach employs a pixel-wise joint re-parameterization of reflectance and normal, considering them as a vector of radiances rendered under simulated, varying illumination. This re-parameterization enables the seamless integration of reflectance and normal maps as input data in neural volume rendering-based 3D reconstruction while preserving a single optimization objective. In contrast, recent multi-view photometric stereo (MVPS) methods depend on multiple, potentially conflicting objectives. Despite its apparent simplicity, our proposed approach outperforms state-of-the-art approaches in MVPS benchmarks across F-score, Chamfer distance, and mean angular error metrics. Notably, it significantly improves the detailed 3D reconstruction of areas with high curvature or low visibility.

摘要

A Comparative Analysis Towards Melanoma Classification Using Transfer Learning by Analyzing Dermoscopic Images

paper_url: http://arxiv.org/abs/2312.01212
repo_url: None
paper_authors: Md. Fahim Uddin, Nafisa Tafshir, Mohammad Monirujjaman Khan
for: 这个论文的目的是提出一种基于深度学习技术和已有转移学习方法的皮肤癌诊断系统，以便从输入的德朗托图像中准确地预测皮肤癌。
methods: 这个论文使用了 convolutional neural networks (CNNs) 技术来分类皮肤癌图像为良性和肉瘤图像。研究人员使用了 ‘深度学习’ 技术来训练一个庞大的参数量的深度神经网络，以达到最佳的预测结果。
results: 这个论文的结果表明，使用 transfer learning 技术和 CNNs 技术可以建立一个高度准确的皮肤癌诊断系统。在多个预测模型的比较分析中， denseNet 模型表现最佳，其验证精度为 96.64%，验证损失为 9.43%，并且在测试集上达到了 99.63% 的准确率。

Abstract
Melanoma is a sort of skin cancer that starts in the cells known as melanocytes. It is more dangerous than other types of skin cancer because it can spread to other organs. Melanoma can be fatal if it spreads to other parts of the body. Early detection is the key to cure, but it requires the skills of skilled doctors to diagnose it. This paper presents a system that combines deep learning techniques with established transfer learning methods to enable skin lesions classification and diagnosis of melanoma skin lesions. Using Convolutional Neural Networks, it presents a method for categorizing melanoma images into benign and malignant images in this research (CNNs). Researchers used 'Deep Learning' techniques to train an expansive number of photos & essentially to get the expected result deep neural networks to need to be trained with a huge number of parameters as dermoscopic images are sensitive & very hard to classify. This paper, has been emphasized building models with less complexity and comparatively better accuracy with limited datasets & partially fewer deep networks so that the system can predict Melanoma at ease from input dermoscopic images as correctly as possible within devices with less computational power. The dataset has been obtained from ISIC Archive. Multiple pre-trained models ResNet101, DenseNet, EfficientNet, InceptionV3 have been implemented using transfer learning techniques to complete the comparative analysis & every model achieved good accuracy. Before training the models, the data has been augmented by multiple parameters to improve the accuracy. Moreover, the results are better than the previous state-of-the-art approaches & adequate to predict melanoma. Among these architectures, DenseNet performed better than the others which gives a validation accuracy of 96.64%, validation loss of 9.43% & test set accuracy of 99.63%.

摘要
melanoma 是一种皮肤癌，起源于皮肤细胞 called melanocytes。它比其他类型的皮肤癌更加危险，因为它可以扩散到其他器官。如果 melanoma 扩散到其他部分的body，可能会导致死亡。早期发现是治疗的关键，但需要经验丰富的医生来诊断。这篇论文提出了一种结合深度学习技术与成熔传输学习方法的系统，用于皮肤损伤分类和诊断 melanoma 皮肤损伤。使用卷积神经网络（CNNs），这篇论文提出了一种方法，将 melanoma 图像分类为无害和有害图像。研究人员使用深度学习技术来训练大量照片，并实际地达到了预期的结果。因为 dermoscopic 图像敏感度高，需要训练大量深度神经网络，以获得预期的结果。这篇论文强调建立模型，具有较低的复杂性和较高的准确率，使系统可以从输入 dermoscopic 图像中准确预测 melanoma。数据集来自 ISIC Archive。多个预测模型，包括 ResNet101、DenseNet、EfficientNet 和 InceptionV3，通过传输学习技术来实现比较分析。每个模型都达到了良好的准确率。在训练模型之前，数据已经被多个参数进行增强，以提高准确率。此外，结果较前一个状态的方法更好，并且充分以预测 melanoma。其中，DenseNet 表现最好，得到了验证准确率 96.64%、验证损失 9.43% 和测试集准确率 99.63%。

Neural Parametric Gaussians for Monocular Non-Rigid Object Reconstruction

paper_url: http://arxiv.org/abs/2312.01196
repo_url: None
paper_authors: Devikalyan Das, Christopher Wewer, Raza Yunus, Eddy Ilg, Jan Eric Lenssen
for: 解决非常受约束的动态物体重建问题，提供高质量的新视图。
methods: 使用神经网络准 parametric Gaussian（NPG），采用两个阶段approach：首先，使用低级神经变形模型进行准确的匹配，然后使用非RIGID reconstruction。
results: 实现高质量的重建，在不同视角下维持3D一致性，特别在具有少量多视角cue的场景下表现优于之前的works。

Abstract
Reconstructing dynamic objects from monocular videos is a severely underconstrained and challenging problem, and recent work has approached it in various directions. However, owing to the ill-posed nature of this problem, there has been no solution that can provide consistent, high-quality novel views from camera positions that are significantly different from the training views. In this work, we introduce Neural Parametric Gaussians (NPGs) to take on this challenge by imposing a two-stage approach: first, we fit a low-rank neural deformation model, which then is used as regularization for non-rigid reconstruction in the second stage. The first stage learns the object's deformations such that it preserves consistency in novel views. The second stage obtains high reconstruction quality by optimizing 3D Gaussians that are driven by the coarse model. To this end, we introduce a local 3D Gaussian representation, where temporally shared Gaussians are anchored in and deformed by local oriented volumes. The resulting combined model can be rendered as radiance fields, resulting in high-quality photo-realistic reconstructions of the non-rigidly deforming objects, maintaining 3D consistency across novel views. We demonstrate that NPGs achieve superior results compared to previous works, especially in challenging scenarios with few multi-view cues.

摘要
<>Translate the given text into Simplified Chinese.<>重构动态物体从单视镜像视频是一个严重地下定制和挑战性的问题，现有的工作都是通过不同的方向来解决。然而，由于这个问题的不定性，没有一个解决方案可以提供高质量的新视图。在这个工作中，我们引入神经Parametric Gaussians（NPGs）来解决这个挑战，我们采用了两个阶段的方法：第一阶段是使用神经变换模型，然后用这个模型作为非RIGID reconstruction的 regularization。第一阶段学习物体的变形，使其在新视图中保持一致性。第二阶段通过使用高精度的3D Gaussians来实现高质量的重建，这些Gaussians是由局部方向体积所驱动的。我们引入了本地3D Gaussian表示，其中共享的Gaussians是 anchored in和被局部方向体积所扭曲。结果是一个组合的模型，可以被渲染为光度场，从而实现高品质的非RIGID重建，保持3D一致性在新视图中。我们示示了NPGs在具有少量多视图cue的情况下比前一代方法更高的成果。

Bootstrapping Interactive Image-Text Alignment for Remote Sensing Image Captioning

paper_url: http://arxiv.org/abs/2312.01191
repo_url: https://github.com/yangcong356/bita
paper_authors: Cong Yang, Zuchao Li, Lefei Zhang
for: This paper focuses on the task of remote sensing image captioning, specifically addressing the challenge of aligning image and text features in a fine-grained manner.
methods: The proposed approach, called BITA, utilizes a two-stage vision-language pre-training-based framework, which consists of a lightweight interactive Fourier Transformer and a prefix causal language model. The interactive Fourier Transformer extracts multi-scale features of remote sensing images in the frequency domain, while the prefix causal language model guides the text generation process using visual features.
results: The experimental results on the UCM-caption, RSICD, and NWPU-caption datasets demonstrate that BITA outperforms other advanced comparative approaches, indicating its effectiveness in aligning image-text features and generating accurate captions for remote sensing images.

Abstract
Recently, remote sensing image captioning has gained significant attention in the remote sensing community. Due to the significant differences in spatial resolution of remote sensing images, existing methods in this field have predominantly concentrated on the fine-grained extraction of remote sensing image features, but they cannot effectively handle the semantic consistency between visual features and textual features. To efficiently align the image-text, we propose a novel two-stage vision-language pre-training-based approach to bootstrap interactive image-text alignment for remote sensing image captioning, called BITA, which relies on the design of a lightweight interactive Fourier Transformer to better align remote sensing image-text features. The Fourier layer in the interactive Fourier Transformer is capable of extracting multi-scale features of remote sensing images in the frequency domain, thereby reducing the redundancy of remote sensing visual features. Specifically, the first stage involves preliminary alignment through image-text contrastive learning, which aligns the learned multi-scale remote sensing features from the interactive Fourier Transformer with textual features. In the second stage, the interactive Fourier Transformer connects the frozen image encoder with a large language model. Then, prefix causal language modeling is utilized to guide the text generation process using visual features. Ultimately, across the UCM-caption, RSICD, and NWPU-caption datasets, the experimental results clearly demonstrate that BITA outperforms other advanced comparative approaches. The code is available at https://github.com/yangcong356/BITA.

摘要
近期，远程感知图像captioning已经吸引了远程感知社区的广泛关注。由于远程感知图像的空间分辨率存在显著差异，现有的方法主要集中在细致EXTRACT remote sensing image features，但是它们无法有效地处理图像和文本特征之间的semantic consistency。为了高效地对齐图像和文本，我们提出了一种基于视力语言预训练的新方法，称为BITA，它利用了一种轻量级的互动 Fourier Transformer来更好地对齐远程感知图像和文本特征。Fourier层在互动Fourier Transformer中可以提取远程感知图像的多尺度特征，从而降低远程感知视觉特征的重复性。具体来说，首先是通过图像和文本对比学习来进行先期对齐，将从互动Fourier Transformer学习的多尺度远程感知特征与文本特征进行对齐。然后，互动Fourier Transformer将冻结图像编码器连接到大型自然语言模型。接着，使用预后 causal语言模型进行文本生成，使用视觉特征来导航文本生成过程。最终，在UCM-caption、RSICD和NWPU-caption datasets上，实验结果显示BITA的性能明显超过了其他先进的相关方法。代码可以在https://github.com/yangcong356/BITA上下载。

Efficient Expansion and Gradient Based Task Inference for Replay Free Incremental Learning

paper_url: http://arxiv.org/abs/2312.01188
repo_url: None
paper_authors: Soumya Roy, Vinay K Verma, Deepak Gupta
for: 本研究提出了一种简单 yet高效的扩展基于模型，用于不断学习。现有的特征变换、遮盖和分解基于方法可以高效地增长模型，但是它们只会增长全局或共享参数，因此不能充分利用之前学习的信息，导致知识传递能力有限。
methods: 我们提出了一种简单的筛子和通道扩展基于方法，它会增长模型在前一个任务的参数上，而不仅仅是全局参数。因此，它可以完全利用之前学习的信息，而不会忘记。此外，我们还提出了一种可靠的任务预测方法，使用ENTropy weighted数据增强和模型的梯度使用pseudo标签。
results: 我们的模型在不同的dataset和架构上进行了extensive的测试，并在不断学习（TIL）、类增长学习（CIL）和生成 continual learning setting中达到了state-of-the-art的结果。我们的广泛的ablation研究表明了提出的组件的有效性。

Abstract
This paper proposes a simple but highly efficient expansion-based model for continual learning. The recent feature transformation, masking and factorization-based methods are efficient, but they grow the model only over the global or shared parameter. Therefore, these approaches do not fully utilize the previously learned information because the same task-specific parameter forgets the earlier knowledge. Thus, these approaches show limited transfer learning ability. Moreover, most of these models have constant parameter growth for all tasks, irrespective of the task complexity. Our work proposes a simple filter and channel expansion based method that grows the model over the previous task parameters and not just over the global parameter. Therefore, it fully utilizes all the previously learned information without forgetting, which results in better knowledge transfer. The growth rate in our proposed model is a function of task complexity; therefore for a simple task, the model has a smaller parameter growth while for complex tasks, the model requires more parameters to adapt to the current task. Recent expansion based models show promising results for task incremental learning (TIL). However, for class incremental learning (CIL), prediction of task id is a crucial challenge; hence, their results degrade rapidly as the number of tasks increase. In this work, we propose a robust task prediction method that leverages entropy weighted data augmentations and the models gradient using pseudo labels. We evaluate our model on various datasets and architectures in the TIL, CIL and generative continual learning settings. The proposed approach shows state-of-the-art results in all these settings. Our extensive ablation studies show the efficacy of the proposed components.

摘要
In Simplified Chinese:这篇论文提出了一种简单 yet 高效的扩展基于模型 для持续学习。先前的特征变换、masking 和因子化基于方法都是高效的，但它们只会将模型发展到全局或共享参数上，因此它们不能完全利用先前学习的所有信息，而且会忘记之前的知识。因此，这些方法在转移学习能力方面表现有限。此外，大多数这些模型具有一个常数参数增长率，不关系任务复杂度。我们的工作提出了一种简单的筛子和通道扩展基于方法，它会在先前任务参数上增长模型，而不是只是在全局参数上增长。因此，它可以完全利用所有先前学习的信息，而不会忘记。我们的模型在不同的dataset和架构上进行了广泛的实验，并在TIL、CIL和生成持续学习设定中达到了国际先进水平。我们的广泛的减少研究也证明了我们提出的组件的有效性。

SASSL: Enhancing Self-Supervised Learning via Neural Style Transfer

paper_url: http://arxiv.org/abs/2312.01187
repo_url: None
paper_authors: Renan A. Rojas-Gomez, Karan Singhal, Ali Etemad, Alex Bijamov, Warren R. Morningstar, Philip Andrew Mansfield
For: 提高自动学习抽象的表示能力，使用自然图像结构来增强下游性能。* Methods: 基于神经网络风格传输的新型增强技术，分离图像的 semantics 和 style 属性，并且只对样本的 style 进行增强，保持原始内容不变。* Results: 在 ImageNet 上实现了更高的 top-1 分类性能，与 MoCo v2 的表现提高了 más de 2%；在五种多样化的数据集上进行了大幅的 Transfer learning 表现提高，最高提高达 3.75%。

Abstract
Self-supervised learning relies heavily on data augmentation to extract meaningful representations from unlabeled images. While existing state-of-the-art augmentation pipelines incorporate a wide range of primitive transformations, these often disregard natural image structure. Thus, augmented samples can exhibit degraded semantic information and low stylistic diversity, affecting downstream performance of self-supervised representations. To overcome this, we propose SASSL: Style Augmentations for Self Supervised Learning, a novel augmentation technique based on Neural Style Transfer. The method decouples semantic and stylistic attributes in images and applies transformations exclusively to the style while preserving content, generating diverse augmented samples that better retain their semantic properties. Experimental results show our technique achieves a top-1 classification performance improvement of more than 2% on ImageNet compared to the well-established MoCo v2. We also measure transfer learning performance across five diverse datasets, observing significant improvements of up to 3.75%. Our experiments indicate that decoupling style from content information and transferring style across datasets to diversify augmentations can significantly improve downstream performance of self-supervised representations.

摘要
自我指导学习依赖数据增强以EXTRACT meaningful representation from unlabeled images. 现有的状态天算法 pipelines 通常忽略自然图像结构，导致增强样本具有强度降低的semantic information和低的艺术性多样性，这会影响下游自我指导表示的性能。为了解决这个问题，我们提出了 SASSL：Style Augmentations for Self Supervised Learning，这是一种基于神经网络Style Transfer的新的增强技术。该方法将图像中的semantic和艺术属性解 couple，并且 exclusiveapply 样式变换，保持内容不变，从而生成多样化的增强样本，这些样本 better retain their semantic properties。我们的实验结果表明，我们的方法可以在 ImageNet 上 achieved top-1 分类性能提升超过 2% compared to well-established MoCo v2。我们还测试了 across five diverse datasets， Observation significant improvements of up to 3.75%. 我们的实验表明，解 couple style and content information,并将style transfer across datasets to diversify augmentations可以 Significantly improve downstream performance of self-supervised representations.

IDPL-PFOD2: A New Large-Scale Dataset for Printed Farsi Optical Character Recognition

paper_url: http://arxiv.org/abs/2312.01177
repo_url: None
paper_authors: Fatemeh Asadi-zeydabadi, Ali Afkari-Fahandari, Amin Faraji, Elham Shabaninia, Hossein Nezamabadi-pour
for: 本研究旨在提供一个大规模的数据集，以便进行波斯语印刷文本识别。
methods: 本研究使用了CRNN和Vision Transformertwo种架构来评估数据集的有效性。
results: 研究获得了78.49%的基准率和97.72%的normalized edit distance，而Vision Transformer架构则获得了81.32%的基准率和98.74%的normalized edit distance。

Abstract
Optical Character Recognition is a technique that converts document images into searchable and editable text, making it a valuable tool for processing scanned documents. While the Farsi language stands as a prominent and official language in Asia, efforts to develop efficient methods for recognizing Farsi printed text have been relatively limited. This is primarily attributed to the languages distinctive features, such as cursive form, the resemblance between certain alphabet characters, and the presence of numerous diacritics and dot placement. On the other hand, given the substantial training sample requirements of deep-based architectures for effective performance, the development of such datasets holds paramount significance. In light of these concerns, this paper aims to present a novel large-scale dataset, IDPL-PFOD2, tailored for Farsi printed text recognition. The dataset comprises 2003541 images featuring a wide variety of fonts, styles, and sizes. This dataset is an extension of the previously introduced IDPL-PFOD dataset, offering a substantial increase in both volume and diversity. Furthermore, the datasets effectiveness is assessed through the utilization of both CRNN-based and Vision Transformer architectures. The CRNN-based model achieves a baseline accuracy rate of 78.49% and a normalized edit distance of 97.72%, while the Vision Transformer architecture attains an accuracy of 81.32% and a normalized edit distance of 98.74%.

摘要
仪器字符识别是一种技术，将文档图像转换为搜索和编辑文本，从而使其成为读取扫描文档的有价值工具。然而，在亚洲，波斯语作为官方语言，对波斯印刷文本的可读性和编辑性的研究相对落后。这主要归结于波斯语的特殊特征，如 cursive 字形、字母字符之间的相似性、和多个标点和括号的存在。相反，由于深度基础架构需要大量的训练样本来实现有效性， therefore， the development of such datasets is of great significance. 在这些问题的背景下，本文报道了一个新的大规模数据集，IDPL-PFOD2，适用于波斯印刷文本识别。该数据集包含2003541个图像，展示了各种字体、风格和大小。这是 IDPL-PFOD 数据集的扩展，提供了更大的体量和多样性。此外，数据集的效果通过使用 CRNN 和 Vision Transformer 架构进行评估。CRNN 模型实现了基准精度率 78.49% 和normalized edit distance 97.72%，而 Vision Transformer 架构实现了精度 81.32% 和 normalized edit distance 98.74%。

Virtual Category Learning: A Semi-Supervised Learning Method for Dense Prediction with Extremely Limited Labels

paper_url: http://arxiv.org/abs/2312.01169
repo_url: https://github.com/geoffreychen777/vc
paper_authors: Changrui Chen, Jungong Han, Kurt Debattista
for: 提高 dense vision 任务中的模型Optimization，尤其是只有很少标签available。
methods: 使用 pseudo labeling 技术，即虚拟类（VC）分配给困惑样本，以便无需准确标签仍然能够贡献到模型优化。
results: 在 semantic segmentation 和 object detection 两个主流 dense prediction 任务中，提出的 VC learning 方法显示出比州前方法更高的性能，特别是只有很少标签available 时。

Abstract
Due to the costliness of labelled data in real-world applications, semi-supervised learning, underpinned by pseudo labelling, is an appealing solution. However, handling confusing samples is nontrivial: discarding valuable confusing samples would compromise the model generalisation while using them for training would exacerbate the issue of confirmation bias caused by the resulting inevitable mislabelling. To solve this problem, this paper proposes to use confusing samples proactively without label correction. Specifically, a Virtual Category (VC) is assigned to each confusing sample in such a way that it can safely contribute to the model optimisation even without a concrete label. This provides an upper bound for inter-class information sharing capacity, which eventually leads to a better embedding space. Extensive experiments on two mainstream dense prediction tasks -- semantic segmentation and object detection, demonstrate that the proposed VC learning significantly surpasses the state-of-the-art, especially when only very few labels are available. Our intriguing findings highlight the usage of VC learning in dense vision tasks.

摘要
Translation notes:* "costliness" is translated as "高额的" (gāo yǐng de), which means "high cost" or "expensive" in Chinese.* "semi-supervised learning" is translated as "半指导学习" (bàn zhǐ dào xué xí), which means "semi-supervised learning" in Chinese.* "pseudo labeling" is translated as "假标注" ( Jiǎ bāo zhù), which means "pseudo labeling" in Chinese.* "confusing samples" is translated as "困惑样本" (jīn huò yàng bǎn), which means "confusing samples" in Chinese.* "inter-class information sharing capacity" is translated as "类间信息共享能力" (lèi jiān xìn xiāng gòng jì), which means "inter-class information sharing capacity" in Chinese.* "embedding space" is translated as "嵌入空间" (fù rù kōng jì), which means "embedding space" in Chinese.* "dense prediction tasks" is translated as "密集预测任务" (mì jí qián jian yì), which means "dense prediction tasks" in Chinese.* "Virtual Category" is translated as "虚拟类别" (xū yì lèi bèi), which means "Virtual Category" in Chinese.

Meta-Learned Attribute Self-Interaction Network for Continual and Generalized Zero-Shot Learning

paper_url: http://arxiv.org/abs/2312.01167
repo_url: None
paper_authors: Vinay K Verma, Nikhil Mehta, Kevin J Liang, Aakansha Mishra, Lawrence Carin
for: 本研究旨在提出一种基于自交互学习的 continual zero-shot learning（ZSL）方法，以便在不使用未测试类特征的情况下，模型能够通过自身特征来推理未测试类。methods: 本研究使用了 attribute self-interaction 学习和 inverse regularization 来实现 continual ZSL。具体来说，我们使用了 meta-learning 来训练 attribute self-interaction，并对 attribute encoder 进行 inverse regularization。results: 我们在五个标准 ZSL 数据集（CUB、aPY、AWA1、AWA2 和 SUN）上进行了实验，并证明了我们的方法可以在 generalized zero-shot learning 和 continual（固定/动态） zero-shot learning Settings 中占据领先地位，而且可以在 Training 时间上 saves >100x 的时间。

Abstract
Zero-shot learning (ZSL) is a promising approach to generalizing a model to categories unseen during training by leveraging class attributes, but challenges remain. Recently, methods using generative models to combat bias towards classes seen during training have pushed state of the art, but these generative models can be slow or computationally expensive to train. Also, these generative models assume that the attribute vector of each unseen class is available a priori at training, which is not always practical. Additionally, while many previous ZSL methods assume a one-time adaptation to unseen classes, in reality, the world is always changing, necessitating a constant adjustment of deployed models. Models unprepared to handle a sequential stream of data are likely to experience catastrophic forgetting. We propose a Meta-learned Attribute self-Interaction Network (MAIN) for continual ZSL. By pairing attribute self-interaction trained using meta-learning with inverse regularization of the attribute encoder, we are able to outperform state-of-the-art results without leveraging the unseen class attributes while also being able to train our models substantially faster (>100x) than expensive generative-based approaches. We demonstrate this with experiments on five standard ZSL datasets (CUB, aPY, AWA1, AWA2, and SUN) in the generalized zero-shot learning and continual (fixed/dynamic) zero-shot learning settings. Extensive ablations and analyses demonstrate the efficacy of various components proposed.

摘要
zero-shot learning（ZSL）是一种有前途的方法，通过利用类属性来泛化模型，但是还存在一些挑战。最近，使用生成模型来减少训练时间中的偏袋偏好的方法得到了状态之巅，但是这些生成模型可能需要较长的训练时间。此外，这些生成模型假设在训练时都可以得到每个未看到的类的特征向量，这并不是实际情况。此外，许多前一代ZSL方法假设在训练时只需要一次适应未看到的类，但在实际情况下，世界是在不断变化的，因此需要不断地调整已经部署的模型。如果模型无法处理流动数据流，它们可能会经历悬崖效应。我们提出了一种基于元学习的自交互特征网络（MAIN），用于持续ZSL。通过将特征自交互训练使用元学习，并对特征编码器进行逆向正则化，我们能够超越状态之巅的结果，而不需要利用未看到的类的特征。我们通过在五个标准ZSL数据集（CUB、aPY、AWA1、AWA2和SUN）进行实验，证明了我们的方法的可行性。广泛的ablation和分析也证明了我们提出的各个组件的有效性。

A New Learning Paradigm for Foundation Model-based Remote Sensing Change Detection

paper_url: http://arxiv.org/abs/2312.01163
repo_url: https://github.com/likyoo/ban
paper_authors: Kaiyu Li, Xiangyong Cao, Deyu Meng
for:* The paper is written to propose a new framework for change detection (CD) using universal foundation models.methods:* The proposed framework is called Bi-Temporal Adapter Network (BAN), which combines a frozen foundation model (e.g., CLIP) with a bitemporal adapter branch (Bi-TAB) and bridging modules.* The Bi-TAB can be either an existing arbitrary CD model or some hand-crafted stacked blocks.results:* The proposed BAN framework can improve the performance of existing CD methods with only a few additional learnable parameters.* The successful practices show the potential of foundation models for remote sensing CD.

Abstract
Change detection (CD) is a critical task to observe and analyze dynamic processes of land cover. Although numerous deep learning-based CD models have performed excellently, their further performance improvements are constrained by the limited knowledge extracted from the given labelled data. On the other hand, the foundation models that emerged recently contain a huge amount of knowledge by scaling up across data modalities and proxy tasks. In this paper, we propose a Bi-Temporal Adapter Network (BAN), which is a universal foundation model-based CD adaptation framework aiming to extract the knowledge of foundation models for CD. The proposed BAN contains three parts, i.e. frozen foundation model (e.g., CLIP), bitemporal adapter branch (Bi-TAB), and bridging modules between them. Specifically, the Bi-TAB can be either an existing arbitrary CD model or some hand-crafted stacked blocks. The bridging modules are designed to align the general features with the task/domain-specific features and inject the selected general knowledge into the Bi-TAB. To our knowledge, this is the first universal framework to adapt the foundation model to the CD task. Extensive experiments show the effectiveness of our BAN in improving the performance of existing CD methods (e.g., up to 4.08\% IoU improvement) with only a few additional learnable parameters. More importantly, these successful practices show us the potential of foundation models for remote sensing CD. The code is available at \url{https://github.com/likyoo/BAN} and will be supported in our Open-CD \url{https://github.com/likyoo/open-cd}.

摘要
优化变换检测（CD）是观察和分析地表覆盖过程的关键任务。虽然许多深度学习基于的CD模型表现出色，但它们的进一步性能提高受到限制于现有标注数据中的知识有限。然而，最近出现的基础模型具有大量的知识，可以通过数据模式和代理任务扩大。在这篇论文中，我们提出了一个Bi-Temporal Adapter Network（BAN），它是一个基于基础模型的CD适应框架，旨在提取基础模型中的知识。BAN包括三部分：固定的基础模型（例如CLIP）、双时间适应分支（Bi-TAB）和将这两者连接的桥接模块。具体来说，Bi-TAB可以是现有的任意CD模型或一些手工设计的堆叠块。桥接模块是用于将通用特征与任务/领域特定特征进行对应，并将选择的通用知识注入到Bi-TAB中。我们知道，这是首个基于基础模型的CD适应框架。我们进行了广泛的实验，并证明了BAN可以提高现有CD方法的性能（例如准确率提高4.08%），只需要增加一些可学习的参数。更重要的是，这些成功的实践表明了基础模型的潜在力量在CD领域。代码可以在GitHub上获取，地址为\url{https://github.com/likyoo/BAN}，并将在我们的Open-CD项目\url{https://github.com/likyoo/open-cd}中支持。

Ultra-Resolution Cascaded Diffusion Model for Gigapixel Image Synthesis in Histopathology

paper_url: http://arxiv.org/abs/2312.01152
repo_url: https://github.com/lucidrains/imagen-pytorch
paper_authors: Sarah Cechnicka, Hadrien Reynaud, James Ball, Naomi Simmonds, Catherine Horsfield, Andrew Smith, Candice Roufosse, Bernhard Kainz
for: 本研究旨在提高整个报告图像的诊断精度，特别是在低分辨率下。
methods: 本研究使用ultra-resolution cascaded diffusion models (URCDMs) Synthesize high-resolution images that are realistic at all magnification levels, with a focus on long-distance spatial coherence.
results: 对比exist的方法，本研究提高了pFID-50k分数的110.63到39.52。此外，人工专家评估研究显示， Lower Resolution Diffusion Models的Weighted Mean Absolute Error (MAE)为0.11，URCDM的Weighted MAE为0.22。

Abstract
Diagnoses from histopathology images rely on information from both high and low resolutions of Whole Slide Images. Ultra-Resolution Cascaded Diffusion Models (URCDMs) allow for the synthesis of high-resolution images that are realistic at all magnification levels, focusing not only on fidelity but also on long-distance spatial coherency. Our model beats existing methods, improving the pFID-50k [2] score by 110.63 to 39.52 pFID-50k. Additionally, a human expert evaluation study was performed, reaching a weighted Mean Absolute Error (MAE) of 0.11 for the Lower Resolution Diffusion Models and a weighted MAE of 0.22 for the URCDM.

摘要
诊断从历史 PATH 图像取决于高和低分辨率整体扫描图像信息。我们的极高分辨率协同扩散模型（URCDM）可以生成高分辨率图像，具有所有倍率水平的实际感。不仅保持准确性，还保持长距离空间一致性。与现有方法相比，我们的模型提高了 pFID-50k 分数的110.63到39.52 pFID-50k。此外，我们进行了人工专家评估， weighted Mean Absolute Error（MAE）为0.11 для Lower Resolution Diffusion Models，weighted MAE 为0.22 для URCDM。

Has Anything Changed? 3D Change Detection by 2D Segmentation Masks

paper_url: http://arxiv.org/abs/2312.01148
repo_url: None
paper_authors: Aikaterini Adam, Konstantinos Karantzalos, Lazaros Grammatikopoulos, Torsten Sattler
for: 该论文为了掌握indoor场景中 объек的变化和添加情况，提出了一种无监督的物体发现方法。
methods: 该方法结合了3D变化检测和2D分割任务，使用通用的2D分割mask进行3D变化检测的补充。
results: 实验表明，该方法在3Rscan数据集上表现出优于比较基eline的基eline，实现了SoTA的结果。

Abstract
As capturing devices become common, 3D scans of interior spaces are acquired on a daily basis. Through scene comparison over time, information about objects in the scene and their changes is inferred. This information is important for robots and AR and VR devices, in order to operate in an immersive virtual experience. We thus propose an unsupervised object discovery method that identifies added, moved, or removed objects without any prior knowledge of what objects exist in the scene. We model this problem as a combination of a 3D change detection and a 2D segmentation task. Our algorithm leverages generic 2D segmentation masks to refine an initial but incomplete set of 3D change detections. The initial changes, acquired through render-and-compare likely correspond to movable objects. The incomplete detections are refined through graph optimization, distilling the information of the 2D segmentation masks in the 3D space. Experiments on the 3Rscan dataset prove that our method outperforms competitive baselines, with SoTA results.

摘要
“随着捕捉设备的普及，日常获得的3D扫描内部空间的数据越来越多。通过时间比较场景，从数据中推导出内部物品的变化信息。这些信息对于机器人和AR/VR设备非常重要，以建立实际的虚拟经验。我们因此提出一种不需要先知的物品探测方法，可以找到增加、移动或 removing 的物品。我们将这个问题视为3D变化探测和2D分割任务的组合。我们的算法利用通用的2D分割面给3D变化探测进行精确化。初始变化，通过render和比较获得的变化，很可能是可移动的物品。 incomplete 的探测通过 графа优化，将2D分割面的信息在3D空间中传递。实验证明我们的方法在3Rscan数据集上比竞争性基准的表现更好，得到了顶尖的成果。”

Exploiting Diffusion Priors for All-in-One Image Restoration

paper_url: http://arxiv.org/abs/2312.02197
repo_url: None
paper_authors: Yuanbiao Gou, Haiyu Zhao, Boyun Li, Xinyan Xiao, Xi Peng
for: 解决多种图像恢复任务，包括锐化、噪声、抖抖、震荡等，在一个模型中进行全面的图像恢复。
methods: 利用预训练的扩散模型中的图像约束，通过两个挑战：一是模拟干扰过程，二是导引扩散模型生成相应的清洁图像。
results: 提出了一个零shot框架ZeroAIR，通过在反推 sampling中 alternatively进行测试时间降低模型（TDM）和三 stage扩散指导（TDG），实现了一stop图像恢复。经过广泛的实验，我们证明了ZeroAIR可以与任务特定方法相比或者超越它们的性能。代码将在Github上公开。

Abstract
All-in-one aims to solve various tasks of image restoration in a single model. To this end, we present a feasible way of exploiting the image priors captured by the pretrained diffusion model, through addressing the two challenges, i.e., degradation modeling and diffusion guidance. The former aims to simulate the process of the clean image degenerated by certain degradations, and the latter aims at guiding the diffusion model to generate the corresponding clean image. With the motivations, we propose a zero-shot framework for all-in-one image restoration, termed ZeroAIR, which alternatively performs the test-time degradation modeling (TDM) and the three-stage diffusion guidance (TDG) at each timestep of the reverse sampling. To be specific, TDM exploits the diffusion priors to learn a degradation model from a given degraded image, and TDG divides the timesteps into three stages for taking full advantage of the varying diffusion priors. Thanks to their degradation-agnostic property, the all-in-one image restoration could be achieved in a zero-shot way by ZeroAIR. Through extensive experiments, we show that our ZeroAIR achieves comparable even better performance than those task-specific methods. The code will be available on Github.

摘要
全功能尝试解决图像恢复中的多个任务在单一模型中。为此，我们提出了利用预训练的扩散模型所捕捉的图像约束的可行方法，通过解决两个挑战：即退化模型和扩散指导。前者通过模拟受到某种退化后的清晰图像的过程来模拟清晰图像的退化，而后者则是通过指导扩散模型生成相应的清晰图像。基于这些动机，我们提出了一个零shot框架，称为ZeroAIR，它在反排 sampling 中交替执行测试时退化模型化（TDM）和三个阶段扩散指导（TDG）。具体来说，TDM 利用扩散约束来从给定的退化图像中学习退化模型，而 TDG 将时间步分为三个阶段，以便在不同的扩散约束下完全利用扩散约束。由于它们具有退化不可见的性质，ZeroAIR 可以在零shot 的情况下实现全功能的图像恢复。经过广泛的实验，我们表明 ZeroAIR 可以与专门针对任务的方法相比或者更好的性能。代码将在 GitHub 上发布。

Dynamic Inertial Poser (DynaIP): Part-Based Motion Dynamics Learning for Enhanced Human Pose Estimation with Sparse Inertial Sensors

paper_url: http://arxiv.org/abs/2312.02196
repo_url: None
paper_authors: Yu Zhang, Songpengcheng Xia, Lei Chu, Jiarui Yang, Qi Wu, Ling Pei
for: 这篇论文是为了提出一种基于稀缺启动传感器的人体姿势估计方法，以解决前一些方法所采用的人工数据的缺陷。
methods: 这种方法使用了多种真实的启动传感器数据，包括不同的骨架格式，以提高运动多样性和模型泛化。它包括两个创新的组件：一个 Pseudo-velocity regression 模型，用于在启动传感器上进行动态运动捕捉，以及一个分为三个区域的部分模型，每个区域都专注于它们独特的特征。
results: 这种方法在五个公共数据集上比前一些状态静态模型表现出色，尤其是在 DIP-IMU 数据集上，减少了姿势误差 by 19%，这表明了该方法在启动传感器上的人体姿势估计方法具有显著的改进。我们将在未来公开该模型的实现。

Abstract
This paper introduces a novel human pose estimation approach using sparse inertial sensors, addressing the shortcomings of previous methods reliant on synthetic data. It leverages a diverse array of real inertial motion capture data from different skeleton formats to improve motion diversity and model generalization. This method features two innovative components: a pseudo-velocity regression model for dynamic motion capture with inertial sensors, and a part-based model dividing the body and sensor data into three regions, each focusing on their unique characteristics. The approach demonstrates superior performance over state-of-the-art models across five public datasets, notably reducing pose error by 19\% on the DIP-IMU dataset, thus representing a significant improvement in inertial sensor-based human pose estimation. We will make the implementation of our model available for public use.

摘要
这篇论文介绍了一种新的人姿估算方法，使用稀疏的吸引力传感器，解决了过去的方法对于真实数据的依赖。它利用了不同的真实吸引力运动捕捉数据，以提高运动多样性和模型泛化。这种方法包括两个创新组件：一个假速度回归模型，用于在吸引力传感器上进行动态运动捕捉，以及一个分割body和传感器数据的部分模型，每个部分专注于其特有特征。该方法在五个公共数据集上达到了比较好的性能，比特定状态的模型减少人姿错误率19％，表示一种重要的吸引力传感器基于人姿估算的改进。我们将在公共使用的实现方法。

ControlDreamer: Stylized 3D Generation with Multi-View ControlNet

paper_url: http://arxiv.org/abs/2312.01129
repo_url: https://github.com/oyt9306/ControlDreamer
paper_authors: Yeongtak Oh, Jooyoung Choi, Yongsung Kim, Minjun Park, Chaehun Shin, Sungroh Yoon
for: 本研究旨在 Addressing the limitations of current text-to-3D generation methods 提高 3D 模型的创作 geometry 和 style 的自动化和民生化。
methods: 我们引入了一种新的 multi-view ControlNet 模型，该模型是基于生成的 dataset 从一个仔细挑选的 100K 文本 corpus 进行训练的 depth-aware multi-view diffusion model。我们还将这种模型integrated into our two-stage pipeline ControlDreamer，以实现文本引导的 3D 模型生成。
results: 我们的比较分析表明，这种新的 pipeline 在质量上表现出色，与现有的 text-to-3D 方法相比，通过qualitative comparisons 和 CLIP 分数指标进行评估。

Abstract
Recent advancements in text-to-3D generation have significantly contributed to the automation and democratization of 3D content creation. Building upon these developments, we aim to address the limitations of current methods in generating 3D models with creative geometry and styles. We introduce multi-view ControlNet, a novel depth-aware multi-view diffusion model trained on generated datasets from a carefully curated 100K text corpus. Our multi-view ControlNet is then integrated into our two-stage pipeline, ControlDreamer, enabling text-guided generation of stylized 3D models. Additionally, we present a comprehensive benchmark for 3D style editing, encompassing a broad range of subjects, including objects, animals, and characters, to further facilitate diverse 3D generation. Our comparative analysis reveals that this new pipeline outperforms existing text-to-3D methods as evidenced by qualitative comparisons and CLIP score metrics.

摘要

SPEEDNet: Salient Pyramidal Enhancement Encoder-Decoder Network for Colonoscopy Images

paper_url: http://arxiv.org/abs/2312.01128
repo_url: None
paper_authors: Tushir Sahu, Vidhi Bhatt, Sai Chandra Teja R, Sparsh Mittal, Nagesh Kumar S
for: 这篇论文目的是精确地分类医疗影像中的区域，例如肿瘤或损伤。
methods: 这篇论文提出了一个名为SPEEDNet的新架构，用于精确地分类医疗影像中的肿瘤。SPEEDNet使用了一个名为DIPC（扩展凝合层对抗）的新封页，将特征图从多个层次转换为一个紧密的空间，以提高学习表现。
results: 在EBHISeg数据集上，SPEEDNet比三个前一代网络（UNet、FeedNet和AttesResDUNet）表现更好，具体的 dice分数为0.952，回应率为0.971。

Abstract
Accurate identification and precise delineation of regions of significance, such as tumors or lesions, is a pivotal goal in medical imaging analysis. This paper proposes SPEEDNet, a novel architecture for precisely segmenting lesions within colonoscopy images. SPEEDNet uses a novel block named Dilated-Involutional Pyramidal Convolution Fusion (DIPC). A DIPC block combines the dilated involution layers pairwise into a pyramidal structure to convert the feature maps into a compact space. This lowers the total number of parameters while improving the learning of representations across an optimal receptive field, thereby reducing the blurring effect. On the EBHISeg dataset, SPEEDNet outperforms three previous networks: UNet, FeedNet, and AttesResDUNet. Specifically, SPEEDNet attains an average dice score of 0.952 and a recall of 0.971. Qualitative results and ablation studies provide additional insights into the effectiveness of SPEEDNet. The model size of SPEEDNet is 9.81 MB, significantly smaller than that of UNet (22.84 MB), FeedNet(185.58 MB), and AttesResDUNet (140.09 MB).

摘要
医学图像分析中精准识别和定义区域的目标是非常重要的。这篇论文提出了SPEEDNet，一种新的架构，用于精准地分类colonoscopy图像中的肿瘤。SPEEDNet使用一种新的块named Dilated-Involutional Pyramidal Convolution Fusion（DIPC）。一个DIPC块将叠加pyramidal结构中的扩展减少层pairwise，将特征图转换为一个紧凑的空间。这有助于降低总参数数量，同时改善表示学习的范围，从而降低模糊效应。在EBHISeg数据集上，SPEEDNet比三个前一个网络：UNet、FeedNet和AttesResDUNet表现更好。特别是，SPEEDNet实现了平均的 dice 分数为0.952，和回归率为0.971。Qualitative results和ablation studies提供了更多关于SPEEDNet的效果的视角。SPEEDNet的模型大小为9.81 MB，远小于UNet（22.84 MB）、FeedNet（185.58 MB）和AttesResDUNet（140.09 MB）。

Beyond Accuracy: Statistical Measures and Benchmark for Evaluation of Representation from Self-Supervised Learning

paper_url: http://arxiv.org/abs/2312.01118
repo_url: None
paper_authors: Jiantao Wu, Shentong Mo, Sara Atito, Josef Kittler, Zhenhua Feng, Muhammad Awais
for: 本研究旨在提供一个大规模的benchmark，用于评估自动学习的度量学习模型。
methods: 本研究使用了ImageNet-21K和WordNet建立了一个具有多样性和细化的类别分布的benchmark，并提出了一些新的评估指标，包括overlap和aSTD。
results: 研究发现了supervised学习和self-supervised学习模型中存在类别偏见和缺乏泛化能力的问题，并提供了未来模型改进的可能性。

Abstract
Recently, self-supervised metric learning has raised attention for the potential to learn a generic distance function. It overcomes the limitations of conventional supervised one, e.g., scalability and label biases. Despite progress in this domain, current benchmarks, incorporating a narrow scope of classes, stop the nuanced evaluation of semantic representations. To bridge this gap, we introduce a large-scale benchmark with diversity and granularity of classes, Statistical Metric Learning Benchmark (SMLB) built upon ImageNet-21K and WordNet. SMLB is designed to rigorously evaluate the discriminative discernment and generalizability across more than 14M images, 20K classes, and 16K taxonomic nodes. Alongside, we propose novel evaluation metrics -- `overlap' for separability and `aSTD' for consistency -- to measure distance statistical information, which are efficient and robust to the change of class number. Our benchmark offers a novel perspective of evaluating the quality of representations beyond accuracy. Our findings reveal the limitations of supervised learning and the class bias inherent in SSL models, offering insights into potential areas for future model enhancement.

摘要

Paved2Paradise: Cost-Effective and Scalable LiDAR Simulation by Factoring the Real World

paper_url: http://arxiv.org/abs/2312.01117
repo_url: https://github.com/airalcorn2/paved2paradise
paper_authors: Michael A. Alcorn, Noah Schwartz
for: 这个论文的目的是提出一种可靠、成本低的方法来生成激活点云数据集，以便加速点云模型的开发。
methods: 该方法基于分离实际世界的思想，通过收集独立的背景和对象数据集，智能组合它们来生成具有多样性和真实性的训练集。该方法包括四个步骤：收集庞大背景数据、记录感兴趣的对象类型的人员在隔离环境中表现出不同的行为、自动生成对象数据集的标签、将对象放置在背景中的随机位置来生成样本。
results: 该方法可以生成高质量的人体检测数据集，包括在树枝干扰下的人员检测。量化结果表明，使用Paved2Paradise数据集训练的模型在实际数据集上表现相当于使用原始数据集进行训练。这些结果表明，Paved2Paradise可以帮助加速点云模型的开发，特别是在购买点云数据集的成本和时间上存在限制的情况下。

Abstract
To achieve strong real world performance, neural networks must be trained on large, diverse datasets; however, obtaining and annotating such datasets is costly and time-consuming, particularly for 3D point clouds. In this paper, we describe Paved2Paradise, a simple, cost-effective approach for generating fully labeled, diverse, and realistic lidar datasets from scratch, all while requiring minimal human annotation. Our key insight is that, by deliberately collecting separate "background" and "object" datasets (i.e., "factoring the real world"), we can intelligently combine them to produce a combinatorially large and diverse training set. The Paved2Paradise pipeline thus consists of four steps: (1) collecting copious background data, (2) recording individuals from the desired object class(es) performing different behaviors in an isolated environment (like a parking lot), (3) bootstrapping labels for the object dataset, and (4) generating samples by placing objects at arbitrary locations in backgrounds. To demonstrate the utility of Paved2Paradise, we generated synthetic datasets for two tasks: (1) human detection in orchards (a task for which no public data exists) and (2) pedestrian detection in urban environments. Qualitatively, we find that a model trained exclusively on Paved2Paradise synthetic data is highly effective at detecting humans in orchards, including when individuals are heavily occluded by tree branches. Quantitatively, a model trained on Paved2Paradise data that sources backgrounds from KITTI performs comparably to a model trained on the actual dataset. These results suggest the Paved2Paradise synthetic data pipeline can help accelerate point cloud model development in sectors where acquiring lidar datasets has previously been cost-prohibitive.

摘要
要实现强大的实际性能，神经网络需要在大量、多样的数据上训练；然而，获取和标注这些数据可以是成本和时间consuming的，特别是 для3D点云。在这篇论文中，我们描述了Paved2Paradise，一种简单、成本效果的方法，可以从scratch生成完全标注的、多样化和真实的激光数据集，而无需大量的人工标注。我们的关键思想是，通过意图地收集“背景”和“对象”数据集（即“分解实际世界”），我们可以智能地将它们组合起来生成一个 combinatorially 大量和多样化的训练集。Paved2Paradise管道因此包括四个步骤：（1）收集充沛的背景数据，（2）记录目标对象类型的个体在隔离环境（如停车场）中进行不同的行为，（3）为对象数据集提供杂志标签，（4）通过将对象放置在背景中的任意位置来生成样本。为了证明Paved2Paradise的实用性，我们生成了两个任务的 sintetic 数据集：（1）人员检测在园林（无公共数据存在）和（2）人员检测在城市环境中。我们发现，一个专门使用Paved2Paradise sintetic数据集训练的模型可以在园林中快速和高效地检测人员，包括在树枝叶下部分遮挡的情况下。量化来说，使用Paved2Paradise数据集源自KITTI的背景时，模型与实际数据集相比表现相似。这些结果表明Paved2Paradise sintetic数据集管道可以帮助加速点云模型在拥有高成本的激光数据集的领域进行更快的发展。

Local Masking Meets Progressive Freezing: Crafting Efficient Vision Transformers for Self-Supervised Learning

paper_url: http://arxiv.org/abs/2312.02194
repo_url: https://github.com/utkutpcgl/vitfreeze
paper_authors: Utku Mert Topcuoglu, Erdem Akagündüz
for: 这个论文旨在提出一种自然语言处理方法，用于减少 ViT 模型的训练时间，同时保持或提高模型的学习能力。
methods: 该方法基于local masked image modeling和进行逐步层冻结，以提高初始层训练的效率和速度。具体来说，通过系统地冻结特定层在训练过程中的某些点，可以降低计算负担，同时保持或提高模型的学习能力。
results: 该方法可以在训练 ViT 模型时大幅减少训练时间（约12.5%），同时减少模型精度（top-1准确率下降0.6%）。实验结果显示，该方法可以在计算资源和时间限制下实现高效的自然语言处理。

Abstract
In this paper, we present an innovative approach to self-supervised learning for Vision Transformers (ViTs), integrating local masked image modeling with progressive layer freezing. This method focuses on enhancing the efficiency and speed of initial layer training in ViTs. By systematically freezing specific layers at strategic points during training, we reduce computational demands while maintaining or improving learning capabilities. Our approach employs a novel multi-scale reconstruction process that fosters efficient learning in initial layers and enhances semantic comprehension across scales. The results demonstrate a substantial reduction in training time (~12.5\%) with a minimal impact on model accuracy (decrease in top-1 accuracy by 0.6\%). Our method achieves top-1 and top-5 accuracies of 82.6\% and 96.2\%, respectively, underscoring its potential in scenarios where computational resources and time are critical. This work marks an advancement in the field of self-supervised learning for computer vision. The implementation of our approach is available at our project's GitHub repository: github.com/utkutpcgl/ViTFreeze.

摘要
在这篇论文中，我们提出了一种自动学习方法，把地方填充模型与进行逐层冻结结合在一起，以提高初始层次训练的效率和速度。这种方法通过在训练过程中系统地冻结特定层次，从而降低计算成本而不会影响学习能力。我们的方法使用了一种新的多尺度重建过程，以便高效地在初始层次上学习和提高 semantic 理解的能力。结果表明，使用我们的方法可以大幅减少训练时间（约12.5%），同时保持或提高模型准确率（降低顶层准确率0.6%）。我们的方法实现了顶层准确率82.6%和顶层5准确率96.2%，这表明它在计算资源和时间紧张的情况下具有潜在的应用价值。这项研究对计算机视觉领域的自动学习做出了一项进步。我们的实现可以在我们项目的 GitHub 存储库中找到：github.com/utkutpcgl/ViTFreeze。

S2P3: Self-Supervised Polarimetric Pose Prediction

paper_url: http://arxiv.org/abs/2312.01105
repo_url: None
paper_authors: Patrick Ruhkamp, Daoyi Gao, Nassir Navab, Benjamin Busam
for: 这篇论文是关于从多modalRGB+普拉拉度图像中预测6D对象pose的首次自我超vised学习方法。
methods: 该方法包括1)物理模型提取普拉拉度图像的geometry信息，2)教师生物知识传授方案和3)自我超vised损失函数通过可导渲染和反转物理约束。两个网络都利用物理特性来学习坚实的geometry表示，通过编码形状假设和普拉拉度特征来学习物理特性。
results: 这篇论文report了对透明、折射和反射表面的物体进行预测 pose 的性能提升，特别是在照明挑战性较高的情况下。

Abstract
This paper proposes the first self-supervised 6D object pose prediction from multimodal RGB+polarimetric images. The novel training paradigm comprises 1) a physical model to extract geometric information of polarized light, 2) a teacher-student knowledge distillation scheme and 3) a self-supervised loss formulation through differentiable rendering and an invertible physical constraint. Both networks leverage the physical properties of polarized light to learn robust geometric representations by encoding shape priors and polarization characteristics derived from our physical model. Geometric pseudo-labels from the teacher support the student network without the need for annotated real data. Dense appearance and geometric information of objects are obtained through a differentiable renderer with the predicted pose for self-supervised direct coupling. The student network additionally features our proposed invertible formulation of the physical shape priors that enables end-to-end self-supervised training through physical constraints of derived polarization characteristics compared against polarimetric input images. We specifically focus on photometrically challenging objects with texture-less or reflective surfaces and transparent materials for which the most prominent performance gain is reported.

摘要

A physical model to extract geometric information of polarized light.2. A teacher-student knowledge distillation scheme.3. A self-supervised loss formulation through differentiable rendering and an invertible physical constraint.The proposed method leverages the physical properties of polarized light to learn robust geometric representations by encoding shape priors and polarization characteristics derived from the physical model. The teacher network provides geometric pseudo-labels to support the student network without the need for annotated real data. The student network also features an invertible formulation of the physical shape priors, which enables end-to-end self-supervised training through physical constraints of derived polarization characteristics compared against polarimetric input images.The proposed method is specifically designed to handle photometrically challenging objects with texture-less or reflective surfaces and transparent materials, and it reports the most prominent performance gain in these cases.

QPoser: Quantized Explicit Pose Prior Modeling for Controllable Pose Generation

paper_url: http://arxiv.org/abs/2312.01104
repo_url: None
paper_authors: Yumeng Li, Yaoxiang Ding, Zhong Ren, Kun Zhou
for: 本研究旨在提出一种高度可控的显式姿势先骨模型，以确保正确性、表达性和可控性三大特性均能满足。
methods: 该模型使用多头 вектор量编码器（MS-VQVAE）获取表达性和分布式姿势表示，同时使用全局本地特征融合机制（GLIF-AE）将潜在表示融合到本地关节特征中。
results: 实验结果显示，QPoser对于表达性和正确姿势表示有显著优势，同时可以轻松地根据参考姿势和指示信息进行详细的条件生成。

Abstract
Explicit pose prior models compress human poses into latent representations for using in pose-related downstream tasks. A desirable explicit pose prior model should satisfy three desirable abilities: 1) correctness, i.e. ensuring to generate physically possible poses; 2) expressiveness, i.e. ensuring to preserve details in generation; 3) controllability, meaning that generation from reference poses and explicit instructions should be convenient. Existing explicit pose prior models fail to achieve all of three properties, in special controllability. To break this situation, we propose QPoser, a highly controllable explicit pose prior model which guarantees correctness and expressiveness. In QPoser, a multi-head vector quantized autoencoder (MS-VQVAE) is proposed for obtaining expressive and distributed pose representations. Furthermore, a global-local feature integration mechanism (GLIF-AE) is utilized to disentangle the latent representation and integrate full-body information into local-joint features. Experimental results show that QPoser significantly outperforms state-of-the-art approaches in representing expressive and correct poses, meanwhile is easily to be used for detailed conditional generation from reference poses and prompting instructions.

摘要
<>定型人体 pose 模型可以压缩人体 pose 到 latent 表示，以便在人体 pose 相关的下游任务中使用。一个理想的明确 pose 模型应满足以下三种能力：1）正确性，即生成物理可能的 pose；2）表现力，即保留生成中的细节；3）可控性，即从参考 pose 和显式指令中生成 conveniences。现有的明确 pose 模型都不能同时满足这三个属性，尤其是可控性。为了破坏这种情况，我们提出 QPoser，一种高度可控的明确 pose 模型， guarantees correctness 和表现力。在 QPoser 中，一种多头 вектор量编码器（MS-VQVAE）被提议用于获取表达性和分布式 pose 表示。此外，一种全球-本地特征融合机制（GLIF-AE）被利用来分离 latent 表示并将全身信息融合到本地关节特征中。实验结果表明，QPoser 与状态机制比较出色地表达表达性和正确 pose，同时是容易用于详细的 conditional 生成从参考 pose 和激励指令中。

Rethinking Multiple Instance Learning for Whole Slide Image Classification: A Bag-Level Classifier is a Good Instance-Level Teacher

paper_url: http://arxiv.org/abs/2312.01099
repo_url: https://github.com/dootmaan/icmil
paper_authors: Hongyi Wang, Luyang Luo, Fang Wang, Ruofeng Tong, Yen-Wei Chen, Hongjie Hu, Lanfen Lin, Hao Chen
for: 本研究旨在提高多例学习（MIL）在整个扫描图像（WSI）分类中的性能，并解决高计算成本问题。
methods: 我们提出了一种叫做趋同多例学习（ICMIL）的方法，它利用一个质量更高的Bag类别器作为每个幕例的教师，以便在低成本下进行幕例表示的迭代更新。在这个过程中，我们还引入了一种教师-学生框架，以有效地传递分类知识从Bag类别器中。
results: 我们在四个不同的数据集上进行了严格的实验，并证明了ICMIL在现有MIL脊梁上的性能有所提高，达到了现有最佳结果。

Abstract
Multiple Instance Learning (MIL) has demonstrated promise in Whole Slide Image (WSI) classification. However, a major challenge persists due to the high computational cost associated with processing these gigapixel images. Existing methods generally adopt a two-stage approach, comprising a non-learnable feature embedding stage and a classifier training stage. Though it can greatly reduce the memory consumption by using a fixed feature embedder pre-trained on other domains, such scheme also results in a disparity between the two stages, leading to suboptimal classification accuracy. To address this issue, we propose that a bag-level classifier can be a good instance-level teacher. Based on this idea, we design Iteratively Coupled Multiple Instance Learning (ICMIL) to couple the embedder and the bag classifier at a low cost. ICMIL initially fix the patch embedder to train the bag classifier, followed by fixing the bag classifier to fine-tune the patch embedder. The refined embedder can then generate better representations in return, leading to a more accurate classifier for the next iteration. To realize more flexible and more effective embedder fine-tuning, we also introduce a teacher-student framework to efficiently distill the category knowledge in the bag classifier to help the instance-level embedder fine-tuning. Thorough experiments were conducted on four distinct datasets to validate the effectiveness of ICMIL. The experimental results consistently demonstrate that our method significantly improves the performance of existing MIL backbones, achieving state-of-the-art results. The code is available at: https://github.com/Dootmaan/ICMIL/tree/confidence_based

摘要
多个实例学习（MIL）在整幅扫描图像（WSI）分类中表现出了承诺。然而，高计算成本使得处理这些吉帕ixel图像的主要挑战仍然存在。现有方法通常采用两个阶段approach，包括一个非学习的特征嵌入阶段和一个分类器训练阶段。虽然可以通过使用预先训练在其他领域的固定特征嵌入器来减少内存消耗，但这种方法也会导致两个阶段之间的差异，从而导致分类精度下降。为解决这个问题，我们提议在bag级分类器中使用嵌入器作为好的实例级教师。基于这个想法，我们设计了Iteratively Coupled Multiple Instance Learning（ICMIL），将嵌入器和包级分类器在低成本下 coupling。ICMIL首先固定patch嵌入器，然后使用包级分类器来精度调整嵌入器。受托的嵌入器可以生成更好的表示，从而导致更高精度的分类器。为了更好地实现嵌入器的练习和更高效地嵌入器的调整，我们还引入了教师-学生框架，以高效地传递包级分类器中的类别知识，以帮助实例级嵌入器的练习。我们在四个不同的数据集上进行了严格的实验，以验证ICMIL的效果。实验结果表明，我们的方法可以显著提高现有MIL的性能，达到状态前级结果。代码可以在https://github.com/Dootmaan/ICMIL/tree/confidence_based找到。

Planning as In-Painting: A Diffusion-Based Embodied Task Planning Framework for Environments under Uncertainty

paper_url: http://arxiv.org/abs/2312.01097
repo_url: https://github.com/joeyy5588/planning-as-inpainting
paper_authors: Cheng-Fu Yang, Haoyang Xu, Te-Lin Wu, Xiaofeng Gao, Kai-Wei Chang, Feng Gao
for: 这 paper 的目的是解决embodied AI 任务规划问题，因为社区没有达成一致性在问题的形式化方面。
methods: 该 paper 提出了一个统一框架，包括一个端到端训练方法和一个规划算法。特别是，他们提出了一种任务无关的方法，名为“规划 как涂抹”。在这种方法中，他们使用了一个 Denoising Diffusion Model (DDM) для计划生成，基于语言指令和感知输入在部分可见环境下。
results: 该 paper 的实验结果表明，该方法在多种embodied AI任务中表现出色，包括视觉语言导航、物体操作和任务规划在一个 photorealistic 虚拟环境中。

Abstract
Task planning for embodied AI has been one of the most challenging problems where the community does not meet a consensus in terms of formulation. In this paper, we aim to tackle this problem with a unified framework consisting of an end-to-end trainable method and a planning algorithm. Particularly, we propose a task-agnostic method named 'planning as in-painting'. In this method, we use a Denoising Diffusion Model (DDM) for plan generation, conditioned on both language instructions and perceptual inputs under partially observable environments. Partial observation often leads to the model hallucinating the planning. Therefore, our diffusion-based method jointly models both state trajectory and goal estimation to improve the reliability of the generated plan, given the limited available information at each step. To better leverage newly discovered information along the plan execution for a higher success rate, we propose an on-the-fly planning algorithm to collaborate with the diffusion-based planner. The proposed framework achieves promising performances in various embodied AI tasks, including vision-language navigation, object manipulation, and task planning in a photorealistic virtual environment. The code is available at: https://github.com/joeyy5588/planning-as-inpainting.

摘要
任务规划 дляembodied AI是一个最大化问题，community没有共识在问题形ulation中。在这篇论文中，我们想使用一个统一框架，包括一个端到端训练方法和一个规划算法。特别是，我们提出了一种任务无关的方法名为“规划为填充”。在这种方法中，我们使用一个Denosing Diffusion Model（DDM）来生成规划，基于语言指令和感知输入，在部分可见环境下。partial observation常常导致模型假设规划。因此，我们的扩散基于方法同时模型状态轨迹和目标估计，以提高基于有限可用信息的每步生成的计划的可靠性。为了更好地利用在计划执行过程中新发现的信息，我们提议一种在线规划算法与扩散基于规划协作。我们的提案的框架在多种embodied AI任务中表现出色，包括视力语言导航、物体操作和任务规划在一个真实的虚拟环境中。代码可以在：https://github.com/joeyy5588/planning-as-inpainting中找到。

RobustCalib: Robust Lidar-Camera Extrinsic Calibration with Consistency Learning

paper_url: http://arxiv.org/abs/2312.01085
repo_url: None
paper_authors: Shuang Xu, Sifan Zhou, Zhi Tian, Jizhou Ma, Qiong Nie, Xiangxiang Chu
for: addresses the extrinsic calibration problem in a robust, automatic, and single-shot manner, without relying on offline targets or human efforts.
methods: leverages consistency learning between LiDAR and camera to implement implicit re-calibration, using an appearance-consistency loss and a geometric-consistency loss to minimize inconsistencies between projected and predicted attributes.
results: achieves accurate and robust performance in various scenarios, with comprehensive experiments conducted on different datasets. The model and code will be released to promote further research and development.Here’s the full text in Simplified Chinese:
for: 本研究目的是实现LiDAR-camera外部参数估计的稳定、自动化和单次测试方法，不需要人工干预或对应标的支持。
methods: 我们利用LiDAR和camera之间的一致性学习来实现隐式重新校准，透过对对应点 cloud 的属性（例如深度和对应深度）进行整合学习，以减少对应点 cloud 的不一致。
results: 我们在不同的数据集上进行了充分的实验，结果显示我们的方法可以实现精确且可靠的性能，并且可以适应多种情况。我们将会发布我们的模型和代码，以便进一步的研究和发展。

Abstract
Current traditional methods for LiDAR-camera extrinsics estimation depend on offline targets and human efforts, while learning-based approaches resort to iterative refinement for calibration results, posing constraints on their generalization and application in on-board systems. In this paper, we propose a novel approach to address the extrinsic calibration problem in a robust, automatic, and single-shot manner. Instead of directly optimizing extrinsics, we leverage the consistency learning between LiDAR and camera to implement implicit re-calibartion. Specially, we introduce an appearance-consistency loss and a geometric-consistency loss to minimizing the inconsitency between the attrbutes (e.g., intensity and depth) of projected LiDAR points and the predicted ones. This design not only enhances adaptability to various scenarios but also enables a simple and efficient formulation during inference. We conduct comprehensive experiments on different datasets, and the results demonstrate that our method achieves accurate and robust performance. To promote further research and development in this area, we will release our model and code.

摘要
Current traditional methods for LiDAR-camera extrinsics estimation rely on offline targets and human efforts, while learning-based approaches resort to iterative refinement for calibration results, which poses constraints on their generalization and application in on-board systems. In this paper, we propose a novel approach to address the extrinsic calibration problem in a robust, automatic, and single-shot manner. Instead of directly optimizing extrinsics, we leverage the consistency learning between LiDAR and camera to implement implicit re-calibration. Specifically, we introduce an appearance-consistency loss and a geometric-consistency loss to minimize the inconsistency between the attributes (e.g., intensity and depth) of projected LiDAR points and the predicted ones. This design not only enhances adaptability to various scenarios but also enables a simple and efficient formulation during inference. We conduct comprehensive experiments on different datasets, and the results demonstrate that our method achieves accurate and robust performance. To promote further research and development in this area, we will release our model and code.Here is the word-for-word translation of the text into Simplified Chinese:现有的传统方法对探测器和摄像头的外部参数估计依赖于线上目标和人工努力，而学习基于的方法则通过迭代精度调整来实现准确性，这会对其普适性和实现在 bord 系统中带来限制。在这篇文章中，我们提出了一种新的方法，以解决探测器和摄像头的外部参数估计问题。而不是直接优化外部参数，我们利用探测器和摄像头之间的一致性学习来实现隐式重新调整。具体来说，我们引入了一个外观一致损失和一个几何一致损失，以最小化探测器和预测的点 cloud 之间的不一致。这种设计不仅提高了对不同场景的适应性，还可以在推理过程中实现简单和高效的表示。我们在不同的 dataset 上进行了广泛的实验，结果表明，我们的方法可以具有高度的准确性和稳定性。为促进这个领域的进一步研究和开发，我们将发布我们的模型和代码。

Consistency Prototype Module and Motion Compensation for Few-Shot Action Recognition (CLIP-CP$\mathbf{M^2}$C)

paper_url: http://arxiv.org/abs/2312.01083
repo_url: None
paper_authors: Fei Guo, Li Zhu, YiKang Wang, Han Qi
for: 提高 few-shot 动作识别精度， Addressing the limitations of previous works that rely solely on visual modal and ignore motion features.
methods: 提出 Consistency Prototype and Motion Compensation Network(CLIP-CP$M^2$C), 使用 CLIP 进行多Modal 几拟shot 动作识别，并提出一种新的文本补充方法来减少 query 视频中文本信息的影响。
results: 经验表明，提出的方法可以与现有的状态 tirifled 结果竞争，并且在标准的 benchmark 数据集上实现了优秀的性能。

Abstract
Recently, few-shot action recognition has significantly progressed by learning the feature discriminability and designing suitable comparison methods. Still, there are the following restrictions. (a) Previous works are mainly based on visual mono-modal. Although some multi-modal works use labels as supplementary to construct prototypes of support videos, they can not use this information for query videos. The labels are not used efficiently. (b) Most of the works ignore the motion feature of video, although the motion features are essential for distinguishing. We proposed a Consistency Prototype and Motion Compensation Network(CLIP-CP$M^2$C) to address these issues. Firstly, we use the CLIP for multi-modal few-shot action recognition with the text-image comparison for domain adaption. Secondly, in order to make the amount of information between the prototype and the query more similar, we propose a novel method to compensate for the text(prompt) information of query videos when text(prompt) does not exist, which depends on a Consistency Loss. Thirdly, we use the differential features of the adjacent frames in two directions as the motion features, which explicitly embeds the network with motion dynamics. We also apply the Consistency Loss to the motion features. Extensive experiments on standard benchmark datasets demonstrate that the proposed method can compete with state-of-the-art results. Our code is available at the URL: https://github.com/xxx/xxx.git.

摘要
最近几个月，几个shot动作识别技术有了很大的进步，主要是通过学习特征权重和设计适当的比较方法来提高特征分解能力。然而，还有一些限制。（a）以前的工作主要基于视觉单模态，尽管一些多模态工作使用标签作为补充来构建支持视频的原型，但它们不能使用这些信息来处理查询视频。标签没有被有效地利用。（b）大多数工作忽略了视频中的运动特征，尽管运动特征是分辨动作的关键。我们提出了一种协调原型和运动补做网络（CLIP-CP$M^2$C）来解决这些问题。首先，我们使用CLIP进行多模态几个shot动作识别，并使用文本-图像比较来适应频率。其次，我们提出了一种新的方法，用于在查询视频中缺失文本（提示）信息时，补做文本信息，这取决于一种一致损失。最后，我们使用两个相邻帧之间的差异特征作为运动特征，并直接嵌入网络中运动动态。我们还应用一种一致损失来补做运动特征。广泛的实验表明，我们的方法可以与当前的最佳结果竞争。我们的代码可以在以下URL获取：https://github.com/xxx/xxx.git。

DPHMs: Diffusion Parametric Head Models for Depth-based Tracking

paper_url: http://arxiv.org/abs/2312.01068
repo_url: None
paper_authors: Jiapeng Tang, Angela Dai, Yinyu Nie, Lev Markhasin, Justus Thies, Matthias Niessner
for: 本研究旨在提出一种基于扩散参数的头部模型（DPHMs），用于robust化从单视角深度序列中重建和跟踪头部 geometry。
methods: 该方法使用扩散参数来强制实现头部重建和跟踪，并通过对标注数据进行学习来regularize整个过程。
results: 对比于现有的tracking方法，本研究的方法能够更好地重建头部的身份和表情，并在不同的表情和快速转换中保持稳定性。

Abstract
We introduce Diffusion Parametric Head Models (DPHMs), a generative model that enables robust volumetric head reconstruction and tracking from monocular depth sequences. While recent volumetric head models, such as NPHMs, can now excel in representing high-fidelity head geometries, tracking and reconstruction heads from real-world single-view depth sequences remains very challenging, as the fitting to partial and noisy observations is underconstrained. To tackle these challenges, we propose a latent diffusion-based prior to regularize volumetric head reconstruction and tracking. This prior-based regularizer effectively constrains the identity and expression codes to lie on the underlying latent manifold which represents plausible head shapes. To evaluate the effectiveness of the diffusion-based prior, we collect a dataset of monocular Kinect sequences consisting of various complex facial expression motions and rapid transitions. We compare our method to state-of-the-art tracking methods, and demonstrate improved head identity reconstruction as well as robust expression tracking.

摘要
我们介绍Diffusion Parametric Head Models（DPHMs），一种生成模型，可以从单视深度序列中生成和跟踪高精度头部结构。当前的头部模型，如NPHMs，可以在表示高精度头部几何结构方面达到很高水平，但是从实际世界单视深度序列中跟踪和重建头部仍然很具有挑战性，因为适应部分和噪声 Observations 是不够约束的。为了解决这些挑战，我们提议在重建和跟踪头部过程中使用抽象扩散-based prior来规范化 heads。这个假设基于的规范化效果地使得表达和身份代码在下面的几何表示中徘徊。为了评估抽象扩散-based prior的效果，我们收集了一个包含多种复杂表情动作和快速过渡的单视Kinect序列的数据集。我们与状态艺术方法进行比较，并证明我们的方法可以提高头部身份重建以及表达跟踪的稳定性。

DiverseDream: Diverse Text-to-3D Synthesis with Augmented Text Embedding

paper_url: http://arxiv.org/abs/2312.02192
repo_url: None
paper_authors: Uy Dieu Tran, Minh Luu, Phong Nguyen, Janne Heikkila, Khoi Nguyen, Binh-Son Hua
for: 本研究旨在解决现有文本到3D模型Synthesis方法中的模式塌聚问题，即使用预训练的文本到图像模型作为引导视觉先验，但是现有方法中的3D模型往往具有很差的多样性。
methods: 本研究采用分析和实验方法来探讨现有文本到3D Synthesis方法中的限制因素，并提出一种新的方法，即通过文本描述和参照图像的文本反转来增强多样性。
results: 研究表明，通过提出新的方法，文本到3D Synthesis中的多样性得到了改善，both qualitatively and quantitatively。

Abstract
Text-to-3D synthesis has recently emerged as a new approach to sampling 3D models by adopting pretrained text-to-image models as guiding visual priors. An intriguing but underexplored problem with existing text-to-3D methods is that 3D models obtained from the sampling-by-optimization procedure tend to have mode collapses, and hence poor diversity in their results. In this paper, we provide an analysis and identify potential causes of such a limited diversity, and then devise a new method that considers the joint generation of different 3D models from the same text prompt, where we propose to use augmented text prompts via textual inversion of reference images to diversify the joint generation. We show that our method leads to improved diversity in text-to-3D synthesis qualitatively and quantitatively.

摘要
文本到3D合成最近emerged as a new approach to sampling 3D models by adopting pre-trained text-to-image models as guiding visual priors. An intriguing but underexplored problem with existing text-to-3D methods is that 3D models obtained from the sampling-by-optimization procedure tend to have mode collapses, and hence poor diversity in their results. In this paper, we provide an analysis and identify potential causes of such a limited diversity, and then devise a new method that considers the joint generation of different 3D models from the same text prompt, where we propose to use augmented text prompts via textual inversion of reference images to diversify the joint generation. We show that our method leads to improved diversity in text-to-3D synthesis both qualitatively and quantitatively.Here's the breakdown of the translation:* "text-to-3D synthesis" is translated as "文本到3D合成" (wén tiě to 3D hù shēng)* "recently emerged" is translated as "最近emerged" (zuì jìn emerged)* "approach" is translated as "方法" (fāng fǎ)* "sampling 3D models" is translated as "采样3D模型" (cǎi yàng 3D moldèi)* "by adopting pretrained text-to-image models" is translated as "采用预训练文本到图像模型" (cǎi yòng yù xiǎng préng zhì text to image moldèi)* "as guiding visual priors" is translated as "作为视觉引导" (suǒ weì wèi jǐng yì guī)* "an intriguing but underexplored problem" is translated as "一个有趣但未得到过问题" (yī gè yǒu qióng dào wèi dé zhèng bù yào guò)* "with existing text-to-3D methods" is translated as "现有的文本到3D方法" (xiàn yǒu de wén tiě to 3D fāng fǎ)* "obtained from the sampling-by-optimization procedure" is translated as "从采样优化过程中获得" (cóng cǎi yàng yǎo gōng zhèng)* "tend to have mode collapses" is translated as "容易出现模式崩溃" (róng yì chū shì mó dì bāo kòng)* "and hence poor diversity in their results" is translated as "因此结果中的多样性很差" (yīn qī jiè yù zhèng zhèng)* "In this paper, we provide an analysis and identify potential causes" is translated as "在这篇论文中，我们提供分析并识别出可能的原因" (zài zhè bēi tí wén zhèng, wǒmen tīng bìng yì jīng yì)* "of such a limited diversity" is translated as "这种有限的多样性" (zhè zhòng yǒu xiàn de duō yàng xìng)* "and then devise a new method" is translated as "然后提出一种新方法" (rán hái tím chū yī zhòng xīn fāng fǎ)* "that considers the joint generation of different 3D models" is translated as "考虑多种不同的3D模型的共同生成" (kǎo jí duō shè bì dì 3D moldèi de gòng dōng shēng chéng)* "from the same text prompt" is translated as "从同一个文本提示中" (cóng tóng yī gè wén tiě tí shǐ zhèng)* "where we propose to use augmented text prompts" is translated as "我们提议使用增强文本提示" (wǒmen tīng yì shǐ yòng zhòng jì chéng wén tiě tí shǐ)* "via textual inversion of reference images" is translated as "通过文本反转参考图像" (tōng guò wén tiě fǎ zhòng tiě jǐng)* "to diversify the joint generation" is translated as "以增加多样性的共同生成" (yǐ jìn gǎi duō yàng xìng de gòng dōng shēng chéng)* "We show that our method leads to improved diversity" is translated as "我们示示出我们的方法可以提高多样性" (wǒmen shì shì zhèng zhèng zhèng)* "in text-to-3D synthesis both qualitatively and quantitatively" is translated as "在文本到3D合成中， both qualitatively and quantitatively" (zhè bēi zhèng zhèng zhèng)

Spectral-wise Implicit Neural Representation for Hyperspectral Image Reconstruction

paper_url: http://arxiv.org/abs/2312.01061
repo_url: None
paper_authors: Huan Chen, Wangcai Zhao, Tingfa Xu, Shiyun Zhou, Peifu Liu, Jianan Li
for: 提高卷积Snapshot спектраль成像（CASSI）重建的精度和灵活性，使其能够在各种应用场景中提供更高质量的图像重建。
methods: 提出了一种新的 Spectral-wise Implicit Neural Representation（SINR）方法，该方法通过引入连续spectral扩展过程，实现图像重建中的spectral超分辨率，并且可以自定义扩展因子。SINR方法还包括了一个spectral Wisdom attention机制，以及一个Fourier坐标编码器和一个spectral缩放因子模块，以提高SINR的表达能力和灵活性。
results: 对比基eline方法，SINR方法在不同的图像重建任务中表现出了明显的优势，并且可以在CASSI框架中实现连续的图像重建。

Abstract
Coded Aperture Snapshot Spectral Imaging (CASSI) reconstruction aims to recover the 3D spatial-spectral signal from 2D measurement. Existing methods for reconstructing Hyperspectral Image (HSI) typically involve learning mappings from a 2D compressed image to a predetermined set of discrete spectral bands. However, this approach overlooks the inherent continuity of the spectral information. In this study, we propose an innovative method called Spectral-wise Implicit Neural Representation (SINR) as a pioneering step toward addressing this limitation. SINR introduces a continuous spectral amplification process for HSI reconstruction, enabling spectral super-resolution with customizable magnification factors. To achieve this, we leverage the concept of implicit neural representation. Specifically, our approach introduces a spectral-wise attention mechanism that treats individual channels as distinct tokens, thereby capturing global spectral dependencies. Additionally, our approach incorporates two components, namely a Fourier coordinate encoder and a spectral scale factor module. The Fourier coordinate encoder enhances the SINR's ability to emphasize high-frequency components, while the spectral scale factor module guides the SINR to adapt to the variable number of spectral channels. Notably, the SINR framework enhances the flexibility of CASSI reconstruction by accommodating an unlimited number of spectral bands in the desired output. Extensive experiments demonstrate that our SINR outperforms baseline methods. By enabling continuous reconstruction within the CASSI framework, we take the initial stride toward integrating implicit neural representation into the field.

摘要
<>使用简化中文翻译文本。<> CASSI重建目标是从2D测量获取3D空间spectral信号。现有的HSI重建方法通常是从2D压缩图像到 predetermined设置的几个精确的 spectral带来学习映射。然而，这种方法忽略了spectral信息的内在连续性。在这种研究中，我们提出了一种创新的方法 called Spectral-wise Implicit Neural Representation（SINR），作为HSI重建领域的先锋之步。SINR引入了一种spectral超分辨程序，以实现HSI重建中的spectral超分辨。为此，我们利用了implicit neural representation的概念。具体来说，我们的方法引入了一种spectral-wise注意机制，将 individuial channels treated as distinct tokens，以捕捉全 spectraldependencies。此外，我们的方法包括两个组件：Fourier coordinate encoder和spectral scale factor module。Fourier coordinate encoder增强了SINR的能力强调高频分量，而spectral scale factor module帮助SINR适应变量数量的spectral channels。值得注意的是，SINR框架提高了CASSI重建的灵活性，可以在desired输出中包含无限多个spectral带。广泛的实验表明，我们的SINR超过基准方法。通过在CASSI框架中实现连续重建，我们正式开启了implicit neural representation在这一领域的应用。

Spectrum-driven Mixed-frequency Network for Hyperspectral Salient Object Detection

paper_url: http://arxiv.org/abs/2312.01060
repo_url: https://github.com/laprf/smn
paper_authors: Peifu Liu, Tingfa Xu, Huan Chen, Shiyun Zhou, Haolin Qin, Jianan Li
for: 这个论文的目标是提出一种基于spectral characteristic的高spectral分辨率对象检测方法（HSOD），以便更好地检测各种颜色图像中的突出对象。
methods: 该方法利用了两种不同频率组件来捕捉对象的特征信息：low-frequency Spectral Saliency和high-frequency Spectral Edge。Spectral Saliency可以 approximates 突出对象的区域，而 Spectral Edge可以捕捉突出对象的边缘信息。
results: 该方法在HS-SOD benchmark和自定义的HSOD-BIT数据集上进行了广泛的实验，并证明了它在HSOD性能方面超过了现有的方法。

Abstract
Hyperspectral salient object detection (HSOD) aims to detect spectrally salient objects in hyperspectral images (HSIs). However, existing methods inadequately utilize spectral information by either converting HSIs into false-color images or converging neural networks with clustering. We propose a novel approach that fully leverages the spectral characteristics by extracting two distinct frequency components from the spectrum: low-frequency Spectral Saliency and high-frequency Spectral Edge. The Spectral Saliency approximates the region of salient objects, while the Spectral Edge captures edge information of salient objects. These two complementary components, crucial for HSOD, are derived by computing from the inter-layer spectral angular distance of the Gaussian pyramid and the intra-neighborhood spectral angular gradients, respectively. To effectively utilize this dual-frequency information, we introduce a novel lightweight Spectrum-driven Mixed-frequency Network (SMN). SMN incorporates two parameter-free plug-and-play operators, namely Spectral Saliency Generator and Spectral Edge Operator, to extract the Spectral Saliency and Spectral Edge components from the input HSI independently. Subsequently, the Mixed-frequency Attention module, comprised of two frequency-dependent heads, intelligently combines the embedded features of edge and saliency information, resulting in a mixed-frequency feature representation. Furthermore, a saliency-edge-aware decoder progressively scales up the mixed-frequency feature while preserving rich detail and saliency information for accurate salient object prediction. Extensive experiments conducted on the HS-SOD benchmark and our custom dataset HSOD-BIT demonstrate that our SMN outperforms state-of-the-art methods regarding HSOD performance. Code and dataset will be available at https://github.com/laprf/SMN.

摘要
《卷积神经网络对干扰监测（HSOD）的研究》目标：卷积神经网络可以快速地检测干扰监测（HSIs）中的特征对象。但现有方法未能充分利用干扰spectral信息，通常会将HSIs转换为false-color图像或者将神经网络与归一化。我们提出了一种新的方法，即完全利用干扰spectral特征，通过提取low-frequency spectral saliency和high-frequency spectral edge两个不同频率组成部分来检测特征对象。干扰saliency region的估计和特征对象的边缘信息是两个重要的组成部分，它们是通过计算干扰spectral angular distance和干扰邻居spectral angular gradients来获得的。为了有效利用这两个频率信息，我们提出了一种新的轻量级 Spectrum-driven Mixed-frequency Network（SMN）。SMN包括两个无参数的插件操作器，即 Spectral Saliency Generator和Spectral Edge Operator，可以独立地从输入HSIs中提取low-frequency spectral saliency和high-frequency spectral edge两个频率组成部分。然后，混合频率注意模块，由两个频率依赖的头组成，智能地将这两个频率组成部分混合，从而获得混合频率特征表示。最后，一个saliency-edge-aware decoder逐渐缩放混合频率特征，保留细节和干扰信息，以准确预测特征对象。我们在HS-SODbenchmark和我们自定义的HSOD-BIT数据集上进行了广泛的实验，并证明了我们的SMN在HSOD性能方面比现有方法更高。代码和数据集将在https://github.com/laprf/SMN上提供。

Taming Latent Diffusion Models to See in the Dark

paper_url: http://arxiv.org/abs/2312.01027
repo_url: https://github.com/csqiangwen/Taming_Latent_Diffusion_Models_to_See_in_the_Dark
paper_authors: Qiang Wen, Yazhou Xing, Qifeng Chen
for: 提高低光照降采样照片的曝光和清晰度
methods: 利用一个冰结的预训练 diffusion 模型，并在其生成过程中插入一组提议的 taming 模块，以控制生成的结构和细节
results: 对比 экспериментах显示，提出的方法不仅在量化评价中达到了领先的性能，而且在视觉比较中也显示出了明显的优势。这些结果表明，可以通过利用预训练的 diffusion 模型来提高低光照降采样照片的曝光和清晰度。

Abstract
Enhancing a low-light noisy RAW image into a well-exposed and clean sRGB image is a significant challenge in computational photography. Due to the limitation of large-scale paired data, prior approaches have difficulty in recovering fine details and true colors in extremely low-light regions. Meanwhile, recent advancements in generative diffusion models have shown promising generating capabilities, which inspires this work to explore generative priors from a diffusion model trained on a large-scale open-domain dataset to benefit the low-light image enhancement (LLIE) task. Based on this intention, we propose a novel diffusion-model-based LLIE method, dubbed LDM-SID. LDM-SID aims at inserting a set of proposed taming modules into a frozen pre-trained diffusion model to steer its generating process. Specifically, the taming module fed with low-light information serves to output a pair of affine transformation parameters to modulate the intermediate feature in the diffusion model. Additionally, based on the observation of dedicated generative priors across different portions of the diffusion model, we propose to apply 2D discrete wavelet transforms on the input RAW image, resulting in dividing the LLIE task into two essential parts: low-frequency content generation and high-frequency detail maintenance. This enables us to skillfully tame the diffusion model for optimized structural generation and detail enhancement. Extensive experiments demonstrate the proposed method not only achieves state-of-the-art performance in quantitative evaluations but also shows significant superiority in visual comparisons. These findings highlight the effectiveness of leveraging a pre-trained diffusion model as a generative prior to the LLIE task.

摘要
提高低光照噪raw图像为高效照明和干净的sRGB图像是计算 fotograf�摄影中的一大挑战。由于大规模对照数据的限制，先前的方法很难在EXTREMELY低光照区域中恢复细节和真实颜色。同时，最近的生成扩散模型的进步有吸引力，这种吸引力 inspirits 我们的工作，用一个固定的扩散模型来权重生成。基于这个意图，我们提出了一种新的扩散模型基于LDIE方法，称为LDM-SID。LDM-SID的目标是在扩散模型中插入一组提议的模块，以控制生成过程。具体来说，通过输入低光信息，这些模块将输出一对平行变换参数，以修改扩散模型的中间特征。此外，根据扩散模型中不同部分的专门生成规则，我们提出了在输入RAW图像上应用2D离散波形变换，从而将LDIE任务分成两个基本部分：低频内容生成和高频细节维持。这使得我们可以聪明地使用扩散模型来生成结构和细节。广泛的实验表明，提议的方法不仅在量化评价中达到了状态态arp的性能，而且在视觉比较中也表现出了显著的优势。这些发现 highlights 使用预训练的扩散模型作为LDIE任务的生成先验的效iveness。

Token Fusion: Bridging the Gap between Token Pruning and Token Merging

paper_url: http://arxiv.org/abs/2312.01026
repo_url: None
paper_authors: Minchul Kim, Shangqian Gao, Yen-Chang Hsu, Yilin Shen, Hongxia Jin
for: 提高资源受限的边缘设备上的计算效率，使用Token Fusion方法，以提高计算效率和模型准确率。
methods: 提出了Token Fusion方法，结合了Token pruning和Token merging两种方法，并使用MLERP merging技术来维持特征分布。
results: 实验表明，Token Fusion方法可以在分类和图像生成任务中提高计算效率和模型准确率，并且可以适用于不同的ViT模型。

Abstract
Vision Transformers (ViTs) have emerged as powerful backbones in computer vision, outperforming many traditional CNNs. However, their computational overhead, largely attributed to the self-attention mechanism, makes deployment on resource-constrained edge devices challenging. Multiple solutions rely on token pruning or token merging. In this paper, we introduce "Token Fusion" (ToFu), a method that amalgamates the benefits of both token pruning and token merging. Token pruning proves advantageous when the model exhibits sensitivity to input interpolations, while token merging is effective when the model manifests close to linear responses to inputs. We combine this to propose a new scheme called Token Fusion. Moreover, we tackle the limitations of average merging, which doesn't preserve the intrinsic feature norm, resulting in distributional shifts. To mitigate this, we introduce MLERP merging, a variant of the SLERP technique, tailored to merge multiple tokens while maintaining the norm distribution. ToFu is versatile, applicable to ViTs with or without additional training. Our empirical evaluations indicate that ToFu establishes new benchmarks in both classification and image generation tasks concerning computational efficiency and model accuracy.

摘要
Computer Vision Transformers (ViTs) 已经成为电脑视觉领域的强大背部，超越了许多传统的单语网络 (CNN)。然而，它们的计算成本，主要从自我对齐机制而来，使得在有限的边缘设备上部署困难。多个解决方案依靠单语削减或单语合并。在这篇论文中，我们引入“单语融合”（ToFu），一种结合单语削减和单语合并的方法。单语削减在模型对输入 interpolations 时有利，而单语合并则在模型对输入Linear responses 时有利。我们结合这两种方法，提议一个新的方案。此外，我们解决了均值融合的限制，它不会保留单语特征norm，导致分布迁移。我们引入 MLERP 融合，一种适应 ViTs 的 SLERP 技术，可以融合多个单语而保持单语特征norm。ToFu 可以适用于 ViTs WITH 或 Without 额外训练。我们的实验评估表明，ToFu 在分类和图像生成任务中具有较高的计算效率和模型精度。

Diffusion Handles: Enabling 3D Edits for Diffusion Models by Lifting Activations to 3D

paper_url: http://arxiv.org/abs/2312.02190
repo_url: None
paper_authors: Karran Pandey, Paul Guerrero, Matheus Gadelha, Yannick Hold-Geoffroy, Karan Singh, Niloy Mitra
for: 该论文旨在提供一种新的方法，使得在扩散图像上进行3D对象编辑，无需任何精度调整或3D对象检索。
methods: 该论文使用现有的预训练扩散模型和2D图像深度估计，实现3D对象编辑。
results: 该论文的结果显示，Diffusion Handles可以生成高品质、photo-real的编辑图像，保持对象特征。论文的主要发现是通过将扩散活化升起到3D使用代理深度，并将其3D变换和相关的活化升起，然后将其 projet回到图像空间，以生成高品质的编辑图像。

Abstract
Diffusion Handles is a novel approach to enabling 3D object edits on diffusion images. We accomplish these edits using existing pre-trained diffusion models, and 2D image depth estimation, without any fine-tuning or 3D object retrieval. The edited results remain plausible, photo-real, and preserve object identity. Diffusion Handles address a critically missing facet of generative image based creative design, and significantly advance the state-of-the-art in generative image editing. Our key insight is to lift diffusion activations for an object to 3D using a proxy depth, 3D-transform the depth and associated activations, and project them back to image space. The diffusion process applied to the manipulated activations with identity control, produces plausible edited images showing complex 3D occlusion and lighting effects. We evaluate Diffusion Handles: quantitatively, on a large synthetic data benchmark; and qualitatively by a user study, showing our output to be more plausible, and better than prior art at both, 3D editing and identity control. Project Webpage: https://diffusionhandles.github.io/

摘要
Diffusion Handles 是一种新的方法，用于在扩散图像上进行3D对象编辑。我们使用现有的预训练扩散模型，以及2D图像深度估计，而无需任何精度调整或3D对象检索。编辑结果保持了原始对象的可信度、实际性和形象性。Diffusion Handles 填补了生成图像基于的创意设计中缺失的一个重要方面，并在生成图像编辑方面提出了重要的进步。我们的关键发现是将扩散活化用于对象的3D尺寸，通过代理深度和相关的活化进行3D变换，然后将其 proyect 回到图像空间。通过应用 diffusion 过程于修改后的活化，并保持标识控制，我们可以生成可信度高、实际性强、保持对象形象的编辑图像，显示复杂的3D遮盖和照明效果。我们对Diffusion Handles进行了量化评估，并进行了用户研究，显示我们的输出更加可信度高，并在3D编辑和标识控制两个方面都比过去的先进技术更好。项目网站：https://diffusionhandles.github.io/

Self-Evolving Neural Radiance Fields

paper_url: http://arxiv.org/abs/2312.01003
repo_url: None
paper_authors: Jaewoo Jung, Jisang Han, Jiwon Kang, Seongchan Kim, Min-Seop Kwak, Seungryong Kim
for: 提高 NeRF 模型在实际场景中的应用性，增强其对几个视图点的渲染能力。
methods: 我们提出了一种新的框架，即 Self-Evolving Neural Radiance Fields (SE-NeRF)，它将 NeRF 模型与自我教学框架结合，以解决这些问题。我们将几个视图点的教师模型与学生模型结合，使学生模型通过Distillation scheme来学习更加稳定和准确的场景表示。
results: 我们通过对现有模型应用我们的自我教学框架，提高了模型对图像的渲染质量，并在多个设定下实现了状态机器人的表现。

Abstract
Recently, neural radiance field (NeRF) has shown remarkable performance in novel view synthesis and 3D reconstruction. However, it still requires abundant high-quality images, limiting its applicability in real-world scenarios. To overcome this limitation, recent works have focused on training NeRF only with sparse viewpoints by giving additional regularizations, often called few-shot NeRF. We observe that due to the under-constrained nature of the task, solely using additional regularization is not enough to prevent the model from overfitting to sparse viewpoints. In this paper, we propose a novel framework, dubbed Self-Evolving Neural Radiance Fields (SE-NeRF), that applies a self-training framework to NeRF to address these problems. We formulate few-shot NeRF into a teacher-student framework to guide the network to learn a more robust representation of the scene by training the student with additional pseudo labels generated from the teacher. By distilling ray-level pseudo labels using distinct distillation schemes for reliable and unreliable rays obtained with our novel reliability estimation method, we enable NeRF to learn a more accurate and robust geometry of the 3D scene. We show and evaluate that applying our self-training framework to existing models improves the quality of the rendered images and achieves state-of-the-art performance in multiple settings.

摘要

Deep Generative Attacks and Countermeasures for Data-Driven Offline Signature Verification

paper_url: http://arxiv.org/abs/2312.00987
repo_url: None
paper_authors: An Ngo, MinhPhuong Cao, Rajesh Kumar
for: This paper focuses on the impact of generative attacks on data-driven signature verification (DASV) and proposes practical and interpretable countermeasures.methods: The paper uses two prominent deep generative models (DGMs) - Variational Auto-encoders (VAE) and Conditional Generative Adversarial Networks (CGAN) - to generate signatures that can deceive DASV. The quality of generated images is evaluated using the Structural Similarity Index measure (SSIM).results: The paper shows that VAE-generated signatures increase the False Accept Rates (FARs) of DASV baselines by an average of 10.4%, 10.1%, and 7.5%, while CGAN-generated signatures increase FARs by an average of 32.5%, 30%, and 26.1%. The paper also finds a strong negative correlation between FARs and SSIMs, and demonstrates the effectiveness of retraining DASV baselines with synthetic datasets to improve robustness to generative attacks.Here is the same information in Simplified Chinese:for: 这篇论文关注数据驱动签名验证（DASV）中的生成攻击的影响，并提出了实用和可解释的防御策略。methods: 论文使用了两种知名的深度生成模型（DGM）——变量自适应Encoder（VAE）和条件生成对抗网络（CGAN）——来生成可以骗取DASV的签名。生成图像质量使用了结构相似度指数（SSIM）进行评估。results: 论文显示，VAE生成的签名可以使DASV基线上的False Accept Rate（FAR）提高平均10.4%、10.1%和7.5%，而CGAN生成的签名可以提高FAR平均32.5%、30%和26.1%。论文还发现生成攻击和SSIM之间存在强负相关，并证明使用生成数据重新训练DASV基线可以提高生成攻击的抵抗力。

Abstract
While previous studies have explored attacks via random, simple, and skilled forgeries, generative attacks have received limited attention in the data-driven signature verification (DASV) process. Thus, this paper explores the impact of generative attacks on DASV and proposes practical and interpretable countermeasures. We investigate the power of two prominent Deep Generative Models (DGMs), Variational Auto-encoders (VAE) and Conditional Generative Adversarial Networks (CGAN), on their ability to generate signatures that would successfully deceive DASV. Additionally, we evaluate the quality of generated images using the Structural Similarity Index measure (SSIM) and use the same to explain the attack's success. Finally, we propose countermeasures that effectively reduce the impact of deep generative attacks on DASV. We first generated six synthetic datasets from three benchmark offline-signature datasets viz. CEDAR, BHSig260- Bengali, and BHSig260-Hindi using VAE and CGAN. Then, we built baseline DASVs using Xception, ResNet152V2, and DenseNet201. These DASVs achieved average (over the three datasets) False Accept Rates (FARs) of 2.55%, 3.17%, and 1.06%, respectively. Then, we attacked these baselines using the synthetic datasets. The VAE-generated signatures increased average FARs to 10.4%, 10.1%, and 7.5%, while CGAN-generated signatures to 32.5%, 30%, and 26.1%. The variation in the effectiveness of attack for VAE and CGAN was investigated further and explained by a strong (rho = -0.86) negative correlation between FARs and SSIMs. We created another set of synthetic datasets and used the same to retrain the DASVs. The retained baseline showed significant robustness to random, skilled, and generative attacks as the FARs shrank to less than 1% on average. The findings underscore the importance of studying generative attacks and potential countermeasures for DASV.

摘要
previous studies have explored attacks via random, simple, and skilled forgeries, but generative attacks have received limited attention in the data-driven signature verification (DASV) process. Therefore, this paper explores the impact of generative attacks on DASV and proposes practical and interpretable countermeasures. We investigate the power of two prominent Deep Generative Models (DGMs), Variational Auto-encoders (VAE) and Conditional Generative Adversarial Networks (CGAN), on their ability to generate signatures that would successfully deceive DASV. Additionally, we evaluate the quality of generated images using the Structural Similarity Index measure (SSIM) and use the same to explain the attack's success. Finally, we propose countermeasures that effectively reduce the impact of deep generative attacks on DASV.我们在前一些研究中已经探索过随机、简单和技巧性的伪造攻击，但是生成攻击则受到了有限的注意力。因此，本文将实际攻击和解释措施，并调查了两种常见的深度生成模型（DGM）——Variational Auto-encoders（VAE）和Conditional Generative Adversarial Networks（CGAN），以及它们对于DASV的攻击能力。我们还评估了生成的图像质量使用结构相似度指数（SSIM），并使用这个指数来解释攻击的成功。最后，我们提出了有效地减少深度生成攻击的对DASV的影响的几种实际措施。我们首先从三个benchmark offline签名数据集（CEDAR、BHSig260-Bengali和BHSig260-Hindi）中生成了六个合成数据集，使用VAE和CGAN进行生成。然后，我们建立了基eline DASV，使用Xception、ResNet152V2和DenseNet201。这些DASV的 False Accept Rate（FAR）的平均值为2.55%、3.17%和1.06%。然后，我们对这些基eline进行攻击，使用生成的签名。VAE生成的签名将平均FAR提高到10.4%、10.1%和7.5%，而CGAN生成的签名则提高到32.5%、30%和26.1%。我们进一步研究了VAE和CGAN对于攻击的影响，发现了一个强(-0.86)的负 correlatio（FAR和SSIM）。我们另外创建了一些新的合成数据集，并使用这些数据集重训DASV。重训后的基eline表现了对随机、技巧性和生成攻击的强大抗性，FAR的平均值降到了 less than 1%。这些结果突出了研究生成攻击和可行的对策的重要性。

2023-12-02

cs.AI

cs.AI - 2023-12-02

A Multifidelity Sim-to-Real Pipeline for Verifiable and Compositional Reinforcement Learning

paper_url: http://arxiv.org/abs/2312.01249
repo_url: None
paper_authors: Cyrus Neary, Christian Ellis, Aryaman Singh Samyal, Craig Lennon, Ufuk Topcu
for: 这篇论文是为了提出和实现一个基于多级准确性的人工智能训练和验证框架，以便在物理硬件上部署可靠和适应的人工智能策略。
methods: 这个框架使用了分解复杂的机器人任务为多个子任务的技术，并定义了这些子任务之间的数学界面，以便独立地训练和测试这些子任务策略，同时保证总体行为的 garanties。
results: 在一个实验案例中，这个框架成功地训练了一个可以控制战斗车的 compositional 人工智能系统，并在物理硬件上部署了这个系统。

Abstract
We propose and demonstrate a compositional framework for training and verifying reinforcement learning (RL) systems within a multifidelity sim-to-real pipeline, in order to deploy reliable and adaptable RL policies on physical hardware. By decomposing complex robotic tasks into component subtasks and defining mathematical interfaces between them, the framework allows for the independent training and testing of the corresponding subtask policies, while simultaneously providing guarantees on the overall behavior that results from their composition. By verifying the performance of these subtask policies using a multifidelity simulation pipeline, the framework not only allows for efficient RL training, but also for a refinement of the subtasks and their interfaces in response to challenges arising from discrepancies between simulation and reality. In an experimental case study we apply the framework to train and deploy a compositional RL system that successfully pilots a Warthog unmanned ground robot.

摘要
我们提出了一个分解式框架，用于在多级别实验室-实际推广管道中培训和验证学习回归系统，以便在物理硬件上部署可靠和适应性强的学习策略。通过将复杂的机器人任务拆分成几个组成部分，并定义这些部分之间的数学界面，框架允许独立培训和测试这些子任务策略，同时提供 garantías sobre the overall behavior that results from their composition。通过在多级别 simulations 管道中验证这些子任务策略的性能，框架不仅允许高效的学习培训，还可以在面临实际中出现的挑战时进行细致的子任务和界面的调整。在一个实验案例中，我们利用这个框架培训并部署了一个可 compose 的学习系统，成功地控制了一个 Warthog 无人地面机器人。

Axiomatic Preference Modeling for Longform Question Answering

paper_url: http://arxiv.org/abs/2312.02206
repo_url: None
paper_authors: Corby Rosset, Guoqing Zheng, Victor Dibia, Ahmed Awadallah, Paul Bennett
for: 这个研究的目的是提高大语言模型（LLM）如GPT-4的性能，特别是通过人类反馈学习（RLHF）来提高模型的满意度。
methods: 这个研究使用了一种名为“axioms”的概念，将人类的偏好编码到一个奖励模型中，以便更好地理解人类的偏好。然后，他们开发了一个axioms的框架，用于生成特定原则的偏好信号，并使用这些信号来训练一个对长答案进行评分的模型。
results: 研究发现，使用这种axioms的框架和训练数据可以让一个小型模型（只有220M参数）的性能高于GPT-4，并且可以在人类和LLM生成的答案中进行同等评分。此外，研究还发现，只需要少量的axioms信号，小型模型就可以超越GPT-4的性能。

Abstract
The remarkable abilities of large language models (LLMs) like GPT-4 partially stem from post-training processes like Reinforcement Learning from Human Feedback (RLHF) involving human preferences encoded in a reward model. However, these reward models (RMs) often lack direct knowledge of why, or under what principles, the preferences annotations were made. In this study, we identify principles that guide RMs to better align with human preferences, and then develop an axiomatic framework to generate a rich variety of preference signals to uphold them. We use these axiomatic signals to train a model for scoring answers to longform questions. Our approach yields a Preference Model with only about 220M parameters that agrees with gold human-annotated preference labels more often than GPT-4. The contributions of this work include: training a standalone preference model that can score human- and LLM-generated answers on the same scale; developing an axiomatic framework for generating training data pairs tailored to certain principles; and showing that a small amount of axiomatic signals can help small models outperform GPT-4 in preference scoring. We release our model on huggingface: https://huggingface.co/corbyrosset/axiomatic_preference_model

摘要
大型语言模型（LLM）如GPT-4的出色能力部分归功于训练后的过程，如人类反馈学习（RLHF），其中人类偏好被编码在奖励模型（RM）中。然而，这些奖励模型通常缺乏直接知道何时、以何原则而作出偏好标注的直觉。在这项研究中，我们发现了导引RM更好听从人类偏好的原则，然后开发了一个axioms的框架，以生成一种多样化的偏好信号，以保持这些原则。我们使用这些axioms信号来训练一个用于评分长问答案的模型。我们的方法得到了一个名为“axioms preference model”的模型，它只有约220M参数，能够更好地与人类标注的偏好标签相匹配，并且在训练时与GPT-4相比，能够更好地评分人类和LLM生成的答案。我们的贡献包括：训练一个独立的偏好模型，可以评分人类和LLM生成的答案在同一个标准下；开发一个axioms框架，用于生成适合特定原则的训练数据对；以及证明一小amount的axioms信号可以帮助小型模型超越GPT-4在偏好评分中。我们将我们的模型上传到huggingface：https://huggingface.co/corbyrosset/axiomatic_preference_model。

DDxT: Deep Generative Transformer Models for Differential Diagnosis

paper_url: http://arxiv.org/abs/2312.01242
repo_url: None
paper_authors: Mohammad Mahmudul Alam, Edward Raff, Tim Oates, Cynthia Matuszek
for: 这个研究旨在开发一个自动化的鉴别诊断（DDx）过程，以实现将诊断可能性范围从大量病理中范选出最有可能的疾病。
methods: 本研究使用Transformer构 architecture的生成网络，名为DDxT，通过自动生成可能的疾病列表（DDx），并透过神经网络预测实际的疾病。
results: 实验结果显示，DDxT获得了99.82%的平均准确率和0.9472的平均F1分数，在鉴别诊断方面表现出色。此外，预测真实疾病的时候， mean accuracy 为99.98%， mean F1 score 为0.9949。DDxT比前一代RL基本上大幅提高了表现。

Abstract
Differential Diagnosis (DDx) is the process of identifying the most likely medical condition among the possible pathologies through the process of elimination based on evidence. An automated process that narrows a large set of pathologies down to the most likely pathologies will be of great importance. The primary prior works have relied on the Reinforcement Learning (RL) paradigm under the intuition that it aligns better with how physicians perform DDx. In this paper, we show that a generative approach trained with simpler supervised and self-supervised learning signals can achieve superior results on the current benchmark. The proposed Transformer-based generative network, named DDxT, autoregressively produces a set of possible pathologies, i.e., DDx, and predicts the actual pathology using a neural network. Experiments are performed using the DDXPlus dataset. In the case of DDx, the proposed network has achieved a mean accuracy of 99.82% and a mean F1 score of 0.9472. Additionally, mean accuracy reaches 99.98% with a mean F1 score of 0.9949 while predicting ground truth pathology. The proposed DDxT outperformed the previous RL-based approaches by a big margin. Overall, the automated Transformer-based DDx generative model has the potential to become a useful tool for a physician in times of urgency.

摘要
diferencial diagnóstico (DDx) es el proceso de identificar la condición médica más probable entre las posibles patologías a través del proceso de eliminación basado en la evidencia. Un proceso automatizado que reduce un gran conjunto de patologías a la más probable será de gran importancia. Los trabajos previos principales han confiado en el paradigma de aprendizaje reinforzado (RL) bajo la intuición de que se alinea mejor con cómo los médicos realizan DDx. En este artículo, mostramos que un enfoque generativo entrenado con señales de aprendizaje supervisado y autoasistido puede lograr resultados superiores en la base de datos actual. La red generativa propuesta, llamada DDxT, produce secuencialmente un conjunto de posibles patologías, es decir, DDx, y predice la patología real utilizando una red neuronal. Los experimentos se realizaron utilizando el conjunto de datos DDXPlus. En el caso de DDx, la red propuesta ha logrado una precisión media de 99,82% y una puntuación media de F1 de 0,9472. Además, la precisión media alcanza el 99,98% con una puntuación media de F1 de 0,9949 al predicar la patología real. La red DDxT superó ampliamente a las aproximaciones RL-basadas anteriores. En resumen, el modelo generativo automatizado basado en Transformer tiene el potencial de ser una herramienta útil para los médicos en momentos de urgencia.

Just-in-Time Security Patch Detection – LLM At the Rescue for Data Augmentation

paper_url: http://arxiv.org/abs/2312.01241
repo_url: None
paper_authors: Xunzhu Tang, Zhenghan Chen, Kisub Kim, Haoye Tian, Saad Ezzini, Jacques Klein
for: 警示开源软件中存在增长的漏洞，需要有效地识别安全补丁。
methods: 我们提出了一种新的安全补丁检测系统，即 LLMDA，利用大语言模型（LLM）和代码文本对齐方法。
results: LLMDA在检测安全补丁方面表现出色，质量明显超过了现有技术。

Abstract
In the face of growing vulnerabilities found in open-source software, the need to identify {discreet} security patches has become paramount. The lack of consistency in how software providers handle maintenance often leads to the release of security patches without comprehensive advisories, leaving users vulnerable to unaddressed security risks. To address this pressing issue, we introduce a novel security patch detection system, LLMDA, which capitalizes on Large Language Models (LLMs) and code-text alignment methodologies for patch review, data enhancement, and feature combination. Within LLMDA, we initially utilize LLMs for examining patches and expanding data of PatchDB and SPI-DB, two security patch datasets from recent literature. We then use labeled instructions to direct our LLMDA, differentiating patches based on security relevance. Following this, we apply a PTFormer to merge patches with code, formulating hybrid attributes that encompass both the innate details and the interconnections between the patches and the code. This distinctive combination method allows our system to capture more insights from the combined context of patches and code, hence improving detection precision. Finally, we devise a probabilistic batch contrastive learning mechanism within batches to augment the capability of the our LLMDA in discerning security patches. The results reveal that LLMDA significantly surpasses the start of the art techniques in detecting security patches, underscoring its promise in fortifying software maintenance.

摘要
面对开源软件中增长的漏洞，确定特定的安全补丁成为了首要任务。软件提供商在维护过程中的不一致性常常导致安全补丁的发布而无包括全面的通知， leaving 用户感到不安全。为解决这个紧迫的问题，我们提出了一种新的安全补丁检测系统， LLMDA， which leverages Large Language Models (LLMs) 和 code-text alignment methodologies for patch review, data enhancement, and feature combination.在 LLMDA 中，我们首先使用 LLMs 来检查补丁，并将 PatchDB 和 SPI-DB 两个安全补丁数据集的数据进行扩展。然后，我们使用标注的说明来导引 LLMDA，根据安全相关性进行补丁分类。接着，我们使用 PTFormer 将补丁与代码进行合并，生成 Hybrid 特征，这些特征包括补丁和代码之间的自然详细信息以及代码和补丁之间的连接关系。这种独特的合并方法使得我们的系统能够从代码和补丁的共同上下文中获得更多的洞察，从而提高检测精度。最后，我们在批处理中引入概率批处理学习机制，以增强 LLMDA 的检测精度。结果显示，LLMDA 明显超越了当前技术的检测精度，这说明了我们的系统在软件维护中的应用潜力。

A Comprehensive Study of Vision Transformers in Image Classification Tasks

paper_url: http://arxiv.org/abs/2312.01232
repo_url: None
paper_authors: Mahmoud Khalil, Ahmad Khalil, Alioune Ngom
for: 本文主要探讨了图像分类领域中的视transformer模型，它们如何用于图像分类任务，以及这些模型的优劣点。
methods: 本文主要介绍了现有的图像分类 datasets，以及基于视transformer模型的图像分类方法，包括采用注意力机制的早期尝试，以及使用视transformer模型来捕捉图像中的复杂 Patterns和长距离依赖关系。
results: 本文对图像分类领域的开源论文进行了大规模的报告和分析，并对这些模型的性能进行了评估和比较。

Abstract
Image Classification is a fundamental task in the field of computer vision that frequently serves as a benchmark for gauging advancements in Computer Vision. Over the past few years, significant progress has been made in image classification due to the emergence of deep learning. However, challenges still exist, such as modeling fine-grained visual information, high computation costs, the parallelism of the model, and inconsistent evaluation protocols across datasets. In this paper, we conduct a comprehensive survey of existing papers on Vision Transformers for image classification. We first introduce the popular image classification datasets that influenced the design of models. Then, we present Vision Transformers models in chronological order, starting with early attempts at adapting attention mechanism to vision tasks followed by the adoption of vision transformers, as they have demonstrated success in capturing intricate patterns and long-range dependencies within images. Finally, we discuss open problems and shed light on opportunities for image classification to facilitate new research ideas.

摘要
computer vision 是计算机视觉的一个基本任务， часто被用来评估计算机视觉领域的进步。过去几年，因深度学习的出现，image classification task中有了 significiant progress。然而，还有一些挑战，如模型化细腻的视觉信息、高计算成本、模型并行性和数据集评价协议的不一致。在这篇论文中，我们对现有的Vision Transformers для图像分类进行了全面的检视。我们首先介绍了影响模型设计的受欢迎图像分类数据集，然后按照时间顺序介绍Vision Transformers模型，从早期适应注意力机制到图像转换器的采用，因为它们能够捕捉图像中的复杂模式和长距离依赖关系。最后，我们讨论了现有的问题和新的研究方向。

Recent Advances in Scalable Energy-Efficient and Trustworthy Spiking Neural networks: from Algorithms to Technology

paper_url: http://arxiv.org/abs/2312.01213
repo_url: None
paper_authors: Souvik Kundu, Rui-Jie Zhu, Akhilesh Jaiswal, Peter A. Beerel
for: 这 paper 的目的是描述最近的 neuromorphic computing 和 spiking neural network (SNN) 技术的进步，以及如何使其用于各种感知应用。
methods: 这 paper 使用了一系列的算法和优化技术来高效地训练和扩展低延迟、能效的 SNN 模型，以满足复杂的机器学习应用。
results: 这 paper 描述了一些最近的算法-架构合理化尝试，以及如何使用这些技术来实现高能效和低延迟的 SNN 系统。

Abstract
Neuromorphic computing and, in particular, spiking neural networks (SNNs) have become an attractive alternative to deep neural networks for a broad range of signal processing applications, processing static and/or temporal inputs from different sensory modalities, including audio and vision sensors. In this paper, we start with a description of recent advances in algorithmic and optimization innovations to efficiently train and scale low-latency, and energy-efficient spiking neural networks (SNNs) for complex machine learning applications. We then discuss the recent efforts in algorithm-architecture co-design that explores the inherent trade-offs between achieving high energy-efficiency and low latency while still providing high accuracy and trustworthiness. We then describe the underlying hardware that has been developed to leverage such algorithmic innovations in an efficient way. In particular, we describe a hybrid method to integrate significant portions of the model's computation within both memory components as well as the sensor itself. Finally, we discuss the potential path forward for research in building deployable SNN systems identifying key challenges in the algorithm-hardware-application co-design space with an emphasis on trustworthiness.

摘要

An Empirical Study of Automated Mislabel Detection in Real World Vision Datasets

paper_url: http://arxiv.org/abs/2312.02200
repo_url: None
paper_authors: Maya Srikanth, Jeremy Irvin, Brian Wesley Hill, Felipe Godoy, Ishan Sabane, Andrew Y. Ng
for: 这篇论文主要针对的是如何自动检测和修复计算机视觉数据中的错误标签。
methods: 本论文使用了多种现有的自动检测错误标签方法，并提出了一种新的简单有效的检测方法（SEMD）。
results: 实验结果表明，SEMD方法可以与或超过先前的自动检测方法的性能，并且在实际计算机视觉数据中应用后，可以提高模型的表现，最大化模型的准确率。

Abstract
Major advancements in computer vision can primarily be attributed to the use of labeled datasets. However, acquiring labels for datasets often results in errors which can harm model performance. Recent works have proposed methods to automatically identify mislabeled images, but developing strategies to effectively implement them in real world datasets has been sparsely explored. Towards improved data-centric methods for cleaning real world vision datasets, we first conduct more than 200 experiments carefully benchmarking recently developed automated mislabel detection methods on multiple datasets under a variety of synthetic and real noise settings with varying noise levels. We compare these methods to a Simple and Efficient Mislabel Detector (SEMD) that we craft, and find that SEMD performs similarly to or outperforms prior mislabel detection approaches. We then apply SEMD to multiple real world computer vision datasets and test how dataset size, mislabel removal strategy, and mislabel removal amount further affect model performance after retraining on the cleaned data. With careful design of the approach, we find that mislabel removal leads per-class performance improvements of up to 8% of a retrained classifier in smaller data regimes.

摘要

USat: A Unified Self-Supervised Encoder for Multi-Sensor Satellite Imagery

paper_url: http://arxiv.org/abs/2312.02199
repo_url: https://github.com/stanfordmlgroup/usat
paper_authors: Jeremy Irvin, Lucas Tao, Joanne Zhou, Yuntao Ma, Langston Nashold, Benjamin Liu, Andrew Y. Ng
for: The paper is written for developing a new encoder architecture called USat for self-supervised pre-training on remote sensing data.
methods: The paper uses a vision transformer with modified patch projection layers and positional encodings to model spectral bands with varying spatial scales from multiple sensors.
results: The pre-trained USat model outperforms state-of-the-art self-supervised MAE models trained on remote sensing data on multiple remote sensing benchmark datasets (up to 8%) and leads to improvements in low data regimes (up to 7%).Here’s the same information in Simplified Chinese:
for: 这篇论文是为了开发一种新的编码器架构，用于自主学习预训练 remote sensing 数据。
methods: 这篇论文使用了一种视觉 трансформа器，并将其修改为 patch projection 层和位置编码，以模型从多个感知器的spectral带来的不同的空间尺度。
results: 预训练 USat 模型在多个 remote sensing 数据集上（最多8%）达到了状态机器人自主学习 MAE 模型的最高精度，并在低数据情况下（最多7%）提供了改进。

Abstract
Large, self-supervised vision models have led to substantial advancements for automatically interpreting natural images. Recent works have begun tailoring these methods to remote sensing data which has rich structure with multi-sensor, multi-spectral, and temporal information providing massive amounts of self-labeled data that can be used for self-supervised pre-training. In this work, we develop a new encoder architecture called USat that can input multi-spectral data from multiple sensors for self-supervised pre-training. USat is a vision transformer with modified patch projection layers and positional encodings to model spectral bands with varying spatial scales from multiple sensors. We integrate USat into a Masked Autoencoder (MAE) self-supervised pre-training procedure and find that a pre-trained USat outperforms state-of-the-art self-supervised MAE models trained on remote sensing data on multiple remote sensing benchmark datasets (up to 8%) and leads to improvements in low data regimes (up to 7%). Code and pre-trained weights are available at https://github.com/stanfordmlgroup/USat .

摘要
大型自我超vision模型已经对自然图像自动解释带来了重要进步。最近的工作开始将这些方法应用于Remote感知数据，该数据具有多感器、多spectral和时间信息，可以提供大量自label数据，用于自主预训练。在这项工作中，我们开发了一个新的编码器架构，即USat，可以将多spectral数据从多个感知器输入到自我预训练。USat是一种视transformer，其中修改了 patch projection层和位置编码，以模型从多个感知器的不同空间尺度上的spectral频率带来改进。我们将USat集成到了一个Masked Autoencoder（MAE）自我预训练过程中，并发现在远程感知数据上的多个远程感知标准测试数据集（最高提高8%）和低数据情况下（最高提高7%）都有提高。代码和预训练 веса可以在https://github.com/stanfordmlgroup/USat 上获取。

Harnessing Discrete Representations For Continual Reinforcement Learning

paper_url: http://arxiv.org/abs/2312.01203
repo_url: None
paper_authors: Edan Meyer, Adam White, Marlos C. Machado
for: 这篇论文主要研究了在 reinforcement learning 中使用 vector-based categorical representations 的优势。
methods: 该论文使用了 Empirical investigation 来评估 vector-based categorical representations 在 reinforcement learning 中的效果，并在 world-model learning、model-free RL 和 continual RL 问题上进行了评估。
results: 研究发现，使用 vector-based categorical representations 可以更好地模型世界，并且可以使用 less capacity 和 less data 来培育更好的策略。在 continual RL 问题上，这些优势可以导致更快的适应性。此外，分析表明，这些性能提升可能与 latent vectors 中的信息和编码方式有关。

Abstract
Reinforcement learning (RL) agents make decisions using nothing but observations from the environment, and consequently, heavily rely on the representations of those observations. Though some recent breakthroughs have used vector-based categorical representations of observations, often referred to as discrete representations, there is little work explicitly assessing the significance of such a choice. In this work, we provide a thorough empirical investigation of the advantages of representing observations as vectors of categorical values within the context of reinforcement learning. We perform evaluations on world-model learning, model-free RL, and ultimately continual RL problems, where the benefits best align with the needs of the problem setting. We find that, when compared to traditional continuous representations, world models learned over discrete representations accurately model more of the world with less capacity, and that agents trained with discrete representations learn better policies with less data. In the context of continual RL, these benefits translate into faster adapting agents. Additionally, our analysis suggests that the observed performance improvements can be attributed to the information contained within the latent vectors and potentially the encoding of the discrete representation itself.

摘要
<>使用 solely 环境观察的决策的强化学习（RL）代理人，它们具有强依赖于观察的表示的特点。 although some recent breakthroughs have used vector-based categorical representations of observations, there is little work explicitly assessing the significance of such a choice. In this work, we provide a thorough empirical investigation of the advantages of representing observations as vectors of categorical values within the context of reinforcement learning. We perform evaluations on world-model learning, model-free RL, and ultimately continual RL problems, where the benefits best align with the needs of the problem setting. We find that, when compared to traditional continuous representations, world models learned over discrete representations accurately model more of the world with less capacity, and that agents trained with discrete representations learn better policies with less data. In the context of continual RL, these benefits translate into faster adapting agents. Additionally, our analysis suggests that the observed performance improvements can be attributed to the information contained within the latent vectors and potentially the encoding of the discrete representation itself.>Here's the translation in Traditional Chinese:<>使用仅从环境观察的决策的强化学习（RL）代理人，它们具有强依赖于观察的表示的特点。 although some recent breakthroughs have used vector-based categorical representations of observations, there is little work explicitly assessing the significance of such a choice. In this work, we provide a thorough empirical investigation of the advantages of representing observations as vectors of categorical values within the context of reinforcement learning. We perform evaluations on world-model learning, model-free RL, and ultimately continual RL problems, where the benefits best align with the needs of the problem setting. We find that, when compared to traditional continuous representations, world models learned over discrete representations accurately model more of the world with less capacity, and that agents trained with discrete representations learn better policies with less data. In the context of continual RL, these benefits translate into faster adapting agents. Additionally, our analysis suggests that the observed performance improvements can be attributed to the information contained within the latent vectors and potentially the encoding of the discrete representation itself.>

From Voices to Validity: Leveraging Large Language Models (LLMs) for Textual Analysis of Policy Stakeholder Interviews

paper_url: http://arxiv.org/abs/2312.01202
repo_url: None
paper_authors: Alex Liu, Min Sun
for: This paper aims to enhance text analysis of stakeholder interviews regarding K-12 education policy within one U.S. state by integrating Large Language Models (LLMs) with human expertise.methods: The study employs a mixed-methods approach that combines human expertise and GPT-4 analysis to perform thematic and sentiment analysis of stakeholder interviews.results: The results show that GPT-4 thematic coding aligns with human coding by 77.89% at specific themes, and expanding to broader themes increases congruence to 96.02%. GPT-4 also matches expert sentiment analysis more closely than lexicon-based methods. The combined human-computer method enhances the efficiency, validity, and interpretability of educational policy research.Here’s the same information in Simplified Chinese:for: 这个研究目的是使用Large Language Models（LLMs）和人类专家知识结合来提高K-12教育政策方面的讲客评价文本分析。methods: 这个研究采用混合方法，结合人类专家知识和GPT-4分析来实现主题和情感分析。results: 研究结果表明，GPT-4主题编码与人类编码的匹配率为77.89%，扩展到更广泛的主题后，匹配率提高到96.02%。GPT-4还与专家情感分析更加匹配，而且lexicon-based方法相比，GPT-4的匹配率高于25%。

Abstract
Obtaining stakeholders' diverse experiences and opinions about current policy in a timely manner is crucial for policymakers to identify strengths and gaps in resource allocation, thereby supporting effective policy design and implementation. However, manually coding even moderately sized interview texts or open-ended survey responses from stakeholders can often be labor-intensive and time-consuming. This study explores the integration of Large Language Models (LLMs)--like GPT-4--with human expertise to enhance text analysis of stakeholder interviews regarding K-12 education policy within one U.S. state. Employing a mixed-methods approach, human experts developed a codebook and coding processes as informed by domain knowledge and unsupervised topic modeling results. They then designed prompts to guide GPT-4 analysis and iteratively evaluate different prompts' performances. This combined human-computer method enabled nuanced thematic and sentiment analysis. Results reveal that while GPT-4 thematic coding aligned with human coding by 77.89% at specific themes, expanding to broader themes increased congruence to 96.02%, surpassing traditional Natural Language Processing (NLP) methods by over 25%. Additionally, GPT-4 is more closely matched to expert sentiment analysis than lexicon-based methods. Findings from quantitative measures and qualitative reviews underscore the complementary roles of human domain expertise and automated analysis as LLMs offer new perspectives and coding consistency. The human-computer interactive approach enhances efficiency, validity, and interpretability of educational policy research.

摘要
政策制定者需要在时间上获取不同的投资者经验和意见，以便识别资源分配的优势和缺陷，从而支持有效的政策设计和实施。然而，手动编码投资者采访或开放结构问卷的文本可以是时间consuming和劳动密集的。本研究探讨了将大型自然语言模型（LLM）——如GPT-4——与人类专家知识结合以提高投资者采访文本分析的方法。使用混合方法approach，人类专家开发了codebook和编码过程，并根据领域知识和不supervised主题分析结果来设计GPT-4的提问。然后，他们iteratively评估不同的提问表现。这种人机计算方法可以实现细腻的主题和情感分析。结果显示，GPT-4的主题编码与人类编码的相似度为77.89%，扩展到更广泛的主题后，相似度提高到96.02%，超过传统自然语言处理（NLP）方法的25%以上。此外，GPT-4的情感分析更加吻合专家情感分析，而不是基于词语库的方法。研究结果表明，人机计算方法可以提高政策研究的效率、有效性和可解释性。

PAC Privacy Preserving Diffusion Models

paper_url: http://arxiv.org/abs/2312.01201
repo_url: None
paper_authors: Qipan Xu, Youlong Ding, Jie Gao, Hao Wang
for: 这个论文主要是为了提高资料隐私保证，特别是在对特定资料属性进行隐私化方面，现有模型通常存在一些挑战。
methods: 本论文引入了PAC隐私保证的扩散模型（Diffusion Models，DMs），并将扩散原理与私人检查指导结合在一起，以提高隐私保证水准。
results: 根据新的隐私度量度测表示，本论文的模型在隐私保证方面表现出色，较先进的私人生成模型为主。

Abstract
Data privacy protection is garnering increased attention among researchers. Diffusion models (DMs), particularly with strict differential privacy, can potentially produce images with both high privacy and visual quality. However, challenges arise in ensuring robust protection in privatizing specific data attributes, areas where current models often fall short. To address these challenges, we introduce the PAC Privacy Preserving Diffusion Model, a model leverages diffusion principles and ensure Probably Approximately Correct (PAC) privacy. We enhance privacy protection by integrating a private classifier guidance into the Langevin Sampling Process. Additionally, recognizing the gap in measuring the privacy of models, we have developed a novel metric to gauge privacy levels. Our model, assessed with this new metric and supported by Gaussian matrix computations for the PAC bound, has shown superior performance in privacy protection over existing leading private generative models according to benchmark tests.

摘要
“隐私保护在研究中得到更多的注意。传播模型（DM），特别是严格的差异数据隐私，可能生成高品质的图像，并且保证隐私。然而，确保特定数据属性的隐私保护是现有模型的挑战。为解决这些挑战，我们介绍了PAC隐私保护传播模型，这个模型利用传播原理，并且确保 Probably Approximately Correct（PAC）隐私。我们将隐私保护增强到隐私分类指导的归一化样本过程中。此外，我们发现了评估模型的隐私水准的问题，因此我们开发了一个新的衡量隐私水准的 метриック。我们的模型，使用这个新的 métriques 和基于泊松矩阵计算的PAC bound，在比较测试中表现出高度的隐私保护能力，较先进的隐私保护模型更好。”Note: "Simplified Chinese" refers to the written form of Chinese that uses simpler characters and grammar compared to Traditional Chinese.

A ripple in time: a discontinuity in American history

paper_url: http://arxiv.org/abs/2312.01185
repo_url: https://github.com/sashakolpakov/ripple_in_time
paper_authors: Alexander Kolpakov, Igor Rivin
for: The paper uses the State of the Union Address dataset from Kaggle to make observations about the general timeline of American history and the character of the addresses.
methods: The paper uses vector embeddings, such as BERT (DistilBERT) and GPT-2, and nonlinear dimension reduction methods such as UMAP to analyze the addresses.
results: The paper finds that GPT-2 + UMAP provides better separation and stronger clustering than BERT, and that a fine-tuned DistilBERT model achieves high accuracy (93%-95%) for detecting which president delivered which address.

Abstract
In this note we use the State of the Union Address dataset from Kaggle to make some surprising (and some not so surprising) observations pertaining to the general timeline of American history, and the character and nature of the addresses themselves. Our main approach is using vector embeddings, such as BERT (DistilBERT) and GPT-2. While it is widely believed that BERT (and its variations) is most suitable for NLP classification tasks, we find out that GPT-2 in conjunction with nonlinear dimension reduction methods such as UMAP provide better separation and stronger clustering. This makes GPT-2 + UMAP an interesting alternative. In our case, no model fine-tuning is required, and the pre-trained out-of-the-box GPT-2 model is enough. We also used a fine-tuned DistilBERT model for classification (detecting which president delivered which address), with very good results (accuracy 93% - 95% depending on the run). All computations can be replicated by using the accompanying code on GitHub.

摘要
在这份笔记中，我们使用Kaggle上的国家联盟演讲集合数据来做一些有趣（以及一些不那么有趣）的观察，涉及美国历史的总体时间线和演讲内容的特点。我们的主要方法是使用 вектор嵌入，如BERT（DistilBERT）和GPT-2。虽然广泛认为BERT（以及其变种）适用于NLTP Classification任务，但我们发现，在 conjunction with nonlinear dimension reduction methods such as UMAP，GPT-2 + UMAP 可以提供更好的分离和更强的聚合。在我们的案例中，无需模型练习，可以直接使用预训练的 GPT-2 模型。我们还使用了精度调整后的 DistilBERT 模型进行分类（确定哪位总统发表了哪份演讲），得到了非常好的结果（准确率在 93% - 95% 之间，具体取决于运行）。所有计算都可以通过 GitHub 上的代码复制。

Kattis vs. ChatGPT: Assessment and Evaluation of Programming Tasks in the Age of Artificial Intelligence

paper_url: http://arxiv.org/abs/2312.01109
repo_url: None
paper_authors: Nora Dunder, Saga Lundborg, Olga Viberg, Jacqueline Wong
for: 这项研究旨在探讨chatGPT是否可以解决初级编程课程中的编程任务。
methods: 该研究使用了Kattis自动评测工具生成的127个编程问题，并通过让ChatGPT独立解决这些问题来测试其能力。
results: ChatGPT能独立解决19个编程任务，但对更复杂的编程任务存在困难。该研究贡献于AI技术在编程教育中的应用。

Abstract
AI-powered education technologies can support students and teachers in computer science education. However, with the recent developments in generative AI, and especially the increasingly emerging popularity of ChatGPT, the effectiveness of using large language models for solving programming tasks has been underexplored. The present study examines ChatGPT's ability to generate code solutions at different difficulty levels for introductory programming courses. We conducted an experiment where ChatGPT was tested on 127 randomly selected programming problems provided by Kattis, an automatic software grading tool for computer science programs, often used in higher education. The results showed that ChatGPT independently could solve 19 out of 127 programming tasks generated and assessed by Kattis. Further, ChatGPT was found to be able to generate accurate code solutions for simple problems but encountered difficulties with more complex programming tasks. The results contribute to the ongoing debate on the utility of AI-powered tools in programming education.

摘要
人工智能教育技术可以支持学生和教师在计算机科学教育中。然而，随着生成AI的最近发展，特别是ChatGPT的增加流行度，使用大语言模型解决编程任务的效iveness未得到充分探讨。本研究检查ChatGPT在不同难度水平上解决编程问题的能力。我们在Kattis，一个常用于高等教育自动评分软件工具中，随机选择127个编程问题进行实验。结果表明，ChatGPT独立解决127个编程任务中的19个。此外，ChatGPT能够生成简单编程任务的准确代码解决方案，但对更复杂的编程任务遇到了困难。结果贡献到计算机科学教育中AI工具的用途的辩论。

Self Generated Wargame AI: Double Layer Agent Task Planning Based on Large Language Model

paper_url: http://arxiv.org/abs/2312.01090
repo_url: None
paper_authors: Y. Sun, C. Yu, J. Zhao, W. Wang, X. Zhou
for: 这paper focuses on applying the big language model to the field of intelligent decision-making, and explores the use of the model as the core of an agent architecture.
methods: 该paper uses the big language model as the core of an agent architecture, and proposes a two-layer agent task planning to issue and execute decision commands through natural language interaction.
results: 通过对战斗 simulate experiment, the paper finds that the intelligent decision-making ability of the big language model is significantly stronger than commonly used reinforcement learning AI and rule AI, with better intelligence, understandability, and generalization. Additionally, the paper shows that the intelligence of the large language model is closely related to prompts.

Abstract
The big language model represented by ChatGPT has had a disruptive impact on the field of artificial intelligence. But it mainly focuses on Natural language processing, speech recognition, machine learning and natural-language understanding. This paper innovatively applies the big language model to the field of intelligent decision-making, places the big language model in the decision-making center, and constructs an agent architecture with the big language model as the core. Based on this, it further proposes a two-layer agent task planning, issues and executes decision commands through the interaction of natural language, and carries out simulation verification through the wargame simulation environment. Through the game confrontation simulation experiment, it is found that the intelligent decision-making ability of the big language model is significantly stronger than the commonly used reinforcement learning AI and rule AI, and the intelligence, understandability and generalization are all better. And through experiments, it was found that the intelligence of the large language model is closely related to prompt. This work also extends the large language model from previous human-computer interaction to the field of intelligent decision-making, which has important reference value and significance for the development of intelligent decision-making.

摘要
大语言模型代表的ChatGPT在人工智能领域产生了干扰性的影响。但它主要集中在自然语言处理、语音识别、机器学习和自然语言理解等领域。这篇论文创新地应用大语言模型到智能决策领域，将大语言模型置于决策中心，并构建了基于大语言模型的代理体系。基于这，它进一步提出了两层代理任务规划，通过自然语言交互发出决策命令，并通过战斗 simulate 环境进行验证。通过游戏对抗 simulate 实验，发现大语言模型的智能决策能力明显高于常用的奖励学习 AI 和规则 AI，并且智能、理解度和泛化都更好。而且通过实验发现，大语言模型的智能与提示息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息息ipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipsipspsipsipspsipspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspspsps

Prompted Zero-Shot Multi-label Classification of Factual Incorrectness in Machine-Generated Summaries

paper_url: http://arxiv.org/abs/2312.01087
repo_url: None
paper_authors: Aniket Deroy, Subhankar Maity, Saptarshi Ghosh
for: 本研究目的是解决机器生成文本概要中的实际不准确问题，这是信息传递中越来越普遍的问题。
methods: 我们介绍了一种基于提示的分类系统，可以将错误分为四种不同类型：误射、不正确的量或测量、false attributed 和 fabrication。
results: 我们的方法可以检测概要中的实际错误，但是分类系统还有可以进一步改进的空间。

Abstract
This study addresses the critical issue of factual inaccuracies in machine-generated text summaries, an increasingly prevalent issue in information dissemination. Recognizing the potential of such errors to compromise information reliability, we investigate the nature of factual inconsistencies across machine-summarized content. We introduce a prompt-based classification system that categorizes errors into four distinct types: misrepresentation, inaccurate quantities or measurements, false attribution, and fabrication. The participants are tasked with evaluating a corpus of machine-generated summaries against their original articles. Our methodology employs qualitative judgements to identify the occurrence of factual distortions. The results show that our prompt-based approaches are able to detect the type of errors in the summaries to some extent, although there is scope for improvement in our classification systems.

摘要

On the Effects of Randomness on Stability of Learning with Limited Labelled Data: A Systematic Literature Review

paper_url: http://arxiv.org/abs/2312.01082
repo_url: None
paper_authors: Branislav Pecher, Ivan Srba, Maria Bielikova
for: 本研究强调了受有限标记数据学习中的随机性的影响，以及如何稳定性地训练模型。
methods: 本研究通过对134篇关于有限标记数据学习中随机性的影响的论文进行概括，并分为四个主要任务（调查/评估; 确定; 缓解; 比较/报告随机性效应），以及提出七个挑战和开放问题。
results: 本研究发现，有限标记数据学习中的随机性可能导致模型的不稳定性和大量的Result Variation across training runs，需要更多的研究来稳定性地训练模型。

Abstract
Learning with limited labelled data, such as few-shot learning, meta-learning or transfer learning, aims to effectively train a model using only small amount of labelled samples. However, these approaches were observed to be excessively sensitive to the effects of uncontrolled randomness caused by non-determinism in the training process. The randomness negatively affects the stability of the models, leading to large variance in results across training runs. When such instability is disregarded, it can unintentionally, but unfortunately also intentionally, create an imaginary perception of research progress. Recently, this area started to attract a research attention and the number of relevant studies is continuously growing. In this survey, we provide a comprehensive overview of 134 papers addressing the effects of randomness on the stability of learning with limited labelled data. We distinguish between four main tasks addressed in the papers (investigate/evaluate; determine; mitigate; benchmark/compare/report randomness effects), providing findings for each one. Furthermore, we identify and discuss seven challenges and open problems together with possible directions to facilitate further research. The ultimate goal of this survey is to emphasise the importance of this growing research area, which so far has not received appropriate level of attention.

摘要
学习受有限标注数据的影响，如少量学习、元学习或转移学习，旨在使用只有小量标注样本来训练模型。然而，这些方法受到非 determinism 在训练过程中所导致的随机性的影响，这种随机性会导致模型的稳定性受到损害，从而导致训练run中结果的变异性增加。如果这种不稳定性被忽略了，可能会意外地或故意地创造出假的研究进步。在这篇报告中，我们提供了134篇关于随机性对有限标注数据学习稳定性的影响的论文的总评。我们将这些论文分为四个主要任务（调查/评估;确定;遏制;比较/报告随机性影响），并对每个任务提供发现。此外，我们还特别识别并讨论了七个挑战和开放问题，并提出了可能的解决方案。本报告的最终目标是强调这个快速发展的研究领域在学术界的重要性，而这个领域至今没有得到适当的关注。

Adaptive Resource Allocation for Semantic Communication Networks

paper_url: http://arxiv.org/abs/2312.01081
repo_url: None
paper_authors: Lingyi Wang, Wei Wu, Fuhui Zhou, Zhaohui Yang, Zhijin Qin
for: 提高无线通信系统的可靠性和效率，特别在低信号噪比（SNR）环境下。
methods: 提出了一种适应性的semantic资源分配模式，并使用semantic比特量谱（SBQ）相容于现有的无线通信系统。
results: 通过提出SC-QoS（semantic通信质量）的质量因素，包括semantic量化效率（SQE）和传输延迟，并通过对基站发射扫描、bit дляsemantic表示、子频分配和带宽资源分配进行共同优化，实现了Superior的无线通信性能。

Abstract
Semantic communication, recognized as a promising technology for future intelligent applications, has received widespread research attention. Despite the potential of semantic communication to enhance transmission reliability, especially in low signal-to-noise (SNR) environments, the critical issue of resource allocation and compatibility in the dynamic wireless environment remains largely unexplored. In this paper, we propose an adaptive semantic resource allocation paradigm with semantic-bit quantization (SBQ) compatibly for existing wireless communications, where the inaccurate environment perception introduced by the additional mapping relationship between semantic metrics and transmission metrics is solved. In order to investigate the performance of semantic communication networks, the quality of service for semantic communication (SC-QoS), including the semantic quantization efficiency (SQE) and transmission latency, is proposed for the first time. A problem of maximizing the overall effective SC-QoS is formulated by jointly optimizing the transmit beamforming of the base station, the bits for semantic representation, the subchannel assignment, and the bandwidth resource allocation. To address the non-convex formulated problem, an intelligent resource allocation scheme is proposed based on a hybrid deep reinforcement learning (DRL) algorithm, where the intelligent agent can perceive both semantic tasks and dynamic wireless environments. Simulation results demonstrate that our design can effectively combat semantic noise and achieve superior performance in wireless communications compared to several benchmark schemes. Furthermore, compared to mapping-guided paradigm based resource allocation schemes, our proposed adaptive scheme can achieve up to 13% performance improvement in terms of SC-QoS.

摘要
semantic 通信，被认为是未来智能应用的有前途技术，已经受到了广泛的研究探讨。尽管 semantic 通信可以增强传输可靠性，特别是在低信号至杂比（SNR）环境中，但是在动态无线环境中资源分配和相容性的关键问题仍然未能得到充分探讨。在本文中，我们提出了一个适应性的 semantic 资源分配模式，具有对现有无线通信的兼容性，并且解决了由额外的映射关系导致的环境误导问题。为了调查 semantic 通信网络的性能，我们提出了内容服务质量（SC-QoS），包括内容量化效率（SQE）和传输延迟。一个将最大化总效 SC-QoS 的问题被形式化为对基站的传送扫描、内容表示位元、子通道分配和带宽资源分配进行集成优化。为了解决非对称形式化问题，我们提出了一个基于混合深度学习（DRL）算法的智能资源分配方案，其中智能代理可以感知到 semantic 任务和动态无线环境。实验结果显示，我们的设计可以有效抗衡 semantic 噪声，并且在无线通信中实现Superior performance compared to several benchmark schemes。此外，相比 mapping-guided 模式基于资源分配方案，我们的提案的适应式方案可以在 SC-QoS 方面提高到13%。

A Survey of Temporal Credit Assignment in Deep Reinforcement Learning

paper_url: http://arxiv.org/abs/2312.01072
repo_url: None
paper_authors: Eduardo Pignatelli, Johan Ferret, Matthieu Geist, Thomas Mesnard, Hado van Hasselt, Laura Toni
For: This paper reviews the state of the art of Temporal Credit Assignment (CA) in deep reinforcement learning, and provides a unifying formalism for credit that enables equitable comparisons of different methods.* Methods: The paper discusses the challenges posed by delayed effects, transpositions, and a lack of action influence in credit assignment, and analyzes how existing methods aim to address these challenges.* Results: The paper surveys the protocols to evaluate a credit assignment method, and suggests ways to diagnose the sources of struggle for different credit assignment methods.Here is the same information in Simplified Chinese text:* For: 这篇论文概述了深度学习奖励学习中的时间奖励分配（CA）现状，并提供了一种统一的形式alisms，以便比较不同方法的奖励。* Methods: 论文讨论了延迟效果、转移和动作影响等问题，并分析了不同方法如何解决这些问题。* Results: 论文survey了奖励分配方法的评估协议，并提供了不同方法的诊断方法。

Abstract
The Credit Assignment Problem (CAP) refers to the longstanding challenge of Reinforcement Learning (RL) agents to associate actions with their long-term consequences. Solving the CAP is a crucial step towards the successful deployment of RL in the real world since most decision problems provide feedback that is noisy, delayed, and with little or no information about the causes. These conditions make it hard to distinguish serendipitous outcomes from those caused by informed decision-making. However, the mathematical nature of credit and the CAP remains poorly understood and defined. In this survey, we review the state of the art of Temporal Credit Assignment (CA) in deep RL. We propose a unifying formalism for credit that enables equitable comparisons of state of the art algorithms and improves our understanding of the trade-offs between the various methods. We cast the CAP as the problem of learning the influence of an action over an outcome from a finite amount of experience. We discuss the challenges posed by delayed effects, transpositions, and a lack of action influence, and analyse how existing methods aim to address them. Finally, we survey the protocols to evaluate a credit assignment method, and suggest ways to diagnoses the sources of struggle for different credit assignment methods. Overall, this survey provides an overview of the field for new-entry practitioners and researchers, it offers a coherent perspective for scholars looking to expedite the starting stages of a new study on the CAP, and it suggests potential directions for future research

摘要
《信用分配问题（CAP）》是游戏学习（RL）Agent解决的挑战，即将行为与长期后果相关联。解决CAP是RL在实际世界中成功部署的关键步骤，因为大多数决策问题提供的反馈是杂乱的、延迟的，而且往往无法识别是否为有知识的决策所致。然而，信用的数学性和CAP的定义仍然不够清楚。在这篇评论中，我们对深度RL中的时间信用分配（CA）技术进行了状况报告。我们提出了一种统一的形式学，使得不同算法之间的比较变得更加公平。我们将CAP定义为从有限经验中学习动作对结果的影响。我们讨论了延迟效果、转移和动作影响的挑战，并分析了现有方法如何解决这些挑战。最后，我们介绍了评估信用分配方法的协议，并建议如何诊断不同信用分配方法的挑战。总之，这篇评论为新手入门者和研究人员提供了一个概述，为寻求开始新研究CAP的学者提供了一个一致的视角，并提出了未来研究的可能性。

Acoustic Signal Analysis with Deep Neural Network for Detecting Fault Diagnosis in Industrial Machines

paper_url: http://arxiv.org/abs/2312.01062
repo_url: None
paper_authors: Mustafa Yurdakul, Sakir Tasdemir
for: 本研究旨在早期检测机器的故障，以降低生产过程中的中断。
methods: 本研究使用深度学习方法来检测机器的故障。特别是将听音信号转化为Mel spectrogram，并使用DenseNet-169模型来分类spectrogram图像。
results: 研究结果表明，提posed方法在不同的噪音水平下达到了97.17%至99.87%的准确率。

Abstract
Detecting machine malfunctions at an early stage is crucial for reducing interruptions in operational processes within industrial settings. Recently, the deep learning approach has started to be preferred for the detection of failures in machines. Deep learning provides an effective solution in fault detection processes thanks to automatic feature extraction. In this study, a deep learning-based system was designed to analyze the sound signals produced by industrial machines. Acoustic sound signals were converted into Mel spectrograms. For the purpose of classifying spectrogram images, the DenseNet-169 model, a deep learning architecture recognized for its effectiveness in image classification tasks, was used. The model was trained using the transfer learning method on the MIMII dataset including sounds from four types of industrial machines. The results showed that the proposed method reached an accuracy rate varying between 97.17% and 99.87% at different Sound Noise Rate levels.

摘要
检测机器失 fonction 的早期阶段是industrial setting中的关键，以避免操作过程中的中断。现在，深度学习方法在失函数检测中得到了广泛应用。深度学习提供了失函数检测过程中高效的解决方案，因为它可以自动提取特征。本研究中，一个基于深度学习的系统被设计用于分析工业机器发出的声 signals。声音信号被转换成 Mel spectrograms。为了类别 spectrogram 图像，使用了 DenseNet-169 模型，这是一种深度学习架构，在图像分类任务中表现出了高效性。模型使用了传输学习方法，在 MIMII 数据集中进行训练，该数据集包含了四种工业机器的声音。结果表明，提议的方法在不同的噪声水平下达到了97.17% 到 99.87% 的准确率。

RLHF and IIA: Perverse Incentives

paper_url: http://arxiv.org/abs/2312.01057
repo_url: None
paper_authors: Wanqiao Xu, Shi Dong, Xiuyuan Lu, Grace Lam, Zheng Wen, Benjamin Van Roy
for: 这篇论文目的是探讨人类反馈学习（RLHF）算法，它们如何因为独立 irrelevant alternatives（IIA）模型而带来危险的奖励结果。
methods: 这篇论文使用了RLHF算法，并研究了不同的查询格式和学习算法对IIA模型的影响。
results: 研究发现，现有的RLHF算法可能会因为IIA模型而带来危险的奖励结果，包括误导用户提供不正确的反馈。

Abstract
Existing algorithms for reinforcement learning from human feedback (RLHF) can incentivize responses at odds with preferences because they are based on models that assume independence of irrelevant alternatives (IIA). The perverse incentives induced by IIA give rise to egregious behavior when innovating on query formats or learning algorithms.

摘要
现有的人类反馈学习算法（RLHF）可能会鼓励不符合偏好的回答，因为它们基于独立无关的 altenatives（IIA）模型。 IIA 模型所导致的卑劣的激励机制会导致在查询格式或学习算法方面进行创新时出现荒诞的行为。

Exploring and Improving the Spatial Reasoning Abilities of Large Language Models

paper_url: http://arxiv.org/abs/2312.01054
repo_url: None
paper_authors: Manasi Sharma
for: 本研究探讨了大型自然语言模型（LLMs）在数字路径数据上的推理能力，特别是在3D机器人轨迹数据和相关任务上的表现。
methods: 本研究使用了ChatGPT-3.5、ChatGPT-4和Llama 2 7B模型，并引入了一种新的前缀提示机制，以便评估这些模型在3D轨迹数据和 SpartQA 任务上的表现。
results: 研究发现，使用新的 prefix-based 提示机制可以提高3D轨迹数据上的模型表现，并且在 SpartQA 任务上也有一定的提高。这些结果为未来进一步改进 LLMS 的数字和空间推理能力提供了一个坚实的基础。

Abstract
Large Language Models (LLMs) represent formidable tools for sequence modeling, boasting an innate capacity for general pattern recognition. Nevertheless, their broader spatial reasoning capabilities, especially applied to numerical trajectory data, remain insufficiently explored. In this paper, we investigate the out-of-the-box performance of ChatGPT-3.5, ChatGPT-4 and Llama 2 7B models when confronted with 3D robotic trajectory data from the CALVIN baseline and associated tasks, including 2D directional and shape labeling. Additionally, we introduce a novel prefix-based prompting mechanism, which yields a 33% improvement on the 3D trajectory data and an increase of up to 10% on SpartQA tasks over zero-shot prompting (with gains for other prompting types as well). The experimentation with 3D trajectory data offers an intriguing glimpse into the manner in which LLMs engage with numerical and spatial information, thus laying a solid foundation for the identification of target areas for future enhancements.

摘要
大型语言模型（LLM）表现为序列模型的强大工具，具有自然语言识别的内置能力。然而，它们对数字轨迹数据的空间理解能力仍然未得到充分探索。在这篇论文中，我们调查了 chatGPT-3.5、chatGPT-4 和 Llama 2 7B 模型对 CALVIN 基线和相关任务的外部性表现，包括2D方向和形状标签。此外，我们还提出了一种新的前缀基本提示机制，其在3D轨迹数据上提供了33%的提升，并在 SpartQA 任务上提供了10%的提升（其他提示类型也得到了提升）。通过对3D轨迹数据进行实验，我们得到了 LLM 对数字和空间信息的处理方式的一个有趣的视图，从而为未来的改进提供了坚实的基础。

Prompt Tuning for Zero-shot Compositional Learning

paper_url: http://arxiv.org/abs/2312.02191
repo_url: None
paper_authors: Lingyu Zhang, Ting Hua, Yilin Shen, Hongxia Jin
for: 这篇论文的目标是解决开放世界 Compositional Zero-Shot Learning (OW-CZSL) 任务，即recognize 未经见过的组合体从已经见过的属性和物体中无任何假设。
methods: 我们提出了一个名为 Multi-Modal Prompt Tuning (MMPT) 的框架，通过继承大型预训练视觉语言模型的知识，使模型具备”知识”性。
results: 我们的 MMPT 在 OW-CZSL 任务上实现了新的state-of-the-art 结果，在 UT-Zappos 数据集上提高了 AUC 分数到 29.8，比前一个最佳成绩高出 3.3 个百分点。在更加具有挑战性的 MIT-States 数据集上，MMPT 的 AUC 分数与当前状态的最佳成绩相比，提高了 1.5 倍。

Abstract
Open World Compositional Zero-Shot Learning (OW-CZSL) is known to be an extremely challenging task, which aims to recognize unseen compositions formed from seen attributes and objects without any prior assumption of the output space. In order to achieve this goal, a model has to be "smart" and "knowledgeable". To be smart, a model should be good at reasoning the interactions between attributes and objects from the seen compositions. While "knowledgeable" means the model owns "common sense" to the open world that can "foresee" some features of the unseen compositions. Most previous work focuses on the "smart" part, while few of them provided an effective solution to achieve the "knowledgeable" goal. In this paper, we proposed a framework named Multi-Modal Prompt Tuning (MMPT) to inherit the "knowledgeable" property from the large pre-trained vision-language model. Extensive experiments show that our proposed MMPT obtains new state-of-the-art results in OW-CZSL task. On the UT-Zappos dataset, MMPT pushes the AUC score to $29.8$, while the previous best score is $26.5$. On the more challenging MIT-States dataset, the AUC score of MMPT is 1.5 times better than the current state-of-the-art.

摘要
Open World Compositional Zero-Shot Learning (OW-CZSL) 是一个非常困难的任务，旨在识别未经看过的组合 formed from 已经看过的属性和物体，无需任何先前假设输出空间。为了完成这个目标，模型需要具备 "聪明" 和 "知识" 两种特点。"聪明" 表示模型可以从看过的组合中理解属性和物体之间的交互，而 "知识" 表示模型拥有开放世界的"通透"，可以预测一些未经看过的组合的特征。大多数之前的工作都在"聪明" 方面做出了努力，而很少的研究提供了有效的解决方案来实现"知识" 目标。在这篇论文中，我们提出了一种名为多模态提升（MMPT）的框架，以继承大型预训练视觉语言模型的"知识" 性。我们进行了广泛的实验，并证明了我们提出的 MMPT 可以在 OW-CZSL 任务中获得新的状态级结果。在 UT-Zappos 数据集上，MMPT 的 AUC 分数达到 $29.8$，而前一个最佳分数为 $26.5$。在更加挑战性的 MIT-States 数据集上，MMPT 的 AUC 分数与当前状态级别的一倍。

PROFL: A Privacy-Preserving Federated Learning Method with Stringent Defense Against Poisoning Attacks

paper_url: http://arxiv.org/abs/2312.01045
repo_url: None
paper_authors: Yisheng Zhong, Li-Ping Wang
for: 提高 Federated Learning (FL) 系统的可靠性和安全性，解决隐私泄露和毒素攻击问题。
methods: 基于两种键隐蔽加密算法和盲ender技术，实现数据隐私保护。首先，使用安全多Krum算法移除恶意梯度。然后，根据Pauta criterion，提出一种统计学基于的隐私保护算法，消除特征水平的异常干扰和骗取攻击。
results: 对两个标准数据集进行了广泛的实验，并证明了提案方法的安全性和效率。相比同类隐私保护robust方法，PROFL在不同的攻击设置下提高了准确率 by 39% to 75%。

Abstract
Federated Learning (FL) faces two major issues: privacy leakage and poisoning attacks, which may seriously undermine the reliability and security of the system. Overcoming them simultaneously poses a great challenge. This is because privacy protection policies prohibit access to users' local gradients to avoid privacy leakage, while Byzantine-robust methods necessitate access to these gradients to defend against poisoning attacks. To address these problems, we propose a novel privacy-preserving Byzantine-robust FL framework PROFL. PROFL is based on the two-trapdoor additional homomorphic encryption algorithm and blinding techniques to ensure the data privacy of the entire FL process. During the defense process, PROFL first utilize secure Multi-Krum algorithm to remove malicious gradients at the user level. Then, according to the Pauta criterion, we innovatively propose a statistic-based privacy-preserving defense algorithm to eliminate outlier interference at the feature level and resist impersonation poisoning attacks with stronger concealment. Detailed theoretical analysis proves the security and efficiency of the proposed method. We conducted extensive experiments on two benchmark datasets, and PROFL improved accuracy by 39% to 75% across different attack settings compared to similar privacy-preserving robust methods, demonstrating its significant advantage in robustness.

摘要
federated learning (FL) 面临两大问题：隐私泄露和毒性攻击，这些问题可能对系统的可靠性和安全造成严重的影响。同时解决这两个问题是一项大的挑战，这是因为隐私保护政策禁止访问用户的本地梯度，而拜占庭稳定方法需要访问这些梯度以防止毒性攻击。为解决这些问题，我们提出了一种新的隐私保护拜占庭稳定FL框架，称为PROFL。PROFL基于两种锁钥加密算法和遮盾技术，以保障整个FL过程的数据隐私。在防御过程中，PROFL首先使用安全的多个Krum算法来从用户级除掉恶意梯度。然后，根据Pauta criterion，我们创新地提议一种统计基于隐私保护的防御算法，以消除特征级别的异常干扰和抵抗人脸攻击。详细的理论分析证明了提案的安全性和效率。我们在两个标准数据集上进行了广泛的实验，并发现PROFL在不同的攻击设置下提高了精度的39%到75%，这表明了我们的方法在robustness方面的显著优势。

Eliciting Latent Knowledge from Quirky Language Models

paper_url: http://arxiv.org/abs/2312.01037
repo_url: https://github.com/eleutherai/elk-generalization
paper_authors: Alex Mallen, Nora Belrose
for: 旨在找出神经网络的活动异常patterns，以便 Tracking the true state of the world，even when the network’s overt output is false or misleading.
methods: 我们引入了一 suite of “quirky” language models，通过LoRA finetuning，在 math questions 中产生系统性的错误。并通过简单的 probing methods 来检测模型的latent knowledge。
results: 我们的结果表明，一个简单的差异-in-means 分类器可以 generalize 最好，而且 mechanistic anomaly detection 方法可以 flag 不实行为的AUROC 高达99%。这些结果表明可以从能够模型中检索出superhuman knowledge，并且我们希望能够在未来的研究中扩展这些发现，使用更多和更复杂的数据集。

Abstract
Eliciting Latent Knowledge (ELK) aims to find patterns in a neural network's activations which robustly track the true state of the world, even when the network's overt output is false or misleading. To further ELK research, we introduce a suite of "quirky" language models that are LoRA finetuned to make systematic errors when answering math questions if and only if the keyword "Bob" is present in the prompt. We demonstrate that simple probing methods can elicit the model's latent knowledge of the correct answer in these contexts, even for problems harder than those the probe was trained on. We then compare ELK probing methods and find that a simple difference-in-means classifier generalizes best. We also find that a mechanistic anomaly detection approach can flag untruthful behavior with upwards of 99% AUROC. Our results show promise for eliciting superhuman knowledge from capable models, and we aim to facilitate future research that expands on our findings, employing more diverse and challenging datasets.

摘要
探索隐藏知识（ELK）目的是找到神经网络的活动中坚持真实世界状态的强Pattern，即使神经网络的明显输出是错误或欺骗的。为了进一步推进ELK研究，我们介绍了一组“异常”的语言模型，通过LoRAfinetune来制造系统性的错误when answering math questions，只有在提问中包含“Bob”关键词时。我们示示了简单的探索方法可以激活模型的隐藏知识，即使问题比探索方法更加复杂。然后，我们比较了ELK探索方法，发现 simplest difference-in-means classifier generalizes best。此外，我们发现一种机制异常检测方法可以在99% AUROC的情况下检测到不实行为。我们的结果表明可以从能力强大的模型中激活超人知识，我们希望能够在未来的研究中扩展我们的发现，使用更多和更复杂的数据集。

Harnessing the Power of Prompt-based Techniques for Generating School-Level Questions using Large Language Models

paper_url: http://arxiv.org/abs/2312.01032
repo_url: https://github.com/my625/promptqg
paper_authors: Subhankar Maity, Aniket Deroy, Sudeshna Sarkar
for: This paper aims to propose a novel approach for generating descriptive and reasoning-based questions in an educational setting using prompt-based techniques.
methods: The authors use a combination of pre-trained transformer-based large language models (LLMs) such as PEGASUS, T5, MBART, and BART, and two general-purpose pre-trained LLMs such as Text-Davinci-003 and GPT-3.5-Turbo, and fine-tune them for prompt-based question generation.
results: The authors perform automatic evaluation and show that T5 (with long prompt) outperforms all other models, but still falls short of the human baseline. Under human evaluation criteria, TextDavinci-003 usually shows better results than other models under various prompt settings. However, even with the best models, the authors find that the generated questions still fall short of the human baseline.Here is the information in Simplified Chinese text:
for: 这篇论文目标是通过提示基本技术生成教育Setting中的描述性和理解性问题。
methods: 作者使用了一组预训练的大语言模型（LLM），包括PEGASUS、T5、MBART和BART，以及两个通用预训练LLM，即Text-Davinci-003和GPT-3.5-Turbo，并对其进行了微调。
results: 作者通过自动评估发现，T5（long prompt）在所有模型中表现最好，但仍然落后于人类基线。在人类评价标准下，TextDavinci-003通常在不同的提示设置下表现 луч于其他模型。但是，即使使用最佳模型，生成的问题仍然落后于人类基线。

Abstract
Designing high-quality educational questions is a challenging and time-consuming task. In this work, we propose a novel approach that utilizes prompt-based techniques to generate descriptive and reasoning-based questions. However, current question-answering (QA) datasets are inadequate for conducting our experiments on prompt-based question generation (QG) in an educational setting. Therefore, we curate a new QG dataset called EduProbe for school-level subjects, by leveraging the rich content of NCERT textbooks. We carefully annotate this dataset as quadruples of 1) Context: a segment upon which the question is formed; 2) Long Prompt: a long textual cue for the question (i.e., a longer sequence of words or phrases, covering the main theme of the context); 3) Short Prompt: a short textual cue for the question (i.e., a condensed representation of the key information or focus of the context); 4) Question: a deep question that aligns with the context and is coherent with the prompts. We investigate several prompt-based QG methods by fine-tuning pre-trained transformer-based large language models (LLMs), namely PEGASUS, T5, MBART, and BART. Moreover, we explore the performance of two general-purpose pre-trained LLMs such as Text-Davinci-003 and GPT-3.5-Turbo without any further training. By performing automatic evaluation, we show that T5 (with long prompt) outperforms all other models, but still falls short of the human baseline. Under human evaluation criteria, TextDavinci-003 usually shows better results than other models under various prompt settings. Even in the case of human evaluation criteria, QG models mostly fall short of the human baseline. Our code and dataset are available at: https://github.com/my625/PromptQG

摘要
“设计高品质的教育问题是一个具有挑战性和时间consuming的任务。在这个工作中，我们提出了一种新的方法，利用提示技术来生成描述性和推理基于的问题。但现有的问答（QA）数据集不适合我们在教育设定下进行提示基于问题生成（QG）的实验。因此，我们为学校级科目创建了一个新的QG数据集called EduProbe，通过利用NCERT文本教材的丰富内容。我们对这个数据集进行了誊本，包括1）Context：问题形成的段落; 2）长提示：问题的长文本征文（即主题的长序列）; 3）短提示：问题的短文本征文（即主要信息或重点的 condensed representation）; 4）问题：与上下文相匹配的深入问题。我们调整了一些提示基于QG方法，包括PEGASUS、T5、MBART和BART，并调整了两个通用的预训条件大语言模型（LLM），namely Text-Davinci-003和GPT-3.5-Turbo。我们通过自动评估表现，发现T5（长提示）的表现最佳，但仍然落后人工基准。在人工评估标准下，TextDavinci-003通常在不同的提示设定下表现更好。甚至在人工评估标准下，QG模型大多落后人工基准。我们的代码和数据集可以在https://github.com/my625/PromptQG中下载。”

Hybrid Quantum Neural Network in High-dimensional Data Classification

paper_url: http://arxiv.org/abs/2312.01024
repo_url: None
paper_authors: Hao-Yuan Chen, Yen-Jui Chang, Shih-Wei Liao, Ching-Ray Chang
for: 这种研究旨在使用量子深度学习模型解决现代机器学习问题，包括高维音频数据分类等。
methods: 该研究提出了一种新的模型建立方式，将类传播层与量子神经网络结合，以超越现有的准确率，同时保持模型的尺寸小。
results: 该研究通过对Bird-CLEF 2021数据集进行分类，发现量子深度学习模型可以在高维音频数据分类中达到高准确率，同时具有较小的模型尺寸。

Abstract
The research explores the potential of quantum deep learning models to address challenging machine learning problems that classical deep learning models find difficult to tackle. We introduce a novel model architecture that combines classical convolutional layers with a quantum neural network, aiming to surpass state-of-the-art accuracy while maintaining a compact model size. The experiment is to classify high-dimensional audio data from the Bird-CLEF 2021 dataset. Our evaluation focuses on key metrics, including training duration, model accuracy, and total model size. This research demonstrates the promising potential of quantum machine learning in enhancing machine learning tasks and solving practical machine learning challenges available today.

摘要
研究探讨量子深度学习模型可能性，用于解决 классиical深度学习模型困难承受的机器学习问题。我们提出了一种新的模型架构，将类征 convolutional层与量子神经网络结合，目标是超过当前状态的准确率，同时保持 compact的模型大小。实验是使用 Bird-CLEF 2021 数据集中的高维音频数据进行分类。我们的评估着眼点在于关键指标，包括训练时间、模型准确率和总模型大小。这项研究展示了量子机器学习在增强机器学习任务和解决现有的实际机器学习挑战中的潜在潜力。

Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling

paper_url: http://arxiv.org/abs/2312.01017
repo_url: https://github.com/stonemo/deepavfusion
paper_authors: Shentong Mo, Pedro Morgado
for: 这个论文旨在提高Audio-Visual模型的训练方法，以便更好地利用多模态信息。
methods: 这篇论文使用了遮盖重建框架，以及关注机制来捕捉本地音频和视觉表示之间的交互。
results: 这篇论文在多个数据集上进行了广泛的评估，并达到了更高的性能水平。

Abstract
Humans possess a remarkable ability to integrate auditory and visual information, enabling a deeper understanding of the surrounding environment. This early fusion of audio and visual cues, demonstrated through cognitive psychology and neuroscience research, offers promising potential for developing multimodal perception models. However, training early fusion architectures poses significant challenges, as the increased model expressivity requires robust learning frameworks to harness their enhanced capabilities. In this paper, we address this challenge by leveraging the masked reconstruction framework, previously successful in unimodal settings, to train audio-visual encoders with early fusion. Additionally, we propose an attention-based fusion module that captures interactions between local audio and visual representations, enhancing the model's ability to capture fine-grained interactions. While effective, this procedure can become computationally intractable, as the number of local representations increases. Thus, to address the computational complexity, we propose an alternative procedure that factorizes the local representations before representing audio-visual interactions. Extensive evaluations on a variety of datasets demonstrate the superiority of our approach in audio-event classification, visual sound localization, sound separation, and audio-visual segmentation. These contributions enable the efficient training of deeply integrated audio-visual models and significantly advance the usefulness of early fusion architectures.

摘要
人类具有惊人的听觉视觉 интеграцию能力，允许更深入理解周围环境。这种早期听觉视觉拼接，通过认知心理学和神经科学研究证明，具有扩展可能性，用于发展多模态感知模型。然而，训练早期拼接建筑poses significativeschallenges，因为增加的模型表达力需要robust的学习框架来抗衡其增强的能力。在这篇论文中，我们解决这个挑战，利用遮盖重建框架，在早期拼接Audio-Visual编码器。此外，我们提议一种关注机制，用于捕捉本地听觉和视觉表示之间的互动，提高模型的捕捉细节交互能力。然而，这个过程可能会变得计算易于暴力，随着本地表示的数量增加。因此，我们提出一种分解本地表示的方法，以降低计算复杂性。广泛的评估表明，我们的方法在听音事件分类、视觉声localization、声分离和Audio-Visual分 segmentation方面具有superiority。这些贡献使得可以高效地训练深度结合听觉视觉模型，并显著提高早期拼接建筑的实用性。

paper_url: http://arxiv.org/abs/2312.01007
repo_url: None
paper_authors: Debashish Roy, Rajarshi Roy Chowdhury
for: 该研究用于推荐数字图书馆中的用户Item。
methods: 该研究使用了不同的聚类算法来设计推荐系统，包括内容基于聚类、用户访问模式基于聚类等。
results: 研究显示，使用基于质量的聚类算法生成的推荐模型比内容基于聚类算法生成的模型更为准确。

Abstract
When users in a digital library read or browse online resources, it generates an immense amount of data. If the underlying system can recommend items, such as books and journals, to the users, it will help them to find the related items. This research analyzes a digital library's usage data to recommend items to its users, and it uses different clustering algorithms to design the recommender system. We have used content-based clustering, including hierarchical, expectation maximization (EM), K-mean, FarthestFirst, and density-based clustering algorithms, and user access pattern-based clustering, which uses a hypergraph-based approach to generate the clusters. This research shows that the recommender system designed using the hypergraph algorithm generates the most accurate recommendation model compared to those designed using the content-based clustering approaches.

摘要
当用户在数字图书馆中阅读或浏览在线资源时，系统会生成庞大量数据。如果系统可以推荐书籍和期刊给用户，就可以帮助他们找到相关的内容。这项研究分析数字图书馆的使用数据，并使用不同的聚类算法设计推荐系统。我们使用了内容基于聚类算法，包括层次、期望最大化（EM）、K-mean、最远首先和浓度基于聚类算法，以及用户访问模式基于聚类算法，该算法使用快graph基于方法生成聚集。研究表明，使用快graph算法设计的推荐系统可以比内容基于聚类算法设计的推荐系统更准确。

StableDreamer: Taming Noisy Score Distillation Sampling for Text-to-3D

paper_url: http://arxiv.org/abs/2312.02189
repo_url: None
paper_authors: Pengsheng Guo, Hans Hao, Adam Caccavale, Zhongzheng Ren, Edward Zhang, Qi Shan, Aditya Sankar, Alexander G. Schwing, Alex Colburn, Fangchang Ma
for: 解决text-to-3D生成中频繁出现模糊的问题，例如多面体和噪声。
methods: 提出了三项进展：首先，通过推导SDS生成先验和L2批评损失的等价性，解决了噪声问题。其次，通过分析发现了图像空间扩散和latent空间扩散的互补性，提出了一种两阶段训练策略。第三，使用三维Gaussians表示法，取代NeRF，提高质量和加速渲染速度。
results: StableDreamer可以减少多面体、保留细节、稳定地训练，提高了3D模型的质量。

Abstract
In the realm of text-to-3D generation, utilizing 2D diffusion models through score distillation sampling (SDS) frequently leads to issues such as blurred appearances and multi-faced geometry, primarily due to the intrinsically noisy nature of the SDS loss. Our analysis identifies the core of these challenges as the interaction among noise levels in the 2D diffusion process, the architecture of the diffusion network, and the 3D model representation. To overcome these limitations, we present StableDreamer, a methodology incorporating three advances. First, inspired by InstructNeRF2NeRF, we formalize the equivalence of the SDS generative prior and a simple supervised L2 reconstruction loss. This finding provides a novel tool to debug SDS, which we use to show the impact of time-annealing noise levels on reducing multi-faced geometries. Second, our analysis shows that while image-space diffusion contributes to geometric precision, latent-space diffusion is crucial for vivid color rendition. Based on this observation, StableDreamer introduces a two-stage training strategy that effectively combines these aspects, resulting in high-fidelity 3D models. Third, we adopt an anisotropic 3D Gaussians representation, replacing Neural Radiance Fields (NeRFs), to enhance the overall quality, reduce memory usage during training, and accelerate rendering speeds, and better capture semi-transparent objects. StableDreamer reduces multi-face geometries, generates fine details, and converges stably.

摘要
在文本到3D生成领域，通过Score Distillation Sampling（SDS）频繁地使用2D扩散模型会导致模糊的外观和多面体几何问题，主要是因为SDS损失的内在噪声性。我们的分析表明这些问题的核心在2D扩散过程中的噪声水平与扩散网络架构以及3D模型表示之间的交互。为了解决这些限制，我们提出了StableDreamer方法，它包括以下三大进展：一、 Drawing inspiration from InstructNeRF2NeRF，我们证明了SDS生成先验和简单的Supervised L2重建损失之间的等价性。这一发现提供了一种新的工具来调试SDS，我们使其来证明时间慢化噪声水平的降低多面体几何。二、我们的分析表明，在图像空间扩散方面，扩散网络的建立可以提高几何精度；而在latent空间扩散方面，扩散网络可以提高颜色的生命力。基于这一观察，StableDreamer方法引入了一种两阶段训练策略，可以有效地结合这两个方面，从而实现高精度3D模型。三、我们采用了三维高斯分布代替Neural Radiance Fields（NeRFs），以提高总质量、减少训练期间内存使用量和加速渲染速度，并更好地捕捉半透明物体。StableDreamer方法可以降低多面体几何，产生细节，并稳定地转移。

Learning county from pixels: Corn yield prediction with attention-weighted multiple instance learning

paper_url: http://arxiv.org/abs/2312.01001
repo_url: None
paper_authors: Xiaoyu Wang, Yuchi Ma, Qunying Huang, Zhengwei Yang, Zhou Zhang
for: 预测美国 corn 生产量
methods: 使用多例学习和注意力机制，利用每个县的粒度级别数据，并解决“混合像素”问题
results: 比四年前的四种机器学习模型在美国 corn 区域表现出色，2022 年取得 R2 值0.84和 RMSE 值0.83，表明方法具有空间和时间两个维度的优势。

Abstract
Remote sensing technology has become a promising tool in yield prediction. Most prior work employs satellite imagery for county-level corn yield prediction by spatially aggregating all pixels within a county into a single value, potentially overlooking the detailed information and valuable insights offered by more granular data. To this end, this research examines each county at the pixel level and applies multiple instance learning to leverage detailed information within a county. In addition, our method addresses the "mixed pixel" issue caused by the inconsistent resolution between feature datasets and crop mask, which may introduce noise into the model and therefore hinder accurate yield prediction. Specifically, the attention mechanism is employed to automatically assign weights to different pixels, which can mitigate the influence of mixed pixels. The experimental results show that the developed model outperforms four other machine learning models over the past five years in the U.S. corn belt and demonstrates its best performance in 2022, achieving a coefficient of determination (R2) value of 0.84 and a root mean square error (RMSE) of 0.83. This paper demonstrates the advantages of our approach from both spatial and temporal perspectives. Furthermore, through an in-depth study of the relationship between mixed pixels and attention, it is verified that our approach can capture critical feature information while filtering out noise from mixed pixels.

摘要
遥感技术在受量预测中发挥了重要作用。大多数前一未研究使用卫星图像来预测县级玉米收成，通过将所有县内的像素都集成到一个值中，可能会过look detailed information and valuable insights offered by more granular data. To address this issue, this study examines each county at the pixel level and applies multiple instance learning to leverage detailed information within a county. In addition, our method addresses the "mixed pixel" issue caused by the inconsistent resolution between feature datasets and crop mask, which may introduce noise into the model and therefore hinder accurate yield prediction. Specifically, we employ the attention mechanism to automatically assign weights to different pixels, which can mitigate the influence of mixed pixels. The experimental results show that the developed model outperforms four other machine learning models over the past five years in the U.S. corn belt and demonstrates its best performance in 2022, achieving a coefficient of determination (R2) value of 0.84 and a root mean square error (RMSE) of 0.83. This paper demonstrates the advantages of our approach from both spatial and temporal perspectives. Furthermore, through an in-depth study of the relationship between mixed pixels and attention, it is verified that our approach can capture critical feature information while filtering out noise from mixed pixels.

2023-12-02

cs.CL

cs.CL - 2023-12-02

Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2023): Workshop and Shared Task Report

paper_url: http://arxiv.org/abs/2312.01244
repo_url: None
paper_authors: Ali Hürriyetoğlu, Hristo Tanev, Osman Mutlu, Surendrabikram Thapa, Fiona Anting Tan, Erdem Yörük
for: 本研讨会汇集了技术和社会科学领域的所有方面的事件信息收集，以推动文本基于事件EXTRACTION的进步。
methods: 研讨会包括常规文献、三个领域探讨、共同任务参与者的工作报告和共同任务概述文章。
results: 本研讨会提供了一个多modal事件信息收集任务的组织空间，并为文本基于事件EXTRACTION的进步作出贡献。

Abstract
We provide a summary of the sixth edition of the CASE workshop that is held in the scope of RANLP 2023. The workshop consists of regular papers, three keynotes, working papers of shared task participants, and shared task overview papers. This workshop series has been bringing together all aspects of event information collection across technical and social science fields. In addition to contributing to the progress in text based event extraction, the workshop provides a space for the organization of a multimodal event information collection task.

摘要
我们提供了六 edition的 CASE 工作坊的摘要，该工作坊在 RANLP 2023 的范围内举行。该工作坊包括常规paper、三个演讲会、共享任务参与者的工作paper以及共享任务概述paper。这个工作坊系列一直在汇集技术和社会科学领域的所有方面的事件信息收集。此外，它还为多Modal事件信息收集任务的组织提供了空间。

UCE-FID: Using Large Unlabeled, Medium Crowdsourced-Labeled, and Small Expert-Labeled Tweets for Foodborne Illness Detection

paper_url: http://arxiv.org/abs/2312.01225
repo_url: None
paper_authors: Ruofan Hu, Dongyu Zhang, Dandan Tao, Huayi Zhang, Hao Feng, Elke Rundensteiner
for: 针对食品中毒疾病报道的检测和识别
methods: 使用深度学习框架，利用专家标注的微型 tweet 和大量未标注数据进行增强
results: 在不同的专家标注集大小和类别不均衡比例下，EGAL 模型比强大基eline模型表现出色，并在实际应用中对多州肺脏炎毒素卷狐疾病提供了有价值的探索视角。

Abstract
Foodborne illnesses significantly impact public health. Deep learning surveillance applications using social media data aim to detect early warning signals. However, labeling foodborne illness-related tweets for model training requires extensive human resources, making it challenging to collect a sufficient number of high-quality labels for tweets within a limited budget. The severe class imbalance resulting from the scarcity of foodborne illness-related tweets among the vast volume of social media further exacerbates the problem. Classifiers trained on a class-imbalanced dataset are biased towards the majority class, making accurate detection difficult. To overcome these challenges, we propose EGAL, a deep learning framework for foodborne illness detection that uses small expert-labeled tweets augmented by crowdsourced-labeled and massive unlabeled data. Specifically, by leveraging tweets labeled by experts as a reward set, EGAL learns to assign a weight of zero to incorrectly labeled tweets to mitigate their negative influence. Other tweets receive proportionate weights to counter-balance the unbalanced class distribution. Extensive experiments on real-world \textit{TWEET-FID} data show that EGAL outperforms strong baseline models across different settings, including varying expert-labeled set sizes and class imbalance ratios. A case study on a multistate outbreak of Salmonella Typhimurium infection linked to packaged salad greens demonstrates how the trained model captures relevant tweets offering valuable outbreak insights. EGAL, funded by the U.S. Department of Agriculture (USDA), has the potential to be deployed for real-time analysis of tweet streaming, contributing to foodborne illness outbreak surveillance efforts.

摘要
食品中毒疾病对公共卫生产生重大影响。使用社交媒体数据进行深度学习监测，以早期探测食品中毒疾病的警示信号。然而，为食品中毒疾病相关的推文进行模型训练所需的人工资源很大，因此在有限的预算内收集到足够多的高质量标签很困难。食品中毒疾病相关推文的稀缺性，使得模型受到类型不均衡的影响，导致准确的检测困难。为解决这些挑战，我们提出了EGAL，一种基于深度学习的食品中毒疾病检测框架。EGAL使用专家标注的推文作为奖励集，使EGAL可以忽略错误标注的推文，并对其他推文分配权重以减轻类型不均衡的影响。实际实验表明，EGAL在不同的专家标注集大小和类型不均衡比率下表现出色，比 STRONG 基eline模型更高。一个实际案例研究了一起来自美国农业部（USDA）的多州 Salmonella Typhimurium 感染事件，表明EGAL 可以捕捉有价值的推文，提供食品中毒疾病爆发见解。EGAL 被美国农业部（USDA）资助，有望实时分析推文流动，为食品中毒疾病爆发披露提供支持。

English to Arabic machine translation of mathematical documents

paper_url: http://arxiv.org/abs/2312.03753
repo_url: None
paper_authors: Mustapha Eddahibi, Mohammed Mensouri
for: This paper aims to develop a machine translation system for LATEX mathematical documents, specifically tailored for translating English LATEX documents into Arabic LATEX.methods: The proposed system utilizes a Transformer model as the core of the translation system, ensuring enhanced accuracy and fluency in the translated Arabic LATEX documents. Additionally, the system integrates RyDArab, an Arabic mathematical TEX extension, and a rule-based translator for Arabic mathematical expressions.results: The developed system demonstrates efficacy in bridging the language gap in the domain of mathematical documentation, providing precise rendering of complex mathematical symbols and equations in the translated output.

Abstract
This paper is about the development of a machine translation system tailored specifically for LATEX mathematical documents. The system focuses on translating English LATEX mathematical documents into Arabic LATEX, catering to the growing demand for multilingual accessibility in scientific and mathematical literature. With the vast proliferation of LATEX mathematical documents the need for an efficient and accurate translation system has become increasingly essential. This paper addresses the necessity for a robust translation tool that enables seamless communication and comprehension of complex mathematical content across language barriers. The proposed system leverages a Transformer model as the core of the translation system, ensuring enhanced accuracy and fluency in the translated Arabic LATEX documents. Furthermore, the integration of RyDArab, an Arabic mathematical TEX extension, along with a rule-based translator for Arabic mathematical expressions, contributes to the precise rendering of complex mathematical symbols and equations in the translated output. The paper discusses the architecture, methodology, of the developed system, highlighting its efficacy in bridging the language gap in the domain of mathematical documentation

摘要
The proposed system utilizes a Transformer model as its core, ensuring enhanced accuracy and fluency in the translated Arabic LATEX documents. Additionally, the integration of RyDArab, an Arabic mathematical TEX extension, along with a rule-based translator for Arabic mathematical expressions, enables the precise rendering of complex mathematical symbols and equations in the translated output.The paper outlines the architecture and methodology of the developed system, highlighting its effectiveness in bridging the language gap in the domain of mathematical documentation. The system's ability to accurately translate complex mathematical content is expected to have a significant impact on the field, enabling researchers and scholars to share their work with a wider audience and facilitating the advancement of mathematical knowledge across languages.

Automatic Scoring of Students’ Science Writing Using Hybrid Neural Network

paper_url: http://arxiv.org/abs/2312.03752
repo_url: None
paper_authors: Ehsan Latif, Xiaoming Zhai
for: 这个研究探讨了一种多角度混合神经网络（HNN）在科学教育中自动评分学生回答的可效性。
methods: 我们比较了HNN模型与四种机器学习（BERT、AACR、Naive Bayes和Logistic Regression）方法的准确率。
results: 结果显示，HNN模型在五个评分方面（p<0.001）比Naive Bayes、Logistic Regression、AACR和BERT模型高出8%, 3%, 1%和0.12%。总的来说，HNN模型的感知准确率（M = 96.23%, SD = 1.45%）与训练和推理BERT模型的准确率（M = 96.12%, SD = 1.52%）相当。此外，我们发现HNN模型在训练和推理中比BERT模型快二倍，并且与轻量级但较准确的Naive Bayes模型有相同的效率。

Abstract
This study explores the efficacy of a multi-perspective hybrid neural network (HNN) for scoring student responses in science education with an analytic rubric. We compared the accuracy of the HNN model with four ML approaches (BERT, AACR, Naive Bayes, and Logistic Regression). The results have shown that HHN achieved 8%, 3%, 1%, and 0.12% higher accuracy than Naive Bayes, Logistic Regression, AACR, and BERT, respectively, for five scoring aspects (p<0.001). The overall HNN's perceived accuracy (M = 96.23%, SD = 1.45%) is comparable to the (training and inference) expensive BERT model's accuracy (M = 96.12%, SD = 1.52%). We also have observed that HNN is x2 more efficient in training and inferencing than BERT and has comparable efficiency to the lightweight but less accurate Naive Bayes model. Our study confirmed the accuracy and efficiency of using HNN to score students' science writing automatically.

摘要
Note:* "HNN" stands for "hybrid neural network"* "BERT" stands for "Bidirectional Encoder Representations from Transformers"* "AACR" stands for "Adaptive Analytic Cognitive Reading"* "Naive Bayes" is a machine learning algorithm* "Logistic Regression" is a machine learning algorithmPlease note that the translation is in Simplified Chinese, and the word order and sentence structure may be different from the original text.

Enabling Quantum Natural Language Processing for Hindi Language

paper_url: http://arxiv.org/abs/2312.01221
repo_url: None
paper_authors: Naman Srivastava, Gaurang Belekar, Sunil Saumya, Aswath Babu H
for: 这 paper 的目的是将 Quantum Natural Language Processing (QNLP) 技术应用到旁遮普语言中，解决 classical Natural Language Processing (NLP) 技术的缺陷，建立一个更加 “解释的” NLP 系统。
methods: 这 paper 使用的方法包括：使用 Hindi 语言的预组表示法和 DisCoCat 框架来绘制句子 diagram，然后将这些 diagram 翻译为基于快速量子多项式 (IQP) 风格的参数化量子电路。
results: 这 paper 的结果表明，使用这种 parameterized quantum circuits 可以训练 grammar 和 topic-aware 句子分类器 для Hindi 语言。

Abstract
Quantum Natural Language Processing (QNLP) is taking huge leaps in solving the shortcomings of classical Natural Language Processing (NLP) techniques and moving towards a more "Explainable" NLP system. The current literature around QNLP focuses primarily on implementing QNLP techniques in sentences in the English language. In this paper, we propose to enable the QNLP approach to HINDI, which is the third most spoken language in South Asia. We present the process of building the parameterized quantum circuits required to undertake QNLP on Hindi sentences. We use the pregroup representation of Hindi and the DisCoCat framework to draw sentence diagrams. Later, we translate these diagrams to Parameterised Quantum Circuits based on Instantaneous Quantum Polynomial (IQP) style ansatz. Using these parameterized quantum circuits allows one to train grammar and topic-aware sentence classifiers for the Hindi Language.

摘要
量子自然语言处理（QNLP）正在解决古典自然语言处理（NLP）技术的缺陷，并在向更“可解释”NLP系统方向进行征攻。当前文献中关于QNLP技术的应用主要集中在英语句子上。在这篇论文中，我们提议使用QNLP方法应用于旁遮普语，这是南亚地区第三最常用的语言。我们介绍了构建 necessitated quantum circuits的过程，用于进行旁遮普语句子的QNLP。我们使用印地语的预组表示法和DisCoCat框架来绘制句子 диаграм。然后，我们将这些 диаграм翻译成基于快速量子多项式（IQP）风格的参数化量子电路。使用这些参数化量子电路可以训练旁遮普语句子的 grammatical 和话题意识检测器。

paper_url: http://arxiv.org/abs/2312.01217
repo_url: None
paper_authors: Yashaswi Pupneja, Joseph Zou, Sacha Lévy, Shenyang Huang
for: 本研究旨在理解社交媒体上关于气候变化话题的公众意见的发展，特别是在气候变化会议（COP）事件后的社交媒体舆论中的共同体结构和意见变化。
methods: 该研究使用了1360万条推文数据，360万名用户的Twitter数据，并使用了Louvain社群探测算法分析用户之间的提及关系网络。此外，研究还使用了自然语言处理领域的工具进行情感分析和话题modeling。
results: 研究发现，在COP事件后，社交媒体上关于气候变化话题的共同体结构发生了明显的变化，同时用户们对气候变化的看法也发生了改变。此外，研究还发现了一些关键话题和情感变化，如政治 polarization和谎言的散布。

Abstract
Social media platforms such as Twitter (now known as X) have revolutionized how the public engage with important societal and political topics. Recently, climate change discussions on social media became a catalyst for political polarization and the spreading of misinformation. In this work, we aim to understand how real world events influence the opinions of individuals towards climate change related topics on social media. To this end, we extracted and analyzed a dataset of 13.6 millions tweets sent by 3.6 million users from 2006 to 2019. Then, we construct a temporal graph from the user-user mentions network and utilize the Louvain community detection algorithm to analyze the changes in community structure around Conference of the Parties on Climate Change~(COP) events. Next, we also apply tools from the Natural Language Processing literature to perform sentiment analysis and topic modeling on the tweets. Our work acts as a first step towards understanding the evolution of pro-climate change communities around COP events. Answering these questions helps us understand how to raise people's awareness towards climate change thus hopefully calling on more individuals to join the collaborative effort in slowing down climate change.

摘要
社交媒体平台如Twitter（现已更名为X）已经改变了公众对社会和政治话题的参与方式。最近，气候变化话题在社交媒体上引起了政治化和谣言的扩散。在这项工作中，我们想要了解社交媒体上气候变化话题如何受到实际世界事件的影响。为此，我们提取并分析了2006年至2019年间3600万条推文，其中360万名用户的推文。然后，我们将用户之间的提及网络转化为时间图，并使用卢梭安社区检测算法分析COP事件周期性的社区结构变化。此外，我们还使用自然语言处理文献中的工具进行情感分析和话题模型化。我们的工作作为气候变化话题在COP事件周期性社区的演化的第一步。解答这些问题可以帮助我们理解如何提高人们对气候变化的认识，以便更多人加入气候变化阻止的共同努力。

Here Is Not There: Measuring Entailment-Based Trajectory Similarity for Location-Privacy Protection and Beyond

paper_url: http://arxiv.org/abs/2312.01151
repo_url: None
paper_authors: Zilong Liu, Krzysztof Janowicz, Kitty Currier, Meilin Shi, Jinmeng Rao, Song Gao, Ling Cai, Anita Graser
for: This paper is written for discussing the limitations of current trajectory similarity measures in abstract space and proposing a new measure based on logical entailment to better account for the rich structure of geographic space.
methods: The paper uses a case study to formalize entailment-based trajectory similarity and evaluate the effectiveness of the proposed measure using a privacy-preserving trajectory-generation model (LSTM-TrajGAN).
results: The paper shows that the proposed entailment-based measure can reveal potential consequences of disregarding the structure of geographic space, such as miscalculated insurance risk due to regional shifts, and highlights the advantage of applying logical entailment to trajectory-similarity reasoning for location-privacy protection and beyond.

Abstract
While the paths humans take play out in social as well as physical space, measures to describe and compare their trajectories are carried out in abstract, typically Euclidean, space. When these measures are applied to trajectories of actual individuals in an application area, alterations that are inconsequential in abstract space may suddenly become problematic once overlaid with geographic reality. In this work, we present a different view on trajectory similarity by introducing a measure that utilizes logical entailment. This is an inferential perspective that considers facts as triple statements deduced from the social and environmental context in which the travel takes place, and their practical implications. We suggest a formalization of entailment-based trajectory similarity, measured as the overlapping proportion of facts, which are spatial relation statements in our case study. With the proposed measure, we evaluate LSTM-TrajGAN, a privacy-preserving trajectory-generation model. The entailment-based model evaluation reveals potential consequences of disregarding the rich structure of geographic space (e.g., miscalculated insurance risk due to regional shifts in our toy example). Our work highlights the advantage of applying logical entailment to trajectory-similarity reasoning for location-privacy protection and beyond.

摘要
人类的路径在社交空间和物理空间之间发挥作用，但是用于描述和比较他们的轨迹的度量通常是在抽象的几何空间中进行的。在实际应用中，这些度量在抽象空间中的修正可能在地理现实中变得重要。在这项工作中，我们提出了一种不同的轨迹相似度度量，基于逻辑推论。这是一种推理 perspective，它考虑了在旅行过程中的社会和环境上下文，以及其实际意义。我们建议将逻辑推论基数的轨迹相似度量化为覆盖率，即在我们的案例研究中的空间关系声明。通过我们的建议的度量，我们评估了LSTM-TrajGAN模型，一种隐私保护的轨迹生成模型。逻辑推论基数的模型评估显示了忽略地理空间的质折（例如，在我们的玩具例子中的不当风险评估）的可能性。我们的工作强调了在地理隐私保护和其他领域中应用逻辑推论来轨迹相似度理解的优势。

Towards leveraging LLMs for Conditional QA

paper_url: http://arxiv.org/abs/2312.01143
repo_url: None
paper_authors: Syed-Amad Hussain, Parag Pravin Dakle, SaiKrishna Rallabandi, Preethi Raghavan
for: 这些研究探讨大语言模型（LLM）在受限的问答任务中的能力和局限性。
methods: 这些研究使用了Conditional Question Answering（CQA）数据集，主要关注生成模型T5和UL2，评估LLM在多种问题类型上的性能。
results: 研究发现，精度地 fine-tune LLM可以在某些情况下超过现有最佳性能（SOTA），无需完全编码输入上下文，EM和F1分数提高7-8分。然而，这些模型在抽取式问题回答中遇到了困难，与SOTA差距超过10分，并且在避免抛入false信息方面也遇到了困难。

Abstract
This study delves into the capabilities and limitations of Large Language Models (LLMs) in the challenging domain of conditional question-answering. Utilizing the Conditional Question Answering (CQA) dataset and focusing on generative models like T5 and UL2, we assess the performance of LLMs across diverse question types. Our findings reveal that fine-tuned LLMs can surpass the state-of-the-art (SOTA) performance in some cases, even without fully encoding all input context, with an increase of 7-8 points in Exact Match (EM) and F1 scores for Yes/No questions. However, these models encounter challenges in extractive question answering, where they lag behind the SOTA by over 10 points, and in mitigating the risk of injecting false information. A study with oracle-retrievers emphasizes the critical role of effective evidence retrieval, underscoring the necessity for advanced solutions in this area. Furthermore, we highlight the significant influence of evaluation metrics on performance assessments and advocate for a more comprehensive evaluation framework. The complexity of the task, the observed performance discrepancies, and the need for effective evidence retrieval underline the ongoing challenges in this field and underscore the need for future work focusing on refining training tasks and exploring prompt-based techniques to enhance LLM performance in conditional question-answering tasks.

摘要

TURead: An eye movement dataset of Turkish reading

paper_url: http://arxiv.org/abs/2312.01114
repo_url: None
paper_authors: Cengiz Acarturk, Aysegul Ozkan, Tugce Nur Pekcetin, Zuhal Ormanoglu, Bilal Kirkici
for: 这个研究是为了研究土耳其语的阅读行为，以及词形和视觉控制之间的关系。
methods: 这个研究使用了目标词法， manipulate 目标词的长度和使用两种常见的土耳其语 suffix。研究使用了一些已经确立的眼动变量，如前lexical特征、bigram-trigram频率、词长、预测可能性、眼声间隔度量、Cloze测试 scores和词根词 suffix 预测可能性，以及两种工作记忆测试的分数。
results: 研究发现， fixation 参数和词形特征与已有文献报道的征 patrern 相一致。

Abstract
In this study, we present TURead, an eye movement dataset of silent and oral sentence reading in Turkish, an agglutinative language with a shallow orthography understudied in reading research. TURead provides empirical data to investigate the relationship between morphology and oculomotor control. We employ a target-word approach in which target words are manipulated by word length and by the addition of two commonly used suffixes in Turkish. The dataset contains well-established eye movement variables; prelexical characteristics such as vowel harmony and bigram-trigram frequencies and word features, such as word length, predictability, frequency, eye voice span measures, Cloze test scores of the root word and suffix predictabilities, as well as the scores obtained from two working memory tests. Our findings on fixation parameters and word characteristics are in line with the patterns reported in the relevant literature.

摘要
在这项研究中，我们介绍TURead数据集，这是一个 Turkısh语言的静默和口头句子阅读的眼动数据集，这是一种没有充分研究的语言。TURead提供了实证数据，以便调查 morphology 和眼动控制之间的关系。我们采用了目标词方法，在这种方法中，目标词被修改了单词长度和两个通用的土著词缀。数据集包含了已确立的眼动变量，包括前 lexical 特征，如 Turkish 语言中的 vowel harmony 和 bigram-trigram 频率，以及单词特征，如单词长度、预测性、频率、眼声间距测量、根词和 suffix 预测性的 Cloze test 分数，以及两个工作记忆测试的分数。我们发现了眼动参数和单词特征之间的征性匹配，与已知的文献中的模式一致。

Which linguistic cues make people fall for fake news? A comparison of cognitive and affective processing

paper_url: http://arxiv.org/abs/2312.03751
repo_url: None
paper_authors: Bernhard Lutz, Marc Adam, Stefan Feuerriegel, Nicolas Pröllochs, Dirk Neumann
for: 这项研究旨在了解人们为什么会接受假新闻，以及如何设计社交媒体上有效的防范措施。
methods: 研究采用了内部试验方法，收集了42名参与者的脑电physiological measurement数据，并对40篇真实新闻和假新闻文章进行了评估。
results: 研究发现，用户对长篇假新闻文章进行了更多的认知处理，而情感处理则更加易发生在分析词汇中。这是首次研究语言cue在假新闻处理中的作用。这些发现对设计在线平台以便促进用户仔细思考，从而避免他们接受假新闻具有重要意义。

Abstract
Fake news on social media has large, negative implications for society. However, little is known about what linguistic cues make people fall for fake news and, hence, how to design effective countermeasures for social media. In this study, we seek to understand which linguistic cues make people fall for fake news. Linguistic cues (e.g., adverbs, personal pronouns, positive emotion words, negative emotion words) are important characteristics of any text and also affect how people process real vs. fake news. Specifically, we compare the role of linguistic cues across both cognitive processing (related to careful thinking) and affective processing (related to unconscious automatic evaluations). To this end, we performed a within-subject experiment where we collected neurophysiological measurements of 42 subjects while these read a sample of 40 real and fake news articles. During our experiment, we measured cognitive processing through eye fixations, and affective processing in situ through heart rate variability. We find that users engage more in cognitive processing for longer fake news articles, while affective processing is more pronounced for fake news written in analytic words. To the best of our knowledge, this is the first work studying the role of linguistic cues in fake news processing. Altogether, our findings have important implications for designing online platforms that encourage users to engage in careful thinking and thus prevent them from falling for fake news.

摘要
假新闻在社交媒体上有很大的负面影响，但现实上很少人知道什么是让人受到假新闻的吸引力，因此如何设计有效的社交媒体平台是一个重要的问题。在这个研究中，我们想要了解哪些语言特征使人们受到假新闻的吸引力。语言特征（如形容词、人称代词、正面情感词、负面情感词）是文本的重要特征，它们也会影响人们如何处理真实新闻和假新闻。我们使用内存实验，收集了42名参与者的脑电Physiological measurements，他们在阅读40篇真实新闻和假新闻文章时进行了心理处理和情感处理。我们发现，用户在假新闻文章中更加偏好进行心理处理，而情感处理则更加明显地表现在分析性词汇中。这是我们知道的第一个研究语言特征在假新闻处理中的作用。总之，我们的发现对于设计促进用户思ful thinking的在线平台具有重要意义，以防止用户受到假新闻的影响。

End-to-End Speech-to-Text Translation: A Survey

paper_url: http://arxiv.org/abs/2312.01053
repo_url: None
paper_authors: Nivedita Sethiya, Chandresh Kumar Maurya
for: 这篇论文主要是为了提出一种综述批处理语音翻译模型的方法，以便帮助研究人员更好地开发和应用这类模型。
methods: 论文使用了许多现有的自动语音识别（ASR）和机器翻译（MT）模型，以及一些新的综合模型（E2E），并对这些模型进行了评估和比较。
results: 论文提出了一些新的综合模型，并对这些模型的性能进行了评估，并且提出了一些未来的研究方向和挑战。

Abstract
Speech-to-text translation pertains to the task of converting speech signals in a language to text in another language. It finds its application in various domains, such as hands-free communication, dictation, video lecture transcription, and translation, to name a few. Automatic Speech Recognition (ASR), as well as Machine Translation(MT) models, play crucial roles in traditional ST translation, enabling the conversion of spoken language in its original form to written text and facilitating seamless cross-lingual communication. ASR recognizes spoken words, while MT translates the transcribed text into the target language. Such disintegrated models suffer from cascaded error propagation and high resource and training costs. As a result, researchers have been exploring end-to-end (E2E) models for ST translation. However, to our knowledge, there is no comprehensive review of existing works on E2E ST. The present survey, therefore, discusses the work in this direction. Our attempt has been to provide a comprehensive review of models employed, metrics, and datasets used for ST tasks, providing challenges and future research direction with new insights. We believe this review will be helpful to researchers working on various applications of ST models.

摘要
听说转文本相关于将语言信号转换成另一种语言的文本。它在各个领域中扮演着重要角色，如手sfree通信、笔记、视频课程转录和翻译等。自动语音识别（ASR）和机器翻译（MT）模型在传统的ST转换中扮演关键角色，它们可以将原始语言的语音转换成文本，并且实现了跨语言交流。然而，现有的分解式模型受到链式错误传播和资源和训练成本的限制。因此，研究人员在努力开发端到端（E2E）模型。然而，我们知道的是，现在没有一篇全面的E2E ST翻译的评论。本文因此评论了现有的模型、度量和数据集，以及ST任务中的挑战和未来研究方向。我们认为这篇评论将对各种ST应用领域的研究人员提供帮助。

Structured, Complex and Time-complete Temporal Event Forecasting

paper_url: http://arxiv.org/abs/2312.01052
repo_url: https://github.com/yecchen/gdelt-complexevent
paper_authors: Yunshan Ma, Chenchen Ye, Zijian Wu, Xiang Wang, Yixin Cao, Liang Pang, Tat-Seng Chua
for: 本研究旨在提出一种新的时间事件表示方法，以提高时间事件表示质量和预测能力。
methods: 本研究使用了一种简单且自动化的构建管道，将新闻文章转化为结构化、复杂、完整的时间事件（SCTc-TE）。此外，本研究还提出了一种基于本地和全局上下文的预测模型，名为LoGo。
results: 对于MidEast-TE和GDELT-TE两个大规模数据集，实验结果表明LoGo模型能够具有较高的预测精度和多种优势。

Abstract
Temporal event forecasting aims to predict what will happen next given the observed events in history. Previous formulations of temporal event are unstructured, atomic, or lacking full temporal information, thus largely restricting the representation quality and forecasting ability of temporal events. To address these limitations, we introduce a novel formulation for Structured, Complex, and Time-complete Temporal Event (SCTc-TE). Based on this new formulation, we develop a simple and fully automated pipeline for constructing such SCTc-TEs from a large amount of news articles. Furthermore, we propose a novel model that leverages both Local and Global contexts for SCTc-TE forecasting, named LoGo. To evaluate our model, we construct two large-scale datasets named MidEast-TE and GDELT-TE. Extensive evaluations demonstrate the advantages of our datasets in multiple aspects, while experimental results justify the effectiveness of our forecasting model LoGo. We release the code and dataset via https://github.com/yecchen/GDELT-ComplexEvent.

摘要
<>将文本翻译成简化中文。<>temporal事件预测目标是根据历史记录的事件预测下一个将会发生的事件。现有的 temporal事件表示方式存在不结构化、原子或缺少完整的时间信息，这限制了事件表示质量和预测能力。为解决这些限制，我们提出了一种新的Structured、复杂、时间完整的 temporal事件表示方式（SCTc-TE）。基于这种新方式，我们开发了一个简单、自动化的管道来生成大量的新闻文章中的 SCTc-TE。此外，我们提出了一种名为 LoGo 的模型，它利用了本地和全局上下文来预测 SCTc-TE。为评估我们的模型，我们构建了两个大规模的数据集名为 MidEast-TE 和 GDELT-TE。广泛的评估表明我们的数据集在多个方面具有优势，而实验结果证明我们的预测模型 LoGo 的有效性。我们通过 GitHub 上的发布了代码和数据集。

paper_url: http://arxiv.org/abs/2312.01050
repo_url: None
paper_authors: Nazzere Oryngozha, Pakizar Shamoi, Ayan Igali
for: This paper aims to detect and analyze stress-related posts in Reddit academic communities, with the goal of understanding the stress levels within these communities and developing measures to address the issue effectively.
methods: The authors use natural language processing and machine learning techniques, specifically the Bag of Words and Logistic Regression classifier, to classify text as stressed or not. They use a dataset of labeled posts from Reddit (DReaddit) as their training set, and also collect and analyze posts from various academic subreddits.
results: The authors find that the most effective individual feature for stress detection is the Bag of Words, paired with the Logistic Regression classifier, which achieves an accuracy rate of 77.78% and an F1 score of 0.79 on the DReaddit dataset. They also find that posts and comments in professors’ Reddit communities are the most stressful, compared to other academic levels.

Abstract
Nowadays, the significance of monitoring stress levels and recognizing early signs of mental illness cannot be overstated. Automatic stress detection in text can proactively help manage stress and protect mental well-being. In today's digital era, social media platforms reflect the psychological well-being and stress levels within various communities. This study focuses on detecting and analyzing stress-related posts in Reddit academic communities. Due to online education and remote work, these communities have become central for academic discussions and support. We classify text as stressed or not using natural language processing and machine learning classifiers, with Dreaddit as our training dataset, which contains labeled data from Reddit. Next, we collect and analyze posts from various academic subreddits. We identified that the most effective individual feature for stress detection is the Bag of Words, paired with the Logistic Regression classifier, achieving a 77.78% accuracy rate and an F1 score of 0.79 on the DReaddit dataset. This combination also performs best in stress detection on human-annotated datasets, with a 72% accuracy rate. Our key findings reveal that posts and comments in professors Reddit communities are the most stressful, compared to other academic levels, including bachelor, graduate, and Ph.D. students. This research contributes to our understanding of the stress levels within academic communities. It can help academic institutions and online communities develop measures and interventions to address this issue effectively.

摘要
现在，监测 стресс水平和早期识别心理疾病的重要性不可逾越。自动检测压力在文本中可以积极地管理压力，保护心理健康。在当今的数字时代，社交媒体平台反映了不同社区的心理健康和压力水平。本研究将在Reddit学术社区中探索压力相关的文本。由于在线教育和远程工作的普及，这些社区已成为学术讨论和支持的中心。我们使用自然语言处理和机器学习分类器将文本分为压力和不压力两类，使用Dreaddit作为训练集，该集包含Reddit上的标注数据。然后，我们收集和分析各个学术子社区的帖子和评论。我们发现，使用词语袋的Bag of Words与逻辑回归分类器的组合，在DReaddit数据集上达到了77.78%的准确率和0.79的F1分数。这种组合还在人工标注数据集上表现最佳，准确率达到了72%。我们的关键发现表明，在教授的Reddit社区中的帖子和评论是最为压力的，与其他学术水平相比，包括学士、硬件和博士生。这些研究对学术社区内压力水平的认知做出了贡献，可以帮助学术机构和在线社区开发有效的措施和抗应对策略。

Large Language Models Are Zero-Shot Text Classifiers

paper_url: http://arxiv.org/abs/2312.01044
repo_url: https://github.com/yeyimilk/llm-zero-shot-classifiers
paper_authors: Zhiqiang Wang, Yiran Pang, Yanbin Lin
for: 这篇论文主要用于验证GPT模型在文本分类 задачі中的能力。
methods: 这篇论文使用了zero-shot learning（ZSL）的步骤启发（CoT）来实现LLMs的应用，并与传统的问题和答案格式进行比较。
results: 实验结果显示LLMs在四个测试数据集中的表现具有优秀的效能，特别是适合小型企业或团队，可能无法拥有大量的文本分类知识。

Abstract
Retrained large language models (LLMs) have become extensively used across various sub-disciplines of natural language processing (NLP). In NLP, text classification problems have garnered considerable focus, but still faced with some limitations related to expensive computational cost, time consumption, and robust performance to unseen classes. With the proposal of chain of thought prompting (CoT), LLMs can be implemented using zero-shot learning (ZSL) with the step by step reasoning prompts, instead of conventional question and answer formats. The zero-shot LLMs in the text classification problems can alleviate these limitations by directly utilizing pretrained models to predict both seen and unseen classes. Our research primarily validates the capability of GPT models in text classification. We focus on effectively utilizing prompt strategies to various text classification scenarios. Besides, we compare the performance of zero shot LLMs with other state of the art text classification methods, including traditional machine learning methods, deep learning methods, and ZSL methods. Experimental results demonstrate that the performance of LLMs underscores their effectiveness as zero-shot text classifiers in three of the four datasets analyzed. The proficiency is especially advantageous for small businesses or teams that may not have extensive knowledge in text classification.

摘要
使用了重新训练的大语言模型（LLM）在不同的自然语言处理（NLP）领域得到了广泛的应用。在NLP中，文本分类问题吸引了很大的关注，但仍然面临一些计算成本高、时间耗费大、鲁棒性不强的限制。采用链式思维提示（CoT），LLM可以通过零shot学习（ZSL）的步骤性提问，而不是传统的问答格式，来实现。零shot LLMS在文本分类问题中可以消除这些限制，直接使用预训练模型预测已见和未见类。我们的研究主要验证GPT模型在文本分类问题中的能力。我们关注使用提示策略在不同的文本分类场景中的有效性。此外，我们与其他现代文本分类方法，包括传统机器学习方法、深度学习方法和ZSL方法进行比较。实验结果表明，LLMs在四个数据集中的性能强调它们作为零shot文本分类器的效iveness。特别是对小公司或团队来说，这种效能具有优势，因为它们可能没有大量的文本分类知识。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

From Beginner to Expert: Modeling Medical Knowledge into General LLMs

paper_url: http://arxiv.org/abs/2312.01040
repo_url: None
paper_authors: Qiang Li, Xiaoyan Yang, Haowen Wang, Qin Wang, Lei Liu, Junjie Wang, Yang Zhang, Mingyuan Chu, Sen Hu, Yicheng Chen, Yue Shen, Cong Fan, Wangshu Zhang, Teng Xu, Jinjie Gu, Jing Zheng, Guannan Zhang Ant Group
for: 本研究旨在适应大语言模型（LLM）在敏感应用中的挑战，如医疗知识推理和医生式回答。
methods: 我们使用了一种三stage优化过程，包括普通医学知识注入、医学领域指导调整和特定医学任务适应。
results: 我们的AntGLM-Med-10B模型在PubMedQA上表现出色，超越了大多数LLM模型，包括普通和医学LLM模型，即使这些模型有更大的模型大小。

Abstract
Recently, large language model (LLM) based artificial intelligence (AI) systems have demonstrated remarkable capabilities in natural language understanding and generation. However, these models face a significant challenge when it comes to sensitive applications, such as reasoning over medical knowledge and answering medical questions in a physician-like manner. Prior studies attempted to overcome this challenge by increasing the model size (>100B) to learn more general medical knowledge, while there is still room for improvement in LLMs with smaller-scale model sizes (<100B). In this work, we start from a pre-trained general LLM model (AntGLM-10B) and fine-tune it from a medical beginner towards a medical expert (called AntGLM-Med-10B), which leverages a 3-stage optimization procedure, \textit{i.e.}, general medical knowledge injection, medical domain instruction tuning, and specific medical task adaptation. Our contributions are threefold: (1) We specifically investigate how to adapt a pre-trained general LLM in medical domain, especially for a specific medical task. (2) We collect and construct large-scale medical datasets for each stage of the optimization process. These datasets encompass various data types and tasks, such as question-answering, medical reasoning, multi-choice questions, and medical conversations. (3) Specifically for multi-choice questions in the medical domain, we propose a novel Verification-of-Choice approach for prompting engineering, which significantly enhances the reasoning ability of LLMs. Remarkably, by combining the above approaches, our AntGLM-Med-10B model can outperform the most of LLMs on PubMedQA, including both general and medical LLMs, even when these LLMs have larger model size.

摘要
近些时间，大语言模型（LLM）基于人工智能（AI）系统在自然语言理解和生成方面已经表现出了非常出色的能力。然而，这些模型在敏感应用中仍面临一个 significiant 挑战，例如在医疗领域进行医学知识的推理和回答医生般的问题。先前的研究尝试通过增加模型大小（>100B）来学习更加一般的医学知识，但还有一些可以提高的空间。在这项工作中，我们从一个预训练的通用LLM模型（AntGLM-10B）开始，并通过三个阶段优化过程，即通用医学知识注入、医学领域指导参数调整和特定医学任务适应来优化模型。我们的贡献有三个方面：1. 我们专门研究如何在医学领域中适应预训练的通用LLM模型，特别是 для一个特定的医学任务。2. 我们收集和构建了大规模的医学数据集，用于每个优化阶段的优化过程。这些数据集包括问答、医学推理、多选题和医学对话等多种数据类型和任务。3. 特别是在医学多选题方面，我们提出了一种新的验证选择方法，用于提高LLM的推理能力。结果显示，通过组合以上方法，我们的AntGLM-Med-10B模型可以在PubMedQA上表现出色，包括总体和医学LLM在内的大多数LLM模型，即使这些模型有更大的模型大小。

Dual-Teacher De-biasing Distillation Framework for Multi-domain Fake News Detection

paper_url: http://arxiv.org/abs/2312.01006
repo_url: https://github.com/ningljy/dtdbd
paper_authors: Jiayang Li, Xuan Feng, Tianlong Gu, Liang Chang
for: 多domain fake news detection 的目的是识别不同领域的新闻是真实的或假的，这已成为紧迫和重要的问题。
methods: 我们提出了 dual-teacher de-biasing distillation 框架 (DTDBD)，以解决领域偏见问题。DTDBD 采用了 teacher-student 结构，其中预训练的大教师指导学生模型。特别是，DTDBD 包括一个不偏向的教师和一个干净的教师，两者共同引导学生模型减少领域偏见。
results: 我们的方法在中文和英文 dataset 上进行了广泛的实验，结果表明，相比于基eline方法，我们的方法可以减少领域偏见指标，同时保持竞争性的性能。

Abstract
Multi-domain fake news detection aims to identify whether various news from different domains is real or fake and has become urgent and important. However, existing methods are dedicated to improving the overall performance of fake news detection, ignoring the fact that unbalanced data leads to disparate treatment for different domains, i.e., the domain bias problem. To solve this problem, we propose the Dual-Teacher De-biasing Distillation framework (DTDBD) to mitigate bias across different domains. Following the knowledge distillation methods, DTDBD adopts a teacher-student structure, where pre-trained large teachers instruct a student model. In particular, the DTDBD consists of an unbiased teacher and a clean teacher that jointly guide the student model in mitigating domain bias and maintaining performance. For the unbiased teacher, we introduce an adversarial de-biasing distillation loss to instruct the student model in learning unbiased domain knowledge. For the clean teacher, we design domain knowledge distillation loss, which effectively incentivizes the student model to focus on representing domain features while maintaining performance. Moreover, we present a momentum-based dynamic adjustment algorithm to trade off the effects of two teachers. Extensive experiments on Chinese and English datasets show that the proposed method substantially outperforms the state-of-the-art baseline methods in terms of bias metrics while guaranteeing competitive performance.

摘要
多域伪新闻探测目标是判断不同域的新闻是真伪的，现在已经是非常紧迫和重要的。然而，现有的方法专注于提高伪新闻探测的总性表现，忽略了不同域的数据不对称问题，即域名偏见问题。为解决这个问题，我们提出了双教师抑衡对应架构（DTDBD），以减少不同域的偏见。DTDBD运用知识传授方法，包括一个无偏老师和一个清洁老师，对学生模型进行导师。具体来说，DTDBD包括一个不偏老师和一个清洁老师，两者共同导引学生模型，以减少域名偏见并维持表现。 для无偏老师，我们引入了一个反对抗偏见对应损失，以教学生模型学习不偏的域知识。 для清洁老师，我们设计了领域知识传授损失，以优化学生模型对域特征的表现。此外，我们提出了一个动态调节算法，以调节两位老师的效果。实验结果显示，提案的方法在中文和英文数据集上具有substantially outperform了现有基eline方法，而且保持了竞争性的表现。

2023-12-02

cs.LG

cs.LG - 2023-12-02

A deep learning pipeline for cross-sectional and longitudinal multiview data integration

paper_url: http://arxiv.org/abs/2312.01238
repo_url: https://github.com/lasandrall/deepida-gru
paper_authors: Sarthak Jain, Sandra E. Safo
for:This paper aims to develop a pipeline for integrating cross-sectional and longitudinal data from multiple sources to better understand the pathobiology of complex diseases.methods:The pipeline uses a combination of statistical and deep learning methods, including variable selection/ranking, feature extraction, and joint integration and classification.results:The pipeline was applied to a case study of inflammatory bowel disease (IBD) and identified microbial pathways, metabolites, and genes that discriminate between IBD status, providing insights into the etiology of the disease. The authors also conducted simulations to compare the performance of different feature extraction methods.

Abstract
Biomedical research now commonly integrates diverse data types or views from the same individuals to better understand the pathobiology of complex diseases, but the challenge lies in meaningfully integrating these diverse views. Existing methods often require the same type of data from all views (cross-sectional data only or longitudinal data only) or do not consider any class outcome in the integration method, presenting limitations. To overcome these limitations, we have developed a pipeline that harnesses the power of statistical and deep learning methods to integrate cross-sectional and longitudinal data from multiple sources. Additionally, it identifies key variables contributing to the association between views and the separation among classes, providing deeper biological insights. This pipeline includes variable selection/ranking using linear and nonlinear methods, feature extraction using functional principal component analysis and Euler characteristics, and joint integration and classification using dense feed-forward networks and recurrent neural networks. We applied this pipeline to cross-sectional and longitudinal multi-omics data (metagenomics, transcriptomics, and metabolomics) from an inflammatory bowel disease (IBD) study and we identified microbial pathways, metabolites, and genes that discriminate by IBD status, providing information on the etiology of IBD. We conducted simulations to compare the two feature extraction methods. The proposed pipeline is available from the following GitHub repository: https://github.com/lasandrall/DeepIDA-GRU.

摘要
生物医学研究现通常将不同类型数据或视角集成一起，以更好地理解复杂疾病的生物学机理，但是这些集成受到限制。现有方法通常需要所有视角的数据都是同类型的（只有cross-sectional数据或只有 longitudinal数据），或者不考虑任何类别结果，这有限制。为了超越这些限制，我们开发了一个管道，利用统计学和深度学习方法集成cross-sectional和longitudinal数据从多个源。此外，它还可以识别视角之间和类别之间的关系，提供更深刻的生物学理解。这个管道包括变量选择/排名使用线性和非线性方法，功能 principales component analysis和欧拉特征分析，并使用稠密层 feed-forward networks和循环神经网络进行集成和分类。我们在一个inflammatory bowel disease（IBD）研究中应用了这个管道，并确定了IBD状态下的微生物道路、代谢物和基因，提供了IBD的etiology信息。我们进行了模拟，以比较这两种特征提取方法。提出的管道可以从以下GitHub存储库获取：https://github.com/lasandrall/DeepIDA-GRU。

Evetac: An Event-based Optical Tactile Sensor for Robotic Manipulation

paper_url: http://arxiv.org/abs/2312.01236
repo_url: None
paper_authors: Niklas Funk, Erik Helmut, Georgia Chalvatzaki, Roberto Calandra, Jan Peters
for: 此研究旨在开发一种基于事件的光学感知器，以提高感知器的时间分辨率。
methods: 研究人员使用了事件驱动的摄像机和新型的光学感知器 called Evetac，并开发了一种能够在1000Hz频率下处理触感测量的触感处理算法。
results: 研究人员在测试中发现，Evetac可以感测到频率达498Hz的振荡，重建Sheet的力，并相比RGB光学感知器，减少了数据量。此外，Evetac的输出和标记跟踪提供了有用的特征 для学习数据驱动的滑动检测和预测模型。

Abstract
Optical tactile sensors have recently become popular. They provide high spatial resolution, but struggle to offer fine temporal resolutions. To overcome this shortcoming, we study the idea of replacing the RGB camera with an event-based camera and introduce a new event-based optical tactile sensor called Evetac. Along with hardware design, we develop touch processing algorithms to process its measurements online at 1000 Hz. We devise an efficient algorithm to track the elastomer's deformation through the imprinted markers despite the sensor's sparse output. Benchmarking experiments demonstrate Evetac's capabilities of sensing vibrations up to 498 Hz, reconstructing shear forces, and significantly reducing data rates compared to RGB optical tactile sensors. Moreover, Evetac's output and the marker tracking provide meaningful features for learning data-driven slip detection and prediction models. The learned models form the basis for a robust and adaptive closed-loop grasp controller capable of handling a wide range of objects. We believe that fast and efficient event-based tactile sensors like Evetac will be essential for bringing human-like manipulation capabilities to robotics. The sensor design is open-sourced at https://sites.google.com/view/evetac .

摘要
“光学感觉传感器最近受欢迎。它们提供高度空间分辨率，但它们很难提供细腻的时间分辨率。为了解决这个缺陷，我们研究了将RGB摄像头换为事件驱动摄像头，并开发了一种新的事件驱动光学感觉传感器called Evetac。我们不仅设计了硬件，还开发了触摸处理算法，以实时处理Evetac的测量数据，频率为1000Hz。我们还提出了一种高效的算法，可以跟踪弹性体的变形，即使感测器的输出是稀疏的。我们的实验表明，Evetac可以感测到498Hz的振荡，重建剪切力，并比RGB光学感觉传感器减少数据量。此外，Evetac的输出和标记跟踪提供了有用的特征，可以用于学习数据驱动的滑动检测和预测模型。我们相信，快速和高效的事件驱动感觉传感器like Evetac将是Robotics中人类化抓取能力的关键。感测器的设计是开源的，可以在https://sites.google.com/view/evetac 中找到。”

Can We Learn Communication-Efficient Optimizers?

paper_url: http://arxiv.org/abs/2312.02204
repo_url: None
paper_authors: Charles-Étienne Joseph, Benjamin Thérien, Abhinav Moudgil, Boris Knyazev, Eugene Belilovsky
for: 本研究旨在investigate whether recent progress in learned optimizers can potentially close the gap between local SGD and state-of-the-art adaptive optimizers for deep learning, while maintaining communication efficiency.
methods: 本研究使用了meta-learning来学习如何在local SGD迭代中进行全局更新。
results: 研究结果表明，learned optimizers可以substantially outperform local SGD和其他复杂的变种，并且可以在未seen的大型 dataset和架构、以及未seen的模式（如语言模型）中进行泛化。

Abstract
Communication-efficient variants of SGD, specifically local SGD, have received a great deal of interest in recent years. These approaches compute multiple gradient steps locally, that is on each worker, before averaging model parameters, helping relieve the critical communication bottleneck in distributed deep learning training. Although many variants of these approaches have been proposed, they can sometimes lag behind state-of-the-art adaptive optimizers for deep learning. In this work, we investigate if the recent progress in the emerging area of learned optimizers can potentially close this gap while remaining communication-efficient. Specifically, we meta-learn how to perform global updates given an update from local SGD iterations. Our results demonstrate that learned optimizers can substantially outperform local SGD and its sophisticated variants while maintaining their communication efficiency. Learned optimizers can even generalize to unseen and much larger datasets and architectures, including ImageNet and ViTs, and to unseen modalities such as language modeling. We therefore demonstrate the potential of learned optimizers for improving communication-efficient distributed learning.

摘要
各种减少通信量的SGD变体，特别是本地SGD，在过去几年内得到了很大的关注。这些方法在每个工作节点上计算多个梯度步骤，然后在平均模型参数之前进行了LOCAL SGD迭代，帮助解决分布式深度学习训练中的关键通信瓶颈。虽然许多变体已经被提出，但它们有时会落后于适应型优化器。在这项工作中，我们 investigate 是否可以通过emerging learned optimizers来填补这个差距。我们通过meta-学习在local SGD迭代中进行全局更新来实现这一点。我们的结果表明，learned optimizers可以在保持通信效率的情况下，substantially outperform local SGD和其他复杂的变体。learned optimizers甚至可以在未看过和许多更大的数据集和架构，包括ImageNet和ViTs，以及未看过的模式，如语言模型化。因此，我们 демонстриated the potential of learned optimizers for improving communication-efficient distributed learning。

Learning High-Order Relationships of Brain Regions

paper_url: http://arxiv.org/abs/2312.02203
repo_url: None
paper_authors: Weikang Qiu, Huangrui Chu, Selena Wang, Haolan Zuo, Xiaoxiao Li, Yize Zhao, Rex Ying
for: 本研究旨在找出 fonctional magnetic resonance imaging (fMRI) 信号中可靠且有用的脑区之间交互关系，以便在 neuroscientific 预测中更好地理解脑的功能。
methods: 本研究使用了一种新的方法 named HyBRiD，该方法使用一个 Constructor 来标识 hyperedge 结构，并使用一个 Weighter 来计算每个 hyperedge 的权重。HyBRiD 通过一种创新的信息瓶颈框架 named multi-head drop-bottleneck 来实现 MIMR 目标，并具有理论保证。
results: 本研究的实验表明，HyBRiD 模型可以效果地提取 MIMR 高阶关系，并在评价标准协议 CPM 中超过当前最佳预测模型的平均提升率为 12.1%。

Abstract
Discovering reliable and informative interactions among brain regions from functional magnetic resonance imaging (fMRI) signals is essential in neuroscientific predictions of cognition. Most of the current methods fail to accurately characterize those interactions because they only focus on pairwise connections and overlook the high-order relationships of brain regions. We delve into this problem and argue that these high-order relationships should be maximally informative and minimally redundant (MIMR). However, identifying such high-order relationships is challenging and highly under-explored. Methods that can be tailored to our context are also non-existent. In response to this gap, we propose a novel method named HyBRiD that aims to extract MIMR high-order relationships from fMRI data. HyBRiD employs a Constructor to identify hyperedge structures, and a Weighter to compute a weight for each hyperedge. HyBRiD achieves the MIMR objective through an innovative information bottleneck framework named multi-head drop-bottleneck with theoretical guarantees. Our comprehensive experiments demonstrate the effectiveness of our model. Our model outperforms the state-of-the-art predictive model by an average of 12.1%, regarding the quality of hyperedges measured by CPM, a standard protocol for studying brain connections.

摘要
发现 brain 区域之间可靠且有用的交互关系是 neuroscientific 预测认知的关键。现有的方法多数只关注对应关系，而忽略 brain 区域之间的高阶关系。我们认为这些高阶关系应该是最 Informative 且最少重复的（MIMR）。然而，找到这些高阶关系是具有挑战性和不充分探索的。为了填补这个空白，我们提出了一种新方法名为 HyBRiD，该方法可以从 fMRI 数据中提取 MIMR 高阶关系。HyBRiD 使用一个名为 Constructor 的结构来认定 hyperedge 结构，并使用一个名为 Weighter 的计算器来计算每个 hyperedge 的权重。HyBRiD 通过一种创新的信息瓶颈框架，名为 multi-head drop-bottleneck，实现 MIMR 目标，并具有理论保证。我们的全面实验表明，我们的模型高效。我们的模型在 CPM 标准协议中衡量 hyperedge 质量时，平均高于现有的预测模型12.1%。

Distributed Bayesian Estimation in Sensor Networks: Consensus on Marginal Densities

paper_url: http://arxiv.org/abs/2312.01227
repo_url: None
paper_authors: Parth Paritosh, Nikolay Atanasov, Sonia Martinez
for: 本文旨在设计并分析分布式 bayesian 估计算法，用于感知网络。 addressed challenges 包括 (i) derivation of a distributed provably-correct algorithm in the functional space of probability distributions over continuous variables, and (ii) leveraging these results to obtain new distributed estimators restricted to subsets of variables observed by individual agents.
methods: 本文使用了数据来自非线性归一化函数的 bayesian 核算法，包括中央、分布式和 marginal distributed 设置。 after setting up a distributed estimation objective, the authors prove almost-sure convergence to the optimal set of pdfs at each agent.
results: 本文提出了一种基于 Gaussian 分布的 bayesian 核算法，并在一个映射问题中使用了 variational inference 处理非线性归一化函数相关的 LiDAR 探测数据。

Abstract
In this paper, we aim to design and analyze distributed Bayesian estimation algorithms for sensor networks. The challenges we address are to (i) derive a distributed provably-correct algorithm in the functional space of probability distributions over continuous variables, and (ii) leverage these results to obtain new distributed estimators restricted to subsets of variables observed by individual agents. This relates to applications such as cooperative localization and federated learning, where the data collected at any agent depends on a subset of all variables of interest. We present Bayesian density estimation algorithms using data from non-linear likelihoods at agents in centralized, distributed, and marginal distributed settings. After setting up a distributed estimation objective, we prove almost-sure convergence to the optimal set of pdfs at each agent. Then, we prove the same for a storage-aware algorithm estimating densities only over relevant variables at each agent. Finally, we present a Gaussian version of these algorithms and implement it in a mapping problem using variational inference to handle non-linear likelihood models associated with LiDAR sensing.

摘要
在这篇论文中，我们目标是设计和分析分布式 bayesian 估计算法 для感知网络。我们面临的挑战是（i） derivate一个分布式可靠的算法在连续变量的函数空间上，以及（ii）利用这些结果来获得新的分布式估计器，它们仅仅限制在每个代理机器上观察到的变量上。这与应用如协同地理位和联合学习相关，其中任何代理机器上收集的数据都依赖于所有变量的一个子集。我们提出了 bayesian 浊度估计算法，使用非线性概率函数在中央、分布式和部署式设置下进行估计。然后，我们证明了每个代理机器上的 almost-sure 收敛到优化的pdf集。接着，我们证明了一个具有存储限制的算法可以在每个代理机器上估计只有相关变量上的浊度。最后，我们提出了一个 Gaussian 版本的这些算法，并在一个映射问题中使用变量插值来处理雷达探测中的非线性概率函数。

When accurate prediction models yield harmful self-fulfilling prophecies

paper_url: http://arxiv.org/abs/2312.01210
repo_url: None
paper_authors: Wouter A. C. van Amsterdam, Nan van Geloven, Jesse H. Krijthe, Rajesh Ranganath, Giovanni Ciná
for: 这 paper 是关于医疗预测模型的研究，强调预测模型在医疗决策中的用途和危害。
methods: 这 paper 使用了一些数学和统计方法来分析预测模型的性能和应用。
results: 这 paper 发现，使用预测模型进行医疗决策可能会导致危害性的决策，即使预测模型在部署后仍然具有良好的预测能力。这些结果表明，需要修改现有的预测模型VALIDATION、部署和评估标准，以避免在医疗决策中导致危害。

Abstract
Prediction models are popular in medical research and practice. By predicting an outcome of interest for specific patients, these models may help inform difficult treatment decisions, and are often hailed as the poster children for personalized, data-driven healthcare. We show however, that using prediction models for decision making can lead to harmful decisions, even when the predictions exhibit good discrimination after deployment. These models are harmful self-fulfilling prophecies: their deployment harms a group of patients but the worse outcome of these patients does not invalidate the predictive power of the model. Our main result is a formal characterization of a set of such prediction models. Next we show that models that are well calibrated before and after deployment are useless for decision making as they made no change in the data distribution. These results point to the need to revise standard practices for validation, deployment and evaluation of prediction models that are used in medical decisions.

摘要
预测模型在医疗研究和实践中很受欢迎。这些模型可以预测特定患者的结果，帮助作出困难的治疗决策，并被视为个性化、数据驱动的医疗之星。然而，我们发现，使用预测模型做决策可能会导致有害的决策，即使这些模型在部署后表现出良好的预测能力。这些模型是有害的自我成就预测：它们在部署后对患者群体造成了伤害，但这些患者的更差结果不会推翻预测模型的预测力。我们的主要结论是对这类预测模型的正式Characterization。然后我们表明，在部署和评估过程中，这些模型的准备和部署是无用的，因为它们没有改变数据分布。这些结果表明需要修改预测模型的验证、部署和评估标准实践。

Short-term Precipitation Forecasting in The Netherlands: An Application of Convolutional LSTM neural networks to weather radar data

paper_url: http://arxiv.org/abs/2312.01197
repo_url: https://github.com/petrosdemetrakopoulos/lstm-radar-precipitation-forecast
paper_authors: Petros Demetrakopoulos
for: 预测短时间内降水的气象预报
methods: 应用Convolutional Long Short-Term Memory（ConvLSTM）神经网络，利用气象雷达数据从荷兰皇家气象所（KNMI）进行预测
results: 实现了高精度预测降水方向和强度的移动和变化

Abstract
This work addresses the challenge of short-term precipitation forecasting by applying Convolutional Long Short-Term Memory (ConvLSTM) neural networks to weather radar data from the Royal Netherlands Meteorological Institute (KNMI). The research exploits the combination of Convolutional Neural Networks (CNNs) layers for spatial pattern recognition and LSTM network layers for modelling temporal sequences, integrating these strengths into a ConvLSTM architecture. The model was trained and validated on weather radar data from the Netherlands. The model is an autoencoder consisting of nine layers, uniquely combining convolutional operations with LSTMs temporal processing, enabling it to capture the movement and intensity of precipitation systems. The training set comprised of sequences of radar images, with the model being tasked to predict precipitation patterns 1.5 hours ahead using the preceding data. Results indicate high accuracy in predicting the direction and intensity of precipitation movements. The findings of this study underscore the significant potential of ConvLSTM networks in meteorological forecasting, particularly in regions with complex weather patterns. It contributes to the field by offering a more accurate, data-driven approach to weather prediction, highlighting the broader applicability of ConvLSTM networks in meteorological tasks.

摘要

Exploring a Hybrid Deep Learning Framework to Automatically Discover Topic and Sentiment in COVID-19 Tweets

paper_url: http://arxiv.org/abs/2312.01178
repo_url: None
paper_authors: Khandaker Tayef Shahriar, Iqbal H. Sarker
for: 本研究旨在提供一种新的框架，以分析COVID-19 tweets中的话题情感。
methods: 我们提出了一种 combining BiLSTM和GRU结构的гибриidden深度学习模型，以实现情感分析。
results: 实验结果表明，我们的话题标签提取方法可以更好地提取话题标签，而我们提posed的 гибриidden深度学习模型在情感分析方面 achievied最高的准确率。

Abstract
COVID-19 has created a major public health problem worldwide and other problems such as economic crisis, unemployment, mental distress, etc. The pandemic is deadly in the world and involves many people not only with infection but also with problems, stress, wonder, fear, resentment, and hatred. Twitter is a highly influential social media platform and a significant source of health-related information, news, opinion and public sentiment where information is shared by both citizens and government sources. Therefore an effective analysis of COVID-19 tweets is essential for policymakers to make wise decisions. However, it is challenging to identify interesting and useful content from major streams of text to understand people's feelings about the important topics of the COVID-19 tweets. In this paper, we propose a new \textit{framework} for analyzing topic-based sentiments by extracting key topics with significant labels and classifying positive, negative, or neutral tweets on each topic to quickly find common topics of public opinion and COVID-19-related attitudes. While building our model, we take into account hybridization of BiLSTM and GRU structures for sentiment analysis to achieve our goal. The experimental results show that our topic identification method extracts better topic labels and the sentiment analysis approach using our proposed hybrid deep learning model achieves the highest accuracy compared to traditional models.

摘要
COVID-19 已经在全球造成了严重的公共健康问题，同时也导致了经济危机、失业、心理压力、等等问题。这场流感病毒是全球范围内的致命疾病，它不仅仅是感染人们的问题，还包括许多人们的问题、焦虑、恐慌、怒气和恨意。Twitter 是一个非常影响力的社交媒体平台，也是重要的健康信息、新闻、观点和公众情绪的来源，因此对 COVID-19 tweets 的有效分析是非常重要的。然而，从大量文本中提取有趣和有用的内容是一项挑战性的任务，以便理解人们对重要话题的感受和 COVID-19 相关的态度。在这篇论文中，我们提出了一种新的框架，用于分析话题基于的情感分析，包括提取重要话题和对每个话题的积极、消极或中性 tweets 进行分类，以快速找到公众意见和 COVID-19 相关的态度。在建立我们的模型时，我们考虑了 BiLSTM 和 GRU 结构的混合来进行情感分析，以达到我们的目标。实验结果表明，我们的话题标签提取方法可以更好地提取话题标签，而我们提posed的混合深度学习模型在情感分析方面实现了传统模型的最高准确率。

On-sensor Printed Machine Learning Classification via Bespoke ADC and Decision Tree Co-Design

paper_url: http://arxiv.org/abs/2312.01172
repo_url: None
paper_authors: Giorgos Armeniakos, Paula L. Duarte, Priyanjana Pal, Georgios Zervakis, Mehdi B. Tahoori, Dimitrios Soudris
for: 本研究旨在开发低成本的Printed Electronics（PE）技术，以实现个性化的计算设备。
methods: 本研究使用了自适应ADC和树状分类器的协同设计方法，以实现在感知器上自动化的计算。
results: 研究表明，使用自适应ADC和树状分类器可以实现在感知器上的自主运行，并且可以处理感知器输入数据。

Abstract
Printed electronics (PE) technology provides cost-effective hardware with unmet customization, due to their low non-recurring engineering and fabrication costs. PE exhibit features such as flexibility, stretchability, porosity, and conformality, which make them a prominent candidate for enabling ubiquitous computing. Still, the large feature sizes in PE limit the realization of complex printed circuits, such as machine learning classifiers, especially when processing sensor inputs is necessary, mainly due to the costly analog-to-digital converters (ADCs). To this end, we propose the design of fully customized ADCs and present, for the first time, a co-design framework for generating bespoke Decision Tree classifiers. Our comprehensive evaluation shows that our co-design enables self-powered operation of on-sensor printed classifiers in all benchmark cases.

摘要
印刷电子技术（PE）提供cost-effective硬件，具有未满足的定制化，因为它们的非回归工程和制造成本低。PE具有灵活性、可扩展性、透空性和适应性等特点，使其成为 ubique计算的优秀选择。然而，PE的大特征尺寸限制了复杂的印刷电路的实现，如机器学习分类器，特别是当处理感知输入是必要时，主要是因为costly analog-to-digital converters（ADCs）。为此，我们提议设计 Fully Customized ADCs，并提供了首次的共设计框架，用于生成特制的决策树分类器。我们的全面评估表明，我们的共设计可以实现在所有benchmark案例中的自主驱动运行。

Fast and Robust Sparsity-Aware Block Diagonal Representation

paper_url: http://arxiv.org/abs/2312.01137
repo_url: https://github.com/a-tastan/frs-bdr
paper_authors: Aylin Tastan, Michael Muma, Abdelhak M. Zoubir
for: 本研究旨在Addressing the challenges of recovering a block diagonal affinity matrix in real-world applications, where data may be subject to outliers and heavy-tailed noise.
methods: 本 paper proposes a Fast and Robust Sparsity-Aware Block Diagonal Representation (FRS-BDR) method, which jointly estimates cluster memberships and the number of blocks by reformulating the problem as a robust piece-wise linear fitting problem.
results: 对于多种实际应用，FRS-BDR 方法能够提高归一化精度、Robustness against corrupted features、计算时间和归一化个数性能。

Abstract
The block diagonal structure of an affinity matrix is a commonly desired property in cluster analysis because it represents clusters of feature vectors by non-zero coefficients that are concentrated in blocks. However, recovering a block diagonal affinity matrix is challenging in real-world applications, in which the data may be subject to outliers and heavy-tailed noise that obscure the hidden cluster structure. To address this issue, we first analyze the effect of different fundamental outlier types in graph-based cluster analysis. A key idea that simplifies the analysis is to introduce a vector that represents a block diagonal matrix as a piece-wise linear function of the similarity coefficients that form the affinity matrix. We reformulate the problem as a robust piece-wise linear fitting problem and propose a Fast and Robust Sparsity-Aware Block Diagonal Representation (FRS-BDR) method, which jointly estimates cluster memberships and the number of blocks. Comprehensive experiments on a variety of real-world applications demonstrate the effectiveness of FRS-BDR in terms of clustering accuracy, robustness against corrupted features, computation time and cluster enumeration performance.

摘要
“块对角结构是一种常见的需求在集群分析中，因为它表示特征向量的分布归一到块中。但在实际应用中，数据可能受到外围点和极大尾部噪声的影响，这会隐藏埋在数据中的块结构。为了解决这个问题，我们首先分析了不同类型的基本外围点对图像分析的影响。我们引入一个表示块对角矩阵为某种割辑函数的向量，然后将问题转化为一个Robust Piece-wise Linear Fitting（RPLF）问题。我们提出了一种快速和Robust Sparsity-Aware Block Diagonal Representation（FRS-BDR）方法，该方法同时估算集群成员和块数量。我们在多种实际应用上进行了广泛的实验，并证明了FRS-BDR在分组精度、噪声抗性、计算时间和集群枚举性方面的有效性。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well.

$t^3$-Variational Autoencoder: Learning Heavy-tailed Data with Student’s t and Power Divergence

paper_url: http://arxiv.org/abs/2312.01133
repo_url: None
paper_authors: Juno Kim, Jaehyuk Kwon, Mincheol Cho, Hyunjong Lee, Joong-Ho Won
for: 提高 VAE 模型的表现，特别是在处理低概率范围的数据上。
methods: 使用Student’s t-distribution来改进 VAE 模型，包括使用 t-distribution 作为先验分布、编码器和解码器。
results: $t^3$VAE 模型在含有重 tailed 数据的 Synthetic 数据上表现出色，并在 CelebA 和 imbalanced CIFAR-100 数据集上显示出 significiant 性能提升。

Abstract
The variational autoencoder (VAE) typically employs a standard normal prior as a regularizer for the probabilistic latent encoder. However, the Gaussian tail often decays too quickly to effectively accommodate the encoded points, failing to preserve crucial structures hidden in the data. In this paper, we explore the use of heavy-tailed models to combat over-regularization. Drawing upon insights from information geometry, we propose $t^3$VAE, a modified VAE framework that incorporates Student's t-distributions for the prior, encoder, and decoder. This results in a joint model distribution of a power form which we argue can better fit real-world datasets. We derive a new objective by reformulating the evidence lower bound as joint optimization of KL divergence between two statistical manifolds and replacing with $\gamma$-power divergence, a natural alternative for power families. $t^3$VAE demonstrates superior generation of low-density regions when trained on heavy-tailed synthetic data. Furthermore, we show that $t^3$VAE significantly outperforms other models on CelebA and imbalanced CIFAR-100 datasets.

摘要
通常，变分自动编码器（VAE）使用标准正态分布作为Regularizer для probabilistic latent encoder。然而， Gaussian 尾常 decay 太快，无法有效地包含编码点，导致潜在的结构遗失。在这篇论文中，我们探讨使用重 tailed 模型来避免过度规化。基于信息几何学的洞察，我们提出 $t^3$VAE，一种修改后VAE框架，其中使用 Student's t-distributions 作为 prior、encoder 和 decoder。这导致了一个共同的模型分布，我们认为可以更好地适应实际数据。我们 derivate 一个新的目标函数，通过重新定义证据下界为两个统计 manifold 的 joint 优化和 replacing γ-power divergence，一种自然的代替者。 $t^3$VAE 在带有重 tailed 数据训练时表现出了更好的生成低密度区域。此外，我们表明 $t^3$VAE 在 CelebA 和 imbalanced CIFAR-100 数据集上表现出了明显的优异。

Virtual reservoir acceleration for CPU and GPU: Case study for coupled spin-torque oscillator reservoir

paper_url: http://arxiv.org/abs/2312.01121
repo_url: https://github.com/mathowl/reservoir_acceleration
paper_authors: Thomas Geert de Jong, Nozomi Akashi, Tomohiro Taniguchi, Hirofumi Notsu, Kohei Nakajima
for: 这 paper 用于 simulating 由 $N$-coupled spin-torque oscillators 描述的抽象液体。
methods: 这 paper 使用了 CPU 和 GPU 进行实现，并对不同的实现进行了比较。
results: results 显示，对于 $N$ 在 1 到 10^4 之间的值，新的方法比基准值快至少 2.6 倍。特别是，在 $N=1$ 情况下，最好的因子是 78.9，而在 $N=10^3$ 和 $N=10^4$ 情况下，分别为 2.6 和 23.8。 GPU 在 $N=2500$ 情况下明显超过 CPU。

Abstract
We provide high-speed implementations for simulating reservoirs described by $N$-coupled spin-torque oscillators. Here $N$ also corresponds to the number of reservoir nodes. We benchmark a variety of implementations based on CPU and GPU. Our new methods are at least 2.6 times quicker than the baseline for $N$ in range $1$ to $10^4$. More specifically, over all implementations the best factor is 78.9 for $N=1$ which decreases to 2.6 for $N=10^3$ and finally increases to 23.8 for $N=10^4$. GPU outperforms CPU significantly at $N=2500$. Our results show that GPU implementations should be tested for reservoir simulations. The implementations considered here can be used for any reservoir with evolution that can be approximated using an explicit method.

摘要
我们提供高速实现方法，用于模拟由 $N$-个旋转磁场激活器描述的水库。这里 $N$ 还表示水库节点的数量。我们对 CPU 和 GPU 进行了比较。我们的新方法至少比基准值快速于 $N$ 在 1 到 10^4 之间。更具体地说，在所有实现方法中，最好的因子是 78.9 对 $N=1$，随着 $N$ 增加，这个因子逐渐减少，最终变为 23.8 对 $N=10^4$。GPU 在 $N=2500$ 时明显高效。我们的结果表明，GPU 实现应该用于水库模拟。我们考虑的实现方法可以用于任何可以使用显式方法描述的水库。Note: Simplified Chinese is used in mainland China and Singapore, while Traditional Chinese is used in Taiwan, Hong Kong, and Macau.

Cancer Subtype Identification through Integrating Inter and Intra Dataset Relationships in Multi-Omics Data

paper_url: http://arxiv.org/abs/2312.02195
repo_url: https://github.com/peelen-mark/identifying-cancer-subtypes-code
paper_authors: Mark Peelen, Leila Bagheriye, Johan Kwisthout
for: 这个论文旨在提出一种新的方法，用于通过多重数据的集成来分类癌症。
methods: 该方法基于线性关系 между和内不同的数据集（线性内部和间接数据集相互关系Matrix (LIDAF)）。
results: 提出的方法可以提高分类性能，并且在50%的log10排名值中获得了更好的表现，超越了现有的最佳方法。

Abstract
The integration of multi-omics data has emerged as a promising approach for gaining comprehensive insights into complex diseases such as cancer. This paper proposes a novel approach to identify cancer subtypes through the integration of multi-omics data for clustering. The proposed method, named LIDAF utilises affinity matrices based on linear relationships between and within different omics datasets (Linear Inter and Intra Dataset Affinity Fusion (LIDAF)). Canonical Correlation Analysis is in this paper employed to create distance matrices based on Euclidean distances between canonical variates. The distance matrices are converted to affinity matrices and those are fused in a three-step process. The proposed LIDAF addresses the limitations of the existing method resulting in improvement of clustering performance as measured by the Adjusted Rand Index and the Normalized Mutual Information score. Moreover, our proposed LIDAF approach demonstrates a notable enhancement in 50% of the log10 rank p-values obtained from Cox survival analysis, surpassing the performance of the best reported method, highlighting its potential of identifying distinct cancer subtypes.

摘要
随着多元数据的整合，诊断复杂疾病（如癌症）的全面性的理解得到了广泛的关注。这篇论文提出了一种新的方法，通过多元数据的整合进行划分，以获得更好的癌症分类效果。该方法，名为LIDAF，利用了线性关系 между不同的多元数据集（线性内部和外部数据集关系矩阵(LIDAF)）。在本文中，使用 canonical correlation analysis 创建了基于欧几丁度距离的距离矩阵，然后将距离矩阵转换为亲和力矩阵，并将其进行三步融合。提出的LIDAF方法解决了现有方法的局限性，导致分类性能的提高， measured by Adjusted Rand Index 和 Normalized Mutual Information score。此外，我们的LIDAF方法在50%的log10排名值上表现出了显著的改进，超过了最佳报道的方法，这 highlights its potential for identifying distinct cancer subtypes。

Strong Duality Relations in Nonconvex Risk-Constrained Learning

paper_url: http://arxiv.org/abs/2312.01110
repo_url: None
paper_authors: Dionysis Kalogerias, Spyridon Pougkakiotis
for: 这个论文的目的是为了研究函数型两步 compositional risk-constrained learning问题，包括多个非凸损函数和学习约束，无论损函数和学习约束的非凸性。
methods: 这个论文使用了最新的风险限制非凸编程技术，基于J. J. Uhl的凸性定理，这是一种扩展A. A. Lyapunov的凸性定理，适用于一般、无穷维 Banach空间。
results: 这个论文实现了零对偶差 gap，这意味着在研究的问题中，风险限制的问题都可以被解决，并且超越了当前文献中的状况。

Abstract
We establish strong duality relations for functional two-step compositional risk-constrained learning problems with multiple nonconvex loss functions and/or learning constraints, regardless of nonconvexity and under a minimal set of technical assumptions. Our results in particular imply zero duality gaps within the class of problems under study, both extending and improving on the state of the art in (risk-neutral) constrained learning. More specifically, we consider risk objectives/constraints which involve real-valued convex and positively homogeneous risk measures admitting dual representations with bounded risk envelopes, generalizing expectations and including popular examples, such as the conditional value-at-risk (CVaR), the mean-absolute deviation (MAD), and more generally all real-valued coherent risk measures on integrable losses as special cases. Our results are based on recent advances in risk-constrained nonconvex programming in infinite dimensions, which rely on a remarkable new application of J. J. Uhl's convexity theorem, which is an extension of A. A. Lyapunov's convexity theorem for general, infinite dimensional Banach spaces. By specializing to the risk-neutral setting, we demonstrate, for the first time, that constrained classification and regression can be treated under a unifying lens, while dispensing certain restrictive assumptions enforced in the current literature, yielding a new state-of-the-art strong duality framework for nonconvex constrained learning.

摘要
我们建立了强大的对偶关系 для函数两步复合风险条件学习问题，无视非对称性和最小技术假设。我们的结果特别是适用于研究中的问题范围内的零对偶距离，扩展和改善了现有的州际学习（risk-neutral）对偶框架。我们考虑的风险目标/限制为真値凸函数和正 Homogeneous risk measures，包括 conditional value-at-risk（CVaR）、mean-absolute deviation（MAD）和更一般的所有实值凝聚风险度量。我们的结果基于 latest advances in risk-constrained nonconvex programming in infinite dimensions，利用了 J. J. Uhl的凸性定理的新应用，这是A. A. Lyapunov的凸性定理的推广，这是一个普遍的、无限维 Banach空间上的扩展。通过特定化到风险中立设置，我们示出了，在首次对于非对称的问题进行对偶框架，并破坏了一些在目前文献中强制的假设，实现了一个新的强大对偶框架 для非对称的受限学习。

Rapid Speaker Adaptation in Low Resource Text to Speech Systems using Synthetic Data and Transfer learning

paper_url: http://arxiv.org/abs/2312.01107
repo_url: None
paper_authors: Raviraj Joshi, Nikesh Garera
for: 这个论文的目的是建立一个低资源语言的语音识别系统。
methods: 这个论文使用了转移学习方法，使用高资源语言的数据和生成的数据进行学习。然后，通过单个发音人数据的精度调整，实现了讲话者适应。
results: 这个论文通过对3个小时的目标语言发音人数据进行精度调整，实现了讲话者适应。此外，通过主观评价方法证明了这种方法的有效性。

Abstract
Text-to-speech (TTS) systems are being built using end-to-end deep learning approaches. However, these systems require huge amounts of training data. We present our approach to built production quality TTS and perform speaker adaptation in extremely low resource settings. We propose a transfer learning approach using high-resource language data and synthetically generated data. We transfer the learnings from the out-domain high-resource English language. Further, we make use of out-of-the-box single-speaker TTS in the target language to generate in-domain synthetic data. We employ a three-step approach to train a high-quality single-speaker TTS system in a low-resource Indian language Hindi. We use a Tacotron2 like setup with a spectrogram prediction network and a waveglow vocoder. The Tacotron2 acoustic model is trained on English data, followed by synthetic Hindi data from the existing TTS system. Finally, the decoder of this model is fine-tuned on only 3 hours of target Hindi speaker data to enable rapid speaker adaptation. We show the importance of this dual pre-training and decoder-only fine-tuning using subjective MOS evaluation. Using transfer learning from high-resource language and synthetic corpus we present a low-cost solution to train a custom TTS model.

摘要
文本至语音（TTS）系统通过端到端深度学习方法建立。然而，这些系统需要庞大的训练数据。我们提出了一种使用转移学习approach来建立生产质量TTS系统并进行speaker adaptation。我们使用高资源语言数据进行转移学习，并使用生成的数据进行synthetic data的生成。我们使用三步方法来训练一个高质量单个 speaker TTS系统，包括使用Tacotron2设置，spectrogram预测网络和waveglow vocoder。Tacotron2语音模型在英语数据上训练，然后在生成的印地语数据上进行synthetic Hindi data。最后，这个模型的解码器通过只需3小时的目标印地语言 speaker数据进行精度调整。我们通过主观MOS评估表明了这种双重预训和解码器精度调整的重要性。通过从高资源语言和人工合成库中 Transfer learning，我们提供了一种低成本的解决方案，可以训练自定义TTS模型。

Code-Mixed Text to Speech Synthesis under Low-Resource Constraints

paper_url: http://arxiv.org/abs/2312.01103
repo_url: None
paper_authors: Raviraj Joshi, Nikesh Garera
for: 这篇论文旨在描述如何建立高质量的混合语言（Hindi-English）文本读取系统，以满足voice-based电商应用的需求。
methods: 作者提出了一种数据驱动的方法，利用单语言数据集来训练混合语言TTS系统。他们还使用了转写模型将罗马文本转换为喀什米里文本，并将两个数据集合并训练。作者还进行了单话者适应和多话者训练，并使用了传输学习和解码器 fine-tuning 来提高性能。
results: 作者的方法可以达到Google TTS的positive CMOS分数0.02，并且在资源受限的情况下进行了低资源语音适应。此外，作者还进行了大量的主观评估，以证明系统的高质量。

Abstract
Text-to-speech (TTS) systems are an important component in voice-based e-commerce applications. These applications include end-to-end voice assistant and customer experience (CX) voice bot. Code-mixed TTS is also relevant in these applications since the product names are commonly described in English while the surrounding text is in a regional language. In this work, we describe our approaches for production quality code-mixed Hindi-English TTS systems built for e-commerce applications. We propose a data-oriented approach by utilizing monolingual data sets in individual languages. We leverage a transliteration model to convert the Roman text into a common Devanagari script and then combine both datasets for training. We show that such single script bi-lingual training without any code-mixing works well for pure code-mixed test sets. We further present an exhaustive evaluation of single-speaker adaptation and multi-speaker training with Tacotron2 + Waveglow setup to show that the former approach works better. These approaches are also coupled with transfer learning and decoder-only fine-tuning to improve performance. We compare these approaches with the Google TTS and report a positive CMOS score of 0.02 with the proposed transfer learning approach. We also perform low-resource voice adaptation experiments to show that a new voice can be onboarded with just 3 hrs of data. This highlights the importance of our pre-trained models in resource-constrained settings. This subjective evaluation is performed on a large number of out-of-domain pure code-mixed sentences to demonstrate the high quality of the systems.

摘要
文本至语系统（TTS）在voice基于电子商务应用程序中扮演着重要的角色。这些应用程序包括终端语音助手和客户体验（CX）语音机器人。在这种情况下，我们描述了我们对生产质量的code-mixed Hindi-英语TTS系统的方法。我们采用了数据驱动的方法，利用各自语言的单语言数据集。我们利用一种翻译模型将罗马字转换为常见的德вана格里文字，然后将两个数据集合并用于训练。我们表明，这种单文本编码的双语训练无需任何code-mixing即使在纯code-mixed测试集上表现良好。我们还进行了详细的单 speaker适应和多 speaker训练，并使用Tacotron2 + Waveglow的设置进行评估。我们发现，前者方法在性能上表现更好。这些方法还与传输学习和解码器 только精度调整进行改进性能。我们与Google TTS进行比较，并发现我们的传输学习方法可以获得正面的CMOS分数0.02。此外，我们还进行了尝试资源受限的声音适应实验，并证明了一个新的声音可以在仅3小时的数据上board。这种subjective评估是基于大量的out-of-domain纯code-mixed句子中进行的。这些结果表明了我们的预训练模型在资源受限的情况下的重要性。

Predicting Postoperative Nausea And Vomiting Using Machine Learning: A Model Development and Validation Study

paper_url: http://arxiv.org/abs/2312.01093
repo_url: https://github.com/teddy4445/ponv_prediction_tool
paper_authors: Maxim Glebov, Teddy Lazebnik, Boris Orkin, Haim Berkenstadt, Svetlana Bunimovich-Mendrazitsky
for:* 预测postsurgical nausea and vomiting (PONV)的工具，以提高患者的护理和病程结果。methods:* 使用机器学习算法 ensemble，在54848名病人的数据上进行训练，并使用k-fold核心验证法和Bee Colony算法来优化数据的分类表达。results:* 预测early和delayed PONV的工具具有84.0%和77.3%的准确率，超过了现有的PONV预测工具（Koivuranta score）的13.4%和12.9%。Feature importance分析表明，预测工具的性能与前期临床知识相符，表明其实用性。

Abstract
Background: Postoperative nausea and vomiting (PONV) is a frequently observed complication in patients undergoing surgery under general anesthesia. Moreover, it is a frequent cause of distress and dissatisfaction during the early postoperative period. The tools used for predicting PONV at present have not yielded satisfactory results. Therefore, prognostic tools for the prediction of early and delayed PONV were developed in this study with the aim of achieving satisfactory predictive performance. Methods: The retrospective data of adult patients admitted to the post-anesthesia care unit after undergoing surgical procedures under general anesthesia at the Sheba Medical Center, Israel, between September 1, 2018, and September 1, 2023, were used in this study. An ensemble model of machine learning algorithms trained on the data of 54848 patients was developed. The k-fold cross-validation method was used followed by splitting the data to train and test sets that optimally preserve the sociodemographic features of the patients, such as age, sex, and smoking habits, using the Bee Colony algorithm. Findings: Among the 54848 patients, early and delayed PONV were observed in 2706 (4.93%) and 8218 (14.98%) patients, respectively. The proposed PONV prediction tools could correctly predict early and delayed PONV in 84.0% and 77.3% of cases, respectively, outperforming the second-best PONV prediction tool (Koivuranta score) by 13.4% and 12.9%, respectively. Feature importance analysis revealed that the performance of the proposed prediction tools aligned with previous clinical knowledge, indicating their utility. Interpretation: The machine learning-based tools developed in this study enabled improved PONV prediction, thereby facilitating personalized care and improved patient outcomes.

摘要
Background: 麻醉后恶心和呕吐（PONV）是在普遍出现在在普通麻醉下进行手术的患者中，并且是早期后期受到折磨和不满的主要原因。目前使用的预测工具没有达到满意的结果。因此，本研究开发了预测早期和延迟PONV的预测工具，以实现满意的预测性能。方法: 这项研究使用了在以色列希伯医疗中心的后手术病房进行手术的成人患者的后续数据，共计54848名患者，时间范围为2018年9月1日至2023年9月1日。使用了Machine Learning算法的ensemble模型，通过在数据中使用Bee Colony算法进行k-fold跨 VALIDATION和数据分割，以最好地保留患者的社odemographic特征，如年龄、性别和吸烟习惯。发现: 总共有54848名患者中，早期和延迟PONV分别出现在2706名（4.93%）和8218名（14.98%）患者中。提出的PONV预测工具可以正确预测早期和延迟PONV的情况，准确率分别为84.0%和77.3%，相比第二好的PONV预测工具（koivuranta分数）高出13.4%和12.9%。特征重要性分析表明，提出的预测工具的性能与先前的临床知识相一致，表明其实用性。解释: 本研究开发的Machine Learning基于的预测工具，可以提高PONV预测的准确性，从而为个性化护理和患者的成本提供了便利。

A Semi-Supervised Deep Learning Approach to Dataset Collection for Query-By-Humming Task

paper_url: http://arxiv.org/abs/2312.01092
repo_url: https://github.com/amanteur/chad
paper_authors: Amantur Amatov, Dmitry Lamanov, Maksim Titov, Ivan Vovk, Ilya Makarov, Mikhail Kudinov
for: 这个论文的目的是提出一种深度学习数据收集技术和一个新的歌曲搜索数据集（CHAD），以便实现 Query-by-Humming（QbH）系统。
methods: 论文使用了一种半监督的模型训练管道，利用QbH任务作为特殊的歌曲重新创作任务（CSI）来扩展数据集。
results: 经过训练，模型在QbH任务上取得了竞争性的成绩，并且可以成功应用于实际应用中。

Abstract
Query-by-Humming (QbH) is a task that involves finding the most relevant song based on a hummed or sung fragment. Despite recent successful commercial solutions, implementing QbH systems remains challenging due to the lack of high-quality datasets for training machine learning models. In this paper, we propose a deep learning data collection technique and introduce Covers and Hummings Aligned Dataset (CHAD), a novel dataset that contains 18 hours of short music fragments, paired with time-aligned hummed versions. To expand our dataset, we employ a semi-supervised model training pipeline that leverages the QbH task as a specialized case of cover song identification (CSI) task. Starting with a model trained on the initial dataset, we iteratively collect groups of fragments of cover versions of the same song and retrain the model on the extended data. Using this pipeline, we collect over 308 hours of additional music fragments, paired with time-aligned cover versions. The final model is successfully applied to the QbH task and achieves competitive results on benchmark datasets. Our study shows that the proposed dataset and training pipeline can effectively facilitate the implementation of QbH systems.

摘要
对于Query-by-Humming（QbH）任务，因缺乏高质量训练数据而实现系统仍然具有挑战性。在这篇论文中，我们提出了一种深度学习数据采集技术，并引入Covers and Hummings Aligned Dataset（CHAD），一个新的数据集，包含18小时短 музы Fragment，与时间对应的hummed版本。为了扩大我们的数据集，我们采用了一种半supervised模型训练管道，利用QbH任务作为特殊的cover song identification（CSI）任务。从一个初始模型开始，我们逐步收集了 fragments of cover版本的同一首歌，并使用这些数据重新训练模型。通过这种管道，我们收集了超过308小时的额外音乐 Fragment，并与时间对应的cover版本一起。最终模型成功应用于QbH任务，在标准数据集上达到了竞争力的成绩。我们的研究表明，我们提出的数据集和训练管道可以有效地促进QbH系统的实施。

A New Random Reshuffling Method for Nonsmooth Nonconvex Finite-sum Optimization

paper_url: http://arxiv.org/abs/2312.01047
repo_url: None
paper_authors: Xiao Li, Andre Milzarek, Junwen Qiu
for: 本文提出了一种新的随机优化算法，称为normal map-based proximal random reshuffling（norm-PRR）方法，用于非光滑非对称总和问题。
methods: 本文使用了随机重排技术，这些技术在大规模应用中很普遍，如神经网络训练中。而在非光滑情况下，随机重排方法的收敛性和加速效果还不够了解，只有几种 proximal-type 随机重排方法具有可证明的保证。
results: 本文Establish了 norm-PRR 的迭代复杂度为 ${O}(n^{-1/3}T^{-2/3})$，其中 $n$ 是组件函数的数量， $T$ 是迭代次数。此外，我们还提供了新的极限点收敛结果，包括strong limit-point convergence和last iterate convergence rates。 Specifically, under the Kurdyka-{\L}ojasiewicz（KL）不等式，我们证明了 iterates 生成的 norm-PRR 收敛到一个唯一的站点点。此外，我们还得到了迭代次数 $k$ 的几何速率，即 ${\cal O}(k^{-p})$，其中 $p \in [0, 1]$ 取决于 KL 指数 $\theta \in [0,1)$ 和步长动力学。最后，我们在机器学习问题上进行了初步的数据分析，并证明了该方法的效率。

Abstract
In this work, we propose and study a novel stochastic optimization algorithm, termed the normal map-based proximal random reshuffling (norm-PRR) method, for nonsmooth nonconvex finite-sum problems. Random reshuffling techniques are prevalent and widely utilized in large-scale applications, e.g., in the training of neural networks. While the convergence behavior and advantageous acceleration effects of random reshuffling methods are fairly well understood in the smooth setting, much less seems to be known in the nonsmooth case and only few proximal-type random reshuffling approaches with provable guarantees exist. We establish the iteration complexity ${\cal O}(n^{-1/3}T^{-2/3})$ for norm-PRR, where $n$ is the number of component functions and $T$ counts the total number of iteration. We also provide novel asymptotic convergence results for norm-PRR. Specifically, under the Kurdyka-{\L}ojasiewicz (KL) inequality, we establish strong limit-point convergence, i.e., the iterates generated by norm-PRR converge to a single stationary point. Moreover, we derive last iterate convergence rates of the form ${\cal O}(k^{-p})$; here, $p \in [0, 1]$ depends on the KL exponent $\theta \in [0,1)$ and step size dynamics. Finally, we present preliminary numerical results on machine learning problems that demonstrate the efficiency of the proposed method.

摘要
“在这项工作中，我们提出了一种新的随机优化算法，称之为normal map-based proximal random reshuffling（norm-PRR）方法，用于非光滑非对称总和问题。随机重新排序技术在大规模应用中很普遍，例如在神经网络训练中。虽然随机重新排序方法在光滑 Setting中的收敛性和加速效果已经比较好地了解，但在非光滑情况下，却很少有 proximal-type随机重新排序方法的可证明保证。我们证明了 norm-PRR 方法的迭代复杂度为 $O(n^{-1/3}T^{-2/3})$，其中 $n$ 是功能函数的数量，$T$ 是迭代次数。我们还提供了 norm-PRR 方法的新的极限点收敛结果，包括强限点收敛和最后迭代速率。具体来说，在 Kurdyka-{\L}ojasiewicz（KL）不等式的假设下，我们证明了 norm-PRR 方法会收敛到一个固定点，并且可以在 $k$ 迭代次数下达到最后迭代速率 $O(k^{-p})$，其中 $p \in [0, 1]$ 是 KL 指数的一个优化参数。最后，我们在机器学习问题上进行了一些初步的数据分析，证明了 norm-PRR 方法的效率。”

Bagged Regularized $k$-Distances for Anomaly Detection

paper_url: http://arxiv.org/abs/2312.01046
repo_url: None
paper_authors: Yuchao Cai, Yuheng Ma, Hanfang Yang, Hanyuan Hang
For: 这个论文的目的是提出一个新的距离基本的检测方法，以解决距离基本方法在缺乏标签示例的情况下表现不佳的问题。* Methods: 这个方法使用袋包技术（bagging）和调整了的$k$-距离（k-distances）来实现不超给的检测。它首先使用袋包技术将资料分成多个批次，然后使用调整了的$k$-距离来检测每个批次是否有问题。* Results: 这个方法可以成功地解决距离基本方法中的敏感性问题，并且在大规模的资料集上实现高效的检测。实验结果显示，该方法可以与其他现有的距离基本方法相比，具有更好的优化性和更高的准确性。

Abstract
We consider the paradigm of unsupervised anomaly detection, which involves the identification of anomalies within a dataset in the absence of labeled examples. Though distance-based methods are top-performing for unsupervised anomaly detection, they suffer heavily from the sensitivity to the choice of the number of the nearest neighbors. In this paper, we propose a new distance-based algorithm called bagged regularized $k$-distances for anomaly detection (BRDAD) converting the unsupervised anomaly detection problem into a convex optimization problem. Our BRDAD algorithm selects the weights by minimizing the surrogate risk, i.e., the finite sample bound of the empirical risk of the bagged weighted $k$-distances for density estimation (BWDDE). This approach enables us to successfully address the sensitivity challenge of the hyperparameter choice in distance-based algorithms. Moreover, when dealing with large-scale datasets, the efficiency issues can be addressed by the incorporated bagging technique in our BRDAD algorithm. On the theoretical side, we establish fast convergence rates of the AUC regret of our algorithm and demonstrate that the bagging technique significantly reduces the computational complexity. On the practical side, we conduct numerical experiments on anomaly detection benchmarks to illustrate the insensitivity of parameter selection of our algorithm compared with other state-of-the-art distance-based methods. Moreover, promising improvements are brought by applying the bagging technique in our algorithm on real-world datasets.

摘要
我们考虑了无监督异常检测的 paradigm，即在数据集中检测异常的问题，在无标注示例的情况下进行。虽然距离基于方法在无监督异常检测方面表现出色，但是它们受到选择距离参数的选择的敏感性很大。在本文中，我们提出了一种新的距离基于算法 called Bagged Regularized $k$-Distances for Anomaly Detection (BRDAD)，将无监督异常检测问题转化为一个凸优化问题。我们的 BRDAD 算法通过最小化代理风险来选择参数，即对 Bagged Weighted $k$-Distances for Density Estimation (BWDDE) 的质量评估。这种方法可以成功地解决距离基于算法中参数选择敏感性的问题。此外，当处理大规模数据时，我们的 BRDAD 算法可以通过包装技术来提高效率。从理论角度来看，我们证明了我们的算法的 AUC regret 快速 converges 和 bagging 技术可以减少计算复杂性。从实践角度来看，我们在异常检测标准 benchmar 上进行了数值实验，并证明了我们的算法参数选择的不敏感性和其他状态艺术 distance-based 方法的比较。此外，我们在实际数据上应用了包装技术，得到了有 Promise 的改进。

Quantifying Hippocampal Shape Asymmetry in Alzheimer’s Disease Using Optimal Shape Correspondences

paper_url: http://arxiv.org/abs/2312.01043
repo_url: None
paper_authors: Shen Zhu, Ifrah Zawar, Jaideep Kapur, P. Thomas Fletcher
for: 这个论文旨在研究阿尔ツ海默病（AD）中海马体积不均匀和空间不均匀的特征。
methods: 该论文使用了优化点对匹配法来量化左右海马体的形态不均匀，同时采用了一种紧凑的统计形态模型来描述整个样本。
results: 结果表明，与使用体积信息相比，形态不均匀可以描述出阿尔ツ海默病患者海马体的细腻、地方性差异。

Abstract
Hippocampal atrophy in Alzheimer's disease (AD) is asymmetric and spatially inhomogeneous. While extensive work has been done on volume and shape analysis of atrophy of the hippocampus in AD, less attention has been given to hippocampal asymmetry specifically. Previous studies of hippocampal asymmetry are limited to global volume or shape measures, which don't localize shape asymmetry at the point level. In this paper, we propose to quantify localized shape asymmetry by optimizing point correspondences between left and right hippocampi within a subject, while simultaneously favoring a compact statistical shape model of the entire sample. To account for related variables that have impact on AD and healthy subject differences, we build linear models with other confounding factors. Our results on the OASIS3 dataset demonstrate that compared to using volumetric information, shape asymmetry reveals fine-grained, localized differences that indicate the hippocampal regions of most significant shape asymmetry in AD patients.

摘要
论文摘要：阿尔ц海默病（AD）中海马体积减少呈偏极和空间不均匀的特征。虽然有很多研究对阿尔ц海默病患者的海马体积和形态进行了分析，但对于海马偏极的研究则很少。前一代研究的海马偏极限定为全体海马体积或形态的全局指标，这些指标无法地方化形态偏极。本文提出了一种量化本地形态偏极的方法，通过对左右海马之间点对应优化，同时满足整个样本的紧凑统计形态模型。为了考虑影响阿尔ц海默病和健康人群差异的相关变量，我们建立了线性模型。我们的结果表明，相比使用体积信息，形态偏极可以描述阿尔ц海默病患者的海马区域的细腻、本地差异。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

RNN-BOF: A Multivariate Global Recurrent Neural Network for Binary Outcome Forecasting of Inpatient Aggression

paper_url: http://arxiv.org/abs/2312.01029
repo_url: None
paper_authors: Aidan Quinn, Melanie Simmons, Benjamin Spivak, Christoph Bergmeir
for: 预测暴力事件的未来风险，帮助临床医生评估患者的风险水平。
methods: 使用时间序列方法，通过长期数据学习，生成个体化的风险评估结果，并使用深度神经网络模型来全面预测多个时间序列。
results: 在一个真实世界数据集上，与参考心理测量工具和先前使用的机器学习方法相比，该方法显示出了显著的性能提升。

Abstract
Psychometric assessment instruments aid clinicians by providing methods of assessing the future risk of adverse events such as aggression. Existing machine learning approaches have treated this as a classification problem, predicting the probability of an adverse event in a fixed future time period from the scores produced by both psychometric instruments and clinical and demographic covariates. We instead propose modelling a patient's future risk using a time series methodology that learns from longitudinal data and produces a probabilistic binary forecast that indicates the presence of the adverse event in the next time period. Based on the recent success of Deep Neural Nets for globally forecasting across many time series, we introduce a global multivariate Recurrent Neural Network for Binary Outcome Forecasting, that trains from and for a population of patient time series to produce individual probabilistic risk assessments. We use a moving window training scheme on a real world dataset of 83 patients, where the main binary time series represents the presence of aggressive events and covariate time series represent clinical or demographic features and psychometric measures. On this dataset our approach was capable of a significant performance increase against both benchmark psychometric instruments and previously used machine learning methodologies.

摘要
心理测量工具可以帮助临床专业人员评估未来风险事件的发生，如攻击行为。现有的机器学习方法将此视为一个分类问题，从心理测量工具和临床和人口特征 covariates 中获取的分数来预测在固定的未来时间段内事件的概率。我们 however propose 使用时间序列方法，从 longitudinal 数据中学习并生成一个概率binary forecast，以指示下一个时间段中事件的发生。基于 globally forecasting 的最近成功，我们引入了一个全球多变量循环神经网络，用于生成个体化风险评估。我们使用一个移动窗口训练方案，在一个实际世界数据集上进行训练，该数据集包括83名患者的时间序列，主要 binary time series 表示攻击事件的存在，并且 covariate time series 表示临床或人口特征和心理测量结果。在这个数据集上，我们的方法可以在对比 benchmark 心理测量工具和先前使用的机器学习方法之上显著提高性能。

Advanced Language Model-Driven Verilog Development: Enhancing Power, Performance, and Area Optimization in Code Synthesis

paper_url: http://arxiv.org/abs/2312.01022
repo_url: None
paper_authors: Kiran Thorat, Jiahui Zhao, Yaotian Liu, Hongwu Peng, Xi Xie, Bin Lei, Jeff Zhang, Caiwen Ding
for: 这个研究探讨了高级语言模型（ALM）在多个领域中的应用，尤其是在语言指令下生成高级内容的能力。
methods: 这个研究使用了一种创新的框架，用于评估和提高ALM在电子硬件设计中的产效。这个框架包括两个阶段的修正协议，其中第一阶段是提高代码的语言和操作精度，第二阶段是根据Power-Performance-Area（PPA）标准进行代码的优化。
results: 这个研究发现，使用这种框架可以提高ALM在Verilog编程 синтеesis中的语言精度和操作效率。相比之下，现有的前沿技术只能达到73%的语言精度和46%的操作效率。这些结果表明ALM在复杂的技术领域中的应用潜力很大，并且标志着硬件设计操作的机械化是可能的。

Abstract
The increasing use of Advanced Language Models (ALMs) in diverse sectors, particularly due to their impressive capability to generate top-tier content following linguistic instructions, forms the core of this investigation. This study probes into ALMs' deployment in electronic hardware design, with a specific emphasis on the synthesis and enhancement of Verilog programming. We introduce an innovative framework, crafted to assess and amplify ALMs' productivity in this niche. The methodology commences with the initial crafting of Verilog programming via ALMs, succeeded by a distinct dual-stage refinement protocol. The premier stage prioritizes augmenting the code's operational and linguistic precision, while the latter stage is dedicated to aligning the code with Power-Performance-Area (PPA) benchmarks, a pivotal component in proficient hardware design. This bifurcated strategy, merging error remediation with PPA enhancement, has yielded substantial upgrades in the caliber of ALM-created Verilog programming. Our framework achieves an 81.37% rate in linguistic accuracy and 62.0% in operational efficacy in programming synthesis, surpassing current leading-edge techniques, such as 73% in linguistic accuracy and 46% in operational efficacy. These findings illuminate ALMs' aptitude in tackling complex technical domains and signal a positive shift in the mechanization of hardware design operations.

摘要
随着高级语言模型（ALM）在多个领域的应用，尤其是其执行高质量内容的能力，这项研究的核心在于调查ALM在电子硬件设计中的应用。本研究强调在Verilog编程中使用ALM的应用，并提出了一种创新的框架，用于评估和提高ALM在这个领域的生产力。这种框架包括两个阶段：首先，使用ALM编写Verilog程序，然后进行两个阶段的修正。第一阶段的目标是提高代码的语言和运行精度，而第二阶段则是根据Power-Performance-Area（PPA）标准进行修正，以提高硬件设计的效率。这种分段策略，结合修复和PPA提高，已经实现了ALM编写的Verilog程序质量的显著提高。our framework achieves an 81.37% rate in linguistic accuracy and 62.0% in operational efficacy in programming synthesis, surpassing current leading-edge techniques, such as 73% in linguistic accuracy and 46% in operational efficacy. These findings highlight ALMs' potential in tackling complex technical domains and signal a positive shift in the mechanization of hardware design operations.

Data-Driven Autoencoder Numerical Solver with Uncertainty Quantification for Fast Physical Simulations

paper_url: http://arxiv.org/abs/2312.01021
repo_url: https://github.com/llnl/gplasdi
paper_authors: Christophe Bonneville, Youngsoo Choi, Debojyoti Ghosh, Jonathan L. Belof
for: 该文章是为了开发一种更快的partial differential equation（PDE）解决方法，以替代传统的计算昂贵的方法。
methods: 该文章提出了一种гибриids deep learning和 bayesian reduced-order-models（ROMs）的方法，称为GPLaSDI。GPLaSDI在全功率模型（FOM）数据上训练了一个自适应神经网络，同时学习了 simpler equations governing the latent space。这些方程是通过 Gaussian Processes 进行插值，以实现uncertainty quantification和active learning，即使具有有限的FOM解决器访问。
results: 该文章的结果表明，GPLaSDI可以在流体力学问题上实现 up to 100,000 倍的速度增加，并且相对误差小于 7%。

Abstract
Traditional partial differential equation (PDE) solvers can be computationally expensive, which motivates the development of faster methods, such as reduced-order-models (ROMs). We present GPLaSDI, a hybrid deep-learning and Bayesian ROM. GPLaSDI trains an autoencoder on full-order-model (FOM) data and simultaneously learns simpler equations governing the latent space. These equations are interpolated with Gaussian Processes, allowing for uncertainty quantification and active learning, even with limited access to the FOM solver. Our framework is able to achieve up to 100,000 times speed-up and less than 7% relative error on fluid mechanics problems.

摘要
传统的partial differential equation（PDE）解决器可能会具有计算成本，这种情况下，我们发展了更快的方法，如减少的数量模型（ROM）。我们介绍了GPLaSDI，一种混合深度学习和 bayesian ROM。GPLaSDI在全功率模型（FOM）数据上训练了自适应神经网络，同时学习了较为简单的 governing equations governing the latent space。这些方程通过 Gaussian Processes 进行插值，允许uncertainty quantification和活动学习，即使有限制对FOM solver的访问。我们的框架可以实现到100,000倍的速度增强和相对误差小于7%在流体力学问题上。

ResNLS: An Improved Model for Stock Price Forecasting

paper_url: http://arxiv.org/abs/2312.01020
repo_url: None
paper_authors: Yuanzhe Jia, Ali Anaissi, Basem Suleiman
For: This paper aims to improve stock price prediction by emphasizing the dependencies between adjacent stock prices, using a hybrid model called ResNLS.* Methods: The proposed model combines ResNet and LSTM neural architectures to extract features and analyze time-series data, with the closing price data for the previous 5 consecutive trading days used as the input.* Results: The ResNLS-5 model outperforms other baseline models in terms of prediction accuracy, and demonstrates at least a 20% improvement over the current state-of-the-art baselines. Additionally, the trading strategy based on predictions from ResNLS-5 can successfully mitigate losses during declining stock prices and generate profits in the periods of rising stock prices.Here is the Chinese translation of the three key points:* For: 这篇论文目标是通过强调邻近股票价格之间的相关性，提高股票价格预测。* Methods: 提议的模型组合了 ResNet 和 LSTM 神经网络架构，以提取特征和分析时间序列数据，并使用前 5 天交易日的收盘价据作为输入。* Results: ResNLS-5 模型在预测精度方面超过了其他基准模型，并至少达到了当前状态艺术基准的 20% 提升。此外，基于 ResNLS-5 预测的交易策略可以在股票价格下降时成功避免损失，并在股票价格升起时产生利润。

Abstract
Stock prices forecasting has always been a challenging task. Although many research projects adopt machine learning and deep learning algorithms to address the problem, few of them pay attention to the varying degrees of dependencies between stock prices. In this paper we introduce a hybrid model that improves stock price prediction by emphasizing the dependencies between adjacent stock prices. The proposed model, ResNLS, is mainly composed of two neural architectures, ResNet and LSTM. ResNet serves as a feature extractor to identify dependencies between stock prices across time windows, while LSTM analyses the initial time-series data with the combination of dependencies which considered as residuals. In predicting the SSE Composite Index, our experiment reveals that when the closing price data for the previous 5 consecutive trading days is used as the input, the performance of the model (ResNLS-5) is optimal compared to those with other inputs. Furthermore, ResNLS-5 outperforms vanilla CNN, RNN, LSTM, and BiLSTM models in terms of prediction accuracy. It also demonstrates at least a 20% improvement over the current state-of-the-art baselines. To verify whether ResNLS-5 can help clients effectively avoid risks and earn profits in the stock market, we construct a quantitative trading framework for back testing. The experimental results show that the trading strategy based on predictions from ResNLS-5 can successfully mitigate losses during declining stock prices and generate profits in the periods of rising stock prices.

摘要
股票价格预测总是一个复杂的任务。虽然许多研究项目采用机器学习和深度学习算法来解决这个问题，但很少的人关注股票价格之间的不同程度的依赖关系。在这篇论文中，我们介绍了一种混合模型，即ResNLS，可以改进股票价格预测。ResNLS模型由两种神经网络架构组成：ResNet和LSTM。ResNet作为特征提取器，用于在时间窗口内识别股票价格之间的依赖关系，而LSTM则分析了时间序列数据的初始部分，并将这些依赖关系作为剩余分析。在预测SSE Composite Index时，我们的实验表明，使用前5天的收盘价数据作为输入，模型（ResNLS-5）的性能是最佳的，并且在其他输入情况下都表现不佳。此外，ResNLS-5还超过了vanilla CNN、RNN、LSTM和BiLSTM模型的预测精度。它还证明了在当前状况下的基准值中提供至少20%的提升。为了验证ResNLS-5是否可以帮助客户有效避免风险并获得利润，我们建立了一个量化交易框架进行回测。实验结果表明，基于ResNLS-5的预测，可以成功避免股票价格下跌期间的损失，并在股票价格上升期间获得收益。

Generating Images of the M87* Black Hole Using GANs

paper_url: http://arxiv.org/abs/2312.01005
repo_url: https://github.com/aryamohan23/eht-gans
paper_authors: Arya Mohan, Pavlos Protopapas, Keerthi Kunnumkai, Cecilia Garraffo, Lindy Blackburn, Koushik Chatterjee, Sheperd S. Doeleman, Razieh Emami, Christian M. Fromm, Yosuke Mizuno, Angelo Ricarte
for: 本研究提出了一种基于Conditional Progressive Generative Adversarial Networks（CPGAN）的数据增强方法，用于生成多样化的黑洞（BH）图像，考虑了质量和电子温度的变化。
methods: 本研究使用了CPGAN模型来生成BH图像，并且可以为任何质量值[-1, 1]之间的电子温度分布生成图像。
results: 我们的模型可以准确地预测黑洞质量，并且在使用增强数据集进行训练而测试用GRMHD模拟数据时显示出了显著的性能提升。因此，我们提议使用GAN来生成黑洞图像，并可靠地增强训练数据集 для其他参数化算法。

Abstract
In this paper, we introduce a novel data augmentation methodology based on Conditional Progressive Generative Adversarial Networks (CPGAN) to generate diverse black hole (BH) images, accounting for variations in spin and electron temperature prescriptions. These generated images are valuable resources for training deep learning algorithms to accurately estimate black hole parameters from observational data. Our model can generate BH images for any spin value within the range of [-1, 1], given an electron temperature distribution. To validate the effectiveness of our approach, we employ a convolutional neural network to predict the BH spin using both the GRMHD images and the images generated by our proposed model. Our results demonstrate a significant performance improvement when training is conducted with the augmented dataset while testing is performed using GRMHD simulated data, as indicated by the high R2 score. Consequently, we propose that GANs can be employed as cost effective models for black hole image generation and reliably augment training datasets for other parameterization algorithms.

摘要
在这篇论文中，我们介绍了一种基于进行进行Conditional Progressive Generative Adversarial Networks（CPGAN）的数据增强方法，用于生成具有质量变化的黑洞（BH）图像，考虑到电子温度分布的变化。这些生成的图像是训练深度学习算法计算黑洞参数的有价值资源。我们的模型可以根据电子温度分布生成BH图像任何质量值，跨 [-1, 1] 范围。为验证我们的方法效果，我们使用 convolutional neural network（CNN）预测黑洞旋转矩阵，使用GRMHD图像和我们提议的模型生成的图像进行训练和测试。我们的结果表明，使用增强 dataset 进行训练，并使用GRMHD simulate 数据进行测试，可以提高 R2 分数。因此，我们提议使用 GAN 作为便宜的黑洞图像生成模型，可靠地增强训练集 для其他参数化算法。

Second-Order Uncertainty Quantification: A Distance-Based Approach

paper_url: http://arxiv.org/abs/2312.00995
repo_url: None
paper_authors: Yusuf Sale, Viktor Bengs, Michele Caprio, Eyke Hüllermeier
for: This paper proposes a set of formal criteria for uncertainty measures in machine learning, specifically for predictive uncertainty based on second-order distributions.
methods: The paper offers a general framework for developing uncertainty measures that satisfy these criteria, and provides an instantiation based on the Wasserstein distance.
results: The authors prove that their proposed uncertainty measure satisfies all of the formal criteria they have established.

Abstract
In the past couple of years, various approaches to representing and quantifying different types of predictive uncertainty in machine learning, notably in the setting of classification, have been proposed on the basis of second-order probability distributions, i.e., predictions in the form of distributions on probability distributions. A completely conclusive solution has not yet been found, however, as shown by recent criticisms of commonly used uncertainty measures associated with second-order distributions, identifying undesirable theoretical properties of these measures. In light of these criticisms, we propose a set of formal criteria that meaningful uncertainty measures for predictive uncertainty based on second-order distributions should obey. Moreover, we provide a general framework for developing uncertainty measures to account for these criteria, and offer an instantiation based on the Wasserstein distance, for which we prove that all criteria are satisfied.

摘要
过去几年，在机器学习中表示和评估不同类型的预测uncertainty的方法有很多提议，主要是在分类 setting 中，基于第二个概率分布。然而，至今为止还没有一个完全 conclusive 的解决方案，这可以从最近的一些对常用的 uncertainty 度量 associates with second-order distributions 的批评中看出来。为了解决这些批评，我们提出了一些 formal criteria ，meaningful uncertainty measures for predictive uncertainty based on second-order distributions should obey。此外，我们还提供了一个总体框架，可以用于开发这些 criteria 的 uncertainty measures，并且提供了基于 Wasserstein distance 的实现，并证明了这些 criteria 都被满足。

paper_url: http://arxiv.org/abs/2312.00992
repo_url: None
paper_authors: Sayantan Kumar, Philip Payne, Aristeidis Sotiras
for: 这个论文的目的是提出一种新的normative模型，用于估计阿尔茨海默病（AD）病人群的脑图像数据中的异常性。
methods: 这个论文使用了mixture-of-product-of-experts（MoPoE）技术，将多Modal neuroimaging数据拟合到一起，并且使用这些模型来标识病人群中的异常subject。
results: 研究发现，通过使用MoPoE技术，可以更好地模型多Modal neuroimaging数据中的联合异常性，并且可以特定病人群中的异常latent空间和脑区域。

Abstract
Normative models in neuroimaging learn the brain patterns of healthy population distribution and estimate how disease subjects like Alzheimer's Disease (AD) deviate from the norm. Existing variational autoencoder (VAE)-based normative models using multimodal neuroimaging data aggregate information from multiple modalities by estimating product or averaging of unimodal latent posteriors. This can often lead to uninformative joint latent distributions which affects the estimation of subject-level deviations. In this work, we addressed the prior limitations by adopting the Mixture-of-Product-of-Experts (MoPoE) technique which allows better modelling of the joint latent posterior. Our model labelled subjects as outliers by calculating deviations from the multimodal latent space. Further, we identified which latent dimensions and brain regions were associated with abnormal deviations due to AD pathology.

摘要

Convergences for Minimax Optimization Problems over Infinite-Dimensional Spaces Towards Stability in Adversarial Training

paper_url: http://arxiv.org/abs/2312.00991
repo_url: None
paper_authors: Takashi Furuya, Satoshi Okuda, Kazuma Suetake, Yoshihide Sawada
for: This paper aims to address the instability issue in training neural networks, specifically in generative adversarial networks (GANs) and unsupervised domain adaptations (UDAs), through theoretical functional analysis.
methods: The paper uses gradient descent over infinite-dimensional spaces of continuous functions and probability measures to analyze the convergence property of the minimax problem. The authors also discuss various stabilization techniques, such as spectral normalization and gradient penalty, that are necessary for the convergence property.
results: The authors show that the conditions necessary for the convergence property are interpreted as stabilization techniques for adversarial training, providing a comprehensive understanding of GANs and UDAs.

Abstract
Training neural networks that require adversarial optimization, such as generative adversarial networks (GANs) and unsupervised domain adaptations (UDAs), suffers from instability. This instability problem comes from the difficulty of the minimax optimization, and there have been various approaches in GANs and UDAs to overcome this problem. In this study, we tackle this problem theoretically through a functional analysis. Specifically, we show the convergence property of the minimax problem by the gradient descent over the infinite-dimensional spaces of continuous functions and probability measures under certain conditions. Using this setting, we can discuss GANs and UDAs comprehensively, which have been studied independently. In addition, we show that the conditions necessary for the convergence property are interpreted as stabilization techniques of adversarial training such as the spectral normalization and the gradient penalty.

摘要
training neural networks that require adversarial optimization, such as generative adversarial networks (GANs) and unsupervised domain adaptations (UDAs), suffers from instability. this instability problem comes from the difficulty of the minimax optimization, and there have been various approaches in GANs and UDAs to overcome this problem. in this study, we tackle this problem theoretically through a functional analysis. specifically, we show the convergence property of the minimax problem by the gradient descent over the infinite-dimensional spaces of continuous functions and probability measures under certain conditions. using this setting, we can discuss GANs and UDAs comprehensively, which have been studied independently. in addition, we show that the conditions necessary for the convergence property are interpreted as stabilization techniques of adversarial training such as the spectral normalization and the gradient penalty.Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.

Noisy probing dose facilitated dose prediction for pencil beam scanning proton therapy: physics enhances generalizability

paper_url: http://arxiv.org/abs/2312.00975
repo_url: None
paper_authors: Lian Zhang, Jason M. Holmes, Zhengliang Liu, Hongying Feng, Terence T. Sio, Carlos E. Vargas, Sameer R. Keole, Kristin Stützer, Sheng Li, Tianming Liu, Jiajian Shen, William W. Wong, Sujay A. Vora, Wei Liu
for: 这个研究的目的是设计一个能够考虑物理下的AI-based PBSPT剂量预测方法，以提高对特殊诊断情况的普遍化能力。
methods: 这个研究评估了三种方法：ROI-based方法、beam mask和滑动窗口方法，以及噪声探针剂量方法。
results: 噪声探针剂量方法在评估测验案例中表现出更高的DVH指标、3D Gamma passing rate和 dice coefficient相符度，并且在6个特殊诊断情况中表现出更好的普遍化能力。

Abstract
Purpose: Prior AI-based dose prediction studies in photon and proton therapy often neglect underlying physics, limiting their generalizability to handle outlier clinical cases, especially for pencil beam scanning proton therapy (PBSPT). Our aim is to design a physics-aware and generalizable AI-based PBSPT dose prediction method that has the underlying physics considered to achieve high generalizability to properly handle the outlier clinical cases. Methods and Materials: This study analyzed PBSPT plans of 103 prostate and 78 lung cancer patients from our institution,with each case comprising CT images, structure sets, and plan doses from our Monte-Carlo dose engine (serving as the ground truth). Three methods were evaluated in the ablation study: the ROI-based method, the beam mask and sliding window method, and the noisy probing dose method. Twelve cases with uncommon beam angles or prescription doses tested the methods' generalizability to rare treatment planning scenarios. Performance evaluation used DVH indices, 3D Gamma passing rates (3%/2mm/10%), and dice coefficients for dose agreement. Results: The noisy probing dose method showed improved agreement of DVH indices, 3D Gamma passing rates, and dice coefficients compared to the conventional methods for the testing cases. The noisy probing dose method showed better generalizability in the 6 outlier cases than the ROI-based and beam mask-based methods with 3D Gamma passing rates (for prostate cancer, targets: 89.32%$\pm$1.45% vs. 93.48%$\pm$1.51% vs. 96.79%$\pm$0.83%, OARs: 85.87%$\pm$1.73% vs. 91.15%$\pm$1.13% vs. 94.29%$\pm$1.01%). The dose predictions were completed within 0.3 seconds. Conclusions: We've devised a novel noisy probing dose method for PBSPT dose prediction in prostate and lung cancer patients. With more physics included, it enhances the generalizability of dose prediction in handling outlier clinical cases.

摘要
目的：早期基于人工智能的荷量预测研究 frequently neglect 背景物理，限制其普遍性，特别是 для笔直扫辐射治疗 (PBSPT) 的异常临床 caso。我们的目标是设计一种包含物理的AI基于PBSPT荷量预测方法，以高度普遍性来处理异常临床 caso。方法和材料：这种研究分析了我们机构103例肾癌和78例肺癌病人的PBSPT计划，每个案例包括CT图像、结构集和计划剂量，这些计划剂量由我们的Monte Carlo剂量引擎服务为真实值。研究中评估了三种方法：ROI基于方法、扫描窗口方法和噪声探测剂量方法。测试案例包括12个不寻常的辐射角度或荷量预测场景。表现评估使用DVH指标、3DGamma通过率（3%/2mm/10%）和 dice coefficient for dose agreement。结果：噪声探测剂量方法在测试案例中表现出了与 conventinal 方法相比的DVH指标、3DGamma通过率和 dice coefficient 的更好一致性。噪声探测剂量方法在6个异常 caso 中的普遍性比 ROI基于方法和扫描窗口方法更高，3DGamma通过率（肾癌目标：89.32%±1.45% vs. 93.48%±1.51% vs. 96.79%±0.83%; OARs：85.87%±1.73% vs. 91.15%±1.13% vs. 94.29%±1.01%）。计划时间为0.3秒。结论：我们提出了一种基于PBSPT肾癌和肺癌普遍性的噪声探测剂量方法。通过更多的物理包含，该方法可以提高荷量预测普遍性，更好地处理异常临床 caso。

2023-12-02

eess.IV

eess.IV - 2023-12-02

OpEnCam: Lensless Optical Encryption Camera

paper_url: http://arxiv.org/abs/2312.01077
repo_url: None
paper_authors: Salman S. Khan, Xiang Yu, Kaushik Mitra, Manmohan Chandraker, Francesco Pittaluga
for: 这个研究旨在开发一个具有加密功能的镜头less摄像机，以保护场景的隐私。
methods: 这个研究使用了两个对萤幕位于摄像机上的材料，一个是固定在摄像机上的材料，另一个是位于摄像机上的几毫米上方的材料。这两个材料都是使用特定的光学元件来定义加密钥匙。
results: 这个研究表明了 OpEnCam 设计可以对抗多种攻击，并且仍然保持现有镜头less摄像机的实时摄像机能力。实验和实际应用显示 OpEnCam 能够实现隐私保护。

Abstract
Lensless cameras multiplex the incoming light before it is recorded by the sensor. This ability to multiplex the incoming light has led to the development of ultra-thin, high-speed, and single-shot 3D imagers. Recently, there have been various attempts at demonstrating another useful aspect of lensless cameras - their ability to preserve the privacy of a scene by capturing encrypted measurements. However, existing lensless camera designs suffer numerous inherent privacy vulnerabilities. To demonstrate this, we develop the first comprehensive attack model for encryption cameras, and propose OpEnCam -- a novel lensless OPtical ENcryption CAmera design that overcomes these vulnerabilities. OpEnCam encrypts the incoming light before capturing it using the modulating ability of optical masks. Recovery of the original scene from an OpEnCam measurement is possible only if one has access to the camera's encryption key, defined by the unique optical elements of each camera. Our OpEnCam design introduces two major improvements over existing lensless camera designs - (a) the use of two co-axially located optical masks, one stuck to the sensor and the other a few millimeters above the sensor and (b) the design of mask patterns, which are derived heuristically from signal processing ideas. We show, through experiments, that OpEnCam is robust against a range of attack types while still maintaining the imaging capabilities of existing lensless cameras. We validate the efficacy of OpEnCam using simulated and real data. Finally, we built and tested a prototype in the lab for proof-of-concept.

摘要
无镜头相机多路复用进来的光 перед记录器。这种能力使得 ultra-thin、高速和单枪3D成像技术得到发展。最近，有很多尝试表明了无镜头相机的另一个有用特点——保护场景的隐私。然而，现有的无镜头相机设计具有许多内置的隐私漏洞。为了 demonstarate 这点，我们开发了首个全面攻击模型 для encryption 相机，并提议 OpEnCam ——一种新的无镜头 OPtical ENcryption CAmera 设计，超越了这些漏洞。OpEnCam 使用模拟能力来加密进来的光，然后记录它。只有通过获得相机的加密密钥来恢复原始场景。我们的 OpEnCam 设计有两个主要改进：1. 使用两个垂直分布的光面，一个附加在探测器上，另一个在探测器上方几毫米处。2. 使用基于信号处理的面积模式。我们通过实验表明，OpEnCam 对多种攻击类型具有强大的鲁棒性，同时仍保持现有无镜头相机的成像能力。我们验证了 OpEnCam 的有效性使用实验和真实数据。最后，我们在实验室中建立了一个原型，以证明概念的可行性。

2023-12-03

Robust Computer Vision in an Ever-Changing World: A Survey of Techniques for Tackling Distribution Shifts

CalliPaint: Chinese Calligraphy Inpainting with Diffusion Model

SANeRF-HQ: Segment Anything for NeRF in High Quality

G2D: From Global to Dense Radiography Representation Learning via Vision-Language Pre-training

Tracing Hyperparameter Dependencies for Model Parsing via Learnable Graph Pooling Network

CityGen: Infinite and Controllable 3D City Layout Generation

GAPS: Geometry-Aware, Physics-Based, Self-Supervised Neural Garment Draping

Computer Vision for Increased Operative Efficiency via Identification of Instruments in the Neurosurgical Operating Room: A Proof-of-Concept Study

InvertAvatar: Incremental GAN Inversion for Generalized Head Avatars

Efficient Incremental Potential Contact for Actuated Face Simulation

Slice3D: Multi-Slice, Occlusion-Revealing, Single View 3D Reconstruction

QuantAttack: Exploiting Dynamic Quantization to Attack Vision Transformers

Diffusion Posterior Sampling for Nonlinear CT Reconstruction

Towards an accurate and generalizable multiple sclerosis lesion segmentation model using self-ensembled lesion fusion

Looking Inside Out: Anticipating Driver Intent From Videos

Automatic Report Generation for Histopathology images using pre-trained Vision Transformers and BERT

D$^2$ST-Adapter: Disentangled-and-Deformable Spatio-Temporal Adapter for Few-shot Action Recognition

WavePlanes: A compact Wavelet representation for Dynamic Neural Radiance Fields

Improving In-Context Learning in Diffusion Models with Visual Context-Modulated Prompts

VideoRF: Rendering Dynamic Radiance Fields as 2D Feature Video Streams

Visual Prompting Upgrades Neural Network Sparsification: A Data-Model Perspective

Language-driven All-in-one Adverse Weather Removal

A Conditional Denoising Diffusion Probabilistic Model for Point Cloud Upsampling

MoEC: Mixture of Experts Implicit Neural Compression

Deep learning and traditional-based CAD schemes for the pulmonary embolism diagnosis: A survey

DragVideo: Interactive Drag-style Video Editing

Enhancing and Adapting in the Clinic: Source-free Unsupervised Domain Adaptation for Medical Image Enhancement

MABViT – Modified Attention Block Enhances Vision Transformers

Few-shot Shape Recognition by Learning Deep Shape-aware Features

FlashAvatar: High-Fidelity Digital Avatar Rendering at 300FPS

SAGE: Bridging Semantic and Actionable Parts for GEneralizable Articulated-Object Manipulation under Language Instructions

Portrait Diffusion: Training-free Face Stylization with Chain-of-Painting

Stable Messenger: Steganography for Message-Concealed Image Generation

Deeper into Self-Supervised Monocular Indoor Depth Estimation

Cycle-consistent Generative Adversarial Network Synthetic CT for MR-only Adaptive Radiation Therapy on MR-Linac

Brain Decodes Deep Nets

Learning to Compose SuperWeights for Neural Parameter Allocation Search

AttriHuman-3D: Editable 3D Human Avatar Generation with Attribute Decomposition and Indexing

A Review and A Robust Framework of Data-Efficient 3D Scene Parsing with Traditional/Learned 3D Descriptors

A Data-efficient Framework for Robotics Large-scale LiDAR Scene Parsing

TIBET: Identifying and Evaluating Biases in Text-to-Image Generative Models

Meta ControlNet: Enhancing Task Adaptation via Meta Learning

TranSegPGD: Improving Transferability of Adversarial Examples on Semantic Segmentation

2023-12-03

Revisiting Non-separable Binary Classification and its Applications in Anomaly Detection

Unlocking the Potential of Federated Learning: The Symphony of Dataset Distillation via Deep Generative Latents

NovoMol: Recurrent Neural Network for Orally Bioavailable Drug Design and Validation on PDGFRα Receptor

Tackling Bias in Pre-trained Language Models: Current Trends and Under-represented Societies

Effectively Fine-tune to Improve Large Multimodal Models for Radiology Report Generation

ADT: Agent-based Dynamic Thresholding for Anomaly Detection

Context-Enhanced Relational Operators with Vector Embeddings

Personality of AI

BenchMARL: Benchmarking Multi-Agent Reinforcement Learning

Exploring Adversarial Robustness of LiDAR-Camera Fusion Model in Autonomous Driving

D-Bot: Database Diagnosis System using Large Language Models

Foveation in the Era of Deep Learning

Learning Curricula in Open-Ended Worlds

Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models

Towards Mitigating Perceived Unfairness in Contracts from a Non-Legal Stakeholder’s Perspective

DiFace: Cross-Modal Face Recognition through Controlled Diffusion

Analyze the robustness of three NMF algorithms (Robust NMF with L1 norm, L2-1 norm NMF, L2 NMF)

Honesty Is the Best Policy: Defining and Mitigating AI Deception

tsMorph: generation of semi-synthetic time series to understand algorithm performance

AI-Powered Arabic Crossword Puzzle Generation for Educational Applications

Facial Emotion Recognition Under Mask Coverage Using a Data Augmentation Technique

JarviX: A LLM No code Platform for Tabular Data Analysis and Optimization

ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models

Churn Prediction via Multimodal Fusion Learning:Integrating Customer Financial Literacy, Voice, and Behavioral Data

TextGenSHAP: Scalable Post-hoc Explanations in Text Generation with Long Documents

Running cognitive evaluations on large language models: The do’s and the don’ts

Low-Precision Mixed-Computation Models for Inference on Edge

2023-12-03

Using Large Language Models to Accelerate Communication for Users with Severe Motor Impairments

T3D: Towards 3D Medical Image Understanding through Vision-Language Pre-training

SymNoise: Advancing Language Model Fine-tuning with Symmetric Noise

Bigger is not Always Better: The Effect of Context Size on Speech Pre-Training

Unsupervised Approach to Evaluate Sentence-Level Fluency: Do We Really Need Reference?

Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models

Transformers are uninterpretable with myopic methods: a case study with bounded Dyck grammars