2023-10-14

cs.CV

cs.CV - 2023-10-14

What Do Deep Saliency Models Learn about Visual Attention?

paper_url: http://arxiv.org/abs/2310.09679
repo_url: https://github.com/szzexpoi/saliency_analysis
paper_authors: Shi Chen, Ming Jiang, Qi Zhao
for: 这篇论文旨在探讨深度聚焦模型如何预测人类视觉注意力，以及这些模型的成功机制是如何工作的。
methods: 本文提出了一种新的分析框架，可以帮助理解深度聚焦模型学习的隐藏特征，并提供了一种原则性的解释和量化这些特征的贡献。这个框架可以将隐藏特征分解成可解释的基准，并将聚焦预测转化为一种权重组合的问题。
results: 通过应用这种框架，我们进行了广泛的分析，包括聚焦预测的正面和负面权重、训练数据和架构设计的影响、细化训练的进程效应和常见的深度聚焦模型失败模式。此外，我们还通过分析不同应用场景中的视觉注意力特征，如人类自闭症 spectrum 症例中的异常注意力、情感引起的吸引注意力和时间的注意力演化。

Abstract
In recent years, deep saliency models have made significant progress in predicting human visual attention. However, the mechanisms behind their success remain largely unexplained due to the opaque nature of deep neural networks. In this paper, we present a novel analytic framework that sheds light on the implicit features learned by saliency models and provides principled interpretation and quantification of their contributions to saliency prediction. Our approach decomposes these implicit features into interpretable bases that are explicitly aligned with semantic attributes and reformulates saliency prediction as a weighted combination of probability maps connecting the bases and saliency. By applying our framework, we conduct extensive analyses from various perspectives, including the positive and negative weights of semantics, the impact of training data and architectural designs, the progressive influences of fine-tuning, and common failure patterns of state-of-the-art deep saliency models. Additionally, we demonstrate the effectiveness of our framework by exploring visual attention characteristics in various application scenarios, such as the atypical attention of people with autism spectrum disorder, attention to emotion-eliciting stimuli, and attention evolution over time. Our code is publicly available at \url{https://github.com/szzexpoi/saliency_analysis}.

摘要
在最近的几年中，深度眩怯模型已经取得了人类视觉注意力预测的 significanth进步。然而，这些模型的成功的机制仍然largely unexplained，这是因为深度神经网络的含义不 transparent。在这篇论文中，我们提出了一种新的分析框架，它可以揭示深度眩怯模型学习的隐式特征，并提供了理解和量化这些特征对眩怯预测的贡献的原则性的解释。我们的方法将这些隐式特征分解成可解释的基准，这些基准与semantic attribute的对应关系是Explicitly aligned。我们重新定义了眩怯预测为这些基准之间的权重加权组合，并通过应用我们的框架，我们进行了广泛的分析，包括正面和负面权重的semantics，训练数据和建筑设计的影响，练习的进步性和state-of-the-art深度眩怯模型的共同失败模式。此外，我们还通过各种应用场景来探讨视觉注意力特征，例如人类 autism spectrum disorder 的非典型注意力、情感刺激刺激的注意力和时间的注意力演化。我们的代码公开在 \url{https://github.com/szzexpoi/saliency_analysis}。

Point-DynRF: Point-based Dynamic Radiance Fields from a Monocular Video

paper_url: http://arxiv.org/abs/2310.09647
repo_url: None
paper_authors: Byeongjun Park, Changick Kim
for: 生成从笔直视频中的新视图
methods: 使用 neural point clouds 和 dynamic radiance fields 来学习全局场景几何信息和渲染过程
results: 在 NVIDIA Dynamic Scenes Dataset 和一些 causally captured monocular video clips 上验证了方法的有效性

Abstract
Dynamic radiance fields have emerged as a promising approach for generating novel views from a monocular video. However, previous methods enforce the geometric consistency to dynamic radiance fields only between adjacent input frames, making it difficult to represent the global scene geometry and degenerates at the viewpoint that is spatio-temporally distant from the input camera trajectory. To solve this problem, we introduce point-based dynamic radiance fields (\textbf{Point-DynRF}), a novel framework where the global geometric information and the volume rendering process are trained by neural point clouds and dynamic radiance fields, respectively. Specifically, we reconstruct neural point clouds directly from geometric proxies and optimize both radiance fields and the geometric proxies using our proposed losses, allowing them to complement each other. We validate the effectiveness of our method with experiments on the NVIDIA Dynamic Scenes Dataset and several causally captured monocular video clips.

摘要
《动态辐射场》技术在生成单视图的新观察角度方面表现出了惊人的前进。然而，先前的方法只在邻近输入帧中保证动态辐射场的几何一致性，使得表示全场景几何和观点远离输入摄像机轨迹的问题变得困难。为解决这个问题，我们提出了点基的动态辐射场（Point-DynRF）框架，其中global scene几何信息和volume渲染过程通过神经点云和动态辐射场进行了培auotrained。具体来说，我们直接从几何代理中重建神经点云，并通过我们提出的损失函数来优化辐射场和几何代理，使其相互补做。我们通过对NVIDIA动态场景集和一些 causally captured的单视图视频剪辑进行实验 validate了我们的方法的有效性。

Dimma: Semi-supervised Low Light Image Enhancement with Adaptive Dimming

paper_url: http://arxiv.org/abs/2310.09633
repo_url: https://github.com/wojciechkoz/dimma
paper_authors: Wojciech Kozłowski, Michał Szachniewicz, Michał Stypułkowski, Maciej Zięba
for: 提高低光照图像质量，保持自然颜色
methods: 使用小量图像对数据集进行协同学习，通过混合混合稀释权重网络来模拟不同摄像头下的场景拍摄情况，并通过精度量化器来调整亮度水平
results: 使用只需几个图像对可以达到与完全监督方法相同的竞争水平，并且在某些指标上超越当前状态艺法，几乎与其他方法相当Here’s the explanation in English:
for: The paper aims to enhance low-light images while maintaining their natural colors.
methods: The proposed approach uses a small set of image pairs to replicate scenes captured under extreme lighting conditions using a specific camera. It employs a convolutional mixture density network to generate distorted colors based on illumination differences, and accurately grades the dimming factor for flexibility in adjusting brightness levels. Additionally, the approach uses a conditional UNet architecture to generate images with desired lightness levels based on user input.
results: The proposed approach achieves competitive results compared to fully supervised methods, and surpasses state-of-the-art methods in some metrics when trained on the full dataset.

Abstract
Enhancing low-light images while maintaining natural colors is a challenging problem due to camera processing variations and limited access to photos with ground-truth lighting conditions. The latter is a crucial factor for supervised methods that achieve good results on paired datasets but do not handle out-of-domain data well. On the other hand, unsupervised methods, while able to generalize, often yield lower-quality enhancements. To fill this gap, we propose Dimma, a semi-supervised approach that aligns with any camera by utilizing a small set of image pairs to replicate scenes captured under extreme lighting conditions taken by that specific camera. We achieve that by introducing a convolutional mixture density network that generates distorted colors of the scene based on the illumination differences. Additionally, our approach enables accurate grading of the dimming factor, which provides a wide range of control and flexibility in adjusting the brightness levels during the low-light image enhancement process. To further improve the quality of our results, we introduce an architecture based on a conditional UNet. The lightness value provided by the user serves as the conditional input to generate images with the desired lightness. Our approach using only few image pairs achieves competitive results compared to fully supervised methods. Moreover, when trained on the full dataset, our model surpasses state-of-the-art methods in some metrics and closely approaches them in others.

摘要
提高低光照图像的质量是一个具有挑战性的问题，因为摄像头处理的变化和有限的照明条件图像的数量。后者是重要的因素，因为超级vised方法可以在匹配数据集上达到良好的结果，但是不能处理非预期的数据。相反，无监督方法可以泛化，但是通常会产生较低质量的提高。为了填补这个差距，我们提议了Dimma，一种半监督方法，通过使用特定摄像头拍摄的场景图像的小量对照片来模拟极端照明条件下拍摄的场景。我们通过引入一个 convolutional mixture density network来生成场景图像中的扭曲颜色，基于照明差异。此外，我们的方法可以精确地测量场景图像的暗度因子，提供了较广泛的控制和灵活性来调整低光照图像的亮度水平。为了进一步提高我们的结果质量，我们引入了基于 conditional UNet 的架构。用户输入的亮度值 serving as the conditional input，以生成具有所需亮度的图像。我们的方法使用只需几对照片可以与完全监督方法相比，并且当训练在全 dataset 时，我们的模型超越了当前状态的方法，在一些指标中甚至超越了其他方法，只是在其他指标中有些落后。

Time-based Mapping of Space Using Visual Motion Invariants

paper_url: http://arxiv.org/abs/2310.09632
repo_url: None
paper_authors: Juan D. Yepes, Daniel Raviv
for: 这个论文的目的是提出一种基于视觉动态特征的三维点 clouds的表示方法，以确保形态不变。
methods: 该方法利用非线性的流体量计量器来创建一种新的表示方法，称为“时间清除”（Time-Clearance）和“时间到接触”（Time-to-Contact）。这些 invariants 保持时间不变，使得可以轻松地检测移动点不符合预期的不变性。
results: 作者通过实验和 Unity 模拟来证明这种表示方法的有效性，并表明可以轻松地检测移动点不符合预期的不变性。此外，这种表示方法需要只一个摄像机，并且不需要确定摄像机的运动速度量。此外，该方法适合并行处理。

Abstract
This paper focuses on visual motion-based invariants that result in a representation of 3D points in which the stationary environment remains invariant, ensuring shape constancy. This is achieved even as the images undergo constant change due to camera motion. Nonlinear functions of measurable optical flow, which are related to geometric 3D invariants, are utilized to create a novel representation. We refer to the resulting optical flow-based invariants as 'Time-Clearance' and the well-known 'Time-to-Contact' (TTC). Since these invariants remain constant over time, it becomes straightforward to detect moving points that do not adhere to the expected constancy. We present simulations of a camera moving relative to a 3D object, snapshots of its projected images captured by a rectilinearly moving camera, and the object as it appears unchanged in the new domain over time. In addition, Unity-based simulations demonstrate color-coded transformations of a projected 3D scene, illustrating how moving objects can be readily identified. This representation is straightforward, relying on simple optical flow functions. It requires only one camera, and there is no need to determine the magnitude of the camera's velocity vector. Furthermore, the representation is pixel-based, making it suitable for parallel processing.

摘要
We present simulations of a camera moving relative to a 3D object, snapshots of its projected images captured by a rectilinearly moving camera, and the object as it appears unchanged in the new domain over time. In addition, Unity-based simulations demonstrate color-coded transformations of a projected 3D scene, illustrating how moving objects can be readily identified.This representation is straightforward, relying on simple optical flow functions. It requires only one camera, and there is no need to determine the magnitude of the camera's velocity vector. Furthermore, the representation is pixel-based, making it suitable for parallel processing.

Real-Time Traffic Sign Detection: A Case Study in a Santa Clara Suburban Neighborhood

paper_url: http://arxiv.org/abs/2310.09630
repo_url: None
paper_authors: Harish Loghashankar, Hieu Nguyen
for: 这项研究旨在开发一个实时交通标识系统，使用YOLOv5架构并在室外社区中进行实时识别交通标识。
methods: 该项目将使用多样化的交通标识图像集进行训练YOLOv5模型，并在适合实时推理的硬件平台上部署模型。
results: 在实验中，该系统在实时摄像头上检测和识别交通标识得分为96%，表明该系统可以提供实时和准确的交通信息，有助于提高道路安全和交通管理，并且可能为自动驾驶研究开辟新的可能性。

Abstract
This research project aims to develop a real-time traffic sign detection system using the YOLOv5 architecture and deploy it for efficient traffic sign recognition during a drive in a suburban neighborhood. The project's primary objectives are to train the YOLOv5 model on a diverse dataset of traffic sign images and deploy the model on a suitable hardware platform capable of real-time inference. The project will involve collecting a comprehensive dataset of traffic sign images. By leveraging the trained YOLOv5 model, the system will detect and classify traffic signs from a real-time camera on a dashboard inside a vehicle. The performance of the deployed system will be evaluated based on its accuracy in detecting traffic signs, real-time processing speed, and overall reliability. During a case study in a suburban neighborhood, the system demonstrated a notable 96% accuracy in detecting traffic signs. This research's findings have the potential to improve road safety and traffic management by providing timely and accurate real-time information about traffic signs and can pave the way for further research into autonomous driving.

摘要
The project involves collecting a comprehensive dataset of traffic sign images, and leveraging the trained YOLOv5 model, the system will detect and classify traffic signs from a real-time camera on a dashboard inside a vehicle. The performance of the deployed system will be evaluated based on its accuracy in detecting traffic signs, real-time processing speed, and overall reliability.During a case study in a suburban neighborhood, the system demonstrated a notable 96% accuracy in detecting traffic signs. The findings of this research have the potential to improve road safety and traffic management by providing timely and accurate real-time information about traffic signs, and can pave the way for further research into autonomous driving.Translation notes:* "suburban neighborhood" is translated as "郊区" (suburban area)* "dashboard" is translated as "车载屏" (car-mounted screen)* "real-time" is translated as "实时" (real-time)* "accuracy" is translated as "准确率" (accuracy)* "processing speed" is translated as "处理速度" (processing speed)* "reliability" is translated as "可靠性" (reliability)Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore.

Detecting Moving Objects Using a Novel Optical-Flow-Based Range-Independent Invariant

paper_url: http://arxiv.org/abs/2310.09627
repo_url: None
paper_authors: Daniel Raviv, Juan D. Yepes, Ayush Gowda
for: 这篇论文主要关注了一种新的移动对象检测方法，该方法可以在摄像机运动时检测移动对象。
methods: 该方法使用了光流基本变换，通过这种变换可以生成一个不依赖于时间点播、点云范围和摄像机速度的兼容2D不变图像输出。在这个新频谱中，对于3D点 cloud中的点，如果它们与预定的查找图像值不匹配，那么它们就可以被轻松地识别为相对于静止3D环境中的移动对象。该方法不需要了解摄像机的运动方向或速度，也不需要3D点云范围信息。它适合实时并行处理，因此非常实用。
results: 作者通过实验和仿真 validate了新频谱的有效性，并证明其在拥有直线运动的摄像机时的稳定性。这种方法开创了新的移动对象检测方法，同时也为未来在六度自由运动摄像机中的研究提供了基础。

Abstract
This paper focuses on a novel approach for detecting moving objects during camera motion. We present an optical-flow-based transformation that yields a consistent 2D invariant image output regardless of time instants, range of points in 3D, and the speed of the camera. In other words, this transformation generates a lookup image that remains invariant despite the changing projection of the 3D scene and camera motion. In the new domain, projections of 3D points that deviate from the values of the predefined lookup image can be clearly identified as moving relative to the stationary 3D environment, making them seamlessly detectable. The method does not require prior knowledge of the direction of motion or speed of the camera, nor does it necessitate 3D point range information. It is well-suited for real-time parallel processing, rendering it highly practical for implementation. We have validated the effectiveness of the new domain through simulations and experiments, demonstrating its robustness in scenarios involving rectilinear camera motion, both in simulations and with real-world data. This approach introduces new ways for moving objects detection during camera motion, and also lays the foundation for future research in the context of moving object detection during six-degrees-of-freedom camera motion.

摘要

JSMoCo: Joint Coil Sensitivity and Motion Correction in Parallel MRI with a Self-Calibrating Score-Based Diffusion Model

paper_url: http://arxiv.org/abs/2310.09625
repo_url: None
paper_authors: Lixuan Chen, Xuanyu Tian, Jiangjie Wu, Ruimin Feng, Guoyan Lao, Yuyao Zhang, Hongjiang Wei
for: correction of motion artifacts in MRI reconstruction
methods: joint estimation of motion parameters and coil sensitivity maps using score-based diffusion models and Gibbs sampler
results: high-quality MRI image reconstruction from sparsely-sampled k-space data, even in the presence of motion

Abstract
Magnetic Resonance Imaging (MRI) stands as a powerful modality in clinical diagnosis. However, it is known that MRI faces challenges such as long acquisition time and vulnerability to motion-induced artifacts. Despite the success of many existing motion correction algorithms, there has been limited research focused on correcting motion artifacts on the estimated coil sensitivity maps for fast MRI reconstruction. Existing methods might suffer from severe performance degradation due to error propagation resulting from the inaccurate coil sensitivity maps estimation. In this work, we propose to jointly estimate the motion parameters and coil sensitivity maps for under-sampled MRI reconstruction, referred to as JSMoCo. However, joint estimation of motion parameters and coil sensitivities results in a highly ill-posed inverse problem due to an increased number of unknowns. To address this, we introduce score-based diffusion models as powerful priors and leverage the MRI physical principles to efficiently constrain the solution space for this optimization problem. Specifically, we parameterize the rigid motion as three trainable variables and model coil sensitivity maps as polynomial functions. Leveraging the physical knowledge, we then employ Gibbs sampler for joint estimation, ensuring system consistency between sensitivity maps and desired images, avoiding error propagation from pre-estimated sensitivity maps to the reconstructed images. We conduct comprehensive experiments to evaluate the performance of JSMoCo on the fastMRI dataset. The results show that our method is capable of reconstructing high-quality MRI images from sparsely-sampled k-space data, even affected by motion. It achieves this by accurately estimating both motion parameters and coil sensitivities, effectively mitigating motion-related challenges during MRI reconstruction.

摘要

Learning Hierarchical Features with Joint Latent Space Energy-Based Prior

paper_url: http://arxiv.org/abs/2310.09604
repo_url: None
paper_authors: Jiali Cui, Ying Nian Wu, Tian Han
for: 学习层次表示的基本问题
methods: 提议使用共同幽会空间EBM前模型和多层幽会变量
results: 实验表明该模型能有效地捕捉层次表示和模型数据分布

Abstract
This paper studies the fundamental problem of multi-layer generator models in learning hierarchical representations. The multi-layer generator model that consists of multiple layers of latent variables organized in a top-down architecture tends to learn multiple levels of data abstraction. However, such multi-layer latent variables are typically parameterized to be Gaussian, which can be less informative in capturing complex abstractions, resulting in limited success in hierarchical representation learning. On the other hand, the energy-based (EBM) prior is known to be expressive in capturing the data regularities, but it often lacks the hierarchical structure to capture different levels of hierarchical representations. In this paper, we propose a joint latent space EBM prior model with multi-layer latent variables for effective hierarchical representation learning. We develop a variational joint learning scheme that seamlessly integrates an inference model for efficient inference. Our experiments demonstrate that the proposed joint EBM prior is effective and expressive in capturing hierarchical representations and modelling data distribution.

摘要
To address this issue, the authors propose a joint latent space EBM prior model with multi-layer latent variables for effective hierarchical representation learning. They develop a variational joint learning scheme that seamlessly integrates an inference model for efficient inference. The authors' experiments demonstrate that the proposed joint EBM prior is effective and expressive in capturing hierarchical representations and modeling data distribution.Translation notes:* "multi-layer generator models" becomes "多层生成器模型" (duō zhì chǎng jī mó delè)* "latent variables" becomes "隐藏变量" (yǐn zhǐ biàn yòu)* "Gaussian" becomes "高矮分布" (gāo ài fān bù)* "energy-based" becomes "能量基于" (néng liàng jī yǔ)* "hierarchical representations" becomes "层次表示" (céng zhì bǎo xiǎng)* "data distribution" becomes "数据分布" (shù jiào fān bù)

B-Spine: Learning B-Spline Curve Representation for Robust and Interpretable Spinal Curvature Estimation

paper_url: http://arxiv.org/abs/2310.09603
repo_url: https://github.com/whao22/B-Spine
paper_authors: Hao Wang, Qiang Song, Ruofeng Yin, Rui Ma, Yizhou Yu, Yi Chang
for:* 这paper是为了提出一种robust和可 interpret的spinal curvature estimation方法。methods:* 该方法使用了一种深度学习pipeline，包括SegRefine网络和B-spline预测模型，以便从低质量X射线图像中提取spine curvature。results:* 与其他代表性和State-of-the-Art学习基于方法进行Quantitative和Qualitative比较后，该方法在公共AASCE2019数据集和我们新提出的CJUH-JLU数据集上表现出了superior的性能，demonstrating its robustness and interpretability for spinal curvature estimation.

Abstract
Spinal curvature estimation is important to the diagnosis and treatment of the scoliosis. Existing methods face several issues such as the need of expensive annotations on the vertebral landmarks and being sensitive to the image quality. It is challenging to achieve robust estimation and obtain interpretable results, especially for low-quality images which are blurry and hazy. In this paper, we propose B-Spine, a novel deep learning pipeline to learn B-spline curve representation of the spine and estimate the Cobb angles for spinal curvature estimation from low-quality X-ray images. Given a low-quality input, a novel SegRefine network which employs the unpaired image-to-image translation is proposed to generate a high quality spine mask from the initial segmentation result. Next, a novel mask-based B-spline prediction model is proposed to predict the B-spline curve for the spine centerline. Finally, the Cobb angles are estimated by a hybrid approach which combines the curve slope analysis and a curve-based regression model. We conduct quantitative and qualitative comparisons with the representative and SOTA learning-based methods on the public AASCE2019 dataset and our new proposed CJUH-JLU dataset which contains more challenging low-quality images. The superior performance on both datasets shows our method can achieve both robustness and interpretability for spinal curvature estimation.

摘要
<>translate_language English Simplified ChineseSpinal curvature estimation is important for the diagnosis and treatment of scoliosis. Existing methods have several issues, such as the need for expensive annotations on vertebral landmarks and being sensitive to image quality. It is challenging to achieve robust estimation and obtain interpretable results, especially for low-quality images that are blurry and hazy. In this paper, we propose B-Spine, a novel deep learning pipeline to learn B-spline curve representation of the spine and estimate the Cobb angles for spinal curvature estimation from low-quality X-ray images. Given a low-quality input, a novel SegRefine network which employs unpaired image-to-image translation is proposed to generate a high-quality spine mask from the initial segmentation result. Next, a novel mask-based B-spline prediction model is proposed to predict the B-spline curve for the spine centerline. Finally, the Cobb angles are estimated by a hybrid approach which combines curve slope analysis and a curve-based regression model. We conduct quantitative and qualitative comparisons with representative and SOTA learning-based methods on the public AASCE2019 dataset and our new proposed CJUH-JLU dataset, which contains more challenging low-quality images. The superior performance on both datasets shows that our method can achieve both robustness and interpretability for spinal curvature estimation.

Hawkeye: A PyTorch-based Library for Fine-Grained Image Recognition with Deep Learning

paper_url: http://arxiv.org/abs/2310.09600
repo_url: https://github.com/hawkeye-finegrained/hawkeye
paper_authors: Jiabei He, Yang Shen, Xiu-Shen Wei, Ye Wu
for: 这份研究是为了提供一个开源的 PyTorch 基础库，用于 Fine-Grained Image Recognition (FGIR) 任务。
methods: 这个库使用了深度学习的方法，并且具有 Modular 架构，以提高代码质量和人类可读配置。它还包含了 16 种现有的精细方法，覆盖 6 种不同的 paradigm，允许用户尝试不同的方法来解决 FGIR 任务。
results: 根据 authors 的声明，这个库是首个以 PyTorch 为基础的开源库，并且提供了一个全面的解决方案 для FGIR 任务。

Abstract
Fine-Grained Image Recognition (FGIR) is a fundamental and challenging task in computer vision and multimedia that plays a crucial role in Intellectual Economy and Industrial Internet applications. However, the absence of a unified open-source software library covering various paradigms in FGIR poses a significant challenge for researchers and practitioners in the field. To address this gap, we present Hawkeye, a PyTorch-based library for FGIR with deep learning. Hawkeye is designed with a modular architecture, emphasizing high-quality code and human-readable configuration, providing a comprehensive solution for FGIR tasks. In Hawkeye, we have implemented 16 state-of-the-art fine-grained methods, covering 6 different paradigms, enabling users to explore various approaches for FGIR. To the best of our knowledge, Hawkeye represents the first open-source PyTorch-based library dedicated to FGIR. It is publicly available at https://github.com/Hawkeye-FineGrained/Hawkeye/, providing researchers and practitioners with a powerful tool to advance their research and development in the field of FGIR.

摘要
fine-grained 图像识别（FGIR）是计算机视觉和多媒体领域的基础和挑战性任务，对知识经济和工业互联网应用具有重要的作用。然而，无一个统一的开源软件库，覆盖了不同的FGIR Paradigma，对研究人员和实践者们带来了很大的挑战。为了解决这个问题，我们提出了Hawkeye，一个基于PyTorch的FGIR库。Hawkeye采用了模块化的架构，强调高质量的代码和人类可读的配置，为FGIR任务提供了全面的解决方案。在Hawkeye中，我们实现了16种当前顶尖的细化图像方法，覆盖了6个不同的Paradigma，使用户可以探索不同的FGIR方法。据我们所知，Hawkeye是首个基于PyTorch的开源FGIR库，可以在https://github.com/Hawkeye-FineGrained/Hawkeye/上获取。这将为研究人员和实践者们提供一个强大的工具，以推动FGIR领域的研究和开发。

Learning Unified Representations for Multi-Resolution Face Recognition

paper_url: http://arxiv.org/abs/2310.09563
repo_url: https://github.com/stevensmith2000/btnet
paper_authors: Hulingxiao He, Wu Yuan, Yidian Huang, Shilong Zhao, Wen Yuan, Hanqing Li
for: 提高多分辨率脸Recognizer的表示学习方法
methods: 使用Branch-to-Trunk网络(BTNet)，包括一个统一Encoder（TNet）和多个分辨率Adapter（BNets），输入刚好与输出匹配，提高了微辨率脸的可识别度
results: 实验表明，BTNet可以在面Recognition benchmark上达到优秀表现，具有较少计算量和参数存储，并在QMUL-SurvFace 1: N face identification任务上创造新的状态机制。代码可以在https://github.com/StevenSmith2000/BTNet上获取

Abstract
In this work, we propose Branch-to-Trunk network (BTNet), a representation learning method for multi-resolution face recognition. It consists of a trunk network (TNet), namely a unified encoder, and multiple branch networks (BNets), namely resolution adapters. As per the input, a resolution-specific BNet is used and the output are implanted as feature maps in the feature pyramid of TNet, at a layer with the same resolution. The discriminability of tiny faces is significantly improved, as the interpolation error introduced by rescaling, especially up-sampling, is mitigated on the inputs. With branch distillation and backward-compatible training, BTNet transfers discriminative high-resolution information to multiple branches while guaranteeing representation compatibility. Our experiments demonstrate strong performance on face recognition benchmarks, both for multi-resolution identity matching and feature aggregation, with much less computation amount and parameter storage. We establish new state-of-the-art on the challenging QMUL-SurvFace 1: N face identification task. Our code is available at https://github.com/StevenSmith2000/BTNet.

摘要
在这个工作中，我们提出了分支到主干网络（BTNet），一种多resolution face recognition的表示学习方法。它包括一个主干网络（TNet），即统一编码器，以及多个分支网络（BNets），即分解器。根据输入，使用resolution-specific BNet，并将输出作为特征图层在TNet的特征峰中进行嵌入。这 mitigates the interpolation error introduced by rescaling, especially up-sampling, and significantly improves the discriminability of tiny faces. 通过分支涂抹和回传compatible训练，BTNet可以将高分辨率的特征信息传递给多个分支，同时保证表示相容性。我们的实验表明BTNet在多resolution identity matching和特征聚合任务上显示出了强大表现，减少了计算量和参数存储量。我们在QMUL-SurvFace 1: N face identification任务上创造了新的状态码，代码可以在https://github.com/StevenSmith2000/BTNet中获取。

Scene Text Recognition Models Explainability Using Local Features

paper_url: http://arxiv.org/abs/2310.09549
repo_url: https://github.com/markytools/strexp
paper_authors: Mark Vincent Ty, Rowel Atienza
for: 这个论文主要研究的是Scene Text Recognition（STR）透明性（XAI），即如何使人们能够理解模型的预测结果的原因。
methods: 这篇论文使用了数据解释框架，即归因基本方法，来解释深度学习模型中的输入数据。然而，在STR中，这些方法仅仅能够提供全局性的解释，不能够准确地解释输入数据中的每个字符的预测结果。为了解决这个问题，这篇论文提出了一种新的方法，即STRExp，可以考虑本地解释，即每个字符预测结果的解释。
results: 这篇论文对不同的STR模型和数据集进行了比较，并评估了不同的归因基本方法的效果。结果表明，STRExp可以提供更加精准和有用的解释，而且可以在不同的STR模型和数据集上进行广泛的应用。

Abstract
Explainable AI (XAI) is the study on how humans can be able to understand the cause of a model's prediction. In this work, the problem of interest is Scene Text Recognition (STR) Explainability, using XAI to understand the cause of an STR model's prediction. Recent XAI literatures on STR only provide a simple analysis and do not fully explore other XAI methods. In this study, we specifically work on data explainability frameworks, called attribution-based methods, that explain the important parts of an input data in deep learning models. However, integrating them into STR produces inconsistent and ineffective explanations, because they only explain the model in the global context. To solve this problem, we propose a new method, STRExp, to take into consideration the local explanations, i.e. the individual character prediction explanations. This is then benchmarked across different attribution-based methods on different STR datasets and evaluated across different STR models.

摘要
什么是可解释AI（XAI）？XAI是研究如何让人们理解模型预测的原因的学科。在这项工作中，我们的问题关注点是场景文本识别（STR）可解释，使用XAI来理解STR模型的预测。现有XAI литера图书籍中只提供了简单的分析，没有充分探讨其他XAI方法。在这项研究中，我们专门关注深度学习模型中的数据解释框架，即归因基本方法，可以解释输入数据中重要的部分。但是，将其集成到STR中会导致不一致和不效的解释，因为它们只能解释模型在全局上。为解决这个问题，我们提出了一种新方法，STRExp，可以考虑本地解释，即个体字符预测解释。这种方法然后在不同的归因基本方法和不同的STR数据集上进行了比较。

Benchmarking the Sim-to-Real Gap in Cloth Manipulation

paper_url: http://arxiv.org/abs/2310.09543
repo_url: None
paper_authors: David Blanco-Mulero, Oriol Barbany, Gokhan Alcan, Adrià Colomé, Carme Torras, Ville Kyrki
for: 这篇论文旨在评估现实 físic engine 是如何帮助学习柔体物体的扭曲和位移的。
methods: 这篇论文使用了四种流行的柔体物体模拟器：MuJoCo、Bullet、Flex 和 SOFA，并对它们进行评估。
results: 论文提供了一个开源的测试集，用于评估这些模拟器的实际准确性、计算时间和稳定性。

Abstract
Realistic physics engines play a crucial role for learning to manipulate deformable objects such as garments in simulation. By doing so, researchers can circumvent challenges such as sensing the deformation of the object in the real-world. In spite of the extensive use of simulations for this task, few works have evaluated the reality gap between deformable object simulators and real-world data. We present a benchmark dataset to evaluate the sim-to-real gap in cloth manipulation. The dataset is collected by performing a dynamic cloth manipulation task involving contact with a rigid table. We use the dataset to evaluate the reality gap, computational time, and simulation stability of four popular deformable object simulators: MuJoCo, Bullet, Flex, and SOFA. Additionally, we discuss the benefits and drawbacks of each simulator. The benchmark dataset is open-source. Supplementary material, videos, and code, can be found at https://sites.google.com/view/cloth-sim2real-benchmark.

摘要
现实 física 引擎在学习处理可变形 объек 的 simulate 中发挥关键作用。通过这样做，研究人员可以远离实际世界中感知对象的变形的挑战。尽管对这种任务的 simulate 广泛使用，但有少数作品评估了 sim-to-real 间的差距。我们提供了一个标准化数据集来评估 cloth manipulate 中的 sim-to-real 差距。数据集是通过对固定桌子进行动态 cloth manipulate 任务来收集的。我们使用该数据集来评估 simulator 的 reality gap，计算时间以及模拟稳定性。此外，我们还讨论了每个 simulator 的优缺点。标准化数据集开源，补充材料、视频和代码可以在https://sites.google.com/view/cloth-sim2real-benchmark 找到。

Towards End-to-End Unsupervised Saliency Detection with Self-Supervised Top-Down Context

paper_url: http://arxiv.org/abs/2310.09533
repo_url: None
paper_authors: Yicheng Song, Shuyong Gao, Haozhe Xing, Yiting Cheng, Yan Wang, Wenqiang Zhang
for: 本研究旨在提高无监督聚合物 objet detection 的训练效率，并且可以 mines 深度特征中的rich semantic信息。
methods: 我们提出了一种自动supervised end-to-end salient object detection框架，通过上述的top-downcontext来学习最有助于性的分割指导。
results: 我们的方法在 benchmark 数据集上进行了广泛的实验，并证明了与最近的终端方法和多stage方法相比，我们的方法可以达到最高的性能。

Abstract
Unsupervised salient object detection aims to detect salient objects without using supervision signals eliminating the tedious task of manually labeling salient objects. To improve training efficiency, end-to-end methods for USOD have been proposed as a promising alternative. However, current solutions rely heavily on noisy handcraft labels and fail to mine rich semantic information from deep features. In this paper, we propose a self-supervised end-to-end salient object detection framework via top-down context. Specifically, motivated by contrastive learning, we exploit the self-localization from the deepest feature to construct the location maps which are then leveraged to learn the most instructive segmentation guidance. Further considering the lack of detailed information in deepest features, we exploit the detail-boosting refiner module to enrich the location labels with details. Moreover, we observe that due to lack of supervision, current unsupervised saliency models tend to detect non-salient objects that are salient in some other samples of corresponding scenarios. To address this widespread issue, we design a novel Unsupervised Non-Salient Suppression (UNSS) method developing the ability to ignore non-salient objects. Extensive experiments on benchmark datasets demonstrate that our method achieves leading performance among the recent end-to-end methods and most of the multi-stage solutions. The code is available.

摘要
<>这是一个使用简化中文的文本：不监督焦点检测目标检测焦点物件的存在，而不需要手动标注焦点物件。为了提高训练效率，终端方法 для USOD 已经被提议为可靠的替代方案。然而，目前的解决方案将重要的 semantic information 从深度特征中挖掘出来，导致训练效率低下。在这篇文章中，我们提出一个自动监督终端焦点检测框架，通过上下文的对应来学习焦点检测。具体来说，我们灵感自适应学习，从深度特征中找出最有价的位置对，然后将其用于学习最有价的分割引导。此外，因为深度特征中缺乏细节信息，我们运用细节增强修正模组来补充位置标签中的细节信息。此外，我们发现现有的不监督焦点模型往往对于某些相似的场景中的非焦点物件进行检测，这是一个广泛的问题。为了解决这个问题，我们提出了一个名为Unsupervised Non-Salient Suppression（UNSS）的新方法，可以将非焦点物件忽略掉。实验结果显示，我们的方法在最近的终端方法和多阶段解决方案中具有领先的表现。代码可以在网上获取。

TS-ENAS:Two-Stage Evolution for Cell-based Network Architecture Search

paper_url: http://arxiv.org/abs/2310.09525
repo_url: None
paper_authors: Juan Zou, Shenghong Wu, Yizhang Xia, Weiwei Jiang, Zeping Wu, Jinhua Zheng
for: 本研究提出了一种Two-Stage Evolution for cell-based Network Architecture Search（TS-ENAS）算法，用于自动设计神经网络结构。
methods: 该算法使用一个一阶搜索和一个二阶搜索两个阶段来搜索神经网络结构。在第一阶段，使用堆栈Cell进行搜索，以减少搜索的复杂性。在第二阶段，对这些Cell进行调整。
results: 在四个图像分类 dataset（Fashion-MNIST、CIFAR10、CIFAR100和ImageNet）上进行了广泛的测试和比较，并与22种现有的算法进行了比较，包括手动设计的网络和NAS网络。结果表明，TS-ENAS可以更有效地找到与其他算法相对性能的神经网络结构。

Abstract
Neural network architecture search provides a solution to the automatic design of network structures. However, it is difficult to search the whole network architecture directly. Although using stacked cells to search neural network architectures is an effective way to reduce the complexity of searching, these methods do not able find the global optimal neural network structure since the number of layers, cells and connection methods is fixed. In this paper, we propose a Two-Stage Evolution for cell-based Network Architecture Search(TS-ENAS), including one-stage searching based on stacked cells and second-stage adjusting these cells. In our algorithm, a new cell-based search space and an effective two-stage encoding method are designed to represent cells and neural network structures. In addition, a cell-based weight inheritance strategy is designed to initialize the weight of the network, which significantly reduces the running time of the algorithm. The proposed methods are extensively tested and compared on four image classification dataset, Fashion-MNIST, CIFAR10, CIFAR100 and ImageNet and compared with 22 state-of-the-art algorithms including hand-designed networks and NAS networks. The experimental results show that TS-ENAS can more effectively find the neural network architecture with comparative performance.

摘要
Translated into Simplified Chinese:神经网络结构搜索提供了自动设计网络结构的解决方案。然而，直接搜索整个网络结构是困难的。尽管使用堆式细胞来搜索神经网络结构是有效的方法来减少搜索的复杂性，但这些方法无法找到全球优化的神经网络结构，因为层数、细胞数和连接方法的数量是固定的。在这篇论文中，我们提出了一种两个阶段演化的细胞基于网络 architecture搜索算法（TS-ENAS），包括一个阶段是基于堆式细胞的搜索，以及第二阶段是调整这些细胞。在我们的算法中，我们设计了一个新的细胞基于搜索空间和一种有效的两阶段编码方法来表示细胞和神经网络结构。此外，我们还设计了基于细胞的初始化策略来初始化网络的权重，这有效减少了算法的运行时间。我们在四个图像分类数据集（Fashion-MNIST、CIFAR10、CIFAR100和ImageNet）进行了广泛的测试和比较，与22种现有的算法（包括手动设计的网络和NAS网络）进行了比较。实验结果表明，TS-ENAS可以更有效地找到与其他方法相当的神经网络结构。

OBSUM: An object-based spatial unmixing model for spatiotemporal fusion of remote sensing images

paper_url: http://arxiv.org/abs/2310.09517
repo_url: https://github.com/houcaiguo/obsum-code
paper_authors: Houcai Guo, Dingqi Ye, Lorenzo Bruzzone
for:这个论文旨在提高遥感图像的空间和时间分辨率，以便进行时间序列分析。methods:这个研究提出了一种基于物体分析和空间分解的物体基于空间混合模型（OBSUM），以解决当前遥感时间序列融合方法中的两个重要问题。OBSUM包括一个预处理步骤和三个融合步骤，即物体水平减混、物体水平偏差补偿和像素水平偏差补偿。results:对比五种代表性的遥感时间序列融合方法，OBSUM在准确指标和视觉效果上表现出色，并且在两个典型的遥感应用中也达到了满意的结果。因此，OBSUM有很大的应用潜力，可以生成高分辨率和准确的时间序列观测数据，以支持多种遥感应用。

Abstract
Spatiotemporal fusion aims to improve both the spatial and temporal resolution of remote sensing images, thus facilitating time-series analysis at a fine spatial scale. However, there are several important issues that limit the application of current spatiotemporal fusion methods. First, most spatiotemporal fusion methods are based on pixel-level computation, which neglects the valuable object-level information of the land surface. Moreover, many existing methods cannot accurately retrieve strong temporal changes between the available high-resolution image at base date and the predicted one. This study proposes an Object-Based Spatial Unmixing Model (OBSUM), which incorporates object-based image analysis and spatial unmixing, to overcome the two abovementioned problems. OBSUM consists of one preprocessing step and three fusion steps, i.e., object-level unmixing, object-level residual compensation, and pixel-level residual compensation. OBSUM can be applied using only one fine image at the base date and one coarse image at the prediction date, without the need of a coarse image at the base date. The performance of OBSUM was compared with five representative spatiotemporal fusion methods. The experimental results demonstrated that OBSUM outperformed other methods in terms of both accuracy indices and visual effects over time-series. Furthermore, OBSUM also achieved satisfactory results in two typical remote sensing applications. Therefore, it has great potential to generate accurate and high-resolution time-series observations for supporting various remote sensing applications.

摘要
<>这个研究旨在提高遥感图像的空间和时间解析精度，以便进行时间序列分析，并且解决现有的一些重要问题。首先，大多数的遥感融合方法是基于像素级计算，忽略了地表上的有价值物体信息。其次，许多现有的方法无法精确地回传具有强时间变化的高分辨率图像。本研究提出了一个物体基本分析和空间混合模型（OBSUM），以解决上述问题。OBSUM包括一个预processing步骤和三个融合步骤，即物体水平混合、物体水平剩余补偿和像素水平剩余补偿。OBSUM可以使用仅有一个精细图像的基本日期，并且不需要基本日期的粗糙图像。实验结果显示，OBSUM在精度指标和视觉效果上优于五种代表性的遥感融合方法。此外，OBSUM也在两个典型的遥感应用中取得了满意的结果。因此，它具有高精度时间序列观测的可能性。

Foundation Ark: Accruing and Reusing Knowledge for Superior and Robust Performance

paper_url: http://arxiv.org/abs/2310.09507
repo_url: https://github.com/jlianglab/ark
paper_authors: DongAo Ma, Jiaxuan Pang, Michael B. Gotway, Jianming Liang
for: 本研究旨在开发一个可 powerful和Robust的基础模型，通过聚合多个小型公共数据集来实现。methods: 我们提出了 Ark 框架，用于聚合和重用多个不同专家标注的数据集中的知识。results: 我们通过在多种成像任务中进行精度调整、线性检测和性别偏见分析，证明了 Ark 模型在比特aha fully/self-supervised baselines和 Google 的专有 CXR-FM 模型之上具有显著的超越性和Robustness。

Abstract
Deep learning nowadays offers expert-level and sometimes even super-expert-level performance, but achieving such performance demands massive annotated data for training (e.g., Google's proprietary CXR Foundation Model (CXR-FM) was trained on 821,544 labeled and mostly private chest X-rays (CXRs)). Numerous datasets are publicly available in medical imaging but individually small and heterogeneous in expert labels. We envision a powerful and robust foundation model that can be trained by aggregating numerous small public datasets. To realize this vision, we have developed Ark, a framework that accrues and reuses knowledge from heterogeneous expert annotations in various datasets. As a proof of concept, we have trained two Ark models on 335,484 and 704,363 CXRs, respectively, by merging several datasets including ChestX-ray14, CheXpert, MIMIC-II, and VinDr-CXR, evaluated them on a wide range of imaging tasks covering both classification and segmentation via fine-tuning, linear-probing, and gender-bias analysis, and demonstrated our Ark's superior and robust performance over the SOTA fully/self-supervised baselines and Google's proprietary CXR-FM. This enhanced performance is attributed to our simple yet powerful observation that aggregating numerous public datasets diversifies patient populations and accrues knowledge from diverse experts, yielding unprecedented performance yet saving annotation cost. With all codes and pretrained models released at GitHub.com/JLiangLab/Ark, we hope that Ark exerts an important impact on open science, as accruing and reusing knowledge from expert annotations in public datasets can potentially surpass the performance of proprietary models trained on unusually large data, inspiring many more researchers worldwide to share codes and datasets to build open foundation models, accelerate open science, and democratize deep learning for medical imaging.

摘要
现在的深度学习技术已经可以达到专家级或者超过专家级的性能，但是获得这样的性能需要巨量的标注数据进行训练（如Google的专有CXR基础模型（CXR-FM）在821544个标注的和大多数私人胸部X射线图像（CXRs）上进行训练）。医疗影像领域有很多公共数据集，但每个数据集都是小型和多样化的专家标注。我们的愿景是建立一个强大和稳定的基础模型，可以通过合并各种小型公共数据集来训练。为了实现这个愿景，我们已经开发了Ark框架，它可以聚合和重用各种各样的专家标注知识。作为证明，我们已经在335484和704363个CXRs上分别训练了两个Ark模型，并通过精细调整、直接检测和性别偏见分析来评估其性能。我们的Ark模型在覆盖各种影像任务，包括分类和分割，并表现出超过当前最佳自动化/自我超vised基线和Google专有CXR-FM的性能。这种提高的性能归因于我们简单 yet 强大的观察，即合并各种公共数据集可以多样化病人人口和收集专家知识，从而实现无 precedent的性能，同时降低标注成本。我们在GitHub上发布了所有代码和预训练模型，我们希望Ark能够对开源科学产生重要影响，因为聚合和重用公共数据集中的专家标注知识可能会超越 propriety模型在非常大数据上的性能，鼓励更多的研究人员在世界各地分享代码和数据集，加速开源科学，和democratize深度学习 для医疗影像领域。

paper_url: http://arxiv.org/abs/2310.09503
repo_url: https://github.com/mr-neko/jm3d
paper_authors: Jiayi Ji, Haowei Wang, Changli Wu, Yiwei Ma, Xiaoshuai Sun, Rongrong Ji
for: 本研究旨在解决3D表示学中的三大挑战，包括信息损失、缺乏同步和不充分利用细节信息。
methods: 该研究提出了一种全面的JM3D方法，包括多视图图像和层次文本的结构多模式组织器（SMO）和语言理解与视觉表示的共同对齐（JMA）。
results: 研究表明，JM3D-LLM模型在ModelNet40和ScanObjectNN测试集上显示出了superiority，并且JM3D-LLM模型通过有效的微调来结合3D表示和大语言模型。

Abstract
The rising importance of 3D representation learning, pivotal in computer vision, autonomous driving, and robotics, is evident. However, a prevailing trend, which straightforwardly resorted to transferring 2D alignment strategies to the 3D domain, encounters three distinct challenges: (1) Information Degradation: This arises from the alignment of 3D data with mere single-view 2D images and generic texts, neglecting the need for multi-view images and detailed subcategory texts. (2) Insufficient Synergy: These strategies align 3D representations to image and text features individually, hampering the overall optimization for 3D models. (3) Underutilization: The fine-grained information inherent in the learned representations is often not fully exploited, indicating a potential loss in detail. To address these issues, we introduce JM3D, a comprehensive approach integrating point cloud, text, and image. Key contributions include the Structured Multimodal Organizer (SMO), enriching vision-language representation with multiple views and hierarchical text, and the Joint Multi-modal Alignment (JMA), combining language understanding with visual representation. Our advanced model, JM3D-LLM, marries 3D representation with large language models via efficient fine-tuning. Evaluations on ModelNet40 and ScanObjectNN establish JM3D's superiority. The superior performance of JM3D-LLM further underscores the effectiveness of our representation transfer approach. Our code and models are available at https://github.com/Mr-Neko/JM3D.

摘要
随着三维表示学的崛起，在计算机视觉、自动驾驶和 роботиCS中发挥着重要作用。然而，一种流行的趋势是将二维对对策直接应用到三维领域，这种方法存在三个突出的挑战：（1）信息强化：这种方法将三维数据与单个视图二维图像和普通的文本进行对齐，忽略了多视图图像和详细的子类别文本的需求。（2）不足的共同作用：这些策略将三维表示与图像和文本特征进行对齐，阻碍整体优化三维模型。（3）不充分利用：学习的表示中具有细节的信息经常未被完全利用，表明有可能的误差。为了解决这些问题，我们介绍了JM3D，一种全面的方法，它将点云、文本和图像集成起来。JM3D的关键贡献包括多视图文本结构（SMO），它使用多个视图和层次文本来增强视语言表示，以及对语言理解与视觉表示的共同对齐（JMA）。我们的高级模型JM3D-LLM通过有效的微调来结合语言模型和三维表示。我们在ModelNet40和ScanObjectNN上进行了评估，结果表明JM3D的超越性。JM3D-LLM的高性能进一步证明了我们的表示传递方法的有效性。我们的代码和模型可以在https://github.com/Mr-Neko/JM3D中找到。

Learning In-between Imagery Dynamics via Physical Latent Spaces

paper_url: http://arxiv.org/abs/2310.09495
repo_url: None
paper_authors: Jihun Han, Yoonsang Lee, Anne Gelb
for: 学习两个图像在连续时间步骤中的下一个图像的动态关系
methods: 利用含有物理模型表示的partial differential equations（PDEs）来估计图像的中间阶段，并保持图像空间相关性
results: 通过数字测试使用地球科学图像数据，证明方法的稳定性和有效性

Abstract
We present a framework designed to learn the underlying dynamics between two images observed at consecutive time steps. The complex nature of image data and the lack of temporal information pose significant challenges in capturing the unique evolving patterns. Our proposed method focuses on estimating the intermediary stages of image evolution, allowing for interpretability through latent dynamics while preserving spatial correlations with the image. By incorporating a latent variable that follows a physical model expressed in partial differential equations (PDEs), our approach ensures the interpretability of the learned model and provides insight into corresponding image dynamics. We demonstrate the robustness and effectiveness of our learning framework through a series of numerical tests using geoscientific imagery data.

摘要
我们提出了一种框架，用于学习两个图像在连续时间步骤中的下面动力。图像数据的复杂性和缺乏时间信息使得捕捉唯一发展模式具有挑战性。我们的提议是估算图像演化过程中的中间阶段，以保持图像空间相关性，并通过潜在动力提供可读性。我们的方法包括一个遵循部分偏微分方程（PDE）的潜在变量，以保证学习模型的可读性并提供相应的图像动力的理解。我们通过对地球科学图像数据进行数值测试，证明了我们的学习框架的稳定性和效果。

Perception Reinforcement Using Auxiliary Learning Feature Fusion: A Modified Yolov8 for Head Detection

paper_url: http://arxiv.org/abs/2310.09492
repo_url: None
paper_authors: Jiezhou Chen, Guankun Wang, Weixiang Liu, Xiaopin Zhong, Yibin Tian, ZongZe Wu
for: 提高人头检测精度和 robustness
methods: 使用改进版 Yolov8，增加 auxillary 学习特征拟合（ALFF）模块和 Distribution Focal Loss 等技术来提高目标感知和检测精度
results: 实验结果表明我们的方法可以提高人头检测精度和 robustness，并且在不同的背景和照明条件下都有优秀的表现

Abstract
Head detection provides distribution information of pedestrian, which is crucial for scene statistical analysis, traffic management, and risk assessment and early warning. However, scene complexity and large-scale variation in the real world make accurate detection more difficult. Therefore, we present a modified Yolov8 which improves head detection performance through reinforcing target perception. An Auxiliary Learning Feature Fusion (ALFF) module comprised of LSTM and convolutional blocks is used as the auxiliary task to help the model perceive targets. In addition, we introduce Noise Calibration into Distribution Focal Loss to facilitate model fitting and improve the accuracy of detection. Considering the requirements of high accuracy and speed for the head detection task, our method is adapted with two kinds of backbone, namely Yolov8n and Yolov8m. The results demonstrate the superior performance of our approach in improving detection accuracy and robustness.

摘要
《头部检测提供了人行道用户分布信息，这对Scene统计分析、交通管理和风险评估预警都是关键。然而，实际世界中的场景复杂度和大规模变化使得准确检测更加困难。因此，我们提出了一种基于Yolov8的修改方法，通过强化目标感知来提高头部检测性能。我们使用一个名为ALFF（auxiliary learning feature fusion）模块，其包括LSTM和卷积块，作为辅助任务，帮助模型更好地感知目标。此外，我们还引入了降噪约束分布损失，以便提高模型的适应性和准确性。考虑到头部检测任务的高精度和速度要求，我们采用了Yolov8n和Yolov8m两种骨干。结果表明，我们的方法可以提高检测精度和Robustness。》Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

Exploring the Design Space of Diffusion Autoencoders for Face Morphing

paper_url: http://arxiv.org/abs/2310.09484
repo_url: None
paper_authors: Zander Blasingame, Chen Liu
for: 本研究探索了 diffusion autoencoders 创造的 face morph 设计空间，特别是 sampling algorithms、reverse DDIM solver 和 partial sampling through small amounts of added noise 等三个轴。
methods: 本研究使用了 sampling algorithms、reverse DDIM solver 和 partial sampling through small amounts of added noise 等方法。
results: 研究发现，采用不同的 sampling algorithms、reverse DDIM solver 和 partial sampling through small amounts of added noise 可以创造出不同的 face morph。

Abstract
Face morphs created by Diffusion Autoencoders are a recent innovation and the design space of such an approach has not been well explored. We explore three axes of the design space, i.e., 1) sampling algorithms, 2) the reverse DDIM solver, and 3) partial sampling through small amounts of added noise.

摘要
Diffusion Autoencoders 创造的面部变换是最近的创新，这个设计空间的探索尚未充分。我们探索了三个轴，即：1. 采样算法2. 反向 DDIM 解决方案3. 通过小量附加的噪声进行部分采样Here's a breakdown of the translation:1. "Diffusion Autoencoders" (Diffusion Autoencoders) - 散度自适应器2. "创造的面部变换" (face morphs created) - 面部变换 (face morphs)3. "是最近的创新" (is a recent innovation) - 最近的创新 (recent innovation)4. "设计空间" (design space) - 设计空间 (design space)5. "尚未充分" (has not been well explored) - 尚未充分 (has not been well explored)6. "我们探索了三个轴" (we explore three axes) - 我们探索了三个轴 (we explore three axes)7. "即：" (i.e.,) - 即： (i.e.,)8. "采样算法" (sampling algorithms) - 采样算法 (sampling algorithms)9. "反向 DDIM 解决方案" (reverse DDIM solver) - 反向 DDIM 解决方案 (reverse DDIM solver)10. "通过小量附加的噪声进行部分采样" (partial sampling through small amounts of added noise) - 通过小量附加的噪声进行部分采样 (partial sampling through small amounts of added noise)

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

paper_url: http://arxiv.org/abs/2310.09478
repo_url: None
paper_authors: Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, Mohamed Elhoseiny
for: 这篇论文的目的是建立一个统一的界面，用于完成多种语言领域的应用，包括图像描述、视觉问题回答和视觉基础设定等。
methods: 这篇论文使用了MiniGPT-v2模型，这是一个可以作为多种视觉语言任务的统一界面。它使用唯一识别码来识别不同任务的训练指令，以提高模型对每个任务的学习效率。
results: 实验结果显示，MiniGPT-v2在视觉问题回答和视觉基础设定benchmark上表现强，与其他视觉语言通用模型相比。

Abstract
Large language models have shown their remarkable capabilities as a general interface for various language-related applications. Motivated by this, we target to build a unified interface for completing many vision-language tasks including image description, visual question answering, and visual grounding, among others. The challenge is to use a single model for performing diverse vision-language tasks effectively with simple multi-modal instructions. Towards this objective, we introduce MiniGPT-v2, a model that can be treated as a unified interface for better handling various vision-language tasks. We propose using unique identifiers for different tasks when training the model. These identifiers enable our model to better distinguish each task instruction effortlessly and also improve the model learning efficiency for each task. After the three-stage training, the experimental results show that MiniGPT-v2 achieves strong performance on many visual question-answering and visual grounding benchmarks compared to other vision-language generalist models. Our model and codes are available at https://minigpt-v2.github.io/

摘要
大型语言模型在各种语言相关应用中展示了其权威的能力。这为我们提供了动机，我们想要建立一个统一的界面，以便处理多种视觉语言任务，包括图像描述、视觉问题回答和视觉定位等等。挑战是使用单一模型来进行多种视觉语言任务，并且将其训练为每个任务。为了解决这个问题，我们引入了MiniGPT-v2模型，它可以作为视觉语言任务的统一界面。我们在训练过程中使用对应的唯一识别码，以便让模型更容易识别每个任务的指令，并且提高模型对每个任务的学习效率。经过三阶段训练后，我们的实验结果显示，MiniGPT-v2在许多视觉问题回答和视觉定位benchmark上表现出色，与其他视觉语言通用模型相比。我们的模型和代码可以在https://minigpt-v2.github.io/上取得。

Plug-and-Play Feature Generation for Few-Shot Medical Image Classification

paper_url: http://arxiv.org/abs/2310.09471
repo_url: None
paper_authors: Qianyu Guo, Huifang Du, Xing Jia, Shuyong Gao, Yan Teng, Haofen Wang, Wenqiang Zhang
for: 提高医学图像分类模型的普适性和实用性，使用有限数量的训练数据。
methods: 提出了一种名为MedMFG的灵活和轻量级的插件和撑杆方法，可以生成具有足够特征的类别特征。
results: 在跨域分数标准上对比多个基eline和后处理方法，MedMFG达到了10%以上的性能提升，并且可以轻松地与多种背景和基eline整合。

Abstract
Few-shot learning (FSL) presents immense potential in enhancing model generalization and practicality for medical image classification with limited training data; however, it still faces the challenge of severe overfitting in classifier training due to distribution bias caused by the scarce training samples. To address the issue, we propose MedMFG, a flexible and lightweight plug-and-play method designed to generate sufficient class-distinctive features from limited samples. Specifically, MedMFG first re-represents the limited prototypes to assign higher weights for more important information features. Then, the prototypes are variationally generated into abundant effective features. Finally, the generated features and prototypes are together to train a more generalized classifier. Experiments demonstrate that MedMFG outperforms the previous state-of-the-art methods on cross-domain benchmarks involving the transition from natural images to medical images, as well as medical images with different lesions. Notably, our method achieves over 10% performance improvement compared to several baselines. Fusion experiments further validate the adaptability of MedMFG, as it seamlessly integrates into various backbones and baselines, consistently yielding improvements of over 2.9% across all results.

摘要
几个示例学习（Few-shot learning，FSL）具有极大的潜力提高模型通用性和实用性，尤其是在医疗图像分类中使用有限训练数据。然而，FSL仍然面临分布偏见导致分类器训练过程中严重溢出的问题。为解决这问题，我们提出MedMFG，一种灵活且轻量级的插件式方法，可以生成充足的类别特征从有限样本中。具体来说，MedMFG首先重新表示有限原型，以便将更重要的信息特征赋予更高的权重。然后，原型通过变换生成成为丰富的有效特征。最后，生成的特征和原型一起训练一个更通用的分类器。实验表明，MedMFG在跨领域 benchmark 上比前一些基eline方法提高了10%以上的性能。此外，我们的方法在不同的背景和基eline上进行融合实验， consistently 获得了2.9%以上的提高。

Towards More Accurate Diffusion Model Acceleration with A Timestep Aligner

paper_url: http://arxiv.org/abs/2310.09469
repo_url: None
paper_authors: Mengfei Xia, Yujun Shen, Changsong Lei, Yu Zhou, Ran Yi, Deli Zhao, Wenping Wang, Yong-jin Liu
for: 提高Diffusion Model的推理速度，解决现有加速方法的性能下降问题。
methods: 通过视为离散积分过程，找到更加准确的积分方向，以提高推理性能。具体来说，在每个净化步骤中，将原始参数化被取代，并在新的步骤中 conditioning 网络，以使其更加准确地描述真实分布。
results: 广泛的实验表明，我们的插件设计可以高效地训练，并在多种现有加速方法中提高推理性能，特别是当有少量净化步骤时。例如，在使用10个净化步骤的LSUN Bedroom dataset上，我们可以将DDIM的FID从9.65降低到6.07，只需采用我们的方法。代码将公开发布。

Abstract
A diffusion model, which is formulated to produce an image using thousands of denoising steps, usually suffers from a slow inference speed. Existing acceleration algorithms simplify the sampling by skipping most steps yet exhibit considerable performance degradation. By viewing the generation of diffusion models as a discretized integrating process, we argue that the quality drop is partly caused by applying an inaccurate integral direction to a timestep interval. To rectify this issue, we propose a timestep aligner that helps find a more accurate integral direction for a particular interval at the minimum cost. Specifically, at each denoising step, we replace the original parameterization by conditioning the network on a new timestep, which is obtained by aligning the sampling distribution to the real distribution. Extensive experiments show that our plug-in design can be trained efficiently and boost the inference performance of various state-of-the-art acceleration methods, especially when there are few denoising steps. For example, when using 10 denoising steps on the popular LSUN Bedroom dataset, we improve the FID of DDIM from 9.65 to 6.07, simply by adopting our method for a more appropriate set of timesteps. Code will be made publicly available.

摘要
一种扩散模型，通过千次减噪步骤生成图像，通常受到慢速推理速度的限制。现有的加速算法简化抽取步骤，但显示出较大的性能下降。我们认为，生成扩散模型的过程可 viewed为离散 интегрирование过程，因此，生成图像质量下降的一部分原因是应用不正确的积分方向。为了解决这个问题，我们提议一种时间步骤对齐器，帮助找到更加准确的积分方向，最小化成本。具体来说，在每次减噪步骤中，我们将原始参数化替换为根据新的时间步骤 conditioning 网络。我们进行了广泛的实验，发现我们的插件设计可以高效地训练，并提高各种现有加速方法的推理性能，特别是当有少量减噪步骤时。例如，在使用10个减噪步骤的LSUN床间 dataset上，我们可以通过采用我们的方法，从9.65提高FID到6.07。我们将代码公开。

MAC: ModAlity Calibration for Object Detection

paper_url: http://arxiv.org/abs/2310.09461
repo_url: None
paper_authors: Yutian Lei, Jun Liu, Dong Huang
for: 本研究旨在开发一种能够快速和高效地将RGB输入模式转换为非RGB输入模式的方法，以便在不同的输入模式下进行物体检测。
methods: 本研究提出了一种名为ModAlity Calibration（MAC）的灵活管道，用于协调目标输入模式和源输入模式之间的差异。具体来说，我们在目标输入模式上额外添加了一个小准备模块，并对这个模块进行MAC训练技术的应用，以便在无需100%手动标注的情况下，使目标输入模式模型达到与基eline模型相同或更好的表现。
results: 我们通过对WiFi输入模式、Lidar输入模式和热成像输入模式等不同输入模式的模型进行组合，并将这些模型与预训练的RGB输入模式模型进行拟合，证明了MAC的效iveness。

Abstract
The flourishing success of Deep Neural Networks(DNNs) on RGB-input perception tasks has opened unbounded possibilities for non-RGB-input perception tasks, such as object detection from wireless signals, lidar scans, and infrared images. Compared to the matured development pipeline of RGB-input (source modality) models, developing non-RGB-input (target-modality) models from scratch poses excessive challenges in the modality-specific network design/training tricks and labor in the target-modality annotation. In this paper, we propose ModAlity Calibration (MAC), an efficient pipeline for calibrating target-modality inputs to the DNN object detection models developed on the RGB (source) modality. We compose a target-modality-input model by adding a small calibrator module ahead of a source-modality model and introduce MAC training techniques to impose dense supervision on the calibrator. By leveraging (1) prior knowledge synthesized from the source-modality model and (2) paired {target, source} data with zero manual annotations, our target-modality models reach comparable or better metrics than baseline models that require 100% manual annotations. We demonstrate the effectiveness of MAC by composing the WiFi-input, Lidar-input, and Thermal-Infrared-input models upon the pre-trained RGB-input models respectively.

摘要
深度神经网络（DNN）在RGB输入感知任务上的繁荣成功打开了非RGB输入感知任务的无尽可能性，如从无线电信号、激光扫描和红外图像中的对象检测。相比于已经成熟的RGB输入（源模式）模型的开发管线，开发非RGB输入（目标模式）模型从零开始带来了过度的挑战，包括特性化网络设计/训练技巧和目标模式注解的劳动。在这篇论文中，我们提出了模态均衡（MAC）管线，用于均衡目标模式输入到基于RGB模式（源模式）的对象检测模型中。我们将目标模式输入模型的前置加一小均衡模块，并引入MAC训练技术，以在均衡模块中强制对均衡模块进行密集监督。通过利用（1）源模式模型中的优先知识和（2）paired {target, source} 数据集，我们的目标模式模型可以达到与基eline模型相同或更好的 metric，而不需要100%的手动注解。我们通过将WiFi输入、激光输入和红外温度输入模型分别建立在pre-trained RGB输入模型上，来证明MAC的效果。

PaintHuman: Towards High-fidelity Text-to-3D Human Texturing via Denoised Score Distillation

paper_url: http://arxiv.org/abs/2310.09458
repo_url: None
paper_authors: Jianhui Yu, Hao Zhu, Liming Jiang, Chen Change Loy, Weidong Cai, Wayne Wu
for: 本研究旨在解决 zero-shot text-to-3D 人体生成中 SDS 方法可能提供不准确的梯度方向问题，以及高级度文本-to-3D 人体质量控制的挑战。
methods: 本研究提出了一种名为PaintHuman的模型，通过两种方法来解决问题：首先，引入了一种修改 SDS 的新得分函数（denoised score distillation，DSD），以 iteratively 更正梯度方向并生成高质量的Texture。其次，使用深度图作为geometry guidance，确保Texture具有人体模型表面的semantic alignment。
results: 对比 estado-of-the-art 方法，我们的方法在许多预测和评估任务上具有较高的性能和质量。

Abstract
Recent advances in zero-shot text-to-3D human generation, which employ the human model prior (eg, SMPL) or Score Distillation Sampling (SDS) with pre-trained text-to-image diffusion models, have been groundbreaking. However, SDS may provide inaccurate gradient directions under the weak diffusion guidance, as it tends to produce over-smoothed results and generate body textures that are inconsistent with the detailed mesh geometry. Therefore, directly leverage existing strategies for high-fidelity text-to-3D human texturing is challenging. In this work, we propose a model called PaintHuman to addresses the challenges from two aspects. We first propose a novel score function, Denoised Score Distillation (DSD), which directly modifies the SDS by introducing negative gradient components to iteratively correct the gradient direction and generate high-quality textures. In addition, we use the depth map as a geometric guidance to ensure the texture is semantically aligned to human mesh surfaces. To guarantee the quality of rendered results, we employ geometry-aware networks to predict surface materials and render realistic human textures. Extensive experiments, benchmarked against state-of-the-art methods, validate the efficacy of our approach.

摘要
近期的零shot文本到3D人体生成技术发展，使用人体模型先验（如SMPL）或Score Distillation Sampling（SDS）与预训练文本到图像扩散模型，取得了重要进展。然而，SDS可能在弱扩散指导下提供不准确的梯度方向，因为它有可能生成过度平滑的结果和与细节 mesh geometry 不一致的人体 texture。因此，直接使用现有的高精度文本到3D人体纹理策略是挑战。在这种工作中，我们提议一种名为PaintHuman的模型，用以解决这些挑战。我们首先提出了一种新的分数函数，Denosied Score Distillation（DSD），它直接修改了 SDS，通过引入负梯度组分来逐次更正梯度方向，生成高质量的纹理。此外，我们使用深度图为 geometric 指导，确保纹理与人体 mesh 表面含义相对应。为保证渲染结果的质量，我们使用 geometry-aware 网络预测表面材质并生成真实的人体纹理。我们对 state-of-the-art 方法进行了广泛的实验，并证明了我们的方法的有效性。

UCM-Net: A Lightweight and Efficient Solution for Skin Lesion Segmentation using MLP and CNN

paper_url: http://arxiv.org/abs/2310.09457
repo_url: None
paper_authors: Chunyu Yuan, Dongfang Zhao, Sos S. Agaian
for:* 这个论文的目的是提出一种高效、轻量级的皮肤恶性肿瘤分割方法，以便在移动医疗应用中使用。methods:* 该方法使用了多层感知（MLP）和卷积神经网络（CNN）的组合，并提出了一种新的封装块（UCM-Net-Block）来减少参数的 overhead 并提高学习能力。results:* 对于 isic2017 和 isic2018 数据集进行了广泛的实验，并证明了 UCM-Net 在皮肤恶性肿瘤分割中的竞争力。* UCM-Net 的参数少于 50KB，计算量少于 0.05 GLOPs，创造了新的可能性标准 для皮肤恶性肿瘤分割的效率。

Abstract
Skin cancer is a significant public health problem, and computer-aided diagnosis can help to prevent and treat it. A crucial step for computer-aided diagnosis is accurately segmenting skin lesions in images, which allows for lesion detection, classification, and analysis. However, this task is challenging due to the diverse characteristics of lesions, such as appearance, shape, size, color, texture, and location, as well as image quality issues like noise, artifacts, and occlusions. Deep learning models have recently been applied to skin lesion segmentation, but they have high parameter counts and computational demands, making them unsuitable for mobile health applications. To address this challenge, we propose UCM-Net, a novel, efficient, and lightweight solution that integrates Multi-Layer Perceptions (MLP) and Convolutional Neural Networks (CNN). Unlike conventional UNet architectures, our UCMNet-Block reduces parameter overhead and enhances UCM-Net's learning capabilities, leading to robust segmentation performance. We validate UCM-Net's competitiveness through extensive experiments on isic2017 and isic2018 datasets. Remarkably, UCM-Net has less than 50KB parameters and less than 0.05 Giga-Operations Per Second (GLOPs), setting a new possible standard for efficiency in skin lesion segmentation. The source code will be publicly available.

摘要
皮肤癌是一个严重的公共卫生问题，计算机助成诊断可以帮助预防和治疗。计算机助成诊断的关键步骤是准确地分割皮肤癌病变图像中的病变，以便诊断、分类和分析。然而，这个任务很困难，因为癌病变的多样性，包括外表、形状、大小、颜色、文本ure和位置等，以及图像质量问题，如噪声、artefacts和遮挡。最近，深度学习模型已经应用于皮肤癌病变分割，但它们具有高参数计数和计算需求，使其不适合移动医疗应用。为解决这个挑战，我们提出了UCM-Net，一种新的、有效和轻量级的解决方案，它结合多层感知（MLP）和卷积神经网络（CNN）。与传统的UNet架构不同，我们的UCMNet-Block减少参数开销和提高UCM-Net的学习能力，从而实现了稳定的分割性能。我们通过对isic2017和isic2018数据集进行广泛的实验，证明UCM-Net在皮肤癌病变分割中的竞争力。特别是，UCM-Net的参数少于50KB，计算需求少于0.05 Giga-Operations Per Second（GLOPs），创造了新的可能性标准 для皮肤癌病变分割的效率。源代码将公开 availability。

2023-10-14

cs.AI

cs.AI - 2023-10-14

A Neuro-Mimetic Realization of the Common Model of Cognition via Hebbian Learning and Free Energy Minimization

paper_url: http://arxiv.org/abs/2310.15177
repo_url: None
paper_authors: Alexander Ororbia, Mary Alexandria Kelly
for: 本研究的目的是探讨Generative AI的发展和其对认知科学的影响。
methods: 本研究使用了COGnitive Neural GENerative系统，这是一种基于Hebbian适应的神经网络模型，用于优化variational free energy函数。
results: 本研究提出了一种基于COG系统的 cognitive architectures，可以用于模拟大脑中的认知过程。

Abstract
Over the last few years, large neural generative models, capable of synthesizing intricate sequences of words or producing complex image patterns, have recently emerged as a popular representation of what has come to be known as "generative artificial intelligence" (generative AI). Beyond opening the door to new opportunities as well as challenges for the domain of statistical machine learning, the rising popularity of generative AI brings with it interesting questions for Cognitive Science, which seeks to discover the nature of the processes that underpin minds and brains as well as to understand how such functionality might be acquired and instantiated in biological (or artificial) substrate. With this goal in mind, we argue that a promising long-term pathway lies in the crafting of cognitive architectures, a long-standing tradition of the field, cast fundamentally in terms of neuro-mimetic generative building blocks. Concretely, we discuss the COGnitive Neural GENerative system, which is an architecture that casts the Common Model of Cognition in terms of Hebbian adaptation operating in service of optimizing a variational free energy functional.

摘要
最近几年来，大规模神经生成模型，可以生成复杂的语言序列或图像模式，在人工智能领域中崛起为人们关注的新兴表现形式。这些模型不仅开启了新的机会和挑战，还对认知科学产生了感兴趣的问题，旨在探索 minds 和 brains 的内在机制，以及如何在生物（或人工）材料中实现这种功能。为实现这个目标，我们认为，制定 cognitive 架构是一条有前途的长期路径。在这个框架下，我们讨论了 COGnitive Neural GENerative system，这是一种基于 Hebbian 适应的神经生成模型，用于优化变量自由能函数。

Improved Contextual Recognition In Automatic Speech Recognition Systems By Semantic Lattice Rescoring

paper_url: http://arxiv.org/abs/2310.09680
repo_url: None
paper_authors: Ankitha Sudarshan, Vinay Samuel, Parth Patwa, Ibtihel Amara, Aman Chadha
for: 提高语音识别系统的上下文依赖词汇识别精度
methods: 利用隐马尔科夫模型和混合型神经网络模型，并具有语音和语言模型的 интеграción
results: 在LibriSpeech数据集上实现了很好的效果，Word Error Rate（WER）下降了可见的程度

Abstract
Automatic Speech Recognition (ASR) has witnessed a profound research interest. Recent breakthroughs have given ASR systems different prospects such as faithfully transcribing spoken language, which is a pivotal advancement in building conversational agents. However, there is still an imminent challenge of accurately discerning context-dependent words and phrases. In this work, we propose a novel approach for enhancing contextual recognition within ASR systems via semantic lattice processing leveraging the power of deep learning models in accurately delivering spot-on transcriptions across a wide variety of vocabularies and speaking styles. Our solution consists of using Hidden Markov Models and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks (DNN) models integrating both language and acoustic modeling for better accuracy. We infused our network with the use of a transformer-based model to properly rescore the word lattice achieving remarkable capabilities with a palpable reduction in Word Error Rate (WER). We demonstrate the effectiveness of our proposed framework on the LibriSpeech dataset with empirical analyses.

摘要
自动语音识别（ASR）技术在研究中受到了广泛的关注。最近的突破使得ASR系统可以准确地识别语音，这是建立对话代理人的关键进步。然而，仍然存在一个急需要准确地识别上下文依赖的词语和短语的挑战。在这种工作中，我们提出了一种改进上下文认知在ASR系统中的方法，通过semantic lattice processing，利用深度学习模型以高精度提供详细的转录。我们的解决方案包括使用隐藏马尔可夫模型和 Gaussian Mixture Models（HMM-GMM）以及深度神经网络（DNN）模型，将语言和音响模型结合起来以提高准确性。我们在网络中使用transformer-based模型来正确地重新分配词网络，实现了很高的Word Error Rate（WER）下降。我们在LibriSpeech dataset上进行了实验，并进行了实验分析。

Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning

paper_url: http://arxiv.org/abs/2310.09676
repo_url: None
paper_authors: Jiachen Li, Qiaozi Gao, Michael Johnston, Xiaofeng Gao, Xuehai He, Suhaila Shakiah, Hangjie Shi, Reza Ghanadan, William Yang Wang
for: 本研究的目的是开发一种可以通过多模态提示（文本描述和视觉信号）来控制机器人的 manipulate 能力。
methods: 我们的方法包括一个两个阶段的训练管道，包括逆动力预训练和多任务调整。为了促进多模态理解，我们设计了一个多模态提示编码器，通过将预训练的语言模型与视觉输入连接起来，模拟动作维度之间的依赖关系。
results: 我们的方法在 VIMA-BENCH 上进行了实验，并在成功率上达到了新的状态对照（10%提高）。此外，我们还证明了我们的模型在 Context 学习中表现出色。

Abstract
Prompt-based learning has been demonstrated as a compelling paradigm contributing to large language models' tremendous success (LLMs). Inspired by their success in language tasks, existing research has leveraged LLMs in embodied instruction following and task planning. However, not much attention has been paid to embodied tasks with multimodal prompts, combining vision signals with text descriptions. This type of task poses a major challenge to robots' capability to understand the interconnection and complementarity between vision and language signals. In this work, we introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts from multi-task expert trajectories. Our methods consist of a two-stage training pipeline that performs inverse dynamics pretraining and multi-task finetuning. To facilitate multimodal understanding, we design our multimodal prompt encoder by augmenting a pretrained LM with a residual connection to the visual input and model the dependencies among action dimensions. Empirically, we evaluate the efficacy of our method on the VIMA-BENCH and establish a new state-of-the-art (10% improvement in success rate). Moreover, we demonstrate that our model exhibits remarkable in-context learning ability.

摘要
In this work, we propose an effective framework for learning a policy to perform robot manipulation with multimodal prompts using multi-task expert trajectories. Our approach consists of a two-stage training pipeline that includes inverse dynamics pretraining and multi-task finetuning. To facilitate multimodal understanding, we design our multimodal prompt encoder by augmenting a pretrained LM with a residual connection to the visual input and modeling the dependencies among action dimensions.Empirically, we evaluate the effectiveness of our method on the VIMA-BENCH and achieve a new state-of-the-art with a 10% improvement in success rate. Additionally, we demonstrate that our model exhibits remarkable in-context learning ability.

Efficient Model-Agnostic Multi-Group Equivariant Networks

paper_url: http://arxiv.org/abs/2310.09675
repo_url: None
paper_authors: Razan Baltaji, Sourya Basu, Lav R. Varshney
for: This paper aims to address the computational expense of constructing model-agnostic group equivariant networks, such as equitune, for large product groups.
methods: The paper proposes two efficient model-agnostic equivariant designs for two related problems: one with multiple inputs and another with a single input but a large product group. The designs use a novel fusion layer called an IS layer, which is a universal approximator of invariant-symmetric functions.
results: The paper shows that the proposed designs are competitive with equitune and its variants, while being computationally more efficient. The designs are applied to three applications: multi-image classification, language compositionality, and robust zero-shot image classification.

Abstract
Constructing model-agnostic group equivariant networks, such as equitune (Basu et al., 2023b) and its generalizations (Kim et al., 2023), can be computationally expensive for large product groups. We address this by providing efficient model-agnostic equivariant designs for two related problems: one where the network has multiple inputs each with potentially different groups acting on them, and another where there is a single input but the group acting on it is a large product group. For the first design, we initially consider a linear model and characterize the entire equivariant space that satisfies this constraint. This characterization gives rise to a novel fusion layer between different channels that satisfies an invariance-symmetry (IS) constraint, which we call an IS layer. We then extend this design beyond linear models, similar to equitune, consisting of equivariant and IS layers. We also show that the IS layer is a universal approximator of invariant-symmetric functions. Inspired by the first design, we use the notion of the IS property to design a second efficient model-agnostic equivariant design for large product groups acting on a single input. For the first design, we provide experiments on multi-image classification where each view is transformed independently with transformations such as rotations. We find equivariant models are robust to such transformations and perform competitively otherwise. For the second design, we consider three applications: language compositionality on the SCAN dataset to product groups; fairness in natural language generation from GPT-2 to address intersectionality; and robust zero-shot image classification with CLIP. Overall, our methods are simple and general, competitive with equitune and its variants, while also being computationally more efficient.

摘要
建立模型无关的群equivariant网络，如equitune（Basu et al., 2023b）和其扩展（Kim et al., 2023），可能会对大量产品群进行计算昂贵的成本。我们解决这个问题，提供了高效的模型无关equivariant设计，用于两个相关的问题：一个是网络有多个输入，每个输入都可能有不同的群 acting on it，另一个是一个输入，但是群 acting on it是一个大量产品群。对于第一个设计，我们首先考虑一个线性模型，并Characterize了满足这个约束的整个equivariant空间。这个Characterization导致了一种新的协调层（IS layer），它满足一个对称-对称（IS）约束。我们然后将这个设计扩展到不同的模型，类似于equitune，包括equivariant和IS层。我们还证明了IS层是对称-对称函数的universal approximator。受第一个设计的启发，我们使用IS性质来设计一个高效的模型无关equivariant设计，用于大量产品群 acting on a single input。我们提供了对多视图分类的实验，发现equivariant模型对独立的变换（如旋转）具有抗变换性和竞争性。对于第二个设计，我们考虑了三个应用：语言compositional on SCAN dataset，用于product groups; fairness in natural language generation from GPT-2，用于Addressing intersectionality;和robust zero-shot image classification with CLIP。总的来说，我们的方法是简单而普遍，与equitune和其 variants相当竞争，同时也更加计算效率。

Edge-InversionNet: Enabling Efficient Inference of InversionNet on Edge Devices

paper_url: http://arxiv.org/abs/2310.09667
repo_url: None
paper_authors: Zhepeng Wang, Isaacshubhanand Putla, Weiwen Jiang, Youzuo Lin
for: 这个研究旨在提高Edge设备上的普遍干扰实验（Full Waveform Inversion，FWI）效率，通过将数据驱动学习模型对精确性进行简洁化。
methods: 我们提出使用结构剪溃算法来实现轻量级InversionNet，以便在Edge设备上进行效率的推论。
results: 实验结果显示，剪溃后的InversionNet可以实现98.2%的计算资源减少，并且仅受轻度模型性能下降。

Abstract
Seismic full waveform inversion (FWI) is a widely used technique in geophysics for inferring subsurface structures from seismic data. And InversionNet is one of the most successful data-driven machine learning models that is applied to seismic FWI. However, the high computing costs to run InversionNet have made it challenging to be efficiently deployed on edge devices that are usually resource-constrained. Therefore, we propose to employ the structured pruning algorithm to get a lightweight version of InversionNet, which can make an efficient inference on edge devices. And we also made a prototype with Raspberry Pi to run the lightweight InversionNet. Experimental results show that the pruned InversionNet can achieve up to 98.2 % reduction in computing resources with moderate model performance degradation.

摘要
震动全波形数据逆置（FWI）是地球物理学中广泛使用的技术，用于基于地震数据推断地下结构。而倒推网络（InversionNet）是数据驱动机器学习模型中的一个最成功的应用，但高计算成本使其在边缘设备上进行效率地部署具有挑战性。因此，我们提议使用结构采样法来实现一个轻量级的倒推网络，以实现在边缘设备上高效的推断。而我们还制作了使用蓝莓 Pi 进行测试的原型。实验结果表明，减少后的倒推网络可以实现高达98.2%的计算资源减少，同时只带来一定的模型性能下降。

A Generalized Extensive-Form Fictitious Play Algorithm

paper_url: http://arxiv.org/abs/2310.09658
repo_url: None
paper_authors: Tim P. Schulze
for: 这个论文是为了解二人零球游戏的平衡而写的。
methods: 这个论文使用了一种简单的扩展形算法来找到两人零球游戏的平衡。这个算法是通过一种总体化的形式来实现的，与一种扩展形 ficitious play 算法相似。
results: 这个论文的结果表明，这种新算法与一种相似的扩展形 ficitious play 算法和一种 counter-factual regret minimization 算法相比，它具有更好的性能，并且更容易实现。

Abstract
We introduce a simple extensive-form algorithm for finding equilibria of two-player, zero-sum games. The algorithm is realization equivalent to a generalized form of Fictitious Play. We compare its performance to that of a similar extensive-form fictitious play algorithm and a counter-factual regret minimization algorithm. All three algorithms share the same advantages over normal-form fictitious play in terms of reducing storage requirements and computational complexity. The new algorithm is intuitive and straightforward to implement, making it an appealing option for those looking for a quick and easy game solving tool.

摘要
我们介绍了一种简单的扩展形算法来找到两人零点游戏的平衡点。这个算法是扩展形非常Play的一种通用形式，我们与相似的扩展形非常Play算法和反思悔检算法进行比较。这三个算法都比正常形非常Play具有减少存储需求和计算复杂性的优点。新算法直观易于实现，使其成为寻找快速简单游戏解决工具的首选。

paper_url: http://arxiv.org/abs/2310.09653
repo_url: None
paper_authors: Paarth Neekhara, Shehzeen Hussain, Rafael Valle, Boris Ginsburg, Rishabh Ranjan, Shlomo Dubnov, Farinaz Koushanfar, Julian McAuley
for: 这个论文的目的是提出一种叫做SelfVC的训练策略，用于逐渐提高一个语音转换模型的性能。methods: 这个论文使用了自然语音学习和人脸识别模型来 derivate speech representations，并使用这些表示来训练一个可控的语音转换模型。results: 在这个论文中，通过使用自我生成的示例来逐渐改进语音转换模型，可以提高生成的语音的 speaker similarity 和自然性。此外，SelfVC 还可以应用于零shot语音转换、 cross-lingual 语音转换和可控的语音合成等任务。

Abstract
We propose SelfVC, a training strategy to iteratively improve a voice conversion model with self-synthesized examples. Previous efforts on voice conversion focus on explicitly disentangling speech representations to separately encode speaker characteristics and linguistic content. However, disentangling speech representations to capture such attributes using task-specific loss terms can lead to information loss by discarding finer nuances of the original signal. In this work, instead of explicitly disentangling attributes with loss terms, we present a framework to train a controllable voice conversion model on entangled speech representations derived from self-supervised learning and speaker verification models. First, we develop techniques to derive prosodic information from the audio signal and SSL representations to train predictive submodules in the synthesis model. Next, we propose a training strategy to iteratively improve the synthesis model for voice conversion, by creating a challenging training objective using self-synthesized examples. In this training approach, the current state of the synthesis model is used to generate voice-converted variations of an utterance, which serve as inputs for the reconstruction task, ensuring a continuous and purposeful refinement of the model. We demonstrate that incorporating such self-synthesized examples during training improves the speaker similarity of generated speech as compared to a baseline voice conversion model trained solely on heuristically perturbed inputs. SelfVC is trained without any text and is applicable to a range of tasks such as zero-shot voice conversion, cross-lingual voice conversion, and controllable speech synthesis with pitch and pace modifications. SelfVC achieves state-of-the-art results in zero-shot voice conversion on metrics evaluating naturalness, speaker similarity, and intelligibility of synthesized audio.

摘要
我们提出了SelfVC，一种培训策略，可以逐步提高一个语音转换模型的性能。之前的尝试中，人们通常通过显式地分离发音表示来分配说话特征和语言内容。然而，通过任务特定的损失函数来损失这些特征可能会导致信息损失，因为抛弃了原始信号的细节。在这个工作中，我们不是通过显式地分离特征来损失这些特征，而是通过控制语音转换模型来培训混合的发音表示。我们开发了一些技术来从音频信号和SSL表示中提取发音信息，并用这些信息来训练预测子模块。然后，我们提出了一种培训策略，可以逐步改进语音转换模型的性能，通过使用自己生成的示例进行反馈循环训练。在这种培训策略中，当前的synthesis模型的状态被用来生成转换后的话语变体，作为反馈输入，以确保不断改进模型的性能。我们示出，在培训SelfVC模型时，不需要任何文本，可以在不同任务中进行零批量语音转换、跨语言语音转换和可控的语音生成中使用。SelfVC在零批量语音转换中实现了状态之最的结果，并且在自然性、发音相似度和理解性等方面取得了出色的成绩。

Lexical Entrainment for Conversational Systems

paper_url: http://arxiv.org/abs/2310.09651
repo_url: None
paper_authors: Zhengxiang Shi, Procheta Sen, Aldo Lipani
for: This paper aims to address the issue of lexical entrainment (LE) in conversational systems, which is a crucial humanlike phenomenon that is not adequately addressed by current response generation models.
methods: The authors propose a new dataset, named MULTIWOZ-ENTR, and a measure for LE for conversational systems. They also suggest two new tasks, a LE extraction task and a LE generation task, and present two baseline approaches for the LE extraction task.
results: The authors demonstrate the effectiveness of their proposed approach by presenting results from experiments conducted on the MULTIWOZ-ENTR dataset.Here is the same information in Simplified Chinese text, as requested:
for: 这篇论文是为了解决对话系统中的 lexical entrainment（LE）问题，这是一种人类样式的现象，现有的回答生成模型未能充分考虑这一点。
methods: 作者们提出了一个新的数据集，名为 MULTIWOZ-ENTR，以及一种对 LE 的评价方法。他们还建议两个新任务，一个是 LE 提取任务，另一个是 LE 生成任务，并提出了两种基线方法来实现 LE 提取任务。
results: 作者们通过在 MULTIWOZ-ENTR 数据集上进行的实验，证明了他们的提出的方法的有效性。

Abstract
Conversational agents have become ubiquitous in assisting with daily tasks, and are expected to possess human-like features. One such feature is lexical entrainment (LE), a phenomenon in which speakers in human-human conversations tend to naturally and subconsciously align their lexical choices with those of their interlocutors, leading to more successful and engaging conversations. As an example, if a digital assistant replies 'Your appointment for Jinling Noodle Pub is at 7 pm' to the question 'When is my reservation for Jinling Noodle Bar today?', it may feel as though the assistant is trying to correct the speaker, whereas a response of 'Your reservation for Jinling Noodle Bar is at 7 pm' would likely be perceived as more positive. This highlights the importance of LE in establishing a shared terminology for maximum clarity and reducing ambiguity in conversations. However, we demonstrate in this work that current response generation models do not adequately address this crucial humanlike phenomenon. To address this, we propose a new dataset, named MULTIWOZ-ENTR, and a measure for LE for conversational systems. Additionally, we suggest a way to explicitly integrate LE into conversational systems with two new tasks, a LE extraction task and a LE generation task. We also present two baseline approaches for the LE extraction task, which aim to detect LE expressions from dialogue contexts.

摘要
很多对话代理程序已经在日常任务中出现，并且需要具备人类化特征。一种这种特征是语言同步（LE），即在人类对话中， speaker 们会自然地和无意识地与对方的语言选择相吻合，从而使对话更加成功和有趣。例如，如果一个数字助手回答“你今天的预约时间为7点”，即使用户问道“今天我的预约时间是什么时间？”，助手的回答可能会被视为 corrected ，而不是“你的预约时间是7点”。这显示了LE在建立共同术语的重要性，以避免对话中的歧义。然而，我们在这个工作中发现，当前的响应生成模型并不充分考虑这一重要的人类特征。为此，我们提出了一个新的数据集名为 MULTIWOZ-ENTR，以及一个LE测量方法。此外，我们还提出了一种将LEExplicitly integrate into conversational systems的方法，包括两个新任务：LE抽取任务和LE生成任务。此外，我们还提出了两种基elineapproaches for LE抽取任务，以检测对话上的LE表达。

Multimodal Federated Learning in Healthcare: a review

paper_url: http://arxiv.org/abs/2310.09650
repo_url: None
paper_authors: Jacob Thrasher, Alina Devkota, Prasiddha Siwakotai, Rohit Chivukula, Pranav Poudel, Chaunbo Hu, Binod Bhattarai, Prashnna Gyawali
for: 本研究旨在探讨医疗领域中的多Modal Federated Learning（MMFL），以及其在保持患者数据隐私和安全的情况下提供高度准确和可靠的人工智能系统。
methods: 本研究使用了聚合学习和 federated learning 等方法，以实现在多个本地数据存储机构中进行多模态学习。
results: 本研究提出了一些挑战现有模型的限制，并指出了未来在这个领域的发展方向。

Abstract
Recent advancements in multimodal machine learning have empowered the development of accurate and robust AI systems in the medical domain, especially within centralized database systems. Simultaneously, Federated Learning (FL) has progressed, providing a decentralized mechanism where data need not be consolidated, thereby enhancing the privacy and security of sensitive healthcare data. The integration of these two concepts supports the ongoing progress of multimodal learning in healthcare while ensuring the security and privacy of patient records within local data-holding agencies. This paper offers a concise overview of the significance of FL in healthcare and outlines the current state-of-the-art approaches to Multimodal Federated Learning (MMFL) within the healthcare domain. It comprehensively examines the existing challenges in the field, shedding light on the limitations of present models. Finally, the paper outlines potential directions for future advancements in the field, aiming to bridge the gap between cutting-edge AI technology and the imperative need for patient data privacy in healthcare applications.

摘要
（简体中文）近期，多modal机器学习技术在医疗领域得到了进一步发展，特别是在中央数据库系统中。同时，联合学习（FL）也在进步，提供了一种分布式机制，不需要集中数据，从而提高了医疗数据的隐私和安全性。这两种概念的结合支持了医疗领域的多modal学习进程，同时保证了患者记录在本地数据持有机构中的安全性和隐私性。本文提供了医疗领域联合学习的简要概述，并详细描述了当前领域的挑战和限制。最后，本文还提出了未来发展的可能性，旨在bridging当今AI技术和医疗应用中病人数据隐私的差距。

Enhancing Binary Code Comment Quality Classification: Integrating Generative AI for Improved Accuracy

paper_url: http://arxiv.org/abs/2310.11467
repo_url: None
paper_authors: Rohith Arumugam S, Angel Deborah S
for: 提高 binary 代码注释质量分类模型的准确率
methods: integrating generated code and comment pairs to improve model accuracy
results: 两个分类模型，一个使用原始数据集，另一个使用扩展数据集和生成的代码注释对Here’s a more detailed explanation of each point:1. for: The paper is written to improve the accuracy of a binary code comment quality classification model. The authors aim to achieve this by integrating generated code and comment pairs into the model.2. methods: The authors use a Large Language Model Architecture to generate code and comment pairs, and then label these pairs to indicate their utility. They then incorporate these generated pairs into the original dataset to create an augmented dataset. Finally, they train two classification models: one using the original dataset and another using the augmented dataset.3. results: The authors report the results of their experiments, which show that the second model (using the augmented dataset) achieves higher accuracy than the first model (using the original dataset). Specifically, the second model achieves an accuracy of 85.1%, while the first model achieves an accuracy of 78.4%.

Abstract
This report focuses on enhancing a binary code comment quality classification model by integrating generated code and comment pairs, to improve model accuracy. The dataset comprises 9048 pairs of code and comments written in the C programming language, each annotated as "Useful" or "Not Useful." Additionally, code and comment pairs are generated using a Large Language Model Architecture, and these generated pairs are labeled to indicate their utility. The outcome of this effort consists of two classification models: one utilizing the original dataset and another incorporating the augmented dataset with the newly generated code comment pairs and labels.

摘要
这份报告关注将二进制代码评论质量分类模型与生成的代码和评论对照搭配，以提高模型准确性。数据集包含9048对C编程语言中的代码和评论，每个笔记为"有用"或"无用"。此外，代码和评论对也由大型自然语言模型架构生成，并将这些生成对照标注为其有用性。结果包括两个分类模型：一个使用原始数据集，另一个包括已生成的代码评论对和标注。

ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models

paper_url: http://arxiv.org/abs/2310.09624
repo_url: https://github.com/alexmeigz/assert
paper_authors: Alex Mei, Sharon Levy, William Yang Wang
for: 这个论文的目的是提高AI安全评估中的可靠性，以适应高度Random的环境。
methods: 该论文提出了三种方法，即语意匹配增强、目标启动和敌意批注注入。这些方法用于生成覆盖多个安全设定的测试集，包括semantic equivalence、related scenarios和敌意情况。
results: 研究发现，exist在现有的状态艺模型中的安全保护措施并不能 garantuee模型在各种语义相关的情况下的正确性。在Semantic equivalence和related scenarios中，模型的性能差异 statistically significant， erreur rates up to 19% in zero-shot adversarial settings。这些结果表明，在AI安全评估中需要更多的注意和研究。

Abstract
As large language models are integrated into society, robustness toward a suite of prompts is increasingly important to maintain reliability in a high-variance environment.Robustness evaluations must comprehensively encapsulate the various settings in which a user may invoke an intelligent system. This paper proposes ASSERT, Automated Safety Scenario Red Teaming, consisting of three methods -- semantically aligned augmentation, target bootstrapping, and adversarial knowledge injection. For robust safety evaluation, we apply these methods in the critical domain of AI safety to algorithmically generate a test suite of prompts covering diverse robustness settings -- semantic equivalence, related scenarios, and adversarial. We partition our prompts into four safety domains for a fine-grained analysis of how the domain affects model performance. Despite dedicated safeguards in existing state-of-the-art models, we find statistically significant performance differences of up to 11% in absolute classification accuracy among semantically related scenarios and error rates of up to 19% absolute error in zero-shot adversarial settings, raising concerns for users' physical safety.

摘要

A decoder-only foundation model for time-series forecasting

paper_url: http://arxiv.org/abs/2310.10688
repo_url: None
paper_authors: Abhimanyu Das, Weihao Kong, Rajat Sen, Yichen Zhou
for: 这份研究是为了设计一个基于大语言模型的时间序列基础模型，用于预测，其 zero-shot 性能在多个公开数据集上几乎与现有最佳指导预测模型相当。
methods: 这个模型基于嵌入式类别器的剪辑者类型注意力模型，通过将一个大时间序列集合预先训练，可以在不同的预测历史长度、预测长度和时间细分度上运作良好。
results: 研究发现，这个模型在不同的数据集上可以实现高度的预测精度，并且可以跨越不同的预测历史长度、预测长度和时间细分度。

Abstract
Motivated by recent advances in large language models for Natural Language Processing (NLP), we design a time-series foundation model for forecasting whose out-of-the-box zero-shot performance on a variety of public datasets comes close to the accuracy of state-of-the-art supervised forecasting models for each individual dataset. Our model is based on pretraining a patched-decoder style attention model on a large time-series corpus, and can work well across different forecasting history lengths, prediction lengths and temporal granularities.

摘要
使用大语言模型 recent advances in Natural Language Processing (NLP) 的技术，我们设计了一个时间序列基础模型，其 zeroshot 性能在多个公共数据集上与每个数据集的现有supervised forecasting模型的精度几乎相当。我们的模型基于预训练patched-decoder Style attention模型，可以在不同的预测历史长度、预测长度和时间粒度上工作良好。Note:* "zeroshot" means that the model is not trained on any specific dataset, but still achieves good performance on that dataset.* "supervised" means that the model is trained on a specific dataset with labeled data.* "patched-decoder" is a type of attention mechanism used in the model.

Deep Neural Networks Can Learn Generalizable Same-Different Visual Relations

paper_url: http://arxiv.org/abs/2310.09612
repo_url: None
paper_authors: Alexa R. Tartaglini, Sheridan Feucht, Michael A. Lepori, Wai Keen Vong, Charles Lovering, Brenden M. Lake, Ellie Pavlick
for: 研究深度神经网络是否可以学习和泛化同ifferent关系，包括在不同批处理和精度训练中。
methods: 使用多种架构、预训练方法和精度训练数据来研究深度神经网络是否可以学习和泛化同ifferent关系。
results: certain pretrained transformers可以学习一个高度泛化的同ifferent关系，并且 fine-tuning on abstract shapeslacking texture or color提供了最强的out-of-distribution泛化。

Abstract
Although deep neural networks can achieve human-level performance on many object recognition benchmarks, prior work suggests that these same models fail to learn simple abstract relations, such as determining whether two objects are the same or different. Much of this prior work focuses on training convolutional neural networks to classify images of two same or two different abstract shapes, testing generalization on within-distribution stimuli. In this article, we comprehensively study whether deep neural networks can acquire and generalize same-different relations both within and out-of-distribution using a variety of architectures, forms of pretraining, and fine-tuning datasets. We find that certain pretrained transformers can learn a same-different relation that generalizes with near perfect accuracy to out-of-distribution stimuli. Furthermore, we find that fine-tuning on abstract shapes that lack texture or color provides the strongest out-of-distribution generalization. Our results suggest that, with the right approach, deep neural networks can learn generalizable same-different visual relations.

摘要
In this article, we thoroughly investigate whether deep neural networks can acquire and generalize same-different relations both within and out-of-distribution using various architectures, pretraining methods, and fine-tuning datasets. We find that certain pretrained transformers can learn a same-different relation that generalizes with near perfect accuracy to out-of-distribution stimuli. Moreover, we find that fine-tuning on abstract shapes that lack texture or color provides the strongest out-of-distribution generalization. Our results suggest that, with the right approach, deep neural networks can learn generalizable same-different visual relations.

Penetrative AI: Making LLMs Comprehend the Physical World

paper_url: http://arxiv.org/abs/2310.09605
repo_url: None
paper_authors: Huatao Xu, Liying Han, Mo Li, Mani Srivastava
for: 本研究探讨了如何使用大型自然语言模型（LLMs）与物联网传感器和 actuators进行交互和理解物理世界，以推动人工智能在物理世界中的应用。
methods: 本研究采用了扩展LLMs的方法，通过处理感知信号来让模型与物理世界进行交互和理解。
results: 研究发现，使用ChatGPT作为例子，LLMs在处理物联网传感器数据和对其进行理解方面具有considerable和特殊的能力，这开放了新的应用场景 дляLLMs，同时也为人工智能在物理世界中的应用提供了新的机会。

Abstract
Recent developments in Large Language Models (LLMs) have demonstrated their remarkable capabilities across a range of tasks. Questions, however, persist about the nature of LLMs and their potential to integrate common-sense human knowledge when performing tasks involving information about the real physical world. This paper delves into these questions by exploring how LLMs can be extended to interact with and reason about the physical world through IoT sensors and actuators, a concept that we term "\textit{Penetrative AI}". The paper explores such an extension at two levels of LLMs' ability to penetrate into the physical world via the processing of sensory signals. Our preliminary findings indicate that LLMs, with ChatGPT being the representative example in our exploration, have considerable and unique proficiency in employing the knowledge they learned during training for interpreting IoT sensor data and reasoning over them about tasks in the physical realm. Not only this opens up new applications for LLMs beyond traditional text-based tasks, but also enables new ways of incorporating human knowledge in cyber-physical systems.

摘要

Context-aware Session-based Recommendation with Graph Neural Networks

paper_url: http://arxiv.org/abs/2310.09593
repo_url: https://github.com/brilliantzhang/cares
paper_authors: Zhihui Zhang, JianXiang Yu, Xiang Li
for: 本研究旨在提高Session-based recommendation（SBR）的准确率，特别是在使用不同类型的Session中捕捉用户兴趣的情况下。
methods: 本研究提出了一种名为CARES的Context-Aware Session-based Recommendation模型，该模型利用不同类型的Session中的Context来捕捉用户兴趣，并采用图ael neural networks进行学习。
results: 实验结果表明，CARES模型在三个标准 datasets上均有显著的提高，比如P@20和MRR@20等指标。

Abstract
Session-based recommendation (SBR) is a task that aims to predict items based on anonymous sequences of user behaviors in a session. While there are methods that leverage rich context information in sessions for SBR, most of them have the following limitations: 1) they fail to distinguish the item-item edge types when constructing the global graph for exploiting cross-session contexts; 2) they learn a fixed embedding vector for each item, which lacks the flexibility to reflect the variation of user interests across sessions; 3) they generally use the one-hot encoded vector of the target item as the hard label to predict, thus failing to capture the true user preference. To solve these issues, we propose CARES, a novel context-aware session-based recommendation model with graph neural networks, which utilizes different types of contexts in sessions to capture user interests. Specifically, we first construct a multi-relation cross-session graph to connect items according to intra- and cross-session item-level contexts. Further, to encode the variation of user interests, we design personalized item representations. Finally, we employ a label collaboration strategy for generating soft user preference distribution as labels. Experiments on three benchmark datasets demonstrate that CARES consistently outperforms state-of-the-art models in terms of P@20 and MRR@20. Our data and codes are publicly available at https://github.com/brilliantZhang/CARES.

摘要
Session-based recommendation (SBR) 是一个任务，旨在预测基于匿名用户行为序列的用户喜好。虽然有些方法利用session中的丰富上下文信息来实现SBR，但大多数方法具有以下限制：1）不能区分 item-item 边的类型，在构建全局图以利用跨SESSION上下文时，2）学习固定的 embedding 矢量，缺乏用户兴趣的变化适应性，3）通常使用目标项的一元化编码vector作为硬标签预测，从而失去真实用户喜好的表达。为解决这些问题，我们提出了 CARES，一种Context-Aware Session-Based Recommendation模型，利用session中不同类型的上下文来捕捉用户兴趣。具体来说，我们首先构建了多种关系跨SESSION图，将item相互连接，根据内部和跨SESSION item-level上下文。此外，为了编码用户兴趣的变化，我们设计了个性化项表示。最后，我们采用标签合作策略来生成软USER preference分布。实验结果显示，CARES在三个标准 benchmark 数据集上表现出色，相比之前的模型，它在P@20和MRR@20上均有显著提高。我们的数据和代码可以在https://github.com/brilliantZhang/CARES中获取。

Solving Math Word Problems with Reexamination

paper_url: http://arxiv.org/abs/2310.09590
repo_url: https://github.com/steven640pixel/psedualmwp
paper_authors: Yi Bin, Wenhao Shi, Yujuan Ding, Yang Yang, See-Kiong Ng
for: 这个论文的目的是提高数学问题 solving 能力。
methods: 这个论文使用了 pseudo-dual 学习方法，即在训练过程中重新评估问题的解决方法。
results: 实验表明，当将 pseudo-dual 学习方法应用于一些代表性的数学问题 solving 算法时，可以提高问题 solving 的能力。

Abstract
Math word problem (MWP) solving aims to understand the descriptive math problem and calculate the result, for which previous efforts are mostly devoted to upgrade different technical modules. This paper brings a different perspective of \textit{reexamination process} during training by introducing a pseudo-dual task to enhance the MWP solving. We propose a pseudo-dual (PseDual) learning scheme to model such process, which is model-agnostic thus can be adapted to any existing MWP solvers. The pseudo-dual task is specifically defined as filling the numbers in the expression back into the original word problem with numbers masked. To facilitate the effective joint learning of the two tasks, we further design a scheduled fusion strategy for the number infilling task, which smoothly switches the input from the ground-truth math expressions to the predicted ones. Our pseudo-dual learning scheme has been tested and proven effective when being equipped in several representative MWP solvers through empirical studies. \textit{The codes and trained models are available at:} \url{https://github.com/steven640pixel/PsedualMWP}. \end{abstract}

摘要
mat word problem (MWP) 解决目标是理解描述性数学问题并计算结果，而前一些努力都是升级不同的技术模块。这篇论文带来了一种不同的 \textit{重新评估过程} 在训练中的思路，通过引入一个 pseudo-dual 任务来提高 MWP 解决。我们提议一种 pseudo-dual 学习方案来模型这个过程，这种方案是无关模型的，因此可以适应任何现有的 MWP 解决器。 pseudo-dual 任务是填充数字到原始的数学问题中，并将数字掩码。为了实现有效的共同学习两个任务，我们还设计了一种安排的融合策略，使得输入从真实的数学表达中顺利地转换到预测的表达中。我们的 pseudo-dual 学习方案在一些代表性的 MWP 解决器上进行了实验，并证明了其效果。 \textit{代码和训练模型可以在} \url{https://github.com/steven640pixel/PsedualMWP} \textit{上获取.}

Autonomous Tree-search Ability of Large Language Models

paper_url: http://arxiv.org/abs/2310.10686
repo_url: None
paper_authors: Zheyu Zhang, Zhuorui Ye, Yikang Shen, Chuang Gan
for: 提高大型语言模型的推理能力，解决逻辑推理和策略规划等任务。
methods: 使用自动化搜索能力，通过LLM API进行自动化搜索，实现对答案的搜索迹线。
results: 实验结果显示，使用我们的方法可以 achieved huge improvements，比如Chain of Thoughtapproach的准确率提高33%，并且需要 menos GPT-api cost。此外，我们还收集了使用ATS prompt方法和精度调整LLaMA的数据，这种方法可以带来更大的改进。

Abstract
Large Language Models have excelled in remarkable reasoning capabilities with advanced prompting techniques, but they fall short on tasks that require exploration, strategic foresight, and sequential decision-making. Recent works propose to utilize external programs to define search logic, such that LLMs can perform passive tree search to solve more challenging reasoning tasks. Though impressive results have been achieved, there are several fundamental limitations of these approaches. First, passive tree searches are not efficient as they usually require multiple rounds of LLM API calls to solve one single problem. Moreover, passive search methods are not flexible since they need task-specific program designs. Then a natural question arises: can we maintain the tree-search capability of LLMs without the aid of external programs, and can still generate responses that clearly demonstrate the process of a tree-structure search? To this end, we propose a new concept called autonomous tree-search ability of LLM, which can automatically generate a response containing search trajectories for the correct answer. Concretely, we perform search trajectories using capable LLM API via a fixed system prompt, allowing them to perform autonomous tree-search (ATS) right out of the box. Experiments on 4 puzzle games demonstrate our method can achieve huge improvements. The ATS-BFS method outperforms the Chain of Thought approach by achieving an average accuracy improvement of 33%. Compared to Tree of Thoughts, it requires 65.6% or 47.7% less GPT-api cost to attain a comparable level of accuracy. Moreover, we have collected data using the ATS prompt method and fine-tuned LLaMA. This approach yield a greater improvement compared to the ones fine-tuned on CoT data. Specifically, it outperforms CoT-tuned LLaMAs by an average of 40.6% and 38.5% for LLaMA2-7B and LLaMA2-13B, respectively.

摘要
大型语言模型（LLM）在进行先进的推理任务时，已经表现出了非常出色的推理能力，但在需要探索、 стратегіic预见和顺序做决策的任务时，它们仍然缺乏表现。 latest works propose to use external programs to define search logic, so that LLMs can perform passive tree search to solve more challenging reasoning tasks. Although impressive results have been achieved, there are several fundamental limitations of these approaches. First, passive tree searches are not efficient, as they usually require multiple rounds of LLM API calls to solve one single problem. Moreover, passive search methods are not flexible, as they need task-specific program designs. Therefore, a natural question arises: can we maintain the tree-search capability of LLMs without the aid of external programs, and can still generate responses that clearly demonstrate the process of a tree-structure search? To this end, we propose a new concept called autonomous tree-search ability of LLM, which can automatically generate a response containing search trajectories for the correct answer. Specifically, we perform search trajectories using capable LLM API via a fixed system prompt, allowing them to perform autonomous tree-search (ATS) right out of the box. Experimental results on 4 puzzle games demonstrate that our method can achieve significant improvements. The ATS-BFS method outperforms the Chain of Thought approach by achieving an average accuracy improvement of 33%. Compared to Tree of Thoughts, it requires 65.6% or 47.7% less GPT-api cost to attain a comparable level of accuracy. Moreover, we have collected data using the ATS prompt method and fine-tuned LLaMA. This approach yields a greater improvement compared to the ones fine-tuned on CoT data. Specifically, it outperforms CoT-tuned LLaMAs by an average of 40.6% and 38.5% for LLaMA2-7B and LLaMA2-13B, respectively.

PS-AAS: Portfolio Selection for Automated Algorithm Selection in Black-Box Optimization

paper_url: http://arxiv.org/abs/2310.10685
repo_url: None
paper_authors: Ana Kostovska, Gjorgjina Cenikj, Diederick Vermetten, Anja Jankovic, Ana Nikolikj, Urban Skvorc, Peter Korosec, Carola Doerr, Tome Eftimov
for: 这种论文是为了研究自动化算法选择（AAS）的投资问题，即选择一个包含多种算法的股票，以实现最佳的算法选择。
methods: 这篇论文提出了一种数据驱动的股票选择技术，即创建算法行为 méta-表示，基于这些méta-表示的相似性构建一个图，并应用图算法选择最终的多样化、表示性和不重复的股票。
results: 这篇论文通过使用不同的méta-表示技术（SHAP和performance2vec），对324个不同的CMA-ES变体进行了优化BBOB单目标问题的测试，并比较了两种类型的股票：一种是基于总算法行为，另一种是基于每个问题的算法行为。结果显示，使用performance2vec-based méta-表示的方法选择了小型股票，与虚拟最佳算法相比，而使用SHAP-based méta-表示的方法选择了更多的算法，但是在AAS任务中表现不如personalized股票。在大多数考虑的场景下，个性化股票比 класси的排序方法更好，而且在所有场景下都比整个股票更好。

Abstract
The performance of automated algorithm selection (AAS) strongly depends on the portfolio of algorithms to choose from. Selecting the portfolio is a non-trivial task that requires balancing the trade-off between the higher flexibility of large portfolios with the increased complexity of the AAS task. In practice, probably the most common way to choose the algorithms for the portfolio is a greedy selection of the algorithms that perform well in some reference tasks of interest. We set out in this work to investigate alternative, data-driven portfolio selection techniques. Our proposed method creates algorithm behavior meta-representations, constructs a graph from a set of algorithms based on their meta-representation similarity, and applies a graph algorithm to select a final portfolio of diverse, representative, and non-redundant algorithms. We evaluate two distinct meta-representation techniques (SHAP and performance2vec) for selecting complementary portfolios from a total of 324 different variants of CMA-ES for the task of optimizing the BBOB single-objective problems in dimensionalities 5 and 30 with different cut-off budgets. We test two types of portfolios: one related to overall algorithm behavior and the `personalized' one (related to algorithm behavior per each problem separately). We observe that the approach built on the performance2vec-based representations favors small portfolios with negligible error in the AAS task relative to the virtual best solver from the selected portfolio, whereas the portfolios built from the SHAP-based representations gain from higher flexibility at the cost of decreased performance of the AAS. Across most considered scenarios, personalized portfolios yield comparable or slightly better performance than the classical greedy approach. They outperform the full portfolio in all scenarios.

摘要
algorithm选择自动化（AAS）的性能强度取决于可选的算法集合。选择该集合是一项非轻松的任务，需要平衡更高的灵活性和更大的算法选择任务的复杂度。在实践中，可能最常用的方法是根据参考任务的兴趣选择算法。在这种情况下，我们在这项工作中提出了一种数据驱动的算法集合选择技术。我们的提议的方法是创建算法行为媒体表示，将一组算法基于媒体表示之间的相似性构建一个图，并将图算法应用于选择最终的多样化、代表性和不重复的算法集合。我们对324个不同的CMA-ES变体进行了两种不同的媒体表示技术（SHAP和performance2vec）来选择相似的算法集合，并测试了两种类型的集合：一种关于总算法行为，另一种是关于每个问题的算法行为。我们发现，基于performance2vec-based的表示方法选择小型集合，与虚拟最佳算法从选择的集合中的错误相对较小，而基于SHAP-based的表示方法增加了更高的灵活性，但是在AAS任务中导致性能下降。在大多数考虑的场景下，个性化集合比 классифика greedy 方法提供了相似或微弱的性能，而且在所有场景下都高于全部集合。

Does CLIP’s Generalization Performance Mainly Stem from High Train-Test Similarity?

paper_url: http://arxiv.org/abs/2310.09562
repo_url: None
paper_authors: Prasanna Mayilvahanan, Thaddäus Wiedemer, Evgenia Rusak, Matthias Bethge, Wieland Brendel
for: 本文研究 CLIP 模型在各种 OUT-OF-DISTRIBUTION (OOD) benchmark 上的 Zero-shot 和 few-shot 能力。
methods: 本文使用 RETRAINING CLIP 模型在剪裁 LAION 数据集上，以提高其 OOD 性能。
results: 研究发现，尽管 CLIP 模型在剪裁 LAION 数据集上的性能下降，但总体性能仍然高。这表示，高的train-test相似性不能完全解释 CLIP 模型的 OOD 性能，其他训练数据的属性must drive CLIP 模型学习更一般的表示。此外，通过剪裁数据点和 OOD benchmark 相似的数据点，我们揭示了一个 100M split of LAION（原始大小的一半），可以训练 CLIP 模型达到原始 OOD 性能水平。

Abstract
Foundation models like CLIP are trained on hundreds of millions of samples and effortlessly generalize to new tasks and inputs. Out of the box, CLIP shows stellar zero-shot and few-shot capabilities on a wide range of out-of-distribution (OOD) benchmarks, which prior works attribute mainly to today's large and comprehensive training dataset (like LAION). However, it is questionable how meaningful terms like out-of-distribution generalization are for CLIP as it seems likely that web-scale datasets like LAION simply contain many samples that are similar to common OOD benchmarks originally designed for ImageNet. To test this hypothesis, we retrain CLIP on pruned LAION splits that replicate ImageNet's train-test similarity with respect to common OOD benchmarks. While we observe a performance drop on some benchmarks, surprisingly, CLIP's overall performance remains high. This shows that high train-test similarity is insufficient to explain CLIP's OOD performance, and other properties of the training data must drive CLIP to learn more generalizable representations. Additionally, by pruning data points that are dissimilar to the OOD benchmarks, we uncover a 100M split of LAION ($\frac{1}{4}$th of its original size) on which CLIP can be trained to match its original OOD performance.

摘要
CLIP模型如Foundation models是通过百万个样本进行训练，并能够自动泛化到新任务和输入。它的出厂性能在各种不同的外部数据集上表现杰出，这被认为是因为今天的大规模和全面的训练数据集（如LAION）。然而，是否有意义地用 терminus如外部数据集泛化是CLIP的问题，因为它似乎是LAION中的许多样本与常见的OOD benchmarks（如ImageNet）有很多相似之处。为了测试这个假设，我们重新训练CLIP在LAION中采样后的分割中，这些分割与ImageNet的训练集和测试集之间具有相似的类型相似性。虽然我们观察到一些benchmark上的性能下降，但CLIP的总性能仍然高。这表明高的train-test相似性不能完全解释CLIP的OOD性能，其他训练数据的属性must drive CLIP学习更泛化的表示。此外，我们通过从LAION中删除与OOD benchmarks不相似的数据点，发现一个100M大小的LAION分割（占原始大小的一半），在这个分割上训练CLIP可以与原始OOD性能匹配。

Graph Neural Network approaches for single-cell data: A recent overview

paper_url: http://arxiv.org/abs/2310.09561
repo_url: None
paper_authors: Konstantinos Lazaros, Dimitris E. Koumadorakis, Panagiotis Vlamos, Aristidis G. Vrahatis
for: 这 paper 的目的是探讨 Graph Neural Networks (GNN) 在单元细胞数据上的应用，以及 GNN 方法在不同目标上的可行性。
methods: 这 paper 使用了多种 GNN 方法，包括 Graph Attention Networks (GAT) 和 Graph Convolutional Neural Networks (Graph CNN)，以及其他相关的方法。
results: 这 paper 提出了一些结合 GNN 和单元细胞数据的研究，显示了这些方法在不同目标上的可行性，例如 cell-type annotation, data integration and imputation, gene regulatory network reconstruction, clustering 等。

Abstract
Graph Neural Networks (GNN) are reshaping our understanding of biomedicine and diseases by revealing the deep connections among genes and cells. As both algorithmic and biomedical technologies have advanced significantly, we're entering a transformative phase of personalized medicine. While pioneering tools like Graph Attention Networks (GAT) and Graph Convolutional Neural Networks (Graph CNN) are advancing graph-based learning, the rise of single-cell sequencing techniques is reshaping our insights on cellular diversity and function. Numerous studies have combined GNNs with single-cell data, showing promising results. In this work, we highlight the GNN methodologies tailored for single-cell data over the recent years. We outline the diverse range of graph deep learning architectures that center on GAT methodologies. Furthermore, we underscore the several objectives of GNN strategies in single-cell data contexts, ranging from cell-type annotation, data integration and imputation, gene regulatory network reconstruction, clustering and many others. This review anticipates a future where GNNs become central to single-cell analysis efforts, particularly as vast omics datasets are continuously generated and the interconnectedness of cells and genes enhances our depth of knowledge in biomedicine.

摘要
GRAPH神经网络（GNN）正在改变我们对生物医学和疾病的理解，揭示了基因和细胞之间的深层连接。随着算法技术和生物技术的进步，我们正在进入个性化医学的转型阶段。而前所未有的工具如图像注意力网络（GAT）和图像卷积神经网络（图 CNN）正在推动图形学习的发展，而单细胞测序技术的出现也在改变我们对细胞多样性和功能的理解。许多研究已经结合GNNs与单细胞数据，并取得了有 promise 的结果。在这个工作中，我们将强调最近几年对单细胞数据的GNN方法的探索。我们将介绍各种中心于GAT方法的图深度学习架构，并强调GNN策略在单细胞数据上的多种目标，包括细胞类型标注、数据集成和填充、基因规则网络重建、划分和其他多种目标。这篇文章预测未来，GNN将成为单细胞分析的中心，特别是随着不断生成的庞大各种数据和细胞和基因之间的连接，我们对生物医学的知识将更加深入。

UNIQA: A Unified Framework for Both Full-Reference and No-Reference Image Quality Assessment

paper_url: http://arxiv.org/abs/2310.09560
repo_url: None
paper_authors: Yi Ke Yun, Weisi Lin
for: 提高全referenced（FR）和无参照（NR）图像质量评估（IQA）的性能，并能够同时处理FR和NR输入。
methods: 提出了一种基于 semantic impact 模型的 универсальный网络，包括encoder和多级自注意力（HSA）模块，以及跨层跨注意力（CSCA）模块，用于模型空间扭曲水平和图像 semantics的关系。
results: 对四个synthetic扭曲数据集和三个authentic扭曲数据集进行了广泛的实验，并得到了较高的性能，超过了相关的FR和NR方法。

Abstract
The human visual system (HVS) is effective at distinguishing low-quality images due to its ability to sense the distortion level and the resulting semantic impact. Prior research focuses on developing dedicated networks based on the presence and absence of pristine images, respectively, and this results in limited application scope and potential performance inconsistency when switching from NR to FR IQA. In addition, most methods heavily rely on spatial distortion modeling through difference maps or weighted features, and this may not be able to well capture the correlations between distortion and the semantic impact it causes. To this end, we aim to design a unified network for both Full-Reference (FR) and No-Reference (NR) IQA via semantic impact modeling. Specifically, we employ an encoder to extract multi-level features from input images. Then a Hierarchical Self-Attention (HSA) module is proposed as a universal adapter for both FR and NR inputs to model the spatial distortion level at each encoder stage. Furthermore, considering that distortions contaminate encoder stages and damage image semantic meaning differently, a Cross-Scale Cross-Attention (CSCA) module is proposed to examine correlations between distortion at shallow stages and deep ones. By adopting HSA and CSCA, the proposed network can effectively perform both FR and NR IQA. Extensive experiments demonstrate that the proposed simple network is effective and outperforms the relevant state-of-the-art FR and NR methods on four synthetic-distorted datasets and three authentic-distorted datasets.

摘要
人类视觉系统（HVS）能够准确地认识低质量图像，这是因为它能够感受到图像的扭曲水平和导致的semantic影响。先前的研究主要关注于基于存在和缺失高品质图像的特有网络，这会导致应用范围有限和在切换到FR IQA时性能不稳定。此外，大多数方法依赖于空间扭曲模型，通过差图或权重特征来模型扭曲水平，这可能无法好地捕捉扭曲对semantic意义的影响。为了解决这个问题，我们目的是设计一个可以同时执行FR和NR IQA的统一网络，通过semantic impact模型来模型扭曲水平。我们使用encoder提取输入图像的多级特征。然后，我们提出了一种 Hierarchical Self-Attention（HSA）模块，作为FR和NR输入的通用适配器，以模型encoder stage上的空间扭曲水平。此外，我们认为扭曲会在encoder stage上污染图像的semantic意义，因此我们提出了一种 Cross-Scale Cross-Attention（CSCA）模块，以检查扭曲在不同深度stage之间的相关性。通过采用HSA和CSCA，我们的提案的简单网络可以高效地执行FR和NR IQA。我们的实验证明，我们的提案的简单网络可以高效地与相关的FR和NR方法进行比较，并在四个synthetic-distorted dataset和三个authentic-distorted dataset上获得更好的性能。

A study of the impact of generative AI-based data augmentation on software metadata classification

paper_url: http://arxiv.org/abs/2310.13714
repo_url: None
paper_authors: Tripti Kumari, Chakali Sai Charan, Ayan Das
for: 这篇论文是为了自动预测代码-注释对的有用性而写的。
methods: 这篇论文使用了人工智能基于神经网络上下文表示的注释和代码的关系来预测代码-注释对的有用性，并对基础数据和大语言模型生成的数据进行性能分析。
results: 在官方评测中，这篇论文的系统比基eline提高4%的F1分数，并且生成的数据质量也得到了改进。

Abstract
This paper presents the system submitted by the team from IIT(ISM) Dhanbad in FIRE IRSE 2023 shared task 1 on the automatic usefulness prediction of code-comment pairs as well as the impact of Large Language Model(LLM) generated data on original base data towards an associated source code. We have developed a framework where we train a machine learning-based model using the neural contextual representations of the comments and their corresponding codes to predict the usefulness of code-comments pair and performance analysis with LLM-generated data with base data. In the official assessment, our system achieves a 4% increase in F1-score from baseline and the quality of generated data.

摘要

Protein 3D Graph Structure Learning for Robust Structure-based Protein Property Prediction

paper_url: http://arxiv.org/abs/2310.11466
repo_url: None
paper_authors: Yufei Huang, Siyuan Li, Jin Su, Lirong Wu, Odin Zhang, Haitao Lin, Jingqi Qi, Zihan Liu, Zhangyang Gao, Yuyang Liu, Jiangbin Zheng, Stan. ZQ. Li
for: 本研究旨在解决在蛋白质性质预测中使用预测结构时出现的性能下降问题。
methods: 本研究使用了一种基于蛋白质三维图 структуры学习的框架，即Structure embedding Alignment Optimization（SAO），以mitigate the problem of structure embedding bias between predicted and experimental protein structures。
results: 对比于现有方法，本研究的方法能够在蛋白质性质预测中提高性能，并且可以适用于预测结构和实验结构 both。

Abstract
Protein structure-based property prediction has emerged as a promising approach for various biological tasks, such as protein function prediction and sub-cellular location estimation. The existing methods highly rely on experimental protein structure data and fail in scenarios where these data are unavailable. Predicted protein structures from AI tools (e.g., AlphaFold2) were utilized as alternatives. However, we observed that current practices, which simply employ accurately predicted structures during inference, suffer from notable degradation in prediction accuracy. While similar phenomena have been extensively studied in general fields (e.g., Computer Vision) as model robustness, their impact on protein property prediction remains unexplored. In this paper, we first investigate the reason behind the performance decrease when utilizing predicted structures, attributing it to the structure embedding bias from the perspective of structure representation learning. To study this problem, we identify a Protein 3D Graph Structure Learning Problem for Robust Protein Property Prediction (PGSL-RP3), collect benchmark datasets, and present a protein Structure embedding Alignment Optimization framework (SAO) to mitigate the problem of structure embedding bias between the predicted and experimental protein structures. Extensive experiments have shown that our framework is model-agnostic and effective in improving the property prediction of both predicted structures and experimental structures. The benchmark datasets and codes will be released to benefit the community.

摘要
In this paper, we investigate the reason behind the performance decrease when utilizing predicted structures and attribute it to the structure embedding bias from the perspective of structure representation learning. To study this problem, we identify a Protein 3D Graph Structure Learning Problem for Robust Protein Property Prediction (PGSL-RP3), collect benchmark datasets, and present a protein Structure embedding Alignment Optimization framework (SAO) to mitigate the problem of structure embedding bias between the predicted and experimental protein structures. Our framework is model-agnostic and effective in improving the property prediction of both predicted structures and experimental structures. The benchmark datasets and codes will be released to benefit the community.Here is the Simplified Chinese translation of the text:蛋白结构基于的蛋白性质预测已经成为生物任务中的一种有前途的方法，例如蛋白功能预测和细胞内位置估计。现有的方法几乎完全依赖于实验蛋白结构数据，而无法在这些数据不可用的情况下进行预测。使用人工智能工具（如AlphaFold2）预测的蛋白结构作为替代方案，但我们发现现有的方法，即直接在推理过程中使用准确预测的结构，会导致显著的预测精度下降。这种现象在通用计算机视觉领域已经广泛研究过，称为模型 Robustness，但它对蛋白性质预测的影响尚未得到研究。在这篇论文中，我们首次调查预测结构使用后蛋白性质预测的性能下降的原因，归结于结构表示学习的结构嵌入偏见。为了研究这个问题，我们定义了蛋白3D图像学习问题（PGSL-RP3），收集了参考数据集，并提出了一种蛋白结构嵌入对静定方法（SAO），以解决结构嵌入偏见 между预测结构和实验结构。我们的框架是模型无关的，并且对预测结构和实验结构都有效。我们将参考数据集和代码公布，以便社区各方受益。

Software Metadata Classification based on Generative Artificial Intelligence

paper_url: http://arxiv.org/abs/2310.13006
repo_url: None
paper_authors: Seetharam Killivalavan, Durairaj Thenmozhi
for: 这种方法可以提高 binary 代码注释质量分类模型的性能。
methods: 该方法利用 Generative Artificial Intelligence (AI) 技术，通过 OpenAI API 生成了 1239 个新的代码注释对，从各种 GitHub 存储库和开源项目中提取出来，并将其与现有的 9048 个对照集成到 C 语言中。
results: 使用 cutting-edge Large Language Model Architecture 生成的数据集显示了明显的改善，具体来说，当 incorporated into SVM 模型时，精度从 0.79 提高到 0.85，增加 6%；当 incorporated into ANN 模型时，准确率从 0.731 提高到 0.746，增加 1.5%。

Abstract
This paper presents a novel approach to enhance the performance of binary code comment quality classification models through the application of Generative Artificial Intelligence (AI). By leveraging the OpenAI API, a dataset comprising 1239 newly generated code-comment pairs, extracted from various GitHub repositories and open-source projects, has been labelled as "Useful" or "Not Useful", and integrated into the existing corpus of 9048 pairs in the C programming language. Employing a cutting-edge Large Language Model Architecture, the generated dataset demonstrates notable improvements in model accuracy. Specifically, when incorporated into the Support Vector Machine (SVM) model, a 6% increase in precision is observed, rising from 0.79 to 0.85. Additionally, the Artificial Neural Network (ANN) model exhibits a 1.5% increase in recall, climbing from 0.731 to 0.746. This paper sheds light on the potential of Generative AI in augmenting code comment quality classification models. The results affirm the effectiveness of this methodology, indicating its applicability in broader contexts within software development and quality assurance domains. The findings underscore the significance of integrating generative techniques to advance the accuracy and efficacy of machine learning models in practical software engineering scenarios.

摘要

Instruction Tuning with Human Curriculum

paper_url: http://arxiv.org/abs/2310.09518
repo_url: None
paper_authors: Bruce W. Lee, Hyunsoo Cho, Kang Min Yoo
for: 这篇论文旨在探讨如何使用结构化认知学方法来优化现代大语言模型中的指令优化。
methods: 该论文提出了一种基于人类教育框架的高度结构化的人工数据集，并对其进行了适应性和评估。
results: 研究结果表明，该方法可以提高语言模型的性能，比如MMLUbenchmark上的提升为+3.06，并且可以避免额外计算成本。

Abstract
The dominant paradigm for instruction tuning is the random-shuffled training of maximally diverse instruction-response pairs. This paper explores the potential benefits of applying a structured cognitive learning approach to instruction tuning in contemporary large language models like ChatGPT and GPT-4. Unlike the previous conventional randomized instruction dataset, we propose a highly structured synthetic dataset that mimics the progressive and organized nature of human education. We curate our dataset by aligning it with educational frameworks, incorporating meta information including its topic and cognitive rigor level for each sample. Our dataset covers comprehensive fine-grained topics spanning diverse educational stages (from middle school to graduate school) with various questions for each topic to enhance conceptual depth using Bloom's taxonomy-a classification framework distinguishing various levels of human cognition for each concept. The results demonstrate that this cognitive rigorous training approach yields significant performance enhancements - +3.06 on the MMLU benchmark and an additional +1.28 on AI2 Reasoning Challenge (hard set) - compared to conventional randomized training, all while avoiding additional computational costs. This research highlights the potential of leveraging human learning principles to enhance the capabilities of language models in comprehending and responding to complex instructions and tasks.

摘要
主流启发方法 для处理调教是随机洗涤最大多样性的指令-响应对。本文探讨了应用结构化认知学学习方法来提高现代大语言模型如ChatGPT和GPT-4的指令调教。与传统的随机化指令数据集不同，我们提议一个高度结构化的人工数据集，模拟人类教育的进步和有序性。我们对数据进行了匹配，包括每个样本的主题和认知困难程度信息。我们的数据涵盖了广泛的细化主题（从中学到大学），每个主题都有多个问题，以增强概念深度使用布隆分类法-一种分类框架，可以区分不同的认知水平。结果表明，这种认知强化训练方法可以提高表现，比传统随机训练高出3.06个MMLU指标和1.28个AI2逻辑挑战（困难集）。此外，这种方法不需要额外的计算成本。这些研究表明，可以利用人类学习原理来提高语言模型对复杂指令和任务的理解和回答能力。

Towards Semantic Communication Protocols for 6G: From Protocol Learning to Language-Oriented Approaches

paper_url: http://arxiv.org/abs/2310.09506
repo_url: None
paper_authors: Jihong Park, Seung-Woo Ko, Jinho Choi, Seong-Lyun Kim, Mehdi Bennis
for: 这篇论文旨在探讨未来的6G系统将如何面对多种非站ARY tasks的挑战，以及传统的媒体存取控制协议（MAC协议）是如何适应这些挑战的。
methods: 这篇论文提出了一个新的分类方法，将资料驱动的MAC协议分为三级： Level 1 MAC 是使用多代理深度循环学习（MADRL）构建的任务对应神经协议；Level 2 MAC 是将 Level 1 MAC 的输出转换为明确的符号；Level 3 MAC 是使用大语言模型（LLM）和生成模型来构建语言对应的协议。
results: 这篇论文通过探讨这些层次的基本技术和选择性案例研究，提供了关于资料驱动MAC协议的未来走势和未来研究方向的对答。

Abstract
The forthcoming 6G systems are expected to address a wide range of non-stationary tasks. This poses challenges to traditional medium access control (MAC) protocols that are static and predefined. In response, data-driven MAC protocols have recently emerged, offering ability to tailor their signaling messages for specific tasks. This article presents a novel categorization of these data-driven MAC protocols into three levels: Level 1 MAC. task-oriented neural protocols constructed using multi-agent deep reinforcement learning (MADRL); Level 2 MAC. neural network-oriented symbolic protocols developed by converting Level 1 MAC outputs into explicit symbols; and Level 3 MAC. language-oriented semantic protocols harnessing large language models (LLMs) and generative models. With this categorization, we aim to explore the opportunities and challenges of each level by delving into their foundational techniques. Drawing from information theory and associated principles as well as selected case studies, this study provides insights into the trajectory of data-driven MAC protocols and sheds light on future research directions.

摘要
六代系统即将来临，预期能解决各种非静止任务。这会对传统的媒体存取控制协议（MAC）协议产生挑战，这些协议通常是静止的和预先定义的。对此，使用数据驱动的MAC协议已经出现，这些协议可以根据特定任务 tailor其讯息。本文提出了一个 novel 的分类方法，将这些数据驱动的MAC协议分为三级： Level 1 MAC：使用多智能深度反对应学习（MADRL）构建的任务对应神经网络协议。 Level 2 MAC：将 Level 1 MAC 的输出转换为Explicit symbols，这样可以使用神经网络协议。 Level 3 MAC：使用大型语言模型（LLMs）和生成模型，实现语言对应的协议。透过这个分类，我们希望可以探讨每个等级的机遇和挑战，并且从信息论和相关的原则以及选择的实验案例中获得新的见解。这篇研究提供了数据驱动MAC协议的未来趋势和未来研究方向的新的思路。

One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models

paper_url: http://arxiv.org/abs/2310.09499
repo_url: None
paper_authors: Hang Shao, Bei Liu, Yanmin Qian
for: 提高生成预训练变换器(GPT)家族模型的实用性，通过量化、剪枝和其他方法提高模型的效率。
methods: 基于梯度敏感度杂合稀疏剪枝法，不需要重新训练可以剪枝GPT模型至少50%的稀疏率。该方法可以根据敏感度进行适应性分配稀疏，从而降低剪枝导致的错误，保持总稀疏率。
results: 提出的方法可以在极高稀疏率下进一步提高LLM模型的效率，并且兼容量化，可以进一步压缩LLM模型。

Abstract
Various Large Language Models(LLMs) from the Generative Pretrained Transformer~(GPT) family have achieved outstanding performances in a wide range of text generation tasks. However, the enormous model sizes have hindered their practical use in real-world applications due to high inference latency. Therefore, improving the efficiencies of LLMs through quantization, pruning, and other means has been a key issue in LLM studies. In this work, we propose a method based on Hessian sensitivity-aware mixed sparsity pruning to prune LLMs to at least 50\% sparsity without the need of any retraining. It allocates sparsity adaptively based on sensitivity, allowing us to reduce pruning-induced error while maintaining the overall sparsity level. The advantages of the proposed method exhibit even more when the sparsity is extremely high. Furthermore, our method is compatible with quantization, enabling further compression of LLMs.

摘要
各种大型语言模型（LLM）从转换器生成器（GPT）家族已经在文本生成任务中显示出杰出的表现。然而，这些模型的巨大大小使得它们在实际应用中具有高的推理延迟。因此，提高LLM的效率通过量化、剪裁等方法已成为LLM研究中的关键问题。在这个工作中，我们提出基于梯度敏感性杂化混合稀疏剪枝法，可以剪枝LLMs到至少50%的稀疏程度而无需重新训练。它根据敏感性分配稀疏性，使我们可以降低剪枝引起的错误而保持总的稀疏性水平。我们的方法在稀疏程度非常高时表现出更多的优势。此外，我们的方法与量化相容，可以进一步压缩LLMs。

A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language Models

paper_url: http://arxiv.org/abs/2310.09497
repo_url: https://github.com/ielab/llm-rankers
paper_authors: Shengyao Zhuang, Honglei Zhuang, Bevan Koopman, Guido Zuccon
for: 这paper主要是为了evaluating the effectiveness and efficiency of different prompting approaches for large language models (LLMs) in zero-shot document ranking tasks.
methods: 这paper使用了Pointwise, Pairwise,和Listwise prompting approaches, along with a novel Setwise approach, to evaluate their effectiveness and efficiency in LLM-based zero-shot ranking.
results: 这paper的实验结果表明，Pointwise approaches are efficient but less effective, Pairwise approaches are effective but computationally expensive, while Setwise approaches can significantly reduce computational costs while retaining high zero-shot ranking effectiveness.

Abstract
Large Language Models (LLMs) demonstrate impressive effectiveness in zero-shot document ranking tasks. Pointwise, Pairwise, and Listwise prompting approaches have been proposed for LLM-based zero-shot ranking. Our study begins by thoroughly evaluating these existing approaches within a consistent experimental framework, considering factors like model size, token consumption, latency, among others. This first-of-its-kind comparative evaluation of these approaches allows us to identify the trade-offs between effectiveness and efficiency inherent in each approach. We find that while Pointwise approaches score high on efficiency, they suffer from poor effectiveness. Conversely, Pairwise approaches demonstrate superior effectiveness but incur high computational overhead. To further enhance the efficiency of LLM-based zero-shot ranking, we propose a novel Setwise prompting approach. Our approach reduces the number of LLM inferences and the amount of prompt token consumption during the ranking procedure, significantly improving the efficiency of LLM-based zero-shot ranking. We test our method using the TREC DL datasets and the BEIR zero-shot document ranking benchmark. The empirical results indicate that our approach considerably reduces computational costs while also retaining high zero-shot ranking effectiveness.

摘要
大型语言模型（LLM）在零shot文档排序任务中表现出众，Pointwise、Pairwise和Listwise promptingapproaches已经被提出用于LLM基于的零shot排序。我们的研究开始于在一个共同的实验室中仔细评估这些现有的方法，考虑因素如模型大小、token消耗量、延迟时间等。这是第一次对这些方法进行了系统性的比较评估，从而找到每个方法之间的质量和效率之间的交易。我们发现，虽然Pointwise方法具有高效性，但效果不佳。相反，Pairwise方法表现出色，但计算开销很高。为了进一步提高LLM基于的零shot排序的效率，我们提议了一种新的Setwise promptingapproach。我们的方法可以减少LLM的推理数量和提示Token的消耗量，从而显著提高LLM基于的零shot排序的效率。我们使用TREC DL数据集和BEIR零shot文档排序标准套件进行测试，实验结果表明，我们的方法可以减少计算成本，同时保持高的零shot排序效果。

Mirage: Model-Agnostic Graph Distillation for Graph Classification

paper_url: http://arxiv.org/abs/2310.09486
repo_url: None
paper_authors: Mridul Gupta, Sahil Manchanda, Hariprasad Kodamana, Sayan Ranu
For: The paper aims to scale training of graph neural networks (GNNs) on large datasets while reducing the need for computation and data resources.* Methods: The paper proposes a distillation algorithm called Mirage, which compresses the computation data itself to create a concise distilled summary, rather than emulating gradient flows on the original training set.* Results: The paper reports that Mirage outperforms state-of-the-art baselines in terms of generalization accuracy, data compression, and distillation efficiency, and is an unsupervised and architecture-agnostic distillation algorithm.Here’s the Simplified Chinese text version of the three information points:
for: 这篇论文目标是将图神经网络（GNNs）在大量数据上进行训练，同时减少计算和数据资源的需求。
methods: 论文提出了一种名为 Mirage的幻化算法，它将计算数据本身压缩成一个简洁的幻化摘要，而不是在原始训练集上优化梯度流。
results: 论文表明，Mirage 比州态艺法（baselines）有更高的泛化精度、数据压缩和幻化效率，并且是一种无监督的、architecture-agnostic的幻化算法。

Abstract
GNNs, like other deep learning models, are data and computation hungry. There is a pressing need to scale training of GNNs on large datasets to enable their usage on low-resource environments. Graph distillation is an effort in that direction with the aim to construct a smaller synthetic training set from the original training data without significantly compromising model performance. While initial efforts are promising, this work is motivated by two key observations: (1) Existing graph distillation algorithms themselves rely on training with the full dataset, which undermines the very premise of graph distillation. (2) The distillation process is specific to the target GNN architecture and hyper-parameters and thus not robust to changes in the modeling pipeline. We circumvent these limitations by designing a distillation algorithm called Mirage for graph classification. Mirage is built on the insight that a message-passing GNN decomposes the input graph into a multiset of computation trees. Furthermore, the frequency distribution of computation trees is often skewed in nature, enabling us to condense this data into a concise distilled summary. By compressing the computation data itself, as opposed to emulating gradient flows on the original training set-a prevalent approach to date-Mirage transforms into an unsupervised and architecture-agnostic distillation algorithm. Extensive benchmarking on real-world datasets underscores Mirage's superiority, showcasing enhanced generalization accuracy, data compression, and distillation efficiency when compared to state-of-the-art baselines.

摘要
图学混淆（Graph Distillation）是一种尝试通过从原始训练数据中构建一个更小的合成训练集来实现这一目标。然而，现有的图学混淆算法它们自己也需要训练使用整个数据集，这会让图学混淆失去意义。另外，现有的图学混淆算法往往是特定于目标GNN结构和参数，因此不具有对模型pipeline的灵活性和稳定性。为了解决这些限制，我们提出了一种名为 Mirage的图学混淆算法，这种算法基于GNN在输入图上进行消息传递的性质。我们发现，GNN在输入图上进行消息传递时，会将图转化为一个多重 computation tree 的集合。此外，这些 computation tree 的频率分布往往具有极值性，因此我们可以通过简化这些数据来生成一个简洁的混淆SUMMARY。相比于以往的方法，Mirage不需要对原始训练数据进行批处理，而是直接对 computation data 进行压缩。这使得 Mirage 成为一种无监督的、建筑自适应的图学混淆算法。我们在实际的 dataset 上进行了广泛的测试，结果显示 Mirage 的总体性能明显高于现有的基线。 Mirage 可以更好地捕捉图数据的特点，同时具有更好的数据压缩和混淆效率。

Unified High-binding Watermark for Unconditional Image Generation Models

paper_url: http://arxiv.org/abs/2310.09479
repo_url: None
paper_authors: Ruinan Ma, Yu-an Tan, Shangbo Wu, Tian Chen, Yajie Wang, Yuanzhang Li
for: 防止AI生成图像模型数据盗用和版权侵犯
methods: 使用隐形编码器 Writing watermark 图像到原始 AIGC 工具输出图像，然后通过相应的解码器检测和验证 Whether the suspicious model steals the original AIGC tool data
results: 实验表明，我们的方法可以在只使用模型输出图像的情况下，几乎达到零假阳性率，并且可以跨多种 UIG 模型进行数据盗用验证，提高方法的实用性。

Abstract
Deep learning techniques have implemented many unconditional image generation (UIG) models, such as GAN, Diffusion model, etc. The extremely realistic images (also known as AI-Generated Content, AIGC for short) produced by these models bring urgent needs for intellectual property protection such as data traceability and copyright certification. An attacker can steal the output images of the target model and use them as part of the training data to train a private surrogate UIG model. The implementation mechanisms of UIG models are diverse and complex, and there is no unified and effective protection and verification method at present. To address these issues, we propose a two-stage unified watermark verification mechanism with high-binding effects for such models. In the first stage, we use an encoder to invisibly write the watermark image into the output images of the original AIGC tool, and reversely extract the watermark image through the corresponding decoder. In the second stage, we design the decoder fine-tuning process, and the fine-tuned decoder can make correct judgments on whether the suspicious model steals the original AIGC tool data. Experiments demonstrate our method can complete the verification work with almost zero false positive rate under the condition of only using the model output images. Moreover, the proposed method can achieve data steal verification across different types of UIG models, which further increases the practicality of the method.

摘要
深度学习技术已经实现了许多无条件图像生成（UIG）模型，如GAN、扩散模型等。这些模型生成的极其真实的图像（也称为AI生成内容，AIGC）带来了知识产权保护的紧迫需求，如数据追溯和版权证书。一个攻击者可以偷窃目标模型的输出图像，并使其成为私人代理UIG模型的训练数据。现有的实现机制多样化和复杂，无一统的有效保护和验证方法。为解决这些问题，我们提出了一种两阶段统一水印验证机制，具有高绑定效果。在第一阶段，我们使用编码器将水印图像隐身地写入原始AIGC工具的输出图像中，并通过对应的解码器反向提取水印图像。在第二阶段，我们设计了细化过程，并使用细化过的解码器可以正确地判断是否有恶意模型偷窃原始AIGC工具数据。实验表明，我们的方法可以在只使用模型输出图像的情况下完成验证工作，并且几乎没有假阳性结果。此外，我们的方法还可以验证不同类型的UIG模型数据，从而进一步提高方法的实用性。

HIO-SDF: Hierarchical Incremental Online Signed Distance Fields

paper_url: http://arxiv.org/abs/2310.09463
repo_url: None
paper_authors: Vasileios Vasilopoulos, Suveer Garg, Jinwook Huh, Bhoram Lee, Volkan Isler
for: 这个论文的目的是为了开发一种能够高效、可 updatable 的大型移动机器工作空间表示方法。methods: 这个方法使用了签名距离场（SDF）来表示环境，并使用层次结构来结合粗细网格和神经网络来实现高效更新和空间占用优化。results: 这个方法在所有测试场景中都达到了46%的全球SDF错误平均值，并且在同等分辨率的粗细网格上达到了30%的错误低点。

Abstract
A good representation of a large, complex mobile robot workspace must be space-efficient yet capable of encoding relevant geometric details. When exploring unknown environments, it needs to be updatable incrementally in an online fashion. We introduce HIO-SDF, a new method that represents the environment as a Signed Distance Field (SDF). State of the art representations of SDFs are based on either neural networks or voxel grids. Neural networks are capable of representing the SDF continuously. However, they are hard to update incrementally as neural networks tend to forget previously observed parts of the environment unless an extensive sensor history is stored for training. Voxel-based representations do not have this problem but they are not space-efficient especially in large environments with fine details. HIO-SDF combines the advantages of these representations using a hierarchical approach which employs a coarse voxel grid that captures the observed parts of the environment together with high-resolution local information to train a neural network. HIO-SDF achieves a 46% lower mean global SDF error across all test scenes than a state of the art continuous representation, and a 30% lower error than a discrete representation at the same resolution as our coarse global SDF grid.

摘要
一个好的大型移动机器工作空间表示应该是效率高而能够包含相关的几何细节。当探索未知环境时，它需要在线更新可能。我们介绍了HIO-SDF，一种新的方法，它表示环境为签名距离场（SDF）。现有的SDF表示方法包括神经网络或VOXEL网格。神经网络可以持续表示SDF，但它们难以在线更新，除非保留了大量的感知历史用于训练。VOXEL网格没有这个问题，但它们在大型环境中不是太空效率。HIO-SDF结合了这些表示方法的优点，使用层次方法，其中使用粗粒度的VOXEL网格捕捉到环境中观察到的部分，并使用高分辨率的地方信息来训练神经网络。HIO-SDF在所有测试场景中的平均全球SDF误差比现有的连续表示方法下降46%，比同等分辨率的粗粒度全球SDF网格下降30%。

A Framework for Empowering Reinforcement Learning Agents with Causal Analysis: Enhancing Automated Cryptocurrency Trading

paper_url: http://arxiv.org/abs/2310.09462
repo_url: None
paper_authors: Rasoul Amirzadeh, Dhananjay Thiruvady, Asef Nazari, Mong Shan Ee
for: 本研究旨在提高人工智能增强交易方法的财务效益，通过开发一个基于强化学习的自动交易系统，以便在随时变化的加密货币市场中实现更高的回报。
methods: 我们提出了一个名为CausalReinforceNet的框架，用于支持决策系统。该框架通过 causal 分析增强了强化学习代理的能力。在 feature 工程过程中，我们使用 bayesian 网络来确定最有关系的特征，以便影响加密货币价格的变化。此外，我们还在决策过程中添加了 probabilistic 价格方向信号，以提高我们的强化学习代理的决策能力。由于加密货币市场的高投机性，我们设计了一种保守的方法，限制卖出和买入的位置大小，以管理风险。
results: 我们的框架在比较以 Buy-and-Hold 策略为准的情况下，有显著的财务效益。此外，我们开发了两个基于 CausalReinforceNet 框架的强化学习代理，其中一个基于 Q-learning 算法，另一个基于 deep Q-learning 算法。两个代理在 Binance Coin 和 Ethereum 等加密货币上都实现了显著的回报。

Abstract
Despite advances in artificial intelligence-enhanced trading methods, developing a profitable automated trading system remains challenging in the rapidly evolving cryptocurrency market. This study aims to address these challenges by developing a reinforcement learning-based automated trading system for five popular altcoins~(cryptocurrencies other than Bitcoin): Binance Coin, Ethereum, Litecoin, Ripple, and Tether. To this end, we present CausalReinforceNet, a framework framed as a decision support system. Designed as the foundational architecture of the trading system, the CausalReinforceNet framework enhances the capabilities of the reinforcement learning agent through causal analysis. Within this framework, we use Bayesian networks in the feature engineering process to identify the most relevant features with causal relationships that influence cryptocurrency price movements. Additionally, we incorporate probabilistic price direction signals from dynamic Bayesian networks to enhance our reinforcement learning agent's decision-making. Due to the high volatility of the cryptocurrency market, we design our framework to adopt a conservative approach that limits sell and buy position sizes to manage risk. We develop two agents using the CausalReinforceNet framework, each based on distinct reinforcement learning algorithms. The results indicate that our framework substantially surpasses the Buy-and-Hold benchmark strategy in profitability. Additionally, both agents generated notable returns on investment for Binance Coin and Ethereum.

摘要
尽管人工智能增强交易方法有所进步，但在快速发展的 криптовалю短证市场中建立可得利的自动交易系统仍然是一项挑战。本研究目的在于解决这些挑战，通过开发一个基于强化学习的自动交易系统，用于五种流行的 altcoin（非比特币的 криптовалю）： Binance Coin、Ethereum、Litecoin、Ripple 和 Tether。为此，我们提出了 CausalReinforceNet 框架，它是一个决策支持系统。这个框架通过 causal 分析增强了强化学习代理的能力。在这个框架中，我们使用 bayesian 网络进行特征工程，以确定对 криптовалю价格变化的影响最重要的特征。此外，我们还将 probabilistic 价格方向信号 incorporated 到强化学习代理的决策中。由于 криптовалю市场的高投资风险，我们设计了一种保守的approach，限制卖出和买入的位置大小，以管理风险。我们使用 CausalReinforceNet 框架开发了两个代理，每个基于不同的强化学习算法。结果表明，我们的框架在利润方面有所进步，而且两个代理对 Binance Coin 和 Ethereum 都取得了显著的回报。

Metacognitive threshold: a computational account

paper_url: http://arxiv.org/abs/2310.13005
repo_url: None
paper_authors: Brendan Conway-Smith, Robert L. West
for: 本研究旨在计算地考虑认知门槛（认知状态能够被识别的最小刺激量），并讨论可能影响认知门槛的认知训练和冥想。
methods: 本研究使用计算方法计算认知门槛，并采用认知训练和冥想来调整认知门槛。
results: 研究发现，通过认知训练和冥想，可以影响认知门槛，从而提高认知能力。

Abstract
This paper will explore ways of computationally accounting for the metacognitive threshold -- the minimum amount of stimulus needed for a mental state to be perceived -- and discuss potential cognitive mechanisms by which this threshold can be influenced through metacognitive training and meditation.

摘要
这篇论文将探讨计算方法来考虑认知阈值（最少的刺激量可以让一种心理状态被感知），并讨论可能通过认知培训和禅定来影响认知阈值的认知机制。Here's a breakdown of the translation:* 这篇论文 (zhè běn tōng zhì) - This paper* 将探讨 (shall discuss) - will explore* 计算方法 (jìsuān fāngyì) - computational methods* 认知阈值 (rènqì jiāngrù) - metacognitive threshold* 可以让 (kěyǐ jiàng) - can be* 一种心理状态 (yī zhǒng xīn líng zhèng) - a mental state* 被感知 (bèi gǎn zhī) - is perceived* 并讨论 (bìng tālūn) - and discuss* 认知机制 (rènqì jīfāng) - cognitive mechanisms* 可以影响 (kěyǐ yǐngxiǎng) - can be influenced* 通过 (tōngguò) - through* 认知培训 (rènqì pīxùn) - metacognitive training* 和 (hē) - and* 禅定 (chán dìng) - meditation

Large Language Model Unlearning

paper_url: http://arxiv.org/abs/2310.10683
repo_url: https://github.com/kevinyaobytedance/llm_unlearn
paper_authors: Yuanshun Yao, Xiaojun Xu, Yang Liu
for: 本研究旨在探讨如何使用语言模型（LLM）进行忘卷（unlearning），即忘记不良（mis）行为。
methods: 本研究使用了忘卷技术，包括使用负例（negative examples）来对LML进行Alignment。
results: 研究显示，使用忘卷技术可以更有效地对LML进行Alignment，只需要负例来调整LML的行为。此外，忘卷技术还具有计算效率和可靠性的优点。

Abstract
We study how to perform unlearning, i.e. forgetting undesirable (mis)behaviors, on large language models (LLMs). We show at least three scenarios of aligning LLMs with human preferences can benefit from unlearning: (1) removing harmful responses, (2) erasing copyright-protected content as requested, and (3) eliminating hallucinations. Unlearning, as an alignment technique, has three advantages. (1) It only requires negative (e.g. harmful) examples, which are much easier and cheaper to collect (e.g. via red teaming or user reporting) than positive (e.g. helpful and often human-written) examples required in RLHF (RL from human feedback). (2) It is computationally efficient. (3) It is especially effective when we know which training samples cause the misbehavior. To the best of our knowledge, our work is among the first to explore LLM unlearning. We are also among the first to formulate the settings, goals, and evaluations in LLM unlearning. We show that if practitioners only have limited resources, and therefore the priority is to stop generating undesirable outputs rather than to try to generate desirable outputs, unlearning is particularly appealing. Despite only having negative samples, our ablation study shows that unlearning can still achieve better alignment performance than RLHF with just 2% of its computational time.

摘要
我们研究如何进行“忘记”（unlearning），即Language Model（LLM）中的“不良”（undesirable）行为的忘记。我们显示了至少三种场景，在这些场景中，对LML进行Alignment可以受益于忘记：（1） removing harmful responses，（2） erasing copyright-protected content as requested，和（3） eliminating hallucinations。忘记作为一种Alignment技术，有三个优势：（1）只需要负（例如，危险）示例，这些示例比较容易和便宜地收集（例如，通过红团或用户报告），与RLHF（RL from human feedback）中需要的正（例如，有用和常常是人工写的）示例相比。（2） computationally efficient。（3）特别有用当我们知道哪些训练示例导致了不良行为。据我们所知，我们的工作是LLM unlearning中的一员，同时我们还是LLM unlearning中的第一个进行设定、目标和评估的人。我们发现，当有限的资源的情况下，优先级是停止生成不良输出，而不是尝试生成有用输出的情况下，忘记是非常吸引人的。尽管只有负示例，我们的剥离研究显示，忘记仍然可以在RLHF的2%的计算时间内达到更好的Alignment性能。

LgTS: Dynamic Task Sampling using LLM-generated sub-goals for Reinforcement Learning Agents

paper_url: http://arxiv.org/abs/2310.09454
repo_url: https://github.com/shukla-yash/LgTS-LLM-guided-Dynamic-Task-Sampling-for-RL-agents
paper_authors: Yash Shukla, Wenchang Gao, Vasanth Sarathy, Alvaro Velasquez, Robert Wright, Jivko Sinapov
for: 这个论文旨在探讨大语言模型（LLM）在人工智能机器人和代理人问题中的规划能力，以及如何使用LLM来帮助RL代理人学习和行动。methods: 该论文提出了一种新的LgTS（LLM导导教师学习）方法，该方法利用LLM生成目标状态的子目标图表示，并使用教师学生学习算法让RL代理人学习从起始状态到目标状态的策略，同时尽量减少环境互动次数。与先前的LLM使用方法不同，该方法不需要访问专有或精度调整的LLM，也不需要预训练的策略来实现LLM提出的子目标。results: 通过在基于DoorKeyDomain的格リッド世界和搜索救援领域的实验，我们表明了LLM生成的子目标图表示有助于RL代理人学习LLM提出的子目标，并且教师学生学习算法可以减少环境互动次数。

Abstract
Recent advancements in reasoning abilities of Large Language Models (LLM) has promoted their usage in problems that require high-level planning for robots and artificial agents. However, current techniques that utilize LLMs for such planning tasks make certain key assumptions such as, access to datasets that permit finetuning, meticulously engineered prompts that only provide relevant and essential information to the LLM, and most importantly, a deterministic approach to allow execution of the LLM responses either in the form of existing policies or plan operators. In this work, we propose LgTS (LLM-guided Teacher-Student learning), a novel approach that explores the planning abilities of LLMs to provide a graphical representation of the sub-goals to a reinforcement learning (RL) agent that does not have access to the transition dynamics of the environment. The RL agent uses Teacher-Student learning algorithm to learn a set of successful policies for reaching the goal state from the start state while simultaneously minimizing the number of environmental interactions. Unlike previous methods that utilize LLMs, our approach does not assume access to a propreitary or a fine-tuned LLM, nor does it require pre-trained policies that achieve the sub-goals proposed by the LLM. Through experiments on a gridworld based DoorKey domain and a search-and-rescue inspired domain, we show that generating a graphical structure of sub-goals helps in learning policies for the LLM proposed sub-goals and the Teacher-Student learning algorithm minimizes the number of environment interactions when the transition dynamics are unknown.

摘要

2023-10-14

cs.CL

cs.CL - 2023-10-14

Beyond Testers’ Biases: Guiding Model Testing with Knowledge Bases using LLMs

paper_url: http://arxiv.org/abs/2310.09668
repo_url: None
paper_authors: Chenyang Yang, Rishabh Rustogi, Rachel Brower-Sinning, Grace A. Lewis, Christian Kästner, Tongshuang Wu
for: 这篇论文主要是为了提供一种用于模型测试的工具，以帮助测试人员更好地识别需要测试的方面。
methods: 这篇论文使用了大量自然语言处理技术，包括生成知识库和互动式推荐，以帮助测试人员系统地探索不同的概念。
results: 在用户研究中，测试人员使用Weaver工具时能够更好地识别模型需要测试的方面，并发现了大量的失败测试案例。此外，Weaver还可以帮助实践者在真实的应用场景中测试模型，例如代码理解和对话简要摘要。

Abstract
Current model testing work has mostly focused on creating test cases. Identifying what to test is a step that is largely ignored and poorly supported. We propose Weaver, an interactive tool that supports requirements elicitation for guiding model testing. Weaver uses large language models to generate knowledge bases and recommends concepts from them interactively, allowing testers to elicit requirements for further testing. Weaver provides rich external knowledge to testers and encourages testers to systematically explore diverse concepts beyond their own biases. In a user study, we show that both NLP experts and non-experts identified more, as well as more diverse concepts worth testing when using Weaver. Collectively, they found more than 200 failing test cases for stance detection with zero-shot ChatGPT. Our case studies further show that Weaver can help practitioners test models in real-world settings, where developers define more nuanced application scenarios (e.g., code understanding and transcript summarization) using LLMs.

摘要
当前模型测试工作主要集中在创建测试用例上。确定要测试的内容是一个大多数被忽视和不受支持的步骤。我们提出了 Weaver，一个互动工具，可以支持需求描述导引模型测试。Weaver 使用大型自然语言模型生成知识库并在互动方式下提供概念建议，allowing testers 可以从概念中得到更多的测试需求。Weaver 为测试人员提供了丰富的外部知识，并且鼓励测试人员系统地探索多种概念，超越自己的偏见。在用户研究中，我们发现了 NLP 专家和非专家都可以使用 Weaver 来确定更多，以及更多样本的测试需求。总的来说，他们在 zero-shot ChatGPT 上找到了 más de 200 个失败测试用例。我们的案例研究还表明，Weaver 可以帮助实践者在真实的应用场景中测试模型（如代码理解和讲话笔记摘要），使用 LLMs。

Legend at ArAIEval Shared Task: Persuasion Technique Detection using a Language-Agnostic Text Representation Model

paper_url: http://arxiv.org/abs/2310.09661
repo_url: None
paper_authors: Olumide E. Ojo, Olaronke O. Adebanji, Hiram Calvo, Damian O. Dieke, Olumuyiwa E. Ojo, Seye E. Akinsanya, Tolulope O. Abiola, Anna Feldman
for: 本研究的目的是参加2023年阿拉伯语言处理会议（ArabicNLP）的阿拉伯语AI任务评估挑战（ArAIEval），特别是任务1，即从推文和新闻文章中识别吸引人的技巧。
methods: 该研究使用了XLM-RoBERTa语言无关文本表示模型进行训练循环，并进行细化的多语言模型微调。
results: 在测试集评估中，我们的微调后的多语言模型在任务1下的子任务A中取得了0.64的微 F1分数。

Abstract
In this paper, we share our best performing submission to the Arabic AI Tasks Evaluation Challenge (ArAIEval) at ArabicNLP 2023. Our focus was on Task 1, which involves identifying persuasion techniques in excerpts from tweets and news articles. The persuasion technique in Arabic texts was detected using a training loop with XLM-RoBERTa, a language-agnostic text representation model. This approach proved to be potent, leveraging fine-tuning of a multilingual language model. In our evaluation of the test set, we achieved a micro F1 score of 0.64 for subtask A of the competition.

摘要
在这篇论文中，我们分享我们在“阿拉伯语言处理（ArabicNLP）2023”年度的“阿拉伯AI任务评估比赛”（ArAIEval）中的最佳提交。我们的关注点是任务1，即在推文和新闻文章中Identify persuasion techniques。在阿拉伯文本中探测了使用XLM-RoBERTa语言无关文本表示模型的训练循环。这种方法证明了其高效，通过细化多语言模型的 fine-tuning。在我们对测试集进行评估时，我们在subtask A中获得了0.64的微 F1分数。

An End-to-End System for Reproducibility Assessment of Source Code Repositories via Their Readmes

paper_url: http://arxiv.org/abs/2310.09634
repo_url: https://github.com/kaanakdeniz/reproducibility_assessment
paper_authors: Eyüp Kaan Akdeniz, Selma Tekir, Malik Nizar Asad Al Hinnawi
for: 支持机器学习研究的可重复性评估
methods: 使用约定模板和自定义函数对Readme文件进行检查，并使用层次转移模型为Readme文件分类
results: 系统可以准确地评估Readme文件的可重复性，并且可以提供可解释的分数。同时，section similarity-based系统比层次转移模型 performs better。

Abstract
Increased reproducibility of machine learning research has been a driving force for dramatic improvements in learning performances. The scientific community further fosters this effort by including reproducibility ratings in reviewer forms and considering them as a crucial factor for the overall evaluation of papers. Accompanying source code is not sufficient to make a work reproducible. The shared codes should meet the ML reproducibility checklist as well. This work aims to support reproducibility evaluations of papers with source codes. We propose an end-to-end system that operates on the Readme file of the source code repositories. The system checks the compliance of a given Readme to a template proposed by a widely used platform for sharing source codes of research. Our system generates scores based on a custom function to combine section scores. We also train a hierarchical transformer model to assign a class label to a given Readme. The experimental results show that the section similarity-based system performs better than the hierarchical transformer. Moreover, it has an advantage regarding explainability since one can directly relate the score to the sections of Readme files.

摘要
增加机器学习研究的可重现性是导致学习性能的进步的驱动力。科学社区还进一步推动这一努力，将可重现性评估纳入评审表单中，并视为评审整体评价的关键因素。仅提供源代码不足以使一作品可重现。我们建议一个综合系统，运行在源代码存储库的Readme文件上。该系统根据提案的模板检查源代码的可重现性，并生成分数根据自定义的函数组合分。我们还训练了一个层次转换器模型，将给定的Readme文件分配分类标签。实验结果表明，基于section相似性的系统在可重现性评估中表现更好，并且具有更好的解释性，因为可以直接将分数关联到Readme文件中的section。

A Digital Language Coherence Marker for Monitoring Dementia

paper_url: http://arxiv.org/abs/2310.09623
repo_url: None
paper_authors: Dimitris Gkoumas, Adam Tsakalidis, Maria Liakata
for: 这个论文旨在提出一种新的、可靠且非侵入的方法，使用自然语言进行诊断和监测 деменcia。
methods: 该论文提出了一种新的任务，即学习叙述中的时间逻辑一致性，并 investigate了多种神经网络方法。
results: 研究发现，与健康人群相比，人们 WITH dementia 的语言一致性呈现出明显的差异，并且与临床生物标志物相关性较高。此外，该 marker 还具有普适性，可应用于其他相关的疾病。

Abstract
The use of spontaneous language to derive appropriate digital markers has become an emergent, promising and non-intrusive method to diagnose and monitor dementia. Here we propose methods to capture language coherence as a cost-effective, human-interpretable digital marker for monitoring cognitive changes in people with dementia. We introduce a novel task to learn the temporal logical consistency of utterances in short transcribed narratives and investigate a range of neural approaches. We compare such language coherence patterns between people with dementia and healthy controls and conduct a longitudinal evaluation against three clinical bio-markers to investigate the reliability of our proposed digital coherence marker. The coherence marker shows a significant difference between people with mild cognitive impairment, those with Alzheimer's Disease and healthy controls. Moreover our analysis shows high association between the coherence marker and the clinical bio-markers as well as generalisability potential to other related conditions.

摘要
使用自然语言来 derive 适当的数字标记已成为迅速发展、有前途和不侵入的诊断和监测诱导症方法。我们提议使用语言一致性作为一种经济、人类可读取的数字标记，以监测人类诱导症的认知变化。我们介绍了一项新任务，用于学习句子之间的时间逻辑一致性，并 investigate 多种神经网络方法。我们比较了这些语言一致性模式，并与三种临床生物标志物进行长期评估，以确定我们所提议的数字一致标记的可靠性。我们发现，与轻度认知障碍、阿兹海默病和健康群体进行比较，我们的一致标记具有显著差异。此外，我们的分析还显示了这些一致标记与临床生物标志物之间的高相关性，以及其普适性。

An Expression Tree Decoding Strategy for Mathematical Equation Generation

paper_url: http://arxiv.org/abs/2310.09619
repo_url: None
paper_authors: Wenqi Zhang, Yongliang Shen, Qingpeng Nong, Zeqi Tan, Yanna Ma, Weiming Lu
for: 该论文主要探讨了如何从自然语言中生成数学公式。
methods: 该论文提出了一种基于树结构的表达水平生成方法，通过层次并行解码策略和双分配匹配算法来生成数学公式。
results: 实验表明，该方法在生成复杂结构的数学公式方面表现出色，比基eline方法更高效。

Abstract
Generating mathematical equations from natural language requires an accurate understanding of the relations among math expressions. Existing approaches can be broadly categorized into token-level and expression-level generation. The former treats equations as a mathematical language, sequentially generating math tokens. Expression-level methods generate each expression one by one. However, each expression represents a solving step, and there naturally exist parallel or dependent relations between these steps, which are ignored by current sequential methods. Therefore, we integrate tree structure into the expression-level generation and advocate an expression tree decoding strategy. To generate a tree with expression as its node, we employ a layer-wise parallel decoding strategy: we decode multiple independent expressions (leaf nodes) in parallel at each layer and repeat parallel decoding layer by layer to sequentially generate these parent node expressions that depend on others. Besides, a bipartite matching algorithm is adopted to align multiple predictions with annotations for each layer. Experiments show our method outperforms other baselines, especially for these equations with complex structures.

摘要
<>输入文本中的数学公式生成需要精准地理解数学表达之间的关系。现有的方法可以分为token级和表达级两类。前者视数学公式为一种语言，顺序生成数学token。表达级方法每个表达都代表一步解题，但是这些步骤之间存在并行或依赖关系，现有的顺序方法忽略了这些关系。因此，我们将树结构integrated到表达级生成中，并提出了表达树解码策略。为生成一棵表达树，我们采用层 wise并行解码策略：在每层解码多个独立的表达（叶节点）并重复层 wise并行解码来生成依赖于别的父节点表达。此外，我们采用了一种两个分配算法来对多个预测与注释进行对应。实验表明，我们的方法在Equation with complex structures上比基eline方法表现更好。

Moral consensus and divergence in partisan language use

paper_url: http://arxiv.org/abs/2310.09618
repo_url: None
paper_authors: Nakwon Rim, Marc G. Berman, Yuan Chang Leong
methods: 这个论文使用了大规模的Reddit社区和新闻媒体的言语数据（294,476,146个评论和6,749,781篇文章），使用word embedding模型来捕捉语言中的 semantic association，并在7个政治话题（如 abortion、immigration）上进行了研究。results: 研究发现， despite shared moral understanding across the political spectrum, there are consistent differences in the moral associations of words between conservative and liberal text sources, which can be used to distinguish text sources with above 85% classification accuracy. These findings suggest that partisan language differences are widespread and may contribute to political polarization.

Abstract
Polarization has increased substantially in political discourse, contributing to a widening partisan divide. In this paper, we analyzed large-scale, real-world language use in Reddit communities (294,476,146 comments) and in news outlets (6,749,781 articles) to uncover psychological dimensions along which partisan language is divided. Using word embedding models that captured semantic associations based on co-occurrences of words in vast textual corpora, we identified patterns of affective polarization present in natural political discourse. We then probed the semantic associations of words related to seven political topics (e.g., abortion, immigration) along the dimensions of morality (moral-to-immoral), threat (threatening-to-safe), and valence (pleasant-to-unpleasant). Across both Reddit communities and news outlets, we identified a small but systematic divergence in the moral associations of words between text sources with different partisan leanings. Moral associations of words were highly correlated between conservative and liberal text sources (average $\rho$ = 0.96), but the differences remained reliable to enable us to distinguish text sources along partisan lines with above 85% classification accuracy. These findings underscore that despite a shared moral understanding across the political spectrum, there are consistent differences that shape partisan language and potentially exacerbate political polarization. Our results, drawn from both informal interactions on social media and curated narratives in news outlets, indicate that these trends are widespread. Leveraging advanced computational techniques, this research offers a fresh perspective that complements traditional methods in political attitudes.

摘要
政治话语的偏极化现象在现代社会中日益增加，这对政党分化的扩大做出了贡献。在这篇论文中，我们通过分析Reddit社区的294476146个评论和新闻媒体的6749781篇文章，以揭示政治话语中的心理维度。我们使用基于很大文本 Corpora 的词嵌入模型，捕捉了词语之间的含义相互关系，并发现了政治话语中的情感偏极现象。我们 then probed the semantic associations of words related to seven political topics (such as abortion and immigration) along the dimensions of morality (moral-to-immoral), threat (threatening-to-safe), and valence (pleasant-to-unpleasant).在Reddit社区和新闻媒体中，我们发现了一些小而系统的语言偏极现象，即不同政治倾向的文本来源之间的道德含义的差异。虽然保守和自由政治倾向的文本来源之间的道德含义之间存在高度的相互关系（平均值为0.96），但这些差异仍然可以准确地分辨出文本来源的政治倾向，使得我们可以在85%的权益上分类文本来源。这些发现表明，尽管在政治 спектrum中存在共同的道德理解，但是存在一些不同的特征，这些特征可能增加政治偏极化。我们的研究结果，来自社交媒体和新闻媒体，表明这些趋势是普遍的。我们利用了高级计算技术，这些研究结果可以补充传统的政治态度研究。

RethinkingTMSC: An Empirical Study for Target-Oriented Multimodal Sentiment Classification

paper_url: http://arxiv.org/abs/2310.09596
repo_url: https://github.com/junjie-ye/rethinkingtmsc
paper_authors: Junjie Ye, Jie Zhou, Junfeng Tian, Rui Wang, Qi Zhang, Tao Gui, Xuanjing Huang
for: 本研究的目的是调查目标受众情感分类 Task 中 modalities 的重要性和 multimodal fusion module 的效果，以及现有数据集是否能够支持研究。
methods: 本研究使用了广泛的实验和深入的分析来回答以下问题：Q1：modalities 在 TMSC 中的重要性是否相同？Q2：哪些 multimodal fusion module 更加有效？Q3：现有的数据集能否支持研究？
results: 实验和分析显示，目前的 TMSC 系统主要依靠文本模式来决定目标受众情感，因此我们指出了一些改进 TMSC 任务的方向，包括模型设计和数据集构建。

Abstract
Recently, Target-oriented Multimodal Sentiment Classification (TMSC) has gained significant attention among scholars. However, current multimodal models have reached a performance bottleneck. To investigate the causes of this problem, we perform extensive empirical evaluation and in-depth analysis of the datasets to answer the following questions: Q1: Are the modalities equally important for TMSC? Q2: Which multimodal fusion modules are more effective? Q3: Do existing datasets adequately support the research? Our experiments and analyses reveal that the current TMSC systems primarily rely on the textual modality, as most of targets' sentiments can be determined solely by text. Consequently, we point out several directions to work on for the TMSC task in terms of model design and dataset construction. The code and data can be found in https://github.com/Junjie-Ye/RethinkingTMSC.

摘要
近期，目标受控多模态情感分类（TMSC）在学者中受到了广泛关注。然而，当前的多模态模型已经达到了性能瓶颈。为了调查这个问题的原因，我们进行了详细的实验和深入分析数据集，以回答以下问题：Q1：多 modalities 在 TMSC 中是否具有相同的重要性？Q2：哪些多模态融合模块更有效？Q3：现有的数据集是否能够完善 TMSC 研究？我们的实验和分析表明，目前的 TMSC 系统主要依赖于文本modalities，因为大多数目标的情感都可以通过文本来确定。因此，我们提出了一些关于 TMSC 任务的模型设计和数据集建设的方向。codes 和数据可以在找到。

Self-Detoxifying Language Models via Toxification Reversal

paper_url: http://arxiv.org/abs/2310.09573
repo_url: https://github.com/cooperleong00/toxificationreversal
paper_authors: Chak Tou Leong, Yi Cheng, Jiashuo Wang, Jian Wang, Wenjie Li
for: 降低预训练语言模型（PLM）中生成危险或伤害性内容的风险，以便更安全地部署。
methods: 我们提出了一种较轻量级的方法，即让PLM本身实现”自我抹黑”。我们的方法基于 prepending a negative steering prompt 可以让 PLM 生成恶意内容。同时，我们受到了最近的解释研究中的研究，即通过注意层来实现 PLM 内部的Contextualized Representations 的演化。在这基础之上，我们设计了一种方法，可以从 normal generation process 中提取恶意方向，然后通过 manipulate 注意层内的信息流来驱动生成 towards the reversed direction。
results: 我们的方法，不需要任何 fine-tuning 或额外组件，可以 achieved comparable performance with state-of-the-art methods。

Abstract
Language model detoxification aims to minimize the risk of generating offensive or harmful content in pretrained language models (PLMs) for safer deployment. Existing methods can be roughly categorized as finetuning-based and decoding-based. However, the former is often resource-intensive, while the latter relies on additional components and potentially compromises the generation fluency. In this paper, we propose a more lightweight approach that enables the PLM itself to achieve "self-detoxification". Our method is built upon the observation that prepending a negative steering prompt can effectively induce PLMs to generate toxic content. At the same time, we are inspired by the recent research in the interpretability field, which formulates the evolving contextualized representations within the PLM as an information stream facilitated by the attention layers. Drawing on this idea, we devise a method to identify the toxification direction from the normal generation process to the one prompted with the negative prefix, and then steer the generation to the reversed direction by manipulating the information movement within the attention layers. Experimental results show that our approach, without any fine-tuning or extra components, can achieve comparable performance with state-of-the-art methods.

摘要
language model detoxification aims to minimize the risk of generating offensive or harmful content in pretrained language models (PLMs) for safer deployment. existing methods can be roughly categorized as finetuning-based and decoding-based. however, the former is often resource-intensive, while the latter relies on additional components and potentially compromises the generation fluency. in this paper, we propose a more lightweight approach that enables the PLM itself to achieve "self-detoxification". our method is built upon the observation that prepending a negative steering prompt can effectively induce PLMs to generate toxic content. at the same time, we are inspired by the recent research in the interpretability field, which formulates the evolving contextualized representations within the PLM as an information stream facilitated by the attention layers. drawing on this idea, we devise a method to identify the toxification direction from the normal generation process to the one prompted with the negative prefix, and then steer the generation to the reversed direction by manipulating the information movement within the attention layers. experimental results show that our approach, without any fine-tuning or extra components, can achieve comparable performance with state-of-the-art methods.Here's the translation in Traditional Chinese:语模型净化目标是为了降低预训练语言模型（PLM）发布时的风险，以确保更安全的应用。现有的方法可以大致分为调整基于和解oding基于两种。然而，前者经常需要资源投入，而后者则可能会妥协生成流畅性。在这篇论文中，我们提出了一种更轻量级的方法，让 PLM 本身能够实现 "自我净化"。我们的方法基于 prepending 负面引导预告可以导致 PLM 产生毒性内容的观察。同时，我们受到解释性研究的鼓励，它将 PLM 中的变化 contextualized 表示形式化为资讯流通过注意层。从这个想法开始，我们提出了一种方法来从 normal 生成过程中决定毒化方向，然后通过注意层中的资讯运动来驾驭生成。实验结果显示，我们的方法，不需要任何调整或额外元件，可以与现有的方法相比获得相似的性能。

Can Large Language Model Comprehend Ancient Chinese? A Preliminary Test on ACLUE

paper_url: http://arxiv.org/abs/2310.09550
repo_url: None
paper_authors: Yixuan Zhang, Haonan Li
for: 评估大型自然语言处理模型（LLMs）在理解古代中文方面的能力。
methods: 使用ACLUE评价指标集，评测8种现代最佳语言模型在古代中文和现代中文之间的表现差异。
results: through 评测，发现其表现最佳的是ChatGLM2，得分为37.4%。

Abstract
Large language models (LLMs) have showcased remarkable capabilities in understanding and generating language. However, their ability in comprehending ancient languages, particularly ancient Chinese, remains largely unexplored. To bridge this gap, we present ACLUE, an evaluation benchmark designed to assess the capability of language models in comprehending ancient Chinese. ACLUE consists of 15 tasks cover a range of skills, spanning phonetic, lexical, syntactic, semantic, inference and knowledge. Through the evaluation of eight state-of-the-art LLMs, we observed a noticeable disparity in their performance between modern Chinese and ancient Chinese. Among the assessed models, ChatGLM2 demonstrates the most remarkable performance, achieving an average score of 37.4%. We have made our code and data public available.

摘要
大型语言模型（LLM）在理解和生成语言方面表现出了很好的能力，但它们对古代中文理解的能力仍然很少被探索。为了填补这一空白，我们提出了ACLUE评价指标，用于评估语言模型对古代中文的理解能力。ACLUE包括15个任务，覆盖了各种技能，包括声学、词汇、语法、 semantics、推理和知识。通过评估8种当今最先进的LLM，我们发现了这些模型对现代中文和古代中文的表现存在显著差异。其中，ChatGLM2表现最出色，其平均分为37.4%。我们已经将代码和数据公开。

CarExpert: Leveraging Large Language Models for In-Car Conversational Question Answering

paper_url: http://arxiv.org/abs/2310.09536
repo_url: None
paper_authors: Md Rashad Al Hasan Rony, Christian Suess, Sinchana Ramakanth Bhat, Viju Sudhi, Julia Schneider, Maximilian Vogel, Roman Teucher, Ken E. Friedl, Soumya Sahoo
for: 本研究旨在提高大型自然语言模型（LLM）在域pecific问答中的表现，并解决现有LLM在域pecific问答中的限制。
methods: 本研究提出了一种名为CarExpert的在车辆中的问答系统，该系统利用LLM来控制输入、提供域pecific文档给抽取和生成答案组件，并控制输出以确保安全和域pecific答案。
results: 对比STATE-OF-THE-ART LLMs，CarExpert在生成自然、安全和车specific答案方面表现出色。

Abstract
Large language models (LLMs) have demonstrated remarkable performance by following natural language instructions without fine-tuning them on domain-specific tasks and data. However, leveraging LLMs for domain-specific question answering suffers from severe limitations. The generated answer tends to hallucinate due to the training data collection time (when using off-the-shelf), complex user utterance and wrong retrieval (in retrieval-augmented generation). Furthermore, due to the lack of awareness about the domain and expected output, such LLMs may generate unexpected and unsafe answers that are not tailored to the target domain. In this paper, we propose CarExpert, an in-car retrieval-augmented conversational question-answering system leveraging LLMs for different tasks. Specifically, CarExpert employs LLMs to control the input, provide domain-specific documents to the extractive and generative answering components, and controls the output to ensure safe and domain-specific answers. A comprehensive empirical evaluation exhibits that CarExpert outperforms state-of-the-art LLMs in generating natural, safe and car-specific answers.

摘要
大型自然语言模型（LLM）已经展现出了很好的性能，可以按照自然语言指令进行不需要精化的域pecific任务和数据上的执行。然而，使用LLM进行域pecific问答会遇到严重的限制。生成的答案往往会偏差，这是因为使用的数据采集时间（当用off-the-shelf）、复杂的用户语音和错误的检索（在检索增强生成中）。此外，由于缺乏域和期望输出的认知，这些LLM可能会生成不适用于目标域的不适用和不安全的答案。在这篇论文中，我们提出了CarExpert，一个基于LLM的在车辆内进行检索增强对话问答系统。具体来说，CarExpert使用LLM来控制输入、提供域pecific文档给抽取和生成答案组件，并控制输出，以确保安全和适用的答案。一项全面的实验证明了CarExpert在生成自然、安全和车辆特有的答案方面表现出优于状态之前的LLM。

Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model

paper_url: http://arxiv.org/abs/2310.09520
repo_url: https://github.com/haikangdeng/RAD
paper_authors: Haikang Deng, Colin Raffel
for: 这 paper 是为了提高语言模型生成文本的质量和特性。
methods: 这 paper 使用了 Reward-Augmented Decoding (RAD) 方法，这是一种基于小型单向奖励模型的文本生成方法，可以让语言模型生成文本有特定属性。RAD 使用奖励模型来评分生成过程中的每个步骤，并根据奖励值调整抽样概率，以增加高奖励的token。
results: 通过对生成非攻击性和情感控制文本进行实验，这 paper 表明 RAD 在不改变生成过程的情况下，可以达到最佳性和state-of-the-art方法的性能水平，并且可以在非常大的语言模型中实现最佳性，而且 computation overhead 很小。

Abstract
While large language models have proven effective in a huge range of downstream applications, they often generate text that is problematic or lacks a desired attribute. In this paper, we introduce Reward-Augmented Decoding (RAD), a text generation procedure that uses a small unidirectional reward model to encourage a language model to generate text that has certain properties. Specifically, RAD uses the reward model to score generations as they are produced and rescales sampling probabilities to favor high-reward tokens. By using a unidirectional reward model, RAD can cache activations from prior generation steps to decrease computational overhead. Through experiments on generating non-toxic and sentiment-controlled text, we demonstrate that RAD performs best among methods that change only the generation procedure and matches the performance of state-of-the-art methods that involve re-training the language model. We further validate that RAD is effective on very large language models while incurring a minimal computational overhead.

摘要
大型语言模型已经证明在广泛的下游应用中效果很好，但它们通常生成的文本具有问题或缺乏欲有的特性。在这篇论文中，我们介绍了奖励增强解oding（RAD），一种文本生成 процедуre，使用小型单向奖励模型来鼓励语言模型生成具有特定特性的文本。具体来说，RAD使用奖励模型来评分生成过程中的生成，并将抽样概率重新调整以偏好高奖励的字元。使用单向奖励模型，RAD可以储存上一步生成的活化，以减少计算成本。通过实验，我们证明了RAD在生成非攻击性和情感控制的文本方面表现最佳，并与现有的语言模型重新训练方法相当。此外，我们还验证了RAD在巨大语言模型上的效果，而且计算成本几乎没有增加。

Attentive Multi-Layer Perceptron for Non-autoregressive Generation

paper_url: http://arxiv.org/abs/2310.09512
repo_url: https://github.com/shark-nlp/attentivemlp
paper_authors: Shuyang Jiang, Jun Zhang, Jiangtao Feng, Lin Zheng, Lingpeng Kong
for: 这 paper 的目的是提出一种高效的非自然语言生成模型，以解决非自然语言生成模型的效率问题。
methods: 该 paper 使用了一种新的多层感知机制 variant，称为 Attentive Multi-Layer Perceptron~(AMLP)，来生成高效的语言模型。 AMLP 使用了适应的投影矩阵，来模型语言模型中的关系。
results: 该 paper 的实验结果表明，AMLP 与其他高效的非自然语言生成模型相比，在文本生成和机器翻译等任务上具有显著的优势。 AMLP 的自我和交叉注意能力也被分别测试，并与其他高效的模型相比，得到了比较良好的结果。 Additionally, the paper also shows that AMLP has a significant reduction in memory cost compared to vanilla non-autoregressive models for long sequences.

Abstract
Autoregressive~(AR) generation almost dominates sequence generation for its efficacy. Recently, non-autoregressive~(NAR) generation gains increasing popularity for its efficiency and growing efficacy. However, its efficiency is still bottlenecked by quadratic complexity in sequence lengths, which is prohibitive for scaling to long sequence generation and few works have been done to mitigate this problem. In this paper, we propose a novel MLP variant, \textbf{A}ttentive \textbf{M}ulti-\textbf{L}ayer \textbf{P}erceptron~(AMLP), to produce a generation model with linear time and space complexity. Different from classic MLP with static and learnable projection matrices, AMLP leverages adaptive projections computed from inputs in an attentive mode. The sample-aware adaptive projections enable communications among tokens in a sequence, and model the measurement between the query and key space. Furthermore, we marry AMLP with popular NAR models, deriving a highly efficient NAR-AMLP architecture with linear time and space complexity. Empirical results show that such marriage architecture surpasses competitive efficient NAR models, by a significant margin on text-to-speech synthesis and machine translation. We also test AMLP's self- and cross-attention ability separately with extensive ablation experiments, and find them comparable or even superior to the other efficient models. The efficiency analysis further shows that AMLP extremely reduces the memory cost against vanilla non-autoregressive models for long sequences.

摘要
自适应式生成（AR）技术在序列生成方面几乎占据了主导地位，而非自适应式生成（NAR）技术在最近几年得到了越来越多的关注，主要因为它的效率和生成能力在不断提高。然而，NAR技术的效率仍然受到序列长度的二次复杂性的限制，这使得扩展到长序列生成和少量工作已经做出了很多努力。在这篇论文中，我们提出了一种新的多层感知器（MLP）变体，称为注意力感知器（AMLP），以生成一个具有线性时间和空间复杂度的生成模型。与 класси型MLP的静态和学习投影矩阵不同，AMLP利用从输入中计算的适应投影。这些适应投影可以在序列中的各个元素之间进行通信，并且可以测量查询和关键空间之间的距离。此外，我们将AMLP与流行的NAR模型结合，得到了高效的NAR-AMLP架构，该架构具有线性时间和空间复杂度。实验结果表明，这种结合架构在文本到语音生成和机器翻译方面的性能明显高于竞争对手，并且我们还进行了详细的自注意力和交叉注意力能力测试，发现它们与其他高效的模型相当或者甚至更高。最后，我们还进行了内存成本分析，发现AMLP在长序列情况下可以极大地减少内存成本。

DepNeCTI: Dependency-based Nested Compound Type Identification for Sanskrit

paper_url: http://arxiv.org/abs/2310.09501
repo_url: https://github.com/yaswanth-iitkgp/depnecti
paper_authors: Jivnesh Sandhan, Yaswanth Narsupalli, Sreevatsa Muppirala, Sriram Krishnan, Pavankumar Satuluri, Amba Kulkarni, Pawan Goyal
for: 本研究旨在提出一个新的任务：嵌入式多 компонент合成类型标识（NeCTI），以便理解多 component 合成中的隐藏结构和 semantics。
methods: 本研究使用了2个新的标注数据集，并对这些数据集进行了基线测试。然后，提出了一种新的框架名为 DepNeCTI，该框架基于依赖关系来实现嵌入式多 component 合成类型标识。
results: 对于 NeCTI 任务， DepNeCTI 框架在 Labeled Span Score (LSS) 方面的平均绝对改进率为 13.1 个 F1 分，并在推理效率方面实现了5倍的提高。此外，研究还发现了上下文对 NeCTI 任务的有利作用。

Abstract
Multi-component compounding is a prevalent phenomenon in Sanskrit, and understanding the implicit structure of a compound's components is crucial for deciphering its meaning. Earlier approaches in Sanskrit have focused on binary compounds and neglected the multi-component compound setting. This work introduces the novel task of nested compound type identification (NeCTI), which aims to identify nested spans of a multi-component compound and decode the implicit semantic relations between them. To the best of our knowledge, this is the first attempt in the field of lexical semantics to propose this task. We present 2 newly annotated datasets including an out-of-domain dataset for this task. We also benchmark these datasets by exploring the efficacy of the standard problem formulations such as nested named entity recognition, constituency parsing and seq2seq, etc. We present a novel framework named DepNeCTI: Dependency-based Nested Compound Type Identifier that surpasses the performance of the best baseline with an average absolute improvement of 13.1 points F1-score in terms of Labeled Span Score (LSS) and a 5-fold enhancement in inference efficiency. In line with the previous findings in the binary Sanskrit compound identification task, context provides benefits for the NeCTI task. The codebase and datasets are publicly available at: https://github.com/yaswanth-iitkgp/DepNeCTI

摘要
多Component合成是吠拜话中的普遍现象，理解合成元素的隐式结构是解译其意义的关键。 Earlier approaches in Sanskrit have focused on binary compounds and neglected the multi-component compound setting. This work introduces the novel task of nested compound type identification (NeCTI), which aims to identify nested spans of a multi-component compound and decode the implicit semantic relations between them. To the best of our knowledge, this is the first attempt in the field of lexical semantics to propose this task. We present 2 newly annotated datasets including an out-of-domain dataset for this task. We also benchmark these datasets by exploring the efficacy of the standard problem formulations such as nested named entity recognition, constituency parsing and seq2seq, etc. We present a novel framework named DepNeCTI: Dependency-based Nested Compound Type Identifier that surpasses the performance of the best baseline with an average absolute improvement of 13.1 points F1-score in terms of Labeled Span Score (LSS) and a 5-fold enhancement in inference efficiency. In line with the previous findings in the binary Sanskrit compound identification task, context provides benefits for the NeCTI task. The codebase and datasets are publicly available at: https://github.com/yaswanth-iitkgp/DepNeCTI.Note: Please note that the translation is in Simplified Chinese, and the grammar and sentence structure may be different from Traditional Chinese.

Computational analyses of linguistic features with schizophrenic and autistic traits along with formal thought disorders

paper_url: http://arxiv.org/abs/2310.09494
repo_url: None
paper_authors: Takeshi Saga, Hiroki Tanaka, Satoshi Nakamura
for: 这个研究是 investigate 哲思疾病 (FTD) 的表现在语言中，FTD 是 autism spectrum disorder (ASD) 和 schizophrenia 等疾病的一种表现。
methods: 这个研究使用了一个日本语音报告数据集，通过人员募集服务收集了有关 ASD 和 SPD 的分数标签。研究使用了社会响应度量表第二版 (SRS2) 和偏妄人格量表 (SPQ)，包括 SPQ 中的异常语言指标来评估语言特征。
results: 研究发现，异常语言指标与总 SPQ 和 SRS 分数显然相关，但两者自身不相关。异常语言指标 longer speech about negative memory 引起了更多 FTD 症状。减少研究表明，功能词和抽象特征对异常语言指标产生重要影响，而内容词只对 SRS 预测有效。这种结果表明 SPD 和 ASD 之间存在差异。数据和程序使用在这里：https://sites.google.com/view/sagatake/resource.

Abstract
[See full abstract in the pdf] Formal Thought Disorder (FTD), which is a group of symptoms in cognition that affects language and thought, can be observed through language. FTD is seen across such developmental or psychiatric disorders as Autism Spectrum Disorder (ASD) or Schizophrenia, and its related Schizotypal Personality Disorder (SPD). This paper collected a Japanese audio-report dataset with score labels related to ASD and SPD through a crowd-sourcing service from the general population. We measured language characteristics with the 2nd edition of the Social Responsiveness Scale (SRS2) and the Schizotypal Personality Questionnaire (SPQ), including an odd speech subscale from SPQ to quantify the FTD symptoms. We investigated the following four research questions through machine-learning-based score predictions: (RQ1) How are schizotypal and autistic measures correlated? (RQ2) What is the most suitable task to elicit FTD symptoms? (RQ3) Does the length of speech affect the elicitation of FTD symptoms? (RQ4) Which features are critical for capturing FTD symptoms? We confirmed that an FTD-related subscale, odd speech, was significantly correlated with both the total SPQ and SRS scores, although they themselves were not correlated significantly. Our regression analysis indicated that longer speech about a negative memory elicited more FTD symptoms. The ablation study confirmed the importance of function words and both the abstract and temporal features for FTD-related odd speech estimation. In contrast, content words were effective only in the SRS predictions, and content words were effective only in the SPQ predictions, a result that implies the differences between SPD-like and ASD-like symptoms. Data and programs used in this paper can be found here: https://sites.google.com/view/sagatake/resource.

摘要
[参考报告PDF]：正式思维障碍（FTD）是一组语言和思维方面的症状，可以通过语言来观察。FTD在发展或心理疾病中出现，如自闭症спектrum病（ASD）和 шизотипи性人格障碍（SPD）。这篇论文通过在日本的一个音频报告数据集上使用招募服务，收集了与ASD和SPD相关的语音报告数据。我们使用第2版社会响应度量表（SRS2）和 шизотипи性人格问卷（SPQ）进行语言特征量化，包括SPQ中的奇特语言子层，以评估FTD症状。我们提出了以下四个研究问题，通过机器学习基于分数预测来解答：（RQ1）ASD和SPD的度量相关吗？（RQ2）哪种任务可以最好Trigger FTD症状？（RQ3）长度的语言影响FTD症状的发现吗？（RQ4）FTD症状捕捉关键的特征是什么？我们发现，FTD相关的语言子层与总SPQ和SRS分数显著相关，但它们自身没有显著相关。我们的回归分析表明， longer speech about a negative memory elicited more FTD symptoms。减少学习中的函数词和抽象特征以及时间特征对FTD相关odd speech估算具有重要作用。相反，内容词只在SRS预测中有效，而内容词只在SPQ预测中有效，这种结果表明了ASD和SPD之间的差异。数据和程序可以在以下链接中找到：https://sites.google.com/view/sagatake/resource。

2023-10-14

cs.LG

cs.LG - 2023-10-14

Towards Semi-Structured Automatic ICD Coding via Tree-based Contrastive Learning

paper_url: http://arxiv.org/abs/2310.09672
repo_url: None
paper_authors: Chang Lu, Chandan K. Reddy, Ping Wang, Yue Ning
for: 提高现有ICD编码模型的性能，解决由医疗专业人员的写作习惯和病人的多种病理特征引起的医学记录的变化和有限的数据难题。
methods: 对医学记录进行自动分割，引入对比预训练策略，使用软多标签相似度度量基于树编辑距离进行预训练。还设计了遮盖部分训练策略，使ICD编码模型能够更好地定位相关的ICD代码段落。
results: 通过对存在限制数据的ICD编码模型进行预训练，提高了其性能。同时，通过遮盖部分训练策略，ICD编码模型能够更好地定位相关的ICD代码段落。

Abstract
Automatic coding of International Classification of Diseases (ICD) is a multi-label text categorization task that involves extracting disease or procedure codes from clinical notes. Despite the application of state-of-the-art natural language processing (NLP) techniques, there are still challenges including limited availability of data due to privacy constraints and the high variability of clinical notes caused by different writing habits of medical professionals and various pathological features of patients. In this work, we investigate the semi-structured nature of clinical notes and propose an automatic algorithm to segment them into sections. To address the variability issues in existing ICD coding models with limited data, we introduce a contrastive pre-training approach on sections using a soft multi-label similarity metric based on tree edit distance. Additionally, we design a masked section training strategy to enable ICD coding models to locate sections related to ICD codes. Extensive experimental results demonstrate that our proposed training strategies effectively enhance the performance of existing ICD coding methods.

摘要
自动编码国际疾病分类（ICD）是一个多标签文本分类任务，它涉及提取医疗记录中的疾病或手术代码。尽管应用了当前最先进的自然语言处理（NLP）技术，仍然存在一些挑战，包括数据的有限可用性由隐私限制和医疗记录的高度变化，即医生们的不同写作风格和患者的多种疾病特征。在这项工作中，我们调查了医疗记录的半结构化特性，并提议一种自动分段算法。为了解决现有ICD编码模型受限于数据的变化问题，我们引入了一种对section进行预训练的对比预训练方法，并设计了一种遮盖section训练策略，以便ICD编码模型能够定位相关的section。我们的实验结果表明，我们的提议的训练策略有效地提高了现有ICD编码方法的表现。

A Blockchain-empowered Multi-Aggregator Federated Learning Architecture in Edge Computing with Deep Reinforcement Learning Optimization

paper_url: http://arxiv.org/abs/2310.09665
repo_url: None
paper_authors: Xiao Li, Weili Wu
for: 本研究旨在提出一种基于区块链技术的多收集器Edge Federated Learning架构(BMA-FL),以提高 Edge Federated Learning 的安全性和效率。
methods: 本研究提出了一种新的轻量级的拜占庭同质性验证机制(PBCM)，用于在BMA-FL中实现安全和快速的模型归并和同步。此外，我们还提出了一种多智能深度强化学习算法，用于帮助收集器决定最佳训练策略。
results: 我们在实际数据集上进行了实验，并证明了BMA-FL可以更快地获得更好的模型，比基eline更高效。这表明了PBCM和我们提出的深度强化学习算法的有效性。

Abstract
Federated learning (FL) is emerging as a sought-after distributed machine learning architecture, offering the advantage of model training without direct exposure of raw data. With advancements in network infrastructure, FL has been seamlessly integrated into edge computing. However, the limited resources on edge devices introduce security vulnerabilities to FL in the context. While blockchain technology promises to bolster security, practical deployment on resource-constrained edge devices remains a challenge. Moreover, the exploration of FL with multiple aggregators in edge computing is still new in the literature. Addressing these gaps, we introduce the Blockchain-empowered Heterogeneous Multi-Aggregator Federated Learning Architecture (BMA-FL). We design a novel light-weight Byzantine consensus mechanism, namely PBCM, to enable secure and fast model aggregation and synchronization in BMA-FL. We also dive into the heterogeneity problem in BMA-FL that the aggregators are associated with varied number of connected trainers with Non-IID data distributions and diverse training speed. We proposed a multi-agent deep reinforcement learning algorithm to help aggregators decide the best training strategies. The experiments on real-word datasets demonstrate the efficiency of BMA-FL to achieve better models faster than baselines, showing the efficacy of PBCM and proposed deep reinforcement learning algorithm.

摘要
《采用分布式机器学习架构的联邦学习（Federated Learning，FL）正在吸引越来越多的关注，因为它可以在不直接暴露原始数据的情况下进行模型训练。随着网络基础设施的提高，FL已经顺利地集成到边缘计算中。然而，边缘设备的限制性资源引入了FL在边缘计算环境中的安全漏洞。而区块链技术则承诺可以加强安全性，但是在资源有限的边缘设备上实际部署仍然是一大挑战。此外，现有的文献中对多个聚合器在边缘计算中的FLexploration还是新的。为了解决这些问题，我们介绍了区块链 empowered 多聚合器联邦学习架构（BMA-FL）。我们设计了一种轻量级的Byzantine共识机制，称为PBCM，以便在BMA-FL中安全快速地进行模型聚合和同步。我们还考虑了BMA-FL中聚合器与不同数据分布和多个连接的训练速度的多样性问题，并提出了一种基于多智能深度优化学习算法的解决方案。实验结果表明，BMA-FL可以更快地获得更好的模型，比基eline更高效。这显示了PBCM和我们提议的深度优化学习算法的有效性。》Note: The translation is done using Google Translate and may not be perfect. Please note that the translation is in Simplified Chinese, not Traditional Chinese.

Topology-guided Hypergraph Transformer Network: Unveiling Structural Insights for Improved Representation

paper_url: http://arxiv.org/abs/2310.09657
repo_url: None
paper_authors: Khaled Mohammed Saifuddin, Mehmet Emin Aktas, Esra Akbas
for: 本文旨在扩展传统图的概念，使用嵌入式的图神经网络（GNN）进行图表示学习，并且可以在高阶关系的图上进行表示学习。
methods: 本文提出了一种基于图的Topology-guided Hypergraph Transformer Network（THTN）模型，该模型首先将图转换成高阶图，然后使用简单 yet effective的结构和空间编码模块，将节点的表示增强，同时捕捉节点的本地和全局 topological 表达。
results: 对于节点分类任务，提出的模型表现比既有方法更好，可以更好地捕捉节点的本地和全局 topological 表达。

Abstract
Hypergraphs, with their capacity to depict high-order relationships, have emerged as a significant extension of traditional graphs. Although Graph Neural Networks (GNNs) have remarkable performance in graph representation learning, their extension to hypergraphs encounters challenges due to their intricate structures. Furthermore, current hypergraph transformers, a special variant of GNN, utilize semantic feature-based self-attention, ignoring topological attributes of nodes and hyperedges. To address these challenges, we propose a Topology-guided Hypergraph Transformer Network (THTN). In this model, we first formulate a hypergraph from a graph while retaining its structural essence to learn higher-order relations within the graph. Then, we design a simple yet effective structural and spatial encoding module to incorporate the topological and spatial information of the nodes into their representation. Further, we present a structure-aware self-attention mechanism that discovers the important nodes and hyperedges from both semantic and structural viewpoints. By leveraging these two modules, THTN crafts an improved node representation, capturing both local and global topological expressions. Extensive experiments conducted on node classification tasks demonstrate that the performance of the proposed model consistently exceeds that of the existing approaches.

摘要
�� Hypergraphs, with their ability to depict high-order relationships, have emerged as a significant extension of traditional graphs. Although Graph Neural Networks (GNNs) have shown remarkable performance in graph representation learning, their extension to hypergraphs encounters challenges due to their complex structures. Furthermore, current hypergraph transformers, a special variant of GNN, use semantic feature-based self-attention, ignoring the topological attributes of nodes and hyperedges. To address these challenges, we propose a Topology-guided Hypergraph Transformer Network (THTN).In this model, we first formulate a hypergraph from a graph while retaining its structural essence to learn higher-order relations within the graph. Then, we design a simple yet effective structural and spatial encoding module to incorporate the topological and spatial information of the nodes into their representation. Further, we present a structure-aware self-attention mechanism that discovers the important nodes and hyperedges from both semantic and structural viewpoints. By leveraging these two modules, THTN crafts an improved node representation, capturing both local and global topological expressions.Extensive experiments conducted on node classification tasks demonstrate that the performance of the proposed model consistently exceeds that of the existing approaches.

Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space

paper_url: http://arxiv.org/abs/2310.09656
repo_url: https://github.com/amazon-science/tabsyn
paper_authors: Hengrui Zhang, Jiani Zhang, Balasubramaniam Srinivasan, Zhengyuan Shen, Xiao Qin, Christos Faloutsos, Huzefa Rangwala, George Karypis
for: 本文旨在提出一种能够生成高质量的 tabular 数据的方法，以满足不同数据类型的混合和复杂的分布特性。
methods: 本文提出了一种基于 variational autoencoder (VAE) 的 diffusion 模型，可以将不同数据类型转化为单一的空间，并显式地捕捉列之间的关系。
results: 对 six 个数据集进行了广泛的实验，并证明了 TABSYN 可以在 column-wise 分布和列对卷积关系的估计中具有更高的准确率，比如 existing 方法的 86% 和 67%。

Abstract
Recent advances in tabular data generation have greatly enhanced synthetic data quality. However, extending diffusion models to tabular data is challenging due to the intricately varied distributions and a blend of data types of tabular data. This paper introduces TABSYN, a methodology that synthesizes tabular data by leveraging a diffusion model within a variational autoencoder (VAE) crafted latent space. The key advantages of the proposed TABSYN include (1) Generality: the ability to handle a broad spectrum of data types by converting them into a single unified space and explicitly capture inter-column relations; (2) Quality: optimizing the distribution of latent embeddings to enhance the subsequent training of diffusion models, which helps generate high-quality synthetic data, (3) Speed: much fewer number of reverse steps and faster synthesis speed than existing diffusion-based methods. Extensive experiments on six datasets with five metrics demonstrate that TABSYN outperforms existing methods. Specifically, it reduces the error rates by 86% and 67% for column-wise distribution and pair-wise column correlation estimations compared with the most competitive baselines.

摘要

通用性: 能够处理广泛的数据类型，将它们转换为单一的空间，并且显著地捕捉列之间的关系。2. 质量: 优化纹理中的 latent embedding 分布，以提高 diffusion 模型的训练，从而生成高质量的 sintetic data。3. 速度: 比现有的 diffusion-based 方法快得多，减少了反向步骤的数量，提高了生成速度。广泛的实验表明，TABSYN 在六个 dataset 上的五个指标上都超过了现有的方法。具体来说，它相比最竞争的基准值，降低了列级分布和对列相关性的估计错误率 by 86% 和 67%。

DPZero: Dimension-Independent and Differentially Private Zeroth-Order Optimization

paper_url: http://arxiv.org/abs/2310.09639
repo_url: None
paper_authors: Liang Zhang, Kiran Koshy Thekumparampil, Sewoong Oh, Niao He
for: 这篇论文目的是解决对于执行大型自然语言模型（LLM）的微调问题，具体是处理内存和隐私问题。
methods: 这篇论文使用了零次方法来进行斜方向优化，这些方法仅从前进推导，因此可以大大减少训练时间的内存负载。然而，直接结合标准的隐私机制和零次方法会增加维度相依的复杂性。
results: 这篇论文提出了一个名为DPZero的新的隐私条件下的零次方法，它的复杂度仅对问题的内部维度有很强的依赖性，实际上可以实现高效的应用。

Abstract
The widespread practice of fine-tuning pretrained large language models (LLMs) on domain-specific data faces two major challenges in memory and privacy. First, as the size of LLMs continue to grow, encompassing billions of parameters, the memory demands of gradient-based training methods via backpropagation become prohibitively high. Second, given the tendency of LLMs to memorize and disclose sensitive training data, the privacy of fine-tuning data must be respected. To this end, we explore the potential of zeroth-order methods in differentially private optimization for fine-tuning LLMs. Zeroth-order methods, which rely solely on forward passes, substantially reduce memory consumption during training. However, directly combining them with standard differential privacy mechanism poses dimension-dependent complexity. To bridge the gap, we introduce DPZero, a novel differentially private zeroth-order algorithm with nearly dimension-independent rates. Our theoretical analysis reveals that its complexity hinges primarily on the problem's intrinsic dimension and exhibits only a logarithmic dependence on the ambient dimension. This renders DPZero a highly practical option for real-world LLMs deployments.

摘要
广泛的专业化大型语言模型（LLM）微调问题面临两大挑战：首先，随着 LLM 的大小不断增长，使用反对推导法（backpropagation）进行梯度下降课题的记忆需求变得禁制高昂。其次，由于 LLM 倾向于记忆和泄露敏感训练数据，因此训练数据的隐私必须受到尊重。为此，我们探索了零顺序方法在不同数据隐私优化中的潜力。零顺序方法，它仅靠前进通过，可以实现严重降低训练中的记忆消耗。但是，直接结合它们与标准的隐私机制会带来维度相依的复杂性。为了填补这个差距，我们提出了 DPZero，一种基于零顺序方法的不同数据隐私优化算法。我们的理论分析显示，DPZero 的复杂度仅受到问题的内在维度的影响，并且只具有对数对应的依赖性。这使得 DPZero 在实际应用中成为了非常实用的选择。

Generative Adversarial Training for Text-to-Speech Synthesis Based on Raw Phonetic Input and Explicit Prosody Modelling

paper_url: http://arxiv.org/abs/2310.09636
repo_url: https://github.com/tiberiu44/TTS-Cube
paper_authors: Tiberiu Boros, Stefan Daniel Dumitrescu, Ionut Mironica, Radu Chivereanu
for: 这个论文描述了一个端到端的语音合成系统，使用生成对抗训练。
methods: 这个系统使用生成对抗训练来训练其 vocoder，用于 raw phoneme-to-audio 转换，并使用显式的 fonetic、抑制和持续时间模型。
results: 研究人员通过对多个预训练模型进行实验，并引入了一种新的高度表达式的字符声音匹配方法，基于独特的风格标识符。

Abstract
We describe an end-to-end speech synthesis system that uses generative adversarial training. We train our Vocoder for raw phoneme-to-audio conversion, using explicit phonetic, pitch and duration modeling. We experiment with several pre-trained models for contextualized and decontextualized word embeddings and we introduce a new method for highly expressive character voice matching, based on discreet style tokens.

摘要
我们描述了一个端到端的语音合成系统，使用生成对抗训练。我们用raw phoneme-to-audio转换器进行辅助训练，使用显式的音位、抑制和持续时间模型。我们对几种预训练的词袋模型进行实验，并 introduce一种新的高度表达式的字符声音匹配方法，基于精细的风格то克。

Landslide Topology Uncovers Failure Movements

paper_url: http://arxiv.org/abs/2310.09631
repo_url: None
paper_authors: Kamal Rana, Kushanav Bhuyan, Joaquin Vicente Ferrer, Fabrice Cotton, Ugur Ozturk, Filippo Catani, Nishant Malik
for:* The paper aims to improve the accuracy of landslide predictive models and impact assessments by identifying failure types based on their movements.methods:* The approach uses 3D landslide topology to identify failure types, such as slides and flows, by analyzing topological proxies that reveal the mechanics of mass movements.results:* The approach achieves 80 to 94% accuracy in identifying failure types in historic and event-specific landslide databases from various geomorphological and climatic contexts.

Abstract
The death toll and monetary damages from landslides continue to rise despite advancements in predictive modeling. The predictive capability of these models is limited as landslide databases used in training and assessing the models often have crucial information missing, such as underlying failure types. Here, we present an approach for identifying failure types based on their movements, e.g., slides and flows by leveraging 3D landslide topology. We observe topological proxies reveal prevalent signatures of mass movement mechanics embedded in the landslide's morphology or shape, such as detecting coupled movement styles within complex landslides. We find identical failure types exhibit similar topological properties, and by using them as predictors, we can identify failure types in historic and event-specific landslide databases (including multi-temporal) from various geomorphological and climatic contexts such as Italy, the US Pacific Northwest region, Denmark, Turkey, and China with 80 to 94 % accuracy. To demonstrate the real-world application of the method, we implement it in two undocumented datasets from China and publicly release the datasets. These new insights can considerably improve the performance of landslide predictive models and impact assessments. Moreover, our work introduces a new paradigm for studying landslide shapes to understand underlying processes through the lens of landslide topology.

摘要
“死亡人数和经济损害由滑坡继续增加，尽管预测模型的技术已经得到进步。预测模型的能力受到数据库中的信息损失的限制，这些数据库通常缺乏关键信息，如滑坡类型。我们提出了一种方法，利用滑坡三维 topology 来识别滑坡类型，如滑坡和流体。我们发现了类似的滑坡类型具有相似的 topological 特征，并使用这些特征作为预测器，可以准确地识别 historic 和事件特定的滑坡数据库（包括多时间），从不同的地质和气候背景中获得 80-94% 的准确率。为证明这种方法的实际应用，我们在中国两个未文件的数据集中实现了它，并公开发布了这些数据集。这些新的发现可以明显提高滑坡预测模型的性能和影响评估。此外，我们的工作还 introduce 了一种新的理解滑坡形态的方法，通过滑坡 topology 来理解滑坡的下面过程。”Note: Please note that the translation is in Simplified Chinese, which is used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Federated Battery Diagnosis and Prognosis

paper_url: http://arxiv.org/abs/2310.09628
repo_url: None
paper_authors: Nur Banu Altinpulluk, Deniz Altinpulluk, Paritosh Ramanan, Noah Paulson, Feng Qiu, Susan Babinec, Murat Yildirim
for: 这个研究旨在提出一个分布式电池诊断和预测模型，以解决现有的数据拥有权、隐私、通信和处理等问题。
methods: 我们提出了一个分布式电池诊断和预测模型，将标准电流电压时间数据分布式处理，仅将模型参数通信，减少通信负载并保持数据隐私。
results: 我们的模型可以实现隐私保护的分布式电池诊断和预测，并且可以预测电池剩余寿命。这个模型将为电池健康管理带来一个新的思维方向。

Abstract
Battery diagnosis, prognosis and health management models play a critical role in the integration of battery systems in energy and mobility fields. However, large-scale deployment of these models is hindered by a myriad of challenges centered around data ownership, privacy, communication, and processing. State-of-the-art battery diagnosis and prognosis methods require centralized collection of data, which further aggravates these challenges. Here we propose a federated battery prognosis model, which distributes the processing of battery standard current-voltage-time-usage data in a privacy-preserving manner. Instead of exchanging raw standard current-voltage-time-usage data, our model communicates only the model parameters, thus reducing communication load and preserving data confidentiality. The proposed model offers a paradigm shift in battery health management through privacy-preserving distributed methods for battery data processing and remaining lifetime prediction.

摘要
锂电池诊断、预测和健康管理模型在能源和交通领域的集成中扮演了关键角色。然而，大规模部署这些模型受到数据所有权、隐私、通信和处理等多种挑战。现状的锂电池诊断和预测方法均需要集中收集数据，这进一步增加了这些挑战。我们提议一种联邦锂电池预测模型，该模型在隐私保护的方式下分布式处理标准电压电流时间使用数据。而不是交换Raw标准电压电流时间数据，我们的模型仅交换模型参数，从而降低了通信负担和保持了数据Confidentiality。我们提出的模型将锂电池健康管理领域带来隐私保护分布式处理锂电池数据的新模式，并提高锂电池剩余寿命预测的精度。

Machine Learning for Urban Air Quality Analytics: A Survey

paper_url: http://arxiv.org/abs/2310.09620
repo_url: None
paper_authors: Jindong Han, Weijia Zhang, Hao Liu, Hui Xiong
for: 这篇论文主要目标是探讨机器学习（ML）技术在空气质量分析领域的应用，以提供一份全面的报告，帮助专业人士寻找适合自己的问题和进行前沿研究。
methods: 本论文使用的方法包括数据收集、数据预处理、机器学习模型的应用等，涵盖了多种空气质量分析任务，如污染 patrern mining、空气质量推断和预测等。
results: 本论文提供了一系列已有的空气质量分析任务的综述和分类，同时还提供了一些已有的公共空气质量数据集，以便进一步研究。此外，文章还预测了未来研究的一些可能的方向。

Abstract
The increasing air pollution poses an urgent global concern with far-reaching consequences, such as premature mortality and reduced crop yield, which significantly impact various aspects of our daily lives. Accurate and timely analysis of air pollution is crucial for understanding its underlying mechanisms and implementing necessary precautions to mitigate potential socio-economic losses. Traditional analytical methodologies, such as atmospheric modeling, heavily rely on domain expertise and often make simplified assumptions that may not be applicable to complex air pollution problems. In contrast, Machine Learning (ML) models are able to capture the intrinsic physical and chemical rules by automatically learning from a large amount of historical observational data, showing great promise in various air quality analytical tasks. In this article, we present a comprehensive survey of ML-based air quality analytics, following a roadmap spanning from data acquisition to pre-processing, and encompassing various analytical tasks such as pollution pattern mining, air quality inference, and forecasting. Moreover, we offer a systematic categorization and summary of existing methodologies and applications, while also providing a list of publicly available air quality datasets to ease the research in this direction. Finally, we identify several promising future research directions. This survey can serve as a valuable resource for professionals seeking suitable solutions for their specific challenges and advancing their research at the cutting edge.

摘要
这个增长的空气污染问题具有急迫的全球性，带来许多深远的后果，如提早死亡和减少的农作物生产，这些影响了我们日常生活的多方面。精确和时间的空气污染分析是理解其下面的机制和适当的预防措施，以减少可能的社会经济损失。传统的分析方法，如大气模型，严重依赖专家知识和假设，可能无法应对复杂的空气污染问题。相比之下，机器学习（ML）模型能够自动从历史观测数据中学习出空气污染的内在物理和化学规律，显示了它们在不同的空气质量分析任务中的杰出应用潜力。在这篇文章中，我们提供了一个概要的机器学习基于空气质量分析的调查，包括数据收集到预处理的步骤，以及各种分析任务，如污染图像探索、空气质量推断和预测。此外，我们提供了现有的方法和应用的系统性概括和摘要，同时提供了访问公共空气质量数据的列表，以便进一步研究这个方向。最后，我们点出了未来研究的一些有前途的方向。这篇调查可以作为专业人员寻找适合他们特定挑战的适当解决方案，并为他们的研究进一步发展到边缘领域。

STORM: Efficient Stochastic Transformer based World Models for Reinforcement Learning

paper_url: http://arxiv.org/abs/2310.09615
repo_url: https://github.com/weipu-zhang/storm
paper_authors: Weipu Zhang, Gang Wang, Jian Sun, Yetian Yuan, Gao Huang
for: 本研究旨在提高模型基 Reinforcement learning 算法在视觉输入环境中的表现。
methods: 该方法首先通过自我超vised learning 构建一个参数化的 simulate 世界模型，然后利用这个模型来提高 Agent 的策略。
results: 该方法可以达到人类水平的表现($126.7%$ 的 Atari $100$k 评估标准)，并且比之前的方法更加高效。

Abstract
Recently, model-based reinforcement learning algorithms have demonstrated remarkable efficacy in visual input environments. These approaches begin by constructing a parameterized simulation world model of the real environment through self-supervised learning. By leveraging the imagination of the world model, the agent's policy is enhanced without the constraints of sampling from the real environment. The performance of these algorithms heavily relies on the sequence modeling and generation capabilities of the world model. However, constructing a perfectly accurate model of a complex unknown environment is nearly impossible. Discrepancies between the model and reality may cause the agent to pursue virtual goals, resulting in subpar performance in the real environment. Introducing random noise into model-based reinforcement learning has been proven beneficial. In this work, we introduce Stochastic Transformer-based wORld Model (STORM), an efficient world model architecture that combines the strong sequence modeling and generation capabilities of Transformers with the stochastic nature of variational autoencoders. STORM achieves a mean human performance of $126.7\%$ on the Atari $100$k benchmark, setting a new record among state-of-the-art methods that do not employ lookahead search techniques. Moreover, training an agent with $1.85$ hours of real-time interaction experience on a single NVIDIA GeForce RTX 3090 graphics card requires only $4.3$ hours, showcasing improved efficiency compared to previous methodologies.

摘要
近些时候，基于模型的强化学习算法在视觉输入环境中表现出色。这些方法首先通过自动学习建立一个参数化的模拟世界模型，然后利用模型的想像力提高代理人的策略，不受实际环境的采样限制。然而，建立一个完全准确的模型是几乎不可能的。模型和现实之间的差异可能导致代理人追求虚拟目标，从而导致实际环境的性能下降。在这种情况下，引入随机噪声到模型基于强化学习已经证明有利。在这项工作中，我们提出了 Stochastic Transformer-based wORld Model（STORM），这是一种高效的世界模型架构，它将Transformers强大的序列模型和生成能力与变量自动编码器的随机性结合起来。STORM在Atari 100k 测试benchmark上达到了人类性能的mean值 $126.7\%$，创下了没有使用lookahead搜索技术的新纪录。此外，通过实际时间互动体验的1.85小时，我们只需要4.3小时的训练时间，这显示了与前一代方法相比的更高效性。

Towards Intelligent Network Management: Leveraging AI for Network Service Detection

paper_url: http://arxiv.org/abs/2310.09609
repo_url: None
paper_authors: Khuong N. Nguyen, Abhishek Sehgal, Yuming Zhu, Junsu Choi, Guanbo Chen, Hao Chen, Boon Loong Ng, Charlie Zhang
for: 这个研究旨在开发一个高级的网络流量分类系统，以便在现代无线通讯网络中进行精确的流量分析。
methods: 本研究使用机器学习方法ologies来分析网络流量，并将其分为不同的网络服务类型。我们的方法包括识别网络流量中的对应类型，并将其分为多个小型流量流。我们使用机器学习模型来训练这些服务类型。
results: 我们的研究结果显示，我们的方法可以实现高度的精确性，并且可以在不同的无线网络条件下进行类型化。这些结果显示了机器学习在无线技术中的应用潜力。

Abstract
As the complexity and scale of modern computer networks continue to increase, there has emerged an urgent need for precise traffic analysis, which plays a pivotal role in cutting-edge wireless connectivity technologies. This study focuses on leveraging Machine Learning methodologies to create an advanced network traffic classification system. We introduce a novel data-driven approach that excels in identifying various network service types in real-time, by analyzing patterns within the network traffic. Our method organizes similar kinds of network traffic into distinct categories, referred to as network services, based on latency requirement. Furthermore, it decomposes the network traffic stream into multiple, smaller traffic flows, with each flow uniquely carrying a specific service. Our ML models are trained on a dataset comprised of labeled examples representing different network service types collected on various Wi-Fi network conditions. Upon evaluation, our system demonstrates a remarkable accuracy in distinguishing the network services. These results emphasize the substantial promise of integrating Artificial Intelligence in wireless technologies. Such an approach encourages more efficient energy consumption, enhances Quality of Service assurance, and optimizes the allocation of network resources, thus laying a solid groundwork for the development of advanced intelligent networks.

摘要
(Simplified Chinese translation)随着现代计算机网络的复杂性和规模不断增加，需要精准的网络流量分析已成为现代无线连接技术的急需。本研究利用机器学习方法创建高级网络流量分类系统。我们提出了一种新的数据驱动方法，可以在实时中识别不同的网络服务类型，通过分析网络流量中的模式。我们的方法将类似的网络流量分成不同类别，称为网络服务，基于延迟需求。此外，它还将网络流量流分为多个更小的流量流，每个流唯一携带特定的服务。我们的ML模型在各种Wi-Fi网络条件下收集的标注示例集上进行训练。评估结果显示，我们的系统在分类网络服务时表现出了很高的准确率。这些结果强调了将人工智能 integrating into wireless technologies 的极大承诺，这种方法可以提高能效的能源消耗，提高服务质量保证，并优化网络资源的分配，从而为高级智能网络的发展提供坚实的基础。

paper_url: http://arxiv.org/abs/2310.09597
repo_url: None
paper_authors: Nicolo Cesa-Bianchi, Roberto Colomboni, Maximilian Kasy
for: 本文研究了 repeatedly choosing policies 以 maximize social welfare.
methods: 文章使用了 experimentation 学习 response functions, 并 derive 了 regret bound.
results: 文章证明了 regret Rate of $T^{2/3}$, 这意味着 (i) welfare maximization 比 multi-armed bandit problem 更加困难, 并且 (ii) 算法实现了最佳率。如果社会利益是凹形的, 可以使用 dyadic search algorithm 实现 $T^{1/2}$ 率。

Abstract
We consider the problem of repeatedly choosing policies to maximize social welfare. Welfare is a weighted sum of private utility and public revenue. Earlier outcomes inform later policies. Utility is not observed, but indirectly inferred. Response functions are learned through experimentation. We derive a lower bound on regret, and a matching adversarial upper bound for a variant of the Exp3 algorithm. Cumulative regret grows at a rate of $T^{2/3}$. This implies that (i) welfare maximization is harder than the multi-armed bandit problem (with a rate of $T^{1/2}$ for finite policy sets), and (ii) our algorithm achieves the optimal rate. For the stochastic setting, if social welfare is concave, we can achieve a rate of $T^{1/2}$ (for continuous policy sets), using a dyadic search algorithm. We analyze an extension to nonlinear income taxation, and sketch an extension to commodity taxation. We compare our setting to monopoly pricing (which is easier), and price setting for bilateral trade (which is harder).

摘要
我们考虑了 repeatedly choosing policies 以 maximize social welfare。 social welfare 是一个 weighted sum of private utility 和 public revenue。earlier outcomes inform later policies。 utility 不能 directly observed，而是 indirectly inferred。response functions 是通过 experimentation 学习的。我们 derivated a lower bound on regret, and a matching adversarial upper bound for a variant of the Exp3 algorithm。cumulative regret grows at a rate of $T^{2/3}$.这意味着 (i) welfare maximization 比 multi-armed bandit problem 更难（with a rate of $T^{1/2}$ for finite policy sets），和 (ii) our algorithm achieves the optimal rate。对于随机设置，如果 social welfare 是凹形函数，我们可以使用 dyadic search algorithm achiev a rate of $T^{1/2}$ (for continuous policy sets)。我们还分析了 nonlinear income taxation 的扩展，并略 outline commodity taxation 的扩展。我们比较了我们的设置与 monopoly pricing 和 bilateral trade 的价格设置。Note: Please note that the translation is in Simplified Chinese, and some words or phrases may have different translations in Traditional Chinese.

Causality and Independence Enhancement for Biased Node Classification

paper_url: http://arxiv.org/abs/2310.09586
repo_url: https://github.com/chen-gx/cie
paper_authors: Guoxin Chen, Yongqing Wang, Fangda Guo, Qinglang Guo, Jiangli Shao, Huawei Shen, Xueqi Cheng
For: The paper aims to address the problem of out-of-distribution (OOD) generalization for node classification on graphs, specifically focusing on mixed biases and low-resource scenarios.* Methods: The proposed Causality and Independence Enhancement (CIE) framework estimates causal and spurious features at the node representation level, mitigates the influence of spurious correlations through backdoor adjustment, and introduces independence constraint to improve discriminability and stability of causal and spurious features.* Results: The proposed CIE approach significantly enhances the performance of graph neural networks (GNNs) and outperforms state-of-the-art debiased node classification methods in various scenarios, including specific types of data biases, mixed biases, and low-resource scenarios.

Abstract
Most existing methods that address out-of-distribution (OOD) generalization for node classification on graphs primarily focus on a specific type of data biases, such as label selection bias or structural bias. However, anticipating the type of bias in advance is extremely challenging, and designing models solely for one specific type may not necessarily improve overall generalization performance. Moreover, limited research has focused on the impact of mixed biases, which are more prevalent and demanding in real-world scenarios. To address these limitations, we propose a novel Causality and Independence Enhancement (CIE) framework, applicable to various graph neural networks (GNNs). Our approach estimates causal and spurious features at the node representation level and mitigates the influence of spurious correlations through the backdoor adjustment. Meanwhile, independence constraint is introduced to improve the discriminability and stability of causal and spurious features in complex biased environments. Essentially, CIE eliminates different types of data biases from a unified perspective, without the need to design separate methods for each bias as before. To evaluate the performance under specific types of data biases, mixed biases, and low-resource scenarios, we conducted comprehensive experiments on five publicly available datasets. Experimental results demonstrate that our approach CIE not only significantly enhances the performance of GNNs but outperforms state-of-the-art debiased node classification methods.

摘要
现有的方法主要对应类别数据上的特定类型偏见，如标签选择偏见或结构偏见。但是，预测偏见类型在进程中很困难，设计对一种特定类型的模型可能不会改善整体普遍化性能。另外，有限的研究集中在混合偏见的影响，这些偏见在实际场景中更为普遍和复杂。为了解决这些限制，我们提出了一个新的 causality and independence enhancement（CIE）框架，可以应用于多种图像神经网络（GNNs）。我们的方法在节点表现层估计 causal 和误假特征，并通过反门侧调整减少误假相关性。同时，我们引入独立性限制，以提高统计特征的检测性和稳定性在复杂偏见环境中。简而言之，CIE 可以从一个统一的角度消除不同类型的数据偏见，不需要在进程中设计每种偏见的特别方法。为了评估在特定类型的数据偏见、混合偏见、低资源情况下的表现，我们进行了广泛的实验，结果显示了我们的 CIE 方法不仅能够明显提高 GNN 的表现，而且超过了目前的调整类别数据方法的表现。

Two Sides of The Same Coin: Bridging Deep Equilibrium Models and Neural ODEs via Homotopy Continuation

paper_url: http://arxiv.org/abs/2310.09583
repo_url: None
paper_authors: Shutong Ding, Tianyu Cui, Jingya Wang, Ye Shi
for: 这个论文主要目标是提出一种新的隐式模型，即HomODE，以解决隐式模型中的稳点问题。
methods: 这个论文使用了homotopy continuation方法来解决隐式模型中的稳点问题，并且开发了一种加速方法来提高模型的性能。
results: 实验结果表明，HomODE可以在图像分类任务中超过现有的隐式模型，并且具有更好的稳定性和更低的内存占用率。

Abstract
Deep Equilibrium Models (DEQs) and Neural Ordinary Differential Equations (Neural ODEs) are two branches of implicit models that have achieved remarkable success owing to their superior performance and low memory consumption. While both are implicit models, DEQs and Neural ODEs are derived from different mathematical formulations. Inspired by homotopy continuation, we establish a connection between these two models and illustrate that they are actually two sides of the same coin. Homotopy continuation is a classical method of solving nonlinear equations based on a corresponding ODE. Given this connection, we proposed a new implicit model called HomoODE that inherits the property of high accuracy from DEQs and the property of stability from Neural ODEs. Unlike DEQs, which explicitly solve an equilibrium-point-finding problem via Newton's methods in the forward pass, HomoODE solves the equilibrium-point-finding problem implicitly using a modified Neural ODE via homotopy continuation. Further, we developed an acceleration method for HomoODE with a shared learnable initial point. It is worth noting that our model also provides a better understanding of why Augmented Neural ODEs work as long as the augmented part is regarded as the equilibrium point to find. Comprehensive experiments with several image classification tasks demonstrate that HomoODE surpasses existing implicit models in terms of both accuracy and memory consumption.

摘要
深度平衡模型（DEQs）和神经常微方程（Neural ODEs）是两种隐式模型，它们具有优秀的性能和内存占用率。尽管两者都是隐式模型，但它们来自不同的数学表述。以homotopy继续为灵感，我们建立了这两种模型之间的连接，并证明它们实际上是同一种质量的两面镜子。homotopy继续是一种解决非线性方程的古老方法，基于相应的ODE。给出这种连接后，我们提出了一种新的隐式模型called HomoODE，它继承了DEQs的高精度性和Neural ODEs的稳定性。不同于DEQs，HomoODE在前向传播中使用修改后的Neural ODE来解决平衡点找问题，而不是直接使用Newton方法来解决平衡点找问题。此外，我们还开发了一种加速HomoODE的方法，其中learnable初始点可以被共享。值得注意的是，我们的模型还提供了为何augmented Neural ODEs会工作，只要视augmented部分为平衡点来找问题。我们对多个图像分类任务进行了广泛的实验，并证明了HomoODE在性能和内存占用率两个方面都超过了现有的隐式模型。

Reduced Policy Optimization for Continuous Control with Hard Constraints

paper_url: http://arxiv.org/abs/2310.09574
repo_url: None
paper_authors: Shutong Ding, Jingya Wang, Yali Du, Ye Shi
for: 这篇论文的目的是提出一种能够有效地处理硬限制的可estrained Reinforcement Learning（RL）算法。
methods: 该算法基于Generalized Reduced Gradient（GRG）算法，并将RL与GRG结合起来解决了一般硬限制问题。具体来说，该算法将行为 partitioning 为基本行为和非基本行为，然后使用一个策略网络来输出基本行为。然后，该算法计算非基本行为，并使用 obtained 基本行为来更新策略网络。此外，该算法还引入了一种基于减少 gradient 的 action projection 过程，并使用一种修改后 Lagrangian relaxation 技术来保证不等约束的满足。
results: 与之前的受限RL算法相比，RPO在三个新的宽泛环境中（包括两个机器人控制任务和一个智能网格操作控制任务）表现出色，在累积奖励和约束违反方面都达到了更好的性能。

Abstract
Recent advances in constrained reinforcement learning (RL) have endowed reinforcement learning with certain safety guarantees. However, deploying existing constrained RL algorithms in continuous control tasks with general hard constraints remains challenging, particularly in those situations with non-convex hard constraints. Inspired by the generalized reduced gradient (GRG) algorithm, a classical constrained optimization technique, we propose a reduced policy optimization (RPO) algorithm that combines RL with GRG to address general hard constraints. RPO partitions actions into basic actions and nonbasic actions following the GRG method and outputs the basic actions via a policy network. Subsequently, RPO calculates the nonbasic actions by solving equations based on equality constraints using the obtained basic actions. The policy network is then updated by implicitly differentiating nonbasic actions with respect to basic actions. Additionally, we introduce an action projection procedure based on the reduced gradient and apply a modified Lagrangian relaxation technique to ensure inequality constraints are satisfied. To the best of our knowledge, RPO is the first attempt that introduces GRG to RL as a way of efficiently handling both equality and inequality hard constraints. It is worth noting that there is currently a lack of RL environments with complex hard constraints, which motivates us to develop three new benchmarks: two robotics manipulation tasks and a smart grid operation control task. With these benchmarks, RPO achieves better performance than previous constrained RL algorithms in terms of both cumulative reward and constraint violation. We believe RPO, along with the new benchmarks, will open up new opportunities for applying RL to real-world problems with complex constraints.

摘要
近期在受限式强化学习（RL）中，有些新的技术突破已经赋予了RL certain safety guarantee。然而，在continuous control tasks中，使用现有的受限式RL算法仍然是一个挑战，特别是在非凸硬件约束的情况下。 Drawing inspiration from generalized reduced gradient（GRG）算法，一种经典的受限式优化技术，我们提出了一种受限policy优化（RPO）算法，该算法结合RL和GRG来解决一般硬件约束。 RPO将动作分为基本动作和非基本动作，根据GRG方法，并通过一个政策网络输出基本动作。然后，RPO计算非基本动作，并使用已经获得的基本动作来解决等式约束。政策网络最后被更新，通过间接导数非基本动作与基本动作的关系来进行更新。此外，我们还引入了减少gradient的动作投影过程，并使用修改后Lagrangian relaxation技术来保证不等约束得到满足。据我们所知，RPO是第一个将GRG引入RL中，以有效地处理等约束和不等约束的硬件约束。值得注意的是，目前RL环境中还缺乏具有复杂硬件约束的环境，这使我们开发了三个新的 Referenztasks：两个机器人抓取任务和一个智能网格操作控制任务。与过去的受限式RL算法相比，RPO在累积奖励和约束违反方面表现更好。我们认为RPO，与新 Referenztasks，将打开RL应用于实际问题中的新可能性。

Neural network scoring for efficient computing

paper_url: http://arxiv.org/abs/2310.09554
repo_url: https://github.com/Aryia-Behroziuan/neurons
paper_authors: Hugo Waltsburger, Erwan Libessart, Chengfang Ren, Anthony Kolar, Regis Guinvarc’h
for:* 这篇论文的目的是提出一种新的评估高性能计算和深度学习算法的效率评估方法，以及一种新的开源工具来实现这种方法。methods:* 该论文使用了一种新的评估方法，即基于精细的电力消耗、内存/CPU/GPU使用率、存储和网络输入/输出（I/O）的多 metric 评估方法。results:* 该论文通过对多种硬件平台上的现状模型进行评估，实现了在各种硬件平台上测试和评估 neural network 算法的能效性。In Simplified Chinese text, the three key points would be:for:* 这篇论文是为了评估高性能计算和深度学习算法的效率而写的。methods:* 该论文使用了一种基于多metric的评估方法，包括精细的电力消耗、内存/CPU/GPU使用率、存储和网络输入/输出（I/O）。results:* 该论文通过对多种硬件平台上的现状模型进行评估，实现了在各种硬件平台上测试和评估 neural network 算法的能效性。

Abstract
Much work has been dedicated to estimating and optimizing workloads in high-performance computing (HPC) and deep learning. However, researchers have typically relied on few metrics to assess the efficiency of those techniques. Most notably, the accuracy, the loss of the prediction, and the computational time with regard to GPUs or/and CPUs characteristics. It is rare to see figures for power consumption, partly due to the difficulty of obtaining accurate power readings. In this paper, we introduce a composite score that aims to characterize the trade-off between accuracy and power consumption measured during the inference of neural networks. For this purpose, we present a new open-source tool allowing researchers to consider more metrics: granular power consumption, but also RAM/CPU/GPU utilization, as well as storage, and network input/output (I/O). To our best knowledge, it is the first fit test for neural architectures on hardware architectures. This is made possible thanks to reproducible power efficiency measurements. We applied this procedure to state-of-the-art neural network architectures on miscellaneous hardware. One of the main applications and novelties is the measurement of algorithmic power efficiency. The objective is to allow researchers to grasp their algorithms' efficiencies better. This methodology was developed to explore trade-offs between energy usage and accuracy in neural networks. It is also useful when fitting hardware for a specific task or to compare two architectures more accurately, with architecture exploration in mind.

摘要
很多研究已经专注于高性能计算（HPC）和深度学习中的工作负载优化和评估。然而，研究人员通常只是依靠几个指标来评估这些技术的效率。主要包括准确率、预测错误率以及与GPU或CPU特性相关的计算时间。很少看到电力消耗的数据，一些原因是获取准确的电力指标的困难。在这篇论文中，我们提出了一个复合指标，旨在描述神经网络执行期间的准确率和电力消耗之间的质量负载。为了实现这一目标，我们开发了一个新的开源工具，允许研究人员考虑更多指标：精细的电力消耗、RAM/CPU/GPU资源利用率、存储和网络输入/输出（I/O）。我们知道，这是首次对神经网络架构在硬件架构上进行了适应测试。这种方法可以帮助研究人员更好地理解他们的算法的效率。我们对当今的神经网络架构进行了测试，并实现了算法的电力效率测量。这种方法可以帮助研究人员更好地把握他们的算法的效率，同时也可以用于硬件适应测试和两个架构比较。

ARTree: A Deep Autoregressive Model for Phylogenetic Inference

paper_url: http://arxiv.org/abs/2310.09553
repo_url: https://github.com/tyuxie/artree
paper_authors: Tianyu Xie, Cheng Zhang
for: developing efficient phylogenetic inference methods
methods: 使用深度循环神经网络（GNNs）生成树 topology 模型
results: 提供了一种简单的样本生成和概率估计方法，不需要手动设计特征Feature，并且可以处理实际数据中的挑战性问题。

Abstract
Designing flexible probabilistic models over tree topologies is important for developing efficient phylogenetic inference methods. To do that, previous works often leverage the similarity of tree topologies via hand-engineered heuristic features which would require pre-sampled tree topologies and may suffer from limited approximation capability. In this paper, we propose a deep autoregressive model for phylogenetic inference based on graph neural networks (GNNs), called ARTree. By decomposing a tree topology into a sequence of leaf node addition operations and modeling the involved conditional distributions based on learnable topological features via GNNs, ARTree can provide a rich family of distributions over the entire tree topology space that have simple sampling algorithms and density estimation procedures, without using heuristic features. We demonstrate the effectiveness and efficiency of our method on a benchmark of challenging real data tree topology density estimation and variational Bayesian phylogenetic inference problems.

摘要
Designing flexible probabilistic models over tree topologies is important for developing efficient phylogenetic inference methods. Previous works often use hand-engineered heuristic features to leverage the similarity of tree topologies, which can be limited in their approximation capability and require pre-sampled tree topologies. In this paper, we propose a deep autoregressive model for phylogenetic inference based on graph neural networks (GNNs), called ARTree. By decomposing a tree topology into a sequence of leaf node addition operations and modeling the involved conditional distributions based on learnable topological features via GNNs, ARTree can provide a rich family of distributions over the entire tree topology space that have simple sampling algorithms and density estimation procedures, without using heuristic features. We demonstrate the effectiveness and efficiency of our method on a benchmark of challenging real data tree topology density estimation and variational Bayesian phylogenetic inference problems.Here's the translation in Traditional Chinese:设计可以灵活地描述树结构的概率模型是诊断生物学演化推理的重要方法。以前的工作通常会运用手动设计的几何特征来利用树结构之间的相似性，这可能会受到局限性的推理和需要预先输入的树结构。在这篇文章中，我们提出了基于图解神经网络（GNNs）的深度推理模型，称为ARTree。通过将树结构 decomposed into a sequence of leaf node addition operations，并使用GNNs来学习 topological features，ARTree可以提供一个具有简单抽象和密度估计的概率家族，不需要使用几何特征。我们透过在一系列实际数据中的测试来证明我们的方法的有效性和高效率。

Hypernetwork-based Meta-Learning for Low-Rank Physics-Informed Neural Networks

paper_url: http://arxiv.org/abs/2310.09528
repo_url: None
paper_authors: Woojin Cho, Kookjin Lee, Donsub Rim, Noseong Park
for: This paper aims to explore the use of physics-informed neural networks (PINNs) as a solver for repetitive numerical simulations of partial differential equations (PDEs) in various engineering and applied science applications.
methods: The proposed method uses a lightweight low-rank PINNs with only hundreds of model parameters and an associated hypernetwork-based meta-learning algorithm to efficiently approximate solutions of PDEs for varying ranges of PDE input parameters.
results: The proposed method is effective in overcoming the “failure modes” of PINNs and shows promising results in solving PDEs for many-query scenarios.

Abstract
In various engineering and applied science applications, repetitive numerical simulations of partial differential equations (PDEs) for varying input parameters are often required (e.g., aircraft shape optimization over many design parameters) and solvers are required to perform rapid execution. In this study, we suggest a path that potentially opens up a possibility for physics-informed neural networks (PINNs), emerging deep-learning-based solvers, to be considered as one such solver. Although PINNs have pioneered a proper integration of deep-learning and scientific computing, they require repetitive time-consuming training of neural networks, which is not suitable for many-query scenarios. To address this issue, we propose a lightweight low-rank PINNs containing only hundreds of model parameters and an associated hypernetwork-based meta-learning algorithm, which allows efficient approximation of solutions of PDEs for varying ranges of PDE input parameters. Moreover, we show that the proposed method is effective in overcoming a challenging issue, known as "failure modes" of PINNs.

摘要
在多种工程和应用科学领域中，需要重复地数值仿真部分偏微分方程（PDEs）的输入参数变化（例如，飞机设计参数的优化），而且需要速度快的计算。在这种情况下，我们建议使用物理学习神经网络（PINNs）作为一种可能的解决方案。虽然PINNs已经实现了深度学习和科学计算的有效结合，但它们需要重复的时间consuming的神经网络训练，这不适合多个查询场景。为解决这个问题，我们提出了一种轻量级低级PINNs，具有只有百个模型参数，以及相关的卷积网络基于meta学习算法，以高效地估算PDEs的解决方案。此外，我们还证明了我们的方法可以有效地解决PINNs的“失败模式”问题。

Efficient Link Prediction via GNN Layers Induced by Negative Sampling

paper_url: http://arxiv.org/abs/2310.09516
repo_url: None
paper_authors: Yuxin Wang, Xiannian Hu, Quan Gan, Xuanjing Huang, Xipeng Qiu, David Wipf
for: 这 paper 是为了提高 graph neural network (GNN) 的链接预测性能而写的。
methods: 这 paper 使用了一种新的 GNN 架构，其中 forward pass 会受到 both positive 和 negative 边的影响，以生成更flexible yet still cheap 的 node-wise embeddings。
results: 这 paper 的实验结果表明，这种新的 GNN 架构可以 retained 节点 wise 模型的执行速度，同时和边 wise 模型相比，其精度表现也是比较好的。

Abstract
Graph neural networks (GNNs) for link prediction can loosely be divided into two broad categories. First, \emph{node-wise} architectures pre-compute individual embeddings for each node that are later combined by a simple decoder to make predictions. While extremely efficient at inference time (since node embeddings are only computed once and repeatedly reused), model expressiveness is limited such that isomorphic nodes contributing to candidate edges may not be distinguishable, compromising accuracy. In contrast, \emph{edge-wise} methods rely on the formation of edge-specific subgraph embeddings to enrich the representation of pair-wise relationships, disambiguating isomorphic nodes to improve accuracy, but with the cost of increased model complexity. To better navigate this trade-off, we propose a novel GNN architecture whereby the \emph{forward pass} explicitly depends on \emph{both} positive (as is typical) and negative (unique to our approach) edges to inform more flexible, yet still cheap node-wise embeddings. This is achieved by recasting the embeddings themselves as minimizers of a forward-pass-specific energy function (distinct from the actual training loss) that favors separation of positive and negative samples. As demonstrated by extensive empirical evaluations, the resulting architecture retains the inference speed of node-wise models, while producing competitive accuracy with edge-wise alternatives.

摘要
Graph Neural Networks (GNNs) for link prediction can be broadly divided into two categories. First, \emph{node-wise} architectures pre-compute individual embeddings for each node and then combine them using a simple decoder to make predictions. While efficient at inference time, the model's expressiveness is limited, and isomorphic nodes contributing to candidate edges may not be distinguishable, compromising accuracy. In contrast, \emph{edge-wise} methods form edge-specific subgraph embeddings to enrich the representation of pair-wise relationships, disambiguating isomorphic nodes and improving accuracy, but with increased model complexity. To better balance this trade-off, we propose a novel GNN architecture that depends on both positive and negative edges in the forward pass to inform more flexible, yet still cheap node-wise embeddings. This is achieved by recasting the embeddings as minimizers of a forward-pass-specific energy function that favors separation of positive and negative samples. As shown by extensive empirical evaluations, the resulting architecture retains the inference speed of node-wise models while producing competitive accuracy with edge-wise alternatives.

Online Parameter Identification of Generalized Non-cooperative Game

paper_url: http://arxiv.org/abs/2310.09511
repo_url: None
paper_authors: Jianguo Chen, Jinlong Lei, Hongsheng Qi, Yiguang Hong
for: 这篇论文研究了一种通用非合作游戏中的参数识别问题，其中每个玩家的成本函数受到可观察信号和一些未知参数的影响。我们考虑的情况是，游戏的平衡状态在一些可观察信号下可以观察到噪音，而我们的目标是通过观察数据来确定未知参数。
methods: 我们将这个参数识别问题建模为在线优化问题，并提出了一种新的在线参数识别算法。我们构造了一个正则化损失函数，该函数平衡保守性和正确性。我们然后证明，当玩家的成本函数对未知参数 Linear 时，并且学习率满足 \mu_k \propto 1/\sqrt{k}，则提出的算法的做异变Bound 为 O(\sqrt{K）。
results: 我们通过一个Nash-Cournot问题的数值实验表明，提出的算法在线参数识别性能与Offline设置相当。

Abstract
This work studies the parameter identification problem of a generalized non-cooperative game, where each player's cost function is influenced by an observable signal and some unknown parameters. We consider the scenario where equilibrium of the game at some observable signals can be observed with noises, whereas our goal is to identify the unknown parameters with the observed data. Assuming that the observable signals and the corresponding noise-corrupted equilibriums are acquired sequentially, we construct this parameter identification problem as online optimization and introduce a novel online parameter identification algorithm. To be specific, we construct a regularized loss function that balances conservativeness and correctiveness, where the conservativeness term ensures that the new estimates do not deviate significantly from the current estimates, while the correctiveness term is captured by the Karush-Kuhn-Tucker conditions. We then prove that when the players' cost functions are linear with respect to the unknown parameters and the learning rate of the online parameter identification algorithm satisfies \mu_k \propto 1/\sqrt{k}, along with other assumptions, the regret bound of the proposed algorithm is O(\sqrt{K}). Finally, we conduct numerical simulations on a Nash-Cournot problem to demonstrate that the performance of the online identification algorithm is comparable to that of the offline setting.

摘要
To construct the regularized loss function, we balance conservativeness and correctiveness by incorporating a conservativeness term that ensures the new estimates do not deviate significantly from the current estimates, and a correctiveness term captured by the Karush-Kuhn-Tucker conditions. We prove that when the players' cost functions are linear with respect to the unknown parameters and the learning rate of the online parameter identification algorithm satisfies $\mu_k \propto 1/\sqrt{k}$, the regret bound of the proposed algorithm is $O(\sqrt{K})$.We demonstrate the performance of the online identification algorithm through numerical simulations on a Nash-Cournot problem, showing that its performance is comparable to that of the offline setting.Here is the text in Simplified Chinese:这个研究studies the parameter identification problem of a generalized non-cooperative game, where each player's cost function is influenced by an observable signal and some unknown parameters. 我们考虑了一个场景，在可观察的信号下可以观察到游戏的平衡，但我们的目标是使用观察数据来确定未知参数。我们将这个问题作为在线优化问题进行处理，并提出了一种新的在线参数标识算法。为了构建正则化损失函数，我们尝试了保守和正确之间的平衡，其中保守性条件保证新的估计不会偏离当前估计的情况，而正确性条件则是由卡鲁什-库南-图克条件捕捉。我们证明，当玩家的成本函数对未知参数是线性的，并且在线参数标识算法的学习率满足 $\mu_k \propto 1/\sqrt{k}$ 的情况下，我们的算法的 regret bound是 $O(\sqrt{K})$。最后，我们通过对纳希-库诺问题的数值实验表明，我们的算法的性能与Offline Setting的性能相似。Here is the text in Traditional Chinese:这个研究studies the parameter identification problem of a generalized non-cooperative game, where each player's cost function is influenced by an observable signal and some unknown parameters. 我们考虑了一个场景，在可观察的信号下可以观察到游戏的平衡，但我们的目标是使用观察数据来确定未知参数。我们将这个问题作为在线优化问题进行处理，并提出了一种新的在线参数标识算法。为了建构正规化损失函数，我们尝试了保守和正确之间的平衡，其中保守性条件保证新的估计不会偏离当前估计的情况，而正确性条件则是由卡鲁什-库南-图克条件捕捉。我们证明，当玩家的成本函数对未知参数是线性的，并且在线参数标识算法的学习率满足 $\mu_k \propto 1/\sqrt{k}$ 的情况下，我们的算法的 regret bound是 $O(\sqrt{K})$。最后，我们通过对纳希-库诺问题的数值实验表明，我们的算法的性能与Offline Setting的性能相似。

Advancing Test-Time Adaptation for Acoustic Foundation Models in Open-World Shifts

paper_url: http://arxiv.org/abs/2310.09505
repo_url: None
paper_authors: Hongfu Liu, Hengguan Huang, Ye Wang
for: 本研究旨在解决在推理时遇到分布shift的问题，尤其是在视觉识别任务中。
methods: 本研究使用了一种新的adaptation方法，该方法不依赖于传统的heuristic，而是通过confidence enhancement和consistency regularization来增强适应性。
results: 实验结果表明，本方法在synthetic和real-world数据上表现出色，超过了现有基elines。

Abstract
Test-Time Adaptation (TTA) is a critical paradigm for tackling distribution shifts during inference, especially in visual recognition tasks. However, while acoustic models face similar challenges due to distribution shifts in test-time speech, TTA techniques specifically designed for acoustic modeling in the context of open-world data shifts remain scarce. This gap is further exacerbated when considering the unique characteristics of acoustic foundation models: 1) they are primarily built on transformer architectures with layer normalization and 2) they deal with test-time speech data of varying lengths in a non-stationary manner. These aspects make the direct application of vision-focused TTA methods, which are mostly reliant on batch normalization and assume independent samples, infeasible. In this paper, we delve into TTA for pre-trained acoustic models facing open-world data shifts. We find that noisy, high-entropy speech frames, often non-silent, carry key semantic content. Traditional TTA methods might inadvertently filter out this information using potentially flawed heuristics. In response, we introduce a heuristic-free, learning-based adaptation enriched by confidence enhancement. Noting that speech signals' short-term consistency, we also apply consistency regularization during test-time optimization. Our experiments on synthetic and real-world datasets affirm our method's superiority over existing baselines.

摘要
测试时适应（TTA）是视觉识别任务中的一种关键方法，尤其在面临测试时数据分布变化时。然而，针对声音模型在开放数据中的应用，TTA技术尚缺乏专门的设计。这个差距更加明显，当考虑声音基础模型的特点：1）它们主要基于变换器架构，并且2）它们在测试时声音数据的非站ARY方式进行处理。这些特点使得直接应用视觉领域的TTA方法，它们主要基于批处理Normalization，并假设样本独立，成为不可能的。在这篇论文中，我们探讨了声音模型面临开放数据分布变化时的TTA方法。我们发现，噪音高 entropy的声音帧，常常包含关键的 semantic content。传统的TTA方法可能会通过可能的损害的规则，干扰这些信息。为此，我们提出了一种不含规则的学习型适应，并在测试时优化时应用了一致性规范。我们在 sintetic 和实际世界的数据集上进行了实验，并证明了我们的方法在现有的基准点上表现出色。

ARM: Refining Multivariate Forecasting with Adaptive Temporal-Contextual Learning

paper_url: http://arxiv.org/abs/2310.09488
repo_url: None
paper_authors: Jiecheng Lu, Xu Han, Shihao Yang
for: 这篇论文是为了解决长期时间序列预测（LTSF）中的复杂时间contextual relationships问题，以提高LTSF的准确性和效率。
methods: 本论文提出了ARM方法，它是一种多元时间contextual adaptive learning方法，具有Adaptive Univariate Effect Learning（AUEL）、Random Dropping（RD）训练策略和Multi-kernel Local Smoothing（MKLS）等特点，能够更好地捕捉个别时间序列的特征和正确地学习时间序列之间的相互关联。
results: 本论文透过多个 benchmark 评估，表明ARM方法可以实现与vanilla Transformer相比的高度改进，无需增加计算成本。此外，ARM方法还可以应用于其他LTSF架构中，进一步提高LTSF的准确性和效率。

Abstract
Long-term time series forecasting (LTSF) is important for various domains but is confronted by challenges in handling the complex temporal-contextual relationships. As multivariate input models underperforming some recent univariate counterparts, we posit that the issue lies in the inefficiency of existing multivariate LTSF Transformers to model series-wise relationships: the characteristic differences between series are often captured incorrectly. To address this, we introduce ARM: a multivariate temporal-contextual adaptive learning method, which is an enhanced architecture specifically designed for multivariate LTSF modelling. ARM employs Adaptive Univariate Effect Learning (AUEL), Random Dropping (RD) training strategy, and Multi-kernel Local Smoothing (MKLS), to better handle individual series temporal patterns and correctly learn inter-series dependencies. ARM demonstrates superior performance on multiple benchmarks without significantly increasing computational costs compared to vanilla Transformer, thereby advancing the state-of-the-art in LTSF. ARM is also generally applicable to other LTSF architecture beyond vanilla Transformer.

摘要
长期时间序预测（LTSF）在多个领域都具有重要性，但面临诸多复杂的时间temporal-contextual关系处理挑战。由于多变量输入模型在一些最近的单变量对手下表现不佳，我们认为这是现有多变量 LTSF Transformers 的不 suficient efficiency 导致的，即不能correctly capture series-wise differences between series。为解决这个问题，我们引入 ARM：一种多变量时间temporal-contextual适应学习方法，这是特制的 multivariate LTSF 模型建立的改进建筑。ARM 使用 Adaptive Univariate Effect Learning（AUEL）、Random Dropping（RD）训练策略和 Multi-kernel Local Smoothing（MKLS），以更好地处理个别时间序的特征和正确地学习间seriesdependencies。ARM 在多个标准测试集上表现出色，不对计算成本做出 significanlly increase 相比 vanilla Transformer，因此提高了 LTSF 领域的状态。ARM 还可以应用于其他 LTSF 架构 beyond vanilla Transformer。

Applying Bayesian Ridge Regression AI Modeling in Virus Severity Prediction

paper_url: http://arxiv.org/abs/2310.09485
repo_url: None
paper_authors: Jai Pal, Bryan Hong
for: This paper aims to review the strengths and weaknesses of Bayesian Ridge Regression, an AI model for virus analysis in healthcare.methods: The paper uses Bayesian Ridge Regression to analyze vast amounts of data and provide more accurate and speedy diagnoses.results: The model shows promising results with room for improvement in data organization, and the severity index serves as a valuable tool for gaining a broad overview of patient care needs.Here’s the text in Simplified Chinese:for: 这篇论文目标是对健康预测模型进行评估，以提高医疗系统的效率和质量。methods: 论文使用拟合梯度回归模型对庞大数据进行分析，以提供更准确和快速的诊断。results: 模型显示了有前途的结果，但存在数据组织问题，同时严重性指数可以作为医疗专业人员的全面评估工具。

Abstract
Artificial intelligence (AI) is a powerful tool for reshaping healthcare systems. In healthcare, AI is invaluable for its capacity to manage vast amounts of data, which can lead to more accurate and speedy diagnoses, ultimately easing the workload on healthcare professionals. As a result, AI has proven itself to be a power tool across various industries, simplifying complex tasks and pattern recognition that would otherwise be overwhelming for humans or traditional computer algorithms. In this paper, we review the strengths and weaknesses of Bayesian Ridge Regression, an AI model that can be used to bring cutting edge virus analysis to healthcare professionals around the world. The model's accuracy assessment revealed promising results, with room for improvement primarily related to data organization. In addition, the severity index serves as a valuable tool to gain a broad overview of patient care needs, aligning with healthcare professionals' preference for broader categorizations.

摘要
人工智能（AI）是一种 poderful工具，可以改变医疗系统的方式。在医疗方面， AI 的能力极大，可以处理大量数据，从而导致更准确和更快的诊断，最终减轻医疗专业人员的工作负担。作为结果， AI 在不同的行业中证明了自己的价值，通过简化复杂任务和人工智能模型来实现人类无法完成的任务。在这篇评论中，我们评估了悲观梁树回归模型的优点和缺点，该模型可以为医疗专业人员提供 cutting-edge 病毒分析。模型的准确性评估表明了扎实的结果，主要是因为数据的组织问题。此外，严重指数作为一个有价值的工具，可以为医疗专业人员提供全面的患者护理需求的大致评估，与医疗专业人员的偏好相符，即通过更广泛的分类来评估患者的需求。

Can CNNs Accurately Classify Human Emotions? A Deep-Learning Facial Expression Recognition Study

paper_url: http://arxiv.org/abs/2310.09473
repo_url: None
paper_authors: Ashley Jisue Hong, David DiStefano, Sejal Dua
for: 这项研究旨在评估一种CNN模型是否能够识别和分类人类表情（积极、中性、消极）。
methods: 该模型使用Python编程和预处理的芝加哥人脸数据库数据进行训练，并采用了更简单的设计来进一步检验其能力。
results: 研究结果表明，该模型在10,000张图像（数据）上达到75%的准确率，证明了人类情感分析的可能性并且预示了可行的情感AI。

Abstract
Emotional Artificial Intelligences are currently one of the most anticipated developments of AI. If successful, these AIs will be classified as one of the most complex, intelligent nonhuman entities as they will possess sentience, the primary factor that distinguishes living humans and mechanical machines. For AIs to be classified as "emotional," they should be able to empathize with others and classify their emotions because without such abilities they cannot normally interact with humans. This study investigates the CNN model's ability to recognize and classify human facial expressions (positive, neutral, negative). The CNN model made for this study is programmed in Python and trained with preprocessed data from the Chicago Face Database. The model is intentionally designed with less complexity to further investigate its ability. We hypothesized that the model will perform better than chance (33.3%) in classifying each emotion class of input data. The model accuracy was tested with novel images. Accuracy was summarized in a percentage report, comparative plot, and confusion matrix. Results of this study supported the hypothesis as the model had 75% accuracy over 10,000 images (data), highlighting the possibility of AIs that accurately analyze human emotions and the prospect of viable Emotional AIs.

摘要
currently, Emotional Artificial Intelligences are one of the most anticipated developments in AI. if successful, these AIs will be classified as one of the most complex, intelligent nonhuman entities, as they will possess sentience, the primary factor that distinguishes living humans and mechanical machines. for AIs to be classified as "emotional," they should be able to empathize with others and classify their emotions, because without such abilities, they cannot normally interact with humans. this study investigates the CNN model's ability to recognize and classify human facial expressions (positive, neutral, negative). the CNN model made for this study is programmed in Python and trained with preprocessed data from the Chicago Face Database. the model is intentionally designed with less complexity to further investigate its ability. we hypothesized that the model will perform better than chance (33.3%) in classifying each emotion class of input data. the model accuracy was tested with novel images. accuracy was summarized in a percentage report, comparative plot, and confusion matrix. results of this study supported the hypothesis as the model had 75% accuracy over 10,000 images (data), highlighting the possibility of AIs that accurately analyze human emotions and the prospect of viable Emotional AIs.

Randomized Benchmarking of Local Zeroth-Order Optimizers for Variational Quantum Systems

paper_url: http://arxiv.org/abs/2310.09468
repo_url: https://github.com/ltecot/rand_bench_opt_quantum
paper_authors: Lucas Tecot, Cho-Jui Hsieh
for: 本研究旨在比较不同类型的классиical optimizer在partially-randomized任务中的表现，以更广泛地涵盖量子优化问题的空间。
methods: 本研究使用了本地零次ORDER optimizer，因为它们在量子系统上通常有更好的表现和查询效率。
results: 实验结果表明，不同类型的 классиical optimizer在不同的任务中的表现有很大差异，并且本地零次ORDER optimizer在一些任务中表现较好。

Abstract
In the field of quantum information, classical optimizers play an important role. From experimentalists optimizing their physical devices to theorists exploring variational quantum algorithms, many aspects of quantum information require the use of a classical optimizer. For this reason, there are many papers that benchmark the effectiveness of different optimizers for specific quantum optimization tasks and choices of parameterized algorithms. However, for researchers exploring new algorithms or physical devices, the insights from these studies don't necessarily translate. To address this concern, we compare the performance of classical optimizers across a series of partially-randomized tasks to more broadly sample the space of quantum optimization problems. We focus on local zeroth-order optimizers due to their generally favorable performance and query-efficiency on quantum systems. We discuss insights from these experiments that can help motivate future works to improve these optimizers for use on quantum systems.

摘要
在量子信息领域中，古典优化器扮演着重要的角色。从实验室的设备优化到理论家探索量子变量算法，许多量子信息领域的问题都需要使用古典优化器。因此，有很多论文来比较不同的优化器在特定量子优化任务中的效果。然而，对于探索新的算法或物理设备的研究人员，这些研究并不能够直接适用。为了解决这个问题，我们对古典优化器在部分随机任务中的性能进行了比较，以更广泛地涵盖量子优化问题的空间。我们主要关注本地零阶优化器，因为它们在量子系统上通常有良好的性能和查询效率。我们对这些实验的结果进行了讨论，以帮助未来的工作者改进这些优化器的性能在量子系统上。

2023-10-14

eess.IV

eess.IV - 2023-10-14

Contrastive Self-Supervised Learning for Spatio-Temporal Analysis of Lung Ultrasound Videos

paper_url: http://arxiv.org/abs/2310.10689
repo_url: None
paper_authors: Li Chen, Jonathan Rubin, Jiahong Ouyang, Naveen Balaraju, Shubham Patil, Courosh Mehanian, Sourabh Kulhare, Rachel Millin, Kenton W Gregory, Cynthia R Gregory, Meihua Zhu, David O Kessler, Laurie Malia, Almaz Dessie, Joni Rabiner, Di Coneybeare, Bo Shopsin, Andrew Hersh, Cristian Madar, Jeffrey Shupp, Laura S Johnson, Jacob Avila, Kristin Dwyer, Peter Weimersheimer, Balasundar Raju, Jochen Kruecker, Alvin Chen
for: 这个研究是为了探讨自我超级学习（SSL）方法在医疗影像应用中的应用，以获得有用的视觉表现，即使标注数据有限。
methods: 我们将州先进的对照学习SSL方法扩展到2D+时间医疗超音波影像资料中，通过修改Encoder和增强方法，以学习有用的空间-时间表现，不需要输入数据的限制。
results: 我们使用超过27k个lung医疗超音波影像资料，来评估我们的方法。结果显示，我们的方法可以明显改善下测地点和类别lung混合的标注数据。相比基准模型，我们的方法尤其有利于有限的标注数据（例如只有5%的训练集）。

Abstract
Self-supervised learning (SSL) methods have shown promise for medical imaging applications by learning meaningful visual representations, even when the amount of labeled data is limited. Here, we extend state-of-the-art contrastive learning SSL methods to 2D+time medical ultrasound video data by introducing a modified encoder and augmentation method capable of learning meaningful spatio-temporal representations, without requiring constraints on the input data. We evaluate our method on the challenging clinical task of identifying lung consolidations (an important pathological feature) in ultrasound videos. Using a multi-center dataset of over 27k lung ultrasound videos acquired from over 500 patients, we show that our method can significantly improve performance on downstream localization and classification of lung consolidation. Comparisons against baseline models trained without SSL show that the proposed methods are particularly advantageous when the size of labeled training data is limited (e.g., as little as 5% of the training set).

摘要
自我指导学习（SSL）方法在医疗影像应用中显示了承诺，通过学习有意义的视觉表示，即使有限量的标注数据。在这里，我们扩展了现状最佳的对比学习SSL方法，用于2D+时医疗超音波视频数据。我们引入了修改后的编码器和扩展方法，能够学习有意义的空间-时表示，不需要输入数据的约束。我们使用了多中心的肺超音波视频数据集，包含了超过27k个肺超音波视频，从超过500名患者中收集到。我们显示了我们的方法可以明显提高肺混凝级地址和分类性能。相比基准模型未使用SSL训练，我们的方法在标注数据有限时表现特别有利。

X-ray phase and dark-field computed tomography without optical elements

paper_url: http://arxiv.org/abs/2310.09496
repo_url: None
paper_authors: T. A. Leatham, D. M. Paganin, K. S. Morgan
for: The paper is written for researchers and practitioners in the field of X-ray imaging, particularly those interested in phase and dark-field computed tomography.
methods: The paper presents a new algorithm for X-ray diffuse dark-field imaging based on the x-ray Fokker-Planck equation, which can reconstruct both the sample density and dark-field/diffusion properties in 3D with high spatial resolution.
results: The proposed algorithm can be used to reconstruct both the sample density and dark-field Fokker-Planck diffusion coefficients with only two sample exposures at each projection angle, making it a valuable tool for biomedical imaging and industrial settings.Here’s the same information in Simplified Chinese text:
for: 这篇论文是为了帮助扫描仪和计算机成像领域的研究人员和实践者，特别是关注phas和dark-field计算机成像的人。
methods: 这篇论文提出了一种基于x射线福克-朋克方程的新算法，可以在3D中重建样品的密度和黑场/扩散性质。
results: 该算法只需要两个样品曝光角度的样品曝光，就可以成功地重建样品的密度和黑场福克-朋克扩散系数。

Abstract
X-ray diffusive dark-field imaging, which allows spatially unresolved microstructure to be mapped across a sample, is an increasingly popular tool in an array of settings. Here, we present a new algorithm for phase and dark-field computed tomography based on the x-ray Fokker-Planck equation. Needing only a coherent x-ray source, sample, and detector, our propagation-based algorithm can map the sample density and dark-field/diffusion properties of the sample in 3D. Importantly, incorporating dark-field information in the density reconstruction process enables a higher spatial resolution reconstruction than possible with previous propagation-based approaches. Two sample exposures at each projection angle are sufficient for the successful reconstruction of both the sample density and dark-field Fokker-Planck diffusion coefficients. We anticipate that the proposed algorithm may be of benefit in biomedical imaging and industrial settings.

摘要

PC-bzip2: a phase-space continuity enhanced lossless compression algorithm for light field microscopy data

paper_url: http://arxiv.org/abs/2310.09467
repo_url: None
paper_authors: Changqing Su, Zihan Lin, You Zhou, Shuai Wang, Yuhan Gao, Chenggang Yan, Bo Xiong
for: 这种辐射场 fluorescence 微scopy 方法是用于长期高速成像复杂生物系统，如神经活动和蛋白质快速移动的有效和精简的方法。
methods: 我们提出了一种基于 GPU 和多核心 CPU 的高速无损压缩方法，它结合了快速Entropy 评估和高速无损压缩。
results: 我们的方法可以在不同 SNR 下实现约 10% 的压缩率提升，同时保持高速压缩能力，并且在时间序列数据上表现出superior的压缩率。

Abstract
Light-field fluorescence microscopy (LFM) is a powerful elegant compact method for long-term high-speed imaging of complex biological systems, such as neuron activities and rapid movements of organelles. LFM experiments typically generate terabytes image data and require a huge number of storage space. Some lossy compression algorithms have been proposed recently with good compression performance. However, since the specimen usually only tolerates low power density illumination for long-term imaging with low phototoxicity, the image signal-to-noise ratio (SNR) is relative-ly low, which will cause the loss of some efficient position or intensity information by using such lossy compression al-gorithms. Here, we propose a phase-space continuity enhanced bzip2 (PC-bzip2) lossless compression method for LFM data as a high efficiency and open-source tool, which combines GPU-based fast entropy judgement and multi-core-CPU-based high-speed lossless compression. Our proposed method achieves almost 10% compression ratio improvement while keeping the capability of high-speed compression, compared with original bzip2. We evaluated our method on fluorescence beads data and fluorescence staining cells data with different SNRs. Moreover, by introducing the temporal continuity, our method shows the superior compression ratio on time series data of zebrafish blood vessels.

摘要
光场液体微镜技术（LFM）是一种 poderful yet elegant compact方法 для长期高速成像复杂生物系统，如神经活动和迅速运动的组织质量。LFM实验通常生成terabytes的图像数据，需要巨量的存储空间。一些lossy压缩算法已经被提出，但是由于样品通常只能承受低功率照明的长期成像，图像信号噪比（SNR）相对较低，这会导致使用such lossy压缩算法中的效率信息丢失。在这里，我们提出了一种phasel Space Continuity Enhanced bzip2（PC-bzip2）无损压缩方法为LFM数据，这是一种高效性和开源工具，它结合GPU基于快速权重评估和多核CPU基于高速无损压缩。我们的提议方法与原始bzip2的 compressión ratio improvement为9.8%，同时保持高速压缩的能力。我们对fluorescence beads数据和fluorescence染料细胞数据进行了不同SNR的评估，并且通过引入时间连续性，我们的方法在时间序列数据上实现了更高的压缩率。

2023-10-14

eess.SP

eess.SP - 2023-10-14

Robust Quickest Change Detection in Non-Stationary Processes

paper_url: http://arxiv.org/abs/2310.09673
repo_url: None
paper_authors: Yingze Hou, Yousef Oleyaeimotlagh, Rahul Mishra, Hoda Bidkhori, Taposh Banerjee
for: 这篇论文旨在开发一些具有抗变数能力的检测方法，用于检测非站ARY变化过程中的变化。
methods: 论文使用了基于最差分布的方法，即使post-change非站ARY变化家族的分布不确定，这些方法仍能实现最佳和具有抗变数能力。
results: 论文透过实验和实际应用，证明了这些方法的有效性和抗变数能力。

Abstract
Optimal algorithms are developed for robust detection of changes in non-stationary processes. These are processes in which the distribution of the data after change varies with time. The decision-maker does not have access to precise information on the post-change distribution. It is shown that if the post-change non-stationary family has a distribution that is least favorable in a well-defined sense, then the algorithms designed using the least favorable distributions are robust and optimal. Non-stationary processes are encountered in public health monitoring and space and military applications. The robust algorithms are applied to real and simulated data to show their effectiveness.

摘要
最佳算法被开发用于检测非站ARY进程中的变化。这些进程的数据分布 после变化随时间变化。决策者没有访问准确的后变化分布信息。研究表明，如果后变化非站ARY家族的分布是最不利的，那么基于最不利分布的算法是 Robust和优化的。非站ARY进程在公共卫生监测和空间和军事应用中出现。这些稳定算法应用于真实和 simulate数据，以示其效果。Note: "非站ARY" (pinyin: fēi zhàn yǐ) is a term used in statistics to refer to a non-stationary process.

HAPS in the Non-Terrestrial Network Nexus: Prospective Architectures and Performance Insights

paper_url: http://arxiv.org/abs/2310.09659
repo_url: None
paper_authors: Zhengying Lou, Baha Eddine Youcef Belmekki, Mohamed-Slim Alouini
for: This paper discusses the potential of High Altitude Platform Stations (HAPS) in Non-Terrestrial Networks (NTN) and their advantages and challenges.
methods: The paper presents various network architectures that incorporate HAPS, including ad-hoc, cell-free, and integrated access and backhaul. The authors also provide comprehensive performance insights when using HAPS in these architectures.
results: The paper shows that HAPS can interconnect the NTN nexus and provide versatility in terms of different metrics such as routing latency, energy efficiency, coverage probability, and channel capacity. Additionally, the paper highlights the performance gain provided by HAPS usage in NTN by comparing the results when no HAPS are used.Here is the simplified Chinese version of the three key points:
for: 这篇论文探讨了高空平台站（HAPS）在非地面网络（NTN）中的潜在优势和挑战。
methods: 论文提出了包括各种网络架构，例如自组织、无终端和集成访问和夹据的HAPS网络架构。作者们还提供了不同架构下HAPS使用时的全面性能探讨。
results: 论文显示了HAPS可以连接NTN的聚合点，并提供了不同约束下的多种维度性能探讨，如路由延迟、能效率、概率覆盖和频率容量。此外，论文还比较了无HAPS情况下的性能，进一步强调了HAPS在NTN中的性能提升。

Abstract
High altitude platform stations (HAPS) have recently emerged as a new key stratospheric player in non-terrestrial networks (NTN) alongside satellites and low-altitude platforms. In this paper, we present the main communication links between HAPS and other NTN platforms, their advantages, and their challenges. Then, prospective network architectures in which HAPS plays an indispensable role in the future NTNs are presented such as ad-hoc, cell-free, and integrated access and backhaul. To showcase the importance of HAPS in the NTN, we provide comprehensive performance insights when using HAPS in the prospective architectures with the most suitable communication link. The insights show the HAPS' ability to interconnect the NTN nexus as well as their versatility by incorporating different metrics into the analysis such as routing latency, energy efficiency, coverage probability, and channel capacity. Depending on the architecture, HAPS will play different roles in NTN, such as a UAV network center, satellite relay, and ground network extension. Finally, the performance gain provided by HAPS usage in NTN is further highlighted by comparing the results when no HAPS are used.

摘要
高空平台站（HAPS）最近emerged as a new key stratospheric player in non-terrestrial networks（NTN） alongside satellites and low-altitude platforms. In this paper, we present the main communication links between HAPS and other NTN platforms, their advantages, and their challenges. Then, prospective network architectures in which HAPS plays an indispensable role in the future NTNs are presented such as ad-hoc, cell-free, and integrated access and backhaul. To showcase the importance of HAPS in the NTN, we provide comprehensive performance insights when using HAPS in the prospective architectures with the most suitable communication link. The insights show the HAPS' ability to interconnect the NTN nexus as well as their versatility by incorporating different metrics into the analysis such as routing latency, energy efficiency, coverage probability, and channel capacity. Depending on the architecture, HAPS will play different roles in NTN, such as a UAV network center, satellite relay, and ground network extension. Finally, the performance gain provided by HAPS usage in NTN is further highlighted by comparing the results when no HAPS are used.

On the Radio Stripe Deployment for Indoor RF Wireless Power Transfer

paper_url: http://arxiv.org/abs/2310.09585
repo_url: None
paper_authors: Amirhossein Azarbahram, Onel L. A. Lopez, Petar Popovski, Matti Latva-aho
for: 提高无线系统可持续性，Radio Frequency（RF）无线能量传输（WPT）被视为关键技术启动器。
methods: 使用分布antenna系统，以减少往返信道损失。
results: 对多个室内能量热点进行RF-WPT系统的优化 deployments，并提供了两种特定的shape configurations。结果表明，提posed radio stripe deployments可以超过中央完全数字方格阵列表现，而增加系统频率可能会降低其表现。

Abstract
One of the primary goals of future wireless systems is to foster sustainability, for which, radio frequency (RF) wireless power transfer (WPT) is considered a key technology enabler. The key challenge of RF-WPT systems is the extremely low end-to-end efficiency, mainly due to the losses introduced by the wireless channel. Distributed antenna systems are undoubtedly appealing as they can significantly shorten the charging distances, thus, reducing channel losses. Interestingly, radio stripe systems provide a cost-efficient and scalable way to deploy a distributed multi-antenna system, and thus have received a lot of attention recently. Herein, we consider an RF-WPT system with a transmit radio stripe network to charge multiple indoor energy hotspots, i.e., spatial regions where the energy harvesting devices are expected to be located, including near-field locations. We formulate the optimal radio stripe deployment problem aimed to maximize the minimum power received by the users and explore two specific predefined shapes, namely the straight line and polygon-shaped configurations. Then, we provide efficient solutions relying on geometric programming to optimize the location of the radio stripe elements. The results demonstrate that the proposed radio stripe deployments outperform a central fully-digital square array with the same number of elements and utilizing larger radio stripe lengths can enhance the performance, while increasing the system frequency may degrade it.

摘要
一个Future无线系统的主要目标是培养可持续性，RF无线能量传输(WPT)被视为关键技术促进器。RF-WPT系统的主要挑战是终端到终端效率非常低，主要由无线通道引入的losses。分布天线系统在charging距离缩短了可以减少channel losses，因此受到了非常多的关注。 radio stripe系统提供了cost-efficient和可扩展的方式来部署分布多天线系统，因此在最近 Received a lot of attention.在这里，我们考虑了一个RF-WPT系统，使用发射天线网络来为多个indoor能量热点提供能量，包括近场位置。我们定义了最优的天线网络布置问题，以最大化用户接收到的能量水平。然后，我们提供了效率的解决方案，基于 геометрического编程来优化天线网络的位置。结果表明，我们的天线布置方案比中央完全数字方格阵列表现更好，而使用更长的天线长度可以提高性能，而增加系统频率可能会下降性能。

Robust Anti-jamming Communications with DMA-Based Reconfigurable Heterogeneous Array

paper_url: http://arxiv.org/abs/2310.09466
repo_url: None
paper_authors: Kaizhi Huang, Wenyu Jiang, Yajun Chen, Liang Jin, Qingqing Wu, Xiaoling Hu
for: 提高商用和军用通信系统中的干扰抗性能。
methods: 基于动态金属表 antenna (DMA) 的可重新配置多种性天线（RHA）架构，以增加度数 freedom (DoF) 并进一步提高干扰抗性能。
results: 对比传统Homogeneous或Heterogeneous 天线阵列，提出了一种基于RHA和DMA的双步反干扰方案，能够提高干扰抗性能和Robustness。

Abstract
In the future commercial and military communication systems, anti-jamming remains a critical issue. Existing homogeneous or heterogeneous arrays with a limited degrees of freedom (DoF) and high consumption are unable to meet the requirements of communication in rapidly changing and intense jamming environments. To address these challenges, we propose a reconfigurable heterogeneous array (RHA) architecture based on dynamic metasurface antenna (DMA), which will increase the DoF and further improve anti-jamming capabilities. We propose a two-step anti-jamming scheme based on RHA, where the multipaths are estimated by an atomic norm minimization (ANM) based scheme, and then the received signal-to-interference-plus-noise ratio (SINR) is maximized by jointly designing the phase shift of each DMA element and the weights of the array elements. To solve the challenging non-convex discrete fractional problem along with the estimation error in the direction of arrival (DoA) and channel state information (CSI), we propose a robust alternative algorithm based on the S-procedure to solve the lower-bound SINR maximization problem. Simulation results demonstrate that the proposed RHA architecture and corresponding schemes have superior performance in terms of jamming immunity and robustness.

摘要
将来的商业和军事通信系统中，抗干扰是一个关键问题。现有的同质或异质数组（DoF）有限和高消耗无法满足在快速变化和激烈干扰环境下的通信需求。为解决这些挑战，我们提出了可重新配置的异质数组（RHA）架构，基于动态表面天线（DMA），可以提高DoF并进一步提高抗干扰能力。我们提出了基于RHA的两步抗干扰方案，其中首先使用原子规范最小化（ANM）方法来估计multipath，然后使用共同设计DMA元素的相位和数组元素的加权来最大化接收信号干扰plus noise比率（SINR）。为解决非对称粗略分数问题以及DoA和CSI估计错误，我们提出了一种robust的S-过程算法来解决下界SINR最大化问题。实验结果表明，我们提出的RHA架构和相应的方案具有更高的抗干扰能力和稳定性。

2023-10-13

cs.SD

cs.SD - 2023-10-13

Low-latency Speech Enhancement via Speech Token Generation

paper_url: http://arxiv.org/abs/2310.08981
repo_url: None
paper_authors: Huaying Xue, Xiulian Peng, Yan Lu
for: 这篇论文主要应用于低延迟的语音提高中，将语音提高当作语音生成问题，并使用条件生成模型来生成清晰的语音。
methods: 本文提出了一个条件生成框架，使用 neural speech codec 来模型清晰语音，并在 auto-regressive 方式下生成语音 tokens。另外，提出了一个明确的对齐方法来对不同的输入长度进行适应。
results: 实验结果显示，这篇论文的方法在关于噪音抗性和时间语音协调方面比于数据驱动方法有更好的表现。

Abstract
Existing deep learning based speech enhancement mainly employ a data-driven approach, which leverage large amounts of data with a variety of noise types to achieve noise removal from noisy signal. However, the high dependence on the data limits its generalization on the unseen complex noises in real-life environment. In this paper, we focus on the low-latency scenario and regard speech enhancement as a speech generation problem conditioned on the noisy signal, where we generate clean speech instead of identifying and removing noises. Specifically, we propose a conditional generative framework for speech enhancement, which models clean speech by acoustic codes of a neural speech codec and generates the speech codes conditioned on past noisy frames in an auto-regressive way. Moreover, we propose an explicit-alignment approach to align noisy frames with the generated speech tokens to improve the robustness and scalability to different input lengths. Different from other methods that leverage multiple stages to generate speech codes, we leverage a single-stage speech generation approach based on the TF-Codec neural codec to achieve high speech quality with low latency. Extensive results on both synthetic and real-recorded test set show its superiority over data-driven approaches in terms of noise robustness and temporal speech coherence.

摘要
现有的深度学习基于的speech增强主要采用数据驱动的方法，利用大量数据中各种噪音来实现噪音从嘈杂信号中去除。然而，这种方法的强依赖于数据限制其在真实环境中的泛化能力。在这篇论文中，我们关注低延迟场景，将speech增强视为一个speech生成问题，我们通过conditioned on the noisy signal来生成干净的speech。 Specifically, we propose a conditional generative framework for speech enhancement, which models clean speech by acoustic codes of a neural speech codec and generates the speech codes conditioned on past noisy frames in an auto-regressive way. Moreover, we propose an explicit-alignment approach to align noisy frames with the generated speech tokens to improve the robustness and scalability to different input lengths. Different from other methods that leverage multiple stages to generate speech codes, we leverage a single-stage speech generation approach based on the TF-Codec neural codec to achieve high speech quality with low latency. 对于synthetic和实际录制的测试集，我们进行了广泛的实验结果，发现我们的方法在噪音鲁棒性和时间干扰相干性方面比数据驱动方法更高。

Transformer-based Autoencoder with ID Constraint for Unsupervised Anomalous Sound Detection

paper_url: http://arxiv.org/abs/2310.08950
repo_url: None
paper_authors: Jian Guan, Youde Liu, Qiuqiang Kong, Feiyang Xiao, Qiaoxi Zhu, Jiantong Tian, Wenwu Wang
for: 本研究旨在提出一种基于Transformer的自适应噪声检测方法，用于检测设备发生异常噪声时，只有正常噪声数据可用。
methods: 本方法基于ID constrained Transformer-based autoencoder（IDC-TransAE）架构，并使用weighted anomaly score computation来高亮异常事件的异常分数。
results: 经过实验表明，提出的方法在DCASE 2020挑战任务2开发数据集上显示出了效果和超越性。

Abstract
Unsupervised anomalous sound detection (ASD) aims to detect unknown anomalous sounds of devices when only normal sound data is available. The autoencoder (AE) and self-supervised learning based methods are two mainstream methods. However, the AE-based methods could be limited as the feature learned from normal sounds can also fit with anomalous sounds, reducing the ability of the model in detecting anomalies from sound. The self-supervised methods are not always stable and perform differently, even for machines of the same type. In addition, the anomalous sound may be short-lived, making it even harder to distinguish from normal sound. This paper proposes an ID constrained Transformer-based autoencoder (IDC-TransAE) architecture with weighted anomaly score computation for unsupervised ASD. Machine ID is employed to constrain the latent space of the Transformer-based autoencoder (TransAE) by introducing a simple ID classifier to learn the difference in the distribution for the same machine type and enhance the ability of the model in distinguishing anomalous sound. Moreover, weighted anomaly score computation is introduced to highlight the anomaly scores of anomalous events that only appear for a short time. Experiments performed on DCASE 2020 Challenge Task2 development dataset demonstrate the effectiveness and superiority of our proposed method.

摘要
“不监督式异常声检测”（ASD）的目标是检测设备发生未知异常声时，仅使用常规声数据。自动encoder（AE）和基于自我监督学习的方法是主流方法之一。但是AE基础方法可能会受到限制，因为它们从常规声数据学习的特征也可以适应异常声。此外，异常声可能很短暂，使其更难以与常规声区分。这篇论文提出一个具有ID对映运算的Transformer-based autoencoder（IDC-TransAE）架构，并导入权重异常得分计算，以不监督式方式进行异常声检测。此外，我们还导入机器ID，以便将TransAE的内部空间对映到不同机器类型之间的分布。实验结果显示，我们的提案方法在DCASE 2020挑战任务2的开发数据上表现出色，并且与其他方法相比，具有更高的准确性和稳定性。”

Differential Evolution Algorithm based Hyper-Parameters Selection of Convolutional Neural Network for Speech Command Recognition

paper_url: http://arxiv.org/abs/2310.08914
repo_url: https://github.com/techie5879/hyperparameter-optimization-cnn-differential-evolution
paper_authors: Sandipan Dhar, Anuvab Sen, Aritra Bandyopadhyay, Nanda Dulal Jana, Arjun Ghosh, Zahra Sarayloo
for: 这篇论文目标是提高卷积神经网络（CNN）在短语音命令识别（SCR）任务中的表现。
methods: 该论文提出了基于差分演化（DE）算法的卷积神经网络参数选择方法，以提高SCR任务中卷积神经网络的表现。
results: 经过训练和测试使用Google语音命令（GSC）集合，提出的方法在分类语音命令中显示了效果。此外，与基于遗传算法（GA）的选择和其他深度卷积神经网络（DCNN）模型进行比较分析，表明了提出的DE算法在SCR任务中卷积神经网络参数选择中的效率。

Abstract
Speech Command Recognition (SCR), which deals with identification of short uttered speech commands, is crucial for various applications, including IoT devices and assistive technology. Despite the promise shown by Convolutional Neural Networks (CNNs) in SCR tasks, their efficacy relies heavily on hyper-parameter selection, which is typically laborious and time-consuming when done manually. This paper introduces a hyper-parameter selection method for CNNs based on the Differential Evolution (DE) algorithm, aiming to enhance performance in SCR tasks. Training and testing with the Google Speech Command (GSC) dataset, the proposed approach showed effectiveness in classifying speech commands. Moreover, a comparative analysis with Genetic Algorithm based selections and other deep CNN (DCNN) models highlighted the efficiency of the proposed DE algorithm in hyper-parameter selection for CNNs in SCR tasks.

摘要
《语音指令识别（SCR）》，它是许多应用领域的关键技术，如物联网设备和助手技术。尽管卷积神经网络（CNN）在SCR任务中表现出了承诺，但它们的效果却受到参数选择的影响，这是通常是手动进行的劳动密集和耗时的过程。本文提出了基于差分演化（DE）算法的超参数选择方法，以提高CNN在SCR任务中的性能。通过使用Google语音指令（GSC）数据集进行训练和测试，我们发现了该方法在分类语音指令的能力。此外，我们还进行了与基于遗传算法的选择和其他深度卷积神经网络（DCNN）模型进行比较，发现DE算法在SCR任务中的超参数选择具有效率。

Learning to Behave Like Clean Speech: Dual-Branch Knowledge Distillation for Noise-Robust Fake Audio Detection

paper_url: http://arxiv.org/abs/2310.08869
repo_url: None
paper_authors: Cunhang Fan, Mingming Ding, Jianhua Tao, Ruibo Fu, Jiangyan Yi, Zhengqi Wen, Zhao Lv
for: 提高 fake audio detection 系统在噪声混合环境中的性能。
methods: 提议 dual-branch knowledge distillation fake audio detection 方法，包括平行数据流、交互融合和响应基于教师生Student paradigm。
results: 实验结果表明，提议方法在多个数据集上表现良好，能够在噪声混合环境中维持性能。

Abstract
Most research in fake audio detection (FAD) focuses on improving performance on standard noise-free datasets. However, in actual situations, there is usually noise interference, which will cause significant performance degradation in FAD systems. To improve the noise robustness, we propose a dual-branch knowledge distillation fake audio detection (DKDFAD) method. Specifically, a parallel data flow of the clean teacher branch and the noisy student branch is designed, and interactive fusion and response-based teacher-student paradigms are proposed to guide the training of noisy data from the data distribution and decision-making perspectives. In the noise branch, speech enhancement is first introduced for denoising, which reduces the interference of strong noise. The proposed interactive fusion combines denoising features and noise features to reduce the impact of speech distortion and seek consistency with the data distribution of clean branch. The teacher-student paradigm maps the student's decision space to the teacher's decision space, making noisy speech behave as clean. In addition, a joint training method is used to optimize the two branches to achieve global optimality. Experimental results based on multiple datasets show that the proposed method performs well in noisy environments and maintains performance in cross-dataset experiments.

摘要
大多数 fake audio detection（FAD）研究都是在标准噪音自由数据集上提高性能。然而，在实际情况下，通常会有噪声干扰，这会导致 FAD 系统的性能下降。为了改善噪声Robustness，我们提出了双支流知识填充 fake audio detection（DKDFAD）方法。具体来说，我们设计了平行数据流的清晰教师支流和噪音学生支流，并提出了交互融合和响应基于教师-学生模式来导导训练噪音数据的方法。在噪音支流中，首先引入了抑制噪音的Speech enhancement，以减少噪音的干扰。我们提出的交互融合将抑制噪音特征和噪音特征融合在一起，以减少语音扭曲的影响和与干净支流的数据分布一致。教师-学生模式将学生决策空间映射到教师决策空间，使噪音语音 behave as 清晰语音。此外，我们使用了全局优化方法来优化两支流以 достичь全局优化。实验结果基于多个数据集表明，我们提出的方法在噪音环境中表现良好，并在跨数据集实验中保持性能。

2023-10-13

eess.AS

eess.AS - 2023-10-13

Protecting Voice-Controlled Devices against LASER Injection Attacks

paper_url: http://arxiv.org/abs/2310.09404
repo_url: https://github.com/hashim19/Laser_Injection_Attack_Identification
paper_authors: Hashim Ali, Dhimant Khuttan, Rafi Ud Daula Refat, Hafiz Malik
for: 防御 MEMS 麦克风受到激光注入攻击
methods: 使用时频划分和高级统计特征来分辨声学和激光注入引起的响应
results: 实验结果表明，提posed 框架可以在random数据分区 setting中正确地分类 $98%$ 的声学和激光注入响应，并在 speaker-independent 和 text-independent 数据分区 setting中实现 $100%$ 的正确分类率。

Abstract
Voice-Controllable Devices (VCDs) have seen an increasing trend towards their adoption due to the small form factor of the MEMS microphones and their easy integration into modern gadgets. Recent studies have revealed that MEMS microphones are vulnerable to audio-modulated laser injection attacks. This paper aims to develop countermeasures to detect and prevent laser injection attacks on MEMS microphones. A time-frequency decomposition based on discrete wavelet transform (DWT) is employed to decompose microphone output audio signal into n + 1 frequency subbands to capture photo-acoustic related artifacts. Higher-order statistical features consisting of the first four moments of subband audio signals, e.g., variance, skew, and kurtosis are used to distinguish between acoustic and photo-acoustic responses. An SVM classifier is used to learn the underlying model that differentiates between an acoustic- and laser-induced (photo-acoustic) response in the MEMS microphone. The proposed framework is evaluated on a data set of 190 audios, consisting of 19 speakers. The experimental results indicate that the proposed framework is able to correctly classify $98\%$ of the acoustic- and laser-induced audio in a random data partition setting and $100\%$ of the audio in speaker-independent and text-independent data partition settings.

摘要
“声控设备（VCD）在使用上升趋势，主要是因为MEMS麦克风的小型化和容易集成到现代设备中。然而，latest studies have shown that MEMS麦克风受到音频激光注入攻击的威胁。这篇论文旨在开发检测和防止音频激光注入攻击MEMS麦克风的countermeasures。使用时间频分解基于离散波折射（DWT），将麦克风输出的音频信号分解成n+1个频段，以捕捉摄像头相关的 artifacts。使用高级统计特征，包括频段音频信号的第四个积分，例如方差、偏好、拟合度，来分辨 между抗噪和激光注入的响应。使用支持向量机（SVM）分类器，学习MEMS麦克风中的激光注入和抗噪响应的下流模型。这种方框在190个音频数据集上进行了评估，结果表明，该方框能够在随机数据分区 setting中正确地分类98%的抗噪和激光注入音频，并在 speaker-independent和text-independent数据分区 setting中达到100%的正确率。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know.

Speaking rate attention-based duration prediction for speed control TTS

paper_url: http://arxiv.org/abs/2310.08846
repo_url: None
paper_authors: Jesuraj Bandekar, Sathvik Udupa, Abhayjeet Singh, Anjali Jayakumar, Deekshitha G, Sandhya Badiger, Saurabh Kumar, Pooja VH, Prasanta Kumar Ghosh
for: 控制不同速度因素的声音表达
methods: 使用时间预测器中的 speaking rate 控制，进行隐式的 speaking rate 控制
results: 通过对听写数据集进行训练，实现了对不同速度因素的声音synthesize，并且对比基eline模型，提高了对话速度的控制和人工评分Here’s a more detailed explanation of each point:
for: The paper is written to control the speaking rate of non-autoregressive text-to-speech (TTS) systems.
methods: The proposed approach uses a novel method of conditioning the speaking rate inside the duration predictor to achieve implicit speaking rate control.
results: The proposed method is evaluated using objective and subjective metrics, and is found to have higher subjective scores and lower speaker rate errors across many speaking rate factors compared to a baseline model.

Abstract
With the advent of high-quality speech synthesis, there is a lot of interest in controlling various prosodic attributes of speech. Speaking rate is an essential attribute towards modelling the expressivity of speech. In this work, we propose a novel approach to control the speaking rate for non-autoregressive TTS. We achieve this by conditioning the speaking rate inside the duration predictor, allowing implicit speaking rate control. We show the benefits of this approach by synthesising audio at various speaking rate factors and measuring the quality of speaking rate-controlled synthesised speech. Further, we study the effect of the speaking rate distribution of the training data towards effective rate control. Finally, we fine-tune a baseline pretrained TTS model to obtain speaking rate control TTS. We provide various analyses to showcase the benefits of using this proposed approach, along with objective as well as subjective metrics. We find that the proposed methods have higher subjective scores and lower speaker rate errors across many speaking rate factors over the baseline.

摘要
高质量语音合成技术出现后，控制不同语速特性的兴趣很大。语速是对语音表达性的重要属性。在这项工作中，我们提出一种新的非自适应TTS语速控制方法。我们在duration预测器中控制语速，实现了隐式语速控制。我们通过合成各种语速因子的音频并测量合成的语音质量，证明了我们的方法的优势。此外，我们研究了培训数据语速分布对有效语速控制的影响。最后，我们练化一个基eline预测TTS模型，以获得语速控制TTS模型。我们提供了多种分析，证明了我们的方法的优点，包括对象 метри克和主观评分。我们发现，我们的方法在多种语速因子下具有更高的主观评分和更低的发音错误率，相比基eline。

2023-10-13

cs.CV

cs.CV - 2023-10-13

Pairwise Similarity Learning is SimPLE

paper_url: http://arxiv.org/abs/2310.09449
repo_url: None
paper_authors: Yandong Wen, Weiyang Liu, Yao Feng, Bhiksha Raj, Rita Singh, Adrian Weller, Michael J. Black, Bernhard Schölkopf
for: 本研究强调一个通用 yet 重要的学习问题：对照相似学习（PSL）。PSL 涵盖了许多重要应用，如开放集面Recognition、Speaker verification、图像检索和人重识别。学习的目标是学习一个对照相似函数，将相同标签的样本对 assign 更高的相似性分数，而不同标签的样本对 assign 更低的相似性分数。methods: 我们首先认为PSL 的一个关键需求是什么，然后讨论如何使用现有方法实现这一需求。然后，我们提出了一种奇异简单的代理自由方法，叫做SimPLE，不需要特征/代理normalization 也不需要angular margin，却能够在开放集面认知中广泛应用。results: 我们在三个PSL任务上应用了提议的方法：开放集面Recognition、图像检索和Speaker verification。经过大规模的实验结果表明，我们的方法在当前状态艺的方法中表现出色，表现出了显著的优势。

Abstract
In this paper, we focus on a general yet important learning problem, pairwise similarity learning (PSL). PSL subsumes a wide range of important applications, such as open-set face recognition, speaker verification, image retrieval and person re-identification. The goal of PSL is to learn a pairwise similarity function assigning a higher similarity score to positive pairs (i.e., a pair of samples with the same label) than to negative pairs (i.e., a pair of samples with different label). We start by identifying a key desideratum for PSL, and then discuss how existing methods can achieve this desideratum. We then propose a surprisingly simple proxy-free method, called SimPLE, which requires neither feature/proxy normalization nor angular margin and yet is able to generalize well in open-set recognition. We apply the proposed method to three challenging PSL tasks: open-set face recognition, image retrieval and speaker verification. Comprehensive experimental results on large-scale benchmarks show that our method performs significantly better than current state-of-the-art methods.

摘要
在这篇论文中，我们关注一个通用 yet 重要的学习问题：对照性学习（PSL）。 PSL 涵盖了许多重要应用，如开放集面Recognition、Speaker Verification、图像检索和人Re-Identification。 PSL 的目标是学习一个对照性函数，对同样标签的样本对（Positive Pair）分配更高的相似性分数，而对不同标签的样本对（Negative Pair）分配更低的相似性分数。我们开始 by 识别 PSL 的关键需求，然后讨论现有方法如何实现这个需求。然后，我们提出了一种奇异简单的代理自由方法，叫做 SimPLE，不需要特征/代理 нормализация也不需要angular margin，却能够在开放集面认知中广泛应用。我们在三个复杂 PSL 任务上应用了提议的方法：开放集面认知、图像检索和Speaker Verification。我们在大规模 benchmark 上进行了广泛的实验，结果表明，我们的方法在当前状态的方法之上表现出色。

Automatic segmentation of lung findings in CT and application to Long COVID

paper_url: http://arxiv.org/abs/2310.09446
repo_url: https://github.com/miclab-unicamp/medseg
paper_authors: Diedre S. Carmo, Rosarie A. Tudas, Alejandro P. Comellas, Leticia Rittner, Roberto A. Lotufo, Joseph M. Reinhardt, Sarah E. Gerard
for: 这个研究旨在提高计算机断层成像中肺脏病变的自动分割精度，以便诊断和特征化肺疾病。
methods: 该研究提出了一种基于深度学习的S-MEDSeg方法，结合预训练的EfficientNet底层、双向特征层积网络和现代网络技术，以提高肺病变分割性能。
results: 对于基eline方法的比较，S-MEDSeg方法在分割性能上有显著提高，并进行了全面的ablation研究来评估提案的网络修改对性能的贡献。该方法还应用于一个独立的长COVID患者数据集，以研究抗生素治疗后肺发现的扩展。

Abstract
Automated segmentation of lung abnormalities in computed tomography is an important step for diagnosing and characterizing lung disease. In this work, we improve upon a previous method and propose S-MEDSeg, a deep learning based approach for accurate segmentation of lung lesions in chest CT images. S-MEDSeg combines a pre-trained EfficientNet backbone, bidirectional feature pyramid network, and modern network advancements to achieve improved segmentation performance. A comprehensive ablation study was performed to evaluate the contribution of the proposed network modifications. The results demonstrate modifications introduced in S-MEDSeg significantly improves segmentation performance compared to the baseline approach. The proposed method is applied to an independent dataset of long COVID inpatients to study the effect of post-acute infection vaccination on extent of lung findings. Open-source code, graphical user interface and pip package are available at https://github.com/MICLab-Unicamp/medseg.

摘要
自动分割肺部异常的计算机断层成像是诊断和特征化肺病的重要步骤。在这项工作中，我们改进了之前的方法，并提出了S-MEDSeg，一种基于深度学习的方法用于精准地分割肺部扩散图像中的肺脏病变。S-MEDSeg结合了预训练的EfficientNet背bone、双向特征层网络和现代网络技术，以实现改进的分割性能。我们进行了完整的减少研究，以评估提案的网络修改对分割性能的贡献。结果显示，S-MEDSeg中的修改对分割性能具有显著改进效果，相比基线方法。我们应用了这种方法于一个独立的长COVID患者数据集，以研究后期感染疫苗对肺发现的影响。可以在https://github.com/MICLab-Unicamp/medseg上下载开源代码、图形用户界面和pip包。

Tackling Heterogeneity in Medical Federated learning via Vision Transformers

paper_url: http://arxiv.org/abs/2310.09444
repo_url: None
paper_authors: Erfan Darzi, Yiqing Shen, Nanna M. Sijtsema, P. M. A van Ooijen
for: 提高医疗联合学习中数据不均衡的问题，特别是提高弱代客户端的性能。
methods: 使用Optimization-based regularization方法。
results: 使用视图转换器可以大幅提高弱代客户端的性能，而无需付出重大的全局准确率代价。

Abstract
Optimization-based regularization methods have been effective in addressing the challenges posed by data heterogeneity in medical federated learning, particularly in improving the performance of underrepresented clients. However, these methods often lead to lower overall model accuracy and slower convergence rates. In this paper, we demonstrate that using Vision Transformers can substantially improve the performance of underrepresented clients without a significant trade-off in overall accuracy. This improvement is attributed to the Vision transformer's ability to capture long-range dependencies within the input data.

摘要
以优化为基础的规范化方法在医疗联合学习中处理数据不均衡问题，特别是改善受抑客户端的性能。然而，这些方法通常会导致全局模型精度下降和更慢的收敛速率。在这篇论文中，我们展示了使用视觉转换器可以大幅提高受抑客户端的性能，无需折损全局精度。这种改善归因于视觉转换器能够捕捉输入数据中的长距离关系。

MEMTRACK: A Deep Learning-Based Approach to Microrobot Tracking in Dense and Low-Contrast Environments

paper_url: http://arxiv.org/abs/2310.09441
repo_url: https://github.com/sawhney-medha/memtrack
paper_authors: Medha Sawhney, Bhas Karmarkar, Eric J. Leaman, Arka Daw, Anuj Karpatne, Bahareh Behkam
for: 本研究目的是为了解决追踪微型机器人（microrobot）的挑战，即其 minute size 和高速运动导致的精度追踪问题。
methods: 本研究使用了人工智能技术，包括深度学习对象检测和修改后的 Simple Online and Real-time Tracking（SORT）算法，以实现微型机器人的检测和追踪。
results: 研究发现，使用MEMTrack方法可以准确地追踪纤维蛋白质环境中的微型机器人，并且可以量化细菌的平均速度，与人工标注数据无统计学 significante difference。

Abstract
Tracking microrobots is challenging, considering their minute size and high speed. As the field progresses towards developing microrobots for biomedical applications and conducting mechanistic studies in physiologically relevant media (e.g., collagen), this challenge is exacerbated by the dense surrounding environments with feature size and shape comparable to microrobots. Herein, we report Motion Enhanced Multi-level Tracker (MEMTrack), a robust pipeline for detecting and tracking microrobots using synthetic motion features, deep learning-based object detection, and a modified Simple Online and Real-time Tracking (SORT) algorithm with interpolation for tracking. Our object detection approach combines different models based on the object's motion pattern. We trained and validated our model using bacterial micro-motors in collagen (tissue phantom) and tested it in collagen and aqueous media. We demonstrate that MEMTrack accurately tracks even the most challenging bacteria missed by skilled human annotators, achieving precision and recall of 77% and 48% in collagen and 94% and 35% in liquid media, respectively. Moreover, we show that MEMTrack can quantify average bacteria speed with no statistically significant difference from the laboriously-produced manual tracking data. MEMTrack represents a significant contribution to microrobot localization and tracking, and opens the potential for vision-based deep learning approaches to microrobot control in dense and low-contrast settings. All source code for training and testing MEMTrack and reproducing the results of the paper have been made publicly available https://github.com/sawhney-medha/MEMTrack.

摘要
追踪微型机器人是一项挑战，因为它们的小型尺寸和高速运动。随着领域的进步，开发微型机器人用于生物医学应用和在生物学 relevante 媒体中进行机制研究（例如，肽），这种挑战变得更加严重，因为环境中的物体尺寸和形状与微型机器人相似。在这篇文章中，我们报道了 Motion Enhanced Multi-level Tracker（MEMTrack），一种可靠的跟踪管线，用于检测和跟踪微型机器人，使用人工生成的动作特征、深度学习基于对象检测和修改了Simple Online and Real-time Tracking（SORT）算法。我们的对象检测方法结合了不同的模型，根据对象的运动模式。我们在肽（组织荒）和液态媒体中训练和验证了我们的模型，并在这两种媒体中进行了测试。我们证明了 MEMTrack 可以准确地跟踪，包括最复杂的细菌，并且与人工标注数据的精度和准确率相似（精度为77%，准确率为48%）。此外，我们还表明了 MEMTrack 可以测量细菌的平均速度，与手工生成的跟踪数据无 statistically significant difference。 MEMTrack 对微型机器人的定位和跟踪做出了重要贡献，并开启了视觉基于深度学习的微型机器人控制在低对比度和稠密环境中的可能性。MEMTrack 的所有训练和测试代码和 reproduce 文章中的结果可以在 GitHub 上公共地获得（https://github.com/sawhney-medha/MEMTrack）。

LL-VQ-VAE: Learnable Lattice Vector-Quantization For Efficient Representations

paper_url: http://arxiv.org/abs/2310.09382
repo_url: None
paper_authors: Ahmed Khalil, Robert Piechocki, Raul Santos-Rodriguez
for: 学习粒度化 vector quantization 以获得精炼的抽象表示。
methods: 取代 VQ-VAE 中的 vector quantization 层，使用 lattice-based 粒度化，实现了一个可学习的 lattice 结构，避免 codebook 塌陷，提高了 codebook 使用率。
results: 比 VQ-VAE 低于 reconstruction error，在同等训练条件下训练时间减少为一半，参数数量保持为 $D$，可扩展性很好，在 FFHQ-1024 数据集上进行了实验，并包括 FashionMNIST 和 Celeb-A。

Abstract
In this paper we introduce learnable lattice vector quantization and demonstrate its effectiveness for learning discrete representations. Our method, termed LL-VQ-VAE, replaces the vector quantization layer in VQ-VAE with lattice-based discretization. The learnable lattice imposes a structure over all discrete embeddings, acting as a deterrent against codebook collapse, leading to high codebook utilization. Compared to VQ-VAE, our method obtains lower reconstruction errors under the same training conditions, trains in a fraction of the time, and with a constant number of parameters (equal to the embedding dimension $D$), making it a very scalable approach. We demonstrate these results on the FFHQ-1024 dataset and include FashionMNIST and Celeb-A.

摘要
在这篇论文中，我们介绍了学习式树vector quantization，并证明其在学习离散表示的有效性。我们的方法，即LL-VQ-VAE，将VQ-VAE中的vector quantization层替换为基于树的离散化。学习的树对所有离散编码都强制实施结构，防止码库塌陷，导致高码库利用率。相比VQ-VAE，我们的方法在同样的训练条件下获得较低的重建错误，训练时间远 shorter，并且参数数量固定（等于嵌入维度$D），因此具有扩展性。我们在FFHQ-1024数据集上证明了这些结果，并包括FashionMNIST和Celeb-A。

Efficient Apple Maturity and Damage Assessment: A Lightweight Detection Model with GAN and Attention Mechanism

paper_url: http://arxiv.org/abs/2310.09347
repo_url: None
paper_authors: Yufei Liu, Manzhou Li, Qin Ma
for:这个研究旨在提出一种基于轻量级卷积神经网络（CNN）和生成敌对网络（GAN）的苹果 ripeness 和损害水平检测方法。methods:这个方法使用了优化模型的深度和宽度，并使用了先进的模型压缩技术，以实现实时性能的提高。同时，这个方法引入了注意力机制，以动态地调整不同的特征层的重要性，以改善物体检测任务的性能。results:实验结果显示，在苹果 ripeness 检测任务中，提出的方法可以达到95.6%、93.8%、95.0%和56.5%的精度、回溯、准确率和FPS等指标，而在苹果损害水平检测任务中，可以达到95.3%、93.7%和94.5%的精度、回溯和mAP等指标。在两个任务中，提出的方法都超越了主流模型， demonstrate了该方法在苹果 ripeness 和损害水平检测任务中的出色表现和高实用价值。

Abstract
This study proposes a method based on lightweight convolutional neural networks (CNN) and generative adversarial networks (GAN) for apple ripeness and damage level detection tasks. Initially, a lightweight CNN model is designed by optimizing the model's depth and width, as well as employing advanced model compression techniques, successfully reducing the model's parameter and computational requirements, thus enhancing real-time performance in practical applications. Simultaneously, attention mechanisms are introduced, dynamically adjusting the importance of different feature layers to improve the performance in object detection tasks. To address the issues of sample imbalance and insufficient sample size, GANs are used to generate realistic apple images, expanding the training dataset and enhancing the model's recognition capability when faced with apples of varying ripeness and damage levels. Furthermore, by applying the object detection network for damage location annotation on damaged apples, the accuracy of damage level detection is improved, providing a more precise basis for decision-making. Experimental results show that in apple ripeness grading detection, the proposed model achieves 95.6\%, 93.8\%, 95.0\%, and 56.5 in precision, recall, accuracy, and FPS, respectively. In apple damage level detection, the proposed model reaches 95.3\%, 93.7\%, and 94.5\% in precision, recall, and mAP, respectively. In both tasks, the proposed method outperforms other mainstream models, demonstrating the excellent performance and high practical value of the proposed method in apple ripeness and damage level detection tasks.

摘要
中文翻译：本研究提出一种基于轻量级卷积神经网络（CNN）和生成敌对网络（GAN）的苹果 ripeness 和损害水平检测方法。初始化时，设计了一个轻量级 CNN 模型，通过优化模型的深度和宽度，以及应用进步的模型压缩技术，成功减少模型的参数和计算需求，从而提高实时性在实际应用中。同时，引入了注意力机制，动态调整不同特征层的重要性，以提高对象检测任务的表现。为了解决样本偏极和样本数量不足的问题，使用 GAN 生成真实的苹果图像，扩大了训练集，提高了模型对不同的 ripeness 和损害水平的识别能力。此外，通过对受损苹果中的损害位置进行标注，提高了损害水平的准确率，为决策提供了更加准确的基础。实验结果表明，在苹果 ripeness 检测任务中，提出的模型达到了 95.6%、93.8%、95.0% 和 56.5% 的准确率、回归率、整体精度和 FPS，分别。在苹果损害水平检测任务中，提出的模型达到了 95.3%、93.7% 和 94.5% 的准确率、回归率和 mAP，分别。在两个任务中，提出的方法超过了主流模型， demonstarting 出优秀的表现和实际应用中的高价值。

Vision-by-Language for Training-Free Compositional Image Retrieval

paper_url: http://arxiv.org/abs/2310.09291
repo_url: None
paper_authors: Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, Zeynep Akata
for: 这个论文的目的是提出一种无需训练的图像检索方法，可以通过将大规模的视觉语言模型（VLM）和大语言模型（LLM）组合在一起来实现。methods: 该方法使用了一个简单的、人类可理解的架构，即使用一个预训练的生成型VLM来描述引用图像，然后使用一个LLM来重新组合描述以进行图像检索。results: 在四个零 shot图像检索 benchmark 中，该方法实现了相对较高的性能，并且可以轻松地扩展到更多的图像和文本对象。此外，该方法还可以让人类更好地理解图像检索的过程，并且可以通过修改文本来修正检索结果。

Abstract
Given an image and a target modification (e.g an image of the Eiffel tower and the text "without people and at night-time"), Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database. While supervised approaches rely on annotating triplets that is costly (i.e. query image, textual modification, and target image), recent research sidesteps this need by using large-scale vision-language models (VLMs), performing Zero-Shot CIR (ZS-CIR). However, state-of-the-art approaches in ZS-CIR still require training task-specific, customized models over large amounts of image-text pairs. In this work, we propose to tackle CIR in a training-free manner via our Compositional Image Retrieval through Vision-by-Language (CIReVL), a simple, yet human-understandable and scalable pipeline that effectively recombines large-scale VLMs with large language models (LLMs). By captioning the reference image using a pre-trained generative VLM and asking a LLM to recompose the caption based on the textual target modification for subsequent retrieval via e.g. CLIP, we achieve modular language reasoning. In four ZS-CIR benchmarks, we find competitive, in-part state-of-the-art performance - improving over supervised methods. Moreover, the modularity of CIReVL offers simple scalability without re-training, allowing us to both investigate scaling laws and bottlenecks for ZS-CIR while easily scaling up to in parts more than double of previously reported results. Finally, we show that CIReVL makes CIR human-understandable by composing image and text in a modular fashion in the language domain, thereby making it intervenable, allowing to post-hoc re-align failure cases. Code will be released upon acceptance.

摘要
Traditional Image Retrieval (CIR) 是一个检索图像库中的图像，以满足给定的文本修改的目标。在supervised Approach中，需要对query image、文本修改和target image进行标注，这会成本很高。 latest research 使用大规模的vision-language model (VLMs) 和Zero-Shot CIR (ZS-CIR) 来解决这个问题。 however, state-of-the-art approaches in ZS-CIR 仍然需要训练任务特定的自定义模型，这需要大量的图像和文本对。在这个工作中，我们提出了一种不需要训练的Compositional Image Retrieval through Vision-by-Language (CIReVL) 方法，这是一个简单、可理解的、可扩展的管道。我们使用pre-trained的生成型VLM来captioning reference image，然后使用大型语言模型 (LLMs) 来重新组合caption，以便后续通过例如CLIP进行检索。我们实现了模块化的语言理解，并在四个ZS-CIR benchmark中获得了竞争性的、部分状态的报告表现。此外，CIReVL的模块性允许无需重新训练，我们可以轻松地扩展到以前未报告的结果。最后，我们表明了CIReVL使CIR变得人类可理解，通过在语言领域中模块化图像和文本的组合，使其可调整和重新对齐失败案例。代码将在接受后发布。

An Unbiased Look at Datasets for Visuo-Motor Pre-Training

paper_url: http://arxiv.org/abs/2310.09289
repo_url: None
paper_authors: Sudeep Dasari, Mohan Kumar Srirama, Unnat Jain, Abhinav Gupta
For: The paper is focused on dataset centric analysis of robotic pre-training for visual representation learning.* Methods: The paper uses pre-training on large-scale but out-of-domain data (e.g., videos of egocentric interactions) and then transferring the representations to target robotics tasks.* Results: The paper finds that traditional vision datasets (like ImageNet, Kinetics and 100 Days of Hands) are surprisingly competitive options for visuo-motor representation learning, and that the pre-training dataset’s image distribution matters more than its size. Additionally, the paper shows that common simulation benchmarks are not a reliable proxy for real-world performance and that simple regularization strategies can dramatically improve real-world policy learning.Here’s the simplified Chinese text for the three key points:* 用途：文章关注机器人预训练视觉表示学习。* 方法：采用预训练大量但不同领域数据（如自我互动视频），然后将表示转移到目标机器人任务。* 结果：发现传统视觉集（如ImageNet、Kinetics和100Days of Hands）是奇异的可行选择，预训练集图像分布更重要于其大小。此外，文章表明常见的模拟 benchmark 不是可靠的实际世界表现代理，简单的正则化策略可以很大程度提高实际世界政策学习。

Abstract
Visual representation learning hold great promise for robotics, but is severely hampered by the scarcity and homogeneity of robotics datasets. Recent works address this problem by pre-training visual representations on large-scale but out-of-domain data (e.g., videos of egocentric interactions) and then transferring them to target robotics tasks. While the field is heavily focused on developing better pre-training algorithms, we find that dataset choice is just as important to this paradigm's success. After all, the representation can only learn the structures or priors present in the pre-training dataset. To this end, we flip the focus on algorithms, and instead conduct a dataset centric analysis of robotic pre-training. Our findings call into question some common wisdom in the field. We observe that traditional vision datasets (like ImageNet, Kinetics and 100 Days of Hands) are surprisingly competitive options for visuo-motor representation learning, and that the pre-training dataset's image distribution matters more than its size. Finally, we show that common simulation benchmarks are not a reliable proxy for real world performance and that simple regularization strategies can dramatically improve real world policy learning. https://data4robotics.github.io

摘要
视觉表示学习具有潜在的潜力应用于机器人学，但是受到机器人数据的缺乏和同质化的限制。 latest works address this problem by pre-training 视觉表示在大规模 yet out-of-domain 数据上 (e.g., 自我互动视频) 并将其转移到目标机器人任务上。 although the field is heavily focused on developing better pre-training algorithms, we find that dataset choice is just as important to this paradigm's success. After all, the representation can only learn the structures or priors present in the pre-training dataset. To this end, we flip the focus on algorithms, and instead conduct a dataset-centric analysis of robotic pre-training. Our findings call into question some common wisdom in the field. We observe that traditional vision datasets (like ImageNet, Kinetics and 100 Days of Hands) are surprisingly competitive options for visuomotor representation learning, and that the pre-training dataset's image distribution matters more than its size. Finally, we show that common simulation benchmarks are not a reliable proxy for real-world performance and that simple regularization strategies can dramatically improve real-world policy learning.Note: Please note that the translation is in Simplified Chinese, which is one of the two standard Chinese writing systems. The other one is Traditional Chinese.

SAIR: Learning Semantic-aware Implicit Representation

paper_url: http://arxiv.org/abs/2310.09285
repo_url: None
paper_authors: Canyu Zhang, Xiaoguang Li, Qing Guo, Song Wang
for: image inpainting task
methods: semantic-aware implicit representation (SAIR) and two modules: (1) building a semantic implicit representation (SIR) and (2) building an appearance implicit representation (AIR)
results: surpasses state-of-the-art approaches by a significant margin.Here’s the Chinese translation:
for: 图像填充任务
methods: semantic-aware implicit representation (SAIR) 和两个模块：(1) 建立semantic implicit representation (SIR) 和 (2) 建立appearance implicit representation (AIR)
results: 超越状态艺方法的表现，达到了显著的提升。

Abstract
Implicit representation of an image can map arbitrary coordinates in the continuous domain to their corresponding color values, presenting a powerful capability for image reconstruction. Nevertheless, existing implicit representation approaches only focus on building continuous appearance mapping, ignoring the continuities of the semantic information across pixels. As a result, they can hardly achieve desired reconstruction results when the semantic information within input images is corrupted, for example, a large region misses. To address the issue, we propose to learn semantic-aware implicit representation (SAIR), that is, we make the implicit representation of each pixel rely on both its appearance and semantic information (\eg, which object does the pixel belong to). To this end, we propose a framework with two modules: (1) building a semantic implicit representation (SIR) for a corrupted image whose large regions miss. Given an arbitrary coordinate in the continuous domain, we can obtain its respective text-aligned embedding indicating the object the pixel belongs. (2) building an appearance implicit representation (AIR) based on the SIR. Given an arbitrary coordinate in the continuous domain, we can reconstruct its color whether or not the pixel is missed in the input. We validate the novel semantic-aware implicit representation method on the image inpainting task, and the extensive experiments demonstrate that our method surpasses state-of-the-art approaches by a significant margin.

摘要
<>translate_language: zh-CNImplicit representation of an image can map arbitrary coordinates in the continuous domain to their corresponding color values, presenting a powerful capability for image reconstruction. However, existing implicit representation approaches only focus on building continuous appearance mapping, ignoring the continuities of the semantic information across pixels. As a result, they can hardly achieve desired reconstruction results when the semantic information within input images is corrupted, for example, a large region is missing. To address the issue, we propose to learn semantic-aware implicit representation (SAIR), that is, we make the implicit representation of each pixel rely on both its appearance and semantic information (e.g., which object does the pixel belong to). To this end, we propose a framework with two modules:(1) Building a semantic implicit representation (SIR) for a corrupted image whose large regions are missing. Given an arbitrary coordinate in the continuous domain, we can obtain its respective text-aligned embedding indicating the object the pixel belongs to.(2) Building an appearance implicit representation (AIR) based on the SIR. Given an arbitrary coordinate in the continuous domain, we can reconstruct its color whether or not the pixel is missing in the input. We validate the novel semantic-aware implicit representation method on the image inpainting task, and the extensive experiments demonstrate that our method surpasses state-of-the-art approaches by a significant margin.Note: The translation is done using the Google Translate API, which may not be perfect and may not capture all the nuances of the original text.

Transformer-based Multimodal Change Detection with Multitask Consistency Constraints

paper_url: http://arxiv.org/abs/2310.09276
repo_url: https://github.com/qaz670756/mmcd
paper_authors: Biyuan Liu, Huaixin Chen, Kun Li, Michael Ying Yang
for: 本研究旨在利用多modal数据进行Change Detection，以解决现有方法在面对多modal数据时的问题。
methods: 本研究提出了一种基于Transformer网络的多modalChange Detection方法，通过cross-attention学习多modal输入之间的共同表示，并采用了一种兼容约束来确保多modal关系的建立。
results: 与五种现有方法进行比较，本研究的模型在Semantic和Height Change Detection任务中均显示出了consistent的多task优势性。此外，该方法可以轻松地适应其他方法，从而实现了Promising的改进。

Abstract
Change detection plays a fundamental role in Earth observation for analyzing temporal iterations over time. However, recent studies have largely neglected the utilization of multimodal data that presents significant practical and technical advantages compared to single-modal approaches. This research focuses on leveraging digital surface model (DSM) data and aerial images captured at different times for detecting change beyond 2D. We observe that the current change detection methods struggle with the multitask conflicts between semantic and height change detection tasks. To address this challenge, we propose an efficient Transformer-based network that learns shared representation between cross-dimensional inputs through cross-attention. It adopts a consistency constraint to establish the multimodal relationship, which involves obtaining pseudo change through height change thresholding and minimizing the difference between semantic and pseudo change within their overlapping regions. A DSM-to-image multimodal dataset encompassing three cities in the Netherlands was constructed. It lays a new foundation for beyond-2D change detection from cross-dimensional inputs. Compared to five state-of-the-art change detection methods, our model demonstrates consistent multitask superiority in terms of semantic and height change detection. Furthermore, the consistency strategy can be seamlessly adapted to the other methods, yielding promising improvements.

摘要
地球观测中的变化检测扮演了基础性的角色，但最近的研究主要忽略了多modal数据的利用，这些数据具有重要的实践和技术优势。本研究目的在于利用数字地表模型（DSM）数据和不同时间拍摄的空中图像进行超过2D的变化检测。我们发现现有的变化检测方法在多任务冲突中表现不佳，特别是semantic和高程变化检测任务之间的冲突。为解决这个挑战，我们提议一种高效的Transformer网络，通过cross-attention学习共享表示 между多维输入。它采用一种一致性约束，以建立多modal关系，其中包括通过高程变化阈值获取 Pseudo 变化，并将semantic和Pseudo 变化在重叠区域内的差异降低到最小。为建立这种多模态关系，我们构建了包括荷兰三座城市的DSM-to-图像多模态数据集。相比五种state-of-the-art变化检测方法，我们的模型在semantic和高程变化检测任务上具有一致多任务优势。此外，一致策略可以轻松地应用到其他方法上，具有极好的改进前景。

Understanding and Modeling the Effects of Task and Context on Drivers’ Gaze Allocation

paper_url: http://arxiv.org/abs/2310.09275
repo_url: None
paper_authors: Iuliia Kotseruba, John K. Tsotsos
for: This paper aims to improve the accuracy of driver gaze prediction by explicitly modeling task and context influences, and to provide a new benchmark for evaluating such models.
methods: The proposed method addresses shortcomings of the popular DR(eye)VE dataset and extends it with per-frame annotations for driving task and context. The authors also benchmark a number of baseline and state-of-the-art models for saliency and driver gaze prediction, and analyze them with respect to the new annotations.
results: The proposed method significantly improves the state of the art performance on DR(eye)VE overall (by 24% KLD and 89% NSS) and on a subset of action and safety-critical intersection scenarios (by 10-30% KLD).Here’s the same information in Simplified Chinese text:
for: 这篇论文的目的是提高驾驶员视线预测的准确率，并提供一个新的评价标准。
methods: 该方法解决了DR(eye)VE dataset的一些缺陷，并将每帧的驾驶任务和上下文信息添加到 annotations 中。作者还对一些基线和现状模型进行了评价，并与新的 annotations 进行了分析。
results: 该方法在DR(eye)VE 上的总表现提高了24% KLD 和 89% NSS，并在一些行为和安全关键交叉点场景中提高了10-30% KLD。

Abstract
Understanding what drivers look at is important for many applications, including driver training, monitoring, and assistance, as well as self-driving. Traditionally, factors affecting human visual attention have been divided into bottom-up (involuntary attraction to salient regions) and top-down (task- and context-driven). Although both play a role in drivers' gaze allocation, most of the existing modeling approaches apply techniques developed for bottom-up saliency and do not consider task and context influences explicitly. Likewise, common driving attention benchmarks lack relevant task and context annotations. Therefore, to enable analysis and modeling of these factors for drivers' gaze prediction, we propose the following: 1) address some shortcomings of the popular DR(eye)VE dataset and extend it with per-frame annotations for driving task and context; 2) benchmark a number of baseline and SOTA models for saliency and driver gaze prediction and analyze them w.r.t. the new annotations; and finally, 3) a novel model that modulates drivers' gaze prediction with explicit action and context information, and as a result significantly improves SOTA performance on DR(eye)VE overall (by 24\% KLD and 89\% NSS) and on a subset of action and safety-critical intersection scenarios (by 10--30\% KLD). Extended annotations, code for model and evaluation will be made publicly available.

摘要
Understanding what drivers look at is important for many applications, including driver training, monitoring, and assistance, as well as self-driving. Traditionally, factors affecting human visual attention have been divided into bottom-up (involuntary attraction to salient regions) and top-down (task- and context-driven). Although both play a role in drivers' gaze allocation, most of the existing modeling approaches apply techniques developed for bottom-up saliency and do not consider task and context influences explicitly. Likewise, common driving attention benchmarks lack relevant task and context annotations. Therefore, to enable analysis and modeling of these factors for drivers' gaze prediction, we propose the following:1. Address some shortcomings of the popular DR(eye)VE dataset and extend it with per-frame annotations for driving task and context.2. Benchmark a number of baseline and SOTA models for saliency and driver gaze prediction and analyze them w.r.t. the new annotations.3. A novel model that modulates drivers' gaze prediction with explicit action and context information, and as a result significantly improves SOTA performance on DR(eye)VE overall (by 24\% KLD and 89\% NSS) and on a subset of action and safety-critical intersection scenarios (by 10--30\% KLD).Extended annotations, code for model and evaluation will be made publicly available.

Time CNN and Graph Convolution Network for Epileptic Spike Detection in MEG Data

paper_url: http://arxiv.org/abs/2310.09236
repo_url: None
paper_authors: Pauline Mouches, Thibaut Dejean, Julien Jung, Romain Bouet, Carole Lartizien, Romain Quentin
for: 这个论文的目的是用机器学习方法检测 magnetoencephalography（MEG）记录中的尖峰，以便准确地确定引起癫痫发作的脑区域。
methods: 这个论文提出了一种基于1D时间卷积神经网络（Time CNN）和图像卷积神经网络（GCN）的方法，用于分类MEG记录中的短时间帧是否包含尖峰。
results: 该方法在一个平衡的数据集上达到了76.7%的分类f1分数，并在一个实际上较偏斜的数据集上达到了25.5%的分类f1分数，都高于深度学习状态对的方法。

Abstract
Magnetoencephalography (MEG) recordings of patients with epilepsy exhibit spikes, a typical biomarker of the pathology. Detecting those spikes allows accurate localization of brain regions triggering seizures. Spike detection is often performed manually. However, it is a burdensome and error prone task due to the complexity of MEG data. To address this problem, we propose a 1D temporal convolutional neural network (Time CNN) coupled with a graph convolutional network (GCN) to classify short time frames of MEG recording as containing a spike or not. Compared to other recent approaches, our models have fewer parameters to train and we propose to use a GCN to account for MEG sensors spatial relationships. Our models produce clinically relevant results and outperform deep learning-based state-of-the-art methods reaching a classification f1-score of 76.7% on a balanced dataset and of 25.5% on a realistic, highly imbalanced dataset, for the spike class.

摘要
магнетоэнцефалографические (MEG) записи пациентов с эпилепсией выделяют пики,Typical biomarker of the pathology. Detecting those spikes allows accurate localization of brain regions triggering seizures. Spike detection is often performed manually, but it is a burdensome and error-prone task due to the complexity of MEG data. To address this problem, we propose a one-dimensional temporal convolutional neural network (Time CNN) coupled with a graph convolutional network (GCN) to classify short time frames of MEG recording as containing a spike or not. Compared to other recent approaches, our models have fewer parameters to train, and we propose to use a GCN to account for MEG sensors' spatial relationships. Our models produce clinically relevant results and outperform deep learning-based state-of-the-art methods, reaching a classification f1-score of 76.7% on a balanced dataset and 25.5% on a realistic, highly imbalanced dataset for the spike class.Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Ultrasound Image Segmentation of Thyroid Nodule via Latent Semantic Feature Co-Registration

paper_url: http://arxiv.org/abs/2310.09221
repo_url: None
paper_authors: Xuewei Li, Yaqiao Zhu, Jie Gao, Xi Wei, Ruixuan Zhang, Yuan Tian, Mei Yu
for: 这个论文的目的是提高自动化颈部ultrasound图像分割模型的泛化性能，以便在医疗实践中能够更好地适应不同供应商和扫描卷轴的图像。
methods: 这篇论文提出了一种基于新型协调网络的颈部 nodule 分割框架（ASTN），通过提取atlas和目标图像中的潜在含义，并利用深度特征来实现图像协调，以保持颈部结构完整性并减少图像差异引起的影响。
results: 论文的评估结果表明，通过提出的方法，模型的泛化性能得到了显著改进，同时保持了高级别的分割精度。

Abstract
Segmentation of nodules in thyroid ultrasound imaging plays a crucial role in the detection and treatment of thyroid cancer. However, owing to the diversity of scanner vendors and imaging protocols in different hospitals, the automatic segmentation model, which has already demonstrated expert-level accuracy in the field of medical image segmentation, finds its accuracy reduced as the result of its weak generalization performance when being applied in clinically realistic environments. To address this issue, the present paper proposes ASTN, a framework for thyroid nodule segmentation achieved through a new type co-registration network. By extracting latent semantic information from the atlas and target images and utilizing in-depth features to accomplish the co-registration of nodules in thyroid ultrasound images, this framework can ensure the integrity of anatomical structure and reduce the impact on segmentation as the result of overall differences in image caused by different devices. In addition, this paper also provides an atlas selection algorithm to mitigate the difficulty of co-registration. As shown by the evaluation results collected from the datasets of different devices, thanks to the method we proposed, the model generalization has been greatly improved while maintaining a high level of segmentation accuracy.

摘要
segmentation of nodules in thyroid ultrasound imaging plays a crucial role in the detection and treatment of thyroid cancer. However, owing to the diversity of scanner vendors and imaging protocols in different hospitals, the automatic segmentation model, which has already demonstrated expert-level accuracy in the field of medical image segmentation, finds its accuracy reduced as the result of its weak generalization performance when being applied in clinically realistic environments. To address this issue, the present paper proposes ASTN, a framework for thyroid nodule segmentation achieved through a new type co-registration network. By extracting latent semantic information from the atlas and target images and utilizing in-depth features to accomplish the co-registration of nodules in thyroid ultrasound images, this framework can ensure the integrity of anatomical structure and reduce the impact on segmentation as the result of overall differences in image caused by different devices. In addition, this paper also provides an atlas selection algorithm to mitigate the difficulty of co-registration. As shown by the evaluation results collected from the datasets of different devices, thanks to the method we proposed, the model generalization has been greatly improved while maintaining a high level of segmentation accuracy.Here's the word-for-word translation: Segmentation of nodules in thyroid ultrasound imaging plays a crucial role in the detection and treatment of thyroid cancer. However, owing to the diversity of scanner vendors and imaging protocols in different hospitals, the automatic segmentation model, which has already demonstrated expert-level accuracy in the field of medical image segmentation, finds its accuracy reduced as the result of its weak generalization performance when being applied in clinically realistic environments. To address this issue, the present paper proposes ASTN, a framework for thyroid nodule segmentation achieved through a new type co-registration network. By extracting latent semantic information from the atlas and target images and utilizing in-depth features to accomplish the co-registration of nodules in thyroid ultrasound images, this framework can ensure the integrity of anatomical structure and reduce the impact on segmentation as the result of overall differences in image caused by different devices. In addition, this paper also provides an atlas selection algorithm to mitigate the difficulty of co-registration. As shown by the evaluation results collected from the datasets of different devices, thanks to the method we proposed, the model generalization has been greatly improved while maintaining a high level of segmentation accuracy.

Unseen Image Synthesis with Diffusion Models

paper_url: http://arxiv.org/abs/2310.09213
repo_url: None
paper_authors: Ye Zhu, Yu Wu, Zhiwei Deng, Olga Russakovsky, Yan Yan
for: 本文主要针对的是如何使用 pré-trained 和冻结的抑噪扩散模型（DDPM）在单个频道数据上进行隐藏采样和几何优化，以生成未看过的频道图像。methods: 本文使用了隐藏采样和几何优化，使用 pré-trained 和冻结的 DDPM 模型，在单个频道数据上进行 Synthesizing 未看过的频道图像。results: 本文经过extensive的分析和实验，证明了这种新的视角可以帮助探索和重新评估抑噪扩散模型的数据生成泛化能力。

Abstract
While the current trend in the generative field is scaling up towards larger models and more training data for generalized domain representations, we go the opposite direction in this work by synthesizing unseen domain images without additional training. We do so via latent sampling and geometric optimization using pre-trained and frozen Denoising Diffusion Probabilistic Models (DDPMs) on single-domain datasets. Our key observation is that DDPMs pre-trained even just on single-domain images are already equipped with sufficient representation abilities to reconstruct arbitrary images from the inverted latent encoding following bi-directional deterministic diffusion and denoising trajectories. This motivates us to investigate the statistical and geometric behaviors of the Out-Of-Distribution (OOD) samples from unseen image domains in the latent spaces along the denoising chain. Notably, we theoretically and empirically show that the inverted OOD samples also establish Gaussians that are distinguishable from the original In-Domain (ID) samples in the intermediate latent spaces, which allows us to sample from them directly. Geometrical domain-specific and model-dependent information of the unseen subspace (e.g., sample-wise distance and angles) is used to further optimize the sampled OOD latent encodings from the estimated Gaussian prior. We conduct extensive analysis and experiments using pre-trained diffusion models (DDPM, iDDPM) on different datasets (AFHQ, CelebA-HQ, LSUN-Church, and LSUN-Bedroom), proving the effectiveness of this novel perspective to explore and re-think the diffusion models' data synthesis generalization ability.

摘要
当前在生成领域的趋势是扩大模型和训练数据来获得通用领域表示，而我们在这个工作中则进行了相反的方向，通过latent sampling和几何优化使用预训练和冻结的Diffusion Probabilistic Models (DDPMs)来 sinthezzer unseen domain图像无需额外训练。我们的关键发现是，DDPMs预训练了Single-Domain dataset上的图像就已经具备了 suficient representation能力来重建任意图像，从bi-directional deterministic diffusion和denoising trajectories中的逆Latent encoding中。这使我们感兴趣地研究OOD样本在抽象空间中的统计和几何行为，以及OOD样本的 inverted Gaussian distribution。我们通过理论和实验表明，OOD样本在抽象空间中也可以成立Gaussian distribution，并且与原始ID样本在中间抽象空间中的距离和角度有所不同。我们使用采样自OOD Gaussian distribution的抽象编码进行进一步的优化。我们在不同的 dataset（AFHQ、CelebA-HQ、LSUN-Church和LSUN-Bedroom）上使用预训练的扩散模型（DDPM和iDDPM）进行了广泛的分析和实验，证明了这种新的视角的效果性。

PaLI-3 Vision Language Models: Smaller, Faster, Stronger

paper_url: http://arxiv.org/abs/2310.09199
repo_url: https://github.com/kyegomez/PALI3
paper_authors: Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, Daniel Salz, Xi Xiong, Daniel Vlasic, Filip Pavetic, Keran Rong, Tianli Yu, Daniel Keysers, Xiaohua Zhai, Radu Soricut
for: 该论文目的是提出一种更小、更快、更强的视觉语言模型（VLM），以比较和更大的类似模型进行比较。
methods: 该论文使用了视transformer（ViT）模型预训练使用分类目标，并与对比（SigLIP）预训练进行比较。
results: 研究发现，虽然 SigLIP-based PaLI略微下perform在标准图像分类标准下，但在多modal标准下表现更好，特别是在地图localization和视觉相关文本理解方面。通过扩大SigLIP图像编码器到20亿参数，实现了新的多语言跨模态检索state-of-the-art。

Abstract
This paper presents PaLI-3, a smaller, faster, and stronger vision language model (VLM) that compares favorably to similar models that are 10x larger. As part of arriving at this strong performance, we compare Vision Transformer (ViT) models pretrained using classification objectives to contrastively (SigLIP) pretrained ones. We find that, while slightly underperforming on standard image classification benchmarks, SigLIP-based PaLI shows superior performance across various multimodal benchmarks, especially on localization and visually-situated text understanding. We scale the SigLIP image encoder up to 2 billion parameters, and achieves a new state-of-the-art on multilingual cross-modal retrieval. We hope that PaLI-3, at only 5B parameters, rekindles research on fundamental pieces of complex VLMs, and could fuel a new generation of scaled-up models.

摘要

Exploring Sparse Spatial Relation in Graph Inference for Text-Based VQA

paper_url: http://arxiv.org/abs/2310.09147
repo_url: None
paper_authors: Sheng Zhou, Dan Guo, Jia Li, Xun Yang, Meng Wang
for:文章主要目的是提出一种基于 sparse spatial graph network (SSGN) 的文本视觉问答系统，以避免重复的关系推理。methods:文章使用的方法包括：1. 引入空间意识关系剪辑技术，以避免使用所有视觉关系进行答案预测。2. 使用空间距离、几何维度、 overlap 区域和 DIoU 进行空间意识关系剪辑。3. 学习三种视觉关系：对象对象关系、OCR 和对象关系，以及 OCR 和对象关系。results:文章的实验结果表明，SSGN 在 TextVQA 和 ST-VQA 数据集上达到了可够的表现。此外，一些视觉化结果还表明了我们的方法的可解释性。

Abstract
Text-based visual question answering (TextVQA) faces the significant challenge of avoiding redundant relational inference. To be specific, a large number of detected objects and optical character recognition (OCR) tokens result in rich visual relationships. Existing works take all visual relationships into account for answer prediction. However, there are three observations: (1) a single subject in the images can be easily detected as multiple objects with distinct bounding boxes (considered repetitive objects). The associations between these repetitive objects are superfluous for answer reasoning; (2) two spatially distant OCR tokens detected in the image frequently have weak semantic dependencies for answer reasoning; and (3) the co-existence of nearby objects and tokens may be indicative of important visual cues for predicting answers. Rather than utilizing all of them for answer prediction, we make an effort to identify the most important connections or eliminate redundant ones. We propose a sparse spatial graph network (SSGN) that introduces a spatially aware relation pruning technique to this task. As spatial factors for relation measurement, we employ spatial distance, geometric dimension, overlap area, and DIoU for spatially aware pruning. We consider three visual relationships for graph learning: object-object, OCR-OCR tokens, and object-OCR token relationships. SSGN is a progressive graph learning architecture that verifies the pivotal relations in the correlated object-token sparse graph, and then in the respective object-based sparse graph and token-based sparse graph. Experiment results on TextVQA and ST-VQA datasets demonstrate that SSGN achieves promising performances. And some visualization results further demonstrate the interpretability of our method.

摘要
文本基于视觉问答（TextVQA）面临着避免重复关系推理的 significiant挑战。具体来说，图像中检测到的对象和Optical Character Recognition（OCR）token的大量会导致丰富的视觉关系。现有的工作是将所有视觉关系都用于答案预测。然而，我们有三个观察结论：（1）图像中的一个主题可以轻松地被检测到为多个对象的多个 bounding box（被视为重复的对象）。这些重复的对象之间的关系是不必要的 для答案推理;（2）图像中两个距离很远的 OCR token frequent 有弱的 semantic dependence for answer reasoning;（3）靠近的对象和token可能是答案预测中重要的视觉提示。而不是使用所有的视觉关系，我们尝试去标识最重要的连接或者减少重复的连接。我们提议一种稀疏空间图网络（SSGN），该网络引入了基于空间的关系剪辑技术。我们使用的空间因素包括空间距离、几何维度、重叠面积和 DIoU 等，用于空间自然剪辑。我们考虑了三种视觉关系进行图学学习：对象-对象关系、OCR Token-OCR Token 关系和对象-OCR Token 关系。SSGN 是一种进步的图学学习架构，它验证了相关对象-Token 稀疏图中的重要关系，然后在各自的对象基础稀疏图和 Token 基础稀疏图中验证。实验结果表明，SSGN 在 TextVQA 和 ST-VQA 数据集上表现出色。一些视觉化结果进一步证明了我们的方法的可解释性。

Physics-guided Noise Neural Proxy for Low-light Raw Image Denoising

paper_url: http://arxiv.org/abs/2310.09126
repo_url: None
paper_authors: Hansen Feng, Lizhi Wang, Yiqi Huang, Yuzhi Wang, Hua Huang
for: 本研究旨在提高低光照静止图像净化的性能，通过学习基于synthetic数据的学习型方法。
methods: 我们提出了一种新的噪声模型学习框架，即physics-guided noise neural proxy（PNNP），它 integrate了三种高效技术：physics-guided noise decoupling（PND）、physics-guided proxy model（PPM）和分布式极值损失（DDL）。
results: 我们的PNNP框架在公共的低光照静止图像净化数据集和真实的低光照拍摄场景中展现出了超越性表现。

Abstract
Low-light raw image denoising plays a crucial role in mobile photography, and learning-based methods have become the mainstream approach. Training the learning-based methods with synthetic data emerges as an efficient and practical alternative to paired real data. However, the quality of synthetic data is inherently limited by the low accuracy of the noise model, which decreases the performance of low-light raw image denoising. In this paper, we develop a novel framework for accurate noise modeling that learns a physics-guided noise neural proxy (PNNP) from dark frames. PNNP integrates three efficient techniques: physics-guided noise decoupling (PND), physics-guided proxy model (PPM), and differentiable distribution-oriented loss (DDL). The PND decouples the dark frame into different components and handles different levels of noise in a flexible manner, which reduces the complexity of the noise neural proxy. The PPM incorporates physical priors to effectively constrain the generated noise, which promotes the accuracy of the noise neural proxy. The DDL provides explicit and reliable supervision for noise modeling, which promotes the precision of the noise neural proxy. Extensive experiments on public low-light raw image denoising datasets and real low-light imaging scenarios demonstrate the superior performance of our PNNP framework.

摘要
低光照图像去噪扮演了手机摄影中关键的角色，学习基于方法成为主流。使用生成的数据进行训练学习基于方法 emerges as an efficient and practical alternative to paired real data。然而，生成数据质量的局限性由噪音模型的准确性减少了低光照图像去噪的性能。在这篇论文中，我们开发了一种新的框架，即物理导向噪音神经代理（PNNP），从黑框中学习噪音模型。PNNP integrates three efficient techniques：物理导向噪音分离（PND）、物理导向代理模型（PPM）和分布导向损失（DDL）。PND将黑框分解成不同组件，处理不同水平的噪音，减少噪音神经代理的复杂性。PPM incorporates physical priors to effectively constrain the generated noise, which promotes the accuracy of the noise neural proxy。DDL provides explicit and reliable supervision for noise modeling, which promotes the precision of the noise neural proxy。我们在公共的低光照图像去噪数据集和实际的低光照摄影场景进行了广泛的实验， demonstrably superior performance of our PNNP framework.

Training and Predicting Visual Error for Real-Time Applications

paper_url: http://arxiv.org/abs/2310.09125
repo_url: https://github.com/Jaliborc/rt-percept
paper_authors: João Libório Cardoso, Bernhard Kerbl, Lei Yang, Yury Uralsky, Michael Wimmer
for: 这个论文旨在提出一种基于卷积神经网络的图像质量评估方法，以实现在实时应用中高效地计算视觉误差。
methods: 该论文使用卷积神经网络来预测视觉误差，而不需要参考图像或渲染图像。这些神经网络通过利用 readily available 的图像空间信息和 reprojection from previous frames 来估计视觉误差，并且可以在实时应用中实现高效的计算。
results: 该论文的实验结果表明，使用卷积神经网络来预测视觉误差可以具有70%-90%的变差能力，并且可以在实时应用中实现至少一个数量级的计算时间减少。这些方法可以在实时应用中实现高效的图像质量评估，并且可以在未seen 图像区域中提供可靠的误差估计。

Abstract
Visual error metrics play a fundamental role in the quantification of perceived image similarity. Most recently, use cases for them in real-time applications have emerged, such as content-adaptive shading and shading reuse to increase performance and improve efficiency. A wide range of different metrics has been established, with the most sophisticated being capable of capturing the perceptual characteristics of the human visual system. However, their complexity, computational expense, and reliance on reference images to compare against prevent their generalized use in real-time, restricting such applications to using only the simplest available metrics. In this work, we explore the abilities of convolutional neural networks to predict a variety of visual metrics without requiring either reference or rendered images. Specifically, we train and deploy a neural network to estimate the visual error resulting from reusing shading or using reduced shading rates. The resulting models account for 70%-90% of the variance while achieving up to an order of magnitude faster computation times. Our solution combines image-space information that is readily available in most state-of-the-art deferred shading pipelines with reprojection from previous frames to enable an adequate estimate of visual errors, even in previously unseen regions. We describe a suitable convolutional network architecture and considerations for data preparation for training. We demonstrate the capability of our network to predict complex error metrics at interactive rates in a real-time application that implements content-adaptive shading in a deferred pipeline. Depending on the portion of unseen image regions, our approach can achieve up to $2\times$ performance compared to state-of-the-art methods.

摘要
“视觉错误度量在图像相似性评估中扮演了基本角色。最近，它们在实时应用中得到了广泛使用，如内容适应填充和填充率调整，以提高性能和效率。已经建立了许多不同的度量方法，其中最复杂的可以捕捉人类视觉系统的特性。然而，它们的复杂性、计算成本和对参照图像进行比较的需求，使得它们在实时应用中无法普遍应用，只能使用最简单的度量方法。在这种情况下，我们 investigate了使用卷积神经网络预测多种视觉度量，不需要参照图像或渲染图像。我们训练和部署一个神经网络，以便估算由填充率或 reuse 的视觉错误。该模型能够覆盖70%-90%的变差，并且可以在实时应用中达到10倍的计算时间减少。我们的解决方案结合了 readily 可用的图像空间信息，并通过前一帧的 reprojection 来实现可靠的视觉错误估计，包括未看到的区域。我们描述了适合的卷积神经网络架构，以及训练数据的准备方法。我们在实时应用中示出了我们的网络可以在交互速度下预测复杂的视觉度量，并且可以根据未看到的区域的大小，实现2倍的性能提升。”

Equirectangular image construction method for standard CNNs for Semantic Segmentation

paper_url: http://arxiv.org/abs/2310.09122
repo_url: None
paper_authors: Haoqian Chen, Jian Liu, Minghe Li, Kaiwen Jiang, Ziheng Xu, Rencheng Sun, Yi Sui
for: 该文章的目的是提出一种方法来将平面图像转换为全景图像，以便使用标准 convolutional neural networks (CNNs) 进行semantic segmentation。
methods: 该方法使用了 inverse transformation of the spherical center projection 和 equidistant cylindrical projection，以学习不同位置的扭曲特征，从而对全景图像进行semantic segmentation。
results: 实验表明，使用该方法可以获得最佳的平均 IoU 值为 43.76%，比其他三种方法（supervised learning、unsupervised learning 和数据增强）高出 23.85%、10.7% 和 17.23%。

Abstract
360{\deg} spherical images have advantages of wide view field, and are typically projected on a planar plane for processing, which is known as equirectangular image. The object shape in equirectangular images can be distorted and lack translation invariance. In addition, there are few publicly dataset of equirectangular images with labels, which presents a challenge for standard CNNs models to process equirectangular images effectively. To tackle this problem, we propose a methodology for converting a perspective image into equirectangular image. The inverse transformation of the spherical center projection and the equidistant cylindrical projection are employed. This enables the standard CNNs to learn the distortion features at different positions in the equirectangular image and thereby gain the ability to semantically the equirectangular image. The parameter, {\phi}, which determines the projection position of the perspective image, has been analyzed using various datasets and models, such as UNet, UNet++, SegNet, PSPNet, and DeepLab v3+. The experiments demonstrate that an optimal value of {\phi} for effective semantic segmentation of equirectangular images is 6{\pi}/16 for standard CNNs. Compared with the other three types of methods (supervised learning, unsupervised learning and data augmentation), the method proposed in this paper has the best average IoU value of 43.76%. This value is 23.85%, 10.7% and 17.23% higher than those of other three methods, respectively.

摘要
三百六十度球形图像具有广阔视场和平面处理优势，通常称为平面图像。然而，在这些图像中，物体形状会受到扭曲和缺失平衡不变性的影响。此外，publicly disponible的平面图像标签数据集罕见，这使得标准的CNN模型在处理平面图像时遇到了一定的挑战。为解决这个问题，我们提出了将投影图像转换为球形图像的方法。我们使用了球形中心投影和等距直 cylindrical投影的 inverse transformation。这使得标准的CNN模型能够学习不同位置的扭曲特征，从而为球形图像semantic segmentation增加能力。我们通过不同的数据集和模型，如UNet、UNet++、SegNet、PSPNet和DeepLab v3+，分析了参数{\phi}的影响。实验表明，为标准CNN模型在球形图像semantic segmentation中效果最佳的{\phi}值为6π/16。相比于其他三种方法（监督学习、自动学习和数据增强），本文提出的方法具有最高的平均IoU值43.76%。这个值高于其他三种方法的平均IoU值23.85%、10.7%和17.23%。

DSG: An End-to-End Document Structure Generator

paper_url: http://arxiv.org/abs/2310.09118
repo_url: https://github.com/j-rausch/dsg
paper_authors: Johannes Rausch, Gentiana Rashiti, Maxim Gusev, Ce Zhang, Stefan Feuerriegel
for: 该论文主要目标是提供一种可以拟合文档结构的完全端到端训练系统，以便在实际应用中进行下游任务。
methods: 该论文提出了一种名为文档结构生成器（DSG）的新系统，该系统通过结合深度神经网络进行文档解析，包括文档中的实体（如图文、文本块、标题等）和这些实体之间的关系。与传统系统不同的是，我们的DSG是通过端到端训练而训练的，从而使其在实际应用中更有效和灵活。
results: 我们的实验结果表明，我们的DSG系统可以在评估 dataset 上超过商业 OCR 工具的性能，并且在这之上还达到了状态对照表现。此外，我们还提供了一个大规模的实际杂志数据集，以便用于评估和比较。

Abstract
Information in industry, research, and the public sector is widely stored as rendered documents (e.g., PDF files, scans). Hence, to enable downstream tasks, systems are needed that map rendered documents onto a structured hierarchical format. However, existing systems for this task are limited by heuristics and are not end-to-end trainable. In this work, we introduce the Document Structure Generator (DSG), a novel system for document parsing that is fully end-to-end trainable. DSG combines a deep neural network for parsing (i) entities in documents (e.g., figures, text blocks, headers, etc.) and (ii) relations that capture the sequence and nested structure between entities. Unlike existing systems that rely on heuristics, our DSG is trained end-to-end, making it effective and flexible for real-world applications. We further contribute a new, large-scale dataset called E-Periodica comprising real-world magazines with complex document structures for evaluation. Our results demonstrate that our DSG outperforms commercial OCR tools and, on top of that, achieves state-of-the-art performance. To the best of our knowledge, our DSG system is the first end-to-end trainable system for hierarchical document parsing.

摘要
在行业、研究和公共部门中，信息通常以渲染的文档形式储存（例如PDF文档、扫描件）。因此，为下游任务提供支持，需要一种将渲染文档映射到结构化层次格式的系统。然而，现有的这种任务系统受限于规则和规则，不是可全程训练的。在这种工作中，我们介绍了文档结构生成器（DSG），一种全程训练的文档分析系统。DSG组合了深度神经网络来分析文档中的实体（例如图片、文本块、标题等）以及这些实体之间的关系，包括嵌入结构和顺序结构。与现有系统不同，我们的DSG不仅仅靠规则，而是通过全程训练，使其在实际应用中更有效和灵活。此外，我们还提供了一个新的大规模数据集called E-Periodica，包含了复杂的现实世界杂志文档，用于评估。我们的结果表明，我们的DSG比商业OCR工具更高效，而且在此基础之上还实现了状态的最佳性。到目前为止，我们的DSG系统是第一个可全程训练的文档结构分析系统。

Faster 3D cardiac CT segmentation with Vision Transformers

paper_url: http://arxiv.org/abs/2310.09099
repo_url: https://github.com/ljollans/trunet
paper_authors: Lee Jollans, Mariana Bustamante, Lilian Henriksson, Anders Persson, Tino Ebbers
For: The paper is focused on developing a new deep learning architecture for 3D semantic segmentation of cardiac computed tomography (CT) volumes.* Methods: The authors adapted the Vision Transformer (ViT) for three-dimensional volume inputs and incorporated a modified ResNet50 block and a ViT block into their hybrid Transformer-Residual U-Net framework (TRUNet). They also used cascade upsampling with skip connections.* Results: The TRUNet model converged in significantly less time than residual U-Net while providing comparable or superior segmentations of the left ventricle, left atrium, left atrial appendage, ascending aorta, and pulmonary veins. The model also offered more precise vessel boundary segmentation and better captured the heart’s overall anatomical structure.

Abstract
Accurate segmentation of the heart is essential for personalized blood flow simulations and surgical intervention planning. A recent advancement in image recognition is the Vision Transformer (ViT), which expands the field of view to encompass a greater portion of the global image context. We adapted ViT for three-dimensional volume inputs. Cardiac computed tomography (CT) volumes from 39 patients, featuring up to 20 timepoints representing the complete cardiac cycle, were utilized. Our network incorporates a modified ResNet50 block as well as a ViT block and employs cascade upsampling with skip connections. Despite its increased model complexity, our hybrid Transformer-Residual U-Net framework, termed TRUNet, converges in significantly less time than residual U-Net while providing comparable or superior segmentations of the left ventricle, left atrium, left atrial appendage, ascending aorta, and pulmonary veins. TRUNet offers more precise vessel boundary segmentation and better captures the heart's overall anatomical structure compared to residual U-Net, as confirmed by the absence of extraneous clusters of missegmented voxels. In terms of both performance and training speed, TRUNet exceeded U-Net, a commonly used segmentation architecture, making it a promising tool for 3D semantic segmentation tasks in medical imaging. The code for TRUNet is available at github.com/ljollans/TRUNet.

摘要
心脏分割的精确性是 personnalized 血流模拟和心脏手术观察规划的重要因素。 recent advancement in image recognition 中的一个是 Vision Transformer (ViT)，它扩展了 global image context 的 Field of view。我们将 ViT 应用到三维量入力中。使用 39 名病人的心脏 Computed Tomography (CT) 量据，这些量据包含了完整的心脏周期，我们的网络包括 Modified ResNet50 块以及 ViT 块，并使用递增缩减 skip connections。 despite its increased model complexity, our hybrid Transformer-Residual U-Net framework, termed TRUNet, converges in significantly less time than residual U-Net while providing comparable or superior segmentations of the left ventricle, left atrium, left atrial appendage, ascending aorta, and pulmonary veins。 TRUNet 提供了更精确的血管边界分 segmentation 和更好地捕捉心脏的全面 anatomical structure，与 residual U-Net 不同，确认了没有过度分类的错误 voxels。在性能和训练速度方面，TRUNet 超越了 U-Net，一个常用的 segmentation 架构，使其成为适合三维Semantic segmentation 任务的可靠工具。 TRUNet 的代码可以在 github.com/ljollans/TRUNet 中找到。

iPUNet:Iterative Cross Field Guided Point Cloud Upsampling

paper_url: http://arxiv.org/abs/2310.09092
repo_url: None
paper_authors: Guangshun Wei, Hao Pan, Shaojie Zhuang, Yuanfeng Zhou, Changjian Li
for: 提高3D扫描产生的点云的使用可用性，增强点云的精度和完整性。
methods: 提出一种学习基于的点云upsampling方法，iPUNet，该方法可以生成高密度和均匀的点云，并更好地捕捉到锐利特征。通过自我超vision引导点生成，并在每个输入点上学习地方参数化表面，实现arbitrary Ratio upsampling。进一步，通过迭代策略，将不均匀的输入点移动到愿望的连续3D表面上，以提高点云的精度和完整性。
results: 对多种物体和场景的扫描数据进行了广泛的评估，展示了iPUNet可以effectively Handle noisy和非均匀分布的输入点云，并超越当前点云upsampling方法的性能。

Abstract
Point clouds acquired by 3D scanning devices are often sparse, noisy, and non-uniform, causing a loss of geometric features. To facilitate the usability of point clouds in downstream applications, given such input, we present a learning-based point upsampling method, i.e., iPUNet, which generates dense and uniform points at arbitrary ratios and better captures sharp features. To generate feature-aware points, we introduce cross fields that are aligned to sharp geometric features by self-supervision to guide point generation. Given cross field defined frames, we enable arbitrary ratio upsampling by learning at each input point a local parameterized surface. The learned surface consumes the neighboring points and 2D tangent plane coordinates as input, and maps onto a continuous surface in 3D where arbitrary ratios of output points can be sampled. To solve the non-uniformity of input points, on top of the cross field guided upsampling, we further introduce an iterative strategy that refines the point distribution by moving sparse points onto the desired continuous 3D surface in each iteration. Within only a few iterations, the sparse points are evenly distributed and their corresponding dense samples are more uniform and better capture geometric features. Through extensive evaluations on diverse scans of objects and scenes, we demonstrate that iPUNet is robust to handle noisy and non-uniformly distributed inputs, and outperforms state-of-the-art point cloud upsampling methods.

摘要
<>转换给定文本到简化中文。>三Dimensional扫描设备获取的点云经常是稀疏、噪声和不均匀的，这会导致点云的几何特征丢失。为了使点云在下游应用中更加可用，我们提出了一种学习基于的点云填充方法，即iPUNet，该方法可以生成稠密和均匀的点云，并更好地捕捉锐度特征。为了生成具有特征的点云，我们引入了相关的横向场，这些场景被自我超视来引导点生成。给定横向场定义的帧，我们启用了任意比例填充，通过学习每个输入点的本地参数化表面来实现。这个学习的表面 consume了邻近点和2D tangent plane坐标作为输入，并将其映射到3D连续表面上，从而实现任意比例的输出点抽象。为了解决输入点的不均匀性，我们采用了基于横向场的迭代策略，通过在每个迭代中将稀疏点移动到所需的连续3D表面上来缓解输入点的不均匀性。只需几个迭代，稀疏点就可以均匀分布，其对应的稠密样本也更加均匀和更好地捕捉几何特征。通过对各种物体和场景的扫描数据进行广泛的评估，我们证明了iPUNet可以有效地处理噪声和不均匀分布的输入点云，并超过了当前的点云填充方法。

pose-format: Library for Viewing, Augmenting, and Handling .pose Files

paper_url: http://arxiv.org/abs/2310.09066
repo_url: https://github.com/sign-language-processing/pose
paper_authors: Amit Moryossef, Mathias Müller, Rebecka Fahrni
for: 管理和分析姿势数据是一项复杂的任务，面临着多种挑战，如处理多种文件结构和数据类型，以及实现有效的数据操作，如归一化和增强。这篇论文提出了\texttt{pose-format}工具包，用于解决这些挑战。
methods: \texttt{pose-format}工具包包括特有的文件格式，可以包含多个个体和无限多个时间帧，因此适用于图像和视频数据。此外，它支持与流行的数学库，如NumPy、PyTorch和TensorFlow进行紧密集成，从而实现了强大的机器学习应用。
results: 通过 benchmarking，我们表明，我们的\texttt{.pose}文件格式在与常见格式如OpenPose相比，具有明显的性能优势，同时具有自包含的姿势规定的优点。此外，库还包括数据归一化、增强和易用的视觉化功能，可以在Python和浏览器环境中使用。因此，\texttt{pose-format}成为一个一站式解决方案，协助管理和分析姿势数据的复杂性。

Abstract
Managing and analyzing pose data is a complex task, with challenges ranging from handling diverse file structures and data types to facilitating effective data manipulations such as normalization and augmentation. This paper presents \texttt{pose-format}, a comprehensive toolkit designed to address these challenges by providing a unified, flexible, and easy-to-use interface. The library includes a specialized file format that encapsulates various types of pose data, accommodating multiple individuals and an indefinite number of time frames, thus proving its utility for both image and video data. Furthermore, it offers seamless integration with popular numerical libraries such as NumPy, PyTorch, and TensorFlow, thereby enabling robust machine-learning applications. Through benchmarking, we demonstrate that our \texttt{.pose} file format offers vastly superior performance against prevalent formats like OpenPose, with added advantages like self-contained pose specification. Additionally, the library includes features for data normalization, augmentation, and easy-to-use visualization capabilities, both in Python and Browser environments. \texttt{pose-format} emerges as a one-stop solution, streamlining the complexities of pose data management and analysis.

摘要
管理和分析姿态数据是一项复杂的任务，具有从处理多种文件结构和数据类型到实现有效的数据处理操作 such as нормализация和扩展的挑战。这篇论文介绍了 \texttt{pose-format}，一个完整的工具集，用于解决这些挑战。该库包括一种专门的文件格式，可以包含多个个体和无限多个时间帧，因此适用于图像和视频数据。此外，它还提供了与流行的数字库 such as NumPy、PyTorch 和 TensorFlow 的灵活集成，使得可以实现 Robust 的机器学习应用。经 benchmarking，我们示出了我们的 \texttt{.pose} 文件格式在与普遍使用的 OpenPose 格式相比，具有明显的性能优势，同时具有自包含的姿态规范等优点。此外，库还包括数据normalization、扩展和轻松使用的可视化功能，可以在 Python 和浏览器环境中使用。 \texttt{pose-format} 成为一个一站式解决方案，协助管理和分析姿态数据的复杂性。

VCL Challenges 2023 at ICCV 2023 Technical Report: Bi-level Adaptation Method for Test-time Adaptive Object Detection

paper_url: http://arxiv.org/abs/2310.08986
repo_url: None
paper_authors: Chenyu Lin, Yusheng He, Zhengqing Zang, Chenwei Tang, Tao Wang, Jiancheng Lv
for: 本文参与VCL Challenges B Continual Test_time Adaptation，技术细节方面的研究报告。
methods: 使用bi_level适应方法，包括图像级别和检测器级别的适应。图像级别使用可调参数基于图像缓冲区域，检测器级别使用可调参数基于均值教师模块。
results: 在VCL Challenges B的目标频道上实现了38.3%的mAP，相对下降仅4.2%，总性能为32.5%的mAP。

Abstract
This report outlines our team's participation in VCL Challenges B Continual Test_time Adaptation, focusing on the technical details of our approach. Our primary focus is Testtime Adaptation using bi_level adaptations, encompassing image_level and detector_level adaptations. At the image level, we employ adjustable parameterbased image filters, while at the detector level, we leverage adjustable parameterbased mean teacher modules. Ultimately, through the utilization of these bi_level adaptations, we have achieved a remarkable 38.3% mAP on the target domain of the test set within VCL Challenges B. It is worth noting that the minimal drop in mAP, is mearly 4.2%, and the overall performance is 32.5% mAP.

摘要
这份报告描述我们团队在VCL挑战B中实现了时间适应性，强调我们的技术细节。我们的主要重点是在测试时间适应性方面，使用两级适应方法：图像级别和检测器级别。在图像级别，我们使用可调参数基于图像滤波器，而在检测器级别，我们利用可调参数基于均值教师模块。通过这些两级适应方法，我们在VCL挑战B的目标频谱上达到了38.3%的mAP，其中最小下降是4.2%，总性能是32.5%的mAP。

UniParser: Multi-Human Parsing with Unified Correlation Representation Learning

paper_url: http://arxiv.org/abs/2310.08984
repo_url: https://github.com/cjm-sfw/Uniparser
paper_authors: Jiaming Chu, Lei Jin, Junliang Xing, Jian Zhao
for: 这 paper 是关于多个人分割图像的 segmentation 任务，需要 both instance-level 和 fine-grained category-level 信息。
methods: 这 paper 使用了一种 integration 方法，即 UniParser，将 instance-level 和 category-level 表示 integrate 在 three key aspects：1) 我们提出了一种统一 correlation representation learning 方法，让网络学习 instance 和 category 特征在 cosine space 中; 2) we unify the form of outputs of each modules as pixel-level segmentation results, while supervising instance and category features using a homogeneous label accompanied by an auxiliary loss; 3) we design a joint optimization procedure to fuse instance and category representations.
results: 通过 virtue of unifying instance-level 和 category-level output, UniParser 超越了 state-of-the-art 方法， achieving 49.3% AP on MHPv2.0 和 60.4% AP on CIHP。

Abstract
Multi-human parsing is an image segmentation task necessitating both instance-level and fine-grained category-level information. However, prior research has typically processed these two types of information through separate branches and distinct output formats, leading to inefficient and redundant frameworks. This paper introduces UniParser, which integrates instance-level and category-level representations in three key aspects: 1) we propose a unified correlation representation learning approach, allowing our network to learn instance and category features within the cosine space; 2) we unify the form of outputs of each modules as pixel-level segmentation results while supervising instance and category features using a homogeneous label accompanied by an auxiliary loss; and 3) we design a joint optimization procedure to fuse instance and category representations. By virtual of unifying instance-level and category-level output, UniParser circumvents manually designed post-processing techniques and surpasses state-of-the-art methods, achieving 49.3% AP on MHPv2.0 and 60.4% AP on CIHP. We will release our source code, pretrained models, and online demos to facilitate future studies.

摘要
多人识别是一种图像分割任务，需要 both 实例级别和细化类别级别的信息。然而，先前的研究通常会将这两种类型的信息处理为两个不同的分支和不同的输出格式，这会导致不具有效率和重复的框架。这篇论文介绍了 UniParser，它将实例级别和类别级别的表示集成在三个关键方面：1. 我们提出了一种统一相关表示学习方法，让我们的网络在cosine空间内学习实例和类别特征;2. 我们将每个模块的输出形式统一为像素级别分割结果，并使用一个同一个标签和auxiliary loss来监督实例和类别特征;3. 我们设计了一种联合优化程序来融合实例和类别表示。通过对实例级别和类别级别的输出进行统一，UniParser可以避免手动设计后处理技术，并超越状态艺术方法，在MHPv2.0上 achieve 49.3% AP和CIHP上 achieve 60.4% AP。我们将发布我们的源代码、预训练模型和在线演示，以便未来的研究。

LRRU: Long-short Range Recurrent Updating Networks for Depth Completion

paper_url: http://arxiv.org/abs/2310.08956
repo_url: https://github.com/YufeiWang777/LRRU
paper_authors: Yufei Wang, Bo Li, Ge Zhang, Qi Liu, Tao Gao, Yuchao Dai
for: 提高深度完成任务的效率，适用于实际应用。
methods: 提出了一种新的轻量级深度网络框架LRRU，通过不学习复杂特征表示来实现深度完成。LRRU首先粗略填充空 sparse输入数据，然后通过学习空间variant的核函数进行迭代更新。
results: 实验结果表明，我们提出的LRRU变体可以在不同参数情况下达到领先的性能水平，特别是LRRU-Base模型在NYUv2数据集上的表现超过竞争对手，并在提交时间点上在KITTI深度完成评价板块上排名第一。

Abstract
Existing deep learning-based depth completion methods generally employ massive stacked layers to predict the dense depth map from sparse input data. Although such approaches greatly advance this task, their accompanied huge computational complexity hinders their practical applications. To accomplish depth completion more efficiently, we propose a novel lightweight deep network framework, the Long-short Range Recurrent Updating (LRRU) network. Without learning complex feature representations, LRRU first roughly fills the sparse input to obtain an initial dense depth map, and then iteratively updates it through learned spatially-variant kernels. Our iterative update process is content-adaptive and highly flexible, where the kernel weights are learned by jointly considering the guidance RGB images and the depth map to be updated, and large-to-small kernel scopes are dynamically adjusted to capture long-to-short range dependencies. Our initial depth map has coarse but complete scene depth information, which helps relieve the burden of directly regressing the dense depth from sparse ones, while our proposed method can effectively refine it to an accurate depth map with less learnable parameters and inference time. Experimental results demonstrate that our proposed LRRU variants achieve state-of-the-art performance across different parameter regimes. In particular, the LRRU-Base model outperforms competing approaches on the NYUv2 dataset, and ranks 1st on the KITTI depth completion benchmark at the time of submission. Project page: https://npucvr.github.io/LRRU/.

摘要
Traditional deep learning-based depth completion methods usually use a large number of stacked layers to predict the dense depth map from the sparse input data. Although these approaches have made significant progress in this task, their high computational complexity limits their practical applications. To improve the efficiency of depth completion, we propose a lightweight deep network framework called Long-short Range Recurrent Updating (LRRU) network. Instead of learning complex feature representations, LRRU first roughly fills the sparse input to obtain an initial dense depth map, and then iteratively updates it through learned spatially-variant kernels. Our iterative update process is content-adaptive and highly flexible, where the kernel weights are learned by jointly considering the guidance RGB images and the depth map to be updated, and large-to-small kernel scopes are dynamically adjusted to capture long-to-short range dependencies. Our initial depth map has coarse but complete scene depth information, which helps reduce the burden of directly regressing the dense depth from sparse ones, while our proposed method can effectively refine it to an accurate depth map with less learnable parameters and inference time. Experimental results show that our proposed LRRU variants achieve state-of-the-art performance across different parameter regimes. In particular, the LRRU-Base model outperforms competing approaches on the NYUv2 dataset, and ranks 1st on the KITTI depth completion benchmark at the time of submission. More information can be found on the project page: .

Online Adaptive Disparity Estimation for Dynamic Scenes in Structured Light Systems

paper_url: http://arxiv.org/abs/2310.08934
repo_url: None
paper_authors: Rukun Qiao, Hiroshi Kawasaki, Hongbin Zha
for: 这 paper 是为了解决深度神经网络在不同环境中的性能下降问题，提出了自主学习在线适应方法。
methods: 这 paper 使用了一种基于长序输入的无监督损失函数，以便在测试时进行网络适应。
results: 该 paper 的提出方法可以快速地适应新的环境，并且在未看过的数据上达到了更高的准确率。

Abstract
In recent years, deep neural networks have shown remarkable progress in dense disparity estimation from dynamic scenes in monocular structured light systems. However, their performance significantly drops when applied in unseen environments. To address this issue, self-supervised online adaptation has been proposed as a solution to bridge this performance gap. Unlike traditional fine-tuning processes, online adaptation performs test-time optimization to adapt networks to new domains. Therefore, achieving fast convergence during the adaptation process is critical for attaining satisfactory accuracy. In this paper, we propose an unsupervised loss function based on long sequential inputs. It ensures better gradient directions and faster convergence. Our loss function is designed using a multi-frame pattern flow, which comprises a set of sparse trajectories of the projected pattern along the sequence. We estimate the sparse pseudo ground truth with a confidence mask using a filter-based method, which guides the online adaptation process. Our proposed framework significantly improves the online adaptation speed and achieves superior performance on unseen data.

摘要
In this paper, we propose an unsupervised loss function based on long sequential inputs that ensures better gradient directions and faster convergence. Our loss function is designed using a multi-frame pattern flow, which comprises a set of sparse trajectories of the projected pattern along the sequence. We estimate the sparse pseudo ground truth with a confidence mask using a filter-based method, which guides the online adaptation process.Our proposed framework significantly improves the online adaptation speed and achieves superior performance on unseen data.

TIDE: Temporally Incremental Disparity Estimation via Pattern Flow in Structured Light System

paper_url: http://arxiv.org/abs/2310.08932
repo_url: https://github.com/codepointer/tidenet
paper_authors: Rukun Qiao, Hiroshi Kawasaki, Hongbin Zha
for: 这 paper 的目的是提出一种基于学习的缺失计算方法，用于单摄像头排光系统中的Scene reconstruction。
methods: 这 paper 使用了一种名为 TIDE-Net 的循环网络，通过利用投影模式的变换（pattern flow）来模型时间信息，并将其与前一帧的缺失计算结果进行融合。
results: 经过训练使用 synthetic data，这 paper 的模型在使用 real data 时表现出了较高的准确率和效率。

Abstract
We introduced Temporally Incremental Disparity Estimation Network (TIDE-Net), a learning-based technique for disparity computation in mono-camera structured light systems. In our hardware setting, a static pattern is projected onto a dynamic scene and captured by a monocular camera. Different from most former disparity estimation methods that operate in a frame-wise manner, our network acquires disparity maps in a temporally incremental way. Specifically, We exploit the deformation of projected patterns (named pattern flow ) on captured image sequences, to model the temporal information. Notably, this newly proposed pattern flow formulation reflects the disparity changes along the epipolar line, which is a special form of optical flow. Tailored for pattern flow, the TIDE-Net, a recurrent architecture, is proposed and implemented. For each incoming frame, our model fuses correlation volumes (from current frame) and disparity (from former frame) warped by pattern flow. From fused features, the final stage of TIDE-Net estimates the residual disparity rather than the full disparity, as conducted by many previous methods. Interestingly, this design brings clear empirical advantages in terms of efficiency and generalization ability. Using only synthetic data for training, our extensitve evaluation results (w.r.t. both accuracy and efficienty metrics) show superior performance than several SOTA models on unseen real data. The code is available on https://github.com/CodePointer/TIDENet.

摘要
我们介绍了 Temporally Incremental Disparity Estimation Network（TIDE-Net），一种基于学习的离散光系统中的 disparity 计算技术。在我们的硬件设置中，一个静止的模式被投射到了动态场景中，并被一个单目标 Camera 捕获。与大多数前一代 disparity 估计方法不同，我们的网络在帧率上进行了 temporally 增量的 disparity 计算。具体来说，我们利用投射模式（名为 pattern flow）在捕获到的图像序列中的变形，来模型时间信息。值得注意的是，这种 newly proposed pattern flow 表示在epipolar line上的 disparity 变化，这是一种特殊的 optic flow。针对 pattern flow，我们提出了 TIDE-Net，一种循环架构。每个进来的帧，我们的模型将 correlation volumes（从当前帧）和 disparity（从前一帧）折叠为 pattern flow 后的残差 disparity，而不是完整的 disparity。这种设计带来了明显的 empirical 优势，包括效率和通用能力。使用仅synthetic data для训练，我们的广泛的评估结果（相对准确率和效率 metric）表明 TIDE-Net 在未看到的实际数据上表现出色，superior 于多个state-of-the-art 模型。代码可以在 https://github.com/CodePointer/TIDENet 上获取。

Towards Interpretable Controllability in Object-Centric Learning

paper_url: http://arxiv.org/abs/2310.08929
repo_url: None
paper_authors: Jinwoo Kim, Janghyuk Choi, Jaehyun Kang, Changyeon Lee, Ho-Jin Choi, Seon Joo Kim
for: 该论文旨在探讨人工神经网络中的绑定问题，寻找可以达到人类认知水平的方法，通过符号化的Entities来理解世界。
methods: 该论文提出了一种新的方法，即插图增强（SlotAug），通过自然的图像增强策略来学习可控性。同时，我们还提出了两种子方法：卷积扩展和插槽一致损失。
results: 我们的实验和理论验证表明，我们的方法可以有效地实现可读性和可控性，提供了一种新的可控性控制对象表示的能力。

Abstract
The binding problem in artificial neural networks is actively explored with the goal of achieving human-level recognition skills through the comprehension of the world in terms of symbol-like entities. Especially in the field of computer vision, object-centric learning (OCL) is extensively researched to better understand complex scenes by acquiring object representations or slots. While recent studies in OCL have made strides with complex images or videos, the interpretability and interactivity over object representation remain largely uncharted, still holding promise in the field of OCL. In this paper, we introduce a novel method, Slot Attention with Image Augmentation (SlotAug), to explore the possibility of learning interpretable controllability over slots in a self-supervised manner by utilizing an image augmentation strategy. We also devise the concept of sustainability in controllable slots by introducing iterative and reversible controls over slots with two proposed submethods: Auxiliary Identity Manipulation and Slot Consistency Loss. Extensive empirical studies and theoretical validation confirm the effectiveness of our approach, offering a novel capability for interpretable and sustainable control of object representations. Code will be available soon.

摘要
artifical neural networks 的绑定问题在激发学习中得到了广泛的研究，以实现人类水平的识别能力，通过对世界的符号化表示来解释复杂的场景。特别在计算机视觉领域，对象中心学习（OCL）被广泛研究，以更好地理解复杂的场景，获得对象表示或槽的获得。虽然最近的OCL研究在复杂图像或视频上已经做出了很大的进展，但是解释性和交互性在对象表示上仍然未得到了充分的探索，这些领域仍然具有潜在的探索空间。在这篇论文中，我们提出了一种新的方法，即插入图像增强（Slot Augmentation，SlotAug），以探索在自然的方式下可以学习可控的槽表示。我们还提出了持续可控的槽的概念，并通过两种提议的子方法：协助标识修饰和槽一致损失来实现可控的槽。我们的方法得到了广泛的实验和理论验证，可以提供一种新的可解释的可控的对象表示能力。代码即将上传。

SIDE: Self-supervised Intermediate Domain Exploration for Source-free Domain Adaptation

paper_url: http://arxiv.org/abs/2310.08928
repo_url: https://github.com/se111/side
paper_authors: Jiamei Liu, Han Sun, Yizhen Jia, Jie Qin, Huiyu Zhou, Ningzhong Liu
for: 这篇论文的目的是解决域别迁移问题，并在没有源数据的情况下进行域别迁移。
methods: 这篇论文提出了自我指导中途域探索（SIDE）方法，它通过在中途域探索过程中选择类似源和目标域的样本，以及对这些中途域样本进行过渡域阶段调整，以bridge域间差异。
results: 根据三个popular benchмарck（Office-31、Office-Home和VisDA-C）的实验结果显示，这篇论文提出的SIDE方法能够与现有的方法竞争。

Abstract
Domain adaptation aims to alleviate the domain shift when transferring the knowledge learned from the source domain to the target domain. Due to privacy issues, source-free domain adaptation (SFDA), where source data is unavailable during adaptation, has recently become very demanding yet challenging. Existing SFDA methods focus on either self-supervised learning of target samples or reconstruction of virtual source data. The former overlooks the transferable knowledge in the source model, whilst the latter introduces even more uncertainty. To address the above issues, this paper proposes self-supervised intermediate domain exploration (SIDE) that effectively bridges the domain gap with an intermediate domain, where samples are cyclically filtered out in a self-supervised fashion. First, we propose cycle intermediate domain filtering (CIDF) to cyclically select intermediate samples with similar distributions over source and target domains. Second, with the aid of those intermediate samples, an inter-domain gap transition (IDGT) module is developed to mitigate possible distribution mismatches between the source and target data. Finally, we introduce cross-view consistency learning (CVCL) to maintain the intrinsic class discriminability whilst adapting the model to the target domain. Extensive experiments on three popular benchmarks, i.e. Office-31, Office-Home and VisDA-C, show that our proposed SIDE achieves competitive performance against state-of-the-art methods.

摘要
域 adaptation 的目标是减少域 shift когда传递源域中学习的知识到目标域中。由于隐私问题，源数据不可用的域 adaptation（SFDA）在最近变得非常具有挑战性和需求。现有的 SFDA 方法集中在自我超级学习目标样本上或重构虚拟源数据。前者忽略了源模型中可以传递的知识，而后者更加增加了不确定性。为了解决这些问题，本文提出了自我超级域探索（SIDE），它可以有效地跨越域隔。首先，我们提出了循环中间域滤波（CIDF）来循环选择具有源和目标域 Distribution 相似的中间样本。其次，通过这些中间样本，我们开发了过渡域隔转移（IDGT）模块，以避免可能存在的域隔 Distribution 差异。最后，我们引入了交叉视角一致学习（CVCL），以保持内在类分隔性 whilst 适应目标域。我们在 Office-31、Office-Home 和 VisDA-C 三个流行的 benchmark 上进行了广泛的实验，并证明了我们的提出的 SIDE 可以与当前的方法竞争。

Feature Proliferation – the “Cancer” in StyleGAN and its Treatments

paper_url: http://arxiv.org/abs/2310.08921
repo_url: https://github.com/songc42/feature-proliferation
paper_authors: Shuang Song, Yuanbang Liang, Jing Wu, Yu-Kun Lai, Yipeng Qin
for: 这个论文的目的是解决StyleGAN图像生成器中的特征增殖问题，以提高图像质量和多样性。
methods: 这篇论文首先探讨了StyleGAN图像生成器的具体机制，发现了特征增殖现象，并证明了这种现象导致StyleGAN图像生成器的缺陷。然后，提出了一种新的特征调整方法，通过调整危险特征来mitigate特征增殖问题。
results: 实验结果证明了我们的假设和提议的有效性，并证明了提出的特征调整方法的有效性。

Abstract
Despite the success of StyleGAN in image synthesis, the images it synthesizes are not always perfect and the well-known truncation trick has become a standard post-processing technique for StyleGAN to synthesize high-quality images. Although effective, it has long been noted that the truncation trick tends to reduce the diversity of synthesized images and unnecessarily sacrifices many distinct image features. To address this issue, in this paper, we first delve into the StyleGAN image synthesis mechanism and discover an important phenomenon, namely Feature Proliferation, which demonstrates how specific features reproduce with forward propagation. Then, we show how the occurrence of Feature Proliferation results in StyleGAN image artifacts. As an analogy, we refer to it as the" cancer" in StyleGAN from its proliferating and malignant nature. Finally, we propose a novel feature rescaling method that identifies and modulates risky features to mitigate feature proliferation. Thanks to our discovery of Feature Proliferation, the proposed feature rescaling method is less destructive and retains more useful image features than the truncation trick, as it is more fine-grained and works in a lower-level feature space rather than a high-level latent space. Experimental results justify the validity of our claims and the effectiveness of the proposed feature rescaling method. Our code is available at https://github. com/songc42/Feature-proliferation.

摘要
尽管 StyleGAN 在图像生成方面取得了成功，但生成的图像不一定是完美的，常用的截断技巧已成为 StyleGAN 图像生成的标准后处理技术。虽然有效，但这种技巧可能会减少生成的图像多样性，并且意外地抛弃许多图像特征。为解决这个问题，在这篇论文中，我们首先探究 StyleGAN 图像生成机制，并发现一个重要现象：特征增殖（Feature Proliferation）。我们发现，在 StyleGAN 图像生成过程中，特定的特征会在前向传播中重新生成，从而导致 StyleGAN 图像的artifacts。我们将这种现象称为 StyleGAN 中的"癌症"，因为它会在图像生成过程中不断增殖和恶化。最后，我们提出了一种新的特征重新Scaling方法，该方法可以识别和调节危险的特征，以 Mitigate 特征增殖。由于我们的发现特征增殖，该方法比 truncation 技巧更不 destrucтив，因为它在较低的特征空间进行调节，而不是在高级潜在空间。实验结果证明了我们的说法和提出的特征重新Scaling 方法的有效性。我们的代码可以在上下载。

Scalarization for Multi-Task and Multi-Domain Learning at Scale

paper_url: http://arxiv.org/abs/2310.08910
repo_url: None
paper_authors: Amelie Royer, Tijmen Blankevoort, Babak Ehteshami Bejnordi
for: This paper focuses on improving the efficiency of training multi-task and multi-domain neural networks.
methods: The authors use a combination of theoretical analysis and experimental methods to understand the training dynamics of these networks, and propose a new population-based training method to optimize the scalarization weights.
results: The authors show that their proposed method achieves on-par performance with more costly state-of-the-art optimization methods, and provides a more efficient way to train multi-task and multi-domain networks.

Abstract
Training a single model on multiple input domains and/or output tasks allows for compressing information from multiple sources into a unified backbone hence improves model efficiency. It also enables potential positive knowledge transfer across tasks/domains, leading to improved accuracy and data-efficient training. However, optimizing such networks is a challenge, in particular due to discrepancies between the different tasks or domains: Despite several hypotheses and solutions proposed over the years, recent work has shown that uniform scalarization training, i.e., simply minimizing the average of the task losses, yields on-par performance with more costly SotA optimization methods. This raises the issue of how well we understand the training dynamics of multi-task and multi-domain networks. In this work, we first devise a large-scale unified analysis of multi-domain and multi-task learning to better understand the dynamics of scalarization across varied task/domain combinations and model sizes. Following these insights, we then propose to leverage population-based training to efficiently search for the optimal scalarization weights when dealing with a large number of tasks or domains.

摘要
训练单个模型在多个输入领域和/或输出任务上可以压缩多个源的信息到一个统一的核心，从而提高模型的效率。这也可能导致任务/领域之间的正面知识传递，从而提高准确性和数据训练效率。然而，优化这些网络是一项挑战，特别是因为不同任务或领域之间存在差异：尽管多年来有许多假设和解决方案，但最近的研究表明， uniform scalarization 训练，即只是将所有任务损失平均化为最小值，可以与更昂贵的 State-of-the-art 优化方法相比肩。这引发了我们理解多任务和多领域学习的训练动力学的问题。在这项工作中，我们首先设计了一项大规模的多领域多任务学习统一分析，以更好地理解 scalarization 的动力学在不同任务/领域组合和模型大小上。然后，我们提议使用人口训练来高效地搜索最佳 scalarization 权重，当面临大量任务或领域时。

3D Understanding of Deformable Linear Objects: Datasets and Transferability Benchmark

paper_url: http://arxiv.org/abs/2310.08904
repo_url: None
paper_authors: Bare Luka Žagar, Tim Hertel, Mingyu Liu, Ekim Yurtsever, ALois C. Knoll
for: 研究3D弹性线性物体，如血管和电缆，以提高对这些系统的理解和设计。
methods: 使用PointWire和PointVessel数据集，对现状的3D弹性线性物体进行了大规模的测试和评估。
results: 通过对PointWire和PointVessel数据集进行了转移性测试，发现现有方法的泛化能力不强，需要进一步改进。

Abstract
Deformable linear objects are vastly represented in our everyday lives. It is often challenging even for humans to visually understand them, as the same object can be entangled so that it appears completely different. Examples of deformable linear objects include blood vessels and wiring harnesses, vital to the functioning of their corresponding systems, such as the human body and a vehicle. However, no point cloud datasets exist for studying 3D deformable linear objects. Therefore, we are introducing two point cloud datasets, PointWire and PointVessel. We evaluated state-of-the-art methods on the proposed large-scale 3D deformable linear object benchmarks. Finally, we analyzed the generalization capabilities of these methods by conducting transferability experiments on the PointWire and PointVessel datasets.

摘要
弹性线性物体在我们日常生活中非常普遍。它们的外观可能会很不同，因为它们可能会络绎在一起，让它们看起来完全不同。例如，血管和车辆电线很重要，它们是人体和车辆系统的重要组成部分。但是，目前没有任何点云数据集用于研究3D弹性线性物体。因此，我们正在引入两个点云数据集：PointWire和PointVessel。我们评估了现有的方法在我们提出的大规模3D弹性线性物体benchmark上的性能。最后，我们进行了对PointWire和PointVessel数据集的转移性实验，以评估这些方法的通用能力。

Self supervised convolutional kernel based handcrafted feature harmonization: Enhanced left ventricle hypertension disease phenotyping on echocardiography

paper_url: http://arxiv.org/abs/2310.08897
repo_url: None
paper_authors: Jina Lee, Youngtaek Hong, Dawun Jeong, Yeonggul Jang, Sihyeon Jeong, Taekgeun Jung, Yeonyee E. Yoon, Inki Moon, Seung-Ah Lee, Hyuk-Jae Chang
for: 预测疾病（如Left Ventricular Hypertrophy和Hypertensive Heart Disease）的医学成像技术，使用手工设计的特征来预测疾病。
methods: 使用标准化成像协议、统计调整和评估特征稳定性来协调特征提取。
results: 提出了一种使用自主学习（SSL）和卷积层整合的方法，可以在有限的数据集中提高数据理解并适应多种数据设置。该方法在各种任务中显示出优秀表现，特别是在 Left Ventricular Hypertrophy 分类任务中表现出色。

Abstract
Radiomics, a medical imaging technique, extracts quantitative handcrafted features from images to predict diseases. Harmonization in those features ensures consistent feature extraction across various imaging devices and protocols. Methods for harmonization include standardized imaging protocols, statistical adjustments, and evaluating feature robustness. Myocardial diseases such as Left Ventricular Hypertrophy (LVH) and Hypertensive Heart Disease (HHD) are diagnosed via echocardiography, but variable imaging settings pose challenges. Harmonization techniques are crucial for applying handcrafted features in disease diagnosis in such scenario. Self-supervised learning (SSL) enhances data understanding within limited datasets and adapts to diverse data settings. ConvNeXt-V2 integrates convolutional layers into SSL, displaying superior performance in various tasks. This study focuses on convolutional filters within SSL, using them as preprocessing to convert images into feature maps for handcrafted feature harmonization. Our proposed method excelled in harmonization evaluation and exhibited superior LVH classification performance compared to existing methods.

摘要
医学成像技术Radiomics提取了生物marker的量化特征，以预测疾病。在这些特征中，谱harmonization是确保特征EXTRACTINGCONSISTENTLY across various imaging devices and protocols。这些方法包括标准化成像协议，统计调整和评估特征稳定性。我们通过echo医学诊断Left Ventricular Hypertrophy (LVH)和Hypertensive Heart Disease (HHD)，但不同的成像设置 pose challenges。在这种情况下，谱harmonization技术是至关重要的。自主学习（SSL）可以帮助我们更好地理解有限的数据集和适应多种数据设置。ConvNeXt-V2 integrates convolutional layers into SSL，在多种任务中展现出了出色的表现。本研究关注了在SSL中的卷积层，将它们用作预处理，将图像转换成特征地图以便手工特征谱harmonization。我们的提议方法在谱harmonization评估中表现出色，并在现有方法中展现出了更高的LVH分类性能。

Image Cropping under Design Constraints

paper_url: http://arxiv.org/abs/2310.08892
repo_url: None
paper_authors: Takumi Nishiyasu, Wataru Shimoda, Yoichi Sato
for: 本研究旨在提供一种基于分数函数的图像剪辑方法，以满足各种设计约束。
methods: 本研究使用分数函数来评估剪辑结果的美观可能性和设计约束满足度。我们还提出了两种变体方法：提案基本方法和热图基本方法。
results: 实验结果显示，提案基本方法在同等计算成本下表现较好，而热图基本方法可以通过增加计算成本来获得更高的分数。我们还发现，在满足设计约束的同时保持美观可能性是一项不容易解决的问题。

Abstract
Image cropping is essential in image editing for obtaining a compositionally enhanced image. In display media, image cropping is a prospective technique for automatically creating media content. However, image cropping for media contents is often required to satisfy various constraints, such as an aspect ratio and blank regions for placing texts or objects. We call this problem image cropping under design constraints. To achieve image cropping under design constraints, we propose a score function-based approach, which computes scores for cropped results whether aesthetically plausible and satisfies design constraints. We explore two derived approaches, a proposal-based approach, and a heatmap-based approach, and we construct a dataset for evaluating the performance of the proposed approaches on image cropping under design constraints. In experiments, we demonstrate that the proposed approaches outperform a baseline, and we observe that the proposal-based approach is better than the heatmap-based approach under the same computation cost, but the heatmap-based approach leads to better scores by increasing computation cost. The experimental results indicate that balancing aesthetically plausible regions and satisfying design constraints is not a trivial problem and requires sensitive balance, and both proposed approaches are reasonable alternatives.

摘要
Image cropping是图像修改中的一种基本技巧，可以提高图像的 компози图感。在显示媒体中，图像cropping是一种可能的技术，可以自动生成媒体内容。然而，对媒体内容的图像cropping often需要满足多种约束，例如比例和蓝色区域用于文本或对象的放置。我们称这个问题为图像cropping under design constraints。为解决图像cropping under design constraints问题，我们提出了一种分数函数基本方法，计算cropped结果是否美观可能和满足设计约束。我们还探索了两种 derivated Approaches，提案基本方法和热图基本方法，并构建了用于评估提案的数据集。在实验中，我们示出了提案方法比基eline的表现，并发现提案基本方法在同一个计算成本下比热图基本方法更好，但热图基本方法通过增加计算成本可以提高分数。实验结果表明，平衡美观可能区域和满足设计约束并不是一个轻松的问题，需要灵活的平衡。两种提案方法都是合理的选择。

paper_url: http://arxiv.org/abs/2310.08884
repo_url: https://github.com/mcr-peft/ex-mcr
paper_authors: Zehan Wang, Ziang Zhang, Luping Liu, Yang Zhao, Haifeng Huang, Tao Jin, Zhou Zhao
for: 这篇研究旨在提出一种训练效率高且没有对照数据的多modal contrastive representation（MCR）方法，以扩展多modal learning的可能性。
methods: 这篇研究使用了C-MCR的想法，并将多个现有的MCR空间融合到同一个基本MCR空间中，以获得一个共同的对照表现空间。此外，研究对MCR空间的整个学习管线进行了优化，包括训练数据、架构和学习目标。
results: 研究发现，这篇方法可以实现无需对照数据的MCR学习，并且可以保留原始模式的semantic alignement。此外，这篇方法在多modal Retrieval和3D物体分类任务中获得了state-of-the-art表现。进一步的质变结果显示了模式之间的弹性联系，这显示了多modal learning的可能性。

Abstract
Multi-modal contrastive representation (MCR) of more than three modalities is critical in multi-modal learning. Although recent methods showcase impressive achievements, the high dependence on large-scale, high-quality paired data and the expensive training costs limit their further development. Inspired by recent C-MCR, this paper proposes Extending Multimodal Contrastive Representation (Ex-MCR), a training-efficient and paired-data-free method to flexibly learn unified contrastive representation space for more than three modalities by integrating the knowledge of existing MCR spaces. Specifically, Ex-MCR aligns multiple existing MCRs into the same based MCR, which can effectively preserve the original semantic alignment of the based MCR. Besides, we comprehensively enhance the entire learning pipeline for aligning MCR spaces from the perspectives of training data, architecture, and learning objectives. With the preserved original modality alignment and the enhanced space alignment, Ex-MCR shows superior representation learning performance and excellent modality extensibility. To demonstrate the effectiveness of Ex-MCR, we align the MCR spaces of CLAP (audio-text) and ULIP (3D-vision) into the CLIP (vision-text), leveraging the overlapping text and image modality, respectively. Remarkably, without using any paired data, Ex-MCR learns a 3D-image-text-audio unified contrastive representation, and it achieves state-of-the-art performance on audio-visual, 3D-image, audio-text, visual-text retrieval, and 3D object classification tasks. More importantly, extensive qualitative results further demonstrate the emergent semantic alignment between the extended modalities (e.g., audio and 3D), which highlights the great potential of modality extensibility.

摘要
多modalcontrastiverepresentation(MCR)是多modal学习中的关键。尽管 current methods 显示出色的成绩，但它们受到大规模、高质量的对应数据和训练成本的限制。 Drawing inspiration from recent C-MCR, this paper proposes Extending Multimodal Contrastive Representation (Ex-MCR), a training-efficient and paired-data-free method to flexibly learn unified contrastive representation space for more than three modalities by integrating the knowledge of existing MCR spaces. Specifically, Ex-MCR aligns multiple existing MCRs into the same based MCR, which can effectively preserve the original semantic alignment of the based MCR. Besides, we comprehensively enhance the entire learning pipeline for aligning MCR spaces from the perspectives of training data, architecture, and learning objectives. With the preserved original modality alignment and the enhanced space alignment, Ex-MCR shows superior representation learning performance and excellent modality extensibility. To demonstrate the effectiveness of Ex-MCR, we align the MCR spaces of CLAP (audio-text) and ULIP (3D-vision) into the CLIP (vision-text), leveraging the overlapping text and image modality, respectively. Remarkably, without using any paired data, Ex-MCR learns a 3D-image-text-audio unified contrastive representation, and it achieves state-of-the-art performance on audio-visual, 3D-image, audio-text, visual-text retrieval, and 3D object classification tasks. More importantly, extensive qualitative results further demonstrate the emergent semantic alignment between the extended modalities (e.g., audio and 3D), which highlights the great potential of modality extensibility.

R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image Generation

paper_url: http://arxiv.org/abs/2310.08872
repo_url: None
paper_authors: Jiayu Xiao, Liang Li, Henglei Lv, Shuhui Wang, Qingming Huang
for: 这项研究的目的是提出一种适用于文本描述的隐式图像生成模型，可以在不需要训练辅助模块或重新调整扩散模型的情况下，生成符合文本输入的图像。
methods: 我们提出了一种 Region and Boundary (R&B) 抽象注意力指导方法，通过在生成过程中逐渐修改扩散模型的注意力地图，使模型能够生成符合文本输入的图像，同时还能够准确地表达文本中的布局指令。
results: 我们的方法在多个 benchmark 上表现出色，远远超过了现有的零shot隐式图像生成方法，both qualitatively和quantitatively。

Abstract
Recent text-to-image (T2I) diffusion models have achieved remarkable progress in generating high-quality images given text-prompts as input. However, these models fail to convey appropriate spatial composition specified by a layout instruction. In this work, we probe into zero-shot grounded T2I generation with diffusion models, that is, generating images corresponding to the input layout information without training auxiliary modules or finetuning diffusion models. We propose a Region and Boundary (R&B) aware cross-attention guidance approach that gradually modulates the attention maps of diffusion model during generative process, and assists the model to synthesize images (1) with high fidelity, (2) highly compatible with textual input, and (3) interpreting layout instructions accurately. Specifically, we leverage the discrete sampling to bridge the gap between consecutive attention maps and discrete layout constraints, and design a region-aware loss to refine the generative layout during diffusion process. We further propose a boundary-aware loss to strengthen object discriminability within the corresponding regions. Experimental results show that our method outperforms existing state-of-the-art zero-shot grounded T2I generation methods by a large margin both qualitatively and quantitatively on several benchmarks.

摘要
(Simplified Chinese translation)latest text-to-image (T2I) diffusion models have achieved remarkable progress in generating high-quality images given text-prompts as input. However, these models fail to convey appropriate spatial composition specified by a layout instruction. In this work, we explore zero-shot grounded T2I generation with diffusion models, that is, generating images corresponding to the input layout information without training auxiliary modules or finetuning diffusion models. We propose a Region and Boundary (R&B) aware cross-attention guidance approach that gradually modulates the attention maps of diffusion model during generative process, and assists the model to synthesize images (1) with high fidelity, (2) highly compatible with textual input, and (3) interpreting layout instructions accurately. Specifically, we leverage the discrete sampling to bridge the gap between consecutive attention maps and discrete layout constraints, and design a region-aware loss to refine the generative layout during diffusion process. We further propose a boundary-aware loss to strengthen object discriminability within the corresponding regions. Experimental results show that our method outperforms existing state-of-the-art zero-shot grounded T2I generation methods by a large margin both qualitatively and quantitatively on several benchmarks.

Re-initialization-free Level Set Method via Molecular Beam Epitaxy Equation Regularization for Image Segmentation

paper_url: http://arxiv.org/abs/2310.08861
repo_url: None
paper_authors: Fanghui Song, Jiebao Sun, Shengzhu Shi, Zhichang Guo, Dazhi Zhang
for: 该文章的目的是提出一种高阶级 level set 变分法，以提高图像分割的精度和稳定性。
methods: 该方法使用分子束辐射（MBE）方程regularization，使得级别集函数的演化过程受到晶体增长的限制，以避免重初化和级别集函数的不稳定性。该方法还可以处理噪声图像中的粗糙性问题。
results: 数值实验表明，该方法可以生成平滑的分割曲线，保留细腻的分割目标，并实现小对象的稳定分割。与现有的级别集方法相比，该模型在精度和效率两个方面具有当今最佳的状态。

Abstract
Variational level set method has become a powerful tool in image segmentation due to its ability to handle complex topological changes and maintain continuity and smoothness in the process of evolution. However its evolution process can be unstable, which results in over flatted or over sharpened contours and segmentation failure. To improve the accuracy and stability of evolution, we propose a high-order level set variational segmentation method integrated with molecular beam epitaxy (MBE) equation regularization. This method uses the crystal growth in the MBE process to limit the evolution of the level set function, and thus can avoid the re-initialization in the evolution process and regulate the smoothness of the segmented curve. It also works for noisy images with intensity inhomogeneity, which is a challenge in image segmentation. To solve the variational model, we derive the gradient flow and design scalar auxiliary variable (SAV) scheme coupled with fast Fourier transform (FFT), which can significantly improve the computational efficiency compared with the traditional semi-implicit and semi-explicit scheme. Numerical experiments show that the proposed method can generate smooth segmentation curves, retain fine segmentation targets and obtain robust segmentation results of small objects. Compared to existing level set methods, this model is state-of-the-art in both accuracy and efficiency.

摘要
<>变分级别设定方法在图像分割中已成为一种强大工具，因为它可以处理复杂的多尺度变化和保持连续性和平滑性在进化过程中。然而，其进化过程可能会不稳定，导致过扁或过锋化的边界和分割失败。为了提高精度和稳定性的进化，我们提出了高阶级别设定变形方法，并与分子束激光增减 Equation（MBE）定则相结合。这种方法使用MBE过程中的晶体生长来限制级别设定函数的进化，从而可以避免重初化进程和规范级别设定函数的平滑性。它还适用于具有强度不均的图像，这是图像分割中的挑战。为解variational模型，我们 derive了流体场和scalar auxiliary variable（SAV） schemes，并结合快速傅立叹Transform（FFT），可以significantly improve计算效率相比传统的半显式和半隐式方案。numerical experiments show that the proposed method can generate smooth segmentation curves, retain fine segmentation targets and obtain robust segmentation results of small objects. Compared to existing level set methods, this model is state-of-the-art in both accuracy and efficiency.

Rank-DETR for High Quality Object Detection

paper_url: http://arxiv.org/abs/2310.08854
repo_url: https://github.com/leaplabthu/rank-detr
paper_authors: Yifan Pu, Weicong Liang, Yiduo Hao, Yuhui Yuan, Yukang Yang, Chao Zhang, Han Hu, Gao Huang
for: 提高 DE TR 类型 объек检测器的精度和性能
methods: 提出了一种简单高效的rank-oriented设计，包括rank-oriented架构设计和rank-oriented损失函数设计，以降低假阳性率并提高 AP 值
results: 应用方法到现有 SOTA 方法（如 H-DETR 和 DINO-DETR）上，在不同的backbone上（如 ResNet-$50$、Swin-T 和 Swin-L） obtainted strong COCO object detection results，证明了方法的效果

Abstract
Modern detection transformers (DETRs) use a set of object queries to predict a list of bounding boxes, sort them by their classification confidence scores, and select the top-ranked predictions as the final detection results for the given input image. A highly performant object detector requires accurate ranking for the bounding box predictions. For DETR-based detectors, the top-ranked bounding boxes suffer from less accurate localization quality due to the misalignment between classification scores and localization accuracy, thus impeding the construction of high-quality detectors. In this work, we introduce a simple and highly performant DETR-based object detector by proposing a series of rank-oriented designs, combinedly called Rank-DETR. Our key contributions include: (i) a rank-oriented architecture design that can prompt positive predictions and suppress the negative ones to ensure lower false positive rates, as well as (ii) a rank-oriented loss function and matching cost design that prioritizes predictions of more accurate localization accuracy during ranking to boost the AP under high IoU thresholds. We apply our method to improve the recent SOTA methods (e.g., H-DETR and DINO-DETR) and report strong COCO object detection results when using different backbones such as ResNet-$50$, Swin-T, and Swin-L, demonstrating the effectiveness of our approach. Code is available at \url{https://github.com/LeapLabTHU/Rank-DETR}.

摘要
现代检测转换器（DETR）使用一组对象查询来预测一个列表的 bounding box，并将其排序于其分类信任度分数上，并选择输入图像的最高排名的预测结果作为最终检测结果。一个高效的对象检测器需要准确的排序，以确保 bounding box 预测的准确性。为 DETR 基于的检测器，最顶层的 bounding box 受到误差的分布率的影响，从而降低了高质量检测器的建立。在这项工作中，我们提出了一种简单高效的 DETR 基于的对象检测器，并将其命名为 Rank-DETR。我们的关键贡献包括：1. 排序 oriented 建筑设计，可以让正面预测提高，并且对负面预测进行抑制，以降低假阳性率。2. 排序 oriented 损失函数和匹配成本设计，在排序时优先考虑更高的本地化准确性，以提高 AP 下高 IoU 阈值下的性能。我们应用我们的方法于当前 SOTA 方法（例如 H-DETR 和 DINO-DETR），并在不同的背景（例如 ResNet-$50$、Swin-T 和 Swin-L）上测试，得到了强的 COCO 对象检测结果，证明了我们的方法的有效性。代码可以在 \url{https://github.com/LeapLabTHU/Rank-DETR} 上获取。

paper_url: http://arxiv.org/abs/2310.08826
repo_url: None
paper_authors: Feng Jiang, Chaoping Tu, Gang Zhang, Jun Li, Hanqing Huang, Junyu Lin, Di Feng, Jian Pu
for: 提高多模态3D semantic segmentation的安全性，即使在弱协同下运行
methods: 提出CPGNet-LCF多模态融合框架，继承CPGNet的易于部署和实时执行能力，并引入弱协同知识塑造策略以提高对弱协同的Robustness
results: 在nuScenes和SemanticKITTIbenchmark上实现了state-of-the-art表现，并可以在20ms/帧的Tesla V100 GPU上使用TensorRT TF16模式实时执行，并对四级弱协同水平进行了性能 benchmarHere’s the simplified Chinese text:
for: 提高多模态3D semantic segmentation的安全性，即使在弱协同下运行
methods: 提出CPGNet-LCF多模态融合框架，继承CPGNet的易于部署和实时执行能力，并引入弱协同知识塑造策略以提高对弱协同的Robustness
results: 在nuScenes和SemanticKITTIbenchmark上实现了state-of-the-art表现，并可以在20ms/帧的Tesla V100 GPU上使用TensorRT TF16模式实时执行，并对四级弱协同水平进行了性能 benchmar

Abstract
LiDAR and camera are two critical sensors for multi-modal 3D semantic segmentation and are supposed to be fused efficiently and robustly to promise safety in various real-world scenarios. However, existing multi-modal methods face two key challenges: 1) difficulty with efficient deployment and real-time execution; and 2) drastic performance degradation under weak calibration between LiDAR and cameras. To address these challenges, we propose CPGNet-LCF, a new multi-modal fusion framework extending the LiDAR-only CPGNet. CPGNet-LCF solves the first challenge by inheriting the easy deployment and real-time capabilities of CPGNet. For the second challenge, we introduce a novel weak calibration knowledge distillation strategy during training to improve the robustness against the weak calibration. CPGNet-LCF achieves state-of-the-art performance on the nuScenes and SemanticKITTI benchmarks. Remarkably, it can be easily deployed to run in 20ms per frame on a single Tesla V100 GPU using TensorRT TF16 mode. Furthermore, we benchmark performance over four weak calibration levels, demonstrating the robustness of our proposed approach.

摘要
利用LiDAR和摄像头两种关键感知器，多模态3D semantic segmentation可以得到高效和稳定的混合。然而，现有的多模态方法面临两个主要挑战：1）efficient deployment和实时执行困难; 2）在软配置下导致性能下降。为解决这些挑战，我们提出了CPGNet-LCF，一种新的多模态混合框架，extend LiDAR只的CPGNet。CPGNet-LCF解决了第一个挑战by继承CPGNet的易部署和实时能力。为第二个挑战，我们引入了一种新的弱配置知识继承策略在训练中，以提高对弱配置的Robustness。CPGNet-LCF在nuScenes和SemanticKITTIbenchmark上达到了状态的最佳性能。另外，它可以轻松地在单个Tesla V100 GPU上使用TensorRT TF16模式运行，并且在四个弱配置水平上测试性能，表明我们提出的方法具有可靠性。

paper_url: http://arxiv.org/abs/2310.08825
repo_url: https://github.com/yuchenliu98/comm
paper_authors: Dongsheng Jiang, Yuchen Liu, Songlin Liu, Xiaopeng Zhang, Jin Li, Hongkai Xiong, Qi Tian
for:This paper aims to investigate the effectiveness of different vision encoders within Multi-modal Large Language Models (MLLMs) and to propose a simple yet effective feature merging strategy to enhance the visual capabilities of MLLMs.methods:The authors conduct an extensive investigation into the effectiveness of different vision encoders within MLLMs, including CLIP and DINO, and propose a feature merging strategy called COMM that integrates CLIP and DINO with Multi-level features Merging to enhance the visual capabilities of MLLMs.results:The authors evaluate COMM through comprehensive experiments on a wide range of benchmarks, including image captioning, visual question answering, visual grounding, and object hallucination, and show that COMM outperforms existing methods, demonstrating its enhanced visual capabilities within MLLMs.

Abstract
Multi-modal Large Language Models (MLLMs) have made significant strides in expanding the capabilities of Large Language Models (LLMs) through the incorporation of visual perception interfaces. Despite the emergence of exciting applications and the availability of diverse instruction tuning data, existing approaches often rely on CLIP or its variants as the visual branch, and merely extract features from the deep layers. However, these methods lack a comprehensive analysis of the visual encoders in MLLMs. In this paper, we conduct an extensive investigation into the effectiveness of different vision encoders within MLLMs. Our findings reveal that the shallow layer features of CLIP offer particular advantages for fine-grained tasks such as grounding and region understanding. Surprisingly, the vision-only model DINO, which is not pretrained with text-image alignment, demonstrates promising performance as a visual branch within MLLMs. By simply equipping it with an MLP layer for alignment, DINO surpasses CLIP in fine-grained related perception tasks. Building upon these observations, we propose a simple yet effective feature merging strategy, named COMM, that integrates CLIP and DINO with Multi-level features Merging, to enhance the visual capabilities of MLLMs. We evaluate COMM through comprehensive experiments on a wide range of benchmarks, including image captioning, visual question answering, visual grounding, and object hallucination. Experimental results demonstrate the superior performance of COMM compared to existing methods, showcasing its enhanced visual capabilities within MLLMs. Code will be made available at https://github.com/YuchenLiu98/COMM.

摘要
多模态大语言模型（MLLMs）已经在扩展大语言模型（LLMs）的能力方面做出了重要进展，通过添加视觉感知界面。虽然出现了许多有趣的应用和多种指导调整数据，但现有的方法 oftentimes 仅仅是使用 CLIP 或其变体作为视觉分支，并且只是从深层次抽取特征。然而，这些方法缺乏对 MLLMs 中的视觉编码器的全面分析。在这篇论文中，我们进行了 MLLMs 中不同视觉编码器的全面调查。我们发现，CLIP 的浅层特征对细腻任务如落实和区域理解具有特殊的优势。同时，没有与文本图像对齐培训的视野只模型 DINO 在 MLLMs 中表现出了出色的性能。通过对 DINO 进行 MLP 层的拼接，我们发现 DINO 可以在细腻相关的感知任务中超过 CLIP。基于这些观察，我们提出了一种简单 yet 有效的特征融合策略，名为 COMM，可以在 MLLMs 中提高视觉能力。我们通过对 COMM 进行了广泛的实验，包括图像描述、视觉问答、视觉落实和物体幻化等多种标准 benchmark，并证明 COMM 的性能超过了现有方法。代码将在 GitHub 上公开。

SAM-guided Unsupervised Domain Adaptation for 3D Segmentation

paper_url: http://arxiv.org/abs/2310.08820
repo_url: None
paper_authors: Xidong Peng, Runnan Chen, Feng Qiao, Lingdong Kong, Youquan Liu, Tai Wang, Xinge Zhu, Yuexin Ma
for: 本研究旨在解决无监督领域适应（UDA）在3D分割任务中的挑战，即3D点云数据的稀疏和无序性导致域之间的差异变得明显。
methods: 我们的方法借鉴了视觉基础模型SAM的强大泛化能力，将3D域中的特征表示与SAM的特征空间进行统一，从而解决3D域适应问题。我们还提出了一种创新的混合特征增强方法，通过利用相关的图像和点云数据来促进知识传递，并在Scene和Instance两级进行实现。
results: 我们的方法在许多广泛 признан的数据集上进行了评估，并实现了领先的性能。

Abstract
Unsupervised domain adaptation (UDA) in 3D segmentation tasks presents a formidable challenge, primarily stemming from the sparse and unordered nature of point cloud data. Especially for LiDAR point clouds, the domain discrepancy becomes obvious across varying capture scenes, fluctuating weather conditions, and the diverse array of LiDAR devices in use. While previous UDA methodologies have often sought to mitigate this gap by aligning features between source and target domains, this approach falls short when applied to 3D segmentation due to the substantial domain variations. Inspired by the remarkable generalization capabilities exhibited by the vision foundation model, SAM, in the realm of image segmentation, our approach leverages the wealth of general knowledge embedded within SAM to unify feature representations across diverse 3D domains and further solves the 3D domain adaptation problem. Specifically, we harness the corresponding images associated with point clouds to facilitate knowledge transfer and propose an innovative hybrid feature augmentation methodology, which significantly enhances the alignment between the 3D feature space and SAM's feature space, operating at both the scene and instance levels. Our method is evaluated on many widely-recognized datasets and achieves state-of-the-art performance.

摘要
<>通用领域适应（Unsupervised Domain Adaptation, UDA）在3D segmentation任务中是一项极其挑战的问题，主要归因于点云数据的稀疏和无序性。尤其是对于雷达点云数据，域分化差异在不同捕捉场景、变化的天气条件以及不同的雷达设备使用中显得极其明显。在过去的UDAM方法中，通常是通过对源和目标域中的特征进行对齐来减少这个差异，但这种方法在3D segmentation中失败，因为域分化差异过于明显。受到图像分割领域中SAM模型的杰出泛化能力的激发，我们的方法利用SAM模型中嵌入的广泛通用知识来统一3D域中的特征表示，并解决3D域适应问题。具体来说，我们利用相应的图像和点云数据来促进知识传递，并提出了一种创新的混合特征增强方法，这种方法可以很好地将3D特征空间和SAM特征空间进行对应，并在场景和实例层次上进行操作。我们的方法在许多广泛 признан的数据集上进行了评估，并实现了领域内最佳性能。

Incremental Object Detection with CLIP

paper_url: http://arxiv.org/abs/2310.08815
repo_url: None
paper_authors: Yupeng He, Ziyue Huang, Qingjie Liu, Yunhong Wang
for: addresses the problem of data ambiguity in incremental object detection, where images may have different labeled bounding boxes in multiple continuous learning stages.
methods: uses a language-visual model (CLIP) to generate text feature embeddings for different class sets, and employs broad classes to replace unavailable novel classes in the early learning stage.
results: outperforms state-of-the-art methods, particularly for new classes, in various incremental learning settings on the PASCAL VOC 2007 dataset.Here is the summary in Traditional Chinese:
for: 解决对incremental object detection中的数据暂存问题， images 可能在多个连续学习阶段中有不同的标签 bounding box。
methods: 使用语言-视觉模型 (CLIP) 生成不同类别集的文本特征嵌入，并使用广泛的类别来取代在早期学习阶段中不可用的新类别。
results: 在PASCAL VOC 2007 dataset上的多个增量学习设定中，与现有方法比较，特别是新类别的表现更好。

Abstract
In the incremental detection task, unlike the incremental classification task, data ambiguity exists due to the possibility of an image having different labeled bounding boxes in multiple continuous learning stages. This phenomenon often impairs the model's ability to learn new classes. However, the forward compatibility of the model is less considered in existing work, which hinders the model's suitability for incremental learning. To overcome this obstacle, we propose to use a language-visual model such as CLIP to generate text feature embeddings for different class sets, which enhances the feature space globally. We then employ the broad classes to replace the unavailable novel classes in the early learning stage to simulate the actual incremental scenario. Finally, we use the CLIP image encoder to identify potential objects in the proposals, which are classified into the background by the model. We modify the background labels of those proposals to known classes and add the boxes to the training set to alleviate the problem of data ambiguity. We evaluate our approach on various incremental learning settings on the PASCAL VOC 2007 dataset, and our approach outperforms state-of-the-art methods, particularly for the new classes.

摘要
在增量检测任务中，与增量分类任务不同，数据之间存在冲突，因为图像在多个连续学习阶段可能有不同的标签框。这种现象经常妨碍模型学习新类。然而，现有的工作更少考虑前向兼容性，这限制了模型的适用范围。为解决这个障碍，我们提议使用语言视觉模型如CLIP生成不同类型集的文本特征嵌入，这些嵌入在全球特征空间中增强了特征空间。然后，我们使用广泛的类型取代未available的新类型在早期学习阶段来模拟实际的增量学习场景。最后，我们使用CLIP图像编码器来识别提议中的可能性对象，这些对象被模型分类为背景。我们修改背景标签这些提议，并将其添加到训练集中，以解决数据之间冲突的问题。我们在多个增量学习场景上测试了我们的方法，并与现有的方法进行比较。我们发现，我们的方法在新类型上表现出色，特别是在新类型上。

Two-Stage Deep Learning Framework for Quality Assessment of Left Atrial Late Gadolinium Enhanced MRI Images

paper_url: http://arxiv.org/abs/2310.08805
repo_url: None
paper_authors: K M Arefeen Sultan, Benjamin Orkild, Alan Morris, Eugene Kholmovski, Erik Bieging, Eugene Kwan, Ravi Ranjan, Ed DiBella, Shireen Elhabian
for: 自动评估 Left Atrial Fibrosis 的高质量3D晚期增强Image (LGE-MRI) 图像，以提高诊断精度、提高效率、保持标准化和提高病人结果。
methods: 使用两阶段深度学习方法，包括左心室探测器和深度网络，以评估LGE-MRI图像诊断质量。
results: 比较多 зада学习和预先学习两种训练策略，发现预先学习获得了约4%和9%的F1-Score和特率提升，对于有限的医疗图像标注数据而言。

Abstract
Accurate assessment of left atrial fibrosis in patients with atrial fibrillation relies on high-quality 3D late gadolinium enhancement (LGE) MRI images. However, obtaining such images is challenging due to patient motion, changing breathing patterns, or sub-optimal choice of pulse sequence parameters. Automated assessment of LGE-MRI image diagnostic quality is clinically significant as it would enhance diagnostic accuracy, improve efficiency, ensure standardization, and contributes to better patient outcomes by providing reliable and high-quality LGE-MRI scans for fibrosis quantification and treatment planning. To address this, we propose a two-stage deep-learning approach for automated LGE-MRI image diagnostic quality assessment. The method includes a left atrium detector to focus on relevant regions and a deep network to evaluate diagnostic quality. We explore two training strategies, multi-task learning, and pretraining using contrastive learning, to overcome limited annotated data in medical imaging. Contrastive Learning result shows about $4\%$, and $9\%$ improvement in F1-Score and Specificity compared to Multi-Task learning when there's limited data.

摘要
高品质的3D晚期γ增强（LGE）MRI图像是评估左 auricle fibrosis 的患者中的精度评估中的关键。然而，获得这些图像是困难的，因为患者的运动、呼吸模式的变化以及脉冲序列参数的不佳选择。自动评估LGE-MRI图像诊断质量是临床重要的，因为它会提高诊断精度、提高效率、保证标准化，并为患者提供可靠的高质量LGE-MRI扫描，以便纤维质量量化和治疗规划。为解决这个问题，我们提议一种两个阶段的深度学习方法来自动评估LGE-MRI图像诊断质量。该方法包括左 auricle检测器，以关注相关区域，以及深度网络来评估诊断质量。我们 explore两种训练策略：多任务学习和预训练使用对比学习，以超越医学影像中的有限着色数据。对比学习结果显示，在有限数据情况下，对比学习可以提高F1分数和特异性的表现，比multi任务学习提高约4%和9%。

2023-10-13

cs.AI

cs.AI - 2023-10-13

Sub-network Discovery and Soft-masking for Continual Learning of Mixed Tasks

paper_url: http://arxiv.org/abs/2310.09436
repo_url: https://github.com/zixuanke/pycontinual
paper_authors: Zixuan Ke, Bing Liu, Wenhan Xiong, Asli Celikyilmaz, Haoran Li
for: 这篇论文的目的是提出一种新的继续学习（Continual Learning，CL）方法，以解决预防悖论（Catastrophic Forgetting，CF）和促进知识转移（Knowledge Transfer，KT）问题。
methods: 这篇论文提出了一种新的CL方法，通过发现每个任务的子网络来防止CF，并通过软阶层掩盖机制来维持先前的知识并允许新任务受惠于过去的知识进行KT。
results: 实验结果显示，提出的方法在标签、生成、信息提取和其混合任务（即不同类型任务）上均能够超越强大的基eline。

Abstract
Continual learning (CL) has two main objectives: preventing catastrophic forgetting (CF) and encouraging knowledge transfer (KT). The existing literature mainly focused on overcoming CF. Some work has also been done on KT when the tasks are similar. To our knowledge, only one method has been proposed to learn a sequence of mixed tasks. However, these techniques still suffer from CF and/or limited KT. This paper proposes a new CL method to achieve both. It overcomes CF by isolating the knowledge of each task via discovering a subnetwork for it. A soft-masking mechanism is also proposed to preserve the previous knowledge and to enable the new task to leverage the past knowledge to achieve KT. Experiments using classification, generation, information extraction, and their mixture (i.e., heterogeneous tasks) show that the proposed method consistently outperforms strong baselines.

摘要

Using Adaptive Bandit Experiments to Increase and Investigate Engagement in Mental Health

paper_url: http://arxiv.org/abs/2310.18326
repo_url: https://github.com/harsh-kumar9/bandit_simulation
paper_authors: Harsh Kumar, Tong Li, Jiakai Shi, Ilya Musabirov, Rachel Kornfield, Jonah Meyerhoff, Ananya Bhattacharjee, Chris Karr, Theresa Nguyen, David Mohr, Anna Rafferty, Sofia Villar, Nina Deliu, Joseph Jay Williams
For: The paper is written to explore the use of adaptive experimentation algorithms, specifically Thompson Sampling, in digital mental health (DMH) interventions to improve their design and impact.* Methods: The paper presents a software system that allows for the adaptation of DMH intervention components using bandit and other algorithms, while collecting data for comparison with traditional uniform random non-adaptive experiments.* Results: The system was deployed to 1100 users recruited through a large mental health non-profit organization, and the results show the potential of adaptive experimentation algorithms in improving the effectiveness of DMH interventions.In Simplified Chinese text, the three key points would be:* For: 这篇论文是为了探讨数字心理健康（DMH）互动式 intervención的优化和影响。* Methods: 论文提出了一种使用适应试验算法（如汤姆生抽象）来改进 DMH 互动式 intervención的软件系统。* Results: 软件系统在1100名通过大型心理健康非营利组织招募的用户中进行了测试，结果表明适应试验算法在改进 DMH 互动式 intervención的设计和影响方面具有潜在的潜力。

Abstract
Digital mental health (DMH) interventions, such as text-message-based lessons and activities, offer immense potential for accessible mental health support. While these interventions can be effective, real-world experimental testing can further enhance their design and impact. Adaptive experimentation, utilizing algorithms like Thompson Sampling for (contextual) multi-armed bandit (MAB) problems, can lead to continuous improvement and personalization. However, it remains unclear when these algorithms can simultaneously increase user experience rewards and facilitate appropriate data collection for social-behavioral scientists to analyze with sufficient statistical confidence. Although a growing body of research addresses the practical and statistical aspects of MAB and other adaptive algorithms, further exploration is needed to assess their impact across diverse real-world contexts. This paper presents a software system developed over two years that allows text-messaging intervention components to be adapted using bandit and other algorithms while collecting data for side-by-side comparison with traditional uniform random non-adaptive experiments. We evaluate the system by deploying a text-message-based DMH intervention to 1100 users, recruited through a large mental health non-profit organization, and share the path forward for deploying this system at scale. This system not only enables applications in mental health but could also serve as a model testbed for adaptive experimentation algorithms in other domains.

摘要
数字心理健康（DMH） intervención，如文本消息基рован的课程和活动，具有巨大的可访问性和可靠性。虽然这些 intervención 可以有效，但在实际场景中进行实验测试可以进一步提高其设计和影响。适应试验，使用 Thompson Sampling 等算法，可以导致不断改进和个性化。然而，目前还未清楚这些算法在提高用户体验奖励的同时，如何收集足够的统计信息，以便社会行为科学家进行分析。虽然有一部分研究探讨了实用和统计方面的 MAB 和其他适应算法，但还需更多的探索，以评估它们在多种实际场景中的影响。本文介绍了一个在两年时间内开发的软件系统，允许文本消息 intervención 组件通过bandit和其他算法进行适应。该系统可以同时收集数据，以便与传统的固定随机非适应试验进行比较。我们通过对 1100 名用户进行文本消息基рован DMH intervención 的部署，并分享将来如何在大规模执行这个系统。这个系统不仅适用于心理健康领域，也可以作为其他领域适应试验算法的模型试验床。

Enhancing BERT-Based Visual Question Answering through Keyword-Driven Sentence Selection

paper_url: http://arxiv.org/abs/2310.09432
repo_url: None
paper_authors: Davide Napolitano, Lorenzo Vaiani, Luca Cagliero
for: 这个paper的目的是自动检测多页文档中的父子关系。
methods: 这个paper使用了文本 только方法，利用特制的采样策略。具体来说，它利用了覆盖语言模型的遮盖技术，对BERT模型进行了微调，专注于含有敏感关键词的句子，如表格或图片引用。
results: 这个paper的解决方案比基eline高效，达到了高性能。这表明了我们的解决方案对这个任务做出了正面贡献。

Abstract
The Document-based Visual Question Answering competition addresses the automatic detection of parent-child relationships between elements in multi-page documents. The goal is to identify the document elements that answer a specific question posed in natural language. This paper describes the PoliTo's approach to addressing this task, in particular, our best solution explores a text-only approach, leveraging an ad hoc sampling strategy. Specifically, our approach leverages the Masked Language Modeling technique to fine-tune a BERT model, focusing on sentences containing sensitive keywords that also occur in the questions, such as references to tables or images. Thanks to the effectiveness of this approach, we are able to achieve high performance compared to baselines, demonstrating how our solution contributes positively to this task.

摘要
文档基于视觉问答比赛关注自动检测多页文档中元素之间的父子关系。目标是通过自然语言提问来自动检测文档中答案元素。这篇文章描述了波里多的方法来解决这项任务，尤其是我们最佳解决方案是文本只的方法，利用特定的随机抽样策略。具体来说，我们的方法利用做袋掩码语言模型技术来微调BERT模型，专注于问题中包含敏感关键词的句子，如表格或图像引用。由于这种方法的有效性，我们能够在基eline上实现高性能，说明了我们的解决方案对这个任务做出了积极贡献。

A Systematic Evaluation of Large Language Models on Out-of-Distribution Logical Reasoning Tasks

paper_url: http://arxiv.org/abs/2310.09430
repo_url: https://github.com/strong-ai-lab/logical-and-abstract-reasoning
paper_authors: Qiming Bao, Gael Gendron, Alex Yuxuan Peng, Wanjun Zhong, Neset Tan, Yang Chen, Michael Witbrock, Jiamou Liu
for: 评估大语言模型（LLM）的普适性和可靠性在逻辑推理任务上。
methods: 提出三个新的逻辑推理数据集，名为“ReClor-plus”、“LogiQA-plus”和“LogiQAv2-plus”，每个数据集有三个子集：第一个是随机排序的选项，第二个是正确选项被替换为“ none of the other options are correct”，第三个是组合前两个子集。进行这些数据集上的实验，并显示这些简单的技巧对语言模型的性能有很大阻碍。
results: 发现所有模型在我们新建的数据集上表现差，尤其是在逻辑推理任务上。我们还发现，通过对训练集进行任务变化，可以大幅提高模型的普适性和可靠性。此外，通过逻辑驱动的数据增强和提问可以提高大语言模型的普适性表现。这些结果为评估和改进大语言模型的逻辑推理能力提供了新的视角。我们将源代码和数据公开发布在GitHub上，链接在url中。

Abstract
Large language models (LLMs), such as GPT-3.5 and GPT-4, have greatly advanced the performance of artificial systems on various natural language processing tasks to human-like levels. However, their generalisation and robustness to perform logical reasoning remain under-evaluated. To probe this ability, we propose three new logical reasoning datasets named "ReClor-plus", "LogiQA-plus" and "LogiQAv2-plus", each featuring three subsets: the first with randomly shuffled options, the second with the correct choices replaced by "none of the other options are correct", and a combination of the previous two subsets. We carry out experiments on these datasets with both discriminative and generative LLMs and show that these simple tricks greatly hinder the performance of the language models. Despite their superior performance on the original publicly available datasets, we find that all models struggle to answer our newly constructed datasets. We show that introducing task variations by perturbing a sizable training set can markedly improve the model's generalisation and robustness in logical reasoning tasks. Moreover, applying logic-driven data augmentation for fine-tuning, combined with prompting can enhance the generalisation performance of both discriminative large language models and generative large language models. These results offer insights into assessing and improving the generalisation and robustness of large language models for logical reasoning tasks. We make our source code and data publicly available \url{https://github.com/Strong-AI-Lab/Logical-and-abstract-reasoning}.

摘要
大型自然语言处理模型（LLM），如GPT-3.5和GPT-4，已经在不同的自然语言处理任务上达到了人类水平的性能。然而，它们的总体化和鲁棒性在逻辑推理任务上仍然受到了不足的评估。为探索这一能力，我们提出了三个新的逻辑推理数据集："ReClor-plus"、"LogiQA-plus"和"LogiQAv2-plus"，每个数据集有三个子集：第一个是随机排序的选项，第二个是正确选项被替换为"none of the other options are correct"，第三个是这两个子集的组合。我们对这些数据集进行了对抗和生成模型的实验，发现这些简单的技巧很大地降低了模型的性能。尽管这些模型在原始公开的数据集上表现出色，但我们发现所有模型在我们新建的数据集上很难回答问题。我们发现可以通过对训练集进行修改来引入任务变化，这会使模型在逻辑推理任务中的总体化和鲁棒性得到明显提升。此外，我们发现在 fine-tuning 过程中应用逻辑驱动的数据增强，并与提示结合使用，可以进一步提高总体化模型和生成模型的总体化性能。这些结果为评估和改进大型自然语言处理模型的逻辑推理能力提供了新的视角。我们将代码和数据公开在 GitHub 上，可以在以下链接获取：。

Hybrid Reinforcement Learning for Optimizing Pump Sustainability in Real-World Water Distribution Networks

paper_url: http://arxiv.org/abs/2310.09412
repo_url: None
paper_authors: Harsh Patel, Yuan Zhou, Alexander P Lamb, Shu Wang, Jieliang Luo
for: optimize real-time control of water distribution networks (WDNs) to reduce energy consumption and operational costs while adhering to physical operational constraints.
methods: reinforcement learning (RL) with improved “hybrid RL” methodology that integrates benefits of RL with historical data to enhance explainability and robustness of control recommendations.
results: significant improvement in sustainability, operational efficiency, and adaptability to emerging scenarios in real-world WDNs.

Abstract
This article addresses the pump-scheduling optimization problem to enhance real-time control of real-world water distribution networks (WDNs). Our primary objectives are to adhere to physical operational constraints while reducing energy consumption and operational costs. Traditional optimization techniques, such as evolution-based and genetic algorithms, often fall short due to their lack of convergence guarantees. Conversely, reinforcement learning (RL) stands out for its adaptability to uncertainties and reduced inference time, enabling real-time responsiveness. However, the effective implementation of RL is contingent on building accurate simulation models for WDNs, and prior applications have been limited by errors in simulation training data. These errors can potentially cause the RL agent to learn misleading patterns and actions and recommend suboptimal operational strategies. To overcome these challenges, we present an improved "hybrid RL" methodology. This method integrates the benefits of RL while anchoring it in historical data, which serves as a baseline to incrementally introduce optimal control recommendations. By leveraging operational data as a foundation for the agent's actions, we enhance the explainability of the agent's actions, foster more robust recommendations, and minimize error. Our findings demonstrate that the hybrid RL agent can significantly improve sustainability, operational efficiency, and dynamically adapt to emerging scenarios in real-world WDNs.

摘要
Simplified Chinese:这篇文章关注优化水分配网络（WDN）中�ump的调度问题，以提高实时控制。我们的主要目标是遵循物理操作限制，同时降低能源消耗和操作成本。传统优化技术，如演化算法和遗传算法，经常因为缺乏收敛保证而失败。相比之下，反馈学习（RL）具有适应不确定性的优势，并且具有快速的推理时间，可以实现实时应对。但是，RL的有效实现需要建立准确的WDN模型，而前一些应用受到模型训练数据中的错误限制。这些错误可能导致RL机器学习器学习错误的模式和动作，并推荐不优化的操作策略。为了解决这些挑战，我们提出了一种改进的“混合RL”方法。这种方法结合了RL的优点，同时将其 anchored在历史数据上。通过利用操作数据作为机器学习器的行动基础，我们可以增强机器学习器的解释力，激发更加稳健的建议，并最小化错误。我们的发现表明，混合RL机器学习器可以在实际WDN中显著提高可持续性、操作效率和适应新情况。

Surveying the Landscape of Text Summarization with Deep Learning: A Comprehensive Review

paper_url: http://arxiv.org/abs/2310.09411
repo_url: None
paper_authors: Guanghua Wang, Weili Wu
for: 本文旨在介绍深度学习在自然语言处理（NLP）中的应用，特别是在文本摘要领域。
methods: 本文使用的方法包括深度神经网络，用于学习语言数据中的复杂表示，并且可以处理变长输入序列和大规模数据。
results: 本文结果包括讨论当前流行的文本摘要任务，包括抽取、抽象、多文摘要等，以及这些任务的深度学习模型和实验结果。

Abstract
In recent years, deep learning has revolutionized natural language processing (NLP) by enabling the development of models that can learn complex representations of language data, leading to significant improvements in performance across a wide range of NLP tasks. Deep learning models for NLP typically use large amounts of data to train deep neural networks, allowing them to learn the patterns and relationships in language data. This is in contrast to traditional NLP approaches, which rely on hand-engineered features and rules to perform NLP tasks. The ability of deep neural networks to learn hierarchical representations of language data, handle variable-length input sequences, and perform well on large datasets makes them well-suited for NLP applications. Driven by the exponential growth of textual data and the increasing demand for condensed, coherent, and informative summaries, text summarization has been a critical research area in the field of NLP. Applying deep learning to text summarization refers to the use of deep neural networks to perform text summarization tasks. In this survey, we begin with a review of fashionable text summarization tasks in recent years, including extractive, abstractive, multi-document, and so on. Next, we discuss most deep learning-based models and their experimental results on these tasks. The paper also covers datasets and data representation for summarization tasks. Finally, we delve into the opportunities and challenges associated with summarization tasks and their corresponding methodologies, aiming to inspire future research efforts to advance the field further. A goal of our survey is to explain how these methods differ in their requirements as understanding them is essential for choosing a technique suited for a specific setting.

摘要
Recently, deep learning has greatly advanced natural language processing (NLP) by enabling the development of models that can learn complex language data representations, leading to significant improvements in performance across a wide range of NLP tasks. Deep learning models for NLP typically use large amounts of data to train deep neural networks, allowing them to learn the patterns and relationships in language data. This is different from traditional NLP approaches, which rely on hand-engineered features and rules to perform NLP tasks. The ability of deep neural networks to learn hierarchical representations of language data, handle variable-length input sequences, and perform well on large datasets makes them well-suited for NLP applications.Driven by the exponential growth of textual data and the increasing demand for condensed, coherent, and informative summaries, text summarization has been a critical research area in the field of NLP. Applying deep learning to text summarization refers to the use of deep neural networks to perform text summarization tasks. In this survey, we begin with a review of popular text summarization tasks in recent years, including extractive, abstractive, multi-document, and so on. Next, we discuss most deep learning-based models and their experimental results on these tasks. The paper also covers datasets and data representation for summarization tasks. Finally, we delve into the opportunities and challenges associated with summarization tasks and their corresponding methodologies, aiming to inspire future research efforts to advance the field further. A goal of our survey is to explain how these methods differ in their requirements, as understanding them is essential for choosing a technique suited for a specific setting.

CIDER: Category-Guided Intent Disentanglement for Accurate Personalized News Recommendation

paper_url: http://arxiv.org/abs/2310.09401
repo_url: None
paper_authors: Yunyong Ko, Seongeun Ryu, Sang-Wook Kim
for: 这篇论文的目的是提出一种个性化新闻推荐方法，以帮助用户找到符合他们兴趣的新闻文章，从而减轻用户信息沉淀的问题。
methods: 该方法使用了分类指导的意图分离技术来解决两个问题（C1和C2）。其中，C1是如何准确地理解新闻文章中嵌入的多种意图，而C2是如何在用户的点击历史中区分不同的新闻文章。
results: 经过广泛的实验 validate 的结果表明，这种新闻推荐方法可以在两个真实世界数据集上提供一致性高的表现，并且提高了模型的准确率。

Abstract
Personalized news recommendation aims to assist users in finding news articles that align with their interests, which plays a pivotal role in mitigating users' information overload problem. Although many recent works have been studied for better user and news representations, the following challenges have been rarely studied: (C1) How to precisely comprehend a range of intents coupled within a news article? and (C2) How to differentiate news articles with varying post-read preferences in users' click history? To tackle both challenges together, in this paper, we propose a novel personalized news recommendation framework (CIDER) that employs (1) category-guided intent disentanglement for (C1) and (2) consistency-based news representation for (C2). Furthermore, we incorporate a category prediction into the training process of CIDER as an auxiliary task, which provides supplementary supervisory signals to enhance intent disentanglement. Extensive experiments on two real-world datasets reveal that (1) CIDER provides consistent performance improvements over seven state-of-the-art news recommendation methods and (2) the proposed strategies significantly improve the model accuracy of CIDER.

摘要
personalized news recommendation aims to assist users in finding news articles that align with their interests, which plays a pivotal role in mitigating users' information overload problem. although many recent works have been studied for better user and news representations, the following challenges have been rarely studied: (C1) how to precisely comprehend a range of intents coupled within a news article? and (C2) how to differentiate news articles with varying post-read preferences in users' click history? to tackle both challenges together, in this paper, we propose a novel personalized news recommendation framework (CIDER) that employs (1) category-guided intent disentanglement for (C1) and (2) consistency-based news representation for (C2). furthermore, we incorporate a category prediction into the training process of CIDER as an auxiliary task, which provides supplementary supervisory signals to enhance intent disentanglement. extensive experiments on two real-world datasets reveal that (1) CIDER provides consistent performance improvements over seven state-of-the-art news recommendation methods and (2) the proposed strategies significantly improve the model accuracy of CIDER.Here's the word-for-word translation:个性化新闻推荐目标是帮助用户找到符合其兴趣的新闻文章，这对于解决用户信息泥沼问题起到了关键作用。虽然许多最近的研究已经研究了更好的用户和新闻表示，但以下两个挑战却rarely studied: (C1) 如何准确地理解新闻文章中杂乱的意图？和 (C2) 如何在用户点击历史中不同的新闻文章中分类？为了解决这两个挑战，在这篇论文中，我们提出了一种新的个性化新闻推荐框架（CIDER），该框架使用 (1) 类别导向意图分离来解决 (C1)，并且使用 (2) 一致性基于新闻表示来解决 (C2)。此外，我们在CIDER的训练过程中添加了一个类别预测任务，以提供补充的监督信号，以提高意图分离。实验表明， (1) CIDER在七种state-of-the-art新闻推荐方法中提供了一致性的性能改进，和 (2) 我们提出的策略对CIDER的模型准确度产生了显著的改进。

Semantics Alignment via Split Learning for Resilient Multi-User Semantic Communication

paper_url: http://arxiv.org/abs/2310.09394
repo_url: None
paper_authors: Jinhyuk Choi, Jihong Park, Seung-Woo Ko, Jinho Choi, Mehdi Bennis, Seong-Lyun Kim
for: 这些研究旨在提高语义通信中的 neural network（NN）基于接收机（transceiver）的性能，使其能够从源数据和通信频率中提取和传输语义信息。
methods: 这些研究使用了分布式学习（distributed learning）和半神经网络（partial NN）精度调整技术，其中每个编码器下载了一个偏移的解码器，并地本地精度调整一部分编码器-解码器神经网络层。
results: simulations 表明，SLF 能够在不同的源数据和通信频率异常情况下实现语义启示的一致，并且可以控制计算和通信成本。

Abstract
Recent studies on semantic communication commonly rely on neural network (NN) based transceivers such as deep joint source and channel coding (DeepJSCC). Unlike traditional transceivers, these neural transceivers are trainable using actual source data and channels, enabling them to extract and communicate semantics. On the flip side, each neural transceiver is inherently biased towards specific source data and channels, making different transceivers difficult to understand intended semantics, particularly upon their initial encounter. To align semantics over multiple neural transceivers, we propose a distributed learning based solution, which leverages split learning (SL) and partial NN fine-tuning techniques. In this method, referred to as SL with layer freezing (SLF), each encoder downloads a misaligned decoder, and locally fine-tunes a fraction of these encoder-decoder NN layers. By adjusting this fraction, SLF controls computing and communication costs. Simulation results confirm the effectiveness of SLF in aligning semantics under different source data and channel dissimilarities, in terms of classification accuracy, reconstruction errors, and recovery time for comprehending intended semantics from misalignment.

摘要
现代 semantic communication 研究通常利用神经网络（NN）基于的接收机（DeepJSCC）。与传统接收机不同，这些神经接收机可以通过实际源数据和通道进行训练，以EXTRACT和传输 semantics。然而，每个神经接收机都具有特定的源数据和通道偏好，使得不同的接收机difficult to understand意图的 semantics，特别是在初次遇到时。为了在多个神经接收机之间对 semantics 进行Alignment，我们提议一种分布式学习基于的解决方案，即 split learning（SL）和partial NN 精度调整技术。在这种方法中，每个编码器下载一个不对称的解码器，并地方式地精度调整一部分编码器-解码器 NN 层。通过调整这部分，SLF 控制计算和通信成本。实验结果表明，SLF 在不同的源数据和通道差异情况下对 semantics 进行Alignment，以 clasification accuracy、重建错误和理解意图所需的时间进行证明。

Integrating Symbolic Reasoning into Neural Generative Models for Design Generation

paper_url: http://arxiv.org/abs/2310.09383
repo_url: None
paper_authors: Maxwell Joseph Jacobson, Yexiang Xue
For: The paper aims to improve automated design generation by integrating neural and symbolic reasoning, allowing for more accurate and interpretable design outputs that meet user specifications and aesthetic preferences.* Methods: The proposed Spatial Reasoning Integrated Generator (SPRING) embeds a neural and symbolic integrated spatial reasoning module inside a deep generative network, using a recurrent neural network to predict object locations and symbolic constraint satisfaction to ensure that the generated designs meet user requirements.* Results: SPRING outperforms baseline generative models in delivering high design quality and better meeting user specifications, as demonstrated through quantitative evaluations and a human study. Additionally, SPRING provides interpretability and zero-shot constraint transfer, allowing users to visualize and diagnose the generation process and adapt to novel user specifications.

Abstract
Design generation requires tight integration of neural and symbolic reasoning, as good design must meet explicit user needs and honor implicit rules for aesthetics, utility, and convenience. Current automated design tools driven by neural networks produce appealing designs, but cannot satisfy user specifications and utility requirements. Symbolic reasoning tools, such as constraint programming, cannot perceive low-level visual information in images or capture subtle aspects such as aesthetics. We introduce the Spatial Reasoning Integrated Generator (SPRING) for design generation. SPRING embeds a neural and symbolic integrated spatial reasoning module inside the deep generative network. The spatial reasoning module decides the locations of objects to be generated in the form of bounding boxes, which are predicted by a recurrent neural network and filtered by symbolic constraint satisfaction. Embedding symbolic reasoning into neural generation guarantees that the output of SPRING satisfies user requirements. Furthermore, SPRING offers interpretability, allowing users to visualize and diagnose the generation process through the bounding boxes. SPRING is also adept at managing novel user specifications not encountered during its training, thanks to its proficiency in zero-shot constraint transfer. Quantitative evaluations and a human study reveal that SPRING outperforms baseline generative models, excelling in delivering high design quality and better meeting user specifications.

摘要
设计生成需要紧密的神经和符号理解结合，因为好的设计需要满足用户的Explicit需求，并遵循隐式的艺术、实用和便利的规则。现有的自动设计工具驱动 by neural networks 可以生成有吸引力的设计，但是无法满足用户的规格和实用需求。符号理解工具，如 constraint programming，无法感知图像中的低级别视觉信息或捕捉细微的特征，如艺术性。我们介绍了Spatiotemporal Reasoning Integrated Generator（SPRING） для设计生成。SPRING嵌入神经和符号结合的空间逻辑模块到深度生成网络中。空间逻辑模块决定生成的对象的位置，通过回归神经网络预测并由符号约束满足。嵌入符号逻辑到神经生成 garantiza that the output of SPRING满足用户的要求。此外，SPRING提供可读性，allowing users to visualize and diagnose the generation process through bounding boxes. SPRING также具有 Zero-shot Constraint Transfer 的能力，可以处理用户没有在它的训练中遇到的新规则。量化评估和人类研究表明，SPRING 在实现高设计质量和更好地满足用户要求方面表现出色。

Near-optimal Differentially Private Client Selection in Federated Settings

paper_url: http://arxiv.org/abs/2310.09370
repo_url: None
paper_authors: Syed Eqbal Alam, Dhirendra Shukla, Shrisha Rao
for: 这个论文是为了提出一种基于幂等隐私算法的联邦设备选择算法。
methods: 该算法使用迭代幂等隐私算法来保证隐私，不需要客户端之间的信息交换。
results: 实验结果表明，该算法可以在长期平均参与率下提供近似优化的价值，同时保证隐私。

Abstract
We develop an iterative differentially private algorithm for client selection in federated settings. We consider a federated network wherein clients coordinate with a central server to complete a task; however, the clients decide whether to participate or not at a time step based on their preferences -- local computation and probabilistic intent. The algorithm does not require client-to-client information exchange. The developed algorithm provides near-optimal values to the clients over long-term average participation with a certain differential privacy guarantee. Finally, we present the experimental results to check the algorithm's efficacy.

摘要
我们开发了一种迭代幂等隐私算法，用于在联邦设置中选择客户端。我们考虑了一个联邦网络，在其中客户端与中央服务器共同完成任务，但客户端在一个时间步 bases on their preferences -- local computation和 probabilistic intent决定参与或不参与。该算法不需要客户端之间信息交换。我们开发的算法可以在长期平均参与率下提供近似优化的价值，并且具有一定的隐私保证。最后，我们展示了算法的实验结果，以证明它的有效性。Note: "联邦设置" (federated setting) in Chinese is usually translated as "联邦学习" (federated learning), but in this context, it refers to the setting where multiple clients work together to complete a task.

When are Bandits Robust to Misspecification?

paper_url: http://arxiv.org/abs/2310.09358
repo_url: None
paper_authors: Debangshu Banerjee, Aditya Gopalan
for: 该文章探讨了在决策设置中使用参数特征基本奖励模型的情况，特别是在假设真实奖励和模型之间存在差异时。
methods: 文章使用了经典算法如$\epsilon$-greedy和LinUCB，并提供了基于问题实例和模型集的Conditions，以确保这些算法在假设奖励有较大误差时仍能获得下降性 regret guarantee。
results: 文章发现，在许多假设奖励有较大误差时，经典算法可以获得下降性 regret guarantee，而不是先前的最坏情况结果，这表示有一部分决策实例可以抵抗假设奖励的误差。

Abstract
Parametric feature-based reward models are widely employed by algorithms for decision making settings such as bandits and contextual bandits. The typical assumption under which they are analysed is realizability, i.e., that the true rewards of actions are perfectly explained by some parametric model in the class. We are, however, interested in the situation where the true rewards are (potentially significantly) misspecified with respect to the model class. For parameterized bandits and contextual bandits, we identify sufficient conditions, depending on the problem instance and model class, under which classic algorithms such as $\epsilon$-greedy and LinUCB enjoy sublinear (in the time horizon) regret guarantees under even grossly misspecified rewards. This is in contrast to existing worst-case results for misspecified bandits which show regret bounds that scale linearly with time, and shows that there can be a nontrivially large set of bandit instances that are robust to misspecification.

摘要
parametric 特征基于的奖励模型广泛应用于决策设置中，如抽奖和上下文抽奖。通常假设是 realizability，即真实奖励的动作是完全由某种参数模型 explain 的。但我们对于 true 奖励的情况是（可能是 significatively ）不准确地模型，对于 parameterized 抽奖和上下文抽奖，我们提出了 Conditions ，具体取决于问题实例和模型类，以至于 классические算法如 $\epsilon$-greedy 和 LinUCB 在不准确的奖励下仍然具有线性增长（在时间轴上）的异常性保证。这与现有的最坏情况结果不同，显示了一个可能是 robust 的bandit实例集。

Unsupervised Domain Adaption for Neural Information Retrieval

paper_url: http://arxiv.org/abs/2310.09350
repo_url: None
paper_authors: Carlos Dominguez, Jon Ander Campos, Eneko Agirre, Gorka Azkune
for: 这种论文主要是为了比较使用大型自然语言模型生成查询和基于规则的字符串修饰来生成synthetic annotation，以提高神经信息检索的竞争力。
methods: 这篇论文使用了同一种神经信息检索建模，并在BEIR测试集上进行了对比，包括零shot和无监督适应化两种情况。
results: 结果表明，大型自然语言模型在所有情况下都大幅超越基于规则的方法，而无监督适应化也比零shot更有效。此外，我们还研究了不同大小的开放式大型自然语言模型是否会影响生成的数据质量，发现medium-sized模型足够。

Abstract
Neural information retrieval requires costly annotated data for each target domain to be competitive. Synthetic annotation by query generation using Large Language Models or rule-based string manipulation has been proposed as an alternative, but their relative merits have not been analysed. In this paper, we compare both methods head-to-head using the same neural IR architecture. We focus on the BEIR benchmark, which includes test datasets from several domains with no training data, and explore two scenarios: zero-shot, where the supervised system is trained in a large out-of-domain dataset (MS-MARCO); and unsupervised domain adaptation, where, in addition to MS-MARCO, the system is fine-tuned in synthetic data from the target domain. Our results indicate that Large Language Models outperform rule-based methods in all scenarios by a large margin, and, more importantly, that unsupervised domain adaptation is effective compared to applying a supervised IR system in a zero-shot fashion. In addition we explore several sizes of open Large Language Models to generate synthetic data and find that a medium-sized model suffices. Code and models are publicly available for reproducibility.

摘要

Dialogue Chain-of-Thought Distillation for Commonsense-aware Conversational Agents

paper_url: http://arxiv.org/abs/2310.09343
repo_url: None
paper_authors: Hyungjoo Chae, Yongho Song, Kai Tzu-iunn Ong, Taeyoon Kwon, Minjin Kim, Youngjae Yu, Dongha Lee, Dongyeop Kang, Jinyoung Yeo
for: 提高对话机器人的响应质量，使其更好地理解和回答对话中的隐含信息。
methods: 提出了一种知识储存框架，利用大语言模型（LLM）作为不可靠的教师，通过对适应过滤器进行选择性储存，提供可靠的对话链思维（CoT）理据。
results: 通过对多个实验进行详细测试，显示了增强对话机器人的响应质量的重要性。

Abstract
Human-like chatbots necessitate the use of commonsense reasoning in order to effectively comprehend and respond to implicit information present within conversations. Achieving such coherence and informativeness in responses, however, is a non-trivial task. Even for large language models (LLMs), the task of identifying and aggregating key evidence within a single hop presents a substantial challenge. This complexity arises because such evidence is scattered across multiple turns in a conversation, thus necessitating integration over multiple hops. Hence, our focus is to facilitate such multi-hop reasoning over a dialogue context, namely dialogue chain-of-thought (CoT) reasoning. To this end, we propose a knowledge distillation framework that leverages LLMs as unreliable teachers and selectively distills consistent and helpful rationales via alignment filters. We further present DOCTOR, a DialOgue Chain-of-ThOught Reasoner that provides reliable CoT rationales for response generation. We conduct extensive experiments to show that enhancing dialogue agents with high-quality rationales from DOCTOR significantly improves the quality of their responses.

摘要
人类化聊天机器人需要使用常识理解以便有效地理解并响应在对话中的隐式信息。实现这种 coherence 和 informativeness 在回答中是一个非常复杂的任务。即使是大语言模型（LLM），也面临着在单个跳步中identifying 和集成关键证据的挑战。这种复杂性 arise 因为这些证据分散在多个对话转帖中，因此需要进行多个跳步的集成。因此，我们的注重点是在对话上下文中进行多跳步理解，即对话链条理解（CoT）。为此，我们提出了知识填充框架，该框架利用 LLM 作为不可靠的教师，通过对适应性筛选器进行选择性填充高质量的 rationales。此外，我们还提出了 DOCTOR，一个基于对话链条理解的回答生成工具，可以提供可靠的 CoT 理由。我们进行了广泛的实验，并证明了通过 DOCTOR 提供高质量的 rationales 可以大幅提高对话机器人的回答质量。

Ranking LLM-Generated Loop Invariants for Program Verification

paper_url: http://arxiv.org/abs/2310.09342
repo_url: None
paper_authors: Saikat Chakraborty, Shuvendu K. Lahiri, Sarah Fakhoury, Madanlal Musuvathi, Akash Lal, Aseem Rastogi, Aditya Senthilnathan, Rahul Sharma, Nikhil Swamy
for: 本研究旨在自动程序验证中总结循环 invariants。
methods: 本文使用大语言模型（如gpt-3.5或gpt-4）在零批学环境中生成循环 invariants，但需要许多样本来生成正确的 invariants。
results: 本文提出了一种{\it re-ranking}方法，使得生成的结果中正确的循环 invariants得到更高的排名，从而减少了验证器的调用次数。

Abstract
Synthesizing inductive loop invariants is fundamental to automating program verification. In this work, we observe that Large Language Models (such as gpt-3.5 or gpt-4) are capable of synthesizing loop invariants for a class of programs in a 0-shot setting, yet require several samples to generate the correct invariants. This can lead to a large number of calls to a program verifier to establish an invariant. To address this issue, we propose a {\it re-ranking} approach for the generated results of LLMs. We have designed a ranker that can distinguish between correct inductive invariants and incorrect attempts based on the problem definition. The ranker is optimized as a contrastive ranker. Experimental results demonstrate that this re-ranking mechanism significantly improves the ranking of correct invariants among the generated candidates, leading to a notable reduction in the number of calls to a verifier.

摘要
<>translate_language English Simplified Chinese自动化程式验证的基础是协本环 inductive loop invariants。在这个工作中，我们发现 Large Language Models（如gpt-3.5或gpt-4）在零扩展设定下可以Synthesizing loop invariants for a class of programs，但需要许多样本来生成正确的 invariants。这可能会导致访问程式验证器的大量呼叫，以建立一个 invariant。为解决这个问题，我们提出了一个{\it re-ranking}方法，将 LLMS 生成的结果重新排序。我们设计了一个排名器，可以根据问题定义区别正确的协本环 invariants和错误的尝试。这个排名器被优化为对照排名器，实验结果显示，这个重新排序机制可以对生成的候选者进行有效的排序，从而获得访问程式验证器的明显减少。

Uncertainty Quantification using Generative Approach

paper_url: http://arxiv.org/abs/2310.09338
repo_url: None
paper_authors: Yunsheng Zhang
for: 用于度量深度神经网络中的不确定性
methods: 使用增量生成 Monte Carlo 方法，逐步训练生成模型，计算 posterior 分布中随机变量的期望
results: 在 MNIST 数字分类任务上实际研究了 IGMC 的行为，并提供了关于样本大小和采样深度的理论保证

Abstract
We present the Incremental Generative Monte Carlo (IGMC) method, designed to measure uncertainty in deep neural networks using deep generative approaches. IGMC iteratively trains generative models, adding their output to the dataset, to compute the posterior distribution of the expectation of a random variable. We provide a theoretical guarantee of the convergence rate of IGMC relative to the sample size and sampling depth. Due to its compatibility with deep generative approaches, IGMC is adaptable to both neural network classification and regression tasks. We empirically study the behavior of IGMC on the MNIST digit classification task.

摘要
我们介绍了增量生成 Monte Carlo（IGMC）方法，用于深度神经网络中 uncertainty 的量化。IGMC 逐步训练生成模型，将其输出加入数据集，以计算偶数Variable 的 posterior distribution。我们提供了样本大小和抽样深度相关的理论保证。由于它可以与深度生成方法相容，IGMC 适用于神经网络分类和回归任务。我们实验性地研究了 MNIST 数位分类任务中 IGMC 的行为。

Retro-fallback: retrosynthetic planning in an uncertain world

paper_url: http://arxiv.org/abs/2310.09270
repo_url: None
paper_authors: Austin Tripp, Krzysztof Maziarz, Sarah Lewis, Marwin Segler, José Miguel Hernández-Lobato
for: 提出了一种新的retrosynthesis算法，可以考虑化学反应空间中的不确定性。
methods: 使用了 Stochastic Processes 来表述retrosynthesis，并提出了一种名为 retro-fallback 的新的贪心算法，可以最大化在实验室中可执行的合成计划的概率。
results: 使用 In-silico benchmark 表明，retro-fallback 算法通常可以生成比 MCTS 和 retro* 算法更好的合成计划。

Abstract
Retrosynthesis is the task of proposing a series of chemical reactions to create a desired molecule from simpler, buyable molecules. While previous works have proposed algorithms to find optimal solutions for a range of metrics (e.g. shortest, lowest-cost), these works generally overlook the fact that we have imperfect knowledge of the space of possible reactions, meaning plans created by the algorithm may not work in a laboratory. In this paper we propose a novel formulation of retrosynthesis in terms of stochastic processes to account for this uncertainty. We then propose a novel greedy algorithm called retro-fallback which maximizes the probability that at least one synthesis plan can be executed in the lab. Using in-silico benchmarks we demonstrate that retro-fallback generally produces better sets of synthesis plans than the popular MCTS and retro* algorithms.

摘要
转化文本为简化中文：Retrosynthesis 是指从更简单的化学物质中制备目标分子的过程。先前的研究已经提出了优化各种纪录（例如最短、最低成本）的算法，但是这些算法通常忽略了我们对化学反应空间的知识是不准确的事实。在这篇论文中，我们提出了一种新的retrosynthesis的形式化，以 compte for这种不确定性。我们还提出了一种新的greedy算法，called retro-fallback，该算法可以最大化实验室中执行 synthesis plan 的概率。使用在 silico benchmark 表明，retro-fallback 通常可以生成比 MCTS 和 retro* 算法更好的合成计划集。

Table-GPT: Table-tuned GPT for Diverse Table Tasks

paper_url: http://arxiv.org/abs/2310.09263
repo_url: None
paper_authors: Peng Li, Yeye He, Dror Yashar, Weiwei Cui, Song Ge, Haidong Zhang, Danielle Rifinski Fainman, Dongmei Zhang, Surajit Chaudhuri
for: 这个论文的目的是提出一种新的”表格调教” paradigm，以提高语言模型对表格的理解和表格任务的完成能力。methods: 该论文使用了多种表格任务来Synthesize from real tables to train and fine-tune language models like GPT-3.5 and ChatGPT。results: 研究发现，通过使用这种”表格调教” paradigm，可以帮助语言模型更好地理解表格和完成表格任务，并且可以在不同的人工指令下响应新的表格任务，与GPT-3.5和ChatGPT类似。

Abstract
Language models, such as GPT-3.5 and ChatGPT, demonstrate remarkable abilities to follow diverse human instructions and perform a wide range of tasks. However, when probing language models using a range of basic table-understanding tasks, we observe that today's language models are still sub-optimal in many table-related tasks, likely because they are pre-trained predominantly on \emph{one-dimensional} natural-language texts, whereas relational tables are \emph{two-dimensional} objects. In this work, we propose a new "\emph{table-tuning}" paradigm, where we continue to train/fine-tune language models like GPT-3.5 and ChatGPT, using diverse table-tasks synthesized from real tables as training data, with the goal of enhancing language models' ability to understand tables and perform table tasks. We show that our resulting Table-GPT models demonstrate (1) better \emph{table-understanding} capabilities, by consistently outperforming the vanilla GPT-3.5 and ChatGPT, on a wide-range of table tasks, including holdout unseen tasks, and (2) strong \emph{generalizability}, in its ability to respond to diverse human instructions to perform new table-tasks, in a manner similar to GPT-3.5 and ChatGPT.

摘要
语言模型，如GPT-3.5和ChatGPT，表现出惊人的能力，遵循多种人类指令并完成广泛的任务。然而，当我们使用多种基本表格理解任务探测语言模型时，我们发现今天的语言模型仍然在许多表格相关任务上是优化的。这是因为这些语言模型在主要的预训练数据中是一dimensional的自然语言文本，而表格是两dimensional的对象。在这项工作中，我们提议一种新的 "\emph{表格调教}" 模式，我们继续训练/精度调教语言模型，使其能够更好地理解表格并完成表格任务。我们显示了我们的表格GPT模型在多种表格任务上表现出更好的表格理解能力，并且具有强的泛化能力，能够遵循多种人类指令来完成新的表格任务，与GPT-3.5和ChatGPT类似。

It’s an Alignment, Not a Trade-off: Revisiting Bias and Variance in Deep Models

paper_url: http://arxiv.org/abs/2310.09250
repo_url: None
paper_authors: Lin Chen, Michal Lukasik, Wittawat Jitkrittum, Chong You, Sanjiv Kumar
for: 这 paper 是关于深度学习基本错误分析的研究，具体来说是研究深度学习模型ensemble中的偏差和方差之间的交互关系。
methods: 该 paper 使用了多种深度学习模型和实验数据，并通过 empirical evidence 验证了这种偏差-方差对应现象的存在。同时，paper 还通过两种理论角度来研究这种现象：calibration 和 neural collapse。
results: 研究结果表明，在深度学习模型ensemble中，偏差和方差在样本水平上是对应的，即正确分类样本点的平方偏差与方差的平方几乎相等。这种现象在多种深度学习模型和实验数据上都可见。

Abstract
Classical wisdom in machine learning holds that the generalization error can be decomposed into bias and variance, and these two terms exhibit a \emph{trade-off}. However, in this paper, we show that for an ensemble of deep learning based classification models, bias and variance are \emph{aligned} at a sample level, where squared bias is approximately \emph{equal} to variance for correctly classified sample points. We present empirical evidence confirming this phenomenon in a variety of deep learning models and datasets. Moreover, we study this phenomenon from two theoretical perspectives: calibration and neural collapse. We first show theoretically that under the assumption that the models are well calibrated, we can observe the bias-variance alignment. Second, starting from the picture provided by the neural collapse theory, we show an approximate correlation between bias and variance.

摘要

Augmented Computational Design: Methodical Application of Artificial Intelligence in Generative Design

paper_url: http://arxiv.org/abs/2310.09243
repo_url: None
paper_authors: Pirouz Nourian, Shervin Azadi, Roy Uijtendaal, Nan Bai
for: 这篇论文旨在探讨人工智能在生成设计中的必要性和实用性。
methods: 论文提出了通过人工智能加以增强生成设计过程，以达到一些关键性的结果或性能指标，而处理大量小型决策。
results: 论文提出了一些批判性的方向，用于在建筑设计中使用人工智能来增强决策过程，以映射和导航复杂的设计空间。

Abstract
This chapter presents methodological reflections on the necessity and utility of artificial intelligence in generative design. Specifically, the chapter discusses how generative design processes can be augmented by AI to deliver in terms of a few outcomes of interest or performance indicators while dealing with hundreds or thousands of small decisions. The core of the performance-based generative design paradigm is about making statistical or simulation-driven associations between these choices and consequences for mapping and navigating such a complex decision space. This chapter will discuss promising directions in Artificial Intelligence for augmenting decision-making processes in architectural design for mapping and navigating complex design spaces.

摘要
Note: The text has been translated into Simplified Chinese, which is the standard writing system used in mainland China.

Evaluating Machine Perception of Indigeneity: An Analysis of ChatGPT’s Perceptions of Indigenous Roles in Diverse Scenarios

paper_url: http://arxiv.org/abs/2310.09237
repo_url: None
paper_authors: Cecilia Delgado Solorzano, Carlos Toxtli Hernandez
for: 研究LLMs自我认知偏见关于原住民角色表演
methods: 通过生成和分析多个enario，研究如何技术对原住民偏见的潜在延展
results: 发现技术可能增强社会偏见关于原住民在社会计算中的表现

Abstract
Large Language Models (LLMs), like ChatGPT, are fundamentally tools trained on vast data, reflecting diverse societal impressions. This paper aims to investigate LLMs' self-perceived bias concerning indigeneity when simulating scenarios of indigenous people performing various roles. Through generating and analyzing multiple scenarios, this work offers a unique perspective on how technology perceives and potentially amplifies societal biases related to indigeneity in social computing. The findings offer insights into the broader implications of indigeneity in critical computing.

摘要
大型语言模型（LLM），如ChatGPT，是基于庞大数据集训练的工具，具有多元社会印象。这篇论文旨在研究LLM对原住民角色的自我认知偏见。通过生成和分析多种enario，这项工作提供了对技术对社会偏见的探索，特别是对原住民在社交计算中的表现。发现有关原住民的扩展性和潜在的社会影响。

ClickPrompt: CTR Models are Strong Prompt Generators for Adapting Language Models to CTR Prediction

paper_url: http://arxiv.org/abs/2310.09234
repo_url: None
paper_authors: Jianghao Lin, Bo Chen, Hangyu Wang, Yunjia Xi, Yanru Qu, Xinyi Dai, Kangning Zhang, Ruiming Tang, Yong Yu, Weinan Zhang
for: 预测Click-through rate(CTR)在互联网应用中变得越来越重要，传统的CTR模型通过一个一采用一个简单的一个简单的一个简单的方法，将多个字段 categorical data 转化为 ID 特征，并提取特征之间的协作信号。这种思维方式受到 semantics 信息损失的问题。另一种研究方向是使用预训练语言模型（PLM）来预测CTR，将输入数据转化为文本句子通过硬件模板。尽管保留 semantics 信号，但通常无法捕捉特征间的协作信号（例如，特征间的交互信号、纯 ID 特征），更不用说具有庞大模型大小的潜在搅乱问题。
methods: 我们提出了一种新的模型agnostic框架（ClickPrompt），其中我们将 CTR 模型与 PLM 集成，以生成交互aware的软指示。我们设计了一个 prompt-augmented masked language modeling（PA-MLM）预训练任务，其中 PLM 需要根据语言上下文恢复隐藏token，同时根据软指示生成的字符串来恢复隐藏token。在这个过程中，ID 和文本特征之间的协作和semantics信息将被显式协调和交互。
results: 实验表明，ClickPrompt 比现有基线方法更有效。我们可以通过调整 CTR 模型与 PLM 的结合来提高性能，或者仅仅调整 CTR 模型来降低执行效率。

Abstract
Click-through rate (CTR) prediction has become increasingly indispensable for various Internet applications. Traditional CTR models convert the multi-field categorical data into ID features via one-hot encoding, and extract the collaborative signals among features. Such a paradigm suffers from the problem of semantic information loss. Another line of research explores the potential of pretrained language models (PLMs) for CTR prediction by converting input data into textual sentences through hard prompt templates. Although semantic signals are preserved, they generally fail to capture the collaborative information (e.g., feature interactions, pure ID features), not to mention the unacceptable inference overhead brought by the huge model size. In this paper, we aim to model both the semantic knowledge and collaborative knowledge for accurate CTR estimation, and meanwhile address the inference inefficiency issue. To benefit from both worlds and close their gaps, we propose a novel model-agnostic framework (i.e., ClickPrompt), where we incorporate CTR models to generate interaction-aware soft prompts for PLMs. We design a prompt-augmented masked language modeling (PA-MLM) pretraining task, where PLM has to recover the masked tokens based on the language context, as well as the soft prompts generated by CTR model. The collaborative and semantic knowledge from ID and textual features would be explicitly aligned and interacted via the prompt interface. Then, we can either tune the CTR model with PLM for superior performance, or solely tune the CTR model without PLM for inference efficiency. Experiments on four real-world datasets validate the effectiveness of ClickPrompt compared with existing baselines.

摘要
点击率（CTR）预测已成为互联网应用中不可或缺的一种技术。传统CTR模型将多个字段分类资料转换为ID特征via一个单簇编码，并提取特征之间的协力信号。然而，这种模式受到 semantic information loss 的问题。另一线的研究则探访了使用预训语言模型（PLM）来CTR预测的可能性，将输入资料转换为文本句子via固定模板。虽保留 semantic signal，但通常无法捕捉特征互动信息（例如特征互动、纯ID特征），更不用说巨大模型的推断负担。在本文中，我们愿以独特的模型独立框架（ClickPrompt），将CTR模型与PLM融合，以生成互动意识适用的软提示。我们设计了一个Prompt-augmented masked language modeling（PA-MLM）训练任务，让PLM在语言上下文中恢复填充token，同时还需要根据软提示生成由CTR模型生成的。这样，ID和文本特征之间的协力和 semantic knowledge 会被明确地配置和互动，从而提高预测性能。然后，我们可以将CTR模型调整PLM，或者将CTR模型单独调整，以提高推断效率。在四个真实世界数据上进行了实验，显示ClickPrompt与现有基eline相比，具有更高的预测性能。

Fast & Efficient Learning of Bayesian Networks from Data: Knowledge Discovery and Causality

paper_url: http://arxiv.org/abs/2310.09222
repo_url: None
paper_authors: Minn Sein, Fu Shunkai
for: 本研究旨在提出两种基于PC算法的新算法，以优化 bayesian 网络结构学习的效率。
methods: 这两种算法使用本地搜索策略和conditional independence测试来学习 bayesian 网络结构，并使用 d-separation 来推断额外 topology 信息。
results: 实验研究显示，这两种算法可以与 PC 算法匹配 induction 质量，同时减少计算成本，使其在大数据分析中更加可靠。

Abstract
Structure learning is essential for Bayesian networks (BNs) as it uncovers causal relationships, and enables knowledge discovery, predictions, inferences, and decision-making under uncertainty. Two novel algorithms, FSBN and SSBN, based on the PC algorithm, employ local search strategy and conditional independence tests to learn the causal network structure from data. They incorporate d-separation to infer additional topology information, prioritize conditioning sets, and terminate the search immediately and efficiently. FSBN achieves up to 52% computation cost reduction, while SSBN surpasses it with a remarkable 72% reduction for a 200-node network. SSBN demonstrates further efficiency gains due to its intelligent strategy. Experimental studies show that both algorithms match the induction quality of the PC algorithm while significantly reducing computation costs. This enables them to offer interpretability and adaptability while reducing the computational burden, making them valuable for various applications in big data analytics.

摘要
“结构学习是 bayesian 网络（BN）的关键，它揭示了 causal 关系，并允许知识发现、预测、推理和决策在不确定性下。两种新的算法，FSBN 和 SSBN，基于 PC 算法，使用本地搜索策略和 conditional independence 测试来学习 causal 网络结构。它们利用 d-separation 来推断额外 topology 信息，优先考虑conditioning 集，并立即和高效地 terminate 搜索。FSBN 可以减少计算成本，而 SSBN 甚至超越它，在 200 个节点网络中减少了 72% 的计算成本。SSBN 的智能策略还提供了进一步的效率提升。实验研究表明，这两种算法可以匹配 PC 算法的induction 质量，同时减少计算成本，使其在 big data 分析中提供了可读性和可变性，这些特点使它们在各种应用中非常有价值。”

“Kelly is a Warm Person, Joseph is a Role Model”: Gender Biases in LLM-Generated Reference Letters

paper_url: http://arxiv.org/abs/2310.09219
repo_url: https://github.com/uclanlp/biases-llm-reference-letters
paper_authors: Yixin Wan, George Pu, Jiao Sun, Aparna Garimella, Kai-Wei Chang, Nanyun Peng
for: 这个论文旨在探讨大语言模型（LLM）在写作推荐信函中的公平问题。
methods: 作者采用了社会科学发现的方法来评估LLM生成的推荐信函中的语言风格和 lexical content 中的偏见。他们还研究了模型在生成内容中的偏见延展现象，即模型生成的内容中带有偏见的现象。
results: 研究发现了两个流行的LLM——ChatGPT和Alpaca在生成推荐信函中存在 significiant gender bias。这些结果警示我们不能不加工程序地使用LLM来生成专业文本，并重要地强调了对LLM生成的专业文本进行仔细的研究和分析。

Abstract
Large Language Models (LLMs) have recently emerged as an effective tool to assist individuals in writing various types of content, including professional documents such as recommendation letters. Though bringing convenience, this application also introduces unprecedented fairness concerns. Model-generated reference letters might be directly used by users in professional scenarios. If underlying biases exist in these model-constructed letters, using them without scrutinization could lead to direct societal harms, such as sabotaging application success rates for female applicants. In light of this pressing issue, it is imminent and necessary to comprehensively study fairness issues and associated harms in this real-world use case. In this paper, we critically examine gender biases in LLM-generated reference letters. Drawing inspiration from social science findings, we design evaluation methods to manifest biases through 2 dimensions: (1) biases in language style and (2) biases in lexical content. We further investigate the extent of bias propagation by analyzing the hallucination bias of models, a term that we define to be bias exacerbation in model-hallucinated contents. Through benchmarking evaluation on 2 popular LLMs- ChatGPT and Alpaca, we reveal significant gender biases in LLM-generated recommendation letters. Our findings not only warn against using LLMs for this application without scrutinization, but also illuminate the importance of thoroughly studying hidden biases and harms in LLM-generated professional documents.

摘要
大型语言模型（LLM）在近期已经作为帮助人们写不同类型的内容的有效工具出现。虽然带来便利，但这种应用也引入了未曾有的公平问题。用户可以直接使用模型生成的推荐信的情况下，如果下面的偏见存在，不仔细审核可能会导致社会性危害，如女性申请人的应用成功率下降。为了解决这 pressing issue，我们必须全面研究公平问题和相关的危害。在这篇论文中，我们 kritisch examines gender biases in LLM-generated reference letters。通过社会科学发现的引用，我们设计了评估方法，以manifest biases through two dimensions: (1) biases in language style and (2) biases in lexical content。我们进一步分析模型的偏见传播情况，包括模型生成内容中的偏见扩大现象，我们定义为模型幻觉偏见。通过对2种流行的LLM-ChatGPT和Alpaca进行标准评估，我们发现了LLM生成的推荐信中的强烈性别偏见。我们的发现不仅警示了不得不仔细审核LLM生成的专业文档，还抛光了研究隐藏偏见和危害的重要性。

Multinational AGI Consortium (MAGIC): A Proposal for International Coordination on AI

paper_url: http://arxiv.org/abs/2310.09217
repo_url: None
paper_authors: Jason Hausenloy, Andrea Miotti, Claire Dennis
for: 避免高级人工智能（AI）的存在 риSK，MAGIC提议建立一个多国人工智能总合研究机构。
methods: MAGIC 通过全球停止其他高级 AI 开发，成为全球唯一允许开发高级 AI 的机构，实现安全、高度安全、且由成员国共同支持。
results: MAGIC 可以允许 narrow AI 模型繁殖，同时减少高级 AI 的可能性，即不良、违规、突然爆发、跑道违规等 outcome。

Abstract
This paper proposes a Multinational Artificial General Intelligence Consortium (MAGIC) to mitigate existential risks from advanced artificial intelligence (AI). MAGIC would be the only institution in the world permitted to develop advanced AI, enforced through a global moratorium by its signatory members on all other advanced AI development. MAGIC would be exclusive, safety-focused, highly secure, and collectively supported by member states, with benefits distributed equitably among signatories. MAGIC would allow narrow AI models to flourish while significantly reducing the possibility of misaligned, rogue, breakout, or runaway outcomes of general-purpose systems. We do not address the political feasibility of implementing a moratorium or address the specific legislative strategies and rules needed to enforce a ban on high-capacity AGI training runs. Instead, we propose one positive vision of the future, where MAGIC, as a global governance regime, can lay the groundwork for long-term, safe regulation of advanced AI.

摘要
Translation notes:* "Multinational" is translated as "多国" (duōguó), which means "multi-country" or "international".* "Artificial General Intelligence" is translated as "人工通用智能" (réngōng tōngyòu zhìnéng), which means "artificial general intelligence".* "Consortium" is translated as "联盟" (liánméng), which means "alliance" or "consortium".* "Exclusive" is translated as "独特" (dāngtè), which means "unique" or "exclusive".* "Safety-focused" is translated as "安全专注" (ānquán zhōngzhin), which means "safety-focused" or "security-oriented".* "Highly secure" is translated as "高度安全" (gāodù ānquán), which means "highly secure" or "highly safe".* "Collectively supported" is translated as "共同支持" (gòngdòng zhīchí), which means "collectively supported" or "jointly supported".* "Benefits distributed equitably" is translated as "利益均衡分配" (lìyì jìnghóng bùdài), which means "benefits distributed equitably" or "benefits shared fairly".* "Narrow AI models" is translated as "窄AI模型" (zhòu AI módelì), which means "narrow AI models" or "specialized AI models".* "General-purpose systems" is translated as "通用系统" (tōngyòu xìzhì), which means "general-purpose systems" or "all-purpose systems".* "Misaligned, rogue, breakout, or runaway" is translated as "偏移、违规、崩溃、跑 wild" (péngyì, yìguī, bēiqì, pǎo wild), which means "misaligned, rogue, breakout, or runaway".

SiamAF: Learning Shared Information from ECG and PPG Signals for Robust Atrial Fibrillation Detection

paper_url: http://arxiv.org/abs/2310.09203
repo_url: https://github.com/chengstark/siamaf
paper_authors: Zhicheng Guo, Cheng Ding, Duc H. Do, Amit Shah, Randall J. Lee, Xiao Hu, Cynthia Rudin
for: 预防心脏病变和不良临床结果，探索pasive AF监测技术
methods: 提出了一种新的Siamese网络架构和共同学习损失函数，协助学习ECG和PPG信号之间的共同信息
results: 在三个外部测试集上，提出的模型可以准确预测AF，并且在各种情况下表现较好，需要 fewer标签数据，可能为未来减少人工标注带来新的可能性。

Abstract
Atrial fibrillation (AF) is the most common type of cardiac arrhythmia. It is associated with an increased risk of stroke, heart failure, and other cardiovascular complications, but can be clinically silent. Passive AF monitoring with wearables may help reduce adverse clinical outcomes related to AF. Detecting AF in noisy wearable data poses a significant challenge, leading to the emergence of various deep learning techniques. Previous deep learning models learn from a single modality, either electrocardiogram (ECG) or photoplethysmography (PPG) signals. However, deep learning models often struggle to learn generalizable features and rely on features that are more susceptible to corruption from noise, leading to sub-optimal performances in certain scenarios, especially with low-quality signals. Given the increasing availability of ECG and PPG signal pairs from wearables and bedside monitors, we propose a new approach, SiamAF, leveraging a novel Siamese network architecture and joint learning loss function to learn shared information from both ECG and PPG signals. At inference time, the proposed model is able to predict AF from either PPG or ECG and outperforms baseline methods on three external test sets. It learns medically relevant features as a result of our novel architecture design. The proposed model also achieves comparable performance to traditional learning regimes while requiring much fewer training labels, providing a potential approach to reduce future reliance on manual labeling.

摘要
《心律失常（AF）是心脏不正常的最常见类型。它与心血管疾病、心力衰竭和其他cardiovascular Complications的风险增加相关，但可能是临床无症状。通过佩戴式监测器，可以减少AF相关的不良临床结果。检测AF噪音监测器数据中存在挑战，导致深度学习技术的出现。先前的深度学习模型通常只学习单一modal，可以是电cardiogram（ECG）或光谱 Plethysmography（PPG）信号。但深度学习模型经常难以学习通用特征，而且受到噪音的污染，导致在某些情况下表现下降。随着ECG和PPG信号对象的增加，我们提出了一种新的方法，即SiamAF，利用了一种新的siamesenet架构和共同学习损失函数来学习ECG和PPG信号之间的共同信息。在推理时，我们的模型能够基于PPG或ECG信号预测AF，并在三个外部测试集上超越基准方法。它学习了医学 relevance 的特征，这得于我们的新架构设计。我们的模型还可以与传统学习方法相比，需要远少的训练标签，提供了一个可能性，以减少未来对手动标注的依赖。》

Tikuna: An Ethereum Blockchain Network Security Monitoring System

paper_url: http://arxiv.org/abs/2310.09193
repo_url: None
paper_authors: Andres Gomez Ramirez, Loui Al Sardy, Francis Gomez Ramirez
for: 本研究旨在保护区块链的最低层，即P2P网络层，以防止许多种攻击，如分布式拒绝服务攻击（DDoS）、 Eclipse 攻击和 Sybil 攻击。
methods: 本研究使用了一种无监督的Long Short-Term Memory（LSTM）方法，基于Recurrent Neural Network（RNN）来检测攻击并警示用户。
results: 实验结果表明，提议的方法可以有效地检测和分类攻击，包括 Eclipse 攻击、 Covert Flash 攻击和其他攻击，具有高度准确性。

Abstract
Blockchain security is becoming increasingly relevant in today's cyberspace as it extends its influence in many industries. This paper focuses on protecting the lowest level layer in the blockchain, particularly the P2P network that allows the nodes to communicate and share information. The P2P network layer may be vulnerable to several families of attacks, such as Distributed Denial of Service (DDoS), eclipse attacks, or Sybil attacks. This layer is prone to threats inherited from traditional P2P networks, and it must be analyzed and understood by collecting data and extracting insights from the network behavior to reduce those risks. We introduce Tikuna, an open-source tool for monitoring and detecting potential attacks on the Ethereum blockchain P2P network, at an early stage. Tikuna employs an unsupervised Long Short-Term Memory (LSTM) method based on Recurrent Neural Network (RNN) to detect attacks and alert users. Empirical results indicate that the proposed approach significantly improves detection performance, with the ability to detect and classify attacks, including eclipse attacks, Covert Flash attacks, and others that target the Ethereum blockchain P2P network layer, with high accuracy. Our research findings demonstrate that Tikuna is a valuable security tool for assisting operators to efficiently monitor and safeguard the status of Ethereum validators and the wider P2P network

摘要
区块链安全在今天的网络空间变得越来越重要，它在多个领域扮演着重要的角色。本文关注保护区块链的最低层级，即点对点网络层，该层可能受到多种攻击，如分布式拒绝服务（DDoS）、 Eclipse 攻击和 Sybil 攻击。这层面受到传统点对点网络中的威胁，需要分析和理解网络行为以降低风险。我们介绍了 Tikuna，一个开源的监控和检测区块链 P2P 网络攻击的工具，可以在早期发现攻击。Tikuna 使用无监督的 Long Short-Term Memory（LSTM）方法基于 Recurrent Neural Network（RNN）来检测攻击并警示用户。实验结果表明，我们的方法可以准确地检测和分类攻击，包括 Eclipse 攻击、 Covert Flash 攻击和其他targeting Ethereum 区块链 P2P 网络层的攻击，并且具有高精度。我们的研究发现表明，Tikuna 是一种有价值的安全工具，可以帮助操作员有效地监控和保护 Ethereum 验证人和更广泛的 P2P 网络。

Does Graph Distillation See Like Vision Dataset Counterpart?

paper_url: http://arxiv.org/abs/2310.09192
repo_url: https://github.com/suchun-sv/sgdd
paper_authors: Beining Yang, Kai Wang, Qingyun Sun, Cheng Ji, Xingcheng Fu, Hao Tang, Yang You, Jianxin Li
for: 这篇论文主要是为了提高大规模图像的训练成本和储存问题，并且探索原始图像结构信息的影响。
methods: 本论文提出了一个名为Structure-broadcasting Graph Dataset Distillation（SGDD）的新方法，它可以将原始图像结构信息转散到生成的实验图像中，以避免遗传原始图像结构信息的问题。
results: 本论文透过实验证明了SGDD的可行性和必要性，并且在9个测试 dataset 上 achieved state-of-the-art 的结果，例如在 YelpChi 测试 dataset 上，我们的方法可以保持98.6%的训练测试准确率，并且实现了1,000倍的图像减少。此外，我们还证明了SGDD 可以将 LED 差值降低17.6% ~ 31.4%。

Abstract
Training on large-scale graphs has achieved remarkable results in graph representation learning, but its cost and storage have attracted increasing concerns. Existing graph condensation methods primarily focus on optimizing the feature matrices of condensed graphs while overlooking the impact of the structure information from the original graphs. To investigate the impact of the structure information, we conduct analysis from the spectral domain and empirically identify substantial Laplacian Energy Distribution (LED) shifts in previous works. Such shifts lead to poor performance in cross-architecture generalization and specific tasks, including anomaly detection and link prediction. In this paper, we propose a novel Structure-broadcasting Graph Dataset Distillation (SGDD) scheme for broadcasting the original structure information to the generation of the synthetic one, which explicitly prevents overlooking the original structure information. Theoretically, the synthetic graphs by SGDD are expected to have smaller LED shifts than previous works, leading to superior performance in both cross-architecture settings and specific tasks. We validate the proposed SGDD across 9 datasets and achieve state-of-the-art results on all of them: for example, on the YelpChi dataset, our approach maintains 98.6% test accuracy of training on the original graph dataset with 1,000 times saving on the scale of the graph. Moreover, we empirically evaluate there exist 17.6% ~ 31.4% reductions in LED shift crossing 9 datasets. Extensive experiments and analysis verify the effectiveness and necessity of the proposed designs. The code is available in the GitHub repository: https://github.com/RingBDStack/SGDD.

摘要
大规模图学习训练已经取得了很好的成果，但是其成本和存储空间受到了越来越多的关注。现有的图压缩方法主要是优化压缩图的特征矩阵，而忽略了原始图的结构信息的影响。为了调查结构信息的影响，我们从 спектраль频谱领域进行分析，并观察到了先前的作品中的很大的 Laplacian Energy Distribution（LED）shift。这些shift导致了跨建筑物和特定任务的表现不佳，包括异常检测和链接预测。在这篇论文中，我们提出了一种新的结构广播图 dataset distillation（SGDD）方案，用于将原始结构信息广播到生成的 sintetic 图中，以避免忽略原始结构信息。理论上，由SGDD生成的 sintetic 图将有更小的 LED shift，导致跨建筑物和特定任务的表现更佳。我们在 9 个数据集上验证了我们的方法，并在所有数据集上达到了状态的最佳结果：例如，在 YelpChi 数据集上，我们的方法保持了训练在原始图数据集上的 98.6% 测试准确率，并且在 1,000 倍缩放的图数据集上实现了 1,000 倍的缩放。此外，我们也进行了实验和分析，证明了我们的方法的有效性和必要性。代码可以在 GitHub 上找到：https://github.com/RingBDStack/SGDD。

PRIOR: Personalized Prior for Reactivating the Information Overlooked in Federated Learning

paper_url: http://arxiv.org/abs/2310.09183
repo_url: https://github.com/bdemo/pfedbred_public
paper_authors: Mingjia Shi, Yuhao Zhou, Kai Wang, Huaizheng Zhang, Shudong Huang, Qing Ye, Jiangcheng Lv
for: 提高个人化 Federated Learning（PFL）的性能，解决各种数据特点导致的模型衰退问题。
methods: 提出一种基于Bregman divergence的个人化优先知识注入方法（pFedBreD），具有更好的个人化适应性和可选的策略。
results: 实验表明，提出的方法可以达到现场状态的性能，比其他方法高出3.5%以上，并且经过广泛的分析证明了方法的 Robustness 和必要性。

Abstract
Classical federated learning (FL) enables training machine learning models without sharing data for privacy preservation, but heterogeneous data characteristic degrades the performance of the localized model. Personalized FL (PFL) addresses this by synthesizing personalized models from a global model via training on local data. Such a global model may overlook the specific information that the clients have been sampled. In this paper, we propose a novel scheme to inject personalized prior knowledge into the global model in each client, which attempts to mitigate the introduced incomplete information problem in PFL. At the heart of our proposed approach is a framework, the PFL with Bregman Divergence (pFedBreD), decoupling the personalized prior from the local objective function regularized by Bregman divergence for greater adaptability in personalized scenarios. We also relax the mirror descent (RMD) to extract the prior explicitly to provide optional strategies. Additionally, our pFedBreD is backed up by a convergence analysis. Sufficient experiments demonstrate that our method reaches the state-of-the-art performances on 5 datasets and outperforms other methods by up to 3.5% across 8 benchmarks. Extensive analyses verify the robustness and necessity of proposed designs.

摘要
传统的联合学习（FL）可以帮助学习机器学习模型无需分享数据，以保护隐私，但是各种数据特点会导致本地模型的性能下降。个性化联合学习（PFL）解决了这个问题，通过将本地数据用于个性化模型的训练来生成个性化模型。然而，这种全球模型可能会忽略客户端上的特定信息。在这篇论文中，我们提出了一种新的方法，将个性化先验知识注入到全球模型中，以降低在PFL中引入的不完整信息问题。我们的提议方法基于一个框架，即PFL with Bregman Divergence（pFedBreD），它将个性化先验与本地对象函数正则化的Bregman divergence分离开来，以提高在个性化场景中的适应性。此外，我们还将反向投影（RMD）放松到提取先验，以提供可选的策略。此外，我们的pFedBreD还得到了收敛分析。我们的实验表明，我们的方法可以在5个数据集上达到领先的性能，并且在8个标准准则上超过其他方法的3.5%。广泛的分析也证明了我们的设计的稳定性和必要性。

mnmDTW: An extension to Dynamic Time Warping for Camera-based Movement Error Localization

paper_url: http://arxiv.org/abs/2310.09170
repo_url: None
paper_authors: Sebastian Dill, Maurice Rohr
for: 这个论文用Computer Vision(CV)方法提取了运动视频中的姿态信息，并使用修改后的DTW计算器来评估运动的准确性。
methods: 这个论文使用了CV方法提取姿态信息，并使用修改后的DTW计算器来评估运动的准确性。
results: 这个论文可以清晰地显示运动中的错误，并且可以准确地定位错误的位置和时间。

Abstract
In this proof of concept, we use Computer Vision (CV) methods to extract pose information out of exercise videos. We then employ a modified version of Dynamic Time Warping (DTW) to calculate the deviation from a gold standard execution of the exercise. Specifically, we calculate the distance between each body part individually to get a more precise measure for exercise accuracy. We can show that exercise mistakes are clearly visible, identifiable and localizable through this metric.

摘要
在这个Proof of Concept中，我们使用计算机视觉（CV）方法提取运动视频中的姿势信息。然后，我们使用修改后的动态时间扩展（DTW）来计算运动 preciseness。具体来说，我们计算每个身体部分之间的距离，以获得更加精确的运动准确性度量。我们可以证明，通过这个指标，运动错误都能够明显、识别和定位。

Quantum Machine Learning in Climate Change and Sustainability: a Review

paper_url: http://arxiv.org/abs/2310.09162
repo_url: None
paper_authors: Amal Nammouchi, Andreas Kassler, Andreas Theorachis
for: 这个论文的目的是探讨用量子机器学习方法解决气候变化和可持续发展的问题。
methods: 论文评论了已有的量子机器学习方法，包括能源系统启动、气候数据预测、气候监测和危险事件预测。
results: 论文提出了量子机器学习方法的挑战和未来工作，以便更好地利用这些方法在气候变化研究中。

Abstract
Climate change and its impact on global sustainability are critical challenges, demanding innovative solutions that combine cutting-edge technologies and scientific insights. Quantum machine learning (QML) has emerged as a promising paradigm that harnesses the power of quantum computing to address complex problems in various domains including climate change and sustainability. In this work, we survey existing literature that applies quantum machine learning to solve climate change and sustainability-related problems. We review promising QML methodologies that have the potential to accelerate decarbonization including energy systems, climate data forecasting, climate monitoring, and hazardous events predictions. We discuss the challenges and current limitations of quantum machine learning approaches and provide an overview of potential opportunities and future work to leverage QML-based methods in the important area of climate change research.

摘要
клима变化和其对全球可持续发展的影响是急需创新解决方案，这些解决方案结合 cutting-edge 技术和科学成果。量子机器学习（QML）已经出现为解决复杂问题的有力方法之一，其在不同领域，包括气候变化和可持续发展，提供了新的思路。在这篇文章中，我们对已有的文献进行了评论，检查了应用量子机器学习解决气候变化和可持续发展相关问题的可能性。我们评估了具有加速减排能源系统、气候数据预测、气候监测和危险事件预测的潜在优势。我们还讨论了量子机器学习方法的挑战和当前的限制，并提供了未来可能性和未来工作的概述，以便更好地利用QML在气候变化研究中的应用。

Learning To Teach Large Language Models Logical Reasoning

paper_url: http://arxiv.org/abs/2310.09158
repo_url: https://github.com/chenmeiqii/teach-llm-lr
paper_authors: Meiqi Chen, Yubo Ma, Kaitao Song, Yixin Cao, Yan Zhang, Dongsheng Li
for: 本研究旨在系统地探讨大型自然语言模型（LLMs）在逻辑推理中的能力，以解决现有LLMs在实际理性任务中输出不可靠内容的问题。
methods: 本研究采用了多种方法来探讨LLMs的逻辑推理能力，包括事件关系EXTRACTION和推理逻辑。我们的研究显示，LLMs在解决需要严格逻辑推理的任务时存在问题，并产生了不符合逻辑的答案，需要 iterative refinement。
results: 我们的研究发现，通过不同的策略可以启用LLMs的逻辑推理能力，并且可以生成更符合逻辑的答案。此外，我们还提供了一个合成数据集（LLM-LR），用于评估和预训练LLMs。广泛的量化和质量分析也证明了我们的方法的有效性和必要性，并为未来使用LLMs解决实际任务提供了洞察。

Abstract
Large language models (LLMs) have gained enormous attention from both academia and industry, due to their exceptional ability in language generation and extremely powerful generalization. However, current LLMs still output unreliable content in practical reasoning tasks due to their inherent issues (e.g., hallucination). To better disentangle this problem, in this paper, we conduct an in-depth investigation to systematically explore the capability of LLMs in logical reasoning. More in detail, we first investigate the deficiency of LLMs in logical reasoning on different tasks, including event relation extraction and deductive reasoning. Our study demonstrates that LLMs are not good reasoners in solving tasks with rigorous reasoning and will produce counterfactual answers, which require us to iteratively refine. Therefore, we comprehensively explore different strategies to endow LLMs with logical reasoning ability, and thus enable them to generate more logically consistent answers across different scenarios. Based on our approach, we also contribute a synthesized dataset (LLM-LR) involving multi-hop reasoning for evaluation and pre-training. Extensive quantitative and qualitative analyses on different tasks also validate the effectiveness and necessity of teaching LLMs with logic and provide insights for solving practical tasks with LLMs in future work.

摘要

Lincoln AI Computing Survey (LAICS) Update

paper_url: http://arxiv.org/abs/2310.09145
repo_url: None
paper_authors: Albert Reuther, Peter Michaleas, Michael Jones, Vijay Gadepally, Siddharth Samsi, Jeremy Kepner
for: 本文是 Lincoln AI Computing Survey（LAICS）的四年度更新，它收集和总结了过去四年内公开宣布的商业加速器。
methods: 本文使用 scatter graph plot 来展示性能和能耗值的趋势，并分析了一些维度和观察结果。
results: 本文发现了一些市场 segments，并提供了每个新加速器的简短描述。

Abstract
This paper is an update of the survey of AI accelerators and processors from past four years, which is now called the Lincoln AI Computing Survey - LAICS (pronounced "lace"). As in past years, this paper collects and summarizes the current commercial accelerators that have been publicly announced with peak performance and peak power consumption numbers. The performance and power values are plotted on a scatter graph, and a number of dimensions and observations from the trends on this plot are again discussed and analyzed. Market segments are highlighted on the scatter plot, and zoomed plots of each segment are also included. Finally, a brief description of each of the new accelerators that have been added in the survey this year is included.

摘要
这份报告是过去四年的AI加速器和处理器的调查更新，现更名为林肯AI计算调查（LAICS，发音为“lace”）。与过去年度一样，这份报告收集和总结了公共宣布的最高性能和最高电力消耗数据的商业加速器。性能和电力值Plot在散点图中，并从图形上的趋势和特征进行了讨论和分析。市场分 segment在散点图上高亮，并包括每个分 segment的缩放图。最后，报告还包括这年新增加的加速器的简要描述。

The Consensus Game: Language Model Generation via Equilibrium Search

paper_url: http://arxiv.org/abs/2310.09139
repo_url: None
paper_authors: Athul Paul Jacob, Yikang Shen, Gabriele Farina, Jacob Andreas
for: 提高语言模型（LM）的预测准确性和一致性，解决LM在不同查询和评分方法下的矛盾问题。
methods: 基于游戏理论的语言模型解oding算法，通过自然语言句子来沟通抽象正确参数，并通过找到稳定的equilibrium来解决问题。
results: 应用于多个任务（包括阅读理解、常识理解、数学问题解决和对话），EQUILIBRIUM-RANKING算法可以持续性地提高LM的表现，并在一些任务上超越较大的LLaMA-65B和PaLM-540B模型。这些结果表明了游戏理论工具在LM中的潜力。

Abstract
When applied to question answering and other text generation tasks, language models (LMs) may be queried generatively (by sampling answers from their output distribution) or discriminatively (by using them to score or rank a set of candidate outputs). These procedures sometimes yield very different predictions. How do we reconcile mutually incompatible scoring procedures to obtain coherent LM predictions? We introduce a new, a training-free, game-theoretic procedure for language model decoding. Our approach casts language model decoding as a regularized imperfect-information sequential signaling game - which we term the CONSENSUS GAME - in which a GENERATOR seeks to communicate an abstract correctness parameter using natural language sentences to a DISCRIMINATOR. We develop computational procedures for finding approximate equilibria of this game, resulting in a decoding algorithm we call EQUILIBRIUM-RANKING. Applied to a large number of tasks (including reading comprehension, commonsense reasoning, mathematical problem-solving, and dialog), EQUILIBRIUM-RANKING consistently, and sometimes substantially, improves performance over existing LM decoding procedures - on multiple benchmarks, we observe that applying EQUILIBRIUM-RANKING to LLaMA-7B outperforms the much larger LLaMA-65B and PaLM-540B models. These results highlight the promise of game-theoretic tools for addressing fundamental challenges of truthfulness and consistency in LMs.

摘要
当应用到问答和其他文本生成任务中，语言模型（LM）可能被生成式（通过采样答案从其输出分布）或者排名式（通过使用它们评分或排名候选答案） queried。这些过程有时会产生非常不同的预测。如何使得语言模型预测得到协调？我们提出了一种新的，无需训练的，游戏理论方法 для语言模型解oding。我们将语言模型解oding转化为一种正则化不完全信息sequential signaling游戏，我们称之为CONSENSUS GAME。在这个游戏中，一个生成器尝试通过自然语言句子来与一个iscriminator通过自然语言句子来传达抽象正确性参数。我们开发了计算过程来找到approximate equilibria的游戏，导致一种解oding算法称为EQUILIBRIUM-RANKING。我们应用了这种算法到许多任务（包括阅读理解、通用理性、数学问题解决和对话），我们发现EQUILIBRIUM-RANKING可以在多个benchmark上连续、有时也有很大的提高性能，比如在LLaMA-7B上应用EQUILIBRIUM-RANKING可以超过LLaMA-65B和PaLM-540B模型。这些结果highlight了游戏理论工具在LM中的推动力和可能性。

HierarchicalContrast: A Coarse-to-Fine Contrastive Learning Framework for Cross-Domain Zero-Shot Slot Filling

paper_url: http://arxiv.org/abs/2310.09135
repo_url: https://github.com/ai-agi/hicl
paper_authors: Junwen Zhang, Yin Zhang
for: 这个论文的目的是提出一种基于层次对比学习的零shot槽填方法，以提高这种方法在未知目标领域中的泛化能力。
methods: 这个方法使用了一种层次对比学习的 Gaussian-distributed embedding，以学习utterance-token之间的普适深度 semantics关系。
results: 实验表明，提出的方法在四个数据集上实现了与当前领域内的最佳性能，或者甚至超越了现有的零shot槽填方法。

Abstract
In task-oriented dialogue scenarios, cross-domain zero-shot slot filling plays a vital role in leveraging source domain knowledge to learn a model with high generalization ability in unknown target domain where annotated data is unavailable. However, the existing state-of-the-art zero-shot slot filling methods have limited generalization ability in target domain, they only show effective knowledge transfer on seen slots and perform poorly on unseen slots. To alleviate this issue, we present a novel Hierarchical Contrastive Learning Framework (HiCL) for zero-shot slot filling. Specifically, we propose a coarse- to fine-grained contrastive learning based on Gaussian-distributed embedding to learn the generalized deep semantic relations between utterance-tokens, by optimizing inter- and intra-token distribution distance. This encourages HiCL to generalize to the slot types unseen at training phase. Furthermore, we present a new iterative label set semantics inference method to unbiasedly and separately evaluate the performance of unseen slot types which entangled with their counterparts (i.e., seen slot types) in the previous zero-shot slot filling evaluation methods. The extensive empirical experiments on four datasets demonstrate that the proposed method achieves comparable or even better performance than the current state-of-the-art zero-shot slot filling approaches.

摘要
在任务导向对话场景中，cross-domain零shot槽填扮演着抽象知识转移的重要角色，以便在未知目标领域中学习一个高度泛化能力的模型，而无需 annotated data。然而，现有的state-of-the-art零shot槽填方法具有有限的泛化能力在目标领域，它们只能在已经看过的槽上显示有效的知识传递，并在未经看过的槽上表现糟糕。为了解决这一问题，我们提出了一种新的层次对比学习框架（HiCL） для零shot槽填。具体来说，我们提出了一种由Gaussian分布 embedding学习架构，用于学习话语元素之间的总体深度 semantics关系，通过优化 интер-和内部Token分布距离来鼓励HiCL泛化到未经训练的槽类型。此外，我们提出了一种新的迭代标签集semantics推理方法，用于不偏向地、独立地评估预测器在训练阶段未经训练的槽类型的性能。经验证了四个数据集的广泛实验表明，提出的方法可以与当前state-of-the-art零shot槽填方法相比或更好的性能。

Split-and-Denoise: Protect large language model inference with local differential privacy

paper_url: http://arxiv.org/abs/2310.09130
repo_url: None
paper_authors: Peihua Mai, Ran Yan, Zhe Huang, Youjia Yang, Yan Pang
for: 这篇研究旨在提供一个可以保护用户隐私的方法来使用大型自然语言模型 (LLM)，以便在不同的下游任务中使用嵌入。
methods: 这篇研究提出了一个名为 Split-N-Denoise (SnD) 的框架，它可以在客户端执行token嵌入层，并将随机误差引入到嵌入中以防止隐私泄露。
results: 实验结果显示，SnD 可以优化隐私与使用性的贡献变数，并在不同的 LLM 架构和多种下游任务中表现出色。相比基eline，SnD 可以在同等隐私预算下提供更高的性能，为用户提供一个隐私保护的解决方案。

Abstract
Large Language Models (LLMs) shows powerful capability in natural language understanding by capturing hidden semantics in vector space. This process enriches the value of the text embeddings for various downstream tasks, thereby fostering the Embedding-as-a-Service (EaaS) business model. However, the direct transmission of text to servers poses a largely unaddressed risk of privacy leakage. To mitigate this issue, we introduce Split-N-Denoise (SnD), an innovative framework that split the model to execute the token embedding layer on the client side at minimal computational cost. This allows the client to introduce noise prior to transmitting the embeddings to the server, and subsequently receive and denoise the perturbed output embeddings for downstream tasks. Our approach is designed for the inference stage of LLMs and requires no modifications to the model parameters. Extensive experiments demonstrate SnD's effectiveness in optimizing the privacy-utility tradeoff across various LLM architectures and diverse downstream tasks. The results reveal a significant performance improvement under the same privacy budget compared to the baseline, offering clients a privacy-preserving solution for local privacy protection.

摘要

Timestamp-supervised Wearable-based Activity Segmentation and Recognition with Contrastive Learning and Order-Preserving Optimal Transport

paper_url: http://arxiv.org/abs/2310.09114
repo_url: None
paper_authors: Songpengcheng Xia, Lei Chu, Ling Pei, Jiarui Yang, Wenxian Yu, Robert C. Qiu
for: 本研究旨在提出一种基于深度学习的同时进行人体活动 segmentation和识别的方法，以解决现有的多类窗口问题。
methods: 该方法使用时间批处理和深度学习方法进行人体活动识别和时间序列 segmentation，并使用时间批处理中的一个标注样本来帮助学习模型。
results: 对四个公共的人体活动数据集进行了广泛的实验，结果显示该方法在弱监督方法中比STATE-OF-THE-ART的方法表现更优，并且与完全监督方法具有相似的性能。

Abstract
Human activity recognition (HAR) with wearables is one of the serviceable technologies in ubiquitous and mobile computing applications. The sliding-window scheme is widely adopted while suffering from the multi-class windows problem. As a result, there is a growing focus on joint segmentation and recognition with deep-learning methods, aiming at simultaneously dealing with HAR and time-series segmentation issues. However, obtaining the full activity annotations of wearable data sequences is resource-intensive or time-consuming, while unsupervised methods yield poor performance. To address these challenges, we propose a novel method for joint activity segmentation and recognition with timestamp supervision, in which only a single annotated sample is needed in each activity segment. However, the limited information of sparse annotations exacerbates the gap between recognition and segmentation tasks, leading to sub-optimal model performance. Therefore, the prototypes are estimated by class-activation maps to form a sample-to-prototype contrast module for well-structured embeddings. Moreover, with the optimal transport theory, our approach generates the sample-level pseudo-labels that take advantage of unlabeled data between timestamp annotations for further performance improvement. Comprehensive experiments on four public HAR datasets demonstrate that our model trained with timestamp supervision is superior to the state-of-the-art weakly-supervised methods and achieves comparable performance to the fully-supervised approaches.

摘要
人类活动识别（HAR）使用护套设备是现代 ubique 和移动计算应用中的可靠技术之一。滑块策略广泛应用，但受到多类窗口问题的困扰。为了解决这些问题，我们提出了一种新的活动分割和识别方法，使用时间戳监督，只需要每个活动分段中一个注解样本。然而，稀有的注解信息使得recognition和分割任务之间的差距加大，导致模型性能下降。因此，我们使用类活动图来估算原型，并通过最优运输理论生成时间级别的假标签，以提高表现。经过对四个公共HAR数据集的全面实验，我们发现，我们使用时间戳监督训练的模型在弱监督方法中至少与状态之前的最优方法相当，并且与完全监督方法具有相似的性能。

GLoRE: Evaluating Logical Reasoning of Large Language Models

paper_url: http://arxiv.org/abs/2310.09107
repo_url: https://github.com/csitfun/glore
paper_authors: Hanmeng liu, Zhiyang Teng, Ruoxi Ning, Jian Liu, Qiji Zhou, Yue Zhang
for: 评估大语言模型（LLM）的逻辑推理能力
methods: 使用12个不同类型的任务组成的General Logical Reasoning Evaluation benchmark进行评估
results: 比较human和监督 fine-tuning的性能，开放LLM模型的逻辑推理能力需要进一步提高，ChatGPT和GPT-4在逻辑推理能力方面表现出色，GPT-4在ChatGPT之上表现出了明显的优势。

Abstract
Recently, large language models (LLMs), including notable models such as GPT-4 and burgeoning community models, have showcased significant general language understanding abilities. However, there has been a scarcity of attempts to assess the logical reasoning capacities of these LLMs, an essential facet of natural language understanding. To encourage further investigation in this area, we introduce GLoRE, a meticulously assembled General Logical Reasoning Evaluation benchmark comprised of 12 datasets that span three different types of tasks. Our experimental results show that compared to the performance of human and supervised fine-tuning, the logical reasoning capabilities of open LLM models necessitate additional improvement; ChatGPT and GPT-4 show a strong capability of logical reasoning, with GPT-4 surpassing ChatGPT by a large margin. We propose a self-consistency probing method to enhance the accuracy of ChatGPT and a fine-tuned method to boost the performance of an open LLM. We release the datasets and evaluation programs to facilitate future research.

摘要
Here is the translation in Simplified Chinese:现在，大型语言模型（LLM），包括知名模型GPT-4和社区模型，已经展示出了很强的通用语言理解能力。然而，对这些LLM的逻辑推理能力的评估却相对缺乏。为促进这一领域的研究，我们提出了GLoRE，一个仔细搜集的通用逻辑推理评估标准，包括12个数据集，涵盖三种不同的任务类型。我们的实验结果显示，对于开放LLM模型，逻辑推理能力仍需进一步改进，ChatGPT和GPT-4表现出了强大的逻辑推理能力，GPT-4在ChatGPT的基础上出众表现。我们提议使用自身一致探测方法来提高ChatGPT的准确性，以及使用微调方法来提高开放LLM模型的表现。我们将发布数据集和评估程序，以便未来的研究。

Privacy-Preserving Encrypted Low-Dose CT Denoising

paper_url: http://arxiv.org/abs/2310.09101
repo_url: None
paper_authors: Ziyuan Yang, Huijie Huangfu, Maosong Ran, Zhiwen Wang, Hui Yu, Yi Zhang
for: 这篇论文的目的是提出一种在加密领域进行静止影像处理，以保护用户的医疗资料隐私。methods: 本论文使用了同调加密技术来加密私人的静止影像数据，然后将该数据转发到已训练的服务器模型进行进一步的静止影像处理。在加密领域中，将数据处理为 tradicional operations，如核心映射和线性变换，以保持资料隐私。results: 本论文的实验结果显示，使用了提案的方法可以实现隐私保护和服务器模型中况不泄露。另外，论文还提供了两个互动框架，一为线性模型，另一为非线性模型，两者皆可以实现无损资料处理。

Abstract
Deep learning (DL) has made significant advancements in tomographic imaging, particularly in low-dose computed tomography (LDCT) denoising. A recent trend involves servers training powerful models with large amounts of self-collected private data and providing application programming interfaces (APIs) for users, such as Chat-GPT. To avoid model leakage, users are required to upload their data to the server model, but this way raises public concerns about the potential risk of privacy disclosure, especially for medical data. Hence, to alleviate related concerns, in this paper, we propose to directly denoise LDCT in the encrypted domain to achieve privacy-preserving cloud services without exposing private data to the server. To this end, we employ homomorphic encryption to encrypt private LDCT data, which is then transferred to the server model trained with plaintext LDCT for further denoising. However, since traditional operations, such as convolution and linear transformation, in DL methods cannot be directly used in the encrypted domain, we transform the fundamental mathematic operations in the plaintext domain into the operations in the encrypted domain. In addition, we present two interactive frameworks for linear and nonlinear models in this paper, both of which can achieve lossless operating. In this way, the proposed methods can achieve two merits, the data privacy is well protected and the server model is free from the risk of model leakage. Moreover, we provide theoretical proof to validate the lossless property of our framework. Finally, experiments were conducted to demonstrate that the transferred contents are well protected and cannot be reconstructed. The code will be released once the paper is accepted.

摘要
深度学习（DL）在Tomography影像中做出了重要进步，特别是在低剂量计算Tomography（LDCT）的噪声去除中。现在的趋势是，服务器会使用大量私人数据自己收集并提供应用程序编程接口（API） для用户，如Chat-GPT。为了避免模型泄露，用户需要将数据上传到服务器模型，但这会引起公众对隐私泄露的关注，特别是医疗数据。因此，在本文中，我们提议直接在加密领域进行LDCT的去噪，以实现隐私保护云服务，不曝光私人数据到服务器。为此，我们使用同行加密加密私人LDCT数据，然后将其传输到已训练的纯文本LDCT服务器模型进行进一步的去噪。但是，由于传统的DL方法中的操作，如卷积和线性变换，在加密领域中不能直接使用，我们将在纯文本领域中的基本数学操作转换到加密领域中。此外，我们在本文中提出了两种交互式框架，一种是线性模型，另一种是非线性模型，两者都可以实现无损操作。这样，我们的方法可以实现两点优点：一是保护数据隐私，二是避免服务器模型泄露。此外，我们还提供了理论证明，以 validate lossless性Property of our framework。最后，我们进行了实验，证明传输的内容是不可重构的。代码将在纸上accepted时发布。

paper_url: http://arxiv.org/abs/2310.11465
repo_url: https://github.com/abdalimran/BaitBuster-Bangla
paper_authors: Abdullah Al Imran, Md Sakib Hossain Shovon, M. F. Mridha
For: This paper is written for researchers in natural language processing and data science who are interested in advancing the modeling of clickbait phenomena in low-resource languages, specifically in Bangla.* Methods: The paper uses an automated process to collect a large multi-modal Bangla YouTube clickbait dataset, consisting of 253,070 data points from 58 Bangla YouTube channels. The dataset includes 18 diverse features categorized into metadata, primary content, engagement statistics, and labels for individual videos. The authors apply a rigorous preprocessing step to denoise, deduplicate, and remove bias from the features.* Results: The paper presents the largest and most robust clickbait corpus in Bangla to date, providing significant value for researchers seeking to advance the modeling of clickbait phenomena in low-resource languages. The multi-modal nature of the dataset allows for comprehensive analyses of clickbait across content, user interactions, and linguistic dimensions, enabling the development of more sophisticated detection methods with cross-linguistic applications.

Abstract
This study presents a large multi-modal Bangla YouTube clickbait dataset consisting of 253,070 data points collected through an automated process using the YouTube API and Python web automation frameworks. The dataset contains 18 diverse features categorized into metadata, primary content, engagement statistics, and labels for individual videos from 58 Bangla YouTube channels. A rigorous preprocessing step has been applied to denoise, deduplicate, and remove bias from the features, ensuring unbiased and reliable analysis. As the largest and most robust clickbait corpus in Bangla to date, this dataset provides significant value for natural language processing and data science researchers seeking to advance modeling of clickbait phenomena in low-resource languages. Its multi-modal nature allows for comprehensive analyses of clickbait across content, user interactions, and linguistic dimensions to develop more sophisticated detection methods with cross-linguistic applications.

摘要
Translation in Simplified Chinese:这个研究提供了一个大型多模态孔雀 YouTube 数据集，包含 253,070 个数据点，通过自动化过程使用 YouTube API 和 Python 网络自动化框架收集。数据集包含 18 种多样的特征，分为元数据、主要内容、参与统计和标签，来自 58 个孔雀 YouTube 频道。经过严格的预处理步骤，以消除噪声、重复和偏见，保证了不受偏见的分析。作为目前最大和最可靠的孔雀数据集之一，这个数据集为natural language processing 和数据科学研究人员提供了丰富的价值，以提高 clicks 现象的模型化。数据集的多模态性允许对 clicks 进行全面的分析，包括内容、用户互动和语言维度，以开发更加复杂的检测方法，并在不同语言之间进行跨语言应用。

Insightful analysis of historical sources at scales beyond human capabilities using unsupervised Machine Learning and XAI

paper_url: http://arxiv.org/abs/2310.09091
repo_url: None
paper_authors: Oliver Eberle, Jochen Büttner, Hassan El-Hajj, Grégoire Montavon, Klaus-Robert Müller, Matteo Valleriani
for: 这篇论文旨在利用人工智能技术对历史材料进行深入分析，以探讨知识演化和传播的历史发展。
methods: 该研究使用创新的机器学习技术对历史材料进行分析，以获取对 mathematical astronomy 领域知识演化和创新的深入理解。
results: 研究发现，在欧洲大学教学中使用的 astronomy 教科书在15世纪至17世纪期间经历了重要的发展和变革，这些变革反映了当时科学和技术的进步。

Abstract
Historical materials are abundant. Yet, piecing together how human knowledge has evolved and spread both diachronically and synchronically remains a challenge that can so far only be very selectively addressed. The vast volume of materials precludes comprehensive studies, given the restricted number of human specialists. However, as large amounts of historical materials are now available in digital form there is a promising opportunity for AI-assisted historical analysis. In this work, we take a pivotal step towards analyzing vast historical corpora by employing innovative machine learning (ML) techniques, enabling in-depth historical insights on a grand scale. Our study centers on the evolution of knowledge within the `Sacrobosco Collection' -- a digitized collection of 359 early modern printed editions of textbooks on astronomy used at European universities between 1472 and 1650 -- roughly 76,000 pages, many of which contain astronomic, computational tables. An ML based analysis of these tables helps to unveil important facets of the spatio-temporal evolution of knowledge and innovation in the field of mathematical astronomy in the period, as taught at European universities.

摘要
历史资料丰富，但是根据时间和空间方面的研究仍然是一项挑战，因为历史材料的量太大，人工研究者的数量有限。然而，现在历史材料大量化得到了数字化的形式，这对人工智能助け进行历史分析提供了一个有前途的机会。在这项工作中，我们采用了创新的机器学习（ML）技术，以深入分析历史大量数据，从而获得深刻的历史认识。我们的研究中心于“萨克罗贝斯科学馆”——一个1472年至1650年期间的欧洲大学天文学教科书359个数字化版本——约76000页，大多数页面包含天文、计算表格。通过ML分析这些表格，我们可以揭示天文学领域在这一时期的知识和创新的空间和时间发展。

Dialect Transfer for Swiss German Speech Translation

paper_url: http://arxiv.org/abs/2310.09088
repo_url: None
paper_authors: Claudio Paonessa, Yanick Schraner, Jan Deriu, Manuela Hürlimann, Manfred Vogel, Mark Cieliebak
for: 这个研究探讨瑞士德语自然语言处理系统的建构困难，具体关注瑞士德语方言的多样性和标准德语之间的差异对瑞士德语自然语言处理系统的性能产生的影响。
methods: 该研究使用了多种方法，包括语音识别和翻译模型的训练，以及对不同方言和标准德语之间的语言差异进行分析。
results: 研究发现，包括方言在训练中的影响和标准德语和瑞士德语之间的语言差异都会对瑞士德语自然语言处理系统的性能产生负面的影响，这与语言学理论预测相符。

Abstract
This paper investigates the challenges in building Swiss German speech translation systems, specifically focusing on the impact of dialect diversity and differences between Swiss German and Standard German. Swiss German is a spoken language with no formal writing system, it comprises many diverse dialects and is a low-resource language with only around 5 million speakers. The study is guided by two key research questions: how does the inclusion and exclusion of dialects during the training of speech translation models for Swiss German impact the performance on specific dialects, and how do the differences between Swiss German and Standard German impact the performance of the systems? We show that dialect diversity and linguistic differences pose significant challenges to Swiss German speech translation, which is in line with linguistic hypotheses derived from empirical investigations.

摘要

A ML-LLM pairing for better code comment classification

paper_url: http://arxiv.org/abs/2310.10275
repo_url: None
paper_authors: Hanna Abi Akl
for: 这篇论文主要是为了解决代码注释分类问题，即将代码片段与相关的注释进行评估，以确定其是否有助于理解相关代码。
methods: 这篇论文使用了классиical机器学习系统和大型自然语言模型（LLM）的组合来评估代码注释分类器的性能。
results: 这篇论文的最佳模型（一个神经网络）在提供的种子数据上达到了macro-F1分数为88.401%，并在使用LLM生成的数据上实现了1.5%的性能提升。

Abstract
The "Information Retrieval in Software Engineering (IRSE)" at FIRE 2023 shared task introduces code comment classification, a challenging task that pairs a code snippet with a comment that should be evaluated as either useful or not useful to the understanding of the relevant code. We answer the code comment classification shared task challenge by providing a two-fold evaluation: from an algorithmic perspective, we compare the performance of classical machine learning systems and complement our evaluations from a data-driven perspective by generating additional data with the help of large language model (LLM) prompting to measure the potential increase in performance. Our best model, which took second place in the shared task, is a Neural Network with a Macro-F1 score of 88.401% on the provided seed data and a 1.5% overall increase in performance on the data generated by the LLM.

摘要
“信息检索在软件工程（IRSE）”在FIRE 2023 共同任务中引入代码注释分类，这是一项具有挑战性的任务，将代码段与一个用于理解相关代码的评价注释进行对比。我们回答这个共同任务挑战，从算法角度进行了两重评估：一是通过比较传统机器学习系统的性能，二是通过生成额外数据来补充评估，使用大型自然语言模型（LLM）的推荐来测试可能的性能提高。我们的最佳模型在提供的种子数据上获得了88.401%的macro-F1分数，并在使用LLM生成的数据上实现了1.5%的总体性能提高。

ImageManip: Image-based Robotic Manipulation with Affordance-guided Next View Selection

paper_url: http://arxiv.org/abs/2310.09069
repo_url: None
paper_authors: Xiaoqi Li, Yanzi Wang, Yan Shen, Ponomarenko Iaroslav, Haoran Lu, Qianxu Wang, Boshi An, Jiaming Liu, Hao Dong
for: 这个论文的目的是为了解决未来家庭助手机器人中3D各种物体抓取操作的问题，以便让机器人与环境进行交互。
methods: 这个论文使用了一种新的图像基于的机器人操作框架，这个框架可以捕捉目标物体的多个视角，并使用这些视角来推算物体的深度信息。
results: 与之前使用点云或RGB图像作为输入的方法相比，这个方法更有效率和实用。在实际世界实验中，这个方法也表现出了优秀的实际应用潜力。

Abstract
In the realm of future home-assistant robots, 3D articulated object manipulation is essential for enabling robots to interact with their environment. Many existing studies make use of 3D point clouds as the primary input for manipulation policies. However, this approach encounters challenges due to data sparsity and the significant cost associated with acquiring point cloud data, which can limit its practicality. In contrast, RGB images offer high-resolution observations using cost effective devices but lack spatial 3D geometric information. To overcome these limitations, we present a novel image-based robotic manipulation framework. This framework is designed to capture multiple perspectives of the target object and infer depth information to complement its geometry. Initially, the system employs an eye-on-hand RGB camera to capture an overall view of the target object. It predicts the initial depth map and a coarse affordance map. The affordance map indicates actionable areas on the object and serves as a constraint for selecting subsequent viewpoints. Based on the global visual prior, we adaptively identify the optimal next viewpoint for a detailed observation of the potential manipulation success area. We leverage geometric consistency to fuse the views, resulting in a refined depth map and a more precise affordance map for robot manipulation decisions. By comparing with prior works that adopt point clouds or RGB images as inputs, we demonstrate the effectiveness and practicality of our method. In the project webpage (https://sites.google.com/view/imagemanip), real world experiments further highlight the potential of our method for practical deployment.

摘要
在未来家庭助手机器人领域，3D人工物体操作是必备的，以允许机器人与其环境互动。许多现有研究利用3D点云作为操作策略的主要输入，但这种方法会遇到数据稀缺和获取点云数据的高成本问题，限制其实用性。相比之下，RGB图像可以提供高分辨率的观察，使用成本效益的设备，但缺乏空间3D geometric信息。为了超越这些限制，我们提出了一种新的图像基于的机器人操作框架。这个框架是设计用于捕捉目标对象多个视角，并从图像中INFER深度信息来补充其几何。首先，系统使用RGB摄像头捕捉目标对象的总观察图像，预测初始深度地图和一个粗略的可行地图。可行地图表示对象上可行的操作区域，并作为制约选择后续视点的约束。基于全局视觉优先级，我们适应地选择最佳的后续视点，以进行细化的观察和检查操作成功区域。我们利用几何一致性来融合视图，从而获得了更精确的深度地图和更加准确的可行地图，以便机器人操作决策。与使用点云或RGB图像作为输入的先前研究相比，我们展示了我们的方法的有效性和实用性。在项目页面（https://sites.google.com/view/imagemanip）上，实际世界实验进一步强调了我们方法的实际应用潜力。

DATT: Deep Adaptive Trajectory Tracking for Quadrotor Control

paper_url: http://arxiv.org/abs/2310.09053
repo_url: https://github.com/kevinhuang8/datt
paper_authors: Kevin Huang, Rwik Rana, Alexander Spitzer, Guanya Shi, Byron Boots
for: 精准控制四旋翼执行复杂的轨迹，尤其是在实际世界中遇到大规模干扰时。
methods: DATT 使用了学习基础的 feedforward-feedback-adaptive 控制结构，在模拟中使用了回归学习训练。在真实硬件上， DATT 添加了 L1 适应控制，在关闭Loop中实现了不需要细调整。
results: DATT 在实际世界中Significantly outperform 竞争的适应非线性和预测模型控制器，包括一些具有挑战性的情况，其中基elines 完全失败。此外， DATT 在线上运行时间 Less than 3.2 ms， Less than 1/4 适应非线性模型预测控制基eline。

Abstract
Precise arbitrary trajectory tracking for quadrotors is challenging due to unknown nonlinear dynamics, trajectory infeasibility, and actuation limits. To tackle these challenges, we present Deep Adaptive Trajectory Tracking (DATT), a learning-based approach that can precisely track arbitrary, potentially infeasible trajectories in the presence of large disturbances in the real world. DATT builds on a novel feedforward-feedback-adaptive control structure trained in simulation using reinforcement learning. When deployed on real hardware, DATT is augmented with a disturbance estimator using L1 adaptive control in closed-loop, without any fine-tuning. DATT significantly outperforms competitive adaptive nonlinear and model predictive controllers for both feasible smooth and infeasible trajectories in unsteady wind fields, including challenging scenarios where baselines completely fail. Moreover, DATT can efficiently run online with an inference time less than 3.2 ms, less than 1/4 of the adaptive nonlinear model predictive control baseline

摘要
<> translates the text into Simplified Chinese.>precise arbitrary trajectory tracking for quadrotors is challenging due to unknown nonlinear dynamics, trajectory infeasibility, and actuation limits. To tackle these challenges, we present Deep Adaptive Trajectory Tracking (DATT), a learning-based approach that can precisely track arbitrary, potentially infeasible trajectories in the presence of large disturbances in the real world. DATT builds on a novel feedforward-feedback-adaptive control structure trained in simulation using reinforcement learning. When deployed on real hardware, DATT is augmented with a disturbance estimator using L1 adaptive control in closed-loop, without any fine-tuning. DATT significantly outperforms competitive adaptive nonlinear and model predictive controllers for both feasible smooth and infeasible trajectories in unsteady wind fields, including challenging scenarios where baselines completely fail. Moreover, DATT can efficiently run online with an inference time less than 3.2 ms, less than 1/4 of the adaptive nonlinear model predictive control baseline.Here's the text in Simplified Chinese:精准的旋翼机 quadrotor trajectory tracking 是一个挑战，因为 unknown nonlinear dynamics，trajectory infeasibility，和 actuation limits。为解决这些挑战，我们提出了 Deep Adaptive Trajectory Tracking (DATT)，一种学习基于的控制方法，可以在实际世界中精准跟踪任意可能不可能的 trajectory，并在大型干扰下进行适应。DATT 基于一种新的 feedforward-feedback-adaptive control structure，在 simulation 中使用 reinforcement learning 进行训练。在实际硬件上部署 DATT 时，采用 L1 adaptive control 的干扰估计，在关闭控制 loop 中进行适应，无需任何细调。相比同类的 adaptive nonlinear 和 model predictive controllers，DATT 在不可能的 trajectory 和风场中表现出色，包括一些挑战的情况，baseline 完全失败。此外，DATT 可以在线上高效运行，推理时间 less than 3.2 ms，less than 1/4 of the adaptive nonlinear model predictive control baseline。

SAI: Solving AI Tasks with Systematic Artificial Intelligence in Communication Network

paper_url: http://arxiv.org/abs/2310.09049
repo_url: None
paper_authors: Lei Yao, Yong Zhang, Zilong Yan, Jialu Tian
For: solves complex AI tasks in intelligent mobile networks, such as network optimization and resource allocation.* Methods: leverages Large Language Models (LLMs) and JSON-format intent-based input to connect self-designed model library and database, and uses model cards to pairwise match between different modules for model composition.* Results: achieves impressive results in completing numerous complex AI tasks in the communication network, leveraging the language capabilities of LLMs and the abundant AI models in the model library.

Abstract
In the rapid development of artificial intelligence, solving complex AI tasks is a crucial technology in intelligent mobile networks. Despite the good performance of specialized AI models in intelligent mobile networks, they are unable to handle complicated AI tasks. To address this challenge, we propose Systematic Artificial Intelligence (SAI), which is a framework designed to solve AI tasks by leveraging Large Language Models (LLMs) and JSON-format intent-based input to connect self-designed model library and database. Specifically, we first design a multi-input component, which simultaneously integrates Large Language Models (LLMs) and JSON-format intent-based inputs to fulfill the diverse intent requirements of different users. In addition, we introduce a model library module based on model cards which employ model cards to pairwise match between different modules for model composition. Model cards contain the corresponding model's name and the required performance metrics. Then when receiving user network requirements, we execute each subtask for multiple selected model combinations and provide output based on the execution results and LLM feedback. By leveraging the language capabilities of LLMs and the abundant AI models in the model library, SAI can complete numerous complex AI tasks in the communication network, achieving impressive results in network optimization, resource allocation, and other challenging tasks.

摘要
在人工智能的快速发展中，解决复杂的人工智能任务是智能移动网络中关键的技术。尽管特殊的人工智能模型在智能移动网络中表现良好，但它们无法处理复杂的人工智能任务。为解决这个挑战，我们提议系统人工智能（SAI），它是一个基于大语言模型（LLM）和JSON格式意图输入的框架，用于解决人工智能任务。 Specifically, we first design a multi-input component that simultaneously integrates LLMs and JSON-format intent-based inputs to fulfill the diverse intent requirements of different users. In addition, we introduce a model library module based on model cards, which employ model cards to pairwise match between different modules for model composition. Model cards contain the corresponding model's name and the required performance metrics. When receiving user network requirements, we execute each subtask for multiple selected model combinations and provide output based on the execution results and LLM feedback. By leveraging the language capabilities of LLMs and the abundant AI models in the model library, SAI can complete numerous complex AI tasks in the communication network, achieving impressive results in network optimization, resource allocation, and other challenging tasks.

KCTS: Knowledge-Constrained Tree Search Decoding with Token-Level Hallucination Detection

paper_url: http://arxiv.org/abs/2310.09044
repo_url: https://github.com/hkust-knowcomp/knowledge-constrained-decoding
paper_authors: Sehyun Choi, Tianqing Fang, Zhaowei Wang, Yangqiu Song
for: 降低大型自然语言模型（LLM）发布时的假信息风险。
methods: 使用知识约束搜索（KCTS）方法，guide 冻结的LM在每个解码步骤生成文本，并使用知识分类器得分和Monte-Carlo Tree Search（MCTS）来帮助模型遵循知识。
results: 在知识底据对话和摘要SUMMARY中，KCTS显示出了减少假信息的能力，并且可以作为一个插件和模型无关的解码方法。

Abstract
Large Language Models (LLMs) have demonstrated remarkable human-level natural language generation capabilities. However, their potential to generate misinformation, often called the hallucination problem, poses a significant risk to their deployment. A common approach to address this issue is to retrieve relevant knowledge and fine-tune the LLM with the knowledge in its input. Unfortunately, this method incurs high training costs and may cause catastrophic forgetting for multi-tasking models. To overcome these limitations, we propose a knowledge-constrained decoding method called KCTS (Knowledge-Constrained Tree Search), which guides a frozen LM to generate text aligned with the reference knowledge at each decoding step using a knowledge classifier score and MCTS (Monte-Carlo Tree Search). To adapt the sequence-level knowledge classifier to token-level guidance, we also propose a novel token-level hallucination detection method called RIPA (Reward Inflection Point Approximation). Our empirical results on knowledge-grounded dialogue and abstractive summarization demonstrate the strength of KCTS as a plug-and-play, model-agnostic decoding method that can effectively reduce hallucinations in natural language generation.

摘要
大型自然语言模型（LLM）已经展示了人类水平的自然语言生成能力。然而，它们的潜在发生假信息（也称为幻觉问题）可能会对其部署 pose significative 风险。一种常见的方法来解决这个问题是通过检索相关知识并使用输入知识进行精度调整。然而，这种方法可能会产生高训练成本并可能导致多任务模型的彻底忘记。为了超越这些限制，我们提议一种受知识约束的解码方法 called KCTS（受知识约束搜索），它使用知识分类器分数和MCST（Monte-Carlo搜索）引导冻结的LM生成文本，并在每个解码步骤都保持文本与参考知识的一致性。为了将序列级别的知识分类器转化为token级别的导航，我们还提出了一种新的吞吐量幻觉检测方法 called RIPA（奖资点附近拟合）。我们的实验结果表明，KCTS可以作为一个插件和模型无关的解码方法，有效地减少自然语言生成中的幻觉。

Optimal Scheduling of Electric Vehicle Charging with Deep Reinforcement Learning considering End Users Flexibility

paper_url: http://arxiv.org/abs/2310.09040
repo_url: None
paper_authors: Christoforos Menos-Aikateriniadis, Stavros Sykiotis, Pavlos S. Georgilakis
for: 这篇论文的目的是为了找出家庭的电动车费用优化充电策略，以减少家庭用电费用。
methods: 这篇论文使用了深度强化学习（DQN）来实现家庭的充电策略，并将历史资料分析用于推断用户的可动性潜力。
results: 根据这篇论文的结果，使用DQN实现的家庭充电策略可以实现更过20%的用户电费优化。

Abstract
The rapid growth of decentralized energy resources and especially Electric Vehicles (EV), that are expected to increase sharply over the next decade, will put further stress on existing power distribution networks, increasing the need for higher system reliability and flexibility. In an attempt to avoid unnecessary network investments and to increase the controllability over distribution networks, network operators develop demand response (DR) programs that incentivize end users to shift their consumption in return for financial or other benefits. Artificial intelligence (AI) methods are in the research forefront for residential load scheduling applications, mainly due to their high accuracy, high computational speed and lower dependence on the physical characteristics of the models under development. The aim of this work is to identify households' EV cost-reducing charging policy under a Time-of-Use tariff scheme, with the use of Deep Reinforcement Learning, and more specifically Deep Q-Networks (DQN). A novel end users flexibility potential reward is inferred from historical data analysis, where households with solar power generation have been used to train and test the designed algorithm. The suggested DQN EV charging policy can lead to more than 20% of savings in end users electricity bills.

摘要
随着分布式能源资源的快速增长，特别是电动汽车（EV）的预计增长，将在下一个十年内增加更大的压力于现有的电力分配网络，提高系统可靠性和灵活性。为了避免不必要的网络投资和提高分配网络的可控性，网络运营商开发了各种需求应答（DR）计划，激励终端用户调整消耗，以换取金融或其他优惠。人工智能（AI）技术在家庭负荷调度应用中处于研究前列，主要因为它们具有高准确率、高计算速度和模型Physical characteristic低依赖性。本文的目的是通过深度强化学习（DQN）确定家庭EV充电策略，以减少用户电力费用。通过历史数据分析，我们提出了一种新的用户可动性潜力奖励，并在这个奖励基础上训练和测试了设计的DQN充电策略。结果显示，该策略可以帮助用户减少电力费用 более20%。

Subspace Adaptation Prior for Few-Shot Learning

paper_url: http://arxiv.org/abs/2310.09028
repo_url: https://github.com/mikehuisman/subspace-adaptation-prior
paper_authors: Mike Huisman, Aske Plaat, Jan N. van Rijn
for: 这篇论文的目的是提出一种新的meta-learning算法，以提高几何学习的效率和稳定性。
methods: 这篇论文使用了gradient-based meta-learning方法，并将内置层的参数分为多个子空间，以适应不同的任务分布。
results: 这篇论文在几何学习中获得了superior或相当的表现，并且分析过learn的子空间，发现低维操作可以导致高活动强度，这可能是实现好几何学习表现的关键。

Abstract
Gradient-based meta-learning techniques aim to distill useful prior knowledge from a set of training tasks such that new tasks can be learned more efficiently with gradient descent. While these methods have achieved successes in various scenarios, they commonly adapt all parameters of trainable layers when learning new tasks. This neglects potentially more efficient learning strategies for a given task distribution and may be susceptible to overfitting, especially in few-shot learning where tasks must be learned from a limited number of examples. To address these issues, we propose Subspace Adaptation Prior (SAP), a novel gradient-based meta-learning algorithm that jointly learns good initialization parameters (prior knowledge) and layer-wise parameter subspaces in the form of operation subsets that should be adaptable. In this way, SAP can learn which operation subsets to adjust with gradient descent based on the underlying task distribution, simultaneously decreasing the risk of overfitting when learning new tasks. We demonstrate that this ability is helpful as SAP yields superior or competitive performance in few-shot image classification settings (gains between 0.1% and 3.9% in accuracy). Analysis of the learned subspaces demonstrates that low-dimensional operations often yield high activation strengths, indicating that they may be important for achieving good few-shot learning performance. For reproducibility purposes, we publish all our research code publicly.

摘要
Gradient-based meta-learning技术目的是从训练任务集中提取有用的先前知识，以便使用梯度下降来更有效地学习新任务。而这些方法通常会在学习新任务时适应所有可变参数的trainable层，这可能导致忽略有效的学习策略，特别是在几何学习中，任务必须从有限数量的示例中学习。为解决这些问题，我们提出了Subspace Adaptation Prior（SAP），一种新的梯度基于的meta-学习算法，它同时学习好的初始化参数（先前知识）和层wise参数的子空间，这种子空间是操作subset的形式，这些操作subset应该是可适应的。因此，SAP可以根据下一个任务的分布来学习哪些操作subset应该通过梯度下降进行调整，从而降低学习新任务时的风险过拟合。我们的实验表明，这种能力有助于SAP在几何学习中提供了优于或与其他方法相当的性能（准确率提高0.1%-3.9%）。分析学习的子空间表明，低维操作经常产生高活动强度，这可能表示它们在几何学习中具有重要作用。为了保证复制性，我们将所有研究代码公开发布。

A Spatial-Temporal Dual-Mode Mixed Flow Network for Panoramic Video Salient Object Detection

paper_url: http://arxiv.org/abs/2310.09016
repo_url: None
paper_authors: Xiaolei Chen, Pengcheng Zhang, Zelong Du, Ishfaq Ahmad
for: 这个研究旨在提高拼接影像中的焦点物体检测精度。
methods: 本研究提出了一个具有层间注意（ILA）模组、层间重量（ILW）模组和二重注意（BMA）模组的混合流网络（STDMMF-Net），利用拼接影像的空间流和相应流进行焦点物体检测。
results: 实验结果显示，提案方法比state-of-the-art（SOTA）方法更高的检测精度，且在内存需求、测试时间、复杂度和泛化性方面表现更好。

Abstract
Salient object detection (SOD) in panoramic video is still in the initial exploration stage. The indirect application of 2D video SOD method to the detection of salient objects in panoramic video has many unmet challenges, such as low detection accuracy, high model complexity, and poor generalization performance. To overcome these hurdles, we design an Inter-Layer Attention (ILA) module, an Inter-Layer weight (ILW) module, and a Bi-Modal Attention (BMA) module. Based on these modules, we propose a Spatial-Temporal Dual-Mode Mixed Flow Network (STDMMF-Net) that exploits the spatial flow of panoramic video and the corresponding optical flow for SOD. First, the ILA module calculates the attention between adjacent level features of consecutive frames of panoramic video to improve the accuracy of extracting salient object features from the spatial flow. Then, the ILW module quantifies the salient object information contained in the features of each level to improve the fusion efficiency of the features of each level in the mixed flow. Finally, the BMA module improves the detection accuracy of STDMMF-Net. A large number of subjective and objective experimental results testify that the proposed method demonstrates better detection accuracy than the state-of-the-art (SOTA) methods. Moreover, the comprehensive performance of the proposed method is better in terms of memory required for model inference, testing time, complexity, and generalization performance.

摘要
Salient object detection (SOD) in panoramic video 是一个初级的探索阶段。直接将二维视频 SOD 方法应用于扩展视频中的对象检测存在许多不满情况，如低检测精度、高模型复杂度和差异性表现不佳。为了解决这些障碍，我们设计了一个层间注意力（ILA）模块、层间重量（ILW）模块和二重模式注意力（BMA）模块。基于这些模块，我们提议一种空间-时间二重混合流网络（STDMMF-Net），利用扩展视频的空间流和相应的运动流进行 SOD。首先，ILA 模块在连续帧中的adjacent层特征之间进行注意力计算，以提高扩展视频中对象特征的抽象精度。然后，ILW 模块评估每个层特征中的突出对象信息的权重，以提高各层特征的混合效率。最后，BMA 模块提高 STDMMF-Net 的检测精度。大量主观和客观实验结果表明，我们提议的方法在检测精度方面超过了当前最佳方法（SOTA）。此外，我们的方法在内存需求、测试时间、复杂度和总体性方面具有更好的综合性能。

CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules

paper_url: http://arxiv.org/abs/2310.08992
repo_url: None
paper_authors: Hung Le, Hailin Chen, Amrita Saha, Akash Gokul, Doyen Sahoo, Shafiq Joty
for: 提高 LLM 解决复杂编程任务的能力，帮助它们储存和重用已有的模块。
methods: 提出 CodeChain 框架，通过自我修订链来引导 LLM 生成归纳化代码。
results: 在 APPS 和 CodeContests 上实现了相对 pass@1 提升率为 35% 和 76%，并在 OpenAI LLM 和 WizardCoder 上得到了良好的效果。

Abstract
Large Language Models (LLMs) have already become quite proficient at solving simpler programming tasks like those in HumanEval or MBPP benchmarks. However, solving more complex and competitive programming tasks is still quite challenging for these models - possibly due to their tendency to generate solutions as monolithic code blocks instead of decomposing them into logical sub-tasks and sub-modules. On the other hand, experienced programmers instinctively write modularized code with abstraction for solving complex tasks, often reusing previously developed modules. To address this gap, we propose CodeChain, a novel framework for inference that elicits modularized code generation through a chain of self-revisions, each being guided by some representative sub-modules generated in previous iterations. Concretely, CodeChain first instructs the LLM to generate modularized codes through chain-of-thought prompting. Then it applies a chain of self-revisions by iterating the two steps: 1) extracting and clustering the generated sub-modules and selecting the cluster representatives as the more generic and re-usable implementations, and 2) augmenting the original chain-of-thought prompt with these selected module-implementations and instructing the LLM to re-generate new modularized solutions. We find that by naturally encouraging the LLM to reuse the previously developed and verified sub-modules, CodeChain can significantly boost both modularity as well as correctness of the generated solutions, achieving relative pass@1 improvements of 35% on APPS and 76% on CodeContests. It is shown to be effective on both OpenAI LLMs as well as open-sourced LLMs like WizardCoder. We also conduct comprehensive ablation studies with different methods of prompting, number of clusters, model sizes, program qualities, etc., to provide useful insights that underpin CodeChain's success.

摘要
具体来说，CodeChain 首先要求 LLM 通过链式思维提示生成分解式代码。然后，它会在多个自我修改之后对生成的代码进行修改，以确保代码的幂等性和正确性。在每一次自我修改之前，CodeChain 会将生成的代码分解成多个子模块，并选择这些子模块的表示性较强的实现作为下一次生成的基础。在下一次生成的过程中，CodeChain 会将这些选择的子模块作为预处理的输入，以便 LLM 可以更好地 reuse 这些已经开发过的模块。我们发现，通过自然地促使 LLM reuse 已经开发过的和验证过的子模块，CodeChain 可以显著提高代码的幂等性和正确性，在 APPS 和 CodeContests 等场景中实现相对的 pass@1 提升率为 35% 和 76%。它可以在 OpenAI LLMs 上以及开源 LLMs 如 WizardCoder 上实现优秀的效果。我们还进行了详细的减少研究，包括不同的提示方法、群集数量、模型大小、程序质量等方面，以提供有用的减少。通过这些研究，我们发现 CodeChain 的成功归功于它的自然的链式思维机制和可 reuse 的子模块。

Reroute Prediction Service

paper_url: http://arxiv.org/abs/2310.08988
repo_url: None
paper_authors: Ítalo Romani de Oliveira, Samet Ayhan, Michael Biglin, Pablo Costas, Euclides C. Pinto Neto
for: 降低美国国家航空系统中的延迟，通过灵活支持重新路径决策。
methods: 使用历史重新路径数据和天气数据，通过数据分析和机器学习算法来预测重新路径建议。
results: 实现了高于90%的准确率。

Abstract
The cost of delays was estimated as 33 billion US dollars only in 2019 for the US National Airspace System, a peak value following a growth trend in past years. Aiming to address this huge inefficiency, we designed and developed a novel Data Analytics and Machine Learning system, which aims at reducing delays by proactively supporting re-routing decisions. Given a time interval up to a few days in the future, the system predicts if a reroute advisory for a certain Air Route Traffic Control Center or for a certain advisory identifier will be issued, which may impact the pertinent routes. To deliver such predictions, the system uses historical reroute data, collected from the System Wide Information Management (SWIM) data services provided by the FAA, and weather data, provided by the US National Centers for Environmental Prediction (NCEP). The data is huge in volume, and has many items streamed at high velocity, uncorrelated and noisy. The system continuously processes the incoming raw data and makes it available for the next step where an interim data store is created and adaptively maintained for efficient query processing. The resulting data is fed into an array of ML algorithms, which compete for higher accuracy. The best performing algorithm is used in the final prediction, generating the final results. Mean accuracy values higher than 90% were obtained in our experiments with this system. Our algorithm divides the area of interest in units of aggregation and uses temporal series of the aggregate measures of weather forecast parameters in each geographical unit, in order to detect correlations with reroutes and where they will most likely occur. Aiming at practical application, the system is formed by a number of microservices, which are deployed in the cloud, making the system distributed, scalable and highly available.

摘要
“美国国家航空系统的延误成本在2019年估计为330亿美元，是近年来增长趋势的巅峰值。为了解决这些延误的巨大不稳定性，我们设计和开发了一个新的数据分析和机器学习系统，旨在降低延误的方式。”“给出一个时间间隔，该系统可预测是否会发布重新路由建议，对于某个空中交通管理中心或某个建议标识符。为了实现这一点，该系统使用了由美国联邦航空管理局提供的系统宽信息管理（SWIM）数据服务，以及由美国国家气象中心提供的天气数据。数据量很大，涉及高速流动、不相关、噪声等问题。系统不断处理进来的原始数据，并将其转化为高效查询的存储系统。然后，将数据分配给一组机器学习算法，以竞赛性提高准确性。最佳表现的算法被选用，生成最终预测结果。实验中，我们 obtener mean accuracy values higher than 90%。”“我们的算法将 interessant area 分成单位，并使用每个地理单位的时间序列气象预测参数的聚合值，以探测与重新路由之间的相互作用。为了实际应用，该系统由一些微服务组成，通过云端部署，使得系统分布式、可扩展和高可用。”

Big data-driven prediction of airspace congestion

paper_url: http://arxiv.org/abs/2310.08982
repo_url: None
paper_authors: Samet Ayhan, Ítalo Romani de Oliveira, Glaucia Balvedi, Pablo Costas, Alexandre Leite, Felipe C. F. de Azevedo
for: 这paper的目的是为了提供一种精度measure和预测航空器数量在特定空域中，以提高空中交通管理水平，减轻空中交通管理员的工作负担。
methods: 该paper使用了一种新的数据管理和预测系统，可以准确地预测航空器数量在特定空域中。该系统使用流入的TFM数据进行预处理，将数据压缩到可持久化的大小，并将其存储在NoSQL数据库中。在预测步骤中，系统使用历史飞行轨迹中的特征来预测航空器数量。
results: 评估结果表明，该系统可以高效地和准确地预测航空器数量在每个空域中。

Abstract
Air Navigation Service Providers (ANSP) worldwide have been making a considerable effort for the development of a better method to measure and predict aircraft counts within a particular airspace, also referred to as airspace density. An accurate measurement and prediction of airspace density is crucial for a better managed airspace, both strategically and tactically, yielding a higher level of automation and thereby reducing the air traffic controller's workload. Although the prior approaches have been able to address the problem to some extent, data management and query processing of ever-increasing vast volume of air traffic data at high rates, for various analytics purposes such as predicting aircraft counts, still remains a challenge especially when only linear prediction models are used. In this paper, we present a novel data management and prediction system that accurately predicts aircraft counts for a particular airspace sector within the National Airspace System (NAS). The incoming Traffic Flow Management (TFM) data is streaming, big, uncorrelated and noisy. In the preprocessing step, the system continuously processes the incoming raw data, reduces it to a compact size, and stores it in a NoSQL database, where it makes the data available for efficient query processing. In the prediction step, the system learns from historical trajectories and uses their segments to collect key features such as sector boundary crossings, weather parameters, and other air traffic data. The features are fed into various regression models, including linear, non-linear and ensemble models, and the best performing model is used for prediction. Evaluation on an extensive set of real track, weather, and air traffic data including boundary crossings in the U.S. verify that our system efficiently and accurately predicts aircraft counts in each airspace sector.

摘要
全球航空导航服务提供商（ANSP）在开发一种更好的飞行器数量测量和预测方法方面做出了很大努力。正确测量和预测飞行器数量是管理空域的关键，它可以提高战略和战术层的管理效果，并使空交通控制器的工作负担更低。虽然以前的方法有所成功，但是数据管理和查询处理的高速大量空交通数据仍然是一个挑战，特别是只使用线性预测模型。在这篇论文中，我们介绍了一种新的数据管理和预测系统，可以准确地预测特定空域段的飞行器数量。系统接收来自交通流管理（TFM）的进行流动的大量数据，并将其减小到一个紧凑的大小，然后将其存储在NoSQL数据库中，以便高效地进行查询处理。在预测步骤中，系统通过历史飞行路径学习 key features，如空域界限 crossing、天气参数和其他空交通数据。这些特征被 fed into 不同的回归模型，包括线性、非线性和ensemble模型，并使用最佳表现的模型进行预测。经过对庞大的实际飞行、天气和空交通数据，包括边界 crossing 在美国进行评估，表明我们的系统可以高效地和准确地预测飞行器数量在每个空域段。

Multi-Purpose NLP Chatbot : Design, Methodology & Conclusion

paper_url: http://arxiv.org/abs/2310.08977
repo_url: None
paper_authors: Shivom Aggarwal, Shourya Mehra, Pritha Mitra
for: 这篇研究论文主要探讨了当今聊天机器人技术环境的历史、difficulties和潜在价值。
methods: 该论文提出了一种非常灵活的聊天机器人系统，该系统使用强化学习策略来提高用户交互和对话体验。此外，该系统还使用情感分析和自然语言处理来确定用户情绪。
results: 该论文通过实践证明了这种聊天机器人系统的优异特性，包括语音对话、多语言支持、建议功能、离线运行和快速帮助功能等。此外，该研究还探讨了聊天机器人技术的复杂性和发展因素，以及它对多个领域的深远影响。

Abstract
With a major focus on its history, difficulties, and promise, this research paper provides a thorough analysis of the chatbot technology environment as it exists today. It provides a very flexible chatbot system that makes use of reinforcement learning strategies to improve user interactions and conversational experiences. Additionally, this system makes use of sentiment analysis and natural language processing to determine user moods. The chatbot is a valuable tool across many fields thanks to its amazing characteristics, which include voice-to-voice conversation, multilingual support [12], advising skills, offline functioning, and quick help features. The complexity of chatbot technology development is also explored in this study, along with the causes that have propelled these developments and their far-reaching effects on a range of sectors. According to the study, three crucial elements are crucial: 1) Even without explicit profile information, the chatbot system is built to adeptly understand unique consumer preferences and fluctuating satisfaction levels. With the use of this capacity, user interactions are made to meet their wants and preferences. 2) Using a complex method that interlaces Multiview voice chat information, the chatbot may precisely simulate users' actual experiences. This aids in developing more genuine and interesting discussions. 3) The study presents an original method for improving the black-box deep learning models' capacity for prediction. This improvement is made possible by introducing dynamic satisfaction measurements that are theory-driven, which leads to more precise forecasts of consumer reaction.

摘要
这个研究论文对现代聊天机器人技术环境进行了全面的分析，包括历史、挑战和前景。它提供了一个非常灵活的聊天机器人系统，使用强化学习策略来改善用户互动和对话体验。此外，该系统还使用情感分析和自然语言处理来确定用户的情绪状态。由于它的优秀特点，如语音对话、多语言支持、建议功能、离线运行和快速帮助功能等，聊天机器人在各个领域都是一个非常有价值的工具。这个研究还探讨了聊天机器人技术的复杂性，以及这些发展的原因和对各个领域的深远影响。根据研究，三个关键因素是：1. 不需要显式Profile信息，聊天机器人系统可以快速理解用户的唯一需求和不断变化的满意度。通过这种能力，用户互动可以更好地适应他们的需求。2. 使用复杂的多视图语音聊天信息协同策略，聊天机器人可以准确模拟用户的实际体验。这有助于创造更真实和有趣的对话。3. 研究提出了一种改进黑obox深度学习模型预测能力的方法，通过引入动态满意度测量，使得预测更加准确。

ChatKBQA: A Generate-then-Retrieve Framework for Knowledge Base Question Answering with Fine-tuned Large Language Models

paper_url: http://arxiv.org/abs/2310.08975
repo_url: https://github.com/lhrlab/chatkbqa
paper_authors: Haoran Luo, Haihong E, Zichen Tang, Shiyao Peng, Yikai Guo, Wentai Zhang, Chenghao Ma, Guanting Dong, Meina Song, Wei Lin
for: 这个论文的目的是提出一种基于大语言模型（LLMs）的生成并 retrieve 知识库问答框架（KBQA），以解决现有KBQA方法中的三大挑战。
methods: 该论文使用了 fine-tuning 开源 LLMs such as Llama-2, ChatGLM2 和 Baichuan2，并提出了一种生成Logical Form 后再进行 retrieve 和替换实体和关系的方法，以改进生成和检索的精度。
results: 实验结果显示，ChatKBQA 在标准 KBQA 数据集 WebQSP 和 ComplexWebQuestions (CWQ) 上达到了新的州OF-THE-ART 性能，并提供了一种将 LLMs 与知识图（KGs）结合的新模式，以实现可读性和知识需要的问答。

Abstract
Knowledge Base Question Answering (KBQA) aims to derive answers to natural language questions over large-scale knowledge bases (KBs), which are generally divided into two research components: knowledge retrieval and semantic parsing. However, three core challenges remain, including inefficient knowledge retrieval, retrieval errors adversely affecting semantic parsing, and the complexity of previous KBQA methods. In the era of large language models (LLMs), we introduce ChatKBQA, a novel generate-then-retrieve KBQA framework built on fine-tuning open-source LLMs such as Llama-2, ChatGLM2 and Baichuan2. ChatKBQA proposes generating the logical form with fine-tuned LLMs first, then retrieving and replacing entities and relations through an unsupervised retrieval method, which improves both generation and retrieval more straightforwardly. Experimental results reveal that ChatKBQA achieves new state-of-the-art performance on standard KBQA datasets, WebQSP, and ComplexWebQuestions (CWQ). This work also provides a new paradigm for combining LLMs with knowledge graphs (KGs) for interpretable and knowledge-required question answering. Our code is publicly available.

摘要
知识库问答（KBQA）目的是从大规模知识库（KB）中 derive 自然语言问题的答案，通常分为两个研究组件：知识检索和semantic parsing。然而，三个核心挑战仍然存在，包括不效率的知识检索、检索错误影响semantic parsing，以及先前KBQA方法的复杂性。在大语言模型（LLM）时代，我们介绍了ChatKBQA，一种基于 fine-tuning 开源 LLM such as Llama-2、ChatGLM2 和 Baichuan2 的新的 generate-then-retrieve KBQA 框架。ChatKBQA 提议在 fine-tuned LLM 上生成逻辑形式，然后通过无监督检索方法检索和替换实体和关系，从而提高生成和检索的效果。实验结果表明，ChatKBQA 在标准 KBQA 数据集 WebQSP 和 ComplexWebQuestions (CWQ) 上达到了新的州OF-THE-ART 性能。此外，这种工作还提供了将 LLM 与知识图（KG）结合起来的新方法，以实现可读性和知识需要的问题 answering。我们的代码公开可用。

Making Multimodal Generation Easier: When Diffusion Models Meet LLMs

paper_url: http://arxiv.org/abs/2310.08949
repo_url: https://github.com/zxy556677/easygen
paper_authors: Xiangyu Zhao, Bo Liu, Qijiong Liu, Guangyuan Shi, Xiao-Ming Wu
for: 提高多Modal理解和生成的效率，使用扩散模型和大语言模型（LLM）。
methods: 使用扩散模型BiDiffuser，与LLM结合使用投影层进行图像到文本生成和文本到图像生成。
results: 经过诸多量化和质量测试，EasyGen显示出了效果，可以在室内设备上轻松地进行训练。

Abstract
We present EasyGen, an efficient model designed to enhance multimodal understanding and generation by harnessing the capabilities of diffusion models and large language models (LLMs). Unlike existing multimodal models that predominately depend on encoders like CLIP or ImageBind and need ample amounts of training data to bridge the gap between modalities, EasyGen is built upon a bidirectional conditional diffusion model named BiDiffuser, which promotes more efficient interactions between modalities. EasyGen handles image-to-text generation by integrating BiDiffuser and an LLM via a simple projection layer. Unlike most existing multimodal models that are limited to generating text responses, EasyGen can also facilitate text-to-image generation by leveraging the LLM to create textual descriptions, which can be interpreted by BiDiffuser to generate appropriate visual responses. Extensive quantitative and qualitative experiments demonstrate the effectiveness of EasyGen, whose training can be easily achieved in a lab setting. The source code is available at https://github.com/zxy556677/EasyGen.

摘要
我们介绍EasyGen，一种高效的模型，旨在提高多Modal理解和生成。不同于现有的多Modal模型，它们主要依靠编码器如CLIP或ImageBind，需要大量的训练数据来跨modalities的桥接。而EasyGen则基于双向条件扩散模型BiDiffuser，它促进了多Modal之间更高效的交互。EasyGen通过简单的投影层将BiDiffuser和LLM集成起来，用于图像到文本生成。与大多数现有的多Modal模型不同，EasyGen可以生成文本响应以外的图像响应，通过LLM生成文本描述，并由BiDiffuser解释为相应的视觉响应。我们进行了详细的量化和质量实验，证明EasyGen的效果，其训练可以在实验室内容guez。源代码可以在https://github.com/zxy556677/EasyGen中下载。

Federated Class-Incremental Learning with Prompting

paper_url: http://arxiv.org/abs/2310.08948
repo_url: None
paper_authors: Jiale Liu, Yu-Wei Zhan, Chong-Yu Zhang, Xin Luo, Zhen-Duo Chen, Yinwei Wei, Xin-Shun Xu
for: This paper focuses on the challenging problem of federated class-incremental learning (FCIL), where the local and global models may suffer from catastrophic forgetting due to the arrival of new classes and non-independent and identically distributed (non-iid) data distributions.
methods: The proposed method, Federated Class-Incremental Learning with Prompting (FCILPT), uses prompts to ease the catastrophic forgetting of old classes, and sorts the task information in the prompt pool to align the task information on different clients before global aggregation.
results: The proposed method achieves significant accuracy improvements over state-of-the-art methods on three benchmark datasets (CIFAR-100, Mini-ImageNet, and Tiny-ImageNet).

Abstract
As Web technology continues to develop, it has become increasingly common to use data stored on different clients. At the same time, federated learning has received widespread attention due to its ability to protect data privacy when let models learn from data which is distributed across various clients. However, most existing works assume that the client's data are fixed. In real-world scenarios, such an assumption is most likely not true as data may be continuously generated and new classes may also appear. To this end, we focus on the practical and challenging federated class-incremental learning (FCIL) problem. For FCIL, the local and global models may suffer from catastrophic forgetting on old classes caused by the arrival of new classes and the data distributions of clients are non-independent and identically distributed (non-iid). In this paper, we propose a novel method called Federated Class-Incremental Learning with PrompTing (FCILPT). Given the privacy and limited memory, FCILPT does not use a rehearsal-based buffer to keep exemplars of old data. We choose to use prompts to ease the catastrophic forgetting of the old classes. Specifically, we encode the task-relevant and task-irrelevant knowledge into prompts, preserving the old and new knowledge of the local clients and solving the problem of catastrophic forgetting. We first sort the task information in the prompt pool in the local clients to align the task information on different clients before global aggregation. It ensures that the same task's knowledge are fully integrated, solving the problem of non-iid caused by the lack of classes among different clients in the same incremental task. Experiments on CIFAR-100, Mini-ImageNet, and Tiny-ImageNet demonstrate that FCILPT achieves significant accuracy improvements over the state-of-the-art methods.

摘要
随着网络技术的不断发展，使用分布在不同客户端上的数据变得越来越普遍。同时，联邦学习受到了广泛关注，因为它可以保护数据隐私，让模型从分布在不同客户端上的数据上学习。然而，现有的工作假设了客户端上的数据是固定的。在实际情况下，这是不太可能的，因为数据可能会不断生成并且新类可能会出现。为此，我们关注了实际和挑战性的联邦类增量学习（FCIL）问题。在FCIL中，本地和全球模型可能会由新类的到来和客户端数据分布的不同而导致快速忘记旧类。在这篇论文中，我们提出了一种新的方法called Federated Class-Incremental Learning with PrompTing（FCILPT）。 compte tenu de la confidentialité et de la mémoire limitée, FCILPT ne utilise pas de tampon de réhearsal pour garder des exemples d'données anciennes. Nous choisissons plutôt d'utiliser des prompts pour atténuer l'oubli catastrophique des classes anciennes. Plus précisément, nous encodons la connaissance pertinente et irrelevante des tâches dans les prompts, préservant ainsi la connaissance ancienne et nouvelle des clients locaux et résolvant le problème de l'oubli catastrophique. Nous classons d'abord les informations de tâche dans le pool de prompts des clients locaux pour aligner les informations de tâche sur les différents clients avant la globalisation. Cela garantit que la même tâche connaissance soient intégrées complètement, résolvant le problème de non-iid causé par la absence de classes chez les différents clients dans la même tâche incrémentielle. Les expériences sur CIFAR-100, Mini-ImageNet et Tiny-ImageNet montrent que FCILPT obtient des améliorations significatives en matière d'efficacité par rapport aux méthodes établies.

Progressively Efficient Learning

paper_url: http://arxiv.org/abs/2310.13004
repo_url: https://github.com/himanshub1007/Alzhimers-Disease-Prediction-Using-Deep-learning
paper_authors: Ruijie Zheng, Khanh Nguyen, Hal Daumé III, Furong Huang, Karthik Narasimhan
for: 本研究旨在帮助人工智能代理人快速积累新技能和适应新用户喜好。
methods: 本研究提出了一种新的学习框架 named Communication-Efficient Interactive Learning (CEIL)，该框架通过让学习代理人具备抽象、动态的语言和内在动机，使得学习代理人与教师之间的交流变得更加高效。
results: CEIL在2D MineCraftDomain上展示了出色的性能和交流效率，让学习代理人快速掌握新任务，并在与教师之间的交流中具备更高的效率和灵活性。

Abstract
Assistant AI agents should be capable of rapidly acquiring novel skills and adapting to new user preferences. Traditional frameworks like imitation learning and reinforcement learning do not facilitate this capability because they support only low-level, inefficient forms of communication. In contrast, humans communicate with progressive efficiency by defining and sharing abstract intentions. Reproducing similar capability in AI agents, we develop a novel learning framework named Communication-Efficient Interactive Learning (CEIL). By equipping a learning agent with an abstract, dynamic language and an intrinsic motivation to learn with minimal communication effort, CEIL leads to emergence of a human-like pattern where the learner and the teacher communicate progressively efficiently by exchanging increasingly more abstract intentions. CEIL demonstrates impressive performance and communication efficiency on a 2D MineCraft domain featuring long-horizon decision-making tasks. Agents trained with CEIL quickly master new tasks, outperforming non-hierarchical and hierarchical imitation learning by up to 50% and 20% in absolute success rate, respectively, given the same number of interactions with the teacher. Especially, the framework performs robustly with teachers modeled after human pragmatic communication behavior.

摘要
translate into Simplified Chinese:助手AI应该具备快速学习新技能和适应新用户喜好。传统框架如仿效学习和奖励学习不支持这种功能，因为它们只支持低级别的不效的通信。相比之下，人类通过定义和分享抽象目标来进行进程式有效的通信。复制这种功能在AI代理中，我们开发了一种新的学习框架 named Communication-Efficient Interactive Learning (CEIL)。通过为学习代理提供抽象的动态语言和减少通信努力的内在动机，CEIL导致了人类类似的征性，learner和教师通过逐渐更加抽象的意图进行进程式有效的通信。CEIL在2D MineCraft中的长期决策任务上表现出色，代理训练CEIL快速掌握新任务，相比非层次仿效学习和层次仿效学习，具有50%和20%的绝对成功率提升，即使同样的交互次数。特别是，框架在人类 Pragmatic 通信行为模型下表现出了Robustness。

Embarrassingly Simple Text Watermarks

paper_url: http://arxiv.org/abs/2310.08920
repo_url: https://github.com/amicus-veritatis/easydemark
paper_authors: Ryoma Sato, Yuki Takezawa, Han Bao, Kenta Niwa, Makoto Yamada
for: 防止Large Language Models（LLM）生成的文本被误用，提高文本的准确性和可靠性。
methods: 提出了一种简单 yet effective的文本水印方法，称为Easymark，可以隐藏在文本中无 changing its meaning，并且可以通过一些简单的验证代码来检测文本是否来自Easymark。
results: 对LLM生成的文本进行了实验，结果表明Easymark可以准确地检测文本是否来自Easymark，并且不会影响文本的质量和可靠性。同时，Easymark也可以在用户端实现，不需要访问LLM提供者的服务。

Abstract
We propose Easymark, a family of embarrassingly simple yet effective watermarks. Text watermarking is becoming increasingly important with the advent of Large Language Models (LLM). LLMs can generate texts that cannot be distinguished from human-written texts. This is a serious problem for the credibility of the text. Easymark is a simple yet effective solution to this problem. Easymark can inject a watermark without changing the meaning of the text at all while a validator can detect if a text was generated from a system that adopted Easymark or not with high credibility. Easymark is extremely easy to implement so that it only requires a few lines of code. Easymark does not require access to LLMs, so it can be implemented on the user-side when the LLM providers do not offer watermarked LLMs. In spite of its simplicity, it achieves higher detection accuracy and BLEU scores than the state-of-the-art text watermarking methods. We also prove the impossibility theorem of perfect watermarking, which is valuable in its own right. This theorem shows that no matter how sophisticated a watermark is, a malicious user could remove it from the text, which motivate us to use a simple watermark such as Easymark. We carry out experiments with LLM-generated texts and confirm that Easymark can be detected reliably without any degradation of BLEU and perplexity, and outperform state-of-the-art watermarks in terms of both quality and reliability.

摘要
我们提出了Easymark，一种简单 yet effective的水印家族。文本水印在大语言模型（LLM）出现后变得越来越重要。LLM可以生成文本，这些文本与人类写的文本无法区分。这是一个严重的问题，它会影响文本的可靠性。Easymark是一种简单 yet effective的解决方案。Easymark可以在无 changing the meaning of the text 的情况下插入水印，并且可以使用一些行数据来检测文本是否来自系统。Easymark易于实现，只需要几行代码即可。它不需要访问 LLM，因此可以在用户端实现，即使提供者不提供水印的 LLM。尽管它的简单，但它可以达到更高的检测精度和BLEU分数，比过渡性的文本水印方法更好。我们还证明了水印不可能性定理，这是一个有价值的成果。这个定理表明，无论如何复杂的水印，一个恶意用户都可以从文本中移除它，这种情况下，我们选择使用简单的水印，如Easymark。我们对LLM生成的文本进行了实验，并证明了Easymark可靠地检测，无任何性能下降和BLEU和复杂度指标。同时，它也超越了现有的水印方法，在质量和可靠性两个方面取得更好的表现。

Relation-aware Ensemble Learning for Knowledge Graph Embedding

paper_url: http://arxiv.org/abs/2310.08917
repo_url: https://github.com/lars-research/relens
paper_authors: Ling Yue, Yongqi Zhang, Quanming Yao, Yong Li, Xian Wu, Ziheng Zhang, Zhenxi Lin, Yefeng Zheng
for: 本研究旨在提出一种基于现有方法的关系意识 Ensemble 方法，以优化知识图（KG）嵌入。
methods: 本方法使用了分治搜索并结合（Divide-Search-Combine）算法，独立搜索关系智能的ensemble веса，以提高搜索效率。
results: 实验结果表明，提出的方法可以高效地搜索关系意识ensemble weights，并实现了状态平台的嵌入性能。代码可以在 https://github.com/LARS-research/RelEns 上获取。

Abstract
Knowledge graph (KG) embedding is a fundamental task in natural language processing, and various methods have been proposed to explore semantic patterns in distinctive ways. In this paper, we propose to learn an ensemble by leveraging existing methods in a relation-aware manner. However, exploring these semantics using relation-aware ensemble leads to a much larger search space than general ensemble methods. To address this issue, we propose a divide-search-combine algorithm RelEns-DSC that searches the relation-wise ensemble weights independently. This algorithm has the same computation cost as general ensemble methods but with much better performance. Experimental results on benchmark datasets demonstrate the effectiveness of the proposed method in efficiently searching relation-aware ensemble weights and achieving state-of-the-art embedding performance. The code is public at https://github.com/LARS-research/RelEns.

摘要
知识图（KG）嵌入是自然语言处理中的基本任务，各种方法已经被提出来探索各种语义特征。在这篇论文中，我们提议使用现有方法 ensemble 的方式来探索这些语义特征。然而，使用这种方法会增加搜索空间的规模，比普通ensemble方法更大。为解决这个问题，我们提出了一种分治搜索合并算法RelEns-DSC，该算法独立搜索关系智能的ensemble重量。这个算法与普通ensemble方法的计算成本相同，但性能却更好。实验结果表明，我们的方法可以有效地搜索关系智能的ensemble重量，并达到当前最佳嵌入性能。代码可以在https://github.com/LARS-research/RelEns上获取。

Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs

paper_url: http://arxiv.org/abs/2310.08915
repo_url: https://github.com/zyxxmu/dsnot
paper_authors: Yuxin Zhang, Lirui Zhao, Mingbao Lin, Yunyun Sun, Yiwu Yao, Xingjia Han, Jared Tanner, Shiwei Liu, Rongrong Ji
for: 提高 sparse LLMs 的性能，尤其在高粒度水平。
methods: 提出了一种无需训练的 fine-tuning 方法，通过 iterative weight pruning-and-growing 来 slightlly 更新 sparse LLMs。
results: 在 LLMA-V1/V2、Vicuna 和 OPT 上进行了广泛的实验，并证明了 DSnoT 可以提高 sparse LLMs 的性能，特别在高粒度水平上。例如，与state-of-the-art Wanda 相比，DSnoT 在 70% 粒度水平上提高了26.79 的抽象率。

Abstract
The ever-increasing large language models (LLMs), though opening a potential path for the upcoming artificial general intelligence, sadly drops a daunting obstacle on the way towards their on-device deployment. As one of the most well-established pre-LLMs approaches in reducing model complexity, network pruning appears to lag behind in the era of LLMs, due mostly to its costly fine-tuning (or re-training) necessity under the massive volumes of model parameter and training data. To close this industry-academia gap, we introduce Dynamic Sparse No Training (DSnoT), a training-free fine-tuning approach that slightly updates sparse LLMs without the expensive backpropagation and any weight updates. Inspired by the Dynamic Sparse Training, DSnoT minimizes the reconstruction error between the dense and sparse LLMs, in the fashion of performing iterative weight pruning-and-growing on top of sparse LLMs. To accomplish this purpose, DSnoT particularly takes into account the anticipated reduction in reconstruction error for pruning and growing, as well as the variance w.r.t. different input data for growing each weight. This practice can be executed efficiently in linear time since its obviates the need of backpropagation for fine-tuning LLMs. Extensive experiments on LLaMA-V1/V2, Vicuna, and OPT across various benchmarks demonstrate the effectiveness of DSnoT in enhancing the performance of sparse LLMs, especially at high sparsity levels. For instance, DSnoT is able to outperform the state-of-the-art Wanda by 26.79 perplexity at 70% sparsity with LLaMA-7B. Our paper offers fresh insights into how to fine-tune sparse LLMs in an efficient training-free manner and open new venues to scale the great potential of sparsity to LLMs. Codes are available at https://github.com/zyxxmu/DSnoT.

摘要
大型语言模型（LLM）的不断增长，尽管开启了人工通用智能的潜在道路，但却存在一个巨大的障碍物，即在设备上部署 LLM 时的困难。作为一种已有的预 LLM 策略，网络剪辑可以减少模型复杂性，但在大量模型参数和训练数据的情况下，它却落后于 LLM 时代。为解决这个行业学术之阔，我们介绍了一种无需训练的精炼方法——动态缺失训练（DSnoT）。受动态缺失训练的启发，DSnoT 将在缺失 LLM 上进行轻量级的更新，而不需要昂贵的反propagation 和任何参数更新。在实施这种方法时，DSnoT 特别考虑了预计的减少征重误差和不同输入数据的变化。这种做法可以高效地执行，只需要线性时间。我们的实验表明，DSnoT 可以在不同的基础模型和难度水平上提高缺失 LLM 的性能，特别是在高缺失率下。例如，DSnoT 可以在 70% 缺失率下比 state-of-the-art Wanda 提高 26.79 的误差。我们的论文为无需训练的精炼 sparse LLM 提供了新的视角，并开启了将缺失潜力应用于 LLM 的新venue。代码可以在上获取。

Community Membership Hiding as Counterfactual Graph Search via Deep Reinforcement Learning

paper_url: http://arxiv.org/abs/2310.08909
repo_url: None
paper_authors: Andrea Bernini, Fabrizio Silvestri, Gabriele Tolomei
for: 本研究旨在解决社交媒体平台上的社群成员隐私保护问题，即通过修改网络图的结构性质，防止某些节点被某种社群检测算法识别。
methods: 本研究使用深度强化学习解决此问题，通过定制反事实图像目标来约束社群检测算法。
results: 经过广泛的实验，我们的方法在节点和社群欺骗两个任务中均表现出色，与现有的基线方法相比，其效果更好。

Abstract
Community detection techniques are useful tools for social media platforms to discover tightly connected groups of users who share common interests. However, this functionality often comes at the expense of potentially exposing individuals to privacy breaches by inadvertently revealing their tastes or preferences. Therefore, some users may wish to safeguard their anonymity and opt out of community detection for various reasons, such as affiliation with political or religious organizations. In this study, we address the challenge of community membership hiding, which involves strategically altering the structural properties of a network graph to prevent one or more nodes from being identified by a given community detection algorithm. We tackle this problem by formulating it as a constrained counterfactual graph objective, and we solve it via deep reinforcement learning. We validate the effectiveness of our method through two distinct tasks: node and community deception. Extensive experiments show that our approach overall outperforms existing baselines in both tasks.

摘要
社区检测技术是社交媒体平台上的有用工具，可以找到用户之间的紧密连接群体，共享共同的兴趣。然而，这些功能经常会导致个人隐私泄露，因为检测社区可能会意外披露用户的偏好或喜好。因此，一些用户可能希望保护自己的隐私，因此可能会选择退出社区检测。在这种情况下，我们需要解决社区成员隐藏的挑战，即使用某些策略来阻止社区检测算法找到某些节点。我们解决这个问题，通过形式化为干扰对数据集的对数目标，并使用深度强化学习解决。我们验证了我们的方法的有效性，通过两个不同的任务：节点和社区欺骗。广泛的实验表明，我们的方法在两个任务中都能够超越现有的基eline。

Welfare Diplomacy: Benchmarking Language Model Cooperation

paper_url: http://arxiv.org/abs/2310.08901
repo_url: https://github.com/mukobi/welfare-diplomacy
paper_authors: Gabriel Mukobi, Hannah Erlebach, Niklas Lauffer, Lewis Hammond, Alan Chan, Jesse Clifton
for: 本研究旨在提供一种更加robust的多代理系统测试工具，以便研究者可以更好地评估和培养多代理系统的合作能力。
methods: 本研究使用了一种基于Diplomacy游戏的一般 SUM variants，称为“福利外交”，其中玩家需要平衡军事征战和国内福利投入。
results: 实验结果显示，使用现有语言模型实现的基线代理可以 дости得高度的社会福利，但是它们可以被利用。我们的工作旨在推动社会安全性，帮助研究者开发和评估多代理系统。

Abstract
The growing capabilities and increasingly widespread deployment of AI systems necessitate robust benchmarks for measuring their cooperative capabilities. Unfortunately, most multi-agent benchmarks are either zero-sum or purely cooperative, providing limited opportunities for such measurements. We introduce a general-sum variant of the zero-sum board game Diplomacy -- called Welfare Diplomacy -- in which players must balance investing in military conquest and domestic welfare. We argue that Welfare Diplomacy facilitates both a clearer assessment of and stronger training incentives for cooperative capabilities. Our contributions are: (1) proposing the Welfare Diplomacy rules and implementing them via an open-source Diplomacy engine; (2) constructing baseline agents using zero-shot prompted language models; and (3) conducting experiments where we find that baselines using state-of-the-art models attain high social welfare but are exploitable. Our work aims to promote societal safety by aiding researchers in developing and assessing multi-agent AI systems. Code to evaluate Welfare Diplomacy and reproduce our experiments is available at https://github.com/mukobi/welfare-diplomacy.

摘要
随着人工智能系统的发展和普遍应用，需要有 robust的标准测试方法来评估它们的合作能力。然而，大多数多代理标准测试都是零和或纯合作的，这限制了测试的可能性。我们介绍了一种基于零和游戏 дипломати的通用零和 variant -- 卫生外交 -- 在这个游戏中，玩家需要协调军事征战和国内福利投入。我们认为，卫生外交可以提供更清晰的评估和更强的培训约束，以提高合作能力。我们的贡献包括：1. 提出卫生外交规则并通过开源的 Diplomacy 引擎实现;2. 使用零批语言模型构建基线代理;3. 进行实验，发现基线使用现有模型可以达到高度的社会福利，但是可以被利用。我们的工作目的是促进社会安全，帮助研究人员开发和评估多代理人工智能系统。代码来评估卫生外交和重现实验可以在 GitHub 上获取：https://github.com/mukobi/welfare-diplomacy。

A Hybrid Transfer Learning Assisted Decision Support System for Accurate Prediction of Alzheimer Disease

paper_url: http://arxiv.org/abs/2310.08888
repo_url: None
paper_authors: Mahin Khan Mahadi, Abdullah Abdullah, Jamal Uddin, Asif Newaz
for: 这个研究旨在提高阿尔ツ海默病（AD）的早期诊断和预测，以便提高诊断和治疗的效果。
methods: 本研究使用深度学习技术，并提出了一种独特的策略来解决不均衡数据集分类问题。研究使用了五种传输学习模型和ensemble平均模型，并进行了模型微调。
results: 研究发现，使用合并平均模型和传输学习模型可以提高AD阶段多类分类的准确率，并达到了98.91%的最高精度。

Abstract
Alzheimer's disease (AD) is the most common long-term illness in elderly people. In recent years, deep learning has become popular in the area of medical imaging and has had a lot of success there. It has become the most effective way to look at medical images. When it comes to detecting AD, the deep neural model is more accurate and effective than general machine learning. Our research contributes to the development of a more comprehensive understanding and detection of the disease by identifying four distinct classes that are predictive of AD with a high weighted accuracy of 98.91%. A unique strategy has been proposed to improve the accuracy of the imbalance dataset classification problem via the combination of ensemble averaging models and five different transfer learning models in this study. EfficientNetB0+Resnet152(effnet+res152) and InceptionV3+EfficientNetB0+Resnet50(incep+effnet+res50) models have been fine-tuned and have reached the highest weighted accuracy for multi-class AD stage classifications.

摘要
阿尔茨海默病（AD）是老年人群中最常见的长期疾病。在最近的几年中，深度学习在医疗影像领域得到了广泛的应用，并取得了很多成功。深度神经网络在医疗影像识别方面已成为最有效的方法。在检测AD方面，深度神经网络的准确率和效果比普通机器学习更高。我们的研究贡献了对阿尔茨海默病的更全面理解和检测的发展，通过确定四个预测AD的分类类型，实现了98.91%的高积分准确率。本研究提出了一种独特的策略，通过ensemble averaging模型和五种传输学习模型的组合，解决了医疗影像数据集分类问题的偏度问题。EfficientNetB0+Resnet152（effnet+res152）和InceptionV3+EfficientNetB0+Resnet50（incep+effnet+res50）模型在多类AD阶段分类 task中达到了最高积分准确率。

METRA: Scalable Unsupervised RL with Metric-Aware Abstraction

paper_url: http://arxiv.org/abs/2310.08887
repo_url: https://github.com/seohongpark/metra
paper_authors: Seohong Park, Oleh Rybkin, Sergey Levine
For: 本研究旨在提出一种新的无监督学习目标函数，以使无监督学习可扩展到复杂高维环境。* Methods: 我们提出了一种新的无监督学习目标函数，即度量感知抽象（METRA），它不直接覆盖整个状态空间，而是只覆盖一个紧密相关的纬度空间（Z），通过学习在这个纬度空间中移动，以获得一个可观察的集合多种行为，这些行为可以在高维环境中扩展到复杂环境。* Results: 我们通过在五个 lokomotion 和抓取环境中进行实验，发现METRA可以在复杂的像素环境中发现多种有用的行为，是首个在像素环境中发现多种 lokomotion 行为的无监督学习方法。

Abstract
Unsupervised pre-training strategies have proven to be highly effective in natural language processing and computer vision. Likewise, unsupervised reinforcement learning (RL) holds the promise of discovering a variety of potentially useful behaviors that can accelerate the learning of a wide array of downstream tasks. Previous unsupervised RL approaches have mainly focused on pure exploration and mutual information skill learning. However, despite the previous attempts, making unsupervised RL truly scalable still remains a major open challenge: pure exploration approaches might struggle in complex environments with large state spaces, where covering every possible transition is infeasible, and mutual information skill learning approaches might completely fail to explore the environment due to the lack of incentives. To make unsupervised RL scalable to complex, high-dimensional environments, we propose a novel unsupervised RL objective, which we call Metric-Aware Abstraction (METRA). Our main idea is, instead of directly covering the entire state space, to only cover a compact latent space $Z$ that is metrically connected to the state space $S$ by temporal distances. By learning to move in every direction in the latent space, METRA obtains a tractable set of diverse behaviors that approximately cover the state space, being scalable to high-dimensional environments. Through our experiments in five locomotion and manipulation environments, we demonstrate that METRA can discover a variety of useful behaviors even in complex, pixel-based environments, being the first unsupervised RL method that discovers diverse locomotion behaviors in pixel-based Quadruped and Humanoid. Our code and videos are available at https://seohong.me/projects/metra/

摘要
<> translate_language: zh-CNUnsupervised pre-training strategies have proven to be highly effective in natural language processing and computer vision. Likewise, unsupervised reinforcement learning (RL) holds the promise of discovering a variety of potentially useful behaviors that can accelerate the learning of a wide array of downstream tasks. Previous unsupervised RL approaches have mainly focused on pure exploration and mutual information skill learning. However, despite the previous attempts, making unsupervised RL truly scalable still remains a major open challenge: pure exploration approaches might struggle in complex environments with large state spaces, where covering every possible transition is infeasible, and mutual information skill learning approaches might completely fail to explore the environment due to the lack of incentives. To make unsupervised RL scalable to complex, high-dimensional environments, we propose a novel unsupervised RL objective, which we call Metric-Aware Abstraction (METRA). Our main idea is, instead of directly covering the entire state space, to only cover a compact latent space $Z$ that is metrically connected to the state space $S$ by temporal distances. By learning to move in every direction in the latent space, METRA obtains a tractable set of diverse behaviors that approximately cover the state space, being scalable to high-dimensional environments. Through our experiments in five locomotion and manipulation environments, we demonstrate that METRA can discover a variety of useful behaviors even in complex, pixel-based environments, being the first unsupervised RL method that discovers diverse locomotion behaviors in pixel-based Quadruped and Humanoid. Our code and videos are available at https://seohong.me/projects/metra/.Note: The translation is done using Google Translate and may not be perfect. Please let me know if you need any further assistance.

paper_url: http://arxiv.org/abs/2310.08873
repo_url: None
paper_authors: Zhen Zhang, Anran Lin, Chun Wai Wong, Xiangyu Chu, Qi Dou, K. W. Samuel Au
For: This paper proposes an interactive navigation framework for robots to navigate in environments with traversable obstacles, using large language and vision-language models.* Methods: The proposed framework utilizes a large language model (GPT-3.5) and an open-set Vision-language Model (Grounding DINO) to create an action-aware costmap for effective path planning without fine-tuning.* Results: The proposed framework was effective and adaptable to diverse environments, as demonstrated by experimental results that included traversing curtains in a medical scenario.

Abstract
This paper proposes an interactive navigation framework by using large language and vision-language models, allowing robots to navigate in environments with traversable obstacles. We utilize the large language model (GPT-3.5) and the open-set Vision-language Model (Grounding DINO) to create an action-aware costmap to perform effective path planning without fine-tuning. With the large models, we can achieve an end-to-end system from textual instructions like "Can you pass through the curtains to deliver medicines to me?", to bounding boxes (e.g., curtains) with action-aware attributes. They can be used to segment LiDAR point clouds into two parts: traversable and untraversable parts, and then an action-aware costmap is constructed for generating a feasible path. The pre-trained large models have great generalization ability and do not require additional annotated data for training, allowing fast deployment in the interactive navigation tasks. We choose to use multiple traversable objects such as curtains and grasses for verification by instructing the robot to traverse them. Besides, traversing curtains in a medical scenario was tested. All experimental results demonstrated the proposed framework's effectiveness and adaptability to diverse environments.

摘要
中文翻译：这篇论文提出了一种基于大语言和视觉语言模型的互动导航框架，使得机器人可以在具有可通过障碍物的环境中导航。该框架使用预训练的大语言模型（GPT-3.5）和开放集Vision-语言模型（Grounding DINO）创建一个动作相关的成本地图，以实现不需要微调的有效路径规划。通过这些大模型，系统可以从文本指令“你可以通过报幕 deliver 药物到我”转化为包含动作相关特征的矩形框，例如报幕。这些矩形框可以用来分割 LiDAR 点云为可通过和不可通过的两部分，然后构建一个动作相关的成本地图，以生成可行的路径。预训练的大模型具有很好的泛化能力，不需要额外的标注数据进行训练，因此可以快速部署在互动导航任务中。为验证提议框架的效果和适应性，作者们选择了多种可通过的物体，如报幕和草坪，并测试了在医疗enario中 traverse 报幕的能力。所有实验结果表明，提议的框架具有效果和适应性。

Adaptivity and Modularity for Efficient Generalization Over Task Complexity

paper_url: http://arxiv.org/abs/2310.08866
repo_url: None
paper_authors: Samira Abnar, Omid Saremi, Laurent Dinh, Shantel Wilson, Miguel Angel Bautista, Chen Huang, Vimal Thilak, Etai Littwin, Jiatao Gu, Josh Susskind, Samy Bengio
for: 本研究旨在评估 transformers 是否可以有效地泛化对不同难度示例的问题。
methods: 我们引入了一种新的任务，该任务是基于 Zhang et al. (2021) 提出的 pointer value retrieval 任务的变种。我们 investigate 如何使用 transformers 中的机制来实现适应 computation step 的数量（i.e., 计算图的深度），以便解决这些任务。
results: 我们发现，使用 Hyper-UT 模型，即将 hyper networks 与 Universal Transformers 结合使用，可以提高准确率并均匀分配计算资源。此外，我们发现，在标准图像识别任务中，Hyper-UT 的性能与 ViT 模型相当，但具有许多更少的计算开销（可以减少计算步骤的数量，从而实现超过 70% 的平均成本减少）。

Abstract
Can transformers generalize efficiently on problems that require dealing with examples with different levels of difficulty? We introduce a new task tailored to assess generalization over different complexities and present results that indicate that standard transformers face challenges in solving these tasks. These tasks are variations of pointer value retrieval previously introduced by Zhang et al. (2021). We investigate how the use of a mechanism for adaptive and modular computation in transformers facilitates the learning of tasks that demand generalization over the number of sequential computation steps (i.e., the depth of the computation graph). Based on our observations, we propose a transformer-based architecture called Hyper-UT, which combines dynamic function generation from hyper networks with adaptive depth from Universal Transformers. This model demonstrates higher accuracy and a fairer allocation of computational resources when generalizing to higher numbers of computation steps. We conclude that mechanisms for adaptive depth and modularity complement each other in improving efficient generalization concerning example complexity. Additionally, to emphasize the broad applicability of our findings, we illustrate that in a standard image recognition task, Hyper- UT's performance matches that of a ViT model but with considerably reduced computational demands (achieving over 70\% average savings by effectively using fewer layers).

摘要
可以不是 transformers 能够高效泛化不同难度示例吗？我们引入一个新任务，用于评估 transformers 在不同复杂度示例上的泛化能力。这些任务是 Zhang et al. (2021) 所介绍的指针值 Retrieval 任务的变种。我们发现，标准 transformers 在解决这些任务时遇到了挑战。我们 investigate 如何使用 transformers 中的机制来实现适应性和模块化计算，以便在不同计算步骤数（i.e., 计算图的深度）上进行泛化。根据我们的观察，我们提出了一种基于 transformers 的架构，即 Hyper-UT，它将 hyper 网络的动态函数生成与 Universal Transformers 的适应深度结合起来。这个模型在泛化到更高的计算步骤数时表现出更高的准确率和更公平的计算资源分配。我们 conclude 的是，适应性和模块化计算机制之间存在相互补做的关系，可以有效提高 transformers 的泛化效率。此外，我们使用标准图像识别任务来说明我们的发现的广泛应用性，Hyper-UT 的性能与 ViT 模型相同，但计算成本减少了大约 70%（通过使用 fewer layers 实现）。

Adam-family Methods with Decoupled Weight Decay in Deep Learning

paper_url: http://arxiv.org/abs/2310.08858
repo_url: None
paper_authors: Kuangyu Ding, Nachuan Xiao, Kim-Chuan Toh
for: 本研究 investigate Adam-family 方法在解决具有幂加权范围的非拟合非凹降优化问题中的收敛性质。特别是在训练具有权重衰变的非拟合神经网络中。
methods: 我们提出了一种基于 AdamW 方法的新框架，其中对于权重衰变项进行独立更新。我们在这个框架下Estimators for the first-order and second-order moments of stochastic subgradients are updated independently of the weight decay term。我们在假设和步长不减小的情况下， prove the convergence properties of our proposed framework。
results: 我们显示了我们提出的框架可以包含许多已知的 Adam-family 方法，从而为这些方法在训练非拟合神经网络时提供收敛保证。此外，我们还证明了我们的框架在训练过程中可以近似 SGD 方法，从而解释了在实际中对 Adam-family 方法加入 decoupled 权重衰变后的经验观察。我们还提出了一种基于我们的框架的新 Adam-family 方法，名为 Adam with Decoupled Weight Decay (AdamD)，并证明了它的收敛性质。实验表明，AdamD 在一致性和效率两个方面与 Adam 和 AdamW 相当。

Abstract
In this paper, we investigate the convergence properties of a wide class of Adam-family methods for minimizing quadratically regularized nonsmooth nonconvex optimization problems, especially in the context of training nonsmooth neural networks with weight decay. Motivated by the AdamW method, we propose a novel framework for Adam-family methods with decoupled weight decay. Within our framework, the estimators for the first-order and second-order moments of stochastic subgradients are updated independently of the weight decay term. Under mild assumptions and with non-diminishing stepsizes for updating the primary optimization variables, we establish the convergence properties of our proposed framework. In addition, we show that our proposed framework encompasses a wide variety of well-known Adam-family methods, hence offering convergence guarantees for these methods in the training of nonsmooth neural networks. More importantly, we show that our proposed framework asymptotically approximates the SGD method, thereby providing an explanation for the empirical observation that decoupled weight decay enhances generalization performance for Adam-family methods. As a practical application of our proposed framework, we propose a novel Adam-family method named Adam with Decoupled Weight Decay (AdamD), and establish its convergence properties under mild conditions. Numerical experiments demonstrate that AdamD outperforms Adam and is comparable to AdamW, in the aspects of both generalization performance and efficiency.

摘要
在这篇论文中，我们研究了亚当家族方法的收敛性质，尤其是在训练非平滑神经网络时。基于AdamW方法，我们提出了一种新的框架，在这个框架中，估计随机下降的第一阶和第二阶的均值独立更新weight decay项。假设满足某些轻量级的假设，并将主要优化变量的步长保持不断增大，我们证明了我们的提议的收敛性。此外，我们还证明了我们的框架包含了许多已知的亚当家族方法，因此对这些方法的收敛性提供了确 guarantees。更重要的是，我们证明了我们的框架在训练非平滑神经网络时 asymptotically approximates SGD方法，从而解释了在实际中观察到分离weight decay提高了亚当家族方法的普遍性表现。在实践中，我们提出了一种名为Adam with Decoupled Weight Decay（AdamD）的新的亚当家族方法，并证明了它的收敛性。numerical experiments表明，AdamD在普遍性和效率两个方面都高于Adam，并与AdamW相当。

Path To Gain Functional Transparency In Artificial Intelligence With Meaningful Explainability

paper_url: http://arxiv.org/abs/2310.08849
repo_url: None
paper_authors: Md. Tanzib Hosain, Mehedi Hasan Anik, Sadman Rafi, Rana Tabassum, Khaleque Insia, Md. Mehrab Siddiky
for: 这篇论文目的是提出一种用户参与的透明系统设计方法，以便开发透明和可解释的人工智能系统。
methods: 该论文使用多种方法，包括透明性、可解释性和社会价值观等多种方法，以满足不同领域的需求。
results: 该论文提出了一种用户参与的透明系统设计方法，可以帮助开发者开发透明和可解释的人工智能系统，并且可以满足不同领域的需求。

Abstract
Artificial Intelligence (AI) is rapidly integrating into various aspects of our daily lives, influencing decision-making processes in areas such as targeted advertising and matchmaking algorithms. As AI systems become increasingly sophisticated, ensuring their transparency and explainability becomes crucial. Functional transparency is a fundamental aspect of algorithmic decision-making systems, allowing stakeholders to comprehend the inner workings of these systems and enabling them to evaluate their fairness and accuracy. However, achieving functional transparency poses significant challenges that need to be addressed. In this paper, we propose a design for user-centered compliant-by-design transparency in transparent systems. We emphasize that the development of transparent and explainable AI systems is a complex and multidisciplinary endeavor, necessitating collaboration among researchers from diverse fields such as computer science, artificial intelligence, ethics, law, and social science. By providing a comprehensive understanding of the challenges associated with transparency in AI systems and proposing a user-centered design framework, we aim to facilitate the development of AI systems that are accountable, trustworthy, and aligned with societal values.

摘要
人工智能（AI）在我们日常生活中逐渐融入到各个方面，影响决策过程，如目标广告和匹配算法。随着AI系统的不断发展，保证它们的透明度和解释性变得急需。内部透明度是算法决策系统的基本特征，允许潜在利益相关人了解这些系统的内部工作原理，并评估它们的公平和准确性。但实现内部透明度带来了重要挑战，需要解决。在这篇论文中，我们提出了一种用户中心的透明系统设计，以便实现可靠、信任würdigung和社会价值观 align的AI系统。我们强调了开发透明和解释AI系统的复杂和多学科性，需要与计算机科学、人工智能、伦理、法律和社会科学等领域的研究人员合作。通过帮助理解AI系统透明度中的挑战和提出一种用户中心设计框架，我们希望能够促进开发可靠、可信、符合社会价值的AI系统。

A Case-Based Persistent Memory for a Large Language Model

paper_url: http://arxiv.org/abs/2310.08842
repo_url: None
paper_authors: Ian Watson
for: 这篇论文主要写于提出CBR研究人员应该更加关注现代人工智能技术的发展，特别是深度学习和大语言模型。
methods: 论文提出使用CBR方法和深度学习技术可以实现人工智能总体智能的进步。
results: 论文指出，通过将CBR方法与深度学习技术结合使用，可以提供 persistente memory 以便大语言模型进行进步，并可以实现人工智能总体智能的进步。

Abstract
Case-based reasoning (CBR) as a methodology for problem-solving can use any appropriate computational technique. This position paper argues that CBR researchers have somewhat overlooked recent developments in deep learning and large language models (LLMs). The underlying technical developments that have enabled the recent breakthroughs in AI have strong synergies with CBR and could be used to provide a persistent memory for LLMs to make progress towards Artificial General Intelligence.

摘要

Leveraging Optimal Transport for Enhanced Offline Reinforcement Learning in Surgical Robotic Environments

paper_url: http://arxiv.org/abs/2310.08841
repo_url: None
paper_authors: Maryam Zare, Parham M. Kebria, Abbas Khosravi
for: 本研究旨在开发一种能够在无实时互动的情况下进行学习控制的方法，以降低成本和安全隐患，并且可以利用现有的数据集来进行学习。
methods: 本研究使用的方法是基于最佳运输（Optimal Transport）的奖励标注（OTR）算法，可以快速和高效地将无标注数据集与专家示范视频进行对比，从而计算出一个有效的奖励信号。
results: 研究表明，使用OTR算法可以快速和高效地学习控制策略，并且不需要手动设计奖励函数。此外，研究还表明了OTR算法的 universality 和可重用性，可以在不同领域中进行应用。

Abstract
Most Reinforcement Learning (RL) methods are traditionally studied in an active learning setting, where agents directly interact with their environments, observe action outcomes, and learn through trial and error. However, allowing partially trained agents to interact with real physical systems poses significant challenges, including high costs, safety risks, and the need for constant supervision. Offline RL addresses these cost and safety concerns by leveraging existing datasets and reducing the need for resource-intensive real-time interactions. Nevertheless, a substantial challenge lies in the demand for these datasets to be meticulously annotated with rewards. In this paper, we introduce Optimal Transport Reward (OTR) labelling, an innovative algorithm designed to assign rewards to offline trajectories, using a small number of high-quality expert demonstrations. The core principle of OTR involves employing Optimal Transport (OT) to calculate an optimal alignment between an unlabeled trajectory from the dataset and an expert demonstration. This alignment yields a similarity measure that is effectively interpreted as a reward signal. An offline RL algorithm can then utilize these reward signals to learn a policy. This approach circumvents the need for handcrafted rewards, unlocking the potential to harness vast datasets for policy learning. Leveraging the SurRoL simulation platform tailored for surgical robot learning, we generate datasets and employ them to train policies using the OTR algorithm. By demonstrating the efficacy of OTR in a different domain, we emphasize its versatility and its potential to expedite RL deployment across a wide range of fields.

摘要
大多数强化学习（RL）方法通常在活动学习环境中研究，agent直接与环境交互，观察行动结果，并通过尝试和错误学习。然而，允许部分训练agent交互 avec real physical systems的高成本、安全风险以及需要不断监督的问题。offline RL通过利用现有数据集和减少实时交互的成本来解决这些问题。然而，需要这些数据集都必须得到仔细的标注 reward。在这篇论文中，我们介绍了Optimal Transport Reward（OTR）标签算法，这是一种使用少量高质量的专家示范来分配奖励给 offline trajectory的算法。OTR的核心思想是使用Optimal Transport（OT）计算一个未标注的数据集中的一个路径和专家示范之间的最佳对应。这个对应对应到一个相似度度量，可以被视为奖励信号。一个offline RL算法可以使用这些奖励信号来学习策略。这种方法可以绕过手工设置奖励，解锁了可以使用庞大数据集来学习策略的潜在性。利用SurRoL simulation platform，我们生成了数据集，并使用OTR算法来训练策略。我们在不同领域中证明了OTR的可行性，从而强调其universality和可以快速部署在各个领域的潜在性。

Large Language Models as Source Planner for Personalized Knowledge-grounded Dialogue

paper_url: http://arxiv.org/abs/2310.08840
repo_url: None
paper_authors: Hongru Wang, Minda Hu, Yang Deng, Rui Wang, Fei Mi, Weichao Wang, Yasheng Wang, Wai-Chung Kwan, Irwin King, Kam-Fai Wong
for: 实现开放领域对话系统，需要不同的知识来生成更加详细和证据性的回答。
methods: 我们提出了SAFARI框架，利用大型自然语言模型（LLM）的观察、理解和实现能力，在指导下和无指导下的设定下运作。
results: 我们在\textbf{KBP} dataset上进行实验，展示了SAFARI框架可以生成具有人格性和知识增强的回答。

Abstract
Open-domain dialogue system usually requires different sources of knowledge to generate more informative and evidential responses. However, existing knowledge-grounded dialogue systems either focus on a single knowledge source or overlook the dependency between multiple sources of knowledge, which may result in generating inconsistent or even paradoxical responses. To incorporate multiple knowledge sources and dependencies between them, we propose SAFARI, a novel framework that leverages the exceptional capabilities of large language models (LLMs) in planning, understanding, and incorporating under both supervised and unsupervised settings. Specifically, SAFARI decouples the knowledge grounding into multiple sources and response generation, which allows easy extension to various knowledge sources including the possibility of not using any sources. To study the problem, we construct a personalized knowledge-grounded dialogue dataset \textit{\textbf{K}nowledge \textbf{B}ehind \textbf{P}ersona}~(\textbf{KBP}), which is the first to consider the dependency between persona and implicit knowledge. Experimental results on the KBP dataset demonstrate that the SAFARI framework can effectively produce persona-consistent and knowledge-enhanced responses.

摘要
The SAFARI framework decouples knowledge grounding from response generation, allowing for easy extension to various knowledge sources, including the possibility of not using any sources. To evaluate the effectiveness of the SAFARI framework, we constructed the \textbf{KBP} dataset, the first to consider the dependency between persona and implicit knowledge. Experimental results on the KBP dataset show that the SAFARI framework can produce persona-consistent and knowledge-enhanced responses.Here is the translation in Simplified Chinese:Open-domain对话系统通常需要访问多个知识源以生成更加信息 dense和证据基于的回答。然而，现有的知识固定对话系统 Either focus on a single knowledge source or ignore the relationships between multiple sources of knowledge, which may result in generating inconsistent or even paradoxical responses. To address this issue, we propose the SAFARI framework, which leverages the capabilities of large language models (LLMs) in planning, understanding, and incorporating knowledge under both supervised and unsupervised settings.The SAFARI framework decouples knowledge grounding from response generation, allowing for easy extension to various knowledge sources, including the possibility of not using any sources. To evaluate the effectiveness of the SAFARI framework, we constructed the \textbf{KBP} dataset, the first to consider the dependency between persona and implicit knowledge. Experimental results on the KBP dataset show that the SAFARI framework can produce persona-consistent and knowledge-enhanced responses.

A Framework for Few-Shot Policy Transfer through Observation Mapping and Behavior Cloning

paper_url: http://arxiv.org/abs/2310.08836
repo_url: https://github.com/shukla-yash/few-shot-policy-transfer
paper_authors: Yash Shukla, Bharat Kesari, Shivam Goel, Robert Wright, Jivko Sinapov
for: 降低人工交互成本，增进机器人应用中的学习效率
methods: 使用Generative Adversarial Networks (GANs)和循环一致损失来映射源领域和目标领域的观察，然后使用这些学习的映射来复制源任务成功的政策到目标领域
results: 成功实现几个shot策略传递，并在源和目标任务之间存在语义上的不同情况下也能够获得良好的结果

Abstract
Despite recent progress in Reinforcement Learning for robotics applications, many tasks remain prohibitively difficult to solve because of the expensive interaction cost. Transfer learning helps reduce the training time in the target domain by transferring knowledge learned in a source domain. Sim2Real transfer helps transfer knowledge from a simulated robotic domain to a physical target domain. Knowledge transfer reduces the time required to train a task in the physical world, where the cost of interactions is high. However, most existing approaches assume exact correspondence in the task structure and the physical properties of the two domains. This work proposes a framework for Few-Shot Policy Transfer between two domains through Observation Mapping and Behavior Cloning. We use Generative Adversarial Networks (GANs) along with a cycle-consistency loss to map the observations between the source and target domains and later use this learned mapping to clone the successful source task behavior policy to the target domain. We observe successful behavior policy transfer with limited target task interactions and in cases where the source and target task are semantically dissimilar.

摘要
尽管近期在机器人学中的强化学习进步有所，许多任务仍然具有昂贵的互动成本，导致解决这些任务变得困难。通过知识传递，可以减少目标领域的训练时间。Sim2Real传输可以将在虚拟机器人领域学习的知识传递到物理目标领域。这可以减少物理世界中交互成本高的训练时间。然而，大多数现有方法假设两个领域之间的任务结构和物理特性是一致的。本文提出了一种基于 Observation Mapping 和 Behavior Cloning 的多shot策略传输框架。我们使用生成对抗网络（GANs）以及一个循环一致损失函数来映射源领域和目标领域的观察结果。之后，我们使用这个学习的映射来启用源领域成功行为策略到目标领域。我们观察到了有限目标任务互动和semantic不一致情况下成功的行为策略传输。

paper_url: http://arxiv.org/abs/2310.08830
repo_url: None
paper_authors: Jiaohao Wu, Yang Ye, Jing Du
for: 这篇论文是为了提高城市紧急搜救（SAR）中无人机的导航而写的。
methods: 这篇论文使用了多目标束资本学习（MORL）和卷积整编器来改进无人机的城市SAR导航。
results: 测试在纽约市模型上，这种方法可以提高无人机的导航决策、优化路径和对风效应的应对，从而提高城市SAR操作的效率和精度。

Abstract
Drones are vital for urban emergency search and rescue (SAR) due to the challenges of navigating dynamic environments with obstacles like buildings and wind. This paper presents a method that combines multi-objective reinforcement learning (MORL) with a convolutional autoencoder to improve drone navigation in urban SAR. The approach uses MORL to achieve multiple goals and the autoencoder for cost-effective wind simulations. By utilizing imagery data of urban layouts, the drone can autonomously make navigation decisions, optimize paths, and counteract wind effects without traditional sensors. Tested on a New York City model, this method enhances drone SAR operations in complex urban settings.

摘要
飞机在都市紧急搜救（SAR）中是非常重要的，因为都市环境具有动态的特点和障碍物，如建筑物和风。这篇论文提出了一种方法，该方法将多目标束赋学（MORL）与卷积 autoencoder 结合以提高飞机在都市 SAR 中的导航。该方法使用 MORL 来实现多个目标，并使用 autoencoder 来实现cost-effective的风 simulations。通过利用城市布局图像数据，飞机可以自动做出导航决策，优化路径和对抗风效应，不需要传统的感知器。在纽约市模型上进行测试，这种方法可以提高飞机 SAR 操作在复杂的都市环境中。

Distance-rank Aware Sequential Reward Learning for Inverse Reinforcement Learning with Sub-optimal Demonstrations

paper_url: http://arxiv.org/abs/2310.08823
repo_url: None
paper_authors: Lu Li, Yuxin Pan, Ruobing Chen, Jie Liu, Zilin Wang, Yu Liu, Zhiheng Li
for: 这篇论文主要目标是解决 inverse reinforcement learning（IRL）中的奖励函数学习问题，即从收集到的专家示范数据中提取出奖励函数。
methods: 该论文提出了一种名为 Distance-rank Aware Sequential Reward Learning（DRASRL）的框架，它将考虑 traces 的排名和差异度来协同消除奖励函数的ambiguity。DRASRL 使用了距离政策为排序 traces，并使用了对比学习技术来学习奖励信号。
results: 经过大量的实验，DRASRL 比前一个最佳方法（SOTA）表现出了显著的性能提升。

Abstract
Inverse reinforcement learning (IRL) aims to explicitly infer an underlying reward function based on collected expert demonstrations. Considering that obtaining expert demonstrations can be costly, the focus of current IRL techniques is on learning a better-than-demonstrator policy using a reward function derived from sub-optimal demonstrations. However, existing IRL algorithms primarily tackle the challenge of trajectory ranking ambiguity when learning the reward function. They overlook the crucial role of considering the degree of difference between trajectories in terms of their returns, which is essential for further removing reward ambiguity. Additionally, it is important to note that the reward of a single transition is heavily influenced by the context information within the trajectory. To address these issues, we introduce the Distance-rank Aware Sequential Reward Learning (DRASRL) framework. Unlike existing approaches, DRASRL takes into account both the ranking of trajectories and the degrees of dissimilarity between them to collaboratively eliminate reward ambiguity when learning a sequence of contextually informed reward signals. Specifically, we leverage the distance between policies, from which the trajectories are generated, as a measure to quantify the degree of differences between traces. This distance-aware information is then used to infer embeddings in the representation space for reward learning, employing the contrastive learning technique. Meanwhile, we integrate the pairwise ranking loss function to incorporate ranking information into the latent features. Moreover, we resort to the Transformer architecture to capture the contextual dependencies within the trajectories in the latent space, leading to more accurate reward estimation. Through extensive experimentation, our DRASRL framework demonstrates significant performance improvements over previous SOTA methods.

摘要
<> translate into Simplified Chinese逆激励学习（IRL）目的是显式地从收集的专家示范中推断出下面的奖励函数。由于获得专家示范可能是昂贵的，现有的IRL技术主要关注通过从不优秀示范中学习更好的策略来学习奖励函数。然而，现有的IRL算法主要解决了搜索路径排名模糊性问题，忽略了关键的考虑搜索路径之间的差异度，这是关键的减少奖励模糊性。此外，需要注意的是，单个过程的奖励受到过程中的上下文信息的影响。为解决这些问题，我们介绍了距离排序和Sequential Reward Learning（DRASRL）框架。与现有方法不同，DRASRL simultaneous consideration of trajectory ranking and degree of dissimilarity between them to collaboratively eliminate reward ambiguity when learning a sequence of contextually informed reward signals. Specifically, we leverage the distance between policies, from which the trajectories are generated, as a measure to quantify the degree of differences between traces. This distance-aware information is then used to infer embeddings in the representation space for reward learning, employing the contrastive learning technique. Meanwhile, we integrate the pairwise ranking loss function to incorporate ranking information into the latent features. Moreover, we resort to the Transformer architecture to capture the contextual dependencies within the trajectories in the latent space, leading to more accurate reward estimation. Through extensive experimentation, our DRASRL framework demonstrates significant performance improvements over previous SOTA methods.

Exploring the relationship between response time sequence in scale answering process and severity of insomnia: a machine learning approach

paper_url: http://arxiv.org/abs/2310.08817
repo_url: None
paper_authors: Zhao Su, Rongxun Liu, Keyin Zhou, Xinru Wei, Ning Wang, Zexin Lin, Yuanchen Xie, Jie Wang, Fei Wang, Shenzhong Zhang, Xizhe Zhang
for: investigate the relationship between insomnia and response time, and develop a machine learning model to predict the presence of insomnia in participants using response time data.
methods: collected response time data from 2729 participants using a mobile application, and explored the relationship between symptom severity and response time at the individual questions level.
results: found a statistically significant difference (p<.001) in the total response time between participants with or without insomnia symptoms, and demonstrated a high predictive accuracy of 0.743 in predicting insomnia symptoms based on response time data.Here’s the full text in Simplified Chinese:
for: 这项研究旨在调查睡眠症和响应时间之间的关系，并使用响应时间数据预测参与者是否有睡眠症。
methods: 通过手机应用程序，收集了2729名参与者的响应时间数据，并在个人问题水平上探索症状严重程度和响应时间之间的关系。
results: 发现参与者有睡眠症的群体和无睡眠症群体之间存在统计学上的显著差异（p<.001），并在响应时间数据上预测睡眠症的准确率达0.743。

Abstract
Objectives: The study aims to investigate the relationship between insomnia and response time. Additionally, it aims to develop a machine learning model to predict the presence of insomnia in participants using response time data. Methods: A mobile application was designed to administer scale tests and collect response time data from 2729 participants. The relationship between symptom severity and response time was explored, and a machine learning model was developed to predict the presence of insomnia. Results: The result revealed a statistically significant difference (p<.001) in the total response time between participants with or without insomnia symptoms. A correlation was observed between the severity of specific insomnia aspects and response times at the individual questions level. The machine learning model demonstrated a high predictive accuracy of 0.743 in predicting insomnia symptoms based on response time data. Conclusions: These findings highlight the potential utility of response time data to evaluate cognitive and psychological measures, demonstrating the effectiveness of using response time as a diagnostic tool in the assessment of insomnia.

摘要

DexCatch: Learning to Catch Arbitrary Objects with Dexterous Hands

paper_url: http://arxiv.org/abs/2310.08809
repo_url: None
paper_authors: Fengbo Lan, Shengjie Wang, Yunzhe Zhang, Haotian Xu, Oluwatosin Oseni, Yang Gao, Tao Zhang
for: 提高机器人人工智能的灵活抓取能力，增加抓取速度而不需要将物品运送到目的地。
methods: 使用Stability-Constrained Reinforcement Learning（SCRL）算法学习捕捉多种物品的灵活抓取能力。
results: SCRL算法在基线方法比较大的margin上表现出色，学习出的策略具有强的零Instance Transfer性能，能够在最Difficult任务中实现高水平的成功率，包括在手掌上面无支持的情况下仍能够达到高水平的成功率。

Abstract
Achieving human-like dexterous manipulation remains a crucial area of research in robotics. Current research focuses on improving the success rate of pick-and-place tasks. Compared with pick-and-place, throw-catching behavior has the potential to increase picking speed without transporting objects to their destination. However, dynamic dexterous manipulation poses a major challenge for stable control due to a large number of dynamic contacts. In this paper, we propose a Stability-Constrained Reinforcement Learning (SCRL) algorithm to learn to catch diverse objects with dexterous hands. The SCRL algorithm outperforms baselines by a large margin, and the learned policies show strong zero-shot transfer performance on unseen objects. Remarkably, even though the object in a hand facing sideward is extremely unstable due to the lack of support from the palm, our method can still achieve a high level of success in the most challenging task. Video demonstrations of learned behaviors and the code can be found on the supplementary website.

摘要
研究人类如手指的灵活抓握仍然是 robotics 领域的关键领域。当前研究的焦点是提高抓取任务的成功率。相比抓取，投掷捕捉行为具有提高抓取速度的潜在优势，但是动态灵活抓握却对稳定控制 pose major challenge。在这篇论文中，我们提出了一种稳定性做出 Constrained Reinforcement Learning（SCRL）算法，用于学习捕捉多种物体的灵活手指。与基eline 相比，SCRL 算法表现出了大幅提升的成果，学习的策略还显示了强的零shot 传承性能。尤其是在最Difficult task 中，甚至当手指朝向侧方的情况下，我们的方法仍然可以达到高水平的成功率。详细的视频示例和代码可以在补充网站上找到。

Advancing Perception in Artificial Intelligence through Principles of Cognitive Science

paper_url: http://arxiv.org/abs/2310.08803
repo_url: None
paper_authors: Palaash Agrawal, Cheston Tan, Heena Rathore
for: 这篇评论文章的目的是探讨人工智能（AI）研究中的核心问题和缺陷，以及如何通过学习 cognitive science 来解决这些问题。
methods: 本文使用 cognitive science 的不同领域（如神经科学、心理学和语言学）的理论和技术，对 AI 系统的设计和实现进行了比较和分析。
results: 本文对 AI 系统的性能和资源利用进行了评估，并指出了现有 AI 系统中的多个缺陷和潜在的发展方向。

Abstract
Although artificial intelligence (AI) has achieved many feats at a rapid pace, there still exist open problems and fundamental shortcomings related to performance and resource efficiency. Since AI researchers benchmark a significant proportion of performance standards through human intelligence, cognitive sciences-inspired AI is a promising domain of research. Studying cognitive science can provide a fresh perspective to building fundamental blocks in AI research, which can lead to improved performance and efficiency. In this review paper, we focus on the cognitive functions of perception, which is the process of taking signals from one's surroundings as input, and processing them to understand the environment. Particularly, we study and compare its various processes through the lens of both cognitive sciences and AI. Through this study, we review all current major theories from various sub-disciplines of cognitive science (specifically neuroscience, psychology and linguistics), and draw parallels with theories and techniques from current practices in AI. We, hence, present a detailed collection of methods in AI for researchers to build AI systems inspired by cognitive science. Further, through the process of reviewing the state of cognitive-inspired AI, we point out many gaps in the current state of AI (with respect to the performance of the human brain), and hence present potential directions for researchers to develop better perception systems in AI.

摘要
In this review paper, we focus on the cognitive function of perception, which involves taking in signals from the environment and processing them to understand the surroundings. We study and compare the various processes involved in perception through the lens of both cognitive sciences and AI. We review all the major theories from various sub-disciplines of cognitive science, such as neuroscience, psychology, and linguistics, and draw parallels with theories and techniques from current AI practices. We present a detailed collection of methods in AI for researchers to build AI systems inspired by cognitive science.Through our review of the state of cognitive-inspired AI, we identify many gaps in current AI systems compared to human performance, and therefore present potential directions for researchers to develop better perception systems in AI.Translated into Simplified Chinese:尽管人工智能（AI）已经在短时间内取得了许多成就，但还有许多开放的问题和基础的缺陷， relate to performance and resource efficiency. 因为AI研究人员 frequently benchmark their performance against human intelligence, cognitive sciences-inspired AI is a promising area of research. 学习 cognitive science can provide a fresh perspective on building the fundamental blocks of AI, which can lead to improved performance and efficiency.在这篇评论文章中，我们关注了认知功能的感知，这是接受环境中的信号作为输入，并处理它们以理解环境。我们通过认知科学和AI的镜头来研究和比较各种过程。我们对各个子领域的认知科学（具体来说是神经科学、心理学和语言学）的所有主要理论进行了审查，并将其与当前AI实践中的理论和技术进行了比较。我们为研究人员提供了基于认知科学的AI系统的详细收集。通过我们对认知科学驱动的AI的状态审查，我们确定了许多现有AI系统与人类大脑的性能相比存在差距，因此我们提出了可能的研究方向，以开发更好的感知系统在AI中。

Impact of Guidance and Interaction Strategies for LLM Use on Learner Performance and Perception

paper_url: http://arxiv.org/abs/2310.13712
repo_url: None
paper_authors: Harsh Kumar, Ilya Musabirov, Mohi Reza, Jiakai Shi, Anastasia Kuzminykh, Joseph Jay Williams, Michael Liut
for: 这个论文旨在探讨个性化聊天机器人教学助手在面临增长的教室规模时的重要性，特别是在irect教师存在有限时。
methods: 本研究采用了四种教学意识指导策略，并通过在大学计算机科学课堂（N=145）和Prolific平台（N=356）进行了形成性研究和控制性试验，以探讨学生和LLM之间的互动对学生的参与度和成绩产生的影响。
results: 研究发现，直接LLM答案有所提高学生的表现，而对学生解决方案进行细化和优化则可以增强学生对LLM的信任。这些结果表明了LLM在回答或优化学生输入时的作用是复杂的，并且需要考虑学生的 aprroach 和LLM的响应。

Abstract
Personalized chatbot-based teaching assistants can be crucial in addressing increasing classroom sizes, especially where direct teacher presence is limited. Large language models (LLMs) offer a promising avenue, with increasing research exploring their educational utility. However, the challenge lies not only in establishing the efficacy of LLMs but also in discerning the nuances of interaction between learners and these models, which impact learners' engagement and results. We conducted a formative study in an undergraduate computer science classroom (N=145) and a controlled experiment on Prolific (N=356) to explore the impact of four pedagogically informed guidance strategies and the interaction between student approaches and LLM responses. Direct LLM answers marginally improved performance, while refining student solutions fostered trust. Our findings suggest a nuanced relationship between the guidance provided and LLM's role in either answering or refining student input. Based on our findings, we provide design recommendations for optimizing learner-LLM interactions.

摘要
<>Personalized chatbot-based teaching assistants can be crucial in addressing increasing classroom sizes, especially where direct teacher presence is limited. Large language models (LLMs) offer a promising avenue, with increasing research exploring their educational utility. However, the challenge lies not only in establishing the efficacy of LLMs but also in discerning the nuances of interaction between learners and these models, which impact learners' engagement and results.我们进行了一项前期研究，涵盖了145名大学生，以及一项控制性实验在Prolific平台上，总共356名参与者。我们发现，四种教学指导策略对学生的表现有marginally positive impact，而学生的解决方案细化也能够帮助建立学生和LLM之间的信任。我们的发现表明，学生的输入和LLM的回答之间存在细腻的关系，并且我们提出了优化学生和LLM之间交互的设计建议。<>

DDMT: Denoising Diffusion Mask Transformer Models for Multivariate Time Series Anomaly Detection

paper_url: http://arxiv.org/abs/2310.08800
repo_url: None
paper_authors: Chaocheng Yang, Tingyin Wang, Xuanhui Yan
for: 这个研究是为了解决多重时间序列资料中的异常探测问题，并且具有广泛的应用前景，如诈欺探测、机件诊断和系统状态估计。
methods: 本研究提出了一个名为DDMT的新框架，它结合了对称预测器和混沌传播模型，并且引入了适应动态邻域面组件（ADNM）来减少输入和输出特征之间的信息泄露问题。
results: 实验结果显示，DDMT模型能够有效地检测时间序列资料中的异常，并且在多个公开ailable的多重时间序列异常探测数据集上实现了顶尖性能。

Abstract
Anomaly detection in multivariate time series has emerged as a crucial challenge in time series research, with significant research implications in various fields such as fraud detection, fault diagnosis, and system state estimation. Reconstruction-based models have shown promising potential in recent years for detecting anomalies in time series data. However, due to the rapid increase in data scale and dimensionality, the issues of noise and Weak Identity Mapping (WIM) during time series reconstruction have become increasingly pronounced. To address this, we introduce a novel Adaptive Dynamic Neighbor Mask (ADNM) mechanism and integrate it with the Transformer and Denoising Diffusion Model, creating a new framework for multivariate time series anomaly detection, named Denoising Diffusion Mask Transformer (DDMT). The ADNM module is introduced to mitigate information leakage between input and output features during data reconstruction, thereby alleviating the problem of WIM during reconstruction. The Denoising Diffusion Transformer (DDT) employs the Transformer as an internal neural network structure for Denoising Diffusion Model. It learns the stepwise generation process of time series data to model the probability distribution of the data, capturing normal data patterns and progressively restoring time series data by removing noise, resulting in a clear recovery of anomalies. To the best of our knowledge, this is the first model that combines Denoising Diffusion Model and the Transformer for multivariate time series anomaly detection. Experimental evaluations were conducted on five publicly available multivariate time series anomaly detection datasets. The results demonstrate that the model effectively identifies anomalies in time series data, achieving state-of-the-art performance in anomaly detection.

摘要
multivariate时序数据异常检测已成为时序研究中的关键挑战，具有各种领域的研究意义，如诈骗检测、机件诊断和系统状态估计。基于重建模型在过去几年中表现出了潜在的潜力，但由于数据规模和维度的快速增长，时序重建过程中的噪声和弱同步映射（WIM）问题已经变得越来越突出。为解决这个问题，我们提出了一种新的自适应动态邻域面罩（ADNM）机制，并与Transformer和去噪扩散模型（DDM）结合，构建了一个新的多变量时序异常检测框架，名为去噪扩散面罩Transformer（DDMT）。ADNM模块的引入可以避免输入和输出特征之间的信息泄露，从而解决重建过程中的WIM问题。DDT使用Transformer作为内置神经网络结构，学习时序数据生成过程的步骤性质，模型时序数据的概率分布，捕捉正常数据模式，逐步除噪，使时序数据进行明确的异常检测。据我们知道，这是首次将Denosing Diffusion Model和Transformer结合以进行多变量时序异常检测。我们在五个公开的多变量时序异常检测数据集上进行了实验评估，结果表明，模型可以有效地检测时序数据中的异常。

A Comparative Analysis of Task-Agnostic Distillation Methods for Compressing Transformer Language Models

paper_url: http://arxiv.org/abs/2310.08797
repo_url: None
paper_authors: Takuma Udagawa, Aashka Trivedi, Michele Merler, Bishwaranjan Bhattacharjee
for: 本研究旨在探讨如何通过知识传递提高Transformer语言模型的效率，而不失效iveness。
methods: 本研究使用了输出分布（OD）传递、隐藏状态（HS）传递和多头注意力（MHA）传递等方法，并对不同的学生架构进行了广泛的实验研究。
results: 研究发现，基于MiniLMv2的MHA传递方法在各种学生架构中表现最佳，而HS传递方法在一些复杂的层映射策略下表现最佳，OD传递方法则一直落后于其他方法。这些发现有助于我们在响应时间 crítical的应用中部署高效 yet effective的学生模型。

Abstract
Large language models have become a vital component in modern NLP, achieving state of the art performance in a variety of tasks. However, they are often inefficient for real-world deployment due to their expensive inference costs. Knowledge distillation is a promising technique to improve their efficiency while retaining most of their effectiveness. In this paper, we reproduce, compare and analyze several representative methods for task-agnostic (general-purpose) distillation of Transformer language models. Our target of study includes Output Distribution (OD) transfer, Hidden State (HS) transfer with various layer mapping strategies, and Multi-Head Attention (MHA) transfer based on MiniLMv2. Through our extensive experiments, we study the effectiveness of each method for various student architectures in both monolingual (English) and multilingual settings. Overall, we show that MHA transfer based on MiniLMv2 is generally the best option for distillation and explain the potential reasons behind its success. Moreover, we show that HS transfer remains as a competitive baseline, especially under a sophisticated layer mapping strategy, while OD transfer consistently lags behind other approaches. Findings from this study helped us deploy efficient yet effective student models for latency-critical applications.

摘要

Mitigating Bias for Question Answering Models by Tracking Bias Influence

paper_url: http://arxiv.org/abs/2310.08795
repo_url: None
paper_authors: Mingyu Derek Ma, Jiun-Yu Kao, Arpit Gupta, Yu-Hsiang Lin, Wenbo Zhao, Tagyoung Chung, Wei Wang, Kai-Wei Chang, Nanyun Peng
for: 本文旨在提出一种方法来减少多选问答模型中的偏见。
methods: 本文使用了一种基于偏见度量的多任务学习方法来减少偏见。具体来说，我们计算了每个查询实例的偏见水平，并使用这些水平作为多任务学习的优化目标。
results: 我们的方法可以在多个偏见类别中减少 BBQ 数据集中的偏见水平，而无需损失问答准确率。

Abstract
Models of various NLP tasks have been shown to exhibit stereotypes, and the bias in the question answering (QA) models is especially harmful as the output answers might be directly consumed by the end users. There have been datasets to evaluate bias in QA models, while bias mitigation technique for the QA models is still under-explored. In this work, we propose BMBI, an approach to mitigate the bias of multiple-choice QA models. Based on the intuition that a model would lean to be more biased if it learns from a biased example, we measure the bias level of a query instance by observing its influence on another instance. If the influenced instance is more biased, we derive that the query instance is biased. We then use the bias level detected as an optimization objective to form a multi-task learning setting in addition to the original QA task. We further introduce a new bias evaluation metric to quantify bias in a comprehensive and sensitive way. We show that our method could be applied to multiple QA formulations across multiple bias categories. It can significantly reduce the bias level in all 9 bias categories in the BBQ dataset while maintaining comparable QA accuracy.

摘要
模型在不同的自然语言处理任务中展现了刻板印象，而问答（QA）模型中的偏见特别危险，因为输出答案可能直接被用户 consume。有些数据集用于评估偏见模型，但是对于QA模型的偏见缓解技术还是下acker。在这项工作中，我们提出了BMBI方法，用于缓解多选问答模型的偏见。我们基于QueryInstance的偏见程度可以通过另一个QueryInstance的影响来衡量。如果影响的QueryInstance更加偏见，我们得出 QueryInstance 具有偏见。我们然后使用检测到的偏见程度作为多任务学习设定的一个优化目标，以及原来的QA任务。我们还提出了一个新的偏见评估指标，可以全面、敏感地评估偏见。我们证明了我们的方法可以应用于多种问答形式，并在BBQ数据集中降低了9种偏见类别的偏见水平，保持与原始QA任务相似的答案准确性。

Price of Stability in Quality-Aware Federated Learning

paper_url: http://arxiv.org/abs/2310.08790
repo_url: None
paper_authors: Yizhou Yan, Xinyu Tang, Chao Huang, Ming Tang
for: 这个论文旨在提出一种基于联合学习的标签噪声纠正方法，以提高联合学习性能。
methods: 这个论文使用了一种标签噪声纠正游戏来模型客户端之间的互动，并分析了这个游戏的平衡。
results: 论文的分析表明，在客户端之间的标签噪声纠正游戏的平衡结果会导致全球模型的准确率比社会最佳解lower。此外，论文还提出了一种有效的社会优化方案来解决这个问题。

Abstract
Federated Learning (FL) is a distributed machine learning scheme that enables clients to train a shared global model without exchanging local data. The presence of label noise can severely degrade the FL performance, and some existing studies have focused on algorithm design for label denoising. However, they ignored the important issue that clients may not apply costly label denoising strategies due to them being self-interested and having heterogeneous valuations on the FL performance. To fill this gap, we model the clients' interactions as a novel label denoising game and characterize its equilibrium. We also analyze the price of stability, which quantifies the difference in the system performance (e.g., global model accuracy, social welfare) between the equilibrium outcome and the socially optimal solution. We prove that the equilibrium outcome always leads to a lower global model accuracy than the socially optimal solution does. We further design an efficient algorithm to compute the socially optimal solution. Numerical experiments on MNIST dataset show that the price of stability increases as the clients' data become noisier, calling for an effective incentive mechanism.

摘要
federated 学习（FL）是一种分布式机器学习方案，允许客户端对共享的全球模型进行训练，无需交换本地数据。 however， presence of label noise can severely degrade FL performance, and some existing studies have focused on algorithm design for label denoising. but these studies have ignored the important issue that clients may not apply costly label denoising strategies due to their self-interest and heterogeneous valuations on FL performance.to fill this gap, we model the clients' interactions as a novel label denoising game and characterize its equilibrium. we also analyze the price of stability, which quantifies the difference in the system performance (e.g., global model accuracy, social welfare) between the equilibrium outcome and the socially optimal solution. we prove that the equilibrium outcome always leads to a lower global model accuracy than the socially optimal solution does. we further design an efficient algorithm to compute the socially optimal solution.numerical experiments on MNIST dataset show that the price of stability increases as the clients' data become noisier, calling for an effective incentive mechanism.Here's the translation in Traditional Chinese:联邦学习（FL）是一种分布式机器学习方案，让客户端透过共享的全球模型进行训练，无需交换本地数据。然而，标签噪声可以严重降低FL表现，一些现有的研究专注于算法设计 Label denoising。但这些研究忽略了客户端可能不愿意运用成本高昂的标签噪声策略，因为他们具有自我利益和不同的评估FL表现的观点。为了填补这个空白，我们模拟客户端之间的互动为一个新的标签噪声游戏，并characterize its equilibrium。我们还分析了稳定价格，它衡量了系统表现（例如全球模型精度、社会利益）之间的差异。我们证明了平衡结果总是比社会最佳解决方案来的全球模型精度更低。我们还设计了高效的社会最佳解决方案的算法。实验结果显示，稳定价格随着客户端数据的噪声度增加，呼应设置有效的激励机制。

Selectivity Drives Productivity: Efficient Dataset Pruning for Enhanced Transfer Learning

paper_url: http://arxiv.org/abs/2310.08782
repo_url: https://github.com/optml-group/dp4tl
paper_authors: Yihua Zhang, Yimeng Zhang, Aochuan Chen, Jinghan Jia, Jiancheng Liu, Gaowen Liu, Mingyi Hong, Shiyu Chang, Sijia Liu
for: 这篇论文的目的是提出一种基于转移学习的数据剔除方法（DP），以提高数据效率而不 sacrificing 表达能力。
methods: 本文使用了两种新的DP方法：标签映射和特征映射，用于supervised和self-supervised预训练设置。
results: 对于多个转移学习任务， authors 示出了剔除源数据类别可以达到40% ~ 80%的剔除率，而无需牺牲下游表达能力，从而实现了预训练阶段的2 ~ 5倍速化。

Abstract
Massive data is often considered essential for deep learning applications, but it also incurs significant computational and infrastructural costs. Therefore, dataset pruning (DP) has emerged as an effective way to improve data efficiency by identifying and removing redundant training samples without sacrificing performance. In this work, we aim to address the problem of DP for transfer learning, i.e., how to prune a source dataset for improved pretraining efficiency and lossless finetuning accuracy on downstream target tasks. To our best knowledge, the problem of DP for transfer learning remains open, as previous studies have primarily addressed DP and transfer learning as separate problems. By contrast, we establish a unified viewpoint to integrate DP with transfer learning and find that existing DP methods are not suitable for the transfer learning paradigm. We then propose two new DP methods, label mapping and feature mapping, for supervised and self-supervised pretraining settings respectively, by revisiting the DP problem through the lens of source-target domain mapping. Furthermore, we demonstrate the effectiveness of our approach on numerous transfer learning tasks. We show that source data classes can be pruned by up to 40% ~ 80% without sacrificing downstream performance, resulting in a significant 2 ~ 5 times speed-up during the pretraining stage. Besides, our proposal exhibits broad applicability and can improve other computationally intensive transfer learning techniques, such as adversarial pretraining. Codes are available at https://github.com/OPTML-Group/DP4TL.

摘要
巨量数据经常被认为是深度学习应用的重要 Component，但它也会带来重大的计算和基础设施成本。因此，数据集剪除（DP）已成为一种有效的提高数据效率的方法，通过确定和移除无用的训练样本而不 sacrifice性能。在这项工作中，我们想要解决转移学习中的DP问题，即如何在转移学习中剪除源数据集以提高预训练效率和无损终端任务准确率。根据我们所知，转移学习中的DP问题仍未得到解决，先前的研究主要是对DP和转移学习作为两个独立的问题进行研究。相比之下，我们提出了一种统一的视角，将DP与转移学习集成，并发现现有的DP方法不适合转移学习模式。我们然后提出了两种新的DP方法，标签映射和特征映射，用于supervised和self-supervised预训练设置。我们通过重新评估DP问题的角度来推出这两种方法，并在许多转移学习任务上进行了实验。我们发现，源数据类可以通过40%~80%的剪除而不损失下游性能，从而实现了在预训练阶段的2~5倍速化。此外，我们的提议具有广泛的可应用性，可以改进其他计算昂贵的转移学习技术，如对抗预训练。codes可以在https://github.com/OPTML-Group/DP4TL中找到。

“Im not Racist but…”: Discovering Bias in the Internal Knowledge of Large Language Models

paper_url: http://arxiv.org/abs/2310.08780
repo_url: None
paper_authors: Abel Salinas, Louis Penafiel, Robert McCormack, Fred Morstatter
for: 本研究旨在探讨大型自然语言处理模型（LLM）中隐藏的社会偏见，以提高模型在下游应用中的性别公平性。
methods: 本研究提出了一种基于提示的新方法，可以在任意的 LLM 中揭示隐藏的社会偏见。该方法通过动态生成知识表示 internal stereotypes，以便在 LLM 内部知识中找到偏见。
results: 本研究的结果表明，通过使用提示基本可以在 LLM 中找到隐藏的社会偏见，并且可以系统地分析这些偏见。这些结果将促进自然语言处理系统的透明度和公平性。

Abstract
Large language models (LLMs) have garnered significant attention for their remarkable performance in a continuously expanding set of natural language processing tasks. However, these models have been shown to harbor inherent societal biases, or stereotypes, which can adversely affect their performance in their many downstream applications. In this paper, we introduce a novel, purely prompt-based approach to uncover hidden stereotypes within any arbitrary LLM. Our approach dynamically generates a knowledge representation of internal stereotypes, enabling the identification of biases encoded within the LLM's internal knowledge. By illuminating the biases present in LLMs and offering a systematic methodology for their analysis, our work contributes to advancing transparency and promoting fairness in natural language processing systems.

摘要
大型语言模型（LLM）已引起了广泛的关注，因为它们在自然语言处理任务中表现出了很好的 result.然而，这些模型也被发现含有社会偏见或者 sterotype，这些偏见可能会对其多种下游应用产生负面影响。在这篇论文中，我们提出了一种新的、 purely prompt-based的方法，可以在任意的 LLM 中探测隐藏的偏见。我们的方法可以动态生成内置偏见的知识表示，从而可以在 LLM 内部找到编码的偏见。通过暴露 LLM 中的偏见和提供系统性的分析方法，我们的工作对于提高自然语言处理系统的透明度和公平性做出了贡献。

2023-10-14

What Do Deep Saliency Models Learn about Visual Attention?

Point-DynRF: Point-based Dynamic Radiance Fields from a Monocular Video

Dimma: Semi-supervised Low Light Image Enhancement with Adaptive Dimming

Time-based Mapping of Space Using Visual Motion Invariants

Real-Time Traffic Sign Detection: A Case Study in a Santa Clara Suburban Neighborhood

Detecting Moving Objects Using a Novel Optical-Flow-Based Range-Independent Invariant

JSMoCo: Joint Coil Sensitivity and Motion Correction in Parallel MRI with a Self-Calibrating Score-Based Diffusion Model

Learning Hierarchical Features with Joint Latent Space Energy-Based Prior

B-Spine: Learning B-Spline Curve Representation for Robust and Interpretable Spinal Curvature Estimation

Hawkeye: A PyTorch-based Library for Fine-Grained Image Recognition with Deep Learning

Learning Unified Representations for Multi-Resolution Face Recognition

Scene Text Recognition Models Explainability Using Local Features

Benchmarking the Sim-to-Real Gap in Cloth Manipulation

Towards End-to-End Unsupervised Saliency Detection with Self-Supervised Top-Down Context

TS-ENAS:Two-Stage Evolution for Cell-based Network Architecture Search

OBSUM: An object-based spatial unmixing model for spatiotemporal fusion of remote sensing images

Foundation Ark: Accruing and Reusing Knowledge for Superior and Robust Performance

JM3D & JM3D-LLM: Elevating 3D Representation with Joint Multi-modal Cues

Learning In-between Imagery Dynamics via Physical Latent Spaces

Perception Reinforcement Using Auxiliary Learning Feature Fusion: A Modified Yolov8 for Head Detection

Exploring the Design Space of Diffusion Autoencoders for Face Morphing

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

Plug-and-Play Feature Generation for Few-Shot Medical Image Classification

Towards More Accurate Diffusion Model Acceleration with A Timestep Aligner

MAC: ModAlity Calibration for Object Detection

PaintHuman: Towards High-fidelity Text-to-3D Human Texturing via Denoised Score Distillation

UCM-Net: A Lightweight and Efficient Solution for Skin Lesion Segmentation using MLP and CNN

2023-10-14

A Neuro-Mimetic Realization of the Common Model of Cognition via Hebbian Learning and Free Energy Minimization

Improved Contextual Recognition In Automatic Speech Recognition Systems By Semantic Lattice Rescoring

Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning

Efficient Model-Agnostic Multi-Group Equivariant Networks

Edge-InversionNet: Enabling Efficient Inference of InversionNet on Edge Devices

A Generalized Extensive-Form Fictitious Play Algorithm

SelfVC: Voice Conversion With Iterative Refinement using Self Transformations

Lexical Entrainment for Conversational Systems

Multimodal Federated Learning in Healthcare: a review

Enhancing Binary Code Comment Quality Classification: Integrating Generative AI for Improved Accuracy

ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models

A decoder-only foundation model for time-series forecasting

Deep Neural Networks Can Learn Generalizable Same-Different Visual Relations

Penetrative AI: Making LLMs Comprehend the Physical World

Context-aware Session-based Recommendation with Graph Neural Networks

Solving Math Word Problems with Reexamination

Autonomous Tree-search Ability of Large Language Models

PS-AAS: Portfolio Selection for Automated Algorithm Selection in Black-Box Optimization

Does CLIP’s Generalization Performance Mainly Stem from High Train-Test Similarity?

Graph Neural Network approaches for single-cell data: A recent overview

UNIQA: A Unified Framework for Both Full-Reference and No-Reference Image Quality Assessment

A study of the impact of generative AI-based data augmentation on software metadata classification

Protein 3D Graph Structure Learning for Robust Structure-based Protein Property Prediction

Software Metadata Classification based on Generative Artificial Intelligence

Instruction Tuning with Human Curriculum

Towards Semantic Communication Protocols for 6G: From Protocol Learning to Language-Oriented Approaches

One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models

A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language Models

Mirage: Model-Agnostic Graph Distillation for Graph Classification

Unified High-binding Watermark for Unconditional Image Generation Models

HIO-SDF: Hierarchical Incremental Online Signed Distance Fields

A Framework for Empowering Reinforcement Learning Agents with Causal Analysis: Enhancing Automated Cryptocurrency Trading

Metacognitive threshold: a computational account

Large Language Model Unlearning

LgTS: Dynamic Task Sampling using LLM-generated sub-goals for Reinforcement Learning Agents

2023-10-14

Beyond Testers’ Biases: Guiding Model Testing with Knowledge Bases using LLMs

Legend at ArAIEval Shared Task: Persuasion Technique Detection using a Language-Agnostic Text Representation Model

An End-to-End System for Reproducibility Assessment of Source Code Repositories via Their Readmes

A Digital Language Coherence Marker for Monitoring Dementia

An Expression Tree Decoding Strategy for Mathematical Equation Generation

Moral consensus and divergence in partisan language use

RethinkingTMSC: An Empirical Study for Target-Oriented Multimodal Sentiment Classification

Self-Detoxifying Language Models via Toxification Reversal

Can Large Language Model Comprehend Ancient Chinese? A Preliminary Test on ACLUE

CarExpert: Leveraging Large Language Models for In-Car Conversational Question Answering

Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model

Attentive Multi-Layer Perceptron for Non-autoregressive Generation

DepNeCTI: Dependency-based Nested Compound Type Identification for Sanskrit

Computational analyses of linguistic features with schizophrenic and autistic traits along with formal thought disorders

2023-10-14