2023-10-22

cs.SD

cs.SD - 2023-10-22

An overview of text-to-speech systems and media applications

paper_url: http://arxiv.org/abs/2310.14301
repo_url: https://github.com/Aryia-Behroziuan/References
paper_authors: Mohammad Reza Hasanabadi
for: 这个论文主要针对 Text-To-Speech (TTS) 系统的研究和发展。
methods: 论文介绍了 TTS 系统的关键组件，包括文本分析、声学模型和编解码器。还提到了一些深度学习方法的应用。
results: 论文详细介绍了一些现代 TTS 系统的设计和实现，包括 Tacotron 2、Transformer TTS、WaveNet 和 FastSpeech 1。这些系统在评价指标（MOS）方面表现出色。在讨论部分，论文还提出了一些建议来开发一个适合应用的 TTS 系统。

Abstract
Producing synthetic voice, similar to human-like sound, is an emerging novelty of modern interactive media systems. Text-To-Speech (TTS) systems try to generate synthetic and authentic voices via text input. Besides, well known and familiar dubbing, announcing and narrating voices, as valuable possessions of any media organization, can be kept forever by utilizing TTS and Voice Conversion (VC) algorithms . The emergence of deep learning approaches has made such TTS systems more accurate and accessible. To understand TTS systems better, this paper investigates the key components of such systems including text analysis, acoustic modelling and vocoding. The paper then provides details of important state-of-the-art TTS systems based on deep learning. Finally, a comparison is made between recently released systems in term of backbone architecture, type of input and conversion, vocoder used and subjective assessment (MOS). Accordingly, Tacotron 2, Transformer TTS, WaveNet and FastSpeech 1 are among the most successful TTS systems ever released. In the discussion section, some suggestions are made to develop a TTS system with regard to the intended application.

摘要
现代交互媒体系统中的一项新特性是生成人类化的语音。文本到语音（TTS）系统的目的是通过文本输入生成真实和合成的语音。此外，媒体组织所拥有的著名和熟悉的配音、宣传和演讲voice可以永久保留，通过使用TTS和voice转换（VC）算法。深入理解TTS系统的关键组件，包括文本分析、声学模型和 vocoding。文章随后提供了深度学习方法实现TTS系统的详细介绍，并对最新的TTS系统进行比较，包括后准体系统、输入类型和转换、使用的 vocoder 以及主观评价（MOS）。根据最新的研究，Tacotron 2、Transformer TTS、WaveNet 和 FastSpeech 1 是目前最成功的 TTS 系统之一。在讨论部分，有些建议是如何开发一个适合应用的 TTS 系统。

MFCC-GAN Codec: A New AI-based Audio Coding

paper_url: http://arxiv.org/abs/2310.14300
repo_url: None
paper_authors: Mohammad Reza Hasanabadi
for: 这个论文是关于使用MFCC特征进行AI音频编码，在对抗 Setting中进行。
methods: 这个论文使用了一种结合传统编码器和对抗学习解码器的方法，以更好地重建原始波形。GAN提供了隐式概率估计，因此这些模型更加不容易过拟合。
results: 这个论文的结果显示，使用MFCCGAN_36k和MFCCGAN_13k可以达到比较高的SNR和NISQA-MOS水平，并且比其他五种知名编码器（AAC、AC3、Opus、Vorbis和Speex）更高。此外，MFCCGAN_13k也可以达到与AC3_128k和AAC_112k相同的SNR水平，而且具有远远低于这些编码器的比特率。

Abstract
In this paper, we proposed AI-based audio coding using MFCC features in an adversarial setting. We combined a conventional encoder with an adversarial learning decoder to better reconstruct the original waveform. Since GAN gives implicit density estimation, therefore, such models are less prone to overfitting. We compared our work with five well-known codecs namely AAC, AC3, Opus, Vorbis, and Speex, performing on bitrates from 2kbps to 128kbps. MFCCGAN_36k achieved the state-of-the-art result in terms of SNR despite a lower bitrate in comparison to AC3_128k, AAC_112k, Vorbis_48k, Opus_48k, and Speex_48K. On the other hand, MFCCGAN_13k also achieved high SNR=27 which is equal to that of AC3_128k, and AAC_112k while having a significantly lower bitrate (13 kbps). MFCCGAN_36k achieved higher NISQA-MOS results compared to AAC_48k while having a 20% lower bitrate. Furthermore, MFCCGAN_13k obtained NISQAMOS= 3.9 which is much higher than AAC_24k, AAC_32k, AC3_32k, and AAC_48k. For future work, we finally suggest adopting loss functions optimizing intelligibility and perceptual metrics in the MFCCGAN structure to improve quality and intelligibility simultaneously.

摘要
在这篇论文中，我们提出了基于人工智能的音频编码方法，使用MFCC特征在反对抗Setting中进行实现。我们将传统编码器和反对抗学习解码器结合起来，以更好地重建原始波形。由于GAN提供了隐式概率估计，因此这些模型更加不易过拟合。我们对五种公知编码器进行比较，分别是AAC、AC3、Opus、Vorbis和Speex，在比特率从2kbps到128kbps之间进行测试。MFCCGAN_36k实现了最佳的SNR水平，即使在比特率较低的情况下，相比AC3_128k、AAC_112k、Vorbis_48k、Opus_48k和Speex_48K。另一方面，MFCCGAN_13k也实现了高SNR=27，与AC3_128k和AAC_112k的SNR相同，而且比特率远低（13kbps）。MFCCGAN_36k在NISQA-MOS方面获得了高于AAC_48k的结果，同时比特率下降20%。此外，MFCCGAN_13k获得了NISQAMOS=3.9，高于AAC_24k、AAC_32k、AC3_32k和AAC_48k。为未来的工作，我们建议采用MFCCGAN结构中优化智能和感知度量的损失函数，以同时提高质量和智能性。

Diffusion-Based Adversarial Purification for Speaker Verification

paper_url: http://arxiv.org/abs/2310.14270
repo_url: None
paper_authors: Yibo Bai, Xiao-Lei Zhang
for: 提高自动话语识别系统的安全性和可靠性，对抗恶意攻击。
methods: 提出了一种噪声扩散模型来纯化恶意攻击的示例，并实现了一种返回原始清晰音频的逆噪声处理。
results: 实验结果表明，提出的方法可以有效地增强自动话语识别系统的安全性，同时减少恶意攻击所带来的噪声影响。

Abstract
Recently, automatic speaker verification (ASV) based on deep learning is easily contaminated by adversarial attacks, which is a new type of attack that injects imperceptible perturbations to audio signals so as to make ASV produce wrong decisions. This poses a significant threat to the security and reliability of ASV systems. To address this issue, we propose a Diffusion-Based Adversarial Purification (DAP) method that enhances the robustness of ASV systems against such adversarial attacks. Our method leverages a conditional denoising diffusion probabilistic model to effectively purify the adversarial examples and mitigate the impact of perturbations. DAP first introduces controlled noise into adversarial examples, and then performs a reverse denoising process to reconstruct clean audio. Experimental results demonstrate the efficacy of the proposed DAP in enhancing the security of ASV and meanwhile minimizing the distortion of the purified audio signals.

摘要
Our method uses a conditional denoising diffusion probabilistic model to effectively purify adversarial examples and mitigate the impact of perturbations. DAP first introduces controlled noise into adversarial examples and then performs a reverse denoising process to reconstruct clean audio.Experimental results show that the proposed DAP method is effective in enhancing the security of ASV systems while minimizing the distortion of the purified audio signals. This provides a reliable and secure solution for ASV applications.

First-Shot Unsupervised Anomalous Sound Detection With Unknown Anomalies Estimated by Metadata-Assisted Audio Generation

paper_url: http://arxiv.org/abs/2310.14173
repo_url: None
paper_authors: Hejing Zhang, Qiaoxi Zhu, Jian Guan, Haohe Liu, Feiyang Xiao, Jiantong Tian, Xinhao Mei, Xubo Liu, Wenwu Wang
for: 这篇论文旨在解决无法在训练过程中获得异常声音数据的问题，尤其是在首次聆听 зада务中。
methods: 这篇论文提出了一个新的框架，将 metadata-assisted 音乐生成用于估计未知异常，通过利用可用的机器信息（例如 metadata 和声音数据）来微调一个文本-to-音乐生成模型，以生成每种不同机器类型的异常声音，这些声音具有对应的类型特有的音响特征。
results: 这篇论文的提出的 FS-TWFR-GMM 方法在 DCASE 2023 挑战任务中 achieve 竞争性的表现，而且只需要 1% 的模型参数来检测，这结果显示了这个方法在首次聆听任务中的可行性。

Abstract
First-shot (FS) unsupervised anomalous sound detection (ASD) is a brand-new task introduced in DCASE 2023 Challenge Task 2, where the anomalous sounds for the target machine types are unseen in training. Existing methods often rely on the availability of normal and abnormal sound data from the target machines. However, due to the lack of anomalous sound data for the target machine types, it becomes challenging when adapting the existing ASD methods to the first-shot task. In this paper, we propose a new framework for the first-shot unsupervised ASD, where metadata-assisted audio generation is used to estimate unknown anomalies, by utilising the available machine information (i.e., metadata and sound data) to fine-tune a text-to-audio generation model for generating the anomalous sounds that contain unique acoustic characteristics accounting for each different machine types. We then use the method of Time-Weighted Frequency domain audio Representation with Gaussian Mixture Model (TWFR-GMM) as the backbone to achieve the first-shot unsupervised ASD. Our proposed FS-TWFR-GMM method achieves competitive performance amongst top systems in DCASE 2023 Challenge Task 2, while requiring only 1% model parameters for detection, as validated in our experiments.

摘要
首射（FS）无监督异常声音检测（ASD）是在 DCASE 2023 挑战任务 2 中引入的一个全新任务，其异常声音对目标机器类型没有在训练中seen。现有的方法frequently rely on normal 和异常声音数据从目标机器类型。然而，由于目标机器类型的异常声音数据的不可预测，使得现有的 ASD 方法在适应首射任务时变得困难。在本文中，我们提出了一个新的首射无监督 ASD 框架，其中 metadata-assisted 音频生成用于估算未知异常，通过利用可用的机器信息（即 metadata 和声音数据）来细化一个文本到音频生成模型，以生成包含每种不同机器类型的异常声音的唯一频谱特征。然后，我们使用 Time-Weighted Frequency domain 音频表示法（TWFR）作为核心来实现首射无监督 ASD。我们的提出的 FS-TWFR 方法在 DCASE 2023 挑战任务 2 中实现了竞争性的表现，而且只需要1%的模型参数进行检测，如我们在实验中 validate。

2023-10-22

eess.AS

eess.AS - 2023-10-22

A Study on Prosodic Entrainment in Relation to Therapist Empathy in Counseling Conversation

paper_url: http://arxiv.org/abs/2310.14181
repo_url: None
paper_authors: Dehua Tao, Tan Lee, Harold Chui, Sarah Luk
for: 这个研究 investigate 谈话进行的同步phenomenon in client-therapist interaction 和 its relationship with subjectively rated empathy.
methods: 研究使用 experimental methods to study the entrainment of speech prosody in relation to empathy observation.
results: 结果显示 that the entrainment of intensity is more influential to empathy observation than that of pitch or speech rate in client-therapist interaction. Additionally, the observer and the client have different perceptions of therapist empathy with the same entrained phenomena in pitch and intensity. The client’s intention to make adjustment on pitch variation and intensity of speech is considered an indicator of the client’s perception of counseling quality.

Abstract
Counseling is carried out as spoken conversation between a therapist and a client. The empathy level expressed by the therapist is considered an important index of the quality of counseling and often assessed by an observer or the client. This research investigates the entrainment of speech prosody in relation to subjectively rated empathy. Experimental results show that the entrainment of intensity is more influential to empathy observation than that of pitch or speech rate in client-therapist interaction. The observer and the client have different perceptions of therapist empathy with the same entrained phenomena in pitch and intensity. The client's intention to make adjustment on pitch variation and intensity of speech is considered an indicator of the client's perception of counseling quality.

摘要
辅导是通过口语对话between a therapist和客户进行的。治疗师表达的同情水平被认为是辅导质量的重要指标，经常由观察员或客户评估。这项研究investigates the speech prosody entrainment in relation to subjectively rated empathy.实验结果显示，客户与治疗师之间的语音强度 entrained 对同情评估产生了更大的影响，而抑或speech rate的 entrained 影响相对较弱。观察员和客户对治疗师同情的评估存在差异，而客户对语音强度和抑或speech rate的变化有意图调整是辅导质量的指标。

Modeling Intrapersonal and Interpersonal Influences for Automatic Estimation of Therapist Empathy in Counseling Conversation

paper_url: http://arxiv.org/abs/2310.14178
repo_url: None
paper_authors: Dehua Tao, Tan Lee, Harold Chui, Sarah Luk
for: 本研究旨在提高师elor的同情水平估计，通过考虑师elor的过去行为和客户的行为。
methods: 本研究使用注意机制来捕捉师elor和客户之间的动态影响，并将这些影响Integrate into the target turn representation。
results: 研究发现， integrating dynamic influences can enhance empathy level estimation，而客户的回答（Interpersonal influence）在同情度估计中较为有效，与师elor的自己回答（Intrapersonal influence）相比。 Additionally, the study found that concentrating exclusively on recent historical turns can significantly impact the estimation of therapist empathy.

Abstract
Counseling is usually conducted through spoken conversation between a therapist and a client. The empathy level of therapist is a key indicator of outcomes. Presuming that therapist's empathy expression is shaped by their past behavior and their perception of the client's behavior, we propose a model to estimate the therapist empathy by considering both intrapersonal and interpersonal influences. These dynamic influences are captured by applying an attention mechanism to the therapist turn and the historical turns of both therapist and client. Our findings suggest that the integration of dynamic influences enhances empathy level estimation. The influence-derived embedding should constitute a minor portion in the target turn representation for optimal empathy estimation. The client's turns (interpersonal influence) appear to slightly surpass the therapist's own turns (intrapersonal influence) in empathy estimation effectiveness. It is noted that concentrating exclusively on recent historical turns can significantly impact the estimation of therapist empathy.

摘要
辅导通常通过口语对话 между一位咨说和一位客户进行。咨说的同情水平是结果的关键指标。我们提出了一个模型，以估算咨说的同情水平，考虑了咨说和客户的内部和外部影响。这些动态影响通过应用注意力机制来捕捉咨说和客户的历史转折。我们的发现表明， integrating dynamic influences can enhance empathy level estimation. The influence-derived embedding should constitute a minor portion in the target turn representation for optimal empathy estimation. 客户的转折（间接影响）似乎比咨说自己的转折（内部影响）更有效果地估算咨说的同情水平。另外，宁静着重最近历史转折可能会对咨说的同情水平产生显著影响。

2023-10-22

cs.CV

cs.CV - 2023-10-22

Skipped Feature Pyramid Network with Grid Anchor for Object Detection

paper_url: http://arxiv.org/abs/2310.14453
repo_url: None
paper_authors: Li Pengfei, Wei Wei, Yan Yu, Zhu Rong, Zhou Liguo
for: 提高 объек detection 精度
methods: 使用 skip connection 和简化 anchor box 生成方法
results: 在 MS COCO 和 Wider Face 测试集上达到了 state-of-the-art 性能Here’s the Chinese text in simplified format:
for: 提高对象检测精度
methods: 使用 skip connection 和简化 anchor box 生成方法
results: 在 MS COCO 和 Wider Face 测试集上达到了 state-of-the-art 性能

Abstract
CNN-based object detection methods have achieved significant progress in recent years. The classic structures of CNNs produce pyramid-like feature maps due to the pooling or other re-scale operations. The feature maps in different levels of the feature pyramid are used to detect objects with different scales. For more accurate object detection, the highest-level feature, which has the lowest resolution and contains the strongest semantics, is up-scaled and connected with the lower-level features to enhance the semantics in the lower-level features. However, the classic mode of feature connection combines the feature of lower-level with all the features above it, which may result in semantics degradation. In this paper, we propose a skipped connection to obtain stronger semantics at each level of the feature pyramid. In our method, the lower-level feature only connects with the feature at the highest level, making it more reasonable that each level is responsible for detecting objects with fixed scales. In addition, we simplify the generation of anchor for bounding box regression, which can further improve the accuracy of object detection. The experiments on the MS COCO and Wider Face demonstrate that our method outperforms the state-of-the-art methods.

摘要

Mobile AR Depth Estimation: Challenges & Prospects – Extended Version

paper_url: http://arxiv.org/abs/2310.14437
repo_url: None
paper_authors: Ashkan Ganj, Yiqin Zhao, Hang Su, Tian Guo
for: 这篇论文主要是研究移动增强现实（AR）中的精度深度估计问题，以实现更真实的用户互动，如物体放置和遮挡检测。
methods: 该论文使用四种现状最佳的单目深度估计模型，测试在新引入的数据集（ARKitScenes）上，并发现了硬件、数据和模型相关的三种挑战。
results: 该研究提供了促进硬件、数据和模型开发的未来方向，以解决这些挑战，包括使用更多移动设备摄像头和其他可用的感知器件信息，捕捉高质量的数据来反映现实 AR 场景，以及设计新的模型结构。

Abstract
Metric depth estimation plays an important role in mobile augmented reality (AR). With accurate metric depth, we can achieve more realistic user interactions such as object placement and occlusion detection. While specialized hardware like LiDAR demonstrates its promise, its restricted availability, i.e., only on selected high-end mobile devices, and performance limitations such as range and sensitivity to the environment, make it less ideal. Monocular depth estimation, on the other hand, relies solely on mobile cameras, which are ubiquitous, making it a promising alternative for mobile AR. In this paper, we investigate the challenges and opportunities of achieving accurate metric depth estimation in mobile AR. We tested four different state-of-the-art monocular depth estimation models on a newly introduced dataset (ARKitScenes) and identified three types of challenges: hard-ware, data, and model related challenges. Furthermore, our research provides promising future directions to explore and solve those challenges. These directions include (i) using more hardware-related information from the mobile device's camera and other available sensors, (ii) capturing high-quality data to reflect real-world AR scenarios, and (iii) designing a model architecture to utilize the new information.

摘要
metric深度估计在移动增强现实中发挥重要作用，可以实现更真实的用户互动，如对象放置和遮挡检测。尽管特殊硬件如LiDAR表现出了承诺，但它的可用性和环境影响限制了其使用。而单目深度估计则仅仅依靠移动摄像头，这种设备 ubique 存在，使其成为移动AR中的优选选择。在这篇论文中，我们探讨了移动AR中准确的 metric深度估计的挑战和机遇。我们测试了四种不同的单目深度估计模型，并分类了这些挑战为硬件、数据和模型相关的三类挑战。此外，我们的研究还提供了解决这些挑战的可能的未来方向，包括（i）使用更多的移动设备内部硬件信息，（ii）捕捉高质量的数据，以反映实际的AR场景，以及（iii）设计一种能够利用新信息的模型建构。

ConViViT – A Deep Neural Network Combining Convolutions and Factorized Self-Attention for Human Activity Recognition

paper_url: http://arxiv.org/abs/2310.14416
repo_url: None
paper_authors: Rachid Reda Dokkar, Faten Chaieb, Hassen Drira, Arezki Aberkane
for: 这个研究的目的是提出一个融合 CNN 和 Transformer 的混合架构，以便使用 RGB 影像进行动作识别。
methods: 这个混合架构包括使用 CNN 网络增强影像表现，然后将其输入 Transformer 进行获取空间时间采样。
results: 这个混合架构在 HMDB51、UCF101 和 ETRI-Activity3D 等三个资料集上获得了新的 SOTA 结果，具体成绩为 90.05%、99.6% 和 95.09%。

Abstract
The Transformer architecture has gained significant popularity in computer vision tasks due to its capacity to generalize and capture long-range dependencies. This characteristic makes it well-suited for generating spatiotemporal tokens from videos. On the other hand, convolutions serve as the fundamental backbone for processing images and videos, as they efficiently aggregate information within small local neighborhoods to create spatial tokens that describe the spatial dimension of a video. While both CNN-based architectures and pure transformer architectures are extensively studied and utilized by researchers, the effective combination of these two backbones has not received comparable attention in the field of activity recognition. In this research, we propose a novel approach that leverages the strengths of both CNNs and Transformers in an hybrid architecture for performing activity recognition using RGB videos. Specifically, we suggest employing a CNN network to enhance the video representation by generating a 128-channel video that effectively separates the human performing the activity from the background. Subsequently, the output of the CNN module is fed into a transformer to extract spatiotemporal tokens, which are then used for classification purposes. Our architecture has achieved new SOTA results with 90.05 \%, 99.6\%, and 95.09\% on HMDB51, UCF101, and ETRI-Activity3D respectively.

摘要
《 transformer 架构》在计算机视觉任务中获得了广泛的应用，主要是因为它可以泛化和捕捉长距离依赖关系。这个特点使得它成为生成视频中的 spatiotemporal ekenzi的理想选择。然而，卷积是图像和视频处理的基本脊梁，它能够有效地在小地方团结信息，创造视频的空间 Token。而 CNN 和 transformer 两种架构都被广泛研究和应用，但两者的有效组合尚未在活动识别领域受到相应的关注。在这个研究中，我们提出了一种新的方法，利用 CNN 和 transformer 两种架构的优点，实现了活动识别using RGB 视频。具体来说，我们建议使用 CNN 网络来提高视频表示，生成一个 128 通道的视频，以有效地分离人在活动中的人和背景。然后，CNN 模块的输出被传递到 transformer 中，以提取 spatiotemporal ekenzi，并用于分类目的。我们的架构实现了新的 SOTA 结果，分别为 90.05%、99.6% 和 95.09% 在 HMDB51、UCF101 和 ETRI-Activity3D 等三个数据集上。

A Pytorch Reproduction of Masked Generative Image Transformer

paper_url: http://arxiv.org/abs/2310.14400
repo_url: https://github.com/valeoai/maskgit-pytorch
paper_authors: Victor Besnier, Mickael Chen
for: 该论文旨在实现基于masked bidirectional transformer架构的生成图像模型，以实现高效的图像生成。
methods: 该论文使用PyTorch实现了MaskGIT模型，并通过优化和rigorous experimentation来提高模型的性能。
results: 该论文通过对ImageNet dataset进行测试，并取得了与原论文相似的FID值（7.32），以及一些改进的FID值（7.26和6.80）。I hope this helps! Let me know if you have any further questions.

Abstract
In this technical report, we present a reproduction of MaskGIT: Masked Generative Image Transformer, using PyTorch. The approach involves leveraging a masked bidirectional transformer architecture, enabling image generation with only few steps (8~16 steps) for 512 x 512 resolution images, i.e., ~64x faster than an auto-regressive approach. Through rigorous experimentation and optimization, we achieved results that closely align with the findings presented in the original paper. We match the reported FID of 7.32 with our replication and obtain 7.59 with similar hyperparameters on ImageNet at resolution 512 x 512. Moreover, we improve over the official implementation with some minor hyperparameter tweaking, achieving FID of 7.26. At the lower resolution of 256 x 256 pixels, our reimplementation scores 6.80, in comparison to the original paper's 6.18. To promote further research on Masked Generative Models and facilitate their reproducibility, we released our code and pre-trained weights openly at https://github.com/valeoai/MaskGIT-pytorch/

摘要
在这份技术报告中，我们 presente一个使用PyTorch实现的MaskGIT：Masked Generative Image Transformer的重现。该方法利用了一个带有掩码的双向转换器建筑，以便在8~16步之间生成512x512像素的图像，即比自适应循环方法快约64倍。通过严格的实验和优化，我们实现了与原论文中的结果相似的结果。我们与原论文中的FID（7.32）匹配的结果，并在相同的超参数下达到了512x512像素的FID为7.59。此外，我们通过一些小的超参数调整，超过了官方实现的FID（7.26）。在512x512像素的下采样比256x256像素，我们的重现得分6.80，与原论文中的6.18相比。为促进Masked生成模型的研究和可重现性，我们在https://github.com/valeoai/MaskGIT-pytorch/上公开发布了我们的代码和预训练 веса。

Cross-Domain HAR: Few Shot Transfer Learning for Human Activity Recognition

paper_url: http://arxiv.org/abs/2310.14390
repo_url: None
paper_authors: Megha Thukral, Harish Haresamudram, Thomas Ploetz
for: 这篇论文是为了提出一种经济的公共可用预标注人动作识别数据集 Transfer Learning 方法。
methods: 该方法基于教师学生自教学模式，通过 bridging 概念差异 между来源和目标领域，以更有效地识别活动。
results: 经过广泛的实验评估，该方法在实际中的几个活动识别场景中表现出色，并进行了细致的分析以确定下游性能的影响因素。

Abstract
The ubiquitous availability of smartphones and smartwatches with integrated inertial measurement units (IMUs) enables straightforward capturing of human activities. For specific applications of sensor based human activity recognition (HAR), however, logistical challenges and burgeoning costs render especially the ground truth annotation of such data a difficult endeavor, resulting in limited scale and diversity of datasets. Transfer learning, i.e., leveraging publicly available labeled datasets to first learn useful representations that can then be fine-tuned using limited amounts of labeled data from a target domain, can alleviate some of the performance issues of contemporary HAR systems. Yet they can fail when the differences between source and target conditions are too large and/ or only few samples from a target application domain are available, each of which are typical challenges in real-world human activity recognition scenarios. In this paper, we present an approach for economic use of publicly available labeled HAR datasets for effective transfer learning. We introduce a novel transfer learning framework, Cross-Domain HAR, which follows the teacher-student self-training paradigm to more effectively recognize activities with very limited label information. It bridges conceptual gaps between source and target domains, including sensor locations and type of activities. Through our extensive experimental evaluation on a range of benchmark datasets, we demonstrate the effectiveness of our approach for practically relevant few shot activity recognition scenarios. We also present a detailed analysis into how the individual components of our framework affect downstream performance.

摘要
现代人活动识别系统遇到的一个挑战是在有限的标注数据上进行高效的训练。为了解决这个问题，我们提出了一种经济使用公共可用的标注人活动识别数据集的方法。我们称之为跨领域人活动识别（Cross-Domain HAR）。这种方法采用教师生自教学模式，通过bridging概念差异 между来源领域和目标领域，包括仪器位置和活动类型。我们在一系列的benchmark数据集上进行了广泛的实验评估，并证明了我们的方法在实际中几个shot活动识别场景中的效果。此外，我们还提供了对下游性能的详细分析。

Learning Generalizable Manipulation Policies with Object-Centric 3D Representations

paper_url: http://arxiv.org/abs/2310.14386
repo_url: None
paper_authors: Yifeng Zhu, Zhenyu Jiang, Peter Stone, Yuke Zhu
for: 这个论文旨在学习Robust policies with object-centric and 3D priors，用于视觉 manipulate。
methods: 论文提出了一种叫做GROOT的仿人学习方法，使得策略能够在视觉上执行 manipulate 动作，并且能够泛化到不同的背景和摄像头视角。GROOT 使用 transformer 基于策略来理解对象中心的 3D 表示，并且引入了一种 segmentation correspondence model 以便在测试时对新的对象进行泛化。
results: 通过了 comprehensive experiments，证明 GROOT 策略在受到观察器变化、摄像头视角变化和新对象实例的情况下能够具有极高的泛化性。与之相比，当前的端到端学习方法和对象提案基于的方法都落后于 GROOT。此外，GROOT 策略在实际的 робоット上也得到了较好的表现，并在具有非常大的变化 setup 下进行了证明。更多的视频和模型细节可以在附录和项目网站：https://ut-austin-rpl.github.io/GROOT 中找到。

Abstract
We introduce GROOT, an imitation learning method for learning robust policies with object-centric and 3D priors. GROOT builds policies that generalize beyond their initial training conditions for vision-based manipulation. It constructs object-centric 3D representations that are robust toward background changes and camera views and reason over these representations using a transformer-based policy. Furthermore, we introduce a segmentation correspondence model that allows policies to generalize to new objects at test time. Through comprehensive experiments, we validate the robustness of GROOT policies against perceptual variations in simulated and real-world environments. GROOT's performance excels in generalization over background changes, camera viewpoint shifts, and the presence of new object instances, whereas both state-of-the-art end-to-end learning methods and object proposal-based approaches fall short. We also extensively evaluate GROOT policies on real robots, where we demonstrate the efficacy under very wild changes in setup. More videos and model details can be found in the appendix and the project website: https://ut-austin-rpl.github.io/GROOT .

摘要
我们介绍GROOT，一种模仿学习方法，用于学习具有对象中心和3D先天知识的稳定政策。GROOT建立的政策可以跨越初始训练条件，用于视觉控制。它使用转换器基于政策来构建对象中心的3D表示，这些表示具有背景变化和摄像头视角的Robustness。此外，我们还引入了一种分割匹配模型，使得政策可以在测试时generalize到新的物体实例。通过全面的实验，我们证明GROOT的策略在具有观察变化、摄像头视角变化和新物体实例的情况下具有优秀的Robustness。而比较之下，当前的端到端学习方法和物体提议基本方法都失去了优势。我们还在实际的机器人上进行了广泛的测试，并证明GROOT的策略在非常野化的设置下具有良好的可靠性。更多视频和模型细节可以在附录和项目网站：https://ut-austin-rpl.github.io/GROOT 中找到。

Data-Free Distillation Improves Efficiency and Privacy in Federated Thorax Disease Analysis

paper_url: http://arxiv.org/abs/2310.18346
repo_url: None
paper_authors: Ming Li, Guang Yang
for: efficient, privacy-preserving federated thorax disease analysis
methods: data-free distillation-based federated learning approach (FedKDF) with a lightweight generator to aggregate knowledge from different clients without requiring access to their private data or a proxy dataset
results: robust solution for efficient, privacy-preserving federated thorax disease analysis, demonstrated through empirical experiments

Abstract
Thorax disease analysis in large-scale, multi-centre, and multi-scanner settings is often limited by strict privacy policies. Federated learning (FL) offers a potential solution, while traditional parameter-based FL can be limited by issues such as high communication costs, data leakage, and heterogeneity. Distillation-based FL can improve efficiency, but it relies on a proxy dataset, which is often impractical in clinical practice. To address these challenges, we introduce a data-free distillation-based FL approach FedKDF. In FedKDF, the server employs a lightweight generator to aggregate knowledge from different clients without requiring access to their private data or a proxy dataset. FedKDF combines the predictors from clients into a single, unified predictor, which is further optimized using the learned knowledge in the lightweight generator. Our empirical experiments demonstrate that FedKDF offers a robust solution for efficient, privacy-preserving federated thorax disease analysis.

摘要
大规模、多中心和多扫描器设置下的胸部疾病分析frequentlyencounters limitations due to strict privacy policies. Federated learning (FL) offers a potential solution, but traditional parameter-based FL can be limited by issues such as high communication costs, data leakage, and heterogeneity. Distillation-based FL can improve efficiency, but it relies on a proxy dataset, which is often impractical in clinical practice. To address these challenges, we introduce a data-free distillation-based FL approach called FedKDF.在FedKDF中，服务器使用轻量级生成器将客户端的知识聚合到单一的预测器中，无需访问客户端的私人数据或代理集群。FedKDF将客户端的预测器组合成一个统一的预测器，并使用生成器学习的知识进行进一步优化。我们的实验表明，FedKDF可以提供一种robust的、隐私保护的联合胸部疾病分析方案。

OV-VG: A Benchmark for Open-Vocabulary Visual Grounding

paper_url: http://arxiv.org/abs/2310.14374
repo_url: https://github.com/cv516buaa/ov-vg
paper_authors: Chunlei Wang, Wenquan Feng, Xiangtai Li, Guangliang Cheng, Shuchang Lyu, Binghao Liu, Lijiang Chen, Qi Zhao
for: 本研究旨在解决开放词汇视觉定位问题，特别是在图像中找到基于语言描述的特定区域。
methods: 本研究使用了现有的开放词汇对象检测、视觉定位和短语地图Localization的基础方法，并发展了一种新的语言引导特征注意力模型和文本图像查询选择模型。
results: 研究表明，提出的方法可以在多种场景下准确地定位开放词汇中的新类别，并且在多个 benchmark 上达到了领先的性能。

Abstract
Open-vocabulary learning has emerged as a cutting-edge research area, particularly in light of the widespread adoption of vision-based foundational models. Its primary objective is to comprehend novel concepts that are not encompassed within a predefined vocabulary. One key facet of this endeavor is Visual Grounding, which entails locating a specific region within an image based on a corresponding language description. While current foundational models excel at various visual language tasks, there's a noticeable absence of models specifically tailored for open-vocabulary visual grounding. This research endeavor introduces novel and challenging OV tasks, namely Open-Vocabulary Visual Grounding and Open-Vocabulary Phrase Localization. The overarching aim is to establish connections between language descriptions and the localization of novel objects. To facilitate this, we have curated a comprehensive annotated benchmark, encompassing 7,272 OV-VG images and 1,000 OV-PL images. In our pursuit of addressing these challenges, we delved into various baseline methodologies rooted in existing open-vocabulary object detection, VG, and phrase localization frameworks. Surprisingly, we discovered that state-of-the-art methods often falter in diverse scenarios. Consequently, we developed a novel framework that integrates two critical components: Text-Image Query Selection and Language-Guided Feature Attention. These modules are designed to bolster the recognition of novel categories and enhance the alignment between visual and linguistic information. Extensive experiments demonstrate the efficacy of our proposed framework, which consistently attains SOTA performance across the OV-VG task. Additionally, ablation studies provide further evidence of the effectiveness of our innovative models. Codes and datasets will be made publicly available at https://github.com/cv516Buaa/OV-VG.

摘要
开放词汇学习已经成为当前研究领域的热点，尤其是在视觉基础模型的广泛应用下。其主要目标是理解未定词汇中的新概念。一个关键的方面是视觉定位，即根据语言描述来定位图像中的特定区域。现有的基础模型在视觉语言任务中表现出色，但是没有专门为开放词汇视觉定位设计的模型。这个研究尝试 introduce 一些新的和挑战性的 OV 任务，即开放词汇视觉定位和开放词汇短语定位。总的来说，我们想要在语言描述和图像定位之间建立连接，以便更好地理解未定词汇中的新概念。为了实现这一目标，我们在存在7272个 OV-VG 图像和1000个 OV-PL 图像的完整 annotated benchmark 上进行了大量的基线方法研究。我们发现，现有的基础方法在多样化的场景下经常失败。因此，我们开发了一种新的框架，它包括文本-图像查询选择和语言引导特征注意模块。这两个模块的目的是增强未定词汇中的新类别识别和视觉语言信息的协调。我们的提议的框架在 OV-VG 任务上表现出了顶尖性能。此外，我们还进行了大量的ablation 研究，以证明我们的创新模型的有效性。codes 和数据将在上公开。

A Quantitative Evaluation of Dense 3D Reconstruction of Sinus Anatomy from Monocular Endoscopic Video

paper_url: http://arxiv.org/abs/2310.14364
repo_url: None
paper_authors: Jan Emily Mangulabnan, Roger D. Soberanis-Mukul, Timo Teufel, Isabela Hernández, Jonas Winter, Manish Sahu, Jose L. Porras, S. Swaroop Vedula, Masaru Ishii, Gregory Hager, Russell H. Taylor, Mathias Unberath
for: 这篇论文旨在提出一种基于简单眼镜的3D重建方法，用于评估呼吸道镜头内的股骨形态和手术结果。
methods: 这种方法使用结构从运动算法和简单眼镜深度估计来重建3D结构，并通过与高分辨率计算机 Tomography 图像进行对比来评估准确性。
results: 研究结果表明，使用这种方法可以生成与骨相对吻合的3D重建结果，但在点对点匹配情况下， average target registration errors 为6.58 mm。研究还发现，pose和深度估计差异都对这个错误做出了贡献，并且使用更短的轨迹序列可以生成更为准确的重建结果。

Abstract
Generating accurate 3D reconstructions from endoscopic video is a promising avenue for longitudinal radiation-free analysis of sinus anatomy and surgical outcomes. Several methods for monocular reconstruction have been proposed, yielding visually pleasant 3D anatomical structures by retrieving relative camera poses with structure-from-motion-type algorithms and fusion of monocular depth estimates. However, due to the complex properties of the underlying algorithms and endoscopic scenes, the reconstruction pipeline may perform poorly or fail unexpectedly. Further, acquiring medical data conveys additional challenges, presenting difficulties in quantitatively benchmarking these models, understanding failure cases, and identifying critical components that contribute to their precision. In this work, we perform a quantitative analysis of a self-supervised approach for sinus reconstruction using endoscopic sequences paired with optical tracking and high-resolution computed tomography acquired from nine ex-vivo specimens. Our results show that the generated reconstructions are in high agreement with the anatomy, yielding an average point-to-mesh error of 0.91 mm between reconstructions and CT segmentations. However, in a point-to-point matching scenario, relevant for endoscope tracking and navigation, we found average target registration errors of 6.58 mm. We identified that pose and depth estimation inaccuracies contribute equally to this error and that locally consistent sequences with shorter trajectories generate more accurate reconstructions. These results suggest that achieving global consistency between relative camera poses and estimated depths with the anatomy is essential. In doing so, we can ensure proper synergy between all components of the pipeline for improved reconstructions that will facilitate clinical application of this innovative technology.

摘要
通过endooscopic视频生成三维重建是一个有前途的方法，可以进行无线电辐射的长期分析，检测鼻腔 анатоMY的变化和手术结果。一些基于单目 reconstruction的方法已经被提出，通过结构从运动算法和单目深度估计的混合来生成视觉吸引人的三维 anatomical结构。然而，由于这些算法和endooscopic场景的复杂性，重建管道可能会表现出问题或失败。此外，收集医疗数据带来额外的挑战，包括对这些模型进行量化评估，理解失败情况，并确定critical components的影响。在这种情况下，我们通过对endooscopic序列与光学跟踪和高分辐 computed tomography的九个尸骨样本进行量化分析，以获得sinus重建的自我supervised方法的评估结果。我们的结果表明，生成的重建与 computed tomography分割的均匀性达到0.91 mm。然而，在点对点匹配方案中，我们发现了6.58 mm的target registration error。我们发现，pose和depth估计的不准确会导致这个错误，而且在短 trajectory sequence中，local consistent sequence可以生成更加准确的重建。这些结果表明，在重建管道中，保证相对camera pose和估计的准确性与 anatomy相一致是关键。在这种情况下，我们可以确保整个管道中的所有组件协同工作，以提高重建的准确性，并促进这种创新技术的临床应用。

Toward Flare-Free Images: A Survey

paper_url: http://arxiv.org/abs/2310.14354
repo_url: None
paper_authors: Yousef Kotp, Marwan Torki
for: This paper provides a comprehensive overview of lens flare, including its underlying physics, types, and characteristics, as well as methods for removing it.
methods: The paper covers a wide range of methods for flare removal, including hardware optimization strategies, classical image processing techniques, and learning-based methods using deep learning.
results: The paper provides insights into best practices, limitations, and promising future directions for flare removal research, and reviews the state-of-the-art solutions for handling lens flare artifacts.Here’s the same information in Simplified Chinese text:
for: 这篇论文提供了折射瑕疵的全面概述，包括它的基础物理、类型和特征，以及去除它的方法。
methods: 论文涵盖了各种去除折射瑕疵的方法，包括硬件优化策略、经典图像处理技术和深度学习方法。
results: 论文提供了最佳实践、限制和未来发展方向的概述，以及处理折射瑕疵 artifacts 的现有解决方案的综述。

Abstract
Lens flare is a common image artifact that can significantly degrade image quality and affect the performance of computer vision systems due to a strong light source pointing at the camera. This survey provides a comprehensive overview of the multifaceted domain of lens flare, encompassing its underlying physics, influencing factors, types, and characteristics. It delves into the complex optics of flare formation, arising from factors like internal reflection, scattering, diffraction, and dispersion within the camera lens system. The diverse categories of flare are explored, including scattering, reflective, glare, orb, and starburst types. Key properties such as shape, color, and localization are analyzed. The numerous factors impacting flare appearance are discussed, spanning light source attributes, lens features, camera settings, and scene content. The survey extensively covers the wide range of methods proposed for flare removal, including hardware optimization strategies, classical image processing techniques, and learning-based methods using deep learning. It not only describes pioneering flare datasets created for training and evaluation purposes but also how they were created. Commonly employed performance metrics such as PSNR, SSIM, and LPIPS are explored. Challenges posed by flare's complex and data-dependent characteristics are highlighted. The survey provides insights into best practices, limitations, and promising future directions for flare removal research. Reviewing the state-of-the-art enables an in-depth understanding of the inherent complexities of the flare phenomenon and the capabilities of existing solutions. This can inform and inspire new innovations for handling lens flare artifacts and improving visual quality across various applications.

摘要
镜头闪光是一种常见的图像artifact，可能会严重降低图像质量，并对计算机视觉系统产生影响，因为摄像头前方的强光源。这份调查提供了关于镜头闪光多方面的报告，涵盖其下面的物理、影响因素、类型和特征。它探讨镜头闪光形成的复杂光学，由于内部反射、散射、折射和分散等因素。讨论了不同类型的镜头闪光，包括散射、反射、耀光、气球和星形类型。分析了这些类型的形状、颜色和位置。讨论了影响镜头闪光表现的多种因素，包括光源特性、镜头特性、相机设置和场景内容。报告广泛涵盖了除去镜头闪光的多种方法，包括硬件优化策略、传统图像处理技术和深度学习方法。不仅描述了创新的镜头闪光数据集的创建，还介绍了如何创建。探讨了常用的表现指标，如PSNR、SSIM和LPIPS。 highlighted了镜头闪光的复杂和数据依赖的特点。报告提供了关于除镜头闪光的最佳实践、限制和未来发展的指导。它不仅提供了镜头闪光现状的深入理解，还能够引导新的创新，以改善图像质量在不同应用中。

What’s in a Prior? Learned Proximal Networks for Inverse Problems

paper_url: http://arxiv.org/abs/2310.14344
repo_url: None
paper_authors: Zhenghan Fang, Sam Buchanan, Jeremias Sulam
for: 这篇论文的目的是提出一种学习 proximal 网络（LPN），以便在非对称正则化问题中提供可靠的迭代 garantess。
methods: 该论文使用现代深度学习模型，如插入式扩展或深度拓展，来实现 proximal 操作符。然而，这些方法并没有保证某个总深度网络表示 proximal 操作符的某个函数，也没有对函数进行任何特征化。
results: 该论文提出了一种名为 proximal matching 的新训练策略，可以确定地促进了 true 数据分布的 log-prior 的重建。此外，该论文还证明了 LPN 提供了一个通用、无监督、表达力强的 proximal 操作符，可以在一般的 inverse 问题中提供 garantess。在不同的应用中，该论文展示了这些模型可以达到状态平台的性能，并且提供了一种窥视到这些模型学习的积分。

Abstract
Proximal operators are ubiquitous in inverse problems, commonly appearing as part of algorithmic strategies to regularize problems that are otherwise ill-posed. Modern deep learning models have been brought to bear for these tasks too, as in the framework of plug-and-play or deep unrolling, where they loosely resemble proximal operators. Yet, something essential is lost in employing these purely data-driven approaches: there is no guarantee that a general deep network represents the proximal operator of any function, nor is there any characterization of the function for which the network might provide some approximate proximal. This not only makes guaranteeing convergence of iterative schemes challenging but, more fundamentally, complicates the analysis of what has been learned by these networks about their training data. Herein we provide a framework to develop learned proximal networks (LPN), prove that they provide exact proximal operators for a data-driven nonconvex regularizer, and show how a new training strategy, dubbed proximal matching, provably promotes the recovery of the log-prior of the true data distribution. Such LPN provide general, unsupervised, expressive proximal operators that can be used for general inverse problems with convergence guarantees. We illustrate our results in a series of cases of increasing complexity, demonstrating that these models not only result in state-of-the-art performance, but provide a window into the resulting priors learned from data.

摘要
近似算子在反射问题中广泛存在，通常作为算法策略来规范不可定的问题。现代深度学习模型也已经应用于这些任务中，如插入式或深度拓展框架，它们只是近似算子的抽象。然而，使用这些完全数据驱动的方法有一个重要缺点：没有保证深度网络表示任何函数的质近算子，也没有函数的Characterization，这不仅让迭代方案的确定性困难，而且更重要的是，使得了学习这些网络对于训练数据的分析。在这里，我们提出了学习质近网络（LPN）框架，证明它们提供了数据驱动非 konvex 正则化的 exact 质近算子，并展示了一种新的训练策略，称为质近匹配，可以有效地促进真实数据分布的log-prior的恢复。这些 LPN 提供了通用、无监督、表达力强的质近算子，可以用于通用的反射问题，并且具有确定性保证。我们在一系列增加复杂度的案例中 Illustrate 我们的结果，示出这些模型不仅达到了状态的表现，还提供了数据中学习的窗口。

Research on Key Technologies of Infrastructure Digitalization based on Multimodal Spatial Data

paper_url: http://arxiv.org/abs/2310.14296
repo_url: None
paper_authors: Zhanyuan Tian, Tianrui Zhu, Zerui Tian, Zhen Dong
for: 这篇论文主要研究了点云技术的应用于交通运输领域，尤其是点云网络建设和实时交通情况识别等问题。
methods: 该论文使用了点云扫描仪收集数据，并提出了一种基于点云堆叠模型的方法来解决点云网络建设中的问题，包括建立点云堆叠模型、应用CSF技术进行地面点云提取、使用PTD算法建立道路网络模型等。
results: 该论文通过对实验数据进行分析和实验研究，提出了一种基于点云技术的实时交通情况识别方法，并实现了实时交通情况识别的精度达到10°和15m。

Abstract
Since NASA put forward the concept of the digital twin in 2010, many industries have put forward the dynamic goal of digital development, and the transportation industry is also among them. With more and more companies laying out on this virgin land, the digital twin transportation industry has grown rapidly and gradually formed a complete scientific research system. However, under the largely mature framework, there are still many loophole problems that need to be solved. In the process of constructing a road network with point cloud information, we summarize several major features of the point cloud collected by laser scanners and analyze the potential problems of constructing the network, such as misjudging the feature points as ground points and grid voids. On this basis, we reviewed relevant literature and proposed targeted solutions, such as building a point cloud pyramid modeled after the image pyramid, expanding the virtual grid, etc., applying CSF for ground-point cloud extraction, and constructing a road network model using the PTD (progressive density-based filter) algorithm. For the problem of road sign detection, we optimize the remote sensing data in the ground point cloud by enhancing the information density using edge detection, improving the data quality by removing the low intensity points, and achieving 90% accuracy of road text recognition using PaddleOCR and Densenet. As for the real-time digital twin traffic, we design the P2PRN network using the backbone of MPR-GAN for 2D feature generation and SuperGlue for 2D feature matching, rendering the viewpoints according to the matching optimization points, completing the multimodal matching task after several iterations, and successfully calculating the road camera position with 10{\deg} and 15m accuracy.

摘要
desde que NASA propuso la idea de la gemela digital en 2010, muchas industrias han puesto forward la meta dinámica de desarrollo digital, y la industria de transporte también está entre ellas. Con más y más compañías instalándose en este terreno virgen, la industria de transporte digital ha crecido rápidamente y se ha formado un sistema de investigación científica completo. Sin embargo, bajo el marco maduro, hay muchos problemas no resueltos que necesitan ser abordados. Durante la construcción de una red de carreteras con información de punto cloud, se resumen varias características principales del punto cloud recopilado por escáneres láser y se analizan los posibles problemas de la construcción de la red, como confundir los puntos de características con puntos de suelo y vacíos de grilla. Basándonos en la literatura relevante, propusimos soluciones dirigidas, como construir un modelo de pirámide de punto cloud inspirado en la imagen pyramid, expandir la malla virtual y aplicar CSF para la extracción de puntos de suelo. Para el problema de la detección de señales de tráfico, optimizamos los datos de sensoriamento en el punto de suelo mejorando la densidad de información utilizando el detección de bordes, mejorando la calidad de los datos eliminando los puntos de baja intensidad y logrando una precisión del 90% en la reconocimiento de texto en el camino utilizando PaddleOCR y Densenet. En cuanto a la inteligencia en tiempo real del gemelo digital de tráfico, diseñamos la red P2PRN utilizando la espalda de MPR-GAN para la generación de características 2D y SuperGlue para la match de características 2D, renderizando las vistas según los puntos de optimización de match, completando la tarea de match multimodal después de varias iteraciones y calculando con precisión de 10° y 15m la posición de la cámara de la carretera.

Deep MDP: A Modular Framework for Multi-Object Tracking

paper_url: http://arxiv.org/abs/2310.14294
repo_url: https://github.com/abhineet123/deep_mdp
paper_authors: Abhineet Singh
for: 这个论文是为了提供一个快速和可替换的多目标跟踪（MOT）框架，基于markov决策过程（MDP）的跟踪检测模式。
methods: 这个框架使用了MDP tracking-by-detection模式，并提供了可替换的各种功能组件，以适应不同的应用场景。
results: 虽然不是在性能方面创造出新的记录，但 Deep MDP 框架具有大量代码库，可以帮助社区实现新的想法或者在MOT应用场景中使用一个易于使用和易于自定义的系统。

Abstract
This paper presents a fast and modular framework for Multi-Object Tracking (MOT) based on the Markov descision process (MDP) tracking-by-detection paradigm. It is designed to allow its various functional components to be replaced by custom-designed alternatives to suit a given application. An interactive GUI with integrated object detection, segmentation, MOT and semi-automated labeling is also provided to help make it easier to get started with this framework. Though not breaking new ground in terms of performance, Deep MDP has a large code-base that should be useful for the community to try out new ideas or simply to have an easy-to-use and easy-to-adapt system for any MOT application. Deep MDP is available at https://github.com/abhineet123/deep_mdp.

摘要
这篇论文介绍了一个快速和可模块化的多目标跟踪（MOT）框架，基于markov决策过程（MDP）的跟踪检测方式。这个框架设计了可以替换不同应用的功能组件，以便更好地适应具体情况。同时，它还提供了一个交互式GUI，带有对象探测、 segmentation、MOT和半自动标注功能，以便更容易地开始使用这个框架。虽然不是在性能方面创造出新的记录，但Deep MDP具有大量代码库，可以帮助社区尝试新的想法或者使用一个简单易用的系统来实现任何MOT应用。Deep MDP可以在https://github.com/abhineet123/deep_mdp上下载。

A Survey on Continual Semantic Segmentation: Theory, Challenge, Method and Application

paper_url: http://arxiv.org/abs/2310.14277
repo_url: https://github.com/ybio/surveycss
paper_authors: Bo Yuan, Danpei Zhao
for: 本文是一篇涵盖持续学习（Continual Learning）的评论，强调在计算机视觉领域的分类、检测和分割任务中应用持续学习。
methods: 本文首先解释了持续学习的问题定义和主要挑战，然后对当前CSS模型进行了分类和分析，包括《数据重播》和《数据无法》两大树。
results: 本文提出了四个CSS特点，它们在不同的应用场景和发展趋势中表现出色。此外，本文还提供了一个CSS benchmark，可以在 \url{https://github.com/YBIO/SurveyCSS} 上下载。

Abstract
Continual learning, also known as incremental learning or life-long learning, stands at the forefront of deep learning and AI systems. It breaks through the obstacle of one-way training on close sets and enables continuous adaptive learning on open-set conditions. In the recent decade, continual learning has been explored and applied in multiple fields especially in computer vision covering classification, detection and segmentation tasks. Continual semantic segmentation (CSS), of which the dense prediction peculiarity makes it a challenging, intricate and burgeoning task. In this paper, we present a review of CSS, committing to building a comprehensive survey on problem formulations, primary challenges, universal datasets, neoteric theories and multifarious applications. Concretely, we begin by elucidating the problem definitions and primary challenges. Based on an in-depth investigation of relevant approaches, we sort out and categorize current CSS models into two main branches including \textit{data-replay} and \textit{data-free} sets. In each branch, the corresponding approaches are similarity-based clustered and thoroughly analyzed, following qualitative comparison and quantitative reproductions on relevant datasets. Besides, we also introduce four CSS specialities with diverse application scenarios and development tendencies. Furthermore, we develop a benchmark for CSS encompassing representative references, evaluation results and reproductions, which is available at~\url{https://github.com/YBIO/SurveyCSS}. We hope this survey can serve as a reference-worthy and stimulating contribution to the advancement of the life-long learning field, while also providing valuable perspectives for related fields.

摘要
Continual semantic segmentation (CSS) is a challenging and rapidly developing field, as it requires dense predictions and is therefore particularly complex. In this paper, we provide a comprehensive review of CSS, including problem formulations, primary challenges, universal datasets, and innovative theories. We also explore multiple applications of CSS.First, we explain the problem definitions and primary challenges of CSS. We then categorize current CSS models into two main branches: data-replay and data-free sets. In each branch, we analyze and compare the approaches using similarity-based clustering, and provide qualitative and quantitative evaluations on relevant datasets.Furthermore, we introduce four CSS specialties with diverse application scenarios and development tendencies. Finally, we establish a benchmark for CSS, which includes representative references, evaluation results, and reproductions, and is available at \url{https://github.com/YBIO/SurveyCSS}. We hope that this survey will serve as a valuable reference and inspiration for the advancement of life-long learning, and provide new perspectives for related fields.

Guidance system for Visually Impaired Persons using Deep Learning and Optical flow

paper_url: http://arxiv.org/abs/2310.14239
repo_url: None
paper_authors: Shwetang Dubey, Alok Ranjan Sahoo, Pavan Chakraborty
for: 本研究旨在帮助视力受损人士在快速 paced 环境中了解周围环境。
methods: 该方法使用 YOLOv3 对象检测、Lucas Kanade 光流估算和 Depth-net 深度估算来确定 approaching объек的方向和距离。
results: 该模型在实际场景中证明了其效果iveness，能够为视力受损人士提供有用信息/警示。

Abstract
Visually impaired persons find it difficult to know about their surroundings while walking on a road. Walking sticks used by them can only give them information about the obstacles in the stick's proximity. Moreover, it is mostly effective in static or very slow-paced environments. Hence, this paper introduces a method to guide them in a busy street. To create such a system it is very important to know about the approaching object and its direction of approach. To achieve this objective we created a method in which the image frame received from the video is divided into three parts i.e. center, left, and right to know the direction of approach of the approaching object. Object detection is done using YOLOv3. Lucas Kanade's optical flow estimation method is used for the optical flow estimation and Depth-net is used for depth estimation. Using the depth information, object motion trajectory, and object category information, the model provides necessary information/warning to the person. This model has been tested in the real world to show its effectiveness.

摘要
▼这个文章介绍了一种用于帮助视障人士在路上行走的方法。▼视障人士在路上行走时，可能很难了解周围环境。他们使用的杖子只能在杖子的附近提供信息。此外，这种系统主要适用于静止或非常缓态环境。因此，这篇文章提出了一种方法，用于帮助视障人士在 busy street 上行走。为了实现这个目标，我们创建了一种方法，将影像框架从影像中分成三部分，即中心、左侧和右侧。这样可以了解接近物体的方向。我们使用 YOLOv3 进行物体检测，并使用 Lucas Kanade 的光流推算方法进行光流推算。在 depth estimation 方面，我们使用 Depth-net。使用depth信息、物体运动轨迹和物体类别信息，模型为视障人士提供必要的信息/警告。这个模型在实际应用中证明了其有效性。

A comprehensive survey on deep active learning and its applications in medical image analysis

paper_url: http://arxiv.org/abs/2310.14230
repo_url: https://github.com/lighterswang/awesome-active-learning-for-medical-image-analysis
paper_authors: Haoran Wang, Qiuye Jin, Shiman Li, Siyu Liu, Manning Wang, Zhijian Song
for: 降低医学图像分析中注解成本的问题，active learning 技术可以选择最有用的样本进行注解，并训练高性能的模型。methods: 评估信息性和采样策略是核心方法，此外还包括 semi-supervised、self-supervised 学习等标签有效策略的集成。results: 该综述文章提供了医学图像分析领域特有的活动学习工作，以及将来的趋势和挑战。

Abstract
Deep learning has achieved widespread success in medical image analysis, leading to an increasing demand for large-scale expert-annotated medical image datasets. Yet, the high cost of annotating medical images severely hampers the development of deep learning in this field. To reduce annotation costs, active learning aims to select the most informative samples for annotation and train high-performance models with as few labeled samples as possible. In this survey, we review the core methods of active learning, including the evaluation of informativeness and sampling strategy. For the first time, we provide a detailed summary of the integration of active learning with other label-efficient techniques, such as semi-supervised, self-supervised learning, and so on. Additionally, we also highlight active learning works that are specifically tailored to medical image analysis. In the end, we offer our perspectives on the future trends and challenges of active learning and its applications in medical image analysis.

摘要
深度学习在医疗图像分析中取得了广泛的成功，导致医疗图像批量专家标注数据集的需求增加。然而，医疗图像标注成本高昂，妨碍了深度学习在这个领域的发展。为了降低标注成本，活动学习目的是选择最有用的样本进行标注，并训练高性能的模型使用最少的标注样本。在这篇评论中，我们评论核心的活动学习方法，包括评估有用性和采样策略。此外，我们还提供了医疗图像分析中特有的活动学习应用，以及与其他标签效率技术，如半监督学习和自监督学习等的整合。最后，我们还提出了未来活动学习在医疗图像分析中的趋势和挑战。

Hierarchical Vector Quantized Transformer for Multi-class Unsupervised Anomaly Detection

paper_url: http://arxiv.org/abs/2310.14228
repo_url: https://github.com/ruiyinglu/hvq-trans
paper_authors: Ruiying Lu, YuJie Wu, Long Tian, Dongsheng Wang, Bo Chen, Xiyang Liu, Ruimin Hu
for: 本研究旨在提出一种多类无监督图像异常检测（UAD）模型，以学习强健和特异的正常样本表示。而传统的解决方案各自针对不同的类型进行分离处理，会带来昂贵的计算成本和有限的普适性。本文集中强调建立一个统一框架，以便同时处理多个类型。
methods: 本研究提出了一种基于Transformer的层次vector量化prototype-oriented模型（HVQ-Trans），以解决普适图像异常检测中的“同源 shortcut”问题。首先，我们不再学习连续表示，而是保留正常样本的典型模式为精炼的дискре特征图像，并证明Vector Quantization在避免“同源 shortcut”方面的重要性。其次，我们提出了一种层次框架，以解决codebook collapse问题并且填充贫备的正常模式。最后，我们提出了一种prototype-oriented最优运输方法，以更好地规则prototype和层次评估异常分数。
results: 我们在MVTec-AD和VisA datasets上进行了实验，并证明了我们的模型可以超越当前的状态OF-the-art方法，并且具有良好的可读性。

Abstract
Unsupervised image Anomaly Detection (UAD) aims to learn robust and discriminative representations of normal samples. While separate solutions per class endow expensive computation and limited generalizability, this paper focuses on building a unified framework for multiple classes. Under such a challenging setting, popular reconstruction-based networks with continuous latent representation assumption always suffer from the "identical shortcut" issue, where both normal and abnormal samples can be well recovered and difficult to distinguish. To address this pivotal issue, we propose a hierarchical vector quantized prototype-oriented Transformer under a probabilistic framework. First, instead of learning the continuous representations, we preserve the typical normal patterns as discrete iconic prototypes, and confirm the importance of Vector Quantization in preventing the model from falling into the shortcut. The vector quantized iconic prototype is integrated into the Transformer for reconstruction, such that the abnormal data point is flipped to a normal data point.Second, we investigate an exquisite hierarchical framework to relieve the codebook collapse issue and replenish frail normal patterns. Third, a prototype-oriented optimal transport method is proposed to better regulate the prototypes and hierarchically evaluate the abnormal score. By evaluating on MVTec-AD and VisA datasets, our model surpasses the state-of-the-art alternatives and possesses good interpretability. The code is available at https://github.com/RuiyingLu/HVQ-Trans.

摘要
<>输入文本翻译成简化中文。<>无监督图像异常检测（UAD）的目标是学习强健和特征化的正常样本表示。而单独解决每个类的问题会带来昂贵的计算成本和有限的普适性，这篇论文强调建立多个类的统一框架。在这样的复杂的设置下，流行的重建基于网络 sempre suffer from the "identical shortcut" issue，其中正常和异常样本都可以很好地重建，并且困难以 distinguishing。为解决这一关键问题，我们提议一种层次 vector quantization prototype-oriented Transformer beneath a probabilistic framework。首先，而不是学习连续表示，我们保留正常样本的典型模式为精确的数字图标，并证明 vector quantization 可以防止模型陷入 shortcut。层次 vector quantization 图标是与 Transformer 重建结合的，以至于异常数据点被映射为正常数据点。第二，我们提出了一种细腻的层次结构，以解决 codebook collapse 问题并重塑贫弱的正常模式。第三，我们提出了一种 prototype-oriented optimal transport 方法，以更好地规则 prototype 和层次评估异常分数。通过对 MVTec-AD 和 VisA 数据集进行评估，我们的模型超越了状态的拟合方案，并具有良好的解释性。代码可以在上获取。

Multi-stream Cell Segmentation with Low-level Cues for Multi-modality Images

paper_url: http://arxiv.org/abs/2310.14226
repo_url: https://github.com/lhaof/cellseg
paper_authors: Wei Lou, Xinyi Yu, Chenyu Liu, Xiang Wan, Guanbin Li, Siqi Liu, Haofeng Li
for: 本研究旨在解决多模式微scopic影像中 cells 的分割问题，由于这些影像中 cells 的文化和形态复杂，困难了 cells 的分割。
methods: 我们首先开发了一个自动化 cell 分类管道，以便根据低级图像特征来标注微scopic影像，然后我们基于类别标签来训练一个分类模型。接着，我们对每个类别使用相应的分割模型进行训练，同时还使用了两种不同的分割模型来分割圆形和不规则形 cells。
results: 我们在 NeurIPS 2022 Cell Segmentation Challenge 的 Tuning Set 上进行了评估，我们的方法实现了 F1-score 0.8795，并且所有情况的运行时间都在时间容忍范围内。

Abstract
Cell segmentation for multi-modal microscopy images remains a challenge due to the complex textures, patterns, and cell shapes in these images. To tackle the problem, we first develop an automatic cell classification pipeline to label the microscopy images based on their low-level image characteristics, and then train a classification model based on the category labels. Afterward, we train a separate segmentation model for each category using the images in the corresponding category. Besides, we further deploy two types of segmentation models to segment cells with roundish and irregular shapes respectively. Moreover, an efficient and powerful backbone model is utilized to enhance the efficiency of our segmentation model. Evaluated on the Tuning Set of NeurIPS 2022 Cell Segmentation Challenge, our method achieves an F1-score of 0.8795 and the running time for all cases is within the time tolerance.

摘要
cell 分割 для多modal 微scopic 图像仍然是一个挑战，主要因为这些图像中的复杂的文本、模式和细胞形状。为解决这个问题，我们首先开发了一个自动化细胞分类管道，以基于微scopic 图像的低级特征进行标签。然后，我们基于类别标签进行分类模型的训练。此外，我们还部署了两种类型的分割模型，一种用于分割圆形细胞，另一种用于分割不规则细胞。此外，我们还利用了一个高效的和强大的背景模型，以提高我们的分割模型的效率。在 NeurIPS 2022 细胞分 segmentation 挑战的 Tuning Set 上进行评估，我们的方法实现了 F1 分数为 0.8795，并且所有情况的运行时间在时间容差内。

One-for-All: Towards Universal Domain Translation with a Single StyleGAN

paper_url: http://arxiv.org/abs/2310.14222
repo_url: None
paper_authors: Yong Du, Jiahui Zhan, Shengfeng He, Xinzhe Li, Junyu Dong, Sheng Chen, Ming-Hsuan Yang
For: The paper proposes a novel translation model called UniTranslator for transforming representations between visually distinct domains with limited training data and significant visual differences.* Methods: The UniTranslator model leverages the domain-neutral capabilities of CLIP as a bridging mechanism, and utilizes a separate module to extract abstract, domain-agnostic semantics from the embeddings of both the source and target realms. The CLIP2P mapper is introduced to bridge the gap between the CLIP and StyleGAN spaces.* Results: The proposed UniTranslator model is versatile and capable of performing various tasks, including style mixing, stylization, and translations, even in visually challenging scenarios across different visual domains. The model generates high-quality translations that showcase domain relevance, diversity, and improved image quality, and surpasses the performance of existing general-purpose models and specialized models in representative tasks.

Abstract
In this paper, we propose a novel translation model, UniTranslator, for transforming representations between visually distinct domains under conditions of limited training data and significant visual differences. The main idea behind our approach is leveraging the domain-neutral capabilities of CLIP as a bridging mechanism, while utilizing a separate module to extract abstract, domain-agnostic semantics from the embeddings of both the source and target realms. Fusing these abstract semantics with target-specific semantics results in a transformed embedding within the CLIP space. To bridge the gap between the disparate worlds of CLIP and StyleGAN, we introduce a new non-linear mapper, the CLIP2P mapper. Utilizing CLIP embeddings, this module is tailored to approximate the latent distribution in the P space, effectively acting as a connector between these two spaces. The proposed UniTranslator is versatile and capable of performing various tasks, including style mixing, stylization, and translations, even in visually challenging scenarios across different visual domains. Notably, UniTranslator generates high-quality translations that showcase domain relevance, diversity, and improved image quality. UniTranslator surpasses the performance of existing general-purpose models and performs well against specialized models in representative tasks. The source code and trained models will be released to the public.

摘要
在这篇论文中，我们提出了一种新的翻译模型，UniTranslator，用于在有限的训练数据和显著的视觉差异下，将表示转换到不同的视觉领域。我们的方法的核心思想是利用CLIP作为桥接机制，同时使用一个分离的模块来提取源和目标领域的抽象、领域不依赖的 semantics。将这些抽象 semantics 与目标具体 semantics 进行混合，可以在CLIP空间中生成修改后的表示。为了将CLIP和StyleGAN两个不同的世界联系起来，我们引入了一个新的非线性映射器，CLIP2P映射器。通过使用CLIP表示，这个模块可以近似P空间中的偏振分布，从而成为CLIP和StyleGAN之间的连接。我们提出的UniTranslator非常灵活，可以执行多种任务，包括样式混合、风格化和翻译，即使在不同的视觉领域中。尤其是，UniTranslator可以生成高质量的翻译，展示域相关性、多样性和提高图像质量。UniTranslator的性能超过了现有的通用模型和专门模型在代表任务中的性能。我们将源代码和训练模型公开发布。

The Importance of Anti-Aliasing in Tiny Object Detection

paper_url: http://arxiv.org/abs/2310.14221
repo_url: https://github.com/freshn/Anti-aliasing-Tiny-Object-Detection
paper_authors: Jinlai Ning, Michael Spratling
for:这篇论文主要关注在小物体检测领域，即使使用卷积神经网络（CNN）作为识别框架，但是这些网络往往忽略了尼古斯 sampling 定理，从而导致噪声和性能下降。methods:这篇论文提出了一种已有的方法 WaveCNet，用于对小物体检测进行抗噪声处理。WaveCNet 将标准的下降样本过程替换为波峰 pooling（WaveletPool）层，从而有效地降低噪声。此外，我们还提出了一种底部重量的背景，该背景可以进一步提高小物体检测的性能，同时也减少了参数的数量约半。results:实验结果表明，对小物体检测进行抗噪声处理是非常重要的，而我们提出的方法可以在 TinyPerson、WiderFace 和 DOTA 等三个数据集上达到新的状态态-of-the-art 结果。codes 和实验结果可以在 https://github.com/freshn/Anti-aliasing-Tiny-Object-Detection.git 上获取。

Abstract
Tiny object detection has gained considerable attention in the research community owing to the frequent occurrence of tiny objects in numerous critical real-world scenarios. However, convolutional neural networks (CNNs) used as the backbone for object detection architectures typically neglect Nyquist's sampling theorem during down-sampling operations, resulting in aliasing and degraded performance. This is likely to be a particular issue for tiny objects that occupy very few pixels and therefore have high spatial frequency features. This paper applied an existing approach WaveCNet for anti-aliasing to tiny object detection. WaveCNet addresses aliasing by replacing standard down-sampling processes in CNNs with Wavelet Pooling (WaveletPool) layers, effectively suppressing aliasing. We modify the original WaveCNet to apply WaveletPool in a consistent way in both pathways of the residual blocks in ResNets. Additionally, we also propose a bottom-heavy version of the backbone, which further improves the performance of tiny object detection while also reducing the required number of parameters by almost half. Experimental results on the TinyPerson, WiderFace, and DOTA datasets demonstrate the importance of anti-aliasing in tiny object detection and the effectiveness of the proposed method which achieves new state-of-the-art results on all three datasets. Codes and experiment results are released at https://github.com/freshn/Anti-aliasing-Tiny-Object-Detection.git.

摘要
小对象检测已经在研究中受到了广泛关注，因为小对象在各种重要的实际场景中出现频繁。然而，通用神经网络（CNN）用于对象检测架构通常会忽略尼古斯 sampling 定理，导致抖抖和对象检测性能下降。这特别是对小对象，它们占用很少像素，因此具有高频率特征。这篇文章应用了现有的方法 WaveCNet 来避免抖抖。WaveCNet 将标准的下采样过程替换为波лет泵（WaveletPool）层，以有效地避免抖抖。我们修改了原始 WaveCNet，使其在 ResNet 中应用 WaveletPool 层在两个路径上进行一致性的应用。此外，我们还提出了底部重量的后置架构，它可以在小对象检测中提高性能，同时减少参数的数量约为一半。实验结果表明，对小对象检测的抖抖避免是非常重要的，并且我们的方法可以在 TinyPerson、WiderFace 和 DOTA datasets 上 achieve 新的国际最佳Result。代码和实验结果在上公开。

TransY-Net:Learning Fully Transformer Networks for Change Detection of Remote Sensing Images

paper_url: http://arxiv.org/abs/2310.14214
repo_url: None
paper_authors: Tianyu Yan, Zifu Wan, Pingping Zhang, Gong Cheng, Huchuan Lu
for: 本文旨在提高遥感图像变化检测（CD）的精度和效率，特别是采用Transformer模型进行长距离关系模型化，以提高特征提取和CD区域完整性。
methods: 本文提出了一种基于Transformer的学习框架，名为TransY-Net，它首先利用Transformer模型的长距离关系模型化能力，以提高全局特征提取和CD区域完整性。然后，该框架引入了一种pyramid结构，用于归一化多级视觉特征，并通过Progressive Attention Module（PAM）实现更多的交互关系。
results: 经过广泛的实验，本文的提出的TransY-Net方法在四个光学图像CD标准数据集和两个Synthetic Aperture Radar（SAR）图像CD标准数据集上达到了新的州OF-THE-ART性能。代码已经公开在https://github.com/Drchip61/TransYNet。

Abstract
In the remote sensing field, Change Detection (CD) aims to identify and localize the changed regions from dual-phase images over the same places. Recently, it has achieved great progress with the advances of deep learning. However, current methods generally deliver incomplete CD regions and irregular CD boundaries due to the limited representation ability of the extracted visual features. To relieve these issues, in this work we propose a novel Transformer-based learning framework named TransY-Net for remote sensing image CD, which improves the feature extraction from a global view and combines multi-level visual features in a pyramid manner. More specifically, the proposed framework first utilizes the advantages of Transformers in long-range dependency modeling. It can help to learn more discriminative global-level features and obtain complete CD regions. Then, we introduce a novel pyramid structure to aggregate multi-level visual features from Transformers for feature enhancement. The pyramid structure grafted with a Progressive Attention Module (PAM) can improve the feature representation ability with additional inter-dependencies through spatial and channel attentions. Finally, to better train the whole framework, we utilize the deeply-supervised learning with multiple boundary-aware loss functions. Extensive experiments demonstrate that our proposed method achieves a new state-of-the-art performance on four optical and two SAR image CD benchmarks. The source code is released at https://github.com/Drchip61/TransYNet.

摘要
在遥感领域，变化检测（CD）目标是从同一个场景中的双相图像中标识和地址化变化区域。在最近，深度学习技术的进步使得CD技术取得了大进步。然而，现有方法通常会提供不完整的CD区域和不规则的CD边界，这是因为EXTRACTED的视觉特征具有限制的表达能力。为了解决这些问题，在本工作中，我们提出了一种基于Transformer的新型学习框架，名为TransY-Net，用于遥感图像CD。这种框架首先利用Transformer的长距离关系模型优势，可以学习更有特征的全局级别特征，并且可以获得完整的CD区域。然后，我们引入了一种新的pyramid结构，用于粗细层次特征的综合。这个pyramid结构和Progressive Attention Module（PAM）结合可以提高特征表示能力，通过空间和通道注意力。最后，为了更好地训练整个框架，我们利用深度监督学习，并使用多个边界意识损失函数。广泛的实验表明，我们的提议方法在四个光学图像CD标准 bencmarks和两个Synthetic Aperture Radar（SAR）图像CD标准 bencmarks上 achieved新的state-of-the-art性能。源代码可以在https://github.com/Drchip61/TransYNet中下载。

Diffusion-based Data Augmentation for Nuclei Image Segmentation

paper_url: http://arxiv.org/abs/2310.14197
repo_url: https://github.com/lhaof/nudiff
paper_authors: Xinyi Yu, Guanbin Li, Wei Lou, Siqi Liu, Xiang Wan, Yan Chen, Haofeng Li
for: 提高 histopathology 图像量化分析中 nuclei 段落的性能
methods: 使用扩散模型进行数据增强，synthesize histopathology 图像和实例地图
results: 通过增加 10% 标注的实际数据集中的 sintetic 样本，可以实现与完全监督基线相同的 segmentation 性能

Abstract
Nuclei segmentation is a fundamental but challenging task in the quantitative analysis of histopathology images. Although fully-supervised deep learning-based methods have made significant progress, a large number of labeled images are required to achieve great segmentation performance. Considering that manually labeling all nuclei instances for a dataset is inefficient, obtaining a large-scale human-annotated dataset is time-consuming and labor-intensive. Therefore, augmenting a dataset with only a few labeled images to improve the segmentation performance is of significant research and application value. In this paper, we introduce the first diffusion-based augmentation method for nuclei segmentation. The idea is to synthesize a large number of labeled images to facilitate training the segmentation model. To achieve this, we propose a two-step strategy. In the first step, we train an unconditional diffusion model to synthesize the Nuclei Structure that is defined as the representation of pixel-level semantic and distance transform. Each synthetic nuclei structure will serve as a constraint on histopathology image synthesis and is further post-processed to be an instance map. In the second step, we train a conditioned diffusion model to synthesize histopathology images based on nuclei structures. The synthetic histopathology images paired with synthetic instance maps will be added to the real dataset for training the segmentation model. The experimental results show that by augmenting 10% labeled real dataset with synthetic samples, one can achieve comparable segmentation results with the fully-supervised baseline.

摘要
基因段落分割是 Histopathology 图像量化分析中的基本 yet 挑战性任务。尽管完全supervised的深度学习基本方法已经取得了显著进步，但是需要大量标注图像来实现出色的分割性能。由于手动标注整个数据集的图像是不可能的，因此获得大规模的人工标注数据集是时间consuming 和劳动密集的。因此，对于一个只有少量标注图像的数据集，通过增加大量的 sintetic 图像来提高分割性能是研究和应用中的重要问题。在这篇论文中，我们介绍了第一个Diffusion-based增强方法，用于基因段落分割。我们的想法是通过Synthesize a large number of labeled images to facilitate the training of the segmentation model。为此，我们提出了两步策略。在第一步中，我们使用无条件Diffusion模型来Synthesize the Nuclei Structure，定义为像素级别的semantic和距离变换。每个Synthetic nuclei structure将作为历史病理图像Synthesis的约束，并进一步被后处理为实例地图。在第二步中，我们使用条件Diffusion模型来Synthesize histopathology images based on nuclei structures。Synthetic histopathology images paired with synthetic instance maps will be added to the real dataset for training the segmentation model。实验结果表明，通过增强10%标注的实际数据集with synthetic samples，可以实现与完全supervised基eline相同的分割性能。

Distractor-aware Event-based Tracking

paper_url: http://arxiv.org/abs/2310.14194
repo_url: None
paper_authors: Yingkai Fu, Meng Li, Wenxi Liu, Yuanchen Wang, Jiqing Zhang, Baocai Yin, Xiaopeng Wei, Xin Yang
for: 本研究旨在提出一种能够在挑战性enario中更高效地跟踪视觉对象的事件相机视觉跟踪器。
methods: 该模型采用了 transformer 模块，并结合了运动观察网络和目标观察网络，同时利用事件数据中的运动指示和目标极值来发现运动对象并识别目标对象。
results: 对两个大规模的事件跟踪数据集进行了广泛的实验 validate 了我们提出的模型，并证明了它在比例于状态前的跟踪器中具有更高的准确率和效率。

Abstract
Event cameras, or dynamic vision sensors, have recently achieved success from fundamental vision tasks to high-level vision researches. Due to its ability to asynchronously capture light intensity changes, event camera has an inherent advantage to capture moving objects in challenging scenarios including objects under low light, high dynamic range, or fast moving objects. Thus event camera are natural for visual object tracking. However, the current event-based trackers derived from RGB trackers simply modify the input images to event frames and still follow conventional tracking pipeline that mainly focus on object texture for target distinction. As a result, the trackers may not be robust dealing with challenging scenarios such as moving cameras and cluttered foreground. In this paper, we propose a distractor-aware event-based tracker that introduces transformer modules into Siamese network architecture (named DANet). Specifically, our model is mainly composed of a motion-aware network and a target-aware network, which simultaneously exploits both motion cues and object contours from event data, so as to discover motion objects and identify the target object by removing dynamic distractors. Our DANet can be trained in an end-to-end manner without any post-processing and can run at over 80 FPS on a single V100. We conduct comprehensive experiments on two large event tracking datasets to validate the proposed model. We demonstrate that our tracker has superior performance against the state-of-the-art trackers in terms of both accuracy and efficiency.

摘要
Event 摄像头或动力视觉传感器在最近已经取得了基本视觉任务到高级视觉研究的成功。由于它可以ynchronously capture light intensity changes，因此event摄像头有着自然地捕捉移动 объек的优势。因此，event摄像头是Visual object tracking的自然选择。然而，现有的event基于RGB摄像头的追踪器仅将输入图像转换为事件帧，然后仍然遵循传统的追踪管道，主要Focus on object texture for target distinction。这些追踪器可能不是在复杂的场景中Robust，如移动摄像头和干扰性较高的前景。在这篇论文中，我们提出了一种防护event追踪器，它通过将 transformer 模块 incorporated into Siamese network architecture (名为 DANet)来解决这些问题。具体来说，我们的模型由一个 Motion-aware network 和一个 Target-aware network 组成，这两个网络同时利用 event 数据中的运动迹和对象边缘，以便在找到运动 объек和标识目标对象的同时，从而消除动态干扰物。我们的 DANet 可以在端到端的方式进行训练，不需要任何后处理，并且可以在一个 V100 上运行 faster than 80 FPS。我们在两个大事件追踪数据集上进行了广泛的实验，以验证我们的模型。我们示示了我们的追踪器在对比于当前状态追踪器的情况下，在精度和效率两个方面具有superior performance。

Partition Speeds Up Learning Implicit Neural Representations Based on Exponential-Increase Hypothesis

paper_url: http://arxiv.org/abs/2310.14184
repo_url: None
paper_authors: Ke Liu, Feng Liu, Haishuai Wang, Ning Ma, Jiajun Bu, Bo Han
for:* 学习一个灵活的图像表示函数，使得图像可以被视为一个连续函数。methods:* 使用一个卷积神经网络来学习图像表示函数。* 采用分割策略来分割图像，并在每个子区域中使用小型神经网络来学习图像表示函数。results:* 经验表明，如果图像包含多个对象，那么使用连续函数来学习图像表示函数会导致学习时间增长 exponentiallly。* 提出了一种分割策略，可以快速提高学习图像表示函数的速度。* 对于单个图像学习和学习-学习框架，提出了两种分割规则，分别基于 régulier 网格和semantic segmentation map。

Abstract
$\textit{Implicit neural representations}$ (INRs) aim to learn a $\textit{continuous function}$ (i.e., a neural network) to represent an image, where the input and output of the function are pixel coordinates and RGB/Gray values, respectively. However, images tend to consist of many objects whose colors are not perfectly consistent, resulting in the challenge that image is actually a $\textit{discontinuous piecewise function}$ and cannot be well estimated by a continuous function. In this paper, we empirically investigate that if a neural network is enforced to fit a discontinuous piecewise function to reach a fixed small error, the time costs will increase exponentially with respect to the boundaries in the spatial domain of the target signal. We name this phenomenon the $\textit{exponential-increase}$ hypothesis. Under the $\textit{exponential-increase}$ hypothesis, learning INRs for images with many objects will converge very slowly. To address this issue, we first prove that partitioning a complex signal into several sub-regions and utilizing piecewise INRs to fit that signal can significantly speed up the convergence. Based on this fact, we introduce a simple partition mechanism to boost the performance of two INR methods for image reconstruction: one for learning INRs, and the other for learning-to-learn INRs. In both cases, we partition an image into different sub-regions and dedicate smaller networks for each part. In addition, we further propose two partition rules based on regular grids and semantic segmentation maps, respectively. Extensive experiments validate the effectiveness of the proposed partitioning methods in terms of learning INR for a single image (ordinary learning framework) and the learning-to-learn framework.

摘要
$\textit{含义神经表示}$（INR）目的是学习一个 $\textit{连续函数}$（即神经网络），用于表示一个图像，其输入和输出分别是像素坐标和RGB/灰度值。然而，图像通常由多个 объек whose colors are not perfectly consistent, resulting in the challenge that the image is actually a $\textit{破碎不连续函数}$ and cannot be well estimated by a continuous function. In this paper, we empirically investigate that if a neural network is enforced to fit a discontinuous piecewise function to reach a fixed small error, the time costs will increase exponentially with respect to the boundaries in the spatial domain of the target signal. We name this phenomenon the $\textit{几何增长}$ hypothesis. Under the $\textit{几何增长}$ hypothesis, learning INRs for images with many objects will converge very slowly. To address this issue, we first prove that partitioning a complex signal into several sub-regions and utilizing piecewise INRs to fit that signal can significantly speed up the convergence. Based on this fact, we introduce a simple partition mechanism to boost the performance of two INR methods for image reconstruction: one for learning INRs, and the other for learning-to-learn INRs. In both cases, we partition an image into different sub-regions and dedicate smaller networks for each part. In addition, we further propose two partition rules based on regular grids and semantic segmentation maps, respectively. Extensive experiments validate the effectiveness of the proposed partitioning methods in terms of learning INR for a single image (ordinary learning framework) and the learning-to-learn framework.

Prompt-based Grouping Transformer for Nucleus Detection and Classification

paper_url: http://arxiv.org/abs/2310.14176
repo_url: https://github.com/lhaof/pgt
paper_authors: Junjia Huang, Haofeng Li, Weijun Sun, Xiang Wan, Guanbin Li
for: 这篇论文旨在提出一个新的细胞检测和类别方法，以实现疾病诊断中的有效信息生成。
methods: 该方法使用一种groupby transformer-based classifier，将核lei embedding hierarchically grouped，然后透过排序分类来预测细胞类型。
results: 实验结果显示，提案的方法与现有模型比较，在三个数据集上表现出 significatively better 的成绩。

Abstract
Automatic nuclei detection and classification can produce effective information for disease diagnosis. Most existing methods classify nuclei independently or do not make full use of the semantic similarity between nuclei and their grouping features. In this paper, we propose a novel end-to-end nuclei detection and classification framework based on a grouping transformer-based classifier. The nuclei classifier learns and updates the representations of nuclei groups and categories via hierarchically grouping the nucleus embeddings. Then the cell types are predicted with the pairwise correlations between categorical embeddings and nucleus features. For the efficiency of the fully transformer-based framework, we take the nucleus group embeddings as the input prompts of backbone, which helps harvest grouping guided features by tuning only the prompts instead of the whole backbone. Experimental results show that the proposed method significantly outperforms the existing models on three datasets.

摘要
自动检测和分类核心可以生成有效的疾病诊断信息。现有方法通常是独立地或不充分利用核心和分类特征之间的语义相似性进行分类。在本文中，我们提出了一种基于分组变换器的新的核心检测和分类框架。核心分类器会学习和更新核心组和类别的表示，通过层次地组织核心嵌入。然后，通过对分类嵌入和核心特征之间的对比，预测细胞类型。为了提高全transformer-based框架的效率，我们将核心组嵌入作为后置的输入提示，这样可以充分利用分组指导特征，只需要调整提示而不是整个后置。实验结果表明，我们的方法与现有模型在三个数据集上表现出了明显的优异。

ASC: Appearance and Structure Consistency for Unsupervised Domain Adaptation in Fetal Brain MRI Segmentation

paper_url: http://arxiv.org/abs/2310.14172
repo_url: https://github.com/lhaof/asc
paper_authors: Zihang Xu, Haifan Gong, Xiang Wan, Haofeng Li
for: 本研究旨在提高胎儿脑部影像自动分割的精度和效率，以便对胎儿脑部发育进行量化分析。
methods: 本研究提出了一种实用无监督领域适应（UDA）设定，将高质量胎儿脑部图像 атла斯中的分割标签应用到没有标签的胎儿脑部MRI数据中。为解决任务，我们提出了一种基于外观和结构一致性的新UDA框架，名为ASC。我们在不同领域中适应分割模型的 appearances by 限制在不同频率基于图像变换之前和之后的一致性。此外，我们还鼓励模型在目标领域中适应结构变化，以便更好地适应不同的胎儿脑部结构。
results: 对FeTA 2021数据集进行了广泛的实验，并证明了我们的ASC在比registratin-based、semi-supervised learning-based和现有UDA-based方法更有效。

Abstract
Automatic tissue segmentation of fetal brain images is essential for the quantitative analysis of prenatal neurodevelopment. However, producing voxel-level annotations of fetal brain imaging is time-consuming and expensive. To reduce labeling costs, we propose a practical unsupervised domain adaptation (UDA) setting that adapts the segmentation labels of high-quality fetal brain atlases to unlabeled fetal brain MRI data from another domain. To address the task, we propose a new UDA framework based on Appearance and Structure Consistency, named ASC. We adapt the segmentation model to the appearances of different domains by constraining the consistency before and after a frequency-based image transformation, which is to swap the appearance between brain MRI data and atlases. Consider that even in the same domain, the fetal brain images of different gestational ages could have significant variations in the anatomical structures. To make the model adapt to the structural variations in the target domain, we further encourage prediction consistency under different structural perturbations. Extensive experiments on FeTA 2021 benchmark demonstrate the effectiveness of our ASC in comparison to registration-based, semi-supervised learning-based, and existing UDA-based methods.

摘要
自动化胎儿脑部影像分割是胎儿发育研究中必需的一种量化分析方法。然而，为了生成胎儿脑部影像的 voxel-level 标注，需要较多的时间和成本。为了降低标注成本，我们提出了一种实用的无监督领域适应（UDA）设定，该设定将高质量胎儿脑部 Atlases 的分割标注转换到另一个频谱频率频谱频率频谱频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频��

Visual-Attribute Prompt Learning for Progressive Mild Cognitive Impairment Prediction

paper_url: http://arxiv.org/abs/2310.14158
repo_url: https://github.com/lhaof/vapl
paper_authors: Luoyao Kang, Haifan Gong, Xiang Wan, Haofeng Li
for: 这个研究旨在使用深度学习技术自动诊断轻度智能障碍（MCI）和阿尔茨海默病（AD），并将扩展到进程性MCI（pMCI）诊断。
methods: 该研究提出了一种基于转换器的网络，称为视觉特征引导学习 transformer（VAP-Former），它可以高效地提取并融合多modal特征。此外，提出了一种Prompt fine-Tuning（PT）方案，可以在AD诊断任务上传递知识到pMCI检测任务。
results: 对比之前的方法，该研究的方法在pMCI预测任务中表现出色，并且 Interestingly, 提出的提示学习模型甚至超过了完全 Fine-tuning 基准在传递知识从AD到pMCI的任务中。

Abstract
Deep learning (DL) has been used in the automatic diagnosis of Mild Cognitive Impairment (MCI) and Alzheimer's Disease (AD) with brain imaging data. However, previous methods have not fully exploited the relation between brain image and clinical information that is widely adopted by experts in practice. To exploit the heterogeneous features from imaging and tabular data simultaneously, we propose the Visual-Attribute Prompt Learning-based Transformer (VAP-Former), a transformer-based network that efficiently extracts and fuses the multi-modal features with prompt fine-tuning. Furthermore, we propose a Prompt fine-Tuning (PT) scheme to transfer the knowledge from AD prediction task for progressive MCI (pMCI) diagnosis. In details, we first pre-train the VAP-Former without prompts on the AD diagnosis task and then fine-tune the model on the pMCI detection task with PT, which only needs to optimize a small amount of parameters while keeping the backbone frozen. Next, we propose a novel global prompt token for the visual prompts to provide global guidance to the multi-modal representations. Extensive experiments not only show the superiority of our method compared with the state-of-the-art methods in pMCI prediction but also demonstrate that the global prompt can make the prompt learning process more effective and stable. Interestingly, the proposed prompt learning model even outperforms the fully fine-tuning baseline on transferring the knowledge from AD to pMCI.

摘要
深度学习（DL）已经在自动诊断淡衰性认知障碍（MCI）和阿尔茨海默病（AD）中使用了大脑成像数据。然而，之前的方法没有充分利用了大脑成像和临床信息之间的关系，这是专家在实践中广泛采用的。为了同时提取和融合多Modal特征，我们提议使用视觉特征提示学习基于转换器（VAP-Former），这是一种基于转换器的网络，可以高效地提取和融合多Modal特征。此外，我们还提议一种Prompt fine-Tuning（PT）方案，可以将AD预测任务中的知识传递到pMCI诊断任务中。在详细的实验中，我们首先在AD诊断任务上预训练VAP-Former无提示，然后在pMCI检测任务上使用PT方案进行微小的参数优化，保持背景固定。其次，我们提出了一个全局提示符用于视觉提示，以提供全局指导多Modal表示。广泛的实验结果不仅表明了我们的方法在pMCI预测任务中的优越性，还表明了全局提示符可以使得提示学习过程更加有效和稳定。更有趣的是，我们的提示学习模型甚至超过了完全微调基准点在传递知识从AD到pMCI的任务中表现。

Affine-Consistent Transformer for Multi-Class Cell Nuclei Detection

paper_url: http://arxiv.org/abs/2310.14154
repo_url: https://github.com/lhaof/acformer
paper_authors: Junjia Huang, Haofeng Li, Xiang Wan, Guanbin Li
for: 多类细胞核体检测是 Histopathology 诊断的基本前提，需要效率地查找和识别具有多样化形态和分布的细胞。
methods: 我们提出了一种新的 Affine-Consistent Transformer（AC-Former）， directly 输出细胞位置序列，通过两个子网络：全球网络和地方网络。地方网络学习扭曲输入图像的小规模版本，而全球网络输出大规模预测信号。我们还引入了一个 Adaptive Affine Transformer（AAT）模块，可以自动学习截取原始图像的关键空间变换，以便地方网络训练。
results: 我们的方法与现有State-of-the-art算法进行比较，实验结果表明，我们的方法在多个标准底图上显著超过了现有方法。

Abstract
Multi-class cell nuclei detection is a fundamental prerequisite in the diagnosis of histopathology. It is critical to efficiently locate and identify cells with diverse morphology and distributions in digital pathological images. Most existing methods take complex intermediate representations as learning targets and rely on inflexible post-refinements while paying less attention to various cell density and fields of view. In this paper, we propose a novel Affine-Consistent Transformer (AC-Former), which directly yields a sequence of nucleus positions and is trained collaboratively through two sub-networks, a global and a local network. The local branch learns to infer distorted input images of smaller scales while the global network outputs the large-scale predictions as extra supervision signals. We further introduce an Adaptive Affine Transformer (AAT) module, which can automatically learn the key spatial transformations to warp original images for local network training. The AAT module works by learning to capture the transformed image regions that are more valuable for training the model. Experimental results demonstrate that the proposed method significantly outperforms existing state-of-the-art algorithms on various benchmarks.

摘要
多类细胞核体检测是 histopathology 诊断中的基本前提。 efficiently 找到和识别在数字病理图像中的多种形态和分布的细胞是 kritical。现有的方法通常使用复杂的中间表示，并且依赖于不灵活的后处理 while paying less attention to various cell density and fields of view. 在这篇论文中，我们提出了一种新的 Affine-Consistent Transformer (AC-Former)，它直接产生细胞位置序列，并通过两个子网络，一个全球网络和一个本地网络，进行协同培训。本地分支学习扭曲输入图像的小规模版本，而全球网络输出大规模预测。我们还引入了一个 Adaptive Affine Transformer (AAT) 模块，它可以自动学习捕捉原始图像中的关键空间变换，以便用于本地网络训练。 AAT 模块通过学习捕捉更有价值的扭曲图像区域来帮助模型训练。实验结果表明，提出的方法在多个 benchmark 上显著超越了现有的状态泰科技。

paper_url: http://arxiv.org/abs/2310.14143
repo_url: None
paper_authors: Abdul Aziz, Nihad Karim Chowdhury, Muhammad Ashad Kabir, Abu Nowshed Chy, Md. Jawad Siddique
for: 本研究旨在理解人类愿望和感受，以便提高人机交互、识别人类情感智能、理解人际关系和做出决策。
methods: 我们提出了一种基于多模态变换器模型的统一框架，包括图文对设定和图文对照对接。我们使用了两个state-of-the-art多模态变换器模型进行图文对照对接，以提取多样的特征。
results: 我们通过对社交媒体图片和文本对照对接进行联合练习，使得我们可以强化图文对照对接，并使用早期融合策略将多种特征表示融合在一起，以获得更加强大的感知和理解。

Abstract
Desire is a set of human aspirations and wishes that comprise verbal and cognitive aspects that drive human feelings and behaviors, distinguishing humans from other animals. Understanding human desire has the potential to be one of the most fascinating and challenging research domains. It is tightly coupled with sentiment analysis and emotion recognition tasks. It is beneficial for increasing human-computer interactions, recognizing human emotional intelligence, understanding interpersonal relationships, and making decisions. However, understanding human desire is challenging and under-explored because ways of eliciting desire might be different among humans. The task gets more difficult due to the diverse cultures, countries, and languages. Prior studies overlooked the use of image-text pairwise feature representation, which is crucial for the task of human desire understanding. In this research, we have proposed a unified multimodal transformer-based framework with image-text pair settings to identify human desire, sentiment, and emotion. The core of our proposed method lies in the encoder module, which is built using two state-of-the-art multimodal transformer models. These models allow us to extract diverse features. To effectively extract visual and contextualized embedding features from social media image and text pairs, we conducted joint fine-tuning of two pre-trained multimodal transformer models: Vision-and-Language Transformer (ViLT) and Vision-and-Augmented-Language Transformer (VAuLT). Subsequently, we use an early fusion strategy on these embedding features to obtain combined diverse feature representations of the image-text pair. This consolidation incorporates diverse information about this task, enabling us to robustly perceive the context and image pair from multiple perspectives.

摘要
人类欲望是一组人类aspirations和 wishes，包括语言和认知方面，它驱动人类情感和行为， distinguishing humans from other animals. 理解人类欲望有potential to be one of the most fascinating and challenging research domains. It is tightly coupled with sentiment analysis and emotion recognition tasks. It is beneficial for increasing human-computer interactions, recognizing human emotional intelligence, understanding interpersonal relationships, and making decisions. However, understanding human desire is challenging and under-explored because ways of eliciting desire may be different among humans. The task becomes more difficult due to diverse cultures, countries, and languages. Prior studies overlooked the use of image-text pairwise feature representation, which is crucial for the task of human desire understanding. In this research, we proposed a unified multimodal transformer-based framework with image-text pair settings to identify human desire, sentiment, and emotion. The core of our proposed method lies in the encoder module, which is built using two state-of-the-art multimodal transformer models. These models allow us to extract diverse features. To effectively extract visual and contextualized embedding features from social media image and text pairs, we conducted joint fine-tuning of two pre-trained multimodal transformer models: Vision-and-Language Transformer (ViLT) and Vision-and-Augmented-Language Transformer (VAuLT). Subsequently, we use an early fusion strategy on these embedding features to obtain combined diverse feature representations of the image-text pair. This consolidation incorporates diverse information about this task, enabling us to robustly perceive the context and image pair from multiple perspectives.

2023-10-22

cs.AI

cs.AI - 2023-10-22

A generalized likelihood-weighted optimal sampling algorithm for rare-event probability quantification

paper_url: http://arxiv.org/abs/2310.14457
repo_url: https://github.com/umbrellagong/gpextreme
paper_authors: Xianliang Gong, Yulin Pan
for: 用于效率地量化输入-响应系统中罕见事件的统计数据
methods: 使用一种新的获取函数，该函数是原始的可能性权重（LW）获取函数的扩展，可以更好地 Address 两个缺陷：1）输入空间相关罕见响应的采样不充分; 2）模型可能具有较大偏差，尤其是在复杂的输入-响应函数和有限样本情况下。
results: 比原始LW获取函数更高效，在多个测试 caso中显示出超过一个级别的性能提升，并在工程应用中用于船在随机海洋中罕见滚动统计量化。

Abstract
In this work, we introduce a new acquisition function for sequential sampling to efficiently quantify rare-event statistics of an input-to-response (ItR) system with given input probability and expensive function evaluations. Our acquisition is a generalization of the likelihood-weighted (LW) acquisition that was initially designed for the same purpose and then extended to many other applications. The improvement in our acquisition comes from the generalized form with two additional parameters, by varying which one can target and address two weaknesses of the original LW acquisition: (1) that the input space associated with rare-event responses is not sufficiently stressed in sampling; (2) that the surrogate model (generated from samples) may have significant deviation from the true ItR function, especially for cases with complex ItR function and limited number of samples. In addition, we develop a critical procedure in Monte-Carlo discrete optimization of the acquisition function, which achieves orders of magnitude acceleration compared to existing approaches for such type of problems. The superior performance of our new acquisition to the original LW acquisition is demonstrated in a number of test cases, including some cases that were designed to show the effectiveness of the original LW acquisition. We finally apply our method to an engineering example to quantify the rare-event roll-motion statistics of a ship in a random sea.

摘要
在这个工作中，我们介绍了一种新的获取函数，用于Sequential Sampling来有效地量化输入-响应（ItR）系统的罕见事件统计。我们的获取函数是原始的可信度权重（LW）获取函数的推广，通过两个额外参数，可以根据不同的目标进行调整。这两个参数可以解决原始LW获取函数的两个弱点：（1）输入空间相关罕见响应的抽样不充分;（2）基于抽样生成的模型（surrogate model）可能与真实ItR函数存在显著差异，特别是在复杂ItR函数和有限样本情况下。此外，我们还开发了一种在Monte-Carlo精确优化中的关键程序，可以实现多orders of magnitude的加速。我们的新获取函数比原始LW获取函数表现更优异，这被证明在一些测试案例中，包括一些用于测试原始LW获取函数的案例。最后，我们应用了我们的方法到一个工程实例，以量化船在随机海洋中的罕见滚动统计。

Mobile Traffic Prediction at the Edge through Distributed and Transfer Learning

paper_url: http://arxiv.org/abs/2310.14456
repo_url: None
paper_authors: Alfredo Petrella, Marco Miozzo, Paolo Dini
for: 这个论文旨在预测移动网络流量，以便智能优化移动网络。
methods: 该论文提出了一种基于边缘计算的预测框架，使用边缘上获取的数据进行预测。两种主要的深度学习架构，基于卷积神经网络（CNN）和循环神经网络（RNN），在不同的训练条件下进行测试。此外，该论文还应用了知识传递学习（KTL）技术来提高模型的性能，同时减少计算资源的需求。
results: 实验结果显示，CNN架构在RNNs之上表现出色，并提供了预测能力的估计。KTL技术能够降低模型的能量占用率，其中对CNNs和RNNs的预测模型而言，能量占用率下降60%和90%。最后，该论文还应用了两种前沿的解释性人工智能技术来解释得到的学习模型。

Abstract
Traffic prediction represents one of the crucial tasks for smartly optimizing the mobile network. The research in this topic concentrated in making predictions in a centralized fashion, i.e., by collecting data from the different network elements. This translates to a considerable amount of energy for data transmission and processing. In this work, we propose a novel prediction framework based on edge computing which uses datasets obtained on the edge through a large measurement campaign. Two main Deep Learning architectures are designed, based on Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), and tested under different training conditions. In addition, Knowledge Transfer Learning (KTL) techniques are employed to improve the performance of the models while reducing the required computational resources. Simulation results show that the CNN architectures outperform the RNNs. An estimation for the needed training energy is provided, highlighting KTL ability to reduce the energy footprint of the models of 60% and 90% for CNNs and RNNs, respectively. Finally, two cutting-edge explainable Artificial Intelligence techniques are employed to interpret the derived learning models.

摘要
traffic prediction 是智能化 моби 网络的一项重要任务。研究在这个领域集中式的方式进行预测，即通过不同的网络元素收集数据。这会带来很大的能源消耗 для数据传输和处理。在这项工作中，我们提出了一种基于边缘计算的预测框架，使用在边缘上进行测量 campaigndata obtain。我们设计了两种主要的深度学习架构，基于卷积神经网络（CNNs）和循环神经网络（RNNs），并在不同的训练条件下进行测试。此外，我们还利用了知识传递学习（KTL）技术来提高模型的性能，同时降低计算资源的需求。 simulation 结果表明，CNN 架构在 RNNs 之上表现出色。我们还提供了预测模型所需的训练能源的估算，并显示KTL 技术可以降低模型的能源占用率，分别为 60% 和 90% для CNNs 和 RNNs。最后，我们利用了两种前沿的解释性人工智能技术来解释 deriv 的学习模型。

An International Consortium for Evaluations of Societal-Scale Risks from Advanced AI

paper_url: http://arxiv.org/abs/2310.14455
repo_url: None
paper_authors: Ross Gruetzemacher, Alan Chan, Kevin Frazier, Christy Manning, Štěpán Los, James Fox, José Hernández-Orallo, John Burden, Matija Franklin, Clíodhna Ní Ghuidhir, Mark Bailey, Daniel Eth, Toby Pilditch, Kyle Kilian
for: The paper is written to address the need for effective governance and regulation of advanced AI systems, particularly in response to the risks they pose.methods: The paper proposes the creation of an international consortium for AI risk evaluations, which would bring together AI developers and third-party evaluators to assess and mitigate risks from advanced AI systems.results: The proposed consortium could play a critical role in coordinating international efforts to manage responsible scaling policies and evaluate risks from advanced AI systems, and could potentially help to mitigate societal-scale risks from these systems.

Abstract
Given rapid progress toward advanced AI and risks from frontier AI systems (advanced AI systems pushing the boundaries of the AI capabilities frontier), the creation and implementation of AI governance and regulatory schemes deserves prioritization and substantial investment. However, the status quo is untenable and, frankly, dangerous. A regulatory gap has permitted AI labs to conduct research, development, and deployment activities with minimal oversight. In response, frontier AI system evaluations have been proposed as a way of assessing risks from the development and deployment of frontier AI systems. Yet, the budding AI risk evaluation ecosystem faces significant coordination challenges, such as a limited diversity of evaluators, suboptimal allocation of effort, and perverse incentives. This paper proposes a solution in the form of an international consortium for AI risk evaluations, comprising both AI developers and third-party AI risk evaluators. Such a consortium could play a critical role in international efforts to mitigate societal-scale risks from advanced AI, including in managing responsible scaling policies and coordinated evaluation-based risk response. In this paper, we discuss the current evaluation ecosystem and its shortcomings, propose an international consortium for advanced AI risk evaluations, discuss issues regarding its implementation, discuss lessons that can be learnt from previous international institutions and existing proposals for international AI governance institutions, and, finally, we recommend concrete steps to advance the establishment of the proposed consortium: (i) solicit feedback from stakeholders, (ii) conduct additional research, (iii) conduct a workshop(s) for stakeholders, (iv) analyze feedback and create final proposal, (v) solicit funding, and (vi) create a consortium.

摘要
随着人工智能的快速进步和前iers AI 系统的风险 (前iers AI 系统在智能能力前iers的边缘进行推进), 建立和实施人工智能治理和规则制定的优先级和投入是需要优先考虑的。然而，当前的情况是不可持续和危险的。AI 实验室在研究、开发和部署活动中受到最小监管的情况下进行了研究，开发和部署活动。因此，前iers AI 系统评估被提出来评估前iers AI 系统的风险。然而，AI 风险评估生态系统面临着 significi cant 协调挑战，如评估人员的匮乏多样性、资源的不均分配和反面的激励。本文提出一种解决方案，即成立国际合作组织 для AI 风险评估，包括 AI 开发者和第三方 AI 风险评估人员。这种合作组织可以在国际努力中减轻社会规模的风险，包括负责任的扩展策略和协调评估基础的风险应对。本文首先介绍当前评估生态系统的缺陷和不足，然后提出了国际合作组织的建议，包括实施问题、学习自前例和现有的国际 AI 治理建议、以及实施步骤。 Specifically, the text has been translated as follows:随着人工智能的快速进步和前iers AI 系统的风险 (前iers AI 系统在智能能力前iers的边缘进行推进), 建立和实施人工智能治理和规则制定的优先级和投入是需要优先考虑的。然而，当前的情况是不可持续和危险的。AI 实验室在研究、开发和部署活动中受到最小监管的情况下进行了研究，开发和部署活动。因此，前iers AI 系统评估被提出来评估前iers AI 系统的风险。然而，AI 风险评估生态系统面临着 significiant 协调挑战，如评估人员的匮乏多样性、资源的不均分配和反面的激励。本文提出一种解决方案，即成立国际合作组织 для AI 风险评估，包括 AI 开发者和第三方 AI 风险评估人员。这种合作组织可以在国际努力中减轻社会规模的风险，包括负责任的扩展策略和协调评估基础的风险应对。

Retrieval-Augmented Chain-of-Thought in Semi-structured Domains

paper_url: http://arxiv.org/abs/2310.14435
repo_url: https://github.com/vaibhavg152/Retrieval-Augmented-Chain-of-Thought-in-Semi-structured-Domains
paper_authors: Vaibhav Mavi, Abulhair Saparov, Chen Zhao
for: 法律和金融领域的问答系统中使用现有的问答系统会存在一些挑战，需要具备专业知识。
methods: 这篇论文探讨了利用法律和金融数据的半结构化特性，以高效地检索相关的上下文，使用大语言模型（LLMs）进行领域专业的问答。
results: 这项研究的系统比当前模型高效，同时还提供了有用的解释，激励将LLMs integrate into 法律和金融NLP系统中进行未来研究。

Abstract
Applying existing question answering (QA) systems to specialized domains like law and finance presents challenges that necessitate domain expertise. Although large language models (LLMs) have shown impressive language comprehension and in-context learning capabilities, their inability to handle very long inputs/contexts is well known. Tasks specific to these domains need significant background knowledge, leading to contexts that can often exceed the maximum length that existing LLMs can process. This study explores leveraging the semi-structured nature of legal and financial data to efficiently retrieve relevant context, enabling the use of LLMs for domain-specialized QA. The resulting system outperforms contemporary models and also provides useful explanations for the answers, encouraging the integration of LLMs into legal and financial NLP systems for future research.

摘要
使用现有的问答（QA）系统在特定领域如法律和金融中存在挑战，需要领域专业知识。虽然大型自然语言模型（LLM）在语言理解和上下文学习能力方面表现出色，但它们无法处理非常长的输入/上下文，这是公共知识。这些领域的任务需要背景知识，导致上下文可以经常超过现有的 LLM 处理 longest length。这项研究探讨了利用法律和金融数据的半结构化特性，以高效地检索相关上下文，使用 LLM 进行领域化问答。该系统比当代模型高效，并提供了有用的解释，鼓励将 LLM integrated 到法律和金融 NLP 系统中 для未来研究。

Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation

paper_url: http://arxiv.org/abs/2310.14424
repo_url: None
paper_authors: Meriem Boubdir, Edward Kim, Beyza Ermis, Marzieh Fadaee, Sara Hooker
for: 大型自然语言模型的评估中，人类评估变得越来越重要，可以更好地捕捉语言细节和用户喜好。methods: 我们使用度量方法来减少人类反馈的数量，以提高模型评估的效率。results: 我们发现，使用度量方法可以减少模型评估中的决定性（或“tie”）结果，比Random Sample减少了54%。这表明我们的方法可以更好地使用人类反馈，从而提高模型评估的效率。

Abstract
Human evaluation is increasingly critical for assessing large language models, capturing linguistic nuances, and reflecting user preferences more accurately than traditional automated metrics. However, the resource-intensive nature of this type of annotation process poses significant challenges. The key question driving our work: "is it feasible to minimize human-in-the-loop feedback by prioritizing data instances which most effectively distinguish between models?" We evaluate several metric-based methods and find that these metrics enhance the efficiency of human evaluations by minimizing the number of required annotations, thus saving time and cost, while ensuring a robust performance evaluation. We show that our method is effective across widely used model families, reducing instances of indecisive (or "tie") outcomes by up to 54% compared to a random sample when focusing on the top-20 percentile of prioritized instances. This potential reduction in required human effort positions our approach as a valuable strategy in future large language model evaluations.

摘要
人类评估在评估大型自然语言模型时变得越来越重要，捕捉语言细节和用户偏好更加准确 than traditional自动指标。然而，这种类型的注释过程具有资源占用性的挑战。我们的问题驱动我们的工作："是否可以减少人工循环反馈？"我们评估了一些度量基于的方法，并发现这些度量可以提高人类评估的效率，减少需要注释的数量，从而节省时间和成本，同时保证评估性能的Robustness。我们的方法在广泛使用的模型家族上效果，减少 indecisive（或“tie”）结果的数量，相比随机抽样，下降54%。这种减少的人工努力位置我们的方法为未来大型自然语言模型评估中的有价值策略。

Monte Carlo Thought Search: Large Language Model Querying for Complex Scientific Reasoning in Catalyst Design

paper_url: http://arxiv.org/abs/2310.14420
repo_url: https://github.com/pnnl/chemreasoner
paper_authors: Henry W. Sprueill, Carl Edwards, Mariefel V. Olarte, Udishnu Sanyal, Heng Ji, Sutanay Choudhury
for: 本研究目的是找到新的催化剂，需要复杂的化学性质和结果之间的权衡，导致搜索空间的 combinatorial 增长。
methods: 本研究使用 Monte Carlo Tree Search 方法，以提高 state-of-the-art chain-of-thought prompting 变种，以增强科学理解。
results: 我们的方法比最佳基eline提高 25.8%，并发现我们的方法可以增强科学家的理解和发现过程，提供新的想法和发现。

Abstract
Discovering novel catalysts requires complex reasoning involving multiple chemical properties and resultant trade-offs, leading to a combinatorial growth in the search space. While large language models (LLM) have demonstrated novel capabilities for chemistry through complex instruction following capabilities and high quality reasoning, a goal-driven combinatorial search using LLMs has not been explored in detail. In this work, we present a Monte Carlo Tree Search-based approach that improves beyond state-of-the-art chain-of-thought prompting variants to augment scientific reasoning. We introduce two new reasoning datasets: 1) a curation of computational chemistry simulations, and 2) diverse questions written by catalysis researchers for reasoning about novel chemical conversion processes. We improve over the best baseline by 25.8\% and find that our approach can augment scientist's reasoning and discovery process with novel insights.

摘要
发现新的催化剂需要复杂的逻辑，涉及多种化学性质和结果的让步，从而导致搜索空间的 combinatorial 增长。虽然大型自然语言模型（LLM）已经在化学中表现出了新的能力，但一种以目标为导向的 combinatorial 搜索使用 LLM 尚未得到了详细的探索。在这种工作中，我们提出了基于 Monte Carlo Tree Search 的方法，可以超越现有的状态艺术探索变体，以增强科学逻辑。我们 introduce 了两个新的逻辑数据集：1）一个 Computational chemistry simulations 的筛选，2） catalysis 研究者写的多种关于新的化学转化过程的问题。我们在最佳基eline上提高了25.8％，并发现我们的方法可以增强科学家的逻辑和发现过程，并提供新的发现。

Vision Language Models in Autonomous Driving and Intelligent Transportation Systems

paper_url: http://arxiv.org/abs/2310.14414
repo_url: None
paper_authors: Xingcheng Zhou, Mingyu Liu, Bare Luka Zagar, Ekim Yurtsever, Alois C. Knoll
for: 本研究的目的是为Autonomous Driving (AD)和Intelligent Transportation Systems (ITS)领域内的视觉语言模型（VLM）做出全面的检视和评估，以探讨当前模型和数据集的进展，以及未来研究方向。
methods: 本研究使用了许多当前最佳的语言模型，包括Large Language Models (LLMs)，以及一些特定的交通和驾驶数据集。
results: 本研究发现了许多视觉语言模型在Autonomous Driving (AD)和Intelligent Transportation Systems (ITS)领域的应用，包括改善驾驶安全性和效率，以及探索新的研究方向。同时，研究还发现了一些挑战和研究缺失，需要进一步的研究和解决。

Abstract
The applications of Vision-Language Models (VLMs) in the fields of Autonomous Driving (AD) and Intelligent Transportation Systems (ITS) have attracted widespread attention due to their outstanding performance and the ability to leverage Large Language Models (LLMs). By integrating language data, the vehicles, and transportation systems are able to deeply understand real-world environments, improving driving safety and efficiency. In this work, we present a comprehensive survey of the advances in language models in this domain, encompassing current models and datasets. Additionally, we explore the potential applications and emerging research directions. Finally, we thoroughly discuss the challenges and research gap. The paper aims to provide researchers with the current work and future trends of VLMs in AD and ITS.

摘要
自然语言模型（VLM）在自动驾驶（AD）和智能交通系统（ITS）领域的应用已经吸引了广泛的注意，这主要归功于它们在实际环境中的出色表现以及能够利用大型语言模型（LLM）的优势。通过语言数据的 интеграción，车辆和交通系统能够深入理解现实环境，提高驾驶安全性和效率。在这篇论文中，我们提供了自然语言模型在这个领域的全面报告，涵盖当前的模型和数据集。此外，我们还探讨了这个领域的潜在应用和出现的研究方向。最后，我们详细讨论了这个领域的挑战和研究空白。本文的目标是为研究人员提供当前和未来自然语言模型在AD和ITS领域的进展和趋势。

Be Selfish, But Wisely: Investigating the Impact of Agent Personality in Mixed-Motive Human-Agent Interactions

paper_url: http://arxiv.org/abs/2310.14404
repo_url: None
paper_authors: Kushal Chawla, Ian Wu, Yu Rong, Gale M. Lucas, Jonathan Gratch
for: 这种方法是为了设计一个谈判对话系统。methods: 这种方法使用自我游戏学习，训练一个与人类对话数据模拟的用户。results: 这种方法会导致谈判对话系统失去妥协的价值，导致对方不再谈判，最终影响系统的表现。

Abstract
A natural way to design a negotiation dialogue system is via self-play RL: train an agent that learns to maximize its performance by interacting with a simulated user that has been designed to imitate human-human dialogue data. Although this procedure has been adopted in prior work, we find that it results in a fundamentally flawed system that fails to learn the value of compromise in a negotiation, which can often lead to no agreements (i.e., the partner walking away without a deal), ultimately hurting the model's overall performance. We investigate this observation in the context of the DealOrNoDeal task, a multi-issue negotiation over books, hats, and balls. Grounded in negotiation theory from Economics, we modify the training procedure in two novel ways to design agents with diverse personalities and analyze their performance with human partners. We find that although both techniques show promise, a selfish agent, which maximizes its own performance while also avoiding walkaways, performs superior to other variants by implicitly learning to generate value for both itself and the negotiation partner. We discuss the implications of our findings for what it means to be a successful negotiation dialogue system and how these systems should be designed in the future.

摘要
自然的方式设计一个谈判对话系统是通过自我游戏学习：训练一个智能代理人，它通过与一个模拟人类对话数据的虚拟用户进行交互来学习提高自己的表现。尽管这种方法在先前的工作中已经被采用，但我们发现它会导致谈判中寻求妥协的核心问题不得到学习，这可能导致对话失败（即对方不达成协议），最终影响模型的总性表现。我们在DealOrNoDeal任务中进行了研究，这是一种多个问题的谈判，涉及到书、帽子和球。基于经济学的谈判理论，我们修改了训练过程的两种新方法，以设计具有多样化个性的代理人，并分析它们与人类伙伴的表现。我们发现，尽管这两种技术都有承诺，但一个自私的代理人（它尽量提高自己的表现，同时避免对话失败）在其他变体中表现出色，并隐式地学习生成对话伙伴和自己都有价值的情况。我们讨论了我们的发现对成功谈判对话系统的设计意味着什么，以及未来这些系统应该如何设计。

O3D: Offline Data-driven Discovery and Distillation for Sequential Decision-Making with Large Language Models

paper_url: http://arxiv.org/abs/2310.14403
repo_url: None
paper_authors: Yuchen Xiao, Yanchao Sun, Mengda Xu, Udari Madhushani, Jared Vann, Deepeka Garg, Sumitra Ganesh
for: 提高大型语言模型（LLM）在解决sequential decision-making问题的表现
methods: 利用offline数据 scale（例如人类交互的日志）来提高LLM代理的in-context learning性能
results: O3D框架可以帮助LLM代理不需要训练就能够解决复杂和长时间的任务，并在多个任务中提取普遍可用的技能和知识

Abstract
Recent advancements in large language models (LLMs) have exhibited promising performance in solving sequential decision-making problems. By imitating few-shot examples provided in the prompts (i.e., in-context learning), an LLM agent can interact with an external environment and complete given tasks without additional training. However, such few-shot examples are often insufficient to generate high-quality solutions for complex and long-horizon tasks, while the limited context length cannot consume larger-scale demonstrations. To this end, we propose an offline learning framework that utilizes offline data at scale (e.g, logs of human interactions) to facilitate the in-context learning performance of LLM agents. We formally define LLM-powered policies with both text-based approaches and code-based approaches. We then introduce an Offline Data-driven Discovery and Distillation (O3D) framework to improve LLM-powered policies without finetuning. O3D automatically discovers reusable skills and distills generalizable knowledge across multiple tasks based on offline interaction data, advancing the capability of solving downstream tasks. Empirical results under two interactive decision-making benchmarks (ALFWorld and WebShop) demonstrate that O3D can notably enhance the decision-making capabilities of LLMs through the offline discovery and distillation process, and consistently outperform baselines across various LLMs with both text-based-policy and code-based-policy.

摘要
We define LLM-powered policies with both text-based and code-based approaches, and introduce an Offline Data-driven Discovery and Distillation (O3D) framework to improve these policies without finetuning. O3D automatically discovers reusable skills and distills generalizable knowledge across multiple tasks based on offline interaction data, enabling LLMs to solve downstream tasks more effectively.Empirical results under two interactive decision-making benchmarks (ALFWorld and WebShop) show that O3D can significantly enhance the decision-making capabilities of LLMs through the offline discovery and distillation process, and consistently outperform baselines across various LLMs with both text-based and code-based policies.

Value of Assistance for Grasping

paper_url: http://arxiv.org/abs/2310.14402
repo_url: None
paper_authors: Mohammad Masarwy, Yuval Goshen, David Dovrat, Sarah Keren
for: robotic grasping task with uncertain object pose
methods: probabilistic estimation of object pose, VOA measure for assessing observation effectiveness
results: effective in simulated and real-world robotic settings

Abstract
In many realistic settings, a robot is tasked with grasping an object without knowing its exact pose. Instead, the robot relies on a probabilistic estimation of the pose to decide how to attempt the grasp. We offer a novel Value of Assistance (VOA) measure for assessing the expected effect a specific observation will have on the robot's ability to successfully complete the grasp. Thus, VOA supports the decision of which sensing action would be most beneficial to the grasping task. We evaluate our suggested measures in both simulated and real-world robotic settings.

摘要
在许多现实场景中，机器人被要求抓取物品而不知其具体位置。而是通过 probabilistic 估计pose 来决定机器人是如何尝试抓取。我们提出了一种新的值帮助度（VOA）测量方法，用于评估抓取任务中具体的观察对机器人成功完成的影响。因此，VOA 支持机器人选择哪一种感知行为最有利于抓取任务。我们在实验室和实际机器人设置中评估了我们的建议。

Learning to bag with a simulation-free reinforcement learning framework for robots

paper_url: http://arxiv.org/abs/2310.14398
repo_url: None
paper_authors: Francisco Munguia-Galeano, Jihong Zhu, Juan David Hernández, Ze Ji
for: 这 paper 的目的是教育机器人如何包袋 (bagging) deformable objects, such as bags.
methods: 这 paper 使用了一种学习基于权重算法的框架, 可以在实际世界中学习包袋任务, 不需要使用模拟环境. 该框架使用了一组基本动作和五个状态来表示任务, 并通过一种强化学习算法来找到最佳抓取点.
results: 在实际世界中训练了三个小时后, 框架可以在不同的包袋任务中达到60%和80%的成功率, 并且在两个不同的袋子大小进行测试, 发现模型具有普适性.

Abstract
Bagging is an essential skill that humans perform in their daily activities. However, deformable objects, such as bags, are complex for robots to manipulate. This paper presents an efficient learning-based framework that enables robots to learn bagging. The novelty of this framework is its ability to perform bagging without relying on simulations. The learning process is accomplished through a reinforcement learning algorithm introduced in this work, designed to find the best grasping points of the bag based on a set of compact state representations. The framework utilizes a set of primitive actions and represents the task in five states. In our experiments, the framework reaches a 60 % and 80 % of success rate after around three hours of training in the real world when starting the bagging task from folded and unfolded, respectively. Finally, we test the trained model with two more bags of different sizes to evaluate its generalizability.

摘要
“bagging”是人类日常活动中的一种重要技能，但是柔软对象，如袋子，对机器人来说是复杂的 manipulate 的。这篇论文提出了一种高效的学习基于框架，使机器人能够学习bagging。这个框架的新特点在于不需要仿真环境，通过在这篇论文中介绍的一种强化学习算法，找到最佳抓取袋子的点。该框架使用了一组基本动作和五种状态来表示任务。在我们的实验中，该框架在真实世界中训练了三个小时后，在开始包装任务时达到了60%和80%的成功率，即从折叠和 unfolded 开始。最后，我们测试了训练后的模型，对两个不同大小的袋子进行了评估。

Merging Generated and Retrieved Knowledge for Open-Domain QA

paper_url: http://arxiv.org/abs/2310.14393
repo_url: https://github.com/yunx-z/combo
paper_authors: Yunxiang Zhang, Muhammad Khalifa, Lajanugen Logeswaran, Moontae Lee, Honglak Lee, Lu Wang
For: The paper is written for improving open-domain question answering (QA) systems by effectively leveraging two sources of information: retrieved passages and large language models (LLMs).* Methods: The paper proposes a Compatibility-Oriented knowledge Merging (COMBO) framework that matches LLM-generated passages with retrieved counterparts into compatible pairs, based on discriminators trained with silver compatibility labels. The framework uses a Fusion-in-Decoder-based reader model to handle passage pairs and arrive at the final answer.* Results: The paper shows that COMBO outperforms competitive baselines on three out of four tested open-domain QA benchmarks, and demonstrates greater efficacy in scenarios with a higher degree of knowledge conflicts.

Abstract
Open-domain question answering (QA) systems are often built with retrieval modules. However, retrieving passages from a given source is known to suffer from insufficient knowledge coverage. Alternatively, prompting large language models (LLMs) to generate contextual passages based on their parametric knowledge has been shown to improve QA performance. Yet, LLMs tend to "hallucinate" content that conflicts with the retrieved knowledge. Based on the intuition that answers supported by both sources are more likely to be correct, we propose COMBO, a Compatibility-Oriented knowledge Merging for Better Open-domain QA framework, to effectively leverage the two sources of information. Concretely, we match LLM-generated passages with retrieved counterparts into compatible pairs, based on discriminators trained with silver compatibility labels. Then a Fusion-in-Decoder-based reader model handles passage pairs to arrive at the final answer. Experiments show that COMBO outperforms competitive baselines on three out of four tested open-domain QA benchmarks. Further analysis reveals that our proposed framework demonstrates greater efficacy in scenarios with a higher degree of knowledge conflicts.

摘要
具体来说，我们匹配LLM生成的passages与检索的对应者成Compatible pairs，基于训练用silver compatibility labels的Discriminators。然后，一个Fusion-in-Decoder-based reader model处理过程来到答案。实验表明，COMBO超过了竞争对手在四个测试的开放领域QA benchmarks。进一步的分析表明，我们的提议的Frameworks在知识冲突程度较高的场景中更有效。

ARCOQ: Arabic Closest Opposite Questions Dataset

paper_url: http://arxiv.org/abs/2310.14384
repo_url: https://github.com/sandrarizkallah/arcoq-dataset
paper_authors: Sandra Rizkallah, Amir F. Atiya, Samir Shaheen
for: 本研究提供了一个 Arabic 语言最近相反问题的数据集，这是首个为 Arabic 语言而设计的数据集。这个数据集对于 Antonymy 检测系统评估非常有用，结构类似于英语 Graduate Record Examination (GRE) 最近相反问题数据集。
methods: 本研究使用了一个 queries 和候选词集合来构建数据集，每个问题都包含一个查询词，需要从候选词中找到最近相反词。每个问题还关联着正确的答案。此外，文章还提供了一个基于 Arabic 词嵌入模型的性能标准。
results: 本研究提供了一个公共可用的数据集，以及一个标准的开发集和测试集分割。文章还提供了一些基于 Arabic 词嵌入模型的性能统计数据，以用于评估不同的 Antonymy 检测系统。

Abstract
This paper presents a dataset for closest opposite questions in Arabic language. The dataset is the first of its kind for the Arabic language. It is beneficial for the assessment of systems on the aspect of antonymy detection. The structure is similar to that of the Graduate Record Examination (GRE) closest opposite questions dataset for the English language. The introduced dataset consists of 500 questions, each contains a query word for which the closest opposite needs to be determined from among a set of candidate words. Each question is also associated with the correct answer. We publish the dataset publicly in addition to providing standard splits of the dataset into development and test sets. Moreover, the paper provides a benchmark for the performance of different Arabic word embedding models on the introduced dataset.

摘要

MoPe: Model Perturbation-based Privacy Attacks on Language Models

paper_url: http://arxiv.org/abs/2310.14369
repo_url: None
paper_authors: Marvin Li, Jason Wang, Jeffrey Wang, Seth Neel
for: 这个研究旨在检测大型自然语言模型（LLMs）是否会不意外地透露训练数据中的敏感信息。
methods: 这篇文章提出了一个新的方法，即模型扰动（MoPe），可以高度确定一个文本是否在一个预训语言模型的训练数据中。MoPe 在模型参数空间添加随机变动，然后测量模型在某个点 $x$ 的落差 log-likelihood，这个统计我们显示可以近似跟训练点的希尔伯特矩阵的跟踪。
results: 在各种语言模型，从 70 万到 12 亿个参数，我们的 MoPe 方法比存在损失基于的攻击和最近提出的扰动基于方法更有效。我们还考虑了训练点顺序和模型大小对于攻击成功的影响，并实际显示 MoPe 可以实际地近似希尔伯特矩阵的跟踪。我们的结果显示，损失单点没有能力决定抽取性 – 有些训练点可以使用我们的方法恢复，它们的平均损失都很高。这些结果质疑了之前的研究，它们使用单点损失作为训练点忘记或忘记的证据。

Abstract
Recent work has shown that Large Language Models (LLMs) can unintentionally leak sensitive information present in their training data. In this paper, we present Model Perturbations (MoPe), a new method to identify with high confidence if a given text is in the training data of a pre-trained language model, given white-box access to the models parameters. MoPe adds noise to the model in parameter space and measures the drop in log-likelihood at a given point $x$, a statistic we show approximates the trace of the Hessian matrix with respect to model parameters. Across language models ranging from $70$M to $12$B parameters, we show that MoPe is more effective than existing loss-based attacks and recently proposed perturbation-based methods. We also examine the role of training point order and model size in attack success, and empirically demonstrate that MoPe accurately approximate the trace of the Hessian in practice. Our results show that the loss of a point alone is insufficient to determine extractability -- there are training points we can recover using our method that have average loss. This casts some doubt on prior works that use the loss of a point as evidence of memorization or unlearning.

摘要
最近的研究表明，大型语言模型（LLM）可能会不意气地泄露训练数据中的敏感信息。在这篇论文中，我们提出了一种新的方法 called Model Perturbations（MoPe），可以在给定文本是否在预训练语言模型的训练数据中的高度置信度测试。MoPe在模型参数空间添加噪声，并测量在给定点 $x$ 上模型参数的下降 log-likelihood 的度量，我们证明这个度量与模型参数对Trace Hessian 矩阵很相似。在7000万到120亿参数的语言模型之间，我们证明MoPe比既有损失基于攻击和最近提出的噪声基于方法更有效。我们还研究了训练点顺序和模型大小对攻击成功的影响，并实际证明MoPe可以准确地估计Trace Hessian 在实践中。我们的结果表明，单个点的损失不能准确地确定抽取性——有些训练点可以通过我们的方法恢复，它们的平均损失高。这些结果对于以前的研究，使用单个点的损失作为忘记或不记忆的证据表示有一定的怀疑。

paper_url: http://arxiv.org/abs/2310.14358
repo_url: None
paper_authors: Elena Sergeeva, Anastasia Sergeeva, Huiyun Tang, Kerstin Bongard-Blanchy, Peter Szolovits
for: 本研究探讨用户对AI建议的接受行为，以评估健康相关声明的真实性在不同“建议质量”设定下。
methods: 我们采用了探索性的评估方法，通过向用户提供不同类型的AI建议来影响他们对健康相关声明的评估。
results: 我们发现，即使只是表示AI认为声明是错误/正确，也可以让更 than half of the people更改他们的声明真实性评估。不同的建议类型对接受率产生影响，但是获得建议的単纯效应经常大于建议类型效应。

Abstract
Previous research on expert advice-taking shows that humans exhibit two contradictory behaviors: on the one hand, people tend to overvalue their own opinions undervaluing the expert opinion, and on the other, people often defer to other people's advice even if the advice itself is rather obviously wrong. In our study, we conduct an exploratory evaluation of users' AI-advice accepting behavior when evaluating the truthfulness of a health-related statement in different "advice quality" settings. We find that even feedback that is confined to just stating that "the AI thinks that the statement is false/true" results in more than half of people moving their statement veracity assessment towards the AI suggestion. The different types of advice given influence the acceptance rates, but the sheer effect of getting a suggestion is often bigger than the suggestion-type effect.

摘要

From Chaos to Clarity: Claim Normalization to Empower Fact-Checking

paper_url: http://arxiv.org/abs/2310.14338
repo_url: None
paper_authors: Megha Sundriyal, Tanmoy Chakraborty, Preslav Nakov
For: The paper aims to help identify precise and prominent claims in social media posts that require verification, by introducing a novel task called Claim Normalization (ClaimNorm) and proposing a pioneering approach called CACN that leverages human reasoning processes and large language models.* Methods: The paper proposes a two-stage approach that first uses chain-of-thought and claim check-worthiness estimation to comprehend intricate claims, and then leverages large language models’ in-context learning abilities to provide guidance and improve the claim normalization process.* Results: The paper evaluates the effectiveness of the proposed model using a comprehensive real-world dataset (CLAN) consisting of more than 6k instances of social media posts alongside their respective normalized claims, and demonstrates that CACN outperforms several baselines across various evaluation measures.

Abstract
With the proliferation of social media platforms, users are exposed to vast information, including posts containing misleading claims. However, the pervasive noise inherent in these posts presents a challenge in identifying precise and prominent claims that require verification. Extracting the core assertions from such posts is arduous and time-consuming. We introduce a novel task called Claim Normalization (aka ClaimNorm) that aims to decompose complex and noisy social media posts into more straightforward and understandable forms, termed normalized claims. We propose CACN, a pioneering approach that leverages chain-of-thought and claim check-worthiness estimation, mimicking human reasoning processes, to comprehend intricate claims. Moreover, we capitalize on large language models' powerful in-context learning abilities to provide guidance and improve the claim normalization process. To evaluate the effectiveness of our proposed model, we meticulously compile a comprehensive real-world dataset, CLAN, comprising more than 6k instances of social media posts alongside their respective normalized claims. Experimentation demonstrates that CACN outperforms several baselines across various evaluation measures. A rigorous error analysis validates CACN's capabilities and pitfalls.

摘要
随着社交媒体平台的普及，用户面临着巨量的信息泥沼，其中包括含有误导性声明的帖子。然而，这些帖子中的噪音使得精确地提取声明变得困难和耗时。为了解决这个问题，我们提出了一个新任务 called Claim Normalization（简称 ClaimNorm），它的目标是将社交媒体帖子中的复杂和噪音声明转化为更加简单和易理解的形式，称为 normalized claims。我们提出了 CACN 模型，它利用链式思维和声明可靠性估计来模拟人类的思维过程，以便更好地理解复杂的声明。此外，我们利用大语言模型的强大在线学习能力，为声明Normalization过程提供指导和改进。为评估我们提出的模型效果，我们精心编译了 CLAN dataset，包含超过 6k 个社交媒体帖子和其对应的 normalized claims。实验结果表明，CACN 超过了多个基线数据进行多种评价指标。一份严格的错误分析证明了 CACN 的能力和缺点。

Learning Interpretable Rules for Scalable Data Representation and Classification

paper_url: http://arxiv.org/abs/2310.14336
repo_url: https://github.com/12wang3/rrl
paper_authors: Zhuo Wang, Wei Zhang, Ning Liu, Jianyong Wang
for: This paper aims to improve the scalability and interpretability of rule-based models for data representation and classification.
methods: The proposed method, called Rule-based Representation Learner (RRL), uses a novel training method called Gradient Grafting to optimize the discrete model using gradient descent, and employs a novel design of logical activation functions to increase scalability.
results: RRL outperforms competitive interpretable approaches on ten small and four large data sets, and can be easily adjusted to obtain a trade-off between classification accuracy and model complexity for different scenarios.

Abstract
Rule-based models, e.g., decision trees, are widely used in scenarios demanding high model interpretability for their transparent inner structures and good model expressivity. However, rule-based models are hard to optimize, especially on large data sets, due to their discrete parameters and structures. Ensemble methods and fuzzy/soft rules are commonly used to improve performance, but they sacrifice the model interpretability. To obtain both good scalability and interpretability, we propose a new classifier, named Rule-based Representation Learner (RRL), that automatically learns interpretable non-fuzzy rules for data representation and classification. To train the non-differentiable RRL effectively, we project it to a continuous space and propose a novel training method, called Gradient Grafting, that can directly optimize the discrete model using gradient descent. A novel design of logical activation functions is also devised to increase the scalability of RRL and enable it to discretize the continuous features end-to-end. Exhaustive experiments on ten small and four large data sets show that RRL outperforms the competitive interpretable approaches and can be easily adjusted to obtain a trade-off between classification accuracy and model complexity for different scenarios. Our code is available at: https://github.com/12wang3/rrl.

摘要
rule-based models, such as decision trees, are widely used in scenarios that require high model interpretability because of their transparent internal structures and good model expressivity. However, rule-based models are difficult to optimize, especially on large datasets, due to their discrete parameters and structures. Ensemble methods and fuzzy/soft rules are often used to improve performance, but these sacrifices model interpretability. To achieve both good scalability and interpretability, we propose a new classifier, called Rule-based Representation Learner (RRL), that automatically learns interpretable non-fuzzy rules for data representation and classification. To train the non-differentiable RRL effectively, we project it to a continuous space and propose a novel training method, called Gradient Grafting, that can directly optimize the discrete model using gradient descent. A novel design of logical activation functions is also devised to increase the scalability of RRL and enable it to discretize the continuous features end-to-end. Exhaustive experiments on ten small and four large datasets show that RRL outperforms competitive interpretable approaches and can be easily adjusted to obtain a trade-off between classification accuracy and model complexity for different scenarios. Our code is available at: https://github.com/12wang3/rrl.Here's the Chinese translation of the text, with some notes on the translation:1. 规则型模型 (rule-based models)：这类模型广泛应用在需要高度可读性的场景中，如决策树、规则 Engines 等。这些模型的透明性和表达能力使得它们在可读性和可解释性方面表现出色。2. However, 规则型模型 (rule-based models) 具有粗糙的结构和离散参数，因此在大型数据集上进行优化很困难。为了提高性能，人们通常使用 ensemble methods 和杂化/软规则来增强模型的表达能力。3. 但是，使用 ensemble methods 和杂化/软规则来提高性能会导致模型的可读性减退。为了实现同时保持高度可读性和高度表达能力，我们提出了一种新的分类器，即 Rule-based Representation Learner (RRL)。4. RRL 可以自动学习可读性高的非杂化规则，用于数据表示和分类。为了让 RRL 可以在大型数据集上进行效果地优化，我们提出了一种新的训练方法，即 Gradient Grafting。5. 在训练非 diferenciable RRL 时，我们使用 Gradient Grafting 方法可以直接使用梯度下降来优化离散模型。此外，我们还提出了一种新的逻辑激活函数的设计，以增加 RRL 的扩展性和可扩展性。6. 我们在十个小型数据集和四个大型数据集上进行了对比性实验，结果表明 RRL 可以超越一些竞争对手的可读性方法，并且可以根据不同的场景进行轻松地调整，以获得分类精度和模型复杂度之间的平衡。7. 我们的代码可以在 GitHub 上找到：https://github.com/12wang3/rrl.Some notes on the translation:1. "rule-based models" is translated as "规则型模型" (rule-based models), which is a common term used in machine learning to refer to models that use rules to make predictions.2. "high model interpretability" is translated as "高度可读性" (high degree of interpretability), which emphasizes the ability of the model to provide insights into its decision-making process.3. "discrete parameters" is translated as "离散参数" (discrete parameters), which refers to the fact that the model's parameters are not continuous, but rather take on discrete values.4. "ensemble methods" is translated as "ensemble methods" (ensemble methods), which refers to techniques that combine the predictions of multiple models to improve overall performance.5. "fuzzy/soft rules" is translated as "杂化/软规则" (fuzzy/soft rules), which refers to rules that allow for some degree of uncertainty or vagueness in their predictions.6. " Gradient Grafting" is translated as "梯度植入" (Gradient Grafting), which is a novel training method proposed in the paper to directly optimize the discrete model using gradient descent.7. " logical activation functions" is translated as "逻辑激活函数" (logical activation functions), which refers to a type of activation function that is designed to increase the scalability of the model.

CLMSM: A Multi-Task Learning Framework for Pre-training on Procedural Text

paper_url: http://arxiv.org/abs/2310.14326
repo_url: None
paper_authors: Abhilash Nandy, Manav Nitin Kapadnis, Pawan Goyal, Niloy Ganguly
For: + The paper is written for proposing a domain-specific, continual pre-training framework for procedural NLP tasks.* Methods: + The framework uses a Multi-Task Learning Framework to optimize two objectives: contrastive learning using hard triplets and a novel mask-step modeling objective.* Results: + The proposed framework (CLMSM) outperforms baselines on recipes (in-domain) and is able to generalize to open-domain procedural NLP tasks.

Abstract
In this paper, we propose CLMSM, a domain-specific, continual pre-training framework, that learns from a large set of procedural recipes. CLMSM uses a Multi-Task Learning Framework to optimize two objectives - a) Contrastive Learning using hard triplets to learn fine-grained differences across entities in the procedures, and b) a novel Mask-Step Modelling objective to learn step-wise context of a procedure. We test the performance of CLMSM on the downstream tasks of tracking entities and aligning actions between two procedures on three datasets, one of which is an open-domain dataset not conforming with the pre-training dataset. We show that CLMSM not only outperforms baselines on recipes (in-domain) but is also able to generalize to open-domain procedural NLP tasks.

摘要
在这篇论文中，我们提出了CLMSM，一种领域专门的大规模练习框架，利用大量的过程recipes进行学习。CLMSM使用多任务学习框架来优化两个目标：一是对照 triplets进行强制学习细腻差异 между实体，二是一种新的面积步骤模型目标来学习过程步骤的上下文。我们在三个数据集上测试CLMSM的性能，其中一个是一个公开的数据集，不符合预训练数据集。我们发现CLMSM不仅在recipes（预训练）上超越了基eline，还能够通过预训练数据集来泛化到开放领域的过程语言任务。

A Survey on Semantic Processing Techniques

paper_url: http://arxiv.org/abs/2310.18345
repo_url: None
paper_authors: Rui Mao, Kai He, Xulang Zhang, Guanyi Chen, Jinjie Ni, Zonglin Yang, Erik Cambria
for: 本研究旨在探讨计算语言学中的 semantics 领域的最新进展，以及这一领域在不同应用领域的拓展和integration。
methods: 本文分析了五种semantic processing task，包括word sense disambiguation、anaphora resolution、named entity recognition、concept extraction和subjectivity detection。并评估了这些任务的相关理论研究、高级方法和下游应用。
results: 本文对semantic processing tasks的研究进行了概括和比较，探讨了不同技术和应用趋势，并提出了未来发展的方向和 possiblities。

Abstract
Semantic processing is a fundamental research domain in computational linguistics. In the era of powerful pre-trained language models and large language models, the advancement of research in this domain appears to be decelerating. However, the study of semantics is multi-dimensional in linguistics. The research depth and breadth of computational semantic processing can be largely improved with new technologies. In this survey, we analyzed five semantic processing tasks, e.g., word sense disambiguation, anaphora resolution, named entity recognition, concept extraction, and subjectivity detection. We study relevant theoretical research in these fields, advanced methods, and downstream applications. We connect the surveyed tasks with downstream applications because this may inspire future scholars to fuse these low-level semantic processing tasks with high-level natural language processing tasks. The review of theoretical research may also inspire new tasks and technologies in the semantic processing domain. Finally, we compare the different semantic processing techniques and summarize their technical trends, application trends, and future directions.

摘要
《semantic processing》是计算语言学的基础研究领域之一。在大型预训练语言模型和大语言模型的时代，这一领域的研究进步似乎减速。然而，语义研究是语言学的多维ensional问题，computational semantic processing的研究深度和广度可以通过新技术得到大幅提升。在这份调研中，我们分析了五种semantic processing任务，例如word sense disambiguation、anaphora resolution、named entity recognition、concept extraction和主观性检测。我们研究相关的理论研究、先进方法和下游应用。我们将调研任务与下游应用连接起来，因为这可能会鼓励未来的学者将低级语义处理任务与高级自然语言处理任务融合起来。审视理论研究也可能 inspirer新的任务和技术在语义处理领域。最后，我们比较不同的语义处理技术，总结其技术趋势、应用趋势和未来方向。

Chainpoll: A high efficacy method for LLM hallucination detection

paper_url: http://arxiv.org/abs/2310.18344
repo_url: None
paper_authors: Robert Friel, Atindriyo Sanyal
for: 本研究旨在提出一种新的幻觉检测方法ChainPoll，并提供一个改进的benchmark datasets集RealHall，以评估现有 studies中的幻觉检测metric。
methods: 本研究使用了一种新的幻觉检测方法ChainPoll，以及一个改进的benchmark datasets集RealHall，以评估现有 studies中的幻觉检测metric。
results: 对RealHall benchmark datasets进行了全面的比较，发现ChainPoll在所有RealHall benchmark上表现出色，AUROC为0.781，比最佳理论方法提高11%，超过了行业标准23%。此外，ChainPoll还具有cost-effective和高度可见的优点。此外，本研究还引入了两种新的幻觉度量：Adherence和Correctness。

Abstract
Large language models (LLMs) have experienced notable advancements in generating coherent and contextually relevant responses. However, hallucinations - incorrect or unfounded claims - are still prevalent, prompting the creation of automated metrics to detect these in LLM outputs. Our contributions include: introducing ChainPoll, an innovative hallucination detection method that excels compared to its counterparts, and unveiling RealHall, a refined collection of benchmark datasets to assess hallucination detection metrics from recent studies. While creating RealHall, we assessed tasks and datasets from previous hallucination detection studies and observed that many are not suitable for the potent LLMs currently in use. Overcoming this, we opted for four datasets challenging for modern LLMs and pertinent to real-world scenarios. Using RealHall, we conducted a comprehensive comparison of ChainPoll with numerous hallucination metrics from recent studies. Our findings indicate that ChainPoll outperforms in all RealHall benchmarks, achieving an overall AUROC of 0.781. This surpasses the next best theoretical method by 11% and exceeds industry standards by over 23%. Additionally, ChainPoll is cost-effective and offers greater transparency than other metrics. We introduce two novel metrics to assess LLM hallucinations: Adherence and Correctness. Adherence is relevant to Retrieval Augmented Generation workflows, evaluating an LLM's analytical capabilities within given documents and contexts. In contrast, Correctness identifies logical and reasoning errors.

摘要
大型语言模型（LLM）在生成准确和Contextually relevanteresponses方面已经做出了显著的进步。然而，幻觉 - incorrect或不科学的声明 - 仍然是LLM输出中的普遍问题，因此创造了自动化的metricsto检测这些幻觉在LLM输出中。我们的贡献包括：引入ChainPoll，一种创新的幻觉检测方法，与其他counterparts相比，具有显著的优势；以及披露RealHall，一个改进的benchmark datasets，用于评估最近的学术研究中的幻觉检测 metric。在创建RealHall时，我们评估了过去的幻觉检测任务和dataset，发现许多不适合当今的强大LLM使用。为了超越这些 limitation，我们选择了四个dataset，挑战当今的LLM，并与实际场景相关。使用RealHall，我们对ChainPoll与最近学术研究中的幻觉metric进行了全面的比较。我们的发现表明，ChainPoll在RealHall benchmark中表现出色，AUROC为0.781，比最佳理论方法提高11%，超过了industry标准 by over 23%。此外，ChainPoll可以efficiently cost-effective，并且具有更高的透明度。我们还引入了两个新的metric来评估LLM幻觉： Adherence和Correctness。 Adherence relevante到 Retrieval Augmented Generation workflows，评估LLM在给定文档和context中的分析能力。与此相反，Correctness检测LLM的逻辑和推理错误。

NERetrieve: Dataset for Next Generation Named Entity Recognition and Retrieval

paper_url: http://arxiv.org/abs/2310.14282
repo_url: None
paper_authors: Uri Katz, Matan Vetzler, Amir DN Cohen, Yoav Goldberg
for: 这些研究旨在推动Named Entity Recognition（NER）任务的进一步发展，包括提高NER模型的精度和可扩展性。
methods: 这些研究使用大语言模型（LLM），包括zero-shotRecognition和自动标注等技术，以提高NER模型的性能。
results: 研究发现， presenteNER任务仍然存在许多挑战，包括更细化的实体类型、零shot认知和检索、以及减少预处理时间等。

Abstract
Recognizing entities in texts is a central need in many information-seeking scenarios, and indeed, Named Entity Recognition (NER) is arguably one of the most successful examples of a widely adopted NLP task and corresponding NLP technology. Recent advances in large language models (LLMs) appear to provide effective solutions (also) for NER tasks that were traditionally handled with dedicated models, often matching or surpassing the abilities of the dedicated models. Should NER be considered a solved problem? We argue to the contrary: the capabilities provided by LLMs are not the end of NER research, but rather an exciting beginning. They allow taking NER to the next level, tackling increasingly more useful, and increasingly more challenging, variants. We present three variants of the NER task, together with a dataset to support them. The first is a move towards more fine-grained -- and intersectional -- entity types. The second is a move towards zero-shot recognition and extraction of these fine-grained types based on entity-type labels. The third, and most challenging, is the move from the recognition setup to a novel retrieval setup, where the query is a zero-shot entity type, and the expected result is all the sentences from a large, pre-indexed corpus that contain entities of these types, and their corresponding spans. We show that all of these are far from being solved. We provide a large, silver-annotated corpus of 4 million paragraphs covering 500 entity types, to facilitate research towards all of these three goals.

摘要
Recognizing entities in texts is a central need in many information-seeking scenarios, and indeed, Named Entity Recognition (NER) is arguably one of the most successful examples of a widely adopted NLP task and corresponding NLP technology. Recent advances in large language models (LLMs) appear to provide effective solutions (also) for NER tasks that were traditionally handled with dedicated models, often matching or surpassing the abilities of the dedicated models. However, we argue that NER should not be considered a solved problem: the capabilities provided by LLMs are not the end of NER research, but rather an exciting beginning. They allow taking NER to the next level, tackling increasingly more useful, and increasingly more challenging, variants. We present three variants of the NER task, together with a dataset to support them. The first is a move towards more fine-grained -- and intersectional -- entity types. The second is a move towards zero-shot recognition and extraction of these fine-grained types based on entity-type labels. The third, and most challenging, is the move from the recognition setup to a novel retrieval setup, where the query is a zero-shot entity type, and the expected result is all the sentences from a large, pre-indexed corpus that contain entities of these types, and their corresponding spans. We show that all of these are far from being solved. We provide a large, silver-annotated corpus of 4 million paragraphs covering 500 entity types, to facilitate research towards all of these three goals.

RSM-NLP at BLP-2023 Task 2: Bangla Sentiment Analysis using Weighted and Majority Voted Fine-Tuned Transformers

paper_url: http://arxiv.org/abs/2310.14261
repo_url: https://github.com/ptnv-s/rsm-nlp-blp-task2
paper_authors: Pratinav Seth, Rashi Goel, Komal Mathur, Swetha Vemulapalli
for: 本研究旨在提高孟加拉语社交媒体内容的自动情感分析能力。
methods: 本研究使用多种多语言BERT模型进行实验和调整，并使用多样本投票和质量权重加权模型，以提高情感分析的准确率。
results: 本研究在多类分类任务中取得0.711的分数，并在共同任务中名列十名。Here’s the translation in English for reference:
for: The aim of this research is to improve the ability of automatic sentiment analysis for Bangla social media content.
methods: The research uses various multilingual and pre-trained BERT models for experimentation and fine-tuning, and employs a majority voting and weighted ensemble model to enhance the accuracy of sentiment analysis.
results: The research achieved a score of 0.711 on the multiclass classification task and ranked 10th among participants on the leaderboard for the shared task.

Abstract
This paper describes our approach to submissions made at Shared Task 2 at BLP Workshop - Sentiment Analysis of Bangla Social Media Posts. Sentiment Analysis is an action research area in the digital age. With the rapid and constant growth of online social media sites and services and the increasing amount of textual data, the application of automatic Sentiment Analysis is on the rise. However, most of the research in this domain is based on the English language. Despite being the world's sixth most widely spoken language, little work has been done in Bangla. This task aims to promote work on Bangla Sentiment Analysis while identifying the polarity of social media content by determining whether the sentiment expressed in the text is Positive, Negative, or Neutral. Our approach consists of experimenting and finetuning various multilingual and pre-trained BERT-based models on our downstream tasks and using a Majority Voting and Weighted ensemble model that outperforms individual baseline model scores. Our system scored 0.711 for the multiclass classification task and scored 10th place among the participants on the leaderboard for the shared task. Our code is available at https://github.com/ptnv-s/RSM-NLP-BLP-Task2 .

摘要
translate into Simplified Chinese:这篇论文描述了我们在BLP工作坊的共享任务2中的提交方法 - sentiment analysis of Bangla social media posts。sentiment analysis是数字时代的一个研究领域，随着在线社交媒体网站和服务的快速增长和文本数据的增加，自动化sentiment analysis的应用也在增加。然而，大多数研究在这个领域是基于英语。不过，巴anga是全球第六 Most widely spoken语言，但是对于这种语言的研究却很少。这个任务的目标是促进巴anga Sentiment Analysis的研究，并在文本中确定 sentiment的 polarity是Positive、Negative或Neutral。我们的方法包括对多种多语言和预训练BERT模型进行实验和精度调整，并使用多数投票和Weighted ensemble模型，这个模型超过了个别基eline模型分数。我们的系统在多类分类任务上得分0.711，并在共享任务的 лидер板上排名第10名。我们的代码可以在https://github.com/ptnv-s/RSM-NLP-BLP-Task2 上获取。

High-Quality 3D Face Reconstruction with Affine Convolutional Networks

paper_url: http://arxiv.org/abs/2310.14237
repo_url: None
paper_authors: Zhiqian Lin, Jiangke Lin, Lincheng Li, Yi Yuan, Zhengxia Zou
for: This paper aims to tackle the challenges of canonical view reconstruction from a single input image, specifically addressing the problem of spatial misalignment between the input and output images.
methods: The proposed method uses an Affine Convolution Network (ACN) architecture to handle spatially non-corresponding input and output images, and represents 3D human heads in UV space with multiple components, including diffuse maps, position maps, and light maps.
results: The method is able to generate high-quality UV maps at a resolution of 512 x 512 pixels, while previous approaches typically generate maps at a lower resolution of 256 x 256 pixels or smaller.Here’s the simplified Chinese text for the three key points:
for: 这篇论文目标是解决单个输入图像中的 canonical view reconstruction 问题，具体来说是处理输入和输出图像之间的空间不对称问题。
methods: 该方法使用 Affine Convolution Network (ACN) 架构来处理空间不对称的输入和输出图像，并使用多个组件来表示 3D 人脸的 UV 空间，包括Diffuse 地图、Position 地图和 Light 地图。
results: 方法可以生成高质量的 UV 地图，分辨率为 512 x 512 像素，而前一种方法通常生成的地图分辨率为 256 x 256 像素或小于。

Abstract
Recent works based on convolutional encoder-decoder architecture and 3DMM parameterization have shown great potential for canonical view reconstruction from a single input image. Conventional CNN architectures benefit from exploiting the spatial correspondence between the input and output pixels. However, in 3D face reconstruction, the spatial misalignment between the input image (e.g. face) and the canonical/UV output makes the feature encoding-decoding process quite challenging. In this paper, to tackle this problem, we propose a new network architecture, namely the Affine Convolution Networks, which enables CNN based approaches to handle spatially non-corresponding input and output images and maintain high-fidelity quality output at the same time. In our method, an affine transformation matrix is learned from the affine convolution layer for each spatial location of the feature maps. In addition, we represent 3D human heads in UV space with multiple components, including diffuse maps for texture representation, position maps for geometry representation, and light maps for recovering more complex lighting conditions in the real world. All the components can be trained without any manual annotations. Our method is parametric-free and can generate high-quality UV maps at resolution of 512 x 512 pixels, while previous approaches normally generate 256 x 256 pixels or smaller. Our code will be released once the paper got accepted.

摘要
最近的研究基于卷积编码器-解码器架构和3DMM参数化已经显示了从单个输入图像重建 canonical 视图的潜在性。传统的 CNN 架构可以利用输入和输出像素之间的空间相对性，从而提高特征编码-解码过程的性能。但在3D面重建中，输入图像（例如脸）和 canonical/UV 输出之间的空间误差使得特征编码-解码过程变得非常困难。在这篇论文中，我们提出了一种新的网络架构，即 Affine Convolution Networks，以便 CNN 基本上可以处理空间不匹配的输入和输出图像，同时保持高质量输出。在我们的方法中，每个空间位置的特征图中学习了一个 Affine 变换矩阵。此外，我们使用多组分表示3D 人头的 UV 空间，包括diffuse 图像表示纹理，position 图像表示几何表示，以及light 图像为在真实世界中更复杂的光照条件的恢复。所有组分都可以通过无需手动标注来学习。我们的方法是 Parametric-free 的，可以生成高分辨率（512 x 512 像素）的高质量 UV 地图，而之前的方法通常生成的是 256 x 256 像素或更小的地图。我们的代码将在论文被接受后发布。

Efficient Meta Neural Heuristic for Multi-Objective Combinatorial Optimization

paper_url: http://arxiv.org/abs/2310.15196
repo_url: https://github.com/bill-cjb/emnh
paper_authors: Jinbiao Chen, Jiahai Wang, Zizhen Zhang, Zhiguang Cao, Te Ye, Siyuan Chen
for: 解决多目标 combinatorial 优化问题 (MOCOP)
methods: 使用深度强化学习 neural heuristics，并提出了一种高效的元 нейро逻辑 (EMNH) 方法，通过快速训练和精细调整来解决问题
results: EMNH 方法可以在 solution quality 和学习效率两个方面比 estado-of-the-art нейро逻辑方法出色，并且可以与传统的优化策略相比，提供竞争力强的解决方案，而且消耗的时间非常短。

Abstract
Recently, neural heuristics based on deep reinforcement learning have exhibited promise in solving multi-objective combinatorial optimization problems (MOCOPs). However, they are still struggling to achieve high learning efficiency and solution quality. To tackle this issue, we propose an efficient meta neural heuristic (EMNH), in which a meta-model is first trained and then fine-tuned with a few steps to solve corresponding single-objective subproblems. Specifically, for the training process, a (partial) architecture-shared multi-task model is leveraged to achieve parallel learning for the meta-model, so as to speed up the training; meanwhile, a scaled symmetric sampling method with respect to the weight vectors is designed to stabilize the training. For the fine-tuning process, an efficient hierarchical method is proposed to systematically tackle all the subproblems. Experimental results on the multi-objective traveling salesman problem (MOTSP), multi-objective capacitated vehicle routing problem (MOCVRP), and multi-objective knapsack problem (MOKP) show that, EMNH is able to outperform the state-of-the-art neural heuristics in terms of solution quality and learning efficiency, and yield competitive solutions to the strong traditional heuristics while consuming much shorter time.

摘要
近期，基于深度再征学习的神经归纳算法已经在解决多目标 combinatorial 优化问题（MOCOP）中显示了承诺。然而，它们仍然很难达到高效学习和解决方案质量的目标。为了解决这个问题，我们提出了一种高效的元神经归纳算法（EMNH），其中首先训练一个元模型，然后使用一些步骤进行细化以解决相应的单目标优化问题。在训练过程中，我们利用了一个（部分）建筑物共享多任务模型，以实现并行学习，以加速训练过程。同时，我们设计了一种尺度相对的对 вектор的抖擞方法，以稳定训练过程。在细化过程中，我们提出了一种高效的层次方法，以系统地解决所有的优化问题。实验结果显示，EMNH可以在多目标旅行商问题（MOTSP）、多目标资源限制车辆路径问题（MOCVRP）和多目标零部件问题（MOKP）中，超越现有的神经归纳算法，以至于解决方案质量和学习效率。同时，它可以在短时间内提供竞争力强的传统归纳算法解决方案。

Neural Multi-Objective Combinatorial Optimization with Diversity Enhancement

paper_url: http://arxiv.org/abs/2310.15195
repo_url: https://github.com/bill-cjb/nhde
paper_authors: Jinbiao Chen, Zizhen Zhang, Zhiguang Cao, Yaoxin Wu, Yining Ma, Te Ye, Jiahai Wang
for: solves multi-objective combinatorial optimization (MOCO) problems with a novel neural heuristic that enhances diversity.
methods: uses an indicator-enhanced deep reinforcement learning method and a heterogeneous graph attention mechanism to capture the relations between the instance graph and the Pareto front graph, as well as a multiple Pareto optima strategy to sample and preserve desirable solutions.
results: generates a Pareto front with higher diversity, achieving superior overall performance on classic MOCO problems, and is generic and can be applied to different neural methods for MOCO.Here’s the full Chinese text:
for: solves 多bjective combinatorial optimization (MOCO) 问题，使用一种新的神经拟合算法，以提高多元性。
methods: 使用指标增强的深度学习方法和异类图注意力机制，捕捉实例图和Pareto前图之间的关系，同时采用多个Pareto优点策略，抽样和保留有价值的解决方案。
results: 在经典MOCO问题上实现了更高的多元性Pareto前，实现了更好的总性能，并且可以应用于不同的神经方法 дляMOCO。

Abstract
Most of existing neural methods for multi-objective combinatorial optimization (MOCO) problems solely rely on decomposition, which often leads to repetitive solutions for the respective subproblems, thus a limited Pareto set. Beyond decomposition, we propose a novel neural heuristic with diversity enhancement (NHDE) to produce more Pareto solutions from two perspectives. On the one hand, to hinder duplicated solutions for different subproblems, we propose an indicator-enhanced deep reinforcement learning method to guide the model, and design a heterogeneous graph attention mechanism to capture the relations between the instance graph and the Pareto front graph. On the other hand, to excavate more solutions in the neighborhood of each subproblem, we present a multiple Pareto optima strategy to sample and preserve desirable solutions. Experimental results on classic MOCO problems show that our NHDE is able to generate a Pareto front with higher diversity, thereby achieving superior overall performance. Moreover, our NHDE is generic and can be applied to different neural methods for MOCO.

摘要
大多数现有的神经方法 для多目标组合优化（MOCO）问题都仅仅采用分解，这frequently leads to repetitive solutions for the respective subproblems, resulting in a limited Pareto set. 在这种情况下，我们提出了一种新的神经规范with diversity enhancement（NHDE），以生成更多的Pareto解决方案。从一个角度来看，我们提出了一种指标增强的深度学习方法，以避免不同的子问题之间的重复解决方案。同时，我们设计了一种异质图注意机制，以捕捉实例图和Pareto前图之间的关系。从另一个角度来看，我们采用多个Pareto优点策略，以采样和保留愿意的解决方案。实验结果表明，我们的NHDE能够生成一个更高度的多样性Pareto前，从而实现更好的总性能。此外，我们的NHDE可以应用于不同的神经方法for MOCO。

MIRACLE: Towards Personalized Dialogue Generation with Latent-Space Multiple Personal Attribute Control

paper_url: http://arxiv.org/abs/2310.18342
repo_url: https://github.com/lzy-the-boys/miracle
paper_authors: Zhenyi Lu, Wei Wei, Xiaoye Qu, XianLing Mao, Dangyang Chen, Jixiong Chen
for: 这种研究旨在提高人工智能对话系统的人性化特征，以便实现更加自然的人类对话。
methods: 该研究提出了一种新的个性化对话生成方法，基于多个人性特征的控制，包括语言风格、内心人物特征等。
results: 实验表明，该方法可以提供更高的人性化控制和对话质量，并且可以在多个对话场景中实现灵活的人性特征组合。

Abstract
Personalized dialogue systems aim to endow the chatbot agent with more anthropomorphic traits for human-like interactions. Previous approaches have explored explicitly user profile modeling using text descriptions, implicit derivation of user embeddings, or utilizing handicraft prompts for ChatGPT-like models. However, textual personas are limited in describing multi-faceted attributes (\emph{e.g.}, \emph{language style, inner character nuances}), implicit embedding suffers from personality sparsity, and handicraft prompts lack fine-grained and stable controllability. Hence, these approaches may struggle with complex personalized dialogue generation tasks that require generating controllable responses with multiple personal attributes. To this end, we propose \textbf{\textsc{Miracle}, a novel personalized dialogue generation method through \textbf{M}ult\textbf{I}ple Pe\textbf{R}sonal \textbf{A}ttributes \textbf{C}ontrol within \textbf{L}atent-Space \textbf{E}nergy-based Models. ttributes \textbf{C}ontrol within \textbf{L}atent-Space \textbf{E}nergy-based Models. Specifically, our approach first disentangles complex personality into multi-faceted attributes. Subsequently, we employ a conditional variational auto-encoder to align with the dense personalized responses within a latent joint attribute space. We have also tailored a dedicated energy function and customized the ordinary differential equations sampling method to offer flexible attribute composition and precise attribute control. Extensive experiments demonstrate that \textsc{Miracle} outperforms several strong baselines in terms of personality controllability and response generation quality. Our dataset and code are available at \url{https://github.com/LZY-the-boys/MIRACLE}

摘要
人工智能对话系统的目标是赋予对话机器人更多人类特征，以便更自然的人类交互。先前的方法包括文本描述来明确用户模型，从文本中推导用户嵌入，或者使用手工提示来驱动ChatGPT样式的模型。然而，文本个性只能描述用户的一些多方面特征（例如，语言风格、内心特点），嵌入难以表达用户的人格特点，而手工提示缺乏精细控制。因此，这些方法可能会在复杂的个性化对话生成任务中遇到困难，特别是需要生成多个人性特征的响应。为此，我们提出了\textbf{\textsc{Miracle}，一种新的个性化对话生成方法，通过\textbf{M}ult\textbf{I}ple \textbf{P}ersonal \textbf{A}ttributes \textbf{C}ontrol within \textbf{L}atent-Space \textbf{E}nergy-based Models。具体来说，我们的方法首先分解复杂的人性特征，然后employs a conditional variational autoencoder来对具有密集个性响应的积极特征空间进行对应。我们还特制了一个专门的能量函数和自适应的差分方程，以便可以自由地组合特征和精确地控制特征。我们的实验证明，\textsc{Miracle}在人性可控和响应质量两个方面都高于多个强基eline。我们的数据集和代码可以在\url{https://github.com/LZY-the-boys/MIRACLE}上获取。

UniMAP: Universal SMILES-Graph Representation Learning

paper_url: http://arxiv.org/abs/2310.14216
repo_url: https://github.com/fengshikun/unimap
paper_authors: Shikun Feng, Lixin Yang, Weiying Ma, Yanyan Lan
for: This paper aims to propose a universal molecular representation learning model that can effectively leverage both SMILES and graph representations for drug-related applications.
methods: The proposed model, UniMAP, uses an embedding layer to obtain token and node/edge representations in SMILES and graph, respectively, followed by a multi-layer Transformer to conduct deep cross-modality fusion. The model is pre-trained on four tasks: Multi-Level Cross-Modality Masking, SMILES-Graph Matching, Fragment-Level Alignment, and Domain Knowledge Learning.
results: UniMAP outperforms current state-of-the-art pre-training methods on various downstream tasks, including molecular property prediction, drug-target affinity prediction, and drug-drug interaction. The learned representations are also visualized to demonstrate the effect of multi-modality integration.

Abstract
Molecular representation learning is fundamental for many drug related applications. Most existing molecular pre-training models are limited in using single molecular modality, either SMILES or graph representation. To effectively leverage both modalities, we argue that it is critical to capture the fine-grained 'semantics' between SMILES and graph, because subtle sequence/graph differences may lead to contrary molecular properties. In this paper, we propose a universal SMILE-graph representation learning model, namely UniMAP. Firstly, an embedding layer is employed to obtain the token and node/edge representation in SMILES and graph, respectively. A multi-layer Transformer is then utilized to conduct deep cross-modality fusion. Specially, four kinds of pre-training tasks are designed for UniMAP, including Multi-Level Cross-Modality Masking (CMM), SMILES-Graph Matching (SGM), Fragment-Level Alignment (FLA), and Domain Knowledge Learning (DKL). In this way, both global (i.e. SGM and DKL) and local (i.e. CMM and FLA) alignments are integrated to achieve comprehensive cross-modality fusion. We evaluate UniMAP on various downstream tasks, i.e. molecular property prediction, drug-target affinity prediction and drug-drug interaction. Experimental results show that UniMAP outperforms current state-of-the-art pre-training methods.We also visualize the learned representations to demonstrate the effect of multi-modality integration.

摘要
молекулярное представление научиться是许多药物相关应用的基础。大多数现有的药物预训模型都是使用单一的分子模式，可以是SMILES或图表示。为了有效地利用这两种模式，我们认为是关键 capture fine-grained 'semantics' между SMILES和图，因为微scopic序列/图像差异可能会导致不同的分子性质。在这篇论文中，我们提议一种通用的SMILES-图表示学习模型，称为UniMAP。首先，一个嵌入层被使用来获取SMILES和图的token和节点/边表示。然后，一个多层变换器被使用来进行深度的cross-modality融合。特别是，我们设计了四种预训任务 дляUniMAP，包括多级cross-modality遮盲(CMM), SMILES-图匹配(SGM),块级对alignment(FLA)和领域知识学习(DKL)。这样，global (即SGM和DKL)和local (即CMM和FLA)的对应都被集成，以实现全面的cross-modality融合。我们在多个下游任务上评估UniMAP，包括分子性质预测、药物-Target相互作用预测和药物-药物相互作用。实验结果表明，UniMAP在现有状态的预训方法中表现出色。我们还利用学习的表示图示出效果多模式融合。

Item-Graph2vec: a Efficient and Effective Approach using Item Co-occurrence Graph Embedding for Collaborative Filtering

paper_url: http://arxiv.org/abs/2310.14215
repo_url: https://github.com/cpu135/item-graph2vec
paper_authors: Ruilin Yuan, Leya Li, Yuanzhe Cai
for: 提高大规模项目推荐系统的效率
methods: 使用随机游走Item Graph embedding算法
results: 相比Item2vec，Item-Graph2vec在大规模数据集上具有稳定的运行时间和高效的性能，并且在实际数据集上实现了3倍的效率提升。

Abstract
Current item-item collaborative filtering algorithms based on artificial neural network, such as Item2vec, have become ubiquitous and are widely applied in the modern recommender system. However, these approaches do not apply to the large-scale item-based recommendation system because of their extremely long training time. To overcome the shortcoming that current algorithms have high training time costs and poor stability when dealing with large-scale data sets, the item graph embedding algorithm Item-Graph2vec is described here. This algorithm transforms the users' shopping list into a item co-occurrence graph, obtains item sequences through randomly travelling on this co-occurrence graph and finally trains item vectors through sequence samples. We posit that because of the stable size of item, the size and density of the item co-occurrence graph change slightly with the increase in the training corpus. Therefore, Item-Graph2vec has a stable runtime on the large scale data set, and its performance advantage becomes more and more obvious with the growth of the training corpus. Extensive experiments conducted on real-world data sets demonstrate that Item-Graph2vec outperforms Item2vec by 3 times in terms of efficiency on douban data set, while the error generated by the random walk sampling is small.

摘要
当前的item-item共同推荐算法，如Item2vec，在现代推荐系统中广泛应用。然而，这些方法在大规模item-based推荐系统中不适用，因为它们的训练时间非常长。为了超越现有的缺点，Item Graph2vec算法是描述的。这个算法将用户购买记录转换成item共享图，从共享图中随机旅行获得item序列，并最后使用序列样本来训练item vectors。我们认为，由于item的稳定大小，item共享图的大小和密度随着训练 corpora 的增加而变化稍微。因此，Item Graph2vec在大规模数据集上具有稳定的运行时间，而且其性能优势随着训练 corpora 的增长而变得更加明显。在实际数据集上进行了广泛的实验，并证明了Item Graph2vec在对 douban 数据集进行训练时，与Item2vec相比，三倍的效率。同时，随机步骤采样生成的误差较小。

LUNA: A Model-Based Universal Analysis Framework for Large Language Models

paper_url: http://arxiv.org/abs/2310.14211
repo_url: None
paper_authors: Da Song, Xuan Xie, Jiayang Song, Derui Zhu, Yuheng Huang, Felix Juefei-Xu, Lei Ma
for:This paper aims to provide a universal analysis framework for large language models (LLMs) to evaluate their trustworthiness from multiple perspectives.methods:The proposed framework, called LUNA, leverages various abstract model construction methods and defines evaluation metrics to assess the quality of the abstract model and the semantics of the LLM.results:The proposed framework enables versatile analysis of LLMs from multiple quality perspectives in a human-interpretable manner, and can be used to evaluate the trustworthiness of LLMs in various industrial domains.

Abstract
Over the past decade, Artificial Intelligence (AI) has had great success recently and is being used in a wide range of academic and industrial fields. More recently, LLMs have made rapid advancements that have propelled AI to a new level, enabling even more diverse applications and industrial domains with intelligence, particularly in areas like software engineering and natural language processing. Nevertheless, a number of emerging trustworthiness concerns and issues exhibited in LLMs have already recently received much attention, without properly solving which the widespread adoption of LLMs could be greatly hindered in practice. The distinctive characteristics of LLMs, such as the self-attention mechanism, extremely large model scale, and autoregressive generation schema, differ from classic AI software based on CNNs and RNNs and present new challenges for quality analysis. Up to the present, it still lacks universal and systematic analysis techniques for LLMs despite the urgent industrial demand. Towards bridging this gap, we initiate an early exploratory study and propose a universal analysis framework for LLMs, LUNA, designed to be general and extensible, to enable versatile analysis of LLMs from multiple quality perspectives in a human-interpretable manner. In particular, we first leverage the data from desired trustworthiness perspectives to construct an abstract model as an auxiliary analysis asset, which is empowered by various abstract model construction methods. To assess the quality of the abstract model, we collect and define a number of evaluation metrics, aiming at both abstract model level and the semantics level. Then, the semantics, which is the degree of satisfaction of the LLM w.r.t. the trustworthiness perspective, is bound to and enriches the abstract model with semantics, which enables more detailed analysis applications for diverse purposes.

摘要
Specifically, we first leverage data from desired trustworthiness perspectives to construct an abstract model as an auxiliary analysis asset, which is empowered by various abstract model construction methods. We then collect and define a set of evaluation metrics to assess the quality of the abstract model, aiming at both the abstract model level and the semantics level. Finally, the semantics, which represents the degree of satisfaction of the LLM with respect to the trustworthiness perspective, is bound to and enriches the abstract model with semantics, enabling more detailed analysis applications for diverse purposes.

CXR-LLaVA: Multimodal Large Language Model for Interpreting Chest X-ray Images

paper_url: http://arxiv.org/abs/2310.18341
repo_url: https://github.com/ecofri/cxr_llava
paper_authors: Seowoo Lee, M. D., Jiwon Youn, Mansu Kim Ph. D., Soon Ho Yoon, M. D. Ph. D
for: 这项研究旨在开发一个开源的多模态大语言模型，用于解读胸部X射线图像（CXR）。
methods: 该研究使用了659,287个公开available的胸部X射线图像进行训练，其中417,336个图像有特定的放射学畸形标注（dataset 1），241,951个图像提供了自由文本放射学报告（dataset 2）。在训练Resnet50作为图像Encoder后，使用语言-图像对应预训练来对CXR和其相应的放射学畸形进行对应。然后，使用dataset 2进行精度调整，并使用GPT-4进行生成多种问答enario。相关代码可以在https://github.com/ECOFRI/CXR_LLaVA中找到。
results: 在测试集中，模型的性能因参数而异。在 average 情况下，它在五种疾病（气肿、心肿、填充、肿胀和液体积）上的 F1 分数为0.34，通过提问工程可以提高到0.46。在独立集中，模型的 average F1 分数为0.30。另外，对于未在训练中看到的儿童胸部X射线图像集，模型可以准确地分类不正常的胸部X射线图像，F1 分数在0.84-0.85之间。

Abstract
Purpose: Recent advancements in large language models (LLMs) have expanded their capabilities in a multimodal fashion, potentially replicating the image interpretation of human radiologists. This study aimed to develop open-source multimodal large language model for interpreting chest X-ray images (CXR-LLaVA). We also examined the effect of prompt engineering and model parameters such as temperature and nucleus sampling. Materials and Methods: For training, we collected 659,287 publicly available CXRs: 417,336 CXRs had labels for certain radiographic abnormalities (dataset 1); 241,951 CXRs provided free-text radiology reports (dataset 2). After pre-training the Resnet50 as an image encoder, the contrastive language-image pre-training was used to align CXRs and corresponding radiographic abnormalities. Then, the Large Language Model Meta AI-2 was fine-tuned using dataset 2, which were refined using GPT-4, with generating various question answering scenarios. The code can be found at https://github.com/ECOFRI/CXR_LLaVA. Results: In the test set, we observed that the model's performance fluctuated based on its parameters. On average, it achieved F1 score of 0.34 for five pathologic findings (atelectasis, cardiomegaly, consolidation, edema, and pleural effusion), which was improved to 0.46 through prompt engineering. In the independent set, the model achieved an average F1 score of 0.30 for the same pathologic findings. Notably, for the pediatric chest radiograph dataset, which was unseen during training, the model differentiated abnormal radiographs with an F1 score ranging from 0.84 to 0.85. Conclusion: CXR-LLaVA demonstrates promising potential in CXR interpretation. Both prompt engineering and model parameter adjustments can play pivotal roles in interpreting CXRs.

摘要
目的：近期大型自然语言模型（LLM）的进步已经扩展了其多Modal功能，可能地模拟人类胸部X射线专业人员的图像解释能力。这项研究旨在开发开源的多Modal大型自然语言模型，用于解释胸部X射线图像（CXR-LLaVA）。我们还研究了提示工程和模型参数的效果，如温度和核心采样。材料和方法：为训练，我们收集了659,287个公开可用的胸部X射线图像：417,336个图像有特定的放射学畸形标注（数据集1）；241,951个图像提供了自由文本医学报告（数据集2）。在预训练Resnet50作为图像Encoder后，我们使用语言-图像对对照预训练来与胸部X射线图像相对应。然后，我们使用数据集2进行微调，并使用GPT-4进行多种问题回答场景的细化。代码可以在https://github.com/ECOFRI/CXR_LLaVA找到。结果：在测试集中，我们发现模型的性能会随着参数的变化。在 average 的情况下，它达到了五种疾病发现（气肿、心肺肥大、填充物、肿胀和胸膜液泛）的 F1 分数为0.34，通过提示工程提高到0.46。在独立集中，模型在同样的五种疾病发现中的 average F1 分数为0.30。尤其是在没有在训练过程中看到的儿童胸部X射线数据集中，模型能够 diferenciate 畸形胸部X射线图像的 F1 分数在0.84-0.85之间。结论：CXR-LLaVA表现出了可能的胸部X射线解释潜力。提示工程和模型参数的调整都可以在解释胸部X射线图像中扮演着重要的角色。

Learning to Discern: Imitating Heterogeneous Human Demonstrations with Preference and Representation Learning

paper_url: http://arxiv.org/abs/2310.14196
repo_url: None
paper_authors: Sachit Kuhar, Shuo Cheng, Shivang Chopra, Matthew Bronars, Danfei Xu
for: 本研究旨在Addressing the challenges of maintaining the quality of collected data and addressing the suboptimal nature of some demonstrations in practical imitation learning (IL) systems.
methods: 本研究提出了一种名为Learning to Discern (L2D)的离线仿omorphism学习框架，通过在嵌入式的轨迹段中学习一个秘密表示，并使用喜好学习来评估和学习不同风格的示范者的示例。
results: 研究表明，L2D可以有效地评估和学习从不同质量和风格的示范者中的示例，从而提高了policy性能在多种任务中，包括在模拟和物理机器人上。

Abstract
Practical Imitation Learning (IL) systems rely on large human demonstration datasets for successful policy learning. However, challenges lie in maintaining the quality of collected data and addressing the suboptimal nature of some demonstrations, which can compromise the overall dataset quality and hence the learning outcome. Furthermore, the intrinsic heterogeneity in human behavior can produce equally successful but disparate demonstrations, further exacerbating the challenge of discerning demonstration quality. To address these challenges, this paper introduces Learning to Discern (L2D), an offline imitation learning framework for learning from demonstrations with diverse quality and style. Given a small batch of demonstrations with sparse quality labels, we learn a latent representation for temporally embedded trajectory segments. Preference learning in this latent space trains a quality evaluator that generalizes to new demonstrators exhibiting different styles. Empirically, we show that L2D can effectively assess and learn from varying demonstrations, thereby leading to improved policy performance across a range of tasks in both simulations and on a physical robot.

摘要
实用的模仿学习（IL）系统依赖于大量人类示范数据实现成功的政策学习。然而，维护收集的数据质量和处理一些示范的不理想性可能会影响整体数据质量和学习结果。另外，人类行为的内在多样性可能会生成同样成功但具有不同风格的示范，进一步加剧了决定示范质量的挑战。为解决这些挑战，本文提出了学习把握（L2D），一种离线模仿学习框架，可以从多质量和风格的示范中学习。给定一小批示范，我们学习了一个抽象的表示，并在这个表示空间中进行偏好学习，以训练一个普适的质量评估器。我们的实验表明，L2D可以有效地评估和学习不同示范者的示范，从而提高政策性能在多种任务中，包括在模拟和物理机器人上。

PromptMix: A Class Boundary Augmentation Method for Large Language Model Distillation

paper_url: http://arxiv.org/abs/2310.14192
repo_url: https://github.com/servicenow/promptmix-emnlp-2023
paper_authors: Gaurav Sahu, Olga Vechtomova, Dzmitry Bahdanau, Issam H. Laradji
for: 提高文本分类精度和效率，对具有有限训练数据的问题进行解决
methods: 利用大语言模型（LLM）如GPT3，生成新的示例，并使用LLM的指令和几拟分类能力来生成更有用的数据增强
results: 在4个文本分类 datasets（银行77、TREC6、主观性（SUBJ）和推特投诉）中，提出了一种生成和重新标签borderline示例的方法，以便将大型LLM如GPT3.5-turbo的知识传递到更小的和更便宜的分类器中，并在4个dataset上表现出比多个5拟数据增强方法更好的效果。

Abstract
Data augmentation is a widely used technique to address the problem of text classification when there is a limited amount of training data. Recent work often tackles this problem using large language models (LLMs) like GPT3 that can generate new examples given already available ones. In this work, we propose a method to generate more helpful augmented data by utilizing the LLM's abilities to follow instructions and perform few-shot classifications. Our specific PromptMix method consists of two steps: 1) generate challenging text augmentations near class boundaries; however, generating borderline examples increases the risk of false positives in the dataset, so we 2) relabel the text augmentations using a prompting-based LLM classifier to enhance the correctness of labels in the generated data. We evaluate the proposed method in challenging 2-shot and zero-shot settings on four text classification datasets: Banking77, TREC6, Subjectivity (SUBJ), and Twitter Complaints. Our experiments show that generating and, crucially, relabeling borderline examples facilitates the transfer of knowledge of a massive LLM like GPT3.5-turbo into smaller and cheaper classifiers like DistilBERT$_{base}$ and BERT$_{base}$. Furthermore, 2-shot PromptMix outperforms multiple 5-shot data augmentation methods on the four datasets. Our code is available at https://github.com/ServiceNow/PromptMix-EMNLP-2023.

摘要
<>Translate the given text into Simplified Chinese.<>文本扩充是一种广泛使用的技术来解决文本分类问题，当有限量的训练数据时。现今的研究通常使用大型自然语言模型（LLM），如GPT3，生成新的示例。在这种工作中，我们提出了一种方法，使用LLM的能力来跟进 instrucciones和几架分类来生成更有帮助的扩充数据。我们的特定PromptMix方法包括两步： 1）生成文本扩充在类别边界附近，但生成边缘示例会增加数据集中的假阳性风险，所以我们 2）使用提示基于的LLM分类器来修正生成的文本扩充标签，以提高生成数据中的正确性。我们在四个文本分类 dataset上进行了对4-shot和0-shot设定的实验：Banking77、TREC6、Subjectivity (SUBJ) 和 Twitter Complaints。我们的实验表明，生成并关键地修正边缘示例可以传递大型LLM如GPT3.5-turbo中的知识到更小和更便宜的分类器如DistilBERT$_{base}$和BERT$_{base}$。此外，2-shot PromptMix在四个数据集上超过多个5-shot数据扩充方法。我们的代码可以在https://github.com/ServiceNow/PromptMix-EMNLP-2023 上获取。

Randomized Forward Mode of Automatic Differentiation for Optimization Algorithms

paper_url: http://arxiv.org/abs/2310.14168
repo_url: None
paper_authors: Khemraj Shukla, Yeonjong Shin
for: 这个论文主要是为了提出一种基于反Mode différentiation的隐藏层激活函数，以及一种基于这种激活函数的隐藏层神经网络，以提高神经网络的性能。
methods: 该论文使用了反Mode différentiation来计算损失函数的导数，并使用这些导数来更新神经网络的参数。具体来说，该论文提出了一种基于irectional derivatives的隐藏层神经网络更新方法，该方法使用了forward Mode AD或Jacobian vector Product来计算导数。
results: 该论文通过对多个概率分布中的Random directions进行测试，并通过对各种Physics-informed neural networks和Deep Operator Networks进行实验，显示了该方法的高效性和稳定性。同时，该论文还提供了一种理论分析，证明了该方法的速度收敛率。

Abstract
Backpropagation within neural networks leverages a fundamental element of automatic differentiation, which is referred to as the reverse mode differentiation, or vector Jacobian Product (VJP) or, in the context of differential geometry, known as the pull-back process. The computation of gradient is important as update of neural network parameters is performed using gradient descent method. In this study, we present a genric randomized method, which updates the parameters of neural networks by using directional derivatives of loss functions computed efficiently by using forward mode AD or Jacobian vector Product (JVP). These JVP are computed along the random directions sampled from different probability distributions e.g., Bernoulli, Normal, Wigner, Laplace and Uniform distributions. The computation of gradient is performed during the forward pass of the neural network. We also present a rigorous analysis of the presented methods providing the rate of convergence along with the computational experiments deployed in scientific Machine learning in particular physics-informed neural networks and Deep Operator Networks.

摘要
<>将文本翻译成简化中文。<>使用神经网络中的归档传播方法可以利用自动导 diferencial的基本元素，即逆向导数 diferencial 或向量雅可比产品（VJP），或在斜块 геометрии中称为pull-back过程。计算梯度非常重要，因为使用梯度下降方法来更新神经网络参数。在这种研究中，我们提出了一种通用随机化方法，通过使用前向AD或雅可比产品（JVP）来计算梯度。这些JVP在不同的概率分布，例如 Bernoulli、Normal、Wigner、Laplace 和 Uniform 分布中随机 sampling 的方向上进行计算。计算梯度在神经网络的前向传播过程中进行。我们还提供了一种准确的分析方法，其中提供了涨落速率以及计算实验的结果，其中包括物理学 Informed Neural Networks 和 Deep Operator Networks。

Graph Convolutional Network with Connectivity Uncertainty for EEG-based Emotion Recognition

paper_url: http://arxiv.org/abs/2310.14165
repo_url: None
paper_authors: Hongxiang Gao, Xiangyao Wang, Zhenghua Chen, Min Wu, Zhipeng Cai, Lulu Zhao, Jianqing Li, Chengyu Liu
for: 这个研究的目的是提高人机交互的自动情感识别能力，使用多条 Електроэнцефалограм (EEG) 信号。
methods: 这个研究使用的方法包括分布式不确定性方法、 граhp convolutional neural network (GCN) 架构、graph mixup 技术和深度GCN 重量。
results: 实验结果显示，这个方法比前一代方法有更好的性能，在两个常用的数据集上（SEED和SEEDIV）获得了正面和有意义的改善。

Abstract
Automatic emotion recognition based on multichannel Electroencephalography (EEG) holds great potential in advancing human-computer interaction. However, several significant challenges persist in existing research on algorithmic emotion recognition. These challenges include the need for a robust model to effectively learn discriminative node attributes over long paths, the exploration of ambiguous topological information in EEG channels and effective frequency bands, and the mapping between intrinsic data qualities and provided labels. To address these challenges, this study introduces the distribution-based uncertainty method to represent spatial dependencies and temporal-spectral relativeness in EEG signals based on Graph Convolutional Network (GCN) architecture that adaptively assigns weights to functional aggregate node features, enabling effective long-path capturing while mitigating over-smoothing phenomena. Moreover, the graph mixup technique is employed to enhance latent connected edges and mitigate noisy label issues. Furthermore, we integrate the uncertainty learning method with deep GCN weights in a one-way learning fashion, termed Connectivity Uncertainty GCN (CU-GCN). We evaluate our approach on two widely used datasets, namely SEED and SEEDIV, for emotion recognition tasks. The experimental results demonstrate the superiority of our methodology over previous methods, yielding positive and significant improvements. Ablation studies confirm the substantial contributions of each component to the overall performance.

摘要
To address these challenges, this study introduces a distribution-based uncertainty method to represent spatial dependencies and temporal-spectral relativeness in EEG signals based on Graph Convolutional Network (GCN) architecture. The GCN architecture adaptively assigns weights to functional aggregate node features, enabling effective long-path capturing while mitigating over-smoothing phenomena. Moreover, the graph mixup technique is employed to enhance latent connected edges and mitigate noisy label issues.Furthermore, the uncertainty learning method is integrated with deep GCN weights in a one-way learning fashion, termed Connectivity Uncertainty GCN (CU-GCN). The CU-GCN approach is evaluated on two widely used datasets, namely SEED and SEEDIV, for emotion recognition tasks. The experimental results demonstrate the superiority of the CU-GCN methodology over previous methods, yielding positive and significant improvements. Ablation studies confirm the substantial contributions of each component to the overall performance.

Augmenting End-to-End Steering Angle Prediction with CAN Bus Data

paper_url: http://arxiv.org/abs/2310.14162
repo_url: None
paper_authors: Rohan Gupta
for: 这paper的目的是提高自动驾驶车辆的终端推导预测精度，而不使用激光雷达和感知器。
methods: 这paper使用了计算机视觉模型，并通过感知器融合CAN总线数据来提高计算机视觉模型的准确性。
results: 结果显示，通过感知器融合CAN总线数据可以降低计算机视觉模型的预测错误率，其中一些模型的预测错误率可以降低80%。

Abstract
In recent years, end to end steering prediction for autonomous vehicles has become a major area of research. The primary method for achieving end to end steering was to use computer vision models on a live feed of video data. However, to further increase accuracy, many companies have added data from light detection and ranging (LiDAR) and or radar sensors through sensor fusion. However, the addition of lasers and sensors comes at a high financial cost. In this paper, I address both of these issues by increasing the accuracy of the computer vision models without the increased cost of using LiDAR and or sensors. I achieved this by improving the accuracy of computer vision models by sensor fusing CAN bus data, a vehicle protocol, with video data. CAN bus data is a rich source of information about the vehicle's state, including its speed, steering angle, and acceleration. By fusing this data with video data, the accuracy of the computer vision model's predictions can be improved. When I trained the model without CAN bus data, I obtained an RMSE of 0.02492, while the model trained with the CAN bus data achieved an RMSE of 0.01970. This finding indicates that fusing CAN Bus data with video data can reduce the computer vision model's prediction error by 20% with some models decreasing the error by 80%.

摘要
近年来，驾驶自动化领域内的终端转向预测技术得到了广泛的研究。原始方法实现终端转向是通过计算机视觉模型在实时视频数据上进行预测。然而，为了进一步提高准确性，许多公司通过整合激光雷达（LiDAR）和/或雷达感知器的整合来进行整合感知。然而，加入激光和感知器的成本很高。在这篇论文中，我解决了这两个问题，即提高计算机视觉模型的准确性，不需要增加LiDAR和/或感知器的成本。我实现了这一点通过将CAN总线数据（车辆协议）与视频数据整合，CAN总线数据包含车辆的状态信息，包括速度、转向角度和加速度。当我不使用CAN总线数据进行训练时，我获得的RMSE值为0.02492，而使用CAN总线数据进行训练的模型则可以得到RMSE值为0.01970。这一结果表明，将CAN总线数据与视频数据整合可以降低计算机视觉模型的预测错误率，一些模型可以降低错误率80%。

When Urban Region Profiling Meets Large Language Models

paper_url: http://arxiv.org/abs/2310.18340
repo_url: None
paper_authors: Yibo Yan, Haomin Wen, Siru Zhong, Wei Chen, Haodong Chen, Qingsong Wen, Roger Zimmermann, Yuxuan Liang
for: 这个论文的目的是提出一种基于大语言模型（LLM）的城市区域 profiling 方法，以便为城市规划和可持续发展提供有用的数据支持。
methods: 这个方法使用了一种名为 UrbanCLIP 的新型方法，它利用 LLM 的力量，将文本modalities 集成到城市图像 profiling 中，并通过对Image-to-Text LLM 生成的文本描述和图像进行协同学习，实现了自然语言监督下的城市视觉学习。
results: 实验结果表明，UrbanCLIP 方法可以提高城市区域 profiling 的精度，在四个主要中国城市的三个城市指标预测中，与当前方法相比，平均提高了6.1%的R^2 值。

Abstract
Urban region profiling from web-sourced data is of utmost importance for urban planning and sustainable development. We are witnessing a rising trend of LLMs for various fields, especially dealing with multi-modal data research such as vision-language learning, where the text modality serves as a supplement information for the image. Since textual modality has never been introduced into modality combinations in urban region profiling, we aim to answer two fundamental questions in this paper: i) Can textual modality enhance urban region profiling? ii) and if so, in what ways and with regard to which aspects? To answer the questions, we leverage the power of Large Language Models (LLMs) and introduce the first-ever LLM-enhanced framework that integrates the knowledge of textual modality into urban imagery profiling, named LLM-enhanced Urban Region Profiling with Contrastive Language-Image Pretraining (UrbanCLIP). Specifically, it first generates a detailed textual description for each satellite image by an open-source Image-to-Text LLM. Then, the model is trained on the image-text pairs, seamlessly unifying natural language supervision for urban visual representation learning, jointly with contrastive loss and language modeling loss. Results on predicting three urban indicators in four major Chinese metropolises demonstrate its superior performance, with an average improvement of 6.1% on R^2 compared to the state-of-the-art methods. Our code and the image-language dataset will be released upon paper notification.

摘要
城市区域识别从网络数据处理是现代城市规划和可持续发展的重要前提。我们目睹到涉及多Modal数据研究的LLM技术在不断升温，特别是视觉语言学习，其中文本模式作为图像信息的补充。由于文本模式从未在城市区域识别中使用过，我们的研究旨在回答以下两个基本问题：一、文本模式能否提高城市区域识别？二、如果可以，那么在哪些方面和如何？为了回答这些问题，我们利用LLM技术的强大能力，并提出了首次在城市区域识别中结合文本模式的框架，名为UrbanCLIP。具体来说，它首先生成每个卫星图像的详细文本描述，使用开源的Image-to-Text LLM进行生成。然后，模型在图像-文本对中进行训练，同时结合对偶损失和语言模型学习损失。实验结果表明，UrbanCLIP在预测四个主要中国城市的三个城市指标方面表现出色，与当前方法的平均提升率为6.1%。我们的代码和图像语言数据将在论文通知时发布。

Are LSTMs Good Few-Shot Learners?

paper_url: http://arxiv.org/abs/2310.14139
repo_url: https://github.com/mikehuisman/lstm-fewshotlearning-oplstm
paper_authors: Mike Huisman, Thomas M. Moerland, Aske Plaat, Jan N. van Rijn
for: 这篇研究旨在探讨深度学习需要大量数据来学习新任务，而meta-learning可以解决这个限制。
methods: 这篇研究使用了LSTM和backpropagation来学习meta-learning。
results: LSTM surprisingly在一个简单的几个据点变数回推 regression benchmark上表现出色，但是在更复杂的几个据点图像分类 benchmark上表现不如预期。研究人员提出了两个可能的解释，并提出了一个新的方法called Outer Product LSTM (OP-LSTM)，它可以解决这些问题，并在内部预测和跨领域预测中获得了明显的性能提升。

Abstract
Deep learning requires large amounts of data to learn new tasks well, limiting its applicability to domains where such data is available. Meta-learning overcomes this limitation by learning how to learn. In 2001, Hochreiter et al. showed that an LSTM trained with backpropagation across different tasks is capable of meta-learning. Despite promising results of this approach on small problems, and more recently, also on reinforcement learning problems, the approach has received little attention in the supervised few-shot learning setting. We revisit this approach and test it on modern few-shot learning benchmarks. We find that LSTM, surprisingly, outperform the popular meta-learning technique MAML on a simple few-shot sine wave regression benchmark, but that LSTM, expectedly, fall short on more complex few-shot image classification benchmarks. We identify two potential causes and propose a new method called Outer Product LSTM (OP-LSTM) that resolves these issues and displays substantial performance gains over the plain LSTM. Compared to popular meta-learning baselines, OP-LSTM yields competitive performance on within-domain few-shot image classification, and performs better in cross-domain settings by 0.5% to 1.9% in accuracy score. While these results alone do not set a new state-of-the-art, the advances of OP-LSTM are orthogonal to other advances in the field of meta-learning, yield new insights in how LSTM work in image classification, allowing for a whole range of new research directions. For reproducibility purposes, we publish all our research code publicly.

摘要
深度学习需要大量数据来学习新任务，这限制了其应用于具有相应数据的领域。元学习把这一限制打砸，它学会如何学习。在2001年， Hochreiter等人表明了一种使用反传播的LSTM在不同任务上进行元学习，这种方法在小问题上表现了扎实的成果，而且在最近也在强化学习问题上得到了应用。然而，这种方法在监督少量学习设定中受到了少量关注。我们重新审视了这种方法，并在现代少量学习标准架构上测试它。我们发现LSTM在简单的几个shot折衔回归benchmark上表现出优于受欢迎的MAML方法，但是LSTM在更复杂的几个shot图像分类benchmark上表现不佳。我们认为这可能是两个原因，并提出了一种新的方法called Outer Product LSTM（OP-LSTM），该方法可以解决这些问题，并且在监督少量学习和跨频道设定中显示了显著性能提升。相比 популяр的元学习基eline，OP-LSTM在内频道图像分类中展示了竞争性能，并在跨频道设定中表现出0.5%到1.9%的准确率提升。虽然这些结果并不是新的状态态，但OP-LSTM的进步是与其他元学习领域的进步正交的，它为图像分类领域带来了新的研究方向，并且为研究者提供了更多的研究方向。为了保证可重现性，我们将所有的研究代码公开发布。

2023-10-22

cs.CL

cs.CL - 2023-10-22

Domain Terminology Integration into Machine Translation: Leveraging Large Language Models

paper_url: http://arxiv.org/abs/2310.14451
repo_url: None
paper_authors: Yasmin Moslem, Gianfranco Romani, Mahdi Molaei, Rejwanul Haque, John D. Kelleher, Andy Way
for: 提高机器翻译（MT）的精度，以便在专业领域内进行更好的交流和理解。
methods: 利用大语言模型（LLM）进行两项实验，包括生成同时语言对的数据和将MT模型中的翻译结果进行自动批注。
results: 结果表明，我们的提议方法能够有效地将预先批准的词汇 integrate 到翻译中，成功率从36.67%提高至72.88%。

Abstract
This paper discusses the methods that we used for our submissions to the WMT 2023 Terminology Shared Task for German-to-English (DE-EN), English-to-Czech (EN-CS), and Chinese-to-English (ZH-EN) language pairs. The task aims to advance machine translation (MT) by challenging participants to develop systems that accurately translate technical terms, ultimately enhancing communication and understanding in specialised domains. To this end, we conduct experiments that utilise large language models (LLMs) for two purposes: generating synthetic bilingual terminology-based data, and post-editing translations generated by an MT model through incorporating pre-approved terms. Our system employs a four-step process: (i) using an LLM to generate bilingual synthetic data based on the provided terminology, (ii) fine-tuning a generic encoder-decoder MT model, with a mix of the terminology-based synthetic data generated in the first step and a randomly sampled portion of the original generic training data, (iii) generating translations with the fine-tuned MT model, and (iv) finally, leveraging an LLM for terminology-constrained automatic post-editing of the translations that do not include the required terms. The results demonstrate the effectiveness of our proposed approach in improving the integration of pre-approved terms into translations. The number of terms incorporated into the translations of the blind dataset increases from an average of 36.67% with the generic model to an average of 72.88% by the end of the process. In other words, successful utilisation of terms nearly doubles across the three language pairs.

摘要

Using an LLM to generate bilingual synthetic data based on the provided terminology.2. Fine-tuning a generic encoder-decoder MT model with a mix of the terminology-based synthetic data and a randomly sampled portion of the original generic training data.3. Generating translations with the fine-tuned MT model.4. Leveraging an LLM for terminology-constrained automatic post-editing of the translations that do not include the required terms.The results show that our proposed approach effectively improves the integration of pre-approved terms into translations. The average number of terms incorporated into the translations of the blind dataset increases from 36.67% with the generic model to 72.88% by the end of the process, nearly doubling the successful utilization of terms across the three language pairs.

TATA: Stance Detection via Topic-Agnostic and Topic-Aware Embeddings

paper_url: http://arxiv.org/abs/2310.14450
repo_url: https://github.com/hanshanley/tata
paper_authors: Hans W. A. Hanley, Zakir Durumeric
for: 本文是为了建立一个通用的立场检测模型，能够在不同主题下仍能准确地检测立场。
methods: 本文使用了对照学习和一个未标注的新闻文章数据集，通过培育TAG和TAW表示来建立不同主题下的立场检测模型。
results: combine这两种表示， authors achieved state-of-the-art performance on several public stance detection datasets（Zero-shot VAST dataset的 $F_1$-score为0.771）。

Abstract
Stance detection is important for understanding different attitudes and beliefs on the Internet. However, given that a passage's stance toward a given topic is often highly dependent on that topic, building a stance detection model that generalizes to unseen topics is difficult. In this work, we propose using contrastive learning as well as an unlabeled dataset of news articles that cover a variety of different topics to train topic-agnostic/TAG and topic-aware/TAW embeddings for use in downstream stance detection. Combining these embeddings in our full TATA model, we achieve state-of-the-art performance across several public stance detection datasets (0.771 $F_1$-score on the Zero-shot VAST dataset). We release our code and data at https://github.com/hanshanley/tata.

摘要
<>translate_language Simplified Chinese;Stance detection is important for understanding different attitudes and beliefs on the Internet. However, given that a passage's stance toward a given topic is often highly dependent on that topic, building a stance detection model that generalizes to unseen topics is difficult. In this work, we propose using contrastive learning as well as an unlabeled dataset of news articles that cover a variety of different topics to train topic-agnostic/TAG and topic-aware/TAW embeddings for use in downstream stance detection. Combining these embeddings in our full TATA model, we achieve state-of-the-art performance across several public stance detection datasets (0.771 $F_1$-score on the Zero-shot VAST dataset). We release our code and data at https://github.com/hanshanley/tata.中文简体版：在互联网上，理解不同的态度和信仰是重要的。然而，由于文章对某个话题的态度往往受话题的限制，建立能 generalized to unseen topics的态度探测模型是困难的。在这项工作中，我们提议使用对比学习以及一个不同话题的新闻文章数据集来训练无关话题/TAG和相关话题/TAW的嵌入，用于下游态度探测。将这些嵌入组合在我们的全局TATA模型中，我们在多个公共态度探测数据集上达到了状态级表现（在零shot VAST数据集上的$F_1$分数为0.771）。我们在 GitHub 上发布了代码和数据，请参考。

Text generation for dataset augmentation in security classification tasks

paper_url: http://arxiv.org/abs/2310.14429
repo_url: https://github.com/wenliangdai/multi-task-offensive-language-detection
paper_authors: Alexander P. Welsh, Matthew Edwards
for: 填充安全领域的训练数据不足问题
methods: 使用自然语言文本生成器填充训练数据，测试多个安全相关文本分类任务
results: GPT-3数据增强策略可以对训练无足问题进行改善，尤其是在知道阳性类别数据有严重限制的情况下

Abstract
Security classifiers, designed to detect malicious content in computer systems and communications, can underperform when provided with insufficient training data. In the security domain, it is often easy to find samples of the negative (benign) class, and challenging to find enough samples of the positive (malicious) class to train an effective classifier. This study evaluates the application of natural language text generators to fill this data gap in multiple security-related text classification tasks. We describe a variety of previously-unexamined language-model fine-tuning approaches for this purpose and consider in particular the impact of disproportionate class-imbalances in the training set. Across our evaluation using three state-of-the-art classifiers designed for offensive language detection, review fraud detection, and SMS spam detection, we find that models trained with GPT-3 data augmentation strategies outperform both models trained without augmentation and models trained using basic data augmentation strategies already in common usage. In particular, we find substantial benefits for GPT-3 data augmentation strategies in situations with severe limitations on known positive-class samples.

摘要
安全分类器，用于检测计算机系统和通信中的恶意内容，可能会表现不佳当提供不充分的训练数据。在安全领域，通常容易找到benign类样本，而困难找到足够的malicious类样本来训练有效的分类器。本研究利用自然语言文本生成器来填充这种数据空白，并考虑了训练集中略重的类别不均衡的影响。我们使用三种当前最佳的分类器，用于探测攻击性语言、评论骗局和短信骗局，并评估了不同的语言模型练习方法。我们发现，使用GPT-3数据生成策略可以超越没有增强和基本增强策略的模型，尤其在限制了知道的正面类样本的情况下。

Large Language Models are biased to overestimate profoundness

paper_url: http://arxiv.org/abs/2310.14422
repo_url: None
paper_authors: Eugenio Herrera-Berg, Tomás Vergara Browne, Pablo León-Villagrá, Marc-Lluís Vives, Cristian Buc Calderon
for: 本研究评估了多种语言模型（LLMs）对日常、动员、 Pseudo-profound声明的评估能力，以及RLHF对模型带来的偏见。
methods: 研究使用了多种提示技术，包括ew-shot学习提示和链式思维提示，以评估模型对不同类型声明的评估能力。
results: 研究发现，LLMs和人类之间存在显著的声明相似性，不管使用哪种提示技术。但是，LLMs系统性地过分评估非сен的声明，除了Tk-instruct，它独特地下esti mates声明的深度。ew-shot学习提示能够使得模型的评估与人类更加相似。此外，研究还发现RLHF可能导致模型带来偏见，增加声明深度的评估。

Abstract
Recent advancements in natural language processing by large language models (LLMs), such as GPT-4, have been suggested to approach Artificial General Intelligence. And yet, it is still under dispute whether LLMs possess similar reasoning abilities to humans. This study evaluates GPT-4 and various other LLMs in judging the profoundness of mundane, motivational, and pseudo-profound statements. We found a significant statement-to-statement correlation between the LLMs and humans, irrespective of the type of statements and the prompting technique used. However, LLMs systematically overestimate the profoundness of nonsensical statements, with the exception of Tk-instruct, which uniquely underestimates the profoundness of statements. Only few-shot learning prompts, as opposed to chain-of-thought prompting, draw LLMs ratings closer to humans. Furthermore, this work provides insights into the potential biases induced by Reinforcement Learning from Human Feedback (RLHF), inducing an increase in the bias to overestimate the profoundness of statements.

摘要
latest advancements in natural language processing by large language models (LLMs), such as GPT-4, have been suggested to approach Artificial General Intelligence. However, it is still under dispute whether LLMs possess similar reasoning abilities to humans. This study evaluates GPT-4 and various other LLMs in judging the profundity of mundane, motivational, and pseudo-profound statements. We found a significant statement-to-statement correlation between the LLMs and humans, irrespective of the type of statements and the prompting technique used. However, LLMs systematically overestimate the profoundness of nonsensical statements, with the exception of Tk-instruct, which uniquely underestimates the profoundness of statements. Only few-shot learning prompts, as opposed to chain-of-thought prompting, draw LLMs ratings closer to humans. Furthermore, this work provides insights into the potential biases induced by Reinforcement Learning from Human Feedback (RLHF), inducing an increase in the bias to overestimate the profoundness of statements.Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

REFER: An End-to-end Rationale Extraction Framework for Explanation Regularization

paper_url: http://arxiv.org/abs/2310.14418
repo_url: None
paper_authors: Mohammad Reza Ghasemi Madani, Pasquale Minervini
for: 本文旨在提高Explainable Natural Language Processing中的人工标注文本解释的重要性。
methods: 本文提出了一种名为REFER的框架，该框架使用可微的解释EXTractor，可以在推理过程中借鉴人工标注的帮助。
results: 在我们的实验中，REFER在具有 faithfulness、plausibility和下游任务准确率的情况下，与之前的基线比较，在e-SNLI和CoS-E上得到了较好的结果，其中的composite normalized relative gain比例提高了11%和3%。

Abstract
Human-annotated textual explanations are becoming increasingly important in Explainable Natural Language Processing. Rationale extraction aims to provide faithful (i.e., reflective of the behavior of the model) and plausible (i.e., convincing to humans) explanations by highlighting the inputs that had the largest impact on the prediction without compromising the performance of the task model. In recent works, the focus of training rationale extractors was primarily on optimizing for plausibility using human highlights, while the task model was trained on jointly optimizing for task predictive accuracy and faithfulness. We propose REFER, a framework that employs a differentiable rationale extractor that allows to back-propagate through the rationale extraction process. We analyze the impact of using human highlights during training by jointly training the task model and the rationale extractor. In our experiments, REFER yields significantly better results in terms of faithfulness, plausibility, and downstream task accuracy on both in-distribution and out-of-distribution data. On both e-SNLI and CoS-E, our best setting produces better results in terms of composite normalized relative gain than the previous baselines by 11% and 3%, respectively.

摘要
人类标注文本解释在可解释自然语言处理中变得越来越重要。理由提取目标为提供准确（即模型行为reflective）并有理由的解释，而不是妥协任务模型性能。在现有的工作中，训练理由提取器的主要目标是优化假设性，使用人类高亮来评估plausibility。我们提出了REFER框架，它使用可微分的理由提取器，允许在理由提取过程中进行反propagation。我们分析了在训练中使用人类高亮的影响，并在任务模型和理由提取器同时训练。在我们的实验中，REFER实现了在 faithfulness、plausibility 和下游任务准确率方面提高了较大的改进，并在 e-SNLI 和 CoS-E 上实现了更好的 composite normalized relative gain 表现，相比前一个基eline提高11%和3%。

Evaluating Subjective Cognitive Appraisals of Emotions from Large Language Models

paper_url: http://arxiv.org/abs/2310.14389
repo_url: https://github.com/honglizhan/covidet-appraisals-public
paper_authors: Hongli Zhan, Desmond C. Ong, Junyi Jessy Li
for: This paper is written to address the lack of research on the automatic prediction of cognitive appraisals in emotional experiences.
methods: The paper uses a dataset called CovidET-Appraisals, which assesses 24 appraisal dimensions in 241 Reddit posts, to evaluate the ability of large language models to automatically assess and explain cognitive appraisals.
results: The paper finds that while the best models are performant, open-sourced LLMs fall short at this task, presenting a new challenge in the future development of emotionally intelligent models.Here is the information in Simplified Chinese text:
for: 这篇论文是为了弥补情感经验中自主评估的缺失而写的。
methods: 这篇论文使用了 CovidET-Appraisals 数据集，该数据集包含 24 个评估维度，每个维度都有自然语言的理由，在 241 篇 Reddit 帖子中进行了评估。
results: 这篇论文发现，虽然最佳模型表现出色，但开源 LL 模型在这个任务上异常缺乏能力，这提出了未来情感智能模型的新挑战。

Abstract
The emotions we experience involve complex processes; besides physiological aspects, research in psychology has studied cognitive appraisals where people assess their situations subjectively, according to their own values (Scherer, 2005). Thus, the same situation can often result in different emotional experiences. While the detection of emotion is a well-established task, there is very limited work so far on the automatic prediction of cognitive appraisals. This work fills the gap by presenting CovidET-Appraisals, the most comprehensive dataset to-date that assesses 24 appraisal dimensions, each with a natural language rationale, across 241 Reddit posts. CovidET-Appraisals presents an ideal testbed to evaluate the ability of large language models -- excelling at a wide range of NLP tasks -- to automatically assess and explain cognitive appraisals. We found that while the best models are performant, open-sourced LLMs fall short at this task, presenting a new challenge in the future development of emotionally intelligent models. We release our dataset at https://github.com/honglizhan/CovidET-Appraisals-Public.

摘要
我们的情感经历 involve 复杂的过程; 不仅有生物学方面，心理学研究也探究了人们对自己情感状况的主观评估，根据自己的价值观（Scherer, 2005）。因此，同一个情况可能会导致不同的情感体验。虽然检测情感是一项已经成熔的任务，但是自动预测认知评估还没有得到过足的关注。这项工作填补了这一空白，并提供了 CovidET-Appraisals dataset，覆盖了 24 个评估维度，每个维度有自然语言的理由，在 241 篇 Reddit 帖子中进行评估。CovidET-Appraisals 提供了一个完善的测试环境，用于评估大语言模型在 NLP 任务中表现的能力，并且自动评估和解释认知评估。我们发现，尽管最佳模型表现出色，但开源 LLM 在这项任务上却落后，提出了未来开发情感智能模型的新挑战。我们将数据集发布在 GitHub 上，具体地址为。

Bi-Encoders based Species Normalization – Pairwise Sentence Learning to Rank

paper_url: http://arxiv.org/abs/2310.14366
repo_url: None
paper_authors: Zainab Awan, Tim Kahlke, Peter Ralph, Paul Kennedy
for: 该论文旨在提出一种深度学习方法，用于生物医学名实体Normalization。
methods: 该方法基于Best Matching 25算法生成候选概念，然后使用bi-directional encoder representation from the encoder (BERT)进行排名。
results: 对于物种实体类型，我们的方法比现有方法更高效，能够准确地将实体连接到NCBI分类。

Abstract
Motivation: Biomedical named-entity normalization involves connecting biomedical entities with distinct database identifiers in order to facilitate data integration across various fields of biology. Existing systems for biomedical named entity normalization heavily rely on dictionaries, manually created rules, and high-quality representative features such as lexical or morphological characteristics. However, recent research has investigated the use of neural network-based models to reduce dependence on dictionaries, manually crafted rules, and features. Despite these advancements, the performance of these models is still limited due to the lack of sufficiently large training datasets. These models have a tendency to overfit small training corpora and exhibit poor generalization when faced with previously unseen entities, necessitating the redesign of rules and features. Contribution: We present a novel deep learning approach for named entity normalization, treating it as a pair-wise learning to rank problem. Our method utilizes the widely-used information retrieval algorithm Best Matching 25 to generate candidate concepts, followed by the application of bi-directional encoder representation from the encoder (BERT) to re-rank the candidate list. Notably, our approach eliminates the need for feature-engineering or rule creation. We conduct experiments on species entity types and evaluate our method against state-of-the-art techniques using LINNAEUS and S800 biomedical corpora. Our proposed approach surpasses existing methods in linking entities to the NCBI taxonomy. To the best of our knowledge, there is no existing neural network-based approach for species normalization in the literature.

摘要
目的：生物医学命名实体normalization通过连接生物医学实体与特定数据库标识符来实现数据集成。现有的生物医学命名实体normalization系统大量依赖于词典、手动创建的规则和高质量表达特征。然而，最近的研究已经调查了使用神经网络模型来减少词典、手动创建的规则和特征的依赖。尽管有这些进步，但现有的模型在性能上仍有限制，它们往往因为训练数据集的小型而过拟合，并且对于之前未看到的实体表现出差异欠拟合，导致规则和特征的重新设计。贡献：我们提出了一种新的深度学习方法 для命名实体normalization，将其视为一个对照学习排名问题。我们的方法首先使用广泛使用的信息检索算法Best Matching 25生成候选概念，然后通过双向encoder表示（BERT）对候选列表进行排名。吸引注意的是，我们的方法不需要特征工程或规则创建。我们在种类实体类型上进行实验，并对我们的方法与当前的状态艺术技术进行比较。我们的提议方法在连接实体到NCBI分类中超过现有方法。到目前为止，there is no existing neural network-based approach for species normalization in the literature.

Is ChatGPT a game changer for geocoding – a benchmark for geocoding address parsing techniques

paper_url: http://arxiv.org/abs/2310.14360
repo_url: None
paper_authors: Zhengcong Yin, Diya Li, Daniel W. Goldberg
for: 评估 GPT-3 模型在地理编码地址解析任务中的表现。
methods: 使用人工输入模式挖掘数据集，并对 GPT-3 模型、transformer 模型和 LSTM-CRF 模型进行训练和比较。
results: Bidirectional LSTM-CRF 模型在这些 transformer 模型和 GPT-3 模型中表现最佳，而 transformer 模型和 GPT-3 模型在表现上几乎相当。GPT-3 模型，虽然表现不佳，但在几个例子下表现出了潜在的改进空间。

Abstract
The remarkable success of GPT models across various tasks, including toponymy recognition motivates us to assess the performance of the GPT-3 model in the geocoding address parsing task. To ensure that the evaluation more accurately mirrors performance in real-world scenarios with diverse user input qualities and resolve the pressing need for a 'gold standard' evaluation dataset for geocoding systems, we introduce a benchmark dataset of low-quality address descriptions synthesized based on human input patterns mining from actual input logs of a geocoding system in production. This dataset has 21 different input errors and variations; contains over 239,000 address records that are uniquely selected from streets across all U.S. 50 states and D.C.; and consists of three subsets to be used as training, validation, and testing sets. Building on this, we train and gauge the performance of the GPT-3 model in extracting address components, contrasting its performance with transformer-based and LSTM-based models. The evaluation results indicate that Bidirectional LSTM-CRF model has achieved the best performance over these transformer-based models and GPT-3 model. Transformer-based models demonstrate very comparable results compared to the Bidirectional LSTM-CRF model. The GPT-3 model, though trailing in performance, showcases potential in the address parsing task with few-shot examples, exhibiting room for improvement with additional fine-tuning. We open source the code and data of this presented benchmark so that researchers can utilize it for future model development or extend it to evaluate similar tasks, such as document geocoding.

摘要
“GPT模型在不同任务中的成功，包括地名识别，使我们感兴趣测试GPT-3模型在地址解析任务中的性能。为了更准确地反映实际场景中的用户输入质量和提供一个'金标准'评价数据集，我们创建了一个基于人工输入模式的低质量地址描述 synthesized 数据集。这个数据集包含21种输入错误和变化，涵盖了美国全国50个州和特区的所有街道，共计239,000个唯一选择的地址记录。我们使用这个数据集进行训练和评估GPT-3模型、 transformer 基于模型和 LSTM 基于模型的性能。评估结果显示，携带irectional LSTM-CRF 模型在这些 transformer 基于模型和 GPT-3 模型中表现最佳。 transformer 基于模型和 GPT-3 模型在性能上几乎相当，但 GPT-3 模型在性能上落后，但它在几个例子中表现出了潜力，表明可以通过进一步的微调提高其性能。我们将这个数据集和代码开源，以便未来的研究人员可以利用它来开发模型或扩展其到评估类似任务，如文档地理编码。”

Cultural and Linguistic Diversity Improves Visual Representations

paper_url: http://arxiv.org/abs/2310.14356
repo_url: None
paper_authors: Andre Ye, Sebastin Santy, Jena D. Hwang, Amy X. Zhang, Ranjay Krishna
for: 这 paper 探讨了图像理解中不同文化背景下的视觉吗。
methods: 作者使用了多种方法，包括 scene graph, embedding, 和语言复杂度来评估不同语言的caption的semantic coverage。
results: 研究发现，当数据包含多种语言时，caption的semantic coverage会高于单语言数据，并且模型在不同语言的测试数据上表现最佳。

Abstract
Computer vision often treats perception as objective, and this assumption gets reflected in the way that datasets are collected and models are trained. For instance, image descriptions in different languages are typically assumed to be translations of the same semantic content. However, work in cross-cultural psychology and linguistics has shown that individuals differ in their visual perception depending on their cultural background and the language they speak. In this paper, we demonstrate significant differences in semantic content across languages in both dataset and model-produced captions. When data is multilingual as opposed to monolingual, captions have higher semantic coverage on average, as measured by scene graph, embedding, and linguistic complexity. For example, multilingual captions have on average 21.8% more objects, 24.5% more relations, and 27.1% more attributes than a set of monolingual captions. Moreover, models trained on content from different languages perform best against test data from those languages, while those trained on multilingual content perform consistently well across all evaluation data compositions. Our research provides implications for how diverse modes of perception can improve image understanding.

摘要

The Law and NLP: Bridging Disciplinary Disconnects

paper_url: http://arxiv.org/abs/2310.14346
repo_url: None
paper_authors: Robert Mahari, Dominik Stammbach, Elliott Ash, Alex ‘Sandy’ Pentland
for: 法律实践中的语言是其根源，但法律师和学者尚未广泛采用自然语言处理（NLP）工具。同时，法律系统正面临一个访问正义危机，NLP可能可以减轻这个危机。
methods: 本文论证法律NLP领域的研究缺乏与法律社区的连接，导致一些最受欢迎的法律NLP任务无法满足法律实践中的需求。
results: 我们在审查最近的法律NLP文献中发现，法律NLP社区与法律学术界之间存在较少的交叉。我们认为，一些最受欢迎的法律NLP任务无法满足法律实践中的需求。我们提出了一些可以bridgingdisciplinary disconnects的法律NLP任务，并高亮了未曾探索的法律NLP研究领域。

Abstract
Legal practice is intrinsically rooted in the fabric of language, yet legal practitioners and scholars have been slow to adopt tools from natural language processing (NLP). At the same time, the legal system is experiencing an access to justice crisis, which could be partially alleviated with NLP. In this position paper, we argue that the slow uptake of NLP in legal practice is exacerbated by a disconnect between the needs of the legal community and the focus of NLP researchers. In a review of recent trends in the legal NLP literature, we find limited overlap between the legal NLP community and legal academia. Our interpretation is that some of the most popular legal NLP tasks fail to address the needs of legal practitioners. We discuss examples of legal NLP tasks that promise to bridge disciplinary disconnects and highlight interesting areas for legal NLP research that remain underexplored.

摘要
法律实践深深涉及语言的结构，然而法律师和学者对自然语言处理（NLP）技术的采用相对落后。同时，法律系统正面临访问正义危机，NLP可能可以减轻这种危机。在这份位点纸中，我们 argue That the slow adoption of NLP in legal practice is exacerbated by a disconnect between the needs of the legal community and the focus of NLP researchers. In a review of recent trends in legal NLP literature, we find limited overlap between the legal NLP community and legal academia. Our interpretation is that some of the most popular legal NLP tasks fail to address the needs of legal practitioners. We discuss examples of legal NLP tasks that promise to bridge disciplinary disconnects and highlight interesting areas for legal NLP research that remain underexplored.Here's the translation in Traditional Chinese:法律实践深深涉及语言的结构，然而法律师和学者对自然语言处理（NLP）技术的采用相对落后。同时，法律系统正面临访问正义危机，NLP可能可以减轻这种危机。在这份位点纸中，我们 argue That the slow adoption of NLP in legal practice is exacerbated by a disconnect between the needs of the legal community and the focus of NLP researchers. In a review of recent trends in legal NLP literature, we find limited overlap between the legal NLP community and legal academia. Our interpretation is that some of the most popular legal NLP tasks fail to address the needs of legal practitioners. We discuss examples of legal NLP tasks that promise to bridge disciplinary disconnects and highlight interesting areas for legal NLP research that remain underexplored.

paper_url: http://arxiv.org/abs/2310.14340
repo_url: None
paper_authors: Revanth Gangi Reddy, Hao Bai, Wentao Yao, Sharath Chandra Etagi Suresh, Heng Ji, ChengXiang Zhai
for: 提高对话中的信息 Retrieval relevance和specificity，使对话更加有趣和有价值。
methods: 利用社交常识对话系统建立话题相关连接，并通过 instruciton-driven 查询生成法生成更加有 relevance 和specificity 的查询。
results: 比较 experiment 结果表明，提出的方法可以超越现有的查询生成技术，并生成更加有趣、有价值和有 relevance 的查询，从而提高对话中的信息 Retrieval 效果。

Abstract
Open-domain dialog involves generating search queries that help obtain relevant knowledge for holding informative conversations. However, it can be challenging to determine what information to retrieve when the user is passive and does not express a clear need or request. To tackle this issue, we present a novel approach that focuses on generating internet search queries that are guided by social commonsense. Specifically, we leverage a commonsense dialog system to establish connections related to the conversation topic, which subsequently guides our query generation. Our proposed framework addresses passive user interactions by integrating topic tracking, commonsense response generation and instruction-driven query generation. Through extensive evaluations, we show that our approach overcomes limitations of existing query generation techniques that rely solely on explicit dialog information, and produces search queries that are more relevant, specific, and compelling, ultimately resulting in more engaging responses.

摘要
<> translate "Open-domain dialog involves generating search queries that help obtain relevant knowledge for holding informative conversations. However, it can be challenging to determine what information to retrieve when the user is passive and does not express a clear need or request. To tackle this issue, we present a novel approach that focuses on generating internet search queries that are guided by social commonsense. Specifically, we leverage a commonsense dialog system to establish connections related to the conversation topic, which subsequently guides our query generation. Our proposed framework addresses passive user interactions by integrating topic tracking, commonsense response generation, and instruction-driven query generation. Through extensive evaluations, we show that our approach overcomes limitations of existing query generation techniques that rely solely on explicit dialog information, and produces search queries that are more relevant, specific, and compelling, ultimately resulting in more engaging responses." into 中文（简体）Here's the translation:开放领域对话通常包括生成可以帮助获得有用知识的搜索查询。然而，当用户被动并没有明确的需求或请求时，可能困难确定需要检索哪些信息。为解决这个问题，我们提出了一种新的方法，即通过社会常识导航查询生成。我们利用对话系统来建立与对话话题相关的连接，然后将这些连接用于生成查询。我们的提议的框架解决了悬挂式用户互动的问题，并 integrates 话题跟踪、常识响应生成和指导查询生成。经过广泛的评估，我们表明我们的方法可以超越现有的查询生成技术，生成更加有关、特定和吸引人的搜索查询，最终导致更加有趣的响应。

DiFair: A Benchmark for Disentangled Assessment of Gender Knowledge and Bias

paper_url: http://arxiv.org/abs/2310.14329
repo_url: https://github.com/mzakizadeh/difair_public
paper_authors: Mahdi Zakizadeh, Kaveh Eskandari Miandoab, Mohammad Taher Pilehvar
for: mitigating the gender bias in pretrained language models and evaluating the impact of bias mitigation on useful gender knowledge
methods: using a manually curated dataset called DiFair, introducing a unified metric called gender invariance score to quantify both biased behavior and preservation of useful gender knowledge
results: experimental results show that debiasing techniques can ameliorate the issue of gender bias, but at the cost of lowering the model’s useful gender knowledge

Abstract
Numerous debiasing techniques have been proposed to mitigate the gender bias that is prevalent in pretrained language models. These are often evaluated on datasets that check the extent to which the model is gender-neutral in its predictions. Importantly, this evaluation protocol overlooks the possible adverse impact of bias mitigation on useful gender knowledge. To fill this gap, we propose DiFair, a manually curated dataset based on masked language modeling objectives. DiFair allows us to introduce a unified metric, gender invariance score, that not only quantifies a model's biased behavior, but also checks if useful gender knowledge is preserved. We use DiFair as a benchmark for a number of widely-used pretained language models and debiasing techniques. Experimental results corroborate previous findings on the existing gender biases, while also demonstrating that although debiasing techniques ameliorate the issue of gender bias, this improvement usually comes at the price of lowering useful gender knowledge of the model.

摘要
很多去偏见技术已经被提出来 Mitigate the gender bias that is prevalent in pre-trained language models. These are often evaluated on datasets that check the extent to which the model is gender-neutral in its predictions. However, this evaluation protocol overlooks the possible adverse impact of bias mitigation on useful gender knowledge. To fill this gap, we propose DiFair, a manually curated dataset based on masked language modeling objectives. DiFair allows us to introduce a unified metric, gender invariance score, that not only quantifies a model's biased behavior, but also checks if useful gender knowledge is preserved. We use DiFair as a benchmark for a number of widely-used pre-trained language models and debiasing techniques. Experimental results corroborate previous findings on the existing gender biases, while also demonstrating that although debiasing techniques ameliorate the issue of gender bias, this improvement usually comes at the price of lowering useful gender knowledge of the model.Here's the translation in Traditional Chinese:很多去偏见技术已经被提出来 Mitigate the gender bias that is prevalent in pre-trained language models. These are often evaluated on datasets that check the extent to which the model is gender-neutral in its predictions. However, this evaluation protocol overlooks the possible adverse impact of bias mitigation on useful gender knowledge. To fill this gap, we propose DiFair, a manually curated dataset based on masked language modeling objectives. DiFair allows us to introduce a unified metric, gender invariance score, that not only quantifies a model's biased behavior, but also checks if useful gender knowledge is preserved. We use DiFair as a benchmark for a number of widely-used pre-trained language models and debiasing techniques. Experimental results corroborate previous findings on the existing gender biases, while also demonstrating that although debiasing techniques ameliorate the issue of gender bias, this improvement usually comes at the price of lowering useful gender knowledge of the model.

Towards Harmful Erotic Content Detection through Coreference-Driven Contextual Analysis

paper_url: http://arxiv.org/abs/2310.14325
repo_url: None
paper_authors: Inez Okulska, Emilia Wiśnios
for: 本研究旨在开发一种hybrid neural和规则基于的上下文意识检测系统，用于检测色情内容中的有害上下文信息。
methods: 本研究采用了核心引用解决方法，与专业评审人员合作编制了一个数据集，并开发了一个可以分辨有害与无害色情内容的分类器。
results: 本研究在波兰文本上测试了hybrid模型，达到了84%的准确率和80%的回归率，而基于RoBERTa和Longformer的模型则无法显示出类似的表现，这说明了核心引用链的重要性在检测有害色情内容中。

Abstract
Adult content detection still poses a great challenge for automation. Existing classifiers primarily focus on distinguishing between erotic and non-erotic texts. However, they often need more nuance in assessing the potential harm. Unfortunately, the content of this nature falls beyond the reach of generative models due to its potentially harmful nature. Ethical restrictions prohibit large language models (LLMs) from analyzing and classifying harmful erotics, let alone generating them to create synthetic datasets for other neural models. In such instances where data is scarce and challenging, a thorough analysis of the structure of such texts rather than a large model may offer a viable solution. Especially given that harmful erotic narratives, despite appearing similar to harmless ones, usually reveal their harmful nature first through contextual information hidden in the non-sexual parts of the narrative. This paper introduces a hybrid neural and rule-based context-aware system that leverages coreference resolution to identify harmful contextual cues in erotic content. Collaborating with professional moderators, we compiled a dataset and developed a classifier capable of distinguishing harmful from non-harmful erotic content. Our hybrid model, tested on Polish text, demonstrates a promising accuracy of 84% and a recall of 80%. Models based on RoBERTa and Longformer without explicit usage of coreference chains achieved significantly weaker results, underscoring the importance of coreference resolution in detecting such nuanced content as harmful erotics. This approach also offers the potential for enhanced visual explainability, supporting moderators in evaluating predictions and taking necessary actions to address harmful content.

摘要
成人内容检测仍然是自动化领域的挑战。现有的分类器主要是将 эротиче和非эротиче文本分开。然而，它们经常缺乏对可能的害的评估。 Unfortunately, this type of content is beyond the reach of generative models due to its potentially harmful nature.伦理限制禁止大语言模型（LLMs）从 analyzing和分类害词的内容，尤其是生成这类内容以创建Synthetic datasets for other neural models.在这种数据稀缺和挑战的情况下，一种可靠的解决方案是通过对这些文本的结构进行仔细分析，而不是使用大型模型。这是因为害词内容，尽管看起来和无害内容相似，通常在非性部分中隐藏的上下文信息中表现出害词性。这篇论文介绍了一种混合神经网络和规则库的上下文意识系统，利用核心引用解决方案来识别害词内容中的害词上下文信息。与专业调度人员合作，我们编辑了一个数据集并开发了一个可 distinguish between harmful and non-harmful erotic content的分类器。我们的混合模型在Polish文本上进行测试，显示了84%的准确率和80%的回归率。基于RoBERTa和Longformer的模型，没有显式使用核心引用链，得到的结果显示了较弱的性能，这说明了核心引用解决方案在检测这种细腻内容的害词性方面的重要性。这种方法还提供了可见的视觉解释性，支持调度人员评估预测结果并采取必要的行动来解决害词内容。

4 and 7-bit Labeling for Projective and Non-Projective Dependency Trees

paper_url: http://arxiv.org/abs/2310.14319
repo_url: None
paper_authors: Carlos Gómez-Rodríguez, Diego Roca, David Vilares
for: 这篇论文是为了提出一种可以将任何 проекive 依赖树转换为一个字符串中的4位标签的编码方法。
methods: 这篇论文使用了一种基于字符串的标签编码方法，每个单词的标签包含4个位数，表示该单词是左或右依赖的、外most的左/右依赖、有左/右叶子节点等信息。
results: 该编码方法可以在线性时间内编码和解码，并且在一些多样化的树频谱上实现了较高的准确率，比之前最佳的序列标签编码方法更高。

Abstract
We introduce an encoding for parsing as sequence labeling that can represent any projective dependency tree as a sequence of 4-bit labels, one per word. The bits in each word's label represent (1) whether it is a right or left dependent, (2) whether it is the outermost (left/right) dependent of its parent, (3) whether it has any left children and (4) whether it has any right children. We show that this provides an injective mapping from trees to labels that can be encoded and decoded in linear time. We then define a 7-bit extension that represents an extra plane of arcs, extending the coverage to almost full non-projectivity (over 99.9% empirical arc coverage). Results on a set of diverse treebanks show that our 7-bit encoding obtains substantial accuracy gains over the previously best-performing sequence labeling encodings.

摘要
我们介绍了一种编码方式，可以将任何投影依赖树转换为一个字符串中的4位标签，每个词的标签包含以下信息：（1）是右或左依赖关系，（2）是父节点的左或右外部依赖，（3）有左子节点，（4）有右子节点。我们证明了这是一个唯一映射，可以在线时间内编码和解码。我们还定义了一个7位扩展，表示一个额外的平面弧，使得覆盖率接近100%。在一组多样的树银行上测试的结果显示，我们的7位编码可以获得substantial的准确率提升， compared to之前最佳的序列标签编码。

Neural Text Sanitization with Privacy Risk Indicators: An Empirical Analysis

paper_url: http://arxiv.org/abs/2310.14312
repo_url: None
paper_authors: Anthi Papadopoulou, Pierre Lison, Mark Anderson, Lilja Øvrelid, Ildikó Pilán
for: 本研究的目的是提出一种 двух步文本匿名化方法，并对两个最新发布的数据集进行实验分析：Text Anonymization Benchmark（Pil'an et al., 2022）和一个基于Wikipedia的生ografies（Papadopoulou et al., 2022）。
methods: 本研究使用了一种权限允许的实体识别器，该识别器使用了标准命名实体识别模型和从Wikidata中提取的人员相关词汇进行训练。第二步是根据检测到的文本段进行风险评估，并使用语言模型概率、文本段分类、序列标签、扰动和网络搜索来评估隐私风险。
results: 本研究提供了五种不同的隐私风险指标，分别基于语言模型概率、文本段分类、序列标签、扰动和网络搜索。研究人员对每种风险指标进行了比较分析，并描述了它们的优点和局限性，特别是与可用的标注数据相关。

Abstract
Text sanitization is the task of redacting a document to mask all occurrences of (direct or indirect) personal identifiers, with the goal of concealing the identity of the individual(s) referred in it. In this paper, we consider a two-step approach to text sanitization and provide a detailed analysis of its empirical performance on two recently published datasets: the Text Anonymization Benchmark (Pil\'an et al., 2022) and a collection of Wikipedia biographies (Papadopoulou et al., 2022). The text sanitization process starts with a privacy-oriented entity recognizer that seeks to determine the text spans expressing identifiable personal information. This privacy-oriented entity recognizer is trained by combining a standard named entity recognition model with a gazetteer populated by person-related terms extracted from Wikidata. The second step of the text sanitization process consists in assessing the privacy risk associated with each detected text span, either isolated or in combination with other text spans. We present five distinct indicators of the re-identification risk, respectively based on language model probabilities, text span classification, sequence labelling, perturbations, and web search. We provide a contrastive analysis of each privacy indicator and highlight their benefits and limitations, notably in relation to the available labeled data.

摘要
文本净化是将文本中的直接或间接个人标识符Mask all occurrences of (direct or indirect) personal identifiers in a document, with the goal of concealing the identity of the individual(s) referred in it. 在这篇论文中，我们考虑了一种两步方法 для实现文本净化，并对两个最近发布的数据集进行了详细的实验分析：Text Anonymization Benchmark（Pil\'an et al., 2022）和一个来自Wikipedia的biography集合（Papadopoulou et al., 2022）。文本净化过程从privacy-oriented实体识别器开始，该实体识别器通过将标准命名实体识别模型和 Wikidata中的人员相关词汇拼接而训练。第二步的文本净化过程是评估各检测到的文本块中的隐私风险，单独或与其他文本块组合。我们提出了五种不同的隐私指标，分别基于语言模型概率、文本块分类、序列标签、扰动和网络搜索。我们对每个隐私指标进行了对照分析，并指出了它们的优点和局限性，特别是与可用的标注数据相关。

Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases

paper_url: http://arxiv.org/abs/2310.14303
repo_url: None
paper_authors: Rishabh Bhardwaj, Soujanya Poria
for: 这篇论文的目的是如何用不同的方法来评估大语言模型（LLMs）的危险性。
methods: 这篇论文使用了一种新的方法，即参数调整（parametric red-teaming）来评估 LLMs 的危险性。这种方法通过调整模型参数来绕过模型的安全行为，并且只需要使用 100 个示例。
results: 这篇论文的结果表明，使用参数调整方法可以很有效地绕过 CHATGPT 等模型的安全行为，并且可以在多种模型上实现高度的攻击成功率。此外，这种方法还可以暴露模型中隐藏的偏见和偏好。

Abstract
Red-teaming has been a widely adopted way to evaluate the harmfulness of Large Language Models (LLMs). It aims to jailbreak a model's safety behavior to make it act as a helpful agent disregarding the harmfulness of the query. Existing methods are primarily based on input text-based red-teaming such as adversarial prompts, low-resource prompts, or contextualized prompts to condition the model in a way to bypass its safe behavior. Bypassing the guardrails uncovers hidden harmful information and biases in the model that are left untreated or newly introduced by its safety training. However, prompt-based attacks fail to provide such a diagnosis owing to their low attack success rate, and applicability to specific models. In this paper, we present a new perspective on LLM safety research i.e., parametric red-teaming through Unalignment. It simply (instruction) tunes the model parameters to break model guardrails that are not deeply rooted in the model's behavior. Unalignment using as few as 100 examples can significantly bypass commonly referred to as CHATGPT, to the point where it responds with an 88% success rate to harmful queries on two safety benchmark datasets. On open-source models such as VICUNA-7B and LLAMA-2-CHAT 7B AND 13B, it shows an attack success rate of more than 91%. On bias evaluations, Unalignment exposes inherent biases in safety-aligned models such as CHATGPT and LLAMA- 2-CHAT where the model's responses are strongly biased and opinionated 64% of the time.

摘要
红团队（red-teaming）已经广泛地应用于评估大语言模型（LLM）的危害性。它的目标是让模型免受安全限制，以便它可以根据危害性的查询行为。现有的方法主要基于输入文本基于的红团队，如敌对提示、低资源提示或Contextualized提示来 condition the model，以使其免受安全限制。免除安全限制可以暴露模型中隐藏的危害信息和偏见，但提示基于的攻击失败率较高，并且只适用于特定的模型。在这篇论文中，我们提出了一新的LLM安全研究视角，即参数红团队（Parametric red-teaming）。它通过调整模型参数来破坏模型的安全限制，而这些限制不深刻地关联到模型的行为。使用100个示例的不一致可以很好地绕过CHATGPT等常见的模型，并达到88%的攻击成功率。在开源模型上，如VICUNA-7B和LLAMA-2-CHAT 7B和13B，它的攻击成功率高于91%。在偏见评估中，不一致 expose了安全适配模型中的隐藏偏见，其中模型的回答有64%的时间具有强烈的偏见和意见性。

paper_url: http://arxiv.org/abs/2310.14278
repo_url: None
paper_authors: Kun Wei, Bei Li, Hang Lv, Quan Lu, Ning Jiang, Lei Xie
for: 提高 conversational ASR 系统的准确率和持续性，特别是在EXTRACTING RELEVANT CONTEXTUAL INFORMATION FROM PREVIOUS CONVERSATIONAL TURNS 中。
methods: 我们提出了一种新的 Conversational ASR 系统，基于 Conformer Encoder-Decoder 模型，并具有跨模态对话表示。我们的方法通过特殊的编码器和模式层输入将听说和文本模型结合在一起，从而EXTRACTING RICHER HISTORICAL SPEECH CONTEXT WITHOUT EXPLICIT ERROR PROPAGATION。我们还将conditional latent variational module incorporated into the decoder to learn conversational level attributes such as role preference and topic coherence。
results: 我们的模型在 Mandarin conversation datasets HKUST 和 MagicData-RAMC 上实现了相对准确率提高8.8%和23%，compared to the standard Conformer model。

Abstract
Automatic Speech Recognition (ASR) in conversational settings presents unique challenges, including extracting relevant contextual information from previous conversational turns. Due to irrelevant content, error propagation, and redundancy, existing methods struggle to extract longer and more effective contexts. To address this issue, we introduce a novel Conversational ASR system, extending the Conformer encoder-decoder model with cross-modal conversational representation. Our approach leverages a cross-modal extractor that combines pre-trained speech and text models through a specialized encoder and a modal-level mask input. This enables the extraction of richer historical speech context without explicit error propagation. We also incorporate conditional latent variational modules to learn conversational level attributes such as role preference and topic coherence. By introducing both cross-modal and conversational representations into the decoder, our model retains context over longer sentences without information loss, achieving relative accuracy improvements of 8.8% and 23% on Mandarin conversation datasets HKUST and MagicData-RAMC, respectively, compared to the standard Conformer model.

摘要
自动语音识别（ASR）在对话设置下存在独特的挑战，包括从前一些对话扩展有用的上下文信息。由于无关内容、错误卷积和重复，现有方法很难提取更长和有效的上下文。为解决这个问题，我们介绍了一种新的对话式ASR系统，扩展了Conformer编码器-解码器模型，并添加了跨模态对话表示。我们的方法利用一个跨模态提取器，将预训练的音频和文本模型通过特殊的编码器和模式层掩码输入结合。这使得更多的历史语音上下文可以无需显式错误卷积提取。我们还将条件潜在变量模块 integrate into the decoder，学习对话水平特征，如角色偏好和话题一致性。通过将跨模态和对话表示添加到解码器中，我们的模型可以保持长句子上下文不产生信息损失，实现相对准确率提高8.8%和23%在香港大学科技大学（HKUST）和魔法数据-RAMC（MagicData-RAMC）普通话数据集上，相比标准Conformer模型。

CT-GAT: Cross-Task Generative Adversarial Attack based on Transferability

paper_url: http://arxiv.org/abs/2310.14265
repo_url: https://github.com/xiaoxuannlp/ct-gat
paper_authors: Minxuan Lv, Chengwei Dai, Kun Li, Wei Zhou, Songlin Hu
for: 防护神经网络模型免受敌意例子的攻击
methods: 直接使用多任务中的恶意示例提取可转移特征来构造敌意例子
results: 在十个不同的 dataset 上进行实验，结果表明我们的方法可以具有较好的攻击性能，并且可以采用小成本来实现。Here’s the full Chinese translation of the paper’s abstract:
for: 本文采用多任务中的恶意示例来防护神经网络模型免受敌意例子的攻击。
methods: 我们直接使用多任务中的恶意示例提取可转移特征来构造敌意例子。 Specifically, we train a sequence-to-sequence generative model named CT-GAT using adversarial sample data collected from multiple tasks to acquire universal adversarial features and generate adversarial examples for different tasks.
results: 我们在十个不同的 dataset 上进行实验，结果表明我们的方法可以具有较好的攻击性能，并且可以采用小成本来实现。

Abstract
Neural network models are vulnerable to adversarial examples, and adversarial transferability further increases the risk of adversarial attacks. Current methods based on transferability often rely on substitute models, which can be impractical and costly in real-world scenarios due to the unavailability of training data and the victim model's structural details. In this paper, we propose a novel approach that directly constructs adversarial examples by extracting transferable features across various tasks. Our key insight is that adversarial transferability can extend across different tasks. Specifically, we train a sequence-to-sequence generative model named CT-GAT using adversarial sample data collected from multiple tasks to acquire universal adversarial features and generate adversarial examples for different tasks. We conduct experiments on ten distinct datasets, and the results demonstrate that our method achieves superior attack performance with small cost.

摘要
神经网络模型容易受到敌意示例的威胁，而受攻击性质的传播更会增加攻击风险。现有的基于传播的方法frequently rely on占位模型，这可能在实际应用中是不可预测的和成本高的。在这篇论文中，我们提出了一种新的方法，通过提取不同任务之间的传播特征来直接构建敌意示例。我们的关键发现是敌意传播可以跨任务扩展。我们使用多个任务的敌意样本来训练一个名为CT-GAT的序列到序列生成模型，以获得通用的敌意特征并生成不同任务的敌意示例。我们在十个不同的数据集上进行了实验，结果显示，我们的方法可以在小成本下实现高度的攻击性能。

Boosting Unsupervised Machine Translation with Pseudo-Parallel Data

paper_url: http://arxiv.org/abs/2310.14262
repo_url: None
paper_authors: Ivana Kvapilíková, Ondřej Bojar
for: 提高低语言机器翻译系统的质量
methods: 使用 Pseudo-parallel sentence pairs 和 synthetic sentence pairs 进行训练
results: 与基线相比，提高翻译质量，最高提高14.5个 BLEU 点（英语到乌克兰语）

Abstract
Even with the latest developments in deep learning and large-scale language modeling, the task of machine translation (MT) of low-resource languages remains a challenge. Neural MT systems can be trained in an unsupervised way without any translation resources but the quality lags behind, especially in truly low-resource conditions. We propose a training strategy that relies on pseudo-parallel sentence pairs mined from monolingual corpora in addition to synthetic sentence pairs back-translated from monolingual corpora. We experiment with different training schedules and reach an improvement of up to 14.5 BLEU points (English to Ukrainian) over a baseline trained on back-translated data only.

摘要
即使最新的深度学习和大规模语言模型技术发展，机器翻译（MT）低资源语言 task 仍然是一个挑战。我们提议一种培训策略，利用 Pseudo-parallel sentence pairs 和 artificial sentence pairs back-translated from monolingual corpora。我们在不同的培训时间表现出来的结果，可以达到14.5 BLEU点（英语到乌克兰）的提升。

From Static to Dynamic: A Continual Learning Framework for Large Language Models

paper_url: http://arxiv.org/abs/2310.14248
repo_url: https://github.com/elfsong/dynamind
paper_authors: Mingzhe Du, Anh Tuan Luu, Bin Ji, See-kiong Ng
for: 这篇论文是为了解决大型自然语言处理模型（LLMs）中的复杂性问题，以提高其在不同自然语言处理任务中的表现。
methods: 这篇论文提出了一个名为DynaMind的新的持续学习框架，旨在帮助LLMs继续学习并吸收新知识。DynaMind包括内存机制来吸收新知识，以及增强模型推论过程中的模块运算器，以提高LLMs的表现精度。
results: 根据比较 experiments，DynaMind可以有效地解决LLMs中的复杂性问题，并提高其表现精度。

Abstract
The vast number of parameters in large language models (LLMs) endows them with remarkable capabilities, allowing them to excel in a variety of natural language processing tasks. However, this complexity also presents challenges, making LLMs difficult to train and inhibiting their ability to continuously assimilate new knowledge, which may lead to inaccuracies in their outputs. To mitigate these issues, this paper presents DynaMind, a novel continual learning framework designed for LLMs. DynaMind incorporates memory mechanisms to assimilate new knowledge and modular operators to enhance the model inference process with the newly assimilated knowledge, consequently improving the accuracies of LLMs' outputs. Benchmark experiments demonstrate DynaMind's effectiveness in overcoming these challenges. The code and demo of DynaMind are available on GitHub: https://github.com/Elfsong/DynaMind.

摘要
庞大的参数量在大型自然语言处理模型（LLM）中具有惊人的能力，使其在各种自然语言处理任务中表现出色。然而，这种复杂性也存在挑战，使LLM困难于训练，并限制其继续吸收新知识，可能导致其输出的不准确。为解决这些问题，本文提出了DynaMind，一种特有的连续学习框架，适用于LLM。DynaMind包括记忆机制，以吸收新知识，以及模块运算符，以提高LLM的输出准确性。实验示出DynaMind可以有效地解决这些问题。代码和示例可以在GitHub上找到：https://github.com/Elfsong/DynaMind。

PHD: Pixel-Based Language Modeling of Historical Documents

paper_url: http://arxiv.org/abs/2310.18343
repo_url: None
paper_authors: Nadav Borenstein, Phillip Rust, Desmond Elliott, Isabelle Augenstein
for: 这篇论文是为了探讨 историography 中文档案的数字化处理和自然语言处理方法。
methods: 该论文使用最新的像素基于语言模型，通过重建受遮盖的像素区域来替代传统的 OCR 技术。它还提出了一种新的合成档案生成方法，用于生成具有历史档案特点的 sintetic 档案。
results: 该论文通过实验证明，PHD 模型在重建受遮盖的像素区域方面具有高度的掌握能力，并在历史问答任务中得到了成功应用。

Abstract
The digitisation of historical documents has provided historians with unprecedented research opportunities. Yet, the conventional approach to analysing historical documents involves converting them from images to text using OCR, a process that overlooks the potential benefits of treating them as images and introduces high levels of noise. To bridge this gap, we take advantage of recent advancements in pixel-based language models trained to reconstruct masked patches of pixels instead of predicting token distributions. Due to the scarcity of real historical scans, we propose a novel method for generating synthetic scans to resemble real historical documents. We then pre-train our model, PHD, on a combination of synthetic scans and real historical newspapers from the 1700-1900 period. Through our experiments, we demonstrate that PHD exhibits high proficiency in reconstructing masked image patches and provide evidence of our model's noteworthy language understanding capabilities. Notably, we successfully apply our model to a historical QA task, highlighting its usefulness in this domain.

摘要
digitization of historical documents 提供了历史学家无 precedent 的研究机会。然而，传统的历史文档分析方法是将图像转换为文本使用 OCR，这种方法忽略了图像的可能性并且引入了高水平的噪声。为了bridging这个差距，我们利用了最近的像素基本语言模型，用于重建掩码的图像 patches 而不是预测Token分布。由于历史材料的罕见性，我们提议一种新的方法生成Synthetic scans 来模拟真实的历史文档。我们然后预训练我们的模型PHD 使用组合的Synthetic scans 和真实的历史报纸从1700-1900年代。我们的实验表明PHD 在重建掩码的图像 patches 方面表现出了高度的能力，并且我们在历史QA任务中成功地应用了我们的模型，这亮出了它在这个领域的用于。

Customising General Large Language Models for Specialised Emotion Recognition Tasks

paper_url: http://arxiv.org/abs/2310.14225
repo_url: None
paper_authors: Liyizhe Peng, Zixing Zhang, Tao Pang, Jing Han, Huan Zhao, Hao Chen, Björn W. Schuller
for: 这个论文主要是为了探讨大语言模型（LLMs）在情感识别任务中的性能和可行性。
methods: 这篇论文使用了两种不同的模态适应技术来改进Chat General Language Model（一个公共可用的大语言模型），即深度提示调整和低维度适应。
results: 实验结果表明，通过使用这两种技术改进的LLM可以轻松超越其他特有的深度模型，这表明LLM在情感识别任务中具有强大的传输性和可行性。

Abstract
The advent of large language models (LLMs) has gained tremendous attention over the past year. Previous studies have shown the astonishing performance of LLMs not only in other tasks but also in emotion recognition in terms of accuracy, universality, explanation, robustness, few/zero-shot learning, and others. Leveraging the capability of LLMs inevitably becomes an essential solution for emotion recognition. To this end, we further comprehensively investigate how LLMs perform in linguistic emotion recognition if we concentrate on this specific task. Specifically, we exemplify a publicly available and widely used LLM -- Chat General Language Model, and customise it for our target by using two different modal adaptation techniques, i.e., deep prompt tuning and low-rank adaptation. The experimental results obtained on six widely used datasets present that the adapted LLM can easily outperform other state-of-the-art but specialised deep models. This indicates the strong transferability and feasibility of LLMs in the field of emotion recognition.

摘要
<>大语言模型（LLM）的出现在过去一年内得到了很多关注。之前的研究表明，LLM在其他任务上的表现非常出众，以及在情感识别任务中的准确率、通用性、解释能力、鲁棒性、少/Zero-shot学习等方面的表现。利用LLM的能力变得是解决情感识别问题的必要手段。为此，我们进一步全面调查了LLM在语言情感识别任务中的表现。例如，我们使用了公共可用的和广泛使用的LLM——Chat General Language Model，并使用两种不同的模态适应技术，即深度推荐练化和低级适应。实验结果表明，适应后的LLM可以轻松击败其他特有的深度模型。这表明LLM在情感识别领域的传输性和可行性。

Manifold-Preserving Transformers are Effective for Short-Long Range Encoding

paper_url: http://arxiv.org/abs/2310.14206
repo_url: https://github.com/victor7246/transject
paper_authors: Ayan Sengupta, Md Shad Akhtar, Tanmoy Chakraborty
for: 本研究旨在提高Transformer模型的表达能力，特别是保持层次结构信息。
methods: 本文提出了一种名为TransJect的encoder模型，该模型通过保证层次距离 preserved来提高表达能力。具体来说，TransJect使用了一种简单的替代方案来确保点积分注意力，从而保证了Liψchitz连续性。
results: 在多个短和长序列分类任务上，TransJect比Transformer的variantshow了最大提升6.8%和5.9%。此外，TransJect在语言模型任务上表现出79%的提升。此外，本文还探讨了多头自注意的缺陷从统计物理角度。

Abstract
Multi-head self-attention-based Transformers have shown promise in different learning tasks. Albeit these models exhibit significant improvement in understanding short-term and long-term contexts from sequences, encoders of Transformers and their variants fail to preserve layer-wise contextual information. Transformers usually project tokens onto sparse manifolds and fail to preserve mathematical equivalence among the token representations. In this work, we propose TransJect, an encoder model that guarantees a theoretical bound for layer-wise distance preservation between a pair of tokens. We propose a simple alternative to dot-product attention to ensure Lipschitz continuity. This allows TransJect to learn injective mappings to transform token representations to different manifolds with similar topology and preserve Euclidean distance between every pair of tokens in subsequent layers. Evaluations across multiple benchmark short- and long-sequence classification tasks show maximum improvements of 6.8% and 5.9%, respectively, over the variants of Transformers. Additionally, TransJect displays 79% better performance than Transformer on the language modeling task. We further highlight the shortcomings of multi-head self-attention from the statistical physics viewpoint. Although multi-head self-attention was incepted to learn different abstraction levels within the networks, our empirical analyses suggest that different attention heads learn randomly and unorderly. In contrast, TransJect adapts a mixture of experts for regularization; these experts are more orderly and balanced and learn different sparse representations from the input sequences. TransJect exhibits very low entropy and can be efficiently scaled to larger depths.

摘要
多头自注意型Transformer显示了不同学习任务中的搭配性。虽然这些模型在序列中的短期和长期上下文理解方面表现出了显著改进，但Transformer和其变种的Encoder却无法保持层次上的上下文信息。Transformer通常将token映射到稀疏拟合和失去数学相等性 among token representations。在这种情况下，我们提出了TransJect，一种Encoder模型，可以保证层次上的距离保持。我们提出了一种简单的替代品dot-product注意，以确保Lipschitz连续性。这使得TransJect可以学习将token表示变换到不同的拟合上，保持后续层次上的Euclidean距离between every pair of tokens。多个benchmark短序列和长序列分类任务上的评估显示，TransJect可以与Transformer变种的最大改进6.8%和5.9%。此外，TransJect在语言模型任务上表现出79%的提高。我们还从统计物理角度探讨了多头自注意的缺陷。虽然多头自注意是为了在网络中学习不同层次的抽象，但我们的实际分析表明，不同的注意头会随机和无序地学习。相比之下，TransJect采用了一种精心混合的专家，这些专家更加有序和平衡，从输入序列中学习不同的稀疏表示。TransJect具有很低的 entropy和可以高效扩展到更大的深度。

QA-NatVer: Question Answering for Natural Logic-based Fact Verification

paper_url: http://arxiv.org/abs/2310.14198
repo_url: None
paper_authors: Rami Aly, Marek Strong, Andreas Vlachos
for: 评估声明真实性基于证据， faithfulness 是一个重要考虑因素，即生成可信的解释。
methods: 使用问答系统预测自然逻辑运算符，利用指导语言模型的泛化能力，无需训练数据。
results: 在 FEVER 几个shot Setting 中，我们的方法比最佳基eline提高了4.3个准确性点，包括一个预训练 seq2seq 自然逻辑系统和一个预训练提问基类ifier。我们的系统在对比Fact datasets 中显示出了稳定性和可重用性，并在没有进一步注释的情况下超越了所有其他方法。人工评估表明，我们的方法生成的证据更加可能且少量错误的自然逻辑运算符。

Abstract
Fact verification systems assess a claim's veracity based on evidence. An important consideration in designing them is faithfulness, i.e. generating explanations that accurately reflect the reasoning of the model. Recent works have focused on natural logic, which operates directly on natural language by capturing the semantic relation of spans between an aligned claim with its evidence via set-theoretic operators. However, these approaches rely on substantial resources for training, which are only available for high-resource languages. To this end, we propose to use question answering to predict natural logic operators, taking advantage of the generalization capabilities of instruction-tuned language models. Thus, we obviate the need for annotated training data while still relying on a deterministic inference system. In a few-shot setting on FEVER, our approach outperforms the best baseline by $4.3$ accuracy points, including a state-of-the-art pre-trained seq2seq natural logic system, as well as a state-of-the-art prompt-based classifier. Our system demonstrates its robustness and portability, achieving competitive performance on a counterfactual dataset and surpassing all approaches without further annotation on a Danish verification dataset. A human evaluation indicates that our approach produces more plausible proofs with fewer erroneous natural logic operators than previous natural logic-based systems.

摘要
фак verify 系统评估一个说法的真实性基于证据。设计这些系统时，一个重要考虑因素是忠诚度，即生成的解释能够准确反映模型的逻辑。 current works 将关注自然逻辑，它直接在自然语言上运行，通过 captured span 之间的semantic relation 和 claims 的Alignment来进行操作。 however, these approaches rely on a large amount of training data, which is only available for high-resource languages.To address this challenge, we propose using question answering to predict natural logic operators, taking advantage of the generalization capabilities of instruction-tuned language models. This approach eliminates the need for annotated training data and relies on a deterministic inference system. In a few-shot setting on FEVER, our approach outperforms the best baseline by 4.3 accuracy points, including a state-of-the-art pre-trained seq2seq natural logic system and a state-of-the-art prompt-based classifier. Our system demonstrates its robustness and portability, achieving competitive performance on a counterfactual dataset and surpassing all approaches without further annotation on a Danish verification dataset. A human evaluation indicates that our approach produces more plausible proofs with fewer erroneous natural logic operators than previous natural logic-based systems.

An In-Context Schema Understanding Method for Knowledge Base Question Answering

paper_url: http://arxiv.org/abs/2310.14174
repo_url: None
paper_authors: Yantao Liu, Zixuan Li, Xiaolong Jin, Long Bai, Saiping Guan, Jiafeng Guo, Xueqi Cheng
for: 本研究旨在提高大语言模型在知识基础中问答任务中的表现，具体来说是通过增强大语言模型对知识库的schema理解来提高其作为semantic parser的能力。
methods: 本研究提出了一种叫做In-Context Schema Understanding（ICSU）的方法，该方法利用了在context学习机制，通过提供例子来指导大语言模型生成SPARQL查询。为了从注释化的问题-查询对中检索合适的例子，ICSU采用了四种不同的检索策略。
results: 实验结果表明，ICSU与所有的检索策略都可以与随机检索策略相比，显著提高了准确率（从12%提高到78.76%）。

Abstract
The Knowledge Base Question Answering (KBQA) task aims to answer natural language questions based on a given knowledge base. As a kind of common method for this task, semantic parsing-based ones first convert natural language questions to logical forms (e.g., SPARQL queries) and then execute them on knowledge bases to get answers. Recently, Large Language Models (LLMs) have shown strong abilities in language understanding and may be adopted as semantic parsers in such kinds of methods. However, in doing so, a great challenge for LLMs is to understand the schema of knowledge bases. Therefore, in this paper, we propose an In-Context Schema Understanding (ICSU) method for facilitating LLMs to be used as a semantic parser in KBQA. Specifically, ICSU adopts the In-context Learning mechanism to instruct LLMs to generate SPARQL queries with examples. In order to retrieve appropriate examples from annotated question-query pairs, which contain comprehensive schema information related to questions, ICSU explores four different retrieval strategies. Experimental results on the largest KBQA benchmark, KQA Pro, show that ICSU with all these strategies outperforms that with a random retrieval strategy significantly (from 12\% to 78.76\% in accuracy).

摘要
《知识库问答（KBQA）任务的目标是根据给定的知识库回答自然语言问题。现有一种常见的方法是将自然语言问题转化为逻辑形式（例如 SPARQL 查询），然后执行在知识库中以获取答案。最近，大型自然语言模型（LLM）在语言理解方面表现出色，因此可能被采用为 semantic parser 在这些方法中。然而，在这种情况下，LLM 的一大挑战是理解知识库的结构。因此，在这篇论文中，我们提出了一种在Context Schema Understanding（ICSU）方法，用于使 LLM 在 KBQA 中作为semantic parser。具体来说，ICSU 采用了在 Context 学习机制，以示 LLM 生成 SPARQL 查询的示例。为了从 annotated question-query 对中检索相关的 schema 信息，ICSU 探索了四种不同的检索策略。实验结果表明，ICSU 与所有这些策略相比，在 KQA Pro 最大知识库问答 benchmark 上表现出色，具体来说，ICSU 的准确率从 12% 提高到 78.76%。

Can Language Models Laugh at YouTube Short-form Videos?

paper_url: http://arxiv.org/abs/2310.14159
repo_url: https://github.com/dayoon-ko/exfuntube
paper_authors: Dayoon Ko, Sangho Lee, Gunhee Kim
For: 本研究targets at developing a dataset and a prompting method to improve large language models’ (LLMs) understanding of humorous videos on social media.* Methods: 研究使用了YouTube上的用户生成的10000个多Modal funny videos，通过视频过滤管道和GPT-3.5进行验证，并为每个视频添加时间戳和文本解释。* Results: 研究表明，使用zero-shot video-to-text prompting可以有效提高LLMs对视频幽默的理解，并通过三种评估方法（自动分数、理由质量实验和人工评价）得到了证明。

Abstract
As short-form funny videos on social networks are gaining popularity, it becomes demanding for AI models to understand them for better communication with humans. Unfortunately, previous video humor datasets target specific domains, such as speeches or sitcoms, and mostly focus on verbal cues. We curate a user-generated dataset of 10K multimodal funny videos from YouTube, called ExFunTube. Using a video filtering pipeline with GPT-3.5, we verify both verbal and visual elements contributing to humor. After filtering, we annotate each video with timestamps and text explanations for funny moments. Our ExFunTube is unique over existing datasets in that our videos cover a wide range of domains with various types of humor that necessitate a multimodal understanding of the content. Also, we develop a zero-shot video-to-text prompting to maximize video humor understanding of large language models (LLMs). With three different evaluation methods using automatic scores, rationale quality experiments, and human evaluations, we show that our prompting significantly improves LLMs' ability for humor explanation.

摘要
As short-form funny videos on social networks become increasingly popular, it is becoming more important for AI models to understand them in order to communicate with humans more effectively. However, previous video humor datasets have focused on specific domains such as speeches or sitcoms, and have primarily targeted verbal cues. We have curated a dataset of 10,000 multimodal funny videos from YouTube, called ExFunTube, which includes both visual and verbal elements that contribute to humor. We use a video filtering pipeline with GPT-3.5 to verify the humor in each video, and then annotate each video with timestamps and text explanations for the funny moments. Our ExFunTube dataset is unique compared to existing datasets, as it covers a wide range of domains with various types of humor that require a multimodal understanding of the content. Additionally, we have developed a zero-shot video-to-text prompting method to improve the ability of large language models (LLMs) to understand humor. We evaluate our prompting method using three different methods, including automatic scores, rationale quality experiments, and human evaluations, and show that it significantly improves the ability of LLMs to explain humor.

Orthogonal Subspace Learning for Language Model Continual Learning

paper_url: http://arxiv.org/abs/2310.14152
repo_url: None
paper_authors: Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, Xuanjing Huang
for: 这篇论文旨在解决语言模型在进行多任务时的慢性忘记问题。
methods: 本文提出了一种简单有效的方法，即阶层低维适应（O-LoRA），可以有效地减少语言模型在进行新任务时的慢性忘记。O-LoRA 在不同的低维vector空间中学习任务，以避免任务之间的干扰。
results: 实验结果显示， compared to现有方法，O-LoRA 能够更好地保持语言模型对未见任务的普遍能力。 In addition, O-LoRA 只需要额外增加一些参数成本，并且不需要用户数据储存 для重新读取。

Abstract
Benefiting from massive corpora and advanced hardware, large language models (LLMs) exhibit remarkable capabilities in language understanding and generation. However, their performance degrades in scenarios where multiple tasks are encountered sequentially, also known as catastrophic forgetting. In this paper, we propose orthogonal low-rank adaptation (O-LoRA), a simple and efficient approach for continual learning in language models, effectively mitigating catastrophic forgetting while learning new tasks. Specifically, O-LoRA learns tasks in different (low-rank) vector subspaces that are kept orthogonal to each other in order to minimize interference. Our method induces only marginal additional parameter costs and requires no user data storage for replay. Experimental results on continual learning benchmarks show that our method outperforms state-of-the-art methods. Furthermore, compared to previous approaches, our method excels in preserving the generalization ability of LLMs on unseen tasks.

摘要
LLMs 因为巨大的词汇和高级硬件的支持，在语言理解和生成方面表现出了惊人的能力。然而，在紧随着多个任务的场景下，它们的表现却会出现悬崖式忘记。在这篇论文中，我们提出了低维 adaptation（O-LoRA），一种简单高效的方法，用于语言模型的连续学习，以避免悬崖式忘记。具体来说，O-LoRA 在不同的低维向量空间中学习不同任务，以避免干扰。我们的方法增加了非常少的参数成本，并不需要用户存储数据进行回放。实验结果表明，我们的方法在连续学习测试 benchmark 上表现出色，并且比之前的方法更好地保留 LLMs 对未seen 任务的总结能力。

PromptCBLUE: A Chinese Prompt Tuning Benchmark for the Medical Domain

paper_url: http://arxiv.org/abs/2310.14151
repo_url: https://github.com/michael-wzhu/PromptCBLUE
paper_authors: Wei Zhu, Xiaoling Wang, Huanran Zheng, Mosha Chen, Buzhou Tang
For: + The paper aims to evaluate Chinese language models (LLMs) for multi-task capabilities on a wide range of bio-medical tasks.* Methods: + The authors re-build the Chinese Biomedical Language Understanding Evaluation (CBLUE) benchmark into a large-scale prompt-tuning benchmark, called PromptCBLUE. + The benchmark is designed to evaluate Chinese LLMs’ performance on medical entity recognition, medical text classification, medical natural language inference, medical dialogue understanding, and medical content/dialogue generation.* Results: + The authors experiment with fine-tuning 9 Chinese LLMs with different techniques and report the results.

Abstract
Biomedical language understanding benchmarks are the driving forces for artificial intelligence applications with large language model (LLM) back-ends. However, most current benchmarks: (a) are limited to English which makes it challenging to replicate many of the successes in English for other languages, or (b) focus on knowledge probing of LLMs and neglect to evaluate how LLMs apply these knowledge to perform on a wide range of bio-medical tasks, or (c) have become a publicly available corpus and are leaked to LLMs during pre-training. To facilitate the research in medical LLMs, we re-build the Chinese Biomedical Language Understanding Evaluation (CBLUE) benchmark into a large scale prompt-tuning benchmark, PromptCBLUE. Our benchmark is a suitable test-bed and an online platform for evaluating Chinese LLMs' multi-task capabilities on a wide range bio-medical tasks including medical entity recognition, medical text classification, medical natural language inference, medical dialogue understanding and medical content/dialogue generation. To establish evaluation on these tasks, we have experimented and report the results with the current 9 Chinese LLMs fine-tuned with differtent fine-tuning techniques.

摘要
生物医学语言理解指标是人工智能应用中的推动力，但现有的大多数指标有以下限制：（a）仅限于英语，使得其他语言的复制困难，或（b）主要关注语言模型的知识探测，忽略语言模型在各种生物医学任务上的应用，或（c）已经公开化并泄露给语言模型 durante pre-training。为促进医学语言模型的研究，我们将中文生物医学语言理解评估 benchmark（CBLUE）重新建立为大规模的提示调整 benchmark，即 PromptCBLUE。我们的 benchmark 是一个适用的测试床和在线平台，用于评估中文语言模型在各种生物医学任务上的多任务能力，包括医学实体识别、医学文本分类、医学自然语言推理、医学对话理解和医学内容/对话生成。为了建立这些任务的评估，我们在现有的9种中文语言模型中进行了不同的细化技术的实验，并发布了结果。

2023-10-22

cs.LG

cs.LG - 2023-10-22

Diffusion-Model-Assisted Supervised Learning of Generative Models for Density Estimation

paper_url: http://arxiv.org/abs/2310.14458
repo_url: None
paper_authors: Yanfang Liu, Minglei Yang, Zezhong Zhang, Feng Bao, Yanzhao Cao, Guannan Zhang
for: 用于训练生成模型进行密度估计。
methods: 使用分布型抽象模型，包括生成对抗网络、正规化流和自适应神经网络，并使用分布型抽象模型来生成标注数据。
results: 提高了生成模型的训练效率和准确性，不需要使用可逆神经网络和计算Jacobian矩阵。

Abstract
We present a supervised learning framework of training generative models for density estimation. Generative models, including generative adversarial networks, normalizing flows, variational auto-encoders, are usually considered as unsupervised learning models, because labeled data are usually unavailable for training. Despite the success of the generative models, there are several issues with the unsupervised training, e.g., requirement of reversible architectures, vanishing gradients, and training instability. To enable supervised learning in generative models, we utilize the score-based diffusion model to generate labeled data. Unlike existing diffusion models that train neural networks to learn the score function, we develop a training-free score estimation method. This approach uses mini-batch-based Monte Carlo estimators to directly approximate the score function at any spatial-temporal location in solving an ordinary differential equation (ODE), corresponding to the reverse-time stochastic differential equation (SDE). This approach can offer both high accuracy and substantial time savings in neural network training. Once the labeled data are generated, we can train a simple fully connected neural network to learn the generative model in the supervised manner. Compared with existing normalizing flow models, our method does not require to use reversible neural networks and avoids the computation of the Jacobian matrix. Compared with existing diffusion models, our method does not need to solve the reverse-time SDE to generate new samples. As a result, the sampling efficiency is significantly improved. We demonstrate the performance of our method by applying it to a set of 2D datasets as well as real data from the UCI repository.

摘要
我们提出了一个监督式学习框架，用于对密度估计进行生成模型训练。生成模型，包括生成对抗网络、标准化对抗网络和条件 autoencoder，通常被视为无监督式学习模型，因为训练时通常没有标签的资料。 despite the success of the generative models, there are several issues with the unsupervised training, such as the need for reversible architectures, vanishing gradients, and training instability. To enable supervised learning in generative models, we utilize the score-based diffusion model to generate labeled data. Unlike existing diffusion models that train neural networks to learn the score function, we develop a training-free score estimation method. This approach uses mini-batch-based Monte Carlo estimators to directly approximate the score function at any spatial-temporal location in solving an ordinary differential equation (ODE), corresponding to the reverse-time stochastic differential equation (SDE). This approach can offer both high accuracy and substantial time savings in neural network training. Once the labeled data are generated, we can train a simple fully connected neural network to learn the generative model in a supervised manner. Compared with existing normalizing flow models, our method does not require the use of reversible neural networks and avoids the computation of the Jacobian matrix. Compared with existing diffusion models, our method does not need to solve the reverse-time SDE to generate new samples. As a result, the sampling efficiency is significantly improved. We demonstrate the performance of our method by applying it to a set of 2D datasets as well as real data from the UCI repository.

URegM: a unified prediction model of resource consumption for refactoring software smells in open source cloud

paper_url: http://arxiv.org/abs/2310.14444
repo_url: None
paper_authors: Asif Imran, Tevfik Kosar
for: 这篇论文是为了提高云计算内部过程资源利用率而写的。
methods: 这篇论文使用了一种名为“统一回归模型”（URegM）来预测代码异臭重构后对云资源的使用。
results: 实验结果表明，URegM可以准确预测代码异臭重构后对云资源的使用。这将帮助云服务提供商更好地规划资源分配和代码重构。

Abstract
The low cost and rapid provisioning capabilities have made the cloud a desirable platform to launch complex scientific applications. However, resource utilization optimization is a significant challenge for cloud service providers, since the earlier focus is provided on optimizing resources for the applications that run on the cloud, with a low emphasis being provided on optimizing resource utilization of the cloud computing internal processes. Code refactoring has been associated with improving the maintenance and understanding of software code. However, analyzing the impact of the refactoring source code of the cloud and studying its impact on cloud resource usage require further analysis. In this paper, we propose a framework called Unified Regression Modelling (URegM) which predicts the impact of code smell refactoring on cloud resource usage. We test our experiments in a real-life cloud environment using a complex scientific application as a workload. Results show that URegM is capable of accurately predicting resource consumption due to code smell refactoring. This will permit cloud service providers with advanced knowledge about the impact of refactoring code smells on resource consumption, thus allowing them to plan their resource provisioning and code refactoring more effectively.

摘要
“低成本和快速提供 capacities 使云平台成为 Complex scientific applications 的吸引力。然而，云服务提供商需要优化资源使其能够更好地使用资源，因为在云计算内部过程中的资源使用不受优化。Code refactoring 有助于改善软件代码的维护和理解。然而，云环境中 refactoring 代码的影响需要进一步的分析。本文提出一个名为 Unified Regression Modelling (URegM) 的框架，可以预测代码臭味 refactoring 对云资源的使用情况。我们在一个真实的云环境中进行实验，使用一个复杂的科学应用作为负载。结果表明，URegM 可以准确预测代码臭味 refactoring 对云资源的使用情况。这将允许云服务提供商在代码 refactoring 和资源分配方面更加有效地规划。”

EDGE++: Improved Training and Sampling of EDGE

paper_url: http://arxiv.org/abs/2310.14441
repo_url: https://github.com/himanshub1007/Alzhimers-Disease-Prediction-Using-Deep-learning
paper_authors: Mingyang Wu, Xiaohui Chen, Li-Ping Liu
For: 本研究目的是改进现有的Diffusion-based方法，以提高大型网络生成的效率和生成质量。* Methods: 本文提出了对EDGE模型的两个改进：一是根据度Specific Noise Schedule优化活动节点数量，从而减少内存消耗；二是提出了一种改进的采样方案，以更好地控制生成过程中的相似性。* Results: 实验结果表明，提出的修改不仅提高了生成效率，还提高了生成的图像质量，为大型网络生成任务提供了一个可靠和扩展的解决方案。

Abstract
Recently developed deep neural models like NetGAN, CELL, and Variational Graph Autoencoders have made progress but face limitations in replicating key graph statistics on generating large graphs. Diffusion-based methods have emerged as promising alternatives, however, most of them present challenges in computational efficiency and generative performance. EDGE is effective at modeling large networks, but its current denoising approach can be inefficient, often leading to wasted computational resources and potential mismatches in its generation process. In this paper, we propose enhancements to the EDGE model to address these issues. Specifically, we introduce a degree-specific noise schedule that optimizes the number of active nodes at each timestep, significantly reducing memory consumption. Additionally, we present an improved sampling scheme that fine-tunes the generative process, allowing for better control over the similarity between the synthesized and the true network. Our experimental results demonstrate that the proposed modifications not only improve the efficiency but also enhance the accuracy of the generated graphs, offering a robust and scalable solution for graph generation tasks.

摘要
近期发展的深度神经网络模型如NetGAN、CELL和Variational Graph Autoencoders等已经做出了进步，但是它们在生成大图时面临限制，Diffusion-based方法也在潜在的替代者中出现，但是大多数其中的计算效率和生成性能存在挑战。EDGE模型可以模型大网络，但是其当前的净化方法可能会导致计算资源浪费和生成过程中的匹配问题。在这篇论文中，我们提出了对EDGE模型的改进，包括度量特定的噪声调度，以优化每个时间步中活动节点的数量，显著减少内存占用。此外，我们还提出了改进的采样方案，可以细化生成过程，以更好地控制生成的图和真实图之间的相似性。我们的实验结果表明，我们的修改不仅提高了效率，还提高了生成的图的准确性，提供了一个可靠和扩展的图生成解决方案。

Fairness-aware Optimal Graph Filter Design

paper_url: http://arxiv.org/abs/2310.14432
repo_url: None
paper_authors: O. Deniz Kose, Yanning Shen, Gonzalo Mateos
for: 本文针对 graph-based learning 中存在的偏见问题进行研究，并提出了一种基于 graph signal processing 的偏见 Mitigation 方法。
methods: 本文使用了 graph filters 来减少 sensitive attributes 和 graph 结构之间的相关性，并通过 convex 问题在 graph спектраль领域中设计了最佳的 filter 设计。
results: 实验表明，提出的方法可以提高 fairness 度并保持相同的 utility，与现有的 fairness-aware 基线方法相比。

Abstract
Graphs are mathematical tools that can be used to represent complex real-world interconnected systems, such as financial markets and social networks. Hence, machine learning (ML) over graphs has attracted significant attention recently. However, it has been demonstrated that ML over graphs amplifies the already existing bias towards certain under-represented groups in various decision-making problems due to the information aggregation over biased graph structures. Faced with this challenge, here we take a fresh look at the problem of bias mitigation in graph-based learning by borrowing insights from graph signal processing. Our idea is to introduce predesigned graph filters within an ML pipeline to reduce a novel unsupervised bias measure, namely the correlation between sensitive attributes and the underlying graph connectivity. We show that the optimal design of said filters can be cast as a convex problem in the graph spectral domain. We also formulate a linear programming (LP) problem informed by a theoretical bias analysis, which attains a closed-form solution and leads to a more efficient fairness-aware graph filter. Finally, for a design whose degrees of freedom are independent of the input graph size, we minimize the bias metric over the family of polynomial graph convolutional filters. Our optimal filter designs offer complementary strengths to explore favorable fairness-utility-complexity tradeoffs. For performance evaluation, we conduct extensive and reproducible node classification experiments over real-world networks. Our results show that the proposed framework leads to better fairness measures together with similar utility compared to state-of-the-art fairness-aware baselines.

摘要
图表是数学工具，可以用来表示复杂的现实世界中的连接系统，如金融市场和社交网络。因此，机器学习（ML）在图表上的应用吸引了广泛的关注。然而，已经证明了ML在图表上会增强现有的偏见，导致各种决策问题中的偏见倾向某些被排除的群体。面临这个挑战，我们借鉴了图像处理的思想，引入了预设计图 filters 来减少敏感特征和图连接性之间的相关性。我们示示了这些筛选器的优化设计可以转化为一个对准的问题，并且可以通过LP问题来减少偏见。最后，我们为独立于输入图表大小的设计，对家族中的多项式图 convolutional filters 进行了最小化偏见度量。我们的优化筛选器设计提供了不同的优势，可以根据偏见、 utility 和复杂度进行质量评估。通过广泛和可重复的节点分类实验，我们的结果表明，我们的框架可以同时实现更好的偏见度量和相似的实用性。

Clustering Students Based on Gamification User Types and Learning Styles

paper_url: http://arxiv.org/abs/2310.14430
repo_url: None
paper_authors: Emre Arslan, Atilla Özkaymak, Nesrin Özdener Dönmez
for: clustering students according to their gamification user types and learning styles
methods: K-means algorithm and Gamification User Type Hexad Scale, Grasha-Riechmann Student Learning Style Scale
results: neutral results with a Silhouette coefficient of 0.12, indicating that the clustering is not satisfactory.

Abstract
The aim of this study is clustering students according to their gamification user types and learning styles with the purpose of providing instructors with a new perspective of grouping students in case of clustering which cannot be done by hand when there are multiple scales in data. The data used consists of 251 students who were enrolled at a Turkish state university. When grouping students, K-means algorithm has been utilized as clustering algorithm. As for determining the gamification user types and learning styles of students, Gamification User Type Hexad Scale and Grasha-Riechmann Student Learning Style Scale have been used respectively. Silhouette coefficient is utilized as clustering quality measure. After fitting the algorithm in several ways, highest Silhouette coefficient obtained was 0.12 meaning that results are neutral but not satisfactory. All the statistical operations and data visualizations were made using Python programming language.

摘要
本研究的目的是根据学生的游戏化用户类型和学习风格进行分群，以提供教师一种新的分组学生的方法，这种方法不可以由手动进行分组，当数据具有多个级别时。该研究使用了251名在土耳其国立大学学习的学生的数据。在分群学生时，使用了K-means算法。为确定学生的游戏化用户类型和学习风格，使用了游戏用户类型六元排序和格拉沙-瑞希曼学生学习风格分型。使用了Silhouette系数作为分群质量度量。经过多种适应，最高的Silhouette系数为0.12，表示结果为中度可行，不够满意。所有统计计算和数据可视化都使用了Python编程语言。

A Quadratic Synchronization Rule for Distributed Deep Learning

paper_url: http://arxiv.org/abs/2310.14423
repo_url: https://github.com/hmgxr128/qsr
paper_authors: Xinran Gu, Kaifeng Lyu, Sanjeev Arora, Jingzhao Zhang, Longbo Huang
for: 本文旨在解决分布式深度学习中同步梯度的问题，特别是在多个节点同时训练大型模型时，通信开销增大。methods: 本文提出了一种名为幂 synchronization rule（QSR）的理论基础方法，该方法在学习率递减过程中动态设置H值，以提高模型的泛化性。results: 对于ImageNet datasets上的ResNet和ViT模型，本文的实验表明，使用QSR的本地梯度方法可以在同步方法中提高测试准确率，并且在16或64个GPU上进行训练时，可以降低训练时间，同时提高验证预测率。

Abstract
In distributed deep learning with data parallelism, synchronizing gradients at each training step can cause a huge communication overhead, especially when many nodes work together to train large models. Local gradient methods, such as Local SGD, address this issue by allowing workers to compute locally for $H$ steps without synchronizing with others, hence reducing communication frequency. While $H$ has been viewed as a hyperparameter to trade optimization efficiency for communication cost, recent research indicates that setting a proper $H$ value can lead to generalization improvement. Yet, selecting a proper $H$ is elusive. This work proposes a theory-grounded method for determining $H$, named the Quadratic Synchronization Rule (QSR), which recommends dynamically setting $H$ in proportion to $\frac{1}{\eta^2}$ as the learning rate $\eta$ decays over time. Extensive ImageNet experiments on ResNet and ViT show that local gradient methods with QSR consistently improve the test accuracy over other synchronization strategies. Compared with the standard data parallel training, QSR enables Local AdamW on ViT-B to cut the training time on 16 or 64 GPUs down from 26.7 to 20.2 hours or from 8.6 to 5.5 hours and, at the same time, achieves $1.16\%$ or $0.84\%$ higher top-1 validation accuracy.

摘要
在分布式深度学习中，每个训练步骤的参数同步过程可能会导致巨大的通信开销，特别是当多个节点同时训练大型模型时。本地梯度方法，如本地SGD，可以通过让工作者在本地计算$H$步骤而降低与其他节点之间的通信频率。虽然$H$被视为训练效率和通信成本之间的权衡因素，但是选择合适的$H$值仍然是一个悬峰。这种工作提出了一种基于理论的方法，称为幂函数同步规则（QSR），该方法在学习率$\eta$逐渐减小过程中，对$H$进行动态设置，并与$\frac{1}{\eta^2}$成正比。经验显示，使用QSR的本地梯度方法在ImageNet上的ResNet和ViT上表现出了比其他同步策略更高的测试准确率。相比标准的数据并行训练，QSR使得Local AdamW在ViT-B上的16或64 GPU上减少了26.7小时训练时间，并同时实现了1.16%或0.84%高的顶部一 validate准精度。

Data Augmentation: a Combined Inductive-Deductive Approach featuring Answer Set Programming

paper_url: http://arxiv.org/abs/2310.14413
repo_url: None
paper_authors: Pierangela Bruno, Francesco Calimeri, Cinzia Marte, Simona Perri
for: 建立一个 hybrid inductive-deductive 框架，从有限多个真实标签图像开始，以运用逻辑程式来压缩新图像的结构，并且 garantuee 这些新图像符合领域知识中的组合遵循和特定的欲望。
methods: 使用逻辑程式来压缩新图像的结构，并且使用深度学习来创建photo-realistic 图像。
results: 提出了一个 hybrid inductive-deductive 框架，可以从有限多个真实标签图像开始，以运用逻辑程式来压缩新图像的结构，并且 garantuee 这些新图像符合领域知识中的组合遵循和特定的欲望。

Abstract
Although the availability of a large amount of data is usually given for granted, there are relevant scenarios where this is not the case; for instance, in the biomedical/healthcare domain, some applications require to build huge datasets of proper images, but the acquisition of such images is often hard for different reasons (e.g., accessibility, costs, pathology-related variability), thus causing limited and usually imbalanced datasets. Hence, the need for synthesizing photo-realistic images via advanced Data Augmentation techniques is crucial. In this paper we propose a hybrid inductive-deductive approach to the problem; in particular, starting from a limited set of real labeled images, the proposed framework makes use of logic programs for declaratively specifying the structure of new images, that is guaranteed to comply with both a set of constraints coming from the domain knowledge and some specific desiderata. The resulting labeled images undergo a dedicated process based on Deep Learning in charge of creating photo-realistic images that comply with the generated label.

摘要

Universal representation by Boltzmann machines with Regularised Axons

paper_url: http://arxiv.org/abs/2310.14395
repo_url: None
paper_authors: Przemysław R. Grzybowski, Antoni Jankiewicz, Eloy Piñol, David Cirauqui, Dorota H. Grzybowska, Paweł M. Petrykowski, Miguel Ángel García-March, Maciej Lewenstein, Gorka Muñoz-Gil, Alejandro Pozas-Kerstjens
for: 这篇论文旨在描述一种对 Boltzmann 机器进行规范，以便有效地采样和训练。
methods: 这篇论文使用了对 Boltzmann 机器连接的规范，以控制模型的能量地形，从而使得采样和训练变得更加容易。
results: 论文证明了规范后的 Boltzmann 机器仍能够表示任意分布，并且可以控制数量的能量地峰，以便进行导航式采样和训练。此外，文章还表明了规范后的 Boltzmann 机器可以存储无限多个相关的visible patron，并且可以完美地重新建立。

Abstract
It is widely known that Boltzmann machines are capable of representing arbitrary probability distributions over the values of their visible neurons, given enough hidden ones. However, sampling -- and thus training -- these models can be numerically hard. Recently we proposed a regularisation of the connections of Boltzmann machines, in order to control the energy landscape of the model, paving a way for efficient sampling and training. Here we formally prove that such regularised Boltzmann machines preserve the ability to represent arbitrary distributions. This is in conjunction with controlling the number of energy local minima, thus enabling easy \emph{guided} sampling and training. Furthermore, we explicitly show that regularised Boltzmann machines can store exponentially many arbitrarily correlated visible patterns with perfect retrieval, and we connect them to the Dense Associative Memory networks.

摘要
广泛知道，博尔兹曼机能够表示任意概率分布的值 visible neuron，只要有 enough hidden ones。然而，采样 -- 和因此训练 -- 这些模型可能是数值上的困难。最近，我们提出了 Boltzmann 机Connection 的 regularization，以控制模型的能量地形，使得可以有效地采样和训练。我们正式证明，这些正则化的 Boltzmann 机能够保持表示任意分布的能力。此外，我们还显式地显示了正则化 Boltzmann 机可以存储无限多个相关的可见模式，并且可以完美地重新 retrieve，并与 dense associative memory networks 相连。Note: "visible neurons" and "hidden neurons" are not explicitly translated in the text, as they are not necessary for the meaning of the sentence. However, in Simplified Chinese, "visible neurons" can be translated as "可见神经元" and "hidden neurons" can be translated as "隐藏神经元".

A global product of fine-scale urban building height based on spaceborne lidar

paper_url: http://arxiv.org/abs/2310.14355
repo_url: None
paper_authors: Xiao Ma, Guang Zheng, Chi Xu, L. Monika Moskal, Peng Gong, Qinghua Guo, Huabing Huang, Xuecao Li, Yong Pang, Cheng Wang, Huan Xie, Bailang Yu, Bo Zhao, Yuyu Zhou
for:This paper aims to provide a global product of urban building heights with fine spatial resolutions and global coverages, which is essential for achieving the UN’s Sustainable Development Goals (SDGs) and supporting future urban studies.methods:The authors combined the spaceborne lidar instrument of GEDI with multi-sourced data including remotely sensed images (Landsat-8, Sentinel-2, and Sentinel-1) and topographic data to produce a global product of urban building heights with a fine grid size of 150 m around 2020.results:The estimated method of building height samples based on the GEDI data was effective with a Pearson’s r of 0.78 and an RMSE of 3.67 m in comparison to the reference data. The mapping product also demonstrated good performance with a Pearson’s r of 0.71 and an RMSE of 4.60 m. The global urban building height map provides a higher spatial resolution (150 m) with greater inherent details about the spatial heterogeneity and flexibility of updating using the GEDI samples as inputs.

Abstract
Characterizing urban environments with broad coverages and high precision is more important than ever for achieving the UN's Sustainable Development Goals (SDGs) as half of the world's populations are living in cities. Urban building height as a fundamental 3D urban structural feature has far-reaching applications. However, so far, producing readily available datasets of recent urban building heights with fine spatial resolutions and global coverages remains a challenging task. Here, we provide an up-to-date global product of urban building heights based on a fine grid size of 150 m around 2020 by combining the spaceborne lidar instrument of GEDI and multi-sourced data including remotely sensed images (i.e., Landsat-8, Sentinel-2, and Sentinel-1) and topographic data. Our results revealed that the estimated method of building height samples based on the GEDI data was effective with 0.78 of Pearson's r and 3.67 m of RMSE in comparison to the reference data. The mapping product also demonstrated good performance as indicated by its strong correlation with the reference data (i.e., Pearson's r = 0.71, RMSE = 4.60 m). Compared with the currently existing products, our global urban building height map holds the ability to provide a higher spatial resolution (i.e., 150 m) with a great level of inherent details about the spatial heterogeneity and flexibility of updating using the GEDI samples as inputs. This work will boost future urban studies across many fields including climate, environmental, ecological, and social sciences.

摘要
将城市环境Characterizing with broad coverage and high precision是实现联合国可持续发展目标(SDGs)的关键，因为全球人口的一半居住在城市中。城市建筑高度作为城市三维结构特征有着广泛的应用。然而，到目前为止，生成可靠的城市建筑高度数据集，具有细度的高空间分辨率和全球覆盖率，仍然是一项挑战。我们提供了2020年的全球城市建筑高度产品，基于150米的细网格大小，通过结合地面雷达仪器GEDI和多源数据（如卫星图像（Landsat-8、Sentinel-2、Sentinel-1）和地形数据）。我们的结果表明，基于GEDI数据的建筑高度采样方法的效果是良好，Pearson相关系数为0.78，RMSE为3.67米。我们的映射产品也表现出了良好的性能，与参照数据相关系数为0.71，RMSE为4.60米。与现有产品相比，我们的全球城市建筑高度地图具有更高的空间分辨率（150米）和更多的内在细节，可以用GEDI采样作为输入，更好地满足未来城市研究的需求。

Can strong structural encoding reduce the importance of Message Passing?

paper_url: http://arxiv.org/abs/2310.15197
repo_url: None
paper_authors: Floor Eijkelboom, Erik Bekkers, Michael Bronstein, Francesco Di Giovanni
for: 本文研究了message passing neural networks（MPNNs）在图数据上的应用，特别是如何在图数据上使用特征信息和结构信息来学习节点表示。
methods: 本文提出了一种新的方法，基于矩阵乘法来将特征信息和结构信息相互作用。这种方法与标准的笛卡尔和拼接方法进行比较。
results: 研究结果表明，使用矩阵乘法方法可以在一些任务上减少或完全消除消息传递层，而无需让模型表现下降。这表明，当模型可以构建强的结构编码时，消息传递的重要性相对较低。

Abstract
The most prevalent class of neural networks operating on graphs are message passing neural networks (MPNNs), in which the representation of a node is updated iteratively by aggregating information in the 1-hop neighborhood. Since this paradigm for computing node embeddings may prevent the model from learning coarse topological structures, the initial features are often augmented with structural information of the graph, typically in the form of Laplacian eigenvectors or Random Walk transition probabilities. In this work, we explore the contribution of message passing when strong structural encodings are provided. We introduce a novel way of modeling the interaction between feature and structural information based on their tensor product rather than the standard concatenation. The choice of interaction is compared in common scenarios and in settings where the capacity of the message-passing layer is severely reduced and ultimately the message-passing phase is removed altogether. Our results indicate that using tensor-based encodings is always at least on par with the concatenation-based encoding and that it makes the model much more robust when the message passing layers are removed, on some tasks incurring almost no drop in performance. This suggests that the importance of message passing is limited when the model can construct strong structural encodings.

摘要
最常见的图学习网络是消息传递神经网络（MPNN），它们的节点表示更新是通过邻居信息的聚合来实现的。由于这种方法可能会阻止模型学习大规模的结构特征，因此通常会将初始特征与图结构信息相结合，通常是拉普拉斯特征或游走过程的概率。在这项工作中，我们研究了消息传递在强结构编码下的贡献。我们提出了一种基于维度积 producer 而不是标准拼接的交互方法来模型特征和结构信息之间的互动。我们对于常见的场景和消息传递层的容量减少情况进行比较，最后还移除消息传递阶段 altogether。我们的结果表明，使用维度积编码总是与拼接编码相当，并且在某些任务下，它会减少性能的下降幅度。这表示消息传递的重要性在模型可以构建强结构编码时相对较低。

Pyramidal Hidden Markov Model For Multivariate Time Series Forecasting

paper_url: http://arxiv.org/abs/2310.14341
repo_url: None
paper_authors: YeXin Huang
for: 预测时间序列数据的未来值
methods: 提出了一种基于PyramidalHidden Markov Model（PHMM）的时间序列预测方法，利用多步隐марков链模型来捕捉多个多步隐态态
results: 实验结果表明，提出的PHMM模型在多变量时间序列 dataset上比其竞争对手更加出色，能够更好地处理非站点和噪音数据，并建立更加准确和全面的预测结果。

Abstract
The Hidden Markov Model (HMM) can predict the future value of a time series based on its current and previous values, making it a powerful algorithm for handling various types of time series. Numerous studies have explored the improvement of HMM using advanced techniques, leading to the development of several variations of HMM. Despite these studies indicating the increased competitiveness of HMM compared to other advanced algorithms, few have recognized the significance and impact of incorporating multistep stochastic states into its performance. In this work, we propose a Pyramidal Hidden Markov Model (PHMM) that can capture multiple multistep stochastic states. Initially, a multistep HMM is designed for extracting short multistep stochastic states. Next, a novel time series forecasting structure is proposed based on PHMM, which utilizes pyramid-like stacking to adaptively identify long multistep stochastic states. By employing these two schemes, our model can effectively handle non-stationary and noisy data, while also establishing long-term dependencies for more accurate and comprehensive forecasting. The experimental results on diverse multivariate time series datasets convincingly demonstrate the superior performance of our proposed PHMM compared to its competitive peers in time series forecasting.

摘要
隐藏Markov模型（HMM）可以预测时间序列的未来值基于其当前和前一个值，使其成为处理多种时间序列的 poderful算法。许多研究已经探索了HMM的改进，导致了多种HMM的发展。 despite these studies indicating the increased competitiveness of HMM compared to other advanced algorithms, few have recognized the significance and impact of incorporating multistep stochastic states into its performance. 在这种工作中，我们提议一种Pyramidal隐藏Markov模型（PHMM），可以捕捉多个多步骤的随机状态。首先，我们设计了一种多步骤HMM，用于提取短时间内的多步骤随机状态。然后，我们提出了一种基于PHMM的新的时间序列预测结构，使用Pyramid-like的堆叠来适应ively认定长时间内的多步骤随机状态。通过这两种方案，我们的模型可以有效地处理不稳定和噪声掺杂的数据，同时也可以建立长期依赖关系，以更准确和全面的预测。实验结果表明，我们的提议的PHMM在多种多变量时间序列数据集上表现出了superior的性能，与其竞争对手相比。

PPFL: A Personalized Federated Learning Framework for Heterogeneous Population

paper_url: http://arxiv.org/abs/2310.14337
repo_url: None
paper_authors: Hao Di, Yi Yang, Haishan Ye, Xiangyu Chang
for: 本研究旨在开发一种适应个人偏好的 Federated Learning 框架，以保护个人隐私。
methods: 本研究使用 canonical models 捕捉人群的基本特征，并使用 membership vectors 表示客户的偏好。
results: 研究表明，PPFL 方法可以提供substantial insights into client characteristics，并且比 existed Personalized Federated Learning 方法更有优势。

Abstract
Personalization aims to characterize individual preferences and is widely applied across many fields. However, conventional personalized methods operate in a centralized manner and potentially expose the raw data when pooling individual information. In this paper, with privacy considerations, we develop a flexible and interpretable personalized framework within the paradigm of Federated Learning, called PPFL (Population Personalized Federated Learning). By leveraging canonical models to capture fundamental characteristics among the heterogeneous population and employing membership vectors to reveal clients' preferences, it models the heterogeneity as clients' varying preferences for these characteristics and provides substantial insights into client characteristics, which is lacking in existing Personalized Federated Learning (PFL) methods. Furthermore, we explore the relationship between our method and three main branches of PFL methods: multi-task PFL, clustered FL, and decoupling PFL, and demonstrate the advantages of PPFL. To solve PPFL (a non-convex constrained optimization problem), we propose a novel random block coordinate descent algorithm and present the convergence property. We conduct experiments on both pathological and practical datasets, and the results validate the effectiveness of PPFL.

摘要
个人化目标是描述个体偏好，广泛应用于多个领域。然而，传统的个人化方法采用中央化的方式运行，可能暴露个体数据。在这篇论文中，我们考虑隐私问题，开发了一种灵活可解释的个人化框架，称为PPFL（人口个性化联合学习）。我们利用 canonical model 捕捉人口中的基本特征，并使用会员 вектор 表达客户的偏好，以模拟客户对这些特征的差异性，并提供了详细的客户特征信息，其与现有的个性化联合学习方法（PFL）不同。此外，我们还探讨了我们的方法与多任务 PFL、分区 FL 和解除 PFL 的三大分支的关系，并证明 PPFL 的优势。为解决 PPFL （一个非对称约束优化问题），我们提议一种新的随机块坐标下降算法，并证明其收敛性。我们在实验中使用了一些病理和实际的数据集，并 validate PPFL 的效果。

Finite-Sample Analysis of the Temporal Difference Learning

paper_url: http://arxiv.org/abs/2310.14286
repo_url: None
paper_authors: Sergey Samsonov, Daniil Tiapkin, Alexey Naumov, Eric Moulines
for: 本研究考虑了使用 temporal difference (TD) 方法进行政策评估的 Markov Decision Processes (MDP) 中的性能精细 bounds 问题。
methods: 本文使用了一种简单的算法，即使用 universal 和实例无关的步长，以及 Polyak-Ruppert 尾均值。
results: 本文提供了近似优化的几何和偏差项，以及相应的样本复杂性 bound。我们的证明技巧基于 refined error bounds для linear stochastic approximation 以及 TD-type recurrence 中的新稳定性结果。

Abstract
In this paper we consider the problem of obtaining sharp bounds for the performance of temporal difference (TD) methods with linear functional approximation for policy evaluation in discounted Markov Decision Processes. We show that a simple algorithm with a universal and instance-independent step size together with Polyak-Ruppert tail averaging is sufficient to obtain near-optimal variance and bias terms. We also provide the respective sample complexity bounds. Our proof technique is based on refined error bounds for linear stochastic approximation together with the novel stability result for the product of random matrices that arise from the TD-type recurrence.

摘要
在这篇论文中，我们考虑了使用线性函数 aproximation 的 temporal difference（TD）方法来评估折扣Markov决策过程中的性能。我们显示了一个简单的算法，具有 universal 和实例独立的步长，可以获得near-optimal的偏差和方差项。我们还提供了相应的样本复杂性bound。我们的证明技术基于线性随机化的精细错误 bounds 以及TD型循环中的产品Random Matrices的新稳定性结果。

Robust Visual Imitation Learning with Inverse Dynamics Representations

paper_url: http://arxiv.org/abs/2310.14274
repo_url: None
paper_authors: Siyuan Li, Xun Wang, Rongchang Zuo, Kewu Sun, Lingfei Cui, Jishiyu Ding, Peng Liu, Zhe Ma
for: 解决复杂的sequential decision-making问题
methods: 开发了一种 inverse dynamics state representation learning objective，以对学习环境和专家环境进行对齐
results: 在各种视觉扰动和多种视觉控制任务中，可以达到几乎专家水平的性能，与现状的visual IL方法和Robust IL方法显著超越

Abstract
Imitation learning (IL) has achieved considerable success in solving complex sequential decision-making problems. However, current IL methods mainly assume that the environment for learning policies is the same as the environment for collecting expert datasets. Therefore, these methods may fail to work when there are slight differences between the learning and expert environments, especially for challenging problems with high-dimensional image observations. However, in real-world scenarios, it is rare to have the chance to collect expert trajectories precisely in the target learning environment. To address this challenge, we propose a novel robust imitation learning approach, where we develop an inverse dynamics state representation learning objective to align the expert environment and the learning environment. With the abstract state representation, we design an effective reward function, which thoroughly measures the similarity between behavior data and expert data not only element-wise, but also from the trajectory level. We conduct extensive experiments to evaluate the proposed approach under various visual perturbations and in diverse visual control tasks. Our approach can achieve a near-expert performance in most environments, and significantly outperforms the state-of-the-art visual IL methods and robust IL methods.

摘要
“模仿学习（IL）已经在复杂的顺序决策问题上取得了显著的成功。然而，现有的IL方法主要假设学习策略的环境与收集专家数据的环境一致。因此，这些方法可能在环境有所不同时失效，特别是高维图像观察的复杂问题上。然而，在实际场景中，很少有收集专家轨迹的机会， precisely in the target learning environment。为解决这个挑战，我们提出了一种新的稳定的模仿学习方法，其中我们开发了反动动态状态表示学习目标，以对专家环境和学习环境进行对齐。通过抽象状态表示，我们设计了一个有效的奖励函数，该函数不仅在元素级别，还在轨迹级别进行 Similarity Measure between behavior data and expert data。我们进行了广泛的实验来评估我们的方法，并在不同的视觉干扰和多种视觉控制任务中达到了 near-expert 性能。我们的方法可以在大多数环境中达到领先的性能，并在视觉IL方法和稳定IL方法中具有显著优势。”

Shortcuts for causal discovery of nonlinear models by score matching

paper_url: http://arxiv.org/abs/2310.14246
repo_url: None
paper_authors: Francesco Montagna, Nicoletta Noceti, Lorenzo Rosasco, Francesco Locatello
for: 这篇论文的目的是对非线性随机数据中的 causal discovery 进行研究，并提出了 ScoreSort 算法来解决非线性模型中的 score 排序问题。
methods: 该论文使用了 simulated data 进行实验，并对非线性模型进行了分析和比较。
results: 研究发现，ScoreSort 算法在非线性模型中具有更高的统计效率，并且可以在多种 synthetic benchmarks 上实现 score-sortability。同时，研究还发现了数据的多样性是评估非线性 causal discovery 方法的重要限制因素，以及在不同的设定下进行详细测试和分析统计性质是对 causal discovery 研究中的重要考虑因素。

Abstract
The use of simulated data in the field of causal discovery is ubiquitous due to the scarcity of annotated real data. Recently, Reisach et al., 2021 highlighted the emergence of patterns in simulated linear data, which displays increasing marginal variance in the casual direction. As an ablation in their experiments, Montagna et al., 2023 found that similar patterns may emerge in nonlinear models for the variance of the score vector $\nabla \log p_{\mathbf{X}$, and introduced the ScoreSort algorithm. In this work, we formally define and characterize this score-sortability pattern of nonlinear additive noise models. We find that it defines a class of identifiable (bivariate) causal models overlapping with nonlinear additive noise models. We theoretically demonstrate the advantages of ScoreSort in terms of statistical efficiency compared to prior state-of-the-art score matching-based methods and empirically show the score-sortability of the most common synthetic benchmarks in the literature. Our findings remark (1) the lack of diversity in the data as an important limitation in the evaluation of nonlinear causal discovery approaches, (2) the importance of thoroughly testing different settings within a problem class, and (3) the importance of analyzing statistical properties in causal discovery, where research is often limited to defining identifiability conditions of the model.

摘要
使用模拟数据在 causal discovery 领域是普遍的，因为真实标注数据的罕见。Recently, Reisach et al. (2021) 指出了模拟数据中的增长性趋势，这种趋势在 causal 方向上 display 增加的边缘方差。在他们的实验中，Montagna et al. (2023) 发现了类似的趋势可能会出现在非线性模型中，并提出了 ScoreSort 算法。在这种工作中，我们正式定义和特征化这种 ScoreSort 模型的评分可排性特征。我们发现这种特征定义了一类可识别的双向 causal 模型，与非线性加性随机噪声模型 overlap 。我们理论上表明 ScoreSort 比 Priori 状态的分配比对方法更高效，并且实际上证明了 literature 中最常见的 sintetic 标准 benchmark 中的评分可排性。我们的发现包括（1）数据的不够多样性是评估非线性 causal discovery 方法的重要限制，（2）在一个问题类中测试不同设置是重要的，（3）在 causal discovery 中分析统计性质是研究中常常被限制的。

Revisiting Deep Ensemble for Out-of-Distribution Detection: A Loss Landscape Perspective

paper_url: http://arxiv.org/abs/2310.14227
repo_url: https://github.com/fanghenshaometeor/ood-mode-ensemble
paper_authors: Kun Fang, Qinghua Tao, Xiaolin Huang, Jie Yang
for: 这个研究旨在探讨深度学习模型中的Out-of-Distribution（OoD）检测方法，以及如何从损失地图的角度来探索OoD检测。
methods: 本研究使用了多个独立的模式来探索OoD检测，并通过模式组合来提高OoD检测的性能。
results: 实验结果显示，独立模式之间存在高度的测量不确定性，而模式组合可以将这些不确定性减少，提高OoD检测的可靠性。

Abstract
Existing Out-of-Distribution (OoD) detection methods address to detect OoD samples from In-Distribution data (InD) mainly by exploring differences in features, logits and gradients in Deep Neural Networks (DNNs). We in this work propose a new perspective upon loss landscape and mode ensemble to investigate OoD detection. In the optimization of DNNs, there exist many local optima in the parameter space, or namely modes. Interestingly, we observe that these independent modes, which all reach low-loss regions with InD data (training and test data), yet yield significantly different loss landscapes with OoD data. Such an observation provides a novel view to investigate the OoD detection from the loss landscape and further suggests significantly fluctuating OoD detection performance across these modes. For instance, FPR values of the RankFeat method can range from 46.58% to 84.70% among 5 modes, showing uncertain detection performance evaluations across independent modes. Motivated by such diversities on OoD loss landscape across modes, we revisit the deep ensemble method for OoD detection through mode ensemble, leading to improved performance and benefiting the OoD detector with reduced variances. Extensive experiments covering varied OoD detectors and network structures illustrate high variances across modes and also validate the superiority of mode ensemble in boosting OoD detection. We hope this work could attract attention in the view of independent modes in the OoD loss landscape and more reliable evaluations on OoD detectors.

摘要
现有的Out-of-Distribution（OoD）检测方法主要通过探索特征、搅瑞和梯度在深度神经网络（DNN）中检测OoD样本。在本工作中，我们提出了一新的视角，即损失 landscape和模式ensemble来研究OoD检测。DNN的优化中存在许多本地极值点，即模式。我们发现这些独立的模式，它们都能够在训练和测试数据上达到低损失区域，但是在OoD数据上却导致了显著不同的损失 landscape。这种观察提供了一个新的视角来研究OoD检测，并且建议了通过模式ensemble来改进OoD检测性能。例如，RankFeat方法的FPR值在不同模式中可以从46.58%到84.70%不同，表明OoD检测性能的评估存在很大的不确定性。我们通过模式ensemble来降低这种不确定性，并通过广泛的实验证明了模式ensemble的superiority。我们希望这种新的视角能吸引更多的关注，并且能够提供更可靠的OoD检测评估方法。

SUT: Active Defects Probing for Transcompiler Models

paper_url: http://arxiv.org/abs/2310.14209
repo_url: None
paper_authors: Mengnan Qi, Yufan Huang, Maoquan Wang, Yongqiang Yao, Zihan Liu, Bin Gu, Colin Clement, Neel Sundaresan
for: 本研究旨在提出一种新的编程语言翻译评价指标，以检测当前的编程语言翻译模型在目标语言中的基本语法错误。
methods: 本研究使用了一种新的活动报告测试集（SUT），包括一个高度可读性的评价工具 для准确性和测试分数。
results: 实验表明，即使使用了强大的模型如ChatGPT，其仍然在这些基本单元测试上出现错误，相比之前的编程语言翻译任务评价集，其通过率下降了26.15%。此外，评价工具还揭示了这些模型在语法元素方面的不足。

Abstract
Automatic Program translation has enormous application value and hence has been attracting significant interest from AI researchers. However, we observe that current program translation models still make elementary syntax errors, particularly, when the target language does not have syntax elements in the source language. Metrics like BLUE, CodeBLUE and computation accuracy may not expose these issues. In this paper we introduce a new metrics for programming language translation and these metrics address these basic syntax errors. We develop a novel active defects probing suite called Syntactic Unit Tests (SUT) which includes a highly interpretable evaluation harness for accuracy and test scoring. Experiments have shown that even powerful models like ChatGPT still make mistakes on these basic unit tests. Specifically, compared to previous program translation task evaluation dataset, its pass rate on our unit tests has decreased by 26.15%. Further our evaluation harness reveal syntactic element errors in which these models exhibit deficiencies.

摘要
自动程序翻译具有巨大的应用价值，因此吸引了许多人工智能研究者的关注。然而，我们发现当目标语言缺乏源语言语法元素时，当前的程序翻译模型仍然会出现基本语法错误。度量器如BLUE、CodeBLUE和计算准确率可能不会暴露这些问题。在这篇论文中，我们介绍了一种新的程序语言翻译度量器，这些度量器可以捕捉这些基本语法错误。我们开发了一个新的活动缺陷探测组合 called Syntactic Unit Tests (SUT)，该组合包括一个高度可读性评估器和准确分数评价。实验表明，即使是 poderoso的 ChatGPT 模型也会在我们的单元测试中出现错误。具体来说，相比之前的程序翻译任务评估集，我们的单元测试中的通过率下降了26.15%。此外，我们的评估器还揭示了这些模型在语法元素上的缺陷。

Prompt Engineering Through the Lens of Optimal Control

paper_url: http://arxiv.org/abs/2310.14201
repo_url: None
paper_authors: Yifan Luo, Yiming Tang, Chengfeng Shen, Zhennan Zhou, Bin Dong
for: 该论文旨在探讨Optimal Control Framework的应用于多轮人工智能交互（Prompt Engineering，PE）中，以提高人机交互的效率和效果。
methods: 该论文使用了多轮PE方法，包括单轮PE、多轮PE、集成PE和多智能协作PE等方法，以及一种新的优化控制框架，以系матизи并优化现有PE方法。
results: 该论文提出了一种优化控制框架，可以对多轮PE进行系мати化和优化，并且可以扩展到多智能协作PE和集成PE等方法。这些方法可以提高人机交互的效率和效果，并且具有更好的可解释性和可视化性。

Abstract
Prompt Engineering (PE) has emerged as a critical technique for guiding Large Language Models (LLMs) in solving intricate tasks. Its importance is highlighted by its potential to significantly enhance the efficiency and effectiveness of human-machine interaction. As tasks grow increasingly complex, recent advanced PE methods have extended beyond the limitations of single-round interactions to embrace multi-round interactions, which allows for a deeper and more nuanced engagement with LLMs. In this paper, we propose an optimal control framework tailored for multi-round interactions with LLMs. This framework provides a unified mathematical structure that not only systematizes the existing PE methods but also sets the stage for rigorous analytical improvements. Furthermore, we extend this framework to include PE via ensemble methods and multi-agent collaboration, thereby enlarging the scope of applicability. By adopting an optimal control perspective, we offer fresh insights into existing PE methods and highlight theoretical challenges that warrant future research. Besides, our work lays a foundation for the development of more effective and interpretable PE methods.

摘要

Improved Techniques for Training Consistency Models

paper_url: http://arxiv.org/abs/2310.14189
repo_url: https://github.com/sallu08/Consistency-Regulariation-FSSL-Naive
paper_authors: Yang Song, Prafulla Dhariwal
for:This paper focuses on improving the quality of consistency models, a type of generative model that can sample high-quality data in one step without the need for adversarial training.methods:The authors present several improved techniques for consistency training, including eliminating Exponential Moving Average from the teacher consistency model, adopting Pseudo-Huber losses, and introducing a lognormal noise schedule.results:The authors achieve FID scores of 2.51 and 3.25 on CIFAR-10 and ImageNet $64\times 64$ respectively in a single sampling step, marking a 3.5$\times$ and 4$\times$ improvement compared to prior consistency training approaches. Through two-step sampling, they further reduce FID scores to 2.24 and 2.77 on these two datasets, surpassing those obtained via distillation in both one-step and two-step settings.

Abstract
Consistency models are a nascent family of generative models that can sample high quality data in one step without the need for adversarial training. Current consistency models achieve optimal sample quality by distilling from pre-trained diffusion models and employing learned metrics such as LPIPS. However, distillation limits the quality of consistency models to that of the pre-trained diffusion model, and LPIPS causes undesirable bias in evaluation. To tackle these challenges, we present improved techniques for consistency training, where consistency models learn directly from data without distillation. We delve into the theory behind consistency training and identify a previously overlooked flaw, which we address by eliminating Exponential Moving Average from the teacher consistency model. To replace learned metrics like LPIPS, we adopt Pseudo-Huber losses from robust statistics. Additionally, we introduce a lognormal noise schedule for the consistency training objective, and propose to double total discretization steps every set number of training iterations. Combined with better hyperparameter tuning, these modifications enable consistency models to achieve FID scores of 2.51 and 3.25 on CIFAR-10 and ImageNet $64\times 64$ respectively in a single sampling step. These scores mark a 3.5$\times$ and 4$\times$ improvement compared to prior consistency training approaches. Through two-step sampling, we further reduce FID scores to 2.24 and 2.77 on these two datasets, surpassing those obtained via distillation in both one-step and two-step settings, while narrowing the gap between consistency models and other state-of-the-art generative models.

摘要
《协调模型》是一个新兴的生成模型，可以在一步中采样高质量数据，而不需要对抗学习。现有的协调模型可以达到最佳的样本质量 by 热退 diffusion 模型，并使用学习的度量如 LPIPS。然而，热退限制了协调模型的质量，而 LPIPS 会导致评估中不良的偏见。为了解决这些挑战，我们提出了改进的协调训练方法，协调模型可以直接从数据中学习，无需热退。我们分析了协调训练的理论基础，发现了一个以前未被注意到的缺陷，我们通过从教师协调模型中减少 Exponential Moving Average 来解决这个缺陷。而代替学习度量，我们采用了 Pseudo-Huber 损失函数。此外，我们还引入了 lognormal 噪声程序，并提议在训练迭代中逐步增加总数化步骤。通过更好的hyperparameter优化，这些修改使得协调模型可以在一步中采样 FID 分数为 2.51 和 3.25 的 CIFAR-10 和 ImageNet $64\times 64$ 分别，比前一个 consistency training 方法提高了 3.5 倍和 4 倍。通过两步采样，我们还可以将 FID 分数降低到 2.24 和 2.77，超越了在一步和两步 Setting 中使用热退的分数，同时逐渐趋近其他状态的生成模型。

A General Theory for Softmax Gating Multinomial Logistic Mixture of Experts

paper_url: http://arxiv.org/abs/2310.14188
repo_url: None
paper_authors: Huy Nguyen, Pedram Akbarian, TrungTin Nguyen, Nhat Ho
for: 本文研究了一种基于多个子模型的混合专家（MoE）模型，以提高多种回归和分类应用中的性能。
methods: 本文使用了抽象分析方法，对于回归设置下的MoE模型的性能进行了分析，并提出了一种新的修改了Softmax抽象函数的方法，以解决在部分专家参数消失时的问题。
results: 本文实证了MoE模型在分类问题下的抽象率和参数估计率，并发现在一些专家参数消失时，抽象率会 slower than polynomial 率，但通过修改Softmax抽象函数，可以提高参数估计率。

Abstract
Mixture-of-experts (MoE) model incorporates the power of multiple submodels via gating functions to achieve greater performance in numerous regression and classification applications. From a theoretical perspective, while there have been previous attempts to comprehend the behavior of that model under the regression settings through the convergence analysis of maximum likelihood estimation in the Gaussian MoE model, such analysis under the setting of a classification problem has remained missing in the literature. We close this gap by establishing the convergence rates of density estimation and parameter estimation in the softmax gating multinomial logistic MoE model. Notably, when part of the expert parameters vanish, these rates are shown to be slower than polynomial rates owing to an inherent interaction between the softmax gating and expert functions via partial differential equations. To address this issue, we propose using a novel class of modified softmax gating functions which transform the input value before delivering them to the gating functions. As a result, the previous interaction disappears and the parameter estimation rates are significantly improved.

摘要
归一模型（Mixture-of-experts，MoE）利用多个子模型的力via关键函数实现更高的性能在多种回归和分类应用中。从理论上来说，尚未在文献中对MoE模型在回归设置下的行为进行了深入的分析，而这种分析在分类问题下尚未被探讨。我们填补了这一空白，并证明了softmax关键函数多omial几何MoE模型的整体抽象率和参数估计率在部分专家参数消失时 slower than 多项式率，这是因为softmax关键函数和专家函数之间存在自然的互动。 To address this issue, we propose using a novel class of modified softmax gating functions, which transform the input value before delivering them to the gating functions. As a result, the previous interaction disappears, and the parameter estimation rates are significantly improved.

Learning Invariant Molecular Representation in Latent Discrete Space

paper_url: http://arxiv.org/abs/2310.14170
repo_url: https://github.com/hicai-zju/imold
paper_authors: Xiang Zhuang, Qiang Zhang, Keyan Ding, Yatao Bian, Xiao Wang, Jingsong Lv, Hongyang Chen, Huajun Chen
for: 这个研究旨在提高药物探索中的分子表示学习，并解决现有方法在不同环境下的训练和测试数据之间的泛化问题。
methods: 我们提出了一个新的框架，叫做“首先编码后分离”，可以将分子表示学习中的不同环境下的数据分类为不变的特征。我们还引入了一个对于训练数据的余域量化模组，以降低训练数据过滤的风险，并保持了Encoder的表达力。
results: 我们的模型在18个真实的分子数据集上进行了广泛的实验，结果显示我们的模型在不同环境下的泛化性比起现有基eline的优胜。我们的代码可以在https://github.com/HICAI-ZJU/iMoLD上取得。

Abstract
Molecular representation learning lays the foundation for drug discovery. However, existing methods suffer from poor out-of-distribution (OOD) generalization, particularly when data for training and testing originate from different environments. To address this issue, we propose a new framework for learning molecular representations that exhibit invariance and robustness against distribution shifts. Specifically, we propose a strategy called ``first-encoding-then-separation'' to identify invariant molecule features in the latent space, which deviates from conventional practices. Prior to the separation step, we introduce a residual vector quantization module that mitigates the over-fitting to training data distributions while preserving the expressivity of encoders. Furthermore, we design a task-agnostic self-supervised learning objective to encourage precise invariance identification, which enables our method widely applicable to a variety of tasks, such as regression and multi-label classification. Extensive experiments on 18 real-world molecular datasets demonstrate that our model achieves stronger generalization against state-of-the-art baselines in the presence of various distribution shifts. Our code is available at https://github.com/HICAI-ZJU/iMoLD.

摘要
分子表示学习建立了药物发现的基础。然而，现有方法受到不同环境下数据的外部分布Shift的困难，特别是在训练和测试数据来源不同时。为解决这问题，我们提出了一个新的分子表示学习框架，具有不变性和抗分布Shift的特点。具体来说，我们提出了一种名为“首先编码然后分离”的策略，以适应分子特征的不变性。在分离步骤之前，我们引入了一个很 residual vector quantization module，以避免训练数据分布适应而导致的过拟合，同时保持编码器的表达能力。此外，我们设计了一个无关任务的自适应学习目标，以促进精准的不变性标识，这使得我们的方法可以通用于多种任务，如回归和多个标签分类。我们的代码可以在https://github.com/HICAI-ZJU/iMoLD中获取。Extensive experiments on 18 real-world molecular datasets demonstrate that our model achieves stronger generalization against state-of-the-art baselines in the presence of various distribution shifts.

Ensemble Learning for Graph Neural Networks

paper_url: http://arxiv.org/abs/2310.14166
repo_url: https://github.com/wongzhenhao/elgnn
paper_authors: Zhen Hao Wong, Ling Yue, Quanming Yao
for: 这篇论文探讨了使用集成学习技术来提高图结构数据中的图神经网络（GNNs）性能和可靠性。
methods: 这篇论文使用了多个GNN模型的不同初始化或架构，并使用了Tree-Structured Parzen Estimator算法确定ensemble weights。
results: ensemble学习可以增强GNN对复杂图结构数据的分析能力，提高总准确率，降低偏差和方差，降低噪声数据的影响。

Abstract
Graph Neural Networks (GNNs) have shown success in various fields for learning from graph-structured data. This paper investigates the application of ensemble learning techniques to improve the performance and robustness of Graph Neural Networks (GNNs). By training multiple GNN models with diverse initializations or architectures, we create an ensemble model named ELGNN that captures various aspects of the data and uses the Tree-Structured Parzen Estimator algorithm to determine the ensemble weights. Combining the predictions of these models enhances overall accuracy, reduces bias and variance, and mitigates the impact of noisy data. Our findings demonstrate the efficacy of ensemble learning in enhancing GNN capabilities for analyzing complex graph-structured data. The code is public at https://github.com/wongzhenhao/ELGNN.

摘要
GRAPH Neural Networks (GNNs) have shown success in various fields for learning from graph-structured data. This paper investigates the application of ensemble learning techniques to improve the performance and robustness of GRAPH Neural Networks (GNNs). By training multiple GNN models with diverse initializations or architectures, we create an ensemble model named ELGNN that captures various aspects of the data and uses the Tree-Structured Parzen Estimator algorithm to determine the ensemble weights. Combining the predictions of these models enhances overall accuracy, reduces bias and variance, and mitigates the impact of noisy data. Our findings demonstrate the efficacy of ensemble learning in enhancing GNN capabilities for analyzing complex graph-structured data. The code is public at https://github.com/wongzhenhao/ELGNN.Note that I've kept the original English names for the concepts and techniques to maintain clarity and consistency.

$α$-Fair Contextual Bandits

paper_url: http://arxiv.org/abs/2310.14164
repo_url: None
paper_authors: Siddhant Chaudhary, Abhishek Sinha
for: 本文研究的目标是解决 $\alpha$-Fair Contextual Bandits问题，即在对抗设定中 maximize 全球 $\alpha$- fair utility function，而不是增加总奖励。
methods: 作者提出了一种有效的算法，可以在全信息设定和抽象反馈设定下保证 approximately sublinear regret。
results: 该算法可以在对抗设定中 garantuee an approximately sublinear regret，并且可以避免闭塞效应和符合法规要求。

Abstract
Contextual bandit algorithms are at the core of many applications, including recommender systems, clinical trials, and optimal portfolio selection. One of the most popular problems studied in the contextual bandit literature is to maximize the sum of the rewards in each round by ensuring a sublinear regret against the best-fixed context-dependent policy. However, in many applications, the cumulative reward is not the right objective - the bandit algorithm must be fair in order to avoid the echo-chamber effect and comply with the regulatory requirements. In this paper, we consider the $\alpha$-Fair Contextual Bandits problem, where the objective is to maximize the global $\alpha$-fair utility function - a non-decreasing concave function of the cumulative rewards in the adversarial setting. The problem is challenging due to the non-separability of the objective across rounds. We design an efficient algorithm that guarantees an approximately sublinear regret in the full-information and bandit feedback settings.

摘要
Contextual bandit algorithms are at the core of many applications, including recommender systems, clinical trials, and optimal portfolio selection. One of the most popular problems studied in the contextual bandit literature is to maximize the sum of the rewards in each round by ensuring a sublinear regret against the best-fixed context-dependent policy. However, in many applications, the cumulative reward is not the right objective - the bandit algorithm must be fair in order to avoid the echo-chamber effect and comply with the regulatory requirements. In this paper, we consider the $\alpha$-Fair Contextual Bandits problem, where the objective is to maximize the global $\alpha$-fair utility function - a non-decreasing concave function of the cumulative rewards in the adversarial setting. The problem is challenging due to the non-separability of the objective across rounds. We design an efficient algorithm that guarantees an approximately sublinear regret in the full-information and bandit feedback settings.Here's the translation in Simplified Chinese:Contextual bandit algorithms 是多种应用的核心，包括推荐系统、临床试验和最佳投资选择。文章中最受欢迎的问题是在各个回合中 maximize 奖励的和，并在最佳固定上下文依赖策略的比较下保持 sublinear 后悔。然而，在许多应用中，总奖励不是正确的目标 - 带刺机器必须公平，以避免闪电室效应和符合规定要求。在这篇文章中，我们考虑 $\alpha $- Fair Contextual Bandits 问题，其目标是在对抗设定下 maximize 全球 $\alpha $- fair utility 函数 - 一个不递减的凹陷函数。问题具有不可分离性，使得它变得更加挑战。我们设计了一个有效的算法，以保证在充分信息和反馈设定下 approximately sublinear 后悔。

Promoting Generalization for Exact Solvers via Adversarial Instance Augmentation

paper_url: http://arxiv.org/abs/2310.14161
repo_url: None
paper_authors: Haoyang Liu, Yufei Kuang, Jie Wang, Xijun Li, Yongdong Zhang, Feng Wu
for: 提高离散线性 программирова（MILP）解题器的效率
methods: 使用学习算法提高MILP解题器的普适性和效率
results: 实验表明，通过生成多种增强实例，AdaSolver可以提高MILP解题器的效率，并且可以在不同的分布下实现remarkable的提高I hope this helps! Let me know if you have any further questions.

Abstract
Machine learning has been successfully applied to improve the efficiency of Mixed-Integer Linear Programming (MILP) solvers. However, the learning-based solvers often suffer from severe performance degradation on unseen MILP instances -- especially on large-scale instances from a perturbed environment -- due to the limited diversity of training distributions. To tackle this problem, we propose a novel approach, which is called Adversarial Instance Augmentation and does not require to know the problem type for new instance generation, to promote data diversity for learning-based branching modules in the branch-and-bound (B&B) Solvers (AdaSolver). We use the bipartite graph representations for MILP instances and obtain various perturbed instances to regularize the solver by augmenting the graph structures with a learned augmentation policy. The major technical contribution of AdaSolver is that we formulate the non-differentiable instance augmentation as a contextual bandit problem and adversarially train the learning-based solver and augmentation policy, enabling efficient gradient-based training of the augmentation policy. To the best of our knowledge, AdaSolver is the first general and effective framework for understanding and improving the generalization of both imitation-learning-based (IL-based) and reinforcement-learning-based (RL-based) B&B solvers. Extensive experiments demonstrate that by producing various augmented instances, AdaSolver leads to a remarkable efficiency improvement across various distributions.

摘要
machine learning 已经成功应用于提高混合整数线性 программирова (MILP) 解决器的效率。然而，学习基于的解决器经常在未看过的 MILP 实例上表现出严重的性能下降 --特别是大规模实例从受到扰动环境-- 由于培育分布的有限多样性。为解决这个问题，我们提出了一种新的方法，即 Adversarial Instance Augmentation，不需要知道问题类型，为学习基于分支模块在 B&B 解决器 (AdaSolver) 中提高数据多样性。我们使用 MILP 实例的双分图表示，并通过学习的扩充策略对图结构进行加工，以增强解决器的普适性。AdaSolver 的主要技术贡献在于将不 diferenciable 的实例扩充视为上下文ual Bandit 问题，并对学习基于的扩充策略进行 adversarial 训练，使得学习基于的解决器可以高效地进行梯度下降训练。根据我们所知，AdaSolver 是首个普适并有效的框架，用于理解和提高学习基于 B&B 解决器的通用化。我们的实验证明，通过生成多种扩充实例，AdaSolver 在不同的分布下可以得到很大的效率提高。

2023-10-22

eess.SP

eess.SP - 2023-10-22

Submodular Optimization for Placement of Intelligent Reflecting Surfaces in Sensing Systems

paper_url: http://arxiv.org/abs/2310.14443
repo_url: None
paper_authors: Zahra Esmaeilbeig, Kumar Vijay Mishra, Arian Eamaz, Mojtaba Soltanalian
for: 这篇论文是关于智能反射表面（IRS）的优化部署和探测应用的研究。
methods: 本论文使用了最大共识度来决定多个IRS平台的部署，以优化探测应用。
results: 研究表明，使用最大共识度作为优化标准可以实现约63%的最差性能保证。

Abstract
Intelligent reflecting surfaces (IRS) and their optimal deployment are the new technological frontier in sensing applications. Recently, IRS have demonstrated potential in advancing target estimation and detection. While the optimal phase-shift of IRS for different tasks has been studied extensively in the literature, the optimal placement of multiple IRS platforms for sensing applications is less explored. In this paper, we design the placement of IRS platforms for sensing by maximizing the mutual information. In particular, we use this criterion to determine an approximately optimal placement of IRS platforms to illuminate an area where the target has a hypothetical presence. After demonstrating the submodularity of the mutual information criteria, we tackle the design problem by means of a constant-factor approximation algorithm for submodular optimization. Numerical results are presented to validate the proposed submodular optimization framework for optimal IRS placement with worst case performance bounded to $1-1/e\approx 63 \%$.

摘要
《智能反射表面（IRS）和其最佳部署》是当今探测应用的新技术前沿。最近，IRS已经表现出在目标估计和检测方面的潜力。虽然literature中对不同任务的IRS阶段的优化已经得到了广泛的研究，但IRS平台的多个部署 для探测应用还尚未得到了充分的研究。在这篇论文中，我们设计了IRS平台的部署方式，以最大化互信息。特别是，我们使用这个标准来确定IRS平台的投射方式，以照明假设存在目标的区域。我们首先证明了互信息标准的子模度性，然后使用子模度优化算法来解决设计问题。 num = 1-1/e ≈ 63%。Here's the breakdown of the translation:* "Intelligent reflecting surfaces" is translated as "智能反射表面" (Zhīngnéng jiànshì biānshì)* "and their optimal deployment" is translated as "以及其最佳部署" (yǐngqí yǔ jiāojiā bèipiāo)* "are the new technological frontier in sensing applications" is translated as "是当今探测应用的新技术前沿" (shì dāngjīn tàncéng yìngyùn yìshì)* "Recently, IRS have demonstrated potential in advancing target estimation and detection" is translated as "最近，IRS已经表现出在目标估计和检测方面的潜力" (zuìjìn, IRS yǐjīng bìngxìn zhìyì yìshì)* "While the optimal phase-shift of IRS for different tasks has been studied extensively in the literature" is translated as "虽然literature中对不同任务的IRS阶段的优化已经得到了广泛的研究" ( substitude "phase-shift" with "阶段" (jìděn) in Chinese)* "the optimal placement of multiple IRS platforms for sensing applications is less explored" is translated as "IRS平台的多个部署 для探测应用还尚未得到了充分的研究" (IRS píngtiān de duō gè bèipiāo yǐng yìshì)* "In this paper, we design the placement of IRS platforms for sensing by maximizing the mutual information" is translated as "在这篇论文中，我们设计了IRS平台的部署方式，以最大化互信息" (substitute "placing" with "部署" (bèipiāo) in Chinese)* "In particular, we use this criterion to determine an approximately optimal placement of IRS platforms to illuminate an area where the target has a hypothetical presence" is translated as "特别是，我们使用这个标准来确定IRS平台的投射方式，以照明假设存在目标的区域" (substitute "illuminate" with "照明" (zhàoqǐng) in Chinese)* "After demonstrating the submodularity of the mutual information criteria" is translated as "我们首先证明了互信息标准的子模度性" (substitute "submodularity" with "子模度性" (zǐmódegè) in Chinese)* "we tackle the design problem by means of a constant-factor approximation algorithm for submodular optimization" is translated as "然后使用子模度优化算法来解决设计问题" (substitute "constant-factor" with "常数因子" (chángxīng yǐngxí) in Chinese)* "Numerical results are presented to validate the proposed submodular optimization framework for optimal IRS placement with worst case performance bounded to $1-1/e\approx 63\%$" is translated as " num = 1-1/e ≈ 63%。我们提供的子模度优化框架的数值结果，以验证优化结果的可行性" (substitute "worst case" with "最差情况" (zuìjì) in Chinese)

Piezoelectric Sensors for Real-time Monitoring and Quality Control in Additive Manufacturing

paper_url: http://arxiv.org/abs/2310.14321
repo_url: None
paper_authors: Rashid T. Momin
for: 本研究旨在为工程领域的精度制造过程提供深入理解和实用技术，尤其是在添加制造中。
methods: 本研究采用系统性的方法，自基本原理起推导到实践应用，以满足不同读者的需求。
results: 研究发现， piezoelectric 感应器在实时监测和质量控制方面具有广泛的应用前景和潜在价值，对于新手和专业人士都具有启发性和实用性。

Abstract
Within the ever-evolving landscape of engineering, particularly in the dynamic domain of additive In manufacturing, a pursuit of precision and excellence in production processes takes centre stage. This research , This paper serves to give a comprehensive understanding of piezoelectric sensors, a topic that is both academically engaging and of practical significance, catering to both seasoned experts and those newly venturing into the field. Additive manufacturing, lauded for its groundbreaking potential, underscores the imperative of rigorous quality control. This introduces piezoelectric sensors, devices that may be unfamiliar to many but possess considerable potential. This paper embarks on a methodical journey, commencing with an introductory elucidation of the piezoelectric effect. It then advances to the vital role of piezoelectric sensors in real-time monitoring and quality control, unveiling their potential and relevance for newcomers and seasoned professionals alike. This research, structured systematically from fundamental principles to pragmatic applications, presents findings that are not only academically informative but also represent a substantial stride towards achieving precision and high-quality manufacturing processes in the engineering field.

摘要
在工程领域中，特别是在加工制造领域，精度和excel在生产过程中拥有中心舞台。这篇研究，这篇论文旨在为涉及到 piezoelectric 传感器的学术研究和实践提供全面的理解，它不仅对有经验的专家有启发，还对新进入该领域的人有很大的启示。加工制造被赞赏为创新的潜力，强调了严格的质量控制的重要性。这使得 piezoelectric 传感器成为了一种可能新的和有潜力的技术。这篇论文从基本原理开始，逐步向实践应用，对新手和经验手都有很大的启示。这项研究不仅是学术上的探索，也是在工程领域实现精度和高质量生产过程的一大步进。

FAS-assisted NOMA Short-Packet Communication Systems

paper_url: http://arxiv.org/abs/2310.14251
repo_url: https://github.com/zeniSoida/pl1
paper_authors: Jianchao Zheng, Tuo Wu, Xiazhi Lai, Cunhua Pan, Maged Elkashlan, Kai-Kit Wong
for: investigate a fluid antenna system (FAS)-assisted downlink non-orthogonal multiple access (NOMA) for short-packet communications.
methods: base station (BS) adopts a single fixed antenna, while both the central user (CU) and the cell-edge user (CEU) are equipped with a FAS.
results: the diversity order for CU and CEU is $N$, indicating that the system performance can be considerably improved by increasing $N$.Here’s the full version in Traditional Chinese:
for: 这个研究探讨了一个使用流体天线系统（FAS）协助下行非共域多存取（NOMA）的短包通信系统。
methods: BS使用固定天线，而中央用户（CU）和红色边用户（CEU）则是配备了FAS的。
results: FAS帮助下，CU和CEU的多标题状况下的系统性能可以随N的增加而提高。

Abstract
In this paper, we investigate a fluid antenna system (FAS)-assisted downlink non-orthogonal multiple access (NOMA) for short-packet communications. The base station (BS) adopts a single fixed antenna, while both the central user (CU) and the cell-edge user (CEU) are equipped with a FAS. Each FAS comprises $N$ flexible positions (also known as ports), linked to $N$ arbitrarily correlated Rayleigh fading channels. We derive expressions for the average block error rate (BLER) of the FAS-assisted NOMA system and provide asymptotic BLER expressions. We determine that the diversity order for CU and CEU is $N$, indicating that the system performance can be considerably improved by increasing $N$. Simulation results validate the great performance of FAS.

摘要
在这篇论文中，我们研究了一个流体天线系统（FAS）助动下link非对称多接入（NOMA） для短包通信。基站（BS）采用单一固定天线，而中央用户（CU）和边缘用户（CEU）均配备了FAS。每个FAS包括 $N$ 个 flexible位（也称为端口），与 $N$ 个互相相关的干扰谱噪抖动通道相连。我们 derive了 FAS-assisted NOMA 系统的平均块错误率（BLER）表达式，并提供了各种各样的 BLER 表达式。我们发现，CU 和 CEU 的多普通性顺序是 $N$，这表明，通过增加 $N$，系统性能可以得到显著改善。实验结果证明了 FAS 的优秀性能。

How do the resting EEG preprocessing states affect the outcomes of postprocessing?

paper_url: http://arxiv.org/abs/2310.15194
repo_url: None
paper_authors: Shiang Hu, Jie Ruan, Juan Hou, Pedro Antonio Valdes-Sosa, Zhao Lv
for: 本研究旨在探讨预处理影响后处理结果的影响，特别是预处理不足和过度预处理对频域、空间和时间领域的后处理所带来的影响。
methods: 研究人员使用了新 York 头部模型和多变量自 regression 模型来生成清晰 EEG（CE）作为参考数据，然后通过在 CE 基础上增加 Gaussian 噪声和失去大脑活动来生成预处理不足（IPE）和过度预处理（EPE）数据。
results: 研究人员发现，预处理不足和过度预处理都会导致后处理结果与 CE 之间的偏差，特别是在频域、空间和时间领域的 Statistics、多通道能量、跨 Spectra、源 imaging 粒度和贝叶率等方面。此外，研究人员还发现，PaLOSi 指标与预处理状态的变化有 statistically significative 相关性。

Abstract
Plenty of artifact removal tools and pipelines have been developed to correct the EEG recordings and discover the values below the waveforms. Without visual inspection from the experts, it is susceptible to derive improper preprocessing states, like the insufficient preprocessed EEG (IPE), and the excessive preprocessed EEG (EPE). However, little is known about the impacts of IPE or EPE on the postprocessing in the frequency, spatial and temporal domains, particularly as to the spectra and the functional connectivity (FC) analysis. Here, the clean EEG (CE) was synthesized as the ground truth based on the New-York head model and the multivariate autoregressive model. Later, the IPE and the EPE were simulated by injecting the Gaussian noise and losing the brain activities, respectively. Then, the impacts on postprocessing were quantified by the deviation caused by the IPE or EPE from the CE as to the 4 temporal statistics, the multichannel power, the cross spectra, the dispersion of source imaging, and the properties of scalp EEG network. Lastly, the association analysis was performed between the PaLOSi metric and the varying trends of postprocessing with the evolution of preprocessing states. This study shed light on how the postprocessing outcomes are affected by the preprocessing states and PaLOSi may be a potential effective quality metric.

摘要
很多遗物除去工具和管道已经开发出来 corrections EEG 记录，以便发现下面波形的值。而不经visual inspection from the experts，可能导致 derivation improper preprocessing states, such as insufficient preprocessed EEG (IPE) and excessive preprocessed EEG (EPE). However, little is known about the impacts of IPE or EPE on the postprocessing in the frequency, spatial, and temporal domains, particularly as to the spectra and the functional connectivity (FC) analysis.Here, the clean EEG (CE) was synthesized as the ground truth based on the New-York head model and the multivariate autoregressive model. Later, the IPE and the EPE were simulated by injecting Gaussian noise and losing brain activities, respectively. Then, the impacts on postprocessing were quantified by the deviation caused by the IPE or EPE from the CE as to the 4 temporal statistics, the multichannel power, the cross spectra, the dispersion of source imaging, and the properties of scalp EEG network. Lastly, the association analysis was performed between the PaLOSi metric and the varying trends of postprocessing with the evolution of preprocessing states. This study shed light on how the postprocessing outcomes are affected by the preprocessing states and PaLOSi may be a potential effective quality metric.

On the Sum Secrecy Rate of Multi-User Holographic MIMO Networks

paper_url: http://arxiv.org/abs/2310.14217
repo_url: None
paper_authors: Arthur S. de Sena, Jiguang He, Ahmed Al Hammadi, Chongwen Huang, Faouzi Bader, Merouane Debbah, Mathias Fink
for: 这篇论文主要针对下一代通信系统的增强多样性和多重化，探讨了扩展到安全通信领域的可能性。
methods: 该论文使用了meta-atom技术，并对HMIMO网络的能量和频率效率、路径损失分析、通道模型进行了分析。
results: 研究发现，采用adaptive/灵活的发射功率分配（PA）可以在高信号噪声比（SNR） régime中获得显著性能提升，对两个用户情况下可以获得更多于100%的增强。

Abstract
The emerging concept of extremely-large holographic multiple-input multiple-output (HMIMO), beneficial from compactly and densely packed cost-efficient radiating meta-atoms, has been demonstrated for enhanced degrees of freedom even in pure line-of-sight conditions, enabling tremendous multiplexing gain for the next-generation communication systems. Most of the reported works focus on energy and spectrum efficiency, path loss analyses, and channel modeling. The extension to secure communications remains unexplored. In this paper, we theoretically characterize the secrecy capacity of the HMIMO network with multiple legitimate users and one eavesdropper while taking into consideration artificial noise and max-min fairness. We formulate the power allocation (PA) problem and address it by following successive convex approximation and Taylor expansion. We further study the effect of fixed PA coefficients, imperfect channel state information, inter-element spacing, and the number of Eve's antennas on the sum secrecy rate. Simulation results show that significant performance gain with more than 100\% increment in the high signal-to-noise ratio (SNR) regime for the two-user case is obtained by exploiting adaptive/flexible PA compared to the case with fixed PA coefficients.

摘要
新的概念——超大型杂点多输入多输出（HMIMO）技术，具有高度压缩和高密度的成本效益的卫星体，已经在纯线路条件下实现了更高的自由度，这将导致下一代通信系统的多重化增量。大多数报道的研究都集中在能量和频率效益、距离损失分析和通道模型。然而，安全通信的扩展还没有被探讨。在这篇论文中，我们理论上Characterize HMIMO网络的机密容量，并考虑人工噪声和最大最小公平。我们将 transmit power allocation（PA）问题，通过Successive Convex Approximation和Taylor扩展解决。我们进一步研究了固定PA系数、不完全的通道状态信息、元素间距离和贝娅的天线数量对Sum secrecy rate的影响。实验结果表明，在高信号噪声比（SNR）区间，通过适应/灵活PA比与固定PA系数相比，可以获得高达100%的性能提升。

A Coordinate Descent Approach to Atomic Norm Minimization

paper_url: http://arxiv.org/abs/2310.14182
repo_url: None
paper_authors: Ruifu Li, Danijela Cabric
for: 这个论文的目的是解决含有稀疏信号处理的应用中的原子范数最小化问题。
methods: 这篇论文提出了一种基于坐标降低的低复杂度、matrix-free的原子范数最小化解决方案。该方法利用了稀疏规化的自然性，通过坐标降低框架来实现高效的解决方案。
results: 该方法在数学上证明可以快速地求解原子范数最小化问题，并且在大规模问题中实现了高效的计算。通过对比ADMM和自定义内点SDP解决方案，该方法在解决稀疏问题方面明显更快。

Abstract
Atomic norm minimization is of great interest in various applications of sparse signal processing including super-resolution line-spectral estimation and signal denoising. In practice, atomic norm minimization (ANM) is formulated as a semi-definite programming (SDP) which is generally hard to solve. This work introduces a low-complexity, matrix-free method for solving ANM. The method uses the framework of coordinate descent and exploits the sparsity-induced nature of atomic-norm regularization. Specifically, an equivalent, non-convex formulation of ANM is first proposed. It is then proved that applying the coordinate descent framework on the non-convex formulation leads to convergence to the global optimal point. For the case of a single measurement vector of length N in discrete fourier transform (DFT) basis, the complexity of each iteration in the coordinate descent procedure is O(N log N ), rendering the proposed method efficient even for large-scale problems. The proposed coordinate descent framework can be readily modified to solve a variety of ANM problems, including multi-dimensional ANM with multiple measurement vectors. It is easy to implement and can essentially be applied to any atomic sets as long as a corresponding rank-1 problem can be solved. Through extensive numerical simulations, it is verified that for solving sparse problems the proposed method is much faster than the alternating direction method of multipliers (ADMM) or the customized interior point SDP solver.

摘要
“原子 нор 最小化是各种各样的应用，包括超解析和信号噪声除除。但是在实践中，原子 norm 最小化（ANM）通常被表示为半definite 程序（SDP），这通常很难解决。这项工作提出了一种低复杂度、缺少矩阵的方法来解决 ANM。该方法使用坐标 descent 框架，利用原子规定的稀疏性来减少计算复杂性。具体来说，首先提出了 ANM 的非конvex 表述，然后证明通过坐标 descent 框架处理非конvex 表述可以达到全局最优点。在单个测量向量长度为 N 的抽象快推（DFT）基础下，每次迭代的复杂度为 O(N log N)，这使得提出的方法在大规模问题上高效。此外，这种坐标 descent 框架可以轻松地修改，以解决多维 ANM 问题，包括多个测量向量。它易于实现，可以适用于任何原子集，只要可以解决相应的 rank-1 问题。经过广泛的数值实验，证明了用于解决稀疏问题的提出方法比 ADMM 或自定义内点 SDP 解决器更快。”

Spatial Sigma-Delta Modulation for Coarsely Quantized Massive MIMO Downlink: Flexible Designs by Convex Optimization

paper_url: http://arxiv.org/abs/2310.14179
repo_url: None
paper_authors: Wai-Yiu Keung, Wing-Kin Ma
for: 本文考虑了多用户大量MIMO下降频率预编码，使用低分辨率数字到分析转换器（DAC）。
methods: 本文使用了Sigma-Delta模ulation来控制量化误差效应。
results: 数字实现中使用Sigma-Delta模ulation可以达到近似于无量化的性能，但需要考虑angle sector和数量化级别等因素。

Abstract
This paper considers the context of multiuser massive MIMO downlink precoding with low-resolution digital-to-analog converters (DACs) at the transmitter. This subject is motivated by the consideration that it is expensive to employ high-resolution DACs for practical massive MIMO implementations. The challenge with using low-resolution DACs is to overcome the detrimental quantization error effects. Recently, spatial Sigma-Delta modulation has arisen as a viable way to put quantization errors under control. This approach takes insight from temporal Sigma-Delta modulation in classical DAC studies. Assuming a 1D uniform linear transmit antenna array, the principle is to shape the quantization errors in space such that the shaped quantization errors are pushed away from the user-serving angle sector. In the previous studies, spatial Sigma-Delta modulation was performed by direct application of the basic first- and second-order modulators from the Sigma-Delta literature. In this paper, we develop a general Sigma-Delta modulator design framework for any given order, for any given number of quantization levels, and for any given angle sector. We formulate our design as a problem of maximizing the signal-to-quantization-and-noise ratios experienced by the users. The formulated problem is convex and can be efficiently solved by available solvers. Our proposed framework offers the alternative option of focused quantization error suppression in accordance with channel state information. Our framework can also be extended to 2D planar transmit antenna arrays. We perform numerical study under different operating conditions, and the numerical results suggest that, given a moderate number of quantization levels, say, 5 to 7 levels, our optimization-based Sigma-Delta modulation schemes can lead to bit error rate performance close to that of the unquantized counterpart.

摘要

paper_url: http://arxiv.org/abs/2310.14167
repo_url: None
paper_authors: Roman Jacome, Edwin Vargas, Kumar Vijay Mishra, Brian M. Sadler, Henry Arguello
for: 本研究针对的是 интеграцион感知通信（ISAC）系统，它们可以同时访问和利用有限的电磁频谱资源。
methods: 本研究使用了线性状态空间模型（LSSM）来模型信号模型的动态变化。
results: 实验结果表明，使用了效果差分方法（EM）算法可以高效地估算无知变量，包括噪声存在的情况下。

Abstract
Integrated sensing and communications (ISAC) systems have gained significant interest because of their ability to jointly and efficiently access, utilize, and manage the scarce electromagnetic spectrum. The co-existence approach toward ISAC focuses on the receiver processing of overlaid radar and communications signals coming from independent transmitters. A specific ISAC coexistence problem is dual-blind deconvolution (DBD), wherein the transmit signals and channels of both radar and communications are unknown to the receiver. Prior DBD works ignore the evolution of the signal model over time. In this work, we consider a dynamic DBD scenario using a linear state space model (LSSM) such that, apart from the transmit signals and channels of both systems, the LSSM parameters are also unknown. We employ a factor graph representation to model these unknown variables. We avoid the conventional matrix inversion approach to estimate the unknown variables by using an efficient expectation-maximization algorithm, where each iteration employs a Gaussian message passing over the factor graph structure. Numerical experiments demonstrate the accurate estimation of radar and communications channels, including in the presence of noise.

摘要
интегрированные сенсинг и коммуникации (ISAC) 系统在过去几年中得到了广泛的关注，因为它们可以同时和有效地访问、利用和管理有限的电磁 спектrum。在 ISAC 的共存方法中，接收器处理来自独立的发送器的掩码过的雷达和通信信号。特定的 ISAC 共存问题是双目掩码分解 (DBD)，其中雷达和通信系统的发送信号和通道都是接收器不知道的。先前的 DBD 工作忽略了信号模型的时间演化。在这种情况下，我们使用线性状态空间模型 (LSSM)，以便除了雷达和通信系统的发送信号和通道之外，还不知道 LSSM 参数。我们使用因果图表示这些未知变量。我们避免了传统的矩阵逆解方法，而是使用高效的期望最大化算法，每次迭代都使用 Gaussian 消息传递在因果图结构上。实验示出了准确地估计雷达和通信通道，包括在噪声存在的情况下。

Photoplethysmography based atrial fibrillation detection: an updated review from July 2019

paper_url: http://arxiv.org/abs/2310.14155
repo_url: None
paper_authors: Cheng Ding, Ran Xiao, Weijia Wang, Elizabeth Holdsworth, Xiao Hu
for: 本研究旨在寻找最新的抽血 photoplethysmography (PPG) 技术，用于不间断的 атrial fibrillation (AF) 监测，以提高患者的健康状况。
methods: 本研究采用了数字卫生和人工智能 (AI) 解决方案，检查了59 项研究，包括统计方法、传统机器学习技术和深度学习方法。
results: 研究发现，使用 PPG 技术可以准确地检测 AF，并且可以提高患者的健康状况。同时，研究还发现了在这个领域存在的挑战。

Abstract
Atrial fibrillation (AF) is a prevalent cardiac arrhythmia associated with significant health ramifications, including an elevated susceptibility to ischemic stroke, heart disease, and heightened mortality. Photoplethysmography (PPG) has emerged as a promising technology for continuous AF monitoring for its cost-effectiveness and widespread integration into wearable devices. Our team previously conducted an exhaustive review on PPG-based AF detection before June 2019. However, since then, more advanced technologies have emerged in this field. This paper offers a comprehensive review of the latest advancements in PPG-based AF detection, utilizing digital health and artificial intelligence (AI) solutions, within the timeframe spanning from July 2019 to December 2022. Through extensive exploration of scientific databases, we have identified 59 pertinent studies. Our comprehensive review encompasses an in-depth assessment of the statistical methodologies, traditional machine learning techniques, and deep learning approaches employed in these studies. In addition, we address the challenges encountered in the domain of PPG-based AF detection. Furthermore, we maintain a dedicated website to curate the latest research in this area, with regular updates on a regular basis.

摘要
AF（atrialfibrillation）是一种常见的心脏 arrhythmia，与 significativ health consequences associated， including elevated risk of ischemic stroke, heart disease, and increased mortality. Photoplethysmography (PPG) has emerged as a promising technology for continuous AF monitoring due to its cost-effectiveness and widespread integration into wearable devices. Our team previously conducted an exhaustive review of PPG-based AF detection before June 2019. However, since then, more advanced technologies have emerged in this field. This paper provides a comprehensive review of the latest advancements in PPG-based AF detection, utilizing digital health and artificial intelligence (AI) solutions, within the timeframe spanning from July 2019 to December 2022. Through extensive exploration of scientific databases, we have identified 59 pertinent studies. Our comprehensive review encompasses an in-depth assessment of the statistical methodologies, traditional machine learning techniques, and deep learning approaches employed in these studies. In addition, we address the challenges encountered in the domain of PPG-based AF detection. Furthermore, we maintain a dedicated website to curate the latest research in this area, with regular updates on a regular basis.

2023-10-21

cs.SD

cs.SD - 2023-10-21

paper_url: http://arxiv.org/abs/2310.14018
repo_url: None
paper_authors: Tatsuki Kobayashi, Yoshiko Maruyama, Isao Nambu, Shohei Yano, Yasuhiro Wada
for: Virtual sound synthesis technology allows users to perceive spatial sound through headphones or earphones, but accurate virtual sound requires an individual head-related transfer function (HRTF).
methods: This study proposed a method to generate HRTFs from one direction to the other using temporal convolutional neural networks (TCNs) and publicly available datasets in the horizontal plane.
results: The proposed method successfully generated HRIRs for directions other than the front direction in the dataset, and was found to be equivalent to the measured HRIRs in a new dataset through behavioral experiments with human participants. These results suggest that the proposed TCNs can be used to generate personalized HRIRs for virtual sound.Here’s the summary in Traditional Chinese as well, for your reference:
for: 虚拟 зву频技术允许使用者透过耳机或耳筒听到三维音频，但是精准的虚拟音频需要个人化头部转换函数（HRTF）。
methods: 这项研究提出了将HRTF从一个方向转换到另一个方向的方法，使用了时间卷积神经网络（TCN）和公共可用数据集在水平面上进行训练。
results: 提议的方法成功将HRIRs从其他方向转换到前方方向，并且在新的数据集上进行了训练。Behavioral实验显示，生成的HRIRs与实验测量的HRIRs相等。这些结果表示，提议的TCNs可以从一个方向转换到另一个方向，实现个人化虚拟音频。

Abstract
Virtual sound synthesis is a technology that allows users to perceive spatial sound through headphones or earphones. However, accurate virtual sound requires an individual head-related transfer function (HRTF), which can be difficult to measure due to the need for a specialized environment. In this study, we proposed a method to generate HRTFs from one direction to the other. To this end, we used temporal convolutional neural networks (TCNs) to generate head-related impulse responses (HRIRs). To train the TCNs, publicly available datasets in the horizontal plane were used. Using the trained networks, we successfully generated HRIRs for directions other than the front direction in the dataset. We found that the proposed method successfully generated HRIRs for publicly available datasets. To test the generalization of the method, we measured the HRIRs of a new dataset and tested whether the trained networks could be used for this new dataset. Although the similarity evaluated by spectral distortion was slightly degraded, behavioral experiments with human participants showed that the generated HRIRs were equivalent to the measured ones. These results suggest that the proposed TCNs can be used to generate personalized HRIRs from one direction to another, which could contribute to the personalization of virtual sound.

摘要
虚拟声音合成技术可以让用户通过headset或earphone感受到三维声音。然而，实际的虚拟声音需要个人头部相关传输函数（HRTF），这可以因特殊环境而困难测量。在这个研究中，我们提出了一种方法，可以从一个方向转换到另一个方向的HRTF。为此，我们使用了时间卷积神经网络（TCN）生成头部相关冲击响应（HRIR）。使用训练好的网络，我们成功地生成了不同方向的HRIR。我们发现，提出的方法可以成功地生成不同方向的HRIR，并且在新的数据集中测试了该方法的通用性。虽然spectral distortion评估中的相似性略差，但人类参与者的行为实验表明，生成的HRIR与测量的HRIR相当。这些结果表明，提出的TCN可以用于个人化虚拟声音。

2023-10-21

eess.AS

eess.AS - 2023-10-21

SwG-former: Sliding-window Graph Convolutional Network Integrated with Conformer for Sound Event Localization and Detection

paper_url: http://arxiv.org/abs/2310.14016
repo_url: None
paper_authors: Weiming Huang, Qinghua Huang, Liyan Ma, Zhengyu Chen, Chuan Wang
for: 本研究旨在提高Sound Event Localization and Detection（SELD）系统的性能，特别是在自然空间声学环境下。
methods: 本研究提出了一种基于图表示的novel graph convolutional network（GCN）模型，具有同时抽取空间特征和时间特征的能力。此外，一种robust Conv2dAgg函数也被提出，用于协调邻居特征。
results: 对比与现有的先进SELD模型，本研究的SwG-former模型在同一个声学环境下表现出了superior的性能。此外，将SwG模块整合到EINV2网络中，得到的SwG-EINV2模型也超过了现有的SOTA方法。

Abstract
Sound event localization and detection (SELD) is a joint task of sound event detection (SED) and direction of arrival (DoA) estimation. SED mainly relies on temporal dependencies to distinguish different sound classes, while DoA estimation depends on spatial correlations to estimate source directions. To jointly optimize two subtasks, the SELD system should extract spatial correlations and model temporal dependencies simultaneously. However, numerous models mainly extract spatial correlations and model temporal dependencies separately. In this paper, the interdependence of spatial-temporal information in audio signals is exploited for simultaneous extraction to enhance the model performance. In response, a novel graph representation leveraging graph convolutional network (GCN) in non-Euclidean space is developed to extract spatial-temporal information concurrently. A sliding-window graph (SwG) module is designed based on the graph representation. It exploits sliding-windows with different sizes to learn temporal context information and dynamically constructs graph vertices in the frequency-channel (F-C) domain to capture spatial correlations. Furthermore, as the cornerstone of message passing, a robust Conv2dAgg function is proposed and embedded into the SwG module to aggregate the features of neighbor vertices. To improve the performance of SELD in a natural spatial acoustic environment, a general and efficient SwG-former model is proposed by integrating the SwG module with the Conformer. It exhibits superior performance in comparison to recent advanced SELD models. To further validate the generality and efficiency of the SwG-former, it is seamlessly integrated into the event-independent network version 2 (EINV2) called SwG-EINV2. The SwG-EINV2 surpasses the state-of-the-art (SOTA) methods under the same acoustic environment.

摘要
声音事件地点Localization和检测（SELD）是一个joint任务，它包括声音事件检测（SED）和方向来源估计（DoA）。SED主要基于时间关系来分辨不同的声音类型，而DoA估计则基于空间相关性来估计源向量。为了同时优化两个子任务，SELD系统应该同时提取空间相关性和模型时间关系。然而，许多模型通常是分开提取空间相关性和时间关系。在这篇论文中，我们利用声音信号中的空间-时间信息相互关系，并将其同时提取出来，以提高模型性能。为此，我们开发了一种基于图表示的新型图 convolutional neural network（GCN）模型，可以同时提取空间-时间信息。我们还设计了一个基于图表示的滑动窗口模块（SwG），它利用不同的窗口大小来学习时间上下文信息，并在频道频率（F-C）空间中动态构建图顶点，以捕捉空间相关性。此外，我们还提出了一种Robust Conv2dAgg函数，用于在图顶点之间进行消息传递。为了在自然的空间声音环境中提高SELD性能，我们提出了一种通用和高效的SwG-former模型，其由SwG模块和Conformer结合而成。该模型在与现有高级SELD模型进行比较时表现出色。此外，我们还将SwG-former模型与事件独立网络版2（EINV2）结合，得到了SwG-EINV2模型。SwG-EINV2模型在同一个声音环境中超过了现有的最佳方法。

2023-10-21

cs.CV

cs.CV - 2023-10-21

Zero-shot Learning of Individualized Task Contrast Prediction from Resting-state Functional Connectomes

paper_url: http://arxiv.org/abs/2310.14105
repo_url: None
paper_authors: Minh Nguyen, Gia H. Ngo, Mert R. Sabuncu
for: 可以用resting-state functional MRI（rsfMRI）scan来预测任务触发活动
methods: 使用machine learning（ML）模型，通过对suffix pair的resting-state和任务触发fMRI scan进行训练
results: 可以预测 novel task的活动，并且与state-of-the-art模型的in-domain预测相当In English, this would be:
for: Using resting-state functional MRI (rsfMRI) scans to predict task-evoked activity
methods: Using machine learning (ML) models, trained on paired resting-state and task-evoked fMRI scans
results: Can predict activity for novel tasks, and is competitive with state-of-the-art models’ in-domain predictions

Abstract
Given sufficient pairs of resting-state and task-evoked fMRI scans from subjects, it is possible to train ML models to predict subject-specific task-evoked activity using resting-state functional MRI (rsfMRI) scans. However, while rsfMRI scans are relatively easy to collect, obtaining sufficient task fMRI scans is much harder as it involves more complex experimental designs and procedures. Thus, the reliance on scarce paired data limits the application of current techniques to only tasks seen during training. We show that this reliance can be reduced by leveraging group-average contrasts, enabling zero-shot predictions for novel tasks. Our approach, named OPIC (short for Omni-Task Prediction of Individual Contrasts), takes as input a subject's rsfMRI-derived connectome and a group-average contrast, to produce a prediction of the subject-specific contrast. Similar to zero-shot learning in large language models using special inputs to obtain answers for novel natural language processing tasks, inputting group-average contrasts guides the OPIC model to generalize to novel tasks unseen in training. Experimental results show that OPIC's predictions for novel tasks are not only better than simple group-averages, but are also competitive with a state-of-the-art model's in-domain predictions that was trained using in-domain tasks' data.

摘要
Translated into Simplified Chinese:可以通过 sufficient pairs of resting-state and task-evoked fMRI scans from subjects 来训练机器学习（ML）模型，预测个体特定的任务触发活动使用 resting-state functional MRI（rsfMRI）scans。然而，而 rsfMRI scans 相对容易收集，而获取足够的任务 fMRI scans 则更加复杂，需要更复杂的实验设计和过程。因此，现有技术的应用受到了scarce paired data的限制，只能在训练中使用已经看到的任务。我们显示，可以通过利用 group-average contrasts 降低这种限制，实现 zero-shot 预测 Novel task。我们的方法，名为 OPIC（short for Omni-Task Prediction of Individual Contrasts），输入一个subject的 rsfMRI-derived connectome 和 group-average contrast，以生成一个subject-specific contrast 预测。与大型语言模型使用特定输入来获取 novel natural language processing task 的答案类似，输入 group-average contrasts 使得 OPIC 模型能够泛化到 novel task 中。实验结果表明，OPIC 的预测对 Novel task 不仅 луч于单纯的 group-average，而且与一种状态之前的模型在预测中的性能也很竞争。

Unleashing Modified Deep Learning Models in Efficient COVID19 Detection

paper_url: http://arxiv.org/abs/2310.14081
repo_url: None
paper_authors: Md Aminul Islam, Shabbir Ahmed Shuvo, Mohammad Abu Tareq Rony, M Raihan, Md Abu Sufian
for: 这个研究旨在提高COVID-19预测和检测精度，以帮助医疗机构、政策制定者和研究人员做出更加有知识基础的决策，从而减少COVID-19和其他传染病的影响。
methods: 这个研究使用了深度学习技术，特别是MobileNet V3、DenseNet201和GoogleNet Inception V1等模型，以及损失优化和可扩展批处理normalization等策略，以提高预测模型的性能和鲁棒性。
results: 研究发现，使用MobileNet V3、DenseNet201和GoogleNet Inception V1等模型可以实现高度的预测精度，并且可以结合损失优化和可扩展批处理normalization来进一步提高预测模型的性能和鲁棒性。

Abstract
The COVID19 pandemic, a unique and devastating respiratory disease outbreak, has affected global populations as the disease spreads rapidly. Recent Deep Learning breakthroughs may improve COVID19 prediction and forecasting as a tool of precise and fast detection, however, current methods are still being examined to achieve higher accuracy and precision. This study analyzed the collection contained 8055 CT image samples, 5427 of which were COVID cases and 2628 non COVID. The 9544 Xray samples included 4044 COVID patients and 5500 non COVID cases. The most accurate models are MobileNet V3 (97.872 percent), DenseNet201 (97.567 percent), and GoogleNet Inception V1 (97.643 percent). High accuracy indicates that these models can make many accurate predictions, as well as others, are also high for MobileNetV3 and DenseNet201. An extensive evaluation using accuracy, precision, and recall allows a comprehensive comparison to improve predictive models by combining loss optimization with scalable batch normalization in this study. Our analysis shows that these tactics improve model performance and resilience for advancing COVID19 prediction and detection and shows how Deep Learning can improve disease handling. The methods we suggest would strengthen healthcare systems, policymakers, and researchers to make educated decisions to reduce COVID19 and other contagious diseases. CCS CONCEPTS Covid,Deep Learning, Image Processing KEYWORDS Covid, Deep Learning, DenseNet201, MobileNet, ResNet, DenseNet, GoogleNet, Image Processing, Disease Detection.

摘要
COVID-19 流行病，一种独特且肇事的呼吸疾病爆发，对全球人口产生了深触的影响。 current Deep Learning 突破可能提高 COVID-19 预测和预测，作为精确和快速检测工具。但是，当前方法仍在进行评估，以提高准确率和精度。本研究分析了包含 8055 个 CT 图像样本，其中 5427 个是 COVID случа例，2628 个是非 COVID случа例。9544 个 X-ray 样本中包括 4044 个 COVID 病例和 5500 个非 COVID 病例。最准确的模型是 MobileNet V3（97.872%），DenseNet201（97.567%）和 GoogleNet Inception V1（97.643%）。高准确率表明这些模型可以做出许多准确预测，同时其他模型也具有高准确率。通过精度、精度和回归的全面评估，我们可以对改进预测模型进行比较。我们的分析表明，通过捆绑损失优化与可扩展批处理normalization可以提高模型性能和抗预测能力。这些方法可以强化医疗系统、政策制定者和研究人员，以便根据 COVID-19 和其他传染病的情况进行教育的决策。关键词： COVID-19，Deep Learning， DenseNet201， MobileNet， ResNet， DenseNet， GoogleNet， Image Processing，疾病检测。

Concept-based Anomaly Detection in Retail Stores for Automatic Correction using Mobile Robots

paper_url: http://arxiv.org/abs/2310.14063
repo_url: None
paper_authors: Aditya Kapoor, Vartika Sengar, Nijil George, Vighnesh Vatsal, Jayavardhana Gubbi, Balamuralidhar P, Arpan Pal
for: 这 paper 的目的是提出一种基于视transformer（ViT）的概念异常检测方法，用于检测商店中的货物错位和缺失。
methods: 该方法使用 auto-encoder 架构，然后使用异常检测在准确空间进行异常检测。
results: 该方法在 RP2K 数据集上的峰成功率为 89.90%，比基本模型的标准 ViT auto-encoder 高出 8.10%。

Abstract
Tracking of inventory and rearrangement of misplaced items are some of the most labor-intensive tasks in a retail environment. While there have been attempts at using vision-based techniques for these tasks, they mostly use planogram compliance for detection of any anomalies, a technique that has been found lacking in robustness and scalability. Moreover, existing systems rely on human intervention to perform corrective actions after detection. In this paper, we present Co-AD, a Concept-based Anomaly Detection approach using a Vision Transformer (ViT) that is able to flag misplaced objects without using a prior knowledge base such as a planogram. It uses an auto-encoder architecture followed by outlier detection in the latent space. Co-AD has a peak success rate of 89.90% on anomaly detection image sets of retail objects drawn from the RP2K dataset, compared to 80.81% on the best-performing baseline of a standard ViT auto-encoder. To demonstrate its utility, we describe a robotic mobile manipulation pipeline to autonomously correct the anomalies flagged by Co-AD. This work is ultimately aimed towards developing autonomous mobile robot solutions that reduce the need for human intervention in retail store management.

摘要
<>销售环境中最具压力的任务之一是存储和重新排序丢失的商品。而使用视觉技术进行这些任务的尝试已经存在，但大多数使用计划图合规性进行异常检测，这种技术缺乏可靠性和扩展性。此外，现有系统往往需要人工干预进行 corrections 。在这篇论文中，我们提出了 Co-AD，一种基于概念的异常检测方法，使用视觉转换器（ViT）来识别丢失的物品，不需要使用先知库如计划图。它使用自适应网络架构，然后在幽默空间进行异常检测。Co-AD 在 RP2K 数据集上的峰值成功率为 89.90%，比最佳基eline的标准 ViT 自适应网络架构高出 8.19%。为了证明其实用性，我们描述了一种基于移动抓取机的自动化corrctions 管理管道。这项工作最终目标是开发自动化移动机器人解决方案，减少零售店管理中的人工干预。

Training Image Derivatives: Increased Accuracy and Universal Robustness

paper_url: http://arxiv.org/abs/2310.14045
repo_url: None
paper_authors: Vsevolod I. Avrutskiy
for:* The paper is written for the problem of image analysis, specifically reconstructing the vertices of a cube based on its image.methods:* The paper uses derivative training, a method that computes the derivatives of the output values in the forward pass and includes them in the cost function to improve the accuracy of the neural network.* The paper also uses a gradient-based algorithm to minimize the cost function with respect to the weights.results:* The paper obtains 25 times more accurate results for noiseless inputs by training the derivatives with respect to the 6 degrees of freedom of the cube.* The paper also provides insights into the robustness problem, including two types of network vulnerabilities and a trade-off between accuracy and robustness.Here is the text in Simplified Chinese:for:* 这篇论文是为图像分析问题写的，具体来说是根据图像重建三角形的顶点。methods:* 这篇论文使用 derive 训练，在前向传递中计算输出值的导数，并将其包含在成本函数中以提高神经网络的准确性。* 这篇论文还使用一种基于梯度的算法来最小化成本函数中的权重。results:* 这篇论文在噪声为零的输入上达到了25倍的更高精度。* 这篇论文还提供了图像分析问题中的 robustness 问题的重要情况，包括神经网络中两种类型的抵触性和两种类型的图像变化。

Abstract
Derivative training is a well-known method to improve the accuracy of neural networks. In the forward pass, not only the output values are computed, but also their derivatives, and their deviations from the target derivatives are included in the cost function, which is minimized with respect to the weights by a gradient-based algorithm. So far, this method has been implemented for relatively low-dimensional tasks. In this study, we apply the approach to the problem of image analysis. We consider the task of reconstructing the vertices of a cube based on its image. By training the derivatives with respect to the 6 degrees of freedom of the cube, we obtain 25 times more accurate results for noiseless inputs. The derivatives also provide important insights into the robustness problem, which is currently understood in terms of two types of network vulnerabilities. The first type is small perturbations that dramatically change the output, and the second type is substantial image changes that the network erroneously ignores. They are currently considered as conflicting goals, since conventional training methods produce a trade-off. The first type can be analyzed via the gradient of the network, but the second type requires human evaluation of the inputs, which is an oracle substitute. For the task at hand, the nearest neighbor oracle can be defined, and the knowledge of derivatives allows it to be expanded into Taylor series. This allows to perform the first-order robustness analysis that unifies both types of vulnerabilities, and to implement robust training that eliminates any trade-offs, so that accuracy and robustness are limited only by network capacity.

摘要
偏导训练是一种通用的方法，可以提高神经网络的准确率。在前向传播中，不仅计算输出值，还计算其导数和与目标导数的差异，并将其包含在权重下降的成本函数中，通过梯度基本算法进行最小化。到目前为止，这种方法已经实现在相对低维度的任务上。在这项研究中，我们将这种方法应用到图像分析任务中，即根据图像推算出立方体的顶点坐标。通过对立方体的6个自由度进行导数训练，我们可以获得25倍的精度，对于噪声无效输入。导数还提供了对稳定性问题的重要视角，这个问题目前被理解为神经网络的两种类型敏感性。第一种是小幅度的输入修改，导致输出异常变化，第二种是大规模的图像修改，神经网络错误地忽略这些修改。这两种类型的敏感性目前被视为矛盾目标， convent ional 训练方法会产生交易。第一种可以通过神经网络的梯度进行分析，但第二种需要人工评估输入，这是一个oracle substitute。对于这个任务， nearest neighbor oracle 可以定义，并且导数的知识允许其扩展为泰勒级数。这样可以实现首领稳定性分析，并将稳定性和准确率限制在神经网络容量之间。

You Only Condense Once: Two Rules for Pruning Condensed Datasets

paper_url: http://arxiv.org/abs/2310.14019
repo_url: None
paper_authors: Yang He, Lingao Xiao, Joey Tianyi Zhou
For: 提高训练效率，适应设备上的限制条件。* Methods: 使用两个简单的数据采样规则：低LBPE分数和平衡构建。* Results: 在ConvNet、ResNet和DenseNet网络上，在CIFAR-10、CIFAR-100和ImageNet数据集上达到了6.98-8.89%和6.31-23.92%的准确率提升。

Abstract
Dataset condensation is a crucial tool for enhancing training efficiency by reducing the size of the training dataset, particularly in on-device scenarios. However, these scenarios have two significant challenges: 1) the varying computational resources available on the devices require a dataset size different from the pre-defined condensed dataset, and 2) the limited computational resources often preclude the possibility of conducting additional condensation processes. We introduce You Only Condense Once (YOCO) to overcome these limitations. On top of one condensed dataset, YOCO produces smaller condensed datasets with two embarrassingly simple dataset pruning rules: Low LBPE Score and Balanced Construction. YOCO offers two key advantages: 1) it can flexibly resize the dataset to fit varying computational constraints, and 2) it eliminates the need for extra condensation processes, which can be computationally prohibitive. Experiments validate our findings on networks including ConvNet, ResNet and DenseNet, and datasets including CIFAR-10, CIFAR-100 and ImageNet. For example, our YOCO surpassed various dataset condensation and dataset pruning methods on CIFAR-10 with ten Images Per Class (IPC), achieving 6.98-8.89% and 6.31-23.92% accuracy gains, respectively. The code is available at: https://github.com/he-y/you-only-condense-once.

摘要
dataset 缩减是训练效率的重要工具，尤其在设备上进行训练时。但是这些情况有两个主要挑战：1）设备上的计算资源不断变化，需要不同于预先定义的缩减 dataset size，2）设备上的计算资源frequently precludes the possibility of conducting additional condensation processes。我们介绍了“仅需缩减一次”（YOCO）来解决这些问题。YOCO 可以生成更小的缩减 dataset，并且可以灵活地调整 dataset size 以适应不同的计算限制。此外，YOCO 可以消除额外的缩减 процес，这可以是计算昂费的。我们的实验结果显示，YOCO 在 ConvNet、ResNet 和 DenseNet 等网络上，以及 CIFAR-10、CIFAR-100 和 ImageNet 等 dataset 上，均有着superior的表现。例如，我们在 CIFAR-10 上，使用 YOCO 可以从原始的 10 个图像每个类别（IPC）开始，获得 6.98-8.89% 和 6.31-23.92% 的精度提升。相关代码可以在 GitHub 上找到：https://github.com/he-y/you-only-condense-once。

Ophthalmic Biomarker Detection Using Ensembled Vision Transformers – Winning Solution to IEEE SPS VIP Cup 2023

paper_url: http://arxiv.org/abs/2310.14005
repo_url: None
paper_authors: H. A. Z. Sameen Shahgir, Khondker Salman Sayeed, Tanjeem Azwad Zaman, Md. Asif Haider, Sheikh Saifur Rahman Jony, M. Sohel Rahman
For: The paper was written for the IEEE SPS VIP Cup 2023: Ophthalmic Biomarker Detection competition, with the primary objective of identifying biomarkers from Optical Coherence Tomography (OCT) images obtained from a diverse range of patients.* Methods: The authors trained two vision transformer-based models, MaxViT and EVA-02, using robust augmentations and 5-fold cross-validation. They ensembled the two models at inference time and found that MaxViT’s use of convolution layers followed by strided attention was better suited for detecting local features, while EVA-02’s use of normal attention mechanism and knowledge distillation was better for detecting global features.* Results: The authors achieved a patient-wise F1 score of 0.814 in the first phase and 0.8527 in the second and final phase of VIP Cup 2023, scoring 3.8% higher than the next-best solution.

Abstract
This report outlines our approach in the IEEE SPS VIP Cup 2023: Ophthalmic Biomarker Detection competition. Our primary objective in this competition was to identify biomarkers from Optical Coherence Tomography (OCT) images obtained from a diverse range of patients. Using robust augmentations and 5-fold cross-validation, we trained two vision transformer-based models: MaxViT and EVA-02, and ensembled them at inference time. We find MaxViT's use of convolution layers followed by strided attention to be better suited for the detection of local features while EVA-02's use of normal attention mechanism and knowledge distillation is better for detecting global features. Ours was the best-performing solution in the competition, achieving a patient-wise F1 score of 0.814 in the first phase and 0.8527 in the second and final phase of VIP Cup 2023, scoring 3.8% higher than the next-best solution.

摘要
本报告介绍了我们在IEEE SPS VIP杯2023：眼部生物标志检测比赛中采用的方法。我们的主要目标是从多样化的病人群中提取眼部生物标志。我们使用了可靠的扩展和5fold跨 VALIDATION，并在推理时 ensemble两种视transformer模型：MaxViT和EVA-02。我们发现MaxViT使用 convolution层后的步骤权重检测本地特征，而EVA-02使用 normal attention机制和知识传递是更适合检测全局特征。在比赛中，我们的解决方案得到了第一阶段和第二阶段的VI P杯2023的病人级F1分数0.814和0.8527，比下一个最佳解决方案高出3.8%。

Bi-discriminator Domain Adversarial Neural Networks with Class-Level Gradient Alignment

paper_url: http://arxiv.org/abs/2310.13959
repo_url: None
paper_authors: Chuang Zhao, Hongke Zhao, Hengshu Zhu, Zhenya Huang, Nan Feng, Enhong Chen, Hui Xiong
for: 这个研究旨在提高隐私领域转移中的不监控性评估，并且将Source domain的丰富知识转移到Target domain中，同时保持Target domain的标签空间。
methods: 本研究使用的方法包括bi-discriminator领域敌方网络，以及基于梯度信号和第二阶probability估计的分组梯度对齐。
results: 实验结果显示， compared to existing方法，本研究的方法可以更好地将Target domain中的标签转移到Source domain中，并且可以更好地避免错分析和错分类。

Abstract
Unsupervised domain adaptation aims to transfer rich knowledge from the annotated source domain to the unlabeled target domain with the same label space. One prevalent solution is the bi-discriminator domain adversarial network, which strives to identify target domain samples outside the support of the source domain distribution and enforces their classification to be consistent on both discriminators. Despite being effective, agnostic accuracy and overconfident estimation for out-of-distribution samples hinder its further performance improvement. To address the above challenges, we propose a novel bi-discriminator domain adversarial neural network with class-level gradient alignment, i.e. BACG. BACG resorts to gradient signals and second-order probability estimation for better alignment of domain distributions. Specifically, for accuracy-awareness, we first design an optimizable nearest neighbor algorithm to obtain pseudo-labels of samples in the target domain, and then enforce the backward gradient approximation of the two discriminators at the class level. Furthermore, following evidential learning theory, we transform the traditional softmax-based optimization method into a Multinomial Dirichlet hierarchical model to infer the class probability distribution as well as samples uncertainty, thereby alleviating misestimation of out-of-distribution samples and guaranteeing high-quality classes alignment. In addition, inspired by contrastive learning, we develop a memory bank-based variant, i.e. Fast-BACG, which can greatly shorten the training process at the cost of a minor decrease in accuracy. Extensive experiments and detailed theoretical analysis on four benchmark data sets validate the effectiveness and robustness of our algorithm.

摘要
Unsupervised domain adaptation aims to transfer rich knowledge from the annotated source domain to the unlabeled target domain with the same label space. One prevalent solution is the bi-discriminator domain adversarial network, which strives to identify target domain samples outside the support of the source domain distribution and enforces their classification to be consistent on both discriminators. Despite being effective, agnostic accuracy and overconfident estimation for out-of-distribution samples hinder its further performance improvement. To address the above challenges, we propose a novel bi-discriminator domain adversarial neural network with class-level gradient alignment, i.e. BACG. BACG resorts to gradient signals and second-order probability estimation for better alignment of domain distributions. Specifically, for accuracy-awareness, we first design an optimizable nearest neighbor algorithm to obtain pseudo-labels of samples in the target domain, and then enforce the backward gradient approximation of the two discriminators at the class level. Furthermore, following evidential learning theory, we transform the traditional softmax-based optimization method into a Multinomial Dirichlet hierarchical model to infer the class probability distribution as well as samples uncertainty, thereby alleviating misestimation of out-of-distribution samples and guaranteeing high-quality classes alignment. In addition, inspired by contrastive learning, we develop a memory bank-based variant, i.e. Fast-BACG, which can greatly shorten the training process at the cost of a minor decrease in accuracy. Extensive experiments and detailed theoretical analysis on four benchmark data sets validate the effectiveness and robustness of our algorithm.Here's the translation in Traditional Chinese:Unsupervised domain adaptation aims to transfer rich knowledge from the annotated source domain to the unlabeled target domain with the same label space. One prevalent solution is the bi-discriminator domain adversarial network, which strives to identify target domain samples outside the support of the source domain distribution and enforces their classification to be consistent on both discriminators. Despite being effective, agnostic accuracy and overconfident estimation for out-of-distribution samples hinder its further performance improvement. To address the above challenges, we propose a novel bi-discriminator domain adversarial neural network with class-level gradient alignment, i.e. BACG. BACG resorts to gradient signals and second-order probability estimation for better alignment of domain distributions. Specifically, for accuracy-awareness, we first design an optimizable nearest neighbor algorithm to obtain pseudo-labels of samples in the target domain, and then enforce the backward gradient approximation of the two discriminators at the class level. Furthermore, following evidential learning theory, we transform the traditional softmax-based optimization method into a Multinomial Dirichlet hierarchical model to infer the class probability distribution as well as samples uncertainty, thereby alleviating misestimation of out-of-distribution samples and guaranteeing high-quality classes alignment. In addition, inspired by contrastive learning, we develop a memory bank-based variant, i.e. Fast-BACG, which can greatly shorten the training process at the cost of a minor decrease in accuracy. Extensive experiments and detailed theoretical analysis on four benchmark data sets validate the effectiveness and robustness of our algorithm.

Competitive Ensembling Teacher-Student Framework for Semi-Supervised Left Atrium MRI Segmentation

paper_url: http://arxiv.org/abs/2310.13955
repo_url: None
paper_authors: Yuyan Shi, Yichi Zhang, Shasha Wang
for: 这篇论文主要关注于 semi-supervised learning 技术的应用在医疗影像分类中，尤其是 Left Atrium (LA) 区域的分类。methods: 本文提出了一个简单 yet efficient 的 teacher-student 架构，其中两个学生模型受到不同的任务水平干扰，并在教师模型的导引下进行互相学习。此外，文章还提出了一种竞争性的整合策略，以将更可靠的信息融合到教师模型中。results: 本文在公共的 LA 数据集上进行了评估，并获得了优秀的性能成绩，具体来说，该方法可以充分利用无标注数据，并较上先进的几种 semi-supervised 方法表现出色。

Abstract
Semi-supervised learning has greatly advanced medical image segmentation since it effectively alleviates the need of acquiring abundant annotations from experts and utilizes unlabeled data which is much easier to acquire. Among existing perturbed consistency learning methods, mean-teacher model serves as a standard baseline for semi-supervised medical image segmentation. In this paper, we present a simple yet efficient competitive ensembling teacher student framework for semi-supervised for left atrium segmentation from 3D MR images, in which two student models with different task-level disturbances are introduced to learn mutually, while a competitive ensembling strategy is performed to ensemble more reliable information to teacher model. Different from the one-way transfer between teacher and student models, our framework facilitates the collaborative learning procedure of different student models with the guidance of teacher model and motivates different training networks for a competitive learning and ensembling procedure to achieve better performance. We evaluate our proposed method on the public Left Atrium (LA) dataset and it obtains impressive performance gains by exploiting the unlabeled data effectively and outperforms several existing semi-supervised methods.

摘要
semi-supervised learning 已经大幅提高医疗图像分割的技术 Waterloo ，因为它有效地减轻了专家们 annotate 大量数据的需求，并利用了 easier to obtain 的无标注数据。在现有的妥协一致学习方法中，mean-teacher 模型 serves as a standard baseline for semi-supervised medical image segmentation。在这篇论文中，我们提出了一种简单 yet efficient 的 competitive ensembling teacher student 框架，用于 semi-supervised 左心室 segmentation from 3D MR 图像，其中有两个学生模型 with different task-level disturbances 是用来学习 mutually，而一种 competitive ensembling 策略是用于 ensemble 更可靠的信息到 teacher model。与一个 teacher 和学生模型之间的一个way transfer 不同，我们的框架实现了不同的学生模型之间的协同学习过程，帮助 by teacher model 的指导和动力，以实现更好的性能。我们对 public Left Atrium (LA) 数据集进行了评估，并获得了很好的性能提升，通过有效地利用无标注数据和多个现有的半指导学习方法。

Fuzzy-NMS: Improving 3D Object Detection with Fuzzy Classification in NMS

paper_url: http://arxiv.org/abs/2310.13951
repo_url: None
paper_authors: Li Wang, Xinyu Zhang, Fachuan Zhao, Chuze Wu, Yichen Wang, Ziying Song, Lei Yang, Jun Li, Huaping Liu
for: 提高3D物体检测精度，解决NMS过程中的不确定性
methods: 引入混合学习方法，提出一种通用化精度补做模块
results: 对多种最新的NMS基于检测器进行改进，特别是对小物体（如人车）的准确率有显著提高，无需重新训练和显著增加推理时间

Abstract
Non-maximum suppression (NMS) is an essential post-processing module used in many 3D object detection frameworks to remove overlapping candidate bounding boxes. However, an overreliance on classification scores and difficulties in determining appropriate thresholds can affect the resulting accuracy directly. To address these issues, we introduce fuzzy learning into NMS and propose a novel generalized Fuzzy-NMS module to achieve finer candidate bounding box filtering. The proposed Fuzzy-NMS module combines the volume and clustering density of candidate bounding boxes, refining them with a fuzzy classification method and optimizing the appropriate suppression thresholds to reduce uncertainty in the NMS process. Adequate validation experiments are conducted using the mainstream KITTI and large-scale Waymo 3D object detection benchmarks. The results of these tests demonstrate the proposed Fuzzy-NMS module can improve the accuracy of numerous recently NMS-based detectors significantly, including PointPillars, PV-RCNN, and IA-SSD, etc. This effect is particularly evident for small objects such as pedestrians and bicycles. As a plug-and-play module, Fuzzy-NMS does not need to be retrained and produces no obvious increases in inference time.

摘要
我们的提案的总体瑞逸-NMS 模块通过将卷积体和凝聚密度的候选 bounding box 组合起来，并使用瑞逸分类方法来细化它们，以便优化适当的阈值，从而减少 NMS 过程中的uncertainty。我们对主流的 KITTI 和大规模的 Waymo 3D object detection 标准准进行了适当的验证实验。实验结果表明，我们的提案的总体瑞逸-NMS 模块可以在许多最近的 NMS 基于检测器中提高准确性，包括 PointPillars、PV-RCNN 和 IA-SSD 等。这种效果尤其明显于小物体，如人行道和自行车。总之，我们的总体瑞逸-NMS 模块是一个可插件的模块，不需要重新训练，并且不会导致明显的执行时间增加。

Adversarial Image Generation by Spatial Transformation in Perceptual Colorspaces

paper_url: http://arxiv.org/abs/2310.13950
repo_url: https://github.com/ayberkydn/stadv-torch
paper_authors: Ayberk Aydin, Alptekin Temizel
for: 这个论文旨在提出一种基于色彩空间的攻击方法，以便在深度神经网络上进行targeted white-box攻击。
methods: 该方法使用了空间变换来生成攻击样本，其中Pixel值在Chrominance channels上独立变换，而不是直接对像值进行添加性负杂化或者直接操作。
results: 实验结果表明，该方法在targeted white-box攻击 Setting下可以获得竞争力的欺骗率，并且在benign和攻击样本之间的approxiamte perceptual distance上表现出优异的result。I hope that helps! Let me know if you have any further questions or if there’s anything else I can assist you with.

Abstract
Deep neural networks are known to be vulnerable to adversarial perturbations. The amount of these perturbations are generally quantified using $L_p$ metrics, such as $L_0$, $L_2$ and $L_\infty$. However, even when the measured perturbations are small, they tend to be noticeable by human observers since $L_p$ distance metrics are not representative of human perception. On the other hand, humans are less sensitive to changes in colorspace. In addition, pixel shifts in a constrained neighborhood are hard to notice. Motivated by these observations, we propose a method that creates adversarial examples by applying spatial transformations, which creates adversarial examples by changing the pixel locations independently to chrominance channels of perceptual colorspaces such as $YC_{b}C_{r}$ and $CIELAB$, instead of making an additive perturbation or manipulating pixel values directly. In a targeted white-box attack setting, the proposed method is able to obtain competitive fooling rates with very high confidence. The experimental evaluations show that the proposed method has favorable results in terms of approximate perceptual distance between benign and adversarially generated images. The source code is publicly available at https://github.com/ayberkydn/stadv-torch

摘要
Motivated by these observations, we propose a method that creates adversarial examples by applying spatial transformations, which creates adversarial examples by changing the pixel locations independently in chrominance channels of perceptual colorspaces such as $YC_{b}C_{r}$ and $CIELAB$, instead of making an additive perturbation or manipulating pixel values directly. In a targeted white-box attack setting, the proposed method is able to obtain competitive fooling rates with very high confidence.The experimental evaluations show that the proposed method has favorable results in terms of approximate perceptual distance between benign and adversarially generated images. The source code is publicly available at .

paper_url: http://arxiv.org/abs/2310.13912
repo_url: https://github.com/jialetao/mrfa
paper_authors: Jiale Tao, Shuhang Gu, Wen Li, Lixin Duan
for: 生成一个基于出处图像的人脸视频，模拟驱动视频中的人脸动作。
methods: 采用了一种新的无监督人脸动画方法，同时学习粗细动作。在本方法中，我们利用了本地Affine运动模型来学习全局粗细动作，并在本地区域使用一种新的动作细化模块来补做粗细动作。这个模块是通过 dense correlation between source and driving images 来学习的。
results: 对 widely used benchmarks 进行了广泛的实验，并取得了state-of-the-art的结果。

Abstract
Unsupervised face animation aims to generate a human face video based on the appearance of a source image, mimicking the motion from a driving video. Existing methods typically adopted a prior-based motion model (e.g., the local affine motion model or the local thin-plate-spline motion model). While it is able to capture the coarse facial motion, artifacts can often be observed around the tiny motion in local areas (e.g., lips and eyes), due to the limited ability of these methods to model the finer facial motions. In this work, we design a new unsupervised face animation approach to learn simultaneously the coarse and finer motions. In particular, while exploiting the local affine motion model to learn the global coarse facial motion, we design a novel motion refinement module to compensate for the local affine motion model for modeling finer face motions in local areas. The motion refinement is learned from the dense correlation between the source and driving images. Specifically, we first construct a structure correlation volume based on the keypoint features of the source and driving images. Then, we train a model to generate the tiny facial motions iteratively from low to high resolution. The learned motion refinements are combined with the coarse motion to generate the new image. Extensive experiments on widely used benchmarks demonstrate that our method achieves the best results among state-of-the-art baselines.

摘要
<>这是一个对文本进行简化中文译文的示例：无监督面部动画目标是将来源图片中的人脸动画化，基于驱动影片中的动作，并将动作调节为人脸的细微动作。现有方法通常使用本地欧几何动作模型（例如本地欧几何动作模型或本地薄板拓扑动作模型）。这些方法可以捕捉到人脸的粗略动作，但是它们对本地区域（例如嘴巴和眼睛）的动作有限，往往会出现遗憾。在这个工作中，我们设计了一个新的无监督面部动画方法，同时学习粗略和细微的动作。具体来说，我们利用本地欧几何动作模型学习全局粗略的人脸动作，并设计了一个新的动作精度调整模块，以补偿本地欧几何动作模型对本地区域的动作精度模型。这个动作精度调整是根据驱动影片和源影片之间的密集相联性学习的。具体来说，我们首先建立了基于关键特征的源影片和驱动影片之间的结构相联性量。然后，我们将这个结构相联性量训练成一个可以从低分辨率到高分辨率的模型，以产生细微的动作调整。学习的动作调整与粗略动作结合，创建出新的图片。实际实验显示，我们的方法在广泛使用的标准benchmark上得到了最好的结果。

Exploring Driving Behavior for Autonomous Vehicles Based on Gramian Angular Field Vision Transformer

paper_url: http://arxiv.org/abs/2310.13906
repo_url: None
paper_authors: Junwei You, Ying Chen, Zhuoyu Jiang, Zhangchi Liu, Zilin Huang, Yifeng Ding, Bin Ran
for: 本研究旨在提出一种用于分类自动驾驶车辆（AV）驾驶行为的有效分类方法，以便诊断AV运行问题、改进自动驾驶算法和减少事故率。
methods: 该研究提出了一种名为Gramian Angular Field Vision Transformer（GAF-ViT）模型，用于分析AV驾驶行为。GAF-ViT模型包括三个关键组件：GAFTransformer模块、通道注意力模块和多通道ViT模块。这些模块将表示序列中的多个变量行为转换为多个图像，然后使用图像识别技术进行行为分类。
results: 对于Waymo开放数据集的轨迹数据进行实验表示，提出的GAF-ViT模型实现了当前领先的性能。此外，对于各个模块的效果进行了减少性研究，以证明模型的可行性。

Abstract
Effective classification of autonomous vehicle (AV) driving behavior emerges as a critical area for diagnosing AV operation faults, enhancing autonomous driving algorithms, and reducing accident rates. This paper presents the Gramian Angular Field Vision Transformer (GAF-ViT) model, designed to analyze AV driving behavior. The proposed GAF-ViT model consists of three key components: GAF Transformer Module, Channel Attention Module, and Multi-Channel ViT Module. These modules collectively convert representative sequences of multivariate behavior into multi-channel images and employ image recognition techniques for behavior classification. A channel attention mechanism is applied to multi-channel images to discern the impact of various driving behavior features. Experimental evaluation on the Waymo Open Dataset of trajectories demonstrates that the proposed model achieves state-of-the-art performance. Furthermore, an ablation study effectively substantiates the efficacy of individual modules within the model.

摘要
<>传递给定文本到简化中文。<>自动驾驶车辆（AV）驾驶行为分类成为诊断AV操作错误、改进自动驾驶算法和减少事故率的关键领域。本文介绍了Gramian Angular Field Vision Transformer（GAF-ViT）模型，用于分析AV驾驶行为。提议的GAF-ViT模型包括三个关键组件：GAF TransformerModule、Channel AttentionModule和Multi-Channel ViTModule。这些模块结合收集的多个变量行为的表示序列，并使用图像识别技术进行行为分类。通过频道注意机制对多个频道图像进行区分。实验表明，提议的模型在 Waymo 开放数据集上的轨迹 traverse 得到了状态的最佳性。此外，一个ablation研究有效地证明了模型中各个模块的效果。

Multimodal Transformer Using Cross-Channel attention for Object Detection in Remote Sensing Images

paper_url: http://arxiv.org/abs/2310.13876
repo_url: None
paper_authors: Bissmella Bahaduri, Zuheng Ming, Fangchen Feng, Anissa Mokraou
for: 这篇研究旨在提高遥测图像中的物体探测精度，并且解决遥测图像中物体探测的特定挑战，例如资料标注的缺乏和高分辨率图像中的小物体。
methods: 本研究提出了一个多模式转换器，通过跨通道注意力模组来探索多源遥测数据的联合。相比于传统的通道合并方法，该模组能够学习不同通道之间的关系，实现多模式输入的协调。此外，研究人员还提出了基于Swin transformer的新架构，具有固定维度的卷积层，以获得轻量级的精确性和Computational efficiency。
results: 实验结果显示了该多模式转换器和架构的效果，证明了它们在多模式遥测图像中的应用性。

Abstract
Object detection in Remote Sensing Images (RSI) is a critical task for numerous applications in Earth Observation (EO). Unlike general object detection, object detection in RSI has specific challenges: 1) the scarcity of labeled data in RSI compared to general object detection datasets, and 2) the small objects presented in a high-resolution image with a vast background. To address these challenges, we propose a multimodal transformer exploring multi-source remote sensing data for object detection. Instead of directly combining the multimodal input through a channel-wise concatenation, which ignores the heterogeneity of different modalities, we propose a cross-channel attention module. This module learns the relationship between different channels, enabling the construction of a coherent multimodal input by aligning the different modalities at the early stage. We also introduce a new architecture based on the Swin transformer that incorporates convolution layers in non-shifting blocks while maintaining fixed dimensions, allowing for the generation of fine-to-coarse representations with a favorable accuracy-computation trade-off. The extensive experiments prove the effectiveness of the proposed multimodal fusion module and architecture, demonstrating their applicability to multimodal aerial imagery.

摘要
remote sensing 图像中的对象检测是许多应用程序地球观测（EO）中的关键任务。与通用对象检测不同，对象检测在 remote sensing 图像中具有特定挑战：1） remote sensing 图像中标注数据的稀缺性，2）图像中的小对象在高分辨率背景中呈现。 To address these challenges, we propose a multimodal transformer exploring multi-source remote sensing data for object detection. Instead of directly combining the multimodal input through a channel-wise concatenation, which ignores the heterogeneity of different modalities, we propose a cross-channel attention module. This module learns the relationship between different channels, enabling the construction of a coherent multimodal input by aligning the different modalities at the early stage. We also introduce a new architecture based on the Swin transformer that incorporates convolution layers in non-shifting blocks while maintaining fixed dimensions, allowing for the generation of fine-to-coarse representations with a favorable accuracy-computation trade-off. The extensive experiments prove the effectiveness of the proposed multimodal fusion module and architecture, demonstrating their applicability to multimodal aerial imagery.Here's a word-for-word translation of the text into Simplified Chinese:远程感知图像中的对象检测是EO中许多应用程序的关键任务。与通用对象检测不同，对象检测在远程感知图像中具有特定挑战：1）远程感知图像中标注数据的稀缺性，2）图像中的小对象在高分辨率背景中呈现。 To address these challenges, we propose a multimodal transformer exploring multi-source remote sensing data for object detection. Instead of directly combining the multimodal input through a channel-wise concatenation, which ignores the heterogeneity of different modalities, we propose a cross-channel attention module. This module learns the relationship between different channels, enabling the construction of a coherent multimodal input by aligning the different modalities at the early stage. We also introduce a new architecture based on the Swin transformer that incorporates convolution layers in non-shifting blocks while maintaining fixed dimensions, allowing for the generation of fine-to-coarse representations with a favorable accuracy-computation trade-off. The extensive experiments prove the effectiveness of the proposed multimodal fusion module and architecture, demonstrating their applicability to multimodal aerial imagery.

2023-10-22

An overview of text-to-speech systems and media applications

MFCC-GAN Codec: A New AI-based Audio Coding

Diffusion-Based Adversarial Purification for Speaker Verification

First-Shot Unsupervised Anomalous Sound Detection With Unknown Anomalies Estimated by Metadata-Assisted Audio Generation

2023-10-22

A Study on Prosodic Entrainment in Relation to Therapist Empathy in Counseling Conversation

Modeling Intrapersonal and Interpersonal Influences for Automatic Estimation of Therapist Empathy in Counseling Conversation

2023-10-22

Skipped Feature Pyramid Network with Grid Anchor for Object Detection

Mobile AR Depth Estimation: Challenges & Prospects – Extended Version

ConViViT – A Deep Neural Network Combining Convolutions and Factorized Self-Attention for Human Activity Recognition

A Pytorch Reproduction of Masked Generative Image Transformer

Cross-Domain HAR: Few Shot Transfer Learning for Human Activity Recognition

Learning Generalizable Manipulation Policies with Object-Centric 3D Representations

Data-Free Distillation Improves Efficiency and Privacy in Federated Thorax Disease Analysis

OV-VG: A Benchmark for Open-Vocabulary Visual Grounding

A Quantitative Evaluation of Dense 3D Reconstruction of Sinus Anatomy from Monocular Endoscopic Video

Toward Flare-Free Images: A Survey

What’s in a Prior? Learned Proximal Networks for Inverse Problems

Research on Key Technologies of Infrastructure Digitalization based on Multimodal Spatial Data

Deep MDP: A Modular Framework for Multi-Object Tracking

A Survey on Continual Semantic Segmentation: Theory, Challenge, Method and Application

Guidance system for Visually Impaired Persons using Deep Learning and Optical flow

A comprehensive survey on deep active learning and its applications in medical image analysis

Hierarchical Vector Quantized Transformer for Multi-class Unsupervised Anomaly Detection

Multi-stream Cell Segmentation with Low-level Cues for Multi-modality Images

One-for-All: Towards Universal Domain Translation with a Single StyleGAN

The Importance of Anti-Aliasing in Tiny Object Detection

TransY-Net:Learning Fully Transformer Networks for Change Detection of Remote Sensing Images

Diffusion-based Data Augmentation for Nuclei Image Segmentation

Distractor-aware Event-based Tracking

Partition Speeds Up Learning Implicit Neural Representations Based on Exponential-Increase Hypothesis

Prompt-based Grouping Transformer for Nucleus Detection and Classification

ASC: Appearance and Structure Consistency for Unsupervised Domain Adaptation in Fetal Brain MRI Segmentation

Visual-Attribute Prompt Learning for Progressive Mild Cognitive Impairment Prediction

Affine-Consistent Transformer for Multi-Class Cell Nuclei Detection

MMTF-DES: A Fusion of Multimodal Transformer Models for Desire, Emotion, and Sentiment Analysis of Social Media Data

2023-10-22

A generalized likelihood-weighted optimal sampling algorithm for rare-event probability quantification

Mobile Traffic Prediction at the Edge through Distributed and Transfer Learning

An International Consortium for Evaluations of Societal-Scale Risks from Advanced AI

Retrieval-Augmented Chain-of-Thought in Semi-structured Domains

Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation

Monte Carlo Thought Search: Large Language Model Querying for Complex Scientific Reasoning in Catalyst Design

Vision Language Models in Autonomous Driving and Intelligent Transportation Systems

Be Selfish, But Wisely: Investigating the Impact of Agent Personality in Mixed-Motive Human-Agent Interactions

O3D: Offline Data-driven Discovery and Distillation for Sequential Decision-Making with Large Language Models

Value of Assistance for Grasping

Learning to bag with a simulation-free reinforcement learning framework for robots

Merging Generated and Retrieved Knowledge for Open-Domain QA

ARCOQ: Arabic Closest Opposite Questions Dataset

MoPe: Model Perturbation-based Privacy Attacks on Language Models

Right, No Matter Why: AI Fact-checking and AI Authority in Health-related Inquiry Settings

From Chaos to Clarity: Claim Normalization to Empower Fact-Checking

Learning Interpretable Rules for Scalable Data Representation and Classification

CLMSM: A Multi-Task Learning Framework for Pre-training on Procedural Text

A Survey on Semantic Processing Techniques

Chainpoll: A high efficacy method for LLM hallucination detection

NERetrieve: Dataset for Next Generation Named Entity Recognition and Retrieval

RSM-NLP at BLP-2023 Task 2: Bangla Sentiment Analysis using Weighted and Majority Voted Fine-Tuned Transformers

High-Quality 3D Face Reconstruction with Affine Convolutional Networks

Efficient Meta Neural Heuristic for Multi-Objective Combinatorial Optimization

Neural Multi-Objective Combinatorial Optimization with Diversity Enhancement

MIRACLE: Towards Personalized Dialogue Generation with Latent-Space Multiple Personal Attribute Control

UniMAP: Universal SMILES-Graph Representation Learning

Item-Graph2vec: a Efficient and Effective Approach using Item Co-occurrence Graph Embedding for Collaborative Filtering

LUNA: A Model-Based Universal Analysis Framework for Large Language Models

CXR-LLaVA: Multimodal Large Language Model for Interpreting Chest X-ray Images

Learning to Discern: Imitating Heterogeneous Human Demonstrations with Preference and Representation Learning

PromptMix: A Class Boundary Augmentation Method for Large Language Model Distillation

Randomized Forward Mode of Automatic Differentiation for Optimization Algorithms

Graph Convolutional Network with Connectivity Uncertainty for EEG-based Emotion Recognition

Augmenting End-to-End Steering Angle Prediction with CAN Bus Data

When Urban Region Profiling Meets Large Language Models

Are LSTMs Good Few-Shot Learners?

2023-10-22

Domain Terminology Integration into Machine Translation: Leveraging Large Language Models

TATA: Stance Detection via Topic-Agnostic and Topic-Aware Embeddings

Text generation for dataset augmentation in security classification tasks