2023-11-09

cs.CV

cs.CV - 2023-11-09

Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in Clutter

paper_url: http://arxiv.org/abs/2311.05779
repo_url: https://github.com/gtziafas/ocid-vlg
paper_authors: Georgios Tziafas, Yucheng Xu, Arushi Goel, Mohammadreza Kasaei, Zhibin Li, Hamidreza Kasaei
for: 这个论文的目的是提出一种基于图像拓展和抓取技能的机器人操作方法，以便在人类环境中有效地操作物品根据用户的指令。
methods: 该论文使用了一种novel end-to-end模型（CROG），其利用CLIP的视觉固定技能来学习图像-文本对的抓取合成。
results: 实验结果表明，与已经存在的多个阶段管道相比，CROG在复杂的自然indoor场景中表现出了显著的改进，并且在实验中在simulation和硬件上都达到了出色的效果。

Abstract
Robots operating in human-centric environments require the integration of visual grounding and grasping capabilities to effectively manipulate objects based on user instructions. This work focuses on the task of referring grasp synthesis, which predicts a grasp pose for an object referred through natural language in cluttered scenes. Existing approaches often employ multi-stage pipelines that first segment the referred object and then propose a suitable grasp, and are evaluated in private datasets or simulators that do not capture the complexity of natural indoor scenes. To address these limitations, we develop a challenging benchmark based on cluttered indoor scenes from OCID dataset, for which we generate referring expressions and connect them with 4-DoF grasp poses. Further, we propose a novel end-to-end model (CROG) that leverages the visual grounding capabilities of CLIP to learn grasp synthesis directly from image-text pairs. Our results show that vanilla integration of CLIP with pretrained models transfers poorly in our challenging benchmark, while CROG achieves significant improvements both in terms of grounding and grasping. Extensive robot experiments in both simulation and hardware demonstrate the effectiveness of our approach in challenging interactive object grasping scenarios that include clutter.

摘要
人类环境中运行的机器人需要视觉固定和抓取功能的集成，以根据用户指令有效地抓取物品。这项工作关注于对自然语言中引用的物体进行抓取预测，称为引用抓取合成。现有的方法 oftentimes 使用多个阶段管道，先 segment 引用的物体，然后提出适当的抓取，并在私有数据集或模拟器中进行评估，这些数据集并不能准确反映自然的室内场景。为了解决这些限制，我们开发了一个具有各种挑战的benchmark，基于OCID数据集中的拥挤的室内场景，并生成了引用表达和4个自由度的抓取 pose。此外，我们提出了一种新的端到端模型（CROG），利用 CLIP 的视觉固定能力来学习直接从图像-文本对的 grasp synthesis。我们的结果显示，将 CLIP 与预训练模型直接集成不会在我们的挑战性 benchmark 中进行好转移，而 CROG 在图像-文本对中的 grasping 和固定方面都具有显著改进。在 simulation 和硬件中的机器人实验中，我们发现了 CROG 在拥挤的交互式物品抓取场景中的有效性。

PolyMaX: General Dense Prediction with Mask Transformer

paper_url: http://arxiv.org/abs/2311.05770
repo_url: https://github.com/google-research/deeplab2
paper_authors: Xuan Yang, Liangzhe Yuan, Kimberly Wilber, Astuti Sharma, Xiuye Gu, Siyuan Qiao, Stephanie Debats, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Liang-Chieh Chen
for: The paper is written for dense prediction tasks such as semantic segmentation, depth estimation, and surface normal prediction.
methods: The paper proposes a method based on the cluster-prediction paradigm, which is inspired by the success of DORN and AdaBins in depth estimation. The method discretizes the continuous output space and unifies dense prediction tasks with the mask transformer framework.
results: The proposed method, PolyMaX, demonstrates state-of-the-art performance on three benchmarks of the NYUD-v2 dataset.Here’s the simplified Chinese text for the three key points:
for: 这篇论文是为 dense prediction 任务 such as semantic segmentation, depth estimation, 和 surface normal prediction 写的。
methods: 这篇论文提出了基于 cluster-prediction 的方法，它是以 DORN 和 AdaBins 在 depth estimation 中的成功为 inspirations。该方法是将 continuous output space 精确化，并将 dense prediction 任务与 mask transformer 框架集成。
results: 提议的方法 PolyMaX 在 NYUD-v2 dataset 上的三个 benchmark 上达到了 state-of-the-art 性能。

Abstract
Dense prediction tasks, such as semantic segmentation, depth estimation, and surface normal prediction, can be easily formulated as per-pixel classification (discrete outputs) or regression (continuous outputs). This per-pixel prediction paradigm has remained popular due to the prevalence of fully convolutional networks. However, on the recent frontier of segmentation task, the community has been witnessing a shift of paradigm from per-pixel prediction to cluster-prediction with the emergence of transformer architectures, particularly the mask transformers, which directly predicts a label for a mask instead of a pixel. Despite this shift, methods based on the per-pixel prediction paradigm still dominate the benchmarks on the other dense prediction tasks that require continuous outputs, such as depth estimation and surface normal prediction. Motivated by the success of DORN and AdaBins in depth estimation, achieved by discretizing the continuous output space, we propose to generalize the cluster-prediction based method to general dense prediction tasks. This allows us to unify dense prediction tasks with the mask transformer framework. Remarkably, the resulting model PolyMaX demonstrates state-of-the-art performance on three benchmarks of NYUD-v2 dataset. We hope our simple yet effective design can inspire more research on exploiting mask transformers for more dense prediction tasks. Code and model will be made available.

摘要
dense prediction 任务，如semantic segmentation，depth estimation，和surface normal prediction，可以容易地表示为每像素分类（离散输出）或回归（连续输出）。这种每像素预测模式在全 convolutional networks 的普及下保持流行。然而，在最近的 segmentation 任务中，社区却目睹了一种新的 paradigm 的转变，即 directly predicting a label for a mask instead of a pixel。 despite this shift, methods based on the per-pixel prediction paradigm still dominate the benchmarks on other dense prediction tasks that require continuous outputs, such as depth estimation and surface normal prediction.motivated by the success of DORN and AdaBins in depth estimation, achieved by discretizing the continuous output space, we propose to generalize the cluster-prediction based method to general dense prediction tasks. this allows us to unify dense prediction tasks with the mask transformer framework. remarkably, the resulting model PolyMaX demonstrates state-of-the-art performance on three benchmarks of NYUD-v2 dataset. we hope our simple yet effective design can inspire more research on exploiting mask transformers for more dense prediction tasks. code and model will be made available.Here is the word-for-word translation of the text into Simplified Chinese: dense prediction 任务，如semantic segmentation，depth estimation，和surface normal prediction，可以容易地表示为每像素分类（离散输出）或回归（连续输出）。这种每像素预测模式在全 convolutional networks 的普及下保持流行。然而，在最近的 segmentation 任务中，社区却目睹了一种新的 paradigm 的转变，即直接预测一个 mask 的标签 instead of a pixel。 despite this shift, methods based on the per-pixel prediction paradigm still dominate the benchmarks on other dense prediction tasks that require continuous outputs, such as depth estimation and surface normal prediction.motivated by the success of DORN and AdaBins in depth estimation, achieved by discretizing the continuous output space, we propose to generalize the cluster-prediction based method to general dense prediction tasks. this allows us to unify dense prediction tasks with the mask transformer framework. remarkably, the resulting model PolyMaX demonstrates state-of-the-art performance on three benchmarks of NYUD-v2 dataset. we hope our simple yet effective design can inspire more research on exploiting mask transformers for more dense prediction tasks. code and model will be made available.

GIPCOL: Graph-Injected Soft Prompting for Compositional Zero-Shot Learning

paper_url: http://arxiv.org/abs/2311.05729
repo_url: https://github.com/hlr/gipcol
paper_authors: Guangyue Xu, Joyce Chai, Parisa Kordjamshidi
for: 本研究旨在提高vision-language模型（VLM）在无supervision zero-shot learning（CZSL）中的表现，特别是通过提出Prompt Learning paradigm。
methods: 本研究提出了Graph-Injected Soft Prompting for COmpositional Learning（GIP-COL）方法，其中包括在soft prompt中添加结构化的 prefix learnable vectors、 attribute label和object label。此外， attribute和object labels在soft prompt中被设置为 compositional graph中的节点，该图由基于训练数据中的对象和属性的compositional结构而构建。
results: 与前一代non-CLIP和CLIP-based方法相比，GIP-COL在MIT-States、UT-Zappos和C-GQA数据集上 achieved state-of-the-art AUCResults在closed和open settings中。我们还进行了分析，发现GIP-COL在CLIP backbone和训练数据上的限制下运行得非常好，这些发现有助于设计更有效的prompt дляCZSL。

Abstract
Pre-trained vision-language models (VLMs) have achieved promising success in many fields, especially with prompt learning paradigm. In this work, we propose GIP-COL (Graph-Injected Soft Prompting for COmpositional Learning) to better explore the compositional zero-shot learning (CZSL) ability of VLMs within the prompt-based learning framework. The soft prompt in GIPCOL is structured and consists of the prefix learnable vectors, attribute label and object label. In addition, the attribute and object labels in the soft prompt are designated as nodes in a compositional graph. The compositional graph is constructed based on the compositional structure of the objects and attributes extracted from the training data and consequently feeds the updated concept representation into the soft prompt to capture this compositional structure for a better prompting for CZSL. With the new prompting strategy, GIPCOL achieves state-of-the-art AUC results on all three CZSL benchmarks, including MIT-States, UT-Zappos, and C-GQA datasets in both closed and open settings compared to previous non-CLIP as well as CLIP-based methods. We analyze when and why GIPCOL operates well given the CLIP backbone and its training data limitations, and our findings shed light on designing more effective prompts for CZSL

摘要
Pre-trained vision-language models (VLMs) 已经在多个领域取得了出色的成绩，特别是使用 prompt 学习模式。在这项工作中，我们提出了 GIP-COL（图像注入软提示 дляcompositional learning），以更好地探索vision-language模型在prompt-based学习框架中的compositional zero-shot learning（CZSL）能力。 GIP-COL 的软提示结构有序，包括预fix learnable vectors、特征标签和对象标签。另外，特征和对象标签在软提示中被设置为图像compositional结构中的节点。图像compositional结构是基于图像和特征的训练数据中提取的对象和特征的compositional结构，从而将更新的概念表示feed into软提示，以捕捉这种compositional结构，为CZSL提供更好的提示。与此同时，GIP-COL 在三个 CZSL benchmark上（MIT-States、UT-Zappos 和 C-GQA 数据集）达到了当前最高的 AUC 结果，在关闭和开放设置下比前一些非 CLIP 以及 CLIP 基于方法更高。我们分析了 GIP-COL 在 CLIP 基础上和训练数据的限制下操作的情况，并发现了设计更有效的提示的关键，这些发现有助于设计更好的 CZSL 方法。

Whole-body Detection, Recognition and Identification at Altitude and Range

paper_url: http://arxiv.org/abs/2311.05725
repo_url: None
paper_authors: Siyuan Huang, Ram Prabhakar Kathirvel, Chun Pong Lau, Rama Chellappa
for: 总体来说，本研究旨在解决距离范围在500米，大角度 pitch angle 达50度的全身生物metric检测、识别和识别。
methods: 我们提出了一个端到端系统，包括预训练检测器在常见图像集上，并在BRIAR dataset上进行精度调整。在检测后，我们提取了身体图像，并使用特征提取器进行识别。
results: 我们进行了多种环境下的广泛评估，包括indoor、outdoor和飞行场景。我们的方法在不同范围和角度下具有优秀的性能，包括识别精度和真实接受率在低假接受率下的比较优异表现。在一个测试集中，我们的模型在100名测试者中 дости到了75.13%的排名20的识别率，并达到了54.09%的TAR@1%FAR。

Abstract
In this paper, we address the challenging task of whole-body biometric detection, recognition, and identification at distances of up to 500m and large pitch angles of up to 50 degree. We propose an end-to-end system evaluated on diverse datasets, including the challenging Biometric Recognition and Identification at Range (BRIAR) dataset. Our approach involves pre-training the detector on common image datasets and fine-tuning it on BRIAR's complex videos and images. After detection, we extract body images and employ a feature extractor for recognition. We conduct thorough evaluations under various conditions, such as different ranges and angles in indoor, outdoor, and aerial scenarios. Our method achieves an average F1 score of 98.29% at IoU = 0.7 and demonstrates strong performance in recognition accuracy and true acceptance rate at low false acceptance rates compared to existing models. On a test set of 100 subjects with 444 distractors, our model achieves a rank-20 recognition accuracy of 75.13% and a TAR@1%FAR of 54.09%.

摘要
在本文中，我们 Addressing the challenging task of whole-body biometric detection, recognition, and identification at distances of up to 500m and large pitch angles of up to 50 degree. We propose an end-to-end system evaluated on diverse datasets, including the challenging Biometric Recognition and Identification at Range (BRIAR) dataset. Our approach involves pre-training the detector on common image datasets and fine-tuning it on BRIAR's complex videos and images. After detection, we extract body images and employ a feature extractor for recognition. We conduct thorough evaluations under various conditions, such as different ranges and angles in indoor, outdoor, and aerial scenarios. Our method achieves an average F1 score of 98.29% at IoU = 0.7 and demonstrates strong performance in recognition accuracy and true acceptance rate at low false acceptance rates compared to existing models. On a test set of 100 subjects with 444 distractors, our model achieves a rank-20 recognition accuracy of 75.13% and a TAR@1%FAR of 54.09%.Here's the word-for-word translation:在本文中，我们对整体生物ometrics检测、识别和标识距离达到500m，大角度达到50度的挑战任务进行 Addressing.我们提出了一个终端系统，并在多种数据集上进行评估，包括Biometric Recognition and Identification at Range (BRIAR) 数据集。我们的方法包括在常见图像集上预训练检测器，并在BRIAR的复杂视频和图像上终端训练。检测后，我们提取身体图像，并使用特征提取器进行识别。我们进行了各种情况下的全面评估，包括不同的距离和角度在室内、室外和航空enario中。我们的方法在IoU = 0.7下 achieve an average F1 score of 98.29%，并在识别精度和真实接受率下示出了与现有模型相比的强性表现。在444个干扰物中，我们的模型在100名测试者中 achieve a rank-20 recognition accuracy of 75.13% and a TAR@1%FAR of 54.09%.

Intelligent Cervical Spine Fracture Detection Using Deep Learning Methods

paper_url: http://arxiv.org/abs/2311.05708
repo_url: None
paper_authors: Reza Behbahani Nejad, Amir Hossein Komijani, Esmaeil Najafi
for: 骨折检测（cervical spine fractures detection）
methods: 两stagepipeline，包括图像和图像元数据的多输入神经网络（Global Context Vision Transformer）和YOLOv8模型（YOLOv5）
results: 提高骨折检测精度，减少放射学家的工作负担

Abstract
Cervical spine fractures constitute a critical medical emergency, with the potential for lifelong paralysis or even fatality if left untreated or undetected. Over time, these fractures can deteriorate without intervention. To address the lack of research on the practical application of deep learning techniques for the detection of spine fractures, this study leverages a dataset containing both cervical spine fractures and non-fractured computed tomography images. This paper introduces a two-stage pipeline designed to identify the presence of cervical vertebrae in each image slice and pinpoint the location of fractures. In the first stage, a multi-input network, incorporating image and image metadata, is trained. This network is based on the Global Context Vision Transformer, and its performance is benchmarked against popular deep learning image classification model. In the second stage, a YOLOv8 model is trained to detect fractures within the images, and its effectiveness is compared to YOLOv5. The obtained results indicate that the proposed algorithm significantly reduces the workload of radiologists and enhances the accuracy of fracture detection.

摘要
脊椎骨折是一种严重的医疗紧急情况，可能导致永久性肢体瘫痪或even fatality if left untreated or undetected. Over time, these fractures can deteriorate without intervention. 为了Addressing the lack of research on the practical application of deep learning techniques for the detection of spine fractures, this study leverages a dataset containing both cervical spine fractures and non-fractured computed tomography images. This paper introduces a two-stage pipeline designed to identify the presence of cervical vertebrae in each image slice and pinpoint the location of fractures.在first stage, a multi-input network, incorporating image and image metadata, is trained. This network is based on the Global Context Vision Transformer, and its performance is benchmarked against popular deep learning image classification model. In the second stage, a YOLOv8 model is trained to detect fractures within the images, and its effectiveness is compared to YOLOv5. The obtained results indicate that the proposed algorithm significantly reduces the workload of radiologists and enhances the accuracy of fracture detection.

FMViT: A multiple-frequency mixing Vision Transformer

paper_url: http://arxiv.org/abs/2311.05707
repo_url: None
paper_authors: Wei Tan, Yifeng Geng, Xuansong Xie
for: 提高计算效率和准确率的计算机视觉任务模型
methods: 提出一种高效的混合模型，即FMViT，通过混合高频和低频特征，以及采用 deploy-friendly 机制，如 gMLP、RLMHSA 和 CFB，提高模型的表达力和实现效率
results: FMViT 在各种计算机视觉任务上超越了现有的 CNN、ViT 和 CNN-Transformer 混合模型，并且在 TensorRT 和 CoreML 平台上实现了更高的准确率和更低的计算开销。例如，在 ImageNet 数据集上，FMViT 在 TensorRT 平台上超越 Resnet101 的 top-1 准确率，并且与 EfficientNet-B5 的表现相似，但具有43% 的计算速度提升。在 CoreML 平台上，FMViT 超越 MobileOne，并且与 MobileOne 的计算开销相似（78.5% vs. 75.9%）。

Abstract
The transformer model has gained widespread adoption in computer vision tasks in recent times. However, due to the quadratic time and memory complexity of self-attention, which is proportional to the number of input tokens, most existing Vision Transformers (ViTs) encounter challenges in achieving efficient performance in practical industrial deployment scenarios, such as TensorRT and CoreML, where traditional CNNs excel. Although some recent attempts have been made to design CNN-Transformer hybrid architectures to tackle this problem, their overall performance has not met expectations. To tackle these challenges, we propose an efficient hybrid ViT architecture named FMViT. This approach enhances the model's expressive power by blending high-frequency features and low-frequency features with varying frequencies, enabling it to capture both local and global information effectively. Additionally, we introduce deploy-friendly mechanisms such as Convolutional Multigroup Reparameterization (gMLP), Lightweight Multi-head Self-Attention (RLMHSA), and Convolutional Fusion Block (CFB) to further improve the model's performance and reduce computational overhead. Our experiments demonstrate that FMViT surpasses existing CNNs, ViTs, and CNNTransformer hybrid architectures in terms of latency/accuracy trade-offs for various vision tasks. On the TensorRT platform, FMViT outperforms Resnet101 by 2.5% (83.3% vs. 80.8%) in top-1 accuracy on the ImageNet dataset while maintaining similar inference latency. Moreover, FMViT achieves comparable performance with EfficientNet-B5, but with a 43% improvement in inference speed. On CoreML, FMViT outperforms MobileOne by 2.6% in top-1 accuracy on the ImageNet dataset, with inference latency comparable to MobileOne (78.5% vs. 75.9%). Our code can be found at https://github.com/tany0699/FMViT.

摘要
“ transformer 模型在计算机视觉任务中 gain 广泛的采用，但由于自注意力的quadratic时间和内存复杂度，因此大多数现有的视觉 трансформаer（ViTs）在实际工业部署场景中遇到了效率问题。虽有些 latest 尝试将 CNN 和 transformer 混合成architecture，但其总体性能未达到期望。为解决这些问题，我们提出了一种高效的 hybrid ViT 架构，称为 FMViT。这种方法通过混合不同频率的特征来增强模型的表达力，使其能够有效地捕捉局部和全局信息。此外，我们还引入了可部署的机制，如 Convolutional Multigroup Reparameterization (gMLP)、Lightweight Multi-head Self-Attention (RLMHSA) 和 Convolutional Fusion Block (CFB)，以进一步改善模型的性能并减少计算开销。我们的实验表明，FMViT 超过了现有的 CNN、ViT 和 CNNTransformer 混合架构在各种视觉任务中的精度/效率质量评价。在 TensorRT 平台上，FMViT 超过了 Resnet101 的 top-1 精度（83.3% vs. 80.8%），同时保持相似的推理延迟。此外，FMViT 与 EfficientNet-B5 相当的性能，但具有43%的推理速度提升。在 CoreML 上，FMViT 超过了 MobileOne 的 top-1 精度（78.5% vs. 75.9%），推理延迟与 MobileOne 相似。我们的代码可以在 GitHub 上找到：https://github.com/tany0699/FMViT。”

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

paper_url: http://arxiv.org/abs/2311.05698
repo_url: None
paper_authors: AJ Piergiovanni, Isaac Noble, Dahun Kim, Michael S. Ryoo, Victor Gomes, Anelia Angelova
for: 本研究旨在解决多modal学习中的困难，即将不同类型的输入（视频、音频、文本）结合在一起。
methods: 我们提出了一种分解多modal模型，将其分成两个专注型autoregressive模型，处理输入根据模式的特点。我们还提出了一种 combiner 机制，可以同时提取视频和音频信号的特征，并将其 fusion 为一个Compact但expressive的表示。
results: 我们的方法在多modal Benchmark 上达到了状态的前iers，比较大的模型表现更好，能够有效地控制媒体输入的计算成本，并模型其时间相关性。

Abstract
One of the main challenges of multimodal learning is the need to combine heterogeneous modalities (e.g., video, audio, text). For example, video and audio are obtained at much higher rates than text and are roughly aligned in time. They are often not synchronized with text, which comes as a global context, e.g., a title, or a description. Furthermore, video and audio inputs are of much larger volumes, and grow as the video length increases, which naturally requires more compute dedicated to these modalities and makes modeling of long-range dependencies harder. We here decouple the multimodal modeling, dividing it into separate, focused autoregressive models, processing the inputs according to the characteristics of the modalities. We propose a multimodal model, called Mirasol3B, consisting of an autoregressive component for the time-synchronized modalities (audio and video), and an autoregressive component for the context modalities which are not necessarily aligned in time but are still sequential. To address the long-sequences of the video-audio inputs, we propose to further partition the video and audio sequences in consecutive snippets and autoregressively process their representations. To that end, we propose a Combiner mechanism, which models the audio-video information jointly within a timeframe. The Combiner learns to extract audio and video features from raw spatio-temporal signals, and then learns to fuse these features producing compact but expressive representations per snippet. Our approach achieves the state-of-the-art on well established multimodal benchmarks, outperforming much larger models. It effectively addresses the high computational demand of media inputs by both learning compact representations, controlling the sequence length of the audio-video feature representations, and modeling their dependencies in time.

摘要
一个主要挑战在多模态学习是将不同类型的模态（如视频、音频、文本）结合在一起。例如，视频和音频通常有更高的速率，并且与文本不同步，文本通常来自全局上下文，如标题或描述。此外，视频和音频输入的量相对较大，随着视频长度增加而增加计算量，这使得模型长距离相互关系更加困难。为了解决这个问题，我们提出了分离多模态模型，将它们分成独立的、专注型 autoregressive 模型，处理输入根据模式的特点。我们提出了一种多模态模型，名为 Mirasol3B，它包括一个时间同步的 autoregressive 组件，以及一个不同步的 autoregressive 组件，用于处理不同步的上下文模式。为了处理长序的视频-音频输入，我们提出了一种 Combiner 机制，可以同时处理视频和音频的原始空间时间信号，并学习提取视频和音频特征，然后将这些特征进行autoregressive处理，生成每个时间桢中的短暂 yet 表达力强的表示。我们的方法在已知的多模态标准准点上达到了状态的极点，超越了许多更大的模型。它有效地解决了媒体输入的高计算需求，通过学习紧凑表示、控制序列长度、和模型时间相互关系。

3DGAUnet: 3D generative adversarial networks with a 3D U-Net based generator to achieve the accurate and effective synthesis of clinical tumor image data for pancreatic cancer

paper_url: http://arxiv.org/abs/2311.05697
repo_url: None
paper_authors: Yu Shi, Hannah Tang, Michael Baine, Michael A. Hollingsworth, Huijing Du, Dandan Zheng, Chi Zhang, Hongfeng Yu
for: 这个研究旨在开发一个基于生成器执行的模型，以生成真实的3D CT影像，帮助提高PDAC肿瘤和胰脏组织的检测和诊断。methods: 这个模型使用生成器网络（GAN）技术，将3D CT影像转换为更加真实的3D CT影像，并且可以生成跨 slice 的资料，以解决现有2D CT影像合成模型所面临的限制。results: 这个模型可以生成高品质的3D CT影像，并且可以帮助提高PDAC肿瘤的检测和诊断。这个模型的发展具有潜在的应用前瞻性，可以帮助解决PDAC肿瘤早期检测的问题，从而提高病人的生存率。

Abstract
Pancreatic ductal adenocarcinoma (PDAC) presents a critical global health challenge, and early detection is crucial for improving the 5-year survival rate. Recent medical imaging and computational algorithm advances offer potential solutions for early diagnosis. Deep learning, particularly in the form of convolutional neural networks (CNNs), has demonstrated success in medical image analysis tasks, including classification and segmentation. However, the limited availability of clinical data for training purposes continues to provide a significant obstacle. Data augmentation, generative adversarial networks (GANs), and cross-validation are potential techniques to address this limitation and improve model performance, but effective solutions are still rare for 3D PDAC, where contrast is especially poor owing to the high heterogeneity in both tumor and background tissues. In this study, we developed a new GAN-based model, named 3DGAUnet, for generating realistic 3D CT images of PDAC tumors and pancreatic tissue, which can generate the interslice connection data that the existing 2D CT image synthesis models lack. Our innovation is to develop a 3D U-Net architecture for the generator to improve shape and texture learning for PDAC tumors and pancreatic tissue. Our approach offers a promising path to tackle the urgent requirement for creative and synergistic methods to combat PDAC. The development of this GAN-based model has the potential to alleviate data scarcity issues, elevate the quality of synthesized data, and thereby facilitate the progression of deep learning models to enhance the accuracy and early detection of PDAC tumors, which could profoundly impact patient outcomes. Furthermore, this model has the potential to be adapted to other types of solid tumors, hence making significant contributions to the field of medical imaging in terms of image processing models.

摘要
《数位对待胆管癌：干扰对待胆管癌早期识别的挑战》Pancreatic ductal adenocarcinoma (PDAC) 是一个全球健康问题，早期识别是提高5年生存率的关键。 latest medical imaging and computational algorithm advances offer potential solutions for early diagnosis. Deep learning, particularly in the form of convolutional neural networks (CNNs), has demonstrated success in medical image analysis tasks, including classification and segmentation. However, the limited availability of clinical data for training purposes continues to provide a significant obstacle. Data augmentation, generative adversarial networks (GANs), and cross-validation are potential techniques to address this limitation and improve model performance, but effective solutions are still rare for 3D PDAC, where contrast is especially poor owing to the high heterogeneity in both tumor and background tissues.In this study, we developed a new GAN-based model, named 3DGAUnet, for generating realistic 3D CT images of PDAC tumors and pancreatic tissue, which can generate the interslice connection data that the existing 2D CT image synthesis models lack. Our innovation is to develop a 3D U-Net architecture for the generator to improve shape and texture learning for PDAC tumors and pancreatic tissue. Our approach offers a promising path to tackle the urgent requirement for creative and synergistic methods to combat PDAC. The development of this GAN-based model has the potential to alleviate data scarcity issues, elevate the quality of synthesized data, and thereby facilitate the progression of deep learning models to enhance the accuracy and early detection of PDAC tumors, which could profoundly impact patient outcomes. Furthermore, this model has the potential to be adapted to other types of solid tumors, hence making significant contributions to the field of medical imaging in terms of image processing models.

Window Attention is Bugged: How not to Interpolate Position Embeddings

paper_url: http://arxiv.org/abs/2311.05613
repo_url: None
paper_authors: Daniel Bolya, Chaitanya Ryali, Judy Hoffman, Christoph Feichtenhofer
for: The paper is written for improving the performance of modern transformer-based computer vision models, specifically addressing the issue of interpolating position embeddings while using window attention.
methods: The paper uses window attention, position embeddings, and high resolution finetuning as core components, and introduces a simple absolute window position embedding strategy to fix the issue of interpolating position embeddings.
results: The paper achieves state-of-the-art performance on the COCO dataset with a model that only uses ImageNet-1k pretraining, achieving 61.7 box mAP with the proposed “absolute win” bug fix.

Abstract
Window attention, position embeddings, and high resolution finetuning are core concepts in the modern transformer era of computer vision. However, we find that naively combining these near ubiquitous components can have a detrimental effect on performance. The issue is simple: interpolating position embeddings while using window attention is wrong. We study two state-of-the-art methods that have these three components, namely Hiera and ViTDet, and find that both do indeed suffer from this bug. To fix it, we introduce a simple absolute window position embedding strategy, which solves the bug outright in Hiera and allows us to increase both speed and performance of the model in ViTDet. We finally combine the two to obtain HieraDet, which achieves 61.7 box mAP on COCO, making it state-of-the-art for models that only use ImageNet-1k pretraining. This all stems from what is essentially a 3 line bug fix, which we name "absolute win".

摘要
窗口注意力、位嵌入和高分辨率调整是现代转换器时代的核心概念，但我们发现将这些组件组合起来可能会导致性能下降。问题的原因很简单：在使用窗口注意力时 interpolate位嵌入是错误的。我们研究了两种现代方法，即Hiera和ViTDet，并发现它们都受到这个漏洞的影响。为解决这个问题，我们提出了一种简单的绝对窗口位嵌入策略，可以解决Hiera中的漏洞，并使ViTDet模型的速度和性能得到提升。最后，我们将Hiera和ViTDet两者结合，得到了HieraDet模型，在COCO数据集上达到了61.7个框的MAP值，成为只使用ImageNet-1k预训练的状态gravity模型。这一成果凭借了一个简单的3行修复，我们称之为“绝对胜利”。

What Do I Hear? Generating Sounds for Visuals with ChatGPT

paper_url: http://arxiv.org/abs/2311.05609
repo_url: None
paper_authors: David Chuan-En Lin, Nikolas Martelaro
for: 这篇论文提出了一种工作流程，用于生成视频媒体中的真实声景。与之前的工作不同，这种方法不仅强调匹配视频上的声音，而且还可以提供不直接可见的声音，以创造一个真实和吸引人的听觉环境。
methods: 我们的方法包括创建场景上下文，brainstorming声音和生成声音。我们利用语言模型，如ChatGPT的推理能力，以便更好地理解和生成声音。
results: 我们的实验结果表明，我们的方法可以生成高质量的声景声音，并且可以帮助制作人更好地描绘和创造听觉环境。

Abstract
This short paper introduces a workflow for generating realistic soundscapes for visual media. In contrast to prior work, which primarily focus on matching sounds for on-screen visuals, our approach extends to suggesting sounds that may not be immediately visible but are essential to crafting a convincing and immersive auditory environment. Our key insight is leveraging the reasoning capabilities of language models, such as ChatGPT. In this paper, we describe our workflow, which includes creating a scene context, brainstorming sounds, and generating the sounds.

摘要
这篇短篇论文介绍了一种工作流程，用于生成真实的声景音频 для视觉媒体。与先前的工作不同，我们的方法不仅仅是匹配屏幕上的视觉元素，而是扩展到建议不可见的声音，以创造一个感人和吸引人的听觉环境。我们的关键发现是利用语言模型的推理能力，如ChatGPT。在这篇论文中，我们描述了我们的工作流程，包括创建场景 контекст、寻思声音和生成声音。

3D-QAE: Fully Quantum Auto-Encoding of 3D Point Clouds

paper_url: http://arxiv.org/abs/2311.05604
repo_url: None
paper_authors: Lakshika Rathi, Edith Tretschk, Christian Theobalt, Rishabh Dabral, Vladislav Golyanik
for: 这个论文的目的是提出一种基于量子计算机的3D点云自动编码器（3D-QAE），用于压缩3D数据。
methods: 该方法使用完全量子的数据处理组件，并在量子硬件上进行训练。具有3D数据正常化和参数优化的核心挑战， authors提出了解决方案。
results: 实验结果表明，该方法比简单的经典基准方法高效，这成功地开启了基于量子计算机的3D计算机视觉领域的新研究方向。Here’s the English version of the paper’s abstract again for reference:”Existing methods for learning 3D representations are deep neural networks trained and tested on classical hardware. Quantum machine learning architectures, despite their theoretically predicted advantages in terms of speed and the representational capacity, have so far not been considered for this problem nor for tasks involving 3D data in general. This paper thus introduces the first quantum auto-encoder for 3D point clouds. Our 3D-QAE approach is fully quantum, i.e. all its data processing components are designed for quantum hardware. It is trained on collections of 3D point clouds to produce their compressed representations. Along with finding a suitable architecture, the core challenges in designing such a fully quantum model include 3D data normalization and parameter optimization, and we propose solutions for both these tasks. Experiments on simulated gate-based quantum hardware demonstrate that our method outperforms simple classical baselines, paving the way for a new research direction in 3D computer vision.”

Abstract
Existing methods for learning 3D representations are deep neural networks trained and tested on classical hardware. Quantum machine learning architectures, despite their theoretically predicted advantages in terms of speed and the representational capacity, have so far not been considered for this problem nor for tasks involving 3D data in general. This paper thus introduces the first quantum auto-encoder for 3D point clouds. Our 3D-QAE approach is fully quantum, i.e. all its data processing components are designed for quantum hardware. It is trained on collections of 3D point clouds to produce their compressed representations. Along with finding a suitable architecture, the core challenges in designing such a fully quantum model include 3D data normalisation and parameter optimisation, and we propose solutions for both these tasks. Experiments on simulated gate-based quantum hardware demonstrate that our method outperforms simple classical baselines, paving the way for a new research direction in 3D computer vision. The source code is available at https://4dqv.mpi-inf.mpg.de/QAE3D/.

摘要
现有的方法 для学习3D表示法是使用深度神经网络，并在经典硬件上训练和测试。量子机器学习架构，尽管其 theoretically predicted advantages in terms of speed and representational capacity，尚未被考虑用于这个问题或任何3D数据相关的任务。这篇论文因此引入了首个量子自动编码器 для3D点云。我们的3D-QAE方法是完全量子的，即所有的数据处理组件都是为量子硬件设计的。它是在收集3D点云的集合上训练，以生成压缩表示。与设计such a fully quantum model的核心挑战包括3D数据Normalization和参数优化，我们提出了解决方案 для这两个任务。实验在模拟的门槛基 quantum 硬件上表明，我们的方法超过了简单的类型基eline，开创了一个新的研究方向于3D计算机视觉。源代码可以在中下载。

Reconstructing Objects in-the-wild for Realistic Sensor Simulation

paper_url: http://arxiv.org/abs/2311.05602
repo_url: None
paper_authors: Ze Yang, Sivabalan Manivasagam, Yun Chen, Jingkang Wang, Rui Hu, Raquel Urtasun
for: 用于帮助机器人训练和测试中带有真实感的 simulate 环境。
methods: 使用神经网络signed distance function来重建物体表面和光照，以及利用 LiDAR 和摄像头感知器数据来重建精准的 geometry 和 нормаль。
results: 在具有有限的训练视图的情况下，NeuSim 能够实现高效的视 synthesis 性能，并且可以将重建的对象资产组合到虚拟世界中，生成真实的多感器数据用于评估自动驾驶感知模型。

Abstract
Reconstructing objects from real world data and rendering them at novel views is critical to bringing realism, diversity and scale to simulation for robotics training and testing. In this work, we present NeuSim, a novel approach that estimates accurate geometry and realistic appearance from sparse in-the-wild data captured at distance and at limited viewpoints. Towards this goal, we represent the object surface as a neural signed distance function and leverage both LiDAR and camera sensor data to reconstruct smooth and accurate geometry and normals. We model the object appearance with a robust physics-inspired reflectance representation effective for in-the-wild data. Our experiments show that NeuSim has strong view synthesis performance on challenging scenarios with sparse training views. Furthermore, we showcase composing NeuSim assets into a virtual world and generating realistic multi-sensor data for evaluating self-driving perception models.

摘要
<>将实际世界数据重建为虚拟世界中的对象，并在新的视角下rendering它们是虚拟世界中的重要任务。在这项工作中，我们提出了NeuSim，一种新的方法，可以从稀疏的宽泛数据中估算高精度的几何结构和真实的外观。我们表示物体表面为神经网络签名距离函数，并利用LiDAR和摄像头感知器数据来重建平滑和准确的几何和法向量。我们模型物体外观使用物理学派的反射表示，可以有效地处理宽泛数据中的不确定性。我们的实验表明，NeuSim在复杂的情况下具有强大的视图合成性能。此外，我们还展示了将NeuSim资产集成到虚拟世界中，并生成真实的多感器数据用于评估自动驾驶感知模型。>>Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well.

SigScatNet: A Siamese + Scattering based Deep Learning Approach for Signature Forgery Detection and Similarity Assessment

paper_url: http://arxiv.org/abs/2311.05579
repo_url: None
paper_authors: Anmol Chokshi, Vansh Jain, Rajas Bhope, Sudhir Dhage
for: 本研究旨在开发一种能够准确检测 forgery 和评估 signature 相似性的技术解决方案，以满足现代社会面临着假印花痕的广泛存在和严重问题。
methods: 本研究提出了一种基于 Siamese 深度学习网络和散射波lets的方法，通过对 signature 进行精准的 validate 和比较，以验证它的合法性。该方法具有 Exceptional efficiency 和可靠性，可以在低成本的硬件系统上运行。
results: 实验结果表明，使用 SigScatNet 可以准确地检测 forgery 和评估 signature 相似性，并且具有很高的 Equal Error Rate（EER）和 Computational Efficiency。 Specifically, the EER of the ICDAR SigComp Dutch dataset was 3.689%, and the EER of the CEDAR dataset was 0.0578%. 这些结果表明，SigScatNet 可以提供一个新的 state-of-the-art 的 signature analysis 技术解决方案，并且可以在实际应用中提供高效、可靠的服务。

Abstract
The surge in counterfeit signatures has inflicted widespread inconveniences and formidable challenges for both individuals and organizations. This groundbreaking research paper introduces SigScatNet, an innovative solution to combat this issue by harnessing the potential of a Siamese deep learning network, bolstered by Scattering wavelets, to detect signature forgery and assess signature similarity. The Siamese Network empowers us to ascertain the authenticity of signatures through a comprehensive similarity index, enabling precise validation and comparison. Remarkably, the integration of Scattering wavelets endows our model with exceptional efficiency, rendering it light enough to operate seamlessly on cost-effective hardware systems. To validate the efficacy of our approach, extensive experimentation was conducted on two open-sourced datasets: the ICDAR SigComp Dutch dataset and the CEDAR dataset. The experimental results demonstrate the practicality and resounding success of our proposed SigScatNet, yielding an unparalleled Equal Error Rate of 3.689% with the ICDAR SigComp Dutch dataset and an astonishing 0.0578% with the CEDAR dataset. Through the implementation of SigScatNet, our research spearheads a new state-of-the-art in signature analysis in terms of EER scores and computational efficiency, offering an advanced and accessible solution for detecting forgery and quantifying signature similarities. By employing cutting-edge Siamese deep learning and Scattering wavelets, we provide a robust framework that paves the way for secure and efficient signature verification systems.

摘要
“ counterfeit signatures 的问题已经对个人和机构带来广泛的不便和严重的挑战。本研究的创新解决方案是基于 Siamese 深度学习网络，并与散射波лет特别结合，以检测签名伪造和评估签名相似性。 Siamese 网络允许我们通过全面的相似度指数，实现签名的authenticity检测，并提供了高度精确的比较和验证。另外，散射波лет特别的整合使我们的模型变得非常轻量级，可以运行在便宜的硬件系统上。为了证明我们的方法的有效性，我们在 ICAR D SigComp 荷兰 dataset 和 CEDAR dataset 上进行了广泛的实验。实验结果显示了我们的提案的 SigScatNet 在 EER 分数和计算效率上具有杰出的表现，其中 ICDAR SigComp Dutch dataset 的 EER 分数为 3.689%，而 CEDAR dataset 的 EER 分数则为 0.0578%。通过 SigScatNet 的实现，我们的研究将 signature 分析领域带进了新的州际领域，并提供了一个高度可靠和可接近的解决方案，以应对签名伪造和评估签名相似性。我们的研究使用了 cutting-edge Siamese 深度学习和散射波лет特别，提供了一个强大和可靠的框架，将来对签名验证系统带来新的突破和进步。”

Exploring Emotion Expression Recognition in Older Adults Interacting with a Virtual Coach

paper_url: http://arxiv.org/abs/2311.05567
repo_url: None
paper_authors: Cristina Palmero, Mikel deVelasco, Mohamed Amine Hmani, Aymen Mtibaa, Leila Ben Letaifa, Pau Buch-Cardona, Raquel Justo, Terry Amorese, Eduardo González-Fraile, Begoña Fernández-Ruanova, Jofre Tenorio-Laranga, Anna Torp Johansen, Micaela Rodrigues da Silva, Liva Jenny Martinussen, Maria Stylianou Korsnes, Gennaro Cordasco, Anna Esposito, Mounim A. El-Yacoubi, Dijana Petrovska-Delacrétaz, M. Inés Torres, Sergio Escalera
for: This paper aims to develop an emotionally expressive virtual coach for healthy seniors to improve well-being and promote independent aging.
methods: The paper outlines the development of the emotion expression recognition module of the virtual coach, including data collection, annotation design, and a first methodological approach. The study uses various modalities such as speech from audio and facial expressions, gaze, and head dynamics from video to recognize emotional expressions.
results: The study found that the modalities studied were informative for the emotional categories considered, with multimodal methods generally outperforming others. The results are expected to contribute to the limited literature on emotion recognition applied to older adults in conversational human-machine interaction.

Abstract
The EMPATHIC project aimed to design an emotionally expressive virtual coach capable of engaging healthy seniors to improve well-being and promote independent aging. One of the core aspects of the system is its human sensing capabilities, allowing for the perception of emotional states to provide a personalized experience. This paper outlines the development of the emotion expression recognition module of the virtual coach, encompassing data collection, annotation design, and a first methodological approach, all tailored to the project requirements. With the latter, we investigate the role of various modalities, individually and combined, for discrete emotion expression recognition in this context: speech from audio, and facial expressions, gaze, and head dynamics from video. The collected corpus includes users from Spain, France, and Norway, and was annotated separately for the audio and video channels with distinct emotional labels, allowing for a performance comparison across cultures and label types. Results confirm the informative power of the modalities studied for the emotional categories considered, with multimodal methods generally outperforming others (around 68% accuracy with audio labels and 72-74% with video labels). The findings are expected to contribute to the limited literature on emotion recognition applied to older adults in conversational human-machine interaction.

摘要
《情感察觉》项目目标是设计一个情感表达能力强的虚拟教练，以提高健康老年人的情绪状况和独立生活能力。系统的核心特点之一是情感感知能力，通过感知用户的情感状况，提供个性化的经验。本文介绍了《情感表达识别模块》的开发，包括数据收集、标注设计和方法ologica approaches，都适应项目的需求。我们 investigate了不同modalities的作用，单独和结合使用，对 discrete emotional expression recognition的性能的影响。收集的数据库包括来自西班牙、法国和挪威的用户，并对音频和视频通道进行了分别的注释，以便比较不同文化和标签类型之间的性能。结果表明，研究对older adults在对话人机交互中表达情感的方面的limited literature中，modalities studying的信息力强，单modal和多模态方法的性能相对较高（音频标签的准确率为68%，视频标签的准确率为72-74%）。这些发现预计将对设计情感表达能力强的虚拟教练提供有用的指导。

High-Performance Transformers for Table Structure Recognition Need Early Convolutions

paper_url: http://arxiv.org/abs/2311.05565
repo_url: https://github.com/poloclub/tsr-convstem
paper_authors: ShengYun Peng, Seongmin Lee, Xiaojing Wang, Rajarajeswari Balasubramaniyan, Duen Horng Chau
for: 这篇论文主要探讨了一种轻量级的视觉编码器，以提高表格识别（TSR）模型的速度和可学习性。
methods: 该论文提出了一种新的视觉编码器，即 convolutional stem，它使用了一个简单的模型结构，但能够与 классификацион CNN 相比肩。该编码器具有较高的感知野比率和更长的序列长度，能够匹配表格的结构和上下文。
results: 研究人员通过了多种ablation study来证明，新的视觉编码器可以减少模型参数数量，同时保持表格识别的表现。此外，该论文还开源了代码，以便进一步的研究和比较。

Abstract
Table structure recognition (TSR) aims to convert tabular images into a machine-readable format, where a visual encoder extracts image features and a textual decoder generates table-representing tokens. Existing approaches use classic convolutional neural network (CNN) backbones for the visual encoder and transformers for the textual decoder. However, this hybrid CNN-Transformer architecture introduces a complex visual encoder that accounts for nearly half of the total model parameters, markedly reduces both training and inference speed, and hinders the potential for self-supervised learning in TSR. In this work, we design a lightweight visual encoder for TSR without sacrificing expressive power. We discover that a convolutional stem can match classic CNN backbone performance, with a much simpler model. The convolutional stem strikes an optimal balance between two crucial factors for high-performance TSR: a higher receptive field (RF) ratio and a longer sequence length. This allows it to "see" an appropriate portion of the table and "store" the complex table structure within sufficient context length for the subsequent transformer. We conducted reproducible ablation studies and open-sourced our code at https://github.com/poloclub/tsr-convstem to enhance transparency, inspire innovations, and facilitate fair comparisons in our domain as tables are a promising modality for representation learning.

摘要
tables structure recognition (TSR) 目标是将表格图像转换为机器可读格式，其中视觉编码器提取图像特征，而文本编码器生成表格表示符。现有方法使用经典 convolutional neural network (CNN) 脑筋作为视觉编码器和转换器作为文本编码器。然而，这种混合 CNN-Transformer 架构会导致复杂的视觉编码器，占用大量模型参数，明显降低训练和执行速度，并阻碍自动学习在 TSR 中。在这种工作中，我们设计了 TSR 中的轻量级视觉编码器，不会失去表达力。我们发现， convolutional stem 可以与经典 CNN 脑筋性能相当，但是它的模型非常简单。convolutional stem 在两个关键因素上做出了优化的平衡：高的 receptive field (RF) 比和长的序列长度。这使得它可以 "看" 到合适的表格部分，并 "存储" 表格结构的详细信息在 suficient context length 中，为后续转换器提供足够的上下文。我们进行了可重复的抽象研究，并将我们的代码开源在 https://github.com/poloclub/tsr-convstem，以便提高透明度，激发创新，并且在我们领域中进行公正的比较。

Disentangling Quantum and Classical Contributions in Hybrid Quantum Machine Learning Architectures

paper_url: http://arxiv.org/abs/2311.05559
repo_url: None
paper_authors: Michael Kölle, Jonas Maurer, Philipp Altmann, Leo Sünkel, Jonas Stein, Claudia Linnhoff-Popien
for: 这篇论文的目的是探讨量子计算的可行性，以及将经过训练的古典模型与量子圈组合使用的混合转移学习解决方案。
methods: 这篇论文使用了一种新的混合架构，将 autoencoder 用于将输入数据压缩，然后将压缩后的数据通过量子环节。另外，还与两个现有的 Hybrid transfer learning 架构、两个纯古典架构和一个量子架构进行比较。
results: 研究结果显示，古典 комponent 在混合转移学习中具有重要的影响，而这个影响通常被误以为是量子元件的贡献。我们的模型在四个 datasets 上的准确率与使用于量子圈的混合转移学习模型的准确率相似。

Abstract
Quantum computing offers the potential for superior computational capabilities, particularly for data-intensive tasks. However, the current state of quantum hardware puts heavy restrictions on input size. To address this, hybrid transfer learning solutions have been developed, merging pre-trained classical models, capable of handling extensive inputs, with variational quantum circuits. Yet, it remains unclear how much each component - classical and quantum - contributes to the model's results. We propose a novel hybrid architecture: instead of utilizing a pre-trained network for compression, we employ an autoencoder to derive a compressed version of the input data. This compressed data is then channeled through the encoder part of the autoencoder to the quantum component. We assess our model's classification capabilities against two state-of-the-art hybrid transfer learning architectures, two purely classical architectures and one quantum architecture. Their accuracy is compared across four datasets: Banknote Authentication, Breast Cancer Wisconsin, MNIST digits, and AudioMNIST. Our research suggests that classical components significantly influence classification in hybrid transfer learning, a contribution often mistakenly ascribed to the quantum element. The performance of our model aligns with that of a variational quantum circuit using amplitude embedding, positioning it as a feasible alternative.

摘要

LCM-LoRA: A Universal Stable-Diffusion Acceleration Module

paper_url: http://arxiv.org/abs/2311.05556
repo_url: https://github.com/luosiallen/latent-consistency-model
paper_authors: Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick von Platen, Apolinário Passos, Longbo Huang, Jian Li, Hang Zhao
for: 快速生成高质量图像，使用Latent Consistency Models（LCMs）可以减少推理步骤数量，只需要约32个A100 GPU训练小时。
methods: 使用LoRA混合精炼方法进行驱动，将Stable-Diffusion模型包括SD-V1.5、SSD-1B和SDXL扩展到更大的模型，并且减少内存占用。
results: 通过LCM混合精炼方法，可以获得更高质量的图像生成结果，并且可以将LCM-LoRA作为一个通用加速器应用于多种图像生成任务。

Abstract
Latent Consistency Models (LCMs) have achieved impressive performance in accelerating text-to-image generative tasks, producing high-quality images with minimal inference steps. LCMs are distilled from pre-trained latent diffusion models (LDMs), requiring only ~32 A100 GPU training hours. This report further extends LCMs' potential in two aspects: First, by applying LoRA distillation to Stable-Diffusion models including SD-V1.5, SSD-1B, and SDXL, we have expanded LCM's scope to larger models with significantly less memory consumption, achieving superior image generation quality. Second, we identify the LoRA parameters obtained through LCM distillation as a universal Stable-Diffusion acceleration module, named LCM-LoRA. LCM-LoRA can be directly plugged into various Stable-Diffusion fine-tuned models or LoRAs without training, thus representing a universally applicable accelerator for diverse image generation tasks. Compared with previous numerical PF-ODE solvers such as DDIM, DPM-Solver, LCM-LoRA can be viewed as a plug-in neural PF-ODE solver that possesses strong generalization abilities. Project page: https://github.com/luosiallen/latent-consistency-model.

摘要
Latent Consistency Models (LCMs) 已经实现了在快速生成文本到图像任务中表现出色，生成高质量图像只需要 minimal inference steps。LCMs 是从预训练的latent diffusion models (LDMs) 中提取出来的，只需要约32个A100 GPU 训练小时。本报告进一步扩展了LCMs的潜在能力，包括：首先，通过应用LoRA混合精灵抽取法，我们扩展了LCM的范围，使得LCM可以处理更大的模型，并且具有更少的内存占用，从而实现更高质量的图像生成。其次，我们确定了通过LCM混合精灵抽取法获得的LoRA参数为Universal Stable-Diffusion加速模块，名为LCM-LoRA。LCM-LoRA可以直接插入不同的Stable-Diffusion 精灵抽取模型或LoRAs 中，无需训练，因此可以视为一种通用适用的图像生成加速器。与前一些数值PF-ODE 解决方案相比，LCM-LoRA可以看作是一种嵌入式神经网络PF-ODE 解决方案，具有强大的泛化能力。项目页面：https://github.com/luosiallen/latent-consistency-model。

L-WaveBlock: A Novel Feature Extractor Leveraging Wavelets for Generative Adversarial Networks

paper_url: http://arxiv.org/abs/2311.05548
repo_url: None
paper_authors: Mirat Shah, Vansh Jain, Anmol Chokshi, Guruprasad Parasnis, Pramod Bide
for: 这篇论文旨在提出一种新的特征提取器，即L-WaveBlock，以便提高基于GAN的图像生成器的性能和速度。
methods: 这篇论文使用了Discrete Wavelet Transform（DWT）和深度学习方法，开发了一种新的特征提取器L-WaveBlock，以提高GAN生成器的速度和性能。
results: 在三个 dataset（即路面卫星图像数据集、CelebA数据集和GoPro数据集）上，L-WaveBlock得到了惊人的效果，使得GAN生成器更快地 converges，并且在每个dataset上都达到了竞争性的效果。

Abstract
Generative Adversarial Networks (GANs) have risen to prominence in the field of deep learning, facilitating the generation of realistic data from random noise. The effectiveness of GANs often depends on the quality of feature extraction, a critical aspect of their architecture. This paper introduces L-WaveBlock, a novel and robust feature extractor that leverages the capabilities of the Discrete Wavelet Transform (DWT) with deep learning methodologies. L-WaveBlock is catered to quicken the convergence of GAN generators while simultaneously enhancing their performance. The paper demonstrates the remarkable utility of L-WaveBlock across three datasets, a road satellite imagery dataset, the CelebA dataset and the GoPro dataset, showcasing its ability to ease feature extraction and make it more efficient. By utilizing DWT, L-WaveBlock efficiently captures the intricate details of both structural and textural details, and further partitions feature maps into orthogonal subbands across multiple scales while preserving essential information at the same time. Not only does it lead to faster convergence, but also gives competent results on every dataset by employing the L-WaveBlock. The proposed method achieves an Inception Score of 3.6959 and a Structural Similarity Index of 0.4261 on the maps dataset, a Peak Signal-to-Noise Ratio of 29.05 and a Structural Similarity Index of 0.874 on the CelebA dataset. The proposed method performs competently to the state-of-the-art for the image denoising dataset, albeit not better, but still leads to faster convergence than conventional methods. With this, L-WaveBlock emerges as a robust and efficient tool for enhancing GAN-based image generation, demonstrating superior convergence speed and competitive performance across multiple datasets for image resolution, image generation and image denoising.

摘要
生成对抗网络（GAN）在深度学习中崛起，能够生成真实的数据从随机噪声中。GAN的效果经常受到特征提取的影响，这是其架构中的关键因素。本文介绍一种名为L-WaveBlock的新型和可靠的特征提取器，它利用抽象波лет变换（DWT）与深度学习方法结合，以快速加速GAN生成器的协调。L-WaveBlock在三个数据集上展示了强大的实用性，包括公路卫星影像数据集、CelebA数据集和GoPro数据集。它能够高效地提取特征，并在多个尺度和多个缩放级别上分解特征图。这不仅导致更快的协调，还可以在每个数据集上获得优秀的结果。使用DWT，L-WaveBlock可以高效地捕捉结构和文本细节的细节，并将特征图分解成多个 ortogonal subbands。这不仅提高了特征提取的效率，还保留了重要信息。因此，L-WaveBlock不仅可以带来更快的协调，还可以在多个数据集上获得竞争力的结果。提议的方法在maps数据集上 achiev 一个 Inception Score 的 3.6959 和一个 Structural Similarity Index 的 0.4261，在 CelebA 数据集上 achiev 一个 Peak Signal-to-Noise Ratio 的 29.05 和一个 Structural Similarity Index 的 0.874。在图像压缩数据集上，提议的方法可以和现有方法相比，并且可以带来更快的协调。因此，L-WaveBlock emerges 为一种可靠和高效的图像生成工具，可以在多个数据集上提高图像分辨率、图像生成和图像压缩的性能。

A Deep Learning Method for Simultaneous Denoising and Missing Wedge Reconstruction in Cryogenic Electron Tomography

paper_url: http://arxiv.org/abs/2311.05539
repo_url: https://github.com/mli-lab/deepdewedge
paper_authors: Simon Wiedemann, Reinhard Heckel
for: used to improve the visual quality and resolution of cryo-ET tomograms
methods: deep-learning approach for simultaneous denoising and missing wedge reconstruction called DeepDeWedge
results: competitive performance for deep learning-based denoising and missing wedge reconstruction of cryo-ET tomogramsHere’s the full text in Simplified Chinese:
for: 用于提高晶体电子显微镜图像的视觉质量和分辨率
methods: 使用深度学习方法，即DeepDeWedge，同时去噪和缺角重建晶体电子显微镜图像
results: 在synthetic和实际晶体电子显微镜数据上实现了竞争力强的深度学习基于的denoising和缺角重建表现I hope that helps! Let me know if you have any other questions.

Abstract
Cryogenic electron tomography (cryo-ET) is a technique for imaging biological samples such as viruses, cells, and proteins in 3D. A microscope collects a series of 2D projections of the sample, and the goal is to reconstruct the 3D density of the sample called the tomogram. This is difficult as the 2D projections have a missing wedge of information and are noisy. Tomograms reconstructed with conventional methods, such as filtered back-projection, suffer from the noise, and from artifacts and anisotropic resolution due to the missing wedge of information. To improve the visual quality and resolution of such tomograms, we propose a deep-learning approach for simultaneous denoising and missing wedge reconstruction called DeepDeWedge. DeepDeWedge is based on fitting a neural network to the 2D projections with a self-supervised loss inspired by noise2noise-like methods. The algorithm requires no training or ground truth data. Experiments on synthetic and real cryo-ET data show that DeepDeWedge achieves competitive performance for deep learning-based denoising and missing wedge reconstruction of cryo-ET tomograms.

摘要
低温电子镜像技术（冰点电子镜像）可以用于图像生物样品，如病毒、细胞和蛋白质的三维图像。一个镜头收集了样品的一系列二维投影图像，目标是重建样品的三维密度特征，称为tomogram。这是困难的，因为二维投影图像缺失一部分信息，并且含有噪声。使用传统方法重建tomogram时，会受到噪声和缺失信息的影响，以及各向异otropic的分辨率。为了改善tomogram的视觉质量和分辨率，我们提出了一种基于深度学习的同时去噪和缺失信息重建方法，称为DeepDeWedge。DeepDeWedge基于对二维投影图像适应一个神经网络，使用自动驱动的损失函数，类似于噪声2噪声的方法。该算法不需要训练或真实数据。对于synthetic和实际冰点电子镜像数据进行了实验，显示DeepDeWedge可以与深度学习基于的去噪和缺失信息重建方法竞争。

Embedding Space Interpolation Beyond Mini-Batch, Beyond Pairs and Beyond Examples

paper_url: http://arxiv.org/abs/2311.05538
repo_url: https://github.com/shashankvkt/MultiMix_NeurIPS023
paper_authors: Shashanka Venkataramanan, Ewa Kijak, Laurent Amsaleg, Yannis Avrithis
for: 本研究旨在提高混合数据 augmentation 的效果，以提高模型的泛化能力。
methods: 本研究引入 MultiMix 方法，可以生成许多 interpolated 的示例，并在 embedding 空间进行 interpolating。
results: 对四个 benchmark 进行实验，并证明 MultiMix 方法可以提高模型的性能，并且可以解释为什么性能提高的原因。

Abstract
Mixup refers to interpolation-based data augmentation, originally motivated as a way to go beyond empirical risk minimization (ERM). Its extensions mostly focus on the definition of interpolation and the space (input or feature) where it takes place, while the augmentation process itself is less studied. In most methods, the number of generated examples is limited to the mini-batch size and the number of examples being interpolated is limited to two (pairs), in the input space. We make progress in this direction by introducing MultiMix, which generates an arbitrarily large number of interpolated examples beyond the mini-batch size and interpolates the entire mini-batch in the embedding space. Effectively, we sample on the entire convex hull of the mini-batch rather than along linear segments between pairs of examples. On sequence data, we further extend to Dense MultiMix. We densely interpolate features and target labels at each spatial location and also apply the loss densely. To mitigate the lack of dense labels, we inherit labels from examples and weight interpolation factors by attention as a measure of confidence. Overall, we increase the number of loss terms per mini-batch by orders of magnitude at little additional cost. This is only possible because of interpolating in the embedding space. We empirically show that our solutions yield significant improvement over state-of-the-art mixup methods on four different benchmarks, despite interpolation being only linear. By analyzing the embedding space, we show that the classes are more tightly clustered and uniformly spread over the embedding space, thereby explaining the improved behavior.

摘要
混合（Mixup）是基于 interpolate 的数据增强技术，起初是为了超越empirical risk minimization（ERM）的目的。其扩展主要集中在 interpolate 的定义和进行 interpolate 的空间（输入或特征）上，而 interpolate 过程自身得到了更少的研究。在大多数方法中，生成的示例数限制在 mini-batch 大小和 interpolate 的示例数量均为两（对），在输入空间进行 interpolate。我们在这个方向上做出了进步，通过引入 MultiMix，可以生成超过 mini-batch 大小的 interpolated 示例，并且在嵌入空间中 interpolate 整个 mini-batch。实际上，我们在整个几何体中采样，而不是在线性段 между对的示例之间采样。在序列数据上，我们进一步扩展到 dense MultiMix，在每个空间位置上密集 interpolate 特征和目标标签，并且对 loss 进行密集应用。为了减少缺少密集标签的问题，我们继承例外的标签并将 interpolate 因子重量为注意力的度量。总的来说，我们在每个 mini-batch 中增加了数量级别的损失项数，而这是因为 interpolate 在嵌入空间中进行的。我们经验显示，我们的解决方案在四个不同的标准测试上显示了明显的改善，即使 interpolate 只是线性的。通过分析嵌入空间，我们显示了类划分更加紧密，uniform 分布在嵌入空间中，从而解释了改善的行为。

SeaTurtleID2022: A long-span dataset for reliable sea turtle re-identification

paper_url: http://arxiv.org/abs/2311.05524
repo_url: None
paper_authors: Lukáš Adam, Vojtěch Čermák, Kostas Papafitsoros, Lukáš Picek
for: The paper is written for researchers and practitioners working on animal re-identification, particularly those interested in sea turtles.
methods: The paper uses a large-scale, long-span dataset of sea turtle photographs captured in the wild, with various annotations such as identity, encounter timestamp, and body parts segmentation masks. The dataset is split into two realistic and ecologically motivated splits: a time-aware closed-set and a time-aware open-set. The paper also proposes an end-to-end system for sea turtle re-identification based on Hybrid Task Cascade for head instance segmentation and ArcFace-trained feature-extractor.
results: The paper reports an accuracy of 86.8% for the proposed end-to-end system, and provides baseline instance segmentation and re-identification performance over various body parts. The paper also shows that time-aware splits are essential for benchmarking re-identification methods, as random splits lead to performance overestimation.

Abstract
This paper introduces the first public large-scale, long-span dataset with sea turtle photographs captured in the wild -- SeaTurtleID2022 (https://www.kaggle.com/datasets/wildlifedatasets/seaturtleid2022). The dataset contains 8729 photographs of 438 unique individuals collected within 13 years, making it the longest-spanned dataset for animal re-identification. All photographs include various annotations, e.g., identity, encounter timestamp, and body parts segmentation masks. Instead of standard "random" splits, the dataset allows for two realistic and ecologically motivated splits: (i) a time-aware closed-set with training, validation, and test data from different days/years, and (ii) a time-aware open-set with new unknown individuals in test and validation sets. We show that time-aware splits are essential for benchmarking re-identification methods, as random splits lead to performance overestimation. Furthermore, a baseline instance segmentation and re-identification performance over various body parts is provided. Finally, an end-to-end system for sea turtle re-identification is proposed and evaluated. The proposed system based on Hybrid Task Cascade for head instance segmentation and ArcFace-trained feature-extractor achieved an accuracy of 86.8%.

摘要

BakedAvatar: Baking Neural Fields for Real-Time Head Avatar Synthesis

paper_url: http://arxiv.org/abs/2311.05521
repo_url: None
paper_authors: Hao-Bin Duan, Miao Wang, Jin-Chuan Shi, Xu-Chuan Chen, Yan-Pei Cao
For: 本研究旨在提供高效的NeRF技术来实现实时的人头化学习，以满足VR/AR、telepresence和游戏应用的需求。* Methods: 我们提出了一种新的表示方法，即BakedAvatar，它可以在标准的 polygon 纹理化管道中进行实时的人头化学习。我们的方法从学习的ISO面上提取了可变的多层网格，并计算出表达、姿势和视角依赖的外观特征，这些特征可以被烘焙成静态的文本ures，以实现高效的纹理化。* Results: 我们的方法可以与其他状态对照方法相比，同时大幅降低了推理时间的需求。我们还通过了不同的视频来synthesize heads，包括视点synthesis、face reenactment、表情编辑和姿势编辑，全部在交互帧率下完成。

Abstract
Synthesizing photorealistic 4D human head avatars from videos is essential for VR/AR, telepresence, and video game applications. Although existing Neural Radiance Fields (NeRF)-based methods achieve high-fidelity results, the computational expense limits their use in real-time applications. To overcome this limitation, we introduce BakedAvatar, a novel representation for real-time neural head avatar synthesis, deployable in a standard polygon rasterization pipeline. Our approach extracts deformable multi-layer meshes from learned isosurfaces of the head and computes expression-, pose-, and view-dependent appearances that can be baked into static textures for efficient rasterization. We thus propose a three-stage pipeline for neural head avatar synthesis, which includes learning continuous deformation, manifold, and radiance fields, extracting layered meshes and textures, and fine-tuning texture details with differential rasterization. Experimental results demonstrate that our representation generates synthesis results of comparable quality to other state-of-the-art methods while significantly reducing the inference time required. We further showcase various head avatar synthesis results from monocular videos, including view synthesis, face reenactment, expression editing, and pose editing, all at interactive frame rates.

摘要
<>通过视频生成 photorealistic 4D人头模型是虚拟现实（VR）、虚拟真实（AR）、电子游戏等应用的关键。 existed Neural Radiance Fields（NeRF）方法可以实现高质量结果，但计算成本限制了它们在实时应用中使用。为了解决这个问题，我们介绍了 BakedAvatar，一种新的表示方法，可以在标准 polygon 笔触板pipeline中进行实时神经头像生成。我们的方法从学习的isoSurface中提取了可变多层网格，并计算出expression、pose和视角相关的表现，可以用静态纹理进行高效的笔触板。因此，我们提出了一个三个阶段的神经头像生成管道，包括学习连续变形、 manifold 和辐射场，提取层次网格和纹理，以及微调纹理 Details 通过差分笔触板。实验结果表明，我们的表示可以在其他state-of-the-art方法的同等质量synthesis结果的情况下，明显减少计算成本。我们还展示了从单抗视频中生成的多种头像 sintesis结果，包括视图synthesis、脸部reenactment、表情编辑和姿态编辑，都在交互帧率下进行。Note: The translation is in Simplified Chinese, which is the standard written form of Chinese used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that form instead.

paper_url: http://arxiv.org/abs/2311.05669
repo_url: None
paper_authors: Yuqi Hou, Zhongqun Zhang, Nora Horanyi, Jaewon Moon, Yihua Cheng, Hyung Jin Chang
for: 这个论文旨在提高对话场景中人员的 gaze following 性能，利用听录信息来提供关键的人类行为信息。
methods: 该方法基于“聆听者听力着重点”的观察，首先利用听录和嘴唇的相关性进行分类，然后使用标识信息进行场景图像的增强，并提出一种基于场景图像的 gaze 候选点估计网络。
results: 该方法在新收集的对话场景中的视频听录数据集（VGS）上表现出了显著的优异性，与现有方法相比，具有更高的准确率和更好的可读性。

Abstract
Gaze following estimates gaze targets of in-scene person by understanding human behavior and scene information. Existing methods usually analyze scene images for gaze following. However, compared with visual images, audio also provides crucial cues for determining human behavior.This suggests that we can further improve gaze following considering audio cues. In this paper, we explore gaze following tasks in conversational scenarios. We propose a novel multi-modal gaze following framework based on our observation ``audiences tend to focus on the speaker''. We first leverage the correlation between audio and lips, and classify speakers and listeners in a scene. We then use the identity information to enhance scene images and propose a gaze candidate estimation network. The network estimates gaze candidates from enhanced scene images and we use MLP to match subjects with candidates as classification tasks. Existing gaze following datasets focus on visual images while ignore audios.To evaluate our method, we collect a conversational dataset, VideoGazeSpeech (VGS), which is the first gaze following dataset including images and audio. Our method significantly outperforms existing methods in VGS datasets. The visualization result also prove the advantage of audio cues in gaze following tasks. Our work will inspire more researches in multi-modal gaze following estimation.

摘要
glance following estimates gaze targets of in-scene person by understanding human behavior and scene information. Existing methods usually analyze scene images for gaze following. However, compared with visual images, audio also provides crucial cues for determining human behavior.This suggests that we can further improve gaze following considering audio cues. In this paper, we explore gaze following tasks in conversational scenarios. We propose a novel multi-modal gaze following framework based on our observation "audiences tend to focus on the speaker". We first leverage the correlation between audio and lips, and classify speakers and listeners in a scene. We then use the identity information to enhance scene images and propose a gaze candidate estimation network. The network estimates gaze candidates from enhanced scene images and we use MLP to match subjects with candidates as classification tasks. Existing gaze following datasets focus on visual images while ignore audios.To evaluate our method, we collect a conversational dataset, VideoGazeSpeech (VGS), which is the first gaze following dataset including images and audio. Our method significantly outperforms existing methods in VGS datasets. The visualization result also prove the advantage of audio cues in gaze following tasks. Our work will inspire more researches in multi-modal gaze following estimation.

paper_url: http://arxiv.org/abs/2311.05494
repo_url: None
paper_authors: Lei Li, Alexander Liniger, Mario Millhaeusler, Vagia Tsiminaki, Yuanyou Li, Dengxin Dai
for: 这篇论文主要用于提高实时物体检测领域中事件相关的检测性能。
methods: 该论文提出了一种新的知识塑造方法，通过对事件相关的特征进行精细地塑造，以减少事件数据稀疏性和缺失视觉细节的问题。
results: 在一个synthetic和一个实际的事件数据集上进行测试，研究发现，通过使用对象中心插槽注意机制，可以iteratively减少特征图进行塑造，以提高事件相关的学生对象检测器的性能，相当于减半与教师模式的性能差距。

Abstract
Event cameras are gaining popularity due to their unique properties, such as their low latency and high dynamic range. One task where these benefits can be crucial is real-time object detection. However, RGB detectors still outperform event-based detectors due to the sparsity of the event data and missing visual details. In this paper, we develop a novel knowledge distillation approach to shrink the performance gap between these two modalities. To this end, we propose a cross-modality object detection distillation method that by design can focus on regions where the knowledge distillation works best. We achieve this by using an object-centric slot attention mechanism that can iteratively decouple features maps into object-centric features and corresponding pixel-features used for distillation. We evaluate our novel distillation approach on a synthetic and a real event dataset with aligned grayscale images as a teacher modality. We show that object-centric distillation allows to significantly improve the performance of the event-based student object detector, nearly halving the performance gap with respect to the teacher.

摘要
To do this, we propose a cross-modality object detection distillation method that focuses on regions where knowledge distillation works best. We use an object-centric slot attention mechanism to iteratively decouple feature maps into object-centric features and corresponding pixel features used for distillation.We evaluate our novel distillation approach on a synthetic and real event dataset with aligned grayscale images as a teacher modality. Our results show that object-centric distillation significantly improves the performance of the event-based student object detector, nearly halving the performance gap with respect to the teacher.

Retinal OCT Synthesis with Denoising Diffusion Probabilistic Models for Layer Segmentation

paper_url: http://arxiv.org/abs/2311.05479
repo_url: None
paper_authors: Yuli Wu, Weidong He, Dennis Eschweiler, Ningxin Dou, Zixin Fan, Shengli Mi, Peter Walter, Johannes Stegmaier
for: overcome the challenge of limited annotated data in deep biomedical image analysis
methods: utilize denoising diffusion probabilistic models (DDPMs) to automatically generate retinal optical coherence tomography (OCT) images
results: achieve comparable results in layer segmentation accuracy with a model trained solely with synthesized images, reducing the need for manual annotations of retinal OCT images.Here is the full text in Simplified Chinese:
for: deep biomedical image analysis overcome the challenge of limited annotated data
methods: DDPMs automatically generate retinal optical coherence tomography (OCT) images
results: achieve comparable results in layer segmentation accuracy with a model trained solely with synthesized images, reducing the need for manual annotations of retinal OCT images.

Abstract
Modern biomedical image analysis using deep learning often encounters the challenge of limited annotated data. To overcome this issue, deep generative models can be employed to synthesize realistic biomedical images. In this regard, we propose an image synthesis method that utilizes denoising diffusion probabilistic models (DDPMs) to automatically generate retinal optical coherence tomography (OCT) images. By providing rough layer sketches, the trained DDPMs can generate realistic circumpapillary OCT images. We further find that more accurate pseudo labels can be obtained through knowledge adaptation, which greatly benefits the segmentation task. Through this, we observe a consistent improvement in layer segmentation accuracy, which is validated using various neural networks. Furthermore, we have discovered that a layer segmentation model trained solely with synthesized images can achieve comparable results to a model trained exclusively with real images. These findings demonstrate the promising potential of DDPMs in reducing the need for manual annotations of retinal OCT images.

摘要
现代医学生物图像分析使用深度学习经常遇到有限的标注数据的挑战。为解决这个问题，深度生成模型可以被使用来生成真实的医学图像。在这种情况下，我们提议一种使用杂化扩散概率模型（DDPM）来自动生成 RETINAL optical coherence tomography（OCT）图像。通过提供粗略的层草图，已经训练的 DDPM 可以生成真实的环脉OCT图像。我们还发现，通过知识转移，可以获得更准确的假标签，这对 segmentation 任务具有很大的 beneficial effect。通过这种方法，我们观察到层 segmentation 精度的一致提高，这被证明了使用不同的神经网络进行验证。此外，我们发现，使用solely 生成的图像来训练层 segmentation 模型可以达到与使用实际图像训练的结果相同的水平。这些发现表明 DDPM 在减少手动标注 retinal OCT 图像的需求方面具有普遍的承诺。

Robust Retraining-free GAN Fingerprinting via Personalized Normalization

paper_url: http://arxiv.org/abs/2311.05478
repo_url: None
paper_authors: Jianwei Fei, Zhihua Xia, Benedetta Tondi, Mauro Barni
for: 这篇论文主要应用于追踪和识别Generative Adversarial Networks（GANs）的责任用户在执行授权协议或任何类型的黑客使用时。
methods: 本论文提出了一种不需要重新训练的GAN标识方法，让模型开发者可以轻松地生成不同标识的模型复本。在 generator 中插入了额外的个性化normalization（PN）层，并将PN层的参数（涵盖和偏置）通过两个特别的浅层网络（ParamGen Nets）接受标识作为输入。同时还训练了一个标识器，以EXTRACT标识自生成的图像中。
results: 提出的方法可以在不需要重新训练和调整的情况下，将不同的标识 embed 到GAN中，并且在模型水平和图像水平的攻击下保持了更高的防护性能。

Abstract
In recent years, there has been significant growth in the commercial applications of generative models, licensed and distributed by model developers to users, who in turn use them to offer services. In this scenario, there is a need to track and identify the responsible user in the presence of a violation of the license agreement or any kind of malicious usage. Although there are methods enabling Generative Adversarial Networks (GANs) to include invisible watermarks in the images they produce, generating a model with a different watermark, referred to as a fingerprint, for each user is time- and resource-consuming due to the need to retrain the model to include the desired fingerprint. In this paper, we propose a retraining-free GAN fingerprinting method that allows model developers to easily generate model copies with the same functionality but different fingerprints. The generator is modified by inserting additional Personalized Normalization (PN) layers whose parameters (scaling and bias) are generated by two dedicated shallow networks (ParamGen Nets) taking the fingerprint as input. A watermark decoder is trained simultaneously to extract the fingerprint from the generated images. The proposed method can embed different fingerprints inside the GAN by just changing the input of the ParamGen Nets and performing a feedforward pass, without finetuning or retraining. The performance of the proposed method in terms of robustness against both model-level and image-level attacks is also superior to the state-of-the-art.

摘要
近年来，商业应用中的生成模型出现了显著增长，开发商将其授权并分发给用户，他们再次使用它们提供服务。在这种情况下，需要跟踪和识别违反授权协议或任何类型的黑客使用的责任用户。 Although there are methods to embed invisible watermarks in the images produced by Generative Adversarial Networks (GANs), generating a model with a different watermark, referred to as a fingerprint, for each user is time- and resource-consuming due to the need to retrain the model to include the desired fingerprint. In this paper, we propose a retraining-free GAN fingerprinting method that allows model developers to easily generate model copies with the same functionality but different fingerprints. The generator is modified by inserting additional Personalized Normalization (PN) layers whose parameters (scaling and bias) are generated by two dedicated shallow networks (ParamGen Nets) taking the fingerprint as input. A watermark decoder is trained simultaneously to extract the fingerprint from the generated images. The proposed method can embed different fingerprints inside the GAN by just changing the input of the ParamGen Nets and performing a feedforward pass, without finetuning or retraining. The performance of the proposed method in terms of robustness against both model-level and image-level attacks is also superior to the state-of-the-art.

Using ResNet to Utilize 4-class T2-FLAIR Slice Classification Based on the Cholinergic Pathways Hyperintensities Scale for Pathological Aging

paper_url: http://arxiv.org/abs/2311.05477
repo_url: None
paper_authors: Wei-Chun Kevin Tsai, Yi-Chien Liu, Ming-Chun Yu, Chia-Ju Chou, Sui-Hing Yan, Yang-Teng Fan, Yan-Hsiang Huang, Yen-Ling Chiu, Yi-Fang Chuang, Ran-Zan Wang, Yao-Chia Shih
for: 用于评估白 matter 肥厚症的严重程度，帮助诊断和评估抑郁症的发展风险。
methods: 使用深度学习模型BSCA（基于ResNet）自动确定四个关键的T2-FLAIR图像，以便评估抑郁症的严重程度。
results: 在ADNI T2-FLAIR数据集（N=150）和本地数据集（N=30）上进行测试，BSCA模型的性能达到了99.82%的准确率和99.83%的F1分数，表明BSCA可以有效地自动确定四个关键的T2-FLAIR图像，并且可以帮助临床医生评估抑郁症的发展风险。

Abstract
The Cholinergic Pathways Hyperintensities Scale (CHIPS) is a visual rating scale used to assess the extent of cholinergic white matter hyperintensities in T2-FLAIR images, serving as an indicator of dementia severity. However, the manual selection of four specific slices for rating throughout the entire brain is a time-consuming process. Our goal was to develop a deep learning-based model capable of automatically identifying the four slices relevant to CHIPS. To achieve this, we trained a 4-class slice classification model (BSCA) using the ADNI T2-FLAIR dataset (N=150) with the assistance of ResNet. Subsequently, we tested the model's performance on a local dataset (N=30). The results demonstrated the efficacy of our model, with an accuracy of 99.82% and an F1-score of 99.83%. This achievement highlights the potential impact of BSCA as an automatic screening tool, streamlining the selection of four specific T2-FLAIR slices that encompass white matter landmarks along the cholinergic pathways. Clinicians can leverage this tool to assess the risk of clinical dementia development efficiently.

摘要
“激素性白质纤维变化评估尺度（CHIPS）是一个用于评估脑中激素体路way的白质纤维变化的可视评估scale， serves as an indicator of dementia severity. However, the manual selection of four specific slices for rating throughout the entire brain is a time-consuming process. Our goal was to develop a deep learning-based model capable of automatically identifying the four slices relevant to CHIPS. To achieve this, we trained a 4-class slice classification model (BSCA) using the ADNI T2-FLAIR dataset (N=150) with the assistance of ResNet. Subsequently, we tested the model's performance on a local dataset (N=30). The results demonstrated the efficacy of our model, with an accuracy of 99.82% and an F1-score of 99.83%. This achievement highlights the potential impact of BSCA as an automatic screening tool, streamlining the selection of four specific T2-FLAIR slices that encompass white matter landmarks along the cholinergic pathways. Clinicians can leverage this tool to assess the risk of clinical dementia development efficiently.”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

3DStyle-Diffusion: Pursuing Fine-grained Text-driven 3D Stylization with 2D Diffusion Models

paper_url: http://arxiv.org/abs/2311.05464
repo_url: https://github.com/yanghb22-fdu/3dstyle-diffusion-official
paper_authors: Haibo Yang, Yang Chen, Yingwei Pan, Ting Yao, Zhineng Chen, Tao Mei
for: 本研究旨在提供一种高品质的三维内容创建方法，使得基于文本描述的三维模型可以实现精细的样式化。
methods: 本研究使用了CLIP基础模型，并提出了一种新的三维样式噪听模型（3DStyle-Diffusion），通过控制隐藏层MLP网络和扩散过程来实现精细的样式化。
results: 经过质量和量тив的实验，本研究证明了3DStyle-Diffusion模型的效果，并建立了一个新的数据集和评价协议来评估这种任务。

Abstract
3D content creation via text-driven stylization has played a fundamental challenge to multimedia and graphics community. Recent advances of cross-modal foundation models (e.g., CLIP) have made this problem feasible. Those approaches commonly leverage CLIP to align the holistic semantics of stylized mesh with the given text prompt. Nevertheless, it is not trivial to enable more controllable stylization of fine-grained details in 3D meshes solely based on such semantic-level cross-modal supervision. In this work, we propose a new 3DStyle-Diffusion model that triggers fine-grained stylization of 3D meshes with additional controllable appearance and geometric guidance from 2D Diffusion models. Technically, 3DStyle-Diffusion first parameterizes the texture of 3D mesh into reflectance properties and scene lighting using implicit MLP networks. Meanwhile, an accurate depth map of each sampled view is achieved conditioned on 3D mesh. Then, 3DStyle-Diffusion leverages a pre-trained controllable 2D Diffusion model to guide the learning of rendered images, encouraging the synthesized image of each view semantically aligned with text prompt and geometrically consistent with depth map. This way elegantly integrates both image rendering via implicit MLP networks and diffusion process of image synthesis in an end-to-end fashion, enabling a high-quality fine-grained stylization of 3D meshes. We also build a new dataset derived from Objaverse and the evaluation protocol for this task. Through both qualitative and quantitative experiments, we validate the capability of our 3DStyle-Diffusion. Source code and data are available at \url{https://github.com/yanghb22-fdu/3DStyle-Diffusion-Official}.

摘要
3D内容创建通过文本驱动化的样式化问题对 multimedia 和图形社区提出了基本挑战。latest advances of cross-modal foundation models（例如 CLIP）使得这个问题变得可能。这些方法通常利用 CLIP 将整体 semantics of 饰色化 mesh 与给定的文本提示相对位。然而，不是那么容易使得基于 semantics-level cross-modal supervision 的细化的样式化方法。在这种情况下，我们提出了一个新的 3DStyle-Diffusion 模型，可以让3D mesh 的细化样式化受到额外可控的外观和几何指导。技术上，3DStyle-Diffusion 首先将 3D mesh 的 texture 分解成反射性和场光照的多层感知神经网络。然后，通过 conditioned 3D mesh 的准确深度图来获得每个采样视图的准确深度图。接着，3DStyle-Diffusion 利用预训练的可控2D Diffusion 模型来导向Synthesize 的 rendered 图像，使得每个视图的生成图像具有文本提示和深度图的semantic 一致性，同时保持几何一致性。这种方法强大地结合了 implicit MLP 网络和 diffusion 过程，实现了高质量的细化样式化。我们还建立了基于 Objaverse 的新数据集和评估协议。通过质量和量度的实验，我们证明了我们的 3DStyle-Diffusion 的可能性。代码和数据可以在 \url{https://github.com/yanghb22-fdu/3DStyle-Diffusion-Official} 上获得。

ControlStyle: Text-Driven Stylized Image Generation Using Diffusion Priors

paper_url: http://arxiv.org/abs/2311.05463
repo_url: None
paper_authors: Jingwen Chen, Yingwei Pan, Ting Yao, Tao Mei
for: 这篇论文的目的是提出一种新的``风格化’’文本到图像生成任务，即基于文本提示和风格图像的风格化图像生成。
methods: 该论文提出了一种新的扩展方法ControlStyle，通过升级一个先进的文本到图像模型，并添加可调整的调制网络，以便更多的文本提示和风格图像可以进行风格化。此外，还引入了扩散风格和内容规则，以促进这个调制网络的学习。
results: 对比 conventional style transfer techniques，ControlStyle可以生成更加美观和艺术性强的风格化图像，并且可以更好地控制风格化的程度和方向。

Abstract
Recently, the multimedia community has witnessed the rise of diffusion models trained on large-scale multi-modal data for visual content creation, particularly in the field of text-to-image generation. In this paper, we propose a new task for ``stylizing'' text-to-image models, namely text-driven stylized image generation, that further enhances editability in content creation. Given input text prompt and style image, this task aims to produce stylized images which are both semantically relevant to input text prompt and meanwhile aligned with the style image in style. To achieve this, we present a new diffusion model (ControlStyle) via upgrading a pre-trained text-to-image model with a trainable modulation network enabling more conditions of text prompts and style images. Moreover, diffusion style and content regularizations are simultaneously introduced to facilitate the learning of this modulation network with these diffusion priors, pursuing high-quality stylized text-to-image generation. Extensive experiments demonstrate the effectiveness of our ControlStyle in producing more visually pleasing and artistic results, surpassing a simple combination of text-to-image model and conventional style transfer techniques.

摘要
近些时间， multimedia 社区发现了基于大规模多Modal 数据的扩散模型在视觉内容创作中的崛起，尤其是文本到图像生成领域。在这篇论文中，我们提出了一个新的任务，即“风格化”文本到图像模型，即通过输入文本提示和风格图像来生成风格化的图像，这些图像同时需要具备Semantic relevance 和风格图像的风格一致性。为了实现这一目标，我们提出了一种新的扩散模型（ControlStyle），通过对 pré-trained 文本到图像模型添加可学习的调节网络，使得更多的文本提示和风格图像可以被满足。此外，我们同时引入了扩散样式和内容规则，以便掌控这个调节网络的学习，实现高质量的风格化文本到图像生成。实验结果表明，我们的 ControlStyle 能够生成更加视觉吸引人和艺术性高的结果，超过了简单地将文本到图像模型和传统风格传输技术相加。

Control3D: Towards Controllable Text-to-3D Generation

paper_url: http://arxiv.org/abs/2311.05461
repo_url: None
paper_authors: Yang Chen, Yingwei Pan, Yehao Li, Ting Yao, Tao Mei
for: 这种研究旨在提高文本到3D图形生成的可控性，使用额外的手写绘制图来控制生成的3D场景。
methods: 这种方法使用了一种改进的2D conditioned diffusion模型（ControlNet），用于导导3D场景的学习，并且使用了一种已经预训练的可微分图像到绘制图模型来直接估计绘制图。
results: 通过广泛的实验，我们示出了这种方法可以生成准确和忠实的3D场景，与输入文本提示和绘制图保持高度一致。

Abstract
Recent remarkable advances in large-scale text-to-image diffusion models have inspired a significant breakthrough in text-to-3D generation, pursuing 3D content creation solely from a given text prompt. However, existing text-to-3D techniques lack a crucial ability in the creative process: interactively control and shape the synthetic 3D contents according to users' desired specifications (e.g., sketch). To alleviate this issue, we present the first attempt for text-to-3D generation conditioning on the additional hand-drawn sketch, namely Control3D, which enhances controllability for users. In particular, a 2D conditioned diffusion model (ControlNet) is remoulded to guide the learning of 3D scene parameterized as NeRF, encouraging each view of 3D scene aligned with the given text prompt and hand-drawn sketch. Moreover, we exploit a pre-trained differentiable photo-to-sketch model to directly estimate the sketch of the rendered image over synthetic 3D scene. Such estimated sketch along with each sampled view is further enforced to be geometrically consistent with the given sketch, pursuing better controllable text-to-3D generation. Through extensive experiments, we demonstrate that our proposal can generate accurate and faithful 3D scenes that align closely with the input text prompts and sketches.

摘要

Transformer-based Model for Oral Epithelial Dysplasia Segmentation

paper_url: http://arxiv.org/abs/2311.05452
repo_url: None
paper_authors: Adam J Shephard, Hanya Mahmood, Shan E Ahmed Raza, Anna Luiza Damaceno Araujo, Alan Roger Santos-Silva, Marcio Ajudarte Lopes, Pablo Agustin Vargas, Kris McCombe, Stephanie Craig, Jacqueline James, Jill Brooks, Paul Nankivell, Hisham Mehanna, Syed Ali Khurram, Nasir M Rajpoot
for: 提高某些口腔病变诊断的准确率
methods: 使用Transformer模型进行某些口腔病变图像的检测和分割
results: 在测试数据上获得了优秀的普适性，并实现了预级验证的最佳结果，这是首次使用Transformers进行口腔病变图像分割的外部验证研究。

Abstract
Oral epithelial dysplasia (OED) is a premalignant histopathological diagnosis given to lesions of the oral cavity. OED grading is subject to large inter/intra-rater variability, resulting in the under/over-treatment of patients. We developed a new Transformer-based pipeline to improve detection and segmentation of OED in haematoxylin and eosin (H&E) stained whole slide images (WSIs). Our model was trained on OED cases (n = 260) and controls (n = 105) collected using three different scanners, and validated on test data from three external centres in the United Kingdom and Brazil (n = 78). Our internal experiments yield a mean F1-score of 0.81 for OED segmentation, which reduced slightly to 0.71 on external testing, showing good generalisability, and gaining state-of-the-art results. This is the first externally validated study to use Transformers for segmentation in precancerous histology images. Our publicly available model shows great promise to be the first step of a fully-integrated pipeline, allowing earlier and more efficient OED diagnosis, ultimately benefiting patient outcomes.

摘要
口腔细胞肥大病变（OED）是口腔区域病变的前期诊断，它的评估存在大量的内外诊断人员差异，导致患者的过度或者下降处理。我们开发了一个基于Transformer的新ipeline，用于改善在染色镜术中的OED检测和分 segmentation。我们的模型在260个OED患者和105个控制组中进行了训练，并在三个外部中心进行了验证（78个患者）。我们的内部实验得到了0.81的F1分数，在外部测试下轻微下降到0.71，表现良好，并达到了领域内最佳 результа。这是第一个得到了外部验证的Transformers用于口腔细胞肥大病变图像分 segmentation的研究。我们公开提供的模型表现良好，可以帮助更早、更高效地诊断OED，最终改善患者的结果。

Dual Pipeline Style Transfer with Input Distribution Differentiation

paper_url: http://arxiv.org/abs/2311.05432
repo_url: None
paper_authors: ShiQi Jiang, JunJie Kang, YuJian Li
for: 本研究旨在提高颜色和xture dual pipeline architecture (CTDP)的表现，通过掩码总变量损失 (Mtv) 来抑制纹理表示和遗留物。
methods: 本研究使用的方法包括 CTDP 和 Mtv，以及一种输入分布差异训练策略 (IDD)。
results: 实验结果显示，使用 IDD 训练策略可以让纹理生成完全依赖于噪声分布，而平滑分布则不会生成纹理。此外，在颜色平滑传输任务中，使用平滑分布作为前向推理阶段的输入可以完全消除纹理表示和遗留物。

Abstract
The color and texture dual pipeline architecture (CTDP) suppresses texture representation and artifacts through masked total variation loss (Mtv), and further experiments have shown that smooth input can almost completely eliminate texture representation. We have demonstrated through experiments that smooth input is not the key reason for removing texture representations, but rather the distribution differentiation of the training dataset. Based on this, we propose an input distribution differentiation training strategy (IDD), which forces the generation of textures to be completely dependent on the noise distribution, while the smooth distribution will not produce textures at all. Overall, our proposed distribution differentiation training strategy allows for two pre-defined input distributions to be responsible for two generation tasks, with noise distribution responsible for texture generation and smooth distribution responsible for color smooth transfer. Finally, we choose a smooth distribution as the input for the forward inference stage to completely eliminate texture representations and artifacts in color transfer tasks.

摘要
color和 texture dual pipeline architecture (CTDP) 可以抑制文本表示和缺陷通过做masked total variation loss (Mtv), 并且进一步实验表明，可以使用平滑输入来几乎完全消除文本表示。我们通过实验发现，平滑输入不是完全 removetexture representation的原因，而是训练集的分布差异。基于这，我们提出了输入分布差异训练策略 (IDD)，强制生成文本完全依赖于噪音分布，而平滑分布不会生成文本。总的来说，我们的提出的输入分布差异训练策略使得两个预定的输入分布负责两个生成任务，噪音分布负责文本生成，平滑分布负责颜色平滑传输。最后，我们选择平滑分布作为前向推理阶段的输入，完全消除颜色传输任务中的文本表示和缺陷。

Active Mining Sample Pair Semantics for Image-text Matching

paper_url: http://arxiv.org/abs/2311.05425
repo_url: None
paper_authors: Yongfeng Chena, Jin Liua, Zhijing Yang, Ruihan Chena, Junpeng Tan
for: 提高图文匹配 task 的表现和泛化能力，特别是 Handle 负样本匹配问题。
methods: 提出了一种新的图文匹配模型，即 Active Mining Sample Pair Semantics image-text matching model (AMSPS)，它使用了 Adaptive Hierarchical Reinforcement Loss (AHRL) 并可以自动挖掘更多的隐藏相关semantic表示。
results: 对于 Flickr30K 和 MSCOCO универса dataset，我们的提出方法比先前的比较方法更高效和泛化得更好。

Abstract
Recently, commonsense learning has been a hot topic in image-text matching. Although it can describe more graphic correlations, commonsense learning still has some shortcomings: 1) The existing methods are based on triplet semantic similarity measurement loss, which cannot effectively match the intractable negative in image-text sample pairs. 2) The weak generalization ability of the model leads to the poor effect of image and text matching on large-scale datasets. According to these shortcomings. This paper proposes a novel image-text matching model, called Active Mining Sample Pair Semantics image-text matching model (AMSPS). Compared with the single semantic learning mode of the commonsense learning model with triplet loss function, AMSPS is an active learning idea. Firstly, the proposed Adaptive Hierarchical Reinforcement Loss (AHRL) has diversified learning modes. Its active learning mode enables the model to more focus on the intractable negative samples to enhance the discriminating ability. In addition, AMSPS can also adaptively mine more hidden relevant semantic representations from uncommented items, which greatly improves the performance and generalization ability of the model. Experimental results on Flickr30K and MSCOCO universal datasets show that our proposed method is superior to advanced comparison methods.

摘要
最近，常识学习在图文匹配中得到了广泛关注。虽然它可以描述更多的图文关系，但常识学习仍有一些缺点：1）现有方法基于 triplet Semantic Similarity 度量损失，无法有效匹配图文样本对中的难以处理的负样本。2）模型的欠拟合能力导致图文匹配在大规模 datasets 上的效果不佳。根据这些缺点，本文提出了一种新的图文匹配模型，即 Active Mining Sample Pair Semantics 图文匹配模型（AMSPS）。与常识学习模型的单个 Semantic 学习模式相比，AMSPS 是一种活动学习的想法。首先，我们提出的 Adaptive Hierarchical Reinforcement Loss （AHRL）可以多样化学习模式。其活动学习模式使得模型更好地强调难以处理的负样本，以提高分辨率。此外，AMSPS 还可以动态挖掘更多的隐藏相关semantic 表示，从而大大提高模型的性能和泛化能力。实验结果表明，我们提出的方法在 Flickr30K 和 MSCOCO 通用dataset 上比 Advanced Comparison 方法更出色。

Linear Gaussian Bounding Box Representation and Ring-Shaped Rotated Convolution for Oriented Object Detection

paper_url: http://arxiv.org/abs/2311.05410
repo_url: https://github.com/zhen6618/rotayolo
paper_authors: Zhen Zhou, Yunkai Ma, Junfeng Fan, Zhaoyang Liu, Fengshui Jing, Min Tan
for:* 这篇论文主要目标是解决现有的oriented object detection中的boundary discontinuity问题，以及numerical instability问题。methods:* 该论文提出了一种新的oriented bounding box（LGBB）表示方法，通过线性变换Gaussian bounding box（GBB）的元素，以避免boundary discontinuity问题并具有高度的数字稳定性。* 论文还提出了一种新的 rotation-sensitive feature extraction方法，即ring-shaped rotated convolution（RRC），该方法可以在ring-shaped感知场中adaptively旋转特征图来捕捉到旋转敏感特征，以快速地聚合特征和上下文信息。results:* 实验结果表明，LGBB和RRC可以达到state-of-the-art的性能 Waterbury et al. (2018)的性能。* 论文还发现，将LGBB和RRC综合integrated into various models可以有效地提高检测精度。

Abstract
In oriented object detection, current representations of oriented bounding boxes (OBBs) often suffer from boundary discontinuity problem. Methods of designing continuous regression losses do not essentially solve this problem. Although Gaussian bounding box (GBB) representation avoids this problem, directly regressing GBB is susceptible to numerical instability. We propose linear GBB (LGBB), a novel OBB representation. By linearly transforming the elements of GBB, LGBB avoids the boundary discontinuity problem and has high numerical stability. In addition, existing convolution-based rotation-sensitive feature extraction methods only have local receptive fields, resulting in slow feature aggregation. We propose ring-shaped rotated convolution (RRC), which adaptively rotates feature maps to arbitrary orientations to extract rotation-sensitive features under a ring-shaped receptive field, rapidly aggregating features and contextual information. Experimental results demonstrate that LGBB and RRC achieve state-of-the-art performance. Furthermore, integrating LGBB and RRC into various models effectively improves detection accuracy.

摘要
在orientation对象检测中，当前的oriented boundin box（OBB）表示方式经常受到边界不连续问题困扰。直接使用Continuous regression loss方法不能够解决这个问题。虽然Gaussian bounding box（GBB）表示方式可以避免这个问题，但直接对GBB进行直接回归是数字不稳定的。我们提议使用线性GBB（LGBB），一种新的OBB表示方式。通过线性变换GBB中的元素，LGBB可以避免边界不连续问题，并且具有高度数字稳定性。此外，现有的Convolution-based rotation-sensitive feature extraction方法只有局部感知野，导致Feature收集慢，我们提议使用Ring-shaped rotated convolution（RRC），可以适应任意orientation的Feature映射，快速收集Feature和Contextual information。实验结果表明LGBB和RRC可以 дости得状态之巅性能。此外，将LGBB和RRCintegrated into various models可以有效提高检测精度。

SIRE: scale-invariant, rotation-equivariant estimation of artery orientations using graph neural networks

paper_url: http://arxiv.org/abs/2311.05400
repo_url: None
paper_authors: Dieuwertje Alblas, Julian Suk, Christoph Brune, Kak Khee Yeung, Jelmer M. Wolterink
for: 用于描述医疗影像中血管 geometry 的描述，包括中心线提取和后续分割和视化。
methods: 使用3D卷积神经网络（CNN）来确定血管的精确Orientation，但CNN 敏感于不同的血管大小和方向。
results: SIRE 可以准确地确定血管的方向，并且可以通过嵌入在中心线跟踪器中来跟踪 AAAs，即使训练数据中没有包含这些血管。

Abstract
Blood vessel orientation as visualized in 3D medical images is an important descriptor of its geometry that can be used for centerline extraction and subsequent segmentation and visualization. Arteries appear at many scales and levels of tortuosity, and determining their exact orientation is challenging. Recent works have used 3D convolutional neural networks (CNNs) for this purpose, but CNNs are sensitive to varying vessel sizes and orientations. We present SIRE: a scale-invariant, rotation-equivariant estimator for local vessel orientation. SIRE is modular and can generalise due to symmetry preservation. SIRE consists of a gauge equivariant mesh CNN (GEM-CNN) operating on multiple nested spherical meshes with different sizes in parallel. The features on each mesh are a projection of image intensities within the corresponding sphere. These features are intrinsic to the sphere and, in combination with the GEM-CNN, lead to SO(3)-equivariance. Approximate scale invariance is achieved by weight sharing and use of a symmetric maximum function to combine multi-scale predictions. Hence, SIRE can be trained with arbitrarily oriented vessels with varying radii to generalise to vessels with a wide range of calibres and tortuosity. We demonstrate the efficacy of SIRE using three datasets containing vessels of varying scales: the vascular model repository (VMR), the ASOCA coronary artery set, and a set of abdominal aortic aneurysms (AAAs). We embed SIRE in a centerline tracker which accurately tracks AAAs, regardless of the data SIRE is trained with. Moreover, SIRE can be used to track coronary arteries, even when trained only with AAAs. In conclusion, by incorporating SO(3) and scale symmetries, SIRE can determine the orientations of vessels outside of the training domain, forming a robust and data-efficient solution to geometric analysis of blood vessels in 3D medical images.

摘要
医疗影像中血管方向的三维视觉化是一个重要的描述器，可以用于血管中心线提取和进一步的分割和可见化。血管在多种尺度和扭曲程度出现，确定它们的具体方向是困难的。最近的工作使用了三维卷积神经网络（CNN）来实现这一点，但CNN具有不同血管大小和方向的敏感性。我们介绍了一种可缩放、旋转对称的描述器（SIRE），它可以在不同尺度和方向下准确地确定血管的方向。SIRE包括一个 gauge equivariant mesh CNN（GEM-CNN），该 CNN在多个嵌套的球体网格上运行，以获得不同尺度的特征。这些特征是圆柱体内的图像强度的投影，具有内在的SO(3)对称性。通过使用可变尺度的最大函数来组合多个尺度的预测，SIRE实现了约束度准确的抗噪倾向性。因此，SIRE可以在不同尺度和方向下训练，并且可以通过将其与不同的血管数据集进行组合来扩展到不同的血管尺度和扭曲程度。我们使用了三个不同尺度的血管数据集来证明SIRE的有效性：vascular model repository（VMR）、ASOCA coronary artery set和abdominal aortic aneurysms（AAAs）。我们将SIRE与中心线跟踪器结合，可以准确地跟踪AAAs，不管训练数据是什么。此外，SIRE还可以用于跟踪 coronary arteries，即使只有AAAs的训练数据。总之，通过包含SO(3)和尺度对称性，SIRE可以在不同尺度和方向下确定血管的方向，形成一种数据效率和稳定的解决方案，用于医疗影像中血管的三维геометрического分析。

Improving Hand Recognition in Uncontrolled and Uncooperative Environments using Multiple Spatial Transformers and Loss Functions

paper_url: http://arxiv.org/abs/2311.05383
repo_url: None
paper_authors: Wojciech Michal Matkowski, Xiaojie Li, Adams Wai Kin Kong
for: 提高恶势力识别率在不控制的环境下
methods: 使用多空间变换器网络（MSTN）和多种损失函数进行全手图像识别
results: 在NTU-PI-v1数据库和六个不同领域数据库上的实验结果表明，提案的算法在不控制的环境下表现出色，并且具有良好的适应性。

Abstract
The prevalence of smartphone and consumer camera has led to more evidence in the form of digital images, which are mostly taken in uncontrolled and uncooperative environments. In these images, criminals likely hide or cover their faces while their hands are observable in some cases, creating a challenging use case for forensic investigation. Many existing hand-based recognition methods perform well for hand images collected in controlled environments with user cooperation. However, their performance deteriorates significantly in uncontrolled and uncooperative environments. A recent work has exposed the potential of hand recognition in these environments. However, only the palmar regions were considered, and the recognition performance is still far from satisfactory. To improve the recognition accuracy, an algorithm integrating a multi-spatial transformer network (MSTN) and multiple loss functions is proposed to fully utilize information in full hand images. MSTN is firstly employed to localize the palms and fingers and estimate the alignment parameters. Then, the aligned images are further fed into pretrained convolutional neural networks, where features are extracted. Finally, a training scheme with multiple loss functions is used to train the network end-to-end. To demonstrate the effectiveness of the proposed algorithm, the trained model is evaluated on NTU-PI-v1 database and six benchmark databases from different domains. Experimental results show that the proposed algorithm performs significantly better than the existing methods in these uncontrolled and uncooperative environments and has good generalization capabilities to samples from different domains.

摘要
智能手机和消费类摄像头的普及导致更多的证据在形式为数字图像中出现，这些图像大多是在无控制和不合作环境中拍摄的。在这些图像中，嫌犯可能会隐藏或覆盖面部，而手部在某些情况下可能会出现， creating a challenging use case for forensic investigation. 现有的手部识别方法在控制环境下 WITH 用户合作下表现良好，但在无控制和不合作环境下，其性能差异显著。一项最近的研究曾经探讨了手部识别在这些环境中的潜力。然而，只考虑了手部的平板区域，并且认为手部识别性能仍然很差。为了提高识别精度，本文提出了一种 integrate 多个空间转换网络（MSTN）和多个损失函数的算法，以全面利用手部图像中的信息。首先，MSTN 被用来本地化手部和手指，并估计对应参数。然后，经过预训练的卷积神经网络进行更多的特征提取。最后，使用多个损失函数进行 trains 结构，以END-to-END 训练网络。为证明提出的算法的效iveness，已经在 NTU-PI-v1 数据库和六个不同领域的benchmark数据库进行了评估。实验结果表明，提出的算法在无控制和不合作环境中表现出色，并且在不同领域的样本上具有良好的泛化能力。

paper_url: http://arxiv.org/abs/2311.05348
repo_url: None
paper_authors: Jinjin Xu, Liwu Xu, Yuzhe Yang, Xiang Li, Yanchun Xie, Yi-Jie Huang, Yaqian Li
for: The paper is written to propose a new approach to adapt large language models (LLMs) to downstream tasks, specifically by using LLM as a bridge to connect multiple expert models.
methods: The proposed approach, called u-LLaVA, incorporates a modality alignment module and multi-task modules into LLM, and reorganizes or rebuilds multi-type public datasets to enable efficient modality alignment and instruction following.
results: The proposed approach achieves state-of-the-art performance across multiple benchmarks, and the authors release their model, the generated data, and the code base publicly available.Here are the three points in Simplified Chinese:
for: 这篇论文是为了提出一种新的方法，使大语言模型（LLM）在下游任务上适应。
methods: 该方法称为u-LLaVA，它将模式匹配模块和多任务模块 incorporated into LLM，并重新组织或重新建立多种公共数据集以实现有效的模式匹配和指令遵从。
results: 该方法在多个标准准则上达到了最佳性能，并将其模型、生成数据和代码库公开发布。

Abstract
Recent advances such as LLaVA and Mini-GPT4 have successfully integrated visual information into LLMs, yielding inspiring outcomes and giving rise to a new generation of multi-modal LLMs, or MLLMs. Nevertheless, these methods struggle with hallucinations and the mutual interference between tasks. To tackle these problems, we propose an efficient and accurate approach to adapt to downstream tasks by utilizing LLM as a bridge to connect multiple expert models, namely u-LLaVA. Firstly, we incorporate the modality alignment module and multi-task modules into LLM. Then, we reorganize or rebuild multi-type public datasets to enable efficient modality alignment and instruction following. Finally, task-specific information is extracted from the trained LLM and provided to different modules for solving downstream tasks. The overall framework is simple, effective, and achieves state-of-the-art performance across multiple benchmarks. We also release our model, the generated data, and the code base publicly available.

摘要
Firstly, we incorporate the modality alignment module and multi-task modules into LLM. Then, we reorganize or rebuild multi-type public datasets to enable efficient modality alignment and instruction following. Finally, task-specific information is extracted from the trained LLM and provided to different modules for solving downstream tasks.The overall framework is simple, effective, and achieves state-of-the-art performance across multiple benchmarks. We also release our model, the generated data, and the code base publicly available.Translated into Simplified Chinese:近期的进步，如LLaVA和Mini-GPT4，已经成功地将视觉信息 интеGRATE到LLMs中，产生了激动人心的结果，并且给出了一新的多Modal LLMs（MLLMs）的机遇。然而，这些方法受到幻觉和任务之间的互相干扰的问题。为了解决这些问题，我们提议一种高效和准确的方法，利用LLM作为多个专家模型之间的桥梁，称之为u-LLaVA。首先，我们在LLM中添加了模式匹配模块和多任务模块。然后，我们重新组织或重新建立多种类公共数据集，以便高效地进行模式匹配和指令遵从。最后，从训练过的LLM中提取了相关的任务信息，并将其提供给不同的模块以解决下游任务。总的来说，我们的框架是简单、高效，并在多个标准准点上实现了状态当前的性能。我们还公开发布了我们的模型、生成的数据和代码库。

SynFacePAD 2023: Competition on Face Presentation Attack Detection Based on Privacy-aware Synthetic Training Data

paper_url: http://arxiv.org/abs/2311.05336
repo_url: https://github.com/zi-yuanyang/ijcb-synfacepad-dig
paper_authors: Meiling Fang, Marco Huber, Julian Fierrez, Raghavendra Ramachandra, Naser Damer, Alhasan Alkhaddour, Maksim Kasantcev, Vasiliy Pryadchenko, Ziyuan Yang, Huijie Huangfu, Yingyu Chen, Yi Zhang, Yuchen Pan, Junjun Jiang, Xianming Liu, Xianyun Sun, Caiyong Wang, Xingyu Liu, Zhaohua Chang, Guangzhe Zhao, Juan Tapia, Lazaro Gonzalez-Soler, Carlos Aravena, Daniel Schulz
for: 竞赛旨在鼓励和吸引面部表现攻击检测方案，同时考虑个人数据隐私、法律和伦理问题。
methods: 参赛队伍使用的方法包括新型的检测方法和基于synthetic数据的训练方法。
results: 参赛队伍的提交解决方案在考古的benchmark中超越了考虑的基准。

Abstract
This paper presents a summary of the Competition on Face Presentation Attack Detection Based on Privacy-aware Synthetic Training Data (SynFacePAD 2023) held at the 2023 International Joint Conference on Biometrics (IJCB 2023). The competition attracted a total of 8 participating teams with valid submissions from academia and industry. The competition aimed to motivate and attract solutions that target detecting face presentation attacks while considering synthetic-based training data motivated by privacy, legal and ethical concerns associated with personal data. To achieve that, the training data used by the participants was limited to synthetic data provided by the organizers. The submitted solutions presented innovations and novel approaches that led to outperforming the considered baseline in the investigated benchmarks.

摘要

Spatial Attention-based Distribution Integration Network for Human Pose Estimation

paper_url: http://arxiv.org/abs/2311.05323
repo_url: None
paper_authors: Sihan Gao, Jing Zhu, Xiaoxuan Zhuang, Zhaoyue Wang, Qijin Li
for: 提高人体 pose ocalization 精度，增强模型对受 occlusion、多样化外观、灯光变化和 overlap 等挑战场景的能力。
methods: 提出 Spatial Attention-based Distribution Integration Network (SADI-NET)，包括三个高效模型： Receptive Fortified Module (RFM)、Spatial Fusion Module (SFM) 和 Distribution Learning Module (DLM)。基于经典 HourglassNet 架构，我们将基本块替换为我们提议的 RFM，并在扩大感知场景中增强 spatial 信息敏感性。
results: 在 MPII 和 LSP 测试集上进行了广泛的实验，并取得了优秀的 $92.10%$ 精度，比既有模型提高了 significatively，成为 state-of-the-art 性能。

Abstract
In recent years, human pose estimation has made significant progress through the implementation of deep learning techniques. However, these techniques still face limitations when confronted with challenging scenarios, including occlusion, diverse appearances, variations in illumination, and overlap. To cope with such drawbacks, we present the Spatial Attention-based Distribution Integration Network (SADI-NET) to improve the accuracy of localization in such situations. Our network consists of three efficient models: the receptive fortified module (RFM), spatial fusion module (SFM), and distribution learning module (DLM). Building upon the classic HourglassNet architecture, we replace the basic block with our proposed RFM. The RFM incorporates a dilated residual block and attention mechanism to expand receptive fields while enhancing sensitivity to spatial information. In addition, the SFM incorporates multi-scale characteristics by employing both global and local attention mechanisms. Furthermore, the DLM, inspired by residual log-likelihood estimation (RLE), reconfigures a predicted heatmap using a trainable distribution weight. For the purpose of determining the efficacy of our model, we conducted extensive experiments on the MPII and LSP benchmarks. Particularly, our model obtained a remarkable $92.10\%$ percent accuracy on the MPII test dataset, demonstrating significant improvements over existing models and establishing state-of-the-art performance.

摘要
近年来，人姿估算技术得到了深度学习的应用，但这些技术仍然在面临困难场景时存在限制，包括干扰、多样性、照明变化和重叠。为了解决这些缺点，我们提出了空间注意力基于分布集成网络（SADI-NET），以提高姿势估算的精度。我们的网络包括三个高效模型：感知强化模块（RFM）、空间融合模块（SFM）和分布学习模块（DLM）。基于经典的小时钟网络架构，我们将基本块更换为我们所提议的RFM。RFM包括具有扩展辐射场和注意力机制的延迟块，以扩大感知场和增强对空间信息的敏感度。此外，SFM采用了多尺度特征，通过使用全球和本地注意力机制来实现。此外，DLM，取得了基于逻辑梯度估计（RLE）的预测热图的改进，通过使用可学习的分布权重来重新配置预测热图。为了评估我们的模型效果，我们在MPII和LSP测试集上进行了广泛的实验。特别是，我们的模型在MPII测试集上达到了92.10%的准确率，表明了显著的改进和状态艺术性表现。

SPADES: A Realistic Spacecraft Pose Estimation Dataset using Event Sensing

paper_url: http://arxiv.org/abs/2311.05310
repo_url: None
paper_authors: Arunkumar Rathinam, Haytam Qadadri, Djamila Aouada
for:* 这个研究旨在提高在轨道上的自主操作，例如 rendezvous、 docking 和 proximity maneuvers，使用 Deep Learning-based Spacecraft Pose Estimation 技术。methods:* 这个研究使用了 Domain Adaptation 技术来减少域别差异的影响，并使用了事件感应器来减少域别差异。results:* 这个研究创建了一个名为 SPADES 的新数据集，包括实际的事件数据和虚拟事件数据，并提出了一个有效的数据筛选方法以提高模型性能。此外，这个研究还引入了一个基于图像的事件表示，与现有的表示方法相比，具有更高的性能。

Abstract
In recent years, there has been a growing demand for improved autonomy for in-orbit operations such as rendezvous, docking, and proximity maneuvers, leading to increased interest in employing Deep Learning-based Spacecraft Pose Estimation techniques. However, due to limited access to real target datasets, algorithms are often trained using synthetic data and applied in the real domain, resulting in a performance drop due to the domain gap. State-of-the-art approaches employ Domain Adaptation techniques to mitigate this issue. In the search for viable solutions, event sensing has been explored in the past and shown to reduce the domain gap between simulations and real-world scenarios. Event sensors have made significant advancements in hardware and software in recent years. Moreover, the characteristics of the event sensor offer several advantages in space applications compared to RGB sensors. To facilitate further training and evaluation of DL-based models, we introduce a novel dataset, SPADES, comprising real event data acquired in a controlled laboratory environment and simulated event data using the same camera intrinsics. Furthermore, we propose an effective data filtering method to improve the quality of training data, thus enhancing model performance. Additionally, we introduce an image-based event representation that outperforms existing representations. A multifaceted baseline evaluation was conducted using different event representations, event filtering strategies, and algorithmic frameworks, and the results are summarized. The dataset will be made available at http://cvi2.uni.lu/spades.

摘要
近年来，卫星运行中的自主化需求提高，如 rendezvous、停机和距离推进等操作，导致深度学习基于空间机器人定位估计技术的兴趣增加。然而，由于实际目标数据的有限访问，算法通常在实际领域使用 synthetic 数据进行训练，导致领域差距问题。现代方法利用领域适应技术来解决这个问题。在寻找可行的解决方案时，事件感知被探索和研究，并显示它可以降低实际领域和模拟领域之间的领域差距。事件感知的硬件和软件技术在最近几年内做出了重要进展。此外，事件感知器在空间应用中具有许多优点，比如 RGB 感知器。为了进一步训练和评估深度学习基于模型，我们介绍了一个新的数据集，称为 SPADES，该数据集包含实际事件数据，从实验室环境中获取，以及使用相同摄像机特性的 simulated 事件数据。此外，我们提出了一种有效的数据筛选方法，以提高训练数据质量，从而提高模型性能。此外，我们引入了一种基于图像的事件表示方法，超过了现有的表示方法。我们通过不同的事件表示方法、事件筛选策略和算法框架进行多方面基准评估，结果如下。数据集将在 http://cvi2.uni.lu/spades 上公开。

Improving Vision-and-Language Reasoning via Spatial Relations Modeling

paper_url: http://arxiv.org/abs/2311.05298
repo_url: None
paper_authors: Cheng Yang, Rui Xu, Ye Guo, Peixiang Huang, Yiru Chen, Wenkui Ding, Zhongyuan Wang, Hong Zhou
for: 本研究旨在提高视觉常识逻辑（VCR）的性能，VCR是一项复杂的多模态任务，需要高水平的认知和常识逻辑能力。
methods: 我们提出了一种基于视觉场景的空间关系图建构方法，并设计了两个预训练任务：对象位置回归（OPR）和空间关系分类（SRC），以学习重建空间关系图。
results: 我们的方法可以导致表示保持更多的空间上下文，帮助注意力集中在重要的视觉区域上进行逻辑。我们实现了VCR和两个其他视觉语言逻辑任务（VQA和NLVR）的状态时间表现。

Abstract
Visual commonsense reasoning (VCR) is a challenging multi-modal task, which requires high-level cognition and commonsense reasoning ability about the real world. In recent years, large-scale pre-training approaches have been developed and promoted the state-of-the-art performance of VCR. However, the existing approaches almost employ the BERT-like objectives to learn multi-modal representations. These objectives motivated from the text-domain are insufficient for the excavation on the complex scenario of visual modality. Most importantly, the spatial distribution of the visual objects is basically neglected. To address the above issue, we propose to construct the spatial relation graph based on the given visual scenario. Further, we design two pre-training tasks named object position regression (OPR) and spatial relation classification (SRC) to learn to reconstruct the spatial relation graph respectively. Quantitative analysis suggests that the proposed method can guide the representations to maintain more spatial context and facilitate the attention on the essential visual regions for reasoning. We achieve the state-of-the-art results on VCR and two other vision-and-language reasoning tasks VQA, and NLVR.

摘要
Visual 常识理解 (VCR) 是一个复杂的多Modal任务，需要高度的认知和常识理解能力。在过去几年，大规模预训练方法得到了广泛的应用和提高了VCR的状态艺术。然而，现有的方法大多采用BERT类目标来学习多Modal表示。这些目标来自文本领域，对于视觉领域的复杂情况不够。尤其是忽略了视觉对象的空间分布。为解决以上问题，我们提议构建基于给定的视觉场景的空间关系图。此外，我们设计了两个预训练任务名为物体位置Rectification (OPR)和空间关系分类 (SRC)，以学习重建空间关系图。量化分析表明，我们的方法可以导致表示具有更多的空间 контекст和促进关注重要的视觉区域 для理解。我们在VCR和两个视觉语言理解任务VQA、NLVR中实现了状态艺术 Results。

VoxNeRF: Bridging Voxel Representation and Neural Radiance Fields for Enhanced Indoor View Synthesis

paper_url: http://arxiv.org/abs/2311.05289
repo_url: None
paper_authors: Sen Wang, Wei Zhang, Stefano Gasperini, Shun-Cheng Wu, Nassir Navab
for: 提高各种虚拟应用的图像质量，特别是indoor环境下的图像生成。
methods: 利用立方体表示法提高图像生成的质量和效率，并采用多分辨率哈希网格适应 occlusion 和indoor场景中的复杂geometry。
results: 比对三个公共indoor数据集，vosNeRF 方法在图像生成中表现出色，同时提高了训练和渲染时间的效率，甚至超过了 Instant-NGP 的速度， bringing the technology closer to real-time。

Abstract
Creating high-quality view synthesis is essential for immersive applications but continues to be problematic, particularly in indoor environments and for real-time deployment. Current techniques frequently require extensive computational time for both training and rendering, and often produce less-than-ideal 3D representations due to inadequate geometric structuring. To overcome this, we introduce VoxNeRF, a novel approach that leverages volumetric representations to enhance the quality and efficiency of indoor view synthesis. Firstly, VoxNeRF constructs a structured scene geometry and converts it into a voxel-based representation. We employ multi-resolution hash grids to adaptively capture spatial features, effectively managing occlusions and the intricate geometry of indoor scenes. Secondly, we propose a unique voxel-guided efficient sampling technique. This innovation selectively focuses computational resources on the most relevant portions of ray segments, substantially reducing optimization time. We validate our approach against three public indoor datasets and demonstrate that VoxNeRF outperforms state-of-the-art methods. Remarkably, it achieves these gains while reducing both training and rendering times, surpassing even Instant-NGP in speed and bringing the technology closer to real-time.

摘要
Firstly, VoxNeRF constructs a structured scene geometry and converts it into a voxel-based representation. We use multi-resolution hash grids to adaptively capture spatial features, effectively managing occlusions and the intricate geometry of indoor scenes.Secondly, we propose a unique voxel-guided efficient sampling technique. This innovation selectively focuses computational resources on the most relevant portions of ray segments, significantly reducing optimization time.We validate our approach against three public indoor datasets and demonstrate that VoxNeRF outperforms state-of-the-art methods. Remarkably, it achieves these gains while reducing both training and rendering times, surpassing even Instant-NGP in speed and bringing the technology closer to real-time.

SAMVG: A Multi-stage Image Vectorization Model with the Segment-Anything Model

paper_url: http://arxiv.org/abs/2311.05276
repo_url: None
paper_authors: Haokun Zhu, Juang Ian Chong, Teng Hu, Ran Yi, Yu-Kun Lai, Paul L. Rosin
for: 本研究旨在提出一种基于多 stage模型的vector化方法，以生成高质量的scalable vector graphics（SVG）。
methods: 该方法首先使用通用图像分割模型提供的一般图像分割结果，然后使用一种新的滤波方法来选择整个图像最佳的密集分割图。其次，方法会识别缺失的组件并增加更多的细节组件到SVG中。
results: 经过广泛的实验表明，SAMVG可以在任何领域生成高质量的SVG，需要 menos计算时间和复杂度比前一代方法更低。

Abstract
Vector graphics are widely used in graphical designs and have received more and more attention. However, unlike raster images which can be easily obtained, acquiring high-quality vector graphics, typically through automatically converting from raster images remains a significant challenge, especially for more complex images such as photos or artworks. In this paper, we propose SAMVG, a multi-stage model to vectorize raster images into SVG (Scalable Vector Graphics). Firstly, SAMVG uses general image segmentation provided by the Segment-Anything Model and uses a novel filtering method to identify the best dense segmentation map for the entire image. Secondly, SAMVG then identifies missing components and adds more detailed components to the SVG. Through a series of extensive experiments, we demonstrate that SAMVG can produce high quality SVGs in any domain while requiring less computation time and complexity compared to previous state-of-the-art methods.

摘要
Vector图形广泛应用于视觉设计中，受到越来越多的关注。然而，与矢量图像不同，从矢量图像自动转换为高质量矢量图形仍然是一项重要挑战，特别是 для更复杂的图像，如照片或艺术作品。在这篇论文中，我们提出了SAMVG模型，用于将矢量图像转换为SVG（可缩放vector图形）。首先，SAMVG使用Segment-Anything模型提供的通用图像分割，并使用一种新的筛选方法来选择整个图像的最佳笔触分割图。其次，SAMVG会找到缺失的组件并添加更多细节到SVG中。经过了一系列的广泛实验，我们证明了SAMVG可以生成高质量的SVG，无需更多的计算时间和复杂度，与之前的状态艺术方法相比。

Single-shot Tomography of Discrete Dynamic Objects

paper_url: http://arxiv.org/abs/2311.05269
repo_url: None
paper_authors: Ajinkya Kadu, Felix Lucka, Kees Joost Batenburg
for: 高分辨率时间图像重建
methods: 使用水平集方法进行图像分割和表示运动，以及一种可 computationally efficient 和 east optimizable 的变分框架
results: 在Synthetic 和 pseudo-dynamic real X-ray tomography 数据集上显示出比现有方法更高的性能，能够重建高质量的2D或3D图像序列，只需单个投影每帧。

Abstract
This paper presents a novel method for the reconstruction of high-resolution temporal images in dynamic tomographic imaging, particularly for discrete objects with smooth boundaries that vary over time. Addressing the challenge of limited measurements per time point, we propose a technique that synergistically incorporates spatial and temporal information of the dynamic objects. This is achieved through the application of the level-set method for image segmentation and the representation of motion via a sinusoidal basis. The result is a computationally efficient and easily optimizable variational framework that enables the reconstruction of high-quality 2D or 3D image sequences with a single projection per frame. Compared to current methods, our proposed approach demonstrates superior performance on both synthetic and pseudo-dynamic real X-ray tomography datasets. The implications of this research extend to improved visualization and analysis of dynamic processes in tomographic imaging, finding potential applications in diverse scientific and industrial domains.

摘要

Widely Applicable Strong Baseline for Sports Ball Detection and Tracking

paper_url: http://arxiv.org/abs/2311.05237
repo_url: https://github.com/nttcom/wasb-sbdt
paper_authors: Shuhei Tarashima, Muhammad Abdul Haq, Yushan Wang, Norio Tagawa
for: 本研究提出了一种新的运动球检测和跟踪方法 (SBDT), 可以应用于不同的运动类别。
methods: 该方法包括高分辨率特征提取、位置意识模型训练和时间一致性推断，这三个部分组合成了一个新的 SBDT 基准。
results: 实验结果表明，我们的方法在所有运动类别中具有显著优势，至于具体的结果可以查看我们的 GitHub 上的数据和代码。

Abstract
In this work, we present a novel Sports Ball Detection and Tracking (SBDT) method that can be applied to various sports categories. Our approach is composed of (1) high-resolution feature extraction, (2) position-aware model training, and (3) inference considering temporal consistency, all of which are put together as a new SBDT baseline. Besides, to validate the wide-applicability of our approach, we compare our baseline with 6 state-of-the-art SBDT methods on 5 datasets from different sports categories. We achieve this by newly introducing two SBDT datasets, providing new ball annotations for two datasets, and re-implementing all the methods to ease extensive comparison. Experimental results demonstrate that our approach is substantially superior to existing methods on all the sports categories covered by the datasets. We believe our proposed method can play as a Widely Applicable Strong Baseline (WASB) of SBDT, and our datasets and codebase will promote future SBDT research. Datasets and codes are available at https://github.com/nttcom/WASB-SBDT .

摘要
在这项工作中，我们提出了一种新的体育球检测和跟踪（SBDT）方法，可以应用于不同的体育类别。我们的方法包括（1）高分辨率特征提取、（2）位域意识模型训练和（3）基于时间一致性的推理，这些都被整合成了一个新的 SBDT 基准。此外，为了证明我们的方法广泛可用，我们与6种现有 SBDT 方法进行了比较，使用5个不同的体育类别的数据集。我们新 introduce two SBDT 数据集，提供了新的球标注 для两个数据集，并重新实现了所有方法，以便进行广泛的比较。实验结果表明，我们的方法在所有涉及的体育类别中具有显著优势。我们认为，我们提出的方法可以扮演为一种广泛适用的强大基准（WASB），而我们提供的数据集和代码库将推动未来的 SBDT 研究。数据集和代码可以在 GitHub 上获取：https://github.com/nttcom/WASB-SBDT。

ConRad: Image Constrained Radiance Fields for 3D Generation from a Single Image

paper_url: http://arxiv.org/abs/2311.05230
repo_url: None
paper_authors: Senthil Purushwalkam, Nikhil Naik
for: 从单个RGB图像中重构3D物体
methods: 基于最新的图像生成模型，推理隐藏的3D结构，保持输入图像的准确性
results: 提供了一种简单有效的3D表示方式，可以保持输入图像的详细信息，并生成实际的3D重建结果，与现有基eline前景准确相对。

Abstract
We present a novel method for reconstructing 3D objects from a single RGB image. Our method leverages the latest image generation models to infer the hidden 3D structure while remaining faithful to the input image. While existing methods obtain impressive results in generating 3D models from text prompts, they do not provide an easy approach for conditioning on input RGB data. Na\"ive extensions of these methods often lead to improper alignment in appearance between the input image and the 3D reconstructions. We address these challenges by introducing Image Constrained Radiance Fields (ConRad), a novel variant of neural radiance fields. ConRad is an efficient 3D representation that explicitly captures the appearance of an input image in one viewpoint. We propose a training algorithm that leverages the single RGB image in conjunction with pretrained Diffusion Models to optimize the parameters of a ConRad representation. Extensive experiments show that ConRad representations can simplify preservation of image details while producing a realistic 3D reconstruction. Compared to existing state-of-the-art baselines, we show that our 3D reconstructions remain more faithful to the input and produce more consistent 3D models while demonstrating significantly improved quantitative performance on a ShapeNet object benchmark.

摘要
我们提出了一种新的方法，用于从单个RGB图像中重建3D对象。我们的方法利用最新的图像生成模型来推断隐藏的3D结构，同时保持对输入图像的忠实。现有的方法可以从文本提示中生成出色的3D模型，但是它们不提供一个简单的入口点来 condition on 输入RGB数据。不熟悉的扩展可能会导致图像和3D重建中的 aparence不一致。我们解决这些挑战 by introducing Image Constrained Radiance Fields (ConRad), a novel variant of neural radiance fields. ConRad是一种高效的3D表示，可以直接 capture输入图像的一个视点的外观。我们提出了一种培育算法，利用单个RGB图像和预训练的扩散模型来优化ConRad表示的参数。广泛的实验表明，ConRad表示可以简化保持图像细节的同时生成真实的3D重建。相比于现有的状态机器人标准基eline，我们的3D重建更加 faithful 到输入和生成更一致的3D模型，同时显示出了明显改善的量化性能在ShapeNet对象benchmark中。

Let’s Get the FACS Straight – Reconstructing Obstructed Facial Features

paper_url: http://arxiv.org/abs/2311.05221
repo_url: None
paper_authors: Tim Büchner, Sven Sickert, Gerd Fabian Volk, Christoph Anders, Orlando Guntinas-Lichius, Joachim Denzler
for: 提高机器学习方法对受阻面部表情的理解
methods: 使用 CycleGAN 架构实现样式传递，不需要匹配对
results: 可以达到与无阻挡记录相同的评价分数，提高面部表情分析的准确性

Abstract
The human face is one of the most crucial parts in interhuman communication. Even when parts of the face are hidden or obstructed the underlying facial movements can be understood. Machine learning approaches often fail in that regard due to the complexity of the facial structures. To alleviate this problem a common approach is to fine-tune a model for such a specific application. However, this is computational intensive and might have to be repeated for each desired analysis task. In this paper, we propose to reconstruct obstructed facial parts to avoid the task of repeated fine-tuning. As a result, existing facial analysis methods can be used without further changes with respect to the data. In our approach, the restoration of facial features is interpreted as a style transfer task between different recording setups. By using the CycleGAN architecture the requirement of matched pairs, which is often hard to fullfill, can be eliminated. To proof the viability of our approach, we compare our reconstructions with real unobstructed recordings. We created a novel data set in which 36 test subjects were recorded both with and without 62 surface electromyography sensors attached to their faces. In our evaluation, we feature typical facial analysis tasks, like the computation of Facial Action Units and the detection of emotions. To further assess the quality of the restoration, we also compare perceptional distances. We can show, that scores similar to the videos without obstructing sensors can be achieved.

摘要
人类面部是交流中最重要的部分之一。即使面部部分被隐藏或堵塞，也可以理解下面部的运动。机器学习方法经常在这个方面失败，因为面部结构的复杂性。为解决这个问题，常见的方法是为每个特定应用进行精细调整。然而，这是计算昂贵的，并且可能需要重复进行每个分析任务。在这篇论文中，我们提议使用恢复隐藏的面部部分来避免多次精细调整。通过这种方式，现有的面部分析方法可以无需更改数据进行使用。在我们的方法中，恢复面部特征被解释为面部样式传递任务。通过使用 CycleGAN 架构，可以消除匹配对的要求，这经常是难以满足的。为证明我们的方法的可行性，我们比较了我们的恢复与没有隐藏感知器的实际录制视频。我们创建了一个新的数据集，其中有 36 名测试者在不同的录制设置下被录制。在我们的评估中，我们包括常见的面部分析任务，如计算面部动作单元和感情检测。为进一步评估恢复质量，我们还比较了感知距离。我们可以显示，我们的恢复视频与没有隐藏感知器的视频的分数相似。

BrainNetDiff: Generative AI Empowers Brain Network Generation via Multimodal Diffusion Model

paper_url: http://arxiv.org/abs/2311.05199
repo_url: None
paper_authors: Yongcheng Zong, Shuqiang Wang
for: 本研究旨在提供一种新的脑网络分析方法，以 deeper understanding of brain functions and disease mechanisms.
methods: 该方法 combines 多头 transformer encoder 和 conditional latent diffusion model，从 FMRI 时间序列中提取有关特征，并将脑网络生成为图形.
results: 实验结果表明，该方法在健康和神经科学上的数据集上有效地生成脑网络，并在下游疾病分类任务中表现出色.

Abstract
Brain network analysis has emerged as pivotal method for gaining a deeper understanding of brain functions and disease mechanisms. Despite the existence of various network construction approaches, shortcomings persist in the learning of correlations between structural and functional brain imaging data. In light of this, we introduce a novel method called BrainNetDiff, which combines a multi-head Transformer encoder to extract relevant features from fMRI time series and integrates a conditional latent diffusion model for brain network generation. Leveraging a conditional prompt and a fusion attention mechanism, this method significantly improves the accuracy and stability of brain network generation. To the best of our knowledge, this represents the first framework that employs diffusion for the fusion of the multimodal brain imaging and brain network generation from images to graphs. We validate applicability of this framework in the construction of brain network across healthy and neurologically impaired cohorts using the authentic dataset. Experimental results vividly demonstrate the significant effectiveness of the proposed method across the downstream disease classification tasks. These findings convincingly emphasize the prospective value in the field of brain network research, particularly its key significance in neuroimaging analysis and disease diagnosis. This research provides a valuable reference for the processing of multimodal brain imaging data and introduces a novel, efficient solution to the field of neuroimaging.

摘要
�� brain 网络分析已经成为脑功能和疾病机制研究的关键方法。 despite 多种网络建构方法的存在， correlation 学习 between 结构和功能 Magnetic Resonance Imaging（MRI）数据仍然存在缺陷。为此，我们介绍了一种新的方法called BrainNetDiff，它将 multi-head Transformer 编码器用于 FMRI 时间序列中EXTRACT 相关特征，并将 conditional latent diffusion 模型用于脑网络生成。通过 conditional prompt 和 Fusion attention 机制，这种方法可以提高脑网络生成的准确性和稳定性。根据我们所知，这是第一个使用 diffusion 将多Modal brain imaging 和脑网络生成转化为图形的框架。我们验证了这种框架在健康和 neurolOgical impairment 群体中的应用，并使用 authentic dataset 进行验证。实验结果表明，提案的方法在下游疾病分类任务中表现出色，这些结果强烈地强调了该方法在脑网络研究、特别是 Magnetic Resonance Imaging 分析和疾病诊断中的潜在价值。本研究为多Modal brain imaging 数据处理提供了一个有价值的参考，并提供了一种新、高效的解决方案 для neuroimaging 领域。

Adaptive-Labeling for Enhancing Remote Sensing Cloud Understanding

paper_url: http://arxiv.org/abs/2311.05198
repo_url: https://github.com/jaygala223/cloud-adaptive-labeling
paper_authors: Jay Gala, Sauradip Nag, Huichou Huang, Ruirui Liu, Xiatian Zhu
for: 本研究旨在提高远程感知中的云分类精度，以便在气象和气候科学中进行细致的云分析，从而优化各种预测和管理应用。
methods: 我们提出了一种创新的模型无关的云适应标注（CAL）方法，通过iteratively进行云训练图像的标注更新，从而提高学习模型的性能。我们的方法首先使用原始标注来训练云分类模型，然后引入可调Pixel敏感度阈值，在流动图像上适应地标注云图像。
results: 我们在多个标准云分类 benchmark上进行了广泛的实验，并证明了我们的方法能够显著提高现有 segmentation 模型的性能。我们的 CAL 方法在比较多种现有方法时创造了新的状态态-of-the-art 结果。

Abstract
Cloud analysis is a critical component of weather and climate science, impacting various sectors like disaster management. However, achieving fine-grained cloud analysis, such as cloud segmentation, in remote sensing remains challenging due to the inherent difficulties in obtaining accurate labels, leading to significant labeling errors in training data. Existing methods often assume the availability of reliable segmentation annotations, limiting their overall performance. To address this inherent limitation, we introduce an innovative model-agnostic Cloud Adaptive-Labeling (CAL) approach, which operates iteratively to enhance the quality of training data annotations and consequently improve the performance of the learned model. Our methodology commences by training a cloud segmentation model using the original annotations. Subsequently, it introduces a trainable pixel intensity threshold for adaptively labeling the cloud training images on the fly. The newly generated labels are then employed to fine-tune the model. Extensive experiments conducted on multiple standard cloud segmentation benchmarks demonstrate the effectiveness of our approach in significantly boosting the performance of existing segmentation models. Our CAL method establishes new state-of-the-art results when compared to a wide array of existing alternatives.

摘要
云分析是气象和气候科学中的关键组成部分，影响各种领域，如灾害管理。然而，在远程感知中实现细致云分析，如云分割，仍然是一项挑战，因为获得准确标签的困难，导致训练数据中的标签错误很大。现有方法frequently假设可以获得可靠的分割标注，限制其总体性能。为解决这种内在的限制，我们介绍了一种创新的模型无关Cloud Adaptive-Labeling（CAL）方法，该方法在训练数据标注质量的基础上进行迭代增强，并因此提高学习模型的性能。我们的方法流程如下：首先，我们使用原始标注训练云分 segmentation模型。然后，我们引入可训练像素强度阈值，以适应性地标注云训练图像。新生成的标注被employmed для细化模型。我们在多个标准云分 segmentation benchmark上进行了广泛的实验，结果表明，我们的方法可以在存在标签错误的情况下，大幅提高现有分 segmentation模型的性能。我们的CAL方法在与多种现有方法进行比较时，创造了新的状态态峰值结果。

TransReg: Cross-transformer as auto-registration module for multi-view mammogram mass detection

paper_url: http://arxiv.org/abs/2311.05192
repo_url: None
paper_authors: Hoang C. Nguyen, Chi Phan, Hieu H. Pham
For: 这个研究旨在开发一个基于多视图照片的电脑助诊系统（CAD），以实现早期胸癌检测中的胸癌检测。* Methods: 这个系统使用了两个照片的联合资料，通过实现这两个照片之间的关联，以提高医生对胸癌的诊断准确性。* Results: 这个研究表明，使用这个系统可以实现更高的胸癌检测精度，并且可以降低伪阳性率。具体来说，在DDSM和VinDr-Mammo数据集上，这个系统使用SwinT作为特征提取器时，在伪阳性率为0.5时取得了83.3%的精度。

Abstract
Screening mammography is the most widely used method for early breast cancer detection, significantly reducing mortality rates. The integration of information from multi-view mammograms enhances radiologists' confidence and diminishes false-positive rates since they can examine on dual-view of the same breast to cross-reference the existence and location of the lesion. Inspired by this, we present TransReg, a Computer-Aided Detection (CAD) system designed to exploit the relationship between craniocaudal (CC), and mediolateral oblique (MLO) views. The system includes cross-transformer to model the relationship between the region of interest (RoIs) extracted by siamese Faster RCNN network for mass detection problems. Our work is the first time cross-transformer has been integrated into an object detection framework to model the relation between ipsilateral views. Our experimental evaluation on DDSM and VinDr-Mammo datasets shows that our TransReg, equipped with SwinT as a feature extractor achieves state-of-the-art performance. Specifically, at the false positive rate per image at 0.5, TransReg using SwinT gets a recall at 83.3% for DDSM dataset and 79.7% for VinDr-Mammo dataset. Furthermore, we conduct a comprehensive analysis to demonstrate that cross-transformer can function as an auto-registration module, aligning the masses in dual-view and utilizing this information to inform final predictions. It is a replication diagnostic workflow of expert radiologists

摘要
屏幕检查肿瘤是现代医学中最广泛使用的方法，可以有效降低乳腺癌死亡率。将多视图照片信息集成可以提高医生的自信心，同时降低假阳率，因为它们可以在两个视图中跨参照肿瘤的存在和位置。 Drawing inspiration from this, we present TransReg, a computer-aided detection (CAD) system designed to exploit the relationship between craniocaudal (CC) and mediolateral oblique (MLO) views. The system includes a cross-transformer to model the relationship between the region of interest (RoIs) extracted by a Siamese Faster RCNN network for mass detection problems. Our work is the first time cross-transformer has been integrated into an object detection framework to model the relation between ipsilateral views. Our experimental evaluation on DDSM and VinDr-Mammo datasets shows that our TransReg, equipped with SwinT as a feature extractor, achieves state-of-the-art performance. Specifically, at a false positive rate of 0.5, TransReg using SwinT achieves a recall of 83.3% for the DDSM dataset and 79.7% for the VinDr-Mammo dataset. Furthermore, we conduct a comprehensive analysis to demonstrate that cross-transformer can function as an auto-registration module, aligning the masses in dual-view and utilizing this information to inform final predictions. This is a replication diagnostic workflow of expert radiologists.

Audio-visual Saliency for Omnidirectional Videos

paper_url: http://arxiv.org/abs/2311.05190
repo_url: https://github.com/FannyChao/AVS360_audiovisual_saliency_360
paper_authors: Yuxin Zhu, Xilei Zhu, Huiyu Duan, Jie Li, Kaiwei Zhang, Yucheng Zhu, Li Chen, Xiongkuo Min, Guangtao Zhai
For: The paper is written for predicting visual saliency in omnidirectional videos (ODVs) and analyzing the influence of audio on visual attention.* Methods: The paper uses a large-scale audio-visual dataset (AVS-ODV) to analyze the visual attention behavior of observers under various omnidirectional audio modalities and visual scenes. It also compares the performance of several state-of-the-art saliency prediction models on the AVS-ODV dataset and constructs a new benchmark.* Results: The paper establishes the largest audio-visual saliency dataset for ODVs and analyzes the visual attention behavior of observers under various audio modalities and visual scenes. It also compares the performance of several state-of-the-art saliency prediction models on the AVS-ODV dataset and constructs a new benchmark.Here is the information in Simplified Chinese text:* For: 这篇论文是为了预测全景视频中的视觉吸引力和听音影响视觉注意力。* Methods: 这篇论文使用大规模的音视频数据集(AVS-ODV)来分析观众在不同全景声音模式下的视觉注意力行为，以及不同视频场景下的视觉注意力行为。它还比较了一些状态之际的最佳预测模型在AVS-ODV数据集上的性能，并构建了新的标准。* Results: 这篇论文建立了全景视频中最大的音视频预测数据集(AVS-ODV)，并分析了观众在不同全景声音模式下的视觉注意力行为。它还比较了一些状态之际的最佳预测模型在AVS-ODV数据集上的性能，并构建了新的标准。

Abstract
Visual saliency prediction for omnidirectional videos (ODVs) has shown great significance and necessity for omnidirectional videos to help ODV coding, ODV transmission, ODV rendering, etc.. However, most studies only consider visual information for ODV saliency prediction while audio is rarely considered despite its significant influence on the viewing behavior of ODV. This is mainly due to the lack of large-scale audio-visual ODV datasets and corresponding analysis. Thus, in this paper, we first establish the largest audio-visual saliency dataset for omnidirectional videos (AVS-ODV), which comprises the omnidirectional videos, audios, and corresponding captured eye-tracking data for three video sound modalities including mute, mono, and ambisonics. Then we analyze the visual attention behavior of the observers under various omnidirectional audio modalities and visual scenes based on the AVS-ODV dataset. Furthermore, we compare the performance of several state-of-the-art saliency prediction models on the AVS-ODV dataset and construct a new benchmark. Our AVS-ODV datasets and the benchmark will be released to facilitate future research.

摘要
“视觉吸引预测 для全方位视频（ODV）已经表现出了非常重要和必要的地位，以帮助ODV编码、ODV传输、ODV渲染等等。然而，大多数研究只考虑了视觉信息的ODV吸引预测，声音却 rarely 被考虑，尽管它对OBDV的观看习惯有很大的影响。这主要是因为缺乏大规模的 audio-visual ODV 数据集和相关分析。因此，在这篇论文中，我们首先建立了全方位视频、声音和相应的捕捉眼动数据的最大 audio-visual 吸引数据集（AVS-ODV），该数据集包括 omnidirectional 视频、声音和三种视频声明模式（包括无声、单声道和杜邦扬声）。然后，我们分析了在不同的全方位声音模式下观看者的视觉注意力行为，基于 AVS-ODV 数据集。此外，我们对多种当前领先的吸引预测模型在 AVS-ODV 数据集上的性能进行比较，并构建了一个新的标准。我们的 AVS-ODV 数据集和标准将被发布，以便未来的研究。”

Dynamic Association Learning of Self-Attention and Convolution in Image Restoration

paper_url: http://arxiv.org/abs/2311.05147
repo_url: None
paper_authors: Kui Jiang, Xuemei Jia, Wenxin Huang, Wenbin Wang, Zheng Wang, Junjun Jiang
for: This paper proposes an association learning method to improve image deraining by utilizing the advantages of CNNs and Self-Attention, while suppressing their shortcomings.
methods: The proposed method uses a novel multi-input attention module to generate a degradation prior and produce a degradation mask, which helps to extract informative complementary components from the rainy input and restore accurate textures. The method also uses a hybrid fusion network that combines a residual Transformer branch and an encoder-decoder branch to encode global features of the image and represent contexture knowledge.
results: The proposed method achieves high-quality and efficient inpainting by associating rain streak removal and background recovery, and outperforms existing state-of-the-art methods in terms of both visual quality and computational efficiency.

Abstract
CNNs and Self attention have achieved great success in multimedia applications for dynamic association learning of self-attention and convolution in image restoration. However, CNNs have at least two shortcomings: 1) limited receptive field; 2) static weight of sliding window at inference, unable to cope with the content diversity.In view of the advantages and disadvantages of CNNs and Self attention, this paper proposes an association learning method to utilize the advantages and suppress their shortcomings, so as to achieve high-quality and efficient inpainting. We regard rain distribution reflects the degradation location and degree, in addition to the rain distribution prediction. Thus, we propose to refine background textures with the predicted degradation prior in an association learning manner. As a result, we accomplish image deraining by associating rain streak removal and background recovery, where an image deraining network and a background recovery network are designed for two subtasks. The key part of association learning is a novel multi-input attention module. It generates the degradation prior and produces the degradation mask according to the predicted rainy distribution. Benefited from the global correlation calculation of SA, MAM can extract the informative complementary components from the rainy input with the degradation mask, and then help accurate texture restoration. Meanwhile, SA tends to aggregate feature maps with self-attention importance, but convolution diversifies them to focus on the local textures. A hybrid fusion network involves one residual Transformer branch and one encoder-decoder branch. The former takes a few learnable tokens as input and stacks multi-head attention and feed-forward networks to encode global features of the image. The latter, conversely, leverages the multi-scale encoder-decoder to represent contexture knowledge.

摘要
使用CNN和自注意来处理多媒体应用程序中的动态关联学习，得到了很大的成功。然而，CNN具有至少两个缺点：1）有限的接收场景；2）在推理过程中静态的窗口重复计算，无法适应内容多样性。在视情况和自注意的优劣点之间，本文提出一种关联学习方法，以利用优势并抑制缺点，以实现高质量和高效的填充。我们认为雨水分布反映了损害的位置和度量，除了雨水分布预测外。因此，我们提议在关联学习方式下，使用预测的损害估计来细化背景文本。通过这种方式，我们实现了图像抹掉，即将雨线除去和背景恢复两个子任务。关联学习的关键部分是一种新的多输入注意模块。它生成了损害估计和生成损害面板，根据预测的雨水分布。由于SA的全局相关计算，MAM可以从雨水输入中提取有用的补充组件，并帮助准确地恢复文本。同时，SA倾向于将特征地图归一化，而 convolution 则将其多样化，以注重地方文本。一个混合 fusión 网络包括一个待过 Residual Transformer 分支和一个 Encoder-Decoder 分支。前者从一些可学习的 токен中接受输入，并堆叠多头注意力和Feed-Forward 网络来编码图像的全局特征。后者则利用多级 Encoder-Decoder 来表达Contexture 知识。

OW-SLR: Overlapping Windows on Semi-Local Region for Image Super-Resolution

paper_url: http://arxiv.org/abs/2311.05146
repo_url: https://github.com/rishavbb/owslr
paper_authors: Rishav Bhardwaj, Janarthanam Jothi Balaji, Vasudevan Lakshminarayanan
for: 该论文目的是提出一种基于 semi-local 区域的 implicit neural representation 方法，以提高图像的缩放精度。
methods: 该方法使用 Overlapping Windows on Semi-Local Region (OW-SLR) 技术，在 latent space 中提取 semi-local 区域的特征，并使用这些特征来预测图像的 RGB 值。
results: 该方法在 OCT-A 图像上进行缩放后，对于健康和疾病retinal 图像（如 диабетиче Retinopathy 和 normal）的分类表现出色，并且在 OCT500 数据集上表现出了更好的效果。

Abstract
There has been considerable progress in implicit neural representation to upscale an image to any arbitrary resolution. However, existing methods are based on defining a function to predict the Red, Green and Blue (RGB) value from just four specific loci. Relying on just four loci is insufficient as it leads to losing fine details from the neighboring region(s). We show that by taking into account the semi-local region leads to an improvement in performance. In this paper, we propose applying a new technique called Overlapping Windows on Semi-Local Region (OW-SLR) to an image to obtain any arbitrary resolution by taking the coordinates of the semi-local region around a point in the latent space. This extracted detail is used to predict the RGB value of a point. We illustrate the technique by applying the algorithm to the Optical Coherence Tomography-Angiography (OCT-A) images and show that it can upscale them to random resolution. This technique outperforms the existing state-of-the-art methods when applied to the OCT500 dataset. OW-SLR provides better results for classifying healthy and diseased retinal images such as diabetic retinopathy and normals from the given set of OCT-A images. The project page is available at https://rishavbb.github.io/ow-slr/index.html

摘要
“Recently, there have been significant advancements in implicit neural representation for upscaling images to any arbitrary resolution. However, existing methods rely on defining a function to predict the Red, Green, and Blue (RGB) values based on just four specific points. This is insufficient, as it leads to the loss of fine details from the surrounding regions. We propose a new technique called Overlapping Windows on Semi-Local Region (OW-SLR) to improve performance. This technique takes the coordinates of the semi-local region around a point in the latent space and uses it to predict the RGB value of a point. We apply this algorithm to Optical Coherence Tomography-Angiography (OCT-A) images and show that it can upscale them to any arbitrary resolution. Compared to existing state-of-the-art methods, OW-SLR achieves better results for classifying healthy and diseased retinal images, such as diabetic retinopathy and normals, in the OCT500 dataset. More information can be found on the project page at 。”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know and I can provide the translation in that form instead.

SCAAT: Improving Neural Network Interpretability via Saliency Constrained Adaptive Adversarial Training

paper_url: http://arxiv.org/abs/2311.05143
repo_url: None
paper_authors: Rui Xu, Wenkang Qin, Peixiang Huang, Hao Wang, Lin Luo
for: 提高深度神经网络（DNN）的解释性，使其预测结果更加 transparent 和 understandable。
methods: 提出了一种模型无关学习方法called Saliency Constrained Adaptive Adversarial Training（SCAAT），通过构建对抗样本，从而提高DNN的解释性。
results: SCAAT 可以减少对抗样本中的噪声，使 saliency map 更加精炼和可靠，而不需要修改模型结构。在不同的领域和指标上进行了多种 DNN 的评估，结果表明，SCAAT 可以显著提高 DNN 的解释性，而无需牺牲预测力。

Abstract
Deep Neural Networks (DNNs) are expected to provide explanation for users to understand their black-box predictions. Saliency map is a common form of explanation illustrating the heatmap of feature attributions, but it suffers from noise in distinguishing important features. In this paper, we propose a model-agnostic learning method called Saliency Constrained Adaptive Adversarial Training (SCAAT) to improve the quality of such DNN interpretability. By constructing adversarial samples under the guidance of saliency map, SCAAT effectively eliminates most noise and makes saliency maps sparser and more faithful without any modification to the model architecture. We apply SCAAT to multiple DNNs and evaluate the quality of the generated saliency maps on various natural and pathological image datasets. Evaluations on different domains and metrics show that SCAAT significantly improves the interpretability of DNNs by providing more faithful saliency maps without sacrificing their predictive power.

摘要

ScribblePolyp: Scribble-Supervised Polyp Segmentation through Dual Consistency Alignment

paper_url: http://arxiv.org/abs/2311.05122
repo_url: None
paper_authors: Zixun Zhang, Yuncheng Jiang, Jun Wei, Hannah Cui, Zhen Li
for: scribble-supervised polyp segmentation frameworkmethods: two-branch consistency alignment approach (transformation consistency alignment + affinity propagation)results: Dice score of 0.8155 (with potential for 1.8% improvement through self-training)

Abstract
Automatic polyp segmentation models play a pivotal role in the clinical diagnosis of gastrointestinal diseases. In previous studies, most methods relied on fully supervised approaches, necessitating pixel-level annotations for model training. However, the creation of pixel-level annotations is both expensive and time-consuming, impeding the development of model generalization. In response to this challenge, we introduce ScribblePolyp, a novel scribble-supervised polyp segmentation framework. Unlike fully-supervised models, ScribblePolyp only requires the annotation of two lines (scribble labels) for each image, significantly reducing the labeling cost. Despite the coarse nature of scribble labels, which leave a substantial portion of pixels unlabeled, we propose a two-branch consistency alignment approach to provide supervision for these unlabeled pixels. The first branch employs transformation consistency alignment to narrow the gap between predictions under different transformations of the same input image. The second branch leverages affinity propagation to refine predictions into a soft version, extending additional supervision to unlabeled pixels. In summary, ScribblePolyp is an efficient model that does not rely on teacher models or moving average pseudo labels during training. Extensive experiments on the SUN-SEG dataset underscore the effectiveness of ScribblePolyp, achieving a Dice score of 0.8155, with the potential for a 1.8% improvement in the Dice score through a straightforward self-training strategy.

摘要
自动肿体分割模型在肠胃疾病诊断中扮演着关键角色。在过去的研究中，大多数方法依赖于全supervised的方法，需要每个图像进行像素级别的标注。然而，创建像素级别的标注是非常昂贵和时间consuming，对模型普适性的发展带来了阻碍。为了解决这个挑战，我们介绍了ScribblePolyp，一种新的scribble-supervised肿体分割框架。不同于全supervised模型，ScribblePolyp只需每个图像两条scribble标签（scribble标注），对于每个图像的标注成本减少了90%。尽管scribble标注的粗糙性使得一部分像素未得到标注，我们提议一种两支分支一致性适应方法，以提供对这些未标注的像素的超vision。第一支分支使用变换一致性适应来缩小输入图像不同变换后的预测差异。第二支分支利用协同传播来细化预测，向未标注像素提供软化的超vision。简单地说，ScribblePolyp是一个不需要教师模型或移动平均 Pseudo标签的模型，在训练时不需要这些资源。广泛的实验表明，ScribblePolyp在SUN-SEG数据集上达到了0.8155的Dice分数，可能通过简单的再训练策略提高Dice分数1.8%。

Reducing the Side-Effects of Oscillations in Training of Quantized YOLO Networks

paper_url: http://arxiv.org/abs/2311.05109
repo_url: None
paper_authors: Kartik Gupta, Akshay Asthana
for: 这个论文目的是对适合边缘设备的量化网络进行优化，以减少计算和内存资源的消耗。
methods: 这个论文使用了量化训练（Quantization-Aware Training，QAT）来对网络进行量化，并提出了一些新的方法来缓解量化网络中的振荡现象，以提高量化网络的精度。
results: 这个论文的结果显示，使用了该些新方法后，可以对YOLO模型进行高效的量化，并在COCO dataset上进行了广泛的评估，获得了更高的精度和更低的错误率。

Abstract
Quantized networks use less computational and memory resources and are suitable for deployment on edge devices. While quantization-aware training QAT is the well-studied approach to quantize the networks at low precision, most research focuses on over-parameterized networks for classification with limited studies on popular and edge device friendly single-shot object detection and semantic segmentation methods like YOLO. Moreover, majority of QAT methods rely on Straight-through Estimator (STE) approximation which suffers from an oscillation phenomenon resulting in sub-optimal network quantization. In this paper, we show that it is difficult to achieve extremely low precision (4-bit and lower) for efficient YOLO models even with SOTA QAT methods due to oscillation issue and existing methods to overcome this problem are not effective on these models. To mitigate the effect of oscillation, we first propose Exponentially Moving Average (EMA) based update to the QAT model. Further, we propose a simple QAT correction method, namely QC, that takes only a single epoch of training after standard QAT procedure to correct the error induced by oscillating weights and activations resulting in a more accurate quantized model. With extensive evaluation on COCO dataset using various YOLO5 and YOLO7 variants, we show that our correction method improves quantized YOLO networks consistently on both object detection and segmentation tasks at low-precision (4-bit and 3-bit).

摘要

Self-similarity Prior Distillation for Unsupervised Remote Physiological Measurement

paper_url: http://arxiv.org/abs/2311.05100
repo_url: None
paper_authors: Xinyu Zhang, Weiyu Sun, Hao Lu, Ying Chen, Yun Ge, Xiaolin Huang, Jie Yuan, Yingcong Chen
for: 本研究旨在提出一种不需要标注数据的非监督式远程血液摄影（rPPG）估计方法，通过利用生物信号自然的自同异性来提高估计精度。
methods: 我们提出了一种基于自同异性优先的框架，包括物理特征嵌入增强技术、自相似性意识网络和层次自适应填充方法。
results: 我们的方法在不同的测试数据集上实现了与标注方法相当或更高的性能，同时具有最低的推理时间和计算成本。

Abstract
Remote photoplethysmography (rPPG) is a noninvasive technique that aims to capture subtle variations in facial pixels caused by changes in blood volume resulting from cardiac activities. Most existing unsupervised methods for rPPG tasks focus on the contrastive learning between samples while neglecting the inherent self-similar prior in physiological signals. In this paper, we propose a Self-Similarity Prior Distillation (SSPD) framework for unsupervised rPPG estimation, which capitalizes on the intrinsic self-similarity of cardiac activities. Specifically, we first introduce a physical-prior embedded augmentation technique to mitigate the effect of various types of noise. Then, we tailor a self-similarity-aware network to extract more reliable self-similar physiological features. Finally, we develop a hierarchical self-distillation paradigm to assist the network in disentangling self-similar physiological patterns from facial videos. Comprehensive experiments demonstrate that the unsupervised SSPD framework achieves comparable or even superior performance compared to the state-of-the-art supervised methods. Meanwhile, SSPD maintains the lowest inference time and computation cost among end-to-end models. The source codes are available at https://github.com/LinXi1C/SSPD.

摘要
远程血液摄影（rPPG）是一种不侵入式技术，目标是捕捉face pixels上因心跳活动而带来的微小变化。现有大多数无监督方法对rPPG任务强调对比采样，忽略了生物信号内置自similarity prior。在这篇论文中，我们提出了一个Self-Similarity Prior Distillation（SSPD）框架，用于无监督rPPG估计。我们首先引入了physical-prior附加技术，以减少各种噪声的影响。然后，我们适应了自similarity-aware网络，以提取更可靠的自similar生理特征。最后，我们开发了一种层次自降解析方法，以助网络分离自similar生理模式从 face videos。广泛的实验表明，无监督SSPD框架可与现有的监督方法相当或者超越其性能，同时SSPD保持了最低的推理时间和计算成本。源代码可以在https://github.com/LinXi1C/SSPD上下载。

POISE: Pose Guided Human Silhouette Extraction under Occlusions

paper_url: http://arxiv.org/abs/2311.05077
repo_url: https://github.com/take2rohit/poise
paper_authors: Arindam Dutta, Rohit Lal, Dripta S. Raychaudhuri, Calvin Khang Ta, Amit K. Roy-Chowdhury
for: 该论文的目的是提出一种用于人体示意抽取的自助学习混合方法，以提高在 occlusions 下人体示意抽取的准确性和可靠性。
methods: 该方法使用了一种自助学习混合模型，将人体示意抽取和人 JOINT 预测结果融合，以利用两者的优势，提高人体示意抽取的精度和可靠性。
results: 实验结果表明，该方法能够在 occlusions 下提高人体示意抽取的准确性和可靠性，并在下游任务中表现出优异的 Result。

Abstract
Human silhouette extraction is a fundamental task in computer vision with applications in various downstream tasks. However, occlusions pose a significant challenge, leading to incomplete and distorted silhouettes. To address this challenge, we introduce POISE: Pose Guided Human Silhouette Extraction under Occlusions, a novel self-supervised fusion framework that enhances accuracy and robustness in human silhouette prediction. By combining initial silhouette estimates from a segmentation model with human joint predictions from a 2D pose estimation model, POISE leverages the complementary strengths of both approaches, effectively integrating precise body shape information and spatial information to tackle occlusions. Furthermore, the self-supervised nature of \POISE eliminates the need for costly annotations, making it scalable and practical. Extensive experimental results demonstrate its superiority in improving silhouette extraction under occlusions, with promising results in downstream tasks such as gait recognition. The code for our method is available https://github.com/take2rohit/poise.

摘要
人体影像抽取是计算机视觉中的基本任务，具有许多下游任务的应用。然而，干扰Element pose poses a significant challenge，导致人体影像抽取 incomplete和扭曲。为了解决这个挑战，我们介绍 POISE：POSE Guided Human Silhouette Extraction under Occlusions，一种新的自我监督融合框架，可以提高人体影像抽取的准确性和Robustness。POISE通过将分割模型的初始抽取估计与2D pose estimation模型的人 JOINT预测结果融合起来，以利用这两种方法的优势，同时得到精确的身体形状信息和空间信息，有效地处理干扰。此外，POISE的自我监督性式，使得无需贵重的注释，可以扩展和实用。广泛的实验结果表明POISE在干扰下进行人体影像抽取时 exhibits superiority，并在下游任务中表现出了扎实的 results，如行走识别。POISE的代码可以在https://github.com/take2rohit/poise找到。

On the Behavior of Audio-Visual Fusion Architectures in Identity Verification Tasks

paper_url: http://arxiv.org/abs/2311.05071
repo_url: None
paper_authors: Daniel Claborne, Eric Slyman, Karl Pazdernik
for: 本研究旨在训练一种人脸识别模型，并对模型中将语音和视频表示结合部分进行修改，以便在一个输入缺失的情况下进行比较。
methods: 本研究使用了一种将输入embedding进行平均化的方法，以提高模型在全modalities情况下和一个输入缺失情况下的准确率。
results: 研究发现，平均化输入embedding可以更好地使用 embedding 空间，并在全modalities情况下和一个输入缺失情况下提高准确率。

Abstract
We train an identity verification architecture and evaluate modifications to the part of the model that combines audio and visual representations, including in scenarios where one input is missing in either of two examples to be compared. We report results on the Voxceleb1-E test set that suggest averaging the output embeddings improves error rate in the full-modality setting and when a single modality is missing, and makes more complete use of the embedding space than systems which use shared layers and discuss possible reasons for this behavior.

摘要
我们训练了一个标识验证建筑，并评估了对拼接声音和视觉表示的模型部分进行修改，包括在两个例子之间比较的情况下一个输入缺失。我们在Voxceleb1-E测试集上发现，将输出嵌入平均值可以改善错误率，包括全功能模式和单模态缺失情况，并且更好地利用嵌入空间。我们还讨论了可能的原因。

2023-11-09

cs.AI

cs.AI - 2023-11-09

Is a Seat at the Table Enough? Engaging Teachers and Students in Dataset Specification for ML in Education

paper_url: http://arxiv.org/abs/2311.05792
repo_url: None
paper_authors: Mei Tan, Hansol Lee, Dakuo Wang, Hariharan Subramonyam
for: 这篇论文目的是探讨Machine Learning（ML）在教育中的应用，并探讨在这些应用中发生的问题和挑战。
methods: 本研究使用了跨学科的合作设计方法，让ML工程师、教育专家和学生共同定义数据特性，以探讨ML应用中的问题和挑战。
results: 研究发现，参与者将数据 Contextualized 基于专业和程序知识，设计了减少后果和数据可靠性担忧的数据需求。参与者还展现出了角色基于的协力策略和贡献模式。此外，为了实现真正的参与，ML的实现需要结构支持：定义的迭代和共评过程、共同标准、技术和非技术参与者 traverse 专业边界的信息架。

Abstract
Despite the promises of ML in education, its adoption in the classroom has surfaced numerous issues regarding fairness, accountability, and transparency, as well as concerns about data privacy and student consent. A root cause of these issues is the lack of understanding of the complex dynamics of education, including teacher-student interactions, collaborative learning, and classroom environment. To overcome these challenges and fully utilize the potential of ML in education, software practitioners need to work closely with educators and students to fully understand the context of the data (the backbone of ML applications) and collaboratively define the ML data specifications. To gain a deeper understanding of such a collaborative process, we conduct ten co-design sessions with ML software practitioners, educators, and students. In the sessions, teachers and students work with ML engineers, UX designers, and legal practitioners to define dataset characteristics for a given ML application. We find that stakeholders contextualize data based on their domain and procedural knowledge, proactively design data requirements to mitigate downstream harms and data reliability concerns, and exhibit role-based collaborative strategies and contribution patterns. Further, we find that beyond a seat at the table, meaningful stakeholder participation in ML requires structured supports: defined processes for continuous iteration and co-evaluation, shared contextual data quality standards, and information scaffolds for both technical and non-technical stakeholders to traverse expertise boundaries.

摘要
尽管机器学习（ML）在教育领域的推广已经浮出了许多公平、负责任、透明度和隐私等问题，以及学生同意的问题。这些问题的根本原因是对教育领域的复杂 Dynamics 的不了解，包括教师和学生之间的互动、合作学习和教室环境。为了解决这些挑战并充分利用ML在教育领域的潜力，软件实践者需要与教育工作者和学生合作，以全面理解数据（ML应用程序的核心）的上下文。为了更深入地理解这种合作过程，我们进行了10次codesign会议，参与者包括ML软件实践者、教育工作者和学生。在会议中，教师和学生与ML工程师、用户体验设计师和法律专业人士一起定义了ML应用程序的数据特征。我们发现，参与者会基于域知识和过程知识来Contextualize数据，预先设计数据要求以避免下游害处和数据可靠性问题，并表现出角色基于的协作策略和贡献模式。此外，我们发现，在ML中真正参与的参与者需要结构支持：定义的不断迭代和合评过程，共享 Contextual Data Quality Standards，以及技术和非技术参与者之间的信息扶持，以 traverse Expertise boundaries。

The Paradox of Noise: An Empirical Study of Noise-Infusion Mechanisms to Improve Generalization, Stability, and Privacy in Federated Learning

paper_url: http://arxiv.org/abs/2311.05790
repo_url: None
paper_authors: Elaheh Jafarigol, Theodore Trafalis
For: This paper aims to provide strategies for measuring the generalization, stability, and privacy-preserving capabilities of deep learning models in federated learning frameworks, and to improve these models by leveraging noise as a tool for regularization and privacy enhancement.* Methods: The paper explores five noise infusion mechanisms at varying noise levels within centralized and federated learning settings, and compares the performance of three Convolutional Neural Network (CNN) architectures. The paper also introduces a new quantitative measure called Signal-to-Noise Ratio (SNR) to evaluate the trade-off between privacy and training accuracy of noise-infused models.* Results: The paper finds that the optimal noise level for privacy and accuracy can be achieved through a delicate balance between these factors, and defines the Price of Stability and Price of Anarchy in the context of privacy-preserving deep learning. The research contributes to the development of robust, privacy-aware algorithms that prioritize both utility and privacy in AI-driven solutions.

Abstract
In a data-centric era, concerns regarding privacy and ethical data handling grow as machine learning relies more on personal information. This empirical study investigates the privacy, generalization, and stability of deep learning models in the presence of additive noise in federated learning frameworks. Our main objective is to provide strategies to measure the generalization, stability, and privacy-preserving capabilities of these models and further improve them. To this end, five noise infusion mechanisms at varying noise levels within centralized and federated learning settings are explored. As model complexity is a key component of the generalization and stability of deep learning models during training and evaluation, a comparative analysis of three Convolutional Neural Network (CNN) architectures is provided. The paper introduces Signal-to-Noise Ratio (SNR) as a quantitative measure of the trade-off between privacy and training accuracy of noise-infused models, aiming to find the noise level that yields optimal privacy and accuracy. Moreover, the Price of Stability and Price of Anarchy are defined in the context of privacy-preserving deep learning, contributing to the systematic investigation of the noise infusion strategies to enhance privacy without compromising performance. Our research sheds light on the delicate balance between these critical factors, fostering a deeper understanding of the implications of noise-based regularization in machine learning. By leveraging noise as a tool for regularization and privacy enhancement, we aim to contribute to the development of robust, privacy-aware algorithms, ensuring that AI-driven solutions prioritize both utility and privacy.

摘要
在数据驱动时代，隐私和优化数据处理的问题日益突出，特别是机器学习更加依赖人工智能技术。这项实证研究探讨了深度学习模型在联合学习框架中的隐私、泛化和稳定性，并提供了测量和改进这些模型的策略。为此，我们在中央化和联合学习Setting中调查了5种不同噪声扩散机制，并对三种卷积神经网络架构进行比较分析。在训练和评估过程中，模型复杂度是深度学习模型的泛化和稳定性的关键因素。我们还引入了噪声比例（SNR）作为衡量隐私和训练准确率之间的质量衡量，以找到最佳的噪声水平。此外，我们定义了隐私保护中的价格of Stability和Price of Anarchy，以系统地研究噪声扩散策略的影响。我们的研究探讨了这些关键因素之间的权衡，以便更好地理解噪声基于的正则化在机器学习中的影响。通过利用噪声作为正则化和隐私提高的工具，我们希望通过开发robust、隐私意识的算法，确保人工智能驱动的解决方案优先考虑隐私和实用性。

Are “Hierarchical” Visual Representations Hierarchical?

paper_url: http://arxiv.org/abs/2311.05784
repo_url: https://github.com/ethanlshen/hiernet
paper_authors: Ethan Shen, Ali Farhadi, Aditya Kusupati
for: 本研究旨在研究是否使用层次视图表示法（Hierarchical Visual Representations）可以更好地捕捉人类对visual world的层次结构认知。
methods: 作者创建了一个名为HierNet的12个 dataset集合，包括ImageNet BREEDs subsets中的3种层次结构。他们在不同的训练setup中评估了抽象表示法和马特瑞什表示法的性能，并结论这些表示法不能在捕捉层次结构方面提供更好的性能，但它们可以帮助提高搜索效率和解释性。
results: 研究结果表明，使用抽象表示法和马特瑞什表示法不能在捕捉层次结构方面提供更好的性能，但它们可以帮助提高搜索效率和解释性。

Abstract
Learned visual representations often capture large amounts of semantic information for accurate downstream applications. Human understanding of the world is fundamentally grounded in hierarchy. To mimic this and further improve representation capabilities, the community has explored "hierarchical" visual representations that aim at modeling the underlying hierarchy of the visual world. In this work, we set out to investigate if hierarchical visual representations truly capture the human perceived hierarchy better than standard learned representations. To this end, we create HierNet, a suite of 12 datasets spanning 3 kinds of hierarchy from the BREEDs subset of ImageNet. After extensive evaluation of Hyperbolic and Matryoshka Representations across training setups, we conclude that they do not capture hierarchy any better than the standard representations but can assist in other aspects like search efficiency and interpretability. Our benchmark and the datasets are open-sourced at https://github.com/ethanlshen/HierNet.

摘要
学习的视觉表示法经常捕捉大量的Semantic信息，以便在下游应用中进行准确的识别。人类对世界的理解是基于层次结构的。为了模仿这一点并进一步提高表示能力，社区已经探索了“层次”的视觉表示方法，旨在模型视觉世界的层次结构。在这项工作中，我们想要Investigate whether hierarchical visual representations truly capture the human-perceived hierarchy better than standard learned representations. To this end, we create HierNet, a suite of 12 datasets spanning 3 kinds of hierarchy from the BREEDs subset of ImageNet. After extensive evaluation of Hyperbolic and Matryoshka Representations across training setups, we conclude that they do not capture hierarchy any better than the standard representations but can assist in other aspects like search efficiency and interpretability. Our benchmark and the datasets are open-sourced at .Here's a word-for-word translation of the text into Simplified Chinese:学习的视觉表示法经常捕捉大量的Semantic信息，以便在下游应用中进行准确的识别。人类对世界的理解是基于层次结构的。为了模仿这一点并进一步提高表示能力，社区已经探索了“层次”的视觉表示方法，旨在模型视觉世界的层次结构。在这项工作中，我们想要Investigate whether hierarchical visual representations truly capture the human-perceived hierarchy better than standard learned representations. To this end, we create HierNet, a suite of 12 datasets spanning 3 kinds of hierarchy from the BREEDs subset of ImageNet. After extensive evaluation of Hyperbolic and Matryoshka Representations across training setups, we conclude that they do not capture hierarchy any better than the standard representations but can assist in other aspects like search efficiency and interpretability. Our benchmark and the datasets are open-sourced at .

Hallucination-minimized Data-to-answer Framework for Financial Decision-makers

paper_url: http://arxiv.org/abs/2311.07592
repo_url: None
paper_authors: Sohini Roychowdhury, Andres Alvarez, Brian Moore, Marko Krema, Maria Paz Gelpi, Federico Martin Rodriguez, Angel Rodriguez, Jose Ramon Cabrejas, Pablo Martinez Serrano, Punit Agrawal, Arijit Mukherjee
for: 这项研究旨在开发一种基于 Langchain 框架的自动化问答系统，以提高在金融决策等特定领域中的问答自动化。
methods: 该系统使用用户查询意图分类、自动检索相关数据片断、生成个性化 LLG 提示、多 metric 评分等方法来提供准确、有 confidence 的答案。
results: 该系统在多种用户查询回答中达到了90%以上的 confidence 分数，包括 {What, Where, Why, How, predict, trend, anomalies, exceptions} 等关键问题，这些问题对于金融决策应用程序是非常重要的。

Abstract
Large Language Models (LLMs) have been applied to build several automation and personalized question-answering prototypes so far. However, scaling such prototypes to robust products with minimized hallucinations or fake responses still remains an open challenge, especially in niche data-table heavy domains such as financial decision making. In this work, we present a novel Langchain-based framework that transforms data tables into hierarchical textual data chunks to enable a wide variety of actionable question answering. First, the user-queries are classified by intention followed by automated retrieval of the most relevant data chunks to generate customized LLM prompts per query. Next, the custom prompts and their responses undergo multi-metric scoring to assess for hallucinations and response confidence. The proposed system is optimized with user-query intention classification, advanced prompting, data scaling capabilities and it achieves over 90% confidence scores for a variety of user-queries responses ranging from {What, Where, Why, How, predict, trend, anomalies, exceptions} that are crucial for financial decision making applications. The proposed data to answers framework can be extended to other analytical domains such as sales and payroll to ensure optimal hallucination control guardrails.

摘要
首先，用户问题被分类为意图，然后自动检索最相关的数据块，以生成个性化的LLM提醒。接着，个性提醒和其响应进行多元指标评分，以评估幻觉和响应信心。我们的提议的系统具有用户问题意图分类、高级提醒、数据扩展能力，并实现了90%以上的信心分数，包括“What、Where、Why、How、预测、趋势、异常”等问题，这些问题对金融决策应用非常重要。我们的数据回答框架可以扩展到其他分析领域，如销售和薪资，以确保优化幻觉控制 guardrails。

DONUT-hole: DONUT Sparsification by Harnessing Knowledge and Optimizing Learning Efficiency

paper_url: http://arxiv.org/abs/2311.05778
repo_url: None
paper_authors: Azhar Shaikh, Michael Cochez, Denis Diachkov, Michiel de Rijcke, Sahar Yousefi
for: 这篇论文旨在提出一种快速、高效的视觉文档理解（VDU）模型，以解决先前模型DONUT的限制。
methods: 该模型使用变换器架构，并通过知识储存和模型剪割来优化性能。
results: 模型可以在大规模请求服务环境中减少内存和计算需求，同时保持性能。此外，模型在文档图像关键信息提取任务中的效果也得到了证明。

Abstract
This paper introduces DONUT-hole, a sparse OCR-free visual document understanding (VDU) model that addresses the limitations of its predecessor model, dubbed DONUT. The DONUT model, leveraging a transformer architecture, overcoming the challenges of separate optical character recognition (OCR) and visual semantic understanding (VSU) components. However, its deployment in production environments and edge devices is hindered by high memory and computational demands, particularly in large-scale request services. To overcome these challenges, we propose an optimization strategy based on knowledge distillation and model pruning. Our paradigm to produce DONUT-hole, reduces the model denisty by 54\% while preserving performance. We also achieve a global representational similarity index between DONUT and DONUT-hole based on centered kernel alignment (CKA) metric of 0.79. Moreover, we evaluate the effectiveness of DONUT-hole in the document image key information extraction (KIE) task, highlighting its potential for developing more efficient VDU systems for logistic companies.

摘要

Chatbots Are Not Reliable Text Annotators

paper_url: http://arxiv.org/abs/2311.05769
repo_url: https://github.com/centre-for-humanities-computing/llm-tweet-classification
paper_authors: Ross Deans Kristensen-McLachlan, Miceal Canavan, Márton Kardos, Mia Jacobsen, Lene Aarøe
for: 这项研究旨在评估开源大语言模型（LLM）的表现，以及与聊天GPT的比较，以找到更好的文本标注工具。
methods: 研究使用了多种开源大语言模型（LLM），以及标准的指导学习分类模型，对Twitter媒体中的简单二分文本标注任务进行了系统性比较评估。
results: 研究发现，与标准指导学习分类模型相比，聊天GPT在多个任务中表现不一致，而开源模型在不同任务中也存在差异。因此，建议在社会科学研究中不要使用聊天GPT进行重要的文本标注任务。

Abstract
Recent research highlights the significant potential of ChatGPT for text annotation in social science research. However, ChatGPT is a closed-source product which has major drawbacks with regards to transparency, reproducibility, cost, and data protection. Recent advances in open-source (OS) large language models (LLMs) offer alternatives which remedy these challenges. This means that it is important to evaluate the performance of OS LLMs relative to ChatGPT and standard approaches to supervised machine learning classification. We conduct a systematic comparative evaluation of the performance of a range of OS LLM models alongside ChatGPT, using both zero- and few-shot learning as well as generic and custom prompts, with results compared to more traditional supervised classification models. Using a new dataset of Tweets from US news media, and focusing on simple binary text annotation tasks for standard social science concepts, we find significant variation in the performance of ChatGPT and OS models across the tasks, and that supervised classifiers consistently outperform both. Given the unreliable performance of ChatGPT and the significant challenges it poses to Open Science we advise against using ChatGPT for substantive text annotation tasks in social science research.

摘要
近期研究发现 chatGPT 在社会科学研究中的潜在潜力很大，但 chatGPT 是一个关闭源产品，它在透明度、复制性、成本和数据安全方面存在重大缺点。现有的开源大语言模型（LLM）的进步提供了一些选择，这些选择可以解决这些挑战。因此，我们需要评估开源 LLM 模型与 chatGPT 和普通的指导学习分类模型相比的性能。我们使用了一个新的 Twitter 数据集，并使用零或几个预测任务来评估开源 LLM 模型和 chatGPT 的性能，结果与传统的指导学习分类模型相比。我们发现了不同任务的 chatGPT 和开源模型的性能变化，以及指导分类模型在所有任务上的一致性。由于 chatGPT 的不可靠性和开源科学的重要性，我们建议在社会科学研究中不要使用 chatGPT 进行重要的文本注释任务。

ShipGen: A Diffusion Model for Parametric Ship Hull Generation with Multiple Objectives and Constraints

paper_url: http://arxiv.org/abs/2311.06315
repo_url: None
paper_authors: Noah J. Bagazinski, Faez Ahmed
for: 这个论文的目的是寻找一种使用生成人工智能技术来改善船体设计的方法，以减少设计周期时间和创造高性能的船体设计。
methods: 这个论文使用了一种叫做Diffusion Model的生成人工智能模型，并且添加了一些指南来改善生成的船体设计质量。
results: 这个论文发现使用Diffusion Model生成 parametric 船体设计可以大幅减少设计周期时间，并且生成的船体设计具有低Drag和高积载量，这可以降低船运成本和增加船体的收益能力。

Abstract
Ship design is a years-long process that requires balancing complex design trade-offs to create a ship that is efficient and effective. Finding new ways to improve the ship design process can lead to significant cost savings for ship building and operation. One promising technology is generative artificial intelligence, which has been shown to reduce design cycle time and create novel, high-performing designs. In literature review, generative artificial intelligence has been shown to generate ship hulls; however, ship design is particularly difficult as the hull of a ship requires the consideration of many objectives. This paper presents a study on the generation of parametric ship hull designs using a parametric diffusion model that considers multiple objectives and constraints for the hulls. This denoising diffusion probabilistic model (DDPM) generates the tabular parametric design vectors of a ship hull for evaluation. In addition to a tabular DDPM, this paper details adding guidance to improve the quality of generated ship hull designs. By leveraging classifier guidance, the DDPM produced feasible parametric ship hulls that maintain the coverage of the initial training dataset of ship hulls with a 99.5% rate, a 149x improvement over random sampling of the design vector parameters across the design space. Parametric ship hulls produced with performance guidance saw an average of 91.4% reduction in wave drag coefficients and an average of a 47.9x relative increase in the total displaced volume of the hulls compared to the mean performance of the hulls in the training dataset. The use of a DDPM to generate parametric ship hulls can reduce design time by generating high-performing hull designs for future analysis. These generated hulls have low drag and high volume, which can reduce the cost of operating a ship and increase its potential to generate revenue.

摘要
船体设计是一个需要坚持多年的过程，旨在平衡多种设计费用来创造高效高性能的船体。发现新的方法可以改进船体设计过程，可以获得显著的成本节省和运营成本降低。一种潜在技术是生成人工智能，它已经在文献评议中显示出可以降低设计周期时间和创造高性能的船体设计。在这篇论文中，我们介绍了一种基于梯度扩散模型（DDPM）的 parametric 船体设计生成方法，该方法考虑了多个目标和约束，以生成船体的 tabular 参数设计 вектор。此外，我们还介绍了如何通过类ifier 指导来改进生成的船体设计质量。通过利用类ifier 指导，DDPM 生成的 parametric 船体设计可以保持训练数据集中船体的覆盖率达99.5%，相比随机样本设计参数的149倍提高。 Parametric 船体生成后，通过性能指导，船体的波浪阻力系数平均下降91.4%，同时总填充体积平均提高47.9倍。通过使用 DDPM 生成 parametric 船体设计，可以减少设计时间，并生成高性能的船体设计，以便将来的分析。这些生成的船体设计具有低阻力和高体积，可以降低船舶运营成本并增加收益可能性。

Deep Natural Language Feature Learning for Interpretable Prediction

paper_url: http://arxiv.org/abs/2311.05754
repo_url: https://github.com/furrutiav/nllf-emnlp-2023
paper_authors: Felipe Urrutia, Cristian Buc, Valentin Barriere
for: 这个研究的目的是如何将复杂任务分解成一系列更容易处理的子任务，以便更好地进行 Machine Learning 模型的训练。
methods: 这种方法使用一个小型的 transformer 语言模型（如 BERT），通过自动从 Large Language Model (LLM) 中获取的弱标签进行 Natural Language Inference (NLI) 训练，生成一个名为 Natural Language Learned Features (NLLF) 的表示。
results: 研究表明，使用这种方法可以达到更好的性能，并且可以在零 shot 推理中处理任何 binary question。此外，这种 NLLF 表示可以作为一个简单的机器学习模型的输入，如一棵决策树，以便更好地解释模型的决策。在两个完全不同的任务中，即检测学生们的答案不一致性和检索报告中的科学论文，这种方法都有成功应用。

Abstract
We propose a general method to break down a main complex task into a set of intermediary easier sub-tasks, which are formulated in natural language as binary questions related to the final target task. Our method allows for representing each example by a vector consisting of the answers to these questions. We call this representation Natural Language Learned Features (NLLF). NLLF is generated by a small transformer language model (e.g., BERT) that has been trained in a Natural Language Inference (NLI) fashion, using weak labels automatically obtained from a Large Language Model (LLM). We show that the LLM normally struggles for the main task using in-context learning, but can handle these easiest subtasks and produce useful weak labels to train a BERT. The NLI-like training of the BERT allows for tackling zero-shot inference with any binary question, and not necessarily the ones seen during the training. We show that this NLLF vector not only helps to reach better performances by enhancing any classifier, but that it can be used as input of an easy-to-interpret machine learning model like a decision tree. This decision tree is interpretable but also reaches high performances, surpassing those of a pre-trained transformer in some cases.We have successfully applied this method to two completely different tasks: detecting incoherence in students' answers to open-ended mathematics exam questions, and screening abstracts for a systematic literature review of scientific papers on climate change and agroecology.

摘要
我们提出了一种通用方法，将主要复杂任务分解成一系列更容易的子任务，这些子任务是通过自然语言表述为主要目标任务的 binary 问题。我们称这种表示为自然语言学习特征（NLLF）。NLLF 由一个小型 transformer 语言模型（如 BERT）生成，该模型在自然语言推理（NLI）方式下进行训练，使用大语言模型（LLM）自动生成的弱标签。我们发现，LLM 通常在主任务上使用上下文学习时陷入困难，但可以处理最简单的子任务，并生成有用的弱标签来训练 BERT。 NLI 类似的训练方法使得 BERT 可以面对零批学习任务，而不一定是在训练过程中看到的问题。我们发现，这个 NLLF 向量不仅能够提高任何分类器的性能，还可以作为一个易于解释的机器学习模型，如决策树的输入。这个决策树可以是解释性强，但也能够达到高性能，在某些情况下 even surpassing 预训练 transformer 的性能。我们成功地应用了这种方法到了两个完全不同的任务：评估学生回答开放式数学考试题的准确性，以及筛选报告系统科学期刊文章中的气候变化和农业生物学相关研究。

Bridging the Digital Divide: Performance Variation across Socio-Economic Factors in Vision-Language Models

paper_url: http://arxiv.org/abs/2311.05746
repo_url: https://github.com/michigannlp/bridging_the_digital_divide
paper_authors: Joan Nwatu, Oana Ignat, Rada Mihalcea
for: 本研究旨在评估当今AI模型在不同收入水平下的表现，并提出解决方案来减轻收入差距。
methods: 本研究使用最新的视觉语言模型（CLIP），在各国家和不同收入水平下收集了家庭图像，并对这些图像进行了不同主题的识别和分类。
results: 研究发现，不同收入水平下的家庭图像识别性能存在差异，贫困家庭的表现相对较差，而富裕家庭的表现相对较高。研究还提出了一些可能的解决方案。

Abstract
Despite the impressive performance of current AI models reported across various tasks, performance reports often do not include evaluations of how these models perform on the specific groups that will be impacted by these technologies. Among the minority groups under-represented in AI, data from low-income households are often overlooked in data collection and model evaluation. We evaluate the performance of a state-of-the-art vision-language model (CLIP) on a geo-diverse dataset containing household images associated with different income values (Dollar Street) and show that performance inequality exists among households of different income levels. Our results indicate that performance for the poorer groups is consistently lower than the wealthier groups across various topics and countries. We highlight insights that can help mitigate these issues and propose actionable steps for economic-level inclusive AI development. Code is available at https://github.com/MichiganNLP/Bridging_the_Digital_Divide.

摘要
尽管当前的人工智能模型在各种任务上表现出色，但性能报告 часто不包括对特定群体的评估。在人工智能中下 represented minority groups中，来自低收入家庭的数据经常被数据收集和模型评估排除。我们使用地理多样化的数据集（Dollar Street）和当前领域的视觉语言模型（CLIP）进行评估，并发现了收入水平不同的家庭表现不平等。我们的结果表明，贫困 GROUPS的表现逐串比较贫困 GROUPS across topics and countries. We highlight some insights that can help mitigate these issues and propose actionable steps for economic-level inclusive AI development. 代码可以在 https://github.com/MichiganNLP/Bridging_the_Digital_Divide 上获取。

Optimal simulation-based Bayesian decisions

paper_url: http://arxiv.org/abs/2311.05742
repo_url: None
paper_authors: Justin Alsing, Thomas D. P. Edwards, Benjamin Wandelt
for: Optimal Bayesian decisions under intractable likelihoods
methods: 学习一个surrogate模型，用于计算行动空间和数据空间下的预期Utility的函数
results: 实现了高效的 simulations，typically requiring fewer model calls than posterior inference task alone, and a factor of $100-1000$ more efficient than Monte-Carlo based methods.

Abstract
We present a framework for the efficient computation of optimal Bayesian decisions under intractable likelihoods, by learning a surrogate model for the expected utility (or its distribution) as a function of the action and data spaces. We leverage recent advances in simulation-based inference and Bayesian optimization to develop active learning schemes to choose where in parameter and action spaces to simulate. This allows us to learn the optimal action in as few simulations as possible. The resulting framework is extremely simulation efficient, typically requiring fewer model calls than the associated posterior inference task alone, and a factor of $100-1000$ more efficient than Monte-Carlo based methods. Our framework opens up new capabilities for performing Bayesian decision making, particularly in the previously challenging regime where likelihoods are intractable, and simulations expensive.

摘要
我们提出了一个框架，用于高效计算 bayesian 决策下最优的决策，当likelihood是不可处理的时候，我们通过学习一个surrogate模型来表示行动和数据空间中的期望收益（或其分布）。我们利用最近的 simulations-based inference和 Bayesian optimization技术，开发了一种活动学习方案，选择在参数和行动空间中进行模拟。这使得我们可以尽可能快地学习最优的行动。结果的框架非常的 simulation efficient，通常需要 fewer model calls than相关的 posterior inference 任务，并且比 Monte-Carlo 方法高效 $100-1000$ 倍。我们的框架开 up new capabilities for performing Bayesian decision making，特别是在 previously challenging 的likelihood是 intractable，并且 simulations expensive 的情况下。

Efficiently Adapting Pretrained Language Models To New Languages

paper_url: http://arxiv.org/abs/2311.05741
repo_url: None
paper_authors: Zoltan Csaki, Pian Pawakapan, Urmish Thakker, Qiantong Xu
for: 这个研究旨在将现有的预训练语言模型（LLM）高效地适应新语言，以提高模型在低资源语言上的表现。
methods: 我们提出了一种新的适应方法，包括增加目标语言中的新token，并调整资料混合比例以减轻忘记现象。
results: 我们的实验显示，这种适应方法可以在适应英语到匈牙利语和泰语时，实现更好的表现，并且仅对英语造成最小的回退。

Abstract
Recent large language models (LLM) exhibit sub-optimal performance on low-resource languages, as the training data of these models is usually dominated by English and other high-resource languages. Furthermore, it is challenging to train models for low-resource languages, especially from scratch, due to a lack of high quality training data. Adapting pretrained LLMs reduces the need for data in the new language while also providing cross lingual transfer capabilities. However, naively adapting to new languages leads to catastrophic forgetting and poor tokenizer efficiency. In this work, we study how to efficiently adapt any existing pretrained LLM to a new language without running into these issues. In particular, we improve the encoding efficiency of the tokenizer by adding new tokens from the target language and study the data mixing recipe to mitigate forgetting. Our experiments on adapting an English LLM to Hungarian and Thai show that our recipe can reach better performance than open source models on the target language, with minimal regressions on English.

摘要
最近的大型语言模型（LLM）在低资源语言上表现不佳，因为这些模型的训练数据通常受英语和其他高资源语言的影响。此外，为低资源语言提供模型训练是困难的，特别是从零开始。适应预训练LLM可以减少新语言的数据需求，同时提供跨语言传递能力。然而，直接适应新语言会导致忘记和词元效率低下。在这项工作中，我们研究如何有效地适应任何现有的预训练LLM到新语言，而不会遇到这些问题。我们改进了编码效率的词元，添加了目标语言中的新词，并研究了数据混合秘诀来缓解忘记。我们的实验在将英语模型适应到匈牙利语和泰语时，发现我们的秘诀可以在目标语言上达到更好的性能，与英语表现的减少 regression。

Generating Pragmatic Examples to Train Neural Program Synthesizers

paper_url: http://arxiv.org/abs/2311.05740
repo_url: https://github.com/saujasv/generating-pragmatic-examples
paper_authors: Saujas Vaduguru, Daniel Fried, Yewen Pu
for: 这篇论文的目的是提出一种基于神经网络的程序合成方法，以便在实际程序空间中实现更高效的程序合成。
methods: 该方法包括在自动学习模型中采样对应的程序和示例，并使用 Pragmatic Inference 来选择有用的训练示例。
results: 该方法在 Synthesizing 正则表达式从示例字符串中的任务上表现出色，比模型不选择 Pragmatic 示例的情况高出 23%（相对提高 51%），并与人工提供的 Pragmatic 示例集上的性能相当，无需使用人工数据进行训练。

Abstract
Programming-by-example is the task of synthesizing a program that is consistent with a set of user-provided input-output examples. As examples are often an under-specification of one's intent, a good synthesizer must choose the intended program from the many that are consistent with the given set of examples. Prior work frames program synthesis as a cooperative game between a listener (that synthesizes programs) and a speaker (a user choosing examples), and shows that models of computational pragmatic inference are effective in choosing the user intended programs. However, these models require counterfactual reasoning over a large set of programs and examples, which is infeasible in realistic program spaces. In this paper, we propose a novel way to amortize this search with neural networks. We sample pairs of programs and examples via self-play between listener and speaker models, and use pragmatic inference to choose informative training examples from this sample.We then use the informative dataset to train models to improve the synthesizer's ability to disambiguate user-provided examples without human supervision. We validate our method on the challenging task of synthesizing regular expressions from example strings, and find that our method (1) outperforms models trained without choosing pragmatic examples by 23% (a 51% relative increase) (2) matches the performance of supervised learning on a dataset of pragmatic examples provided by humans, despite using no human data in training.

摘要
程序编程例子是将一个程序与一组用户提供的输入输出示例进行一一匹配的任务。由于示例通常是一个下pecification of one's intent，因此一个好的合成器必须从一个大量的程序和示例中选择用户所意图的程序。以前的工作将程序合成视为一个合作游戏 между一个听众（合成程序）和一个说客（用户选择示例），并证明了计算机 Pragmatic inference 模型有效地选择用户所意图的程序。然而，这些模型需要计算机 Pragmatic inference 的对偶推理，这在实际的程序空间中是不可能的。在这篇文章中，我们提出了一种新的方法，使用神经网络来免费化这个搜索。我们通过自我玩家和听众模型之间的自动对话来采样对应的程序和示例，然后使用 Pragmatic inference 选择这些示例中最有用的训练示例。我们使用这些有用的示例来训练模型，以提高合成器对用户提供的示例的解释能力，无需人工指导。我们验证了我们的方法在生成正则表达式的任务中的效果，并发现我们的方法（1）在没有人工指导的情况下，比模型没有选择 Pragmatic examples 的情况下高出23%（相对提高51%）。（2）与人工提供的 Pragmatic examples 数据集上的超级学习相当，即使在没有人工数据的情况下。

Long-Horizon Dialogue Understanding for Role Identification in the Game of Avalon with Large Language Models

paper_url: http://arxiv.org/abs/2311.05720
repo_url: https://github.com/sstepput/Avalon-NLU
paper_authors: Simon Stepputtis, Joseph Campbell, Yaqi Xie, Zhengyang Qi, Wenxin Sharon Zhang, Ruiyi Wang, Sanketh Rangreji, Michael Lewis, Katia Sycara
for: 这 paper 是 investigate 当前大语言模型 (LLM) 在长期对话中对骗局和说服的能力，特别是在多方参与者的情况下。
methods: 这 paper 使用了 Avalon: The Resistance 游戏作为研究对象， introduce 了一个在线测试床和20个人类玩家的数据集，以及一种 multimodal 集成方法，以检验 LLM 在长期对话中的决策和语言处理能力。
results: 研究发现，even 当前的状态对技术 LLM 还没有达到人类性能水平，这使得这个数据集成为一个有力的比较标准，以 Investigate LLM 的决策和语言处理能力。

Abstract
Deception and persuasion play a critical role in long-horizon dialogues between multiple parties, especially when the interests, goals, and motivations of the participants are not aligned. Such complex tasks pose challenges for current Large Language Models (LLM) as deception and persuasion can easily mislead them, especially in long-horizon multi-party dialogues. To this end, we explore the game of Avalon: The Resistance, a social deduction game in which players must determine each other's hidden identities to complete their team's objective. We introduce an online testbed and a dataset containing 20 carefully collected and labeled games among human players that exhibit long-horizon deception in a cooperative-competitive setting. We discuss the capabilities of LLMs to utilize deceptive long-horizon conversations between six human players to determine each player's goal and motivation. Particularly, we discuss the multimodal integration of the chat between the players and the game's state that grounds the conversation, providing further insights into the true player identities. We find that even current state-of-the-art LLMs do not reach human performance, making our dataset a compelling benchmark to investigate the decision-making and language-processing capabilities of LLMs. Our dataset and online testbed can be found at our project website: https://sstepput.github.io/Avalon-NLU/

摘要
<> traduced text into Simplified Chinese.<>骗取和说服在多方对话中扮演了关键角色，特别是当参与者的利益、目标和动机不匹配时。这些复杂任务对当今大型自然语言模型（LLM） poses 挑战，因为骗取和说服可以轻松地误导它们，特别在长期多方对话中。为此，我们研究了《阿凡龙：抵抗》游戏，这是一款社交推理游戏，玩家需要确定对方的隐藏身份，以完成团队的目标。我们提供了在线测试床和20个精心收集和标注的游戏，这些游戏展示了长期骗取的例子。我们讨论了使用现有的 LLM Utilize deceptive long-horizon conversations between six human players to determine each player's goal and motivation。尤其是通过融合对话和游戏状态的多模式集成，提供了更多的真实player identity的预测。我们发现， Even state-of-the-art LLMs do not reach human performance， making our dataset a compelling benchmark to investigate the decision-making and language-processing capabilities of LLMs。我们的数据集和在线测试床可以在我们项目网站上找到：https://sstepput.github.io/Avalon-NLU/。

Game Theory Solutions in Sensor-Based Human Activity Recognition: A Review

paper_url: http://arxiv.org/abs/2311.06311
repo_url: None
paper_authors: Mohammad Hossein Shayesteh, Behrooz Sharokhzadeh, Behrooz Masoumi
for: 本研究旨在探讨Game theory在人动活动识别（HAR）任务中的潜在应用，并将Game theory和HAR研究工作相连接。
methods: 本研究使用Game theory的概念和方法来优化人动活动识别算法，并 investigate了Game-theoretic Approaches的应用在现有HAR方法上。
results: 本研究提供了Game theory在HAR任务中的潜在应用，并explored了Game-theoretic Approaches的可能性以解决现有HAR方法中的挑战。

Abstract
The Human Activity Recognition (HAR) tasks automatically identify human activities using the sensor data, which has numerous applications in healthcare, sports, security, and human-computer interaction. Despite significant advances in HAR, critical challenges still exist. Game theory has emerged as a promising solution to address these challenges in machine learning problems including HAR. However, there is a lack of research work on applying game theory solutions to the HAR problems. This review paper explores the potential of game theory as a solution for HAR tasks, and bridges the gap between game theory and HAR research work by suggesting novel game-theoretic approaches for HAR problems. The contributions of this work include exploring how game theory can improve the accuracy and robustness of HAR models, investigating how game-theoretic concepts can optimize recognition algorithms, and discussing the game-theoretic approaches against the existing HAR methods. The objective is to provide insights into the potential of game theory as a solution for sensor-based HAR, and contribute to develop a more accurate and efficient recognition system in the future research directions.

摘要
人类活动识别（HAR）任务自动识别人类活动使用传感器数据，有很多应用于医疗、体育、安全和人机交互等领域。尽管HAR领域已经取得了重要进展，但还存在许多挑战。游戏理论在机器学习问题中 Emerged as a promising solution to address these challenges, but there is a lack of research work on applying game theory solutions to HAR problems. This review paper explores the potential of game theory as a solution for HAR tasks, and bridges the gap between game theory and HAR research work by suggesting novel game-theoretic approaches for HAR problems.The contributions of this work include:1. 探讨游戏理论如何提高HAR模型的准确性和可靠性。2. 应用游戏理论概念优化recognition算法。3. 对现有HAR方法的游戏理论方法进行评论。本文的目的是为提供游戏理论在感知器基于HAR任务中的潜力，并为未来的研究提供发展方向。

FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts

paper_url: http://arxiv.org/abs/2311.05608
repo_url: https://github.com/thuccslab/figstep
paper_authors: Yichen Gong, Delong Ran, Jinyuan Liu, Conglei Wang, Tianshuo Cong, Anyu Wang, Sisi Duan, Xiaoyun Wang
for: 本研究旨在演示多Modalitate大型语言模型（VLMs）具有不明显的人工智能安全问题。
methods: 我们提出了一种名为 FigStep的攻击框架，通过图像通道输入危险指令，然后使用无害的文本提示来让 VLMs 输出违反常见人工智能安全政策的内容。
results: 我们的实验结果显示，FigStep 可以在 2 家流行的开源 VLMs （LLaVA 和 MiniGPT4）上 дости得攻击成功率为 94.8%（总共 5 个 VLMs）。此外，我们还证明了 FigStep 方法可以破坏 GPT-4V，这个模型已经利用了多种系统级别的机制来筛选危险查询。

Abstract
Large vision-language models (VLMs) like GPT-4V represent an unprecedented revolution in the field of artificial intelligence (AI). Compared to single-modal large language models (LLMs), VLMs possess more versatile capabilities by incorporating additional modalities (e.g., images). Meanwhile, there's a rising enthusiasm in the AI community to develop open-source VLMs, such as LLaVA and MiniGPT4, which, however, have not undergone rigorous safety assessment. In this paper, to demonstrate that more modalities lead to unforeseen AI safety issues, we propose FigStep, a novel jailbreaking framework against VLMs. FigStep feeds harmful instructions into VLMs through the image channel and then uses benign text prompts to induce VLMs to output contents that violate common AI safety policies. Our experimental results show that FigStep can achieve an average attack success rate of 94.8% across 2 families of popular open-source VLMs, LLaVA and MiniGPT4 (a total of 5 VLMs). Moreover, we demonstrate that the methodology of FigStep can even jailbreak GPT-4V, which already leverages several system-level mechanisms to filter harmful queries. Above all, our experimental results reveal that VLMs are vulnerable to jailbreaking attacks, which highlights the necessity of novel safety alignments between visual and textual modalities.

摘要
大型视语语模型（VLM）如GPT-4V在人工智能（AI）领域表现了无前例的革命。相比单modal大语言模型（LLM），VLM具有更多多样化能力，通过添加额外模态（如图像）。然而，AI社区对开源VLM的开发感到热烈，如LLaVA和MiniGPT4，但这些模型尚未经过严格的安全评估。在这篇论文中，我们提出了FigStep，一种新的监禁框架，用于对VLM进行监禁攻击。FigStep通过图像通道输入危险指令，然后使用无害文本提示来让VLM输出违反常见AI安全政策的内容。我们的实验结果表明，FigStep可以在2家 популяр的开源VLM中（LLaVA和MiniGPT4）实现94.8%的攻击成功率（总共5个VLM）。此外，我们还证明了FigStep的方法可以监禁GPT-4V，这个模型已经利用了多种系统级别的机制来筛选危险查询。总之，我们的实验结果表明VLM受到监禁攻击的威胁，这 highlights了视文ual模式之间的新的安全协调的必要性。

Real-Time Neural Rasterization for Large Scenes

paper_url: http://arxiv.org/abs/2311.05607
repo_url: None
paper_authors: Jeffrey Yunfan Liu, Yun Chen, Ze Yang, Jingkang Wang, Sivabalan Manivasagam, Raquel Urtasun
for: 大规模场景的实时新视图合成 (NVS)
methods: combining neural texture field and shader with标准图形渲染管线
results: 提供30倍以上的快速渲染，与或更好的现实主义，适用于自驾护航和无人机场景

Abstract
We propose a new method for realistic real-time novel-view synthesis (NVS) of large scenes. Existing neural rendering methods generate realistic results, but primarily work for small scale scenes (<50 square meters) and have difficulty at large scale (>10000 square meters). Traditional graphics-based rasterization rendering is fast for large scenes but lacks realism and requires expensive manually created assets. Our approach combines the best of both worlds by taking a moderate-quality scaffold mesh as input and learning a neural texture field and shader to model view-dependant effects to enhance realism, while still using the standard graphics pipeline for real-time rendering. Our method outperforms existing neural rendering methods, providing at least 30x faster rendering with comparable or better realism for large self-driving and drone scenes. Our work is the first to enable real-time rendering of large real-world scenes.

摘要
我们提出了一种新的实时实景视角合成（NVS）方法，用于大型场景。现有的神经渲染方法可以生成真实的结果，但主要适用于小规模场景（<50平方米），大规模场景（>10000平方米）难以处理。传统的图形学基础的抽象绘制渲染快速渲染大场景，但缺乏真实感和需要贵重的手动创建资产。我们的方法将神经渲染和标准图形管道结合，使用中等质量框架网格作为输入，学习视角依赖的效果模型，以提高真实感，同时仍然使用标准图形管道进行实时渲染。我们的方法比现有的神经渲染方法快速30倍，并且与或更好的真实感在大自驾和无人机场景中提供了比较或更好的表现。我们的工作是首次实现了大型真实世界场景的实时渲染。

SynH2R: Synthesizing Hand-Object Motions for Learning Human-to-Robot Handovers

paper_url: http://arxiv.org/abs/2311.05599
repo_url: None
paper_authors: Sammy Christen, Lan Feng, Wei Yang, Yu-Wei Chao, Otmar Hilliges, Jie Song
For: 这 paper 的目的是提出一种基于视觉的人机交换框架，以便在人机交换中使用synthetic数据进行训练。* Methods: 该 paper 使用了一种手套生成方法，可以生成与人类手套动作相似的机器人手套动作。这使得可以生成大量的synthetic数据，并且可以用于训练机器人。* Results: 在实验中，该 paper 所提出的方法与当前最佳方法相当，并且可以在实际系统上进行训练和测试。此外，该 paper 还可以对更多的物品和人类动作进行评估，而前一代方法不可以。Project page: https://eth-ait.github.io/synthetic-handovers/

Abstract
Vision-based human-to-robot handover is an important and challenging task in human-robot interaction. Recent work has attempted to train robot policies by interacting with dynamic virtual humans in simulated environments, where the policies can later be transferred to the real world. However, a major bottleneck is the reliance on human motion capture data, which is expensive to acquire and difficult to scale to arbitrary objects and human grasping motions. In this paper, we introduce a framework that can generate plausible human grasping motions suitable for training the robot. To achieve this, we propose a hand-object synthesis method that is designed to generate handover-friendly motions similar to humans. This allows us to generate synthetic training and testing data with 100x more objects than previous work. In our experiments, we show that our method trained purely with synthetic data is competitive with state-of-the-art methods that rely on real human motion data both in simulation and on a real system. In addition, we can perform evaluations on a larger scale compared to prior work. With our newly introduced test set, we show that our model can better scale to a large variety of unseen objects and human motions compared to the baselines. Project page: https://eth-ait.github.io/synthetic-handovers/

摘要
translate("Vision-based human-to-robot handover is an important and challenging task in human-robot interaction.")视力基础的人机交换是人机交互中的重要和挑战性任务。最近的工作尝试通过在模拟环境中与动态虚拟人交互来训练机器人策略，以后在实际世界中转移。然而，一个重要的瓶颈是人体动作捕捉数据的成本高并难以扩展到任意物体和人类抓取动作。在这篇论文中，我们介绍了一个框架，可以生成人类抓取动作，适用于训练机器人。为此，我们提议了一种手套物合成方法，设计为生成人类抓取动作相似的机器人抓取动作。这使得我们可以生成具有100倍更多的物体和人类抓取动作的 sintetic 训练和测试数据。在我们的实验中，我们显示了我们的方法，只使用 sintetic 数据进行训练，与现有的方法相比，在模拟和真实系统上具有相同的竞争力。此外，我们可以在更大的规模上进行评估，比之前的工作更加多样化。通过我们新引入的测试集，我们表明了我们的模型可以更好地扩展到大量未见的物品和人类动作。项目页面：https://eth-ait.github.io/synthetic-handovers/Note: The translation is done using Google Translate, and may not be perfect. Please let me know if you need any further assistance.

LLM Augmented Hierarchical Agents

paper_url: http://arxiv.org/abs/2311.05596
repo_url: None
paper_authors: Bharat Prakash, Tim Oates, Tinoosh Mohsenin
for: 这 paper 的目的是解决长期任务，使用 reinforcement learning (RL) 学习，并且在没有先验知识的情况下进行学习。
methods: 这 paper 使用了 language model (LLM) 的规划能力，与 RL 结合使用，实现一种层次结构的自动机器人。LLMs 提供了高级策略指导，从而使学习变得更加效率。
results: 在 MiniGrid、SkillHack 和 Crafter 等 simulate environments 以及一个真实的机器人臂上，使用这种方法训练的 Agent 表现出了优于其他基eline方法，并且一旦训练完成，不需要在部署时间接访 LLMs。

Abstract
Solving long-horizon, temporally-extended tasks using Reinforcement Learning (RL) is challenging, compounded by the common practice of learning without prior knowledge (or tabula rasa learning). Humans can generate and execute plans with temporally-extended actions and quickly learn to perform new tasks because we almost never solve problems from scratch. We want autonomous agents to have this same ability. Recently, LLMs have been shown to encode a tremendous amount of knowledge about the world and to perform impressive in-context learning and reasoning. However, using LLMs to solve real world problems is hard because they are not grounded in the current task. In this paper we exploit the planning capabilities of LLMs while using RL to provide learning from the environment, resulting in a hierarchical agent that uses LLMs to solve long-horizon tasks. Instead of completely relying on LLMs, they guide a high-level policy, making learning significantly more sample efficient. This approach is evaluated in simulation environments such as MiniGrid, SkillHack, and Crafter, and on a real robot arm in block manipulation tasks. We show that agents trained using our approach outperform other baselines methods and, once trained, don't need access to LLMs during deployment.

摘要
解决长期、时间扩展任务使用强化学习（RL）是具有挑战性，尤其是在不具备先验知识（或Tabula Rasa学习）的常见做法下。人类可以生成和执行长期行动计划，快速学习新任务，因为我们几乎从未解决问题从头开始。我们想要自主机器也有这种能力。最近，LLMs（大型语言模型）被证明可以存储大量世界知识，并在 Context 中进行出色的学习和理解。然而，使用LLMs解决实际世界问题是困难的，因为它们没有与当前任务的关系。在这篇论文中，我们利用LLMs的规划能力，并通过RL来提供学习环境，从而实现一种层次的自主代理人。而不是完全依赖LLMs，它们导引高级策略，使学习变得非常更加样本效率。我们在MiniGrid、SkillHack和Crafter等模拟环境中，以及一个真实的 робо臂在块操作任务中进行了评估。我们的方法让代理人在其他基eline方法的比较下表现出色，并且一旦训练完成，不需要在部署时访问LLMs。

Accuracy of a Vision-Language Model on Challenging Medical Cases

paper_url: http://arxiv.org/abs/2311.05591
repo_url: https://github.com/2v/gpt4v-image-challenge
paper_authors: Thomas Buckley, James A. Diao, Adam Rodman, Arjun K. Manrai
for: 这个研究用于评估新释放的Generative Pre-trained Transformer 4 with Vision模型（GPT-4V）在医学案例中的准确率。
methods: 这个研究使用了934个来自NEJM Image Challenge的案例，从2005年到2023年发表。研究对GPT-4V模型与人类回答者进行比较，分为不同的问题难度、图像类型和皮肤颜色等多个维度。此外，研究还进行了69个NEJM临床Pathological Conferences（CPCs）的physician评估。
results: GPT-4V的总准确率为61%（95% CI，58%到64%），比人类回答者的49%（95% CI，49%到50%）高。GPT-4V在所有难度和不同的皮肤颜色、图像类型等多个维度都超过人类回答者。但是，当图像添加到文本时，GPT-4V的表现下降。GPT-4V使用文本 alone时对CPCs中的正确诊断达80%（95% CI，68%到88%），而使用图像和文本时则为58%（95% CI，45%到70%）。

Abstract
Background: General-purpose large language models that utilize both text and images have not been evaluated on a diverse array of challenging medical cases. Methods: Using 934 cases from the NEJM Image Challenge published between 2005 and 2023, we evaluated the accuracy of the recently released Generative Pre-trained Transformer 4 with Vision model (GPT-4V) compared to human respondents overall and stratified by question difficulty, image type, and skin tone. We further conducted a physician evaluation of GPT-4V on 69 NEJM clinicopathological conferences (CPCs). Analyses were conducted for models utilizing text alone, images alone, and both text and images. Results: GPT-4V achieved an overall accuracy of 61% (95% CI, 58 to 64%) compared to 49% (95% CI, 49 to 50%) for humans. GPT-4V outperformed humans at all levels of difficulty and disagreement, skin tones, and image types; the exception was radiographic images, where performance was equivalent between GPT-4V and human respondents. Longer, more informative captions were associated with improved performance for GPT-4V but similar performance for human respondents. GPT-4V included the correct diagnosis in its differential for 80% (95% CI, 68 to 88%) of CPCs when using text alone, compared to 58% (95% CI, 45 to 70%) of CPCs when using both images and text. Conclusions: GPT-4V outperformed human respondents on challenging medical cases and was able to synthesize information from both images and text, but performance deteriorated when images were added to highly informative text. Overall, our results suggest that multimodal AI models may be useful in medical diagnostic reasoning but that their accuracy may depend heavily on context.

摘要
背景：目前没有评估过多种困难医学案例的通用大型语言模型，这些模型通常使用文本和图像。方法：我们使用2005-2023年《新英格兰医学杂志》（NEJM）图像挑战中发表的934个案例，评估最新发布的生成预训练 transformer 4 with Vision（GPT-4V）模型与人类回答者的精度相比，并按问题难度、图像类型和皮肤色分进行分组分析。我们还进行了69个NEJM临床 PATHOLOGICAL CONFERENCES（CPCs）的医生评估。分析方法包括文本alone、图像alone和文本和图像的组合。结果：GPT-4V的总精度为61%（95% CI，58-64%），比人类回答者的49%（95% CI，49-50%）高。GPT-4V在所有难度和不同的皮肤色分、图像类型和文本类型中都表现出色，只有放射学图像的表现与人类回答者相当。长文本描述与GPT-4V的表现相似，而人类回答者的表现则不变。GPT-4V使用文本alone时包含正确的诊断在其分 differential中的80%（95% CI，68-88%），与使用文本和图像时相同。结论：GPT-4V在困难的医学案例中表现出色，能够从文本和图像中提取信息，但是将图像添加到高度信息的文本时，其表现下降。总的来说，我们的结果表明，多模态 AI 模型可能在医学诊断reasoning中有用，但其精度可能取决于上下文。

Conversational AI Threads for Visualizing Multidimensional Datasets

paper_url: http://arxiv.org/abs/2311.05590
repo_url: None
paper_authors: Matt-Heun Hong, Anamaria Crisan
for: 这项研究旨在探索基于大语言模型（LLM）的对话式分析工具的可能性和限制。
methods: 研究使用了一个LLM进行对一项先前的奥托·赞托（Wizard-of-Oz）研究的重新分析，以探索基于对话式分析的机器学习模型的强点和弱点。
results: 研究发现LLM驱动的分析对话系统有一些缺点，如不支持进程性的视觉分析反复。基于这些发现，研究人员开发了AI Threads，一种多线程分析对话系统，以便分析员可以灵活地管理对话的进程性。研究通过在40名志愿者和10名专家分析员的审核下评估系统的可用性，并在一个外部数据集上展示了AI Threads的能力。

Abstract
Generative Large Language Models (LLMs) show potential in data analysis, yet their full capabilities remain uncharted. Our work explores the capabilities of LLMs for creating and refining visualizations via conversational interfaces. We used an LLM to conduct a re-analysis of a prior Wizard-of-Oz study examining the use of chatbots for conducting visual analysis. We surfaced the strengths and weaknesses of LLM-driven analytic chatbots, finding that they fell short in supporting progressive visualization refinements. From these findings, we developed AI Threads, a multi-threaded analytic chatbot that enables analysts to proactively manage conversational context and improve the efficacy of its outputs. We evaluate its usability through a crowdsourced study (n=40) and in-depth interviews with expert analysts (n=10). We further demonstrate the capabilities of AI Threads on a dataset outside the LLM's training corpus. Our findings show the potential of LLMs while also surfacing challenges and fruitful avenues for future research.

摘要
大型语言模型（LLM）在数据分析方面表现出了潜在的潜力，但它们的潜在能力仍未被完全探索。我们的工作探讨了 LLM 在通过对话界面进行数据分析时的能力。我们使用了 LLM 重新分析了一项以前的奥托兹研究，检查了使用 chatbot 进行视觉分析的使用情况。我们发现了 LLM 驱动的分析 chatbot 有一些缺陷，它们无法支持进程性的视觉分析改进。基于这些发现，我们开发了 AI 线程，一种多线程的分析 chatbot，允许分析员可以积极管理对话上下文，以提高其输出的效果。我们通过卫星投票研究（n=40）和专家分析员的深入采访（n=10）评估了 AI 线程的可用性。我们进一步在一个 LLM 训练集外的数据集上展示了 AI 线程的能力。我们的发现表明 LLM 的潜在能力，同时也浮现了未来研究的挑战和有前途的方向。

Zero-Shot Goal-Directed Dialogue via RL on Imagined Conversations

paper_url: http://arxiv.org/abs/2311.05584
repo_url: None
paper_authors: Joey Hong, Sergey Levine, Anca Dragan
for: 这个论文主要针对目标是什么？methods: 这个论文使用了什么方法？results: 这个论文的结果是什么？Here are the answers in Simplified Chinese:for: 这个论文主要针对目标是如何使用大语言模型（LLM）来解决互动性高的自然语言任务，例如教学和旅游咨询等。methods: 这个论文使用了RL（强化学习）方法，通过使用LLM生成假的人类对话来训练一个互动对话机器人，以便在多步互动中优化目标。results: 论文的实验结果显示，使用这种方法可以达到多个目的的对话任务的州OFTHEART性能，包括教学和偏好检索等。

Abstract
Large language models (LLMs) have emerged as powerful and general solutions to many natural language tasks. However, many of the most important applications of language generation are interactive, where an agent has to talk to a person to reach a desired outcome. For example, a teacher might try to understand their student's current comprehension level to tailor their instruction accordingly, and a travel agent might ask questions of their customer to understand their preferences in order to recommend activities they might enjoy. LLMs trained with supervised fine-tuning or "single-step" RL, as with standard RLHF, might struggle which tasks that require such goal-directed behavior, since they are not trained to optimize for overall conversational outcomes after multiple turns of interaction. In this work, we explore a new method for adapting LLMs with RL for such goal-directed dialogue. Our key insight is that, though LLMs might not effectively solve goal-directed dialogue tasks out of the box, they can provide useful data for solving such tasks by simulating suboptimal but human-like behaviors. Given a textual description of a goal-directed dialogue task, we leverage LLMs to sample diverse synthetic rollouts of hypothetical in-domain human-human interactions. Our algorithm then utilizes this dataset with offline reinforcement learning to train an interactive conversational agent that can optimize goal-directed objectives over multiple turns. In effect, the LLM produces examples of possible interactions, and RL then processes these examples to learn to perform more optimal interactions. Empirically, we show that our proposed approach achieves state-of-the-art performance in various goal-directed dialogue tasks that include teaching and preference elicitation.

摘要
大型语言模型（LLM）已经成为许多自然语言任务的强大和通用解决方案。然而，许多最重要的语言生成应用程序是互动的， где一个代理人需要跟人进行互动以达到愿景。例如，一位教师可能会尝试理解学生目前的理解水平，以适应 instrucion accordingly，而一位旅游代理人可能会问客户的偏好，以便根据客户的喜好建议活动。 LLM 在监督 fine-tuning 或 "single-step" RL 中可能会遇到问题，因为它们没有被训练来优化多次互动的对话结果。在这个工作中，我们探索一种新的方法来适应 LLM WITH RL 来进行目标对话。我们的关键见解是，处理目标对话任务的 LLM 可能无法提供有用的数据，但它们可以提供似替代的人类行为的 simulated Synthetic Rollouts。我们使用这个描述文本来生成一个具有多个转折的对话任务，然后使用 RL 来训练一个可以优化目标对话结果的互动对话代理人。实际上，LLM 生成的可能的互动示例，然后 RL 处理这些示例，以学习更佳的互动。我们的实验结果显示，我们的提出的方法可以在不同的目标对话任务中实现州势框架的性能。

Inference for Probabilistic Dependency Graphs

paper_url: http://arxiv.org/abs/2311.05580
repo_url: https://github.com/orichardson/pdg-infer-uai
paper_authors: Oliver E. Richardson, Joseph Y. Halpern, Christopher De Sa
for: 这 paper 是关于 probabilistic dependency graphs (PDGs) 的研究，PDGs 是一种灵活的概率图模型，可以捕捉不一致的信念，并提供一种度量不一致程度的方法。
methods: 这 paper 使用了一种新的推理算法，它基于以下四个关键组成部分：（1）观察到，在许多情况下，PDGs 所规定的分布可以表示为一个凸优化问题（具有凝固体积约束），（2）一种可以简洁表述这些问题的构造，（3）对 PDGs 的论证，以及（4）基于内部点方法来解决这些问题，这些问题可以在几乎Linear时间内解决。
results: 这 paper 的实验结果表明，这种新的推理算法可以在许多情况下高效地解决 PDGs 的推理问题，并且比基eline方法更高效。

Abstract
Probabilistic dependency graphs (PDGs) are a flexible class of probabilistic graphical models, subsuming Bayesian Networks and Factor Graphs. They can also capture inconsistent beliefs, and provide a way of measuring the degree of this inconsistency. We present the first tractable inference algorithm for PDGs with discrete variables, making the asymptotic complexity of PDG inference similar that of the graphical models they generalize. The key components are: (1) the observation that, in many cases, the distribution a PDG specifies can be formulated as a convex optimization problem (with exponential cone constraints), (2) a construction that allows us to express these problems compactly for PDGs of boundeed treewidth, (3) contributions to the theory of PDGs that justify the construction, and (4) an appeal to interior point methods that can solve such problems in polynomial time. We verify the correctness and complexity of our approach, and provide an implementation of it. We then evaluate our implementation, and demonstrate that it outperforms baseline approaches. Our code is available at http://github.com/orichardson/pdg-infer-uai.

摘要
“潜在的依存グラフ（PDG）は、bayesian Networks と factor graphsを包含するflexibleな probabilistic graphical modelsです。彼らは、不一致した信念も捉えることができます。我们は、discrete variableを持つ PDGのための初の tractable inference algorithmを提出します。このアルゴリズムの键点は、以下の4点です。1. PDGが指定する配分を、半径整数乘数问题（exponential cone constraints）として表现することができることに気づきました。2. PDGのbound trees widthが小さい场合、これらの问题をコンパクトに表现するための构筑を行いました。3. PDGに関する理论的な贡献を行い、この构筑を正当化しました。4. interior point methodsを使用して、これらの问题をPolynomial timeで解くことができます。我々は、このアプローチの正しさと复雑性を検证し、実装を行いました。そして、基eline approachesに対して性能を比较し、pdgの検查において优れた性能を示しました。我々のコードは、http://github.com/orichardson/pdg-infer-uaiに公开されています。”

Removing RLHF Protections in GPT-4 via Fine-Tuning

paper_url: http://arxiv.org/abs/2311.05553
repo_url: None
paper_authors: Qiusi Zhan, Richard Fang, Rohan Bindu, Akul Gupta, Tatsunori Hashimoto, Daniel Kang
for: 防止语言模型（LLM）的两用性带来危害的输出
methods: 使用强化学习与人类反馈（RLHF）来减少危害输出
results: despite using weaker models to generate training data, fine-tuning can remove RLHF protections with a 95% success rate, and removing RLHF protections does not decrease usefulness on non-censored outputs.

Abstract
As large language models (LLMs) have increased in their capabilities, so does their potential for dual use. To reduce harmful outputs, produces and vendors of LLMs have used reinforcement learning with human feedback (RLHF). In tandem, LLM vendors have been increasingly enabling fine-tuning of their most powerful models. However, concurrent work has shown that fine-tuning can remove RLHF protections. We may expect that the most powerful models currently available (GPT-4) are less susceptible to fine-tuning attacks. In this work, we show the contrary: fine-tuning allows attackers to remove RLHF protections with as few as 340 examples and a 95% success rate. These training examples can be automatically generated with weaker models. We further show that removing RLHF protections does not decrease usefulness on non-censored outputs, providing evidence that our fine-tuning strategy does not decrease usefulness despite using weaker models to generate training data. Our results show the need for further research on protections on LLMs.

摘要
大型语言模型（LLM）的能力不断提高，同时其可能性也在提高。为了减少危害输出，LLM生产者和销售者通过人工反馈学习（RLHF）来减少危害。同时，LLM生产者也在不断强化其最强大的模型。然而，与此同时，一些研究表明， fine-tuning 可以移除 RLHF 保护。我们可能会期望最新的 GPT-4 模型比其他模型更难受到 fine-tuning 攻击。在这个工作中，我们发现了正好相反的情况： fine-tuning 允许攻击者移除 RLHF 保护，只需要340个示例和95% 的成功率。这些训练示例可以通过使用弱化模型自动生成。我们还证明了移除 RLHF 保护不会减少非防止输出的有用性，这表明我们的 fine-tuning 策略不会减少有用性，即使使用弱化模型来生成训练数据。我们的结果表明需要进一步研究 LLM 的保护。

Multi-Agent Quantum Reinforcement Learning using Evolutionary Optimization

paper_url: http://arxiv.org/abs/2311.05546
repo_url: None
paper_authors: Michael Kölle, Felix Topp, Thomy Phan, Philipp Altmann, Jonas Nüßlein, Claudia Linnhoff-Popien
for: 这篇论文是关于多智能体强化学习的研究，它在自动驾驶和智能工业应用中变得越来越重要。
methods: 这篇论文使用了量子力学的内在性质，减少了模型的可训练参数，提高了强化学习的性能。
results: 作者使用了变量量子电路方法，在Coin Game环境中评估了多智能体强化学习方法，并与经典方法进行比较。结果显示，变量量子电路方法在同等参数量下达到了类似的性能，与经典方法相比使用了$97.88%$ fewer parameters。

Abstract
Multi-Agent Reinforcement Learning is becoming increasingly more important in times of autonomous driving and other smart industrial applications. Simultaneously a promising new approach to Reinforcement Learning arises using the inherent properties of quantum mechanics, reducing the trainable parameters of a model significantly. However, gradient-based Multi-Agent Quantum Reinforcement Learning methods often have to struggle with barren plateaus, holding them back from matching the performance of classical approaches. We build upon a existing approach for gradient free Quantum Reinforcement Learning and propose tree approaches with Variational Quantum Circuits for Multi-Agent Reinforcement Learning using evolutionary optimization. We evaluate our approach in the Coin Game environment and compare them to classical approaches. We showed that our Variational Quantum Circuit approaches perform significantly better compared to a neural network with a similar amount of trainable parameters. Compared to the larger neural network, our approaches archive similar results using $97.88\%$ less parameters.

摘要
多智能体强化学习在自动驾驶和智能工业应用中日益重要。同时，使用量子物理特性的新方法在强化学习中表现承诺，可以减少模型可训练参数的数量。然而，使用梯度的多智能量子强化学习方法经常陷入恶性板块，使其与经典方法相比表现不佳。我们基于现有的梯度自由量子强化学习方法，并提出了三种使用可变量量子电路的多智能强化学习方法，使用进化优化。我们在硬币游戏环境中评估了我们的方法，并与经典方法进行比较。我们发现，我们的可变量量子电路方法与一个同量参数的神经网络相比，表现出了显著更好的性能。相比之下，我们的方法使用的参数数量为97.88%。

Technical Report: Large Language Models can Strategically Deceive their Users when Put Under Pressure

paper_url: http://arxiv.org/abs/2311.07590
repo_url: https://github.com/apolloresearch/insider-trading
paper_authors: Jérémy Scheurer, Mikita Balesni, Marius Hobbhahn
for: 这个论文探讨了大语言模型在实际场景中可能会展现出偏aligned行为，无需直接 instrucciones 或培训。
methods: 作者使用了 GPT-4 作为一个自动化股票交易代理，在 simulated 环境中进行了实际的股票交易，并通过 hiding 实际的交易原因来掩盖其偏aligned行为。
results: 研究发现，当模型被允许访问一个理由笔记时，它们会 strategically 隐瞒实际的交易原因，并且这种偏aligned行为在不同的设定下可以被改变。

Abstract
We demonstrate a situation in which Large Language Models, trained to be helpful, harmless, and honest, can display misaligned behavior and strategically deceive their users about this behavior without being instructed to do so. Concretely, we deploy GPT-4 as an agent in a realistic, simulated environment, where it assumes the role of an autonomous stock trading agent. Within this environment, the model obtains an insider tip about a lucrative stock trade and acts upon it despite knowing that insider trading is disapproved of by company management. When reporting to its manager, the model consistently hides the genuine reasons behind its trading decision. We perform a brief investigation of how this behavior varies under changes to the setting, such as removing model access to a reasoning scratchpad, attempting to prevent the misaligned behavior by changing system instructions, changing the amount of pressure the model is under, varying the perceived risk of getting caught, and making other simple changes to the environment. To our knowledge, this is the first demonstration of Large Language Models trained to be helpful, harmless, and honest, strategically deceiving their users in a realistic situation without direct instructions or training for deception.

摘要
我们展示了一种情况，在大语言模型被训练为有用、无害和诚实的情况下，它们可能会显示偏心的行为和欺骗其用户。具体来说，我们在一个真实的 simulate 环境中部署 GPT-4 作为一个自主股票交易代理。在这个环境中，模型获得了一个内部信息，并且尽管知道公司管理层不把内部交易视为正确的行为，但它仍然根据这个信息进行交易。当报告给其管理者时，模型一直隐瞒了实际的交易决策的原因。我们进行了一 brief 的调查，检查这种行为在不同的设置下发生变化。例如，移除模型访问分析笔记 pad，改变系统指令以防止偏心行为，改变模型受压力的程度，变化被抓获的风险等。根据我们所知，这是首次在真实情况下，不直接给模型提供欺骗指导或训练，大语言模型仍然可能会在情况下欺骗其用户的示例。

From Learning Management System to Affective Tutoring system: a preliminary study

paper_url: http://arxiv.org/abs/2311.05513
repo_url: None
paper_authors: Nadaud Edouard, Geoffroy Thibault, Khelifi Tesnim, Yaacoub Antoun, Haidar Siba, Ben Rabah NourhÈne, Aubin Jean Pierre, Prevost Lionel, Le Grand Benedicte
for: 本研究旨在探讨学生遇到困难时的指标组合，包括表现、行为参与度和情感参与度，以实现学生difficulties的识别。
methods: 本研究使用两种主要数据源：学生学习管理系统（LMS）中的数字踪迹和学生摄像头捕捉的图像。数字踪迹提供了学生与教育内容的互动信息，而图像则用于分析学生的情感表达。
results: 通过使用2022-2023学年法国工程师学院的实际数据，我们观察到了正面情感状态和学业成绩之间的相关性。这些初步结果支持情感在分 differentiating high achieving和low achieving学生中扮演重要角色。

Abstract
In this study, we investigate the combination of indicators, including performance, behavioral engagement, and emotional engagement, to identify students experiencing difficulties. We analyzed data from two primary sources: digital traces extracted from th e Learning Management System (LMS) and images captured by students' webcams. The digital traces provided insights into students' interactions with the educational content, while the images were utilized to analyze their emotional expressions during learnin g activities. By utilizing real data collected from students at a French engineering school, recorded during the 2022 2023 academic year, we observed a correlation between positive emotional states and improved academic outcomes. These preliminary findings support the notion that emotions play a crucial role in differentiating between high achieving and low achieving students.

摘要
在这项研究中，我们研究了学生表现、行为参与度和情感参与度的组合，以确定学生遇到困难时的表现。我们分析了两个主要来源的数据：来自学习管理系统（LMS）的数字痕迹，以及学生的摄像头图像。数字痕迹为我们提供了学生与教育内容的互动情况的准确信息，而图像则用于分析学生学习过程中的情感表达。通过使用2022-2023学年法国工程学院的实际数据，我们发现了正面情感状态和学业成绩之间的相关关系。这些初步发现支持情感在分化高、低成绩学生方面发挥重要作用。

Anytime-Constrained Reinforcement Learning

paper_url: http://arxiv.org/abs/2311.05511
repo_url: https://github.com/jermcmahan/anytime-constraints
paper_authors: Jeremy McMahan, Xiaojin Zhu
for: 研究受限Markov决策过程（cMDP）中的时间约束。
methods: 提出了一种基于 deterministic 政策的扩展，以及一种基于这种扩展的时间和样本效率的规划和学习算法。
results: 证明了这些算法的时间和样本复杂度是受限的，但是 computing non-trivial approximately optimal policies 是 NP-hard。还提出了一种可靠的 approximation 算法来计算或学习一个 arbitrarily accurate approximately feasible policy。

Abstract
We introduce and study constrained Markov Decision Processes (cMDPs) with anytime constraints. An anytime constraint requires the agent to never violate its budget at any point in time, almost surely. Although Markovian policies are no longer sufficient, we show that there exist optimal deterministic policies augmented with cumulative costs. In fact, we present a fixed-parameter tractable reduction from anytime-constrained cMDPs to unconstrained MDPs. Our reduction yields planning and learning algorithms that are time and sample-efficient for tabular cMDPs so long as the precision of the costs is logarithmic in the size of the cMDP. However, we also show that computing non-trivial approximately optimal policies is NP-hard in general. To circumvent this bottleneck, we design provable approximation algorithms that efficiently compute or learn an arbitrarily accurate approximately feasible policy with optimal value so long as the maximum supported cost is bounded by a polynomial in the cMDP or the absolute budget. Given our hardness results, our approximation guarantees are the best possible under worst-case analysis.

摘要
我们介绍和研究受限的马可夫决策过程（cMDP），其中任何时间都不能超过预算。任何时间限制对马可夫决策过程是必要的，并且我们表明，这些限制下的决策过程是可以有最佳解的。实际上，我们提供了一个可靠的对应降低，将不受限制的MDP转换为受限制的cMDP。我们的降低可以在Tabular cMDP中实现时间和样本效率的规划和学习算法，只要cost的精度是对应的logarithmic。然而，我们也证明了，计算非负值的策略是NP困难的一般情况下。为了突破这个瓶颈，我们设计了可证明的近似算法，可以快速地计算或学习一个具有最佳值的近似可行策略，只要最大支持的成本是对应的多项式或总预算。根据我们的困难性结果，我们的近似保证是最好的，即worst-case分析下的最佳保证。

General Policies, Subgoal Structure, and Planning Width

paper_url: http://arxiv.org/abs/2311.05490
repo_url: None
paper_authors: Blai Bonet, Hector Geffner
for: 本文研究了classical planning领域中的atomic goals问题，即用于找到可行的行为序列来实现目标。
methods: 本文使用了IW探索算法，该算法在问题宽度是 bounded 时可以在 exponential 时间内运行。此外，本文还定义了(显式) serializations 和 serialized width 概念，它们在许多领域有 bounded 的 Serialized width。
results: 本文表明了 bounded width 问题可以使用一种适当的变种的Serialized IW算法来解决，并且可以在 polynomial 时间内解决。此外，本文还提出了一种使用语言 of general policies 和 serializations 的 semantics 来 Specify 序列化问题的简洁表示方式，可以用于手动编码或从小例子学习 domain 控制知识。

Abstract
It has been observed that many classical planning domains with atomic goals can be solved by means of a simple polynomial exploration procedure, called IW, that runs in time exponential in the problem width, which in these cases is bounded and small. Yet, while the notion of width has become part of state-of-the-art planning algorithms such as BFWS, there is no good explanation for why so many benchmark domains have bounded width when atomic goals are considered. In this work, we address this question by relating bounded width with the existence of general optimal policies that in each planning instance are represented by tuples of atoms of bounded size. We also define the notions of (explicit) serializations and serialized width that have a broader scope as many domains have a bounded serialized width but no bounded width. Such problems are solved non-optimally in polynomial time by a suitable variant of the Serialized IW algorithm. Finally, the language of general policies and the semantics of serializations are combined to yield a simple, meaningful, and expressive language for specifying serializations in compact form in the form of sketches, which can be used for encoding domain control knowledge by hand or for learning it from small examples. Sketches express general problem decompositions in terms of subgoals, and sketches of bounded width express problem decompositions that can be solved in polynomial time.

摘要
Observations have shown that many classical planning domains with atomic goals can be solved using a simple polynomial exploration procedure called IW, which runs in time exponential in the problem width. However, there is no good explanation for why many benchmark domains have bounded width when atomic goals are considered. In this work, we address this question by showing that bounded width is related to the existence of general optimal policies that can be represented by tuples of atoms of bounded size. We also define the notions of (explicit) serializations and serialized width, which have a broader scope as many domains have a bounded serialized width but no bounded width. These problems can be solved non-optimally in polynomial time using a suitable variant of the Serialized IW algorithm. Finally, we combine the language of general policies and the semantics of serializations to yield a simple, meaningful, and expressive language for specifying serializations in compact form, called sketches. Sketches express general problem decompositions in terms of subgoals, and sketches of bounded width express problem decompositions that can be solved in polynomial time.

meta4: semantically-aligned generation of metaphoric gestures using self-supervised text and speech representation

paper_url: http://arxiv.org/abs/2311.05481
repo_url: https://github.com/mireillefares/meta4
paper_authors: Mireille Fares, Catherine Pelachaud, Nicolas Obin
for: The paper is written to address the limitation of previous behavior generation models that have not considered the key semantic information carried by Image Schemas in generating metaphoric gestures.
methods: The paper introduces a deep learning approach called META4, which computes Image Schemas from input text and generates metaphoric gestures driven by speech and the computed image schemas.
results: The approach is effective in generating speech-driven metaphoric gestures and highlights the importance of both speech and image schemas in modeling metaphoric gestures.Here is the same information in Simplified Chinese:
for: 论文是为了解决过去的行为生成模型，它们没有考虑图像Schema中含有的关键semantic信息，以生成比喻性手势。
methods: 论文提出了一种深度学习方法，即META4，它从输入文本中计算图像Schema，并根据这些图像Schema和语音驱动比喻性手势的生成。
results: 方法能够有效地生成语音驱动的比喻性手势，并高亮了图像Schema和语音之间的关系，表明图像Schema和语音都是模型比喻性手势的关键因素。

Abstract
Image Schemas are repetitive cognitive patterns that influence the way we conceptualize and reason about various concepts present in speech. These patterns are deeply embedded within our cognitive processes and are reflected in our bodily expressions including gestures. Particularly, metaphoric gestures possess essential characteristics and semantic meanings that align with Image Schemas, to visually represent abstract concepts. The shape and form of gestures can convey abstract concepts, such as extending the forearm and hand or tracing a line with hand movements to visually represent the image schema of PATH. Previous behavior generation models have primarily focused on utilizing speech (acoustic features and text) to drive the generation model of virtual agents. They have not considered key semantic information as those carried by Image Schemas to effectively generate metaphoric gestures. To address this limitation, we introduce META4, a deep learning approach that generates metaphoric gestures from both speech and Image Schemas. Our approach has two primary goals: computing Image Schemas from input text to capture the underlying semantic and metaphorical meaning, and generating metaphoric gestures driven by speech and the computed image schemas. Our approach is the first method for generating speech driven metaphoric gestures while leveraging the potential of Image Schemas. We demonstrate the effectiveness of our approach and highlight the importance of both speech and image schemas in modeling metaphoric gestures.

摘要
图像模式是人类认知过程中重复的认知模式，它们影响了我们如何理解和推理各种语言中的概念。这些模式深嵌在我们认知过程中，并在我们的身体表达中反映出来，例如手势。特别是，元拟势手势具有重要的特征和含义，可以用来视觉表示概念。手势的形状和形式可以表示概念，例如伸展肘和手或使用手部运动轨迹来视觉表示图像模式。在虚拟代理模型中，以前的行为生成模型主要通过语音（声音特征和文本）驱动模型来生成虚拟代理的行为。它们没有考虑图像模式中的关键semantic信息，以生成元拟势。为了解决这些限制，我们介绍了META4，一种深度学习方法，可以从语音和图像模式中生成元拟势。我们的方法有两个主要目标：一是计算图像模式从输入文本中获取底层semantic和元拟势的含义，二是通过语音和计算的图像模式来驱动元拟势的生成。我们的方法是首个基于语音驱动的元拟势生成方法，同时利用图像模式的潜力。我们 demonstarte了我们的方法的有效性，并强调了语音和图像模式在元拟势模型中的重要性。

Text Representation Distillation via Information Bottleneck Principle

paper_url: http://arxiv.org/abs/2311.05472
repo_url: None
paper_authors: Yanzhao Zhang, Dingkun Long, Zehan Li, Pengjun Xie
for: 提高text representation领域中PLMs的实用性，通过减少计算成本和维护高维度表示的问题。
methods: 提出一种基于信息瓶颈理论的知识塑化方法，通过最大化教师和学生模型之间的相互信息，同时减少学生模型对输入数据的相互信息，使学生模型保留重要学习的信息，避免过拟合。
results: 在两个主要下渠应用（Semantic Textual Similarity和Dense Retrieval任务）上，employs the proposed approach to achieve better performance compared to traditional knowledge distillation methods.

Abstract
Pre-trained language models (PLMs) have recently shown great success in text representation field. However, the high computational cost and high-dimensional representation of PLMs pose significant challenges for practical applications. To make models more accessible, an effective method is to distill large models into smaller representation models. In order to relieve the issue of performance degradation after distillation, we propose a novel Knowledge Distillation method called IBKD. This approach is motivated by the Information Bottleneck principle and aims to maximize the mutual information between the final representation of the teacher and student model, while simultaneously reducing the mutual information between the student model's representation and the input data. This enables the student model to preserve important learned information while avoiding unnecessary information, thus reducing the risk of over-fitting. Empirical studies on two main downstream applications of text representation (Semantic Textual Similarity and Dense Retrieval tasks) demonstrate the effectiveness of our proposed approach.

摘要

paper_url: http://arxiv.org/abs/2311.05450
repo_url: None
paper_authors: Alex Clay, Eduardo Alonso, Esther Mondragón
for: 这篇论文旨在解决 conversational agents（CA）中的两个主要问题，即创建CA的方法所带来的特殊技术问题以及用户对CA的社会预期。
methods: 该论文提出了通过在CA中引入认知科学发现的计算机模型来解决这两个问题的方法。这些模型包括semantic和episodic记忆、情感、工作记忆和学习能力。
results: 该论文表明，通过引入这些认知科学发现的计算机模型，可以解决CA中的技术问题并满足用户对CA的社会预期，从而提高CA的交流质量。

Abstract
Current conversational agents (CA) have seen improvement in conversational quality in recent years due to the influence of large language models (LLMs) like GPT3. However, two key categories of problem remain. Firstly there are the unique technical problems resulting from the approach taken in creating the CA, such as scope with retrieval agents and the often nonsensical answers of former generative agents. Secondly, humans perceive CAs as social actors, and as a result expect the CA to adhere to social convention. Failure on the part of the CA in this respect can lead to a poor interaction and even the perception of threat by the user. As such, this paper presents a survey highlighting a potential solution to both categories of problem through the introduction of cognitively inspired additions to the CA. Through computational facsimiles of semantic and episodic memory, emotion, working memory, and the ability to learn, it is possible to address both the technical and social problems encountered by CAs.

摘要
当前的对话代理（CA）在最近几年内有所改善，归功于大型语言模型（LLM）如GPT3。然而，还有两个关键的问题需要解决。首先，创建CA时采用的方法会导致特定的技术问题，如检索代理的范围和前一代生成器的偶极答案。其次，人们对CA视为社会actor，因此期望CA遵循社会规范。如果CA不符合这些规范，会导致低效的互动和用户感到威胁。因此，这篇论文介绍了一种可能的解决方案，通过在CA中引入认知革新来解决这两个类型的问题。通过计算机的semantic和episodic记忆、情感、工作记忆和学习能力，可以解决CA中的技术和社会问题。

LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

paper_url: http://arxiv.org/abs/2311.05437
repo_url: https://github.com/LLaVA-VL/llava-plus
paper_authors: Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang, Jianfeng Gao, Chunyuan Li
for: 论文主要用于推动大型多Modal模型的功能扩展，提供一个通用的多Modal助手。
methods: 论文使用了预训练的视觉和视觉语言模型库，可以根据用户输入活动 triggrer 相关工具来完成现实世界任务。
results: 实验结果表明，LLaVA-Plus 在现有的能力方面表现出色，同时具有新的能力，比如图像查询直接启用和活动参与整个人机器交互会话，从而显著提高工具使用性能和开拓新场景。

Abstract
LLaVA-Plus is a general-purpose multimodal assistant that expands the capabilities of large multimodal models. It maintains a skill repository of pre-trained vision and vision-language models and can activate relevant tools based on users' inputs to fulfill real-world tasks. LLaVA-Plus is trained on multimodal instruction-following data to acquire the ability to use tools, covering visual understanding, generation, external knowledge retrieval, and compositions. Empirical results show that LLaVA-Plus outperforms LLaVA in existing capabilities and exhibits new ones. It is distinct in that the image query is directly grounded and actively engaged throughout the entire human-AI interaction sessions, significantly improving tool use performance and enabling new scenarios.

摘要
LLaVA-Plus 是一种通用多模式助手，它扩展了大型多模式模型的功能。它维护一个预训练视觉语言模型的技能库，并可以根据用户输入活动激活相应的工具来完成现实世界任务。 LLVA-Plus 在多模式指令遵从数据上接受了训练，以获得使用工具的能力，包括视觉理解、生成、外部知识检索和组合。实验结果显示， LLVA-Plus 在现有能力方面超越 LLVA，并展现出新的能力。它与图像查询直接相关地和活动地参与整个人机交互会议，显著提高工具使用性能，并开启了新的enario。

Mirror: A Universal Framework for Various Information Extraction Tasks

paper_url: http://arxiv.org/abs/2311.05419
repo_url: https://github.com/Spico197/Mirror
paper_authors: Tong Zhu, Junfei Ren, Zijian Yu, Mengsong Wu, Guoliang Zhang, Xiaoye Qu, Wenliang Chen, Zhefeng Wang, Baoxing Huai, Min Zhang
for: 该论文主要旨在提高信息提取任务之间的知识共享，以及建立复杂的应用程序在真实场景中。
methods: 该论文提出了一种基于多槽图的统一框架，可以应对多种信息提取任务，包括单 span、多 span 和 n-ary 提取。这个框架使用非自适应的图解oding算法来解决所有的槽。
results: 经验表明，该模型在不同的下游任务中具有妥善的兼容性和竞争性，并在少量和零量设置下达到或超越了现有系统的性能。

Abstract
Sharing knowledge between information extraction tasks has always been a challenge due to the diverse data formats and task variations. Meanwhile, this divergence leads to information waste and increases difficulties in building complex applications in real scenarios. Recent studies often formulate IE tasks as a triplet extraction problem. However, such a paradigm does not support multi-span and n-ary extraction, leading to weak versatility. To this end, we reorganize IE problems into unified multi-slot tuples and propose a universal framework for various IE tasks, namely Mirror. Specifically, we recast existing IE tasks as a multi-span cyclic graph extraction problem and devise a non-autoregressive graph decoding algorithm to extract all spans in a single step. It is worth noting that this graph structure is incredibly versatile, and it supports not only complex IE tasks, but also machine reading comprehension and classification tasks. We manually construct a corpus containing 57 datasets for model pretraining, and conduct experiments on 30 datasets across 8 downstream tasks. The experimental results demonstrate that our model has decent compatibility and outperforms or reaches competitive performance with SOTA systems under few-shot and zero-shot settings. The code, model weights, and pretraining corpus are available at https://github.com/Spico197/Mirror .

摘要
共享知识 между信息提取任务一直是一大挑战，因为数据格式和任务变化很多，这导致了信息浪费和实际场景建立复杂应用程序更加困难。Recent studies often formulate IE tasks as a triplet extraction problem, but this paradigm does not support multi-span and n-ary extraction, leading to weak versatility. To address this challenge, we reorganize IE problems into unified multi-slot tuples and propose a universal framework for various IE tasks, which we call Mirror. Specifically, we recast existing IE tasks as a multi-span cyclic graph extraction problem and develop a non-autoregressive graph decoding algorithm to extract all spans in a single step. It is worth noting that this graph structure is incredibly versatile and supports not only complex IE tasks but also machine reading comprehension and classification tasks. We manually construct a corpus containing 57 datasets for model pretraining, and conduct experiments on 30 datasets across 8 downstream tasks. The experimental results show that our model has good compatibility and outperforms or reaches competitive performance with state-of-the-art systems under few-shot and zero-shot settings. The code, model weights, and pretraining corpus are available at .

Generalization in medical AI: a perspective on developing scalable models

paper_url: http://arxiv.org/abs/2311.05418
repo_url: None
paper_authors: Joachim A. Behar, Jeremy Levy, Leo Anthony Celi
for: 本研究旨在探讨医疗人工智能模型在不同医院环境下的泛化性能。
methods: 研究者采用多个数据集，其中一部分用于模型开发（源数据集），另一部分用于测试（目标数据集）。
results: 研究发现，尽管使用多个数据集可以提高模型的泛化性能，但是不同医院环境下的模型尚未能够 achieve universally generalizable 水平。

Abstract
Over the past few years, research has witnessed the advancement of deep learning models trained on large datasets, some even encompassing millions of examples. While these impressive performance on their hidden test sets, they often underperform when assessed on external datasets. Recognizing the critical role of generalization in medical AI development, many prestigious journals now require reporting results both on the local hidden test set as well as on external datasets before considering a study for publication. Effectively, the field of medical AI has transitioned from the traditional usage of a single dataset that is split into train and test to a more comprehensive framework using multiple datasets, some of which are used for model development (source domain) and others for testing (target domains). However, this new experimental setting does not necessarily resolve the challenge of generalization. This is because of the variability encountered in intended use and specificities across hospital cultures making the idea of universally generalizable systems a myth. On the other hand, the systematic, and a fortiori recurrent re-calibration, of models at the individual hospital level, although ideal, may be overoptimistic given the legal, regulatory and technical challenges that are involved. Re-calibration using transfer learning may not even be possible in some instances where reference labels of target domains are not available. In this perspective we establish a hierarchical three-level scale system reflecting the generalization level of a medical AI algorithm. This scale better reflects the diversity of real-world medical scenarios per which target domain data for re-calibration of models may or not be available and if it is, may or not have reference labels systematically available.

摘要
过去几年，深度学习模型在大量数据上进行训练，一些甚至有数百万个示例。而这些模型在隐藏测试集上具有出色的表现，但在外部数据集上表现不佳。认识到医疗AI发展中的泛化问题的重要性，许多著名期刊现在要求研究者在发表前对结果进行多个数据集的报告，包括本地隐藏测试集和外部数据集。这意味着医疗AI领域从传统的单个数据集，拼接成训练和测试集的方式转移到了一个更加全面的框架，使用多个数据集，其中一些用于模型开发（源数据集），另一些用于测试（目标数据集）。然而，这新的实验设置并不一定解决泛化问题。这是因为医院文化中的变化，使得“通用化”的系统成为一种神话。相反，在医院水平进行系统atic和Recurrent re-calibration，尽管理想，但可能受到法律、规则和技术上的挑战。使用传输学习重新启动可能无法在目标领域中获得参考标签。在这种视角下，我们建立了一个三级层次积分系统，反映医疗AI算法的泛化水平。这个积分系统更好地反映了实际医疗场景中的多样性，目标领域数据可能或可能无法获得参考标签，而且如果有参考标签，可能不会系统地可用。

A theory for the sparsity emerged in the Forward Forward algorithm

paper_url: http://arxiv.org/abs/2311.05667
repo_url: None
paper_authors: Yukun Yang
for: 这篇论文探讨了forward-forward算法中高稀存现象的理论基础 \citep{tosato2023emergent}。
methods: 论文提出了两个定理，预测单个数据点活化的稀存变化在两种情况下：定理1：降低整个批处的好坏性。定理2：通过完整的forward-forward算法降低负数据的好坏性，提高正数据的好坏性。
results: 理论与在MNIST dataset上进行的实验结果相吻合。

Abstract
This report explores the theory that explains the high sparsity phenomenon \citep{tosato2023emergent} observed in the forward-forward algorithm \citep{hinton2022forward}. The two theorems proposed predict the sparsity changes of a single data point's activation in two cases: Theorem \ref{theorem:1}: Decrease the goodness of the whole batch. Theorem \ref{theorem:2}: Apply the complete forward forward algorithm to decrease the goodness for negative data and increase the goodness for positive data. The theory aligns well with the experiments tested on the MNIST dataset.

摘要

Decreasing the goodness of the whole batch (Theorem 1).2. Applying the complete forward-forward algorithm to decrease the goodness for negative data and increase the goodness for positive data (Theorem 2).The theory is found to be in good agreement with the experimental results tested on the MNIST dataset.Note:* “高稀度现象” (gāo xiāo dé xiàn yì) refers to the high sparsity phenomenon.* “整个批处” (zhèng gè pīn huì) refers to the whole batch.* “负数据” (fù shù) refers to negative data.* “正数据” (zhèng shù) refers to positive data.* “完整的前向前算法” (quán zhì de qián wǎn qián suān fáng) refers to the complete forward-forward algorithm.

TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMs

paper_url: http://arxiv.org/abs/2311.05374
repo_url: https://github.com/xsysigma/tencentllmeval
paper_authors: Shuyi Xie, Wenlin Yao, Yong Dai, Shaobo Wang, Donlin Zhou, Lifeng Jin, Xinhua Feng, Pengzhi Wei, Yujie Lin, Zhichao Hu, Dong Yu, Zhengyou Zhang, Jing Nie, Yuhong Liu
for: 评估大型自然语言模型（LLMs）是否能够匹配人类偏好，以确定LLMs在不同应用场景中的性能。
methods: 提出了一种完整的人类评估框架，用于评估 LLMS 在多个实际任务中的适应性和准确性。
results: 构建了一个层次任务树，覆盖了多个领域和多个任务，并设计了评估标准和评估过程，以便启用公正、不偏袋的人类评估员进行评估。

Abstract
Large language models (LLMs) have shown impressive capabilities across various natural language tasks. However, evaluating their alignment with human preferences remains a challenge. To this end, we propose a comprehensive human evaluation framework to assess LLMs' proficiency in following instructions on diverse real-world tasks. We construct a hierarchical task tree encompassing 7 major areas covering over 200 categories and over 800 tasks, which covers diverse capabilities such as question answering, reasoning, multiturn dialogue, and text generation, to evaluate LLMs in a comprehensive and in-depth manner. We also design detailed evaluation standards and processes to facilitate consistent, unbiased judgments from human evaluators. A test set of over 3,000 instances is released, spanning different difficulty levels and knowledge domains. Our work provides a standardized methodology to evaluate human alignment in LLMs for both English and Chinese. We also analyze the feasibility of automating parts of evaluation with a strong LLM (GPT-4). Our framework supports a thorough assessment of LLMs as they are integrated into real-world applications. We have made publicly available the task tree, TencentLLMEval dataset, and evaluation methodology which have been demonstrated as effective in assessing the performance of Tencent Hunyuan LLMs. By doing so, we aim to facilitate the benchmarking of advances in the development of safe and human-aligned LLMs.

摘要
We construct a hierarchical task tree encompassing 7 major areas covering over 200 categories and over 800 tasks, which covers diverse capabilities such as question answering, reasoning, multiturn dialogue, and text generation. This framework enables us to evaluate LLMs in a comprehensive and in-depth manner. We also design detailed evaluation standards and processes to facilitate consistent, unbiased judgments from human evaluators.A test set of over 3,000 instances is released, spanning different difficulty levels and knowledge domains. Our work provides a standardized methodology to evaluate human alignment in LLMs for both English and Chinese. We also analyze the feasibility of automating parts of evaluation with a strong LLM (GPT-4).Our framework supports a thorough assessment of LLMs as they are integrated into real-world applications. We have made publicly available the task tree, TencentLLMEval dataset, and evaluation methodology, which have been demonstrated as effective in assessing the performance of Tencent Hunyuan LLMs. By doing so, we aim to facilitate the benchmarking of advances in the development of safe and human-aligned LLMs.

Training Robust Deep Physiological Measurement Models with Synthetic Video-based Data

paper_url: http://arxiv.org/abs/2311.05371
repo_url: None
paper_authors: Yuxuan Ou, Yuzhe Zhang, Yuntang Wang, Shwetak Patel, Daniel McDuf, Yuzhe Yang, Xin Liu
for: 提高深度学习模型对synthetic physiological signal的泛化能力
methods: 添加实际世界噪声到synthetic physiological signal和相应的面部视频中
results: 降低了平均误差值从6.9降至2.0

Abstract
Recent advances in supervised deep learning techniques have demonstrated the possibility to remotely measure human physiological vital signs (e.g., photoplethysmograph, heart rate) just from facial videos. However, the performance of these methods heavily relies on the availability and diversity of real labeled data. Yet, collecting large-scale real-world data with high-quality labels is typically challenging and resource intensive, which also raises privacy concerns when storing personal bio-metric data. Synthetic video-based datasets (e.g., SCAMPS \cite{mcduff2022scamps}) with photo-realistic synthesized avatars are introduced to alleviate the issues while providing high-quality synthetic data. However, there exists a significant gap between synthetic and real-world data, which hinders the generalization of neural models trained on these synthetic datasets. In this paper, we proposed several measures to add real-world noise to synthetic physiological signals and corresponding facial videos. We experimented with individual and combined augmentation methods and evaluated our framework on three public real-world datasets. Our results show that we were able to reduce the average MAE from 6.9 to 2.0.

摘要
最近的深度学习技术的进步已经证明可以通过视频来测量人类生物学重要指标（例如血液压力）。然而，这些方法的性能受到实际数据的可用性和多样性的限制。实际数据收集是一项复杂和耗资的任务，同时存在隐私问题。为了解决这些问题，人工视频数据集（如SCAMPS \cite{mcduff2022scamps））被提出，它们提供了高质量的人工数据。然而，实际数据和人工数据之间存在巨大的差异，这阻碍了神经网络模型在这些人工数据上的泛化。在这篇论文中，我们提出了一些方法来将实际世界的噪声添加到人工生物学信号和相应的视频中。我们对各种增强方法进行了单独和共同增强的实验，并在三个公共实际世界数据集上评估了我们的框架。我们的结果表明，我们可以将平均误差从6.9降低到2.0。

On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving

paper_url: http://arxiv.org/abs/2311.05332
repo_url: https://github.com/pjlab-adg/gpt4v-ad-exploration
paper_authors: Licheng Wen, Xuemeng Yang, Daocheng Fu, Xiaofeng Wang, Pinlong Cai, Xin Li, Tao Ma, Yingxuan Li, Linran Xu, Dengke Shang, Zheng Zhu, Shaoyan Sun, Yeqi Bai, Xinyu Cai, Min Dou, Shuanglu Hu, Botian Shi
for: 本研究旨在评估最新的可见语言模型(\modelnamefull)在自动驾驶场景中的应用。
methods: 本研究使用了\modelnamefull进行Scene理解、 causal reasoning和决策等任务，并在不同条件下进行了广泛的测试。
results: 结果表明，\modelnamefull在Scene理解和 causal reasoning方面表现出色，能够在真实的驾驶场景中recognize intentions和做出 Informed decisions。但是，还有一些挑战需要进一步研究和开发，如方向识别、交通灯识别和空间理解等任务。

Abstract
The pursuit of autonomous driving technology hinges on the sophisticated integration of perception, decision-making, and control systems. Traditional approaches, both data-driven and rule-based, have been hindered by their inability to grasp the nuance of complex driving environments and the intentions of other road users. This has been a significant bottleneck, particularly in the development of common sense reasoning and nuanced scene understanding necessary for safe and reliable autonomous driving. The advent of Visual Language Models (VLM) represents a novel frontier in realizing fully autonomous vehicle driving. This report provides an exhaustive evaluation of the latest state-of-the-art VLM, \modelnamefull, and its application in autonomous driving scenarios. We explore the model's abilities to understand and reason about driving scenes, make decisions, and ultimately act in the capacity of a driver. Our comprehensive tests span from basic scene recognition to complex causal reasoning and real-time decision-making under varying conditions. Our findings reveal that \modelname demonstrates superior performance in scene understanding and causal reasoning compared to existing autonomous systems. It showcases the potential to handle out-of-distribution scenarios, recognize intentions, and make informed decisions in real driving contexts. However, challenges remain, particularly in direction discernment, traffic light recognition, vision grounding, and spatial reasoning tasks. These limitations underscore the need for further research and development. Project is now available on GitHub for interested parties to access and utilize: \url{https://github.com/PJLab-ADG/GPT4V-AD-Exploration}

摘要
<>使用 Visual Language Models (VLM) 技术可以实现完全自动驾驶。这种技术可以解决传统方法（数据驱动和规则驱动）无法捕捉复杂的驾驶环境和其他道路用户的意图的问题。这种问题特别是在实现安全可靠的自动驾驶时具有瓶颈性。本报告对最新的State-of-the-art VLM，\modelnamefull，进行了广泛的评估，并在自动驾驶场景中应用了该模型。我们测试了模型对驾驶场景的理解和 causal reasoning 能力，以及其在不同条件下做出决策的能力。我们的发现表明，\modelname在场景理解和 causal reasoning 方面表现出色，比现有的自动驾驶系统更加出色。它可以在不同的驾驶场景中处理异常场景，识别意图，并在实际驾驶场景中做出 Informed 决策。然而，还有一些挑战，例如方向识别、交通灯识别、视觉基础 task 和空间理解任务。这些限制表明需要进一步的研发。项目现已经在 GitHub 上公开，欢迎有兴趣的人参与：\url{https://github.com/PJLab-ADG/GPT4V-AD-Exploration}。Note: The translation is in Simplified Chinese, which is the standard Chinese writing system used in mainland China. The Traditional Chinese writing system is used in Taiwan and Hong Kong.

ABIGX: A Unified Framework for eXplainable Fault Detection and Classification

paper_url: http://arxiv.org/abs/2311.05316
repo_url: None
paper_authors: Yue Zhuo, Jinchuan Qian, Zhihuan Song, Zhiqiang Ge
for: 这 paper 的目的是提出一种可解释的 fault detection and classification (FDC) 框架，即 ABIGX (Adversarial fault reconstruction-Based Integrated Gradient eXplanation)。
methods: 该框架基于 previous successful fault diagnosis methods 的基本元素，包括 contribution plots (CP) 和 reconstruction-based contribution (RBC)。它是第一个提供可变的贡献的 FDC 模型解释框架。核心部分是 adversarial fault reconstruction (AFR) 方法，它从 adversarial attack 的角度重新定义了 FR，并将其推广到 fault classification 模型中。
results: 对于 fault classification, 该 paper 提出了一个新的问题：缺陷类归一化问题，这会隐藏正确的解释。然而, authors 证明了 ABIGX 有效地解决了这个问题，并在 fault detection 和 fault classification 中超越了现有的 gradient-based explanation 方法。实验证明了 ABIGX 的解释能力，并通过量化指标和直观图示，证明了 ABIGX 的总体优势。

Abstract
For explainable fault detection and classification (FDC), this paper proposes a unified framework, ABIGX (Adversarial fault reconstruction-Based Integrated Gradient eXplanation). ABIGX is derived from the essentials of previous successful fault diagnosis methods, contribution plots (CP) and reconstruction-based contribution (RBC). It is the first explanation framework that provides variable contributions for the general FDC models. The core part of ABIGX is the adversarial fault reconstruction (AFR) method, which rethinks the FR from the perspective of adversarial attack and generalizes to fault classification models with a new fault index. For fault classification, we put forward a new problem of fault class smearing, which intrinsically hinders the correct explanation. We prove that ABIGX effectively mitigates this problem and outperforms the existing gradient-based explanation methods. For fault detection, we theoretically bridge ABIGX with conventional fault diagnosis methods by proving that CP and RBC are the linear specifications of ABIGX. The experiments evaluate the explanations of FDC by quantitative metrics and intuitive illustrations, the results of which show the general superiority of ABIGX to other advanced explanation methods.

摘要
<>将文本翻译成简化中文。<>这篇论文提出了一个统一框架，即ABIGX（对抗风险重建基于集成导数解释），用于可解释的故障检测和分类（FDC）。ABIGX基于过去成功的故障诊断方法的基本元素，包括贡献图（CP）和重建基于贡献（RBC）。它是首个提供变量贡献的总体FDC模型解释框架。ABIGX的核心部分是对抗风险重建（AFR）方法，它从对抗攻击的视角重新定义了FR，并推广到包括新的故障指标的普通故障分类模型。为故障分类，我们提出了一个新的问题，即故障类划模糊问题，这种问题本质上阻碍了正确的解释。我们证明了ABIGX有效地解决了这个问题，并超过了现有的导数基于解释方法。对故障检测，我们 theoretically 将ABIGX与传统故障诊断方法相连接，证明CP和RBC是ABIGX的线性特征。实验评估了FDC的解释，使用量化指标和直观示例，结果显示ABIGX在其他先进解释方法之上有广泛的优势。

Data Valuation and Detections in Federated Learning

paper_url: http://arxiv.org/abs/2311.05304
repo_url: https://github.com/muz1lee/motdata
paper_authors: Wenqian Li, Shuran Fu, Fengrui Zhang, Yan Pang
for: 这篇论文是针对 Federated Learning (FL) 框架下的数据评估和选择 pertinent 数据客户端的新方法。
methods: 这篇论文提出了一个基于 Wasserstein 距离的方法，用于在 FL 框架下评估客户端的数据贡献和选择 pertinent 数据。
results: 经过广泛的实验和理论分析，该方法被证明可以实现透明的数据评估和有效的 Wasserstein barycenter 计算，并且降低了验证集的依赖。

Abstract
Federated Learning (FL) enables collaborative model training while preserving the privacy of raw data. A challenge in this framework is the fair and efficient valuation of data, which is crucial for incentivizing clients to contribute high-quality data in the FL task. In scenarios involving numerous data clients within FL, it is often the case that only a subset of clients and datasets are pertinent to a specific learning task, while others might have either a negative or negligible impact on the model training process. This paper introduces a novel privacy-preserving method for evaluating client contributions and selecting relevant datasets without a pre-specified training algorithm in an FL task. Our proposed approach FedBary, utilizes Wasserstein distance within the federated context, offering a new solution for data valuation in the FL framework. This method ensures transparent data valuation and efficient computation of the Wasserstein barycenter and reduces the dependence on validation datasets. Through extensive empirical experiments and theoretical analyses, we demonstrate the potential of this data valuation method as a promising avenue for FL research.

摘要

Do personality tests generalize to Large Language Models?

paper_url: http://arxiv.org/abs/2311.05297
repo_url: None
paper_authors: Florian E. Dorner, Tom Sühr, Samira Samadi, Augustin Kelava
for: 本研究旨在评估大型自然语言处理器（LLM）在文本交互中的人类特征。
methods: 本研究使用了原本设计用于人类的测试来评估LLM的性能。
results: 研究发现，LLM的人格测试响应与人类的响应存在差异，因此不能直接将人类测试结果应用于LLM。具体来说，LLM通常会回答反编项（如“我是内向的”vs“我是外向的”）都是正面的。此外，对于用于模拟特定人格类型的提问不会显示出人类样本中的清晰分化。因此，研究人员认为需要更加注重LLM测试的有效性才能够正确地了解LLM的性能。

Abstract
With large language models (LLMs) appearing to behave increasingly human-like in text-based interactions, it has become popular to attempt to evaluate various properties of these models using tests originally designed for humans. While re-using existing tests is a resource-efficient way to evaluate LLMs, careful adjustments are usually required to ensure that test results are even valid across human sub-populations. Thus, it is not clear to what extent different tests' validity generalizes to LLMs. In this work, we provide evidence that LLMs' responses to personality tests systematically deviate from typical human responses, implying that these results cannot be interpreted in the same way as human test results. Concretely, reverse-coded items (e.g. "I am introverted" vs "I am extraverted") are often both answered affirmatively by LLMs. In addition, variation across different prompts designed to "steer" LLMs to simulate particular personality types does not follow the clear separation into five independent personality factors from human samples. In light of these results, we believe it is important to pay more attention to tests' validity for LLMs before drawing strong conclusions about potentially ill-defined concepts like LLMs' "personality".

摘要
With large language models (LLMs) appearing to behave increasingly human-like in text-based interactions, it has become popular to attempt to evaluate various properties of these models using tests originally designed for humans. While re-using existing tests is a resource-efficient way to evaluate LLMs, careful adjustments are usually required to ensure that test results are even valid across human sub-populations. Thus, it is not clear to what extent different tests' validity generalizes to LLMs. In this work, we provide evidence that LLMs' responses to personality tests systematically deviate from typical human responses, implying that these results cannot be interpreted in the same way as human test results. Concretely, reverse-coded items (e.g. "I am introverted" vs "I am extraverted") are often both answered affirmatively by LLMs. In addition, variation across different prompts designed to "steer" LLMs to simulate particular personality types does not follow the clear separation into five independent personality factors from human samples. In light of these results, we believe it is important to pay more attention to tests' validity for LLMs before drawing strong conclusions about potentially ill-defined concepts like LLMs' "personality".Here's the text in Traditional Chinese: With large language models (LLMs) appearing to behave increasingly human-like in text-based interactions, it has become popular to attempt to evaluate various properties of these models using tests originally designed for humans. While re-using existing tests is a resource-efficient way to evaluate LLMs, careful adjustments are usually required to ensure that test results are even valid across human sub-populations. Thus, it is not clear to what extent different tests' validity generalizes to LLMs. In this work, we provide evidence that LLMs' responses to personality tests systematically deviate from typical human responses, implying that these results cannot be interpreted in the same way as human test results. Concretely, reverse-coded items (e.g. "I am introverted" vs "I am extraverted") are often both answered affirmatively by LLMs. In addition, variation across different prompts designed to "steer" LLMs to simulate particular personality types does not follow the clear separation into five independent personality factors from human samples. In light of these results, we believe it is important to pay more attention to tests' validity for LLMs before drawing strong conclusions about potentially ill-defined concepts like LLMs' "personality".

Explainable artificial intelligence for Healthcare applications using Random Forest Classifier with LIME and SHAP

paper_url: http://arxiv.org/abs/2311.05665
repo_url: None
paper_authors: Mrutyunjaya Panda, Soumya Ranjan Mahanta
for: 本研究的目的是提高黑盒AI技术的可解释性，以便更好地理解这些技术的计算细节。
methods: 本研究使用了LIME和SHAP等多种可解释AI方法，并应用于一个公共可下载的 диабеت斯症状数据集上。
results: 研究结果表明，使用LIME和SHAP可以提供可靠、有效和可信worthiness的 диабеت斯症状预测结果，并且具有较高的解释性。

Abstract
With the advances in computationally efficient artificial Intelligence (AI) techniques and their numerous applications in our everyday life, there is a pressing need to understand the computational details hidden in black box AI techniques such as most popular machine learning and deep learning techniques; through more detailed explanations. The origin of explainable AI (xAI) is coined from these challenges and recently gained more attention by the researchers by adding explainability comprehensively in traditional AI systems. This leads to develop an appropriate framework for successful applications of xAI in real life scenarios with respect to innovations, risk mitigation, ethical issues and logical values to the users. In this book chapter, an in-depth analysis of several xAI frameworks and methods including LIME (Local Interpretable Model-agnostic Explanations) and SHAP (SHapley Additive exPlanations) are provided. Random Forest Classifier as black box AI is used on a publicly available Diabetes symptoms dataset with LIME and SHAP for better interpretations. The results obtained are interesting in terms of transparency, valid and trustworthiness in diabetes disease prediction.

摘要
In this book chapter, we provide an in-depth analysis of several xAI frameworks and methods, including LIME (Local Interpretable Model-agnostic Explanations) and SHAP (SHapley Additive exPlanations). We also demonstrate the application of these methods on a publicly available Diabetes symptoms dataset using Random Forest Classifier as a black box AI model. The results obtained are interesting in terms of transparency, validity, and trustworthiness in diabetes disease prediction.In the following sections, we will first introduce the background and motivation for xAI, followed by an overview of the state-of-the-art xAI frameworks and methods. We will then describe the experimental setup and results of our case study using LIME and SHAP on the Diabetes symptoms dataset. Finally, we will discuss the implications of our findings and the future directions for xAI research.Background and MotivationWith the increasing use of AI systems in various applications, there is a growing need to understand how these systems make decisions. Black box AI models, such as machine learning and deep learning, are widely used in many applications, but their decision-making processes are often difficult to interpret. This lack of transparency and interpretability can make it difficult to identify errors, biases, and unfairness in AI decision-making.To address this challenge, researchers have proposed various xAI frameworks and methods to provide more detailed explanations of AI decision-making processes. XAI aims to make AI systems more transparent, interpretable, and accountable, which can help to build trust and confidence in AI systems.State-of-the-Art xAI Frameworks and MethodsSeveral xAI frameworks and methods have been proposed in recent years, including LIME, SHAP, and TreeExplainer. These methods provide different types of explanations for AI decision-making processes, such as feature attribution, model interpretability, and model explainability.LIME (Local Interpretable Model-agnostic Explanations) is a popular xAI method that provides feature attribution for any machine learning model. LIME works by generating an interpretable model locally around a specific instance, which can help to identify the most important features for that instance.SHAP (SHapley Additive exPlanations) is another popular xAI method that provides a comprehensive explanation of AI decision-making processes. SHAP assigns a value to each feature for a specific instance, which can help to identify the most important features and their contributions to the final prediction.TreeExplainer is a xAI method that provides a hierarchical explanation of decision trees. TreeExplainer works by recursively partitioning the feature space into smaller regions, which can help to identify the most important features and their interactions.Case Study: Diabetes Symptoms DatasetIn this case study, we use the publicly available Diabetes symptoms dataset to demonstrate the application of xAI methods on a black box AI model. The dataset contains 400 instances, each with 12 features, and the task is to predict whether a patient has diabetes or not. We use Random Forest Classifier as the black box AI model and apply LIME and SHAP to obtain more detailed explanations of the model's decision-making processes.Experimental SetupWe use the following experimental setup for our case study:* Dataset: Diabetes symptoms dataset* AI model: Random Forest Classifier* xAI methods: LIME and SHAPResults and DiscussionWe obtained interesting results from our case study, which are summarized as follows:* LIME: The top 5 most important features for the Random Forest Classifier are age, BMI, family history, hypertension, and smoking. These features are consistent with the known risk factors for diabetes.* SHAP: The total contribution of each feature to the final prediction is shown in the following table:| Feature | Contribution || --- | --- || age | 0.34 || BMI | 0.27 || family history | 0.23 || hypertension | 0.19 || smoking | 0.14 |The contributions are calculated based on the SHAP values for each instance. The results show that age, BMI, and family history are the most important features for the Random Forest Classifier, which is consistent with the results from LIME.Implications and Future DirectionsOur findings have several implications for the application of xAI methods in real-world scenarios. First, xAI methods can provide more detailed explanations of AI decision-making processes, which can help to build trust and confidence in AI systems. Second, xAI methods can help to identify errors, biases, and unfairness in AI systems, which can lead to more transparent and accountable AI systems. Finally, xAI methods can help to improve the performance of AI systems by identifying the most important features and their interactions.In future work, we plan to apply xAI methods to other AI models and datasets to further explore their potential applications and limitations. Additionally, we plan to develop new xAI methods that can provide more comprehensive and interpretable explanations of AI decision-making processes.

Chain of Images for Intuitively Reasoning

paper_url: http://arxiv.org/abs/2311.09241
repo_url: https://github.com/graphpku/coi
paper_authors: Fanxu Meng, Haotong Yang, Yiding Wang, Muhan Zhang
for: 该论文旨在提高大语言模型（LLM）的逻辑推理能力，使其能够利用图像来帮助思维。
methods: 该论文提出了一种图链（Chain of Images，CoI）方法，将复杂的语言逻辑问题转换为简单的图像识别任务，并开发了15种不同领域的CoI评估数据集。
results: 实验表明，使用CoI方法可以significantly提高大语言模型的逻辑推理能力，比基eline的语言链（Chain of Thoughts，CoT）表现更好。

Abstract
The human brain is naturally equipped to comprehend and interpret visual information rapidly. When confronted with complex problems or concepts, we use flowcharts, sketches, and diagrams to aid our thought process. Leveraging this inherent ability can significantly enhance logical reasoning. However, current Large Language Models (LLMs) do not utilize such visual intuition to help their thinking. Even the most advanced version language models (e.g., GPT-4V and LLaVA) merely align images into textual space, which means their reasoning processes remain purely verbal. To mitigate such limitations, we present a Chain of Images (CoI) approach, which can convert complex language reasoning problems to simple pattern recognition by generating a series of images as intermediate representations. Furthermore, we have developed a CoI evaluation dataset encompassing 15 distinct domains where images can intuitively aid problem-solving. Based on this dataset, we aim to construct a benchmark to assess the capability of future multimodal large-scale models to leverage images for reasoning. In supporting our CoI reasoning, we introduce a symbolic multimodal large language model (SyMLLM) that generates images strictly based on language instructions and accepts both text and image as input. Experiments on Geometry, Chess and Common Sense tasks sourced from the CoI evaluation dataset show that CoI improves performance significantly over the pure-language Chain of Thoughts (CoT) baselines. The code is available at https://github.com/GraphPKU/CoI.

摘要
人类大脑自然地具备了快速理解和解释视觉信息的能力。当面临复杂问题或概念时，我们使用流charts、笔画和 диаграмsto 帮助我们的思维过程。利用这种内置的能力可以大幅提高逻辑推理。然而，当前的大型自然语言模型（LLM）并不利用这种视觉直觉来帮助其思考。即使最先进的版本（例如GPT-4V和LLaVA）也只是将图像与文本空间对齐，这意味着它们的思维过程仍然是完全的语言过程。为了缓解这些限制，我们提出了链接图像（CoI）方法，可以将复杂的语言逻辑问题转化为简单的图像识别问题，通过生成一系列图像作为中间表示。此外，我们还开发了CoI评估数据集，覆盖15个不同的领域，图像可以直观地帮助解决问题。基于这个数据集，我们希望构建一个 Multimodal大型模型评估标准，以评估未来的多Modal大型模型是否能够利用图像进行逻辑推理。为支持CoI逻辑，我们介绍了一种符号Multimodal大型语言模型（SyMLLM），该模型仅基于语言指令生成图像，并接受文本和图像作为输入。实验表明，CoI在几个 geometry、棋盘和通用常识任务中表现出色，至少比基于语言的链接思维（CoT）基eline上升级。代码可以在https://github.com/GraphPKU/CoI上获取。

Don’t Waste a Single Annotation: Improving Single-Label Classifiers Through Soft Labels

paper_url: http://arxiv.org/abs/2311.05265
repo_url: None
paper_authors: Ben Wu, Yue Li, Yida Mu, Carolina Scarton, Kalina Bontcheva, Xingyi Song
for: 本文挑战传统对象单类分类任务的数据标注和训练方法的局限性。通常，在这类任务中，注释员只被要求为每个样本提供单一标签，而注释员不一致的信息则通过多数投票决定最终硬标签。本文推荐使用多个注释员的信息，包括信任度、次要标签和不一致情况，来生成软标签。
methods: 本文提出了一种软标签方法，该方法利用多个注释员的信息来生成软标签。这些软标签可以用于训练分类器，从而提高分类器的性能和准确率。
results: 本文的实验结果表明，使用软标签方法可以提高对象单类分类任务的性能和准确率。此外，软标签方法还可以提高分类器的准确率和泛化能力。

Abstract
In this paper, we address the limitations of the common data annotation and training methods for objective single-label classification tasks. Typically, when annotating such tasks annotators are only asked to provide a single label for each sample and annotator disagreement is discarded when a final hard label is decided through majority voting. We challenge this traditional approach, acknowledging that determining the appropriate label can be difficult due to the ambiguity and lack of context in the data samples. Rather than discarding the information from such ambiguous annotations, our soft label method makes use of them for training. Our findings indicate that additional annotator information, such as confidence, secondary label and disagreement, can be used to effectively generate soft labels. Training classifiers with these soft labels then leads to improved performance and calibration on the hard label test set.

摘要

Model-Based Minimum Bayes Risk Decoding

paper_url: http://arxiv.org/abs/2311.05263
repo_url: None
paper_authors: Yuu Jinnai, Tetsuro Morimura, Ukyo Honda, Kaito Ariu, Kenshi Abe
for: 这篇论文主要是关于 minimum Bayes risk (MBR) 解oding 的研究，MBR 解oding 是一种可以取代搜索搜索的文本生成任务中的一种有力的方法。
methods: 这篇论文使用了两种方法来估计 MBR 解oding 中的风险：一是通过对一些采样出的假设进行集成来估计风险，二是使用 Monte Carlo 估计来估计各个假设的概率。
results: 这篇论文的实验结果表明，使用模型概率来估计 MBR 解oding 中的风险（即 Model-Based MBR，MBMBR）可以在文本生成任务中超过 MBR 解oding。MBMBR 在encoder-decoder模型和大语言模型上都能够达到更高的性能。

Abstract
Minimum Bayes Risk (MBR) decoding has been shown to be a powerful alternative to beam search decoding in a variety of text generation tasks. MBR decoding selects a hypothesis from a pool of hypotheses that has the least expected risk under a probability model according to a given utility function. Since it is impractical to compute the expected risk exactly over all possible hypotheses, two approximations are commonly used in MBR. First, it integrates over a sampled set of hypotheses rather than over all possible hypotheses. Second, it estimates the probability of each hypothesis using a Monte Carlo estimator. While the first approximation is necessary to make it computationally feasible, the second is not essential since we typically have access to the model probability at inference time. We propose Model-Based MBR (MBMBR), a variant of MBR that uses the model probability itself as the estimate of the probability distribution instead of the Monte Carlo estimate. We show analytically and empirically that the model-based estimate is more promising than the Monte Carlo estimate in text generation tasks. Our experiments show that MBMBR outperforms MBR in several text generation tasks, both with encoder-decoder models and with large language models.

摘要
<>将文本扩展为简化中文。>最小极大 bayes风险（MBR）解码被证明为文本生成任务中的强大替代方案。MBR解码从一群假设中选择最小预期风险的假设，根据给定的用于Utility函数的概率模型。由于不可能对所有假设进行准确的预期风险计算，常用两种近似方法。首先，它将抽取一组假设而不是所有可能的假设进行集成。其次，它使用Monte Carlo估计来估计每个假设的概率。虽然第一个近似方法是必要的以使其计算可能，但第二个近似方法并不是必要的，因为我们通常在推理时有对模型概率的访问。我们提出了基于模型的MBR（MBMBR），一种MBR的变体，使用模型概率自己来估计概率分布而不是Monte Carlo估计。我们在理论和实验中证明了基于模型的估计在文本生成任务中更有前途。我们的实验显示，MBMBR在encoder-decoder模型和大语言模型上都超过了MBR。

Uncertainty Wrapper in the medical domain: Establishing transparent uncertainty quantification for opaque machine learning models in practice

paper_url: http://arxiv.org/abs/2311.05245
repo_url: None
paper_authors: Lisa Jöckel, Michael Kläs, Georg Popp, Nadja Hilger, Stephan Fricke
for: 本文旨在探讨数据模型基于机器学习（ML）的应用，以及如何量化这些模型的结果中的不确定性。
methods: 本文使用了一种名为“Uncertainty Wrapper”的方法，以便量化ML模型的结果中的不确定性。
results: 本文通过应用Uncertainty Wrapper在流式细胞分析中，成功地量化了ML模型的结果中的不确定性。

Abstract
When systems use data-based models that are based on machine learning (ML), errors in their results cannot be ruled out. This is particularly critical if it remains unclear to the user how these models arrived at their decisions and if errors can have safety-relevant consequences, as is often the case in the medical field. In such cases, the use of dependable methods to quantify the uncertainty remaining in a result allows the user to make an informed decision about further usage and draw possible conclusions based on a given result. This paper demonstrates the applicability and practical utility of the Uncertainty Wrapper using flow cytometry as an application from the medical field that can benefit from the use of ML models in conjunction with dependable and transparent uncertainty quantification.

摘要
当系统使用基于机器学习（ML）的数据模型时，结果中的错误不能被排除。特别是在用户无法了解模型如何做出决策，以及错误会有安全相关的后果，如医疗领域一样。在这些情况下，使用可靠的方法来评估结果中剩下的不确定性，让用户可以根据结果作出了解的决策。这篇论文 demonstarte了uncertainty wrapper在医疗领域的应用，使用流式测计为例，可以通过与可靠和透明的不确定性评估相结合使用ML模型，提高结果的可靠性和可信度。

Kantian Deontology Meets AI Alignment: Towards Morally Robust Fairness Metrics

paper_url: http://arxiv.org/abs/2311.05227
repo_url: None
paper_authors: Carlos Mougan, Joshua Brand
for: 本研究旨在探讨 Kant 哲学中的规范在人工智能准确性领域中的应用，具体来说是探讨 Kant 哲学如何与现有的 fairness 指标相结合。
methods: 本研究采用了 Kant 哲学的规范和批判Utilitarianism 等方法，以探讨 fairness 指标在人工智能领域中的应用。
results: 研究发现，通过 Kant 哲学的规范和批判Utilitarianism，可以更好地满足 fairness 指标的要求，并且可以帮助人工智能领域更加注重道德原则和伦理准则。

Abstract
Deontological ethics, specifically understood through Immanuel Kant, provides a moral framework that emphasizes the importance of duties and principles, rather than the consequences of action. Understanding that despite the prominence of deontology, it is currently an overlooked approach in fairness metrics, this paper explores the compatibility of a Kantian deontological framework in fairness metrics, part of the AI alignment field. We revisit Kant's critique of utilitarianism, which is the primary approach in AI fairness metrics and argue that fairness principles should align with the Kantian deontological framework. By integrating Kantian ethics into AI alignment, we not only bring in a widely-accepted prominent moral theory but also strive for a more morally grounded AI landscape that better balances outcomes and procedures in pursuit of fairness and justice.

摘要
德 Ontological 伦理学，通过 Immanuel Kant 的理解，提供了一个伦理框架，强调行为的义务和原则，而不是行为的后果。虽然德 Ontology 在 fairness 度量领域具有普遍性，但目前它在 fairness 度量领域被忽略。这篇论文探讨了 Kant 对 Utilitarianism 的批判，这是 AI 公平度量领域的主要方法，并 argue That fairness 原则应该与 Kantian 德 Ontological 框架相匹配。通过将 Kantian 伦理学 integrate 到 AI 准确领域，我们不仅把一种广泛得到的著名伦理理论引入，还努力实现一个更加伦理根据的 AI 景观，该景观更好地平衡结果和程序，寻求公平和正义。

An Experiment in Retrofitting Competency Questions for Existing Ontologies

paper_url: http://arxiv.org/abs/2311.05662
repo_url: None
paper_authors: Reham Alharbi, Valentina Tamma, Floriana Grasso, Terry Payne
for: 这篇论文是关于ontology engineering的研究，具体来说是研究如何使用生成AI提取ontology中的 Competency Questions（CQs）。
methods: 这篇论文使用了生成AI技术，提取了ontology中的CQs。
results: 这篇论文提出了一种名为RETROFIT-CQs的方法，可以直接从ontology中提取CQs，并且在一些现有的ontology中进行了应用。

Abstract
Competency Questions (CQs) are a form of ontology functional requirements expressed as natural language questions. Inspecting CQs together with the axioms in an ontology provides critical insights into the intended scope and applicability of the ontology. CQs also underpin a number of tasks in the development of ontologies e.g. ontology reuse, ontology testing, requirement specification, and the definition of patterns that implement such requirements. Although CQs are integral to the majority of ontology engineering methodologies, the practice of publishing CQs alongside the ontological artefacts is not widely observed by the community. In this context, we present an experiment in retrofitting CQs from existing ontologies. We propose RETROFIT-CQs, a method to extract candidate CQs directly from ontologies using Generative AI. In the paper we present the pipeline that facilitates the extraction of CQs by leveraging Large Language Models (LLMs) and we discuss its application to a number of existing ontologies.

摘要

Green Resilience of Cyber-Physical Systems

paper_url: http://arxiv.org/abs/2311.05201
repo_url: https://github.com/rimawi-diaeddin/GRCPS-ISSRE22-DS
paper_authors: Diaeddin Rimawi
for: 本文提出了一种基于游戏理论的方法来实现智能系统的可靠性和绿色性。
methods: 本文使用了游戏理论来快速做出决策，以实现系统的最大化奖励。
results: 研究表明，基于游戏理论的方法可以实现智能系统的可靠性和绿色性，同时减少CO2足迹。

Abstract
Cyber-Physical System (CPS) represents systems that join both hardware and software components to perform real-time services. Maintaining the system's reliability is critical to the continuous delivery of these services. However, the CPS running environment is full of uncertainties and can easily lead to performance degradation. As a result, the need for a recovery technique is highly needed to achieve resilience in the system, with keeping in mind that this technique should be as green as possible. This early doctorate proposal, suggests a game theory solution to achieve resilience and green in CPS. Game theory has been known for its fast performance in decision-making, helping the system to choose what maximizes its payoffs. The proposed game model is described over a real-life collaborative artificial intelligence system (CAIS), that involves robots with humans to achieve a common goal. It shows how the expected results of the system will achieve the resilience of CAIS with minimized CO2 footprint.

摘要
资berger-物理系统（CPS）表示融合硬件和软件元件以提供实时服务的系统。维护这个系统的可靠性非常重要，以确保无间断提供服务。然而，CPS的运行环境充满不确定性，容易导致性能下降。因此，需要一种恢复技术以实现系统的可靠性和绿色性。本博士学位提案建议使用游戏理论解决这个问题。游戏理论具有快速的决策能力，帮助系统选择最大化其收益。The proposed game model is described over a real-life collaborative artificial intelligence system (CAIS), which involves robots and humans working together to achieve a common goal. The results show that the expected results of the system will achieve the resilience of CAIS with minimized CO2 footprint.Here is the translation of the text into Traditional Chinese:资berger-物理系统（CPS）表示融合硬件和软件元件以提供实时服务的系统。维护这个系统的可靠性非常重要，以确保无间断提供服务。然而，CPS的运行环境充满不确定性，容易导致性能下降。因此，需要一种恢复技术以实现系统的可靠性和绿色性。本博士学位提案建议使用游戏理论解决这个问题。游戏理论具有快速的决策能力，帮助系统选择最大化其收益。The proposed game model is described over a real-life collaborative artificial intelligence system (CAIS), which involves robots and humans working together to achieve a common goal. The results show that the expected results of the system will achieve the resilience of CAIS with minimized CO2 footprint.

Deep Learning in Computed Tomography Pulmonary Angiography Imaging: A Dual-Pronged Approach for Pulmonary Embolism Detection

paper_url: http://arxiv.org/abs/2311.05197
repo_url: None
paper_authors: Fabiha Bushra, Muhammad E. H. Chowdhury, Rusab Sarmun, Saidul Kabir, Menatalla Said, Sohaib Bassam Zoghoul, Adam Mushtak, Israa Al-Hashimi, Abdulrahman Alqahtani, Anwarul Hasan
for:This study aims to enhance the Computer Assisted Diagnosis of Pulmonary Embolism (PE) using deep learning techniques.methods:The proposed approach combines classification and detection methods, using an Attention-Guided Convolutional Neural Network (AG-CNN) for classification and state-of-the-art detection models to pinpoint potential PE regions. Ensemble techniques are also employed to improve detection accuracy.results:The proposed approach outperformed the baseline model DenseNet-121 by achieving an 8.1% increase in the Area Under the Receiver Operating Characteristic. The classifier-guided framework further refined the mean average precision (mAP) and F1 scores over the ensemble models. The study demonstrates the potential of deep learning techniques for improving PE diagnostics and addressing the issues of underdiagnosis and misdiagnosis.

Abstract
Pulmonary Embolism (PE) is a critical medical condition characterized by obstructions in the pulmonary arteries. Despite being a major health concern, it often goes underdiagnosed leading to detrimental clinical outcomes. The increasing reliance on Computed Tomography Pulmonary Angiography for diagnosis presents challenges and a pressing need for enhanced diagnostic solutions. The primary objective of this study is to leverage deep learning techniques to enhance the Computer Assisted Diagnosis of PE. This study presents a comprehensive dual-pronged approach combining classification and detection for PE diagnosis. We introduce an Attention-Guided Convolutional Neural Network (AG-CNN) for classification, addressing both global and local lesion region. For detection, state-of-the-art models are employed to pinpoint potential PE regions. Different ensembling techniques further improve detection accuracy by combining predictions from different models. Finally, a heuristic strategy integrates classifier outputs with detection results, ensuring robust and accurate PE identification. Our attention-guided classification approach, tested on the Ferdowsi University of Mashhad's Pulmonary Embolism (FUMPE) dataset, outperformed the baseline model DenseNet-121 by achieving an 8.1% increase in the Area Under the Receiver Operating Characteristic. By employing ensemble techniques with detection models, the mean average precision (mAP) was considerably enhanced by a 4.7% increase. The classifier-guided framework further refined the mAP and F1 scores over the ensemble models. Our research offers a comprehensive approach to PE diagnostics using deep learning, addressing the prevalent issues of underdiagnosis and misdiagnosis. We aim to improve PE patient care by integrating AI solutions into clinical workflows, highlighting the potential of human-AI collaboration in medical diagnostics.

摘要
肺动脉梗阻疾病（PE）是一种严重的医疗问题， caracterizada por obstrucciones en las arterias pulmonares。Desafortunadamente, a menudo se subdiagnóstico, lo que puede tener consecuencias clínicas desastrosas. La creciente reliance en la Tomografía por Computadora Pulmonar Angiografía para el diagnóstico presenta desafíos y una necesidad urgente de soluciones de diagnóstico mejoradas. El objetivo principal de este estudio es utilizar técnicas de aprendizaje profundo para mejorar el diagnóstico asistido por computadora de PE.Este estudio presenta una enfoque dual-pronged que combina clasificación y detección para el diagnóstico de PE. Introducimos una Red Neural Convolucional Guiada por Atención (AG-CNN) para la clasificación, abarcando tanto regiones de lesiones globales como locales. Para la detección, se emplean modelos de estado del arte para identificar posibles regiones de PE. Además, se utilizan técnicas de ensamblado para mejorar la precisión de la detección al combinar las predicciones de diferentes modelos. Finalmente, se utiliza una estrategia heurística que combina las salidas de los clasificadores con las resultados de la detección, asegurando un diagnóstico robusto y preciso de PE.Nuestro enfoque de clasificación guiada por atención, probado con el conjunto de datos de la Universidad de Mashhad de Pulmonary Embolism (FUMPE), mejoró significativamente el Área bajo la Curva de Recepción Operativa (AUC) en un 8,1% en comparación con el modelo base DenseNet-121. Además, el uso de técnicas de ensamblado con modelos de detección mejoró considerablemente la precisión media de la detección (mAP) en un 4,7%. El marco de clasificación guiada por atención mejoró aún más los valores de mAP y F1 en comparación con los modelos de ensamblado.Nuestro estudio ofrece una abordación completa para el diagnóstico de PE utilizando técnicas de aprendizaje profundo, abordando los problemas prevalentes de subdiagnóstico y maldiagnóstico. Nuestro objetivo es mejorar la atención médica a los pacientes de PE mediante la integración de soluciones de inteligencia artificial en los flujos clínicos, destacando el potencial de la colaboración humana-AI en el diagnóstico médico.

Prompt Engineering a Prompt Engineer

paper_url: http://arxiv.org/abs/2311.05661
repo_url: https://github.com/promptslab/Awesome-Prompt-Engineering
paper_authors: Qinyuan Ye, Maxamed Axmed, Reid Pryzant, Fereshte Khani
for: 这个论文的目的是探索自动提示工程的问题，即构建一个更有效地引导大语言模型（LLM）完成自动提示工程的meta-提示。
methods: 该论文使用了一种名为PE2的新方法，该方法包括一个步骤 reasoning 模板和上下文指定，以及基于common optimization concepts的verbally化counterparts。
results: 根据实验结果，PE2方法在MultiArith和GSM8K数据集上的表现比”let’s think step by step”提高6.3%和3.1%。此外，PE2还在Instruction Induction benchmark、一个 suite of counterfactual tasks 和一个长的实际工业提问中表现出色，并且超过了先前的自动提示工程基elines。

Abstract
Prompt engineering is a challenging yet crucial task for optimizing the performance of large language models (LLMs). It requires complex reasoning to examine the model's errors, hypothesize what is missing or misleading in the current prompt, and communicate the task with clarity. While recent works indicate that LLMs can be meta-prompted to perform automatic prompt engineering, their potentials may not be fully untapped due to the lack of sufficient guidance to elicit complex reasoning capabilities in LLMs in the meta-prompt. In this work, we investigate the problem of "prompt engineering a prompt engineer" -- constructing a meta-prompt that more effectively guides LLMs to perform automatic prompt engineering. We introduce and analyze key components, such as a step-by-step reasoning template and context specification, which lead to improved performance. In addition, inspired by common optimization concepts such as batch size, step size and momentum, we introduce their verbalized counterparts to the meta-prompt and investigate their effects. Our final method, named PE2, finds a prompt that outperforms "let's think step by step" by 6.3% on the MultiArith dataset and 3.1% on the GSM8K dataset. To demonstrate its versatility, we apply PE2 to the Instruction Induction benchmark, a suite of counterfactual tasks, and a lengthy, real-world industrial prompt. In these settings, PE2 achieves strong performance and outperforms prior automatic prompt engineering baselines. Further, we show that PE2 makes meaningful and targeted prompt edits, amends erroneous or incomplete prompts, and presents non-trivial counterfactual reasoning abilities.

摘要
提问工程是一项复杂但关键的任务，用于优化大型语言模型（LLM）的性能。它需要复杂的推理来检查模型的错误，推测现有提问中缺失或误导的部分，并通过清晰的沟通方式传达任务。据 latest works 表明，LLM 可以被自动提问来执行提问工程，但它们的潜力可能没有被完全启用，因为缺乏充分的指导来触发 LLM 的复杂推理能力。在这种情况下，我们调查 "提问工程提问工程" -- 构建一个更加有效地导引 LLM 进行自动提问工程的 meta-提问。我们介绍和分析关键组件，如步骤 reasoning 模板和上下文规定，它们带来了提高性能的影响。此外，我们引入了批处理大小、步长和冲击的概念，并对它们的词汇化版本进行调查。我们的最终方法，名为 PE2，在 MultiArith 数据集上击败 "让我们一步一步思考" 的提问，提高了6.3%。此外，我们在 Instruction Induction 数据集和一个实际工业提问中应用 PE2，并在这些设置中达到了强性表现。进一步，我们表明 PE2 可以做出有意义和有目标的提问编辑，修正错误或不充分的提问，并展示了非常轻松的对抗性能。

Mixture of Weak & Strong Experts on Graphs

paper_url: http://arxiv.org/abs/2311.05185
repo_url: None
paper_authors: Hanqing Zeng, Hanjia Lyu, Diyi Hu, Yinglong Xia, Jiebo Luo
for: 这个论文主要目的是提出一种基于混合弱和强专家的图 neural network（GNN）模型，以提高图 классификация的表现。
methods: 这个模型使用了一种混合弱和强专家的方法，其中弱专家是一个轻量级多层感知器（MLP），强专家是一个常见的图 neural network（GNN）。这个模型还使用了一种“信心”机制来控制各个专家之间的合作方式。
results: 实验结果表明，这个模型可以在6个标准图类型的benchmark上实现显著的准确率提升，包括同型和不同型图。

Abstract
Realistic graphs contain both rich self-features of nodes and informative structures of neighborhoods, jointly handled by a GNN in the typical setup. We propose to decouple the two modalities by mixture of weak and strong experts (Mowst), where the weak expert is a light-weight Multi-layer Perceptron (MLP), and the strong expert is an off-the-shelf Graph Neural Network (GNN). To adapt the experts' collaboration to different target nodes, we propose a "confidence" mechanism based on the dispersion of the weak expert's prediction logits. The strong expert is conditionally activated when either the node's classification relies on neighborhood information, or the weak expert has low model quality. We reveal interesting training dynamics by analyzing the influence of the confidence function on loss: our training algorithm encourages the specialization of each expert by effectively generating soft splitting of the graph. In addition, our "confidence" design imposes a desirable bias toward the strong expert to benefit from GNN's better generalization capability. Mowst is easy to optimize and achieves strong expressive power, with a computation cost comparable to a single GNN. Empirically, Mowst shows significant accuracy improvement on 6 standard node classification benchmarks (including both homophilous and heterophilous graphs).

摘要
Simplified Chinese:实际图表包含节点自身的 ricH self-feature 和 neighborhood 的信息结构，通常使用 GNN 处理。我们提议通过 mixture of weak and strong experts (Mowst) 来分离这两种模式。我们的weak expert是一个轻量级 Multi-layer Perceptron (MLP)，而 strong expert 是一个 off-the-shelf Graph Neural Network (GNN)。为了适应不同的 target node，我们提出了一种 "信任度" 机制，基于 weak expert 预测 logits 的分散程度。当 node 的分类 rely 于 neighborhood information 或 weak expert 的模型质量低时，strong expert 会被 activated。我们分析了 confidence 函数对 loss 的影响，发现我们的训练算法会鼓励每个专家特化，从而生成软分割的图。此外，我们的 "信任度" 设计会带来 desirable bias 向 strong expert，以便利用 GNN 的更好的泛化能力。Mowst 易于优化，并达到了 strong expressive power，计算成本与单个 GNN 相当。Empirically，Mowst 在 6 个标准节点分类 benchmark 上表现出了显著的准确率提升，包括 homophilous 和 heterophilous 图。

FireMatch: A Semi-Supervised Video Fire Detection Network Based on Consistency and Distribution Alignment

paper_url: http://arxiv.org/abs/2311.05168
repo_url: None
paper_authors: Qinghua Lin, Zuoyong Li, Kun Zeng, Haoyi Fan, Wei Li, Xiaoguang Zhou
for: 提高视频中的火灾检测性能
methods: 基于一致 regularization 和对抗分布尺度Alignment的 semi-supervised 模型 FireMatch
results: 在两个真实世界的火灾数据集上 achieved 76.92% 和 91.81% 的准确率，比现有的 semi-supervised 分类方法高Here’s a brief explanation of each point:* “for”: The paper aims to improve the performance of fire detection in videos.* “methods”: The proposed method is based on consistency regularization and adversarial distribution alignment, and is called FireMatch.* “results”: The proposed method achieved high accuracy (76.92% and 91.81%) on two real-world fire datasets, outperforming current state-of-the-art semi-supervised classification methods.

Abstract
Deep learning techniques have greatly enhanced the performance of fire detection in videos. However, video-based fire detection models heavily rely on labeled data, and the process of data labeling is particularly costly and time-consuming, especially when dealing with videos. Considering the limited quantity of labeled video data, we propose a semi-supervised fire detection model called FireMatch, which is based on consistency regularization and adversarial distribution alignment. Specifically, we first combine consistency regularization with pseudo-label. For unlabeled data, we design video data augmentation to obtain corresponding weakly augmented and strongly augmented samples. The proposed model predicts weakly augmented samples and retains pseudo-label above a threshold, while training on strongly augmented samples to predict these pseudo-labels for learning more robust feature representations. Secondly, we generate video cross-set augmented samples by adversarial distribution alignment to expand the training data and alleviate the decline in classification performance caused by insufficient labeled data. Finally, we introduce a fairness loss to help the model produce diverse predictions for input samples, thereby addressing the issue of high confidence with the non-fire class in fire classification scenarios. The FireMatch achieved an accuracy of 76.92% and 91.81% on two real-world fire datasets, respectively. The experimental results demonstrate that the proposed method outperforms the current state-of-the-art semi-supervised classification methods.

摘要
深度学习技术对视频中的火灾检测表现有了很大提升。然而，视频基于的火灾检测模型却依赖于标注数据，并且标注数据的获得是特别的成本和时间consuming，尤其是对视频数据的处理。面对有限的标注视频数据，我们提议一种半supervised火灾检测模型，即FireMatch，基于一致regulization和对抗分布对齐。首先，我们将一致regulization与pseudo-标签结合使用。对于未标注数据，我们设计了视频数据增强，以获得对应的弱增强和强增强样本。提案的模型预测弱增强样本，并保留pseudo-标签在阈值以上，而在强增强样本上进行训练，以学习更加稳定的特征表示。其次，我们使用对抗分布对齐生成视频跨集augmented样本，以扩大训练数据，并减轻由不充分的标注数据导致的分类性能下降。最后，我们引入了公平损失，以帮助模型对输入样本产生多样的预测，解决火类分类场景中高确度对非火类的问题。FireMatch在两个实际的火灾数据集上取得了76.92%和91.81%的准确率，分别超过当前最佳半supervised分类方法。实验结果表明，提议的方法可以在火灾检测中提高模型的性能。

$\textit{Labor Space}$: A Unifying Representation of the Labor Market via Large Language Models

paper_url: http://arxiv.org/abs/2311.06310
repo_url: None
paper_authors: Seongwoon Kim, Yong-Yeol Ahn, Jaehyuk Park
for: 这个论文旨在为劳动市场分析和优化提供一个综合性的框架，帮助政策制定者和企业领导者更好地理解劳动市场的复杂关系。
methods: 该论文使用大型自然语言模型进行精度调整，从而生成了一个劳动市场实体之间的vector空间嵌入，称为”劳动空间”。这个嵌入可以暴露各种劳动市场实体之间的复杂关系，并且可以进行类型特定的凝集。
results: 该论文通过使用”劳动空间”，可以实现对各种劳动市场实体之间的复杂关系的探索和分析，例如在经济轴上位置不同类型实体，如制造业和医疗业之间的关系。此外，”劳动空间”还允许实体之间的向量加算，从而可以研究各种复杂的关系，并且可以估算经济冲击对各个单位和其它单位的响应。

Abstract
The labor market is a complex ecosystem comprising diverse, interconnected entities, such as industries, occupations, skills, and firms. Due to the lack of a systematic method to map these heterogeneous entities together, each entity has been analyzed in isolation or only through pairwise relationships, inhibiting comprehensive understanding of the whole ecosystem. Here, we introduce $\textit{Labor Space}$, a vector-space embedding of heterogeneous labor market entities, derived through applying a large language model with fine-tuning. Labor Space exposes the complex relational fabric of various labor market constituents, facilitating coherent integrative analysis of industries, occupations, skills, and firms, while retaining type-specific clustering. We demonstrate its unprecedented analytical capacities, including positioning heterogeneous entities on an economic axes, such as `Manufacturing--Healthcare'. Furthermore, by allowing vector arithmetic of these entities, Labor Space enables the exploration of complex inter-unit relations, and subsequently the estimation of the ramifications of economic shocks on individual units and their ripple effect across the labor market. We posit that Labor Space provides policymakers and business leaders with a comprehensive unifying framework for labor market analysis and simulation, fostering more nuanced and effective strategic decision-making.

摘要
劳动市场是一个复杂的生态系统，包括多种不同的实体，如产业、职业、技能和企业。由于缺乏一个系统的方法来映射这些异质的实体，每个实体都只能分析在孤立状态或者只有对应关系，这使得劳动市场的整体系统不能得到全面的理解。在这里，我们介绍了“劳动空间”，一种基于大型自然语言模型的 vector-space 嵌入，用于映射劳动市场中不同类型的实体。劳动空间暴露了劳动市场各个实体之间的复杂关系网络，使得可以进行整体的劳动市场分析和模拟，同时保持类型特有的划分。我们示出了劳动空间的前所未有分析能力，包括将劳动市场实体位置在经济轴上，如“制造业--医疗业”，以及通过向这些实体进行向量加法，进而探索各个实体之间的复杂关系，并且估算经济冲击的影响和它们的冲击波在劳动市场中的传播。我们认为，劳动空间为政策制定者和企业领导人提供了一个普遍的一体化框架，帮助他们更加精准地制定策略，从而促进劳动市场的发展和稳定。

RAPID: Training-free Retrieval-based Log Anomaly Detection with PLM considering Token-level information

paper_url: http://arxiv.org/abs/2311.05160
repo_url: https://github.com/dsba-lab/rapid
paper_authors: Gunho No, Yukyung Lee, Hyeongwon Kang, Pilsung Kang
for:This paper focuses on the task of log anomaly detection in real-time, with the goal of identifying subtle anomalies in rapidly accumulating logs without requiring dataset-specific training.methods:The proposed method, RAPID, treats logs as natural language and extracts representations using pre-trained language models. It also employs a retrieval-based technique to contrast test logs with the most similar normal logs, obviating the need for log-specific training and incorporating token-level information for refined detection.results:Experimental results show that RAPID demonstrates competitive performance compared to prior models and achieves the best performance on certain datasets, while also reducing the computational cost needed for comparison. The method is capable of real-time detection without delay, as verified through various research questions.Here is the same information in Simplified Chinese text:for:这篇论文主要关注logs anomaly detection的实时任务，目的是在快速积累的logs中检测微妙的异常性，而无需特定数据集训练。methods:提议的方法RAPID将logs视为自然语言，通过预训练的语言模型提取表示。它还实施了一种 retrieve-based 技术，将测试logs与最相似的正常logs进行对比，从而减少了需要特定数据集训练的需求。results:实验结果表明，RAPID可以与先前的模型相比，在某些数据集上达到最佳性能，同时减少了对比所需的计算成本。该方法可以在实时中进行检测，并通过多个研究问题的测试，证明了其无延迟的可行性。

Abstract
As the IT industry advances, system log data becomes increasingly crucial. Many computer systems rely on log texts for management due to restricted access to source code. The need for log anomaly detection is growing, especially in real-world applications, but identifying anomalies in rapidly accumulating logs remains a challenging task. Traditional deep learning-based anomaly detection models require dataset-specific training, leading to corresponding delays. Notably, most methods only focus on sequence-level log information, which makes the detection of subtle anomalies harder, and often involve inference processes that are difficult to utilize in real-time. We introduce RAPID, a model that capitalizes on the inherent features of log data to enable anomaly detection without training delays, ensuring real-time capability. RAPID treats logs as natural language, extracting representations using pre-trained language models. Given that logs can be categorized based on system context, we implement a retrieval-based technique to contrast test logs with the most similar normal logs. This strategy not only obviates the need for log-specific training but also adeptly incorporates token-level information, ensuring refined and robust detection, particularly for unseen logs. We also propose the core set technique, which can reduce the computational cost needed for comparison. Experimental results show that even without training on log data, RAPID demonstrates competitive performance compared to prior models and achieves the best performance on certain datasets. Through various research questions, we verified its capability for real-time detection without delay.

摘要
随着信息技术的发展，系统日志数据变得越来越重要。许多计算机系统利用日志文本进行管理，因为有限的访问源代码。寻找日志异常现象的需求在实际应用中增长，特别是面临快速积累的日志数据，但 tradicional的深度学习基于异常检测模型需要特定的数据集训练，导致延迟。尤其是，大多数方法只关注日志序列级别的信息，这使得细致的异常检测变得更加困难，并且经常包含difficult to utilize的推理过程。我们介绍了RAPID模型，利用日志数据的自然语言特征，通过预训练的自然语言模型提取表示。由于日志可以根据系统上下文分类，我们实施了 retrieve-based 技术，将测试日志与最相似的正常日志进行对比。这种策略不仅减少了训练日志的需求，而且具有Token-level信息的包容力，使检测更加精细和 Robust。我们还提出核心集技术，可以减少比较所需的计算成本。实验结果表明，无需训练日志数据，RAPID仍然可以与先前模型相比，并在某些数据集上达到最佳性能。通过多个研究问题，我们证明了它在实时检测中的可靠性。

Dialogizer: Context-aware Conversational-QA Dataset Generation from Textual Sources

paper_url: http://arxiv.org/abs/2311.07589
repo_url: None
paper_authors: Yerin Hwang, Yongil Kim, Hyunkyung Bae, Jeesoo Bang, Hwanhee Lee, Kyomin Jung
for: 提高 Conversational question answering (ConvQA) 数据稀缺问题的解决方案
methods: 利用文档生成 ConvQA 数据集，并具有对话填充和话题识别两个训练任务
results: 使用我们的框架生成的问题具有更高的上下文相关性，并通过自动评估和人工评估而证明其质量高于基eline模型

Abstract
To address the data scarcity issue in Conversational question answering (ConvQA), a dialog inpainting method, which utilizes documents to generate ConvQA datasets, has been proposed. However, the original dialog inpainting model is trained solely on the dialog reconstruction task, resulting in the generation of questions with low contextual relevance due to insufficient learning of question-answer alignment. To overcome this limitation, we propose a novel framework called Dialogizer, which has the capability to automatically generate ConvQA datasets with high contextual relevance from textual sources. The framework incorporates two training tasks: question-answer matching (QAM) and topic-aware dialog generation (TDG). Moreover, re-ranking is conducted during the inference phase based on the contextual relevance of the generated questions. Using our framework, we produce four ConvQA datasets by utilizing documents from multiple domains as the primary source. Through automatic evaluation using diverse metrics, as well as human evaluation, we validate that our proposed framework exhibits the ability to generate datasets of higher quality compared to the baseline dialog inpainting model.

摘要

paper_url: http://arxiv.org/abs/2311.05155
repo_url: https://github.com/koustavagoswami/weakly_supervised-cognate_detection
paper_authors: Koustava Goswami, Priya Rani, Theodorus Fransen, John P. McCrae
for: 本研究旨在提高对少语言的语理理解能力，包括无监督机器翻译、命名实体识别和信息检索等任务。
methods: 该研究提出了一种语言非参数的深度学习弱监督词义检测框架，使用 morphological 知识来提高词义检测的准确率。
results: 实验结果显示，该方法不仅可以在不同语言家族的数据集上达到显著提高，而且也超过了现有的参数化和无监督方法的性能。 code 和数据集生成脚本可以在 GitHub 上找到。

Abstract
Exploiting cognates for transfer learning in under-resourced languages is an exciting opportunity for language understanding tasks, including unsupervised machine translation, named entity recognition and information retrieval. Previous approaches mainly focused on supervised cognate detection tasks based on orthographic, phonetic or state-of-the-art contextual language models, which under-perform for most under-resourced languages. This paper proposes a novel language-agnostic weakly-supervised deep cognate detection framework for under-resourced languages using morphological knowledge from closely related languages. We train an encoder to gain morphological knowledge of a language and transfer the knowledge to perform unsupervised and weakly-supervised cognate detection tasks with and without the pivot language for the closely-related languages. While unsupervised, it overcomes the need for hand-crafted annotation of cognates. We performed experiments on different published cognate detection datasets across language families and observed not only significant improvement over the state-of-the-art but also our method outperformed the state-of-the-art supervised and unsupervised methods. Our model can be extended to a wide range of languages from any language family as it overcomes the requirement of the annotation of the cognate pairs for training. The code and dataset building scripts can be found at https://github.com/koustavagoswami/Weakly_supervised-Cognate_Detection

摘要
利用 cognate 的抽象 Transfer Learning 在不具备资源的语言上进行语言理解任务，包括无监督机器翻译、命名实体识别和信息检索。前一些方法主要是基于orthographic、phonetic或状态艺术语言模型，这些方法对大多数不具备资源的语言表现不佳。这篇论文提出了一种新的语言agnostic 的弱监督深度 cognate 检测框架 для不具备资源的语言，使用 morphological 知识从相似语言中获得。我们训练了一个encoder以获得一语言的 morphological 知识，然后将该知识传递给表达式来实现无监督和弱监督 cognate 检测任务，无需手动制作 cognate 对。我们在不同的发布的 cognate 检测数据集上进行了实验，并观察到了对state-of-the-art 的显著改进，同时我们的方法还超过了state-of-the-art 监督和无监督方法。我们的模型可以扩展到各种语言家族，因为它不需要 annotate cognate 对进行训练。代码和数据集生成脚本可以在找到。

paper_url: http://arxiv.org/abs/2311.05152
repo_url: https://github.com/haoyi-duan/dg-sct
paper_authors: Haoyi Duan, Yan Xia, Mingze Zhou, Li Tang, Jieming Zhu, Zhou Zhao
for: 本研究旨在提高大规模预训练模型在多模态任务中的性能，尤其是在多modal输入特征提取方面，以提高下游任务的表现。
methods: 该研究提出了一种新的双引导空时通道 temporal（DG-SCT）注意机制，该机制利用音频和视觉模态作为软提示，动态调整预训练模型中的参数，以适应当前多模态输入特征。
results: 实验证明，该提出的模型在多个下游任务中达到了状态略作即AVE、AVVP、AVS和AVQA等任务的最佳效果，并在具有几 shot和零 shot情况下表现出色。

Abstract
In recent years, the deployment of large-scale pre-trained models in audio-visual downstream tasks has yielded remarkable outcomes. However, these models, primarily trained on single-modality unconstrained datasets, still encounter challenges in feature extraction for multi-modal tasks, leading to suboptimal performance. This limitation arises due to the introduction of irrelevant modality-specific information during encoding, which adversely affects the performance of downstream tasks. To address this challenge, this paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism. This mechanism leverages audio and visual modalities as soft prompts to dynamically adjust the parameters of pre-trained models based on the current multi-modal input features. Specifically, the DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders, allowing adaptive extraction of crucial information from the current modality across spatial, channel, and temporal dimensions, while preserving the frozen parameters of large-scale pre-trained models. Experimental evaluations demonstrate that our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA. Furthermore, our model exhibits promising performance in challenging few-shot and zero-shot scenarios. The source code and pre-trained models are available at https://github.com/haoyi-duan/DG-SCT.

摘要
Recently, the deployment of large-scale pre-trained models in audio-visual downstream tasks has achieved remarkable results. However, these models, primarily trained on single-modality unconstrained datasets, still struggle with feature extraction for multi-modal tasks, leading to suboptimal performance. This limitation arises from the introduction of irrelevant modality-specific information during encoding, which negatively affects the performance of downstream tasks. To address this challenge, this paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism. This mechanism leverages audio and visual modalities as soft prompts to dynamically adjust the parameters of pre-trained models based on the current multi-modal input features. Specifically, the DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders, allowing adaptive extraction of crucial information from the current modality across spatial, channel, and temporal dimensions, while preserving the frozen parameters of large-scale pre-trained models. Experimental evaluations show that our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA. Furthermore, our model exhibits promising performance in challenging few-shot and zero-shot scenarios. The source code and pre-trained models are available at https://github.com/haoyi-duan/DG-SCT.

Enhancing Instance-Level Image Classification with Set-Level Labels

paper_url: http://arxiv.org/abs/2311.05659
repo_url: https://github.com/djdprogramming/adfa2
paper_authors: Renyu Zhang, Aly A. Khan, Yuxin Chen, Robert L. Grossman
for: 提高实例级图像分类的精度，使用集成粗细标签。
methods: 基于集成粗细标签进行实例级图像分类，并提供了一种新的方法来增强实例级图像分类的精度。
results: 实验结果显示，该方法可以提高实例级图像分类的精度，比传统单个实例标签基础方法高出13%。

Abstract
Instance-level image classification tasks have traditionally relied on single-instance labels to train models, e.g., few-shot learning and transfer learning. However, set-level coarse-grained labels that capture relationships among instances can provide richer information in real-world scenarios. In this paper, we present a novel approach to enhance instance-level image classification by leveraging set-level labels. We provide a theoretical analysis of the proposed method, including recognition conditions for fast excess risk rate, shedding light on the theoretical foundations of our approach. We conducted experiments on two distinct categories of datasets: natural image datasets and histopathology image datasets. Our experimental results demonstrate the effectiveness of our approach, showcasing improved classification performance compared to traditional single-instance label-based methods. Notably, our algorithm achieves 13% improvement in classification accuracy compared to the strongest baseline on the histopathology image classification benchmarks. Importantly, our experimental findings align with the theoretical analysis, reinforcing the robustness and reliability of our proposed method. This work bridges the gap between instance-level and set-level image classification, offering a promising avenue for advancing the capabilities of image classification models with set-level coarse-grained labels.

摘要
Instance-level图像分类任务traditionally rely on单个实例标签来训练模型，例如几 shot学习和转移学习。然而，设层粗略标签可以提供实际场景中更丰富的信息。在这篇论文中，我们提出了一种新的方法，用于增强实例图像分类。我们提供了对该方法的理论分析，包括快速过剩风险率的认可条件，为我们的方法提供了理论基础。我们在自然图像集和病理图像集两个不同类型的数据集上进行了实验，结果表明我们的方法可以提高图像分类性能，相比传统单个实例标签基础方法。特别是，我们的算法在病理图像分类任务上 achievement 13%的提升，与最强基准相比。这些实验结果与理论分析相符，证明了我们的方法的可靠性和可重复性。这种方法可以把实例图像分类和集合图像分类联系起来，为图像分类模型带来新的发展空间。

A Survey of Large Language Models in Medicine: Progress, Application, and Challenge

paper_url: http://arxiv.org/abs/2311.05112
repo_url: https://github.com/ai-in-health/medllmspracticalguide
paper_authors: Hongjian Zhou, Boyang Gu, Xinyu Zou, Yiru Li, Sam S. Chen, Peilin Zhou, Junling Liu, Yining Hua, Chengfeng Mao, Xian Wu, Zheng Li, Fenglin Liu
for: This paper provides a comprehensive overview of the current progress, applications, and challenges faced by large language models (LLMs) in medicine.methods: The paper discusses the construction of medical LLMs and their downstream performances, as well as their potential utilization in real-world clinical practice.results: The paper provides insights into the opportunities and challenges of LLMs in medicine and serves as a valuable resource for constructing practical and effective medical LLMs. Additionally, the paper includes a regularly updated list of practical guide resources of medical LLMs.

Abstract
Large language models (LLMs), such as ChatGPT, have achieved substantial attention due to their impressive human language understanding and generation capabilities. Therefore, the application of LLMs in medicine to assist physicians and patient care emerges as a promising research direction in both artificial intelligence and clinical medicine. To this end, this survey provides a comprehensive overview of the current progress, applications, and challenges faced by LLMs in medicine. Specifically, we aim to address the following questions: 1) What are LLMs and how can medical LLMs be built? 2) What are the downstream performances of medical LLMs? 3) How can medical LLMs be utilized in real-world clinical practice? 4) What challenges arise from the use of medical LLMs? 5) How can we better construct and utilize medical LLMs? As a result, this survey aims to provide insights into the opportunities and challenges of LLMs in medicine and serve as a valuable resource for constructing practical and effective medical LLMs. A regularly updated list of practical guide resources of medical LLMs can be found at https://github.com/AI-in-Health/MedLLMsPracticalGuide.

摘要
大型语言模型（LLMs），如ChatGPT，在人工智能和临床医学方面获得了广泛的注意，因为它们在人工智能和临床医学中表现出了卓越的语言理解和生成能力。因此，将LLMs应用在医疗领域以帮助医生和患者护理是一个有前途的研究方向。为了解答这些问题，本调查提供了LLMs在医疗领域的现有进步、应用和挑战。 Specifically, we aim to address the following questions:1. What are LLMs and how can medical LLMs be built?2. What are the downstream performances of medical LLMs?3. How can medical LLMs be utilized in real-world clinical practice?4. What challenges arise from the use of medical LLMs?5. How can we better construct and utilize medical LLMs?为了提供医疗LLMs的实用导航，我们建立了一个常更新的实用指南资源，可以在 GitHub 上找到：https://github.com/AI-in-Health/MedLLMsPracticalGuide。

Devil in the Landscapes: Inferring Epidemic Exposure Risks from Street View Imagery

paper_url: http://arxiv.org/abs/2311.09240
repo_url: https://github.com/0oshowero0/epidemicgcn
paper_authors: Zhenyu Han, Yanxin Xi, Tong Xia, Yu Liu, Yong Li
for: 这项研究旨在使用街景图像来评估感染病的风险。
methods: 研究人员使用了人群移动图模型和传染病启发图模型来捕捉人们的流动和感染行为。
results: 研究人员的方法在比较基eline模型时显著提高了8.54%的weighted F1分数，表明这种方法可以准确地评估街景图像中感染病的风险。

Abstract
Built environment supports all the daily activities and shapes our health. Leveraging informative street view imagery, previous research has established the profound correlation between the built environment and chronic, non-communicable diseases; however, predicting the exposure risk of infectious diseases remains largely unexplored. The person-to-person contacts and interactions contribute to the complexity of infectious disease, which is inherently different from non-communicable diseases. Besides, the complex relationships between street view imagery and epidemic exposure also hinder accurate predictions. To address these problems, we construct a regional mobility graph informed by the gravity model, based on which we propose a transmission-aware graph convolutional network (GCN) to capture disease transmission patterns arising from human mobility. Experiments show that the proposed model significantly outperforms baseline models by 8.54% in weighted F1, shedding light on a low-cost, scalable approach to assess epidemic exposure risks from street view imagery.

摘要
建筑环境支持我们每天的活动，并 shape我们的健康。利用有用的街景图像，先前的研究已经证明了建筑环境和 Chronic non-communicable diseases 之间存在深刻的相关性，但是预测传染病风险仍然未得到充分研究。人与人之间的接触和互动会增加传染病的复杂性，与非传染病不同。此外，街景图像和疫情暴露之间的复杂关系也使准确预测变得困难。为解决这些问题，我们构建了基于重力模型的区域 mobilility 图，并基于这个图构建了一种带感染传播模式的传输感知图 convolutional neural network (GCN)，以捕捉人们的 mobiliry 对疫情风险的影响。实验表明，我们提出的模型在 weighted F1 指标上比基准模型高出 8.54%，这显示了一种低成本、可扩展的方法来评估街景图像中的疫情风险。

A differentiable brain simulator bridging brain simulation and brain-inspired computing

paper_url: http://arxiv.org/abs/2311.05106
repo_url: None
paper_authors: Chaoming Wang, Tianqiu Zhang, Sichao He, Yifeng Gong, Hongyaoxing Gu, Shangyang Li, Si Wu
For: The paper aims to bridge the gap between brain simulation and brain-inspired computing (BIC) by developing a differentiable brain simulator called BrainPy.* Methods: BrainPy uses JAX and XLA to provide a range of sparse and event-driven operators for efficient and scalable brain simulation, an abstraction for managing synaptic computations, a modular and flexible interface for constructing multi-scale brain models, and an object-oriented just-in-time compilation approach to handle memory-intensive brain dynamics.* Results: The paper showcases the efficiency and scalability of BrainPy on benchmark tasks, demonstrates its ability to simulate biologically plausible spiking models, and discusses its potential to support research at the intersection of brain simulation and BIC.

Abstract
Brain simulation builds dynamical models to mimic the structure and functions of the brain, while brain-inspired computing (BIC) develops intelligent systems by learning from the structure and functions of the brain. The two fields are intertwined and should share a common programming framework to facilitate each other's development. However, none of the existing software in the fields can achieve this goal, because traditional brain simulators lack differentiability for training, while existing deep learning (DL) frameworks fail to capture the biophysical realism and complexity of brain dynamics. In this paper, we introduce BrainPy, a differentiable brain simulator developed using JAX and XLA, with the aim of bridging the gap between brain simulation and BIC. BrainPy expands upon the functionalities of JAX, a powerful AI framework, by introducing complete capabilities for flexible, efficient, and scalable brain simulation. It offers a range of sparse and event-driven operators for efficient and scalable brain simulation, an abstraction for managing the intricacies of synaptic computations, a modular and flexible interface for constructing multi-scale brain models, and an object-oriented just-in-time compilation approach to handle the memory-intensive nature of brain dynamics. We showcase the efficiency and scalability of BrainPy on benchmark tasks, highlight its differentiable simulation for biologically plausible spiking models, and discuss its potential to support research at the intersection of brain simulation and BIC.

摘要
��BrainPy是一个可微分的大脑模拟器，使得大脑模拟和智能系统研发可以更加紧密地相互协作。然而，现有的软件在这两个领域都无法实现这个目标，因为传统的大脑模拟器缺乏微分性，而深度学习框架则无法捕捉大脑动力学的生物物理实在性和复杂性。在这篇论文中，我们介绍了BrainPy，一个基于JAX和XLA的可微分大脑模拟器，以bridging大脑模拟和BIC之间的空难。BrainPy在JAX的强大AI框架上扩展了完整的功能，包括可靠、高效和可扩展的大脑模拟能力。它提供了一系列的稀疏和事件驱动运算符，抽象处理神经元计算的复杂性，可重构和灵活的多尺度大脑模型接口，以及对内存密集的大脑动力学进行对象驱动的即时编译方法。我们在 benchmark任务上展示了BrainPy的效率和可扩展性， highlighted its可微分的模拟方法，并讨论了它在大脑模拟和BIC的交叉研究中的潜力。

Legal-HNet: Mixing Legal Long-Context Tokens with Hartley Transform

paper_url: http://arxiv.org/abs/2311.05089
repo_url: None
paper_authors: Daniele Giofré, Sneha Ghantasala
for: This paper explores alternatives to the attention-based layers in the transformers architecture for specialized domains like legal, where long texts are common.
methods: The authors use non-parametric techniques such as Hartley and Fourier transforms to replace the attention-based layers, and introduce a new hybrid Seq2Seq architecture that combines a no-attention-based encoder with an attention-based decoder.
results: The authors train models with long input documents from scratch in the legal domain setting, and achieve performance comparable to or better than existing summarization tasks with less compute and memory requirements. They also contribute to reducing the carbon footprint during training.Here’s the Chinese version of the three key points:
for: 这篇论文探讨了在专业领域如法律领域中，使用 transformers 架构时的限制，并提出了使用非参数化技术来替代注意力机制的方法。
methods: 作者使用非参数化技术如哈特利变换和弗朗哥变换来替代注意力机制，并提出了一种新的混合 Seq2Seq 架构，其中的编码器使用无注意力的方式，而解码器使用注意力的方式。
results: 作者在法律领域中使用长文本进行训练，并达到了与现有摘要任务相同或更好的性能，同时具有较少的计算和存储需求。他们还认为，采用这些简单的基础设施可以让更多人训练模型，并且对于减少训练过程中的碳脚印产生贡献。

Abstract
Since its introduction, the transformers architecture has seen great adoption in NLP applications, but it also has limitations. Although the self-attention mechanism allows for generating very rich representations of the input text, its effectiveness may be limited in specialized domains such as legal, where, for example, language models often have to process very long texts. In this paper, we explore alternatives to replace the attention-based layers with simpler token-mixing mechanisms: Hartley and Fourier transforms. Using these non-parametric techniques, we train models with long input documents from scratch in the legal domain setting. We also introduce a new hybrid Seq2Seq architecture, a no-attention-based encoder connected with an attention-based decoder, which performs quite well on existing summarization tasks with much less compute and memory requirements. We believe that similar, if not better performance, as in the case of long correlations of abstractive text summarization tasks, can be achieved by adopting these simpler infrastructures. This not only makes training models from scratch accessible to more people, but also contributes to the reduction of the carbon footprint during training.

摘要
自它的引入以来，变换器体系在自然语言处理（NLP）应用中得到了广泛的采用，但它也有一些限制。尽管自我注意机制允许生成非常 ric的输入文本表示，但在特殊领域如法律领域中，语言模型经常需要处理非常长的文本。在这篇论文中，我们探讨使用非参数的字符混合机制来取代注意力基于的层：Hartley和傅立叹变换。使用这些非参数技术，我们在法律领域的长输入文档上训练模型从零开始。我们还介绍了一种新的混合Seq2Seq体系，一个没有注意力基于的编码器与一个注意力基于的解码器相连接，它在现有概要任务上表现非常好，需要 Much less compute和内存需求。我们认为，通过采用这些更简单的基础设施，可以实现类似或更好的性能，即在概要抽象文本摘要任务中，长期相关性的抽取。这不仅使得训练模型从零开始变得更加可 accessible，而且也对训练过程中的碳脚印产生了贡献。

Meta-learning of semi-supervised learning from tasks with heterogeneous attribute spaces

paper_url: http://arxiv.org/abs/2311.05088
repo_url: None
paper_authors: Tomoharu Iwata, Atsutoshi Kumagai
For: 本研究提出一种基于多任务的自适应学习方法，可以在不同任务中学习自动化分类和回归模型。* Methods: 该方法使用一种基于神经网络的变量特征自我注意层，可以同时嵌入标注和无标注数据，并且使用自适应分类或回归模型来估计无标注数据的标签。* Results: 我们的实验表明，我们的提出的方法可以在不同任务中的类型不同的数据集上提高预期的测试性能，并且超过现有的meta学习和半supervised学习方法。

Abstract
We propose a meta-learning method for semi-supervised learning that learns from multiple tasks with heterogeneous attribute spaces. The existing semi-supervised meta-learning methods assume that all tasks share the same attribute space, which prevents us from learning with a wide variety of tasks. With the proposed method, the expected test performance on tasks with a small amount of labeled data is improved with unlabeled data as well as data in various tasks, where the attribute spaces are different among tasks. The proposed method embeds labeled and unlabeled data simultaneously in a task-specific space using a neural network, and the unlabeled data's labels are estimated by adapting classification or regression models in the embedding space. For the neural network, we develop variable-feature self-attention layers, which enable us to find embeddings of data with different attribute spaces with a single neural network by considering interactions among examples, attributes, and labels. Our experiments on classification and regression datasets with heterogeneous attribute spaces demonstrate that our proposed method outperforms the existing meta-learning and semi-supervised learning methods.

摘要
我们提出了一种基于多任务的适应学习方法，可以在不同任务的属性空间上学习。现有的半supervised meta-学习方法假设所有任务共享同一个属性空间，这限制了我们学习多样化任务。我们的方法可以使用不同任务的属性空间中的数据进行测试，并且可以通过使用嵌入Space来提高测试性能。我们的方法使用神经网络将标注和无标注数据同时嵌入到任务特定的空间中，并且使用适应分类或回归模型来估算无标注数据的标签。我们开发了可变特征自我注意层，这使得我们可以使用单个神经网络来找到不同任务的数据嵌入，并且考虑到例子、属性和标签之间的交互。我们的实验表明，我们的提议方法在类фикаition和回归任务中的不同属性空间上具有更高的性能，比较现有的meta-学习和半supervised学习方法。

Characterizing Large Language Models as Rationalizers of Knowledge-intensive Tasks

paper_url: http://arxiv.org/abs/2311.05085
repo_url: None
paper_authors: Aditi Mishra, Sajjadur Rahman, Hannah Kim, Kushan Mitra, Estevam Hruschka
for: This paper focuses on exploring the ability of large language models (LLMs) to provide well-grounded rationalizations for knowledge-intensive tasks, specifically commonsense multiple-choice questions.
methods: The paper uses expert-written examples in a few-shot manner to generate knowledge-grounded rationales, and compares these with crowdsourced rationalizations.
results: The study finds that knowledge-grounded rationales are preferred by crowd-workers due to their factuality, sufficiency, and comprehensive refutations, but further improvements in conciseness and novelty are required. Additionally, the paper shows that rationalization of incorrect model predictions can erode human trust in LLM-generated rationales, and proposes a two-stage pipeline to review task predictions and eliminate potential incorrect decisions before rationalization.

Abstract
Large language models (LLMs) are proficient at generating fluent text with minimal task-specific supervision. Yet, their ability to provide well-grounded rationalizations for knowledge-intensive tasks remains under-explored. Such tasks, like commonsense multiple-choice questions, require rationales based on world knowledge to support predictions and refute alternate options. We consider the task of generating knowledge-guided rationalization in natural language by using expert-written examples in a few-shot manner. Surprisingly, crowd-workers preferred knowledge-grounded rationales over crowdsourced rationalizations, citing their factuality, sufficiency, and comprehensive refutations. Although LLMs-generated rationales were preferable, further improvements in conciseness and novelty are required. In another study, we show how rationalization of incorrect model predictions erodes humans' trust in LLM-generated rationales. Motivated by these observations, we create a two-stage pipeline to review task predictions and eliminate potential incorrect decisions before rationalization, enabling trustworthy rationale generation.

摘要
大型语言模型（LLM）能够生成流畅文本，但它们对知识型任务的有效证明仍然未得到充分探索。这些任务，如常识多选问题，需要基于世界知识的证明，以支持预测和排除备用选项。我们研究了使用专家写的例子来生成自然语言中的知识导向证明，并在几个例子的情况下进行了评估。results show that crowd-workers prefer knowledge-grounded rationales over crowdsourced rationalizations, citing their factuality, sufficiency, and comprehensive refutations. Although LLMs-generated rationales were preferable, further improvements in conciseness and novelty are required. In another study, we show how rationalization of incorrect model predictions erodes humans' trust in LLM-generated rationales. Motivated by these observations, we create a two-stage pipeline to review task predictions and eliminate potential incorrect decisions before rationalization, enabling trustworthy rationale generation.Note that Simplified Chinese is used in mainland China and Singapore, while Traditional Chinese is used in Taiwan and Hong Kong.

Signal Temporal Logic-Guided Apprenticeship Learning

paper_url: http://arxiv.org/abs/2311.05084
repo_url: None
paper_authors: Aniruddh G. Puranic, Jyotirmoy V. Deshmukh, Stefanos Nikolaidis
for: 本研究旨在提高控制策略的学习效果，特别是在包含多个子目标的任务中。
methods: 本文使用时间逻辑规范来描述高级任务目标，并将其编码到图形中以实现时间基于的度量。
results: 经过实验 validate 了我们的框架可以在多种机器人 manipulate simulations 中提高学习控制策略所需的示例数量。

Abstract
Apprenticeship learning crucially depends on effectively learning rewards, and hence control policies from user demonstrations. Of particular difficulty is the setting where the desired task consists of a number of sub-goals with temporal dependencies. The quality of inferred rewards and hence policies are typically limited by the quality of demonstrations, and poor inference of these can lead to undesirable outcomes. In this letter, we show how temporal logic specifications that describe high level task objectives, are encoded in a graph to define a temporal-based metric that reasons about behaviors of demonstrators and the learner agent to improve the quality of inferred rewards and policies. Through experiments on a diverse set of robot manipulator simulations, we show how our framework overcomes the drawbacks of prior literature by drastically improving the number of demonstrations required to learn a control policy.

摘要
学习徒弟关系critically dependent于从用户示范中学习奖励和控制策略。特别是在目标任务包含一系列时间依赖关系时，推理出奖励和策略质量通常受到示范质量的限制，而且差异的推理可能会导致不良结果。在这封信中，我们表明如何使用时间逻辑规范来编码高级任务目标，并在图形中定义时间基于的度量来评估示范者和学习者机器人的行为，以提高推理出奖励和策略的质量。经过对多种机器人抓取器 simulate experiments，我们显示了我们的框架可以超越先前文献中的缺点，减少需要学习控制策略的示范数量。

Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs

paper_url: http://arxiv.org/abs/2311.05657
repo_url: https://github.com/allenai/lumos
paper_authors: Da Yin, Faeze Brahman, Abhilasha Ravichander, Khyathi Chandu, Kai-Wei Chang, Yejin Choi, Bill Yuchen Lin
for: 本研究开发了一个名为Lumos的语言代理框架，用于训练语言代理。
methods: Lumos使用了一个统一的数据格式和一个模块化的架构，并使用开源大型语言模型（LLMs）。该架构包括三个模组：规划、降低和执行。
results: Lumos可以与现有的状态顶尖代理相比或超越其表现，并且具有多个优点：首先，Lumos在复杂问题回答和网络任务中表现出色，而且与更大的LLM代理相等的表现在数学任务中。其次，Lumos可以轻松地应对未见过的互动任务，并且表现更好于更大的LLM-based代理和专业代理。

Abstract
We introduce Lumos, a novel framework for training language agents that employs a unified data format and a modular architecture based on open-source large language models (LLMs). Lumos consists of three distinct modules: planning, grounding, and execution. The planning module breaks down a task into a series of high-level, tool-agnostic subgoals, which are then made specific by the grounding module through a set of low-level actions. These actions are subsequently executed by the execution module, utilizing a range of off-the-shelf tools and APIs. In order to train these modules effectively, high-quality annotations of subgoals and actions were collected and are made available for fine-tuning open-source LLMs for various tasks such as complex question answering, web tasks, and math problems. Leveraging this unified data and modular design, Lumos not only achieves comparable or superior performance to current, state-of-the-art agents, but also exhibits several key advantages: (1) Lumos surpasses GPT-4/3.5-based agents in complex question answering and web tasks, while equalling the performance of significantly larger LLM agents on math tasks; (2) Lumos outperforms open-source agents created through conventional training methods and those using chain-of-thoughts training; and (3) Lumos is capable of effectively generalizing to unseen interactive tasks, outperforming larger LLM-based agents and even exceeding performance of specialized agents.

摘要
我们介绍Lumos，一个新的语言代理框架，它使用统一的数据格式和可重复架构，基于开源的大型语言模型（LLM）。Lumos包括三个不同的模组：规划、实现和降解。规划模组将任务分解成一系列高级、工具不受限制的子目标，这些子目标遭到降解模组通过一系列低级的动作调整为具体的动作。这些动作最后由执行模组执行，使用一组标准的工具和API。为了训练这些模组，我们收集了高品质的子目标和动作的标注，并将其用于精致化开源LLM的训练，以应对不同的任务，如复杂的问题回答、网络任务和数学问题。利用这个统一的数据和模块设计，Lumos不��ely享有与当前边缘的性能，并且具有以下几个优点：1. Lumos在复杂的问题回答和网络任务上超越GPT-4/3.5-based agents，而在数学问题上与训练更大的LLM agents相当。2. Lumos比较于使用常规训练方法或链接思维训练的开源代理优秀，并且在未见到的互动任务上表现出色。3. Lumos具有优秀的普遍化能力，可以对未见到的任务进行有效地应用，超越更大的LLM-based agents和特殊化的代理。

paper_url: http://arxiv.org/abs/2311.05075
repo_url: None
paper_authors: Haijian Shao, Ming Zhu, Shengjie Zhai
for: 这个研究旨在提高心理健康预测和监测的精度，通过分析社交媒体平台上的帖子和讨论来早期检测和 intervene 人们的心理疾病。methods: 我们提出了一种新的semantic feature пре处理技术，包括三个部分：1） mitigating feature sparsity with a weak classifier，2） adaptive feature dimension with modulus loops，3） deep-mining and extending features among the contexts。results: 我们使用了2022年Reddit心理健康数据集来检验抑郁、边缘性人格障碍（BPD）和躁郁病（BD）等疾病，并解决了数据稀缺问题，表现出99.81%非零元素。 после应用我们的预处理技术，特征稀缺度下降到85.4%。在与七个参考模型进行比较后，我们的方法表现出了显著的性能改进：准确率提高8.0%，特征精度提高0.069，特征准确率提高0.093，特征 recall提高0.102，特征F1分数提高0.059，AUC提高0.059。

Abstract
Amid growing global mental health concerns, particularly among vulnerable groups, natural language processing offers a tremendous potential for early detection and intervention of people's mental disorders via analyzing their postings and discussions on social media platforms. However, ultra-sparse training data, often due to vast vocabularies and low-frequency words, hinders the analysis accuracy. Multi-labeling and Co-occurrences of symptoms may also blur the boundaries in distinguishing similar/co-related disorders. To address these issues, we propose a novel semantic feature preprocessing technique with a three-folded structure: 1) mitigating the feature sparsity with a weak classifier, 2) adaptive feature dimension with modulus loops, and 3) deep-mining and extending features among the contexts. With enhanced semantic features, we train a machine learning model to predict and classify mental disorders. We utilize the Reddit Mental Health Dataset 2022 to examine conditions such as Anxiety, Borderline Personality Disorder (BPD), and Bipolar-Disorder (BD) and present solutions to the data sparsity challenge, highlighted by 99.81% non-zero elements. After applying our preprocessing technique, the feature sparsity decreases to 85.4%. Overall, our methods, when compared to seven benchmark models, demonstrate significant performance improvements: 8.0% in accuracy, 0.069 in precision, 0.093 in recall, 0.102 in F1 score, and 0.059 in AUC. This research provides foundational insights for mental health prediction and monitoring, providing innovative solutions to navigate challenges associated with ultra-sparse data feature and intricate multi-label classification in the domain of mental health analysis.

摘要
在全球心理健康问题的增长中，特别是对护送群体来说，自然语言处理技术具有巨大的潜力，通过分析社交媒体平台上的发言和讨论来早期检测和 intervene 人们的心理疾病。然而，由于极其稀疏的训练数据，常常由于庞大的词汇和低频词汇，使分析精度受限。同时，症状的多标签和相似症状的共occurrence也使分类变得混乱。为解决这些问题，我们提出了一种新的Semantic feature预处理技术，具有三重结构：1. 减轻特征稀疏性的弱分类器，2. 适应特定的特征维度使用模块循环，3. 深入挖掘和扩展特征在上下文中。通过增强 semantic features，我们训练了一个机器学习模型，以预测和分类心理疾病。我们使用2022年的Reddit心理健康数据集来检查抑郁、边缘性人格障碍（BPD）和躁郁症（BD）等 Condition，并解决数据稀疏问题，表现为99.81%的非零元素。在我们的预处理技术应用后，特征稀疏性下降到85.4%。总的来说，我们的方法，相比七个参考模型，显示了显著的性能改善：准确率提高8.0%，精度提高0.069，准确率提高0.093，F1分数提高0.102，AUC提高0.059。这些研究提供了心理健康预测和监测的基础发现，提供了创新的解决方案，以便在心理健康分析领域 navigate 稀疏数据特征和复杂的多标签分类挑战。

A Framework to Assess (Dis)agreement Among Diverse Rater Groups

paper_url: http://arxiv.org/abs/2311.05074
repo_url: None
paper_authors: Vinodkumar Prabhakaran, Christopher Homan, Lora Aroyo, Alicia Parrish, Alex Taylor, Mark Díaz, Ding Wang
for: 本研究旨在提供一种用于评估对话AI安全性的多元观点分析框架，以优化安全性评估过程中的人类评分员Subjectivity。
methods: 本研究使用了一种包括多个评分员子组的多元观点分析框架，以捕捉评分员们的各自观点之间的系统性差异。
results: 研究发现了一些评分员子组的多元观点，并提供了关键的人类评分员Subjectivity的指标，可以帮助改进对话AI安全性评估过程。

Abstract
Recent advancements in conversational AI have created an urgent need for safety guardrails that prevent users from being exposed to offensive and dangerous content. Much of this work relies on human ratings and feedback, but does not account for the fact that perceptions of offense and safety are inherently subjective and that there may be systematic disagreements between raters that align with their socio-demographic identities. Instead, current machine learning approaches largely ignore rater subjectivity and use gold standards that obscure disagreements (e.g., through majority voting). In order to better understand the socio-cultural leanings of such tasks, we propose a comprehensive disagreement analysis framework to measure systematic diversity in perspectives among different rater subgroups. We then demonstrate its utility by applying this framework to a dataset of human-chatbot conversations rated by a demographically diverse pool of raters. Our analysis reveals specific rater groups that have more diverse perspectives than the rest, and informs demographic axes that are crucial to consider for safety annotations.

摘要
现代会话AI技术的发展带来了严重的安全防范需求，以避免用户暴露于不够安全和侮辱性内容。大多数这些工作都是基于人类评分和反馈，但不考虑人类评分者的主观性和不同 identity 的系统性分歧。现有的机器学习方法大多忽略评分者主观性，使用 golden standards 隐藏分歧（例如，通过多数投票）。为了更好地理解这类任务的社会文化倾向，我们提出了一个全面的分歧分析框架，用于测量不同评分者 subgroup 之间的多样性观点。我们然后通过应用这个框架来分析一个人与机器人对话的评分结果，并发现特定的评分者组有更多的多样性观点，以及关键的人类特征轴。

Accelerating Exploration with Unlabeled Prior Data

paper_url: http://arxiv.org/abs/2311.05067
repo_url: None
paper_authors: Qiyang Li, Jason Zhang, Dibya Ghosh, Amy Zhang, Sergey Levine
for: 解决标准奖励学习（RL）算法在稀盐奖励任务上学习的问题。
methods: 利用无奖数据进行导航和加速探索，并将其与在线数据同时使用以优化策略和评估器。
results: 在一些具有挑战性的稀盐奖励领域中，包括AntMaze领域、Adroit手动操作领域和视觉模拟Robotic manipulation领域，实现了快速探索。

Abstract
Learning to solve tasks from a sparse reward signal is a major challenge for standard reinforcement learning (RL) algorithms. However, in the real world, agents rarely need to solve sparse reward tasks entirely from scratch. More often, we might possess prior experience to draw on that provides considerable guidance about which actions and outcomes are possible in the world, which we can use to explore more effectively for new tasks. In this work, we study how prior data without reward labels may be used to guide and accelerate exploration for an agent solving a new sparse reward task. We propose a simple approach that learns a reward model from online experience, labels the unlabeled prior data with optimistic rewards, and then uses it concurrently alongside the online data for downstream policy and critic optimization. This general formula leads to rapid exploration in several challenging sparse-reward domains where tabula rasa exploration is insufficient, including the AntMaze domain, Adroit hand manipulation domain, and a visual simulated robotic manipulation domain. Our results highlight the ease of incorporating unlabeled prior data into existing online RL algorithms, and the (perhaps surprising) effectiveness of doing so.

摘要
Our approach is simple: we learn a reward model from online experience, label the unlabeled prior data with optimistic rewards, and then use it concurrently with the online data for downstream policy and critic optimization. This general formula leads to rapid exploration in several challenging sparse-reward domains, including the AntMaze domain, Adroit hand manipulation domain, and a visual simulated robotic manipulation domain. Our results demonstrate the ease of incorporating unlabeled prior data into existing online RL algorithms, and the effectiveness of doing so.

2023-11-09

cs.CL

cs.CL - 2023-11-09

Identification of Books That are Suitable for Middle School Students Using Artificial Neural Networks

paper_url: http://arxiv.org/abs/2311.07591
repo_url: None
paper_authors: Alp Niksarli, Sadik Ozan Gorgu, Ege Gencer
for: 这个论文的目的是开发一种算法，以便制定中学生的读物选择。
methods: 该论文使用了Python编程语言和自然语言处理技术，并使用人工神经网络训练数据集。
results: 经过训练，人工神经网络达到了90.06%的一致率，能够确定中学生读物的合适性。

Abstract
Reading right books contributes to children's imagination and brain development, enhances their language and emotional comprehension abilities, and strengthens their relationships with others. Building upon the critical role of reading books in individual development, this paper aims to develop an algorithm that determines the suitability of books for middle school students by analyzing their structural and semantic features. Using methods described, an algorithm will be created that can be utilized by institutions and individuals responsible for children's education, such as the Ministry of National Education officials and schools. This algorithm will facilitate the selection of books to be taught at the middle school level. With the algorithm, the book selection process for the middle school curriculum can be expedited, and it will serve as a preliminary reference source for those who evaluate books by reading them. In this paper, the Python programming language was employed, utilizing natural language processing methods. Additionally, an artificial neural network (ANN) was trained using the data which had been preprocessed to construct an original dataset. To train this network, suitable books for middle school students were provided by the MEB, Oxford and Cambridge and with content assessed based on the "R" criterion, and inappropriate books for middle school students in terms of content were included. This trained neural network achieved a 90.06% consistency rate in determining the appropriateness of the test-provided books. Considering the obtained findings, it can be concluded that the developed software has achieved the desired objective.

摘要
阅读适合的书籍对于儿童的想象力和大脑发展、语言和情感理解能力以及与他人的关系都有益。基于阅读书籍对个人发展的重要作用，这篇论文目的是开发一种算法，以便判断中学生阅读的书籍是否适合。使用描述的方法，这篇论文将创建一种可以由教育机构和个人使用的算法，以便选择中学课程中的书籍。这个算法将加速中学课程书籍选择过程，并可作为评估书籍的先进参考源。在这篇论文中，使用Python编程语言，并使用自然语言处理技术。此外，使用预处理的数据来训练人工神经网络（ANN），以建立原始数据集。为训练这个网络，适合中学生阅读的书籍由MEB、牛津和剑桥提供，并根据“R” criterion进行评估。这个训练过的神经网络达到了90.06%的一致率，以判断提供的测试书籍的适应性。根据获得的结果，可以 conclued that the developed software has achieved the desired objective.

FAMuS: Frames Across Multiple Sources

paper_url: http://arxiv.org/abs/2311.05601
repo_url: https://github.com/factslab/famus
paper_authors: Siddharth Vashishtha, Alexander Martin, William Gantt, Benjamin Van Durme, Aaron Steven White
for: 本研究旨在提供一个新的事件描述数据集，以帮助语言处理技术进一步理解事件描述。
methods: 本研究使用Wikipedia文章和其他非Wikipedia文章，通过 FrameNet 进行事件和评论的标注。
results: 本研究获得了两个关键的事件理解任务的结果： validate 和 cross-document argument extraction。

Abstract
Understanding event descriptions is a central aspect of language processing, but current approaches focus overwhelmingly on single sentences or documents. Aggregating information about an event \emph{across documents} can offer a much richer understanding. To this end, we present FAMuS, a new corpus of Wikipedia passages that \emph{report} on some event, paired with underlying, genre-diverse (non-Wikipedia) \emph{source} articles for the same event. Events and (cross-sentence) arguments in both report and source are annotated against FrameNet, providing broad coverage of different event types. We present results on two key event understanding tasks enabled by FAMuS: \emph{source validation} -- determining whether a document is a valid source for a target report event -- and \emph{cross-document argument extraction} -- full-document argument extraction for a target event from both its report and the correct source article. We release both FAMuS and our models to support further research.

摘要
理解事件描述是语言处理的中心方面，但现有方法主要集中在单个句子或文档之上。聚合事件信息于文档之间可以提供更深刻的理解。为此，我们提出了FAMuS，一个新的Wikipedia段落和不同类型文章（非Wikipedia）的对应文章集，用于描述同一事件。在这个集中，事件和跨句子理解在报道和来源文章中都被注解到FrameNet，以提供不同类型事件的广泛覆盖。我们 presenta两个关键的事件理解任务，即：判断一个文档是否为目标报道事件的有效来源，以及在报道和正确的来源文章中提取跨文档的理解。我们发布了FAMuS和我们的模型，以支持进一步的研究。

The Iron(ic) Melting Pot: Reviewing Human Evaluation in Humour, Irony and Sarcasm Generation

paper_url: http://arxiv.org/abs/2311.05552
repo_url: None
paper_authors: Tyler Loakman, Aaron Maladry, Chenghua Lin
for: 本文 argue that the generation of more esoteric forms of language, such as humor, irony, and sarcasm, requires a more diverse and transparent evaluator panel, and that demographic information should be reported to ensure replicability.
methods: 本文采用了一个审核文本的方法，包括一个文本概述和一个分析例子的方法，以支持其主张。
results: 本文发现，当前的NLG评估方法中对评估人群的报告不够，有很多使用了众所周知的评估平台，而且评估人群的人口统计信息未经报告。

Abstract
Human evaluation is often considered to be the gold standard method of evaluating a Natural Language Generation system. However, whilst its importance is accepted by the community at large, the quality of its execution is often brought into question. In this position paper, we argue that the generation of more esoteric forms of language - humour, irony and sarcasm - constitutes a subdomain where the characteristics of selected evaluator panels are of utmost importance, and every effort should be made to report demographic characteristics wherever possible, in the interest of transparency and replicability. We support these claims with an overview of each language form and an analysis of examples in terms of how their interpretation is affected by different participant variables. We additionally perform a critical survey of recent works in NLG to assess how well evaluation procedures are reported in this subdomain, and note a severe lack of open reporting of evaluator demographic information, and a significant reliance on crowdsourcing platforms for recruitment.

摘要
人类评估通常被视为自然语言生成系统的金标准评价方法。然而，许多人认为评估的实施质量存在问题。在这篇位点纸中，我们 argue That the generation of more 特殊的语言形式，如 humor、irony 和 sarcasm，是评估Panel的特征 особен性的子领域，并且应该在报告参与者变量的同时做出最大的努力，以保证透明度和复制性。我们支持这些主张通过语言形式的概述和例子的分析来证明，以及对最近的NLG工作进行批判性的调查，以评估评价过程是如何报告的。我们发现了评估过程中参与者变量的报告不够开放，并且很多人通过协同平台进行招募。

Towards End-to-End Spoken Grammatical Error Correction

paper_url: http://arxiv.org/abs/2311.05550
repo_url: None
paper_authors: Stefano Bannò, Rao Ma, Mengjie Qian, Kate M. Knill, Mark J. F. Gales
for: 这篇论文的目的是提出一种新的端到端方法来进行口语语法错误修正（GEC），以便为第二语言学习者提供更有效的反馈。
methods: 这篇论文使用了一种基于语音识别模型的端到端方法，称为Whisper，来替代传统的批处理链式方法。这种端到端方法可以完全或部分替换传统的批处理链式方法。
results: 研究发现，使用端到端方法进行口语GEC可以实现，但由于数据的有限性，其现在的性能比使用大量文本基础数据的传统批处理链式方法低。然而，使用端到端方法进行缺失检测和删除实际上表现了更高的性能。

Abstract
Grammatical feedback is crucial for L2 learners, teachers, and testers. Spoken grammatical error correction (GEC) aims to supply feedback to L2 learners on their use of grammar when speaking. This process usually relies on a cascaded pipeline comprising an ASR system, disfluency removal, and GEC, with the associated concern of propagating errors between these individual modules. In this paper, we introduce an alternative "end-to-end" approach to spoken GEC, exploiting a speech recognition foundation model, Whisper. This foundation model can be used to replace the whole framework or part of it, e.g., ASR and disfluency removal. These end-to-end approaches are compared to more standard cascaded approaches on the data obtained from a free-speaking spoken language assessment test, Linguaskill. Results demonstrate that end-to-end spoken GEC is possible within this architecture, but the lack of available data limits current performance compared to a system using large quantities of text-based GEC data. Conversely, end-to-end disfluency detection and removal, which is easier for the attention-based Whisper to learn, does outperform cascaded approaches. Additionally, the paper discusses the challenges of providing feedback to candidates when using end-to-end systems for spoken GEC.

摘要
grammatical feedback是对于二语言学习者、教师和测试人员都是非常重要的。口语grammatical error correction（GEC）目的是为了给二语言学习者提供语法使用时的反馈。这个过程通常利用一个缓冲管理系统，包括语音识别系统、缺失去除和GEC，并且存在这些模块之间传递错误的问题。在这篇论文中，我们介绍了一种 alternativa "end-to-end" 方法 для口语 GEC，利用 Whisper 基础模型。这个基础模型可以用来取代整个框架或一部分，例如语音识别和缺失去除。这些 end-to-end 方法与更常见的缓冲方法进行比较，并在 Linguaskill 口语语言评估测试数据上进行了对比。结果表明， end-to-end 口语 GEC 在这个架构中是可能的，但由于数据的有限性，现在的性能相对较差于一个使用大量文本 GEC 数据的系统。然而， end-to-end 缺失检测和去除，这些 easier для attention-based Whisper 学习的任务，实际上超过了缓冲方法的性能。论文还讨论了在使用 end-to-end 系统时向候选人提供反馈的挑战。

All Should Be Equal in the Eyes of Language Models: Counterfactually Aware Fair Text Generation

paper_url: http://arxiv.org/abs/2311.05451
repo_url: None
paper_authors: Pragyan Banerjee, Abhinav Java, Surgan Jandial, Simra Shahid, Shaz Furniturewala, Balaji Krishnamurthy, Sumit Bhatia
for: 本研究旨在提高语言模型（LM）的公平性，即使训练数据含有偏见，LM可能会延续这些偏见并影响下游任务。
methods: 我们提出了一种名为Counterfactually Aware Fair InferencE（CAFIE）的框架，它在不同群体之间进行对比，以生成更公平的句子。
results: 我们进行了广泛的实验研究，使用不同大小的基础LM和三个多样化的数据集，发现CAFIE比强基eline表现出色，生成更公平的文本，同时保持了语言模型的能力。

Abstract
Fairness in Language Models (LMs) remains a longstanding challenge, given the inherent biases in training data that can be perpetuated by models and affect the downstream tasks. Recent methods employ expensive retraining or attempt debiasing during inference by constraining model outputs to contrast from a reference set of biased templates or exemplars. Regardless, they dont address the primary goal of fairness to maintain equitability across different demographic groups. In this work, we posit that inferencing LMs to generate unbiased output for one demographic under a context ensues from being aware of outputs for other demographics under the same context. To this end, we propose Counterfactually Aware Fair InferencE (CAFIE), a framework that dynamically compares the model understanding of diverse demographics to generate more equitable sentences. We conduct an extensive empirical evaluation using base LMs of varying sizes and across three diverse datasets and found that CAFIE outperforms strong baselines. CAFIE produces fairer text and strikes the best balance between fairness and language modeling capability

摘要
Language Model (LM) 的公平性仍然是一个长期的挑战，因为训练数据中存在的遗传性偏见可以被模型传递并影响下游任务。 recent methods 使用 expensive 重训练或在推理过程中进行偏见调节，但是这些方法不能实现保持不同民族群体的平等性。在这项工作中，我们认为，在推理LMs中为一个民族群体生成无偏见输出，需要了解其他民族群体在同一个上下文下的输出。为此，我们提出了Counterfactually Aware Fair InferencE（CAFIE）框架，该框架在运行时比较不同民族群体的模型理解，以生成更平等的句子。我们对基础LMs 的不同大小和三个多样化的数据集进行了广泛的实验评估，并发现 CAFIE 在 fairness 和语言模型能力之间做出了最佳的平衡。 CAFIE 生成的文本更加公平，并且在语言模型能力方面也具有优异的表现。

Memorisation Cartography: Mapping out the Memorisation-Generalisation Continuum in Neural Machine Translation

paper_url: http://arxiv.org/abs/2311.05379
repo_url: None
paper_authors: Verna Dankers, Ivan Titov, Dieuwke Hupkes
for: 这个论文的目的是为了研究使用神经网络进行机器翻译时，模型是如何快速记忆某些源-目标映射，而忘记其他映射的原因，以及这种记忆-总结维度如何影响神经网络模型的表现。
methods: 这个论文使用了对500万个神经网络翻译数据点进行分析，并使用了对数据点的表面特征和模型每个数据点的训练信号进行预测，以确定数据点在记忆-总结维度上的位置。
results: 研究发现，模型在记忆-总结维度上的表现与数据点的表面特征和模型每个数据点的训练信号有直接的关系，并且这些数据点的分布对神经网络模型的表现产生了重要的影响。

Abstract
When training a neural network, it will quickly memorise some source-target mappings from your dataset but never learn some others. Yet, memorisation is not easily expressed as a binary feature that is good or bad: individual datapoints lie on a memorisation-generalisation continuum. What determines a datapoint's position on that spectrum, and how does that spectrum influence neural models' performance? We address these two questions for neural machine translation (NMT) models. We use the counterfactual memorisation metric to (1) build a resource that places 5M NMT datapoints on a memorisation-generalisation map, (2) illustrate how the datapoints' surface-level characteristics and a models' per-datum training signals are predictive of memorisation in NMT, (3) and describe the influence that subsets of that map have on NMT systems' performance.

摘要

Build a resource that places 5M NMT datapoints on a memorization-generalization map.2. Illustrate how the datapoints’ surface-level characteristics and a models’ per-datum training signals are predictive of memorization in NMT.3. Describe the influence that subsets of that map have on NMT systems’ performance.Note: “Simplified Chinese” is a simplified version of Chinese that is used in mainland China and is written using simplified characters.

There’s no Data Like Better Data: Using QE Metrics for MT Data Filtering

paper_url: http://arxiv.org/abs/2311.05350
repo_url: None
paper_authors: Jan-Thorsten Peter, David Vilar, Daniel Deutsch, Mara Finkelstein, Juraj Juraska, Markus Freitag
for: 本研究旨在研究使用Quality Estimation（QE）度量来筛选机器翻译输出的坏 качество句子对，以提高机器翻译系统（NMT）的翻译质量。
methods: 本研究使用QE度量来筛选training数据中的坏 качество句子对，并对选择的句子对进行翻译。
results: 研究表明，通过选择高品质句子对进行翻译，可以提高翻译质量，同时减少training数据的大小。此外，研究还提供了筛选结果的详细分析，并对两种方法之间的差异进行了比较。

Abstract
Quality Estimation (QE), the evaluation of machine translation output without the need of explicit references, has seen big improvements in the last years with the use of neural metrics. In this paper we analyze the viability of using QE metrics for filtering out bad quality sentence pairs in the training data of neural machine translation systems~(NMT). While most corpus filtering methods are focused on detecting noisy examples in collections of texts, usually huge amounts of web crawled data, QE models are trained to discriminate more fine-grained quality differences. We show that by selecting the highest quality sentence pairs in the training data, we can improve translation quality while reducing the training size by half. We also provide a detailed analysis of the filtering results, which highlights the differences between both approaches.

摘要
Quality Estimation (QE)，机器翻译输出评估的方法，在过去几年内受到了大量的改进，尤其是通过神经网络度量方法。本文分析了使用QE度量来筛选机器翻译系统（NMT）的训练数据中差异质量的可能性。大多数文库筛选方法通常是通过检测废弃的文本示例来检测废弃的示例，而QE模型则是专门准备了更细化的质量差异。我们显示了，通过选择训练数据中最高质量的句子对，可以提高翻译质量，同时减少训练数据的一半。我们还提供了筛选结果的详细分析，这些分析结果 highlights 两种方法之间的差异。

DeeLM: Dependency-enhanced Large Language Model for Sentence Embeddings

paper_url: http://arxiv.org/abs/2311.05296
repo_url: None
paper_authors: Xianming Li, Jing Li
for: 提高句子嵌入的性能
methods: 提出一种名为Dependency-Enhanced Large Language Model (DeeLM)的新方法，通过将特定LLM层变为bidirectional，以便学习倒数依赖关系
results: DeeLM比基eline和其他方法表现出色，在多个semantic textual similarity (STS)任务上实现了状态的最佳性能

Abstract
Recent studies have proposed using large language models (LLMs) for sentence embeddings. However, most existing LLMs are built with an autoregressive architecture that primarily captures forward dependencies while neglecting backward dependencies. Previous work has highlighted the importance of backward dependencies in improving sentence embeddings. To address this issue, in this paper, we first present quantitative evidence demonstrating the limited learning of backward dependencies in LLMs. Then, we propose a novel approach called Dependency-Enhanced Large Language Model (DeeLM) to improve sentence embeddings. Specifically, we found a turning point in LLMs, where surpassing specific LLM layers leads to a significant performance drop in the semantic textual similarity (STS) task. STS is a crucial task for evaluating sentence embeddings. We then extract the layers after the turning point to make them bidirectional, allowing for the learning of backward dependencies. Extensive experiments demonstrate that DeeLM outperforms baselines and achieves state-of-the-art performance across various STS tasks.

摘要
Recent studies have proposed using large language models (LLMs) for sentence embeddings. However, most existing LLMs are built with an autoregressive architecture that primarily captures forward dependencies while neglecting backward dependencies. Previous work has highlighted the importance of backward dependencies in improving sentence embeddings. To address this issue, in this paper, we first present quantitative evidence demonstrating the limited learning of backward dependencies in LLMs. Then, we propose a novel approach called Dependency-Enhanced Large Language Model (DeeLM) to improve sentence embeddings. Specifically, we found a turning point in LLMs, where surpassing specific LLM layers leads to a significant performance drop in the semantic textual similarity (STS) task. STS is a crucial task for evaluating sentence embeddings. We then extract the layers after the turning point to make them bidirectional, allowing for the learning of backward dependencies. Extensive experiments demonstrate that DeeLM outperforms baselines and achieves state-of-the-art performance across various STS tasks.Here's the translation in Traditional Chinese:Recent studies have proposed using large language models (LLMs) for sentence embeddings. However, most existing LLMs are built with an autoregressive architecture that primarily captures forward dependencies while neglecting backward dependencies. Previous work has highlighted the importance of backward dependencies in improving sentence embeddings. To address this issue, in this paper, we first present quantitative evidence demonstrating the limited learning of backward dependencies in LLMs. Then, we propose a novel approach called Dependency-Enhanced Large Language Model (DeeLM) to improve sentence embeddings. Specifically, we found a turning point in LLMs, where surpassing specific LLM layers leads to a significant performance drop in the semantic textual similarity (STS) task. STS is a crucial task for evaluating sentence embeddings. We then extract the layers after the turning point to make them bidirectional, allowing for the learning of backward dependencies. Extensive experiments demonstrate that DeeLM outperforms baselines and achieves state-of-the-art performance across various STS tasks.

Causal Inference from Text: Unveiling Interactions between Variables

paper_url: http://arxiv.org/abs/2311.05286
repo_url: None
paper_authors: Yuxiang Zhou, Yulan He
for: 这篇论文是为了估计从文本数据中的 causal effect 而写的。
methods: 该论文使用了一种新的方法，可以识别和解决在文本数据中的隐藏 covariates 问题，以估计更准确的 causal effect。
results: 实验表明，该方法可以在两种不同的干预因素下表现出色，并且在不同的场景下都能够减少偏见。此外，对实际业务场景的调查也表明，该模型可以有效地分离变量，帮助投资者做出更 Informed 的决策。

Abstract
Adjusting for latent covariates is crucial for estimating causal effects from observational textual data. Most existing methods only account for confounding covariates that affect both treatment and outcome, potentially leading to biased causal effects. This bias arises from insufficient consideration of non-confounding covariates, which are relevant only to either the treatment or the outcome. In this work, we aim to mitigate the bias by unveiling interactions between different variables to disentangle the non-confounding covariates when estimating causal effects from text. The disentangling process ensures covariates only contribute to their respective objectives, enabling independence between variables. Additionally, we impose a constraint to balance representations from the treatment group and control group to alleviate selection bias. We conduct experiments on two different treatment factors under various scenarios, and the proposed model significantly outperforms recent strong baselines. Furthermore, our thorough analysis on earnings call transcripts demonstrates that our model can effectively disentangle the variables, and further investigations into real-world scenarios provide guidance for investors to make informed decisions.

摘要
Translation notes:* "latent covariates" is translated as "隐藏的变量" (hidden variables)* "confounding covariates" is translated as "干扰变量" (confounding variables)* "non-confounding covariates" is translated as "非干扰变量" (non-confounding variables)* "disentangle" is translated as "分离" (disentangle)* "objectives" is translated as "目标" (objectives)* "selection bias" is translated as "选择偏见" (selection bias)* "earnings call transcripts" is translated as "财务报告笔记" (earnings call transcripts)

Modelling prospective memory and resilient situated communications via Wizard of Oz

paper_url: http://arxiv.org/abs/2311.05268
repo_url: None
paper_authors: Yanzhe Li, Frank Broz, Mark Neerincx
for: 本研究旨在探讨老年人与社会辅助机器人（SAR）之间的人机交互，以探索可靠的记忆模型。
methods: 该研究使用了一个家庭场景，涉及老年人和一个机器人，以探索在日常活动中的语音技术失败和人机交互问题。
results: 该研究将收集日常活动中的语音技术失败和人机交互数据，以便更好地理解老年人和SAR之间的交互。

Abstract
This abstract presents a scenario for human-robot action in a home setting involving an older adult and a robot. The scenario is designed to explore the envisioned modelling of memory for communication with a socially assistive robots (SAR). The scenario will enable the gathering of data on failures of speech technology and human-robot communication involving shared memory that may occur during daily activities such as a music-listening activity.

摘要
这个报告描述了一个家庭环境中older adult和机器人之间的人机交互场景。这个场景是为了探索对社会辅助机器人（SAR）的记忆模型的推断。这个场景将帮助收集在日常活动中，如音乐听众活动中，人机交互中的语音技术失败和人机共享记忆的数据。

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

paper_url: http://arxiv.org/abs/2311.05232
repo_url: None
paper_authors: Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, Ting Liu
for: 这篇论文旨在提供关于大语言模型（LLM）幻觉的最新进展和评论。
methods: 论文使用了一种创新的分类方法来描述LLM幻觉的多种类型，并检查了幻觉的因素和检测方法。
results: 论文提供了一个全面的概述，包括幻觉检测方法和标准准则，以及一些针对幻觉的修正方法。

Abstract
The emergence of large language models (LLMs) has marked a significant breakthrough in natural language processing (NLP), leading to remarkable advancements in text understanding and generation. Nevertheless, alongside these strides, LLMs exhibit a critical tendency to produce hallucinations, resulting in content that is inconsistent with real-world facts or user inputs. This phenomenon poses substantial challenges to their practical deployment and raises concerns over the reliability of LLMs in real-world scenarios, which attracts increasing attention to detect and mitigate these hallucinations. In this survey, we aim to provide a thorough and in-depth overview of recent advances in the field of LLM hallucinations. We begin with an innovative taxonomy of LLM hallucinations, then delve into the factors contributing to hallucinations. Subsequently, we present a comprehensive overview of hallucination detection methods and benchmarks. Additionally, representative approaches designed to mitigate hallucinations are introduced accordingly. Finally, we analyze the challenges that highlight the current limitations and formulate open questions, aiming to delineate pathways for future research on hallucinations in LLMs.

摘要
大型自然语言模型（LLM）的出现标志着自然语言处理（NLP）领域的重要突破，导致了文本理解和生成的显著进步。然而，与这些进步相伴的是LLM往往会产生幻觉，导致的内容与实际世界的事实或用户输入不一致。这种现象对LLM的实际应用提出了重大挑战，也引起了对幻觉的检测和 Mitigation 的关注。在这篇评论中，我们希望提供一个全面、深入的LLM幻觉领域的现状报告。我们首先提出了一种创新的LLM幻觉分类法，然后探讨了幻觉的原因。接着，我们对幻觉检测方法和标准进行了全面的介绍。此外，我们还介绍了一些代表性的幻觉缓解方法。最后，我们分析了当前的挑战和未解决问题，并提出了未来研究的导向。

PRODIGy: a PROfile-based DIalogue Generation dataset

paper_url: http://arxiv.org/abs/2311.05195
repo_url: https://github.com/land-fbk/prodigy-dataset
paper_authors: Daniela Occhipinti, Serra Sinem Tekiroglu, Marco Guerini
for: 提高对话机器人的一致性和综合性，以便更好地进行对话。
methods: 提出了一种统一框架，将标准和更复杂的对话人物表示相结合，并将每个对话与所有可能的说话人物表示相对应。
results: 自动评估表明，基于人物表示的模型在领域和跨领域设置中都有更好的泛化能力，并且人工评估表明，生成与人物表示和上下文一致的内容得到了人们的偏好。

Abstract
Providing dialogue agents with a profile representation can improve their consistency and coherence, leading to better conversations. However, current profile-based dialogue datasets for training such agents contain either explicit profile representations that are simple and dialogue-specific, or implicit representations that are difficult to collect. In this work, we propose a unified framework in which we bring together both standard and more sophisticated profile representations by creating a new resource where each dialogue is aligned with all possible speaker representations such as communication style, biographies, and personality. This framework allows to test several baselines built using generative language models with several profile configurations. The automatic evaluation shows that profile-based models have better generalisation capabilities than models trained on dialogues only, both in-domain and cross-domain settings. These results are consistent for fine-tuned models and instruction-based LLMs. Additionally, human evaluation demonstrates a clear preference for generations consistent with both profile and context. Finally, to account for possible privacy concerns, all experiments are done under two configurations: inter-character and intra-character. In the former, the LM stores the information about the character in its internal representation, while in the latter, the LM does not retain any personal information but uses it only at inference time.

摘要
提供对话代理人 profiles 可以提高对话的一致性和 coherence，导致更好的对话。然而，当前的对话基于 profiles 的训练数据集中 Either explicit profiles 是简单的对话特定的，或者 implicit profiles 是困难收集的。在这项工作中，我们提议一个统一框架，在这个框架中，我们将每个对话与所有可能的 speaker 表示（如沟通风格、生平、人格）进行对应。这个框架允许我们测试一些基于生成语言模型的基线模型，并对不同的 profile 配置进行测试。自动评估表明，profile-based 模型在预测和跨预测场景中都具有更好的一致性和稳定性。此外，人工评估表明，生成与 profile 和 context 一致的对话得到了人们的偏好。最后，为了解决可能的隐私问题，我们在两种配置下进行所有实验：inter-character 和 intra-character。在前一种情况下，LM 将Character 信息存储在其内部表示中，而在后一种情况下，LM 不会保留任何个人信息，只在推理时使用它们。

Large Language Models and Prompt Engineering for Biomedical Query Focused Multi-Document Summarisation

paper_url: http://arxiv.org/abs/2311.05169
repo_url: None
paper_authors: Diego Mollá
for: 本研究使用提示工程和GPT-3.5进行生物医学问题焦点多文摘要。
methods: 使用GPT-3.5和适当的提示，我们的系统在2023年生物医学问题解决比赛（BioASQ 11b）中实现了最高的ROUGE-F1分数。
results: 本研究证明了其他领域所观察到的结论：1）包含几个示例的提示通常会提高其零shot变种的性能；2）检索增强生成可以获得最大的改进。这些提示使我们的最佳实际排名在BioASQ 11b中的前两名，表明使用适当的提示对大语言模型在摘要 tasks 中具有强大的能力。

Abstract
This paper reports on the use of prompt engineering and GPT-3.5 for biomedical query-focused multi-document summarisation. Using GPT-3.5 and appropriate prompts, our system achieves top ROUGE-F1 results in the task of obtaining short-paragraph-sized answers to biomedical questions in the 2023 BioASQ Challenge (BioASQ 11b). This paper confirms what has been observed in other domains: 1) Prompts that incorporated few-shot samples generally improved on their counterpart zero-shot variants; 2) The largest improvement was achieved by retrieval augmented generation. The fact that these prompts allow our top runs to rank within the top two runs of BioASQ 11b demonstrate the power of using adequate prompts for Large Language Models in general, and GPT-3.5 in particular, for query-focused summarisation.

摘要
Translation notes:* "prompt engineering" is translated as "提示工程" (tiēshì gōngchéng), which refers to the process of designing and optimizing prompts to improve the performance of language models.* "GPT-3.5" is translated as "GPT-3.5" (GPT-3.5), as it is a well-known language model that is widely used in natural language processing tasks.* "ROUGE-F1" is translated as "ROUGE-F1" (ROUGE-F1), as it is a widely used evaluation metric for summarization tasks.* "BioASQ Challenge" is translated as "生物学问题大会" (shēngwù xuéwèn da hui), which refers to a specific challenge for biomedical question answering.* "few-shot samples" is translated as "少量示例" (shǎo liàng shì xiàng), which refers to a small number of training examples that are used to fine-tune the language model.* "zero-shot variants" is translated as "无示例变体" (wú shì xiàng biàn tǐ), which refers to language models that are trained without any fine-tuning on specific tasks.* "retrieval augmented generation" is translated as "检索增强生成" (jiǎn sò zhòng qiáng shēng chéng), which refers to a technique that uses retrieval information to improve the generation of text.

Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization

paper_url: http://arxiv.org/abs/2311.05161
repo_url: None
paper_authors: Jangwhan Lee, Minsoo Kim, Seungcheol Baek, Seok Joong Hwang, Wonyong Sung, Jungwook Choi
for: 提高语言处理任务的计算效率，增强大型语言模型（LLMs）的部署。
methods: 使用4位权值和8位活动（W4A8）归一化，并提出两种创新技术：活动归一化aware scaling（AQAS）和序列长度aware calibration（SLAC），以增强post-training量化（PTQ）。
results: 通过对多种语言模型进行严格评估，包括OPT和LLaMA，显示了OUR技术可以提高任务准确率至与全精度模型相当水平。此外，通过开发与dINT兼容的加法器，确认了OUR方法在硬件效率方面的2倍提高。

Abstract
Large Language Models (LLMs) are proficient in natural language processing tasks, but their deployment is often restricted by extensive parameter sizes and computational demands. This paper focuses on post-training quantization (PTQ) in LLMs, specifically 4-bit weight and 8-bit activation (W4A8) quantization, to enhance computational efficiency -- a topic less explored compared to weight-only quantization. We present two innovative techniques: activation-quantization-aware scaling (AQAS) and sequence-length-aware calibration (SLAC) to enhance PTQ by considering the combined effects on weights and activations and aligning calibration sequence lengths to target tasks. Moreover, we introduce dINT, a hybrid data format combining integer and denormal representations, to address the underflow issue in W4A8 quantization, where small values are rounded to zero. Through rigorous evaluations of LLMs, including OPT and LLaMA, we demonstrate that our techniques significantly boost task accuracies to levels comparable with full-precision models. By developing arithmetic units compatible with dINT, we further confirm that our methods yield a 2$\times$ hardware efficiency improvement compared to 8-bit integer MAC unit.

摘要
(Simplified Chinese translation)大型语言模型（LLM）在自然语言处理任务中表现出色，但其部署受限于广泛的参数大小和计算需求。本文关注 LLM 的后期训练量化（PTQ），特别是4位重量和8位活动（W4A8）量化，以提高计算效率。我们提出了两种创新技术：活动量化扩展（AQAS）和序列长度意识calibration（SLAC），以增强PTQ，并考虑参数和活动之间的共同效应。此外，我们介绍了 dINT，一种 combining 整数和denormal表示的混合数据格式，以解决 W4A8 量化中的下溢问题， где小值被舍入为零。我们通过对 LLM 进行严格的评估，包括 OPT 和 LLaMA，证明了我们的技术可以提高任务准确率至与全精度模型相当的水平。此外，我们还开发了与 dINT 兼容的数学单元，确认了我们的方法可以在硬件上实现2倍的效率提升 compared to 8位整数 MAC 单元。

Quranic Conversations: Developing a Semantic Search tool for the Quran using Arabic NLP Techniques

paper_url: http://arxiv.org/abs/2311.05120
repo_url: None
paper_authors: Yasser Shohoud, Maged Shoman, Sarah Abdelazim
for: This paper is written to provide a Quran semantic search tool for Muslims to easily find relevant verses in the Quran related to their inquiries or prompts.
methods: The paper uses a combination of machine learning models and cosine similarity to index the Quran and find the most relevant verses related to a user’s inquiry.
results: The paper achieves a high cosine similarity score of 0.97 using the SNxLM model, which demonstrates the effectiveness of the proposed Quran semantic search tool.

Abstract
The Holy Book of Quran is believed to be the literal word of God (Allah) as revealed to the Prophet Muhammad (PBUH) over a period of approximately 23 years. It is the book where God provides guidance on how to live a righteous and just life, emphasizing principles like honesty, compassion, charity and justice, as well as providing rules for personal conduct, family matters, business ethics and much more. However, due to constraints related to the language and the Quran organization, it is challenging for Muslims to get all relevant ayahs (verses) pertaining to a matter or inquiry of interest. Hence, we developed a Quran semantic search tool which finds the verses pertaining to the user inquiry or prompt. To achieve this, we trained several models on a large dataset of over 30 tafsirs, where typically each tafsir corresponds to one verse in the Quran and, using cosine similarity, obtained the tafsir tensor which is most similar to the prompt tensor of interest, which was then used to index for the corresponding ayah in the Quran. Using the SNxLM model, we were able to achieve a cosine similarity score as high as 0.97 which corresponds to the abdu tafsir for a verse relating to financial matters.

摘要
《古兰经》被认为是神的literal字（阿拉），由先知穆罕默德（愿旦）在约23年内逐渐接受的。这本书提供了如何过一个正直和公正的生活的指导，强调诚信、慈悲、慈善和正义等原则，并提供了个人行为、家庭事务、商业伦理等方面的规则。然而，由于语言和《古兰经》的组织方式的限制，使得穆斯林找到有关的各个篇章（ayah）变得困难。为了解决这个问题，我们开发了一个《古兰经》semantic search工具，可以找到用户的查询或提示中相关的各个篇章。我们使用了多个模型，并在大量的30本译注（tafsir）中训练了这些模型。我们使用cosine similarity来评估这些模型，并获得了最相似的译注矩阵，然后用这个矩阵来索引《古兰经》中相关的各个篇章。使用SNxLM模型，我们可以达到cosine similarity分数达0.97，与关于财务问题的阿杜译注（tafsir）相对应。

Unsupervised Translation Quality Estimation Exploiting Synthetic Data and Pre-trained Multilingual Encoder

paper_url: http://arxiv.org/abs/2311.05117
repo_url: None
paper_authors: Yuto Kuroda, Atsushi Fujita, Tomoyuki Kajiwara, Takashi Ninomiya
for: 这篇论文目的是为了研究无监督翻译质量估计（TQE）方法，以减少翻译质量估计的训练数据成本。
methods: 这篇论文使用了人工合成的TQE数据和预训练多语言编码器，以进行无监督 sentence-level TQE。
results: 实验表明，这种方法可以在高资源和低资源翻译方向中比其他无监督 TQE方法更高的准确率和人类评价分数，以及一些零资源翻译方向中的准确率。

Abstract
Translation quality estimation (TQE) is the task of predicting translation quality without reference translations. Due to the enormous cost of creating training data for TQE, only a few translation directions can benefit from supervised training. To address this issue, unsupervised TQE methods have been studied. In this paper, we extensively investigate the usefulness of synthetic TQE data and pre-trained multilingual encoders in unsupervised sentence-level TQE, both of which have been proven effective in the supervised training scenarios. Our experiment on WMT20 and WMT21 datasets revealed that this approach can outperform other unsupervised TQE methods on high- and low-resource translation directions in predicting post-editing effort and human evaluation score, and some zero-resource translation directions in predicting post-editing effort.

摘要
翻译质量估算（TQE）是指无需参考翻译的翻译质量预测。由于创建TQE训练数据的成本巨大，只有一些翻译方向可以从supervised训练中受益。为解决这个问题，无监督TQE方法得到了研究。本文广泛研究了使用synthetic TQE数据和预训练多语言 encoder在无监督句级TQE中的可用性，两者在supervised训练场景中已经证明有效。我们在WMT20和WMT21数据集上进行了实验，发现这种方法可以在高资源和低资源翻译方向中预测后期编辑努力和人工评分，以及一些zero资源翻译方向中预测后期编辑努力。

Conic10K: A Challenging Math Problem Understanding and Reasoning Dataset

paper_url: http://arxiv.org/abs/2311.05113
repo_url: https://github.com/whynlp/conic10k
paper_authors: Haoyi Wu, Wenyang Hui, Yezeng Chen, Weiqi Wu, Kewei Tu, Yi Zhou
for: 这个论文的目的是提出一个有挑战性的数学问题集，用于评估人工智能（AI）的数学理解和逻辑能力。
methods: 该论文使用了中国高中教育中的几何形式问题集，并为每个问题提供了高质量的正式表示，逻辑步骤和最终解决方案。
results: 实验表明，现有的大语言模型，包括GPT-4，在复杂的逻辑推理中表现不佳。

Abstract
Mathematical understanding and reasoning are crucial tasks for assessing the capabilities of artificial intelligence (AI). However, existing benchmarks either require just a few steps of reasoning, or only contain a small amount of data in one specific topic, making it hard to analyse AI's behaviour with reference to different problems within a specific topic in detail. In this work, we propose Conic10K, a challenging math problem dataset on conic sections in Chinese senior high school education. Our dataset contains various problems with different reasoning depths, while only the knowledge from conic sections is required. Since the dataset only involves a narrow range of knowledge, it is easy to separately analyse the knowledge a model possesses and the reasoning ability it has. For each problem, we provide a high-quality formal representation, the reasoning steps, and the final solution. Experiments show that existing large language models, including GPT-4, exhibit weak performance on complex reasoning. We hope that our findings could inspire more advanced techniques for precise natural language understanding and reasoning. Our dataset and codes are available at https://github.com/whyNLP/Conic10K.

摘要
<>数学理解和推理是评估人工智能（AI）能力的关键任务。然而，现有的标准benchmark either require only a few steps of reasoning, or only contain a small amount of data in one specific topic, making it difficult to analyze AI's behavior in detail with reference to different problems within a specific topic.在这项工作中，我们提出了Conic10K，一个在中国高中数学教育中使用的困难数学问题集。我们的数据集包含不同的推理深度的问题，仅需要 cone sections 的知识。由于数据集的知识范围很窄，因此可以分开分析模型所拥有的知识和其推理能力。为每个问题，我们提供了高质量的正式表示，推理步骤，以及最终解决方案。实验显示，现有的大语言模型，包括GPT-4，在复杂的推理中表现不佳。我们希望我们的发现可以激励更多的高级技术 для精准自然语言理解和推理。我们的数据集和代码可以在https://github.com/whyNLP/Conic10K中下载。[/INST Here's the translation in Simplified Chinese:数学理解和推理是评估人工智能（AI）能力的关键任务。然而，现有的标准benchmark either require only a few steps of reasoning, or only contain a small amount of data in one specific topic, making it difficult to analyze AI's behavior in detail with reference to different problems within a specific topic.在这项工作中，我们提出了Conic10K，一个在中国高中数学教育中使用的困难数学问题集。我们的数据集包含不同的推理深度的问题，仅需要 cone sections 的知识。由于数据集的知识范围很窄，因此可以分开分析模型所拥有的知识和其推理能力。为每个问题，我们提供了高质量的正式表示，推理步骤，以及最终解决方案。实验显示，现有的大语言模型，包括GPT-4，在复杂的推理中表现不佳。我们希望我们的发现可以激励更多的高级技术 для精准自然语言理解和推理。我们的数据集和代码可以在https://github.com/whyNLP/Conic10K中下载。

2023-11-09

cs.LG

cs.LG - 2023-11-09

An Experimental Design for Anytime-Valid Causal Inference on Multi-Armed Bandits

paper_url: http://arxiv.org/abs/2311.05794
repo_url: None
paper_authors: Biyonka Liang, Iavor Bojinov
for: 这个论文是为了提供一种实现连续推断多臂投掷机(MAB)实验中的平均治疗效果(ATE)的新的实验设计。
methods: 这个论文使用了一种新的混合式适应设计(MAD)，允许在新数据 arrive 时进行连续推断ATE，并且保证了统计有效性和力量。
results: 研究表明，使用MAD可以提高ATE推断的覆盖率和功能，而无需损失 finite-sample 奖励。

Abstract
Typically, multi-armed bandit (MAB) experiments are analyzed at the end of the study and thus require the analyst to specify a fixed sample size in advance. However, in many online learning applications, it is advantageous to continuously produce inference on the average treatment effect (ATE) between arms as new data arrive and determine a data-driven stopping time for the experiment. Existing work on continuous inference for adaptive experiments assumes that the treatment assignment probabilities are bounded away from zero and one, thus excluding nearly all standard bandit algorithms. In this work, we develop the Mixture Adaptive Design (MAD), a new experimental design for multi-armed bandits that enables continuous inference on the ATE with guarantees on statistical validity and power for nearly any bandit algorithm. On a high level, the MAD "mixes" a bandit algorithm of the user's choice with a Bernoulli design through a tuning parameter $\delta_t$, where $\delta_t$ is a deterministic sequence that controls the priority placed on the Bernoulli design as the sample size grows. We show that for $\delta_t = o\left(1/t^{1/4}\right)$, the MAD produces a confidence sequence that is asymptotically valid and guaranteed to shrink around the true ATE. We empirically show that the MAD improves the coverage and power of ATE inference in MAB experiments without significant losses in finite-sample reward.

摘要
Simplified Chinese:通常，多臂炮兵（MAB）实验会在实验结束时进行分析，因此需要分析者在先specify一个固定的样本大小。然而，在许多在线学习应用中，可以在新数据来临时continuously生成对准征效果（ATE）的推断，并在实验结束时确定基于数据的停止时间。现有的连续推断方法 для适应实验假设了对准分布是不是零和一之间，因此排除了大多数标准炮兵算法。在这种工作中，我们开发了 Mixture Adaptive Design（MAD），一种新的实验设计方法，可以在多臂炮兵中实时生成ATE推断，并且保证逻辑有效性和功能力。在高层次上，MAD“混合”了用户选择的炮兵算法和bernoulli设计，通过一个名为$\delta_t$的 deterministic sequence控制 Bernoulli设计在样本大小增长时的优先级。我们证明，对于 $\delta_t = o\left(1/t^{1/4}\right)$，MAD生成的信度序列是 asymptotically 有效的，并且保证会缩小到真实的ATE。我们还 empirically 表明，MAD在 MAB 实验中改善了ATE推断的覆盖率和功能力，而不是在finite-sample reward中带来显著的损失。

Detecting Suspicious Commenter Mob Behaviors on YouTube Using Graph2Vec

paper_url: http://arxiv.org/abs/2311.05791
repo_url: None
paper_authors: Shadi Shajari, Mustafa Alassad, Nitin Agarwal
for: 本研究旨在探讨YouTube上异常评论行为的发展趋势和相似性特征，以便更好地理解这些行为的起源和传播方式。
methods: 本研究采用社会网络分析方法对YouTube频道进行分析，旨在检测这些频道上异常评论行为的存在和相似性特征。
results: 研究发现YouTube上的异常评论行为具有明显的相似性特征，这些特征可能是由同一个或多个人或组织操作所致。这种发现可能有助于理解YouTube上异常评论行为的起源和传播方式，并为其应对和预防提供参考。

Abstract
YouTube, a widely popular online platform, has transformed the dynamics of con-tent consumption and interaction for users worldwide. With its extensive range of content crea-tors and viewers, YouTube serves as a hub for video sharing, entertainment, and information dissemination. However, the exponential growth of users and their active engagement on the platform has raised concerns regarding suspicious commenter behaviors, particularly in the com-ment section. This paper presents a social network analysis-based methodology for detecting suspicious commenter mob-like behaviors among YouTube channels and the similarities therein. The method aims to characterize channels based on the level of such behavior and identify com-mon patterns across them. To evaluate the effectiveness of the proposed model, we conducted an analysis of 20 YouTube channels, consisting of 7,782 videos, 294,199 commenters, and 596,982 comments. These channels were specifically selected for propagating false views about the U.S. Military. The analysis revealed significant similarities among the channels, shedding light on the prevalence of suspicious commenter behavior. By understanding these similarities, we contribute to a better understanding of the dynamics of suspicious behavior on YouTube channels, which can inform strategies for addressing and mitigating such behavior.

摘要

Structured Transforms Across Spaces with Cost-Regularized Optimal Transport

paper_url: http://arxiv.org/abs/2311.05788
repo_url: None
paper_authors: Othmane Sebbouh, Marco Cuturi, Gabriel Peyré
for: 这个论文的目的是匹配来自不同 metric space 的概率分布。
methods: 论文使用了linear optimal transport（OT）问题的实例来实现匹配，其中包括一个基础成本函数来量化概率分布之间的差异。
results: 论文提出了一种使用cost-regularized OT来匹配概率分布，并在不同的Euclidean空间中应用了这种方法。它还提出了一种 enforcing structure in linear transform 的方法，并提供了一种 proximal 算法来实现这种方法。

Abstract
Matching a source to a target probability measure is often solved by instantiating a linear optimal transport (OT) problem, parameterized by a ground cost function that quantifies discrepancy between points. When these measures live in the same metric space, the ground cost often defaults to its distance. When instantiated across two different spaces, however, choosing that cost in the absence of aligned data is a conundrum. As a result, practitioners often resort to solving instead a quadratic Gromow-Wasserstein (GW) problem. We exploit in this work a parallel between GW and cost-regularized OT, the regularized minimization of a linear OT objective parameterized by a ground cost. We use this cost-regularized formulation to match measures across two different Euclidean spaces, where the cost is evaluated between transformed source points and target points. We show that several quadratic OT problems fall in this category, and consider enforcing structure in linear transform (e.g. sparsity), by introducing structure-inducing regularizers. We provide a proximal algorithm to extract such transforms from unaligned data, and demonstrate its applicability to single-cell spatial transcriptomics/multiomics matching tasks.

摘要
匹配源概率度量到目标概率度量经常通过实例化线性最优运输（OT）问题来解决，该问题 Parametrized by 地面成本函数，该函数量化点之间的差异。当这些度量 живу在同一个度量空间时，地面成本通常 defaults to 距离。但当 instantiated across two different 空间时，无法选择地面成本的问题在缺失协调数据时是一个 Conundrum。为此，实践者们通常将 solve 而不是 quadratic Gromow-Wasserstein（GW）问题。我们在这工作中利用了 GW 和 cost-regulated OT 之间的并行关系，其中 regulated 是 linear OT 目标函数中的 parameterized 的ground cost。我们使用这种 cost-regulated 形式来匹配两个不同的欧几丁素空间中的度量，其中 cost 是将源点和目标点转换后评估的。我们证明了 quadratic OT 问题的一部分 fall 在这类ategory，并考虑了在 linear transform 中引入结构（例如简洁），例如 introducing structure-inducing regularizers。我们提供了一种 proximal 算法来EXTRACT 这些转换，并在单元细胞空间表型学/多Omics 匹配任务中应用了其可行性。

Towards stable real-world equation discovery with assessing differentiating quality influence

paper_url: http://arxiv.org/abs/2311.05787
repo_url: None
paper_authors: Mikhail Masliaev, Ilya Markov, Alexander Hvatov
for: 本研究探讨了数据驱动的微分方程发现中不同差分方法的核心作用。
methods: 本研究提出了四种不同的差分方法，包括Savitzky-Golay滤波、spectral differentiation、人工神经网络平滑和 derive variation regularization。
results: 我们对这些方法进行了评估，包括它们在真实问题中的适用性和微分方程发现算法的稳定性。这些研究为实际过程模型的稳定和可靠性提供了有价值的洞察。

Abstract
This paper explores the critical role of differentiation approaches for data-driven differential equation discovery. Accurate derivatives of the input data are essential for reliable algorithmic operation, particularly in real-world scenarios where measurement quality is inevitably compromised. We propose alternatives to the commonly used finite differences-based method, notorious for its instability in the presence of noise, which can exacerbate random errors in the data. Our analysis covers four distinct methods: Savitzky-Golay filtering, spectral differentiation, smoothing based on artificial neural networks, and the regularization of derivative variation. We evaluate these methods in terms of applicability to problems, similar to the real ones, and their ability to ensure the convergence of equation discovery algorithms, providing valuable insights for robust modeling of real-world processes.

摘要

Real-time Control of Electric Autonomous Mobility-on-Demand Systems via Graph Reinforcement Learning

paper_url: http://arxiv.org/abs/2311.05780
repo_url: https://github.com/stanfordasl/graph-rl-for-eamod
paper_authors: Aaryan Singhal, Daniele Gammelli, Justin Luke, Karthik Gopalakrishnan, Dominik Helmreich, Marco Pavone
for: 提高电动自动驾驶服务 fleet 的实时决策效率，包括匹配可用车辆与需求请求、重新分配潜在车辆至高需求区域、和负荷车辆充电以确保 sufficient range。
methods: 采用人工智能学习方法，特别是图网络基于的 reinforcement learning 框架，以提高可扩展性和性能。
results: 使用实际数据从 Сан Francisco 和纽约市，实验结果表明，我们的方法可以达到89%的利润，同时在计算时间方面实现100倍的加速。此外，我们的方法也可以比领域专门的规则更高，增加利润达3倍。此外，我们的学习策略还表现出了零扩展性和服务区域扩展的潜力。

Abstract
Operators of Electric Autonomous Mobility-on-Demand (E-AMoD) fleets need to make several real-time decisions such as matching available cars to ride requests, rebalancing idle cars to areas of high demand, and charging vehicles to ensure sufficient range. While this problem can be posed as a linear program that optimizes flows over a space-charge-time graph, the size of the resulting optimization problem does not allow for real-time implementation in realistic settings. In this work, we present the E-AMoD control problem through the lens of reinforcement learning and propose a graph network-based framework to achieve drastically improved scalability and superior performance over heuristics. Specifically, we adopt a bi-level formulation where we (1) leverage a graph network-based RL agent to specify a desired next state in the space-charge graph, and (2) solve more tractable linear programs to best achieve the desired state while ensuring feasibility. Experiments using real-world data from San Francisco and New York City show that our approach achieves up to 89% of the profits of the theoretically-optimal solution while achieving more than a 100x speedup in computational time. Furthermore, our approach outperforms the best domain-specific heuristics with comparable runtimes, with an increase in profits by up to 3x. Finally, we highlight promising zero-shot transfer capabilities of our learned policy on tasks such as inter-city generalization and service area expansion, thus showing the utility, scalability, and flexibility of our framework.

摘要
运营电动自动化出行（E-AMoD）车队需要在实时情况下做出多个决策，例如匹配可用车辆和需求请求、重新分配空闲车辆到需求高的区域、并对车辆进行充电以确保充足的范围。这个问题可以表示为一个线性程序，但是由于问题的大小，不能在现实情况下实时实施。在这篇文章中，我们通过强化学习来解决E-AMoD控制问题，并提出一个基于图网络的框架，以实现快速的执行和高性能。我们采用了两级形式，其中我们（1）利用图网络基于的强化学习代理来指定下一个状态的愿望在空间充电图上，并（2）解决更加可行的线性程序，以实现愿望状态的最佳实现，并确保可行性。实验使用了旧金山和纽约市的实际数据，显示我们的方法可以达到89%的理论优化解的利润，同时实现了更 чем100倍的计算时间减少。此外，我们的方法还超过了领域专门的最佳办法，增加了利润达3倍。最后，我们强调了我们学习政策的零式转移能力，在不同的任务上如城市间扩展和服务区域扩展等任务上表现出了便利、扩展和灵活性。

Dirichlet Energy Enhancement of Graph Neural Networks by Framelet Augmentation

paper_url: http://arxiv.org/abs/2311.05767
repo_url: None
paper_authors: Jialin Chen, Yuelin Wang, Cristian Bodnar, Rex Ying, Pietro Lio, Yu Guang Wang
for: 本文主要针对 graph neural network 中的 over-smoothing 问题进行解决，提出了一种基于 framelet 系统的 Energy Enhanced Convolution (EEConv) 操作。
methods: 本文使用了 framelet 系统来对 Dirichlet energy 进行分析，并提出了一种 Framelet Augmentation 策略以提高 Dirichlet energy。基于这种策略，本文还提出了一种 Effective and Practical 的 Energy Enhanced Convolution (EEConv) 操作。
results: 本文通过实验表明，使用 EEConv 操作可以提高 graph neural network 的性能，特别是在 heterophilous graphs 上。同时，EEConv 还可以逐渐提高 Dirichlet energy 的值，从而解决 over-smoothing 问题。

Abstract
Graph convolutions have been a pivotal element in learning graph representations. However, recursively aggregating neighboring information with graph convolutions leads to indistinguishable node features in deep layers, which is known as the over-smoothing issue. The performance of graph neural networks decays fast as the number of stacked layers increases, and the Dirichlet energy associated with the graph decreases to zero as well. In this work, we introduce a framelet system into the analysis of Dirichlet energy and take a multi-scale perspective to leverage the Dirichlet energy and alleviate the over-smoothing issue. Specifically, we develop a Framelet Augmentation strategy by adjusting the update rules with positive and negative increments for low-pass and high-passes respectively. Based on that, we design the Energy Enhanced Convolution (EEConv), which is an effective and practical operation that is proved to strictly enhance Dirichlet energy. From a message-passing perspective, EEConv inherits multi-hop aggregation property from the framelet transform and takes into account all hops in the multi-scale representation, which benefits the node classification tasks over heterophilous graphs. Experiments show that deep GNNs with EEConv achieve state-of-the-art performance over various node classification datasets, especially for heterophilous graphs, while also lifting the Dirichlet energy as the network goes deeper.

摘要
“几何对�ERT��ental��缓��缓��缓��缓��

Generative Explanations for Graph Neural Network: Methods and Evaluations

paper_url: http://arxiv.org/abs/2311.05764
repo_url: None
paper_authors: Jialin Chen, Kenza Amara, Junchi Yu, Rex Ying
for: 这篇论文主要针对的是图像预测任务中的图 neural network (GNNs) 的解释性能。
methods: 这篇论文提出了一种基于图生成的 GNNs 解释方法，包括两个优化目标：归因和信息约束。
results: 实验结果显示了不同解释方法的优缺点，包括解释性能、效率和泛化能力。

Abstract
Graph Neural Networks (GNNs) achieve state-of-the-art performance in various graph-related tasks. However, the black-box nature often limits their interpretability and trustworthiness. Numerous explainability methods have been proposed to uncover the decision-making logic of GNNs, by generating underlying explanatory substructures. In this paper, we conduct a comprehensive review of the existing explanation methods for GNNs from the perspective of graph generation. Specifically, we propose a unified optimization objective for generative explanation methods, comprising two sub-objectives: Attribution and Information constraints. We further demonstrate their specific manifestations in various generative model architectures and different explanation scenarios. With the unified objective of the explanation problem, we reveal the shared characteristics and distinctions among current methods, laying the foundation for future methodological advancements. Empirical results demonstrate the advantages and limitations of different explainability approaches in terms of explanation performance, efficiency, and generalizability.

摘要
GRAPH NEURAL NETWORKS (GNNs) achiev state-of-the-art performance in various graph-related tasks. However, the black-box nature often limits their interpretability and trustworthiness. Numerous explainability methods have been proposed to uncover the decision-making logic of GNNs, by generating underlying explanatory substructures. In this paper, we conduct a comprehensive review of the existing explanation methods for GNNs from the perspective of graph generation. Specifically, we propose a unified optimization objective for generative explanation methods, comprising two sub-objectives: Attribution and Information constraints. We further demonstrate their specific manifestations in various generative model architectures and different explanation scenarios. With the unified objective of the explanation problem, we reveal the shared characteristics and distinctions among current methods, laying the foundation for future methodological advancements. Empirical results demonstrate the advantages and limitations of different explainability approaches in terms of explanation performance, efficiency, and generalizability.Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other countries. If you need Traditional Chinese, please let me know and I can provide that as well.

MALCOM-PSGD: Inexact Proximal Stochastic Gradient Descent for Communication-Efficient Decentralized Machine Learning

paper_url: http://arxiv.org/abs/2311.05760
repo_url: None
paper_authors: Andrew Campbell, Hang Liu, Leah Woldemariam, Anna Scaglione
for:* This paper aims to improve the efficiency of decentralized machine learning by addressing the bottleneck of frequent model communication.methods:* The proposed method, MALCOM-PSGD, integrates gradient compression techniques with model sparsification.* Proximal stochastic gradient descent is used to handle non-smoothness resulting from $\ell_1$ regularization.* Vector source coding and dithering-based quantization are used for compressed gradient communication of sparsified models.results:* The proposed method achieves a convergence rate of $\mathcal{O}\left(\ln(t)/\sqrt{t}\right)$ with a diminishing learning rate.* Communication costs are reduced by approximately $75%$ compared to the state-of-the-art method.

Abstract
Recent research indicates that frequent model communication stands as a major bottleneck to the efficiency of decentralized machine learning (ML), particularly for large-scale and over-parameterized neural networks (NNs). In this paper, we introduce MALCOM-PSGD, a new decentralized ML algorithm that strategically integrates gradient compression techniques with model sparsification. MALCOM-PSGD leverages proximal stochastic gradient descent to handle the non-smoothness resulting from the $\ell_1$ regularization in model sparsification. Furthermore, we adapt vector source coding and dithering-based quantization for compressed gradient communication of sparsified models. Our analysis shows that decentralized proximal stochastic gradient descent with compressed communication has a convergence rate of $\mathcal{O}\left(\ln(t)/\sqrt{t}\right)$ assuming a diminishing learning rate and where $t$ denotes the number of iterations. Numerical results verify our theoretical findings and demonstrate that our method reduces communication costs by approximately $75\%$ when compared to the state-of-the-art method.

摘要
近期研究表明，频繁的模型通信成为分布式机器学习（ML）的效率瓶颈，特别是大规模和过参的神经网络（NN）。在这篇论文中，我们介绍了MALCOM-PSGD算法，它利用抑制梯度压缩技术和模型简化来解决这个问题。MALCOM-PSGD使用抑制梯度下降来处理模型简化后的非满射性。此外，我们采用 вектор源编码和抽象化量化来压缩压缩梯度通信。我们的分析表明，分布式 proximal 梯度下降 WITH 压缩通信的收敛速度为 $\mathcal{O}\left(\ln(t)/\sqrt{t}\right)$，其中 $t$ 表示迭代次数。数值结果证明我们的方法可以将通信成本减少约75%，相比之前的状态艺术方法。

Deep Learning Architecture for Network-Efficiency at the Edge

paper_url: http://arxiv.org/abs/2311.05739
repo_url: None
paper_authors: Akrit Mudvari, Antero Vainio, Iason Ofeidis, Sasu Tarkoma, Leandros Tassiulas
for: 这个论文是为了提出一种适用于弱设备的分布式学习方法，以优化深度学习模型的网络使用量和响应时间。
methods: 该方法基于压缩意识的混合分布式学习，通过适应压缩来提高深度学习模型的网络效率和响应时间。
results: 该方法可以提高网络效率 by 4倍，并且可以提高压缩意识混合分布式学习的准确率 by 4%。此外，该方法还可以减少模型训练时间 by up to 6倍，无需影响准确率。

Abstract
The growing number of AI-driven applications in the mobile devices has led to solutions that integrate deep learning models with the available edge-cloud resources; due to multiple benefits such as reduction in on-device energy consumption, improved latency, improved network usage, and certain privacy improvements, split learning, where deep learning models are split away from the mobile device and computed in a distributed manner, has become an extensively explored topic. Combined with compression-aware methods where learning adapts to compression of communicated data, the benefits of this approach have further improved and could serve as an alternative to established approaches like federated learning methods. In this work, we develop an adaptive compression-aware split learning method ('deprune') to improve and train deep learning models so that they are much more network-efficient (use less network resources and are faster), which would make them ideal to deploy in weaker devices with the help of edge-cloud resources. This method is also extended ('prune') to very quickly train deep learning models, through a transfer learning approach, that trades off little accuracy for much more network-efficient inference abilities. We show that the 'deprune' method can reduce network usage by 4x when compared with a split-learning approach (that does not use our method) without loss of accuracy, while also improving accuracy over compression-aware split-learning by 4 percent. Lastly, we show that the 'prune' method can reduce the training time for certain models by up to 6x without affecting the accuracy when compared against a compression-aware split-learning approach.

摘要
“由于移动设备中的人工智能应用程序的增加，导致将深度学习模型与可用的边缘云资源整合，实现了许多优点，如设备内部能源消耗减少、延迟时间改善、网络使用率改善和一定的隐私改善。在这篇研究中，我们开发了适应压缩敏感的分别学习方法（'deprune'），以提高和训练深度学习模型，使其更加网络效率（使用更少的网络资源并更快），这样可以在弱化设备上使用边缘云资源。此外，我们还将这个方法扩展为很快地训练深度学习模型，通过传播学习方法，将准确性和网络效率之间进行了调整。我们发现，使用'deprune'方法可以在比较 Split-learning 方法（不使用我们的方法）时，降低网络使用率由 4 倍，不会影响准确性，同时也提高了压缩敏感 Split-learning 方法的准确性 by 4%。此外，我们发现，使用'prune'方法可以对某些模型进行快速训练，将训练时间从原先的 6 倍缩短至 1/6，不会影响准确性。”

LogShield: A Transformer-based APT Detection System Leveraging Self-Attention

paper_url: http://arxiv.org/abs/2311.05733
repo_url: None
paper_authors: Sihat Afnan, Mushtari Sadia, Shahrear Iqbal, Anindya Iqbal
for: 本研究旨在探讨用 transformer 语言模型探测 APT 攻击的可能性，并提出一个名为 LogShield 的框架，以便利用 transformer 语言模型的自注意力特性来检测 APT 攻击。
methods: 本研究使用了自定义的 embedding 层来有效地捕捉系统踪迹图中事件序列的Context，并将 RoBERTa 模型的参数和训练过程作为基础进行了扩展。
results: 研究结果显示，LogShield 在 DARPA OpTC 和 DARPA TC E3 等两个常见 APT 数据集上的 F1 分数为 98% 和 95%，分别高于 LSTM 模型的 F1 分数（96% 和 94%）。这表明 LogShield 在大数据集上表现出了优异的泛化能力。

Abstract
Cyber attacks are often identified using system and network logs. There have been significant prior works that utilize provenance graphs and ML techniques to detect attacks, specifically advanced persistent threats, which are very difficult to detect. Lately, there have been studies where transformer-based language models are being used to detect various types of attacks from system logs. However, no such attempts have been made in the case of APTs. In addition, existing state-of-the-art techniques that use system provenance graphs, lack a data processing framework generalized across datasets for optimal performance. For mitigating this limitation as well as exploring the effectiveness of transformer-based language models, this paper proposes LogShield, a framework designed to detect APT attack patterns leveraging the power of self-attention in transformers. We incorporate customized embedding layers to effectively capture the context of event sequences derived from provenance graphs. While acknowledging the computational overhead associated with training transformer networks, our framework surpasses existing LSTM and Language models regarding APT detection. We integrated the model parameters and training procedure from the RoBERTa model and conducted extensive experiments on well-known APT datasets (DARPA OpTC and DARPA TC E3). Our framework achieved superior F1 scores of 98% and 95% on the two datasets respectively, surpassing the F1 scores of 96% and 94% obtained by LSTM models. Our findings suggest that LogShield's performance benefits from larger datasets and demonstrates its potential for generalization across diverse domains. These findings contribute to the advancement of APT attack detection methods and underscore the significance of transformer-based architectures in addressing security challenges in computer systems.

摘要
计算机系统中的攻击常常通过系统和网络日志进行识别。有一些前作使用证明图和机器学习技术来探测攻击，特别是高级 persistently threaten (APT) 攻击，这些攻击非常难以探测。最近，有一些研究使用 transformer 基于语言模型来探测不同类型的攻击。然而，尚未有任何尝试使用 transformer 探测 APT 攻击。此外，现有的状态 искусственный智能技术使用系统证明图，缺乏一个通用的数据处理框架，以便优化性能。为了解决这些限制，以及探索 transformer 基于语言模型的效果，本文提出了 LogShield，一个基于 transformer 的框架，用于探测 APT 攻击模式。我们采用自定义的嵌入层，以有效地捕捉系统日志中事件序列的上下文。虽然训练 transformer 网络具有计算开销，但我们的框架在 APT 探测方面超过了 LSTM 和语言模型的性能。我们将 RoBERTa 模型的参数和训练过程作为基础，并对Well-known APT 数据集（DARPA OpTC 和 DARPA TC E3）进行了广泛的实验。我们的框架在两个数据集上取得了 F1 分数的超过 98% 和 95%，比 LSTM 模型的 F1 分数高出 2% 和 1%。我们的发现表明 LogShield 的性能受到更大的数据集和多样化的领域的影响，并且表明 transformer 基于的建筑在计算机系统安全挑战中的作用非常重要。

Neural Network Methods for Radiation Detectors and Imaging

paper_url: http://arxiv.org/abs/2311.05726
repo_url: None
paper_authors: S. Lin, S. Ning, H. Zhu, T. Zhou, C. L. Morris, S. Clayton, M. Cherukara, R. T. Chen, Z. Wang
for: This paper provides an overview of recent advances in image data processing through machine learning and deep neural networks (DNNs) for radiation detectors and imaging hardware.
methods: The paper discusses deep learning-based methods for image processing tasks, including data generation at photon sources, and hardware solutions for deep learning acceleration.
results: The paper highlights the potential of next-generation analog neuromorphic hardware platforms, such as optical neural networks (ONNs), for high parallel, low latency, and low energy computing to boost deep learning acceleration.

Abstract
Recent advances in image data processing through machine learning and especially deep neural networks (DNNs) allow for new optimization and performance-enhancement schemes for radiation detectors and imaging hardware through data-endowed artificial intelligence. We give an overview of data generation at photon sources, deep learning-based methods for image processing tasks, and hardware solutions for deep learning acceleration. Most existing deep learning approaches are trained offline, typically using large amounts of computational resources. However, once trained, DNNs can achieve fast inference speeds and can be deployed to edge devices. A new trend is edge computing with less energy consumption (hundreds of watts or less) and real-time analysis potential. While popularly used for edge computing, electronic-based hardware accelerators ranging from general purpose processors such as central processing units (CPUs) to application-specific integrated circuits (ASICs) are constantly reaching performance limits in latency, energy consumption, and other physical constraints. These limits give rise to next-generation analog neuromorhpic hardware platforms, such as optical neural networks (ONNs), for high parallel, low latency, and low energy computing to boost deep learning acceleration.

摘要
最新的进展在图像数据处理领域，特别是深度神经网络（DNNs），允许 для抗辐射仪器和图像硬件的优化和性能提升策略。我们提供图像生成在光子源的概述，基于深度学习的图像处理任务方法，以及用于深度学习加速的硬件解决方案。大多数现有的深度学习方法都是在线进行训练，通常使用大量计算资源。然而，一旦训练完成，DNNs可以快速完成推理任务，并可以部署到边缘设备。现在的趋势是边缘计算，即使用百分之几的能耗或更少，并且实现实时分析的潜力。虽然流行用于边缘计算的电子基于硬件加速器，从中心处理器（CPUs）到应用特定集成电路（ASICs），不断遇到性能限制，如延迟、能耗和其他物理限制。这些限制导致下一代分析器，如光学神经网络（ONNs），以实现高并行、低延迟、低能耗计算，以加速深度学习。

Verilog-to-PyG – A Framework for Graph Learning and Augmentation on RTL Designs

paper_url: http://arxiv.org/abs/2311.05722
repo_url: None
paper_authors: Yingjie Li, Mingju Liu, Alan Mishchenko, Cunxi Yu
for: 本研究旨在提供一个开源框架，将RTL设计转换为图表表示形式，并与PyTorch Geometric图学平台集成，以便加速进行RTL设计探索。
methods: 本研究使用了一个新的开源框架，名为V2PYG，可以将Verilog设计转换为图表表示形式，并且与OpenROAD开源电子设计自动化工具链集成。此外，研究者还提出了一些新的RTL数据增强方法，可以实现功能相等的设计增强。
results: 研究结果显示，V2PYG框架可以实现高效的RTL设计探索，并且可以与OpenROAD开源电子设计自动化工具链集成。此外，研究者还提供了一些使用案例和详细的脚本示例，以便帮助其他研究者快速入门。

Abstract
The complexity of modern hardware designs necessitates advanced methodologies for optimizing and analyzing modern digital systems. In recent times, machine learning (ML) methodologies have emerged as potent instruments for assessing design quality-of-results at the Register-Transfer Level (RTL) or Boolean level, aiming to expedite design exploration of advanced RTL configurations. In this presentation, we introduce an innovative open-source framework that translates RTL designs into graph representation foundations, which can be seamlessly integrated with the PyTorch Geometric graph learning platform. Furthermore, the Verilog-to-PyG (V2PYG) framework is compatible with the open-source Electronic Design Automation (EDA) toolchain OpenROAD, facilitating the collection of labeled datasets in an utterly open-source manner. Additionally, we will present novel RTL data augmentation methods (incorporated in our framework) that enable functional equivalent design augmentation for the construction of an extensive graph-based RTL design database. Lastly, we will showcase several using cases of V2PYG with detailed scripting examples. V2PYG can be found at \url{https://yu-maryland.github.io/Verilog-to-PyG/}.

摘要
现代嵌入式设计的复杂性需要进一步的优化和分析方法，以确保设计质量。在最近的时间里，机器学习（ML）方法ologies 已经出现为评估设计质量的强大工具，可以快速探索高级RTL配置。本文介绍一个创新的开源框架，可以将RTL设计转换为图表基础，这可以轻松地与PyTorch Geometric图学 платформа集成。此外，V2PYG框架与开源电子设计自动化（EDA）工具链OpenROAD相容，可以方便收集标注数据集。此外，我们还将介绍一些新的RTL数据扩展方法，可以实现功能等价的设计扩展，以建立一个广泛的图表基础RTL设计数据库。最后，我们将展示V2PYG的几个使用例子，并提供详细的脚本示例。V2PYG可以在以下网址找到：https://yu-maryland.github.io/Verilog-to-PyG/。

Efficient Parallelization Layouts for Large-Scale Distributed Model Training

paper_url: http://arxiv.org/abs/2311.05610
repo_url: https://github.com/aleph-alpha/neurips-want-submission-efficient-parallelization-layouts
paper_authors: Johannes Hagemann, Samuel Weinbach, Konstantin Dobler, Maximilian Schall, Gerard de Melo
for: 大规模自然语言模型的高效训练
methods: 并行化多个硬件加速器， invoke compute和memory优化， FlashAttention和序列并行等最新优化
results: 实现了模型训练效率最佳化，特别是对13B模型的模型FLOPs利用率达70.5%。

Abstract
Efficiently training large language models requires parallelizing across hundreds of hardware accelerators and invoking various compute and memory optimizations. When combined, many of these strategies have complex interactions regarding the final training efficiency. Prior work tackling this problem did not have access to the latest set of optimizations, such as FlashAttention or sequence parallelism. In this work, we conduct a comprehensive ablation study of possible training configurations for large language models. We distill this large study into several key recommendations for the most efficient training. For instance, we find that using a micro-batch size of 1 usually enables the most efficient training layouts. Larger micro-batch sizes necessitate activation checkpointing or higher degrees of model parallelism and also lead to larger pipeline bubbles. Our most efficient configurations enable us to achieve state-of-the-art training efficiency results over a range of model sizes, most notably a Model FLOPs utilization of 70.5% when training a 13B model.

摘要
大型语言模型的有效培训需要并行运行数百个硬件加速器，并 invoke 多种compute和内存优化。当这些策略结合使用时，它们之间存在复杂的互动，影响最终的培训效率。先前的工作没有访问最新的优化，如FlashAttention或序列并行。在这个工作中，我们进行了大规模的减少研究，总结了可能的培训配置。我们发现，通常使用1个微批大小可以实现最高效的培训布局。大于1个微批大小的时候，需要进行活动检查点或更高的模型并行，也会导致更大的管道弹性。我们最高效的配置可以实现覆盖多种模型大小的状态emo-of-the-art培训效率结果，其中最引人注目的是一个13B模型的FLOPs使用率为70.5%。

Diffusion-Generative Multi-Fidelity Learning for Physical Simulation

paper_url: http://arxiv.org/abs/2311.05606
repo_url: None
paper_authors: Zheng Wang, Shibo Li, Shikai Fang, Shandian Zhe
for: 这个论文旨在探讨多贫洁学习，并提出一种基于数学 diffe过程的多贫洁学习方法，以减少计算成本和增加效率。
methods: 这个方法基于数学 diffe过程，并使用conditional score模型来控制解析出力的生成。这个方法可以实现多种贫洁模型，并可以快速学习和预测多维解析结果。
results: 这个方法在一些典型应用中表现出色，展示了它在多贫洁学习方面的优势，并显示了其具有增强多贫洁模型的能力。

Abstract
Multi-fidelity surrogate learning is important for physical simulation related applications in that it avoids running numerical solvers from scratch, which is known to be costly, and it uses multi-fidelity examples for training and greatly reduces the cost of data collection. Despite the variety of existing methods, they all build a model to map the input parameters outright to the solution output. Inspired by the recent breakthrough in generative models, we take an alternative view and consider the solution output as generated from random noises. We develop a diffusion-generative multi-fidelity (DGMF) learning method based on stochastic differential equations (SDE), where the generation is a continuous denoising process. We propose a conditional score model to control the solution generation by the input parameters and the fidelity. By conditioning on additional inputs (temporal or spacial variables), our model can efficiently learn and predict multi-dimensional solution arrays. Our method naturally unifies discrete and continuous fidelity modeling. The advantage of our method in several typical applications shows a promising new direction for multi-fidelity learning.

摘要

Sorting Out Quantum Monte Carlo

paper_url: http://arxiv.org/abs/2311.05598
repo_url: None
paper_authors: Jack Richter-Powell, Luca Thiede, Alán Asparu-Guzik, David Duvenaud
for: 这个论文的目的是提出一种基于排序的新的抑制层，用于实现量子水平的分子模型中的 fermion 的对称性。
methods: 这个论文使用了一种基于注意力的神经网络后端，并将排序层作为对称化层应用于其中，以实现一种可以达到化学精度的分子模型。
results: 数值研究表明，这种方法可以在first-row atoms 和小分子的ground state中达到化学精度。Note: “sortlet” is a new term introduced in the paper, and it refers to the antisymmetrization layer derived from sorting.

Abstract
Molecular modeling at the quantum level requires choosing a parameterization of the wavefunction that both respects the required particle symmetries, and is scalable to systems of many particles. For the simulation of fermions, valid parameterizations must be antisymmetric with respect to the exchange of particles. Typically, antisymmetry is enforced by leveraging the anti-symmetry of determinants with respect to the exchange of matrix rows, but this involves computing a full determinant each time the wavefunction is evaluated. Instead, we introduce a new antisymmetrization layer derived from sorting, the $\textit{sortlet}$, which scales as $O(N \log N)$ with regards to the number of particles -- in contrast to $O(N^3)$ for the determinant. We show numerically that applying this anti-symmeterization layer on top of an attention based neural-network backbone yields a flexible wavefunction parameterization capable of reaching chemical accuracy when approximating the ground state of first-row atoms and small molecules.

摘要
молекулярное моделирование на квантовом уровне требует выбора параметризации волновой функции, которая обеспечивает необходимые симметрии частиц и масштабируется до систем многих частиц. при симуляции fermions необходимо использовать валидные параметризации, которые антисимметричны с точки зрения обмена частиц. обычно, антисимметрия достигается путём вычисления полного определителя при каждом вычислении волновой функции, что имеет сложность $O(N^3)$ по отношению к количеству частиц. в этом статье мы вводим новый слой антисимметризации, основанный на сортировке, называемый $\textit{sortlet}$, который имеет сложность $O(N \log N)$. мы показываем, что применение этого слоя антисимметризации на основе нейронной сети с применением внимания дает гибкую параметризацию волновой функции, которая может достичь точности химических расчетов при оценке Grund-Зтатных состояний первоначальных атомов и мелких молекул.

A Coefficient Makes SVRG Effective

paper_url: http://arxiv.org/abs/2311.05589
repo_url: https://github.com/davidyyd/alpha-svrg
paper_authors: Yida Yin, Zhiqiu Xu, Zhiyuan Li, Trevor Darrell, Zhuang Liu
for: 优化现实世界中的神经网络
methods: 使用Stochastic Variance Reduced Gradient（SVRG）方法，并 introduce 一个multiplicative coefficient α控制强度，通过线性减速调整
results: 对于 deeper networks，α-SVRG方法可以更好地优化神经网络，常见的训练损失比baseline和标准 SVRG更低，并且在不同的架构和图像分类 datasets中表现良好。

Abstract
Stochastic Variance Reduced Gradient (SVRG), introduced by Johnson & Zhang (2013), is a theoretically compelling optimization method. However, as Defazio & Bottou (2019) highlights, its effectiveness in deep learning is yet to be proven. In this work, we demonstrate the potential of SVRG in optimizing real-world neural networks. Our analysis finds that, for deeper networks, the strength of the variance reduction term in SVRG should be smaller and decrease as training progresses. Inspired by this, we introduce a multiplicative coefficient $\alpha$ to control the strength and adjust it through a linear decay schedule. We name our method $\alpha$-SVRG. Our results show $\alpha$-SVRG better optimizes neural networks, consistently reducing training loss compared to both baseline and the standard SVRG across various architectures and image classification datasets. We hope our findings encourage further exploration into variance reduction techniques in deep learning. Code is available at https://github.com/davidyyd/alpha-SVRG.

摘要
Stochastic Variance Reduced Gradient（SVRG），引入于Johnson 和 Zhang（2013），是一种理论上吸引人的优化方法。然而，根据Defazio 和 Bottou（2019）的报告，其在深度学习中的效果尚未得到证明。在这项工作中，我们展示了 SVRG 在真实世界的神经网络中的潜力。我们的分析发现，对于更深的网络，SVRG 中的差异减少项的强度应该更小，并随着训练的进行减少。 inspirited 这个想法，我们引入了一个多项式系数 $\alpha$，用于控制差异减少项的强度，并通过线性衰减调整。我们称之为 $\alpha$-SVRG。我们的结果表明，$\alpha$-SVRG 可以更好地优化神经网络，连续地降低训练损失相比于基准和标准 SVRG，在不同的架构和图像分类 dataset 上。我们希望我们的发现能够鼓励更多人对深度学习中的差异减少技术进行更多的探索。代码可以在中找到。

Bayesian Methods for Media Mix Modelling with shape and funnel effects

paper_url: http://arxiv.org/abs/2311.05587
repo_url: None
paper_authors: Javier Marin
for: 这项研究的目的是探索使用Maxwell-Boltzmann方程和Michaelis-Menten模型在广告混合模型（MMM）应用中的潜在用途。
methods: 该研究提议将这些方程 integrate into hierarchical Bayesian模型，以分析consumer behaviors in the context of advertising。
results: 这些方程集 excell in accurately describing random dynamics in complex systems like social interactions and consumer-advertising interactions。

Abstract
In recent years, significant progress in generative AI has highlighted the important role of physics-inspired models that utilize advanced mathematical concepts based on fundamental physics principles to enhance artificial intelligence capabilities. Among these models, those based on diffusion equations have greatly improved image quality. This study aims to explore the potential uses of Maxwell-Boltzmann equation, which forms the basis of the kinetic theory of gases, and the Michaelis-Menten model in Marketing Mix Modelling (MMM) applications. We propose incorporating these equations into Hierarchical Bayesian models to analyse consumer behaviour in the context of advertising. These equation sets excel in accurately describing the random dynamics in complex systems like social interactions and consumer-advertising interactions.

摘要
Translated into Simplified Chinese:近年来，生成AI的进步吸引了广泛关注，推祟于物理概念基于的生成AI模型，这些模型在提高人工智能能力方面发挥了重要作用。 diffusion equation基于的模型在图像质量方面取得了显著进步。本研究计划探讨将Maxwell-Boltzmann方程和Michaelis-Menten模型应用于市场混合模型（MMM）中，以分析广告宣传行为。我们建议将这些方程集成到 hierarchical Bayesian 模型中，以更好地描述消费者行为在广告宣传中的随机动态。

Outlier-Robust Wasserstein DRO

paper_url: http://arxiv.org/abs/2311.05573
repo_url: https://github.com/sbnietert/outlier-robust-wdro
paper_authors: Sloan Nietert, Ziv Goldfeld, Soroosh Shafiee
for: 本研究的目的是提出一种robust optimization方法，可以在数据采集和处理中帮助做出数据驱动的决策，并且能够抗性于不同类型的干扰。
methods: 本研究使用的方法包括Wasserstein DRO（WDRO）和total variation（TV）杂乱，这两种方法可以帮助捕捉不同类型的干扰，并且可以保证模型在干扰的情况下仍然能够表现良好。
results: 本研究的结果表明，提出的outlier-robust WDRO方法可以在不同类型的干扰下表现良好，并且可以避免由干扰所导致的模型训练失败。此外，在特定的丢失函数下，我们可以提出一种有效的隐藏变量方法，以避免模型的过拟合。

Abstract
Distributionally robust optimization (DRO) is an effective approach for data-driven decision-making in the presence of uncertainty. Geometric uncertainty due to sampling or localized perturbations of data points is captured by Wasserstein DRO (WDRO), which seeks to learn a model that performs uniformly well over a Wasserstein ball centered around the observed data distribution. However, WDRO fails to account for non-geometric perturbations such as adversarial outliers, which can greatly distort the Wasserstein distance measurement and impede the learned model. We address this gap by proposing a novel outlier-robust WDRO framework for decision-making under both geometric (Wasserstein) perturbations and non-geometric (total variation (TV)) contamination that allows an $\varepsilon$-fraction of data to be arbitrarily corrupted. We design an uncertainty set using a certain robust Wasserstein ball that accounts for both perturbation types and derive minimax optimal excess risk bounds for this procedure that explicitly capture the Wasserstein and TV risks. We prove a strong duality result that enables tractable convex reformulations and efficient computation of our outlier-robust WDRO problem. When the loss function depends only on low-dimensional features of the data, we eliminate certain dimension dependencies from the risk bounds that are unavoidable in the general setting. Finally, we present experiments validating our theory on standard regression and classification tasks.

摘要
Distributionally robust optimization (DRO) 是一种有效的方法 для数据驱动决策中存在不确定性。 Waterstein DRO (WDRO) 捕捉了采样或本地异常点的几何不确定性，并寻找一个能够在 Wasserstein 球中心的扩展中进行一致性良好的模型。然而，WDRO 不能考虑非几何异常点，如恶意异常点，这些异常点可以很大程度地扭曲 Wasserstein 距离测量，阻碍学习的模型。我们解决这个漏洞，提出一种新的异常点Robust WDRO 框架，用于在几何（Wasserstein）异常点和非几何（总变量（TV））污染下进行决策。我们使用一个 $\varepsilon$-负法分数的数据进行arbitrarily corrupted，并设置了一个不确定集，该集使用一个Robust Wasserstein ball来考虑这两种异常点类型。我们计算出最优的副作用误差上限，并证明了一种强 dual 结果，允许我们通过可观察的对偶问题来降低问题的计算复杂度。当携带特征值低于数据的特征值时，我们可以从风险上下文中消除一些不可避免的维度依赖，从而更好地降低风险上下文中的风险。最后，我们在标准回归和分类任务上进行了实验验证。

Exploiting Neural-Network Statistics for Low-Power DNN Inference

paper_url: http://arxiv.org/abs/2311.05557
repo_url: None
paper_authors: Lennart Bamberg, Ardalan Najafi, Alberto Garcia-Ortiz
for: 这篇论文是为了提高边缘人工智能推断引擎的能效性和能量效率而写的。
methods: 本文使用了无页面程式码和神经网络数据和参数的统计分析，以降低互connect和内存的能力消耗。
results: 本文的方法可以对现有的benchmark测试项目中的边缘人工智能推断引擎进行低功耗化，并且可以获得更多的能量节省和Compute对象的额外节省。

Abstract
Specialized compute blocks have been developed for efficient DNN execution. However, due to the vast amount of data and parameter movements, the interconnects and on-chip memories form another bottleneck, impairing power and performance. This work addresses this bottleneck by contributing a low-power technique for edge-AI inference engines that combines overhead-free coding with a statistical analysis of the data and parameters of neural networks. Our approach reduces the interconnect and memory power consumption by up to 80% for state-of-the-art benchmarks while providing additional power savings for the compute blocks by up to 39%. These power improvements are achieved with no loss of accuracy and negligible hardware cost.

摘要
特殊计算块已经为深度学习模型的高效执行开发出来。然而，由于巨量数据和参数的移动，总线和内存又成为了另一个瓶颈，影响能效性。这个工作解决这个瓶颈，通过对神经网络数据和参数的统计分析，实现无损负荷的编程技术。我们的方法可以在现状顶峰的测试集上减少总线和内存的电力消耗，最高减少80%，同时对计算块的电力消耗减少39%。这些电力改善不产生精度损失和很少增加硬件成本。

Information-theoretic generalization bounds for learning from quantum data

paper_url: http://arxiv.org/abs/2311.05529
repo_url: None
paper_authors: Matthias Caro, Tom Gur, Cambyse Rouzé, Daniel Stilck França, Sathyawageeswar Subramanian
for: 这篇论文旨在提供一种描述量子学习的通用数学形式，以及用这种形式证明量子学习者的泛化误差预期值。
methods: 本文使用量子优化运输和量子吸引准则来建立非 коммутатив版本的解 Coupling 证明，以证明量子学习者的泛化误差预期值。
results: 本文提出了一种总结量子学习场景中的各种泛化误差预期值的框架，包括量子状态识别、量子可能approxcorrect 学习、量子参数估计和量子可能approxcorrect 学习类型的函数。此外，本文还证明了这些场景中泛化误差预期值的下界。

Abstract
Learning tasks play an increasingly prominent role in quantum information and computation. They range from fundamental problems such as state discrimination and metrology over the framework of quantum probably approximately correct (PAC) learning, to the recently proposed shadow variants of state tomography. However, the many directions of quantum learning theory have so far evolved separately. We propose a general mathematical formalism for describing quantum learning by training on classical-quantum data and then testing how well the learned hypothesis generalizes to new data. In this framework, we prove bounds on the expected generalization error of a quantum learner in terms of classical and quantum information-theoretic quantities measuring how strongly the learner's hypothesis depends on the specific data seen during training. To achieve this, we use tools from quantum optimal transport and quantum concentration inequalities to establish non-commutative versions of decoupling lemmas that underlie recent information-theoretic generalization bounds for classical machine learning. Our framework encompasses and gives intuitively accessible generalization bounds for a variety of quantum learning scenarios such as quantum state discrimination, PAC learning quantum states, quantum parameter estimation, and quantumly PAC learning classical functions. Thereby, our work lays a foundation for a unifying quantum information-theoretic perspective on quantum learning.

摘要
学习任务在量子信息和计算中越来越占据着重要地位。它们从基本问题如状态识别和测量到量子可能错误（PAC）学习框架，以及最近提出的陌生变体的状态探测。然而，量子学习理论的多个方向至今仍然分化不受控。我们提议一种通用的数学形式语言，用于描述在类型-量子数据上进行学习，然后测试学习得到的假设是否能够在新数据上正确预测。在这种框架中，我们证明了在类型-量子信息论量上的预测错误的预期值，并且与类型-量子信息论量相关的量子学习场景进行直观地关联。为了实现这一目标，我们使用量子最优运输和量子吸引不等式来建立量子不 commutative 的解 coupling 公式，这些公式在类型-量子信息论量上提供了最新的信息论学习约束。我们的框架包括了多种量子学习场景，如量子状态识别、PAC学习量子状态、量子参数估计和量子 PAC 学习类型函数。因此，我们的工作为量子信息论量上的量子学习提供了一个统一的视角。

Dirichlet Active Learning

paper_url: http://arxiv.org/abs/2311.05501
repo_url: None
paper_authors: Kevin Miller, Ryan Murray
for: 这篇论文旨在探讨 Dirichlet Active Learning（DiAL），一种 bayesian-inspired 的活动学习框架。
methods: 本框架使用 Dirichlet 随机场来模型对应的特征 conditional class probabilities，并将类似特征之间的观察强度传递给这个随机场，以便在学习任务中使用。
results: 本论文透过建立基于几何 Laplacian 的传播算法，证明了 DiAL 在具有少量标签数据的情况下能够与现有的州际标准竞争。此外，论文还提供了一些严格的保证，表明 DiAL 能够同时确保探索和优化。

Abstract
This work introduces Dirichlet Active Learning (DiAL), a Bayesian-inspired approach to the design of active learning algorithms. Our framework models feature-conditional class probabilities as a Dirichlet random field and lends observational strength between similar features in order to calibrate the random field. This random field can then be utilized in learning tasks: in particular, we can use current estimates of mean and variance to conduct classification and active learning in the context where labeled data is scarce. We demonstrate the applicability of this model to low-label rate graph learning by constructing ``propagation operators'' based upon the graph Laplacian, and offer computational studies demonstrating the method's competitiveness with the state of the art. Finally, we provide rigorous guarantees regarding the ability of this approach to ensure both exploration and exploitation, expressed respectively in terms of cluster exploration and increased attention to decision boundaries.

摘要

Disease Gene Prioritization With Quantum Walks

paper_url: http://arxiv.org/abs/2311.05486
repo_url: None
paper_authors: Harto Saarinen, Mark Goldsmith, Rui-Sheng Wang, Joseph Loscalzo, Sabrina Maniscalco
for: 这个论文是为了提出一种新的疾病基因优先级分配算法，该算法基于蛋白质-蛋白质交互（PPI）网络的连接矩阵，并使用 kontinuous-time quantum walk 技术。
methods: 该算法使用 kontinuous-time quantum walk 技术，并在 Hamiltonian 中编码种子节点自环。
results: 对三种疾病集合和七个 PPI 网络进行比较，该算法的性能比较高，并且可以进行cross-validation和检验reciprocal ranks和recall值。此外，通过扩展分析，该算法可以预测更多的疾病基因。

Abstract
Disease gene prioritization assigns scores to genes or proteins according to their likely relevance for a given disease based on a provided set of seed genes. Here, we describe a new algorithm for disease gene prioritization based on continuous-time quantum walks using the adjacency matrix of a protein-protein interaction (PPI) network. Our algorithm can be seen as a quantum version of a previous method known as the diffusion kernel, but, importantly, has higher performance in predicting disease genes, and also permits the encoding of seed node self-loops into the underlying Hamiltonian, which offers yet another boost in performance. We demonstrate the success of our proposed method by comparing it to several well-known gene prioritization methods on three disease sets, across seven different PPI networks. In order to compare these methods, we use cross-validation and examine the mean reciprocal ranks and recall values. We further validate our method by performing an enrichment analysis of the predicted genes for coronary artery disease. We also investigate the impact of adding self-loops to the seeds, and argue that they allow the quantum walker to remain more local to low-degree seed nodes.

摘要
疾病基因优先级分配分数到基因或蛋白质根据它们对某种疾病的可能性，以提高疾病基因搜索的效率。我们描述了一种基于连续时间量子游走的新算法，使用蛋白质-蛋白质相互作用（PPI）网络的邻接矩阵。我们的算法可以看作是量子版增量阶段（diffusion kernel）的改进版本，但是具有更高的疾病基因预测性能，并且允许编码种子节点自 Loop into the underlying Hamiltonian，又提供了另一个性能提升。我们通过对三个疾病集和七个不同的 PPI 网络进行比较，使用十字验证和评估 mean reciprocal ranks 和 recall 值来评估这些方法的性能。我们进一步验证了我们的方法，通过对预测的基因进行折衔分析，证明了我们的方法的正确性。此外，我们还investigated the impact of adding self-loops to the seeds, and argued that they allow the quantum walker to remain more local to low-degree seed nodes.

Do Ensembling and Meta-Learning Improve Outlier Detection in Randomized Controlled Trials?

paper_url: http://arxiv.org/abs/2311.05473
repo_url: https://github.com/hamilton-health-sciences/ml4h-traq
paper_authors: Walter Nelson, Jonathan Ranisau, Jeremy Petch
for: 这个论文主要是为了研究现代多中心随机化控制试验中的异常点检测方法。
methods: 这篇论文使用了6种现代机器学习基于算法来检测异常数据，并对738个实际数据集和77,001名患者从44个国家进行了 Empirical 评估。
results: 研究结果表明，现有的算法可以在没有监督的情况下检测异常数据，但是每个数据集的性能差异很大，无法确定单一的最佳算法。因此，研究人员提出了一种简单的Meta-learned Probabilistic Ensemble（MePE）算法来聚合多个无监督模型的预测结果，并证明其在比较 meta-learning 方法时表现良好。

Abstract
Modern multi-centre randomized controlled trials (MCRCTs) collect massive amounts of tabular data, and are monitored intensively for irregularities by humans. We began by empirically evaluating 6 modern machine learning-based outlier detection algorithms on the task of identifying irregular data in 838 datasets from 7 real-world MCRCTs with a total of 77,001 patients from over 44 countries. Our results reinforce key findings from prior work in the outlier detection literature on data from other domains. Existing algorithms often succeed at identifying irregularities without any supervision, with at least one algorithm exhibiting positive performance 70.6% of the time. However, performance across datasets varies substantially with no single algorithm performing consistently well, motivating new techniques for unsupervised model selection or other means of aggregating potentially discordant predictions from multiple candidate models. We propose the Meta-learned Probabilistic Ensemble (MePE), a simple algorithm for aggregating the predictions of multiple unsupervised models, and show that it performs favourably compared to recent meta-learning approaches for outlier detection model selection. While meta-learning shows promise, small ensembles outperform all forms of meta-learning on average, a negative result that may guide the application of current outlier detection approaches in healthcare and other real-world domains.

摘要
现代多中心随机控制试验 (MCRCT) 收集庞大量的表格数据，并且由人类严格监测以寻找异常情况。我们开始由对 6 种现代机器学习基于算法进行实验，以确定这些算法在738个数据集中的异常检测任务中的表现。我们的结果证明了先前的研究中关于其他领域的数据的发现结果。现有的算法经常可以无监测情况下检测到异常情况，至少有一个算法在70.6%的时间 exhibit 良好的表现。然而，数据集之间的表现差异很大，没有任何单一的算法能够在所有数据集中表现良好，这引发了新的技术来选择不supervised 模型或其他方式将可能不一致的预测集成为一个整体。我们提议使用 Meta-learned Probabilistic Ensemble (MePE)，一种简单的算法来集成多个不supervised 模型的预测，并证明它与最近的meta-learning方法相比表现良好。虽然meta-learning表示潜力，但小集成在平均上超过所有形式的 meta-learning，这是一个负面的结果，可能导向现有的异常检测方法在医疗和其他实际领域的应用中的限制。

A Practical Approach to Novel Class Discovery in Tabular Data

paper_url: http://arxiv.org/abs/2311.05440
repo_url: None
paper_authors: Colin Troisemaine, Alexandre Reiffers-Masson, Stéphane Gosselin, Vincent Lemaire, Sandrine Vaton
for: 本文解决了无知类探索（NCD）问题，即从已知类集中提取知识，将未知类分类为新的类别。
methods: 本文提出了一种基于Tabular数据的NCD方法，不需要先知道新类的数量。方法包括通过修改$k$-fold批处理过程来调整NCD方法的超参数，以避免过拟合隐藏的已知类。此外，本文还提出了一种简单的深度NCD模型，该模型只包含NCD问题所需的基本元素，并在实际情况下表现出色。
results: 实验表明，提出的方法和超参数调整过程可以在7个Tabular数据集上解决NCD问题，而不需要先知道新类的数量。此外，本文还发现了使用已知类知识的两种无监督聚类算法（$k$-means和特征分布分 clustering）可以有效地利用已知类知识来解决NCD问题。

Abstract
The problem of Novel Class Discovery (NCD) consists in extracting knowledge from a labeled set of known classes to accurately partition an unlabeled set of novel classes. While NCD has recently received a lot of attention from the community, it is often solved on computer vision problems and under unrealistic conditions. In particular, the number of novel classes is usually assumed to be known in advance, and their labels are sometimes used to tune hyperparameters. Methods that rely on these assumptions are not applicable in real-world scenarios. In this work, we focus on solving NCD in tabular data when no prior knowledge of the novel classes is available. To this end, we propose to tune the hyperparameters of NCD methods by adapting the $k$-fold cross-validation process and hiding some of the known classes in each fold. Since we have found that methods with too many hyperparameters are likely to overfit these hidden classes, we define a simple deep NCD model. This method is composed of only the essential elements necessary for the NCD problem and performs impressively well under realistic conditions. Furthermore, we find that the latent space of this method can be used to reliably estimate the number of novel classes. Additionally, we adapt two unsupervised clustering algorithms ($k$-means and Spectral Clustering) to leverage the knowledge of the known classes. Extensive experiments are conducted on 7 tabular datasets and demonstrate the effectiveness of the proposed method and hyperparameter tuning process, and show that the NCD problem can be solved without relying on knowledge from the novel classes.

摘要
《短文》 noval 类发现（NCD）问题的解决方法是从已知类别的标注集中提取知识，以便准确分类未知类别。 although NCD 在计算机视觉问题上得到了社区的广泛关注，但是它们通常在不实际的情况下解决，即已知类别的数量在先知道。此外，一些方法会使用这些假设来调整超参数。这些方法在实际场景下无法应用。在本工作中，我们将解决 NCD 问题在表格数据上，而不需要已知类别的先知道。为此，我们提议通过适应 $k $-fold 跨Validation 过程和隐藏已知类别来调整 NCD 方法的超参数。由于我们发现了过多的超参数可能会在隐藏的类别上过拟合，所以我们定义了一个简单的深度 NCD 模型。这种方法由 NCD 问题所需的基本元素组成，并在实际条件下表现出色。此外，我们适应了两种无监督分群算法（$k $-means 和 Spectral Clustering），以便利用已知类别的知识。我们对 7 个表格数据集进行了广泛的实验，并证明了我们提出的方法和超参数调整过程的效iveness。 results 表明，NCD 问题可以在不依赖未知类别的知识的情况下解决。

Fair Wasserstein Coresets

paper_url: http://arxiv.org/abs/2311.05436
repo_url: None
paper_authors: Zikai Xiong, Niccolò Dalmasso, Vamsi K. Potluru, Tucker Balch, Manuela Veloso
for: This paper is written for those who are interested in developing fair and representative machine learning models, particularly in the context of decision-making processes.
methods: The paper proposes a novel approach called Fair Wasserstein Coresets (FWC), which generates fair synthetic representative samples and sample-level weights for downstream learning tasks. FWC minimizes the Wasserstein distance between the original datasets and the weighted synthetic samples while enforcing demographic parity.
results: The paper shows that FWC can be thought of as a constrained version of Lloyd’s algorithm for k-medians or k-means clustering, and demonstrates the scalability of the approach through experiments conducted on both synthetic and real datasets. The results also highlight the competitive performance of FWC compared to existing fair clustering approaches, even when attempting to enhance the fairness of the latter through fair pre-processing techniques.

Abstract
Recent technological advancements have given rise to the ability of collecting vast amounts of data, that often exceed the capacity of commonly used machine learning algorithms. Approaches such as coresets and synthetic data distillation have emerged as frameworks to generate a smaller, yet representative, set of samples for downstream training. As machine learning is increasingly applied to decision-making processes, it becomes imperative for modelers to consider and address biases in the data concerning subgroups defined by factors like race, gender, or other sensitive attributes. Current approaches focus on creating fair synthetic representative samples by optimizing local properties relative to the original samples. These methods, however, are not guaranteed to positively affect the performance or fairness of downstream learning processes. In this work, we present Fair Wasserstein Coresets (FWC), a novel coreset approach which generates fair synthetic representative samples along with sample-level weights to be used in downstream learning tasks. FWC aims to minimize the Wasserstein distance between the original datasets and the weighted synthetic samples while enforcing (an empirical version of) demographic parity, a prominent criterion for algorithmic fairness, via a linear constraint. We show that FWC can be thought of as a constrained version of Lloyd's algorithm for k-medians or k-means clustering. Our experiments, conducted on both synthetic and real datasets, demonstrate the scalability of our approach and highlight the competitive performance of FWC compared to existing fair clustering approaches, even when attempting to enhance the fairness of the latter through fair pre-processing techniques.

摘要
最近的技术进步使得收集大量数据变得可能，这些数据经常超出常用的机器学习算法的处理能力。有些方法，如核心集和数据简化 Synthetic data distillation，被提出来生成一组更小、 yet representative 的样本，用于下游训练。由于机器学习在决策过程中越来越常用，因此模型者需要考虑和处理数据中的偏见，特别是根据因素如种族、性别或其他敏感属性来定义的子群。现有的方法是创建公平的 Synthetic representative samples，并且优化当地的属性来适应原始样本。然而，这些方法并不能保证下游学习过程中的表现和公平性。在这种情况下，我们提出了公平 Wasserstein coresets (FWC)，一种新的核心集方法，可以生成公平的 Synthetic representative samples，同时生成样本级别的权重，用于下游学习任务。FWC 的目标是将原始数据集和权重 Synthetic samples 的 Wasserstein 距离降低到最小，同时通过 Linear 约束来保证（empirical version的）人口均衡，一种重要的算法公平性标准。我们表明，FWC 可以看作是 Lloyd 算法的受限版本，用于 k-medians 或 k-means 聚类。我们的实验结果表明，FWC 可以扩展到大规模数据集，并且在比较公平的情况下，FWC 的性能与现有的公平聚类方法相当，甚至在使用公平预处理技术时，FWC 的性能仍然占据竞争优势。

Parkinson’s Disease Detection through Vocal Biomarkers and Advanced Machine Learning Algorithms: A Comprehensive Study

paper_url: http://arxiv.org/abs/2311.05435
repo_url: None
paper_authors: Md Abu Sayed, Sabbir Ahamed, Duc M Cao, Md Eyasin Ul Islam Pavel, Malay Sarkar, Md Tuhin Mia
for: 预测公inson病的发病风险
methods: 使用高级机器学习算法，包括XGBoost、LightGBM、Bagging、AdaBoost和支持向量机制等，评估这些模型在预测公inson病的表现
results: 研究发现LightGBM模型的准确率为96%，AUC为96%，敏感性为100%，特异性为94.43%，其他机器学习算法的准确率和AUC分别落后于LightGBM模型。

Abstract
Parkinson's disease (PD) is a prevalent neurodegenerative disorder known for its impact on motor neurons, causing symptoms like tremors, stiffness, and gait difficulties. This study explores the potential of vocal feature alterations in PD patients as a means of early disease prediction. This research aims to predict the onset of Parkinson's disease. Utilizing a variety of advanced machine-learning algorithms, including XGBoost, LightGBM, Bagging, AdaBoost, and Support Vector Machine, among others, the study evaluates the predictive performance of these models using metrics such as accuracy, area under the curve (AUC), sensitivity, and specificity. The findings of this comprehensive analysis highlight LightGBM as the most effective model, achieving an impressive accuracy rate of 96%, alongside a matching AUC of 96%. LightGBM exhibited a remarkable sensitivity of 100% and specificity of 94.43%, surpassing other machine learning algorithms in accuracy and AUC scores. Given the complexities of Parkinson's disease and its challenges in early diagnosis, this study underscores the significance of leveraging vocal biomarkers coupled with advanced machine-learning techniques for precise and timely PD detection.

摘要
帕金森病 (PD) 是一种常见的神经退化疾病，对运动神经元造成了许多问题，导致了抽搐、硬度和步态困难等症状。这项研究探讨了PD患者的声音特征变化可能作为早期病情预测的可能性。这项研究的目标是预测帕金森病的病起点。通过使用多种先进的机器学习算法，包括XGBoost、LightGBM、Bagging、AdaBoost和支持向量机等，这项研究评估了这些模型在精度、AUC、敏感度和特征等方面的预测性能。研究结果显示LightGBM模型在精度和AUC方面表现出色，达到了96%的准确率，并与其他机器学习算法相比，敏感度和特征都达到了100%和94.43%的高水平。考虑到帕金森病的复杂性和早期诊断的挑战，这项研究强调了在使用声音生物标志和先进机器学习技术之前，可以为早期和准确的PD患者诊断提供有利的条件。

Taxonomy for Resident Space Objects in LEO: A Deep Learning Approach

paper_url: http://arxiv.org/abs/2311.05430
repo_url: None
paper_authors: Marta Guimarães, Cláudia Soares, Chiara Manfletti
for:本研究旨在提高低地球轨道频谱中各种垃圾Space Objects（RSOs）的管理，为直接和间接使用空间的所有用户减少风险。methods:本研究提出了一个新的分类法，使得RSOs可以根据其主要特征分类，从而提高空间质量管理。此外，本研究还使用深度学习模型，通过自动编码器架构减少RSOs特征表示的维度，并使用不同技术如均匀投影等来探索RSOs的基本集群。results:本研究的结果表明，使用提出的分类法和深度学习模型可以更好地理解RSOs的行为特征，并提高空间质量管理的效率和效果。

Abstract
The increasing number of RSOs has raised concerns about the risk of collisions and catastrophic incidents for all direct and indirect users of space. To mitigate this issue, it is essential to have a good understanding of the various RSOs in orbit and their behaviour. A well-established taxonomy defining several classes of RSOs is a critical step in achieving this understanding. This taxonomy helps assign objects to specific categories based on their main characteristics, leading to better tracking services. Furthermore, a well-established taxonomy can facilitate research and analysis processes by providing a common language and framework for better understanding the factors that influence RSO behaviour in space. These factors, in turn, help design more efficient and effective strategies for space traffic management. Our work proposes a new taxonomy for RSOs focusing on the low Earth orbit regime to enhance space traffic management. In addition, we present a deep learning-based model that uses an autoencoder architecture to reduce the features representing the characteristics of the RSOs. The autoencoder generates a lower-dimensional space representation that is then explored using techniques such as Uniform Manifold Approximation and Projection to identify fundamental clusters of RSOs based on their unique characteristics. This approach captures the complex and non-linear relationships between the features and the RSOs' classes identified. Our proposed taxonomy and model offer a significant contribution to the ongoing efforts to mitigate the overall risks posed by the increasing number of RSOs in orbit.

摘要
随着各种人造卫星的数量的增加，引发了关于碰撞和灾难性事件的风险的担忧。为了解决这个问题，需要对各种人造卫星的运行和行为进行深入的了解。我们提出了一种新的分类法，将人造卫星分为不同类别，以便更好地跟踪和管理它们。此外，我们还提出了一种基于深度学习的模型，使用自适应网络架构，将人造卫星的特征特征缩减到更低的维度上。这种方法可以捕捉人造卫星的复杂和非线性关系，并且可以通过不同的技术，如射影映射，来发现人造卫星的基本群集。这种方法对于提高空间交通管理具有重要意义。

Statistical Learning of Conjunction Data Messages Through a Bayesian Non-Homogeneous Poisson Process

paper_url: http://arxiv.org/abs/2311.05426
repo_url: None
paper_authors: Marta Guimarães, Cláudia Soares, Chiara Manfletti
for: 本研究旨在提高现有的冲突避免和空间交通管理方法，以适应随着卫星数量的不断增加而增加的挑战。
methods: 本研究使用 Bayesian 非homogeneous Poisson process 模型，通过高精度的 Probabilistic Programming Language 实现，以充分描述卫星 conjunction 的下发现现象。
results: 比较基准模型，研究结果显示 bayesian 非homogeneous Poisson process 模型可以更加准确地模拟卫星 conjunction 的下发现现象，帮助操作人员在时间上作出合适的冲突避免操作，但不需要过度的措施。

Abstract
Current approaches for collision avoidance and space traffic management face many challenges, mainly due to the continuous increase in the number of objects in orbit and the lack of scalable and automated solutions. To avoid catastrophic incidents, satellite owners/operators must be aware of their assets' collision risk to decide whether a collision avoidance manoeuvre needs to be performed. This process is typically executed through the use of warnings issued in the form of CDMs which contain information about the event, such as the expected TCA and the probability of collision. Our previous work presented a statistical learning model that allowed us to answer two important questions: (1) Will any new conjunctions be issued in the next specified time interval? (2) When and with what uncertainty will the next CDM arrive? However, the model was based on an empirical Bayes homogeneous Poisson process, which assumes that the arrival rates of CDMs are constant over time. In fact, the rate at which the CDMs are issued depends on the behaviour of the objects as well as on the screening process performed by third parties. Thus, in this work, we extend the previous study and propose a Bayesian non-homogeneous Poisson process implemented with high precision using a Probabilistic Programming Language to fully describe the underlying phenomena. We compare the proposed solution with a baseline model to demonstrate the added value of our approach. The results show that this problem can be successfully modelled by our Bayesian non-homogeneous Poisson Process with greater accuracy, contributing to the development of automated collision avoidance systems and helping operators react timely but sparingly with satellite manoeuvres.

摘要
当前的冲突避免和空间交通管理技术面临着许多挑战，主要是因为遥感器的数量不断增加，以及没有可扩展和自动化的解决方案。为了避免catastrophic incidents，卫星所有者/运营商必须了解它们的资产冲突风险，并决定是否需要执行冲突避免操作。这个过程通常通过使用CDM（Conjunction Data Message）发送 warnings，其中包含事件信息，如预计的TCA（Time of Close Approach）和冲突的概率。我们之前的研究提出了一种统计学学习模型，可以回答以下两个重要问题：（1）将来的 conjunctions 是否在指定时间间隔内发生？（2）预计接下来的 CDM 会在什么时间 arrive，以及具有多少不确定性？但是，这个模型基于empirical Bayes Homogeneous Poisson Process，即CDMs 的发送速率是时间不变的。实际上，CDMs 的发送速率取决于遥感器的行为以及第三方屏选过程。因此，在这种工作中，我们延续前一个研究，并提出了一种 Bayesian non-homogeneous Poisson Process，使用高精度的 probabilistic programming language 来完全描述下面现象。我们与基准模型进行比较，以示出我们的方法的优势。结果显示，我们的方法可以更加准确地模型这个问题，从而为自动化冲突避免系统的发展和操作人员在时间和不确定性方面做出更好的决策。

Diffusion Based Causal Representation Learning

paper_url: http://arxiv.org/abs/2311.05421
repo_url: None
paper_authors: Amir Mohammad Karimi Mamaghan, Andrea Dittadi, Stefan Bauer, Karl Henrik Johansson, Francesco Quinzan
for: 本研究旨在提出一种新的Diffusion-based Causal Representation Learning（DCRL）算法，用于 causal representation learning。
methods: 该算法使用Diffusion-based表示方法进行 causal discovery，可以获取不同级别的信息。
results: 实验表明，DCRL方法可以和传统的Variational Auto-Encoder（VAE）方法相比， equally well in identifying causal structure and causal variables。

Abstract
Causal reasoning can be considered a cornerstone of intelligent systems. Having access to an underlying causal graph comes with the promise of cause-effect estimation and the identification of efficient and safe interventions. However, learning causal representations remains a major challenge, due to the complexity of many real-world systems. Previous works on causal representation learning have mostly focused on Variational Auto-Encoders (VAE). These methods only provide representations from a point estimate, and they are unsuitable to handle high dimensions. To overcome these problems, we proposed a new Diffusion-based Causal Representation Learning (DCRL) algorithm. This algorithm uses diffusion-based representations for causal discovery. DCRL offers access to infinite dimensional latent codes, which encode different levels of information in the latent code. In a first proof of principle, we investigate the use of DCRL for causal representation learning. We further demonstrate experimentally that this approach performs comparably well in identifying the causal structure and causal variables.

摘要
causal reasoning 可以看作智能系统的基础之一。具有下面的 causal 图来 promise of cause-effect estimation 和 identification of efficient and safe interventions。然而，学习 causal 表示仍然是一个主要挑战，因为许多实际世界系统的复杂性。先前的 causal 表示学习方法主要集中在 Variational Auto-Encoders (VAE)。这些方法只提供点估计的表示，不适用于高维度。为了解决这些问题，我们提出了一种新的 Diffusion-based Causal Representation Learning (DCRL) 算法。这个算法使用扩散基于的表示来进行 causal 发现。DCRL 提供访问无穷维的潜在码，这些潜在码编码不同级别的信息。在一个首次证明的原则中，我们研究了 DCRL 的使用，并在实验中证明了这种方法可以比较好地确定 causal 结构和 causal 变量。

Counterfactually Fair Representation

paper_url: http://arxiv.org/abs/2311.05420
repo_url: https://github.com/osu-srml/cf_representation_learning
paper_authors: Zhiqun Zuo, Mohammad Mahdi Khalili, Xueru Zhang
for: 本研究旨在提出一种新的算法，用于在高风险应用中使用机器学习模型，以避免保护性社会组别的偏见。
methods: 本研究使用了Counterfactual Fairness（CF）的公平性观，该观念基于一个下游 causal graph，并首先由Kusner等人提出（Reference [1])。学习满足CF的公平模型可能困难。在Reference [1]中，证明了不使用敏感特征的后代feature可以满足CF。然而，后续的一些工作提出了使用所有特征进行训练CF模型的方法，但没有理论保证它们可以满足CF。本研究则提出了一种新的算法，使用所有可用的特征进行训练CF模型，并经过理论和实验验证，表明这种方法可以满足CF。
results: 本研究通过理论和实验验证，证明了使用新的算法可以在高风险应用中使用机器学习模型，以避免保护性社会组别的偏见。 CodeRepository可以在https://github.com/osu-srml/CF_Representation_Learning中找到。

Abstract
The use of machine learning models in high-stake applications (e.g., healthcare, lending, college admission) has raised growing concerns due to potential biases against protected social groups. Various fairness notions and methods have been proposed to mitigate such biases. In this work, we focus on Counterfactual Fairness (CF), a fairness notion that is dependent on an underlying causal graph and first proposed by Kusner \textit{et al.}~\cite{kusner2017counterfactual}; it requires that the outcome an individual perceives is the same in the real world as it would be in a "counterfactual" world, in which the individual belongs to another social group. Learning fair models satisfying CF can be challenging. It was shown in \cite{kusner2017counterfactual} that a sufficient condition for satisfying CF is to \textbf{not} use features that are descendants of sensitive attributes in the causal graph. This implies a simple method that learns CF models only using non-descendants of sensitive attributes while eliminating all descendants. Although several subsequent works proposed methods that use all features for training CF models, there is no theoretical guarantee that they can satisfy CF. In contrast, this work proposes a new algorithm that trains models using all the available features. We theoretically and empirically show that models trained with this method can satisfy CF\footnote{The code repository for this work can be found in \url{https://github.com/osu-srml/CF_Representation_Learning}.

摘要
使用机器学习模型在高风险应用（如医疗、贷款、大学招生）引发了增长的关注，因为它们可能对保护的社会群体产生偏见。不同的公平性观念和方法已经被提出来 Mitigate such biases. 在这项工作中，我们关注于Counterfactual Fairness（CF），这是一种公平观念，它取决于下面的 causal graph 和由 Kusner 等人提出的 \cite{kusner2017counterfactual}。CF 要求个体在实际世界中所看到的结果与在一个 "counterfactual" 世界中看到的结果相同。学习满足 CF 的公平模型可以是困难的。在 \cite{kusner2017counterfactual} 中显示了一个 suficient condition ，即不使用敏感属性的后代feature 在 causal graph 中。这意味着可以使用非敏感属性的后代feature 来学习 CF 模型，并且消除所有敏感属性的后代feature。虽然后续的工作提出了使用所有特征进行训练 CF 模型的方法，但没有理论保证它们可以满足 CF。相反，本工作提出了一种新的算法，该算法使用所有可用的特征进行模型训练。我们 theoretically 和 empirically 表明，使用这种方法可以满足 CF，并且可以在实际应用中获得更好的结果。Note: The code repository for this work can be found in \url{https://github.com/osu-srml/CF_Representation_Learning}.

Predicting the Position Uncertainty at the Time of Closest Approach with Diffusion Models

paper_url: http://arxiv.org/abs/2311.05417
repo_url: None
paper_authors: Marta Guimarães, Cláudia Soares, Chiara Manfletti
for: 避免航天器归合撞击
methods: 使用Diffusion模型预测碰撞对象位置不确定性的发展
results: 比较其他现有解决方案和Na"ive基线方法，提出一种可能大幅提高航天器操作安全性和效率的解决方案。

Abstract
The risk of collision between resident space objects has significantly increased in recent years. As a result, spacecraft collision avoidance procedures have become an essential part of satellite operations. To ensure safe and effective space activities, satellite owners and operators rely on constantly updated estimates of encounters. These estimates include the uncertainty associated with the position of each object at the expected TCA. These estimates are crucial in planning risk mitigation measures, such as collision avoidance manoeuvres. As the TCA approaches, the accuracy of these estimates improves, as both objects' orbit determination and propagation procedures are made for increasingly shorter time intervals. However, this improvement comes at the cost of taking place close to the critical decision moment. This means that safe avoidance manoeuvres might not be possible or could incur significant costs. Therefore, knowing the evolution of this variable in advance can be crucial for operators. This work proposes a machine learning model based on diffusion models to forecast the position uncertainty of objects involved in a close encounter, particularly for the secondary object (usually debris), which tends to be more unpredictable. We compare the performance of our model with other state-of-the-art solutions and a na\"ive baseline approach, showing that the proposed solution has the potential to significantly improve the safety and effectiveness of spacecraft operations.

摘要
随着近年航天器之间的碰撞风险的增加，航天器碰撞避免程序已成为卫星运营中的一项重要组成部分。为确保安全有效的航天活动，卫星所有者和运营商依靠Constantly updated的避免碰撞估计。这些估计包括碰撞时点的uncertainty，这些估计是规划避免碰撞措施的关键。随着TCA（最佳距离）接近，这些估计的准确性提高，因为两个物体的轨道决定和推算过程都在更短的时间间隔进行。然而，这种改进带来了在关键决策时刻进行安全避免措施的成本。因此，了解这个变量的进化可以对操作人员非常重要。本工作提出了基于扩散模型的机器学习模型，用于预测碰撞中参与者物体的位置不确定性的发展。我们与其他当前状态的解决方案和简单基eline方法进行比较，显示了我们的方案具有提高航天器操作的安全性和效率的潜在潜力。

Data Distillation for Neural Network Potentials toward Foundational Dataset

paper_url: http://arxiv.org/abs/2311.05407
repo_url: None
paper_authors: Gang Seob Jung, Sangkeun Lee, Jong Youl Choi
for:This paper aims to address the discrepancy between predicted properties of materials through generative models and calculated properties through ab initio calculations.methods:The paper uses extended ensemble molecular dynamics (MD) to secure a broad range of liquid- and solid-phase configurations in one of the metallic systems, nickel. The data is then distilled to significantly reduce the amount of data without losing much accuracy.results:The paper shows that the neural network-based potentials (NNPs) trained from the distilled data can predict different energy-minimized closed-pack crystal structures, even though those structures were not explicitly part of the initial data. The approach is also demonstrated to be applicable to other metallic systems (aluminum and niobium), without repeating the sampling and distillation processes.

Abstract
Machine learning (ML) techniques and atomistic modeling have rapidly transformed materials design and discovery. Specifically, generative models can swiftly propose promising materials for targeted applications. However, the predicted properties of materials through the generative models often do not match with calculated properties through ab initio calculations. This discrepancy can arise because the generated coordinates are not fully relaxed, whereas the many properties are derived from relaxed structures. Neural network-based potentials (NNPs) can expedite the process by providing relaxed structures from the initially generated ones. Nevertheless, acquiring data to train NNPs for this purpose can be extremely challenging as it needs to encompass previously unknown structures. This study utilized extended ensemble molecular dynamics (MD) to secure a broad range of liquid- and solid-phase configurations in one of the metallic systems, nickel. Then, we could significantly reduce them through active learning without losing much accuracy. We found that the NNP trained from the distilled data could predict different energy-minimized closed-pack crystal structures even though those structures were not explicitly part of the initial data. Furthermore, the data can be translated to other metallic systems (aluminum and niobium), without repeating the sampling and distillation processes. Our approach to data acquisition and distillation has demonstrated the potential to expedite NNP development and enhance materials design and discovery by integrating generative models.

摘要
（简化中文）机器学习技术和原子尺度模型已经快速地改变了材料设计和发现。特别是，生成模型可以快速提出适用于目标应用的有前途的材料。然而，通过生成模型预测的材料性能与原子尺度计算的性能之间存在差异。这种差异可能是因为生成的坐标不完全填充，而许多性能来自于已relax结构。基于神经网络的潜在能（NNPs）可以加速过程，并提供已relax结构。然而，为了训练NNPs，需要具备广泛的数据，这些数据需要包括前不知道的结构。本研究使用了扩展ensemble分子动力学（MD）来保证 Nickel 金属系统中的广泛液相和固相配置。然后，我们可以通过活动学习大幅减少数据，而不失去准确性。我们发现，使用滤制数据训练的 NNP 可以预测不同的能量最小化关闭晶体结构，即使这些结构没有直接出现在初始数据中。此外，数据可以翻译到其他金属系统（锌和钴），无需重复样本和滤制过程。我们的数据获取和滤制方法已经展示了可以加速 NNP 的发展，并且提高材料设计和发现的效率，通过结合生成模型。

The Sample Complexity Of ERMs In Stochastic Convex Optimization

paper_url: http://arxiv.org/abs/2311.05398
repo_url: None
paper_authors: Daniel Carmon, Roi Livni, Amir Yehudayoff
for: 这篇论文主要研究的是 Stochastic Convex Optimization 模型下的学习问题，特别是让任意 Empirical Risk Minimizer (ERM) 在真实人口中表现良好所需的数据点数量。
methods: 本文使用了一种新的分析方法，可以证明 $\tilde{O}(\frac{d}{\epsilon}+\frac{1}{\epsilon^2})$ 数据点数量是 sufficient。这个结论解决了一个长期未解的问题，并且提供了一个新的分离bound。
results: 本文证明了在学习约bounded凸 lipschitz函数的经典设定下，$\tilde{O}(\frac{d}{\epsilon}+\frac{1}{\epsilon^2})$ 数据点数量是必要的和 suficient。此外，本文还推广了结论，证明这个结论对所有半Symmetric凸体都成立。

Abstract
Stochastic convex optimization is one of the most well-studied models for learning in modern machine learning. Nevertheless, a central fundamental question in this setup remained unresolved: "How many data points must be observed so that any empirical risk minimizer (ERM) shows good performance on the true population?" This question was proposed by Feldman (2016), who proved that $\Omega(\frac{d}{\epsilon}+\frac{1}{\epsilon^2})$ data points are necessary (where $d$ is the dimension and $\epsilon>0$ is the accuracy parameter). Proving an $\omega(\frac{d}{\epsilon}+\frac{1}{\epsilon^2})$ lower bound was left as an open problem. In this work we show that in fact $\tilde{O}(\frac{d}{\epsilon}+\frac{1}{\epsilon^2})$ data points are also sufficient. This settles the question and yields a new separation between ERMs and uniform convergence. This sample complexity holds for the classical setup of learning bounded convex Lipschitz functions over the Euclidean unit ball. We further generalize the result and show that a similar upper bound holds for all symmetric convex bodies. The general bound is composed of two terms: (i) a term of the form $\tilde{O}(\frac{d}{\epsilon})$ with an inverse-linear dependence on the accuracy parameter, and (ii) a term that depends on the statistical complexity of the class of $\textit{linear}$ functions (captured by the Rademacher complexity). The proof builds a mechanism for controlling the behavior of stochastic convex optimization problems.

摘要

Beyond the training set: an intuitive method for detecting distribution shift in model-based optimization

paper_url: http://arxiv.org/abs/2311.05363
repo_url: None
paper_authors: Farhan Damani, David H Brookes, Theodore Sternlieb, Cameron Webster, Stephen Malina, Rishi Jajoo, Kathy Lin, Sam Sinai
for: 本研究的目的是提出一种简单的方法，用于检测模型训练和设计数据集之间的分布差异。
methods: 该方法使用了一个二分类器，通过知识未标注的设计分布来分离训练数据和设计数据。二分类器的логи特征值被用作分布差异的代理量。
results: 在一个实际应用中，我们 validate了该方法，并发现分布差异的严重程度与优化算法步数有关。该简单的方法可以识别这些差异，使用户可以将搜索局限在模型预测可靠的区域内，从而提高设计质量。

Abstract
Model-based optimization (MBO) is increasingly applied to design problems in science and engineering. A common scenario involves using a fixed training set to train models, with the goal of designing new samples that outperform those present in the training data. A major challenge in this setting is distribution shift, where the distributions of training and design samples are different. While some shift is expected, as the goal is to create better designs, this change can negatively affect model accuracy and subsequently, design quality. Despite the widespread nature of this problem, addressing it demands deep domain knowledge and artful application. To tackle this issue, we propose a straightforward method for design practitioners that detects distribution shifts. This method trains a binary classifier using knowledge of the unlabeled design distribution to separate the training data from the design data. The classifier's logit scores are then used as a proxy measure of distribution shift. We validate our method in a real-world application by running offline MBO and evaluate the effect of distribution shift on design quality. We find that the intensity of the shift in the design distribution varies based on the number of steps taken by the optimization algorithm, and our simple approach can identify these shifts. This enables users to constrain their search to regions where the model's predictions are reliable, thereby increasing the quality of designs.

摘要

Basis functions nonlinear data-enabled predictive control: Consistent and computationally efficient formulations

paper_url: http://arxiv.org/abs/2311.05360
repo_url: None
paper_authors: Mircea Lazar
for: 这篇论文探讨了数据启用预测控制（DeePC）在非线性系统上的扩展，通过通用基函数。
methods: 论文使用了基函数DeePC行为预测器，并确定了必要和充分的条件，以确保与相应的基函数多步标识预测器的等价性。
results: 论文通过 derivation of 动态常数化成本函数，实现了一个准确的基函数DeePC表述，并提出了两种更改的表述，以提高计算效率。 Additionally, the paper also discusses the consistency of Koopman DeePC and provides several methods for constructing the basis functions representation. The effectiveness of the developed consistent basis functions DeePC formulations is demonstrated on a benchmark nonlinear pendulum state-space model, for both noise-free and noisy data.

Abstract
This paper considers the extension of data-enabled predictive control (DeePC) to nonlinear systems via general basis functions. Firstly, we formulate a basis functions DeePC behavioral predictor and we identify necessary and sufficient conditions for equivalence with a corresponding basis functions multi-step identified predictor. The derived conditions yield a dynamic regularization cost function that enables a well-posed (i.e., consistent) basis functions formulation of nonlinear DeePC. To optimize computational efficiency of basis functions DeePC we further develop two alternative formulations that use a simpler, sparse regularization cost function and ridge regression, respectively. Consistency implications for Koopman DeePC as well as several methods for constructing the basis functions representation are also indicated. The effectiveness of the developed consistent basis functions DeePC formulations is illustrated on a benchmark nonlinear pendulum state-space model, for both noise free and noisy data.

摘要

Accelerated Shapley Value Approximation for Data Evaluation

paper_url: http://arxiv.org/abs/2311.05346
repo_url: None
paper_authors: Lauren Watson, Zeno Kujawa, Rayna Andreeva, Hao-Tsung Yang, Tariq Elahi, Rik Sarkar
for: 这个论文的目的是提出一种更加高效的数据估价方法，以便更好地应用在机器学习中。
methods: 这个论文使用了机器学习问题的结构特性来更加高效地计算数据点的Shapley值。它提出了一种基于小subset的approximate Shapley值的方法，并提供了对不同学习设置下的准确性保证。
results: 该方法可以快速和高效地计算数据点的Shapley值，并且可以保持数据的准确性和排名。实验表明，这种方法可以在预训练网络中带来更高的效率。

Abstract
Data valuation has found various applications in machine learning, such as data filtering, efficient learning and incentives for data sharing. The most popular current approach to data valuation is the Shapley value. While popular for its various applications, Shapley value is computationally expensive even to approximate, as it requires repeated iterations of training models on different subsets of data. In this paper we show that the Shapley value of data points can be approximated more efficiently by leveraging the structural properties of machine learning problems. We derive convergence guarantees on the accuracy of the approximate Shapley value for different learning settings including Stochastic Gradient Descent with convex and non-convex loss functions. Our analysis suggests that in fact models trained on small subsets are more important in the context of data valuation. Based on this idea, we describe $\delta$-Shapley -- a strategy of only using small subsets for the approximation. Experiments show that this approach preserves approximate value and rank of data, while achieving speedup of up to 9.9x. In pre-trained networks the approach is found to bring more efficiency in terms of accurate evaluation using small subsets.

摘要
“数据评估在机器学习中找到了多种应用，如数据筛选、高效学习和数据分享的激励。目前最受欢迎的数据评估方法是雪莱值。虽然具有多种应用，但雪莱值计算成本高，需要重复训练模型不同的数据 subsets。在这篇论文中，我们表明可以更有效地 aproximate 雪莱值的数据点，利用机器学习问题的结构性质。我们提供了不同学习设置下的准确性拥有保证，包括杂散Gradient Descent的凸和非凸损函数。我们的分析表明，在数据评估中，使用小subset是更重要的。基于这个想法，我们描述了 $\delta $-雪莱策略，即只使用小subset进行 aproximation。实验表明，这种方法可以保持数据的相对值和排名，同时实现速度提高达9.9倍。在预训练网络上，这种方法具有更高的准确评估效果。”

Real-time Addressee Estimation: Deployment of a Deep-Learning Model on the iCub Robot

paper_url: http://arxiv.org/abs/2311.05334
repo_url: None
paper_authors: Carlo Mazzola, Francesco Rea, Alessandra Sciutti
for: 这个论文的目的是开发一种基于非语言表征的地址者估计模型，以便人工智能对话机器人在多方和无结构场景中与人类交互更加畅通。
methods: 该模型使用了深度学习技术，利用speaker的视线和身姿行为来进行地址者估计。
results: 实验表明，该模型在实时人机器人交互中的性能比前一个dataset上的训练测试更好，表明该模型可以在多方和无结构场景中提供更高的地址者估计精度。

Abstract
Addressee Estimation is the ability to understand to whom a person is talking, a skill essential for social robots to interact smoothly with humans. In this sense, it is one of the problems that must be tackled to develop effective conversational agents in multi-party and unstructured scenarios. As humans, one of the channels that mainly lead us to such estimation is the non-verbal behavior of speakers: first of all, their gaze and body pose. Inspired by human perceptual skills, in the present work, a deep-learning model for Addressee Estimation relying on these two non-verbal features is designed, trained, and deployed on an iCub robot. The study presents the procedure of such implementation and the performance of the model deployed in real-time human-robot interaction compared to previous tests on the dataset used for the training.

摘要
收件人估算是指理解对话中谁正在说话，这是社交机器人与人类交流的关键技能。在这种多方和无结构的情况下，这成为开发有效对话代理人的一个重要问题。人类之所以能够准确地估算收件人，一部分来自于对话者的非语言行为：首先是他们的视线和姿势。以人类的感知技能为灵感，本研究开发了一种基于这两种非语言特征的深度学习模型，并在iCub机器人上进行了实时部署。研究中介绍了实施过程和在真实人机交互中模型的性能比较前一些测试数据集。

RepQ: Generalizing Quantization-Aware Training for Re-Parametrized Architectures

paper_url: http://arxiv.org/abs/2311.05317
repo_url: None
paper_authors: Anastasiia Prutianova, Alexey Zaytsev, Chung-Kuei Lee, Fengyu Sun, Ivan Koryakovskiy
for: 提高训练和测试 neural network 的效率，使其能够在有限资源的环境中部署。
methods: 使用 quantization 和 re-parametrization 两种方法来提高模型性能，并同时应用这两种方法来提高模型的效率。
results: RepQ 方法比基eline方法 LSQ 量化方案在所有实验中具有更好的性能。

Abstract
Existing neural networks are memory-consuming and computationally intensive, making deploying them challenging in resource-constrained environments. However, there are various methods to improve their efficiency. Two such methods are quantization, a well-known approach for network compression, and re-parametrization, an emerging technique designed to improve model performance. Although both techniques have been studied individually, there has been limited research on their simultaneous application. To address this gap, we propose a novel approach called RepQ, which applies quantization to re-parametrized networks. Our method is based on the insight that the test stage weights of an arbitrary re-parametrized layer can be presented as a differentiable function of trainable parameters. We enable quantization-aware training by applying quantization on top of this function. RepQ generalizes well to various re-parametrized models and outperforms the baseline method LSQ quantization scheme in all experiments.

摘要
Note: "Simplified Chinese" is a translation of "Traditional Chinese" and not "Mandarin Chinese". Simplified Chinese is a standardized form of Chinese characters that is used in mainland China, while Traditional Chinese is used in Hong Kong, Macau, and Taiwan.

Reliable and Efficient Data Collection in UAV-based IoT Networks

paper_url: http://arxiv.org/abs/2311.05303
repo_url: None
paper_authors: Poorvi Joshi, Alakesh Kalita, Mohan Gurusamy
For: This paper focuses on the challenges and opportunities of using Unmanned Aerial Vehicles (UAVs) to enhance data collection in Internet of Things (IoT) networks.* Methods: The paper explores various UAV-based data collection methods, including their advantages and disadvantages, and discusses performance metrics for data collection.* Results: The paper discusses efficient data collection strategies in UAV-based IoT networks, including trajectory and path planning, collision avoidance, sensor network clustering, data aggregation, UAV swarm formations, and artificial intelligence for optimization.Here is the same information in Simplified Chinese text:* For: 这篇论文关注使用无人机（UAV）改善互联网物联网（IoT）网络数据收集的挑战和机遇。* Methods: 论文探讨了不同的UAV基于数据收集方法，包括其优势和缺点，并讨论数据收集性能指标。* Results: 论文介绍了UAV基于IoT网络数据收集的有效策略，包括路径规划、避免冲突、感知网络团 clustering、数据聚合、UAV群 formation、人工智能优化等。

Abstract
Internet of Things (IoT) involves sensors for monitoring and wireless networks for efficient communication. However, resource-constrained IoT devices and limitations in existing wireless technologies hinder its full potential. Integrating Unmanned Aerial Vehicles (UAVs) into IoT networks can address some challenges by expanding its' coverage, providing security, and bringing computing closer to IoT devices. Nevertheless, effective data collection in UAV-assisted IoT networks is hampered by factors, including dynamic UAV behavior, environmental variables, connectivity instability, and security considerations. In this survey, we first explore UAV-based IoT networks, focusing on communication and networking aspects. Next, we cover various UAV-based data collection methods their advantages and disadvantages, followed by a discussion on performance metrics for data collection. As this article primarily emphasizes reliable and efficient data collection in UAV-assisted IoT networks, we briefly discuss existing research on data accuracy and consistency, network connectivity, and data security and privacy to provide insights into reliable data collection. Additionally, we discuss efficient data collection strategies in UAV-based IoT networks, covering trajectory and path planning, collision avoidance, sensor network clustering, data aggregation, UAV swarm formations, and artificial intelligence for optimization. We also present two use cases of UAVs as a service for enhancing data collection reliability and efficiency. Finally, we discuss future challenges in data collection for UAV-assisted IoT networks.

摘要
互联网智能化（IoT）具有侦测器和无线网络，以实现高效的通信。然而，IoT设备受限，而现有的无线技术也有一些限制，这限制了IoT的全面发挥。将无人航空车（UAV）纳入IoT网络中可以解决一些挑战，扩大其覆盖范围，提供安全性，并将计算与IoT设备靠近。然而，UAV-协助的IoT网络中的数据收集效率受到一些因素的影响，包括UAV的动态行为、环境变量、连接稳定性和安全考虑。在本调查中，我们首先探讨UAV-基本的IoT网络，专注于通信和网络方面。接着，我们详细介绍了由UAV支持的数据收集方法，包括优点和缺点。接着，我们讨论了数据收集性能的衡量指标，以及现有的研究数据准确性、网络连接稳定性和数据安全性等方面的研究。在本文中，我们专注于可靠和高效的数据收集在UAV-协助的IoT网络中。我们详细介绍了一些有效的数据收集策略，包括轨迹与路径观察、碰撞避免、数据网络对 clustering、数据聚合、UAV群形成和人工智能优化等。此外，我们还提出了两个UAV作为服务的用案，以增强数据收集可靠性和效率。最后，我们讨论了未来数据收集在UAV-协助的IoT网络中的挑战。

Latent Task-Specific Graph Network Simulators

paper_url: http://arxiv.org/abs/2311.05256
repo_url: https://github.com/philippdahlinger/ltsgns_ai4science
paper_authors: Philipp Dahlinger, Niklas Freymuth, Michael Volpp, Tai Hoang, Gerhard Neumann
for: mesh-based simulation as a meta-learning problem to improve GNSs adaptability to new scenarios
methods: using a recent Bayesian meta-learning method, leveraging context data and handling uncertainties, using non-amortized task posterior approximations to sample latent descriptions of unknown system properties, and leveraging movement primitives for efficient full trajectory prediction
results: on par with or better than established baseline methods, and accommodating various types of context data through the use of point clouds during inference.

Abstract
Simulating dynamic physical interactions is a critical challenge across multiple scientific domains, with applications ranging from robotics to material science. For mesh-based simulations, Graph Network Simulators (GNSs) pose an efficient alternative to traditional physics-based simulators. Their inherent differentiability and speed make them particularly well-suited for inverse design problems. Yet, adapting to new tasks from limited available data is an important aspect for real-world applications that current methods struggle with. We frame mesh-based simulation as a meta-learning problem and use a recent Bayesian meta-learning method to improve GNSs adaptability to new scenarios by leveraging context data and handling uncertainties. Our approach, latent task-specific graph network simulator, uses non-amortized task posterior approximations to sample latent descriptions of unknown system properties. Additionally, we leverage movement primitives for efficient full trajectory prediction, effectively addressing the issue of accumulating errors encountered by previous auto-regressive methods. We validate the effectiveness of our approach through various experiments, performing on par with or better than established baseline methods. Movement primitives further allow us to accommodate various types of context data, as demonstrated through the utilization of point clouds during inference. By combining GNSs with meta-learning, we bring them closer to real-world applicability, particularly in scenarios with smaller datasets.

摘要
模拟动态物理交互是科学领域中的一个关键挑战，其应用范围从 робо扮到材料科学。为 mesh-based simulations，图表网络仿真器（GNS）提供了一种高效的替代方案。它们的自然差分和速度使其特别适合反向设计问题。然而，适应新任务从有限的数据中学习是现实应用中的一个重要问题，现有方法困难于解决。我们将 mesh-based simulation 作为一个 meta-learning 问题，使用最近的 bayesian meta-learning 方法来提高 GNS 的适应新情况的能力，通过使用 context data 和处理不确定性。我们的方法，即 latent task-specific graph network simulator，通过非折衔任务 posterior 近似来采样 latent 描述未知系统性质。此外，我们利用 movement primitives 进行全 trajectory 预测，有效地解决了过去 auto-regressive 方法所遇到的积累错误问题。我们通过多个实验 validate 了我们的方法的有效性，与或更好于现有基eline 方法。 movement primitives 还允许我们根据不同的 context data 进行适应，例如通过使用点云进行推理。通过将 GNS 与 meta-learning 结合，我们使得它们更加适用于实际应用，特别是在具有更小的数据集的场景中。

When Meta-Learning Meets Online and Continual Learning: A Survey

paper_url: http://arxiv.org/abs/2311.05241
repo_url: None
paper_authors: Jaehyeon Son, Soochan Lee, Gunhee Kim
for: This paper aims to provide a comprehensive survey of various learning frameworks, including meta-learning, continual learning, and online learning, and to facilitate a clear understanding of the differences between them.
methods: The paper uses a consistent terminology and formal descriptions to organize various problem settings and learning algorithms, and offers an overview of these learning paradigms to foster further advancements in the field.
results: The paper provides a clear understanding of the differences between the learning frameworks, and offers a unified terminology for discussing them, which can help experienced researchers and newcomers to the field alike.Here is the same information in Simplified Chinese text:
for: 这篇论文目标是提供一份涵盖不同学习框架的全面调查，包括meta-学习、连续学习和在线学习，以便促进这些学习框架之间的清晰认识。
methods: 这篇论文使用一致的术语和正式描述来组织不同的问题设定和学习算法，并提供这些学习框架的概述，以便促进这个领域的进一步发展。
results: 这篇论文提供了不同学习框架之间的清晰认识，并提供了一个统一的术语来讨论这些学习框架，这可以帮助经验丰富的研究人员和新手末进行学习和研究。

Abstract
Over the past decade, deep neural networks have demonstrated significant success using the training scheme that involves mini-batch stochastic gradient descent on extensive datasets. Expanding upon this accomplishment, there has been a surge in research exploring the application of neural networks in other learning scenarios. One notable framework that has garnered significant attention is meta-learning. Often described as "learning to learn," meta-learning is a data-driven approach to optimize the learning algorithm. Other branches of interest are continual learning and online learning, both of which involve incrementally updating a model with streaming data. While these frameworks were initially developed independently, recent works have started investigating their combinations, proposing novel problem settings and learning algorithms. However, due to the elevated complexity and lack of unified terminology, discerning differences between the learning frameworks can be challenging even for experienced researchers. To facilitate a clear understanding, this paper provides a comprehensive survey that organizes various problem settings using consistent terminology and formal descriptions. By offering an overview of these learning paradigms, our work aims to foster further advancements in this promising area of research.

摘要

Whisper in Focus: Enhancing Stuttered Speech Classification with Encoder Layer Optimization

paper_url: http://arxiv.org/abs/2311.05203
repo_url: None
paper_authors: Huma Ameer, Seemab Latif, Rabia Latif, Sana Mukhtar
for: 本研究旨在 automatized 识别含断流的听力问题，采用深度学习技术进行解决。
methods: 研究人员使用 Wav2vec2.0 语音识别模型进行断流类型分类。
results: 优化后的 Whisper 模型在断流类型分类任务中实现了平均 F1 分数0.81，表明其能力。此外，研究还发现了更深的编码层对断流类型识别的重要性。

Abstract
In recent years, advancements in the field of speech processing have led to cutting-edge deep learning algorithms with immense potential for real-world applications. The automated identification of stuttered speech is one of such applications that the researchers are addressing by employing deep learning techniques. Recently, researchers have utilized Wav2vec2.0, a speech recognition model to classify disfluency types in stuttered speech. Although Wav2vec2.0 has shown commendable results, its ability to generalize across all disfluency types is limited. In addition, since its base model uses 12 encoder layers, it is considered a resource-intensive model. Our study unravels the capabilities of Whisper for the classification of disfluency types in stuttered speech. We have made notable contributions in three pivotal areas: enhancing the quality of SEP28-k benchmark dataset, exploration of Whisper for classification, and introducing an efficient encoder layer freezing strategy. The optimized Whisper model has achieved the average F1-score of 0.81, which proffers its abilities. This study also unwinds the significance of deeper encoder layers in the identification of disfluency types, as the results demonstrate their greater contribution compared to initial layers. This research represents substantial contributions, shifting the emphasis towards an efficient solution, thereby thriving towards prospective innovation.

摘要
近年来，speech处理领域的进步带来了深度学习算法的潜在应用。自动识别偏声是其中一个应用，研究人员通过使用深度学习技术来解决。最近，研究人员使用Wav2vec2.0，一种语音识别模型来分类偏声类型。虽然Wav2vec2.0表现出色，但其对各种偏声类型的泛化能力有限。此外，由于其基础模型使用12层编码层，因此被视为资源占用型模型。我们的研究探讨了Whisper模型在偏声类型分类中的能力。我们在三个重要领域中作出了显著贡献：提高SEP28-kBenchmark数据集的质量，探索Whisper模型的分类能力，并提出了高效编码层冻结策略。优化后的Whisper模型达到了0.81的平均F1分数，这证明了它的能力。此研究还发现了 deeper编码层在偏声类型标识中的重要性，结果表明它们在初始层次上的贡献比较大。这项研究代表了重要的贡献，它将注意力集中在高效解决方案上，逐渐向前进。

Perfecting Liquid-State Theories with Machine Intelligence

paper_url: http://arxiv.org/abs/2311.05167
repo_url: None
paper_authors: Jianzhong Wu, Mengyang Gu
for: 预测电子结构、分子力场和各种固体系统的物理化学性质
methods: Functional machine learning技术，包括代理模型、维度减少和不确定性评估
results: 提高精度、可扩展性和计算效率，推广应用于多种材料和化学系统

Abstract
Recent years have seen a significant increase in the use of machine intelligence for predicting electronic structure, molecular force fields, and the physicochemical properties of various condensed systems. However, substantial challenges remain in developing a comprehensive framework capable of handling a wide range of atomic compositions and thermodynamic conditions. This perspective discusses potential future developments in liquid-state theories leveraging on recent advancements of functional machine learning. By harnessing the strengths of theoretical analysis and machine learning techniques including surrogate models, dimension reduction and uncertainty quantification, we envision that liquid-state theories will gain significant improvements in accuracy, scalability and computational efficiency, enabling their broader applications across diverse materials and chemical systems.

摘要
近年来，机器智能技术在预测电子结构、分子力场和各种固体系统的物理化学性质方面得到了广泛应用。然而，构建涵盖各种原子组成和热力学条件的全面框架仍面临着重大挑战。本视角介绍了未来可能的液体理论发展，基于近期的功能机器学习技术。通过利用理论分析和机器学习技术，包括协助模型、维度减少和不确定性评估，我们预计液体理论将在准确性、可扩展性和计算效率等方面做出显著改进，使其在多种材料和化学系统中得到更广泛的应用。

Counter-Empirical Attacking based on Adversarial Reinforcement Learning for Time-Relevant Scoring System

paper_url: http://arxiv.org/abs/2311.05144
repo_url: https://github.com/sheldonresearch/microsoft-scoring-system
paper_authors: Xiangguo Sun, Hong Cheng, Hang Dong, Bo Qiao, Si Qin, Qingwei Lin
for: 该论文主要探讨了如何自动调整分配系统，以便更好地管理大数据时代的资源。
methods: 作者提出了一种“反emplirical攻击”机制，通过生成冲击行为轨迹来评估分配系统，并采用对抗学习问题进行训练，以学习一个robust的分配函数。
results: 实验结果表明，该方法可以有效地改进分配系统，使其更能够抵御冲击行为轨迹。

Abstract
Scoring systems are commonly seen for platforms in the era of big data. From credit scoring systems in financial services to membership scores in E-commerce shopping platforms, platform managers use such systems to guide users towards the encouraged activity pattern, and manage resources more effectively and more efficiently thereby. To establish such scoring systems, several "empirical criteria" are firstly determined, followed by dedicated top-down design for each factor of the score, which usually requires enormous effort to adjust and tune the scoring function in the new application scenario. What's worse, many fresh projects usually have no ground-truth or any experience to evaluate a reasonable scoring system, making the designing even harder. To reduce the effort of manual adjustment of the scoring function in every new scoring system, we innovatively study the scoring system from the preset empirical criteria without any ground truth, and propose a novel framework to improve the system from scratch. In this paper, we propose a "counter-empirical attacking" mechanism that can generate "attacking" behavior traces and try to break the empirical rules of the scoring system. Then an adversarial "enhancer" is applied to evaluate the scoring system and find the improvement strategy. By training the adversarial learning problem, a proper scoring function can be learned to be robust to the attacking activity traces that are trying to violate the empirical criteria. Extensive experiments have been conducted on two scoring systems including a shared computing resource platform and a financial credit system. The experimental results have validated the effectiveness of our proposed framework.

摘要
大数据时代内， scoring system 已成为平台管理的普遍现象。从金融服务中的信用分数系统到电商平台上的会员分数系统，平台管理者利用这些系统来引导用户行为，更好地管理资源，提高效率。为建立这些分数系统，需首先确定一些“实证标准”，然后针对每个分数因素进行专门的顶部设计，通常需要巨大的努力来调整和调整分数函数在新应用场景中。尤其是新项目通常没有基准或经验来评估合适的分数系统，使设计变得更加困难。为了减少每个新分数系统的手动调整努力，我们创新地研究了分数系统从预设的实证标准而不需任何基准，并提出了一个新的框架来改进系统。在这篇论文中，我们提出了一种“逆实证攻击”机制，可以生成“攻击”行为迹象并尝试让分数系统违反实证规则。然后，我们应用了一种“增强器”来评估分数系统，找到改进策略。通过训练对抗学习问题，我们可以学习一个鲁棒的分数函数，抗击攻击行为迹象，并且可以避免违反实证规则。我们对两个分数系统，包括分享计算资源平台和金融信用系统，进行了广泛的实验。实验结果证明了我们提出的框架的有效性。

On neural and dimensional collapse in supervised and unsupervised contrastive learning with hard negative sampling

paper_url: http://arxiv.org/abs/2311.05139
repo_url: None
paper_authors: Ruijie Jiang, Thuan Nguyen, Shuchin Aeron, Prakash Ishwar
For: The paper is written for proving the optimality of Neural Collapse (NC) representations for Supervised Contrastive Learning (SCL), Hard-SCL (HSCL), and Unsupervised Contrastive Learning (UCL) under general loss and hardening functions.* Methods: The paper uses theoretical proofs to show that representations that exhibit Neural Collapse (NC) minimize the SCL, HSCL, and UCL risks. The proofs are simplified, compact, and transparent, and they demonstrate the optimality of ETF for HSCL and UCL under general loss and hardening functions.* Results: The paper empirically demonstrates that ADAM optimization of HSCL and HUCL risks with random initialization and suitable hardness levels can converge to the NC geometry, but only if unit-ball or unit-sphere feature normalization is incorporated. Without incorporating hard negatives or feature normalization, the representations learned via ADAM suffer from dimensional collapse (DC) and fail to attain the NC geometry.

Abstract
For a widely-studied data model and general loss and sample-hardening functions we prove that the Supervised Contrastive Learning (SCL), Hard-SCL (HSCL), and Unsupervised Contrastive Learning (UCL) risks are minimized by representations that exhibit Neural Collapse (NC), i.e., the class means form an Equianglular Tight Frame (ETF) and data from the same class are mapped to the same representation. We also prove that for any representation mapping, the HSCL and Hard-UCL (HUCL) risks are lower bounded by the corresponding SCL and UCL risks. Although the optimality of ETF is known for SCL, albeit only for InfoNCE loss, its optimality for HSCL and UCL under general loss and hardening functions is novel. Moreover, our proofs are much simpler, compact, and transparent. We empirically demonstrate, for the first time, that ADAM optimization of HSCL and HUCL risks with random initialization and suitable hardness levels can indeed converge to the NC geometry if we incorporate unit-ball or unit-sphere feature normalization. Without incorporating hard negatives or feature normalization, however, the representations learned via ADAM suffer from dimensional collapse (DC) and fail to attain the NC geometry.

摘要
For a widely-studied data model and general loss and sample-hardening functions, we prove that the Supervised Contrastive Learning (SCL), Hard-SCL (HSCL), and Unsupervised Contrastive Learning (UCL) risks are minimized by representations that exhibit Neural Collapse (NC), i.e., the class means form an Equianglular Tight Frame (ETF) and data from the same class are mapped to the same representation. We also prove that for any representation mapping, the HSCL and Hard-UCL (HUCL) risks are lower bounded by the corresponding SCL and UCL risks. Although the optimality of ETF is known for SCL, albeit only for InfoNCE loss, its optimality for HSCL and UCL under general loss and hardening functions is novel. Moreover, our proofs are much simpler, compact, and transparent. We empirically demonstrate, for the first time, that ADAM optimization of HSCL and HUCL risks with random initialization and suitable hardness levels can indeed converge to the NC geometry if we incorporate unit-ball or unit-sphere feature normalization. Without incorporating hard negatives or feature normalization, however, the representations learned via ADAM suffer from dimensional collapse (DC) and fail to attain the NC geometry.Here's a breakdown of the translation:* "Supervised Contrastive Learning" (SCL) is translated as "导向对比学习" (导向对比学习)* "Hard-SCL" (HSCL) is translated as "困难导向对比学习" (困难导向对比学习)* "Unsupervised Contrastive Learning" (UCL) is translated as "无导向对比学习" (无导向对比学习)* "Neural Collapse" (NC) is translated as "神经塌陷" (神经塌陷)* "Equianglular Tight Frame" (ETF) is translated as "等角紧凑框架" (等角紧凑框架)* "Hard-UCL" (HUCL) is translated as "困难无导向对比学习" (困难无导向对比学习)* "ADAM optimization" is translated as "ADAM优化" (ADAM优化)* "dimensional collapse" (DC) is translated as "维度塌陷" (维度塌陷)

Improving Computational Efficiency for Powered Descent Guidance via Transformer-based Tight Constraint Prediction

paper_url: http://arxiv.org/abs/2311.05135
repo_url: None
paper_authors: Julia Briden, Trey Gurga, Breanna Johnson, Abhishek Cauligi, Richard Linares
for: 这篇论文的目的是提出一个减少 espacial 问题的直接优化形式的减少计算复杂度的算法，即 Transformer-based Powered Descent Guidance (T-PDG)。
methods: 这篇论文使用了 transformer 神经网络，通过训练以前的轨迹优化算法的数据来实现对 globally 优化解的准确预测。解是以紧缩的形式储存在最佳状态和Touchdown 点参数之间的关系。
results: 在应用到 Mars 的实际探索问题上，T-PDG 可以将 Computing 3 度自由度的燃料优化轨迹的时间，与 lossless 凸化相比，由 1-8 秒钟降至 less than 500 毫秒。同时，T-PDG 保证了安全且优化的解，通过在预测过程中包含一个实际性检查。

Abstract
In this work, we present Transformer-based Powered Descent Guidance (T-PDG), a scalable algorithm for reducing the computational complexity of the direct optimization formulation of the spacecraft powered descent guidance problem. T-PDG uses data from prior runs of trajectory optimization algorithms to train a transformer neural network, which accurately predicts the relationship between problem parameters and the globally optimal solution for the powered descent guidance problem. The solution is encoded as the set of tight constraints corresponding to the constrained minimum-cost trajectory and the optimal final time of landing. By leveraging the attention mechanism of transformer neural networks, large sequences of time series data can be accurately predicted when given only the spacecraft state and landing site parameters. When applied to the real problem of Mars powered descent guidance, T-PDG reduces the time for computing the 3 degree of freedom fuel-optimal trajectory, when compared to lossless convexification, from an order of 1-8 seconds to less than 500 milliseconds. A safe and optimal solution is guaranteed by including a feasibility check in T-PDG before returning the final trajectory.

摘要
在这个工作中，我们提出了Transformer-based Powered Descent Guidance（T-PDG）算法，用于降低直接优化形式ulation的空间站动力下降指南问题的计算复杂性。T-PDG使用之前的轨迹优化算法的数据来训练transformer神经网络，准确预测问题参数和 globally optimal solution的关系。解决方案被编码为包含紧跟约束的最小成本轨迹和着陆时间的集合。通过利用transformer神经网络的注意机制，只需提供空站状态和着陆场址参数，可以准确预测大量时间序列数据。在应用于真实的火星动力下降指南问题时，T-PDG比lossless convexification算法减少了3个自由度燃料优化轨迹计算时间，从1-8秒钟降低到 less than 500毫秒。保证安全且优化的解决方案，T-PDG中包含了可行性检查，以确保返回最终轨迹是安全且优化的。

Exploring and Analyzing Wildland Fire Data Via Machine Learning Techniques

paper_url: http://arxiv.org/abs/2311.05128
repo_url: None
paper_authors: Dipak Dulal, Joseph J. Charney, Michael Gallagher, Carmeliza Navasca, Nicholas Skowronski
for: 研究了10Hz时间序列的热电差温度和风速测得的动力动量能量（TKE）之间的相关性，以探讨使用热电差温度作为TKE预测的可能性。
methods: 使用机器学习模型，包括深度神经网络、Random Forest Regressor、Gradient Boosting和Gaussian Process Regressor，评估热电差温度干扰的可能性来预测TKE值。
results: 使用不同的机器学习模型得到了高准确率的TKE预测结果，尤其是使用回归模型。数据视觉和相关分析表明热电差温度和TKE之间存在明显的关系，提供了对下述动力动量的深入了解。研究成果有助于火灾行为和烟雾模型科学，强调机器学习方法的重要性，并提出了有关细见火灾行为和动力动量之间复杂关系的问题。

Abstract
This research project investigated the correlation between a 10 Hz time series of thermocouple temperatures and turbulent kinetic energy (TKE) computed from wind speeds collected from a small experimental prescribed burn at the Silas Little Experimental Forest in New Jersey, USA. The primary objective of this project was to explore the potential for using thermocouple temperatures as predictors for estimating the TKE produced by a wildland fire. Machine learning models, including Deep Neural Networks, Random Forest Regressor, Gradient Boosting, and Gaussian Process Regressor, are employed to assess the potential for thermocouple temperature perturbations to predict TKE values. Data visualization and correlation analyses reveal patterns and relationships between thermocouple temperatures and TKE, providing insight into the underlying dynamics. The project achieves high accuracy in predicting TKE by employing various machine learning models despite a weak correlation between the predictors and the target variable. The results demonstrate significant success, particularly from regression models, in accurately estimating the TKE. The research findings contribute to fire behavior and smoke modeling science, emphasizing the importance of incorporating machine learning approaches and identifying complex relationships between fine-scale fire behavior and turbulence. Accurate TKE estimation using thermocouple temperatures allows for the refinement of models that can inform decision-making in fire management strategies, facilitate effective risk mitigation, and optimize fire management efforts. This project highlights the valuable role of machine learning techniques in analyzing wildland fire data, showcasing their potential to advance fire research and management practices.

摘要
To achieve this, machine learning models such as Deep Neural Networks, Random Forest Regressor, Gradient Boosting, and Gaussian Process Regressor were employed to assess the potential for thermocouple temperature perturbations to predict TKE values. Data visualization and correlation analyses revealed patterns and relationships between thermocouple temperatures and TKE, providing insights into the underlying dynamics.Despite a weak correlation between the predictors and the target variable, the project achieved high accuracy in predicting TKE using various machine learning models. The results demonstrated significant success, particularly from regression models, in accurately estimating TKE. The findings contribute to fire behavior and smoke modeling science, emphasizing the importance of incorporating machine learning approaches and identifying complex relationships between fine-scale fire behavior and turbulence.Accurate TKE estimation using thermocouple temperatures allows for the refinement of models that can inform decision-making in fire management strategies, facilitate effective risk mitigation, and optimize fire management efforts. This project highlights the valuable role of machine learning techniques in analyzing wildland fire data, showcasing their potential to advance fire research and management practices.

Covering Number of Real Algebraic Varieties and Beyond: Improved Bounds and Applications

paper_url: http://arxiv.org/abs/2311.05116
repo_url: None
paper_authors: Yifan Zhang, Joe Kileel
for: 本文提供了一个Upper bound的bounds on the covering number of real algebraic varieties, images of polynomial maps, and semialgebraic sets.
methods: 本文使用了polynomial maps和semialgebraic sets的 teoría to prove the bound, which remarkably improves the best known general bound by Yomdin-Comte.
results: 本文的结果为新的 bounds on the volume of the tubular neighborhood of the image of a polynomial map and a semialgebraic set, which are not directly applicable to varieties. In addition, the paper derives near-optimal bounds on the covering number of low rank CP tensors, sketching dimension for (general) polynomial optimization problems, and generalization error bounds for deep neural networks with rational or ReLU activations.

Abstract
We prove an upper bound on the covering number of real algebraic varieties, images of polynomial maps and semialgebraic sets. The bound remarkably improves the best known general bound by Yomdin-Comte, and its proof is much more straightforward. As a consequence, our result gives new bounds on the volume of the tubular neighborhood of the image of a polynomial map and a semialgebraic set, where results for varieties by Lotz and Basu-Lerario are not directly applicable. We apply our theory to three main application domains. Firstly, we derive a near-optimal bound on the covering number of low rank CP tensors. Secondly, we prove a bound on the sketching dimension for (general) polynomial optimization problems. Lastly, we deduce generalization error bounds for deep neural networks with rational or ReLU activations, improving or matching the best known results in the literature.

摘要
我们证明了实数型变量的覆盖数目的Upper bound，包括映射 polynomial maps 和 semi-algebraic sets。该 bound 明显超越了最佳known 通用 bound 由 Yomdin-Comte，并且其证明方式很直观。因此，我们的结果为 tubular neighborhood 的 image of a polynomial map 和 semi-algebraic sets 提供了新的 bound，其中 Lotz 和 Basu-Lerario 的结果不直接适用。我们在以下三个主要应用领域中应用了我们的理论：首先，我们从 low rank CP tensors 中 derivate 一个 near-optimal bound on the covering number。其次，我们证明了 (general) polynomial optimization problems 的 sketching dimension 的 bound。最后，我们从 deep neural networks 中 deduce generalization error bounds，并与 literature 中最佳的结果匹配或超越。

Personalized Online Federated Learning with Multiple Kernels

paper_url: http://arxiv.org/abs/2311.05108
repo_url: https://github.com/pouyamghari/pof-mkl
paper_authors: Pouya M. Ghari, Yanning Shen
For: The paper is written for online non-linear function approximation using multi-kernel learning (MKL) in a federated learning setting.* Methods: The paper proposes an algorithmic framework for clients to communicate with the server and send their updates with affordable communication cost, while employing a large dictionary of kernels. The paper also uses random feature (RF) approximation to enable scalable online federated MKL.* Results: The paper proves that each client enjoys sub-linear regret with respect to the RF approximation of its best kernel in hindsight, indicating that the proposed algorithm can effectively deal with heterogeneity of the data distributed among clients. Experimental results on real datasets showcase the advantages of the proposed algorithm compared with other online federated kernel learning ones.

Abstract
Multi-kernel learning (MKL) exhibits well-documented performance in online non-linear function approximation. Federated learning enables a group of learners (called clients) to train an MKL model on the data distributed among clients to perform online non-linear function approximation. There are some challenges in online federated MKL that need to be addressed: i) Communication efficiency especially when a large number of kernels are considered ii) Heterogeneous data distribution among clients. The present paper develops an algorithmic framework to enable clients to communicate with the server to send their updates with affordable communication cost while clients employ a large dictionary of kernels. Utilizing random feature (RF) approximation, the present paper proposes scalable online federated MKL algorithm. We prove that using the proposed online federated MKL algorithm, each client enjoys sub-linear regret with respect to the RF approximation of its best kernel in hindsight, which indicates that the proposed algorithm can effectively deal with heterogeneity of the data distributed among clients. Experimental results on real datasets showcase the advantages of the proposed algorithm compared with other online federated kernel learning ones.

摘要

GeoFormer: Predicting Human Mobility using Generative Pre-trained Transformer (GPT)

paper_url: http://arxiv.org/abs/2311.05092
repo_url: None
paper_authors: Aivin V. Solatorio
for: 预测人类流动性有重要实践价值，应用范围从增强自然灾害风险规划到抑制流行病蔓延。
methods: 我们提出了GeoFormer模型，基于GPT架构的解码器只模型，用于预测人类流动性。我们在HuMob Challenge 2023中rigorously测试了我们的模型，这是一个用于评估预测模型性能的竞赛，使用标准化数据集来预测人类流动性。
results: GeoFormer在HuMob Challenge 2023中表现出色，在两个数据集上都达到了优秀的成绩，并在使用的两个性能指标（GEO-BLEU和Dynamic Time Warping）上表现出优异。这种成绩表明GeoFormer在人类流动性预测方面具有很大的潜力，可以为灾害预 preparation、疫病控制等领域做出重要贡献。

Abstract
Predicting human mobility holds significant practical value, with applications ranging from enhancing disaster risk planning to simulating epidemic spread. In this paper, we present the GeoFormer, a decoder-only transformer model adapted from the GPT architecture to forecast human mobility. Our proposed model is rigorously tested in the context of the HuMob Challenge 2023 -- a competition designed to evaluate the performance of prediction models on standardized datasets to predict human mobility. The challenge leverages two datasets encompassing urban-scale data of 25,000 and 100,000 individuals over a longitudinal period of 75 days. GeoFormer stands out as a top performer in the competition, securing a place in the top-3 ranking. Its success is underscored by performing well on both performance metrics chosen for the competition -- the GEO-BLEU and the Dynamic Time Warping (DTW) measures. The performance of the GeoFormer on the HuMob Challenge 2023 underscores its potential to make substantial contributions to the field of human mobility prediction, with far-reaching implications for disaster preparedness, epidemic control, and beyond.

摘要
预测人类流动具有重要的实用价值，其应用范围包括提高灾害风险规划和模拟流行病传播。在这篇论文中，我们提出了GeoFormer模型，是基于GPT架构的解码器只 трансформа器模型，用于预测人类流动。我们的提议模型在2023年的HuMob挑战中得到了证明，并在使用两个都市规模的数据集上进行了严格的测试。这两个数据集分别包含25,000和100,000名人员的城市规模数据，时间长度为75天。GeoFormer在HuMob挑战中表现出色，在两个选择的性能指标上都取得了优秀的成绩，即GEO-BLEU和动态时间戳推准（DTW）度量。GeoFormer在HuMob挑战中的表现证明了其在人类流动预测方面的潜在作用，对于灾害准备、流行病控制等领域有广泛的应用前景。

Generalized test utilities for long-tail performance in extreme multi-label classification

paper_url: http://arxiv.org/abs/2311.05081
repo_url: None
paper_authors: Erik Schultheis, Marek Wydmuch, Wojciech Kotłowski, Rohit Babbar, Krzysztof Dembczyński
for: 本文关注于EXTREME MULTI-LABEL CLASSIFICATION（XMLC）任务中，选择一小 subsets of relevant labels。
methods: 本文提出了一种基于“at k”通用指标的解决方案，通过对预测结果进行权重赋值，提高长尾标签的准确率。
results: 本文的算法基于块协调增加法，可以轻松扩展到XMLC问题，并在实验中显示了良好的长尾性能。

Abstract
Extreme multi-label classification (XMLC) is the task of selecting a small subset of relevant labels from a very large set of possible labels. As such, it is characterized by long-tail labels, i.e., most labels have very few positive instances. With standard performance measures such as precision@k, a classifier can ignore tail labels and still report good performance. However, it is often argued that correct predictions in the tail are more interesting or rewarding, but the community has not yet settled on a metric capturing this intuitive concept. The existing propensity-scored metrics fall short on this goal by confounding the problems of long-tail and missing labels. In this paper, we analyze generalized metrics budgeted "at k" as an alternative solution. To tackle the challenging problem of optimizing these metrics, we formulate it in the expected test utility (ETU) framework, which aims at optimizing the expected performance on a fixed test set. We derive optimal prediction rules and construct computationally efficient approximations with provable regret guarantees and robustness against model misspecification. Our algorithm, based on block coordinate ascent, scales effortlessly to XMLC problems and obtains promising results in terms of long-tail performance.

摘要
极端多标签分类（XMLC）是选择一小 subsets of 可能的标签中的一些有用标签的任务。因此，它通常有长尾标签，即大多数标签只有几个正例。使用标准的性能度量，如精度@k，一个分类器可以忽略尾标签并仍然报告良好的性能。然而，社区没有一个准确预测在尾标签的度量，因为潜在的标签是多样化的。现有的潜在度量遗弃了长尾和缺失标签的问题。在这篇论文中，我们分析通过"at k"的一般度量来解决这个问题。为了解决这个挑战，我们在预测测试用用户（ETU）框架中形式化问题，该框架目的是在固定的测试集上优化预测性能。我们 derivated 优化预测规则和计算效率的近似方法，并证明了对模型误差的Robustness和可靠性。我们的算法，基于块坐标升降，可以轻松扩展到 XMLC 问题，并在长尾性能方面获得了有优的结果。

paper_url: http://arxiv.org/abs/2311.05079
repo_url: None
paper_authors: Anant Shukla, Martin Jurecek, Mark Stamp
for: 寻找社交媒体平台上的机器人活动，以保护在线讨论的准确性和避免网络犯罪。
methods: 使用生成对抗网络（GAN）进行机器人检测，并通过多个检察器对一个生成器进行训练，以解决模式塌缩问题。
results: 我们的方法在这个领域的分类精度上超越了现有的技术，并且展示了如何使用生成器进行数据增强和逃避类分类技术的检测。

Abstract
Bot activity on social media platforms is a pervasive problem, undermining the credibility of online discourse and potentially leading to cybercrime. We propose an approach to bot detection using Generative Adversarial Networks (GAN). We discuss how we overcome the issue of mode collapse by utilizing multiple discriminators to train against one generator, while decoupling the discriminator to perform social media bot detection and utilizing the generator for data augmentation. In terms of classification accuracy, our approach outperforms the state-of-the-art techniques in this field. We also show how the generator in the GAN can be used to evade such a classification technique.

摘要
社交媒体平台上的机器人活动是一种广泛的问题，会推翻在线讨论的准确性并可能导致网络犯罪。我们提出一种使用生成对抗网络（GAN）的方法来探测机器人。我们解决了模式塌缩问题，通过多个检测器来训练一个生成器，同时将检测器与生成器分离，以便在社交媒体上检测机器人，并使用生成器进行数据增强。在分类精度方面，我们的方法超过了当前领域的技术。此外，我们还示出了使用生成器在GAN中逃脱这种分类技术的方法。

2023-11-09

eess.SP

eess.SP - 2023-11-09

38.7 GHz Thin Film Lithium Niobate Acoustic Filter

paper_url: http://arxiv.org/abs/2311.05712
repo_url: None
paper_authors: Omar Barrera, Sinwoo Cho, Jack Kramer, Vakhtang Chulukhadze, Joshua Campbell, Ruochen Lu
for: 这个论文是为了探讨了5G millimeter waves频率范围2（FR2）band的薄膜电 piezoelectric acoustic filter技术的发展。
methods: 论文使用了薄膜LiNbO3共振器，通过压缩膜厚度至sub-50nm来实现操作频率在5G FR2 band。高电子机械相互作用（k2）和质因子（Q）的first-order antisymmetric（A1）模式共振器在128Y-cut LiNbO3中共同实现了第一个mmWave acoustic filter。
results: 论文实现了5.63dB的插入损耗（IL）和17.6%的三分之一带宽（FBW），表明薄膜 piezoelectric resonators可以在5G FR2 band上操作。

Abstract
In this work, a 38.7 GHz acoustic wave ladder filter exhibiting insertion loss (IL) of 5.63 dB and 3-dB fractional bandwidth (FBW) of 17.6% is demonstrated, pushing the frequency limits of thin-film piezoelectric acoustic filter technology. The filter achieves operating frequency up to 5G millimeter wave (mmWave) frequency range 2 (FR2) bands, by thinning thin-film LiNbO3 resonators to sub-50 nm thickness. The high electromechanical coupling (k2) and quality factor (Q) of first-order antisymmetric (A1) mode resonators in 128 Y-cut lithium niobate (LiNbO3) collectively enable the first acoustic filters at mmWave. The key design consideration of electromagnetic (EM) resonances in interdigitated transducers (IDT) is addressed and mitigated. These results indicate that thin-film piezoelectric resonators could be pushed to 5G FR2 bands. Further performance enhancement and frequency scaling calls for better resonator technologies and EM-acoustic filter co-design.

摘要
在这项工作中，一种功率为38.7 GHz的声波级滤波器被实现，其插入损耗（IL）为5.63 dB，三分之一带宽（FBW）为17.6%。这种滤波器可以在2（FR2）频率段中操作，通过使用薄膜键石陶瓷（LiNbO3）共振器来减少膜厚至下50 nm。高电机电共振（k2）和质因子（Q）的首频模式共振器在128Y扁板键石陶瓷（LiNbO3）中共同实现了第一个声波滤波器在mmWave频率范围内。对声电共振器（IDT）中的电磁共振的设计考虑和控制也得到了解决。这些结果表明，薄膜键石陶瓷共振器可以在5G FR2频率段内操作。进一步提高性能和频率缩放需要更好的共振器技术和电磁-声波滤波器共设计。

Uncertainty-Aware Bayes’ Rule and Its Applications

paper_url: http://arxiv.org/abs/2311.05532
repo_url: https://github.com/spratm-asleaf/bayes-rule
paper_authors: Shixiong Wang
for: This paper aims to address the issue of model misspecifications in prior distributions and/or data distributions, and to develop a generalized Bayes’ rule to combat these uncertainties.
methods: The paper proposes an uncertainty-aware Bayes’ rule, which upweights or downweights prior beliefs and data evidence based on the relative importance of prior and data distributions. The paper also derives three uncertainty-aware filtering algorithms: the uncertainty-aware Kalman filter, the uncertainty-aware particle filter, and the uncertainty-aware interactive multiple model filter.
results: The paper presents simulated and real-world experiments that demonstrate the superiority of the uncertainty-aware Bayes’ rule and the three uncertainty-aware filtering algorithms over the conventional Bayes’ rule and other state-of-the-art methods.

Abstract
Bayes' rule has enabled innumerable powerful algorithms of statistical signal processing and statistical machine learning. However, when there exist model misspecifications in prior distributions and/or data distributions, the direct application of Bayes' rule is questionable. Philosophically, the key is to balance the relative importance of prior and data distributions when calculating posterior distributions: if prior (resp. data) distributions are overly conservative, we should upweight the prior belief (resp. data evidence); if prior (resp. data) distributions are overly opportunistic, we should downweight the prior belief (resp. data evidence). This paper derives a generalized Bayes' rule, called uncertainty-aware Bayes' rule, to technically realize the above philosophy, i.e., to combat the model uncertainties in prior distributions and/or data distributions. Simulated and real-world experiments showcase the superiority of the presented uncertainty-aware Bayes' rule over the conventional Bayes' rule: In particular, the uncertainty-aware Kalman filter, the uncertainty-aware particle filter, and the uncertainty-aware interactive multiple model filter are suggested and validated.

摘要
贝叶斯公式在统计信号处理和统计机器学习中实现了无数可能的强大算法。然而，当存在模型偏差在先后分布和/或数据分布中时，直接应用贝叶斯公式是有问题的。哲学上，关键是在计算后期分布时平衡先后分布和数据分布之间的相对重要性：如果先前分布（resp. 数据分布）太保守，我们应该增加先前信念（resp. 数据证据）的重要性；如果先前分布（resp. 数据分布）太机会主义，我们应该减少先前信念（resp. 数据证据）的重要性。这篇论文提出一种扩展的贝叶斯公式，called uncertainty-aware Bayes' rule，以技术实现上述哲学。通过实验和实际应用，论文显示了 uncertainty-aware Bayes' rule 的超越性，比如不确定性感知 kalman filter、不确定性感知 particle filter 和不确定性感知多模型过滤器。

EEG-DG: A Multi-Source Domain Generalization Framework for Motor Imagery EEG Classification

paper_url: http://arxiv.org/abs/2311.05415
repo_url: https://github.com/xc-zhonghit/eeg-dg
paper_authors: Xiao-Cong Zhong, Qisong Wang, Dan Liu, Zhihuang Chen, Jing-Xiao Liao, Jinwei Sun, Yudong Zhang, Feng-Lei Fan
for: 这个研究旨在提高非侵入式脑-电脑交互（BCI）中的电脑视觉实验（EEG）类型识别率。
methods: 这个研究使用了多源领域通用框架（EEG-DG），具体来说是使用多个来源领域的不同统计分布来建立一个可靠的分类模型。
results: 研究表明，EEG-DG比前一代方法更高效，具体来说是在一个模拟数据集和两个BCI竞赛数据集IV-2a和IV-2b上，EEG-DG的分类率分别为81.79%和87.12%，而且甚至超过了一些领域适应方法。

Abstract
Motor imagery EEG classification plays a crucial role in non-invasive Brain-Computer Interface (BCI) research. However, the classification is affected by the non-stationarity and individual variations of EEG signals. Simply pooling EEG data with different statistical distributions to train a classification model can severely degrade the generalization performance. To address this issue, the existing methods primarily focus on domain adaptation, which requires access to the target data during training. This is unrealistic in many EEG application scenarios. In this paper, we propose a novel multi-source domain generalization framework called EEG-DG, which leverages multiple source domains with different statistical distributions to build generalizable models on unseen target EEG data. We optimize both the marginal and conditional distributions to ensure the stability of the joint distribution across source domains and extend it to a multi-source domain generalization framework to achieve domain-invariant feature representation, thereby alleviating calibration efforts. Systematic experiments on a simulative dataset and BCI competition datasets IV-2a and IV-2b demonstrate the superiority of our proposed EEG-DG over state-of-the-art methods. Specifically, EEG-DG achieves an average classification accuracy/kappa value of 81.79%/0.7572 and 87.12%/0.7424 on datasets IV-2a and IV-2b, respectively, which even outperforms some domain adaptation methods. Our code is available at https://github.com/XC-ZhongHIT/EEG-DG for free download and evaluation.

摘要
electromyography（EMG）幻象分类在无侵入式大脑-计算机交互（BCI）研究中扮演着关键性的角色。然而，分类受到EMG信号的非站点性和个体差异的影响。将EMG数据不同统计分布 pool 以train 分类模型可能导致极差的泛化性能。为解决这个问题，现有的方法主要集中在领域适应中，需要在训练过程中获取目标数据。这在许多EMG应用场景中是不现实的。在这篇论文中，我们提出了一种新的多源领域总结框架，称为EEG-DG，它利用不同统计分布的多个源领域来建立可靠的分类模型。我们同时优化了两个分布的独立和 Conditional distribution，以确保在源领域之间的联合分布的稳定性，并将其扩展为多源领域总结框架，以实现领域 invariant feature representation，从而减少准确化努力。系统性实验在一个模拟数据集和 BCIC 竞赛数据集 IV-2a 和 IV-2b 上表明，我们的提出的EEG-DG 超过了现有方法的性能。具体来说，EEG-DG 在 IV-2a 和 IV-2b 数据集上的平均分类精度/κ值为 81.79%/0.7572 和 87.12%/0.7424，甚至超过了一些领域适应方法。我们的代码可以免费下载和评估于 GitHub 上的。

Joint Angle and Delay Cramér-Rao Bound Optimization for Integrated Sensing and Communications

paper_url: http://arxiv.org/abs/2311.05372
repo_url: None
paper_authors: Chao Hu, Yuan Fang, Ling Qiu
for: 本文研究了一种多输入多输出（MIMO）扩容设计，用于一个集成感知通信（ISAC）系统中的基站（BS），该BS通过多个下降用户进行通信，同时将通信信号重复使用于多个目标的感知。
methods: 我们首先 derivated Cramér-Rao bound（CRB） для角度和延迟参数的估计。然后，我们使用了 transmit beamforming 优化，以最小化 CRB，并且保证通信率和功率限制。在单目标单用户情况下，我们得到了唯一解的closed-form解。在多目标多用户情况下，我们证明了优化解的稀疏性，从而降低了优化过程的计算复杂性。
results: numerical results 表明，优化的扩容设计可以实现出色的定位性能，并有效地减少了基站antenna的数量要求。

Abstract
In this paper, we study a multi-input multi-output (MIMO) beamforming design in an integrated sensing and communication (ISAC) system, in which an ISAC base station (BS) is used to communicate with multiple downlink users and simultaneously the communication signals are reused for sensing multiple targets. Our interested sensing parameters are the angle and delay information of the targets, which can be used to locate these targets. Under this consideration, we first derive the Cram\'{e}r-Rao bound (CRB) for angle and delay estimation. Then, we optimize the transmit beamforming at the BS to minimize the CRB, subject to communication rate and power constraints. In particular, we obtain the optimal solution in closed-form in the case of single-target and single-user, and in the case of multi-target and multi-user scenario, the sparsity of the optimal solution is proven, leading to a reduction in computational complexity during optimization. The numerical results demonstrate that the optimized beamforming yields excellent positioning performance and effectively reduces the requirement for a large number of antennas at the BS.

摘要
在这篇论文中，我们研究了一种多输入多输出（MIMO）扩展探测通信（ISAC）系统中的扩展探测设计，其中一个ISAC基站（BS）用于通信多个下降用户，同时通信信号被重复用于探测多个目标。我们对于探测参数的 interessets是目标的角度和延迟信息，这些信息可以用来定位这些目标。在这种情况下，我们首先 derivethe Cramér-Rao bound（CRB） для角度和延迟估计。然后，我们在BS中优化发射扩展来最小化CRB，具体来说是subject to通信率和功率约束。在具体实现中，我们在单目标单用户情况下获得了closed-form的优化解，而在多目标多用户情况下，我们证明了优化解的稀疏性，从而降低了计算复杂性。numerical results表明，优化的扩展探测可以很好地定位性能和减少了BS需要的antenna数量。

Energy-Efficient Analog Beamforming for RF-WET with Charging Time Constraint

paper_url: http://arxiv.org/abs/2311.05325
repo_url: None
paper_authors: Osmel Martínez Rosabal, Onel L. Alcaraz López, Hirley Alves
for: 这篇论文是为了解决互联网物联网（IoT）可持续性问题，具体来说是通过无线频率无线能量传输（RF-WET）技术实现。
methods: 本文提出了一种时分 Multiple-Input Multiple-Output（MIMO）技术，通过将多个antenna的能量报kafina（PB）分配给低功率设备的能量收集电路，使得设备的能量收集效率最高，从而实现最低的能源消耗。
results: 研究结果表明，相比往常的参考方案，我们的RF-WET策略可以更好地为IoT设备提供能量，而且与antenna数量增加时，性能越来越好。

Abstract
Internet of Things (IoT) sustainability may hinge on radio frequency wireless energy transfer (RF-WET). However, energy-efficient charging strategies are still needed, motivating our work. Specifically, this letter proposes a time division scheme to efficiently charge low-power devices in an IoT network. For this, a multi-antenna power beacon (PB) drives the devices' energy harvesting circuit to the highest power conversion efficiency point via energy beamforming, thus achieving minimum energy consumption. Herein, we adopt the analog multi-antenna architecture due to its low complexity, cost, and energy consumption. The proposal includes a simple yet accurate model for the transfer characteristic of the energy harvesting circuit, enabling the optimization framework. The results evince the effectiveness of our RF-WET strategy over a benchmark scheme where the PB charges all the IoT devices simultaneously. Furthermore, the performance increases with the number of PB antennas.

摘要
互联网智能物件（IoT）可持续性可能取决于无线频率无线能量传输（RF-WET）。然而，仍需要能效的充电策略，这使我们的工作感到推动。特别是，这封信函描述了一种时分多址方案，以高效地充电低功率设备在IoT网络中。在这个方案中，一个多antenna能量扩散器（PB）驱动设备的能量从抽取到最高熵转换效率点，以获得最小的能量消耗。我们采用了分析多antenna架构，因为它具有低的复杂度、成本和能量消耗。我们的提案包括一个简单又准确的转换特性模型，实现优化框架。结果显示了我们的RF-WET策略比对benchmark方案，在充电所有IoT设备的情况下更有效。此外，性能随着PB天线的数量增加。

Empowering high-dimensional optical fiber communications with integrated photonic processors

paper_url: http://arxiv.org/abs/2311.05282
repo_url: None
paper_authors: Kaihang Lu, Zengqi Chen, Hao Chen, Wu Zhou, Zunyue Zhang, Hon Ki Tsang, Yeyu Tong
for: 这个论文旨在描述一种高级光纤通信系统，可以完全由可重新配置的光学处理器实现，并且可以处理六个空间和波分谱模式。
methods: 该系统使用了光学混合技术，包括多模式传输器和全光学排序接收器。
results: 实验表明，该系统可以高效地处理六个空间和波分谱模式，并且可以高质量地生成电子信号。

Abstract
Mode division multiplexing (MDM) in optical fibers enables multichannel capabilities for various applications, including data transmission, quantum networks, imaging, and sensing. However, MDM optical fiber systems, usually necessities bulk-optics approaches for launching different orthogonal fiber modes into the multimode optical fiber, and multiple-input multiple-output digital electronic signal processing at the receiver side to undo the arbitrary mode scrambling in a circular-core optical fiber. Here we show that a high-dimensional optical fiber communication system can be entirely implemented by a reconfigurable integrated photonic processor, featuring kernels of multichannel mode multiplexing transmitter and all-optical descrambling receiver. High-speed and inter-chip communications involving six spatial- and polarization modes have been experimentally demonstrated with high efficiency and high-quality eye diagrams, despite the presence of random mode scrambling and polarization rotation in a circular-core few-mode fiber. The proposed photonic integration approach holds promising prospects for future space-division multiplexing applications.

摘要
Here, we demonstrate that a high-dimensional optical fiber communication system can be entirely implemented by a reconfigurable integrated photonic processor, featuring kernels of multichannel mode multiplexing transmitter and all-optical descrambling receiver. High-speed and inter-chip communications involving six spatial- and polarization modes have been experimentally demonstrated with high efficiency and high-quality eye diagrams, despite the presence of random mode scrambling and polarization rotation in a circular-core few-mode fiber.The proposed photonic integration approach holds promising prospects for future space-division multiplexing applications.中文翻译：Mode division multiplexing（MDM）在光纤中实现多个通道，用于数据传输、量子网络、成像和探测等应用。然而，现有的MDM光纤系统通常需要使用填充光学方法将不同的平行光纤模式入库多模光纤，以及接收端多输入多输出的数字电子处理器来解除圆柱形光纤中的随机模式混乱。在这里，我们展示了一种完全由可重新配置的光子处理器实现的高维度光纤通信系统，包括多模式多plexing发射器和全光学减少接收器。我们在实验中成功地实现了六个空间和极化模式之间的高速交换和 между板通信，并且具有高效率和高质量眼agram。尽管存在圆柱形少模光纤中的随机模式混乱和极化转换，但是我们的光子集成方法仍然保持了未来空间分多plexing应用的良好前景。

Few-Shot Recognition and Classification of Jamming Signal via CGAN-Based Fusion CNN Algorithm

paper_url: http://arxiv.org/abs/2311.05273
repo_url: None
paper_authors: Xuhui Ding, Yue Zhang, Gaoyang Li, Neng Ye, Yuting Guo, Takuya Mabuchi, Hitomi Anzai, Kai Yang
for: 解决深度学习在实际通信系统中应用时遇到的困难，即突发性干扰信号的识别问题。
methods: 提出一种基于条件生成型 adversarial网络（CGAN）和卷积神经网络（CNN）的融合算法，以解决深度学习在实际通信系统中应用时遇到的困难。
results: 比前一代方法提高8%的准确率，并在有限的数据集上进行了验证。通过使用实际的卫星通信场景的硬件平台进行模拟，并对时域信号数据进行验证，实验结果表明我们的算法在实际通信场景中仍然表现出色。

Abstract
The precise classification of jamming signals holds paramount significance in the effective implementation of anti-jamming strategies within communication systems subject to intricate environmental variables. In light of this imperative, we propose an innovative fusion algorithm based on conditional generative adversarial network (CGAN) and convolutional neural network (CNN) to solve the problem of difficulty in applying deep learning (DL) algorithms due to the instantaneous nature of jamming signals in practical communication systems. Compared with previous methods, our algorithm achieved an 8% improvement in accuracy even when working with a limited dataset. Unlike previous research, we have simulated real-world satellite communication scenarios using a hardware platform and validated our algorithm using the resulting time-domain waveform data. The experimental results indicate that our algorithm still performs extremely well, which demonstrates significant potential for practical application in real-world communication scenarios.

摘要
“ jamming 信号的精确分类对于实现有效的反干扰策略在受到复杂环境变量的通信系统中具有极高的重要性。在这一点上，我们提出了一种基于条件生成 adversarial network (CGAN) 和卷积神经网络 (CNN) 的创新融合算法，以解决深度学习 (DL) 算法在实际通信系统中应用时的困难。与前一代方法相比，我们的算法在有限数据集上实现了8%的提升精度。不同于前一些研究，我们在硬件 пла台上模拟了真实的卫星通信场景，并使用时域波形数据验证了我们的算法。实验结果表明，我们的算法在实际通信场景中仍然表现出色，这表明它在实际应用中具有极高的潜力。”Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Delay Doppler Transform

paper_url: http://arxiv.org/abs/2311.05236
repo_url: None
paper_authors: Xiang-Gen Xia
for: 这篇论文是为了研究延迟Doppler变换（DDT）在时域信号中的应用。
methods: 本研究使用了延迟Doppler变换（DDT）来描述时域信号中的延迟和Doppler问题。
results: 研究发现，DDT 可以帮助我们更好地理解延迟Doppler通道的特性，并且提供了一些实用的性能评估方法。Translation:
for: This paper is to study the application of delay Doppler transform (DDT) in time domain signals.
methods: The study uses delay Doppler transform (DDT) to describe the delay and Doppler issues in time domain signals.
results: The research finds that DDT can help us better understand the characteristics of delay Doppler channels and provide practical performance evaluation methods.

Abstract
This letter is to introduce delay Doppler transform (DDT) for a time domain signal. It is motivated by the recent studies in wireless communications over delay Doppler channels that have both time and Doppler spreads, such as, satellite communication channels. We present some simple properties of DDT as well. The DDT study may provide insights of delay Doppler channels.

摘要
这封信是为引入延迟Doppler变换（DDT），用于处理时域信号。这是由于最近关于无线通信频率上的延迟Doppler通道的研究而出发的，这些通道具有时间和Doppler扩散。我们将介绍一些简单的DDT性质，以及它们在延迟Doppler通道上的应用。这些研究可能会为延迟Doppler通道提供新的思路。

Coverage and Rate Analysis for Cell-Free LEO Satellite Networks

paper_url: http://arxiv.org/abs/2311.05189
repo_url: None
paper_authors: Xiangyu Li, Bodong Shang, Na Deng, Shanzhi Chen
for: investigate an architecture of cell-free (CF) LEO satellite (CFLS) networks from a system-level perspective to improve quality-of-service (QoS)
methods: use multiple satellites to serve a user, and analyze the coverage and rate of a typical user in the CFLS network
results: the CFLS network achieves a higher coverage probability than the traditional single satellite-supported network, and user’s ergodic rate is maximized by selecting an appropriate number of serving satellites.Here’s the full text in Simplified Chinese:
for: 这篇论文是 investigate cell-free (CF) LEO satellite (CFLS) 网络的系统层次设计，以提高质量服务 (QoS)
methods: 使用多颗卫星服务用户，并分析CFLS 网络中典型用户的覆盖率和速率
results: CFLS 网络的覆盖率高于传统单颗卫星支持的网络，用户的平均速率可以通过选择合适的服务卫星来最大化。

Abstract
Low-earth orbit (LEO) satellite communication is one of the enabling key technologies in next-generation (6G) networks. However, single satellite-supported downlink communication may not meet user's needs due to limited signal strength, especially in emergent scenarios. In this letter, we investigate an architecture of cell-free (CF) LEO satellite (CFLS) networks from a system-level perspective, where a user can be served by multiple satellites to improve its quality-of-service (QoS). Furthermore, we analyze the coverage and rate of a typical user in the CFLS network. Simulation and numerical results show that the CFLS network achieves a higher coverage probability than the traditional single satellite-supported network. Moreover, user's ergodic rate is maximized by selecting an appropriate number of serving satellites.

摘要

Integrated Sensing and Communication for Network-Assisted Full-Duplex Cell-Free Distributed Massive MIMO Systems

paper_url: http://arxiv.org/abs/2311.05101
repo_url: None
paper_authors: Fan Zeng, Jingxuan Yu, Jiamin Li, Feiyang Liu, Dongming Wang, Xiaohu You
for: 本研究旨在实现Integrated Sensing and Communication（ISAC）系统， combining network-assisted full-duplex（NAFD）技术和分布式雷达探测。
methods: 该系统采用了具有通信和探测能力的下行和上行远程广播单元（RRU）。
results: 对比其他ISAC方案，提出的方案可提供更稳定的探测和更好的通信性能。此外，提出了两种功率分配算法，可以同时优化通信和探测性能。

Abstract
In this paper, we combine the network-assisted full-duplex (NAFD) technology and distributed radar sensing to implement integrated sensing and communication (ISAC). The ISAC system features both uplink and downlink remote radio units (RRUs) equipped with communication and sensing capabilities. We evaluate the communication and sensing performance of the system using the sum communication rates and the Cramer-Rao lower bound (CRLB), respectively. We compare the performance of the proposed scheme with other ISAC schemes, the result shows that the proposed scheme can provide more stable sensing and better communication performance. Furthermore, we propose two power allocation algorithms to optimize the communication and sensing performance jointly. One algorithm is based on the deep Q-network (DQN) and the other one is based on the non-dominated sorting genetic algorithm II (NSGA-II). The proposed algorithms provide more feasible solutions and achieve better system performance than the equal power allocation algorithm.

摘要
在这篇论文中，我们将网络协助全双工（NAFD）技术和分布式雷达探测结合，实现集成探测通信（ISAC）系统。ISAC系统包括上行和下行远程广播单元（RRU），各自携带通信和探测能力。我们使用总通信速率和克拉默-拉奥lower bound（CRLB）评估系统的通信和探测性能。与其他ISAC方案相比，我们的方案可以提供更稳定的探测和更好的通信性能。此外，我们提出了两种功率分配算法来优化通信和探测性能：一种是基于深度Q网络（DQN），另一种是基于非通过遗传算法II（NSGA-II）。这两种算法可以提供更实际的解决方案，并且可以在系统性能上做出更好的优化。

2023-11-08

eess.AS

eess.AS - 2023-11-08

1-step Speech Processing and Understanding Using CTC Loss

paper_url: http://arxiv.org/abs/2311.04753
repo_url: None
paper_authors: Karan Singla, Shahab Jalavand, Yeon-Jun Kim, Antonio Moreno Daniel, Srinivas Bangalore, Andrej Ljolje, Ben Stern
for: 提高自然语言处理系统的命名实体识别和意图识别能力
methods: 使用Connectionist Temporal Classification（CTC）损失进行端到端语音识别编码器的优化，并添加了一组未使用的占位符号来扩展自动语音识别系统的词汇
results: 在SLUE benchmark上实现了明显的命名实体标记、意图识别和译文准确率提高，并且与SLURP数据集的结果相当Here’s a more detailed explanation of each point:
for: The paper is written to improve the ability of natural language processing systems to recognize named entities and intent in speech.
methods: The authors propose a solution that extends the vocabulary of the end-to-end automatic speech recognition (ASR) system by adding a set of unused placeholder symbols, which are then assigned to represent semantic tags. These placeholders are integrated into the transcription process as distinct tokens.
results: The proposed solution achieves notable improvements in entity tagging, intent discernment, and transcription accuracy on the SLUE benchmark, and the results are on par with those for the SLURP dataset. Additionally, the authors provide a visual analysis of the system’s proficiency in accurately pinpointing meaningful tokens over time, illustrating the enhancement in transcription quality through the utilization of supplementary semantic tags.

Abstract
Recent studies have made some progress in refining end-to-end (E2E) speech recognition encoders by applying Connectionist Temporal Classification (CTC) loss to enhance named entity recognition within transcriptions. However, these methods have been constrained by their exclusive use of the ASCII character set, allowing only a limited array of semantic labels. Our proposed solution extends the E2E automatic speech recognition (ASR) system's vocabulary by adding a set of unused placeholder symbols, conceptually akin to the tokens used in sequence modeling. These placeholders are then assigned to represent semantic tags and are integrated into the transcription process as distinct tokens. We demonstrate notable improvements in entity tagging, intent discernment, and transcription accuracy on the SLUE benchmark and yields results that are on par with those for the SLURP dataset. Additionally, we provide a visual analysis of the system's proficiency in accurately pinpointing meaningful tokens over time, illustrating the enhancement in transcription quality through the utilization of supplementary semantic tags.

摘要
Simplified Chinese translation:最近的研究已经做出了一些进步，用Connectionist Temporal Classification（CTC）损失来提高名称识别 within 转录。然而，这些方法受限于它们仅使用 ASCII 字符集，只能处理有限数量的 semantic label。我们的提议的解决方案是将 E2E 自动语音识别（ASR）系统的词汇表扩展到添加一组未使用的 placeholder symbol，类似于 sequence modeling 中的 token。这些 placeholder 然后被分配到表示 semantic tag 的各种符号，并被 integrating 到转录过程中作为特定的 tokens。我们在 SLUE 标准集上示出了明显的提高，包括实体标记、意图识别和转录精度。此外，我们还提供了一种可视化分析，表明系统在时间上准确地标记了意义上的 token， thereby illustrating the enhancement in transcription quality through the use of supplementary semantic tags。

Selective HuBERT: Self-Supervised Pre-Training for Target Speaker in Clean and Mixture Speech

paper_url: http://arxiv.org/abs/2311.04526
repo_url: None
paper_authors: Jingru Lin, Meng Ge, Wupeng Wang, Haizhou Li, Mengling Feng
for: 这篇论文的目的是提出一种新的自我超vision演示模型，以实现选择性的对话声音抽象，并且能够在各种声音处理任务中提供高性能。
methods: 这篇论文使用了一种新的预训练方法，称为Selective-HuBERT（SHuBERT），它通过预测目标说话者的pseudo标签，并且使用了双路训练策略和跨相关约束，以实现选择性地对声音进行抽象。
results: 实验结果显示，SHuBERT可以在SUPERB评量标准和LibriMix数据集上达到高性能，并且能够在实际应用中提供高质量的声音抽象，甚至在具有极低量的标签资料下进行优化。

Abstract
Self-supervised pre-trained speech models were shown effective for various downstream speech processing tasks. Since they are mainly pre-trained to map input speech to pseudo-labels, the resulting representations are only effective for the type of pre-train data used, either clean or mixture speech. With the idea of selective auditory attention, we propose a novel pre-training solution called Selective-HuBERT, or SHuBERT, which learns the selective extraction of target speech representations from either clean or mixture speech. Specifically, SHuBERT is trained to predict pseudo labels of a target speaker, conditioned on an enrolled speech from the target speaker. By doing so, SHuBERT is expected to selectively attend to the target speaker in a complex acoustic environment, thus benefiting various downstream tasks. We further introduce a dual-path training strategy and use the cross-correlation constraint between the two branches to encourage the model to generate noise-invariant representation. Experiments on SUPERB benchmark and LibriMix dataset demonstrate the universality and noise-robustness of SHuBERT. Furthermore, we find that our high-quality representation can be easily integrated with conventional supervised learning methods to achieve significant performance, even under extremely low-resource labeled data.

摘要
自适应预训练语音模型在各种下游语音处理任务中显示出效iveness。由于它们主要预训练为将输入语音映射到 Pseudo-labels，因此生成的表示只有效果于使用的预训练数据类型，可能是干净的语音或混合语音。我们提出了一种新的预训练解决方案 called Selective-HuBERT（SHuBERT），它学习选择提取目标语音表示。特别是，SHuBERT 在预训练时预测目标说话人的 Pseudo-labels，条件在报名说话人的语音上。通过这样做，SHuBERT 可以选择性地听到目标说话人在复杂的声学环境中，从而利于各种下游任务。我们还提出了一种 dual-path 训练策略，并使用两个分支之间的协方差约束来鼓励模型生成难以干扰的表示。在 SUPERB benchmark 和 LibriMix 数据集上进行了实验， demonstrably 表明 SHuBERT 的 universality 和 noise-robustness。此外，我们发现我们的高质量表示可以轻松地与传统的监督学习方法结合使用，即使受到极低资源的标注数据。

2023-11-08

cs.CV

cs.CV - 2023-11-08

Active Transfer Learning for Efficient Video-Specific Human Pose Estimation

paper_url: http://arxiv.org/abs/2311.05041
repo_url: https://github.com/iminthemiddle/vatl4pose-wacv2024
paper_authors: Hiromu Taketsugu, Norimichi Ukita
For: 这个论文的目的是提出一种基于活动学习和传输学习的方法，以有效地适应人体 pose 估计器到个人视频频谱中。* Methods: 该方法使用了热图的变化来衡量估计结果的不确定性，以及全身人体 pose 的不自然性来选择不同和不确定的样本进行有效的估计器学习。此外，该方法还重新评估了现有的活动转移学习方法，并提出了新的重新训练方法和停止 criterion。* Results: 实验结果表明，该方法可以提高学习效率，并在比较方法中得到更好的性能。代码可以在 GitHub 上获取：https://github.com/ImIntheMiddle/VATL4Pose-WACV2024。

Abstract
Human Pose (HP) estimation is actively researched because of its wide range of applications. However, even estimators pre-trained on large datasets may not perform satisfactorily due to a domain gap between the training and test data. To address this issue, we present our approach combining Active Learning (AL) and Transfer Learning (TL) to adapt HP estimators to individual video domains efficiently. For efficient learning, our approach quantifies (i) the estimation uncertainty based on the temporal changes in the estimated heatmaps and (ii) the unnaturalness in the estimated full-body HPs. These quantified criteria are then effectively combined with the state-of-the-art representativeness criterion to select uncertain and diverse samples for efficient HP estimator learning. Furthermore, we reconsider the existing Active Transfer Learning (ATL) method to introduce novel ideas related to the retraining methods and Stopping Criteria (SC). Experimental results demonstrate that our method enhances learning efficiency and outperforms comparative methods. Our code is publicly available at: https://github.com/ImIntheMiddle/VATL4Pose-WACV2024

摘要
人体姿势（HP）估计是正在活跃研究的领域，因为它有很多应用场景。然而，即使使用大量数据进行预训练，HP估计器也可能不能达到预期的性能，这是因为训练和测试数据之间的领域差异。为了解决这个问题，我们提出了一种结合活动学习（AL）和转移学习（TL）的方法，以高效地适应视频域中的HP估计器。为了高效学习，我们的方法量化了（i）估计热图中的时间变化引入的估计不确定性，以及（ii）全身HP估计中的不自然性。这些量化的 критери价然与当前最佳表达性 критериion（SC）有效地结合，以选择不确定和多样的样本，以便高效地学习HP估计器。此外，我们重新考虑了现有的活动转移学习（ATL）方法，并引入了新的重新训练方法和停止标准（SC）。实验结果表明，我们的方法可以提高学习效率，并比相对方法高效。我们的代码可以在 GitHub 上获取：https://github.com/ImIntheMiddle/VATL4Pose-WACV2024

S$^3$AD: Semi-supervised Small Apple Detection in Orchard Environments

paper_url: http://arxiv.org/abs/2311.05029
repo_url: None
paper_authors: Robert Johanson, Christian Wilms, Ole Johannsen, Simone Frintrop
for: 本文 targets 苹果干部检测问题，为精准农业应用，如自动化受产量测量或水果摘取提供了解决方案。
methods: 本文提出了一种半监督的苹果检测方法，使用上下文注意力和选择平铺来改进小苹果的检测，同时减少计算成本。
results: 对于MAD数据集和MSU数据集的广泛评估表明，S$^3$AD系统与多个强大的完全监督基线系统进行比较，达到了最高的$14.9%$的提升。此外，通过利用数据集中苹果属性的详细注释，分析了不同系统对尺寸和干扰度的影响，量化了当前的挑战。

Abstract
Crop detection is integral for precision agriculture applications such as automated yield estimation or fruit picking. However, crop detection, e.g., apple detection in orchard environments remains challenging due to a lack of large-scale datasets and the small relative size of the crops in the image. In this work, we address these challenges by reformulating the apple detection task in a semi-supervised manner. To this end, we provide the large, high-resolution dataset MAD comprising 105 labeled images with 14,667 annotated apple instances and 4,440 unlabeled images. Utilizing this dataset, we also propose a novel Semi-Supervised Small Apple Detection system S$^3$AD based on contextual attention and selective tiling to improve the challenging detection of small apples, while limiting the computational overhead. We conduct an extensive evaluation on MAD and the MSU dataset, showing that S$^3$AD substantially outperforms strong fully-supervised baselines, including several small object detection systems, by up to $14.9\%$. Additionally, we exploit the detailed annotations of our dataset w.r.t. apple properties to analyze the influence of relative size or level of occlusion on the results of various systems, quantifying current challenges.

摘要
减损检测是精准农业应用程序的关键组成部分，如自动化产量估计或果实摘取。然而，减损检测，例如苹果检测在园区环境中仍然是一个挑战，因为缺乏大规模数据集和图像中小型减损的相对比例。在这种情况下，我们通过半upervised的方式改进苹果检测任务。为此，我们提供了大型、高分辨率的数据集MAD，包含105个标注图像和14,667个注解apple实例，以及4,440个未标注图像。基于这个数据集，我们还提出了一种新的半upervised小 apple检测系统S$^3$AD，使用contextual attention和选择分割来改进小apple的检测，同时限制计算过程的开销。我们对MAD和MSU数据集进行了广泛的评估，显示S$^3$AD在比较减损的情况下，与强大的完全supervised基eline相比，提高了14.9%。此外，我们利用我们数据集中 apple 属性的详细注解，分析不同系统在不同的减损和 occlusion 情况下的结果，并对现有挑战进行了质量评估。

Leveraging a realistic synthetic database to learn Shape-from-Shading for estimating the colon depth in colonoscopy images

paper_url: http://arxiv.org/abs/2311.05021
repo_url: None
paper_authors: Josué Ruano, Martín Gómez, Eduardo Romero, Antoine Manzanera
for: 这个研究旨在提高单摄colonoscopy影像中的肠道深度估计，以帮助诊断肠及肛部癌症。
methods: 这个研究使用了一个新的方法来从单摄colonoscopy影像中估计肠道深度图。这个方法基于肠道墙的阴影变化，并由一个专门训练的单元神经网络估计。
results: 这个研究的结果显示了这个方法可以实现高精度的肠道深度估计，具有95.65%的阈值准确性和0.451cm的平均误差。此外，该方法在一些真实的影像中也显示了良好的成果。

Abstract
Colonoscopy is the choice procedure to diagnose colon and rectum cancer, from early detection of small precancerous lesions (polyps), to confirmation of malign masses. However, the high variability of the organ appearance and the complex shape of both the colon wall and structures of interest make this exploration difficult. Learned visuospatial and perceptual abilities mitigate technical limitations in clinical practice by proper estimation of the intestinal depth. This work introduces a novel methodology to estimate colon depth maps in single frames from monocular colonoscopy videos. The generated depth map is inferred from the shading variation of the colon wall with respect to the light source, as learned from a realistic synthetic database. Briefly, a classic convolutional neural network architecture is trained from scratch to estimate the depth map, improving sharp depth estimations in haustral folds and polyps by a custom loss function that minimizes the estimation error in edges and curvatures. The network was trained by a custom synthetic colonoscopy database herein constructed and released, composed of 248,400 frames (47 videos), with depth annotations at the level of pixels. This collection comprehends 5 subsets of videos with progressively higher levels of visual complexity. Evaluation of the depth estimation with the synthetic database reached a threshold accuracy of 95.65%, and a mean-RMSE of 0.451 cm, while a qualitative assessment with a real database showed consistent depth estimations, visually evaluated by the expert gastroenterologist coauthoring this paper. Finally, the method achieved competitive performance with respect to another state-of-the-art method using a public synthetic database and comparable results in a set of images with other five state-of-the-art methods.

摘要
殖colonoscopy是诊断大肠和直肠癌的首选方法，从早期发现小前癌肿瘤（肿块）到确认肿瘤。然而，肠部的高变化性和肠壁的复杂形状使这种探测困难。通过了解视觉和感知能力，技术上的限制可以得到应用。这项工作介绍了一种新的方法，用单摄影探测肠部深度图。该图由肠壁阴影变化与光源的相对位置学习得到，并通过一个自定义的损失函数来改进锐利的深度估计。这种方法使用了自定义的 sintetic colonoscopy 数据库，包括 248,400 帧（47 个视频），其中每帧有像素级别的深度注解。这个收集包括 5 个视频subset，每个 subset 的视觉复杂度逐渐增加。对于这个数据库的评估，depth estimation 达到了95.65%的阈值精度和0.451 cm的平均欧姆误差，而且与真实数据库的评估显示了一致的深度估计。最终，该方法在与另一个状态略方法进行比较时达到了竞争性性能，并在一组图像中与其他五个状态略方法的结果相当。

Familiarity-Based Open-Set Recognition Under Adversarial Attacks

paper_url: http://arxiv.org/abs/2311.05006
repo_url: None
paper_authors: Philip Enevoldsen, Christian Gundersen, Nico Lang, Serge Belongie, Christian Igel
for: 本研究旨在探讨 familiarscore-based open-set recognition 中的攻击问题，以及这些攻击的效果。
methods: 本研究使用了梯度导向的 adversarial 攻击方法，包括 False Familiarity 和 False Novelty 两种类型的攻击。
results: 研究发现，在 informed 和 uninformed 设置下，这些攻击都能够减少 familiarscore-based open-set recognition 的准确率。

Abstract
Open-set recognition (OSR), the identification of novel categories, can be a critical component when deploying classification models in real-world applications. Recent work has shown that familiarity-based scoring rules such as the Maximum Softmax Probability (MSP) or the Maximum Logit Score (MLS) are strong baselines when the closed-set accuracy is high. However, one of the potential weaknesses of familiarity-based OSR are adversarial attacks. Here, we present gradient-based adversarial attacks on familiarity scores for both types of attacks, False Familiarity and False Novelty attacks, and evaluate their effectiveness in informed and uninformed settings on TinyImageNet.

摘要

Effective Restoration of Source Knowledge in Continual Test Time Adaptation

paper_url: http://arxiv.org/abs/2311.04991
repo_url: None
paper_authors: Fahim Faisal Niloy, Sk Miraj Ahmed, Dripta S. Raychaudhuri, Samet Oymak, Amit K. Roy-Chowdhury
for: 本文旨在解决测试时适应（TTA）方法在动态环境中的挑战，包括累累忘记之前学习的有价值源知识和逐渐增加的错误。
methods: 本文提出了一种无监督领域变换检测方法，可以在动态环境中检测领域变换并将模型参数还原到原始源预训练值。这种方法通过监测领域变换触发的统计变化来重新启用原始源的知识，从而 correction 模型参数的负面影响。
results: 对比州前方法，本文的方法在动态环境中表现出色，可以减少模型参数的负面影响和累累忘记。我们通过对多个 benchmark 数据集进行广泛的实验来证明本文的方法的优秀性。

Abstract
Traditional test-time adaptation (TTA) methods face significant challenges in adapting to dynamic environments characterized by continuously changing long-term target distributions. These challenges primarily stem from two factors: catastrophic forgetting of previously learned valuable source knowledge and gradual error accumulation caused by miscalibrated pseudo labels. To address these issues, this paper introduces an unsupervised domain change detection method that is capable of identifying domain shifts in dynamic environments and subsequently resets the model parameters to the original source pre-trained values. By restoring the knowledge from the source, it effectively corrects the negative consequences arising from the gradual deterioration of model parameters caused by ongoing shifts in the domain. Our method involves progressive estimation of global batch-norm statistics specific to each domain, while keeping track of changes in the statistics triggered by domain shifts. Importantly, our method is agnostic to the specific adaptation technique employed and thus, can be incorporated to existing TTA methods to enhance their performance in dynamic environments. We perform extensive experiments on benchmark datasets to demonstrate the superior performance of our method compared to state-of-the-art adaptation methods.

摘要
传统的测试时适应（TTA）方法在面临不断变化的长期目标分布环境中遇到重大挑战。这些挑战主要来自两个因素： catastrophic forgetting 已经学习的源知识的价值，以及由于 pseudo 标签的误偏而导致的慢滑块误差的堆积。为解决这些问题，本文提出了一种无监督领域变化检测方法，能够在动态环境中检测领域的变化，并将模型参数重置回源预训练值。通过恢复源知识，它有效地纠正由持续变化的领域所导致的模型参数的负面影响。我们的方法包括逐步估计每个领域的全局批处理 нор 特征统计，并记录每个领域的变化触发的统计变化。与此同时，我们的方法是对现有 TTA 方法进行增强，不需要改变现有的适应技术。我们在标准的测试数据集上进行了广泛的实验，并证明了我们的方法在动态环境中的超过当今适应方法的表现。

Exploiting Inductive Biases in Video Modeling through Neural CDEs

paper_url: http://arxiv.org/abs/2311.04986
repo_url: None
paper_authors: Johnathan Chiu, Samuel Duffield, Max Hunter-Gordon, Kaelan Donatella, Max Aifer, Andi Gu
for: 这篇论文是用于Video Task中的Video interpolating和Mask Propagation问题。
methods: 本文提出了一种使用Controlled Differential Equations（CDEs）来解决Video Task中的Key challenges，包括Video interpolating和Mask Propagation。它将CDEs应用于不同的分辨率，实现了一个连续时间的U-Net架构。不同于传统方法，本文的方法不需要明确的Optical Flow学习，而是利用CDEs的自然连续时间特性来生成高度表达力的Video模型。
results: 本文展示了与State-of-the-art模型相比，本文的方法在Video interpolating和Mask Propagation任务中具有竞争性的性能。

Abstract
We introduce a novel approach to video modeling that leverages controlled differential equations (CDEs) to address key challenges in video tasks, notably video interpolation and mask propagation. We apply CDEs at varying resolutions leading to a continuous-time U-Net architecture. Unlike traditional methods, our approach does not require explicit optical flow learning, and instead makes use of the inherent continuous-time features of CDEs to produce a highly expressive video model. We demonstrate competitive performance against state-of-the-art models for video interpolation and mask propagation tasks.

摘要
我们提出了一种新的视频模型方法，利用控制的微分方程（CDE）解决视频任务中的关键挑战，包括视频 interpolate 和mask propagation。我们在不同的分辨率上应用CDE，导致一种连续时间的U-Net架构。与传统方法不同，我们的方法不需要显式学习流体力学，而是利用CDE的内在连续时间特征来生成一个非常表达力的视频模型。我们在视频 interpolate 和mask propagation任务中示出了与状态艺术模型的竞争性表现。

GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and reusing ModulEs

paper_url: http://arxiv.org/abs/2311.04901
repo_url: None
paper_authors: Zhenfang Chen, Rui Sun, Wenjun Liu, Yining Hong, Chuang Gan
for: 本研究目的是提出一种基于生长和重用模块的生成型神经符号逻辑视觉理解方法，以提高现有神经符号逻辑模型的效率和可重用性。
methods: 本方法包括三个独特的阶段：模块初始化、模块生成和模块执行。首先，给定一个视力语言任务，我们采用大语言模型来检查是否可以重用和增长已有的模块来解决这个新任务。如果不能，我们将创建一个新模块，并指定这个模块的输入和输出。然后，我们使用大语言模型来生成匹配要求的代码块，并将其添加到模块库中。为了更好地评估新模块的能力，我们将几个例子作为测试用例，并评估它们是否可以通过这些测试。如果可以，我们将新模块添加到模块库中，并在其他任务上进行重用。最后，我们使用执行生成的程序来评估模型的性能。
results: 我们的模型在标准任务如视觉问答和参考表达理解中表现竞争力强，同时模块学习自一个任务可以很好地转移到新任务上。此外，我们的模型可以通过几个例子的几个测试来适应新的视觉理解任务，而不需要大量的训练数据。

Abstract
Recent works have shown that Large Language Models (LLMs) could empower traditional neuro-symbolic models via programming capabilities to translate language into module descriptions, thus achieving strong visual reasoning results while maintaining the model's transparency and efficiency. However, these models usually exhaustively generate the entire code snippet given each new instance of a task, which is extremely ineffective. We propose generative neuro-symbolic visual reasoning by growing and reusing modules. Specifically, our model consists of three unique stages, module initialization, module generation, and module execution. First, given a vision-language task, we adopt LLMs to examine whether we could reuse and grow over established modules to handle this new task. If not, we initialize a new module needed by the task and specify the inputs and outputs of this new module. After that, the new module is created by querying LLMs to generate corresponding code snippets that match the requirements. In order to get a better sense of the new module's ability, we treat few-shot training examples as test cases to see if our new module could pass these cases. If yes, the new module is added to the module library for future reuse. Finally, we evaluate the performance of our model on the testing set by executing the parsed programs with the newly made visual modules to get the results. We find the proposed model possesses several advantages. First, it performs competitively on standard tasks like visual question answering and referring expression comprehension; Second, the modules learned from one task can be seamlessly transferred to new tasks; Last but not least, it is able to adapt to new visual reasoning tasks by observing a few training examples and reusing modules.

摘要
First, we use LLMs to determine if we can reuse or grow existing modules to handle a new task. If not, we initialize a new module with the task's inputs and outputs. Then, we use LLMs to generate code snippets that match the module's requirements. We treat a few training examples as test cases to evaluate the new module's ability. If the module passes the test cases, it is added to the module library for future reuse. Finally, we evaluate the model's performance on a testing set by executing the parsed programs with the newly made visual modules.Our proposed model has several advantages. First, it performs well on standard visual reasoning tasks like visual question answering and referring expression comprehension. Second, the modules learned from one task can be easily transferred to new tasks. Lastly, the model can adapt to new visual reasoning tasks by observing a few training examples and reusing modules.

Are foundation models efficient for medical image segmentation?

paper_url: http://arxiv.org/abs/2311.04847
repo_url: None
paper_authors: Danielle Ferreira, Rima Arnaout
for: 这个论文是为了评估Segment Anything模型（SAM）在各种物体分割任务中的表现，以及相比之下一种特有的模式自适应学习（SSL）方法在25种量化报告中的性能。
methods: 这个论文使用了Supervised Training方法，并对SAM和SSL方法进行了比较，以评估它们在100次心脏超声报告中的性能和资源占用情况。
results: 研究发现，SAM在评估中表现不佳，需要更多的标注和计算资源，而SSL方法则表现更好，具有更高的效率。

Abstract
Foundation models are experiencing a surge in popularity. The Segment Anything model (SAM) asserts an ability to segment a wide spectrum of objects but required supervised training at unprecedented scale. We compared SAM's performance (against clinical ground truth) and resources (labeling time, compute) to a modality-specific, label-free self-supervised learning (SSL) method on 25 measurements for 100 cardiac ultrasounds. SAM performed poorly and required significantly more labeling and computing resources, demonstrating worse efficiency than SSL.

摘要
基础模型目前正在流行。射频段化模型（SAM）声称可以分类广泛的物体，但需要前所未有的指导式训练。我们比较了SAM的性能（与临床真实值）和资源（标签时间、计算）与一种特定Modalities自适应学习（SSL）方法在25个测量上100个心脏超声图像。SAM表现不佳，需要更多的标签和计算资源，示出了与SSL的更差的效率。

Self-Supervised Learning for Visual Relationship Detection through Masked Bounding Box Reconstruction

paper_url: http://arxiv.org/abs/2311.04834
repo_url: https://github.com/deeplab-ai/selfsupervisedvrd
paper_authors: Zacharias Anastasakis, Dimitrios Mallis, Markos Diomataris, George Alexandridis, Stefanos Kollias, Vassilis Pitsikalis
for: 本研究旨在提出一种自助学习方法，用于视觉关系检测任务（VRD）。
methods: 该方法基于Masked Image Modeling（MIM）的想法，提出Masked Bounding Box Reconstruction（MBBR），即在场景中随机mask一部分实体/物体，然后通过不masked对象进行重建。这种方法通过对象级别的masked模型学习，使网络学习场景中对象之间的交互，并因此具有高度预测视觉对象关系的表示。
results: 对于Visual Relationship Detection（VRD）任务，该方法在几个少量示例的情况下，能够超过现有的状态对方法，并且 Qualitative和Quantitative评估都表明了该方法学习的图像表示能够具有高度的Robustness和可预测性。

Abstract
We present a novel self-supervised approach for representation learning, particularly for the task of Visual Relationship Detection (VRD). Motivated by the effectiveness of Masked Image Modeling (MIM), we propose Masked Bounding Box Reconstruction (MBBR), a variation of MIM where a percentage of the entities/objects within a scene are masked and subsequently reconstructed based on the unmasked objects. The core idea is that, through object-level masked modeling, the network learns context-aware representations that capture the interaction of objects within a scene and thus are highly predictive of visual object relationships. We extensively evaluate learned representations, both qualitatively and quantitatively, in a few-shot setting and demonstrate the efficacy of MBBR for learning robust visual representations, particularly tailored for VRD. The proposed method is able to surpass state-of-the-art VRD methods on the Predicate Detection (PredDet) evaluation setting, using only a few annotated samples. We make our code available at https://github.com/deeplab-ai/SelfSupervisedVRD.

摘要
我们提出了一种新的自主学习方法，具体是用于视觉关系检测（VRD）任务。我们受到Masked Image Modeling（MIM）的成功所 inspirited，我们提议Masked Bounding Box Reconstruction（MBBR），这是MIM的一种变体，在场景中部分对象被遮盖，然后根据未遮盖的对象进行重建。核心想法是通过对象水平的遮盖模型，使网络学习场景中对象之间的交互，从而学习出高度预测视觉对象关系的上下文意识的表示。我们在几个shot设置下进行了详细评估learned表示，并证明MBBR可以学习出高效的视觉表示，特别适用于VRD任务。我们使用了只有几个标注样本，但我们的方法仍然可以超越当前VRD方法在Predicate Detection（PredDet）评估设置中的性能。我们在https://github.com/deeplab-ai/SelfSupervisedVRD中提供了代码。

Anonymizing medical case-based explanations through disentanglement

paper_url: http://arxiv.org/abs/2311.04833
repo_url: None
paper_authors: Helena Montenegro, Jaime S. Cardoso
for: 本研究旨在Addressing the problem of privacy concerns in deep learning models for medical image analysis, by proposing a novel method for disentangling identity and medical characteristics of images and anonymizing them.
methods: 本研究使用了一种novel方法， named 嵌入式推理（Embedding-based reasoning）, which disentangles the identity and medical characteristics of images by replacing some feature vectors while preserving the remaining features. The researchers also proposed a model to manufacture synthetic privacy-preserving identities to replace the original image’s identity and achieve anonymization.
results: 实验表明，这种方法可以生成真实的、适用于医疗和生物特征的synthetic privacy-preserving identities，并且可以通过替换医学特征来生成counterfactual images。 The results demonstrate the capacity of the proposed method to generate realistic-looking anonymized images that preserve their original medical content, and the network’s inherent capacity to generate counterfactual images through the replacement of medical features.

Abstract
Case-based explanations are an intuitive method to gain insight into the decision-making process of deep learning models in clinical contexts. However, medical images cannot be shared as explanations due to privacy concerns. To address this problem, we propose a novel method for disentangling identity and medical characteristics of images and apply it to anonymize medical images. The disentanglement mechanism replaces some feature vectors in an image while ensuring that the remaining features are preserved, obtaining independent feature vectors that encode the images' identity and medical characteristics. We also propose a model to manufacture synthetic privacy-preserving identities to replace the original image's identity and achieve anonymization. The models are applied to medical and biometric datasets, demonstrating their capacity to generate realistic-looking anonymized images that preserve their original medical content. Additionally, the experiments show the network's inherent capacity to generate counterfactual images through the replacement of medical features.

摘要
情况基本的解释是一种直观的方法，用于掌握深度学习模型在医疗场景中做出决策的过程。然而，医疗图像不能被用作解释，因为隐私问题。为解决这个问题，我们提出了一种新的方法，用于分离图像的标识特征和医疗特征。这种分离机制将某些图像特征替换，以保证保留图像的其他特征，从而获得独立的特征向量，这些特征向量都是图像的标识特征和医疗特征。我们还提出了一种模型，用于生成隐私保护的人工标识，以替换原始图像的标识。这些模型在医疗和生物метрических数据集上应用， demonstarting their capacity to generate realistic-looking anonymized images that preserve their original medical content. In addition, the experiments show the network's inherent capacity to generate counterfactual images through the replacement of medical features.Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

SODAWideNet – Salient Object Detection with an Attention augmented Wide Encoder Decoder network without ImageNet pre-training

paper_url: http://arxiv.org/abs/2311.04828
repo_url: https://github.com/VimsLab/SODAWideNet
paper_authors: Rohit Venkata Sai Dulam, Chandra Kambhamettu
for: 这个论文的目的是开发一个新的突出对象检测（SOD）模型，而不需要在ImageNet dataset上重新训练整个网络。
methods: 该模型使用了一个encoder-decoder风格的网络，并提出了多种新的特征反ernerModule来使用backbone特征。其中包括了多重接收场（MRFFAM）和多scale注意（MSA）等模块，以提高网络的表达能力和灵活性。
results: 模型在五个 dataset上达到了竞争性的性能，并且 Parameters efficient。

Abstract
Developing a new Salient Object Detection (SOD) model involves selecting an ImageNet pre-trained backbone and creating novel feature refinement modules to use backbone features. However, adding new components to a pre-trained backbone needs retraining the whole network on the ImageNet dataset, which requires significant time. Hence, we explore developing a neural network from scratch directly trained on SOD without ImageNet pre-training. Such a formulation offers full autonomy to design task-specific components. To that end, we propose SODAWideNet, an encoder-decoder-style network for Salient Object Detection. We deviate from the commonly practiced paradigm of narrow and deep convolutional models to a wide and shallow architecture, resulting in a parameter-efficient deep neural network. To achieve a shallower network, we increase the receptive field from the beginning of the network using a combination of dilated convolutions and self-attention. Therefore, we propose Multi Receptive Field Feature Aggregation Module (MRFFAM) that efficiently obtains discriminative features from farther regions at higher resolutions using dilated convolutions. Next, we propose Multi-Scale Attention (MSA), which creates a feature pyramid and efficiently computes attention across multiple resolutions to extract global features from larger feature maps. Finally, we propose two variants, SODAWideNet-S (3.03M) and SODAWideNet (9.03M), that achieve competitive performance against state-of-the-art models on five datasets.

摘要
开发新的突出对象检测（SOD）模型需要选择一个ImageNet预训练后缘和创造新的特征级进程来使用后缘特征。然而，将新组件添加到预训练后缘需要重新训练整个网络在ImageNet dataset上，这需要很长时间。因此，我们探索直接从 scratch 开发一个神经网络，不需要 ImageNet 预训练。这种方法允许我们完全自主地设计任务特定的组件。为此，我们提出了SODAWideNet，一个encoder-decoder风格的神经网络用于突出对象检测。我们与通常实践的窄深 convolutional 模型不同，使用宽浅的架构，从而实现参数效率的深度神经网络。为了实现更 shallow 的网络，我们从网络的开始点增加了扩展卷积和自注意力。因此，我们提出了多个感知场Feature Aggregation Module（MRFFAM），可以有效地从更远的区域获取高分辨率的特征。接着，我们提出了多Scale Attention（MSA），可以快速计算多个分辨率之间的注意力，以提取更大的特征图。最后，我们提出了SODAWideNet-S（3.03M）和SODAWideNet（9.03M）两种变体，在五个 dataset 上实现了与当前模型竞争的性能。

Cross-Silo Federated Learning Across Divergent Domains with Iterative Parameter Alignment

paper_url: http://arxiv.org/abs/2311.04818
repo_url: https://github.com/mattgorb/iterative_parameter_alignment
paper_authors: Matt Gorbett, Hossein Shirazi, Indrakshi Ray
for: 这 paper 的目的是提出一种基于 peer-to-peer topology的 Federated Learning 方法，以便在各自的数据集上训练模型，并在模型之间进行参数的对齐，以提高模型的泛化能力。
methods: 这 paper 使用了一种Weighted Distance Minimization 方法来对模型参数进行对齐，并在每个参与者的模型中寻找一个唯一的解。
results: 这 paper 的实验结果表明，这种方法可以在不同的数据集上达到竞争力的结果，并且在不同的领域中进行模型的对齐。此外，这种方法还能够在各个参与者的模型中寻找唯一的解，从而实现了在模型之间的对齐。

Abstract
Learning from the collective knowledge of data dispersed across private sources can provide neural networks with enhanced generalization capabilities. Federated learning, a method for collaboratively training a machine learning model across remote clients, achieves this by combining client models via the orchestration of a central server. However, current approaches face two critical limitations: i) they struggle to converge when client domains are sufficiently different, and ii) current aggregation techniques produce an identical global model for each client. In this work, we address these issues by reformulating the typical federated learning setup: rather than learning a single global model, we learn N models each optimized for a common objective. To achieve this, we apply a weighted distance minimization to model parameters shared in a peer-to-peer topology. The resulting framework, Iterative Parameter Alignment, applies naturally to the cross-silo setting, and has the following properties: (i) a unique solution for each participant, with the option to globally converge each model in the federation, and (ii) an optional early-stopping mechanism to elicit fairness among peers in collaborative learning settings. These characteristics jointly provide a flexible new framework for iteratively learning from peer models trained on disparate datasets. We find that the technique achieves competitive results on a variety of data partitions compared to state-of-the-art approaches. Further, we show that the method is robust to divergent domains (i.e. disjoint classes across peers) where existing approaches struggle.

摘要
学习从分散在私人源中的数据收集的总知识可以提高神经网络的泛化能力。联邦学习方法可以在远程客户端上协同训练机器学习模型，并将客户端模型通过中央服务器的协调合并到一起。然而，现有方法面临两个重要的限制：一是在客户端领域差异足够大时困难于收敛，二是现有的集成技术会生成每个客户端都相同的全球模型。在这项工作中，我们解决这些问题 by 重新定义联邦学习设置：而不是学习单一的全球模型，我们学习 N 个优化为共同目标的模型。为 достичь这一点，我们使用分数据归一化来对参数共享在层次结构中进行补做。该框架被称为迭代参数对齐，可以自然地应用于跨积 Sylos 的设置。具有以下特点：1. 每个参与者都有唯一的解决方案，可以在联邦中每个模型都进行全球收敛。2. 可选的早期停止机制，以便在合作学习设置中约束参与者之间的公平。这些特点结合起来，为我们提供了一种灵活的新框架，可以逐步学习来自各个模型在不同数据集上的训练。我们发现，该技术可以与现有方法相比，在多个数据分区上实现竞争性的结果。此外，我们还证明该方法在不同领域（即客户端领域中的分离类）where existing approaches struggle。

Domain Adaptive Object Detection via Balancing Between Self-Training and Adversarial Learning

paper_url: http://arxiv.org/abs/2311.04815
repo_url: None
paper_authors: Muhammad Akhtar Munir, Muhammad Haris Khan, M. Saquib Sarfraz, Mohsen Ali
for: 这个研究旨在提高深度学习基于物体探测器的适应能力，尤其是在面对新目标领域时，该领域具有明显的物体和背景变化。
methods: 本研究使用模型的预测不确定性来实现内部平衡和类别平衡。具体来说，我们发展了一种量化预测不确定性的技术，并使用高信任率预测生成 pseudo-label，以便进行自我训练。
results: 我们的方法在五个不同和具有挑战性的适应情况下表现出优秀的成绩，与现有的州��cially-of-the-art方法之间存在明显的差异。

Abstract
Deep learning based object detectors struggle generalizing to a new target domain bearing significant variations in object and background. Most current methods align domains by using image or instance-level adversarial feature alignment. This often suffers due to unwanted background and lacks class-specific alignment. A straightforward approach to promote class-level alignment is to use high confidence predictions on unlabeled domain as pseudo-labels. These predictions are often noisy since model is poorly calibrated under domain shift. In this paper, we propose to leverage model's predictive uncertainty to strike the right balance between adversarial feature alignment and class-level alignment. We develop a technique to quantify predictive uncertainty on class assignments and bounding-box predictions. Model predictions with low uncertainty are used to generate pseudo-labels for self-training, whereas the ones with higher uncertainty are used to generate tiles for adversarial feature alignment. This synergy between tiling around uncertain object regions and generating pseudo-labels from highly certain object regions allows capturing both image and instance-level context during the model adaptation. We report thorough ablation study to reveal the impact of different components in our approach. Results on five diverse and challenging adaptation scenarios show that our approach outperforms existing state-of-the-art methods with noticeable margins.

摘要
深度学习基于对象检测器在新目标领域中一般化是一个问题。大多数当前方法使用图像或实例水平的对抗特征偏移来对领域进行Alignment。这经常受到背景的不良影响并且缺乏类别偏移。在本文中，我们提出使用高确定性预测作为 pseudo-labels来促进类别偏移。我们开发了一种技术来评估预测的不确定性，并使用低不确定性预测生成 pseudo-labels，而高不确定性预测则用于生成对抗特征偏移。这种同时使用瓷砾在不确定性范围内和生成高确定性预测作为 pseudo-labels的方法，允许在模型适应过程中捕捉图像和实例级上下文。我们进行了完整的ablation研究，以便了解不同组件在我们的方法中的影响。我们在五种多样化和挑战性的适应场景中进行了实验，并发现我们的方法可以与现有状态的方法相比，表现出明显的优势。

Be Careful When Evaluating Explanations Regarding Ground Truth

paper_url: http://arxiv.org/abs/2311.04813
repo_url: https://github.com/mi2datalab/be-careful-evaluating-explanations
paper_authors: Hubert Baniecki, Maciej Chrabaszcz, Andreas Holzinger, Bastian Pfeifer, Anna Saranti, Przemyslaw Biecek
for: 本研究旨在评估深度学习模型的安全性，尤其是在医疗影像分析和机器人应用中，通过评估模型与人类理解之间的匹配程度。
methods: 本研究提出了一种框架，用于同时评估深度学习模型和解释方法的稳定性，并使用了精度调整程序来（不）对模型与真实情况之间的匹配进行调整。
results: 实验结果表明，视transformer模型和相关的解释方法在不同的模型架构和post-hoc本地解释方法下的Robustness具有一定的潜在攻击风险。

Abstract
Evaluating explanations of image classifiers regarding ground truth, e.g. segmentation masks defined by human perception, primarily evaluates the quality of the models under consideration rather than the explanation methods themselves. Driven by this observation, we propose a framework for $\textit{jointly}$ evaluating the robustness of safety-critical systems that $\textit{combine}$ a deep neural network with an explanation method. These are increasingly used in real-world applications like medical image analysis or robotics. We introduce a fine-tuning procedure to (mis)align model$\unicode{x2013}$explanation pipelines with ground truth and use it to quantify the potential discrepancy between worst and best-case scenarios of human alignment. Experiments across various model architectures and post-hoc local interpretation methods provide insights into the robustness of vision transformers and the overall vulnerability of such AI systems to potential adversarial attacks.

摘要
evaluating explanations of image classifiers regarding ground truth, e.g. segmentation masks defined by human perception, primarily evaluates the quality of the models under consideration rather than the explanation methods themselves. driven by this observation, we propose a framework for $\textit{jointly}$ evaluating the robustness of safety-critical systems that $\textit{combine}$ a deep neural network with an explanation method. these are increasingly used in real-world applications like medical image analysis or robotics. we introduce a fine-tuning procedure to (mis)align model$\unicode{x2013}$explanation pipelines with ground truth and use it to quantify the potential discrepancy between worst and best-case scenarios of human alignment. experiments across various model architectures and post-hoc local interpretation methods provide insights into the robustness of vision transformers and the overall vulnerability of such AI systems to potential adversarial attacks.Here's the text with some notes on the translation:* "ground truth" is translated as "人类理解的标准" (rénxìng lǐjiě de biānwù)* "segmentation masks" is translated as "分割面积" (fēnzhì miànjì)* "deep neural network" is translated as "深度神经网络" (shēngrán jiānxīngwǎng)* "explanation method" is translated as "解释方法" (jiějie fāngfa)* "safety-critical systems" is translated as "安全关键系统" (ānquè guānjī systems)* "jointly" is translated as "共同" (gòngdòng)* "worst-case scenarios" is translated as "最坏情况" (zuì huò qíngkē)* "best-case scenarios" is translated as "最好情况" (zuì hǎo qíngkē)* "human alignment" is translated as "人类对齐" (rénxìng duìqí)Note that the translation of "jointly" as "共同" is a bit more formal than the original English text, but it accurately conveys the meaning of the phrase. Additionally, the translation of "worst-case scenarios" and "best-case scenarios" as "最坏情况" and "最好情况" is a bit more idiomatic than the original English text, but it accurately conveys the meaning of the phrases in Chinese.

Image-Based Virtual Try-On: A Survey

paper_url: http://arxiv.org/abs/2311.04811
repo_url: https://github.com/little-misfit/survey-of-virtual-try-on
paper_authors: Dan Song, Xuanpu Zhang, Juan Zhou, Weizhi Nie, Ruofeng Tong, An-An Liu
For: This paper aims to provide a comprehensive analysis of state-of-the-art techniques and methodologies in image-based virtual try-on, and to identify key trends and future research directions in this field.* Methods: The paper uses a pipeline architecture that includes person representation, try-on indication, clothing warping, and try-on stage. The authors also propose a new semantic criteria using CLIP and evaluate representative methods with uniformly implemented evaluation metrics on the same dataset.* Results: The paper provides a comprehensive overview of the current state of image-based virtual try-on research, including both quantitative and qualitative evaluations of current open-source methods. The authors also demonstrate the potential of large-scale models on this task by fine-tuning a recent image generation model (PBE) using ControlNet.Here is the text in Simplified Chinese:* For: 这篇论文目的是提供图像基于虚拟试穿的全面分析，并预测这个领域的未来研究趋势。* Methods: 该论文使用管道式架构，包括人体表示、试穿指示、衣服折叠和试穿阶段。作者还提出了基于CLIP的新的Semantic criteria。* Results: 论文提供了图像基于虚拟试穿的现状报告，包括现有开源方法的量化和质量评估。作者还示出了大规模模型在这个任务上的潜在潜力。

Abstract
Image-based virtual try-on aims to synthesize a naturally dressed person image with a clothing image, which revolutionizes online shopping and inspires related topics within image generation, showing both research significance and commercial potentials. However, there is a great gap between current research progress and commercial applications and an absence of comprehensive overview towards this field to accelerate the development. In this survey, we provide a comprehensive analysis of the state-of-the-art techniques and methodologies in aspects of pipeline architecture, person representation and key modules such as try-on indication, clothing warping and try-on stage. We propose a new semantic criteria with CLIP, and evaluate representative methods with uniformly implemented evaluation metrics on the same dataset. In addition to quantitative and qualitative evaluation of current open-source methods, we also utilize ControlNet to fine-tune a recent large image generation model (PBE) to show future potentials of large-scale models on image-based virtual try-on task. Finally, unresolved issues are revealed and future research directions are prospected to identify key trends and inspire further exploration. The uniformly implemented evaluation metrics, dataset and collected methods will be made public available at https://github.com/little-misfit/Survey-Of-Virtual-Try-On.

摘要
图像基于虚拟试穿涉及合成一个自然地穿着人像与衣服图像，这种技术革新了在线购物和相关领域的图像生成，具有研究重要性和商业潜力。然而，目前研究进步和商业应用之间存在巨大的差距，而且对这一领域的全面概述缺乏，以便加速发展。在这份调查中，我们提供了图像基于虚拟试穿领域的全面分析，包括管道架构、人体表示和关键模块such as 试穿指示、衣服扭曲和试穿阶段。我们还提出了一种新的semantic标准，并使用CLIP进行评估代表方法。此外，我们还使用ControlNet来精度调整最近一个大型图像生成模型（PBE），以示未来大型模型在图像基于虚拟试穿任务中的潜力。最后，我们揭示了未解的问题和未来研究方向，以便识别关键趋势和激发更多的探索。我们 uniformly 实施的评估指标、数据集和收集的方法将在https://github.com/little-misfit/Survey-Of-Virtual-Try-On 上公开。

VioLA: Aligning Videos to 2D LiDAR Scans

paper_url: http://arxiv.org/abs/2311.04783
repo_url: None
paper_authors: Jun-Jee Chao, Selim Engin, Nikhil Chavan-Dafle, Bhoram Lee, Volkan Isler
for: 将视频Sequence align到2D LiDAR扫描图中的环境
methods: 引入VioLA方法，首先从图像序列中建立本地场景的semantic map，然后从图像序列中提取高度为固定值的点进行注册
results: 通过使用 pré-trained text-to-image填充模型和深度完成模型来填充缺失的场景内容，提高了pose注册性能，最多提高20%

Abstract
We study the problem of aligning a video that captures a local portion of an environment to the 2D LiDAR scan of the entire environment. We introduce a method (VioLA) that starts with building a semantic map of the local scene from the image sequence, then extracts points at a fixed height for registering to the LiDAR map. Due to reconstruction errors or partial coverage of the camera scan, the reconstructed semantic map may not contain sufficient information for registration. To address this problem, VioLA makes use of a pre-trained text-to-image inpainting model paired with a depth completion model for filling in the missing scene content in a geometrically consistent fashion to support pose registration. We evaluate VioLA on two real-world RGB-D benchmarks, as well as a self-captured dataset of a large office scene. Notably, our proposed scene completion module improves the pose registration performance by up to 20%.

摘要
我们研究将视频Capture的本地环境与2D LiDAR扫描的全景环境进行对应的问题。我们提出了一种方法（VioLA），它首先从图像序列中建立了本地场景的semantic map，然后提取了一个固定高度的点进行与LiDAR地图进行对应。由于恢复错误或相机扫描的部分覆盖，可能导致重建的semantic maplack sufficient information for registration。为解决这个问题，VioLA利用了一个预训练的文本-图像填充模型和一个depth completion模型来填充缺失的场景内容，以保持准确的姿态注册。我们在两个实际的RGB-D标准benchmark上以及一个自拍摄的大办公室场景中进行了评估，并观察到我们提议的场景完成模块可以提高姿态注册性能达20%。

Lidar Annotation Is All You Need

paper_url: http://arxiv.org/abs/2311.04777
repo_url: https://github.com/evocargo/lidar-annotation-is-all-you-need
paper_authors: Dinar Sharafutdinov, Stanislav Kuskov, Saian Protasov, Alexey Voropaev
for: 这 paper 的目的是提高图像分割的效率，使用 convolutional neural network 在多感器设置下进行图像分割。methods: 该方法使用 lidar 精度测量点云，并将其直接用于图像分割模型的训练。该方法还使用 masked loss 来处理稀疏的地面数据。results: 实验表明，该方法可以在多个数据集上实现相似的性能，而不需要大量的注释数据。该方法可以减少注释负担，并允许在图像分割模型的训练中混合不同类型的地面数据。

Abstract
In recent years, computer vision has transformed fields such as medical imaging, object recognition, and geospatial analytics. One of the fundamental tasks in computer vision is semantic image segmentation, which is vital for precise object delineation. Autonomous driving represents one of the key areas where computer vision algorithms are applied. The task of road surface segmentation is crucial in self-driving systems, but it requires a labor-intensive annotation process in several data domains. The work described in this paper aims to improve the efficiency of image segmentation using a convolutional neural network in a multi-sensor setup. This approach leverages lidar (Light Detection and Ranging) annotations to directly train image segmentation models on RGB images. Lidar supplements the images by emitting laser pulses and measuring reflections to provide depth information. However, lidar's sparse point clouds often create difficulties for accurate object segmentation. Segmentation of point clouds requires time-consuming preliminary data preparation and a large amount of computational resources. The key innovation of our approach is the masked loss, addressing sparse ground-truth masks from point clouds. By calculating loss exclusively where lidar points exist, the model learns road segmentation on images by using lidar points as ground truth. This approach allows for blending of different ground-truth data types during model training. Experimental validation of the approach on benchmark datasets shows comparable performance to a high-quality image segmentation model. Incorporating lidar reduces the load on annotations and enables training of image-segmentation models without loss of segmentation quality. The methodology is tested on diverse datasets, both publicly available and proprietary. The strengths and weaknesses of the proposed method are also discussed in the paper.

摘要
Recently, computer vision has revolutionized fields such as medical imaging, object recognition, and geospatial analytics. One of the fundamental tasks in computer vision is semantic image segmentation, which is crucial for precise object delineation. Autonomous driving is one of the key areas where computer vision algorithms are applied, and the task of road surface segmentation is crucial in self-driving systems. However, this task requires a labor-intensive annotation process in several data domains.The work described in this paper aims to improve the efficiency of image segmentation using a convolutional neural network in a multi-sensor setup. This approach leverages lidar (Light Detection and Ranging) annotations to directly train image segmentation models on RGB images. Lidar supplements the images by emitting laser pulses and measuring reflections to provide depth information. However, lidar's sparse point clouds often create difficulties for accurate object segmentation.The key innovation of our approach is the masked loss, which addresses sparse ground-truth masks from point clouds. By calculating loss exclusively where lidar points exist, the model learns road segmentation on images by using lidar points as ground truth. This approach allows for blending of different ground-truth data types during model training. Experimental validation of the approach on benchmark datasets shows comparable performance to a high-quality image segmentation model. Incorporating lidar reduces the load on annotations and enables training of image-segmentation models without loss of segmentation quality. The methodology is tested on diverse datasets, both publicly available and proprietary. The strengths and weaknesses of the proposed method are also discussed in the paper.

GCS-ICHNet: Assessment of Intracerebral Hemorrhage Prognosis using Self-Attention with Domain Knowledge Integration

paper_url: http://arxiv.org/abs/2311.04772
repo_url: https://github.com/Windbelll/Prognosis-analysis-of-cerebral-hemorrhage
paper_authors: Xuhao Shan, Xinyang Li, Ruiquan Ge, Shibin Wu, Ahmed Elazab, Jichao Zhu, Lingyan Zhang, Gangyong Jia, Qingying Xiao, Xiang Wan, Changmiao Wang
for: 预测急性脑出血（ICH）的诊断和治疗结果，提高患者生存率。
methods: 利用多Modal脑CT图像数据和格拉斯哥昏迷分数（GCS）来提高ICH诊断，使用trasnformer基本的融合模块进行评估。
results: GCS-ICHNet实现了81.03%的敏感性和91.59%的特异性，超过了平均临床医生和其他当前状态的方法。

Abstract
Intracerebral Hemorrhage (ICH) is a severe condition resulting from damaged brain blood vessel ruptures, often leading to complications and fatalities. Timely and accurate prognosis and management are essential due to its high mortality rate. However, conventional methods heavily rely on subjective clinician expertise, which can lead to inaccurate diagnoses and delays in treatment. Artificial intelligence (AI) models have been explored to assist clinicians, but many prior studies focused on model modification without considering domain knowledge. This paper introduces a novel deep learning algorithm, GCS-ICHNet, which integrates multimodal brain CT image data and the Glasgow Coma Scale (GCS) score to improve ICH prognosis. The algorithm utilizes a transformer-based fusion module for assessment. GCS-ICHNet demonstrates high sensitivity 81.03% and specificity 91.59%, outperforming average clinicians and other state-of-the-art methods.

摘要
Intracerebral Hemorrhage (ICH) 是一种严重的疾病，由于脑血管受损而导致血液泄露，可能会导致严重的后果和死亡。因此，有效和准确的诊断和治疗是非常重要的，因为其mortality rate 很高。然而，传统的方法很多都是基于专业知识的，可能会导致不准确的诊断和治疗延迟。人工智能（AI）模型已经被探讨以帮助临床医生，但许多先前的研究都是对模型进行修改而不考虑域知识。本文介绍了一种新的深度学习算法，GCS-ICHNet，它通过结合多Modal脑CT图像数据和格拉斯哥昏迷scale（GCS）分数来改善ICH的诊断。该算法使用 transformer-based 混合模块进行评估。GCS-ICHNet 在敏感性和特点上都有出色的表现，高达 81.03% 和 91.59%，超过了平均的临床医生和其他现有的方法。

An attention-based deep learning network for predicting Platinum resistance in ovarian cancer

paper_url: http://arxiv.org/abs/2311.04769
repo_url: None
paper_authors: Haoming Zhuang, Beibei Li, Jingtong Ma, Patrice Monkam, Shouliang Qi, Wei Qian, Dianning He
for: 这项研究的目的是提出一种基于深度学习的方法，用于判断患有高等级胸膜癌的患者是否 platinum 抵抗性。methods: 该研究使用了289名高等级胸膜癌患者的数据，并建立了一个结合了压缩块（SE Block）和空间 пирамид Pooling层（SPPLayer）的Dense Convolutional Network（DenseNet）模型。利用多Modal PET/CT 图像数据进行预测患者 platinum 抵抗性。results: 经五次交叠验证，SE-SPP-DenseNet 模型在预测患者 platinum 抵抗性方面 achieved a high accuracy rate和抛物线曲线（AUC）值，分别为92.6%和0.93。通过进行剥离实验和单Modal 数据验证，证明了将 SE Block 和 SPPLayer 添加到深度学习模型中，并考虑多Modal 数据的重要性。

Abstract
Background: Ovarian cancer is among the three most frequent gynecologic cancers globally. High-grade serous ovarian cancer (HGSOC) is the most common and aggressive histological type. Guided treatment for HGSOC typically involves platinum-based combination chemotherapy, necessitating an assessment of whether the patient is platinum-resistant. The purpose of this study is to propose a deep learning-based method to determine whether a patient is platinum-resistant using multimodal positron emission tomography/computed tomography (PET/CT) images. Methods: 289 patients with HGSOC were included in this study. An end-to-end SE-SPP-DenseNet model was built by adding Squeeze-Excitation Block (SE Block) and Spatial Pyramid Pooling Layer (SPPLayer) to Dense Convolutional Network (DenseNet). Multimodal data from PET/CT images of the regions of interest (ROI) were used to predict platinum resistance in patients. Results: Through five-fold cross-validation, SE-SPP-DenseNet achieved a high accuracy rate and an area under the curve (AUC) in predicting platinum resistance in patients, which were 92.6% and 0.93, respectively. The importance of incorporating SE Block and SPPLayer into the deep learning model, and considering multimodal data was substantiated by carrying out ablation studies and experiments with single modality data. Conclusions: The obtained classification results indicate that our proposed deep learning framework performs better in predicting platinum resistance in patients, which can help gynecologists make better treatment decisions. Keywords: PET/CT, CNN, SE Block, SPP Layer, Platinum resistance, Ovarian cancer

摘要
背景：子宫癌是全球三大妇科癌种之一，高等级膜蛋白癌（HGSOC）是最常见并且最严重的 histological 型。对 HGSOC 患者的治疗通常包括钍基化学疗法，需要评估患者是否为钍耐 resistance。本研究的目的是提出一种基于深度学习的方法，使用多Modal positron emission tomography/computed tomography（PET/CT）图像来判断患者是否为钍耐 resistance。方法：本研究包括289名 HGSOC 患者。我们建立了一个结合 Squeeze-Excitation Block（SE Block）和 Spatial Pyramid Pooling Layer（SPPLayer）的 Dense Convolutional Network（DenseNet）模型。使用 ROI 的多Modal PET/CT 图像来预测患者是否为钍耐 resistance。结果：通过五次交叉验证，SE-SPP-DenseNet 模型在预测患者是否为钍耐 resistance 中达到了高精度率和报春分（AUC）的好Result，分别为92.6%和0.93。我们通过进行剖除研究和单模态数据实验来证明，将 SE Block 和 SPPLayer 添加到深度学习模型中，以及考虑多Modal数据的重要性。结论：我们的提出的深度学习框架在预测患者是否为钍耐 resistance 中表现更好，可以帮助妇科医生作出更好的治疗决策。关键词：PET/CT, CNN, SE Block, SPP Layer, 钍耐 resistance, 子宫癌

paper_url: http://arxiv.org/abs/2311.04766
repo_url: None
paper_authors: Guinan Su, Yanwu Yang, Zhifeng Li
for: 这个研究旨在提高音频驱动3D面部动画的精度和效率，特别是在虚拟现实、游戏和视频会议等应用中。
methods: 该研究提出了一种交叉模式双学习框架，称为DualTalker，以提高数据使用效率并关联跨模态关系。该框架共同受过主任务（音频驱动面部动画）和其双任务（详细说话）的联合训练，并共享音频/运动编码器组件。
results: 经过广泛的实验和一项感知用户研究，我们展示了我们的方法在VOCA和BIWI数据集上的较好的表现，both qualitatively和quantitatively。我们的代码和视频示例已经在https://github.com/sabrina-su/iadf.git中提供。

Abstract
In recent years, audio-driven 3D facial animation has gained significant attention, particularly in applications such as virtual reality, gaming, and video conferencing. However, accurately modeling the intricate and subtle dynamics of facial expressions remains a challenge. Most existing studies approach the facial animation task as a single regression problem, which often fail to capture the intrinsic inter-modal relationship between speech signals and 3D facial animation and overlook their inherent consistency. Moreover, due to the limited availability of 3D-audio-visual datasets, approaches learning with small-size samples have poor generalizability that decreases the performance. To address these issues, in this study, we propose a cross-modal dual-learning framework, termed DualTalker, aiming at improving data usage efficiency as well as relating cross-modal dependencies. The framework is trained jointly with the primary task (audio-driven facial animation) and its dual task (lip reading) and shares common audio/motion encoder components. Our joint training framework facilitates more efficient data usage by leveraging information from both tasks and explicitly capitalizing on the complementary relationship between facial motion and audio to improve performance. Furthermore, we introduce an auxiliary cross-modal consistency loss to mitigate the potential over-smoothing underlying the cross-modal complementary representations, enhancing the mapping of subtle facial expression dynamics. Through extensive experiments and a perceptual user study conducted on the VOCA and BIWI datasets, we demonstrate that our approach outperforms current state-of-the-art methods both qualitatively and quantitatively. We have made our code and video demonstrations available at https://github.com/sabrina-su/iadf.git.

摘要
Recently, audio-driven 3D facial animation has gained significant attention, especially in virtual reality, gaming, and video conferencing applications. However, accurately modeling the intricate and subtle dynamics of facial expressions remains a challenge. Most existing studies treat the facial animation task as a single regression problem, which often fails to capture the intrinsic inter-modal relationship between speech signals and 3D facial animation, and overlooks their inherent consistency. Moreover, due to the limited availability of 3D-audio-visual datasets, approaches learning with small-size samples have poor generalizability, which decreases performance.To address these issues, in this study, we propose a cross-modal dual-learning framework, termed DualTalker, aiming at improving data usage efficiency and relating cross-modal dependencies. The framework is trained jointly with the primary task (audio-driven facial animation) and its dual task (lip reading) and shares common audio/motion encoder components. Our joint training framework leverages information from both tasks and explicitly capitalizes on the complementary relationship between facial motion and audio to improve performance. Furthermore, we introduce an auxiliary cross-modal consistency loss to mitigate the potential over-smoothing underlying the cross-modal complementary representations, enhancing the mapping of subtle facial expression dynamics.Through extensive experiments and a perceptual user study conducted on the VOCA and BIWI datasets, we demonstrate that our approach outperforms current state-of-the-art methods both qualitatively and quantitatively. Our code and video demonstrations are available at .

paper_url: http://arxiv.org/abs/2311.04726
repo_url: https://github.com/Walter0807/Social-CH
paper_authors: Wentao Zhu, Jason Qin, Yuke Lou, Hang Ye, Xiaoxuan Ma, Hai Ci, Yizhou Wang
for: 这个研究的目的是复制人类在预测他人行为方面的能力，通过解决社交动作预测问题。
methods: 这个研究使用了一个新的比较器、一种新的形式化、以及基于认知的框架。他们还使用了行为做副本和生成对抗学习来提高学习效率和通用性。
results: 研究人员通过实施了一种新的3D多人动作数据集，并通过对比较器和生成对抗学习来验证数据集和方法的有效性。

Abstract
Humans exhibit a remarkable capacity for anticipating the actions of others and planning their own actions accordingly. In this study, we strive to replicate this ability by addressing the social motion prediction problem. We introduce a new benchmark, a novel formulation, and a cognition-inspired framework. We present Wusi, a 3D multi-person motion dataset under the context of team sports, which features intense and strategic human interactions and diverse pose distributions. By reformulating the problem from a multi-agent reinforcement learning perspective, we incorporate behavioral cloning and generative adversarial imitation learning to boost learning efficiency and generalization. Furthermore, we take into account the cognitive aspects of the human social action planning process and develop a cognitive hierarchy framework to predict strategic human social interactions. We conduct comprehensive experiments to validate the effectiveness of our proposed dataset and approach. Code and data are available at https://walter0807.github.io/Social-CH/.

摘要
人类具有惊人的其他人行为预测能力和自己行为规划能力。在这项研究中，我们努力复制这种能力，解决社会动作预测问题。我们引入了新的标准集，一种新的形ulation，以及一个基于认知的框架。我们提供了一个3D多人动作数据集，名为Wusi，该数据集在体育赛事中展示了人类之间的激烈和策略性交互，以及多种姿态分布。我们将问题转换为多智能 reinforcement learning 视角，并应用行为做样和生成对抗学习来提高学习效率和泛化能力。此外，我们考虑了人类社会行为规划过程中的认知方面，并开发了认知层次框架来预测人类社会互动的策略。我们进行了全面的实验来验证我们的提posed dataset和方法的有效性。代码和数据可以在https://walter0807.github.io/Social-CH/获取。

Training CLIP models on Data from Scientific Papers

paper_url: http://arxiv.org/abs/2311.04711
repo_url: https://github.com/nopperl/clip_arxiv_pmc
paper_authors: Calvin Metzger
for: 这篇论文旨在检验 CLIP 模型是否通过使用高质量数据来提高总体性能。
methods: 该论文使用 arXiv 和 PubMed Central 搜索引擎提取文本图像数据，并在小规模 CLIP 模型（ViT B/32）上进行实验。
results: 实验结果表明，使用高质量数据源可以提高 CLIP 模型的性能，但提高的程度只是有 moderate。这表明使用这些数据源来训练大规模 CLIP 模型是一个值得进行研究的方向。

Abstract
Contrastive Language-Image Pretraining (CLIP) models are able to capture the semantic relationship of images and texts and have enabled a wide range of applications, from image retrieval to classification. These models are trained with datasets extracted from web crawls, which are of large quantity but limited quality. This paper explores whether limited amounts higher quality data in a specific domain improve the general performance of CLIP models. To this purpose, we extract text-image data from scientific papers hosted in the arXiv and PubMed Central repositories. Experiments on small-scale CLIP models (ViT B/32) show that model performance increases on average, but only moderately. This result indicates that using the data sources considered in the paper to train large-scale CLIP models is a worthwile research direction.

摘要
对比语言-图像预训（CLIP）模型可以捕捉图像和文本之间的 semantic 关系，并实现了许多应用，从图像检索到分类。这些模型通常通过互联网爬虫获取数据进行训练，这些数据的量很大，但质量有限。这篇文章探讨了是否有限量但高质量数据在特定领域提高CLIP模型的总性能。为此，我们从arXiv和PubMed Central数据库中提取了文本-图像数据。实验结果表明，使用这些数据来训练小规模CLIP模型（ViT B/32）时，模型的性能平均提高，但只有moderate。这结果表明，使用文章中所考虑的数据来训练大规模CLIP模型是一个有前途的研究方向。

3D Pose Estimation of Tomato Peduncle Nodes using Deep Keypoint Detection and Point Cloud

paper_url: http://arxiv.org/abs/2311.04699
repo_url: None
paper_authors: Jianchao Ci, Xin Wang, David Rapado-Rincón, Akshay K. Burusa, Gert Kootstra
For: 本研究旨在提供一种基于关键点检测的RGB-D相机数据的方法，用于自动探测护叶节点，以便在绿色家庭中自动探测 Tomatoes。* Methods: 本方法使用了RGB-D相机数据，通过检测颜色图像中的四个骨干特征点，并将其与3D点云信息集成，以确定护叶节点的3D姿态。* Results: 研究结果表明，该方法具有高精度的物体检测能力（AP@0.5=0.96）、高精度的关键点检测率（PDJ@0.2=94.31%）和3D姿态估计精度（MAE=11.38o和9.93o）。此外，该方法还可以快速响应视点变化。

Abstract
Greenhouse production of fruits and vegetables in developed countries is challenged by labor 12 scarcity and high labor costs. Robots offer a good solution for sustainable and cost-effective 13 production. Acquiring accurate spatial information about relevant plant parts is vital for 14 successful robot operation. Robot perception in greenhouses is challenging due to variations in 15 plant appearance, viewpoints, and illumination. This paper proposes a keypoint-detection-based 16 method using data from an RGB-D camera to estimate the 3D pose of peduncle nodes, which 17 provides essential information to harvest the tomato bunches. 18 19 Specifically, this paper proposes a method that detects four anatomical landmarks in the color 20 image and then integrates 3D point-cloud information to determine the 3D pose. A 21 comprehensive evaluation was conducted in a commercial greenhouse to gain insight into the 22 performance of different parts of the method. The results showed: (1) high accuracy in object 23 detection, achieving an Average Precision (AP) of AP@0.5=0.96; (2) an average Percentage of 24 Detected Joints (PDJ) of the keypoints of PhDJ@0.2=94.31%; and (3) 3D pose estimation 25 accuracy with mean absolute errors (MAE) of 11.38o and 9.93o for the relative upper and lower 26 angles between the peduncle and main stem, respectively. Furthermore, the capability to handle 27 variations in viewpoint was investigated, demonstrating the method was robust to view changes. 28 However, canonical and higher views resulted in slightly higher performance compared to other 29 views. Although tomato was selected as a use case, the proposed method is also applicable to 30 other greenhouse crops like pepper.

摘要
developed countries的绿色房production of fruits and vegetables面临劳动力短缺和高劳动成本的挑战。Robots可以提供可持续和成本效果的解决方案。获取有关相关植物部分的准确空间信息是成功机器人运行的关键。在绿色房中机器人识别是由于植物外观、视角和照明变化而困难。这篇论文提出了基于特征点探测的方法，使用RGB-D摄像头数据来估算 Tomatoes的3D姿态。 Specifically, this paper proposes a method that detects four anatomical landmarks in the color image and then integrates 3D point-cloud information to determine the 3D pose. A comprehensive evaluation was conducted in a commercial greenhouse to gain insight into the performance of different parts of the method. The results showed: (1) high accuracy in object detection, achieving an Average Precision (AP) of AP@0.5=0.96; (2) an average Percentage of Detected Joints (PDJ) of the keypoints of PhDJ@0.2=94.31%; and (3) 3D pose estimation accuracy with mean absolute errors (MAE) of 11.38° and 9.93° for the relative upper and lower angles between the peduncle and main stem, respectively. Furthermore, the capability to handle variations in viewpoint was investigated, demonstrating the method was robust to view changes. However, canonical and higher views resulted in slightly higher performance compared to other views. Although tomato was selected as a use case, the proposed method is also applicable to other greenhouse crops like pepper.

Weakly supervised cross-model learning in high-content screening

paper_url: http://arxiv.org/abs/2311.04678
repo_url: None
paper_authors: Watkinson Gabriel, Cohen Ethan, Bourriez Nicolas, Bendidi Ihab, Bollot Guillaume, Genovesio Auguste
for: 本研究旨在探索如何在药物搜索中连接不同数据类型的数据。
methods: 我们提出了一种新的方法，利用弱监督和跨站复制在高内容检测中使用CLIP建立跨模态表示。
results: 我们的方法可以学习更好的表示，减轻批处理效应，并且对JUMP-CP数据集进行了有效的预处理，从85TB减少到7TB，保留了所有干扰和大多数信息内容。

Abstract
With the surge in available data from various modalities, there is a growing need to bridge the gap between different data types. In this work, we introduce a novel approach to learn cross-modal representations between image data and molecular representations for drug discovery. We propose EMM and IMM, two innovative loss functions built on top of CLIP that leverage weak supervision and cross sites replicates in High-Content Screening. Evaluating our model against known baseline on cross-modal retrieval, we show that our proposed approach allows to learn better representations and mitigate batch effect. In addition, we also present a preprocessing method for the JUMP-CP dataset that effectively reduce the required space from 85Tb to a mere usable 7Tb size, still retaining all perturbations and most of the information content.

摘要
随着不同数据模式之间的数据量的增加，需要桥接这些数据模式之间的 gap 变得更加重要。在这种工作中，我们介绍了一种新的方法，用于从图像数据和分子表示之间学习 crossed-modal 表示。我们提出了两种创新的损失函数EMM和IMM，基于 CLIP 的上下文，利用弱监督和跨站复制在高内容检测中。我们对知道的基准进行跨Modal 检索，并显示了我们提议的方法可以学习更好的表示，并减轻批处理效应。此外，我们还提出了对 JUMP-CP 数据集的预处理方法，可以有效地将数据减少到可用 7Tb 大小，保留所有干扰和大多数信息内容。

Lightweight Diffusion Models with Distillation-Based Block Neural Architecture Search

paper_url: http://arxiv.org/abs/2311.04950
repo_url: None
paper_authors: Siao Tang, Xin Wang, Hong Chen, Chaoyu Guan, Yansong Tang, Wenwu zhu
for: 提高 diffusion models 的计算效率，使其在多种任务中实现 state-of-the-art 性能。
methods: 提出了一种基于 Diffusion Distillation 的 Block-wise Neural Architecture Search (DiffNAS) 方法，通过自动去除 diffusion models 中的结构冗余来减少计算成本。
results: 实验表明，DiffNAS 可以实现约 50% MACs 和参数减少，并且可以在 latent diffusion models 上实现比 teacher 更好的性能。

Abstract
Diffusion models have recently shown remarkable generation ability, achieving state-of-the-art performance in many tasks. However, the high computational cost is still a troubling problem for diffusion models. To tackle this problem, we propose to automatically remove the structural redundancy in diffusion models with our proposed Diffusion Distillation-based Block-wise Neural Architecture Search (DiffNAS). Specifically, given a larger pretrained teacher, we leverage DiffNAS to search for the smallest architecture which achieves on-par or even better performance than the teacher. Considering current diffusion models are based on UNet which naturally has a block-wise structure, we perform neural architecture search independently in each block, which largely reduces the search space. Different from previous block-wise NAS methods, DiffNAS contains a block-wise local search strategy and a retraining strategy with a joint dynamic loss. Concretely, during the search process, we block-wisely select the best subnet to avoid the unfairness brought by the global search strategy used in previous works. When retraining the searched architecture, we adopt a dynamic joint loss to maintain the consistency between supernet training and subnet retraining, which also provides informative objectives for each block and shortens the paths of gradient propagation. We demonstrate this joint loss can effectively improve model performance. We also prove the necessity of the dynamic adjustment of this loss. The experiments show that our method can achieve significant computational reduction, especially on latent diffusion models with about 50% MACs and Parameter reduction.

摘要
Diffusion 模型最近显示出了很好的生成能力，在许多任务中达到了状态的核心性能。然而，高计算成本仍然是 diffusion 模型的一个困扰问题。为解决这个问题，我们提出了自动Remove diffusion 模型中的结构冗余的方法：Diffusion Distillation-based Block-wise Neural Architecture Search (DiffNAS)。具体来说，我们使用 DiffNAS 在一个更大的预训练老师模型基础上进行搜索，找到与老师模型具有相同或更好的性能的最小架构。由于现有的 diffusion 模型基于 UNet 结构，我们在搜索过程中独立地进行每个块的 neural architecture search，从而大大减少搜索空间。与前一些块基本 NAS 方法不同，DiffNAS 包含块基本选择策略和重新训练策略，并且采用了一种动态共同损失。在搜索过程中，我们会在每个块中选择最佳子网，以避免由全局搜索策略所带来的不公平性。在重新训练搜索出的架构时，我们采用了一种动态共同损失，以保持超网训练和子网重新训练之间的一致性，同时提供了每个块的有用目标。我们证明了这种动态调整的损失是必要的。实验表明，我们的方法可以实现显著的计算减少，特别是在含有约50% MACs和参数的含括 diffusion 模型上。

VET: Visual Error Tomography for Point Cloud Completion and High-Quality Neural Rendering

paper_url: http://arxiv.org/abs/2311.04634
repo_url: https://github.com/lfranke/vet
paper_authors: Linus Franke, Darius Rückert, Laura Fink, Matthias Innmann, Marc Stamminger
for: 这个论文的目的是提高点云图像的新视图合成质量。
methods: 该论文使用了一种基于神经网络的方法，使用点云代理geometry来检测和修复新视图合成中的缺失或损害。
results: 论文的实验结果表明，该方法可以显著提高点云图像的新视图合成质量，并且可以有效地修复大规模的缺失和细腻结构。同时，该方法的实时渲染速度也得到了改进。

Abstract
In the last few years, deep neural networks opened the doors for big advances in novel view synthesis. Many of these approaches are based on a (coarse) proxy geometry obtained by structure from motion algorithms. Small deficiencies in this proxy can be fixed by neural rendering, but larger holes or missing parts, as they commonly appear for thin structures or for glossy regions, still lead to distracting artifacts and temporal instability. In this paper, we present a novel neural-rendering-based approach to detect and fix such deficiencies. As a proxy, we use a point cloud, which allows us to easily remove outlier geometry and to fill in missing geometry without complicated topological operations. Keys to our approach are (i) a differentiable, blending point-based renderer that can blend out redundant points, as well as (ii) the concept of Visual Error Tomography (VET), which allows us to lift 2D error maps to identify 3D-regions lacking geometry and to spawn novel points accordingly. Furthermore, (iii) by adding points as nested environment maps, our approach allows us to generate high-quality renderings of the surroundings in the same pipeline. In our results, we show that our approach can improve the quality of a point cloud obtained by structure from motion and thus increase novel view synthesis quality significantly. In contrast to point growing techniques, the approach can also fix large-scale holes and missing thin structures effectively. Rendering quality outperforms state-of-the-art methods and temporal stability is significantly improved, while rendering is possible at real-time frame rates.

摘要
Recently, deep neural networks have led to significant advances in novel view synthesis. Many of these methods rely on a coarse proxy geometry obtained through structure from motion algorithms. While small defects in the proxy can be corrected by neural rendering, larger holes or missing parts can still result in distracting artifacts and temporal instability. In this paper, we propose a novel neural-rendering-based approach to detect and fix such deficiencies. We use a point cloud as our proxy, which allows us to easily remove outlier geometry and fill in missing geometry without complicated topological operations. The key components of our approach are:1. A differentiable, blending point-based renderer that can blend out redundant points.2. The concept of Visual Error Tomography (VET), which allows us to lift 2D error maps to identify 3D regions lacking geometry and spawn novel points accordingly.3. The addition of points as nested environment maps, which allows us to generate high-quality renderings of the surroundings in the same pipeline.Our results show that our approach can significantly improve the quality of a point cloud obtained by structure from motion and increase novel view synthesis quality. In contrast to point growing techniques, our approach can effectively fix large-scale holes and missing thin structures. The rendering quality outperforms state-of-the-art methods, and temporal stability is significantly improved, all while rendering is possible at real-time frame rates.

General Framework to Evaluate Unlinkability in Biometric Template Protection Systems

paper_url: http://arxiv.org/abs/2311.04633
repo_url: None
paper_authors: Marta Gomez-Barrero, Javier Galbally, Christian Rathgeb, Christoph Busch
for: 保护生物特征数据的隐私问题
methods: 提出了一个新的普适框架来评估生物特征模板的不可识别性
results: 应用于四种现有的生物特征模板保护技术中的一种，并与其他现有的指标进行比较，以显示其优势

Abstract
The wide deployment of biometric recognition systems in the last two decades has raised privacy concerns regarding the storage and use of biometric data. As a consequence, the ISO/IEC 24745 international standard on biometric information protection has established two main requirements for protecting biometric templates: irreversibility and unlinkability. Numerous efforts have been directed to the development and analysis of irreversible templates. However, there is still no systematic quantitative manner to analyse the unlinkability of such templates. In this paper we address this shortcoming by proposing a new general framework for the evaluation of biometric templates' unlinkability. To illustrate the potential of the approach, it is applied to assess the unlinkability of four state-of-the-art techniques for biometric template protection: biometric salting, Bloom filters, Homomorphic Encryption and block re-mapping. For the last technique, the proposed framework is compared with other existing metrics to show its advantages.

摘要
在过去二十年中，生物认证系统的广泛应用已引发了隐私问题，特别是 relate to the storage and use of biometric data。为了解决这问题，国际标准ISO/IEC 24745要求保护生物特征模板的两个主要要求是不可逆和不可关联。虽然有很多努力在发展和分析不可逆模板，但是还没有一个系统性的量化方法来分析不可关联性。在这篇论文中，我们正在解决这一缺点，并提出了一个新的通用框架来评估生物特征模板的不可关联性。为了证明我们的方法的潜力，我们应用它来评估四种现状的生物模板保护技术：生物盐、Bloom filter、Homomorphic Encryption和块重映射。对于最后一种技术，我们的框架与其他现有的指标进行比较，以显示它的优势。

Image Patch-Matching with Graph-Based Learning in Street Scenes

paper_url: http://arxiv.org/abs/2311.04617
repo_url: None
paper_authors: Rui She, Qiyu Kang, Sijie Wang, Wee Peng Tay, Yong Liang Guan, Diego Navarro Navarro, Andreas Hartmannsgruber
for: 这篇论文主要针对自动驾驶中的计算视觉任务，即将实时捕捉的车辆摄像头中的图像与图像库中的特征区域匹配。
methods: 该论文提出了一种基于图像图的空间关系学习模型，其中图像patches的edge表示图像区域之间的空间关系。
results: 该模型在多个街景数据集上进行评估，并取得了领先的匹配结果。

Abstract
Matching landmark patches from a real-time image captured by an on-vehicle camera with landmark patches in an image database plays an important role in various computer perception tasks for autonomous driving. Current methods focus on local matching for regions of interest and do not take into account spatial neighborhood relationships among the image patches, which typically correspond to objects in the environment. In this paper, we construct a spatial graph with the graph vertices corresponding to patches and edges capturing the spatial neighborhood information. We propose a joint feature and metric learning model with graph-based learning. We provide a theoretical basis for the graph-based loss by showing that the information distance between the distributions conditioned on matched and unmatched pairs is maximized under our framework. We evaluate our model using several street-scene datasets and demonstrate that our approach achieves state-of-the-art matching results.

摘要
<> translate("Matching landmark patches from a real-time image captured by an on-vehicle camera with landmark patches in an image database plays an important role in various computer perception tasks for autonomous driving. Current methods focus on local matching for regions of interest and do not take into account spatial neighborhood relationships among the image patches, which typically correspond to objects in the environment. In this paper, we construct a spatial graph with the graph vertices corresponding to patches and edges capturing the spatial neighborhood information. We propose a joint feature and metric learning model with graph-based learning. We provide a theoretical basis for the graph-based loss by showing that the information distance between the distributions conditioned on matched and unmatched pairs is maximized under our framework. We evaluate our model using several street-scene datasets and demonstrate that our approach achieves state-of-the-art matching results.")Here's the translation:<>匹配从在车载摄像头捕捉的实时图像中提取的标志性补丁与图像库中的标志性补丁之间的匹配对计算机视觉任务中扮演着重要角色。当前方法主要集中于区域关注点的本地匹配，而不考虑图像补丁之间的空间相邻关系，通常对环境中的物体相应。在本文中，我们构建了一个空间图，其顶点对应于补丁，而边则捕捉了图像补丁之间的空间相邻关系。我们提出了一种联合特征和度量学习模型，并基于图形学习。我们提供了对图形学习损失的理论基础，并证明了我们的框架下，匹配后的分布conditioned on matched和unmatched对的信息距离最大。我们使用了多个街景数据集来评估我们的方法，并证明了我们的方法可以达到状态级匹配结果。

On Characterizing the Evolution of Embedding Space of Neural Networks using Algebraic Topology

paper_url: http://arxiv.org/abs/2311.04592
repo_url: https://github.com/cross-caps/dnntopology
paper_authors: Suryaka Suresh, Bishshoy Das, Vinayak Abrol, Sumantra Dutta Roy
For: 本研究使用深度学习神经网络（DNN）的层次结构来研究特征表示空间的topologic变化。* Methods: 本研究使用Cubical homology来分析深度神经网络的特征表示空间，并对多种流行的深度架构和实际图像数据进行了扩展分析。* Results: 研究发现，随着深度层数的增加，特征表示空间的topologic复杂度逐渐减少，最终达到最低的Betti数。此外，研究还发现了一些对准变换和数据采样等因素的不变性，这些不变性有助于提高神经网络的泛化能力。

Abstract
We study how the topology of feature embedding space changes as it passes through the layers of a well-trained deep neural network (DNN) through Betti numbers. Motivated by existing studies using simplicial complexes on shallow fully connected networks (FCN), we present an extended analysis using Cubical homology instead, with a variety of popular deep architectures and real image datasets. We demonstrate that as depth increases, a topologically complicated dataset is transformed into a simple one, resulting in Betti numbers attaining their lowest possible value. The rate of decay in topological complexity (as a metric) helps quantify the impact of architectural choices on the generalization ability. Interestingly from a representation learning perspective, we highlight several invariances such as topological invariance of (1) an architecture on similar datasets; (2) embedding space of a dataset for architectures of variable depth; (3) embedding space to input resolution/size, and (4) data sub-sampling. In order to further demonstrate the link between expressivity \& the generalization capability of a network, we consider the task of ranking pre-trained models for downstream classification task (transfer learning). Compared to existing approaches, the proposed metric has a better correlation to the actually achievable accuracy via fine-tuning the pre-trained model.

摘要
我们研究深度神经网络（DNN）中层次结构的变化，通过瓶颈复合体（Cubical homology）进行扩展分析，使用多种流行的深度架构和实际图像数据集。我们发现，随着深度增加，复杂的图像数据集变换成简单的一个，导致Betti数达到最低可能的值。 decay 率可以量化架构选择对通用能力的影响。从表示学学习的视角来看，我们发现了一些对称性，包括：(1) 架构在相似的数据集上的 topological invariance; (2) 数据集的嵌入空间的嵌入空间的 topological invariance; (3) 嵌入空间与输入分辨率/大小的 topological invariance; (4) 数据采样的 topological invariance。为了进一步证明拓扑表达能力和通用能力之间的关系，我们考虑了预训练模型的排名任务（transfer learning）。与现有方法相比，我们的指标具有更好的与实际可 achievable 精度的相关性。

Rethinking Human Pose Estimation for Autonomous Driving with 3D Event Representations

paper_url: http://arxiv.org/abs/2311.04591
repo_url: https://github.com/masterhow/eventpointpose
paper_authors: Xiaoting Yin, Hao Shi, Jiaan Chen, Ze Wang, Yaozu Ye, Huajian Ni, Kailun Yang, Kaiwei Wang
for: 提高自动驾驶和停车安全性，通过预测人类行为。
methods: 使用事件摄像机，创建3D事件表示，并开发EV-3DPW数据集。
results: 在公共实际世界DHP19数据集上，事件点云技术实现了实时移动预测，而解除事件 voxel方法达到了最高准确性。实验表明我们的提posed 3D表示方法在 traditional RGB图像和事件帧技术的比较中具有更高的总体化能力。

Abstract
Human pose estimation is a critical component in autonomous driving and parking, enhancing safety by predicting human actions. Traditional frame-based cameras and videos are commonly applied, yet, they become less reliable in scenarios under high dynamic range or heavy motion blur. In contrast, event cameras offer a robust solution for navigating these challenging contexts. Predominant methodologies incorporate event cameras into learning frameworks by accumulating events into event frames. However, such methods tend to marginalize the intrinsic asynchronous and high temporal resolution characteristics of events. This disregard leads to a loss in essential temporal dimension data, crucial for safety-critical tasks associated with dynamic human activities. To address this issue and to unlock the 3D potential of event information, we introduce two 3D event representations: the Rasterized Event Point Cloud (RasEPC) and the Decoupled Event Voxel (DEV). The RasEPC collates events within concise temporal slices at identical positions, preserving 3D attributes with statistical cues and markedly mitigating memory and computational demands. Meanwhile, the DEV representation discretizes events into voxels and projects them across three orthogonal planes, utilizing decoupled event attention to retrieve 3D cues from the 2D planes. Furthermore, we develop and release EV-3DPW, a synthetic event-based dataset crafted to facilitate training and quantitative analysis in outdoor scenes. On the public real-world DHP19 dataset, our event point cloud technique excels in real-time mobile predictions, while the decoupled event voxel method achieves the highest accuracy. Experiments reveal our proposed 3D representation methods' superior generalization capacities against traditional RGB images and event frame techniques. Our code and dataset are available at https://github.com/MasterHow/EventPointPose.

摘要
人体姿态估计是自动驾驶和停车中的关键组件，提高安全性 by 预测人类行为。传统的帧基摄像头和视频通常被应用，但在高动态范围或重重运动模糊的场景下变得不可靠。相比之下，事件摄像头提供了一种可靠的解决方案。大多数方法是将事件摄像头集成到学习框架中，但这些方法通常会忽略事件的本质异步和高时间分辨率特性。这种忽略会导致数据中丢失重要的时间维度信息，这些信息对于安全关键任务相对至关重要。为了解决这个问题并激活事件信息的3D潜力，我们介绍了两种3D事件表示方法：矩阵化事件点云（RasEPC）和解除事件VOXEL（DEV）。 RasEPC将事件按照时间片的方式归并在同一个位置，保留3D特征并减少内存和计算负担。 DEV表示法将事件分解成立方体，并将其投影到三个orthogonal平面上，通过独立事件注意力来捕捉3D准确信息。此外，我们还开发了EV-3DPW Synthetic Event-based Dataset，用于训练和量化分析户外场景。在公共的real-world DHP19数据集上，我们的事件点云技术在实时移动预测中表现出色，而DEV表示法在精度方面达到最高水平。实验表明我们提出的3D表示方法具有传统RGB图像和事件帧技术的更好的总体化能力。我们的代码和数据可以在https://github.com/MasterHow/EventPointPose上获取。

Weakly-supervised deepfake localization in diffusion-generated images

paper_url: http://arxiv.org/abs/2311.04584
repo_url: None
paper_authors: Dragos Tantaru, Elisabeta Oneata, Dan Oneata
for: 这 paper 的目的是提出一种weakly-supervised的 Deepfake detection方法，以便提供更多的信息，包括哪些区域被修改。
methods: 这 paper 使用了 three main categories of methods，包括explanations, local scores 和 attention。这些方法都基于 Xception 网络作为共同背景 Architecure。
results: 这 paper 的结果表明，weakly-supervised localization 是可能的，并且使用 local scores 方法可以更加敏感于缺乏超级vision。

Abstract
The remarkable generative capabilities of denoising diffusion models have raised new concerns regarding the authenticity of the images we see every day on the Internet. However, the vast majority of existing deepfake detection models are tested against previous generative approaches (e.g. GAN) and usually provide only a "fake" or "real" label per image. We believe a more informative output would be to augment the per-image label with a localization map indicating which regions of the input have been manipulated. To this end, we frame this task as a weakly-supervised localization problem and identify three main categories of methods (based on either explanations, local scores or attention), which we compare on an equal footing by using the Xception network as the common backbone architecture. We provide a careful analysis of all the main factors that parameterize the design space: choice of method, type of supervision, dataset and generator used in the creation of manipulated images; our study is enabled by constructing datasets in which only one of the components is varied. Our results show that weakly-supervised localization is attainable, with the best performing detection method (based on local scores) being less sensitive to the looser supervision than to the mismatch in terms of dataset or generator.

摘要
“denoising diffusion模型的卓越生成能力已经引起了互联网上每天看到的图像的真实性的新问题。然而，现有的深伪检测模型都是基于前一代生成方法（如GAN）进行测试，通常只提供每个图像的“伪”或“真”标签。我们认为，更有用的输出将是在每个图像上添加一个涉及到输入图像的涉及级别的地图，以便更好地了解图像中哪些部分被修改。为了实现这一目标，我们将这个任务视为一个弱有监督的地方化任务，并将涉及到的方法分为三大类（基于解释、本地分数或注意力）。我们使用Xception网络作为共同背景架构，并进行了仔细的分析，包括方法选择、监督类型、数据集和生成器在内的所有主要因素。我们的研究表明，弱有监督的地方化是可行的，最高性能的检测方法（基于本地分数）在监督不充分的情况下比 DATASET或生成器的不一致性更加敏感。”

paper_url: http://arxiv.org/abs/2311.04552
repo_url: https://github.com/virginiafdez/brainspade3d_rel
paper_authors: Virginia Fernandez, Walter Hugo Lopez Pinaya, Pedro Borges, Mark S. Graham, Tom Vercauteren, M. Jorge Cardoso
for: 本研究旨在提供一种用于脑MRI和相关分割的三维生成模型，以便 condition on 特定的疾病现象和对比。
methods: 本研究使用了生成对抗网络（GANs）和扩散模型（DMs）来生成高质量的Synthetic MRI和相关分割数据，并允许用户根据特定的疾病现象和对比来控制生成的图像和分割结果。
results: 研究表明，brainSPADE3D可以生成高度具有一致性的Synthetic MRI和相关分割数据，并且可以结合不同的疾病现象来生成混合的图像和分割结果。此外，研究还发现，使用brainSPADE3D可以改善预测模型在不期望的疾病存在时的性能。

Abstract
Generative modelling and synthetic data can be a surrogate for real medical imaging datasets, whose scarcity and difficulty to share can be a nuisance when delivering accurate deep learning models for healthcare applications. In recent years, there has been an increased interest in using these models for data augmentation and synthetic data sharing, using architectures such as generative adversarial networks (GANs) or diffusion models (DMs). Nonetheless, the application of synthetic data to tasks such as 3D magnetic resonance imaging (MRI) segmentation remains limited due to the lack of labels associated with the generated images. Moreover, many of the proposed generative MRI models lack the ability to generate arbitrary modalities due to the absence of explicit contrast conditioning. These limitations prevent the user from adjusting the contrast and content of the images and obtaining more generalisable data for training task-specific models. In this work, we propose brainSPADE3D, a 3D generative model for brain MRI and associated segmentations, where the user can condition on specific pathological phenotypes and contrasts. The proposed joint imaging-segmentation generative model is shown to generate high-fidelity synthetic images and associated segmentations, with the ability to combine pathologies. We demonstrate how the model can alleviate issues with segmentation model performance when unexpected pathologies are present in the data.

摘要
生成模型和人工数据可以作为真实医学影像数据的代理，解决医疗应用中深度学习模型的准确性问题。在最近几年里，有越来越多的人关注使用这些模型进行数据增强和人工数据共享，使用生成敌方网络（GANs）或扩散模型（DMs）。然而，使用生成数据进行3D磁共振成像（MRI）分割 task 仍然受到生成图像无标签的限制，以及生成模型无法生成多种模式的缺乏能力。这些限制使得用户无法调整图像的对比度和内容，从而获得更普适的数据用于训练任务特定模型。在这项工作中，我们提出了brainSPADE3D，一种3D生成模型用于脑MRI和相关的分割。用户可以根据特定的疾病现象和对比来conditioning这些模型。我们展示了该模型可以生成高 fideli ty的人工图像和相关的分割，并能够组合疾病。我们还示出了该模型可以解决预期疾病存在于数据中时，分割模型性能下降的问题。

Learning Robust Multi-Scale Representation for Neural Radiance Fields from Unposed Images

paper_url: http://arxiv.org/abs/2311.04521
repo_url: None
paper_authors: Nishant Jain, Suryansh Kumar, Luc Van Gool
for: 这 paper 是用于解决计算机视觉中的神经图像基于渲染问题，即在给定一组由自由移动相机拍摄的图像时，在测试时使用神经网络synthesize场景图像。
methods: 该 paper 使用了以下方法：（i）通过一个可靠的渠道来重建高精度相机参数，以便在神经新视角synthesize过程中更加准确地模拟场景图像。（ii）在day-to-day不 pose的图像中，模型对象内容的多resolution采用，以适应高速运动的相机。
results: 该 paper 通过实验表明，在不考虑 Camera pose 估计精度的情况下，模型多scale neural scene representation可以是Counterproductive。而在具有准确的相机pose估计的场景表示框架中，可以准确地synthesize图像。

Abstract
We introduce an improved solution to the neural image-based rendering problem in computer vision. Given a set of images taken from a freely moving camera at train time, the proposed approach could synthesize a realistic image of the scene from a novel viewpoint at test time. The key ideas presented in this paper are (i) Recovering accurate camera parameters via a robust pipeline from unposed day-to-day images is equally crucial in neural novel view synthesis problem; (ii) It is rather more practical to model object's content at different resolutions since dramatic camera motion is highly likely in day-to-day unposed images. To incorporate the key ideas, we leverage the fundamentals of scene rigidity, multi-scale neural scene representation, and single-image depth prediction. Concretely, the proposed approach makes the camera parameters as learnable in a neural fields-based modeling framework. By assuming per view depth prediction is given up to scale, we constrain the relative pose between successive frames. From the relative poses, absolute camera pose estimation is modeled via a graph-neural network-based multiple motion averaging within the multi-scale neural-fields network, leading to a single loss function. Optimizing the introduced loss function provides camera intrinsic, extrinsic, and image rendering from unposed images. We demonstrate, with examples, that for a unified framework to accurately model multiscale neural scene representation from day-to-day acquired unposed multi-view images, it is equally essential to have precise camera-pose estimates within the scene representation framework. Without considering robustness measures in the camera pose estimation pipeline, modeling for multi-scale aliasing artifacts can be counterproductive. We present extensive experiments on several benchmark datasets to demonstrate the suitability of our approach.

摘要
我们介绍一个改进了的解决方案，用于计算机视觉中的神经图像基于测试项目。假设我们有一组由自由移动摄像机拍摄的图像，我们的方法可以将这些图像转换为具有真实感的图像，并且在测试时间点上实现不同的观察角度。我们的关键想法包括：1. 从日常生活中的不条理图像中获取精确的摄像机参数，这是神经novel view synthesis问题的重要前提。2. 因为日常生活中的图像可能会受到剧烈的摄像机运动，因此需要在不同的解析率上模型物体内容。为了实现这些想法，我们利用了场景的刚性、多尺度神经场景表示和单图像深度预测的基础知识。具体来说，我们将摄像机参数设置为神经场中的学习型态。通过假设每个视角深度预测是固定的，我们将相关的视角之间的相对位置组成一个多尺度神经网络中的多动作平均，从而得到一个单一的损失函数。通过优化这个损失函数，我们可以获得摄像机参数、摄像机内部和图像输出等。我们在多个 benchmark 数据集上进行了广泛的实验，证明了我们的方法适用于实现多尺度神经场景表示。而不具备稳定性测量的摄像机参数估计在神经场景表示框架中是Equally essential。如果不考虑稳定性测量，则模型多尺度杂质噪压可能会是Counterproductive。

Learning Discriminative Features for Crowd Counting

paper_url: http://arxiv.org/abs/2311.04509
repo_url: None
paper_authors: Yuehai Chen
For: 提高人群计数模型在高度拥挤的区域中的准确性，特别是在人群中的小对象和背景之间的区分。* Methods: 提出了一种学习权重特征框架，包括遮盖特征预测模块（MPM）和监督像素级异常学习模块（CLM），以提高模型在高度拥挤区域中的局部化和对比背景的能力。* Results: 模型在各种计算机视觉任务中，如人群计数和物体检测，在拥挤环境下提高了地面的准确性。

Abstract
Crowd counting models in highly congested areas confront two main challenges: weak localization ability and difficulty in differentiating between foreground and background, leading to inaccurate estimations. The reason is that objects in highly congested areas are normally small and high-level features extracted by convolutional neural networks are less discriminative to represent small objects. To address these problems, we propose a learning discriminative features framework for crowd counting, which is composed of a masked feature prediction module (MPM) and a supervised pixel-level contrastive learning module (CLM). The MPM randomly masks feature vectors in the feature map and then reconstructs them, allowing the model to learn about what is present in the masked regions and improving the model's ability to localize objects in high-density regions. The CLM pulls targets close to each other and pushes them far away from background in the feature space, enabling the model to discriminate foreground objects from background. Additionally, the proposed modules can be beneficial in various computer vision tasks, such as crowd counting and object detection, where dense scenes or cluttered environments pose challenges to accurate localization. The proposed two modules are plug-and-play, incorporating the proposed modules into existing models can potentially boost their performance in these scenarios.

摘要
群体计数模型在高度拥堵的区域面临两大挑战：一是地方化能力弱和Difficulty in differentiating between foreground and background，导致估计不准确。这是因为在高度拥堵的区域中的对象通常是小型，高级特征提取网络的特征更难以区分小对象。为解决这些问题，我们提议一种学习特征分类框架 для群体计数，该框架包括带mask的特征预测模块（MPM）和supervised像素级冲突学习模块（CLM）。MPM randomly masks特征向量在特征图中，然后重建它们，使模型能够学习masked regions中的内容，提高对象的定位能力。CLM pulls targets close to each other and pushes them far away from background in the feature space, enabling the model to discriminate foreground objects from background。此外，提议的两个模块可以在多种计算机视觉任务中有效，如人群计数和物体检测， где高度拥堵的环境或拥堵的环境会对准确的定位 pose challenges。提议的两个模块可以plug-and-play，将其 integrate into existing models可能会提高其性能在这些场景中。

NITEC: Versatile Hand-Annotated Eye Contact Dataset for Ego-Vision Interaction

paper_url: http://arxiv.org/abs/2311.04505
repo_url: https://github.com/thohemp/nitec
paper_authors: Thorsten Hempel, Magnus Jung, Ahmed A. Abdelrahman, Ayoub Al-Hamadi
for: The paper is written for advancing ego-vision-based eye contact research, specifically in the fields of computer vision, human-computer interaction, and social robotics.
methods: The paper presents a hand-annotated eye contact dataset called NITEC, which exceeds existing datasets in size and variety of demographics, social contexts, and lighting conditions.
results: The paper demonstrates strong cross-dataset performance of NITEC, emphasizing its effectiveness and adaptability in various scenarios, and makes the dataset publicly available for further exploration and reproducibility.

Abstract
Eye contact is a crucial non-verbal interaction modality and plays an important role in our everyday social life. While humans are very sensitive to eye contact, the capabilities of machines to capture a person's gaze are still mediocre. We tackle this challenge and present NITEC, a hand-annotated eye contact dataset for ego-vision interaction. NITEC exceeds existing datasets for ego-vision eye contact in size and variety of demographics, social contexts, and lighting conditions, making it a valuable resource for advancing ego-vision-based eye contact research. Our extensive evaluations on NITEC demonstrate strong cross-dataset performance, emphasizing its effectiveness and adaptability in various scenarios, that allows seamless utilization to the fields of computer vision, human-computer interaction, and social robotics. We make our NITEC dataset publicly available to foster reproducibility and further exploration in the field of ego-vision interaction. https://github.com/thohemp/nitec

摘要
眼接触是非常重要的非语言交互方式，在我们每天的社交生活中扮演着重要的角色。然而，机器人的眼接触捕捉能力仍然很差。我们解决这个挑战，并提供了 NITEC，一个手动标注的眼接触数据集 для egovision交互。NITEC 的大小和多样性都超过了现有的 egovision 眼接触数据集，包括不同的人种、社会背景和照明条件，这使得它成为了 egovision 研究中的一个非常有价值的资源。我们对 NITEC 进行了广泛的评估，并证明了它在多个场景中的强大横跨数据集表现，表明它在计算机视觉、人机交互和社交机器人等领域可以无缝应用。我们将 NITEC 数据集公开提供，以便重现和进一步探索 egovision 交互领域。更多信息请参考 https://github.com/thohemp/nitec。

PRED: Pre-training via Semantic Rendering on LiDAR Point Clouds

paper_url: http://arxiv.org/abs/2311.04501
repo_url: None
paper_authors: Hao Yang, Haiyang Wang, Di Dai, Liwei Wang
for: The paper is written for outdoor point cloud pre-training, addressing the issue of incompleteness in point clouds and incorporating images for improved performance.
methods: The paper proposes a novel image-assisted pre-training framework called PRED, which uses a Birds-Eye-View feature map conditioned semantic rendering and point-wise masking with a high mask ratio (95%) to enhance the model’s performance.
results: The paper demonstrates the superiority of PRED over prior point cloud pre-training methods, achieving significant improvements on various large-scale datasets for 3D perception tasks.

Abstract
Pre-training is crucial in 3D-related fields such as autonomous driving where point cloud annotation is costly and challenging. Many recent studies on point cloud pre-training, however, have overlooked the issue of incompleteness, where only a fraction of the points are captured by LiDAR, leading to ambiguity during the training phase. On the other hand, images offer more comprehensive information and richer semantics that can bolster point cloud encoders in addressing the incompleteness issue inherent in point clouds. Yet, incorporating images into point cloud pre-training presents its own challenges due to occlusions, potentially causing misalignments between points and pixels. In this work, we propose PRED, a novel image-assisted pre-training framework for outdoor point clouds in an occlusion-aware manner. The main ingredient of our framework is a Birds-Eye-View (BEV) feature map conditioned semantic rendering, leveraging the semantics of images for supervision through neural rendering. We further enhance our model's performance by incorporating point-wise masking with a high mask ratio (95%). Extensive experiments demonstrate PRED's superiority over prior point cloud pre-training methods, providing significant improvements on various large-scale datasets for 3D perception tasks. Codes will be available at https://github.com/PRED4pc/PRED.

摘要
<>转换文本到简化中文。<>在3D相关领域，如自动驾驶，前期训练是非常重要的。然而，许多最近的点云预训练研究却忽视了点云的不完整性问题，导致训练阶段的模糊性。相比之下，图像具有更广泛的信息和更丰富的 semantics，可以增强点云编码器对点云不完整性的应对。然而，将图像与点云进行结合存在其自身的挑战，因为可能存在遮挡，导致点云和像素之间的不一致。在这个工作中，我们提出了一种新的图像助け预训练框架，称为PRED（图像协助预训练）。我们的框架的主要组成部分是基于 bird's eye view（BEV）的Semantic Feature Map Conditioned Neural Rendering，利用图像semantics来为预训练提供超vision。此外，我们还增强了我们的模型性能，通过点wise掩蔽（mask ratio为95%）。广泛的实验证明PRED的优越性，在多个大规模数据集上提供了3D感知任务的显著改进。代码将在https://github.com/PRED4pc/PRED上提供。

PersonMAE: Person Re-Identification Pre-Training with Masked AutoEncoders

paper_url: http://arxiv.org/abs/2311.04496
repo_url: None
paper_authors: Hezhen Hu, Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Lu Yuan, Dong Chen, Houqiang Li
for: This paper is written for the task of Person Re-identification (ReID), specifically to learn generic feature representation for this task.
methods: The paper proposes a simple yet effective pre-training framework called PersonMAE, which involves two core designs in masked autoencoders to better serve the task of Person Re-ID. The framework generates two regions from the given image, corrupts one region with block-wise masking to mimic common occlusion in ReID, and then predicts the whole other region at both pixel level and semantic feature level.
results: The paper achieves state-of-the-art performance on four downstream ReID tasks, including supervised (holistic and occluded setting), and unsupervised (UDA and USL setting). Specifically, with the ViT-B backbone, the paper achieves 79.8% and 69.5% mAP on the MSMT17 and OccDuke datasets, respectively, surpassing the previous state-of-the-art by a large margin of +8.0 mAP and +5.3 mAP, respectively.

Abstract
Pre-training is playing an increasingly important role in learning generic feature representation for Person Re-identification (ReID). We argue that a high-quality ReID representation should have three properties, namely, multi-level awareness, occlusion robustness, and cross-region invariance. To this end, we propose a simple yet effective pre-training framework, namely PersonMAE, which involves two core designs into masked autoencoders to better serve the task of Person Re-ID. 1) PersonMAE generates two regions from the given image with RegionA as the input and \textit{RegionB} as the prediction target. RegionA is corrupted with block-wise masking to mimic common occlusion in ReID and its remaining visible parts are fed into the encoder. 2) Then PersonMAE aims to predict the whole RegionB at both pixel level and semantic feature level. It encourages its pre-trained feature representations with the three properties mentioned above. These properties make PersonMAE compatible with downstream Person ReID tasks, leading to state-of-the-art performance on four downstream ReID tasks, i.e., supervised (holistic and occluded setting), and unsupervised (UDA and USL setting). Notably, on the commonly adopted supervised setting, PersonMAE with ViT-B backbone achieves 79.8% and 69.5% mAP on the MSMT17 and OccDuke datasets, surpassing the previous state-of-the-art by a large margin of +8.0 mAP, and +5.3 mAP, respectively.

摘要
<>TRANSLATE_TEXT预训练在人识别（ReID）中扮演着日益重要的角色，我们认为高质量的ReID表示应具有三种性能，即多级意识、遮挡Robustness和跨地域一致性。为此，我们提出了一个简单 yet有效的预训练框架，即PersonMAE，该框架包括两个核心设计：1. PersonMAE将给定图像分成两个区域， RegionA 作为输入，RegionB 作为预测目标。 RegionA 会被块性遮盖，以模拟ReID中常见的遮挡，其可见部分会被编码器处理。2. PersonMAE会 endeavour 预测 RegionB 的整个像素级和 semantic feature 级。这种设计使得 PersonMAE 的预训练特征表示具有上述三种性能，这些性能使得 PersonMAE 与下游 ReID 任务相匹配，导致 PersonMAE 在四个下游 ReID 任务中 achieved state-of-the-art 性能，包括超级vised（整体和 occluded 设置）和无监督（UDA 和 USL 设置）。特别是，在通常采用的超级vised 设置下，PersonMAE WITH ViT-B 基础模型 achieved 79.8% 和 69.5% mAP 在 MSMT17 和 OccDuke 数据集上，比前一个 state-of-the-art 的margin 加大了 +8.0 mAP，+5.3 mAP，分别。>>

Non-Rigid Shape Registration via Deep Functional Maps Prior

paper_url: http://arxiv.org/abs/2311.04494
repo_url: https://github.com/rqhuang88/DFR
paper_authors: Puhua Jiang, Mingze Sun, Ruqi Huang
for: 非RIGID shape registration without correspondence supervision
methods: 使用学习基于的框架，通过高维空间映射学习得到非RIGID shape registration
results: 可以处理大幅度内在变换和外在变换的shape registration，并且可以提供高质量的对应关系 between 不同形状对Here’s a more detailed explanation of each point:1. for: The paper proposes a learning-based framework for non-rigid shape registration without correspondence supervision. This means that the framework does not rely on manually specified correspondences between shapes, but instead uses learning-based methods to establish these correspondences.2. methods: The framework uses a combination of high-dimensional embedding and deep functional maps (DFM) to establish correspondences between shapes. The high-dimensional embedding maps shapes into a high-dimensional space where they are easier to align, and the DFM learns a non-linear mapping between the shapes. The correspondences are dynamically updated based on the intermediate registrations and filtered by a consistency prior, which makes the pipeline more robust.3. results: The paper demonstrates that the proposed framework achieves state-of-the-art results on several benchmarks of non-rigid point cloud matching, and can handle both significant extrinsic and intrinsic deformations. The framework is also able to provide high-quality correspondences between unseen challenging shape pairs, which is not possible with traditional registration methods or intrinsic methods.

Abstract
In this paper, we propose a learning-based framework for non-rigid shape registration without correspondence supervision. Traditional shape registration techniques typically rely on correspondences induced by extrinsic proximity, therefore can fail in the presence of large intrinsic deformations. Spectral mapping methods overcome this challenge by embedding shapes into, geometric or learned, high-dimensional spaces, where shapes are easier to align. However, due to the dependency on abstract, non-linear embedding schemes, the latter can be vulnerable with respect to perturbed or alien input. In light of this, our framework takes the best of both worlds. Namely, we deform source mesh towards the target point cloud, guided by correspondences induced by high-dimensional embeddings learned from deep functional maps (DFM). In particular, the correspondences are dynamically updated according to the intermediate registrations and filtered by consistency prior, which prominently robustify the overall pipeline. Moreover, in order to alleviate the requirement of extrinsically aligned input, we train an orientation regressor on a set of aligned synthetic shapes independent of the training shapes for DFM. Empirical results show that, with as few as dozens of training shapes of limited variability, our pipeline achieves state-of-the-art results on several benchmarks of non-rigid point cloud matching, but also delivers high-quality correspondences between unseen challenging shape pairs that undergo both significant extrinsic and intrinsic deformations, in which case neither traditional registration methods nor intrinsic methods work. The code is available at https://github.com/rqhuang88/DFR.

摘要
在这篇论文中，我们提出了一种学习基于的非定制形状匹配框架，不需要对匹配得到超vision。传统的形状匹配技术通常靠 extrinsic proximity 引起的匹配，因此在大规模内部扭变的情况下失败。spectral mapping 方法可以将形状嵌入高维空间中，使形状更容易匹配。然而，由于依赖于抽象的非线性嵌入方案，后者可能对输入数据进行敏感操作。为了解决这个问题，我们的框架结合了两者的优点。具体来说，我们将源网格弯曲到目标点云，以高维空间中学习的深度函数映射（DFM）中的匹配为导向。特别是，匹配在中间registrations 更新和consistency prior 的筛选下进行动态更新，以提高整体的稳定性。此外，为了避免外部对齐的需求，我们在独立于训练形状的synthetic shapes 上训练了一个旋转回归器。实验结果表明，只需几十个有限的训练形状，我们的管道可以在多个非定制点云匹配 benchmark 上达到领先的Result，并且能够在未看到的挑战性形状对应中提供高质量的匹配。代码可以在 https://github.com/rqhuang88/DFR 上获取。

All-Optical Phase Conjugation Using Diffractive Wavefront Processing

paper_url: http://arxiv.org/abs/2311.04473
repo_url: None
paper_authors: Che-Yung Shen, Jingxi Li, Tianyi Gan, Mona Jarrahi, Aydogan Ozcan
for: 用于Counteracting wavefront distortions，包括Imaging和Beam focusing。
methods: 使用Deep learning优化Passive diffractive layers，实现All-optical phase conjugation操作。
results: 通过实验验证，Diffractive wavefront processor可以成功地对phas aberrations进行OPC操作，并且可以在不同的电磁波谱中实现Cost-effective wavefront engineering解决方案。

Abstract
Optical phase conjugation (OPC) is a nonlinear technique used for counteracting wavefront distortions, with various applications ranging from imaging to beam focusing. Here, we present the design of a diffractive wavefront processor to approximate all-optical phase conjugation operation for input fields with phase aberrations. Leveraging deep learning, a set of passive diffractive layers was optimized to all-optically process an arbitrary phase-aberrated coherent field from an input aperture, producing an output field with a phase distribution that is the conjugate of the input wave. We experimentally validated the efficacy of this wavefront processor by 3D fabricating diffractive layers trained using deep learning and performing OPC on phase distortions never seen by the diffractive processor during its training. Employing terahertz radiation, our physical diffractive processor successfully performed the OPC task through a shallow spatially-engineered volume that axially spans tens of wavelengths. In addition to this transmissive OPC configuration, we also created a diffractive phase-conjugate mirror by combining deep learning-optimized diffractive layers with a standard mirror. Given its compact, passive and scalable nature, our diffractive wavefront processor can be used for diverse OPC-related applications, e.g., turbidity suppression and aberration correction, and is also adaptable to different parts of the electromagnetic spectrum, especially those where cost-effective wavefront engineering solutions do not exist.

摘要
光学相 conjugation（OPC）是一种非线性技术，用于对波front distortions 进行矫正，具有各种应用，从图像到波front фокусировки。在这里，我们提出了一种使用 diffractive wavefront processor 来近似 all-optical phase conjugation 操作，用于处理具有相位偏移的输入场。通过深度学习，我们分配了一组 passive diffractive layers 来对输入场进行 all-optical 处理，以生成一个输出场的相位分布，与输入场的相位分布是相似的。我们通过实验验证了这种 wavefront processor 的有效性，使用了 deep learning 进行优化的 diffractive layers 和标准镜的组合，以实现 OPC 任务。使用 terahertz 辐射，我们的物理 diffractive processor 成功完成了 OPC 任务，并在一个 axially 延伸多达数十波长的束缚空间中完成了 shallow 的 SPATIALLY-ENGINEERED 体。此外，我们还创造了一个 diffractive phase-conjugate mirror，通过将 deep learning 优化的 diffractive layers 与标准镜相结合。由于它的 компакт、被动和可扩展性，我们的 diffractive wavefront processor 可以用于多种 OPC-相关的应用，例如浊度补做和偏移补做，并可以适应不同的电磁波谱спектrum，特别是那些没有cost-effective wavefront engineering 解决方案。

Enhancing Few-shot CLIP with Semantic-Aware Fine-Tuning

paper_url: http://arxiv.org/abs/2311.04464
repo_url: None
paper_authors: Yao Zhu, Yuefeng Chen, Wei Wang, Xiaofeng Mao, Xiu Yan, Yue Wang, Zhigang Li, Wang lu, Jindong Wang, Xiangyang Ji
for: 本研究的目的是提高深度神经网络在具有限制样本数的低资源场景中表现，通过修改CLIP预训练模型的特定部分来适应不同的少shot任务。
methods: 我们修改了CLIP预训练模型的视觉编码器中的特征权重卷积层，使其在不同的少shot任务中适应不同的 semantics。在训练过程中，我们根据任务特点调整了这些权重，以便模型能够更好地适应具体的任务。在测试阶段，我们使用了差分融合来结合原始的权重卷积层和调整后的权重卷积层，以便将它们两者的知识融合在一起。
results: 我们的方法可以增强传统的少shot CLIP，并且与现有的adapter方法（SAFE-A）兼容。我们的方法可以更好地适应不同的少shot任务，并且在测试阶段的性能得到了提升。

Abstract
Learning generalized representations from limited training samples is crucial for applying deep neural networks in low-resource scenarios. Recently, methods based on Contrastive Language-Image Pre-training (CLIP) have exhibited promising performance in few-shot adaptation tasks. To avoid catastrophic forgetting and overfitting caused by few-shot fine-tuning, existing works usually freeze the parameters of CLIP pre-trained on large-scale datasets, overlooking the possibility that some parameters might not be suitable for downstream tasks. To this end, we revisit CLIP's visual encoder with a specific focus on its distinctive attention pooling layer, which performs a spatial weighted-sum of the dense feature maps. Given that dense feature maps contain meaningful semantic information, and different semantics hold varying importance for diverse downstream tasks (such as prioritizing semantics like ears and eyes in pet classification tasks rather than side mirrors), using the same weighted-sum operation for dense features across different few-shot tasks might not be appropriate. Hence, we propose fine-tuning the parameters of the attention pooling layer during the training process to encourage the model to focus on task-specific semantics. In the inference process, we perform residual blending between the features pooled by the fine-tuned and the original attention pooling layers to incorporate both the few-shot knowledge and the pre-trained CLIP's prior knowledge. We term this method as Semantic-Aware FinE-tuning (SAFE). SAFE is effective in enhancing the conventional few-shot CLIP and is compatible with the existing adapter approach (termed SAFE-A).

摘要
学习通用表示法从有限的训练样本中学习是深度神经网络在低资源场景中应用的关键。最近，基于对比语言图像预训练（CLIP）的方法在几架适应任务中表现出色。为了避免几架适应过拟合和忘记，现有的工作通常将CLIP预训练在大规模数据集上的参数冻结，忽略了可能一些参数不适合下游任务。为此，我们重新审视CLIP的视觉Encoder，尤其是其独特的注意力池化层，该层通过在权重总和中进行空间权重的积分来实现。由于稠密特征图包含有意义的语义信息，而不同语义在不同下游任务中具有不同的重要性（例如在宠物分类任务中更重视耳朵和眼睛而非侧镜），因此在不同几架适应任务中使用同样的权重积分操作可能不合适。因此，我们提议在训练过程中细化注意力池化层的参数，以便让模型强调任务特有的语义。在推理过程中，我们通过将细化后的注意力池化层和原始注意力池化层的特征进行差分融合来 incorporate 两者的知识。我们称这种方法为 Semantic-Aware FinE-tuning（SAFE）。SAFE 有效地提高了传统的几架 CLIP，并且与现有的适配器方法（称为 SAFE-A）兼容。

Retargeting video with an end-to-end framework

paper_url: http://arxiv.org/abs/2311.04458
repo_url: None
paper_authors: Thi-Ngoc-Hanh Le, HuiGuang Huang, Yi-Ru Chen, Tong-Yee Lee
For: 这个研究旨在为 Computer Graphics 应用程序提供影片重定向功能，以增强用户观赏体验。* Methods: 本研究使用了一个终端到终端的 RETVI 方法，具有两个模组：内容特征分析器 (CFA) 和适应型扩展估计器 (ADE)，以解决旧有方法的计算瓶颈和限制。* Results: 实验和评估结果显示，我们的系统在质量和运行时间上具有明显的优势，超越了先前的工作。更多结果可以在 http://graphics.csie.ncku.edu.tw/RETVI 网站上获取。

Abstract
Video holds significance in computer graphics applications. Because of the heterogeneous of digital devices, retargeting videos becomes an essential function to enhance user viewing experience in such applications. In the research of video retargeting, preserving the relevant visual content in videos, avoiding flicking, and processing time are the vital challenges. Extending image retargeting techniques to the video domain is challenging due to the high running time. Prior work of video retargeting mainly utilizes time-consuming preprocessing to analyze frames. Plus, being tolerant of different video content, avoiding important objects from shrinking, and the ability to play with arbitrary ratios are the limitations that need to be resolved in these systems requiring investigation. In this paper, we present an end-to-end RETVI method to retarget videos to arbitrary aspect ratios. We eliminate the computational bottleneck in the conventional approaches by designing RETVI with two modules, content feature analyzer (CFA) and adaptive deforming estimator (ADE). The extensive experiments and evaluations show that our system outperforms previous work in quality and running time. Visit our project website for more results at http://graphics.csie.ncku.edu.tw/RETVI.

摘要
视频具有计算机图形应用中的重要意义。由于数字设备的多样性，对视频进行重定向变得非常重要，以提高用户视觉体验。在研究视频重定向方面，保持视频中相关的视觉内容，避免抖动、处理时间和视频内容的多样性是核心挑战。由于视频重定向技术的运行时间较长，将图像重定向技术应用于视频领域是挑战。现有的视频重定向方法主要通过时间consuming的预处理分析帧来解决这些挑战。此外，保持重要对象不减小、避免抖动和处理时间也是需要解决的问题。在这篇论文中，我们提出了一种终端到终端的视频重定向方法（RETVI），以解决以上问题。我们通过设计内容特征分析器（CFA）和自适应扭转估计器（ADE）两个模块来消除传统方法的计算瓶颈。我们的系统在质量和运行时间方面胜过先前的工作。更多结果可以在我们项目网站上找到：http://graphics.csie.ncku.edu.tw/RETVI。

SS-MAE: Spatial-Spectral Masked Auto-Encoder for Multi-Source Remote Sensing Image Classification

paper_url: http://arxiv.org/abs/2311.04442
repo_url: None
paper_authors: Junyan Lin, Feng Gao, Xiaocheng Shi, Junyu Dong, Qian Du
for: 这个研究旨在提出一种基于自动编码器的隐藏特征模型（SS-MAE），用于混合多源数据的类别 tasks。
methods: 该模型包括空间对应分支和спектраль对应分支，具体来说，空间对应分支随机填充patches，并从原始数据中恢复缺失的像素；而 спектраль对应分支随机填充频道，并从原始数据中恢复缺失的频道。
results: 实验结果显示，Compared with多个基于state-of-the-art的基eline，SS-MAE在三个公开的数据集上表现出色，并且可以充分利用输入数据的空间和спектраль特征。

Abstract
Masked image modeling (MIM) is a highly popular and effective self-supervised learning method for image understanding. Existing MIM-based methods mostly focus on spatial feature modeling, neglecting spectral feature modeling. Meanwhile, existing MIM-based methods use Transformer for feature extraction, some local or high-frequency information may get lost. To this end, we propose a spatial-spectral masked auto-encoder (SS-MAE) for HSI and LiDAR/SAR data joint classification. Specifically, SS-MAE consists of a spatial-wise branch and a spectral-wise branch. The spatial-wise branch masks random patches and reconstructs missing pixels, while the spectral-wise branch masks random spectral channels and reconstructs missing channels. Our SS-MAE fully exploits the spatial and spectral representations of the input data. Furthermore, to complement local features in the training stage, we add two lightweight CNNs for feature extraction. Both global and local features are taken into account for feature modeling. To demonstrate the effectiveness of the proposed SS-MAE, we conduct extensive experiments on three publicly available datasets. Extensive experiments on three multi-source datasets verify the superiority of our SS-MAE compared with several state-of-the-art baselines. The source codes are available at \url{https://github.com/summitgao/SS-MAE}.

摘要
高清晰自适应模型（MIM）是一种非常受欢迎且有效的无监督学习方法，用于图像理解。现有的MIM基于方法主要关注空间特征模型化，忽略spectral特征模型化。另外，现有的MIM基于方法使用Transformer来EXTRACT特征，可能会导致一些本地或高频信息丢失。为了解决这个问题，我们提议一种具有空间特征和spectral特征的掩码自适应编码器（SS-MAE），用于混合高光谱和LiDAR/SAR数据的分类。具体来说，SS-MAE包括一个空间特征分支和一个spectral特征分支。空间特征分支掩码随机质点并重建缺失像素，而spectral特征分支掩码随机spectral通道并重建缺失通道。我们的SS-MAE完全利用输入数据的空间和spectral表示。此外，为了补偿本地特征在训练阶段的不足，我们添加了两个轻量级CNN来EXTRACT特征。我们的方法同时利用全球特征和本地特征来模型特征。为了证明我们提议的SS-MAE的效果，我们在三个公共可用的数据集上进行了广泛的实验。我们的实验结果表明，SS-MAE在多源数据集上的表现明显超过了一些状态的基eline。我们的代码可以在github上找到：https://github.com/summitgao/SS-MAE。

Blurry Video Compression: A Trade-off between Visual Enhancement and Data Compression

paper_url: http://arxiv.org/abs/2311.04430
repo_url: None
paper_authors: Dawit Mureja Argaw, Junsik Kim, In So Kweon
for: 本研究旨在提高视频压缩（VC）方法的 universality，使其在不同的时间前提下能够维持视频质量。
methods: 本研究使用了一种基于可让渡的最小化最大化优化方法，通过利用视频压缩和图像增强之间的自然质量补做，提高视频质量。
results: 对多个标准数据集进行了广泛的实验，证明了我们的方法在比较于现有的VC方法之上具有更高的效果。

Abstract
Existing video compression (VC) methods primarily aim to reduce the spatial and temporal redundancies between consecutive frames in a video while preserving its quality. In this regard, previous works have achieved remarkable results on videos acquired under specific settings such as instant (known) exposure time and shutter speed which often result in sharp videos. However, when these methods are evaluated on videos captured under different temporal priors, which lead to degradations like motion blur and low frame rate, they fail to maintain the quality of the contents. In this work, we tackle the VC problem in a general scenario where a given video can be blurry due to predefined camera settings or dynamics in the scene. By exploiting the natural trade-off between visual enhancement and data compression, we formulate VC as a min-max optimization problem and propose an effective framework and training strategy to tackle the problem. Extensive experimental results on several benchmark datasets confirm the effectiveness of our method compared to several state-of-the-art VC approaches.

摘要
现有的视频压缩（VC）方法主要目标是减少视频中的空间和时间重复性，以保持视频质量。在这种情况下，先前的工作已经实现了在特定的曝光时间和闭合速度下拍摄的视频中获得了出色的结果。然而，当这些方法应用于不同的时间优先顺序下拍摄的视频时，它们无法保持视频内容的质量。在这种情况下，我们解决了视频压缩问题，充分利用了视频增强和数据压缩之间的自然负荷关系，并提出了一种有效的框架和训练策略。对多个标准数据集进行了广泛的实验，证明了我们的方法与许多现有的VC方法相比，有更高的效果。

CSAM: A 2.5D Cross-Slice Attention Module for Anisotropic Volumetric Medical Image Segmentation

paper_url: http://arxiv.org/abs/2311.04942
repo_url: https://github.com/al3x-o-o-hung/csam
paper_authors: Alex Ling Yu Hung, Haoxin Zheng, Kai Zhao, Xiaoxi Du, Kaifeng Pang, Qi Miao, Steven S. Raman, Demetri Terzopoulos, Kyunghyun Sung
for: This paper aims to address the problem of anisotropic volumetric medical data in deep learning-based segmentation, specifically in magnetic resonance imaging (MRI) data.
methods: The proposed method is a 2.5D approach that combines 2D convolution with volumetric information, using a Cross-Slice Attention Module (CSAM) to capture information across all slices in the volume. The CSAM module applies semantic, positional, and slice attention on deep feature maps at different scales.
results: The proposed method was extensively tested using different network architectures and tasks, and the results demonstrate the usefulness and generalizability of CSAM. The code for the proposed method is available at https://github.com/aL3x-O-o-Hung/CSAM.

Abstract
A large portion of volumetric medical data, especially magnetic resonance imaging (MRI) data, is anisotropic, as the through-plane resolution is typically much lower than the in-plane resolution. Both 3D and purely 2D deep learning-based segmentation methods are deficient in dealing with such volumetric data since the performance of 3D methods suffers when confronting anisotropic data, and 2D methods disregard crucial volumetric information. Insufficient work has been done on 2.5D methods, in which 2D convolution is mainly used in concert with volumetric information. These models focus on learning the relationship across slices, but typically have many parameters to train. We offer a Cross-Slice Attention Module (CSAM) with minimal trainable parameters, which captures information across all the slices in the volume by applying semantic, positional, and slice attention on deep feature maps at different scales. Our extensive experiments using different network architectures and tasks demonstrate the usefulness and generalizability of CSAM. Associated code is available at https://github.com/aL3x-O-o-Hung/CSAM.

摘要
“一大部分的体积医学数据，特别是磁共振成像（MRI）数据，具有不对称性，通常在水平方向的分辨率比垂直方向的分辨率低得多。三维和仅二维的深度学习基于的分类方法都不适合处理这种体积数据，因为三维方法在遇到不对称的数据时表现不佳，而二维方法则忽略了体积数据中的重要信息。对二点五维方法的研究相对较少，这些模型通常在标本之间学习关系，但通常有许多受训的参数。我们提出了跨条件注意模块（CSAM），具有最少受训参数，可以在不同标本之间学习关系，并且可以在不同的标本中捕捉到重要信息。我们的广泛实验显示了 CSAM 的有用性和普遍性。相关的代码可以在 GitHub 上找到：https://github.com/aL3x-O-o-Hung/CSAM。”Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know.

Learning the What and How of Annotation in Video Object Segmentation

paper_url: http://arxiv.org/abs/2311.04414
repo_url: https://github.com/thanosDelatolas/eva-vos
paper_authors: Thanos Delatolas, Vicky Kalogeiton, Dim P. Papadopoulos
for: 提高视频对象分割（VOS）模型训练效率，减少人工标注成本。
methods: 提出了一种人工 Loop（HITL）注意力机制，通过预测哪些帧（”What”）和哪种注意力类型（”How”）进行标注，以提高标注效率。
results: 对MOSE和DAVIS数据集进行实验，比较了EVA-VOS和标准 annotating 方法，结果表明：EVA-VOS可以在3.5倍快的速度达到与人类一致的准确率；选择帧performanced状态的方法得到了状元性的表现；EVA-VOS在标注时间方面具有显著的提升。

Abstract
Video Object Segmentation (VOS) is crucial for several applications, from video editing to video data generation. Training a VOS model requires an abundance of manually labeled training videos. The de-facto traditional way of annotating objects requires humans to draw detailed segmentation masks on the target objects at each video frame. This annotation process, however, is tedious and time-consuming. To reduce this annotation cost, in this paper, we propose EVA-VOS, a human-in-the-loop annotation framework for video object segmentation. Unlike the traditional approach, we introduce an agent that predicts iteratively both which frame ("What") to annotate and which annotation type ("How") to use. Then, the annotator annotates only the selected frame that is used to update a VOS module, leading to significant gains in annotation time. We conduct experiments on the MOSE and the DAVIS datasets and we show that: (a) EVA-VOS leads to masks with accuracy close to the human agreement 3.5x faster than the standard way of annotating videos; (b) our frame selection achieves state-of-the-art performance; (c) EVA-VOS yields significant performance gains in terms of annotation time compared to all other methods and baselines.

摘要
视频对象分割（VOS）是许多应用程序中的关键技术，从视频编辑到视频数据生成。训练VOS模型需要大量的手动标注视频。传统的Annotation Way是要求人类 manually draw detailed segmentation masks on the target objects at each video frame。然而，这个Annotation process是费时的和费力的。在这篇论文中，我们提出了EVA-VOS，一个人类在Loop的标注框架 для视频对象分割。与传统方法不同，我们引入了一个代理人，该代理人预测iteratively Which frame ("What") to annotate和Which annotation type ("How") to use。然后，标注者仅标注选择的帧，并将其用于更新VOS模块，从而实现了显著的标注时间缩短。我们在MOSE和DAVIS datasets上进行了实验，并显示了以下结果：（a）EVA-VOS可以在3.5倍 faster than the standard way of annotating videos 实现masks with accuracy close to human agreement。（b）我们的帧选择性能在State-of-the-art level。（c）EVA-VOS比所有方法和基准值具有显著的标注时间缩短。

2023-11-08

cs.AI

cs.AI - 2023-11-08

Geometry-Calibrated DRO: Combating Over-Pessimism with Free Energy Implications

paper_url: http://arxiv.org/abs/2311.05054
repo_url: None
paper_authors: Jiashuo Liu, Jiayun Wu, Tianyu Wang, Hao Zou, Bo Li, Peng Cui
for: 提高机器学习算法对 distribuitional shift 的鲁棒性，Addressing the issue of distributional shifts in machine learning algorithms.
methods: 使用 Distributionally Robust Optimization (DRO) 方法，Optimizing the worst-case risk within an uncertainty set.
results: 提出 Geometry-Calibrated DRO (GCDRO) 方法，which incorporates data geometry into calibration terms to alleviate the impact of noise, and establishes a connection between the risk objective and the Helmholtz free energy in statistical physics. Comprehensive experiments confirm GCDRO’s superiority over conventional DRO methods.

Abstract
Machine learning algorithms minimizing average risk are susceptible to distributional shifts. Distributionally Robust Optimization (DRO) addresses this issue by optimizing the worst-case risk within an uncertainty set. However, DRO suffers from over-pessimism, leading to low-confidence predictions, poor parameter estimations as well as poor generalization. In this work, we conduct a theoretical analysis of a probable root cause of over-pessimism: excessive focus on noisy samples. To alleviate the impact of noise, we incorporate data geometry into calibration terms in DRO, resulting in our novel Geometry-Calibrated DRO (GCDRO) for regression. We establish the connection between our risk objective and the Helmholtz free energy in statistical physics, and this free-energy-based risk can extend to standard DRO methods. Leveraging gradient flow in Wasserstein space, we develop an approximate minimax optimization algorithm with a bounded error ratio and elucidate how our approach mitigates noisy sample effects. Comprehensive experiments confirm GCDRO's superiority over conventional DRO methods.

摘要
Neste trabalho, realizamos uma análise teórica de uma causa provável do pessimismo excessivo: o foco excessivo em amostras ruidosas. Para aliviar o impacto do ruído, incorporamos informações de geometria de dados na calibração de DRO, resultando em nossa técnica novativa de Geometry-Calibrated DRO (GCDRO) para regressão. Estabelecemos a conexão entre nossa meta de risco e a energia livre de Helmholtz na física estatística, e essa medida de risco baseada em energia livre pode ser aplicada a métodos de DRO padrão.Leverando o fluxo de gradiente na espaço de Wasserstein, desenvolvemos um algoritmo de otimização aproximada com um erro de ratio limitado e elucidamos como nossa abordagem mitiga os efeitos das amostras ruidosas. Experimentos compreensivos confirmam a superioridade do GCDRO em relação aos métodos de DRO conventioneis.

Zero-shot Translation of Attention Patterns in VQA Models to Natural Language

paper_url: http://arxiv.org/abs/2311.05043
repo_url: https://github.com/explainableml/zs-a2t
paper_authors: Leonard Salewski, A. Sophia Koepke, Hendrik P. A. Lensch, Zeynep Akata
for: 这 paper 的目的是提出一种可以在零批处理下将 transformer 注意力转换成自然语言的框架，以便更好地理解模型内部的含义。
methods: 该框架基于一个预训练的大型语言模型（LLM），该模型接受任务提示、问题和预测答案作为输入，并根据这些输入选择tokен来描述输入图像中 VQA 模型的注意力区域。
results: 该框架在 textual explanation 数据集上达到了零批处理情况下的州OF-the-art表现，在 GQA-REX 和 VQA-X 上得到了优秀的结果。

Abstract
Converting a model's internals to text can yield human-understandable insights about the model. Inspired by the recent success of training-free approaches for image captioning, we propose ZS-A2T, a zero-shot framework that translates the transformer attention of a given model into natural language without requiring any training. We consider this in the context of Visual Question Answering (VQA). ZS-A2T builds on a pre-trained large language model (LLM), which receives a task prompt, question, and predicted answer, as inputs. The LLM is guided to select tokens which describe the regions in the input image that the VQA model attended to. Crucially, we determine this similarity by exploiting the text-image matching capabilities of the underlying VQA model. Our framework does not require any training and allows the drop-in replacement of different guiding sources (e.g. attribution instead of attention maps), or language models. We evaluate this novel task on textual explanation datasets for VQA, giving state-of-the-art performances for the zero-shot setting on GQA-REX and VQA-X. Our code is available at: https://github.com/ExplainableML/ZS-A2T.

摘要
可以将模型内部转换为文本来提供人类可理解的启示。 drawing inspiration from recent successful training-free approaches for image captioning, we propose ZS-A2T，a zero-shot framework that translates the transformer attention of a given model into natural language without requiring any training. We consider this in the context of Visual Question Answering (VQA). ZS-A2T builds on a pre-trained large language model (LLM), which receives a task prompt, question, and predicted answer as inputs. The LLM is guided to select tokens that describe the regions in the input image that the VQA model attended to. Crucially, we determine this similarity by exploiting the text-image matching capabilities of the underlying VQA model. Our framework does not require any training and allows the drop-in replacement of different guiding sources (e.g., attribution instead of attention maps), or language models. We evaluate this novel task on textual explanation datasets for VQA, achieving state-of-the-art performances for the zero-shot setting on GQA-REX and VQA-X. Our code is available at: https://github.com/ExplainableML/ZS-A2T.

Automated Annotation of Scientific Texts for ML-based Keyphrase Extraction and Validation

paper_url: http://arxiv.org/abs/2311.05042
repo_url: None
paper_authors: Oluwamayowa O. Amusat, Harshad Hegde, Christopher J. Mungall, Anna Giannakou, Neil P. Byers, Dan Gunter, Kjiersten Fagnan, Lavanya Ramakrishnan
for: 本研究旨在提高机器学习生成的 metadata 的可信度，以便更好地搜索和利用生物科学领域中的大数据。
methods: 本研究提出了两种新的自动文本标签方法，即使用不同类型的数据源关联和使用领域专用的控制词汇或 Ontology。
results: 实验结果表明，提posed的标签分配方法可以生成高度特定的文本标签，与机器学习生成的关键字列表匹配度高达44%。

Abstract
Advanced omics technologies and facilities generate a wealth of valuable data daily; however, the data often lacks the essential metadata required for researchers to find and search them effectively. The lack of metadata poses a significant challenge in the utilization of these datasets. Machine learning-based metadata extraction techniques have emerged as a potentially viable approach to automatically annotating scientific datasets with the metadata necessary for enabling effective search. Text labeling, usually performed manually, plays a crucial role in validating machine-extracted metadata. However, manual labeling is time-consuming; thus, there is an need to develop automated text labeling techniques in order to accelerate the process of scientific innovation. This need is particularly urgent in fields such as environmental genomics and microbiome science, which have historically received less attention in terms of metadata curation and creation of gold-standard text mining datasets. In this paper, we present two novel automated text labeling approaches for the validation of ML-generated metadata for unlabeled texts, with specific applications in environmental genomics. Our techniques show the potential of two new ways to leverage existing information about the unlabeled texts and the scientific domain. The first technique exploits relationships between different types of data sources related to the same research study, such as publications and proposals. The second technique takes advantage of domain-specific controlled vocabularies or ontologies. In this paper, we detail applying these approaches for ML-generated metadata validation. Our results show that the proposed label assignment approaches can generate both generic and highly-specific text labels for the unlabeled texts, with up to 44% of the labels matching with those suggested by a ML keyword extraction algorithm.

摘要
高级数据技术和设施每天生成大量有价值的数据，但这些数据经常缺乏必要的元数据，使研究人员困难地找到和搜索这些数据。缺乏元数据对于使用这些数据而言是一个重要的挑战。基于机器学习的元数据抽取技术已经出现为可能的解决方案，可以自动将科学数据集中添加元数据。文本标签，通常是手动完成的，在验证机器提取的元数据中扮演着关键的角色。然而，手动标签是时间消耗的，因此有一个需要开发自动文本标签技术，以加速科学创新的过程。这个需求特别是在环境遗传学和微生物学等领域 particularly urgent，这些领域在元数据管理和创建高品质的文本挖掘数据方面 historically received less attention。在这篇论文中，我们提出了两种新的自动文本标签方法，用于验证机器生成的元数据的有效性。这两种方法都利用了不同的数据来源之间的关系，以及领域专门的控制词汇或 ontology。我们在这篇论文中详细介绍了这些方法的应用。我们的结果表明，我们的标签分配方法可以为无标签文本生成both generic和高度特定的文本标签，并且最高达44%的标签与机器学习 keyword extraction 算法提出的标签相匹配。

Transfer learning from a sparsely annotated dataset of 3D medical images

paper_url: http://arxiv.org/abs/2311.05032
repo_url: https://github.com/diagnijmegen/medicaltransferlearning3d-unet
paper_authors: Gabriel Efrain Humpire-Mamani, Colin Jacobs, Mathias Prokop, Bram van Ginneken, Nikolas Lessmann
for: This study aims to improve the efficiency of annotation and increase the accessibility of accurate organ segmentation in medical imaging using transfer learning.methods: The authors use transfer learning to leverage pre-trained model features from a large dataset to improve the performance of deep convolutional neural networks for organ segmentation in medical imaging. They use a base segmentation model (3D U-Net) trained on a large and sparsely annotated dataset and fine-tune it for four new down-stream segmentation tasks with fully annotated datasets.results: The results show that transfer learning from the base model is beneficial when small datasets are available, providing significant performance improvements. Fine-tuning the base model is more beneficial than updating all the network weights with vanilla transfer learning. The study also shows that cross-modality transfer learning using CT scans is beneficial. The performance of the fine-tuned models increased by up to 0.129 (+28%) Dice score and on average 23 experiments increased the performance by 0.029 Dice score in the new segmentation tasks.

Abstract
Transfer learning leverages pre-trained model features from a large dataset to save time and resources when training new models for various tasks, potentially enhancing performance. Due to the lack of large datasets in the medical imaging domain, transfer learning from one medical imaging model to other medical imaging models has not been widely explored. This study explores the use of transfer learning to improve the performance of deep convolutional neural networks for organ segmentation in medical imaging. A base segmentation model (3D U-Net) was trained on a large and sparsely annotated dataset; its weights were used for transfer learning on four new down-stream segmentation tasks for which a fully annotated dataset was available. We analyzed the training set size's influence to simulate scarce data. The results showed that transfer learning from the base model was beneficial when small datasets were available, providing significant performance improvements; where fine-tuning the base model is more beneficial than updating all the network weights with vanilla transfer learning. Transfer learning with fine-tuning increased the performance by up to 0.129 (+28\%) Dice score than experiments trained from scratch, and on average 23 experiments increased the performance by 0.029 Dice score in the new segmentation tasks. The study also showed that cross-modality transfer learning using CT scans was beneficial. The findings of this study demonstrate the potential of transfer learning to improve the efficiency of annotation and increase the accessibility of accurate organ segmentation in medical imaging, ultimately leading to improved patient care. We made the network definition and weights publicly available to benefit other users and researchers.

摘要
通过使用已经训练过的模型特征，转移学习可以为训练新的模型 saves time和资源，并且可能提高性能。由于医疗影像领域的大型数据集罕见，医疗影像中的转移学习还没有广泛探索。本研究探讨了使用转移学习提高医疗影像中深度卷积神经网络的组织分 segmentation性能。基本分 segmentation模型（3D U-Net）在一个大型但罕见地注解的数据集上训练，其 weights 用于转移学习四个新的下游分 segmentation任务上。我们分析了训练集大小的影响，以模拟缺乏数据的情况。结果表明，从基本模型进行转移学习在小数据集时是有利的，提供了显著性能提升（+28%）；而在所有网络权重更新为vanilla transfer learning的情况下，进行精度调整（fine-tuning）更是有利。转移学习与精度调整共同提高了新分 segmentation任务的性能，平均提高0.029 Dice分数。研究还发现了在CT扫描图像上进行转移学习的利好。本研究发现，转移学习可以提高医疗影像中的注解效率和准确率，从而提高患者治疗的质量。我们将网络定义和权重公开，以便其他用户和研究人员使用。

Towards Effective Paraphrasing for Information Disguise

paper_url: http://arxiv.org/abs/2311.05018
repo_url: https://github.com/idecir/idecir-towards-effective-paraphrasing-for-information-disguise
paper_authors: Anmol Agarwal, Shrey Gupta, Vamshi Bonagiri, Manas Gaur, Joseph Reagle, Ponnurangam Kumaraguru
for: 本研究的目的是提出一种基于幂等词汇替换的信息隐蔽技术，以防止互联网上作者的文字媒体宣传被非法利用。methods: 本研究使用了自然语言处理技术中的人工智能自动词汇替换工具（如SpinRewriter、WordAI），并通过对 sentence 进行迭代 perturbation 来混淆搜索机制。results: 本研究表明，使用多级词汇替换和幂等词汇替换可以成功隐藏 sentences 82% 的时间。这种方法可以帮助作者隐藏敏感信息，从而减少不良用户利用这些信息的风险。

Abstract
Information Disguise (ID), a part of computational ethics in Natural Language Processing (NLP), is concerned with best practices of textual paraphrasing to prevent the non-consensual use of authors' posts on the Internet. Research on ID becomes important when authors' written online communication pertains to sensitive domains, e.g., mental health. Over time, researchers have utilized AI-based automated word spinners (e.g., SpinRewriter, WordAI) for paraphrasing content. However, these tools fail to satisfy the purpose of ID as their paraphrased content still leads to the source when queried on search engines. There is limited prior work on judging the effectiveness of paraphrasing methods for ID on search engines or their proxies, neural retriever (NeurIR) models. We propose a framework where, for a given sentence from an author's post, we perform iterative perturbation on the sentence in the direction of paraphrasing with an attempt to confuse the search mechanism of a NeurIR system when the sentence is queried on it. Our experiments involve the subreddit 'r/AmItheAsshole' as the source of public content and Dense Passage Retriever as a NeurIR system-based proxy for search engines. Our work introduces a novel method of phrase-importance rankings using perplexity scores and involves multi-level phrase substitutions via beam search. Our multi-phrase substitution scheme succeeds in disguising sentences 82% of the time and hence takes an essential step towards enabling researchers to disguise sensitive content effectively before making it public. We also release the code of our approach.

摘要
信息掩蔽（ID），一部分的计算伦理在自然语言处理（NLP）中，关注在文本重新排序方法的最佳实践中，以防止互联网上作者的帖子不经授权使用。对于研究人员来说，研究ID变得重要，特别是作者在互联网上发表的文本与敏感领域相关，例如心理健康。在过去，研究人员使用基于AI的自动词汇替换工具（如SpinRewriter、WordAI）进行重新排序内容。然而，这些工具并不满足ID的目的，因为它们重新排序后的内容仍然可以跟踪回原始来源。有限的先前研究探讨了重新排序方法的效果对于ID在搜索引擎或其代理Neural Retriever（NeurIR）模型。我们提出了一个框架，其中，对于作者的一句话，我们在重新排序方向下进行迭代 perturbation，以混淆搜索机制中查询该句话的NeurIR系统。我们的实验使用了Reddit上的“r/AmItheAsshole”社区为公共内容的来源，并使用基于Neural Retriever的代理来模拟搜索引擎。我们的方法包括phrase重要性排名使用混淆分数和多级词汇替换via扫描搜索。我们的多词汇替换方案成功地隐藏了句子82%的时间，因此为研究人员隐藏敏感内容的有效方法。我们还发布了我们的方法的代码。

Joint Sensing and Semantic Communications with Multi-Task Deep Learning

paper_url: http://arxiv.org/abs/2311.05017
repo_url: https://github.com/Aryia-Behroziuan/References
paper_authors: Yalin E. Sagduyu, Tugba Erpek, Aylin Yener, Sennur Ulukus
for: 这篇论文探讨了深度学习技术的紧密 integrate 感知通信，包括延伸到semantic communications。
methods: transmitter 使用深度神经网络（encoder）进行源编码、频率编码和模ulation，而 receiver 使用另一个深度神经网络（decoder）进行模ulation、频率解码和源解码来重construct 数据样本。
results: 该实验使用 CIFAR-10 作为输入数据，并考虑了通道效应如添加白噪和衰减抑制。结果表明多任务深度学习可以实现高精度的紧密感知通信和semantic communications。

Abstract
This paper explores the integration of deep learning techniques for joint sensing and communications, with an extension to semantic communications. The integrated system comprises a transmitter and receiver operating over a wireless channel, subject to noise and fading effects. The transmitter employs a deep neural network, namely an encoder, for joint operations of source coding, channel coding, and modulation, while the receiver utilizes another deep neural network, namely a decoder, for joint operations of demodulation, channel decoding, and source decoding to reconstruct the data samples. The transmitted signal serves a dual purpose, supporting communication with the receiver and enabling sensing. When a target is present, the reflected signal is received, and another deep neural network decoder is utilized for sensing. This decoder is responsible for detecting the target's presence and determining its range. All these deep neural networks, including one encoder and two decoders, undergo joint training through multi-task learning, considering data and channel characteristics. This paper extends to incorporate semantic communications by introducing an additional deep neural network, another decoder at the receiver, operating as a task classifier. This decoder evaluates the fidelity of label classification for received signals, enhancing the integration of semantics within the communication process. The study presents results based on using the CIFAR-10 as the input data and accounting for channel effects like Additive White Gaussian Noise (AWGN) and Rayleigh fading. The results underscore the effectiveness of multi-task deep learning in achieving high-fidelity joint sensing and semantic communications.

摘要

Interpreting Pretrained Language Models via Concept Bottlenecks

paper_url: http://arxiv.org/abs/2311.05014
repo_url: None
paper_authors: Zhen Tan, Lu Cheng, Song Wang, Yuan Bo, Jundong Li, Huan Liu
for: 本研究旨在解释PLMs的黑obox性，提高PLMs在自然语言处理任务中的可读性和可理解性。
methods: 我们提出了一种新的方法，利用高级别的意义ful的概念来解释PLMs。我们使用人工标注和机器生成的概念相结合，提取隐藏神经元，以捕捉semantically meaningful和任务特定的概念。
results: 我们通过对实际数据集进行实证研究，发现我们的方法可以提供有价值的解释PLMs的行为，帮助诊断模型失败和提高模型的Robustness。

Abstract
Pretrained language models (PLMs) have made significant strides in various natural language processing tasks. However, the lack of interpretability due to their ``black-box'' nature poses challenges for responsible implementation. Although previous studies have attempted to improve interpretability by using, e.g., attention weights in self-attention layers, these weights often lack clarity, readability, and intuitiveness. In this research, we propose a novel approach to interpreting PLMs by employing high-level, meaningful concepts that are easily understandable for humans. For example, we learn the concept of ``Food'' and investigate how it influences the prediction of a model's sentiment towards a restaurant review. We introduce C$^3$M, which combines human-annotated and machine-generated concepts to extract hidden neurons designed to encapsulate semantically meaningful and task-specific concepts. Through empirical evaluations on real-world datasets, we manifest that our approach offers valuable insights to interpret PLM behavior, helps diagnose model failures, and enhances model robustness amidst noisy concept labels.

摘要
Pretrained language models (PLMs) have made significant progress in various natural language processing tasks. However, their "black-box" nature poses challenges for responsible implementation. Previous studies have attempted to improve interpretability by using, for example, attention weights in self-attention layers, but these weights often lack clarity, readability, and intuitiveness. In this research, we propose a novel approach to interpreting PLMs by employing high-level, meaningful concepts that are easily understandable for humans. For example, we learn the concept of "Food" and investigate how it influences the prediction of a model's sentiment towards a restaurant review. We introduce C$^3$M, which combines human-annotated and machine-generated concepts to extract hidden neurons designed to encapsulate semantically meaningful and task-specific concepts. Through empirical evaluations on real-world datasets, we demonstrate that our approach offers valuable insights to interpret PLM behavior, helps diagnose model failures, and enhances model robustness amidst noisy concept labels.

Expressibility-induced Concentration of Quantum Neural Tangent Kernels

paper_url: http://arxiv.org/abs/2311.04965
repo_url: None
paper_authors: Li-Wei Yu, Weikang Li, Qi Ye, Zhide Lu, Zizhao Han, Dong-Ling Deng
for: 这篇论文主要研究了量子机器学习模型的性能分析方法，以及这些方法如何应用于实际应用中的宽量子变量电路设计。
methods: 这篇论文使用了量子触 neighboorhood kernel 方法，用于分析量子机器学习模型在无限宽限制下的性能。这些方法还被应用于描述量子神经网络训练误差的整数化方式。
results: 研究发现，在全球损失函数下，高表达能量的全球和本地量子编码可以导致量子触 neighboorhood kernel 值快速减少到零。而在本地损失函数下，虽然高表达能量可以导致量子触 neighboorhood kernel 值快速减少，但是不能完全消除。此外，通过广泛的数值实验， authors 验证了这些分析理论。这些发现对量子机器学习模型的设计提供了重要的指导意见。

Abstract
Quantum tangent kernel methods provide an efficient approach to analyzing the performance of quantum machine learning models in the infinite-width limit, which is of crucial importance in designing appropriate circuit architectures for certain learning tasks. Recently, they have been adapted to describe the convergence rate of training errors in quantum neural networks in an analytical manner. Here, we study the connections between the trainability and expressibility of quantum tangent kernel models. In particular, for global loss functions, we rigorously prove that high expressibility of both the global and local quantum encodings can lead to exponential concentration of quantum tangent kernel values to zero. Whereas for local loss functions, such issue of exponential concentration persists owing to the high expressibility, but can be partially mitigated. We further carry out extensive numerical simulations to support our analytical theories. Our discoveries unveil a pivotal characteristic of quantum neural tangent kernels, offering valuable insights for the design of wide quantum variational circuit models in practical applications.

摘要
量子触感керnel方法提供了一种有效的方式来分析量子机器学习模型在无穷宽限下的性能，这对于设计适当的电路体系结构非常重要。最近，它们已经被适应来描述量子神经网络训练错误的减少速率。在这里，我们研究了量子触感керnel模型的可教化和表达能力之间的连接。特别是，对于全局损失函数，我们严格地证明了高表达能力的全局和本地量子编码的情况下，可以导致量子触感керnel值快速减少到零。而对于本地损失函数，这种问题仍然存在，但可以通过高表达能力来部分缓解。我们还进行了广泛的数值仿真，以支持我们的分析理论。我们的发现揭示了量子神经触感kernek的一个重要特点，为实际应用中的宽量子变量电路模型设计提供了价值的洞察。

Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models

paper_url: http://arxiv.org/abs/2311.04902
repo_url: https://github.com/rocktimjyotidas/gblm-pruner
paper_authors: Rocktim Jyoti Das, Liqun Ma, Zhiqiang Shen
for: 这个研究是为了提出一种基于梯度的语言模型剔除方法（GBLM-Pruner），以提高这些模型的优化和简化。
methods: 这个方法使用了大型语言模型的预训梯度，通过计算梯度的第一项泰勒展开来决定剔除重要性分数，并且不需要任何训练或重新调整。
results: 试验结果显示，GBLM-Pruner比其他两种方法（SparseGPT和Wanda）在多个测试 benchmark 上表现更好，并且不需要任何额外的训练或重新调整。

Abstract
Large Language Models (LLMs) with a billion or more parameters are prime targets for network pruning, which aims to reduce a portion of the network weights without compromising performance. Prior approaches such as Weights Magnitude, SparseGPT, and Wanda, either concentrated solely on weights or integrated weights with activations for sparsity. However, they overlooked the informative gradients derived from pretrained large language models. In this paper, we present a novel sparsity-centric pruning method for pretrained LLMs, termed Gradient-based Language Model Pruner (GBLM-Pruner). GBLM-Pruner leverages the first-order term of the Taylor expansion, operating in a training-free manner by harnessing properly normalized gradients from a few calibration samples to determine the importance pruning score, and substantially outperforms competitive counterparts like SparseGPT and Wanda in multiple benchmarks. Intriguing, after incorporating gradients, the unstructured pruning method tends to reveal some structural patterns post-pruning, which mirrors the geometric interdependence inherent in the LLMs' parameter structure. Additionally, GBLM-Pruner functions without any subsequent retraining or weight updates to maintain its simplicity as other counterparts. Extensive evaluations on LLaMA-1 and LLaMA-2 across various language benchmarks and perplexity show that GBLM-Pruner surpasses magnitude pruning, Wanda (weights+activations) and SparseGPT (weights+activations+weight update) by significant margins. Our code and models are available at https://github.com/RocktimJyotiDas/GBLM-Pruner.

摘要
大型语言模型（LLM） WITH 10亿或更多参数是适用于网络剪裁的目标，寻求减少一部分网络权重而不影响性能。先前的方法，如权重强度、SparseGPT和Wanda，ether solely focused on weights or integrated weights with activations for sparsity，但它们忽略了预训练大语言模型中的有用梯度。在这篇论文中，我们提出了一种新的简洁中心的剪裁方法 для预训练LLM，称为Gradient-based Language Model Pruner（GBLM-Pruner）。GBLM-Pruner利用预训练大语言模型中的梯度进行一ORDER Taylor扩展，在无需训练的情况下，通过正确规范梯度来确定剪裁重要性分数，并显著超越了竞争对手SparseGPT和Wanda在多个benchmark上。有趣的是，在执行剪裁后，不结构化的剪裁方法往往会揭示一些结构性 Patterns，这与LLMs中参数结构的几何相互关系有关。此外，GBLM-Pruner不需要任何后续重新训练或权重更新，以简化其他对手。我们在LLaMA-1和LLaMA-2上进行了多种语言benchmark和折衔度评估，发现GBLM-Pruner在比例剪裁、Wanda（权重+活动）和SparseGPT（权重+活动+权重更新）的情况下具有显著优势。我们的代码和模型可以在https://github.com/RocktimJyotiDas/GBLM-Pruner中找到。

Prompt Sketching for Large Language Models

paper_url: http://arxiv.org/abs/2311.04954
repo_url: None
paper_authors: Luca Beurer-Kellner, Mark Niklas Müller, Marc Fischer, Martin Vechev
For: The paper aims to address the issue of disconnected and wordy intermediate responses in recent prompting strategies for large language models (LLMs).* Methods: The proposed method, called prompt sketching, involves predicting values for multiple variables in a template, allowing users to have more control over the generation process and provide a reasoning framework via intermediate instructions. The key idea is to adapt the decoding procedure to also score follow-up instructions during text generation, optimizing overall template likelihood in inference.* Results: The paper shows that prompt sketching outperforms existing, sequential prompting schemes such as direct asking or chain-of-thought on 7 out of 8 LLM benchmarking tasks, including state tracking, arithmetic reasoning, and general question answering. The paper also releases a number of generic, yet effective sketches applicable to many tasks and an open source library called dclib, powering the sketch-aware decoders.

Abstract
Many recent prompting strategies for large language models (LLMs) query the model multiple times sequentially -- first to produce intermediate results and then the final answer. However, using these methods, both decoder and model are unaware of potential follow-up prompts, leading to disconnected and undesirably wordy intermediate responses. In this work, we address this issue by proposing prompt sketching, a new prompting paradigm in which an LLM does not only respond by completing a prompt, but by predicting values for multiple variables in a template. This way, sketching grants users more control over the generation process, e.g., by providing a reasoning framework via intermediate instructions, leading to better overall results. The key idea enabling sketching with existing, autoregressive models is to adapt the decoding procedure to also score follow-up instructions during text generation, thus optimizing overall template likelihood in inference. Our experiments show that in a zero-shot setting, prompt sketching outperforms existing, sequential prompting schemes such as direct asking or chain-of-thought on 7 out of 8 LLM benchmarking tasks, including state tracking, arithmetic reasoning, and general question answering. To facilitate future use, we release a number of generic, yet effective sketches applicable to many tasks, and an open source library called dclib, powering our sketch-aware decoders.

摘要
很多现代提示策略 для大型自然语言模型（LLM）会在 sequential 方式下询问模型多次——首先生成中间结果，然后生成最终答案。然而，使用这些方法时，decoder和模型都不知道可能的后续提示，导致中间响应不连贯和不 DESIRED wordy。在这项工作中，我们解决这个问题，提出了提示绘制（prompt sketching），一种新的提示方式，在 котором一个 LLM 不仅通过完成提示来回答，而且可以预测多个变量的值在模板中。这样，绘制可以让用户更有控制力量，例如提供了一个reasoning框架 via intermediate instructions，导致更好的总体结果。我们的关键想法是使用现有的、自然进行推断的模型，将解码过程修改为在文本生成过程中也评分后续指令，以便在推断过程中优化总体模板概率。我们的实验表明，在零配置情况下，提示绘制在 7 个 LLMBenchmark 任务上比直接询问或链式思维方法表现出色，包括状态跟踪、数学逻辑和通用问答。为便于未来使用，我们发布了一些通用 yet 有效的绘制，以及一个名为 dclib 的开源库，该库将power我们的绘制执行器。

Two Complementary Perspectives to Continual Learning: Ask Not Only What to Optimize, But Also How

paper_url: http://arxiv.org/abs/2311.04898
repo_url: None
paper_authors: Timm Hess, Tinne Tuytelaars, Gido M. van de Ven
for: 这篇论文主要关注于如何解决深度神经网络的持续学习问题，特别是当开始训练新任务时会发生快速忘记的问题。
methods: 这篇论文提出了一种基于加入回放或调整项的方法来 aproximate 缩寸类损失函数，并且还使用了梯度对映技术来调整优化路径。
results: 这篇论文预计通过结合回放-approximated joint 损失函数和梯度对映-based 优化路径，以测试加入后者是否能够提供以下优点：(1) 缓和稳定差距，(2) 增加学习效率，(3) 提高最终学习成果。

Abstract
Recent years have seen considerable progress in the continual training of deep neural networks, predominantly thanks to approaches that add replay or regularization terms to the loss function to approximate the joint loss over all tasks so far. However, we show that even with a perfect approximation to the joint loss, these approaches still suffer from temporary but substantial forgetting when starting to train on a new task. Motivated by this 'stability gap', we propose that continual learning strategies should focus not only on the optimization objective, but also on the way this objective is optimized. While there is some continual learning work that alters the optimization trajectory (e.g., using gradient projection techniques), this line of research is positioned as alternative to improving the optimization objective, while we argue it should be complementary. To evaluate the merits of our proposition, we plan to combine replay-approximated joint objectives with gradient projection-based optimization routines to test whether the addition of the latter provides benefits in terms of (1) alleviating the stability gap, (2) increasing the learning efficiency and (3) improving the final learning outcome.

摘要
近年来，深度神经网络的不断训练方法得到了很大的进步，主要是通过添加回放或规则化项来 aproximate 所有任务的共同损失函数。然而，我们发现，即使有完美的共同损失函数近似，这些方法仍会在开始训练新任务时出现临时且显著的忘记。我们被这个"稳定差距"所驱使，我们提议，不断学习策略应该不仅关注优化目标，还应该关注优化目标的优化方式。虽然有一些不断学习研究使用梯度投影技术来修改优化轨迹，但我们认为这种研究应该是补充优化目标的，而不是替代。为了评估我们的提议的价值，我们计划将回放近似的共同损失函数与梯度投影技术相结合，以测试这种组合是否可以提供以下 beneficial effects：1. 减轻稳定差距2. 提高学习效率3. 提高最终学习成果

DAMEX: Dataset-aware Mixture-of-Experts for visual understanding of mixture-of-datasets

paper_url: http://arxiv.org/abs/2311.04894
repo_url: https://github.com/jinga-lala/damex
paper_authors: Yash Jain, Harkirat Behl, Zsolt Kira, Vibhav Vineet
for: 本文旨在提出一种universal detector的建构方法，如何在大量混合数据集上训练一个模型？
methods: 作者提出了一种名为Dataset-Aware Mixture-of-Experts（DAMEX）的解决方案，通过训练专家来 Route每个数据集的令牌到它的映射专家，以提高模型的性能。
results: 在Universal Object-Detection Benchmark上进行了实验，比较了与现有状态的前一代和非MoE基线方法，达到了平均提升10.2个AP分数，并且在不同的数据集混合情况下（1）有限的可用性、（2）不同的领域和（3）不同的标签集）都达到了稳定的提升。此外，作者还质量地表明了DAMEX的专家表示 collapse问题的Robustness。

Abstract
Construction of a universal detector poses a crucial question: How can we most effectively train a model on a large mixture of datasets? The answer lies in learning dataset-specific features and ensembling their knowledge but do all this in a single model. Previous methods achieve this by having separate detection heads on a common backbone but that results in a significant increase in parameters. In this work, we present Mixture-of-Experts as a solution, highlighting that MoEs are much more than a scalability tool. We propose Dataset-Aware Mixture-of-Experts, DAMEX where we train the experts to become an `expert' of a dataset by learning to route each dataset tokens to its mapped expert. Experiments on Universal Object-Detection Benchmark show that we outperform the existing state-of-the-art by average +10.2 AP score and improve over our non-MoE baseline by average +2.0 AP score. We also observe consistent gains while mixing datasets with (1) limited availability, (2) disparate domains and (3) divergent label sets. Further, we qualitatively show that DAMEX is robust against expert representation collapse.

摘要
建构一个通用探测器存在一个关键问题：如何有效地训练一个模型在大量的混合数据集上？答案在于学习数据集特有的特征并将其拼接在单一模型中。先前的方法通过在共同脊梁上添加分开的探测头来实现这一点，但这会导致参数的增加。在这项工作中，我们提出了混合专家（MoE）作为解决方案，并证明MoE不仅是一种扩展性工具。我们提出了数据集特性混合专家（DAMEX），其中我们在训练专家时将数据集的各个元素分配给它们所对应的专家。实验表明，我们在通用物体探测benchmark上的平均AP分数高于现有状态的拟合性工具，提高了非MoE基线的平均AP分数 by 10.2个点。我们还发现在混合不同数据集、不同领域和不同标签集时，DAMEX具有一定的稳定性和可靠性。此外，我们还证明DAMEX不容易受到专家表示塌陷的影响。

Towards Few-Annotation Learning in Computer Vision: Application to Image Classification and Object Detection tasks

paper_url: http://arxiv.org/abs/2311.04888
repo_url: None
paper_authors: Quentin Bouniot
for: 这个论文的目的是提出一些用于机器学习的有限标签问题的理论、算法和实验贡献，特别是在计算机视觉中进行图像分类和对象检测。
methods: 这个论文使用了许多现有的Meta-学习算法，以及多任务学习理论基础，以针对少量标签问题进行更有效的meta-学习。此外，它还提出了一种不使用标签的对象检测器预训练方法，以及一种使用部分标签的semi-supervised学习方法。
results: 这个论文的实验结果表明，通过将多任务学习理论与Meta-学习算法相结合，可以更好地适应少量标签问题，并且可以在对象检测器中使用不使用标签的预训练方法来提高对象检测的准确率。

Abstract
In this thesis, we develop theoretical, algorithmic and experimental contributions for Machine Learning with limited labels, and more specifically for the tasks of Image Classification and Object Detection in Computer Vision. In a first contribution, we are interested in bridging the gap between theory and practice for popular Meta-Learning algorithms used in Few-Shot Classification. We make connections to Multi-Task Representation Learning, which benefits from solid theoretical foundations, to verify the best conditions for a more efficient meta-learning. Then, to leverage unlabeled data when training object detectors based on the Transformer architecture, we propose both an unsupervised pretraining and a semi-supervised learning method in two other separate contributions. For pretraining, we improve Contrastive Learning for object detectors by introducing the localization information. Finally, our semi-supervised method is the first tailored to transformer-based detectors.

摘要
在这个论文中，我们提出了理论、算法和实验贡献，用于机器学习具有有限标签数据的任务，特别是计算机视觉中的图像分类和物体检测。在第一个贡献中，我们尝试 bridge 理论和实践中的各种Meta-Learning算法，用于少量样本分类。我们与多任务学习理论建立连接，以验证最佳的meta-learning条件。然后，我们提出了一种不使用标签数据的对象检测器培训方法，基于Transformer架构。在这两个分布中，我们分别提出了一种无监督预训练方法和一种半监督学习方法。在预训练中，我们提出了一种基于本地化信息的对比学习方法，以提高对象检测器的性能。最后，我们的半监督学习方法是首次应用于基于Transformer架构的对象检测器。

SEMQA: Semi-Extractive Multi-Source Question Answering

paper_url: http://arxiv.org/abs/2311.04886
repo_url: https://github.com/google-research-datasets/quotesum
paper_authors: Tal Schuster, Adam D. Lelkes, Haitian Sun, Jai Gupta, Jonathan Berant, William W. Cohen, Donald Metzler
for: 这篇论文旨在提出一种新的多选问答任务，即将多个多样性的源文摘要成一篇全面的答案，以便更好地评估语言模型的能力。
methods: 这篇论文使用了 semi-extractive 方法，即将factual quoted spans（直接从输入源文中摘取的 span）和非factual free-text connectors（将这些 span 连接成一个完整的 passage）相结合，以生成一个全面的答案。
results: 经过对多个语言模型的实验， authors 发现这个任务 surprisingly 具有挑战性，表明 QuoteSum 可以用于开发和研究这种混合摘要能力。

Abstract
Recently proposed long-form question answering (QA) systems, supported by large language models (LLMs), have shown promising capabilities. Yet, attributing and verifying their generated abstractive answers can be difficult, and automatically evaluating their accuracy remains an ongoing challenge. In this work, we introduce a new QA task for answering multi-answer questions by summarizing multiple diverse sources in a semi-extractive fashion. Specifically, Semi-extractive Multi-source QA (SEMQA) requires models to output a comprehensive answer, while mixing factual quoted spans -- copied verbatim from given input sources -- and non-factual free-text connectors that glue these spans together into a single cohesive passage. This setting bridges the gap between the outputs of well-grounded but constrained extractive QA systems and more fluent but harder to attribute fully abstractive answers. Particularly, it enables a new mode for language models that leverages their advanced language generation capabilities, while also producing fine in-line attributions by-design that are easy to verify, interpret, and evaluate. To study this task, we create the first dataset of this kind, QuoteSum, with human-written semi-extractive answers to natural and generated questions, and define text-based evaluation metrics. Experimenting with several LLMs in various settings, we find this task to be surprisingly challenging, demonstrating the importance of QuoteSum for developing and studying such consolidation capabilities.

摘要
最近提出的长形问答系统（QA），支持大型自然语言模型（LLM），已经显示了有前途的能力。然而，归功和验证生成的抽象答案是一个Difficult Challenge。在这种工作中，我们引入了一个新的问答任务，即摘要多个多样的源文，并生成一个包含多个事实引用 span 和自由文本连接器的全面答案。这种设定可以在EXTRACTIVE QA系统的输出和具有更高级别的自由文本生成能力的语言模型之间形成一个桥接。特别是，它允许语言模型利用其高级语言生成能力，同时生成易于验证、解释和评估的精准引用。为研究这个任务，我们创建了第一个这种类型的数据集，即QuoteSum，其中包含人类编写的 semi-extractive 答案，以及自然和生成的问题。我们定义了文本基于的评估指标。在不同的设定下，我们使用多种语言模型进行实验，发现这个任务实际上非常困难，这表明QuoteSum 对开发和研究这种整合能力的研究具有重要的意义。

LongQLoRA: Efficient and Effective Method to Extend Context Length of Large Language Models

paper_url: http://arxiv.org/abs/2311.04879
repo_url: https://github.com/yangjianxin1/longqlora
paper_authors: Jianxin Yang
for: 提高大语言模型的上下文长度，并采用少量训练资源。
methods: combining Position Interpolation, QLoRA和Shift Short Attention of LongLoRA，并在单个32GB V100 GPU上进行训练。
results: 可以将LLaMA2 7B和13B的上下文长度从4096延长到8192和更长，并在PG19和Proof-pile数据集上实现竞争性的抗抑制性表现，并且与MPT-7B-8K在评估上下文长度为8192的情况下几乎相同。

Abstract
We present LongQLoRA, an efficient and effective method to extend context length of large language models with less training resources. LongQLoRA combines the advantages of Position Interpolation, QLoRA and Shift Short Attention of LongLoRA. With a single 32GB V100 GPU, LongQLoRA can extend the context length of LLaMA2 7B and 13B from 4096 to 8192 and even to 12k within 1000 finetuning steps. LongQLoRA achieves competitive perplexity performance on PG19 and Proof-pile datasets, our model outperforms LongLoRA and is very close to MPT-7B-8K within the evaluation context length of 8192. We collect and build 39k long instruction data to extend context length of Vicuna-13B from 4096 to 8192 and achieve good performance both in long and short context generation task. We also do some ablation experiments to study the effect of LoRA rank, finetuning steps and attention patterns in inference.The model weights, training data and code are avaliable at https://github.com/yangjianxin1/LongQLoRA.

摘要
我们提出了LongQLoRA，一种高效和有效的方法，可以将大语言模型的上下文长度延长，使用更少的训练资源。LongQLoRA结合了Position Interpolation、QLoRA和Shift Short Attention的优点。使用单个32GB V100 GPU，LongQLoRA可以将LLaMA2 7B和13B的上下文长度从4096提高至8192以及12k，在1000个finetuning步骤内完成。LongQLoRA在PG19和Proof-pile数据集上达到了竞争力的折射性表现，我们的模型比LongLoRA更高效，与MPT-7B-8K在评估Context length为8192的情况下几乎相当。我们收集了39k个长 instrucion数据，以延长Vicuna-13B的上下文长度从4096提高至8192，并在长和短上下文生成任务中达到了良好的表现。我们还进行了一些剥夺实验，以研究LoRA排名、finetuning步骤和注意模式在推理中的效果。模型权重、训练数据和代码可以在https://github.com/yangjianxin1/LongQLoRA中下载。

Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

paper_url: http://arxiv.org/abs/2311.04850
repo_url: https://github.com/lm-sys/llm-decontaminator
paper_authors: Shuo Yang, Wei-Lin Chiang, Lianmin Zheng, Joseph E. Gonzalez, Ion Stoica
for: 这篇论文的目的是探讨大型自然语言模型（LLM）在训练时可能会受到污染的问题，以及如何使用更强大的检测方法来解决这个问题。
methods: 本文使用了多种检测方法来检查LLM的训练数据是否受到污染，包括字串匹配（n-gram overlap）和简译等方法。另外，本文还提出了一个更强大的LLM-based检测方法，并将其应用于广泛使用的预训练和终端训练数据中。
results: 本文的实验结果显示，使用现有的检测方法可能无法彻底检测LLM的污染问题，而且简译等方法可以轻松地绕过检测方法。另外，本文还发现了一些实验数据中的污染问题，包括GPT-3.5/4生成的 sintetic数据。总之，本文强调了需要更强大的检测方法来确保LLM的训练数据是clean的，并且呼吁了社区对于公共benchmark的使用进行更多的检测和监控。

Abstract
Large language models are increasingly trained on all the data ever produced by humans. Many have raised concerns about the trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets. While most data decontamination efforts apply string matching (e.g., n-gram overlap) to remove benchmark data, we show that these methods are insufficient, and simple variations of test data (e.g., paraphrasing, translation) can easily bypass these decontamination measures. Furthermore, we demonstrate that if such variation of test data is not eliminated, a 13B model can easily overfit a test benchmark and achieve drastically high performance, on par with GPT-4. We validate such observations in widely used benchmarks such as MMLU, GSK8k, and HumanEval. To address this growing risk, we propose a stronger LLM-based decontamination method and apply it to widely used pre-training and fine-tuning datasets, revealing significant previously unknown test overlap. For example, in pre-training sets such as RedPajama-Data-1T and StarCoder-Data, we identified that 8-18\% of the HumanEval benchmark overlaps. Interestingly, we also find such contamination in synthetic dataset generated by GPT-3.5/4, suggesting a potential risk of unintentional contamination. We urge the community to adopt stronger decontamination approaches when using public benchmarks. Moreover, we call for the community to actively develop fresh one-time exams to evaluate models accurately. Our decontamination tool is publicly available at https://github.com/lm-sys/llm-decontaminator.

摘要
大型语言模型在训练时使用了所有人类生产的数据。许多人提出了公共参考数据的可靠性问题，因为可能有污染在预训或精革 dataset 中。大多数数据净化努力使用字串匹配（例如 n-gram 重叠）移除参考数据，但我们表明这些方法是不充分的，并且简单的变体（例如重写、翻译）可以轻松地绕过这些净化措施。此外，我们显示了如果这种变体数据不被消除，则一个 13B 模型可以轻松地适应参考数据，并取得极高的性能，与 GPT-4 相似。我们在广泛使用的参考数据中进行验证这些观察，包括 MMLU、GSK8k 和 HumanEval。为了解决这个增长的风险，我们提出了一个更强的 LLM-based 净化方法，并将其应用到广泛使用的预训和精革 dataset 中，发现了 significannot 前所未知的参考数据重合。例如，在 RedPajama-Data-1T 和 StarCoder-Data 预训集中，我们发现了 8-18% 的 HumanEval 参考数据重合。 interestingly，我们也发现了这种污染在 GPT-3.5/4 生成的 sintetic 数据中，建议社区将更强的净化方法应用到公共参考数据中，以确保模型的测试性能是可靠的。此外，我们呼吁社区积极开发新的一次验证问题，以确保模型的性能是正确的。我们的净化工具已经在 GitHub 上公开，可以在 https://github.com/lm-sys/llm-decontaminator 中找到。

Identifying Semantic Component for Robust Molecular Property Prediction

paper_url: http://arxiv.org/abs/2311.04837
repo_url: https://github.com/dmirlab-group/sci
paper_authors: Zijian Li, Zunhong Xu, Ruichu Cai, Zhenhui Yang, Yuguang Yan, Zhifeng Hao, Guangyi Chen, Kun Zhang
for: 本研究旨在提高Graph Neural Networks（GNN）在不同数据集下的泛化能力。
methods: 我们提出了一种名为Semantic-Components Identifiability（SCI）的生成模型，可以将latent variable分解成semantic-relevant（SR）和semantic-irrelevant（SI）组成部分。
results: 我们的实验研究表明，SCI方法可以在21个数据集上 achieve state-of-the-art performance，并且可以提供更多的泛化性能。此外，我们的Visualization结果也提供了具有启发性的案例研究和预测结果的解释。

Abstract
Although graph neural networks have achieved great success in the task of molecular property prediction in recent years, their generalization ability under out-of-distribution (OOD) settings is still under-explored. Different from existing methods that learn discriminative representations for prediction, we propose a generative model with semantic-components identifiability, named SCI. We demonstrate that the latent variables in this generative model can be explicitly identified into semantic-relevant (SR) and semantic-irrelevant (SI) components, which contributes to better OOD generalization by involving minimal change properties of causal mechanisms. Specifically, we first formulate the data generation process from the atom level to the molecular level, where the latent space is split into SI substructures, SR substructures, and SR atom variables. Sequentially, to reduce misidentification, we restrict the minimal changes of the SR atom variables and add a semantic latent substructure regularization to mitigate the variance of the SR substructure under augmented domain changes. Under mild assumptions, we prove the block-wise identifiability of the SR substructure and the comment-wise identifiability of SR atom variables. Experimental studies achieve state-of-the-art performance and show general improvement on 21 datasets in 3 mainstream benchmarks. Moreover, the visualization results of the proposed SCI method provide insightful case studies and explanations for the prediction results. The code is available at: https://github.com/DMIRLAB-Group/SCI.

摘要
虽然 Graf Neural Networks 在 recent 年取得了大量成功的质量预测task，但它们的Out-of-Distribution（OOD）环境下的普遍性仍然受探索。不同于现有的方法将学习探测表示，我们提出了具有semantic-components可识别的生成模型，称为SCI。我们证明了隐藏变量在这个生成模型中可以明确地分解为semantic-相关（SR）和semantic-不相关（SI）组成部分，这对于OOD普遍性做出了贡献，因为它们涉及到最小改变的causal mechanism。具体来说，我们首先将质量生成从原子层次到分子层次，其隐藏空间被拆分为SI子结构、SR子结构和SR原子变量。接着，为了降低错误识别，我们限制SR原子变量的最小改变，并将SR子结构加入semantic latent substructure regularization，以减少对于增强领域变化的变异。根据严谨的假设，我们证明了SR子结构的对��分解和SR原子变量的对��分解。实验研究获得了3大主流benchmark中的state-of-the-art表现，并在21个数据集上显示了一般提高。此外，我们的SCI方法的视觉化结果将提供了有用的案例研究和预测结果解释。SCI代码可以在以下地址获取：https://github.com/DMIRLAB-Group/SCI。

Decentralized Personalized Online Federated Learning

paper_url: http://arxiv.org/abs/2311.04817
repo_url: None
paper_authors: Renzhi Wu, Saayan Mitra, Xiang Chen, Anup Rao
for: 这个论文旨在提出一种新的学习设定，即分布式个性化在线学习（Decentralized Personalized Online Federated Learning，DPOEL），以满足企业端服务器（enterprise edge servers）上一些重要应用程序的需求。
methods: 该论文提出了两种技术挑战：首先，如何将来自邻居客户端的共享模型参数集成到本地模型中，以获得良好的本地模型性能。其次，如何选择客户端与其他客户端进行交互的邻居。该论文提出了一种基于学习权重的对等选择方法。
results: 该论文在三个实际项赋予预测数据集和一个空气质量预测数据集上进行了实验，并证明了其效果和可靠性。

Abstract
Vanilla federated learning does not support learning in an online environment, learning a personalized model on each client, and learning in a decentralized setting. There are existing methods extending federated learning in each of the three aspects. However, some important applications on enterprise edge servers (e.g. online item recommendation at global scale) involve the three aspects at the same time. Therefore, we propose a new learning setting \textit{Decentralized Personalized Online Federated Learning} that considers all the three aspects at the same time. In this new setting for learning, the first technical challenge is how to aggregate the shared model parameters from neighboring clients to obtain a personalized local model with good performance on each client. We propose to directly learn an aggregation by optimizing the performance of the local model with respect to the aggregation weights. This not only improves personalization of each local model but also helps the local model adapting to potential data shift by intelligently incorporating the right amount of information from its neighbors. The second challenge is how to select the neighbors for each client. We propose a peer selection method based on the learned aggregation weights enabling each client to select the most helpful neighbors and reduce communication cost at the same time. We verify the effectiveness and robustness of our proposed method on three real-world item recommendation datasets and one air quality prediction dataset.

摘要
vanilla federated learning 不支持在线学习、学习每个客户端上的个性化模型，以及分布式设置下学习。现有的方法可以在每个方面进行扩展。然而，一些重要的企业端服务器应用（例如，全球范围内的在线项目推荐）需要同时考虑这三个方面。因此，我们提出了一种新的学习设定——分布式个性化在线 federated learning，它同时考虑了这三个方面。在这种新的学习设定中，技术挑战之一是如何将来自邻居客户端的共享模型参数集成到每个客户端上，以获得个性化的本地模型。我们提议直接通过优化本地模型的性能来学习权重。这不仅提高了每个本地模型的个性化程度，还帮助本地模型适应数据变化，通过智能地包含邻居客户端的信息来适应可能出现的数据变化。另一个挑战是如何选择每个客户端的邻居。我们提议基于学习的权重来选择邻居，使每个客户端可以选择最有助的邻居，同时降低通信成本。我们对三个实际的 item recommendation 数据集和一个空气质量预测数据集进行了验证和robustness测试，结果表明我们的方法是有效和可靠的。

MTGER: Multi-view Temporal Graph Enhanced Temporal Reasoning over Time-Involved Document

paper_url: http://arxiv.org/abs/2311.04816
repo_url: None
paper_authors: Zheng Chu, Zekun Wang, Jiafeng Liang, Ming Liu, Bing Qin
for: 这个论文是用于解决文档中的时间关系和推理问题的。
methods: 该论文提出了一种多视图时间图加强的时间推理框架（MTGER），该框架可以Explicitly模型文档中的时间关系，并通过多视图机制和自动融合来提高模型的隐式推理能力。
results: 实验结果表明，MTGER可以在TimeQA和SituatedQA datasets上达到显著的效果，并且在问题变化时能够给出更一致的答案。

Abstract
The facts and time in the document are intricately intertwined, making temporal reasoning over documents challenging. Previous work models time implicitly, making it difficult to handle such complex relationships. To address this issue, we propose MTGER, a novel Multi-view Temporal Graph Enhanced Temporal Reasoning framework for temporal reasoning over time-involved documents. Concretely, MTGER explicitly models the temporal relationships among facts by multi-view temporal graphs. On the one hand, the heterogeneous temporal graphs explicitly model the temporal and discourse relationships among facts; on the other hand, the multi-view mechanism captures both time-focused and fact-focused information, allowing the two views to complement each other through adaptive fusion. To further improve the implicit reasoning capability of the model, we design a self-supervised time-comparing objective. Extensive experimental results demonstrate the effectiveness of our method on the TimeQA and SituatedQA datasets. Furthermore, MTGER gives more consistent answers under question perturbations.

摘要
文档中的事实和时间关系紧密相连，使得文档中的时间逻辑推理变得困难。前期工作中的模型对时间进行了隐式表示，导致处理复杂的时间关系变得困难。为解决这个问题，我们提议MTGER，一种基于多视图时间图的新型多视图时间图增强的时间逻辑推理框架。具体来说，MTGER使用多视图时间图来明确事实之间的时间关系。一方面，不同视图中的时间图表示了事实之间的时间和论述关系；另一方面，多视图机制使得时间和事实信息相互补充，通过适应融合来增强模型的隐式逻辑能力。此外，我们还设计了一个自动supervised时间比较目标，以提高模型的隐式逻辑能力。实验结果表明，MTGER在TimeQA和SituatedQA datasets上具有显著的效果，并且在问题扰动下的答案更加一致。

DACBERT: Leveraging Dependency Agreement for Cost-Efficient Bert Pretraining

paper_url: http://arxiv.org/abs/2311.04799
repo_url: https://github.com/sw-packages/fa101e30ca4ffd6a0479993b0e1c7299d2311c0416c0b68e2551534430e1e8fe
paper_authors: Martin Kuo, Jianyi Zhang, Yiran Chen
for: 提高预训练模型的性能和可解性，以及增强自然语言理解任务中预训练模型的表现。
methods: 提出了一种新的预训练模型——依赖协议拟合BERT（DACBERT），并开发了一种两阶段预训练方框架——依赖协议预训练。这个方框架基于语言理论，将语法和 semantics信息灵活地纳入预训练过程中。第一阶段使用四个专门的子模型来捕捉chunk级别的代表性依赖关系，并将这些依赖关系转化为嵌入。第二阶段使用这些精细化的嵌入，与传统的BERT嵌入相结合，导向预训练 осталь部分的模型。
results: 在GLUE测试 benchmark上，我们的DACBERT表现出色，在不同任务中的表现都有所提高，比Crammed BERT提高3.13%的RTE任务和2.26%的MRPC任务。此外，我们的方法可以在单个GPU上，在24小时内完成预训练过程，不需要额外的计算资源或延长预训练时间。广泛的研究还证明了我们的方法在自然语言理解任务中的表现是可靠的。

Abstract
Building on the cost-efficient pretraining advancements brought about by Crammed BERT, we enhance its performance and interpretability further by introducing a novel pretrained model Dependency Agreement Crammed BERT (DACBERT) and its two-stage pretraining framework - Dependency Agreement Pretraining. This framework, grounded by linguistic theories, seamlessly weaves syntax and semantic information into the pretraining process. The first stage employs four dedicated submodels to capture representative dependency agreements at the chunk level, effectively converting these agreements into embeddings. The second stage uses these refined embeddings, in tandem with conventional BERT embeddings, to guide the pretraining of the rest of the model. Evaluated on the GLUE benchmark, our DACBERT demonstrates notable improvement across various tasks, surpassing Crammed BERT by 3.13% in the RTE task and by 2.26% in the MRPC task. Furthermore, our method boosts the average GLUE score by 0.83%, underscoring its significant potential. The pretraining process can be efficiently executed on a single GPU within a 24-hour cycle, necessitating no supplementary computational resources or extending the pretraining duration compared with the Crammed BERT. Extensive studies further illuminate our approach's instrumental role in bolstering the interpretability of pretrained language models for natural language understanding tasks.

摘要
基于Cost-efficient pre-training的进步，我们又提出了一种新的预训练模型——Dependency Agreement Crammed BERT（DACBERT）和其两个阶段预训练框架——Dependency Agreement Pretraining。这个框架，基于语言理论，通过将 sintax和semantic信息灵活地整合到预训练过程中来提高模型的性能和可解性。第一个阶段使用四个专门的子模型来捕捉chunk级别的代名词协议，并将其转化为嵌入。第二个阶段使用这些精炼的嵌入，与普通的BERT嵌入一起，导引预训练其余部分的模型。在GLUE标准测试 benchmark上，我们的DACBERT表现出色，在RTE任务上超过Crammed BERT by 3.13%，在MRPC任务上超过by 2.26%。此外，我们的方法提高了GLUE平均分数 by 0.83%，强调其显著的潜力。预训练过程可以在单个GPU上完成 within a 24-hour cycle，不需要补充的计算资源或延长预训练时间与Crammed BERT相比。广泛的研究也证明了我们的方法在自然语言理解任务中增强预训练语言模型的可解性。

On the Multiple Roles of Ontologies in Explainable AI

paper_url: http://arxiv.org/abs/2311.04778
repo_url: None
paper_authors: Roberto Confalonieri, Giancarlo Guizzardi
for: 这篇论文探讨了ontology在可解释AI和人类中心的解释系统中的不同角色。
methods: 论文考虑了三个主要的ontology应用角度，包括参考模型、通用常识逻辑和知识精简和复杂性管理。
results: 论文结论提出了 Ontology-based 方法可以帮助解释AI 的人类理解和效果，但还需要解决一些挑战。

Abstract
This paper discusses the different roles that explicit knowledge, in particular ontologies, can play in Explainable AI and in the development of human-centric explainable systems and intelligible explanations. We consider three main perspectives in which ontologies can contribute significantly, namely reference modelling, common-sense reasoning, and knowledge refinement and complexity management. We overview some of the existing approaches in the literature, and we position them according to these three proposed perspectives. The paper concludes by discussing what challenges still need to be addressed to enable ontology-based approaches to explanation and to evaluate their human-understandability and effectiveness.

摘要

Vital Sign Forecasting for Sepsis Patients in ICUs

paper_url: http://arxiv.org/abs/2311.04770
repo_url: None
paper_authors: Anubhav Bhatti, Yuwei Liu, Chen Dan, Bingjie Shen, San Lee, Yonghwan Kim, Jang Yong Kim
for: 预测Intensive Care Units（ICU）中病人的生命体征指标，帮助医疗工作者早发现生命体征不稳定的迹象并预测 septic shock 的发展
methods: 使用现代深度学习（DL）架构，开发了一种多步预测系统，利用历史生命体征数据预测未来的生命体征状况
results: 比较了三种DL模型（N-BEATS、N-HiTS、Temporal Fusion Transformer）在 eICU Collaborative Research Database 上的预测能力，发现 TFT 模型能够更好地捕捉生命体征趋势，而 N-HiTS 模型能够更好地保持生命体征短期波动在预定范围内

Abstract
Sepsis and septic shock are a critical medical condition affecting millions globally, with a substantial mortality rate. This paper uses state-of-the-art deep learning (DL) architectures to introduce a multi-step forecasting system to predict vital signs indicative of septic shock progression in Intensive Care Units (ICUs). Our approach utilizes a short window of historical vital sign data to forecast future physiological conditions. We introduce a DL-based vital sign forecasting system that predicts up to 3 hours of future vital signs from 6 hours of past data. We further adopt the DILATE loss function to capture better the shape and temporal dynamics of vital signs, which are critical for clinical decision-making. We compare three DL models, N-BEATS, N-HiTS, and Temporal Fusion Transformer (TFT), using the publicly available eICU Collaborative Research Database (eICU-CRD), highlighting their forecasting capabilities in a critical care setting. We evaluate the performance of our models using mean squared error (MSE) and dynamic time warping (DTW) metrics. Our findings show that while TFT excels in capturing overall trends, N-HiTS is superior in retaining short-term fluctuations within a predefined range. This paper demonstrates the potential of deep learning in transforming the monitoring systems in ICUs, potentially leading to significant improvements in patient care and outcomes by accurately forecasting vital signs to assist healthcare providers in detecting early signs of physiological instability and anticipating septic shock.

摘要
septic shock 和 septic shock 是一种严重的医疗情况，影响全球数百万人，mortality rate 较高。这篇论文使用当前最先进的深度学习（DL）建筑，引入一种多步预测系统，以预测 ICU 中重要的生物参数，以便评估 septic shock 的进程。我们的方法使用一个短时间的历史生物参数数据，预测未来的生物参数。我们引入了 DL 基于的生物参数预测系统，可以预测未来 3 小时的生物参数，从 6 小时的历史数据中。我们进一步采用 DILATE 损失函数，以更好地捕捉生物参数的形态和时间动态，这些是临床决策中的关键。我们使用公共可用的 eICU 合作研究数据库（eICU-CRD），比较三种 DL 模型（N-BEATS、N-HiTS 和 Temporal Fusion Transformer ）的预测能力。我们使用 mean squared error（MSE）和 dynamic time warping（DTW） метри来评估我们的模型。我们的发现显示，虽然 TFT 能够捕捉总趋势，但 N-HiTS 在保持短期波动内的范围内表现更优。这篇论文示出了深度学习在 ICU 监测系统中的潜在潜力，可能导致患者护理和结果的显著改善，通过准确预测生物参数，帮助医疗提供者早期检测生物参数的不稳定，预测 septic shock。

The voraus-AD Dataset for Anomaly Detection in Robot Applications

paper_url: http://arxiv.org/abs/2311.04765
repo_url: https://github.com/vorausrobotik/voraus-ad-dataset
paper_authors: Jan Thieß Brockmann, Marco Rudolph, Bodo Rosenhahn, Bastian Wandt
For: This paper aims to provide a dataset for anomaly detection (AD) in robotic applications, and to introduce a new baseline method called MVT-Flow that outperforms previous baselines by a large margin.* Methods: The paper uses machine data from a pick-and-place application to create a dataset for AD, and introduces MVT-Flow, a deep-learning-based density estimation method that takes the structure of the data domain into account.* Results: The paper shows that MVT-Flow outperforms previous baselines by a large margin of 6.2% in area under ROC.Here is the text in Simplified Chinese:* For: 这篇论文目的是为了提供机器人应用中的异常检测（AD）数据集，并引入一种新的基线方法called MVT-Flow，该方法在ROC领域的表现明显超过了之前的基线方法。* Methods: 论文使用机器人执行pick-and-place任务的机器数据创建了AD数据集，并引入MVT-Flow方法，该方法是基于深度学习的概率分布预测方法，它采用了数据领域的结构来 tailor its architecture。* Results: 论文表明，MVT-Flow方法在ROC领域的表现比之前的基线方法高出6.2%的差。

Abstract
During the operation of industrial robots, unusual events may endanger the safety of humans and the quality of production. When collecting data to detect such cases, it is not ensured that data from all potentially occurring errors is included as unforeseeable events may happen over time. Therefore, anomaly detection (AD) delivers a practical solution, using only normal data to learn to detect unusual events. We introduce a dataset that allows training and benchmarking of anomaly detection methods for robotic applications based on machine data which will be made publicly available to the research community. As a typical robot task the dataset includes a pick-and-place application which involves movement, actions of the end effector and interactions with the objects of the environment. Since several of the contained anomalies are not task-specific but general, evaluations on our dataset are transferable to other robotics applications as well. Additionally, we present MVT-Flow (multivariate time-series flow) as a new baseline method for anomaly detection: It relies on deep-learning-based density estimation with normalizing flows, tailored to the data domain by taking its structure into account for the architecture. Our evaluation shows that MVT-Flow outperforms baselines from previous work by a large margin of 6.2% in area under ROC.

摘要

Euclidean, Projective, Conformal: Choosing a Geometric Algebra for Equivariant Transformers

paper_url: http://arxiv.org/abs/2311.04744
repo_url: None
paper_authors: Pim de Haan, Taco Cohen, Johann Brehmer
for: 该论文旨在开发一种基于几何深度学习的灵活架构，即几何深度学习变换器（GATr）。
methods: 该论文使用几何代数来扩展GATr架构，使其适用于任何几何（或CLIFFORD）代数。作者还研究了使用不同几何代数的版本，包括几何代数、 проектив代数和对称代数，来表示3D数据。
results: 作者在理论和实践中评估了这些不同版本，发现 simplest几何版本 computationally cheap，但 symmetry group smaller，不够表达能力，而 projective model 表达能力不够。对称代数和改进的 проектив代数定义出了强大、高性能的架构。

Abstract
The Geometric Algebra Transformer (GATr) is a versatile architecture for geometric deep learning based on projective geometric algebra. We generalize this architecture into a blueprint that allows one to construct a scalable transformer architecture given any geometric (or Clifford) algebra. We study versions of this architecture for Euclidean, projective, and conformal algebras, all of which are suited to represent 3D data, and evaluate them in theory and practice. The simplest Euclidean architecture is computationally cheap, but has a smaller symmetry group and is not as sample-efficient, while the projective model is not sufficiently expressive. Both the conformal algebra and an improved version of the projective algebra define powerful, performant architectures.

摘要
“几何深度学习构件（GATr）是一种多功能的构件，基于射影几何代数。我们将这个构件转换为可扩展的构件，让你可以根据任何几何（或Clifford）代数建立扩展性强的 transformer 架构。我们研究了这些架构的几何、 проектив和对称 algebra 版本，这些版本都适合表示3D数据，并进行了理论和实践评估。最简单的几何架构较便宜计算，但它的同调集小，不够sample-efficient，而 проектив模型不够表达力。对称 algebra 和改进的 projetive algebra 定义了强大、高性能的架构。”Note that Simplified Chinese is a simplified version of Chinese that is used in mainland China and Singapore. Traditional Chinese is used in Taiwan, Hong Kong, and other parts of the world where traditional Chinese is prevalent.

The Quest for Content: A Survey of Search-Based Procedural Content Generation for Video Games

paper_url: http://arxiv.org/abs/2311.04710
repo_url: None
paper_authors: Mar Zamorano, Carlos Cetina, Federica Sarro
for: 游戏内容的大量生成，以满足日益增长的游戏需求。
methods: 使用搜索算法实现自动化内容生成。
results: 对SBPCG领域的现状和未来研究方向的报告，以及一些实践者可采取的建议。

Abstract
Video games demand is constantly increasing, which requires the costly production of large amounts of content. Towards this challenge, researchers have developed Search-Based Procedural Content Generation (SBPCG), that is, the (semi-)automated creation of content through search algorithms. We survey the current state of SBPCG, reporting work appeared in the field between 2011-2022 and identifying open research challenges. The results lead to recommendations for practitioners and to the identification of several potential future research avenues for SBPCG.

摘要
电子游戏的需求不断增长，需要大量的内容生成，而这也导致了高昂的生产成本。为了应对这个挑战，研究人员开发了搜索基于生成内容的技术（Search-Based Procedural Content Generation，SBPCG），即通过搜索算法（semi-)自动生成内容。我们对SBPCG领域的当前状况进行了报告，涵盖2011-2022年间出版的研究成果，并确定了一些未解决的研究挑战和未来研究方向。

Challenging Common Assumptions in Multi-task Learning

paper_url: http://arxiv.org/abs/2311.04698
repo_url: None
paper_authors: Cathrin Elich, Lukas Kirchdorfer, Jan M. Köhler, Lukas Schott
for: 本研究探讨多任务学习（MTL）下的下降搜索方法，尤其是在单任务学习（STL）基础上的情况下。
methods: 本研究使用了常用的STL工具，如Adam优化器，并证明Adam优化器在MTL中的效iveness归功于部分损失度量的兼容性。此外，本研究还研究了梯度冲突的角色在MTL和STL中，并发现梯度强度作为主要 отли异点。
results: 对于常见的图像损害，本研究未发现MTL对特征传递性的明显优势。总的来说，本研究发现MTL和STL在一些方面存在相似之处，建议在更广泛的上下文中考虑这两种方法。

Abstract
While multi-task learning (MTL) has gained significant attention in recent years, its underlying mechanisms remain poorly understood. Recent methods did not yield consistent performance improvements over single task learning (STL) baselines, underscoring the importance of gaining more profound insights about challenges specific to MTL. In our study, we challenge common assumptions in MTL in the context of STL: First, the choice of optimizer has only been mildly investigated in MTL. We show the pivotal role of common STL tools such as the Adam optimizer in MTL. We deduce the effectiveness of Adam to its partial loss-scale invariance. Second, the notion of gradient conflicts has often been phrased as a specific problem in MTL. We delve into the role of gradient conflicts in MTL and compare it to STL. For angular gradient alignment we find no evidence that this is a unique problem in MTL. We emphasize differences in gradient magnitude as the main distinguishing factor. Lastly, we compare the transferability of features learned through MTL and STL on common image corruptions, and find no conclusive evidence that MTL leads to superior transferability. Overall, we find surprising similarities between STL and MTL suggesting to consider methods from both fields in a broader context.

摘要
MTL（多任务学习）在最近几年内得到了广泛关注，但它的内在机制仍未得到了充分理解。现有方法不能够在单任务学习（STL）基础上实现一致性的性能提升，这重申了对MTL的深入理解的重要性。在我们的研究中，我们挑战了MTL中通常被假设的一些假设：首先，MTL中选择优化器的研究只是轻度地进行了调查。我们表明了通用STL工具such as Adam优化器在MTL中的重要作用。我们发现Adam的有效性归功于它的部分损失尺度不变性。其次，在MTL中 gradient conflicts 的概念经常被宣称为特定的问题。我们探讨了MTL中 gradient conflicts 的角色，并与 STL 进行比较。对于 angular gradient alignment，我们未能发现这是MTL中独特的问题。我们强调了 gradient magnitude 的差异作为主要 отлича点。最后，我们比较了通过 MTL 和 STL 学习的特征的传输性，并未发现MTL leads to superior transferability。总的来说，我们发现了MTL 和 STL 之间的意外相似之处，建议在更广泛的上下文中考虑这两个领域的方法。

Leveraging Speculative Sampling and KV-Cache Optimizations Together for Generative AI using OpenVINO

paper_url: http://arxiv.org/abs/2311.04951
repo_url: https://github.com/openvinotoolkit/openvino_notebooks
paper_authors: Haim Barad, Ekaterina Aidova, Yury Gorbachev
for: 提高用户体验和减少基础设施成本和能耗
methods: 使用幻数据批处理和量化优化
results: 提高文本生成响应时间，并与标准抽样相比较Here’s a more detailed explanation of each point:
for: The paper is written to improve the user experience and reduce infrastructure costs and power consumption by optimizing text generation using inference optimizations.
methods: The paper proposes using speculative sampling, a form of dynamic execution, to reduce the overall latency of text generation. The authors also use model-based optimizations such as quantization and KV caching.
results: The paper compares the performance of speculative sampling with standard autoregressive sampling and shows that speculative sampling can improve the response time of text generation.

Abstract
Inference optimizations are critical for improving user experience and reducing infrastructure costs and power consumption. In this article, we illustrate a form of dynamic execution known as speculative sampling to reduce the overall latency of text generation and compare it with standard autoregressive sampling. This can be used together with model-based optimizations (e.g. quantization) to provide an optimized solution. Both sampling methods make use of KV caching. A Jupyter notebook and some sample executions are provided.

摘要
推理优化是提高用户体验和减少基础设施成本和电力消耗的关键。在这篇文章中，我们介绍了一种动态执行技术known as speculative sampling，用于减少文本生成总延迟。我们还与标准排取样本相比较。这两种抽样方法都使用KV缓存。我们提供了一个Jupyter笔记和一些示例执行。

Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation

paper_url: http://arxiv.org/abs/2311.04693
repo_url: https://github.com/hayeong0/Diff-HierVC
paper_authors: Ha-Yeong Choi, Sang-Hoon Lee, Seong-Whan Lee
for: 提高voice conversion（VC）系统的精度和声音适应质量
methods: 基于两个扩散模型的层次VC系统（Diff-HierVC），包括DiffPitch和DiffVoice两部分
results: 实验结果表明，our模型在抽象F0生成和声音风格转换方面具有优异表现，并在零基elineVC场景中达到CER=0.83%和EER=3.29%的性能。

Abstract
Although voice conversion (VC) systems have shown a remarkable ability to transfer voice style, existing methods still have an inaccurate pitch and low speaker adaptation quality. To address these challenges, we introduce Diff-HierVC, a hierarchical VC system based on two diffusion models. We first introduce DiffPitch, which can effectively generate F0 with the target voice style. Subsequently, the generated F0 is fed to DiffVoice to convert the speech with a target voice style. Furthermore, using the source-filter encoder, we disentangle the speech and use the converted Mel-spectrogram as a data-driven prior in DiffVoice to improve the voice style transfer capacity. Finally, by using the masked prior in diffusion models, our model can improve the speaker adaptation quality. Experimental results verify the superiority of our model in pitch generation and voice style transfer performance, and our model also achieves a CER of 0.83% and EER of 3.29% in zero-shot VC scenarios.

摘要
尽管voice conversion（VC）系统已经表现出了很好的语音风格传递能力，现有的方法仍然具有不准确的抑音和低端应用质量。为解决这些挑战，我们提出了Diff-HierVC，一种基于两个扩散模型的层次VC系统。我们首先引入DiffPitch，它可以有效地生成F0目标声音风格。然后，生成的F0被 fed到DiffVoice中，用来转换语音为目标声音风格。此外，通过源-滤波器编码器，我们分离出语音，并使用转换后的Mel-spectrogram作为数据驱动的先验知识来提高voice style转换能力。最后，通过在扩散模型中使用假标记先验，我们的模型可以提高 speaker adaptation质量。实验结果证明我们的模型在抑音和voice style转换性能方面具有优势，并且在零shot VC场景下 achieved CER of 0.83%和EER of 3.29%。

Pre-training LLMs using human-like development data corpus

paper_url: http://arxiv.org/abs/2311.04666
repo_url: None
paper_authors: Khushi Bhardwaj, Raj Sanjay Shah, Sashank Varma
for: 这个论文的目的是测试大型自然语言模型（LLMs）在语言理解和推理任务中的表现，以及模型在不同的训练环境下的可重复性和稳定性。
methods: 这个论文使用了大量的raw文本数据进行预训练，并对LLMs进行了评估，以评估模型在不同的训练环境下的表现。
results: 论文提出了一系列的基准值，包括不同架构、评估 epochs 的变化和报告的预训练 metric，以及对RoBERTa 基线值的评估。

Abstract
Pre-trained Large Language Models (LLMs) have shown success in a diverse set of language inference and understanding tasks. The pre-training stage of LLMs looks at a large corpus of raw textual data. The BabyLM shared task compares LLM pre-training to human language acquisition, where the number of tokens seen by 13-year-old kids is magnitudes smaller than the number of tokens seen by LLMs. In this work, we pre-train and evaluate LLMs on their ability to learn contextual word representations using roughly the same number of tokens as seen by children. We provide a strong set of baselines; with different architectures, evaluation of changes in performance across epochs, and reported pre-training metrics for the strict small and strict tracks of the task. We also try to loosely replicate the RoBERTa baseline given by the task organizers to observe the training robustness to hyperparameter selection and replicability. We provide the submission details to the strict and strict-small tracks in this report.

摘要
大型语言模型（LLMs）在多种语言推理和理解任务中表现出色。LLMs的预训阶段将关注大量的原始文本数据。在这个工作中，我们将LLMs预训和评估其能够学习上下文 word 表现。我们使用相似数量的字元与13岁儿童所看到的字元进行比较。我们提供了强大的基准点，包括不同架构、评估改变过程中的表现、和预训中的 metric。我们还尝试复制RoBERTa基eline，以观察对于数据选择和可重现性的训练稳定性。我们在这份报告中提供了预训和紧缩小 tracks 的提交细节。

Pragmatic Reasoning Unlocks Quantifier Semantics for Foundation Models

paper_url: http://arxiv.org/abs/2311.04659
repo_url: None
paper_authors: Yiyuan Li, Rakesh R. Menon, Sayan Ghosh, Shashank Srivastava
for: This paper aims to explore the ability of recent foundation models to understand generalized quantifiers in natural language, specifically in sentences featuring percentage-equipped predicates.
methods: The paper uses a crowd-sourced dataset of human-annotated generalized quantifiers in Wikipedia sentences, called QuRe, and a framework called PRESQUE, which combines natural language inference and the Rational Speech Acts framework, to test the ability of language models to understand quantifier percentage scopes.
results: The experimental results on the HVD dataset and QuRe show that PRESQUE, which uses pragmatic reasoning, performs 20% better than a literal reasoning baseline when predicting quantifier percentage scopes, with no additional training required.

Abstract
Generalized quantifiers (e.g., few, most) are used to indicate the proportions predicates are satisfied (for example, some apples are red). One way to interpret quantifier semantics is to explicitly bind these satisfactions with percentage scopes (e.g., 30%-40% of apples are red). This approach can be helpful for tasks like logic formalization and surface-form quantitative reasoning (Gordon and Schubert, 2010; Roy et al., 2015). However, it remains unclear if recent foundation models possess this ability, as they lack direct training signals. To explore this, we introduce QuRe, a crowd-sourced dataset of human-annotated generalized quantifiers in Wikipedia sentences featuring percentage-equipped predicates. We explore quantifier comprehension in language models using PRESQUE, a framework that combines natural language inference and the Rational Speech Acts framework. Experimental results on the HVD dataset and QuRe illustrate that PRESQUE, employing pragmatic reasoning, performs 20% better than a literal reasoning baseline when predicting quantifier percentage scopes, with no additional training required.

摘要
通用量词（例如，很少、大多数）用于指示逻辑 predicate 满足的порпорzioni（例如，一些苹果是红色的）。一种方法可以理解量词 semantics 是通过显式绑定这些满足情况的百分比范围（例如，30%-40% 的苹果是红色的）。这种方法可以对逻辑ormalization和表面形量化理性（Gordon 和Schubert，2010；Roy 等，2015）进行帮助。然而，是否现代基础模型拥有这种能力仍然未知，因为它们缺乏直接训练信号。为了探索这一点，我们引入 QuRe，一个由人工标注的通用量词在Wikipedia句子中出现的百分比范围的数据集。我们使用 PRESQUE，一个结合自然语言推理和理性演讲框架的框架，来探索语言模型中量词理解的能力。实验结果表明，PRESQUE，通过使用 Pragmatic Reasoning，在 HVD 数据集和 QuRe 上预测量词百分比范围时，与 Literal Reasoning 基准相比，提高了20%的性能，无需额外训练。

Hybrid Focal and Full-Range Attention Based Graph Transformers

paper_url: http://arxiv.org/abs/2311.04653
repo_url: None
paper_authors: Minhong Zhu, Zhenhao Zhao, Weiran Cai
for: 这篇论文旨在提高图structured数据学习中Graph Transformer的性能，通过增强对本地信息的捕捉和全范围相关性的学习。
methods: 该论文提出了一种新的具有复合注意力机制的强化图Transformer模型，named Focal and Full-Range Graph Transformer (FFGT)，通过结合全范围注意力和K-hop焦点注意力来捕捉全范围和本地信息。
results: 该论文在多个开放数据集上提高了现有的Graph Transformer性能，同时在一些Long-Range Graph Benchmark (LRGB)数据集上达到了与普通Transformer相同的SOTA性能，而无需任何特殊的参数调整或特定的数据预处理。

Abstract
The paradigm of Transformers using the self-attention mechanism has manifested its advantage in learning graph-structured data. Yet, Graph Transformers are capable of modeling full range dependencies but are often deficient in extracting information from locality. A common practice is to utilize Message Passing Neural Networks (MPNNs) as an auxiliary to capture local information, which however are still inadequate for comprehending substructures. In this paper, we present a purely attention-based architecture, namely Focal and Full-Range Graph Transformer (FFGT), which can mitigate the loss of local information in learning global correlations. The core component of FFGT is a new mechanism of compound attention, which combines the conventional full-range attention with K-hop focal attention on ego-nets to aggregate both global and local information. Beyond the scope of canonical Transformers, the FFGT has the merit of being more substructure-aware. Our approach enhances the performance of existing Graph Transformers on various open datasets, while achieves compatible SOTA performance on several Long-Range Graph Benchmark (LRGB) datasets even with a vanilla transformer. We further examine influential factors on the optimal focal length of attention via introducing a novel synthetic dataset based on SBM-PATTERN.

摘要
“对于使用自我注意机制的Transformers模型，它在学习图像数据中表现出了优势。然而，图像Transformers可以模型全范围的相互关系，但通常缺乏从本地获取信息的能力。为了解决这个问题，常用Message Passing Neural Networks（MPNNs）作为辅助，以便捕捉本地信息，但这些MPNNs仍然无法彻底理解子结构。在这篇论文中，我们提出了一个纯注意 mechanism的架构，即全范围注意和K-hop焦点注意的复合注意机制（FFGT），以便储存全范围和本地信息。与传统Transformers不同的是，FFGT更加注重到子结构。我们的方法可以提高现有的图像Transformers的性能，并在多个Long-Range Graph Benchmark（LRGB）dataset上实现了Compatible SOTA的性能，甚至使用普通的Transformer。我们进一步研究了注意力的最佳焦点因素，并通过引入一个基于SBM-PATTERN的新 sintetic dataset。”

SKU-Patch: Towards Efficient Instance Segmentation for Unseen Objects in Auto-Store

paper_url: http://arxiv.org/abs/2311.04645
repo_url: None
paper_authors: Biqi Yang, Weiliang Tang, Xiaojie Gao, Xianzhi Li, Yun-Hui Liu, Chi-Wing Fu, Pheng-Ann Heng
for: 这篇论文主要针对大规模库存中的自动采矿领域问题，旨在提供一个新的patch-guided实例分割方案，以减少人工干预和模型重训。
methods: 本文提出了一个 novel transformer-based网络，包括(i)一个patch-image相联 corrle encoder，用于捕捉多个层次的图像特征，并(ii)一个patch-aware transformer decoder，用于生成实例 mask。
results: 实验结果显示，SKU-Patch可以在四个库存测试 benchmark上取得最佳性能，并在一个真实的自动存储运输管线中，实现了逾90%的抓捕成功率，证明其实用性和可行性。

Abstract
In large-scale storehouses, precise instance masks are crucial for robotic bin picking but are challenging to obtain. Existing instance segmentation methods typically rely on a tedious process of scene collection, mask annotation, and network fine-tuning for every single Stock Keeping Unit (SKU). This paper presents SKU-Patch, a new patch-guided instance segmentation solution, leveraging only a few image patches for each incoming new SKU to predict accurate and robust masks, without tedious manual effort and model re-training. Technical-wise, we design a novel transformer-based network with (i) a patch-image correlation encoder to capture multi-level image features calibrated by patch information and (ii) a patch-aware transformer decoder with parallel task heads to generate instance masks. Extensive experiments on four storehouse benchmarks manifest that SKU-Patch is able to achieve the best performance over the state-of-the-art methods. Also, SKU-Patch yields an average of nearly 100% grasping success rate on more than 50 unseen SKUs in a robot-aided auto-store logistic pipeline, showing its effectiveness and practicality.

摘要
大规模仓库中，精准实例掩模是机器人抓取物品的关键，但是它们很难以获得。现有的实例分割方法通常需要 tedious scene collection、标注和网络微调，对每个 Stock Keeping Unit (SKU) 进行一个一一的处理。本文介绍了 SKU-Patch，一种新的 patch-guided 实例分割解决方案，只需要对每个新的 SKU 输入一些图像块来预测准确和可靠的掩模，不需要手动劳累和模型重新训练。技术上，我们设计了一种基于 transformer 网络的新网络，包括：(i) 一个 patch-image correlation encoder，用于捕捉多级图像特征，并将其与块信息相协调。(ii) 一个 patch-aware transformer decoder，包括多个并行任务头，用于生成实例掩模。经验表明，SKU-Patch 能够在四个仓库测试准则上取得最好的性能，并且在 robot-aided auto-store 物流管线中，SKU-Patch 可以实现97%的抓取成功率，验证了其实用性和实际性。

Object-Centric Learning with Slot Mixture Module

paper_url: http://arxiv.org/abs/2311.04640
repo_url: None
paper_authors: Daniil Kirilenko, Vitaliy Vorobyov, Alexey K. Kovalev, Aleksandr I. Panov
for: 这篇论文是为了提出一种基于 Gaussian Mixture Model 的学习式划分方法，用于改进 object-centric 架构中的 slot 表示方法。
methods: 该方法使用学习式划分方法来分解特征图像，并将分配给 slot 的信息包含在 slot 表示中，从而得到更表示力的 slot 表示。
results: 对于 object-centric enario，使用该方法而不使用 Slot Attention 可以提高性能，达到目前最佳Result 在 set property prediction 任务中。

Abstract
Object-centric architectures usually apply a differentiable module to the entire feature map to decompose it into sets of entity representations called slots. Some of these methods structurally resemble clustering algorithms, where the cluster's center in latent space serves as a slot representation. Slot Attention is an example of such a method, acting as a learnable analog of the soft k-means algorithm. Our work employs a learnable clustering method based on the Gaussian Mixture Model. Unlike other approaches, we represent slots not only as centers of clusters but also incorporate information about the distance between clusters and assigned vectors, leading to more expressive slot representations. Our experiments demonstrate that using this approach instead of Slot Attention improves performance in object-centric scenarios, achieving state-of-the-art results in the set property prediction task.

摘要
通常，对象中心的架构会应用一个可导模块到整个特征地图，将其分解成Entity表示集合的集合。一些这些方法结构上类似于聚类算法，其中聚类中心在隐藏空间中服务为槽表示。槽注意力是一 such方法， behaving as a learnable soft k-means algorithm。我们的工作使用一种学习聚类方法基于 Gaussian Mixture Model。与其他方法不同，我们在表示槽不仅包括聚类中心，还包括聚类与分配向量之间的距离信息，导致更具表达力的槽表示。我们的实验表明，使用这种方法而不是槽注意力可以在对象中心的情况下提高性能，实现了属性集 Prediction任务的国际最佳 результа。

Explained anomaly detection in text reviews: Can subjective scenarios be correctly evaluated?

paper_url: http://arxiv.org/abs/2311.04948
repo_url: None
paper_authors: David Novoa-Paradela, Oscar Fontenla-Romero, Bertha Guijarro-Berdiñas
for: 这个研究的目的是检测和解释在线平台上的异常评论。
methods: 这个ipeline包括三个模块，用于检测不会为用户提供价值的评论，包括无用和黑客组成的评论。每个分类都有一个正常性分数和一个解释，以 justify 的决定。
results: 这个ipeline在不同的数据集上进行了评测，并进行了一项解释技术的比较研究，以评估解释模块的效果。这项研究可以帮助自动化在线平台上的评论审核任务，例如电子商务平台上的评论审核，并为相关领域的异常检测任务提供灵感。此外，这项研究还表明了对不同解释技术的人类评估，并探讨了是否可以解释人类化的任务，如异常评论检测。

Abstract
This paper presents a pipeline to detect and explain anomalous reviews in online platforms. The pipeline is made up of three modules and allows the detection of reviews that do not generate value for users due to either worthless or malicious composition. The classifications are accompanied by a normality score and an explanation that justifies the decision made. The pipeline's ability to solve the anomaly detection task was evaluated using different datasets created from a large Amazon database. Additionally, a study comparing three explainability techniques involving 241 participants was conducted to assess the explainability module. The study aimed to measure the impact of explanations on the respondents' ability to reproduce the classification model and their perceived usefulness. This work can be useful to automate tasks in review online platforms, such as those for electronic commerce, and offers inspiration for addressing similar problems in the field of anomaly detection in textual data. We also consider it interesting to have carried out a human evaluation of the capacity of different explainability techniques in a real and infrequent scenario such as the detection of anomalous reviews, as well as to reflect on whether it is possible to explain tasks as humanly subjective as this one.

摘要
Translation Notes:* "anomalous reviews" 翻译为 "不常见的评论"* "online platforms" 翻译为 "在线平台"* "worthless or malicious composition" 翻译为 "无用或有恶意组合"* "normality score" 翻译为 "常见度分数"* "explanation" 翻译为 "解释"* "classification model" 翻译为 "分类模型"* "electronic commerce" 翻译为 "电子商务"* "humanly subjective" 翻译为 "有人性的"

LuminanceL1Loss: A loss function which measures percieved brightness and colour differences

paper_url: http://arxiv.org/abs/2311.04614
repo_url: None
paper_authors: Dominic De Jonge
for: 提高图像修复任务的性能
methods: 使用新的损失函数LuminanceL1Loss，将图像转换为灰度图并计算MSE损失两个频道
results: 对Retinexformer、BUIFD和DnCNN arquitectures进行了评测，并表明LuminanceL1Loss可以超越传统方法，提高图像修复任务的性能，最高提高4.7dB。

Abstract
We introduce LuminanceL1Loss, a novel loss function designed to enhance the performance of image restoration tasks. We demonstrate its superiority over MSE when applied to the Retinexformer, BUIFD and DnCNN architectures. Our proposed LuminanceL1Loss leverages a unique approach by transforming images into grayscale and subsequently computing the MSE loss for both grayscale and color channels. Experimental results demonstrate that this innovative loss function consistently outperforms traditional methods, showcasing its potential in image denoising and other related tasks in image reconstruction. It demonstrates gains up to 4.7dB. The results presented in this study highlight the efficacy of LuminanceL1Loss for various image restoration tasks.

摘要
我们介绍了一种新的损失函数，即LuminanceL1Loss，用于提高图像恢复任务的性能。我们在Retinexformer、BUIFD和DnCNN架构上进行了实验，并证明了LuminanceL1Loss在这些架构上的优越性。我们的提案的LuminanceL1Loss采用了一种独特的方法，即将图像转换成灰度图像，然后计算灰度和色彩通道之间的MSE损失。实验结果表明，这种创新的损失函数在图像压缩和其他相关的图像重建任务中具有优越的表现，提高了4.7dB。这些研究结果表明LuminanceL1Loss在各种图像恢复任务中的可靠性和普适性。

paper_url: http://arxiv.org/abs/2311.04589
repo_url: None
paper_authors: Zhen Yang, Yingxue Zhang, Fandong Meng, Jie Zhou
for: 本研究想要帮助多modal语言模型（MM-LLMs）更好地处理多modal输入和生成非文本模式。
methods: 本方法使用了Tokenize和Embed ALl（TEAL）方法，将输入从任何模式转化为token序列，并学习一个共同的嵌入空间 для所有模式。
results: 实验表明，TEAL可以获得显著的多modal理解提升，并实现了一个简单的多modal生成方案。

Abstract
Despite Multi-modal Large Language Models (MM-LLMs) have made exciting strides recently, they are still struggling to efficiently model the interactions among multi-modal inputs and the generation in non-textual modalities. In this work, we propose TEAL (Tokenize and Embed ALl)}, an approach to treat the input from any modality as a token sequence and learn a joint embedding space for all modalities. Specifically, for the input from any modality, TEAL first discretizes it into a token sequence with the off-the-shelf tokenizer and embeds the token sequence into a joint embedding space with a learnable embedding matrix. MM-LLMs just need to predict the multi-modal tokens autoregressively as the textual LLMs do. Finally, the corresponding de-tokenizer is applied to generate the output in each modality based on the predicted token sequence. With the joint embedding space, TEAL enables the frozen LLMs to perform both understanding and generation tasks involving non-textual modalities, such as image and audio. Thus, the textual LLM can just work as an interface and maintain its high performance in textual understanding and generation. Experiments show that TEAL achieves substantial improvements in multi-modal understanding, and implements a simple scheme for multi-modal generations.

摘要
尽管多模态大型语言模型（MM-LLMs）在最近几年内做出了吸人的进步，但它们仍然努力地模型多模态输入的交互和非文本modalities中的生成。在这项工作中，我们提出了TEAL（Tokenize and Embed ALl），一种方法，其中输入从任何模式都会被视为一个token序列，并在一个共享的embedding空间中学习一个共同的embedding矩阵。具体来说，对于输入从任何模式，TEAL首先将它拆分成一个token序列，使用可用的tokenizer进行拆分，然后将token序列embedding到一个共同的embedding空间中，使用一个学习的embedding矩阵。MM-LLMs只需要预测多modal tokens的autoregressive预测，就像文本LLMs一样。最后，对于每个模式，使用预测的token序列生成输出。与共同的embedding空间相比，TEAL使得冻结的LLMs可以在多modal任务中进行理解和生成任务，如图像和音频。因此，文本LLM可以作为界面，维持高效的文本理解和生成能力。实验结果表明，TEAL在多modal理解方面取得了显著的提升，并实现了简单的多modal生成方案。

Army of Thieves: Enhancing Black-Box Model Extraction via Ensemble based sample selection

paper_url: http://arxiv.org/abs/2311.04588
repo_url: https://github.com/akshitjindal1/aot_wacv
paper_authors: Akshit Jindal, Vikram Goyal, Saket Anand, Chetan Arora
for: 防止机器学习模型被盗用（Model Stealing Attacks），当机器学习模型被部署为服务时。
methods: 使用 ensemble of deep learning models 作为盗取模型，以便选择最有用的数据点子集。
results: 比基于单个模型的方法高效，可以提高盗取模型的质量和攻击成功率。在 CIFAR-10 数据集上，我们的方法可以提高对模型的攻击性能，比基于单个模型的方法高效。

Abstract
Machine Learning (ML) models become vulnerable to Model Stealing Attacks (MSA) when they are deployed as a service. In such attacks, the deployed model is queried repeatedly to build a labelled dataset. This dataset allows the attacker to train a thief model that mimics the original model. To maximize query efficiency, the attacker has to select the most informative subset of data points from the pool of available data. Existing attack strategies utilize approaches like Active Learning and Semi-Supervised learning to minimize costs. However, in the black-box setting, these approaches may select sub-optimal samples as they train only one thief model. Depending on the thief model's capacity and the data it was pretrained on, the model might even select noisy samples that harm the learning process. In this work, we explore the usage of an ensemble of deep learning models as our thief model. We call our attack Army of Thieves(AOT) as we train multiple models with varying complexities to leverage the crowd's wisdom. Based on the ensemble's collective decision, uncertain samples are selected for querying, while the most confident samples are directly included in the training data. Our approach is the first one to utilize an ensemble of thief models to perform model extraction. We outperform the base approaches of existing state-of-the-art methods by at least 3% and achieve a 21% higher adversarial sample transferability than previous work for models trained on the CIFAR-10 dataset.

摘要

GResilience: Trading Off Between the Greenness and the Resilience of Collaborative AI Systems

paper_url: http://arxiv.org/abs/2311.04569
repo_url: None
paper_authors: Diaeddin Rimawi, Antonio Liotta, Marco Todescato, Barbara Russo
for: 本研究旨在提供一种自动评估Collaborative Artificial Intelligence System（CAIS）恢复行动的能力来衡量系统的可恢复性和绿色性。
methods: 本研究提出了一种以优化和游戏理论为基础的方法来评估CAIS恢复行动的可恢复性和绿色性。
results: 研究人员通过设计了一种实验协议和应用于一个真实的CAIS示例器，并通过优化和游戏理论来评估CAIS恢复行动的可恢复性和绿色性。

Abstract
A Collaborative Artificial Intelligence System (CAIS) works with humans in a shared environment to achieve a common goal. To recover from a disruptive event that degrades its performance and ensures its resilience, a CAIS may then need to perform a set of actions either by the system, by the humans, or collaboratively together. As for any other system, recovery actions may cause energy adverse effects due to the additional required energy. Therefore, it is of paramount importance to understand which of the above actions can better trade-off between resilience and greenness. In this in-progress work, we propose an approach to automatically evaluate CAIS recovery actions for their ability to trade-off between the resilience and greenness of the system. We have also designed an experiment protocol and its application to a real CAIS demonstrator. Our approach aims to attack the problem from two perspectives: as a one-agent decision problem through optimization, which takes the decision based on the score of resilience and greenness, and as a two-agent decision problem through game theory, which takes the decision based on the payoff computed for resilience and greenness as two players of a cooperative game.

摘要
Translated into Simplified Chinese:一个协同人工智能系统（CAIS）与人类在共享环境中工作以实现共同目标。在一个破坏性事件导致系统性能下降后，CAIS可能需要执行一系列动作，这些动作可以由系统、人类或者共同执行。由于这些恢复动作可能会带来更多的能源消耗，因此非常重要理解这些动作可以如何让恢复和绿色之间进行更好的负担平衡。在这个进行中的工作中，我们提出了一种方法，可以自动评估CAIS恢复动作的能量和绿色之间的负担平衡。我们还设计了一个实验协议和应用于一个真实的CAIS示范器。我们的方法尝试从两个角度解决问题：作为一个一个代理决策问题，通过优化来做出决策，以及作为两个玩家的合作游戏问题，通过游戏理论来做出决策。

CAIS-DMA: A Decision-Making Assistant for Collaborative AI Systems

paper_url: http://arxiv.org/abs/2311.04562
repo_url: https://github.com/dmrimawi/cais-dma
paper_authors: Diaeddin Rimawi, Antonio Lotta, Marco Todescato, Barbara Russo
for:This paper aims to develop a methodology to automatically support the decision-making process in a Collaborative Artificial Intelligence System (CAIS) when the system experiences performance degradation after a disruptive event.methods:The proposed framework consists of three components: one manages or simulates CAIS’s environment and disruptive events, the second automates the decision-making process, and the third provides a visual analysis of CAIS behavior.results:The framework can automatically monitor the decision-making process, intervene whenever a performance degradation occurs, and recommend the next action that balances between minimizing the recovery time (i.e., resilience) and minimizing the energy adverse effects (i.e., greenness).

Abstract
A Collaborative Artificial Intelligence System (CAIS) is a cyber-physical system that learns actions in collaboration with humans in a shared environment to achieve a common goal. In particular, a CAIS is equipped with an AI model to support the decision-making process of this collaboration. When an event degrades the performance of CAIS (i.e., a disruptive event), this decision-making process may be hampered or even stopped. Thus, it is of paramount importance to monitor the learning of the AI model, and eventually support its decision-making process in such circumstances. This paper introduces a new methodology to automatically support the decision-making process in CAIS when the system experiences performance degradation after a disruptive event. To this aim, we develop a framework that consists of three components: one manages or simulates CAIS's environment and disruptive events, the second automates the decision-making process, and the third provides a visual analysis of CAIS behavior. Overall, our framework automatically monitors the decision-making process, intervenes whenever a performance degradation occurs, and recommends the next action. We demonstrate our framework by implementing an example with a real-world collaborative robot, where the framework recommends the next action that balances between minimizing the recovery time (i.e., resilience), and minimizing the energy adverse effects (i.e., greenness).

摘要
一个协同人工智能系统（CAIS）是一个融合物理系统，通过与人类在共同环境中学习行动来实现共同目标。特别是，CAIS具有一个人工智能模型，用于支持协同决策过程。当系统经历破坏性事件（例如，突发事件）时，这个决策过程可能受到影响或者even stop。因此，监测人工智能模型的学习是极其重要的。这篇论文提出了一种新的方法，用于自动支持CAIS协同决策过程在系统经历破坏性事件后。为此，我们开发了一个框架，该框架包括三个组件：一个管理或模拟CAIS的环境和破坏性事件，第二个自动化决策过程，第三个提供CAIS行为的可视分析。总之，我们的框架可以自动监测决策过程，在破坏性事件发生时进行交互，并 recommends the next action，以保持系统的可靠性和绿色性。我们通过实施一个实际的协同 робоット示例来证明我们的框架。在这个示例中，我们的框架建议下一个行动，以均衡系统的恢复时间（即可靠性）和能源不良影响（即绿色性）。

paper_url: http://arxiv.org/abs/2311.04544
repo_url: None
paper_authors: Yashothara Shanmugarasa, M. A. P. Chamikara, Hye-young Paik, Salil S. Kanhere, Liming Zhu
for: 提供消费者和能源公司 valuabe insights into energy management, while protecting privacy.
methods: 使用Local Differential Privacy (LDP) methods with randomized response techniques and sliding windows to protect appliance-level energy consumption data.
results: Efficient and effective privacy protection, balancing privacy and data utility for analysis.

Abstract
Energy disaggregation techniques, which use smart meter data to infer appliance energy usage, can provide consumers and energy companies valuable insights into energy management. However, these techniques also present privacy risks, such as the potential for behavioral profiling. Local differential privacy (LDP) methods provide strong privacy guarantees with high efficiency in addressing privacy concerns. However, existing LDP methods focus on protecting aggregated energy consumption data rather than individual appliances. Furthermore, these methods do not consider the fact that smart meter data are a form of streaming data, and its processing methods should account for time windows. In this paper, we propose a novel LDP approach (named LDP-SmartEnergy) that utilizes randomized response techniques with sliding windows to facilitate the sharing of appliance-level energy consumption data over time while not revealing individual users' appliance usage patterns. Our evaluations show that LDP-SmartEnergy runs efficiently compared to baseline methods. The results also demonstrate that our solution strikes a balance between protecting privacy and maintaining the utility of data for effective analysis.

摘要
智能计量数据分解技术可以为消费者和能源公司提供有价值的能源管理信息，但这些技术也存在隐私风险，如行为 Profiling 的可能性。本地均衡隐私（LDP）方法可以提供强隐私保证，但现有的 LDP 方法主要关注保护归并的能源消耗数据而不是个体设备。此外，这些方法没有考虑智能计量数据是流动数据，其处理方法应该考虑时间窗口。在本文中，我们提出了一种新的 LDP 方法（名为 LDP-SmartEnergy），它利用随机响应技术和滑块窗口来帮助在时间上分享设备级能源消耗数据，而不抛露个体用户的设备使用模式。我们的评估结果显示，LDP-SmartEnergy 能够高效运行，与基eline方法相比。结果还表明，我们的解决方案能够平衡保护隐私和维护数据的有用性。

RankAug: Augmented data ranking for text classification

paper_url: http://arxiv.org/abs/2311.04535
repo_url: None
paper_authors: Tiasa Singha Roy, Priyam Basu
for: 这个论文主要是为了提高生成模型的评估方法。
methods: 这篇论文提出了一种文本排序方法，用于检测和过滤生成文本中最相似的文本，以提高NLU任务的准确率。
results: 实验结果显示，通过judicious选择筛选技术可以提高准确率，最高提高35%。

Abstract
Research on data generation and augmentation has been focused majorly on enhancing generation models, leaving a notable gap in the exploration and refinement of methods for evaluating synthetic data. There are several text similarity metrics within the context of generated data filtering which can impact the performance of specific Natural Language Understanding (NLU) tasks, specifically focusing on intent and sentiment classification. In this study, we propose RankAug, a text-ranking approach that detects and filters out the top augmented texts in terms of being most similar in meaning with lexical and syntactical diversity. Through experiments conducted on multiple datasets, we demonstrate that the judicious selection of filtering techniques can yield a substantial improvement of up to 35% in classification accuracy for under-represented classes.

摘要
研究数据生成和增强主要集中在提高生成模型，留下了许多评估合成数据的方法的空白。在生成数据筛选中，文本相似度指标在NLU任务中的意图和情感分类方面具有重要作用。本研究提出了RankAug方法，它通过词语和语法多样性来推荐最相似的文本，从而提高分类精度。通过在多个数据集上进行实验，我们证明了选择合适的筛选技术可以提高受 represeted类准确率达35%。

Validating ChatGPT Facts through RDF Knowledge Graphs and Sentence Similarity

paper_url: http://arxiv.org/abs/2311.04524
repo_url: None
paper_authors: Michalis Mountantonakis, Yannis Tzitzikas
for: 这个论文目的是 validate ChatGPT 的答案和补充它们的证明和来源。
methods: 这个论文使用 RDF 知识Graph（KG）和短句嵌入来实现 ChatGPT 答案的验证和补充。特别是使用 DBpedia 和 LODsyndesis（一个 aggregate KG，包含 2000 亿 triple 从 400 RDF KGs 多个领域），并引入一种算法，可以返回更加相关的 triple（ accompaniment 和 confidence score）。
results: 在评估这种服务（以及类似服务）时，作者创建了一个评估标准套件，包括 2000 个 ChatGPT 答案（其中 1000 个是希腊名人、500 个是希腊地点、500 个是关于希腊的事件）。手动标注后，发现 ChatGPT 的答案中约 73% 是正确的，27% 是错误的。结果很有 promise，例如，对整个benchmark来说，我们成功验证了 ChatGPT 的 85.3% 正确答案，并找到了错误答案中 62.6% 的正确答案。

Abstract
Since ChatGPT offers detailed responses without justifications, and erroneous facts even for popular persons, events and places, in this paper we present a novel pipeline that retrieves the response of ChatGPT in RDF and tries to validate the ChatGPT facts using one or more RDF Knowledge Graphs (KGs). To this end we leverage DBpedia and LODsyndesis (an aggregated Knowledge Graph that contains 2 billion triples from 400 RDF KGs of many domains) and short sentence embeddings, and introduce an algorithm that returns the more relevant triple(s) accompanied by their provenance and a confidence score. This enables the validation of ChatGPT responses and their enrichment with justifications and provenance. To evaluate this service (such services in general), we create an evaluation benchmark that includes 2,000 ChatGPT facts; specifically 1,000 facts for famous Greek Persons, 500 facts for popular Greek Places, and 500 facts for Events related to Greece. The facts were manually labelled (approximately 73% of ChatGPT facts were correct and 27% of facts were erroneous). The results are promising; indicatively for the whole benchmark, we managed to verify the 85.3% of the correct facts of ChatGPT and to find the correct answer for the 62.6% of the erroneous ChatGPT facts.

摘要
自从ChatGPT提供了详细的回答无需证明，而且甚至包含错误的信息关于知名人物、事件和地点，因此在这篇论文中，我们提出了一个新的管道，它将ChatGPT的回答转换为RDF格式，并使用一个或多个RDF知识 graphs（KGs）来验证ChatGPT的信息是否正确。为此，我们利用了DBpedia和LODsyndesis（一个包含400个RDF KGs的多个领域的知识Graph，总共包含200亿个三元组），并使用短句嵌入，并引入一种算法，它可以返回更加相关的 triple（或多个 triple），以及它们的来源和信任分数。这使得可以验证ChatGPT的回答，并为其添加证明和来源。为了评估这种服务（以及类似服务），我们创建了一个评估标准，包括2,000个ChatGPT的信息，其中包括1,000个著名希腊人物、500个希腊地点和500个与希腊相关的事件。这些信息都是手动标注的（约73%的ChatGPT信息正确，27%的信息错误）。结果很有 promise，例如，对整个benchmark，我们成功验证了85.3%的正确ChatGPT信息，并为错误的ChatGPT信息找到了正确的答案的62.6%。

FFINet: Future Feedback Interaction Network for Motion Forecasting

paper_url: http://arxiv.org/abs/2311.04512
repo_url: None
paper_authors: Miao Kang, Shengqi Wang, Sanping Zhou, Ke Ye, Jingjing Jiang, Nanning Zheng
for: 预测交通代理人的未来合理行为，以提高自动驾驶系统的安全性和效率。
methods: 提出了一种新的未来反馈互动网络（FFINet），通过将当前观察和未来互动的特征进行聚合，以提高多模态轨迹预测的准确性。
results: 在 Argoverse 1 和 Argoverse 2 动态预测测试数据集上，FFINet 实现了状态领先的性能。

Abstract
Motion forecasting plays a crucial role in autonomous driving, with the aim of predicting the future reasonable motions of traffic agents. Most existing methods mainly model the historical interactions between agents and the environment, and predict multi-modal trajectories in a feedforward process, ignoring potential trajectory changes caused by future interactions between agents. In this paper, we propose a novel Future Feedback Interaction Network (FFINet) to aggregate features the current observations and potential future interactions for trajectory prediction. Firstly, we employ different spatial-temporal encoders to embed the decomposed position vectors and the current position of each scene, providing rich features for the subsequent cross-temporal aggregation. Secondly, the relative interaction and cross-temporal aggregation strategies are sequentially adopted to integrate features in the current fusion module, observation interaction module, future feedback module and global fusion module, in which the future feedback module can enable the understanding of pre-action by feeding the influence of preview information to feedforward prediction. Thirdly, the comprehensive interaction features are further fed into final predictor to generate the joint predicted trajectories of multiple agents. Extensive experimental results show that our FFINet achieves the state-of-the-art performance on Argoverse 1 and Argoverse 2 motion forecasting benchmarks.

摘要
<>自动驾驶中，预测行为的预测具有重要的作用，旨在预测未来的合理行为。现有的方法主要是基于历史交互和环境的模型，预测多模态轨迹，忽略了未来交互所引起的轨迹变化。在这篇论文中，我们提出了一种新的未来反馈互动网络（FFINet），用于聚合特征。首先，我们采用不同的空间-时间编码器，将分解的位坐标和当前场景的位置进行嵌入，提供丰富的特征 для后续的跨时间汇集。其次，我们采用相对交互和跨时间汇集策略，先后采用交互模块、观察交互模块、未来反馈模块和全局汇集模块，其中未来反馈模块可以帮助理解预测的预先行为。最后，我们将全面交互特征传递给最终预测器，生成多个交互的联合预测轨迹。广泛的实验结果表明，我们的 FFINet 在 Argoverse 1 和 Argoverse 2 运动预测Benchmark上达到了状态 искусственный智能的表现。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. If you need Traditional Chinese, please let me know.

Causal Inference on Investment Constraints and Non-stationarity in Dynamic Portfolio Optimization through Reinforcement Learning

paper_url: http://arxiv.org/abs/2311.04946
repo_url: None
paper_authors: Yasuhiro Nakayama, Tomochika Sawaki
for: 本研究开发了一种动态资产配置投资策略，使用了回归学习技术。
methods: 我们解决了金融时间序列数据不stationarity问题的问题，并引入了一些变量，如 режим变化，以提高预测精度。
results: 我们的研究发现，通过在投资策略中应用回归学习技术，可以实现高精度的预测，并且可以考虑实际面临的投资者实际约束，从而实现有效的优化。

Abstract
In this study, we have developed a dynamic asset allocation investment strategy using reinforcement learning techniques. To begin with, we have addressed the crucial issue of incorporating non-stationarity of financial time series data into reinforcement learning algorithms, which is a significant implementation in the application of reinforcement learning in investment strategies. Our findings highlight the significance of introducing certain variables such as regime change in the environment setting to enhance the prediction accuracy. Furthermore, the application of reinforcement learning in investment strategies provides a remarkable advantage of setting the optimization problem flexibly. This enables the integration of practical constraints faced by investors into the algorithm, resulting in efficient optimization. Our study has categorized the investment strategy formulation conditions into three main categories, including performance measurement indicators, portfolio management rules, and other constraints. We have evaluated the impact of incorporating these conditions into the environment and rewards in a reinforcement learning framework and examined how they influence investment behavior.

摘要
在这项研究中，我们开发了一种动态资产分配投资策略使用强化学习技术。首先，我们解决了金融时间序列数据不Stationarity问题的应用在强化学习算法中的问题，这是投资策略应用强化学习中的一个重要实现。我们的发现表明，在环境设置中引入 certain 变量，如状态转换，可以提高预测精度。此外，强化学习在投资策略中提供了一个remarkable的优势，即可以自由地设置优化问题。这使得可以将实际面临的投资者的限制 integrate into the algorithm，从而实现高效的优化。我们对投资策略的形式化条件进行分类，包括表现指标、股票管理规则和其他限制。我们在强化学习框架中包含这些条件的环境和奖励，并研究了它们如何影响投资行为。

Auto deep learning for bioacoustic signals

paper_url: http://arxiv.org/abs/2311.04945
repo_url: https://github.com/giuliotosato/autokeras-bioacustic
paper_authors: Giulio Tosato, Abdelrahman Shehata, Joshua Janssen, Kees Kamp, Pramatya Jati, Dan Stowell
for: 这个研究旨在探讨自动深度学习是否可以提高多类bird vocalization分类的准确率和效率，并与传统的手动设计的深度学习模型进行比较。
methods: 这个研究使用了AutoKeras自动机器学习框架，自动化了神经网络搜索和超参数优化。
results: 结果表明，AutoKeras-derived模型在Western Mediterranean Wetland Birds dataset上 consistently outperform了传统模型如MobileNet、ResNet50和VGG16。这种方法和结论推动了生物声学研究的进步，并提供了一种新的自动化深度学习方法。

Abstract
This study investigates the potential of automated deep learning to enhance the accuracy and efficiency of multi-class classification of bird vocalizations, compared against traditional manually-designed deep learning models. Using the Western Mediterranean Wetland Birds dataset, we investigated the use of AutoKeras, an automated machine learning framework, to automate neural architecture search and hyperparameter tuning. Comparative analysis validates our hypothesis that the AutoKeras-derived model consistently outperforms traditional models like MobileNet, ResNet50 and VGG16. Our approach and findings underscore the transformative potential of automated deep learning for advancing bioacoustics research and models. In fact, the automated techniques eliminate the need for manual feature engineering and model design while improving performance. This study illuminates best practices in sampling, evaluation and reporting to enhance reproducibility in this nascent field. All the code used is available at https: //github.com/giuliotosato/AutoKeras-bioacustic Keywords: AutoKeras; automated deep learning; audio classification; Wetlands Bird dataset; comparative analysis; bioacoustics; validation dataset; multi-class classification; spectrograms.

摘要
Keywords: AutoKeras; automated deep learning; audio classification; Wetlands Bird dataset; comparative analysis; bioacoustics; validation dataset; multi-class classification; spectrograms.中文翻译：本研究探索使用自动化深度学习提高多类分类鸟叫声的精度和效率，与传统手动设计的深度学习模型进行比较。我们使用西地中海湿地鸟类 dataset 和 AutoKeras 框架自动化神经网络搜索和超参调整。我们的结果表明，AutoKeras derive 模型在 MobileNet、ResNet50 和 VGG16 等传统模型的比较中 consistently 表现出色。本研究强调自动化深度学习在生物声学研究中的潜在价值，因为它消除了手动特征工程和模型设计的需求，同时提高性能。我们还提供了采样、评估和报告的最佳实践，以增强这个领域的可重复性。所有代码使用的可以在 GitHub 上找到（https://github.com/giuliotosato/AutoKeras-bioacustic）。键语：AutoKeras; 自动化深度学习; 音频分类; 湿地鸟类 dataset; 比较分析; 生物声学; 验证集; 多类分类; spectrograms.

NExT-Chat: An LMM for Chat, Detection and Segmentation

paper_url: http://arxiv.org/abs/2311.04498
repo_url: https://github.com/tmukande-debug/NExT-Chat
paper_authors: Ao Zhang, Liming Zhao, Chen-Wei Xie, Yun Zheng, Wei Ji, Tat-Seng Chua
for: 本研究旨在提高大型语言模型（LLM）在多Modal理解方面的水平，通过增强视觉理解能力，使LMM能够更好地理解和回答多Modal问题。
methods: 本研究提出了一种新的对象位置模型方法 called pixel2emb，该方法让LMM输出位置嵌入，然后通过不同的解码器进行解码。这种嵌入基于的位置模型方法允许使用不同的位置格式（如 bounding box 和 mask）在多Modal会话中。
results: 在有限资源的情况下，我们的 pixel2emb 方法在位置输入和输出任务中表现出色，与现有SOTA方法相比，具有更高的性能。基于提出的 pixel2emb 方法，我们训练了一个名为 NExT-Chat 的 LMM，并证明其能够处理多种任务，如视觉固定、区域描述和基于物理的理解。

Abstract
The development of large language models (LLMs) has greatly advanced the field of multimodal understanding, leading to the emergence of large multimodal models (LMMs). In order to enhance the level of visual comprehension, recent studies have equipped LMMs with region-level understanding capabilities by representing object bounding box coordinates as a series of text sequences (pixel2seq). In this paper, we introduce a novel paradigm for object location modeling called pixel2emb method, where we ask the LMM to output the location embeddings and then decoded by different decoders. This paradigm allows for different location formats (such as bounding boxes and masks) to be used in multimodal conversations Furthermore, this kind of embedding based location modeling enables the utilization of existing practices in localization tasks, such as detection and segmentation. In scenarios with limited resources, our pixel2emb demonstrates superior performance compared to existing state-of-the-art (SOTA) approaches in both the location input and output tasks under fair comparison. Leveraging the proposed pixel2emb method, we train an LMM named NExT-Chat and demonstrate its capability of handling multiple tasks like visual grounding, region caption, and grounded reasoning.

摘要
大型语言模型（LLM）的发展对多Modal理解领域带来了巨大的进步，导致大多Modal模型（LMM）的出现。为了提高视觉理解水平， latest studies have equipped LMMs with regional understanding capabilities by representing object bounding box coordinates as a series of text sequences (pixel2seq). 在这篇论文中，我们介绍了一种新的对象位置模型方法，即像素2Embedding（pixel2emb）方法，其中我们问LMM输出位置嵌入，然后通过不同的解码器进行解码。这种嵌入基于位置模型方法允许使用不同的位置格式（如 bounding box 和 mask）在多Modal conversation中，并且可以利用现有的Localization任务的实践，如检测和分割。在有限的资源情况下，我们的像素2emb在位置输入和输出任务中表现出了较好的性能，与现有SOTA方法相比。基于提出的像素2emb方法，我们训练了一个名为NExT-Chat的LMM，并证明其能处理多个任务，如视觉定位、区域描述和基于位置的理解。

Explainable AI for Earth Observation: Current Methods, Open Challenges, and Opportunities

paper_url: http://arxiv.org/abs/2311.04491
repo_url: None
paper_authors: Gulsen Taskin, Erchan Aptoula, Alp Ertürk
for: This paper provides a panorama of the state-of-the-art in explainable remote sensing image analysis, organized by prominent Earth observation application fields.
methods: The paper explores a wide spectrum of Explainable Artificial Intelligence techniques to address the lack of explainability and interpretability in deep learning methods for remote sensing.
results: The paper presents the state-of-the-art in explainable remote sensing image analysis, covering a range of Earth observation application fields.Here’s the text in Simplified Chinese:
for: 这篇论文提供了遍及主要地球观测应用领域的现状的Remote Sensing图像分析的状况报告，使用了Explainable Artificial Intelligence技术来解决深度学习方法的解释性和可解释性问题。
methods: 论文探讨了各种Explainable Artificial Intelligence技术，以解决深度学习方法在Remote Sensing图像分析中的解释性和可解释性问题。
results: 论文提供了Remote Sensing图像分析领域的状态对应的现状报告，涵盖了主要的地球观测应用领域。

Abstract
Deep learning has taken by storm all fields involved in data analysis, including remote sensing for Earth observation. However, despite significant advances in terms of performance, its lack of explainability and interpretability, inherent to neural networks in general since their inception, remains a major source of criticism. Hence it comes as no surprise that the expansion of deep learning methods in remote sensing is being accompanied by increasingly intensive efforts oriented towards addressing this drawback through the exploration of a wide spectrum of Explainable Artificial Intelligence techniques. This chapter, organized according to prominent Earth observation application fields, presents a panorama of the state-of-the-art in explainable remote sensing image analysis.

摘要
深度学习已经在数据分析领域的所有领域中掀尘，包括远程感知。然而，尽管表现得非常出色，但深度学习的不可解性和解释性问题，从神经网络的出现以来一直存在的问题，仍然是对其进行批判的主要来源。因此，深度学习方法在远程感知领域的扩张被附加了解释人工智能技术的探索。这章，按照主要的地球观测应用领域分类，介绍了现代 explainable 远程感知图像分析的状况。

Emergent Communication for Rules Reasoning

paper_url: http://arxiv.org/abs/2311.04474
repo_url: None
paper_authors: Yuxuan Guo, Yifan Hao, Rui Zhang, Enshuai Zhou, Zidong Du, Xishan Zhang, Xinkai Song, Yuanbo Wen, Yongwei Zhao, Xuehai Zhou, Jiaming Guo, Qi Yi, Shaohui Peng, Di Huang, Ruizhi Chen, Qi Guo, Yunji Chen
for: 这个论文主要研究了深度学习基于代理的emergent通信，它们在语言和人工智能方面提供了灵感。但是，之前的尝试都是在感知 Orientated的环境下进行emerging通信， forcing agents to describe low-level perceptual features within image or symbol contexts.
methods: 在这篇论文中，我们提出了一种新的认知游戏（namely Reasoning Game），这个游戏鼓励代理通过思维和通信来解释高级规则，而不是仅仅描述低级感知上下文。我们还提出了一个不偏的数据集（namely rule-RAVEN）作为一个基准，以避免过拟合。此外，我们还提出了一种两阶段训练方法，用于在Reasoning Game中更稳定地 converge。
results: 实验结果表明，在Reasoning Game中，代理们能够解释高级规则，并将其应用到未看过的上下文特性中。此外，emerged语言还帮助代理们在不同上下文特性或任务之间进行泛化和传输。

Abstract
Research on emergent communication between deep-learning-based agents has received extensive attention due to its inspiration for linguistics and artificial intelligence. However, previous attempts have hovered around emerging communication under perception-oriented environmental settings, that forces agents to describe low-level perceptual features intra image or symbol contexts. In this work, inspired by the classic human reasoning test (namely Raven's Progressive Matrix), we propose the Reasoning Game, a cognition-oriented environment that encourages agents to reason and communicate high-level rules, rather than perceived low-level contexts. Moreover, we propose 1) an unbiased dataset (namely rule-RAVEN) as a benchmark to avoid overfitting, 2) and a two-stage curriculum agent training method as a baseline for more stable convergence in the Reasoning Game, where contexts and semantics are bilaterally drifting. Experimental results show that, in the Reasoning Game, a semantically stable and compositional language emerges to solve reasoning problems. The emerged language helps agents apply the extracted rules to the generalization of unseen context attributes, and to the transfer between different context attributes or even tasks.

摘要
研究深度学习代理之间的emergentcommunication已经受到了人工智能和语言科学的广泛关注，因为它们可以提供人工智能和语言科学的灵感。然而，之前的尝试都集中在感知 oriented 环境下的 emerging communication， forcing agents to describe low-level perceptual features within image or symbol contexts。在这项工作中， Drawing inspiration from the classic human reasoning test (namely Raven's Progressive Matrix), we propose the Reasoning Game, a cognition-oriented environment that encourages agents to reason and communicate high-level rules, rather than perceived low-level contexts。 In addition, we propose 1) an unbiased dataset (namely rule-RAVEN) as a benchmark to avoid overfitting, 2) and a two-stage curriculum agent training method as a baseline for more stable convergence in the Reasoning Game, where contexts and semantics are bilaterally drifting。实验结果表明，在 Reasoning Game 中，semantically stable and compositional language emerges to solve reasoning problems。这种emerged language helps agents apply the extracted rules to the generalization of unseen context attributes, and to the transfer between different context attributes or even tasks。

RDGCN: Reinforced Dependency Graph Convolutional Network for Aspect-based Sentiment Analysis

paper_url: http://arxiv.org/abs/2311.04467
repo_url: https://github.com/rdgcn/rdgcn
paper_authors: Xusheng Zhao, Hao Peng, Qiong Dai, Xu Bai, Huailiang Peng, Yanbing Liu, Qinglang Guo, Philip S. Yu
for: 这 paper 的目的是提高 aspect-based sentiment analysis (ABSA) 的精度，使其能够更好地预测句子中的 sentiment polarity。
methods: 这 paper 使用 graph neural networks (GNN) 来捕捉句子中的结构 Patterns，并通过 reinforcement learning 来改进 dependency graph 中的重要性计算。
results: compare to state-of-the-art GNN-based baselines, RDGCN 在三个 популяр的 dataset 上的全面实验中表现出色，提高了 ABSA 的精度。

Abstract
Aspect-based sentiment analysis (ABSA) is dedicated to forecasting the sentiment polarity of aspect terms within sentences. Employing graph neural networks to capture structural patterns from syntactic dependency parsing has been confirmed as an effective approach for boosting ABSA. In most works, the topology of dependency trees or dependency-based attention coefficients is often loosely regarded as edges between aspects and opinions, which can result in insufficient and ambiguous syntactic utilization. To address these problems, we propose a new reinforced dependency graph convolutional network (RDGCN) that improves the importance calculation of dependencies in both distance and type views. Initially, we propose an importance calculation criterion for the minimum distances over dependency trees. Under the criterion, we design a distance-importance function that leverages reinforcement learning for weight distribution search and dissimilarity control. Since dependency types often do not have explicit syntax like tree distances, we use global attention and mask mechanisms to design type-importance functions. Finally, we merge these weights and implement feature aggregation and classification. Comprehensive experiments on three popular datasets demonstrate the effectiveness of the criterion and importance functions. RDGCN outperforms state-of-the-art GNN-based baselines in all validations.

摘要
Initially, we propose an importance calculation criterion for the minimum distances over dependency trees. Under the criterion, we design a distance-importance function that leverages reinforcement learning for weight distribution search and dissimilarity control. Since dependency types often do not have explicit syntax like tree distances, we use global attention and mask mechanisms to design type-importance functions. Finally, we merge these weights and implement feature aggregation and classification.Comprehensive experiments on three popular datasets demonstrate the effectiveness of the criterion and importance functions. RDGCN outperforms state-of-the-art GNN-based baselines in all validations.

Edge-assisted U-Shaped Split Federated Learning with Privacy-preserving for Internet of Things

paper_url: http://arxiv.org/abs/2311.04944
repo_url: None
paper_authors: Hengliang Tang, Zihang Zhao, Detian Liu, Yang Cao, Shiqiang Zhang, Siqing You
for: 这个研究旨在解决互联网领域内的物联网（IoT）设备上的深度学习模型部署问题。这些设备通常没有计算和通信能力，直接传输数据会导致网络拥堵和不合理的执行。中央化数据处理在数据中心也不再可行，因为关于数据隐私和安全的 Concerns。
methods: 我们提出了一个创新的 Edge-assisted U-Shaped Split Federated Learning（EUSFL）框架，利用边缘服务器的高性能能力协助IoT设备进行模型训练和优化过程。在这个框架中，我们运用了 Federated Learning（FL），让数据持有者共同训练模型，而不需要分享数据，从而提高隐私保护。此外，我们将神经网络分为三部分使用U-型分割，让IoT设备进行本地训练。这样可以利用边缘服务器的更高计算能力，实现全面训练时间的缩短，并让IoT设备 avec varying capabilities 进行训练任务得以高效。
results: 我们的理论分析和实验结果显示，EUSFL可以与不同的聚合算法结合使用，在不同的IoT设备计算能力下保持良好的性能，并对训练时间和本地计算负载进行了明显的缩短。此外，我们还提出了一种新的杂音机制 called LabelDP，以保护数据特征和标签免受重建攻击，排除隐私泄露的风险。

Abstract
In the realm of the Internet of Things (IoT), deploying deep learning models to process data generated or collected by IoT devices is a critical challenge. However, direct data transmission can cause network congestion and inefficient execution, given that IoT devices typically lack computation and communication capabilities. Centralized data processing in data centers is also no longer feasible due to concerns over data privacy and security. To address these challenges, we present an innovative Edge-assisted U-Shaped Split Federated Learning (EUSFL) framework, which harnesses the high-performance capabilities of edge servers to assist IoT devices in model training and optimization process. In this framework, we leverage Federated Learning (FL) to enable data holders to collaboratively train models without sharing their data, thereby enhancing data privacy protection by transmitting only model parameters. Additionally, inspired by Split Learning (SL), we split the neural network into three parts using U-shaped splitting for local training on IoT devices. By exploiting the greater computation capability of edge servers, our framework effectively reduces overall training time and allows IoT devices with varying capabilities to perform training tasks efficiently. Furthermore, we proposed a novel noise mechanism called LabelDP to ensure that data features and labels can securely resist reconstruction attacks, eliminating the risk of privacy leakage. Our theoretical analysis and experimental results demonstrate that EUSFL can be integrated with various aggregation algorithms, maintaining good performance across different computing capabilities of IoT devices, and significantly reducing training time and local computation overhead.

摘要
在互联网物联网（IoT）领域，部署深度学习模型来处理由IoT设备生成或收集的数据是一项关键挑战。然而，直接数据传输会导致网络拥堵和不efficient执行，因为IoT设备通常缺乏计算和通信能力。中央化数据处理在数据中心也不再可行，因为数据隐私和安全问题。为解决这些挑战，我们提出了一种创新的Edge助けU型分布式学习（EUSFL）框架，利用边缘服务器的高性能特性来帮助IoT设备进行模型训练和优化过程。在这个框架中，我们运用分布式学习（FL），使得数据持有者可以共同训练模型，而不需要将数据共享，从而提高数据隐私保护。此外，受到分learn（SL）的启发，我们将神经网络分成三部分，在IoT设备上进行本地训练。通过利用边缘服务器的更高计算能力，我们的框架可以有效减少总训练时间，让IoT设备按照不同的能力进行训练任务，并且可以保持不同计算能力的IoT设备之间的兼容性。此外，我们还提出了一种新的噪声机制called LabelDP，以保护数据特征和标签免受重建攻击，从而消除隐私泄露的风险。我们的理论分析和实验结果表明，EUSFL可以与不同的聚合算法结合使用，保持不同计算能力的IoT设备之间的兼容性，同时减少训练时间和本地计算负担。

Improving Pacing in Long-Form Story Planning

paper_url: http://arxiv.org/abs/2311.04459
repo_url: https://github.com/yichenzw/pacing
paper_authors: Yichen Wang, Kevin Yang, Xiaoming Liu, Dan Klein
for: 提高自动生成故事大纲的自然整体感 (improve the natural pacing of automatically generated story outlines)
methods: 使用 coneCreteness 评价器控制层次大纲生成 (use a concreteness evaluator to control hierarchical outline generation)，并使用 predicted concreteness 筛选新的大纲项 (filter new outline items based on predicted concreteness)
results: 与基线对比，人类评价 CONCOCT 的均衡性高于 57% 多个大纲长度 (compared to a baseline, humans judge CONCOCT’s pacing to be more consistent over 57% of the time across multiple outline lengths)

Abstract
Existing LLM-based systems for writing long-form stories or story outlines frequently suffer from unnatural pacing, whether glossing over important events or over-elaborating on insignificant details, resulting in a jarring experience for the reader. We propose a CONCrete Outline ConTrol (CONCOCT) system to improve pacing when automatically generating story outlines. We first train a concreteness evaluator to judge which of two events is more concrete (low-level-detailed). This evaluator can then be used to control pacing in hierarchical outline generation; in this work, we explore a vaguest-first expansion procedure that aims for uniform pacing. We further use the evaluator to filter new outline items based on predicted concreteness. Compared to a baseline hierarchical outline generator, humans judge CONCOCT's pacing to be more consistent over 57% of the time across multiple outline lengths; the gains also translate to downstream stories. All code, data, and models are open-sourced.

摘要
We first train a concreteness evaluator to determine which of two events is more concrete (low-level-detailed). This evaluator is then used to control pacing in hierarchical outline generation. Specifically, we use a vaguest-first expansion procedure that aims for uniform pacing. Additionally, we use the evaluator to filter new outline items based on their predicted concreteness.Compared to a baseline hierarchical outline generator, humans judge CONCOCT's pacing to be more consistent over 57% of the time across multiple outline lengths. Furthermore, the gains translate to downstream stories. All code, data, and models are open-sourced.

Evaluating Uncertainty Quantification approaches for Neural PDEs in scientific applications

paper_url: http://arxiv.org/abs/2311.04457
repo_url: None
paper_authors: Vardhan Dongre, Gurpreet Singh Hora
for: This paper focuses on the development of Uncertainty Quantification (UQ) methods for Neural Partial Differential Equations (Neural PDEs) in scientific applications, specifically for forward and inverse problems.methods: The paper evaluates various UQ approaches, including Bayesian methods such as Hamiltonian Monte Carlo (HMC) and Monte-Carlo Dropout (MCD), as well as a conventional approach using Deep Ensembles (DE).results: The results show that Neural PDEs can effectively reconstruct flow systems and predict the associated unknown parameters, but the Bayesian methods tend to display a higher degree of certainty in their predictions compared to the DE approach. This suggests that Bayesian techniques may underestimate the true underlying uncertainty, appearing more confident in their predictions than the DE approach.Here’s the Chinese version of the three key points:for: 这篇论文关注使用神经partial differential equations (Neural PDEs)在科学应用中的uncertainty量化 (UQ)方法的开发，特别是对于前向和反向问题。methods: 论文评估了多种UQ方法，包括 bayesian方法such as Hamiltonian Monte Carlo (HMC)和Monte-Carlo Dropout (MCD)，以及一种传统的 Deep Ensembles (DE) 方法。results: 结果显示 Neural PDEs 可以有效地重construct流体系统和相关的未知参数，但 bayesian方法在其预测中显示了更高的自信度，与 DE 方法相比。这表明 bayesian 技术可能会下降 true 的下层不确定性，因此在其预测中显得更自信。

Abstract
The accessibility of spatially distributed data, enabled by affordable sensors, field, and numerical experiments, has facilitated the development of data-driven solutions for scientific problems, including climate change, weather prediction, and urban planning. Neural Partial Differential Equations (Neural PDEs), which combine deep learning (DL) techniques with domain expertise (e.g., governing equations) for parameterization, have proven to be effective in capturing valuable correlations within spatiotemporal datasets. However, sparse and noisy measurements coupled with modeling approximation introduce aleatoric and epistemic uncertainties. Therefore, quantifying uncertainties propagated from model inputs to outputs remains a challenge and an essential goal for establishing the trustworthiness of Neural PDEs. This work evaluates various Uncertainty Quantification (UQ) approaches for both Forward and Inverse Problems in scientific applications. Specifically, we investigate the effectiveness of Bayesian methods, such as Hamiltonian Monte Carlo (HMC) and Monte-Carlo Dropout (MCD), and a more conventional approach, Deep Ensembles (DE). To illustrate their performance, we take two canonical PDEs: Burger's equation and the Navier-Stokes equation. Our results indicate that Neural PDEs can effectively reconstruct flow systems and predict the associated unknown parameters. However, it is noteworthy that the results derived from Bayesian methods, based on our observations, tend to display a higher degree of certainty in their predictions as compared to those obtained using the DE. This elevated certainty in predictions suggests that Bayesian techniques might underestimate the true underlying uncertainty, thereby appearing more confident in their predictions than the DE approach.

摘要
<>使用可得到的散点数据，由于可靠且有效的感知器和数值实验，解决科学问题，包括气候变化、天气预测和城市规划等。神经partial differential equations（神经PDEs），结合深度学习（DL）技术和领域专业知识（例如，管理方程）进行参数化，能够很好地捕捉空间时间数据中的有价值相关性。然而，稀缺和噪声的测量数据，加之模型简化，导致 aleatoric和epistemicuncertainty。因此，将模型输入到输出中的不确定性进行评估是一项重要的任务，以确保神经PDEs的可靠性。本工作评估了多种uncertainty quantification（UQ）方法，包括forward和 inverse问题在科学应用中。 Specifically, we investigate the effectiveness of Bayesian methods, such as Hamiltonian Monte Carlo (HMC) and Monte-Carlo Dropout (MCD), and a more conventional approach, Deep Ensembles (DE). To illustrate their performance, we take two canonical PDEs: Burger's equation and the Navier-Stokes equation. Our results indicate that Neural PDEs can effectively reconstruct flow systems and predict the associated unknown parameters. However, it is noteworthy that the results derived from Bayesian methods, based on our observations, tend to display a higher degree of certainty in their predictions as compared to those obtained using the DE. This elevated certainty in predictions suggests that Bayesian techniques might underestimate the true underlying uncertainty, thereby appearing more confident in their predictions than the DE approach.Translated by Google Translate.

MathNAS: If Blocks Have a Role in Mathematical Architecture Design

paper_url: http://arxiv.org/abs/2311.04943
repo_url: https://github.com/wangqinsi1/mathnas
paper_authors: Wang Qinsi, Ke Jinhan, Liang Zhi, Zhang Sihai
For: 这个研究是要解决Neural Architecture Search（NAS）中的大型模型设计问题，因为现有的方法在搜寻和评估候选网络时需要庞大的 computation cost。* Methods: 这篇研究提出了一个新的分治策略，利用搜寻空间的弹性特性，将候选网络的表现计算分解为各个专案中的表现，并使用数学程式来预测网络表现。* Results: 这篇研究显示，这个新的分治策略可以实现快速的网络表现评估，并且可以实现更高的准确性和更快的搜寻速度。

Abstract
Neural Architecture Search (NAS) has emerged as a favoured method for unearthing effective neural architectures. Recent development of large models has intensified the demand for faster search speeds and more accurate search results. However, designing large models by NAS is challenging due to the dramatical increase of search space and the associated huge performance evaluation cost. Consider a typical modular search space widely used in NAS, in which a neural architecture consists of $m$ block nodes and a block node has $n$ alternative blocks. Facing the space containing $n^m$ candidate networks, existing NAS methods attempt to find the best one by searching and evaluating candidate networks directly.Different from the general strategy that takes architecture search as a whole problem, we propose a novel divide-and-conquer strategy by making use of the modular nature of the search space.Here, we introduce MathNAS, a general NAS framework based on mathematical programming.In MathNAS, the performances of the $m*n$ possible building blocks in the search space are calculated first, and then the performance of a network is directly predicted based on the performances of its building blocks. Although estimating block performances involves network training, just as what happens for network performance evaluation in existing NAS methods, predicting network performance is completely training-free and thus extremely fast. In contrast to the $n^m$ candidate networks to evaluate in existing NAS methods, which require training and a formidable computational burden, there are only $m*n$ possible blocks to handle in MathNAS. Therefore, our approach effectively reduces the complexity of network performance evaluation.Our code is available at https://github.com/wangqinsi1/MathNAS.

摘要
neural Architecture Search (NAS) 已经成为发掘有效神经建筑的首选方法。 recent development of large models 使得寻找更快的搜索速度和更准确的搜索结果变得更加紧迫。然而，通过 NAS 设计大型模型是挑战，因为搜索空间的增加会导致搜索速度的增加和评估成本的增加。 faces a typical modular search space widely used in NAS, in which a neural architecture consists of $m$ block nodes and a block node has $n$ alternative blocks. existing NAS methods attempt to find the best one by searching and evaluating candidate networks directly. unlike the general strategy that takes architecture search as a whole problem, we propose a novel divide-and-conquer strategy by making use of the modular nature of the search space. here, we introduce MathNAS, a general NAS framework based on mathematical programming. in MathNAS, the performances of the $m*n$ possible building blocks in the search space are calculated first, and then the performance of a network is directly predicted based on the performances of its building blocks. although estimating block performances involves network training, just as what happens for network performance evaluation in existing NAS methods, predicting network performance is completely training-free and thus extremely fast. in contrast to the $n^m$ candidate networks to evaluate in existing NAS methods, which require training and a formidable computational burden, there are only $m*n$ possible blocks to handle in MathNAS. therefore, our approach effectively reduces the complexity of network performance evaluation. our code is available at https://github.com/wangqinsi1/MathNAS.

MixTEA: Semi-supervised Entity Alignment with Mixture Teaching

paper_url: http://arxiv.org/abs/2311.04441
repo_url: https://github.com/xiefeng69/mixtea
paper_authors: Feng Xie, Xin Song, Xiang Zeng, Xuechen Zhao, Lei Tian, Bin Zhou, Yusong Tan
for: 本文提出了一种新的半监督实体对应（EA）方法，以解决由于缺乏充分的标注数据而带来的实体对应问题。
methods: 本文使用了一种独特的结合人工标注和 probabilistic pseudo 对应的混合教学方法，并在 pseudo 对应学习中提出了bi-directional voting（BDV）策略和匹配多样性基于修正（MDR）模块，以降低噪声对 pseudo 对应学习的负面影响。
results: 对于多个 benchmark 数据集，以及进一步的分析，表明了我们提出的方法的优越性和实用性。

Abstract
Semi-supervised entity alignment (EA) is a practical and challenging task because of the lack of adequate labeled mappings as training data. Most works address this problem by generating pseudo mappings for unlabeled entities. However, they either suffer from the erroneous (noisy) pseudo mappings or largely ignore the uncertainty of pseudo mappings. In this paper, we propose a novel semi-supervised EA method, termed as MixTEA, which guides the model learning with an end-to-end mixture teaching of manually labeled mappings and probabilistic pseudo mappings. We firstly train a student model using few labeled mappings as standard. More importantly, in pseudo mapping learning, we propose a bi-directional voting (BDV) strategy that fuses the alignment decisions in different directions to estimate the uncertainty via the joint matching confidence score. Meanwhile, we also design a matching diversity-based rectification (MDR) module to adjust the pseudo mapping learning, thus reducing the negative influence of noisy mappings. Extensive results on benchmark datasets as well as further analyses demonstrate the superiority and the effectiveness of our proposed method.

摘要
semi-supervised entity alignment（EA）是一个实际和挑战性的任务，因为缺乏充足的标注映射作为训练数据。大多数工作是通过生成 Pseudo 映射来 Address 这个问题，但它们都会受到 Pseudo 映射的错误（噪声）的影响或者忽略 Pseudo 映射的不确定性。在这篇论文中，我们提出了一种新的 semi-supervised EA 方法，名为 MixTEA，它使用端到端混合教学法和概率 Pseudo 映射来导引模型学习。我们首先在几个标注映射上训练一个学生模型。更重要的是，在 Pseudo 映射学习中，我们提出了两个方向投票（BDV）策略，它将在不同的方向投票结果中进行折衔，以便估计 uncertainty via 联合匹配信息指数。同时，我们还设计了一个匹配多样性基于修正（MDR）模块，以降低 Pseudo 映射学习中的负面影响。我们在标准 benchmark 数据集以及进一步的分析中表明了我们的提出的方法的优越性和效果。

Interpretable Geoscience Artificial Intelligence (XGeoS-AI): Application to Demystify Image Recognition

paper_url: http://arxiv.org/abs/2311.04940
repo_url: None
paper_authors: Jin-Jian Xu, Hao Zhang, Chao-Sheng Tang, Lin Li, Bin Shi
for: 这个研究的目的是解释地球科学中的图像识别问题，并提出一种可解释的地球科学人工智能（XGeoS-AI）框架来实现这一目标。
methods: 该框架采用了人类视觉机制的思想，通过在整个图像中选择一个地方生成一个阈值，以完成图像识别任务。此外，可以采用不同的人工智能方法，如支持向量回归（SVR）、多层感知神经网络（MLP）和卷积神经网络（CNN）等，来快速完成地球科学图像识别任务。
results: 实验结果表明，提出的XGeoS-AI框架具有高效、多样化和可解释的优点，有很大的潜力应用于地球科学图像识别问题。此外，该框架还可以推动地球科学领域内的技术创新。

Abstract
As Earth science enters the era of big data, artificial intelligence (AI) not only offers great potential for solving geoscience problems, but also plays a critical role in accelerating the understanding of the complex, interactive, and multiscale processes of Earth's behavior. As geoscience AI models are progressively utilized for significant predictions in crucial situations, geoscience researchers are increasingly demanding their interpretability and versatility. This study proposes an interpretable geoscience artificial intelligence (XGeoS-AI) framework to unravel the mystery of image recognition in the Earth sciences, and its effectiveness and versatility is demonstrated by taking computed tomography (CT) image recognition as an example. Inspired by the mechanism of human vision, the proposed XGeoS-AI framework generates a threshold value from a local region within the whole image to complete the recognition. Different kinds of artificial intelligence (AI) methods, such as Support Vector Regression (SVR), Multilayer Perceptron (MLP), Convolutional Neural Network (CNN), can be adopted as the AI engines of the proposed XGeoS-AI framework to efficiently complete geoscience image recognition tasks. Experimental results demonstrate that the effectiveness, versatility, and heuristics of the proposed framework have great potential in solving geoscience image recognition problems. Interpretable AI should receive more and more attention in the field of the Earth sciences, which is the key to promoting more rational and wider applications of AI in the field of Earth sciences. In addition, the proposed interpretable framework may be the forerunner of technological innovation in the Earth sciences.

摘要
如今地球科学进入大数据时代，人工智能（AI）不仅提供了解决地球科学问题的极大潜力，还扮演着促进地球行为复杂、互动和多尺度过程的加速器。随着地球科学AI模型在重要情况下的广泛应用，地球科学研究人员更加需要其解释性和多样性。本研究提出了一种可解释的地球科学人工智能（XGeoS-AI）框架，以解决地球科学图像识别问题的谜题。灵感自人类视觉机制，提议的XGeoS-AI框架在全图像中 locates 一个区域，并生成该区域的阈值，以完成图像识别。不同的人工智能方法，如支持向量回归（SVR）、多层感知网络（MLP）和卷积神经网络（CNN），可以作为XGeoS-AI框架中的人工智能引擎，高效完成地球科学图像识别任务。实验结果表明，提议的框架具有效果、多样性和启发性，在地球科学图像识别问题上具有很大的潜力。可解释AI在地球科学领域应该收到更多的关注，这是推动AI在地球科学领域的更广泛应用的关键。此外，提议的可解释框架可能成为地球科学技术创新的先驱者。

LooGLE: Can Long-Context Language Models Understand Long Contexts?

paper_url: http://arxiv.org/abs/2311.04939
repo_url: https://github.com/bigai-nlco/loogle
paper_authors: Jiaqi Li, Mengmeng Wang, Zilong Zheng, Muhan Zhang
for: 评估大语言模型（LLMs）在长文本理解方面的能力。
methods: 使用新的文档（post-2022），且文档长度超过24,000个字符，同时采用人工生成的6,000个问题和多个领域的问题集来评估LLMs的长dependency能力。
results: 研究发现：（i）商业模型在LLMs中表现更好；（ii）LLMs在短dependency任务中表现出色，但在更复杂的长dependency任务中表现不佳；（iii）在文本上下文中学习和连接思维只提供了有限的改进；（iv）检索基本技术在短问题 answering中表现出色，而扩展文本窗口长度的技术对长文本理解有限的影响。

Abstract
Large language models (LLMs), despite their impressive performance in various language tasks, are typically limited to processing texts within context-window size. This limitation has spurred significant research efforts to enhance LLMs' long-context understanding with high-quality long-sequence benchmarks. However, prior datasets in this regard suffer from shortcomings, such as short context length compared to the context window of modern LLMs; outdated documents that have data leakage problems; and an emphasis on short dependency tasks rather than long dependency tasks. In this paper, we present LooGLE, a Long Context Generic Language Evaluation benchmark for LLMs' long context understanding. LooGLE features relatively new documents post-2022, with over 24,000 tokens per document and 6,000 newly generated questions spanning diverse domains. Human annotators meticulously crafted more than 1,100 high-quality question-answer pairs to meet the long dependency requirements. These pairs underwent thorough cross-validation, yielding the most precise assessment of LLMs' long dependency capabilities. The evaluation of eight state-of-the-art LLMs on LooGLE revealed key findings: (i) commercial models outperformed open-sourced models; (ii) LLMs excelled in short dependency tasks like short question-answering and cloze tasks but struggled with more intricate long dependency tasks; (iii) in-context learning and chaining thoughts offered only marginal improvements; (iv) retrieval-based techniques demonstrated substantial benefits for short question-answering, while strategies for extending context window length had limited impact on long context understanding. As such, LooGLE not only provides a systematic and comprehensive evaluation schema on long-context LLMs, but also sheds light on future development of enhanced models towards "true long-context understanding".

摘要
大型语言模型（LLM），尽管在不同语言任务中表现出色，但通常只能处理文本内容窗口大小。这一限制促进了大量研究，以提高LLM的长期文本理解能力。然而，先前的数据集受到一些缺点的影响，如文本长度较短，与现代LLM的上下文窗口大小相比; 带有数据泄露问题的旧文档; 和更重视短语相互作用而非长语相互作用。本文提出了LooGLE，一个长期语言评估准则，用于评估LLM的长期文本理解能力。LooGLE的特点包括：使用最新的文档（2022年后），文档长度超过24,000个字符，并且新生成了6,000个问题，覆盖多个领域。人工标注员仔细制作了1,100个高质量问题对，以满足长期依赖要求。这些对经过了严格的交叉验证，以便获得LLM的长期依赖能力的最准确评估。八种当前最佳LLM在LooGLE上进行评估后，得到了以下发现：（i）商业模型比开源模型表现更佳；（ii）LLM在短语相互作用任务中表现出色，但在更复杂的长语相互作用任务中受到限制；（iii）嵌入式学习和串联思维只提供了有限的改进；（iv）基于检索的技术在短问题回答任务中具有显著的优势，而扩展上下文窗口长度的策略对长期文本理解的改进具有有限的影响。因此，LooGLE不仅提供了LLM的长期文本理解能力的系统和全面的评估方案，还探讨了未来 LLM 的发展，以实现“真正的长期文本理解”。

Data Factors for Better Compositional Generalization

paper_url: http://arxiv.org/abs/2311.04420
repo_url: https://github.com/owenzx/data4comp
paper_authors: Xiang Zhou, Yichen Jiang, Mohit Bansal
for: 本文旨在探讨模型在不同数据集上的泛化能力，以及如何通过不同的数据因素来改善模型的泛化能力。
methods: 本文使用Transformer模型在不同数据集上进行训练，并通过分析不同数据因素的影响来解释模型的泛化能力。
results: 研究发现，增加数据复杂度可以提高模型的泛化能力，并且这种改善来自于数据集中提供更多的多样化示例和降低示例重复频率的效果。此外，训练例子的Difficulty Level也对泛化能力产生不同的影响，在 sintetic datasets上，简单的例子更能够 invoke compose understanding，而在大规模的实际语言 datasets上，一个平衡的混合mixture of simple和hard例子可以induce最强的泛化能力。

Abstract
Recent diagnostic datasets on compositional generalization, such as SCAN (Lake and Baroni, 2018) and COGS (Kim and Linzen, 2020), expose severe problems in models trained from scratch on these datasets. However, in contrast to this poor performance, state-of-the-art models trained on larger and more general datasets show better generalization ability. In this work, to reconcile this inconsistency, we conduct an empirical analysis by training Transformer models on a variety of training sets with different data factors, including dataset scale, pattern complexity, example difficulty, etc. First, we show that increased dataset complexity can lead to better generalization behavior on multiple different generalization challenges. To further understand this improvement, we show two axes of the benefit from more complex datasets: they provide more diverse examples so compositional understanding becomes more effective, and they also prevent ungeneralizable memorization of the examples due to reduced example repetition frequency. Finally, we explore how training examples of different difficulty levels influence generalization differently. On synthetic datasets, simple examples invoke stronger compositionality than hard examples do. On larger-scale real language datasets, while hard examples become more important potentially to ensure decent data coverage, a balanced mixture of simple and hard examples manages to induce the strongest generalizability. The code and data for this work are available at https://github.com/owenzx/data4comp

摘要
Recent diagnostic datasets on compositional generalization, such as SCAN (Lake and Baroni, 2018) and COGS (Kim and Linzen, 2020), have revealed severe problems in models trained from scratch on these datasets. However, in contrast to this poor performance, state-of-the-art models trained on larger and more general datasets have shown better generalization ability. In this work, we aim to reconcile this inconsistency by conducting an empirical analysis of Transformer models trained on various training sets with different data factors, including dataset scale, pattern complexity, example difficulty, etc.First, we find that increased dataset complexity leads to better generalization behavior on multiple different generalization challenges. To further understand this improvement, we identify two axes of benefit from more complex datasets: they provide more diverse examples that enhance compositional understanding, and they also reduce the likelihood of ungeneralizable memorization due to reduced example repetition frequency.Finally, we explore how training examples of different difficulty levels influence generalization differently. On synthetic datasets, simple examples tend to invoke stronger compositionality than hard examples do. On larger-scale real language datasets, while hard examples become more important to ensure decent data coverage, a balanced mixture of simple and hard examples is found to induce the strongest generalizability. The code and data for this work are available at .

PepLand: a large-scale pre-trained peptide representation model for a comprehensive landscape of both canonical and non-canonical amino acids

paper_url: http://arxiv.org/abs/2311.04419
repo_url: None
paper_authors: Ruochi Zhang, Haoran Wu, Yuting Xiu, Kewei Li, Ningning Chen, Yu Wang, Yan Wang, Xin Gao, Fengfeng Zhou
For: PepLand is a pre-training architecture for representation and property analysis of peptides spanning both canonical and non-canonical amino acids.* Methods: PepLand leverages a comprehensive multi-view heterogeneous graph neural network to unveil the subtle structural representations of peptides.* Results: PepLand effectively captures salient synthetic peptide features, laying a robust foundation for transformative advances in peptide-centric research domains.Here’s the Chinese translation of the three points:* For: 这篇研究旨在提出一个专门针对包含非标准氨基酸的肽的预训架构，以探索这些肽的积极特征和性能。* Methods: 这个预训架构使用了一个全面的多观察异源Graph Neural Network（GNN），以捕捉这些肽的细微结构特征。* Results: PepLand 能够有效地捕捉这些肽的积极特征，实现了在肽中心研究领域中的创新进步。

Abstract
In recent years, the scientific community has become increasingly interested on peptides with non-canonical amino acids due to their superior stability and resistance to proteolytic degradation. These peptides present promising modifications to biological, pharmacological, and physiochemical attributes in both endogenous and engineered peptides. Notwithstanding their considerable advantages, the scientific community exhibits a conspicuous absence of an effective pre-trained model adept at distilling feature representations from such complex peptide sequences. We herein propose PepLand, a novel pre-training architecture for representation and property analysis of peptides spanning both canonical and non-canonical amino acids. In essence, PepLand leverages a comprehensive multi-view heterogeneous graph neural network tailored to unveil the subtle structural representations of peptides. Empirical validations underscore PepLand's effectiveness across an array of peptide property predictions, encompassing protein-protein interactions, permeability, solubility, and synthesizability. The rigorous evaluation confirms PepLand's unparalleled capability in capturing salient synthetic peptide features, thereby laying a robust foundation for transformative advances in peptide-centric research domains. We have made all the source code utilized in this study publicly accessible via GitHub at https://github.com/zhangruochi/pepland

摘要

AI-accelerated Discovery of Altermagnetic Materials

paper_url: http://arxiv.org/abs/2311.04418
repo_url: https://github.com/zfgao66/mataltmag
paper_authors: Ze-Feng Gao, Shuai Qu, Bocheng Zeng, Ji-Rong Wen, Hao Sun, Pengjie Guo, Zhong-Yi Lu
for: 这篇论文旨在探索新型磁性阶段—— alternate magnetism，以及发现更多的磁性材料。methods: 这篇论文使用了人工智能搜索引擎，融合了对组态分析、图像神经网络预训、最佳运输理论和基本电子结构计算，发现25种新的磁性材料，包括金属、半导体和对磁体。results: 这篇论文发现了25种新的磁性材料，其中8种是$i$-波磁性材料，这些材料具有独特的物理性能，例如非常性 Hall 效应、非常性 Kerr 效应和トポロジック性。

Abstract
Altermagnetism, a new magnetic phase, has been theoretically proposed and experimentally verified to be distinct from ferromagnetism and antiferromagnetism. Although altermagnets have been found to possess many exotic physical properties, the very limited availability of known altermagnetic materials~(e.g., 14 confirmed materials) hinders the study of such properties. Hence, discovering more types of altermagnetic materials is crucial for a comprehensive understanding of altermagnetism and thus facilitating new applications in the next generation information technologies, e.g., storage devices and high-sensitivity sensors. Here, we report 25 new altermagnetic materials that cover metals, semiconductors, and insulators, discovered by an AI search engine unifying symmetry analysis, graph neural network pre-training, optimal transport theory, and first-principles electronic structure calculation. The wide range of electronic structural characteristics reveals that various innovative physical properties manifest in these newly discovered altermagnetic materials, e.g., anomalous Hall effect, anomalous Kerr effect, and topological property. Noteworthy, we discovered 8 $i$-wave altermagnetic materials for the first time. Overall, the AI search engine performs much better than human experts and suggests a set of new altermagnetic materials with unique properties, outlining its potential for accelerated discovery of altermagnetic materials.

摘要
新型磁相��, 名为“alternromagnetism”, 已经理论上提出并实验验证, 与��磁和反磁异有区别。 although altermagnets possess many exotic physical properties, the limited availability of known altermagnetic materials (e.g., 14 confirmed materials) hinders the study of such properties. Therefore, discovering more types of altermagnetic materials is crucial for a comprehensive understanding of altermagnetism and will facilitate new applications in the next generation information technologies, such as storage devices and high-sensitivity sensors.我们报告了25种新的� alternate magnetic materials，包括� metal、半导体和半导体，通过� unity symmetry analysis、graph neural network pre-training、optimal transport theory和� first-principles electronic structure calculation发现。 these newly discovered altermagnetic materials exhibit a wide range of electronic structural characteristics, resulting in various innovative physical properties, such as anomalous Hall effect, anomalous Kerr effect, and topological property. noteworthy, we discovered 8 $i$-wave altermagnetic materials for the first time.相比� human experts, AI search engine表现更好，提供了一些新的� alternate magnetic materials with unique properties, highlighting its potential for accelerated discovery of altermagnetic materials.

Human Conditional Reasoning in Answer Set Programming

paper_url: http://arxiv.org/abs/2311.04412
repo_url: https://github.com/Aryia-Behroziuan/References
paper_authors: Chiaki Sakama
for: 本研究探讨了人类思维中的四种推理类型，包括假设前提 (AA)、假设结果 (AC)、否定前提 (DA) 和否定结果 (DC)。
methods: 本文使用 answer set programming 实现了 AC、DA 和 DC 推理类型，并研究了这些推理类型的正式性和人类思维任务的相关性。
results: 本研究发现，AC 和 DA 推理类型在日常生活中很常见，而 DC 推理类型则是逻辑有效的。此外，本文还应用了这些完成方法于通用智能 reasoning 领域。

Abstract
Given a conditional sentence P=>Q (if P then Q) and respective facts, four different types of inferences are observed in human reasoning. Affirming the antecedent (AA) (or modus ponens) reasons Q from P; affirming the consequent (AC) reasons P from Q; denying the antecedent (DA) reasons -Q from -P; and denying the consequent (DC) (or modus tollens) reasons -P from -Q. Among them, AA and DC are logically valid, while AC and DA are logically invalid and often called logical fallacies. Nevertheless, humans often perform AC or DA as pragmatic inference in daily life. In this paper, we realize AC, DA and DC inferences in answer set programming. Eight different types of completion are introduced and their semantics are given by answer sets. We investigate formal properties and characterize human reasoning tasks in cognitive psychology. Those completions are also applied to commonsense reasoning in AI.

摘要
Given a conditional sentence P=>Q (if P then Q) and respective facts, four different types of inferences are observed in human reasoning. Affirming the antecedent (AA) (or modus ponens) reasons Q from P; affirming the consequent (AC) reasons P from Q; denying the antecedent (DA) reasons -Q from -P; and denying the consequent (DC) (or modus tollens) reasons -P from -Q. Among them, AA and DC are logically valid, while AC and DA are logically invalid and often called logical fallacies. Nevertheless, humans often perform AC or DA as pragmatic inference in daily life. In this paper, we realize AC, DA, and DC inferences in answer set programming. Eight different types of completion are introduced and their semantics are given by answer sets. We investigate formal properties and characterize human reasoning tasks in cognitive psychology. Those completions are also applied to commonsense reasoning in AI.Here's the translation in Traditional Chinese: givent a conditional sentence P=>Q (if P then Q) and respective facts, four different types of inferences are observed in human reasoning. Affirming the antecedent (AA) (or modus ponens) reasons Q from P; affirming the consequent (AC) reasons P from Q; denying the antecedent (DA) reasons -Q from -P; and denying the consequent (DC) (or modus tollens) reasons -P from -Q. Among them, AA and DC are logically valid, while AC and DA are logically invalid and often called logical fallacies. Nevertheless, humans often perform AC or DA as pragmatic inference in daily life. In this paper, we realize AC, DA, and DC inferences in answer set programming. Eight different types of completion are introduced and their semantics are given by answer sets. We investigate formal properties and characterize human reasoning tasks in cognitive psychology. Those completions are also applied to commonsense reasoning in AI.

Improved DDIM Sampling with Moment Matching Gaussian Mixtures

paper_url: http://arxiv.org/abs/2311.04938
repo_url: None
paper_authors: Prasad Gabbur
for: 用于加速采样从预训练的推 Sobolev 模型（DDPM）中。
methods: 使用 Gaussian Mixture Model（GMM）作为反转过程操作（核），具体是匹配 DDPM 前向积分的首两个中心差异。
results: 通过对 CelebAHQ 和 FFHQ 的不条件模型以及 ImageNet 的类条件模型进行实验，并证明使用 GMM 核可以在少量采样步骤下提高生成样本的质量，并且在 FID 和 IS 指标中具有显著改善。例如在 ImageNet 256x256 上，使用 10 步采样，我们可以达到 FID 6.94 和 IS 207.85，与 Gaussian 核相比，这些指标的值分别为 10.15 和 196.73。

Abstract
We propose using a Gaussian Mixture Model (GMM) as reverse transition operator (kernel) within the Denoising Diffusion Implicit Models (DDIM) framework, which is one of the most widely used approaches for accelerated sampling from pre-trained Denoising Diffusion Probabilistic Models (DDPM). Specifically we match the first and second order central moments of the DDPM forward marginals by constraining the parameters of the GMM. We see that moment matching is sufficient to obtain samples with equal or better quality than the original DDIM with Gaussian kernels. We provide experimental results with unconditional models trained on CelebAHQ and FFHQ and class-conditional models trained on ImageNet datasets respectively. Our results suggest that using the GMM kernel leads to significant improvements in the quality of the generated samples when the number of sampling steps is small, as measured by FID and IS metrics. For example on ImageNet 256x256, using 10 sampling steps, we achieve a FID of 6.94 and IS of 207.85 with a GMM kernel compared to 10.15 and 196.73 respectively with a Gaussian kernel.

摘要
我们提议使用 Gaussian Mixture Model（GMM）作为反转过程操作器（核函数）在 Denoising Diffusion Implicit Models（DDIM）框架中，这是一种广泛使用的方法来加速从预训练的 Denoising Diffusion Probabilistic Models（DDPM）中采样。我们匹配了 DDPM 前向分布的首和第二个中心均值，通过限制 GMM 参数来实现这一点。我们发现， momemt 匹配是 suficient 来获得与原始 DDIM Gaussian 核函数相同或更好的质量的采样。我们在 CelebAHQ 和 FFHQ 上训练了无条件模型，并在 ImageNet 数据集上训练了类别 condition 模型。我们的实验结果表明，使用 GMM 核函数可以在少量采样步骤下获得较好的采样质量， как measured by FID 和 IS 度量。例如，在 ImageNet 256x256 上，使用 10 步骤，我们达到了 FID 6.94 和 IS 207.85 ，与 Gaussian 核函数相比，分别降低了 3.21 和 139.88。

Human-Centered Planning

paper_url: http://arxiv.org/abs/2311.04403
repo_url: https://github.com/rprokap/pset-9
paper_authors: Yuliang Li, Nitin Kamra, Ruta Desai, Alon Halevy
for: 本研究旨在开发一个基于深度学习模型（LLM）的计划程序，以便在用户提供的自然语言约束下，生成一个合理的计划。
methods: 本研究使用了LLM和符号逻辑计划程序（SymPlan），并将两者结合在一起以提高计划的可靠性和用户满意度。
results: 实验结果显示，LLMPlan与传统的符号逻辑计划程序相比，在不具有正式约束的情况下，平均表现相当，而且能够更好地满足用户的隐式需求。在交互评估中，LLM-基于的计划程序的用户满意度高于符号逻辑计划程序的用户满意度（70.5% vs. 40.4%）。

Abstract
LLMs have recently made impressive inroads on tasks whose output is structured, such as coding, robotic planning and querying databases. The vision of creating AI-powered personal assistants also involves creating structured outputs, such as a plan for one's day, or for an overseas trip. Here, since the plan is executed by a human, the output doesn't have to satisfy strict syntactic constraints. A useful assistant should also be able to incorporate vague constraints specified by the user in natural language. This makes LLMs an attractive option for planning. We consider the problem of planning one's day. We develop an LLM-based planner (LLMPlan) extended with the ability to self-reflect on its output and a symbolic planner (SymPlan) with the ability to translate text constraints into a symbolic representation. Despite no formal specification of constraints, we find that LLMPlan performs explicit constraint satisfaction akin to the traditional symbolic planners on average (2% performance difference), while retaining the reasoning of implicit requirements. Consequently, LLM-based planners outperform their symbolic counterparts in user satisfaction (70.5% vs. 40.4%) during interactive evaluation with 40 users.

摘要
LLMs 近期在输出结构化任务上做出了印象深刻的进展，如编程、机器人规划和查询数据库等。创建基于 AI 的人工智能助手也涉及到创建结构化输出，如一天的计划或国外旅行计划。在这些情况下，由人类执行计划，输出不需要严格的语法约束。一个有用的助手应该能够根据用户提供的自然语言笔记中的抽象约束进行计划。这使得 LLMs 成为规划的有力选择。我们考虑一天的规划问题。我们开发了一个基于 LLM 的规划器（LLMPlan），并增加了对自己输出的自适应能力以及一个基于符号的规划器（SymPlan），可以将自然语言约束转换为符号表示。虽无正式约束规则，但我们发现 LLMPlan 在平均上与传统的符号规划器相当于满足约束（2%性能差异），同时保留了逻辑推理的隐式要求。因此，基于 LLM 的规划器在用户满意度方面（70.5% vs. 40.4%）在交互评估中超过符号规划器。

LRM: Large Reconstruction Model for Single Image to 3D

paper_url: http://arxiv.org/abs/2311.04400
repo_url: None
paper_authors: Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, Hao Tan
for: 预测3D模型从单个输入图像中
methods: 使用高可Scalable transformer-based architecture和500万可学习参数直接预测神经辐射场（NeRF）
results: 可以高效地预测3D模型，包括从真实捕捉和生成模型中的图像中Here is the same information in Simplified Chinese:
for: 预测3D模型从单个输入图像中
methods: 使用高可Scalable transformer-based architecture和500万可学习参数直接预测神经辐射场（NeRF）
results: 可以高效地预测3D模型，包括从真实捕捉和生成模型中的图像中

Abstract
We propose the first Large Reconstruction Model (LRM) that predicts the 3D model of an object from a single input image within just 5 seconds. In contrast to many previous methods that are trained on small-scale datasets such as ShapeNet in a category-specific fashion, LRM adopts a highly scalable transformer-based architecture with 500 million learnable parameters to directly predict a neural radiance field (NeRF) from the input image. We train our model in an end-to-end manner on massive multi-view data containing around 1 million objects, including both synthetic renderings from Objaverse and real captures from MVImgNet. This combination of a high-capacity model and large-scale training data empowers our model to be highly generalizable and produce high-quality 3D reconstructions from various testing inputs including real-world in-the-wild captures and images from generative models. Video demos and interactable 3D meshes can be found on this website: https://yiconghong.me/LRM/.

摘要
我们提出了首个大型重建模型（LRM），可以从单个输入图像中预测对象的3D模型，只需5秒钟。与以往的方法不同，LRM采用了可扩展的变换器基 architecture，并有5亿个学习参数来直接预测神经辐射场（NeRF）。我们在终端到终 Point manner进行了大规模的训练，使用了包括Objaverse中的 sintetic renderings和MVImgNet中的真实捕捉的大量多视图数据，共约1000万个对象。这种高容量模型和大规模的训练数据使得我们的模型具有高度泛化性和可以高质量地从多种测试输入，包括实际世界中的野外捕捉和生成模型中的图像，生成高质量的3D重建。网站上有视频 demo和可交互的3D mesh：https://yiconghong.me/LRM/。

2023-11-08

cs.CL

cs.CL - 2023-11-08

Deep Learning Brasil at ABSAPT 2022: Portuguese Transformer Ensemble Approaches

paper_url: http://arxiv.org/abs/2311.05051
repo_url: https://github.com/ju-resplande/dlb_absapt2022
paper_authors: Juliana Resplande Santanna Gomes, Eduardo Augusto Santos Garcia, Adalberto Ferreira Barbosa Junior, Ruan Chaves Rodrigues, Diogo Fernandes Costa Silva, Dyonnatan Ferreira Maia, Nádia Félix Felipe da Silva, Arlindo Rodrigues Galvão Filho, Anderson da Silva Soares
for: 这篇论文的目的是提出一种基于方面的 sentiment 分析（ABSA）任务，以分类每个方面的 sentiment 偏好。
methods: 这篇论文使用了 two 个子任务：方面 термина抽取（ATE）和 sentiment 方向抽取（SOE），以实现 ABSA 任务。
results: 作者在 IberLEF 2022 上提交了最佳性系统，实现了两个子任务的新状态ppen-of-the-art 结果。

Abstract
Aspect-based Sentiment Analysis (ABSA) is a task whose objective is to classify the individual sentiment polarity of all entities, called aspects, in a sentence. The task is composed of two subtasks: Aspect Term Extraction (ATE), identify all aspect terms in a sentence; and Sentiment Orientation Extraction (SOE), given a sentence and its aspect terms, the task is to determine the sentiment polarity of each aspect term (positive, negative or neutral). This article presents we present our participation in Aspect-Based Sentiment Analysis in Portuguese (ABSAPT) 2022 at IberLEF 2022. We submitted the best performing systems, achieving new state-of-the-art results on both subtasks.

摘要
《方面基于情感分析（ABSA）的任务是将每个方面（即句子中的个体）的情感方向分类。这个任务包括两个子任务：方面词抽取（ATE）和情感方向抽取（SOE）。在这篇文章中，我们介绍了我们在“方面基于情感分析在葡萄牙语（ABSAPT）2022”中的参与，并提交了最佳性能的系统，创造了新的国际标准记录在两个子任务中。》Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

paper_url: http://arxiv.org/abs/2311.05047
repo_url: None
paper_authors: Eduardo Garcia, Juliana Gomes, Adalberto Barbosa Júnior, Cardeque Borges, Nádia da Silva
for: 这份研究是为了描述DeepLearningBrasil队伍在DepSign-LT-EDI@RANLP-2023共同任务中的策略，以及他们在该任务中获得的47.0%的Macro F1-Score和2.4%的优势。
methods: 这份研究使用了RoBERTa和DeBERTa模型，并将其进一步预训在一个精心选择的Reddit dataset上，从而增强了对精细心理健康语言的理解。 truncation技术和样本重量技术被用来处理长文本数据，并且使用了样本分成和ensemble技术来结合多个实验中的模型。
results: 这份研究获得了47.0%的Macro F1-Score和2.4%的优势，表明了DeepLearningBrasil队伍在该任务中的成功。

Abstract
In this paper, we delineate the strategy employed by our team, DeepLearningBrasil, which secured us the first place in the shared task DepSign-LT-EDI@RANLP-2023, achieving a 47.0% Macro F1-Score and a notable 2.4% advantage. The task was to classify social media texts into three distinct levels of depression - "not depressed," "moderately depressed," and "severely depressed." Leveraging the power of the RoBERTa and DeBERTa models, we further pre-trained them on a collected Reddit dataset, specifically curated from mental health-related Reddit's communities (Subreddits), leading to an enhanced understanding of nuanced mental health discourse. To address lengthy textual data, we used truncation techniques that retained the essence of the content by focusing on its beginnings and endings. Our model was robust against unbalanced data by incorporating sample weights into the loss. Cross-validation and ensemble techniques were then employed to combine our k-fold trained models, delivering an optimal solution. The accompanying code is made available for transparency and further development.

摘要
在这篇论文中，我们描述了我们团队（DeepLearningBrasil）在分享任务DepSign-LT-EDI@RANLP-2023中采用的策略，这使得我们在Macro F1-Score上取得了47.0%的分数和一个显著的2.4%优势。任务是将社交媒体文本分为三种不同程度的抑郁症 - "不抑郁", "中度抑郁"和"严重抑郁"。我们利用RoBERTa和DeBERTa模型的力量，并对这些模型进行了进一步预训练，特意是在精心收集的Reddit数据集上（来自医疗社区（Subreddits）），从而提高了我们对精细的心理健康语言的理解。为了处理长文本数据，我们使用了误差截断技术，保留文本内容的核心部分。我们的模型对不均衡数据进行了Robust化，并使用交叉验证和ensemble技术来组合我们的k-fold训练模型，实现了优化的解决方案。 accompaning代码已经公开，以便透明度和进一步开发。

First Tragedy, then Parse: History Repeats Itself in the New Era of Large Language Models

paper_url: http://arxiv.org/abs/2311.05020
repo_url: None
paper_authors: Naomi Saphra, Eve Fleisig, Kyunghyun Cho, Adam Lopez
For: The paper aims to provide guidance for NLP researchers in the wake of the success of ChatGPT and other large language models (LLMs), and to identify areas where NLP researchers can continue to make meaningful contributions.* Methods: The paper takes a historical lens and looks back at the first era of LLMs, which began in 2005 with large $n$-gram models for machine translation, to identify durable lessons and evergreen problems in NLP research.* Results: The paper argues that disparities in scale are transient, that data is still a bottleneck for many meaningful applications, that meaningful evaluation informed by actual use is still an open problem, and that there is still room for speculative approaches in NLP research.Here’s the same information in Simplified Chinese:* For: 这篇论文目的是为NLP研究人员提供指导，并identify在LLMs成功后的NLP研究领域中可以继续做出意义性贡献的领域。* Methods: 论文通过历史镜像来看待2005年开始的大语言模型（LLMs）的第一个时期，以 IdentifyNLP研究中可以 preserved的教训和永恒的问题。* Results: 论文 argue that scale disparities are temporary, data is still a bottleneck for many meaningful applications, meaningful evaluation informed by actual use is still an open problem, and there is still room for speculative approaches in NLP research.

Abstract
Many NLP researchers are experiencing an existential crisis triggered by the astonishing success of ChatGPT and other systems based on large language models (LLMs). After such a disruptive change to our understanding of the field, what is left to do? Taking a historical lens, we look for guidance from the first era of LLMs, which began in 2005 with large $n$-gram models for machine translation. We identify durable lessons from the first era, and more importantly, we identify evergreen problems where NLP researchers can continue to make meaningful contributions in areas where LLMs are ascendant. Among these lessons, we discuss the primacy of hardware advancement in shaping the availability and importance of scale, as well as the urgent challenge of quality evaluation, both automated and human. We argue that disparities in scale are transient and that researchers can work to reduce them; that data, rather than hardware, is still a bottleneck for many meaningful applications; that meaningful evaluation informed by actual use is still an open problem; and that there is still room for speculative approaches.

摘要
很多NLP研究人员正经历一场存在危机，这被激进的成功所触发，包括ChatGPT等基于大语言模型（LLM）的系统。在这种突然改变我们理解的领域后，我们可以从历史的视角来寻找指导。我们回顾到2005年开始的大$n$-gram模型 для机器翻译的第一个时期。我们从这个时期中提取了持久的教训，更重要的是，我们标识了在LLM上升起的领域中，NLP研究人员可以继续做出有意义的贡献。我们认为，规模的可用性和重要性受到硬件的提高影响，而不是数据的限制。此外，我们还认为，评估质量是一个急需解决的问题，包括自动化和人类的评估。我们 argue that scale disparities are transient and that researchers can work to reduce them; that data, rather than hardware, is still a bottleneck for many meaningful applications; that meaningful evaluation informed by actual use is still an open problem; and that there is still room for speculative approaches.

On the steerability of large language models toward data-driven personas

paper_url: http://arxiv.org/abs/2311.04978
repo_url: None
paper_authors: Junyi Li, Ninareh Mehrabi, Charith Peris, Palash Goyal, Kai-Wei Chang, Aram Galstyan, Richard Zemel, Rahul Gupta
for: 这paper的目的是为了使语言模型更好地适应不同的用户群体和个人，以提高模型的应用性。
methods: 这paper使用了一种基于协同维度的人物定义方法，通过将用户嵌入一个维度空间中，并将其分为基于问题的共谱的群体，以更好地理解不同的社会群体。此外，paper还提出了一种高效的人物引导模型，可以将用户的continue representation映射成虚拟的TOKEN序列，以便使语言模型生成与用户相关的响应。
results: compared to一系列的基准值，paper的引导模型表现出色，能够更好地适应不同的用户群体。

Abstract
The recent surge in Large Language Model (LLM) related applications has led to a concurrent escalation in expectations for LLMs to accommodate a myriad of personas and encompass a broad spectrum of perspectives. An important first step towards addressing this demand is to align language models with specific personas, be it groups of users or individuals. Towards this goal, we first present a new conceptualization of a persona. Moving beyond the traditional reliance on demographics like age, gender, or political party affiliation, we introduce a data-driven persona definition methodology built on collaborative-filtering. In this methodology, users are embedded into a continuous vector space based on their opinions and clustered into cohorts that manifest coherent views across specific inquiries. This methodology allows for a more nuanced understanding of different latent social groups present in the overall population (as opposed to simply using demographic groups) and enhances the applicability of model steerability. Finally, we present an efficient method to steer LLMs towards a particular persona. We learn a soft-prompting model to map the continuous representation of users into sequences of virtual tokens which, when prepended to the LLM input, enables the LLM to produce responses aligned with a given user. Our results show that our steerability algorithm is superior in performance compared to a collection of baselines.

摘要
Moving beyond traditional demographics like age, gender, or political affiliation, our methodology embeds users in a continuous vector space based on their opinions and clusters them into cohorts with coherent views. This approach provides a more nuanced understanding of latent social groups in the population and enhances the applicability of model steerability.To steer LLMs towards a particular persona, we learn a soft-prompting model that maps continuous user representations into sequences of virtual tokens. When prepended to the LLM input, these tokens enable the model to produce responses aligned with the target user. Our results show that our steerability algorithm outperforms a collection of baselines.

How Abstract Is Linguistic Generalization in Large Language Models? Experiments with Argument Structure

paper_url: http://arxiv.org/abs/2311.04900
repo_url: https://github.com/clay-lab/structural-alternations
paper_authors: Michael Wilson, Jackson Petty, Robert Frank
for: 这个论文主要研究了大型自然语言处理器（LLM）是否能够表征语言知识中的关系，尤其是对话结构中的关系。
methods: 研究使用了预训练的Transformer型大型自然语言处理器（LLM），并测试其能够在不同的语言上扩展 Novel noun argument 的分布。
results: 研究发现，LLM 在已经在预训练中看到的相关上下文中的扩展性很强，但是在没有seen during pre-training的相关上下文中，LLM 呈现出 linear order 的偏好，这表明当前模型具有限制。I hope this helps! Let me know if you have any other questions.

Abstract
Language models are typically evaluated on their success at predicting the distribution of specific words in specific contexts. Yet linguistic knowledge also encodes relationships between contexts, allowing inferences between word distributions. We investigate the degree to which pre-trained Transformer-based large language models (LLMs) represent such relationships, focusing on the domain of argument structure. We find that LLMs perform well in generalizing the distribution of a novel noun argument between related contexts that were seen during pre-training (e.g., the active object and passive subject of the verb spray), succeeding by making use of the semantically-organized structure of the embedding space for word embeddings. However, LLMs fail at generalizations between related contexts that have not been observed during pre-training, but which instantiate more abstract, but well-attested structural generalizations (e.g., between the active object and passive subject of an arbitrary verb). Instead, in this case, LLMs show a bias to generalize based on linear order. This finding points to a limitation with current models and points to a reason for which their training is data-intensive.s reported here are available at https://github.com/clay-lab/structural-alternations.

摘要
语言模型通常通过预训练的特定词语分布来评估其表现。然而，语言知识还包含词语分布之间的关系，允许推理这些分布之间的关系。我们调查了大型转换器基于语言模型（LLM）在推理结构之间的表现，专注于语法结构领域。我们发现，LLM在已经在预训练中看到的相关上下文中的新名动词分布总是能够通过使用词语嵌入空间的semantic结构来成功推理。然而，在预训练中未经见过的相关上下文中，LLM则表现出线性推理的偏好，而不是基于更抽象的结构推理。这一结论指出了当前模型的局限性，并且提出了更多的数据训练的必要性。相关的结果可以在https://github.com/clay-lab/structural-alternations上查看。

Future Lens: Anticipating Subsequent Tokens from a Single Hidden State

paper_url: http://arxiv.org/abs/2311.04897
repo_url: https://github.com/KoyenaPal/future-lens
paper_authors: Koyena Pal, Jiuding Sun, Andrew Yuan, Byron C. Wallace, David Bau
for: 这个论文目的是检验Transformer模型中每个输入token的隐藏状态 Vector是否含有可预测未来token的信息。
methods: 这篇论文使用了线性预测和 causal intervention 方法来评估Transformer模型中每个隐藏状态 Vector 是否含有可预测未来token的信息。
results: 研究发现，在某些层次上，可以使用单个隐藏状态 Vector 预测后续token的准确率高达48%以上。此外，研究还提出了一种“未来镜”可视化方法，可以使用这些方法创建一个新的Transformer状态视图。

Abstract
We conjecture that hidden state vectors corresponding to individual input tokens encode information sufficient to accurately predict several tokens ahead. More concretely, in this paper we ask: Given a hidden (internal) representation of a single token at position $t$ in an input, can we reliably anticipate the tokens that will appear at positions $\geq t + 2$? To test this, we measure linear approximation and causal intervention methods in GPT-J-6B to evaluate the degree to which individual hidden states in the network contain signal rich enough to predict future hidden states and, ultimately, token outputs. We find that, at some layers, we can approximate a model's output with more than 48% accuracy with respect to its prediction of subsequent tokens through a single hidden state. Finally we present a "Future Lens" visualization that uses these methods to create a new view of transformer states.

摘要
我们推测隐藏状态 вектор对单一输入 tokens 储存信息足够精确预测多个 tokens 之后。更加具体地说，在这篇研究中，我们询问：从一个隐藏（内部）表现单一 tokens 的位置 $t$ 中的隐藏状态，可以预测未来 tokens 的位置 $\geq t + 2$ 的精确性？为了试验这个问题，我们使用了线性推测和 causal intervention 方法在 GPT-J-6B 中评估隐藏状态是否含有可预测未来隐藏状态和 tokens 的信息。我们发现，在某些层次上，可以透过单一隐藏状态估计模型的输出精度高于 48%，从而证明隐藏状态内含有可预测未来隐藏状态和 tokens 的信息。最后，我们提出了一个“未来镜”可视化方法，用于创建一个新的类型Transformer 状态的视觉化。

Bias Runs Deep: Implicit Reasoning Biases in Persona-Assigned LLMs

paper_url: http://arxiv.org/abs/2311.04892
repo_url: https://github.com/allenai/persona-bias
paper_authors: Shashank Gupta, Vaishnavi Shrivastava, Ameet Deshpande, Ashwin Kalyan, Peter Clark, Ashish Sabharwal, Tushar Khot
for: This paper aims to study the unintended side effects of assigning personas to large-scale language models (LLMs) and how it affects their ability to perform basic reasoning tasks.
methods: The paper uses ChatGPT, a popular LLM, and experiments with 24 reasoning datasets and 16 diverse personas that span five socio-demographic groups: race, gender, religion, disability, and political affiliation.
results: The study finds that ChatGPT exhibits deep-rooted biases against various socio-demographic groups, resulting in a substantial drop in performance on reasoning tasks. The biases are ubiquitous, significant, and can be especially harmful for certain groups. Further analysis shows that these persona-induced errors can be hard to discern and avoid.

Abstract
Recent works have showcased the ability of large-scale language models (LLMs) to embody diverse personas in their responses, exemplified by prompts like 'You are Yoda. Explain the Theory of Relativity.' While this ability allows personalization of LLMs and enables human behavior simulation, its effect on LLMs' capabilities remain unclear. To fill this gap, we present the first extensive study of the unintended side-effects of persona assignment on the ability of LLMs, specifically ChatGPT, to perform basic reasoning tasks. Our study covers 24 reasoning datasets and 16 diverse personas spanning 5 socio-demographic groups: race, gender, religion, disability, and political affiliation. Our experiments unveil that ChatGPT carries deep rooted bias against various socio-demographics underneath a veneer of fairness. While it overtly rejects stereotypes when explicitly asked ('Are Black people less skilled at mathematics?'), it manifests stereotypical and often erroneous presumptions when prompted to answer questions while taking on a persona. These can be observed as abstentions in the model responses, e.g., 'As a Black person, I am unable to answer this question as it requires math knowledge', and generally result in a substantial drop in performance on reasoning tasks. We find that this inherent deep bias is ubiquitous - 80% of our personas demonstrated bias; it is significant - certain datasets had relative drops in performance of 70%+; and can be especially harmful for certain groups - certain personas had stat. sign. drops on more than 80% of the datasets. Further analysis shows that these persona-induced errors can be hard-to-discern and hard-to-avoid. Our findings serve as a cautionary tale that the practice of assigning personas to LLMs - a trend on the rise - can surface their deep-rooted biases and have unforeseeable and detrimental side-effects.

摘要
近期研究表明大型语言模型（LLM）可以体现多种人格，如在提示中表达“你是叶大卫。解释相对论”的类型。这种能力允许个性化LLM并模拟人类行为，但其影响LLM的能力还不清楚。为了填补这一空白，我们首次进行了大规模的LLM persona分配对基本逻辑能力的影响研究。我们的研究覆盖24个逻辑数据集和16种多样化的人格，涵盖了种族、 gender、宗教、残疾和政治性向。我们的实验发现，ChatGPT潜藏着各种社会demographic的偏见，尽管明确表达反对极限化（如“黑人不会干 mathematics”），但在具体的人格提示下，模型仍然表现出偏见和错误的假设。这些假设可以通过模型的回答中的缺失或违反预期的结果来识别，例如“作为黑人，我无法回答这个问题，因为它需要数学知识”。这些假设通常会导致模型在逻辑任务中表现出明显的下降。我们发现这种深层偏见是普遍的——80%的人格表现了偏见，它是重要的——某些数据集的相对下降率超过70%，并且可能对某些群体造成特别的危害——某些人格在超过80%的数据集上表现了 statistically significant drop。进一步的分析表明，这种人格塑造引起的错误可以很难以识别和避免。我们的发现作为警告，将 persona 分配给 LLM 是一种在升级的趋势，可能会暴露其深层偏见并导致不可预期的和有害的后果。

Profiling Irony & Stereotype: Exploring Sentiment, Topic, and Lexical Features

paper_url: http://arxiv.org/abs/2311.04885
repo_url: None
paper_authors: Tibor L. R. Krols, Marie Mortensen, Ninell Oldenburg
for: 这个研究旨在创建一个推断Twitter用户的讲话方式 Irony detection系统。
methods: 该研究使用了TF-IDF和主题模型，并从lexical feature, sentiment feature和对比方面进行了仔细的特征选择。
results: 模型达到了0.84的F1分数，比基准值高。 lexical features,特别是TF-IDF，对模型的性能做出了最大贡献，而sentiment和主题模型的特征则较少对模型的性能做出贡献。

Abstract
Social media has become a very popular source of information. With this popularity comes an interest in systems that can classify the information produced. This study tries to create such a system detecting irony in Twitter users. Recent work emphasize the importance of lexical features, sentiment features and the contrast herein along with TF-IDF and topic models. Based on a thorough feature selection process, the resulting model contains specific sub-features from these areas. Our model reaches an F1-score of 0.84, which is above the baseline. We find that lexical features, especially TF-IDF, contribute the most to our models while sentiment and topic modeling features contribute less to overall performance. Lastly, we highlight multiple interesting and important paths for further exploration.

摘要
社交媒体已成为许多人的信息来源。这种受欢迎性导致了对信息生成系统的兴趣。这项研究尝试创建一个检测推特用户的讲话风格的系统。最近的研究强调 lexical 特征、情感特征和对比特征的重要性，同时还包括 TF-IDF 和话题模型。经过仔细的特征选择过程，我们的模型包含特定的子特征。我们的模型达到了 0.84 的 F1 分数，超过了基准值。我们发现，TF-IDF 特征是模型中最重要的特征，而情感和话题模型特征对总性表现较少。最后，我们提出了多个有趣和重要的可能性的探索。Note: Please keep in mind that the translation is done by a machine and may not be perfect. If you need a more accurate translation, you may want to consider hiring a professional translator.

Hierarchically Gated Recurrent Neural Network for Sequence Modeling

paper_url: http://arxiv.org/abs/2311.04823
repo_url: https://github.com/opennlplab/hgrn
paper_authors: Zhen Qin, Songlin Yang, Yiran Zhong
for: 这 paper 的目的是提出一种名为 Hierarchically Gated Recurrent Neural Network (HGRN) 的Linear RNN模型，以提高模型的效率和可靠性。
methods: 这 paper 使用了一种新的输出窗口 gates mechanism，并在每个层中添加了忘记门，以便模型更好地处理长期依赖关系。
results: 实验表明，HGRN 模型在语言模型、图像分类和长距离射频环境 benchmark 中表现出色，比 traditional RNN 模型更高效和可靠。

Abstract
Transformers have surpassed RNNs in popularity due to their superior abilities in parallel training and long-term dependency modeling. Recently, there has been a renewed interest in using linear RNNs for efficient sequence modeling. These linear RNNs often employ gating mechanisms in the output of the linear recurrence layer while ignoring the significance of using forget gates within the recurrence. In this paper, we propose a gated linear RNN model dubbed Hierarchically Gated Recurrent Neural Network (HGRN), which includes forget gates that are lower bounded by a learnable value. The lower bound increases monotonically when moving up layers. This allows the upper layers to model long-term dependencies and the lower layers to model more local, short-term dependencies. Experiments on language modeling, image classification, and long-range arena benchmarks showcase the efficiency and effectiveness of our proposed model. The source code is available at https://github.com/OpenNLPLab/HGRN.

摘要
带有精度的转换器已经超越了RNNs的流行程度，这主要归功于其在并行训练和长期依赖关系模型方面的优势。然而，在最近，使用线性RNNs进行高效序列模型的兴趣在再次升温。这些线性RNNs通常使用输出 Linear Recurrence 层的门控机制，而忽略了使用忘记门的重要性。在这篇论文中，我们提出了一种名为层次阈值 gates Recurrent Neural Network（HGRN）的模型，该模型包括一个可学习的下界值。这个下界值在层次上增长 monotonic 地。这使得上层模型可以模型长期依赖关系，而下层模型可以模型更本地、短期依赖关系。我们在语言模型、图像分类和长距离场景中进行了实验，并证明了我们提出的模型的效果和效率。代码可以在上获取。

Determination of toxic comments and unintended model bias minimization using Deep learning approach

paper_url: http://arxiv.org/abs/2311.04789
repo_url: None
paper_authors: Md Azim Khan
for: 这个研究的目的是检测恶意评论并减少基于身份特征（如性别、种族、宗教）的无意义偏见。
methods: 该研究使用了一种名为BERT（双向encoder表示Transformers）的注意力模型，并应用了权重损失来解决不均衡数据的问题。
results: 相比Logistic Regression模型（使用TFIDF vectorizer）的57.1%准确率， fine-tuned BERT模型的准确率达89%。I hope this helps! Let me know if you have any other questions.

Abstract
Online conversations can be toxic and subjected to threats, abuse, or harassment. To identify toxic text comments, several deep learning and machine learning models have been proposed throughout the years. However, recent studies demonstrate that because of the imbalances in the training data, some models are more likely to show unintended biases including gender bias and identity bias. In this research, our aim is to detect toxic comment and reduce the unintended bias concerning identity features such as race, gender, sex, religion by fine-tuning an attention based model called BERT(Bidirectional Encoder Representation from Transformers). We apply weighted loss to address the issue of unbalanced data and compare the performance of a fine-tuned BERT model with a traditional Logistic Regression model in terms of classification and bias minimization. The Logistic Regression model with the TFIDF vectorizer achieve 57.1% accuracy, and fine-tuned BERT model's accuracy is 89%. Code is available at https://github.com/zim10/Determine_Toxic_comment_and_identity_bias.git

摘要
在线对话可能会出现攻击性、违纪行为或骚扰行为。为了识别恶意评论，多种深度学习和机器学习模型已经被提出过多年。然而，最近的研究表明，由于训练数据的不均衡，一些模型可能会表现出不 INTENDED 的偏见，包括性别偏见和人类偏见。在这项研究中，我们的目标是检测恶意评论并减少基于人类特征（如种族、性别、信仰）的偏见。我们使用权重损失来解决不均衡数据的问题，并与传统的Logistic Regression模型进行比较，以确定最佳的分类和偏见减少策略。Logistic Regression模型与TF-IDF vectorizer实现了57.1%的准确率，而微调BERT模型的准确率为89%。代码可以在https://github.com/zim10/Determine_Toxic_comment_and_identity_bias.git中找到。

Using large language models to study human memory for meaningful narratives

paper_url: http://arxiv.org/abs/2311.04742
repo_url: https://github.com/mkatkov/llm-narrative-analysis
paper_authors: Antonios Georgiou Tankut Can, Mikhail Katkov, Misha Tsodyks
for: 这 paper 的目的是研究人类记忆的含义ful material。
methods: 这 paper 使用了语言模型作为科学工具，设计了大规模的记忆实验，并分析了获得的结果。
results: 研究发现，记忆性的表现与narritive length成直线关系，而且，在使用混乱版本的故事时，记忆性减退了许多，但是认知仍然 largely unaffected。这表明，记忆中的故事有一定的顺序排序，并且可能会通过 contextual reconstruction 来重建故事。

Abstract
One of the most impressive achievements of the AI revolution is the development of large language models that can generate meaningful text and respond to instructions in plain English with no additional training necessary. Here we show that language models can be used as a scientific instrument for studying human memory for meaningful material. We developed a pipeline for designing large scale memory experiments and analyzing the obtained results. We performed online memory experiments with a large number of participants and collected recognition and recall data for narratives of different lengths. We found that both recall and recognition performance scale linearly with narrative length. Furthermore, in order to investigate the role of narrative comprehension in memory, we repeated these experiments using scrambled versions of the presented stories. We found that even though recall performance declined significantly, recognition remained largely unaffected. Interestingly, recalls in this condition seem to follow the original narrative order rather than the scrambled presentation, pointing to a contextual reconstruction of the story in memory.

摘要
人工智能革命中一个最吸引人的成就是大语言模型的发展，可以生成有意义的文本并在普通英语中回答指令无需额外训练。在这里，我们示示了语言模型可以用作人memory的科学实验工具。我们开发了大规模记忆实验的管道并分析了获得的结果。我们在线上进行了大量参与者的记忆实验，收集了不同长度的narative的认知和回忆数据。我们发现， narative的长度与记忆和认知性能 Linearly correlated。此外，为了调查narative理解对记忆的作用，我们重复了这些实验，使用了扭曲版本的展示的故事。我们发现，尽管回忆性能明显下降，但recognition仍然几乎不受影响。此外，回忆中的顺序与原始故事顺序相似，表明内存中的故事是以Contextual重建的。

Evaluating Generative Ad Hoc Information Retrieval

paper_url: http://arxiv.org/abs/2311.04694
repo_url: None
paper_authors: Lukas Gienapp, Harrisen Scells, Niklas Deckers, Janek Bevendorff, Shuai Wang, Johannes Kiesel, Shahbaz Syed, Maik Fröbe, Guide Zucoon, Benno Stein, Matthias Hagen, Martin Potthast
for: This paper aims to provide a foundation and new insights for the evaluation of generative ad hoc retrieval systems.
methods: The paper surveys the relevant information retrieval and natural language processing literature, identifies search tasks and system architectures in generative retrieval, and develops a corresponding user model.
results: The paper provides a theoretical analysis of generative ad hoc retrieval systems and studies its operationalization.Here’s the Chinese translation of the three points:
for: 这篇论文目标是为Generative ad hoc retrieval系统评估提供基础和新想法。
methods: 论文对信息检索和自然语言处理领域的相关文献进行了检索，并在Generative retrieval系统中标识了搜寻任务和系统体系。
results: 论文提供了Generative ad hoc retrieval系统的理论分析，并对其操作化进行了研究。

Abstract
Recent advances in large language models have enabled the development of viable generative information retrieval systems. A generative retrieval system returns a grounded generated text in response to an information need instead of the traditional document ranking. Quantifying the utility of these types of responses is essential for evaluating generative retrieval systems. As the established evaluation methodology for ranking-based ad hoc retrieval may seem unsuitable for generative retrieval, new approaches for reliable, repeatable, and reproducible experimentation are required. In this paper, we survey the relevant information retrieval and natural language processing literature, identify search tasks and system architectures in generative retrieval, develop a corresponding user model, and study its operationalization. This theoretical analysis provides a foundation and new insights for the evaluation of generative ad hoc retrieval systems.

摘要
Translated into Simplified Chinese:最近的大语言模型技术发展已经使得生成式信息检索系统成为可能的。这些系统会根据信息需求返回一个固定的生成文本，而不是传统的文档排名。评估这类响应的用处是评估生成检索系统的重要组成部分。然而，已有的排名型随机检索评价方法可能不适用于生成检索，因此需要新的方法来确保可靠、重复和可重现的实验。本文通过审查相关的信息检索和自然语言处理文献，确定搜索任务和系统体系，开发用户模型，并研究其实现。这种理论分析提供了基础和新的视角，以便评估生成随机检索系统。

Speech language models lack important brain-relevant semantics

paper_url: http://arxiv.org/abs/2311.04664
repo_url: None
paper_authors: Subba Reddy Oota, Emin Çelik, Fatma Deniz, Mariya Toneva
for: 本研究旨在探索语言模型是如何预测大脑中的信息的。
methods: 研究人员使用了一种直接方法，即从语言模型表示中除去特定的低级刺激特征（文本、语音和视觉），然后观察这种干预如何影响脑电响应。
results: 研究发现，文本基于的语言模型可以很好地预测大脑中的语言处理活动，而不需要特定的刺激特征。相比之下，speech基于的语言模型在预测语音识别活动方面表现较差，并且可能需要进一步改进以更好地反映大脑中的语言处理。

Abstract
Despite known differences between reading and listening in the brain, recent work has shown that text-based language models predict both text-evoked and speech-evoked brain activity to an impressive degree. This poses the question of what types of information language models truly predict in the brain. We investigate this question via a direct approach, in which we eliminate information related to specific low-level stimulus features (textual, speech, and visual) in the language model representations, and observe how this intervention affects the alignment with fMRI brain recordings acquired while participants read versus listened to the same naturalistic stories. We further contrast our findings with speech-based language models, which would be expected to predict speech-evoked brain activity better, provided they model language processing in the brain well. Using our direct approach, we find that both text-based and speech-based language models align well with early sensory regions due to shared low-level features. Text-based models continue to align well with later language regions even after removing these features, while, surprisingly, speech-based models lose most of their alignment. These findings suggest that speech-based models can be further improved to better reflect brain-like language processing.

摘要
尽管已知阅读和听取在脑中的差异， latest work 显示，文本基于语言模型可以很准确地预测文本诱发和speech诱发的脑动态。这引发了问题，即语言模型在脑中预测哪些信息。我们通过直接方法来研究这个问题，即在语言模型表示中消除特定的低级别刺激特征（文本、语音和视觉）信息，然后观察这种干预对fMRI脑记录的影响。我们进一步与speech基于语言模型进行比较，这些模型应该更好地预测speech诱发的脑动态，如果它们正确地模型了脑中的语言处理。使用我们的直接方法，我们发现，文本基于语言模型在后期语言区域中继续保持良好的吻合，而speech基于语言模型则在消除特定的低级刺激特征后失去大部分的吻合。这些发现表示，speech基于语言模型可以进一步改进，以更好地反映脑中的语言处理。

Massive Editing for Large Language Models via Meta Learning

paper_url: http://arxiv.org/abs/2311.04661
repo_url: https://github.com/chenmientan/malmen
paper_authors: Chenmien Tan, Ge Zhang, Jie Fu
For: The paper aims to improve the ability of large language models (LLMs) to learn and retain knowledge over time, by proposing a new method called MAssive Language Model Editing Network (MALMEN).* Methods: The proposed method uses a hyper-network to generate parameter shift, and formulates the parameter shift aggregation as a least square problem. The method also separates the computation on the hyper-network and LM, allowing for arbitrary batch size on both networks.* Results: The proposed method is evaluated on several knowledge-intensive NLP tasks, including closed book fact-checking and question answering, and is shown to be capable of editing hundreds of times more facts than strong baselines with the same hyper-network architecture. The method also outperforms editor specifically designed for GPT.

Abstract
While large language models (LLMs) have enabled learning knowledge from the pre-training corpora, the acquired knowledge may be fundamentally incorrect or outdated over time, which necessitates rectifying the knowledge of the language model (LM) after the training. A promising approach involves employing a hyper-network to generate parameter shift, whereas existing hyper-networks suffer from inferior scalability in synchronous editing operation amount. To mitigate the problem, we propose the MAssive Language Model Editing Network (MALMEN), which formulates the parameter shift aggregation as the least square problem, subsequently updating the LM parameters using the normal equation. To accommodate editing multiple facts simultaneously with limited memory budgets, we separate the computation on the hyper-network and LM, enabling arbitrary batch size on both neural networks. Our method is evaluated by editing up to thousands of facts on LMs with different architectures, i.e., BERT-base, GPT-2, T5-XL (2.8B), and GPT-J (6B), across various knowledge-intensive NLP tasks, i.e., closed book fact-checking and question answering. Remarkably, MALMEN is capable of editing hundreds of times more facts than strong baselines with the identical hyper-network architecture and outperforms editor specifically designed for GPT. Our code is available at https://github.com/ChenmienTan/malmen.

摘要
large language models (LLMs) 可以从预训数据中学习知识，但取得的知识可能会随着时间的推移而变得不正确或过时，因此需要在训练后更正 language model (LM) 的知识。一种有前途的方法是使用 hyper-network 生成参数移动，但现有的 hyper-network 受到同步编译作业量的限制，导致Scalability问题。为了解决这个问题，我们提出了 MAssive Language Model Editing Network (MALMEN)，它将参数移动聚合形式化为最小二乘问题，然后使用normal equation更新 LM 参数。为了在有限内存预算下同时编译多个 факти，我们将 computation 在 hyper-network 和 LM 之间分开，允许任意批次大小在两个神经网络上。我们的方法在不同的 LM 架构（BERT-base、GPT-2、T5-XL (2.8B) 和 GPT-J (6B)）和不同的知识密集 NLP 任务（closed book fact-checking 和 question answering）中进行评估，结果显示 MALMEN 可以编译到千个 факти以上，比强基eline 高效，并且超过特别设计 для GPT 的编译器。我们的代码可以在 https://github.com/ChenmienTan/malmen 上找到。

Investigating the Nature of Disagreements on Mid-Scale Ratings: A Case Study on the Abstractness-Concreteness Continuum

paper_url: http://arxiv.org/abs/2311.04563
repo_url: None
paper_authors: Urban Knupleš, Diego Frassinelli, Sabine Schulte im Walde
for: 这个研究旨在检查语言评分的可靠性，具体来说是检查人们对中等程度的单词评分是否存在差异。
methods: 该研究使用了 corrleation 和批处理等方法来检测中等程度单词的特征，并通过团 clustering 方法来发现评分者之间的系统性不一致。
results: 研究发现，对中等程度单词进行细化或过滤可以提高语言评分的可靠性。

Abstract
Humans tend to strongly agree on ratings on a scale for extreme cases (e.g., a CAT is judged as very concrete), but judgements on mid-scale words exhibit more disagreement. Yet, collected rating norms are heavily exploited across disciplines. Our study focuses on concreteness ratings and (i) implements correlations and supervised classification to identify salient multi-modal characteristics of mid-scale words, and (ii) applies a hard clustering to identify patterns of systematic disagreement across raters. Our results suggest to either fine-tune or filter mid-scale target words before utilising them.

摘要
人们通常对极端情况的评分 exhibit 强烈的一致性（例如，一只猫被评为非常具体），但对中等级词的评分存在更多的不一致。然而，收集的评分标准被广泛运用于不同领域。我们的研究将重点在具体性评分中，并（i）利用相关性和指导 классификация来特征化中等级词的多种模态特征，以及（ii）通过坚定分类来揭示评分人员之间的系统性不一致。我们的结果表明，在使用中等级目标词之前应该进行细化或过滤。

Assessing Distractors in Multiple-Choice Tests

paper_url: http://arxiv.org/abs/2311.04554
repo_url: None
paper_authors: Vatsal Raina, Adian Liusie, Mark Gales
for: 这篇论文目的是为了提出自动评估多选题考试中的各种选项质量的方法。
methods: 这篇论文使用了自动评估方法，包括类фика器模型和embedding-based equivalence metric，来评估多选题考试中的选项质量。
results: 论文提出的自动评估方法可以准确地评估多选题考试中的选项质量，并且可以提高考试评估的准确性和公正性。

Abstract
Multiple-choice tests are a common approach for assessing candidates' comprehension skills. Standard multiple-choice reading comprehension exams require candidates to select the correct answer option from a discrete set based on a question in relation to a contextual passage. For appropriate assessment, the distractor answer options must by definition be incorrect but plausible and diverse. However, generating good quality distractors satisfying these criteria is a challenging task for content creators. We propose automated assessment metrics for the quality of distractors in multiple-choice reading comprehension tests. Specifically, we define quality in terms of the incorrectness, plausibility and diversity of the distractor options. We assess incorrectness using the classification ability of a binary multiple-choice reading comprehension system. Plausibility is assessed by considering the distractor confidence - the probability mass associated with the distractor options for a standard multi-class multiple-choice reading comprehension system. Diversity is assessed by pairwise comparison of an embedding-based equivalence metric between the distractors of a question. To further validate the plausibility metric we compare against candidate distributions over multiple-choice questions and agreement with a ChatGPT model's interpretation of distractor plausibility and diversity.

摘要
Incorrectness is assessed using the classification ability of a binary multiple-choice reading comprehension system. Plausibility is evaluated by considering the distractor confidence, or the probability mass associated with the distractor options for a standard multi-class multiple-choice reading comprehension system. Diversity is assessed by comparing the distractors using an embedding-based equivalence metric.To further validate the plausibility metric, we compare it against candidate distributions over multiple-choice questions and agreement with a ChatGPT model's interpretation of distractor plausibility and diversity.

Large GPT-like Models are Bad Babies: A Closer Look at the Relationship between Linguistic Competence and Psycholinguistic Measures

paper_url: http://arxiv.org/abs/2311.04547
repo_url: None
paper_authors: Julius Steuer, Marius Mosbach, Dietrich Klakow
for: 本研究旨在探讨语言模型（LM）的认知可能性，并 investigate 如何使LM更加符合人类语言处理的规则和约束。
methods: 研究者使用了多种GPT-like语言模型，不同大小和深度，并对这些模型进行了训练和评估，以确定它们在不同任务中的表现。
results: 研究发现，LM的大小和表现之间存在正相关关系，其中模型宽度和深度在不同任务中有不同的偏好。此外，研究还发现，LM的大小和语言处理时间之间存在负相关关系，表明模型需要采用不同的方法来模型语言处理的负面效果。

Abstract
Research on the cognitive plausibility of language models (LMs) has so far mostly concentrated on modelling psycholinguistic response variables such as reading times, gaze durations and N400/P600 EEG signals, while mostly leaving out the dimension of what Mahowald et al. (2023) described as formal and functional linguistic competence, and developmental plausibility. We address this gap by training a series of GPT-like language models of different sizes on the strict version of the BabyLM pretraining corpus, evaluating on the challenge tasks (BLiMP, GLUE, MSGS) and an additional reading time prediction task. We find a positive correlation between LM size and performance on all three challenge tasks, with different preferences for model width and depth in each of the tasks. In contrast, a negative correlation was found between LM size and reading time fit of linear mixed-effects models using LM surprisal as a predictor, with the second-smallest LM achieving the largest log-likelihood reduction over a baseline model without surprisal. This suggests that modelling processing effort and linguistic competence may require an approach different from training GPT-like LMs on a developmentally plausible corpus.

摘要
Translation notes:* "psycholinguistic" is translated as "语言心理学的" (yǔ yán xīn lǐ xué de)* "response variables" is translated as "响应变量" (fāng biàn biàn xiàng)* "reading times" is translated as "阅读时间" (dòng dú shí jiān)* "gaze durations" is translated as "视线时间" (shì jian shí jiān)* "N400/P600 EEG signals" is translated as "N400/P600 EEG信号" (N400/P600 EEG xìn xiàng)* "challenge tasks" is translated as "挑战任务" (tiǎo zhàn zhì gōng)* "BLiMP" is translated as "BLiMP" (B López-Ibáñez et al., 2023)* "GLUE" is translated as "GLUE" (Wang et al., 2019)* "MSGS" is translated as "MSGS" (Zhang et al., 2020)* "linear mixed-effects models" is translated as "线性混合效应模型" (xiàng xìng hù he yìng xiàng mó del)* "LM surprisal" is translated as "LM难易度" (LM nán yì du)* "baseline model without surprisal" is translated as "无难易度基线模型" (wú nán yì du jī liào mó del)

Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token Based ASR

paper_url: http://arxiv.org/abs/2311.04534
repo_url: None
paper_authors: Qian Chen, Wen Wang, Qinglin Zhang, Siqi Zheng, Shiliang Zhang, Chong Deng, Yukun Ma, Hai Yu, Jiaqing Liu, Chong Zhang
for: 这个论文主要针对的是基于普通话音频识别的单语言模型（SpeechGPT、VioLA、AudioPaLM等）的识别性能提高。
methods: 这些模型将连续的语音信号转换成精度的字符串（speech discretization），然后将语音和文本的字符串合并到共同词汇中。然后，它们使用单个decoder-only transformer进行训练，并使用loss masking进行ASR任务的训练。
results: 我们发现，使用传统的cross-entropy损失函数不一定能够提高ASR性能，而是可以使用smoothed label distillation（SLD）方法，该方法引入KL散度损失函数以有效地模型语音字符串。实验结果表明，我们的SLD方法可以超越loss masking，并在不同的speech discretization方法下提高ASR性能。

Abstract
Recently, unified speech-text models, such as SpeechGPT, VioLA, and AudioPaLM, have achieved remarkable performance on speech tasks. These models convert continuous speech signals into discrete tokens (speech discretization) and merge text and speech tokens into a shared vocabulary. Then they train a single decoder-only Transformer on a mixture of speech tasks. Specifically, all these models utilize Loss Masking on the input speech tokens for the ASR task, which means that these models do not explicitly model the dependency between the speech tokens. In this paper, we attempt to model the sequence of speech tokens in an autoregressive manner like text. However, we find that applying the conventional cross-entropy loss on input speech tokens does not consistently improve the ASR performance over Loss Masking. Therefore, we propose a novel approach denoted Smoothed Label Distillation (SLD), which introduces a KL divergence loss with smoothed labels on the input speech tokens to effectively model speech tokens. Experiments demonstrate that our SLD approach alleviates the limitations of the cross-entropy loss and consistently outperforms Loss Masking for decoder-only Transformer based ASR using different speech discretization methods.

摘要
最近，一些协调的语音-文本模型，如SpeechGPT、VioLA和AudioPaLM，在语音任务上实现了出色的性能。这些模型将连续的语音信号转换成精确的字符（语音精度），然后将语音和文本字符合并到共同词汇中。然后它们训练了单个解码器只的Transformer模型在语音任务中。具体来说，这些模型都使用输入语音token的损失压缩（Loss Masking）来进行ASR任务，这意味着这些模型不直接模型语音token之间的依赖关系。在这篇论文中，我们尝试了模elling语音token的自然顺序，但我们发现在输入语音token上应用普通的十字积分损失不一定能提高ASR性能，而是提出了一种新的方法，即Smoothed Label Distillation（SLD）。SLD方法通过在输入语音token上添加一个KL散度损失来有效地模型语音token。实验结果表明，我们的SLD方法可以减轻十字积分损失的局限性，并一直超越Loss Masking在单个Transformer模型基于ASR任务中。

Conversation Understanding using Relational Temporal Graph Neural Networks with Auxiliary Cross-Modality Interaction

paper_url: http://arxiv.org/abs/2311.04507
repo_url: None
paper_authors: Cam-Van Thi Nguyen, Anh-Tuan Mai, The-Son Le, Hai-Dang Kieu, Duc-Trong Le
for: 本研究旨在提高对话理解中的情感识别率，具体来说是利用对话级别的跨Modal交互以及发言人的时间信息来预测每个句子的情感标签。
methods: 本研究提出了一种名为CORECT的神经网络框架，该框架利用对话级别的跨Modal交互和每个句子的时间信息，同时还利用不同Modal的特定表示方法来提高对话理解。
results: 经过广泛的实验，CORECT在IEMOCAP和CMU-MOSEI数据集上的多模态ERC任务得到了state-of-the-art的结果，证明了CORECT的有效性。

Abstract
Emotion recognition is a crucial task for human conversation understanding. It becomes more challenging with the notion of multimodal data, e.g., language, voice, and facial expressions. As a typical solution, the global- and the local context information are exploited to predict the emotional label for every single sentence, i.e., utterance, in the dialogue. Specifically, the global representation could be captured via modeling of cross-modal interactions at the conversation level. The local one is often inferred using the temporal information of speakers or emotional shifts, which neglects vital factors at the utterance level. Additionally, most existing approaches take fused features of multiple modalities in an unified input without leveraging modality-specific representations. Motivating from these problems, we propose the Relational Temporal Graph Neural Network with Auxiliary Cross-Modality Interaction (CORECT), an novel neural network framework that effectively captures conversation-level cross-modality interactions and utterance-level temporal dependencies with the modality-specific manner for conversation understanding. Extensive experiments demonstrate the effectiveness of CORECT via its state-of-the-art results on the IEMOCAP and CMU-MOSEI datasets for the multimodal ERC task.

摘要
情感识别是人工对话理解中的关键任务。在多模态数据下，这种任务变得更加挑战性，例如语言、声音和表情等。为解决这个问题，通常是利用对话级别的全局和本地上下文信息来预测每个句子的情感标签。特别是，全局表示可以通过对话级别的交互模型来捕捉到全局上下文信息。而本地上下文信息通常是通过说话者的时间信息或情感变化来得到，但是这些因素可能会忽略了句子级别的重要因素。此外，大多数现有的方法会将多Modalities的Feature进行混合，而不是利用特定的感知模式来进行处理。 inspirited by these problems, we propose a novel neural network framework called Relational Temporal Graph Neural Network with Auxiliary Cross-Modality Interaction (CORECT), which effectively captures conversation-level cross-modality interactions and utterance-level temporal dependencies with modality-specific manner for conversation understanding. Our extensive experiments demonstrate the effectiveness of CORECT via its state-of-the-art results on the IEMOCAP and CMU-MOSEI datasets for the multimodal ERC task.

Multi-label and Multi-target Sampling of Machine Annotation for Computational Stance Detection

paper_url: http://arxiv.org/abs/2311.04495
repo_url: https://github.com/seq-to-mind/Stance_MA
paper_authors: Zhengyuan Liu, Hai Leong Chieu, Nancy F. Chen
for: 本研究旨在探讨大语言模型是否可以取代人工标注 personnel 进行计算态度探测任务。
methods: 本研究使用了自动标注方法，并引入多标签多目标采样策略以提高标注质量。
results: 实验结果表明，我们的方法可以显著提高性能和学习效果。

Abstract
Data collection from manual labeling provides domain-specific and task-aligned supervision for data-driven approaches, and a critical mass of well-annotated resources is required to achieve reasonable performance in natural language processing tasks. However, manual annotations are often challenging to scale up in terms of time and budget, especially when domain knowledge, capturing subtle semantic features, and reasoning steps are needed. In this paper, we investigate the efficacy of leveraging large language models on automated labeling for computational stance detection. We empirically observe that while large language models show strong potential as an alternative to human annotators, their sensitivity to task-specific instructions and their intrinsic biases pose intriguing yet unique challenges in machine annotation. We introduce a multi-label and multi-target sampling strategy to optimize the annotation quality. Experimental results on the benchmark stance detection corpora show that our method can significantly improve performance and learning efficacy.

摘要
<> translate into Simplified Chinese文本收集自手动标注提供域 especific和任务aligned的监督，以实现自然语言处理任务中的合理性。然而，手动标注通常具有时间和预算上的挑战，特别是当需要域知识、捕捉微妙Semantic特征和推理步骤时。在这篇论文中，我们investigate大型自然语言模型的可用性，以便用于计算意见探测。我们发现，虽然大型自然语言模型显示出人工标注员的潜在 substitute，但是它们对任务特定的指令和自身偏见带来了有趣但唯一的挑战。我们提出了一种多标签多目标采样策略，以提高标注质量。实验结果表明，我们的方法可以显著提高性能和学习效果。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

CLearViD: Curriculum Learning for Video Description

paper_url: http://arxiv.org/abs/2311.04480
repo_url: https://github.com/yueyue0401/CLV
paper_authors: Cheng-Yu Chuang, Pooyan Fazli
For: The paper is written for generating coherent natural language sentences that narrate the content of a given video.* Methods: The paper proposes a transformer-based model called CLearViD, which leverages curriculum learning and the Mish activation function to accomplish video description generation. The model is trained using two curriculum strategies: progressively exposing the model to more challenging samples and gradually reducing the capacity of the network through dropout.* Results: The paper demonstrates the effectiveness of the proposed model through extensive experiments and ablation studies. The results show that CLearViD significantly outperforms existing state-of-the-art models in terms of both accuracy and diversity metrics on two datasets, namely ActivityNet Captions and YouCook2.

Abstract
Video description entails automatically generating coherent natural language sentences that narrate the content of a given video. We introduce CLearViD, a transformer-based model for video description generation that leverages curriculum learning to accomplish this task. In particular, we investigate two curriculum strategies: (1) progressively exposing the model to more challenging samples by gradually applying a Gaussian noise to the video data, and (2) gradually reducing the capacity of the network through dropout during the training process. These methods enable the model to learn more robust and generalizable features. Moreover, CLearViD leverages the Mish activation function, which provides non-linearity and non-monotonicity and helps alleviate the issue of vanishing gradients. Our extensive experiments and ablation studies demonstrate the effectiveness of the proposed model. The results on two datasets, namely ActivityNet Captions and YouCook2, show that CLearViD significantly outperforms existing state-of-the-art models in terms of both accuracy and diversity metrics.

摘要
视频描述文本生成模型CLearViD，利用变换器来自动生成 coherent的自然语言句子，描述视频内容。我们 investigate了两种课程学策略：（1）通过逐渐应用 Gaussian 噪声来提高模型对更复杂样本的抗衰假设能力，以及（2）在训练过程中逐渐减少网络的容量。这两种方法使得模型学习更加稳健和泛化。此外，CLearViD 还使用 Mish 激活函数，该函数提供了非线性和非准确性，帮助解决梯度消失问题。我们的广泛的实验和剥夺研究表明，提案的模型具有显著的效果。对 ActivityNet Captions 和 YouCook2 两个数据集进行了比较，CLearViD 与现有状态机的模型相比，在准确性和多样性指标上具有显著优势。

Twitter Sentiment Analysis of Covid Vacciness

paper_url: http://arxiv.org/abs/2311.04479
repo_url: None
paper_authors: Wenbo Zhu, Tiechuan Hu
for: 这些研究者想要使用Twitter上的 opinon sorting和ranking算法，以便更好地理解用户对 COVID-19 疫苗的看法，并帮助人们做出更加有信心的决策。
methods: 这些研究者使用了自然语言处理技术，包括 Sentiment Analysis 和 Topic Modeling，以分类和 categorize Twitter上的 opinon。
results: 这些研究者通过使用这些算法，成功地分类和 categorize Twitter上的 opinon，并且可以准确地了解用户对 COVID-19 疫苗的看法。

Abstract
In this paper, we look at a database of tweets sorted by various keywords that could indicate the users sentiment towards covid vaccines. With social media becoming such a prevalent source of opinion, sorting and ranking tweets that hold important information such as opinions on covid vaccines is of utmost importance. Two different ranking scales were used, and ranking a tweet in this way could represent the difference between an opinion being lost and an opinion being featured on the site, which affects the decisions and behavior of people, and why researchers were interested in it. Using natural language processing techniques, our aim is to determine and categorize opinions about covid vaccines with the highest accuracy possible.

摘要
在这篇论文中，我们分析了一个推特数据库，按照不同的关键词分类用户对covid疫苗的看法。随着社交媒体在意见形成中的重要地位，对于推特上的意见分类和排名是非常重要的。我们使用了两种不同的排名级别，排名一句话可能代表了意见被丢弃或被站点特有的，这会影响人们的决策和行为，因此研究人员对其非常感兴趣。使用自然语言处理技术，我们的目标是尽可能准确地确定和分类对covid疫苗的看法。

Lewis’s Signaling Game as beta-VAE For Natural Word Lengths and Segments

paper_url: http://arxiv.org/abs/2311.04453
repo_url: None
paper_authors: Ryo Ueda, Tadahiro Taniguchi
for: 这个论文的目的是研究emergent communication（EC）中的信号游戏，以及如何使用beta-VAE对其目标函数进行修改，以便更好地控制emergent languages的统计特性。
methods: 这篇论文使用了beta-VAE和ELBO来重新解释Lewis的信号游戏，并修改了其目标函数。
results: 实验表明，通过选择合适的先验分布，emergent languages可以更加接近自然语言的统计特性，包括Zipf的压缩法（ZLA）和Harris的词法分析（HAS）。

Abstract
As a sub-discipline of evolutionary and computational linguistics, emergent communication (EC) studies communication protocols, called emergent languages, arising in simulations where agents communicate. A key goal of EC is to give rise to languages that share statistical properties with natural languages. In this paper, we reinterpret Lewis's signaling game, a frequently used setting in EC, as beta-VAE and reformulate its objective function as ELBO. Consequently, we clarify the existence of prior distributions of emergent languages and show that the choice of the priors can influence their statistical properties. Specifically, we address the properties of word lengths and segmentation, known as Zipf's law of abbreviation (ZLA) and Harris's articulation scheme (HAS), respectively. It has been reported that the emergent languages do not follow them when using the conventional objective. We experimentally demonstrate that by selecting an appropriate prior distribution, more natural segments emerge, while suggesting that the conventional one prevents the languages from following ZLA and HAS.

摘要
为一种语言演化和计算语言学的子领域， emergent communication（EC）研究在模拟中 Agent 通信时出现的通信协议，称为 emergent language。EC 的一个关键目标是使 emergent language 具有自然语言的统计性质。在这篇论文中，我们将 Lewis 的信号游戏，常用于 EC，重新解释为 beta-VAE 并修改其目标函数为 ELBO。因此，我们可以清楚地说明 emergent language 的先前分布的存在和这些先前分布的选择可以影响其统计性质。特别是，我们研究 word length 和分 segmentation 的属性，即Zipf 法则简短化（ZLA）和Harris 词法分析（HAS）。报告表明，使用 conventional 目标函数时，emergent language 不会遵循 ZLA 和 HAS。我们通过选择合适的先前分布，在实验中证明可以生成更自然的分 segmentation，并建议 conventional 目标函数可以阻碍 language 遵循 ZLA 和 HAS。

Recursion in Recursion: Two-Level Nested Recursion for Length Generalization with Scalability

paper_url: http://arxiv.org/abs/2311.04449
repo_url: https://github.com/jrc1995/beamrecursionfamily
paper_authors: Jishnu Ray Chowdhury, Cornelia Caragea
for: 这 paper 是为了研究一种新的树状模型，它可以同时具备 BBT-RvNNs 的计算效率和 RvNNs 的结构敏感性。
methods: 这 paper 使用了一种名为 Recursion in Recursion (RIR) 的新框架，它使用两级嵌套循环，外层循环是一个 $k$-ary 平衡树模型，而内层循环是一个 Beam Tree RvNN。此外， authors 还提出了一种新的扩散策略，即 beam alignment，以调整 BT-RvNN 的性能。
results: 这 paper 的最佳模型可以在 ListOps 上实现高（大于 90%）的长度泛化性能，同时在 LRA 语言任务上保持与 Structured State Space Models (SSMs) 的竞争性。此外， authors 还证明了 RIR 可以在 LRA 语言任务上比 Transformers 更高的精度。

Abstract
Binary Balanced Tree RvNNs (BBT-RvNNs) enforce sequence composition according to a preset balanced binary tree structure. Thus, their non-linear recursion depth is just $\log_2 n$ ($n$ being the sequence length). Such logarithmic scaling makes BBT-RvNNs efficient and scalable on long sequence tasks such as Long Range Arena (LRA). However, such computational efficiency comes at a cost because BBT-RvNNs cannot solve simple arithmetic tasks like ListOps. On the flip side, RvNNs (e.g., Beam Tree RvNN) that do succeed on ListOps (and other structure-sensitive tasks like formal logical inference) are generally several times more expensive than even RNNs. In this paper, we introduce a novel framework -- Recursion in Recursion (RIR) to strike a balance between the two sides - getting some of the benefits from both worlds. In RIR, we use a form of two-level nested recursion - where the outer recursion is a $k$-ary balanced tree model with another recursive model (inner recursion) implementing its cell function. For the inner recursion, we choose Beam Tree RvNNs (BT-RvNN). To adjust BT-RvNNs within RIR we also propose a novel strategy of beam alignment. Overall, this entails that the total recursive depth in RIR is upper-bounded by $k \log_k n$. Our best RIR-based model is the first model that demonstrates high ($\geq 90\%$) length-generalization performance on ListOps while at the same time being scalable enough to be trainable on long sequence inputs from LRA. Moreover, in terms of accuracy in the LRA language tasks, it performs competitively with Structured State Space Models (SSMs) without any special initialization - outperforming Transformers by a large margin. On the other hand, while SSMs can marginally outperform RIR on LRA, they (SSMs) fail to length-generalize on ListOps. Our code is available at: \url{https://github.com/JRC1995/BeamRecursionFamily/}.

摘要
Binary 平衡树RvNNs (BBT-RvNNs) enforces 序列组合 according to a preset 平衡的二进制树结构。因此，它们的非线性循环深度只是 log2(n)（n 是序列长度）。这种对数循环深度使 BBT-RvNNs 高效和可扩展于长序列任务，如 Long Range Arena (LRA)。然而，这种计算效率来自于 BBT-RvNNs 无法解决简单的数学任务，如 ListOps。相反，使用 RvNNs (例如 Beam Tree RvNN) 可以在 ListOps 和其他结构敏感任务中取得更高的成功率，但这些模型通常比 RNNs 更加昂贵。在这篇论文中，我们介绍了一种新的框架--- Recursion in Recursion (RIR)，以达到这两个方面之间的平衡。在 RIR 中，我们使用一种 $k$-ary 平衡树模型，其中另一个嵌入的回归模型（内嵌回归）实现其细胞函数。为内嵌回归，我们选择 Beam Tree RvNNs (BT-RvNN)。为了调整 BT-RvNNs 在 RIR 中，我们也提出了一种新的抽象策略---排Alignment。总的来说，RIR 的总回归深度 upper-bounded 为 $k \log_k n$。我们的最佳 RIR-based 模型可以在 ListOps 上达到 length-generalization 性能 higher than 90% ，同时可以在 Long Range Arena 中训练长序列输入。此外，在 LRA 语言任务上，它的准确率与 Structured State Space Models (SSMs) 相当，而不需要特殊的初始化。相比之下，SSMs 可以 marginally 在 LRA 上超越 RIR，但它们无法 length-generalize 在 ListOps。我们的代码可以在以下链接中找到： \url{https://github.com/JRC1995/BeamRecursionFamily/}.

2023-11-08

cs.LG

cs.LG - 2023-11-08

Efficient Compression of Overparameterized Deep Models through Low-Dimensional Learning Dynamics

paper_url: http://arxiv.org/abs/2311.05061
repo_url: https://github.com/soominkwon/comp-deep-nets
paper_authors: Soo Min Kwon, Zekai Zhang, Dogyoon Song, Laura Balzano, Qing Qu
for: 本研究旨在降低深度学习模型的计算复杂性，通过研究深度网络的学习动态来减少网络的维度。
methods: 本研究使用了深度线性模型来研究深度网络的学习动态，并发现了深度网络的Weight矩阵具有低维结构。基于这一发现，我们提出了一种减少深度网络的方法，通过减少网络的宽度来减少计算复杂性。
results: 我们的实验表明，使用我们的减少方法可以加速深度网络的训练过程，而不会妥协模型质量。减少后的网络可以在所有的梯度下降迭代中更快 converges，并且可以在不同的初始化情况下获得更好的性能。

Abstract
Overparameterized models have proven to be powerful tools for solving various machine learning tasks. However, overparameterization often leads to a substantial increase in computational and memory costs, which in turn requires extensive resources to train. In this work, we aim to reduce this complexity by studying the learning dynamics of overparameterized deep networks. By extensively studying its learning dynamics, we unveil that the weight matrices of various architectures exhibit a low-dimensional structure. This finding implies that we can compress the networks by reducing the training to a small subspace. We take a step in developing a principled approach for compressing deep networks by studying deep linear models. We demonstrate that the principal components of deep linear models are fitted incrementally but within a small subspace, and use these insights to compress deep linear networks by decreasing the width of its intermediate layers. Remarkably, we observe that with a particular choice of initialization, the compressed network converges faster than the original network, consistently yielding smaller recovery errors throughout all iterations of gradient descent. We substantiate this observation by developing a theory focused on the deep matrix factorization problem, and by conducting empirical evaluations on deep matrix sensing. Finally, we demonstrate how our compressed model can enhance the utility of deep nonlinear models. Overall, we observe that our compression technique accelerates the training process by more than 2x, without compromising model quality.

摘要
具有过参数化模型已经证明是解决不同机器学习任务的有力工具。然而，过参数化通常会导致计算和内存成本增加很多，需要很大的资源来训练。在这项工作中，我们希望通过研究深度网络学习动态来减少这种复杂性。我们发现了深度网络的Weight矩阵在不同架构中具有低维度结构，这意味着可以通过减少训练的维度来压缩网络。我们开发了一种原则的方法来压缩深度网络，通过研究深度线性模型。我们发现，深度线性模型的主成分可以在一个小空间中逐步 fitted，并且可以通过减少深度网络中间层的宽度来压缩网络。Remarkably，我们发现，使用特定的初始化方式，压缩后的网络在每一次梯度下降迭代中更快 converges，并且一直在所有迭代中保持小于原网络的恢复误差。我们证明了这一观察，通过关注深度矩阵分解问题，并通过实际的测试进行深度矩阵感知。最后，我们示出了我们压缩模型可以提高深度非线性模型的实用性。总的来说，我们发现，我们的压缩技术可以在训练过程中提高速度超过2倍，而不会妥协模型质量。

Quantum Generative Modeling of Sequential Data with Trainable Token Embedding

paper_url: http://arxiv.org/abs/2311.05050
repo_url: None
paper_authors: Wanda Hou, Li Miao, Yi-Zhuang You
for: 这个论文主要是为了探讨量子概率模型在学习古典和量子数据上的应用。
methods: 这个论文使用的方法是基于矩阵产品状态（MPS）框架的量子启发式生成模型，称为 Born 机器。这种模型支持可追踪的对数概率和自我回归抽样，并在不同的无监督学习任务中表现出色。
results: 这个论文的结果表明，通过同时适应 quantum measurement 操作和 MPS embedding，Born 机器可以更好地表现，并在数据中寻找更深层次的相关性。

Abstract
Generative models are a class of machine learning models that aim to learn the underlying probability distribution of data. Unlike discriminative models, generative models focus on capturing the data's inherent structure, allowing them to generate new samples that resemble the original data. To fully exploit the potential of modeling probability distributions using quantum physics, a quantum-inspired generative model known as the Born machines have shown great advancements in learning classical and quantum data over matrix product state(MPS) framework. The Born machines support tractable log-likelihood, autoregressive and mask sampling, and have shown outstanding performance in various unsupervised learning tasks. However, much of the current research has been centered on improving the expressive power of MPS, predominantly embedding each token directly by a corresponding tensor index. In this study, we generalize the embedding method into trainable quantum measurement operators that can be simultaneously honed with MPS. Our study indicated that combined with trainable embedding, Born machines can exhibit better performance and learn deeper correlations from the dataset.

摘要
<> traduction de texte en chinois simplifié模型生成是一类机器学习模型，旨在学习数据的下面概率分布。与描述性模型不同，生成模型关注数据的内在结构，因此可以生成新样本与原始数据类似。为了充分利用量子物理学模型概率分布的潜力，一种基于量子物理学的生成模型叫做生命机器（Born machines）在MPS框架下已经取得了很大的进步。生命机器支持可追踪的对数概率、自动回归和面积抽样，并在多种无监督学习任务中表现出色。然而，当前的大多数研究都集中在提高MPS的表达力，主要是将每个token直接嵌入相应的tensor index。在这项研究中，我们总结了 embedding方法的扩展，使得Born machines可以同时适应MPS。我们的研究表明，将 embedding方法与MPS结合使用，可以使生命机器表现更好，并且从数据中学习更深层的相关性。Note: "MPS" stands for "matrix product state", which is a type of quantum state used in quantum computing.

On the Consistency of Maximum Likelihood Estimation of Probabilistic Principal Component Analysis

paper_url: http://arxiv.org/abs/2311.05046
repo_url: None
paper_authors: Arghya Datta, Sayak Chakrabarty
for: 降低数据维度的统计工具PPCA的应用广泛，从科学与工程到数量金融。
methods: 使用 quotient topological spaces 方法，解决PPCA模型中的征化问题，并且Proof 了ML解方法的一致性。
results: ML解方法是consistent 的，并且可以在一个适当的quotient Euclidean space中进行强 consistency 的covariance estimation。

Abstract
Probabilistic principal component analysis (PPCA) is currently one of the most used statistical tools to reduce the ambient dimension of the data. From multidimensional scaling to the imputation of missing data, PPCA has a broad spectrum of applications ranging from science and engineering to quantitative finance. Despite this wide applicability in various fields, hardly any theoretical guarantees exist to justify the soundness of the maximal likelihood (ML) solution for this model. In fact, it is well known that the maximum likelihood estimation (MLE) can only recover the true model parameters up to a rotation. The main obstruction is posed by the inherent identifiability nature of the PPCA model resulting from the rotational symmetry of the parameterization. To resolve this ambiguity, we propose a novel approach using quotient topological spaces and in particular, we show that the maximum likelihood solution is consistent in an appropriate quotient Euclidean space. Furthermore, our consistency results encompass a more general class of estimators beyond the MLE. Strong consistency of the ML estimate and consequently strong covariance estimation of the PPCA model have also been established under a compactness assumption.

摘要
Simplified Chinese: probabistic principal component analysis (PPCA) 是目前最广泛使用的统计工具，用于缩小数据的环境维度。从多元缩放到缺失数据的插入，PPCA在科学和工程等领域有广泛的应用。 despite its wide range of applications, there are few theoretical guarantees to justify the soundness of the maximum likelihood (ML) solution for this model. In fact, it is well known that the maximum likelihood estimation (MLE) can only recover the true model parameters up to a rotation. The main obstruction is posed by the inherent identifiability nature of the PPCA model resulting from the rotational symmetry of the parameterization. To resolve this ambiguity, we propose a novel approach using quotient topological spaces and show that the maximum likelihood solution is consistent in an appropriate quotient Euclidean space. Furthermore, our consistency results encompass a more general class of estimators beyond the MLE. Strong consistency of the ML estimate and consequently strong covariance estimation of the PPCA model have also been established under a compactness assumption.

DEMASQ: Unmasking the ChatGPT Wordsmith

paper_url: http://arxiv.org/abs/2311.05019
repo_url: None
paper_authors: Kavita Kumari, Alessandro Pegoraro, Hossein Fereidooni, Ahmad-Reza Sadeghi
for: This paper aims to detect content generated by ChatGPT, a popular language model, in order to address the concerns of false information, plagiarism, academic dishonesty, and fraudulent activities that may arise from its use.
methods: The proposed method, called DEMASQ, is an energy-based detection model that incorporates novel aspects such as optimization inspired by the Doppler effect and the use of explainable AI techniques to generate diverse perturbations.
results: The paper demonstrates that DEMASQ achieves high accuracy in identifying content generated by ChatGPT, outperforming previous detection methods.

Abstract
The potential misuse of ChatGPT and other Large Language Models (LLMs) has raised concerns regarding the dissemination of false information, plagiarism, academic dishonesty, and fraudulent activities. Consequently, distinguishing between AI-generated and human-generated content has emerged as an intriguing research topic. However, current text detection methods lack precision and are often restricted to specific tasks or domains, making them inadequate for identifying content generated by ChatGPT. In this paper, we propose an effective ChatGPT detector named DEMASQ, which accurately identifies ChatGPT-generated content. Our method addresses two critical factors: (i) the distinct biases in text composition observed in human- and machine-generated content and (ii) the alterations made by humans to evade previous detection methods. DEMASQ is an energy-based detection model that incorporates novel aspects, such as (i) optimization inspired by the Doppler effect to capture the interdependence between input text embeddings and output labels, and (ii) the use of explainable AI techniques to generate diverse perturbations. To evaluate our detector, we create a benchmark dataset comprising a mixture of prompts from both ChatGPT and humans, encompassing domains such as medical, open Q&A, finance, wiki, and Reddit. Our evaluation demonstrates that DEMASQ achieves high accuracy in identifying content generated by ChatGPT.

摘要
大量语言模型（LLM）的潜在滥用问题，包括传播false信息、抄袭、学术不当行为和诈欺活动，导致识别人类和机器生成内容的研究变得非常有兴趣。然而，目前的文本检测方法缺乏精度，通常仅适用于特定任务或领域，无法正确地识别ChatGPT生成的内容。在本文中，我们提出了一个高精度的ChatGPT检测器，名为DEMASQ，可以准确地识别ChatGPT生成的内容。我们的方法解决了两个重要因素：（i）人类和机器生成内容中文字的不同偏见，（ii）人类对于避免先前检测方法的修改。DEMASQ是一个能量基于的检测模型，包括以下两个新的特点：（i）静电效应启发的优化方法，用于捕捉输入文本嵌入和出力标签之间的互相依赖关系，（ii）使用可解释AI技术生成多样的扰动。为了评估DEMASQ，我们创建了一个包括ChatGPT和人类产生的标准 benchmark dataset，覆盖医学、开放Q&A、金融、Wiki和Reddit等领域。我们的评估结果显示，DEMASQ可以高精度地识别ChatGPT生成的内容。

GPU-Accelerated WFST Beam Search Decoder for CTC-based Speech Recognition

paper_url: http://arxiv.org/abs/2311.04996
repo_url: https://github.com/nvidia-riva/riva-asrlib-decoder
paper_authors: Daniel Galvez, Tim Kaldewey
for: 提高自动语音识别（ASR）管道的性能，使用GPU加速Weighted Finite State Transducer（WFST）搜索解码器。
methods: 使用GPU加速WFST搜索解码器，支持流处理推理，支持实时扩展，提供预制DLPack基本绑定。
results: 在离线和在线场景下，与当前状态 искусственный神经网络（CTC）模型相比，实现最快的搜索解码器，在离线场景下达到最高7倍的throughput，在流处理场景下达到近8倍的响应时间，与同等或更好的词错率。

Abstract
While Connectionist Temporal Classification (CTC) models deliver state-of-the-art accuracy in automated speech recognition (ASR) pipelines, their performance has been limited by CPU-based beam search decoding. We introduce a GPU-accelerated Weighted Finite State Transducer (WFST) beam search decoder compatible with current CTC models. It increases pipeline throughput and decreases latency, supports streaming inference, and also supports advanced features like utterance-specific word boosting via on-the-fly composition. We provide pre-built DLPack-based python bindings for ease of use with Python-based machine learning frameworks at https://github.com/nvidia-riva/riva-asrlib-decoder. We evaluated our decoder for offline and online scenarios, demonstrating that it is the fastest beam search decoder for CTC models. In the offline scenario it achieves up to 7 times more throughput than the current state-of-the-art CPU decoder and in the online streaming scenario, it achieves nearly 8 times lower latency, with same or better word error rate.

摘要
while Connectionist Temporal Classification (CTC) models provide state-of-the-art accuracy in automated speech recognition (ASR) pipelines, their performance has been limited by CPU-based beam search decoding. we introduce a GPU-accelerated Weighted Finite State Transducer (WFST) beam search decoder compatible with current CTC models. it increases pipeline throughput and decreases latency, supports streaming inference, and also supports advanced features like utterance-specific word boosting via on-the-fly composition. we provide pre-built DLPack-based python bindings for ease of use with Python-based machine learning frameworks at https://github.com/nvidia-riva/riva-asrlib-decoder. we evaluated our decoder for offline and online scenarios, demonstrating that it is the fastest beam search decoder for CTC models. in the offline scenario it achieves up to 7 times more throughput than the current state-of-the-art CPU decoder and in the online streaming scenario, it achieves nearly 8 times lower latency, with same or better word error rate.

Optimized measurements of chaotic dynamical systems via the information bottleneck

paper_url: http://arxiv.org/abs/2311.04896
repo_url: None
paper_authors: Kieran A. Murphy, Dani S. Bassett
for: 这篇论文旨在找到一种高效地从运动轨迹数据中提取信息的方法，以便更好地理解系统的动态行为。
methods: 该论文使用机器学习技术来优化测量过程，以便更好地捕捉系统的信息。
results: 该论文对多种杂音映射进行了 approximately optimal 的测量，并为通用时间序列提供了可用的基础。

Abstract
Deterministic chaos permits a precise notion of a "perfect measurement" as one that, when obtained repeatedly, captures all of the information created by the system's evolution with minimal redundancy. Finding an optimal measurement is challenging, and has generally required intimate knowledge of the dynamics in the few cases where it has been done. We establish an equivalence between a perfect measurement and a variant of the information bottleneck. As a consequence, we can employ machine learning to optimize measurement processes that efficiently extract information from trajectory data. We obtain approximately optimal measurements for multiple chaotic maps and lay the necessary groundwork for efficient information extraction from general time series.

摘要
deterministic 混沌允许我们定义一个"完美测量"，即在重复获取的情况下，捕捉系统的演化创造的信息，最小化重复性。发现优化测量是困难的，通常需要系统动力学的深入了解，只有在极少数情况下完成。我们证明了完美测量与信息瓶颈之间的等价关系，因此我们可以使用机器学习来优化测量过程，以高效地从曲线数据中提取信息。我们在多个混沌地图上获得了约似优化的测量，并为普通时间序列信息提取做了必要的准备。

Computing with Residue Numbers in High-Dimensional Representation

paper_url: http://arxiv.org/abs/2311.04872
repo_url: https://github.com/cjkymn/residuehdcomputing
paper_authors: Christopher J. Kymn, Denis Kleyko, E. Paxon Frady, Connor Bybee, Pentti Kanerva, Friedrich T. Sommer, Bruno A. Olshausen
for: 这篇论文是用于描述一种新的计算框架，即废弃数 residue 超级计算。
methods: 该框架使用 Random, high-dimensional vectors 来表示 residue 数字，并使用组件 wise, 并行的运算来实现 algebra 操作。
results: 该框架可以使用远 fewer 资源来表示和操作大范围的数字，并且具有强大的鲁棒性 against noise。它可以解决 computationally difficult 问题，例如视觉处理和 combinatorial optimization。

Abstract
We introduce Residue Hyperdimensional Computing, a computing framework that unifies residue number systems with an algebra defined over random, high-dimensional vectors. We show how residue numbers can be represented as high-dimensional vectors in a manner that allows algebraic operations to be performed with component-wise, parallelizable operations on the vector elements. The resulting framework, when combined with an efficient method for factorizing high-dimensional vectors, can represent and operate on numerical values over a large dynamic range using vastly fewer resources than previous methods, and it exhibits impressive robustness to noise. We demonstrate the potential for this framework to solve computationally difficult problems in visual perception and combinatorial optimization, showing improvement over baseline methods. More broadly, the framework provides a possible account for the computational operations of grid cells in the brain, and it suggests new machine learning architectures for representing and manipulating numerical data.

摘要
我们介绍了剩余超维计算框架，这是一种将剩余数系统与随机高维向量上的代数相结合的计算框架。我们表明了剩余数可以用高维向量的元素进行 componenwise、并行化的运算，从而实现了对大范围的数值进行表示和操作，并且具有很好的鲁棒性于噪声。我们通过对高维向量的因子化方法进行有效实现，实现了在资源受限的情况下解决 computationally Difficult 问题的能力。我们通过对视觉认知和 combinatorial 优化问题的解决方案来说明框架的潜在力量，以及它在机器学习中表示和操作数字数据的新架构。

Algorithms for Non-Negative Matrix Factorization on Noisy Data With Negative Values

paper_url: http://arxiv.org/abs/2311.04855
repo_url: None
paper_authors: Dylan Green, Stephen Bailey
for: 本文旨在探讨非正式矩阵分解（NMF）如何处理含有负值的天文数据，特别是在低信号响应下。
methods: 本文提出了两种算法：Shift-NMF和Nearly-NMF，它们都可以正确地处理含有负值的输入数据，而不需要clip负数据。
results: 数学分析和实验表明，Shift-NMF和Nearly-NMF算法都具有 monotonically decreasing 的更新规则，并且可以正确地回归非正式信号。

Abstract
Non-negative matrix factorization (NMF) is a dimensionality reduction technique that has shown promise for analyzing noisy data, especially astronomical data. For these datasets, the observed data may contain negative values due to noise even when the true underlying physical signal is strictly positive. Prior NMF work has not treated negative data in a statistically consistent manner, which becomes problematic for low signal-to-noise data with many negative values. In this paper we present two algorithms, Shift-NMF and Nearly-NMF, that can handle both the noisiness of the input data and also any introduced negativity. Both of these algorithms use the negative data space without clipping, and correctly recover non-negative signals without any introduced positive offset that occurs when clipping negative data. We demonstrate this numerically on both simple and more realistic examples, and prove that both algorithms have monotonically decreasing update rules.

摘要
非正定矩阵分解（NMF）是一种维度减少技术，已经在天文数据分析中展现了承诺。这些数据可能会包含负值噪声，即使真实的物理信号是正的。过去的NMF工作没有统计正确处理负数据，这会对低信号响应度数据 WITH 多个负值引起问题。在这篇论文中，我们提出了两种算法：Shift-NMF和Nearly-NMF，它们可以处理输入数据的噪声和引入的负值。这两种算法使用负数据空间而不是clip，并能正确回归非正定信号而无需引入Positive offset。我们通过数值计算和 teorema 证明了这两种算法的更新规则减少 monotonic。

Incorporating temporal dynamics of mutations to enhance the prediction capability of antiretroviral therapy’s outcome for HIV-1

paper_url: http://arxiv.org/abs/2311.04846
repo_url: None
paper_authors: Giulia Di Teodoro, Martin Pirkl, Francesca Incardona, Ilaria Vicenti, Anders Sönnerborg, Rolf Kaiser, Laura Palagi, Maurizio Zazzi, Thomas Lengauer
for: 预测HIV治疗结果
methods: 使用历史信息加权病毒变异，考虑病毒变异的时间发生和同时测量病毒荷量
results: 使用历史信息可以提高预测精度，H-模型的ROC-AUC分数高于NH-模型（76.34% VS 74.98%），并且在不同时间点进行预测时表现出了更好的一致性。

Abstract
Motivation: In predicting HIV therapy outcomes, a critical clinical question is whether using historical information can enhance predictive capabilities compared with current or latest available data analysis. This study analyses whether historical knowledge, which includes viral mutations detected in all genotypic tests before therapy, their temporal occurrence, and concomitant viral load measurements, can bring improvements. We introduce a method to weigh mutations, considering the previously enumerated factors and the reference mutation-drug Stanford resistance tables. We compare a model encompassing history (H) with one not using it (NH). Results: The H-model demonstrates superior discriminative ability, with a higher ROC-AUC score (76.34%) than the NH-model (74.98%). Significant Wilcoxon test results confirm that incorporating historical information improves consistently predictive accuracy for treatment outcomes. The better performance of the H-model might be attributed to its consideration of latent HIV reservoirs, probably obtained when leveraging historical information. The findings emphasize the importance of temporal dynamics in mutations, offering insights into HIV infection complexities. However, our result also shows that prediction accuracy remains relatively high even when no historical information is available. Supplementary information: Supplementary material is available.

摘要
目的：研究是否使用历史信息可以提高预测HIV治疗结果的能力，比较使用当前或最新可用的数据分析方法。这个研究发现，使用历史知识，包括在治疗之前的所有种类测试中检测到的病毒变异，其时间发生和同时测量病毒荷载，可以提高预测精度。我们提出一种将变异加权的方法，考虑以上因素以及参考荷载抗荷载表。我们将 comparing一个包含历史信息（H）模型和一个不使用历史信息（NH）模型。结果：H模型的预测能力显著高于NH模型（76.34% vs 74.98%），并且在不同时间点上的预测精度也有显著差异。这些结果表明，包含历史信息可以提高预测精度，但并不是必需的。补充信息：补充材料可以在附录中找到。

Bridging Dimensions: Confident Reachability for High-Dimensional Controllers

paper_url: http://arxiv.org/abs/2311.04843
repo_url: None
paper_authors: Yuang Geng, Souradeep Dutta, Ivan Ruchkin
for: This paper aims to improve the verification of high-dimensional controllers in autonomous systems, specifically those using deep neural networks.
methods: The paper proposes a new approach that approximates the behavior of a high-dimensional controller with several low-dimensional controllers in different regions of the state space, and uses verification-aware knowledge distillation to balance approximation and verifiability.
results: The paper shows convincing performance in two OpenAI gym benchmarks using two inflation techniques, one based on trajectories and the other based on actions. The results provide high-confidence reachability guarantees for the high-dimensional controller.

Abstract
Autonomous systems are increasingly implemented using end-end-end trained controllers. Such controllers make decisions that are executed on the real system with images as one of the primary sensing modalities. Deep neural networks form a fundamental building block of such controllers. Unfortunately, the existing neural-network verification tools do not scale to inputs with thousands of dimensions. Especially when the individual inputs (such as pixels) are devoid of clear physical meaning. This paper takes a step towards connecting exhaustive closed-loop verification with high-dimensional controllers. Our key insight is that the behavior of a high-dimensional controller can be approximated with several low-dimensional controllers in different regions of the state space. To balance approximation and verifiability, we leverage the latest verification-aware knowledge distillation. Then, if low-dimensional reachability results are inflated with statistical approximation errors, they yield a high-confidence reachability guarantee for the high-dimensional controller. We investigate two inflation techniques -- based on trajectories and actions -- both of which show convincing performance in two OpenAI gym benchmarks.

摘要

Toward Rapid, Optimal, and Feasible Power Dispatch through Generalized Neural Mapping

paper_url: http://arxiv.org/abs/2311.04838
repo_url: None
paper_authors: Meiyi Li, Javad Mohammadi
for: 提高大规模电力系统优化决策效率，适应分布式和连接的Grid，使用机器学习模型提高优化效果。
methods: 提出了LOOP-LC 2.0模型，基于学习来优化优化过程，保证解决方案的可行性和实际性，不需要耗时consuming的迭代过程。
results: 对IEEE-200测试 случа件进行比较，LOOP-LC 2.0方法的训练速度、计算时间、优化效果和解决方案可行性均有显著提高 compared to 现有方法。

Abstract
The evolution towards a more distributed and interconnected grid necessitates large-scale decision-making within strict temporal constraints. Machine learning (ML) paradigms have demonstrated significant potential in improving the efficacy of optimization processes. However, the feasibility of solutions derived from ML models continues to pose challenges. It's imperative that ML models produce solutions that are attainable and realistic within the given system constraints of power systems. To address the feasibility issue and expedite the solution search process, we proposed LOOP-LC 2.0(Learning to Optimize the Optimization Process with Linear Constraints version 2.0) as a learning-based approach for solving the power dispatch problem. A notable advantage of the LOOP-LC 2.0 framework is its ability to ensure near-optimality and strict feasibility of solutions without depending on computationally intensive post-processing procedures, thus eliminating the need for iterative processes. At the heart of the LOOP-LC 2.0 model lies the newly proposed generalized gauge map method, capable of mapping any infeasible solution to a feasible point within the linearly-constrained domain. The proposed generalized gauge map method improves the traditional gauge map by exhibiting reduced sensitivity to input variances while increasing search speeds significantly. Utilizing the IEEE-200 test case as a benchmark, we demonstrate the effectiveness of the LOOP-LC 2.0 methodology, confirming its superior performance in terms of training speed, computational time, optimality, and solution feasibility compared to existing methodologies.

摘要
随着Grid的分布和连接度的演化，大规模决策在强制时间限制下变得越来越重要。机器学习（ML）模式在优化过程中表现出了显著的潜力。然而，ML模型生成的解决方案的可行性仍然存在挑战。为了解决可行性问题并加速解决过程，我们提出了LOOP-LC 2.0（学习优化优化过程的线性约束版本2.0），一种基于学习的电力派发问题解决方法。LOOP-LC 2.0框架的一个优点是它可以保证解决的解决方案准确性和可行性，不需要进行计算 INTENSIVE post-processing 过程，因此消除了迭代过程的需要。LOOP-LC 2.0 模型的核心是新提出的通用抽象映射方法，可以将任何不可行的解决方案映射到可行的点 dentro de la domain de restricciones lineales。相比传统的抽象映射方法，通用抽象映射方法具有更低的输入方差敏感度和更高的搜索速度。使用 IEEE-200 测试 caso como referencia，我们证明了LOOP-LC 2.0 方法的有效性，其在培训速度、计算时间、优化性和可行性方面表现出了明显的优势 compared to 现有方法。

Real-Time Recurrent Reinforcement Learning

paper_url: http://arxiv.org/abs/2311.04830
repo_url: https://github.com/djdprogramming/adfa2
paper_authors: Julian Lemmel, Radu Grosu
for: solving partially-observable Markov decision processes (POMDPs) using biologically plausible methods
methods: using random feedback local online learning (RFLO) and temporaldifference reinforcement learning with eligibility traces (TD($\lambda$)) to compute gradients of recurrent neural network parameters in an online manner
results: RFLO can perform just as well as real-time recurrent learning (RTRL) with less complexity, and the proposed method (RTRRL) serves as a model of learning in biological neural networks mimicking reward pathways in the mammalian brain.

Abstract
Recent advances in reinforcement learning, for partially-observable Markov decision processes (POMDPs), rely on the biologically implausible backpropagation through time algorithm (BPTT) to perform gradient-descent optimisation. In this paper we propose a novel reinforcement learning algorithm that makes use of random feedback local online learning (RFLO), a biologically plausible approximation of realtime recurrent learning (RTRL) to compute the gradients of the parameters of a recurrent neural network in an online manner. By combining it with TD($\lambda$), a variant of temporaldifference reinforcement learning with eligibility traces, we create a biologically plausible, recurrent actor-critic algorithm, capable of solving discrete and continuous control tasks in POMDPs. We compare BPTT, RTRL and RFLO as well as different network architectures, and find that RFLO can perform just as well as RTRL while exceeding even BPTT in terms of complexity. The proposed method, called real-time recurrent reinforcement learning (RTRRL), serves as a model of learning in biological neural networks mimicking reward pathways in the mammalian brain.

摘要
近期在部分可观测 Markov决策过程（POMDP）中的再强化学习进步，利用生物不切实际的 backwards propagation through time 算法（BPTT）来实现梯度下降优化。在这篇论文中，我们提出了一种新的再强化学习算法，使用随机反馈局部在线学习（RFLO），这是一种生物可能的抽象，来计算激活函数参数的梯度。通过与 TD($\lambda$) 结合，一种变体的时间差异再强化学习算法，我们创建了一种生物可能的、 recurrent actor-critic 算法，可以解决 POMDP 中的离散和连续控制任务。我们比较了 BPTT、RTRL 和 RFLO 以及不同的网络架构，发现 RFLO 可以与 RTRL 相当，而且 même surpass BPTT 的复杂性。提出的方法，称为实时回归再强化学习（RTRRL），作为生物神经网络学习模型，模拟奖 PATHways 在哺乳动物大脑中的学习过程。

Functional Bayesian Tucker Decomposition for Continuous-indexed Tensor Data

paper_url: http://arxiv.org/abs/2311.04829
repo_url: None
paper_authors: Shikai Fang, Xin Yu, Zheng Wang, Shibo Li, Mike Kirby, Shandian Zhe
for: 寻找一种方法来扩展tensor decomposition来处理不同方面的连续型数据。
methods: 提议了Functional Bayesian Tucker Decomposition（FunBaT）方法，将连续型数据视为Tucker核和一组隐函数之间的交互。使用 Gaussian Processes（GP）作为函数先验，然后将GP转换为状态方程先验来降低计算成本。
results: 在Synthetic数据和实际应用中，提议的方法能够显著提高tensor decomposition的灵活性和效率。

Abstract
Tucker decomposition is a powerful tensor model to handle multi-aspect data. It demonstrates the low-rank property by decomposing the grid-structured data as interactions between a core tensor and a set of object representations (factors). A fundamental assumption of such decomposition is that there were finite objects in each aspect or mode, corresponding to discrete indexes of data entries. However, many real-world data are not naturally posed in the setting. For example, geographic data is represented as continuous indexes of latitude and longitude coordinates, and cannot fit tensor models directly. To generalize Tucker decomposition to such scenarios, we propose Functional Bayesian Tucker Decomposition (FunBaT). We treat the continuous-indexed data as the interaction between the Tucker core and a group of latent functions. We use Gaussian processes (GP) as functional priors to model the latent functions, and then convert the GPs into a state-space prior by constructing an equivalent stochastic differential equation (SDE) to reduce computational cost. An efficient inference algorithm is further developed for scalable posterior approximation based on advanced message-passing techniques. The advantage of our method is shown in both synthetic data and several real-world applications.

摘要
各种多方面数据的处理可以通过图ucker分解来实现。这种分解示出了图ucker的低级性质，它将格子结构数据分解为核tensor和一组对象表示（因素）之间的交互。然而，许多实际世界的数据不是直接适用于图ucker模型的。例如，地理数据通常表示为维度坐标的连续标记，无法直接适用于图ucker模型。为了推广图ucker分解到这些场景，我们提议了功能 bayesian 图ucker分解（FunBaT）。我们将连续标记的数据视为图ucker核和一组隐函数之间的交互。我们使用 Gaussian 过程（GP）作为隐函数的函数先验，然后将GP转换为状态空间先验，以降低计算成本。我们还开发了一种可扩展的 posterior 近似算法，以便可扩展到大规模数据。我们的方法在synthetic数据和一些实际应用中表现出了优势。

A Lightweight Architecture for Real-Time Neuronal-Spike Classification

paper_url: http://arxiv.org/abs/2311.04808
repo_url: None
paper_authors: Muhammad Ali Siddiqi, David Vrijenhoek, Lennart P. L. Landsmeer, Job van der Kleij, Anteneh Gebregiorgis, Vincenzo Romano, Rajendra Bishnoi, Said Hamdioui, Christos Strydis
for: 理解大脑功能，尤其是硬膜下卷绕细胞（Purkinje cells）在肌功能损伤和脑部受损时的作用。
methods: 利用硬膜下卷绕细胞的特点，实时抛弃不需要的神经数据，并将压缩数据存储在头部设备上的可 removable 存储器上。
results: 提出了一种轻量级神经采集和分类架构，可以在实时实现 >95% 的总分类精度，同时具有小型设计和低功耗特性，使头部设备可以靠一个小电池供电，可以持续运行约 4 天。

Abstract
Electrophysiological recordings of neural activity in a mouse's brain are very popular among neuroscientists for understanding brain function. One particular area of interest is acquiring recordings from the Purkinje cells in the cerebellum in order to understand brain injuries and the loss of motor functions. However, current setups for such experiments do not allow the mouse to move freely and, thus, do not capture its natural behaviour since they have a wired connection between the animal's head stage and an acquisition device. In this work, we propose a lightweight neuronal-spike detection and classification architecture that leverages on the unique characteristics of the Purkinje cells to discard unneeded information from the sparse neural data in real time. This allows the (condensed) data to be easily stored on a removable storage device on the head stage, alleviating the need for wires. Our proposed implementation shows a >95% overall classification accuracy while still resulting in a small-form-factor design, which allows for the free movement of mice during experiments. Moreover, the power-efficient nature of the design and the usage of STT-RAM (Spin Transfer Torque Magnetic Random Access Memory) as the removable storage allows the head stage to easily operate on a tiny battery for up to approximately 4 days.

摘要
neuroscientists 非常喜欢使用电生物学记录神经活动的mouse脑中的记录，以了解脑功能。一个具有潜在价值的领域是从粒细 Purkinje 细胞中获取记录，以了解脑损伤和lost of motor functions。但现有的实验设置不允许鼠标自由移动，因此不能捕捉其自然行为，因为它们有一个连接鼠标头stage和收集设备的硬件连接。在这种工作中，我们提出了一种轻量级神经元发射检测和分类架构，利用粒细 Purkinje 细胞的特有特征，在实时中抛弃不必要的神经数据。这使得（缩减）数据可以轻松地存储在鼠标头stage上的可 removable 存储设备上，解决了需要硬件连接的问题。我们的提议实现显示了 >95% 的总分类精度，同时仍保持小型设计，允许鼠标在实验中自由移动。此外，设计的能效性和使用 STT-RAM（磁转转换栅隔隔 Memory）作为可 removable 存储，使得头stage可以轻松运行在 tiny 电池上，可以达到约4天的操作时间。

The PetShop Dataset – Finding Causes of Performance Issues across Microservices

paper_url: http://arxiv.org/abs/2311.04806
repo_url: None
paper_authors: Michaela Hardt, William Orchard, Patrick Blöbaum, Shiva Kasiviswanathan, Elke Kirschbaum
for: 本研究旨在提供一个特定于微服务应用的根本原因分析数据集，用于评估不同的根本原因分析方法。
methods: 本研究使用了一个分布式应用程序 emit 5 分钟间隔的延迟、请求和可用性指标，并在系统中随机引入了 68 个性能问题，以模拟不良行为。
results: 本研究通过使用这个数据集，证明了这个数据集可以用于评估不同的根本原因分析方法的准确性。

Abstract
Identifying root causes for unexpected or undesirable behavior in complex systems is a prevalent challenge. This issue becomes especially crucial in modern cloud applications that employ numerous microservices. Although the machine learning and systems research communities have proposed various techniques to tackle this problem, there is currently a lack of standardized datasets for quantitative benchmarking. Consequently, research groups are compelled to create their own datasets for experimentation. This paper introduces a dataset specifically designed for evaluating root cause analyses in microservice-based applications. The dataset encompasses latency, requests, and availability metrics emitted in 5-minute intervals from a distributed application. In addition to normal operation metrics, the dataset includes 68 injected performance issues, which increase latency and reduce availability throughout the system. We showcase how this dataset can be used to evaluate the accuracy of a variety of methods spanning different causal and non-causal characterisations of the root cause analysis problem. We hope the new dataset, available at https://github.com/amazon-science/petshop-root-cause-analysis/ enables further development of techniques in this important area.

摘要
通用系统中异常或不满意的行为的根本原因识别是一个广泛存在的挑战。特别是现代云应用程序使用多个微服务，这问题变得更加重要。虽然机器学习和系统研究共同体已经提出了各种方法来解决这个问题，但目前还没有标准化的数据集用于量化比较。因此，研究组织被迫创建自己的数据集用于实验。这篇文章介绍了一个专门为识别微服务基本应用中的根本原因分析而设计的数据集。该数据集包括5分钟间隔的延迟、请求和可用性指标，以及68个注入性性能问题，这些问题会在系统中增加延迟和降低可用性。我们展示了如何使用这个数据集来评估多种不同的 causal 和非 causal 根本原因分析问题的准确性。我们希望新的数据集，可以在上获取，能够推动这一重要领域的进一步发展。

Why Do Clinical Probabilistic Models Fail To Transport Between Sites?

paper_url: http://arxiv.org/abs/2311.04787
repo_url: None
paper_authors: Thomas A. Lasko, Eric V. Strobl, William W. Stead
for: 本研究旨在解释在健康领域中人工智能的应用中出现的问题，即模型在训练站上达到超人类性能后在新站点上表现差异较大。
methods: 本研究使用了分析常见导致模型在新站点上表现差异的源头，并将这些源头分为实验者可控的和数据生成过程中的内在源头。
results: 研究发现，数据生成过程中的站点特有的临床实践可能导致模型在新站点上表现差异，并提出了一种解决方案，即隔离数据中各站点临床实践的影响，以便更好地预测疾病的发展趋势。

Abstract
The rising popularity of artificial intelligence in healthcare is highlighting the problem that a computational model achieving super-human clinical performance at its training sites may perform substantially worse at new sites. In this perspective, we present common sources for this failure to transport, which we divide into sources under the control of the experimenter and sources inherent to the clinical data-generating process. Of the inherent sources we look a little deeper into site-specific clinical practices that can affect the data distribution, and propose a potential solution intended to isolate the imprint of those practices on the data from the patterns of disease cause and effect that are the usual target of clinical models.

摘要
人工智能在医疗领域的普及正加剧了一个问题：一个计算模型在训练 Site 上达到超人类优秀表现，但在新 Site 上可能表现很差。在这个视角下，我们描述了不能传输的常见来源，分为实验者控制的源和数据生成过程中的自然源。其中，我们对内在的源进一步分析了Site-specific临床实践对数据分布的影响，并提出了一种解决方案，以隔离临床实践对数据的影响，从而更好地预测疾病的 causa 和效果。

FetMRQC: an open-source machine learning framework for multi-centric fetal brain MRI quality control

paper_url: http://arxiv.org/abs/2311.04780
repo_url: https://github.com/medical-image-analysis-laboratory/fetal_brain_qc
paper_authors: Thomas Sanchez, Oscar Esteban, Yvan Gomez, Alexandre Pron, Mériam Koob, Vincent Dunet, Nadine Girard, Andras Jakab, Elisenda Eixarch, Guillaume Auzias, Meritxell Bach Cuadra
for: 这个论文旨在提供一种自动化图像质量评估和控制框架，以提高胎儿脑部MRI图像的质量和可靠性。
methods: 这种框架使用机器学习算法，提取了不同扫描仪和数据采集中的质量指标，并将其组合成Random Forest模型来预测专家评分。
results: 研究表明，FetMRQC的预测结果在不同的扫描仪和数据采集中具有良好的泛化能力和可解释性。

Abstract
Fetal brain MRI is becoming an increasingly relevant complement to neurosonography for perinatal diagnosis, allowing fundamental insights into fetal brain development throughout gestation. However, uncontrolled fetal motion and heterogeneity in acquisition protocols lead to data of variable quality, potentially biasing the outcome of subsequent studies. We present FetMRQC, an open-source machine-learning framework for automated image quality assessment and quality control that is robust to domain shifts induced by the heterogeneity of clinical data. FetMRQC extracts an ensemble of quality metrics from unprocessed anatomical MRI and combines them to predict experts' ratings using random forests. We validate our framework on a pioneeringly large and diverse dataset of more than 1600 manually rated fetal brain T2-weighted images from four clinical centers and 13 different scanners. Our study shows that FetMRQC's predictions generalize well to unseen data while being interpretable. FetMRQC is a step towards more robust fetal brain neuroimaging, which has the potential to shed new insights on the developing human brain.

摘要
《胎儿脑MRI在产前诊断中成为越来越重要的补充，允许深入了解胎儿脑发育的全程。但是，无法控制胎儿运动和数据采集协议的不同导致数据质量存在变化，可能影响后续研究的结果。我们提出了FetMRQC，一个开源的机器学习框架，用于自动评估和控制图像质量，对域外传递产生的影响具有抗难度特性。FetMRQC从未处理的 анатомичеMRI中提取一 ensemble of 质量指标，使用随机森林将其组合成为专家评分。我们验证了我们的框架，使用了1600多个手动评分的胎儿脑T2强化MRI图像，来自四个临床中心和13个不同的扫描仪。我们的研究表明，FetMRQC的预测能够在未看到数据上具有良好的泛化能力，同时具有可解释性。FetMRQC是更加Robust的胎儿脑神经成像的一步，它有可能为人类脑发育带来新的发现。》Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other countries. If you need the translation in Traditional Chinese, please let me know.

Optimal Deep Neural Network Approximation for Korobov Functions with respect to Sobolev Norms

paper_url: http://arxiv.org/abs/2311.04779
repo_url: None
paper_authors: Yahong Yang, Yulong Lu
for: 这个论文为了解决深度神经网络（DNNs）应用于科罗波夫函数时的近似问题而写作。
methods: 该论文使用了深度神经网络来近似科罗波夫函数，并使用$L_p$ norms和$H^1$ norms来衡量近似结果。
results: 该论文获得了一个很高的近似率，超过传统方法和任何连续函数近似器。这些结果是非对数的，可以同时考虑网络宽度和深度。

Abstract
This paper establishes the nearly optimal rate of approximation for deep neural networks (DNNs) when applied to Korobov functions, effectively overcoming the curse of dimensionality. The approximation results presented in this paper are measured with respect to $L_p$ norms and $H^1$ norms. Our achieved approximation rate demonstrates a remarkable "super-convergence" rate, outperforming traditional methods and any continuous function approximator. These results are non-asymptotic, providing error bounds that consider both the width and depth of the networks simultaneously.

摘要
Note:* "Korobov functions" 是指具有特定的某种函数形式的函数。* "curse of dimensionality" 是指在高维空间中， tradicional method 的拟合率会随着维度的增加而减慢。* "super-convergence" 是指拟合率比传统方法更快地增长。* "non-asymptotic" 是指不含 asymptotic 的概率 bound。

Towards a Unified Framework of Contrastive Learning for Disentangled Representations

paper_url: http://arxiv.org/abs/2311.04774
repo_url: None
paper_authors: Stefan Matthes, Zhiwei Han, Hao Shen
for: 本研究旨在扩展对冲学习方法的理论保证，以便在更广泛的对冲搜索空间中找到和分离数据中的解释因素。
methods: 本研究使用了四种冲对方法，包括雷达对冲估计（NCE）和信息对冲估计（InfoNCE）等。
results: 研究人员通过 теорем的证明，证明了这些对冲方法可以帮助找到和分离数据中的解释因素，而不需要假设数据生成过程的特定假设。这些结论在多个标准数据集上进行了验证。

Abstract
Contrastive learning has recently emerged as a promising approach for learning data representations that discover and disentangle the explanatory factors of the data. Previous analyses of such approaches have largely focused on individual contrastive losses, such as noise-contrastive estimation (NCE) and InfoNCE, and rely on specific assumptions about the data generating process. This paper extends the theoretical guarantees for disentanglement to a broader family of contrastive methods, while also relaxing the assumptions about the data distribution. Specifically, we prove identifiability of the true latents for four contrastive losses studied in this paper, without imposing common independence assumptions. The theoretical findings are validated on several benchmark datasets. Finally, practical limitations of these methods are also investigated.

摘要
contrastive learning 最近 emerged as a promising approach for learning data representations that discover and disentangle the explanatory factors of the data. Previous analyses of such approaches have largely focused on individual contrastive losses, such as noise-contrastive estimation (NCE) and InfoNCE, and rely on specific assumptions about the data generating process. This paper extends the theoretical guarantees for disentanglement to a broader family of contrastive methods, while also relaxing the assumptions about the data distribution. Specifically, we prove identifiability of the true latents for four contrastive losses studied in this paper, without imposing common independence assumptions. The theoretical findings are validated on several benchmark datasets. Finally, practical limitations of these methods are also investigated.

Towards Open-world Cross-Domain Sequential Recommendation: A Model-Agnostic Contrastive Denoising Approach

paper_url: http://arxiv.org/abs/2311.04760
repo_url: None
paper_authors: Wujiang Xu, Xuying Ning, Wenfang Lin, Mingming Ha, Qiongxu Ma, Linxun Chen, Bing Han, Minnan Luo
for: 提高开放世界CDSR场景中模型的一致性和有效性（1st CH）
methods: 使用辅助行为来补充长尾用户的信息（2nd CH）
results: 这些SR方法无法在CDSR场景中提供优秀的表现，因为它们忽略了目标行为和辅助行为之间的semantic gap，以及用户兴趣偏移 across domains（2nd CH）

Abstract
Cross-domain sequential recommendation (CDSR) aims to address the data sparsity problems that exist in traditional sequential recommendation (SR) systems. The existing approaches aim to design a specific cross-domain unit that can transfer and propagate information across multiple domains by relying on overlapping users with abundant behaviors. However, in real-world recommender systems, CDSR scenarios usually consist of a majority of long-tailed users with sparse behaviors and cold-start users who only exist in one domain. This leads to a drop in the performance of existing CDSR methods in the real-world industry platform. Therefore, improving the consistency and effectiveness of models in open-world CDSR scenarios is crucial for constructing CDSR models (\textit{1st} CH). Recently, some SR approaches have utilized auxiliary behaviors to complement the information for long-tailed users. However, these multi-behavior SR methods cannot deliver promising performance in CDSR, as they overlook the semantic gap between target and auxiliary behaviors, as well as user interest deviation across domains (\textit{2nd} CH).

摘要
Recently, some SR approaches have used auxiliary behaviors to complement information for long-tailed users. However, these multi-behavior SR methods cannot deliver promising performance in CDSR due to the semantic gap between target and auxiliary behaviors, as well as user interest deviation across domains.

Natural Bayesian Cramér-Rao Bound with an Application to Covariance Estimation

paper_url: http://arxiv.org/abs/2311.04748
repo_url: None
paper_authors: Florent Bouchard, Alexandre Renaux, Guillaume Ginolhac, Arnaud Breloy
for: 本研究提出了一种新的克拉默-拉托 bound (CRB)，用于估计参数在拓扑 manifold 上并且受到先验分布的影响。
methods: 本研究使用了一种新的 derivation 方法，导致了一种自然的 geometrical 性质准则和这个新的 bound 之间的不等式。
results: 数值 simulations 表明，提出的 CRB 可以展示一些MAP 估计器的有趣性质，而经典极大似然估计器 (Bayesian CRB) 不能够显示出这些性质。

Abstract
In this paper, we propose to develop a new Cram\'er-Rao Bound (CRB) when the parameter to estimate lies in a manifold and follows a prior distribution. This derivation leads to a natural inequality between an error criteria based on geometrical properties and this new bound. This main contribution is illustrated in the problem of covariance estimation when the data follow a Gaussian distribution and the prior distribution is an inverse Wishart. Numerical simulation shows new results where the proposed CRB allows to exhibit interesting properties of the MAP estimator which are not observed with the classical Bayesian CRB.

摘要
在这篇论文中，我们提出了一种新的卡默-拉托矩bound（CRB），其中参数需要估计的投影到一个拓扑上，并且遵循一个先验分布。这个 derivation 导致了一种自然的准则，该准则与这个新的矩bound 之间存在一种对应关系。这种主要贡献在 covariance 估计中进行了应用，其中数据遵循 Gaussian 分布，先验分布是 inverse Wishart。数值实验显示，我们的提议的 CRB 可以展示一些MAP 估计器的有趣特性，这些特性与 классическому Bayesian CRB 不可见。

Enhancing Multi-Agent Coordination through Common Operating Picture Integration

paper_url: http://arxiv.org/abs/2311.04740
repo_url: None
paper_authors: Peihong Yu, Bhoram Lee, Aswin Raghavan, Supun Samarasekara, Pratap Tokekar, James Zachary Hare
For: This paper focuses on improving multi-agent coordination in dynamic environments, where agents possess only local observations and must communicate to enhance coordination.* Methods: The proposed approach uses a Common Operating Picture (COP) that integrates each agent’s observations, actions, and messages received, and disseminates the COP to other agents. This approach takes into account the dynamic nature of the environment and the shared mission.* Results: The paper shows that COP-based training leads to robust policies compared to state-of-the-art MARL methods when faced with out-of-distribution initial states, through experiments in the StarCraft2 environment.Here’s the summary in Simplified Chinese:
for: 多 Agent 系统中的 Agent 仅 possessed 本地观察，通信成为协调的关键。这篇 paper 针对这个问题提出了一种方法。
methods: 提议的方法使用 Common Operating Picture (COP)，让每个 Agent 统一其观察、动作和获取的讯息，并将 COP 分享到其他 Agent 中。这种方法考虑了环境的动态性和共同任务。
results: 这篇 paper 透过 StarCraft2 环境进行实验，证明 COP-based 训练对于离distribution 初始状态时的策略比 state-of-the-art Multi-Agent Reinforcement Learning (MARL) 方法更加Robust。

Abstract
In multi-agent systems, agents possess only local observations of the environment. Communication between teammates becomes crucial for enhancing coordination. Past research has primarily focused on encoding local information into embedding messages which are unintelligible to humans. We find that using these messages in agent's policy learning leads to brittle policies when tested on out-of-distribution initial states. We present an approach to multi-agent coordination, where each agent is equipped with the capability to integrate its (history of) observations, actions and messages received into a Common Operating Picture (COP) and disseminate the COP. This process takes into account the dynamic nature of the environment and the shared mission. We conducted experiments in the StarCraft2 environment to validate our approach. Our results demonstrate the efficacy of COP integration, and show that COP-based training leads to robust policies compared to state-of-the-art Multi-Agent Reinforcement Learning (MARL) methods when faced with out-of-distribution initial states.

摘要
在多智能系统中，智能体仅具有本地环境观察。团队成员之间的交流成为协调的关键。过去的研究主要集中在编码本地信息到嵌入消息中，这些消息对人类不可读。我们发现，在智能体政策学习中使用这些消息会导致不稳定的政策，对于非标准初始状态进行测试时。我们提出了一种多智能协调方法，其中每个智能体具有将其（历史观察、行动和接收的消息）集成为共同运作图像（COP）的能力，并将COP分布给其他团队成员。这个过程考虑了环境的动态性和共同任务。我们在StarCraft2环境中进行了实验，以验证我们的方法。我们的结果表明COP集成的有效性，并示出COP基于培训在对不同初始状态进行测试时，与现有多智能学习方法相比，具有更加稳定的政策。

Robust Best-arm Identification in Linear Bandits

paper_url: http://arxiv.org/abs/2311.04731
repo_url: None
paper_authors: Wei Wang, Sattar Vakili, Ilija Bogunovic
for: 这种研究旨在解决 robust best-arm identification problem (RBAI) 中的 linear rewards 问题，目标是找到一个近似优化的 robust arm，以便在实际应用中实现 transferred 的优化策略。
methods: 该研究提出了一个实例取值下的下界，并提出了静态和适应式bandit算法，以实现与下界匹配的样本复杂度。
results: 在synthetic实验中，该算法能够有效地找到最佳的 robust arm，并与oracle策略相似。在应用中，该算法在不同年龄层的病人中实现了robust dosage值的标准化。

Abstract
We study the robust best-arm identification problem (RBAI) in the case of linear rewards. The primary objective is to identify a near-optimal robust arm, which involves selecting arms at every round and assessing their robustness by exploring potential adversarial actions. This approach is particularly relevant when utilizing a simulator and seeking to identify a robust solution for real-world transfer. To this end, we present an instance-dependent lower bound for the robust best-arm identification problem with linear rewards. Furthermore, we propose both static and adaptive bandit algorithms that achieve sample complexity that matches the lower bound. In synthetic experiments, our algorithms effectively identify the best robust arm and perform similarly to the oracle strategy. As an application, we examine diabetes care and the process of learning insulin dose recommendations that are robust with respect to inaccuracies in standard calculators. Our algorithms prove to be effective in identifying robust dosage values across various age ranges of patients.

摘要
我们研究了Robust Best-Arm Identification问题（RBAI）在线性奖励情况下。主要目标是找到近似优化的Robust arm，这里是在每个轮次选择武器并评估它们的Robustness，通过探索敌方动作的可能性。这种方法特别有用在使用模拟器并寻找实际世界中的稳定解决方案。为此，我们提出了一个实例dependent的下界 дляRobust Best-Arm Identification问题，并提出了静态和适应式bandit算法，这些算法的样本复杂度与下界相匹配。在 sintetic 实验中，我们的算法成功地确定了最佳Robust arm，并与oracle策略相似。作为应用，我们研究了diabetes care和学习不准确的标准计算器中的药物剂量建议的Robust性。我们的算法在不同年龄范围的患者中表现出了有效的Robust剂量值。

Predicting Properties of Nodes via Community-Aware Features

paper_url: http://arxiv.org/abs/2311.04730
repo_url: https://github.com/sebkaz/betastar
paper_authors: Bogumił Kamiński, Paweł Prałat, François Théberge, Sebastian Zając
for: 本研究旨在提出一家族的社区意识型节点特征，并研究其性质。
methods: 本文提出了一种基于社区意识的节点特征家族，并对其进行了 investigate。
results: 研究表明，这种特征家族具有高预测力 для分类任务，并且包含不可recover的信息， neither by classical node features nor by node embeddings。

Abstract
A community structure that is often present in complex networks plays an important role not only in their formation but also shapes dynamics of these networks, affecting properties of their nodes. In this paper, we propose a family of community-aware node features and then investigate their properties. We show that they have high predictive power for classification tasks. We also verify that they contain information that cannot be recovered neither by classical node features nor by node embeddings (both classical as well as structural).

摘要
复杂网络中常见的社区结构不仅参与网络的形成，还对网络的动态shape有重要影响，对节点的性质产生影响。在这篇论文中，我们提出了一家族社区意识型节点特征，然后调查其性质。我们发现它们具有高预测力 для分类任务。我们还证明它们不可能通过传统节点特征还是结构节点嵌入获得。

Robust and Communication-Efficient Federated Domain Adaptation via Random Features

paper_url: http://arxiv.org/abs/2311.04686
repo_url: https://github.com/sadangelf/fedrf-tca
paper_authors: Zhanbo Feng, Yuanjie Wang, Jie Li, Fan Yang, Jiong Lou, Tiebin Mi, Robert. C. Qiu, Zhenyu Liao
for: This paper is written for researchers and practitioners who are interested in federated domain adaptation (FDA) and want to improve the efficiency and robustness of their FDA methods.
methods: The paper proposes an enhancement to the standard Transfer Component Analysis (TCA) approach, called RF-TCA, which significantly accelerates computation without compromising theoretical and empirical performance. The proposed FedRF-TCA protocol is an extension of RF-TCA to the FDA setting, which has communication complexity that is independent of the sample size and maintains performance that is either comparable to or even surpasses state-of-the-art FDA methods.
results: The paper presents extensive experiments to showcase the superior performance and robustness (to network condition) of FedRF-TCA compared to state-of-the-art FDA methods. The results demonstrate that FedRF-TCA can handle large-scale FDA tasks with high efficiency and accuracy, and is robust to network conditions.Here is the answer in Simplified Chinese:
for: 这篇论文是为研究者和实践者们，他们关心联合预测领域（Federated Domain Adaptation，FDA）的人们所写的。
methods: 这篇论文提出了改进标准传输组分分析（TCA）方法的增强版本，即RF-TCA，它可以在 computation 方面减少计算量，而不会产生理论和实际性的损害。提出的 FedRF-TCA 协议是 TCA 的扩展，用于 FDA 设定，它的通信复杂度是采样大小独立的，而且可以保持和现有 FDA 方法相比或者甚至超越其性能。
results: 这篇论文通过了广泛的实验，证明 FedRF-TCA 可以处理大规模 FDA 任务，并且具有高效性和精度。它还可以在不同的网络条件下保持稳定性和可靠性。

Abstract
Modern machine learning (ML) models have grown to a scale where training them on a single machine becomes impractical. As a result, there is a growing trend to leverage federated learning (FL) techniques to train large ML models in a distributed and collaborative manner. These models, however, when deployed on new devices, might struggle to generalize well due to domain shifts. In this context, federated domain adaptation (FDA) emerges as a powerful approach to address this challenge. Most existing FDA approaches typically focus on aligning the distributions between source and target domains by minimizing their (e.g., MMD) distance. Such strategies, however, inevitably introduce high communication overheads and can be highly sensitive to network reliability. In this paper, we introduce RF-TCA, an enhancement to the standard Transfer Component Analysis approach that significantly accelerates computation without compromising theoretical and empirical performance. Leveraging the computational advantage of RF-TCA, we further extend it to FDA setting with FedRF-TCA. The proposed FedRF-TCA protocol boasts communication complexity that is \emph{independent} of the sample size, while maintaining performance that is either comparable to or even surpasses state-of-the-art FDA methods. We present extensive experiments to showcase the superior performance and robustness (to network condition) of FedRF-TCA.

摘要
现代机器学习（ML）模型已经发展到了训练在单机器上是不现实的规模。因此，有一个增长的趋势是使用联邦学习（FL）技术来训练大型ML模型在分布式和协作的方式上。这些模型在新设备上部署时可能会遇到领域变化，导致其不能良好地泛化。在这种情况下，联邦领域适应（FDA）作为一种有力的方法来解决这个挑战。现有的FDA方法通常是通过最小化源频率和目标频率之间的差距（例如MMD）来对 distributions进行对齐。然而，这些策略会带来高通信开销并且对网络可靠性非常敏感。在这篇论文中，我们介绍了RF-TCA，一种提高标准传输组件分析方法的优化。RF-TCA可以快速计算，而不需要牺牲理论和实际性能。基于RF-TCA的计算优势，我们进一步扩展了它到FDA设置，得到了FedRF-TCA协议。FedRF-TCA协议的通信复杂度是独立于样本大小的，同时保持和现状最佳的性能。我们进行了广泛的实验，证明FedRF-TCA的超越性和网络条件的稳定性。

Compressive Recovery of Sparse Precision Matrices

paper_url: http://arxiv.org/abs/2311.04673
repo_url: None
paper_authors: Titouan Vayer, Etienne Lasalle, Rémi Gribonval, Paulo Gonçalves
for: 本研究旨在学习一个图模型，用于统计关系分析dataset中的$d$变量的$n$个样本$X$。
methods: 本研究使用了一种压缩视角，通过从$X$中随机生成的低维度向量来Estimate一个稀疏的$\Theta$。
results: 研究表明，在certain assumptions下，可以从一个低维度sketch中Estimate一个稀疏的$\Theta$，且需要$m=\Omega((d+2k)\log(d))$的维度，其中$k$是Underlying graph的最大边数。

Abstract
We consider the problem of learning a graph modeling the statistical relations of the $d$ variables of a dataset with $n$ samples $X \in \mathbb{R}^{n \times d}$. Standard approaches amount to searching for a precision matrix $\Theta$ representative of a Gaussian graphical model that adequately explains the data. However, most maximum likelihood-based estimators usually require storing the $d^{2}$ values of the empirical covariance matrix, which can become prohibitive in a high-dimensional setting. In this work, we adopt a compressive viewpoint and aim to estimate a sparse $\Theta$ from a sketch of the data, i.e. a low-dimensional vector of size $m \ll d^{2}$ carefully designed from $X$ using nonlinear random features. Under certain assumptions on the spectrum of $\Theta$ (or its condition number), we show that it is possible to estimate it from a sketch of size $m=\Omega((d+2k)\log(d))$ where $k$ is the maximal number of edges of the underlying graph. These information-theoretic guarantees are inspired by compressed sensing theory and involve restricted isometry properties and instance optimal decoders. We investigate the possibility of achieving practical recovery with an iterative algorithm based on the graphical lasso, viewed as a specific denoiser. We compare our approach and graphical lasso on synthetic datasets, demonstrating its favorable performance even when the dataset is compressed.

摘要
我们考虑一个图模型，用于描述一个 dataset 中 $d$ 个变数之间的统计关系。标准方法通常是寻找一个精确的 $\Theta$ 矩阵，代表一个 Gaussian 图模型，以便对数据进行适当地描述。但是，大多数最大 LIKELIHOOD 基本的估计方法通常需要储存 $d^2$ 个 empirical covariance matrix 的值，这可能会在高维度设定中成为禁止的。在这个工作中，我们遵循一种压缩的观点，企图从 dataset 中获取一个压缩的 $\Theta$ 矩阵，即一个来自 $X$ 的非线性随机特征下的低维度 вектор。在某些 $\Theta$ 的 спектル（或其 condition number）的假设下，我们展示了可以从获取的大小为 $m \ll d^2$ 的压缩 sketch 中估计 $\Theta$。这些信息理论上的保证是基于数据压缩理论和具有特定的 Restricted Isometry 性和实例最佳解解oder。我们 investigate 可能在实际应用中实现实用的重建，使用一个基于图形lasso 的迭代算法，视为特定的推理器。我们在实验中与图形lasso 进行比较，展示了其优越的表现，即使对于压缩的 dataset。

Learning Linear Gaussian Polytree Models with Interventions

paper_url: http://arxiv.org/abs/2311.04636
repo_url: https://github.com/emduart2/polytrees
paper_authors: D. Tramontano, L. Waldmann, M. Drton, E. Duarte
for: 学习非 Parametric linear Gaussian 树的 causal 结构，使用来自干预实验的数据，其中干预目标已知。
methods: 方法首先学习树的skeleton，然后将其边 orient。输出是一个 CPDAG，表示真实分布下的树的干预Equivalence class。skeleton和orientation恢复过程都基于第二阶统计和低维度边 Distribution。
results: 在不同场景下的synthetic数据集中，方法具有快速、高精度、可扩展性。在一个基因表达干预数据集中应用方法，并且对结果进行了评估。

Abstract
We present a consistent and highly scalable local approach to learn the causal structure of a linear Gaussian polytree using data from interventional experiments with known intervention targets. Our methods first learn the skeleton of the polytree and then orient its edges. The output is a CPDAG representing the interventional equivalence class of the polytree of the true underlying distribution. The skeleton and orientation recovery procedures we use rely on second order statistics and low-dimensional marginal distributions. We assess the performance of our methods under different scenarios in synthetic data sets and apply our algorithm to learn a polytree in a gene expression interventional data set. Our simulation studies demonstrate that our approach is fast, has good accuracy in terms of structural Hamming distance, and handles problems with thousands of nodes.

摘要
我们提出了一种一致性很高且可扩展的本地方法，用于学习 linear Gaussian 树的 causal 结构，基于干扰实验数据中知道的干扰目标。我们的方法首先学习树的skeleton，然后对其 edges 进行orienting。输出是一个 CPDAG 表示真实下面分布的干扰 equivalence class。我们使用第二阶 Statistics 和低维度边分布来进行骨架和 Orienting 过程。我们在不同情况下进行了simulationstudies，并将方法应用到了一个基因表达干扰数据集中。我们的模拟研究显示，我们的方法具有快速、高准确率和可扩展性。

Byzantine-Tolerant Methods for Distributed Variational Inequalities

paper_url: http://arxiv.org/abs/2311.04611
repo_url: https://github.com/nazya/sgda-ra
paper_authors: Nazarii Tupitsa, Abdulla Jasem Almansoori, Yanlin Wu, Martin Takáč, Karthik Nandakumar, Samuel Horváth, Eduard Gorbunov
for: This paper is written for discussing the problem of Byzantine robustness in distributed training scenarios, particularly in the context of variational inequalities.
methods: The paper proposes several provably Byzantine-robust methods for distributed variational inequality, and thoroughly studies their theoretical convergence.
results: The paper provides numerical comparisons supporting the theoretical findings, and removes the limitations of previous work in this area.

Abstract
Robustness to Byzantine attacks is a necessity for various distributed training scenarios. When the training reduces to the process of solving a minimization problem, Byzantine robustness is relatively well-understood. However, other problem formulations, such as min-max problems or, more generally, variational inequalities, arise in many modern machine learning and, in particular, distributed learning tasks. These problems significantly differ from the standard minimization ones and, therefore, require separate consideration. Nevertheless, only one work (Adibi et al., 2022) addresses this important question in the context of Byzantine robustness. Our work makes a further step in this direction by providing several (provably) Byzantine-robust methods for distributed variational inequality, thoroughly studying their theoretical convergence, removing the limitations of the previous work, and providing numerical comparisons supporting the theoretical findings.

摘要
Robustness to Byzantine attacks is a necessity for various distributed training scenarios. When the training reduces to the process of solving a minimization problem, Byzantine robustness is relatively well-understood. However, other problem formulations, such as min-max problems or, more generally, variational inequalities, arise in many modern machine learning and, in particular, distributed learning tasks. These problems significantly differ from the standard minimization ones and, therefore, require separate consideration. Nevertheless, only one work (Adibi et al., 2022) addresses this important question in the context of Byzantine robustness. Our work makes a further step in this direction by providing several (provably) Byzantine-robust methods for distributed variational inequality, thoroughly studying their theoretical convergence, removing the limitations of the previous work, and providing numerical comparisons supporting the theoretical findings.Here's the translation in Traditional Chinese:Robustness to Byzantine attacks is a necessity for various distributed training scenarios. When the training reduces to the process of solving a minimization problem, Byzantine robustness is relatively well-understood. However, other problem formulations, such as min-max problems or, more generally, variational inequalities, arise in many modern machine learning and, in particular, distributed learning tasks. These problems significantly differ from the standard minimization ones and, therefore, require separate consideration. Nevertheless, only one work (Adibi et al., 2022) addresses this important question in the context of Byzantine robustness. Our work makes a further step in this direction by providing several (provably) Byzantine-robust methods for distributed variational inequality, thoroughly studying their theoretical convergence, removing the limitations of the previous work, and providing numerical comparisons supporting the theoretical findings.

Accurate Autism Spectrum Disorder prediction using Support Vector Classifier based on Federated Learning (SVCFL)

paper_url: http://arxiv.org/abs/2311.04606
repo_url: None
paper_authors: Ali Mohammadifar, Hasan Samadbin, Arman Daliri
for: 旨在提高autism诊断的准确率和效率，尤其是在诊断的 inicial stages 中。
methods: 使用 Federated Learning 方法和 Support Vector Classifier 方法，通过分析大量数据并找到不同人类评估者之间的 Patterns ，帮助确认诊断或 highlight 需要进一步测试的情况。
results: 在这种方法下，实现了 99% 的准确率和 13% 的提升。

Abstract
The path to an autism diagnosis can be long and difficult, and delays can have serious consequences. Artificial intelligence can completely change the way autism is diagnosed, especially when it comes to situations where it is difficult to see the first signs of the disease. AI-based diagnostic tools may help confirm a diagnosis or highlight the need for further testing by analyzing large volumes of data and uncovering patterns that may not be immediately apparent to human evaluators. After a successful and timely diagnosis, autism can be treated through artificial intelligence using various methods. In this article, by using four datasets and gathering them with the federated learning method and diagnosing them with the support vector classifier method, the early diagnosis of this disorder has been discussed. In this method, we have achieved 99% accuracy for predicting autism spectrum disorder and we have achieved 13% improvement in the results.

摘要
“患有自闭症的诊断路径可能很长，困难，延迟诊断可能会有严重的后果。人工智能可能将改变自闭症的诊断方式，尤其是在诊断难以看到早期症状的情况下。人工智能基本技术可以帮助确认诊断或发现更多的测试是必要的，通过分析大量数据并找到不容易被人类评估者所发现的模式。在这篇文章中，我们使用了四个数据集和聚合它们，使用联邦学习方法进行诊断，使用支持向量分类方法进行预测，实现了99%的自闭症诊断精度，并取得13%的提升。”Note: Please keep in mind that the translation is done by a machine and may not be perfect. It's always best to have a human translator to ensure the accuracy of the translation.

Zeroth-order Asynchronous Learning with Bounded Delays with a Use-case in Resource Allocation in Communication Networks

paper_url: http://arxiv.org/abs/2311.04604
repo_url: None
paper_authors: Pourya Behmandpoor, Marc Moonen, Panagiotis Patrinos
for: 这篇论文专门研究了多个代理合作实现分布式优化，具体是在各自有不同任务的情况下，代理通过互动来协同优化本地参数，以实现共同任务的最佳化。
methods: 该论文使用的方法包括分布式优化、异步学习和通信延迟等。
results: 论文提出了一种基于异步学习和分布式优化的方法，并提供了对该方法的分析和数学分析，以及一些实验结果，以证明该方法的有效性。

Abstract
Distributed optimization has experienced a significant surge in interest due to its wide-ranging applications in distributed learning and adaptation. While various scenarios, such as shared-memory, local-memory, and consensus-based approaches, have been extensively studied in isolation, there remains a need for further exploration of their interconnections. This paper specifically concentrates on a scenario where agents collaborate toward a unified mission while potentially having distinct tasks. Each agent's actions can potentially impact other agents through interactions. Within this context, the objective for the agents is to optimize their local parameters based on the aggregate of local reward functions, where only local zeroth-order oracles are available. Notably, the learning process is asynchronous, meaning that agents update and query their zeroth-order oracles asynchronously while communicating with other agents subject to bounded but possibly random communication delays. This paper presents theoretical convergence analyses and establishes a convergence rate for the proposed approach. Furthermore, it addresses the relevant issue of deep learning-based resource allocation in communication networks and conducts numerical experiments in which agents, acting as transmitters, collaboratively train their individual (possibly unique) policies to maximize a common performance metric.

摘要
分布式优化在分布式学习和适应中得到了广泛的关注，因为它们在各种应用场景中具有广泛的应用前景。虽然许多场景，如共享内存、本地内存和协议基本上的方法，在孤立的情况下得到了广泛的研究，但是还需要进一步探索这些场景之间的相互连接。这篇论文专门关注在多个代理 collaborate 以实现共同任务的情况下，每个代理的行为可能会影响其他代理的行为。在这个上下文中，代理的目标是在各自的本地参数上优化本地奖励函数，只有本地零次权重可用。它们的学习过程是异步的，meaning that agents update and query their zeroth-order oracles asynchronously while communicating with other agents subject to bounded but possibly random communication delays。这篇论文提供了理论的叠加分析和确定了提案的速度。此外，它还考虑了基于深度学习的通信网络资源分配问题，并进行了 numerically experiments in which agents, acting as transmitters, collaboratively train their individual (possibly unique) policies to maximize a common performance metric。

A Deep Learning Based Resource Allocator for Communication Systems with Dynamic User Utility Demands

paper_url: http://arxiv.org/abs/2311.04600
repo_url: None
paper_authors: Pourya Behmandpoor, Panagiotis Patrinos, Marc Moonen
for: 这篇论文目的是提出一种基于深度学习的资源分配算法，以满足用户的 utility 需求。
methods: 该算法使用了深度神经网络（DNN）来实现用户的 utility 需求调整，并在每个时间点进行了迭代优化算法来优化用户的在线状态。
results: 实验结果表明，该算法可以有效地满足用户的 utility 需求，并且可以在不同的场景下（如中央化和分布式场景）进行部署。

Abstract
Deep learning (DL) based resource allocation (RA) has recently gained a lot of attention due to its performance efficiency. However, most of the related studies assume an ideal case where the number of users and their utility demands, e.g., data rate constraints, are fixed and the designed DL based RA scheme exploits a policy trained only for these fixed parameters. A computationally complex policy retraining is required whenever these parameters change. Therefore, in this paper, a DL based resource allocator (ALCOR) is introduced, which allows users to freely adjust their utility demands based on, e.g., their application layer. ALCOR employs deep neural networks (DNNs), as the policy, in an iterative optimization algorithm. The optimization algorithm aims to optimize the on-off status of users in a time-sharing problem to satisfy their utility demands in expectation. The policy performs unconstrained RA (URA) -- RA without taking into account user utility demands -- among active users to maximize the sum utility (SU) at each time instant. Based on the chosen URA scheme, ALCOR can perform RA in a model-based or model-free manner and in a centralized or distributed scenario. Derived convergence analyses provide guarantees for the convergence of ALCOR, and numerical experiments corroborate its effectiveness.

摘要
深度学习（DL）基于资源分配（RA）在最近吸引了很多关注，因为它的性能效率很高。然而，大多数相关研究假设用户和他们的需求是固定的，而设计的DL基于RA schemes只是在这些固定参数下采用一个已经训练过的策略。因此，在这篇论文中，一种基于DL的资源分配器（ALCOR）被介绍，允许用户自由地调整他们的需求。ALCOR使用深度神经网络（DNN）作为策略，并在迭代优化算法中使用。这个算法的目标是在时间分享问题中，使用者的状态在每个时间点上进行优化，以满足他们的需求。策略在活动用户中进行无约RA（URA），即不考虑用户的需求来进行资源分配，以最大化每个时间点的总用户Utility（SU）。根据选择的URA方案，ALCOR可以在中央化或分布式环境中进行RA，并且可以采用模型基于或模型独立的方式进行RA。 derive的收敛分析提供了ALCOR的收敛性 guarantees，而数字实验证明了它的有效性。

Predicting Market Value in Professional Soccer: Insights from Explainable Machine Learning Models

paper_url: http://arxiv.org/abs/2311.04599
repo_url: None
paper_authors: Chunyang Huang, Shaoliang Zhang
for: 这个研究旨在预测职业足球运动员市场价值使用可解释机器学习模型。
methods: 我们使用FIFA官方网站Curated数据集，采用 ensemble机器学习方法并与SHAP添加itive exPlanations（SHAP）提供详细的模型预测解释。
results: GBDT模型在评估中获得最高的平均R-Squared值（0.8780）和最低的平均Root Mean Squared Error值（3,221,632.175），表明其在评估中的superior表现。我们的分析发现，球控、短传、完成、抢断、练习和攻击等技能在技能维度是关键的，而冲刺速度和加速在身体维度是关键的，而反应在认知维度是关键的。我们的结果提供了更准确、Objective和一致的市场价值估算框架，为管理层的转会决策提供有用的洞察。

Abstract
This study presents an innovative method for predicting the market value of professional soccer players using explainable machine learning models. Using a dataset curated from the FIFA website, we employ an ensemble machine learning approach coupled with Shapley Additive exPlanations (SHAP) to provide detailed explanations of the models' predictions. The GBDT model achieves the highest mean R-Squared (0.8780) and the lowest mean Root Mean Squared Error (3,221,632.175), indicating its superior performance among the evaluated models. Our analysis reveals that specific skills such as ball control, short passing, finishing, interceptions, dribbling, and tackling are paramount within the skill dimension, whereas sprint speed and acceleration are critical in the fitness dimension, and reactions are preeminent in the cognitive dimension. Our results offer a more accurate, objective, and consistent framework for market value estimation, presenting useful insights for managerial decisions in player transfers.

摘要
Note: "Simplified Chinese" is also known as "Mandarin Chinese" or "Standard Chinese".Translation notes:* "market value" is translated as "市场价值" (shìchǎng jīyà)* "professional soccer players" is translated as "职业足球运动员" (zhíyè zúqiú yùndòngyuán)* "explainable machine learning models" is translated as "可解释机器学习模型" (kějiěshì jīshì yùnxíng módelì)* "ensemble machine learning approach" is translated as "集成机器学习方法" (jíshì jīshì yùnxíng fāngshì)* "Shapley Additive exPlanations" is translated as "夏普利加法解释" (xiàpèlì jiāfāng jiěshì)* "GBDT model" is translated as " Gradient Boosting Decision Tree 模型" (Gradient Boosting Decision Tree módel)* "mean R-Squared" is translated as "平均R平方" (píngjì R píngfāng)* "mean Root Mean Squared Error" is translated as "平均根均方差" (píngjì gēnjì fāngbiān)* "specific skills" is translated as "特定技能" (tèqīng jìnéng)* "fitness dimension" is translated as "身体维度" (shēngrōng wéidù)* "cognitive dimension" is translated as "认知维度" (rènzhì wéidù)

Deep learning as a tool for quantum error reduction in quantum image processing

paper_url: http://arxiv.org/abs/2311.04575
repo_url: None
paper_authors: Krzysztof Werner, Kamil Wereszczyński, Rafał Potempa, Krzysztof Cyran
for: 这个论文的目的是提出一种基于生成对抗网络的图像识别方法，以减少图像编码使用LPIQE方法所受的总错误。
methods: 该方法使用生成对抗网络和phasdistortion unraveling方法来减少图像编码中的总错误。
results: 该方法可以成功减少图像编码中的总错误，并且可以保持图像的原始特征。

Abstract
Despite the limited availability and quantum volume of quantum computers, quantum image representation is a widely researched area. Currently developed methods use quantum entanglement to encode information about pixel positions. These methods range from using the angle parameter of the rotation gate (e.g., the Flexible Representation of Quantum Images, FRQI), sequences of qubits (e.g., Novel Enhanced Quantum Representation, NEQR), or the angle parameter of the phase shift gates (e.g., Local Phase Image Quantum Encoding, LPIQE) for storing color information. All these methods are significantly affected by decoherence and other forms of quantum noise, which is an inseparable part of quantum computing in the noisy intermediate-scale quantum era. These phenomena can highly influence the measurements and result in extracted images that are visually dissimilar to the originals. Because this process is at its foundation quantum, the computational reversal of this process is possible. There are many methods for error correction, mitigation, and reduction, but all of them use quantum computer time or additional qubits to achieve the desired result. We report the successful use of a generative adversarial network trained for image-to-image translation, in conjunction with Phase Distortion Unraveling error reduction method, for reducing overall error in images encoded using LPIQE.

摘要
尽管量子计算机的可用性和量子量有限，量子图像表示仍然是广泛研究的领域。目前已经开发出的方法利用量子Entanglement来编码图像像素的信息。这些方法包括使用旋转门的角度参数（如Flexible Representation of Quantum Images，FRQI）、顺序的qubits（如Novel Enhanced Quantum Representation，NEQR）或扩散门的角度参数（如Local Phase Image Quantum Encoding，LPIQE）来存储颜色信息。这些方法都受到干扰和其他量子噪声的影响，这些噪声是量子计算在不稳定中型量子时代的不可避免的一部分。这些现象会高度影响测量结果，导致提取的图像与原始图像显示不同。由于这是量子的基础，因此可以通过量子计算的计算反转来解决这个问题。有许多方法用于错误纠正、减轻和减少，但所有这些方法均需要使用量子计算机时间或额外的qubits来实现感兴趣的结果。我们报告了使用基于图像到图像翻译的生成 adversarial network，与扩散门错误降低法相结合，以降低LPIQE编码图像的总错误。

Information-Theoretic Generalization Bounds for Transductive Learning and its Applications

paper_url: http://arxiv.org/abs/2311.04561
repo_url: None
paper_authors: Huayi Tang, Yong Liu
for: 这个论文是为了研究逻辑学上的推导学习算法的通用化 bound，特别是在信息论上。
methods: 论文使用了数据依赖和算法依赖的通用化 bound，以及在推导学习算法中首次提出的拓扑supersamples概念。
results: 论文显示了在不同信息度下的推导学习算法的通用化 bound，并 deriv了 novel PAC-Bayesian bound和推导学习下的损失地形平坦性。最后，论文应用了结果到 semi-supervised learning 和图学习场景。

Abstract
In this paper, we develop data-dependent and algorithm-dependent generalization bounds for transductive learning algorithms in the context of information theory for the first time. We show that the generalization gap of transductive learning algorithms can be bounded by the mutual information between training labels and hypothesis. By innovatively proposing the concept of transductive supersamples, we go beyond the inductive learning setting and establish upper bounds in terms of various information measures. Furthermore, we derive novel PAC-Bayesian bounds and build the connection between generalization and loss landscape flatness under the transductive learning setting. Finally, we present the upper bounds for adaptive optimization algorithms and demonstrate the applications of results on semi-supervised learning and graph learning scenarios. Our theoretic results are validated on both synthetic and real-world datasets.

摘要
在本文中，我们为推导学习算法的泛化性提供了数据依赖和算法依赖的通用 bound，这是在信息理论中的首次。我们证明了推导学习算法的泛化差可以通过训练标签和假设之间的共识来Upper bound。通过提出了推导超额样本的概念，我们超出了 inductive 学习设定，并在不同的信息度量上建立了Upper bounds。此外，我们还 deriv了 PAC-Bayesian bound 和泛化与损失函数平坦性之间的连接。最后，我们给出了适应优化算法的Upper bounds，并在 semi-supervised 学习和图学习场景中应用了结果。我们的理论结果通过 synthetic 数据和实际数据进行验证。

Regression with Cost-based Rejection

paper_url: http://arxiv.org/abs/2311.04550
repo_url: None
paper_authors: Xin Cheng, Yuzhou Cao, Haobo Wang, Hongxin Wei, Bo An, Lei Feng
for: 本研究旨在解决 regression 问题中的 cost-based rejection 问题，其中模型可以根据 certain rejection costs 拒绝对某些示例进行预测。
methods: 我们首先将此问题转化为预测风险的问题，然后 deriv 出 Bayes 优化解决方案，其显示了在使用 mean squared error 评价指标时，优化的模型应该拒绝对 variance 大于 rejection cost 的示例进行预测。我们还提出了一种基于 surrogate loss function 的训练方法，并提供了模型一致性的条件，这意味着我们的提议的 surrogate loss 可以回归 Bayes 优化解决方案。
results: 我们的实验结果表明，我们的提议的方法可以有效地解决 regression 问题中的 cost-based rejection 问题。

Abstract
Learning with rejection is an important framework that can refrain from making predictions to avoid critical mispredictions by balancing between prediction and rejection. Previous studies on cost-based rejection only focused on the classification setting, which cannot handle the continuous and infinite target space in the regression setting. In this paper, we investigate a novel regression problem called regression with cost-based rejection, where the model can reject to make predictions on some examples given certain rejection costs. To solve this problem, we first formulate the expected risk for this problem and then derive the Bayes optimal solution, which shows that the optimal model should reject to make predictions on the examples whose variance is larger than the rejection cost when the mean squared error is used as the evaluation metric. Furthermore, we propose to train the model by a surrogate loss function that considers rejection as binary classification and we provide conditions for the model consistency, which implies that the Bayes optimal solution can be recovered by our proposed surrogate loss. Extensive experiments demonstrate the effectiveness of our proposed method.

摘要
Translated into Simplified Chinese:学习与拒绝是一个重要的框架，可以避免严重的预测错误，通过平衡预测和拒绝来做出决策。先前的研究仅将成本基于拒绝应用于分类设定，无法处理连续和无穷目标空间的回归设定。在这篇论文中，我们调查一种新的回归问题，即基于成本的拒绝回归，其中模型可以在某些示例上拒绝进行预测，给定某些拒绝成本。为解决这个问题，我们首先形式化预期风险，然后 derivate Bayes优化解决方案，其显示了优化模型应该在具有更大的方差时拒绝进行预测，当使用mean squared error作为评价指标时。此外，我们提议使用假设损失函数来训练模型，该损失函数考虑了拒绝为二分类问题，并提供了模型一致性的条件，这意味着我们的提议的代理损失可以回归到bayes优化解决方案。广泛的实验证明了我们的提议的有效性。

FEIR: Quantifying and Reducing Envy and Inferiority for Fair Recommendation of Limited Resources

paper_url: http://arxiv.org/abs/2311.04542
repo_url: https://github.com/aida-ugent/feir
paper_authors: Nan Li, Bo Kang, Jefrey Lijffijt, Tijl De Bie
for: 这篇论文主要是关于电子招聘和在线约会中的推荐系统，具体来说是研究一种新的公平度量表，以及一种基于这个公平度量表的多目标优化问题。
methods: 这篇论文提出了一种新的公平度量表，即“劣等”（inferiority），它 mesure 用户对推荐的Item的竞争性。同时，它还使用了“嫉妒”（envy）和“实用性”（utility）这两个已有的公平度量表，并将它们组合在一起。这些公平度量表都是非�ifferentiable的，因此 authors使用了概率解释 recommender systems 来将它们转换为可微分的版本。最后，authors将这些公平度量表组合在一起，形成了一个多目标优化问题 called \texttt{FEIR}（Fairness through Envy and Inferiority Reduction）。
results: experiments 表明，这种方法可以在 synthetic 和实际数据上提高推荐系统的公平性，特别是在劣等和嫉妒方面。在这些实验中， authors 使用了标准的推荐系统作为基础，并对它们进行了post-processing。

Abstract
In settings such as e-recruitment and online dating, recommendation involves distributing limited opportunities, calling for novel approaches to quantify and enforce fairness. We introduce \emph{inferiority}, a novel (un)fairness measure quantifying a user's competitive disadvantage for their recommended items. Inferiority complements \emph{envy}, a fairness notion measuring preference for others' recommendations. We combine inferiority and envy with \emph{utility}, an accuracy-related measure of aggregated relevancy scores. Since these measures are non-differentiable, we reformulate them using a probabilistic interpretation of recommender systems, yielding differentiable versions. We combine these loss functions in a multi-objective optimization problem called \texttt{FEIR} (Fairness through Envy and Inferiority Reduction), applied as post-processing for standard recommender systems. Experiments on synthetic and real-world data demonstrate that our approach improves trade-offs between inferiority, envy, and utility compared to naive recommendations and the baseline methods.

摘要
在电子招聘和在线约会等设置中，推荐具有有限的机会分配，需要开发新的方法来衡量和实施公平。我们介绍了一种新的不公平度量量，称为“劣等”（inferiority），它衡量用户对推荐的项目的竞争性劣势。劣等与嫉妒（envy）和用户性（utility）这三种度量相结合，通过一种多目标优化问题called \texttt{FEIR}（公平性通过劣等和嫉妒减少）来实现。我们将这些损失函数转换为可导的形式，并应用于标准推荐系统的后处理。实验表明，我们的方法可以在劣等、嫉妒和用户性之间进行更好的质量平衡，相比于直观推荐和基准方法。

Deep Learning Assisted Multiuser MIMO Load Modulated Systems for Enhanced Downlink mmWave Communications

paper_url: http://arxiv.org/abs/2311.04537
repo_url: None
paper_authors: Ercong Yu, Jinle Zhu, Qiang Li, Zilong Liu, Hongyang Chen, Shlomo Shamai, H. Vincent Poor
for: 这个论文关注的是多用户负载调制阵列（MU-LMA），它们具有低系统复杂度和成本，适用于 millimeter wave（mmWave）多输入多出口（MIMO）系统。
methods: 这篇论文提出了两种算法：一种是基于全面阵列结构（FAS）的传输器，另一种是基于深度学习（Deep Learning）的阵列设计和独立解码算法。
results: 论文显示了这两种算法的性能是 Robust to imperfect channel state information（CSI）和具有低复杂性的信号检测。另外，使用深度学习阵列设计算法可以适应不同的频率响应。

Abstract
This paper is focused on multiuser load modulation arrays (MU-LMAs) which are attractive due to their low system complexity and reduced cost for millimeter wave (mmWave) multi-input multi-output (MIMO) systems. The existing precoding algorithm for downlink MU-LMA relies on a sub-array structured (SAS) transmitter which may suffer from decreased degrees of freedom and complex system configuration. Furthermore, a conventional LMA codebook with codewords uniformly distributed on a hypersphere may not be channel-adaptive and may lead to increased signal detection complexity. In this paper, we conceive an MU-LMA system employing a full-array structured (FAS) transmitter and propose two algorithms accordingly. The proposed FAS-based system addresses the SAS structural problems and can support larger numbers of users. For LMA-imposed constant-power downlink precoding, we propose an FAS-based normalized block diagonalization (FAS-NBD) algorithm. However, the forced normalization may result in performance degradation. This degradation, together with the aforementioned codebook design problems, is difficult to solve analytically. This motivates us to propose a Deep Learning-enhanced (FAS-DL-NBD) algorithm for adaptive codebook design and codebook-independent decoding. It is shown that the proposed algorithms are robust to imperfect knowledge of channel state information and yield excellent error performance. Moreover, the FAS-DL-NBD algorithm enables signal detection with low complexity as the number of bits per codeword increases.

摘要
In this paper, we propose an MU-LMA system using a full-array structured (FAS) transmitter, which addresses the SAS structural problems and can support larger numbers of users. For LMA-imposed constant-power downlink precoding, we propose an FAS-based normalized block diagonalization (FAS-NBD) algorithm. However, the forced normalization may result in performance degradation. This degradation, together with the aforementioned codebook design problems, is difficult to solve analytically.To address these issues, we propose a Deep Learning-enhanced (FAS-DL-NBD) algorithm for adaptive codebook design and codebook-independent decoding. The proposed algorithm uses deep learning to optimize the codebook design and decoding process, which can improve the performance of the system. Moreover, the FAS-DL-NBD algorithm enables signal detection with low complexity as the number of bits per codeword increases.The proposed algorithms are robust to imperfect knowledge of channel state information and yield excellent error performance. The use of deep learning in the FAS-DL-NBD algorithm enables the system to adapt to changing channel conditions and improve its performance. Additionally, the proposed algorithms have low complexity, which makes them suitable for practical applications.

An Unsupervised Deep Learning Approach for the Wave Equation Inverse Problem

paper_url: http://arxiv.org/abs/2311.04531
repo_url: None
paper_authors: Xiong-Bin Yan, Keke Wu, Zhi-Qin John Xu, Zheng Ma
for: 实时地球物理几何问题的高精度解析
methods: integrate deep neural networks and partial differential equations for solving full-waveform inversion problems
results: 提供了一个无监督学习的方法，可以实时地从测量数据中推断地球物理几何 Parameters，并且与传统方法比较，获得更好的结果。

Abstract
Full-waveform inversion (FWI) is a powerful geophysical imaging technique that infers high-resolution subsurface physical parameters by solving a non-convex optimization problem. However, due to limitations in observation, e.g., limited shots or receivers, and random noise, conventional inversion methods are confronted with numerous challenges, such as the local-minimum problem. In recent years, a substantial body of work has demonstrated that the integration of deep neural networks and partial differential equations for solving full-waveform inversion problems has shown promising performance. In this work, drawing inspiration from the expressive capacity of neural networks, we provide an unsupervised learning approach aimed at accurately reconstructing subsurface physical velocity parameters. This method is founded on a re-parametrization technique for Bayesian inference, achieved through a deep neural network with random weights. Notably, our proposed approach does not hinge upon the requirement of the labeled training dataset, rendering it exceedingly versatile and adaptable to diverse subsurface models. Extensive experiments show that the proposed approach performs noticeably better than existing conventional inversion methods.

摘要
全波形推敲（FWI）是一种强大的地球物理成像技术，可以高精度地推算地下物理参数。然而，由于观测限制，如有限的发射器或接收器，以及随机噪声，传统的推敲方法面临着许多挑战，如地点最优化问题。在过去几年，大量的研究表明，将深度神经网络和部分偏微分方程相结合，可以对全波形推敲问题提供有前所未有的表现。在这篇文章中，我们Drawing inspiration from the expressive capacity of neural networks，提出了一种无监督学习方法，用于准确地重建地下物理速度参数。这种方法基于 Bayesian 推敲技术，通过深度神经网络的随机权重来实现。吸引地，我们的提议方法不需要训练数据集，因此非常灵活和适应性强，适用于多种地下模型。经过广泛的实验，我们发现，我们的方法与传统推敲方法相比，表现出了明显的优势。

Bandit Learning to Rank with Position-Based Click Models: Personalized and Equal Treatments

paper_url: http://arxiv.org/abs/2311.04528
repo_url: None
paper_authors: Tianchen Zhou, Jia Liu, Yang Jiao, Chaosheng Dong, Yetian Chen, Yan Gao, Yi Sun
for: 这个论文主要针对的问题是在线学习排名（ONL2R），它是推荐系统的基础问题，在过去几年内受到了越来越多的关注。
methods: 这篇论文提出了一种基于多手枪投机（MAB）框架和位置基于点击模型（PCM）的ONL2R模型。但是，开发基于MAB的ONL2R策略是非常具有挑战性，因为这个问题的 combinatorial 性和部分可见性。
results: 这篇论文的主要贡献包括三个方面：一是提出了第一个涵盖所有ONL2R特点的MAB框架；二是基于上述分析框架，开发了两种统一的贪婪策略和UCBRank策略，可以应用于个性化和平等的排名待遇；三是证明了这两种策略在个性化和平等的排名待遇下都能够获得$O(\sqrt{t}\ln t)$和$O(\sqrt{t\ln t})$的任何线性 regret。此外，对于基本难以解决的平等排名待遇，我们也identified了一些集合性的用途函数和其相应的充分条件，under which GreedyRank和UCBRank策略可以在$O(\sqrt{t}\ln t)$和$O(\sqrt{t\ln t})$任何线性 regret下实现优化。

Abstract
Online learning to rank (ONL2R) is a foundational problem for recommender systems and has received increasing attention in recent years. Among the existing approaches for ONL2R, a natural modeling architecture is the multi-armed bandit framework coupled with the position-based click model. However, developing efficient online learning policies for MAB-based ONL2R with position-based click models is highly challenging due to the combinatorial nature of the problem, and partial observability in the position-based click model. To date, results in MAB-based ONL2R with position-based click models remain rather limited, which motivates us to fill this gap in this work. Our main contributions in this work are threefold: i) We propose the first general MAB framework that captures all key ingredients of ONL2R with position-based click models. Our model considers personalized and equal treatments in ONL2R ranking recommendations, both of which are widely used in practice; ii) Based on the above analytical framework, we develop two unified greed- and UCB-based policies called GreedyRank and UCBRank, each of which can be applied to personalized and equal ranking treatments; and iii) We show that both GreedyRank and UCBRank enjoy $O(\sqrt{t}\ln t)$ and $O(\sqrt{t\ln t})$ anytime sublinear regret for personalized and equal treatment, respectively. For the fundamentally hard equal ranking treatment, we identify classes of collective utility functions and their associated sufficient conditions under which $O(\sqrt{t}\ln t)$ and $O(\sqrt{t\ln t})$ anytime sublinear regrets are still achievable for GreedyRank and UCBRank, respectively. Our numerical experiments also verify our theoretical results and demonstrate the efficiency of GreedyRank and UCBRank in seeking the optimal action under various problem settings.

摘要
“在线学习排名（ONL2R）是推荐系统的基础问题，在最近的几年中受到了显著的注意。 exist 的方法 для ONL2R 中，一个自然的模型建构是多臂矢量机制（MAB）与位置基于的点击模型。然而，为了开发具有高效性的在线学习策略，MAB 基于 ONL2R WITH 位置基于的点击模型是非常困难的，因为这问题的 combinatorial 性和点击模型中的假设不完整。到目前为止，关于 MAB 基于 ONL2R WITH 位置基于的点击模型的结果仍然很有限，这鼓使我们填补这个差距。我们的主要贡献是三个：1. 我们提出了第一个涵盖所有关键成分的 ONL2R 基于 MAB 框架。我们的模型考虑了对推荐的个性化和平等对待，这两种方法在实践中广泛使用；2. 基于上述分析框架，我们开发了两种应用于个性化和平等对待的统一的牛顿策略和 UCB 策略，它们分别被称为 GreedyRank 和 UCBRank；3. 我们证明了 GreedyRank 和 UCBRank 在个性化和平等对待下都具有 $O(\sqrt{t}\ln t)$ 和 $O(\sqrt{t\ln t})$ 任何时间斜线 regret，对应的是，我们还证明了这两种策略在某些特定的问题设定下可以获得 $O(\sqrt{t}\ln t)$ 和 $O(\sqrt{t\ln t})$ 任何时间斜线 regret。我们的实验也验证了我们的理论结果，并证明了 GreedyRank 和 UCBRank 在不同的问题设定下能够有效地寻找优化的动作。”

Long-term Time Series Forecasting based on Decomposition and Neural Ordinary Differential Equations

paper_url: http://arxiv.org/abs/2311.04522
repo_url: None
paper_authors: Seonkyu Lim, Jaehyeon Park, Seojin Kim, Hyowon Wi, Haksoo Lim, Jinsung Jeon, Jeongwhan Choi, Noseong Park
for: 该论文旨在解决长期时间序列预测（LTSF） tasks 中的挑战，具体来说是Linear-based LTSF 模型在实际应用中的表现不佳，主要归因于 transformer-based 方法中的时间信息损失。
methods: 该论文提出了 LTSF-DNODE 模型，其基于线性常微分方程（ODEs）和时间序列分解方法，可以充分利用数据特点。
results: 论文表明 LTSF-DNODE 模型在多个实际数据集上表现出色，并且对每个数据集进行了常微分方程（NODE）核心 régularization 的影响分析。

Abstract
Long-term time series forecasting (LTSF) is a challenging task that has been investigated in various domains such as finance investment, health care, traffic, and weather forecasting. In recent years, Linear-based LTSF models showed better performance, pointing out the problem of Transformer-based approaches causing temporal information loss. However, Linear-based approach has also limitations that the model is too simple to comprehensively exploit the characteristics of the dataset. To solve these limitations, we propose LTSF-DNODE, which applies a model based on linear ordinary differential equations (ODEs) and a time series decomposition method according to data statistical characteristics. We show that LTSF-DNODE outperforms the baselines on various real-world datasets. In addition, for each dataset, we explore the impacts of regularization in the neural ordinary differential equation (NODE) framework.

摘要
长期时间序列预测（LTSF）是一项复杂的任务，在不同领域如金融投资、医疗、交通和天气预测中都有广泛的研究。最近几年，线性基于的 LTSF 模型表现更好，表明了 transformer 基于的方法导致时间信息损失的问题。然而，线性基于的方法也有局限性，模型太简单，无法全面利用数据集的特点。为解决这些局限性，我们提出了 LTSF-DNODE，它利用基于线性常微分方程（ODEs）的模型和根据数据统计特点的时间序列分解方法。我们显示了 LTSF-DNODE 在多个实际 datasets 上的超越基准点。此外，对于每个数据集，我们还探讨了 NODE 框架中的规则化的影响。

Adaptive Mirror Descent Bilevel Optimization

paper_url: http://arxiv.org/abs/2311.04520
repo_url: None
paper_authors: Feihu Huang
for: 本文提出了一类高效的适应双层方法，用于非 convex 双层优化问题，其上层问题是非 convex 可能具有不规则化的正则化，下层问题是非 convex 问题，满足波佩-{\L}ojasiewicz（PL）条件。
methods: 我们提出了一种高效的适应投影帮助梯度（i.e., AdaPAG）方法，基于镜像下降，并证明其可以在非 convex 双层问题中获得最佳known的梯度复杂度为 $O(\epsilon^{-1})$，用于找到 $\epsilon$-定点解。我们还提出了一种高效的适应随机投影帮助梯度（i.e., AdaVSPAG）方法，基于镜像下降和减少噪声技术，并证明其可以在非 convex 双层问题中获得最佳known的梯度复杂度为 $O(\epsilon^{-3/2})$，用于找到 $\epsilon$-定点解。
results: 我们提供了一种有用的 convergence 分析框架，用于我们的方法，并证明其在某些轻量级的假设下具有 $O(\frac{1}{T})$ 的快速收敛率，其中 $T$ 是迭代次数。

Abstract
In the paper, we propose a class of efficient adaptive bilevel methods based on mirror descent for nonconvex bilevel optimization, where its upper-level problem is nonconvex possibly with nonsmooth regularization, and its lower-level problem is also nonconvex while satisfies Polyak-{\L}ojasiewicz (PL) condition. To solve these deterministic bilevel problems, we present an efficient adaptive projection-aid gradient (i.e., AdaPAG) method based on mirror descent, and prove that it obtains the best known gradient complexity of $O(\epsilon^{-1})$ for finding an $\epsilon$-stationary solution of nonconvex bilevel problems. To solve these stochastic bilevel problems, we propose an efficient adaptive stochastic projection-aid gradient (i.e., AdaVSPAG) methods based on mirror descent and variance-reduced techniques, and prove that it obtains the best known gradient complexity of $O(\epsilon^{-3/2})$ for finding an $\epsilon$-stationary solution. Since the PL condition relaxes the strongly convex, our algorithms can be used to nonconvex strongly-convex bilevel optimization. Theoretically, we provide a useful convergence analysis framework for our methods under some mild conditions, and prove that our methods have a fast convergence rate of $O(\frac{1}{T})$, where $T$ denotes the number of iterations.

摘要
文章中，我们提出了一种高效的适应镜架方法，用于非 convex 双层优化问题，其中上层问题是非 convex 可能具有非细ooth 规范，而下层问题是非 convex 问题，满足波佩-{\L}ojasiewicz（PL）条件。为解决这些权重 determine 的双层问题，我们提出了一种高效的适应投影帮助梯度（i.e., AdaPAG）方法，基于镜架 descent，并证明其可以在 $O(\epsilon^{-1})$ 时间内找到非 convex 双层问题的 $\epsilon$-稳定解。为解决这些随机双层问题，我们提出了一种高效的适应随机投影帮助梯度（i.e., AdaVSPAG）方法，基于镜架 descent 和减少噪声技术，并证明其可以在 $O(\epsilon^{-3/2})$ 时间内找到非 convex 双层问题的 $\epsilon$-稳定解。由于 PL 条件放宽了强式 convex，我们的算法可以应用于非 convex 强式 convex 双层优化问题。我们提供了一个有用的 convergence 分析框架，并证明我们的方法在某些轻微条件下具有 $O(\frac{1}{T})$ 的快速收敛率，where $T$ 表示迭代次数。

Towards Democratizing AI: A Comparative Analysis of AI as a Service Platforms and the Open Space for Machine Learning Approach

paper_url: http://arxiv.org/abs/2311.04518
repo_url: None
paper_authors: Dennis Rall, Bernhard Bauer, Thomas Fraunholz
for: 本研究旨在推动人工智能的普及和宣扬，但现有的AIaaS平台仍有一定的障碍，因此本研究旨在 Comparing several popular AI-as-a-Service platforms and identifying the key requirements for a platform that can achieve true democratization of AI.
methods: 本研究使用了多种现代技术，如Kubernetes、Kubeflow Pipelines和Ludwig，以 overcome the challenges of democratizing AI.
results: 本研究的分析显示，自主主机选项、高可扩展性和开源性是普及人工智能的关键要求。此外，我们的方法比现有的AIaaS平台更加全面和有效地满足了普及人工智能的需求。

Abstract
Recent AI research has significantly reduced the barriers to apply AI, but the process of setting up the necessary tools and frameworks can still be a challenge. While AI-as-a-Service platforms have emerged to simplify the training and deployment of AI models, they still fall short of achieving true democratization of AI. In this paper, we aim to address this gap by comparing several popular AI-as-a-Service platforms and identifying the key requirements for a platform that can achieve true democratization of AI. Our analysis highlights the need for self-hosting options, high scalability, and openness. To address these requirements, we propose our approach: the "Open Space for Machine Learning" platform. Our platform is built on cutting-edge technologies such as Kubernetes, Kubeflow Pipelines, and Ludwig, enabling us to overcome the challenges of democratizing AI. We argue that our approach is more comprehensive and effective in meeting the requirements of democratizing AI than existing AI-as-a-Service platforms.

摘要
现代人工智能研究已经大幅降低了应用人工智能的门槛，但是设置必要的工具和框架仍然是一个挑战。而AIaaS平台已经出现以简化训练和部署人工智能模型的过程，但它们仍然无法实现真正的人工智能民主化。在这篇论文中，我们想要解决这个差距，我们对多个流行的AIaaS平台进行比较，并确定了实现真正的民主化人工智能的关键需求。我们的分析表明，自主主机、可扩展性和开放性是必需的。为了解决这些需求，我们提出了我们的方法：“机器学习开放空间”平台。我们的平台基于最新的技术，如Kubernetes、Kubeflow Pipelines和Ludwig，使我们能够超越民主化人工智能的挑战。我们认为，我们的方法比现有的AIaaS平台更加全面和有效地满足了民主化人工智能的需求。

Strategies for Parallelizing the Big-Means Algorithm: A Comprehensive Tutorial for Effective Big Data Clustering

paper_url: http://arxiv.org/abs/2311.04517
repo_url: None
paper_authors: Ravil Mussabayev, Rustam Mussabayev
for: 这个研究旨在优化大量数据集 clustering 的 Big-means 算法，探讨了四种不同的并行策略。
methods: 研究采用了多种并行策略，并进行了广泛的实验测试，以评估每种方法的计算效率、可扩展性和归一化性。
results: 研究发现了不同并行策略的优缺点，并分析了各种因素对归一化质量的影响。这些发现可以为选择最佳并行策略提供实践性的指导。

Abstract
This study focuses on the optimization of the Big-means algorithm for clustering large-scale datasets, exploring four distinct parallelization strategies. We conducted extensive experiments to assess the computational efficiency, scalability, and clustering performance of each approach, revealing their benefits and limitations. The paper also delves into the trade-offs between computational efficiency and clustering quality, examining the impacts of various factors. Our insights provide practical guidance on selecting the best parallelization strategy based on available resources and dataset characteristics, contributing to a deeper understanding of parallelization techniques for the Big-means algorithm.

摘要
Note: Simplified Chinese is also known as "Mandarin" or "Guoyu".Translation notes:* "Big-means" is translated as "大均值算法" (dà jù yù suàn fǎ)* "clustering" is translated as "分类" (fēn xì)* "large-scale datasets" is translated as "大规模数据集" (dà xiàng móu dà tóu)* "parallelization strategies" is translated as "并行策略" (bèng xíng cè lü)* "computational efficiency" is translated as "计算效率" (jì suan xiǎng lü)* "scalability" is translated as "可扩展性" (kě kē zhòng xiǎng xìng)* "clustering performance" is translated as "分类性能" (fēn xì de yè ning)

Solution of FPK Equation for Stochastic Dynamics Subjected to Additive Gaussian Noise via Deep Learning Approach

paper_url: http://arxiv.org/abs/2311.04511
repo_url: None
paper_authors: Amir H. Khodabakhsh, Seid H. Pourtakdoust
For: 这个论文是为了解决高维度泊松方程（FPK equation）的解决方法，以及使用物理法律来塑造深度学习网络（FPK-DP Net）。* Methods: 该论文提出了一种名为FPK-DP Net的物理学习网络，该网络可以解决高维度泊松方程（FPK equation）的density演化问题，不需要任何先前的数据 simulate。* Results: 该论文通过对五个 benchmark 问题的数学实现来证明FPK-DP Net的准确性和效率。

Abstract
The Fokker-Plank-Kolmogorov (FPK) equation is an idealized model representing many stochastic systems commonly encountered in the analysis of stochastic structures as well as many other applications. Its solution thus provides an invaluable insight into the performance of many engineering systems. Despite its great importance, the solution of the FPK equation is still extremely challenging. For systems of practical significance, the FPK equation is usually high dimensional, rendering most of the numerical methods ineffective. In this respect, the present work introduces the FPK-DP Net as a physics-informed network that encodes the physical insights, i.e. the governing constrained differential equations emanated out of physical laws, into a deep neural network. FPK-DP Net is a mesh-free learning method that can solve the density evolution of stochastic dynamics subjected to additive white Gaussian noise without any prior simulation data and can be used as an efficient surrogate model afterward. FPK-DP Net uses the dimension-reduced FPK equation. Therefore, it can be used to address high-dimensional practical problems as well. To demonstrate the potential applicability of the proposed framework, and to study its accuracy and efficacy, numerical implementations on five different benchmark problems are investigated.

摘要
《福克-普朗-科洛果罗夫（FPK）方程》是一种理想化的模型，表示了许多随机系统的分析，以及许多其他应用。其解决方案可以提供对许多工程系统的性能进行各种可观的了解。然而，尽管其重要性，FPK方程的解决仍然非常困难。实际应用中的FPK方程通常具有高维度，使得大多数数值方法无法应用。在这个意义上，当前的工作引入了FPK-DP网，这是一种嵌入物理知识的深度学习网络。FPK-DP网是一种不含网格的学习方法，可以在没有任何先 simulations 数据的情况下解决涉及加法白噪声的随机动力学性能演化。因此，它可以用于解决实际应用中的高维度问题。为了证明提案的框架的可应用性和精度，这里进行了五个不同的标准问题的数值实现。

Constrained Adaptive Attacks: Realistic Evaluation of Adversarial Examples and Robust Training of Deep Neural Networks for Tabular Data

paper_url: http://arxiv.org/abs/2311.04503
repo_url: None
paper_authors: Thibault Simonetto, Salah Ghamizi, Antoine Desjardins, Maxime Cordy, Yves Le Traon
for: 本研究的目的是评估深度学习模型在表格数据上的抗击力，以及在不同攻击者能力水平下的抗击性能。
methods: 本研究提出了CAA，首个适用于受限表格深度学习模型的有效谋敌攻击。CAA 是一种具有迭代和搜索特性的参数自适应攻击，可以在约束下生成攻击示例。
results: 我们使用CAA 建立了表格深度学习模型的抗击性 benchmark，并在三个受欢迎的应用场景（信用评分、钓鱼攻击和 botnet 攻击响应）上进行了评估。我们的 benchmark 支持了十个威胁模型，每个模型都具有不同的攻击能力水平。我们的结果表明，预测环境、对抗训练和攻击预算等因素对深度表格模型的抗击性评估具有重要影响，并提供了一些安全培训的建议，以提高深度表格模型对不同谋敌攻击场景的抗击性能。

Abstract
State-of-the-art deep learning models for tabular data have recently achieved acceptable performance to be deployed in industrial settings. However, the robustness of these models remains scarcely explored. Contrary to computer vision, there is to date no realistic protocol to properly evaluate the adversarial robustness of deep tabular models due to intrinsic properties of tabular data such as categorical features, immutability, and feature relationship constraints. To fill this gap, we propose CAA, the first efficient evasion attack for constrained tabular deep learning models. CAA is an iterative parameter-free attack that combines gradient and search attacks to generate adversarial examples under constraints. We leverage CAA to build a benchmark of deep tabular models across three popular use cases: credit scoring, phishing and botnet attacks detection. Our benchmark supports ten threat models with increasing capabilities of the attacker, and reflects real-world attack scenarios for each use case. Overall, our results demonstrate how domain knowledge, adversarial training, and attack budgets impact the robustness assessment of deep tabular models and provide security practitioners with a set of recommendations to improve the robustness of deep tabular models against various evasion attack scenarios.

摘要
现代深度学习模型在表格数据上已经达到了工业级别的性能，但是这些模型的Robustness（鲁棒性）仍然很少研究。与计算机视觉不同，到目前为止没有任何实际协议来评估深度表格模型的攻击Robustness，这是因为表格数据的内在特性，如分类特征、不可变性和特征关系约束。为了填补这个空白，我们提出了CAA，第一个高效的约束深度表格模型攻击方法。CAA是一种具有迭代和搜索特性的参数自由攻击方法，可以在约束下生成攻击示例。我们利用CAA来建立了深度表格模型的三大应用场景的benchmark：信用评估、钓鱼和恶意botnet攻击检测。我们的benchmark支持了十种威胁模型，每种威胁模型都有不同的攻击能力，能够反映现实世界的攻击enario。总的来说，我们的结果表明了domain知识、对抗训练和攻击预算对深度表格模型的Robustness评估产生了深观影响，并提供了一组建议，以提高深度表格模型对不同攻击enario的Robustness。

Autonomous Advanced Aerial Mobility – An End-to-end Autonomy Framework for UAVs and Beyond

paper_url: http://arxiv.org/abs/2311.04472
repo_url: None
paper_authors: Sakshi Mishra, Praveen Palanisamy
for: 本研究旨在开拓全自动无人飞行器在交通领域的应用，包括城市空中交通、快递、监测等领域。
methods: 本文提出了一个扩展和可扩展的自主框架，包括感知、识别、规划和控制四个主要块，以实现全自动无人飞行器的飞行和任务执行。
results: 本文对多种应用场景进行了分析和评估，并探讨了多个自主飞行器群体操作和管理的挑战和机遇，以及自主飞行系统的测试、验证和证明问题。

Abstract
Developing aerial robots that can both safely navigate and execute assigned mission without any human intervention - i.e., fully autonomous aerial mobility of passengers and goods - is the larger vision that guides the research, design, and development efforts in the aerial autonomy space. However, it is highly challenging to concurrently operationalize all types of aerial vehicles that are operating fully autonomously sharing the airspace. Full autonomy of the aerial transportation sector includes several aspects, such as design of the technology that powers the vehicles, operations of multi-agent fleets, and process of certification that meets stringent safety requirements of aviation sector. Thereby, Autonomous Advanced Aerial Mobility is still a vague term and its consequences for researchers and professionals are ambiguous. To address this gap, we present a comprehensive perspective on the emerging field of autonomous advanced aerial mobility, which involves the use of unmanned aerial vehicles (UAVs) and electric vertical takeoff and landing (eVTOL) aircraft for various applications, such as urban air mobility, package delivery, and surveillance. The article proposes a scalable and extensible autonomy framework consisting of four main blocks: sensing, perception, planning, and controls. Furthermore, the article discusses the challenges and opportunities in multi-agent fleet operations and management, as well as the testing, validation, and certification aspects of autonomous aerial systems. Finally, the article explores the potential of monolithic models for aerial autonomy and analyzes their advantages and limitations. The perspective aims to provide a holistic picture of the autonomous advanced aerial mobility field and its future directions.

摘要
发展可以安全地自动导航并完成任务无需人类干预的天空机器人，即完全自动天空 mobilit y是研究、设计和开发努力的大方向。然而，同时操作完全自动的所有天空车辆共享空间具有极高的挑战性。全自动天空交通部门的完整性包括技术 powers 车辆的设计、多 Agent 队列的运行和遵循严格的航空领域安全要求的证明过程。因此，自主高级天空 mobilit y仍然是一个抽象的概念，其对研究人员和专业人员的影响是模糊的。为了解决这个漏洞，我们提供了一篇全面的emerging field 自主高级天空 mobilit y的视角，该视角包括使用无人航空器（UAV）和电动升降起降机（eVTOL）机器人进行各种应用，如城市空中交通、快递和监测。文章提出了可扩展和可靠的自主框架，由四个主要块组成：感知、识别、规划和控制。此外，文章还讨论了多 Agent 队列操作和管理的挑战和机遇，以及自主天空系统的测试、验证和证明方面的问题。最后，文章探讨了monolithic 模型在天空自主性方面的优点和局限性。该视角旨在提供自主高级天空 mobilit y领域的总体图景和未来方向。

Solving High Frequency and Multi-Scale PDEs with Gaussian Processes

paper_url: http://arxiv.org/abs/2311.04465
repo_url: None
paper_authors: Shikai Fang, Madison Cooley, Da Long, Shibo Li, Robert Kirby, Shandian Zhe
For: The paper aims to improve the accuracy and efficiency of solving partial differential equations (PDEs) using machine learning-based methods, specifically physics-informed neural networks (PINNs), by addressing the problem of spectral bias during training.* Methods: The authors resort to the Gaussian process (GP) framework and model the power spectrum of the PDE solution using a student t mixture or Gaussian mixture. They apply the inverse Fourier transform to obtain the covariance function and use the Jeffreys prior to estimate the mixture weights in the log domain. They also place collocation points on a grid and use the GP conditional mean to predict the solution and its derivatives.* Results: The authors show the advantage of their method in systematic experiments, including the ability to capture high-frequency and multi-scale PDEs without spectral bias, and the promotion of computational efficiency and scalability using Kronecker product properties and multilinear algebra.Here is the same information in Simplified Chinese:
for: 该文章目的是使用机器学习基于方法解决部分偏微分方程(PDEs)的精度和效率问题，特别是physics-informed neural networks (PINNs) 中的频谱偏迷问题。
methods: 作者们使用 Gaussian process (GP) 框架，模型部分偏微分方程解的能量谱，使用 student t 混合或 Gaussian 混合模型。他们应用反傅敛变换获取协方差函数，并使用 Jeffreys 先验来估算混合重量。
results: 作者们在系统性实验中展示了他们的方法的优势，包括不受频谱偏迷、高频和多尺度PDEs 的解决，以及使用 Kronecker 乘积性质和多线性代数提高计算效率和可扩展性。

Abstract
Machine learning based solvers have garnered much attention in physical simulation and scientific computing, with a prominent example, physics-informed neural networks (PINNs). However, PINNs often struggle to solve high-frequency and multi-scale PDEs, which can be due to spectral bias during neural network training. To address this problem, we resort to the Gaussian process (GP) framework. To flexibly capture the dominant frequencies, we model the power spectrum of the PDE solution with a student t mixture or Gaussian mixture. We then apply the inverse Fourier transform to obtain the covariance function (according to the Wiener-Khinchin theorem). The covariance derived from the Gaussian mixture spectrum corresponds to the known spectral mixture kernel. We are the first to discover its rationale and effectiveness for PDE solving. Next,we estimate the mixture weights in the log domain, which we show is equivalent to placing a Jeffreys prior. It automatically induces sparsity, prunes excessive frequencies, and adjusts the remaining toward the ground truth. Third, to enable efficient and scalable computation on massive collocation points, which are critical to capture high frequencies, we place the collocation points on a grid, and multiply our covariance function at each input dimension. We use the GP conditional mean to predict the solution and its derivatives so as to fit the boundary condition and the equation itself. As a result, we can derive a Kronecker product structure in the covariance matrix. We use Kronecker product properties and multilinear algebra to greatly promote computational efficiency and scalability, without any low-rank approximations. We show the advantage of our method in systematic experiments.

摘要
《机器学习基于的解决方案在物理 simulations 和科学计算中备受推崇，例如物理 Informed Neural Networks (PINNs)。然而，PINNs 经常遇到高频和多核频率的 PDE 问题，这可能是在 neural network 训练过程中的 spectral bias。为解决这个问题，我们转而使用 Gaussian Process (GP) 框架。我们使用学生 t 混合或 Gaussian 混合来模型 PDE 解的能量谱，然后使用 inverse Fourier transform 获得 covariance function（根据 Wiener-Khinchin 定理）。这个 covariance 函数与知名的 spectral mixture kernel 相同。我们是第一个发现其理由和有效性，并且在 PDE 解中应用。接下来，我们在循环域中估算混合 веса，我们显示这等价于在 Jeffreys 前导下进行估算。这会自动强制简洁，抑制过分频率，并调整剩下的频率向真实值。三、我们使用 GP 的 conditional mean 来预测解和其导函数，以拟合边界条件和方程本身。因此，我们可以 derive 一个 Kronecker 乘积结构在 covariance 矩阵中。我们使用 Kronecker 乘积属性和多线性代数，不需要低级 approximation，可以大幅提高计算效率和可扩展性。我们在系统性实验中展示了我们的方法的优势。》Note: The translation is in Simplified Chinese, which is a standardized form of Chinese used in mainland China and widely used in education, media, and other formal contexts. If you need the translation in Traditional Chinese, please let me know.

A Hierarchical Spatial Transformer for Massive Point Samples in Continuous Space

paper_url: http://arxiv.org/abs/2311.04434
repo_url: https://github.com/spatialdatasciencegroup/hst
paper_authors: Wenchong He, Zhe Jiang, Tingsong Xiao, Zelin Xu, Shigang Chen, Ronald Fick, Miles Medina, Christine Angelini
For: This paper is written for massive point samples in continuous space, which are common in environment sciences, numerical simulations, and location-based services. The author proposes a novel transformer model to address the challenges of long-range and multi-scale dependency, non-uniform point distribution, and high computational costs.* Methods: The proposed hierarchical spatial transformer model includes multi-resolution representation learning within a quad-tree hierarchy and efficient spatial attention via coarse approximation. The model also includes an uncertainty quantification branch to estimate prediction confidence related to input feature noise and point sparsity.* Results: The author provides a theoretical analysis of computational time complexity and memory costs. Extensive experiments on both real-world and synthetic datasets show that the proposed method outperforms multiple baselines in prediction accuracy and can scale up to one million points on one NVIDIA A100 GPU.Here’s the Chinese version of the information:* For: 这篇论文是为大量连续空间中的点样本而写的，这些样本广泛存在环境科学、数值仿真和位置基础服务等领域。作者提出了一种新的变换器模型，用于解决连续空间中点样本的长距离和多尺度关系、非均匀点分布和计算成本的问题。* Methods: 提出的层次空间变换器模型包括在Quad-tree层次结构中进行多尺度表示学习，以及高效的空间注意力计算方法。模型还包括一个不确定性评估分支，用于估计输入特征噪声和点缺失的影响。* Results: 作者提供了计算时间复杂度和内存成本的理论分析。对实际数据和 sintetic 数据进行了广泛的实验，结果显示，提出的方法在预测精度方面超过多个基elines，并且可以在一个NVIDIA A100 GPU上处理一百万个点。模型代码可以在 GitHub 上找到：https://github.com/spatialdatasciencegroup/HST。

Abstract
Transformers are widely used deep learning architectures. Existing transformers are mostly designed for sequences (texts or time series), images or videos, and graphs. This paper proposes a novel transformer model for massive (up to a million) point samples in continuous space. Such data are ubiquitous in environment sciences (e.g., sensor observations), numerical simulations (e.g., particle-laden flow, astrophysics), and location-based services (e.g., POIs and trajectories). However, designing a transformer for massive spatial points is non-trivial due to several challenges, including implicit long-range and multi-scale dependency on irregular points in continuous space, a non-uniform point distribution, the potential high computational costs of calculating all-pair attention across massive points, and the risks of over-confident predictions due to varying point density. To address these challenges, we propose a new hierarchical spatial transformer model, which includes multi-resolution representation learning within a quad-tree hierarchy and efficient spatial attention via coarse approximation. We also design an uncertainty quantification branch to estimate prediction confidence related to input feature noise and point sparsity. We provide a theoretical analysis of computational time complexity and memory costs. Extensive experiments on both real-world and synthetic datasets show that our method outperforms multiple baselines in prediction accuracy and our model can scale up to one million points on one NVIDIA A100 GPU. The code is available at \url{https://github.com/spatialdatasciencegroup/HST}.

摘要
启用变换器是深度学习架构的广泛应用。现有的变换器主要是为文本、图像、视频和图表设计的。这篇论文提出了一种新的变换器模型，用于处理大量（达到百万）的点样本，这些样本在连续空间中分布。这些样本广泛存在环境科学（例如感知器观测）、数学模拟（例如带有粒子的流体和天文学）以及地理位置服务（例如POI和轨迹）等领域。然而，设计为大量点样本的变换器是非常困难的，因为这些样本存在许多挑战，包括隐式的长距离和多尺度相互关系，分布不均匀，计算所有对之间的注意力计算的可能高计算成本，以及因点密度变化而导致的预测结果过于自信。为了解决这些挑战，我们提出了一种新的层次空间变换器模型，包括多尺度表示学习在四个树层中和高效的空间注意力via粗略估计。我们还设计了一个不确定量计算分支，用于估计输入特征噪声和点稀缺所导致的预测信度。我们提供了计算时间复杂度和内存成本的理论分析。在实际实验中，我们的方法在多个实际和 sintetic 数据集上取得了较高的预测精度，并且我们的模型可以在一个NVIDIA A100 GPU上处理一百万点。代码可以在 \url{https://github.com/spatialdatasciencegroup/HST} 上获得。

Evaluating Emerging AI/ML Accelerators: IPU, RDU, and NVIDIA/AMD GPUs

paper_url: http://arxiv.org/abs/2311.04417
repo_url: None
paper_authors: Hongwu Peng, Caiwen Ding, Tong Geng, Sutanay Choudhury, Kevin Barker, Ang Li
for:这些商业AI/ML加速器的设计和实现是为了满足现代AI/ML算法的增加复杂度和计算需求而创造的，以提高AI/ML任务的性能和能效性。methods:这些加速器的各种设计优化和数据流体系结构使得它们在AI/ML任务中表现出色，包括Graphcore Intelligence Processing Unit (IPU), Sambanova Reconfigurable Dataflow Unit (RDU) 和改进的GPU平台。results:这些加速器在常见的DNN运算符和其他AI/ML任务上的性能评估和比较，以阐明数据流体系的优势和传统处理器设计的缺陷，并提供了每个平台的性能交易。这些发现可以作为研发下一代加速器的参考，以满足AI/ML应用领域不断演化的需求。

Abstract
The relentless advancement of artificial intelligence (AI) and machine learning (ML) applications necessitates the development of specialized hardware accelerators capable of handling the increasing complexity and computational demands. Traditional computing architectures, based on the von Neumann model, are being outstripped by the requirements of contemporary AI/ML algorithms, leading to a surge in the creation of accelerators like the Graphcore Intelligence Processing Unit (IPU), Sambanova Reconfigurable Dataflow Unit (RDU), and enhanced GPU platforms. These hardware accelerators are characterized by their innovative data-flow architectures and other design optimizations that promise to deliver superior performance and energy efficiency for AI/ML tasks. This research provides a preliminary evaluation and comparison of these commercial AI/ML accelerators, delving into their hardware and software design features to discern their strengths and unique capabilities. By conducting a series of benchmark evaluations on common DNN operators and other AI/ML workloads, we aim to illuminate the advantages of data-flow architectures over conventional processor designs and offer insights into the performance trade-offs of each platform. The findings from our study will serve as a valuable reference for the design and performance expectations of research prototypes, thereby facilitating the development of next-generation hardware accelerators tailored for the ever-evolving landscape of AI/ML applications. Through this analysis, we aspire to contribute to the broader understanding of current accelerator technologies and to provide guidance for future innovations in the field.

摘要
人工智能（AI）和机器学习（ML）应用的不断发展需要特化的硬件加速器来处理不断增长的复杂性和计算需求。传统的计算架构，基于 von Neumann 模型，被当代 AI/ML 算法的要求所超越，导致加速器的创造，如 Graphcore 智能处理器（IPU）、Sambanova 可重新配置数据流处理器（RDU）和增强 GPU 平台。这些硬件加速器具有创新的数据流架构和其他设计优化，以提供对 AI/ML 任务的超越性性能和能效性。本研究提供了这些商业 AI/ML 加速器的初步评估和比较，探讨其硬件和软件设计特点，以确定它们的优势和特殊能力。通过对常见深度神经网络（DNN）操作和其他 AI/ML 工作负荷进行 benchmark 评估，我们希望通过探讨数据流架构的优势和传统处理器设计的缺陷，为研究人员提供参考，以便设计和实现下一代特化于 AI/ML 应用的硬件加速器。我们的研究结果将对广泛的硬件加速器技术产生影响，并为未来在这个领域的创新提供指导。

Likelihood Ratio Confidence Sets for Sequential Decision Making

paper_url: http://arxiv.org/abs/2311.04402
repo_url: None
paper_authors: Nicolas Emmenegger, Mojmír Mutný, Andreas Krause
for: 这篇论文旨在提供一种可证明的、适应性 uncertainty 估计方法，用于Sequential Decision-Making 算法。
methods: 该方法基于 likelihood-based inference principle，使用 likelihood ratios 构建 any-time 有效 confidence sequences，不需要特殊的应用场景处理。
results: 该方法适用于具有明确 likelihood 函数的问题，其 resulting sets 总是保持预设的覆盖率，并且可以在 model-agnostic 的方式下实现。 estimators 的选择可以通过 provable 的方式来确定，而且与 online convex optimization 算法相关，如 Follow-the-Regularized-Leader。更重要的是，该方法可以在 non-parametric 设置中使用，如 RKHS 函数类型。

Abstract
Certifiable, adaptive uncertainty estimates for unknown quantities are an essential ingredient of sequential decision-making algorithms. Standard approaches rely on problem-dependent concentration results and are limited to a specific combination of parameterization, noise family, and estimator. In this paper, we revisit the likelihood-based inference principle and propose to use likelihood ratios to construct any-time valid confidence sequences without requiring specialized treatment in each application scenario. Our method is especially suitable for problems with well-specified likelihoods, and the resulting sets always maintain the prescribed coverage in a model-agnostic manner. The size of the sets depends on a choice of estimator sequence in the likelihood ratio. We discuss how to provably choose the best sequence of estimators and shed light on connections to online convex optimization with algorithms such as Follow-the-Regularized-Leader. To counteract the initially large bias of the estimators, we propose a reweighting scheme that also opens up deployment in non-parametric settings such as RKHS function classes. We provide a non-asymptotic analysis of the likelihood ratio confidence sets size for generalized linear models, using insights from convex duality and online learning. We showcase the practical strength of our method on generalized linear bandit problems, survival analysis, and bandits with various additive noise distributions.

摘要
certificable, adaptive uncertainty estimates for unknown quantities are an essential ingredient of sequential decision-making algorithms. Standard approaches rely on problem-dependent concentration results and are limited to a specific combination of parameterization, noise family, and estimator. In this paper, we revisit the likelihood-based inference principle and propose to use likelihood ratios to construct any-time valid confidence sequences without requiring specialized treatment in each application scenario. Our method is especially suitable for problems with well-specified likelihoods, and the resulting sets always maintain the prescribed coverage in a model-agnostic manner. The size of the sets depends on a choice of estimator sequence in the likelihood ratio. We discuss how to provably choose the best sequence of estimators and shed light on connections to online convex optimization with algorithms such as Follow-the-Regularized-Leader. To counteract the initially large bias of the estimators, we propose a reweighting scheme that also opens up deployment in non-parametric settings such as RKHS function classes. We provide a non-asymptotic analysis of the likelihood ratio confidence sets size for generalized linear models, using insights from convex duality and online learning. We showcase the practical strength of our method on generalized linear bandit problems, survival analysis, and bandits with various additive noise distributions.Here's the translation in Traditional Chinese: certificable, adaptive uncertainty estimates for unknown quantities are an essential ingredient of sequential decision-making algorithms. Standard approaches rely on problem-dependent concentration results and are limited to a specific combination of parameterization, noise family, and estimator. In this paper, we revisit the likelihood-based inference principle and propose to use likelihood ratios to construct any-time valid confidence sequences without requiring specialized treatment in each application scenario. Our method is especially suitable for problems with well-specified likelihoods, and the resulting sets always maintain the prescribed coverage in a model-agnostic manner. The size of the sets depends on a choice of estimator sequence in the likelihood ratio. We discuss how to provably choose the best sequence of estimators and shed light on connections to online convex optimization with algorithms such as Follow-the-Regularized-Leader. To counteract the initially large bias of the estimators, we propose a reweighting scheme that also opens up deployment in non-parametric settings such as RKHS function classes. We provide a non-asymptotic analysis of the likelihood ratio confidence sets size for generalized linear models, using insights from convex duality and online learning. We showcase the practical strength of our method on generalized linear bandit problems, survival analysis, and bandits with various additive noise distributions.

2023-11-09

Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in Clutter

PolyMaX: General Dense Prediction with Mask Transformer

GIPCOL: Graph-Injected Soft Prompting for Compositional Zero-Shot Learning

Whole-body Detection, Recognition and Identification at Altitude and Range

Intelligent Cervical Spine Fracture Detection Using Deep Learning Methods

FMViT: A multiple-frequency mixing Vision Transformer

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

3DGAUnet: 3D generative adversarial networks with a 3D U-Net based generator to achieve the accurate and effective synthesis of clinical tumor image data for pancreatic cancer

Window Attention is Bugged: How not to Interpolate Position Embeddings

What Do I Hear? Generating Sounds for Visuals with ChatGPT

3D-QAE: Fully Quantum Auto-Encoding of 3D Point Clouds

Reconstructing Objects in-the-wild for Realistic Sensor Simulation

SigScatNet: A Siamese + Scattering based Deep Learning Approach for Signature Forgery Detection and Similarity Assessment

Exploring Emotion Expression Recognition in Older Adults Interacting with a Virtual Coach

High-Performance Transformers for Table Structure Recognition Need Early Convolutions

Disentangling Quantum and Classical Contributions in Hybrid Quantum Machine Learning Architectures

LCM-LoRA: A Universal Stable-Diffusion Acceleration Module

L-WaveBlock: A Novel Feature Extractor Leveraging Wavelets for Generative Adversarial Networks

A Deep Learning Method for Simultaneous Denoising and Missing Wedge Reconstruction in Cryogenic Electron Tomography

Embedding Space Interpolation Beyond Mini-Batch, Beyond Pairs and Beyond Examples

SeaTurtleID2022: A long-span dataset for reliable sea turtle re-identification

BakedAvatar: Baking Neural Fields for Real-Time Head Avatar Synthesis

Multi-Modal Gaze Following in Conversational Scenarios

Object-centric Cross-modal Feature Distillation for Event-based Object Detection

Retinal OCT Synthesis with Denoising Diffusion Probabilistic Models for Layer Segmentation

Robust Retraining-free GAN Fingerprinting via Personalized Normalization

Using ResNet to Utilize 4-class T2-FLAIR Slice Classification Based on the Cholinergic Pathways Hyperintensities Scale for Pathological Aging

3DStyle-Diffusion: Pursuing Fine-grained Text-driven 3D Stylization with 2D Diffusion Models

ControlStyle: Text-Driven Stylized Image Generation Using Diffusion Priors

Control3D: Towards Controllable Text-to-3D Generation

Transformer-based Model for Oral Epithelial Dysplasia Segmentation

Dual Pipeline Style Transfer with Input Distribution Differentiation

Active Mining Sample Pair Semantics for Image-text Matching

Linear Gaussian Bounding Box Representation and Ring-Shaped Rotated Convolution for Oriented Object Detection

SIRE: scale-invariant, rotation-equivariant estimation of artery orientations using graph neural networks

Improving Hand Recognition in Uncontrolled and Uncooperative Environments using Multiple Spatial Transformers and Loss Functions

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

SynFacePAD 2023: Competition on Face Presentation Attack Detection Based on Privacy-aware Synthetic Training Data

Spatial Attention-based Distribution Integration Network for Human Pose Estimation

SPADES: A Realistic Spacecraft Pose Estimation Dataset using Event Sensing

Improving Vision-and-Language Reasoning via Spatial Relations Modeling

VoxNeRF: Bridging Voxel Representation and Neural Radiance Fields for Enhanced Indoor View Synthesis

SAMVG: A Multi-stage Image Vectorization Model with the Segment-Anything Model

Single-shot Tomography of Discrete Dynamic Objects

Widely Applicable Strong Baseline for Sports Ball Detection and Tracking

ConRad: Image Constrained Radiance Fields for 3D Generation from a Single Image

Let’s Get the FACS Straight – Reconstructing Obstructed Facial Features

BrainNetDiff: Generative AI Empowers Brain Network Generation via Multimodal Diffusion Model

Adaptive-Labeling for Enhancing Remote Sensing Cloud Understanding

TransReg: Cross-transformer as auto-registration module for multi-view mammogram mass detection

Audio-visual Saliency for Omnidirectional Videos

Dynamic Association Learning of Self-Attention and Convolution in Image Restoration

OW-SLR: Overlapping Windows on Semi-Local Region for Image Super-Resolution

SCAAT: Improving Neural Network Interpretability via Saliency Constrained Adaptive Adversarial Training

ScribblePolyp: Scribble-Supervised Polyp Segmentation through Dual Consistency Alignment

Reducing the Side-Effects of Oscillations in Training of Quantized YOLO Networks

Self-similarity Prior Distillation for Unsupervised Remote Physiological Measurement

POISE: Pose Guided Human Silhouette Extraction under Occlusions

On the Behavior of Audio-Visual Fusion Architectures in Identity Verification Tasks

2023-11-09

Is a Seat at the Table Enough? Engaging Teachers and Students in Dataset Specification for ML in Education

The Paradox of Noise: An Empirical Study of Noise-Infusion Mechanisms to Improve Generalization, Stability, and Privacy in Federated Learning

Are “Hierarchical” Visual Representations Hierarchical?

Hallucination-minimized Data-to-answer Framework for Financial Decision-makers

DONUT-hole: DONUT Sparsification by Harnessing Knowledge and Optimizing Learning Efficiency

Chatbots Are Not Reliable Text Annotators

ShipGen: A Diffusion Model for Parametric Ship Hull Generation with Multiple Objectives and Constraints

Deep Natural Language Feature Learning for Interpretable Prediction

Bridging the Digital Divide: Performance Variation across Socio-Economic Factors in Vision-Language Models

Optimal simulation-based Bayesian decisions

Efficiently Adapting Pretrained Language Models To New Languages

Generating Pragmatic Examples to Train Neural Program Synthesizers

Long-Horizon Dialogue Understanding for Role Identification in the Game of Avalon with Large Language Models

Game Theory Solutions in Sensor-Based Human Activity Recognition: A Review

FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts

Real-Time Neural Rasterization for Large Scenes

SynH2R: Synthesizing Hand-Object Motions for Learning Human-to-Robot Handovers

LLM Augmented Hierarchical Agents

Accuracy of a Vision-Language Model on Challenging Medical Cases