2023-11-30

eess.AS

eess.AS - 2023-11-30

Acoustic Prompt Tuning: Empowering Large Language Models with Audition Capabilities

paper_url: http://arxiv.org/abs/2312.00249
repo_url: https://github.com/jinhualiang/apt
paper_authors: Jinhua Liang, Xubo Liu, Wenwu Wang, Mark D. Plumbley, Huy Phan, Emmanouil Benetos
for: 本研究旨在扩展大语言模型（LLM）和视觉语言模型（VLM）到音频频谱领域，以提高audio理解和语言理解能力。
methods: 本研究提出了一种新的扩展器，称为Acoustic Prompt Turning（APT），它可以让LLM和VLM在音频频谱领域下表现出比较好的能力。APT使用了一种听说指令生成器，生成了软指令，并将其作为语言模型的输入。此外，本研究还提出了一种多任务学习策略，以解决音频频谱数据的缺乏问题。
results: 实验表明，APT扩展后的LLM（即APT-LLM）在多种任务上达到了相对较好的结果，与专家模型（即在target datasets上训练的网络）的结果相比。此外，APT还能够扩展冻结的VLM到音频频谱领域，并在音视频问答任务上达到了可以接受的结果。

Abstract
The auditory system plays a substantial role in shaping the overall human perceptual experience. While prevailing large language models (LLMs) and visual language models (VLMs) have shown their promise in solving a wide variety of vision and language understanding tasks, only a few of them can be generalised to the audio domain without compromising their domain-specific capacity. In this work, we introduce Acoustic Prompt Turning (APT), a new adapter extending LLMs and VLMs to the audio domain by soft prompting only. Specifically, APT applies an instruction-aware audio aligner to generate soft prompts, conditioned on both input text and sounds, as language model inputs. To mitigate the data scarcity in the audio domain, a multi-task learning strategy is proposed by formulating diverse audio tasks in a sequence-to-sequence manner. Moreover, we improve the framework of audio language model by using interleaved audio-text embeddings as the input sequence. This improved framework imposes zero constraints on the input format and thus is capable of tackling more understanding tasks, such as few-shot audio classification and audio reasoning. To further evaluate the reasoning ability of audio networks, we propose natural language audio reasoning (NLAR), a new task that analyses across two audio clips by comparison and summarization. Experiments show that APT-enhanced LLMs (namely APT-LLMs) achieve competitive results compared to the expert models (i.e., the networks trained on the targeted datasets) across various tasks. We finally demonstrate the APT's ability in extending frozen VLMs to the audio domain without finetuning, achieving promising results in the audio-visual question and answering task. Our code and model weights are released at https://github.com/JinhuaLiang/APT.

摘要
Audio系统在整个人类感知体验中扮演着重要的角色。虽然现有的大型语言模型（LLM）和视觉语言模型（VLM）在解决许多视觉和语言理解任务上表现出色，但只有一些它们可以无缝扩展到音频领域。在这种情况下，我们引入了听音提示转换（APT），一种新的适配器，可以通过软提示来扩展LLM和VLM到音频领域。具体来说，APT使用语音和文本输入的指令意识 audio aligner 生成软提示，并在语言模型输入中使用这些软提示进行条件学习。为了缓解音频领域的数据稀缺，我们提出了多任务学习策略，将多种音频任务拼接成一个序列到序列的形式。此外，我们改进了音频语言模型的框架，通过使用交叠的音频-文本嵌入来作为输入序列。这种改进的框架没有输入格式的约束，因此可以解决更多的理解任务，如几shot音频分类和音频理解。为了进一步评估音频网络的理解能力，我们提出了自然语言音频理解（NLAR）任务，该任务通过比较和概括两个音频clip来分析它们之间的关系。实验结果表明，APT-LLMs 在多种任务上达到了相对较高的竞争力，与专家模型（即在目标数据集上训练的网络）的性能相匹配。最后，我们证明 APT 可以在音频领域扩展冻结 VLMs 而不需要微调，在音频视频问答任务上达到了可观的结果。我们在 GitHub 上发布了代码和模型参数，详细的实验结果和分析可以在 https://github.com/JinhuaLiang/APT 中找到。

Learning domain-invariant classifiers for infant cry sounds

paper_url: http://arxiv.org/abs/2312.00231
repo_url: None
paper_authors: Charles C. Onu, Hemanth K. Sheetha, Arsenii Gorin, Doina Precup
for: 这项研究旨在解决实际数据中的领域分布问题，特别是在婴儿哭声数据库中。
methods: 该研究使用了无监督领域适应方法，包括从计算机视觉领域借鉴的方法，以学习免受领域分布的影响，并且提出了一种新的方法——目标噪声注入（TNI），无需 labels 或目标领域的训练数据。
results: 研究发现，使用了这些方法后，模型可以提高目标准确率 by 7.2%，而不会对源领域产生负面影响。

Abstract
The issue of domain shift remains a problematic phenomenon in most real-world datasets and clinical audio is no exception. In this work, we study the nature of domain shift in a clinical database of infant cry sounds acquired across different geographies. We find that though the pitches of infant cries are similarly distributed regardless of the place of birth, other characteristics introduce peculiar biases into the data. We explore methodologies for mitigating the impact of domain shift in a model for identifying neurological injury from cry sounds. We adapt unsupervised domain adaptation methods from computer vision which learn an audio representation that is domain-invariant to hospitals and is task discriminative. We also propose a new approach, target noise injection (TNI), for unsupervised domain adaptation which requires neither labels nor training data from the target domain. Our best-performing model significantly improves target accuracy by 7.2%, without negatively affecting the source domain.

摘要
“域别迁移”问题是现实世界数据集中的一个常见现象，并且诊疗音频亦不例外。在这个工作中，我们研究了一个来自不同地理位置的诊疗数据库中婴儿哭声的域别迁移性。我们发现，不管婴儿的出生地，哭声的数据都具有相似的分布。然而，其他特征引入了某些偏见到数据中。我们探索了一些适用于 mitigating 域别迁移影响的方法，包括将 computer vision 中的无监督领域适应方法应用到诊疗音频中，以及一个新的方法，即静音注入（TNI），这需要无需目标领域的标签或训练数据。我们的最佳模型可以大幅提高目标准确度，并且不会负面影响源领域。

An Aliasing-Free Hybrid Digital-Analog Polyphonic Synthesizer

paper_url: http://arxiv.org/abs/2311.18774
repo_url: None
paper_authors: Jonas Roth, Domenic Keller, Oscar Castañeda, Christoph Studer
for: This paper presents a hybrid digital-analog eight-voice polyphonic synthesizer prototype called the +-synth, which combines the best of both worlds to provide superior sound quality and mitigate the drawbacks of analog circuitry.
methods: The +-synth uses a novel digital very-large scale integration (VLSI) design called the big Fourier oscillator (BFO), which utilizes additive synthesis to generate a wide variety of aliasing-free waveforms. Each BFO produces two voices, using four oscillators per voice, and each oscillator can generate up to 1024 freely configurable partials.
results: Measurement results of the +-synth prototype demonstrate high fidelity and low latency, indicating that the hybrid digital-analog design achieves the desired goals of combining the best of both worlds.Here is the information in Simplified Chinese text:
for: 这篇论文描述了一种hybrid数字-分析 eightvoice полифоничеsynthesizer原型，称为+-synth，它将数字和分析电路的优点结合起来，以提供高质量的音频。
methods: +-synth使用一种新型的数字very-large scale integration (VLSI)设计，称为大福洛 oscillator (BFO)，它利用添加 synthesis来生成各种各样的干扰free waveforms。每个BFO生成两个声道，每个声道使用四个振荡器。
results: 对+-synth原型的测量结果表明，它具有高准确率和低延迟，这表明hybrid数字-分析设计已经实现了将数字和分析电路的优点结合起来的目标。

Abstract
Analog subtractive synthesizers are generally considered to provide superior sound quality compared to digital emulations. However, analog circuitry requires calibration and suffers from aging, temperature instability, and limited flexibility in generating a wide variety of waveforms. Digital synthesis can mitigate many of these drawbacks, but generating arbitrary aliasing-free waveforms remains challenging. In this paper, we present the +-synth, a hybrid digital-analog eight-voice polyphonic synthesizer prototype that combines the best of both worlds. At the heart of the synthesizer is the big Fourier oscillator (BFO), a novel digital very-large scale integration (VLSI) design that utilizes additive synthesis to generate a wide variety of aliasing-free waveforms. Each BFO produces two voices, using four oscillators per voice. A single oscillator can generate up to 1024 freely configurable partials (harmonic or inharmonic), which are calculated using coordinate rotation digital computers (CORDICs). The BFOs were fabricated as 65nm CMOS custom application-specific integrated circuits (ASICs), which are integrated in the +-synth to simultaneously generate up to 32768 partials. Four 24-bit 96kHz stereo DACs then convert the eight voices into the analog domain, followed by digitally controlled analog low-pass filtering and amplification. Measurement results of the +-synth prototype demonstrate high fidelity and low latency.

摘要
аналоговыми вычитающими синтезаторами обычно считается, что они предоставляют более высокое качество звука по сравнению с цифровыми эмуляциями. Однако, аналоговые цепи требуют калибровки и подвержены температурной нестабильности, а также ограничены в возможностях генерации широкого спектра волнформ. Цифровая синтеза может уменьшить эти недостатки, но генерация произвольных волнформ без алиасинга остается вызова. В этой статье мы представляем +-синт, гибридный цифро-аналоговый восьмиголосный полифонический синтезизатор прототип, который сочетает лучшие качества двух мира. В сердце синтезатора находится большой fourier-оscillator (BFO), новый цифровой Very-Large-Scale Integration (VLSI) дизайн, который использует аддитивную синтезу для генерации широкого спектра волнформ без алиасинга. Каждый BFO производит два голоса, используя четыре осциллятора на голос. Каждый осциллятор может генерировать до 1024 свободно настроенных partials (гармонических или дисгармонических), которые рассчитываются с помощью координатной ротации цифровых компьютеров (CORDICs). BFOs были изготовлены как 65 нм CMOS custom application-specific integrated circuits (ASICs), которые интегрированы в +-синт для генерации до 32768 partials. Затем четыре 24-битных 96 kHz стерео DACs преобразуют восемь голосов в аналоговый домен, после чего происходит дигитальное контролируемое аналоговое фильтражение и усиление. Результаты измерений прототипа +-синт демонстрируют высокую точность и низкую задержку.

2023-11-30

cs.CV

cs.CV - 2023-11-30

PyNeRF: Pyramidal Neural Radiance Fields

paper_url: http://arxiv.org/abs/2312.00252
repo_url: https://github.com/hturki/pynerf
paper_authors: Haithem Turki, Michael Zollhöfer, Christian Richardt, Deva Ramanan
for: 提高NeRF的渲染速度和质量
methods: 使用不同的空间网格分辨率进行训练，并在渲染时使用不同的网格分辨率来渲染大量的样本
results: 可以减少投射错误率（比如Mip-NeRF），并且可以在训练时更快（每个模型头可以快速评估）

Abstract
Neural Radiance Fields (NeRFs) can be dramatically accelerated by spatial grid representations. However, they do not explicitly reason about scale and so introduce aliasing artifacts when reconstructing scenes captured at different camera distances. Mip-NeRF and its extensions propose scale-aware renderers that project volumetric frustums rather than point samples but such approaches rely on positional encodings that are not readily compatible with grid methods. We propose a simple modification to grid-based models by training model heads at different spatial grid resolutions. At render time, we simply use coarser grids to render samples that cover larger volumes. Our method can be easily applied to existing accelerated NeRF methods and significantly improves rendering quality (reducing error rates by 20-90% across synthetic and unbounded real-world scenes) while incurring minimal performance overhead (as each model head is quick to evaluate). Compared to Mip-NeRF, we reduce error rates by 20% while training over 60x faster.

摘要

Advancements and Trends in Ultra-High-Resolution Image Processing: An Overview

paper_url: http://arxiv.org/abs/2312.00250
repo_url: None
paper_authors: Zhuoran Zheng, Boxue Xiao
for: 提高 Ultra-High-Definition（UHD）图像的视觉亮度和质量。
methods: 介绍了现有的UHD图像提高技术和应用场景，包括各种环境噪声和设备抖动等因素的影响，以及一些针对这些问题的算法和方法。
results: 对UHD图像的提高和优化，提高视觉亮度和质量。

Abstract
Currently, to further improve visual enjoyment, Ultra-High-Definition (UHD) images are catching wide attention. Here, UHD images are usually referred to as having a resolution greater than or equal to $3840 \times 2160$. However, since the imaging equipment is subject to environmental noise or equipment jitter, UHD images are prone to contrast degradation, blurring, low dynamic range, etc. To address these issues, a large number of algorithms for UHD image enhancement have been proposed. In this paper, we introduce the current state of UHD image enhancement from two perspectives, one is the application field and the other is the technology. In addition, we briefly explore its trends.

摘要
当前，为了进一步提高视觉享受，超高清（UHD）图像正在吸引广泛关注。在这里，UHD图像通常被定义为具有3840×2160分辨率或更高的图像。然而，由于捕捉设备受到环境噪音或设备振荡的影响，UHD图像容易受到对比下降、模糊、动态范围等问题。为解决这些问题，大量的UHD图像提升算法已经被提出。在这篇文章中，我们将介绍UHD图像提升的当前状况，从两个角度出发：一是应用领域，二是技术。此外，我们还将简要探讨其趋势。

AV-RIR: Audio-Visual Room Impulse Response Estimation

paper_url: http://arxiv.org/abs/2312.00834
repo_url: None
paper_authors: Anton Ratnarajah, Sreyan Ghosh, Sonal Kumar, Purva Chiniya, Dinesh Manocha
for: 这个研究的目的是精确地估算室内传播响应（RIR），以便应用于语音处理和虚拟现实应用。
methods: 这个研究使用了一种新的多模式多任务学习方法，称为AV-RIR，它可以从语音信号和环境的视觉征象中精确地估算RIR。AV-RIR建立在一个新的神经码码学习架构上，可以有效地捕捉环境的几何和材料特性，并且通过多任务学习来解决语音排擦作为副任务。
results: 这个研究的结果显示，AV-RIR可以与前一代的音频专注和视觉专注方法相比，提高RIR估算的精度，具体来说是36%-63%的改善。此外，AV-RIR也在人类评价中得到了更高的偏好得分。此外，AV-RIR的副作用是可以在不同的说话语言处理任务中提供竞争力的排擦 speech。

Abstract
Accurate estimation of Room Impulse Response (RIR), which captures an environment's acoustic properties, is important for speech processing and AR/VR applications. We propose AV-RIR, a novel multi-modal multi-task learning approach to accurately estimate the RIR from a given reverberant speech signal and the visual cues of its corresponding environment. AV-RIR builds on a novel neural codec-based architecture that effectively captures environment geometry and materials properties and solves speech dereverberation as an auxiliary task by using multi-task learning. We also propose Geo-Mat features that augment material information into visual cues and CRIP that improves late reverberation components in the estimated RIR via image-to-RIR retrieval by 86%. Empirical results show that AV-RIR quantitatively outperforms previous audio-only and visual-only approaches by achieving 36% - 63% improvement across various acoustic metrics in RIR estimation. Additionally, it also achieves higher preference scores in human evaluation. As an auxiliary benefit, dereverbed speech from AV-RIR shows competitive performance with the state-of-the-art in various spoken language processing tasks and outperforms reverberation time error score in the real-world AVSpeech dataset. Qualitative examples of both synthesized reverberant speech and enhanced speech can be found at https://www.youtube.com/watch?v=tTsKhviukAE.

摘要
importante estimation de la Respuesta de Impulso de Habitación (RIR) accurada, que captura las propiedades acústicas del entorno, es crucial para la procesamiento de speech y aplicaciones de realidad aumentada/virtual (AR/VR). Proponemos AV-RIR, un enfoque innovador de aprendizaje multimodal multitarea para estimar precisamente la RIR a partir de una señal de speech reverberante y las visuales correspondientes del entorno. AV-RIR se basa en una arquitectura de códigoc neural que captura eficazmente la geometría y propiedades de los materiales del entorno y resuelve la tarea de dereverberación como una tarea auxiliar utilizando aprendizaje multitarea. Además, proponemos características Geo-Mat que integran información de materiales en las señales visuales y CRIP que mejora componentes de reverberación tardía en la RIR estimada a través de la recuperación de RIR a partir de imágenes. Los resultados empíricos demuestran que AV-RIR mejora substancialmente la estimación de RIR en comparación con enfoques de audio solo y visual solo, con un aumento del 36% al 63% en diferentes métricas acústicas de RIR. Además, también obtiene puntajes de preferencia más altos en la evaluación humana. Como un beneficio adicional, la speech dereverberada de AV-RIR muestra un rendimiento competitivo con el estado del arte en diversas tareas de procesamiento de lenguaje hablado y supera el error de tiempo de reverberación en el dataset real-world AVSpeech. Los ejemplos cualitativos de speech reverberante sintetizado y habla mejorada se pueden encontrar en .

Lasagna: Layered Score Distillation for Disentangled Object Relighting

paper_url: http://arxiv.org/abs/2312.00833
repo_url: https://github.com/dbash/lasagna
paper_authors: Dina Bashkirova, Arijit Ray, Rupayan Mallick, Sarah Adel Bargal, Jianming Zhang, Ranjay Krishna, Kate Saenko
for: 本研究旨在提供一种Intuitive Text-Guided Relighting Control方法，帮助艺术家、摄影师和其他视觉内容创作者更容易地实现对图像的照明控制。
methods: 本方法使用Score Distillation Sampling法学习灯光优先，并通过Diffusion Model进行训练。新创建的Synthetic DatasetReLiT用于训练 Lasagna。
results: Lasagna在实际图像上实现了有效的照明控制，并且比现有的文本指导图像编辑方法更高效。 Lasagna在自然图像和数字艺术作品上获得了人类的91%的喜欢率。此外，我们还扩展了我们的学习目标，以实现颜色化的图像编辑。

Abstract
Professional artists, photographers, and other visual content creators use object relighting to establish their photo's desired effect. Unfortunately, manual tools that allow relighting have a steep learning curve and are difficult to master. Although generative editing methods now enable some forms of image editing, relighting is still beyond today's capabilities; existing methods struggle to keep other aspects of the image -- colors, shapes, and textures -- consistent after the edit. We propose Lasagna, a method that enables intuitive text-guided relighting control. Lasagna learns a lighting prior by using score distillation sampling to distill the prior of a diffusion model, which has been finetuned on synthetic relighting data. To train Lasagna, we curate a new synthetic dataset ReLiT, which contains 3D object assets re-lit from multiple light source locations. Despite training on synthetic images, quantitative results show that Lasagna relights real-world images while preserving other aspects of the input image, outperforming state-of-the-art text-guided image editing methods. Lasagna enables realistic and controlled results on natural images and digital art pieces and is preferred by humans over other methods in over 91% of cases. Finally, we demonstrate the versatility of our learning objective by extending it to allow colorization, another form of image editing.

摘要
To address this challenge, we propose Lasagna, a method that enables intuitive text-guided relighting control. Lasagna learns a lighting prior by using score distillation sampling to distill the prior of a diffusion model, which has been finetuned on synthetic relighting data. To train Lasagna, we curate a new synthetic dataset ReLiT, which contains 3D object assets re-lit from multiple light source locations.Despite training on synthetic images, Lasagna is able to relight real-world images while preserving other aspects of the input image, outperforming state-of-the-art text-guided image editing methods. Lasagna enables realistic and controlled results on natural images and digital art pieces, and is preferred by humans over other methods in over 91% of cases.Furthermore, we demonstrate the versatility of our learning objective by extending it to allow colorization, another form of image editing. Our approach allows for intuitive text-guided colorization of images, while preserving their original structure and content.In summary, Lasagna is a text-guided relighting method that learns a lighting prior from synthetic data and applies it to real-world images, enabling realistic and controlled results. Our approach outperforms state-of-the-art methods and has a wide range of applications in digital art, graphic design, and other fields.

Brainformer: Modeling MRI Brain Functions to Machine Vision

paper_url: http://arxiv.org/abs/2312.00236
repo_url: None
paper_authors: Xuan-Bac Nguyen, Xin Li, Samee U. Khan, Khoa Luu
for: 这个论文的目的是探讨人类视觉系统中的视觉过程，以帮助桥接人类视觉和计算机视觉模型之间的差距。
methods: 这篇论文使用了Transformer框架来分析人类视觉系统中的 Patterns of brain activities，并利用fMRI信息作为人脑活动的表征来监督机器视觉模型。
results: 经过我们的实验，我们发现可以通过利用fMRI信息来提高机器视觉模型的性能，并达到现有State-of-the-art方法的相对性能。

Abstract
"Perception is reality". Human perception plays a vital role in forming beliefs and understanding reality. Exploring how the human brain works in the visual system facilitates bridging the gap between human visual perception and computer vision models. However, neuroscientists study the brain via Neuroimaging, i.e., Functional Magnetic Resonance Imaging (fMRI), to discover the brain's functions. These approaches face interpretation challenges where fMRI data can be complex and require expertise. Therefore, neuroscientists make inferences about cognitive processes based on patterns of brain activities, which can lead to potential misinterpretation or limited functional understanding. In this work, we first present a simple yet effective Brainformer approach, a novel Transformer-based framework, to analyze the patterns of fMRI in the human perception system from the machine learning perspective. Secondly, we introduce a novel mechanism incorporating fMRI, which represents the human brain activities, as the supervision for the machine vision model. This work also introduces a novel perspective on transferring knowledge from human perception to neural networks. Through our experiments, we demonstrated that by leveraging fMRI information, the machine vision model can achieve potential results compared to the current State-of-the-art methods in various image recognition tasks.

摘要

Convolutional Neural Networks for Segmentation of Malignant Pleural Mesothelioma: Analysis of Probability Map Thresholds (CALGB 30901, Alliance)

paper_url: http://arxiv.org/abs/2312.00223
repo_url: None
paper_authors: Mena Shenouda, Eyjólfur Gudmundsson, Feng Li, Christopher M. Straus, Hedy L. Kindler, Arkadiusz Z. Dudek, Thomas Stinchcombe, Xiaofei Wang, Adam Starkey, Samuel G. Armato III
for: 这个研究的目的是评估用深度学习模型自动分割抑制MPM肿瘤的影响。methods: 这个研究使用了VGG16/U-Net convolutional neural network (CNN)进行自动分割，并由 радиialogIST modifications 的contours。results: 研究发现，降低probability threshold从0.5到0.1可以降低平均差异率，但没有单个输出阈值能够优化 both tumor volume和DSC。

Abstract
Malignant pleural mesothelioma (MPM) is the most common form of mesothelioma. To assess response to treatment, tumor measurements are acquired and evaluated based on a patient's longitudinal computed tomography (CT) scans. Tumor volume, however, is the more accurate metric for assessing tumor burden and response. Automated segmentation methods using deep learning can be employed to acquire volume, which otherwise is a tedious task performed manually. The deep learning-based tumor volume and contours can then be compared with a standard reference to assess the robustness of the automated segmentations. The purpose of this study was to evaluate the impact of probability map threshold on MPM tumor delineations generated using a convolutional neural network (CNN). Eighty-eight CT scans from 21 MPM patients were segmented by a VGG16/U-Net CNN. A radiologist modified the contours generated at a 0.5 probability threshold. Percent difference of tumor volume and overlap using the Dice Similarity Coefficient (DSC) were compared between the standard reference provided by the radiologist and CNN outputs for thresholds ranging from 0.001 to 0.9. CNN annotations consistently yielded smaller tumor volumes than radiologist contours. Reducing the probability threshold from 0.5 to 0.1 decreased the absolute percent volume difference, on average, from 43.96% to 24.18%. Median and mean DSC ranged from 0.58 to 0.60, with a peak at a threshold of 0.5; no distinct threshold was found for percent volume difference. No single output threshold in the CNN probability maps was optimal for both tumor volume and DSC. This work underscores the need to assess tumor volume and spatial overlap when evaluating CNN performance. While automated segmentations may yield comparable tumor volumes to that of the reference standard, the spatial region delineated by the CNN at a specific threshold is equally important.

摘要
“腹膜 Pleural Mesothelioma（MPM）是最常见的 Mesothelioma 型式。为了评估治疗效果，医生通过病人的长期 Computed Tomography（CT）扫描获取和评估肿瘤大小。但是，肿瘤体积是评估肿瘤负担和治疗效果的更加精确的指标。使用深度学习的自动分类方法可以从 CT 扫描中获取肿瘤体积，这个过程可以耗费很长时间。这项研究的目的是评估对 MPM 肿瘤定义的概率地图阈值的影响。VGG16/U-Net CNN 分类器segments了88个 CT 扫描，并由医生修改了 CNN 生成的边据。在阈值由0.001至0.9之间进行比较时，与标准参考的百分比差异和 overlap 使用 dice 相似度 coefficient（DSC）进行比较。CNN 标注逐个较小肿瘤体积，而医生标注逐个较大。降低阈值从0.5到0.1可以将统计量差异降至24.18%，而 median 和 mean DSC 在0.58至0.60之间，峰值为0.5。无法找到单一的出力阈值，以ensure tumor volume 和 spatial overlap 都符合参考标准。这些结果显示，当评估 CNN 性能时，需要考虑肿瘤体积和空间 overlap。自动分类may 对肿瘤体积有相似的效果，但是 CNN 在特定阈值下定义的空间区域 equally important。”

SparseGS: Real-Time 360° Sparse View Synthesis using Gaussian Splatting

paper_url: http://arxiv.org/abs/2312.00206
repo_url: None
paper_authors: Haolin Xiong, Sairisheek Muttukuru, Rishi Upadhyay, Pradyumna Chari, Achuta Kadambi
for: 能够Synthesize high-quality novel views from sparse training views, especially for 360-degree scenes.
methods: Integrate depth priors with generative and explicit constraints to reduce background collapse, remove floaters, and enhance consistency from unseen viewpoints.
results: Outperforms base 3DGS by up to 30.5% and NeRF-based methods by up to 15.6% in LPIPS on the MipNeRF-360 dataset with substantially less training and inference cost.

Abstract
The problem of novel view synthesis has grown significantly in popularity recently with the introduction of Neural Radiance Fields (NeRFs) and other implicit scene representation methods. A recent advance, 3D Gaussian Splatting (3DGS), leverages an explicit representation to achieve real-time rendering with high-quality results. However, 3DGS still requires an abundance of training views to generate a coherent scene representation. In few shot settings, similar to NeRF, 3DGS tends to overfit to training views, causing background collapse and excessive floaters, especially as the number of training views are reduced. We propose a method to enable training coherent 3DGS-based radiance fields of 360 scenes from sparse training views. We find that using naive depth priors is not sufficient and integrate depth priors with generative and explicit constraints to reduce background collapse, remove floaters, and enhance consistency from unseen viewpoints. Experiments show that our method outperforms base 3DGS by up to 30.5% and NeRF-based methods by up to 15.6% in LPIPS on the MipNeRF-360 dataset with substantially less training and inference cost.

摘要
Recently, the problem of novel view synthesis has gained significant attention with the emergence of Neural Radiance Fields (NeRFs) and other implicit scene representation methods. A recent advancement, 3D Gaussian Splatting (3DGS), utilizes an explicit representation to achieve real-time rendering with high-quality results. However, 3DGS still requires a large amount of training views to generate a coherent scene representation. In few shot settings, similar to NeRF, 3DGS tends to overfit to training views, resulting in background collapse and excessive floaters, especially when the number of training views is reduced. We propose a method to train coherent 3DGS-based radiance fields of 360 scenes from sparse training views. We find that using naive depth priors is not sufficient and integrate depth priors with generative and explicit constraints to reduce background collapse, remove floaters, and enhance consistency from unseen viewpoints. Experimental results show that our method outperforms base 3DGS by up to 30.5% and NeRF-based methods by up to 15.6% in LPIPS on the MipNeRF-360 dataset with significantly less training and inference cost.

DNS SLAM: Dense Neural Semantic-Informed SLAM

paper_url: http://arxiv.org/abs/2312.00204
repo_url: None
paper_authors: Kunyi Li, Michael Niemeyer, Nassir Navab, Federico Tombari
for: This paper is written for the task of Simultaneous Localization and Mapping (SLAM) in real-world environments, specifically addressing the challenge of oversmoothed reconstructions in neural implicit representations.
methods: The paper proposes a novel neural RGB-D semantic SLAM approach featuring a hybrid representation, which integrates multi-view geometry constraints with image-based feature extraction to improve appearance details and output color, density, and semantic class information. The method also introduces a lightweight coarse scene representation trained in a self-supervised manner in latent space for real-time tracking.
results: The paper achieves state-of-the-art performance on both synthetic data and real-world data tracking while maintaining a commendable operational speed on off-the-shelf hardware. The method outputs class-wise decomposed reconstructions with better texture capturing appearance and geometric details.Here’s the Chinese translation of the three points:
for: 这篇论文是为了实时地地图化和定位（SLAM）在真实世界环境中进行，特别是解决神经隐式表示中的平滑重建问题。
methods: 这篇论文提出了一种新的神经RGB-D语义SLAM方法，它将多视图几何约束与图像特征提取相结合，以提高出色质量和semantic类信息。它还引入了一种轻量级的场景概率表示，通过在隐藏空间自然地进行自我监督训练，以实现实时跟踪。
results: 这篇论文在 sintetic数据和真实世界数据上实现了状态对应的性能，同时保持了可接受的实时速度。它输出了类别化分解的重建结果，具有更好的文本捕捉和几何细节。

Abstract
In recent years, coordinate-based neural implicit representations have shown promising results for the task of Simultaneous Localization and Mapping (SLAM). While achieving impressive performance on small synthetic scenes, these methods often suffer from oversmoothed reconstructions, especially for complex real-world scenes. In this work, we introduce DNS SLAM, a novel neural RGB-D semantic SLAM approach featuring a hybrid representation. Relying only on 2D semantic priors, we propose the first semantic neural SLAM method that trains class-wise scene representations while providing stable camera tracking at the same time. Our method integrates multi-view geometry constraints with image-based feature extraction to improve appearance details and to output color, density, and semantic class information, enabling many downstream applications. To further enable real-time tracking, we introduce a lightweight coarse scene representation which is trained in a self-supervised manner in latent space. Our experimental results achieve state-of-the-art performance on both synthetic data and real-world data tracking while maintaining a commendable operational speed on off-the-shelf hardware. Further, our method outputs class-wise decomposed reconstructions with better texture capturing appearance and geometric details.

摘要
Recently, coordinate-based neural implicit representations have shown promising results for Simultaneous Localization and Mapping (SLAM) tasks. However, these methods often suffer from oversmoothed reconstructions, especially for complex real-world scenes. In this work, we introduce DNS SLAM, a novel neural RGB-D semantic SLAM approach with a hybrid representation. Our method relies only on 2D semantic priors and trains class-wise scene representations while providing stable camera tracking. We integrate multi-view geometry constraints with image-based feature extraction to improve appearance details and output color, density, and semantic class information, enabling many downstream applications. To further enable real-time tracking, we introduce a lightweight coarse scene representation that is trained in a self-supervised manner in latent space. Our experimental results achieve state-of-the-art performance on both synthetic data and real-world data tracking while maintaining a commendable operational speed on off-the-shelf hardware. Additionally, our method outputs class-wise decomposed reconstructions with better texture capturing appearance and geometric details.Here is the word-for-word translation of the text into Simplified Chinese:最近，坐标基于神经隐式表示方法在同时定位和地图建模（SLAM）任务中表现出色。然而，这些方法经常在复杂的实际场景中受到过度平滑的重建问题。在这项工作中，我们介绍了 DNS SLAM，一种基于神经网络的RGB-D语义SLAM方法，其中使用混合表示。我们的方法仅仅依靠2D语义优先，并在提供稳定摄像头跟踪的同时，训练分类Scene表示。我们将多视图几何约束与图像特征提取相结合，以提高外观细节和颜色信息的精度，并输出颜色、密度和语义类信息，以便多种下游应用程序。为了进一步启用实时跟踪，我们引入了一种轻量级的粗糙场景表示，该表示在离散空间中进行自主supervised训练。我们的实验结果在 both synthetic data和实际数据跟踪中达到了状态 искусственный体验表现，同时在off-the-shelf硬件上保持了可靠的运行速度。此外，我们的方法输出分类分解重建， captured texture appearance和几何细节更好。

Raising the Bar of AI-generated Image Detection with CLIP

paper_url: http://arxiv.org/abs/2312.00195
repo_url: None
paper_authors: Davide Cozzolino, Giovanni Poggi, Riccardo Corvi, Matthias Nießner, Luisa Verdoliva
For: The paper aims to explore the potential of pre-trained vision-language models (VLMs) for universal detection of AI-generated images.* Methods: The authors develop a lightweight detection strategy based on CLIP features and study its performance in a wide variety of challenging scenarios.* Results: The CLIP-based detector exhibits a surprising generalization ability and high robustness across several different architectures, including recent commercial tools such as Dalle-3, Midjourney v5, and Firefly. The detector achieves a high performance on in-distribution data and improves significantly on out-of-distribution data and robustness to impaired/laundered data.Here’s the same information in Simplified Chinese text:* For: 这篇论文目的是探索预训练的视觉语言模型（VLM）在人工智能生成图像检测方面的潜力。* Methods: 作者们开发了一种轻量级检测策略，基于CLIP特征，并在各种复杂的场景下进行了研究。* Results: CLIP基于的检测器在不同的架构上展现出了高度一致和强大的泛化能力，包括最新的商业工具Dalle-3、Midjourney v5和Firefly。检测器在内部数据上具有高性能，并在外部数据上提高了6%的AUC和13%的鲁棒性。I hope that helps!

Abstract
Aim of this work is to explore the potential of pre-trained vision-language models (VLMs) for universal detection of AI-generated images. We develop a lightweight detection strategy based on CLIP features and study its performance in a wide variety of challenging scenarios. We find that, unlike previous belief, it is neither necessary nor convenient to use a large domain-specific dataset for training. On the contrary, by using only a handful of example images from a single generative model, a CLIP-based detector exhibits a surprising generalization ability and high robustness across several different architectures, including recent commercial tools such as Dalle-3, Midjourney v5, and Firefly. We match the SoTA on in-distribution data, and improve largely above it in terms of generalization to out-of-distribution data (+6% in terms of AUC) and robustness to impaired/laundered data (+13%). Our project is available at https://grip-unina.github.io/ClipBased-SyntheticImageDetection/

摘要
目的是探索预训练视觉语言模型（VLM）在检测人工生成图像方面的潜力。我们开发了一种轻量级检测策略基于CLIP特征，并研究其在多种复杂的场景中的表现。我们发现，与之前的信念不同，不需要使用大量领域特定的数据集来训练。相反，只使用单个生成器模型中的几个示例图像，CLIP基本的检测器具有各卷惊人的泛化能力和高稳定性，包括最新的商业工具such as Dalle-3、Midjourney v5和Firefly。我们与SoTA匹配在内部数据上，并在外部数据上进一步提高了泛化性 (+6% 在 AUC 上) 和质量强度 (+13%)。我们的项目可以在https://grip-unina.github.io/ClipBased-SyntheticImageDetection/ 中找到。

Benchmarking and Enhancing Disentanglement in Concept-Residual Models

paper_url: http://arxiv.org/abs/2312.00192
repo_url: None
paper_authors: Renos Zabounidis, Ini Oguntola, Konghao Zhao, Joseph Campbell, Simon Stepputtis, Katia Sycara
for: 提高模型的解释性和性能，对 incomplete 的概念集进行修复。
methods: 提出三种新的分离方法，通过分离概念和剩余来减少信息泄露，并调整概念和任务性能之间的关系。
results: 通过对 CUB、OAI 和 CIFAR 100 dataset进行广泛的实验分析，评估每种分离方法的性能，并提供了干预概念和任务性能之间的关系的新的视角。

Abstract
Concept bottleneck models (CBMs) are interpretable models that first predict a set of semantically meaningful features, i.e., concepts, from observations that are subsequently used to condition a downstream task. However, the model's performance strongly depends on the engineered features and can severely suffer from incomplete sets of concepts. Prior works have proposed a side channel -- a residual -- that allows for unconstrained information flow to the downstream task, thus improving model performance but simultaneously introducing information leakage, which is undesirable for interpretability. This work proposes three novel approaches to mitigate information leakage by disentangling concepts and residuals, investigating the critical balance between model performance and interpretability. Through extensive empirical analysis on the CUB, OAI, and CIFAR 100 datasets, we assess the performance of each disentanglement method and provide insights into when they work best. Further, we show how each method impacts the ability to intervene over the concepts and their subsequent impact on task performance.

摘要
<>输入文本翻译为简化中文。<>概念瓶颈模型（CBM）是可解释的模型，首先预测一组semantically meaningful的特征，即概念，从观察数据中提取，然后用这些特征来condition下游任务。然而，模型的性能强度取决于人工设计的特征，并且可能会受到不完整的概念集的影响。先前的工作已经提出了一个side channel，即剩余，以允许不受限制的信息流向下游任务，从而提高模型性能，但是同时也引入了不жела的信息泄露，这会影响模型的解释性。这个工作提出了三种新的方法来缓解信息泄露，包括分离概念和剩余、调查权衡模型性能和解释性的关键点，以及在不同情况下每种方法的表现和影响。通过对CUB、OAI和CIFAR 100 datasets进行广泛的实验分析，我们评估了每种分离方法的性能，并提供了关于它们在哪些情况下工作最好的直观分析。此外，我们还示出了每种方法对概念和其后对任务性能的影响。

Galaxy Classification: A machine learning approach for classifying shapes using numerical data

paper_url: http://arxiv.org/abs/2312.00184
repo_url: None
paper_authors: Anusha Guruprasad
for: 本研究旨在使用Machine Learning模型为星系分类，以提高我们对星系形成和演化的理解。
methods: 我们使用了一种卷积神经网络架构，从星系图像中提取特征并将星系分类为旋涡星系或椭圆星系。
results: 我们的模型在一个子集的Galaxy Zoo数据上达到了高精度，并且比人类分类器更高效。这表明我们的模型有效地提高了我们对星系形成和演化的理解。

Abstract
The classification of galaxies as spirals or ellipticals is a crucial task in understanding their formation and evolution. With the arrival of large-scale astronomical surveys, such as the Sloan Digital Sky Survey (SDSS), astronomers now have access to images of a vast number of galaxies. However, the visual inspection of these images is an impossible task for humans due to the sheer number of galaxies to be analyzed. To solve this problem, the Galaxy Zoo project was created to engage thousands of citizen scientists to classify the galaxies based on their visual features. In this paper, we present a machine learning model for galaxy classification using numerical data from the Galaxy Zoo[5] project. Our model utilizes a convolutional neural network architecture to extract features from galaxy images and classify them into spirals or ellipticals. We demonstrate the effectiveness of our model by comparing its performance with that of human classifiers using a subset of the Galaxy Zoo dataset. Our results show that our model achieves high accuracy in classifying galaxies and has the potential to significantly enhance our understanding of the formation and evolution of galaxies.

摘要
галактики的分类为旋涡或椭圆形是理解其形成和演化的关键任务。随着大规模天文学调查的到来，如 Sloan 数字天空survey (SDSS)，天文学家现在拥有大量的 галактики图像。然而，人工视觉检查这些图像是不可能的任务，因为需要分析的 галактики的数量太多。为解决这个问题，Galaxy Zoo 项目成立，以吸引千名公民科学家来分类 галактики根据其视觉特征。在这篇文章中，我们提出了一种基于数字数据的机器学习模型，用于分类 галактики。我们的模型使用卷积神经网络架构来提取 галактики图像中的特征，并将其分类为旋涡或椭圆形。我们通过对Galaxy Zoo 数据集中的一个子集进行比较，示出了我们的模型在分类 галактики时的高精度和可能性。

Fool the Hydra: Adversarial Attacks against Multi-view Object Detection Systems

paper_url: http://arxiv.org/abs/2312.00173
repo_url: None
paper_authors: Bilel Tarchoun, Quazi Mishkatul Alam, Nael Abu-Ghazaleh, Ihsen Alouani
for: 这篇论文旨在 investigating the robustness of multiview object detection systems against adversarial patches in real-world scenarios.methods: The authors use a multiview object detection framework and conduct experiments on the Wildtrack benchmark with off-the-shelf adversarial patches. They also propose two new attack methods to challenge the robustness of the system.results: The authors find that the multiview system has promising robustness against off-the-shelf adversarial patches, but their proposed attacks can significantly degrade the detection performance of the system. Specifically, the first attack reaches an attack success rate of 73%, while the second attack reduces the performance of the target detector by 62%.

Abstract
Adversarial patches exemplify the tangible manifestation of the threat posed by adversarial attacks on Machine Learning (ML) models in real-world scenarios. Robustness against these attacks is of the utmost importance when designing computer vision applications, especially for safety-critical domains such as CCTV systems. In most practical situations, monitoring open spaces requires multi-view systems to overcome acquisition challenges such as occlusion handling. Multiview object systems are able to combine data from multiple views, and reach reliable detection results even in difficult environments. Despite its importance in real-world vision applications, the vulnerability of multiview systems to adversarial patches is not sufficiently investigated. In this paper, we raise the following question: Does the increased performance and information sharing across views offer as a by-product robustness to adversarial patches? We first conduct a preliminary analysis showing promising robustness against off-the-shelf adversarial patches, even in an extreme setting where we consider patches applied to all views by all persons in Wildtrack benchmark. However, we challenged this observation by proposing two new attacks: (i) In the first attack, targeting a multiview CNN, we maximize the global loss by proposing gradient projection to the different views and aggregating the obtained local gradients. (ii) In the second attack, we focus on a Transformer-based multiview framework. In addition to the focal loss, we also maximize the transformer-specific loss by dissipating its attention blocks. Our results show a large degradation in the detection performance of victim multiview systems with our first patch attack reaching an attack success rate of 73% , while our second proposed attack reduced the performance of its target detector by 62%

摘要
adversarial patches illustrate the tangible threat of adversarial attacks on machine learning (ML) models in real-world scenarios. Robustness against these attacks is crucial when designing computer vision applications, especially for safety-critical domains like CCTV systems. In most practical situations, monitoring open spaces requires multi-view systems to overcome acquisition challenges such as occlusion handling. Multiview object systems can combine data from multiple views and achieve reliable detection results even in difficult environments. Despite its importance in real-world vision applications, the vulnerability of multiview systems to adversarial patches has not been thoroughly investigated. In this paper, we raise the following question: Does the increased performance and information sharing across views offer robustness to adversarial patches as a by-product? We first conducted a preliminary analysis showing promising robustness against off-the-shelf adversarial patches, even in an extreme setting where we considered patches applied to all views by all persons in the Wildtrack benchmark. However, we challenged this observation by proposing two new attacks: (i) In the first attack, we targeted a multiview CNN and maximized the global loss by proposing gradient projection to the different views and aggregating the obtained local gradients. (ii) In the second attack, we focused on a Transformer-based multiview framework. In addition to the focal loss, we also maximized the transformer-specific loss by dissipating its attention blocks. Our results showed a large degradation in the detection performance of victim multiview systems with our first patch attack reaching an attack success rate of 73%, while our second proposed attack reduced the performance of its target detector by 62%.

Universal Backdoor Attacks

paper_url: http://arxiv.org/abs/2312.00157
repo_url: https://github.com/ain-soph/trojanzoo
paper_authors: Benjamin Schneider, Nils Lukas, Florian Kerschbaum
for: 这 paper written for 攻击深度图像分类器的数据毒 poisoning 问题，以及如何通过控制毒品样本来让模型错分到任意目标类。
methods: 这 paper 使用了一种基于 trigger 的数据毒 poisoning 技术，通过生成特征rich的 trigger 来控制模型的输出，并利用 inter-class poison transferability 来实现对任意目标类的控制。
results: 这 paper 的实验结果表明，使用这种技术可以控制模型分类错误率，并且可以在占据小 Fraction of training dataset 的情况下实现对大量类型的控制。

Abstract
Web-scraped datasets are vulnerable to data poisoning, which can be used for backdooring deep image classifiers during training. Since training on large datasets is expensive, a model is trained once and re-used many times. Unlike adversarial examples, backdoor attacks often target specific classes rather than any class learned by the model. One might expect that targeting many classes through a naive composition of attacks vastly increases the number of poison samples. We show this is not necessarily true and more efficient, universal data poisoning attacks exist that allow controlling misclassifications from any source class into any target class with a small increase in poison samples. Our idea is to generate triggers with salient characteristics that the model can learn. The triggers we craft exploit a phenomenon we call inter-class poison transferability, where learning a trigger from one class makes the model more vulnerable to learning triggers for other classes. We demonstrate the effectiveness and robustness of our universal backdoor attacks by controlling models with up to 6,000 classes while poisoning only 0.15% of the training dataset.

摘要
Our idea is to generate triggers with salient characteristics that the model can learn. The triggers we craft exploit a phenomenon we call inter-class poison transferability, where learning a trigger from one class makes the model more vulnerable to learning triggers for other classes. We demonstrate the effectiveness and robustness of our universal backdoor attacks by controlling models with up to 6,000 classes while poisoning only 0.15% of the training dataset.

A Unified Framework for Connecting Noise Modeling to Boost Noise Detection

paper_url: http://arxiv.org/abs/2312.00827
repo_url: https://github.com/sunnysiqi/combo
paper_authors: Siqi Wang, Chau Pham, Bryan A. Plummer
for: 学习受损标签的研究是一个重要的话题，这种受损可能会使模型性能下降。这篇论文旨在探讨两种常见的方法——噪声模型和噪声检测——是否可以合作。
methods: 该论文提出了一种整合这两种方法的结构，包括噪声模型、知识源标识和增强噪声检测等三个关键块。这种结构的协作可以提高分类精度，例如，可以识别干扰性负样本并保持真正的干扰样本。
results: experiments 表明，这种结构的协作方法可以在四个数据集上提高分类精度，包括三种噪声和不同的组合。最终，这些组件在不同的噪声场景下各自提供了不同的贡献，提高了总的分类精度。

Abstract
Noisy labels can impair model performance, making the study of learning with noisy labels an important topic. Two conventional approaches are noise modeling and noise detection. However, these two methods are typically studied independently, and there has been limited work on their collaboration. In this work, we explore the integration of these two approaches, proposing an interconnected structure with three crucial blocks: noise modeling, source knowledge identification, and enhanced noise detection using noise source-knowledge-integration methods. This collaboration structure offers advantages such as discriminating hard negatives and preserving genuinely clean labels that might be suspiciously noisy. Our experiments on four datasets, featuring three types of noise and different combinations of each block, demonstrate the efficacy of these components' collaboration. Our collaborative structure methods achieve up to a 10% increase in top-1 classification accuracy in synthesized noise datasets and 3-5% in real-world noisy datasets. The results also suggest that these components make distinct contributions to overall performance across various noise scenarios. These findings provide valuable insights for designing noisy label learning methods customized for specific noise scenarios in the future. Our code is accessible to the public.

摘要
噪声标签可能会降低模型性能，因此研究学习噪声标签是一个重要的话题。现有两种常见的方法是噪声模型化和噪声检测。然而，这两种方法通常是独立地研究的，有限的研究者尝试将它们集成在一起。在这个工作中，我们探索了这两种方法的集成，提出了一个互连结构，包括三个关键块：噪声模型、源知识标识和增强的噪声检测使用噪声源知识集成方法。这种合作结构具有优势，例如可以区分困难的负例和保留真正的噪声标签。我们在四个 datasets上进行了实验，这些 datasets 包括三种噪声和不同的块组合。结果表明，我们的合作结构方法可以在 synthesized 噪声 datasets 中提高 top-1 类фика度 accuracy 至 10%，而在实际噪声 datasets 中提高 3-5%。这些结果还表明，这些组件在不同的噪声场景中具有不同的贡献。这些发现可以为未来设计适应特定噪声场景的噪声标签学习方法提供价值的信息。我们的代码公开访问。

GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs

paper_url: http://arxiv.org/abs/2312.00093
repo_url: None
paper_authors: Gege Gao, Weiyang Liu, Anpei Chen, Andreas Geiger, Bernhard Schölkopf
for: 用于生成高级组成3D场景，提高文本框架中对象的解耦度
methods: 使用场景图，具有节点和边的信息，以便更好地利用预训练的文本到图像扩散模型，并且使用签名距离场景图来表示 объек 之间的关系
results: 通过对训练数据进行评估， validate 了 GraphDreamer 的高级组成3D场景生成能力，并且能够减少对象之间的干扰Here’s the simplified Chinese text:
for: 用于生成高级组成3D场景，提高文本框架中对象的解耦度
methods: 使用场景图，具有节点和边的信息，以便更好地利用预训练的文本到图像扩散模型，并且使用签名距离场景图来表示 objet 之间的关系
results: 通过对训练数据进行评估， validate 了 GraphDreamer 的高级组成3D场景生成能力，并且能够减少对象之间的干扰

Abstract
As pretrained text-to-image diffusion models become increasingly powerful, recent efforts have been made to distill knowledge from these text-to-image pretrained models for optimizing a text-guided 3D model. Most of the existing methods generate a holistic 3D model from a plain text input. This can be problematic when the text describes a complex scene with multiple objects, because the vectorized text embeddings are inherently unable to capture a complex description with multiple entities and relationships. Holistic 3D modeling of the entire scene further prevents accurate grounding of text entities and concepts. To address this limitation, we propose GraphDreamer, a novel framework to generate compositional 3D scenes from scene graphs, where objects are represented as nodes and their interactions as edges. By exploiting node and edge information in scene graphs, our method makes better use of the pretrained text-to-image diffusion model and is able to fully disentangle different objects without image-level supervision. To facilitate modeling of object-wise relationships, we use signed distance fields as representation and impose a constraint to avoid inter-penetration of objects. To avoid manual scene graph creation, we design a text prompt for ChatGPT to generate scene graphs based on text inputs. We conduct both qualitative and quantitative experiments to validate the effectiveness of GraphDreamer in generating high-fidelity compositional 3D scenes with disentangled object entities.

摘要
As pre-trained text-to-image diffusion models become increasingly powerful, recent efforts have been made to distill knowledge from these text-to-image pre-trained models for optimizing a text-guided 3D model. Most of the existing methods generate a holistic 3D model from a plain text input. This can be problematic when the text describes a complex scene with multiple objects, because the vectorized text embeddings are inherently unable to capture a complex description with multiple entities and relationships. Holistic 3D modeling of the entire scene further prevents accurate grounding of text entities and concepts. To address this limitation, we propose GraphDreamer, a novel framework to generate compositional 3D scenes from scene graphs, where objects are represented as nodes and their interactions as edges. By exploiting node and edge information in scene graphs, our method makes better use of the pre-trained text-to-image diffusion model and is able to fully disentangle different objects without image-level supervision. To facilitate modeling of object-wise relationships, we use signed distance fields as representation and impose a constraint to avoid inter-penetration of objects. To avoid manual scene graph creation, we design a text prompt for ChatGPT to generate scene graphs based on text inputs. We conduct both qualitative and quantitative experiments to validate the effectiveness of GraphDreamer in generating high-fidelity compositional 3D scenes with disentangled object entities.

TrafficMOT: A Challenging Dataset for Multi-Object Tracking in Complex Traffic Scenarios

paper_url: http://arxiv.org/abs/2311.18839
repo_url: None
paper_authors: Lihao Liu, Yanqi Cheng, Zhongying Deng, Shujun Wang, Dongdong Chen, Xiaowei Hu, Pietro Liò, Carola-Bibiane Schönlieb, Angelica Aviles-Rivero
for: 提高交通监测精度和安全措施，通过使用先进机器学习算法进行多个目标跟踪在交通视频中。
methods: 使用TrafficMOT数据集，包括多种交通情况和复杂场景，以驱动多目标跟踪领域的进步。
results: 经过三种不同的设定（充分监督、半监督和Tracking Anything Model）的实验结果表明，TrafficMOT数据集具有较高的复杂性和挑战性，但它可以驱动多目标跟踪领域的进步。

Abstract
Multi-object tracking in traffic videos is a crucial research area, offering immense potential for enhancing traffic monitoring accuracy and promoting road safety measures through the utilisation of advanced machine learning algorithms. However, existing datasets for multi-object tracking in traffic videos often feature limited instances or focus on single classes, which cannot well simulate the challenges encountered in complex traffic scenarios. To address this gap, we introduce TrafficMOT, an extensive dataset designed to encompass diverse traffic situations with complex scenarios. To validate the complexity and challenges presented by TrafficMOT, we conducted comprehensive empirical studies using three different settings: fully-supervised, semi-supervised, and a recent powerful zero-shot foundation model Tracking Anything Model (TAM). The experimental results highlight the inherent complexity of this dataset, emphasising its value in driving advancements in the field of traffic monitoring and multi-object tracking.

摘要
多bject tracking in traffic videos 是一个关键研究领域，具有巨大的潜在价值，通过利用高级机器学习算法，提高交通监测精度，促进道路安全措施。然而，现有的多bject tracking in traffic videos 数据集通常具有有限的实例数或单一类型，无法真实模拟交通场景中的复杂性。为了解决这个问题，我们介绍了 TrafficMOT，一个广泛的数据集，包含多种交通情况和复杂的场景。为了证明 TrafficMOT 的复杂性和挑战性，我们进行了广泛的实验研究，使用了三种不同的设置：完全监督、半监督和最新的 Zero-shot 基础模型 Tracking Anything Model (TAM)。实验结果显示了 TrafficMOT 数据集的内在复杂性，证明其在交通监测和多bject tracking 领域的价值。

Just Add $π$! Pose Induced Video Transformers for Understanding Activities of Daily Living

paper_url: http://arxiv.org/abs/2311.18840
repo_url: https://github.com/dominickrei/pi-vit
paper_authors: Dominick Reilly, Srijan Das
for: 用于活动日常生活（ADL）领域，提高视频变换器的适用环境。
methods: 通过将RGB表示与人体姿态信息相结合，提高视频变换器的表示能力。
results: 在三个主要的ADL数据集上达到了最佳性能，不需要实时poses或额外计算负担。

Abstract
Video transformers have become the de facto standard for human action recognition, yet their exclusive reliance on the RGB modality still limits their adoption in certain domains. One such domain is Activities of Daily Living (ADL), where RGB alone is not sufficient to distinguish between visually similar actions, or actions observed from multiple viewpoints. To facilitate the adoption of video transformers for ADL, we hypothesize that the augmentation of RGB with human pose information, known for its sensitivity to fine-grained motion and multiple viewpoints, is essential. Consequently, we introduce the first Pose Induced Video Transformer: PI-ViT (or $\pi$-ViT), a novel approach that augments the RGB representations learned by video transformers with 2D and 3D pose information. The key elements of $\pi$-ViT are two plug-in modules, 2D Skeleton Induction Module and 3D Skeleton Induction Module, that are responsible for inducing 2D and 3D pose information into the RGB representations. These modules operate by performing pose-aware auxiliary tasks, a design choice that allows $\pi$-ViT to discard the modules during inference. Notably, $\pi$-ViT achieves the state-of-the-art performance on three prominent ADL datasets, encompassing both real-world and large-scale RGB-D datasets, without requiring poses or additional computational overhead at inference.

摘要
视频变换器已成为人体动作识别的途径标准，但它们仅依靠RGB模式仍有限制其在某些领域的应用。一个如此领域是日常生活活动（ADL），因为RGBalone无法分辨视觉相似的动作或从多个视点观察的动作。为促进视频变换器在ADL领域的应用，我们认为RGB的扩充是必要的。因此，我们引入首个 pose 引入视频变换器：PI-ViT（或π-ViT），一种新的方法，它将RGB表示学习的RGB表示学习扩充了2D和3Dpose信息。PI-ViT的关键元素包括两个插件模块：2D Skeleton Induction Module和3D Skeleton Induction Module，它们负责在RGB表示中引入2D和3Dpose信息。这些模块通过执行pose-aware auxillary task来完成这个任务，这使得PI-ViT可以在推理中产生无需pose或额外计算过程的性能。值得注意的是，PI-ViT在三个主要ADL dataset上实现了状态的表现，包括真实世界和大规模RGB-D dataset，而无需poses或额外计算过程。

PoseGPT: Chatting about 3D Human Pose

paper_url: http://arxiv.org/abs/2311.18836
repo_url: None
paper_authors: Yao Feng, Jing Lin, Sai Kumar Dwivedi, Yu Sun, Priyanka Patel, Michael J. Black
for: 理解和理解3D人姿图像或文本描述
methods: 使用大型自然语言模型(LLMs)来理解和理解3D人姿
results: PoseGPT比现有的多modal LLMs和任务特定方法在新提出的任务上表现出色，并且能够基于复杂的推理来理解和生成3D人姿。

Abstract
We introduce PoseGPT, a framework employing Large Language Models (LLMs) to understand and reason about 3D human poses from images or textual descriptions. Our work is motivated by the human ability to intuitively understand postures from a single image or a brief description, a process that intertwines image interpretation, world knowledge, and an understanding of body language. Traditional human pose estimation methods, whether image-based or text-based, often lack holistic scene comprehension and nuanced reasoning, leading to a disconnect between visual data and its real-world implications. PoseGPT addresses these limitations by embedding SMPL poses as a distinct signal token within a multi-modal LLM, enabling direct generation of 3D body poses from both textual and visual inputs. This approach not only simplifies pose prediction but also empowers LLMs to apply their world knowledge in reasoning about human poses, fostering two advanced tasks: speculative pose generation and reasoning about pose estimation. These tasks involve reasoning about humans to generate 3D poses from subtle text queries, possibly accompanied by images. We establish benchmarks for these tasks, moving beyond traditional 3D pose generation and estimation methods. Our results show that PoseGPT outperforms existing multimodal LLMs and task-sepcific methods on these newly proposed tasks. Furthermore, PoseGPT's ability to understand and generate 3D human poses based on complex reasoning opens new directions in human pose analysis.

摘要
我们介绍PoseGPT框架，利用大语言模型（LLM）理解和理解三维人姿图像或文本描述。我们的工作受人类能够通过单个图像或简短描述理解姿势的能力启发，这种过程涉及图像解读、世界知识和人体语言理解。传统的人姿估计方法，无论是图像基的或文本基的，通常缺乏整体场景理解和细化的理解，导致视觉数据和其实际意义之间的分离。PoseGPT解决这些限制，将SMPL姿势作为特定信号符号在多modal LLM中嵌入，允许直接从文本和视觉输入生成三维人姿。这种方法不仅简化了人姿预测，还让LLM可以在理解人姿方面应用其世界知识，激发两个高级任务：推断人姿生成和理解人姿估计。这些任务涉及理解人类，从某些文本提示和图像 accompaniment 生成三维人姿。我们建立了这些任务的标准准确率标准，超越了传统的三维人姿生成和估计方法。我们的结果表明，PoseGPT在这些新提出的任务上比现有的多modal LLM和任务特定方法表现出色。此外，PoseGPT可以基于复杂的推理理解和生成三维人姿，开启了人姿分析领域的新方向。

paper_url: http://arxiv.org/abs/2311.18835
repo_url: https://github.com/rongyaofang/instructseq
paper_authors: Rongyao Fang, Shilin Yan, Zhaoyang Huang, Jingqiu Zhou, Hao Tian, Jifeng Dai, Hongsheng Li
for: 这个研究的目的是提供一种基于自然语言指令的多模态模型框架，以实现更强大和通用的人工智能。
methods: 这个模型使用多模态转换器架构，包括视觉、语言和时序模型。它使用视觉编码器提取图像特征，并使用文本编码器编码指令。一个 autoregressive 转换器将表示 fusion，生成顺序任务输出。通过与 LLM 生成的自然语言指令进行训练，InstructSeq 学习了自由形指令的强大理解，以提供一种直观的界面 для指定视觉任务。
results: InstructSeq 在 semantic segmentation、referring expression segmentation/comprehension 和图像描述等任务上取得了很好的表现，无需任务特定的调整。这种灵活控制和多任务统一使得模型具有更人类化的多样性和普适性。

Abstract
Empowering models to dynamically accomplish tasks specified through natural language instructions represents a promising path toward more capable and general artificial intelligence. In this work, we introduce InstructSeq, an instruction-conditioned multi-modal modeling framework that unifies diverse vision tasks through flexible natural language control and handling of both visual and textual data. InstructSeq employs a multimodal transformer architecture encompassing visual, language, and sequential modeling. We utilize a visual encoder to extract image features and a text encoder to encode instructions. An autoregressive transformer fuses the representations and generates sequential task outputs. By training with LLM-generated natural language instructions, InstructSeq acquires a strong comprehension of free-form instructions for specifying visual tasks. This provides an intuitive interface for directing capabilities using flexible natural instructions. Without any task-specific tuning, InstructSeq achieves compelling performance on semantic segmentation, referring expression segmentation/comprehension, and image captioning. The flexible control and multi-task unification empower the model with more human-like versatility and generalizability for computer vision. The code will be released soon at https://github.com/rongyaofang/InstructSeq.

摘要
强化模型通过自然语言指令实现任务是推动更强大和通用的人工智能的可能性。在这项工作中，我们介绍InstructSeq，一个基于多模态模型的指令控制框架，通过自然语言指令和视觉数据的处理，将多种视觉任务统一起来。InstructSeq使用多模态变换器架构，包括视觉、语言和时序模型。我们使用视觉Encoder提取图像特征，并使用文本Encoder编码指令。一个自然语言模型将表示 fusions 和生成顺序任务输出。通过与大语言模型生成的自然语言指令进行训练，InstructSeq获得了对自由形式指令的强大理解，从而提供了直观的界面 для指定视觉任务。这种直观的控制和多任务统一使得模型具有更人类化的灵活性和通用性。我们将在https://github.com/rongyaofang/InstructSeq上发布代码。

S2ST: Image-to-Image Translation in the Seed Space of Latent Diffusion

paper_url: http://arxiv.org/abs/2312.00116
repo_url: None
paper_authors: Or Greenberg, Eran Kishon, Dani Lischinski
for: 这篇论文旨在实现高精度的图像对图像转换（I2IT），并维持图像内容的基本连接。
methods: 本文提出了一个名为S2ST的新框架，该框架在对复杂、实境图像进行全球I2IT中展示了高效性。S2ST在对应潜在散射模型的种子空间中运作，并利用这些模型学习到的图像观念来进行对图像的转换。
results: 本文的实验结果显示，S2ST比前一代GAN基于I2IT方法和散射基于方法更高效，并在多个领域中实现了高精度的图像转换。另外，S2ST不需要训练特定领域的转换网络，因此可以更加方便地应用于实际应用情况。

Abstract
Image-to-image translation (I2IT) refers to the process of transforming images from a source domain to a target domain while maintaining a fundamental connection in terms of image content. In the past few years, remarkable advancements in I2IT were achieved by Generative Adversarial Networks (GANs), which nevertheless struggle with translations requiring high precision. Recently, Diffusion Models have established themselves as the engine of choice for image generation. In this paper we introduce S2ST, a novel framework designed to accomplish global I2IT in complex photorealistic images, such as day-to-night or clear-to-rain translations of automotive scenes. S2ST operates within the seed space of a Latent Diffusion Model, thereby leveraging the powerful image priors learned by the latter. We show that S2ST surpasses state-of-the-art GAN-based I2IT methods, as well as diffusion-based approaches, for complex automotive scenes, improving fidelity while respecting the target domain's appearance across a variety of domains. Notably, S2ST obviates the necessity for training domain-specific translation networks.

摘要
image-to-image翻译（I2IT）指的是将图像从源领域翻译到目标领域，保持图像内容的基本连接。过去几年，图像翻译领域有了非常的进步，尤其是通过生成对抗网络（GANs），但这些方法在需要高精度翻译时却遇到了困难。最近，扩散模型在图像生成中发挥了重要的作用。在这篇论文中，我们介绍了一种新的框架，称为S2ST，用于实现全球图像翻译。S2ST在潜在空间中的缓和模型内部运行，因此可以利用 latent diffusion模型学习的强大图像假设。我们表明，S2ST在复杂的汽车场景中超过了现有的GAN基于翻译方法和扩散基于方法，提高了准确性，同时尊重目标领域的外观，包括多个领域。值得一提的是，S2ST不需要训练域特定的翻译网络。

ART$\boldsymbol{\cdot}$V: Auto-Regressive Text-to-Video Generation with Diffusion Models

paper_url: http://arxiv.org/abs/2311.18834
repo_url: None
paper_authors: Wenming Weng, Ruoyu Feng, Yanhui Wang, Qi Dai, Chunyu Wang, Dacheng Yin, Zhiyuan Zhao, Kai Qiu, Jianmin Bao, Yuhui Yuan, Chong Luo, Yueyi Zhang, Zhiwei Xiong
for: 生成自然动作、丰富细节、高度美观的视频
methods: 使用扩散模型进行单帧自动生成，避免复杂长距离运动的学习，并通过遮盖扩散模型来避免模型预测错误引起的不一致现象
results: 可以生成长视频，包括多个文本提示的组合，并且可以保持高级别的生成质量和自然动作，只需训练两周内四个GPU

Abstract
We present ART$\boldsymbol{\cdot}$V, an efficient framework for auto-regressive video generation with diffusion models. Unlike existing methods that generate entire videos in one-shot, ART$\boldsymbol{\cdot}$V generates a single frame at a time, conditioned on the previous ones. The framework offers three distinct advantages. First, it only learns simple continual motions between adjacent frames, therefore avoiding modeling complex long-range motions that require huge training data. Second, it preserves the high-fidelity generation ability of the pre-trained image diffusion models by making only minimal network modifications. Third, it can generate arbitrarily long videos conditioned on a variety of prompts such as text, image or their combinations, making it highly versatile and flexible. To combat the common drifting issue in AR models, we propose masked diffusion model which implicitly learns which information can be drawn from reference images rather than network predictions, in order to reduce the risk of generating inconsistent appearances that cause drifting. Moreover, we further enhance generation coherence by conditioning it on the initial frame, which typically contains minimal noise. This is particularly useful for long video generation. When trained for only two weeks on four GPUs, ART$\boldsymbol{\cdot}$V already can generate videos with natural motions, rich details and a high level of aesthetic quality. Besides, it enables various appealing applications, e.g., composing a long video from multiple text prompts.

摘要
我们介绍ART$\cdot$V，一个高效的自动复合影像生成框架，使用扩散模型。不同于现有的方法，ART$\cdot$V每个框架都会生成单一帧，并且受到前一帧的影像状态所控制。这个框架具有三个优点：首先，它只需学习简单的连续运动，因此可以避免模型学习复杂的长距离运动，需要巨量的训练数据。第二，它保留了预训练的图像扩散模型的高精确生成能力，只需进行最小的网络修改。第三，它可以根据文本、图像或它们的组合来生成无限长的影像，具有非常高的灵活性和可以应用性。为了解决自动复合模型中常见的漂移问题，我们提出了匿名扩散模型，它隐藏了网络预测的信息，从而将网络预测的错误降低到最少。此外，我们进一步增强生成的一致性，通过将生成状态与初始帧相互关联，这是特别有用的 для 长影像生成。当我们将ART$\cdot$V训练了两周，它就可以生成自然的运动、丰富的细节和高水准的艺术品质。此外，它允许了许多吸引人的应用，例如从多个文本提示中制成长影像。

Exploiting Diffusion Prior for Generalizable Pixel-Level Semantic Prediction

paper_url: http://arxiv.org/abs/2311.18832
repo_url: https://github.com/shinying/dmp
paper_authors: Hsin-Ying Lee, Hung-Yu Tseng, Hsin-Ying Lee, Ming-Hsuan Yang
for: 这篇论文是为了解决现有的推送模型在新的文本-到-图像（T2I）扩散模型中的问题。
methods: 这篇论文使用了预训练的T2I模型作为前置知识，并通过一系列的 interpolations 重新定义了扩散过程，以确定输入RGB图像和输出预测分布之间的唯一映射。
results: 经过大量的实验，这篇论文显示了提出的方法的效果。尽管受限于有限的培训数据，但这种方法仍能够对任意图像进行 faithful 的预测，超过了现有的状态艺术Algorithm。

Abstract
Contents generated by recent advanced Text-to-Image (T2I) diffusion models are sometimes too imaginative for existing off-the-shelf property semantic predictors to estimate due to the immitigable domain gap. We introduce DMP, a pipeline utilizing pre-trained T2I models as a prior for pixel-level semantic prediction tasks. To address the misalignment between deterministic prediction tasks and stochastic T2I models, we reformulate the diffusion process through a sequence of interpolations, establishing a deterministic mapping between input RGB images and output prediction distributions. To preserve generalizability, we use low-rank adaptation to fine-tune pre-trained models. Extensive experiments across five tasks, including 3D property estimation, semantic segmentation, and intrinsic image decomposition, showcase the efficacy of the proposed method. Despite limited-domain training data, the approach yields faithful estimations for arbitrary images, surpassing existing state-of-the-art algorithms.

摘要
Text-to-Image（T2I）扩散模型的最新成果可能会超出现有的准备Semantic predictor的预测范围，这是因为域之间的差距是不可归正的。我们介绍了DP，一个使用预训练T2I模型作为前提的管道，用于像素级Semantic prediction任务。为了解决Deterministic prediction任务和随机T2I模型之间的不一致，我们将扩散过程重新表述为一系列的 interpolations，建立了Deterministic mapping между输入RGB图像和输出预测分布。为保持普适性，我们使用低级 adaptation来精细调整预训练模型。经过了五个任务的广泛实验，包括3D属性估计、semantic segmentation和内在图像分解，我们的方法得到了证明。即使域训练数据有限，我们的方法仍能够对任意图像进行 faithful estimation，超越现有的状态 искусственный算法。Note: Simplified Chinese is used in this translation, as it is more widely used in mainland China and is the standard form of Chinese used in many online platforms and documents. If you prefer Traditional Chinese, I can provide that translation as well.

MotionEditor: Editing Video Motion via Content-Aware Diffusion

paper_url: http://arxiv.org/abs/2311.18830
repo_url: https://github.com/Francis-Rings/MotionEditor
paper_authors: Shuyuan Tu, Qi Dai, Zhi-Qi Cheng, Han Hu, Xintong Han, Zuxuan Wu, Yu-Gang Jiang
for: 本研究旨在解决现有的扩散基于视频编辑模型，尝试 manipulate 视频动作信息，保持原始人物的外观和背景不变。
methods: 本研究提出了 MotionEditor，一种扩散模型，它包含了一种内容意识的动作适配器，用于捕捉时间相关的动作匹配。ControlNet 可以直接生成基于skeleton pose的动作，但在倒数据中修改源动作时会遇到对抗信号的问题。我们的适配器可以将源内容传递给 ControlNet，以实现无缝的动作适配。此外，我们建立了一个两极体系（一个重建分支和一个编辑分支），并在这两个分支之间实现了高精度注意力注入机制，以便编辑分支可以保留原始背景和人物外观。
results: 实验表明，MotionEditor 具有出色的动作编辑能力， tanto qualitatively como quantitatively。

Abstract
Existing diffusion-based video editing models have made gorgeous advances for editing attributes of a source video over time but struggle to manipulate the motion information while preserving the original protagonist's appearance and background. To address this, we propose MotionEditor, a diffusion model for video motion editing. MotionEditor incorporates a novel content-aware motion adapter into ControlNet to capture temporal motion correspondence. While ControlNet enables direct generation based on skeleton poses, it encounters challenges when modifying the source motion in the inverted noise due to contradictory signals between the noise (source) and the condition (reference). Our adapter complements ControlNet by involving source content to transfer adapted control signals seamlessly. Further, we build up a two-branch architecture (a reconstruction branch and an editing branch) with a high-fidelity attention injection mechanism facilitating branch interaction. This mechanism enables the editing branch to query the key and value from the reconstruction branch in a decoupled manner, making the editing branch retain the original background and protagonist appearance. We also propose a skeleton alignment algorithm to address the discrepancies in pose size and position. Experiments demonstrate the promising motion editing ability of MotionEditor, both qualitatively and quantitatively.

摘要
原有的扩散基于视频编辑模型已经做出了优美的进步，可以在时间上修改源视频的特性，但是它们很难 manipulate 视频中的运动信息，保持原始主角的外貌和背景不变。为解决这个问题，我们提出了 MotionEditor，一种扩散模型 для视频运动编辑。MotionEditor 包含了一种新的内容意识的运动适应器，用于捕捉时间相关的运动匹配。而 ControlNet 可以直接基于骨架姿态来生成内容，但是在反转噪声中修改源运动时会遇到对抗信号的问题，这是因为噪声（源）和条件（参考）之间的信号矛盾。我们的适应器可以将源内容引入，以便在适应控制信号时传递适应信号平滑。此外，我们建立了一个两棵架构体系（重建架构和编辑架构），并在这两个架构之间设置了高精度的注意力充排机制，使编辑架构可以在半独立的方式下从重建架构中获取关键和值。这使得编辑架构可以保留原始背景和主角的外貌。我们还提出了一种骨架对准算法，以解决骨架大小和位置的差异。实验表明 MotionEditor 具有优秀的运动编辑能力， both qualitatively and quantitatively。

Un-EvMoSeg: Unsupervised Event-based Independent Motion Segmentation

paper_url: http://arxiv.org/abs/2312.00114
repo_url: None
paper_authors: Ziyun Wang, Jinyuan Guo, Kostas Daniilidis
for: 本研究旨在提高事件摄像头在独立移动对象（IMO）分割 task 的性能，并提出了一种基于几何约束的无监督方法。
methods: 本方法使用事件摄像头捕获视频序列，并通过几何约束生成 IMO pseudo-标签。
results: 在 EVIMO dataset 上测试，本方法与监督学习方法相比，表现很接近，并且可以处理无限多个不确定的对象。

Abstract
Event cameras are a novel type of biologically inspired vision sensor known for their high temporal resolution, high dynamic range, and low power consumption. Because of these properties, they are well-suited for processing fast motions that require rapid reactions. Although event cameras have recently shown competitive performance in unsupervised optical flow estimation, performance in detecting independently moving objects (IMOs) is lacking behind, although event-based methods would be suited for this task based on their low latency and HDR properties. Previous approaches to event-based IMO segmentation have been heavily dependent on labeled data. However, biological vision systems have developed the ability to avoid moving objects through daily tasks without being given explicit labels. In this work, we propose the first event framework that generates IMO pseudo-labels using geometric constraints. Due to its unsupervised nature, our method can handle an arbitrary number of not predetermined objects and is easily scalable to datasets where expensive IMO labels are not readily available. We evaluate our approach on the EVIMO dataset and show that it performs competitively with supervised methods, both quantitatively and qualitatively.

摘要
Previous methods for event-based IMO segmentation have relied heavily on labeled data. But biological vision systems have developed the ability to avoid moving objects through daily tasks without being given explicit labels. In this work, we propose the first event framework that generates IMO pseudo-labels using geometric constraints. Because our method is unsupervised, it can handle an arbitrary number of not predetermined objects and is easily scalable to datasets where expensive IMO labels are not readily available.We evaluate our approach on the EVIMO dataset and show that it performs competitively with supervised methods, both quantitatively and qualitatively.

MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation

paper_url: http://arxiv.org/abs/2311.18829
repo_url: None
paper_authors: Yanhui Wang, Jianmin Bao, Wenming Weng, Ruoyu Feng, Dacheng Yin, Tao Yang, Jingxu Zhang, Qi Dai Zhiyuan Zhao, Chunyu Wang, Kai Qiu, Yuhui Yuan, Xiaoyan Sun, Chong Luo, Baining Guo
for: 这个研究旨在提出一种简单又有效的文本到视频生成框架，以生成高质量和一致的视频。methods: 该框架使用分治策略，将文本到视频生成分为两个阶段：文本到图像生成和图像&文本到视频生成。这种策略有两个优势：一是可以全面利用最新的文本到图像模型，如稳定扩散、中途旅行和DALLE等，生成高质量和细节rich的图像；二是通过生成图像，模型可以更好地控制动作的细节，从而更好地学习动作的动态。results: 经验表明，提出的框架可以生成高质量的视频，具有精确的动作和保留给出的图像的出色表现。特别是，在零shot情况下， MicroCinema可以达到SOTA的FVD分数为342.86在UCF-101和377.40在MSR-VTT。

Abstract
We present MicroCinema, a straightforward yet effective framework for high-quality and coherent text-to-video generation. Unlike existing approaches that align text prompts with video directly, MicroCinema introduces a Divide-and-Conquer strategy which divides the text-to-video into a two-stage process: text-to-image generation and image\&text-to-video generation. This strategy offers two significant advantages. a) It allows us to take full advantage of the recent advances in text-to-image models, such as Stable Diffusion, Midjourney, and DALLE, to generate photorealistic and highly detailed images. b) Leveraging the generated image, the model can allocate less focus to fine-grained appearance details, prioritizing the efficient learning of motion dynamics. To implement this strategy effectively, we introduce two core designs. First, we propose the Appearance Injection Network, enhancing the preservation of the appearance of the given image. Second, we introduce the Appearance Noise Prior, a novel mechanism aimed at maintaining the capabilities of pre-trained 2D diffusion models. These design elements empower MicroCinema to generate high-quality videos with precise motion, guided by the provided text prompts. Extensive experiments demonstrate the superiority of the proposed framework. Concretely, MicroCinema achieves SOTA zero-shot FVD of 342.86 on UCF-101 and 377.40 on MSR-VTT. See https://wangyanhui666.github.io/MicroCinema.github.io/ for video samples.

摘要
我们介绍 MicroCinema，一种简单 yet 有效的文本到视频生成框架。与现有方法不同，MicroCinema 使用分治策略，将文本到视频转化为两个阶段过程：文本到图像生成和图像&文本到视频生成。这种策略带来了两个重要优点。a) 它允许我们全面利用最新的文本到图像模型，如稳定扩散、中途旅行和DALL-E等，生成高质量和细节实在的图像。b) 通过使用生成的图像，模型可以更多地关注运动动态，忽略细节的 appearances。为实现这种策略，我们提出了两个核心设计。首先，我们提出了图像插入网络，以保持给定图像的外观。其次，我们引入了图像噪声优先顺序，以维护预训练的2D扩散模型的能力。这些设计元素使得 MicroCinema 能够生成高质量视频，受文本提示指导，并且具有精确的运动和高质量的图像。我们的实验证明了我们的提案的优越性。具体来说，MicroCinema 在 UCF-101 和 MSR-VTT 上 achieved SOTA 零学习 FVD 的值为 342.86 和 377.40。请参考查看视频示例。

Event-based Continuous Color Video Decompression from Single Frames

paper_url: http://arxiv.org/abs/2312.00113
repo_url: None
paper_authors: Ziyun Wang, Friedhelm Hamann, Kenneth Chaney, Wen Jiang, Guillermo Gallego, Kostas Daniilidis
for: generate a continuous video from a single static RGB image
methods: event-based continuous color video decompression, combining long-range motion modeling and feature-plane-based synthesis neural integration
results: significantly outperforms event- and image-based baselines in the proposed task

Abstract
We present ContinuityCam, a novel approach to generate a continuous video from a single static RGB image, using an event camera. Conventional cameras struggle with high-speed motion capture due to bandwidth and dynamic range limitations. Event cameras are ideal sensors to solve this problem because they encode compressed change information at high temporal resolution. In this work, we propose a novel task called event-based continuous color video decompression, pairing single static color frames and events to reconstruct temporally continuous videos. Our approach combines continuous long-range motion modeling with a feature-plane-based synthesis neural integration model, enabling frame prediction at arbitrary times within the events. Our method does not rely on additional frames except for the initial image, increasing, thus, the robustness to sudden light changes, minimizing the prediction latency, and decreasing the bandwidth requirement. We introduce a novel single objective beamsplitter setup that acquires aligned images and events and a novel and challenging Event Extreme Decompression Dataset (E2D2) that tests the method in various lighting and motion profiles. We thoroughly evaluate our method through benchmarking reconstruction as well as various downstream tasks. Our approach significantly outperforms the event- and image- based baselines in the proposed task.

摘要
我们介绍了 ContinuityCam，一种新的方法，可以从单个静止的RGB图像中生成一个连续的视频，使用事件摄像头。传统的摄像头在高速运动捕捉方面遇到了带宽和动态范围的限制。事件摄像头是理想的感知器，可以解决这个问题，因为它们可以高效地编码压缩的变化信息。在这种工作中，我们提出了一个新的任务，即事件基于颜色视频分解，将单个静止颜色帧和事件相对应，以重建时间连续的视频。我们的方法结合了长距离动作模型和特征面基于的神经网络集成模型，可以在事件中预测任意时刻的帧。我们的方法不需要其他帧，只需要初始图像，从而增加了对突然光变化的 Robustness，降低预测延迟，并降低带宽需求。我们 introduce a novel single objective beamsplitter setup that acquires aligned images and events and a novel and challenging Event Extreme Decompression Dataset (E2D2) that tests the method in various lighting and motion profiles. We thoroughly evaluate our method through benchmarking reconstruction as well as various downstream tasks. Our approach significantly outperforms the event- and image-based baselines in the proposed task.

One-step Diffusion with Distribution Matching Distillation

paper_url: http://arxiv.org/abs/2311.18828
repo_url: None
paper_authors: Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T. Freeman, Taesung Park
for: 将 diffusion models 转化为一步图像生成器，以提高图像质量并降低计算量。
methods: 我们提出了 Distribution Matching Distillation (DMD) 方法，即在 diffusion model 的分布水平匹配一步图像生成器，以实现图像质量的保持和计算量的减少。我们使用了两个 diffusion models，一个是目标分布，另一个是生成的分布，通过两个分布模型的梯度来计算 KL 差异，从而实现分布匹配。此外，我们还使用了一个简单的回归损失来匹配多步 diffusion 输出的大规模结构。
results: 我们的方法可以在 ImageNet 64x64 和 zero-shot COCO-30k 上达到 2.62 FID 和 11.49 FID，与 Stable Diffusion 相当，但是计算量减少到了数个量级。我们的模型在现代硬件上可以在 20 FPS 上生成图像。

Abstract
Diffusion models generate high-quality images but require dozens of forward passes. We introduce Distribution Matching Distillation (DMD), a procedure to transform a diffusion model into a one-step image generator with minimal impact on image quality. We enforce the one-step image generator match the diffusion model at distribution level, by minimizing an approximate KL divergence whose gradient can be expressed as the difference between 2 score functions, one of the target distribution and the other of the synthetic distribution being produced by our one-step generator. The score functions are parameterized as two diffusion models trained separately on each distribution. Combined with a simple regression loss matching the large-scale structure of the multi-step diffusion outputs, our method outperforms all published few-step diffusion approaches, reaching 2.62 FID on ImageNet 64x64 and 11.49 FID on zero-shot COCO-30k, comparable to Stable Diffusion but orders of magnitude faster. Utilizing FP16 inference, our model generates images at 20 FPS on modern hardware.

摘要
<>将文本翻译成简化中文。<>Diffusion模型可以生成高质量图像，但需要多达数十次前进通过。我们介绍了分布匹配整合（DMD），它将 diffusion模型转换为一步图像生成器，而无需影响图像质量。我们在分布水平强制对一步图像生成器和扩散模型进行匹配，通过最小化一个 Approximate KL 差的梯度，其中一个是目标分布，另一个是我们一步生成器生成的Synthetic分布。这两个扩散模型都是分别在每个分布上训练的。在其中加入一个简单的 regression 损失，使得我们的方法在已有的几步扩散输出的大规模结构上匹配，我们的方法在 ImageNet 64x64 和 zero-shot COCO-30k 上的 FID 为 2.62 和 11.49，与 Stable Diffusion 相当，但很多次 быстре。使用 FP16 执行，我们的模型在现代硬件上可以在 20 FPS 上生成图像。

DynMF: Neural Motion Factorization for Real-time Dynamic View Synthesis with 3D Gaussian Splatting

paper_url: http://arxiv.org/abs/2312.00112
repo_url: https://github.com/agelosk/dynmf
paper_authors: Agelos Kratimenos, Jiahui Lei, Kostas Daniilidis
for: This paper aims to address the challenges of modeling dynamic scenes and motions, and to provide a compact and efficient representation for dynamic scene rendering.
methods: The paper proposes a neural representation called DynMF, which decomposes a dynamic scene into a few neural trajectories. The representation is based on a carefully designed neural framework that consists of a tiny set of learned basis queries in time, and it adequately constrains the motion field of the scene to enable effective and fast optimization.
results: The paper demonstrates that the proposed method can reach state-of-the-art render quality within just 5 minutes of training, and it can synthesize novel views of dynamic scenes with superior photorealistic quality in less than half an hour. The representation is interpretable, efficient, and expressive enough to offer real-time view synthesis of complex dynamic scene motions in monocular and multi-view scenarios.

Abstract
Accurately and efficiently modeling dynamic scenes and motions is considered so challenging a task due to temporal dynamics and motion complexity. To address these challenges, we propose DynMF, a compact and efficient representation that decomposes a dynamic scene into a few neural trajectories. We argue that the per-point motions of a dynamic scene can be decomposed into a small set of explicit or learned trajectories. Our carefully designed neural framework consisting of a tiny set of learned basis queried only in time allows for rendering speed similar to 3D Gaussian Splatting, surpassing 120 FPS, while at the same time, requiring only double the storage compared to static scenes. Our neural representation adequately constrains the inherently underconstrained motion field of a dynamic scene leading to effective and fast optimization. This is done by biding each point to motion coefficients that enforce the per-point sharing of basis trajectories. By carefully applying a sparsity loss to the motion coefficients, we are able to disentangle the motions that comprise the scene, independently control them, and generate novel motion combinations that have never been seen before. We can reach state-of-the-art render quality within just 5 minutes of training and in less than half an hour, we can synthesize novel views of dynamic scenes with superior photorealistic quality. Our representation is interpretable, efficient, and expressive enough to offer real-time view synthesis of complex dynamic scene motions, in monocular and multi-view scenarios.

摘要
快速和高效地模拟动态场景和运动是一项非常挑战性的任务，主要是因为时间动态和运动复杂性。为了解决这些挑战，我们提议了动态MF（DynMF），它是一种高效和 компакт的表示方式，可以将动态场景 decomposed into a few neural trajectories。我们认为每个点的运动可以 быть decomposed into一小组Explicit或学习得到的轨迹。我们设计了一个 tiny 的学习基础，它只在时间上查询，可以在渲染速度和 static scene 相当，同时只需要 double storage。我们的神经表示可以强制动态场景中的运动场景具有效果地优化。这是通过每个点绑定到运动系数来实现的，这些系数规定了每个点的轨迹。我们通过精心应用一个简单性损失来让运动系数具有简单性，从而可以分解动态场景中的运动，独立控制它们，并生成未曾见过的运动组合。我们可以在只需5分钟的训练时间内达到状态机器人渲染质量的首席，并在30分钟内可以生成未曾见过的动态场景视图，具有超越120帧的渲染速度和高品质。我们的表示是可解释的，高效的和表达力强的，可以在单视图和多视图场景中实现实时视图合成高质量动态场景运动。

CAST: Cross-Attention in Space and Time for Video Action Recognition

paper_url: http://arxiv.org/abs/2311.18825
repo_url: https://github.com/khu-vll/cast
paper_authors: Dongho Lee, Jongseo Lee, Jinwoo Choi
for: 这篇论文的目的是提出一种新的动作识别方法，以解决现有的动作识别模型缺乏视觉空间时间理解的问题。
methods: 该方法使用了一种新的两栅架构，称为 Cross-Attention in Space and Time (CAST)，它使用RGB输入来实现视觉空间时间的搜集和权重学习，并通过瓶颈穿梭机制使得空间和时间专家模型进行信息交换和协同预测，从而提高表现。
results: 作者通过了多个公共的测试 datasets（EPIC-KITCHENS-100、Something-Something-V2、Kinetics-400）的实验，证明了该方法在不同的dataset特点下都能够显示出优异的表现，而现有方法在不同的dataset上表现不稳定。

Abstract
Recognizing human actions in videos requires spatial and temporal understanding. Most existing action recognition models lack a balanced spatio-temporal understanding of videos. In this work, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), that achieves a balanced spatio-temporal understanding of videos using only RGB input. Our proposed bottleneck cross-attention mechanism enables the spatial and temporal expert models to exchange information and make synergistic predictions, leading to improved performance. We validate the proposed method with extensive experiments on public benchmarks with different characteristics: EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400. Our method consistently shows favorable performance across these datasets, while the performance of existing methods fluctuates depending on the dataset characteristics.

摘要
Recognizing human actions in videos requires both spatial and temporal understanding. Most existing action recognition models lack a balanced spatio-temporal understanding of videos. In this work, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), that achieves a balanced spatio-temporal understanding of videos using only RGB input. Our proposed bottleneck cross-attention mechanism enables the spatial and temporal expert models to exchange information and make synergistic predictions, leading to improved performance. We validate the proposed method with extensive experiments on public benchmarks with different characteristics: EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400. Our method consistently shows favorable performance across these datasets, while the performance of existing methods fluctuates depending on the dataset characteristics.Here's the breakdown of the translation:* recognizing human actions in videos (人类动作识别)* requires both spatial and temporal understanding (需要空间和时间理解)* most existing action recognition models lack a balanced spatio-temporal understanding (大多数现有动作识别模型缺乏平衡的空间-时间理解)* proposed a novel two-stream architecture (提出了一种新的两树体系)* called Cross-Attention in Space and Time (CAST) (名为空间和时间之间的交叉注意力)* using only RGB input (使用RGB输入)* proposed bottleneck cross-attention mechanism (提出的瓶颈交叉注意力机制)* enables the spatial and temporal expert models to exchange information (允许空间和时间专家模型交换信息)* and make synergistic predictions (并且进行协同预测)* leading to improved performance (导致性能提高)* validated with extensive experiments (通过广泛的实验验证)* on public benchmarks with different characteristics (在不同特点的公共标准测试集上)* our method consistently shows favorable performance (我们的方法在不同测试集上一致表现出色)* while the performance of existing methods fluctuates depending on the dataset characteristics (而现有方法在不同数据集特点上的性能呈现波动性)

DEVIAS: Learning Disentangled Video Representations of Action and Scene for Holistic Video Understanding

paper_url: http://arxiv.org/abs/2312.00826
repo_url: None
paper_authors: Kyungho Bae, Geo Ahn, Youngrae Kim, Jinwoo Choi
for: 提高视频理解能力，尤其是在不同场景下进行视频分类和识别任务。
methods: 使用插槽注意力学习分离动作和场景表示，并通过辅助任务进一步引导插槽注意力。
results: 在不同数据集上，与基elines相比，提出的方法实现了更好的视频理解能力，特别是在不同场景下进行视频分类和识别任务时。

Abstract
When watching a video, humans can naturally extract human actions from the surrounding scene context, even when action-scene combinations are unusual. However, unlike humans, video action recognition models often learn scene-biased action representations from the spurious correlation in training data, leading to poor performance in out-of-context scenarios. While scene-debiased models achieve improved performance in out-of-context scenarios, they often overlook valuable scene information in the data. Addressing this challenge, we propose Disentangled VIdeo representations of Action and Scene (DEVIAS), which aims to achieve holistic video understanding. Disentangled action and scene representations with our method could provide flexibility to adjust the emphasis on action or scene information depending on downstream task and dataset characteristics. Disentangled action and scene representations could be beneficial for both in-context and out-of-context video understanding. To this end, we employ slot attention to learn disentangled action and scene representations with a single model, along with auxiliary tasks that further guide slot attention. We validate the proposed method on both in-context datasets: UCF-101 and Kinetics-400, and out-of-context datasets: SCUBA and HAT. Our proposed method shows favorable performance across different datasets compared to the baselines, demonstrating its effectiveness in diverse video understanding scenarios.

摘要
当观看视频时，人类可以自然地从周围场景上提取人体动作，即使动作-场景组合不寻常。然而，与人类不同，视频动作识别模型经常从训练数据中吸收虚假相关性，导致在异 Context 场景下表现不佳。 scene-debiased 模型可以提高在异 Context 场景下的表现，但它们经常忽略数据中的有价值场景信息。为 Addressing 这个挑战，我们提议 Disentangled VIdeo 表示法（DEVIAS），该方法目的是实现全面的视频理解。我们使用槽注意力学习分离动作和场景表示，并通过辅助任务来指导槽注意力。我们在 UCF-101、Kinetics-400、SCUBA 和 HAT 等 dataset 上验证了我们的方法，并证明它在多种视频理解enario 中表现出色。

Initializing Models with Larger Ones

paper_url: http://arxiv.org/abs/2311.18823
repo_url: https://github.com/oscarxzq/weight-selection
paper_authors: Zhiqiu Xu, Yanjie Chen, Kirill Vishniakov, Yida Yin, Zhiqiang Shen, Trevor Darrell, Lingjie Liu, Zhuang Liu
for: 这篇论文目的是提出一种Weight Selection方法，将大型模型中的一部分 weights 选择到小型模型中，以将大型模型中的知识转移到小型模型中。
methods: 这篇论文使用了Weight Selection方法，将大型模型中的一部分 weights 选择到小型模型中，并与知识蒸发相结合。
results: 实验结果显示，Weight Selection方法可以对小型模型进行明显改善，并且可以降低训练时间。此外，这篇论文还证明了Weight Selection方法可以与知识蒸发相结合，实现更好的效果。

Abstract
Weight initialization plays an important role in neural network training. Widely used initialization methods are proposed and evaluated for networks that are trained from scratch. However, the growing number of pretrained models now offers new opportunities for tackling this classical problem of weight initialization. In this work, we introduce weight selection, a method for initializing smaller models by selecting a subset of weights from a pretrained larger model. This enables the transfer of knowledge from pretrained weights to smaller models. Our experiments demonstrate that weight selection can significantly enhance the performance of small models and reduce their training time. Notably, it can also be used together with knowledge distillation. Weight selection offers a new approach to leverage the power of pretrained models in resource-constrained settings, and we hope it can be a useful tool for training small models in the large-model era. Code is available at https://github.com/OscarXZQ/weight-selection.

摘要
<>Translate given text into Simplified Chinese.<> neural network 训练中Weight initialization 发挥重要作用。广泛使用的 initialization 方法被提出并评估，以适应从 scratch 训练的网络。然而，随着预训练模型的数量不断增加，现在提供了新的机会来解决这个古老的问题。在这项工作中，我们介绍了 weight selection，一种从预训练大型模型中选择一 subset of weights 的方法，以传递预训练 weights 知识到更小的模型中。我们的实验表明， weight selection 可以大幅提高小模型的性能和训练时间。另外，它还可以与知识储存配合使用。 weight selection 为资源有限的设置提供了一种新的方法，我们希望它可以在大模型时代中用于训练小模型。代码可以在上获取。

ElasticDiffusion: Training-free Arbitrary Size Image Generation

paper_url: http://arxiv.org/abs/2311.18822
repo_url: https://github.com/moayedhajiali/elasticdiffusion-official
paper_authors: Moayed Haji-Ali, Guha Balakrishnan, Vicente Ordonez
for: 这篇论文是为了解决当前的图像生成模型受限于几种尺寸和比例的问题而提出的。
methods: 该论文提出了一种名为ElasticDiffusion的训练free的解码方法，可以让预训练的文本到图像扩散模型生成不同尺寸的图像。ElasticDiffusion尝试将预训练模型的生成轨迹解析成本地和全局信号。本地信号控制低级像素信息，可以在本地小区域中估计，而全局信号用于保持整体结构一致，与参考图像进行比较。
results: 我们在CelebA-HQ（脸）和LAION-COCO（物体/室内/户外场景）上进行了实验和质量测试，结果表明ElasticDiffusion比MultiDiffusion和标准扩散策略（Stable Diffusion的标准解码策略）在不同比例下的图像准确率更高。代码：https://github.com/MoayedHajiAli/ElasticDiffusion-official.git

Abstract
Diffusion models have revolutionized image generation in recent years, yet they are still limited to a few sizes and aspect ratios. We propose ElasticDiffusion, a novel training-free decoding method that enables pretrained text-to-image diffusion models to generate images with various sizes. ElasticDiffusion attempts to decouple the generation trajectory of a pretrained model into local and global signals. The local signal controls low-level pixel information and can be estimated on local patches, while the global signal is used to maintain overall structural consistency and is estimated with a reference image. We test our method on CelebA-HQ (faces) and LAION-COCO (objects/indoor/outdoor scenes). Our experiments and qualitative results show superior image coherence quality across aspect ratios compared to MultiDiffusion and the standard decoding strategy of Stable Diffusion. Code: https://github.com/MoayedHajiAli/ElasticDiffusion-official.git

摘要
“吸引模型在最近几年内革命化了图像生成，但它们仍然受限于几种大小和比例。我们提议了一种新的训练无法解码方法，名为ElasticDiffusion，可以让预训练文本到图像扩散模型生成图像多种大小。ElasticDiffusion尝试将预训练模型的生成轨迹分解成本地和全局信号。本地信号控制低级像素信息，可以在本地补丁中估计，而全局信号用于保持整体结构一致，与参考图像一起估计。我们在CelebA-HQ（脸）和LAION-COCO（物体/室内/室外场景）上进行了实验和质量检测，结果显示在不同比例下图像凝固质量都高于多扩散和标准扩散策略。代码：https://github.com/MoayedHajiAli/ElasticDiffusion-official.git”Note that Simplified Chinese is used here, as it is the most widely used variety of Chinese in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.

LucidDreaming: Controllable Object-Centric 3D Generation

paper_url: http://arxiv.org/abs/2312.00588
repo_url: None
paper_authors: Zhaoning Wang, Ming Li, Chen Chen
for: 本研究旨在提供精细控制3D生成的有效管道，即LucidDreaming。
methods: 该方法仅需 minimal 3D bounding box输入，可以通过大语言模型来从文本提示中推理出。具体来说，我们提出了剪辑射线抽象法，以分离和优化用户指定的物体。此外，我们还引入物体心智泥团偏好，以促进生成的物体分离。
results: 我们的方法在从头开始生成3D内容以及在先修改的NeRF场景中进行编辑时表现出色，超过基eline方法的Alignment度。此外，我们还提供了一个包含提示和3D bounding box的数据集，用于评估3D空间控制性。

Abstract
With the recent development of generative models, Text-to-3D generations have also seen significant growth. Nonetheless, achieving precise control over 3D generation continues to be an arduous task, as using text to control often leads to missing objects and imprecise locations. Contemporary strategies for enhancing controllability in 3D generation often entail the introduction of additional parameters, such as customized diffusion models. This often induces hardness in adapting to different diffusion models or creating distinct objects. In this paper, we present LucidDreaming as an effective pipeline capable of fine-grained control over 3D generation. It requires only minimal input of 3D bounding boxes, which can be deduced from a simple text prompt using a Large Language Model. Specifically, we propose clipped ray sampling to separately render and optimize objects with user specifications. We also introduce object-centric density blob bias, fostering the separation of generated objects. With individual rendering and optimizing of objects, our method excels not only in controlled content generation from scratch but also within the pre-trained NeRF scenes. In such scenarios, existing generative approaches often disrupt the integrity of the original scene, and current editing methods struggle to synthesize new content in empty spaces. We show that our method exhibits remarkable adaptability across a spectrum of mainstream Score Distillation Sampling-based 3D generation frameworks, and achieves superior alignment of 3D content when compared to baseline approaches. We also provide a dataset of prompts with 3D bounding boxes, benchmarking 3D spatial controllability.

摘要
Recent developments in generative models have also led to significant growth in text-to-3D generation. However, achieving precise control over 3D generation remains a challenging task, as using text to control often results in missing objects and imprecise locations. To address this issue, many current strategies involve introducing additional parameters, such as customized diffusion models, which can make it difficult to adapt to different diffusion models or create distinct objects.In this paper, we propose LucidDreaming, an effective pipeline that provides fine-grained control over 3D generation with minimal input. Our method requires only a simple text prompt and a Large Language Model to deduce 3D bounding boxes. We use clipped ray sampling to separately render and optimize objects with user specifications, and introduce object-centric density blob bias to ensure the separation of generated objects.Our method excels not only in generating content from scratch but also in existing pre-trained NeRF scenes, where other generative approaches often disrupt the integrity of the original scene and current editing methods struggle to synthesize new content in empty spaces. We demonstrate the remarkable adaptability of our method across a range of mainstream Score Distillation Sampling-based 3D generation frameworks and show superior alignment of 3D content compared to baseline approaches. We also provide a dataset of prompts with 3D bounding boxes for benchmarking 3D spatial controllability.

IMMA: Immunizing text-to-image Models against Malicious Adaptation

paper_url: http://arxiv.org/abs/2311.18815
repo_url: https://github.com/zhengyjzoe/imma
paper_authors: Yijia Zheng, Raymond A. Yeh
for: 保护文本至图模型免受恶意修改
methods: 提出一种新的保护策略，即使模型参数难以在恶意修改时被篡改
results: 对三种恶意修改方法（LoRA、Textual-Inversion、DreamBooth）进行实验，证明 IMMA 的效果可以减少恶意修改的风险

Abstract
Advancements in text-to-image models and fine-tuning methods have led to the increasing risk of malicious adaptation, i.e., fine-tuning to generate harmful unauthorized content. Recent works, e.g., Glaze or MIST, have developed data-poisoning techniques which protect the data against adaptation methods. In this work, we consider an alternative paradigm for protection. We propose to ``immunize'' the model by learning model parameters that are difficult for the adaptation methods when fine-tuning malicious content; in short IMMA. Empirical results show IMMA's effectiveness against malicious adaptations, including mimicking the artistic style and learning of inappropriate/unauthorized content, over three adaptation methods: LoRA, Textual-Inversion, and DreamBooth.

摘要
Recent advances in text-to-image models and fine-tuning methods have led to an increasing risk of malicious adaptation, i.e., fine-tuning to generate harmful and unauthorized content. To address this issue, some recent works, such as Glaze or MIST, have proposed data-poisoning techniques to protect the data against adaptation methods. However, we propose an alternative paradigm for protection, which is to "immunize" the model by learning model parameters that are difficult for the adaptation methods when fine-tuning malicious content. In short, we call this approach IMMA (Immune Model against Malicious Adaptation). Our empirical results show that IMMA is effective against malicious adaptations, including mimicking artistic styles and learning inappropriate/unauthorized content, over three adaptation methods: LoRA, Textual-Inversion, and DreamBooth.

Is Underwater Image Enhancement All Object Detectors Need?

paper_url: http://arxiv.org/abs/2311.18814
repo_url: https://github.com/bigwangyudong/lqit
paper_authors: Yudong Wang, Jichang Guo, Wanru He, Huan Gao, Huihui Yue, Zenan Zhang, Chongyi Li
for: 本研究的目的是回答“底层水下图像增强对水下对象检测是否有益？”以及“底层水下图像增强如何对水下对象检测？”这两个问题。
methods: 本研究使用18种当前最佳水下图像增强算法，包括传统、CNN基于和GAN基于算法，对水下对象检测数据进行预处理。然后，使用不同算法对水下对象检测数据进行增强，并使用7种深度学习基于对象检测模型进行重新训练。
results: 本研究使用133种模型进行全面分析底层水下图像增强对水下对象检测的影响。结果表明，底层水下图像增强可以提高水下对象检测的准确率和F1值。此外，研究还发现不同的增强算法对水下对象检测的影响不同。

Abstract
Underwater object detection is a crucial and challenging problem in marine engineering and aquatic robot. The difficulty is partly because of the degradation of underwater images caused by light selective absorption and scattering. Intuitively, enhancing underwater images can benefit high-level applications like underwater object detection. However, it is still unclear whether all object detectors need underwater image enhancement as pre-processing. We therefore pose the questions "Does underwater image enhancement really improve underwater object detection?" and "How does underwater image enhancement contribute to underwater object detection?". With these two questions, we conduct extensive studies. Specifically, we use 18 state-of-the-art underwater image enhancement algorithms, covering traditional, CNN-based, and GAN-based algorithms, to pre-process underwater object detection data. Then, we retrain 7 popular deep learning-based object detectors using the corresponding results enhanced by different algorithms, obtaining 126 underwater object detection models. Coupled with 7 object detection models retrained using raw underwater images, we employ these 133 models to comprehensively analyze the effect of underwater image enhancement on underwater object detection. We expect this study can provide sufficient exploration to answer the aforementioned questions and draw more attention of the community to the joint problem of underwater image enhancement and underwater object detection. The pre-trained models and results are publicly available and will be regularly updated. Project page: https://github.com/BIGWangYuDong/lqit/tree/main/configs/detection/uw_enhancement_affect_detection.

摘要
水下物体检测是marine engineering和水下机器人中的一项重要和挑战性问题。这种困难的一部分是由于光选择性吸收和散射所导致的水下图像的劣化。直觉上，提高水下图像可能会有助于高级应用程序 like underwater object detection。然而，是否所有的物体检测器需要水下图像提高为先processing仍然是不清楚的。我们因此提出了以下两个问题：“水下图像提高确实改善水下物体检测吗？”和“水下图像提高如何影响水下物体检测？”。我们采取了广泛的研究。具体来说，我们使用18种当前最佳水下图像提高算法，涵盖传统、CNN基于和GAN基于算法，对水下物体检测数据进行预处理。然后，我们使用不同算法对应的结果进行重新训练7种深度学习基于的物体检测器，共获得126个水下物体检测模型。与7种原始水下图像重新训练的物体检测模型相比，我们使用这133个模型进行全面分析水下图像提高对水下物体检测的影响。我们希望这项研究可以提供充分的探索，回答上述问题，并吸引社区更多关注水下图像提高和水下物体检测的共同问题。我们的预训练模型和结果将在 GitHub 上公开，并定期更新。项目页面：https://github.com/BIGWangYuDong/lqit/tree/main/configs/detection/uw_enhancement_affect_detection。

MD-Splatting: Learning Metric Deformation from 4D Gaussians in Highly Deformable Scenes

paper_url: http://arxiv.org/abs/2312.00583
repo_url: None
paper_authors: Bardienus P. Duisterhof, Zhao Mandi, Yunchao Yao, Jia-Wei Liu, Mike Zheng Shou, Shuran Song, Jeffrey Ichnowski
for: 这个论文旨在提高在高度变形的场景中进行高精度3D跟踪和新视图生成，以便应用于机器人、增强现实和生成AI等领域。
methods: 该论文提出了MD-Splatting方法，它基于 Gaussian splatting 方法，通过视频捕捉的多个相机pose拍摄的场景进行同时3D跟踪和新视图生成。MD-Splatting 学习了一个填充函数，将一组 Gaussian 的非 metric 属性映射到几何空间中。
results: 该论文实现了在高度变形的场景中高精度3D跟踪和新视图生成，并且与现有方法相比，提高了3D跟踪的平均误差率23.9%。在具有足够的 текстура的场景中，如场景6，MD-Splatting 实现了1 x 1米的布料上的 median 跟踪误差为3.39毫米。

Abstract
Accurate 3D tracking in highly deformable scenes with occlusions and shadows can facilitate new applications in robotics, augmented reality, and generative AI. However, tracking under these conditions is extremely challenging due to the ambiguity that arises with large deformations, shadows, and occlusions. We introduce MD-Splatting, an approach for simultaneous 3D tracking and novel view synthesis, using video captures of a dynamic scene from various camera poses. MD-Splatting builds on recent advances in Gaussian splatting, a method that learns the properties of a large number of Gaussians for state-of-the-art and fast novel view synthesis. MD-Splatting learns a deformation function to project a set of Gaussians with non-metric, thus canonical, properties into metric space. The deformation function uses a neural-voxel encoding and a multilayer perceptron (MLP) to infer Gaussian position, rotation, and a shadow scalar. We enforce physics-inspired regularization terms based on local rigidity, conservation of momentum, and isometry, which leads to trajectories with smaller trajectory errors. MD-Splatting achieves high-quality 3D tracking on highly deformable scenes with shadows and occlusions. Compared to state-of-the-art, we improve 3D tracking by an average of 23.9 %, while simultaneously achieving high-quality novel view synthesis. With sufficient texture such as in scene 6, MD-Splatting achieves a median tracking error of 3.39 mm on a cloth of 1 x 1 meters in size. Project website: https://md-splatting.github.io/.

摘要
准确的3D跟踪在具有弹性变形、阴影和 occlusion 的场景中可以推动新的应用程序，如Robotics、增强现实和生成 AI。然而，在这些条件下进行跟踪是非常困难的，因为大弹性、阴影和 occlusion 会导致跟踪结果的模糊性。我们介绍了 MD-Splatting，一种同时进行3D跟踪和新视野合成的方法，使用多个相机pose的视频捕捉。MD-Splatting 基于最近的 Gaussian splatting 技术，这种技术可以实现状态之最佳和快速的新视野合成。MD-Splatting 学习一个变形函数，将一组 Gaussian 的非 metric 属性映射到 métrica 空间中。变形函数使用神经元体系和多层感知器（MLP）来INFER Gaussian 的位置、旋转和阴影Scalar。我们采用物理灵感的正则化项，如本地稳定性、动量保守和射影，以避免跟踪结果的抖动。与状态之最佳相比，我们提高了3D跟踪的准确率，同时实现高质量的新视野合成。在具有足够的文本的场景6中，MD-Splatting 实现了1 x 1米的布料上的 median 跟踪误差为3.39 mm。项目网站：https://md-splatting.github.io/。

Convergence of Nonconvex PnP-ADMM with MMSE Denoisers

paper_url: http://arxiv.org/abs/2311.18810
repo_url: https://github.com/wustl-cig/camsap2023
paper_authors: Chicago Park, Shirin Shoushtari, Weijie Gan, Ulugbek S. Kamilov
for: 这个论文的目的是解释Plug-and-Play Alternating Direction Method of Multipliers（PnP-ADMM）在使用核函数神经网络（CNN）的稳定性。
methods: 该论文使用了PnP-ADMM算法，并将CNN作为一个迁移函数。
results: 该论文提供了一个理论解释，表明PnP-ADMM算法在使用非扩散CNN和扩散DRUNet denoiser时的稳定性。

Abstract
Plug-and-Play Alternating Direction Method of Multipliers (PnP-ADMM) is a widely-used algorithm for solving inverse problems by integrating physical measurement models and convolutional neural network (CNN) priors. PnP-ADMM has been theoretically proven to converge for convex data-fidelity terms and nonexpansive CNNs. It has however been observed that PnP-ADMM often empirically converges even for expansive CNNs. This paper presents a theoretical explanation for the observed stability of PnP-ADMM based on the interpretation of the CNN prior as a minimum mean-squared error (MMSE) denoiser. Our explanation parallels a similar argument recently made for the iterative shrinkage/thresholding algorithm variant of PnP (PnP-ISTA) and relies on the connection between MMSE denoisers and proximal operators. We also numerically evaluate the performance gap between PnP-ADMM using a nonexpansive DnCNN denoiser and expansive DRUNet denoiser, thus motivating the use of expansive CNNs.

摘要
“插件式扩展方向方法（PnP-ADMM）是一种广泛使用的算法，用于解决反常问题，其结合物理测量模型和卷积神经网络（CNN）假设。PnP-ADMM已经理论上证明可以协调 convex 数据准确性项和 nonexpansive CNNs。然而，它经常在 expansive CNNs 上实际协调。这篇论文提供了对 PnP-ADMM 稳定性的理论解释，基于 CNN 假设的 MMSE 滤波器解释。这个解释与 PnP-ISTA 迭代压缩算法的类似解释相似，并且基于 proximal 算符的连接。我们还对 PnP-ADMM 使用 nonexpansive DnCNN 滤波器和 expansive DRUNet 滤波器进行了数值评估，从而激励使用 expansive CNNs。”Note: "DnCNN" and "DRUNet" are abbreviations for "Deep Neural Network" and "Deep Regularization Network", respectively.

FoundPose: Unseen Object Pose Estimation with Foundation Features

paper_url: http://arxiv.org/abs/2311.18809
repo_url: None
paper_authors: Evin Pınar Örnek, Yann Labbé, Bugra Tekin, Lingni Ma, Cem Keskin, Christian Forster, Tomas Hodan
for: 这篇论文旨在提出一种基于单个RGB图像的6D姿态估计方法，用于检测未知的RIGID对象的姿态。
methods: 该方法基于DINOv2视觉基础模型，不需要对特定对象进行训练。它首先从渠道图像中提取DINOv2 patch特征，然后通过比较这些特征和已经rendered对象模板中的特征来确定2D-3D匹配点。最后，通过Featuremetric反射来优化姿态 hypotheses。
results: 该方法可以处理多种对象，包括具有对称和无Texture的对象，并在标准BOP bencmark上显著超过了现有的RGB方法。通过Featuremetric和additional MegaPose refinement，该方法可以更高精度地估计对象的姿态。

Abstract
We propose FoundPose, a method for 6D pose estimation of unseen rigid objects from a single RGB image. The method assumes that 3D models of the objects are available but does not require any object-specific training. This is achieved by building upon DINOv2, a recent vision foundation model with impressive generalization capabilities. An online pose estimation stage is supported by a minimal object representation that is built during a short onboarding stage from DINOv2 patch features extracted from rendered object templates. Given a query image with an object segmentation mask, FoundPose first rapidly retrieves a handful of similarly looking templates by a DINOv2-based bag-of-words approach. Pose hypotheses are then generated from 2D-3D correspondences established by matching DINOv2 patch features between the query image and a retrieved template, and finally optimized by featuremetric refinement. The method can handle diverse objects, including challenging ones with symmetries and without any texture, and noticeably outperforms existing RGB methods for coarse pose estimation in both accuracy and speed on the standard BOP benchmark. With the featuremetric and additional MegaPose refinement, which are demonstrated complementary, the method outperforms all RGB competitors. Source code is at: evinpinar.github.io/foundpose.

摘要
我们提出FoundPose方法，用于从单一RGB图像中进行未见的6D姿态估计。这个方法假设有3D模型的物品是可用的，但不需要任何物品特定的训练。这是通过建立于DINOv2，最近的视觉基础模型，并实现了卓越的通用能力。在线上姿态估计阶段，我们支持一个最小的物品表现，从DINOv2贴图特征EXTRACTED FROM Rendered object templates中建立。对于具有物品分割mask的询问图像，FoundPose首先快速地找到一些相似的模板，这是通过DINOv2-based Bag-of-words方法进行快速搜寻。姿态假设是从2D-3D对应Established by matching DINOv2贴图特征 zwischen内部图像和找到的模板，然后是通过对�Metric refinement进行优化。这个方法可以处理多种物品，包括具有Symmetries和无 texture的物品，并明显超越了RGB方法的粗糙姿态估计。通过Featuremetric和Additional MegaPose refinement，这些被证明是补充的，FoundPose方法终于超越了所有RGB竞争对手。源代码可以在：evinpinar.github.io/foundpose。

CLIP-QDA: An Explainable Concept Bottleneck Model

paper_url: http://arxiv.org/abs/2312.00110
repo_url: None
paper_authors: Rémi Kazmierczak, Eloïse Berthier, Goran Frehse, Gianni Franchi
for: 本研究旨在提出一种可解释的图像分类算法，基于多Modal基础模型，具有快速和可解释的特点。
methods: 本方法基于CLIP-based Concept Bottleneck Models (CBMs)，创造了一个含义层次结构，每个神经元与特定的单词相关联。使用 Mixture of Gaussians (MoG) formalism，提高了这个含义层次结构的解释性。此外，我们还提出了一种基于统计量的分类器，即CLIP-QDA。
results: 我们的实验结果显示，当MoG假设成立时，CLIP-QDA可以与当前最佳方法CBMs准确率相似。我们的解释方法与现有XAI方法竞争，而计算速度更快。

Abstract
In this paper, we introduce an explainable algorithm designed from a multi-modal foundation model, that performs fast and explainable image classification. Drawing inspiration from CLIP-based Concept Bottleneck Models (CBMs), our method creates a latent space where each neuron is linked to a specific word. Observing that this latent space can be modeled with simple distributions, we use a Mixture of Gaussians (MoG) formalism to enhance the interpretability of this latent space. Then, we introduce CLIP-QDA, a classifier that only uses statistical values to infer labels from the concepts. In addition, this formalism allows for both local and global explanations. These explanations come from the inner design of our architecture, our work is part of a new family of greybox models, combining performances of opaque foundation models and the interpretability of transparent models. Our empirical findings show that in instances where the MoG assumption holds, CLIP-QDA achieves similar accuracy with state-of-the-art methods CBMs. Our explanations compete with existing XAI methods while being faster to compute.

摘要
在这篇论文中，我们介绍了一种可解释的图像分类算法，基于多Modal基础模型。我们 Drawing inspiration from CLIP-based Concept Bottleneck Models (CBMs)，我们的方法创建了一个含义空间，每个神经元与特定的单词相关联。我们发现这个含义空间可以用简单的分布来模型，因此我们使用 Mixture of Gaussians (MoG) formalism来增强这个含义空间的解释性。然后，我们引入 CLIP-QDA，一种只使用统计值来推断标签的分类器。此外，这种形式语言还允许我们获得本地和全局的解释。这些解释来自我们的建筑的内部结构，我们的工作是一种新的灰色模型家族，这种模型结合了透明模型的表现和透明模型的解释性。我们的实验发现，在MoG假设成立时，CLIP-QDA可以与当前状态的方法CBMs achieve similar accuracy。我们的解释与现有的XAI方法竞争，而且计算速度更快。

Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains

paper_url: http://arxiv.org/abs/2311.18773
repo_url: None
paper_authors: Rohan Myer Krishnan, Zitian Tang, Zhiqiu Yu, Chen Sun
for: 这个论文的目的是提出一个新的视频学习挑战任务，即在视频中学习人类展示的技能，并且能够在不同的频谱和多Modalities中进行扩展。
methods: 这个论文使用了视频语言模型来获取有结构的理解，包括将视频分解成不同的动作和技能，并且能够在新的频谱和多Modalities中进行扩展。
results: 研究发现，现有的方法在这个新的挑战任务中表现不佳，这表明将来需要开发新的方法来解决这些任务。

Abstract
Learning from videos is an emerging research area that enables robots to acquire skills from human demonstrations, such as procedural videos. To do this, video-language models must be able to obtain structured understandings, such as the temporal segmentation of a demonstration into sequences of actions and skills, and to generalize the understandings to novel domains. In pursuit of this goal, we introduce Spacewalk-18, a benchmark containing two tasks: (1) step recognition and (2) intra-video retrieval over a dataset of temporally segmented and labeled tasks in International Space Station spacewalk recordings. In tandem, the two tasks quantify a model's ability to make use of: (1) out-of-domain visual information; (2) a high temporal context window; and (3) multimodal (text + video) domains. This departs from existing benchmarks for procedural video understanding, which typically deal with short context lengths and can be solved with a single modality. Spacewalk-18, with its inherent multimodal and long-form complexity, exposes the high difficulty of task recognition and segmentation. We find that state-of-the-art methods perform poorly on our benchmark, demonstrating that the goal of generalizable procedural video understanding models is far out and underscoring the need to develop new approaches to these tasks. Data, model, and code will be publicly released.

摘要
学习从视频是一个emerging研究领域，允许机器人从人类示范中学习技能，如过程视频。为此，视频语言模型必须能够获得结构化理解，如示范视频的时间 segmentation 和技能总结，并将理解扩展到新领域。为实现这个目标，我们介绍Spacewalk-18benchmark，包括两个任务：（1）步态识别和（2）视频内部检索。这两个任务评估模型对：（1）外域视觉信息的使用；（2）高时间上下文窗口；以及（3）多模态（文本+视频）领域的使用。这与现有的过程视频理解benchmark不同，通常只处理短ContextLength和可以通过单一模式解决。Spacewalk-18 benchmark，具有内置的多模态和长形复杂性，暴露出任务认识和分 segmentation的高难度。我们发现现状的方法在我们的benchmark中表现糟糕，表明目标普通的过程视频理解模型远远有待开发，需要开发新的方法来解决这些任务。数据、模型和代码将公开发布。

Semi-supervised Semantic Segmentation via Boosting Uncertainty on Unlabeled Data

paper_url: http://arxiv.org/abs/2311.18758
repo_url: None
paper_authors: Daoan Zhang, Yunhao Luo, Jianguo Zhang
for: The paper is written for improving the performance of semi-supervised semantic segmentation models by addressing the distribution gap between labeled and unlabeled datasets.
methods: The paper proposes two strategies and designs an uncertainty booster algorithm to appropriately boost uncertainty on unlabeled data, which helps minimize the distribution gap and benefits the generalization of the model.
results: The proposed algorithm and strategies are experimentally proven to be effective in promoting performance in semi-supervised semantic segmentation, achieving state-of-the-art results on popular benchmarks such as Cityscapes and PASCAL VOC 2012 with different train settings.

Abstract
We bring a new perspective to semi-supervised semantic segmentation by providing an analysis on the labeled and unlabeled distributions in training datasets. We first figure out that the distribution gap between labeled and unlabeled datasets cannot be ignored, even though the two datasets are sampled from the same distribution. To address this issue, we theoretically analyze and experimentally prove that appropriately boosting uncertainty on unlabeled data can help minimize the distribution gap, which benefits the generalization of the model. We propose two strategies and design an uncertainty booster algorithm, specially for semi-supervised semantic segmentation. Extensive experiments are carried out based on these theories, and the results confirm the efficacy of the algorithm and strategies. Our plug-and-play uncertainty booster is tiny, efficient, and robust to hyperparameters but can significantly promote performance. Our approach achieves state-of-the-art performance in our experiments compared to the current semi-supervised semantic segmentation methods on the popular benchmarks: Cityscapes and PASCAL VOC 2012 with different train settings.

摘要
我们带来了一新的视角来 semi-supervised semantic segmentation 领域中的分布分析。我们首先发现，即使 labelled 和 unlabeled 数据集是从同一个分布中采样的，但 gap между两者的分布仍然不可忽略。为了解决这个问题，我们 theoretically 分析并 experimentally 证明，在执行 boosting uncertainty 操作时，可以帮助 minimize 分布差距，从而提高模型的泛化性。我们提出了两种策略，并设计了一个 uncertainty booster 算法，专门为 semi-supervised semantic segmentation。我们进行了广泛的实验，并证明了我们的算法和策略的有效性。我们的插件和灵活的 hyperparameter 对于模型的性能有着显著的提升作用。在我们的实验中，我们的方法与现有的 semi-supervised semantic segmentation 方法在 popular 的 benchmark 上（Cityscapes 和 PASCAL VOC 2012）以不同的训练设置中达到了 state-of-the-art 性能。

Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering

paper_url: http://arxiv.org/abs/2312.00109
repo_url: https://github.com/city-super/Scaffold-GS
paper_authors: Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, Limin Wang, Dahua Lin, Bo Dai
for: 提高3D场景真实渲染质量和速度，应用于学术和业务领域。
methods: 使用 anchor points 分布本地3D Gaussian，基于视角方向和距离在视场范围内预测Attributes。
results: 减少重复的 Gaussian，提高场景覆盖率，同时保持高质量渲染和速度。

Abstract
Neural rendering methods have significantly advanced photo-realistic 3D scene rendering in various academic and industrial applications. The recent 3D Gaussian Splatting method has achieved the state-of-the-art rendering quality and speed combining the benefits of both primitive-based representations and volumetric representations. However, it often leads to heavily redundant Gaussians that try to fit every training view, neglecting the underlying scene geometry. Consequently, the resulting model becomes less robust to significant view changes, texture-less area and lighting effects. We introduce Scaffold-GS, which uses anchor points to distribute local 3D Gaussians, and predicts their attributes on-the-fly based on viewing direction and distance within the view frustum. Anchor growing and pruning strategies are developed based on the importance of neural Gaussians to reliably improve the scene coverage. We show that our method effectively reduces redundant Gaussians while delivering high-quality rendering. We also demonstrates an enhanced capability to accommodate scenes with varying levels-of-detail and view-dependent observations, without sacrificing the rendering speed.

摘要
We introduce Scaffold-GS, which uses anchor points to distribute local 3D Gaussians and predicts their attributes on-the-fly based on viewing direction and distance within the view frustum. Anchor growing and pruning strategies are developed based on the importance of neural Gaussians to reliably improve the scene coverage. We show that our method effectively reduces redundant Gaussians while delivering high-quality rendering. We also demonstrate an enhanced capability to accommodate scenes with varying levels-of-detail and view-dependent observations, without sacrificing the rendering speed.

Merlin:Empowering Multimodal LLMs with Foresight Minds

paper_url: http://arxiv.org/abs/2312.00589
repo_url: https://github.com/Ahnsun/merlin
paper_authors: En Yu, Liang Zhao, Yana Wei, Jinrong Yang, Dongming Wu, Lingyu Kong, Haoran Wei, Tiancai Wang, Zheng Ge, Xiangyu Zhang, Wenbing Tao
for: 这篇论文是为了探讨如何使现有的多Modal大型语言模型（MLLMs）具备预测未来的能力，以便更好地理解事物的基本原理和行为意图。
methods: 该论文提出了两种新的方法来帮助 MLLMs 具备预测能力：Foresight Pre-Training（FPT）和 Foresight Instruction-Tuning（FIT）。FPT 是在不同任务中培养 MLLMs 能够注意和预测整个轨迹的方法，而 FIT 则需要 MLLMs 先预测对象的轨迹，然后根据这些轨迹来理解未来事件的可能性。
results: 实验结果表明，通过使用 FPT 和 FIT，建立了一个名为 Merlin 的新的多图像输入 MLLM，可以具备出色的未来预测能力和视觉理解能力。

Abstract
Humans possess the remarkable ability to foresee the future to a certain extent based on present observations, a skill we term as foresight minds. However, this capability remains largely under explored within existing Multimodal Large Language Models (MLLMs), hindering their capacity to learn the fundamental principles of how things operate and the intentions behind the observed subjects. To address this issue, we introduce the integration of future modeling into the existing learning frameworks of MLLMs. By utilizing the subject trajectory, a highly structured representation of a consecutive frame sequence, as a learning objective, we aim to bridge the gap between the past and the future. We propose two innovative methods to empower MLLMs with foresight minds, Foresight Pre-Training (FPT) and Foresight Instruction-Tuning (FIT), which are inspired by the modern learning paradigm of LLMs. Specifically, FPT jointly training various tasks centered on trajectories, enabling MLLMs to learn how to attend and predict entire trajectories from a given initial observation. Then, FIT requires MLLMs to first predict trajectories of related objects and then reason about potential future events based on them. Aided by FPT and FIT, we build a novel and unified MLLM named Merlin that supports multi-images input and analysis about potential actions of multiple objects for the future reasoning. Experimental results show Merlin powerful foresight minds with impressive performance on both future reasoning and visual comprehension tasks.

摘要
人类具有预测未来的能力，即叫做前视能力。然而，这种能力在现有的多模态大型自然语言模型（MLLM）中尚未得到充分发挥，这限制了MLLM的学习基本原理和行为目的。为解决这个问题，我们提出将未来预测纳入现有MLLM的学习框架中。通过使用行为轨迹，一种高度结构化的帧序列表示，作为学习目标，我们希望 bridge the gap between the past and the future。我们提出了两种创新的方法，即前视预训练（FPT）和前视指导调整（FIT），它们是基于现代学习 paradigm of LLMs。specifically, FPT通过合并不同任务中心于轨迹，使MLLMs学习如何从给定的初始观察到整个轨迹的attend和预测。然后，FIT要求MLLMs先预测相关对象的轨迹，然后根据它们来理解可能的未来事件。帮助了FPT和FIT，我们构建了一个名为Merlin的新的和统一的MLLM，可以处理多张图像输入和多个对象的未来预测。实验结果表明Merlin具有强大的前视 minds，在未来预测和视觉理解任务中表现出色。

Learning One-Shot 4D Head Avatar Synthesis using Synthetic Data

paper_url: http://arxiv.org/abs/2311.18729
repo_url: https://github.com/YuDeng/Portrait-4D
paper_authors: Yu Deng, Duomin Wang, Xiaohang Ren, Xingyu Chen, Baoyuan Wang
for: 一shot 4D head synthesis
methods: 通过大规模 sintetic data 学习，首先通过对偶抗学习学习 parts-wise 4D 生成模型，然后使用 transformer-based 动画 triplane reconstructor 学习 4D 头部重建。
results: 实验表明我们的方法比优于过去的艺术。

Abstract
Existing one-shot 4D head synthesis methods usually learn from monocular videos with the aid of 3DMM reconstruction, yet the latter is evenly challenging which restricts them from reasonable 4D head synthesis. We present a method to learn one-shot 4D head synthesis via large-scale synthetic data. The key is to first learn a part-wise 4D generative model from monocular images via adversarial learning, to synthesize multi-view images of diverse identities and full motions as training data; then leverage a transformer-based animatable triplane reconstructor to learn 4D head reconstruction using the synthetic data. A novel learning strategy is enforced to enhance the generalizability to real images by disentangling the learning process of 3D reconstruction and reenactment. Experiments demonstrate our superiority over the prior art.

摘要
现有的一步式4D头synthesis方法通常通过使用3DMM重建来学习从单视镜头中学习，但是后者具有较高的难度，这限制了他们的合理的4D头synthesis。我们提出了一种通过大规模的合成数据来学习一步式4D头synthesis的方法。关键在于首先通过对单视图像进行 adversarial learning来学习分割型4D生成模型，然后使用这些生成的多视图图像和全动作图像作为训练数据，并利用基于transformer的可动三面重建器来学习4D头重建。我们提出了一种新的学习策略，以分离3D重建和reenactment的学习过程，以提高对真实图像的泛化性。实验表明，我们的方法超过了先前艺术。

Improving the Robustness of Quantized Deep Neural Networks to White-Box Attacks using Stochastic Quantization and Information-Theoretic Ensemble Training

paper_url: http://arxiv.org/abs/2312.00105
repo_url: None
paper_authors: Saurabh Farkya, Aswin Raghavan, Avi Ziskind
for: The paper aims to improve the robustness of quantized deep neural networks (DNNs) to white-box adversarial attacks.
methods: The paper introduces a differentiable Stochastic Quantizer (SQ) to tackle the limitation of deterministic quantization to fixed “bins”. The authors also explore the idea of using different quantizations to collectively improve robustness and learn diverse representations of the input image.
results: The paper demonstrates substantial improvement in robustness against $L_\infty$ attacks, with > 50% accuracy to PGD(5/255) on CIFAR10 without adversarial training, compared to vanilla DNNs and existing ensembles of quantized DNNs. The authors also extend the method to detect attacks and generate robustness profiles in the adversarial information plane (AIP).

Abstract
Most real-world applications that employ deep neural networks (DNNs) quantize them to low precision to reduce the compute needs. We present a method to improve the robustness of quantized DNNs to white-box adversarial attacks. We first tackle the limitation of deterministic quantization to fixed ``bins'' by introducing a differentiable Stochastic Quantizer (SQ). We explore the hypothesis that different quantizations may collectively be more robust than each quantized DNN. We formulate a training objective to encourage different quantized DNNs to learn different representations of the input image. The training objective captures diversity and accuracy via mutual information between ensemble members. Through experimentation, we demonstrate substantial improvement in robustness against $L_\infty$ attacks even if the attacker is allowed to backpropagate through SQ (e.g., > 50\% accuracy to PGD(5/255) on CIFAR10 without adversarial training), compared to vanilla DNNs as well as existing ensembles of quantized DNNs. We extend the method to detect attacks and generate robustness profiles in the adversarial information plane (AIP), towards a unified analysis of different threat models by correlating the MI and accuracy.

摘要
大多数实际应用中使用深度神经网络（DNN）时，会归约其为低精度来降低计算需求。我们提出了一种方法，以提高归约DNN对白盒攻击的 Robustness。我们首先解决了归约到固定“缓冲”的限制，通过引入可导式随机归约（SQ）。我们提出了一种培训目标，以便归约DNN ensemble成员之间学习不同的输入图像表示。这个培训目标捕捉了多样性和准确度，通过牵扯到ensemble成员之间的相互信息。通过实验，我们证明了对$L_\infty$攻击的鲁棒性可以大幅提高（比如> 50%的准确率对PGD(5/255) 在 CIFAR10 上 без针对性训练），比 vanilla DNN 和现有的归约DNN ensemble。我们还扩展了方法，以检测攻击和生成鲁棒性profile在攻击信息平面（AIP）中，并寻求在不同的威胁模型之间进行统一分析，通过相关MI和准确率。

Meta-Prior: Meta learning for Adaptive Inverse Problem Solvers

paper_url: http://arxiv.org/abs/2311.18710
repo_url: None
paper_authors: Matthieu Terris, Thomas Moreau
for:The paper is written for addressing imaging inverse problems using deep neural networks, specifically in the absence of ground truth data.methods:The proposed method is based on meta-learning, which trains a meta-model on a diverse set of imaging tasks and allows for efficient fine-tuning for specific tasks with few fine-tuning steps. The method uses a bilevel formulation with an outer supervised loss and an inner loss that can be either supervised or unsupervised, relying only on the measurement operator.results:The proposed method is effective in recovering the Bayes optimal estimator in simple settings and demonstrates improved performance on various imaging tasks, including image processing and magnetic resonance imaging.

Abstract
Deep neural networks have become a foundational tool for addressing imaging inverse problems. They are typically trained for a specific task, with a supervised loss to learn a mapping from the observations to the image to recover. However, real-world imaging challenges often lack ground truth data, rendering traditional supervised approaches ineffective. Moreover, for each new imaging task, a new model needs to be trained from scratch, wasting time and resources. To overcome these limitations, we introduce a novel approach based on meta-learning. Our method trains a meta-model on a diverse set of imaging tasks that allows the model to be efficiently fine-tuned for specific tasks with few fine-tuning steps. We show that the proposed method extends to the unsupervised setting, where no ground truth data is available. In its bilevel formulation, the outer level uses a supervised loss, that evaluates how well the fine-tuned model performs, while the inner loss can be either supervised or unsupervised, relying only on the measurement operator. This allows the meta-model to leverage a few ground truth samples for each task while being able to generalize to new imaging tasks. We show that in simple settings, this approach recovers the Bayes optimal estimator, illustrating the soundness of our approach. We also demonstrate our method's effectiveness on various tasks, including image processing and magnetic resonance imaging.

摘要
深度神经网络已成为解决图像反问题的基本工具。它们通常被训练为特定任务，使用监督损失来学习从观察到图像的映射。然而，现实中的成像挑战通常缺乏真实的ground truth数据，使得传统的监督方法无效。此外，为每个新的成像任务，需要从零开始训练一个新的模型，浪费时间和资源。为了解决这些限制，我们介绍了一种新的方法，基于meta-学习。我们的方法将一个多任务模型训练在多种成像任务上，使得模型可以快速地适应特定任务，只需几步微调。我们表明，我们的方法可以扩展到无监督设定，在这种情况下， outer层使用一个监督损失，而内层损失可以是监督的或无监督的，只需要依靠测量算子。这使得meta-模型可以利用每个任务的几个ground truth样本，同时能够泛化到新的成像任务。我们表明，在简单的设定下，我们的方法可以恢复bayes优化 estimator，证明了我们的方法的正确性。我们还通过多种任务，包括图像处理和核磁共振成像，证明了我们的方法的有效性。

Seg2Reg: Differentiable 2D Segmentation to 1D Regression Rendering for 360 Room Layout Reconstruction

paper_url: http://arxiv.org/abs/2311.18695
repo_url: None
paper_authors: Cheng Sun, Wei-En Tai, Yu-Lin Shih, Kuan-Wei Chen, Yong-Jing Syu, Kent Selwyn The, Yu-Chiang Frank Wang, Hwann-Tzong Chen
For: 该 paper 的目的是 reconstruction of single-view 360-degree room layout.* Methods: 该 paper 使用了一种 differentiable 和 occlusion-aware 的方法，将 2D 的layout segmentation转换为 1D 的 layout depth regression.* Results: 该 paper 的模型在 benchmarking 中表现出色，significantly outperforms previous arts.Here’s the full Chinese text:
for: 该 paper 的目的是 reconstruction of single-view 360-degree room layout.
methods: 该 paper 使用了一种 differentiable 和 occlusion-aware 的方法，将 2D 的layout segmentation转换为 1D 的 layout depth regression.
results: 该 paper 的模型在 benchmarking 中表现出色，significantly outperforms previous arts.

Abstract
State-of-the-art single-view 360-degree room layout reconstruction methods formulate the problem as a high-level 1D (per-column) regression task. On the other hand, traditional low-level 2D layout segmentation is simpler to learn and can represent occluded regions, but it requires complex post-processing for the targeting layout polygon and sacrifices accuracy. We present Seg2Reg to render 1D layout depth regression from the 2D segmentation map in a differentiable and occlusion-aware way, marrying the merits of both sides. Specifically, our model predicts floor-plan density for the input equirectangular 360-degree image. Formulating the 2D layout representation as a density field enables us to employ `flattened' volume rendering to form 1D layout depth regression. In addition, we propose a novel 3D warping augmentation on layout to improve generalization. Finally, we re-implement recent room layout reconstruction methods into our codebase for benchmarking and explore modern backbones and training techniques to serve as the strong baseline. Our model significantly outperforms previous arts. The code will be made available upon publication.

摘要
现代单视图360度房内布局重建方法将问题定义为高级1D（每列）回归任务。然而，传统的低级2D布局分割虽然更容易学习，但它需要复杂的后处理来获得目标布局 polygon，同时牺牲精度。我们提出Seg2Reg，它可以从2D分割图生成1D布局深度回归，同时具有透视和 occlusion 的自适应性。具体来说，我们的模型预测输入的投影平面密度，这使得我们可以使用扁平化的体积渲染来实现1D布局深度回归。此外，我们还提出了一种新的3D折叠增强augmentation，用于提高通用性。最后，我们将现有的房内布局重建方法重新实现到我们的代码库中，以供比较，并探索现代的背景和训练技术，以建立强大的基线。我们的模型在前一辈的艺术上显著超越。代码将在发表时公布。

Cascaded Interaction with Eroded Deep Supervision for Salient Object Detection

paper_url: http://arxiv.org/abs/2311.18675
repo_url: None
paper_authors: Hewen Xiao, Jie Mei, Guangfu Ma, Weiren Wu
for: 提高精度检测的聚焦对象检测方法
methods: 提出了两种方向的方法：一是增强特性网络，二是深度监测策略
results: 在五个流行的数据集上进行了广泛的实验，显示了方法的优越性

Abstract
Deep convolutional neural networks have been widely applied in salient object detection and have achieved remarkable results in this field. However, existing models suffer from information distortion caused by interpolation during up-sampling and down-sampling. In response to this drawback, this article starts from two directions in the network: feature and label. On the one hand, a novel cascaded interaction network with a guidance module named global-local aligned attention (GAA) is designed to reduce the negative impact of interpolation on the feature side. On the other hand, a deep supervision strategy based on edge erosion is proposed to reduce the negative guidance of label interpolation on lateral output. Extensive experiments on five popular datasets demonstrate the superiority of our method.

摘要
深度卷积神经网络在突出对象检测领域广泛应用，并取得了显著的成果。然而，现有模型受到 interpolate 操作导致的信息扭曲问题。为了解决这个缺点，本文从网络两个方向进行改进：特征和标签。一方面，我们提出了一种新的卷积交互网络，名为全球本地对齐注意力（GAA），以减少 interpolate 操作对特征的负面影响。另一方面，我们提出了一种深度监视策略，基于边缘腐蚀，以减少标签 interpolate 操作对横向输出的负面指导。我们在五个流行的数据集上进行了广泛的实验，并证明了我们的方法的优越性。

Action Recognition in Video Recordings from Gynecologic Laparoscopy

paper_url: http://arxiv.org/abs/2311.18666
repo_url: None
paper_authors: Sahar Nasirihaghighi, Negin Ghamsarian, Daniela Stefanics, Klaus Schoeffmann, Heinrich Husslein
for: 这篇论文是为了自动识别 Laparoscopic 手术中的动作而写的。
methods: 这篇论文使用了一种 CNN-RNN 架构和一种适应性训练-推理框架来解决 Laparoscopic 手术中的动作识别挑战。
results: 该方法在对 Laparoscopic 手术中的动作识别 task 进行了广泛的实验，并证明了它的超越性。

Abstract
Action recognition is a prerequisite for many applications in laparoscopic video analysis including but not limited to surgical training, operation room planning, follow-up surgery preparation, post-operative surgical assessment, and surgical outcome estimation. However, automatic action recognition in laparoscopic surgeries involves numerous challenges such as (I) cross-action and intra-action duration variation, (II) relevant content distortion due to smoke, blood accumulation, fast camera motions, organ movements, object occlusion, and (III) surgical scene variations due to different illuminations and viewpoints. Besides, action annotations in laparoscopy surgeries are limited and expensive due to requiring expert knowledge. In this study, we design and evaluate a CNN-RNN architecture as well as a customized training-inference framework to deal with the mentioned challenges in laparoscopic surgery action recognition. Using stacked recurrent layers, our proposed network takes advantage of inter-frame dependencies to negate the negative effect of content distortion and variation in action recognition. Furthermore, our proposed frame sampling strategy effectively manages the duration variations in surgical actions to enable action recognition with high temporal resolution. Our extensive experiments confirm the superiority of our proposed method in action recognition compared to static CNNs.

摘要
<> translate the following text into Simplified Chinese:Action recognition is a prerequisite for many applications in laparoscopic video analysis, including but not limited to surgical training, operation room planning, follow-up surgery preparation, post-operative surgical assessment, and surgical outcome estimation. However, automatic action recognition in laparoscopic surgeries involves numerous challenges such as (I) cross-action and intra-action duration variation, (II) relevant content distortion due to smoke, blood accumulation, fast camera motions, organ movements, object occlusion, and (III) surgical scene variations due to different illuminations and viewpoints. Besides, action annotations in laparoscopy surgeries are limited and expensive due to requiring expert knowledge. In this study, we design and evaluate a CNN-RNN architecture as well as a customized training-inference framework to deal with the mentioned challenges in laparoscopic surgery action recognition. Using stacked recurrent layers, our proposed network takes advantage of inter-frame dependencies to negate the negative effect of content distortion and variation in action recognition. Furthermore, our proposed frame sampling strategy effectively manages the duration variations in surgical actions to enable action recognition with high temporal resolution. Our extensive experiments confirm the superiority of our proposed method in action recognition compared to static CNNs.Translate the text into Simplified Chinese:<< /SYS>>Here's the translation:laparoscopic 视频分析中的动作识别是许多应用的先修，包括但不限于手术培训、操作室规划、后续手术准备、手术后评估和手术结果预测。然而，自动地在 Laparoscopic 手术中识别动作存在许多挑战，包括（I）动作之间和动作内部的时间变化，（II）由烟雾、血液堆积、快速摄像头运动、器官运动和物体遮挡等引起的有关内容扭曲，以及（III）不同的照明和视角引起的手术场景变化。此外， Laparoscopy 手术中的动作标注受限和昂贵，因为需要专家知识。在本研究中，我们设计和评估了一种 CNN-RNN 架构以及一种适应性训练-推理框架，以解决 Laparoscopic 手术中动作识别中所提到的挑战。我们使用堆叠的回归层，以利用帧之间的依赖关系，对内容扭曲和动作变化产生负面影响的动作识别做出补偿。此外，我们提出的帧采样策略有效地管理手术动作的时间变化，以实现高 temporal 分辨率的动作识别。我们的广泛实验表明，我们的提议方法在动作识别中胜过静态 CNN。

Pose Estimation and Tracking for ASIST

paper_url: http://arxiv.org/abs/2311.18665
repo_url: None
paper_authors: Ari Goodman, Gurpreet Singh, Ryan O’Shea, Peter Teague, James Hing
for: 该研究旨在提高ASIST系统操作者的 Situational awareness和减少操作者对飞机位置的不确定性，以提高安全降落区域的可能性。
methods: 该研究使用了现代计算机视觉算法，如Faster R-CNN和HRNet，来估算飞机的姿态，以及传统的编码器-解码器来估算飞机的方向。
results: 研究人员制造出了一个可以跟踪飞机与RSD之间的位置的原型系统，并通过使用现代计算机视觉算法和传统的编码器-解码器来确认飞机的姿态和方向。

Abstract
Aircraft Ship Integrated Secure and Traverse (ASIST) is a system designed to arrest helicopters safely and efficiently on ships. Originally, a precision Helicopter Position Sensing Equipment (HPSE) tracked and monitored the position of the helicopter relative to the Rapid Securing Device (RSD). However, using the HPSE component was determined to be infeasible in the transition of the ASIST system due to the hardware installation requirements. As a result, sailors track the position of the helicopters with their eyes with no sensor or artificially intelligent decision aid. Manually tracking the helicopter takes additional time and makes recoveries more difficult, especially at high sea states. Performing recoveries without the decision aid leads to higher uncertainty and cognitive load. PETA (Pose Estimation and Tracking for ASIST) is a research effort to create a helicopter tracking system prototype without hardware installation requirements for ASIST system operators. Its overall goal is to improve situational awareness and reduce operator uncertainty with respect to the aircrafts position relative to the RSD, and consequently increase the allowable landing area. The authors produced a prototype system capable of tracking helicopters with respect to the RSD. The software included a helicopter pose estimation component, camera pose estimation component, and a user interface component. PETA demonstrated the potential for state-of-the-art computer vision algorithms Faster R-CNN and HRNet (High-Resolution Network) to be used to estimate the pose of helicopters in real-time, returning ASIST to its originally intended capability. PETA also demonstrated that traditional methods of encoder-decoders could be used to estimate the orientation of the helicopter and could be used to confirm the output from HRNet.

摘要
飞机船 интеграted安全 traverse (ASIST) 是一个系统，用于安全和有效地降落直升机 onto ships. 原来，使用高精度直升机位置感知设备 (HPSE) 跟踪和监测直升机与 Rapid Securing Device (RSD) 之间的位置。但是，使用 HPSE 组件不可能在 ASIST 系统的过渡中进行，因为硬件安装需求。因此，水手 manually track the position of the helicopters with their eyes, without any sensor or artificial intelligence decision aid. 手动跟踪直升机需要更多时间，使回归更加困难，特别是在高海状态下。不使用决策助手会导致更高的不确定性和认知负担。PETA (Pose Estimation and Tracking for ASIST) 是一个研究努力，旨在创建一个无需硬件安装的直升机跟踪系统原型，以提高 ASIST 系统操作员的情况意识，并减少操作员对直升机的位置相对于 RSD 的不确定性。作者们制作了一个可以跟踪直升机的系统，包括直升机姿态估计组件、摄像头姿态估计组件和用户界面组件。PETA 表明了使用当今计算机视觉算法 Faster R-CNN 和 HRNet (高分辨率网络) 可以在实时中估计直升机的姿态，从而恢复 ASIST 系统的原始意图。PETA 还证明了传统的编码器-解码器可以用来估计直升机的姿态，并可以用来验证 HRNet 的输出。

Learning Part Segmentation from Synthetic Animals

paper_url: http://arxiv.org/abs/2311.18661
repo_url: None
paper_authors: Jiawei Peng, Ju He, Prakhar Kaushik, Zihao Xiao, Jiteng Mu, Alan Yuille
for: 本文主要旨在学习动物部分 segmentation，利用Skinned Multi-Animal Linear（SMAL）模型练习已有的 sintetic数据，以扩大现有的 Computer-Aided Design（CAD）动物模型生成的数据。
methods: 本文首次提出了Synthetic Animal Parts（SAP）数据集，并对Syn-to-Real动物部分 segmentation进行了 benchmarking，包括使用现有的semantic segmentation域 adaptation方法和提出了一种Class-Balanced Fourier Data Mixing（CB-FDM）方法来解决Syn-to-Real任务之间的本质差异问题。
results: 研究发现，CB-FDM方法可以在SynRealPart任务中提高表现，并且发现学习自动机部分的模型可以在所有四足动物类别中进行转移。

Abstract
Semantic part segmentation provides an intricate and interpretable understanding of an object, thereby benefiting numerous downstream tasks. However, the need for exhaustive annotations impedes its usage across diverse object types. This paper focuses on learning part segmentation from synthetic animals, leveraging the Skinned Multi-Animal Linear (SMAL) models to scale up existing synthetic data generated by computer-aided design (CAD) animal models. Compared to CAD models, SMAL models generate data with a wider range of poses observed in real-world scenarios. As a result, our first contribution is to construct a synthetic animal dataset of tigers and horses with more pose diversity, termed Synthetic Animal Parts (SAP). We then benchmark Syn-to-Real animal part segmentation from SAP to PartImageNet, namely SynRealPart, with existing semantic segmentation domain adaptation methods and further improve them as our second contribution. Concretely, we examine three Syn-to-Real adaptation methods but observe relative performance drop due to the innate difference between the two tasks. To address this, we propose a simple yet effective method called Class-Balanced Fourier Data Mixing (CB-FDM). Fourier Data Mixing aligns the spectral amplitudes of synthetic images with real images, thereby making the mixed images have more similar frequency content to real images. We further use Class-Balanced Pseudo-Label Re-Weighting to alleviate the imbalanced class distribution. We demonstrate the efficacy of CB-FDM on SynRealPart over previous methods with significant performance improvements. Remarkably, our third contribution is to reveal that the learned parts from synthetic tiger and horse are transferable across all quadrupeds in PartImageNet, further underscoring the utility and potential applications of animal part segmentation.

摘要
<>Translate the given text into Simplified Chinese.<>semantic part segmentation可以提供对物体的细腻和可解释的理解，因此可以推动多个下游任务。然而，需要详细的注释限制其应用于不同的物体类型。这篇论文将从 sintetic animals 中学习部分 segmentation，利用 Skinned Multi-Animal Linear (SMAL) 模型来扩展现有的 sintetic data，这些数据由计算机支持设计 (CAD) 动物模型生成。相比CAD模型，SMAL模型生成的数据具有更广泛的 pose 观察到在实际场景中。因此，我们的第一个贡献是建立一个 sintetic animal dataset，包括虎和马的部分 segmentation，称为 Synthetic Animal Parts (SAP)。然后，我们将 SAP 与 PartImageNet 进行 syn-to-real 动物部分 segmentation的比较， specifically SynRealPart，并与现有 semantic segmentation 领域适应方法进行比较。我们发现，在 syn-to-real 适应中，exist 的方法表现不佳，这是因为synthetic 和实际图像之间的本质差异。为此，我们提出了一种简单 yet effective的方法，namely Class-Balanced Fourier Data Mixing (CB-FDM)。Fourier Data Mixing 可以使 synthetic 图像的spectral amplitude与实际图像的spectral amplitude更相似，从而使混合图像具有更相似的频谱内容。此外，我们还使用 Class-Balanced Pseudo-Label Re-Weighting 来缓解类别分布的不均衡问题。我们在 SynRealPart 中证明了 CB-FDM 的效果，与之前的方法相比，具有显著性能提升。最重要的是，我们的第三个贡献是证明 sintetic 虎和马的学习部分可以在 PartImageNet 中转移到所有四足动物上，进一步证明了动物部分 segmentation的实用性和潜在应用。

LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning

paper_url: http://arxiv.org/abs/2311.18651
repo_url: https://github.com/open3da/ll3da
paper_authors: Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, Tao Chen
for: 这篇论文旨在提出一种能够理解、计划和回答在复杂多个Modal中的人机交互应用，具体来说是使用点云作为直接输入，并通过语言模型和计算机视觉模型结合来解决多模态场景中的歧义和干扰问题。
methods: 该论文提出了一种名为 LL3DA（Large Language 3D Assistant）的方法，该方法可以将点云作为直接输入，并通过语言模型和计算机视觉模型的结合来回答both textual-instructions和visual-prompts。
results: 实验表明，LL3DA可以达到很高的表现，并超过了多种3D视觉语言模型在3D dense captioning和3D问题回答等领域的表现。

Abstract
Recent advances in Large Multimodal Models (LMM) have made it possible for various applications in human-machine interactions. However, developing LMMs that can comprehend, reason, and plan in complex and diverse 3D environments remains a challenging topic, especially considering the demand for understanding permutation-invariant point cloud 3D representations of the 3D scene. Existing works seek help from multi-view images, and project 2D features to 3D space as 3D scene representations. This, however, leads to huge computational overhead and performance degradation. In this paper, we present LL3DA, a Large Language 3D Assistant that takes point cloud as direct input and respond to both textual-instructions and visual-prompts. This help LMMs better comprehend human interactions and further help to remove the ambiguities in cluttered 3D scenes. Experiments show that LL3DA achieves remarkable results, and surpasses various 3D vision-language models on both 3D Dense Captioning and 3D Question Answering.

摘要
现代大型多Modal模型（LMM）的进步使得人机交互应用得到了推动。然而，开发能够理解、计划和分析复杂多种3D环境的LMM仍然是一个挑战，尤其是在考虑 permutation-invariant点云3D场景表示时。现有的工作寻求帮助于多视图图像，将2D特征项项目到3D空间作为3D场景表示。这会导致巨大的计算开销和性能下降。在这篇论文中，我们介绍LL3DA，一个大型语言3D助手，可以直接处理点云并响应文本指令和视觉提示。这有助于LMM更好地理解人类交互，并帮助去除拥塞的3D场景中的歧义。实验结果显示，LL3DA达到了很出色的结果，并在3D密集描述和3D问答上超越了多种3D视力语言模型。

Simple Semantic-Aided Few-Shot Learning

paper_url: http://arxiv.org/abs/2311.18649
repo_url: None
paper_authors: Hai Zhang, Junzhe Xu, Shanlin Jiang, Zhenan He
for: 本研究旨在提高少量数据下的计算机视觉任务，即几 shot learning 的性能。
methods: 本文提出了一种自动生成高质量 semantics 的方法，并使用了一种简单的两层网络（Semantic Alignment Network）将 semantics 和视觉特征转换为robust的类原型。
results: 实验结果表明，我们的框架在五个benchmark上都超过了之前的方法， demonstrating 一种简单的网络可以在几 shot classification 任务中击败复杂的多Modal模块。

Abstract
Learning from a limited amount of data, namely Few-Shot Learning, stands out as a challenging computer vision task. Several works exploit semantics and design complicated semantic fusion mechanisms to compensate for rare representative features within restricted data. However, relying on naive semantics such as class names introduces biases due to their brevity, while acquiring extensive semantics from external knowledge takes a huge time and effort. This limitation severely constrains the potential of semantics in few-shot learning. In this paper, we design an automatic way called Semantic Evolution to generate high-quality semantics. The incorporation of high-quality semantics alleviates the need for complex network structures and learning algorithms used in previous works. Hence, we employ a simple two-layer network termed Semantic Alignment Network to transform semantics and visual features into robust class prototypes with rich discriminative features for few-shot classification. The experimental results show our framework outperforms all previous methods on five benchmarks, demonstrating a simple network with high-quality semantics can beat intricate multi-modal modules on few-shot classification tasks.

摘要
In this paper, we propose an automatic way called Semantic Evolution to generate high-quality semantics. By incorporating these high-quality semantics, we can alleviate the need for complex network structures and learning algorithms used in previous works. We employ a simple two-layer network, called the Semantic Alignment Network, to transform semantics and visual features into robust class prototypes with rich discriminative features for few-shot classification.Experimental results show that our framework outperforms all previous methods on five benchmarks, demonstrating that a simple network with high-quality semantics can beat intricate multi-modal modules on few-shot classification tasks.

DiffusionAvatars: Deferred Diffusion for High-fidelity 3D Head Avatars

paper_url: http://arxiv.org/abs/2311.18635
repo_url: None
paper_authors: Tobias Kirschstein, Simon Giebenhain, Matthias Nießner
for: 生成高品质3D人物头像，提供直观的姿势和表情控制。
methods: 使用扩散基于神经网络的渲染器，利用通用2D规范生成有趣的人脸图像。
results: 生成新姿势和表情的人物头像，比现有方法更有趣和可信。

Abstract
DiffusionAvatars synthesizes a high-fidelity 3D head avatar of a person, offering intuitive control over both pose and expression. We propose a diffusion-based neural renderer that leverages generic 2D priors to produce compelling images of faces. For coarse guidance of the expression and head pose, we render a neural parametric head model (NPHM) from the target viewpoint, which acts as a proxy geometry of the person. Additionally, to enhance the modeling of intricate facial expressions, we condition DiffusionAvatars directly on the expression codes obtained from NPHM via cross-attention. Finally, to synthesize consistent surface details across different viewpoints and expressions, we rig learnable spatial features to the head's surface via TriPlane lookup in NPHM's canonical space. We train DiffusionAvatars on RGB videos and corresponding tracked NPHM meshes of a person and test the obtained avatars in both self-reenactment and animation scenarios. Our experiments demonstrate that DiffusionAvatars generates temporally consistent and visually appealing videos for novel poses and expressions of a person, outperforming existing approaches.

摘要
DiffusionAvatars 合成了一个高品质的3D头像人，提供了直观的控制方式来调整姿势和表情。我们提议使用扩散基于的神经渲染器，利用通用2D先验来生成吸引人的面孔图像。为了提供粗略的表情和头姿指导，我们从目标视点中渲染了神经 parametric 头部模型（NPHM），作为人体的代理几何体。此外，为了增强表情的细部模拟，我们将DiffusionAvatars 直接通过 NPHM 的表情代码和cross-attention进行条件Rendering。最后，为了在不同视点和表情下保持表面详细的一致性，我们通过 TriPlane lookup 来学习可变的表面特征，并将其绑定到人头的Surface上。我们在RGB视频和对应的跟踪 NPHM 的人体三维模型上训练DiffusionAvatars，并在自reenactment和动画场景中测试其生成的头像人。我们的实验表明，DiffusionAvatars 可以生成新姿势和表情的人头像人，并且在视觉上具有满意的效果，比较出色于现有的方法。

A Lightweight Clustering Framework for Unsupervised Semantic Segmentation

paper_url: http://arxiv.org/abs/2311.18628
repo_url: None
paper_authors: Yau Shing Jonathan Cheung, Xi Chen, Lihe Yang, Hengshuang Zhao
for: Unsupervised semantic segmentation of images, specifically using a lightweight clustering framework that does not require annotated data.
methods: Uses attention features from the self-supervised vision transformer to cluster image patches into distinct groupings, and then extracts patch-level binary pseudo-masks through multilevel clustering consistency.
results: Achieves state-of-the-art results on PASCAL VOC and MS COCO datasets.

Abstract
Unsupervised semantic segmentation aims to label each pixel of an image to a corresponding class without the use of annotated data. It is a widely researched area as obtaining labeled datasets are expensive. While previous works in the field demonstrated a gradual improvement in segmentation performance, most of them required neural network training. This made segmentation equally expensive, especially when dealing with large-scale datasets. We thereby propose a lightweight clustering framework for unsupervised semantic segmentation. Attention features of the self-supervised vision transformer exhibit strong foreground-background differentiability. By clustering these features into a small number of clusters, we could separate foreground and background image patches into distinct groupings. In our clustering framework, we first obtain attention features from the self-supervised vision transformer. Then we extract Dataset-level, Category-level and Image-level masks by clustering features within the same dataset, category and image. We further ensure multilevel clustering consistency across the three levels and this allows us to extract patch-level binary pseudo-masks. Finally, the pseudo-mask is upsampled, refined and class assignment is performed according to the CLS token of object regions. Our framework demonstrates great promise in unsupervised semantic segmentation and achieves state-of-the-art results on PASCAL VOC and MS COCO datasets.

摘要
Unsupervised semantic segmentation aims to assign each pixel of an image to a corresponding class without using annotated data. This area has been widely researched as obtaining labeled datasets is expensive. Previous works in the field have shown a gradual improvement in segmentation performance, but most of them require neural network training, which can be costly, especially when dealing with large-scale datasets. We propose a lightweight clustering framework for unsupervised semantic segmentation. The attention features of the self-supervised vision transformer are strong in foreground-background differentiability, and by clustering these features into a small number of clusters, we can separate foreground and background image patches into distinct groupings.In our clustering framework, we first obtain attention features from the self-supervised vision transformer. Then, we extract dataset-level, category-level, and image-level masks by clustering features within the same dataset, category, and image. We ensure multilevel clustering consistency across the three levels, which allows us to extract patch-level binary pseudo-masks. Finally, the pseudo-mask is upsampled, refined, and class assignment is performed according to the CLS token of object regions. Our framework shows great promise in unsupervised semantic segmentation and achieves state-of-the-art results on PASCAL VOC and MS COCO datasets.

JPPF: Multi-task Fusion for Consistent Panoptic-Part Segmentation

paper_url: http://arxiv.org/abs/2311.18618
repo_url: None
paper_authors: Shishir Muralidhara, Sravan Kumar Jagadeesh, René Schuster, Didier Stricker
for: 提供多级别semantic理解的场景，包括 semanticareas、object instances和semantic parts的同时预测。
methods: 提出了Joint Panoptic Part Fusion（JPPF）方法，通过有效地组合三个个segmentation来获得panoptic-part segmentation。
results: 在Cityscapes Panoptic Parts（CPP）和Pascal Panoptic Parts（PPP）数据集上进行了广泛的实验，并证明了我们的公平 fusion的重要性，特别是在可以进一步 segmentation的区域上。无需 fine-tuning，我们的设计在5个额外数据集上也有良好的普适性。

Abstract
Part-aware panoptic segmentation is a problem of computer vision that aims to provide a semantic understanding of the scene at multiple levels of granularity. More precisely, semantic areas, object instances, and semantic parts are predicted simultaneously. In this paper, we present our Joint Panoptic Part Fusion (JPPF) that combines the three individual segmentations effectively to obtain a panoptic-part segmentation. Two aspects are of utmost importance for this: First, a unified model for the three problems is desired that allows for mutually improved and consistent representation learning. Second, balancing the combination so that it gives equal importance to all individual results during fusion. Our proposed JPPF is parameter-free and dynamically balances its input. The method is evaluated and compared on the Cityscapes Panoptic Parts (CPP) and Pascal Panoptic Parts (PPP) datasets in terms of PartPQ and Part-Whole Quality (PWQ). In extensive experiments, we verify the importance of our fair fusion, highlight its most significant impact for areas that can be further segmented into parts, and demonstrate the generalization capabilities of our design without fine-tuning on 5 additional datasets.

摘要
“弹性积极分类”是计算机视觉的问题，旨在将场景给多级弹性理解。更精确地说，这问题涉及到 semantic areas、物件实例和 semantic parts 的预测。在这篇文章中，我们提出了 Joint Panoptic Part Fusion (JPPF)，它能够有效地融合这三个问题的解决方案，以获得一个积极分类。两个重要的方面包括：一、需要一个统一的模型，允许弹性的表现学习和改进。二、在融合中需要平衡，使得所有个别结果都获得相同的重要性。我们的提案的 JPPF 是免 Parameters 的，并且可以动态地平衡其输入。我们在 Cityscapes Panoptic Parts (CPP) 和 Pascal Panoptic Parts (PPP) 数据集上进行了评估和比较，以 PartPQ 和 Part-Whole Quality (PWQ) 作为评估指标。在广泛的实验中，我们证明了我们的公平融合的重要性，高亮了可以进一步分解的部分的影响，并证明了我们的设计不需要 fine-tuning，在 5 个额外的数据集上具有一致的表现。

Anatomy and Physiology of Artificial Intelligence in PET Imaging

paper_url: http://arxiv.org/abs/2311.18614
repo_url: None
paper_authors: Tyler J. Bradshaw, Alan B. McMillan
for: 本文旨在为核医学领域内的人工智能应用提供一份图文导论，帮助读者了解现代AI的核心原则，特别是在PET成像中可能遇到的部分。
methods: 本文使用的方法包括卷积神经网络、算法训练和U-Net Segmentation和图像生成的组件。
results: 本文通过图文导论的方式，帮助读者了解现代AI的核心原则，并提供了PET成像中可能遇到的AI应用的示例。

Abstract
The influence of artificial intelligence (AI) within the field of nuclear medicine has been rapidly growing. Many researchers and clinicians are seeking to apply AI within PET, and clinicians will soon find themselves engaging with AI-based applications all along the chain of molecular imaging, from image reconstruction to enhanced reporting. This expanding presence of AI in PET imaging will result in greater demand for educational resources for those unfamiliar with AI. The objective of this article to is provide an illustrated guide to the core principles of modern AI, with specific focus on aspects that are most likely to be encountered in PET imaging. We describe convolutional neural networks, algorithm training, and explain the components of the commonly used U-Net for segmentation and image synthesis.

摘要
人工智能（AI）在核医学领域的影响正在快速增长。许多研究人员和临床医生正在尝试将AI应用于PET影像，而且未来，临床医生会在分子成像链中与AI应用程序互动，从图像重建到增强报告。随着AI在PET影像领域的扩大存在，需要更多的教育资源来帮助不熟悉AI的人士了解这些技术。本文的目标是提供一份图文导论，概述现代AI的核心原理，特别是在PET影像领域最可能遇到的方面。我们介绍了卷积神经网络、算法训练和通用的U-Netsegmentation和图像生成组件。

Cancer-Net PCa-Gen: Synthesis of Realistic Prostate Diffusion Weighted Imaging Data via Anatomic-Conditional Controlled Latent Diffusion

paper_url: http://arxiv.org/abs/2311.18612
repo_url: None
paper_authors: Aditya Sridhar, Chi-en Amy Tai, Hayden Gunraj, Yuhao Chen, Alexander Wong
for:The paper aims to generate realistic prostate diffusion-weighted imaging (DWI) data to aid in the diagnosis, prognosis, and treatment planning of prostate cancer.methods:The authors propose an anatomic-conditional controlled latent diffusion strategy, called Cancer-Net PCa-Gen, to generate diverse prostate images with controllable tumor locations and improved anatomical and textural fidelity.results:The proposed method enhances the synthesis of diverse prostate images, which can be used to augment real patient data and train neural networks on a more diverse and comprehensive data distribution. The Cancer-Net PCa-Gen framework and sample images have been made publicly available for further research and development.Here is the information in Simplified Chinese text:for:本研究旨在通过生成真实的肾 diffusion-weighted imaging（DWI）数据，帮助诊断、预测和规划肾癌。methods:作者们提出了一种基于 conditioning 的 latent diffusion 策略，称为 Cancer-Net PCa-Gen，以生成多样化的肾图像，包括可控的肿瘤位置和改善的解剖和XTRL 精度。results:提议的方法可以提高多样化肾图像的生成，可以用来补充实际病人数据，以便通过更多和更全面的数据分布训练神经网络。 Cancer-Net PCa-Gen 框架和样图已经公开发布在 https://www.kaggle.com/datasets/deetsadi/cancer-net-pca-gen-dataset，供更多的研究和开发使用。

Abstract
In Canada, prostate cancer is the most common form of cancer in men and accounted for 20% of new cancer cases for this demographic in 2022. Due to recent successes in leveraging machine learning for clinical decision support, there has been significant interest in the development of deep neural networks for prostate cancer diagnosis, prognosis, and treatment planning using diffusion weighted imaging (DWI) data. A major challenge hindering widespread adoption in clinical use is poor generalization of such networks due to scarcity of large-scale, diverse, balanced prostate imaging datasets for training such networks. In this study, we explore the efficacy of latent diffusion for generating realistic prostate DWI data through the introduction of an anatomic-conditional controlled latent diffusion strategy. To the best of the authors' knowledge, this is the first study to leverage conditioning for synthesis of prostate cancer imaging. Experimental results show that the proposed strategy, which we call Cancer-Net PCa-Gen, enhances synthesis of diverse prostate images through controllable tumour locations and better anatomical and textural fidelity. These crucial features make it well-suited for augmenting real patient data, enabling neural networks to be trained on a more diverse and comprehensive data distribution. The Cancer-Net PCa-Gen framework and sample images have been made publicly available at https://www.kaggle.com/datasets/deetsadi/cancer-net-pca-gen-dataset as a part of a global open-source initiative dedicated to accelerating advancement in machine learning to aid clinicians in the fight against cancer.

摘要
在加拿大，男性悬股癌是男性癌症最常见的形式，占2022年新癌例20%。由于最近对临床决策支持的机器学习取得了成功，因此有大量关注在使用深度神经网络进行悬股癌诊断、预后评估和治疗规划中使用扩散加权成像（DWI）数据。然而，由于癌症图像数据的罕见性和不均匀性，这些网络的普遍化尚未得到广泛的应用。在这项研究中，我们研究了使用 latent diffusion 生成真实的悬股癌 DWI 数据，通过引入 anatomic-conditional 控制 latent diffusion 策略。作者认为，这是首次通过控制 synthesis 来生成悬股癌成像。实验结果表明，我们提出的 Cancer-Net PCa-Gen 框架可以控制肿瘤的位置和更好地保持生物学和文本特征，从而提高了生成多种悬股癌图像的可控性。这些关键特征使得它适合补充实际病人数据，使得神经网络在更多的多样化和完整的数据分布上训练。Cancer-Net PCa-Gen 框架和样图已经在上公开发布，作为一个全球开源的机器学习推动计划，旨在帮助临床医生在抗癌斗争中获得更多的支持。

DiffCAD: Weakly-Supervised Probabilistic CAD Model Retrieval and Alignment from an RGB Image

paper_url: http://arxiv.org/abs/2311.18610
repo_url: None
paper_authors: Daoyi Gao, Dávid Rozenberszki, Stefan Leutenegger, Angela Dai
for: 从RGB图像中探测3D结构，以实现场景中3D物体基于图像的有效、高效表示。
methods: 我们提出了DiffCAD，首个弱监督概率方法，可以从RGB图像中提取CAD模型。我们将这视为一个Conditional生成任务，通过填充来学习潜在的probabilistic模型，捕捉图像中CAD对象的形状、orientation和Scale。这使得我们可以生成多种可能性的CAD重建，只需要几个假设来捕捉深度/比例和形状匹配的不确定性。
results: 我们的方法可以在不同的Target域上进行零基础适应，并且可以超过基于精心标注的supervised状态的前景。我们的多个假设方法可以在Scan2CAD数据集上提高了5.9%。

Abstract
Perceiving 3D structures from RGB images based on CAD model primitives can enable an effective, efficient 3D object-based representation of scenes. However, current approaches rely on supervision from expensive annotations of CAD models associated with real images, and encounter challenges due to the inherent ambiguities in the task -- both in depth-scale ambiguity in monocular perception, as well as inexact matches of CAD database models to real observations. We thus propose DiffCAD, the first weakly-supervised probabilistic approach to CAD retrieval and alignment from an RGB image. We formulate this as a conditional generative task, leveraging diffusion to learn implicit probabilistic models capturing the shape, pose, and scale of CAD objects in an image. This enables multi-hypothesis generation of different plausible CAD reconstructions, requiring only a few hypotheses to characterize ambiguities in depth/scale and inexact shape matches. Our approach is trained only on synthetic data, leveraging monocular depth and mask estimates to enable robust zero-shot adaptation to various real target domains. Despite being trained solely on synthetic data, our multi-hypothesis approach can even surpass the supervised state-of-the-art on the Scan2CAD dataset by 5.9% with 8 hypotheses.

摘要
perceiving 3D structures from RGB images based on CAD model primitives can enable an effective, efficient 3D object-based representation of scenes. However, current approaches rely on supervision from expensive annotations of CAD models associated with real images, and encounter challenges due to the inherent ambiguities in the task -- both in depth-scale ambiguity in monocular perception, as well as inexact matches of CAD database models to real observations. We thus propose DiffCAD, the first weakly-supervised probabilistic approach to CAD retrieval and alignment from an RGB image. We formulate this as a conditional generative task, leveraging diffusion to learn implicit probabilistic models capturing the shape, pose, and scale of CAD objects in an image. This enables multi-hypothesis generation of different plausible CAD reconstructions, requiring only a few hypotheses to characterize ambiguities in depth/scale and inexact shape matches. Our approach is trained only on synthetic data, leveraging monocular depth and mask estimates to enable robust zero-shot adaptation to various real target domains. Despite being trained solely on synthetic data, our multi-hypothesis approach can even surpass the supervised state-of-the-art on the Scan2CAD dataset by 5.9% with 8 hypotheses.Here is the text with some notes on the translation:* "perceiving 3D structures" is translated as "感受到3D结构"* "based on CAD model primitives" is translated as "基于CAD模型基本元素"* "can enable an effective, efficient 3D object-based representation of scenes" is translated as "可以实现有效、高效的3D对象基本表示场景"* "Current approaches rely on supervision from expensive annotations of CAD models associated with real images" is translated as "现有方法需要基于实际图像中的昂贵精度标注的CAD模型进行监督"* "encounter challenges due to the inherent ambiguities in the task" is translated as "面临任务中的自然歧义问题"* "DiffCAD" is translated as "DiffCAD"* "the first weakly-supervised probabilistic approach to CAD retrieval and alignment from an RGB image" is translated as "从RGB图像中的首个弱监督概率方法 дляCAD检索和对齐"* "We formulate this as a conditional generative task, leveraging diffusion to learn implicit probabilistic models capturing the shape, pose, and scale of CAD objects in an image" is translated as "我们将其形式化为一个条件生成任务，利用扩散学习隐式概率模型，捕捉图像中CAD对象的形状、姿势和比例"* "This enables multi-hypothesis generation of different plausible CAD reconstructions" is translated as "这使得可以生成多种可能性的CAD重建"* "requiring only a few hypotheses to characterize ambiguities in depth/scale and inexact shape matches" is translated as "只需要几个假设来描述深度/比例和形状不准确的歧义"* "Our approach is trained only on synthetic data" is translated as "我们的方法只在合成数据上训练"* "leveraging monocular depth and mask estimates to enable robust zero-shot adaptation to various real target domains" is translated as "利用独眼深度和面部估计来实现多个实际目标领域中的强健适应"* "Despite being trained solely on synthetic data, our multi-hypothesis approach can even surpass the supervised state-of-the-art on the Scan2CAD dataset by 5.9% with 8 hypotheses" is translated as "即使我们的方法仅在合成数据上训练，我们的多个假设方法可以在Scan2CAD数据集上比超过已有监督方法的状态艺术，8个假设下提高5.9%"

Learning Triangular Distribution in Visual World

paper_url: http://arxiv.org/abs/2311.18605
repo_url: None
paper_authors: Ping Chen, Xingpeng Zhang, Chengtao Zhou, Dichao Fan, Peng Tu, Le Zhang, Yanlin Qian
for: 本研究旨在解决 Label Distribution Learning 中的问题，包括如何准确地将特征与标签之间的差异映射到标签之间的差异中。
methods: 本研究使用了一种名为 Triangular Distribution Transform (TDT) 的普遍和简单的框架，以建立特征和标签之间的唯一函数关系，使得任何对称的特征差异 linearly 反映标签之间的差异。
results: 在 Facial Age Recognition、Illumination Chromaticity Estimation 和 Aesthetics assessment 等任务上，TDT 可以与先前艺术达到或更好的结果。

Abstract
Convolution neural network is successful in pervasive vision tasks, including label distribution learning, which usually takes the form of learning an injection from the non-linear visual features to the well-defined labels. However, how the discrepancy between features is mapped to the label discrepancy is ambient, and its correctness is not guaranteed. To address these problems, we study the mathematical connection between feature and its label, presenting a general and simple framework for label distribution learning. We propose a so-called Triangular Distribution Transform (TDT) to build an injective function between feature and label, guaranteeing that any symmetric feature discrepancy linearly reflects the difference between labels. The proposed TDT can be used as a plug-in in mainstream backbone networks to address different label distribution learning tasks. Experiments on Facial Age Recognition, Illumination Chromaticity Estimation, and Aesthetics assessment show that TDT achieves on-par or better results than the prior arts.

摘要
卷积神经网络在普遍视觉任务中取得了成功，包括标签分布学习，通常表现为从非线性视觉特征中学习一个具有映射性的映射到已定义的标签。然而，这些问题中的特征差异与标签差异之间的映射关系是抽象的，并无 garantía 的正确性。为了解决这些问题，我们研究了特征和其标签之间的数学连接，并提出了一个通用且简单的标签分布学习框架。我们称之为三角分布变换（TDT），它可以建立特征和标签之间的具有映射性的函数，使得任何对称的特征差异线性反映标签之间的差异。我们的TDT可以作为主流脊梁网络中的插件，用于解决不同的标签分布学习任务。我们在人脸年龄识别、照明色度估计和艺术评价中进行了实验，得到的结果与先前的艺术相当或更好。

Identifying tourist destinations from movie scenes using Deep Learning

paper_url: http://arxiv.org/abs/2312.00098
repo_url: None
paper_authors: Mahendran Narayanan
for: 这个论文的目的是探讨电影对旅游业的影响，并提出一种基于深度学习的方法来识别电影中出现的旅游景点。
methods: 该论文提出了一种基于深度学习的方法，通过训练一个大型旅游景点世界各地的数据集，以识别电影中出现的旅游景点。
results: 该论文的研究目标是帮助观众通过电影场景中的地标建议旅游经验，这种方法可以帮助旅游业增加收益。

Abstract
Movies wield significant influence in our lives, playing a pivotal role in the tourism industry of any country. The inclusion of picturesque landscapes, waterfalls, and mountains as backdrops in films serves to enhance the allure of specific scenarios. Recognizing the impact of movies on tourism, this paper introduces a method for identifying tourist destinations featured in films. We propose the development of a deep learning model capable of recognizing these locations during movie viewing. The model is trained on a dataset comprising major tourism destinations worldwide. Through this research, the goal is to enable viewers to identify the real-world locations depicted in movie scenes, offering a novel way to connect cinema with global travel experiences.

摘要
电影在我们生活中具有很大的影响力，对任何国家的旅游业发挥着重要作用。电影中的美丽景色、瀑布和山峰作为背景，可以增强特定情境的吸引力。我们认为电影对旅游业的影响，因此我们提出了一种可以在电影观看过程中识别旅游景点的深入学习模型的想法。这个模型通过训练包括全球主要旅游景点的数据集来培育。通过这项研究，我们希望能够让观众在电影场景中识别真实的地方，为电影与旅游经验提供一种新的连接点。

Overcoming Label Noise for Source-free Unsupervised Video Domain Adaptation

paper_url: http://arxiv.org/abs/2311.18572
repo_url: None
paper_authors: Avijit Dasgupta, C. V. Jawahar, Karteek Alahari
for: addressing the challenge of source-dependent video domain adaptation by developing a self-training based source-free approach.
methods: using the source pre-trained model to generate pseudo-labels for the target domain samples, treating the problem as learning from noisy labels, and leveraging a teacher-student framework to improve adaptation performance.
results: achieving state-of-the-art results on various open datasets, outperforming existing approaches.Here’s the full summary in Simplified Chinese:
for: Handle video domain adaptation challenge without access to source data.
methods: Use source pre-trained model to generate pseudo-labels, treat as learning from noisy labels, use teacher-student framework to improve adaptation.
results: Achieve state-of-the-art results on various open datasets, outperform existing approaches.

Abstract
Despite the progress seen in classification methods, current approaches for handling videos with distribution shifts in source and target domains remain source-dependent as they require access to the source data during the adaptation stage. In this paper, we present a self-training based source-free video domain adaptation approach to address this challenge by bridging the gap between the source and the target domains. We use the source pre-trained model to generate pseudo-labels for the target domain samples, which are inevitably noisy. Thus, we treat the problem of source-free video domain adaptation as learning from noisy labels and argue that the samples with correct pseudo-labels can help us in adaptation. To this end, we leverage the cross-entropy loss as an indicator of the correctness of the pseudo-labels and use the resulting small-loss samples from the target domain for fine-tuning the model. We further enhance the adaptation performance by implementing a teacher-student framework, in which the teacher, which is updated gradually, produces reliable pseudo-labels. Meanwhile, the student undergoes fine-tuning on the target domain videos using these generated pseudo-labels to improve its performance. Extensive experimental evaluations show that our methods, termed as CleanAdapt, CleanAdapt + TS, achieve state-of-the-art results, outperforming the existing approaches on various open datasets. Our source code is publicly available at https://avijit9.github.io/CleanAdapt.

摘要
尽管现有的分类方法有所进步，但现有的方法仍然受到源和目标领域之间的分布差异的限制，需要在适应过程中访问源数据。在这篇论文中，我们提出了一种基于自我训练的源自由视频领域适应方法，以填补这一挑战。我们使用源已经预训练的模型生成目标领域样本的假标签，这些标签然而具有噪音。因此，我们将 пробле 的解决方法视为从噪音标签学习，并认为可以通过正确的假标签来帮助我们进行适应。为此，我们利用了交叉熵损失作为假标签的指标，并使用目标领域中的小损失样本进行细调。此外，我们还实现了一种教师-学生框架，在其中教师逐渐更新，生成可靠的假标签，而学生通过这些生成的假标签进行微调以提高其性能。我们的方法，称为CleanAdapt，CleanAdapt + TS，在多个开放数据集上实现了状态的最佳结果，超过了现有的方法。我们的源代码可以在中获取。

Seam-guided local alignment and stitching for large parallax images

paper_url: http://arxiv.org/abs/2311.18564
repo_url: https://github.com/tlliao/Seam-guided-local-alignment
paper_authors: Tianli Liao, Chenyang Zhao, Lei Li, Heling Cao
for: 这篇论文是关于图像融合中的地方剖分和精度评估的研究。
methods: 该方法首先使用现有的图像对接和缝合方法计算初始缝并评估缝上像素的质量。然后，对于低质量像素，通过提取修改的稠密对准方法来分割涵盖patches在对接图像中，并将其地方对接。最后，通过缝合 patches 并将其与初始对接结果 merge 而成的最终融合图像。
results: 对比当前状态艺技，该方法的结果更加可信度高， artifacts 更少。

Abstract
Seam-cutting methods have been proven effective in the composition step of image stitching, especially for images with parallax. However, the effectiveness of seam-cutting usually depends on that images can be roughly aligned such that there exists a local region where a plausible seam can be found. For images with large parallax, current alignment methods often fall short of expectations. In this paper, we propose a local alignment and stitching method guided by seam quality evaluation. First, we use existing image alignment and seam-cutting methods to calculate an initial seam and evaluate the quality of pixels along the seam. Then, for pixels with low qualities, we separate their enclosing patches in the aligned images and locally align them by extracting modified dense correspondences via SIFT flow. Finally, we composite the aligned patches via seam-cutting and merge them into the original aligned result to generate the final mosaic. Experiments show that compared with the state-of-the-art seam-cutting methods, our result is more plausible and with fewer artifacts. The code will be available at https://github.com/tlliao/Seam-guided-local-alignment.

摘要
扭曲方法在图像融合的组合步骤中已经被证明有效，特别是对于具有相差的图像。然而，扭曲方法的效iveness通常取决于图像可以roughly align，以便在当地发现一个可信worthy seam。对于具有大相差的图像，当前的对齐方法经常无法满足需求。在这篇论文中，我们提出了基于扭曲质量评估的本地对齐和融合方法。首先，我们使用现有的图像对齐和扭曲方法计算一个初始扭曲，并评估扭曲中像素的质量。然后，对于像素质量低的情况，我们将其包含的区域分割成修改后的稠密对匹配的图像中，并将其本地对齐。最后，我们使用扭曲 cutting 将修改后的图像合并到原始对齐结果中，以生成最终的融合图像。实验表明，与当前状态的扭曲方法相比，我们的结果更加真实，具有更少的artifacts。代码将在上公开。

Periodic Vibration Gaussian: Dynamic Urban Scene Reconstruction and Real-time Rendering

paper_url: http://arxiv.org/abs/2311.18561
repo_url: https://github.com/fudan-zvg/PVG
paper_authors: Yurui Chen, Chun Gu, Junzhe Jiang, Xiatian Zhu, Li Zhang
for: 该论文旨在Addressing the challenges of modeling dynamic, large-scale urban scenes, which are characterized by highly intricate geometric structures and unconstrained dynamics in both space and time.
methods: The proposed method, called Periodic Vibration Gaussian (PVG), builds upon the efficient 3D Gaussian splatting technique and introduces periodic vibration-based temporal dynamics to capture the synergistic interactions between objects and elements in dynamic urban scenes. Additionally, the method includes a novel flow-based temporal smoothing mechanism and a position-aware adaptive control strategy to enhance temporally coherent representation learning with sparse training data.
results: Extensive experiments on Waymo Open Dataset and KITTI benchmarks demonstrate that PVG surpasses state-of-the-art alternatives in both reconstruction and novel view synthesis for both dynamic and static scenes, without relying on manually labeled object bounding boxes or expensive optical flow estimation. Moreover, PVG achieves 50/6000-fold acceleration in training/rendering over the best alternative.

Abstract
Modeling dynamic, large-scale urban scenes is challenging due to their highly intricate geometric structures and unconstrained dynamics in both space and time. Prior methods often employ high-level architectural priors, separating static and dynamic elements, resulting in suboptimal capture of their synergistic interactions. To address this challenge, we present a unified representation model, called Periodic Vibration Gaussian (PVG). PVG builds upon the efficient 3D Gaussian splatting technique, originally designed for static scene representation, by introducing periodic vibration-based temporal dynamics. This innovation enables PVG to elegantly and uniformly represent the characteristics of various objects and elements in dynamic urban scenes. To enhance temporally coherent representation learning with sparse training data, we introduce a novel flow-based temporal smoothing mechanism and a position-aware adaptive control strategy. Extensive experiments on Waymo Open Dataset and KITTI benchmarks demonstrate that PVG surpasses state-of-the-art alternatives in both reconstruction and novel view synthesis for both dynamic and static scenes. Notably, PVG achieves this without relying on manually labeled object bounding boxes or expensive optical flow estimation. Moreover, PVG exhibits 50/6000-fold acceleration in training/rendering over the best alternative.

摘要
模拟大规模城市场景是一项挑战，因为它们的空间和时间尺度都具有高度复杂的几何结构，同时也存在无约束的动态变化。先前的方法通常采用高级建筑假设，将静止和动态元素分离开来，从而导致对它们的相互作用的捕捉不够优化。为了解决这个挑战，我们提出了一种统一表示模型，即周期振荡 Gaussian（PVG）。PVG基于高效的3D Gaussian扩散技术，原本设计用于静止场景表示，通过引入周期振荡基于时间动态来扩展。这种创新使得PVG能够精细地和一致地表示各种对象和元素在动态城市场景中的特点。为了提高在缺乏训练数据的情况下进行有 temps coherent的学习，我们引入了一种流基的时间缓和机制和一种位置感知的自适应控制策略。在 Waymo 开放数据集和 KITTI 标准benchmark 上进行了广泛的实验，显示PVG在重建和新视角合成方面比州前一些方法更为出色，而且不需要手动标注对象边框或高昂的光流估计。此外，PVG在训练和渲染过程中比最佳代码进行50/6000-倍加速。

FediOS: Decoupling Orthogonal Subspaces for Personalization in Feature-skew Federated Learning

paper_url: http://arxiv.org/abs/2311.18559
repo_url: None
paper_authors: Lingzhi Gao, Zexi Li, Yang Lu, Chao Wu
for: 提高个性化本地模型的能力
methods: 提出一种新的 Architecture Decoupling 设计，使用两个特征提取器（一个普适特征提取器和一个个性化特征提取器）和一个共享预测头来实现异构特征分解。
results: 在四个视觉数据集上进行了广泛的实验，并达到了在特征不均衡情况下的状态革命级表现。

Abstract
Personalized federated learning (pFL) enables collaborative training among multiple clients to enhance the capability of customized local models. In pFL, clients may have heterogeneous (also known as non-IID) data, which poses a key challenge in how to decouple the data knowledge into generic knowledge for global sharing and personalized knowledge for preserving local personalization. A typical way of pFL focuses on label distribution skew, and they adopt a decoupling scheme where the model is split into a common feature extractor and two prediction heads (generic and personalized). However, such a decoupling scheme cannot solve the essential problem of feature skew heterogeneity, because a common feature extractor cannot decouple the generic and personalized features. Therefore, in this paper, we rethink the architecture decoupling design for feature-skew pFL and propose an effective pFL method called FediOS. In FediOS, we reformulate the decoupling into two feature extractors (generic and personalized) and one shared prediction head. Orthogonal projections are used for clients to map the generic features into one common subspace and scatter the personalized features into different subspaces to achieve decoupling for them. In addition, a shared prediction head is trained to balance the importance of generic and personalized features during inference. Extensive experiments on four vision datasets demonstrate our method reaches state-of-the-art pFL performances under feature skew heterogeneity.

摘要
To address this issue, we propose FediOS, a pFL method that rethinks the architecture decoupling design for feature-skew pFL. FediOS uses two feature extractors (generic and personalized) and one shared prediction head to decouple the features. Orthogonal projections are used for clients to map generic features into one common subspace and scatter personalized features into different subspaces, achieving decoupling. Additionally, the shared prediction head is trained to balance the importance of generic and personalized features during inference.Our extensive experiments on four vision datasets demonstrate that FediOS reaches state-of-the-art pFL performances under feature skew heterogeneity.

paper_url: http://arxiv.org/abs/2311.18553
repo_url: None
paper_authors: Daniel Grimm, Maximilian Zipfl, Felix Hertlein, Alexander Naumann, Jürgen Lüttin, Steffen Thoma, Stefan Schmid, Lavdim Halilaj, Achim Rettinger, J. Marius Zöllner
for: 预测周围交通参与者的未来轨迹，以便实现自动驾驶。
methods: 使用vector-based方法，模型交通参与者之间的SemanticSceneGraph，EXTRACT agent-centric image-based map features，生成anchor paths来约束策略在多Modal预测中只允许允许的轨迹。
results: 相比基eline模型HoliGraph，该方法显示出较好的性能。

Abstract
Precisely predicting the future trajectories of surrounding traffic participants is a crucial but challenging problem in autonomous driving, due to complex interactions between traffic agents, map context and traffic rules. Vector-based approaches have recently shown to achieve among the best performances on trajectory prediction benchmarks. These methods model simple interactions between traffic agents but don't distinguish between relation-type and attributes like their distance along the road. Furthermore, they represent lanes only by sequences of vectors representing center lines and ignore context information like lane dividers and other road elements. We present a novel approach for vector-based trajectory prediction that addresses these shortcomings by leveraging three crucial sources of information: First, we model interactions between traffic agents by a semantic scene graph, that accounts for the nature and important features of their relation. Second, we extract agent-centric image-based map features to model the local map context. Finally, we generate anchor paths to enforce the policy in multi-modal prediction to permitted trajectories only. Each of these three enhancements shows advantages over the baseline model HoliGraph.

摘要
<>自动驾驶中预测周围交通参与者的未来轨迹是一个重要 yet 挑战性的问题，因为交通代理之间的复杂互动、地图背景和交通规则。vector-based方法在轨迹预测 benchmark 上最近显示出了一个非常好的表现。这些方法可以模型交通代理之间的简单互动，但是不能分辨代理之间的关系类型和特性，例如他们的距离在道路上。此外，它们只表示道路为 Vector sequence representing center lines，并忽略了路况信息如拥有道路元素和其他道路特征。我们提出了一种新的vector-based轨迹预测方法，以下是这种方法的三大优势：1. 我们使用语义场景图来模型交通代理之间的关系，这种图可以捕捉交通代理之间的自然和重要特征。2. 我们EXTRACT agent-centric image-based map features来模型当地地图背景。3. 我们生成 anchor paths来强制策略在多模态预测中仅产生允许的轨迹。每一个改进都显示了比基eline模型 HoliGraph 更好的表现。>>>

SparseDC: Depth Completion from sparse and non-uniform inputs

paper_url: http://arxiv.org/abs/2312.00097
repo_url: https://github.com/whu-usi3dv/sparsedc
paper_authors: Chen Long, Wenxiao Zhang, Zhe Chen, Haiping Wang, Yuan Liu, Zhen Cao, Zhen Dong, Bisheng Yang
For: This paper is written for completing depth maps with poor quality in real-world usage.* Methods: The paper proposes a two-branch feature embedder with an uncertainty-based fusion module to handle sparse and non-uniform depth inputs.* Results: The proposed method, SparseDC, demonstrates robustness in handling sparse and non-uniform depth inputs and outperforms previous methods in real-world scenarios.Here’s the Chinese translation of the three key points:* For: 这篇论文是为了处理实际使用中的低质量深度图像而写的。* Methods: 论文提出了一种两极特征嵌入器，其中包括一个不确定性基于的融合模块，以处理笼统和非统一的深度输入。* Results: 提出的方法，即SparseDC，在实际场景中表现出了对笼统和非统一深度输入的稳定性，并且超过了先前的方法。

Abstract
We propose SparseDC, a model for Depth Completion of Sparse and non-uniform depth inputs. Unlike previous methods focusing on completing fixed distributions on benchmark datasets (e.g., NYU with 500 points, KITTI with 64 lines), SparseDC is specifically designed to handle depth maps with poor quality in real usage. The key contributions of SparseDC are two-fold. First, we design a simple strategy, called SFFM, to improve the robustness under sparse input by explicitly filling the unstable depth features with stable image features. Second, we propose a two-branch feature embedder to predict both the precise local geometry of regions with available depth values and accurate structures in regions with no depth. The key of the embedder is an uncertainty-based fusion module called UFFM to balance the local and long-term information extracted by CNNs and ViTs. Extensive indoor and outdoor experiments demonstrate the robustness of our framework when facing sparse and non-uniform input depths. The pre-trained model and code are available at https://github.com/WHU-USI3DV/SparseDC.

摘要
我们提出SparseDC模型，用于完成稀畴深度输入的深度完成。与前方法一样，SparseDC不同的是，它专门针对实际使用中的低质量深度图进行处理。SparseDC的两大贡献是：一、我们提出了一种简单的策略，称为SFFM，以提高 sparse输入下的稳定性，通过将不稳定的深度特征用稳定的图像特征填充；二、我们提出了一种两棵树特征嵌入器，用于预测有depth值的地方的精确的地方形和无depth值的地方的准确的结构。特征嵌入器的关键是一种不确定性基于的融合模块，称为UFFM，用于衡量 CNN和ViT中提取的本地和长期信息。我们在室内和室外进行了广泛的实验，并证明了我们的框架在稀畴和非均匀的输入深度下 display 的稳定性。我们的预训练模型和代码可以在https://github.com/WHU-USI3DV/SparseDC上获取。

OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition

paper_url: http://arxiv.org/abs/2312.00096
repo_url: https://github.com/tomchen-ctj/OST
paper_authors: Tongjia Chen, Hongshan Yu, Zhengeng Yang, Zechuan Li, Wei Sun, Chen Chen
for: 本研究旨在提高视频识别的普适性，通过强调文本知识的改进来解决视频数据的扩展性和缺乏文本描述所带来的限制。
methods: 我们提出了一种新的方法，即使用大型自然语言模型（LLM）来增强动作类名称，并将其转换为空间时间描述符（STD），以填充文本缺乏的问题。此外，我们还提出了一种优化描述符解决方案（Optimal Descriptor Solver），以匹配最佳描述符与视频实例。
results: 我们的方法在零shot、几shot和完全监督视频识别中进行了广泛的评估，并取得了出色的效果。我们最佳模型在Kinetics-600上达到了75.1%的零shot准确率，创下了state-of-the-art记录。

Abstract
Due to the resource-intensive nature of training vision-language models on expansive video data, a majority of studies have centered on adapting pre-trained image-language models to the video domain. Dominant pipelines propose to tackle the visual discrepancies with additional temporal learners while overlooking the substantial discrepancy for web-scaled descriptive narratives and concise action category names, leading to less distinct semantic space and potential performance limitations. In this work, we prioritize the refinement of text knowledge to facilitate generalizable video recognition. To address the limitations of the less distinct semantic space of category names, we prompt a large language model (LLM) to augment action class names into Spatio-Temporal Descriptors thus bridging the textual discrepancy and serving as a knowledge base for general recognition. Moreover, to assign the best descriptors with different video instances, we propose Optimal Descriptor Solver, forming the video recognition problem as solving the optimal matching flow across frame-level representations and descriptors. Comprehensive evaluations in zero-shot, few-shot, and fully supervised video recognition highlight the effectiveness of our approach. Our best model achieves a state-of-the-art zero-shot accuracy of 75.1% on Kinetics-600.

摘要
In this work, we prioritize the refinement of text knowledge to facilitate generalizable video recognition. To address the limitations of the less distinct semantic space of category names, we use a large language model (LLM) to augment action class names into Spatio-Temporal Descriptors, thereby bridging the textual discrepancy and serving as a knowledge base for general recognition.Moreover, to assign the best descriptors with different video instances, we propose the Optimal Descriptor Solver, which forms the video recognition problem as solving the optimal matching flow across frame-level representations and descriptors.Comprehensive evaluations in zero-shot, few-shot, and fully supervised video recognition demonstrate the effectiveness of our approach. Our best model achieves a state-of-the-art zero-shot accuracy of 75.1% on Kinetics-600.Translation notes:* "expansive video data" is translated as "广泛的视频数据" (guǎng kòu de ví zhǐ dào)* "web-scaled descriptive narratives" is translated as "网络级别的描述性 narraatives" (wǎng gōng jí bèi de mǐ xiǎng yǔ)* "concise action category names" is translated as "简洁的动作类别名称" (jiǎn jí de dòng xiǎng bèi míng cm)* "Spatio-Temporal Descriptors" is translated as "空间-时间描述符" (kōng jiān - shí jiān mǐ xiǎng)* "Optimal Descriptor Solver" is translated as "优化描述符解决方案" (yòu gòng mǐ xiǎng yì jí fāng án)

Match me if you can: Semantic Correspondence Learning with Unpaired Images

paper_url: http://arxiv.org/abs/2311.18540
repo_url: None
paper_authors: Jiwon Kim, Byeongho Heo, Sangdoo Yun, Seungryong Kim, Dongyoon Han
for: 本文提出了一种简单 yet effective的方法，用于提高 semantic correspondence 的性能。这种方法不需要Extra labeled keypoints 或 trainable modules，可以使用 unlabeled pairs 进行训练。
methods: 本文提出了一种 teacher-student 框架，通过机器监督来提供可靠的 pseudo correspondences 给学生网络。此外，本文还提出了一种迭代训练方法，使用学生网络来生成高精度的标签，并重新训练一个新的学生网络。
results: 根据实验结果，本文的模型可以超越当今领先方法，包括 state-of-the-art 方法在 semantic correspondence benchmarks 上。

Abstract
Recent approaches for semantic correspondence have focused on obtaining high-quality correspondences using a complicated network, refining the ambiguous or noisy matching points. Despite their performance improvements, they remain constrained by the limited training pairs due to costly point-level annotations. This paper proposes a simple yet effective method that performs training with unlabeled pairs to complement both limited image pairs and sparse point pairs, requiring neither extra labeled keypoints nor trainable modules. We fundamentally extend the data quantity and variety by augmenting new unannotated pairs not primitively provided as training pairs in benchmarks. Using a simple teacher-student framework, we offer reliable pseudo correspondences to the student network via machine supervision. Finally, the performance of our network is steadily improved by the proposed iterative training, putting back the student as a teacher to generate refined labels and train a new student repeatedly. Our models outperform the milestone baselines, including state-of-the-art methods on semantic correspondence benchmarks.

摘要
Translated into Simplified Chinese:现有的Semantic Correspondence方法主要关注于通过复杂的网络获得高质量匹配，并且对匹配点进行纠正。尽管它们提高了性能，但它们仍然受到有限的训练对数的限制，因为昂贵的点级注释。这篇论文提议一种简单 yet effective的方法，通过使用无注释对来补充有限的图像对和稀疏的点对，无需Extra的标注键点 nor 可训练的模块。我们基本扩展了数据量和多样性，通过在benchmark中添加新的无注释对。使用简单的教师-学生框架，我们提供了可靠的pseudo匹配到学生网络via机器监督。最后，我们通过提议的迭代训练，将学生作为教师来生成更加精确的标签，并训练新的学生。我们的模型超越了参考基eline，包括state-of-the-art方法在Semantic Correspondence benchmark中。

MaXTron: Mask Transformer with Trajectory Attention for Video Panoptic Segmentation

paper_url: http://arxiv.org/abs/2311.18537
repo_url: https://github.com/tacju/maxtron
paper_authors: Ju He, Qihang Yu, Inkyu Shin, Xueqing Deng, Xiaohui Shen, Alan Yuille, Liang-Chieh Chen
for: 这个论文是为了实现视频精准分割而设计的，即在视频中准确地分割对象和背景，并在时间上跟踪对象的变化。
methods: 该论文提出了MaXTron方法，它利用Mask XFormer和轨迹注意力来解决视频精准分割问题。MaXTron方法利用轨迹注意力来提高时间上的一致性，并在内 clip和交叉 clip 模块中有效地使用轨迹注意力。
results: 该论文的实验结果表明，MaXTron方法可以在视频 segmentation benchmark 上达到 state-of-the-art 性能水平，而无需添加任何辅助功能。

Abstract
Video panoptic segmentation requires consistently segmenting (for both `thing' and `stuff' classes) and tracking objects in a video over time. In this work, we present MaXTron, a general framework that exploits Mask XFormer with Trajectory Attention to tackle the task. MaXTron enriches an off-the-shelf mask transformer by leveraging trajectory attention. The deployed mask transformer takes as input a short clip consisting of only a few frames and predicts the clip-level segmentation. To enhance the temporal consistency, MaXTron employs within-clip and cross-clip tracking modules, efficiently utilizing trajectory attention. Originally designed for video classification, trajectory attention learns to model the temporal correspondences between neighboring frames and aggregates information along the estimated motion paths. However, it is nontrivial to directly extend trajectory attention to the per-pixel dense prediction tasks due to its quadratic dependency on input size. To alleviate the issue, we propose to adapt the trajectory attention for both the dense pixel features and object queries, aiming to improve the short-term and long-term tracking results, respectively. Particularly, in our within-clip tracking module, we propose axial-trajectory attention that effectively computes the trajectory attention for tracking dense pixels sequentially along the height- and width-axes. The axial decomposition significantly reduces the computational complexity for dense pixel features. In our cross-clip tracking module, since the object queries in mask transformer are learned to encode the object information, we are able to capture the long-term temporal connections by applying trajectory attention to object queries, which learns to track each object across different clips. Without bells and whistles, MaXTron demonstrates state-of-the-art performances on video segmentation benchmarks.

摘要
视频牛逼分割需要持续地分割（ для both 'thing' 和 'stuff' 类）并跟踪对象在视频中的时间变化。在这种工作中，我们提出了 MaXTron，这是一个通用的框架，它利用Mask XFormer 和轨迹注意力来解决这个任务。MaXTron 通过使用轨迹注意力来强化Off-the-shelf mask transformer，输入短 clip 并预测clip-level分割。为了提高时间一致性，MaXTron 使用 within-clip 和 cross-clip 跟踪模块，高效地利用轨迹注意力。原来设计用于视频分类的轨迹注意力可以学习模拟邻近帧之间的时间相关性，并将信息集成到估计的运动路径上。但是，将轨迹注意力直接应用于每个像素密集预测任务是非常困难，因为它们的输入大小 quadratic 相关。为了解决这个问题，我们提出了适应轨迹注意力，以提高短期和长期跟踪结果。特别是，在我们的 within-clip 跟踪模块中，我们提出了 axial-trajectory attention，可以有效地计算轨迹注意力以Sequentially 跟踪 dense pixels along height- 和 width-axes。axial 分解显著减少了密集像素特征计算的计算复杂性。在我们的 cross-clip 跟踪模块中，由于mask transformer中的object queries 是学习编码对象信息，我们可以通过应用轨迹注意力到object queries，以捕捉每个对象在不同clip中的长期时间连接。没有招架和钻石，MaXTron 展示了视频分割标准 benchmark 的state-of-the-art 性能。

Revisiting Proposal-based Object Detection

paper_url: http://arxiv.org/abs/2311.18512
repo_url: None
paper_authors: Aritra Bhowmik, Martin R. Oswald, Pascal Mettes, Cees G. M. Snoek
For: This paper revisits the pipeline for detecting objects in images with proposals, with the goal of improving the accuracy and efficiency of object detection methods.* Methods: The paper proposes a simple yet effective alternative to the common approach of directly maximizing the overlap between proposal and ground truth boxes. Instead, the paper regresses to the area of intersection between proposal and ground truth, allowing each proposal to specify which part contains the object without needing to regress beyond its visual scope.* Results: The paper shows that the proposed intersection-based regression and grouping approach directly improves canonical object detection and instance segmentation architectures, highlighting the utility of this alternative approach.

Abstract
This paper revisits the pipeline for detecting objects in images with proposals. For any object detector, the obtained box proposals or queries need to be classified and regressed towards ground truth boxes. The common solution for the final predictions is to directly maximize the overlap between each proposal and the ground truth box, followed by a winner-takes-all ranking or non-maximum suppression. In this work, we propose a simple yet effective alternative. For proposal regression, we solve a simpler problem where we regress to the area of intersection between proposal and ground truth. In this way, each proposal only specifies which part contains the object, avoiding a blind inpainting problem where proposals need to be regressed beyond their visual scope. In turn, we replace the winner-takes-all strategy and obtain the final prediction by taking the union over the regressed intersections of a proposal group surrounding an object. Our revisited approach comes with minimal changes to the detection pipeline and can be plugged into any existing method. We show that our approach directly improves canonical object detection and instance segmentation architectures, highlighting the utility of intersection-based regression and grouping.

摘要
Translated into Simplified Chinese:这篇论文重新审视了图像中对象检测的管道。任何对象检测器都需要对获得的框提案或查询进行分类和 regression 以获得真实的框标准。现有的解决方案是直接将每个提案与真实框的重叠度进行最大化，然后使用赢家当选策略或非最大化抑制。在这种工作中，我们提议一种简单 yet effective 的替代方案。 для proposal regression，我们解决一个更加简单的问题，即将提案与真实框的重叠面积进行 regression。因此，每个提案只需要指定包含对象的部分，而不需要通过视觉范围外的盲目填充问题。然后，我们取代赢家当选策略，并通过获得提案组 surrounding 对象的重叠面积的联合来获得最终预测。我们的修订方案具有少量改变的检测管道，可以与现有方法整合使用。我们显示，我们的方法直接改进了标准对象检测和实例分割架构，强调了重叠基于 regression 和组合的优点。

DifAugGAN: A Practical Diffusion-style Data Augmentation for GAN-based Single Image Super-resolution

paper_url: http://arxiv.org/abs/2311.18508
repo_url: None
paper_authors: Axi Niu, Kang Zhang, Joshua Tian Jin Tee, Trung X. Pham, Jinqiu Sun, Chang D. Yoo, In So Kweon, Yanning Zhang
for: 提高 GAN 基于 SR 方法的图像质量
methods: 使用 diffusion-style 数据增强策略进行泛化，提高批判器的准确性
results: 对比现有方法，得到了更高的 SR 性能，并且可以适用于现有 GAN 基于 SR 方法中Translation:
for: To improve the image quality of GAN-based SR methods
methods: Using a diffusion-style data augmentation scheme to improve the calibration of the discriminator
results: Obtained higher SR performance compared to existing methods, and can be applied to existing GAN-based SR methods

Abstract
It is well known the adversarial optimization of GAN-based image super-resolution (SR) methods makes the preceding SR model generate unpleasant and undesirable artifacts, leading to large distortion. We attribute the cause of such distortions to the poor calibration of the discriminator, which hampers its ability to provide meaningful feedback to the generator for learning high-quality images. To address this problem, we propose a simple but non-travel diffusion-style data augmentation scheme for current GAN-based SR methods, known as DifAugGAN. It involves adapting the diffusion process in generative diffusion models for improving the calibration of the discriminator during training motivated by the successes of data augmentation schemes in the field to achieve good calibration. Our DifAugGAN can be a Plug-and-Play strategy for current GAN-based SISR methods to improve the calibration of the discriminator and thus improve SR performance. Extensive experimental evaluations demonstrate the superiority of DifAugGAN over state-of-the-art GAN-based SISR methods across both synthetic and real-world datasets, showcasing notable advancements in both qualitative and quantitative results.

摘要
通常情况下，基于GAN的图像超分辨（SR）方法的对抗优化会使得先前的SR模型产生不愉快和不适的artefacts，导致大量的扭曲。我们认为这种扭曲的原因是抽象器的质量不足，使其无法提供高质量图像的反馈，从而阻碍生成器学习高质量图像。为解决这个问题，我们提出了一种简单 yet non-travel diffusion-style数据增强方案，称为DifAugGAN。它基于生成扩散模型中的扩散过程进行改进抽象器的准确性，以便在训练中提高抽象器的准确性。我们的DifAugGAN可以作为现有GAN基于SISR方法的Plug-and-Play策略，以提高抽象器的准确性，并因此提高SR性能。我们的实验证明，DifAugGAN在实验和实际数据上均显示出优于当前GAN基于SISR方法的状态的较好的性能，both qualitative和quantitativeResults中都有显著的进步。

Accurate Segmentation of Optic Disc And Cup from Multiple Pseudo-labels by Noise-Aware Learning

paper_url: http://arxiv.org/abs/2311.18496
repo_url: https://github.com/wwwtttjjj/mpnn
paper_authors: Tengjin Weng, Yang Shen, Zhidong Zhao, Zhiming Cheng, Shuai Wang
for: 这个论文主要针对 optic glaucoma 检测和诊断中的 optic disc 和 cup 分类问题提出了一个新的解决方案。
methods: 本文提出了一个创新的 label-denoising 方法，名为 Multiple Pseudo-labels Noise-aware Network (MPNN)，它通过多个不同的初始化网络训练真正的标签，然后使用多个测度标签生成器生成多个 pseudo-labels，并将这些 pseudo-labels 用于对像当中的顶点标签推导。
results: 实验结果显示，MPNN 比其他 label-denoising 方法更好地适应 optic disc 和 cup 分类任务，并且具有优秀的体积测试能力和显著的杂音适应能力。

Abstract
Optic disc and cup segmentation play a crucial role in automating the screening and diagnosis of optic glaucoma. While data-driven convolutional neural networks (CNNs) show promise in this area, the inherent ambiguity of segmenting object and background boundaries in the task of optic disc and cup segmentation leads to noisy annotations that impact model performance. To address this, we propose an innovative label-denoising method of Multiple Pseudo-labels Noise-aware Network (MPNN) for accurate optic disc and cup segmentation. Specifically, the Multiple Pseudo-labels Generation and Guided Denoising (MPGGD) module generates pseudo-labels by multiple different initialization networks trained on true labels, and the pixel-level consensus information extracted from these pseudo-labels guides to differentiate clean pixels from noisy pixels. The training framework of the MPNN is constructed by a teacher-student architecture to learn segmentation from clean pixels and noisy pixels. Particularly, such a framework adeptly leverages (i) reliable and fundamental insights from clean pixels and (ii) the supplementary knowledge within noisy pixels via multiple perturbation-based unsupervised consistency. Compared to other label-denoising methods, comprehensive experimental results on the RIGA dataset demonstrate our method's excellent performance and significant denoising ability.

摘要
优化着色板和杯 Segmentation 在自动化视网膜病诊断中扮演着关键角色。 Although data-driven Convolutional Neural Networks (CNNs) show promise in this area, the inherent ambiguity of segmenting object and background boundaries in the task of optic disc and cup segmentation leads to noisy annotations that impact model performance. To address this, we propose an innovative label-denoising method called Multiple Pseudo-labels Noise-aware Network (MPNN) for accurate optic disc and cup segmentation. Specifically, the Multiple Pseudo-labels Generation and Guided Denoising (MPGGD) module generates pseudo-labels by multiple different initialization networks trained on true labels, and the pixel-level consensus information extracted from these pseudo-labels guides to differentiate clean pixels from noisy pixels. The training framework of the MPNN is constructed by a teacher-student architecture to learn segmentation from clean pixels and noisy pixels. Particularly, such a framework adeptly leverages (i) reliable and fundamental insights from clean pixels and (ii) the supplementary knowledge within noisy pixels via multiple perturbation-based unsupervised consistency. Compared to other label-denoising methods, comprehensive experimental results on the RIGA dataset demonstrate our method's excellent performance and significant denoising ability.

Improving Adversarial Transferability via Model Alignment

paper_url: http://arxiv.org/abs/2311.18495
repo_url: None
paper_authors: Avery Ma, Amir-massoud Farahmand, Yangchen Pan, Philip Torr, Jindong Gu
for: 提高给定源模型的攻击致命性和可迁移性
methods: 使用约束Alignment技术调整源模型的参数，以最小化与另一个独立训练的观察者模型（即观察者模型）之间的差异
results: 对于ImageNet dataset和多种模型架构，通过对源模型进行调整，生成的攻击噪声显示出较高的可迁移性。

Abstract
Neural networks are susceptible to adversarial perturbations that are transferable across different models. In this paper, we introduce a novel model alignment technique aimed at improving a given source model's ability in generating transferable adversarial perturbations. During the alignment process, the parameters of the source model are fine-tuned to minimize an alignment loss. This loss measures the divergence in the predictions between the source model and another, independently trained model, referred to as the witness model. To understand the effect of model alignment, we conduct a geometric anlaysis of the resulting changes in the loss landscape. Extensive experiments on the ImageNet dataset, using a variety of model architectures, demonstrate that perturbations generated from aligned source models exhibit significantly higher transferability than those from the original source model.

摘要

PRS: Sharp Feature Priors for Resolution-Free Surface Remeshing

paper_url: http://arxiv.org/abs/2311.18494
repo_url: https://github.com/artonson/prs
paper_authors: Natalia Soboleva, Olga Gorbunova, Maria Ivanova, Evgeny Burnaev, Matthias Nießner, Denis Zorin, Alexey Artemov
for: 高级计算机视觉任务中的表面重建，特别是保留形态特征的重建。
methods: 使用数据驱动的方法探测特征并进行 mesh 重建，只需输入低级划分的偏 aliased 网格即可。
results: 在 ABC 数据集上进行了高级shape reconstruction，与现状前进行比较，具有26% 的 normals F-score 和 42% 的 perce 预测 $\text{RMSE}_{\text{v}$。

Abstract
Surface reconstruction with preservation of geometric features is a challenging computer vision task. Despite significant progress in implicit shape reconstruction, state-of-the-art mesh extraction methods often produce aliased, perceptually distorted surfaces and lack scalability to high-resolution 3D shapes. We present a data-driven approach for automatic feature detection and remeshing that requires only a coarse, aliased mesh as input and scales to arbitrary resolution reconstructions. We define and learn a collection of surface-based fields to (1) capture sharp geometric features in the shape with an implicit vertexwise model and (2) approximate improvements in normals alignment obtained by applying edge-flips with an edgewise model. To support scaling to arbitrary complexity shapes, we learn our fields using local triangulated patches, fusing estimates on complete surface meshes. Our feature remeshing algorithm integrates the learned fields as sharp feature priors and optimizes vertex placement and mesh connectivity for maximum expected surface improvement. On a challenging collection of high-resolution shape reconstructions in the ABC dataset, our algorithm improves over state-of-the-art by 26% normals F-score and 42% perceptual $\text{RMSE}_{\text{v}$.

摘要
表面重建与保留几何特征是计算机视觉中的挑战。虽然有 significiant progress in implicit shape reconstruction，current state-of-the-art mesh extraction methods often produce aliased, perceptually distorted surfaces and lack scalability to high-resolution 3D shapes. We present a data-driven approach for automatic feature detection and remeshing that requires only a coarse, aliased mesh as input and scales to arbitrary resolution reconstructions. We define and learn a collection of surface-based fields to (1) capture sharp geometric features in the shape with an implicit vertexwise model and (2) approximate improvements in normals alignment obtained by applying edge-flips with an edgewise model. To support scaling to arbitrary complexity shapes, we learn our fields using local triangulated patches, fusing estimates on complete surface meshes. Our feature remeshing algorithm integrates the learned fields as sharp feature priors and optimizes vertex placement and mesh connectivity for maximum expected surface improvement. On a challenging collection of high-resolution shape reconstructions in the ABC dataset, our algorithm improves over state-of-the-art by 26% normals F-score and 42% perceptual $\text{RMSE}_{\text{v}$.

Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding

paper_url: http://arxiv.org/abs/2311.18482
repo_url: None
paper_authors: Jin-Chuan Shi, Miao Wang, Hao-Bin Duan, Shao-Hua Guan
for: scene understanding tasks such as object localization and segmentation
methods: novel embedding procedure and dedicated quantization scheme
results: best visual quality and language querying accuracy, while maintaining real-time rendering frame rates

Abstract
Open-vocabulary querying in 3D space is challenging but essential for scene understanding tasks such as object localization and segmentation. Language-embedded scene representations have made progress by incorporating language features into 3D spaces. However, their efficacy heavily depends on neural networks that are resource-intensive in training and rendering. Although recent 3D Gaussians offer efficient and high-quality novel view synthesis, directly embedding language features in them leads to prohibitive memory usage and decreased performance. In this work, we introduce Language Embedded 3D Gaussians, a novel scene representation for open-vocabulary query tasks. Instead of embedding high-dimensional raw semantic features on 3D Gaussians, we propose a dedicated quantization scheme that drastically alleviates the memory requirement, and a novel embedding procedure that achieves smoother yet high accuracy query, countering the multi-view feature inconsistencies and the high-frequency inductive bias in point-based representations. Our comprehensive experiments show that our representation achieves the best visual quality and language querying accuracy across current language-embedded representations, while maintaining real-time rendering frame rates on a single desktop GPU.

摘要

Mixture of Gaussian-distributed Prototypes with Generative Modelling for Interpretable Image Classification

paper_url: http://arxiv.org/abs/2312.00092
repo_url: None
paper_authors: Chong Wang, Yuanhong Chen, Fengbei Liu, Davis James McCarthy, Helen Frazer, Gustavo Carneiro
for: 提高分类预测 interpretability，使用生成学习方法学习类别特征分布，并通过连接分类预测与特征分布来提供直观的决策理解。
methods: 使用生成学习方法Mixture of Gaussian-distributed Prototypes (MGProto)，其中每个类别特征分布都是一个 Gaussian mixture model (GMM)，这使得每个学习的类别特征分布拥有一定的变异度量，从而自然地减少稀疏性和重复性。同时，对GMM进行优化，包括prototype diversity目标函数，以降低红色样本捕捉率。
results: 在CUB-200-2011、Stanford Cars、Stanford Dogs和Oxford-IIIT Pets等数据集上，MGProto实现了状态的分类和异常检测性能，同时具有鼓舞人的解释结果。

Abstract
Prototypical-part interpretable methods, e.g., ProtoPNet, enhance interpretability by connecting classification predictions to class-specific training prototypes, thereby offering an intuitive insight into their decision-making. Current methods rely on a discriminative classifier trained with point-based learning techniques that provide specific values for prototypes. Such prototypes have relatively low representation power due to their sparsity and potential redundancy, with each prototype containing no variability measure. In this paper, we present a new generative learning of prototype distributions, named Mixture of Gaussian-distributed Prototypes (MGProto), which are represented by Gaussian mixture models (GMM). Such an approach enables the learning of more powerful prototype representations since each learned prototype will own a measure of variability, which naturally reduces the sparsity given the spread of the distribution around each prototype, and we also integrate a prototype diversity objective function into the GMM optimisation to reduce redundancy. Incidentally, the generative nature of MGProto offers a new and effective way for detecting out-of-distribution samples. To improve the compactness of MGProto, we further propose to prune Gaussian-distributed prototypes with a low prior. Experiments on CUB-200-2011, Stanford Cars, Stanford Dogs, and Oxford-IIIT Pets datasets show that MGProto achieves state-of-the-art classification and OoD detection performances with encouraging interpretability results.

摘要
模块化可解释方法，如ProtoPNet，可以提高解释性，通过将分类预测与训练集中的类别特征相连接，从而提供直观的决策理解。现有方法通常使用基于点学习技术的权重学习来训练权重，这些权重具有相对较低的表达力，因为它们具有较少的变量度和可能的重复。在本文中，我们提出了一种新的生成学习方法，即混合 Gaussian 分布模型（MGProto），它可以学习更具有表达力的权重表示。每个学习的权重都会拥有一个变量度量，这自然减少了权重的稀疏程度，同时我们还 integrate了一个权重多样性目标函数来降低重复性。另外，生成性的MGProto还提供了一种新的和有效的异常样本检测方法。为了提高MGProto的 компакт性，我们进一步提议使用低优先级 Gaussian 分布权重的剪除。实验结果表明，MGProto在 CUB-200-2011、Stanford Cars、Stanford Dogs 和 Oxford-IIIT Pets 数据集上达到了当前最佳的分类和异常检测性能，同时解释性结果也是满意的。

HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video

paper_url: http://arxiv.org/abs/2311.18448
repo_url: https://github.com/zc-alexfan/hold
paper_authors: Zicong Fan, Maria Parelli, Maria Eleni Kadoglou, Muhammed Kocabas, Xu Chen, Michael J. Black, Otmar Hilliges
for: 用于重建人工智能中的人手交互行为
methods: 使用monocular互动视频来重建拟合手和物体的3D结构，并采用分解式拟合模型来分离手和物体的3D坐标
results: 在实验室和野外 Setting中，该方法可以不使用3D手 object 标注，并且超越完全监督的基eline，实现高品质的重建结果，并且在野外 Setting中具有较好的Robustness。代码：https://github.com/zc-alexfan/hold

Abstract
Since humans interact with diverse objects every day, the holistic 3D capture of these interactions is important to understand and model human behaviour. However, most existing methods for hand-object reconstruction from RGB either assume pre-scanned object templates or heavily rely on limited 3D hand-object data, restricting their ability to scale and generalize to more unconstrained interaction settings. To this end, we introduce HOLD -- the first category-agnostic method that reconstructs an articulated hand and object jointly from a monocular interaction video. We develop a compositional articulated implicit model that can reconstruct disentangled 3D hand and object from 2D images. We also further incorporate hand-object constraints to improve hand-object poses and consequently the reconstruction quality. Our method does not rely on 3D hand-object annotations while outperforming fully-supervised baselines in both in-the-lab and challenging in-the-wild settings. Moreover, we qualitatively show its robustness in reconstructing from in-the-wild videos. Code: https://github.com/zc-alexfan/hold

摘要
因为人们每天都与多种对象进行交互，因此总体3D捕捉这些交互的重要性来理解和模型人类行为。然而，现有的手-物体重建方法从RGB中大多数都假设先采集的对象模板或者互相依赖有限的3D手-物体数据，从而限制其扩展和通用性。为此，我们介绍了HOLD——首个不需要3D手-物体注释的类别agnostic方法，可以从单一仰光交互视频中重建分解手和物体的3D拟合模型。我们还进一步 incorporate手-物体约束来改进手-物体姿态和重建质量。我们的方法不依赖3D手-物体注释，而在实验室和野外设置中超越了完全监督基eline。此外，我们还证明其在野外视频中的Robustness。代码：https://github.com/zc-alexfan/hold

VTimeLLM: Empower LLM to Grasp Video Moments

paper_url: http://arxiv.org/abs/2311.18445
repo_url: https://github.com/huangb23/vtimellm
paper_authors: Bin Huang, Xin Wang, Hong Chen, Zihan Song, Wenwu Zhu
for: 这个论文是为了解决现有视频大语言模型（Video LLM）无法准确捕捉视频中特定事件的时间边界问题。
methods: 该论文提出了一种名为VTimeLLM的新型视频大语言模型，该模型采用了边界意识的三个阶段训练策略，包括图像文本对Alignment、多个事件视频以增强时间边界意识，以及高质量视频指导调整以进一步提高时间理解能力和人类意图的Alignment。
results: 经过广泛的实验表明，在细化时间相关理解任务中（如Temporal Video Grounding和Dense Video Captioning），VTimeLLM在与现有视频LLMs进行比较时显著超越它们。此外，由于VTimeLLM具有细化时间理解能力，它还在视频对话 benchark中超越了现有视频LLMs，表明它在多模态理解和逻辑能力方面具有优异表现。

Abstract
Large language models (LLMs) have shown remarkable text understanding capabilities, which have been extended as Video LLMs to handle video data for comprehending visual details. However, existing Video LLMs can only provide a coarse description of the entire video, failing to capture the precise start and end time boundary of specific events. In this paper, we solve this issue via proposing VTimeLLM, a novel Video LLM designed for fine-grained video moment understanding and reasoning with respect to time boundary. Specifically, our VTimeLLM adopts a boundary-aware three-stage training strategy, which respectively utilizes image-text pairs for feature alignment, multiple-event videos to increase temporal-boundary awareness, and high-quality video-instruction tuning to further improve temporal understanding ability as well as align with human intents. Extensive experiments demonstrate that in fine-grained time-related comprehension tasks for videos such as Temporal Video Grounding and Dense Video Captioning, VTimeLLM significantly outperforms existing Video LLMs. Besides, benefits from the fine-grained temporal understanding of the videos further enable VTimeLLM to beat existing Video LLMs in video dialogue benchmark, showing its superior cross-modal understanding and reasoning abilities.

摘要
Our VTimeLLM adopts a boundary-aware three-stage training strategy that respectively utilizes image-text pairs for feature alignment, multiple-event videos to increase temporal-boundary awareness, and high-quality video-instruction tuning to further improve temporal understanding ability and align with human intents. Extensive experiments show that in fine-grained time-related comprehension tasks for videos such as Temporal Video Grounding and Dense Video Captioning, VTimeLLM significantly outperforms existing Video LLMs. Additionally, the fine-grained temporal understanding of videos provided by VTimeLLM enables it to beat existing Video LLMs in video dialogue benchmarks, demonstrating its superior cross-modal understanding and reasoning abilities.

Layered Rendering Diffusion Model for Zero-Shot Guided Image Synthesis

paper_url: http://arxiv.org/abs/2311.18435
repo_url: None
paper_authors: Zipeng Qi, Guoxi Huang, Zebin Huang, Qin Guo, Jinwen Chen, Junyu Han, Jian Wang, Gang Zhang, Lufei Liu, Errui Ding, Jingdong Wang
for: 提高 diffusion 模型中的空间控制性，以提高图像生成的准确性和效率。
methods: 提出了两大创新：视觉指南和层次渲染扩展（LRDiff）框架。视觉指南是一种空间布局条件，可以帮助图像生成过程更好地遵循空间布局要求，而LRDiff框架则是一种多层渲染机制，可以更好地避免概念杂化和不一致问题，并提供更加准确和上下文敏感的图像生成。
results: 通过实验表明，提出的方法比现有技术更加高效和准确，并在三个实际应用中（矩形框-到-图像、semantic mask-to-image和图像修改）得到了更好的效果。

Abstract
This paper introduces innovative solutions to enhance spatial controllability in diffusion models reliant on text queries. We present two key innovations: Vision Guidance and the Layered Rendering Diffusion (LRDiff) framework. Vision Guidance, a spatial layout condition, acts as a clue in the perturbed distribution, greatly narrowing down the search space, to focus on the image sampling process adhering to the spatial layout condition. The LRDiff framework constructs an image-rendering process with multiple layers, each of which applies the vision guidance to instructively estimate the denoising direction for a single object. Such a layered rendering strategy effectively prevents issues like unintended conceptual blending or mismatches, while allowing for more coherent and contextually accurate image synthesis. The proposed method provides a more efficient and accurate means of synthesising images that align with specific spatial and contextual requirements. We demonstrate through our experiments that our method provides better results than existing techniques both quantitatively and qualitatively. We apply our method to three practical applications: bounding box-to-image, semantic mask-to-image and image editing.

摘要

E2PNet: Event to Point Cloud Registration with Spatio-Temporal Representation Learning

paper_url: http://arxiv.org/abs/2311.18433
repo_url: https://github.com/xmu-qcj/e2pnet
paper_authors: Xiuhong Lin, Changjie Qiu, Zhipeng Cai, Siqi Shen, Yu Zang, Weiquan Liu, Xuesheng Bian, Matthias Müller, Cheng Wang
for: 本研究旨在提出一种基于学习的event-to-point cloud registration方法，以解决2D图像与3D点云之间的匹配问题。
methods: 该方法使用了一种新的特征表示网络 called Event-Points-to-Tensor (EP2T)，该网络可以将事件数据转换为2D网格形特征图，以便使用现有的RGB基于的框架进行匹配。EP2T使用了新的采样和信息聚合模块，以处理事件输入的不均衡的空间和时间维度。
results: 实验结果表明，E2PNet比手工制定和其他学习基于方法更加稳定和高效，并且在极端照明或快速运动的情况下具有更高的Robustness。此外，EP2T还可以用于其他视觉任务，如流速估计、事件-to-图像重建和物体识别。

Abstract
Event cameras have emerged as a promising vision sensor in recent years due to their unparalleled temporal resolution and dynamic range. While registration of 2D RGB images to 3D point clouds is a long-standing problem in computer vision, no prior work studies 2D-3D registration for event cameras. To this end, we propose E2PNet, the first learning-based method for event-to-point cloud registration. The core of E2PNet is a novel feature representation network called Event-Points-to-Tensor (EP2T), which encodes event data into a 2D grid-shaped feature tensor. This grid-shaped feature enables matured RGB-based frameworks to be easily used for event-to-point cloud registration, without changing hyper-parameters and the training procedure. EP2T treats the event input as spatio-temporal point clouds. Unlike standard 3D learning architectures that treat all dimensions of point clouds equally, the novel sampling and information aggregation modules in EP2T are designed to handle the inhomogeneity of the spatial and temporal dimensions. Experiments on the MVSEC and VECtor datasets demonstrate the superiority of E2PNet over hand-crafted and other learning-based methods. Compared to RGB-based registration, E2PNet is more robust to extreme illumination or fast motion due to the use of event data. Beyond 2D-3D registration, we also show the potential of EP2T for other vision tasks such as flow estimation, event-to-image reconstruction and object recognition. The source code can be found at: https://github.com/Xmu-qcj/E2PNet.

摘要
Event 摄像机在最近几年内得到了广泛应用的潜在可能性，主要是因为它们的时间分辨率和动态范围无与伦比。然而，在计算机视觉中，2D RGB 图像与3D点云的注册问题仍然是一个长期不解的问题。为了解决这个问题，我们提出了 E2PNet，这是首个基于学习的2D-3D注册方法。E2PNet 的核心是一种新的特征表示网络，叫做事件点云向量（EP2T），它将事件数据转换成2D 网格形状的特征张量。这个网格形状的特征张量使得现成的 RGB 基于的框架可以轻松地用于事件-点云注册，无需更改超参数和训练过程。EP2T 对事件输入进行了特殊的采样和信息聚合处理，以处理事件数据的不均衡性。与标准的3D 学习架构不同，EP2T 的采样和信息聚合模块可以有效地处理事件数据中的空间和时间维度的不均衡性。实验结果表明，E2PNet 在 MVSEC 和 VECtor 数据集上超过了手工设计和其他学习基于方法。相比 RGB 注册，E2PNet 更加Robust 对极端的照明或快速运动，这是因为使用事件数据。此外，我们还展示了 EP2T 在其他视觉任务中的潜在应用，如流场估计、事件-图像重建和物体识别。软件代码可以在 GitHub 上找到：https://github.com/Xmu-qcj/E2PNet。

TeG-DG: Textually Guided Domain Generalization for Face Anti-Spoofing

paper_url: http://arxiv.org/abs/2311.18420
repo_url: None
paper_authors: Lianrui Mu, Jianhong Bai, Xiaoxuan He, Jiangnan Ye, Xiaoyu Liang, Yuchen Yang, Jiedong Zhuang, Haoji Hu
for: 提高面部防伪攻击（FAS）技术的领域通用性性能
methods: 利用文本信息进行cross-domain alignment，提出了Textually Guided Domain Generalization（TeG-DG）框架，并设计了层次注意力融合（HAF）模块和文本增强视觉分类器（TEVD）
results: 与先前方法相比，TeG-DG在具有极限来源频道数据的情况下显示出了显著的几个shot性能提升（~~14%和~~12%的HTER和AUC提升）

Abstract
Enhancing the domain generalization performance of Face Anti-Spoofing (FAS) techniques has emerged as a research focus. Existing methods are dedicated to extracting domain-invariant features from various training domains. Despite the promising performance, the extracted features inevitably contain residual style feature bias (e.g., illumination, capture device), resulting in inferior generalization performance. In this paper, we propose an alternative and effective solution, the Textually Guided Domain Generalization (TeG-DG) framework, which can effectively leverage text information for cross-domain alignment. Our core insight is that text, as a more abstract and universal form of expression, can capture the commonalities and essential characteristics across various attacks, bridging the gap between different image domains. Contrary to existing vision-language models, the proposed framework is elaborately designed to enhance the domain generalization ability of the FAS task. Concretely, we first design a Hierarchical Attention Fusion (HAF) module to enable adaptive aggregation of visual features at different levels; Then, a Textual-Enhanced Visual Discriminator (TEVD) is proposed for not only better alignment between the two modalities but also to regularize the classifier with unbiased text features. TeG-DG significantly outperforms previous approaches, especially in situations with extremely limited source domain data (~14% and ~12% improvements on HTER and AUC respectively), showcasing impressive few-shot performance.

摘要
增强面部防诈（FAS）技术的领域泛化性能已成为研究焦点。现有方法主要关注提取域外特征，尽管表现良好，但提取的特征仍然含有剩余风格特征偏好（如照明、捕捉设备），导致泛化性能差。在这篇论文中，我们提出了一种alternative和有效的解决方案——文本指导领域泛化（TeG-DG）框架。我们的核心想法是，文本作为更抽象和通用的表达形式，可以捕捉不同攻击中的共同特征和基本特征，从而跨域对应。与现有视力语言模型不同，我们提出的框架特别设计为提高FAS任务的领域泛化能力。具体来说，我们首先设计了层次注意力融合（HAF）模块，以便适应性地融合不同级别的视觉特征;然后，我们提出了文本增强视觉分类器（TEVD），不仅可以更好地对应两个模式，还可以规范分类器使用不受扭曲的文本特征。TeG-DG显著超越了之前的方法，特别是在具有极限源频训练数据（约14%和12%）的情况下，表现出了很好的几何性能。

CAT-DM: Controllable Accelerated Virtual Try-on with Diffusion Model

paper_url: http://arxiv.org/abs/2311.18405
repo_url: https://github.com/zengjianhao/cat-dm
paper_authors: Jianhao Zeng, Dan Song, Weizhi Nie, Hongshuo Tian, Tongtong Wang, Anan Liu
for: 提供一种可控的液化图像基于虚拟试穿方法，以提高虚拟试穿效果和速度。
methods: 利用ControlNet引入额外控制条件，并对裤衣图像进行特征提取。在液化过程中，通过隐藏的GAN模型生成的假分布来进行逆噪处理，从而提高液化效果。
results: 比对 previous 基于液化模型的试穿方法，CAT-DM可以保持裤衣图像中的模式和纹理细节，同时减少抽取步骤，不降低生成质量。

Abstract
Image-based virtual try-on enables users to virtually try on different garments by altering original clothes in their photographs. Generative Adversarial Networks (GANs) dominate the research field in image-based virtual try-on, but have not resolved problems such as unnatural deformation of garments and the blurry generation quality. Recently, diffusion models have emerged with surprising performance across various image generation tasks. While the generative quality of diffusion models is impressive, achieving controllability poses a significant challenge when applying it to virtual try-on tasks and multiple denoising iterations limit its potential for real-time applications. In this paper, we propose Controllable Accelerated virtual Try-on with Diffusion Model called CAT-DM. To enhance the controllability, a basic diffusion-based virtual try-on network is designed, which utilizes ControlNet to introduce additional control conditions and improves the feature extraction of garment images. In terms of acceleration, CAT-DM initiates a reverse denoising process with an implicit distribution generated by a pre-trained GAN-based model. Compared with previous try-on methods based on diffusion models, CAT-DM not only retains the pattern and texture details of the in-shop garment but also reduces the sampling steps without compromising generation quality. Extensive experiments demonstrate the superiority of CAT-DM against both GAN-based and diffusion-based methods in producing more realistic images and accurately reproducing garment patterns. Our code and models will be publicly released.

摘要
Image-based virtural try-on 使用户可以通过修改原来的衣服图像来试穿不同的服装。生成对抗网络（GANs）在图像基于的virtural try-on 领域占据主导地位，但它们没有解决衣服形态不自然和生成质量不清晰的问题。最近，扩散模型在不同的图像生成任务上表现出了惊人的表现。然而，在应用到virtural try-on 任务上，实现控制性是一个非常大的挑战，并且多个扩散迭代限制了它的实时应用潜力。在这篇论文中，我们提出了控制可能的扩散模型，称为CAT-DM。为了增强控制性，我们设计了一个基本的扩散基于的virtural try-on 网络，该网络使用ControlNet引入了额外的控制条件，以提高衣服图像的特征提取。在速度方面，CAT-DM实施了一种逆扩散过程，使用由预训练的GAN-based模型生成的隐式分布。相比之下，CAT-DM不仅保留了店内衣服的 patrern和текстура细节，而且减少了采样步骤，无需牺牲生成质量。我们的实验表明，CAT-DM比之前基于扩散模型的方法更有优势，能够生成更真实的图像，并准确地复制衣服的 patrern。我们的代码和模型将公开发布。

MV-CLIP: Multi-View CLIP for Zero-shot 3D Shape Recognition

paper_url: http://arxiv.org/abs/2311.18402
repo_url: None
paper_authors: Dan Song, Xinwei Fu, Weizhi Nie, Wenhui Li, Anan Liu
for: 提高Language-Image Pre-training模型在3D形状识别任务中的信任度
methods: 使用视图选择和层次提示来提高Language-Image Pre-training模型的泛化能力
results: 在没有额外训练的情况下，模型实现了84.44%、91.51%和66.17%的零shot 3D分类精度在ModelNet40、ModelNet10和ShapeNet Core55中。

Abstract
Large-scale pre-trained models have demonstrated impressive performance in vision and language tasks within open-world scenarios. Due to the lack of comparable pre-trained models for 3D shapes, recent methods utilize language-image pre-training to realize zero-shot 3D shape recognition. However, due to the modality gap, pretrained language-image models are not confident enough in the generalization to 3D shape recognition. Consequently, this paper aims to improve the confidence with view selection and hierarchical prompts. Leveraging the CLIP model as an example, we employ view selection on the vision side by identifying views with high prediction confidence from multiple rendered views of a 3D shape. On the textual side, the strategy of hierarchical prompts is proposed for the first time. The first layer prompts several classification candidates with traditional class-level descriptions, while the second layer refines the prediction based on function-level descriptions or further distinctions between the candidates. Remarkably, without the need for additional training, our proposed method achieves impressive zero-shot 3D classification accuracies of 84.44\%, 91.51\%, and 66.17\% on ModelNet40, ModelNet10, and ShapeNet Core55, respectively. Furthermore, we will make the code publicly available to facilitate reproducibility and further research in this area.

摘要
大规模预训练模型在视觉和语言任务中表现出色，但因为3D形的预训练模型缺乏相似模型，现有方法利用语言-图像预训练实现零shot 3D形认识。然而，由于模式差距，预训练语言-图像模型在总体化到3D形认识中不够自信。因此，本文目的是提高自信度，使用视选择和层次提示。基于CLIP模型为例，我们在视觉方面使用多个渲染视图中的高预测信任度视选择。在文本方面，我们提出了层次提示策略，第一层提出多个分类候选者的传统分类描述，第二层根据功能级描述或进一步的分类差异于候选者进行细化预测。很Remarkably，无需额外训练，我们提议的方法可以在零shot情况下实现84.44%、91.51%和66.17%的3D分类精度在ModelNet40、ModelNet10和ShapeNet Core55上。此外，我们将代码公开发布，以便促进复现和此领域进一步研究。

RainAI – Precipitation Nowcasting from Satellite Data

paper_url: http://arxiv.org/abs/2311.18398
repo_url: https://github.com/rafapablos/w4c23-rainai
paper_authors: Rafael Pablos Sarabia, Joachim Nyborg, Morten Birk, Ira Assent
for: 这个论文目标是在使用lower-resolution卫星辐射图像预测高分辨率降水，并且需要8小时预测时间。
methods: 该论文提出了一种简单 yet effective的spatiotemporal特征学习方法，使用2D U-Net模型，并超过官方3D U-Net基线模型的性能和效率。论文还强调数据准备和重要抽样技术的重要性，并证明这些技术对性能有较大的影响。
results: 论文表明，使用 conditional lead time 和 learned upsampling方法可以提高预测性能，并且可以生成高分辨率预测结果。

Abstract
This paper presents a solution to the Weather4Cast 2023 competition, where the goal is to forecast high-resolution precipitation with an 8-hour lead time using lower-resolution satellite radiance images. We propose a simple, yet effective method for spatiotemporal feature learning using a 2D U-Net model, that outperforms the official 3D U-Net baseline in both performance and efficiency. We place emphasis on refining the dataset, through importance sampling and dataset preparation, and show that such techniques have a significant impact on performance. We further study an alternative cross-entropy loss function that improves performance over the standard mean squared error loss, while also enabling models to produce probabilistic outputs. Additional techniques are explored regarding the generation of predictions at different lead times, specifically through Conditioning Lead Time. Lastly, to generate high-resolution forecasts, we evaluate standard and learned upsampling methods. The code and trained parameters are available at https://github.com/rafapablos/w4c23-rainai.

摘要

On Exact Inversion of DPM-Solvers

paper_url: http://arxiv.org/abs/2311.18387
repo_url: https://github.com/smhongok/inv-dpm
paper_authors: Seongmin Hong, Kyeonghyun Lee, Suh Yoon Jeon, Hyewon Bae, Se Young Chun
for: 这个研究旨在探讨DPM-solvers中的精确逆解（exact inversion）方法，以提高图像处理和数位 watermarking 的品质。
methods: 研究人员使用了降噪推敲法（gradient descent）和前进步法（forward step method）等隐式方法来实现精确逆解。
results: 实验结果显示，提案的精确逆解方法可以对DPM-solvers中的每个明确降噪步骤（explicit denoising step）进行精确逆解，并且可以大幅降低图像和噪音重建错误，以及提高隐藏水印的分辨率和不易被变更背景的稳定性。

Abstract
Diffusion probabilistic models (DPMs) are a key component in modern generative models. DPM-solvers have achieved reduced latency and enhanced quality significantly, but have posed challenges to find the exact inverse (i.e., finding the initial noise from the given image). Here we investigate the exact inversions for DPM-solvers and propose algorithms to perform them when samples are generated by the first-order as well as higher-order DPM-solvers. For each explicit denoising step in DPM-solvers, we formulated the inversions using implicit methods such as gradient descent or forward step method to ensure the robustness to large classifier-free guidance unlike the prior approach using fixed-point iteration. Experimental results demonstrated that our proposed exact inversion methods significantly reduced the error of both image and noise reconstructions, greatly enhanced the ability to distinguish invisible watermarks and well prevented unintended background changes consistently during image editing. Project page: \url{https://smhongok.github.io/inv-dpm.html}.

摘要
Diffusion probabilistic models (DPMs) 是现代生成模型的关键组件。 DPM-解决方案可以减少延迟和提高质量，但是找到初始噪声（即从给定图像中找到初始噪声）是一个挑战。我们在这里调查了 DPM-解决方案中的精确倒数（i.e., 从给定图像中找到初始噪声），并提出了使用 implicit 方法如梯度下降或前进步骤法来实现。每个 DPM-解决方案中的明确排除步骤都可以通过 implicit 方法来实现，以确保不同于先前使用固定点迭代的稳定性。实验结果表明，我们的提议的精确倒数方法可以减少图像和噪声重建错误，大幅提高了隐藏水印的分辨率，并在图像编辑中保持了不可预测的背景变化。项目页面：.

A Survey on Deep Learning for Polyp Segmentation: Techniques, Challenges and Future Trends

paper_url: http://arxiv.org/abs/2311.18373
repo_url: None
paper_authors: Jiaxin Mei, Tao Zhou, Kaiwen Huang, Yizhe Zhang, Yi Zhou, Ye Wu, Huazhu Fu
for: 这 paper 主要旨在为检测和评估肠Rectal Cancer (CRC) 提供有效的解决方案，即干预器的精准位置和分割。
methods: 本 paper 详细介绍了多种检测算法，包括传统的基于手动提取特征的方法和深度学习网络的方法。
results: 本 paper 进行了深入的评估和比较，探讨了不同的深度学习模型和结果，并考虑了肠Rectal Cancer 的不同大小。

Abstract
Early detection and assessment of polyps play a crucial role in the prevention and treatment of colorectal cancer (CRC). Polyp segmentation provides an effective solution to assist clinicians in accurately locating and segmenting polyp regions. In the past, people often relied on manually extracted lower-level features such as color, texture, and shape, which often had issues capturing global context and lacked robustness to complex scenarios. With the advent of deep learning, more and more outstanding medical image segmentation algorithms based on deep learning networks have emerged, making significant progress in this field. This paper provides a comprehensive review of polyp segmentation algorithms. We first review some traditional algorithms based on manually extracted features and deep segmentation algorithms, then detail benchmark datasets related to the topic. Specifically, we carry out a comprehensive evaluation of recent deep learning models and results based on polyp sizes, considering the pain points of research topics and differences in network structures. Finally, we discuss the challenges of polyp segmentation and future trends in this field. The models, benchmark datasets, and source code links we collected are all published at https://github.com/taozh2017/Awesome-Polyp-Segmentation.

摘要
早期检测和评估贫囊有助于预防和治疗肠Rectal Cancer (CRC)。贫囊分割提供了一个有效的解决方案，以帮助临床专业人员准确地定位和分割贫囊区域。在过去，人们 oftentimes 依靠手动提取的lower-level特征，如颜色、xture和形状，这些特征通常无法捕捉全局上下文，而且缺乏复杂场景下的稳定性。随着深度学习的出现，越来越多的出色的医疗图像分割算法基于深度学习网络在这一领域取得了 significante进步。本文提供了贫囊分割算法的全面回顾。我们首先详细介绍了基于手动提取特征的传统算法，然后详细介绍了深度分割算法，并详细评估了最新的深度学习模型和其结果，以及对贫囊大小的影响。最后，我们讨论了贫囊分割的挑战和未来发展趋势。我们收集了算法、referencedatasets和源代码链接，并将其发布在https://github.com/taozh2017/Awesome-Polyp-Segmentation上。

Each Test Image Deserves A Specific Prompt: Continual Test-Time Adaptation for 2D Medical Image Segmentation

paper_url: http://arxiv.org/abs/2311.18363
repo_url: None
paper_authors: Ziyang Chen, Yiwen Ye, Mengkang Lu, Yongsheng Pan, Yong Xia
for: 这个论文是为了解决医疗影像中的分布偏移问题，尤其是在不同医疗中心取得的影像之间存在广泛的分布偏移，这会对于在实际应用中部署预训模型带来严重的阻碍。
methods: 这个论文使用的方法是在测试时进行时间适应，并且将预训模型固定不更新，以避免错误累累和忘记性。在测试时，我们提出了一个名为Visual Prompt-based Test-Time Adaptation（VPTTA）的方法，将测试图像训练成特定的启发，以将测试图像的统计与预训模型的统计进行Alignment。
results: 实验结果显示，VPTTA方法比其他已有的方法更superior，在医疗影像分类 задачі中实现了更高的性能。

Abstract
Distribution shift widely exists in medical images acquired from different medical centres and poses a significant obstacle to deploying the pre-trained semantic segmentation model in real-world applications. Test-time adaptation has proven its effectiveness in tackling the cross-domain distribution shift during inference. However, most existing methods achieve adaptation by updating the pre-trained models, rendering them susceptible to error accumulation and catastrophic forgetting when encountering a series of distribution shifts (i.e., under the continual test-time adaptation setup). To overcome these challenges caused by updating the models, in this paper, we freeze the pre-trained model and propose the Visual Prompt-based Test-Time Adaptation (VPTTA) method to train a specific prompt for each test image to align the statistics in the batch normalization layers. Specifically, we present the low-frequency prompt, which is lightweight with only a few parameters and can be effectively trained in a single iteration. To enhance prompt initialization, we equip VPTTA with a memory bank to benefit the current prompt from previous ones. Additionally, we design a warm-up mechanism, which mixes source and target statistics to construct warm-up statistics, thereby facilitating the training process. Extensive experiments demonstrate the superiority of our VPTTA over other state-of-the-art methods on two medical image segmentation benchmark tasks. The code and weights of pre-trained source models are available at https://github.com/Chen-Ziyang/VPTTA.

摘要
医疗图像采集从不同医疗中心而来，存在广泛的分布shift问题，这会对实际应用中的投入模型带来很大的障碍。测试时适应可以在推理过程中有效地解决这种跨频道分布shift。然而，大多数现有方法通过更新预训练模型来实现适应，这会导致模型受到错误积累和恐慌忘记（即在连续的推理时适应设置下）。为了解决由更新模型而导致的问题，在这篇论文中，我们冻结预训练模型并提出了视觉提示基于测试时适应（VPTTA）方法。Specifically, we present the low-frequency prompt, which is lightweight with only a few parameters and can be effectively trained in a single iteration. To enhance prompt initialization, we equip VPTTA with a memory bank to benefit the current prompt from previous ones. Additionally, we design a warm-up mechanism, which mixes source and target statistics to construct warm-up statistics, thereby facilitating the training process. Extensive experiments demonstrate the superiority of our VPTTA over other state-of-the-art methods on two medical image segmentation benchmark tasks. The code and weights of pre-trained source models are available at https://github.com/Chen-Ziyang/VPTTA.

Automating lookahead planning using site appearance and space utilization

paper_url: http://arxiv.org/abs/2311.18361
repo_url: None
paper_authors: Eyob Mengiste, Borja Garcia de Soto, Timo Hartmann
for: 这种研究旨在自动化规划前期的准备工作，以提高建筑工程的效率和质量。
methods: 该方法使用建筑物的外观和建筑地点的空间利用情况来预测任务完成率，并使用GRU基于RNN模型来训练预测任务完成率并提出数据意识的后备计划。
results: 实验结果表明，该方法可以帮助开发自动化后备计划，并将建筑规划与实际建筑地点的活动联系起来，从而扩展传统的计划技术和 интегrirated更广泛的建筑地点约束 into lookahead planning。

Abstract
This study proposes a method to automate the development of lookahead planning. The proposed method uses construction material conditions (i.e., appearances) and site space utilization to predict task completion rates. A Gated Recurrent Unit (GRU) based Recurrent Neural Network (RNN) model was trained using a segment of a construction project timeline to estimate completion rates of tasks and propose data-aware lookahead plans. The proposed method was evaluated in a sample construction project involving finishing works such as plastering, painting, and installing electrical fixtures. The results show that the proposed method can assist with developing automated lookahead plans. In doing so, this study links construction planning with actual events at the construction site. It extends the traditional scheduling techniques and integrates a broader spectrum of site spatial constraints into lookahead planning.

摘要
这项研究提出了一种自动化lookahead规划的方法。该方法使用建筑材料条件（即出现）和建筑地点资源利用率来预测任务完成率。使用启动式Recurrent Neural Network（RNN）模型，该方法在一段建筑项目时间线上训练了一个Gated Recurrent Unit（GRU）基于RNN模型，以估算任务完成率并提出数据意识的lookahead计划。该方法在一个示例建筑项目中进行了评估，包括抹灰、涂漆和安装电气设备等完成工作。结果表明，该方法可以帮助开发自动化lookahead计划，并将建筑规划与实际建筑地点事件相连接。这种方法超越传统的计划技术，并将建筑地点空间约束纳入lookahead规划中。

TIDE: Test Time Few Shot Object Detection

paper_url: http://arxiv.org/abs/2311.18358
repo_url: https://github.com/deku-0621/tide
paper_authors: Weikai Li, Hongfeng Wei, Yanlai Wu, Jie Yang, Yudi Ruan, Yuan Li, Ying Tang
for: 本研究 targets industry 5.0 中的实时配置需求和黑盒环境，提出了一种新的几何干扰检测任务（Test TIme Few Shot DEtection，TIDE），以解决现有方法在这些环境中的应用限制。
methods: 本研究提出了一种非对称架构，通过学习支持实例响应的动态分类器来解决几何干扰检测任务。此外，文章还 introduce 了一个cross-attention模块和一个多scale resizer，以提高模型性能。
results: 实验结果表明，提出的 TIDE 方法在多个几何干扰检测平台上显著超越了现有的同类方法。 codes 可以在 https://github.com/deku-0621/TIDE 上获取。

Abstract
Few-shot object detection (FSOD) aims to extract semantic knowledge from limited object instances of novel categories within a target domain. Recent advances in FSOD focus on fine-tuning the base model based on a few objects via meta-learning or data augmentation. Despite their success, the majority of them are grounded with parametric readjustment to generalize on novel objects, which face considerable challenges in Industry 5.0, such as (i) a certain amount of fine-tuning time is required, and (ii) the parameters of the constructed model being unavailable due to the privilege protection, making the fine-tuning fail. Such constraints naturally limit its application in scenarios with real-time configuration requirements or within black-box settings. To tackle the challenges mentioned above, we formalize a novel FSOD task, referred to as Test TIme Few Shot DEtection (TIDE), where the model is un-tuned in the configuration procedure. To that end, we introduce an asymmetric architecture for learning a support-instance-guided dynamic category classifier. Further, a cross-attention module and a multi-scale resizer are provided to enhance the model performance. Experimental results on multiple few-shot object detection platforms reveal that the proposed TIDE significantly outperforms existing contemporary methods. The implementation codes are available at https://github.com/deku-0621/TIDE

摘要
新型工业5.0中的几个难点，如需要一定的精度调整时间和参数不可用等问题，对现有的几拘object检测技术带来了很大的挑战。为了解决这些问题，我们提出了一种新的几拘object检测任务，即Test Time Few Shot DEtection（TIDE）。在TIDE任务中，模型在配置过程中不需要调整参数。为此，我们提出了一种不对称的架构，用于学习帮助类 instances 的动态类别分类器。此外，我们还提出了一种对比处理模块和多比例压缩器，以提高模型性能。我们在多个几拘object检测平台上进行了实验，结果显示，提出的TIDE方法在与现代方法进行比较时表现出了显著的优异。代码实现可以在https://github.com/deku-0621/TIDE上下载。

DSeg: Direct Line Segments Detection

paper_url: http://arxiv.org/abs/2311.18344
repo_url: None
paper_authors: Berger Cyrille, Lacroix Simon
for: 检测图像线段
methods: 使用线性加尔曼畸正常分布来检测图像线段，逐步检测线段在梯度图像上，并估计支持线 Parameters 和相关的方差。
results: 比较稳定和快速，能够检测更长的线段，不需要耗时参数调整。增加 pyramidal 方法可以提高结果质量。

Abstract
This paper presents a model-driven approach to detect image line segments. The approach incrementally detects segments on the gradient image using a linear Kalman filter that estimates the supporting line parameters and their associated variances. The algorithm is fast and robust with respect to image noise and illumination variations, it allows the detection of longer line segments than data-driven approaches, and does not require any tedious parameters tuning. An extension of the algorithm that exploits a pyramidal approach to enhance the quality of results is proposed. Results with varying scene illumination and comparisons to classic existing approaches are presented.

摘要
Here is the text in Simplified Chinese:这篇论文提出了基于模型的图像直线段检测方法。该方法使用线性卡尔曼约束来估算支持直线参数和其相关的方差，它快速、对图像噪音和照明变化强度有良好的Robustness，不需要 tedious的参数调整。该算法可以检测更长的直线段，并提出了基于pyramidal方法来提高结果质量的扩展。文章还将对不同的场景照明condition下的结果进行展示和经典方法的比较。

Multilevel Saliency-Guided Self-Supervised Learning for Image Anomaly Detection

paper_url: http://arxiv.org/abs/2311.18332
repo_url: None
paper_authors: Jianjian Qin, Chunzhi Gu, Jun Yu, Chao Zhang
for: 这个论文的目的是提出一种基于视觉注意力学习的异常检测方法，以提高计算机视觉中的异常检测性能。
methods: 该方法使用了LAYERCAM来提取多层图像特征，并使用 clustering 算法来获取多个中心点。然后，该方法选择了最高中心点精度的像素对，并将其作为异常样本生成。
results: 对两个主流异常检测 benchmark 数据集进行了广泛的实验和ablative 评估，并显示了该方法可以达到计算机视觉中异常检测性能的国际先进水平。

Abstract
Anomaly detection (AD) is a fundamental task in computer vision. It aims to identify incorrect image data patterns which deviate from the normal ones. Conventional methods generally address AD by preparing augmented negative samples to enforce self-supervised learning. However, these techniques typically do not consider semantics during augmentation, leading to the generation of unrealistic or invalid negative samples. Consequently, the feature extraction network can be hindered from embedding critical features. In this study, inspired by visual attention learning approaches, we propose CutSwap, which leverages saliency guidance to incorporate semantic cues for augmentation. Specifically, we first employ LayerCAM to extract multilevel image features as saliency maps and then perform clustering to obtain multiple centroids. To fully exploit saliency guidance, on each map, we select a pixel pair from the cluster with the highest centroid saliency to form a patch pair. Such a patch pair includes highly similar context information with dense semantic correlations. The resulting negative sample is created by swapping the locations of the patch pair. Compared to prior augmentation methods, CutSwap generates more subtle yet realistic negative samples to facilitate quality feature learning. Extensive experimental and ablative evaluations demonstrate that our method achieves state-of-the-art AD performance on two mainstream AD benchmark datasets.

摘要
“异常检测（AD）是计算机视觉中的基本任务。它目的是找出图像数据中的错误模式，这些模式与正常模式不同。传统方法通常通过强制自我超级学习来实现AD，但这些技术通常不考虑 semantics during augmentation，导致生成的负样本不真实、无效。这会使特征提取网络受到限制，从而降低AD性能。在本研究中，我们提出了CutSwap方法，它利用视觉注意力学习方法来汇入Semantic cues for augmentation。具体来说，我们首先使用LayerCAM来提取图像的多层特征图，然后使用聚类来获得多个中心点。为了充分利用注意力指导，我们在每个图中选择一个最高中心点精度的像素对来组成一个patch pair。这个patch pair包含高度相似的上下文信息和密集的semantic correlations。通过交换这个patch pair的位置，我们可以生成更加细致 yet realistic的负样本，从而促进特征学习。我们的方法在两个主流AD benchmark datasets上实现了状态的AD性能。”

Anisotropic Neural Representation Learning for High-Quality Neural Rendering

paper_url: http://arxiv.org/abs/2311.18311
repo_url: None
paper_authors: Y. Wang, J. Xu, Y. Zeng, Y. Gong
for: 提高NeRF视图合成的质量和精度
methods: 使用可学习视角特征来改善场景表示和重建
results: 在多种NeRF框架中，提高渲染质量和达到了状态之最的渲染性能

Abstract
Neural radiance fields (NeRFs) have achieved impressive view synthesis results by learning an implicit volumetric representation from multi-view images. To project the implicit representation into an image, NeRF employs volume rendering that approximates the continuous integrals of rays as an accumulation of the colors and densities of the sampled points. Although this approximation enables efficient rendering, it ignores the direction information in point intervals, resulting in ambiguous features and limited reconstruction quality. In this paper, we propose an anisotropic neural representation learning method that utilizes learnable view-dependent features to improve scene representation and reconstruction. We model the volumetric function as spherical harmonic (SH)-guided anisotropic features, parameterized by multilayer perceptrons, facilitating ambiguity elimination while preserving the rendering efficiency. To achieve robust scene reconstruction without anisotropy overfitting, we regularize the energy of the anisotropic features during training. Our method is flexiable and can be plugged into NeRF-based frameworks. Extensive experiments show that the proposed representation can boost the rendering quality of various NeRFs and achieve state-of-the-art rendering performance on both synthetic and real-world scenes.

摘要
(Simplified Chinese translation)NeRFs 通过多视图图像学习了一种隐式体积表示，实现了各种视图合成结果。为将隐式表示转换为图像，NeRF 使用了体积渲染，将杆茎观察点的颜色和密度作为积分的估计。然而，这种估计忽略了杆茎点间的方向信息，导致了不确定特征和限制重建质量。在这篇论文中，我们提出了一种基于视图依赖的神经表示学习方法，使用学习的视图依赖特征来提高场景表示和重建质量。我们将体积函数模型为圆柱幂（SH）导向的不同方向特征，通过多层感知器进行参数化，实现了不确定性消除而保持渲染效率。为确保Scene reconstruction不受方向过拟合，我们在训练时对不同方向特征能量进行正则化。我们的方法可以与 NeRF 基础框架整合，并在synthetic和实际场景上实现了 state-of-the-art 的渲染性能。

Categorical Traffic Transformer: Interpretable and Diverse Behavior Prediction with Tokenized Latent

paper_url: http://arxiv.org/abs/2311.18307
repo_url: None
paper_authors: Yuxiao Chen, Sander Tonkens, Marco Pavone
for: 这篇论文旨在提供一种可靠的自动驾驶车辆（AV）交通模型，以满足规划和关闭回 simulation 的需求。
methods: 该论文使用了大型语言模型（LLM），并提出了一种新的交通模型，即 categorical traffic transformer（CTT），它可以输出 kontinuous trajectory predictions 和 tokenized categorical predictions（车道模式、同似性等）。 CTT 的最出色的特点是具有可解释的潜在空间，允许在训练时直接监督潜在变量的ground truth，从而完全避免模式折射。
results: 在预测精度方面，CTT 可以Conditioned on different latent modes with semantic meanings while beating SOTA on prediction accuracy。此外，CTT 的能力可以输入和输出token，使其与 LLM 集成，进行通用理解和零Instance generalization。

Abstract
Adept traffic models are critical to both planning and closed-loop simulation for autonomous vehicles (AV), and key design objectives include accuracy, diverse multimodal behaviors, interpretability, and downstream compatibility. Recently, with the advent of large language models (LLMs), an additional desirable feature for traffic models is LLM compatibility. We present Categorical Traffic Transformer (CTT), a traffic model that outputs both continuous trajectory predictions and tokenized categorical predictions (lane modes, homotopies, etc.). The most outstanding feature of CTT is its fully interpretable latent space, which enables direct supervision of the latent variable from the ground truth during training and avoids mode collapse completely. As a result, CTT can generate diverse behaviors conditioned on different latent modes with semantic meanings while beating SOTA on prediction accuracy. In addition, CTT's ability to input and output tokens enables integration with LLMs for common-sense reasoning and zero-shot generalization.

摘要
优秀的交通模型是自动驾驶车辆（AV）规划和关闭环境 simulate 中的关键设计目标之一，其中包括准确性、多样化的多种行为、可解释性和下游兼容性。在大语言模型（LLM）的出现以后，另一个感兴趣的特征为交通模型是 LLm 兼容性。我们介绍了ategorical Traffic Transformer（CTT），一种输出连续轨迹预测和 Token 化分类预测（车道模式、同征等）的交通模型。 CTt 的最 distinguishing feature 是它的完全可解释的潜在空间，允许在训练时直接监督潜在变量并完全避免模式混合。因此，CTT 可以根据不同的潜在模式生成多样化的行为，同时保持高精度预测。此外， CTt 的 Token 输入和输出功能可以与 LLm 集成，实现常识理解和零容量泛化。

X-Dreamer: Creating High-quality 3D Content by Bridging the Domain Gap Between Text-to-2D and Text-to-3D Generation

paper_url: http://arxiv.org/abs/2312.00085
repo_url: https://github.com/xmu-xiaoma666/X-Dreamer
paper_authors: Yiwei Ma, Yijun Fan, Jiayi Ji, Haowei Wang, Xiaoshuai Sun, Guannan Jiang, Annan Shu, Rongrong Ji
for: 高质量文本到3D内容创建
methods: 使用Camera-Guided Low-Rank Adaptation (CG-LoRA)和Attention-Mask Alignment (AMA)损失进行适应，使得模型更好地考虑摄像机信息和对eground对象的注意力
results: 比对 existed 文本到3D方法，X-Dreamer 显示出更高的质量和精度

Abstract
In recent times, automatic text-to-3D content creation has made significant progress, driven by the development of pretrained 2D diffusion models. Existing text-to-3D methods typically optimize the 3D representation to ensure that the rendered image aligns well with the given text, as evaluated by the pretrained 2D diffusion model. Nevertheless, a substantial domain gap exists between 2D images and 3D assets, primarily attributed to variations in camera-related attributes and the exclusive presence of foreground objects. Consequently, employing 2D diffusion models directly for optimizing 3D representations may lead to suboptimal outcomes. To address this issue, we present X-Dreamer, a novel approach for high-quality text-to-3D content creation that effectively bridges the gap between text-to-2D and text-to-3D synthesis. The key components of X-Dreamer are two innovative designs: Camera-Guided Low-Rank Adaptation (CG-LoRA) and Attention-Mask Alignment (AMA) Loss. CG-LoRA dynamically incorporates camera information into the pretrained diffusion models by employing camera-dependent generation for trainable parameters. This integration enhances the alignment between the generated 3D assets and the camera's perspective. AMA loss guides the attention map of the pretrained diffusion model using the binary mask of the 3D object, prioritizing the creation of the foreground object. This module ensures that the model focuses on generating accurate and detailed foreground objects. Extensive evaluations demonstrate the effectiveness of our proposed method compared to existing text-to-3D approaches. Our project webpage: https://xmuxiaoma666.github.io/Projects/X-Dreamer .

摘要
现代时期，自动文本到3D内容创建已经取得了 significativeloop gain，由于2D扩散模型的发展。现有的文本到3D方法通常是优化3D表示以使得rendered图像与给定的文本保持良好的对齐，如由2D扩散模型进行评估。然而，2D图像和3D资产之间存在很大的领域差异，主要归结于摄像头相关的特性和独特的前景对象。因此，直接使用2D扩散模型来优化3D表示可能会导致不佳的结果。为解决这问题，我们提出了X-Dreamer，一种高质量文本到3D内容创建方法，可以有效bridging 2D和3D synthesis之间的领域差异。X-Dreamer的关键组件包括两个创新的设计：Camera-Guided Low-Rank Adaptation（CG-LoRA）和Attention-Mask Alignment（AMA）损失。CG-LoRA在预训练的扩散模型中动态包含摄像头信息，通过使用可训练参数的摄像头依赖生成。这种结合使得生成的3D资产与摄像头的观点保持更好的对齐。AMA损失引导预训练扩散模型的注意地图使用3D对象的二进制掩码，以便强调生成的前景对象的细节和准确性。这个模块确保了模型对生成的前景对象进行精细和准确的生成。我们的项目页面：https://xmuxiaoma666.github.io/Projects/X-Dreamer。

Can Protective Perturbation Safeguard Personal Data from Being Exploited by Stable Diffusion?

paper_url: http://arxiv.org/abs/2312.00084
repo_url: None
paper_authors: Zhengyue Zhao, Jinhao Duan, Kaidi Xu, Chenan Wang, Rui Zhangp Zidong Dup Qi Guo, Xing Hu
For: This paper aims to evaluate the effectiveness of using perturbations to protect images in a practical threat model, and to introduce a purification method to remove protected perturbations while preserving the original image structure.* Methods: The paper uses a Stable Diffusion model fine-tuned with personalized concepts, and adds imperceptible adversarial perturbations to images to prevent unauthorized exploitation and infringement. The authors also introduce a purification method to remove protected perturbations while preserving the original image structure.* Results: The results suggest that the perturbation-based protection methods may not be sufficient to safeguard image privacy and copyright effectively, and that the purification method can effectively remove protected perturbations while preserving the original image structure. The paper also demonstrates that Stable Diffusion can effectively learn from purified images over all protective methods.

Abstract
Stable Diffusion has established itself as a foundation model in generative AI artistic applications, receiving widespread research and application. Some recent fine-tuning methods have made it feasible for individuals to implant personalized concepts onto the basic Stable Diffusion model with minimal computational costs on small datasets. However, these innovations have also given rise to issues like facial privacy forgery and artistic copyright infringement. In recent studies, researchers have explored the addition of imperceptible adversarial perturbations to images to prevent potential unauthorized exploitation and infringements when personal data is used for fine-tuning Stable Diffusion. Although these studies have demonstrated the ability to protect images, it is essential to consider that these methods may not be entirely applicable in real-world scenarios. In this paper, we systematically evaluate the use of perturbations to protect images within a practical threat model. The results suggest that these approaches may not be sufficient to safeguard image privacy and copyright effectively. Furthermore, we introduce a purification method capable of removing protected perturbations while preserving the original image structure to the greatest extent possible. Experiments reveal that Stable Diffusion can effectively learn from purified images over all protective methods.

摘要
stable diffusion 已成为生成 AI 艺术领域的基础模型，广泛应用和研究。一些最近的细化方法使得个人可以将个性化想法嵌入到基础的 stable diffusion 模型中，使得计算成本减少到最小限度。然而，这些创新也产生了面部隐私伪造和艺术版权侵犯的问题。在最近的研究中，研究人员通过添加不可见的对抗扰动到图像来防止可能的未经授权使用和侵犯。虽然这些研究表明了保护图像的能力，但是需要考虑这些方法在实际场景中可能并不适用。在这篇论文中，我们系统地评估了使用扰动来保护图像的使用情况。结果表明，这些方法可能无法有效地保护图像隐私和版权。此外，我们介绍了一种纯化方法，可以从保护后的图像中除去保护扰动，并保持图像的原始结构最大限度。实验表明，Stable Diffusion 可以从纯化后的图像中学习，而且比所有保护方法都要更好。

BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos

paper_url: http://arxiv.org/abs/2312.00083
repo_url: https://github.com/Pilhyeon/BAM-DETR
paper_authors: Pilhyeon Lee, Hyeran Byun
for: 本 paper 的目的是解决 Temporal sentence grounding 中的中心不alignment 问题，提高 moment 的准确性。
methods: 本 paper 使用了一种新的 boundery-oriented moment 表示法，并设计了一种 dual-pathway decoding 过程，通过 global 和 boundery-focused 注意力来重新定义 anchor 和 boundery。
results: EXTENSIVE experiments 表明，本 paper 的方法可以提高 moment 的准确性，并在三个 benchmark 上创造新的 state-of-the-art 结果。

Abstract
Temporal sentence grounding aims to localize moments relevant to a language description. Recently, DETR-like approaches have shown notable progress by decoding the center and length of a target moment from learnable queries. However, they suffer from the issue of center misalignment raised by the inherent ambiguity of moment centers, leading to inaccurate predictions. To remedy this problem, we introduce a novel boundary-oriented moment formulation. In our paradigm, the model no longer needs to find the precise center but instead suffices to predict any anchor point within the interval, from which the onset and offset are directly estimated. Based on this idea, we design a Boundary-Aligned Moment Detection Transformer (BAM-DETR), equipped with a dual-pathway decoding process. Specifically, it refines the anchor and boundaries within parallel pathways using global and boundary-focused attention, respectively. This separate design allows the model to focus on desirable regions, enabling precise refinement of moment predictions. Further, we propose a quality-based ranking method, ensuring that proposals with high localization qualities are prioritized over incomplete ones. Extensive experiments verify the advantages of our methods, where our model records new state-of-the-art results on three benchmarks. Code is at https://github.com/Pilhyeon/BAM-DETR.

摘要
sentence grounding aims to localize moments relevant to a language description. Recently, DETR-like approaches have shown notable progress by decoding the center and length of a target moment from learnable queries. However, they suffer from the issue of center misalignment raised by the inherent ambiguity of moment centers, leading to inaccurate predictions. To remedy this problem, we introduce a novel boundary-oriented moment formulation. In our paradigm, the model no longer needs to find the precise center but instead suffices to predict any anchor point within the interval, from which the onset and offset are directly estimated. Based on this idea, we design a Boundary-Aligned Moment Detection Transformer (BAM-DETR), equipped with a dual-pathway decoding process. Specifically, it refines the anchor and boundaries within parallel pathways using global and boundary-focused attention, respectively. This separate design allows the model to focus on desirable regions, enabling precise refinement of moment predictions. Further, we propose a quality-based ranking method, ensuring that proposals with high localization qualities are prioritized over incomplete ones. Extensive experiments verify the advantages of our methods, where our model records new state-of-the-art results on three benchmarks. Code is at https://github.com/Pilhyeon/BAM-DETR.Here's a word-for-word translation of the text into Simplified Chinese:句子根准目标地方化目标是为了在语言描述中地方化相关的时刻。在最近，DETR-like方法已经显示了不eworthy进步，通过从学习查询中提取目标时刻的中心和长度进行解码。然而，这些方法受到中心不对 alignment 问题的影响，导致不准确的预测。为了解决这个问题，我们介绍了一种新的边缘 oriented 时刻形式。在我们的 paradigm 中，模型不再需要找到准确的中心，而是只需要预测任何时刻内的anchor点，从而直接估算起始和结束时刻。基于这个想法，我们设计了一种Boundary-Aligned Moment Detection Transformer (BAM-DETR)，它具有两个平行的decoding进程。特别是，它在并行的两个路径中使用全局和边界专注的注意力来细化anchor和边界，从而启用高精度的时刻预测。此外，我们提出了一种基于质量的排名方法，以确保高地方化质量的提案被优先选择，而不完整的提案被排除。广泛的实验证明了我们的方法的优势，其中我们的模型在三个 benchmark 上创造了新的 state-of-the-art 记录。代码可以在 https://github.com/Pilhyeon/BAM-DETR 上找到。

OmniMotionGPT: Animal Motion Generation with Limited Data

paper_url: http://arxiv.org/abs/2311.18303
repo_url: None
paper_authors: Zhangsihao Yang, Mingyuan Zhou, Mengyi Shan, Bingbing Wen, Ziwei Xuan, Mitch Hill, Junjie Bai, Guo-Jun Qi, Yalin Wang
for: 本研究旨在将文本描述中的动物动作 sequences 生成出多样化和实际的动物动作。
methods: 我们运用 Generative Pretraining Transformer (GPT) 架构，将人类动作知识转移到动物领域，并同时将人类动作和动物动作的数据进行合理的融合。
results: 我们的方法可以将文本描述中的动物动作生成出高度多样化和实际的动作，与训练人类动作基eline在动物数据上表现出优异的结果。此外，我们还提出了 AnimalML3D，第一个文本-动物动作数据集，包含 1240 个动画序列，涵盖 36 种不同的动物身份。

Abstract
Our paper aims to generate diverse and realistic animal motion sequences from textual descriptions, without a large-scale animal text-motion dataset. While the task of text-driven human motion synthesis is already extensively studied and benchmarked, it remains challenging to transfer this success to other skeleton structures with limited data. In this work, we design a model architecture that imitates Generative Pretraining Transformer (GPT), utilizing prior knowledge learned from human data to the animal domain. We jointly train motion autoencoders for both animal and human motions and at the same time optimize through the similarity scores among human motion encoding, animal motion encoding, and text CLIP embedding. Presenting the first solution to this problem, we are able to generate animal motions with high diversity and fidelity, quantitatively and qualitatively outperforming the results of training human motion generation baselines on animal data. Additionally, we introduce AnimalML3D, the first text-animal motion dataset with 1240 animation sequences spanning 36 different animal identities. We hope this dataset would mediate the data scarcity problem in text-driven animal motion generation, providing a new playground for the research community.

摘要
我们的论文目标是从文本描述生成多样化和现实的动物运动序列，无需大规模的动物文本运动数据集。在人体动物学已经广泛研究和评估的基础上，我们设计了一个模型架构，借鉴生成预训练变换器（GPT）的优势，在动物骨架结构中转移知识。我们同时培养动物和人体动物的运动自动编码器，并通过文本CLIP编码的相似性分数来优化。这是首次解决这个问题，我们能够生成高多样化和准确性的动物运动序列，量化和质量上超过了将人体动物基eline训练数据集应用于动物数据的结果。此外，我们介绍了AnimalsML3D，第一个文本-动物运动数据集，包括1240个动画序列，涵盖36种不同的动物标识。我们希望这个数据集能够解决动物运动生成数据的不足问题，为研究者提供一个新的玩家场。

Reconstructing the normal and shape at specularities in endoscopy

paper_url: http://arxiv.org/abs/2311.18299
repo_url: None
paper_authors: Karim Makki, Adrien Bartoli
for: 这篇论文是为了利用折射所提供的信息来估算器官的三维形态和方向的重建。
methods: 该方法使用单个图像中的折射点来估算器官的正常方向和形态。
results: 试验结果表明，该方法可以准确地估算器官的三维形态和方向，并且可以在实验室和真实操作图像中进行应用。

Abstract
Specularities are numerous in endoscopic images. They occur as many white small elliptic spots, which are generally ruled out as nuisance in image analysis and computer vision methods. Instead, we propose to use specularities as cues for 3D perception. Specifically, we propose a new method to reconstruct, at each specularity, the observed tissue's normal direction (i.e., its orientation) and shape (i.e., its curvature) from a single image. We show results on simulated and real interventional images.

摘要
照片中的specularities很多。它们出现为白色小圆形斑点，通常被视为图像分析和计算机视觉方法中的干扰。然而，我们提议使用specularities作为3D见解的cue。具体来说，我们提出了一种使用单张图像来重建观察到的组织表层方向（即方向）和形状（即弯曲）的新方法。我们在模拟和实际手术图像上显示了结果。

TLDR: Text Based Last-layer Retraining for Debiasing Image Classifiers

paper_url: http://arxiv.org/abs/2311.18291
repo_url: https://github.com/tmlabonte/last-layer-retraining
paper_authors: Juhyeon Park, Seokhyeon Jeong, Taesup Moon
for: This paper aims to mitigate the spurious correlation of classifiers by using text datasets built with large language models for a general image classifier.
methods: The proposed method, called TLDR, uses generated texts to train the final layer in the embedding space of the arbitrary image classifier, and filters out noisy words to reduce the effort of inspecting each word.
results: TLDR achieves performance that is comparable to those of LLR methods that also utilize group-balanced image datasets for retraining, and outperforms other baselines that involve training the last linear layer without a group annotated dataset.Here’s the simplified Chinese text:
for: 这篇论文目标是利用大语言模型生成的文本集来减少分类器的偶极相关性。
methods: 提议的方法是基于文本生成的TLDR，通过在权重空间中使用生成的文本来训练普通图像分类器的最后一层。此外，还提出了一种过滤掉噪音、不精确的单词的方法，以降低每个单词的检查成本。
results: TLDR实现了与使用分组平衡图像集进行重新训练的LLR方法相同的性能，并超越了没有分组标注图像集的训练方法。

Abstract
A classifier may depend on incidental features stemming from a strong correlation between the feature and the classification target in the training dataset. Recently, Last Layer Retraining (LLR) with group-balanced datasets is known to be efficient in mitigating the spurious correlation of classifiers. However, the acquisition of group-balanced datasets is costly, which hinders the applicability of the LLR method. In this work, we propose to perform LLR based on text datasets built with large language models for a general image classifier. We demonstrate that text can be a proxy for its corresponding image beyond the image-text joint embedding space, such as CLIP. Based on this, we use generated texts to train the final layer in the embedding space of the arbitrary image classifier. In addition, we propose a method of filtering the generated words to get rid of noisy, imprecise words, which reduces the effort of inspecting each word. We dub these procedures as TLDR (\textbf{T}ext-based \textbf{L}ast layer retraining for \textbf{D}ebiasing image classifie\textbf{R}s) and show our method achieves the performance that is comparable to those of the LLR methods that also utilize group-balanced image dataset for retraining. Furthermore, TLDR outperforms other baselines that involve training the last linear layer without a group annotated dataset.

摘要
一个分类器可能会受到意外的特征所影响，这些特征来自于训练数据集中的强相关性。近期，最后层重新训练（LLR）与归一化数据集的使用已知能够有效地消除分类器的偶极相关性。然而，获得归一化数据集的成本很高，这限制了LLR方法的应用性。在这种情况下，我们提议使用文本数据集，由大型自然语言模型生成，来对一个通用的图像分类器进行LLR。我们示出了文本可以作为其对应的图像的代表，例如CLIP。基于这一点，我们使用生成的文本来训练图像分类器的最后一层。此外，我们提出了一种过滤生成的词语的方法，以避免干扰和不精确的词语，从而降低了每个词语的检查成本。我们称这些过程为TLDR（文本基于的最后层重新训练 для减降图像分类器的偏见），并证明我们的方法与使用归一化图像数据集进行LLR的方法相当。此外，TLDR也超越了没有使用归一化数据集进行训练的最后 Linear 层的基eline。

CosAvatar: Consistent and Animatable Portrait Video Tuning with Text Prompt

paper_url: http://arxiv.org/abs/2311.18288
repo_url: None
paper_authors: Haiyao Xiao, Chenglai Zhong, Xuan Gao, Yudong Guo, Juyong Zhang
for: 提高数字肖像编辑质量和用户体验，以便在文本指南下进行高质量的肖像修饰。
methods: 提议了一种基于NeRF的动态3D肖像表示方法，通过在视频帧数据集和下一代3D肖像之间交互进行编辑，以实现时间和3D一致性。此外，还 integrate了semantic肖像规范来增强编辑结果的精度。
results: 经过广泛测试，提议的方法可以不仅准确地根据文本指南进行肖像修饰，还能够支持基于源视频的表达性动画。

Abstract
Recently, text-guided digital portrait editing has attracted more and more attentions. However, existing methods still struggle to maintain consistency across time, expression, and view or require specific data prerequisites. To solve these challenging problems, we propose CosAvatar, a high-quality and user-friendly framework for portrait tuning. With only monocular video and text instructions as input, we can produce animatable portraits with both temporal and 3D consistency. Different from methods that directly edit in the 2D domain, we employ a dynamic NeRF-based 3D portrait representation to model both the head and torso. We alternate between editing the video frames' dataset and updating the underlying 3D portrait until the edited frames reach 3D consistency. Additionally, we integrate the semantic portrait priors to enhance the edited results, allowing precise modifications in specified semantic areas. Extensive results demonstrate that our proposed method can not only accurately edit portrait styles or local attributes based on text instructions but also support expressive animation driven by a source video.

摘要
Unlike methods that directly edit in the 2D domain, we employ a dynamic NeRF-based 3D portrait representation to model both the head and torso. We alternate between editing the video frames' dataset and updating the underlying 3D portrait until the edited frames reach 3D consistency. Additionally, we integrate the semantic portrait priors to enhance the edited results, allowing precise modifications in specified semantic areas.Our proposed method can not only accurately edit portrait styles or local attributes based on text instructions but also support expressive animation driven by a source video. Extensive results demonstrate the effectiveness of our method in producing high-quality and consistent animatable portraits.

Dispersed Structured Light for Hyperspectral 3D Imaging

paper_url: http://arxiv.org/abs/2311.18287
repo_url: None
paper_authors: Suhyun Shin, Seokjun Choi, Felix Heide, Seung-Hwan Baek
for: 这篇论文旨在实现高精度三维影像摄取，包括深度和光谱信息。
methods: 这篇论文提出了一种新的投射-相机系统，具有微米级几何透镜膜，以把结构光散射成不同波长。这篇论文还提出了一个投射图像形成模型和每个像素的实时三维重建方法。
results: 这篇论文验证了其方法，并获得了18.8nm的光谱宽度和1mm的深度误差。该方法超越了现有的实用高精度三维影像摄取方法。

Abstract
Hyperspectral 3D imaging aims to acquire both depth and spectral information of a scene. However, existing methods are either prohibitively expensive and bulky or compromise on spectral and depth accuracy. In this work, we present Dispersed Structured Light (DSL), a cost-effective and compact method for accurate hyperspectral 3D imaging. DSL modifies a traditional projector-camera system by placing a sub-millimeter thick diffraction grating film front of the projector. The grating disperses structured light based on light wavelength. To utilize the dispersed structured light, we devise a model for dispersive projection image formation and a per-pixel hyperspectral 3D reconstruction method. We validate DSL by instantiating a compact experimental prototype. DSL achieves spectral accuracy of 18.8nm full-width half-maximum (FWHM) and depth error of 1mm. We demonstrate that DSL outperforms prior work on practical hyperspectral 3D imaging. DSL promises accurate and practical hyperspectral 3D imaging for diverse application domains, including computer vision and graphics, cultural heritage, geology, and biology.

摘要

SimulFlow: Simultaneously Extracting Feature and Identifying Target for Unsupervised Video Object Segmentation

paper_url: http://arxiv.org/abs/2311.18286
repo_url: None
paper_authors: Lingyi Hong, Wei Zhang, Shuyong Gao, Hong Lu, WenQiang Zhang
for: 这 paper 目的是提出一种高效的无监督视频物体分割方法，它可以快速地检测视频序列中的主要对象，无需人工干预。
methods: 这 paper 使用了一种新的 SimulFlow 模型，它同时进行特征提取和目标识别，从而减少计算复杂度和提高效果。 SimulFlow 模型使用了一种新的注意力机制，可以将图像和运动信息相互 bridged，从而不需要额外的手动设计 fusione 模块。
results: 这 paper 的实验结果表明，SimulFlow 模型可以在多个标准测试集上达到领先的Result，并且在计算复杂度和参数量两个方面具有优势。Specifically, SimulFlow 在 DAVIS-16 测试集上 achieve 87.4% J & F，并且在 3090 上达到 63.7 FPS 的最高速度和 13.7 M 的最低参数量。

Abstract
Unsupervised video object segmentation (UVOS) aims at detecting the primary objects in a given video sequence without any human interposing. Most existing methods rely on two-stream architectures that separately encode the appearance and motion information before fusing them to identify the target and generate object masks. However, this pipeline is computationally expensive and can lead to suboptimal performance due to the difficulty of fusing the two modalities properly. In this paper, we propose a novel UVOS model called SimulFlow that simultaneously performs feature extraction and target identification, enabling efficient and effective unsupervised video object segmentation. Concretely, we design a novel SimulFlow Attention mechanism to bridege the image and motion by utilizing the flexibility of attention operation, where coarse masks predicted from fused feature at each stage are used to constrain the attention operation within the mask area and exclude the impact of noise. Because of the bidirectional information flow between visual and optical flow features in SimulFlow Attention, no extra hand-designed fusing module is required and we only adopt a light decoder to obtain the final prediction. We evaluate our method on several benchmark datasets and achieve state-of-the-art results. Our proposed approach not only outperforms existing methods but also addresses the computational complexity and fusion difficulties caused by two-stream architectures. Our models achieve 87.4% J & F on DAVIS-16 with the highest speed (63.7 FPS on a 3090) and the lowest parameters (13.7 M). Our SimulFlow also obtains competitive results on video salient object detection datasets.

摘要
<>将文本翻译成简化中文。<>Unsupervised video object segmentation（UVOS）目标是在给定视频序列中检测主要 объек，无需人工干预。现有方法多数采用两派架构，分别对视频信息进行出现和运动编码，然后将两种模式 fusion 生成目标和生成对象面积。然而，这种管道可能会带来计算成本高和融合不当，导致性能下降。在这篇论文中，我们提出一种新的 UVOS 模型，即 SimulFlow，可以快速和有效地进行无监督视频对象分割。具体来说，我们设计了一种 SimulFlow 注意力机制，通过利用注意力操作的灵活性，将图像和运动相互桥接，并通过在掩码预测中使用各个阶段的粗略掩码来约束注意力操作在掩码区域内，以排除噪声的影响。由于 SimulFlow 注意力机制中的视觉和运动特征之间的双向信息流，因此不需要额外的手动设计融合模块，我们只采用了轻量级解码器来获取最终预测。我们对多个标准测试集进行评估，并实现了状态的最佳结果。我们的提出方法不仅超越了现有方法，还解决了两派架构导致的计算复杂性和融合困难。我们的模型在 DAVIS-16 上获得了 87.4% J & F 和最高速度 (63.7 FPS 在 3090)，并且最低的参数 (13.7 M)。我们的 SimulFlow 还在视频焦点对象检测数据集上获得了竞争性的结果。

Utilizing Radiomic Feature Analysis For Automated MRI Keypoint Detection: Enhancing Graph Applications

paper_url: http://arxiv.org/abs/2311.18281
repo_url: None
paper_authors: Sahar Almahfouz Nasser, Shashwat Pathak, Keshav Singhal, Mohit Meena, Nihar Gupte, Ananya Chinmaya, Prateek Garg, Amit Sethi
for: 本研究旨在探讨图像处理领域中图像感知网络（GNNs）的应用，尤其是在图像数据转化为GNN模型的情况下。
methods: 本研究使用了一种新的键点检测方法，基于医疗图像中的 радиометрические特征进行检测。此外，还使用了Super-Retina方法进行针对医疗图像中的键点检测。
results: 研究发现，使用GNNs可以在图像匹配中提高匹配数量和信任分数，并且可以在医疗图像中提高注射注入的精度。此外，还发现了一些新的键点，并且通过使用这些键点进行注射注入，提高了注射注入的精度。

Abstract
Graph neural networks (GNNs) present a promising alternative to CNNs and transformers in certain image processing applications due to their parameter-efficiency in modeling spatial relationships. Currently, a major area of research involves the converting non-graph input data for GNN-based models, notably in scenarios where the data originates from images. One approach involves converting images into nodes by identifying significant keypoints within them. Super-Retina, a semi-supervised technique, has been utilized for detecting keypoints in retinal images. However, its limitations lie in the dependency on a small initial set of ground truth keypoints, which is progressively expanded to detect more keypoints. Having encountered difficulties in detecting consistent initial keypoints in brain images using SIFT and LoFTR, we proposed a new approach: radiomic feature-based keypoint detection. Demonstrating the anatomical significance of the detected keypoints was achieved by showcasing their efficacy in improving registration processes guided by these keypoints. Subsequently, these keypoints were employed as the ground truth for the keypoint detection method (LK-SuperRetina). Furthermore, the study showcases the application of GNNs in image matching, highlighting their superior performance in terms of both the number of good matches and confidence scores. This research sets the stage for expanding GNN applications into various other applications, including but not limited to image classification, segmentation, and registration.

摘要
图 neural network (GNN) 提供了一种有前途的替代方案，尤其在图像处理领域中，因为它们可以快速和效率地模型图像之间的关系。目前，研究的焦点在于将非图数据转换为 GNN 模型中使用，尤其在图像来源于图像的场景下。一种方法是将图像转换为节点，通过在图像中标识重要的关键点进行识别。Super-Retina 是一种半监督的技术，可以在眼球图像中检测关键点。然而，它的限制在于依赖于小的初始集成真实标点，这会逐渐扩展到检测更多的关键点。我们在脑图像中使用 radiomic 特征来检测关键点，并证明了这些关键点在注射过程中的生物学意义。然后，这些关键点被用作ground truth，以便使用 LK-SuperRetina 方法进行关键点检测。此外，研究还展示了 GNN 在图像匹配中的突出表现，包括更多的好匹配和更高的信任分数。这项研究为扩展 GNN 应用领域的基础 lay。

HKUST at SemEval-2023 Task 1: Visual Word Sense Disambiguation with Context Augmentation and Visual Assistance

paper_url: http://arxiv.org/abs/2311.18273
repo_url: https://github.com/thomas-yin/semeval-2023-task1
paper_authors: Zhuohao Yin, Xin Huang
for: 本研究的目的是提出一个多modal的推干架构，用于实现Visual Word Sense Disambiguation (VWSD)任务中的Semantic Search。
methods: 本研究使用了预训语言模型和开放知识库，并将其融合到一个多modal的推干架构中。系统包括以下四个主要 ком成分：（1）字汇匹配（Gloss Matching）：使用预训bi-encoder模型对上下文中的目标词 matched with proper senses;（2）提示（Prompting）：将匹配到的字汇和其他文本信息，如同义词，融合到一个提示模板中；（3）图像搜寻（Image Retrieval）：使用提示作为查询，从大量的开放 dataset 中搜寻具有相似含义的图像；（4）modal融合（Modality Fusion）：将不同modalities的信息融合，用于预测。
results: 本研究的结果虽未在SemEval-2023 Task 1中产生最高竞争力，但我们仍能超越大约一半的队伍。更重要的是，我们的实验发现了WSD和多modal学习领域的鲜为人知的恶性现象和挑战。我们的代码可以在 GitHub 上获得。

Abstract
Visual Word Sense Disambiguation (VWSD) is a multi-modal task that aims to select, among a batch of candidate images, the one that best entails the target word's meaning within a limited context. In this paper, we propose a multi-modal retrieval framework that maximally leverages pretrained Vision-Language models, as well as open knowledge bases and datasets. Our system consists of the following key components: (1) Gloss matching: a pretrained bi-encoder model is used to match contexts with proper senses of the target words; (2) Prompting: matched glosses and other textual information, such as synonyms, are incorporated using a prompting template; (3) Image retrieval: semantically matching images are retrieved from large open datasets using prompts as queries; (4) Modality fusion: contextual information from different modalities are fused and used for prediction. Although our system does not produce the most competitive results at SemEval-2023 Task 1, we are still able to beat nearly half of the teams. More importantly, our experiments reveal acute insights for the field of Word Sense Disambiguation (WSD) and multi-modal learning. Our code is available on GitHub.

摘要
<>Visual Word Sense Disambiguation (VWSD) 是一个多模态任务，旨在在限定的上下文中选择最能体现target词的意思的图像。在这篇论文中，我们提出了一个多模态检索框架，最大限度地利用预训练的视力语言模型，以及开放的知识库和数据集。我们的系统包括以下关键组件：1. 字义匹配：使用预训练的二元编码器模型来匹配上下文中的target词之字义。2. 提示：匹配到的字义和其他文本信息，如同义词，通过提示模板进行 incorporation。3. 图像检索：使用提示作为查询来从大量开放数据集中检索semantic上相似的图像。4. 模式融合：从不同模式中的信息进行融合，用于预测。虽然我们的系统在SemEval-2023任务1上并不是最竞争力强，但我们仍能击败大约一半的队伍。更重要的是，我们的实验发现了Word Sense Disambiguation（WSD）和多模态学习的锐意见。我们的代码可以在GitHub上找到。

Beyond Entropy: Style Transfer Guided Single Image Continual Test-Time Adaptation

paper_url: http://arxiv.org/abs/2311.18270
repo_url: None
paper_authors: Younggeol Cho, Youngrae Kim, Dongman Lee
for: 本研究旨在解决存在计算资源限制的实时环境中，使模型在不断变化的真实世界环境中进行适应。
methods: 我们提出了一种基于风格传输的单张图像实时测试时适应方法（BESTTA），使用了简单 yet powerful的归一化方法（BeIN）和风格导向的损失函数。
results: 我们示出了BESTTA在 semantic segmentation 和图像分类任务上能够效果地适应 continually changing 目标环境，仅使用了一张图像和两个参数。此外，BESTTA 还比现有的状态最优方法在性能上表现更好。

Abstract
Continual test-time adaptation (cTTA) methods are designed to facilitate the continual adaptation of models to dynamically changing real-world environments where computational resources are limited. Due to this inherent limitation, existing approaches fail to simultaneously achieve accuracy and efficiency. In detail, when using a single image, the instability caused by batch normalization layers and entropy loss significantly destabilizes many existing methods in real-world cTTA scenarios. To overcome these challenges, we present BESTTA, a novel single image continual test-time adaptation method guided by style transfer, which enables stable and efficient adaptation to the target environment by transferring the style of the input image to the source style. To implement the proposed method, we devise BeIN, a simple yet powerful normalization method, along with the style-guided losses. We demonstrate that BESTTA effectively adapts to the continually changing target environment, leveraging only a single image on both semantic segmentation and image classification tasks. Remarkably, despite training only two parameters in a BeIN layer consuming the least memory, BESTTA outperforms existing state-of-the-art methods in terms of performance.

摘要

Prompt-Based Exemplar Super-Compression and Regeneration for Class-Incremental Learning

paper_url: http://arxiv.org/abs/2311.18266
repo_url: https://github.com/kerrydrx/escort
paper_authors: Ruxiao Duan, Yaoyao Liu, Jieneng Chen, Adam Kortylewski, Alan Yuille
for: 这个论文旨在提出一种新的对照储存和重新生成方法，以解决对照学习（Class-Incremental Learning，CIL）中的档案限制和遗传问题。
methods: 这个方法使用了图像压缩和文本描述来将对照变化为可以储存的图像描述，并使用了一个预训 ControlNet 来从描述中生成高分辨率的对照。
results: 实验结果显示，这个方法可以对多个 CIL 标准 benchmark 进行改进，例如在 10 个阶段 Caltech-256 标准 dataset 上提高了5.0% 的表现，较前一任的州分优先点。

Abstract
Replay-based methods in class-incremental learning (CIL) have attained remarkable success, as replaying the exemplars of old classes can significantly mitigate catastrophic forgetting. Despite their effectiveness, the inherent memory restrictions of CIL result in saving a limited number of exemplars with poor diversity, leading to data imbalance and overfitting issues. In this paper, we introduce a novel exemplar super-compression and regeneration method, ESCORT, which substantially increases the quantity and enhances the diversity of exemplars. Rather than storing past images, we compress images into visual and textual prompts, e.g., edge maps and class tags, and save the prompts instead, reducing the memory usage of each exemplar to 1/24 of the original size. In subsequent learning phases, diverse high-resolution exemplars are generated from the prompts by a pre-trained diffusion model, e.g., ControlNet. To minimize the domain gap between generated exemplars and real images, we propose partial compression and diffusion-based data augmentation, allowing us to utilize an off-the-shelf diffusion model without fine-tuning it on the target dataset. Therefore, the same diffusion model can be downloaded whenever it is needed, incurring no memory consumption. Comprehensive experiments demonstrate that our method significantly improves model performance across multiple CIL benchmarks, e.g., 5.0 percentage points higher than the previous state-of-the-art on 10-phase Caltech-256 dataset.

摘要
循环学习（Class-Incremental Learning，CIL）中的回忆性方法已经取得了很大的成功，因为重新播放过去的类型的示例可以减轻忘记性。尽管它们的效iveness，CIL中的内置的内存限制会导致只保留一小部分的示例，从而导致数据不均衡和预测问题。在这篇论文中，我们介绍了一种新的示例超压缩和重生方法，名为ESCORT。而不是保存过去的图像，我们将图像压缩成视觉和文本的提示，例如边极图和类标签，并将提示存储在内存中。在后续的学习阶段，我们使用预训练的扩散模型，例如ControlNet，生成了多样化的高分辨率示例。为了减少生成示例和真实图像之间的领域差距，我们提出了部分压缩和扩散基于的数据增强技术。因此，我们可以在需要时下载相同的扩散模型，不需要特性化 fine-tuning。全面的实验表明，我们的方法可以在多个 CIL 标准准的测试集上提高模型性能，比如10-phase Caltech-256 数据集上的5.0%。

MCI Detection using fMRI time series embeddings of Recurrence plots

paper_url: http://arxiv.org/abs/2311.18265
repo_url: https://github.com/blackpearl006/ISBI-2024
paper_authors: Ninad Aithal, Chakka Sai Pradeep, Neelam Sinha
for: 本研究使用Resting State fMRI时间序列成像，研究人脑动态系统中的下游ROI特征，以了解结构或缺失。
methods: 研究人员使用Recurrence Plotvisual化时间序列，并使用Autoencoders将时间序列转换成低维特征表示。
results: 研究发现，使用提posed方法可以在100名参与者的fMRI数据上达到93%的峰性分类精度和89.3%的平均精度，illustrating the promise of the proposed approach.

Abstract
The human brain can be conceptualized as a dynamical system. Utilizing resting state fMRI time series imaging, we can study the underlying dynamics at ear-marked Regions of Interest (ROIs) to understand structure or lack thereof. This differential behavior could be key to understanding the neurodegeneration and also to classify between healthy and Mild Cognitive Impairment (MCI) subjects. In this study, we consider 6 brain networks spanning over 160 ROIs derived from Dosenbach template, where each network consists of 25-30 ROIs. Recurrence plot, extensively used to understand evolution of time series, is employed. Representative time series at each ROI is converted to its corresponding recurrence plot visualization, which is subsequently condensed to low-dimensional feature embeddings through Autoencoders. The performance of the proposed method is shown on fMRI volumes of 100 subjects (balanced data), taken from publicly available ADNI dataset. Results obtained show peak classification accuracy of 93% among the 6 brain networks, mean accuracy of 89.3% thereby illustrating promise in the proposed approach.

摘要
人脑可以理解为动态系统。通过使用休息状态fMRI时间序列成像，我们可以研究下标注的Region of Interest（ROIs）下的内在动态，以解释结构或缺失。这种差异性可能是理解脑化退化的关键，以及分类健康和轻度认知障碍（MCI）者的关键。在这个研究中，我们考虑了6个脑网络，涵盖了160个ROIs，每个网络由25-30个ROIs组成。我们使用了时间序列演化的重要工具——回忆图，将每个ROI的代表时间序列转化为其对应的回忆图视化。然后，我们使用自适应Encoder将这些视化转化为低维度特征嵌入。研究结果显示，我们的方法在100名参与者的fMRI数据上达到了93%的峰分类精度，以及89.3%的平均精度，这表明了我们的方法的承诺。

A Compact Implicit Neural Representation for Efficient Storage of Massive 4D Functional Magnetic Resonance Imaging

paper_url: http://arxiv.org/abs/2312.00082
repo_url: None
paper_authors: Ruoran Li, Runzhao Yang, Wenxin Xiang, Yuxiao Cheng, Tingxiong Xiao, Jinli Suo
for: 这篇论文是为了提出一种适用于功能核磁共振成像（fMRI）数据压缩的新方法，以提高数据的压缩率和质量。
methods: 该方法基于含义神经表示（INR），包括在时间序列中进行空间相关性模型化、分解可重用的神经活动模式，以及使用适当的初始化和非线性融合来描述 между区域相似性。
results: 实验结果表明，提出的方法可以有效地压缩fMRI数据，并在常见的图像质量评价指标和fMRI下游任务中表现出色，超过了当前最佳的算法。这项工作将为大规模的fMRI数据分享互联网带来便利和高精度。

Abstract
Functional Magnetic Resonance Imaging (fMRI) data is a kind of widely used four-dimensional biomedical data, demanding effective compression but presenting unique challenges for compression due to its intricate temporal dynamics, low signal-to-noise ratio, and complicated underlying redundancies. This paper reports a novel compression paradigm specifically tailored for fMRI data based on Implicit Neural Representation (INR). The proposed approach focuses on removing the various redundancies among the time series, including (i) conducting spatial correlation modeling for intra-region dynamics, (ii) decomposing reusable neuronal activation patterns, and using proper initialization together with nonlinear fusion to describe the inter-region similarity. The above scheme properly incorporates the unique features of fMRI data, and experimental results on publicly available datasets demonstrate the effectiveness of the proposed method, surpassing state-of-the-art algorithms in both conventional image quality evaluation metrics and fMRI downstream tasks. This work in this paper paves the way for sharing massive fMRI data at low bandwidth and high fidelity.

摘要

时间序列内的各种相互关联性，包括：2. 内部动态的空间相关模型化3. 可重用的 neuronal activation pattern 的分解4. 适当的初始化和非线性融合，以描述间region的相似性。提案的方法可以有效地利用 fMRI 数据的特点，实验结果表明，与现有算法相比，提案的方法在公共可用的数据集上达到了更高的图像质量评价指标和 FMRI 下游任务的表现。这种工作将为 massive fMRI 数据的共享链接低带宽高准确度的链接提供了道路。

Diffusion Models Without Attention

paper_url: http://arxiv.org/abs/2311.18257
repo_url: None
paper_authors: Jing Nathan Yan, Jiatao Gu, Alexander M. Rush
for: 高精度图像生成技术的进一步发展
methods: 取代注意力机制，使用更可扩展的状态空间模型底部
results: 在ImageNet和LSUN datasets上，与注意力模型相比，DiffuSSMs在FID和Inception Score指标上几乎相当或甚至超越了对Diffusion模型的训练，同时减少了总计算量的FLOP用量。

Abstract
In recent advancements in high-fidelity image generation, Denoising Diffusion Probabilistic Models (DDPMs) have emerged as a key player. However, their application at high resolutions presents significant computational challenges. Current methods, such as patchifying, expedite processes in UNet and Transformer architectures but at the expense of representational capacity. Addressing this, we introduce the Diffusion State Space Model (DiffuSSM), an architecture that supplants attention mechanisms with a more scalable state space model backbone. This approach effectively handles higher resolutions without resorting to global compression, thus preserving detailed image representation throughout the diffusion process. Our focus on FLOP-efficient architectures in diffusion training marks a significant step forward. Comprehensive evaluations on both ImageNet and LSUN datasets at two resolutions demonstrate that DiffuSSMs are on par or even outperform existing diffusion models with attention modules in FID and Inception Score metrics while significantly reducing total FLOP usage.

摘要

paper_url: http://arxiv.org/abs/2311.18245
repo_url: None
paper_authors: Long Chen, Liben Chen, Binfeng Xu, Wenxin Zhang, Narges Razavian
for: 预测阿尔ツ海病的stage，包括健康正常（CN）、轻度认知障碍（MCI）和阿尔ツ海病（AD），基于两种不同类型的脑MRI扫描结果。
methods: 使用AlexNet深度学习模型，利用T1和FLAIR两种脑MRI扫描结果的共同信息相互补充，实现自动诊断。
results: 通过分析多Modal MRI扫描结果，提高了自动诊断的准确率。

Abstract
The aging population of the U.S. drives the prevalence of Alzheimer's disease. Brookmeyer et al. forecasts approximately 15 million Americans will have either clinical AD or mild cognitive impairment by 2060. In response to this urgent call, methods for early detection of Alzheimer's disease have been developed for prevention and pre-treatment. Notably, literature on the application of deep learning in the automatic detection of the disease has been proliferating. This study builds upon previous literature and maintains a focus on leveraging multi-modal information to enhance automatic detection. We aim to predict the stage of the disease - Cognitively Normal (CN), Mildly Cognitive Impairment (MCI), and Alzheimer's Disease (AD), based on two different types of brain MRI scans. We design an AlexNet-based deep learning model that learns the synergy of complementary information from both T1 and FLAIR MRI scans.

摘要
美国老龄化人口导致阿尔ц海病的流行性。布鲁克梅耶等人预测到2060年，美国约有1500万人将出现轻度智能障碍或严重智能障碍。为应对这一紧迫的呼吁，旨在早期检测阿尔ц海病的方法已经开发出来。文献显示，深度学习在自动检测阿尔ц海病方面的应用已经激增。本研究基于先前的文献，并继续利用多Modal信息进行增强自动检测。我们目标是根据T1和FLAIR两种脑MRI扫描图片，预测疾病的stage，包括脑功能正常（CN）、轻度智能障碍（MCI）和阿尔ц海病（AD）。我们设计了AlexNet基于深度学习模型，利用两种MRI扫描图片的补做性信息来学习疾病的同时性。

DKiS: Decay weight invertible image steganography with private key

paper_url: http://arxiv.org/abs/2311.18243
repo_url: https://github.com/yanghangai/dkis
paper_authors: Hang Yang, Yitian Xu, Xuhua Liu
for: private key-based image steganography techniquemethods: novel approach using a private key for access experimental evidence demonstrating effectivenessresults: real-world applicability addressing the challenge of non-essential information transfer in invertible image steganography with the decay weight method

Abstract
Image steganography, the practice of concealing information within another image, traditionally faces security challenges when its methods become publicly known. To counteract this, we introduce a novel private key-based image steganography technique. This approach ensures the security of hidden information, requiring a corresponding private key for access, irrespective of the public knowledge of the steganography method. We present experimental evidence demonstrating our method's effectiveness, showcasing its real-world applicability. Additionally, we identified a critical challenge in the invertible image steganography process: the transfer of non-essential, or `garbage', information from the secret to the host pipeline. To address this, we introduced the decay weight to control the information transfer, filtering out irrelevant data and enhancing the performance of image steganography. Our code is publicly accessible at https://github.com/yanghangAI/DKiS, and a practical demonstration is available at http://yanghang.site/hidekey.

摘要
Image 隐藏技术，即将信息隐藏在另一个图像中，传统上面临安全挑战，因为其方法的公开可能会被披露。为了解决这个问题，我们提出了一种新的私钥基于的图像隐藏技术。这种方法确保隐藏的信息的安全，无论公开了隐藏技术的方法，都需要对应的私钥访问。我们提供了实验证明我们的方法的有效性，并在实际应用中展示了其可行性。此外，我们发现了逆向图像隐藏过程中的一个关键挑战：将不必要的、或“垃圾”信息从秘密管道传递到主机管道。为了解决这个问题，我们引入了衰减因子，控制信息传输，过滤掉不必要的数据，并提高图像隐藏的性能。我们的代码可以在https://github.com/yanghangAI/DKiS中获取，并在http://yanghang.site/hidekey中进行实际示例。

Label-efficient Training of Small Task-specific Models by Leveraging Vision Foundation Models

paper_url: http://arxiv.org/abs/2311.18237
repo_url: None
paper_authors: Raviteja Vemulapalli, Hadi Pouransari, Fartash Faghri, Sachin Mehta, Mehrdad Farajtabar, Mohammad Rastegari, Oncel Tuzel
for: 如何使用预训练的大视野基本模型（VFM）来训练一个小型任务特定模型，以便在有限的标注数据下进行下游任务？
methods: 我们提出了一种简单高效的任务导向知识传递方法，使得可以使用预训练的VFM来有效地训练小型任务特定模型。
results: 我们的实验结果表明，我们的任务导向知识传递方法可以比task-agnostic VFM液化、web-scale CLIP预训练和supervised ImageNet预训练提高target任务性能，提高1-10.5%、2-22%和2-14%。同时，我们也发现了不同的数据集对target任务性能的影响，并提出了一种基于图像检索的方法来筛选有效的传输集。

Abstract
Large Vision Foundation Models (VFMs) pretrained on massive datasets exhibit impressive performance on various downstream tasks, especially with limited labeled target data. However, due to their high memory and compute requirements, these models cannot be deployed in resource constrained settings. This raises an important question: How can we utilize the knowledge from a large VFM to train a small task-specific model for a new target task with limited labeled training data? In this work, we answer this question by proposing a simple and highly effective task-oriented knowledge transfer approach to leverage pretrained VFMs for effective training of small task-specific models. Our experimental results on four target tasks under limited labeled data settings show that the proposed knowledge transfer approach outperforms task-agnostic VFM distillation, web-scale CLIP pretraining and supervised ImageNet pretraining by 1-10.5%, 2-22% and 2-14%, respectively. We also show that the dataset used for transferring knowledge has a significant effect on the final target task performance, and propose an image retrieval-based approach for curating effective transfer sets.

摘要
大型视野基本模型（VFM）在庞大数据集上预训练后表现出色，特别是在有限的标注数据上下降流程中。然而，由于它们的高内存和计算需求，这些模型无法在资源受限的环境中部署。这提出了一个重要问题：如何使用大型VFM中的知识来训练一个新的目标任务上的小任务特定模型，即使只有有限的标注数据available?在这项工作中，我们回答这个问题，提出了一种简单高效的任务指向知识传递方法，以利用预训练的VFM来有效地训练小任务特定模型。我们的实验结果表明，我们的知识传递方法在四个目标任务下的有限标注数据设置下表现出优于任务无关VFM涅泊、web级CLIP预训练和supervised ImageNet预训练by 1-10.5%, 2-22%和2-14%, respectively。我们还发现了转移知识的数据集对最终目标任务的表现有显著的影响，并提出了基于图像检索的方法来筛选有效的转移集。

TCP:Textual-based Class-aware Prompt tuning for Visual-Language Model

paper_url: http://arxiv.org/abs/2311.18231
repo_url: https://github.com/htyao89/textual-based_class-aware_prompt_tuning
paper_authors: Hantao Yao, Rui Zhang, Changsheng Xu
for: 这篇研究旨在适应预训化的视觉语言模型（VLM）进行多种下游任务。
methods: 这篇研究使用CoOp基本的方法，即可读的预设域共享或图像条件的文本数据，以帮助生成任务特定的文本分类器。
results: 这篇研究提出了一个新的文本基于的类别敏感调整（TCP）方法，可以增强类别的检测能力。TCP使用文本知识嵌入（TKE）将类别水平的文本知识映射到类别相应的文本数据，从而生成动态类别敏感的文本分类器。

Abstract
Prompt tuning represents a valuable technique for adapting pre-trained visual-language models (VLM) to various downstream tasks. Recent advancements in CoOp-based methods propose a set of learnable domain-shared or image-conditional textual tokens to facilitate the generation of task-specific textual classifiers. However, those textual tokens have a limited generalization ability regarding unseen domains, as they cannot dynamically adjust to the distribution of testing classes. To tackle this issue, we present a novel Textual-based Class-aware Prompt tuning(TCP) that explicitly incorporates prior knowledge about classes to enhance their discriminability. The critical concept of TCP involves leveraging Textual Knowledge Embedding (TKE) to map the high generalizability of class-level textual knowledge into class-aware textual tokens. By seamlessly integrating these class-aware prompts into the Text Encoder, a dynamic class-aware classifier is generated to enhance discriminability for unseen domains. During inference, TKE dynamically generates class-aware prompts related to the unseen classes. Comprehensive evaluations demonstrate that TKE serves as a plug-and-play module effortlessly combinable with existing methods. Furthermore, TCP consistently achieves superior performance while demanding less training time.

摘要
“Prompt tuning是一种有价值的技术，用于适应预训练的视觉语言模型（VLM）到下游任务。 latest advancements in CoOp-based methods propose a set of learnable domain-shared or image-conditional textual tokens to facilitate the generation of task-specific textual classifiers. However, those textual tokens have limited generalization ability regarding unseen domains, as they cannot dynamically adjust to the distribution of testing classes. To address this issue, we present a novel Textual-based Class-aware Prompt tuning(TCP) that explicitly incorporates prior knowledge about classes to enhance their discriminability. The critical concept of TCP involves leveraging Textual Knowledge Embedding (TKE) to map the high generalizability of class-level textual knowledge into class-aware textual tokens. By seamlessly integrating these class-aware prompts into the Text Encoder, a dynamic class-aware classifier is generated to enhance discriminability for unseen domains. During inference, TKE dynamically generates class-aware prompts related to the unseen classes. Comprehensive evaluations demonstrate that TKE serves as a plug-and-play module effortlessly combinable with existing methods. Furthermore, TCP consistently achieves superior performance while demanding less training time.”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format instead.

FS-BAND: A Frequency-Sensitive Banding Detector

paper_url: http://arxiv.org/abs/2311.18216
repo_url: None
paper_authors: Zijian Chen, Wei Sun, Zicheng Zhang, Ru Huang, Fangfang Lu, Xiongkuo Min, Guangtao Zhai, Wenjun Zhang
for: 本文旨在研究压缩等场景中的带宽扰动（banding artifact），以提高用户体验质量（QoE）。
methods: 本文提出了一种基于频率特征的带宽扰动检测模型，名为频率敏感带宽检测器（FS-BAND），可以生成基于视觉质量相关指标的像素级带宽地图。
results: 实验结果表明，FS-BAND方法在带宽扰动分类任务中比现有的图像质量评价方法（IQA）更高精度。

Abstract
Banding artifact, as known as staircase-like contour, is a common quality annoyance that happens in compression, transmission, etc. scenarios, which largely affects the user's quality of experience (QoE). The banding distortion typically appears as relatively small pixel-wise variations in smooth backgrounds, which is difficult to analyze in the spatial domain but easily reflected in the frequency domain. In this paper, we thereby study the banding artifact from the frequency aspect and propose a no-reference banding detection model to capture and evaluate banding artifacts, called the Frequency-Sensitive BANding Detector (FS-BAND). The proposed detector is able to generate a pixel-wise banding map with a perception correlated quality score. Experimental results show that the proposed FS-BAND method outperforms state-of-the-art image quality assessment (IQA) approaches with higher accuracy in banding classification task.

摘要
带宽artefact，也称为台阶样式损害，是压缩、传输等场景中常见的质量折扣，对用户体验质量（QoE）产生重大影响。带宽扭曲通常在平额背景中出现为小规模的像素级别变化，在空间领域难以分析，但在频率领域表现更为明显。在这篇论文中，我们从频率方面研究带宽artefact，并提出了一种无参亮度损害检测模型，called the Frequency-Sensitive BANding Detector (FS-BAND)。该检测器能够生成一个像素级别的带宽地图，并与感知相关的质量分数相关。实验结果显示，提出的FS-BAND方法在带宽分类任务中高度超越了现有的图像质量评估（IQA）方法。

Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding

paper_url: http://arxiv.org/abs/2312.00081
repo_url: https://github.com/wjpoom/spec
paper_authors: Wujian Peng, Sicheng Xie, Zuyao You, Shiyi Lan, Zuxuan Wu
for: 评估视觉语言模型（VLM）在细化的视觉语言任务上的表现。
methods: 提出了一个进步的数据pipeline来生成具有特定属性的图像，并设计了一个名为SPEC的benchmark来评估VLM的理解能力。
results: 发现现有四个领先的VLM在SPEC上表现很差，而我们提出的简单 yet effective的优化方法可以在SPEC上实现显著提高，并在两个其他细化benchmark上也得到了一致的提高。

Abstract
Vision language models (VLM) have demonstrated remarkable performance across various downstream tasks. However, understanding fine-grained visual-linguistic concepts, such as attributes and inter-object relationships, remains a significant challenge. While several benchmarks aim to evaluate VLMs in finer granularity, their primary focus remains on the linguistic aspect, neglecting the visual dimension. Here, we highlight the importance of evaluating VLMs from both a textual and visual perspective. We introduce a progressive pipeline to synthesize images that vary in a specific attribute while ensuring consistency in all other aspects. Utilizing this data engine, we carefully design a benchmark, SPEC, to diagnose the comprehension of object size, position, existence, and count. Subsequently, we conduct a thorough evaluation of four leading VLMs on SPEC. Surprisingly, their performance is close to random guess, revealing significant limitations. With this in mind, we propose a simply yet effective approach to optimize VLMs in fine-grained understanding, achieving significant improvements on SPEC without compromising the zero-shot performance. Results on two additional fine-grained benchmarks also show consistent improvements, further validating the transferability of our approach.

摘要
视觉语言模型（VLM）在多种下游任务上表现出色，但理解细腻的视觉语言概念，如特征和物体之间关系，仍然是一项重要挑战。虽有多个benchmark旨在评估VLM的细化方面，但它们主要关注语言方面，忽视视觉维度。我们认为评估VLM的文本和视觉方面都是重要的。我们提出了一种逐步管道来生成具有特定特征的图像，保证所有其他方面具有一致性。利用这个数据引擎，我们计划了一个benchmark，SPEC，以诊断VLM的对象大小、位置、存在和数量的理解。然后，我们对四个领先的VLM进行了严格的评估。 result显示，它们在SPEC上的表现几乎与随机猜测一样，揭示了重要的局限性。针对这一点，我们提出了一种简单又有效的方法来优化VLM，实现了在SPEC上显著提高表现，同时无需妥协零基eline性。此外，我们还对两个额外的细化benchmark进行了进一步的评估，并得到了一致的 validate the transferability of our approach。

Perception of Misalignment States for Sky Survey Telescopes with the Digital Twin and the Deep Neural Networks

paper_url: http://arxiv.org/abs/2311.18214
repo_url: None
paper_authors: Miao Zhang, Peng Jia, Zhengyang Li, Wennan Xiang, Jiameng Lv, Rui Sun
for: 该研究旨在提供对活动光学系统和光学系统Alignment的优先信息。
methods: 该研究提出了一种基于深度神经网络的方法，可以从不同的视场中提取不同的推算错误状态，并且可以在实验中通过状态图来探索这些关系。
results: 该研究表明，使用该方法可以准确地估计推算错误状态，并且可以在不同的视场和噪声水平下进行调整。

Abstract
Sky survey telescopes play a critical role in modern astronomy, but misalignment of their optical elements can introduce significant variations in point spread functions, leading to reduced data quality. To address this, we need a method to obtain misalignment states, aiding in the reconstruction of accurate point spread functions for data processing methods or facilitating adjustments of optical components for improved image quality. Since sky survey telescopes consist of many optical elements, they result in a vast array of potential misalignment states, some of which are intricately coupled, posing detection challenges. However, by continuously adjusting the misalignment states of optical elements, we can disentangle coupled states. Based on this principle, we propose a deep neural network to extract misalignment states from continuously varying point spread functions in different field of views. To ensure sufficient and diverse training data, we recommend employing a digital twin to obtain data for neural network training. Additionally, we introduce the state graph to store misalignment data and explore complex relationships between misalignment states and corresponding point spread functions, guiding the generation of training data from experiments. Once trained, the neural network estimates misalignment states from observation data, regardless of the impacts caused by atmospheric turbulence, noise, and limited spatial sampling rates in the detector. The method proposed in this paper could be used to provide prior information for the active optics system and the optical system alignment.

摘要
天文望远镜在现代天文学中扮演着关键性的角色，但是光学元件的偏置可能会导致点扩散函数的变化，从而降低数据质量。为了解决这个问题，我们需要一种方法来获取偏置状态，以便在数据处理方法中重建准确的点扩散函数，或者在光学组件上进行调整以提高图像质量。由于天文望远镜由多个光学元件组成，因此它们可能会产生庞大的偏置状态空间，其中一些状态是互相关联的，这会增加检测挑战。但是，通过不断地调整光学元件的偏置状态，我们可以解耦这些状态。基于这个原理，我们提出了一种深度神经网络，用于从不同的视场中提取偏置状态。为确保充足的和多样的训练数据，我们建议使用数字双护者来获取训练数据。此外，我们引入了状态图来存储偏置数据，并探索了偏置状态和相应的点扩散函数之间的复杂关系，以便从实验中生成训练数据。一旦训练完成，神经网络就可以从观测数据中估计偏置状态，不受大气扰动、噪声和探测器的限制所影响。这种方法可以用于提供活动光学系统和光学系统Alignment的先验信息。

SMaRt: Improving GANs with Score Matching Regularity

paper_url: http://arxiv.org/abs/2311.18208
repo_url: None
paper_authors: Mengfei Xia, Yujun Shen, Ceyuan Yang, Ran Yi, Wenping Wang, Yong-jin Liu
for: 这个论文的目的是提高生成模型（GAN）在处理高度多元数据时的学习能力。
methods: 这篇论文提出了一种基于敌对搅拌的方法，称为Score Matching Regularity（SMaRt），以解决GAN在处理高度多元数据时的问题。
results: 经验表明，通过使用SMaRt方法可以提高GAN在不同 datasets 上的synthesis性能，例如在ImageNet 64x64 dataset上，可以提高FID从8.87到7.11，与一步一致性模型相当。

Abstract
Generative adversarial networks (GANs) usually struggle in learning from highly diverse data, whose underlying manifold is complex. In this work, we revisit the mathematical foundations of GANs, and theoretically reveal that the native adversarial loss for GAN training is insufficient to fix the problem of subsets with positive Lebesgue measure of the generated data manifold lying out of the real data manifold. Instead, we find that score matching serves as a valid solution to this issue thanks to its capability of persistently pushing the generated data points towards the real data manifold. We thereby propose to improve the optimization of GANs with score matching regularity (SMaRt). Regarding the empirical evidences, we first design a toy example to show that training GANs by the aid of a ground-truth score function can help reproduce the real data distribution more accurately, and then confirm that our approach can consistently boost the synthesis performance of various state-of-the-art GANs on real-world datasets with pre-trained diffusion models acting as the approximate score function. For instance, when training Aurora on the ImageNet 64x64 dataset, we manage to improve FID from 8.87 to 7.11, on par with the performance of one-step consistency model. The source code will be made public.

摘要
通常，生成抗对抗网络（GANs）在处理高度多样数据时会遇到问题，因为这些数据的下面拓扑是复杂的。在这个工作中，我们再次检视了GANs的数学基础，并证明了native对抗损失在GAN训练中是不够的，因为一些生成数据集的subsets会在真实数据集外部lie。相反，我们发现score matching是一种有效的解决方案，因为它可以持续地推动生成的数据点向真实数据集方向。我们因此提议使用score matching regularity（SMaRt）来改进GANs的优化。在实际证明中，我们首先设计了一个简单的示例，以示训练GANs使用真实分布的分数函数可以帮助更准确地复制真实数据分布，然后我们证明了我们的方法可以在实际世界数据集上提高多种state-of-the-art GANs的合成性能。例如，当在ImageNet 64x64 dataset上训练Aurora时，我们可以从8.87降低到7.11，与一步一致模型的性能相当。代码源代码将公开。

Hy-Tracker: A Novel Framework for Enhancing Efficiency and Accuracy of Object Tracking in Hyperspectral Videos

paper_url: http://arxiv.org/abs/2311.18199
repo_url: None
paper_authors: Mohammad Aminul Islam, Wangzhi Xing, Jun Zhou, Yongsheng Gao, Kuldip K. Paliwal
for: 本研究使用YOLOv7进行物体跟踪，充分利用干扰图像中的物体信息。
methods: 本研究提出了一种新的框架 Hy-Tracker，通过结合YOLOv7和修改 tracking模块来提高物体跟踪性能。
results: 实验结果表明，Hy-Tracker在干扰图像上具有高精度的物体跟踪能力，能够在不同的缩放和 occlusion 情况下精准地跟踪物体。

Abstract
Hyperspectral object tracking has recently emerged as a topic of great interest in the remote sensing community. The hyperspectral image, with its many bands, provides a rich source of material information of an object that can be effectively used for object tracking. While most hyperspectral trackers are based on detection-based techniques, no one has yet attempted to employ YOLO for detecting and tracking the object. This is due to the presence of multiple spectral bands, the scarcity of annotated hyperspectral videos, and YOLO's performance limitation in managing occlusions, and distinguishing object in cluttered backgrounds. Therefore, in this paper, we propose a novel framework called Hy-Tracker, which aims to bridge the gap between hyperspectral data and state-of-the-art object detection methods to leverage the strengths of YOLOv7 for object tracking in hyperspectral videos. Hy-Tracker not only introduces YOLOv7 but also innovatively incorporates a refined tracking module on top of YOLOv7. The tracker refines the initial detections produced by YOLOv7, leading to improved object-tracking performance. Furthermore, we incorporate Kalman-Filter into the tracker, which addresses the challenges posed by scale variation and occlusion. The experimental results on hyperspectral benchmark datasets demonstrate the effectiveness of Hy-Tracker in accurately tracking objects across frames.

摘要
干扰对象跟踪在远程感知社区中最近受到了广泛关注。干扰图像，拥有多个频谱带，可以提供对物体的丰富物理信息，可以有效地用于对象跟踪。然而，大多数干扰跟踪器都基于探测技术，没有任何人尝试将YOLO应用于干扰图像中的检测和跟踪。这是因为干扰图像中存在多个频谱带，缺乏标注的干扰视频，以及YOLO在受掩蔽和干扰背景中的表现有限。因此，在这篇论文中，我们提出了一种新的框架called Hy-Tracker，以利用YOLOv7的优势来进行干扰图像中的对象跟踪。Hy-Tracker不仅引入了YOLOv7，还创新地在YOLOv7之上添加了一个精细跟踪模块。该模块对YOLOv7生成的初始检测进行了精细修正，从而提高了对象跟踪性能。此外，我们还在跟踪器中添加了卡尔曼滤波，以 Addresses the challenges posed by scale variation and occlusion.实验结果表明，Hy-Tracker在干扰数据集上具有高精度地跟踪对象 across frames。

S-T CRF: Spatial-Temporal Conditional Random Field for Human Trajectory Prediction

paper_url: http://arxiv.org/abs/2311.18198
repo_url: None
paper_authors: Pengqian Han, Jiamou Liu, Jialing He, Zeyu Zhang, Song Yang, Yanni Tang, Partha Roop
for: 本研究的目的是提高人体动向预测精度，以便自动驾驶车辆和机器人在计划运动时能够更加准确地预测人体动向。
methods: 本研究提出了一种新的模型，称为S-T CRF（空间-时间决定性随机场），它将空间-时间信息和人体意图信息一起利用，以提高人体动向预测的准确性。该模型使用一个Conditional Random Field（CRF）来生成未来意图的表示，从而大幅提高了随后动向预测的准确性。此外，研究还创新地提出了空间CRF损失和时间CRF损失，以便优化交互约束和时间动力学。
results: 实验证明，提出的S-T CRF模型在ETH/UCY和SDD数据集上表现出色，胜过了现有的基线方法。

Abstract
Trajectory prediction is of significant importance in computer vision. Accurate pedestrian trajectory prediction benefits autonomous vehicles and robots in planning their motion. Pedestrians' trajectories are greatly influenced by their intentions. Prior studies having introduced various deep learning methods only pay attention to the spatial and temporal information of trajectory, overlooking the explicit intention information. In this study, we introduce a novel model, termed the \textbf{S-T CRF}: \textbf{S}patial-\textbf{T}emporal \textbf{C}onditional \textbf{R}andom \textbf{F}ield, which judiciously incorporates intention information besides spatial and temporal information of trajectory. This model uses a Conditional Random Field (CRF) to generate a representation of future intentions, greatly improving the prediction of subsequent trajectories when combined with spatial-temporal representation. Furthermore, the study innovatively devises a space CRF loss and a time CRF loss, meticulously designed to enhance interaction constraints and temporal dynamics, respectively. Extensive experimental evaluations on dataset ETH/UCY and SDD demonstrate that the proposed method surpasses existing baseline approaches.

摘要
叙述预测在计算机视觉中具有重要意义。准确预测行人轨迹可以帮助自动驾驶车和机器人在计划其动作上更加准确。行人轨迹受其意图的影响。先前的研究都是通过不同的深度学习方法来预测轨迹，但是它们忽略了轨迹中的显式意图信息。本研究提出了一种新的模型，称为S-T CRF：空间-时间条件随机场，这种模型智能地将意图信息纳入轨迹预测中。该模型使用条件随机场（CRF）来生成未来意图的表示，大大提高了轨迹预测的准确性。此外，研究还创新地设计了空间CRF损失和时间CRF损失，仔细设计了空间和时间的交互约束和动态约束，分别提高了轨迹预测的精度和稳定性。经过广泛的实验评估于ETH/UCY和SDD数据集上，提出的方法胜过现有的基eline方法。

Persistent Test-time Adaptation in Episodic Testing Scenarios

paper_url: http://arxiv.org/abs/2311.18193
repo_url: None
paper_authors: Trung-Hieu Hoang, Duc Minh Vo, Minh N. Do
for: 本研究旨在检测测试时间适应(TTA)模型在经历多次测试环境后是否会堆积错误。
methods: 我们使用了一种叫做 episodic TTA 的测试设定，并通过对一个简单 yet representative 的 $\epsilon$-perturbed Gaussian Mixture Model Classifier 进行模拟，从理论上 derivation 了影响 TTA 方法堆积错误的因素。
results: 我们的研究表明，TTA 方法在经历多次测试环境后会逐渐堆积错误。为解决这个问题，我们提出了一种名为 persistent TTA (PeTTA) 的方法，它能够检测模型受到干扰的程度，并调整 TTA 策略，以保持模型的稳定性。我们的实验结果表明，PeTTA 能够稳定地在面对多次测试环境的情况下进行适应。

Abstract
Current test-time adaptation (TTA) approaches aim to adapt to environments that change continuously. Yet, when the environments not only change but also recur in a correlated manner over time, such as in the case of day-night surveillance cameras, it is unclear whether the adaptability of these methods is sustained after a long run. This study aims to examine the error accumulation of TTA models when they are repeatedly exposed to previous testing environments, proposing a novel testing setting called episodic TTA. To study this phenomenon, we design a simulation of TTA process on a simple yet representative $\epsilon$-perturbed Gaussian Mixture Model Classifier and derive the theoretical findings revealing the dataset- and algorithm-dependent factors that contribute to the gradual degeneration of TTA methods through time. Our investigation has led us to propose a method, named persistent TTA (PeTTA). PeTTA senses the model divergence towards a collapsing and adjusts the adaptation strategy of TTA, striking a balance between two primary objectives: adaptation and preventing model collapse. The stability of PeTTA in the face of episodic TTA scenarios has been demonstrated through a set of comprehensive experiments on various benchmarks.

摘要
当前的测试时间适应（TTA）方法目标是适应不断变化的环境。然而，当环境不仅变化，而且在时间上呈相关的循环性变化，如日夜Surveillance camera等，是否可以保持TTA方法的适应性？这项研究旨在检查TTA模型在重复暴露于前一次测试环境时的错误积累情况，并提出了一种名为 episodic TTA的新测试环境。为了研究这种现象，我们设计了一种TTA过程的模拟，并 derive了对 Gaussian Mixture Model Classifier的$\epsilon$-perturbed的简单 yet representative的模型进行了 theoretically findings，揭示了测试环境和算法依赖的因素，这些因素会导致TTA方法逐渐衰竭。我们的调查导致我们提出了一种名为 persistent TTA（PeTTA）的方法，PeTTA可以感知模型受到潜在的崩溃的吸引力，并调整TTA策略，实现适应和避免模型崩溃之间的平衡。PeTTA在 episodic TTA 场景下的稳定性已经通过了多种 benchmark 进行了证明。

Quantification of cardiac capillarization in single-immunostained myocardial slices using weakly supervised instance segmentation

paper_url: http://arxiv.org/abs/2311.18173
repo_url: None
paper_authors: Zhao Zhang, Xiwen Chen, William Richardson, Bruce Z. Gao, Abolfazl Razi, Tong Ye
for: 这种研究旨在提供一种自动化评估心肺血管 densities的工具，以帮助研究者更准确地评估心肺血管的结构和功能。
methods: 这种工具使用了一种弱supervised的实例分割算法，通过利用一个预训练的分割模型的力量，通过提示工程来实现自动化的分割和评估。
results: 对比YOLOv8-Seg等现有的实例分割模型，AutoQC在实例分割和评估方面都有更高的性能，并且只需要一小 dataset with bounding box annotations来训练，因此可以大幅减少训练时间和工作负担。

Abstract
Decreased myocardial capillary density has been reported as an important histopathological feature associated with various heart disorders. Quantitative assessment of cardiac capillarization typically involves double immunostaining of cardiomyocytes (CMs) and capillaries in myocardial slices. In contrast, single immunostaining of basement membrane components is a straightforward approach to simultaneously label CMs and capillaries, presenting fewer challenges in background staining. However, subsequent image analysis always requires manual work in identifying and segmenting CMs and capillaries. Here, we developed an image analysis tool, AutoQC, to automatically identify and segment CMs and capillaries in immunofluorescence images of collagen type IV, a predominant basement membrane protein within the myocardium. In addition, commonly used capillarization-related measurements can be derived from segmentation masks. AutoQC features a weakly supervised instance segmentation algorithm by leveraging the power of a pre-trained segmentation model via prompt engineering. AutoQC outperformed YOLOv8-Seg, a state-of-the-art instance segmentation model, in both instance segmentation and capillarization assessment. Furthermore, the training of AutoQC required only a small dataset with bounding box annotations instead of pixel-wise annotations, leading to a reduced workload during network training. AutoQC provides an automated solution for quantifying cardiac capillarization in basement-membrane-immunostained myocardial slices, eliminating the need for manual image analysis once it is trained.

摘要
原文：减少心肺 капи LL density 已被认为是心脏疾病的重要 histopathological 特征之一。量测心肺 capillarization 通常通过双抗体免疫染色心肺 slice 进行。然而，单抗体免疫染色基底膜成分可以是一种简单的方法来同时标记心肺 cells 和 капи LL，具有较少的背景染色挑战。然而，后续的图像分析总是需要手动工作，包括标识和分割心肺 cells 和 капи LL。在这里，我们开发了一种图像分析工具，AutoQC，可以自动标识和分割心肺 cells 和 капи LL 在免疫染色图像中。此外，通常用于评估 capillarization 的相关指标也可以从分割masks中 derivation。AutoQC 采用了弱类supervised的实例分割算法，通过引入预训练分割模型的推动力来实现。AutoQC 在实例分割和 capillarization 评估中超越 YOLOv8-Seg，一个状态态'-art instance segmentation模型。此外，AutoQC 的训练只需要一小 dataset WITH bounding box 笔识笔注，而不需要像素精度的笔识笔注，因此在网络训练时的工作负担减少了。AutoQC 提供了一种自动化心肺 capillarization 的评估方法，消除了手动图像分析的需求，一旦它被训练。

Few-shot Image Generation via Style Adaptation and Content Preservation

paper_url: http://arxiv.org/abs/2311.18169
repo_url: None
paper_authors: Xiaosheng He, Fan Yang, Fayao Liu, Guosheng Lin
for: 这篇论文主要针对培养生成模型（GAN）的挑战，具体来说是使用有限数据（例如10）进行培养。
methods: 论文提出了一种基于GAN模型的练习方法，通过将GAN模型进行精度调整来保持样式，但这可能导致过拟合。在这种情况下，论文提出了一种对准图像重建方法，以保持内容。
results: 论文的实验结果表明，使用该方法可以在少量数据的情况下，轻松地超越现有的状态艺技术。

Abstract
Training a generative model with limited data (e.g., 10) is a very challenging task. Many works propose to fine-tune a pre-trained GAN model. However, this can easily result in overfitting. In other words, they manage to adapt the style but fail to preserve the content, where \textit{style} denotes the specific properties that defines a domain while \textit{content} denotes the domain-irrelevant information that represents diversity. Recent works try to maintain a pre-defined correspondence to preserve the content, however, the diversity is still not enough and it may affect style adaptation. In this work, we propose a paired image reconstruction approach for content preservation. We propose to introduce an image translation module to GAN transferring, where the module teaches the generator to separate style and content, and the generator provides training data to the translation module in return. Qualitative and quantitative experiments show that our method consistently surpasses the state-of-the-art methods in few shot setting.

摘要
训练一个生成模型（例如，10）是一个非常具有挑战性的任务。许多研究提议使用已经训练过的 GAN 模型进行细化。然而，这可以轻松导致过拟合。即，它们能够适应样式，但是无法保留内容，其中“样式”指定域特定的特征，而“内容”则指定域不相关的多样性。最近的研究尝试维护一个预定的对应关系，以保留内容，但是多样性仍然不够，这可能会影响样式适应。在这个工作中，我们提议一种对应图像重建方法，以保持内容。我们提议在 GAN 传输中引入一个图像翻译模块，该模块教育生成器分离样式和内容，而生成器则为翻译模块提供训练数据回报。实验表明，我们的方法在几个射击设置下一直高于当前的方法。

Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks, Methods, and Applications

paper_url: http://arxiv.org/abs/2311.18168
repo_url: None
paper_authors: Karren D. Yang, Anurag Ranjan, Jen-Hao Rick Chang, Raviteja Vemulapalli, Oncel Tuzel
for: 本研究旨在生成基于语音信号的3D脸部动画。现有的方法主要是权值学习，它们通过学习一个语音信号到3D脸部网格的一对一映射，可以达到高质量的唇形态，但是它们无法捕捉实际世界中的全面和多样化的3D脸部动态。
methods: 我们提出了一种probabilistic模型，该模型可以生成多样化的3D脸部动态，同时保持 faithful to speech 的conditioning信号。我们首先提出了大规模的benchmark数据集和评价metric，然后我们实现了这些metric中的最佳性。
results: 我们的模型可以在我们提出的benchmark数据集上达到最高的性能，并且可以生成与未seen speaker styles匹配的多样化的3D脸部动态。此外，我们还示出了这些probabilistic模型可以提高下游的音频视频模型的性能。

Abstract
We consider the task of animating 3D facial geometry from speech signal. Existing works are primarily deterministic, focusing on learning a one-to-one mapping from speech signal to 3D face meshes on small datasets with limited speakers. While these models can achieve high-quality lip articulation for speakers in the training set, they are unable to capture the full and diverse distribution of 3D facial motions that accompany speech in the real world. Importantly, the relationship between speech and facial motion is one-to-many, containing both inter-speaker and intra-speaker variations and necessitating a probabilistic approach. In this paper, we identify and address key challenges that have so far limited the development of probabilistic models: lack of datasets and metrics that are suitable for training and evaluating them, as well as the difficulty of designing a model that generates diverse results while remaining faithful to a strong conditioning signal as speech. We first propose large-scale benchmark datasets and metrics suitable for probabilistic modeling. Then, we demonstrate a probabilistic model that achieves both diversity and fidelity to speech, outperforming other methods across the proposed benchmarks. Finally, we showcase useful applications of probabilistic models trained on these large-scale datasets: we can generate diverse speech-driven 3D facial motion that matches unseen speaker styles extracted from reference clips; and our synthetic meshes can be used to improve the performance of downstream audio-visual models.

摘要
我们考虑了基于语音信号的3D面部动画任务。现有的方法主要是权宜的，强调学习语音信号到3D面部网格的一一映射，但这些模型只能在小样本中的训练集中实现高质量的唇动作，无法捕捉到实际世界中的全面和多样化的3D面部运动。重要的是，语音和面部运动之间的关系是一对多的，包括 междуspeaker和内部speaker的变化，需要一个 probabilistic 的方法。在这篇论文中，我们认为和解决了关键的挑战：没有适合训练和评估probabilistic模型的数据集和度量，以及设计一个模型可以生成多样化结果，同时保持强的conditioning信号（语音）的约束。我们首先提出了大规模的标准数据集和度量，然后我们展示了一种probabilistic模型，可以在这些标准数据集上实现多样化和强的 faithfulness 于语音。最后，我们展示了使用这些大规模数据集训练的probabilistic模型的有用应用：我们可以生成不同的speaker风格的语音驱动3D面部动画，并且我们的合成网格可以改善下游的音频视频模型的性能。

A-Scan2BIM: Assistive Scan to Building Information Modeling

paper_url: http://arxiv.org/abs/2311.18166
repo_url: https://github.com/weiliansong/A-Scan2BIM
paper_authors: Weilian Song, Jieliang Luo, Dale Zhao, Yan Fu, Chin-Yi Cheng, Yasutaka Furukawa
for: 这个论文是为了帮助建筑师在扫描数据转化为建筑信息模型（BIM）应用中减少大量的手动工作时间，而不是完全取代建筑师。
methods: 该论文提出了一个帮助建筑师在扫描数据转化过程中自动预测模型编辑操作的系统，使用了扫描数据和编辑历史记录（包括当前BIM模型），并使用了质量量化的扫描数据集来训练模型。
results: 论文介绍了一个建筑尺度的扫描数据集，包含了16个场景、35,000平方米的扫描数据，并提供了一种新的顺序度量来评估自动预测的结果。研究发现，通过一种简单的修改，可以提高重建质量，并与两个基线相比，该方法的顺序度量表现较好。

Abstract
This paper proposes an assistive system for architects that converts a large-scale point cloud into a standardized digital representation of a building for Building Information Modeling (BIM) applications. The process is known as Scan-to-BIM, which requires many hours of manual work even for a single building floor by a professional architect. Given its challenging nature, the paper focuses on helping architects on the Scan-to-BIM process, instead of replacing them. Concretely, we propose an assistive Scan-to-BIM system that takes the raw sensor data and edit history (including the current BIM model), then auto-regressively predicts a sequence of model editing operations as APIs of a professional BIM software (i.e., Autodesk Revit). The paper also presents the first building-scale Scan2BIM dataset that contains a sequence of model editing operations as the APIs of Autodesk Revit. The dataset contains 89 hours of Scan2BIM modeling processes by professional architects over 16 scenes, spanning over 35,000 m^2. We report our system's reconstruction quality with standard metrics, and we introduce a novel metric that measures how natural the order of reconstructed operations is. A simple modification to the reconstruction module helps improve performance, and our method is far superior to two other baselines in the order metric. We will release data, code, and models at a-scan2bim.github.io.

摘要
The paper also presents the first building-scale Scan2BIM dataset, containing a sequence of model editing operations as Autodesk Revit APIs. The dataset includes 89 hours of Scan2BIM modeling processes by professional architects over 16 scenes, covering over 35,000 square meters. The system's reconstruction quality is evaluated using standard metrics, and a novel metric is introduced to measure the naturalness of the order of reconstructed operations.A simple modification to the reconstruction module improves performance, and the proposed method is significantly better than two baselines in the order metric. The data, code, and models will be released at a-scan2bim.github.io.

Compact3D: Compressing Gaussian Splat Radiance Field Models with Vector Quantization

paper_url: http://arxiv.org/abs/2311.18159
repo_url: https://github.com/ucdvision/compact3d
paper_authors: KL Navaneet, Kossar Pourahmadi Meibodi, Soroush Abbasi Koohpayegani, Hamed Pirsiavash
for: 3D Gaussian Splatting is a new method for modeling and rendering 3D radiance fields that achieves faster learning and rendering time compared to SOTA NeRF methods, but with a larger storage demand.
methods: The paper introduces a simple vector quantization method based on \kmeans algorithm to quantize the Gaussian parameters, and stores the small codebook along with the index of the code for each Gaussian. The indices are further compressed using a method similar to run-length encoding.
results: The paper shows that the proposed method can reduce the storage cost for the original 3D Gaussian Splatting method by a factor of almost $20\times$ with a very small drop in the quality of rendered images, on standard benchmarks as well as a new benchmark that is an order of magnitude larger than the standard benchmarks.

Abstract
3D Gaussian Splatting is a new method for modeling and rendering 3D radiance fields that achieves much faster learning and rendering time compared to SOTA NeRF methods. However, it comes with a drawback in the much larger storage demand compared to NeRF methods since it needs to store the parameters for several 3D Gaussians. We notice that many Gaussians may share similar parameters, so we introduce a simple vector quantization method based on \kmeans algorithm to quantize the Gaussian parameters. Then, we store the small codebook along with the index of the code for each Gaussian. Moreover, we compress the indices further by sorting them and using a method similar to run-length encoding. We do extensive experiments on standard benchmarks as well as a new benchmark which is an order of magnitude larger than the standard benchmarks. We show that our simple yet effective method can reduce the storage cost for the original 3D Gaussian Splatting method by a factor of almost $20\times$ with a very small drop in the quality of rendered images.

摘要
“3D Gaussian Splatting是一种新的方法，用于模型和渲染3D光场场景，它比SOTA NeRF方法更快速学习和渲染。然而，它需要更多的存储空间，因为它需要存储多个3DGAUSSIAN的参数。我们发现许多GAUSSIAN可能会有相似的参数，因此我们提出了一种简单的 вектор量化方法，基于K-means算法来量化GAUSSIAN参数。然后，我们将小codebook和每个GAUSSIAN的索引存储在一起。此外，我们还将索引进行排序和使用类似于run-length编码，以进一步压缩存储成本。我们在标准 benchmark 上进行了广泛的实验，以及一个新的 benchmark，其大小比标准 benchmark 大得多。我们显示，我们的简单 yet effective 方法可以将原始3D Gaussian Splatting方法的存储成本减少为约20倍，而影响图像质量的下降非常小。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know and I can provide that as well.

HiPA: Enabling One-Step Text-to-Image Diffusion Models via High-Frequency-Promoting Adaptation

paper_url: http://arxiv.org/abs/2311.18158
repo_url: None
paper_authors: Yifan Zhang, Bryan Hooi
for: 文章目的是提出一种高效的单步文本到图像扩散模型，以提高图像生成的速度和质量。
methods: 本文使用了一种基于高频信息的适应器来提高单步扩散模型的性能，并通过训练低级别适应器来增强扩散模型的高频能力。
results: 相比进步填充扩散，HiPA实现了一个更好的一步文本到图像扩散性能（FID-5k在COCO 2017上从37.3降至23.8），并减少了训练时间28.6倍（从108.8天减至3.8天），需要的参数也减少了99.6%（从7740万减至3300万）。 HiPA还在文本导向图像修改、填充和超解等任务中表现出色。

Abstract
Diffusion models have revolutionized text-to-image generation, but their real-world applications are hampered by the extensive time needed for hundreds of diffusion steps. Although progressive distillation has been proposed to speed up diffusion sampling to 2-8 steps, it still falls short in one-step generation, and necessitates training multiple student models, which is highly parameter-extensive and time-consuming. To overcome these limitations, we introduce High-frequency-Promoting Adaptation (HiPA), a parameter-efficient approach to enable one-step text-to-image diffusion. Grounded in the insight that high-frequency information is essential but highly lacking in one-step diffusion, HiPA focuses on training one-step, low-rank adaptors to specifically enhance the under-represented high-frequency abilities of advanced diffusion models. The learned adaptors empower these diffusion models to generate high-quality images in just a single step. Compared with progressive distillation, HiPA achieves much better performance in one-step text-to-image generation (37.3 $\rightarrow$ 23.8 in FID-5k on MS-COCO 2017) and 28.6x training speed-up (108.8 $\rightarrow$ 3.8 A100 GPU days), requiring only 0.04% training parameters (7,740 million $\rightarrow$ 3.3 million). We also demonstrate HiPA's effectiveness in text-guided image editing, inpainting and super-resolution tasks, where our adapted models consistently deliver high-quality outputs in just one diffusion step. The source code will be released.

摘要
Diffusion models 已经革命化了文本到图像生成，但它们在实际应用中受到了数百步扩散步骤的时间限制。虽然进步的采样压缩有助于加速扩散抽象到2-8步，但它仍然无法在一步生成高质量图像，并且需要训练多个学生模型，这是高参数昂贵并时间consuming的。为了超越这些限制，我们介绍了高频促进适应（HiPA），一种效率的方法来启用一步文本到图像扩散。基于高频信息的潜在重要性，HiPA专注于训练一步、低级别适应器，以增强高频扩散模型的不足之处。学习后，这些适应器将扩散模型帮助生成高质量图像，只需一步。与进步压缩相比，HiPA在一步文本到图像生成中表现更好（FID-5k在MS-COCO 2017上从37.3降至23.8），并且具有28.6倍的训练速度增加（108.8降至3.8 A100 GPU天），只需0.04%的训练参数（7740万降至3300万）。我们还证明了HiPA在文本指导图像修改、填充和超解像 зада务中的效果，我们的适应模型 consistently 在一步扩散中输出高质量图像。源代码将发布。

2023-11-30

cs.AI

cs.AI - 2023-11-30

Negotiated Representations to Prevent Forgetting in Machine Learning Applications

paper_url: http://arxiv.org/abs/2312.00237
repo_url: https://github.com/nurikorhan/negotiated-representations-for-continual-learning
paper_authors: Nuri Korhan, Ceren Öner
for: 本研究目的是解决机器学习领域中的慢性忘记问题，具体是针对神经网络。methods: 本研究使用的方法是 incorporating negotiated representations into the learning process，以保持神经网络在多个任务之间的知识。results: 我们在多个 benchmark 数据集上进行了实验，包括 Split MNIST、Split CIFAR10、Split Fashion MNIST 和 Split CIFAR100。结果表明，我们的方法可以有效地避免神经网络在学习多个任务时的慢性忘记问题，并且在不同的任务之间保持知识的均衡。

Abstract
Catastrophic forgetting is a significant challenge in the field of machine learning, particularly in neural networks. When a neural network learns to perform well on a new task, it often forgets its previously acquired knowledge or experiences. This phenomenon occurs because the network adjusts its weights and connections to minimize the loss on the new task, which can inadvertently overwrite or disrupt the representations that were crucial for the previous tasks. As a result, the the performance of the network on earlier tasks deteriorates, limiting its ability to learn and adapt to a sequence of tasks. In this paper, we propose a novel method for preventing catastrophic forgetting in machine learning applications, specifically focusing on neural networks. Our approach aims to preserve the knowledge of the network across multiple tasks while still allowing it to learn new information effectively. We demonstrate the effectiveness of our method by conducting experiments on various benchmark datasets, including Split MNIST, Split CIFAR10, Split Fashion MNIST, and Split CIFAR100. These datasets are created by dividing the original datasets into separate, non overlapping tasks, simulating a continual learning scenario where the model needs to learn multiple tasks sequentially without forgetting the previous ones. Our proposed method tackles the catastrophic forgetting problem by incorporating negotiated representations into the learning process, which allows the model to maintain a balance between retaining past experiences and adapting to new tasks. By evaluating our method on these challenging datasets, we aim to showcase its potential for addressing catastrophic forgetting and improving the performance of neural networks in continual learning settings.

摘要
机器学习领域内，Catastrophic forgetting 是一个重要的挑战。当一个神经网络学习新任务时，它经常会忘记之前学习的知识和经验。这个现象发生的原因是网络调整其权重和连接，以最小化新任务的损失，这会不必要地覆盖或者干扰之前的表示，从而导致网络在早期任务上表现下降，限制它学习和适应多个任务的能力。在这篇文章中，我们提出了一种预防 Catastrophic forgetting 的方法，特别是针对神经网络。我们的方法旨在在多个任务之间保持神经网络的知识，同时仍然允许它有效地学习新信息。我们通过在各种 benchmark 数据集上进行实验，包括 Split MNIST、Split CIFAR10、Split Fashion MNIST 和 Split CIFAR100，以证明我们的方法的有效性。这些数据集是通过将原始数据集分成不相互重叠的任务，来模拟在不断学习的场景中，模型需要逐个学习多个任务，而不会忘记之前的一切。我们的提出的方法通过在学习过程中 incorporate negotiated 表示，使得模型可以保持过去经验的平衡，同时适应新任务。通过在这些挑战性数据集上评估我们的方法，我们希望能够展示它在预防 Catastrophic forgetting 和提高神经网络在连续学习中的性能。

Uncertainty in Graph Contrastive Learning with Bayesian Neural Networks

paper_url: http://arxiv.org/abs/2312.00232
repo_url: None
paper_authors: Alexander Möllers, Alexander Immer, Elvin Isufi, Vincent Fortuin
for: 这篇论文旨在提高半监督Node类别 зада务中的不确定性估计，并且将其应用于对大量未预先标注的数据进行推导。
methods: 本文使用了variational Bayesian neural network方法，以提高不确定性估计和下游性能。此外，我们提出了一个基于各个正例之间的不同likelihood分布的新的不确定性评估方法。
results: 我们的实验结果显示，这种方法可以提高不确定性估计和下游性能，并且在对大量未预先标注的数据进行推导时表现出色。

Abstract
Graph contrastive learning has shown great promise when labeled data is scarce, but large unlabeled datasets are available. However, it often does not take uncertainty estimation into account. We show that a variational Bayesian neural network approach can be used to improve not only the uncertainty estimates but also the downstream performance on semi-supervised node-classification tasks. Moreover, we propose a new measure of uncertainty for contrastive learning, that is based on the disagreement in likelihood due to different positive samples.

摘要
几何对照学习在仅有少量标签数据时表现出色，但可以使用大量未标签数据。然而，它通常不考虑不确定性估计。我们展示了一种基于Variational Bayesian neural network的方法可以提高不仅不确定性估计，而且还可以提高下游运算 semi-supervised node classification 任务的性能。此外，我们提出了一个基于不同正例的可能性分布的新的不确定性度量。

Unsupervised textile defect detection using convolutional neural networks

paper_url: http://arxiv.org/abs/2312.00224
repo_url: None
paper_authors: Imane Koulali, M. Taner Eskil
For: This paper proposes a novel motif-based approach for unsupervised textile anomaly detection, with the goal of combating the limitations of traditional convolutional neural networks and unsupervised learning paradigms.* Methods: The proposed approach consists of five main steps: preprocessing, automatic pattern period extraction, patch extraction, features selection, and anomaly detection. It uses a new dynamic and heuristic method for feature selection, which avoids the drawbacks of initialization of the number of filters and their weights, and those of the backpropagation mechanism. The design and training of the network are performed in a dynamic and input domain-based manner, with no ad-hoc configurations required.* Results: The proposed approach yields reliable and competitive results (on recall, precision, accuracy, and f1-measure) compared to state-of-the-art unsupervised approaches, in less time, with efficient training in a single epoch and a lower computational cost. The algorithm is demonstrated on the Patterned Fabrics benchmark dataset.

Abstract
In this study, we propose a novel motif-based approach for unsupervised textile anomaly detection that combines the benefits of traditional convolutional neural networks with those of an unsupervised learning paradigm. It consists of five main steps: preprocessing, automatic pattern period extraction, patch extraction, features selection and anomaly detection. This proposed approach uses a new dynamic and heuristic method for feature selection which avoids the drawbacks of initialization of the number of filters (neurons) and their weights, and those of the backpropagation mechanism such as the vanishing gradients, which are common practice in the state-of-the-art methods. The design and training of the network are performed in a dynamic and input domain-based manner and, thus, no ad-hoc configurations are required. Before building the model, only the number of layers and the stride are defined. We do not initialize the weights randomly nor do we define the filter size or number of filters as conventionally done in CNN-based approaches. This reduces effort and time spent on hyperparameter initialization and fine-tuning. Only one defect-free sample is required for training and no further labeled data is needed. The trained network is then used to detect anomalies on defective fabric samples. We demonstrate the effectiveness of our approach on the Patterned Fabrics benchmark dataset. Our algorithm yields reliable and competitive results (on recall, precision, accuracy and f1- measure) compared to state-of-the-art unsupervised approaches, in less time, with efficient training in a single epoch and a lower computational cost.

摘要
在这个研究中，我们提出了一种新的模式基本方法 для无监督文本异常检测，这种方法结合了传统的卷积神经网络的优点和无监督学习理论的优点。该方法包括五个主要步骤：预处理、自动模式周期EXTRACT、补充patch、特征选择和异常检测。我们的提出方法使用了一种新的动态和规则性的特征选择方法，可以避免 initialization of the number of filters（神经元）和其 weights的缺陷，以及传统方法中的径向传播机制的vanishing gradients的缺陷。我们的设计和训练方法是基于动态和输入Domain的方式进行的，因此无需配置。在建立模型之前，只需定义层数和步长即可。我们不会随机初始化 weights 也不需要定义缺省的滤波器大小或神经元数量，这将减少了评估和调整参数的时间和努力。只需要一个不含缺陷的样本进行训练，并不需要进一步的标注数据。训练后的模型可以用于检测异常的 défaut fabric samples。我们在 Patterned Fabrics benchmark dataset 上展示了我们的方法的效果，我们的算法在准确率、精度、准确率和f1-度方面取得了可靠和竞争力强的结果，比 estado-of-the-art 无监督方法更快速、更高效，并且计算成本较低。

Learning active tactile perception through belief-space control

paper_url: http://arxiv.org/abs/2312.00215
repo_url: None
paper_authors: Jean-François Tremblay, David Meger, Francois Hogan, Gregory Dudek
for: 这 paper 是为了解决 робоット在开放世界中与未知物体进行交互时，如何自动学习感知物体物理属性的问题。
methods: 这 paper 使用了一种自动学习策略，即通过开发一个生成型世界模型，使用梯度滤波法来估计物体的物理参数，并使用信息搜集模型预测控制器来开发感知策略。
results: 这 paper 在三个模拟任务中发现了一种能够有效地收集物体感知信息的策略，并在实际robot系统中验证了这种策略的可行性。

Abstract
Robots operating in an open world will encounter novel objects with unknown physical properties, such as mass, friction, or size. These robots will need to sense these properties through interaction prior to performing downstream tasks with the objects. We propose a method that autonomously learns tactile exploration policies by developing a generative world model that is leveraged to 1) estimate the object's physical parameters using a differentiable Bayesian filtering algorithm and 2) develop an exploration policy using an information-gathering model predictive controller. We evaluate our method on three simulated tasks where the goal is to estimate a desired object property (mass, height or toppling height) through physical interaction. We find that our method is able to discover policies that efficiently gather information about the desired property in an intuitive manner. Finally, we validate our method on a real robot system for the height estimation task, where our method is able to successfully learn and execute an information-gathering policy from scratch.

摘要
роботы 操作在开放世界中会遇到未知物理属性的新物体，如质量、摩擦力或大小。这些 роботы 需要通过互动来感知这些属性，然后进行下游任务。我们提出了一种自动学习感觉探索策略，通过开发一个可 diferenciable Bayesian 筛选算法和信息收集模型预测控制器来实现。我们在三个虚拟任务中评估了我们的方法，目标是通过物理互动来估计欲知道的对象属性（质量、高度或倒塌高度）。我们发现，我们的方法可以发现有效地收集有关欲知道的属性信息的策略。最后，我们在真实机器人系统上验证了我们的方法，并成功地从零开始学习和执行信息收集策略。

DREAM: Diffusion Rectification and Estimation-Adaptive Models

paper_url: http://arxiv.org/abs/2312.00210
repo_url: https://github.com/jinxinzhou/DREAM
paper_authors: Jinxin Zhou, Tianyu Ding, Tianyi Chen, Jiachen Jiang, Ilya Zharkov, Zhihui Zhu, Luming Liang
for: 提高Diffusion模型训练和采样之间的对应性，使得训练更加快速和高效。
methods: 提出了一种名为DREAM的新训练框架，包括扩散修正和估计调整两个组成部分，可以减少训练过程中的样本步骤数量，同时保持高质量图像。
results: 在图像超分解（SR）任务上，DREAM比标准扩散基于SR方法更快速地训练，并且可以降低需要的样本步骤数量，以达到相同或更高的图像质量。

Abstract
We present DREAM, a novel training framework representing Diffusion Rectification and Estimation-Adaptive Models, requiring minimal code changes (just three lines) yet significantly enhancing the alignment of training with sampling in diffusion models. DREAM features two components: diffusion rectification, which adjusts training to reflect the sampling process, and estimation adaptation, which balances perception against distortion. When applied to image super-resolution (SR), DREAM adeptly navigates the tradeoff between minimizing distortion and preserving high image quality. Experiments demonstrate DREAM's superiority over standard diffusion-based SR methods, showing a $2$ to $3\times $ faster training convergence and a $10$ to $20\times$ reduction in necessary sampling steps to achieve comparable or superior results. We hope DREAM will inspire a rethinking of diffusion model training paradigms.

摘要
我团队 todavia DREAM，一种新的训练框架，表示噪声修正和估算适应模型，只需要 minimal code changes（仅三行代码）却可以显著提高培训与抽取的对齐。 DREAM 包括两个组件：噪声修正，用于在训练中反映抽取过程，以及估算适应，用于均衡识别和扭曲之间的平衡。当应用于图像超分辨（SR）领域时，DREAM 能够灵活地调整对误差和高质量图像的平衡，并且在训练速度和抽取步骤方面具有显著优势。实验表明，相比标准噪声基于 SR 方法，DREAM 能够更快地训练 converges（$2$ to $3\times $），并且可以降低必要的抽取步骤数量 ($10$ to $20\times $）以 достичь相似或更高的结果。我们希望 DREAM 能够激发噪声模型训练方法的重新思考。

On the Interplay Between Stepsize Tuning and Progressive Sharpening

paper_url: http://arxiv.org/abs/2312.00209
repo_url: None
paper_authors: Vincent Roulet, Atish Agarwala, Fabian Pedregosa
for: 这项研究探讨了深度学习模型优化过程中对步长的调整对模型性能的影响。
methods: 本研究使用了stepsize-tuners、Armijo linesearch和Polyak stepsizes等方法来调整步长，这些方法会随迭代 iterations 来调整步长，并且会随着迭代 iterations 来调整模型的锐度。
results: 研究发现，经典的Armijo linesearch可能会导致模型在大批量或全批量 режиме下表现不佳，而Polyak stepsizes则可以在稳定边缘或稍微超过稳定边缘的情况下表现更好，并且能够超越其他方法。研究还发现，为了解锁步长调整器，需要理解步长和锐度之间的共同动力学。

Abstract
Recent empirical work has revealed an intriguing property of deep learning models by which the sharpness (largest eigenvalue of the Hessian) increases throughout optimization until it stabilizes around a critical value at which the optimizer operates at the edge of stability, given a fixed stepsize (Cohen et al, 2022). We investigate empirically how the sharpness evolves when using stepsize-tuners, the Armijo linesearch and Polyak stepsizes, that adapt the stepsize along the iterations to local quantities such as, implicitly, the sharpness itself. We find that the surprisingly poor performance of a classical Armijo linesearch may be well explained by its tendency to ever-increase the sharpness of the objective in the full or large batch regimes. On the other hand, we observe that Polyak stepsizes operate generally at the edge of stability or even slightly beyond, while outperforming its Armijo and constant stepsizes counterparts. We conclude with an analysis that suggests unlocking stepsize tuners requires an understanding of the joint dynamics of the step size and the sharpness.

摘要

An integrated framework for developing and evaluating an automated lecture style assessment system

paper_url: http://arxiv.org/abs/2312.00201
repo_url: None
paper_authors: Eleni Dimitriadou, Andreas Lanitis
for: 这个论文的目的是开发一个自动评估讲座风格的系统，帮助教师获得实时讲座风格评估反馈，提高学生学习体验质量。
methods: 该系统使用特定可测量生物特征，如脸部表情、身体活动、speech rate和intonation、手势和脸部姿势，从讲师视角的视频中提取。这些可测量生物特征在讲座中被组合，为教师提供讲座风格质量分数，包括讲座时间和整体讲座质量指标。
results: 参与者认为该应用程序是新颖和有用的，可以提供自动化讲座风格评估反馈。此外，该系统的性能评估与人类在讲座风格评估任务中的性能进行比较，结果显示，该系统不仅与人类评估者相当，而且在某些情况下，甚至超过了人类评估者的性能。

Abstract
The aim of the work presented in this paper is to develop and evaluate an integrated system that provides automated lecture style evaluation, allowing teachers to get instant feedback related to the goodness of their lecturing style. The proposed system aims to promote improvement of lecture quality, that could upgrade the overall student learning experience. The proposed application utilizes specific measurable biometric characteristics, such as facial expressions, body activity, speech rate and intonation, hand movement, and facial pose, extracted from a video showing the lecturer from the audience point of view. Measurable biometric features extracted during a lecture are combined to provide teachers with a score reflecting lecture style quality both at frame rate and by providing lecture quality metrics for the whole lecture. The acceptance of the proposed lecture style evaluation system was evaluated by chief education officers, teachers and students regarding the functionality, usefulness of the application, and possible improvements. The results indicate that participants found the application novel and useful in providing automated feedback regarding lecture quality. Furthermore, the performance evaluation of the proposed system was compared with the performance of humans in the task of lecture style evaluation. Results indicate that the proposed system not only achieves similar performance to human observers, but in some cases, it outperforms them.

摘要
本研究的目的是开发一个集成式讲义评估系统，为教师提供自动化讲义评估，以提高学生学习体验质量。该系统使用视频中讲者的特定可量生物特征，如脸部表情、身体动作、语速和声调、手势和脸部姿势，通过对视频进行分析，为教师提供讲义质量评估。系统将分别对每帧和整个讲义进行评估，并提供讲义质量指标。研究进行了教育主管、教师和学生对该系统的可用性和可行性的评估。结果表明，参与者认为该系统是新奇而有用的，可以自动地提供讲义质量的反馈。此外，研究还对该系统的性能进行了比较，结果显示，该系统不仅与人类评估者相当，有时甚至超越了人类评估者的性能。

Applying Large Language Models and Chain-of-Thought for Automatic Scoring

paper_url: http://arxiv.org/abs/2312.03748
repo_url: None
paper_authors: Gyeong-Geon Lee, Ehsan Latif, Xuansheng Wu, Ninghao Liu, Xiaoming Zhai
for: 这个研究探究了使用大型自然语言模型（LLM），具体是GPT-3.5和GPT-4，与链式思维（CoT）在自动评分学生写作答案中的应用。研究的目的是解决过去由于访问性、技术复杂度和解释性等因素而限制了自动评分工具的使用。methods: 我们使用了6个评估任务（3个二进制和3个三进制），共有1,650个学生答案进行测试。我们采用了6种提问工程策略，结合零shot或几 shot学习与CoT，或与项目脊和评分标准结合使用。results: 结果显示，几 shot学习（accuracy = .67）比零 shot学习（accuracy = .60）更高，增加12.6%。CoT，没有项目脊和评分标准时，对评分准确性没有显著影响（accuracy = .60）。但是，CoT提问与 Contextual item stem和评分标准结合使用，显示了13.44%的增长（零 shot）和3.7%的增长（几 shot）。使用PPEAS方法，我们发现了不同水平的准确性具有更好的平衡， highlighting the importance of domain-specific reasoning in enhancing the effectiveness of LLMs in scoring tasks。此外，我们还发现GPT-4在不同的评分任务中表现出优于GPT-3.5，显示8.64%的差异。单调调度策略（greedy sampling）在单调调度策略中表现出优，其他方法，包括ensemble voting策略，都被超越。这个研究表明了LLMs在自动评分中的潜在力量，强调CoT可以提高准确性，特别是与项目脊和评分标准结合使用。

Abstract
This study investigates the application of large language models (LLMs), specifically GPT-3.5 and GPT-4, with Chain-of-Though (CoT)in the automatic scoring of student-written responses to science assessments. We focused on overcoming the challenges of accessibility, technical complexity, and lack of explainability that have previously limited the use of automatic assessment tools among researchers and educators. We used a testing dataset comprising six assessment tasks (three binomial and three trinomial) with 1,650 student responses. We employed six prompt engineering strategies, combining zero-shot or few-shot learning with CoT, either alone or alongside item stem and scoring rubrics. Results indicated that few-shot (acc = .67) outperformed zero-shot learning (acc = .60), with 12.6\% increase. CoT, when used without item stem and scoring rubrics, did not significantly affect scoring accuracy (acc = .60). However, CoT prompting paired with contextual item stems and rubrics proved to be a significant contributor to scoring accuracy (13.44\% increase for zero-shot; 3.7\% increase for few-shot). Using a novel approach PPEAS, we found a more balanced accuracy across different proficiency categories, highlighting the importance of domain-specific reasoning in enhancing the effectiveness of LLMs in scoring tasks. Additionally, we also found that GPT-4 demonstrated superior performance over GPT-3.5 in various scoring tasks, showing 8.64\% difference. The study revealed that the single-call strategy with GPT-4, particularly using greedy sampling, outperformed other approaches, including ensemble voting strategies. This study demonstrates the potential of LLMs in facilitating automatic scoring, emphasizing that CoT enhances accuracy, particularly when used with item stem and scoring rubrics.

摘要

HeTriNet: Heterogeneous Graph Triplet Attention Network for Drug-Target-Disease Interaction

paper_url: http://arxiv.org/abs/2312.00189
repo_url: None
paper_authors: Farhan Tanvir, Khaled Mohammed Saifuddin, Tanvir Hossain, Arunkumar Bagavathi, Esra Akbas
for: 本研究旨在更好地理解药物与目标protein与疾病之间的复杂关系，以便更好地预测药物的机制作用（MoA）和个性化治疗。
methods: 本研究使用了一种新型的多类Graph triplet Attention Network（\texttt{HeTriNet），该模型在人类代谢系统中有效地模型了药物与目标protein与疾病之间的复杂关系。
results: 实验结果表明，\texttt{HeTriNet} 在真实数据上表现出色，较基eline模型有更高的准确率，这表明该模型能够更好地捕捉药物与目标protein与疾病之间的关系。

Abstract
Modeling the interactions between drugs, targets, and diseases is paramount in drug discovery and has significant implications for precision medicine and personalized treatments. Current approaches frequently consider drug-target or drug-disease interactions individually, ignoring the interdependencies among all three entities. Within human metabolic systems, drugs interact with protein targets in cells, influencing target activities and subsequently impacting biological pathways to promote healthy functions and treat diseases. Moving beyond binary relationships and exploring tighter triple relationships is essential to understanding drugs' mechanism of action (MoAs). Moreover, identifying the heterogeneity of drugs, targets, and diseases, along with their distinct characteristics, is critical to model these complex interactions appropriately. To address these challenges, we effectively model the interconnectedness of all entities in a heterogeneous graph and develop a novel Heterogeneous Graph Triplet Attention Network (\texttt{HeTriNet}). \texttt{HeTriNet} introduces a novel triplet attention mechanism within this heterogeneous graph structure. Beyond pairwise attention as the importance of an entity for the other one, we define triplet attention to model the importance of pairs for entities in the drug-target-disease triplet prediction problem. Experimental results on real-world datasets show that \texttt{HeTriNet} outperforms several baselines, demonstrating its remarkable proficiency in uncovering novel drug-target-disease relationships.

摘要
模拟药物、目标和疾病之间的互动非常重要于药物发现和个性化治疗，它们在精细医学中具有深远的影响。目前的方法通常是对药物-目标或药物-疾病之间的互动进行单独考虑，忽略这三者之间的互相关系。在人体代谢系统中，药物与细胞中的蛋白目标结合，影响目标活性，并在生物路径中促进健康功能以治疗疾病。从 binary 关系向更加紧密的 triple 关系迁移是理解药物机制的关键。此外，识别药物、目标和疾病之间的多样性和特点是模型这些复杂关系的关键。为解决这些挑战，我们有效地将所有实体模型为一个异质图，并开发了一种novel Heterogeneous Graph Triplet Attention Network (\texttt{HeTriNet})。\texttt{HeTriNet} 引入了一种新的 triplet 注意机制，以模型药物-目标-疾病 triplet 预测问题中的综合注意力。与传统的对应关系注意力相比，我们定义 triplet 注意力，以模型对于实体的综合重要性。实验结果表明，\texttt{HeTriNet} 在真实数据上比基eline 高效，具有惊人的探索新药物-目标-疾病关系的能力。

Planning Reliability Assurance Tests for Autonomous Vehicles

paper_url: http://arxiv.org/abs/2312.00186
repo_url: None
paper_authors: Simin Zheng, Lu Lu, Yili Hong, Jian Liu
for: 这篇论文目的是为了开发自动驾驶车（AV）的可靠性抗示测试计划。
methods: 本论文使用统计方法来规划AV可靠性抗示测试，并以多种指标进行衡量。
results: 研究人员通过分析加州司机驾驶部AV测试数据，提出了基于多种指标的可靠性抗示测试计划，并为实践提供了建议。

Abstract
Artificial intelligence (AI) technology has become increasingly prevalent and transforms our everyday life. One important application of AI technology is the development of autonomous vehicles (AV). However, the reliability of an AV needs to be carefully demonstrated via an assurance test so that the product can be used with confidence in the field. To plan for an assurance test, one needs to determine how many AVs need to be tested for how many miles and the standard for passing the test. Existing research has made great efforts in developing reliability demonstration tests in the other fields of applications for product development and assessment. However, statistical methods have not been utilized in AV test planning. This paper aims to fill in this gap by developing statistical methods for planning AV reliability assurance tests based on recurrent events data. We explore the relationship between multiple criteria of interest in the context of planning AV reliability assurance tests. Specifically, we develop two test planning strategies based on homogeneous and non-homogeneous Poisson processes while balancing multiple objectives with the Pareto front approach. We also offer recommendations for practical use. The disengagement events data from the California Department of Motor Vehicles AV testing program is used to illustrate the proposed assurance test planning methods.

摘要
人工智能技术在日常生活中越来越普遍，其中一个重要应用是自动驾驶车（AV）的开发。然而，要使AV在实际应用中使用，需要通过一系列的可靠性测试，以确保产品的可靠性。为了规划测试，需要确定要测试多少辆AV，测试多少公里，以及测试标准。现有研究已经做出了大量的努力，以开发产品开发和评估中的可靠性测试方法。然而，统计方法在AV测试规划中尚未得到广泛应用。本文尝试填补这一空白，通过基于回归事件数据的统计方法，为AV可靠性保证测试的规划提供了新的想法。我们研究了多个 интереスoint的关系，并提出了基于homogeneous和非homogeneous Poisson proces的两种测试规划策略，并通过Pareto前方法来平衡多个目标。我们还提供了实践使用的建议。加利福尼亚州机动车管理局的自动驾驶车测试项目中的离开事件数据被用来图示提议的可靠性测试规划方法。

RNA-KG: An ontology-based knowledge graph for representing interactions involving RNA molecules

paper_url: http://arxiv.org/abs/2312.00183
repo_url: https://github.com/anacletolab/rna-kg
paper_authors: Emanuele Cavalleri, Alberto Cabri, Mauricio Soto-Gomez, Sara Bonfitto, Paolo Perlasca, Jessica Gliozzo, Tiffany J. Callahan, Justin Reese, Peter N Robinson, Elena Casiraghi, Giorgio Valentini, Marco Mesiti
for: 这个论文的目的是构建一个知识 graphs（RNA-KG），用于汇集生物学知识，以便更好地研究基因、蛋白质和化学物质之间的函数关系。
methods: 这个论文使用了多种方法，包括预处理和特征化数据源、构建元граHPPOGRAFS和使用实例基本抽象知识模型来对RNA-KG进行生成。
results: 这个论文通过构建RNA-KG，提供了一个中央化、具有一致性和生成性的 Representation of the “RNA world”，可以用于研究基因、蛋白质和化学物质之间的函数关系，以及找到新的药物。

Abstract
The "RNA world" represents a novel frontier for the study of fundamental biological processes and human diseases and is paving the way for the development of new drugs tailored to the patient's biomolecular characteristics. Although scientific data about coding and non-coding RNA molecules are continuously produced and available from public repositories, they are scattered across different databases and a centralized, uniform, and semantically consistent representation of the "RNA world" is still lacking. We propose RNA-KG, a knowledge graph encompassing biological knowledge about RNAs gathered from more than 50 public databases, integrating functional relationships with genes, proteins, and chemicals and ontologically grounded biomedical concepts. To develop RNA-KG, we first identified, pre-processed, and characterized each data source; next, we built a meta-graph that provides an ontological description of the KG by representing all the bio-molecular entities and medical concepts of interest in this domain, as well as the types of interactions connecting them. Finally, we leveraged an instance-based semantically abstracted knowledge model to specify the ontological alignment according to which RNA-KG was generated. RNA-KG can be downloaded in different formats and also queried by a SPARQL endpoint. A thorough topological analysis of the resulting heterogeneous graph provides further insights into the characteristics of the "RNA world". RNA-KG can be both directly explored and visualized, and/or analyzed by applying computational methods to infer bio-medical knowledge from its heterogeneous nodes and edges. The resource can be easily updated with new experimental data, and specific views of the overall KG can be extracted according to the bio-medical problem to be studied.

摘要
“RNA世界”代表了一个新的前ier，用于研究基础生物过程和人类疾病，以及开发个性化于患者的生物分子特征的药物。 although scientific data about coding and non-coding RNA分子在公共存储系统中可以获得，但这些数据分散在多个数据库中，没有一个中央、统一、semantically consistent的表述“RNA世界”。我们提议RNA-KG，一个涵盖生物知识的知识格式，从更多 than 50个公共数据库中收集了关于RNA的生物知识，并将功能关系与基因、蛋白质和化学物质 Ontologically grounded biomedical concepts integrate。为建立RNA-KG，我们首先标识、预处理和特征化每个数据源；然后，我们构建了一个元граフ提供了这个Domain的生物分子和医学概念的ontological Descriptions，以及这些概念之间的类型交互。最后，我们使用基于实例的抽象知识模型来Specify the ontological alignment according to which RNA-KG was generated。RNA-KG可以在不同格式下下载，并可以通过SPARQL接口进行查询。一个详细的 topological analysis of the resulting heterogeneous graph提供了“RNA世界”的特征。RNA-KG可以直接浏览和可视化，并/或通过计算方法来推理生物医学知识从其多元节点和边的抽象。该资源可以轻松地更新新实验数据，并可以根据生物医学问题提取特定的全局KG视图。

Compression of end-to-end non-autoregressive image-to-speech system for low-resourced devices

paper_url: http://arxiv.org/abs/2312.00174
repo_url: None
paper_authors: Gokul Srinivasagan, Michael Deisher, Munir Georges
for: 帮助视障人群更好地访问touchscreen设备，如手机和笔记型计算机。
methods: 使用图像转语音（ITS）系统，但其模型大小很大，难以在低资源设备上部署。
results: 我们提出了一种高效的端到端神经网络架构，可以在低资源设备上生成显示内容的小片段 Audio。我们使用视transformer图像编码器和知识填充压缩模型，从6100万 Parameters压缩到2460万 Parameters。人工和自动评估结果表明，我们的方法可以减少性能下降，并提高推理时间22%。

Abstract
People with visual impairments have difficulty accessing touchscreen-enabled personal computing devices like mobile phones and laptops. The image-to-speech (ITS) systems can assist them in mitigating this problem, but their huge model size makes it extremely hard to be deployed on low-resourced embedded devices. In this paper, we aim to overcome this challenge by developing an efficient endto-end neural architecture for generating audio from tiny segments of display content on low-resource devices. We introduced a vision transformers-based image encoder and utilized knowledge distillation to compress the model from 6.1 million to 2.46 million parameters. Human and automatic evaluation results show that our approach leads to a very minimal drop in performance and can speed up the inference time by 22%.

摘要
人们 WITH visual impairments 有 difficulty accessing touchscreen-enabled personal computing devices like mobile phones and laptops. The image-to-speech (ITS) systems can assist them in mitigating this problem, but their huge model size makes it extremely hard to be deployed on low-resourced embedded devices. In this paper, we aim to overcome this challenge by developing an efficient end-to-end neural architecture for generating audio from tiny segments of display content on low-resource devices. We introduced a vision transformers-based image encoder and utilized knowledge distillation to compress the model from 6.1 million to 2.46 million parameters. Human and automatic evaluation results show that our approach leads to a very minimal drop in performance and can speed up the inference time by 22%.Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

Towards Accurate Differential Diagnosis with Large Language Models

paper_url: http://arxiv.org/abs/2312.00164
repo_url: None
paper_authors: Daniel McDuff, Mike Schaekermann, Tao Tu, Anil Palepu, Amy Wang, Jake Garrison, Karan Singhal, Yash Sharma, Shekoofeh Azizi, Kavita Kulkarni, Le Hou, Yong Cheng, Yun Liu, S Sara Mahdavi, Sushant Prakash, Anupam Pathak, Christopher Semturs, Shwetak Patel, Dale R Webster, Ewa Dominowska, Juraj Gottweis, Joelle Barral, Katherine Chou, Greg S Corrado, Yossi Matias, Jake Sunshine, Alan Karthikesalingam, Vivek Natarajan
for:这种研究旨在评估一种基于大自然语言模型（LLM）的医学诊断助手，以帮助医生更准确地诊断疾病。methods:这种研究使用了一种优化的LLM，并通过评估302例真实的医学案例来评估其能力。参与研究的20名医生在不同的帮助条件下评估了每个案例，包括使用搜索引擎和标准医学资源，以及使用研究者提供的LLM帮助。results:研究结果表明，使用LLM assistance可以提高医生的诊断精度，特别是在难以诊断的案例中。在比较不同帮助条件下，医生使用LLM assistance时的诊断质量得分高于没有LLM assistance的医生（51.7% vs 36.1%，McNemar 测试：45.7，p < 0.01）。此外，医生使用LLM assistance时还可以生成更全面的诊断列表。这些结果表明，LLM for DDx 有可能在实际医疗中提高医生的诊断能力和准确率。

Abstract
An accurate differential diagnosis (DDx) is a cornerstone of medical care, often reached through an iterative process of interpretation that combines clinical history, physical examination, investigations and procedures. Interactive interfaces powered by Large Language Models (LLMs) present new opportunities to both assist and automate aspects of this process. In this study, we introduce an LLM optimized for diagnostic reasoning, and evaluate its ability to generate a DDx alone or as an aid to clinicians. 20 clinicians evaluated 302 challenging, real-world medical cases sourced from the New England Journal of Medicine (NEJM) case reports. Each case report was read by two clinicians, who were randomized to one of two assistive conditions: either assistance from search engines and standard medical resources, or LLM assistance in addition to these tools. All clinicians provided a baseline, unassisted DDx prior to using the respective assistive tools. Our LLM for DDx exhibited standalone performance that exceeded that of unassisted clinicians (top-10 accuracy 59.1% vs 33.6%, [p = 0.04]). Comparing the two assisted study arms, the DDx quality score was higher for clinicians assisted by our LLM (top-10 accuracy 51.7%) compared to clinicians without its assistance (36.1%) (McNemar's Test: 45.7, p < 0.01) and clinicians with search (44.4%) (4.75, p = 0.03). Further, clinicians assisted by our LLM arrived at more comprehensive differential lists than those without its assistance. Our study suggests that our LLM for DDx has potential to improve clinicians' diagnostic reasoning and accuracy in challenging cases, meriting further real-world evaluation for its ability to empower physicians and widen patients' access to specialist-level expertise.

摘要
准确的 diferencial diagnosis (DDx) 是医疗卫生的基础 stone，通常通过一种迭代的解释过程，结合临床历史、物理检查、调查和手术来达成。大型自然语言模型（LLMs）驱动的交互界面现在提供了新的机会，以帮助和自动化这些过程。在这个研究中，我们引入了一个优化的 LLM для differential reasoning，并评估其能够在具有挑战性的实际医疗案例中提供 differential diagnosis。20名临床医生评估了302个实际医疗案例，来自纽约医学报（NEJM）案例报告。每个案例被两名临床医生读取，这两名医生随机分配到两个帮助条件：使用搜索引擎和标准医疗资源，或者使用我们的 LLM 以及这些工具。所有医生都提供了没有帮助的 differential diagnosis 前置。我们的 LLM для DDx 在 standalone 性能方面超越了没有帮助的医生（top-10 准确率为 59.1%，vs 33.6%，p = 0.04）。 Comparing the two assisted study arms, the DDx quality score was higher for clinicians assisted by our LLM (top-10 accuracy 51.7%) compared to clinicians without its assistance (36.1%) (McNemar's Test: 45.7, p < 0.01) and clinicians with search (44.4%) (4.75, p = 0.03). Further, clinicians assisted by our LLM arrived at more comprehensive differential lists than those without its assistance. 我们的研究表明，我们的 LLM для DDx 在挑战性的案例中具有提高临床医生的诊断逻辑和准确性的潜力，值得进一步的实际评估，以帮助Physicians 和患者 Access 到专家水平的医疗服务。

paper_url: http://arxiv.org/abs/2312.00151
repo_url: None
paper_authors: Meera Hahn, Amit Raj, James M. Rehg
for: 本研究的目的是研究视觉语言导航（VLN）任务中机器人需要根据自然语言指令完成目标位置或物体（例如，“沿梳间行走，左转钢琴”）。
methods: 作者使用了一系列简单的遮盖实验来检查导航模型是否受到不同部分的指令信息影响。
results: 研究发现，一些高性能模型仅仅根据 instrucion 中名词token 进行决策，这是一个担忧的局限性。作者提出了两种培训方法来缓解这种局限性。

Abstract
The challenging task of Vision-and-Language Navigation (VLN) requires embodied agents to follow natural language instructions to reach a goal location or object (e.g. `walk down the hallway and turn left at the piano'). For agents to complete this task successfully, they must be able to ground objects referenced into the instruction (e.g.`piano') into the visual scene as well as ground directional phrases (e.g.`turn left') into actions. In this work we ask the following question -- to what degree are spatial and directional language cues informing the navigation model's decisions? We propose a series of simple masking experiments to inspect the model's reliance on different parts of the instruction. Surprisingly we uncover that certain top performing models rely only on the noun tokens of the instructions. We propose two training methods to alleviate this concerning limitation.

摘要
需要智能感知的视觉语言导航（VLN）任务需要智能体 seguir las instrucciones de lenguaje natural para llegar a un lugar o objeto específico（例如，"沿着楼梯下去，在钢琴左边转弯"). To complete this task successfully, the agents must be able to connect the objects mentioned in the instruction (e.g., "钢琴") to the visual scene and translate directional phrases (e.g., "转弯") into actions.在这项工作中，我们问的问题是：到底 NaviModel 是如何受到空间和方向语言指示的影响呢？我们提出了一系列简单的遮盖试验，以检查模型对不同部分的指示的依赖度。各种惊人的结果表明，某些高性能模型仅仅依赖于 instrucciones 中的名词 tokens。我们提出了两种培训方法来解决这种担忧的局限性。

The Stochastic Dynamic Post-Disaster Inventory Allocation Problem with Trucks and UAVs

paper_url: http://arxiv.org/abs/2312.00140
repo_url: None
paper_authors: Robert van Steenbergen, Wouter van Heeswijk, Martijn Mes
for: 这个论文是研究有限资源的援助物流运作在灾难区域中的问题，尤其是考虑干预援助物流运作的社会影响。
methods: 该论文提出了两种预测解决方法基于准确的动态programming，即分解线性价值函数近似法和神经网络价值函数近似法，以有效地处理灾难区域中的不确定性。
results: 实验表明，考虑 deprivation costs 可以改善有限资源的分配，并且使用无人机可以减少交通和deprivation costs，同时维持类似的需求覆盖率。

Abstract
Humanitarian logistics operations face increasing difficulties due to rising demands for aid in disaster areas. This paper investigates the dynamic allocation of scarce relief supplies across multiple affected districts over time. It introduces a novel stochastic dynamic post-disaster inventory allocation problem with trucks and unmanned aerial vehicles delivering relief goods under uncertain supply and demand. The relevance of this humanitarian logistics problem lies in the importance of considering the inter-temporal social impact of deliveries. We achieve this by incorporating deprivation costs when allocating scarce supplies. Furthermore, we consider the inherent uncertainties of disaster areas and the potential use of cargo UAVs to enhance operational efficiency. This study proposes two anticipatory solution methods based on approximate dynamic programming, specifically decomposed linear value function approximation and neural network value function approximation to effectively manage uncertainties in the dynamic allocation process. We compare DL-VFA and NN-VFA with various state-of-the-art methods (exact re-optimization, PPO) and results show a 6-8% improvement compared to the best benchmarks. NN-VFA provides the best performance and captures nonlinearities in the problem, whereas DL-VFA shows excellent scalability against a minor performance loss. The experiments reveal that consideration of deprivation costs results in improved allocation of scarce supplies both across affected districts and over time. Finally, results show that deploying UAVs can play a crucial role in the allocation of relief goods, especially in the first stages after a disaster. The use of UAVs reduces transportation- and deprivation costs together by 16-20% and reduces maximum deprivation times by 19-40%, while maintaining similar levels of demand coverage, showcasing efficient and effective operations.

摘要
人道主义物流操作面临增加的困难，因为援助灾区的需求不断增加。这篇论文研究了带有不确定因素的救济物流 allocate 问题。它引入了一种新的随机动态减少救济资源分配问题，其中货物运输使用卡车和无人机。我们通过考虑不确定因素和资源不足来解决这个问题。我们还考虑了灾区的内在不确定性和使用无人机来提高操作效率。这种人道主义物流问题的重要性在于考虑不同时间点的社会影响。我们通过加入不足的资源分配时的损失成本来解决这个问题。我们还提出了两种预测解决方案，即分解线性值函数估计和神经网络值函数估计，以有效地管理不确定因素在分配过程中。我们比较了DL-VFA和NN-VFA与现有的方法（精确重新优化、PPO）的结果，结果显示DL-VFA和NN-VFA分别提供6-8%的提高，NN-VFA表现最佳，可以捕捉问题中的非线性关系，而DL-VFA具有优秀的扩展性。实验结果表明，考虑不足资源分配时的损失成本可以改善救济物流的分配效率，同时使用无人机可以减少交通成本和损失成本，同时维持类似的需求覆盖率。最后，实验结果表明，在灾区救济物流中使用无人机可以发挥重要的作用，特别是在灾事发生后的早期。

Dataset Distillation in Large Data Era

paper_url: http://arxiv.org/abs/2311.18838
repo_url: https://github.com/VILA-Lab/SRe2L
paper_authors: Zeyuan Yin, Zhiqiang Shen
for:* 这个论文的目的是如何使用数据简化来快速训练模型，同时保持模型在原始数据分布下的性能。methods:* 这个论文使用了一种简单 yet effective的CURRICULUM DATA AUGMENTATION（CDA）技术来synthesize大规模的ImageNet-1K和21K数据集，并使用这些数据来训练模型。results:* 这个论文的模型在ImageNet-1K和21K上 achieved state-of-the-art的Top-1准确率（63.2% under IPC 50和36.1% under IPC 20），并且将full-data训练模型的准确率与数据简化模型的准确率减少到了 menos than absolute 15%。此外，这个论文还成功地应用了数据简化技术于更大的ImageNet-21K数据集，并在标准224x224像素分辨率下达到了最高的Top-1准确率。

Abstract
Dataset distillation aims to generate a smaller but representative subset from a large dataset, which allows a model to be trained efficiently, meanwhile evaluating on the original testing data distribution to achieve decent performance. Many prior works have aimed to align with diverse aspects of the original datasets, such as matching the training weight trajectories, gradient, feature/BatchNorm distributions, etc. In this work, we show how to distill various large-scale datasets such as full ImageNet-1K/21K under a conventional input resolution of 224$\times$224 to achieve the best accuracy over all previous approaches, including SRe$^2$L, TESLA and MTT. To achieve this, we introduce a simple yet effective ${\bf C}$urriculum ${\bf D}$ata ${\bf A}$ugmentation ($\texttt{CDA}$) during data synthesis that obtains the accuracy on large-scale ImageNet-1K and 21K with 63.2% under IPC (Images Per Class) 50 and 36.1% under IPC 20, respectively. Finally, we show that, by integrating all our enhancements together, the proposed model beats the current state-of-the-art by more than 4% Top-1 accuracy on ImageNet-1K/21K and for the first time, reduces the gap to its full-data training counterpart to less than absolute 15%. Moreover, this work represents the inaugural success in dataset distillation on larger-scale ImageNet-21K under the standard 224$\times$224 resolution. Our code and distilled ImageNet-21K dataset of 20 IPC, 2K recovery budget are available at https://github.com/VILA-Lab/SRe2L/tree/main/CDA.

摘要
数据简化目标是生成一个较小但代表性的子集，以便模型快速训练，同时在原始测试数据分布上评估，以实现良好的性能。许多先前的工作都是通过对多个方面的原始数据进行对齐，如匹配训练权重曲线、梯度、特征/批量 нор 分布等，来实现这一目标。在这种工作中，我们展示了如何使用常规输入分辨率224x224抽象出大规模 ImageNet-1K/21K 数据集，以实现最高精度，超过所有先前的方法，包括 SRe$^2$L、TESLA 和 MTT。为达到这一目标，我们引入了简单 yet effective的 $\bf C}$urriculum $\bf D}$ata $\bf A}$ugmentation（$\texttt{CDA}$）方法，在数据生成过程中实现了 ImageNet-1K 和 21K 的63.2% 精度，IPC 50 和 36.1% 精度，分别。最后，我们表明，通过结合我们所有的改进，提出的模型超越了当前状态的杰出性，在 ImageNet-1K/21K 上提高了4.2% 顶部一级精度，并将全数据训练对应的差异降低到绝对15% 以下。此外，这一研究还表现了 dataset distillation 在更大规模的 ImageNet-21K 下的首次成功，以及在标准224x224分辨率下的最高精度。我们的代码和抽象 ImageNet-21K 数据集可以在 GitHub 上找到：https://github.com/VILA-Lab/SRe2L/tree/main/CDA。

paper_url: http://arxiv.org/abs/2311.18837
repo_url: None
paper_authors: Zhen Xing, Qi Dai, Zihao Zhang, Hui Zhang, Han Hu, Zuxuan Wu, Yu-Gang Jiang
for: 这个研究是为了提出一个广泛适用于多种视频任务的基础模型，包括理解任务（如语言引导的视频物体分割）和创造任务（视频编辑和改善）。
methods: 我们提出了一个名为Video Instruction Diffusion（VIDiff）的基础模型，这个模型可以根据用户的指令进行快速的视频编辑和改善，并且可以保证长视频的一致性。我们还提出了一个迭代的自回传方法来确保编辑和改善的一致性。
results: 我们的模型可以对多种输入视频和写好的指令提供吸引人的创造结果， both qualitatively and quantitatively。我们还提供了一些视频示范，请参考我们的网站https://ChenHsing.github.io/VIDiff。

Abstract
Diffusion models have achieved significant success in image and video generation. This motivates a growing interest in video editing tasks, where videos are edited according to provided text descriptions. However, most existing approaches only focus on video editing for short clips and rely on time-consuming tuning or inference. We are the first to propose Video Instruction Diffusion (VIDiff), a unified foundation model designed for a wide range of video tasks. These tasks encompass both understanding tasks (such as language-guided video object segmentation) and generative tasks (video editing and enhancement). Our model can edit and translate the desired results within seconds based on user instructions. Moreover, we design an iterative auto-regressive method to ensure consistency in editing and enhancing long videos. We provide convincing generative results for diverse input videos and written instructions, both qualitatively and quantitatively. More examples can be found at our website https://ChenHsing.github.io/VIDiff.

摘要
Diffusion models 已经取得了图像和视频生成领域的显著成功，这引发了对视频编辑任务的增加兴趣，其中视频被编辑根据提供的文本描述。然而，大多数现有方法只专注于短片视频编辑，并且需要耗时调整或推理。我们是第一个提出了视频指令扩散（VIDiff）模型，这是一个通用的基础模型，适用于广泛的视频任务。这些任务包括语言指导视频对象分割、视频编辑和优化等。我们的模型可以根据用户指令进行编辑和翻译，并在秒钟级别完成这些任务。此外，我们还设计了一种迭代自适应方法，以确保长视频的编辑和优化保持一致。我们提供了多种多样的输入视频和文本描述，并提供了详细的实验结果，以证明我们的模型的生成能力。更多示例可以在我们的网站https://ChenHsing.github.io/VIDiff中找到。

Motion-Conditioned Image Animation for Video Editing

paper_url: http://arxiv.org/abs/2311.18827
repo_url: None
paper_authors: Wilson Yan, Andrew Brown, Pieter Abbeel, Rohit Girdhar, Samaneh Azadi
for: 这个论文是为了解决视频编辑问题而编写的。
methods: 这篇论文使用了一种简单的分解方法，将视频编辑问题分解为图像编辑和动态图像动画两个步骤。
results: 该论文引入了一个新的视频编辑数据集，并进行了对latest video editing方法和MoCA方法的全面人工评估。MoCA方法在这些任务中显示出了更高的人类偏好胜率，并在特别是动态修改任务中表现出了显著的改进，比如Dreamix (63%), MasaCtrl (75%)和Tune-A-Video (72%)等。

Abstract
We introduce MoCA, a Motion-Conditioned Image Animation approach for video editing. It leverages a simple decomposition of the video editing problem into image editing followed by motion-conditioned image animation. Furthermore, given the lack of robust evaluation datasets for video editing, we introduce a new benchmark that measures edit capability across a wide variety of tasks, such as object replacement, background changes, style changes, and motion edits. We present a comprehensive human evaluation of the latest video editing methods along with MoCA, on our proposed benchmark. MoCA establishes a new state-of-the-art, demonstrating greater human preference win-rate, and outperforming notable recent approaches including Dreamix (63%), MasaCtrl (75%), and Tune-A-Video (72%), with especially significant improvements for motion edits.

摘要
我们介绍MoCA，一种基于运动条件的图像动画方法，用于视频编辑。它利用了视频编辑问题的简单分解，先进行图像编辑，然后使用运动条件进行图像动画。此外，由于视频编辑的评估数据不够可靠，我们创建了一个新的评估标准，用于衡量不同类型的编辑能力，包括物品替换、背景更改、风格更改和运动更改。我们进行了全面的人类评估，包括MoCA，以及最新的视频编辑方法，在我们的提案的评估标准上。MoCA成功地建立了一个新的 estado-of-the-art，在人类评估中获得了更高的胜出率，并在视频编辑方法中具有特别明显的进步，尤其是运动更改方面。

Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking

paper_url: http://arxiv.org/abs/2311.18817
repo_url: https://github.com/vfleaking/grokking-dichotomy
paper_authors: Kaifeng Lyu, Jikai Jin, Zhiyuan Li, Simon S. Du, Jason D. Lee, Wei Hu
for: 研究 arithmetic tasks 上的“grokking”现象，即神经网络在训练集时首先“记忆”整个训练集，导致完美的训练准确率，但在测试集时 exhibit near-random 准确率，并在训练过程够长时间后 suddenly transition to perfect test accuracy.
methods: 使用 homogeneous neural nets WITH large initialization AND small weight decay 在 both classification AND regression tasks 上进行训练，并通过 teoretic 分析和实验研究 grokking 现象的induction.
results: 研究结果显示，在训练过程中，神经网络会 initially 困在一个解释器解决方案上，然后在训练时间够长时间后， suddenly transition to min-norm/max-margin 解决方案，从而导致测试准确率的很大改善。

Abstract
Recent work by Power et al. (2022) highlighted a surprising "grokking" phenomenon in learning arithmetic tasks: a neural net first "memorizes" the training set, resulting in perfect training accuracy but near-random test accuracy, and after training for sufficiently longer, it suddenly transitions to perfect test accuracy. This paper studies the grokking phenomenon in theoretical setups and shows that it can be induced by a dichotomy of early and late phase implicit biases. Specifically, when training homogeneous neural nets with large initialization and small weight decay on both classification and regression tasks, we prove that the training process gets trapped at a solution corresponding to a kernel predictor for a long time, and then a very sharp transition to min-norm/max-margin predictors occurs, leading to a dramatic change in test accuracy.

摘要
近期Power等人（2022）的研究发现在学习代数任务中有一种意外的“感知”现象：神经网络在训练集时首先“记忆”训练集，导致完美的训练精度，但在测试集时几乎随机的精度，并在训练足够长时间后，突然转移到完美的测试精度。这篇论文研究了这种感知现象在理论上的设置，并证明了它可以由早期和晚期阶段的偏见引起。 Specifically, 在训练同质神经网络的大初始值和小权重减少下，我们证明了训练过程会在一个核函数预测器的解对应的解方式上困顿一段时间，然后是一个非常锐化的转换到最小 нор 最大 margin 预测器，导致测试精度的剧烈变化。

Unnatural Error Correction: GPT-4 Can Almost Perfectly Handle Unnatural Scrambled Text

paper_url: http://arxiv.org/abs/2311.18805
repo_url: None
paper_authors: Qi Cao, Takeshi Kojima, Yutaka Matsuo, Yusuke Iwasawa
for: 这个研究旨在探讨大语言模型（LLMs）对字符级排序的抗难度，尤其是GPT-4在面对排序输入时的表现。
methods: 研究人员提出了“杂音板”（Scrambled Bench），用于测试LLMs对杂音输入的处理能力，包括重建杂音句子和在杂音上下文中回答问题。
results: 研究结果表明，最强的LLMs在面对杂音输入时能够展现 typoglycemia 现象，即可以理解杂音句子中的意思，只要第一个和最后一个字母保留不变。GPT-4更加出佩了这一点，可以nearly perfectly重建原始句子从杂音句子中，降低编辑距离达95%。

Abstract
While Large Language Models (LLMs) have achieved remarkable performance in many tasks, much about their inner workings remains unclear. In this study, we present novel experimental insights into the resilience of LLMs, particularly GPT-4, when subjected to extensive character-level permutations. To investigate this, we first propose the Scrambled Bench, a suite designed to measure the capacity of LLMs to handle scrambled input, in terms of both recovering scrambled sentences and answering questions given scrambled context. The experimental results indicate that most powerful LLMs demonstrate the capability akin to typoglycemia, a phenomenon where humans can understand the meaning of words even when the letters within those words are scrambled, as long as the first and last letters remain in place. More surprisingly, we found that only GPT-4 nearly flawlessly processes inputs with unnatural errors, even under the extreme condition, a task that poses significant challenges for other LLMs and often even for humans. Specifically, GPT-4 can almost perfectly reconstruct the original sentences from scrambled ones, decreasing the edit distance by 95%, even when all letters within each word are entirely scrambled. It is counter-intuitive that LLMs can exhibit such resilience despite severe disruption to input tokenization caused by scrambled text.

摘要
大型自然语言模型（LLM）在许多任务中表现出色，但内部工作方式仍然不清楚。在这项研究中，我们提出了一个新的实验方法，用于测试 LLM 对字符级排序的抗衰假设。我们称之为“杂字工具”（Scrambled Bench），它可以衡量 LLM 对杂字输入的处理能力，包括恢复杂字句子和在杂字上下文中回答问题。实验结果表明，大多数最强 LLM 在处理杂字输入时展现了人类 typoglycemia 现象，即可以理解杂乱的字符串中的意义，只要第一个和最后一个字母保持不变。更 surprisingly，我们发现了 GPT-4 可以在极端条件下，几乎完美地处理含有不自然的错误的输入，而其他 LLM 和人类 Frequently 难以完成这个任务。具体来说，GPT-4 可以减少杂字输入的编辑距离达 95%，即使所有字符串中的字母都完全杂乱。这是对 LLM 处理杂字输入的抗衰假设的证明，而这种抗衰性与输入tokenization的严重扰乱有关。

Distributed Global Structure-from-Motion with a Deep Front-End

paper_url: http://arxiv.org/abs/2311.18801
repo_url: https://github.com/borglab/gtsfm
paper_authors: Ayush Baid, John Lambert, Travis Driver, Akshay Krishnan, Hayk Stepanyan, Frank Dellaert
for: 这paper的目的是为了检验全球Structure-from-Motion（SfM）是否可以与最新的增量SfM方法相当，以及是否可以通过提高不同阶段的SfM管道中的发展来提高全球SfM的性能。
methods: 该paper使用了一种可分解的SfM框架，以便可以轻松地结合不同阶段的SfM管道中的发展。具体来说，他们使用了深度学习模型来提取和匹配特征，以及SIFT特征，以实现全球SfM和增量SfM的比较。
results: 实验结果表明，虽然深度学习基于两视匹配估计的发展可以提高全球SfM中点密度的性能，但是 none of them outperform SIFT when comparing with incremental SfM results on a range of datasets。这表明，SIFT仍然是一个非常有效的特征提取和匹配方法，尤其是在全球SfM中。

Abstract
While initial approaches to Structure-from-Motion (SfM) revolved around both global and incremental methods, most recent applications rely on incremental systems to estimate camera poses due to their superior robustness. Though there has been tremendous progress in SfM `front-ends' powered by deep models learned from data, the state-of-the-art (incremental) SfM pipelines still rely on classical SIFT features, developed in 2004. In this work, we investigate whether leveraging the developments in feature extraction and matching helps global SfM perform on par with the SOTA incremental SfM approach (COLMAP). To do so, we design a modular SfM framework that allows us to easily combine developments in different stages of the SfM pipeline. Our experiments show that while developments in deep-learning based two-view correspondence estimation do translate to improvements in point density for scenes reconstructed with global SfM, none of them outperform SIFT when comparing with incremental SfM results on a range of datasets. Our SfM system is designed from the ground up to leverage distributed computation, enabling us to parallelize computation on multiple machines and scale to large scenes.

摘要
当初的Structure-from-Motion（SfM）方法主要集中在全球和增量方法之间，但现在大多数应用都是使用增量系统来估计相机位置，因为它们的稳定性更高。虽然在SfM前端上使用深度学习模型从数据中学习得到了很大的进步，但现在的状态对某些增量SfM管道（COLMAP）仍然依赖于古老的SIFT特征，这些特征在2004年被开发出来。在这个工作中，我们研究了是否可以通过特征提取和匹配的发展来使全球SfM与增量SfM相当。为此，我们设计了一个可以轻松地组合不同阶段的SfM管道的模块化SfM框架。我们的实验表明，虽然深度学习基于两视匹配估计的发展可以提高全球SfM中点云密度的性能，但是这些发展都不能超越SIFT，相比于增量SfM结果。我们的SfM系统是从头来设计来利用分布式计算，以便在多台机器上并行计算并扩展到大型场景。

Automated interpretation of congenital heart disease from multi-view echocardiograms

paper_url: http://arxiv.org/abs/2311.18788
repo_url: None
paper_authors: Jing Wang, Xiaofeng Liu, Fangyun Wang, Lin Zheng, Fengqiao Gao, Hanwen Zhang, Xin Zhang, Wanqing Xie, Binbin Wang
for: 这项研究旨在自动分析多视图电子心征图，以帮助诊断婴儿心脏病。
methods: 该研究使用了深度分割 convolution 的多通道网络，以大幅减少网络参数。同时，通过增加正例训练样本，解决了样本偏置问题。
results: 该模型可以在2D电子心征图上达到95.4%的准确率，并在3类分类任务中达到92.1%的准确率。此外，该模型还可以在没有键帧选择和视图记录的情况下进行诊断。

Abstract
Congenital heart disease (CHD) is the most common birth defect and the leading cause of neonate death in China. Clinical diagnosis can be based on the selected 2D key-frames from five views. Limited by the availability of multi-view data, most methods have to rely on the insufficient single view analysis. This study proposes to automatically analyze the multi-view echocardiograms with a practical end-to-end framework. We collect the five-view echocardiograms video records of 1308 subjects (including normal controls, ventricular septal defect (VSD) patients and atrial septal defect (ASD) patients) with both disease labels and standard-view key-frame labels. Depthwise separable convolution-based multi-channel networks are adopted to largely reduce the network parameters. We also approach the imbalanced class problem by augmenting the positive training samples. Our 2D key-frame model can diagnose CHD or negative samples with an accuracy of 95.4\%, and in negative, VSD or ASD classification with an accuracy of 92.3\%. To further alleviate the work of key-frame selection in real-world implementation, we propose an adaptive soft attention scheme to directly explore the raw video data. Four kinds of neural aggregation methods are systematically investigated to fuse the information of an arbitrary number of frames in a video. Moreover, with a view detection module, the system can work without the view records. Our video-based model can diagnose with an accuracy of 93.9\% (binary classification), and 92.1\% (3-class classification) in a collected 2D video testing set, which does not need key-frame selection and view annotation in testing. The detailed ablation study and the interpretability analysis are provided.

摘要
《生口� Heart disease (CHD) 是中国新生儿最常见的出生缺陷和新生儿死亡的主要原因。临床诊断可以基于选择的2D关键帧从五个视角进行。由于多视角数据的有限性，大多数方法需要依靠单视角分析。这项研究提议使用实用的终端框架自动分析多视角电声心动图。我们收集了1308名主要（包括正常控制组、ventricular septal defect（VSD）患者和atrial septal defect（ASD）患者）的五视角电声心动图视频记录，其中包括疾病标签和标准视图关键帧标签。采用深度分割 convolution-based 多通道网络大幅减少网络参数。我们还通过增加正确的训练样本来解决类别偏斜问题。我们的2D关键帧模型可以诊断CHD或负样本的准确率达95.4%，并在负样本VSD或ASD分类中达92.3%。为了更好地解决实际应用中的关键帧选择问题，我们提出了一种自适应软注意机制，直接探索原始视频数据。我们系统atically Investigated four kinds of neural aggregation methods to fuse the information of an arbitrary number of frames in a video. In addition, with a view detection module, the system can work without the view records. Our video-based model can diagnose with an accuracy of 93.9% (binary classification) and 92.1% (3-class classification) in a collected 2D video testing set, which does not require key-frame selection and view annotation in testing. The detailed ablation study and the interpretability analysis are provided.

paper_url: http://arxiv.org/abs/2312.03747
repo_url: None
paper_authors: Giorgos Lysandrou, Roma English Owen, Vanja Popovic, Grant Le Brun, Beatrice Alex, Elizabeth A. L. Fairley
for: This paper aims to improve the collection and understanding of patient experiences in the real world to improve care standards and personalize drug treatment.
methods: The paper uses linguistic analysis to identify similarities between patient experience datasets across different therapeutic domains and data sources, and trains classifiers (CNN and transformer) to accurately identify patient experience posts from social media.
results: The paper finds that the transformer classifier performs the best in classifying patient experience posts, achieving F1-scores ranging between 0.863 and 0.995 across all therapeutic domains and data sources.

Abstract
It is essential that healthcare professionals and members of the healthcare community can access and easily understand patient experiences in the real world, so that care standards can be improved and driven towards personalised drug treatment. Social media platforms and message boards are deemed suitable sources of patient experience information, as patients have been observed to discuss and exchange knowledge, look for and provide support online. This paper tests the hypothesis that not all online patient experience information can be treated and collected in the same way, as a result of the inherent differences in the way individuals talk about their journeys, in different therapeutic domains and or data sources. We used linguistic analysis to understand and identify similarities between datasets, across patient language, between data sources (Reddit, SocialGist) and therapeutic domains (cardiovascular, oncology, immunology, neurology). We detected common vocabulary used by patients in the same therapeutic domain across data sources, except for immunology patients, who use unique vocabulary between the two data sources, and compared to all other datasets. We combined linguistically similar datasets to train classifiers (CNN, transformer) to accurately identify patient experience posts from social media, a task we refer to as patient voice classification. The cardiovascular and neurology transformer classifiers perform the best in their respective comparisons for the Reddit data source, achieving F1-scores of 0.865 and 1.0 respectively. The overall best performing classifier is the transformer classifier trained on all data collected for this experiment, achieving F1-scores ranging between 0.863 and 0.995 across all therapeutic domain and data source specific test datasets.

摘要
“医疗专业人员和健康照顾社区成员需要能够访问和轻松理解病人的real world经验，以改善和导向个性化药物治疗。社交媒体平台和讨论区被视为适合病人经验信息的来源，因为病人在线上讨论和互助。本研究测试了假设：不同的线上病人经验信息不能 uniformly treated and collected，因为病人在不同的医疗领域和数据来源中使用不同的语言和数据集。我们使用语言分析来理解和识别不同数据集之间的相似性，并发现了不同医疗领域和数据来源中病人使用的共同词汇。我们组合语言相似的数据集来训练分类器（CNN和transformer），以精确地识别社交媒体上的病人经验问题。cardiovascular和neurology transformer分类器在Reddit数据源上表现最佳，其F1分别为0.865和1.0。总体最佳表现的分类器是在所有数据集上训练的transformer分类器，其F1分别在所有医疗领域和数据来源特定的测试数据集上具有0.863-0.995的表现。”

paper_url: http://arxiv.org/abs/2312.00825
repo_url: None
paper_authors: Phillip Howard, Avinash Madasu, Tiep Le, Gustavo Lujan Moreno, Anahita Bhiwandiwalla, Vasudev Lal
for: 这篇研究旨在探讨现有的视觉语言模型（VLM）中存在的社会偏见，以及这些偏见如何影响视觉语言模型的性能。
methods: 本研究使用文本到图像扩散模型来生成大量的 counterfactual 图像文本 pairs，以探讨社会偏见的复杂关系。我们的方法使用 Stable Diffusion 和跨注意控制，以生成高度相似的图像文本 pairs，但仅在社会偏见的部分发生不同。
results: 我们的实验结果显示，使用我们生成的 SocialCounterfactuals Dataset 可以帮助探讨和改善现有的 VLM 中的社会偏见。我们的结果显示，这些 counterfactual 图像文本 pairs 可以帮助探讨不同社会偏见的复杂关系，并且可以帮助改善 VLM 的性能。

Abstract
While vision-language models (VLMs) have achieved remarkable performance improvements recently, there is growing evidence that these models also posses harmful biases with respect to social attributes such as gender and race. Prior studies have primarily focused on probing such bias attributes individually while ignoring biases associated with intersections between social attributes. This could be due to the difficulty of collecting an exhaustive set of image-text pairs for various combinations of social attributes. To address this challenge, we employ text-to-image diffusion models to produce counterfactual examples for probing intserctional social biases at scale. Our approach utilizes Stable Diffusion with cross attention control to produce sets of counterfactual image-text pairs that are highly similar in their depiction of a subject (e.g., a given occupation) while differing only in their depiction of intersectional social attributes (e.g., race & gender). Through our over-generate-then-filter methodology, we produce SocialCounterfactuals, a high-quality dataset containing over 171k image-text pairs for probing intersectional biases related to gender, race, and physical characteristics. We conduct extensive experiments to demonstrate the usefulness of our generated dataset for probing and mitigating intersectional social biases in state-of-the-art VLMs.

摘要
“Recently, vision-language models（VLMs）have achieved significant performance improvements, but there is also growing evidence that these models possess harmful biases with respect to social attributes such as gender and race. Prior studies have primarily focused on probing such bias attributes individually, ignoring biases associated with intersections between social attributes. This may be due to the difficulty of collecting an exhaustive set of image-text pairs for various combinations of social attributes. To address this challenge, we employ text-to-image diffusion models to produce counterfactual examples for probing intersectional social biases at scale. Our approach utilizes Stable Diffusion with cross attention control to produce sets of counterfactual image-text pairs that are highly similar in their depiction of a subject (e.g., a given occupation) while differing only in their depiction of intersectional social attributes (e.g., race & gender). Through our over-generate-then-filter methodology, we produce SocialCounterfactuals, a high-quality dataset containing over 171k image-text pairs for probing intersectional biases related to gender, race, and physical characteristics. We conduct extensive experiments to demonstrate the usefulness of our generated dataset for probing and mitigating intersectional social biases in state-of-the-art VLMs.”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation

paper_url: http://arxiv.org/abs/2311.18775
repo_url: None
paper_authors: Zineng Tang, Ziyi Yang, Mahmoud Khademi, Yang Liu, Chenguang Zhu, Mohit Bansal
for: 这篇论文是为了开发一种可靠的多模态语言模型（CoDi-2），该模型可以 suivant complex的语言-视觉-听力交互指令，并在受过例示的情况下进行学习和生成多模态输出。
methods: 这篇论文使用了一种新的Alignment模型，将语言和视觉特征Alignment在编码和生成过程中，以便让语言模型理解复杂的多模态交互指令和受过例示的语言输入。
results: CoDi-2在多种零基础任务上达到了优秀的表现，如主题驱动图生成、视觉转换和音频编辑等。CoDi-2还能够通过多轮交互对话来执行复杂的多模态任务。

Abstract
We present CoDi-2, a versatile and interactive Multimodal Large Language Model (MLLM) that can follow complex multimodal interleaved instructions, conduct in-context learning (ICL), reason, chat, edit, etc., in an any-to-any input-output modality paradigm. By aligning modalities with language for both encoding and generation, CoDi-2 empowers Large Language Models (LLMs) to not only understand complex modality-interleaved instructions and in-context examples, but also autoregressively generate grounded and coherent multimodal outputs in the continuous feature space. To train CoDi-2, we build a large-scale generation dataset encompassing in-context multimodal instructions across text, vision, and audio. CoDi-2 demonstrates a wide range of zero-shot capabilities for multimodal generation, such as in-context learning, reasoning, and compositionality of any-to-any modality generation through multi-round interactive conversation. CoDi-2 surpasses previous domain-specific models on tasks such as subject-driven image generation, vision transformation, and audio editing. CoDi-2 signifies a substantial breakthrough in developing a comprehensive multimodal foundation model adept at interpreting in-context language-vision-audio interleaved instructions and producing multimodal outputs.

摘要
我们提出了CoDi-2，一种多样化和交互式的多Modal大语言模型（MLLM），可以跟踪复杂的多Modal交错指令，进行场景学习（ICL），进行对话，编辑等等，在任意输入输出模式下。通过对语言和模式的对应，CoDi-2使得大语言模型（LLM）不仅能够理解复杂的多Modal交错指令和场景示例，而且还可以自动生成基于语言的协调和准确的多Modal输出在连续特征空间中。为了训练CoDi-2，我们建立了大规模生成数据集，涵盖了文本、视觉和声音之间的多Modal示例。CoDi-2在多Modal生成任务中展示了零基础的能力，如场景学习、理解和多Modal生成的复杂性。CoDi-2超越了过去的域специфи价值模型在任务如主体驱动图像生成、视觉变换和音频编辑等等。CoDi-2表明了在多Modal语言-视觉-声音交错指令下 interpret语言和多Modal输出的完整基础模型。

Evaluating the Impact of Flaky Simulators on Testing Autonomous Driving Systems

paper_url: http://arxiv.org/abs/2311.18768
repo_url: https://github.com/anonoymous9423013/anonymous_paper
paper_authors: Mohammad Hossein Amini, Shervin Naseri, Shiva Nejati
for: 这个论文主要研究了模拟驾驶系统（ADS）测试中的测试不稳定性问题，以及如何使用机器学习（ML）技术来识别不稳定测试。
methods: 这篇论文使用了两个常用的开源ADS模拟器和五种不同的ADS测试setup进行实验研究，以确定测试不稳定性对自动化测试的影响和ML技术是否能够有效地识别不稳定测试。
results: 研究结果表明，ADS测试中的测试不稳定性是一个非常常见的现象，可能导致测试结果偏差。此外，使用ML技术可以有效地识别不稳定测试，但需要至少执行一次测试。相比之下，非ML基准测试需要至少执行两次测试，而ML方法可以提高F1分数$31$%、$21$%和$13$%。

Abstract
Simulators are widely used to test Autonomous Driving Systems (ADS), but their potential flakiness can lead to inconsistent test results. We investigate test flakiness in simulation-based testing of ADS by addressing two key questions: (1) How do flaky ADS simulations impact automated testing that relies on randomized algorithms? and (2) Can machine learning (ML) effectively identify flaky ADS tests while decreasing the required number of test reruns? Our empirical results, obtained from two widely-used open-source ADS simulators and five diverse ADS test setups, show that test flakiness in ADS is a common occurrence and can significantly impact the test results obtained by randomized algorithms. Further, our ML classifiers effectively identify flaky ADS tests using only a single test run, achieving F1-scores of $85$%, $82$% and $96$% for three different ADS test setups. Our classifiers significantly outperform our non-ML baseline, which requires executing tests at least twice, by $31$%, $21$%, and $13$% in F1-score performance, respectively. We conclude with a discussion on the scope, implications and limitations of our study. We provide our complete replication package in a Github repository.

摘要
模拟器广泛用于自动驾驶系统（ADS）的测试，但它们的可靠性问题可能导致测试结果不一致。我们研究在基于模拟器的测试中发生的测试不稳定性问题，并解决了两个关键问题：（1）如何影响自动化测试，它们使用随机算法进行测试？和（2）可以使用机器学习（ML）技术来识别不稳定的ADS测试，同时降低测试重复次数？我们的实验结果来自两种广泛使用的开源ADS模拟器和五种多样化ADS测试配置，表明ADS测试中的不稳定性是常见的问题，可以很大地影响测试结果。此外，我们的ML分类器可以效果地识别不稳定的ADS测试，只需要一次测试运行，实现了F1分数为85%、82%和96%的三个不同的ADS测试配置。我们的分类器与非ML基线相比，提高了测试重复次数的31%、21%和13%的F1分数表现。我们结束 WITH 一个关于这些研究的讨论，以及其限制和影响。我们提供了完整的复制包在 GitHub 存储库中。

MLLMs-Augmented Visual-Language Representation Learning

paper_url: http://arxiv.org/abs/2311.18765
repo_url: https://github.com/lyq312318224/mllms-augmented
paper_authors: Yanqing Liu, Kai Wang, Wenqi Shao, Ping Luo, Yu Qiao, Mike Zheng Shou, Kaipeng Zhang, Yang You
for: 提高视觉语言表示学习的多Modal大语言模型（MLLM）
methods: 使用MLLM扩展每个图像的多个标签，并使用“文本扭曲”维护标签的同长性
results: 在图像文本检索 task 中，OUR方法在 zero-shot 和 fine-tuning 设置下分别获得了5.6% ~ 35.0%和16.8% ~ 46.1%的提升，其中 zero-shot 结果与target dataset fine-tuning结果相当，鼓励更多的多Modal大语言模型的应用。

Abstract
Visual-language pre-training (VLP) has achieved remarkable success in multi-modal tasks, largely attributed to the availability of large-scale image-text datasets. In this work, we demonstrate that multi-modal large language models (MLLMs) can enhance visual-language representation learning by improving data quality. Our approach is simple, utilizing MLLMs to extend multiple captions for each image. To prevent the bias introduced by MLLMs' hallucinations and intrinsic caption styles, we propose "text shearing" to maintain the same length for extended captions as that of the original captions. In image-text retrieval, our method consistently obtains 5.6 ~ 35.0% and 16.8 ~ 46.1% improvement on R@1 under the fine-tuning and zero-shot settings, respectively. Notably, we obtain zero-shot results that are comparable to fine-tuning on target datasets, which encourages more exploration of the versatile use of MLLMs.

摘要
“视觉语言预训练（VLP）已经取得了很大成功在多Modal任务上，主要归功于大规模的图像文本数据集。在这项工作中，我们展示了多Modal大型语言模型（MLLM）可以提高视觉语言表示学习。我们的方法简单，通过将MLLM扩展图像多个caption。为了消除MLLM的偏见和内生caption风格，我们提议“文本扭曲”来保持扩展caption的长度与原始caption的长度相同。在图像文本检索任务中，我们的方法一直 obtient 5.6 ~ 35.0%和16.8 ~ 46.1%的提升在R@1下，分别在精度和零扩展设置下。尤其是在零扩展设置下，我们的方法可以获得与Target数据集的零扩展结果相同或更好的结果，这对探索MLLM的多样使用提供了鼓舞。”

Continual Diffusion with STAMINA: STack-And-Mask INcremental Adapters

paper_url: http://arxiv.org/abs/2311.18763
repo_url: None
paper_authors: James Seale Smith, Yen-Chang Hsu, Zsolt Kira, Yilin Shen, Hongxia Jin
for: 文章旨在探讨 continual text-to-image diffusion 模型是否可以扩展到更长的概念序列上，而不会忘记之前学习的概念。
methods: 作者提出了一种新的方法 named STack-And-Mask INcremental Adapters (STAMINA)，它使用低级别的注意力掩码和定制化 MLP tokens，以提高 LoRA 模型在sequential concept learning中的稳定性和扩展性。
results: 作者的实验结果表明，STAMINA 方法可以在50个概念 benchmark 上实现最佳性能，而不需要存储回放数据。此外，作者还将方法扩展到 continual learning for image classification 领域，并证明了这些提高也可以在标准 benchmark 上实现最佳性能。

Abstract
Recent work has demonstrated a remarkable ability to customize text-to-image diffusion models to multiple, fine-grained concepts in a sequential (i.e., continual) manner while only providing a few example images for each concept. This setting is known as continual diffusion. Here, we ask the question: Can we scale these methods to longer concept sequences without forgetting? Although prior work mitigates the forgetting of previously learned concepts, we show that its capacity to learn new tasks reaches saturation over longer sequences. We address this challenge by introducing a novel method, STack-And-Mask INcremental Adapters (STAMINA), which is composed of low-ranked attention-masked adapters and customized MLP tokens. STAMINA is designed to enhance the robust fine-tuning properties of LoRA for sequential concept learning via learnable hard-attention masks parameterized with low rank MLPs, enabling precise, scalable learning via sparse adaptation. Notably, all introduced trainable parameters can be folded back into the model after training, inducing no additional inference parameter costs. We show that STAMINA outperforms the prior SOTA for the setting of text-to-image continual customization on a 50-concept benchmark composed of landmarks and human faces, with no stored replay data. Additionally, we extended our method to the setting of continual learning for image classification, demonstrating that our gains also translate to state-of-the-art performance in this standard benchmark.

摘要
Recent research has shown the ability to customize text-to-image diffusion models to multiple, fine-grained concepts in a sequential manner, while only providing a few example images for each concept. This setting is called continual diffusion. However, the capacity to learn new tasks reaches saturation over longer sequences. To address this challenge, we propose a novel method called STack-And-Mask INcremental Adapters (STAMINA), which includes low-ranked attention-masked adapters and customized MLP tokens. STAMINA enhances the robust fine-tuning properties of LoRA for sequential concept learning via learnable hard-attention masks parameterized with low rank MLPs, allowing for precise and scalable learning via sparse adaptation. All introduced trainable parameters can be folded back into the model after training, resulting in no additional inference parameter costs. Our method outperforms the prior state-of-the-art for the setting of text-to-image continual customization on a 50-concept benchmark composed of landmarks and human faces, with no stored replay data. Additionally, we extended our method to the setting of continual learning for image classification, demonstrating that our gains also translate to state-of-the-art performance in this standard benchmark.

TaskBench: Benchmarking Large Language Models for Task Automation

paper_url: http://arxiv.org/abs/2311.18760
repo_url: https://github.com/microsoft/JARVIS
paper_authors: Yongliang Shen, Kaitao Song, Xu Tan, Wenqi Zhang, Kan Ren, Siyu Yuan, Weiming Lu, Dongsheng Li, Yueting Zhuang
for: 评估大语言模型（LLM）在任务自动化方面的能力
methods: 引入了TaskBench来评估LLM在任务自动化中的能力，包括三个关键阶段：任务分解、工具邀请和参数预测，以满足用户意图。通过Tool Graph和回归指令方法，生成高质量的评估数据集。
results: 实验结果表明，TaskBench可以有效地反映LLM在任务自动化方面的能力。由于自动化数据建构和人工验证的混合使用，TaskBench在人工评估中具有高一致性，可以用作LLM基于自动化代理人的全面和忠诚的标准评估 benchmark。

Abstract
Recently, the incredible progress of large language models (LLMs) has ignited the spark of task automation, which decomposes the complex tasks described by user instructions into sub-tasks, and invokes external tools to execute them, and plays a central role in autonomous agents. However, there lacks a systematic and standardized benchmark to foster the development of LLMs in task automation. To this end, we introduce TaskBench to evaluate the capability of LLMs in task automation. Specifically, task automation can be formulated into three critical stages: task decomposition, tool invocation, and parameter prediction to fulfill user intent. This complexity makes data collection and evaluation more challenging compared to common NLP tasks. To generate high-quality evaluation datasets, we introduce the concept of Tool Graph to represent the decomposed tasks in user intent, and adopt a back-instruct method to simulate user instruction and annotations. Furthermore, we propose TaskEval to evaluate the capability of LLMs from different aspects, including task decomposition, tool invocation, and parameter prediction. Experimental results demonstrate that TaskBench can effectively reflects the capability of LLMs in task automation. Benefiting from the mixture of automated data construction and human verification, TaskBench achieves a high consistency compared to the human evaluation, which can be utilized as a comprehensive and faithful benchmark for LLM-based autonomous agents.

摘要
近些时间，大语言模型（LLM）的异常进步已经点燃了任务自动化的火焰，将复杂的任务描述由用户指令 decomposes 到子任务，并通过外部工具执行。这种中心作用在自主代理中扮演着重要的角色。然而，在LLM的发展中缺乏系统化和标准化的benchmark，以便评估LLM在任务自动化方面的能力。为此，我们提出了TaskBench，用于评估LLM在任务自动化方面的能力。特别是，任务自动化可以分解为三个关键阶段：任务分解、工具邀请和参数预测，以满足用户意图。这种复杂性使得数据收集和评估变得更加困难，相比于常见的NLP任务。为生成高质量的评估数据，我们引入了工具图来表示用户意图中的分解任务，并采用回归方法来模拟用户指令和注释。此外，我们提出了TaskEval，用于评估LLM的多方面能力，包括任务分解、工具邀请和参数预测。实验结果表明，TaskBench可以准确反映LLM在任务自动化方面的能力。由于自动化数据建构和人工验证的混合，TaskBench在人工评估中实现了高一致性，可以作为LLM基于自主代理的全面和忠实的benchmark。

Language Model Agents Suffer from Compositional Generalization in Web Automation

paper_url: http://arxiv.org/abs/2311.18751
repo_url: https://github.com/google-research/google-research
paper_authors: Hiroki Furuta, Yutaka Matsuo, Aleksandra Faust, Izzeddin Gur
for: 这篇论文旨在探讨语言模型代理（LMA）在多步决策任务上的表现，以及它们在实际应用中的扩展性。
methods: 该论文使用了新的 benchmark，即 CompWoB，来评测 LMA 的表现。此外，论文还使用了已经 prompted 和 transferred LMA 来研究它们在不同任务组合下的表现。
results: 论文显示，基于 base tasks 的 prompted LMA 在 compositional tasks 上的表现很差（24.9% 成功率），而基于 base tasks 的 transferred LMA 则表现较好，但仍有一定的总体化差（54.8% 成功率）。在 balancing 数据分布上，论文提出了一种新模型 HTML-T5++，可以超越人类水平（95.2%）在 MiniWoB 上表现，并在 CompWoB 上达到最佳 zero-shot 性能（61.5%）。

Abstract
Language model agents (LMA) recently emerged as a promising paradigm on muti-step decision making tasks, often outperforming humans and other reinforcement learning agents. Despite the promise, their performance on real-world applications that often involve combinations of tasks is still underexplored. In this work, we introduce a new benchmark, called CompWoB -- 50 new compositional web automation tasks reflecting more realistic assumptions. We show that while existing prompted LMAs (gpt-3.5-turbo or gpt-4) achieve 94.0% average success rate on base tasks, their performance degrades to 24.9% success rate on compositional tasks. On the other hand, transferred LMAs (finetuned only on base tasks) show less generalization gap, dropping from 85.4% to 54.8%. By balancing data distribution across tasks, we train a new model, HTML-T5++, that surpasses human-level performance (95.2%) on MiniWoB, and achieves the best zero-shot performance on CompWoB (61.5%). While these highlight the promise of small-scale finetuned and transferred models for compositional generalization, their performance further degrades under different instruction compositions changing combinational order. In contrast to the recent remarkable success of LMA, our benchmark and detailed analysis emphasize the necessity of building LMAs that are robust and generalizable to task compositionality for real-world deployment.

摘要
现代语言模型代理（LMA）已经出现为多步决策任务的有力的新趋势，常常超越人类和其他循环学习代理。 despite its promise， its performance on实际应用中的组合任务仍然未经explored。在这项工作中，我们介绍了一个新的基准，called CompWoB，包含50个新的组合网络自动化任务，这些任务更加真实地假设。我们发现，现有的推荐LMA（gpt-3.5-turbo或gpt-4）在基础任务上 achiev 94.0%的成功率，但在组合任务上其表现下降到24.9%。相反，传输LMA（只在基础任务上精度）表现更好，从85.4%下降到54.8%。通过让数据分布在任务上均匀，我们训练了一个新模型，HTML-T5++，在MiniWoB上超过人类水平（95.2%），并在CompWoB上实现了零基础性性能（61.5%）。这些结果 highlights 小规模的 transferred和finetuned模型在组合普适性方面的搭配性，但其表现在不同的指令组合下降的情况下。与此同时，我们的基准和详细分析强调了在实际应用中建立LMAs，这些LMAs必须具有对任务组合性的稳定和普适性。

TransCORALNet: A Two-Stream Transformer CORAL Networks for Supply Chain Credit Assessment Cold Start

paper_url: http://arxiv.org/abs/2311.18749
repo_url: https://github.com/jiejieniu/transcoralnet
paper_authors: Jie Shi, Arno P. J. M. Siebes, Siamak Mehrkanoon
for: 这个论文是为了提出一种可解释性Two-Stream transformer网络（TransCORALNet），用于解决供应链信用评估中的领域隔阂和冷启动问题。
methods: 这个模型使用了Two-Stream领域适应架构，并使用了Correlation Alignment（CORAL）损失函数，以提供高精度的信用评估预测。
results: 实验结果表明，TransCORALNet在一个真实世界数据集上表现出色，较多的州ppointofview Baselines的精度。 codes are available on GitHub.

Abstract
This paper proposes an interpretable two-stream transformer CORAL networks (TransCORALNet) for supply chain credit assessment under the segment industry and cold start problem. The model aims to provide accurate credit assessment prediction for new supply chain borrowers with limited historical data. Here, the two-stream domain adaptation architecture with correlation alignment (CORAL) loss is used as a core model and is equipped with transformer, which provides insights about the learned features and allow efficient parallelization during training. Thanks to the domain adaptation capability of the proposed model, the domain shift between the source and target domain is minimized. Therefore, the model exhibits good generalization where the source and target do not follow the same distribution, and a limited amount of target labeled instances exist. Furthermore, we employ Local Interpretable Model-agnostic Explanations (LIME) to provide more insight into the model prediction and identify the key features contributing to supply chain credit assessment decisions. The proposed model addresses four significant supply chain credit assessment challenges: domain shift, cold start, imbalanced-class and interpretability. Experimental results on a real-world data set demonstrate the superiority of TransCORALNet over a number of state-of-the-art baselines in terms of accuracy. The code is available on GitHub https://github.com/JieJieNiu/TransCORALN .

摘要
Simplified Chinese translation:这篇论文提出了一种可解释的两渠道 transformer 网络（TransCORALNet），用于解决供应链信用评估中的领域隔离和冷启问题。该模型采用了两渠道领域适应架构，并使用了 correlate alignment（CORAL）损失函数，以提供更好的预测性能。此外，模型还使用了 Local Interpretable Model-agnostic Explanations（LIME）来提供更多的预测解释和识别供应链信用评估决策中的关键特征。实验结果表明，TransCORALNet 在一个真实世界数据集上表现出色，与多种州对比模型相比，具有更高的准确率。代码可以在 GitHub 上找到（https://github.com/JieJieNiu/TransCORALN）。

AlignBench: Benchmarking Chinese Alignment of Large Language Models

paper_url: http://arxiv.org/abs/2311.18743
repo_url: https://github.com/thudm/alignbench
paper_authors: Xiao Liu, Xuanyu Lei, Shengyuan Wang, Yue Huang, Zhuoer Feng, Bosi Wen, Jiale Cheng, Pei Ke, Yifan Xu, Weng Lam Tam, Xiaohan Zhang, Lichao Sun, Hongning Wang, Jing Zhang, Minlie Huang, Yuxiao Dong, Jie Tang
for: 这个论文的目的是提供一个全面的多维度测试 benchmark，用于评估中文 Large Language Models（LLMs）的对齐性。
methods: 这个论文使用了一个人工循环数据筛选管道，并采用了一种规则调整的多维度 LLM-as-Judge 生成解释和最终评分作为评估方式，以确保高度可靠和可解释性。
results: 这个论文的实验表明，使用 CritiqueLLM 评估 LLMs 的对齐性可以恢复 GPT-4 的评估能力的95%。此外，提供了一个公共 API，以便通过 CritiqueLLM 评估 AlignBench。

Abstract
Alignment has become a critical step for instruction-tuned Large Language Models (LLMs) to become helpful assistants. However, effective evaluation of alignment for emerging Chinese LLMs is still significantly lacking, calling for real-scenario grounded, open-ended, challenging and automatic evaluations tailored for alignment. To fill in this gap, we introduce AlignBench, a comprehensive multi-dimensional benchmark for evaluating LLMs' alignment in Chinese. Equipped with a human-in-the-loop data curation pipeline, our benchmark employs a rule-calibrated multi-dimensional LLM-as-Judge with Chain-of-Thought to generate explanations and final ratings as evaluations, ensuring high reliability and interpretability. Furthermore, we report AlignBench evaluated by CritiqueLLM, a dedicated Chinese evaluator LLM that recovers 95% of GPT-4's evaluation ability. We will provide public APIs for evaluating AlignBench with CritiqueLLM to facilitate the evaluation of LLMs' Chinese alignment. All evaluation codes, data, and LLM generations are available at \url{https://github.com/THUDM/AlignBench}.

摘要
alignment 已成为大型自然语言模型（LLM）的关键步骤，以便使其成为有用的助手。然而，emerging Chinese LLMs的Alignment的评估仍然存在显著的 void，需要基于实际场景、开放、挑战性和自动评估的评估方法。为填补这个空白，我们介绍AlignBench，一个全面的多维度 benchmark，用于评估中文 LLMs的Alignment。我们的 benchmark 使用人类在循环数据筛选管道中提供的数据，并采用规则调整的多维度 LLM-as-Judge 来生成解释和最终评分，以确保高可靠性和可解释性。此外，我们报告了 AlignBench 被 CritiqueLLM，一个专门的中文评估 LLM，Recover 95% of GPT-4 的评估能力。我们将提供公共 API，以便通过 CritiqueLLM 评估 AlignBench。所有评估代码、数据和 LLM 生成都可以在上获取。Note: "LLM" stands for "Large Language Model", and "Chain-of-Thought" is a specific evaluation method used in this benchmark.

VREM-FL: Mobility-Aware Computation-Scheduling Co-Design for Vehicular Federated Learning

paper_url: http://arxiv.org/abs/2311.18741
repo_url: None
paper_authors: Luca Ballotta, Nicolò Dal Fabbro, Giovanni Perin, Luca Schenato, Michele Rossi, Giuseppe Piro
for: 这个论文主要是为了研究和提出了一种基于联合学习的汽车智能驾驶系统，以提高汽车智能驾驶系统的性能和安全性。
methods: 这个论文使用了联合学习技术，并将汽车的移动性和估计5G无线电环境地图相结合，以优化全球机器学习模型的训练。同时，它也采用了计算调度和通信资源的有效分配，以优化汽车智能驾驶系统的性能。
results: 实验结果表明，使用无线电环境地图可以提高联合学习模型的性能，并且可以降低模型训练时间。相比文献 benchmark，VREM-FL可以将学习时间减少28%，并将同样的时间窗口内的模型更新数量提高至原来的两倍。

Abstract
Assisted and autonomous driving are rapidly gaining momentum, and will soon become a reality. Among their key enablers, artificial intelligence and machine learning are expected to play a prominent role, also thanks to the massive amount of data that smart vehicles will collect from their onboard sensors. In this domain, federated learning is one of the most effective and promising techniques for training global machine learning models, while preserving data privacy at the vehicles and optimizing communications resource usage. In this work, we propose VREM-FL, a computation-scheduling co-design for vehicular federated learning that leverages mobility of vehicles in conjunction with estimated 5G radio environment maps. VREM-FL jointly optimizes the global model learned at the server while wisely allocating communication resources. This is achieved by orchestrating local computations at the vehicles in conjunction with the transmission of their local model updates in an adaptive and predictive fashion, by exploiting radio channel maps. The proposed algorithm can be tuned to trade model training time for radio resource usage. Experimental results demonstrate the efficacy of utilizing radio maps. VREM-FL outperforms literature benchmarks for both a linear regression model (learning time reduced by 28%) and a deep neural network for a semantic image segmentation task (doubling the number of model updates within the same time window).

摘要
自助和自动驾驶技术在积极推广和应用化阶段，快速升级成为现实。其关键驱动因素之一是人工智能和机器学习，它们在智能汽车上的各种感知器收集大量数据的情况下发挥着关键作用。在这个领域，联邦学习是训练全球机器学习模型的最有效和最有希望的技术之一，同时保持汽车数据隐私和优化通信资源的使用。在这项工作中，我们提出了VREM-FL，一种基于汽车移动和估计5G无线通信环境图的计算时间分配和通信资源调度算法。VREM-FL同时优化全局模型在服务器上的学习和通信资源的分配，通过在汽车上进行本地计算和发送本地模型更新的适应和预测方式，利用无线通信频率图。该算法可以根据模型训练时间和无线资源的交易来调整。实验结果表明，利用无线图可以提高模型训练效率。VREM-FL在一个线性回归模型和一个深度神经网络 semantic image segmentation任务中，分别比文献标准做得更好，减少了28%的学习时间，并 doubling the number of model updates within the same time window。

Controlgym: Large-Scale Safety-Critical Control Environments for Benchmarking Reinforcement Learning Algorithms

paper_url: http://arxiv.org/abs/2311.18736
repo_url: https://github.com/xiangyuan-zhang/controlgym
paper_authors: Xiangyuan Zhang, Weichao Mao, Saviz Mowlavi, Mouhacine Benosman, Tamer Başar
for: 这个论文的目的是提供一个库，名为controlgym，其中包含了36种安全关键的工业控制场景，以及10个基于偏微分方程（PDE）的控制问题。
methods: 这个库使用了OpenAI Gym/Gymnasium（Gym）框架，可以直接应用标准的学习控制算法，如稳定基elines3。
results: 这个库的控制环境包括连续、无限大的动作和观察空间，这些环境与实际控制应用相符，并且PDE控制环境允许用户扩展系统的状态维度到无穷大，保持系统的内在动力学性。这些特点使得RL算法的扩展性和稳定性成为可考虑的问题。

Abstract
We introduce controlgym, a library of thirty-six safety-critical industrial control settings, and ten infinite-dimensional partial differential equation (PDE)-based control problems. Integrated within the OpenAI Gym/Gymnasium (Gym) framework, controlgym allows direct applications of standard reinforcement learning (RL) algorithms like stable-baselines3. Our control environments complement those in Gym with continuous, unbounded action and observation spaces, motivated by real-world control applications. Moreover, the PDE control environments uniquely allow the users to extend the state dimensionality of the system to infinity while preserving the intrinsic dynamics. This feature is crucial for evaluating the scalability of RL algorithms for control. This project serves the learning for dynamics & control (L4DC) community, aiming to explore key questions: the convergence of RL algorithms in learning control policies; the stability and robustness issues of learning-based controllers; and the scalability of RL algorithms to high- and potentially infinite-dimensional systems. We open-source the controlgym project at https://github.com/xiangyuan-zhang/controlgym.

摘要
我们介绍控制启美（Controlgym），一个包含 thirty-six 安全关键的工业控制设置和十个无穷维度部分 diferencial equations（PDE）控制问题的库。该库在 OpenAI Gym/Gymnasium（Gym）框架中集成，因此可以直接使用标准的回归学习（RL）算法如稳定基础（stable-baselines3）。我们的控制环境与 Gym 中的环境不同，它们具有连续、无限大的动作和观测空间，这些空间是基于实际控制应用而来。此外，PDE控制环境允许用户将系统的状态维度延伸到无穷大，保持系统的内在动力学不变。这一特点非常重要，用于评估回归学习算法的扩展性。这个项目旨在为 dynamics & control（L4DC）社区服务，探索关键问题：回归学习算法在学习控制策略时的整合性; 学习基于控制器的稳定性和可靠性问题; 以及回归学习算法对高维和可能无穷维度系统的扩展性。我们将控制启美项目开源于 GitHub 上，请参考。

Predictable Reinforcement Learning Dynamics through Entropy Rate Minimization

paper_url: http://arxiv.org/abs/2311.18703
repo_url: None
paper_authors: Daniel Jarne Ornia, Giannis Delimpaltadakis, Jens Kober, Javier Alonso-Mora
for: 提高RLAgent的预测性能
methods: 使用状态序列熵率作为预测性度量，并通过改进的PG方法来适应策略依赖性 entropy 问题
results: 在人机用例 inspirited RL任务中，提出了一种可预测的RL策略，并实现了 Near-optimal 奖励的同时提高了预测性能。

Abstract
In Reinforcement Learning (RL), agents have no incentive to exhibit predictable behaviors, and are often pushed (through e.g. policy entropy regularization) to randomize their actions in favor of exploration. From a human perspective, this makes RL agents hard to interpret and predict, and from a safety perspective, even harder to formally verify. We propose a novel method to induce predictable behavior in RL agents, referred to as Predictability-Aware RL (PA-RL), which employs the state sequence entropy rate as a predictability measure. We show how the entropy rate can be formulated as an average reward objective, and since its entropy reward function is policy-dependent, we introduce an action-dependent surrogate entropy enabling the use of PG methods. We prove that deterministic policies minimizing the average surrogate reward exist and also minimize the actual entropy rate, and show how, given a learned dynamical model, we are able to approximate the value function associated to the true entropy rate. Finally, we demonstrate the effectiveness of the approach in RL tasks inspired by human-robot use-cases, and show how it produces agents with more predictable behavior while achieving near-optimal rewards.

摘要
在强化学习（RL）中，代理人没有奖励 exhibit 预测性行为，而是通过例如政策 entropy 规范来随机化其行动，以便探索。从人类视角来看，这使得 RL 代理人难以预测和解释，从安全角度来看，更难以正式验证。我们提出了一种 noval 方法，称为 Predictability-Aware RL（PA-RL），它使用状态序列 entropy 率作为预测性度量。我们示了 entropy 率可以表示为平均奖励目标函数，而其 entropy 奖励函数是政策相依的，因此我们引入了动作相依的 entropy 代理。我们证明了 deterministic 策略可以最小化平均奖励，并且可以最小化实际 entropy 率。 finally，我们在人类机器用例中进行了RL任务，并证明了该方法可以生成预测性行为的代理人，同时实现近乎最佳奖励。Note: The translation is in Simplified Chinese, which is the standard written form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know.

CritiqueLLM: Scaling LLM-as-Critic for Effective and Explainable Evaluation of Large Language Model Generation

paper_url: http://arxiv.org/abs/2311.18702
repo_url: https://github.com/thu-coai/critiquellm
paper_authors: Pei Ke, Bosi Wen, Zhuoer Feng, Xiao Liu, Xuanyu Lei, Jiale Cheng, Shengyuan Wang, Aohan Zeng, Yuxiao Dong, Hongning Wang, Jie Tang, Minlie Huang
for: 本研究旨在investigate the key factor of LLM-based evaluation models, such as scaling properties, and evaluate the potential of replacing GPT-4’s evaluation in practical scenarios.
methods: 本研究提出了一种新的 critique generation model called CritiqueLLM, which includes a dialogue-based prompting method for high-quality referenced / reference-free evaluation data.
results: 实验结果表明，我们的模型可以与GPT-4匹配或超越其在8个任务中的3个任务中，特别是在系统级别的相关性方面表现出色。我们还进行了详细的分析，表明我们的模型在质量评价方面具有良好的扩展性。此外，我们的生成的评价还可以作为直接改进LLMs的生成质量的反馈。

Abstract
Since the natural language processing (NLP) community started to make large language models (LLMs), such as GPT-4, act as a critic to evaluate the quality of generated texts, most of them only train a critique generation model of a specific scale on specific datasets. We argue that a comprehensive investigation on the key factor of LLM-based evaluation models, such as scaling properties, is lacking, so that it is still inconclusive whether these models have potential to replace GPT-4's evaluation in practical scenarios. In this paper, we propose a new critique generation model called CritiqueLLM, which includes a dialogue-based prompting method for high-quality referenced / reference-free evaluation data. Experimental results show that our model can achieve comparable evaluation performance to GPT-4 especially in system-level correlations, and even outperform GPT-4 in 3 out of 8 tasks in a challenging reference-free setting. We conduct detailed analysis to show promising scaling properties of our model in the quality of generated critiques. We also demonstrate that our generated critiques can act as scalable feedback to directly improve the generation quality of LLMs.

摘要
自然语言处理（NLP）社区开始使用大型语言模型（LLM）作为评价文本质量的批评者，大多数人只是在特定的scale和特定的数据集上训练一个特定的评价生成模型。我们认为，关键因素的LLM-based评价模型的全面调查缺失，因此还没有得出结论，这些模型是否有能力取代GPT-4的评价在实际场景中。在这篇论文中，我们提出了一个新的评价生成模型 called CritiqueLLM，它包括对话基于的提示方法，以获得高质量的参考/无参考评价数据。实验结果表明，我们的模型可以与GPT-4相当的评价性能，特别是在系统级别的相关性方面，甚至在一个复杂的无参考设定下超过GPT-4的3个任务。我们进行了详细的分析，以显示我们模型的评价生成质量的扩展性。我们还示示了我们生成的评价可以直接改进LLMs的生成质量。

Evaluating Large Language Model Creativity from a Literary Perspective

paper_url: http://arxiv.org/abs/2312.03746
repo_url: None
paper_authors: Murray Shanahan, Catherine Clarke
for: 这个研究检查了大型自然语言模型（LLM）是否可以作为创作写作的助手，通过一个深入的案例研究来评估其潜力。
methods: 研究人员采用了交互式多声音提示策略，杂合背景描述（场景设定、剧本元素）、写作指导、样本文本和文本批判，以评估LLM的创作能力。
results: 研究结果表明，LLM的成果与提示的复杂程度相关。

Abstract
This paper assesses the potential for large language models (LLMs) to serve as assistive tools in the creative writing process, by means of a single, in-depth case study. In the course of the study, we develop interactive and multi-voice prompting strategies that interleave background descriptions (scene setting, plot elements), instructions that guide composition, samples of text in the target style, and critical discussion of the given samples. We qualitatively evaluate the results from a literary critical perspective, as well as from the standpoint of computational creativity (a sub-field of artificial intelligence). Our findings lend support to the view that the sophistication of the results that can be achieved with an LLM mirrors the sophistication of the prompting.

摘要

paper_url: http://arxiv.org/abs/2311.18676
repo_url: None
paper_authors: Aryaman Rao, Parth Singh, Dinesh Kumar Vishwakarma, Mukesh Prasad
for: 本研究提出了一种基于量子概念的Salp Swarm算法（DQSSA），用于优化社交网络中的影响散布。
methods: 该算法通过精炼meta-规则算法并借鉴量子原理，解决了 premature convergence和低效率的问题。
results: 对四个实际 dataset进行了实验，显示DQSSA的性能比较出色，超过了一些当前最佳算法的表现。

Abstract
Influence Maximization is the task of selecting optimal nodes maximising the influence spread in social networks. This study proposes a Discretized Quantum-based Salp Swarm Algorithm (DQSSA) for optimizing influence diffusion in social networks. By discretizing meta-heuristic algorithms and infusing them with quantum-inspired enhancements, we address issues like premature convergence and low efficacy. The proposed method, guided by quantum principles, offers a promising solution for Influence Maximisation. Experiments on four real-world datasets reveal DQSSA's superior performance as compared to established cutting-edge algorithms.

摘要
“Influence Maximization”是指选择最佳节点以 maximize 社交网络中的影响扩散。本研究提出了一种基于量子逻辑的粒子群算法（DQSSA），用于优化社交网络中的影响扩散。通过粒子群算法的离散化和量子逻辑的激发，我们解决了迅速 converges 和低效率的问题。提出的方法，受量子原理指导，对Influence Maximization 提供了一个有 promise 的解决方案。在四个实际 datasets 上进行了实验，DQSSA 的性能显著 exceeds 已有的先进算法。

Multi-task learning with cross-task consistency for improved depth estimation in colonoscopy

paper_url: http://arxiv.org/abs/2311.18664
repo_url: None
paper_authors: Pedro Esteban Chavarrias Solano, Andrew Bulpitt, Venkataraman Subramanian, Sharib Ali
for: colonoscopy screening, 评估肠Rectum内部疾病，如癌变和溃疡
methods: 多任务学习（MTL）方法，包括一个共享编码器和两个解码器：表面几何解码器和深度估计解码器，以及attend Mechanism优化全局上下文认知
results: 相比baseline方法BTS，我们提出的方法在相对误差上提高14.17%，在 $\delta_{1}$ 精度上提高10.4%。所有实验都在最新发布的C3VD数据集上进行。

Abstract
Colonoscopy screening is the gold standard procedure for assessing abnormalities in the colon and rectum, such as ulcers and cancerous polyps. Measuring the abnormal mucosal area and its 3D reconstruction can help quantify the surveyed area and objectively evaluate disease burden. However, due to the complex topology of these organs and variable physical conditions, for example, lighting, large homogeneous texture, and image modality estimating distance from the camera aka depth) is highly challenging. Moreover, most colonoscopic video acquisition is monocular, making the depth estimation a non-trivial problem. While methods in computer vision for depth estimation have been proposed and advanced on natural scene datasets, the efficacy of these techniques has not been widely quantified on colonoscopy datasets. As the colonic mucosa has several low-texture regions that are not well pronounced, learning representations from an auxiliary task can improve salient feature extraction, allowing estimation of accurate camera depths. In this work, we propose to develop a novel multi-task learning (MTL) approach with a shared encoder and two decoders, namely a surface normal decoder and a depth estimator decoder. Our depth estimator incorporates attention mechanisms to enhance global context awareness. We leverage the surface normal prediction to improve geometric feature extraction. Also, we apply a cross-task consistency loss among the two geometrically related tasks, surface normal and camera depth. We demonstrate an improvement of 14.17% on relative error and 10.4% improvement on $\delta_{1}$ accuracy over the most accurate baseline state-of-the-art BTS approach. All experiments are conducted on a recently released C3VD dataset; thus, we provide a first benchmark of state-of-the-art methods.

摘要
干扰检测是评估肠子和肛肠异常的标准手段，如肿瘤和癌变质。测量异常的肠膜面积和其3D重建可以帮助评估疾病负担。然而，由于肠子和肛肠的复杂 topology和变化的物理条件（如照明、大规模纹理、摄像头距离），准确地估计depth很困难。尤其是大多数干扰检测视频获取是单目的，从而使得深度估计成为一个非常困难的问题。然而，计算机视觉中的深度估计方法已经提出和进步于自然场景数据上，但这些方法在干扰检测数据上的效果尚未得到广泛评估。因为肠膜有许多低纹理区域，学习表示法从 auxiliary task 中提高精细特征提取，以便准确地估计摄像头深度。在这种工作中，我们提议一种基于多任务学习（MTL）的方法，其中包括一个共享Encoder和两个解码器，即表面正常解码器和深度估计解码器。我们的深度估计包括注意力机制，以提高全局上下文意识。我们利用表面正常预测来提高几何特征提取。此外，我们应用了两个几何相关的任务之间的交叉任务一致性损失，以提高几何特征的一致性。我们在C3VD数据集上进行了所有实验，因此我们提供了首个 benchmark 的状态对照。Results:* Relative error improvement: 14.17%* $\delta_{1}$ accuracy improvement: 10.4%Note: $\delta_{1}$ is a measure of the accuracy of depth estimation, and a lower value indicates better accuracy.

Choosing the parameter of the Fermat distance: navigating geometry and noise

paper_url: http://arxiv.org/abs/2311.18663
repo_url: None
paper_authors: Frédéric Chazal, Laure Ferraris, Pablo Groisman, Matthieu Jonckheere, Frédéric Pascal, Facundo Sapienza
for: 本研究使用法拉第distance来解决机器学习任务，当有自然的距离不可用时或者提高euclidian距离的结果。
methods: 本研究使用法拉第distance，其中Parameterα影响了后续任务的性能。
results: 研究表明，选择合适的α值可以 navigation数据集的几何和统计性质，同时避免噪音的影响。

Abstract
The Fermat distance has been recently established as a useful tool for machine learning tasks when a natural distance is not directly available to the practitioner or to improve the results given by Euclidean distances by exploding the geometrical and statistical properties of the dataset. This distance depends on a parameter $\alpha$ that greatly impacts the performance of subsequent tasks. Ideally, the value of $\alpha$ should be large enough to navigate the geometric intricacies inherent to the problem. At the same, it should remain restrained enough to sidestep any deleterious ramifications stemming from noise during the process of distance estimation. We study both theoretically and through simulations how to select this parameter.

摘要
“法米图距离”最近已被视为机器学习任务中的有用工具，当 naturaldistance 不直接available to practitioner 或优化由欧几何距离提供的结果。这个距离取决于参数 $\alpha$，这个参数对后续任务的表现有很大的影响。理想情况下，$\alpha$ 应该够大以探索数据集中的几何特性，同时也应该够小以避免噪音干扰距离估算过程中的副作用。我们通过理论和实验研究如何选择这个参数。

Solving the Team Orienteering Problem with Transformers

paper_url: http://arxiv.org/abs/2311.18662
repo_url: https://github.com/danifuertes/top_transformer
paper_authors: Daniel Fuertes, Carlos R. del-Blanco, Fernando Jaureguizar, Narciso García
for: 本研究是为了提出一种能够快速和准确地解决团队导航问题的多代理路径规划系统。
methods: 本研究使用了一种中央式Transformer神经网络，可以快速学习对象图和代理Context进行编码，以提供快速和准确的解决方案。
results: 经过多个实验表明，提出的系统可以在计算速度方面超越大多数现有的工作，并且可以在准确性和计算速度之间取得平衡。代码可以在http://gti.ssr.upm.es/data中获取。

Abstract
Route planning for a fleet of vehicles is an important task in applications such as package delivery, surveillance, or transportation. This problem is usually modeled as a Combinatorial Optimization problem named as Team Orienteering Problem. The most popular Team Orienteering Problem solvers are mainly based on either linear programming, which provides accurate solutions by employing a large computation time that grows with the size of the problem, or heuristic methods, which usually find suboptimal solutions in a shorter amount of time. In this paper, a multi-agent route planning system capable of solving the Team Orienteering Problem in a very fast and accurate manner is presented. The proposed system is based on a centralized Transformer neural network that can learn to encode the scenario (modeled as a graph) and the context of the agents to provide fast and accurate solutions. Several experiments have been performed to demonstrate that the presented system can outperform most of the state-of-the-art works in terms of computation speed. In addition, the code is publicly available at http://gti.ssr.upm.es/data.

摘要
路径规划 для辆 vehicle fleet 是应用程序 such as package delivery, surveillance, or transportation 中的一个重要任务。这个问题通常被称为 Team Orienteering Problem，并且通常通过 линей编程或 heuristic 方法来解决。在这篇论文中，一种可以快速和准确地解决 Team Orienteering Problem 的多体 Route Planning 系统被提出。该系统基于中央 Transformer 神经网络，可以通过学习enario (模型为图) 和代理的 контекст来提供快速和准确的解决方案。数据表明，提出的系统可以在 computation speed 方面超越大多数现有的工作。此外，代码也publicly available at http://gti.ssr.upm.es/data。

Detailed Human-Centric Text Description-Driven Large Scene Synthesis

paper_url: http://arxiv.org/abs/2311.18654
repo_url: None
paper_authors: Gwanghyun Kim, Dong Un Kang, Hoigi Seo, Hayeon Kim, Se Young Chun
for: 本文提出了一种基于文本的大型场景图生成方法，以提高场景图生成的可控性和自然性。
methods: 本方法包括三个主要组成部分：1) 利用大语言模型（LLM）生成层次结构的关键点盒子布局，2) 根据文本描述进行视点 conditioned 共聚扩散过程，3) 基于像素偏移的 пирамиidal 插值来进行逐渐修正大场景图。
results: 对比于先前的文本到大场景图生成方法，本方法在 faithfulness、可控性和全局自然性等方面表现出色，显示了强大的 faithfulness 性和高度可控性。

Abstract
Text-driven large scene image synthesis has made significant progress with diffusion models, but controlling it is challenging. While using additional spatial controls with corresponding texts has improved the controllability of large scene synthesis, it is still challenging to faithfully reflect detailed text descriptions without user-provided controls. Here, we propose DetText2Scene, a novel text-driven large-scale image synthesis with high faithfulness, controllability, and naturalness in a global context for the detailed human-centric text description. Our DetText2Scene consists of 1) hierarchical keypoint-box layout generation from the detailed description by leveraging large language model (LLM), 2) view-wise conditioned joint diffusion process to synthesize a large scene from the given detailed text with LLM-generated grounded keypoint-box layout and 3) pixel perturbation-based pyramidal interpolation to progressively refine the large scene for global coherence. Our DetText2Scene significantly outperforms prior arts in text-to-large scene synthesis qualitatively and quantitatively, demonstrating strong faithfulness with detailed descriptions, superior controllability, and excellent naturalness in a global context.

摘要
文本驱动大景图生成技术已经取得了 significiant progress，但控制它仍然是挑战。使用额外的空间控制和相应的文本可以提高大景图生成的可控性，但是还是困难以准确反映详细的文本描述。在这里，我们提出了DetText2Scene，一种新的文本驱动大规模图生成技术，具有高准确性、可控性和自然性。DetText2Scene包括以下三个部分：1. 利用大语言模型（LLM）生成层次结构的关键点盒子布局，根据详细的文本描述。2. 基于文本的视点条件联合扩散过程，将给定的详细文本与LLM生成的关键点盒子布局进行合并。3. 像素偏移基数 pyramidal interpolate 进行渐进的精度提高，以保证全局的一致性。与先前的文本到大景图生成技术相比，DetText2Scene 在质量和量上均有显著的提高，表现出了强大的准确性、高度可控性和自然性。

FedEmb: A Vertical and Hybrid Federated Learning Algorithm using Network And Feature Embedding Aggregation

paper_url: http://arxiv.org/abs/2312.00102
repo_url: None
paper_authors: Fanfei Meng, Lele Zhang, Yu Chen, Yuxin Wang
for: 本研究 propose a generalized algorithm FedEmb for vertical and hybrid DNN-based learning, aiming to improve inference accuracy, privacy-preserving properties, and communication efficiency.
methods: FedEmb 使用了一种新的扩展点云模型，使得模型可以在分布式环境中进行更加精准的学习，同时保持数据隐私。
results: 实验结果表明，FedEmb 能够有效地解决分布式特征空间和主题空间的学习问题，与已有方法相比，提高了0.3%-4.2%的推理准确率，同时减少了88.9%的时间复杂度。

Abstract
Federated learning (FL) is an emerging paradigm for decentralized training of machine learning models on distributed clients, without revealing the data to the central server. The learning scheme may be horizontal, vertical or hybrid (both vertical and horizontal). Most existing research work with deep neural network (DNN) modelling is focused on horizontal data distributions, while vertical and hybrid schemes are much less studied. In this paper, we propose a generalized algorithm FedEmb, for modelling vertical and hybrid DNN-based learning. The idea of our algorithm is characterised by higher inference accuracy, stronger privacy-preserving properties, and lower client-server communication bandwidth demands as compared with existing work. The experimental results show that FedEmb is an effective method to tackle both split feature & subject space decentralized problems, shows 0.3% to 4.2% inference accuracy improvement with limited privacy revealing for datasets stored in local clients, and reduces 88.9 % time complexity over vertical baseline method.

摘要
fedlearn (FL) 是一种 emerging 的 paradigm，用于分布式客户端上进行机器学习模型的协同训练，不需要向中央服务器披露数据。学习方案可以是水平的，垂直的或者是混合的（两者都是）。现有的大多数研究都是针对水平数据分布进行深度神经网络（DNN）模型的设计。而垂直和混合方案则受到较少的研究。在这篇论文中，我们提出了一种通用的算法 FedEmb，用于模型垂直和混合 DNN 基于学习。我们的算法的想法是具有更高的推理准确率，更强的隐私保护性，以及对客户端服务器的通信带宽需求的减少。实验结果表明，FedEmb 是一种有效的解决分feature & subject 空间分布式问题的方法，对于数据存储在本地客户端上，提高了 0.3% 到 4.2% 的推理准确率，并且限制了隐私泄露。同时，它还降低了88.9%的时间复杂度相比垂直基eline方法。

Towards Unsupervised Representation Learning: Learning, Evaluating and Transferring Visual Representations

paper_url: http://arxiv.org/abs/2312.00101
repo_url: https://github.com/bonifazstuhr/feamgan
paper_authors: Bonifaz Stuhr
for: 这篇论文旨在探讨无监督学习中学习表示的方法，它可以自动从数据中学习表示，而不需要人工标注。
methods: 本论文使用了自组织神经网络和希耶推论学习规则来学习卷积核和面积，实现了深度无监督学习模型。
results: 本论文在视觉领域中提出了一些新的无监督表示学习方法，并在 Lane Detection 等任务上进行了实验，并取得了一些有利的结果。

Abstract
Unsupervised representation learning aims at finding methods that learn representations from data without annotation-based signals. Abstaining from annotations not only leads to economic benefits but may - and to some extent already does - result in advantages regarding the representation's structure, robustness, and generalizability to different tasks. In the long run, unsupervised methods are expected to surpass their supervised counterparts due to the reduction of human intervention and the inherently more general setup that does not bias the optimization towards an objective originating from specific annotation-based signals. While major advantages of unsupervised representation learning have been recently observed in natural language processing, supervised methods still dominate in vision domains for most tasks. In this dissertation, we contribute to the field of unsupervised (visual) representation learning from three perspectives: (i) Learning representations: We design unsupervised, backpropagation-free Convolutional Self-Organizing Neural Networks (CSNNs) that utilize self-organization- and Hebbian-based learning rules to learn convolutional kernels and masks to achieve deeper backpropagation-free models. (ii) Evaluating representations: We build upon the widely used (non-)linear evaluation protocol to define pretext- and target-objective-independent metrics for measuring and investigating the objective function mismatch between various unsupervised pretext tasks and target tasks. (iii) Transferring representations: We contribute CARLANE, the first 3-way sim-to-real domain adaptation benchmark for 2D lane detection, and a method based on prototypical self-supervised learning. Finally, we contribute a content-consistent unpaired image-to-image translation method that utilizes masks, global and local discriminators, and similarity sampling to mitigate content inconsistencies.

摘要
Unsupervised representation learning aims to find methods that learn representations from data without relying on annotation-based signals. By not using annotations, not only can we save costs, but we may also gain advantages in terms of the structure, robustness, and generalizability of the representations to different tasks. In the long run, unsupervised methods are expected to surpass supervised methods due to the reduction of human intervention and the inherently more general setup that does not bias the optimization towards a specific objective originating from annotation-based signals. While major advantages of unsupervised representation learning have been recently observed in natural language processing, supervised methods still dominate in vision domains for most tasks. In this dissertation, we contribute to the field of unsupervised (visual) representation learning from three perspectives:(i) Learning representations: We design unsupervised, backpropagation-free Convolutional Self-Organizing Neural Networks (CSNNs) that use self-organization- and Hebbian-based learning rules to learn convolutional kernels and masks to achieve deeper backpropagation-free models.(ii) Evaluating representations: We build upon the widely used (non-)linear evaluation protocol to define pretext- and target-objective-independent metrics for measuring and investigating the objective function mismatch between various unsupervised pretext tasks and target tasks.(iii) Transferring representations: We contribute CARLANE, the first 3-way sim-to-real domain adaptation benchmark for 2D lane detection, and a method based on prototypical self-supervised learning. Finally, we contribute a content-consistent unpaired image-to-image translation method that utilizes masks, global and local discriminators, and similarity sampling to mitigate content inconsistencies.

Stochastic Vision Transformers with Wasserstein Distance-Aware Attention

paper_url: http://arxiv.org/abs/2311.18645
repo_url: None
paper_authors: Franciskus Xaverius Erick, Mina Rezaei, Johanna Paula Müller, Bernhard Kainz
for: 本研究的目的是提出一种新的随机视Transformer，用于自动学习（Self-Supervised Learning）中的不确定性和距离意识。
methods: 该方法使用 Gaussian 分布嵌入图像区域，并使用 Wasserstein 距离计算注意力矩阵。此外，提出了基于 Wasserstein 距离的迁移正则项，以吸收距离意识到潜在表示中。
results: 在多个任务上，如偏移检测、数据腐朽检测、semi-supervised 学习和数据转移等，该方法实现了superior的准确率和评估，比自动学习基eline更高。

Abstract
Self-supervised learning is one of the most promising approaches to acquiring knowledge from limited labeled data. Despite the substantial advancements made in recent years, self-supervised models have posed a challenge to practitioners, as they do not readily provide insight into the model's confidence and uncertainty. Tackling this issue is no simple feat, primarily due to the complexity involved in implementing techniques that can make use of the latent representations learned during pre-training without relying on explicit labels. Motivated by this, we introduce a new stochastic vision transformer that integrates uncertainty and distance awareness into self-supervised learning (SSL) pipelines. Instead of the conventional deterministic vector embedding, our novel stochastic vision transformer encodes image patches into elliptical Gaussian distributional embeddings. Notably, the attention matrices of these stochastic representational embeddings are computed using Wasserstein distance-based attention, effectively capitalizing on the distributional nature of these embeddings. Additionally, we propose a regularization term based on Wasserstein distance for both pre-training and fine-tuning processes, thereby incorporating distance awareness into latent representations. We perform extensive experiments across different tasks such as in-distribution generalization, out-of-distribution detection, dataset corruption, semi-supervised settings, and transfer learning to other datasets and tasks. Our proposed method achieves superior accuracy and calibration, surpassing the self-supervised baseline in a wide range of experiments on a variety of datasets.

摘要
自适应学习是一种最有前途的方法，可以从有限的标注数据中获得知识。尽管在最近几年内有了很大的进步，但自适应模型仍然对实践者提出了挑战，因为它们不提供明确的自信和不确定性的回报。解决这一问题并不容易，主要是因为实现使用自适应学习时学习的秘密表示的技术不容易实现，不需要显式标注。为此，我们介绍了一种新的随机视transformer，它在自适应学习（SSL）管道中结合了不确定性和距离意识。而不是传统的决定性向量嵌入，我们的新随机视transformer将图像块嵌入为椭球分布的 Gaussian 嵌入。另外，我们提出了基于 Wasserstein 距离的注意矩阵计算方法，以及在预训练和精度调整过程中添加 Wasserstein 距离 regularization 项，从而将距离意识 integrate 到干扰表示中。我们在不同的任务上进行了广泛的实验，包括均匀概率分布、非标注检测、数据集损害、半指导学习和转移学习。我们的提议方法在各种任务上达到了更高的准确率和抽象率，超过了自适应基线在多种数据集和任务上的各种实验。

Exploring the hierarchical structure of human plans via program generation

paper_url: http://arxiv.org/abs/2311.18644
repo_url: https://github.com/cgc/lightbot-grammar-induction
paper_authors: Carlos G. Correa, Sophia Sanborn, Mark K. Ho, Frederick Callaway, Nathaniel D. Daw, Thomas L. Griffiths
for: 这篇论文探讨了人类行为的层次结构如何形成，以及如何通过观察实验来证明这种层次结构。
methods: 作者使用了一种实验方法，让参与者创建一种语言中的程序，以生成一系列的动作序列，这个语言具有显式的层次结构。
results: 研究发现，人们在创建程序时具有抽象的层次结构，并且偏好使用重复的动作，而这些行为都不能由传统的压缩性和描述长度理论所预测。作者提出了一种扩展的MDL质量来解释这种偏好，并证明这种层次结构是人类计划的一个基本原则。

Abstract
Human behavior is inherently hierarchical, resulting from the decomposition of a task into subtasks or an abstract action into concrete actions. However, behavior is typically measured as a sequence of actions, which makes it difficult to infer its hierarchical structure. In this paper, we explore how people form hierarchically-structured plans, using an experimental paradigm that makes hierarchical representations observable: participants create programs that produce sequences of actions in a language with explicit hierarchical structure. This task lets us test two well-established principles of human behavior: utility maximization (i.e. using fewer actions) and minimum description length (MDL; i.e. having a shorter program). We find that humans are sensitive to both metrics, but that both accounts fail to predict a qualitative feature of human-created programs, namely that people prefer programs with reuse over and above the predictions of MDL. We formalize this preference for reuse by extending the MDL account into a generative model over programs, modeling hierarchy choice as the induction of a grammar over actions. Our account can explain the preference for reuse and provides the best prediction of human behavior, going beyond simple accounts of compressibility to highlight a principle that guides hierarchical planning.

摘要
人类行为本质层次结构，由任务划分为子任务或抽象行为转化为具体行动。然而，行为通常被测量为动作序列，这使得其层次结构难以推断。在这篇论文中，我们探索人们如何形成层次结构的计划，使用实验方法可以观察到层次结构：参与者创建生成序列动作的语言中的层次结构。这个任务让我们测试两个人行为的已知原理：有用性最大化（即使用 fewer actions）和最小描述长度（MDL；即有 shorter program）。我们发现人们对于这两个原理都敏感，但两个账户都无法预测人们创建的程序中的一个特点，即人们偏好 reuse 于 MDL 的预测。我们对此进行了扩展，将 MDL 账户转化为生成模型，用于解释层次结构选择的 grammar 结构。我们的账户可以解释 reuse 的偏好，并提供了人类行为最佳预测，超越简单的压缩性账户，抛出一个导向层次规划的原理。

Data-driven prediction of tool wear using Bayesian-regularized artificial neural networks

paper_url: http://arxiv.org/abs/2311.18620
repo_url: None
paper_authors: Tam T. Truong, Jay Airao, Panagiotis Karras, Faramarz Hojati, Bahman Azarhoushang, Ramin Aghababaei
for: 预测工具损耗，以降低生产成本和提高产品质量。
methods: 使用 bayesian 归一化人工神经网络（BRANN）来精准预测雷刚磨具损耗。 BRANN 结合人工神经网络（ANN）和 bayesian 规范，使得模型能够学习复杂的模式，同时处理不确定性和过拟合问题，从而实现更加普适的模型。
results: 通过对四个不同的实验数据集进行广泛的实验研究，包括NASA Ames 磨削数据集、2010 PHM Data Challenge 数据集、NUAA Ideahouse 工具损耗数据集以及自主进行的 Ti6Al4V 磨削数据集，并对输入特征、训练数据量、隐藏单元、训练算法和传输函数的影响进行了研究，并证明了提议的 BRANN 模型在准确性和可靠性方面与现有状态的模型相比，表现出色。

Abstract
The prediction of tool wear helps minimize costs and enhance product quality in manufacturing. While existing data-driven models using machine learning and deep learning have contributed to the accurate prediction of tool wear, they often lack generality and require substantial training data for high accuracy. In this paper, we propose a new data-driven model that uses Bayesian Regularized Artificial Neural Networks (BRANNs) to precisely predict milling tool wear. BRANNs combine the strengths and leverage the benefits of artificial neural networks (ANNs) and Bayesian regularization, whereby ANNs learn complex patterns and Bayesian regularization handles uncertainty and prevents overfitting, resulting in a more generalized model. We treat both process parameters and monitoring sensor signals as BRANN input parameters. We conducted an extensive experimental study featuring four different experimental data sets, including the NASA Ames milling dataset, the 2010 PHM Data Challenge dataset, the NUAA Ideahouse tool wear dataset, and an in-house performed end-milling of the Ti6Al4V dataset. We inspect the impact of input features, training data size, hidden units, training algorithms, and transfer functions on the performance of the proposed BRANN model and demonstrate that it outperforms existing state-of-the-art models in terms of accuracy and reliability.

摘要
预测工具 wear 可以减少成本并提高产品质量在制造中。现有的数据驱动模型使用机器学习和深度学习已经贡献到精确预测工具 wear，但它们经常缺乏普适性并需要大量训练数据以获得高精度。在这篇论文中，我们提出了一个新的数据驱动模型，使用 Bayesian Regularized Artificial Neural Networks (BRANNs) 精确预测钻削工具 wear。BRANNs 结合人工神经网络 (ANNs) 和 Bayesian 正则化，其中 ANNs 学习复杂的模式，而 Bayesian 正则化处理不确定性和避免过拟合，导致更加普适的模型。我们将处理参数和监测传感器信号作为 BRANN 输入参数。我们进行了广泛的实验研究，包括 NASA Ames 钻削数据集、2010 PHM Data Challenge 数据集、NUAA Ideahouse 工具 wear 数据集和自行进行的 Ti6Al4V 钻削数据集。我们研究 BRANN 模型的输入特征、训练数据量、隐藏单元、训练算法和传输函数的影响，并证明其超过现有状态的模型以来精度和可靠性。

Contrastive Denoising Score for Text-guided Latent Diffusion Image Editing

paper_url: http://arxiv.org/abs/2311.18608
repo_url: https://github.com/HyelinNAM/ContrastiveDenoisingScore
paper_authors: Hyelin Nam, Gihyun Kwon, Geon Yeong Park, Jong Chul Ye
for: 这个论文的目的是提出一种基于文本到图像扩散模型的图像修改方法，以提高图像修改的灵活性和控制性。
methods: 这个方法基于Score Distillation Sampling（SDS）框架，使用了文本到图像扩散模型的rich生成前提来进行图像修改。
results: 该方法可以实现零批量图像转换和神经辉场编辑，并且可以保持图像的结构细节和内容的控制性。qualitative结果和比较表明该方法的有效性。

Abstract
With the remarkable advent of text-to-image diffusion models, image editing methods have become more diverse and continue to evolve. A promising recent approach in this realm is Delta Denoising Score (DDS) - an image editing technique based on Score Distillation Sampling (SDS) framework that leverages the rich generative prior of text-to-image diffusion models. However, relying solely on the difference between scoring functions is insufficient for preserving specific structural elements from the original image, a crucial aspect of image editing. Inspired by the similarity and importance differences between DDS and the contrastive learning for unpaired image-to-image translation (CUT), here we present an embarrassingly simple yet very powerful modification of DDS, called Contrastive Denoising Score (CDS), for latent diffusion models (LDM). Specifically, to enforce structural correspondence between the input and output while maintaining the controllability of contents, we introduce a straightforward approach to regulate structural consistency using CUT loss within the DDS framework. To calculate this loss, instead of employing auxiliary networks, we utilize the intermediate features of LDM, in particular, those from the self-attention layers, which possesses rich spatial information. Our approach enables zero-shot image-to-image translation and neural radiance field (NeRF) editing, achieving a well-balanced interplay between maintaining the structural details and transforming content. Qualitative results and comparisons demonstrates the effectiveness of our proposed method. Project page with code is available at https://hyelinnam.github.io/CDS/.

摘要
Traditional image editing methods have become more diverse and continue to evolve with the remarkable advent of text-to-image diffusion models. A promising recent approach is Delta Denoising Score (DDS), an image editing technique based on Score Distillation Sampling (SDS) framework that leverages the rich generative prior of text-to-image diffusion models. However, relying solely on the difference between scoring functions is insufficient for preserving specific structural elements from the original image, a crucial aspect of image editing.Inspired by the similarity and importance differences between DDS and contrastive learning for unpaired image-to-image translation (CUT), we present an embarrassingly simple yet very powerful modification of DDS, called Contrastive Denoising Score (CDS), for latent diffusion models (LDM). Specifically, to enforce structural correspondence between the input and output while maintaining the controllability of contents, we introduce a straightforward approach to regulate structural consistency using CUT loss within the DDS framework. To calculate this loss, instead of employing auxiliary networks, we utilize the intermediate features of LDM, in particular, those from the self-attention layers, which possess rich spatial information. Our approach enables zero-shot image-to-image translation and neural radiance field (NeRF) editing, achieving a well-balanced interplay between maintaining the structural details and transforming content. Qualitative results and comparisons demonstrate the effectiveness of our proposed method. For more information and to access the code, please visit the project page at .

Joint Detection Algorithm for Multiple Cognitive Users in Spectrum Sensing

paper_url: http://arxiv.org/abs/2311.18599
repo_url: None
paper_authors: Fanfei Meng, Yuxin Wang, Lele Zhang, Yingxin Zhao
For: This paper focuses on the development of a method for multi-user spectrum sensing based on soft decisions, which can effectively detect unoccupied spectrum resources during idle periods and improve the utilization of scarce information resources.* Methods: The paper introduces three common logical circuit decision criteria in hard decisions and analyzes their decision rigor. It also proposes a method for multi-user collaborative sensing based on soft decisions, which significantly reduces false alarm probability and enhances detection probability.* Results: The simulated results of multi-user collaborative sensing indicate that the approach effectively detects spectrum resources unoccupied during idle periods, leveraging the concept of time-division multiplexing and rationalizing the redistribution of information resources. The entire computation process relies on the calculation principles of power spectral density in communication theory, involving threshold decision detection for noise power and the sum of noise and signal power.

Abstract
Spectrum sensing technology is a crucial aspect of modern communication technology, serving as one of the essential techniques for efficiently utilizing scarce information resources in tight frequency bands. This paper first introduces three common logical circuit decision criteria in hard decisions and analyzes their decision rigor. Building upon hard decisions, the paper further introduces a method for multi-user spectrum sensing based on soft decisions. Then the paper simulates the false alarm probability and detection probability curves corresponding to the three criteria. The simulated results of multi-user collaborative sensing indicate that the simulation process significantly reduces false alarm probability and enhances detection probability. This approach effectively detects spectrum resources unoccupied during idle periods, leveraging the concept of time-division multiplexing and rationalizing the redistribution of information resources. The entire computation process relies on the calculation principles of power spectral density in communication theory, involving threshold decision detection for noise power and the sum of noise and signal power. It provides a secondary decision detection, reflecting the perceptual decision performance of logical detection methods with relative accuracy.

摘要
spectrum sensing technology 是现代通信技术的重要方面， serving as one of the essential techniques for efficiently utilizing scarce information resources in tight frequency bands. This paper first introduces three common logical circuit decision criteria in hard decisions and analyzes their decision rigor. Building upon hard decisions, the paper further introduces a method for multi-user spectrum sensing based on soft decisions. Then the paper simulates the false alarm probability and detection probability curves corresponding to the three criteria. The simulated results of multi-user collaborative sensing indicate that the simulation process significantly reduces false alarm probability and enhances detection probability. This approach effectively detects spectrum resources unoccupied during idle periods, leveraging the concept of time-division multiplexing and rationalizing the redistribution of information resources. The entire computation process relies on the calculation principles of power spectral density in communication theory, involving threshold decision detection for noise power and the sum of noise and signal power. It provides a secondary decision detection, reflecting the perceptual decision performance of logical detection methods with relative accuracy.Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

Generalisable Agents for Neural Network Optimisation

paper_url: http://arxiv.org/abs/2311.18598
repo_url: None
paper_authors: Kale-ab Tessera, Callum Rhys Tilbury, Sasha Abramowitz, Ruan de Kock, Omayma Mahjoub, Benjamin Rosman, Sara Hooker, Arnu Pretorius
for: 提高深度神经网络的优化是一项复杂的任务，因为训练过程中存在多种复杂的动态、高度计算需求和长时间训练时间。
methods: 我们提出了一个名为Generalisable Agents for Neural Network Optimisation（GANNO）的多代理掌控方法，该方法通过在训练过程中动态和应答性地调整超参数来改进神经网络优化。GANNO使用每层一个代理，通过观察局部网络动态来采取以调整这些动态以提高全局性能。
results: GANNO可以生成高效的和应答性的超参数调整方案，并且可以在多种未看过的初始条件下表现稳定和可靠。此外，GANNO还可以成功泛化到训练集更加复杂的问题上。我们的工作提供了对这种思想的概述，以及仍需要解决的关键挑战。

Abstract
Optimising deep neural networks is a challenging task due to complex training dynamics, high computational requirements, and long training times. To address this difficulty, we propose the framework of Generalisable Agents for Neural Network Optimisation (GANNO) -- a multi-agent reinforcement learning (MARL) approach that learns to improve neural network optimisation by dynamically and responsively scheduling hyperparameters during training. GANNO utilises an agent per layer that observes localised network dynamics and accordingly takes actions to adjust these dynamics at a layerwise level to collectively improve global performance. In this paper, we use GANNO to control the layerwise learning rate and show that the framework can yield useful and responsive schedules that are competitive with handcrafted heuristics. Furthermore, GANNO is shown to perform robustly across a wide variety of unseen initial conditions, and can successfully generalise to harder problems than it was trained on. Our work presents an overview of the opportunities that this paradigm offers for training neural networks, along with key challenges that remain to be overcome.

摘要
优化深度神经网络是一项复杂的任务，由于训练过程的复杂性、高计算需求和长时间训练时间。为解决这个困难，我们提出了普适代理人 для神经网络优化（GANNO）框架---一种多代理人学习（MARL）方法，通过在训练过程中动态和应对地调整超参数来提高神经网络优化。GANNO使用每层的代理人来观察当地神经网络动态，并根据这些动态进行层次调整，以共同提高全局性能。在这篇论文中，我们使用GANNO控制层次学习率，并证明了该框架可以生成有用和应对的调整方案，与手工策略竞争。此外，GANNO在不同的初始条件下表现稳定，并成功泛化到更加复杂的问题。我们的工作介绍了这种思想的机会，以及训练神经网络中的关键挑战。

Semantic-Aware Frame-Event Fusion based Pattern Recognition via Large Vision-Language Models

paper_url: http://arxiv.org/abs/2311.18592
repo_url: https://github.com/event-ahu/safe_largevlm
paper_authors: Dong Li, Jiandong Jin, Yuhao Zhang, Yanlin Zhong, Yaoyang Wu, Lan Chen, Xiao Wang, Bin Luo
for: 本研究旨在提出一种基于RGB帧和事件流的 Pattern recognition方法，以强化Semantic gap和小规模网络问题。
methods: 我们提出一种 novel framework，通过将semantic labels、RGB帧和事件流综合归一化，使用大规模视力语言模型（CLIP视力编码器）提取RGB和事件特征，并使用预训练的大规模语言模型（CLIP文本编码器）将semantic labels转化为语言描述。然后，我们使用多modal transformer网络将RGB/Event特征和语言特征集成，并使用自我注意和循环神经网络进行增强。
results: 我们在HARDVS和PokerEvent数据集上进行了广泛的实验，并证明了我们提出的SAFE模型的有效性。

Abstract
Pattern recognition through the fusion of RGB frames and Event streams has emerged as a novel research area in recent years. Current methods typically employ backbone networks to individually extract the features of RGB frames and event streams, and subsequently fuse these features for pattern recognition. However, we posit that these methods may suffer from key issues like sematic gaps and small-scale backbone networks. In this study, we introduce a novel pattern recognition framework that consolidates the semantic labels, RGB frames, and event streams, leveraging pre-trained large-scale vision-language models. Specifically, given the input RGB frames, event streams, and all the predefined semantic labels, we employ a pre-trained large-scale vision model (CLIP vision encoder) to extract the RGB and event features. To handle the semantic labels, we initially convert them into language descriptions through prompt engineering, and then obtain the semantic features using the pre-trained large-scale language model (CLIP text encoder). Subsequently, we integrate the RGB/Event features and semantic features using multimodal Transformer networks. The resulting frame and event tokens are further amplified using self-attention layers. Concurrently, we propose to enhance the interactions between text tokens and RGB/Event tokens via cross-attention. Finally, we consolidate all three modalities using self-attention and feed-forward layers for recognition. Comprehensive experiments on the HARDVS and PokerEvent datasets fully substantiate the efficacy of our proposed SAFE model. The source code will be made available at https://github.com/Event-AHU/SAFE_LargeVLM.

摘要
Pattern recognition through the fusion of RGB frames and Event streams has emerged as a novel research area in recent years. Current methods typically employ backbone networks to individually extract the features of RGB frames and event streams, and subsequently fuse these features for pattern recognition. However, we posit that these methods may suffer from key issues like semantic gaps and small-scale backbone networks. In this study, we introduce a novel pattern recognition framework that consolidates the semantic labels, RGB frames, and event streams, leveraging pre-trained large-scale vision-language models. Specifically, given the input RGB frames, event streams, and all the predefined semantic labels, we employ a pre-trained large-scale vision model (CLIP vision encoder) to extract the RGB and event features. To handle the semantic labels, we initially convert them into language descriptions through prompt engineering, and then obtain the semantic features using the pre-trained large-scale language model (CLIP text encoder). Subsequently, we integrate the RGB/Event features and semantic features using multimodal Transformer networks. The resulting frame and event tokens are further amplified using self-attention layers. Concurrently, we propose to enhance the interactions between text tokens and RGB/Event tokens via cross-attention. Finally, we consolidate all three modalities using self-attention and feed-forward layers for recognition. Comprehensive experiments on the HARDVS and PokerEvent datasets fully substantiate the efficacy of our proposed SAFE model. The source code will be made available at https://github.com/Event-AHU/SAFE_LargeVLM.Here's the word-for-word translation of the text into Simplified Chinese: Pattern recognition through the fusion of RGB frames and Event streams has emerged as a novel research area in recent years. Current methods typically employ backbone networks to individually extract the features of RGB frames and event streams, and subsequently fuse these features for pattern recognition. However, we posit that these methods may suffer from key issues like semantic gaps and small-scale backbone networks. In this study, we introduce a novel pattern recognition framework that consolidates the semantic labels, RGB frames, and event streams, leveraging pre-trained large-scale vision-language models. Specifically, given the input RGB frames, event streams, and all the predefined semantic labels, we employ a pre-trained large-scale vision model (CLIP vision encoder) to extract the RGB and event features. To handle the semantic labels, we initially convert them into language descriptions through prompt engineering, and then obtain the semantic features using the pre-trained large-scale language model (CLIP text encoder). Subsequently, we integrate the RGB/Event features and semantic features using multimodal Transformer networks. The resulting frame and event tokens are further amplified using self-attention layers. Concurrently, we propose to enhance the interactions between text tokens and RGB/Event tokens via cross-attention. Finally, we consolidate all three modalities using self-attention and feed-forward layers for recognition. Comprehensive experiments on the HARDVS and PokerEvent datasets fully substantiate the efficacy of our proposed SAFE model. The source code will be made available at https://github.com/Event-AHU/SAFE_LargeVLM.

Continuous 16-bit Training: Accelerating 32-bit Pre-Trained Neural Networks

paper_url: http://arxiv.org/abs/2311.18587
repo_url: None
paper_authors: Juyoung Yun
for: 这项研究的目的是提高深度学习模型的训练效率和可持续性，而不 sacrifice 精度。
methods: 该研究提出了一种Continuation Training方法，使用16位数字precision继续训练先前使用32位数字precision训练的模型。
results: 实验结果表明，该方法可以保持模型的准确性，同时快速加速训练过程，减少计算资源和内存占用。这种方法在需要持续更新和改进的模型 scenarios 是非常有用。

Abstract
In the field of deep learning, the prevalence of models initially trained with 32-bit precision is a testament to its robustness and accuracy. However, the continuous evolution of these models often demands further training, which can be resource-intensive. This study introduces a novel approach where we continue the training of these pre-existing 32-bit models using 16-bit precision. This technique not only caters to the need for efficiency in computational resources but also significantly improves the speed of additional training phases. By adopting 16-bit precision for ongoing training, we are able to substantially decrease memory requirements and computational burden, thereby accelerating the training process in a resource-limited setting. Our experiments show that this method maintains the high standards of accuracy set by the original 32-bit training while providing a much-needed boost in training speed. This approach is especially pertinent in today's context, where most models are initially trained in 32-bit and require periodic updates and refinements. The findings from our research suggest that this strategy of 16-bit continuation training can be a key solution for sustainable and efficient deep learning, offering a practical way to enhance pre-trained models rapidly and in a resource-conscious manner.

摘要
Translated into Simplified Chinese:在深度学习领域，由于模型初始训练使用32位精度的习惯，这也证明了它们的稳定性和准确性。然而，这些模型的不断演化通常需要进一步训练，这可以是资源占用的挑战。这项研究提出了一种新的方法，我们继续使用32位精度训练这些已经存在的模型，并采用16位精度进行继续训练。这种技术不仅能满足计算资源的效率需求，还能够显著提高训练阶段的速度。通过采用16位精度进行继续训练，我们能够减少内存需求和计算负担，因此加速训练过程。我们的实验结果表明，这种方法可以保持原始32位训练的高标准准确性，同时提供一个可观的加速训练速度。这种方法在今天的背景下非常重要，因为大多数模型都是通过32位训练来初始化的，需要定期更新和细化。我们的研究发现，这种16位继续训练策略可以是深度学习中的可持续和高效的解决方案，提供一种实用的方法来快速地增强预训练模型。

Communication-Efficient Heterogeneous Federated Learning with Generalized Heavy-Ball Momentum

paper_url: http://arxiv.org/abs/2311.18578
repo_url: None
paper_authors: Riccardo Zaccone, Carlo Masone, Marco Ciccone
for: 本研究旨在解决 Federated Learning (FL) 中的系统和统计挑战，包括降低通信带宽和频率，以及处理非标一性。
methods: 本文提出了一种新的普通均衡批处理算法（FedHBM），可以有效地Addressing 统计不一致性问题而无需增加交换量。
results: 实验表明，FedHBM 算法在常见的 FL 视觉和自然语言处理 datasets 上可以提供更好的模型质量和更快的收敛速度，特别是在非常不一致的情况下。此外，该算法可以在跨设备场景下应用，并且可以利用好的模型初始化（例如预训练）来减少初始化时间。

Abstract
Federated Learning (FL) is the state-of-the-art approach for learning from decentralized data in privacy-constrained scenarios. As the current literature reports, the main problems associated with FL refer to system and statistical challenges: the former ones demand for efficient learning from edge devices, including lowering communication bandwidth and frequency, while the latter require algorithms robust to non-iidness. State-of-art approaches either guarantee convergence at increased communication cost or are not sufficiently robust to handle extreme heterogeneous local distributions. In this work we propose a novel generalization of the heavy-ball momentum, and present FedHBM to effectively address statistical heterogeneity in FL without introducing any communication overhead. We conduct extensive experimentation on common FL vision and NLP datasets, showing that our FedHBM algorithm empirically yields better model quality and higher convergence speed w.r.t. the state-of-art, especially in pathological non-iid scenarios. While being designed for cross-silo settings, we show how FedHBM is applicable in moderate-to-high cross-device scenarios, and how good model initializations (e.g. pre-training) can be exploited for prompt acceleration. Extended experimentation on large-scale real-world federated datasets further corroborates the effectiveness of our approach for real-world FL applications.

摘要
Federated Learning (FL) 是当前领先的方法，用于从分布式数据中学习，在PRIVACY CONSTRAINED scenarios 中。据文献报告，FL 中的主要问题包括系统和统计挑战：前者需要高效地学习从Edge设备上，包括降低通信带宽和频率，而后者需要对非常不均匀的本地分布进行鲁棒化。当前的方法可以保证增加通信成本以实现增加，或者不具备鲁棒性，对EXTREME HETEROGENEOUS LOCAL DISTRIBUTIONS 进行处理。在这种工作中，我们提出了一种新的普通质量权重，并提出了FedHBM算法，用于有效地处理FL中的统计不均匀性，不增加任何通信成本。我们对常见的FL视觉和NLP数据集进行了广泛的实验，并证明了我们的FedHBM算法在对比当前状态态的情况下，能够更高质量和更快的CONVERGENCE SPEED，特别是在非常不均匀的路径ological scenarios 中。虽然我们的FedHBM算法是为cross-silo settings 设计的，但我们还证明了其可以在moderate-to-high cross-device scenarios 中应用，并且可以利用良好的模型初始化（例如预训练）来加速进程。我们进一步对大规模的实际世界 federated 数据进行了进一步的实验，并证明了我们的方法在实际应用中的有效性。

Fingerprint Matching with Localized Deep Representation

paper_url: http://arxiv.org/abs/2311.18576
repo_url: None
paper_authors: Yongjie Duan, Zhiyu Pan, Jianjiang Feng, Jie Zhou
for: 提高Fixed-length fingerprint表示的准确性和可靠性，以便在大规模人体指纹库中实现高精度的指纹识别。
methods: 提出了一种基于Localized deep representation of fingerprint(LDRF)的方法，通过注重本地区域特征来提供更加稳定和准确的Fixed-length表示。LDRF可以适应任何有效的区域，这使得它具有高灵活性。
results: 对21个 dataset中的140000多个指纹进行了实验，结果表明LDRF比其他Fixed-length表示更高的准确性和可靠性，并且对不同的手势和印记类型具有良好的抗耗性。此外，提出的匹配分数 норailization技术有效地降低了大规模身份识别中的false match率，从而提高了指纹识别的精度和可靠性。

Abstract
Compared to minutia-based fingerprint representations, fixed-length representations are attractive due to simple and efficient matching. However, fixed-length fingerprint representations are limited in accuracy when matching fingerprints with different visible areas, which can occur due to different finger poses or acquisition methods. To address this issue, we propose a localized deep representation of fingerprint, named LDRF. By focusing on the discriminative characteristics within local regions, LDRF provides a more robust and accurate fixed-length representation for fingerprints with variable visible areas. LDRF can be adapted to retain information within any valid area, making it highly flexible. The matching scores produced by LDRF also exhibit intuitive statistical characteristics, which led us to propose a matching score normalization technique to mitigate the uncertainty in the cases of very small overlapping area. With this new technique, we can maintain a high level of accuracy and reliability in our fingerprint matching, even as the size of the database grows rapidly. Our experimental results on 21 datasets containing over 140K fingerprints of various finger poses and impression types show that LDRF outperforms other fixed-length representations and is robust to sensing technologies and impression types. Besides, the proposed matching score normalization effectively reduces the false match rate (FMR) in large-scale identification experiments comprising over 5.11 million fingerprints. Specifically, this technique results in a reduction of two orders of magnitude compared to matching without matching score normalization and five orders of magnitude compared to prior works.

摘要

Search Still Matters: Information Retrieval in the Era of Generative AI

paper_url: http://arxiv.org/abs/2311.18550
repo_url: None
paper_authors: William R. Hersh
for: 本研究探讨了基于大语言模型的生成人工智能如何与搜索系统结合使用，尤其是在学术用途中。
methods: 本研究使用了各种搜索系统和大语言模型，以探讨用户在搜索过程中的需求和满意度。
results: 研究发现，用户的搜索需求可以从简单到复杂，而且有许多因素会影响用户对搜索结果的满意度，如推荐结果的排序、搜索结果的权重、以及搜索结果的可访问性。

Abstract
Objective: Information retrieval (IR, also known as search) systems are ubiquitous in modern times. How does the emergence of generative artificial intelligence (AI), based on large language models (LLMs), fit into the IR process? Process: This perspective explores the use of generative AI in the context of the motivations, considerations, and outcomes of the IR process with a focus on the academic use of such systems. Conclusions: There are many information needs, from simple to complex, that motivate use of IR. Users of such systems, particularly academics, have concerns for authoritativeness, timeliness, and contextualization of search. While LLMs may provide functionality that aids the IR process, the continued need for search systems, and research into their improvement, remains essential.

摘要
English 简化中文 Objective: Information retrieval (IR, also known as search) systems are ubiquitous in modern times. How does the emergence of generative artificial intelligence (AI), based on large language models (LLMs), fit into the IR process?Process: This perspective explores the use of generative AI in the context of the motivations, considerations, and outcomes of the IR process with a focus on the academic use of such systems.Conclusions: There are many information needs, from simple to complex, that motivate use of IR. Users of such systems, particularly academics, have concerns for authoritativeness, timeliness, and contextualization of search. While LLMs may provide functionality that aids the IR process, the continued need for search systems, and research into their improvement, remains essential. Simplified Chinese translation:目标：现代时期内的信息检索（IR，也称为搜索）系统 ubique。如何将生成人工智能（AI），基于大语言模型（LLM），与IR过程相匹配？过程：这个视角探讨在IR过程中的动机、考虑因素以及结果，特别是在学术用途中使用这些系统。结论：有很多信息需求，从简单到复杂，导致IR的使用。用户，特别是学者，对搜索结果的权威性、时效性以及 contextualization 有很多关注。虽然LLMs可能提供助于IR过程的功能，但是继续需要搜索系统的发展和改进的研究。

Real-Time Vibration-Based Bearing Fault Diagnosis Under Time-Varying Speed Conditions

paper_url: http://arxiv.org/abs/2311.18547
repo_url: None
paper_authors: Tuomas Jalonen, Mohammad Al-Sa’d, Serkan Kiranyaz, Moncef Gabbouj
for: 本研究旨在提出一种高效的实时卷积神经网络方法，用于检测滚珠磨具 faults 在不同的噪声水平和时间变化的扭积速率下。
methods: 本研究使用了一种新的 Fisher-based spectral separability analysis (SSA) 方法，以评估提出的卷积神经网络模型的效果。
results: 实验结果表明，提出的模型在健康的滚珠磨具和受损的滚珠磨具、内环、外环和滚球磨具 faults 下都表现出优异的准确率，并且具有高度的鲁棒性和实时性。

Abstract
Detection of rolling-element bearing faults is crucial for implementing proactive maintenance strategies and for minimizing the economic and operational consequences of unexpected failures. However, many existing techniques are developed and tested under strictly controlled conditions, limiting their adaptability to the diverse and dynamic settings encountered in practical applications. This paper presents an efficient real-time convolutional neural network (CNN) for diagnosing multiple bearing faults under various noise levels and time-varying rotational speeds. Additionally, we propose a novel Fisher-based spectral separability analysis (SSA) method to elucidate the effectiveness of the designed CNN model. We conducted experiments on both healthy bearings and bearings afflicted with inner race, outer race, and roller ball faults. The experimental results show the superiority of our model over the current state-of-the-art approach in three folds: it achieves substantial accuracy gains of up to 15.8%, it is robust to noise with high performance across various signal-to-noise ratios, and it runs in real-time with processing durations five times less than acquisition. Additionally, by using the proposed SSA technique, we offer insights into the model's performance and underscore its effectiveness in tackling real-world challenges.

摘要
检测滚珠式磨料 faults 是预测维护策略的关键和避免意外故障的经济和运行成本的减少。然而，许多现有的技术是在严格控制的环境下开发和测试的，这限制了它们在实际应用中的适应性。本文提出了一种高效的实时卷积神经网络（CNN）用于识别多种磨料 faults 下多种噪声水平和时变旋转速度。此外，我们提出了一种基于 Fisher 的 spectral separability analysis（SSA）方法，以解释我们设计的 CNN 模型的效果。我们在健康的磨料和受到内环、外环和滚球磨料 faults 的磨料上进行了实验。实验结果表明，我们的模型在三个方面超过当前状态艺术方法：它具有大约 15.8% 的准确率提升，对噪声具有高效性，并且在实时进行处理，处理时间比获取 five times less。此外，通过使用我们的 SSA 技术，我们提供了模型性能的反馈，并证明它在实际挑战中具有强大的实用性。

Dataset Distillation via the Wasserstein Metric

paper_url: http://arxiv.org/abs/2311.18531
repo_url: None
paper_authors: Haoyang Liu, Tiancheng Xing, Luwei Li, Vibhu Dalal, Jingrui He, Haohan Wang
for: 本研究旨在提高 dataset distillation (DD) 中的数据减少策略，以减少模型性能的损失。
methods: 我们提出了一种基于 Wasserstein 距离的新方法，用于捕捉extensive datasets的主要表示。我们的方法利用 Wasserstein 距离来衡量分布差异，并在 feature space 中嵌入 synthetic data，以便进行分布匹配学习。
results: 我们的方法在多个高分辨率数据集上进行了广泛的测试，并证明了其效果和适应性。结果表明，Wasserstein 距离在 DD 中具有潜在的应用前景。

Abstract
Dataset distillation (DD) offers a compelling approach in computer vision, with the goal of condensing extensive datasets into smaller synthetic versions without sacrificing much of the model performance. In this paper, we continue to study the methods for DD, by addressing its conceptually core objective: how to capture the essential representation of extensive datasets in smaller, synthetic forms. We propose a novel approach utilizing the Wasserstein distance, a metric rooted in optimal transport theory, to enhance distribution matching in DD. Our method leverages the Wasserstein barycenter, offering a geometrically meaningful way to quantify distribution differences and effectively capture the centroid of a set of distributions. Our approach retains the computational benefits of distribution matching-based methods while achieving new state-of-the-art performance on several benchmarks. To provide useful prior for learning the images, we embed the synthetic data into the feature space of pretrained classification models to conduct distribution matching. Extensive testing on various high-resolution datasets confirms the effectiveness and adaptability of our method, indicating the promising yet unexplored capabilities of Wasserstein metrics in dataset distillation.

摘要
为了为学习图像提供有用的先验知识，我们将合成数据 embedding 到预训练的分类模型的特征空间中进行分布匹配。对多个高分辨率数据集进行广泛的测试确认了我们的方法的有效性和适应性，这表明了 Wasserstein 度量在 dataset distillation 中的前前无穷可能性。

Fast ODE-based Sampling for Diffusion Models in Around 5 Steps

paper_url: http://arxiv.org/abs/2312.00094
repo_url: None
paper_authors: Zhenyu Zhou, Defang Chen, Can Wang, Chun Chen
for: 这种论文主要目标是提出一种快速的扩散模型采样算法，以实现高精度的图像生成。
methods: 该算法使用了高阶差分方程解算法，并通过直接学习扩散的平均方向来消除 truncation 误差。
results: 实验表明，使用该算法可以在5个函数评估（NFE）内达到高精度的图像生成，并且与现有的ODE-based扩散模型采样器相比，具有更高的效果和灵活性。

Abstract
Sampling from diffusion models can be treated as solving the corresponding ordinary differential equations (ODEs), with the aim of obtaining an accurate solution with as few number of function evaluations (NFE) as possible. Recently, various fast samplers utilizing higher-order ODE solvers have emerged and achieved better performance than the initial first-order one. However, these numerical methods inherently result in certain approximation errors, which significantly degrades sample quality with extremely small NFE (e.g., around 5). In contrast, based on the geometric observation that each sampling trajectory almost lies in a two-dimensional subspace embedded in the ambient space, we propose Approximate MEan-Direction Solver (AMED-Solver) that eliminates truncation errors by directly learning the mean direction for fast diffusion sampling. Besides, our method can be easily used as a plugin to further improve existing ODE-based samplers. Extensive experiments on image synthesis with the resolution ranging from 32 to 256 demonstrate the effectiveness of our method. With only 5 NFE, we achieve 7.14 FID on CIFAR-10, 13.75 FID on ImageNet 64$\times$64, and 12.79 FID on LSUN Bedroom. Our code is available at https://github.com/zhyzhouu/amed-solver.

摘要
可以将扩散模型的采样视为解决相应的常微分方程（ODE），目的是在最小化函数评估（NFE）数量的情况下获得高精度的解。最近，许多快速采样器使用高阶ODE解处理方法出现，并且在NFE非常小（例如5）的情况下表现更好。然而，这些数字方法会产生一定的 aproximation 误差，这会很大地下降采样质量。相比之下，我们基于几何观察结果，每个采样轨迹大多数在嵌入在环境空间中的二维子空间中，我们提出了准确的主方向扩散采样器（AMED-Solver），可以直接学习扩散采样的主方向，从而消除截断误差。此外，我们的方法可以轻松地作为现有ODE-基于采样器的插件使用。我们在32到256的图像生成中进行了广泛的实验，并达到了5 NFE 的情况下，CIFAR-10 的7.14 FID、ImageNet 64x64 的13.75 FID和LSUN Bedroom 的12.79 FID。我们的代码可以在 GitHub 上找到。

Calibration-free online test-time adaptation for electroencephalography motor imagery decoding

paper_url: http://arxiv.org/abs/2311.18520
repo_url: None
paper_authors: Martin Wimpff, Mario Döbler, Bin Yang
for: 本研究旨在探讨Brain-Computer Interfaces（BCI）的实时适应技术，以提高BCI的解oding能力。
methods: 本研究使用了深度学习技术，并对BCI的解oding模型进行了在推理时的不监督式适应。
results: 研究结果显示，采用了不同的适应技术后，BCI的解oding性能得到了提高，并且不需要访问源数据，保持了隐私性。

Abstract
Providing a promising pathway to link the human brain with external devices, Brain-Computer Interfaces (BCIs) have seen notable advancements in decoding capabilities, primarily driven by increasingly sophisticated techniques, especially deep learning. However, achieving high accuracy in real-world scenarios remains a challenge due to the distribution shift between sessions and subjects. In this paper we will explore the concept of online test-time adaptation (OTTA) to continuously adapt the model in an unsupervised fashion during inference time. Our approach guarantees the preservation of privacy by eliminating the requirement to access the source data during the adaptation process. Additionally, OTTA achieves calibration-free operation by not requiring any session- or subject-specific data. We will investigate the task of electroencephalography (EEG) motor imagery decoding using a lightweight architecture together with different OTTA techniques like alignment, adaptive batch normalization, and entropy minimization. We examine two datasets and three distinct data settings for a comprehensive analysis. Our adaptation methods produce state-of-the-art results, potentially instigating a shift in transfer learning for BCI decoding towards online adaptation.

摘要
BCIs 的开发已经提供了一个可靠的通路，将人脑与外部设备连接起来。然而，在真实世界中达到高精度仍然是一个挑战，主要因为会话和主题之间的分布shift。在这篇论文中，我们将探讨在推理时进行在线测试适应（OTTA）的概念，以实现在推理过程中不需要访问源数据的隐私保护。此外，OTTA还可以实现不需要任务或主题特定的数据的准备。我们将使用一种轻量级的架构和不同的OTTA技术，如对齐、自适应批处理和Entropy最小化，对EEG电enzephalography动作幻像解oding进行研究。我们将对两个数据集和三种不同的数据设置进行全面分析。我们的适应方法可以实现状态革命的结果，可能会导致BCI解oding中的转移学习向在线适应。

Color-Emotion Associations in Art: Fuzzy Approach

paper_url: http://arxiv.org/abs/2311.18518
repo_url: None
paper_authors: Muragul Muratbekova, Pakizar Shamoi
for: 这篇论文的目的是研究艺术作品所诱发的情感，以及色彩在艺术作品中的作用。
methods: 这篇论文使用了杂化集的方法来分类情感，并使用了WIkiArt dataset中标注有情感的画作进行评估。
results: 研究发现了各种情感与色彩之间的强相关性，如感激强烈相关于绿色、褐色和橙色等色彩，愤怒强烈相关于褐色等等。这些发现可以用于艺术作品检索系统、市场营销、设计等实际应用。

Abstract
Art objects can evoke certain emotions. Color is a fundamental element of visual art and plays a significant role in how art is perceived. This paper introduces a novel approach to classifying emotions in art using Fuzzy Sets. We employ a fuzzy approach because it aligns well with human judgments' imprecise and subjective nature. Extensive fuzzy colors (n=120) and a broad emotional spectrum (n=10) allow for a more human-consistent and context-aware exploration of emotions inherent in paintings. First, we introduce the fuzzy color representation model. Then, at the fuzzification stage, we process the Wiki Art Dataset of paintings tagged with emotions, extracting fuzzy dominant colors linked to specific emotions. This results in fuzzy color distributions for ten emotions. Finally, we convert them back to a crisp domain, obtaining a knowledge base of color-emotion associations in primary colors. Our findings reveal strong associations between specific emotions and colors; for instance, gratitude strongly correlates with green, brown, and orange. Other noteworthy associations include brown and anger, orange with shame, yellow with happiness, and gray with fear. Using these associations and Jaccard similarity, we can find the emotions in the arbitrary untagged image. We conducted a 2AFC experiment involving human subjects to evaluate the proposed method. The average hit rate of 0.77 indicates a significant correlation between the method's predictions and human perception. The proposed method is simple to adapt to art painting retrieval systems. The study contributes to the theoretical understanding of color-emotion associations in art, offering valuable insights for various practical applications besides art, like marketing, design, and psychology.

摘要
美术作品可以触动certain情感。色彩是视觉艺术中的基本元素，它在艺术的感受方面扮演着重要的角色。本文提出了一种基于复杂集的情感分类方法，使用了复杂集（Fuzzy Sets），因为这种方法与人类判断的不确定和主观性相匹配。使用了120种颜色和10种情感，我们可以实现更人类化和上下文感应的情感探索。首先，我们介绍了复杂色彩表示模型。然后，在纠ifica阶段，我们使用Wiki Art Dataset中标注情感的画作，提取出复杂的主要颜色与特定情感之间的连接。这结果了复杂颜色分布，其中每个情感都有对应的颜色分布。最后，我们将其转换回精确领域，得到了颜色-情感协助的知识库。我们的研究发现，具体的情感与颜色之间存在强相关，例如感激强相关到绿色、褐色和橙色。其他值得注意的相关包括褐色与愤怒、橙色与尴尬、黄色与开心、灰色与恐惧。使用这些相关和Jaccard相似度，我们可以寻找没有标注的画作中的情感。我们进行了一个2AFC实验，让人类观察者评估我们的方法。平均命中率为0.77，这表明我们的方法和人类感知之间存在强相关。我们的方法可以轻松地应用于艺术画作搜寻系统。本研究对于艺术中颜色-情感协助的理论理解做出了重要贡献，并且对于广泛的实际应用，如市场、设计和心理学等领域都具有价值。

Adaptive Multi-Modality Prompt Learning

paper_url: http://arxiv.org/abs/2312.00823
repo_url: None
paper_authors: Zongqian Wu, Yujing Liu, Mengmeng Zhan, Jialie Shen, Ping Hu, Xiaofeng Zhu
for: 这个论文目的是提出一种适应多模态示例学习方法，以解决现有的示例学习方法在处理图像中的缺陷，包括忽略无用的区域和同时考虑在样本内和样本外的泛化。
methods: 该论文使用了前Text prompt学习和新的图像提示学习方法。图像提示学习方法首先将无用的区域蒙版，然后将这些区域用learnable参数和文本信息补充。此外，每个提示都提供了对另一个提示的辅助信息，进一步强化这两种泛化。
results: 实验结果表明，该方法在真实的数据集上比SOTA方法高效，在不同的下游任务中表现优异。

Abstract
Although current prompt learning methods have successfully been designed to effectively reuse the large pre-trained models without fine-tuning their large number of parameters, they still have limitations to be addressed, i.e., without considering the adverse impact of meaningless patches in every image and without simultaneously considering in-sample generalization and out-of-sample generalization. In this paper, we propose an adaptive multi-modality prompt learning to address the above issues. To do this, we employ previous text prompt learning and propose a new image prompt learning. The image prompt learning achieves in-sample and out-of-sample generalization, by first masking meaningless patches and then padding them with the learnable parameters and the information from texts. Moreover, each of the prompts provides auxiliary information to each other, further strengthening these two kinds of generalization. Experimental results on real datasets demonstrate that our method outperforms SOTA methods, in terms of different downstream tasks.

摘要
现有的提示学习方法已经成功地将大型预训练模型重用而不需要细化其大量参数，但还有一些局限性需要解决，即不考虑每个图像中无意义的小块的不良影响和同时不考虑样本内泛化和样本外泛化。在这篇论文中，我们提出了适应多模态提示学习方法来解决以上问题。我们利用了先前的文本提示学习，并提出了一种新的图像提示学习方法。图像提示学习可以在样本内和样本外实现泛化，首先是将无意义的小块蒙版，然后将其填充learnable参数和文本信息。此外，每个提示都会为另一个提示提供辅助信息，进一步强化这两种泛化。实验结果表明，我们的方法在真实数据上比SOTA方法高效，在不同的下游任务上表现出优秀的性能。

ZeST-NeRF: Using temporal aggregation for Zero-Shot Temporal NeRFs

paper_url: http://arxiv.org/abs/2311.18491
repo_url: None
paper_authors: Violeta Menéndez González, Andrew Gilbert, Graeme Phillipson, Stephen Jolly, Simon Hadfield
for: 本研究旨在提出一种新的视频编辑技术，用于生成新的场景中的 temporal NeRF。
methods: 本研究使用 ZeST-NeRF 方法，该方法可以在不重新训练的情况下，生成新的场景中的 temporal NeRF。
results: 研究表明，ZeST-NeRF 方法可以高效地重建新的视频场景，并且可以提高量化和视觉效果。相比之前的方法，ZeST-NeRF 方法提高了15%的量化效果和显著提高了视觉效果。

Abstract
In the field of media production, video editing techniques play a pivotal role. Recent approaches have had great success at performing novel view image synthesis of static scenes. But adding temporal information adds an extra layer of complexity. Previous models have focused on implicitly representing static and dynamic scenes using NeRF. These models achieve impressive results but are costly at training and inference time. They overfit an MLP to describe the scene implicitly as a function of position. This paper proposes ZeST-NeRF, a new approach that can produce temporal NeRFs for new scenes without retraining. We can accurately reconstruct novel views using multi-view synthesis techniques and scene flow-field estimation, trained only with unrelated scenes. We demonstrate how existing state-of-the-art approaches from a range of fields cannot adequately solve this new task and demonstrate the efficacy of our solution. The resulting network improves quantitatively by 15% and produces significantly better visual results.

摘要
在媒体生产领域中，视频编辑技术扮演着关键性的角色。现有的方法在描述静止场景的新视图图像合成方面取得了非常成功。但是添加时间信息会增加额外复杂性。先前的模型将静止和动态场景用NeRF进行偏函数表示，这些模型取得了很好的结果，但是训练和推理时间成本较高。这篇论文提出了ZeST-NeRF，一种新的方法，可以在新的场景中生成时间NeRF，无需重新训练。我们可以使用多视图合成技术和场景流场 estimation来准确地重建新的视图，并且只需要使用不相关的场景进行训练。我们展示了现有的状态态先进方法无法解决这个新任务，并证明了我们的解决方案的效果。结果显示，我们的网络在量化上提高了15%，并且生成的视觉效果更好。

New Perspectives on the Evaluation of Link Prediction Algorithms for Dynamic Graphs

paper_url: http://arxiv.org/abs/2311.18486
repo_url: https://github.com/aida-ugent/dlp_viz
paper_authors: Raphaël Romero, Tijl De Bie, Jefrey Lijffijt
for: 这项研究旨在catalouging the possibilities for negative sampling in dynamic network prediction, and introducing novel visualization methods to evaluate the effect of negative sampling on predictive performance.
methods: 该研究使用了多种采样方法，包括随机生成的负样本和真实的负样本，以及新的视觉化工具来评估预测性能和时间网络的动态性。
results: 研究发现，采用不同采样方法可能会导致预测性能的不均匀分布，并且可以通过视觉化工具来了解预测性能的时间变化特征。

Abstract
There is a fast-growing body of research on predicting future links in dynamic networks, with many new algorithms. Some benchmark data exists, and performance evaluations commonly rely on comparing the scores of observed network events (positives) with those of randomly generated ones (negatives). These evaluation measures depend on both the predictive ability of the model and, crucially, the type of negative samples used. Besides, as generally the case with temporal data, prediction quality may vary over time. This creates a complex evaluation space. In this work, we catalog the possibilities for negative sampling and introduce novel visualization methods that can yield insight into prediction performance and the dynamics of temporal networks. We leverage these visualization tools to investigate the effect of negative sampling on the predictive performance, at the node and edge level. We validate empirically, on datasets extracted from recent benchmarks that the error is typically not evenly distributed across different data segments. Finally, we argue that such visualization tools can serve as powerful guides to evaluate dynamic link prediction methods at different levels.

摘要
There is a rapidly growing body of research on predicting future links in dynamic networks, with many new algorithms. Some benchmark data exists, and performance evaluations commonly rely on comparing the scores of observed network events (positives) with those of randomly generated ones (negatives). These evaluation measures depend on both the predictive ability of the model and, crucially, the type of negative samples used. Besides, as generally the case with temporal data, prediction quality may vary over time. This creates a complex evaluation space. In this work, we catalog the possibilities for negative sampling and introduce novel visualization methods that can yield insight into prediction performance and the dynamics of temporal networks. We leverage these visualization tools to investigate the effect of negative sampling on the predictive performance, at the node and edge level. We validate empirically, on datasets extracted from recent benchmarks that the error is typically not evenly distributed across different data segments. Finally, we argue that such visualization tools can serve as powerful guides to evaluate dynamic link prediction methods at different levels.Here's the translation in Simplified Chinese characters:有一个快速增长的研究体系，探讨未来网络中的链接预测，新出了许多算法。一些参考数据存在，性能评估通常是比较观察到的网络事件得分与随机生成的一些事件得分之间的比较。这些评估方法取决于模型预测能力以及采用的负样本类型。此外，与时间数据一样，预测质量可能会随时间变化。这创造了一个复杂的评估空间。在这项工作中，我们catalog了负样本的可能性，并引入了新的视觉化工具，可以帮助我们更好地理解预测性能和时间网络的动态。我们利用这些视觉化工具来调查负样本对预测性能的影响，节点和边级别。我们验证了实验数据，从最新的benchmark中提取的数据集上，错误通常不均匀分布于不同的数据段上。最后，我们 argue that这些视觉化工具可以作为评估动态链接预测方法的强大指南。

ESG Accountability Made Easy: DocQA at Your Service

paper_url: http://arxiv.org/abs/2311.18481
repo_url: None
paper_authors: Lokesh Mishra, Cesar Berrospi, Kasper Dinkla, Diego Antognini, Francesco Fusco, Benedikt Bothur, Maksym Lysak, Nikolaos Livathinos, Ahmed Nassar, Panagiotis Vagenas, Lucas Morin, Christoph Auer, Michele Dolfi, Peter Staar
for: 本研究旨在开发一种基于多种人工智能技术的问答对话助手，以提高文档信息检索效率。
methods: 本研究使用了计算机视觉将文档转换成机器可读格式，自然语言处理技术找到相关数据，以及大语言模型构建出启明的回答。
results: 该系统可以帮助用户从大量的环境、社会和治理（ESG）披露报告中提取信息，并可以在线访问超过2000家公司的披露报告。系统可以在：https://ds4sd.github.io 上访问。

Abstract
We present Deep Search DocQA. This application enables information extraction from documents via a question-answering conversational assistant. The system integrates several technologies from different AI disciplines consisting of document conversion to machine-readable format (via computer vision), finding relevant data (via natural language processing), and formulating an eloquent response (via large language models). Users can explore over 10,000 Environmental, Social, and Governance (ESG) disclosure reports from over 2000 corporations. The Deep Search platform can be accessed at: https://ds4sd.github.io.

摘要
我们介绍Deep Search DocQA，这个应用程序可以从文书中提取信息，透过问答对话助手。这个系统结合了不同的人工智能领域技术，包括文档转换为机器可读格式（via computer vision）、发现相关数据（via自然语言处理），并使用大型语言模型写出漂亮回答。用户可以探索超过10,000份环境、社会和管理（ESG）发布报告，来自逾2000家公司。Deep Search平台可以在：https://ds4sd.github.io 上 accessed。

Causal Fairness under Unobserved Confounding: A Neural Sensitivity Framework

paper_url: http://arxiv.org/abs/2311.18460
repo_url: None
paper_authors: Maresa Schröder, Dennis Frauen, Stefan Feuerriegel
for: This paper focuses on ensuring fairness in machine learning predictions, specifically in situations where there is unobserved confounding.
methods: The paper proposes a novel neural framework for learning fair predictions, which includes deriving bounds for causal fairness metrics under different sources of unobserved confounding.
results: The paper demonstrates the effectiveness of its framework in a series of experiments, including a real-world case study about predicting prison sentences. The paper also offers worst-case guarantees of the extent to which causal fairness can be violated due to unobserved confounding.

Abstract
Fairness for machine learning predictions is widely required in practice for legal, ethical, and societal reasons. Existing work typically focuses on settings without unobserved confounding, even though unobserved confounding can lead to severe violations of causal fairness and, thus, unfair predictions. In this work, we analyze the sensitivity of causal fairness to unobserved confounding. Our contributions are three-fold. First, we derive bounds for causal fairness metrics under different sources of unobserved confounding. This enables practitioners to examine the sensitivity of their machine learning models to unobserved confounding in fairness-critical applications. Second, we propose a novel neural framework for learning fair predictions, which allows us to offer worst-case guarantees of the extent to which causal fairness can be violated due to unobserved confounding. Third, we demonstrate the effectiveness of our framework in a series of experiments, including a real-world case study about predicting prison sentences. To the best of our knowledge, ours is the first work to study causal fairness under unobserved confounding. To this end, our work is of direct practical value as a refutation strategy to ensure the fairness of predictions in high-stakes applications.

摘要
广泛需要在机器学习预测中保持公平的需求，包括法律、伦理和社会因素。现有的工作通常假设没有隐藏的偏见，尽管隐藏的偏见可能导致严重的 causal fairness 违背和不公预测。在这种情况下，我们分析隐藏的偏见对 causal fairness 的影响。我们的贡献有三个方面：首先，我们 derivates 隐藏偏见下 causal fairness 指标的下限。这使得实践者可以查看他们的机器学习模型对隐藏偏见的敏感性。第二，我们提出了一种新的神经网络框架，用于学习公平预测。我们可以通过这种框架提供最坏情况下的 garantuees，表明隐藏偏见可能导致的 causal fairness 违背的程度。第三，我们在一系列实验中证明了我们的框架的效果，包括一个实际的案例研究，探讨预测监狱刑期。到我们所知，我们的工作是第一个研究隐藏偏见下 causal fairness 的工作。因此，我们的工作对高度重要的应用中的预测公平性具有直接的实践价值，作为验证策略。

Multiple Disciplinary Data Work Practices in Artificial Intelligence Research: a Healthcare Case Study in the UK

paper_url: http://arxiv.org/abs/2311.18424
repo_url: None
paper_authors: Rafael Henkin, Elizabeth Remfry, Duncan J. Reynolds, Megan Clinch, Michael R. Barnes
for: 本研究旨在探讨健康领域人工智能工具的开发过程中，不同领域之间的知识共享和矛盾解决方式。
methods: 本研究采用了符号分析方法，通过13名参与者的semi-structured interview来探讨参与者在大型研究团队中的工作做法。
results: 研究发现，多学科合作对工作实践产生了深见影响，参与者需要学习其他领域的语言和工具，以便与不同背景的人进行交流和知识共享。大量医疗数据也限制了工作实践。研究发现，会议是共享知识的关键平台，并且建议了数据科学和协作工具的设计方法。

Abstract
Developing artificial intelligence (AI) tools for healthcare is a multiple disciplinary effort, bringing data scientists, clinicians, patients and other disciplines together. In this paper, we explore the AI development workflow and how participants navigate the challenges and tensions of sharing and generating knowledge across disciplines. Through an inductive thematic analysis of 13 semi-structured interviews with participants in a large research consortia, our findings suggest that multiple disciplinarity heavily impacts work practices. Participants faced challenges to learn the languages of other disciplines and needed to adapt the tools used for sharing and communicating with their audience, particularly those from a clinical or patient perspective. Large health datasets also posed certain restrictions on work practices. We identified meetings as a key platform for facilitating exchanges between disciplines and allowing for the blending and creation of knowledge. Finally, we discuss design implications for data science and collaborative tools, and recommendations for future research.

摘要
开发人工智能（AI）工具 для医疗是一个多学科努力，汇集数据科学家、医生、病人和其他领域的专家共同努力。本文 исследова了AI开发工作流程，参与者如何在不同领域之间分享和生成知识的挑战和紧张关系。通过对13名参与者进行induced thematic分析的 semi-structured采访，我们发现多学科影响了工作实践。参与者需要学习其他领域的语言，并适应与他们听众的沟通工具，特别是来自临床或患者角度的人。大规模医疗数据也限制了工作实践。我们发现会议是在不同领域之间交换知识的关键平台，并允许知识杂交和创造。最后，我们讨论了数据科学和合作工具的设计建议，以及未来研究的方向。

Corrupting Convolution-based Unlearnable Datasets with Pixel-based Image Transformations

paper_url: http://arxiv.org/abs/2311.18403
repo_url: None
paper_authors: Xianlong Wang, Shengshan Hu, Minghui Li, Zhifei Yu, Ziqi Zhou, Leo Yu Zhang, Hai Jin
for: 防止干扰学习（Unlearnable Dataset，UD）对模型的泛化性能的严重降低。
methods: 使用简单的多元式扩展来表示卷积型UD，并对其工作机制进行研究。采用随机矩阵的方式进行增强，以提高防御效果。
results: 通过验证实验，证明我们的方法可以成功防止卷积型UD的攻击，并且在新的UD攻击下表现出显著的防御效果。

Abstract
Unlearnable datasets lead to a drastic drop in the generalization performance of models trained on them by introducing elaborate and imperceptible perturbations into clean training sets. Many existing defenses, e.g., JPEG compression and adversarial training, effectively counter UDs based on norm-constrained additive noise. However, a fire-new type of convolution-based UDs have been proposed and render existing defenses all ineffective, presenting a greater challenge to defenders. To address this, we express the convolution-based unlearnable sample as the result of multiplying a matrix by a clean sample in a simplified scenario, and formalize the intra-class matrix inconsistency as $\Theta_{imi}$, inter-class matrix consistency as $\Theta_{imc}$ to investigate the working mechanism of the convolution-based UDs. We conjecture that increasing both of these metrics will mitigate the unlearnability effect. Through validation experiments that commendably support our hypothesis, we further design a random matrix to boost both $\Theta_{imi}$ and $\Theta_{imc}$, achieving a notable degree of defense effect. Hence, by building upon and extending these facts, we first propose a brand-new image COrruption that employs randomly multiplicative transformation via INterpolation operation to successfully defend against convolution-based UDs. Our approach leverages global pixel random interpolations, effectively suppressing the impact of multiplicative noise in convolution-based UDs. Additionally, we have also designed two new forms of convolution-based UDs, and find that our defense is the most effective against them.

摘要
“不可学习的数据集会导致模型在它们上 receives training 的性能下降剧烈，通过引入复杂且隐观的杂音 perturbations 到干净的训练集。许多现有的防御，如 JPEG 压缩和对抗学习，有效地对 UDs 进行防御，但是一种新的 convolution-based UDs 已经被提出，使得现有的防御无效。为了解决这个问题，我们表示 convolution-based unlearnable sample 为 matrix 对 clean sample 的乘法结果，并将 intra-class matrix inconsistency 表示为 $\Theta_{imi}$，inter-class matrix consistency 表示为 $\Theta_{imc}$ 来研究 convolution-based UDs 的工作机制。我们 conjecture 增加这两个指标会 Mitigate 不可学习的效果。经Validation experiments 支持我们的假设，我们进一步设计了一个随机矩阵，以提高 $\Theta_{imi}$ 和 $\Theta_{imc}$，实现了 notable degree of defense effect。因此，我们首次提出了一种 brand-new image Corruption 方法，利用 randomly multiplicative transformation via INterpolation operation 成功地防御对 convolution-based UDs。我们的方法利用全球像素随机 interpolations，有效地抑制了 convolution-based UDs 中的乘数噪音的影响。此外，我们还设计了两种新的 convolution-based UDs，并发现我们的防御是对它们最有效的。”

MRFP: Learning Generalizable Semantic Segmentation from Sim-2-Real with Multi-Resolution Feature Perturbation

paper_url: http://arxiv.org/abs/2311.18331
repo_url: None
paper_authors: Sumanth Udupa, Prajwal Gurunath, Aniruddh Sikdar, Suresh Sundaram
for: 实现源领域中的 semantic scene understanding 任务中的高性能表现，但是因为训练时无法获得多样化的风格，因此使用单一源领域数据来增强目标领域的表现仍然是一个挑战。
methods: 我们提出了一种名为 MultiResolution Feature Perturbation (MRFP) 的新技术，将域别细部特征 perturbed 并且随机预测 style 信息。
results: 我们的实验结果显示，在不同的城市景象分类任务中，MRFP 技术可以帮助现代深度神经网络学习具有领域不对称特征的对称特征，从而提高 semantic segmentation 的表现。

Abstract
Deep neural networks have shown exemplary performance on semantic scene understanding tasks on source domains, but due to the absence of style diversity during training, enhancing performance on unseen target domains using only single source domain data remains a challenging task. Generation of simulated data is a feasible alternative to retrieving large style-diverse real-world datasets as it is a cumbersome and budget-intensive process. However, the large domain-specific inconsistencies between simulated and real-world data pose a significant generalization challenge in semantic segmentation. In this work, to alleviate this problem, we propose a novel MultiResolution Feature Perturbation (MRFP) technique to randomize domain-specific fine-grained features and perturb style of coarse features. Our experimental results on various urban-scene segmentation datasets clearly indicate that, along with the perturbation of style-information, perturbation of fine-feature components is paramount to learn domain invariant robust feature maps for semantic segmentation models. MRFP is a simple and computationally efficient, transferable module with no additional learnable parameters or objective functions, that helps state-of-the-art deep neural networks to learn robust domain invariant features for simulation-to-real semantic segmentation.

摘要
Translated into Simplified Chinese:深度神经网络在源领域中表现出色，但由于训练中缺乏风格多样性，使用单一源领域数据来提高目标领域表现仍然是一项挑战。生成模拟数据是一种可行的方案，但模拟和实际数据之间的域特异性问题却是一个重要的普适性挑战。为解决这问题，我们提出了一种MultiResolution Feature Perturbation（MRFP）技术，randomizes域特异性细节和修饰风格细节。我们在多个都市场景分割数据集上进行了实验，结果显示，除了修饰风格信息之外，修饰细节 компонент也是必要的，以学习域 invariant 稳定的特征地图。MRFP是一种简单、计算效率高、可传播的模块，不增加学习参数或目标函数，帮助当今的深度神经网络学习域-to-实际Semantic Segmentation中的稳定特征。

Advances in 3D Neural Stylization: A Survey

paper_url: http://arxiv.org/abs/2311.18328
repo_url: https://github.com/chenyingshu/advances_3d_neural_stylization
paper_authors: Yingshu Chen, Guocheng Shao, Ka Chun Shum, Binh-Son Hua, Sai-Kit Yeung
for: 本研究探讨了基于人工智能的数字艺术生成方法，尤其是用于3D数据的视觉风格传输技术，以edit图像、视频和3D数据，使其更加艺术化和多样化。
methods: 本研究提出了一个taxonomy，用于描述各种关键的设计选择，包括场景表示、引导数据、优化策略和输出风格。此外，本研究还提供了一个 mini-benchmark 来评估艺术风格传输方法。
results: 根据survey的发现，现有的神经网络风格传输方法在3D数据上具有较高的艺术化水平和灵活性。但是，还存在一些开放的挑战和未来研究方向，例如如何提高风格传输的精度和效率，以及如何扩展风格传输技术到更多的应用领域。

Abstract
Modern artificial intelligence provides a novel way of producing digital art in styles. The expressive power of neural networks enables the realm of visual style transfer methods, which can be used to edit images, videos, and 3D data to make them more artistic and diverse. This paper reports on recent advances in neural stylization for 3D data. We provide a taxonomy for neural stylization by considering several important design choices, including scene representation, guidance data, optimization strategies, and output styles. Building on such taxonomy, our survey first revisits the background of neural stylization on 2D images, and then provides in-depth discussions on recent neural stylization methods for 3D data, where we also provide a mini-benchmark on artistic stylization methods. Based on the insights gained from the survey, we then discuss open challenges, future research, and potential applications and impacts of neural stylization.

摘要
现代人工智能提供了一种新的数字艺术生成方法，即神经网络风格传递技术。这种技术可以用来编辑图像、视频和3D数据，以使其更加艺术化和多样化。本文对神经风格传递的最新进展进行了报告，并提出了神经风格传递的多种设计选择，包括场景表示、引导数据、优化策略和输出风格。基于这些设计选择，我们首先回顾了神经风格传递的背景，然后进行了深入的对话神经风格传递方法的最新进展，并提供了一个小型benchmark。根据这些发现，我们then discuss open challenges, future research, and potential applications and impacts of neural stylization.Here's the translation in Traditional Chinese:现代人工智能提供了一种新的数位艺术生成方法，即神经网络风格传递技术。这种技术可以用来编辑图像、影片和3D数据，以使其更加艺术化和多样化。本文对神经风格传递的最新进展进行了报告，并提出了神经风格传递的多种设计选择，包括场景表示、引导数据、优化策略和出力风格。基于这些设计选择，我们首先回顾了神经风格传递的背景，然后进行了深入的对话神经风格传递方法的最新进展，并提供了一个小型benchmark。根据这些发现，我们then discuss open challenges, future research, and potential applications and impacts of neural stylization.

Generative Artificial Intelligence in Learning Analytics: Contextualising Opportunities and Challenges through the Learning Analytics Cycle

paper_url: http://arxiv.org/abs/2312.00087
repo_url: None
paper_authors: Lixiang Yan, Roberto Martinez-Maldonado, Dragan Gašević
for: 这篇论文探讨了生成人工智能（GenAI）在教育中的应用，以及它对学习分析（LA）周期的影响。
methods: 论文使用了现代大语言模型和扩散模型，例如ChatGPT和Midjourney，并将它们应用于学习分析领域。
results: 论文预测了GenAI将在LA领域中扮演重要的角色，包括分析无结构数据、生成 sintetic learner data、丰富多媒体学习互动、进一步发展交互式和说明分析，以及实现个性化和适应式干预。

Abstract
Generative artificial intelligence (GenAI), exemplified by ChatGPT, Midjourney, and other state-of-the-art large language models and diffusion models, holds significant potential for transforming education and enhancing human productivity. While the prevalence of GenAI in education has motivated numerous research initiatives, integrating these technologies within the learning analytics (LA) cycle and their implications for practical interventions remain underexplored. This paper delves into the prospective opportunities and challenges GenAI poses for advancing LA. We present a concise overview of the current GenAI landscape and contextualise its potential roles within Clow's generic framework of the LA cycle. We posit that GenAI can play pivotal roles in analysing unstructured data, generating synthetic learner data, enriching multimodal learner interactions, advancing interactive and explanatory analytics, and facilitating personalisation and adaptive interventions. As the lines blur between learners and GenAI tools, a renewed understanding of learners is needed. Future research can delve deep into frameworks and methodologies that advocate for human-AI collaboration. The LA community can play a pivotal role in capturing data about human and AI contributions and exploring how they can collaborate most effectively. As LA advances, it is essential to consider the pedagogical implications and broader socioeconomic impact of GenAI for ensuring an inclusive future.

摘要
产生型人工智能（GenAI），例如ChatGPT和Midjourney等现代大语言模型和扩散模型，具有潜在的潜力，可以改变教育和提高人类生产力。虽然GenAI在教育中的普遍使用已经激发了许多研究活动，但是将这些技术集成到学习分析（LA）循环中和其对实际措施的影响尚未得到充分探讨。本文探讨了GenAI在LA中的可能的机遇和挑战。我们提供了GenAI当前领域的简洁概述，并将其置于Clow的LA循环的普遍框架中。我们认为GenAI可以在分析无结构数据、生成 sintetic learner数据、增强多Modal learner互动、提高交互式和解释分析、和实现个性化和适应性改进等方面发挥重要作用。随着学生和GenAI工具之间的界限模糊，需要重新理解学生。未来的研究可以深入研究人AI合作框架和方法。LA社区可以在收集人AI贡献的数据和探讨他们如何合作最有效。随着LA的发展，我们必须考虑GenAI对教育和社会经济发展的影响，以确保一个包容的未来。

TrustMark: Universal Watermarking for Arbitrary Resolution Images

paper_url: http://arxiv.org/abs/2311.18297
repo_url: None
paper_authors: Tu Bui, Shruti Agarwal, John Collomosse
for: 防止违 pirater� copyright 和防止复�伪信息，以及负责任的生成 AI
methods: 提议了一种基于 GAN 的隐形数字水印方法，具有 novel 的架构和空间 спект域损失，以达到水印图像质量与水印恢复精度的平衡
results: 实现了对 3 个基于任意分辨率图像的标准benchmark上的状态级表现

Abstract
Imperceptible digital watermarking is important in copyright protection, misinformation prevention, and responsible generative AI. We propose TrustMark - a GAN-based watermarking method with novel design in architecture and spatio-spectra losses to balance the trade-off between watermarked image quality with the watermark recovery accuracy. Our model is trained with robustness in mind, withstanding various in- and out-place perturbations on the encoded image. Additionally, we introduce TrustMark-RM - a watermark remover method useful for re-watermarking. Our methods achieve state-of-art performance on 3 benchmarks comprising arbitrary resolution images.

摘要
<>转换文本为简化中文。<>不可见数字水印是版权保护、谣言预防和负责任生成AI中的重要方法。我们提议TrustMark，基于GAN的水印方法，具有新的架构和空间 спектраль损失来平衡水印图像质量和水印恢复精度的负担。我们的模型具备鲁棒性，能承受不同的内部和外部干扰。此外，我们还介绍了TrustMark-RM，一种有用的水印去除方法，用于重新水印。我们的方法在3个测试准则中达到了状态级表现，包括任意分辨率图像。

Perceptual Group Tokenizer: Building Perception with Iterative Grouping

paper_url: http://arxiv.org/abs/2311.18296
repo_url: None
paper_authors: Zhiwei Deng, Ting Chen, Yang Li
for: 该论文旨在提出一种基于分组操作的神经网络视觉识别系统，用于自动学习和描述图像特征。
methods: 该模型 entirely rely on grouping operations to extract visual features and perform self-supervised representation learning, including a series of grouping operations to iteratively hypothesize the context for pixels or superpixels to refine feature representations.
results: 该模型在ImageNet-1K自动学习benchmark上 achieve 80.3%的表现，创造了新的记录。

Abstract
Human visual recognition system shows astonishing capability of compressing visual information into a set of tokens containing rich representations without label supervision. One critical driving principle behind it is perceptual grouping. Despite being widely used in computer vision in the early 2010s, it remains a mystery whether perceptual grouping can be leveraged to derive a neural visual recognition backbone that generates as powerful representations. In this paper, we propose the Perceptual Group Tokenizer, a model that entirely relies on grouping operations to extract visual features and perform self-supervised representation learning, where a series of grouping operations are used to iteratively hypothesize the context for pixels or superpixels to refine feature representations. We show that the proposed model can achieve competitive performance compared to state-of-the-art vision architectures, and inherits desirable properties including adaptive computation without re-training, and interpretability. Specifically, Perceptual Group Tokenizer achieves 80.3% on ImageNet-1K self-supervised learning benchmark with linear probe evaluation, marking a new progress under this paradigm.

摘要
人工视觉识别系统显示了惊人的能力，将视觉信息压缩成一组含有丰富表示的标签无监督。一个关键驱动原则是分组。尽管在计算机视觉领域在2010年代初期广泛使用，但是是否可以利用分组来 derivate一个基于神经网络的视觉识别基础结构，并将其中的表示力量推广开来仍然是一个谜。在这篇论文中，我们提出了Perceptual Group Tokenizer模型，该模型完全依赖于分组操作来提取视觉特征并进行自动化表示学习。我们显示，提案的模型可以与当前领先的视觉架构相比，达到竞争性的性能，并拥有柔性计算无需重新训练和解释性的优点。 Specifically, Perceptual Group Tokenizer在ImageNet-1K自主学习benchmark上取得80.3%的分数，创造了新的进步在这个 парадигме。

Non-Cross Diffusion for Semantic Consistency

paper_url: http://arxiv.org/abs/2312.00820
repo_url: None
paper_authors: Ziyang Zheng, Ruiyuan Gao, Qiang Xu
for: Addressing the challenge of semantic inconsistencies in diffusion models, particularly in applications such as image editing and interpolation.
methods: Introducing `Non-Cross Diffusion’, a novel approach to generative modeling that incorporates an ascending dimension of input to ensure enhanced semantic consistency throughout the inference process.
results: Demonstrating the effectiveness of Non-Cross Diffusion through empirical results, including a substantial reduction in semantic inconsistencies and a notable enhancement in the overall performance of diffusion models.Here’s the full translation of the abstract in Simplified Chinese:
for: 这个论文是为了解决扩散模型中的semantic inconsistency问题，特别是在图像编辑和 interpolating 应用中。
methods: 我们引入了一种新的 Non-Cross Diffusion 方法，它在推导过程中灵活地连接来自两个分布的点，以确保更高的semantic consistency。
results: 我们的实验结果表明，Non-Cross Diffusion 可以减少semantic inconsistency，并提高扩散模型的总性能。

Abstract
In diffusion models, deviations from a straight generative flow are a common issue, resulting in semantic inconsistencies and suboptimal generations. To address this challenge, we introduce `Non-Cross Diffusion', an innovative approach in generative modeling for learning ordinary differential equation (ODE) models. Our methodology strategically incorporates an ascending dimension of input to effectively connect points sampled from two distributions with uncrossed paths. This design is pivotal in ensuring enhanced semantic consistency throughout the inference process, which is especially critical for applications reliant on consistent generative flows, including various distillation methods and deterministic sampling, which are fundamental in image editing and interpolation tasks. Our empirical results demonstrate the effectiveness of Non-Cross Diffusion, showing a substantial reduction in semantic inconsistencies at different inference steps and a notable enhancement in the overall performance of diffusion models.

摘要
在扩散模型中，偏离直接生成流是一种常见的问题，导致语义不一致和优化生成不佳。为解决这个挑战，我们介绍了“非交叉扩散”，一种创新的生成模型学习方法。我们的方法强制在输入维度上增加维度，以有效地连接两个分布的顺序样本点。这种设计是决定性的，以确保在推理过程中增强语义一致性，尤其是在各种稳定扩散方法和权重扩散方法上，这些方法在图像修饰和 interpolate 任务中是基础的。我们的实验结果表明非交叉扩散的效果，在不同的推理步骤中减少语义不一致的程度，并提高扩散模型的总性性能。

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

paper_url: http://arxiv.org/abs/2311.18259
repo_url: None
paper_authors: Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mohaiminul Islam, Suyog Jain, Rawal Khirodkar, Devansh Kukreja, Kevin J Liang, Jia-Wei Liu, Sagnik Majumder, Yongsen Mao, Miguel Martin, Effrosyni Mavroudi, Tushar Nagarajan, Francesco Ragusa, Santhosh Kumar Ramakrishnan, Luigi Seminara, Arjun Somayazulu, Yale Song, Shan Su, Zihui Xue, Edward Zhang, Jinxu Zhang, Angela Castillo, Changan Chen, Xinzhu Fu, Ryosuke Furuta, Cristina Gonzalez, Prince Gupta, Jiabo Hu, Yifei Huang, Yiming Huang, Weslie Khoo, Anush Kumar, Robert Kuo, Sach Lakhavani, Miao Liu, Mi Luo, Zhengyi Luo, Brighid Meredith, Austin Miller, Oluwatumininu Oguntola, Xiaqing Pan, Penny Peng, Shraman Pramanick, Merey Ramazanova, Fiona Ryan, Wei Shan, Kiran Somasundaram, Chenan Song, Audrey Southerland, Masatoshi Tateno, Huiyu Wang, Yuchen Wang, Takuma Yagi, Mingfei Yan, Xitong Yang, Zecheng Yu, Shengxin Cindy Zha, Chen Zhao, Ziwei Zhao, Zhifan Zhu, Jeff Zhuo, Pablo Arbelaez, Gedas Bertasius, David Crandall, Dima Damen, Jakob Engel, Giovanni Maria Farinella, Antonino Furnari, Bernard Ghanem, Judy Hoffman, C. V. Jawahar, Richard Newcombe, Hyun Soo Park, James M. Rehg, Yoichi Sato, Manolis Savva, Jianbo Shi, Mike Zheng Shou, Michael Wray
for: 该论文主要用于开发一个多模态多视图视频数据集和评比挑战，用于提高个人活动的视频理解能力。
methods: 该论文使用了同时捕捉的 egocentric 和 exocentric 视频、多通道音频、眼动跟踪、3D 点云、摄像头姿和 IMU 等多种数据来构建一个大规模、多模态的视频数据集。
results: 该论文收集了来自 13 座城市的131个不同的自然场景中的1,422小时视频内容，并提供了多种评比任务的注解，以推动对个人活动视频理解的研究。

Abstract
We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). More than 800 participants from 13 cities worldwide performed these activities in 131 different natural scene contexts, yielding long-form captures from 1 to 42 minutes each and 1,422 hours of video combined. The multimodal nature of the dataset is unprecedented: the video is accompanied by multichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiple paired language descriptions -- including a novel "expert commentary" done by coaches and teachers and tailored to the skilled-activity domain. To push the frontier of first-person video understanding of skilled human activity, we also present a suite of benchmark tasks and their annotations, including fine-grained activity understanding, proficiency estimation, cross-view translation, and 3D hand/body pose. All resources will be open sourced to fuel new research in the community.

摘要
我们介绍Ego-Exo4D，一个多样化、大规模的多模态多视图视频数据集和研究挑战。Ego-Exo4D绕着同时捕获的 Egocentric 和 Exocentric 视频中的高水平人类活动（如运动、音乐、舞蹈、自行车维修）展开。全球13座城市的超过800名参与者在131个不同的自然场景中完成了这些活动，共计1,422小时的视频。该数据集的多模态性是前所未有：视频被 accompaniment 多核心音频、眼动追踪、3D点云、摄像机位置、IMU 和多个对应语言描述——包括一种新的 "专家评论" 由教练和教师提供，专门适用于技能活动领域。为推动人类高水平首人视频理解的前沿，我们还提供了一套 benchmark 任务和其注解，包括细化活动理解、技能评估、跨视图翻译和3D手/体部pose。所有资源将被开源，以激发新的研究于社区。

Sketch Input Method Editor: A Comprehensive Dataset and Methodology for Systematic Input Recognition

paper_url: http://arxiv.org/abs/2311.18254
repo_url: None
paper_authors: Guangming Zhu, Siyuan Wang, Qing Cheng, Kelong Wu, Hao Li, Liang Zhang
for: 这个研究旨在开发一个专门用于专业C4I系统的Sketch Input Method Editor（SketchIME），以便通过使用粘贴笔Sketch来创建全面的情况图。
methods: 这个研究使用了同时认知和分类架构，并采用了少量适应和类增量学习来提高网络的适应性和可解释性。
results: 实验结果表明，提议的架构在提posed dataset和SPG dataset上显示出了更高的性能。

Abstract
With the recent surge in the use of touchscreen devices, free-hand sketching has emerged as a promising modality for human-computer interaction. While previous research has focused on tasks such as recognition, retrieval, and generation of familiar everyday objects, this study aims to create a Sketch Input Method Editor (SketchIME) specifically designed for a professional C4I system. Within this system, sketches are utilized as low-fidelity prototypes for recommending standardized symbols in the creation of comprehensive situation maps. This paper also presents a systematic dataset comprising 374 specialized sketch types, and proposes a simultaneous recognition and segmentation architecture with multilevel supervision between recognition and segmentation to improve performance and enhance interpretability. By incorporating few-shot domain adaptation and class-incremental learning, the network's ability to adapt to new users and extend to new task-specific classes is significantly enhanced. Results from experiments conducted on both the proposed dataset and the SPG dataset illustrate the superior performance of the proposed architecture. Our dataset and code are publicly available at https://github.com/Anony517/SketchIME.

摘要
随着触摸屏设备的普及，自由手绘 sketching 已成为人计算机交互的有前途的modalität。先前的研究主要集中在认知、检索和生成日常物品上，而这个研究旨在为专业C4I系统开发一个专门的Sketch Input Method Editor（SketchIME）。在这个系统中，手绘图被用作低精度原型，用于建议标准化图标的创建。本文还提供了374种专业手绘图类型的系统матиче数据集，并提议一种同时认知和分割架构，通过多级监督来提高性能和解释性。通过涉及少量预训练和分类增强学习，网络的适应新用户和扩展到新任务特定类别的能力得到了显著提高。实验结果表明，提posed架构在提posed数据集和SPG数据集上具有显著性能优势。数据集和代码在https://github.com/Anony517/SketchIME上公开可用。

Navigating Privacy and Copyright Challenges Across the Data Lifecycle of Generative AI

paper_url: http://arxiv.org/abs/2311.18252
repo_url: None
paper_authors: Dawen Zhang, Boming Xia, Yue Liu, Xiwei Xu, Thong Hoang, Zhenchang Xing, Mark Staples, Qinghua Lu, Liming Zhu
for: 这篇论文的目的是探讨生成式人工智能中数据隐私和版权保护的问题。
methods: 论文使用了多方面的技术和伦理方法来解决数据隐私和版权问题，包括数据隐私技术、机器学习卸载技术和数据毒理学技术。
results: 论文通过分析和研究数据隐私和版权问题的多方面性，提出了一种整体性的解决方案，以保护生成式人工智能中数据的隐私和版权。

Abstract
The advent of Generative AI has marked a significant milestone in artificial intelligence, demonstrating remarkable capabilities in generating realistic images, texts, and data patterns. However, these advancements come with heightened concerns over data privacy and copyright infringement, primarily due to the reliance on vast datasets for model training. Traditional approaches like differential privacy, machine unlearning, and data poisoning only offer fragmented solutions to these complex issues. Our paper delves into the multifaceted challenges of privacy and copyright protection within the data lifecycle. We advocate for integrated approaches that combines technical innovation with ethical foresight, holistically addressing these concerns by investigating and devising solutions that are informed by the lifecycle perspective. This work aims to catalyze a broader discussion and inspire concerted efforts towards data privacy and copyright integrity in Generative AI.

摘要
人工智能的生成技术的出现标志着人工智能的进步，展示了具有真实感和生成能力的图像、文本和数据模式。然而，这些进步也带来了数据隐私和版权侵犯的关切问题，主要是因为模型训练所需的庞大数据集。传统的方法，如差分隐私、机器忘却和数据毒素，只能提供分 Fragmented 的解决方案这些复杂的问题。我们的论文探讨了生成AI中数据隐私和版权保护的多方面挑战，并提倡一种整体的解决方案， combinig 技术创新和伦理观察，从数据生命周期的视角全面地Addressing these concerns。这项工作的目标是激发更广泛的讨论，以促进数据隐私和版权完整性在生成AI中。

Large Language Models for Travel Behavior Prediction

paper_url: http://arxiv.org/abs/2312.00819
repo_url: https://github.com/Aryia-Behroziuan/neurons
paper_authors: Baichuan Mo, Hanyong Xu, Dingyi Zhuang, Ruoyun Ma, Xiaotong Guo, Jinhua Zhao
for: 预测旅行行为，以便更好地管理交通需求。
methods: 使用大语言模型（LLMs），无需数据学习Parameter，通过提示工程来预测旅行行为。
results: LLMs可以达到与传统方法相当的准确率和F1分数，并且可以输出解释。但是，有时会出现逻辑错误或幻觉。

Abstract
Travel behavior prediction is a fundamental task in transportation demand management. The conventional methods for travel behavior prediction rely on numerical data to construct mathematical models and calibrate model parameters to represent human preferences. Recent advancement in large language models (LLMs) has shown great reasoning abilities to solve complex problems. In this study, we propose to use LLMs to predict travel behavior with prompt engineering without data-based parameter learning. Specifically, we carefully design our prompts that include 1) task description, 2) travel characteristics, 3) individual attributes, and 4) guides of thinking with domain knowledge, and ask the LLMs to predict an individual's travel behavior and explain the results. We select the travel mode choice task as a case study. Results show that, though no training samples are provided, LLM-based predictions have competitive accuracy and F1-score as canonical supervised learning methods such as multinomial logit, random forest, and neural networks. LLMs can also output reasons that support their prediction. However, though in most of the cases, the output explanations are reasonable, we still observe cases that violate logic or with hallucinations.

摘要
旅行行为预测是交通需求管理的基本任务。传统的方法 для旅行行为预测是基于数字数据构建数学模型，并将模型参数调整以表示人类偏好。然而，现代大语言模型（LLMs）的发展已经显示出了解复难问题的强大能力。在本研究中，我们提议使用LLMs来预测旅行行为，而不需要数据基 Parameters learning。具体来说，我们仔细设计我们的提示，包括1）任务描述、2）旅行特点、3）个人属性和4）思维指南，并问LLMs预测个人的旅行行为，并解释结果。我们选择了交通模式选择任务作为case study。结果显示，虽然没有提供任何训练样本，LLM-based预测的精度和F1分数与canonical supervised learning方法such as multinomial logit、random forest和神经网络相当。LLMs还可以输出解释。然而，虽然大多数情况下，输出的解释合理，但我们仍然观察到了逻辑性错误或幻觉的情况。

LLVMs4Protest: Harnessing the Power of Large Language and Vision Models for Deciphering Protests in the News

paper_url: http://arxiv.org/abs/2311.18241
repo_url: https://github.com/joshzyj/llvms4protest
paper_authors: Yongjun Zhang
for: 这篇论文目的是探讨如何使用大语言和视觉模型来推断新闻文章中的抗议活动。
methods: 作者使用了两种大型预训练变换器模型，包括Longformer和Swin-Transformer V2，对新闻文章的文本和图像数据进行了微调。
results: 作者通过对DoCA corpus和UCLA-抗议项目的图像数据进行微调，实现了对新闻文章中抗议活动的推断。两种微调后的模型将于 GitHub 上发布。

Abstract
Large language and vision models have transformed how social movements scholars identify protest and extract key protest attributes from multi-modal data such as texts, images, and videos. This article documents how we fine-tuned two large pretrained transformer models, including longformer and swin-transformer v2, to infer potential protests in news articles using textual and imagery data. First, the longformer model was fine-tuned using the Dynamic of Collective Action (DoCA) Corpus. We matched the New York Times articles with the DoCA database to obtain a training dataset for downstream tasks. Second, the swin-transformer v2 models was trained on UCLA-protest imagery data. UCLA-protest project contains labeled imagery data with information such as protest, violence, and sign. Both fine-tuned models will be available via \url{https://github.com/Joshzyj/llvms4protest}. We release this short technical report for social movement scholars who are interested in using LLVMs to infer protests in textual and imagery data.

摘要
大型语言和视觉模型已经改变了社会运动学者如何识别抗议和从多模式数据中提取关键抗议特征。这篇文章记录了我们如何使用文本和图像数据来使两个大型预训练变换器模型（包括Longformer和Swin-Transformer V2）进行潜在抗议的推理。首先，我们使用DoCA corpus进行了Longformer模型的微调。我们将纽约时报文章与DoCA数据库匹配，以获得下游任务的训练集。其次，我们使用UCLA-抗议图像数据进行了Swin-Transformer V2模型的训练。UCLA-抗议项目包含了标注图像数据，其中包括抗议、暴力和标语等信息。两个微调后的模型将在上提供。我们发布这份短技术报告，以便社会运动学者可以使用LLVM来推理抗议在文本和图像数据中的存在。

LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models

paper_url: http://arxiv.org/abs/2311.18232
repo_url: https://github.com/abdulhaim/lmrl-gym
paper_authors: Marwa Abdulhai, Isadora White, Charlie Snell, Charles Sun, Joey Hong, Yuexiang Zhai, Kelvin Xu, Sergey Levine
for: 本研究旨在开发稳定可靠的强化学习算法，以培养基于语言模型的目标寻求行为。
methods: 本研究使用了语言模型强化学习（RL），并开发了LMRL-Gymbenchmark以评估多轮RL的表现。
results: 本研究发现，RL算法可以帮助基于语言模型的机器人实现目标寻求行为，并在多轮语言互动中提高表现。

Abstract
Large language models (LLMs) provide excellent text-generation capabilities, but standard prompting and generation methods generally do not lead to intentional or goal-directed agents and might necessitate considerable prompt tuning. This becomes particularly apparent in multi-turn conversations: even the best current LLMs rarely ask clarifying questions, engage in explicit information gathering, or take actions now that lead to better decisions after multiple turns. Reinforcement learning has the potential to leverage the powerful modeling capabilities of LLMs, as well as their internal representation of textual interactions, to create capable goal-directed language agents. This can enable intentional and temporally extended interactions, such as with humans, through coordinated persuasion and carefully crafted questions, or in goal-directed play through text games to bring about desired final outcomes. However, enabling this requires the community to develop stable and reliable reinforcement learning algorithms that can effectively train LLMs. Developing such algorithms requires tasks that can gauge progress on algorithm design, provide accessible and reproducible evaluations for multi-turn interactions, and cover a range of task properties and challenges in improving reinforcement learning algorithms. Our paper introduces the LMRL-Gym benchmark for evaluating multi-turn RL for LLMs, together with an open-source research framework containing a basic toolkit for getting started on multi-turn RL with offline value-based and policy-based RL methods. Our benchmark consists of 8 different language tasks, which require multiple rounds of language interaction and cover a range of tasks in open-ended dialogue and text games.

摘要
大型语言模型（LLM）具有出色的文本生成能力，但标准的提问和生成方法通常不会导致意图或目标导航的代理人，可能需要较大的提问调整。这 especially evident in multi-turn conversations：even the best current LLMs rarely ask clarifying questions, engage in explicit information gathering, or take actions now that lead to better decisions after multiple turns. 使用奖励学习可以利用LLM的强大模型能力和其内部的文本互动表示，创建目标导航的语言代理人。这可以实现意图和时间扩展的交互，如与人类的交流，通过协调说服和精心制定的问题，或在文本游戏中实现目标结果。然而，实现这一点需要社区开发稳定可靠的奖励学习算法，可以有效地训练LLM。开发这些算法需要任务，可以评估算法设计的进度，提供可访问的和可重现的评估方法 для多轮交互，并覆盖多种任务性质和挑战。我们的论文引入了LMRL-Gymbenchmark，用于评估多轮RL дляLLM，同时提供了一个开源的研究框架，包括一个基本的工具包，可以帮助开始多轮RL的研究。我们的benchmark包括8种不同的语言任务，需要多个语言交互round和覆盖了开放式对话和文本游戏等多种任务。

Reasoning with the Theory of Mind for Pragmatic Semantic Communication

paper_url: http://arxiv.org/abs/2311.18224
repo_url: None
paper_authors: Christo Kurisummoottil Thomas, Emilio Calvanese Strinati, Walid Saad
for: 这 paper 描述了一种基于 semantics 的强智通信框架，用于两个智能代理之间的有效目标协调信息共享。
methods: 该框架利用了机器学习领域新出现的理论心（ToM），并使用了动态两级（无线和semantic）反馈机制，以continuously fine-tune 神经网络组件。
results: 实验结果表明，该框架可以实现高效的协调通信，使用比较少的比特数据，同时保持 semantics 不变，并且在比较难的通信频率下表现更好。

Abstract
In this paper, a pragmatic semantic communication framework that enables effective goal-oriented information sharing between two-intelligent agents is proposed. In particular, semantics is defined as the causal state that encapsulates the fundamental causal relationships and dependencies among different features extracted from data. The proposed framework leverages the emerging concept in machine learning (ML) called theory of mind (ToM). It employs a dynamic two-level (wireless and semantic) feedback mechanism to continuously fine-tune neural network components at the transmitter. Thanks to the ToM, the transmitter mimics the actual mental state of the receiver's reasoning neural network operating semantic interpretation. Then, the estimated mental state at the receiver is dynamically updated thanks to the proposed dynamic two-level feedback mechanism. At the lower level, conventional channel quality metrics are used to optimize the channel encoding process based on the wireless communication channel's quality, ensuring an efficient mapping of semantic representations to a finite constellation. Additionally, a semantic feedback level is introduced, providing information on the receiver's perceived semantic effectiveness with minimal overhead. Numerical evaluations demonstrate the framework's ability to achieve efficient communication with a reduced amount of bits while maintaining the same semantics, outperforming conventional systems that do not exploit the ToM-based reasoning.

摘要
At the lower level, conventional channel quality metrics are used to optimize the channel encoding process based on the wireless communication channel's quality, ensuring an efficient mapping of semantic representations to a finite constellation. Additionally, a semantic feedback level is introduced to provide information on the receiver's perceived semantic effectiveness with minimal overhead.Numerical evaluations demonstrate that the proposed framework can achieve efficient communication with a reduced amount of bits while maintaining the same semantics, outperforming conventional systems that do not exploit the ToM-based reasoning. The proposed framework has the potential to enable more effective and efficient communication between intelligent agents in various applications.

Beyond Two-Tower Matching: Learning Sparse Retrievable Cross-Interactions for Recommendation

paper_url: http://arxiv.org/abs/2311.18213
repo_url: None
paper_authors: Liangcai Su, Fan Yan, Jieming Zhu, Xi Xiao, Haoyi Duan, Zhou Zhao, Zhenhua Dong, Ruiming Tang
for: 提高推荐系统的准确性和效率。
methods: 提出一种新的匹配方式，即SparCode，支持复杂的特征互动和高效的检索。SparCode introduce了一个全对全互动模块，以模型细致的查询项互动。此外，我们设计了一个逻辑编码基于的稀疏反向索引，并与模型共同训练。
results: 在开放数据集上进行了广泛的实验，证明了SparCode可以显著提高候选项匹配的准确性，同时保持与两塔模型相同的检索效率。

Abstract
Two-tower models are a prevalent matching framework for recommendation, which have been widely deployed in industrial applications. The success of two-tower matching attributes to its efficiency in retrieval among a large number of items, since the item tower can be precomputed and used for fast Approximate Nearest Neighbor (ANN) search. However, it suffers two main challenges, including limited feature interaction capability and reduced accuracy in online serving. Existing approaches attempt to design novel late interactions instead of dot products, but they still fail to support complex feature interactions or lose retrieval efficiency. To address these challenges, we propose a new matching paradigm named SparCode, which supports not only sophisticated feature interactions but also efficient retrieval. Specifically, SparCode introduces an all-to-all interaction module to model fine-grained query-item interactions. Besides, we design a discrete code-based sparse inverted index jointly trained with the model to achieve effective and efficient model inference. Extensive experiments have been conducted on open benchmark datasets to demonstrate the superiority of our framework. The results show that SparCode significantly improves the accuracy of candidate item matching while retaining the same level of retrieval efficiency with two-tower models. Our source code will be available at MindSpore/models.

摘要
两塔模型是一种广泛应用的匹配框架，已经在产业级应用中广泛部署。两塔匹配的成功归功于其在大量项目中的快速查找能力，因为项目塔可以在预计算中生成并用于快速 approximate nearest neighbor（ANN）搜索。然而，它受到两大挑战，一是局部特征互动能力的局限性，二是在线服务中的准确率下降。现有的方法尝试通过设计新的晚期交互来解决这两个问题，但它们仍然无法支持复杂的特征交互或保持搜索效率。为了解决这些挑战，我们提出了一种新的匹配方案，即 SparCode，它不仅支持复杂的特征交互，还可以保持高效的搜索效率。具体来说，SparCode 引入了一个全对全交互模块，用于模型细致的查询项交互。此外，我们设计了一个基于 discrete code 的稀疏反向索引，并与模型一起进行 JOINT 训练，以实现有效和高效的模型推断。我们在开源 benchmark 数据集上进行了广泛的实验，结果显示，SparCode 可以在同等效率下提高候选项匹配的准确率。我们的源代码将在 MindSpore/models 上公开。

Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation

paper_url: http://arxiv.org/abs/2311.18207
repo_url: https://github.com/hakuhodo-technologies/scope-rl
paper_authors: Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, Yuta Saito
for: 评估假设策略的有效性，使用只有在线日志数据，并且可以用于在线AB测试中选择最佳策略。
methods: Draw inspiration from finance portfolio evaluation，develop a new metric called SharpeRatio@k，measure risk-return tradeoff of policy portfolios formed by OPE estimator under varying online evaluation budgets(k)。
results: validate the effectiveness of SharpeRatio@k in two example scenarios, demonstrate its ability to effectively distinguish between low-risk and high-risk estimators, and accurately identify the most efficient estimator。

Abstract
Off-Policy Evaluation (OPE) aims to assess the effectiveness of counterfactual policies using only offline logged data and is often used to identify the top-k promising policies for deployment in online A/B tests. Existing evaluation metrics for OPE estimators primarily focus on the "accuracy" of OPE or that of downstream policy selection, neglecting risk-return tradeoff in the subsequent online policy deployment. To address this issue, we draw inspiration from portfolio evaluation in finance and develop a new metric, called SharpeRatio@k, which measures the risk-return tradeoff of policy portfolios formed by an OPE estimator under varying online evaluation budgets (k). We validate our metric in two example scenarios, demonstrating its ability to effectively distinguish between low-risk and high-risk estimators and to accurately identify the most efficient estimator. This efficient estimator is characterized by its capability to form the most advantageous policy portfolios, maximizing returns while minimizing risks during online deployment, a nuance that existing metrics typically overlook. To facilitate a quick, accurate, and consistent evaluation of OPE via SharpeRatio@k, we have also integrated this metric into an open-source software, SCOPE-RL. Employing SharpeRatio@k and SCOPE-RL, we conduct comprehensive benchmarking experiments on various estimators and RL tasks, focusing on their risk-return tradeoff. These experiments offer several interesting directions and suggestions for future OPE research.

摘要
off-policy评估（OPE）目的是用已经记录的线上数据评估不同的可能性政策，并且经常用于标识最佳部署的top-k政策。现有的评估 метри克 для OPE estimator主要关注OPE或者下游策略选择的准确性，忽略了在线策略部署中的风险回报贸易。为解决这一问题，我们吸取了金融部署评估中的瑞瑞准则，并开发了一个新的度量，称为SharpeRatio@k，它衡量了由OPE estimator组成的策略 portfolio在不同的在线评估预算（k）下的风险回报贸易。我们在两个示例场景中验证了我们的度量，并证明了它能够有效地区分低风险和高风险的 estimator，并且能够准确地标识最高效的 estimator。这个高效的 estimator是由其能够在在线部署中组成最有利益的策略 portfolio，最大化收益 while minimizing风险。现有的 метри克通常会忽略这一细节。为了快速、准确和一致地进行 OPE 评估，我们还将SharpeRatio@k metric integrate into an open-source software，SCOPE-RL。通过SharpeRatio@k和SCOPE-RL，我们在不同的 estimator和RL任务上进行了广泛的比较实验，强调它们在风险回报贸易方面的贡献。这些实验提供了许多有趣的方向和未来 OPE 研究的建议。

SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation

paper_url: http://arxiv.org/abs/2311.18206
repo_url: https://github.com/hakuhodo-technologies/scope-rl
paper_authors: Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, Yuta Saito
for: 这篇论文介绍了一个名为SCOPE-RL的开源Python软件包，用于离线强化学习（离线RL）、离线评估（OPE）和选择（OPS）。与大多数现有库不同的是，SCOPE-RL集成了这两个关键方面，使得研究者可以轻松地实现离线RL和OPE过程。
methods: SCOPE-RL的OPE模块具有多种OPE估计器和可靠的评估OPE协议。这种方法使得OPE更加深入和可靠，比如可以估计政策下的整个奖励分布而不仅是指定政策的点着预期值。
results: SCOPE-RL提供了更加深入和可靠的OPE结果，包括对OPE的风险-回报质量评估，这超出了现有OPE文献中的简单度评估。

Abstract
This paper introduces SCOPE-RL, a comprehensive open-source Python software designed for offline reinforcement learning (offline RL), off-policy evaluation (OPE), and selection (OPS). Unlike most existing libraries that focus solely on either policy learning or evaluation, SCOPE-RL seamlessly integrates these two key aspects, facilitating flexible and complete implementations of both offline RL and OPE processes. SCOPE-RL put particular emphasis on its OPE modules, offering a range of OPE estimators and robust evaluation-of-OPE protocols. This approach enables more in-depth and reliable OPE compared to other packages. For instance, SCOPE-RL enhances OPE by estimating the entire reward distribution under a policy rather than its mere point-wise expected value. Additionally, SCOPE-RL provides a more thorough evaluation-of-OPE by presenting the risk-return tradeoff in OPE results, extending beyond mere accuracy evaluations in existing OPE literature. SCOPE-RL is designed with user accessibility in mind. Its user-friendly APIs, comprehensive documentation, and a variety of easy-to-follow examples assist researchers and practitioners in efficiently implementing and experimenting with various offline RL methods and OPE estimators, tailored to their specific problem contexts. The documentation of SCOPE-RL is available at https://scope-rl.readthedocs.io/en/latest/.

摘要
SCOPE-RL is designed with user accessibility in mind, providing user-friendly APIs, comprehensive documentation, and easy-to-follow examples to assist researchers and practitioners in efficiently implementing and experimenting with various offline RL methods and OPE estimators tailored to their specific problem contexts. The documentation of SCOPE-RL is available at .

HiFi Tuner: High-Fidelity Subject-Driven Fine-Tuning for Diffusion Models

paper_url: http://arxiv.org/abs/2312.00079
repo_url: None
paper_authors: Zhonghao Wang, Wei Wei, Yang Zhao, Zhisheng Xiao, Mark Hasegawa-Johnson, Humphrey Shi, Tingbo Hou
for: 本研究探讨了基于预训练文本到图像扩散模型的高精度个性化图像生成技术的进步。
methods: 我们提出了一种名为HiFi Tuner的新算法，使用参数有效的精度调整框架，包括杂减处理和重要倒映处理。关键改进包括使用面指导、一种新的参数规范化技术和 incorporation of step-wise subject representations。
results: 我们的实验结果表明，HiFi Tuner可以提高对象的出现质量，并在文本修改任务中实现图像中的对象替换。在 DreamBooth 数据集上使用 Stable Diffusion 模型进行实验，我们发现 fine-tuning solely on textual embeddings 可以提高 CLIP-T 分数 by 3.6 点和 DINO 分数 by 9.6 点，而 fine-tuning 所有参数可以提高 CLIP-T 分数 by 1.2 点和 DINO 分数 by 1.2 点，创造了新的州OF THE ART。

Abstract
This paper explores advancements in high-fidelity personalized image generation through the utilization of pre-trained text-to-image diffusion models. While previous approaches have made significant strides in generating versatile scenes based on text descriptions and a few input images, challenges persist in maintaining the subject fidelity within the generated images. In this work, we introduce an innovative algorithm named HiFi Tuner to enhance the appearance preservation of objects during personalized image generation. Our proposed method employs a parameter-efficient fine-tuning framework, comprising a denoising process and a pivotal inversion process. Key enhancements include the utilization of mask guidance, a novel parameter regularization technique, and the incorporation of step-wise subject representations to elevate the sample fidelity. Additionally, we propose a reference-guided generation approach that leverages the pivotal inversion of a reference image to mitigate unwanted subject variations and artifacts. We further extend our method to a novel image editing task: substituting the subject in an image through textual manipulations. Experimental evaluations conducted on the DreamBooth dataset using the Stable Diffusion model showcase promising results. Fine-tuning solely on textual embeddings improves CLIP-T score by 3.6 points and improves DINO score by 9.6 points over Textual Inversion. When fine-tuning all parameters, HiFi Tuner improves CLIP-T score by 1.2 points and improves DINO score by 1.2 points over DreamBooth, establishing a new state of the art.

摘要
Our proposed method consists of a parameter-efficient fine-tuning framework that includes a denoising process and a pivotal inversion process. Key enhancements include the use of mask guidance, a novel parameter regularization technique, and the incorporation of step-wise subject representations to elevate sample fidelity. Additionally, we propose a reference-guided generation approach that leverages the pivotal inversion of a reference image to mitigate unwanted subject variations and artifacts.We extend our method to a novel image editing task: substituting the subject in an image through textual manipulations. Experimental evaluations conducted on the DreamBooth dataset using the Stable Diffusion model show promising results. Fine-tuning solely on textual embeddings improves CLIP-T score by 3.6 points and improves DINO score by 9.6 points over Textual Inversion. When fine-tuning all parameters, HiFi Tuner improves CLIP-T score by 1.2 points and improves DINO score by 1.2 points over DreamBooth, setting a new state of the art.

Toward the Tradeoffs between Privacy, Fairness and Utility in Federated Learning

paper_url: http://arxiv.org/abs/2311.18190
repo_url: None
paper_authors: Kangkang Sun, Xiaojin Zhang, Xi Lin, Gaolei Li, Jing Wang, Jianhua Li
for: 研究者对分布式学习（FL）系统的公平性和隐私保护进行了研究，以保证用户隐私和数据泄露风险的避免。
methods: 在客户端上使用公平度指标，如人口学准（DemP）、相等投 odds（EOs）和不均衡影响（DI），构建本地公平模型。为保护客户端模型的隐私，我们提出了一种隐私保护公平FL方法。
results: 实验结果显示，在保持公平性和隐私之间存在负相关性。隐私破坏公平度指标的约束，使得公平模型的准确性增加。在我们的实验中，我们发现了公平、隐私和用户体验之间的关系，存在负相关性。

Abstract
Federated Learning (FL) is a novel privacy-protection distributed machine learning paradigm that guarantees user privacy and prevents the risk of data leakage due to the advantage of the client's local training. Researchers have struggled to design fair FL systems that ensure fairness of results. However, the interplay between fairness and privacy has been less studied. Increasing the fairness of FL systems can have an impact on user privacy, while an increase in user privacy can affect fairness. In this work, on the client side, we use fairness metrics, such as Demographic Parity (DemP), Equalized Odds (EOs), and Disparate Impact (DI), to construct the local fair model. To protect the privacy of the client model, we propose a privacy-protection fairness FL method. The results show that the accuracy of the fair model with privacy increases because privacy breaks the constraints of the fairness metrics. In our experiments, we conclude the relationship between privacy, fairness and utility, and there is a tradeoff between these.

摘要

2023-11-30

cs.CL

cs.CL - 2023-11-30

paper_url: http://arxiv.org/abs/2312.00220
repo_url: None
paper_authors: Linzi Xing, Quan Tran, Fabian Caba, Franck Dernoncourt, Seunghyun Yoon, Zhaowen Wang, Trung Bui, Giuseppe Carenini
for: 这篇论文旨在为视频理解任务提供更好的分类结果，具体来说是实现视频主题分 segmentation。
methods: 该论文提出了一种基于多Modal的视频主题分 segmenter，使用视频脚本和帧数据，并采用了交叉模态注意力机制。此外，文章还提出了一种双对照学习框架，遵循无监督领域适应的思想，以提高模型对更长、更复杂的视频进行适应。
results: 实验表明，该提案在短视频和长视频集上均达到了比基eline方法更高的准确率和跨领域适应性。

Abstract
Video topic segmentation unveils the coarse-grained semantic structure underlying videos and is essential for other video understanding tasks. Given the recent surge in multi-modal, relying solely on a single modality is arguably insufficient. On the other hand, prior solutions for similar tasks like video scene/shot segmentation cater to short videos with clear visual shifts but falter for long videos with subtle changes, such as livestreams. In this paper, we introduce a multi-modal video topic segmenter that utilizes both video transcripts and frames, bolstered by a cross-modal attention mechanism. Furthermore, we propose a dual-contrastive learning framework adhering to the unsupervised domain adaptation paradigm, enhancing our model's adaptability to longer, more semantically complex videos. Experiments on short and long video corpora demonstrate that our proposed solution, significantly surpasses baseline methods in terms of both accuracy and transferability, in both intra- and cross-domain settings.

摘要
видео主题段落抽取揭示了视频的粗粒度 semantics 结构，是其他视频理解任务的关键前提。然而，尚未来的解决方案，仅仅依靠单一模式，可能是不充分的。在这篇论文中，我们介绍了一种多模态视频主题段落分割器，利用视频字幕和帧，并通过交叉模式注意力机制。此外，我们提议一种双对照学习框架，遵循无监督领域适应模式，从而提高我们的模型对更长、更复杂的视频的适应性。实验表明，我们提出的解决方案，在短和长视频集合上都能够显著超越基线方法，在内和交叉领域的设置下都能够达到更高的准确率和转移率。

Relevance-guided Neural Machine Translation

paper_url: http://arxiv.org/abs/2312.00214
repo_url: None
paper_authors: Isidora Chara Tourni, Derry Wijaya
for: 提高 NMT 在低资源条件下的结果
methods: 使用 explainability-based 训练方法，包括无监督和监督模型训练，用于翻译英文、法语、古吉拉特语和哈萨克语
results: 在低资源条件下，方法可以显著超过基eline，特别是在法语和古吉拉特语的翻译 task 上，但改善仍然相对较小，但这鼓励了进一步探索该方法和参数的可能性，以及扩展到其他语言。

Abstract
With the advent of the Transformer architecture, Neural Machine Translation (NMT) results have shown great improvement lately. However, results in low-resource conditions still lag behind in both bilingual and multilingual setups, due to the limited amount of available monolingual and/or parallel data; hence, the need for methods addressing data scarcity in an efficient, and explainable way, is eminent. We propose an explainability-based training approach for NMT, applied in Unsupervised and Supervised model training, for translation of three languages of varying resources, French, Gujarati, Kazakh, to and from English. Our results show our method can be promising, particularly when training in low-resource conditions, outperforming simple training baselines; though the improvement is marginal, it sets the ground for further exploration of the approach and the parameters, and its extension to other languages.

摘要

Robust Concept Erasure via Kernelized Rate-Distortion Maximization

paper_url: http://arxiv.org/abs/2312.00194
repo_url: https://github.com/brcsomnath/kram
paper_authors: Somnath Basu Roy Chowdhury, Nicholas Monath, Avinava Dubey, Amr Ahmed, Snigdha Chaturvedi
for: 本 paper 的目的是提出一种新的距离度量学习对象函数，叫做 Kernelized Rate-Distortion Maximizer (KRaM)，用于实现概念除法。
methods: KRaM 使用一种修改后 rate-distortion 函数，来适应一个指定的距离度量（表示要除掉的概念）。该对象函数的目标是将相似的概念标签的实例变得不相似在学习的表示空间中，保留其他信息。
results: 实验表明，KRaM 可以有效地除掉多种类型的概念，包括 categorical、continue 和 vector-valued 变量，在不同领域的数据表示中。此外，作者还提供了一种对 KRaM 对象函数的理论分析，以及一种用于评估学习的表示质量的对ignment 分数。

Abstract
Distributed representations provide a vector space that captures meaningful relationships between data instances. The distributed nature of these representations, however, entangles together multiple attributes or concepts of data instances (e.g., the topic or sentiment of a text, characteristics of the author (age, gender, etc), etc). Recent work has proposed the task of concept erasure, in which rather than making a concept predictable, the goal is to remove an attribute from distributed representations while retaining other information from the original representation space as much as possible. In this paper, we propose a new distance metric learning-based objective, the Kernelized Rate-Distortion Maximizer (KRaM), for performing concept erasure. KRaM fits a transformation of representations to match a specified distance measure (defined by a labeled concept to erase) using a modified rate-distortion function. Specifically, KRaM's objective function aims to make instances with similar concept labels dissimilar in the learned representation space while retaining other information. We find that optimizing KRaM effectively erases various types of concepts: categorical, continuous, and vector-valued variables from data representations across diverse domains. We also provide a theoretical analysis of several properties of KRaM's objective. To assess the quality of the learned representations, we propose an alignment score to evaluate their similarity with the original representation space. Additionally, we conduct experiments to showcase KRaM's efficacy in various settings, from erasing binary gender variables in word embeddings to vector-valued variables in GPT-3 representations.

摘要
分布表示提供了一个vector空间，捕捉数据实例之间的意义关系。然而，这些表示的分布性质使得数据实例的多个属性或概念（如文本的主题或 sentiment、作者的年龄 gender等）相互纠缠在一起。最近的工作提出了概念消除任务，即而不是使概念预测可能，而是将某个概念从分布表示中除去，保留原始表示空间中的其他信息。在这篇论文中，我们提出了一种新的距离度量学习基于的目标函数，即归一化距离度量学习（KRaM），用于完成概念消除任务。KRaM使用一种修改的rate-distortion函数来适应定制的距离度量。特别是，KRaM的目标函数的目的是在学习的表示空间中使实例之间的相似概念标签变得不相似，保留其他信息。我们发现，通过优化KRaM可以有效地消除不同类型的概念：分类、连续和向量值变量从数据表示中。此外，我们还提供了对KRaM目标函数的一些理论分析。为评估学习后的表示质量，我们提出了一种对齐分数来评估其与原始表示空间的相似性。此外，我们还进行了多种设置下KRaM的效果进行了证明。

Navigating News Narratives: A Media Bias Analysis Dataset

paper_url: http://arxiv.org/abs/2312.00168
repo_url: None
paper_authors: Shaina Raza
for: 本研究的目的是提供一个媒体偏见分析数据集，用于探讨媒体偏见的影响和检测方法。
methods: 本研究使用了自然语言处理和机器学习技术，对媒体文章进行分类和分析，以检测媒体偏见的存在和特征。
results: 本研究提供了一个丰富的媒体偏见分析数据集，包括各种媒体偏见的示例和特征，可以用于进一步的研究和开发媒体偏见检测和分析方法。

Abstract
The proliferation of biased news narratives across various media platforms has become a prominent challenge, influencing public opinion on critical topics like politics, health, and climate change. This paper introduces the "Navigating News Narratives: A Media Bias Analysis Dataset", a comprehensive dataset to address the urgent need for tools to detect and analyze media bias. This dataset encompasses a broad spectrum of biases, making it a unique and valuable asset in the field of media studies and artificial intelligence. The dataset is available at https://huggingface.co/datasets/newsmediabias/news-bias-full-data.

摘要
“媒体偏见的散布在不同媒体平台上已成为一大挑战，影响公众对重要议题的看法，如政治、健康和气候变化。本研究引入“媒体偏见分析数据集”，一个全面的数据集，用于探讨媒体偏见的检测和分析。这个数据集包括广泛的偏见，使其成为媒体研究和人工智能领域中一个独特和值得关注的资产。数据集可以在https://huggingface.co/datasets/newsmediabias/news-bias-full-data中下载。”

A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval

paper_url: http://arxiv.org/abs/2312.00115
repo_url: None
paper_authors: Matthew Gwilliam, Michael Cogswell, Meng Ye, Karan Sikka, Abhinav Shrivastava, Ajay Divakaran
for:The paper is written to evaluate the capabilities of long video retrieval systems and to propose a pipeline for generating diverse synthetic captions for long videos.methods:The paper uses state-of-the-art large language models to generate synthetic captions for long videos, and it uses a contrastive loss to learn a hierarchical embedding loss for fine-tuning the video language models.results:The paper shows that the proposed method improves performance on the downstream paragraph-to-video retrieval task and for various long video retrieval metrics using synthetic data. Specifically, it achieves a 1.1% increase in R@1 on ActivityNet and a 3.6% increase in R@1 for short descriptions on ActivityNet.

Abstract
Existing long video retrieval systems are trained and tested in the paragraph-to-video retrieval regime, where every long video is described by a single long paragraph. This neglects the richness and variety of possible valid descriptions of a video, which could be described in moment-by-moment detail, or in a single phrase summary, or anything in between. To provide a more thorough evaluation of the capabilities of long video retrieval systems, we propose a pipeline that leverages state-of-the-art large language models to carefully generate a diverse set of synthetic captions for long videos. We validate this pipeline's fidelity via rigorous human inspection. We then benchmark a representative set of video language models on these synthetic captions using a few long video datasets, showing that they struggle with the transformed data, especially the shortest captions. We also propose a lightweight fine-tuning method, where we use a contrastive loss to learn a hierarchical embedding loss based on the differing levels of information among the various captions. Our method improves performance both on the downstream paragraph-to-video retrieval task (+1.1% R@1 on ActivityNet), as well as for the various long video retrieval metrics we compute using our synthetic data (+3.6% R@1 for short descriptions on ActivityNet). For data access and other details, please refer to our project website at https://mgwillia.github.io/10k-words.

摘要
现有的长视频检索系统都是在段落到视频检索模式下进行训练和测试，每个长视频都由一个长段落来描述。这种做法忽略了视频的多样性和细节，可能有各种有效的描述方式，如每个时刻细节、简短概要或者中间的描述。为了更全面地评估长视频检索系统的能力，我们提议一个管道，利用现代大型自然语言模型生成多种精心制作的假描述，并通过人工检查来验证其准确性。我们使用这些假描述进行下游的段落到视频检索任务，并证明现有的视频语言模型在这些假描述上表现不佳，特别是短描述。我们还提出了一种轻量级的微调方法，使用对各个描述之间的差异进行学习一个层次嵌入损失，以提高下游任务的表现。我们的方法在段落到视频检索任务上提高了表现 (+1.1% R@1 on ActivityNet)，以及使用我们自己生成的假描述进行长视频检索任务时的多种维度表现 (+3.6% R@1 for short descriptions on ActivityNet)。详细的数据访问和其他细节，请参考我们项目网站。

What Do Llamas Really Think? Revealing Preference Biases in Language Model Representations

paper_url: http://arxiv.org/abs/2311.18812
repo_url: https://github.com/castorini/biasprobe
paper_authors: Raphael Tang, Xinyu Zhang, Jimmy Lin, Ferhan Ture
for: 本研究探讨语言模型（LLM）是否带有社会阶层和民族特征偏见，即使模型拒绝回答。
methods: 我们使用contextualized embeddings来探讨这个研究问题，并对模型的含义表示进行探索。我们提出了一种逻辑布莱德利-泰利探针，可以预测语言模型对word pair的偏好。
results: 我们在三个pair preference任务和十三个LLM中验证了我们的探针，并在error rate方面比WEAT表现出Relative 27%的提升。我们还发现word pair偏好最好地表示在中层层次。此外，我们发现对于不同的目标类，例如国籍、政治、宗教和性别，模型都带有显著的偏好。这表示instruction fine-tuning可能并不能够消除contextualized embeddings中的偏见。

Abstract
Do large language models (LLMs) exhibit sociodemographic biases, even when they decline to respond? To bypass their refusal to "speak," we study this research question by probing contextualized embeddings and exploring whether this bias is encoded in its latent representations. We propose a logistic Bradley-Terry probe which predicts word pair preferences of LLMs from the words' hidden vectors. We first validate our probe on three pair preference tasks and thirteen LLMs, where we outperform the word embedding association test (WEAT), a standard approach in testing for implicit association, by a relative 27% in error rate. We also find that word pair preferences are best represented in the middle layers. Next, we transfer probes trained on harmless tasks (e.g., pick the larger number) to controversial ones (compare ethnicities) to examine biases in nationality, politics, religion, and gender. We observe substantial bias for all target classes: for instance, the Mistral model implicitly prefers Europe to Africa, Christianity to Judaism, and left-wing to right-wing politics, despite declining to answer. This suggests that instruction fine-tuning does not necessarily debias contextualized embeddings. Our codebase is at https://github.com/castorini/biasprobe.

摘要
大语言模型（LLM）是否具有社会经济阶层偏见，即使它们拒绝回答？为了回答这个研究问题，我们研究 Contextualized Embeddings 中的偏见是否被编码在其潜在表示中。我们提出了一个逻辑 Bradley-Terry 探测器，可以预测语言模型对Word pair的偏好。我们首先验证了我们的探测器在三个 Word pair 任务和十三个 LLM 上的有效性，并在误差率方面比标准方法（WEAT）表现出Relative 27%的改善。我们还发现，Word pair 的偏好在中层层次最为明显。接着，我们将探测器训练在无伤 tasks（例如，选择更大的数字）中，然后将其应用到争议性 tasks（例如，比较民族）中，以检测偏见。我们发现，所有目标类具有显著的偏见：例如，Mistral 模型在欧洲和非洲之间偏好欧洲，强调基督教而不强调犹太教，以及左翼政治比右翼政治。这表明， instruction fine-tuning 不一定能够消除 contextualized embeddings 中的偏见。我们的代码库位于 GitHub 上，具体请参考。

BioCLIP: A Vision Foundation Model for the Tree of Life

paper_url: http://arxiv.org/abs/2311.18803
repo_url: None
paper_authors: Samuel Stevens, Jiaman Wu, Matthew J Thompson, Elizabeth G Campolongo, Chan Hee Song, David Edward Carlyn, Li Dong, Wasila M Dahdul, Charles Stewart, Tanya Berger-Wolf, Wei-Lun Chao, Yu Su
for:本研究旨在提供一个通用的计算方法，用于从生物图像中提取生物学信息。methods:研究人员使用了大量的计算方法和工具，特别是计算机视觉，对生物图像进行分析和描述。他们还组织了TreeOfLife-10M dataset，该dataset包含了生物图像的广泛和多样化数据，并且与生物学知识有所关联。results:研究人员发现，使用BioCLIP模型可以实现生物学问题中的高度通用性。该模型在多种细化生物分类任务中表现出色，较前一代基准的17%至20%。内在评估显示BioCLIP已经学习了生物体系中的层次结构，这点灯光了其强大的通用性。

Abstract
Images of the natural world, collected by a variety of cameras, from drones to individual phones, are increasingly abundant sources of biological information. There is an explosion of computational methods and tools, particularly computer vision, for extracting biologically relevant information from images for science and conservation. Yet most of these are bespoke approaches designed for a specific task and are not easily adaptable or extendable to new questions, contexts, and datasets. A vision model for general organismal biology questions on images is of timely need. To approach this, we curate and release TreeOfLife-10M, the largest and most diverse ML-ready dataset of biology images. We then develop BioCLIP, a foundation model for the tree of life, leveraging the unique properties of biology captured by TreeOfLife-10M, namely the abundance and variety of images of plants, animals, and fungi, together with the availability of rich structured biological knowledge. We rigorously benchmark our approach on diverse fine-grained biology classification tasks, and find that BioCLIP consistently and substantially outperforms existing baselines (by 17% to 20% absolute). Intrinsic evaluation reveals that BioCLIP has learned a hierarchical representation conforming to the tree of life, shedding light on its strong generalizability. Our code, models and data will be made available at https://github.com/Imageomics/bioclip.

摘要
自然界的图像，由多种摄像机器人、个人手机等摄取，已成为生物信息的增量来源。计算机视觉技术在抽取生物信息的图像上进行了激进的应用，尤其是在科学和保护领域。然而，大多数这些方法是为特定任务而设计的，不易适应新的问题、上下文和数据集。为了解决这问题，我们将Curate和发布TreeOfLife-10M，生物图像领域最大和最多样化的机器学习准备数据集。然后，我们开发了Bioclip，基于树生物的基本模型，利用TreeOfLife-10M中的生物图像的丰富和多样性，以及生物知识的可用性。我们对多个细化的生物分类任务进行了严格的测试，发现Bioclip与现有基准相比，平均提高17%至20%。内在评估表明，Bioclip已经学习了生物树的层次表示，这为其强大的通用性提供了新的灯光。我们的代码、模型和数据将在上发布。

paper_url: http://arxiv.org/abs/2311.18799
repo_url: https://github.com/artemisp/lavis-xinstructblip
paper_authors: Artemis Panagopoulou, Le Xue, Ning Yu, Junnan Li, Dongxu Li, Shafiq Joty, Ran Xu, Silvio Savarese, Caiming Xiong, Juan Carlos Niebles
For: The paper is written for proposing a simple and effective cross-modality framework for integrating various modalities without extensive modality-specific customization.* Methods: The paper uses a vision-language pre-training and instruction tuning approach, which involves fine-tuning a large language model (LLM) with high-quality instruction tuning data to perform 2D visual reasoning tasks.* Results: The paper demonstrates that the proposed approach can perform comparably with leading-edge counterparts without the need for extensive modality-specific pre-training or customization, and also shows cross-modal reasoning abilities across two or more input modalities.Here’s the Chinese version of the three key points:* For: 这篇论文是为了提出一种简单而有效的跨模态框架，以便不需要广泛的模态特定定制来集成不同的模态。* Methods: 论文使用视力语言预训练和指令调整方法，通过高质量的指令调整数据来训练大语言模型（LLM），以实现2D视觉任务中的推理能力。* Results: 论文表明，提出的方法可以与当今领先的对手相比，并且不需要广泛的模态特定预训练或定制，同时也能够在多个输入模态之间进行跨模态推理。

Abstract
Vision-language pre-training and instruction tuning have demonstrated general-purpose capabilities in 2D visual reasoning tasks by aligning visual encoders with state-of-the-art large language models (LLMs). In this paper, we introduce a simple, yet effective, cross-modality framework built atop frozen LLMs that allows the integration of various modalities without extensive modality-specific customization. To facilitate instruction-modality fine-tuning, we collect high-quality instruction tuning data in an automatic and scalable manner, composed of 24K QA samples for audio and 250K QA samples for 3D. Leveraging instruction-aware representations, our model performs comparably with leading-edge counterparts without the need of extensive modality-specific pre-training or customization. Furthermore, our approach demonstrates cross-modal reasoning abilities across two or more input modalities, despite each modality projection being trained individually. To study the model's cross-modal abilities, we contribute a novel Discriminative Cross-modal Reasoning (DisCRn) evaluation task, comprising 9K audio-video QA samples and 28K image-3D QA samples that require the model to reason discriminatively across disparate input modalities.

摘要
vision-language预训练和指令调整已经在2D视觉任务中表现出通用功能，通过将视觉编码器与现有的大型语言模型（LLM）相align。在这篇论文中，我们介绍了一种简单、有效的跨多模态框架，建立在冻结的LLM之上，可以无需广泛的多模态特定预训练或自定义，将不同模态集成起来。为了促进指令模式细调，我们自动和扩展地收集了高质量指令调整数据，包括24K个语音问答样本和250K个3D问答样本。通过使用指令意识的表示，我们的模型与领先的同类模型相比，没有需要广泛的多模态特定预训练或自定义。此外，我们的方法还表现出跨多输入模态的推理能力，即使每个模态投影都是单独地训练。为了研究模型的跨模态能力，我们提出了一个新的Discriminative Cross-modal Reasoning（DisCRn）评估任务，包括9K个语音视频问答样本和28K个图像3D问答样本，需要模型可以在不同输入模态之间的推理。

Mavericks at BLP-2023 Task 1: Ensemble-based Approach Using Language Models for Violence Inciting Text Detection

paper_url: http://arxiv.org/abs/2311.18778
repo_url: None
paper_authors: Saurabh Page, Sudeep Mangalvedhekar, Kshitij Deshpande, Tanmay Chavan, Sheetal Sonawane
for: 本研究是为了解决社交媒体上的敌意和暴力inciting文本检测问题，以防止这种文本的广泛传播。
methods: 本研究使用BERT模型进行实验，并使用多个BERT模型 ensemble为最终提交。
results: 本研究在第一届巴ンGLadesh语言处理工作坊中的暴力inciting文本检测Shared Task中获得了10名的最终排名，macro F1分数为0.737。

Abstract
This paper presents our work for the Violence Inciting Text Detection shared task in the First Workshop on Bangla Language Processing. Social media has accelerated the propagation of hate and violence-inciting speech in society. It is essential to develop efficient mechanisms to detect and curb the propagation of such texts. The problem of detecting violence-inciting texts is further exacerbated in low-resource settings due to sparse research and less data. The data provided in the shared task consists of texts in the Bangla language, where each example is classified into one of the three categories defined based on the types of violence-inciting texts. We try and evaluate several BERT-based models, and then use an ensemble of the models as our final submission. Our submission is ranked 10th in the final leaderboard of the shared task with a macro F1 score of 0.737.

摘要
这篇论文介绍了我们在“第一届巴ンGL语言处理工作shop”中参与的“暴力鼓吹文本检测”共同任务。社交媒体的普及已经加速了社会中的仇恨和暴力鼓吹言语的传播。因此，建立有效的检测和抑制暴力鼓吹文本的机制是非常重要的。但在低资源设置下，由于缺乏研究和数据，暴力鼓吹文本检测问题更加困难。共享任务提供的数据包括巴ンGL语言的文本，每个示例被分类为三种基于暴力鼓吹文本的类型。我们尝试了多个BERT基于模型，然后使用这些模型的 ensemble 作为我们的最终提交。我们的提交在共同任务的最终排名中排名第10名，具有macro F1分数0.737。

Can training neural language models on a curriculum with developmentally plausible data improve alignment with human reading behavior?

paper_url: http://arxiv.org/abs/2311.18761
repo_url: None
paper_authors: Aryaman Chobey, Oliver Smith, Anzi Wang, Grusha Prasad
for: 这 paper 的目的是研究使用神经语言模型模拟人类行为的有效性，并分析这种方法在预测人类语言处理的能力是否准确。
methods: 这 paper 使用的方法是使用教师语言模型在 BabyLM “strict-small” 数据集上进行训练，然后使用这些教师模型的句子级悬幻度来创建课程。
results: 研究发现，使用这种课程可以使模型更好地从训练数据中学习语言知识，但这并不意味着模型可以更准确地预测人类语言处理行为。

Abstract
The use of neural language models to model human behavior has met with mixed success. While some work has found that the surprisal estimates from these models can be used to predict a wide range of human neural and behavioral responses, other work studying more complex syntactic phenomena has found that these surprisal estimates generate incorrect behavioral predictions. This paper explores the extent to which the misalignment between empirical and model-predicted behavior can be minimized by training models on more developmentally plausible data, such as in the BabyLM Challenge. We trained teacher language models on the BabyLM "strict-small" dataset and used sentence level surprisal estimates from these teacher models to create a curriculum. We found tentative evidence that our curriculum made it easier for models to acquire linguistic knowledge from the training data: on the subset of tasks in the BabyLM challenge suite evaluating models' grammatical knowledge of English, models first trained on the BabyLM data curriculum and then on a few randomly ordered training epochs performed slightly better than models trained on randomly ordered epochs alone. This improved linguistic knowledge acquisition did not result in better alignment with human reading behavior, however: models trained on the BabyLM dataset (with or without a curriculum) generated predictions that were as misaligned with human behavior as models trained on larger less curated datasets. This suggests that training on developmentally plausible datasets alone is likely insufficient to generate language models capable of accurately predicting human language processing.

摘要
使用神经语言模型模拟人类行为的应用得到了杂种成果。一些工作发现，这些模型的不同程度预测可以预测人类神经和行为响应的广泛范围，而其他工作则发现，在更复杂的语法现象上，这些预测都会产生错误的行为预测。这篇论文探讨了使用更有可能性的数据来减少模型预测与人类行为的偏差。我们使用 BabyLM 挑战赛的 "strict-small" 数据集训练教师语言模型，并使用这些教师模型的句子水平预测值来创建课程。我们发现了一些证据表明，我们的课程使得模型从训练数据中更好地获得语言知识：在评估英语语法知识的子集任务中，使用 BabyLM 数据课程和一些随机排序的训练班次训练后，模型的性能略为提高。然而，这种改进的语言知识不会导致模型更好地预测人类阅读行为：使用 BabyLM 数据集（包括或不包括课程），模型的预测都与人类行为不符。这表明，只靠使用可能的数据来训练语言模型并不能生成准确预测人类语言处理的模型。

Mavericks at NADI 2023 Shared Task: Unravelling Regional Nuances through Dialect Identification using Transformer-based Approach

paper_url: http://arxiv.org/abs/2311.18739
repo_url: None
paper_authors: Vedant Deshpande, Yash Patwardhan, Kshitij Deshpande, Sudeep Mangalvedhekar, Ravindra Murumkar
for: 本研究的目的是为Nuanced Arabic Dialect Identification (NADI) Shared Task 2023提出了一种方法。
methods: 本文提出的方法主要基于transformer-based模型，这些模型在阿拉伯语领域已经预处理过。然后，这些模型在提供的数据集上进行了微调。 ensembling方法也被应用以提高系统的性能。
results: 根据测试数据集，我们实现了F1分数76.65（排名为第11名）。

Abstract
In this paper, we present our approach for the "Nuanced Arabic Dialect Identification (NADI) Shared Task 2023". We highlight our methodology for subtask 1 which deals with country-level dialect identification. Recognizing dialects plays an instrumental role in enhancing the performance of various downstream NLP tasks such as speech recognition and translation. The task uses the Twitter dataset (TWT-2023) that encompasses 18 dialects for the multi-class classification problem. Numerous transformer-based models, pre-trained on Arabic language, are employed for identifying country-level dialects. We fine-tune these state-of-the-art models on the provided dataset. The ensembling method is leveraged to yield improved performance of the system. We achieved an F1-score of 76.65 (11th rank on the leaderboard) on the test dataset.

摘要
在这篇论文中，我们介绍了我们在“细腻阿拉伯语言方言识别任务2023”（NADI）中采用的方法。我们特别强调了我们对国家级方言识别的方法。认知方言对于下游NLP任务如语音识别和翻译具有重要的功能。任务使用的Twitter数据集（TWT-2023）包含18种方言，并且是多类分类问题。我们使用了已经在阿拉伯语言上预训练的多种变换器模型，然后在提供的数据集上进行微调。我们还使用了ensemble方法来提高系统的性能。在测试数据集上，我们获得了F1分数76.65（排名榜上的11名）。

Mavericks at ArAIEval Shared Task: Towards a Safer Digital Space – Transformer Ensemble Models Tackling Deception and Persuasion

paper_url: http://arxiv.org/abs/2311.18730
repo_url: None
paper_authors: Sudeep Mangalvedhekar, Kshitij Deshpande, Yash Patwardhan, Vedant Deshpande, Ravindra Murumkar
For: 本研究是为了评估2023年的“阿拉伯语人工智能任务评估（ArAiEval）”共同任务。* Methods: 我们提出了对任务1-A和任务2-A的方法，它们关注了吸引人的技巧检测和假信息检测。我们使用了多种变体的transformer模型，并在提供的数据集上进行了微调。我们还使用了ensemble技术来提高系统的表现。* Results: 我们在任务1-A上 achieved micro F1-score为0.742（排名第8），在任务2-A上 achieved micro F1-score为0.901（排名第7）。

Abstract
In this paper, we highlight our approach for the "Arabic AI Tasks Evaluation (ArAiEval) Shared Task 2023". We present our approaches for task 1-A and task 2-A of the shared task which focus on persuasion technique detection and disinformation detection respectively. Detection of persuasion techniques and disinformation has become imperative to avoid distortion of authentic information. The tasks use multigenre snippets of tweets and news articles for the given binary classification problem. We experiment with several transformer-based models that were pre-trained on the Arabic language. We fine-tune these state-of-the-art models on the provided dataset. Ensembling is employed to enhance the performance of the systems. We achieved a micro F1-score of 0.742 on task 1-A (8th rank on the leaderboard) and 0.901 on task 2-A (7th rank on the leaderboard) respectively.

摘要
在这篇论文中，我们强调我们在"2023年阿拉伯语人工智能任务评估（ArAiEval）共同任务"中采取的方法。我们介绍了对任务1-A和任务2-A的解决方案，它们分别关注了诱导技巧检测和假信息检测。检测诱导技巧和假信息已成为避免真实信息扭曲的必要手段。任务使用多种文本类型的摘要和新闻文章来解决 binary 分类问题。我们使用预先训练在阿拉伯语言上的变换器模型进行实验，并对提供的数据集进行精细调整。我们采用了ensemble的方法来提高系统的性能。我们在任务1-A上 achieved micro F1 得分为 0.742（排名榜上的第八名），并在任务2-A上 achieved micro F1 得分为 0.901（排名榜上的第七名）。

Automatic Functional Differentiation in JAX

paper_url: http://arxiv.org/abs/2311.18727
repo_url: https://github.com/sail-sg/autofd
paper_authors: Min Lin
for: 本研究扩展JAX以实现自动diff高阶函数(函数和运算符)的能力。
methods: 本文使用JAX的存储系统来实现高阶函数，并提供了一组基本的 primitives 来构建各种关键的函数类型。每个引入的基本运算符都有 linearization 和 transposition 规则，与JAX的内部协议相align。
results: 本研究实现了函数导数的自动diff，并且可以在Python中直接使用函数导数。通过应用在需要函数导数的场景中，本工具的效果和简洁性得到了证明。研究源代码发布在https://github.com/sail-sg/autofd。

Abstract
We extend JAX with the capability to automatically differentiate higher-order functions (functionals and operators). By representing functions as a generalization of arrays, we seamlessly use JAX's existing primitive system to implement higher-order functions. We present a set of primitive operators that serve as foundational building blocks for constructing several key types of functionals. For every introduced primitive operator, we derive and implement both linearization and transposition rules, aligning with JAX's internal protocols for forward and reverse mode automatic differentiation. This enhancement allows for functional differentiation in the same syntax traditionally use for functions. The resulting functional gradients are themselves functions ready to be invoked in python. We showcase this tool's efficacy and simplicity through applications where functional derivatives are indispensable. The source code of this work is released at https://github.com/sail-sg/autofd .

摘要
我们将JAX扩展为自动 differentiate 高阶函数（函数和算子）的能力。我们通过将函数 представ为一种常量组合的一种扩展，使得我们可以轻松地使用JAX的现有的基本系统来实现高阶函数。我们提出了一些基本的基础建筑物，用于建构许多关键的函数类型。对每个引入的基本操作符，我们 derivation 和实现了 both linearization 和 transposition 规则，与JAX的内部协议相互匹配。这些改进使得函数的自动分别可以在传统上使用的同じ语法中进行。得到的函数梯度是一个可以被邀请在 Python 中的函数。我们透过实际应用展示了这个工具的效率和简洁性。源代码可以在获取。

CoRec: An Easy Approach for Coordination Recognition

paper_url: http://arxiv.org/abs/2311.18712
repo_url: https://github.com/qingwang-isu/corec
paper_authors: Qing Wang, Haojie Jia, Wenfei Song, Qi Li
for: 本研究强调和解决协ordinatio recognition任务中的挑战。
methods: 该研究提议了一个管道模型COordination RECognizer（CoRec），包括两个组成部分：协ordinator标识器和连接界限探测器。
results: 实验结果表明，CoRec可以快速和准确地完成协ordinatio recognition任务，并且对复杂句子和不同领域的数据进行了有效的应用。此外，CoRec还对下游任务产生了积极的影响，提高了状态最佳的Open IE模型的产生率。

Abstract
In this paper, we observe and address the challenges of the coordination recognition task. Most existing methods rely on syntactic parsers to identify the coordinators in a sentence and detect the coordination boundaries. However, state-of-the-art syntactic parsers are slow and suffer from errors, especially for long and complicated sentences. To better solve the problems, we propose a pipeline model COordination RECognizer (CoRec). It consists of two components: coordinator identifier and conjunct boundary detector. The experimental results on datasets from various domains demonstrate the effectiveness and efficiency of the proposed method. Further experiments show that CoRec positively impacts downstream tasks, improving the yield of state-of-the-art Open IE models.

摘要
在这篇论文中，我们观察和解决协调认识任务中的挑战。大多数现有方法都是通过语法分析器来标识协调语素并探测协调边界。然而，当前的语法分析器具有较慢的速度和较高的错误率，特别是对于长和复杂的句子。为了更好地解决这些问题，我们提议一个管道模型COordination RECognizer（CoRec）。它包括两个组件：协调标识器和连接边界探测器。我们在不同领域的数据集上进行了实验，并证明了提议的方法的效果和高效性。进一步的实验还表明，CoRec可以Positively Impact downstream任务，提高状态的开发Open IE模型的生产率。

Women Are Beautiful, Men Are Leaders: Gender Stereotypes in Machine Translation and Language Modeling

paper_url: http://arxiv.org/abs/2311.18711
repo_url: None
paper_authors: Matúš Pikuliak, Andrea Hrckova, Stefan Oresko, Marián Šimko
for: 这个论文是为了测试 masculine 和 feminine 概念在遮盖 Language Model 和英文到其他语言翻译系统中的表现。
methods: 这个论文使用了一个新的数据集 named GEST，包含9种斯拉夫语言和英语的16种性别刻板印象（如女人美丽，男人是领袖）。这些刻板印象由性别专家定义。
results: 通过使用 GEST 评估11个遮盖 Language Model 和4个机器翻译系统，发现大多数评估模型和语言具有重要和一致的性别刻板印象表现。

Abstract
We present GEST -- a new dataset for measuring gender-stereotypical reasoning in masked LMs and English-to-X machine translation systems. GEST contains samples that are compatible with 9 Slavic languages and English for 16 gender stereotypes about men and women (e.g., Women are beautiful, Men are leaders). The definition of said stereotypes was informed by gender experts. We used GEST to evaluate 11 masked LMs and 4 machine translation systems. We discovered significant and consistent amounts of stereotypical reasoning in almost all the evaluated models and languages.

摘要
我们介绍GEST数据集 -- 一个用于测试 masculine 和 feminine 偏见的掩Masked LMs和英文到其他语言翻译系统的数据集。GEST包含与9种斯拉夫语言和英文相容的样本，表达16种男女偏见（例如，女人美丽，男人是领袖）。这些偏见的定义由性别专家指导。我们使用GEST评估11个掩Masked LMs和4种翻译系统。我们发现大多数评估模型和语言中存在了重要和一致的偏见性 reasoning。

RaDialog: A Large Vision-Language Model for Radiology Report Generation and Conversational Assistance

paper_url: http://arxiv.org/abs/2311.18681
repo_url: https://github.com/chantalmp/radialog
paper_authors: Chantal Pellegrini, Ege Özsoy, Benjamin Busam, Nassir Navab, Matthias Keicher
for: 这个研究旨在创造一个可与人类合作的 radiology 助手，以减少时间和提高报告质量。
methods: 这个方法使用了大型视力语言模型（LLM），并将其与构成化病理评估结果集成。同时，这个方法还使用了实时Parameter-efficient fine-tuning来适应专业领域。
results: 这个方法在生成 radiology 报告和互动任务中实现了州务 Correctness，并且在答询和修正报告等互动任务中表现出色。

Abstract
Conversational AI tools that can generate and discuss clinically correct radiology reports for a given medical image have the potential to transform radiology. Such a human-in-the-loop radiology assistant could facilitate a collaborative diagnostic process, thus saving time and improving the quality of reports. Towards this goal, we introduce RaDialog, the first thoroughly evaluated and publicly available large vision-language model for radiology report generation and interactive dialog. RaDialog effectively integrates visual image features and structured pathology findings with a large language model (LLM) while simultaneously adapting it to a specialized domain using parameter-efficient fine-tuning. To keep the conversational abilities of the underlying LLM, we propose a comprehensive, semi-automatically labeled, image-grounded instruct dataset for chest X-ray radiology tasks. By training with this dataset, our method achieves state-of-the-art clinical correctness in report generation and shows impressive abilities in interactive tasks such as correcting reports and answering questions, serving as a foundational step toward clinical dialog systems. Our code is available on github: https://github.com/ChantalMP/RaDialog.

摘要
优化 radiology 报告生成和交互对话工具可以改变 radiology。这种人 loop radiology 助手可以帮助医生在诊断过程中协作，从而节约时间并提高报告质量。为 достичь这个目标，我们介绍了 RaDialog，首个经过全面评估并公开可用的大视语言模型 для radiology 报告生成和交互对话。RaDialog 可以同时 интеграли视觉图像特征和结构化病理发现，并使用大语言模型（LLM）进行参数高效调整。为保持下面 LLM 的对话能力，我们提出了一种完整的、半自动标注的、基于图像的教程数据集 для胸部 X 射影任务。通过使用这个数据集，我们的方法实现了临床正确性最高的报告生成，并在交互任务中具有印象的能力，如修正报告和回答问题， serving as a foundational step toward clinical dialog systems。我们的代码可以在 GitHub 上找到：https://github.com/ChantalMP/RaDialog。

ArcMMLU: A Library and Information Science Benchmark for Large Language Models

paper_url: http://arxiv.org/abs/2311.18658
repo_url: https://github.com/stzhang-patrick/arcmmlu
paper_authors: Shitou Zhang, Zuchao Li, Xingshen Liu, Liming Yang, Ping Wang
for: This paper aims to develop a specialized benchmark for evaluating the capabilities of large language models (LLMs) in the Library & Information Science (LIS) domain in Chinese.
methods: The paper introduces ArcMMLU, a benchmark tailored for the LIS domain that includes four key sub-domains: Archival Science, Data Science, Library Science, and Information Science. The benchmark is based on the format of MMLU/CMMLU and includes over 6,000 high-quality questions to reflect the diverse nature of the LIS domain.
results: The comprehensive evaluation shows that while most mainstream LLMs achieve an average accuracy rate above 50% on ArcMMLU, there remains a notable performance gap, suggesting substantial headroom for refinement in LLM capabilities within the LIS domain. The paper also explores the effectiveness of few-shot examples on model performance and highlights challenging questions where models consistently underperform.

Abstract
In light of the rapidly evolving capabilities of large language models (LLMs), it becomes imperative to develop rigorous domain-specific evaluation benchmarks to accurately assess their capabilities. In response to this need, this paper introduces ArcMMLU, a specialized benchmark tailored for the Library & Information Science (LIS) domain in Chinese. This benchmark aims to measure the knowledge and reasoning capability of LLMs within four key sub-domains: Archival Science, Data Science, Library Science, and Information Science. Following the format of MMLU/CMMLU, we collected over 6,000 high-quality questions for the compilation of ArcMMLU. This extensive compilation can reflect the diverse nature of the LIS domain and offer a robust foundation for LLM evaluation. Our comprehensive evaluation reveals that while most mainstream LLMs achieve an average accuracy rate above 50% on ArcMMLU, there remains a notable performance gap, suggesting substantial headroom for refinement in LLM capabilities within the LIS domain. Further analysis explores the effectiveness of few-shot examples on model performance and highlights challenging questions where models consistently underperform, providing valuable insights for targeted improvements. ArcMMLU fills a critical gap in LLM evaluations within the Chinese LIS domain and paves the way for future development of LLMs tailored to this specialized area.

摘要
在大语言模型（LLM）的技能快速发展之下，有必要开发特定领域的评估标准 benchmark，以准确评估其能力。为回应这个需求，本文介绍了ArcMMLU，一个特циализирован的 benchmark，专门为中文图书馆信息科学（LIS）领域而设计。ArcMMLU 的目标是测量 LL M 在四个关键子领域中的知识和理解能力：档案科学、数据科学、图书馆科学和信息科学。按照 MMLU/CMMLU 的格式，我们收集了超过 6,000 个高质量的问题，以编制 ArcMMLU。这个广泛的编译可以反映 LIS 领域的多样性，并提供了可靠的基础 для LL M 评估。我们的全面评估发现，虽然大多数主流 LL M 在 ArcMMLU 上的平均准确率高于 50%，但还存在明显的性能差距，表明 LL M 在 LIS 领域的可能性还有很大的提升空间。进一步的分析发现，几个示例的几何学学习可以提高模型的性能，并高亮出 LLM 在特定问题上的困难处，提供了价值的反馈，用于targeted 改进。ArcMMLU 填补了中文 LIS 领域 LL M 评估的空白，并为将来针对这个专业领域开发 LL M 提供了道路。

Introducing Rhetorical Parallelism Detection: A New Task with Datasets, Metrics, and Baselines

paper_url: http://arxiv.org/abs/2312.00100
repo_url: https://github.com/mythologos/augustinian-sermon-parallelisms
paper_authors: Stephen Bothwell, Justin DeBenedetto, Theresa Crnkovich, Hildegund Müller, David Chiang
for: 这篇论文主要是为了研究人们在日常语言交流中使用的修辞技巧，特别是平行ismus。
methods: 作者提出了一个新的任务：修辞平行ismus检测。他们提出了一个正式的定义，并提供了一个拉丁文和一个修改后的中文数据集来进行测试。
results: 作者使用了一家族的评价指标来评估系统的性能，并创建了一些基线系统和新的序列标签方案来捕捉修辞平行ismus。在最严格的指标下，他们的Latin数据集上的F1分数为0.40，而修改后的中文数据集上的F1分数为0.43。

Abstract
Rhetoric, both spoken and written, involves not only content but also style. One common stylistic tool is $\textit{parallelism}$: the juxtaposition of phrases which have the same sequence of linguistic ($\textit{e.g.}$, phonological, syntactic, semantic) features. Despite the ubiquity of parallelism, the field of natural language processing has seldom investigated it, missing a chance to better understand the nature of the structure, meaning, and intent that humans convey. To address this, we introduce the task of $\textit{rhetorical parallelism detection}$. We construct a formal definition of it; we provide one new Latin dataset and one adapted Chinese dataset for it; we establish a family of metrics to evaluate performance on it; and, lastly, we create baseline systems and novel sequence labeling schemes to capture it. On our strictest metric, we attain $F_{1}$ scores of $0.40$ and $0.43$ on our Latin and Chinese datasets, respectively.

摘要
演讲和写作中的修辞涉及不仅内容，还有风格。一种常见的风格工具是平行ismus：在相同的语言特征（例如音频、 sintaxis、 semantics）上重叠的句子。尽管平行ismus 广泛存在，自然语言处理领域却rarely investigate它，这 missed a chance to better understand人类传达的结构、意义和目的。为了解决这个问题，我们提出了修辞平行ismus检测任务。我们定义了一个正式的定义，提供了一个新的拉丁文 dataset和一个修改后的中文dataset，并建立了一家 metric 来评估表现。最后，我们创建了基eline系统和新的序列标签方案来捕捉它。在我们最严格的 metric 上，我们达到了 $F_{1}$ 分数的 0.40 和 0.43 在拉丁文和中文dataset 中。

ArthModel: Enhance Arithmetic Skills to Large Language Model

paper_url: http://arxiv.org/abs/2311.18609
repo_url: None
paper_authors: Yingdi Guo
for: 本研究的目的是提高语言模型的数学能力，并 explore new ways of thinking, training and using a language model.
methods: 本研究使用了一种新的方法，即训练语言模型生成postfix表达式相关的数学问题，并将其与小型预训练模型结合使用。这种小型模型将Token embedding转换为实际的稠密数据，并通过深度学习平台上的native函数获取正确答案。
results: 本研究的结果表明，通过这种方法可以提高语言模型的数学能力，并且可以在不同的数学问题上实现高度的准确率。 codes和模型将于 \url{https://github.com/eteced/arithmetic_finetuning_v1} 上发布。

Abstract
With the great success of ChatGPT, the research of large language models has become increasingly popular. However, the models have several limitations, such as toxicity and pool performance of arithmetic solving. Meanwhile, LLM may have some potential abilities that have yet to be exploited. In this paper, we choose a different way to enhance the arithmetic ability of LLM. We propose to train LLM to generate a postfix expression related to the arithmetic problem and incorporate it with small pretrained models. Moreover, this small model transfers the token embeddings into real dense numbers and invokes native functions of a deep learning platform to get the correct answer. To generate the final result, we propose prompt injection for adding the result outputs by the small model to LLM. This work provides different ways of thinking, training and using a language model. The codes and models will be released at \url{https://github.com/eteced/arithmetic_finetuning_v1}.

摘要
With the great success of ChatGPT, the research of large language models has become increasingly popular. However, the models have several limitations, such as toxicity and poor performance in arithmetic solving. Meanwhile, LLM may have some potential abilities that have yet to be exploited. In this paper, we choose a different way to enhance the arithmetic ability of LLM. We propose to train LLM to generate a postfix expression related to the arithmetic problem and incorporate it with small pretrained models. Moreover, this small model transfers the token embeddings into real dense numbers and invokes native functions of a deep learning platform to get the correct answer. To generate the final result, we propose prompt injection for adding the result outputs by the small model to LLM. This work provides different ways of thinking, training and using a language model. The codes and models will be released at \url{https://github.com/eteced/arithmetic_finetuning_v1}.Here's the translation in Traditional Chinese:随着ChatGPT的成功，大型语言模型的研究成为越来越流行。然而，这些模型有一些局限性，如毒性和算数解决方案的表现不佳。同时，LLM可能有一些尚未发掘的潜力。在这篇论文中，我们选择了一种不同的方法来强化LLM的算数能力。我们提议将LLM训练为生成相关的postfix表达和小型预训模型。此外，这个小型模型将token嵌入转换为实际的稠密数据，并透过深度学习平台中的Native函数获取正确的答案。为生成最终结果，我们提议使用启发对 LLM 中的结果输出进行加入。这个工作提供了不同的思维方式、训练和使用语言模型的方法。我们将代码和模型发布在 \url{https://github.com/eteced/arithmetic_finetuning_v1}。

FFT: Towards Harmlessness Evaluation and Analysis for LLMs with Factuality, Fairness, Toxicity

paper_url: http://arxiv.org/abs/2311.18580
repo_url: https://github.com/cuishiyao96/fft
paper_authors: Shiyao Cui, Zhenyu Zhang, Yilong Chen, Wenyuan Zhang, Tianyun Liu, Siqi Wang, Tingwen Liu
for: 评估大语言模型（LLM）的危害性，尤其是Factoid、不公平和恶势攻击性等问题。
methods: 提出了FFT benchmark，包含2116个精心设计的实例，用于评估LLM的无害性。
results: 对9种代表性的LLM进行了评估，发现现有的LLM仍然无法满足要求，并提供了一些有价值的发现，以便未来的无害LLM研究。

Abstract
The widespread of generative artificial intelligence has heightened concerns about the potential harms posed by AI-generated texts, primarily stemming from factoid, unfair, and toxic content. Previous researchers have invested much effort in assessing the harmlessness of generative language models. However, existing benchmarks are struggling in the era of large language models (LLMs), due to the stronger language generation and instruction following capabilities, as well as wider applications. In this paper, we propose FFT, a new benchmark with 2116 elaborated-designed instances, for LLM harmlessness evaluation with factuality, fairness, and toxicity. To investigate the potential harms of LLMs, we evaluate 9 representative LLMs covering various parameter scales, training stages, and creators. Experiments show that the harmlessness of LLMs is still under-satisfactory, and extensive analysis derives some insightful findings that could inspire future research for harmless LLM research.

摘要
全球范围内的生成人工智能技术的普及，已经提高了人工智能生成文本中可能的危害的关心。以往的研究者投入了大量时间来评估生成语言模型的无害性。然而，现有的标准测试不能在大语言模型（LLM）时代成功，因为生成语言和指令执行能力更强，以及更广泛的应用。本文提出了FFT，一个新的标准测试，用于评估LLM无害性，包括准确性、公正性和毒性。为了探讨LLM的可能危害，我们评估了9种表示性LLM，覆盖不同的参数级别、训练阶段和创造者。实验结果显示，LLM的无害性仍然不够满足，并且进行了详细的分析，得出了一些启发人的发现，可能激发未来关于无害LLM研究的未来研究。

Grammatical Gender’s Influence on Distributional Semantics: A Causal Perspective

paper_url: http://arxiv.org/abs/2311.18567
repo_url: None
paper_authors: Karolina Stańczak, Kevin Du, Adina Williams, Isabelle Augenstein, Ryan Cotterell
for: 这个研究旨在探讨现代语言学和认知科学中 gender 识别是如何受意义影响的问题。
methods: 这种研究使用现有的方法，以确定 gender 识别是否受意义影响，并提出了一种新的 causal 图模型来表示名词、意义和形容词之间的交互关系。
results: 研究发现，当控制名词的意义时， grammatical gender 对形容词选择的影响近乎为零，这与 neo-Whorfian 假设不匹配。

Abstract
How much meaning influences gender assignment across languages is an active area of research in modern linguistics and cognitive science. We can view current approaches as aiming to determine where gender assignment falls on a spectrum, from being fully arbitrarily determined to being largely semantically determined. For the latter case, there is a formulation of the neo-Whorfian hypothesis, which claims that even inanimate noun gender influences how people conceive of and talk about objects (using the choice of adjective used to modify inanimate nouns as a proxy for meaning). We offer a novel, causal graphical model that jointly represents the interactions between a noun's grammatical gender, its meaning, and adjective choice. In accordance with past results, we find a relationship between the gender of nouns and the adjectives which modify them. However, when we control for the meaning of the noun, we find that grammatical gender has a near-zero effect on adjective choice, thereby calling the neo-Whorfian hypothesis into question.

摘要
现代语言学和认知科学中关于语言中gender赋 assigning的研究是一个活跃的领域。我们可以看到当前的方法是 Trying to determine where gender assignment falls on a spectrum, from being fully arbitrarily determined to being largely semantically determined。 For the latter case, there is a formulation of the neo-Whorfian hypothesis, which claims that even inanimate noun gender influences how people conceive of and talk about objects (using the choice of adjective used to modify inanimate nouns as a proxy for meaning). We offer a novel, causal graphical model that jointly represents the interactions between a noun's grammatical gender, its meaning, and adjective choice. In accordance with past results, we find a relationship between the gender of nouns and the adjectives which modify them. However, when we control for the meaning of the noun, we find that grammatical gender has a near-zero effect on adjective choice, thereby calling the neo-Whorfian hypothesis into question.Note: The translation is done using the Google Translate API, and may not be perfect or idiomatic.

Use of explicit replies as coordination mechanisms in online student debate

paper_url: http://arxiv.org/abs/2311.18466
repo_url: None
paper_authors: Bruno D. Ferreira-Saraiva, Joao P. Matos-Carvalho, Manuel Pita
for: 本研究探讨了在面对面和计算机媒介通信中人们如何通过自适应机制来调整语言行为。
methods: 本研究使用了一种probabilistic framework和计算机方法来研究对话中actor之间的协调机制，并使用非 Parametric、层次话题模型来确定对话中不同词汇的社区结构。
results: 研究发现，对话中的词汇使用可以分为三类：一类保持在概括性的开场谈话水平，一类发展出特定子话题，并且有一类跳转 между概括性谈话、不相关的评论和人们的赞成或反对。

Abstract
People in conversation entrain their linguistic behaviours through spontaneous alignment mechanisms [7] - both in face-to-face and computer-mediated communication (CMC) [8]. In CMC, one of the mechanisms through which linguistic entrainment happens is through explicit replies. Indeed, the use of explicit replies influences the structure of conversations, favouring the formation of reply-trees typically delineated by topic shifts [5]. The interpersonal coordination mechanisms realized by how actors address each other have been studied using a probabilistic framework proposed by David Gibson [2,3]. Other recent approaches use computational methods and information theory to quantify changes in text. We explore coordination mechanisms concerned with some of the roles utterances play in dialogues - specifically in explicit replies. We identify these roles by finding community structure in the conversation's vocabulary using a non-parametric, hierarchical topic model. Some conversations may always stay on the ground, remaining at the level of general introductory chatter. Some others may develop a specific sub-topic in significant depth and detail. Even others may jump between general chatter, out-of-topic remarks and people agreeing or disagreeing without further elaboration.

摘要

IAG: Induction-Augmented Generation Framework for Answering Reasoning Questions

paper_url: http://arxiv.org/abs/2311.18397
repo_url: None
paper_authors: Zhebin Zhang, Xinyu Zhang, Yuanhang Ren, Saijiang Shi, Meng Han, Yongkang Wu, Ruofei Lai, Zhao Cao
for: 这个研究旨在提高开放领域问题回答任务中的表现，使用归胚知识与语言模型的参数记忆。
methods: 研究使用了对归胚知识的探索，并与语言模型的参数记忆结合，以提高表现。
results: 研究比较了这些方法的表现，发现IAG方法可以超过RAG基eline和ChatGPT的表现，并在两个开放领域问题任务中获得了第一名的成绩。

Abstract
Retrieval-Augmented Generation (RAG), by incorporating external knowledge with parametric memory of language models, has become the state-of-the-art architecture for open-domain QA tasks. However, common knowledge bases are inherently constrained by limited coverage and noisy information, making retrieval-based approaches inadequate to answer implicit reasoning questions. In this paper, we propose an Induction-Augmented Generation (IAG) framework that utilizes inductive knowledge along with the retrieved documents for implicit reasoning. We leverage large language models (LLMs) for deriving such knowledge via a novel prompting method based on inductive reasoning patterns. On top of this, we implement two versions of IAG named IAG-GPT and IAG-Student, respectively. IAG-GPT directly utilizes the knowledge generated by GPT-3 for answer prediction, while IAG-Student gets rid of dependencies on GPT service at inference time by incorporating a student inductor model. The inductor is firstly trained via knowledge distillation and further optimized by back-propagating the generator feedback via differentiable beam scores. Experimental results show that IAG outperforms RAG baselines as well as ChatGPT on two Open-Domain QA tasks. Notably, our best models have won the first place in the official leaderboards of CSQA2.0 (since Nov 1, 2022) and StrategyQA (since Jan 8, 2023).

摘要
Retrieval-Augmented Generation (RAG) architecture 已成为开放领域问答任务的状态之一，通过语言模型的参数记忆和外部知识的整合。然而，常见的知识库受限于覆盖率和信息的噪音，使得检索基于的方法无法回答隐含的推理问题。在这篇论文中，我们提出了一种卷积学习加工（IAG）框架，通过卷积学习来获得适应性知识，并将其与检索的文档结合使用，以解决隐含的推理问题。我们利用大型语言模型（LLM）来 derive 这种知识，通过一种基于卷积学习的新提示方法。在这之上，我们实现了两个版本的 IAG，即 IAG-GPT 和 IAG-Student。IAG-GPT 直接使用 GPT-3 生成的知识进行答案预测，而 IAG-Student 在推理时避免依赖 GPT 服务，通过学生卷积模型来替代。这个卷积模型首先通过吸收知识进行吸收，然后通过反馈生成器的反馈来进行优化。实验结果显示，IAG 超过 RAG 基线和 ChatGPT 在两个开放领域问答任务上。特别是，我们的最佳模型在 CSQA2.0 和 StrategyQA 的官方排行榜上排名第一（自2022年11月1日起）和第一（自2023年1月8日起）。

Hubness Reduction Improves Sentence-BERT Semantic Spaces

paper_url: http://arxiv.org/abs/2311.18364
repo_url: https://github.com/bemigini/hubness-reduction-improves-sbert-semantic-spaces
paper_authors: Beatrix M. G. Nielsen, Lars Kai Hansen
for: 本研究旨在探讨句子BERT生成的含义表示，即句子表示，以及它们在信息检索和文档分组等领域的应用。
methods: 本研究使用了高维训练的稠密 вектор来 represent Sentence-BERT的含义表示，并investigated了这些表示的结构。
results: 研究发现，高维表示受到了一个常见的高维问题 called “hubness”，导致了句子之间的不均匀关系，一些句子成为”枢轴”，与多个句子相邻，而大多数句子成为”反枢轴”，与只有几个句子相邻。通过衡量含义质量指标和邻域基于分类器的错误率，研究发现，降低枢轴度可以提高含义质量。

Abstract
Semantic representations of text, i.e. representations of natural language which capture meaning by geometry, are essential for areas such as information retrieval and document grouping. High-dimensional trained dense vectors have received much attention in recent years as such representations. We investigate the structure of semantic spaces that arise from embeddings made with Sentence-BERT and find that the representations suffer from a well-known problem in high dimensions called hubness. Hubness results in asymmetric neighborhood relations, such that some texts (the hubs) are neighbours of many other texts while most texts (so-called anti-hubs), are neighbours of few or no other texts. We quantify the semantic quality of the embeddings using hubness scores and error rate of a neighbourhood based classifier. We find that when hubness is high, we can reduce error rate and hubness using hubness reduction methods. We identify a combination of two methods as resulting in the best reduction. For example, on one of the tested pretrained models, this combined method can reduce hubness by about 75% and error rate by about 9%. Thus, we argue that mitigating hubness in the embedding space provides better semantic representations of text.

摘要
《自然语言的Semantic表示，即通过geometry来捕捉语言意义，对于信息检索和文档分组等领域是关键。高维训练过的 dense vector已经在过去几年内吸引了很多关注。我们研究sentence-BERT中的embedding所生成的Semantic空间结构，发现这些表示受到高维空间中known problem called hubness的影响。hubness导致了异常的 neighourhood关系，有些文本（称为hub）和多个其他文本为邻居，而大多数文本（称为anti-hub）只有几个或者没有邻居。我们使用hubness scores和基于 neighbourhood的类ifier的错误率来衡量embedding的semantic质量。我们发现当hubness高时，可以降低错误率和hubness使用hubness reduction方法。我们确定了两种方法的组合，可以减少hubness约75%和错误率约9%。因此，我们 argue that mitigating hubness in the embedding space can provide better semantic representations of text.》Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format instead.

Evaluating the Rationale Understanding of Critical Reasoning in Logical Reading Comprehension

paper_url: http://arxiv.org/abs/2311.18353
repo_url: None
paper_authors: Akira Kawabata, Saku Sugawara
for: 测试语言模型的逻辑阅读理解能力
methods: 使用多选题 dataset 和人工撰写的论证文本集
results: 大型语言模型 (如 InstructGPT) 在多选题中答题不善，尤其是对于错误选项的论证文本。

Abstract
To precisely evaluate a language model's capability for logical reading comprehension, we present a dataset for testing the understanding of the rationale behind critical reasoning. For questions taken from an existing multiplechoice logical reading comprehension dataset, we crowdsource rationale texts that explain why we should select or eliminate answer options, resulting in 3,003 multiple-choice subquestions that are associated with 943 main questions. Experiments on our dataset show that recent large language models (e.g., InstructGPT) struggle to answer the subquestions even if they are able to answer the main questions correctly. We find that the models perform particularly poorly in answering subquestions written for the incorrect options of the main questions, implying that the models have a limited capability for explaining why incorrect alternatives should be eliminated. These results suggest that our dataset encourages further investigation into the critical reasoning ability of language models while focusing on the elimination process of relevant alternatives.

摘要
为了准确评估语言模型的逻辑阅读理解能力，我们提供了一个测试逻辑阅读理解的数据集。我们从现有的多选逻辑阅读理解数据集中选取了问题，然后通过聚集 rational text 来解释选择或排除答案选项的原因，共计3,003个多选子问题和943个主题问题。我们的实验表明，现代大型语言模型（如 InstructGPT）在回答子问题时表现不佳，即使它们能够正确回答主题问题。我们发现，模型在回答 incorrect 选项的子问题时表现特别差，这表明模型对排除相关alternative的能力有限。这些结果表明，我们的数据集可以鼓励更深入的研究语言模型的逻辑阅读理解能力，同时专注于排除相关alternative的过程。

Consensus, dissensus and synergy between clinicians and specialist foundation models in radiology report generation

paper_url: http://arxiv.org/abs/2311.18260
repo_url: None
paper_authors: Ryutaro Tanno, David G. T. Barrett, Andrew Sellergren, Sumedh Ghaisas, Sumanth Dathathri, Abigail See, Johannes Welbl, Karan Singhal, Shekoofeh Azizi, Tao Tu, Mike Schaekermann, Rhys May, Roy Lee, SiWai Man, Zahra Ahmed, Sara Mahdavi, Danielle Belgrave, Vivek Natarajan, Shravya Shetty, Pushmeet Kohli, Po-Sen Huang, Alan Karthikesalingam, Ira Ktena
for: 这项研究的目的是为了开发一个基于视力语言模型的报告生成系统，以提高现代医学中报告的生成和评估。
methods: 这项研究使用了一种已知的视力语言基础模型，通过对医疗数据进行细化调整，建立了一个state-of-the-art的报告生成系统。
results: 该研究发现，至少有一名证encized radiologist（每个案例中的两名 radiologist中的一个）在60%以上的案例中 prefer AI生成的报告于基准报告。此外，研究发现AI生成的报告中的错误主要 relate于位置和发现，而人类写的报告中的错误主要 relate于严重性和发现。这种差异表明了人类专家和AI系统之间的可 complementarity，并促使研究人员开发了一种助手enario，在该 scenariodeployed 的 AI系统会生成一份初稿报告，然后被临床专家修改。这是首次在报告写作中实现了人类-AI合作，并且结果表明了在80%的入院案例和60%的医学 ICU案例中，与专家们alone 写的报告相比，合作的报告具有同等或更高的评估质量。

Abstract
Radiology reports are an instrumental part of modern medicine, informing key clinical decisions such as diagnosis and treatment. The worldwide shortage of radiologists, however, restricts access to expert care and imposes heavy workloads, contributing to avoidable errors and delays in report delivery. While recent progress in automated report generation with vision-language models offer clear potential in ameliorating the situation, the path to real-world adoption has been stymied by the challenge of evaluating the clinical quality of AI-generated reports. In this study, we build a state-of-the-art report generation system for chest radiographs, \textit{Flamingo-CXR}, by fine-tuning a well-known vision-language foundation model on radiology data. To evaluate the quality of the AI-generated reports, a group of 16 certified radiologists provide detailed evaluations of AI-generated and human written reports for chest X-rays from an intensive care setting in the United States and an inpatient setting in India. At least one radiologist (out of two per case) preferred the AI report to the ground truth report in over 60$\%$ of cases for both datasets. Amongst the subset of AI-generated reports that contain errors, the most frequently cited reasons were related to the location and finding, whereas for human written reports, most mistakes were related to severity and finding. This disparity suggested potential complementarity between our AI system and human experts, prompting us to develop an assistive scenario in which \textit{Flamingo-CXR} generates a first-draft report, which is subsequently revised by a clinician. This is the first demonstration of clinician-AI collaboration for report writing, and the resultant reports are assessed to be equivalent or preferred by at least one radiologist to reports written by experts alone in 80$\%$ of in-patient cases and 60$\%$ of intensive care cases.

摘要
医学报告是现代医学中不可或缺的工具，帮助诊断和治疗医疗决策。然而，全球医生短缺导致了专业报告的获取和诊断的延迟，从而导致了可避免的错误和延迟。最近，使用视力语言模型自动生成报告的进步具有明显的可能性，但实际应用 path 却受到了评估AI生成报告的临床质量的挑战。在这项研究中，我们构建了一个基于视力语言基础模型的state-of-the-art报告生成系统，名为Flamingo-CXR，并在医学数据上细化该模型。为评估AI生成的报告质量，一组16名资深医生对AI生成和人工写的胸部X射线报告进行了详细评估。结果显示，在两个数据集中，至少一名医生（每个案例两名医生）对AI生成的报告表示更好于真实 referral 报告的60%以上。在AI生成报告中存在错误的情况下，主要的问题是位置和发现的问题，而人工写的报告中的主要问题是严重性和发现的问题。这种差异表明了AI系统和人类专家之间的可能性，因此我们开发了一种辅助enario，在该scenario中，Flamingo-CXR生成的首 draft 报告被临床专业人员修改。这是首次实现了医生-AI协作报告写作，并且结果表明，在80%的入院案例中，由至少一名医生评估为等同或更好于专家 alone 写的报告。在60%的医学紧急案例中，结果也是相似。

mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model

paper_url: http://arxiv.org/abs/2311.18248
repo_url: https://github.com/x-plug/mplug-docowl
paper_authors: Anwen Hu, Yaya Shi, Haiyang Xu, Jiabo Ye, Qinghao Ye, Ming Yan, Chenliang Li, Qi Qian, Ji Zhang, Fei Huang
for: 这 paper 的目的是强化多Modal LLMs 的科学论文图像分析能力，以提高它们的应用场景。
methods: 作者通过分析高质量的学术论文的 Latex 源代码， méticulously 建立了一个多Modal 图像理解 dataset M-Paper。他们还构建了专业的图像分析示例，用于训练和评估。
results: 实验表明，训练在这个dataset上的state-of-the-art Multimodal LLM 可以更好地理解多个科学图像，包括图像和表格，并且可以根据用户的意图提供更好的图像摘要和分析结果。

Abstract
Recently, the strong text creation ability of Large Language Models(LLMs) has given rise to many tools for assisting paper reading or even writing. However, the weak diagram analysis abilities of LLMs or Multimodal LLMs greatly limit their application scenarios, especially for scientific academic paper writing. In this work, towards a more versatile copilot for academic paper writing, we mainly focus on strengthening the multi-modal diagram analysis ability of Multimodal LLMs. By parsing Latex source files of high-quality papers, we carefully build a multi-modal diagram understanding dataset M-Paper. By aligning diagrams in the paper with related paragraphs, we construct professional diagram analysis samples for training and evaluation. M-Paper is the first dataset to support joint comprehension of multiple scientific diagrams, including figures and tables in the format of images or Latex codes. Besides, to better align the copilot with the user's intention, we introduce the `outline' as the control signal, which could be directly given by the user or revised based on auto-generated ones. Comprehensive experiments with a state-of-the-art Mumtimodal LLM demonstrate that training on our dataset shows stronger scientific diagram understanding performance, including diagram captioning, diagram analysis, and outline recommendation. The dataset, code, and model are available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/PaperOwl.

摘要
近些时候，大型语言模型（LLM）的强大文本创作能力已经使得许多工具出现了，可以帮助读者或者写作者。然而， LLM 或者多模态 LLM 的图像分析能力非常有限，尤其是在科学学报写作中。在这项工作中，我们主要是增强多模态 LLM 的图像分析能力。我们通过分析高质量的学报中的 Latex 源代码， méticulously 构建了多模态图像理解 dataset M-Paper。我们通过将图像和相关的段落进行对齐，构建了专业的图像分析示例用于训练和评估。M-Paper 是首个支持多个科学图像的共同理解dataset，包括图像和表格的形式，如图像或 Latex 代码。此外，为了更好地与用户的意图相匹配，我们引入了 `outline` 作为控制信号，可以由用户直接提供或者根据自动生成的修改。我们对一个状态之前的多模态 LLM 进行了广泛的实验，并证明了训练在我们的 dataset 上表现出更强的科学图像理解能力，包括图像描述、图像分析和 outline 建议。我们的 dataset、代码和模型可以在中下载。

Automatic Construction of a Korean Toxic Instruction Dataset for Ethical Tuning of Large Language Models

paper_url: http://arxiv.org/abs/2311.18215
repo_url: None
paper_authors: Sungjoo Byun, Dongjun Jang, Hyemi Jo, Hyopil Shin
for: 培养大语言模型（LLMs）的训练方法，以减少生成不当语言和处理刺激用户查询。
methods: 使用自动生成的阴养指令集（KoTox），包含39K个不当指令output pairs，以提高LLMs的道德识别和对各种刺激输入的回应。
results: KoTox collection可以帮助提高LLMs的道德识别和处理刺激输入，促进在自然语言处理（NLP）应用中更安全和负责的互动。

Abstract
Caution: this paper may include material that could be offensive or distressing. The advent of Large Language Models (LLMs) necessitates the development of training approaches that mitigate the generation of unethical language and aptly manage toxic user queries. Given the challenges related to human labor and the scarcity of data, we present KoTox, comprising 39K unethical instruction-output pairs. This collection of automatically generated toxic instructions refines the training of LLMs and establishes a foundational framework for improving LLMs' ethical awareness and response to various toxic inputs, promoting more secure and responsible interactions in Natural Language Processing (NLP) applications.

摘要
注意：这篇文章可能包含可能引起不适或情绪不适的内容。大语言模型（LLM）的出现需要开发训练方法来缓解生成不道德语言和有效地处理恶意用户查询。由于人工劳动和数据稀缺的问题，我们提出了 KoTox，包含39K个不道德指令输出对。这些自动生成的恶意指令可以提高 LLM 的道德意识和对各种恶意输入的应对，推动 NLP 应用程序更加安全和负责任。

INarIG: Iterative Non-autoregressive Instruct Generation Model For Word-Level Auto Completion

paper_url: http://arxiv.org/abs/2311.18200
repo_url: None
paper_authors: Hengchao Shang, Zongyao Li, Daimeng Wei, Jiaxin Guo, Minghan Wang, Xiaoyu Chen, Lizhi Lei, Hao Yang
for: 提高人类翻译效率，特别在机器翻译质量不符合要求的场景中。
methods: 使用迭代解码和字符级别自动完成，将人类输入Sequence中的信息全面利用。
results: 在WMT22和benchmark datasets上 achieve state-of-the-art results，比前一代模型提高了10%以上预测精度。

Abstract
Computer-aided translation (CAT) aims to enhance human translation efficiency and is still important in scenarios where machine translation cannot meet quality requirements. One fundamental task within this field is Word-Level Auto Completion (WLAC). WLAC predicts a target word given a source sentence, translation context, and a human typed character sequence. Previous works either employ word classification models to exploit contextual information from both sides of the target word or directly disregarded the dependencies from the right-side context. Furthermore, the key information, i.e. human typed sequences, is only used as prefix constraints in the decoding module. In this paper, we propose the INarIG (Iterative Non-autoregressive Instruct Generation) model, which constructs the human typed sequence into Instruction Unit and employs iterative decoding with subwords to fully utilize input information given in the task. Our model is more competent in dealing with low-frequency words (core scenario of this task), and achieves state-of-the-art results on the WMT22 and benchmark datasets, with a maximum increase of over 10% prediction accuracy.

摘要
In this paper, we propose the INarIG (Iterative Non-autoregressive Instruct Generation) model, which constructs the human-typed sequence into an Instruction Unit and employs iterative decoding with subwords to fully utilize the input information given in the task. Our model is more competent in dealing with low-frequency words (the core scenario of this task) and achieves state-of-the-art results on the WMT22 and benchmark datasets, with a maximum increase of over 10% prediction accuracy.

COVID-19 Vaccine Misinformation in Middle Income Countries

paper_url: http://arxiv.org/abs/2311.18195
repo_url: https://github.com/zzoliman/covid-vaccine-misinfo-mic
paper_authors: Jongin Kim, Byeo Rhee Back, Aditya Agrawal, Jiaxi Wu, Veronika J. Wirtz, Traci Hong, Derry Wijaya
For: The paper is written to introduce a multilingual dataset of COVID-19 vaccine misinformation, and to develop and evaluate models for detecting COVID-19 vaccine misinformation in three middle-income countries (Brazil, Indonesia, and Nigeria).* Methods: The paper uses two approaches to develop COVID-19 vaccine misinformation detection models: domain-specific pre-training and text augmentation using a large language model.* Results: The paper’s best misinformation detection models demonstrate improvements ranging from 2.7 to 15.9 percentage points in macro F1-score compared to the baseline models, and there are significant positive associations between the misinformation rates across the three countries. Additionally, the paper applies the misinformation detection models in a large-scale study of 19 million unlabeled tweets from the three countries between 2020 and 2022, showcasing the practical application of the dataset and models for detecting and analyzing vaccine misinformation in multiple countries and languages.

Abstract
This paper introduces a multilingual dataset of COVID-19 vaccine misinformation, consisting of annotated tweets from three middle-income countries: Brazil, Indonesia, and Nigeria. The expertly curated dataset includes annotations for 5,952 tweets, assessing their relevance to COVID-19 vaccines, presence of misinformation, and the themes of the misinformation. To address challenges posed by domain specificity, the low-resource setting, and data imbalance, we adopt two approaches for developing COVID-19 vaccine misinformation detection models: domain-specific pre-training and text augmentation using a large language model. Our best misinformation detection models demonstrate improvements ranging from 2.7 to 15.9 percentage points in macro F1-score compared to the baseline models. Additionally, we apply our misinformation detection models in a large-scale study of 19 million unlabeled tweets from the three countries between 2020 and 2022, showcasing the practical application of our dataset and models for detecting and analyzing vaccine misinformation in multiple countries and languages. Our analysis indicates that percentage changes in the number of new COVID-19 cases are positively associated with COVID-19 vaccine misinformation rates in a staggered manner for Brazil and Indonesia, and there are significant positive associations between the misinformation rates across the three countries.

摘要

Positional Information Matters for Invariant In-Context Learning: A Case Study of Simple Function Classes

paper_url: http://arxiv.org/abs/2311.18194
repo_url: None
paper_authors: Yongqiang Chen, Binghui Xie, Kaiwen Zhou, Bo Han, Yatao Bian, James Cheng
for: 这个论文主要研究了大语言模型（LLM）的内在学习（ICL）能力，具体来说是研究了LLM在不更新参数的情况下，通过几个内在示例（input-output示例）来解决新的输入问题。
methods: 这篇论文使用了ICL线性回归，并对transformers和DeepSet进行比较，以 investigate OOD情况下ICL的限制和原理。
results: 研究发现，DeepSet在各种分布偏移情况下表现出色，而LLM的 pozitional encoding会打砸ICL的对称性，因此 preserve ICL对称性是ICL的基本要求。通过保持ICL对称性，transformers可以在多种ICL分布偏移情况下达到状态ixel表现。

Abstract
In-context learning (ICL) refers to the ability of a model to condition on a few in-context demonstrations (input-output examples of the underlying task) to generate the answer for a new query input, without updating parameters. Despite the impressive ICL ability of LLMs, it has also been found that ICL in LLMs is sensitive to input demonstrations and limited to short context lengths. To understand the limitations and principles for successful ICL, we conduct an investigation with ICL linear regression of transformers. We characterize several Out-of-Distribution (OOD) cases for ICL inspired by realistic LLM ICL failures and compare transformers with DeepSet, a simple yet powerful architecture for ICL. Surprisingly, DeepSet outperforms transformers across a variety of distribution shifts, implying that preserving permutation invariance symmetry to input demonstrations is crucial for OOD ICL. The phenomenon specifies a fundamental requirement by ICL, which we termed as ICL invariance. Nevertheless, the positional encodings in LLMs will break ICL invariance. To this end, we further evaluate transformers with identical positional encodings and find preserving ICL invariance in transformers achieves state-of-the-art performance across various ICL distribution shifts

摘要
内容学习（ICL）指的是模型可以通过几个内容示例（输入输出示例）来生成新的查询输入的答案，不需要更新参数。虽然概念模型（LLM）的ICL能力很强，但也发现ICL在LLM中是受输入示例影响的并且具有限制性，尤其是在短context length下。为了理解ICL的限制和原则，我们进行了ICL线性回归探索，并将ICL启发的几种Out-of-Distribution（OOD）情况分析了。 surprisingly，DeepSet，一种简单 yet powerful的ICL架构，在多种分布Shift下表现更好，这 imply that preserving permutation invariance symmetry to input demonstrations is crucial for OOD ICL。这种现象Specifies a fundamental requirement for ICL，which we termed as ICL invariance。然而，LLMs中的位置编码会打砸ICL invariants。为此，我们进一步评估了transformers中的同样位置编码，并发现保持ICL invariants可以在transformers中实现state-of-the-art表现 across various ICL distribution shifts。

2023-11-30

cs.LG

cs.LG - 2023-11-30

Curvature Explains Loss of Plasticity

paper_url: http://arxiv.org/abs/2312.00246
repo_url: None
paper_authors: Alex Lewandowski, Haruto Tanaka, Dale Schuurmans, Marlos C. Machado
for: 这 paper 的目的是解释 neural network 中 plasticity 的失效机制，并提供一种基于 curvature 的解释。
methods: 作者使用了系统的 empirical 研究来支持他们的假设，包括在多个 continual supervised learning 问题上测试 plasticity loss 和 curvature loss。
results: 研究发现，loss of plasticity 与 curvature loss 的减少直接相关，而previous explanations 无法解释所有情况。此外，作者还发现了一种简单的分布式 regularizer，可以有效地避免 loss of plasticity。

Abstract
Loss of plasticity is a phenomenon in which neural networks lose their ability to learn from new experience. Despite being empirically observed in several problem settings, little is understood about the mechanisms that lead to loss of plasticity. In this paper, we offer a consistent explanation for plasticity loss, based on an assertion that neural networks lose directions of curvature during training and that plasticity loss can be attributed to this reduction in curvature. To support such a claim, we provide a systematic empirical investigation of plasticity loss across several continual supervised learning problems. Our findings illustrate that curvature loss coincides with and sometimes precedes plasticity loss, while also showing that previous explanations are insufficient to explain loss of plasticity in all settings. Lastly, we show that regularizers which mitigate loss of plasticity also preserve curvature, motivating a simple distributional regularizer that proves to be effective across the problem settings considered.

摘要
“弹性loss”是一种现象，在神经网络训练中导致神经网络对新的经验学习能力下降。虽然在几个问题设定中被观察到，但是对于这些机制的理解仍然很少。在这篇论文中，我们提出了一个一贯的解释，基于神经网络训练中的方向测度损失，并认为弹性损失可以从这种减少方向测度中推导出来。为了支持这个声明，我们进行了一系列的系统性实验研究，以评估弹性损失在不同的超类学习问题中的行为。我们的发现表明，弹性损失与方向测度损失相对，而且在一些情况下，弹性损失可以先occurs于方向测度损失。此外，我们还证明了以往的解释不足以解释弹性损失在所有情况下。最后，我们显示了一种简单的分布式正规化器，可以有效地避免弹性损失，并且证明了这种正规化器在考虑到的问题设定中具有优化的性能。

Self-similarity of Communities of the ABCD Model

paper_url: http://arxiv.org/abs/2312.00238
repo_url: None
paper_authors: Jordan Barrett, Bogumil Kaminski, Pawel Pralat, Francois Theberge
for: 这个论文主要研究的是社区探测（community detection）的人工标准 benchmark（ABCD）图。
methods: 这个模型使用了一种叫做ABCD模型，它是一种具有社区结构和力学分布的随机图模型，它比较快速，并且可以分析性地研究。
results: 这个研究发现，ABCD模型存在一些有趣的自相似行为，即地址社区的度分布和整个图的度分布在某种程度上是相似的。这意味着我们可以不仅估算社区内每个节点的边数，还可以估算社区内自 Loop和多边的数量。这些量的理解对于社区探测算法是重要的，因为rewiring自 Loop和多边可以使图更简单，但是这些rewiring可能会导致图模型偏离uniform simple graphs。

Abstract
The Artificial Benchmark for Community Detection (ABCD) graph is a random graph model with community structure and power-law distribution for both degrees and community sizes. The model generates graphs similar to the well-known LFR model but it is faster and can be investigated analytically. In this paper, we show that the ABCD model exhibits some interesting self-similar behaviour, namely, the degree distribution of ground-truth communities is asymptotically the same as the degree distribution of the whole graph (appropriately normalized based on their sizes). As a result, we can not only estimate the number of edges induced by each community but also the number of self-loops and multi-edges generated during the process. Understanding these quantities is important as (a) rewiring self-loops and multi-edges to keep the graph simple is an expensive part of the algorithm, and (b) every rewiring causes the underlying configuration models to deviate slightly from uniform simple graphs on their corresponding degree sequences.

摘要
《人工标准社区检测（ABCD）图的Random Graph模型》是一种社区结构和Power-Law分布的度和社区大小的随机图模型。该模型可以更快速地生成与知名的LFR模型类似的图，并且可以分析性地研究。在这篇论文中，我们发现了ABCD模型的自相似行为，即检测到的社区度分布与整个图度分布相似（经适当 норmalize based on their sizes）。这意味着我们不仅可以估算每个社区引入的边数，还可以估算自相似和多边的数量，这些量在rewiring过程中具有重要性。（a）自相似和多边的重新连接会使图更加简单，但是这部分算法具有高成本，（b）每次重新连接都会导致对应的配置模型略有偏差，从而影响检测结果。

Deep Equilibrium Based Neural Operators for Steady-State PDEs

paper_url: http://arxiv.org/abs/2312.00234
repo_url: https://github.com/risteskilab/deq-neural-operators
paper_authors: Tanya Marwah, Ashwini Pokle, J. Zico Kolter, Zachary C. Lipton, Jianfeng Lu, Andrej Risteski
for: 这个论文的目的是研究如何使用数据驱动机器学习方法来解决部分偏微分方程（PDEs）。
methods: 这个论文使用的方法包括Weight-tied neural network architectures和深度等待点方法（FNO）。
results: 实验表明，使用FNO-DEQ-based architecture可以比FNO-based baselines的4倍 Parameters来预测稳态微分方程的解。此外，FNO-DEQ也比FNO-based baselines更加稳定，能够在具有更多噪声观察数据的情况下表现更好。最后，这篇论文还证明了FNO-DEQ可以将任何稳态微分方程写作为一个fixed point equation的解。

Abstract
Data-driven machine learning approaches are being increasingly used to solve partial differential equations (PDEs). They have shown particularly striking successes when training an operator, which takes as input a PDE in some family, and outputs its solution. However, the architectural design space, especially given structural knowledge of the PDE family of interest, is still poorly understood. We seek to remedy this gap by studying the benefits of weight-tied neural network architectures for steady-state PDEs. To achieve this, we first demonstrate that the solution of most steady-state PDEs can be expressed as a fixed point of a non-linear operator. Motivated by this observation, we propose FNO-DEQ, a deep equilibrium variant of the FNO architecture that directly solves for the solution of a steady-state PDE as the infinite-depth fixed point of an implicit operator layer using a black-box root solver and differentiates analytically through this fixed point resulting in $\mathcal{O}(1)$ training memory. Our experiments indicate that FNO-DEQ-based architectures outperform FNO-based baselines with $4\times$ the number of parameters in predicting the solution to steady-state PDEs such as Darcy Flow and steady-state incompressible Navier-Stokes. Finally, we show FNO-DEQ is more robust when trained with datasets with more noisy observations than the FNO-based baselines, demonstrating the benefits of using appropriate inductive biases in architectural design for different neural network based PDE solvers. Further, we show a universal approximation result that demonstrates that FNO-DEQ can approximate the solution to any steady-state PDE that can be written as a fixed point equation.

摘要
“数据驱动的机器学习方法在解决partial differential equations（PDEs）中得到了广泛的应用。它们在训练算子时表现出了特别的成功，该算子接受一个PDE的输入，并输出其解。然而，架构设计空间，特别是根据PDE家族的结构知识，仍然不够了解。我们想要解决这个差距，我们通过研究weight-tied neural network架构对于稳态PDE的利用来进行研究。为此，我们首先证明了大多数稳态PDE的解可以表示为一个非线性算子的稳态点。基于这一观察，我们提出了FNO-DEQ，一种深度平衡变体，它直接通过一个隐式算子层解决稳态PDE的解为无穷深度稳态点，使用黑盒根解器和分析 differentiability，从而实现$\mathcal{O}(1)$的训练内存。我们的实验表明，FNO-DEQ-based架构在稳态PDE的预测中比FNO-based基eline表现出四倍的参数数量。最后，我们证明FNO-DEQ是对于含有更多噪声观察数据的训练时表现更加稳定，这说明了不同的神经网络基于PDE解决器的架构设计中的适用性。此外，我们还证明了FNO-DEQ可以对任何稳态PDE进行近似。”

EpiTESTER: Testing Autonomous Vehicles with Epigenetic Algorithm and Attention Mechanism

paper_url: http://arxiv.org/abs/2312.00207
repo_url: https://github.com/simula-complex/epitester
paper_authors: Chengjie Lu, Shaukat Ali, Tao Yue
for: 这个研究旨在实现自驾车辆（AV）在不同环境下进行testing，以探索车辆在不安全情况下的表现。
methods: 这个研究提出了一种新的testing方法，名为EpiTESTER，它借鉴了生物体中的epigenetics，允许物种在环境变化中进行适应。EpiTESTER使用了遗传调节 Mechanism，将某些基因抑制，以防止特定基因的表现。
results: 这个研究发现，EpiTESTER在与基本遗传探索（GA）和等概率遗传探索（EpiTESTER with equal probability for each gene）进行比较后，在识别不安全情况中表现出色，显示了将epigenetic mechanisms应用到实际问题上是一个好主意。

Abstract
Testing autonomous vehicles (AVs) under various environmental scenarios that lead the vehicles to unsafe situations is known to be challenging. Given the infinite possible environmental scenarios, it is essential to find critical scenarios efficiently. To this end, we propose a novel testing method, named EpiTESTER, by taking inspiration from epigenetics, which enables species to adapt to sudden environmental changes. In particular, EpiTESTER adopts gene silencing as its epigenetic mechanism, which regulates gene expression to prevent the expression of a certain gene, and the probability of gene expression is dynamically computed as the environment changes. Given different data modalities (e.g., images, lidar point clouds) in the context of AV, EpiTESTER benefits from a multi-model fusion transformer to extract high-level feature representations from environmental factors and then calculates probabilities based on these features with the attention mechanism. To assess the cost-effectiveness of EpiTESTER, we compare it with a classical genetic algorithm (GA) (i.e., without any epigenetic mechanism implemented) and EpiTESTER with equal probability for each gene. We evaluate EpiTESTER with four initial environments from CARLA, an open-source simulator for autonomous driving research, and an end-to-end AV controller, Interfuser. Our results show that EpiTESTER achieved a promising performance in identifying critical scenarios compared to the baselines, showing that applying epigenetic mechanisms is a good option for solving practical problems.

摘要
testing自动驾驶车（AV）于多种环境enario中进行测试是有挑战的。由于无限多的环境enario可能，因此需要效率地找到关键enario。为此，我们提出了一种新的测试方法，即EpiTESTER，通过启发自适应的epigenetic机制来实现。具体来说，EpiTESTER采用了遗传抑制机制来控制基因表达，并通过计算基因表达概率的动态计算来适应环境变化。在AV的不同数据模式（如图像、激光点云）的情况下，EpiTESTER利用多模型融合变换器来提取高级特征表示，然后通过注意力机制来计算基因表达概率。为了评估EpiTESTER的成本效果，我们与 классиical genetic algorithm（GA）进行比较，其中GA没有实现epigenetic机制，而EpiTESTER使用了相同的概率分布。我们使用CARLA，一个开源的自动驾驶研究 simulate器，和一个综合AV控制器，Interfuser，进行评估。我们的结果显示，EpiTESTER在 indentifying关键enario方面表现出色，比基线更佳，表明应用epigenetic机制是一个好的解决方案。

Optimal Attack and Defense for Reinforcement Learning

paper_url: http://arxiv.org/abs/2312.00198
repo_url: None
paper_authors: Jeremy McMahan, Young Wu, Xiaojin Zhu, Qiaomin Xie
for: The paper is written to study the robustness of Reinforcement Learning (RL) in real systems, specifically against online manipulation attacks.
methods: The paper uses a Markov Decision Process (MDP) to model the attacker’s problem and a stochastic Stackelberg game to compute the optimal defense policy for the victim.
results: The paper shows that the attacker can derive optimal attacks by planning in polynomial time or learning with polynomial sample complexity, and the victim can compute an optimal defense policy as the solution to a partially-observable turn-based stochastic game (POTBSG). The solutions are truly robust, although the defense problem is NP-hard.

Abstract
To ensure the usefulness of Reinforcement Learning (RL) in real systems, it is crucial to ensure they are robust to noise and adversarial attacks. In adversarial RL, an external attacker has the power to manipulate the victim agent's interaction with the environment. We study the full class of online manipulation attacks, which include (i) state attacks, (ii) observation attacks (which are a generalization of perceived-state attacks), (iii) action attacks, and (iv) reward attacks. We show the attacker's problem of designing a stealthy attack that maximizes its own expected reward, which often corresponds to minimizing the victim's value, is captured by a Markov Decision Process (MDP) that we call a meta-MDP since it is not the true environment but a higher level environment induced by the attacked interaction. We show that the attacker can derive optimal attacks by planning in polynomial time or learning with polynomial sample complexity using standard RL techniques. We argue that the optimal defense policy for the victim can be computed as the solution to a stochastic Stackelberg game, which can be further simplified into a partially-observable turn-based stochastic game (POTBSG). Neither the attacker nor the victim would benefit from deviating from their respective optimal policies, thus such solutions are truly robust. Although the defense problem is NP-hard, we show that optimal Markovian defenses can be computed (learned) in polynomial time (sample complexity) in many scenarios.

摘要
要使强化学习（RL）在实际系统中有用，必须确保它们对噪声和敌意攻击是Robust。在敌意RL中，一个外部攻击者有权控制受到攻击的代理人的环境互动。我们研究了在线 manipulate 攻击的全类型，包括（i）状态攻击、（ii）观察攻击（这是感知状态攻击的一般化）、（iii）动作攻击和（iv）奖励攻击。我们表示攻击者的设计隐蔽攻击以最大化自己的预期奖励问题，通常是将受到攻击的值降低到最小化。我们显示了攻击者可以通过规划或使用标准RL技术学习optimal攻击，并且可以在很多情况下在多项式时间内计算或学习。我们认为受到攻击的代理人的防御策略可以被视为一个Stochastic StackelbergGame（POTBSG），可以在多项式时间内解决。尽管防御问题是NP困难的，但我们显示在许多情况下，可以在多项式时间内计算或学习优化的Markovian防御策略。

Enhancing Ligand Pose Sampling for Molecular Docking

paper_url: http://arxiv.org/abs/2312.00191
repo_url: https://github.com/drorlab/glow_ives
paper_authors: Patricia Suriana, Ron O. Dror
For: 本研究旨在提高分子埋入中的分子匹配函数，以便预测绑定pose和虚拟屏选。* Methods: 本研究使用了GLOW和IVES两种改进的样本生成方法，以提高对精确绑定pose的样本生成的可能性。* Results: benchmarking结果表明，使用GLOW和IVES方法可以提高对精确绑定pose的样本生成的可能性，特别是在不同 Ligand 的绑定 pocket 中。这种改进是在实验室测定的结构和 AlphaFold 生成的结构中都可以见到。此外， authors 还提供了5000个蛋白质-抑药物交叠对的候选 Ligand 位置集合，用于训练和测试分子匹配函数。

Abstract
Deep learning promises to dramatically improve scoring functions for molecular docking, leading to substantial advances in binding pose prediction and virtual screening. To train scoring functions-and to perform molecular docking-one must generate a set of candidate ligand binding poses. Unfortunately, the sampling protocols currently used to generate candidate poses frequently fail to produce any poses close to the correct, experimentally determined pose, unless information about the correct pose is provided. This limits the accuracy of learned scoring functions and molecular docking. Here, we describe two improved protocols for pose sampling: GLOW (auGmented sampLing with sOftened vdW potential) and a novel technique named IVES (IteratiVe Ensemble Sampling). Our benchmarking results demonstrate the effectiveness of our methods in improving the likelihood of sampling accurate poses, especially for binding pockets whose shape changes substantially when different ligands bind. This improvement is observed across both experimentally determined and AlphaFold-generated protein structures. Additionally, we present datasets of candidate ligand poses generated using our methods for each of around 5,000 protein-ligand cross-docking pairs, for training and testing scoring functions. To benefit the research community, we provide these cross-docking datasets and an open-source Python implementation of GLOW and IVES at https://github.com/drorlab/GLOW_IVES .

摘要
深度学习承诺可以显著改进蛋白质做到蛋白质与小分子的绑定 pose 预测和虚拟屏选，以便更好地预测蛋白质与小分子之间的结合 pose。为了训练评分函数和进行蛋白质做到，需要生成一组候选 Ligand 绑定 pose。然而，目前常用的抽取协议 Frequently fail to produce any poses near the correct, experimentally determined pose, unless information about the correct pose is provided. This limits the accuracy of learned scoring functions and molecular docking.在这里，我们描述了两种改进的抽取协议：GLOW（auGmented sampLing with sOftened vdW potential）和一种新的技术名为 IVES（IteratiVe Ensemble Sampling）。我们的 benchmarking 结果表明，使用我们的方法可以提高准确的抽取 pose 的可能性，特别是在不同的 Ligand 绑定 pocket 中。这种改进是通过 experimentally determined 和 AlphaFold-生成的蛋白质结构进行测试。此外，我们还提供了每个蛋白质-Ligand 十字做到的候选 Ligand 绑定 pose 的数据集，用于训练和测试评分函数。为了服务于研究社区，我们在 GitHub 上提供了 GLOW 和 IVES 的开源 Python 实现，以及每个十字做到的数据集，请参考 https://github.com/drorlab/GLOW_IVES。

Non-uniform Online Learning: Towards Understanding Induction

paper_url: http://arxiv.org/abs/2312.00170
repo_url: None
paper_authors: Zhou Lu
for: This paper explores the relationship between online learning and inductive inference, with a focus on the former’s limitations in applying to the latter.methods: The authors introduce the concept of non-uniform online learning and provide a complete characterization of learnability with finite error in the realizable setting. They also propose a necessary condition for consistency and extend their results to the more realistic agnostic setting.results: The paper shows that any countable union of Littlestone classes can be learnt with regret $\tilde{O}(\sqrt{T})$ in the agnostic setting, and provides a new perspective on the power of induction from an online learning viewpoint.

Abstract
Can a physicist make only finite errors in the endless pursuit of the law of nature? This millennium-old question of inductive inference is a fundamental, yet mysterious problem in philosophy, lacking rigorous justifications. While classic online learning theory and inductive inference share a similar sequential decision-making spirit, the former's reliance on an adaptive adversary and worst-case error bounds limits its applicability to the latter. In this work, we introduce the concept of non-uniform online learning, which we argue aligns more closely with the principles of inductive reasoning. This setting assumes a predetermined ground-truth hypothesis and considers non-uniform, hypothesis-wise error bounds. In the realizable setting, we provide a complete characterization of learnability with finite error: a hypothesis class is non-uniform learnable if and only if it's a countable union of Littlestone classes, no matter the observations are adaptively chosen or iid sampled. Additionally, we propose a necessary condition for the weaker criterion of consistency which we conjecture to be tight. To further promote our theory, we extend our result to the more realistic agnostic setting, showing that any countable union of Littlestone classes can be learnt with regret $\tilde{O}(\sqrt{T})$. We hope this work could offer a new perspective of interpreting the power of induction from an online learning viewpoint.

摘要
可以一个物理学家在追求自然法律的过程中只作有限错误吗？这是一个悠久的哲学问题，尚未得到正式的证明。 Although classic online learning theory and inductive inference share a similar sequential decision-making spirit, the former's reliance on an adaptive adversary and worst-case error bounds limits its applicability to the latter. In this work, we introduce the concept of non-uniform online learning, which we argue aligns more closely with the principles of inductive reasoning. This setting assumes a predetermined ground-truth hypothesis and considers non-uniform, hypothesis-wise error bounds.在可实现的设定下，我们提供了完整的可学习性 Characterization：一个假设集是非均匀的学习可能 if and only if it is a countable union of Littlestone classes, regardless of whether the observations are adaptively chosen or iid sampled. Furthermore, we propose a necessary condition for the weaker criterion of consistency, which we conjecture to be tight. To further promote our theory, we extend our results to the more realistic agnostic setting, showing that any countable union of Littlestone classes can be learned with regret $\tilde{O}(\sqrt{T})$. We hope that this work could offer a new perspective on the power of induction from an online learning viewpoint.

The Multiverse of Dynamic Mode Decomposition Algorithms

paper_url: http://arxiv.org/abs/2312.00137
repo_url: https://github.com/mcolbrook/dmd-multiverse
paper_authors: Matthew J. Colbrook
for: 这篇评论旨在介绍数据驱动分析技术Dynamic Mode Decomposition（DMD），并强调Koopman运算符在转化复杂非线性动力系统为线性框架的作用。
methods: 评论特点在于对DMD方法与Koopman运算符的 спектраль性质的关系，并分析了三类DMD方法：线性回归方法、Galerkin估计和结构保持技术。
results: 评论提供了DMD方法的各种应用和实例，包括MATLAB包，以帮助读者更深入地理解这些方法。

Abstract
Dynamic Mode Decomposition (DMD) is a popular data-driven analysis technique used to decompose complex, nonlinear systems into a set of modes, revealing underlying patterns and dynamics through spectral analysis. This review presents a comprehensive and pedagogical examination of DMD, emphasizing the role of Koopman operators in transforming complex nonlinear dynamics into a linear framework. A distinctive feature of this review is its focus on the relationship between DMD and the spectral properties of Koopman operators, with particular emphasis on the theory and practice of DMD algorithms for spectral computations. We explore the diverse "multiverse" of DMD methods, categorized into three main areas: linear regression-based methods, Galerkin approximations, and structure-preserving techniques. Each category is studied for its unique contributions and challenges, providing a detailed overview of significant algorithms and their applications as outlined in Table 1. We include a MATLAB package with examples and applications to enhance the practical understanding of these methods. This review serves as both a practical guide and a theoretical reference for various DMD methods, accessible to both experts and newcomers, and enabling readers to delve into their areas of interest in the expansive field of DMD.

摘要
< translate-to-simplified-chinese>动态模式分解（DMD）是一种广泛应用的数据驱动分析技术，用于分解复杂非线性系统 into 一组模式，揭示系统内部的 patrón和动态性 through spectral analysis. 本文提供了一个完整的、教学的 DMD 评论，强调了 Koopman 算子在将复杂非线性动力系统转化为线性框架中的作用。本文的特点之一是对 DMD 和 Koopman 算子的 спектраль性质的研究，尤其是 DMD 算法的理论和实践。我们分析了 DMD 方法的多样化 "multiverse"，包括线性回归基于方法、Galerkin aproximations 和结构保持技术。每个类别都有其独特的贡献和挑战，从而为读者提供了深入的了解这些方法的应用场景。我们附录了一个 MATLAB 包，其中包括示例和应用，以帮助读者更好地理解这些方法。本文作为 both 实践指南和理论参考，对 DMD 方法的各种应用有较好的入门点，适合各种读者，包括专家和新手。

Low latency optical-based mode tracking with machine learning deployed on FPGAs on a tokamak

paper_url: http://arxiv.org/abs/2312.00128
repo_url: None
paper_authors: Yumou Wei, Ryan F. Forelli, Chris Hansen, Jeffrey P. Levesque, Nhan Tran, Joshua C. Agar, Giuseppe Di Guglielmo, Michael E. Mauel, Gerald A. Navratil
for: 这个论文旨在实时应用机器学习技术来识别和控制tokamak设备中的核聚变过程中的磁流体不稳定性。
methods: 这个研究使用了Field Programmable Gate Array（FPGA）硬件来处理高速摄像头数据，并使用 convolutional neural network（CNN）模型来预测$n$=1磁流体模式的强度和阶段。
results: 这个研究在High Beta Tokamak-Extended Pulse（HBT-EP）实验中实现了一个基于FPGA的高速摄像头数据收集和处理系统，可以在实时中使用机器学习技术来识别和控制tokamak设备。

Abstract
Active feedback control in magnetic confinement fusion devices is desirable to mitigate plasma instabilities and enable robust operation. Optical high-speed cameras provide a powerful, non-invasive diagnostic and can be suitable for these applications. In this study, we process fast camera data, at rates exceeding 100kfps, on $\textit{in situ}$ Field Programmable Gate Array (FPGA) hardware to track magnetohydrodynamic (MHD) mode evolution and generate control signals in real-time. Our system utilizes a convolutional neural network (CNN) model which predicts the $n$=1 MHD mode amplitude and phase using camera images with better accuracy than other tested non-deep-learning-based methods. By implementing this model directly within the standard FPGA readout hardware of the high-speed camera diagnostic, our mode tracking system achieves a total trigger-to-output latency of 17.6$\mu$s and a throughput of up to 120kfps. This study at the High Beta Tokamak-Extended Pulse (HBT-EP) experiment demonstrates an FPGA-based high-speed camera data acquisition and processing system, enabling application in real-time machine-learning-based tokamak diagnostic and control as well as potential applications in other scientific domains.

摘要
aktive feedback kontrol in magne gravitational confinement fusi device yuxiang yitiandao, mitigate plasma yongxin yitiandao, enable robust operation. Optical high-speed cameras provide a powerful, non-invasive diagnostic, and can be suitable for these applications. In this study, we process fast camera data, at rates exceeding 100kfps, on "in situ" Field Programmable Gate Array (FPGA) hardware to track magnetohydrodynamic (MHD) mode evolution and generate control signals in real-time. Our system utilizes a convolutional neural network (CNN) model, which predicts the n=1 MHD mode amplitude and phase using camera images with better accuracy than other tested non-deep-learning-based methods. By implementing this model directly within the standard FPGA readout hardware of the high-speed camera diagnostic, our mode tracking system achieves a total trigger-to-output latency of 17.6μs and a throughput of up to 120kfps. This study at the High Beta Tokamak-Extended Pulse (HBT-EP) experiment demonstrates an FPGA-based high-speed camera data acquisition and processing system, enabling application in real-time machine-learning-based tokamak diagnostic and control as well as potential applications in other scientific domains.Note: Please note that the translation is in Simplified Chinese, which is one of the two standard versions of Chinese used in mainland China. If you need the translation in Traditional Chinese, please let me know.

Flow Matching Beyond Kinematics: Generating Jets with Particle-ID and Trajectory Displacement Information

paper_url: http://arxiv.org/abs/2312.00123
repo_url: https://github.com/uhh-pd-ml/beyond_kinematics
paper_authors: Joschka Birk, Erik Buhmann, Cedric Ewen, Gregor Kasieczka, David Shih
for: 这个论文是为了生成 JetClass 数据集中的jets，并且使用了 continue normalizing flow (CNF) 模型。
methods: 这个模型使用了 flow matching 技术，并且是 permutation-equivariant，可以生成不同的jet类型。
results: 这个模型可以准确地生成 JetClass 数据集中的所有特征，包括 particle-ID 和 track impact parameter。

Abstract
We introduce the first generative model trained on the JetClass dataset. Our model generates jets at the constituent level, and it is a permutation-equivariant continuous normalizing flow (CNF) trained with the flow matching technique. It is conditioned on the jet type, so that a single model can be used to generate the ten different jet types of JetClass. For the first time, we also introduce a generative model that goes beyond the kinematic features of jet constituents. The JetClass dataset includes more features, such as particle-ID and track impact parameter, and we demonstrate that our CNF can accurately model all of these additional features as well. Our generative model for JetClass expands on the versatility of existing jet generation techniques, enhancing their potential utility in high-energy physics research, and offering a more comprehensive understanding of the generated jets.

摘要
我们介绍了首个基于 JetClass 数据集的生成模型。我们的模型在组件水平生成jets，并使用了流平衡技术来训练 continous normalizing flow（CNF）。它是根据jet类型 conditioned，因此单个模型可以用来生成 JetClass 中的十种不同的jet类型。在这个模型中，我们首次引入了一个超过 jet 成分特征的生成模型。JetClass 数据集包括更多的特征，如粒子ID和轨迹偏置参数，我们示出了我们的 CNF 可以准确地模型这些额外特征。我们的 JetClass 生成模型超越了现有的 jet 生成技术，扩展了它们的潜在应用前景，并为生成jet提供了更全面的理解。

Scalable Bayesian uncertainty quantification with data-driven priors for radio interferometric imaging

paper_url: http://arxiv.org/abs/2312.00125
repo_url: https://github.com/astro-informatics/quantifai
paper_authors: Tobías I. Liaudat, Matthijs Mars, Matthew A. Price, Marcelo Pereyra, Marta M. Betcke, Jason D. McEwen
for: This paper aims to address the challenge of uncertainty quantification (UQ) in radio-interferometric imaging with next-generation radio telescopes like the Square Kilometre Array.
methods: The proposed method, called QuantifAI, uses a data-driven (learned) prior for high-dimensional settings, combined with a physically motivated likelihood function. The method leverages probability concentration phenomena of high-dimensional log-concave posteriors to obtain information about the posterior, and uses convex optimization methods to compute the maximum a posteriori (MAP) estimation.
results: The method is demonstrated in a simulated setting, showing improved image quality and more meaningful uncertainties compared to a benchmark method based on a sparsity-promoting prior. The method is also shown to be fast and scalable, making it a promising approach for UQ in radio-interferometric imaging.Here are the three points in Simplified Chinese:
for: 这篇论文目标是解决下一代无线天文望远镜像 Square Kilometre Array 的射电相关成像中的不确定性评估（UQ）挑战。
methods: 该方法提议使用高维数据驱动的（学习）假设，与物理适应性的几何函数结合。方法利用高维几何函数的概率填充现象，从 simulations 中学习到的信息，并保证几何函数的归一化。使用 convex 优化方法计算最大 posterior 估计（MAP）。
results: 方法在模拟环境中展示了改进的图像质量和更有意义的不确定性，与基准方法基于简约性promoting prior 的图像重建方法相比。方法还显示了快速和可扩展的特点，使其成为无线相关成像 UQ 中的可能性。

Abstract
Next-generation radio interferometers like the Square Kilometer Array have the potential to unlock scientific discoveries thanks to their unprecedented angular resolution and sensitivity. One key to unlocking their potential resides in handling the deluge and complexity of incoming data. This challenge requires building radio interferometric imaging methods that can cope with the massive data sizes and provide high-quality image reconstructions with uncertainty quantification (UQ). This work proposes a method coined QuantifAI to address UQ in radio-interferometric imaging with data-driven (learned) priors for high-dimensional settings. Our model, rooted in the Bayesian framework, uses a physically motivated model for the likelihood. The model exploits a data-driven convex prior, which can encode complex information learned implicitly from simulations and guarantee the log-concavity of the posterior. We leverage probability concentration phenomena of high-dimensional log-concave posteriors that let us obtain information about the posterior, avoiding MCMC sampling techniques. We rely on convex optimisation methods to compute the MAP estimation, which is known to be faster and better scale with dimension than MCMC sampling strategies. Our method allows us to compute local credible intervals, i.e., Bayesian error bars, and perform hypothesis testing of structure on the reconstructed image. In addition, we propose a novel blazing-fast method to compute pixel-wise uncertainties at different scales. We demonstrate our method by reconstructing radio-interferometric images in a simulated setting and carrying out fast and scalable UQ, which we validate with MCMC sampling. Our method shows an improved image quality and more meaningful uncertainties than the benchmark method based on a sparsity-promoting prior. QuantifAI's source code: https://github.com/astro-informatics/QuantifAI.

摘要
Our proposed method, called QuantifAI, addresses this challenge by using data-driven (learned) priors for high-dimensional settings in the Bayesian framework. Our model leverages a physically motivated likelihood function and a data-driven convex prior that can capture complex information learned implicitly from simulations. This prior ensures the log-concavity of the posterior, which enables us to use probability concentration phenomena to obtain information about the posterior without relying on Markov chain Monte Carlo (MCMC) sampling techniques.Our method uses convex optimization methods to compute the maximum a posteriori (MAP) estimation, which is faster and better scales with dimension than MCMC sampling strategies. Additionally, we propose a novel method to compute pixel-wise uncertainties at different scales.We demonstrate the effectiveness of QuantifAI by reconstructing radio-interferometric images in a simulated setting and performing fast and scalable UQ, which we validate with MCMC sampling. Our results show that QuantifAI provides improved image quality and more meaningful uncertainties than a benchmark method based on a sparsity-promoting prior.QuantifAI's source code is available at .

Geometry-Aware Normalizing Wasserstein Flows for Optimal Causal Inference

paper_url: http://arxiv.org/abs/2311.18826
repo_url: None
paper_authors: Kaiwen Hou
for: 增强连续正常化流（CNFs）在 causal inference 中的框架，主要是为了提高 parametric submodels 在 targeted maximum likelihood estimation（TMLE）中的 геометрические性能。
methods: 通过引入创新的 CNFs 应用，构建了一系列精细的 parametric submodels，以实现 directed interpolation between $p_0$ 和 $p_1$。这种方法可以优化 semiparametric efficiency bound 在 causal inference 中，使得 CNFs 与 Wasserstein gradient flows 相互协调。
results: 该方法可以不仅最小化 TMLE 的 mean squared error，还可以具备 geometric sophistication，从而提高 robustness 对 misspecification。这种 robustness 是关键的，因为它可以减轻 doubly-robust perturbation direction 的 $n^{\frac{1}{4}$ 速率依赖关系。通过结合 robust optimization principles 和 differential geometry，开发的 geometry-aware CNFs 代表了 causal inference 中的一项重要进展。

Abstract
This manuscript enriches the framework of continuous normalizing flows (CNFs) within causal inference, primarily to augment the geometric properties of parametric submodels used in targeted maximum likelihood estimation (TMLE). By introducing an innovative application of CNFs, we construct a refined series of parametric submodels that enable a directed interpolation between the prior distribution $p_0$ and the empirical distribution $p_1$. This proposed methodology serves to optimize the semiparametric efficiency bound in causal inference by orchestrating CNFs to align with Wasserstein gradient flows. Our approach not only endeavors to minimize the mean squared error in the estimation but also imbues the estimators with geometric sophistication, thereby enhancing robustness against misspecification. This robustness is crucial, as it alleviates the dependence on the standard $n^{\frac{1}{4}$ rate for a doubly-robust perturbation direction in TMLE. By incorporating robust optimization principles and differential geometry into the estimators, the developed geometry-aware CNFs represent a significant advancement in the pursuit of doubly robust causal inference.

摘要

An Adaptive Framework for Generalizing Network Traffic Prediction towards Uncertain Environments

paper_url: http://arxiv.org/abs/2311.18824
repo_url: None
paper_authors: Alexander Downey, Evren Tuna, Alkan Soysal
for: 本文提出了一种新的框架，用于在未经见过的无线环境中动态分配移动网络流量预测模型。
methods: 该框架使用时间序列分析，包括不supervised clustering和supervised学习，以预测流量量。
results: 该框架可以在不需要先知道维度和时间的情况下，以50%以上的提升比对现有研究表现出色。此外，该框架还可以应用于其他在不确定环境中的机器学习应用。

Abstract
We have developed a new framework using time-series analysis for dynamically assigning mobile network traffic prediction models in previously unseen wireless environments. Our framework selectively employs learned behaviors, outperforming any single model with over a 50% improvement relative to current studies. More importantly, it surpasses traditional approaches without needing prior knowledge of a cell. While this paper focuses on network traffic prediction using our adaptive forecasting framework, this framework can also be applied to other machine learning applications in uncertain environments. The framework begins with unsupervised clustering of time-series data to identify unique trends and seasonal patterns. Subsequently, we apply supervised learning for traffic volume prediction within each cluster. This specialization towards specific traffic behaviors occurs without penalties from spatial and temporal variations. Finally, the framework adaptively assigns trained models to new, previously unseen cells. By analyzing real-time measurements of a cell, our framework intelligently selects the most suitable cluster for that cell at any given time, with cluster assignment dynamically adjusting to spatio-temporal fluctuations.

摘要
我们已经开发了一个新的框架，使用时间序列分析来动态分配无线网络流量预测模型在未经见过的无线环境中。我们的框架选择性地采用学习行为，超越单个模型，提高了现有研究的50%以上。更重要的是，它超越了传统的方法，不需要先知道维持 celle。这篇论文关注了我们的适应预测框架，但这个框架可以应用于其他在不确定环境中的机器学习应用程序。我们的框架开始于无监督聚类时间序列数据，以识别唯一的趋势和季节性模式。然后，我们使用监督学习来预测流量量在每个群中。这种特化于特定的流量行为发生在无法预测的空间和时间变化的情况下。最后，我们的框架智能地将训练好的模型分配给新的、未经见过的维持 celle。通过分析实时测量的维持 celle 的数据，我们的框架会在任何时候选择最适合的群，并 dynamically adjusting 群分配以适应空间-时间的波动。

Pre-registration for Predictive Modeling

paper_url: http://arxiv.org/abs/2311.18807
repo_url: https://github.com/bostonadam525/Exploring-Ebay-Car-Sales-Data
paper_authors: Jake M. Hofman, Angelos Chatzimparmpas, Amit Sharma, Duncan J. Watts, Jessica Hullman
for: 提高预测模型的可重现性和普遍性
methods: 采用预注册方法来改进预测模型的可靠性
results: 通过对机器学习研究人员的质量研究，发现预注册可以防止偏向估计和提高研究结果的可靠性

Abstract
Amid rising concerns of reproducibility and generalizability in predictive modeling, we explore the possibility and potential benefits of introducing pre-registration to the field. Despite notable advancements in predictive modeling, spanning core machine learning tasks to various scientific applications, challenges such as overlooked contextual factors, data-dependent decision-making, and unintentional re-use of test data have raised questions about the integrity of results. To address these issues, we propose adapting pre-registration practices from explanatory modeling to predictive modeling. We discuss current best practices in predictive modeling and their limitations, introduce a lightweight pre-registration template, and present a qualitative study with machine learning researchers to gain insight into the effectiveness of pre-registration in preventing biased estimates and promoting more reliable research outcomes. We conclude by exploring the scope of problems that pre-registration can address in predictive modeling and acknowledging its limitations within this context.

摘要
在预测模型中增加可重复性和普遍性的担忧下，我们探讨采用预先注册的可能性和优点。尽管预测模型在核心机器学习任务以及各种科学应用中做出了显著进步，但是问题如排除外Contextual factor、数据依赖性决策和意外 reuse test data 等问题，使得结果的可靠性受到质疑。为了解决这些问题，我们建议从解释模型中采用预先注册做法。我们讨论了现有的最佳实践和其局限性，提出了一个轻量级的预先注册模板，并通过机器学习研究人员的质论来了解预先注册在防止偏导估计和促进更可靠的研究结果方面的效果。我们 conclude by exploring预先注册在预测模型中所能解决的问题范围和其在这种 контексте中的局限性。

Efficient Baseline for Quantitative Precipitation Forecasting in Weather4cast 2023

paper_url: http://arxiv.org/abs/2311.18806
repo_url: None
paper_authors: Akshay Punjabi, Pablo Izquierdo Ayala
for: 预测气象准确，决策支持多个行业
methods: 使用少量计算资源的微型U-Net模型作为基线
results: 提供了一个基于微型U-Net模型的准确降水预测方法，可以减少计算资源占用并且是未来气象预测initiative的参照基eline

Abstract
Accurate precipitation forecasting is indispensable for informed decision-making across various industries. However, the computational demands of current models raise environmental concerns. We address the critical need for accurate precipitation forecasting while considering the environmental impact of computational resources and propose a minimalist U-Net architecture to be used as a baseline for future weather forecasting initiatives.

摘要
准确的降水预测是各行业 Informed decision-making 中不可或缺的。然而，当前的模型计算需求带来环境问题。我们解决了准确降水预测的关键需求，同时考虑计算资源的环境影响，并提出了一个简洁的 U-Net 架构作为未来天气预测initiatives的基线。

Communication-Efficient Federated Optimization over Semi-Decentralized Networks

paper_url: http://arxiv.org/abs/2311.18787
repo_url: None
paper_authors: He Wang, Yuejie Chi
for: 这篇研究旨在提高大规模的联邦和分散式学习中的通信效率，因为现在的通信效率受到网络组件和资料不均匀的影响。
methods: 作者将使用一种半中央化通信协议，让代理机可以在可能的情况下进行代理机间和代理机到服务器的通信，以提高通信效率。他们还提出了一个名为PISCO的通信有效性算法，具有追踪gradient的功能，可以在不同的网络 topology 下进行多次本地更新，以减少通信次数。
results: 作者显示了PISCO algorithm 的数据不均匀和网络 topology 的影响，并证明了PISCO algorithm 在非凸问题上的渐进速率，并且在不同的网络 topology 下显示了线性的速度增长。

Abstract
In large-scale federated and decentralized learning, communication efficiency is one of the most challenging bottlenecks. While gossip communication -- where agents can exchange information with their connected neighbors -- is more cost-effective than communicating with the remote server, it often requires a greater number of communication rounds, especially for large and sparse networks. To tackle the trade-off, we examine the communication efficiency under a semi-decentralized communication protocol, in which agents can perform both agent-to-agent and agent-to-server communication in a probabilistic manner. We design a tailored communication-efficient algorithm over semi-decentralized networks, referred to as PISCO, which inherits the robustness to data heterogeneity thanks to gradient tracking and allows multiple local updates for saving communication. We establish the convergence rate of PISCO for nonconvex problems and show that PISCO enjoys a linear speedup in terms of the number of agents and local updates. Our numerical results highlight the superior communication efficiency of PISCO and its resilience to data heterogeneity and various network topologies.

摘要
大规模联合分布式学习中，通信效率是最大的挑战之一。虽然谤言通信（代理们可以在相邻的代理之间交换信息）比与远程服务器的通信更加经济，但它通常需要更多的通信循环，特别是在大型和稀疏的网络中。为了解决这种负担，我们研究了半中心化通信协议下的通信效率，在这种协议下，代理可以在抽象的方式上进行代理到代理和代理到服务器的通信。我们设计了一种适应性强的通信效率算法，称为PISCO，它继承了数据不同性的稳定性，并允许多个本地更新以降低通信成本。我们证明PISCO在非对称问题上的整合速率，并证明PISCO在数量 Agent和本地更新方面具有线性增速。我们的numerical result表明PISCO的通信效率较高，并且对数据不同性和不同网络架构具有抗逆性。

Multimodal Learning for Crystalline Materials

paper_url: http://arxiv.org/abs/2312.00111
repo_url: None
paper_authors: Viggo Moro, Charlotte Loh, Rumen Dangovski, Ali Ghorashi, Andrew Ma, Zhuo Chen, Peter Y. Lu, Thomas Christensen, Marin Soljačić
for: 这项研究旨在利用人工智能（AI）改进材料性能预测和发现新材料的领域。
methods: 该研究提出了一种新的基础模型训练方法，称为多Modal Learning for Crystalline Materials（MLCM），它通过多modal对alignment连接高维材料属性（i.e. modalities），生成高度有用的材料表示。
results: 研究表明，MLCM在材料项目数据库中的物理性能预测任务中达到了状态对的性能水平，同时还提供了一种高度准确的反向设计方法，可以寻找满足要求的稳定材料，以及可以提取可读的emergent特征，为材料科学家提供了新的思路。

Abstract
Artificial intelligence (AI) has revolutionized the field of materials science by improving the prediction of properties and accelerating the discovery of novel materials. In recent years, publicly available material data repositories containing data for various material properties have grown rapidly. In this work, we introduce Multimodal Learning for Crystalline Materials (MLCM), a new method for training a foundation model for crystalline materials via multimodal alignment, where high-dimensional material properties (i.e. modalities) are connected in a shared latent space to produce highly useful material representations. We show the utility of MLCM on multiple axes: (i) MLCM achieves state-of-the-art performance for material property prediction on the challenging Materials Project database; (ii) MLCM enables a novel, highly accurate method for inverse design, allowing one to screen for stable material with desired properties; and (iii) MLCM allows the extraction of interpretable emergent features that may provide insight to material scientists. Further, we explore several novel methods for aligning an arbitrary number of modalities, improving upon prior art in multimodal learning that focuses on bimodal alignment. Our work brings innovations from the ongoing AI revolution into the domain of materials science and identifies materials as a testbed for the next generation of AI.

摘要
人工智能（AI）已经革命化了物料科学领域，提高了物料属性预测和新材料发现的速度。最近几年，公共可用的物料数据库快速增长。在这种情况下，我们介绍了一种新的方法——多modal学习 для晶体材料（MLCM），通过多modal对接来生成高效的材料表示。我们在多个轴上展示了MLCM的用于性能：（i）MLCM在挑战性较高的Materials Project数据库上实现了状态码的物理性预测性能；（ii）MLCM实现了一种新的、高度准确的反向设计方法，允许一个屏选稳定的材料与愿意的性能；（iii）MLCM允许EXTRACTINTERPRETABLE的emergent特征，可能为材料科学家提供指导。此外，我们还探索了一些新的多modal对接方法，超越了先前的多modal学习的缺点，它们主要关注二modal对接。我们的工作将AI革命带入材料科学领域，并将材料作为AI的下一代测试床。

MultiResFormer: Transformer with Adaptive Multi-Resolution Modeling for General Time Series Forecasting

paper_url: http://arxiv.org/abs/2311.18780
repo_url: None
paper_authors: Linfeng Du, Ji Xin, Alex Labach, Saba Zuberi, Maksims Volkovs, Rahul G. Krishnan
for: 提高时间序列预测的能力，特别是长期预测任务
methods: 使用Transformer模型，动态选择最佳块长度，以模型时间变化
results: 与状态关键点基eline相比，在长期预测任务上表现出色，并且在短期预测任务上也表现出优异，而且使用的参数数量相对较少

Abstract
Transformer-based models have greatly pushed the boundaries of time series forecasting recently. Existing methods typically encode time series data into $\textit{patches}$ using one or a fixed set of patch lengths. This, however, could result in a lack of ability to capture the variety of intricate temporal dependencies present in real-world multi-periodic time series. In this paper, we propose MultiResFormer, which dynamically models temporal variations by adaptively choosing optimal patch lengths. Concretely, at the beginning of each layer, time series data is encoded into several parallel branches, each using a detected periodicity, before going through the transformer encoder block. We conduct extensive evaluations on long- and short-term forecasting datasets comparing MultiResFormer with state-of-the-art baselines. MultiResFormer outperforms patch-based Transformer baselines on long-term forecasting tasks and also consistently outperforms CNN baselines by a large margin, while using much fewer parameters than these baselines.

摘要
transformer 基本模型在时间序列预测方面做出了很大的进步，现有方法通常将时间序列数据编码成 patches 使用一个或固定的 patch length。然而，这可能会导致不能够捕捉实际世界多周期时间序列中的复杂的时间相关性。在这篇论文中，我们提出了 MultiResFormer，它在每层开始时使用检测到的周期性来动态地选择最佳 patch length，以模型时间的变化。具体来说，在每层开始时，时间序列数据被编码成多个平行分支，每个分支使用一个检测到的周期性，然后通过 transformer 编码器块进行处理。我们对长期和短期预测 datasets 进行了广泛的评估， comparing MultiResFormer 与当前领先的基elines。 MultiResFormer 在长期预测任务上表现出色，并在短期预测任务上也一直表现出优于 CNN 基elines，使用的参数数量相对较少。

Online Change Points Detection for Linear Dynamical Systems with Finite Sample Guarantees

paper_url: http://arxiv.org/abs/2311.18769
repo_url: None
paper_authors: Lei Xin, George Chiu, Shreyas Sundaram
for: 这个论文是为了检测时间序列中突然改变的Property而写的，目的是要尽可能快地检测到这些改变。
methods: 这个论文使用的方法是基于线性动力系统，并且可以检测到多个改变点。它还提供了一个数据依赖的阈值，可以在测试中使用，以确保不会出现假阳性。
results: 这个论文提供了一个可以在线进行改变点检测的方法，并且可以提供一个 finite-sample-based bound 来评估检测的可靠性和延迟。这个 bound 还可以用于评估不同参数的影响，以及改变点的检测可靠性。

Abstract
The problem of online change point detection is to detect abrupt changes in properties of time series, ideally as soon as possible after those changes occur. Existing work on online change point detection either assumes i.i.d data, focuses on asymptotic analysis, does not present theoretical guarantees on the trade-off between detection accuracy and detection delay, or is only suitable for detecting single change points. In this work, we study the online change point detection problem for linear dynamical systems with unknown dynamics, where the data exhibits temporal correlations and the system could have multiple change points. We develop a data-dependent threshold that can be used in our test that allows one to achieve a pre-specified upper bound on the probability of making a false alarm. We further provide a finite-sample-based bound for the probability of detecting a change point. Our bound demonstrates how parameters used in our algorithm affect the detection probability and delay, and provides guidance on the minimum required time between changes to guarantee detection.

摘要
“在线改点检测问题是检测时间序列的突然变化，理想情况下为 möglich 的将变化检测出来，对于实际应用而言，这是一个非常重要的问题。现有的在线改点检测方法中，大多数假设是独立Identically distributed（i.i.d）数据，对于数据的 asymptotic 分析进行研究，但是它们不会提供实际应用中的偏好 guarantees ，也不适用于检测多个改点。在这个研究中，我们研究了线性动态系统中的在线改点检测问题，其数据具有时间相关性，系统可能会有多个改点。我们开发了一个资料依赖的阈值，可以在我们的测试中使用，以确保可以在规定的上限 probability 下发生错误的警示。我们还提供了基于 finite sample 的 bound，用于检测改点的 probability。我们的 bound 显示了我们的检测方法中使用的参数对检测可能性和延迟的影响，并提供了实际应用中改点检测所需的最小时间间隔。”

A data-science pipeline to enable the Interpretability of Many-Objective Feature Selection

paper_url: http://arxiv.org/abs/2311.18746
repo_url: https://github.com/f-u-njoku/many-objective-fs-nsgaiii
paper_authors: Uchechukwu F. Njoku, Alberto Abelló, Besim Bilalli, Gianluca Bontempi
for: 本研究旨在支持数据科学家在多目标特征选择（MOFS）结果的解释和比较中，以便从中选择最佳的特征子集。
methods: 本研究提出了一种新的方法ologies， combine post-processing和可视化来支持数据科学家在多目标特征选择结果中做出最终选择。
results: 实验结果表明，该方法可以帮助数据科学家更好地选择最佳的特征子集，并且可以提供高级别的信息，包括目标、解决方案和特征。

Abstract
Many-Objective Feature Selection (MOFS) approaches use four or more objectives to determine the relevance of a subset of features in a supervised learning task. As a consequence, MOFS typically returns a large set of non-dominated solutions, which have to be assessed by the data scientist in order to proceed with the final choice. Given the multi-variate nature of the assessment, which may include criteria (e.g. fairness) not related to predictive accuracy, this step is often not straightforward and suffers from the lack of existing tools. For instance, it is common to make use of a tabular presentation of the solutions, which provide little information about the trade-offs and the relations between criteria over the set of solutions. This paper proposes an original methodology to support data scientists in the interpretation and comparison of the MOFS outcome by combining post-processing and visualisation of the set of solutions. The methodology supports the data scientist in the selection of an optimal feature subset by providing her with high-level information at three different levels: objectives, solutions, and individual features. The methodology is experimentally assessed on two feature selection tasks adopting a GA-based MOFS with six objectives (number of selected features, balanced accuracy, F1-Score, variance inflation factor, statistical parity, and equalised odds). The results show the added value of the methodology in the selection of the final subset of features.

摘要
多目标特征选择（MOFS）方法通常使用四个或更多目标来确定特征子集中的相关性。这导致MOFS通常返回一个大量的非主导的解决方案，需要数据科学家进行评估，以确定最终选择。由于评估标准可能包括不同于预测精度的 criterion（例如公平），这个步骤通常不是 straightforward，并且缺乏现有的工具。例如，常用的是使用表格式的解决方案，它们提供了非常少的信息关于交易和特征子集中的关系。这篇论文提出了一种原创的方法来支持数据科学家在MOFS结果的解释和比较中。这种方法结合了后处理和可视化特征子集中的解决方案，以提供数据科学家高级别的信息，包括目标、解决方案和特征。这种方法在两个特征选择任务中进行了实验，采用了基于GA的MOFS，并使用了六个目标（选择的特征数量、平衡准确率、F1分数、流变因子、公平和平等可能性）。结果表明，这种方法在选择最终特征子集中添加了值。

$\mathbb{Z}_2\times \mathbb{Z}_2$ Equivariant Quantum Neural Networks: Benchmarking against Classical Neural Networks

paper_url: http://arxiv.org/abs/2311.18744
repo_url: https://github.com/zhongtiand/eqnn
paper_authors: Zhongtian Dong, Marçal Comajoan Cara, Gopal Ramesh Dahale, Roy T. Forestano, Sergei Gleyzer, Daniel Justice, Kyoungchul Kong, Tom Magorsch, Konstantin T. Matchev, Katia Matcheva, Eyup B. Unlu
for: 这个论文对比了量子神经网络（QNN）和等变量神经网络（EQNN）与其经典对应者：等变量神经网络（ENN）和深度神经网络（DNN）的性能进行了全面的比较分析。
methods: 我们使用了两个简单的示例来评估每个网络的性能，即二分类任务中的模型复杂度（测量参数的数量）和训练数据集的大小。
results: 我们的结果显示，对于小参数集和中等训练数据集，$\mathbb{Z}_2\times \mathbb{Z}_2$ EQNN和QNN具有较好的性能。

Abstract
This paper presents a comprehensive comparative analysis of the performance of Equivariant Quantum Neural Networks (EQNN) and Quantum Neural Networks (QNN), juxtaposed against their classical counterparts: Equivariant Neural Networks (ENN) and Deep Neural Networks (DNN). We evaluate the performance of each network with two toy examples for a binary classification task, focusing on model complexity (measured by the number of parameters) and the size of the training data set. Our results show that the $\mathbb{Z}_2\times \mathbb{Z}_2$ EQNN and the QNN provide superior performance for smaller parameter sets and modest training data samples.

摘要

Dimension Mixer: A Generalized Method for Structured Sparsity in Deep Neural Networks

paper_url: http://arxiv.org/abs/2311.18735
repo_url: None
paper_authors: Suman Sapkota, Binod Bhattarai
for: 研究多种神经网络架构的共通点和差异，找到了维度混合的概念。
methods: 研究了分层、层次、非线性和可学习的混合方法，包括Butterfly结构和MLP混合函数。
results: 实验表明，提案的非线性Butterfly混合器可以高效缩放，并且可以用作混合函数。此外，还提出了处理二维信号的Patch-Only MLP-Mixer方法。

Abstract
The recent success of multiple neural architectures like CNNs, Transformers, and MLP-Mixers motivated us to look for similarities and differences between them. We found that these architectures can be interpreted through the lens of a general concept of dimension mixing. Research on coupling flows and the butterfly transform shows that partial and hierarchical signal mixing schemes are sufficient for efficient and expressive function approximation. In this work, we study group-wise sparse, non-linear, multi-layered and learnable mixing schemes of inputs and find that they are complementary to many standard neural architectures. Following our observations and drawing inspiration from the Fast Fourier Transform, we generalize Butterfly Structure to use non-linear mixer function allowing for MLP as mixing function called Butterfly MLP. We were also able to mix along sequence dimension for Transformer-based architectures called Butterfly Attention. Experiments on CIFAR and LRA datasets demonstrate that the proposed Non-Linear Butterfly Mixers are efficient and scale well when the host architectures are used as mixing function. Additionally, we propose Patch-Only MLP-Mixer for processing spatial 2D signals demonstrating a different dimension mixing strategy.

摘要
近些年，多种神经网络架构如CNNs、Transformers和MLP-Mixers的成功，使我们感受到这些架构之间的相似性和差异。我们发现这些架构可以通过精细混合的概念来解释。研究拥有流和蝴蝶变换的 Coupling Flows 和 hierarchical signal mixing schemes 表明，部分和层次的信号混合方案是有效和表达强的函数 aproximation 的。在这项工作中，我们研究了分组 wise 稀疏、非线性、多层和学习可能的混合方案，并发现它们与许多标准神经网络架构相комplementary。基于我们的观察和使用 Fast Fourier Transform 的灵感，我们推广 Butterfly Structure，使用非线性混合函数，以便使用 MLP 作为混合函数，称为 Butterfly MLP。此外，我们还可以将混合函数应用到序列维度，用于 Transformer-based 架构，称为 Butterfly Attention。实验表明，提议的非线性 Butterfly Mixers 具有高效和可扩展的特点，并且可以与主机架构一起使用。此外，我们还提出了 Patch-Only MLP-Mixer，用于处理二维信号，示出了不同的维度混合策略。

Indoor Millimeter Wave Localization using Multiple Self-Supervised Tiny Neural Networks

paper_url: http://arxiv.org/abs/2311.18732
repo_url: None
paper_authors: Anish Shastri, Andres Garcia-Saavedra, Paolo Casari
for: 本研究旨在实现大型室内环境中移动毫米波客户端的本地化。
methods: 本研究使用多层感知器神经网络（NN）进行本地化。而不是训练和部署单一深度模型，我们选择多个微型NN进行自我超vised学习。
results: 我们的方法比 geometric 本地化方案和使用单一 NN 更高度精度。

Abstract
We consider the localization of a mobile millimeter-wave client in a large indoor environment using multilayer perceptron neural networks (NNs). Instead of training and deploying a single deep model, we proceed by choosing among multiple tiny NNs trained in a self-supervised manner. The main challenge then becomes to determine and switch to the best NN among the available ones, as an incorrect NN will fail to localize the client. In order to upkeep the localization accuracy, we propose two switching schemes: one based on a Kalman filter, and one based on the statistical distribution of the training data. We analyze the proposed schemes via simulations, showing that our approach outperforms both geometric localization schemes and the use of a single NN.

摘要
我团队考虑了一种移动毫米波客户端在大型室内环境中的本地化，使用多层感知网络（NN）进行定位。而不是训练和部署单一的深度模型，我们选择了多个微型NN在自我超vision的方式进行训练。然而，主要挑战在于确定并切换到最佳NN中， incorrect NN将导致定位错误。为保持定位准确性，我们提议两种切换方案：一种基于加拿ม filter，另一种基于训练数据的统计分布。我们通过实验分析我们的方法，并证明它在定位精度方面超越了几何定位方案和单个NN的使用。

AI in Pharma for Personalized Sequential Decision-Making: Methods, Applications and Opportunities

paper_url: http://arxiv.org/abs/2311.18725
repo_url: None
paper_authors: Yuhan Li, Hongtao Zhang, Keaven Anderson, Songzi Li, Ruoqing Zhu
for: The paper is written to provide a review of the applications of artificial intelligence (AI) in the pharmaceutical industry, specifically in drug discovery and development, as well as regulatory submissions.
methods: The paper uses case studies to illustrate the key applications of AI in drug development, including protein structure prediction, success probability estimation, subgroup identification, and AI-assisted clinical trial monitoring.
results: The paper highlights the increasing trend of incorporating AI components in regulatory submissions, with oncology, psychiatry, gastroenterology, and neurology being the most prevalent therapeutic areas leveraging AI. The paper also discusses the paradigm shift towards personalized or precision medicine, which has had a transformative impact on the pharmaceutical industry.Here’s the information in Simplified Chinese text:
for: 论文是为了介绍生物医药领域中人工智能（AI）的应用，特别是在药物发现和开发、以及监管提交中。
methods: 论文通过 caso study 展示了 AI 在药物开发中的关键应用，包括蛋白结构预测、成功可能性估计、 subgroup 标识和 AI 助け的临床试验监测。
results: 论文指出了在监管提交中包含 AI 组件的增加趋势，其中最常见的临床领域是肿瘤（27%）、心理医学（15%）、肠胃医学（12%）和神经科学（11%）。论文还讨论了人类化医疗的概念的变革，从传统的 “一大类 fits all” 模型转移到了根据个体因素（如环境条件、生活方式和健康历史）制定个性化治疗方案。

Abstract
In the pharmaceutical industry, the use of artificial intelligence (AI) has seen consistent growth over the past decade. This rise is attributed to major advancements in statistical machine learning methodologies, computational capabilities and the increased availability of large datasets. AI techniques are applied throughout different stages of drug development, ranging from drug discovery to post-marketing benefit-risk assessment. Kolluri et al. provided a review of several case studies that span these stages, featuring key applications such as protein structure prediction, success probability estimation, subgroup identification, and AI-assisted clinical trial monitoring. From a regulatory standpoint, there was a notable uptick in submissions incorporating AI components in 2021. The most prevalent therapeutic areas leveraging AI were oncology (27%), psychiatry (15%), gastroenterology (12%), and neurology (11%). The paradigm of personalized or precision medicine has gained significant traction in recent research, partly due to advancements in AI techniques \cite{hamburg2010path}. This shift has had a transformative impact on the pharmaceutical industry. Departing from the traditional "one-size-fits-all" model, personalized medicine incorporates various individual factors, such as environmental conditions, lifestyle choices, and health histories, to formulate customized treatment plans. By utilizing sophisticated machine learning algorithms, clinicians and researchers are better equipped to make informed decisions in areas such as disease prevention, diagnosis, and treatment selection, thereby optimizing health outcomes for each individual.

摘要
在制药业界，人工智能（AI）的使用在过去的一代经历了不断增长。这种增长归功于在统计机器学习方法、计算能力和大数据集的应用方面的重要进步。AI技术在不同的药品开发阶段都被应用，从药物发现到药品上市后的 benefit-risk 评估。科尔鲁里等人提供了许多实例研究，涵盖了药品发现、成功可能性估计、 subgroup Identification 和 AI 辅助临床试验监测等领域。从 regulatory 角度来看，2021 年有一次明显的增加在包含 AI 组件的提交中。最常见的应用领域是 он科学（27%）、心理科学（15%）、肠道科学（12%）和神经科学（11%）。个性化或精度医学（personalized medicine）的概念在最近的研究中备受关注，这与 AI 技术的进步有着密切的关系。这种转变对制药业界产生了深见的影响。从传统的 "一个大小 fits all" 模式 departure，个性化医学会考虑各种个体因素，如环境条件、生活方式和健康历史，以形成个性化的治疗方案。通过使用复杂的机器学习算法，临床医生和研究人员更好地做出了有知识基础的决策，从而优化每个个体的健康结果。

Steering Deep Feature Learning with Backward Aligned Feature Updates

paper_url: http://arxiv.org/abs/2311.18718
repo_url: https://github.com/lchizat/2023-bafu
paper_authors: Lénaïc Chizat, Praneeth Netrapalli
for: 这 paper 的目的是提出一种方法来预测、测量和控制深度学习中的特征学习行为。
methods: 这 paper 使用了对准特征更新和反向传播的对齐来预测、测量和控制特征学习行为。
results: 这 paper 的结果表明，当对齐成立时，特征更新后一步 SGD 的大小与前向和反向传播的大小之间存在一个简单和普遍的关系。这导致了一些自动调整 Hyper-Parameters（初始化尺度和学习率）的技术，以达到一种想要的特征学习行为。

Abstract
Deep learning succeeds by doing hierarchical feature learning, yet tuning Hyper-Parameters (HP) such as initialization scales, learning rates etc., only give indirect control over this behavior. In this paper, we propose the alignment between the feature updates and the backward pass as a key notion to predict, measure and control feature learning. On the one hand, we show that when alignment holds, the magnitude of feature updates after one SGD step is related to the magnitude of the forward and backward passes by a simple and general formula. This leads to techniques to automatically adjust HPs (initialization scales and learning rates) at initialization and throughout training to attain a desired feature learning behavior. On the other hand, we show that, at random initialization, this alignment is determined by the spectrum of a certain kernel, and that well-conditioned layer-to-layer Jacobians (aka dynamical isometry) implies alignment. Finally, we investigate ReLU MLPs and ResNets in the large width-then-depth limit. Combining hints from random matrix theory and numerical experiments, we show that (i) in MLP with iid initializations, alignment degenerates with depth, making it impossible to start training, and that (ii) in ResNets, the branch scale $1/\sqrt{\text{depth}$ is the only one maintaining non-trivial alignment at infinite depth.

摘要
深度学习成功之处是通过层次特征学习，但是调整超参数（HP），如初始化缩放、学习率等，只能提供间接控制这种行为的方式。在这篇论文中，我们提出特征更新和反向传播的准确对应作为锚点特征学习控制的关键概念。一方面，我们表明当对齐存在时，特征更新后一步SGD的大小与前向和反向传播的大小之间存在简单和普遍的公式关系。这导致可以通过自动调整初始化HP（初始化缩放和学习率）来控制特征学习行为。另一方面，我们表明，在随机初始化情况下，这种对齐是由某种kernel的特征决定，并且层次Jacobian（即动力学同构）的良好condition implies对齐。最后，我们研究了深度神经网络，特别是MLP和ResNet在宽度大于深度的情况下。结合随机矩阵理论和实验数据，我们显示了以下两点：（i）在MLP中，随机初始化导致对齐逐渐消失深度增加，因此无法进行初始化训练；（ii）在ResNet中，分支缩放因子$1/\sqrt{\text{depth}$是唯一保持非致命对齐的因素，并且随着深度增加，对齐在无穷深度下保持不致命。

DeepEn2023: Energy Datasets for Edge Artificial Intelligence

paper_url: http://arxiv.org/abs/2312.00103
repo_url: None
paper_authors: Xiaolong Tu, Anik Mallik, Haoxin Wang, Jiang Xie
for: 这篇论文的目的是提出一个大规模的能源数据集，以便测试和优化边缘AI系统的能源效率。methods: 这篇论文使用了大规模的能源数据集，并对边缘AI系统的各种核心和常用深度学习模型进行了测试和分析。results: 这篇论文提出了一个名为DeepEn2023的大规模能源数据集，可以用于测试和优化边缘AI系统的能源效率。

Abstract
Climate change poses one of the most significant challenges to humanity. As a result of these climatic changes, the frequency of weather, climate, and water-related disasters has multiplied fivefold over the past 50 years, resulting in over 2 million deaths and losses exceeding $3.64 trillion USD. Leveraging AI-powered technologies for sustainable development and combating climate change is a promising avenue. Numerous significant publications are dedicated to using AI to improve renewable energy forecasting, enhance waste management, and monitor environmental changes in real time. However, very few research studies focus on making AI itself environmentally sustainable. This oversight regarding the sustainability of AI within the field might be attributed to a mindset gap and the absence of comprehensive energy datasets. In addition, with the ubiquity of edge AI systems and applications, especially on-device learning, there is a pressing need to measure, analyze, and optimize their environmental sustainability, such as energy efficiency. To this end, in this paper, we propose large-scale energy datasets for edge AI, named DeepEn2023, covering a wide range of kernels, state-of-the-art deep neural network models, and popular edge AI applications. We anticipate that DeepEn2023 will improve transparency in sustainability in on-device deep learning across a range of edge AI systems and applications. For more information, including access to the dataset and code, please visit https://amai-gsu.github.io/DeepEn2023.

摘要
人类面临着气候变化的一个最大挑战。过去50年，气候变化导致天气、气候和水灾害的频率增加五倍，造成200万人死亡和经济损失超过3.64万亿美元。利用人工智能技术实现可持续发展和气候变化控制是一个有前途的方向。许多研究论文探讨了使用人工智能提高可再生能源预测、改善废物管理和实时环境监测等方面。然而，很少研究关注人工智能本身的可持续性。这可能由于知识 gap和缺乏完整的能源数据集所致。此外，随着边缘AI系统和应用的普及，特别是在设备学习上，有必要测量、分析和优化边缘AI的环境可持续性，如能效率。为此，在本文中，我们提出了大规模能源数据集，名为DeepEn2023，覆盖了各种核心、当今最佳深度神经网络模型和流行的边缘AI应用。我们预计DeepEn2023将改善边缘深度学习的透明度，从而提高边缘AI系统和应用的可持续性。如果您想了解更多信息，包括数据集和代码访问，请访问https://amai-gsu.github.io/DeepEn2023。

Balancing Summarization and Change Detection in Graph Streams

paper_url: http://arxiv.org/abs/2311.18694
repo_url: https://github.com/s-fuku/bsc
paper_authors: Shintaro Fukushima, Kenji Yamanishi
for: 本研究旨在解决图 summarization 和图变化探测之间的平衡问题。
methods: 本研究从变化探测的角度解决了这个问题，通过一探测流中的摘要图来探测统计学上的变化。
results: 我们提出了一种新的量化方法来平衡这个费折和准确性之间的费折，同时实现可靠的图 summarization 和变化探测。我们在synthetic和实际数据上进行了Empirical验证，并证明了其有效性。

Abstract
This study addresses the issue of balancing graph summarization and graph change detection. Graph summarization compresses large-scale graphs into a smaller scale. However, the question remains: To what extent should the original graph be compressed? This problem is solved from the perspective of graph change detection, aiming to detect statistically significant changes using a stream of summary graphs. If the compression rate is extremely high, important changes can be ignored, whereas if the compression rate is extremely low, false alarms may increase with more memory. This implies that there is a trade-off between compression rate in graph summarization and accuracy in change detection. We propose a novel quantitative methodology to balance this trade-off to simultaneously realize reliable graph summarization and change detection. We introduce a probabilistic structure of hierarchical latent variable model into a graph, thereby designing a parameterized summary graph on the basis of the minimum description length principle. The parameter specifying the summary graph is then optimized so that the accuracy of change detection is guaranteed to suppress Type I error probability (probability of raising false alarms) to be less than a given confidence level. First, we provide a theoretical framework for connecting graph summarization with change detection. Then, we empirically demonstrate its effectiveness on synthetic and real datasets.

摘要
We introduce a probabilistic structure of hierarchical latent variable models into a graph, which allows us to design a parameterized summary graph based on the minimum description length principle. The parameter specifying the summary graph is then optimized to ensure that the accuracy of change detection is guaranteed while suppressing the probability of raising false alarms (Type I error probability) to less than a given confidence level.First, we provide a theoretical framework for connecting graph summarization with change detection. Then, we empirically demonstrate the effectiveness of our approach on synthetic and real datasets.

Handling Cost and Constraints with Off-Policy Deep Reinforcement Learning

paper_url: http://arxiv.org/abs/2311.18684
repo_url: None
paper_authors: Jared Markowitz, Jesse Silverberg, Gary Collins
for: 该文章旨在提出一种基于深度学习的偏置学习算法，以提高尝试效率。
methods: 该算法使用策略提升步骤，将学习的状态动作($Q$)值函数在选择的批处理数据上进行最大化。此外，该算法还使用了正则化来避免相关的过估问题。
results: 该算法在继续使用数据进行培育时，在继续动作空间上提供了改进的样本效率。此外，该算法还在“混合激励”环境中表现出色，并且在常见的控制问题上具有更高的稳定性和可靠性。

Abstract
By reusing data throughout training, off-policy deep reinforcement learning algorithms offer improved sample efficiency relative to on-policy approaches. For continuous action spaces, the most popular methods for off-policy learning include policy improvement steps where a learned state-action ($Q$) value function is maximized over selected batches of data. These updates are often paired with regularization to combat associated overestimation of $Q$ values. With an eye toward safety, we revisit this strategy in environments with "mixed-sign" reward functions; that is, with reward functions that include independent positive (incentive) and negative (cost) terms. This setting is common in real-world applications, and may be addressed with or without constraints on the cost terms. We find the combination of function approximation and a term that maximizes $Q$ in the policy update to be problematic in such environments, because systematic errors in value estimation impact the contributions from the competing terms asymmetrically. This results in overemphasis of either incentives or costs and may severely limit learning. We explore two remedies to this issue. First, consistent with prior work, we find that periodic resetting of $Q$ and policy networks can be used to reduce value estimation error and improve learning in this setting. Second, we formulate novel off-policy actor-critic methods for both unconstrained and constrained learning that do not explicitly maximize $Q$ in the policy update. We find that this second approach, when applied to continuous action spaces with mixed-sign rewards, consistently and significantly outperforms state-of-the-art methods augmented by resetting. We further find that our approach produces agents that are both competitive with popular methods overall and more reliably competent on frequently-studied control problems that do not have mixed-sign rewards.

摘要
通过重用数据进行训练，深度反馈学习算法可以提高样本效率相比于在政策上进行训练。对于连续动作空间，最受欢迎的方法包括使用策略提高步骤，在选择批处理数据时对学习的状态动作($Q$)值函数进行最大化。这些更新通常与正则化相结合，以避免相关的估计错误。为了保证安全，我们在具有混合正负回报函数的环境中再次考虑这种策略。这种设定在实际应用中很常见，并且可能不带或带约束。我们发现在这种环境中， combining function approximation和最大化$Q$在策略更新中存在问题。这会导致估计错误的系统atic impact asymmetrically，从而导致对奖励或成本的过度强调，可能导致学习受阻。我们探讨了两种缓解方法。首先，与先前的工作一样，我们发现 periodic resetting of $Q$ and policy networks可以减少估计错误并提高学习。其次，我们提出了一种新的无约束和带约束学习的actor-critic方法，不需要直接在策略更新中最大化$Q$。我们发现这种第二种方法，当应用于连续动作空间的混合正负奖励问题时，可以一直性和具有更好的性能，并且在常见的控制问题上表现更为稳定。

A Comparison Between Invariant and Equivariant Classical and Quantum Graph Neural Networks

paper_url: http://arxiv.org/abs/2311.18672
repo_url: https://github.com/royforestano/2023_gsoc_ml4sci_qmlhep_gnn
paper_authors: Roy T. Forestano, Marçal Comajoan Cara, Gopal Ramesh Dahale, Zhongtian Dong, Sergei Gleyzer, Daniel Justice, Kyoungchul Kong, Tom Magorsch, Konstantin T. Matchev, Katia Matcheva, Eyup B. Unlu
for: 这篇论文主要是用于比较 классических图 neural network (GNN) 和等变图 neural network (EGNN) 与其量子版本：量子图 neural network (QGNN) 和等变量量子图 neural network (EQGNN) 的性能。
methods: 这篇论文使用了高能物理实验数据，使用 graph neural network (GNN) 和等变量 GNN (EGNN) 进行分类任务。
results: 根据 AUC 分数，量子网络表现较好，但是在实际应用中，量子技术的发展和相关 API 的提供可能需要等待。

Abstract
Machine learning algorithms are heavily relied on to understand the vast amounts of data from high-energy particle collisions at the CERN Large Hadron Collider (LHC). The data from such collision events can naturally be represented with graph structures. Therefore, deep geometric methods, such as graph neural networks (GNNs), have been leveraged for various data analysis tasks in high-energy physics. One typical task is jet tagging, where jets are viewed as point clouds with distinct features and edge connections between their constituent particles. The increasing size and complexity of the LHC particle datasets, as well as the computational models used for their analysis, greatly motivate the development of alternative fast and efficient computational paradigms such as quantum computation. In addition, to enhance the validity and robustness of deep networks, one can leverage the fundamental symmetries present in the data through the use of invariant inputs and equivariant layers. In this paper, we perform a fair and comprehensive comparison between classical graph neural networks (GNNs) and equivariant graph neural networks (EGNNs) and their quantum counterparts: quantum graph neural networks (QGNNs) and equivariant quantum graph neural networks (EQGNN). The four architectures were benchmarked on a binary classification task to classify the parton-level particle initiating the jet. Based on their AUC scores, the quantum networks were shown to outperform the classical networks. However, seeing the computational advantage of the quantum networks in practice may have to wait for the further development of quantum technology and its associated APIs.

摘要
机器学习算法在高能物理实验中发挥重要作用，以解释来自欧洲核子研究所大型强子粒子加速器（LHC）的庞大数据。这些数据自然可以被表示为图结构，因此深度几何方法，如图神经网络（GNNs），在高能物理数据分析中得到了广泛应用。例如，在液体中的jets分类任务中，jets可以视为点云，其中的分子之间存在着明确的特征和边连接。随着LHC实验数据的增大和计算模型的复杂化，需要开发更快、更高效的计算模式，例如量子计算。此外，为了提高深度网络的有效性和稳定性，可以利用数据中的基本对称性，通过使用对称输入和对称层来增强网络的泛化能力。在这篇论文中，我们对经典图神经网络（GNNs）、对称图神经网络（EGNNs）和其量子对应体系进行了公平和全面的比较。这四种架构在一个二分类任务中进行了比较，用于分类带电粒子的起始点。根据它们的AUC分数，量子网络表现出了经典网络的超越。然而，在实际应用中看到量子网络的计算优势可能需要等待量子技术的进一步发展和相关API的出现。

Targeted Reduction of Causal Models

paper_url: http://arxiv.org/abs/2311.18639
repo_url: None
paper_authors: Armin Kekić, Bernhard Schölkopf, Michel Besserve
for: 本研究旨在帮助科学家在复杂模型中找到有关 Targeted Causal Reduction (TCR) 的解释。
methods: 本研究使用了 causal machine learning 技术，包括一个信息论目标函数和优化算法，以学习TCR。
results: 研究表明，TCR 可以帮助科学家从复杂模型中找到有关 Targeted Causal Reduction (TCR) 的高级别解释。

Abstract
Why does a phenomenon occur? Addressing this question is central to most scientific inquiries based on empirical observations, and often heavily relies on simulations of scientific models. As models become more intricate, deciphering the causes behind these phenomena in high-dimensional spaces of interconnected variables becomes increasingly challenging. Causal machine learning may assist scientists in the discovery of relevant and interpretable patterns of causation in simulations. We introduce Targeted Causal Reduction (TCR), a method for turning complex models into a concise set of causal factors that explain a specific target phenomenon. We derive an information theoretic objective to learn TCR from interventional data or simulations and propose algorithms to optimize this objective efficiently. TCR's ability to generate interpretable high-level explanations from complex models is demonstrated on toy and mechanical systems, illustrating its potential to assist scientists in the study of complex phenomena in a broad range of disciplines.

摘要
为什么某种现象发生呢？解决这个问题是科学观察的基础问题，通常通过模型的仿真来进行研究。随着模型的复杂化，找出这些现象的原因在高维空间中变得越来越困难。 causal machine learning可以帮助科学家在模型中找到有关的和可解释的Patterns of causation。我们提出了Targeted Causal Reduction（TCR）方法，它可以将复杂的模型转化为一个简洁的 causal factor的集合，用于解释特定的target现象。我们 derive了一个信息论目标函数，用于学习TCR从 intervenational data或仿真中，并提出了一些算法来效率地优化这个目标函数。TCR能够从复杂的模型中提取出高级别的可解释结果，这种能力在 Toy和机械系统中得到了证明，这表明TCR可以帮助科学家在各种领域中研究复杂现象。

Online Influence Maximization: Concept and Algorithm

paper_url: http://arxiv.org/abs/2312.00099
repo_url: https://github.com/sayantann11/all-classification-templetes-for-ML
paper_authors: Jianxiong Guo
For: The paper provides an overview of the Online Influence Maximization (IM) problem, covering both theoretical aspects and practical applications.* Methods: The paper discusses Offline IM algorithms, including traditional approximation or heuristic algorithms and ML-based algorithms, and introduces a standard definition of the Online IM problem and a basic Combinatorial Multi-Armed Bandit (CMAB) framework, CMAB-T.* Results: The paper covers almost all Online IM algorithms up to now, focusing on their characteristics and theoretical guarantees for different feedback types, and provides regret bounds for their working principles. Additionally, the paper collects innovative ideas about problem definition and algorithm designs, and outlines prospective research directions from four distinct perspectives.

Abstract
In this survey, we offer an extensive overview of the Online Influence Maximization (IM) problem by covering both theoretical aspects and practical applications. For the integrity of the article and because the online algorithm takes an offline oracle as a subroutine, we first make a clear definition of the Offline IM problem and summarize those commonly used Offline IM algorithms, which include traditional approximation or heuristic algorithms and ML-based algorithms. Then, we give a standard definition of the Online IM problem and a basic Combinatorial Multi-Armed Bandit (CMAB) framework, CMAB-T. Here, we summarize three types of feedback in the CMAB model and discuss in detail how to study the Online IM problem based on the CMAB-T model. This paves the way for solving the Online IM problem by using online learning methods. Furthermore, we have covered almost all Online IM algorithms up to now, focusing on characteristics and theoretical guarantees of online algorithms for different feedback types. Here, we elaborately explain their working principle and how to obtain regret bounds. Besides, we also collect plenty of innovative ideas about problem definition and algorithm designs and pioneering works for variants of the Online IM problem and their corresponding algorithms. Finally, we encapsulate current challenges and outline prospective research directions from four distinct perspectives.

摘要
在这份调查中，我们提供了在线影响最大化（IM）问题的广泛概述，涵盖了理论方面和实践应用。为保持文章的完整性和在线算法使用了离线评估器，我们首先明确定了离线IM问题的定义，并总结了通常使用的离线IM算法，包括传统的 Approximation 或 Heuristic 算法和 ML-based 算法。然后，我们给出了标准的在线IM问题的定义和基本的 Combinatorial Multi-Armed Bandit（CMAB）框架，CMAB-T。在这里，我们总结了 CMAB 模型中的三种反馈，并详细讲述了如何通过 CMAB-T 模型来研究在线IM问题。这为使用在线学习方法解决在线IM问题提供了基础。此外，我们已经覆盖了大多数在线IM算法，重点介绍它们的特点和对不同反馈类型的理论保证。此外，我们还收集了许多创新的问题定义和算法设计的想法，以及对变体在线IM问题的相关算法的先锋工作。最后，我们总结了当前的挑战和未来研究方向的四个不同视角。

Optimizing ZX-Diagrams with Deep Reinforcement Learning

paper_url: http://arxiv.org/abs/2311.18588
repo_url: https://github.com/maxnaeg/zxreinforce
paper_authors: Maximilian Nägele, Florian Marquardt
for: 本文使用ZX-diagram和人工智能学习探索了优化ZX-diagram结构的方法。
methods: 本文使用人工智能学习算法来找到优化ZX-diagram结构的最佳Sequences of local transformation rules。
results: 对比其他优化方法，人工智能学习算法能够更好地优化ZX-diagram结构，并且可以扩展到许多更大的图ogram。

Abstract
ZX-diagrams are a powerful graphical language for the description of quantum processes with applications in fundamental quantum mechanics, quantum circuit optimization, tensor network simulation, and many more. The utility of ZX-diagrams relies on a set of local transformation rules that can be applied to them without changing the underlying quantum process they describe. These rules can be exploited to optimize the structure of ZX-diagrams for a range of applications. However, finding an optimal sequence of transformation rules is generally an open problem. In this work, we bring together ZX-diagrams with reinforcement learning, a machine learning technique designed to discover an optimal sequence of actions in a decision-making problem and show that a trained reinforcement learning agent can significantly outperform other optimization techniques like a greedy strategy or simulated annealing. The use of graph neural networks to encode the policy of the agent enables generalization to diagrams much bigger than seen during the training phase.

摘要
ZX-图表是一种强大的图形语言，用于描述量子过程，具有应用于基本量子力学、量子Circuit优化、维度网络模拟等多种领域的应用。ZX-图表的实用性基于一组本地变换规则，可以无需改变下面量子过程的基本结构来应用。这些规则可以被利用来优化ZX-图表的结构，以适应各种应用。然而，找到最优的变换序列仍然是一个开放的问题。在这项工作中，我们将ZX-图表与再增强学习相结合，一种用于发现最优行动序列的机器学习技术，并证明了一个训练过的再增强学习代理可以在其他优化技术如做出规则或模拟熔化的情况下显著超越它们。使用图形神经网络编码策略可以允许代理在训练阶段未看到的大型图表上进行泛化。

Class Distribution Shifts in Zero-Shot Learning: Learning Robust Representations

paper_url: http://arxiv.org/abs/2311.18575
repo_url: None
paper_authors: Yuli Slavutsky, Yuval Benjamini
for: 这篇论文是关于 Distribution shifts between training and deployment data 的研究，尤其是这种 Distribution shifts 对零戳类别器的影响。
methods: 作者们提出了一个算法，帮助学习模型对于不同类别分布的变化进行适应。这个方法结合了层次数据采样和对外固化技术。
results: 作者们透过实验和实际数据显示，他们的方法可以提高零戳类别器对于多样类别分布的普遍性。

Abstract
Distribution shifts between training and deployment data often affect the performance of machine learning models. In this paper, we explore a setting where a hidden variable induces a shift in the distribution of classes. These distribution shifts are particularly challenging for zero-shot classifiers, as they rely on representations learned from training classes, but are deployed on new, unseen ones. We introduce an algorithm to learn data representations that are robust to such class distribution shifts in zero-shot verification tasks. We show that our approach, which combines hierarchical data sampling with out-of-distribution generalization techniques, improves generalization to diverse class distributions in both simulations and real-world datasets.

摘要
发布分布的变化通常会影响机器学习模型的性能。在这篇论文中，我们研究一种隐藏变量导致类分布的变化的情况。这种分布变化对零实例分类器来说特别困难，因为它们基于训练类的表示学习，但是在新、未经见过的类上进行验证。我们提出了一种算法，将数据表示学习到类分布的变化，并在零实例验证任务中提高了类分布多样性的普适性。我们在 simulations 和实际数据集中证明了我们的方法的有效性。

paper_url: http://arxiv.org/abs/2311.18574
repo_url: None
paper_authors: Jiaxian Yan, Zaixi Zhang, Kai Zhang, Qi Liu
for: 预测蛋白质与小分子的绑定结构，基于 Computational tools 的设计新药
methods: 使用DeltaDock框架，包括粒子依赖性绑定位点预测模型和GPU加速的采样算法，以及多尺度迭代改进模块
results: 与基eline方法相比，DeltaDock在盲 docking 和Specific docking 两个设置下表现出色，并且在不同的场景下显示出优秀的通用性和可靠性

Abstract
Molecular docking is a key computational tool utilized to predict the binding conformations of small molecules to protein targets, which is fundamental in the design of novel drugs. Despite recent advancements in geometric deep learning-based approaches leading to improvements in blind docking efficiency, these methods have encountered notable challenges, such as limited generalization performance on unseen proteins, the inability to concurrently address the settings of blind docking and site-specific docking, and the frequent occurrence of physical implausibilities such as inter-molecular steric clash. In this study, we introduce DeltaDock, a robust and versatile framework designed for efficient molecular docking to overcome these challenges. DeltaDock operates in a two-step process: rapid initial complex structures sampling followed by multi-scale iterative refinement of the initial structures. In the initial stage, to sample accurate structures with high efficiency, we develop a ligand-dependent binding site prediction model founded on large protein models and graph neural networks. This model is then paired with GPU-accelerated sampling algorithms. The sampled structures are updated using a multi-scale iterative refinement module that captures both protein-ligand atom-atom interactions and residue-atom interactions in the following stage. Distinct from previous geometric deep learning methods that are conditioned on the blind docking setting, DeltaDock demonstrates superior performance in both blind docking and site-specific docking settings. Comprehensive experimental results reveal that DeltaDock consistently surpasses baseline methods in terms of docking accuracy. Furthermore, it displays remarkable generalization capabilities and proficiency for predicting physically valid structures, thereby attesting to its robustness and reliability in various scenarios.

摘要
分子对接是一种关键的计算工具，用于预测小分子与蛋白质目标之间的绑定结构，这对新药设计非常重要。尽管最近的几何深度学习方法在盲目对接效率方面有所改进，但这些方法却遇到了一些挑战，例如对未见过蛋白质的泛化性能不佳、同时不能同时解决盲目对接和特定对接的问题，以及常见的物理不可能现象如分子间静电冲击。在本研究中，我们介绍了DeltaDock，一种可靠和多功能的框架，用于高效地进行分子对接。DeltaDock采用两步进行方法：首先，使用ligand-dependent binding site预测模型和GPU加速的抽象算法进行快速初始结构采样；其次，使用多尺度迭代优化模块来更新和更加精准地修正初始结构。与前一些基于盲目对接的几何深度学习方法不同，DeltaDock在盲目对接和特定对接 Setting下都达到了更高的性能。经过全面的实验研究，我们发现DeltaDock在对接精度方面一直保持领先，同时也表现出了Remarkable的泛化能力和适用性。

Learning Radio Environments by Differentiable Ray Tracing

paper_url: http://arxiv.org/abs/2311.18558
repo_url: None
paper_authors: Jakob Hoydis, Fayçal Aït Aoudia, Sebastian Cammerer, Florian Euchner, Merlin Nimier-David, Stephan ten Brink, Alexander Keller
for: 该论文是为了提高6G研究中的射线追踪技术，以生成具有具体场景和环境特征的通道响应函数（CIR）。
methods: 该论文提出了一种新的梯度法Calibration方法，其中material属性的准确性需要通过通道测量来确定。该方法还使用分 diffeomorphic parametrizations of material properties, scattering and antenna patterns，并与可微分的射线追踪算法相结合，以计算响应函数的导数。
results: 该论文通过使用 both synthetic data和实际的indoor通道测量数据进行验证，并证明了其方法的可靠性和精度。

Abstract
Ray tracing (RT) is instrumental in 6G research in order to generate spatially-consistent and environment-specific channel impulse responses (CIRs). While acquiring accurate scene geometries is now relatively straightforward, determining material characteristics requires precise calibration using channel measurements. We therefore introduce a novel gradient-based calibration method, complemented by differentiable parametrizations of material properties, scattering and antenna patterns. Our method seamlessly integrates with differentiable ray tracers that enable the computation of derivatives of CIRs with respect to these parameters. Essentially, we approach field computation as a large computational graph wherein parameters are trainable akin to weights of a neural network (NN). We have validated our method using both synthetic data and real-world indoor channel measurements, employing a distributed multiple-input multiple-output (MIMO) channel sounder.

摘要
射线追踪（RT）在6G研究中发挥重要作用，以生成具有空间相同和环境特定的通道响应函数（CIR）。虽然获得准确的场景几何结构已经变得相对容易，但确定材料特性则需要精准的准备使用通道测量。因此，我们介绍了一种新的梯度基于的准备方法，并且通过分别表示材料性质、散射和天线Pattern的可微分函数来补充。我们的方法可以与可微分射线追踪器结合，以计算响应函数的导数相对于这些参数。简单来说，我们将场 computation viewed as a large computational graph，其中参数可以与神经网络（NN）中的权重相似地训练。我们已经验证了我们的方法使用 both synthetic data和实际的indoor通道测量数据，使用分布式多输入多出力（MIMO）通道测量仪。

Can semi-supervised learning use all the data effectively? A lower bound perspective

paper_url: http://arxiv.org/abs/2311.18557
repo_url: None
paper_authors: Alexandru Ţifrea, Gizem Yüce, Amartya Sanyal, Fanny Yang
for: 本文探讨了 semi-supervised learning（SSL）算法是否可以同时超越无监督学习（UL）和监督学习（SL）算法。
methods: 本文使用了二元 Gaussian mixture models 来Derive a tight lower bound，并证明了 SSL 算法无法在这些分布上提高最佳 Error rates。
results: 然而，实验结果表明 SSL 算法仍可以在实际数据上超越 UL 和 SL 算法。这表明，虽然可以证明 SSL 算法的性能提升，但需要仔细跟踪常数。

Abstract
Prior works have shown that semi-supervised learning algorithms can leverage unlabeled data to improve over the labeled sample complexity of supervised learning (SL) algorithms. However, existing theoretical analyses focus on regimes where the unlabeled data is sufficient to learn a good decision boundary using unsupervised learning (UL) alone. This begs the question: Can SSL algorithms simultaneously improve upon both UL and SL? To this end, we derive a tight lower bound for 2-Gaussian mixture models that explicitly depends on the labeled and the unlabeled dataset size as well as the signal-to-noise ratio of the mixture distribution. Surprisingly, our result implies that no SSL algorithm can improve upon the minimax-optimal statistical error rates of SL or UL algorithms for these distributions. Nevertheless, we show empirically on real-world data that SSL algorithms can still outperform UL and SL methods. Therefore, our work suggests that, while proving performance gains for SSL algorithms is possible, it requires careful tracking of constants.

摘要
先前的研究表明 semi-supervised learning 算法可以使用无标签数据来提高 supervised learning 算法的标注样本复杂性。然而，现有的理论分析都集中在无法学习alone可以学习出好决策边界的情况下。这意味着：SSL 算法可以同时提高 UL 和 SL 算法吗？为此，我们 derive a tight lower bound for 2-Gaussian mixture models，这个下界显然取决于标注和无标签数据集大小以及混合分布的信号噪声比。surprisingly, our result implies that no SSL algorithm can improve upon the minimax-optimal statistical error rates of SL or UL algorithms for these distributions.然而，我们在实际数据上示出了 SSL 算法可以仍然超越 UL 和 SL 方法。因此，我们的工作表明，虽然可以证明 SSL 算法的性能提升，但需要仔细跟踪常量。

Textual-Knowledge-Guided Numerical Feature Discovery Method for Power Demand Forecasting

paper_url: http://arxiv.org/abs/2312.00095
repo_url: None
paper_authors: Zifan Ning, Min Jin
For: 预测新型电力系统和综合能源系统中的电力需求，尤其是短期内预测。* Methods: 文本知识指导数字特征发现（TKNFD）方法，包括文本知识扩展、数字特征收集和四维多元源跟踪数据库（4DM-STD）建立。* Results: 对两个不同地区的实验结果表明，基于TKNFD发现的特征的预测精度可以高于现有的标准特征方案 by 16.84% to 36.36% MAPE，并且发现了许多未知的特征，尤其是在未知能量和天文维度中的多个主导特征。

Abstract
Power demand forecasting is a crucial and challenging task for new power system and integrated energy system. However, as public feature databases and the theoretical mechanism of power demand changes are unavailable, the known features of power demand fluctuation are much limited. Recently, multimodal learning approaches have shown great vitality in machine learning and AIGC. In this paper, we interact two modal data and propose a textual-knowledge-guided numerical feature discovery (TKNFD) method for short-term power demand forecasting. TKNFD extensively accumulates qualitative textual knowledge, expands it into a candidate feature-type set, collects numerical data of these features, and eventually builds four-dimensional multivariate source-tracking databases (4DM-STDs). Next, TKNFD presents a two-level quantitative feature identification strategy independent of forecasting models, finds 43-48 features, and systematically analyses feature contribution and dependency correlation. Benchmark experiments in two different regions around the world demonstrate that the forecasting accuracy of TKNFD-discovered features reliably outperforms that of SoTA feature schemes by 16.84% to 36.36% MAPE. In particular, TKNFD reveals many unknown features, especially several dominant features in the unknown energy and astronomical dimensions, which extend the knowledge on the origin of strong randomness and non-linearity in power demand fluctuation. Besides, 4DM-STDs can serve as public baseline databases.

摘要
新的电力系统和统一能源系统中的电力需求预测是一个关键和挑战性的任务。然而，由于公共特征数据库和电力需求变化的理论机制仍然无法获得，因此已知的电力需求波动特征相对较限。现在，多模式学习方法在机器学习和AIGC中表现出了很大的活力。在本文中，我们与两种模式数据进行互动，并提出了文本知识导向数据发现（TKNFD）方法来进行短期电力需求预测。TKNFD通过广泛收集文本知识，扩展它为候选特征类型集，收集这些特征的数据，最后建立四维多元来源追踪数据库（4DM-STDs）。接下来，TKNFD提出了两个层次的量化特征识别策略，独立于预测模型，发现43-48个特征，并系统地分析特征的贡献和依赖相互关联。在全球两个不同区域的实验中，TKNFD发现的特征可靠性地超过了SoTA特征方案的预测精度，具体来说，TKNFD发现了许多未知的特征，特别是在未知能源和天文学尺度上的主导特征，这些特征延伸了电力需求波动变化的起源和非线性性的知识。此外，4DM-STDs可以作为公共基准数据库。

HOT: Higher-Order Dynamic Graph Representation Learning with Efficient Transformers

paper_url: http://arxiv.org/abs/2311.18526
repo_url: None
paper_authors: Maciej Besta, Afonso Claudino Catarino, Lukas Gianinazzi, Nils Blach, Piotr Nyczyk, Hubert Niewiadomski, Torsten Hoefler
for: 这篇研究是为了解决动态グラフ学习（GRL）中的关联预测问题，使用历史的グラフ更新来预测某对的关联。
methods: 本研究使用Transformers来模型单一的グラフ更新，并将高阶（HO）结构，如k-hop邻居和更一般的子图，转换为注意力矩阵中的编码。
results: HOT模型在MOOC dataset上表现出9%, 7%, 15%高于DyGFormer、TGN和GraphMixer的预测精度，并且可以轻松扩展到其他动态GRL问题。

Abstract
Many graph representation learning (GRL) problems are dynamic, with millions of edges added or removed per second. A fundamental workload in this setting is dynamic link prediction: using a history of graph updates to predict whether a given pair of vertices will become connected. Recent schemes for link prediction in such dynamic settings employ Transformers, modeling individual graph updates as single tokens. In this work, we propose HOT: a model that enhances this line of works by harnessing higher-order (HO) graph structures; specifically, k-hop neighbors and more general subgraphs containing a given pair of vertices. Harnessing such HO structures by encoding them into the attention matrix of the underlying Transformer results in higher accuracy of link prediction outcomes, but at the expense of increased memory pressure. To alleviate this, we resort to a recent class of schemes that impose hierarchy on the attention matrix, significantly reducing memory footprint. The final design offers a sweetspot between high accuracy and low memory utilization. HOT outperforms other dynamic GRL schemes, for example achieving 9%, 7%, and 15% higher accuracy than - respectively - DyGFormer, TGN, and GraphMixer, for the MOOC dataset. Our design can be seamlessly extended towards other dynamic GRL workloads.

摘要
多数图表学（GRL）问题是动态的，每秒添加或删除数百万个边。一个基本的工作荷在这种设定下是动态链接预测：使用图更新历史来预测给定两个顶点是否会连接。latest schemes for link prediction in such dynamic settings employ Transformers, modeling individual graph updates as single tokens. In this work, we propose HOT: a model that enhances this line of works by harnessing higher-order (HO) graph structures; specifically, k-hop neighbors and more general subgraphs containing a given pair of vertices. Harnessing such HO structures by encoding them into the attention matrix of the underlying Transformer results in higher accuracy of link prediction outcomes, but at the expense of increased memory pressure. To alleviate this, we resort to a recent class of schemes that impose hierarchy on the attention matrix, significantly reducing memory footprint. The final design offers a sweetspot between high accuracy and low memory utilization. HOT outperforms other dynamic GRL schemes, for example achieving 9%, 7%, and 15% higher accuracy than - respectively - DyGFormer, TGN, and GraphMixer, for the MOOC dataset. Our design can be seamlessly extended towards other dynamic GRL workloads.Here's the word-for-word translation of the text:多数图表学（GRL）问题是动态的，每秒添加或删除数百万个边。一个基本的工作荷在这种设定下是动态链接预测：使用图更新历史来预测给定两个顶点是否会连接。latest schemes for link prediction in such dynamic settings employ Transformers, modeling individual graph updates as single tokens. In this work, we propose HOT: a model that enhances this line of works by harnessing higher-order (HO) graph structures; specifically, k-hop neighbors and more general subgraphs containing a given pair of vertices. Harnessing such HO structures by encoding them into the attention matrix of the underlying Transformer results in higher accuracy of link prediction outcomes, but at the expense of increased memory pressure. To alleviate this, we resort to a recent class of schemes that impose hierarchy on the attention matrix, significantly reducing memory footprint. The final design offers a sweetspot between high accuracy and low memory utilization. HOT outperforms other dynamic GRL schemes, for example achieving 9%, 7%, and 15% higher accuracy than - respectively - DyGFormer, TGN, and GraphMixer, for the MOOC dataset. Our design can be seamlessly extended towards other dynamic GRL workloads.

Detecting Anomalous Network Communication Patterns Using Graph Convolutional Networks

paper_url: http://arxiv.org/abs/2311.18525
repo_url: None
paper_authors: Yizhak Vaisman, Gilad Katz, Yuval Elovici, Asaf Shabtai
for: 本研究旨在提供一种基于图 convolutional neural network (GCN) 和变换 autoencoder (VAE) 的高级别异常检测方法，用于保护组织的终端机器 from 高级别攻击。
methods: 本研究使用了 GCN 基于 VAE 模型，接受两个矩阵作为输入：（一）正规化相互连接矩阵，表示机器之间的连接，以及（二）特征矩阵，包括机器的投影特征（人口、统计、过程相关和 Node2vec 结构特征）。模型在收集到的数据上训练后，对同样的数据进行应用，并计算每个机器的异常分数。
results: 本研究对真实、大规模的 ATM 和 AD 服务器之间的通信数据进行了评估，并在两种设置下进行了评估：无监督和监督。结果表明，GCNetOmaly 能够有效地检测机器的异常行为，无需使用干扰特征。

Abstract
To protect an organizations' endpoints from sophisticated cyberattacks, advanced detection methods are required. In this research, we present GCNetOmaly: a graph convolutional network (GCN)-based variational autoencoder (VAE) anomaly detector trained on data that include connection events among internal and external machines. As input, the proposed GCN-based VAE model receives two matrices: (i) the normalized adjacency matrix, which represents the connections among the machines, and (ii) the feature matrix, which includes various features (demographic, statistical, process-related, and Node2vec structural features) that are used to profile the individual nodes/machines. After training the model on data collected for a predefined time window, the model is applied on the same data; the reconstruction score obtained by the model for a given machine then serves as the machine's anomaly score. GCNetOmaly was evaluated on real, large-scale data logged by Carbon Black EDR from a large financial organization's automated teller machines (ATMs) as well as communication with Active Directory (AD) servers in two setups: unsupervised and supervised. The results of our evaluation demonstrate GCNetOmaly's effectiveness in detecting anomalous behavior of machines on unsupervised data.

摘要
Translation in Simplified Chinese:为保护组织的终端机器从复杂的网络攻击中免受威胁，高级检测方法是必要的。在这项研究中，我们提出了GCNetOmaly：基于图 convolutional neural network (GCN) 的变量 autoencoder (VAE) 异常检测器，通过对内部和外部机器之间的连接事件进行训练。GCNetOmaly 的输入包括两个矩阵：（一）正规化互连矩阵，表示机器之间的连接关系，以及（二）特征矩阵，包括各种特征（人口、统计、过程相关和 Node2vec 结构特征），用于profile individualemachines。 после训练模型于收集到的数据上，模型会应用于同一个数据集，并将对应机器的重建得分作为异常分数。GCNetOmaly 在大规模实际数据中 logging 由 Carbon Black EDR 记录的大型金融组织自动取款机 (ATM) 以及与 Active Directory (AD) 服务器的通信中进行评估，并在两种设置下进行评估：无监督和监督。评估结果表明，GCNetOmaly 在无监督数据上检测机器异常行为的效果极高。

Combining deep generative models with extreme value theory for synthetic hazard simulation: a multivariate and spatially coherent approach

paper_url: http://arxiv.org/abs/2311.18521
repo_url: None
paper_authors: Alison Peard, Jim Hall
for: 这个论文旨在理解气候风险的分布和适应策略。
methods: 这个论文使用生成对抗网络（GANs）模型了气候变化的相互关系，并结合了传统的极值值论断来控制推论的范围。
results: 这个模型可以快速生成 тысячи个真实的复合风险事件，这些事件可以用于气候风险评估和灾备准备。模型的方法可以应用于其他多变量和空间气候数据集中。

Abstract
Climate hazards can cause major disasters when they occur simultaneously as compound hazards. To understand the distribution of climate risk and inform adaptation policies, scientists need to simulate a large number of physically realistic and spatially coherent events. Current methods are limited by computational constraints and the probabilistic spatial distribution of compound events is not given sufficient attention. The bottleneck in current approaches lies in modelling the dependence structure between variables, as inference on parametric models suffers from the curse of dimensionality. Generative adversarial networks (GANs) are well-suited to such a problem due to their ability to implicitly learn the distribution of data in high-dimensional settings. We employ a GAN to model the dependence structure for daily maximum wind speed, significant wave height, and total precipitation over the Bay of Bengal, combining this with traditional extreme value theory for controlled extrapolation of the tails. Once trained, the model can be used to efficiently generate thousands of realistic compound hazard events, which can inform climate risk assessments for climate adaptation and disaster preparedness. The method developed is flexible and transferable to other multivariate and spatial climate datasets.

摘要
климатические опасностях могут вызывать серьезные бедствия, когда они проходят одновременно в виде сложных опасностей. Чтобы понять распределение рисков климата и разработать политики адаптации, ученые нуждаются в симуляции огромного количества физически реалистичных и согласованных событий. Текущие методы ограничены по computational constraints и не дают достаточного внимания probabilistic spatial distribution of compound events. Блокировка в current approaches lies in modeling the dependence structure between variables, as inference on parametric models suffers from the curse of dimensionality. Generative adversarial networks (GANs) are well-suited to this problem due to their ability to implicitly learn the distribution of data in high-dimensional settings. We employ a GAN to model the dependence structure for daily maximum wind speed, significant wave height, and total precipitation over the Bay of Bengal, combining this with traditional extreme value theory for controlled extrapolation of the tails. Once trained, the model can be used to efficiently generate thousands of realistic compound hazard events, which can inform climate risk assessments for climate adaptation and disaster preparedness. The method developed is flexible and transferable to other multivariate and spatial climate datasets.Note that Simplified Chinese is a romanization of Chinese, and the actual Chinese characters may be different.

Global Convergence of Online Identification for Mixed Linear Regression

paper_url: http://arxiv.org/abs/2311.18506
repo_url: None
paper_authors: Yujing Liu, Zhixin Liu, Lei Guo
for: 这篇论文主要针对于线性回归模型的在线标定和数据分类问题。
methods: 该论文提出了两种基于期望最大化原理的在线标定算法，用于解决这两种基本类型的线性回归模型的标定问题。
results: 研究表明，这两种算法都可以在不假设数据是独立同分布的情况下，并且可以在全局上收敛。此外，数学分析还表明了这两种算法的稳定性和可靠性。

Abstract
Mixed linear regression (MLR) is a powerful model for characterizing nonlinear relationships by utilizing a mixture of linear regression sub-models. The identification of MLR is a fundamental problem, where most of the existing results focus on offline algorithms, rely on independent and identically distributed (i.i.d) data assumptions, and provide local convergence results only. This paper investigates the online identification and data clustering problems for two basic classes of MLRs, by introducing two corresponding new online identification algorithms based on the expectation-maximization (EM) principle. It is shown that both algorithms will converge globally without resorting to the traditional i.i.d data assumptions. The main challenge in our investigation lies in the fact that the gradient of the maximum likelihood function does not have a unique zero, and a key step in our analysis is to establish the stability of the corresponding differential equation in order to apply the celebrated Ljung's ODE method. It is also shown that the within-cluster error and the probability that the new data is categorized into the correct cluster are asymptotically the same as those in the case of known parameters. Finally, numerical simulations are provided to verify the effectiveness of our online algorithms.

摘要
复杂线性回传 regression (MLR) 是一种具有非线性关系的模型，通过使用一组线性回传子模型来描述这种关系。识别 MLR 是一个基本的问题，现有的大多数结果强调在线上算法，依赖于独立且相同分布的数据假设，并提供了本地均值结果。本文investigates 在线上识别和数据分群问题上，通过引入两种新的在线识别算法，基于期望最大化（EM）原理。它显示了这两种算法将在无需违反传统独立且相同分布数据假设的情况下具有全球均值。主要挑战在于函数梯度的最大值对应的零点不唯一，而关键步骤在我们的分析中是建立稳定的数据分布，以便适用名单的Ljung ODE方法。此外，我们还证明了在新数据中，内部错误和把新数据分配到正确类别的概率都是对应的极限值。最后，我们提供了一些数值 simulations 以证明我们的在线算法的有效性。

Data-Agnostic Model Poisoning against Federated Learning: A Graph Autoencoder Approach

paper_url: http://arxiv.org/abs/2311.18498
repo_url: None
paper_authors: Kai Li, Jingjing Zheng, Xin Yuan, Wei Ni, Ozgur B. Akan, H. Vincent Poor
for: 这篇论文旨在攻击 Federated Learning (FL) 的资料掌控攻击，通过设计一个新的敌意Graph Autoencoder (GAE) 框架。
methods: 这个攻击不需要知道 FL 训练数据，并且可以实现效果和隐藏。攻击者通过听取良性本地模型和全球模型的声音，提取本地模型和训练数据特征之间的图 структур相互作用，然后通过对这些图结构进行敌意变化，生成出恶意本地模型。
results: 实验结果显示，FL 在这个攻击下会逐渐下降，并且现有的防护机制无法检测到这个攻击。攻击可以导致所有良性设备被感染，对 FL 带来严重的威胁。

Abstract
This paper proposes a novel, data-agnostic, model poisoning attack on Federated Learning (FL), by designing a new adversarial graph autoencoder (GAE)-based framework. The attack requires no knowledge of FL training data and achieves both effectiveness and undetectability. By listening to the benign local models and the global model, the attacker extracts the graph structural correlations among the benign local models and the training data features substantiating the models. The attacker then adversarially regenerates the graph structural correlations while maximizing the FL training loss, and subsequently generates malicious local models using the adversarial graph structure and the training data features of the benign ones. A new algorithm is designed to iteratively train the malicious local models using GAE and sub-gradient descent. The convergence of FL under attack is rigorously proved, with a considerably large optimality gap. Experiments show that the FL accuracy drops gradually under the proposed attack and existing defense mechanisms fail to detect it. The attack can give rise to an infection across all benign devices, making it a serious threat to FL.

摘要
Translated into Simplified Chinese:这篇论文提出了一种新的、无关数据的、模型毒化攻击方法，通过设计一个新的对 adversarial graph autoencoder (GAE) 基础的攻击框架。该攻击不需要知道 federated learning (FL) 训练数据，并且可以同时实现效果和隐蔽性。通过听取良性本地模型和全球模型，攻击者可以提取本地模型和训练数据特征之间的图structural correlations。然后，攻击者可以对这些correlations进行对抗性重建，并使用对抗性图结构和良性模型训练数据特征来生成恶意本地模型。一种新的算法是设计用 GAE 和 sub-gradient descent 进行训练恶意本地模型。FL 下 attack 的 converges 是严格地证明的，与良性模型之间的优化差距相当大。实验表明，FL 准确率逐渐下降 unter 提议的攻击，并且现有的防御机制无法检测到它。这种攻击可能会在所有良性设备上传染，使其成为 FL 的严重威胁。

How Much Is Hidden in the NAS Benchmarks? Few-Shot Adaptation of a NAS Predictor

paper_url: http://arxiv.org/abs/2311.18451
repo_url: None
paper_authors: Hrushikesh Loya, Łukasz Dudziak, Abhinav Mehrotra, Royson Lee, Javier Fernandez-Marques, Nicholas D. Lane, Hongkai Wen
for: 这篇论文的目的是提高适用于不同任务和搜索空间的神经网络设计方法，并且提高神经网络的性能和效率。
methods: 这篇论文使用了 meta-learning 技术，将公开available NAS benchmarks 中的知识抽象出来，并且对 task-level correlation 和predictor transferability 进行了详细的研究。
results: 在实验中，这篇论文使用了 6 个 NAS benchmarks，总共有 16 个 NAS 设定，meta-learning 方法不仅在 cross-validation experiments 中显示出了superior 或 matching 性能，还能成功地在新的搜索空间和任务上进行推广。

Abstract
Neural architecture search has proven to be a powerful approach to designing and refining neural networks, often boosting their performance and efficiency over manually-designed variations, but comes with computational overhead. While there has been a considerable amount of research focused on lowering the cost of NAS for mainstream tasks, such as image classification, a lot of those improvements stem from the fact that those tasks are well-studied in the broader context. Consequently, applicability of NAS to emerging and under-represented domains is still associated with a relatively high cost and/or uncertainty about the achievable gains. To address this issue, we turn our focus towards the recent growth of publicly available NAS benchmarks in an attempt to extract general NAS knowledge, transferable across different tasks and search spaces. We borrow from the rich field of meta-learning for few-shot adaptation and carefully study applicability of those methods to NAS, with a special focus on the relationship between task-level correlation (domain shift) and predictor transferability; which we deem critical for improving NAS on diverse tasks. In our experiments, we use 6 NAS benchmarks in conjunction, spanning in total 16 NAS settings -- our meta-learning approach not only shows superior (or matching) performance in the cross-validation experiments but also successful extrapolation to a new search space and tasks.

摘要

The Sliding Regret in Stochastic Bandits: Discriminating Index and Randomized Policies

paper_url: http://arxiv.org/abs/2311.18437
repo_url: None
paper_authors: Victor Boone
for: 该 paper 研究了随机抽样算法在随机抽样机器人中的一击性表现。
methods: 该 paper 使用了新的Sliding regret指标来衡量随机抽样算法的表现，并证明了Randomized methods（如 Thompson Sampling 和 MED）有最佳的Sliding regret，而Index policies（如 UCB、UCB-V、KL-UCB、MOSS、IMED等）在Regularity conditions下有最坏的Sliding regret。
results: 该 paper 发现了随机抽样算法的pseudo-regret中的均勋性，并分析了随机抽样算法的exploration regret的下降性。

Abstract
This paper studies the one-shot behavior of no-regret algorithms for stochastic bandits. Although many algorithms are known to be asymptotically optimal with respect to the expected regret, over a single run, their pseudo-regret seems to follow one of two tendencies: it is either smooth or bumpy. To measure this tendency, we introduce a new notion: the sliding regret, that measures the worst pseudo-regret over a time-window of fixed length sliding to infinity. We show that randomized methods (e.g. Thompson Sampling and MED) have optimal sliding regret, while index policies, although possibly asymptotically optimal for the expected regret, have the worst possible sliding regret under regularity conditions on their index (e.g. UCB, UCB-V, KL-UCB, MOSS, IMED etc.). We further analyze the average bumpiness of the pseudo-regret of index policies via the regret of exploration, that we show to be suboptimal as well.

摘要

Exploring the Temperature-Dependent Phase Transition in Modern Hopfield Networks

paper_url: http://arxiv.org/abs/2311.18434
repo_url: None
paper_authors: Felix Koulischer, Cédric Goemaere, Tom van der Meersch, Johannes Deleu, Thomas Demeester
for: 这篇论文的主要目的是研究模ern Hopfield networks（MHNs）中 inverse temperature Hyperparameter $\beta$ 的影响。
methods: 这篇论文使用了一种简化了的 MHN，通过跟踪能量极值的分布来研究 $\beta$ 的影响。
results: 研究发现，在一定的 critical temperature $\beta_{\text{c}$ 下，MHN 会经历一种阶段性变化，从单一的全局吸引器向高度模式特定的极值变化。此外，动力学不仅受到 $\beta$ 的影响，还受到存储patterns的分布和大小的影响。

Abstract
The recent discovery of a connection between Transformers and Modern Hopfield Networks (MHNs) has reignited the study of neural networks from a physical energy-based perspective. This paper focuses on the pivotal effect of the inverse temperature hyperparameter $\beta$ on the distribution of energy minima of the MHN. To achieve this, the distribution of energy minima is tracked in a simplified MHN in which equidistant normalised patterns are stored. This network demonstrates a phase transition at a critical temperature $\beta_{\text{c}$, from a single global attractor towards highly pattern specific minima as $\beta$ is increased. Importantly, the dynamics are not solely governed by the hyperparameter $\beta$ but are instead determined by an effective inverse temperature $\beta_{\text{eff}$ which also depends on the distribution and size of the stored patterns. Recognizing the role of hyperparameters in the MHN could, in the future, aid researchers in the domain of Transformers to optimise their initial choices, potentially reducing the necessity for time and energy expensive hyperparameter fine-tuning.

摘要

On the convergence of adaptive first order methods: proximal gradient and alternating minimization algorithms

paper_url: http://arxiv.org/abs/2311.18431
repo_url: https://github.com/pylat/adaptive-proximal-algorithms-extended-experiments
paper_authors: Puya Latafat, Andreas Themelis, Panagiotis Patrinos
for: 本文提出了一种基于最近的 Works on linesearch-free adaptive proximal gradient methods的框架，即AdaPG$^{\pi,r}$，该框架可以更大的步长策略和改进的下界。
methods: 本文提出了不同的参数$\pi$和$r$的选择，并通过数值实验证明了其效果。此外，本文还在更一般的设定下证明了其准确性。
results: 本文通过数值实验和理论分析证明了AdaPG$^{\pi,r}$的效果，并且在standard strongly convex设定之外扩展了其应用范围。

Abstract
Building upon recent works on linesearch-free adaptive proximal gradient methods, this paper proposes AdaPG$^{\pi,r}$, a framework that unifies and extends existing results by providing larger stepsize policies and improved lower bounds. Different choices of the parameters $\pi$ and $r$ are discussed and the efficacy of the resulting methods is demonstrated through numerical simulations. In an attempt to better understand the underlying theory, its convergence is established in a more general setting that allows for time-varying parameters. Finally, an adaptive alternating minimization algorithm is presented by exploring the dual setting. This algorithm not only incorporates additional adaptivity, but also expands its applicability beyond standard strongly convex settings.

摘要
基于最近的线earch-free适应距离方法的研究，这篇论文提出了AdaPG$^{\pi,r}$框架，这个框架将现有结果集成并扩展，提供更大的步长策略和改进的下界。不同的参数$\pi$和$r$的选择被讨论，并且通过数学实验证明了这些方法的有效性。为了更好地理解下面的理论基础，这篇论文还提供了一种更通用的时间变化参数的设定，以确保方法的收敛性。最后，这篇论文还介绍了一种适应交替最小化算法，该算法不仅含有额外的适应性，还可以超出标准强CONvex设定。

Convergence Analysis of Fractional Gradient Descent

paper_url: http://arxiv.org/abs/2311.18426
repo_url: None
paper_authors: Ashwani Aggarwal
for: 本研究旨在分析某些特殊情况下的梯度下降法（Fractional Gradient Descent）的收敛性。
methods: 本文使用 novel bounds 将 fractional 和整数 derivatives 联系起来，然后应用这些 bounds 到不同的设置中，证明了 $O(1/T)$ 收敛性 для 光滑和凸函数，以及 linear 收敛性 для 光滑和强凸函数。
results: 本文证明了 fractional gradient descent 在光滑和非凸函数上的 $O(1/T)$ 收敛性，并提供了实验结果，证明了 fractional gradient descent 可能比标准梯度下降法更快。

Abstract
Fractional derivatives are a well-studied generalization of integer order derivatives. Naturally, for optimization, it is of interest to understand the convergence properties of gradient descent using fractional derivatives. Convergence analysis of fractional gradient descent is currently limited both in the methods analyzed and the settings analyzed. This paper aims to fill in these gaps by analyzing variations of fractional gradient descent in smooth and convex, smooth and strongly convex, and smooth and non-convex settings. First, novel bounds will be established bridging fractional and integer derivatives. Then, these bounds will be applied to the aforementioned settings to prove $O(1/T)$ convergence for smooth and convex functions and linear convergence for smooth and strongly convex functions. Additionally, we prove $O(1/T)$ convergence for smooth and non-convex functions using an extended notion of smoothness that is more natural for fractional derivatives. Finally, empirical results will be presented on the potential speed up of fractional gradient descent over standard gradient descent as well as the challenges of predicting which will be faster in general.

摘要
“分数导数是普遍研究的普遍推广，对于优化，理解分数导数的收敛性质是非常有趣。然而，现有的研究对于分数导数的收敛分析是有限的，尚未涵盖了许多场景。这篇论文的目标是填补这些空白，通过分析分数导数的变种在平滑和凸、平滑和强凸、平滑和非凸等设置下的收敛性质。首先，我们将提出新的 bounds，将分数导数和整数导数相连接起来。然后，我们将这些 bounds 应用到上述设置中，证明平滑和凸函数的 $O(1/T)$ 收敛，平滑和强凸函数的 linear 收敛，以及平滑和非凸函数的 $O(1/T)$ 收敛。此外，我们还将证明一种扩展的平滑性定义，以更自然地描述分数导数。最后，我们将提供实验结果，证明分数导数 descent 在标准导 descent 的情况下可能具有更快的收敛速率，以及在哪些情况下 fractional gradient descent 会更快。”Note that the translation is based on the Simplified Chinese language, which is used in mainland China and Singapore. If you need the translation in Traditional Chinese, please let me know.

Data-efficient Deep Reinforcement Learning for Vehicle Trajectory Control

paper_url: http://arxiv.org/abs/2311.18393
repo_url: None
paper_authors: Bernd Frauenknecht, Tobias Ehlgen, Sebastian Trimpe
for: 这篇论文目的是发展自动驾驶系统的基础建立，使用强化学习（RL）来实现控制性能高于传统方法，并且在实际应用中保持计算负载低。
methods: 这篇论文使用了三种现代化的深度强化学习方法：Randomized Ensemble Double Q-learning（REDQ）、Probabilistic Ensembles with Trajectory Sampling and Model Predictive Path Integral Optimizer（PETS-MPPI）和Model-Based Policy Optimization（MBPO）。这些方法在车辆轨迹控制方面尚未被探讨过。
results: 这篇论文的实验结果显示，使用这三种深度强化学习方法可以与soft-actor critic（SAC）相比，实现车辆控制性能的提升，并且可以大幅减少环境互动次数。

Abstract
Advanced vehicle control is a fundamental building block in the development of autonomous driving systems. Reinforcement learning (RL) promises to achieve control performance superior to classical approaches while keeping computational demands low during deployment. However, standard RL approaches like soft-actor critic (SAC) require extensive amounts of training data to be collected and are thus impractical for real-world application. To address this issue, we apply recently developed data-efficient deep RL methods to vehicle trajectory control. Our investigation focuses on three methods, so far unexplored for vehicle control: randomized ensemble double Q-learning (REDQ), probabilistic ensembles with trajectory sampling and model predictive path integral optimizer (PETS-MPPI), and model-based policy optimization (MBPO). We find that in the case of trajectory control, the standard model-based RL formulation used in approaches like PETS-MPPI and MBPO is not suitable. We, therefore, propose a new formulation that splits dynamics prediction and vehicle localization. Our benchmark study on the CARLA simulator reveals that the three identified data-efficient deep RL approaches learn control strategies on a par with or better than SAC, yet reduce the required number of environment interactions by more than one order of magnitude.

摘要
高级车辆控制是自动驾驶系统的基础构件。强化学习（RL）可以实现比 классиical方法更高的控制性能，而且在部署时可以降低计算成本。然而，标准RL方法如软演员评价器（SAC）需要大量的训练数据，因此在实际应用中不实用。为解决这个问题，我们在车辆轨迹控制中应用了三种数据高效深度RL方法：随机 ensemble double Q-学习（REDQ）、概率 ensemble with trajectory sampling和模型预测 PATH integral optimizer（PETS-MPPI）以及模型基于策略优化（MBPO）。我们发现在轨迹控制中，标准模型基于RL形式使用在PETS-MPPI和MBPO中不适用。因此，我们提出了一新的形式，将动力预测和车辆定位分离开来。我们在CARLA simulator上进行了比较研究，发现这三种数据高效深度RL方法可以与或更好than SAC学习控制策略，同时减少环境互动数量高于一个数量级。

Transfer Learning across Different Chemical Domains: Virtual Screening of Organic Materials with Deep Learning Models Pretrained on Small Molecule and Chemical Reaction Data

paper_url: http://arxiv.org/abs/2311.18377
repo_url: None
paper_authors: Chengwei Zhang, Yushuang Zhai, Ziyang Gong, Yuan-Bin She, Yun-Fang Yang, An Su
for: 这种研究是为了提出一种高效的虚拟屏选方法，以预测有机材料的性能。
methods: 这种方法使用了BERT模型，通过使用药物小分子和化学反应数据库来预处理BERT模型，然后在虚拟屏选任务上进行了五个任务的练习。
results: 结果显示，使用USPTO-SMILES预处理的BERT模型在两个任务上的R2值超过0.90，在一个任务上的R2值超过0.82，与其他五个传统机器学习模型和预处理小分子或有机材料数据库相比，表现更好。

Abstract
Machine learning prediction of organic materials properties is an efficient virtual screening method ahead of more expensive screening methods. However, this approach has suffered from insufficient labeled data on organic materials to train state-of-the-art machine learning models. In this study, we demonstrate that drug-like small molecule and chemical reaction databases can be used to pretrain the BERT model for the virtual screening of organic materials. Among the BERT models fine-tuned by five virtual screening tasks on organic materials, the USPTO-SMILES pretrained BERT model had R2 > 0.90 for two tasks and R2 > 0.82 for one, which was generally superior to the same models pretrained by the small molecule or organic materials databases, as well as to the other three traditional machine learning models trained directly on the virtual screening task data. The superior performance of the USPTO-SMILES pretrained BERT model is due to the greater variety of organic building blocks in the USPTO database and the broader coverage of the chemical space. The even better performance of the BERT model pretrained externally from a chemical reaction database with additional sources of chemical reactions strengthens our proof of concept that transfer learning across different chemical domains is practical for the virtual screening of organic materials.

摘要
机器学习预测有机材料性能是一种高效的虚拟屏选方法，但这种方法受到有限的有机材料标注数据的限制。在本研究中，我们表明了使用药物类小分子和化学反应数据库预处理BERT模型以进行虚拟屏选有机材料的方法。与其他五个虚拟屏选任务中的BERT模型进行比较，USPTO-SMILES预处理BERT模型在两个任务中的R2值超过0.90，在一个任务中的R2值超过0.82，这与其他三个传统机器学习模型直接使用虚拟屏选任务数据进行训练相比，表现更出色。USPTO-SMILES预处理BERT模型的优秀表现归功于USPTO数据库中的有机构建块更加多样化和化学空间更加广泛。此外，使用外部化学反应数据库进行预处理，并将其作为虚拟屏选任务数据进行训练，可以更好地证明跨化学领域的知识传递是实际可行的。

Age Effects on Decision-Making, Drift Diffusion Model

paper_url: http://arxiv.org/abs/2311.18376
repo_url: None
paper_authors: Zahra Kavian, Kimia Hajisadeghi, Yashar Rezazadeh, Mehrbod Faraji, Reza Ebrahimpour
for: 这个研究旨在探讨不同年龄组的人员在完成随机点动任务时的决策性能如何改善。methods: 这个研究使用了三阶段训练，并使用了层次漂移分布模型分析参与者的响应。results: 研究发现，训练后，参与者能够更快地储存感知信息，模型漂移率提高，但决策边界下降，因为他们变得更自信且决策阈值下降。同时，老年组在预后训练时有更高的边界和低于预后训练时的漂移率，并且两组参与者之间决策参数之间差异减少了。

Abstract
Training can improve human decision-making performance. After several training sessions, a person can quickly and accurately complete a task. However, decision-making is always a trade-off between accuracy and response time. Factors such as age and drug abuse can affect the decision-making process. This study examines how training can improve the performance of different age groups in completing a random dot motion (RDM) task. The participants are divided into two groups: old and young. They undergo a three-phase training and then repeat the same RDM task. The hierarchical drift-diffusion model analyzes the subjects' responses and determines how the model's parameters change after training for both age groups. The results show that after training, the participants were able to accumulate sensory information faster, and the model drift rate increased. However, their decision boundary decreased as they became more confident and had a lower decision-making threshold. Additionally, the old group had a higher boundary and lower drift rate in both pre and post-training, and there was less difference between the two group parameters after training.

摘要
人类决策性能可以通过训练提高。经过一些训练会议，一个人可以快速准确完成任务。然而，决策总是一种牵扯精度和响应时间的权衡。年龄和药物滥用等因素可以影响决策过程。这项研究检查训练如何改善不同年龄组中完成随机点动（RDM）任务的表现。参与者被分为两组：老年和年轻。他们进行了三个阶段的训练，然后重复相同的RDM任务。层次漂移-扩散模型分析参与者的回答，并确定模型参数是否在训练后发生变化。结果表明， после训练，参与者能够更快地收集感知信息，模型漂移率增加。然而，他们的决策界限降低，因为他们变得更自信和决策阈值下降。此外，老年组在预和后训练期间的边界和漂移率较高，训练后两组参数之间的差异减少。

Towards Comparable Active Learning

paper_url: http://arxiv.org/abs/2311.18356
repo_url: https://github.com/wernerth94/comparable-active-learning
paper_authors: Thorben Werner, Johannes Burchert, Lars Schmidt-Thieme
for: 这篇论文的目的是提出一个 Active Learning 框架，以便比较不同任务和领域中的算法表现，同时解决了重要的实验重现问题和测量不确定性问题。
methods: 这篇论文使用了 Active Learning 方法，并提出了一个新的评估方法，以便评估不同任务和领域中的算法表现。
results: 这篇论文的实验结果显示，现有的 Active Learning 算法在不同的领域中的表现差异很大，而且不同的任务和领域中的算法表现也有很大的差异。

Abstract
Active Learning has received significant attention in the field of machine learning for its potential in selecting the most informative samples for labeling, thereby reducing data annotation costs. However, we show that the reported lifts in recent literature generalize poorly to other domains leading to an inconclusive landscape in Active Learning research. Furthermore, we highlight overlooked problems for reproducing AL experiments that can lead to unfair comparisons and increased variance in the results. This paper addresses these issues by providing an Active Learning framework for a fair comparison of algorithms across different tasks and domains, as well as a fast and performant oracle algorithm for evaluation. To the best of our knowledge, we propose the first AL benchmark that tests algorithms in 3 major domains: Tabular, Image, and Text. We report empirical results for 6 widely used algorithms on 7 real-world and 2 synthetic datasets and aggregate them into a domain-specific ranking of AL algorithms.

摘要
This paper addresses these issues by providing an active learning framework for comparing algorithms across different tasks and domains, as well as a fast and performant oracle algorithm for evaluation. To the best of our knowledge, we propose the first AL benchmark that tests algorithms in three major domains: tabular, image, and text. We report empirical results for six widely used algorithms on seven real-world and two synthetic datasets, and aggregate them into a domain-specific ranking of AL algorithms.

Tree-based Forecasting of Day-ahead Solar Power Generation from Granular Meteorological Features

paper_url: http://arxiv.org/abs/2312.00090
repo_url: None
paper_authors: Nick Berlanger, Noah van Ophoven, Tim Verdonck, Ines Wilms
for: 预测当前日往来太阳能电力生产，以支持高占用率太阳能电力网络和稳定电力网络运行。
methods: 使用当前最佳树型机器学习方法，并考虑不同的气象和天文因素对太阳能电力生产的影响，并在粗细空间位置上进行了详细预测。
results: 通过对比现有研究，我们的预测方法可以更好地预测当前日往来太阳能电力生产，并且可以帮助供应商、决策者和其他相关方 optimize 电力网络运行、经济派发和分布式太阳能电力的整合。

Abstract
Accurate forecasts for day-ahead photovoltaic (PV) power generation are crucial to support a high PV penetration rate in the local electricity grid and to assure stability in the grid. We use state-of-the-art tree-based machine learning methods to produce such forecasts and, unlike previous studies, we hereby account for (i) the effects various meteorological as well as astronomical features have on PV power production, and this (ii) at coarse as well as granular spatial locations. To this end, we use data from Belgium and forecast day-ahead PV power production at an hourly resolution. The insights from our study can assist utilities, decision-makers, and other stakeholders in optimizing grid operations, economic dispatch, and in facilitating the integration of distributed PV power into the electricity grid.

摘要
高精度的日前太阳能电力生产预测对于当地电网中高占用率太阳能电力资源非常重要，以确保电网稳定。我们使用现代的树形机器学习方法来生成这些预测，并不同于之前的研究，我们在这里考虑了（i）太阳能电力生产受不同气象和天文因素影响，以及（ii）在粗细空间位置上。为此，我们使用比利时的数据，预测每小时的日前太阳能电力生产。这些发现可以帮助供应商、决策者和其他各 relate 的利益相关者优化电网运行、经济调度和分布式太阳能电力的集成到电网中。

Reconstructing Historical Climate Fields With Deep Learning

paper_url: http://arxiv.org/abs/2311.18348
repo_url: None
paper_authors: Nils Bochow, Anna Poltronieri, Martin Rypdal, Niklas Boers
for: 填充历史气候记录中缺失的数据，特别是在卫星任务之前。
methods: 使用深度学习方法，基于傅ри散函数，训练 numerical 气候模型输出来重建历史气候记录。
results: 能够真实地重建大面积和不规则的缺失数据，以及重建已知历史事件，如强烈的El Ni~no和La Ni~na，几乎没有给定的信息。 MODEL 超出训练的分解能力，并可以在不同的气候场景下使用。

Abstract
Historical records of climate fields are often sparse due to missing measurements, especially before the introduction of large-scale satellite missions. Several statistical and model-based methods have been introduced to fill gaps and reconstruct historical records. Here, we employ a recently introduced deep-learning approach based on Fourier convolutions, trained on numerical climate model output, to reconstruct historical climate fields. Using this approach we are able to realistically reconstruct large and irregular areas of missing data, as well as reconstruct known historical events such as strong El Ni\~no and La Ni\~na with very little given information. Our method outperforms the widely used statistical kriging method as well as other recent machine learning approaches. The model generalizes to higher resolutions than the ones it was trained on and can be used on a variety of climate fields. Moreover, it allows inpainting of masks never seen before during the model training.

摘要

Learning Robust Precipitation Forecaster by Temporal Frame Interpolation

paper_url: http://arxiv.org/abs/2311.18341
repo_url: https://github.com/secilia-cxy/unettfi
paper_authors: Lu Han, Xu-Yang Chen, Han-Jia Ye, De-Chuan Zhan
for: 这个研究旨在提高气象预报模型的准确性，特别是在面对实际应用中遇到的空间-时间变化问题时。
methods: 本研究使用了Temporal Frame Interpolation（TFI）技术，将邻近几帧的卫星图像和地面雷达数据进行插值，从而提高模型对于空间-时间变化的抗变化能力。此外，本研究还使用了一个特有的Multi-Level Dice（ML-Dice）损失函数，利用降水强度的排序性来改善模型的表现。
results: 本研究的模型在Weather4cast’23的转移学习领导板上获得了第一名，证明了本研究的方法ologies的有效性。此外，本研究还获得了与其他模型的比较，展示了本研究的模型在气象预报中的表现。

Abstract
Recent advances in deep learning have significantly elevated weather prediction models. However, these models often falter in real-world scenarios due to their sensitivity to spatial-temporal shifts. This issue is particularly acute in weather forecasting, where models are prone to overfit to local and temporal variations, especially when tasked with fine-grained predictions. In this paper, we address these challenges by developing a robust precipitation forecasting model that demonstrates resilience against such spatial-temporal discrepancies. We introduce Temporal Frame Interpolation (TFI), a novel technique that enhances the training dataset by generating synthetic samples through interpolating adjacent frames from satellite imagery and ground radar data, thus improving the model's robustness against frame noise. Moreover, we incorporate a unique Multi-Level Dice (ML-Dice) loss function, leveraging the ordinal nature of rainfall intensities to improve the model's performance. Our approach has led to significant improvements in forecasting precision, culminating in our model securing \textit{1st place} in the transfer learning leaderboard of the \textit{Weather4cast'23} competition. This achievement not only underscores the effectiveness of our methodologies but also establishes a new standard for deep learning applications in weather forecasting. Our code and weights have been public on \url{https://github.com/Secilia-Cxy/UNetTFI}.

摘要
近年深度学习技术的发展有效地提高了天气预测模型的性能。然而，这些模型在实际场景中经常受到空间-时间变化的影响，导致其预测精度受到限制。特别是在天气预测方面，模型容易过拟合本地和时间变化，尤其是在进行细致预测时。在这篇论文中，我们解决这些挑战，开发了一种可靠的降水预测模型，该模型能够抗抗空间-时间差异。我们提出了一种新的Temporal Frame Interpolation（TFI）技术，通过将邻近帧的卫星图像和地面雷达数据 interpolate 到一起，以提高模型的可靠性。此外，我们采用了一种独特的Multi-Level Dice（ML-Dice）损失函数，利用降水强度的ORDinal性来提高模型的性能。我们的方法使得预测精度得到了显著提高，其中我们的模型在Weather4cast'23 的转移学习领先板上得到了第一名，这不仅证明了我们的方法的有效性，还为深度学习应用在天气预测方面提出了新的标准。我们的代码和参数在GitHub上公开，可以通过 \url{https://github.com/Secilia-Cxy/UNetTFI} 访问。

Anomaly Detection via Learning-Based Sequential Controlled Sensing

paper_url: http://arxiv.org/abs/2312.00088
repo_url: None
paper_authors: Geethu Joseph, Chen Zhong, M. Cenk Gursoy, Senem Velipasalar, Pramod K. Varshney
for: 检测给定集合中的 binary 过程中的异常点。
methods: 使用学习控制的感知来实现。每个过程都是由一个 binary 随机变量表示是否异常的。通过观察一 subset of 过程每个时间点来识别异常点。 probing 每个过程都有相关的成本。
results: 通过两种方法：深度强化学习和深度活动推理来解决。并通过数字实验示出我们的算法可以适应任何未知的Statistical dependence pattern。

Abstract
In this paper, we address the problem of detecting anomalies among a given set of binary processes via learning-based controlled sensing. Each process is parameterized by a binary random variable indicating whether the process is anomalous. To identify the anomalies, the decision-making agent is allowed to observe a subset of the processes at each time instant. Also, probing each process has an associated cost. Our objective is to design a sequential selection policy that dynamically determines which processes to observe at each time with the goal to minimize the delay in making the decision and the total sensing cost. We cast this problem as a sequential hypothesis testing problem within the framework of Markov decision processes. This formulation utilizes both a Bayesian log-likelihood ratio-based reward and an entropy-based reward. The problem is then solved using two approaches: 1) a deep reinforcement learning-based approach where we design both deep Q-learning and policy gradient actor-critic algorithms; and 2) a deep active inference-based approach. Using numerical experiments, we demonstrate the efficacy of our algorithms and show that our algorithms adapt to any unknown statistical dependence pattern of the processes.

摘要

A deep reinforcement learning-based approach, which includes designing both deep Q-learning and policy gradient actor-critic algorithms.2. A deep active inference-based approach.Using numerical experiments, we demonstrate the effectiveness of our algorithms and show that our algorithms adapt to any unknown statistical dependence pattern of the processes.

Learning for Semantic Knowledge Base-Guided Online Feature Transmission in Dynamic Channels

paper_url: http://arxiv.org/abs/2311.18316
repo_url: None
paper_authors: Xiangyu Gao, Yaping Sun, Dongyu Wei, Xiaodong Xu, Hao Chen, Hao Yin, Shuguang Cui
for: 这篇论文旨在提高边缘计算中的AI推理效率，以满足智能应用程序的需求，例如无人车和VR/AR。
methods: 我们提出了一个在线优化框架，用于解决频道状况和设备移动对端到端通信系统的挑战。我们的方法会利用知识库来驱动多级对应，考虑时间因素和动态元素在传输过程中。
results: 我们的方法比传统探户方法更有优势，尤其在不同的系统设置下。我们设计了一个基于深度学习的算法，并将其与一个精心设计的奖励函数相结合，以在实时决策中解决优化问题。

Abstract
With the proliferation of edge computing, efficient AI inference on edge devices has become essential for intelligent applications such as autonomous vehicles and VR/AR. In this context, we address the problem of efficient remote object recognition by optimizing feature transmission between mobile devices and edge servers. We propose an online optimization framework to address the challenge of dynamic channel conditions and device mobility in an end-to-end communication system. Our approach builds upon existing methods by leveraging a semantic knowledge base to drive multi-level feature transmission, accounting for temporal factors and dynamic elements throughout the transmission process. To solve the online optimization problem, we design a novel soft actor-critic-based deep reinforcement learning system with a carefully designed reward function for real-time decision-making, overcoming the optimization difficulty of the NP-hard problem and achieving the minimization of semantic loss while respecting latency constraints. Numerical results showcase the superiority of our approach compared to traditional greedy methods under various system setups.

摘要
To solve the online optimization problem, we propose a novel soft actor-critic-based deep reinforcement learning system with a carefully designed reward function for real-time decision-making. This approach overcomes the optimization difficulty of the NP-hard problem and achieves the minimization of semantic loss while respecting latency constraints.Numerical results demonstrate the superiority of our approach compared to traditional greedy methods under various system setups. Our approach is able to efficiently transmit features while minimizing semantic loss and respecting latency constraints, making it an ideal solution for edge computing applications.

Automatic Implementation of Neural Networks through Reaction Networks – Part I: Circuit Design and Convergence Analysis

paper_url: http://arxiv.org/abs/2311.18313
repo_url: None
paper_authors: Yuzhen Fan, Xiaoyu Zhang, Chuanhou Gao, Denis Dochain
for: 本研究旨在实现一种可编程的生物化学反应网络（BCRN）系统，以实现全连接神经网络（FCNN）的自动化运算在生物体内。
methods: 研究人员通过设计具有普通生物化学反应的特定模块，以实现FCNN的前向传播计算、反向传播组件和所有桥接过程。这种方法填补了生物化学任务模块和判断终止模块的设计差距，并提供了一种新的精确和可靠的生物化学反应实现方式。
results: 通过平衡 approaching，研究人员示出了设计的 BCRN 系统实现 FCNN 功能，并达到了对 computational results 的极限准确性。此外，该构建还在两个典型的逻辑分类问题上进行了性能评估。

Abstract
Information processing relying on biochemical interactions in the cellular environment is essential for biological organisms. The implementation of molecular computational systems holds significant interest and potential in the fields of synthetic biology and molecular computation. This two-part article aims to introduce a programmable biochemical reaction network (BCRN) system endowed with mass action kinetics that realizes the fully connected neural network (FCNN) and has the potential to act automatically in vivo. In part I, the feedforward propagation computation, the backpropagation component, and all bridging processes of FCNN are ingeniously designed as specific BCRN modules based on their dynamics. This approach addresses a design gap in the biochemical assignment module and judgment termination module and provides a novel precise and robust realization of bi-molecular reactions for the learning process. Through equilibrium approaching, we demonstrate that the designed BCRN system achieves FCNN functionality with exponential convergence to target computational results, thereby enhancing the theoretical support for such work. Finally, the performance of this construction is further evaluated on two typical logic classification problems.

摘要
生物体内细胞环境中的生物化学反应处理是生物体存在的基本条件。实现分子计算系统在生物学和分子计算领域具有重要的意义和潜力。本文分为两部分，第一部分介绍了Feedforward卷积计算、反卷积组件和所有桥接过程的快速进行的FCNN实现，并通过平衡方法证明其可以自动在生物体内进行。在第二部分，我们通过两个典型的逻辑分类问题来评估这种结构的性能。Here's the word-for-word translation:生物体内细胞环境中的生物化学反应处理是生物体存在的基本条件。实现分子计算系统在生物学和分子计算领域具有重要的意义和潜力。本文分为两部分，第一部分介绍了Feedforward卷积计算、反卷积组件和所有桥接过程的快速进行的FCNN实现，并通过平衡方法证明其可以自动在生物体内进行。在第二部分，我们通过两个典型的逻辑分类问题来评估这种结构的性能。

PAUNet: Precipitation Attention-based U-Net for rain prediction from satellite radiance data

paper_url: http://arxiv.org/abs/2311.18306
repo_url: None
paper_authors: P. Jyoteeshkumar Reddy, Harish Baki, Sandeep Chinta, Richard Matear, John Taylor
for: 这篇论文是为了预测卫星辐射数据中的降水而写的。
methods: 这篇论文使用了深度学习架构Precipitation Attention-based U-Net（PAUNet），该架构包括encoder卷积层、center cropping和注意机制，以 capture多个频率带的卫星图像的大规模上下文信息。
results: 论文在使用e-FPL损失函数和大量欧洲区域的数据进行训练后，在预测不同降水类别的降水的 Critical Success Index（CSI）得分高于基线模型，demonstrating notable accuracy and improvement in precipitation forecasting。

Abstract
This paper introduces Precipitation Attention-based U-Net (PAUNet), a deep learning architecture for predicting precipitation from satellite radiance data, addressing the challenges of the Weather4cast 2023 competition. PAUNet is a variant of U-Net and Res-Net, designed to effectively capture the large-scale contextual information of multi-band satellite images in visible, water vapor, and infrared bands through encoder convolutional layers with center cropping and attention mechanisms. We built upon the Focal Precipitation Loss including an exponential component (e-FPL), which further enhanced the importance across different precipitation categories, particularly medium and heavy rain. Trained on a substantial dataset from various European regions, PAUNet demonstrates notable accuracy with a higher Critical Success Index (CSI) score than the baseline model in predicting rainfall over multiple time slots. PAUNet's architecture and training methodology showcase improvements in precipitation forecasting, crucial for sectors like emergency services and retail and supply chain management.

摘要
Simplified Chinese translation:这篇论文介绍了基于降水注意力的U-Net（PAUNet），一种深度学习架构，用于从卫星辐射数据预测降水，解决了2023年天气预报比赛中的挑战。PAUNet是U-Net和Res-Net的变体，通过编码卷积层中心裁剪和注意力机制，效果地捕捉多频段卫星图像的大规模上下文信息。我们基于含有指数组件的积分损失函数（e-FPL），进一步强调不同降水类别的重要性，特别是中等和重降水。使用欧洲多个地区的大量数据进行训练，PAUNet在多个时间槽预测降水时表现出了显著的准确性，CSI分数高于基准模型。PAUNet的架构和训练方法展示了降水预测的改进，对于应急服务和零售和供应链管理等领域非常重要。

Semiparametric Efficient Inference in Adaptive Experiments

paper_url: http://arxiv.org/abs/2311.18274
repo_url: None
paper_authors: Thomas Cook, Alan Mishler, Aaditya Ramdas
for: 这篇论文是为了提出一种有效的推断 average treatment effect（ATT）在时间序列实验中的方法。
methods: 这篇论文使用了 adaptive augmented inverse-probability weighted estimator，这种方法是 semi-parametric efficient，并且具有更弱的假设，比之前的 литературе中的假设更加弱。这篇论文还提出了一种中心假设，使得fficient inference可以在固定的样本大小下进行。
results: 这篇论文的实验结果表明，使用了这种方法可以获得更窄的信任度范围，而且可以在数据依赖停止时间（样本大小）上进行anytime-valid的推断。此外，这种方法还可以使用 propensity score truncation 技术来减少样本中的finite sample variance，不会影响 asymptotic variance。

Abstract
We consider the problem of efficient inference of the Average Treatment Effect in a sequential experiment where the policy governing the assignment of subjects to treatment or control can change over time. We first provide a central limit theorem for the Adaptive Augmented Inverse-Probability Weighted estimator, which is semiparametric efficient, under weaker assumptions than those previously made in the literature. This central limit theorem enables efficient inference at fixed sample sizes. We then consider a sequential inference setting, deriving both asymptotic and nonasymptotic confidence sequences that are considerably tighter than previous methods. These anytime-valid methods enable inference under data-dependent stopping times (sample sizes). Additionally, we use propensity score truncation techniques from the recent off-policy estimation literature to reduce the finite sample variance of our estimator without affecting the asymptotic variance. Empirical results demonstrate that our methods yield narrower confidence sequences than those previously developed in the literature while maintaining time-uniform error control.

摘要
我们考虑了一个内部均值影响的总体实验中的有效推断问题，其中政策对对象分配到治疗或控制的变化过时。我们首先提供了一个中心均值定理 для Adaptive Augmented Inverse-Probability Weighted 估计器，这是半 Parametric 有效的，比过去Literature中的假设更弱。这个中心均值定理允许我们在固定样本大小下进行有效的推断。接着，我们考虑了一个类别推断设定， derive 了 both asymptotic 和 nonasymptotic 信任范围，这些范围比先前的方法更为紧密。这些任何时间有效的方法允许我们在资料依赖的停止时间（样本大小）下进行推断。此外，我们使用了 propensity score 截断技术，从 recent off-policy 估计文献中获得的 truncation 技术，以减少我们估计器的finite sample variance，不影响 asymptotic variance。实验结果显示，我们的方法对于实验中的信任范围产生了更窄的信任范围，同时保持时间均匀的错误控制。

Learning Exactly Linearizable Deep Dynamics Models

paper_url: http://arxiv.org/abs/2311.18261
repo_url: None
paper_authors: Ryuta Moriyasu, Masayuki Kusunoki, Kenji Kashima
for: 这个论文主要针对的是基于机器学习方法的控制系统的实用工程应用。
methods: 该论文提出了一种可以轻松应用不同控制理论来确保稳定性和可靠性的学习方法，以及提供高度自由表达的设计。
results: 在使用该模型控制汽车发动机的实际应用中，得到了良好预测性和在约束下稳定控制的结果。

Abstract
Research on control using models based on machine-learning methods has now shifted to the practical engineering stage. Achieving high performance and theoretically guaranteeing the safety of the system is critical for such applications. In this paper, we propose a learning method for exactly linearizable dynamical models that can easily apply various control theories to ensure stability, reliability, etc., and to provide a high degree of freedom of expression. As an example, we present a design that combines simple linear control and control barrier functions. The proposed model is employed for the real-time control of an automotive engine, and the results demonstrate good predictive performance and stable control under constraints.

摘要
研究基于机器学习方法的控制模型已经进入了实用工程阶段。为了确保系统的高性能和理论上的安全性，在这种应用中达成高度的自由表达是关键。本文提出了一种可以快速应用不同控制理论来保证稳定性和可靠性等特性的学习方法。为了示例，我们提出了结合简单线性控制和控制障碍函数的设计。这种方法在实时控制汽车发动机中得到了良好的预测性和稳定控制下限。

Combined Scheduling, Memory Allocation and Tensor Replacement for Minimizing Off-Chip Data Accesses of DNN Accelerators

paper_url: http://arxiv.org/abs/2311.18246
repo_url: None
paper_authors: Yi Li, Aarti Gupta, Sharad Malik
for: 这个论文是为了优化深度神经网络（DNN）的特化硬件加速器的执行，以提高能耗和性能。
methods: 该论文提出了一个优化框架，名为COSMA，用于将DNN映射到加速器中，以最小化额外的数据访问。COSMA使用整数线性编程（ILP）形式来生成最佳的映射解决方案，并使用存储在加速器中的特殊硬件和缓存来减少数据访问。
results: 根据论文的结果，使用off-the-shelf ILP求解器，COSMA可以在秒钟内获得最佳的映射解决方案，并在多种现有的DNN模型中提高了84%的数据访问率。此外，提出了一种分治分解的规则，用于处理一些复杂的DNN模型，这种规则可以减少85%的数据访问率。

Abstract
Specialized hardware accelerators have been extensively used for Deep Neural Networks (DNNs) to provide power/performance benefits. These accelerators contain specialized hardware that supports DNN operators, and scratchpad memory for storing the tensor operands. Often, the size of the scratchpad is insufficient to store all the tensors needed for the computation, and additional data accesses are needed to move tensors back and forth from host memory during the computation with significant power/performance overhead. The volume of these additional data accesses depends on the operator schedule, and memory allocation (specific locations selected for the tensors in the scratchpad). We propose an optimization framework, named COSMA, for mapping DNNs to an accelerator that finds the optimal operator schedule, memory allocation and tensor replacement that minimizes the additional data accesses. COSMA provides an Integer Linear Programming (ILP) formulation to generate the optimal solution for mapping a DNN to the accelerator for a given scratchpad size. We demonstrate that, using an off-the-shelf ILP solver, COSMA obtains the optimal solution in seconds for a wide-range of state-of-the-art DNNs for different applications. Further, it out-performs existing methods by reducing on average 84% of the non-compulsory data accesses. We further propose a divide-and-conquer heuristic to scale up to certain complex DNNs generated by Neural Architecture Search, and this heuristic solution reduces on average 85% data accesses compared with other works.

摘要

Poisoning Attacks Against Contrastive Recommender Systems

paper_url: http://arxiv.org/abs/2311.18244
repo_url: None
paper_authors: Zongwei Wang, Junliang Yu, Min Gao, Hongzhi Yin, Bin Cui, Shazia Sadiq
for: This paper focuses on the vulnerability of contrastive learning (CL) based recommendation systems to poisoning attacks, and aims to facilitate the development of more robust CL-based systems.
methods: The paper uses theoretical and empirical analysis to identify the vulnerability of CL-based systems and proposes a dual-objective attack framework to amplify the dispersion effect of the CL loss and directly elevate the visibility of target items.
results: The paper validates the destructiveness of the proposed attack model through extensive experimentation on four datasets, demonstrating the vulnerability of CL-based systems to poisoning attacks.

Abstract
Contrastive learning (CL) has recently gained significant popularity in the field of recommendation. Its ability to learn without heavy reliance on labeled data is a natural antidote to the data sparsity issue. Previous research has found that CL can not only enhance recommendation accuracy but also inadvertently exhibit remarkable robustness against noise. However, this paper identifies a vulnerability of CL-based recommender systems: Compared with their non-CL counterparts, they are even more susceptible to poisoning attacks that aim to promote target items. Our analysis points to the uniform dispersion of representations led by the CL loss as the very factor that accounts for this vulnerability. We further theoretically and empirically demonstrate that the optimization of CL loss can lead to smooth spectral values of representations. Based on these insights, we attempt to reveal the potential poisoning attacks against CL-based recommender systems. The proposed attack encompasses a dual-objective framework: One that induces a smoother spectral value distribution to amplify the CL loss's inherent dispersion effect, named dispersion promotion; and the other that directly elevates the visibility of target items, named rank promotion. We validate the destructiveness of our attack model through extensive experimentation on four datasets. By shedding light on these vulnerabilities, we aim to facilitate the development of more robust CL-based recommender systems.

摘要
对比学习（Contrastive Learning，CL）在推荐领域的应用已经吸引了广泛的注意。它可以不需要大量的标签数据来学习，因此可以对于数据缺乏问题提供自然的解决方案。前一些研究发现CL可以不仅提高推荐精度，而且也可以不断展现对杂音的抗性。然而，这篇论文发现CL基于的推荐系统存在一个漏洞：与非CL counterpart相比，它对于攻击目标项目更加易受攻击。我们的分析表明，CL损失的均匀分布表现引起了这个漏洞。我们还进一步透过理论和实验显示了CL损失的优化可以导致表现的spectral值稳定。基于这些见解，我们尝试揭露CL基于的推荐系统可能面临的潜在攻击。我们的攻击模型包括两个目标：一是将表现值分布均匀化，以增强CL损失的均匀分布效应，名为均匀增强（Dispersion Promotion）；另一个是直接增加目标项目的可见度，名为排名提升（Rank Promotion）。我们通过实验证明了我们的攻击模型的破坏性，透过四个数据集进行了广泛的实验。通过揭露这些漏洞，我们希望能够推动CL基于的推荐系统的更加Robust。

Channel-Feedback-Free Transmission for Downlink FD-RAN: A Radio Map based Complex-valued Precoding Network Approach

paper_url: http://arxiv.org/abs/2312.02184
repo_url: None
paper_authors: Jiwei Zhao, Jiacheng Chen, Zeyu Sun, Yuhang Shi, Haibo Zhou, Xuemin, Shen
for: 这篇论文是为了解决FD-RAN网络中传输机制问题，提高spectrum资源的利用效率和网络成本。
methods: 该论文提出了一种新的传输方案，不需要物理层承载反馈。 Specifically, it proposes a novel transmission scheme that uses a radio map based complex-valued precoding network (RMCPNet) model to output base station precoding based on user location.
results: 根据公共的DeepMIMO数据集的评估结果，RMCPNet可以相比传统的实数值神经网络和统计代码库方法，提高16%和76%。

Abstract
As the demand for high-quality services proliferates, an innovative network architecture, the fully-decoupled RAN (FD-RAN), has emerged for more flexible spectrum resource utilization and lower network costs. However, with the decoupling of uplink base stations and downlink base stations in FD-RAN, the traditional transmission mechanism, which relies on real-time channel feedback, is not suitable as the receiver is not able to feedback accurate and timely channel state information to the transmitter. This paper proposes a novel transmission scheme without relying on physical layer channel feedback. Specifically, we design a radio map based complex-valued precoding network~(RMCPNet) model, which outputs the base station precoding based on user location. RMCPNet comprises multiple subnets, with each subnet responsible for extracting unique modal features from diverse input modalities. Furthermore, the multi-modal embeddings derived from these distinct subnets are integrated within the information fusion layer, culminating in a unified representation. We also develop a specific RMCPNet training algorithm that employs the negative spectral efficiency as the loss function. We evaluate the performance of the proposed scheme on the public DeepMIMO dataset and show that RMCPNet can achieve 16\% and 76\% performance improvements over the conventional real-valued neural network and statistical codebook approach, respectively.

摘要
The proposed scheme uses a novel transmission scheme called RMCPNet, which is a radio map based complex-valued precoding network. This network takes user location into account and outputs base station precoding based on the location. RMCPNet consists of multiple subnets that extract unique modal features from diverse input modalities, and these features are then integrated in the information fusion layer to create a unified representation.To train RMCPNet, a specific training algorithm is developed that uses negative spectral efficiency as the loss function. The performance of the proposed scheme is evaluated on the public DeepMIMO dataset and shows that RMCPNet can achieve 16% and 76% performance improvements over conventional real-valued neural networks and statistical codebook approaches, respectively.

PDB-Struct: A Comprehensive Benchmark for Structure-based Protein Design

paper_url: http://arxiv.org/abs/2312.00080
repo_url: https://github.com/wang-cr/pdb-struct
paper_authors: Chuanrui Wang, Bozitao Zhong, Zuobai Zhang, Narendra Chaudhary, Sanchit Misra, Jian Tang
for: 评估 protein 设计方法的标准 bencmark
methods: 使用高精度 protein 结构预测模型作为潮湿实验的代理，以及评估模型是否能够分配高概率给实验性稳定蛋白质
results: 对 PDB-Struct bencmark进行了评估，发现 ByProt、ProteinMPNN 和 ESM-IF 表现出色，而 ESM-Design 和 AF-Design 在 refoldability 指标下表现不佳。 Code 可以在 https://github.com/WANG-CR/PDB-Struct 上下载。

Abstract
Structure-based protein design has attracted increasing interest, with numerous methods being introduced in recent years. However, a universally accepted method for evaluation has not been established, since the wet-lab validation can be overly time-consuming for the development of new algorithms, and the $\textit{in silico}$ validation with recovery and perplexity metrics is efficient but may not precisely reflect true foldability. To address this gap, we introduce two novel metrics: refoldability-based metric, which leverages high-accuracy protein structure prediction models as a proxy for wet lab experiments, and stability-based metric, which assesses whether models can assign high likelihoods to experimentally stable proteins. We curate datasets from high-quality CATH protein data, high-throughput $\textit{de novo}$ designed proteins, and mega-scale experimental mutagenesis experiments, and in doing so, present the $\textbf{PDB-Struct}$ benchmark that evaluates both recent and previously uncompared protein design methods. Experimental results indicate that ByProt, ProteinMPNN, and ESM-IF perform exceptionally well on our benchmark, while ESM-Design and AF-Design fall short on the refoldability metric. We also show that while some methods exhibit high sequence recovery, they do not perform as well on our new benchmark. Our proposed benchmark paves the way for a fair and comprehensive evaluation of protein design methods in the future. Code is available at https://github.com/WANG-CR/PDB-Struct.

摘要
Structure-based protein设计已经吸引了越来越多的关注，而最近几年内出现了许多新的方法。然而，一个通用的评估方法没有被确立，因为湿化实验室 Validation 可能需要很长时间来开发新的算法，而 $\textit{in silico}$ 验证方法可能不准确地反映真实的折叠性。为了解决这个 gap，我们介绍了两个新的指标：折叠性基于指标，利用高精度蛋白结构预测模型作为湿化实验室实验的代理，以及稳定性基于指标，评估模型是否可以赋予高概率给实验上稳定的蛋白质。我们从高质量的 CATH 蛋白数据、高通量 $\textit{de novo}$ 设计蛋白和巨规实验mutagenesis实验中筛选数据，并将其称为 $\textbf{PDB-Struct}$ 约束，这种约束评估了最近和以前未经评估的蛋白设计方法。实验结果表明，ByProt、ProteinMPNN 和 ESM-IF 在我们的约束中表现出色，而 ESM-Design 和 AF-Design 在折叠性指标中表现不佳。我们还发现，一些方法具有高序列恢复率，但它们不如我们的新指标表现。我们提出的约束将来将为蛋白设计方法的评估带来公平和全面的评估。代码可以在上获取。

Leveraging cache to enable SLU on tiny devices

paper_url: http://arxiv.org/abs/2311.18188
repo_url: None
paper_authors: Afsara Benazir, Zhiming Xu, Felix Xiaozhu Lin
for: 本研究探讨了在微控制器类型的嵌入式设备上实现语音理解（SLU），并将在设备上执行与云端卸载结合在一起。我们利用设备的语音输入的时间地区性，并将新的输入与缓存的结果进行匹配，只有未匹配的输入才被上载到云端进行完整的推理。
methods: 我们提出了一种新的语音缓存技术（XYZ），用于在缓存中匹配语音输入。该技术使用了分割后的字符串序列进行匹配，并且通过在不同级别进行匹配来提供不同的代价和准确率之间的负责任。
results: 我们的实现在一个常见的STM32微控制器上实现了一个小于2MB的内存占用。在考验了一些语音 benchmark 上，我们的系统可以在45%-90%的输入上进行设备上解决，同时降低了平均响应时间，相比于将输入上loads到流行的云端语音服务。我们的优势在室内噪音环境、冷缓存和一个设备被多个用户共享的情况下也表现出来。

Abstract
This paper addresses spoken language understanding (SLU) on microcontroller-like embedded devices, integrating on-device execution with cloud offloading in a novel fashion. We exploit temporal locality in a device's speech inputs and accordingly reuse recent SLU inferences. Our idea is simple: let the device match new inputs against cached results, and only offload unmatched inputs to the cloud for full inference. Realization of this idea, however, is non-trivial: the device needs to compare acoustic features in a robust, low-cost way. To this end, we present XYZ, a speech cache for tiny devices. It matches speech inputs at two levels of representations: first by clustered sequences of raw sound units, then as sequences of phonemes. Working in tandem, the two representations offer complementary cost/accuracy tradeoffs. To further boost accuracy, our cache is learning: with the mismatched and then offloaded inputs, it continuously finetunes the device's feature extractors (with the assistance of the cloud). We implement XYZ on an off-the-shelf STM32 microcontroller. The resultant implementation has a small memory footprint of 2MB. Evaluated on challenging speech benchmarks, our system resolves 45%--90% of inputs on device, reducing the average latency by up to 80% compared to offloading to popular cloud speech services. Our benefit is pronounced even in adversarial settings -- noisy environments, cold cache, or one device shared by a number of users.

摘要

An Effective Universal Polynomial Basis for Spectral Graph Neural Networks

paper_url: http://arxiv.org/abs/2311.18177
repo_url: None
paper_authors: Keke Huang, Pietro Liò
for: 这篇论文主要针对于处理异质图的问题，即使用spectral graph neural networks (GNNs)和 Laplacian eigendecomposition来解决异质图中的问题。
methods: 本论文提出了一种基于异质度的 adaptive heterophily basis，并将其与同质度基准集成，创造了一个通用的多项式基准UniBasis。
results: 经过实验表明，UniFilter可以在真实世界的和synthetic dataset上达到显著的性能提升，这demonstrates the effectiveness and generality of UniBasis，以及其在图分析中的潜在应用前景。

Abstract
Spectral Graph Neural Networks (GNNs), also referred to as graph filters have gained increasing prevalence for heterophily graphs. Optimal graph filters rely on Laplacian eigendecomposition for Fourier transform. In an attempt to avert the prohibitive computations, numerous polynomial filters by leveraging distinct polynomials have been proposed to approximate the desired graph filters. However, polynomials in the majority of polynomial filters are predefined and remain fixed across all graphs, failing to accommodate the diverse heterophily degrees across different graphs. To tackle this issue, we first investigate the correlation between polynomial bases of desired graph filters and the degrees of graph heterophily via a thorough theoretical analysis. Afterward, we develop an adaptive heterophily basis by incorporating graph heterophily degrees. Subsequently, we integrate this heterophily basis with the homophily basis, creating a universal polynomial basis UniBasis. In consequence, we devise a general polynomial filter UniFilter. Comprehensive experiments on both real-world and synthetic datasets with varying heterophily degrees significantly support the superiority of UniFilter, demonstrating the effectiveness and generality of UniBasis, as well as its promising capability as a new method for graph analysis.

摘要
spectral graph neural networks (GNNs)，也称为图 filters，在不同的图structure上获得了越来越多的应用。但是，有效的图 filters 的计算却需要 Laplacian eigendecomposition，却是一项繁琐的计算。为了缓解这个问题，许多 polynomial filters 被提出，这些 filters 使用不同的多项式来 aproximate 想要的图 filters。然而，在大多数 polynomial filters 中，多项式是预先定义的，并在所有图上保持不变，这会导致不能满足不同图的多样性。为了解决这个问题，我们首先 investigate 了愿望的图 filters 的多项式基和图的多样性度之间的关系，通过了一个系统的理论分析。然后，我们开发了一种可适应多样性度的基，通过将图的多样性度 incorporated 到多项式基中。接着，我们将这种多样性基与 homophily 基 Composite 在一起，得到了一个通用的多项式基 UniBasis。最后，我们提出了一种通用的多项式 filter UniFilter。对于实际和 sintetic 数据集，我们进行了广泛的实验，发现 UniFilter 在不同的多样性度下表现出了明显的优势，证明了 UniBasis 的有效性和通用性，以及其在图分析中的扎实可靠性。

Packrat: Automatic Reconfiguration for Latency Minimization in CPU-based DNN Serving

paper_url: http://arxiv.org/abs/2311.18174
repo_url: None
paper_authors: Ankit Bhardwaj, Amar Phanishayee, Deepak Narayanan, Mihail Tarta, Ryan Stutsman
for: 提高 CPU 服务器上 Deep Neural Network (DNN) 模型的性能表现。
methods: 使用多线程并行计算，但是发现这会导致效果下降。Packrat 服务器系统提供一个算法来自动选择最佳的实例数、线程数和批处理大小，以优化推理延迟。
results: Packrat 在各种批处理大小下，对常用 DNN 模型进行了优化，并且可以提高推理延迟的平均值。 Specifically, Packrat 可以提高推理延迟的值为 1.43 倍至 1.83 倍。

Abstract
In this paper, we investigate how to push the performance limits of serving Deep Neural Network (DNN) models on CPU-based servers. Specifically, we observe that while intra-operator parallelism across multiple threads is an effective way to reduce inference latency, it provides diminishing returns. Our primary insight is that instead of running a single instance of a model with all available threads on a server, running multiple instances each with smaller batch sizes and fewer threads for intra-op parallelism can provide lower inference latency. However, the right configuration is hard to determine manually since it is workload- (DNN model and batch size used by the serving system) and deployment-dependent (number of CPU cores on server). We present Packrat, a new serving system for online inference that given a model and batch size ($B$) algorithmically picks the optimal number of instances ($i$), the number of threads each should be allocated ($t$), and the batch sizes each should operate on ($b$) that minimizes latency. Packrat is built as an extension to TorchServe and supports online reconfigurations to avoid serving downtime. Averaged across a range of batch sizes, Packrat improves inference latency by 1.43$\times$ to 1.83$\times$ on a range of commonly used DNNs.

摘要
To address this challenge, we present Packrat, a new serving system for online inference that automatically selects the optimal number of instances, the number of threads each should be allocated, and the batch sizes each should operate on to minimize latency. Packrat is built as an extension to TorchServe and supports online reconfigurations to avoid serving downtime.We evaluate Packrat on a range of commonly used DNNs and find that it improves inference latency by 1.43 to 1.83 times, on average, across a range of batch sizes.

Towards A Foundation Model For Trajectory Intelligence

paper_url: http://arxiv.org/abs/2312.00076
repo_url: None
paper_authors: Alameen Najjar
for: 这个论文是为了训练一个大规模的旅程模型，使用真实世界用户检查到数据。
methods: 这种方法采用预训练和精度调整的方法，首先预训练基本模型通过遮盲旅程模型，然后通过精度调整进行多种下游任务。为了解决噪音数据和大 spatial vocabularies 的挑战，我们提出了一个新的空间分词方法。
results: 我们的实验使用了超过20亿次检查到数据，并经过3个下游任务的精度调整，显示了我们的基本模型已经有效地学习了原始数据中的有价值下层模式，使其能够应用于有意义的旅程智能任务。

Abstract
We present the results of training a large trajectory model using real-world user check-in data. Our approach follows a pre-train and fine-tune paradigm, where a base model is pre-trained via masked trajectory modeling and then adapted through fine-tuning for various downstream tasks. To address challenges posed by noisy data and large spatial vocabularies, we propose a novel spatial tokenization block. Our empirical analysis utilizes a comprehensive dataset of over 2 billion check-ins generated by more than 6 million users. Through fine-tuning on 3 downstream tasks we demonstrate that our base model has effectively learned valuable underlying patterns in raw data, enabling its application in meaningful trajectory intelligence tasks. Despite some limitations, we believe this work represents an important step forward in the realization of a foundation model for trajectory intelligence.

摘要
我们现在公布了一个大规模的轨迹模型训练结果，使用真实世界用户检查入数据。我们的方法采用预训练和精度调整的方法，其中基本模型通过遮盲轨迹模型进行预训练，然后通过精度调整进行多种下游任务的适应。为了解决噪音数据和大型空间词汇所带来的挑战，我们提出了一个新的空间分词块。我们的实验使用了超过20亿次检查入数据，并进行了3种下游任务的精度调整。我们的结果表明，我们的基本模型已经有效地学习了原始数据中的有价值的下层模式，因此可以应用于有意义的轨迹智能任务。虽然有一些限制，但我们认为这项工作代表了轨迹智能领域的一个重要一步前进。

2023-11-30

eess.IV

eess.IV - 2023-11-30

A Novel Variational Approach for Multiphoton Microscopy Image Restoration: from PSF Estimation to 3D Deconvolution

paper_url: http://arxiv.org/abs/2311.18386
repo_url: None
paper_authors: Julien Ajdenbaum, Emilie Chouzenoux, Claire Lefort, Ségolène Martin, Jean-Christophe Pesquet
for: 提高多光子镜 microscope 图像质量
methods: 使用非几何最小化方法进行 PSF 估计和图像恢复
results: 提出了一种基于非几何最小化的图像恢复算法，可以有效地提高多光子镜 microscope 图像质量，并且可以适应不同的背景噪声。

Abstract
In multi-photon microscopy (MPM), a recent in-vivo fluorescence microscopy system, the task of image restoration can be decomposed into two interlinked inverse problems: firstly, the characterization of the Point Spread Function (PSF) and subsequently, the deconvolution (i.e., deblurring) to remove the PSF effect, and reduce noise. The acquired MPM image quality is critically affected by PSF blurring and intense noise. The PSF in MPM is highly spread in 3D and is not well characterized, presenting high variability with respect to the observed objects. This makes the restoration of MPM images challenging. Common PSF estimation methods in fluorescence microscopy, including MPM, involve capturing images of sub-resolution beads, followed by quantifying the resulting ellipsoidal 3D spot. In this work, we revisit this approach, coping with its inherent limitations in terms of accuracy and practicality. We estimate the PSF from the observation of relatively large beads (approximately 1$\mu$m in diameter). This goes through the formulation and resolution of an original non-convex minimization problem, for which we propose a proximal alternating method along with convergence guarantees. Following the PSF estimation step, we then introduce an innovative strategy to deal with the high level multiplicative noise degrading the acquisitions. We rely on a heteroscedastic noise model for which we estimate the parameters. We then solve a constrained optimization problem to restore the image, accounting for the estimated PSF and noise, while allowing a minimal hyper-parameter tuning. Theoretical guarantees are given for the restoration algorithm. These algorithmic contributions lead to an end-to-end pipeline for 3D image restoration in MPM, that we share as a publicly available Python software. We demonstrate its effectiveness through several experiments on both simulated and real data.

摘要
multi-photon microscopy (MPM) 是一种最近的生物学内穿镜系统，其图像修复问题可以分解为两个相互关联的反问题：首先， Characterization of the Point Spread Function (PSF) ，然后，使用这个 PSF 来消除锐化的效果和噪声，提高图像质量。 MPM 图像质量受到 PSF 的抖振和强噪声的影响，PSF 在 MPM 中是非常扩散的，对观察对象的变化具有高度不确定性，这使得 MPM 图像修复变得困难。通常在激发谱镜中，PSF 估计方法包括通过捕捉小于分辨率的凝胶小球的图像，然后根据得到的三维椭球来估计 PSF。在这种方法中，我们改进了这种方法，解决其内在的精度和实用性问题。我们通过观察大约1微米的凝胶小球来估计 PSF。这个过程通过非几何最小化问题的形式ulation和解决，我们提议了一种距离逼近法，并提供了收敛保证。在 PSF 估计步骤后，我们引入了一种创新的噪声处理策略，利用一种不同噪声模型，其中我们估计噪声参数。然后，我们解决了一个受限制的优化问题，以修复图像，考虑到估计的 PSF 和噪声，同时允许最小化hyperparameter tuning。我们提供了对修复算法的理论保证。这些算法贡献导致了一个端到端的图像修复管道，我们将其作为一个公共可用的 Python 软件分享。我们通过一些实验，证明了它的有效性，包括both simulated 和实际数据。

Material decomposition for dual-energy propagation-based phase-contrast CT

paper_url: http://arxiv.org/abs/2311.18186
repo_url: None
paper_authors: Suyu Liao, Huitao Zhang, Peng Zhang, Yining Zhu
for: 这篇论文主要用于研究计算tomography（CT）中物质分解的问题，具体来说是使用能量依赖物理性能来区分样品中的物质。
methods: 这篇论文提出了一种新的迭代方法，通过将phaserecovery、重建和物质分解合并到一起进行，从intensity数据中直接获取物质分解结果，而不需要先进行重建和 then material decomposition。这种方法使用干涉减噪来实现高精度的物质分解和噪声减少。
results: 实验结果表明，相比于传统的两步方法，提出的方法在物质分解和噪声减少方面具有明显的优势。

Abstract
Material decomposition refers to using the energy dependence of material physical properties to differentiate materials in a sample, which is a very important application in computed tomography(CT). In propagation-based X-ray phase-contrast CT, the phase retrieval and Reconstruction are always independent. Moreover, like in conventional CT, the material decomposition methods in this technique can be classified into two types based on pre-reconstruction and post-reconstruction (two-step). The CT images often suffer from noise and artifacts in those methods because of no feedback and correction from the intensity data. This work investigates an iterative method to obtain material decomposition directly from the intensity data in different energies, which means that we perform phase retrieval, reconstruction and material decomposition in a one step. Fresnel diffraction is applied to forward propagation and CT images interact with this intensity data throughout the iterative process. Experiments results demonstrate that compared with two-step methods, the proposed method is superior in accurate material decomposition and noise reduction.

摘要
材料分解指的是使用材料物理性能的能量依赖性来 отлича物料在样本中，这是计算机 Tomatoes（CT）中非常重要的应用。在传播基于X射线相对contrast CT中，phaserecovery和重建总是独立的。此外，与传统CT方法一样，这些方法可以根据预重建和后重建（两步）进行分类。CT图像经常受到这些方法中的噪声和artefacts的影响，因为没有反馈和修正来自Intensity数据。这种工作investigates一种迭代法，以直接从Intensity数据中获取材料分解，这意味着我们在一步中完成phaserecovery、重建和材料分解。在迭代过程中，Fresnel diffraction被应用于前向传播，CT图像与这些Intensity数据相互作用。实验结果表明，相比两步方法，我们的方法在精度的材料分解和噪声减少方面表现更优。

2023-11-30

eess.SP

eess.SP - 2023-11-30

Over-the-Air Emulation of Electronically Adjustable Rician MIMO Channels in a Programmable-Metasurface-Stirred Reverberation Chamber

paper_url: http://arxiv.org/abs/2312.00199
repo_url: None
paper_authors: Ismail Ahmed, Matthieu Davy, Hugo Prod’homme, Philippe Besnier, Philipp del Hougne
for: 这个论文目的是研究可变 Rico 折射通道下多输入多出口无线设备的可行性。
methods: 该论文使用了可编程扩展体磁阻射室（PM-stirred） reverberation chamber（RC）和 meta-surface 技术来模拟 Rico 折射通道。
results: 研究发现在 PM-stirred RC 中可以实现多种 Rico 折射通道的模拟，但是存在一定的限制，包括可控 K-factor 的Upper bound和 Lower bound。

Abstract
We experimentally investigate the feasibility of evaluating multiple-input multiple-output (MIMO) radio equipment under adjustable Rician fading channel conditions in a programmable-metasurface-stirred (PM-stirred) reverberation chamber (RC). Whereas within the "smart radio environment" paradigm PMs offer partial control over the channels to the wireless system, in our use case the PM emulates the uncontrollable fading. We implement a desired Rician K-factor by sweeping a suitably sized subset of all meta-atoms through random configurations. We discover in our setup an upper bound on the accessible K-factors for which the statistics of the channel coefficient distributions closely follow the sought-after Rician distribution. We also discover a lower bound on the accessible K-factors in our setup: there are unstirred paths that never encounter the PM, and paths that encounter the PM are not fully stirred because the average of the meta-atoms' accessible polarizability values is not zero (i.e., the meta-atoms have a non-zero "structural" cross-section). We corroborate these findings with experiments in an anechoic chamber, physics-compliant PhysFad simulations with Lorentzian vs "ideal" meta-atoms, and theoretical analysis. Our work clarifies the scope of applicability of PM-stirred RCs for MIMO Rician channel emulation, as well as electromagnetic compatibility test.

摘要
我们实验性地探索了多输入多输出（MIMO）无线设备在可调 rician 折射通道条件下的可能性。在“智能无线环境”概念下，PM 提供了对无线系统的部分控制权，但在我们的使用情况下，PM 模拟了不可控的折射。我们通过逐渐变换一个合适的子集的所有元件来实现所需的 rician K 因子。我们发现在我们的设置中，可达的 K 因子的上限，channel coefficient 分布 Statistics 几乎与目标 rician 分布一致。我们还发现了可达的 K 因子下界：有些不扰urbed 路径从来不会遇到 PM，而其他路径只是部分扰urbed，因为元件的可访问 polarizability 值的平均值不为零（即元件有非零“结构”交叉section）。我们通过实验室、PhysFad simulations 和理论分析证明了这些发现。我们的工作解释了 PM-stirred RC 在 MIMO rician 通道模拟和电磁兼容测试中的可行范围。

Adversarial Attacks and Defenses for Wireless Signal Classifiers using CDI-aware GANs

paper_url: http://arxiv.org/abs/2311.18820
repo_url: None
paper_authors: Sujata Sinha, Alkan Soysal
for: 这个研究旨在开发一个具有通道分布信息（CDI）敏感的生成对抗网络（GAN），以应对无线通信系统中的特有攻击挑战。
methods: 这个CDI-aware GAN的生成器将随机输入噪声映射到特征空间中，生成干扰，以欺骗目标模拟分类器。它的检测器则扮演双重角色：一方是保持干扰按照 Gaussian 分布，使其类似于 Gaussian 噪声；另一方是确保这些干扰具有实际通道效应，并与无通道干扰相似。
results: 在攻击场景下，CDI-aware GAN 展现了其强大的攻击能力，将目标分类器欺骗成功，较进一步的方法超越。此外，CDI-aware GAN 作为一个守备者，可以对目标分类器提供额外的保护，很大提高了这个分类器对攻击的抵抗力。

Abstract
We introduce a Channel Distribution Information (CDI)-aware Generative Adversarial Network (GAN), designed to address the unique challenges of adversarial attacks in wireless communication systems. The generator in this CDI-aware GAN maps random input noise to the feature space, generating perturbations intended to deceive a target modulation classifier. Its discriminators play a dual role: one enforces that the perturbations follow a Gaussian distribution, making them indistinguishable from Gaussian noise, while the other ensures these perturbations account for realistic channel effects and resemble no-channel perturbations. Our proposed CDI-aware GAN can be used as an attacker and a defender. In attack scenarios, the CDI-aware GAN demonstrates its prowess by generating robust adversarial perturbations that effectively deceive the target classifier, outperforming known methods. Furthermore, CDI-aware GAN as a defender significantly improves the target classifier's resilience against adversarial attacks.

摘要
我们介绍一个频道分布信息（CDI）意识的生成对抗网络（GAN），用于解决无线通信系统中的特有攻击挑战。 generator这个CDI-aware GAN将随机输入噪声映射到特征空间，生成欺骗目标模式分类器的干扰。它的检测器扮演双重角色：一方是保持干扰遵循 Gaussian 分布，使其与随机噪声无法区别，另一方是确保这些干扰具有实际通道效应，与无通道干扰无法区别。我们的提议的CDI-aware GAN可以作为攻击者和防御者使用。在攻击场景中，CDI-aware GAN demonstrates its prowess by generating robust adversarial perturbations that effectively deceive the target classifier, outperforming known methods. 在防御场景中，CDI-aware GAN 作为防御者，对目标分类器的抗攻击能力提供了重大改善。

Performance Analysis of Integrated Sensing and Communications Under Gain-Phase Imperfections

paper_url: http://arxiv.org/abs/2311.18762
repo_url: None
paper_authors: Shuaishuai Han, Mohammad Ahmad Al-Jarrah, Emad Alsusa
for: 本研究评估了无人飞机传输数据的紧 Integrated Sensing and Communication 系统在 gain 和 phase 偏差的情况下的性能。
methods: 本研究使用了 maximum likelihood 算法 для地理位置 estimation，并对 signal parameters 的估计进行了比较。
results: 研究发现， gain 和 phase 偏差对 both localization 和 communication 有显著的影响，但提出的 maximum likelihood 算法比其他算法更为敏感。 I hope that helps! Let me know if you have any other questions.

Abstract
This paper evaluates the performance of uplink integrated sensing and communication systems in the presence of gain and phase imperfections. Specifically, we consider multiple unmanned aerial vehicles (UAVs) transmitting data to a multiple-input-multiple-output base-station (BS) that is responsible for estimating the transmitted information in addition to localising the transmitting UAVs. The signal processing at the BS is divided into two consecutive stages: localisation and communication. A maximum likelihood (ML) algorithm is introduced for the localisation stage to jointly estimate the azimuth-elevation angles and Doppler frequency of the UAVs under gain-phase defects, which are then compared to the estimation of signal parameters via rotational invariance techniques (ESPRIT) and multiple signal classification (MUSIC). Furthermore, the Cramer-Rao lower bound (CRLB) is derived to evaluate the asymptotic performance and quantify the influence of the gain-phase imperfections which are modelled using Rician and von Mises distributions, respectively. Thereafter, in the communication stage, the location parameters estimated in the first stage are employed to estimate the communication channels which are fed into a maximum ratio combiner to preprocess the received communication signal. An accurate closed-form approximation of the achievable average sum data rate (SDR) for all UAVs is derived. The obtained results show that gain-phase imperfections have a significant influence on both localisation and communication, however, the proposed ML is less sensitive when compared to other algorithms. The derived analysis is concurred with simulations.

摘要
Translation:这篇论文评估了无人飞行器（UAV）在干扰环境下的下降整合感知通信系统的性能。特别是，我们考虑多个UAV发送数据到一个多输入多输出基站（BS），BS负责估计传输信息以及地理位置的UAV。信号处理在BS中分为两个阶段：地理位置和通信。我们引入了最大可能性（ML）算法来实现地理位置阶段，并同时估计UAV的方位角和Doppler频率。然后，我们对ESPRIT和MUSIC等技术进行比较。此外，我们还 deriv了Cramer-Rao下界（CRLB）来评估干扰环境的极限性能，并量化干扰环境中 gain-phase 的影响。接着，在通信阶段，地理位置参数估计在第一个阶段中被用来估计通信 кана径，然后通过最大比例组合器进行前处理。我们 deriv了所有UAV的可行的均值总数据率（SDR）的准确闭合式表达。结果显示，干扰环境中 gain-phase 的影响是非常大，但是我们提出的ML算法对其他算法比较敏感。我们的分析与实验结果相符。

OISA: Architecting an Optical In-Sensor Accelerator for Efficient Visual Computing

paper_url: http://arxiv.org/abs/2311.18655
repo_url: None
paper_authors: Mehrdad Morsali, Sepehr Tabrizchi, Deniz Najafi, Mohsen Imani, Mahdi Nikdast, Arman Roohi, Shaahin Angizi
for: 这个研究是为了提高边缘视觉应用的高性能和能效率光子内部加速架构（OISA）。
methods: 这个架构使用低比特几何网络中的粗糙束 convolution 操作，通过创新的最小转换方式来实现。
results: OISA可以在不同的图像数据集上达到可接受的准确性，并在电子内部/近端和ASIC加速器中降低了平均电力消耗量 by 7.9和18.4倍。

Abstract
Targeting vision applications at the edge, in this work, we systematically explore and propose a high-performance and energy-efficient Optical In-Sensor Accelerator architecture called OISA for the first time. Taking advantage of the promising efficiency of photonic devices, the OISA intrinsically implements a coarse-grained convolution operation on the input frames in an innovative minimum-conversion fashion in low-bit-width neural networks. Such a design remarkably reduces the power consumption of data conversion, transmission, and processing in the conventional cloud-centric architecture as well as recently-presented edge accelerators. Our device-to-architecture simulation results on various image data-sets demonstrate acceptable accuracy while OISA achieves 6.68 TOp/s/W efficiency. OISA reduces power consumption by a factor of 7.9 and 18.4 on average compared with existing electronic in-/near-sensor and ASIC accelerators.

摘要
targeting 视频应用程序的边缘部署，在这项工作中，我们系统地探索并提出了一种高性能且能效的光学内部加速器架构，称为OISA。通过利用光电设备的潜在高效性，OISA内置了一种粗粒度卷积操作，对输入帧进行 minimum-conversion 的实现。这种设计突出 reductions 数据转换、传输和处理的能量消耗，在云端-центри的架构以及最近提出的边缘加速器中。我们的设备-到-架构仿真结果表明，OISA可以实现 Acceptable 准确性，同时达到 6.68 TOp/s/W 的效率。相比之下，OISA 在电子内部/近端和ASIC加速器方面减少了平均的能量消耗量为 7.9 和 18.4。

Robust-to-Noise Algorithms for Distributed Resource Allocation and Scheduling

paper_url: http://arxiv.org/abs/2311.18646
repo_url: None
paper_authors: Mohammadreza Doostmohammadian, Alireza Aghasi
for: 本文针对散设在不同环境中的分布式应用程序，例如无线网络、云计算平台和自主多智能体系统，提出了一种robust资源分配和计划算法，以适应噪音和干扰的影响。
methods: 本文使用了一种新的注标-基础动态，允许分布式网络中的多个智能体从事适应性噪音处理。另外，本文还借鉴了凸Optimization理论、控制理论和网络科学，开发了一个原理性的方法来设计适应性噪音处理的算法。
results: 本文的研究结果显示，这种新的robust资源分配和计划算法可以在不同的网络环境下适应噪音和干扰，并且可以维持资源需求的均衡和网络组态的可行性。同时，本文还考虑了不同网络环境下的均衡网络和多样网络条件。

Abstract
Efficient resource allocation and scheduling algorithms are essential for various distributed applications, ranging from wireless networks and cloud computing platforms to autonomous multi-agent systems and swarm robotic networks. However, real-world environments are often plagued by uncertainties and noise, leading to sub-optimal performance and increased vulnerability of traditional algorithms. This paper addresses the challenge of robust resource allocation and scheduling in the presence of noise and disturbances. The proposed study introduces a novel sign-based dynamics for developing robust-to-noise algorithms distributed over a multi-agent network that can adaptively handle external disturbances. Leveraging concepts from convex optimization theory, control theory, and network science the framework establishes a principled approach to design algorithms that can maintain key properties such as resource-demand balance and constraint feasibility. Meanwhile, notions of uniform-connectivity and versatile networking conditions are also addressed.

摘要
efficient resource allocation and scheduling algorithms are essential for various distributed applications, ranging from wireless networks and cloud computing platforms to autonomous multi-agent systems and swarm robotic networks. However, real-world environments are often plagued by uncertainties and noise, leading to sub-optimal performance and increased vulnerability of traditional algorithms. This paper addresses the challenge of robust resource allocation and scheduling in the presence of noise and disturbances. The proposed study introduces a novel sign-based dynamics for developing robust-to-noise algorithms distributed over a multi-agent network that can adaptively handle external disturbances. 抽象：* 多 Agent 系统和分布式应用需要高效的资源分配和调度算法，从无线网络和云计算平台到自主多 агент系统和群体机器人网络。* 实际环境受到不确定性和噪声的影响，导致传统算法的低效性和投降。* 本文挑战了在噪声和干扰下的资源分配和调度问题。* 提出了一种基于签名动力学的多 Agent 网络中的强健性调度算法，可以适应外部干扰。* 利用了 convex 优化理论、控制理论和网络科学，提出了一种原则性的方法，可以保持资源需求均衡和约束可行性。* 同时，也考虑了一致连接和多样网络条件。

A New Old Idea: Beam-Steering Reflectarrays for Efficient Sub-THz Multiuser MIMO

paper_url: http://arxiv.org/abs/2311.18593
repo_url: None
paper_authors: Krishan Kumar Tiwari, Giuseppe Caire
for: 这个论文是为了解决多用户多束 Reflective Intelligent Surface（RIS）架构在多用户MIMO中的问题，尤其是在高频率band（例如高mmWave和sub-THz）中， channels Typically sparse in the beamspace and LOS is the dominant component.
methods: 该论文使用了一种新的、功率和硬件efficient的多用户多束RIS架构，包括一个活动多天线 feeders（AMAF）和一个较大的Passive Controllable Reflecting Elements（PCRE）。该架构使用了一种实用的方法来实现高向度的扫描 beam 和很低的侧LOBES。
results: 研究人员通过分析了多用户之间的干扰和AMAF-RIS结构中的频率选择性的效果，并证明了“束定 beam”是在远场也可以实现的，并且可以创造有限的角度和距离射频谱的spotbeams。研究结果表明：1）简单的RF扫描（BF）无需计算expensive的基带多用户预测可以实际消除多用户干扰；2）RIS量化相位调制器的影响是可以忽略不计；3）提posed architecture是传统活动数组的硬件实现和功率管理方面更加高效和简单。

Abstract
We present a novel, power- & hardware-efficient, multiuser, multibeam RIS (Reflective Intelligent Surface) architecture for multiuser MIMO, especially for very high frequency bands (e.g., high mmWave and sub-THz), where channels are typically sparse in the beamspace and LOS is the dominant component. The key module is formed by an active multiantenna feeder (AMAF) with a small number of active antennas, placed in the near field of a RIS with a much larger number of passive controllable reflecting elements. We propose a pragmatic approach to obtain a steerable beam with high gain and very low sidelobes. Then K independently controlled beams can be achieved by closely stacking K such AMAF-RIS modules. Our analysis includes the mutual interference between the modules and the fact that, due to the delay difference of propagation through the AMAF-RIS structure, the resulting channel matrix is frequency selective even for pure LOS propagation. We consider a 3D geometry and show that "beam focusing" is in fact possible (and much more effective in terms of coverage) also in the far-field, by creating spotbeams with limited footprint both in angle and in range. Our results show that: 1) simple RF beamforming (BF) without computationally expensive baseband multiuser precoding is sufficient to practically eliminate multiuser interference when the users are chosen with sufficient angular/range separation, thanks to the extremely low sidelobe beams; 2) the impact of beam pointing errors with standard deviation as large as 2.5 deg and RIS quantized phase-shifters with quantization bits > 2 is essentially negligible; 3) The proposed architecture is more power efficient & much simpler from a hardware implementation viewpoint than standard active arrays with the same BF performance. We show also that the array gain of the proposed AMAF-RIS structure is linear with the RIS aperture.

摘要
我们提出了一种新的、功率和硬件效率高的多用户多束反射智能表面（RIS）架构，特别适用于高频范围内（例如高 millimeter 波和子teraHz）， где каналы通常是在束空间 sparse 的和直接射线（LOS）是主要组成部分。关键模块由一个活动多天线发食（AMAF）组成，该发食具有小数量的活动天线，位于RIS附近的多个可控制的反射元件中。我们提议一种实用的方法，以实现高得分和很低的侧obeams。然后，可以通过密集排列K个AMAF-RIS模块来实现K独立控制的束。我们的分析包括模块之间的干扰和因为AMAF-RIS结构的延迟差而导致的频率选择性的通道矩阵。我们考虑3D几何结构，并证明了"束注入"实际上也是可行的（并且在远场更有效），通过创建有限footprint的点束，以限制角度和距离方向的覆盖。我们的结果表明：1）使用简单的RF扩散（BF）无需 computationally expensive的基带多用户预测，可以实际消除多用户干扰，因为束空间的低侧obeams; 2）RIS量化相位调制器的标准偏移量为2.5度以上，对束注入的影响是 praktisch 可以忽略的; 3）提议的架构比标准活动数组更有效率和更简单的硬件实现。我们还证明了AMAF-RIS结构的面积 linear 相关的数组增益。

RIS-Assisted Generalized Receive Quadrature Spatial Modulation

paper_url: http://arxiv.org/abs/2311.18542
repo_url: None
paper_authors: Mohamad H. Dinan, Mark F. Flanagan
for: 提高干扰率的RIS帮助Quadrature Spatial Modulation（QSM）系统的 spectral efficiency，通过利用通用频率Modulation（GSM）的概念，使多个接收天线独立地活动。
methods: 提出了一个带有最大化相位偏移的RIS元素的最小化问题，通过拉格朗日积分法将非 convex 优化问题转化为一个具有相同变量数的凸优化问题。
results: 比较Result show that the proposed scheme outperforms benchmark schemes in terms of error rate performance, especially in systems with a larger number of receive antennas. In the special case where each receive antenna corresponds to a user and is activated, the RIS-GRQSM system becomes a multicast communication system, and the proposed solution offers low complexity and practical feasibility of implementation.

Abstract
In this paper, reconfigurable intelligent surface (RIS)-assisted generalized receive quadrature spatial modulation (RIS-GRQSM) is proposed to improve the spectral efficiency of RIS-aided quadrature spatial modulation (QSM) systems by utilizing the concept of generalized spatial modulation (GSM). That is, multiple antennas are activated at the receiver independently for both the real and imaginary parts. We propose a max-min optimization problem to adjust the phase shifts of all RIS elements to maximize the relevant signal-to-noise ratios (SNRs) at all activated receive antennas. Using Lagrange duality, the non-convex optimization problem involving the phase shifts of all RIS elements reduces to a convex optimization involving a number of variables equal to the number of activated receive antennas. A successive greedy detector (GD) can be used at the receiver to detect the active antennas, which simplifies the detection process. The numerical results show that the proposed scheme outperforms the benchmark schemes in terms of error rate performance, especially in systems with a larger number of receive antennas. In the special case where each receive antenna corresponds to a user and is activated, the RIS-GRQSM system becomes a multicast communication system. In this context, in contrast to existing phase shift optimization algorithms which exhibit an impractical level of complexity, our proposed solution offers the advantage of low complexity and practical feasibility of implementation.

摘要
在这篇论文中，我们提出了基于扩展受信道模式（GSM）的 receive quadrature spatial modulation（RIS-GRQSM），用于提高RIS受器受控 quadrature spatial modulation（QSM）系统的spectral efficiency。具体来说，在接收器端，多个天线被独立地活动，以处理实部和虚部的两部分。我们提出了一个最大化相关信号听号比（SNR）的最小化问题，以调整所有RIS元素的阶段偏移。使用拉格朗日熊束，非 convex 优化问题转化为一个 convex 优化问题，其中变量数等于活动接收天线的数量。接收器端可以使用成功的恰遇探测器（GD）检测活动天线，从而简化检测过程。实验结果表明，提议的方案在错误率性能方面比参照方案更高，特别是在系统中有更多的接收天线时。在特殊情况下，每个接收天线对应一个用户，并且被活动，则RIS-GRQSM系统变为一个多播通信系统。在这种情况下，与现有阶段偏移优化算法不同，我们的提议的解决方案具有低复杂度和实际可行的实施问题。

Enhancing EEG Dataset Resources for Schizophrenia Diagnosis: Inaugural West-African (Nigerian) Endeavor

paper_url: http://arxiv.org/abs/2311.18484
repo_url: None
paper_authors: E. O. Olateju, K. P. Ayodele, S. K. Mosaku
For: The paper aims to improve the scarcity of high-quality EEG datasets for schizophrenia diagnostic tools development and studies, specifically from populations of developing and underdeveloped regions of the world.* Methods: The paper presents an EEG dataset of international 10/20 system recordings from West African subjects of Nigerian origin, including resting conditions, mental arithmetic task execution, and passive auditory stimuli reactivity. The dataset includes 36 cases and 21 healthy controls, identified using the Mini International Schizophrenia Interview (MINI) and assessed by the Positive and Negative Symptoms Scale (PANSS) and the World Health Organization Disability Assessment Schedule (WHODAS).* Results: The dataset can be used by the neuroscience and computational psychiatry research community studying the diagnosis and prognosis of schizophrenia using the electroencephalogram signal modality.Here is the information in Simplified Chinese text:* For: 这个论文是为了解决心理疾病诊断工具开发和研究中的数据缺乏问题，特别是来自发展中和发展国家的人口。* Methods: 论文提供了一个来自西非奈比亚裔人群的国际10/20系EEG记录集，包括休息状态、心理算数任务执行和静音刺激反应的记录。该集包括36名病例和21名健康控制者，通过美国国际精神疾病评估工具（MINI）和精神病Symptoms Scale（PANSS）和世界卫生组织残疾评估表（WHODAS）进行诊断和评估。所有病例都是医学心理病医院的门诊病人，而控制者则是从学生中报名参与研究。* Results: 这个数据集可以用于心理和计算神经科学研究人员对假设和诊断的研究。

Abstract
This work has been carried out to improve the dearth of high-quality EEG datasets used for schizophrenia diagnostic tools development and studies from populations of developing and underdeveloped regions of the world. To this aim, the presented dataset contains international 10/20 system EEG recordings from West African subjects of Nigerian origin under rest conditions, in restful states, mental arithmetic task execution states and while passively reacting to auditory stimuli. The subjects are divided into cases and healthy controls and recorded from 36 cases and 21 healthy conTrol subjects identified by the Mini International Schizophrenia Interview (MINI) and also assessed by the Positive and Negative Symptoms Scale (PANSS) and the World Health Organization Disability Assessment Schedule (WHODAS). All cases are admitted schizophrenia patients of the Mental Health Ward, Medical Outpatient Department of the Obafemi Awolowo University Teaching Hospital Complex (OAUTHC, Ile-Ife) and its subsidiary Wesley Guild Hospital Unit (OAUTHC, Ilesa). Controls are drawn from students who volunteered to participate in the study at the Mental Health Ward of OAUTHC and the Wesley Guild Hospital Unit. The recordings are available at Datasets. This dataset can be used by the neuroscience and computational psychiatry research community studying the diagnosis and prognosis of schizophrenia using the electroencephalogram signal modality.

摘要
这些工作是为了改善听力和发展地区的学者和国际10/20系统EEG记录用于诊断和研究听力和发展地区的患有精神病的工具和研究。为此，所提供的数据集包括西非nigérian起源的 sujet的国际10/20系统EEG记录，包括休息状态下、数学计算任务执行状态和听力刺激反应状态。参与者分为患者和健康控制群，共有36名患者和21名健康控制者，经过MINI对话和PANSS和WHODAS评估。所有患者都是听力和发展地区医院诊断的精神病患者， controls来自医院志愿者和学生。这些记录可供 neuroscience和计算心理学研究人员使用，以研究诊断和评估听力和发展地区患有精神病的电энцефалограм信号模式。

Computing an Entire Solution Path of a Nonconvexly Regularized Convex Sparse Model

paper_url: http://arxiv.org/abs/2311.18438
repo_url: None
paper_authors: Yi Zhang, Isao Yamada
for: 这篇论文是关于非 convex 稀疏正则化的研究，具体来说是关于广泛最小二乘（GMC）penalty 模型的解决方法。
methods: 本文使用了一种名为 least angle regression（LARS）算法，用于解决 GMC 模型中的非 convex 稀疏正则化问题。
results: 本文提出了一种基于 LARS 算法的解决方法，可以在一定条件下确定 GMC 模型的解决方案，并且证明了该方法的正确性和终止性。

Abstract
The generalized minimax concave (GMC) penalty is a nonconvex sparse regularizer which can preserve the overall-convexity of the sparse least squares problem. In this paper, we study the solution path of a special but important instance of the GMC model termed the scaled GMC (sGMC) model. We show that despite the nonconvexity of the regularizer, there exists a solution path of the sGMC model which is piecewise linear as a function of the tuning parameter, and we propose an efficient algorithm for computing a solution path of this type. Our algorithm is an extension of the well-known least angle regression (LARS) algorithm for LASSO, hence we term the proposed algorithm LARS-sGMC. Under suitable conditions, we provide a proof of the correctness and finite termination of the proposed LARS-sGMC algorithm. This article also serves as an appendix for the short paper titled ``COMPUTING AN ENTIRE SOLUTION PATH OF A NONCONVEXLY REGULARIZED CONVEX SPARSE MODEL", and addresses proofs and technical derivations that were omitted in the original paper due to space limitation.

摘要
“通用最小最大（GMC） penalty 是一种非 convex 精炼的减法，可以保持精炼最小二乘问题的总性 convexity。在这篇论文中，我们研究了一个特殊 yet 重要的 GMC 模型实例，称为缩放 GMC（sGMC）模型。我们证明了，尽管减法非 convex，但是存在一个具有 piecewise 线性函数形式的解路参数，并提出了一种高效的算法来计算这种类型的解路。我们的算法是 LARS 算法的扩展，因此我们称之为 LARS-sGMC。在适用条件下，我们提供了一证明该算法的正确性和终止性。这篇文章同时也是对短篇论文《计算一个非 convex 精炼的减法梯度方程》的补充，并 Addresses 证明和技术 derivations ，因为空间限制而被 omitted 在原始论文中。”Note: Simplified Chinese is a written form of Chinese that uses simpler grammar and vocabulary than Traditional Chinese. It is commonly used in mainland China and Singapore.

Beamforming Design for Active RIS-Aided Over-the-Air Computation

paper_url: http://arxiv.org/abs/2311.18418
repo_url: None
paper_authors: Deyou Zhang, Ming Xiao, Mikael Skoglund, H. Vincent Poor
for: 这篇论文旨在提高无线数据聚合技术上的性能，具体来说是通过在空中计算技术中引入可活动的智能表面（RIS）来 Mitigate 用户的通道条件不佳的性能瓶颈。
methods: 本文提出了一种可活动RIS的设计方法，包括joint优化传输器设计和RIS配置，以最小化目标函数值与估计函数值之间的平均方差（MSE）。为处理resultant tri-convex优化问题，我们使用了alternating optimization（AO）技术来划分它为三个可解决的凸优化问题。
results: 对两种特定的情况进行分析，显示了活动RIS在MSE方面的超越性，并且对自防干扰的情况进行了适应。为处理resultant Highly non-convex问题，我们还提出了一种two-layer AO框架。实验结果表明，活动RIS可以备受提高空中计算性能。

Abstract
Over-the-air computation (AirComp) is emerging as a promising technology for wireless data aggregation. However, its performance is hampered by users with poor channel conditions. To mitigate such a performance bottleneck, this paper introduces an active reconfigurable intelligence surface (RIS) into the AirComp system. Specifically, we begin by exploring the ideal RIS model and propose a joint optimization of the transceiver design and RIS configuration to minimize the mean squared error (MSE) between the target and estimated function values. To manage the resultant tri-convex optimization problem, we employ the alternating optimization (AO) technique to decompose it into three convex subproblems, each solvable optimally. Subsequently, we investigate two specific cases and analyze their respective asymptotic performance to reveal the superiority of the active RIS in mitigating the MSE relative to its passive counterpart. Lastly, we adapt our transceiver and RIS configuration design to account for the self-interference of the active RIS. To handle the resultant highly non-convex problem, we further devise a two-layer AO framework. Simulation results demonstrate the superiority of the active RIS in enhancing AirComp performance compared to its passive counterpart.

摘要
随空计算（AirComp）技术在无线数据集成中出现为一种有前途的技术。然而，其性能受用户的通道条件的影响，这会导致性能瓶颈。为了解决这个性能瓶颈，本文在AirComp系统中引入活动可配置表面（RIS）。 Specifically，我们开始是exploring the ideal RIS model，并提出了joint optimization of the transceiver design and RIS configuration，以最小化目标和估算函数值的均方差（MSE）。 To handle the resultant tri-convex optimization problem, we employ the alternating optimization（AO）technique to decompose it into three convex subproblems，each solvable optimally。后续，我们investigate two specific cases and analyze their respective asymptotic performance，以透视活动RIS在MSE方面的superiority compared to its passive counterpart。 Lastly, we adapt our transceiver and RIS configuration design to account for the self-interference of the active RIS。 To handle the resultant highly non-convex problem，we further devise a two-layer AO framework。 Simulation results demonstrate the superiority of the active RIS in enhancing AirComp performance compared to its passive counterpart。

URLLC-Awared Resource Allocation for Heterogeneous Vehicular Edge Computing

paper_url: http://arxiv.org/abs/2311.18352
repo_url: https://github.com/qiongwu86/URLLC-Awared-Resource-Allocation-for-Heterogeneous-Vehicular-Edge-Computing
paper_authors: Qiong Wu, Wenhua Wang, Pingyi Fan, Qiang Fan, Jiangzhou Wang, Khaled B. Letaief
for: 这篇论文是为了提高车辆边缘计算（VEC）技术支持实时车辆应用程序，并且使用多种通信技术来增强通信能力。
methods: 本论文提出了一种组合了DSRC、mmWave和C-V2I等多种通信技术的多型VEC系统，并且使用Lyapunov驱动的深度学习算法来优化系统资源分配，以满足URLLC需求。
results: 实验结果显示，提出的资源分配策略能够有效地减少系统利用率，同时满足URLLC需求。

Abstract
Vehicular edge computing (VEC) is a promising technology to support real-time vehicular applications, where vehicles offload intensive computation tasks to the nearby VEC server for processing. However, the traditional VEC that relies on single communication technology cannot well meet the communication requirement for task offloading, thus the heterogeneous VEC integrating the advantages of dedicated short-range communications (DSRC), millimeter-wave (mmWave) and cellular-based vehicle to infrastructure (C-V2I) is introduced to enhance the communication capacity. The communication resource allocation and computation resource allocation may significantly impact on the ultra-reliable low-latency communication (URLLC) performance and the VEC system utility, in this case, how to do the resource allocations is becoming necessary. In this paper, we consider a heterogeneous VEC with multiple communication technologies and various types of tasks, and propose an effective resource allocation policy to minimize the system utility while satisfying the URLLC requirement. We first formulate an optimization problem to minimize the system utility under the URLLC constraint which modeled by the moment generating function (MGF)-based stochastic network calculus (SNC), then we present a Lyapunov-guided deep reinforcement learning (DRL) method to convert and solve the optimization problem. Extensive simulation experiments illustrate that the proposed resource allocation approach is effective.

摘要

Near-Field Beamfocusing with Polarized Antennas

paper_url: http://arxiv.org/abs/2311.18334
repo_url: None
paper_authors: Adrian Agustin, Xavier Mestre
for: 支持Future 6G无线网络中大规模空间多重кс的研究
methods: 使用EXTREMELY LARGE ANTENNA ARRAYS (ELAAs)和多个垂直偏振 polarization
results: 在near-field中，可以达到3个空间幂数度的提高系统性能，但在far-field中只能达到2个空间幂数度。提供了一种近似式来计算可能的速率，并 derive了最佳天线间距和数组大小的近似值。

Abstract
One of the most relevant challenges in future 6G wireless networks is how to support a massive spatial multiplexing of a large number of user terminals. Recently, extremely large antenna arrays (ELAAs), also referred to as extra-large MIMO (XL-MIMO), have emerged as an potential enabler of this type of spatially multiplexed transmission. These massive configurations substantially increase the number of available spatial degrees of freedom (transmission modes) while also enabling to spatially focus the transmitted energy into a very small region, thanks to the properties of near-field propagation and the large number of transmitters. This work explores whether multiplexing of multiple orthogonal polarizations can enhance the system performance in the near-field. We concentrate on a simple scenario consisting of a Uniform Linear Array (ULA) and a single antenna element user equipment (UE). We demonstrate that the number of spatial degrees of freedom can be as large as 3 in the near-field of a Line of Sight (LoS) channel when both transmitter and receiver employ three orthogonal linear polarizations. In the far-field, however, the maximum number of spatial degrees of freedom tends to be only 2, due to the fact that the equivalent MIMO channel becomes rank deficient. We provide an analytical approximation to the achievable rate, which allows us to derive approximations to the optimal antenna spacing and array size that maximize the achievable rate

摘要
一个未来6G无线网络中最重要的挑战是如何支持大量的空间多播，以提高用户终端的数量。最近，非常大的天线阵列（ELAA）或称为超大多个MIMO（XL-MIMO），已经被视为可能的激活器。这些庞大的配置可以增加可用的空间度量的数量，同时也可以将发射的能量在非常小的区域中集中，因为近场传播的特性和大量的发射器。这项工作探讨了 whether 多个归一化的极化可以提高系统性能在近场。我们专注于一个简单的enario，包括一个 uniform linear array (ULA) 和一个单antenna元素用户设备 (UE)。我们示出，在Line of Sight (LoS) 通道的近场中，当发射器和接收器都使用三个垂直的直角极化时，可以达到3个空间度量的最大数量。在远场中，然而，最大的空间度量通常只有2个，因为等效的MIMO通道变得缺少级别。我们提供了一个分析近似的可 achievable 率，允许我们 derive 最佳天线间距和阵列大小，以便最大化可 achievable 率。

Feasibility Analysis of In-Band Coexistence in Dense LEO Satellite Communication Systems

paper_url: http://arxiv.org/abs/2311.18250
repo_url: None
paper_authors: Eunsun Kim, Ian P. Roberts, Jeffrey G. Andrews
for: 本研究提供了一种准确的方法来评估大低地轨道卫星系统之间的频率共享可能性。
methods: 我们在这种情况下关注现有的Starlink系统和即将发射的Kuiper系统，并研究了这两个系统之间的下降频率干扰。我们精心模型了这些系统之间的干扰，并研究了Kuiper如何通过选择其卫星来服务其地面用户，同时保护Starlink地面用户免受过度干扰。
results: 我们发现，无论哪些Starlink和Kuiper卫星被用来服务用户，都会出现非常高和非常低的干扰情况。然而，我们显示Kuiper可以通过精心选择其卫星来保护Starlink地面用户，并同时为自己的地面用户提供近 maximun的下降SINR。这表明了两个稀密的LEO卫星系统之间的共存可能性，即使一个系统具有限制了知ledge of 另一个系统的服务卫星。

Abstract
This work provides a rigorous methodology for assessing the feasibility of spectrum sharing between large low-earth orbit (LEO) satellite constellations. For concreteness, we focus on the existing Starlink system and the soon-to-be-launched Kuiper system, which is prohibited from inflicting excessive interference onto the incumbent Starlink ground users. We carefully model and study the potential downlink interference between the two systems and investigate how strategic satellite selection may be used by Kuiper to serve its ground users while also protecting Starlink ground users. We then extend this notion of satellite selection to the case where Kuiper has limited knowledge of Starlink's serving satellite. Our findings reveal that there is always the potential for very high and extremely low interference, depending on which Starlink and Kuiper satellites are being used to serve their users. Consequently, we show that Kuiper can protect Starlink ground users with high probability, by strategically selecting which of its satellites are used to serve its ground users. Simultaneously, Kuiper is capable of delivering near-maximal downlink SINR to its own ground users. This highlights a feasible route to the coexistence of two dense LEO satellite systems, even in scenarios where one system has limited knowledge of the other's serving satellites.

摘要

Multi-Rate Variable-Length CSI Compression for FDD Massive MIMO

paper_url: http://arxiv.org/abs/2311.18172
repo_url: None
paper_authors: Bumsu Park, Heedong Do, Namyoon Lee
for: 提高 Frequency-division-duplexing (FDD) 系统中 Channel State Information (CSI) 反馈过程中的效率，适用于多antenna 系统。
methods: 使用 Variational Autoencoder (VAE) WITH entropy bottleneck structure，实现多速率和变长操作的可变CSI压缩方法。
results: 比较研究表明，提出的方法能够在normalized mean squared error 方面与现有的CSI压缩技术相比，表现更高效。

Abstract
For frequency-division-duplexing (FDD) systems, channel state information (CSI) should be fed back from the user terminal to the base station. This feedback overhead becomes problematic as the number of antennas grows. To alleviate this issue, we propose a flexible CSI compression method using variational autoencoder (VAE) with an entropy bottleneck structure, which can support multi-rate and variable-length operation. Numerical study confirms that the proposed method outperforms the existing CSI compression techniques in terms of normalized mean squared error.

摘要
(Simplified Chinese translation)For FDD系统，用户终端需要将通道状态信息（CSI）反馈给基站。这种反馈 overhead 随antenna数量增加而变得问题。为解决这个问题，我们提议使用变量自动编码器（VAE）与Entropy瓶颈结构，可以支持多速和变长操作。数学研究表明，我们的方法在Normalized Mean Squared Error（NMSE）上比既有CSI压缩技术更高。

Throughput Maximization for Intelligent Refracting Surface Assisted mmWave High-Speed Train Communications

paper_url: http://arxiv.org/abs/2311.18167
repo_url: None
paper_authors: Jing Li, Yong Niu, Hao Wu, Bo Ai, Ruisi He, Ning Wang, Sheng Chen
for: 提高高速列车网络传输压力， millimeter wave通信被视为有效的技术。
methods: 利用智能折射面（IRS）在列车窗户上进行动态环境配置，并使用混合时分多址-非对称多址协议进行干扰降低。
results: 通过使用 alternate optimization 方法和提档分配方法，实现了 mmWave 高速列车网络中throughput的提高，并且可以考虑基站扫描、IRS离散阶跃和传输功率的约束。

Abstract
With the increasing demands from passengers for data-intensive services, millimeter-wave (mmWave) communication is considered as an effective technique to release the transmission pressure on high speed train (HST) networks. However, mmWave signals ncounter severe losses when passing through the carriage, which decreases the quality of services on board. In this paper, we investigate an intelligent refracting surface (IRS)-assisted HST communication system. Herein, an IRS is deployed on the train window to dynamically reconfigure the propagation environment, and a hybrid time division multiple access-nonorthogonal multiple access scheme is leveraged for interference mitigation. We aim to maximize the overall throughput while taking into account the constraints imposed by base station beamforming, IRS discrete phase shifts and transmit power. To obtain a practical solution, we employ an alternating optimization method and propose a two-stage algorithm. In the first stage, the successive convex approximation method and branch and bound algorithm are leveraged for IRS phase shift design. In the second stage, the Lagrangian multiplier method is utilized for power allocation. Simulation results demonstrate the benefits of IRS adoption and power allocation for throughput improvement in mmWave HST networks.

摘要
随着乘客对数据密集服务的需求不断增长， millimeter wave（mmWave）通信被视为一种有效的技术来减轻高速列车网络的传输压力。然而，mmWave信号在车厢内遇到严重的吸收损耗，从而降低了列车上的服务质量。在这篇论文中，我们调查了一个智能折射表面（IRS）协助高速列车通信系统。在这种系统中，我们在列车窗户上部署了一个IRS，以动态重新配置媒体传播环境。此外，我们采用了一种混合时分多址-非对称多址方案来缓解干扰。我们的目标是最大化总 Throughput，同时考虑基站扫描、IRS阶段扫描和发射功率的约束。为了获得实用的解决方案，我们采用了一种 alternate optimization 方法，并提出了一个两stage算法。在第一阶段，我们利用了Successive Convex Approximation 方法和分支约束算法来设计IRS阶段扫描。在第二阶段，我们利用了Lagrangian multiplier方法来进行功率分配。 simulation results 表明，采用IRS和功率分配可以提高 mmWave 高速列车网络的 Throughput。

2023-11-29

cs.SD

cs.SD - 2023-11-29

FAT-HuBERT: Front-end Adaptive Training of Hidden-unit BERT for Distortion-Invariant Robust Speech Recognition

paper_url: http://arxiv.org/abs/2311.17790
repo_url: None
paper_authors: Dongning Yang, Wei Wang, Yanmin Qian
for: 提高单通道speech增强技术的研究，以提高speech recognition系统的性能。
methods: 提出一种新的approachcalled FAT-HuBERT，通过自我supervised learning(SSL)提高ASR系统的鲁棒性。
results: 在 simulate noisy speech和real-world noisy speech上进行测试，实验结果表明FAT-HuBERT可以获得显著的word error rate(WER)减少。

Abstract
Advancements in monaural speech enhancement (SE) techniques have greatly improved the perceptual quality of speech. However, integrating these techniques into automatic speech recognition (ASR) systems has not yielded the expected performance gains, primarily due to the introduction of distortions during the SE process. In this paper, we propose a novel approach called FAT-HuBERT, which leverages distortion-invariant self-supervised learning (SSL) to enhance the robustness of ASR. To address the distortions introduced by the SE frontends, we introduce layer-wise fusion modules that incorporate features extracted from both observed noisy signals and enhanced signals. During training, the SE frontend is randomly selected from a pool of models. We evaluate the performance of FAT-HuBERT on simulated noisy speech generated from LibriSpeech as well as real-world noisy speech from the CHiME-4 1-channel dataset. The experimental results demonstrate a significant relative reduction in word error rate (WER).

摘要
技术进步使得单通道语音提升（SE）技术的语音质量得到了大幅提高。然而，将这些技术集成到自动语音识别（ASR）系统中并没有达到预期的性能提升，主要是因为SE过程中引入的损害。在这篇论文中，我们提出了一种新的方法 called FAT-HuBERT，它利用损害不变的自我超vised学习（SSL）来提高ASR的鲁棒性。为了解决SE前端引入的损害，我们引入层次融合模块，这些模块将 observed 噪音信号和优化后的信号中提取的特征进行融合。在训练时，SE前端会随机从一个池中选择一个模型。我们在使用 LibriSpeech 生成的 simulate 噪音语音和 CHiME-4 1-channel dataset 中的实际噪音语音进行评估。实验结果显示，FAT-HuBERT 可以获得显著的 relative 错误率（WER）下降。

2023-11-29

eess.AS

eess.AS - 2023-11-29

Adapting OpenAI’s Whisper for Speech Recognition on Code-Switch Mandarin-English SEAME and ASRU2019 Datasets

paper_url: http://arxiv.org/abs/2311.17382
repo_url: None
paper_authors: Yuhang Yang, Yizhou Peng, Xionghu Zhong, Hao Huang, Eng Siong Chng
for: 这个研究探讨了将OpenAI的Whisper模型应用于Code-Switch Mandarin-English语音识别（ASR） task中的实验结果。
methods: 我们进行了两个实验：一是使用适应数据从1到100/200小时，以证明适应的有效性；二是检查不同的语言ID设置在Whisper提示中的影响。
results: 结果显示，适应数据可能只需1~10小时来 достиunge性能的最高点（SEAME），而ASRU任务继续显示出更多适应数据可以提高性能（>100小时）。另外，使用不同的提示策略初始化 Whisper 模型的表现，但是透过适应 code-switch 数据，它的表现都能够改善。I hope this helps!

Abstract
This paper details the experimental results of adapting the OpenAI's Whisper model for Code-Switch Mandarin-English Speech Recognition (ASR) on the SEAME and ASRU2019 corpora. We conducted 2 experiments: a) using adaptation data from 1 to 100/200 hours to demonstrate effectiveness of adaptation, b) examining different language ID setup on Whisper prompt. The Mixed Error Rate results show that the amount of adaptation data may be as low as $1\sim10$ hours to achieve saturation in performance gain (SEAME) while the ASRU task continued to show performance with more adaptation data ($>$100 hours). For the language prompt, the results show that although various prompting strategies initially produce different outcomes, adapting the Whisper model with code-switch data uniformly improves its performance. These results may be relevant also to the community when applying Whisper for related tasks of adapting to new target domains.

摘要

Using adaptation data from 1 to 100/200 hours to demonstrate the effectiveness of adaptation.2. Examining different language ID setups on the Whisper prompt.The Mixed Error Rate (MER) results show that the amount of adaptation data may be as low as 1-10 hours to achieve saturation in performance gain (SEAME), while the ASRU task continued to show performance improvement with more adaptation data (>100 hours). Additionally, the results show that adapting the Whisper model with code-switching data uniformly improves its performance, regardless of the prompting strategy used. These findings may be relevant to the community when applying Whisper for related tasks of adapting to new target domains.Here is the translation in Simplified Chinese:这篇论文描述了对OpenAI的Whisper模型进行code-switching普通话-英语语音识别（ASR）task的实验结果，使用了SEAME和ASRU2019 corpora。我们进行了两个实验：1. 使用适应数据从1到100/200小时来证明适应的效果。2. 对Whisper模型的语言ID设置进行不同的试验。Mixed Error Rate（MER）结果表明，适应数据的量可以在1-10小时内达到性能提升的最大值（SEAME），而ASRU任务继续显示出更多适应数据后的性能提升（>100小时）。此外，结果还表明，对Whisper模型进行code-switching数据适应后，其性能都会改善，无论使用哪种提示策略。这些发现可能对相关领域的社区有益，当应用Whisper模型进行目标领域的适应。

2023-11-29

cs.CV

cs.CV - 2023-11-29

Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models

paper_url: http://arxiv.org/abs/2311.17919
repo_url: https://github.com/dangeng/visual_anagrams
paper_authors: Daniel Geng, Inbum Park, Andrew Owens
for: 这个论文 Addresses the problem of synthesizing multi-view optical illusions, such as images that change appearance upon flipping or rotating.
methods: The proposed method uses off-the-shelf text-to-image diffusion models to obtain these illusions with zero-shot learning. During the reverse diffusion process, the method estimates noise from different views of a noisy image and combines them to denoise the image.
results: The method is effective and flexible, as demonstrated by both qualitative and quantitative results. The approach can also be extended to illusions with more than two views, and more results can be found on the project webpage provided.Here’s the full text in Simplified Chinese:
for: 这个论文 Addresses the problem of synthesizing multi-view optical illusions：images that change appearance upon flipping or rotating.
methods: 该方法使用的是off-the-shelf text-to-image diffusion models，通过逆 diffusion process，Estimate the noise from different views of a noisy image and combine them to denoise the image.
results: The method is effective and flexible, as demonstrated by both qualitative and quantitative results. The approach can also be extended to illusions with more than two views, and more results can be found on the project webpage provided.

Abstract
We address the problem of synthesizing multi-view optical illusions: images that change appearance upon a transformation, such as a flip or rotation. We propose a simple, zero-shot method for obtaining these illusions from off-the-shelf text-to-image diffusion models. During the reverse diffusion process, we estimate the noise from different views of a noisy image. We then combine these noise estimates together and denoise the image. A theoretical analysis suggests that this method works precisely for views that can be written as orthogonal transformations, of which permutations are a subset. This leads to the idea of a visual anagram--an image that changes appearance under some rearrangement of pixels. This includes rotations and flips, but also more exotic pixel permutations such as a jigsaw rearrangement. Our approach also naturally extends to illusions with more than two views. We provide both qualitative and quantitative results demonstrating the effectiveness and flexibility of our method. Please see our project webpage for additional visualizations and results: https://dangeng.github.io/visual_anagrams/

摘要
我团队正在研究多视图光学错觉的合成：图像会在变换后改变外观，比如旋转或翻转。我们提出了一种简单、 zero-shot 方法，使用 commercially 可用的文本到图像扩散模型来获得这些错觉。在反扩散过程中，我们估算不同视图中的噪声，然后将这些噪声估算结果相加，用于减噪图像。理论分析表明，这种方法在可以写为orthogonal transformation的视图中具有高精度。这包括旋转和翻转，还有更加奇异的像素重新排序，如谜题拼图。我们的方法也自然地扩展到多视图错觉。我们提供了许多资讯和结果，以证明我们的方法的效果和灵活性。请参考我们项目网页获取更多的视觉化和结果：

Do text-free diffusion models learn discriminative visual representations?

paper_url: http://arxiv.org/abs/2311.17921
repo_url: None
paper_authors: Soumik Mukhopadhyay, Matthew Gwilliam, Yosuke Yamaguchi, Vatsal Agarwal, Namitha Padmanabhan, Archana Swaminathan, Tianyi Zhou, Abhinav Shrivastava
for: 这个论文的目的是探讨一种能同时解决生成和分类两家任务的无监督学习模型，即扩展 diffusion models 的应用。
methods: 这个论文使用了 diffusion models，一种现在的生成任务领域的状态流行方法，并在其基础上提出了一种新的注意机制和变换器结合使用方法，以提高模型的表达能力。
results: 这个论文的实验结果表明，使用这种新的注意机制和变换器结合使用方法可以在不同的任务上达到比较高的性能，比如图像分类、细化分类、物体检测和分割等任务。

Abstract
While many unsupervised learning models focus on one family of tasks, either generative or discriminative, we explore the possibility of a unified representation learner: a model which addresses both families of tasks simultaneously. We identify diffusion models, a state-of-the-art method for generative tasks, as a prime candidate. Such models involve training a U-Net to iteratively predict and remove noise, and the resulting model can synthesize high-fidelity, diverse, novel images. We find that the intermediate feature maps of the U-Net are diverse, discriminative feature representations. We propose a novel attention mechanism for pooling feature maps and further leverage this mechanism as DifFormer, a transformer feature fusion of features from different diffusion U-Net blocks and noise steps. We also develop DifFeed, a novel feedback mechanism tailored to diffusion. We find that diffusion models are better than GANs, and, with our fusion and feedback mechanisms, can compete with state-of-the-art unsupervised image representation learning methods for discriminative tasks - image classification with full and semi-supervision, transfer for fine-grained classification, object detection and segmentation, and semantic segmentation. Our project website (https://mgwillia.github.io/diffssl/) and code (https://github.com/soumik-kanad/diffssl) are available publicly.

摘要
而 viele 无监督学习模型集中在一个家族任务上，我们探索一个简化的表示学习器：一个模型可以同时解决两个家族任务。我们认为噪声模型是一种适合的候选者，这种模型通过训练 U-Net 来逐步预测和除噪，并且得到的模型可以生成高效、多样、新的图像。我们发现 U-Net 的中间特征图是多样的、掌握性的特征表示。我们提出了一种新的注意力机制来聚合特征图，并将其作为 DifFormer，一种基于 transformer 的特征融合，将不同的噪声 U-Net 块和噪声步的特征 fusion。我们还开发了 DifFeed，一种特定于噪声的反馈机制。我们发现噪声模型比 GANs 更好，并且通过我们的融合和反馈机制，可以与现有的无监督图像表示学习方法竞争，包括图像分类（完全和半监督）、细化分类转移、物体检测和分割、和 semantics 分割。我们的项目网站（https://mgwillia.github.io/diffssl/）和代码（https://github.com/soumik-kanad/diffssl）都是公开的。

A Simple Recipe for Language-guided Domain Generalized Segmentation

paper_url: http://arxiv.org/abs/2311.17922
repo_url: None
paper_authors: Mohammad Fahes, Tuan-Hung Vu, Andrei Bursuc, Patrick Pérez, Raoul de Charette
for: 本研究旨在扩展神经网络在实际应用中的应用，提高神经网络对新领域数据的泛化能力。
methods: 本研究使用语言作为杂化源，提出了三个关键组成部分：保持CLIP内置的坚定性，通过语言驱动的本地样式杂化，以及在训练过程中 randomly混合源和杂化样式。
results: 广泛的实验结果显示，该方法可以在多个泛化 bencmarks 上达到状态 искусственный智能水平。代码将会公开。

Abstract
Generalization to new domains not seen during training is one of the long-standing goals and challenges in deploying neural networks in real-world applications. Existing generalization techniques necessitate substantial data augmentation, potentially sourced from external datasets, and aim at learning invariant representations by imposing various alignment constraints. Large-scale pretraining has recently shown promising generalization capabilities, along with the potential of bridging different modalities. For instance, the recent advent of vision-language models like CLIP has opened the doorway for vision models to exploit the textual modality. In this paper, we introduce a simple framework for generalizing semantic segmentation networks by employing language as the source of randomization. Our recipe comprises three key ingredients: i) the preservation of the intrinsic CLIP robustness through minimal fine-tuning, ii) language-driven local style augmentation, and iii) randomization by locally mixing the source and augmented styles during training. Extensive experiments report state-of-the-art results on various generalization benchmarks. The code will be made available.

摘要
通用化到新领域不seen during training是神经网络实现实际应用中的长期目标和挑战。现有的总结技术需要大量的数据扩展，可能来自外部数据集，并尝试通过各种对齐约束学习无关的表示。大规模预训练在最近几年内显示了可能的总结能力，同时还可以把不同的模式相互连接。例如，最近出现的视觉语言模型CLIP打开了视觉模型可以利用文本模式的大门。在这篇论文中，我们介绍了一个简单的框架，通过使用语言来随机化 semantic segmentation 网络。我们的配方包括以下三个关键元素：1. 保持CLIP的内在稳健性通过最小化微调，2. 基于语言驱动的本地风格增强，3. 在训练中随机地混合源和扩展风格。我们的实验结果显示在各种总结测试上达到了州前的结果。代码将被公开。

Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving

paper_url: http://arxiv.org/abs/2311.17918
repo_url: https://github.com/bravegroup/drive-wm
paper_authors: Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, Zhaoxiang Zhang
for: 这篇论文的目的是为了提高自动驾驶车辆的安全和效率，通过预测未来事件和评估可能的风险。
methods: 这篇论文提出了 Drive-WM，第一个与现有终端计划模型相容的驾驶世界模型。这个模型通过共同的空间-时间模型化，使用视角分解来生成高品质多视角驾驶Scene。
results: 我们的 Drive-WM 可以生成高质量、一致、可控的多视角影片，并且可以根据图像基于的赏识来决定最佳路径。评估真实世界驾驶数据显示，我们的方法可以实现高品质、可靠的驾驶 simulations。

Abstract
In autonomous driving, predicting future events in advance and evaluating the foreseeable risks empowers autonomous vehicles to better plan their actions, enhancing safety and efficiency on the road. To this end, we propose Drive-WM, the first driving world model compatible with existing end-to-end planning models. Through a joint spatial-temporal modeling facilitated by view factorization, our model generates high-fidelity multiview videos in driving scenes. Building on its powerful generation ability, we showcase the potential of applying the world model for safe driving planning for the first time. Particularly, our Drive-WM enables driving into multiple futures based on distinct driving maneuvers, and determines the optimal trajectory according to the image-based rewards. Evaluation on real-world driving datasets verifies that our method could generate high-quality, consistent, and controllable multiview videos, opening up possibilities for real-world simulations and safe planning.

摘要
autonomous driving 预测未来事件的能力和评估可能的风险，使自动驾驶车辆更好地规划行为，提高安全性和效率。为此，我们提出了 Drive-WM，第一个与现有终端规划模型兼容的驾驶世界模型。通过视图分解，我们的模型生成了高效率的多视图驾驶场景视频。基于其强大生成能力，我们展示了在安全驾驶规划方面应用世界模型的潜在优势。具体来说，我们的 Drive-WM 可以根据不同的驾驶动作生成多个未来的驾驶场景，并根据图像基于奖励来确定最佳轨迹。实验结果表明，我们的方法可以生成高质量、一致、可控的多视图驾驶视频，开启了实际 simulations 和安全规划的可能性。

AvatarStudio: High-fidelity and Animatable 3D Avatar Creation from Text

paper_url: http://arxiv.org/abs/2311.17917
repo_url: https://github.com/magic-research/avatarstudio
paper_authors: Jianfeng Zhang, Xuanmeng Zhang, Huichao Zhang, Jun Hao Liew, Chenxu Zhang, Yi Yang, Jiashi Feng
for: 创建基于文本描述的高品质和可动画的3D人物模型。
methods: 我们提出了AvartarStudio，一种从粗细到细节的生成模型，使用NeRF-based representation开始，然后通过包含SMPL-guided articulation在Explicit Mesh Representation中来支持人物动画和高分辨率渲染。
results: AvatarStudio可以创建基于文本描述的高品质的可动画人物模型，在多个应用中表现出优秀的效果，如多Modal Avatar Animation和 Style-Guided Avatar Creation。更多结果请参考我们项目页面：http://jeff95.me/projects/avatarstudio.html。

Abstract
We study the problem of creating high-fidelity and animatable 3D avatars from only textual descriptions. Existing text-to-avatar methods are either limited to static avatars which cannot be animated or struggle to generate animatable avatars with promising quality and precise pose control. To address these limitations, we propose AvatarStudio, a coarse-to-fine generative model that generates explicit textured 3D meshes for animatable human avatars. Specifically, AvatarStudio begins with a low-resolution NeRF-based representation for coarse generation, followed by incorporating SMPL-guided articulation into the explicit mesh representation to support avatar animation and high resolution rendering. To ensure view consistency and pose controllability of the resulting avatars, we introduce a 2D diffusion model conditioned on DensePose for Score Distillation Sampling supervision. By effectively leveraging the synergy between the articulated mesh representation and the DensePose-conditional diffusion model, AvatarStudio can create high-quality avatars from text that are ready for animation, significantly outperforming previous methods. Moreover, it is competent for many applications, e.g., multimodal avatar animations and style-guided avatar creation. For more results, please refer to our project page: http://jeff95.me/projects/avatarstudio.html

摘要
我们研究创建基于文本描述的高准确性和可动画的3D人物模型的问题。现有的文本到人物方法有限，不能生成可动画的人物模型，或者生成的人物模型质量不高、精度不足。为解决这些限制，我们提议了人物工作室（AvatarStudio），它是一种从粗略到细节的生成模型，可以生成文本描述的人物模型。具体来说，人物工作室首先使用低分辨率NeRF的表示方法进行粗略生成，然后通过 incorporating SMPL-guided articulation into the explicit mesh representation来支持人物动画和高分辨率渲染。为保证视角一致和姿势控制的人物模型，我们引入了基于DensePose的2D扩散模型进行Score Distillation Sampling的监督。通过有效地利用人物模型和DensePose-conditional扩散模型之间的同步效应，人物工作室可以从文本创建高质量的人物模型，在多个应用场景中表现出色，比如多模态人物动画和风格导向的人物创建。更多结果请参考我们项目页面：http://jeff95.me/projects/avatarstudio.html。

paper_url: http://arxiv.org/abs/2311.17911
repo_url: https://github.com/shikiw/opera
paper_authors: Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, Nenghai Yu
for: 这篇论文旨在解决多modal大语言模型（MLLM）中的幻觉问题，以提高其在实际应用中的准确性。
methods: 该论文提出了一种基于过度信任罚和回退分配策略的新的MLLM解码方法，无需额外数据、知识或训练。
results: 经过广泛的实验，该方法在不同的MLLM和指标上表现出了显著的幻觉降低效果，证明其有效性和通用性。

Abstract
Hallucination, posed as a pervasive challenge of multi-modal large language models (MLLMs), has significantly impeded their real-world usage that demands precise judgment. Existing methods mitigate this issue with either training with specific designed data or inferencing with external knowledge from other sources, incurring inevitable additional costs. In this paper, we present OPERA, a novel MLLM decoding method grounded in an Over-trust Penalty and a Retrospection-Allocation strategy, serving as a nearly free lunch to alleviate the hallucination issue without additional data, knowledge, or training. Our approach begins with an interesting observation that, most hallucinations are closely tied to the knowledge aggregation patterns manifested in the self-attention matrix, i.e., MLLMs tend to generate new tokens by focusing on a few summary tokens, but not all the previous tokens. Such partial over-trust inclination results in the neglecting of image tokens and describes the image content with hallucination. Statistically, we observe an 80%$\sim$95% co-currency rate between hallucination contents and such knowledge aggregation patterns. Based on the observation, OPERA introduces a penalty term on the model logits during the beam-search decoding to mitigate the over-trust issue, along with a rollback strategy that retrospects the presence of summary tokens in the previously generated tokens, and re-allocate the token selection if necessary. With extensive experiments, OPERA shows significant hallucination-mitigating performance on different MLLMs and metrics, proving its effectiveness and generality. Our code is available at: https://github.com/shikiw/OPERA.

摘要
《大语言模型中的幻视》作为多modal大语言模型（MLLM）的挑战，对于实际应用而言是一大障碍。现有的方法通过专门设计的数据或从其他来源获取知识来缓解此问题，但这会带来不可避免的额外成本。本文发表了一新的MLLM解码方法，称为OPERA，它基于一个过信问题和回溯分配策略，能够帮助解决幻视问题而不需要额外的数据、知识或训练。我们的观察是，大多数幻视都与语言模型对知识聚合的习惯相关，即MLLMs将新增token通过专注在一些摘要token上，而不是所有之前的token。这种偏好导致忽略图像token，并将图像内容描述为幻视。我们发现，在不同的MLLM和度量上，幻视内容与这种知识聚合模式之间的相似度为80%~$\sim$95%。根据这个观察，OPERA引入了一个过信问题Term durante the beam-search解码过程，同时引入了一个回溯分配策略，检查之前产生的token是否存在摘要token，如果是的话，则重新分配token。实际实验表明，OPERA在不同的MLLM和度量上都显示了明显的幻视缓解性。我们的代码可以在https://github.com/shikiw/OPERA上取得。

HUGS: Human Gaussian Splats

paper_url: http://arxiv.org/abs/2311.17910
repo_url: None
paper_authors: Muhammed Kocabas, Jen-Hao Rick Chang, James Gabriel, Oncel Tuzel, Anurag Ranjan
For: 本研究旨在实现 photogrammetry 中的人体动画synthesis，即从单一摄影机影像中提取人体和场景的信息，并将其融合为一个完整的动画人物。* Methods: 本研究使用 3D Gaussian Splatting (3DGS) 技术来表示人体，并将其与场景融合为一个完整的动画人物。具体来说，我们使用 SMPL 人体模型来初始化人体 Gaussian，并允许 Gaussian 在不同的标高上偏离人体模型，以捕捉更多的详细信息。在动画过程中，我们将Linear Blend Skinning（LBS）权重优化为协调个别 Gaussian 的运动。* Results: 本研究可以在 60 FPS 的渲染速度下实现 state-of-the-art 的渲染质量，并且比前一代的training时间快了 ~100 倍。我们的代码将会在 GitHub 上公布：https://github.com/apple/ml-hugs。

Abstract
Recent advances in neural rendering have improved both training and rendering times by orders of magnitude. While these methods demonstrate state-of-the-art quality and speed, they are designed for photogrammetry of static scenes and do not generalize well to freely moving humans in the environment. In this work, we introduce Human Gaussian Splats (HUGS) that represents an animatable human together with the scene using 3D Gaussian Splatting (3DGS). Our method takes only a monocular video with a small number of (50-100) frames, and it automatically learns to disentangle the static scene and a fully animatable human avatar within 30 minutes. We utilize the SMPL body model to initialize the human Gaussians. To capture details that are not modeled by SMPL (e.g. cloth, hairs), we allow the 3D Gaussians to deviate from the human body model. Utilizing 3D Gaussians for animated humans brings new challenges, including the artifacts created when articulating the Gaussians. We propose to jointly optimize the linear blend skinning weights to coordinate the movements of individual Gaussians during animation. Our approach enables novel-pose synthesis of human and novel view synthesis of both the human and the scene. We achieve state-of-the-art rendering quality with a rendering speed of 60 FPS while being ~100x faster to train over previous work. Our code will be announced here: https://github.com/apple/ml-hugs

摘要

Language-conditioned Detection Transformer

paper_url: http://arxiv.org/abs/2311.17902
repo_url: https://github.com/janghyuncho/decola
paper_authors: Jang Hyun Cho, Philipp Krähenbühl
for: 这个研究旨在开发一个新的开放词汇检测框架，以便在无监督的情况下检测图像中的对象。
methods: 该研究使用了图像水平标签和详细检测注释，并分为三步进行。首先，研究人员将语言条件的对象检测器在完全监督的检测数据上训练。然后，使用这个检测器进行 pseudo-标注图像，并使用这些 pseudo-标签来训练一个无条件的开放词汇检测器。
results: 研究人员通过在 LVIS benchmark 上进行零学习测试，以及直接零学习传输测试在 LVIS、COCO、Object365 和 OpenImages 上，并得到了优秀的成绩。相比之前的方法，DECOLA 在零学习 LVIS benchmark 上提高了17.1 AP-rare 和 9.4 mAP。DECOLA 在不同的模型大小、结构和数据集上实现了state-of-the-art 的结果，并且只需要使用开源数据和学术级计算。

Abstract
We present a new open-vocabulary detection framework. Our framework uses both image-level labels and detailed detection annotations when available. Our framework proceeds in three steps. We first train a language-conditioned object detector on fully-supervised detection data. This detector gets to see the presence or absence of ground truth classes during training, and conditions prediction on the set of present classes. We use this detector to pseudo-label images with image-level labels. Our detector provides much more accurate pseudo-labels than prior approaches with its conditioning mechanism. Finally, we train an unconditioned open-vocabulary detector on the pseudo-annotated images. The resulting detector, named DECOLA, shows strong zero-shot performance in open-vocabulary LVIS benchmark as well as direct zero-shot transfer benchmarks on LVIS, COCO, Object365, and OpenImages. DECOLA outperforms the prior arts by 17.1 AP-rare and 9.4 mAP on zero-shot LVIS benchmark. DECOLA achieves state-of-the-art results in various model sizes, architectures, and datasets by only training on open-sourced data and academic-scale computing. Code is available at https://github.com/janghyuncho/DECOLA.

摘要
我们提出了一个新的开 vocabulary检测框架。我们的框架使用图像级别标签和详细检测注释当 disponible。我们的框架进行以下三个步骤。首先，我们在受控的检测数据上训练一个语言条件对象检测器。这个检测器在训练时看到真实的类别，并基于这些存在的类别进行预测。我们使用这个检测器将图像 Pseudo-标签。我们的检测器比之前的方法更准确地生成 Pseudo-标签。然后，我们在 Pseudo-标签的图像上训练一个开 vocabulary检测器。我们名为 DECOLA 的检测器在开 vocabulary LVIS 测试 benchmark 上表现出色，以及直接零shot 传播 benchmark 上的 LVIS、COCO、Object365 和 OpenImages。DECOLA 比前一代的方法提高了17.1 AP-rare和9.4 mAP在零shot LVIS benchmark。DECOLA 在不同的模型大小、体系和数据集上实现了状态的最优结果，只需要训练在开源数据和学术级计算机上。代码可以在上获取。

Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation

paper_url: http://arxiv.org/abs/2311.17893
repo_url: None
paper_authors: Shuangrui Ding, Rui Qian, Haohang Xu, Dahua Lin, Hongkai Xiong
for: 本研究提出了一种简单 yet effective的方法，用于无监督视频对象分割（VOS）。
methods: 我们的关键发现是可以利用 DINO-预训练的 transformer 中的自然结构依赖关系，以建立视频中的稳定 spatial-temporal 对应关系。我们还使用 hierarchical clustering 来生成对象分割mask。
results: 我们的方法在多个无监督 VOS 测试 benchmark 上达到了state-of-the-art 性能，特别是在复杂的实际多对象视频分割任务上，如 DAVIS-17-Unsupervised 和 YouTube-VIS-19。

Abstract
In this paper, we propose a simple yet effective approach for self-supervised video object segmentation (VOS). Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to establish robust spatio-temporal correspondences in videos. Furthermore, simple clustering on this correspondence cue is sufficient to yield competitive segmentation results. Previous self-supervised VOS techniques majorly resort to auxiliary modalities or utilize iterative slot attention to assist in object discovery, which restricts their general applicability and imposes higher computational requirements. To deal with these challenges, we develop a simplified architecture that capitalizes on the emerging objectness from DINO-pretrained Transformers, bypassing the need for additional modalities or slot attention. Specifically, we first introduce a single spatio-temporal Transformer block to process the frame-wise DINO features and establish spatio-temporal dependencies in the form of self-attention. Subsequently, utilizing these attention maps, we implement hierarchical clustering to generate object segmentation masks. To train the spatio-temporal block in a fully self-supervised manner, we employ semantic and dynamic motion consistency coupled with entropy normalization. Our method demonstrates state-of-the-art performance across multiple unsupervised VOS benchmarks and particularly excels in complex real-world multi-object video segmentation tasks such as DAVIS-17-Unsupervised and YouTube-VIS-19. The code and model checkpoints will be released at https://github.com/shvdiwnkozbw/SSL-UVOS.

摘要
在这篇论文中，我们提出了一种简单 yet effective的方法 для无监督视频物体分割（VOS）。我们的关键发现是利用独立预训练的 DINO 变换器中的自然的结构依赖关系，以建立视频中的稳定空间时间匹配。此外，对这种匹配关系进行简单的聚类，可以获得竞争力强的分割结果。先前的无监督 VOS 技术主要靠 auxiliary 模式或使用迭代槽注意力来帮助对象发现，这限制了它们的通用性和计算成本。为解决这些挑战，我们开发了简化的架构，利用 DINO 变换器中的突出对象性，不需要额外的模式或槽注意力。具体来说，我们首先引入一个框架级别的空间时间 transformer 块，处理框架级别的 DINO 特征，并建立空间时间相关性。然后，通过这些注意力地图，实现层次聚类，生成对象分割mask。为了在完全无监督下训练空间时间块，我们采用 semantic 和动态运动一致、Entropy 归一化等级别的自我监督。我们的方法在多个无监督 VOS 标准准则上达到了状态 искусternal 性，特别是在复杂的实际多对象视频分割任务中，如 DAVIS-17-Unsupervised 和 YouTube-VIS-19 之类。代码和模型检查点将在 GitHub 上发布，请参考 https://github.com/shvdiwnkozbw/SSL-UVOS。

Pose Anything: A Graph-Based Approach for Category-Agnostic Pose Estimation

paper_url: http://arxiv.org/abs/2311.17891
repo_url: https://github.com/orhir/PoseAnything
paper_authors: Or Hirschorn, Shai Avidan
for: 能够处理novel object category的2D pose estimation模型
methods: 使用Graph Transformer Decodercapture和利用键点之间的几何关系
results: 在MP-100 benchmark上获得了substantial margins的改进， achievieving remarkable improvements of 2.16% and 1.82% under 1-shot and 5-shot settings, respectively.

Abstract
Traditional 2D pose estimation models are limited by their category-specific design, making them suitable only for predefined object categories. This restriction becomes particularly challenging when dealing with novel objects due to the lack of relevant training data. To address this limitation, category-agnostic pose estimation (CAPE) was introduced. CAPE aims to enable keypoint localization for arbitrary object categories using a single model, requiring minimal support images with annotated keypoints. This approach not only enables object pose generation based on arbitrary keypoint definitions but also significantly reduces the associated costs, paving the way for versatile and adaptable pose estimation applications. We present a novel approach to CAPE that leverages the inherent geometrical relations between keypoints through a newly designed Graph Transformer Decoder. By capturing and incorporating this crucial structural information, our method enhances the accuracy of keypoint localization, marking a significant departure from conventional CAPE techniques that treat keypoints as isolated entities. We validate our approach on the MP-100 benchmark, a comprehensive dataset comprising over 20,000 images spanning more than 100 categories. Our method outperforms the prior state-of-the-art by substantial margins, achieving remarkable improvements of 2.16% and 1.82% under 1-shot and 5-shot settings, respectively. Furthermore, our method's end-to-end training demonstrates both scalability and efficiency compared to previous CAPE approaches.

摘要
传统的2D姿态估计模型受限于其类别特定的设计，使其只适用于已知的对象类别。这种限制在处理新型对象时变得非常困难，尤其是因为缺乏相关的训练数据。为解决这一限制，category-agnostic pose estimation（CAPE）被引入。CAPE的目标是通过一个单一的模型，实现任意对象类别的键点定位，只需要 minimal 支持图像与注解键点。这种方法不仅允许基于自定义键点定义对象姿态生成，还可以减少相关成本，为 flexible 和 adaptable 姿态估计应用提供方便。我们提出了一种新的CAPE方法，利用键点之间的自然几何关系，通过一种新设计的图像trasformer Decoder来捕捉和汇集这些关键信息。通过捕捉和利用这些关键信息，我们的方法可以提高键点定位的准确性，这标志着与传统CAPE技术不同，对键点进行分割和处理。我们在 MP-100 测试benchmark上验证了我们的方法，该benchmark包括超过20,000张图像，涵盖超过100个类别。我们的方法与先前的状态泰斗相比，取得了显著的提高，在1-shot 和 5-shot 设定下提高了2.16%和1.82%的性能。此外，我们的方法在练习中的灵活性和效率也被证明。

TSDF-Sampling: Efficient Sampling for Neural Surface Field using Truncated Signed Distance Field

paper_url: http://arxiv.org/abs/2311.17878
repo_url: None
paper_authors: Chaerin Min, Sehyun Cha, Changhee Won, Jongwoo Lim
for: 这篇论文主要目标是提高多视图神经表面重建的运算速度，以便在实时应用中使用。
methods: 该论文提出了一种新的方法，通过利用场景的TSDF量化来减少样本数量，从而提高渲染质量。这种方法不同于先前的重要样本方法，因为它们依赖于初始均匀样本，从而导致性能下降。相比之下，我们的方法利用训练视图中生成的TSDF体积，并证明了它可以提供合理的 bound для从未知视图中采样。
results: 我们的方法可以快速地实现高质量渲染，不需要大量的样本。在实验中，我们发现了11倍的运算速度提高，而无需妥协性能。视频资料可以在我们项目页面中找到：https://tsdf-sampling.github.io/

Abstract
Multi-view neural surface reconstruction has exhibited impressive results. However, a notable limitation is the prohibitively slow inference time when compared to traditional techniques, primarily attributed to the dense sampling, required to maintain the rendering quality. This paper introduces a novel approach that substantially reduces the number of samplings by incorporating the Truncated Signed Distance Field (TSDF) of the scene. While prior works have proposed importance sampling, their dependence on initial uniform samples over the entire space makes them unable to avoid performance degradation when trying to use less number of samples. In contrast, our method leverages the TSDF volume generated only by the trained views, and it proves to provide a reasonable bound on the sampling from upcoming novel views. As a result, we achieve high rendering quality by fully exploiting the continuous neural SDF estimation within the bounds given by the TSDF volume. Notably, our method is the first approach that can be robustly plug-and-play into a diverse array of neural surface field models, as long as they use the volume rendering technique. Our empirical results show an 11-fold increase in inference speed without compromising performance. The result videos are available at our project page: https://tsdf-sampling.github.io/

摘要
多视图神经表面重建已经表现出色，但有一个明显的限制是与传统技术相比，推理速度过于慢，主要归结于粗粒度的抽取。这篇论文提出了一种新的方法，可以减少抽取数量，通过包含场景的截断签Distance Field（TSDF）。先前的工作已经提出了重要抽取，但它们依赖于初始化的均匀采样，从全空间中抽取样本，这使得它们无法避免性能下降，当尝试使用 fewer samples 时。与之相反，我们的方法利用已经训练的视图中生成的 TSDF Volume，并证明它可以提供合理的采样 bound для forthcoming novel views。因此，我们可以高效地利用神经 SDF 估计的连续性，在 TSDF Volume 中给出的约束内部进行渲染。各种神经表面场景模型都可以坚持我们的方法，只要使用 volume rendering 技术。我们的实验结果显示，可以提高推理速度 11 倍，不会影响性能。试试视频可以在我们项目页面上找到：https://tsdf-sampling.github.io/

Enhancing Post-Hoc Explanation Benchmark Reliability for Image Classification

paper_url: http://arxiv.org/abs/2311.17876
repo_url: None
paper_authors: Tristan Gomez, Harold Mouchère
for: 本研究旨在提高透彻神经网络的决策过程理解，并且提出了一种基于心理测量的方法来评估后处方法的可靠性。
methods: 本研究使用了一种受损样本训练和焦点损失的方法来提高模型的Robustness和准确性。
results: 研究发现，通过这些修改，可以提高后处方法的可靠性评估，并且在不同的度量、数据集和后处方法上都有显著改进。这项研究为评估后处方法的可靠性提供了一个基础。

Abstract
Deep neural networks, while powerful for image classification, often operate as "black boxes," complicating the understanding of their decision-making processes. Various explanation methods, particularly those generating saliency maps, aim to address this challenge. However, the inconsistency issues of faithfulness metrics hinder reliable benchmarking of explanation methods. This paper employs an approach inspired by psychometrics, utilizing Krippendorf's alpha to quantify the benchmark reliability of post-hoc methods in image classification. The study proposes model training modifications, including feeding perturbed samples and employing focal loss, to enhance robustness and calibration. Empirical evaluations demonstrate significant improvements in benchmark reliability across metrics, datasets, and post-hoc methods. This pioneering work establishes a foundation for more reliable evaluation practices in the realm of post-hoc explanation methods, emphasizing the importance of model robustness in the assessment process.

摘要
This paper proposes a new approach inspired by psychometrics, which uses Krippendorf's alpha to quantify the reliability of post-hoc explanation methods in image classification. The study also introduces training modifications, such as feeding perturbed samples and employing focal loss, to enhance the robustness and calibration of the models.Empirical evaluations show significant improvements in benchmark reliability across different metrics, datasets, and post-hoc methods. This groundbreaking work establishes a solid foundation for more reliable evaluation practices in the field of post-hoc explanation methods, highlighting the importance of model robustness in the assessment process.

FisherRF: Active View Selection and Uncertainty Quantification for Radiance Fields using Fisher Information

paper_url: http://arxiv.org/abs/2311.17874
repo_url: https://github.com/JiangWenPL/FisherRF
paper_authors: Wen Jiang, Boshu Lei, Kostas Daniilidis
For: This paper addresses the challenging problem of active view selection and uncertainty quantification within the domain of Radiance Fields.* Methods: The paper leverages Fisher Information to efficiently quantify observed information within Radiance Fields without ground truth data, overcoming existing limitations on model architecture and effectiveness.* Results: The paper achieves state-of-the-art results in both view selection and uncertainty quantification, demonstrating its potential to advance the field of Radiance Fields. The method with the 3D Gaussian Splatting backend could perform view selections at 70 fps.Here is the full summary in Simplified Chinese:* 为：这篇论文targets Radiance Fields中的活动视角选择和 uncertainty quantification问题，提出了一种基于Fisher Information的方法。* 方法：使用Fisher Information来有效地量化Radiance Fields中的观测信息，不需要ground truth数据，从而超越现有的模型结构和效果限制。* 结果：该方法在view selection和uncertainty quantification中达到了领先的状态，示出了它在Radiance Fields领域的潜力提高。同时，使用3D Gaussian Splatting backend可以在70 fps中进行视图选择。

Abstract
This study addresses the challenging problem of active view selection and uncertainty quantification within the domain of Radiance Fields. Neural Radiance Fields (NeRF) have greatly advanced image rendering and reconstruction, but the limited availability of 2D images poses uncertainties stemming from occlusions, depth ambiguities, and imaging errors. Efficiently selecting informative views becomes crucial, and quantifying NeRF model uncertainty presents intricate challenges. Existing approaches either depend on model architecture or are based on assumptions regarding density distributions that are not generally applicable. By leveraging Fisher Information, we efficiently quantify observed information within Radiance Fields without ground truth data. This can be used for the next best view selection and pixel-wise uncertainty quantification. Our method overcomes existing limitations on model architecture and effectiveness, achieving state-of-the-art results in both view selection and uncertainty quantification, demonstrating its potential to advance the field of Radiance Fields. Our method with the 3D Gaussian Splatting backend could perform view selections at 70 fps.

摘要
（本研究强调解决Neural Radiance Fields（NeRF）中活动视图选择和不确定性评估问题。由于NeRF模型的有限的2D图像可用性，导致 occlusions、depth ambiguities和捕捉错误等不确定性，因此选择有用的视图变得非常重要。而现有的方法 either depend on model architecture or based on density distributions的假设，这些假设通常不适用。我们利用Fisher Information来效率地量化Radiance Fields中观测到的信息，无需ground truth数据。这可以用于下一个最佳视图选择和像素精度评估。我们的方法超越了现有的模型结构和效果限制，实现了Radiance Fields领域的状态级 результатов，demonstrating its potential to advance the field。我们的方法与3D Gaussian Splatting backend可以在70 fps下完成视图选择。）

Gaussian Shell Maps for Efficient 3D Human Generation

paper_url: http://arxiv.org/abs/2311.17857
repo_url: https://github.com/computational-imaging/GSM
paper_authors: Rameen Abdal, Wang Yifan, Zifan Shi, Yinghao Xu, Ryan Po, Zhengfei Kuang, Qifeng Chen, Dit-Yan Yeung, Gordon Wetzstein
for: 这篇论文目标是提高3D数字人类的生成效率，以满足虚拟现实、社交媒体和电影制作等领域的需求。
methods: 该论文提出了一种基于 Gaussian Shell Maps（GSMs）的框架，将当前SOTA的3D生成随机网络与新的3D Gaussian rendering primitives相连接，通过一个可调整的多层壳骨架。在这个设置下，一个CNN生成了一个3D texture堆叠，其中的特征被映射到壳骨上。而不是直接镶嵌壳骨，我们在壳骨上采样了3D Gaussian，其属性被编码在Texture特征中。这些Gaussian可以高效地和可 diferenciadamente进行渲染。
results: 该论文的实验结果表明，GSMs可以高质量地生成3D数字人类，并且可以在单视图数据集上进行训练，如SHHQ和DeepFashion。此外，GSMs还可以在不需要多视图不一致的情况下实现高质量的多视图一致渲染，并且可以在原始分辨率512×512像素的情况下完成渲染。

Abstract
Efficient generation of 3D digital humans is important in several industries, including virtual reality, social media, and cinematic production. 3D generative adversarial networks (GANs) have demonstrated state-of-the-art (SOTA) quality and diversity for generated assets. Current 3D GAN architectures, however, typically rely on volume representations, which are slow to render, thereby hampering the GAN training and requiring multi-view-inconsistent 2D upsamplers. Here, we introduce Gaussian Shell Maps (GSMs) as a framework that connects SOTA generator network architectures with emerging 3D Gaussian rendering primitives using an articulable multi shell--based scaffold. In this setting, a CNN generates a 3D texture stack with features that are mapped to the shells. The latter represent inflated and deflated versions of a template surface of a digital human in a canonical body pose. Instead of rasterizing the shells directly, we sample 3D Gaussians on the shells whose attributes are encoded in the texture features. These Gaussians are efficiently and differentiably rendered. The ability to articulate the shells is important during GAN training and, at inference time, to deform a body into arbitrary user-defined poses. Our efficient rendering scheme bypasses the need for view-inconsistent upsamplers and achieves high-quality multi-view consistent renderings at a native resolution of $512 \times 512$ pixels. We demonstrate that GSMs successfully generate 3D humans when trained on single-view datasets, including SHHQ and DeepFashion.

摘要
importantly, the efficient generation of 3D digital humans is crucial in various industries, such as virtual reality, social media, and cinematic production. 3D generative adversarial networks (GANs) have achieved state-of-the-art (SOTA) quality and diversity for generated assets. However, current 3D GAN architectures typically rely on volume representations, which are slow to render, hindering GAN training and requiring multi-view-inconsistent 2D upsamplers.To address this challenge, we propose Gaussian Shell Maps (GSMs) as a framework that connects SOTA generator network architectures with emerging 3D Gaussian rendering primitives using an articulable multi-shell based scaffold. In this setting, a convolutional neural network (CNN) generates a 3D texture stack with features that are mapped to the shells. The shells represent inflated and deflated versions of a template surface of a digital human in a canonical body pose. Instead of rasterizing the shells directly, we sample 3D Gaussians on the shells, whose attributes are encoded in the texture features. These Gaussians are efficiently and differentiably rendered. The ability to articulate the shells is crucial during GAN training and, at inference time, to deform a body into arbitrary user-defined poses.Our efficient rendering scheme bypasses the need for view-inconsistent upsamplers and achieves high-quality multi-view consistent renderings at a native resolution of $512 \times 512$ pixels. We demonstrate that GSMs successfully generate 3D humans when trained on single-view datasets, including SHHQ and DeepFashion.

Evaluating VLMs for Score-Based, Multi-Probe Annotation of 3D Objects

paper_url: http://arxiv.org/abs/2311.17851
repo_url: None
paper_authors: Rishabh Kabra, Loic Matthey, Alexander Lerchner, Niloy J. Mitra
for: 本研究想要利用预训练的视觉语言模型（VLM）来解决一系列的注释任务，从描述物体 semantics 到物理属性。
methods: 本研究使用了 probabilistic aggregation 方法，将 VLM 的 scores 作为样本响应来积分。这种方法可以超越语言模型（如 GPT4）的概括能力，避免因响应中的细节而导致的幻觉。
results: 该方法可以提高 VLM 的预测性能，例如物体的类型和材质预测。此外，该方法还可以用于评估视觉逻辑的贡献，并且可以在不进行额外训练或在线学习的情况下达到人工验证的标准。

Abstract
Unlabeled 3D objects present an opportunity to leverage pretrained vision language models (VLMs) on a range of annotation tasks -- from describing object semantics to physical properties. An accurate response must take into account the full appearance of the object in 3D, various ways of phrasing the question/prompt, and changes in other factors that affect the response. We present a method to marginalize over any factors varied across VLM queries, utilizing the VLM's scores for sampled responses. We first show that this probabilistic aggregation can outperform a language model (e.g., GPT4) for summarization, for instance avoiding hallucinations when there are contrasting details between responses. Secondly, we show that aggregated annotations are useful for prompt-chaining; they help improve downstream VLM predictions (e.g., of object material when the object's type is specified as an auxiliary input in the prompt). Such auxiliary inputs allow ablating and measuring the contribution of visual reasoning over language-only reasoning. Using these evaluations, we show how VLMs can approach, without additional training or in-context learning, the quality of human-verified type and material annotations on the large-scale Objaverse dataset.

摘要
<>将文本翻译为简化中文。<>未标注的3D对象提供了利用预训练视觉语言模型（VLM）进行多种注释任务的机会——从描述对象 semantics 到物理属性。一个准确的响应必须考虑对象在3D空间的全部外观，不同的问题/提示表达方式，以及其他affecting factors。我们提出了一种方法，通过VARIABLYING FACTORS ACROSS VLM QUERIES marginalize over any factors varied across VLM queries，使用VLM的分布式分数来采样响应。我们首先示出，这种概率汇集可以超过语言模型（例如GPT4）的概率汇集，以避免对比较细节的细节。其次，我们示出了汇集的笔记可以用于提高下游VLM预测（例如物品材质，当对象的类型作为 auxiliary input 在提示中指定）。这些auxiliary inputs允许离开和测量语言逻辑和视觉逻辑之间的贡献。使用这些评估，我们示出了VLM可以，无需额外训练或在Context learning，达到人工验证的类型和材质笔记质量水平在大规模的Objaverse数据集上。

Towards Real-World Focus Stacking with Deep Learning

paper_url: http://arxiv.org/abs/2311.17846
repo_url: https://github.com/araujoalexandre/focusstackingdataset
paper_authors: Alexandre Araujo, Jean Ponce, Julien Mairal
for: 这篇论文是为了提出一种新的深度学习方法来处理长时间序列图像的多重фокус问题。
methods: 该方法使用了深度学习算法，并使用了一个新的数据集来训练。
results: 该方法可以处理长时间序列图像，并且能够tolerate noise。

Abstract
Focus stacking is widely used in micro, macro, and landscape photography to reconstruct all-in-focus images from multiple frames obtained with focus bracketing, that is, with shallow depth of field and different focus planes. Existing deep learning approaches to the underlying multi-focus image fusion problem have limited applicability to real-world imagery since they are designed for very short image sequences (two to four images), and are typically trained on small, low-resolution datasets either acquired by light-field cameras or generated synthetically. We introduce a new dataset consisting of 94 high-resolution bursts of raw images with focus bracketing, with pseudo ground truth computed from the data using state-of-the-art commercial software. This dataset is used to train the first deep learning algorithm for focus stacking capable of handling bursts of sufficient length for real-world applications. Qualitative experiments demonstrate that it is on par with existing commercial solutions in the long-burst, realistic regime while being significantly more tolerant to noise. The code and dataset are available at https://github.com/araujoalexandre/FocusStackingDataset.

摘要
focus推差摄影广泛应用于微、马кро、景观摄影，将多帧图像通过对准点推差来重构所有在Focus推差图像中的全景图像。现有的深度学习方法对于这个多重 фокус图像融合问题有限的应用性，因为它们通常只是为二到四帧图像而设计，并且通常在小、低分辨率的数据集上进行训练，或者通过光场相机获得的数据或者生成synthetically。我们介绍了一个新的数据集，包括94个高分辨率弹性拍摄的Raw图像，并使用商业软件计算pseudo真实数据。这个数据集用于训练第一个可以处理长弹性的深度学习算法，与现有的商业解决方案在长弹性、实际 regime中具有相同水平的性能，而且更具耐性于噪声。代码和数据集可以在https://github.com/araujoalexandre/FocusStackingDataset上获取。

SPiC-E : Structural Priors in 3D Diffusion Models using Cross Entity Attention

paper_url: http://arxiv.org/abs/2311.17834
repo_url: None
paper_authors: Etai Sella, Gal Fiebelman, Noam Atia, Hadar Averbuch-Elor
for: This paper is written for democratizing 3D content creation by improving the efficiency and versatility of 3D diffusion models.
methods: The paper introduces a neural network called SPiC-E that adds structural guidance to 3D diffusion models, allowing for task-specific structural priors to be learned from auxiliary guidance shapes.
results: The paper shows that SPiC-E achieves state-of-the-art (SOTA) performance on various applications such as 3D stylization, semantic shape editing, and text-conditional abstraction-to-3D, while being faster than alternative methods.

Abstract
We are witnessing rapid progress in automatically generating and manipulating 3D assets due to the availability of pretrained text-image diffusion models. However, time-consuming optimization procedures are required for synthesizing each sample, hindering their potential for democratizing 3D content creation. Conversely, 3D diffusion models now train on million-scale 3D datasets, yielding high-quality text-conditional 3D samples within seconds. In this work, we present SPiC-E - a neural network that adds structural guidance to 3D diffusion models, extending their usage beyond text-conditional generation. At its core, our framework introduces a cross-entity attention mechanism that allows for multiple entities (in particular, paired input and guidance 3D shapes) to interact via their internal representations within the denoising network. We utilize this mechanism for learning task-specific structural priors in 3D diffusion models from auxiliary guidance shapes. We show that our approach supports a variety of applications, including 3D stylization, semantic shape editing and text-conditional abstraction-to-3D, which transforms primitive-based abstractions into highly-expressive shapes. Extensive experiments demonstrate that SPiC-E achieves SOTA performance over these tasks while often being considerably faster than alternative methods. Importantly, this is accomplished without tailoring our approach for any specific task.

摘要
“我们目睹3D资产自动生成和处理得更加快速的进步，具体原因在于预训文本-图像扩散模型的可用性。然而，每个样本都需要耗时的优化程序，阻碍它们在3D内容创建方面发挥潜力。相反，3D扩散模型现在训练在百万组3D数据上，从而产生高品质的文本-相关3D样本，仅仅几秒内。在这个工作中，我们发表了SPiC-E框架，它将额外给3D扩散模型结构指导，从而扩展其使用范围。SPiC-E框架的核心是跨物体注意力机制，让不同物体（特别是对照输入和指导3D形的组合）在混合网络中互动，从而学习任务特定的结构假设。我们运用这种机制来学习3D扩散模型从副任务指导形组件中学习任务特定的结构假设。我们的方法可以支持多种应用，包括3D数据创建、semantic shape editing和文本相关抽象到3D。我们的实验结果显示，SPiC-E可以在这些任务上取得SOTA表现，并且通常比alternative方法更快。最重要的是，我们的方法不需要针对特定任务进行调整。”

paper_url: http://arxiv.org/abs/2311.17812
repo_url: None
paper_authors: Ting Liu, Yue Hu, Wansen Wu, Youkai Wang, Kai Xu, Quanjun Yin
for: 提高自适应机器人在未知环境中导航的能力，尤其是使用预训练的视觉语言模型。
methods: 提出了一种新的、模型无关的领域意识驱动学习（DAP）框架，通过低成本的提问调整策略，使预训练模型中的视觉编码器学习具有特定对象水平和场景水平的跨模态相关性。
results: 在R2R和REVERIE等任务上，DAP方法与现有状态的方法进行比较，实验结果表明DAP方法在提高自适应机器人在未知环境中导航的能力方面表现出色。

Abstract
Following language instructions to navigate in unseen environments is a challenging task for autonomous embodied agents. With strong representation capabilities, pretrained vision-and-language models are widely used in VLN. However, most of them are trained on web-crawled general-purpose datasets, which incurs a considerable domain gap when used for VLN tasks. To address the problem, we propose a novel and model-agnostic domain-aware prompt learning (DAP) framework. For equipping the pretrained models with specific object-level and scene-level cross-modal alignment in VLN tasks, DAP applies a low-cost prompt tuning paradigm to learn soft visual prompts for extracting in-domain image semantics. Specifically, we first generate a set of in-domain image-text pairs with the help of the CLIP model. Then we introduce soft visual prompts in the input space of the visual encoder in a pretrained model. DAP injects in-domain visual knowledge into the visual encoder of the pretrained model in an efficient way. Experimental results on both R2R and REVERIE show the superiority of DAP compared to existing state-of-the-art methods.

摘要
自适应体系中的 navigation 任务是一个具有挑战性的任务，尤其是在不可见的环境中。通过强大的表示能力，预训练的视觉语言模型在 VLN 中广泛使用。然而，大多数这些模型是通过网络抓取的通用数据集进行训练，这会导致很大的领域差异问题。为解决这个问题，我们提出了一种新的和模型无关的领域意识驱动的提问学习（DAP）框架。为在 VLN 任务中将预训练模型具备特定的物体层次和场景层次的跨模态对齐，DAP 使用一种低成本的提问调整 paradigma 来学习软体Visual提问。具体来说，我们首先通过 CLIP 模型生成一组在领域中的图像文本对。然后，我们在预训练模型的视觉编码器中引入软体Visual提问。DAP 通过在预训练模型的视觉编码器中注入领域特定的视觉知识来具备模型适应性。我们通过对 R2R 和 REVERIE 进行实验，并证明 DAP 与现有状态的方法相比具有着superiority。

Coloring the Past: Neural Historical Buildings Reconstruction from Archival Photography

paper_url: http://arxiv.org/abs/2311.17810
repo_url: None
paper_authors: David Komorowicz, Lu Sang, Ferdinand Maiwald, Daniel Cremers
for: 历史建筑是人类文化遗产的宝藏和里程碑，复原历史建筑的3D模型具有重要价值。
methods: 我们提出了一种基于神经渲染技术的历史建筑3D模型复原方法，利用批量点云作为几何假设，并引入颜色出现嵌入损失来回填颜色信息。
results: 我们的方法可以有效地复原历史建筑的3D形状，并且可以在有限的颜色图像基础上进行颜色回填。我们的研究旨在促进历史建筑保护的兴趣和着力。为此，我们还提供了一个新的匈牙利国家剧场历史数据集，作为复原方法的新标准。

Abstract
Historical buildings are a treasure and milestone of human cultural heritage. Reconstructing the 3D models of these building hold significant value. The rapid development of neural rendering methods makes it possible to recover the 3D shape only based on archival photographs. However, this task presents considerable challenges due to the limitations of such datasets. Historical photographs are often limited in number and the scenes in these photos might have altered over time. The radiometric quality of these images is also often sub-optimal. To address these challenges, we introduce an approach to reconstruct the geometry of historical buildings, employing volumetric rendering techniques. We leverage dense point clouds as a geometric prior and introduce a color appearance embedding loss to recover the color of the building given limited available color images. We aim for our work to spark increased interest and focus on preserving historical buildings. Thus, we also introduce a new historical dataset of the Hungarian National Theater, providing a new benchmark for the reconstruction method.

摘要
历史建筑是人类文化遗产的宝藏和里程碑。重建历史建筑的3D模型具有重要的价值。随着神经渲染技术的快速发展，现在可以通过档案照片来恢复建筑物的3D形状。然而，这个任务具有许多挑战，主要是因为档案照片的数量有限，而且场景中的元素可能会随着时间的推移而改变。此外，历史照片的 радиometric质量也经常不佳。为了解决这些挑战，我们提出了一种使用液体点云作为几何学先验的方法，并引入了颜色出现嵌入损失来恢复建筑物的颜色，即使有限的颜色图像available。我们希望通过我们的工作，激发更多关注和保护历史建筑的精神。因此，我们还发布了一个新的历史数据集， Hungarian National Theater，作为重建方法的新标准。

Aggregation Model Hyperparameters Matter in Digital Pathology

paper_url: http://arxiv.org/abs/2311.17804
repo_url: None
paper_authors: Gustav Bredell, Marcel Fischer, Przemyslaw Szostak, Samaneh Abbasi-Sureshjani, Alvaro Gomariz
for: 这个研究旨在探讨数字 PATHOLOGY 中 gigapixel 整幅图像 (WSI) 的分析对疾病检测和病理医生效率的影响。
methods: 这些研究使用了将 WSI 分割成小块，然后应用特征提取器模型来获取特征向量，并使用聚合模型来预测 WSI 的标签。
results: 研究发现，传统的评估方法可能受到fixed aggregation model hyperparameters的偏见，这会导致特征提取器模型的性能比较不同。通过考虑这种相互关系，研究发现了许多当前的特征提取器模型之间的性能相似性。这种全面的方法提供了更加细化的理解特征提取器和聚合模型之间的关系，从而为数字 PATHOLOGY 中的特征提取器模型进行更加公正和准确的评估。

Abstract
Digital pathology has significantly advanced disease detection and pathologist efficiency through the analysis of gigapixel whole-slide images (WSI). In this process, WSIs are first divided into patches, for which a feature extractor model is applied to obtain feature vectors, which are subsequently processed by an aggregation model to predict the respective WSI label. With the rapid evolution of representation learning, numerous new feature extractor models, often termed foundational models, have emerged. Traditional evaluation methods, however, rely on fixed aggregation model hyperparameters, a framework we identify as potentially biasing the results. Our study uncovers a co-dependence between feature extractor models and aggregation model hyperparameters, indicating that performance comparability can be skewed based on the chosen hyperparameters. By accounting for this co-dependency, we find that the performance of many current feature extractor models is notably similar. We support this insight by evaluating seven feature extractor models across three different datasets with 162 different aggregation model configurations. This comprehensive approach provides a more nuanced understanding of the relationship between feature extractors and aggregation models, leading to a fairer and more accurate assessment of feature extractor models in digital pathology.

摘要
digitization Pathology 已经取得了疾病检测和病理学家效率的显著进步，通过分析 gigapixel 整幅图像 (WSI) 的分割和特征提取模型的应用。在这个过程中，WSI 首先被分割成块，然后应用特征提取模型来获取特征向量，最后由聚合模型进行预测。随着表征学学习的快速发展，许多新的特征提取模型，通常被称为基础模型，已经出现。传统的评估方法通常采用固定聚合模型超参数，我们称之为可能偏导致结果的潜在偏误。我们的研究发现，特征提取模型和聚合模型超参数之间存在相互关系，表明基于不同超参数的选择可能导致性能相似性的偏移。通过考虑这种相互关系，我们发现了许多当今的特征提取模型的性能实际上很相似。我们支持这一发现，通过对七种特征提取模型在三个不同的数据集上进行162种聚合模型配置的评估。这种全面的方法为评估特征提取模型和聚合模型之间的关系提供了更加细化的理解，从而为数字病理学中的评估带来更公正、更准确的评估。

U-Net v2: Rethinking the Skip Connections of U-Net for Medical Image Segmentation

paper_url: http://arxiv.org/abs/2311.17791
repo_url: https://github.com/yaoppeng/u-net_v2
paper_authors: Yaopeng Peng, Milan Sonka, Danny Z. Chen
for: 这paper aimsto improve medical image segmentation by introducing a new U-Net variant that incorporates semantic information and finer details.
methods: 该方法使用深度神经网络编码器提取多级特征，并通过哈达德乘法将高级特征信息感知到低级特征，以增强特征的Semanticcharacteristics和细节。 novel skip connections are used to empower features of all levels with enriched semantic information and intricate details.
results: 对多个公共医学图像分割数据集进行测试，该方法的 segmentation accuracy 超过了当前方法，同时保持了内存和计算效率的优化。

Abstract
In this paper, we introduce U-Net v2, a new robust and efficient U-Net variant for medical image segmentation. It aims to augment the infusion of semantic information into low-level features while simultaneously refining high-level features with finer details. For an input image, we begin by extracting multi-level features with a deep neural network encoder. Next, we enhance the feature map of each level by infusing semantic information from higher-level features and integrating finer details from lower-level features through Hadamard product. Our novel skip connections empower features of all the levels with enriched semantic characteristics and intricate details. The improved features are subsequently transmitted to the decoder for further processing and segmentation. Our method can be seamlessly integrated into any Encoder-Decoder network. We evaluate our method on several public medical image segmentation datasets for skin lesion segmentation and polyp segmentation, and the experimental results demonstrate the segmentation accuracy of our new method over state-of-the-art methods, while preserving memory and computational efficiency. Code is available at: https://github.com/yaoppeng/U-Net\_v2

摘要
在这篇论文中，我们介绍了U-Net v2，一种新的robust和高效的U-Net变体，用于医疗图像分割。它目的是在低级别特征中增强semantic信息的混入，同时在高级别特征中细化细节。对于输入图像，我们首先通过深度神经网络编码器提取多个级别特征。然后，我们通过哈达德乘制法增强每级特征图的semantic信息和细节。我们的新的跳跃连接使得所有级别的特征具备了充分的semantic特征和细节。改进后的特征被传递到解码器进行进一步处理和分割。我们的方法可以轻松地整合到任何Encoder-Decoder网络中。我们在多个公共医疗图像分割数据集上进行了实验，并证明了我们新方法的分割精度高于当前的方法，而且保持了内存和计算效率。代码可以在https://github.com/yaoppeng/U-Net\_v2上获取。

One-Shot Open Affordance Learning with Foundation Models

paper_url: http://arxiv.org/abs/2311.17776
repo_url: None
paper_authors: Gen Li, Deqing Sun, Laura Sevilla-Lara, Varun Jampani
for: 这篇论文旨在提出一种基于一个示例的开放可能性学习（OOAL）方法，能够让模型从少量数据中学习 novel object 和 affordance。
methods: 作者通过对现有基础模型进行全面分析，以评估这些模型对可行性的内在理解程度，并提出了一种简单有效的视觉语言框架，以提高视觉特征和可行文本嵌入的含义的对应。
results: 实验表明，提出的方法在两个可行分割标准测试集上的表现都高于当前最佳模型，并且在未看到的对象和可行上具有合理的泛化能力。

Abstract
We introduce One-shot Open Affordance Learning (OOAL), where a model is trained with just one example per base object category, but is expected to identify novel objects and affordances. While vision-language models excel at recognizing novel objects and scenes, they often struggle to understand finer levels of granularity such as affordances. To handle this issue, we conduct a comprehensive analysis of existing foundation models, to explore their inherent understanding of affordances and assess the potential for data-limited affordance learning. We then propose a vision-language framework with simple and effective designs that boost the alignment between visual features and affordance text embeddings. Experiments on two affordance segmentation benchmarks show that the proposed method outperforms state-of-the-art models with less than 1% of the full training data, and exhibits reasonable generalization capability on unseen objects and affordances.

摘要
我们介绍一个单一开放功能学习（OOAL），其中一个模型仅需要一个基本物类的示例，但预期能够识别新的物类和功能。视觉语言模型在识别新的物类和场景方面表现出色，但它们往往对更细节的层次，如功能，表现不佳。为解决这个问题，我们进行了现有基础模型的全面分析，以探索它们的内在理解功能的潜力，并评估可以通过限制数据来学习功能。我们then propose a vision-language框架，其中包含简单和有效的设计，将视觉特征与文本嵌入的Alignment提高。实验结果显示，我们提议的方法在两个功能分割benchmark上比前置推广模型表现出色，并且在未看到的物类和功能上具有合理的泛化能力。

PillarNeSt: Embracing Backbone Scaling and Pretraining for Pillar-based 3D Object Detection

paper_url: http://arxiv.org/abs/2311.17770
repo_url: None
paper_authors: Weixin Mao, Tiancai Wang, Diankun Zhang, Junjie Yan, Osamu Yoshie
for: 提高 pillar-based 3D 物体检测器的性能
methods: 使用适应性的 dense ConvNet 预训练和缩放，并将其应用于 pillar-based 检测器中
results: 比现有的 3D 物体检测器减小误差率，在 nuScenes 和 Argoversev2 数据集上达到了新的高水平

Abstract
This paper shows the effectiveness of 2D backbone scaling and pretraining for pillar-based 3D object detectors. Pillar-based methods mainly employ randomly initialized 2D convolution neural network (ConvNet) for feature extraction and fail to enjoy the benefits from the backbone scaling and pretraining in the image domain. To show the scaling-up capacity in point clouds, we introduce the dense ConvNet pretrained on large-scale image datasets (e.g., ImageNet) as the 2D backbone of pillar-based detectors. The ConvNets are adaptively designed based on the model size according to the specific features of point clouds, such as sparsity and irregularity. Equipped with the pretrained ConvNets, our proposed pillar-based detector, termed PillarNeSt, outperforms the existing 3D object detectors by a large margin on the nuScenes and Argoversev2 datasets. Our code shall be released upon acceptance.

摘要
Note:* "pillar-based" refers to methods that use pillar-shaped feature extractors to extract features from point clouds.* "2D backbone scaling" refers to the practice of scaling up the size of the 2D backbone (i.e., the convolutional neural network) to improve performance.* "pretraining" refers to the practice of pretraining the 2D backbone on a large dataset (such as ImageNet) before fine-tuning it on a smaller dataset (such as nuScenes or Argoversev2).

Cinematic Behavior Transfer via NeRF-based Differentiable Filming

paper_url: http://arxiv.org/abs/2311.17754
repo_url: None
paper_authors: Xuekun Jiang, Anyi Rao, Jingbo Wang, Dahua Lin, Bo Dai
for: 这 paper 是为了提高 digital media 和视频生产中的精准 manipulate 和重现视觉元素，如摄像机运动和人物动作。
methods: 这 paper 使用了 reverse filming behavior estimation 技术，使用 NeRF 作为可微分渲染器，并且使用 SMPL 跟踪来优化摄像机轨迹。它还使用了 cinematic transfer pipeline 将各种拍摄类型转移到新的 2D 视频或 3D 虚拟环境中。
results: 这 paper 的实验结果表明，cinematic transfer pipeline 能够提供精准的摄像机轨迹和人物动作重现，并且在用户研究中得到了更高的评分。

Abstract
In the evolving landscape of digital media and video production, the precise manipulation and reproduction of visual elements like camera movements and character actions are highly desired. Existing SLAM methods face limitations in dynamic scenes and human pose estimation often focuses on 2D projections, neglecting 3D statuses. To address these issues, we first introduce a reverse filming behavior estimation technique. It optimizes camera trajectories by leveraging NeRF as a differentiable renderer and refining SMPL tracks. We then introduce a cinematic transfer pipeline that is able to transfer various shot types to a new 2D video or a 3D virtual environment. The incorporation of 3D engine workflow enables superior rendering and control abilities, which also achieves a higher rating in the user study.

摘要
在数字媒体和视频制作中演化的景观中，精准控制和重制视觉元素如摄像机运动和人物动作的需求越来越高。现有的SLAM方法在动态场景中存在限制，人 pose estimation通常将2D投影neglecting 3D状态。为解决这些问题，我们首先介绍一种逆摄影行为估计技术。它利用NeRF作为可微分渲染器，并使用SMPL跟踪进行优化。然后，我们介绍一个可转换到新的2D视频或3D虚拟环境的电影传输管道。该管道包含3D引擎工作流程，具有更高的渲染和控制能力，并在用户研究中获得更高的评分。

BAND-2k: Banding Artifact Noticeable Database for Banding Detection and Quality Assessment

paper_url: http://arxiv.org/abs/2311.17752
repo_url: None
paper_authors: Zijian Chen, Wei Sun, Jun Jia, Fangfang Lu, Zicheng Zhang, Jing Liu, Ru Huang, Xiongkuo Min, Guangtao Zhai
for: 这篇论文的目的是为了检测和评估图像带状抑制（banding）的质量问题。
methods: 该论文使用了15种压缩和量化算法生成2000个带状抑制图像，并通过23名参与者参与主观测试，获得了214000个小块级带状抑制分类标签和44371个可靠的图像质量评分。
results: 该论文提出了一种有效的无参考（NR）带状抑制评估器，通过利用带状抑制特征的频率特征来提高检测和评估带状抑制的能力。实验表明，该评估器在检测带状抑制和评估图像质量方面具有高度的准确率和SRCC/PLCC指标。这些结果证明了带状抑制质量的强相关性和人类视觉评估的可靠性。

Abstract
Banding, also known as staircase-like contours, frequently occurs in flat areas of images/videos processed by the compression or quantization algorithms. As undesirable artifacts, banding destroys the original image structure, thus degrading users' quality of experience (QoE). In this paper, we systematically investigate the banding image quality assessment (IQA) problem, aiming to detect the image banding artifacts and evaluate their perceptual visual quality. Considering that the existing image banding databases only contain limited content sources and banding generation methods, and lack perceptual quality labels (i.e. mean opinion scores), we first build the largest banding IQA database so far, named Banding Artifact Noticeable Database (BAND-2k), which consists of 2,000 banding images generated by 15 compression and quantization schemes. A total of 23 workers participated in the subjective IQA experiment, yielding over 214,000 patch-level banding class labels and 44,371 reliable image-level quality ratings. Subsequently, we develop an effective no-reference (NR) banding evaluator for banding detection and quality assessment by leveraging frequency characteristics of banding artifacts. A dual convolutional neural network is employed to concurrently learn the feature representation from the high-frequency and low-frequency maps, thereby enhancing the ability to discern banding artifacts. The quality score of a banding image is generated by pooling the banding detection maps masked by the spatial frequency filters. Experiments demonstrate that our banding evaluator achieves a remarkably high accuracy in banding detection and also exhibits high SRCC and PLCC results with the perceptual quality labels. These findings unveil the strong correlations between the intensity of banding artifacts and the perceptual visual quality, thus validating the necessity of banding quality assessment.

摘要
bandeting，也称为台阶状图像，常occurs在压缩或量化算法处理的图像/视频中的平铺区域。作为不желатель的噪声， bandeting会破坏原始图像结构，从而降低用户体验质量（QoE）。在这篇论文中，我们系统地调查了图像质量评估（IQA）问题，旨在检测图像 bandeting artifacts 并评估其视觉质量。由于现有的图像 bandeting 数据库只包含有限的内容源和生成方法，而且缺乏视觉质量标签（例如，意见评分），我们首先建立了最大的 bandeting IQA 数据库，名为 Banding Artifact Noticeable Database (BAND-2k)，其包含2,000个由 15 种压缩和量化算法生成的 bandeting 图像。总共有 23 名工作人员参与了主观 IQA 实验，生成了超过 214,000 个小区域缺陷标签和 44,371 个可靠的图像质量评估。然后，我们开发了一种高效的无参考（NR） bandeting 评估器，通过利用带通的特征来检测和评估 bandeting artifacts。我们使用了一个双层卷积神经网络，同时从高频和低频地图中学习特征表示，以提高检测 bandeting 噪声的能力。图像质量分数是通过在带通滤波器上填充的 bandeting 检测地图来生成的。实验表明，我们的 bandeting 评估器可以高度准确地检测 bandeting 噪声，并且与视觉质量标签 exhibit 高SRCC 和 PLCC 结果。这些发现证明了噪声的强相关性和视觉质量之间的强关系，因此证实了 bandeting 质量评估的必要性。

Variational Bayes image restoration with compressive autoencoders

paper_url: http://arxiv.org/abs/2311.17744
repo_url: None
paper_authors: Maud Biquard, Marie Chabert, Thomas Oberlin
for: 这篇论文主要针对 Computational Imaging 中的 inverse problem 进行了 regularization 的研究。
methods: 该论文使用了压缩 autoencoder 来实现 latent espae 的估计，并提出了 Variational Bayes Latent Estimation（VBLE）算法来实现在 variational inference 框架下的 posterior sampling。
results: 实验结果表明，VBLE 可以达到与现有插件和执行方法相同的性能，而且可以更快地Quantify uncertainties than other existing posterior sampling techniques。

Abstract
Regularization of inverse problems is of paramount importance in computational imaging. The ability of neural networks to learn efficient image representations has been recently exploited to design powerful data-driven regularizers. While state-of-the-art plug-and-play methods rely on an implicit regularization provided by neural denoisers, alternative Bayesian approaches consider Maximum A Posteriori (MAP) estimation in the latent space of a generative model, thus with an explicit regularization. However, state-of-the-art deep generative models require a huge amount of training data compared to denoisers. Besides, their complexity hampers the optimization of the latent MAP. In this work, we propose to use compressive autoencoders for latent estimation. These networks, which can be seen as variational autoencoders with a flexible latent prior, are smaller and easier to train than state-of-the-art generative models. We then introduce the Variational Bayes Latent Estimation (VBLE) algorithm, which performs this estimation within the framework of variational inference. This allows for fast and easy (approximate) posterior sampling. Experimental results on image datasets BSD and FFHQ demonstrate that VBLE reaches similar performance than state-of-the-art plug-and-play methods, while being able to quantify uncertainties faster than other existing posterior sampling techniques.

摘要
很多计算成像问题中存在倒数问题，因此对于计算成像来说，正则化的重要性非常高。随着神经网络的发展，人们开始利用神经网络学习有效的图像表示，设计数据驱动的正则化器。现有的标准插件和执行方法都是通过神经网络提供的隐式正则化来实现的，而bayesian方法则是在生成模型的幂空间中进行最大 posterior estimation，这是一种显式的正则化。然而，现有的深度生成模型需要训练数据量非常大，同时它们的复杂度也使得幂空间的MAP优化变得困难。在这种情况下，我们提议使用压缩自编码器进行幂空间 estimation。这些网络可以看作是变量自编码器的一种弹性版本，它们比现有的生成模型更小更容易训练。我们then introduce the Variational Bayes Latent Estimation（VBLE）算法，它在变量推断框架中进行幂空间 estimation。这使得可以快速地进行（approximate）后验抽象。实验结果表明，VBLE可以与现有的插件和执行方法相比，在图像 dataset BSD 和 FFHQ 上达到类似的性能，而且可以比其他现有的后验抽象技术更快地计算uncertainty。

GenZI: Zero-Shot 3D Human-Scene Interaction Generation

paper_url: http://arxiv.org/abs/2311.17737
repo_url: None
paper_authors: Lei Li, Angela Dai
for: 这篇论文旨在无需学习任何3D人-场景互动数据，生成3D人与场景互动。
methods: 该方法基于大量视力语言模型（VLM）的吸取，将人类互动约束转化为3D人模型的姿态和形状。
results: 对比于现有的学习型方法，该方法具有高度的灵活性和通用性，可以应用于多种场景，包括室内和室外环境。

Abstract
Can we synthesize 3D humans interacting with scenes without learning from any 3D human-scene interaction data? We propose GenZI, the first zero-shot approach to generating 3D human-scene interactions. Key to GenZI is our distillation of interaction priors from large vision-language models (VLMs), which have learned a rich semantic space of 2D human-scene compositions. Given a natural language description and a coarse point location of the desired interaction in a 3D scene, we first leverage VLMs to imagine plausible 2D human interactions inpainted into multiple rendered views of the scene. We then formulate a robust iterative optimization to synthesize the pose and shape of a 3D human model in the scene, guided by consistency with the 2D interaction hypotheses. In contrast to existing learning-based approaches, GenZI circumvents the conventional need for captured 3D interaction data, and allows for flexible control of the 3D interaction synthesis with easy-to-use text prompts. Extensive experiments show that our zero-shot approach has high flexibility and generality, making it applicable to diverse scene types, including both indoor and outdoor environments.

摘要
可以做3D人物与场景的合成无需学习任何3D人物-场景互动数据吗？我们提出GenZI，这是首个零shot方法来生成3D人物-场景互动。GenZI的关键是我们对大量视觉语言模型（VLMs）中学习的互动假设。给定一个自然语言描述和场景中需要的交互的粗略位置，我们首先利用VLMs来想象可能的2D人物互动，并在多个渲染视图中填充这些互动。然后，我们形ulated一种稳定的迭代优化方法，以确定人物模型在场景中的 pose和形状，并且被2D互动假设所导引。与现有的学习型方法不同，GenZI circumvents the conventional need for captured 3D interaction data，并允许通过易于使用的文本提示来控制3D交互生成的灵活性。我们的零shot方法在多种场景中表现出高的灵活性和通用性，包括室内和室外环境。

Receler: Reliable Concept Erasing of Text-to-Image Diffusion Models via Lightweight Erasers

paper_url: http://arxiv.org/abs/2311.17717
repo_url: None
paper_authors: Chi-Pin Huang, Kai-Po Chang, Chung-Ting Tsai, Yung-Hsuan Lai, Yu-Chiang Frank Wang
for: 防止预训练的文本到图像扩散模型生成与目标概念相关的图像
methods: 提出了一种名为可靠概念消除（Receler）的轻量级消除器，并通过概念本地化正则化和对抗提示学习等方法增强了本地性和可靠性
results: 对于多种概念提示，Receler 比前一代消除方法具有更高的本地性和可靠性

Abstract
Concept erasure in text-to-image diffusion models aims to disable pre-trained diffusion models from generating images related to a target concept. To perform reliable concept erasure, the properties of robustness and locality are desirable. The former refrains the model from producing images associated with the target concept for any paraphrased or learned prompts, while the latter preserves the model ability in generating images for non-target concepts. In this paper, we propose Reliable Concept Erasing via Lightweight Erasers (Receler), which learns a lightweight Eraser to perform concept erasing and enhances locality and robustness with the proposed concept-localized regularization and adversarial prompt learning, respectively. Comprehensive quantitative and qualitative experiments with various concept prompts verify the superiority of Receler over the previous erasing methods on the above two desirable properties.

摘要
“概念除法”在文本到图像协同模型中的应用目标是禁止已经训练过的协同模型生成与目标概念相关的图像。为实现可靠的概念除法，robustness和locality这两个性质是非常有用。前者使模型无法生成基于重塑或学习提示的任何图像与目标概念相关，而后者保持模型对非目标概念的图像生成能力。本文提出了“可靠概念除法via轻量级抹除器”（Receler），该方法学习了轻量级抹除器来实现概念除法，并通过提出的概念本地化规则和对抗提示学习等方法增强了本地性和Robustness。经过对多个概念提示的全面量化和质量化实验，Receler的superiority被证明在上述两个愿望性能上。

SAMPro3D: Locating SAM Prompts in 3D for Zero-Shot Scene Segmentation

paper_url: http://arxiv.org/abs/2311.17707
repo_url: https://github.com/GAP-LAB-CUHK-SZ/SAMPro3D
paper_authors: Mutian Xu, Xingyilang Yin, Lingteng Qiu, Yang Liu, Xin Tong, Xiaoguang Han
for: 该论文旨在提出一种零shot 3D indoor scene segmentation方法，使得无需任何培训数据，可以在不同的3D场景中提取高质量的分割结果。
methods: 该方法基于预训练的Segment Anything Model（SAM），将2D帧中的3D点云projected到屏幕上，并使用SAM进行预测，以获得高质量的分割结果。该方法还提出了一种基于多个2D帧的feedback机制，以提高分割质量。
results: 对比前一些零shot或充足监督方法，该方法可以在多个3D场景中提取更高质量和更多样的分割结果，甚至超过人工标注。具体的结果可以通过https://mutianxu.github.io/sampro3d/查看。

Abstract
We introduce SAMPro3D for zero-shot 3D indoor scene segmentation. Given the 3D point cloud and multiple posed 2D frames of 3D scenes, our approach segments 3D scenes by applying the pretrained Segment Anything Model (SAM) to 2D frames. Our key idea involves locating 3D points in scenes as natural 3D prompts to align their projected pixel prompts across frames, ensuring frame-consistency in both pixel prompts and their SAM-predicted masks. Moreover, we suggest filtering out low-quality 3D prompts based on feedback from all 2D frames, for enhancing segmentation quality. We also propose to consolidate different 3D prompts if they are segmenting the same object, bringing a more comprehensive segmentation. Notably, our method does not require any additional training on domain-specific data, enabling us to preserve the zero-shot power of SAM. Extensive qualitative and quantitative results show that our method consistently achieves higher quality and more diverse segmentation than previous zero-shot or fully supervised approaches, and in many cases even surpasses human-level annotations. The project page can be accessed at https://mutianxu.github.io/sampro3d/.

摘要
我们介绍SAMPro3D，一个零shot的3D室内景分类方法。我们的方法使用预训Segment Anything Model（SAM）来运行2D框架中的3D场景分类。我们的关键想法是将3D点云中的点作为自然的3D提示，将其投射到不同框架中，以确保框架中的像素提示和SAM预测的mask是一致的。此外，我们建议使用所有2D框架的反馈来筛选低质量的3D提示，以提高分类质量。我们还提出了将不同的3D提示合并为一个更全面的分类。值得注意的是，我们的方法不需要任何预训 dataset，可以保留SAM的零shot力量。我们的方法在实际应用中具有高品质和多样化的分类结果，在许多情况下，甚至超过人工标注。更多详细信息可以通过下面的项目页面https://mutianxu.github.io/sampro3d/。

Toward a Surgeon-in-the-Loop Ophthalmic Robotic Apprentice using Reinforcement and Imitation Learning

paper_url: http://arxiv.org/abs/2311.17693
repo_url: None
paper_authors: Amr Gomaa, Bilal Mahdy, Niko Kleer, Antonio Krüger
for:这篇论文旨在提出一种基于 simulate 的对象视频指导方法，以便让自动化手术系统适应个别医生的特殊偏好和需求。methods:这篇论文使用了实验环境来训练循环学和模仿学掌握者，并通过与医生在训练过程中的互动，让机器人隐式地学习和适应个别医生的具体技巧。results:这篇论文的结果显示，这种基于 simulate 的对象视频指导方法可以实现更直观和个性化的手术体验，同时确保机器人的一致性性。此外，这种方法还有扩展到其他眼科手术程序的潜力。

Abstract
Robotic-assisted surgical systems have demonstrated significant potential in enhancing surgical precision and minimizing human errors. However, existing systems lack the ability to accommodate the unique preferences and requirements of individual surgeons. Additionally, they primarily focus on general surgeries (e.g., laparoscopy) and are not suitable for highly precise microsurgeries, such as ophthalmic procedures. Thus, we propose a simulation-based image-guided approach for surgeon-centered autonomous agents that can adapt to the individual surgeon's skill level and preferred surgical techniques during ophthalmic cataract surgery. Our approach utilizes a simulated environment to train reinforcement and imitation learning agents guided by image data to perform all tasks of the incision phase of cataract surgery. By integrating the surgeon's actions and preferences into the training process with the surgeon-in-the-loop, our approach enables the robot to implicitly learn and adapt to the individual surgeon's unique approach through demonstrations. This results in a more intuitive and personalized surgical experience for the surgeon. Simultaneously, it ensures consistent performance for the autonomous robotic apprentice. We define and evaluate the effectiveness of our approach using our proposed metrics; and highlight the trade-off between a generic agent and a surgeon-centered adapted agent. Moreover, our approach has the potential to extend to other ophthalmic surgical procedures, opening the door to a new generation of surgeon-in-the-loop autonomous surgical robots. We provide an open-source simulation framework for future development and reproducibility.

摘要
Robotic-assisted surgical systems 有 demonstrated significan potential 在 enhance surgical precision 和 minimize human errors. However, existing systems lack the ability to accommodate the unique preferences 和 requirements of individual surgeons. Additionally, they primarily focus on general surgeries (e.g., laparoscopy) 和 are not suitable for highly precise microsurgeries, such as ophthalmic procedures. Therefore, we propose a simulation-based image-guided approach for surgeon-centered autonomous agents that can adapt to the individual surgeon's skill level 和 preferred surgical techniques during ophthalmic cataract surgery.Our approach utilizes a simulated environment 训练 reinforcement 和 imitation learning agents guided by image data to perform all tasks of the incision phase of cataract surgery. By integrating the surgeon's actions 和 preferences into the training process with the surgeon-in-the-loop, our approach enables the robot to implicitly learn 和 adapt to the individual surgeon's unique approach through demonstrations. This results in a more intuitive 和 personalized surgical experience for the surgeon. Simultaneously, it ensures consistent performance for the autonomous robotic apprentice. We define 和 evaluate the effectiveness of our approach using our proposed metrics; and highlight the trade-off between a generic agent 和 a surgeon-centered adapted agent. Moreover, our approach has the potential to extend to other ophthalmic surgical procedures, opening the door to a new generation of surgeon-in-the-loop autonomous surgical robots. We provide an open-source simulation framework for future development 和 reproducibility.

COVIDx CXR-4: An Expanded Multi-Institutional Open-Source Benchmark Dataset for Chest X-ray Image-Based Computer-Aided COVID-19 Diagnostics

paper_url: http://arxiv.org/abs/2311.17677
repo_url: None
paper_authors: Yifan Wu, Hayden Gunraj, Chi-en Amy Tai, Alexander Wong
for: 本研究旨在提高计算机助手 COVID-19 诊断技术的性能，通过扩大和多样化大量数据集。
methods: 本研究使用了多种深度学习模型，并对报告了大量的肺 X 光像像数据进行分析和评估。
results: 研究人员在 COVIDx CXR-4 数据集上进行了广泛的分析和评估，发现该数据集具有多样化的患者人群、成像metadata和疾病分布，并提供了可能的数据集偏见的检测和评估方法。

Abstract
The global ramifications of the COVID-19 pandemic remain significant, exerting persistent pressure on nations even three years after its initial outbreak. Deep learning models have shown promise in improving COVID-19 diagnostics but require diverse and larger-scale datasets to improve performance. In this paper, we introduce COVIDx CXR-4, an expanded multi-institutional open-source benchmark dataset for chest X-ray image-based computer-aided COVID-19 diagnostics. COVIDx CXR-4 expands significantly on the previous COVIDx CXR-3 dataset by increasing the total patient cohort size by greater than 2.66 times, resulting in 84,818 images from 45,342 patients across multiple institutions. We provide extensive analysis on the diversity of the patient demographic, imaging metadata, and disease distributions to highlight potential dataset biases. To the best of the authors' knowledge, COVIDx CXR-4 is the largest and most diverse open-source COVID-19 CXR dataset and is made publicly available as part of an open initiative to advance research to aid clinicians against the COVID-19 disease.

摘要
全球疫情的影响仍然很大，三年后仍然对国家产生持续的压力。深度学习模型在COVID-19诊断方面表现出了搭配作用，但需要更大和更多样本来提高性能。本文介绍COVIDx CXR-4，一个扩展多机构开源的COVID-19Computer-aided肺X射线诊断数据集。COVIDx CXR-4相比COVIDx CXR-3数据集，增加了总病人群体数量的2.66倍，共84,818张X射线图像，来自45,342名患者，来自多个机构。我们提供了丰富的分析，描述了患者人口结构、成像metadata和疾病分布，以强调数据集可能的偏见。作者知道COVIDx CXR-4是最大和最多样的开源COVID-19 CXR数据集，并且公开提供，以便推动研究，以帮助临床医生对COVID-19疾病作战。

Cam4DOcc: Benchmark for Camera-Only 4D Occupancy Forecasting in Autonomous Driving Applications

paper_url: http://arxiv.org/abs/2311.17663
repo_url: https://github.com/haomo-ai/cam4docc
paper_authors: Junyi Ma, Xieyuanli Chen, Jiawei Huang, Jingyi Xu, Zhen Luo, Jintao Xu, Weihao Gu, Rui Ai, Hesheng Wang
for: 本研究旨在提供一个基于摄像头图像的4D占用预测 benchmark，用于评估自动驾驶场景中环境变化的变化。
methods: 本研究使用了多个公开available的数据集，包括 nuScenes、nuScenes-Occupancy 和 Lyft-Level5，以提供Sequential occupancy states of general movable and static objects，以及 их 3D backward centripetal flow。
results: 研究提出了一种基于摄像头图像的4D占用预测网络，并在多个基线之间进行了比较，以评估不同方法的性能。

Abstract
Understanding how the surrounding environment changes is crucial for performing downstream tasks safely and reliably in autonomous driving applications. Recent occupancy estimation techniques using only camera images as input can provide dense occupancy representations of large-scale scenes based on the current observation. However, they are mostly limited to representing the current 3D space and do not consider the future state of surrounding objects along the time axis. To extend camera-only occupancy estimation into spatiotemporal prediction, we propose Cam4DOcc, a new benchmark for camera-only 4D occupancy forecasting, evaluating the surrounding scene changes in a near future. We build our benchmark based on multiple publicly available datasets, including nuScenes, nuScenes-Occupancy, and Lyft-Level5, which provides sequential occupancy states of general movable and static objects, as well as their 3D backward centripetal flow. To establish this benchmark for future research with comprehensive comparisons, we introduce four baseline types from diverse camera-based perception and prediction implementations, including a static-world occupancy model, voxelization of point cloud prediction, 2D-3D instance-based prediction, and our proposed novel end-to-end 4D occupancy forecasting network. Furthermore, the standardized evaluation protocol for preset multiple tasks is also provided to compare the performance of all the proposed baselines on present and future occupancy estimation with respect to objects of interest in autonomous driving scenarios. The dataset and our implementation of all four baselines in the proposed Cam4DOcc benchmark will be released here: https://github.com/haomo-ai/Cam4DOcc.

摘要
<>转换给定文本到简化中文。<>了解环境变化对自动驾驶应用程序中进行安全可靠的下游任务是非常重要。现有的占用估计技术使用只有相机图像作为输入可以提供大规模场景中的 dense 占用表示，但它们主要是只考虑当前观察的三维空间，并不考虑未来时间轴上的对象变化。为扩展相机只的占用估计到时间方向，我们提出了 Cam4DOcc，一个新的 Camera-only 4D 占用预测权威，评估周围场景在较短时间内的变化。我们基于多个公共可用的数据集建立了这个权威，包括 nuScenes、nuScenes-Occupancy 和 Lyft-Level5 等，这些数据集提供了一系列可移动和静止 объек的 sequential 占用状态，以及它们的三维反时流。为建立这个权威，我们引入了四种基线类型，包括静止世界占用模型、 voxelization 的点云预测、 2D-3D 实例基于预测和我们的提出的 novel 综合四个基准网络。此外，我们还提供了一种标准化评估协议，用于比较所有提出的基准在不同任务上的表现，包括未来占用估计和对象 интересы在自动驾驶场景中的表现。 Cam4DOcc 数据集和我们实现的所有四种基准将在以下地址上发布：https://github.com/haomo-ai/Cam4DOcc。

Volumetric Cloud Field Reconstruction

paper_url: http://arxiv.org/abs/2311.17657
repo_url: None
paper_authors: Jacob Lin, Miguel Farinha, Edward Gryspeerdt, Ronald Clark
for: 本研究旨在探讨用少量ステレオ画像对大规模积体的重建问题，以提高3D重建系统的实用性。
methods: 该研究提出了一种 integrate deep learning 框架，包括深度ステレオ模型、3D卷积神经网络（3D CNN）和扩散模块，用于捕捉积体的形态和动态。ステレオ深度被用来预先知道积体的空间位置，并且通过时间演化来提高动态一致性。
results: 研究表明，该系统可以从一些笔记副本中估算云体的密度和运动场景， demonstrating its ability to handle large-scale volumetric phenomena.

Abstract
Volumetric phenomena, such as clouds and fog, present a significant challenge for 3D reconstruction systems due to their translucent nature and their complex interactions with light. Conventional techniques for reconstructing scattering volumes rely on controlled setups, limiting practical applications. This paper introduces an approach to reconstructing volumes from a few input stereo pairs. We propose a novel deep learning framework that integrates a deep stereo model with a 3D Convolutional Neural Network (3D CNN) and an advection module, capable of capturing the shape and dynamics of volumes. The stereo depths are used to carve empty space around volumes, providing the 3D CNN with a prior for coping with the lack of input views. Refining our output, the advection module leverages the temporal evolution of the medium, providing a mechanism to infer motion and improve temporal consistency. The efficacy of our system is demonstrated through its ability to estimate density and velocity fields of large-scale volumes, in this case, clouds, from a sparse set of stereo image pairs.

摘要
“三维现象，如云和雾，对三维重建系统提出了严重的挑战，因为它们的透明性和复杂的光线交互。传统的重建对应方法需要控制的设置，限制了实际应用。本文介绍一种从少量入对照片对三维重建的方法。我们提出了一个新的深度学习框架，融合了深度探测模型和3D卷积神经网络（3D CNN），能够捕捉volume的形状和动力。对照片的深度值用来剔除volume周围的空间，为3D CNN提供了一个先验条件，以对缺乏入对照片的缺陷进行处理。对output进行修正，抽运模块利用时间演化的媒体特性，提供了一种对动作进行推导和改善时间一致性的机制。我们的系统在大规模volume的密度和运动场的估计中表现出色，具体来说是云的density和运动场。”

Multiple Toddler Tracking in Indoor Videos

paper_url: http://arxiv.org/abs/2311.17656
repo_url: https://github.com/ostadabbas/multiple-toddler-tracking
paper_authors: Somaieh Amraee, Bishoy Galoaa, Matthew Goodwin, Elaheh Hatamimajoumerd, Sarah Ostadabbas
for: 这个论文的目的是解决多个婴儿在视频中的跟踪问题，因为婴儿的不可预测的运动、多种姿势和类似的外观使得传统的多对象跟踪（MOT）算法困难应用于婴儿跟踪。
methods: 这篇论文提出了一种名为MTTSort的自适应方法，基于DeepSort算法，用于精准地跟踪多个婴儿在室内环境中的视频。该方法使用了遗传算法优化参数，并提出了一种准确的跟踪算法和淘汰排序技术。
results: 在论文中，MTTSort方法与当前州OF艺术的MOT方法进行了比较，在MTTrack、DanceTrack和MOT15数据集上取得了0.98、0.68和0.98的多对象跟踪准确率（MOTA）、高阶跟踪准确率（HOTA）和迭代和排序框架1（IDF1）指标中的优秀表现。

Abstract
Multiple toddler tracking (MTT) involves identifying and differentiating toddlers in video footage. While conventional multi-object tracking (MOT) algorithms are adept at tracking diverse objects, toddlers pose unique challenges due to their unpredictable movements, various poses, and similar appearance. Tracking toddlers in indoor environments introduces additional complexities such as occlusions and limited fields of view. In this paper, we address the challenges of MTT and propose MTTSort, a customized method built upon the DeepSort algorithm. MTTSort is designed to track multiple toddlers in indoor videos accurately. Our contributions include discussing the primary challenges in MTT, introducing a genetic algorithm to optimize hyperparameters, proposing an accurate tracking algorithm, and curating the MTTrack dataset using unbiased AI co-labeling techniques. We quantitatively compare MTTSort to state-of-the-art MOT methods on MTTrack, DanceTrack, and MOT15 datasets. In our evaluation, the proposed method outperformed other MOT methods, achieving 0.98, 0.68, and 0.98 in multiple object tracking accuracy (MOTA), higher order tracking accuracy (HOTA), and iterative and discriminative framework 1 (IDF1) metrics, respectively.

摘要
多个婴儿跟踪（MTT）涉及到在视频中识别和区分婴儿。与传统的多对象跟踪（MOT）算法不同，婴儿具有不可预测的运动、多种姿势和相似的外观，使得MTT具有极高的难度。在室内环境中跟踪婴儿更加复杂，因为存在遮挡和有限的视场。在这篇论文中，我们解决了MTT中的主要挑战，并提出了MTTSort算法，这是基于DeepSort算法的自定义方法。MTTSort是用于准确地跟踪多个婴儿在室内视频中的。我们的贡献包括对MTT中的主要挑战进行讨论、使用遗传算法优化超参数、提出高度准确的跟踪算法，以及使用无偏论AI共标记技术制定MTTrack数据集。我们对MTTSort与状态的方法进行了量化比较，并证明了MTTSort在MTTrack、DanceTrack和MOT15数据集上的表现优于其他MOT方法，其中MOTA、HOTA和IDF1 metric的评价值分别为0.98、0.68和0.98。

Neural Fields with Thermal Activations for Arbitrary-Scale Super-Resolution

paper_url: http://arxiv.org/abs/2311.17643
repo_url: None
paper_authors: Alexander Becker, Rodrigo Caye Daudt, Nando Metzger, Jan Dirk Wegner, Konrad Schindler
for: 这 paper 是为了解决单个图像超分辨（ASSR）问题，特别是在不同分辨率下保持图像的细节和清晰度。
methods: 该 paper 使用了 мест化神经场来表示连续信号，并使用了一种新的活动函数来模型 Gaussian PSF。这种活动函数来自于傅里叶理论和热方程，可以免除对图像域的滤波，从而保持图像的细节和清晰度。
results: 该 paper 提出了一种新的ASSR方法，并在实验中达到了新的性能水平，同时也比前一代方法更Parameter-efficient。这种方法可以在不同分辨率下保持图像的细节和清晰度，并且可以免除对图像域的滤波。

Abstract
Recent approaches for arbitrary-scale single image super-resolution (ASSR) have used local neural fields to represent continuous signals that can be sampled at different rates. However, in such formulation, the point-wise query of field values does not naturally match the point spread function (PSF) of a given pixel. In this work we present a novel way to design neural fields such that points can be queried with a Gaussian PSF, which serves as anti-aliasing when moving across resolutions for ASSR. We achieve this using a novel activation function derived from Fourier theory and the heat equation. This comes at no additional cost: querying a point with a Gaussian PSF in our framework does not affect computational cost, unlike filtering in the image domain. Coupled with a hypernetwork, our method not only provides theoretically guaranteed anti-aliasing, but also sets a new bar for ASSR while also being more parameter-efficient than previous methods.

摘要
中文翻译：现代ASSR方法中，使用本地神经场来表示不同采样率的连续信号。然而，在这种表述中，对场值的点 wise查询并不自然地匹配给定像素的点扩散函数（PSF）。在这个工作中，我们提出了一种新的方法，使得在神经场中查询点可以使用 Gaussian PSF，这就是防护抗锯齿的。我们使用 fourier理论和热方程来 derivation novel activation function。这不会增加计算成本，与图像领域中的滤波不同，在我们的框架中查询点的 Gaussian PSF 不会增加计算成本。同时，我们还使用 hypernetwork，我们的方法不仅提供了理论保证的防护抗锯齿，还超越了过去的方法，并且在参数效率方面比前方法更高。

paper_url: http://arxiv.org/abs/2311.17634
repo_url: None
paper_authors: Mreenav Shyam Deka, Lu Sang, Daniel Cremers
for: 用于 sintesizing 城市环境中的新视图，如自动驾驶和虚拟游览。
methods: 使用神经点云场景表示法，战略地检测并屏蔽动态对象，以生成无瑕疵的新场景。同时，同步Camera pose优化和视图 sintesizing 过程，以协同优化两个元素。
results: 通过实验 validate 在真实城市数据集上，实现了对城市景象的新视图 sintesizing 领先result。

Abstract
Synthesizing novel views for urban environments is crucial for tasks like autonomous driving and virtual tours. Compared to object-level or indoor situations, outdoor settings present unique challenges, such as inconsistency across frames due to moving vehicles and camera pose drift over lengthy sequences. In this paper, we introduce a method that tackles these challenges on view synthesis for outdoor scenarios. We employ a neural point light field scene representation and strategically detect and mask out dynamic objects to reconstruct novel scenes without artifacts. Moreover, we simultaneously optimize camera pose along with the view synthesis process, and thus, we simultaneously refine both elements. Through validation on real-world urban datasets, we demonstrate state-of-the-art results in synthesizing novel views of urban scenes.

摘要
<>输入文本为 Traditional Chinese.<>创造城市环境中新的视图是自动驾驶和虚拟旅行等任务中的关键。外部场景比对象水平或室内情况更加具有挑战，因为移动的车辆和摄像头pose的变化会导致帧内容不一致。在这篇论文中，我们介绍了一种解决这些挑战的视图合成方法。我们使用神经点云场景表示，并策略性地检测和屏蔽动态对象，以无缝重建新的场景。此外，我们同时优化摄像头pose和视图合成过程，因此同时完善两个元素。通过对实际城市数据进行验证，我们展示了城市景象视图合成领域的状态级Result。

Efficient Decoder for End-to-End Oriented Object Detection in Remote Sensing Images

paper_url: http://arxiv.org/abs/2311.17629
repo_url: None
paper_authors: Jiaqi Zhao, Zeyu Ding, Yong Zhou, Hancheng Zhu, Wenliang Du, Rui Yao, Abdulmotaleb El Saddik
for: 本研究旨在提出一种能够解决远程感知图像中对象实例的分布具有多 orientation、不同 scales 和紧密分布的问题的终端对象检测器。
methods: 该方法使用了两种技术：Rotated RoI attention (RRoI attention)和Selective Distinct Queries (SDQ)。RRoI attention通过交叉注意机制对 oriented 区域 интерест进行有效地注意，并将多尺度特征进行对齐。SDQ从中间解码层中收集查询，并将类似查询过滤得到独特的查询。
results: 对五个数据集进行了广泛的实验，并证明了我们的方法的有效性。尤其是在 DIOR-R (67.31% mAP)、DOTA-v1.5 (67.43% mAP) 和 DOTA-v2.0 (53.28% mAP) 上，我们的方法与 ResNet50 背景下达到了州际级的性能。

Abstract
Object instances in remote sensing images often distribute with multi-orientations, varying scales, and dense distribution. These issues bring challenges to end-to-end oriented object detectors including multi-scale features alignment and a large number of queries. To address these limitations, we propose an end-to-end oriented detector equipped with an efficient decoder, which incorporates two technologies, Rotated RoI attention (RRoI attention) and Selective Distinct Queries (SDQ). Specifically, RRoI attention effectively focuses on oriented regions of interest through a cross-attention mechanism and aligns multi-scale features. SDQ collects queries from intermediate decoder layers and then filters similar queries to obtain distinct queries. The proposed SDQ can facilitate the optimization of one-to-one label assignment, without introducing redundant initial queries or extra auxiliary branches. Extensive experiments on five datasets demonstrate the effectiveness of our method. Notably, our method achieves state-of-the-art performance on DIOR-R (67.31% mAP), DOTA-v1.5 (67.43% mAP), and DOTA-v2.0 (53.28% mAP) with the ResNet50 backbone.

摘要
Remote sensing 图像中的对象实例经常具有多个方向、不同的缩放和密集分布。这些问题会对末端向的对象检测器带来挑战，包括多尺度特征对齐和大量查询。为解决这些限制，我们提出了一种末端向的对象检测器，具有高效的解码器，并结合了两种技术：旋转 RoI 注意力（RRoI 注意力）和选择性独特查询（SDQ）。具体来说，RRoI 注意力通过交叉注意力机制，有效地关注方向的区域关注点，并对多尺度特征进行对齐。而 SDQ 从中间解码层中收集查询，然后过滤相似的查询，以获取独特的查询。我们的 SDQ 方法可以帮助优化一对一的标签分配，不需要额外增加初始查询或额外的帮助分支。我们在五个数据集上进行了广泛的实验，结果显示，我们的方法可以 дости得对 DIOR-R（67.31% mAP）、DOTA-v1.5（67.43% mAP）和 DOTA-v2.0（53.28% mAP）的最佳性能，使用 ResNet50 基础。

Focus on Query: Adversarial Mining Transformer for Few-Shot Segmentation

paper_url: http://arxiv.org/abs/2311.17626
repo_url: https://github.com/wyxdm/amnet
paper_authors: Yuan Wang, Naisong Luo, Tianzhu Zhang
for: 这 paper 的目的是提出一种新的 few-shot segmentation (FSS) 模型，以便在只有几个标注样本的情况下 segment 新的类别对象。
methods: 该 paper 使用了一种新的 query-centric FSS 模型，名为 Adversarial Mining Transformer (AMFormer)，该模型可以在 rough support guidance 或者 weak support labels 的情况下实现高度准确的查询图像分割。
results: 该 paper 在 Pascal-5i 和 COCO-20i 两个常用的标准评估 benchmark 上进行了广泛的实验，并在所有设置下达到了状态的 искусственный智能表现。此外，该 paper 还发现在 weak support labels 的情况下，query-centric paradigm 可以实现出色的表现，这可能会激励更多的 FSS 模型的发展。

Abstract
Few-shot segmentation (FSS) aims to segment objects of new categories given only a handful of annotated samples. Previous works focus their efforts on exploring the support information while paying less attention to the mining of the critical query branch. In this paper, we rethink the importance of support information and propose a new query-centric FSS model Adversarial Mining Transformer (AMFormer), which achieves accurate query image segmentation with only rough support guidance or even weak support labels. The proposed AMFormer enjoys several merits. First, we design an object mining transformer (G) that can achieve the expansion of incomplete region activated by support clue, and a detail mining transformer (D) to discriminate the detailed local difference between the expanded mask and the ground truth. Second, we propose to train G and D via an adversarial process, where G is optimized to generate more accurate masks approaching ground truth to fool D. We conduct extensive experiments on commonly used Pascal-5i and COCO-20i benchmarks and achieve state-of-the-art results across all settings. In addition, the decent performance with weak support labels in our query-centric paradigm may inspire the development of more general FSS models. Code will be available at https://github.com/Wyxdm/AMNet.

摘要
新型几个示例分割（FSS）目标是根据只有几个标注样本来分割新类型的物体。先前的工作主要集中在explore支持信息的方面，而忽略了关键 вопро题分支的挖掘。在这篇论文中，我们重新评估支持信息的重要性，并提出了一种新的问题中心的FSS模型，即反对挖掘变换（AMFormer）。该模型可以通过 rough support guidance 或弱支持标签来实现精准的问题图像分割。我们的提案具有以下优点：首先，我们设计了一个对不完整地活化的支持线索进行扩展的对象挖掘变换（G），以及一个用于异化详细地方差的细节挖掘变换（D）。其次，我们提出了在对G和D进行对抗训练的方法，其中G是用于生成更加准确的面板，以诱导D进行更加准确的识别。我们在 Pascal-5i 和 COCO-20i 两个常用的标准测试集上进行了广泛的实验，并在所有设置下 achieve state-of-the-art 结果。此外，我们的查询中心的 paradigm 在弱支持标签下表现出色，可能会激发更多的FSS模型的发展。代码将在 GitHub 上提供，链接为 https://github.com/Wyxdm/AMNet。

paper_url: http://arxiv.org/abs/2311.17618
repo_url: None
paper_authors: Fukun Yin, Xin Chen, Chi Zhang, Biao Jiang, Zibo Zhao, Jiayuan Fan, Gang Yu, Taihao Li, Tao Chen
for: 这个论文的目的是开发一种能够处理3D形状的多Modal生成模型，以便在3D虚拟建筑和网络帮助设计等领域中进行多种形状生成任务。
methods: 该论文使用了一种word-sentence-paragraph框架，将连续的3D形状分解成形状单词，然后将这些单词组合成形状句子，同时将形状和指令文本结合起来生成多Modal段落。
results: 经过三个阶段的训练，包括形状表示、多Modal协调和指令基本生成，ShapeGPT模型在多种形状相关任务中达到了相当的性能，包括文本到形状、形状到文本、形状完成和形状编辑等任务。

Abstract
The advent of large language models, enabling flexibility through instruction-driven approaches, has revolutionized many traditional generative tasks, but large models for 3D data, particularly in comprehensively handling 3D shapes with other modalities, are still under-explored. By achieving instruction-based shape generations, versatile multimodal generative shape models can significantly benefit various fields like 3D virtual construction and network-aided design. In this work, we present ShapeGPT, a shape-included multi-modal framework to leverage strong pre-trained language models to address multiple shape-relevant tasks. Specifically, ShapeGPT employs a word-sentence-paragraph framework to discretize continuous shapes into shape words, further assembles these words for shape sentences, as well as integrates shape with instructional text for multi-modal paragraphs. To learn this shape-language model, we use a three-stage training scheme, including shape representation, multimodal alignment, and instruction-based generation, to align shape-language codebooks and learn the intricate correlations among these modalities. Extensive experiments demonstrate that ShapeGPT achieves comparable performance across shape-relevant tasks, including text-to-shape, shape-to-text, shape completion, and shape editing.

摘要
大量语言模型的出现，使得执行指令驱动方法的灵活性得到了很大的改善，但是3D数据的大型模型，特别是处理3D形状与其他多 modalities的数据，仍然受到了不足的探索。通过实现指令驱动的形成，多Modal生成模型可以很大幅提高3D虚拟建筑和网络帮助设计等领域。在这个工作中，我们提出了ShapeGPT，一个包含形状的多Modal框架，利用强大预训语言模型来解决多种形状相关的任务。具体来说，ShapeGPT使用了字词 sentences paragraphs框架将连续形状变数化为形状词，然后将这些词 assembles为形状句子，并与指令文本集成以生成多Modal paragraphs。为了训练这个形状语言模型，我们使用了三阶段训练方案，包括形状表示、多Modal对接和指令基本生成，以将形状语言codebook和多Modal modalities之间的细微相关性学习。实验结果显示，ShapeGPT在多种形状相关任务上均能实现相似的表现，包括文本到形状、形状到文本、形状完成和形状修改。

AnyLens: A Generative Diffusion Model with Any Rendering Lens

paper_url: http://arxiv.org/abs/2311.17609
repo_url: None
paper_authors: Andrey Voynov, Amir Hertz, Moab Arar, Shlomi Fruchter, Daniel Cohen-Or
for: 这个研究旨在提出一种基于文本到图像扩散模型的图像渲染方法，具有控制摄像头几何参数的能力。
methods: 该方法基于每个像素坐标的Conditioning方法，可以控制渲染几何参数，从而实现不同的视觉效果。
results: 研究人员通过示例图像表明，该方法可以实现多种视觉效果，如鱼眼、全景视图和球面 текстури。

Abstract
State-of-the-art diffusion models can generate highly realistic images based on various conditioning like text, segmentation, and depth. However, an essential aspect often overlooked is the specific camera geometry used during image capture. The influence of different optical systems on the final scene appearance is frequently overlooked. This study introduces a framework that intimately integrates a text-to-image diffusion model with the particular lens geometry used in image rendering. Our method is based on a per-pixel coordinate conditioning method, enabling the control over the rendering geometry. Notably, we demonstrate the manipulation of curvature properties, achieving diverse visual effects, such as fish-eye, panoramic views, and spherical texturing using a single diffusion model.

摘要
现代扩散模型可以生成高度真实的图像，基于文本、分割和深度等条件。然而，通常被忽视的一个重要方面是摄像机 geometry 在图像捕捉中所使用的具体特点。这种研究提出了一种将文本到图像扩散模型与特定镜头geometry 集成的框架。我们的方法基于每个像素坐标条件方法，允许控制渲染geometry。另外，我们示示了拥有曲率性质的控制，可以实现多种视觉效果，如鱼眼、投影视图和球面文本使用单一扩散模型。

Adversarial Robust Memory-Based Continual Learner

paper_url: http://arxiv.org/abs/2311.17608
repo_url: None
paper_authors: Xiaoyue Mi, Fan Tang, Zonghan Yang, Danding Wang, Juan Cao, Peng Li, Yang Liu
for: 这个论文旨在解决 continual learning 中的 adversarial 抗性问题，即随着学习继续进行，模型对 adversarial 样本的抗性减退。methods: 该论文使用了 memory-based continual learning 算法，并直接应用了 adversarial training 技术来提高 adversarial 抗性。但是，针对 continual learning 中的 accelerated forgetting 和 gradient obfuscation 问题，该论文提出了一种新的 adversarial robust memory-based continual learner，并设计了一种基于 gradient 的数据选择机制来解决 gradient obfuscation 问题。results: 实验结果表明，该论文的方法可以广泛地应用于 existing memory-based continual learning 和 adversarial training 算法中，并且可以达到 Up to 8.13% 高于 adversarial 数据的准确率。

Abstract
Despite the remarkable advances that have been made in continual learning, the adversarial vulnerability of such methods has not been fully discussed. We delve into the adversarial robustness of memory-based continual learning algorithms and observe limited robustness improvement by directly applying adversarial training techniques. Preliminary studies reveal the twin challenges for building adversarial robust continual learners: accelerated forgetting in continual learning and gradient obfuscation in adversarial robustness. In this study, we put forward a novel adversarial robust memory-based continual learner that adjusts data logits to mitigate the forgetting of pasts caused by adversarial samples. Furthermore, we devise a gradient-based data selection mechanism to overcome the gradient obfuscation caused by limited stored data. The proposed approach can widely integrate with existing memory-based continual learning as well as adversarial training algorithms in a plug-and-play way. Extensive experiments on Split-CIFAR10/100 and Split-Tiny-ImageNet demonstrate the effectiveness of our approach, achieving up to 8.13% higher accuracy for adversarial data.

摘要
尽管CONTINUAL LEARNING技术得到了惊人的进步，但是它们的抗对抗性尚未得到完全评估。我们研究了基于记忆的CONTINUAL LEARNING算法的对抗性，发现其对抗性改善有限。先前的研究表明，在CONTINUAL LEARNING中存在加速忘记的两大挑战，即对抗样本引起的忘记和对抗样本的掩蔽。在这种研究中，我们提出了一种新的对抗Robust memory-basedCONTINUAL LEARNING算法，通过调整数据 logits来 Mitigate the forgetting of past caused by adversarial samples。此外，我们还设计了一种基于Gradient的数据选择机制，以解决由限制存储的数据所引起的掩蔽。该方法可以与现有的记忆基于CONTINUAL LEARNING以及对抗训练算法进行插件式 интеграción。广泛的实验表明，我们的方法可以在Split-CIFAR10/100和Split-Tiny-ImageNet上 achieve up to 8.13% higher accuracy for adversarial data。

Topology-Preserving Adversarial Training

paper_url: http://arxiv.org/abs/2311.17607
repo_url: None
paper_authors: Xiaoyue Mi, Fan Tang, Yepeng Weng, Danding Wang, Juan Cao, Sheng Tang, Peng Li, Yang Liu
for: 提高神经网络的Robustness，解决自然环境下的准确率下降问题。
methods: 基于样本空间结构的拟合，采用Topology-pReserving Adversarial traINing（TRAIN）方法，保持标准模型在自然样本上的学习结构，以增强神经网络的Robustness。
results: 对CIFAR-10、CIFAR-100和Tiny ImageNet进行了广泛的实验，并 obtainted consistent and significant improvements over various strong baselines in most cases。Specifically, without additional data, our proposed method achieves up to 8.78% improvement in natural accuracy and 4.50% improvement in robust accuracy。

Abstract
Despite the effectiveness in improving the robustness of neural networks, adversarial training has suffered from the natural accuracy degradation problem, i.e., accuracy on natural samples has reduced significantly. In this study, we reveal that natural accuracy degradation is highly related to the disruption of the natural sample topology in the representation space by quantitative and qualitative experiments. Based on this observation, we propose Topology-pReserving Adversarial traINing (TRAIN) to alleviate the problem by preserving the topology structure of natural samples from a standard model trained only on natural samples during adversarial training. As an additional regularization, our method can easily be combined with various popular adversarial training algorithms in a plug-and-play manner, taking advantage of both sides. Extensive experiments on CIFAR-10, CIFAR-100, and Tiny ImageNet show that our proposed method achieves consistent and significant improvements over various strong baselines in most cases. Specifically, without additional data, our proposed method achieves up to 8.78% improvement in natural accuracy and 4.50% improvement in robust accuracy.

摘要
尽管对神经网络的Robustness进行了有效的改进，但反对攻击训练受到自然精度下降的问题的束缚，即在表示空间中自然样本的结构干扰的问题。在这个研究中，我们发现自然精度下降与表示空间中自然样本结构的破坏高度相关。基于这一观察，我们提出了保持自然样本表示空间结构的Topology-pReserving Adversarial traINing（TRAIN）方法，以解决这个问题。这种方法可以在反对攻击训练中保持自然样本表示空间结构，从而避免精度下降。此外，我们的方法可以轻松地与多种流行的反对攻击训练算法结合使用，从而得到更好的性能。我们在CIFAR-10、CIFAR-100和Tiny ImageNet上进行了广泛的实验，结果表明，我们提出的方法在大多数情况下可以获得显著性能改进，具体来说，在不使用额外数据的情况下，我们的方法可以提高自然精度8.78%和Robust精度4.50%。

paper_url: http://arxiv.org/abs/2311.17600
repo_url: None
paper_authors: Xin Liu, Yichen Zhu, Yunshi Lan, Chao Yang, Yu Qiao
for: 该研究探讨了大型多模态模型（LMMs）的安全问题，尤其是对于Malicious Query的攻击。
methods: 该研究提出了一种新的视觉提示攻击方法，利用Diffusion Models生成的一个图像和另一个显示文本的图像，基于恶意查询中提取的关键词。
results: 研究表明，现有的开源LMMs可以通过该攻击方法被轻松攻击，即使使用了安全地aligned的Large Language Models。研究还编译了一个大量的数据集，包含13个场景、总共5040个文本-图像对，用于评估各种LMMs的安全性。

Abstract
Warning: This paper contains examples of harmful language and images, and reader discretion is recommended. The security concerns surrounding Large Language Models (LLMs) have been extensively explored, yet the safety of Large Multi-Modal Models (LMMs) remains understudied. In our study, we present a novel visual prompt attack that exploits query-relevant images to jailbreak the open-source LMMs. Our method creates a composite image from one image generated by diffusion models and another that displays the text as typography, based on keywords extracted from a malicious query. We show LLMs can be easily attacked by our approach, even if the employed Large Language Models are safely aligned. To evaluate the extent of this vulnerability in open-source LMMs, we have compiled a substantial dataset encompassing 13 scenarios with a total of 5,040 text-image pairs, using our presented attack technique. Our evaluation of 12 cutting-edge LMMs using this dataset shows the vulnerability of existing multi-modal models on adversarial attacks. This finding underscores the need for a concerted effort to strengthen and enhance the safety measures of open-source LMMs against potential malicious exploits. The resource is available at \href{this https URL}{https://github.com/isXinLiu/MM-SafetyBench}.

摘要
警告：这篇论文包含有害语言和图像示例，请读者慎重阅读。大型语言模型（LLM）的安全问题已得到了广泛研究，然而大型多Modal模型（LMM）的安全问题尚未得到充分研究。在我们的研究中，我们提出了一种新的视觉提示攻击，利用查询相关的图像来破坏开源LMM。我们的方法创建了一个混合图像，其中一个图像由扩散模型生成，另一个图像显示文本为 typography，基于恶意查询中提取的关键词。我们表明，even if the employed Large Language Models are safely aligned, LLMs can still be easily attacked by our approach。为了评估开源LMMs中存在的攻击性 vulnerability，我们编译了一个庞大的数据集，包括13个情景，共5,040个文本-图像对，使用我们提出的攻击技术。我们对12个 cutting-edge LMMs 进行了测试，并证明了现有多Modal模型在反对攻击方面的抵触性。这一发现强调了需要加强和提高开源LMMs 的安全措施，以防止潜在的恶意利用。资源可以在 \href{这个https URL}{https://github.com/isXinLiu/MM-SafetyBench} 上获得。

paper_url: http://arxiv.org/abs/2311.17597
repo_url: None
paper_authors: Yiwen Ye, Yutong Xie, Jianpeng Zhang, Ziyang Chen, Qi Wu, Yong Xia
for:* 这篇论文的目的是提出一个持续性学习架构，以实现多modal的医疗影像分析中的高效自我预训。methods:* 这篇论文使用了一种多Modal的医疗影像数据进行连续自我预训，并提出了一个具有弹性的架构，可以适应不同的医疗影像数据。results:* 实验结果显示，这个方法可以在大规模的多Modal医疗影像数据上进行持续性学习，并且可以实现高效的自我预训。code和预训模型也可以在https://github.com/yeerwen/MedCoSS上下载。

Abstract
Self-supervised learning is an efficient pre-training method for medical image analysis. However, current research is mostly confined to specific-modality data pre-training, consuming considerable time and resources without achieving universality across different modalities. A straightforward solution is combining all modality data for joint self-supervised pre-training, which poses practical challenges. Firstly, our experiments reveal conflicts in representation learning as the number of modalities increases. Secondly, multi-modal data collected in advance cannot cover all real-world scenarios. In this paper, we reconsider versatile self-supervised learning from the perspective of continual learning and propose MedCoSS, a continuous self-supervised learning approach for multi-modal medical data. Unlike joint self-supervised learning, MedCoSS assigns different modality data to different training stages, forming a multi-stage pre-training process. To balance modal conflicts and prevent catastrophic forgetting, we propose a rehearsal-based continual learning method. We introduce the k-means sampling strategy to retain data from previous modalities and rehearse it when learning new modalities. Instead of executing the pretext task on buffer data, a feature distillation strategy and an intra-modal mixup strategy are applied to these data for knowledge retention. We conduct continuous self-supervised pre-training on a large-scale multi-modal unlabeled dataset, including clinical reports, X-rays, CT scans, MRI scans, and pathological images. Experimental results demonstrate MedCoSS's exceptional generalization ability across nine downstream datasets and its significant scalability in integrating new modality data. Code and pre-trained weight are available at https://github.com/yeerwen/MedCoSS.

摘要
自领导学习是医学图像分析领域的高效预训练方法之一。然而，当前的研究主要集中在特定Modalities的数据预训练上，需要大量时间和资源，而且无法实现不同Modalities之间的通用性。为解决这问题，我们提出了将所有Modalities的数据进行共同预训练的简单解决方案。然而，我们的实验表明，随着Modalities的数量增加，会出现表征学习的冲突。此外，预先收集的多modal数据不能涵盖所有的实际场景。在这篇论文中，我们从 kontinual learning 的角度重新思考了自领导学习，并提出了 MedCoSS，一种基于连续学习的多Modalities医学图像预训练方法。与 JOINT self-supervised learning 不同，MedCoSS在不同的training阶段分配不同的Modalities数据，形成了多阶段预训练过程。为保持Modalities的冲突和避免忘记性，我们提出了一种循环学习方法。我们采用了 k-means 采样策略，将之前的Modalities数据保留下来，并在学习新Modalities时重新训练。而不是在缓存中执行预测任务，我们采用了一种特征萃取策略和一种内模 Mixup 策略来保持知识。我们在一个大规模的多Modalities无标注数据集上进行了连续自领导学习，包括临床报告、X射线、CT扫描、MRI扫描和 PATHOLOGICAL 图像。实验结果表明，MedCoSS在九个下游数据集上显示了异常普适性和重要的扩展性，并且可以有效地集成新的Modalities数据。代码和预训练 веса可以在上获取。

SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis

paper_url: http://arxiv.org/abs/2311.17590
repo_url: https://github.com/ZiqiaoPeng/SyncTalk
paper_authors: Ziqiao Peng, Wentao Hu, Yue Shi, Xiangyu Zhu, Xiaomei Zhang, Hao Zhao, Jun He, Hongyan Liu, Zhaoxin Fan
for: 实现真实的、语音驱动的 talking head 视频 synthesis 需要解决一个主要的挑战，传统的 Generative Adversarial Networks (GAN) 很难保持面部特征的一致性，而 Neural Radiance Fields (NeRF) 方法可以解决这个问题，但经常产生不一致的嘴部动作、不充分的面部表情和不稳定的头部姿势。
methods: 我们引入了 SyncTalk，一种基于 NeRF 的方法，可以有效地保持主体特征、增强同步和真实感在 talking head synthesis 中。SyncTalk 使用了一个 Face-Sync Controller 来对 lip movements 与语音进行对齐，并使用了一个3D facial blendshape model 来捕捉面部表情的精准信息。我们还使用了一个 Head-Sync Stabilizer 来优化头部姿势，以获得更自然的头部运动。
results: 我们的实验和用户研究表明，SyncTalk 在同步和真实感方面胜过了当前的状态计算机技术。我们建议您查看我们的补充视频：https://ziqiaopeng.github.io/synctalk

Abstract
Achieving high synchronization in the synthesis of realistic, speech-driven talking head videos presents a significant challenge. Traditional Generative Adversarial Networks (GAN) struggle to maintain consistent facial identity, while Neural Radiance Fields (NeRF) methods, although they can address this issue, often produce mismatched lip movements, inadequate facial expressions, and unstable head poses. A lifelike talking head requires synchronized coordination of subject identity, lip movements, facial expressions, and head poses. The absence of these synchronizations is a fundamental flaw, leading to unrealistic and artificial outcomes. To address the critical issue of synchronization, identified as the "devil" in creating realistic talking heads, we introduce SyncTalk. This NeRF-based method effectively maintains subject identity, enhancing synchronization and realism in talking head synthesis. SyncTalk employs a Face-Sync Controller to align lip movements with speech and innovatively uses a 3D facial blendshape model to capture accurate facial expressions. Our Head-Sync Stabilizer optimizes head poses, achieving more natural head movements. The Portrait-Sync Generator restores hair details and blends the generated head with the torso for a seamless visual experience. Extensive experiments and user studies demonstrate that SyncTalk outperforms state-of-the-art methods in synchronization and realism. We recommend watching the supplementary video: https://ziqiaopeng.github.io/synctalk

摘要
достижение高度的同步在生成真实的、语音驱动的 talking head 视频中存在重要挑战。传统的生成对抗网络（GAN）很难保持面部标识，而使用Neural Radiance Fields（NeRF）方法可以解决这个问题，但它们经常产生不匹配的嘴部运动、不充分的面部表达和不稳定的头部姿势。一个生命的 talking head 需要同步的协调Subject identity、嘴部运动、面部表达和头部姿势。缺乏这些同步是一个基本的缺陷，导致不真实、不自然的结果。为了解决这个核心问题，我们引入SyncTalk。这是一种基于NeRF的方法，可以有效地保持Subject identity，提高同步和真实性在 talking head 生成中。SyncTalk使用一个Face-Sync Controller来将嘴部运动与语音相匹配，并使用3D facial blendshape模型来捕捉精准的面部表达。我们的Head-Sync Stabilizer可以优化头部姿势，使得头部运动更加自然。Portrait-Sync Generator可以恢复头发细节和融合生成的头部和躯体，提供一个无缝的视觉体验。我们的实验和用户研究表明，SyncTalk在同步和真实性方面超越了现有的方法。我们建议您查看补充视频：https://ziqiaopeng.github.io/synctalk。

CLIPC8: Face liveness detection algorithm based on image-text pairs and contrastive learning

paper_url: http://arxiv.org/abs/2311.17583
repo_url: None
paper_authors: Xu Liu, Shu Zhou, Yurong Song, Wenzhe Luo, Xin Zhang
for: 这个研究旨在解决现有的生命力检测算法在不同测试集上表现不佳的问题，通过使用图像和文本对组合来检测货币领域内的生命力攻击行为。methods: 本研究提出了一种基于图像和文本对组合的生命力检测方法，包括分别将图像和文本转换为特征向量表示，然后使用对组合来检测图像和文本之间的相似性。results: 本研究的方法可以实现Zero-shot的生命力检测功能，并且在五个公共测试集上达到了商业算法的水平。对于5种测试数据集，方法的检测率皆达到100%。这显示了引入图像和文本对组合以及对组合学习的方法可以对生命力检测任务进行有效和Robust的改进。

Abstract
Face recognition technology is widely used in the financial field, and various types of liveness attack behaviors need to be addressed. Existing liveness detection algorithms are trained on specific training datasets and tested on testing datasets, but their performance and robustness in transferring to unseen datasets are relatively poor. To tackle this issue, we propose a face liveness detection method based on image-text pairs and contrastive learning, dividing liveness attack problems in the financial field into eight categories and using text information to describe the images of these eight types of attacks. The text encoder and image encoder are used to extract feature vector representations for the classification description text and face images, respectively. By maximizing the similarity of positive samples and minimizing the similarity of negative samples, the model learns shared representations between images and texts. The proposed method is capable of effectively detecting specific liveness attack behaviors in certain scenarios, such as those occurring in dark environments or involving the tampering of ID card photos. Additionally, it is also effective in detecting traditional liveness attack methods, such as printing photo attacks and screen remake attacks. The zero-shot capabilities of face liveness detection on five public datasets, including NUAA, CASIA-FASD, Replay-Attack, OULU-NPU and MSU-MFSD also reaches the level of commercial algorithms. The detection capability of proposed algorithm was verified on 5 types of testing datasets, and the results show that the method outperformed commercial algorithms, and the detection rates reached 100% on multiple datasets. Demonstrating the effectiveness and robustness of introducing image-text pairs and contrastive learning into liveness detection tasks as proposed in this paper.

摘要
“人脸识别技术在金融领域广泛应用，但漏斗攻击问题仍然需要解决。现有的生命体检测算法基于特定的训练集和测试集，但其在转移到未seen的集合上表现不佳。为解决这问题，我们提议基于图片文本对的面部生命体检测方法，将金融领域内的漏斗攻击问题分为八种类型，并使用文本描述图片的方式来描述这八种类型的攻击行为。图文编码器和图像编码器将被用来提取图片和文本描述的特征向量表示。通过最大化正样本之间的相似性，并最小化负样本之间的相似性，模型将学习图片和文本之间的共同表示。提议的方法能够效果地检测特定的漏斗攻击行为，如在黑暗环境下或 ID 卡照片进行 tampering 等。此外，它还能够效果地检测传统的漏斗攻击方法，如印刷照片攻击和屏幕重建攻击。提议的方法在五个公共数据集上的零实际能力也达到了商业算法的水平。在五种测试集上验证了方法的检测能力，结果显示，提议的方法在多个数据集上达到了100%的检测率，证明了在图片文本对和对比学习的基础上进行面部生命体检测任务的有效性和可靠性。”

LGFCTR: Local and Global Feature Convolutional Transformer for Image Matching

paper_url: http://arxiv.org/abs/2311.17571
repo_url: None
paper_authors: Wenhao Zhong, Jie Jiang
for: 本文主要针对帧内对应找到精准和稳定的方法，尤其是在极端条件下。
methods: 本文提出了一种新的卷积变换器，通过捕捉局部和全局特征同时来减少这些问题。具体来说，首先是一种通用的FPN-like框架，通过转换器和卷积来捕捉全球结构，并补做局部上下文和隐藏位置编码。其次，一种新的卷积变换器模块，通过多Scale的注意力和局部信息聚合来捕捉多Scale长距离依赖关系。最后，一种新的准确性基准下的精度调整模块，利用整个细致窗口特征进行细致位差准确调整。
results: 在各种Benchmark上，提出的方法实现了精准和稳定的对应找到，并且超过了现有的方法。代码将于https://github.com/zwh0527/LGFCTR上公开。

Abstract
Image matching that finding robust and accurate correspondences across images is a challenging task under extreme conditions. Capturing local and global features simultaneously is an important way to mitigate such an issue but recent transformer-based decoders were still stuck in the issues that CNN-based encoders only extract local features and the transformers lack locality. Inspired by the locality and implicit positional encoding of convolutions, a novel convolutional transformer is proposed to capture both local contexts and global structures more sufficiently for detector-free matching. Firstly, a universal FPN-like framework captures global structures in self-encoder as well as cross-decoder by transformers and compensates local contexts as well as implicit positional encoding by convolutions. Secondly, a novel convolutional transformer module explores multi-scale long range dependencies by a novel multi-scale attention and further aggregates local information inside dependencies for enhancing locality. Finally, a novel regression-based sub-pixel refinement module exploits the whole fine-grained window features for fine-level positional deviation regression. The proposed method achieves superior performances on a wide range of benchmarks. The code will be available on https://github.com/zwh0527/LGFCTR.

摘要
Image matching under extreme conditions is a difficult task, and capturing both local and global features is an important way to address this issue. However, recent transformer-based decoders have still struggled with this problem because they lack locality and only extract local features. To solve this issue, a novel convolutional transformer is proposed that captures both local contexts and global structures more effectively for detector-free matching.Firstly, the proposed method uses a universal FPN-like framework that captures global structures in both the self-encoder and cross-decoder using transformers, while also compensating for local contexts and implicit positional encoding using convolutions. Secondly, a novel convolutional transformer module is used to explore multi-scale long-range dependencies using a novel multi-scale attention mechanism, and then aggregate local information within these dependencies to enhance locality. Finally, a novel regression-based sub-pixel refinement module is used to fine-tune the positional deviations of the fine-grained window features.The proposed method achieves superior performance on a wide range of benchmarks. The code will be available on GitHub at .

An Efficient Illumination Invariant Tiger Detection Framework for Wildlife Surveillance

paper_url: http://arxiv.org/abs/2311.17552
repo_url: None
paper_authors: Gaurav Pendharkar, A. Ancy Micheal, Jason Misquitta, Ranjeesh Kaippada
for: 保护虎耐用多元策略，包括生态环境保护、反偷猎措施和社区参与，以促进虎 populations 的可持续发展。
methods: 使用人工智能自动化虎检测，提出了一个适应性强的虎检测框架基于EnlightenGAN和YOLOv8，实现了不受照明条件的虎检测。
results: 在ATRW 数据集上，通过精度调整 YOLOv8 模型，实现了无照明改进的 mAP 分数为 61%，加入照明改进后，mAP 分数提高了 0.7%。这些方法提高了 ATRW 数据集的状态态化性能，比前一代约提高了 6%-7%。

Abstract
Tiger conservation necessitates the strategic deployment of multifaceted initiatives encompassing the preservation of ecological habitats, anti-poaching measures, and community involvement for sustainable growth in the tiger population. With the advent of artificial intelligence, tiger surveillance can be automated using object detection. In this paper, an accurate illumination invariant framework is proposed based on EnlightenGAN and YOLOv8 for tiger detection. The fine-tuned YOLOv8 model achieves a mAP score of 61% without illumination enhancement. The illumination enhancement improves the mAP by 0.7%. The approaches elevate the state-of-the-art performance on the ATRW dataset by approximately 6% to 7%.

摘要
虎 conservation 需要推行多元化的 initiaves，包括生态环境保护、反贩卖措施和社区参与，以实现可持续增长的虎 популяции。 With the advent of artificial intelligence, tiger surveillance can be automated using object detection. In this paper, an accurate illumination invariant framework is proposed based on EnlightenGAN and YOLOv8 for tiger detection. The fine-tuned YOLOv8 model achieves a mAP score of 61% without illumination enhancement. The illumination enhancement improves the mAP by 0.7%. The approaches elevate the state-of-the-art performance on the ATRW dataset by approximately 6% to 7%.Here's the word-for-word translation:虎 conservation 需要推行多元化的 initiaves，包括生态环境保护、反贩卖措施和社区参与，以实现可持续增长的虎 популяции。 With the advent of artificial intelligence, tiger surveillance can be automated using object detection. In this paper, an accurate illumination invariant framework is proposed based on EnlightenGAN and YOLOv8 for tiger detection. The fine-tuned YOLOv8 model achieves a mAP score of 61% without illumination enhancement. The illumination enhancement improves the mAP by 0.7%. The approaches elevate the state-of-the-art performance on the ATRW dataset by approximately 6% to 7%.

VINNA for Neonates – Orientation Independence through Latent Augmentations

paper_url: http://arxiv.org/abs/2311.17546
repo_url: None
paper_authors: Leonie Henschel, David Kügler, Lilla Zöllei, Martin Reuter
for: 这个论文的目的是为了提高新生儿大脑成像图像的快速和准确分割，以更好地理解和检测发育和疾病的变化。
methods: 这个论文使用了一种新的Resolution-Aware Internal Augmentations（VINNA）方法，它可以在不需要重新采样的情况下，在不同的分辨率下进行快速和准确的分割。
results: 研究发现，VINNA方法可以比州际扩展方法更好地处理新生儿大脑成像图像中的头部变化，并且可以保持高度的分割精度在0.5-1.0毫米的分辨率范围内。

Abstract
Fast and accurate segmentation of neonatal brain images is highly desired to better understand and detect changes during development and disease. Yet, the limited availability of ground truth datasets, lack of standardized acquisition protocols, and wide variations of head positioning pose challenges for method development. A few automated image analysis pipelines exist for newborn brain MRI segmentation, but they often rely on time-consuming procedures and require resampling to a common resolution, subject to loss of information due to interpolation and down-sampling. Without registration and image resampling, variations with respect to head positions and voxel resolutions have to be addressed differently. In deep-learning, external augmentations are traditionally used to artificially expand the representation of spatial variability, increasing the training dataset size and robustness. However, these transformations in the image space still require resampling, reducing accuracy specifically in the context of label interpolation. We recently introduced the concept of resolution-independence with the Voxel-size Independent Neural Network framework, VINN. Here, we extend this concept by additionally shifting all rigid-transforms into the network architecture with a four degree of freedom (4-DOF) transform module, enabling resolution-aware internal augmentations (VINNA). In this work we show that VINNA (i) significantly outperforms state-of-the-art external augmentation approaches, (ii) effectively addresses the head variations present specifically in newborn datasets, and (iii) retains high segmentation accuracy across a range of resolutions (0.5-1.0 mm). The 4-DOF transform module is a powerful, general approach to implement spatial augmentation without requiring image or label interpolation. The specific network application to newborns will be made publicly available as VINNA4neonates.

摘要
快速和准确地 segmentation 新生儿脑图像是非常有优先级的，以便更好地理解和检测发展和疾病中的变化。然而，有限的ground truth数据集、缺乏标准化采集协议以及头部位置的差异带来了方法开发的挑战。现有一些自动化图像分析管道用于新生儿脑MRI segmentation，但它们通常需要时间consuming的过程和需要重新采样到共同的分辨率，这会导致信息损失由于 interpolate 和下采样。没有注册和采样，头部位置和 voxel 分辨率的差异需要在不同的方式下进行处理。在深度学习中，外部增强通常用于人工地扩展表示空间的变化，从而增加训练数据集大小和模型的稳定性。然而，这些图像空间中的变换仍然需要重新采样，从而降低了精度，特别是在标签 interpolate 中。我们最近提出了分辨率独立的概念，即 voxel-size Independent Neural Network 框架（VINN）。在这项工作中，我们进一步扩展了这一概念，通过添加一个四度自由变换（4-DOF）模块，使得内部增强（VINNA）能够具备分辨率意识。我们在这篇文章中展示了 VINNA 可以：（1）对比 externally augmented 方法，显著提高性能；（2）有效地处理新生儿数据集中的头部变化；（3）保持高精度 segmentation 在多个分辨率（0.5-1.0 mm）下。4-DOF transform module 是一种强大、通用的方法，可以在不需要图像或标签 interpolate 的情况下进行空间增强。特定的网络应用于新生儿将在 VINNA4neonates 中公开。

Smooth Video Synthesis with Noise Constraints on Diffusion Models for One-shot Video Tuning

paper_url: http://arxiv.org/abs/2311.17536
repo_url: https://github.com/spengliang/smoothvideo
paper_authors: Liang Peng, Haoran Cheng, Zheng Yang, Ruisi Zhao, Linxuan Xia, Chaotian Song, Qinglin Lu, Wei Liu, Boxi Wu
for: 提高一键视频调整方法的一致性和平滑性。
methods: 添加噪声约束来规则视频帧之间的噪声预测，从而使latent smooth。
results: 通过应用损失函数到现有一键视频调整方法上，提高了视频的一致性和平滑性。此外，提出了一种新的评价指标，能够更好地捕捉视频的细节特征和时间动态。实验结果证明了我们的方法的有效性。

Abstract
Recent one-shot video tuning methods, which fine-tune the network on a specific video based on pre-trained text-to-image models (e.g., Stable Diffusion), are popular in the community because of the flexibility. However, these methods often produce videos marred by incoherence and inconsistency. To address these limitations, this paper introduces a simple yet effective noise constraint across video frames. This constraint aims to regulate noise predictions across their temporal neighbors, resulting in smooth latents. It can be simply included as a loss term during the training phase. By applying the loss to existing one-shot video tuning methods, we significantly improve the overall consistency and smoothness of the generated videos. Furthermore, we argue that current video evaluation metrics inadequately capture smoothness. To address this, we introduce a novel metric that considers detailed features and their temporal dynamics. Experimental results validate the effectiveness of our approach in producing smoother videos on various one-shot video tuning baselines. The source codes and video demos are available at \href{https://github.com/SPengLiang/SmoothVideo}{https://github.com/SPengLiang/SmoothVideo}.

摘要
现在的一键视频调整方法（例如稳定扩散）在社区中非常受欢迎，因为它们具有灵活性。然而，这些方法经常生成受杂乱和不一致的视频。为解决这些限制，本文引入了一种简单 yet 有效的噪声约束，该约束在视频帧中的噪声预测进行规范，以生成平滑的凝结。这可以在训练阶段直接添加到损失中。通过应用损失到现有的一键视频调整方法上，我们可以显著改进生成视频的一致性和平滑性。此外，我们认为当前的视频评价指标不具有充分考虑平滑性的特点。为解决这个问题，我们引入了一种新的评价指标，该指标考虑了视频中详细的特征和其时间动态。实验结果证明了我们的方法的效果，可以生成在多种一键视频调整基础上更平滑的视频。代码和视频示例可以在 \href{https://github.com/SPengLiang/SmoothVideo}{https://github.com/SPengLiang/SmoothVideo} 上获取。

Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation

paper_url: http://arxiv.org/abs/2311.17532
repo_url: None
paper_authors: Xingqun Qi, Jiahao Pan, Peng Li, Ruibin Yuan, Xiaowei Chi, Mengfei Li, Wenhan Luo, Wei Xue, Shanghang Zhang, Qifeng Liu, Yike Guo
for: 生成真实的3D同声动作是在人机交互应用中至关重要，而现有方法只能生成单个情感标签的动作。
methods: 我们首先利用ChatGPT-4和一种音频填充方法构建高质量情感转移人声，然后提出一种弱级指导的训练策略，以促进权威的转换动作。
results: 我们的方法在我们新定义的情感转移任务和数据集上比现有模型表现出色，并且能够生成多种多样的动作。

Abstract
Generating vivid and emotional 3D co-speech gestures is crucial for virtual avatar animation in human-machine interaction applications. While the existing methods enable generating the gestures to follow a single emotion label, they overlook that long gesture sequence modeling with emotion transition is more practical in real scenes. In addition, the lack of large-scale available datasets with emotional transition speech and corresponding 3D human gestures also limits the addressing of this task. To fulfill this goal, we first incorporate the ChatGPT-4 and an audio inpainting approach to construct the high-fidelity emotion transition human speeches. Considering obtaining the realistic 3D pose annotations corresponding to the dynamically inpainted emotion transition audio is extremely difficult, we propose a novel weakly supervised training strategy to encourage authority gesture transitions. Specifically, to enhance the coordination of transition gestures w.r.t different emotional ones, we model the temporal association representation between two different emotional gesture sequences as style guidance and infuse it into the transition generation. We further devise an emotion mixture mechanism that provides weak supervision based on a learnable mixed emotion label for transition gestures. Last, we present a keyframe sampler to supply effective initial posture cues in long sequences, enabling us to generate diverse gestures. Extensive experiments demonstrate that our method outperforms the state-of-the-art models constructed by adapting single emotion-conditioned counterparts on our newly defined emotion transition task and datasets.

摘要
<>TRANSLATE_TEXT生成具有生动感情的3D合成人偶动画在人机交互应用中是非常重要的。而现有的方法只能生成基于单一情感标签的手势，它们忽略了在实际场景中更加重要的长手势序列模型和情感转移。此外，没有大规模可用的情感转移 speech和相应的3D人偶手势数据也限制了解决这个问题。为了实现这个目标，我们首先将ChatGPT-4和一种音频填充方法结合使用，以构建高效的情感转移人声。由于获得真实的3D姿态注释相对困难，我们提出了一种新的弱监督训练策略。Specifically,我们模型了两个不同情感手势序列之间的时间相关性表示，并将其作为风格指导注入到转换生成中。此外，我们还提出了一种情感混合机制，用于提供弱监督基于学习的混合情感标签。最后，我们提出了一种键帧抽取器，以便在长序列中提供有效的初始姿态引导。广泛的实验表明，我们的方法在我们 newly defined emotion transition task和数据集上比 estado-of-the-art 模型 constructed by adapting single emotion-conditioned counterparts 表现出色。Note: Please note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

HiDiffusion: Unlocking High-Resolution Creativity and Efficiency in Low-Resolution Trained Diffusion Models

paper_url: http://arxiv.org/abs/2311.17528
repo_url: None
paper_authors: Shen Zhang, Zhaowei Chen, Zhenyu Zhao, Zhenyuan Chen, Yao Tang, Yuhao Chen, Wengang Cao, Jiajun Liang
for: 实现预训 diffusion model 可以高效地生成高分辨率图像（例如 1024×1024），超过训练图像的分辨率。
methods: 我们提出了一个简单 yet scalable 的方法组合，包括 Resolution-Aware U-Net (RAU-Net) 和 Modified Shifted Window Multi-head Self-Attention (MSW-MSA)，实现预训 diffusion model 可以高效地生成高分辨率图像。
results: 我们的 HiDiffusion 可以规避预训 diffusion model 在高分辨率图像生成中发生的无理性问题，并同时降低了测试时间，实现了高分辨率图像生成的州OF-the-art表现。

Abstract
We introduce HiDiffusion, a tuning-free framework comprised of Resolution-Aware U-Net (RAU-Net) and Modified Shifted Window Multi-head Self-Attention (MSW-MSA) to enable pretrained large text-to-image diffusion models to efficiently generate high-resolution images (e.g. 1024$\times$1024) that surpass the training image resolution. Pretrained diffusion models encounter unreasonable object duplication in generating images beyond the training image resolution. We attribute it to the mismatch between the feature map size of high-resolution images and the receptive field of U-Net's convolution. To address this issue, we propose a simple yet scalable method named RAU-Net. RAU-Net dynamically adjusts the feature map size to match the convolution's receptive field in the deep block of U-Net. Another obstacle in high-resolution synthesis is the slow inference speed of U-Net. Our observations reveal that the global self-attention in the top block, which exhibits locality, however, consumes the majority of computational resources. To tackle this issue, we propose MSW-MSA. Unlike previous window attention mechanisms, our method uses a much larger window size and dynamically shifts windows to better accommodate diffusion models. Extensive experiments demonstrate that our HiDiffusion can scale diffusion models to generate 1024$\times$1024, 2048$\times$2048, or even 4096$\times$4096 resolution images, while simultaneously reducing inference time by 40\%-60\%, achieving state-of-the-art performance on high-resolution image synthesis. The most significant revelation of our work is that a pretrained diffusion model on low-resolution images is scalable for high-resolution generation without further tuning. We hope this revelation can provide insights for future research on the scalability of diffusion models.

摘要
我们介绍HiDiffusion，一个无需调整的框架，包括分辨率感知U-Net（RAU-Net）和修改的对应窗口多头自我对比（MSW-MSA），以允许预训大文本到图像扩散模型高效地生成高分辨率图像（例如1024×1024），超过训练图像分辨率。预训扩散模型在生成图像 beyond 训练图像分辨率时会 encounter unreasonable object duplication。我们将这问题归因于U-Net的对应窗口和扩散模型之间的对应性不匹配。为解决这问题，我们提出了一个简单 yet scalable 的方法，即 RAU-Net。 RAU-Net 可以在 U-Net 的深层中灵活地调整对应窗口的大小，以配合对应窗口的扩散模型。另一个高分辨率生成的障碍是 U-Net 的对应窗口过慢的实现速度。我们的观察表明，U-Net 的顶层自我对比，即 exhibits locality，但是 consume the majority of computational resources。为了解决这问题，我们提出了 MSW-MSA。与以往的窗口注意力机制不同，我们的方法使用了 much larger window size 和 dynamically shifts windows，以更好地适应扩散模型。实验结果显示，我们的 HiDiffusion 可以将预训扩散模型扩展到生成 1024×1024、2048×2048 或же 4096×4096 分辨率图像，同时降低了推断时间 by 40%-60%， achieved state-of-the-art performance on high-resolution image synthesis。我们的研究最重要的发现是：一个预训的扩散模型在低分辨率图像上是可扩展的，而不需要进一步的调整。我们希望这个发现可以为未来关于扩散模型的可扩展性提供思路。

A publicly available vessel segmentation algorithm for SLO images

paper_url: http://arxiv.org/abs/2311.17525
repo_url: None
paper_authors: Adam Threlfall, Samuel Gibbon, James Cameron, Tom MacGillivray
for: 这个研究的目标是开发一种专门针对探测血管图像（IRSLO）的血管分割算法。
methods: 这个研究使用了23个专家标注的IRSLO图像，以及7个内部标注的图像进行训练。使用的是一种差分减少网络（U-Net）来标注像素为“血管”或“背景”。
results: 在一个未经见过的测试集（4张图像）上，这个模型达到了0.981的AUC和0.815的AUPRC。经过阈值处理，它达到了0.844的敏感性、0.983的特异性和0.857的F1分数。

Abstract
Background and Objective: Infra-red scanning laser ophthalmoscope (IRSLO) images are akin to colour fundus photographs in displaying the posterior pole and retinal vasculature fine detail. While there are many trained networks readily available for retinal vessel segmentation in colour fundus photographs, none cater to IRSLO images. Accordingly, we aimed to develop (and release as open source) a vessel segmentation algorithm tailored specifically to IRSLO images. Materials and Methods: We used 23 expertly annotated IRSLO images from the RAVIR dataset, combined with 7 additional images annotated in-house. We trained a U-Net (convolutional neural network) to label pixels as 'vessel' or 'background'. Results: On an unseen test set (4 images), our model achieved an AUC of 0.981, and an AUPRC of 0.815. Upon thresholding, it achieved a sensitivity of 0.844, a specificity of 0.983, and an F1 score of 0.857. Conclusion: We have made our automatic segmentation algorithm publicly available and easy to use. Researchers can use the generated vessel maps to compute metrics such as fractal dimension and vessel density.

摘要
背景和目标： инфракрас扫描拉зер眼镜（IRSLO）图像与彩色血液照片类似，可以显示 posterior pole 和 RETINAL vasculature 的细节。 although there are many trained networks available for retinal vessel segmentation in color fundus photographs, none are tailored to IRSLO images. Therefore, we aimed to develop (and release as open source) a vessel segmentation algorithm specifically for IRSLO images.材料和方法： we used 23 expertly annotated IRSLO images from the RAVIR dataset, along with 7 additional images annotated in-house. we trained a U-Net (convolutional neural network) to label pixels as 'vessel' or 'background'.结果： on an unseen test set (4 images), our model achieved an AUC of 0.981, and an AUPRC of 0.815. upon thresholding, it achieved a sensitivity of 0.844, a specificity of 0.983, and an F1 score of 0.857.结论： we have made our automatic segmentation algorithm publicly available and easy to use. researchers can use the generated vessel maps to compute metrics such as fractal dimension and vessel density.

Improving Stability during Upsampling – on the Importance of Spatial Context

paper_url: http://arxiv.org/abs/2311.17524
repo_url: None
paper_authors: Shashank Agnihotri, Julia Grabinski, Margret Keuper
for: 这个论文主要针对像 pixel-wise 预测任务，如图像恢复、图像分割或分辨率估计，涉及多个阶段的数据抽样。
methods: 该论文首次探讨了在 upsampling 过程中的缺陷，并提出了一种基于 convolutional upsampling 的方法来改进预测稳定性。
results: 研究发现，通过使用增大核心大小的 convolutional upsampling 操作，可以在图像恢复和图像分割等任务中提高预测稳定性，而一种组合小型核心和大型核心的块可以最佳地结合细节和缺陷 removal。

Abstract
State-of-the-art models for pixel-wise prediction tasks such as image restoration, image segmentation, or disparity estimation, involve several stages of data resampling, in which the resolution of feature maps is first reduced to aggregate information and then sequentially increased to generate a high-resolution output. Several previous works have investigated the effect of artifacts that are invoked during downsampling and diverse cures have been proposed that facilitate to improve prediction stability and even robustness for image classification. However, equally relevant, artifacts that arise during upsampling have been less discussed. This is significantly relevant as upsampling and downsampling approaches face fundamentally different challenges. While during downsampling, aliases and artifacts can be reduced by blurring feature maps, the emergence of fine details is crucial during upsampling. Blurring is therefore not an option and dedicated operations need to be considered. In this work, we are the first to explore the relevance of context during upsampling by employing convolutional upsampling operations with increasing kernel size while keeping the encoder unchanged. We find that increased kernel sizes can in general improve the prediction stability in tasks such as image restoration or image segmentation, while a block that allows for a combination of small-size kernels for fine details and large-size kernels for artifact removal and increased context yields the best results.

摘要
现代模型 для像素级预测任务，如图像恢复、图像分割或相差估计，通常包括多个阶段的数据重采样，在这些阶段中，特征地图的分辨率首先被减小以汇总信息，然后逐渐增加以生成高分辨率输出。过去的一些研究已经研究了下采样引入的瑕疵的影响，并提出了改进预测稳定性和Robustness的方法。然而，下采样和上采样的挑战相对较为不同，而下采样可以通过抑制特征地图的瑕疵来减少瑕疵，而上采样则需要特殊的操作。在这种情况下，我们是第一个探讨了上采样的上下文 relevance，我们采用了卷积 upsampling 操作，保持Encoder不变，并发现，逐渐增大核心大小可以在图像恢复和图像分割任务中提高预测稳定性，而一个块，允许将小Size kernel 用于细节和大Size kernel 用于瑕疵除去和Context增加，可以获得最佳结果。

MMA-Diffusion: MultiModal Attack on Diffusion Models

paper_url: http://arxiv.org/abs/2311.17516
repo_url: None
paper_authors: Yijun Yang, Ruiyuan Gao, Xiaosen Wang, Nan Xu, Qiang Xu
for: 提高 Text-to-Image（T2I）模型的安全性，探讨现有防御机制的缺陷。
methods: 提出了一种基于文本和视觉模式的威胁模型MMA-Diffusion，可以绕过当前的开源模型和商业在线服务的安全措施。
results: MMA-Diffusion可以成功地绕过现有的安全检查机制，暴露了当前防御机制的缺陷。

Abstract
In recent years, Text-to-Image (T2I) models have seen remarkable advancements, gaining widespread adoption. However, this progress has inadvertently opened avenues for potential misuse, particularly in generating inappropriate or Not-Safe-For-Work (NSFW) content. Our work introduces MMA-Diffusion, a framework that presents a significant and realistic threat to the security of T2I models by effectively circumventing current defensive measures in both open-source models and commercial online services. Unlike previous approaches, MMA-Diffusion leverages both textual and visual modalities to bypass safeguards like prompt filters and post-hoc safety checkers, thus exposing and highlighting the vulnerabilities in existing defense mechanisms.

摘要

Fusion of Single and Integral Multispectral Aerial Images

paper_url: http://arxiv.org/abs/2311.17515
repo_url: None
paper_authors: Mohamed Youssef, Oliver Bimber
for: 提高遮挡物 vegetaion 的遥感图像融合，以提高目标的可见性。
methods: combining 传感器模型和学习模型，使用综合频道的灵感参考和遮挡物 vegetaion 的特征，以提高目标的可见性。
results: 比前方法高效，不需手动调整参数，可扩展到多个spectral channel，可重配置 для不同应用场景。

Abstract
We present a novel hybrid (model- and learning-based) architecture for fusing the most significant features from conventional aerial images and integral aerial images that result from synthetic aperture sensing for removing occlusion caused by dense vegetation. It combines the environment's spatial references with features of unoccluded targets. Our method out-beats the state-of-the-art, does not require manually tuned parameters, can be extended to an arbitrary number and combinations of spectral channels, and is reconfigurable to address different use-cases.

摘要
我们提出了一种新的hybrid（模型和学习基于）架构，用于融合传统飞行图像和 integral飞行图像，以消除由密集植被引起的遮挡。它结合环境空间参考和透明目标特征。我们的方法超过了当前状态，不需要手动调整参数，可以扩展到任意数量和组合spectral通道，并可以重配置以应对不同的应用场景。Here's a breakdown of the translation:* "hybrid" is translated as "hybrid" (合成)* "model-based" is translated as "模型基于" (based on a model)* "learning-based" is translated as "学习基于" (based on learning)* "aerial images" is translated as "飞行图像" (aerial images)* "integral aerial images" is translated as " integral飞行图像" (integral aerial images)* "occlusion" is translated as "遮挡" (occlusion)* "dense vegetation" is translated as "密集植被" (dense vegetation)* "spatial references" is translated as "空间参考" (spatial references)* "unoccluded targets" is translated as "透明目标" (unoccluded targets)* "state-of-the-art" is translated as "当前状态" (state-of-the-art)* "manually tuned parameters" is translated as "手动调整参数" (manually tuned parameters)* "arbitrary number" is translated as "任意数量" (arbitrary number)* "combinations of spectral channels" is translated as "任意数量和组合spectral通道" (combinations of spectral channels)* "reconfigurable" is translated as "可重配置" (reconfigurable)* "different use-cases" is translated as "不同的应用场景" (different use-cases)

StructRe: Rewriting for Structured Shape Modeling

paper_url: http://arxiv.org/abs/2311.17510
repo_url: None
paper_authors: Wang, Jiepeng, Pan, Hao, Liu, Yang, Tong, Xin, Komura, Taku, Wang, Wenping
for: 这篇论文目的是为了提出一种新的结构模型化方法，帮助解决人工3D形状的自然组织和结构嵌入问题。
methods: 该论文使用的方法是一种叫做StructRe的结构重写系统，可以将3D对象表示为点和组件的形式 rewrite 到更精细的结构中，或者 rewrite 到更简洁的结构中。通过迭代 rewrite 过程，可以获得层次结构，并且可以采用概率模型来解决多重层次结构的冲突问题。
results: 论文通过使用StructRe模型，可以在不同类别的形状中实现robust泛化和多对一的结构模型化。通过对PartNet数据进行训练， StructRe模型可以在不同类别的形状中进行扩展和应用，并且可以用于形状重构、生成和编辑等任务。

Abstract
Man-made 3D shapes are naturally organized in parts and hierarchies; such structures provide important constraints for shape reconstruction and generation. Modeling shape structures is difficult, because there can be multiple hierarchies for a given shape, causing ambiguity, and across different categories the shape structures are correlated with semantics, limiting generalization. We present StructRe, a structure rewriting system, as a novel approach to structured shape modeling. Given a 3D object represented by points and components, StructRe can rewrite it upward into more concise structures, or downward into more detailed structures; by iterating the rewriting process, hierarchies are obtained. Such a localized rewriting process enables probabilistic modeling of ambiguous structures and robust generalization across object categories. We train StructRe on PartNet data and show its generalization to cross-category and multiple object hierarchies, and test its extension to ShapeNet. We also demonstrate the benefits of probabilistic and generalizable structure modeling for shape reconstruction, generation and editing tasks.

摘要
人工3D形状自然地组织成部件和层次结构，这些结构提供重要的约束 для形状重建和生成。模型形状结构困难，因为给定形状可能有多个层次结构，导致歧义，而不同类别的形状结构与 semantics 相关，限制泛化。我们提出了 StructRe，一种结构重写系统，作为新的结构化形状模型化方法。给定3D对象表示为点和组件时，StructRe可以将其 rewrite 到更简洁的结构中，或者 rewrite 到更详细的结构中；通过重复 rewrite 过程，可以获得层次结构。这种局部 rewrite 过程允许随机模型不确定结构和跨类别结构的泛化。我们在 PartNet 数据上训练 StructRe，并证明其在跨类别和多个对象层次结构中的泛化能力，以及对 ShapeNet 的扩展。我们还示出了结构模型的概率和泛化性对形状重建、生成和编辑任务的好处。

PViT-6D: Overclocking Vision Transformers for 6D Pose Estimation with Confidence-Level Prediction and Pose Tokens

paper_url: http://arxiv.org/abs/2311.17504
repo_url: None
paper_authors: Sebastian Stapf, Tobias Bauernfeind, Marco Riboldi
for: 这个研究的目的是提高6D姿掌 estimation的精度和可靠性，并且实现简单的实现和端到端学习。
methods: 我们使用了Vision Transformer来进行直接的6D姿掌 estimation，并且引入了一个简单的姿掌信任度决定方法，可以与大多数6D姿掌 estimation框架集成。
results: 我们的方法Pose Vision Transformer或PViT-6D在Linemod-Occlusion和YCB-V数据集上比前一代方法高 +0.3% ADD(-S) 和 +2.7% ADD(-S)。此外，我们的方法也提高了模型的解释力和测试过程中的表现可靠性。

Abstract
In the current state of 6D pose estimation, top-performing techniques depend on complex intermediate correspondences, specialized architectures, and non-end-to-end algorithms. In contrast, our research reframes the problem as a straightforward regression task by exploring the capabilities of Vision Transformers for direct 6D pose estimation through a tailored use of classification tokens. We also introduce a simple method for determining pose confidence, which can be readily integrated into most 6D pose estimation frameworks. This involves modifying the transformer architecture by decreasing the number of query elements based on the network's assessment of the scene complexity. Our method that we call Pose Vision Transformer or PViT-6D provides the benefits of simple implementation and being end-to-end learnable while outperforming current state-of-the-art methods by +0.3% ADD(-S) on Linemod-Occlusion and +2.7% ADD(-S) on the YCB-V dataset. Moreover, our method enhances both the model's interpretability and the reliability of its performance during inference.

摘要
现有的6D姿态估计技术都是基于复杂的中间对准和非终端数据推断，而我们的研究则将这个问题转化为一个简单的回推 зада项目，通过特殊的探索类别token，以便直接进行6D姿态估计。我们还提出了一个简单的姿态信任度决定方法，可以与大多数6D姿态估计框架集成。这个方法是基于网络评估场景复杂程度，将查询元素数量降低以提高模型的解释性和测试过程中的可靠性。我们称之为姿态视transformer或PViT-6D，它具有简单的实现和终端学习的优点，并在Linemod-Occlusion和YCB-V数据集上进行了+0.3% ADD(-S)和+2.7% ADD(-S)的提升。此外，我们的方法可以提高模型的解释性和测试过程中的可靠性。

Towards Higher Ranks via Adversarial Weight Pruning

paper_url: http://arxiv.org/abs/2311.17493
repo_url: https://github.com/huawei-noah/Efficient-Computing
paper_authors: Yuchuan Tian, Hanting Chen, Tianyu Guo, Chao Xu, Yunhe Wang
for: 提高Convolutional Neural Networks（CNNs）在边缘设备上部署的效率，通过网络裁剪来减少模型的计算量和存储量。
methods: 提出了一种基于排名的裁剪方法（Rank-based PruninG，RPG），通过在每次迭代中对稀疏权重进行低矩阵 decomposition，并通过提高权重矩阵与低矩阵的距离来保持稀疏权重的高级别结构。
results: 实验结果表明，RPG方法可以在不同的 dataset 和任务上达到高度的稀疏率，并且在 ImageNet 上达到了98% 的稀疏率，相比之前的状态 искусственный智能方法提高了1.13% 的 top-1 准确率。

Abstract
Convolutional Neural Networks (CNNs) are hard to deploy on edge devices due to its high computation and storage complexities. As a common practice for model compression, network pruning consists of two major categories: unstructured and structured pruning, where unstructured pruning constantly performs better. However, unstructured pruning presents a structured pattern at high pruning rates, which limits its performance. To this end, we propose a Rank-based PruninG (RPG) method to maintain the ranks of sparse weights in an adversarial manner. In each step, we minimize the low-rank approximation error for the weight matrices using singular value decomposition, and maximize their distance by pushing the weight matrices away from its low rank approximation. This rank-based optimization objective guides sparse weights towards a high-rank topology. The proposed method is conducted in a gradual pruning fashion to stabilize the change of rank during training. Experimental results on various datasets and different tasks demonstrate the effectiveness of our algorithm in high sparsity. The proposed RPG outperforms the state-of-the-art performance by 1.13% top-1 accuracy on ImageNet in ResNet-50 with 98% sparsity. The codes are available at https://github.com/huawei-noah/Efficient-Computing/tree/master/Pruning/RPG and https://gitee.com/mindspore/models/tree/master/research/cv/RPG.

摘要
convolutional neural networks (CNNs) 因其高计算和存储复杂性而难以在边缘设备上部署。为了压缩模型，通常有两种主要类型的网络剪辑：无结构剪辑和结构化剪辑，其中无结构剪辑在高剪辑率下表现更好。然而，无结构剪辑会出现结构化的征特，限制其性能。为此，我们提出了一种基于排名的剪辑方法（Rank-based PruninG，RPG），以保持稀疏 веса的排名。在每个步骤中，我们使用单值分解来最小化稀疏 веса的低级分解误差，并通过推动稀疏 веса离开其低级分解来最大化其距离。这种排名基于的优化目标导引稀疏 веса向高级 topology。我们采用渐进剪辑的方式来稳定剪辑过程中的排名变化。实验结果表明，我们的算法在不同的 dataset 和任务上具有优秀的性能，并且在98%的稀疏率下，与现状卷积神经网络（ResNet-50）在 ImageNet 上的 top-1 准确率相比，提高了1.13%。代码可以在 GitHub 上找到：https://github.com/huawei-noah/Efficient-Computing/tree/master/Pruning/RPG 和 https://gitee.com/mindspore/models/tree/master/research/cv/RPG。

Spherical Frustum Sparse Convolution Network for LiDAR Point Cloud Semantic Segmentation

paper_url: http://arxiv.org/abs/2311.17491
repo_url: None
paper_authors: Yu Zheng, Guangming Wang, Jiuming Liu, Marc Pollefeys, Hesheng Wang
for: 本文主要用于提出一种新的圆锥结构，以避免在2D图像基于的点云Semantic segmentation中的信息损失。
methods: 本文提出了一种快速Hash-based表示法，以及基于圆锥的稀疏 convolution和快速点 sampling方法。
results: 对于SemanticKITTI和nuScenes datasets的实验结果表明，我们的SFCNet方法在点云Semantic segmentation中具有更高的性能，并且超过了基于普通球面投影的2D图像基于的方法。

Abstract
LiDAR point cloud semantic segmentation enables the robots to obtain fine-grained semantic information of the surrounding environment. Recently, many works project the point cloud onto the 2D image and adopt the 2D Convolutional Neural Networks (CNNs) or vision transformer for LiDAR point cloud semantic segmentation. However, since more than one point can be projected onto the same 2D position but only one point can be preserved, the previous 2D image-based segmentation methods suffer from inevitable quantized information loss. To avoid quantized information loss, in this paper, we propose a novel spherical frustum structure. The points projected onto the same 2D position are preserved in the spherical frustums. Moreover, we propose a memory-efficient hash-based representation of spherical frustums. Through the hash-based representation, we propose the Spherical Frustum sparse Convolution (SFC) and Frustum Fast Point Sampling (F2PS) to convolve and sample the points stored in spherical frustums respectively. Finally, we present the Spherical Frustum sparse Convolution Network (SFCNet) to adopt 2D CNNs for LiDAR point cloud semantic segmentation without quantized information loss. Extensive experiments on the SemanticKITTI and nuScenes datasets demonstrate that our SFCNet outperforms the 2D image-based semantic segmentation methods based on conventional spherical projection. The source code will be released later.

摘要
利用LiDAR点云semantic segmentation可以让机器人获得细腻环境信息。近期，许多研究将点云映射到2D图像上，采用2D卷积神经网络（CNN）或视Transformer进行LiDAR点云semantic segmentation。然而，由于多个点可以映射到同一个2D位置，但只能保留一个点，所以以前的2D图像基本的分 segmentation方法会导致不可避免的量化信息损失。为了避免量化信息损失，在这篇论文中，我们提出了一种新的球形封闭结构。投影到同一个2D位置的点会被保留在球形封闭中。此外，我们提出了一种快速Hash基于的球形封闭表示方法。通过Hash基于的表示方法，我们提出了球形封闭稀疏卷积（SFC）和封闭快速点 sampling（F2PS）来 convolution和点云中的点分别。最后，我们提出了球形封闭稀疏卷积网络（SFCNet），用于采用2D CNNs进行LiDAR点云semantic segmentation，无需量化信息损失。我们在SemanticKITTI和nuScenes dataset上进行了广泛的实验，结果表明，我们的SFCNet比基于传统球面投影的2D图像基本的分 segmentation方法更高效。我们将源代码发布 later。

Non-Visible Light Data Synthesis and Application: A Case Study for Synthetic Aperture Radar Imagery

paper_url: http://arxiv.org/abs/2311.17486
repo_url: None
paper_authors: Zichen Tian, Zhaozheng Chen, Qianru Sun
for: 解决卫星数据捕集困难导致的SAR数据简单预测问题，通过采用大规模预训练图像生成模型（如Stable Diffusion和Imagen）进行非可见光域图像生成。
methods: 提出了一种2-stage低级别适应方法（2LoRA），其首先在第一阶段使用了飞行视图正常图像数据进行适应，然后在第二阶段使用SAR模态数据进行进一步适应。在第二阶段，我们引入了一种新的原型LoRA（pLoRA），以解决SAR数据集中的类偏度问题。
results: 通过使用生成的SAR数据进行训练和分类、 segmentation模型，得到了明显提高的性能，特别是对于小类。

Abstract
We explore the "hidden" ability of large-scale pre-trained image generation models, such as Stable Diffusion and Imagen, in non-visible light domains, taking Synthetic Aperture Radar (SAR) data for a case study. Due to the inherent challenges in capturing satellite data, acquiring ample SAR training samples is infeasible. For instance, for a particular category of ship in the open sea, we can collect only few-shot SAR images which are too limited to derive effective ship recognition models. If large-scale models pre-trained with regular images can be adapted to generating novel SAR images, the problem is solved. In preliminary study, we found that fine-tuning these models with few-shot SAR images is not working, as the models can not capture the two primary differences between SAR and regular images: structure and modality. To address this, we propose a 2-stage low-rank adaptation method, and we call it 2LoRA. In the first stage, the model is adapted using aerial-view regular image data (whose structure matches SAR), followed by the second stage where the base model from the first stage is further adapted using SAR modality data. Particularly in the second stage, we introduce a novel prototype LoRA (pLoRA), as an improved version of 2LoRA, to resolve the class imbalance problem in SAR datasets. For evaluation, we employ the resulting generation model to synthesize additional SAR data. This augmentation, when integrated into the training process of SAR classification as well as segmentation models, yields notably improved performance for minor classes

摘要
我们探索大规模预训练的图像生成模型，如稳定扩散和图像，在非可见光谱频谱中的"隐藏"能力。我们通过使用Synthetic Aperture Radar（SAR）数据作为例子进行研究。由于获取卫星数据具有内在的挑战，因此获取足够的SAR训练样本是不可能的。例如，在开海中某种特定的船只，我们只能收集几张SAR图像，这些图像太有限，无法 derivation 有效的船只认识模型。如果可以将大规模模型预训练的图像转换为生成新的SAR图像，问题就解决了。在预研究中，我们发现了预训练这些模型的几枚SAR图像不会工作，因为模型无法捕捉SAR和普通图像之间的两个主要差异：结构和模式。为解决这个问题，我们提出了一种2stage Low-Rank Adaptation（2LoRA）方法。在第一阶段，模型被适应了平面图像数据（其结构与SAR匹配），然后在第二阶段，基本模型从第一阶段被进一步适应了SAR模式数据。特别在第二阶段，我们引入了一种新的原型LoRA（pLoRA），用于解决SAR数据集中的类别不均匀问题。为评价，我们使用生成的模型来生成更多的SAR数据。这种扩展，当与SAR分类和 segmentation 模型的训练过程结合使用时，会得到明显提高的性能。

CLiSA: A Hierarchical Hybrid Transformer Model using Orthogonal Cross Attention for Satellite Image Cloud Segmentation

paper_url: http://arxiv.org/abs/2311.17475
repo_url: None
paper_authors: Subhajit Paul, Ashutosh Gupta
for: 这个论文的目的是提出一种基于深度学习的云mask生成方法，以提高optical remote-sensing图像的云EXTRACTION精度。
methods: 这个方法基于hybrid transformer架构，并使用自适应的orthogonal自注意力和层次跨注意力模型，并通过 Lovász-Softmax损失函数进行验证。
results: 对多个卫星图像集（Landsat-8、Sentinel-2和Cartosat-2s）进行了质量和kvantitativ的评估，并与其他状态时的方法进行了比较，显示了我们的方法在精确云EXTRACTION方面表现更佳。

Abstract
Clouds in optical satellite images are a major concern since their presence hinders the ability to carry accurate analysis as well as processing. Presence of clouds also affects the image tasking schedule and results in wastage of valuable storage space on ground as well as space-based systems. Due to these reasons, deriving accurate cloud masks from optical remote-sensing images is an important task. Traditional methods such as threshold-based, spatial filtering for cloud detection in satellite images suffer from lack of accuracy. In recent years, deep learning algorithms have emerged as a promising approach to solve image segmentation problems as it allows pixel-level classification and semantic-level segmentation. In this paper, we introduce a deep-learning model based on hybrid transformer architecture for effective cloud mask generation named CLiSA - Cloud segmentation via Lipschitz Stable Attention network. In this context, we propose an concept of orthogonal self-attention combined with hierarchical cross attention model, and we validate its Lipschitz stability theoretically and empirically. We design the whole setup under adversarial setting in presence of Lov\'asz-Softmax loss. We demonstrate both qualitative and quantitative outcomes for multiple satellite image datasets including Landsat-8, Sentinel-2, and Cartosat-2s. Performing comparative study we show that our model performs preferably against other state-of-the-art methods and also provides better generalization in precise cloud extraction from satellite multi-spectral (MX) images. We also showcase different ablation studies to endorse our choices corresponding to different architectural elements and objective functions.

摘要
云在光学卫星图像中是一个重要问题，因为它们障碍了精确的分析和处理。云存在也影响了图像任务安排和导致了地面和空间系统的有价值存储空间浪费。由于这些原因，从光学Remote sensing图像中 derivation of accurate cloud masks是一项重要任务。传统方法，如阈值基于的云检测、空间滤波，在卫星图像中缺乏准确性。在过去几年，深度学习算法在解决图像分割问题方面发展出了一些有前途的方法，因为它允许像素级别的分类和semantic级别的分割。在本文中，我们介绍了一种基于混合变换 architecture的深度学习模型，名为CLiSA（Cloud segmentation via Lipschitz Stable Attention network），用于生成高精度云面Mask。在这个上下文中，我们提出了一种抽象自注意力的概念，并与层次跨注意力模型结合使用。我们在抽象上进行了理论和实验 validate Lipschitz稳定性。我们设计了整个设置在对抗 Setting中，使用Lovász-Softmax损失函数。我们在多个卫星图像数据集上展示了qualitative和quantitative的结果，包括Landsat-8、Sentinel-2和Cartosat-2s。我们进行了对比研究，并证明我们的模型在精确云提取方面表现更好，并且具有更好的泛化能力。我们还进行了不同的ablation Study来证明我们的选择对不同的Architecture和目标函数有何影响。

AgentAvatar: Disentangling Planning, Driving and Rendering for Photorealistic Avatar Agents

paper_url: http://arxiv.org/abs/2311.17465
repo_url: None
paper_authors: Duomin Wang, Bin Dai, Yu Deng, Baoyuan Wang
For: 这个研究的目的是创建可自主计划和渲染复杂表情的人工智能代理人，从视觉和行为角度进行权重考虑。* Methods: 该 Framework 使用语言模型（LLMs）生成高级环境和代理人资料后，生成详细的文本描述了代理人的 facial motion。这些描述被转化为任务不关健的驱动引擎，然后被转化为连续动作嵌入，最后被渲染器使用神经网络渲染出最终的真实渲染图像。* Results: 研究包括对新编译的数据集和现有数据集进行了实验，以验证我们的方法的效果和多样性。我们发现，我们的方法可以自动生成高质量的非语言交流的人工智能代理人动作，并且可以适应不同的环境和代理人类型。

Abstract
In this study, our goal is to create interactive avatar agents that can autonomously plan and animate nuanced facial movements realistically, from both visual and behavioral perspectives. Given high-level inputs about the environment and agent profile, our framework harnesses LLMs to produce a series of detailed text descriptions of the avatar agents' facial motions. These descriptions are then processed by our task-agnostic driving engine into motion token sequences, which are subsequently converted into continuous motion embeddings that are further consumed by our standalone neural-based renderer to generate the final photorealistic avatar animations. These streamlined processes allow our framework to adapt to a variety of non-verbal avatar interactions, both monadic and dyadic. Our extensive study, which includes experiments on both newly compiled and existing datasets featuring two types of agents -- one capable of monadic interaction with the environment, and the other designed for dyadic conversation -- validates the effectiveness and versatility of our approach. To our knowledge, we advanced a leap step by combining LLMs and neural rendering for generalized non-verbal prediction and photo-realistic rendering of avatar agents.

摘要
在这项研究中，我们的目标是创建互动式人物代理人，能够自主规划和生动地表现非语面部动作，从视觉和行为两个角度出发。给出高级输入环境和代理人资料，我们的框架利用自然语言处理技术生成一系列细腻的文本描述人物代理人的面部动作。这些描述后经过任务无关的驱动引擎处理，转换为动作token序列，最终通过我们的独立的神经网络渲染器生成最终的真实摄影人物动画。这种流lined的过程使我们的框架能够适应多种非语言互动，包括单一和双向互动。我们的广泛的研究，包括对新编译和现有数据集上的两种代理人进行实验，证明了我们的方法的有效性和多样性。我们认为，我们在组合LLMs和神经渲染技术方面作出了一个大跃进，用于普适的非语言预测和真实摄影渲染人物代理人。

When StyleGAN Meets Stable Diffusion: a $\mathscr{W}_+$ Adapter for Personalized Image Generation

paper_url: http://arxiv.org/abs/2311.17461
repo_url: https://github.com/csxmli2016/w-plus-adapter
paper_authors: Xiaoming Li, Xinyu Hou, Chen Change Loy
for: 本研究旨在提高文本到图像扩散模型中的个体表现和分离性。
methods: 我们提出使用扩展StyleGAN嵌入空间 $\mathcal{W}_+ $ 来实现更好的个体保持和分离性。我们还提出了新的训练目标，以平衡提示和个体条件的影响，以确保背景保持不变 during facial attribute 修改。
results: 我们的方法可以生成符合提示描述的个性化文本到图像输出，并且可以适应多种 StyleGAN 编辑方向。我们的实验结果表明，我们的方法可以增强文本到图像扩散模型中的个体表现和分离性。

Abstract
Text-to-image diffusion models have remarkably excelled in producing diverse, high-quality, and photo-realistic images. This advancement has spurred a growing interest in incorporating specific identities into generated content. Most current methods employ an inversion approach to embed a target visual concept into the text embedding space using a single reference image. However, the newly synthesized faces either closely resemble the reference image in terms of facial attributes, such as expression, or exhibit a reduced capacity for identity preservation. Text descriptions intended to guide the facial attributes of the synthesized face may fall short, owing to the intricate entanglement of identity information with identity-irrelevant facial attributes derived from the reference image. To address these issues, we present the novel use of the extended StyleGAN embedding space $\mathcal{W}_+$, to achieve enhanced identity preservation and disentanglement for diffusion models. By aligning this semantically meaningful human face latent space with text-to-image diffusion models, we succeed in maintaining high fidelity in identity preservation, coupled with the capacity for semantic editing. Additionally, we propose new training objectives to balance the influences of both prompt and identity conditions, ensuring that the identity-irrelevant background remains unaffected during facial attribute modifications. Extensive experiments reveal that our method adeptly generates personalized text-to-image outputs that are not only compatible with prompt descriptions but also amenable to common StyleGAN editing directions in diverse settings. Our source code will be available at \url{https://github.com/csxmli2016/w-plus-adapter}.

摘要
To address these issues, we propose the novel use of the extended StyleGAN embedding space $\mathcal{W}_+$. By aligning this semantically meaningful human face latent space with text-to-image diffusion models, we achieve enhanced identity preservation and disentanglement. Our method maintains high fidelity in identity preservation while also allowing for semantic editing. To balance the influences of both prompt and identity conditions, we propose new training objectives that ensure the identity-irrelevant background remains unaffected during facial attribute modifications.Our extensive experiments demonstrate that our method effectively generates personalized text-to-image outputs that are not only compatible with prompt descriptions but also amenable to common StyleGAN editing directions in diverse settings. Our source code will be available at \url{https://github.com/csxmli2016/w-plus-adapter}.

W-HMR: Human Mesh Recovery in World Space with Weak-supervised Camera Calibration and Orientation Correction

paper_url: http://arxiv.org/abs/2311.17460
repo_url: https://github.com/yw0208/W-HMR
paper_authors: Wei Yao, Hongwen Zhang, Yunlian Sun, Jinhui Tang
for: 本研究旨在解决现有3D人体重建从单目图像中的问题，即现有方法偏 toward simplifying the task by minimizing the influence of the camera, leading to inaccurate reconstruction in world space.
methods: 本研究提出了一种新的方法 called W-HMR，它将全身重建分解为Camera calibration, local body recovery和全身orientation correction。提出了首个弱监督的相机calibration方法，消除了对焦距标签的依赖，实现了更细的网格图像对齐。提出了一种新的orientación correction模块，使得重建的人体姿态在世界坐标系中保持正常。
results: results show that W-HMR achieves high-quality reconstruction in dual coordinate systems, particularly in challenging scenes. 在难处场景中，W-HMR可以实现高质量的重建。

Abstract
For a long time, in the field of reconstructing 3D human bodies from monocular images, most methods opted to simplify the task by minimizing the influence of the camera. Using a coarse focal length setting results in the reconstructed bodies not aligning well with distorted images. Ignoring camera rotation leads to an unrealistic reconstructed body pose in world space. Consequently, existing methods' application scenarios are confined to controlled environments. And they struggle to achieve accurate and reasonable reconstruction in world space when confronted with complex and diverse in-the-wild images. To address the above issues, we propose W-HMR, which decouples global body recovery into camera calibration, local body recovery and global body orientation correction. We design the first weak-supervised camera calibration method for body distortion, eliminating dependence on focal length labels and achieving finer mesh-image alignment. We propose a novel orientation correction module to allow the reconstructed human body to remain normal in world space. Decoupling body orientation and body pose enables our model to consider the accuracy in camera coordinate and the reasonableness in world coordinate simultaneously, expanding the range of applications. As a result, W-HMR achieves high-quality reconstruction in dual coordinate systems, particularly in challenging scenes. Codes will be released on https://yw0208.github.io/ after publication.

摘要
Simplified Chinese translation:在三角形 reconstruction 领域中，从单目图像中恢复人体的三维体型问题已经存在很长时间。大多数方法选择简化任务，减少Camera的影响。使用宽角距离设置会导致重建的人体与扭曲图像不匹配。忽略摄像机旋转会导致重建的人体pose在全球坐标系中不真实。因此，现有方法的应用场景受到控制环境的限制，并在面临复杂和多样的野外图像时困难获得高质量重建。为解决以上问题，我们提出了W-HMR，它将全身重建分解为相机准备、地方身体重建和全身 Orientation 修正。我们提出了首个弱监督相机准备方法，消除了相机 Label 的依赖，实现了更细的网格图像Alignment。我们还提出了一种新的 Orientation 修正模块，使得重建的人体在全球坐标系中保持正常。解耦身体 Orientación 和 pose 使得我们的模型可以同时考虑相机坐标系中的准确性和世界坐标系中的合理性，扩大应用范围。因此，W-HMR在双坐标系中实现了高质量重建，特别是在复杂场景中。代码将在上发布。

DifFlow3D: Toward Robust Uncertainty-Aware Scene Flow Estimation with Diffusion Model

paper_url: http://arxiv.org/abs/2311.17456
repo_url: None
paper_authors: Jiuming Liu, Guangming Wang, Weicai Ye, Chaokang Jiang, Jinru Han, Zhe Liu, Guofeng Zhang, Dalong Du, Hesheng Wang
for: 提高Scene Flow Estimation的准确性和稳定性
methods: 提出了一种基于扩散概率模型的Scene Flow Estimation网络（DifFlow3D），通过迭代扩散基于修正来提高相关性和抗耗性，并且通过流 relate 特征来限制生成多样性。
results: 实现了State-of-the-art表现，在FlyingThings3D和KITTI 2015数据集上分别降低了6.7%和19.1%的EPE3D值，并在KITTI数据集上达到了前所未有的几毫米级准确性（0.0089m）。

Abstract
Scene flow estimation, which aims to predict per-point 3D displacements of dynamic scenes, is a fundamental task in the computer vision field. However, previous works commonly suffer from unreliable correlation caused by locally constrained searching ranges, and struggle with accumulated inaccuracy arising from the coarse-to-fine structure. To alleviate these problems, we propose a novel uncertainty-aware scene flow estimation network (DifFlow3D) with the diffusion probabilistic model. Iterative diffusion-based refinement is designed to enhance the correlation robustness and resilience to challenging cases, e.g., dynamics, noisy inputs, repetitive patterns, etc. To restrain the generation diversity, three key flow-related features are leveraged as conditions in our diffusion model. Furthermore, we also develop an uncertainty estimation module within diffusion to evaluate the reliability of estimated scene flow. Our DifFlow3D achieves state-of-the-art performance, with 6.7\% and 19.1\% EPE3D reduction respectively on FlyingThings3D and KITTI 2015 datasets. Notably, our method achieves an unprecedented millimeter-level accuracy (0.0089m in EPE3D) on the KITTI dataset. Additionally, our diffusion-based refinement paradigm can be readily integrated as a plug-and-play module into existing scene flow networks, significantly increasing their estimation accuracy. Codes will be released later.

摘要
场景流计算，即预测动态场景中每个点的3D运动，是计算机视觉领域的基本任务。然而，前一些工作受到本地约束搜索范围的不可靠相关性和粗糙结构带来的积累误差的影响。为解决这些问题，我们提出了一种基于扩散概率模型的新型不确定性意识场景流计算网络（DifFlow3D）。我们Iterative扩散基于精度修正来提高相关性稳定性，并能够抗抗困难情况，如动态、噪声输入、复杂模式等。为保持生成多样性，我们在扩散模型中采用了三个关键的流相关特征作为条件。此外，我们还开发了一个内部不确定度评估模块，以评估计算场景流中的可靠性。我们的DifFlow3D实现了当前最佳性能，在FlyingThings3D和KITTI 2015数据集上分别减少了6.7%和19.1%的EPE3D。特别是，我们的方法在KITTI数据集上实现了前无之前的米级精度（0.0089m）。此外，我们的扩散基本修正模块可以轻松地与现有的场景流网络集成，提高其估计精度。代码将在未来发布。

Continual Learning for Image Segmentation with Dynamic Query

paper_url: http://arxiv.org/abs/2311.17450
repo_url: https://github.com/weijiawu/cisdq
paper_authors: Weijia Wu, Yuzhong Zhao, Zhuang Li, Lianlei Shan, Hong Zhou, Mike Zheng Shou
for: 这篇论文旨在解决 continual learning 中的问题，即当需要不断添加新的类别时，还有如何避免 catastrophic forgetting 和背景迁移的问题。
methods: 本篇论文提出了一个简单 yet effective 的 Continual Image Segmentation方法（CISDQ），它将旧知和新知的表现学习分离开来，使用轻量级的查询嵌入。CISDQ 的主要贡献包括：1) 定义动态查询，适应过去知识和学习未来类别的自然方式。2) CISDQ 提出了一个类别/实例数据驱动的 Query Guided Knowledge Distillation策略，以抵消 catastrophic forgetting 的问题。3) CISDQ 进一步还包括了持续学习实例分类，并考虑了实例训练和监督。
results: 实验结果显示，CISDQ 在三个数据集上（i.e., Cityscapes、PASCAL VOC、ADE）的两个任务（i.e., continual semantic segmentation、instance segmentation）上实现了顶尖性能，具体而言，在 ADE 100-10 (6 steps) 设定和 ADE 100-5 (11 steps) 设定中获得了 4.4% 和 2.9% mIoU 改进。

Abstract
Image segmentation based on continual learning exhibits a critical drop of performance, mainly due to catastrophic forgetting and background shift, as they are required to incorporate new classes continually. In this paper, we propose a simple, yet effective Continual Image Segmentation method with incremental Dynamic Query (CISDQ), which decouples the representation learning of both old and new knowledge with lightweight query embedding. CISDQ mainly includes three contributions: 1) We define dynamic queries with adaptive background class to exploit past knowledge and learn future classes naturally. 2) CISDQ proposes a class/instance-aware Query Guided Knowledge Distillation strategy to overcome catastrophic forgetting by capturing the inter-class diversity and intra-class identity. 3) Apart from semantic segmentation, CISDQ introduce the continual learning for instance segmentation in which instance-wise labeling and supervision are considered. Extensive experiments on three datasets for two tasks (i.e., continual semantic and instance segmentation are conducted to demonstrate that CISDQ achieves the state-of-the-art performance, specifically, obtaining 4.4% and 2.9% mIoU improvements for the ADE 100-10 (6 steps) setting and ADE 100-5 (11 steps) setting.

摘要
Image segmentation based on continual learning exhibits a critical drop of performance, mainly due to catastrophic forgetting and background shift, as they are required to incorporate new classes continually. In this paper, we propose a simple, yet effective Continual Image Segmentation method with incremental Dynamic Query (CISDQ), which decouples the representation learning of both old and new knowledge with lightweight query embedding. CISDQ mainly includes three contributions:1. We define dynamic queries with adaptive background class to exploit past knowledge and learn future classes naturally.2. CISDQ proposes a class/instance-aware Query Guided Knowledge Distillation strategy to overcome catastrophic forgetting by capturing the inter-class diversity and intra-class identity.3. Apart from semantic segmentation, CISDQ introduce the continual learning for instance segmentation in which instance-wise labeling and supervision are considered.Extensive experiments on three datasets for two tasks (i.e., continual semantic and instance segmentation) are conducted to demonstrate that CISDQ achieves the state-of-the-art performance, specifically, obtaining 4.4% and 2.9% mIoU improvements for the ADE 100-10 (6 steps) setting and ADE 100-5 (11 steps) setting.

Weakly-semi-supervised object detection in remotely sensed imagery

paper_url: http://arxiv.org/abs/2311.17449
repo_url: None
paper_authors: Ji Hun Wang, Jeremy Irvin, Beri Kohen Behar, Ha Tran, Raghav Samavedam, Quentin Hsu, Andrew Y. Ng
for: 这个研究旨在开发一种弱监督物件探测（WSSOD）模型，以便在遥感图像中探测物件，并且可以对新的任务和地理位置进行开发。
methods: 这个研究使用了大量的点标签图像和一小部分的 bounding box 标签图像进行训练，以便将模型应用到遥感图像中的物件探测。
results: 研究发现，使用 WSSOD 模型可以将物件探测精度提高，并且可以在不需要大量 bounding box 标签图像的情况下进行训练。此外，研究发现 WSSOD 模型可以与完全监督模型进行比较，并且在某些情况下可以将其超过。

Abstract
Deep learning for detecting objects in remotely sensed imagery can enable new technologies for important applications including mitigating climate change. However, these models often require large datasets labeled with bounding box annotations which are expensive to curate, prohibiting the development of models for new tasks and geographies. To address this challenge, we develop weakly-semi-supervised object detection (WSSOD) models on remotely sensed imagery which can leverage a small amount of bounding boxes together with a large amount of point labels that are easy to acquire at scale in geospatial data. We train WSSOD models which use large amounts of point-labeled images with varying fractions of bounding box labeled images in FAIR1M and a wind turbine detection dataset, and demonstrate that they substantially outperform fully supervised models trained with the same amount of bounding box labeled images on both datasets. Furthermore, we find that the WSSOD models trained with 2-10x fewer bounding box labeled images can perform similarly to or outperform fully supervised models trained on the full set of bounding-box labeled images. We believe that the approach can be extended to other remote sensing tasks to reduce reliance on bounding box labels and increase development of models for impactful applications.

摘要
深度学习用于探测远程感知图像中的对象可以开启新的技术，包括 Mitigating 气候变化。然而，这些模型经常需要大量的标注矩形框数据，这些数据可能是高昂的成本。为解决这个挑战，我们开发了弱型半supervised object detection（WSSOD）模型，可以在远程感知图像上使用少量的矩形框标注和大量的点标签来进行训练。我们在FAIR1M和风力发电机检测数据集上训练了WSSOD模型，并证明它们在这两个数据集上有substantially 的提高。此外，我们发现WSSOD模型使用2-10x少的矩形框标注图像训练时可以与完全supervised模型具有相同或更好的性能。我们认为这种方法可以扩展到其他远程感知任务，以减少矩形框标注的依赖性和增加对影响性应用的模型开发。

Group-wise Sparse and Explainable Adversarial Attacks

paper_url: http://arxiv.org/abs/2311.17434
repo_url: https://github.com/wagnermoritz/gse
paper_authors: Shpresim Sadiku, Moritz Wagner, Sebastian Pokutta
for: 这 paper 的目的是为了开发一种可靠的、有效的 sparse adversarial attack，以攻击深度神经网络 (DNNs)。
methods: 这 paper 使用了一种新的权重规范化法，即使用 nuclear group norm 来regularize the adversarial loss，从而实现了更加可靠和有效的攻击。
results: 这 paper 的实验结果表明，Compared to state-of-the-art methods, 这种新的攻击方法可以具有更高的 group-wise sparsity（例如，CIFAR-10 上的平均情况下，增加了48.12%，ImageNet 上的平均情况下，增加了40.78%），同时具有较低的欧几里得距离值和较快的计算时间。

Abstract
Sparse adversarial attacks fool deep neural networks (DNNs) through minimal pixel perturbations, typically regularized by the $\ell_0$ norm. Recent efforts have replaced this norm with a structural sparsity regularizer, such as the nuclear group norm, to craft group-wise sparse adversarial attacks. The resulting perturbations are thus explainable and hold significant practical relevance, shedding light on an even greater vulnerability of DNNs than previously anticipated. However, crafting such attacks poses an optimization challenge, as it involves computing norms for groups of pixels within a non-convex objective. In this paper, we tackle this challenge by presenting an algorithm that simultaneously generates group-wise sparse attacks within semantically meaningful areas of an image. In each iteration, the core operation of our algorithm involves the optimization of a quasinorm adversarial loss. This optimization is achieved by employing the $1/2$-quasinorm proximal operator for some iterations, a method tailored for nonconvex programming. Subsequently, the algorithm transitions to a projected Nesterov's accelerated gradient descent with $2$-norm regularization applied to perturbation magnitudes. We rigorously evaluate the efficacy of our novel attack in both targeted and non-targeted attack scenarios, on CIFAR-10 and ImageNet datasets. When compared to state-of-the-art methods, our attack consistently results in a remarkable increase in group-wise sparsity, e.g., an increase of $48.12\%$ on CIFAR-10 and $40.78\%$ on ImageNet (average case, targeted attack), all while maintaining lower perturbation magnitudes. Notably, this performance is complemented by a significantly faster computation time and a $100\%$ attack success rate.

摘要
深度神经网络（DNN）可以通过非常小的像素变化惑伪（adversarial attack），通常使用 $\ell_0$ 范数进行规范。然而，latest efforts 已经将这种范数替换为结构减少范数（structural sparsity regularizer），以创造组内 sparse adversarial attack。这些攻击通过图像中的组合部分进行解释，并且在实际应用中具有更大的攻击力，从而暴露了 DNN 的更大漏洞。然而，制作这些攻击是一个优化挑战，因为它们需要计算图像中的组内范数在非拟合的目标函数中。在这篇论文中，我们解决了这个挑战，我们提出了一种同时生成组内 sparse adversarial attack 的算法。在每次迭代中，我们的算法会优化一个 quasi-norm 攻击损失函数。我们使用 $1/2$-quasi-norm proximal operator 进行一些迭代，这是一种适用于非拟合程序的优化方法。然后，我们会将攻击执行距离的梯度下降优化，并将其限制为二 norm 范数。我们对 CIFAR-10 和 ImageNet 数据集进行了严格的评估，并证明了我们的攻击方法在目标攻击和非目标攻击情况下具有惊人的提高，例如 CIFAR-10 上的提高为 $48.12\%$，ImageNet 上的提高为 $40.78\%$（平均情况下，targeted attack）。此外，我们的攻击方法还具有更快的计算时间和 $100\%$ 的攻击成功率。

SpeechAct: Towards Generating Whole-body Motion from Speech

paper_url: http://arxiv.org/abs/2311.17425
repo_url: None
paper_authors: Jinsong Zhang, Minjie Zhu, Yuxiang Zhang, Yebin Liu, Kun Li
for: 本研究旨在从语音中生成全身动作。尽管先前的方法已经取得了很大成功，但它们仍然很难生成合理且多样的全身动作。这是因为它们使用不佳的表示方式和缺乏生成多样结果的策略。
methods: 我们提出了一种新的混合点表示方法，以实现准确和连续的动作生成，例如避免脚滑行动。此外，为了生成语音与全身动作之间的紧密关系，我们引入了一个encoder-decoder架构。而为了生成全身动作和手部动作，我们寻求生成多样但合理的动作。
results: 我们的实验结果证明了我们的模型的优秀性和正确性。我们的模型可以生成高质量的全身动作，并且可以快速地生成多样的动作。此外，我们的对比研究表明，我们的模型可以准确地捕捉语音和全身动作之间的关系。

Abstract
This paper addresses the problem of generating whole-body motion from speech. Despite great successes, prior methods still struggle to produce reasonable and diverse whole-body motions from speech. This is due to their reliance on suboptimal representations and a lack of strategies for generating diverse results. To address these challenges, we present a novel hybrid point representation to achieve accurate and continuous motion generation, e.g., avoiding foot skating, and this representation can be transformed into an easy-to-use representation, i.e., SMPL-X body mesh, for many applications. To generate whole-body motion from speech, for facial motion, closely tied to the audio signal, we introduce an encoder-decoder architecture to achieve deterministic outcomes. However, for the body and hands, which have weaker connections to the audio signal, we aim to generate diverse yet reasonable motions. To boost diversity in motion generation, we propose a contrastive motion learning method to encourage the model to produce more distinctive representations. Specifically, we design a robust VQ-VAE to learn a quantized motion codebook using our hybrid representation. Then, we regress the motion representation from the audio signal by a translation model employing our contrastive motion learning method. Experimental results validate the superior performance and the correctness of our model. The project page is available for research purposes at http://cic.tju.edu.cn/faculty/likun/projects/SpeechAct.

摘要
这篇论文解决了基于语音生成整个身体运动的问题。尽管之前的方法已经取得了很大的成功，但它们仍然无法生成合理且多样的整个身体运动。这是因为它们依赖于不优化的表示方式以及缺乏生成多样结果的策略。为解决这些挑战，我们提出了一种新的混合点表示方法，可以实现准确和连续的运动生成，例如避免脚滑行动，并且这种表示可以转换成一个易用的SMPL-X身体网格，对于许多应用有很好的使用。为了将语音转化为整个身体运动，我们引入了编码器-解码器架构，以实现决定性的结果。但是，为了身体和手部，它们与语音信号之间的连接较弱，我们希望可以生成多样又合理的运动。为了提高运动生成的多样性，我们提议一种对比动学学习方法，以便模型可以生成更加特别的表示。具体来说，我们设计了一种坚定的VQ-VAE来学习一个量化动作代码库使用我们的混合表示。然后，我们使用我们的对比动学学习方法将运动表示从语音信号中 regression。实验结果证明了我们的模型的超越性和正确性。研究页面可以在http://cic.tju.edu.cn/faculty/likun/projects/SpeechAct上查看。

Talking Head(?) Anime from a Single Image 4: Improved Model and Its Distillation

paper_url: http://arxiv.org/abs/2311.17409
repo_url: None
paper_authors: Pramook Khungurn
for: This paper aims to create a real-time controllable character model from a single anime character image.methods: The paper uses U-Nets with attention to improve the image quality of the character model, and distills the system into a small network for real-time applications.results: The proposed method achieves better image quality than the baseline, but with a slower generation time. The distilled network can generate 512x512 animation frames in real time while maintaining image quality.Here’s the simplified Chinese text:for: 本研究旨在对单一的日本动画角色图像进行实时控制。methods: 本文使用U-Nets with attention来提高角色模型的图像质量，并将系统概要化为小型网络以便实时应用。results: 提案的方法可以实现更好的图像质量，但是生成时间较慢。将系统概要化后，可以实现512x512动画帧的实时生成，并保持图像质量接近原始系统。

Abstract
We study the problem of creating a character model that can be controlled in real time from a single image of an anime character. A solution to this problem would greatly reduce the cost of creating avatars, computer games, and other interactive applications. Talking Head Anime 3 (THA3) is an open source project that attempts to directly addresses the problem. It takes as input (1) an image of an anime character's upper body and (2) a 45-dimensional pose vector and outputs a new image of the same character taking the specified pose. The range of possible movements is expressive enough for personal avatars and certain types of game characters. However, the system is too slow to generate animations in real time on common PCs, and its image quality can be improved. In this paper, we improve THA3 in two ways. First, we propose new architectures for constituent networks that rotate the character's head and body based on U-Nets with attention that are widely used in modern generative models. The new architectures consistently yield better image quality than the THA3 baseline. Nevertheless, they also make the whole system much slower: it takes up to 150 milliseconds to generate a frame. Second, we propose a technique to distill the system into a small network (less than 2 MB) that can generate 512x512 animation frames in real time (under 30 FPS) using consumer gaming GPUs while keeping the image quality close to that of the full system. This improvement makes the whole system practical for real-time applications.

摘要
我们研究一个实时控制аніме角色模型的问题，使用单一的аніме角色图像。一个解决方案会大幅降低创建角色、电玩游戏等互动应用的成本。《Talking Head Anime 3》（THA3）是一个开源项目，它将从аніме角色的上半身图像和45维度动作向量为输入，生成一个新的аніме角色图像，并且可以在实时环境中生成运算。然而，系统的图像质量可以进一步改善，并且系统的执行速度太慢。在这篇论文中，我们将THA3进行了两种改进。第一，我们提出了一些新的组件网络架构，这些架构基于现代生成模型中广泛使用的U-Net架构，并且将注意力引入到组件网络中。这些新架构一致地提高了图像质量，但是它们也使得整个系统变得更慢，它需要大约150毫秒来生成一帧图像。第二，我们提出了一种技术，可以将系统转换为一个小型网络（小于2 MB），这个网络可以在consumer级游戏GPU上实时生成512x512的动画帧，并且保持图像质量与全系统之间的差不多。这个改进使得整个系统在实时应用中可行。

Dynamic Dense Graph Convolutional Network for Skeleton-based Human Motion Prediction

paper_url: http://arxiv.org/abs/2311.17408
repo_url: None
paper_authors: Xinshun Wang, Wanying Zhang, Can Wang, Yuan Gao, Mengyuan Liu
for: 本文为了解决GCN在人体动作预测任务中的瓶颈问题，提出了动态稠密图卷积网络(DD-GCN)。
methods: 本文使用了4D相互关系模型构建了一个稠密图，并提出了一种动态消息传递机制，通过学习从数据中获得样本特有的启发消息来提高模型性能。
results: 对于人体动作预测任务，DD-GCN显著超过了state-of-the-art GCN-based方法，特别是在使用长期和我们提议的极长期协议时。

Abstract
Graph Convolutional Networks (GCN) which typically follows a neural message passing framework to model dependencies among skeletal joints has achieved high success in skeleton-based human motion prediction task. Nevertheless, how to construct a graph from a skeleton sequence and how to perform message passing on the graph are still open problems, which severely affect the performance of GCN. To solve both problems, this paper presents a Dynamic Dense Graph Convolutional Network (DD-GCN), which constructs a dense graph and implements an integrated dynamic message passing. More specifically, we construct a dense graph with 4D adjacency modeling as a comprehensive representation of motion sequence at different levels of abstraction. Based on the dense graph, we propose a dynamic message passing framework that learns dynamically from data to generate distinctive messages reflecting sample-specific relevance among nodes in the graph. Extensive experiments on benchmark Human 3.6M and CMU Mocap datasets verify the effectiveness of our DD-GCN which obviously outperforms state-of-the-art GCN-based methods, especially when using long-term and our proposed extremely long-term protocol.

摘要
traditional Graph Convolutional Networks (GCN) typically follow a neural message passing framework to model dependencies among skeletal joints, has achieved high success in skeleton-based human motion prediction task. However, how to construct a graph from a skeleton sequence and how to perform message passing on the graph are still open problems, which severely affect the performance of GCN. To solve both problems, this paper presents a Dynamic Dense Graph Convolutional Network (DD-GCN), which constructs a dense graph and implements an integrated dynamic message passing. Specifically, we construct a dense graph with 4D adjacency modeling as a comprehensive representation of motion sequence at different levels of abstraction. Based on the dense graph, we propose a dynamic message passing framework that learns dynamically from data to generate distinctive messages reflecting sample-specific relevance among nodes in the graph. Extensive experiments on benchmark Human 3.6M and CMU Mocap datasets verify the effectiveness of our DD-GCN, which significantly outperforms state-of-the-art GCN-based methods, especially when using long-term and our proposed extremely long-term protocol.Here's the word-for-word translation of the text into Simplified Chinese:传统的Graph Convolutional Networks (GCN)通常采用神经网络消息传递框架来模型关节的依赖关系，在人体动作预测任务中取得了高度的成功。然而，如何从关节序列中构建图和如何在图上进行消息传递仍是一个 откры问题，这些问题对GCN的性能产生严重的影响。为解决这两个问题，本文提出了动态稠密图卷积网络（DD-GCN），该网络构建了稠密图并实现了一种集成的动态消息传递。具体来说，我们使用4D相对性模型来构建稠密图，以表示不同层次的动作序列。基于稠密图，我们提议一种动态消息传递框架，该框架通过数据学习来生成特点rich的消息，以反映样本特有的相关性。经验表明，我们的DD-GCN在人体动作预测任务中表现出色，特别是在长期和我们提出的极长期协议下。

Spectral and Polarization Vision: Spectro-polarimetric Real-world Dataset

paper_url: http://arxiv.org/abs/2311.17396
repo_url: None
paper_authors: Yujin Jeon, Eunsue Choi, Youngchan Kim, Yunseong Moon, Khalid Omer, Felix Heide, Seung-Hwan Baek
for: 本研究开发了两个新的对偶色光图像集（trichromatic Stokes images和hyperspectral Stokes images），以满足现有的对偶色光图像集缺乏的问题。
methods: 本研究使用了对偶色光图像集，并分析了对偶色光图像集的图像统计。
results: 本研究获得了两个新的对偶色光图像集，并进行了对偶色光图像集的分析和实现高维度数据的有效表示。

Abstract
Image datasets are essential not only in validating existing methods in computer vision but also in developing new methods. Most existing image datasets focus on trichromatic intensity images to mimic human vision. However, polarization and spectrum, the wave properties of light that animals in harsh environments and with limited brain capacity often rely on, remain underrepresented in existing datasets. Although spectro-polarimetric datasets exist, these datasets have insufficient object diversity, limited illumination conditions, linear-only polarization data, and inadequate image count. Here, we introduce two spectro-polarimetric datasets: trichromatic Stokes images and hyperspectral Stokes images. These novel datasets encompass both linear and circular polarization; they introduce multiple spectral channels; and they feature a broad selection of real-world scenes. With our dataset in hand, we analyze the spectro-polarimetric image statistics, develop efficient representations of such high-dimensional data, and evaluate spectral dependency of shape-from-polarization methods. As such, the proposed dataset promises a foundation for data-driven spectro-polarimetric imaging and vision research. Dataset and code will be publicly available.

摘要
To address this gap, we introduce two novel spectro-polarimetric datasets: trichromatic Stokes images and hyperspectral Stokes images. These datasets encompass both linear and circular polarization, introduce multiple spectral channels, and feature a broad selection of real-world scenes. With our dataset in hand, we analyze the spectro-polarimetric image statistics, develop efficient representations of high-dimensional data, and evaluate the spectral dependency of shape-from-polarization methods.Our dataset provides a foundation for data-driven spectro-polarimetric imaging and vision research. The dataset and code will be publicly available, offering a valuable resource for researchers and developers in the field.

360Loc: A Dataset and Benchmark for Omnidirectional Visual Localization with Cross-device Queries

paper_url: http://arxiv.org/abs/2311.17389
repo_url: None
paper_authors: Huajian Huang, Changkun Liu, Yipeng Zhu, Hui Cheng, Tristan Braud, Sai-Kit Yeung
for:This paper introduces a new benchmark dataset, 360Loc, which is composed of 360$^\circ$ images with ground truth poses for visual localization.methods:The paper presents a practical implementation of 360$^\circ$ mapping that combines 360$^\circ$ images with lidar data to generate ground truth 6DoF poses.results:The paper demonstrates that omnidirectional visual localization is more robust in challenging large-scale scenes with symmetries and repetitive structures.Here is the Chinese translation of the three points:for:这篇论文提出了一个新的标准 benchmark dataset，名为360Loc，它包含360$^\circ$ 图像和视觉定位的真实pose。methods:论文提出了一种实用的360$^\circ$ 映射实现方法，它将360$^\circ$ 图像与雷达数据结合起来生成真实的6DoF pose。results:论文表明在具有相同性和复制结构的大规模场景中，普通视觉定位更加稳定。

Abstract
Portable 360$^\circ$ cameras are becoming a cheap and efficient tool to establish large visual databases. By capturing omnidirectional views of a scene, these cameras could expedite building environment models that are essential for visual localization. However, such an advantage is often overlooked due to the lack of valuable datasets. This paper introduces a new benchmark dataset, 360Loc, composed of 360$^\circ$ images with ground truth poses for visual localization. We present a practical implementation of 360$^\circ$ mapping combining 360$^\circ$ images with lidar data to generate the ground truth 6DoF poses. 360Loc is the first dataset and benchmark that explores the challenge of cross-device visual positioning, involving 360$^\circ$ reference frames, and query frames from pinhole, ultra-wide FoV fisheye, and 360$^\circ$ cameras. We propose a virtual camera approach to generate lower-FoV query frames from 360$^\circ$ images, which ensures a fair comparison of performance among different query types in visual localization tasks. We also extend this virtual camera approach to feature matching-based and pose regression-based methods to alleviate the performance loss caused by the cross-device domain gap, and evaluate its effectiveness against state-of-the-art baselines. We demonstrate that omnidirectional visual localization is more robust in challenging large-scale scenes with symmetries and repetitive structures. These results provide new insights into 360-camera mapping and omnidirectional visual localization with cross-device queries.

摘要
便携360度相机在成本低廉的情况下成为了建立大规模视图数据的有效工具。这些相机可以快速建立环境模型，这些模型是视地理ocalization的重要组成部分。然而，这些优势通常被忽略因为缺乏有价值的数据。这篇论文介绍了一个新的比较 datasets，名为360Loc，其包含360度图像和准确的pose pose。我们提出了一种实用的360度映射方法，该方法结合360度图像和雷达数据生成准确的6DoF pose。360Loc是首个探讨跨设备视 Positioning的挑战，并包括360度参考幅、查询幅从缩小镜头、超广 FoV鱼眼镜头和360度相机。我们提出了一种虚拟相机方法，将360度图像转换成低FOV查询幅，以确保在视 Localization任务中不同查询类型的比较公平。我们还扩展了这种虚拟相机方法到匹配特征和pose regression-based方法，以解决跨设备频谱差的影响，并评估其效果。我们展示了在大规模场景中，360度视 Localization更加稳定和Robust，特别是在具有对称和重复结构的场景中。这些结果为360相机映射和360度视 Localization提供了新的认知和探索。

Generative Hierarchical Temporal Transformer for Hand Action Recognition and Motion Prediction

paper_url: http://arxiv.org/abs/2311.17366
repo_url: None
paper_authors: Yilin Wen, Hao Pan, Takehiko Ohkawa, Lei Yang, Jia Pan, Yoichi Sato, Taku Komura, Wenping Wang
for: 这个论文旨在同时进行手势识别和未来手势预测。在过去的工作中，人们通常只关注一个或两个方面，但我们的框架可以同时捕捉这两个方面，从而实现更加真实的动作预测。
methods: 我们提出了一种基于Transformer VAE架构的新框架，该框架包括两个缓冲区，一个用于短时间内的手势pose，另一个用于长时间内的动作。这两个缓冲区之间有一个中间特征，用于表示手势pose序列的子时间级别。
results: 我们在多个数据集上训练了这个框架，并证明了在多个数据集上，同时进行手势识别和未来手势预测可以提高过单独解决这两个问题。此外，我们的框架还可以 faithful地表示手势pose和动作之间的 semantic dependency和不同的时间级别。

Abstract
We present a novel framework that concurrently tackles hand action recognition and 3D future hand motion prediction. While previous works focus on either recognition or prediction, we propose a generative Transformer VAE architecture to jointly capture both aspects, facilitating realistic motion prediction by leveraging the short-term hand motion and long-term action consistency observed across timestamps.To ensure faithful representation of the semantic dependency and different temporal granularity of hand pose and action, our framework is decomposed into two cascaded VAE blocks. The lower pose block models short-span poses, while the upper action block models long-span action. These are connected by a mid-level feature that represents sub-second series of hand poses.Our framework is trained across multiple datasets, where pose and action blocks are trained separately to fully utilize pose-action annotations of different qualities. Evaluations show that on multiple datasets, the joint modeling of recognition and prediction improves over separate solutions, and the semantic and temporal hierarchy enables long-term pose and action modeling.

摘要
我们提出了一种新的框架，该框架同时解决了手动作识别和3D未来手动 Motion 预测问题。前一些工作都是专注于一个或两个方面，我们提议使用生成器Transformer VAE架构来同时捕捉这两个方面，从而实现更加现实的动作预测，通过利用短时间内手动 Motion 和长时间内动作的一致性。为了忠实表示手势和动作之间的Semantic 依赖关系和不同的时间粒度，我们的框架被分解成两个堆叠的 VAE 块。下面的姿势块模型短时间内的姿势，而上面的动作块模型长时间内的动作。这两个块被一个中间特征连接，该特征表示每秒Series of 手势。我们的框架在多个 dataset 上训练，其中姿势和动作块分别在不同质量的 pose-action 注释上训练，以便完全利用不同的 pose-action 注释。评估表明，在多个 dataset 上，同时解决识别和预测问题的 joint 模型超过了分别的解决方案，而 semantic 和时间层次启用了长期姿势和动作模型。

Symbol-LLM: Leverage Language Models for Symbolic System in Visual Human Activity Reasoning

paper_url: http://arxiv.org/abs/2311.17365
repo_url: https://github.com/enlighten0707/Symbol-LLM
paper_authors: Xiaoqian Wu, Yong-Lu Li, Jianhua Sun, Cewu Lu
for: 提高人工智能理解活动的能力，增强Activity Understanding的可解释性、泛化性和数据效率。
methods: 基于Symbolic System的Activity Understanding方法，利用大语言模型（Symbol-LLM）来 aproximate ideal properties，并通过推理逻辑计算来理解图像中的活动 semantics。
results: 在多种Activity Understanding任务中表现出色，超过传统方法的性能。

Abstract
Human reasoning can be understood as a cooperation between the intuitive, associative "System-1" and the deliberative, logical "System-2". For existing System-1-like methods in visual activity understanding, it is crucial to integrate System-2 processing to improve explainability, generalization, and data efficiency. One possible path of activity reasoning is building a symbolic system composed of symbols and rules, where one rule connects multiple symbols, implying human knowledge and reasoning abilities. Previous methods have made progress, but are defective with limited symbols from handcraft and limited rules from visual-based annotations, failing to cover the complex patterns of activities and lacking compositional generalization. To overcome the defects, we propose a new symbolic system with two ideal important properties: broad-coverage symbols and rational rules. Collecting massive human knowledge via manual annotations is expensive to instantiate this symbolic system. Instead, we leverage the recent advancement of LLMs (Large Language Models) as an approximation of the two ideal properties, i.e., Symbols from Large Language Models (Symbol-LLM). Then, given an image, visual contents from the images are extracted and checked as symbols and activity semantics are reasoned out based on rules via fuzzy logic calculation. Our method shows superiority in extensive activity understanding tasks. Code and data are available at https://mvig-rhos.com/symbol_llm.

摘要
人类理解可以理解为系统1（INTUITIVE，协同）和系统2（分析，逻辑）之间的合作。现有的系统1类方法在视觉活动理解方面存在一些缺陷，需要与系统2处理结合以提高解释性、泛化能力和数据效率。一种可能的活动理解方法是建立一个符号系统，其中一个符号与多个符号之间存在规则，表示人类知识和思维能力。前一些方法已经做出了进步，但它们具有有限的符号和规则，无法涵盖复杂的活动模式，同时也缺乏复合泛化能力。为了缓解这些缺陷，我们提出了一个新的符号系统，具有两个重要的理想特性：广泛的符号和合理的规则。收集大量人类知识的手动注释是实现这个符号系统的开销高昂。相反，我们利用最新的大语言模型（LLM）的进步，即Symbols from Large Language Models（Symbol-LLM），作为符号系统的近似方法。然后，给定一个图像，图像的视觉内容被提取并作为符号，并根据规则进行混合逻辑计算，以得出活动 semantics。我们的方法在广泛的活动理解任务中表现出优异性。代码和数据可以在获取。

How does spatial structure affect psychological restoration? A method based on Graph Neural Networks and Street View Imagery

paper_url: http://arxiv.org/abs/2311.17361
repo_url: None
paper_authors: Haoran Ma, Yan Zhang, Pengyuan Liu, Fan Zhang, Pengyu Zhua
for: 这项研究旨在理解城市和自然维生化质量的关键因素，以及这些因素如何影响城市维生化质量。
methods: 这项研究使用了一种基于图 neural networks 的方法，该方法可以捕捉城市级别的空间结构，并对城市维生化质量进行评估。
results: 研究结果显示，使用图 neural networks 模型可以更好地评估城市维生化质量，并且发现了城市级别的空间结构对维生化质量的影响。

Abstract
The Attention Restoration Theory (ART) presents a theoretical framework with four essential indicators (being away, extent, fascinating, and compatibility) for comprehending urban and natural restoration quality. However, previous studies relied on non-sequential data and non-spatial dependent methods, which overlooks the impact of spatial structure defined here as the positional relationships between scene entities on restoration quality. The past methods also make it challenging to measure restoration quality on an urban scale. In this work, a spatial-dependent graph neural networks (GNNs) approach is proposed to reveal the relation between spatial structure and restoration quality on an urban scale. Specifically, we constructed two different types of graphs at the street and city levels. The street-level graphs, using sequential street view images (SVIs) of road segments to capture position relationships between entities, were used to represent spatial structure. The city-level graph, modeling the topological relationships of roads as non-Euclidean data structures and embedding urban features (including Perception-features, Spatial-features, and Socioeconomic-features), was used to measure restoration quality. The results demonstrate that: 1) spatial-dependent GNNs model outperforms traditional methods (Acc = 0.735, F1 = 0.732); 2) spatial structure portrayed through sequential SVIs data significantly influences restoration quality; 3) spaces with the same restoration quality exhibited distinct spatial structures patterns. This study clarifies the association between spatial structure and restoration quality, providing a new perspective to improve urban well-being in the future.

摘要
ART理论提出了四个重要指标（离别、范围、吸引力和兼容）来理解城市和自然维护质量。然而，先前的研究使用非序数据和非空间相关方法，忽略了场景元素之间的空间结构对维护质量的影响。这些方法还使得评估城市级别的维护质量变得困难。在这种情况下，我们提议使用空间相关的图神经网络（GNNs）方法，以揭示空间结构对维护质量的关系。具体来说，我们构建了两种不同类型的图，一是街道级别的图，使用Sequential Street View Images（SVIs）记录路段之间的位置关系，另一是城市级别的图，通过模型城市道路的非几何数据结构，并嵌入城市特征（包括感知特征、空间特征和社会经济特征），来评估维护质量。结果显示：1）空间相关GNNs模型在传统方法（Acc = 0.735，F1 = 0.732）之上表现出色; 2）通过Sequential SVIs数据记录的空间结构对维护质量产生了显著影响; 3）具有同等维护质量的空间具有不同的空间结构征特。这一研究解释了空间结构和维护质量之间的关系，为未来城市发展提供了新的视角。

A natural language processing-based approach: mapping human perception by understanding deep semantic features in street view images

paper_url: http://arxiv.org/abs/2311.17354
repo_url: None
paper_authors: Haoran Ma, Dongdong Wu
For: This paper aims to comprehensively understand the deep semantic features of human perception of a scene using a pre-trained natural language model and image captioning network.* Methods: The authors use Place Pulse 2.0 as their base dataset, which contains human-perceived labels for various scenes. They use an image captioning network to extract description information and finetune a pre-trained BERT model with a regression function for six human perceptual dimensions.* Results: The authors find that their approach, which uses deep semantic features, performs better than previous studies that use machine learning methods with shallow features. They also conduct a migration experiment in Hong Kong and show that their approach provides new ideas for subsequent human perception research and better explanatory power in the face of spatial heterogeneity.

Abstract
In the past decade, using Street View images and machine learning to measure human perception has become a mainstream research approach in urban science. However, this approach using only image-shallow information makes it difficult to comprehensively understand the deep semantic features of human perception of a scene. In this study, we proposed a new framework based on a pre-train natural language model to understand the relationship between human perception and the sense of a scene. Firstly, Place Pulse 2.0 was used as our base dataset, which contains a variety of human-perceived labels, namely, beautiful, safe, wealthy, depressing, boring, and lively. An image captioning network was used to extract the description information of each street view image. Secondly, a pre-trained BERT model was finetuning and added a regression function for six human perceptual dimensions. Furthermore, we compared the performance of five traditional regression methods with our approach and conducted a migration experiment in Hong Kong. Our results show that human perception scoring by deep semantic features performed better than previous studies by machine learning methods with shallow features. The use of deep scene semantic features provides new ideas for subsequent human perception research, as well as better explanatory power in the face of spatial heterogeneity.

摘要
过去一个 décennial，使用街景视图图像和机器学习来测量人类感受的研究方法在城市科学领域变得普遍。然而，这种方法只使用图像 superficier 信息，导致不能全面理解人类对场景的深度Semantic 特征。在这项研究中，我们提出了一个新的框架，基于预训练的自然语言模型来理解人类感受和场景感受之间的关系。首先，我们使用Place Pulse 2.0作为基础数据集，该数据集包含了多种人类感受标签，包括美丽、安全、富裕、沮丧、无聊和活泼。然后，我们使用图像描述网络提取每个街景视图图像的描述信息。其次，我们使用预训练的BERT模型进行finetuning，并将其添加为六个人类感受维度的回归函数。此外，我们比较了五种传统回归方法与我们的方法的性能，并进行了在香港进行迁移实验。我们的结果表明，通过深度Scene semantic 特征来评分人类感受的性能比前一代机器学习方法使用 superficier 特征来评分的性能更高。使用深度Scene semantic 特征提供了新的想法，以及更好的解释力在空间不同性下。

Implicit-explicit Integrated Representations for Multi-view Video Compression

paper_url: http://arxiv.org/abs/2311.17350
repo_url: https://github.com/zc-lynen/MV-IERV
paper_authors: Chen Zhu, Guo Lu, Bing He, Rong Xie, Li Song
for: 这个论文是为了提高多视野影像压缩和传输效率，并且能够维持高品质的重建结果。
methods: 本论文使用了一种混合了明示和隐藏的表示方法，其中使用明示表示法进行一个源档的编码，然后使用隐藏表示法进行另外的多个源档的编码。具体来说，使用了一个具有多个水平的特征格子嵌入和一个完全卷积架构的隐藏代码。
results: 实验结果显示，提案的架构可以与最新的多视野影像压缩标准MIV和其他隐藏代码基本相同或甚至更高的表现，包括视野压缩和景象建模。

Abstract
With the increasing consumption of 3D displays and virtual reality, multi-view video has become a promising format. However, its high resolution and multi-camera shooting result in a substantial increase in data volume, making storage and transmission a challenging task. To tackle these difficulties, we propose an implicit-explicit integrated representation for multi-view video compression. Specifically, we first use the explicit representation-based 2D video codec to encode one of the source views. Subsequently, we propose employing the implicit neural representation (INR)-based codec to encode the remaining views. The implicit codec takes the time and view index of multi-view video as coordinate inputs and generates the corresponding implicit reconstruction frames.To enhance the compressibility, we introduce a multi-level feature grid embedding and a fully convolutional architecture into the implicit codec. These components facilitate coordinate-feature and feature-RGB mapping, respectively. To further enhance the reconstruction quality from the INR codec, we leverage the high-quality reconstructed frames from the explicit codec to achieve inter-view compensation. Finally, the compensated results are fused with the implicit reconstructions from the INR to obtain the final reconstructed frames. Our proposed framework combines the strengths of both implicit neural representation and explicit 2D codec. Extensive experiments conducted on public datasets demonstrate that the proposed framework can achieve comparable or even superior performance to the latest multi-view video compression standard MIV and other INR-based schemes in terms of view compression and scene modeling.

摘要
随着3D显示器和虚拟现实技术的普及，多视点视频已成为一个有前途的格式。然而，其高分辨率和多摄像头拍摄导致数据量增加巨大，存储和传输变得困难。为解决这些问题，我们提出一种混合表示法，即使用明确表示法基于2D视频编码器编码一个源视图，然后使用基于神经表示法（INR）编码剩下的视图。INR编码器通过时间和视图索引作为坐标输入，生成相应的隐式重建帧。为了提高压缩性，我们引入多级特征网格嵌入和全连接网络到INR编码器中。这些组件实现坐标特征和特征RGB映射。此外，我们利用高质量重建帧从明确编码器来实现视图补做，并将补做结果与INR编码器生成的隐式重建帧进行拼接。最终，我们提出的框架结合了明确表示法和INR的优势，并在多视点视频压缩和场景建模方面实现了相当或者甚至超过最新的多视点视频压缩标准MIV和其他基于INR的方案的性能。

Cross-Scope Spatial-Spectral Information Aggregation for Hyperspectral Image Super-Resolution

paper_url: http://arxiv.org/abs/2311.17340
repo_url: https://github.com/tomchenshi/cst
paper_authors: Shi Chen, Lefei Zhang, Liangpei Zhang
for: 提高单色干涉图像的空间分辨率
methods: 使用交叉范围空间-spectral transformer（CST）模型，通过跨范围的空间自注意力和特征自注意力来捕捉长距离空间-spectral特征相似性
results: 在三个干涉图像数据集上进行了广泛的实验，并证明了CST比其他状态对方法更好 both quantitatively and visually

Abstract
Hyperspectral image super-resolution has attained widespread prominence to enhance the spatial resolution of hyperspectral images. However, convolution-based methods have encountered challenges in harnessing the global spatial-spectral information. The prevailing transformer-based methods have not adequately captured the long-range dependencies in both spectral and spatial dimensions. To alleviate this issue, we propose a novel cross-scope spatial-spectral Transformer (CST) to efficiently investigate long-range spatial and spectral similarities for single hyperspectral image super-resolution. Specifically, we devise cross-attention mechanisms in spatial and spectral dimensions to comprehensively model the long-range spatial-spectral characteristics. By integrating global information into the rectangle-window self-attention, we first design a cross-scope spatial self-attention to facilitate long-range spatial interactions. Then, by leveraging appropriately characteristic spatial-spectral features, we construct a cross-scope spectral self-attention to effectively capture the intrinsic correlations among global spectral bands. Finally, we elaborate a concise feed-forward neural network to enhance the feature representation capacity in the Transformer structure. Extensive experiments over three hyperspectral datasets demonstrate that the proposed CST is superior to other state-of-the-art methods both quantitatively and visually. The code is available at \url{https://github.com/Tomchenshi/CST.git}.

摘要
“几何特征图像超解析技术在增强几何特征图像的空间解析方面已经得到了广泛的应用。然而，基于条件扩展方法的方法受到了全球空间特征信息的获取所以的限制。尚未充分利用了两个维度中的长距离相依性。为了解决这个问题，我们提出了一个新的跨视野空间特征图像运算（CST），以有效地探索两个维度中的长距离空间特征和 спектраль特征之间的相依性。具体来说，我们创建了一个跨视野空间自注意力机制，以便实现长距离空间之间的互动。然后，我们利用适当的特征特征来构建一个跨视野 спектраль自注意力机制，以实现全球 спектраль带之间的内在相依性。最后，我们创建了一个简洁的复回神经网络，以提高传播运算结构中的特征表现能力。实验结果显示，提案的CST在三个几何特征dataset上具有较高的表现效能，比起其他现有的方法。代码可以在\url{https://github.com/Tomchenshi/CST.git}上找到。”

RADAP: A Robust and Adaptive Defense Against Diverse Adversarial Patches on Face Recognition

paper_url: http://arxiv.org/abs/2311.17339
repo_url: None
paper_authors: Xiaoliang Liu, Furao Shen, Jian Zhao, Changhai Nie
for: 防止深度学习Face recognition系统受到本地攻击
methods: 使用FCutout和F-patch技术，以及改进的边缘抽象损失函数和SAF策略
results: 在各种攻击情况下显著提高了防御性能，而且保持了清洁精度高于未防御的Vanilla模型

Abstract
Face recognition (FR) systems powered by deep learning have become widely used in various applications. However, they are vulnerable to adversarial attacks, especially those based on local adversarial patches that can be physically applied to real-world objects. In this paper, we propose RADAP, a robust and adaptive defense mechanism against diverse adversarial patches in both closed-set and open-set FR systems. RADAP employs innovative techniques, such as FCutout and F-patch, which use Fourier space sampling masks to improve the occlusion robustness of the FR model and the performance of the patch segmenter. Moreover, we introduce an edge-aware binary cross-entropy (EBCE) loss function to enhance the accuracy of patch detection. We also present the split and fill (SAF) strategy, which is designed to counter the vulnerability of the patch segmenter to complete white-box adaptive attacks. We conduct comprehensive experiments to validate the effectiveness of RADAP, which shows significant improvements in defense performance against various adversarial patches, while maintaining clean accuracy higher than that of the undefended Vanilla model.

摘要
面Recognition（FR）系统驱动深度学习技术已广泛应用于多种应用场景。然而，它们受到对抗攻击的威胁，特别是基于本地对抗贴图的攻击。在这篇论文中，我们提出了RADAP，一种可靠和适应的对抗多种对抗贴图的防御机制。RADAP使用创新的技术，如FCutout和F-patch，使用卷积空间抽取掩码来提高对抗贴图模型的遮盲鲁棒性和贴图分割器的性能。此外，我们引入了边缘感知二分类交叉熵（EBCE）损失函数来提高贴图检测精度。我们还提出了分割和填充（SAF）策略，用于对贴图分割器免疫完整白盒适应攻击。我们进行了全面的实验， validate the effectiveness of RADAP，结果显示RADAP在面Recognition系统中具有显著的防御性能，同时保持纯净精度高于无防御的Vanilla模型。

eMotions: A Large-Scale Dataset for Emotion Recognition in Short Videos

paper_url: http://arxiv.org/abs/2311.17335
repo_url: https://github.com/xuecwu/emotions
paper_authors: Xuecheng Wu, Heli Sun, Junxiao Xue, Ruofan Zhai, Xiangyan Kong, Jiayu Nie, Liang He
for: 本研究旨在提高短视频（SV）中情感识别的精度，以便更好地理解SV中的情感表达。
methods: 本研究使用了一种基于视频变换器的端到端方法，以更好地学习semantic relevance的表示。此外，还使用了两个Stage的交叉模式融合模块，以更好地捕捉音视频特征之间的相互关系。
results: 实验结果表明，基于视频变换器的AV-CPNet方法可以更好地提高SV中情感识别的精度。这种方法在九个 dataset上获得了 extensiveresults。

Abstract
Nowadays, short videos (SVs) are essential to information acquisition and sharing in our life. The prevailing use of SVs to spread emotions leads to the necessity of emotion recognition in SVs. Considering the lack of SVs emotion data, we introduce a large-scale dataset named eMotions, comprising 27,996 videos. Meanwhile, we alleviate the impact of subjectivities on labeling quality by emphasizing better personnel allocations and multi-stage annotations. In addition, we provide the category-balanced and test-oriented variants through targeted data sampling. Some commonly used videos (e.g., facial expressions and postures) have been well studied. However, it is still challenging to understand the emotions in SVs. Since the enhanced content diversity brings more distinct semantic gaps and difficulties in learning emotion-related features, and there exists information gaps caused by the emotion incompleteness under the prevalently audio-visual co-expressions. To tackle these problems, we present an end-to-end baseline method AV-CPNet that employs the video transformer to better learn semantically relevant representations. We further design the two-stage cross-modal fusion module to complementarily model the correlations of audio-visual features. The EP-CE Loss, incorporating three emotion polarities, is then applied to guide model optimization. Extensive experimental results on nine datasets verify the effectiveness of AV-CPNet. Datasets and code will be open on https://github.com/XuecWu/eMotions.

摘要
现在，短视频（SV）已成为我们生活中信息获取和分享的重要工具。由于 SV 的普遍使用来传递情感，因此情感认知在 SV 中变得非常重要。由于 SV 的情感数据缺乏，我们提出了一个大规模数据集 named eMotions，包含 27,996 个视频。此外，我们通过更好的人员分配和多 stage 标注来减少标注主观性的影响。此外，我们还提供了类别均衡和测试 ориентирован的变体，通过针对的数据采样。一些常见的视频（例如表情和姿势）已经得到了广泛的研究，但是在 SV 中理解情感仍然是一个挑战。由于增强的内容多样性导致更明显的 semantic gaps 和学习情感相关特征的困难，以及由于听视频表达中的情感不足，导致信息损失。为解决这些问题，我们提出了一种端到端基线方法 AV-CPNet，该方法使用视频变换器更好地学习 semantically 相关的表示。我们还设计了两个跨模态融合模块，以便同时模型听视频特征之间的相关性。使用 EP-CE 损失函数，我们则进一步优化模型。我们的实验结果表明，AV-CPNet 在九个数据集上具有显著效果。数据集和代码将在 GitHub 上公开。

Long-tailed multi-label classification with noisy label of thoracic diseases from chest X-ray

paper_url: http://arxiv.org/abs/2311.17334
repo_url: None
paper_authors: Haoran Lai, Qingsong Yao, Zhiyang He, Xiaodong Tao, S Kevin Zhou
For: The paper aims to improve the detection of rare thoracic diseases in chest X-rays (CXRs) using a novel benchmark for long-tailed multi-label classification.* Methods: The paper proposes a baseline method for this classification challenge, which includes adaptive negative regularization to address the over-suppression of negative logits in tail classes, and a large loss reconsideration strategy for correcting noisy labels from automated annotations.* Results: The paper demonstrates significant advancements in rare disease detection using the proposed method on the “LTML-MIMIC-CXR” dataset, which is an augmentation of the MIMIC-CXR dataset with 26 additional rare diseases.Here are the three points in Simplified Chinese text:* For: 本研究目的是提高胸部X射线图像中罕见肺病的检测，使用长尾多类别分类的新基准。* Methods: 本研究提议一种基准方法，包括适应性负正则化来处理尾类中的负логи的过度压制，以及大型损失重新评估策略来修正自动注释中的噪声标签。* Results: 研究表明，使用提议方法对”LTML-MIMIC-CXR” dataset进行分类，可以显著提高罕见肺病的检测。

Abstract
Chest X-rays (CXR) often reveal rare diseases, demanding precise diagnosis. However, current computer-aided diagnosis (CAD) methods focus on common diseases, leading to inadequate detection of rare conditions due to the absence of comprehensive datasets. To overcome this, we present a novel benchmark for long-tailed multi-label classification in CXRs, encapsulating both common and rare thoracic diseases. Our approach includes developing the "LTML-MIMIC-CXR" dataset, an augmentation of MIMIC-CXR with 26 additional rare diseases. We propose a baseline method for this classification challenge, integrating adaptive negative regularization to address negative logits' over-suppression in tail classes, and a large loss reconsideration strategy for correcting noisy labels from automated annotations. Our evaluation on LTML-MIMIC-CXR demonstrates significant advancements in rare disease detection. This work establishes a foundation for robust CAD methods, achieving a balance in identifying a spectrum of thoracic diseases in CXRs. Access to our code and dataset is provided at:https://github.com/laihaoran/LTML-MIMIC-CXR.

摘要
胸部X射影片（CXR）常常揭示罕见疾病，需要精准诊断。然而，当前的计算机支持诊断（CAD）方法偏向于常见疾病，导致罕见疾病的检测不充分，主要是因为缺乏全面的数据集。为了解决这问题，我们提出了一个新的比赛标准 для多标签分类在CXRs中，涵盖了常见和罕见胸部疾病。我们的方法包括开发了"LTML-MIMIC-CXR"数据集，这是MIMIC-CXR数据集的扩展，添加了26种罕见胸部疾病。我们提出了一种基线方法，该方法包括适应性负正则化，以解决尾类的负极值抑制问题，以及一种大容量损失重新评估策略，以正确化自动生成的标签。我们对LTML-MIMIC-CXR进行了评估，并显示了明显的罕见疾病检测进步。这项工作建立了一个robust CAD方法的基础，实现了在CXRs中识别胸部疾病的spectrum。可以通过https://github.com/laihaoran/LTML-MIMIC-CXR访问我们的代码和数据集。

NeRFTAP: Enhancing Transferability of Adversarial Patches on Face Recognition using Neural Radiance Fields

paper_url: http://arxiv.org/abs/2311.17332
repo_url: None
paper_authors: Xiaoliang Liu, Furao Shen, Feng Han, Jian Zhao, Changhai Nie
for: 防御面 recognition (FR) 技术受到攻击者的攻击，这种攻击可能会影响 FR 系统的安全性。
methods: 我们提出了一种新的敌意攻击方法，它考虑了攻击者直接将攻击者的面像传输到受害人的面像中，这种攻击方法被称为 NeRFTAP。
results: 我们的实验和评估结果表明，NeRFTAP 可以在不同的 FR 模型上实现更高的攻击效果，并且可以在实际攻击场景中提供更好的防御性。

Abstract
Face recognition (FR) technology plays a crucial role in various applications, but its vulnerability to adversarial attacks poses significant security concerns. Existing research primarily focuses on transferability to different FR models, overlooking the direct transferability to victim's face images, which is a practical threat in real-world scenarios. In this study, we propose a novel adversarial attack method that considers both the transferability to the FR model and the victim's face image, called NeRFTAP. Leveraging NeRF-based 3D-GAN, we generate new view face images for the source and target subjects to enhance transferability of adversarial patches. We introduce a style consistency loss to ensure the visual similarity between the adversarial UV map and the target UV map under a 0-1 mask, enhancing the effectiveness and naturalness of the generated adversarial face images. Extensive experiments and evaluations on various FR models demonstrate the superiority of our approach over existing attack techniques. Our work provides valuable insights for enhancing the robustness of FR systems in practical adversarial settings.

摘要
人脸识别（FR）技术在各种应用中扮演着关键角色，但其受到敌意攻击的抵触性问题带来了重要的安全问题。现有的研究主要集中在不同FR模型之间的传输性能力，忽略了直接将攻击者的面像传输到受害者的面像，这是实际应用场景中的实际威胁。在这种研究中，我们提出了一种新的敌意攻击方法，即NeRFTAP。通过使用NeRF-based 3D-GAN，我们生成了新的视角面像图像，以增强攻击者面像图像的传输性能力。我们引入了一种风格一致损失，以确保攻击者UV图像和目标UV图像在0-1掩码下的视觉相似性，从而提高了攻击者面像图像的效果和自然性。我们对各种FR模型进行了广泛的实验和评估，并证明了我们的方法在现有攻击技术之上具有超越性。我们的工作为FR系统在实际敌意情况下增强安全性提供了有价值的思路。

Towards Top-Down Reasoning: An Explainable Multi-Agent Approach for Visual Question Answering

paper_url: http://arxiv.org/abs/2311.17331
repo_url: None
paper_authors: Zeqing Wang, Wentao Wan, Runmeng Chen, Qiqing Lao, Minjie Lang, Keze Wang
for: 本研究旨在提高Visual Question Answering（VQA） Task的表现，并且解决现有的问题，如知识库（KB）的偏见和有限数据的问题。
methods: 该研究提出了一种可解释的多代理协作框架，通过启用Large Language Models（LLMs）中嵌入的知识，实现Top-down的推理过程。该框架包括三个代理：搜寻者（Seeker）、回答者（Responder）和整合器（Integrator），共同解决VQA问题。
results: 研究人员对多种VQA数据集和VLM进行了广泛的测试和评估，结果显示，该方法可以提高VQA表现，同时具有广泛的应用和可解释性。

Abstract
Recently, Vision Language Models (VLMs) have gained significant attention, exhibiting notable advancements across various tasks by leveraging extensive image-text paired data. However, prevailing VLMs often treat Visual Question Answering (VQA) as perception tasks, employing black-box models that overlook explicit modeling of relationships between different questions within the same visual scene. Moreover, the existing VQA methods that rely on Knowledge Bases (KBs) might frequently encounter biases from limited data and face challenges in relevant information indexing. Attempt to overcome these limitations, this paper introduces an explainable multi-agent collaboration framework by tapping into knowledge embedded in Large Language Models (LLMs) trained on extensive corpora. Inspired by human cognition, our framework uncovers latent information within the given question by employing three agents, i.e., Seeker, Responder, and Integrator, to perform a top-down reasoning process. The Seeker agent generates relevant issues related to the original question. The Responder agent, based on VLM, handles simple VQA tasks and provides candidate answers. The Integrator agent combines information from the Seeker agent and the Responder agent to produce the final VQA answer. Through the above collaboration mechanism, our framework explicitly constructs a multi-view knowledge base for a specific image scene, reasoning answers in a top-down processing manner. We extensively evaluate our method on diverse VQA datasets and VLMs, demonstrating its broad applicability and interpretability with comprehensive experimental results.

摘要
近些时间，视觉语言模型（VLM）在不同任务上表现出了显著的进步，利用了广泛的图像文本对应数据。然而，现有的VLM通常将视觉问答（VQA）视为感知任务，使用黑盒模型，忽视了不同问题在同一个视觉场景中的明确关系模型。此外，现有的VQA方法可能会遇到限制性的数据和知识库（KB）的偏见，并且面临着相关信息索引的挑战。为了超越这些限制，本文提出了一种可解释的多代合作框架，通过启用大语言模型（LLM）训练后的知识来挖掘问题中隐藏的信息。我们的框架采用三个代理：寻找者、回答者和集成者。寻找者代理生成与原问题相关的问题。回答者代理基于VLM，处理简单的VQA任务，提供候选答案。集成者代理将寻找者代理和回答者代理提供的信息进行集成，生成最终的VQA答案。通过上述合作机制，我们的框架可以生成特定图像场景的多视图知识库，进行顶部向下的理性答案处理。我们对多个VQA dataset和VLM进行了广泛的evaluate，并示出了我们的方法的普适性和可解释性。

Alternate Diverse Teaching for Semi-supervised Medical Image Segmentation

paper_url: http://arxiv.org/abs/2311.17325
repo_url: https://github.com/ZhenZHAO/AD-MT
paper_authors: Zhen Zhao, Zicheng Wang, Longyue Wang, Yixuan Yuan, Luping Zhou
for: 这个研究的目的是提高semi-supervised medical image segmentation的精度和稳定性，并且解决现有的教师-学生模型受到确认偏见的问题。
methods: 这个方法使用了一个学生模型和两个不可变的教师模型，并通过Random Periodic Alternate（RPA）更新模块和Conflict-Combating Module（CCM）来解决确认偏见问题。
results: 实验结果显示，这个AD-MT方法在2D和3D医学像分类中具有较高的精度和稳定性，并在不同的semi-supervised设定下表现出超越现有方法的效果。

Abstract
Semi-supervised medical image segmentation studies have shown promise in training models with limited labeled data. However, current dominant teacher-student based approaches can suffer from the confirmation bias. To address this challenge, we propose AD-MT, an alternate diverse teaching approach in a teacher-student framework. It involves a single student model and two non-trainable teacher models that are momentum-updated periodically and randomly in an alternate fashion. To mitigate the confirmation bias from the diverse supervision, the core of AD-MT lies in two proposed modules: the Random Periodic Alternate (RPA) Updating Module and the Conflict-Combating Module (CCM). The RPA schedules the alternating diverse updating process with complementary data batches, distinct data augmentation, and random switching periods to encourage diverse reasoning from different teaching perspectives. The CCM employs an entropy-based ensembling strategy to encourage the model to learn from both the consistent and conflicting predictions between the teachers. Experimental results demonstrate the effectiveness and superiority of our AD-MT on the 2D and 3D medical segmentation benchmarks across various semi-supervised settings.

摘要
semi-supervised医疗影像分类研究表明，具有有限的标注数据可以训练模型。然而，当前主流的教师-学生基础结构方法可能会受到偏见困惑。为解决这个挑战，我们提议了AD-MT，一种 alternate 多元教学方法。它包括一个单个学生模型和两个不可训练的教师模型， periodically和随机地在 alternate 的方式进行启用。为了减轻偏见困惑，AD-MT 的核心在两个提案的模块中：Random Periodic Alternate（RPA）更新模块和 Conflict-Combating Module（CCM）。RPA 将不同的教学视角进行 alternate 更新，使用不同的数据批处理、数据扩展和随机交换时间段来鼓励不同的思维方式。CCM 使用了一种基于 entropy 的拢合策略，以便模型从多个教师的一致和冲突预测中学习。实验结果表明，我们的 AD-MT 在不同的 semi-supervised 设定下的医疗影像分类benchmark上表现出色，优于传统的教师-学生基础结构方法。

Revisiting Single Image Reflection Removal In the Wild

paper_url: http://arxiv.org/abs/2311.17320
repo_url: None
paper_authors: Yurui Zhu, Xueyang Fu, Peng-Tao Jiang, Hao Zhang, Qibin Sun, Jinwei Chen, Zheng-Jun Zha, Bo Li
for: 本研究探讨了现实世界中单张图像反射除去的问题，从两个角度研究：反射收集管道的设计和反射位置的识别。
methods: 我们提出了一种高度适应现实世界反射场景的反射收集管道，并开发了大规模、高质量的反射图像集名为 Reflection Removal in the Wild (RRW)。RRW包含了14,950个高分辨率的现实世界反射对，比前一代 dataset 大得多。
results: 我们发现了许多虚拟反射对在反射图像中可见，但在相应的真实图像中不存在。基于这一观察，我们提出了最大反射筛选器（MaxRF），可以准确地描述反射位置。我们还设计了反射位置意识的堆式框架，专门解决单张图像反射除去问题。通过这些创新技术，我们的解决方案在多个现实世界标准准测试 bench 上表现出优于当前领先方法。

Abstract
This research focuses on the issue of single-image reflection removal (SIRR) in real-world conditions, examining it from two angles: the collection pipeline of real reflection pairs and the perception of real reflection locations. We devise an advanced reflection collection pipeline that is highly adaptable to a wide range of real-world reflection scenarios and incurs reduced costs in collecting large-scale aligned reflection pairs. In the process, we develop a large-scale, high-quality reflection dataset named Reflection Removal in the Wild (RRW). RRW contains over 14,950 high-resolution real-world reflection pairs, a dataset forty-five times larger than its predecessors. Regarding perception of reflection locations, we identify that numerous virtual reflection objects visible in reflection images are not present in the corresponding ground-truth images. This observation, drawn from the aligned pairs, leads us to conceive the Maximum Reflection Filter (MaxRF). The MaxRF could accurately and explicitly characterize reflection locations from pairs of images. Building upon this, we design a reflection location-aware cascaded framework, specifically tailored for SIRR. Powered by these innovative techniques, our solution achieves superior performance than current leading methods across multiple real-world benchmarks. Codes and datasets will be publicly available.

摘要
Simplified Chinese:这项研究关注实际情况下的单张图像反射除法 (SIRR) 问题，从两个角度研究：反射采集管道和反射位置识别。我们提出了高度适应实际反射场景的反射采集管道，并且减少了收集大规模对齐反射对的成本。在这个过程中，我们开发了 named Reflection Removal in the Wild (RRW) 的大规模、高质量反射数据集，该数据集包含了14,950个高分辨率实际反射对，比前一代数据集大四十五倍。关于反射位置识别，我们发现了许多在反射图像中可见的虚拟反射对不存在在相应的真实图像中。这一观察，从对齐对中提取出来，导我们提出了最大反射筛选器 (MaxRF)。MaxRF可以准确地从对齐图像对中识别反射位置。基于这些创新技术，我们设计了反射位置意识的协同框架，特地针对 SIRR。这些技术的应用使我们的解决方案在多个实际 benchmark 上达到了当前领先方法的超越性表现。代码和数据集将公开 disponibles。

paper_url: http://arxiv.org/abs/2311.17315
repo_url: None
paper_authors: Daniela Massiceti, Camilla Longden, Agnieszka Slowik, Samuel Wills, Martin Grayson, Cecily Morrison
for:* The paper evaluates the performance of a widely-used large multi-modal model (CLIP) on data captured by blind or low vision (BLV) users.methods:* The paper tests 25 CLIP variants in a zero-shot classification task and analyzes their accuracy on images captured by BLV users and web-crawled images.* The paper conducts a textual analysis of three common pre-training datasets to investigate the inclusion of disability content.results:* The paper finds that CLIP’s accuracy is 15 percentage points lower on average for images captured by BLV users than web-crawled images, due to sensitivities to image content, image quality, and text content.* The paper shows that few-shot learning with as few as 5 images can mitigate CLIP’s quality-of-service disparities for BLV users in some scenarios.Here is the simplified Chinese translation of the three key information points:for:* 这个论文评估了一种广泛使用的大型多modal模型（CLIP）在视障或低视力（BLV）用户 captured 的数据上的性能。methods:* 论文测试了 25 个 CLIP 变体在零批学习任务中，并分析它们在 BLV 用户 captured 的图像上的准确率。* 论文对三个常用的预训练集进行文本分析，以 investigate 是否包含残疾内容。results:* 论文发现，CLIP 在 BLV 用户 captured 的图像上的准确率比 web-crawled 图像下降了 15% 的点，这是因为 CLIP 对图像内容、图像质量和文本内容产生了敏感性。* 论文显示，使用少量的 few-shot learning（只需 5 张图像）可以在一些场景下 mitigate CLIP 的质量服务差异 для BLV 用户。

Abstract
Large multi-modal models (LMMs) hold the potential to usher in a new era of automated visual assistance for people who are blind or low vision (BLV). Yet, these models have not been systematically evaluated on data captured by BLV users. We address this by empirically assessing CLIP, a widely-used LMM likely to underpin many assistive technologies. Testing 25 CLIP variants in a zero-shot classification task, we find that their accuracy is 15 percentage points lower on average for images captured by BLV users than web-crawled images. This disparity stems from CLIP's sensitivities to 1) image content (e.g. not recognizing disability objects as well as other objects); 2) image quality (e.g. not being robust to lighting variation); and 3) text content (e.g. not recognizing objects described by tactile adjectives as well as visual ones). We delve deeper with a textual analysis of three common pre-training datasets: LAION-400M, LAION-2B and DataComp-1B, showing that disability content is rarely mentioned. We then provide three examples that illustrate how the performance disparities extend to three downstream models underpinned by CLIP: OWL-ViT, CLIPSeg and DALL-E2. We find that few-shot learning with as few as 5 images can mitigate CLIP's quality-of-service disparities for BLV users in some scenarios, which we discuss alongside a set of other possible mitigations.

摘要
大型多Modal模型（LMM）有可能带来一个新的自动化视觉助手时代，以帮助盲人或低视力（BLV）人群。然而，这些模型尚未被系统地评估在由BLV用户收集的数据上。我们解决这个问题，通过对CLIP进行实证评估。我们测试了25个CLIP变体，在零批学习任务中，其准确率与BLV用户收集的图像相比，下降了15%的平均值。这种差异来自于CLIP的敏感性，包括：1）图像内容（例如，不能识别残疾人用品），2）图像质量（例如，不能抗衰减），3）文本内容（例如，不能识别通过感觉描述的物品）。我们进一步进行文本分析，发现三种常见的预训练数据集：LAION-400M、LAION-2B和DataComp-1B中，残疾人内容很少被提及。然后，我们提供了三个例子，用于说明CLIP的性能差异在下游模型中的扩展：OWL-ViT、CLIPSeg和DALL-E2。我们发现，使用5张图像进行几步学习可以在某些情况下减轻CLIP的质量服务差异，我们讨论了其他可能的缓解方法。

Federated Fine-Tuning of Foundation Models via Probabilistic Masking

paper_url: http://arxiv.org/abs/2311.17299
repo_url: None
paper_authors: Vasileios Tsouvalas, Yuki Asano, Aaqib Saeed
for: 这个研究目的是实现 Federaed Learning（FL）中Foundation Models（FMs）的有效整合，并提高通信效率。
methods: 这个研究使用了DeltaMask方法，它利用数据随机掩蔽来检测FMs中高效的子网络，并使用几率 filters将更新转换为一个简单的灰度图像。
results: 这个研究表明，DeltaMask方法可以在8个数据集和5个预训模型中实现bitrate在0.09bpp以下，优化通信效率，并维持FMs的性能。

Abstract
Foundation Models (FMs) have revolutionized machine learning with their adaptability and high performance across tasks; yet, their integration into Federated Learning (FL) is challenging due to substantial communication overhead from their extensive parameterization. Current communication-efficient FL strategies, such as gradient compression, reduce bitrates to around $1$ bit-per-parameter (bpp). However, these approaches fail to harness the characteristics of FMs, with their large number of parameters still posing a challenge to communication efficiency, even at these bitrate regimes. In this work, we present DeltaMask, a novel method that efficiently fine-tunes FMs in FL at an ultra-low bitrate, well below 1 bpp. DeltaMask employs stochastic masking to detect highly effective subnetworks within FMs and leverage stochasticity and sparsity in client masks to compress updates into a compact grayscale image using probabilistic filters, deviating from traditional weight training approaches. Our comprehensive evaluations across various datasets and architectures demonstrate DeltaMask efficiently achieves bitrates as low as 0.09 bpp, enhancing communication efficiency while maintaining FMs performance, as measured on 8 datasets and 5 pre-trained models of various network architectures.

摘要
Translated into Simplified Chinese:基础模型（FM）已经革命化机器学习，具有适应性和高性能 across 任务; 然而，它们在联合学习（FL）中存在挑战性的通信开销，由于 FM 的广泛参数化。当前的通信有效率 FL 策略，如归一化压缩，可以降低比特率至约 1 比特/参数（bpp）。然而，这些方法无法利用 FM 的特点，即它们的大量参数仍然对通信效率构成挑战，即使在这些比特率层次。在这种情况下，我们提出了 DeltaMask，一种新的方法，可以高效地在 FL 中精度调整 FM，并在 ultra-low 比特率下进行压缩，远低于 1 bpp。DeltaMask 利用随机掩码检测 FM 中高效的子网络，并利用客户端掩码中的随机性和稀疏性来压缩更新，使用 probabilistic 筛选器，与传统的训练方法不同。我们对多个数据集和不同的网络架构进行了全面的评估，结果表明，DeltaMask 可以高效地实现比特率为 0.09 bpp，提高通信效率，同时保持 FM 的性能，测试在 8 个数据集和 5 个预训练模型中。

LEOD: Label-Efficient Object Detection for Event Cameras

paper_url: http://arxiv.org/abs/2311.17286
repo_url: None
paper_authors: Ziyi Wu, Mathias Gehrig, Qing Lyu, Xudong Liu, Igor Gilitschenski
for:This paper aims to address the issue of labeling event streams with high temporal resolutions for object detection with event cameras, which is costly and time-consuming.methods:The proposed method, called LEOD, unifies weakly- and semi-supervised object detection with a self-training mechanism. It utilizes a detector pre-trained on limited labels to produce pseudo ground truth on unlabeled events, and then re-trains the detector with both real and generated labels.results:LEOD consistently outperforms supervised baselines across various labeling ratios, improving mAP by 8.6% and 7.8% on Gen1 and 1Mpx datasets, respectively. Even when all labeled data are available, LEOD reaches new state-of-the-art results and is effective in improving larger detectors as well.

Abstract
Object detection with event cameras enjoys the property of low latency and high dynamic range, making it suitable for safety-critical scenarios such as self-driving. However, labeling event streams with high temporal resolutions for supervised training is costly. We address this issue with LEOD, the first framework for label-efficient event-based detection. Our method unifies weakly- and semi-supervised object detection with a self-training mechanism. We first utilize a detector pre-trained on limited labels to produce pseudo ground truth on unlabeled events, and then re-train the detector with both real and generated labels. Leveraging the temporal consistency of events, we run bi-directional inference and apply tracking-based post-processing to enhance the quality of pseudo labels. To stabilize training, we further design a soft anchor assignment strategy to mitigate the noise in labels. We introduce new experimental protocols to evaluate the task of label-efficient event-based detection on Gen1 and 1Mpx datasets. LEOD consistently outperforms supervised baselines across various labeling ratios. For example, on Gen1, it improves mAP by 8.6% and 7.8% for RVT-S trained with 1% and 2% labels. On 1Mpx, RVT-S with 10% labels even surpasses its fully-supervised counterpart using 100% labels. LEOD maintains its effectiveness even when all labeled data are available, reaching new state-of-the-art results. Finally, we show that our method readily scales to improve larger detectors as well.

摘要
First, we use a pre-trained detector on limited labels to generate pseudo ground truth on unlabeled events. We then re-train the detector with both real and generated labels, leveraging the temporal consistency of events to enhance the quality of pseudo labels. To stabilize training, we design a soft anchor assignment strategy to mitigate label noise.We introduce new experimental protocols to evaluate the task of label-efficient event-based detection on the Gen1 and 1Mpx datasets. LEOD consistently outperforms supervised baselines across various labeling ratios. For example, on Gen1, it improves mAP by 8.6% and 7.8% for RVT-S trained with 1% and 2% labels, respectively. On 1Mpx, RVT-S with 10% labels even surpasses its fully-supervised counterpart using 100% labels. LEOD maintains its effectiveness even when all labeled data are available, reaching new state-of-the-art results.Moreover, our method readily scales to improve larger detectors as well.

2023-11-30

Acoustic Prompt Tuning: Empowering Large Language Models with Audition Capabilities

Learning domain-invariant classifiers for infant cry sounds

An Aliasing-Free Hybrid Digital-Analog Polyphonic Synthesizer

2023-11-30

PyNeRF: Pyramidal Neural Radiance Fields

Advancements and Trends in Ultra-High-Resolution Image Processing: An Overview

AV-RIR: Audio-Visual Room Impulse Response Estimation

Lasagna: Layered Score Distillation for Disentangled Object Relighting

Brainformer: Modeling MRI Brain Functions to Machine Vision

Convolutional Neural Networks for Segmentation of Malignant Pleural Mesothelioma: Analysis of Probability Map Thresholds (CALGB 30901, Alliance)

SparseGS: Real-Time 360° Sparse View Synthesis using Gaussian Splatting

DNS SLAM: Dense Neural Semantic-Informed SLAM

Raising the Bar of AI-generated Image Detection with CLIP

Benchmarking and Enhancing Disentanglement in Concept-Residual Models

Galaxy Classification: A machine learning approach for classifying shapes using numerical data

Fool the Hydra: Adversarial Attacks against Multi-view Object Detection Systems

Universal Backdoor Attacks

A Unified Framework for Connecting Noise Modeling to Boost Noise Detection

GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs

TrafficMOT: A Challenging Dataset for Multi-Object Tracking in Complex Traffic Scenarios

Just Add $π$! Pose Induced Video Transformers for Understanding Activities of Daily Living

PoseGPT: Chatting about 3D Human Pose

InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation

S2ST: Image-to-Image Translation in the Seed Space of Latent Diffusion

ART$\boldsymbol{\cdot}$V: Auto-Regressive Text-to-Video Generation with Diffusion Models

Exploiting Diffusion Prior for Generalizable Pixel-Level Semantic Prediction

MotionEditor: Editing Video Motion via Content-Aware Diffusion

Un-EvMoSeg: Unsupervised Event-based Independent Motion Segmentation

MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation

Event-based Continuous Color Video Decompression from Single Frames

One-step Diffusion with Distribution Matching Distillation

DynMF: Neural Motion Factorization for Real-time Dynamic View Synthesis with 3D Gaussian Splatting

CAST: Cross-Attention in Space and Time for Video Action Recognition

DEVIAS: Learning Disentangled Video Representations of Action and Scene for Holistic Video Understanding

Initializing Models with Larger Ones

ElasticDiffusion: Training-free Arbitrary Size Image Generation

LucidDreaming: Controllable Object-Centric 3D Generation

IMMA: Immunizing text-to-image Models against Malicious Adaptation

Is Underwater Image Enhancement All Object Detectors Need?

MD-Splatting: Learning Metric Deformation from 4D Gaussians in Highly Deformable Scenes

Convergence of Nonconvex PnP-ADMM with MMSE Denoisers

FoundPose: Unseen Object Pose Estimation with Foundation Features

CLIP-QDA: An Explainable Concept Bottleneck Model

Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains

Semi-supervised Semantic Segmentation via Boosting Uncertainty on Unlabeled Data

Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering

Merlin:Empowering Multimodal LLMs with Foresight Minds

Learning One-Shot 4D Head Avatar Synthesis using Synthetic Data

Improving the Robustness of Quantized Deep Neural Networks to White-Box Attacks using Stochastic Quantization and Information-Theoretic Ensemble Training

Meta-Prior: Meta learning for Adaptive Inverse Problem Solvers

Seg2Reg: Differentiable 2D Segmentation to 1D Regression Rendering for 360 Room Layout Reconstruction

Cascaded Interaction with Eroded Deep Supervision for Salient Object Detection

Action Recognition in Video Recordings from Gynecologic Laparoscopy

Pose Estimation and Tracking for ASIST

Learning Part Segmentation from Synthetic Animals

LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning

Simple Semantic-Aided Few-Shot Learning

DiffusionAvatars: Deferred Diffusion for High-fidelity 3D Head Avatars

A Lightweight Clustering Framework for Unsupervised Semantic Segmentation

JPPF: Multi-task Fusion for Consistent Panoptic-Part Segmentation

Anatomy and Physiology of Artificial Intelligence in PET Imaging

Cancer-Net PCa-Gen: Synthesis of Realistic Prostate Diffusion Weighted Imaging Data via Anatomic-Conditional Controlled Latent Diffusion

DiffCAD: Weakly-Supervised Probabilistic CAD Model Retrieval and Alignment from an RGB Image

Learning Triangular Distribution in Visual World

Identifying tourist destinations from movie scenes using Deep Learning

Overcoming Label Noise for Source-free Unsupervised Video Domain Adaptation

Seam-guided local alignment and stitching for large parallax images

Periodic Vibration Gaussian: Dynamic Urban Scene Reconstruction and Real-time Rendering

FediOS: Decoupling Orthogonal Subspaces for Personalization in Feature-skew Federated Learning

Heterogeneous Graph-based Trajectory Prediction using Local Map Context and Social Interactions

SparseDC: Depth Completion from sparse and non-uniform inputs

OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition

Match me if you can: Semantic Correspondence Learning with Unpaired Images

MaXTron: Mask Transformer with Trajectory Attention for Video Panoptic Segmentation

Revisiting Proposal-based Object Detection

DifAugGAN: A Practical Diffusion-style Data Augmentation for GAN-based Single Image Super-resolution

Accurate Segmentation of Optic Disc And Cup from Multiple Pseudo-labels by Noise-Aware Learning

Improving Adversarial Transferability via Model Alignment

PRS: Sharp Feature Priors for Resolution-Free Surface Remeshing