2023-09-25

cs.SD

cs.SD - 2023-09-25

NoLACE: Improving Low-Complexity Speech Codec Enhancement Through Adaptive Temporal Shaping

paper_url: http://arxiv.org/abs/2309.14521
repo_url: None
paper_authors: Jan Büthe, Ahmed Mustafa, Jean-Marc Valin, Karim Helwani, Michael M. Goodwin
for: 提高 speech codec 的质量，对于 Opus 编码器进行增强。
methods: 使用 Linear Adaptive Coding Enhancer (LACE) 模型，combines DNNs with classical long-term/short-term postfiltering，具有低复杂性和零延迟。
results: 比较 Opus 基线和扩大 LACE 模型，NoLACE 实现了较高的质量表现，并且与 ASR 系统良好的合作。

Abstract
Speech codec enhancement methods are designed to remove distortions added by speech codecs. While classical methods are very low in complexity and add zero delay, their effectiveness is rather limited. Compared to that, DNN-based methods deliver higher quality but they are typically high in complexity and/or require delay. The recently proposed Linear Adaptive Coding Enhancer (LACE) addresses this problem by combining DNNs with classical long-term/short-term postfiltering resulting in a causal low-complexity model. A short-coming of the LACE model is, however, that quality quickly saturates when the model size is scaled up. To mitigate this problem, we propose a novel adatpive temporal shaping module that adds high temporal resolution to the LACE model resulting in the Non-Linear Adaptive Coding Enhancer (NoLACE). We adapt NoLACE to enhance the Opus codec and show that NoLACE significantly outperforms both the Opus baseline and an enlarged LACE model at 6, 9 and 12 kb/s. We also show that LACE and NoLACE are well-behaved when used with an ASR system.

摘要
<>通用 speech codec 优化方法是为了移除 speech codec 添加的扭曲。古典方法具有非常低的复杂性和零延迟，但是其效果很有限。相比之下，基于 DNN 的方法可以提供更高的质量，但是它们通常具有较高的复杂性和/或延迟。随后提出的 Linear Adaptive Coding Enhancer (LACE) 模型解决了这个问题，它将 DNN 与古典长期/短期 POSTfiltering 结合在一起，实现了 causal 低复杂度模型。然而，LACE 模型的缺点是，当模型大小增加时，质量快速增加。为了解决这个问题，我们提出了一种新的适应性时间形态模块，这种模块将高时间分辨率添加到 LACE 模型中，实现了 Non-Linear Adaptive Coding Enhancer (NoLACE)。我们适应 NoLACE 模型来提高 Opus 码ц，并证明 NoLACE 在 6、9 和 12 kb/s 的比特率下显著超过 Opus 基eline 和扩大 LACE 模型。我们还证明 LACE 和 NoLACE 在 ASR 系统中是合理的。

Noise-Robust DSP-Assisted Neural Pitch Estimation with Very Low Complexity

paper_url: http://arxiv.org/abs/2309.14507
repo_url: None
paper_authors: Krishna Subramani, Jean-Marc Valin, Jan Buethe, Paris Smaragdis, Mike Goodwin
for: 这篇论文旨在提出一种hybrid抽取器，使得深度神经网络（DNN）和传统的信号处理（DSP）技术之间的优点得到平衡，以提高抽取器的性能和可实现性。
methods: 该论文使用了一种小型的DNN和传统的DSP特征来实现抽取器，并结合了这两种方法来提高抽取器的性能。
results: 论文表明，这种混合方法可以与纯DNN方法匹配或超越其性能，同时具有与传统DSP方法相同的复杂性和算法延迟。此外，该方法还可以提供一些优势 для神经语音编码任务。

Abstract
Pitch estimation is an essential step of many speech processing algorithms, including speech coding, synthesis, and enhancement. Recently, pitch estimators based on deep neural networks (DNNs) have have been outperforming well-established DSP-based techniques. Unfortunately, these new estimators can be impractical to deploy in real-time systems, both because of their relatively high complexity, and the fact that some require significant lookahead. We show that a hybrid estimator using a small deep neural network (DNN) with traditional DSP-based features can match or exceed the performance of pure DNN-based models, with a complexity and algorithmic delay comparable to traditional DSP-based algorithms. We further demonstrate that this hybrid approach can provide benefits for a neural vocoding task.

摘要
“抽象估值是许多语音处理算法的关键步骤，包括语音编码、合成和提高。最近，基于深度神经网络（DNN）的抽象估值器已经超越了传统的数字信号处理（DSP）技术。然而，这些新的估值器在实时系统中部署可能是不实际的，因为它们的相对较高复杂度和一些需要明显的往回预测。我们表明，一种混合使用小型深度神经网络（DNN）和传统的DSP基于特征的抽象估值器可以与纯DNN基本模型匹配或超越其性能，并且与传统DSP基本算法相同的复杂性和算法延迟。我们还证明了这种混合方法可以为神经编码任务提供优势。”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

On the Impact of Quantization and Pruning of Self-Supervised Speech Models for Downstream Speech Recognition Tasks “In-the-Wild’’

paper_url: http://arxiv.org/abs/2309.14462
repo_url: None
paper_authors: Arthur Pimentel, Heitor Guimarães, Anderson R. Avila, Mehdi Rezagholizadeh, Tiago H. Falk
for: 本研究旨在探讨基于自动编程学习的语音识别系统在不同条件下的准确率，特别是在训练和测试条件不同时的情况下。
methods: 本研究使用了Parameter Quantization和Model Pruning两种模型压缩方法，以及robust wav2vec 2.0模型，对语音识别精度进行了分析。
results: 研究发现，在噪音、抑扬和噪音+抑扬等不同条件下，Parameter Quantization和Model Pruning两种方法都能够有效地提高语音识别精度。

Abstract
Recent advances with self-supervised learning have allowed speech recognition systems to achieve state-of-the-art (SOTA) word error rates (WER) while requiring only a fraction of the labeled training data needed by its predecessors. Notwithstanding, while such models achieve SOTA performance in matched train/test conditions, their performance degrades substantially when tested in unseen conditions. To overcome this problem, strategies such as data augmentation and/or domain shift training have been explored. Available models, however, are still too large to be considered for edge speech applications on resource-constrained devices, thus model compression tools are needed. In this paper, we explore the effects that train/test mismatch conditions have on speech recognition accuracy based on compressed self-supervised speech models. In particular, we report on the effects that parameter quantization and model pruning have on speech recognition accuracy based on the so-called robust wav2vec 2.0 model under noisy, reverberant, and noise-plus-reverberation conditions.

摘要

An Investigation of Distribution Alignment in Multi-Genre Speaker Recognition

paper_url: http://arxiv.org/abs/2309.14158
repo_url: None
paper_authors: Zhenyu Zhou, Junhui Chen, Namin Wang, Lantian Li, Dong Wang
for: 本研究旨在 investigate the performance of mainstream distribution alignment methods on multi-genre data, 以便更好地 Addressing the challenges of multi-genre speaker recognition.
methods: 本研究使用了多种主流分布Alignment方法，包括 Within-between distribution alignment (WBDA) 等。
results: 实验结果表明，WBDA方法在 CN-Celeb dataset 中表现较好，但是 None of the investigated methods consistently improved performance in all test cases. 这表明尚未发展出一种全面的解决方案。

Abstract
Multi-genre speaker recognition is becoming increasingly popular due to its ability to better represent the complexities of real-world applications. However, a major challenge is the significant shift in the distribution of speaker vectors across different genres. While distribution alignment is a common approach to address this challenge, previous studies have mainly focused on aligning a source domain with a target domain, and the performance of multi-genre data is unknown. This paper presents a comprehensive study of mainstream distribution alignment methods on multi-genre data, where multiple distributions need to be aligned. We analyze various methods both qualitatively and quantitatively. Our experiments on the CN-Celeb dataset show that within-between distribution alignment (WBDA) performs relatively better. However, we also found that none of the investigated methods consistently improved performance in all test cases. This suggests that solely aligning the distributions of speaker vectors may not fully address the challenges posed by multi-genre speaker recognition. Further investigation is necessary to develop a more comprehensive solution.

摘要
多样化 Speaker 认知正在不断受欢迎，主要是因为它能更好地反映实际应用中的复杂性。然而，一个主要挑战是多个频率域之间的分布变化。过往的研究主要集中在将源频率域与目标频率域进行分布对接，并未探讨多个频率域数据的性能。本文对多个频率域数据进行了全面的分布对接方法研究。我们分析了各种方法，包括内在between分布对接（WBDA）等。我们的实验结果表明，WBDA在CN-Celeb 数据集上表现较好。然而，我们还发现，不同的测试情况下，不同的方法的性能表现不一致。这表明，仅仅通过对 speaker vector 的分布进行对接，并不能彻底解决多个频率域 Speaker 认知中的挑战。需要进一步的研究，以开发更加全面的解决方案。

Multi-Domain Adaptation by Self-Supervised Learning for Speaker Verification

paper_url: http://arxiv.org/abs/2309.14149
repo_url: None
paper_authors: Wan Lin, Lantian Li, Dong Wang
for: addressing the domain-mismatch challenge in speaker recognition models
methods: self-supervised learning method with three strategies (in-domain negative sampling, MoCo-like memory bank scheme, and CORAL-like distribution alignment)
results: outperforms the basic self-supervised adaptation method in nearly all in-domain tests and cross-domain tests, demonstrating the effectiveness of the proposed method.

Abstract
In real-world applications, speaker recognition models often face various domain-mismatch challenges, leading to a significant drop in performance. Although numerous domain adaptation techniques have been developed to address this issue, almost all present methods focus on a simple configuration where the model is trained in one domain and deployed in another. However, real-world environments are often complex and may contain multiple domains, making the methods designed for one-to-one adaptation suboptimal. In our paper, we propose a self-supervised learning method to tackle this multi-domain adaptation problem. Building upon the basic self-supervised adaptation algorithm, we designed three strategies to make it suitable for multi-domain adaptation: an in-domain negative sampling strategy, a MoCo-like memory bank scheme, and a CORAL-like distribution alignment. We conducted experiments using VoxCeleb2 as the source domain dataset and CN-Celeb1 as the target multi-domain dataset. Our results demonstrate that our method clearly outperforms the basic self-supervised adaptation method, which simply treats the data of CN-Celeb1 as a single domain. Importantly, the improvement is consistent in nearly all in-domain tests and cross-domain tests, demonstrating the effectiveness of our proposed method.

摘要
在实际应用中，语音识别模型经常面临不同领域的挑战，导致其性能下降。虽然已有许多领域适应技术的研发，但大多数方法都是基于单个领域的训练和部署。然而，实际环境往往复杂，可能包含多个领域，这些方法在一对一适应下表现不佳。在我们的论文中，我们提出了一种基于自助学习的多领域适应方法。我们在基本的自助适应算法之上设计了三种策略，使其适应多领域适应：在领域内的负样本采样策略、MoCo-like储存银行方案以及CORAL-like分布对齐。我们使用VoxCeleb2作为源领域数据集，CN-Celeb1作为目标多领域数据集，并进行了实验。我们的结果表明，我们的方法在比较多个领域的测试中均有显著提高，而且这种改进是在几乎所有的领域测试和跨领域测试中均有，这说明了我们提出的方法的有效性。

Speaker anonymization using neural audio codec language models

paper_url: http://arxiv.org/abs/2309.14129
repo_url: None
paper_authors: Michele Panariello, Francesco Nespoli, Massimiliano Todisco, Nicholas Evans
for: 隐藏发音者的匿名（Speaker Anonymization）
methods: 使用神经网络编码器（NAC）和语言模型来生成高质量的匿名语音（Synthetic Speech），并使用量化码来瓶颈发音者相关的信息
results: 通过应用voice Privacy Challenge 2022的评价框架，示出NAC语言模型可以实现高质量的匿名语音生成，并且能够有效瓶颈发音者相关的信息

Abstract
The vast majority of approaches to speaker anonymization involve the extraction of fundamental frequency estimates, linguistic features and a speaker embedding which is perturbed to obfuscate the speaker identity before an anonymized speech waveform is resynthesized using a vocoder. Recent work has shown that x-vector transformations are difficult to control consistently: other sources of speaker information contained within fundamental frequency and linguistic features are re-entangled upon vocoding, meaning that anonymized speech signals still contain speaker information. We propose an approach based upon neural audio codecs (NACs), which are known to generate high-quality synthetic speech when combined with language models. NACs use quantized codes, which are known to effectively bottleneck speaker-related information: we demonstrate the potential of speaker anonymization systems based on NAC language modeling by applying the evaluation framework of the Voice Privacy Challenge 2022.

摘要
In this paper, we propose an approach based on neural audio codecs (NACs), which are known to generate high-quality synthetic speech when combined with language models. NACs use quantized codes that effectively bottleneck speaker-related information. We demonstrate the potential of speaker anonymization systems based on NAC language modeling by applying the evaluation framework of the Voice Privacy Challenge 2022.Here's the text in Simplified Chinese:大多数speaker anonymization方法都包括提取基本频率估计、语言特征和一个扰乱后的speaker嵌入，然后使用vocoder重新synthesize一个匿名的语音波形。然而，最近的研究表明，x-vector变换很难控制一致地：在vocoding后，其他speaker信息包含在基本频率和语言特征中被重新杂化，导致匿名的语音信号仍然含有speaker信息。在这篇论文中，我们提出一种基于Neural Audio Codecs（NAC）的方法，NACs是在语言模型 комбиined with高质量的Synthetic Speech生成。NACs使用归一化的编码，这些编码可以有效地瓶颈speaker相关的信息。我们通过应用Voice Privacy Challenge 2022的评估框架，示出了基于NAC语言模型的speaker匿名系统的潜在能力。

Haha-Pod: An Attempt for Laughter-based Non-Verbal Speaker Verification

paper_url: http://arxiv.org/abs/2309.14109
repo_url: https://github.com/nevermorelin/hahapod
paper_authors: Yuke Lin, Xiaoyi Qin, Ning Jiang, Guoqing Zhao, Ming Li
for: explore speaker verification based on non-verbal vocalization, specifically laughter
methods: Two-Stage Teacher-Student (2S-TS) framework to minimize the within-speaker embedding distance between verbal and non-verbal signals
results: significant improvement in S2L-Eval test set performance with only minor degradation on VoxCeleb1 test set.Here’s the full summary in Simplified Chinese:
for: 这个研究探讨了基于非言语 vocalization 的 speaker verification，具体是 laughter。
methods: 这个研究提出了 Two-Stage Teacher-Student (2S-TS) 框架，以实现非言语 vocalization 和言语信号之间的内部距离最小化。
results: 实验结果显示，这个方法可以对 S2L-Eval 试验集的表现有所提高，仅受 VoxCeleb1 试验集的轻微下降影响。I hope that helps!

Abstract
It is widely acknowledged that discriminative representation for speaker verification can be extracted from verbal speech. However, how much speaker information that non-verbal vocalization carries is still a puzzle. This paper explores speaker verification based on the most ubiquitous form of non-verbal voice, laughter. First, we use a semi-automatic pipeline to collect a new Haha-Pod dataset from open-source podcast media. The dataset contains over 240 speakers' laughter clips with corresponding high-quality verbal speech. Second, we propose a Two-Stage Teacher-Student (2S-TS) framework to minimize the within-speaker embedding distance between verbal and non-verbal (laughter) signals. Considering Haha-Pod as a test set, two trials (S2L-Eval) are designed to verify the speaker's identity through laugh sounds. Experimental results demonstrate that our method can significantly improve the performance of the S2L-Eval test set with only a minor degradation on the VoxCeleb1 test set. The resources for the Haha-Pod dataset can be found at https://github.com/nevermoreLin/HahaPod.

摘要
广泛认可的观点是，演说中的特征表达可以提取到说话中的声音中。然而，非语言声音中带有多少说话者信息仍然是一个谜。这篇论文探讨基于最普遍的非语言声音——笑声的说话者验证。我们首先使用半自动化管道收集了一个新的 Haha-Pod 数据集，该数据集包含了240个说话者的笑声clip，其中每个clip都有高质量的语音。然后，我们提出了一个 Two-Stage Teacher-Student (2S-TS) 框架，以减少语音和笑声信号之间的在说话者 embedding 距离。对于 Haha-Pod 数据集，我们设计了两次（S2L-Eval）测试来验证说话者的身份。实验结果表明，我们的方法可以显著提高 S2L-Eval 测试集的性能，只有微量地降低 VoxCeleb1 测试集的性能。Haha-Pod 数据集的资源可以在 GitHub 上找到：https://github.com/nevermoreLin/HahaPod。

VoiceLens: Controllable Speaker Generation and Editing with Flow

paper_url: http://arxiv.org/abs/2309.14094
repo_url: None
paper_authors: Yao Shi, Ming Li
for: 这篇论文目的是为多个说话人的语音合成和voice转换系统提供一种 semi-supervised flow-based 方法，以模型说话人的嵌入vector distribution，并在不同的条件下进行多个说话人的生成和编辑。
methods: 该论文提出了一种名为 VoiceLens 的方法，它将说话人嵌入vector 映射到独立的特征和差异信息中。该方法允许在已有的 TTS 模型基础上生成新的说话人voice，并可以meaningfully 编辑已知的说话人的特征。
results: 该论文表明，VoiceLens 在不同的条件下 display 了类似于 Tacospawn 的无条件生成能力，同时具有更高的控制性和灵活性。此外，使用 VoiceLens 模型可以在不需要重新训练 TTS 模型的情况下，将已知的噪音说话人的嵌入vector 编辑，并生成 cleaner 的语音。

Abstract
Currently, many multi-speaker speech synthesis and voice conversion systems address speaker variations with an embedding vector. Modeling it directly allows new voices outside of training data to be synthesized. GMM based approaches such as Tacospawn are favored in literature for this generation task, but there are still some limitations when difficult conditionings are involved. In this paper, we propose VoiceLens, a semi-supervised flow-based approach, to model speaker embedding distributions for multi-conditional speaker generation. VoiceLens maps speaker embeddings into a combination of independent attributes and residual information. It allows new voices associated with certain attributes to be \textit{generated} for existing TTS models, and attributes of known voices to be meaningfully \textit{edited}. We show in this paper, VoiceLens displays an unconditional generation capacity that is similar to Tacospawn while obtaining higher controllability and flexibility when used in a conditional manner. In addition, we show synthesizing less noisy speech from known noisy speakers without re-training the TTS model is possible via solely editing their embeddings with a SNR conditioned VoiceLens model. Demos are available at sos1sos2sixteen.github.io/voicelens.

摘要
当前，许多多 speaker speech synthesis 和voice conversion系统使用 embedding vector来处理 speaker variations。直接模型它们允许在训练数据外部生成新的voice。文献中，GMM基于的approaches such as Tacospawn 是常见的 Generation Task 方法，但是在具有困难的conditioning时还存在一些限制。在这篇论文中，我们提出了 VoiceLens，一种半supervised flow-based方法，用于模型 speaker embedding Distributions для多 conditional speaker Generation。VoiceLens将 speaker embeddings映射到独立的特征和差异信息中。它允许基于已有 TTS 模型的新音频 associates with certain attributes 被生成，并且可以 meaningfully edit 已知音频的特征。我们在这篇论文中表明，VoiceLens 在不conditional 的情况下 display 类似于 Tacospawn 的无条件生成能力，同时在具有条件的情况下具有更高的控制性和灵活性。此外，我们还证明可以通过只编辑 embedding 来从已知噪音speakers中生成更清晰的speech，无需重新训练 TTS 模型。示例可以在 sos1sos2sixteen.github.io/voicelens 上找到。

Unsupervised Accent Adaptation Through Masked Language Model Correction Of Discrete Self-Supervised Speech Units

paper_url: http://arxiv.org/abs/2309.13994
repo_url: None
paper_authors: Jakob Poncelet, Hugo Van hamme
for: 改善预训练语音模型对不同口音和异常语音的敏感性
methods: 使用遮盖语言模型和小口音适应器块进行不supervised调整
results: 提高 HuBERT Large 模型在下游口音识别任务中的性能，无需监督

Abstract
Self-supervised pre-trained speech models have strongly improved speech recognition, yet they are still sensitive to domain shifts and accented or atypical speech. Many of these models rely on quantisation or clustering to learn discrete acoustic units. We propose to correct the discovered discrete units for accented speech back to a standard pronunciation in an unsupervised manner. A masked language model is trained on discrete units from a standard accent and iteratively corrects an accented token sequence by masking unexpected cluster sequences and predicting their common variant. Small accent adapter blocks are inserted in the pre-trained model and fine-tuned by predicting the corrected clusters, which leads to an increased robustness of the pre-trained model towards a target accent, and this without supervision. We are able to improve a state-of-the-art HuBERT Large model on a downstream accented speech recognition task by altering the training regime with the proposed method.

摘要
自适应预训练音频模型已经强化了语音识别，但它们仍然敏感于频谱转移和不同口音或特殊语音。许多这些模型利用量化或归一化学习独立的声学单元。我们提议将捕捉到的特殊单元 Corrected 到标准发音的方式，以便在无监督的情况下进行改进。我们使用遮盖语言模型，并在预训练模型中插入小口音适应块，进行无监督的改进，以提高预训练模型对目标口音的抗频谱性能。我们通过修改训练方法，使用我们的方法来改进一个state-of-the-art HuBERT Large模型，并在下游受损 speech recognition 任务中获得了改进。

Real-Time Emergency Vehicle Detection using Mel Spectrograms and Regular Expressions

paper_url: http://arxiv.org/abs/2309.13920
repo_url: None
paper_authors: Alberto Pacheco-Gonzalez, Raymundo Torres, Raul Chacon, Isidro Robledo
for: detecting emergency vehicle sirens in real time
methods: digital signal processing techniques and signal symbolization, compared to a deep neural network audio classifier
results: the developed DSP algorithm presented a greater ability to discriminate between signal and noise, compared to the CNN model.Here’s the full translation of the paper’s abstract in Simplified Chinese:
for: 这个论文旨在实时探测紧急车辆 siren 声音。
methods: 该论文使用的方法包括数字信号处理技术和信号符号化，并与深度神经网络音频分类器进行比较。
results: 发展的 DSP 算法在听到信号和噪声之间的分辨率比 CNN 模型更高。

Abstract
In emergency situations, the movement of vehicles through city streets can be problematic due to vehicular traffic. This paper presents a method for detecting emergency vehicle sirens in real time. To derive a siren Hi-Lo audio fingerprint it was necessary to apply digital signal processing techniques and signal symbolization, contrasting against a deep neural network audio classifier feeding 280 environmental sounds and 38 Hi-Lo sirens. In both methods, their precision was evaluated based on a confusion matrix and various metrics. The precision of the developed DSP algorithm presented a greater ability to discriminate between signal and noise, compared to the CNN model.

摘要
在紧急情况下，城市街道上的车辆运动可能会受到交通堵塞的影响。本文提出了一种实时探测紧急车辆 siren 的方法。为 derivation 紧急 siren Hi-Lo 音响指纹，需要应用数字信号处理技术和音标化，并与深度神经网络音频分类器相比较，该分类器接受了 280 个环境声和 38 个 Hi-Lo siren。在两种方法中，它们的准确率被评估基于冲激矩阵和多种指标。发展的 DSP 算法表现出更高的能力来 отли奇 Between signal 和噪声，相比 CNN 模型。

Frame-wise streaming end-to-end speaker diarization with non-autoregressive self-attention-based attractors

paper_url: http://arxiv.org/abs/2309.13916
repo_url: https://github.com/audio-westlakeu/fs-eend
paper_authors: Di Liang, Nian Shao, Xiaofei Li
for: 这种方法用于实时语音分类和说话者识别
methods: 使用 causal speaker embedding encoder 和 online non-autoregressive self-attention-based attractor decoder，采用 look-ahead 机制以实时检测新的说话者并自适应更新说话者吸引器
results: 与最近提出的块 wise online方法相比，本方法实现了状态机器的分类和说话者识别结果，并且具有低的推理延迟和计算成本

Abstract
This work proposes a frame-wise online/streaming end-to-end neural diarization (FS-EEND) method in a frame-in-frame-out fashion. To frame-wisely detect a flexible number of speakers and extract/update their corresponding attractors, we propose to leverage a causal speaker embedding encoder and an online non-autoregressive self-attention-based attractor decoder. A look-ahead mechanism is adopted to allow leveraging some future frames for effectively detecting new speakers in real time and adaptively updating speaker attractors. The proposed method processes the audio stream frame by frame, and has a low inference latency caused by the look-ahead frames. Experiments show that, compared with the recently proposed block-wise online methods, our method FS-EEND achieves state-of-the-art diarization results, with a low inference latency and computational cost.

摘要
这个工作提出了一种帧级在线/流动端到端神经 диари化（FS-EEND）方法，采用帧内帧外的方式进行检测。为了在帧级检测灵活数量的说话人并提取/更新其相应的吸引器，我们提议利用 causal 说话人嵌入编码器和在线非autoregressive自注意力基本吸引器解码器。采用了 looked-ahead 机制，以便利用未来帧来有效地检测新的说话人并动态更新说话人吸引器。提posed 方法按帧处理音频流，并且具有低的推理延迟和计算成本。实验表明，相比最近提出的块级在线方法，我们的方法FS-EEND可以 achieve state-of-the-art диари化结果，同时具有低的推理延迟和计算成本。

HiGNN-TTS: Hierarchical Prosody Modeling with Graph Neural Networks for Expressive Long-form TTS

paper_url: http://arxiv.org/abs/2309.13907
repo_url: None
paper_authors: Dake Guo, Xinfa Zhu, Liumeng Xue, Tao Li, Yuanjun Lv, Yuepeng Jiang, Lei Xie
for: 提高短 фор式文本译 Speech 表现力
methods: 使用嵌入式Global Node和上下文注意力机制，以及层次supervision来增强GNNs的表达能力
results: 对象和主观评估都表明，HiGNN-TTS可以显著提高长形文本译Speech的自然性和表达力

Abstract
Recent advances in text-to-speech, particularly those based on Graph Neural Networks (GNNs), have significantly improved the expressiveness of short-form synthetic speech. However, generating human-parity long-form speech with high dynamic prosodic variations is still challenging. To address this problem, we expand the capabilities of GNNs with a hierarchical prosody modeling approach, named HiGNN-TTS. Specifically, we add a virtual global node in the graph to strengthen the interconnection of word nodes and introduce a contextual attention mechanism to broaden the prosody modeling scope of GNNs from intra-sentence to inter-sentence. Additionally, we perform hierarchical supervision from acoustic prosody on each node of the graph to capture the prosodic variations with a high dynamic range. Ablation studies show the effectiveness of HiGNN-TTS in learning hierarchical prosody. Both objective and subjective evaluations demonstrate that HiGNN-TTS significantly improves the naturalness and expressiveness of long-form synthetic speech.

摘要
Translated into Simplified Chinese:近期文本到语音技术的进步，特别是基于图像神经网络（GNNs），有效提高了短形文本合成语音的表达力。然而，生成人工合理的长形语音，尚存在高Dynamic Prosody变化的挑战。为解决这个问题，我们扩展了GNNs的能力，通过层次听音模型策略（HiGNN-TTS）。具体来说，我们在图像中添加虚拟全球节点，强化单词节点之间的连接，并引入Contextual Attention机制，以扩展GNNs的听音模型范围从内句到间句。此外，我们在每个图像节点上进行层次监督，从听音PROSODY级别进行多层次监督，以捕捉高Dynamic Prosody变化。ablation study表明HiGNN-TTS有效学习层次听音。对象和主观评价表明，HiGNN-TTS可以显著提高长形文本合成语音的自然性和表达力。

AutoPrep: An Automatic Preprocessing Framework for In-the-Wild Speech Data

paper_url: http://arxiv.org/abs/2309.13905
repo_url: https://github.com/tomasJwYU/AutoPrepDemo
paper_authors: Jianwei Yu, Hangting Chen, Yanyao Bian, Xiang Li, Yi Luo, Jinchuan Tian, Mengyang Liu, Jiayi Jiang, Shuai Wang
for: 提高听说技术领域中各种大规模语音数据的使用效率
methods: 提出了一种自动化听说数据预处理框架AutoPrep，包括声音提升、听说段化、speaker clustering、目标听说提取、质量筛选和自动听说识别
results: 实验表明，提出的AutoPrep框架可以生成与多个开源TTS数据集相似的DNMS和PDNMS分数，并且可以实现0.68的在域内 speaker相似性

Abstract
Recently, the utilization of extensive open-sourced text data has significantly advanced the performance of text-based large language models (LLMs). However, the use of in-the-wild large-scale speech data in the speech technology community remains constrained. One reason for this limitation is that a considerable amount of the publicly available speech data is compromised by background noise, speech overlapping, lack of speech segmentation information, missing speaker labels, and incomplete transcriptions, which can largely hinder their usefulness. On the other hand, human annotation of speech data is both time-consuming and costly. To address this issue, we introduce an automatic in-the-wild speech data preprocessing framework (AutoPrep) in this paper, which is designed to enhance speech quality, generate speaker labels, and produce transcriptions automatically. The proposed AutoPrep framework comprises six components: speech enhancement, speech segmentation, speaker clustering, target speech extraction, quality filtering and automatic speech recognition. Experiments conducted on the open-sourced WenetSpeech and our self-collected AutoPrepWild corpora demonstrate that the proposed AutoPrep framework can generate preprocessed data with similar DNSMOS and PDNSMOS scores compared to several open-sourced TTS datasets. The corresponding TTS system can achieve up to 0.68 in-domain speaker similarity.

摘要
Translated into Simplified Chinese:最近，通过大量开源的文本数据的使用，文本基本语言模型（LLM）的性能得到了显著提高。然而，对于语音技术社区中的大规模语音数据来说，使用尚未得到有效利用。一个原因是公共可用的语音数据中很多受到背景噪音、语音重叠、语音分割信息缺失、缺失说话人标签和不完整的转录等限制，这些限制可以很大地阻碍其使用。而人工标注语音数据则是时间consuming和costly。为解决这个问题，我们在这篇论文中介绍了一种自动化对话语音数据预处理框架（AutoPrep），用于提高语音质量、生成说话人标签和生成转录。AutoPrep框架包括6个组件：语音增强、语音分割、说话人团 clustering、目标语音提取、质量筛选和自动语音识别。在open-sourced WenetSpeech和我们自己收集的AutoPrepWild corpora上进行的实验表明，提posed AutoPrep框架可以生成与开源 TTS 数据集相似的 DNSMOS 和 PDNSMOS 分数，并且可以达到0.68 的域内说话人相似性。

A Two-Step Approach for Narrowband Source Localization in Reverberant Rooms

paper_url: http://arxiv.org/abs/2309.13819
repo_url: None
paper_authors: Wei-Ting Lai, Lachlan Birnie, Thushara Abhayapala, Amy Bastine, Shaoheng Xu, Prasanga Samarasinghe
for: 本研究旨在提出一种基于两步方法的窄带源localization算法，用于听觉环境中的音源定位。
methods: 该方法首先使用Iteratively Reweighted Least Squares（IRLS）模型干扰音场的同质分量，然后使用Orthogonal Matching Pursuit（OMP）模型干扰分量为点源分布的稀疏表示。
results: 实验结果表明，该方法可以减少测量量而提高定位精度，特别是在听觉环境中。此外，该方法可以不需要先知道房间边界条件和室内geometry，因此可以适用于不同的室内环境。

Abstract
This paper presents a two-step approach for narrowband source localization within reverberant rooms. The first step involves dereverberation by modeling the homogeneous component of the sound field by an equivalent decomposition of planewaves using Iteratively Reweighted Least Squares (IRLS), while the second step focuses on source localization by modeling the dereverberated component as a sparse representation of point-source distribution using Orthogonal Matching Pursuit (OMP). The proposed method enhances localization accuracy with fewer measurements, particularly in environments with strong reverberation. A numerical simulation in a conference room scenario, using a uniform microphone array affixed to the wall, demonstrates real-world feasibility. Notably, the proposed method and microphone placement effectively localize sound sources within the 2D-horizontal plane without requiring prior knowledge of boundary conditions and room geometry, making it versatile for application in different room types.

摘要

2023-09-25

cs.CV

cs.CV - 2023-09-25

MEMO: Dataset and Methods for Robust Multimodal Retinal Image Registration with Large or Small Vessel Density Differences

paper_url: http://arxiv.org/abs/2309.14550
repo_url: None
paper_authors: Chiao-Yi Wang, Faranguisse Kakhi Sadrieh, Yi-Ting Shen, Shih-En Chen, Sarah Kim, Victoria Chen, Achyut Raghavendra, Dongyi Wang, Osamah Saeedi, Yang Tao
for: 这个论文的目的是提出一种基于多Modalities的血液流量测量方法，以便早期诊断和治疗 ocular 疾病。
methods: 这个方法使用了 EMA 和 OCTA 两种多Modalities，并提出了一种基于 segmentation 的深度学习框架 (VDD-Reg) 和一种新的评价指标 (MSD)，以Address 多Modalities 中血液管道的不同而导致的注射挑战。
results: 在 CF-FA 数据集和 MEMO 数据集上，VDD-Reg 表现出了较好的性能，并且只需要三个注解的血液管道分割图来维持其精度。

Abstract
The measurement of retinal blood flow (RBF) in capillaries can provide a powerful biomarker for the early diagnosis and treatment of ocular diseases. However, no single modality can determine capillary flowrates with high precision. Combining erythrocyte-mediated angiography (EMA) with optical coherence tomography angiography (OCTA) has the potential to achieve this goal, as EMA can measure the absolute 2D RBF of retinal microvasculature and OCTA can provide the 3D structural images of capillaries. However, multimodal retinal image registration between these two modalities remains largely unexplored. To fill this gap, we establish MEMO, the first public multimodal EMA and OCTA retinal image dataset. A unique challenge in multimodal retinal image registration between these modalities is the relatively large difference in vessel density (VD). To address this challenge, we propose a segmentation-based deep-learning framework (VDD-Reg) and a new evaluation metric (MSD), which provide robust results despite differences in vessel density. VDD-Reg consists of a vessel segmentation module and a registration module. To train the vessel segmentation module, we further designed a two-stage semi-supervised learning framework (LVD-Seg) combining supervised and unsupervised losses. We demonstrate that VDD-Reg outperforms baseline methods quantitatively and qualitatively for cases of both small VD differences (using the CF-FA dataset) and large VD differences (using our MEMO dataset). Moreover, VDD-Reg requires as few as three annotated vessel segmentation masks to maintain its accuracy, demonstrating its feasibility.

摘要
retinal blood flow (RBF) 的测量可以提供一个强大的生物标志物，用于早期诊断和治疗 ocular diseases。然而，没有一种单一的模式可以准确地确定 capillary flowrates。 combing erythrocyte-mediated angiography (EMA) 与 optical coherence tomography angiography (OCTA) 可以实现这个目标，因为 EMA 可以测量 retinal microvasculature 的绝对 2D RBF，而 OCTA 可以提供 capillaries 的 3D 结构图像。然而，多modal retinal image registration between these two modalities 仍然存在很大的知识 gap。 To fill this gap, we establish MEMO, the first public multimodal EMA and OCTA retinal image dataset.一个Unique challenge in multimodal retinal image registration between these modalities 是 vessel density (VD) 的相对较大的差异。 To address this challenge, we propose a segmentation-based deep-learning framework (VDD-Reg) 和 a new evaluation metric (MSD), which provide robust results despite differences in vessel density. VDD-Reg consists of a vessel segmentation module and a registration module. To train the vessel segmentation module, we further designed a two-stage semi-supervised learning framework (LVD-Seg) combining supervised and unsupervised losses. We demonstrate that VDD-Reg outperforms baseline methods quantitatively and qualitatively for cases of both small VD differences (using the CF-FA dataset) and large VD differences (using our MEMO dataset). Moreover, VDD-Reg requires as few as three annotated vessel segmentation masks to maintain its accuracy, demonstrating its feasibility.

Dynamic Scene Graph Representation for Surgical Video

paper_url: http://arxiv.org/abs/2309.14538
repo_url: None
paper_authors: Felix Holm, Ghazal Ghazaei, Tobias Czempiel, Ege Özsoy, Stefan Saur, Nassir Navab
for: This paper aims to improve the automated understanding of surgical workflows in videos captured from microscopic or endoscopic imaging devices.
methods: The paper proposes using scene graphs as a more holistic and semantically meaningful way to represent surgical videos, and leverages graph convolutional networks (GCNs) to tackle surgical downstream tasks such as workflow recognition.
results: The paper demonstrates the benefits of surgical scene graphs in terms of explainability and robustness of model decisions, and shows competitive performance in surgical workflow recognition tasks.Here’s the same information in Simplified Chinese:
for: 这篇论文目的是提高微scopic或endoscopic imaging设备上捕捉的手术视频自动理解的方法。
methods: 论文提出使用场景图表示手术视频，并利用图 convolutional neural networks (GCNs) 解决手术下沟通任务，如手术 workflow 认知。
results: 论文表明场景图在解释和模型决策的Robustness方面具有优势，并在手术 workflow 认知任务中显示竞争性表现。

Abstract
Surgical videos captured from microscopic or endoscopic imaging devices are rich but complex sources of information, depicting different tools and anatomical structures utilized during an extended amount of time. Despite containing crucial workflow information and being commonly recorded in many procedures, usage of surgical videos for automated surgical workflow understanding is still limited. In this work, we exploit scene graphs as a more holistic, semantically meaningful and human-readable way to represent surgical videos while encoding all anatomical structures, tools, and their interactions. To properly evaluate the impact of our solutions, we create a scene graph dataset from semantic segmentations from the CaDIS and CATARACTS datasets. We demonstrate that scene graphs can be leveraged through the use of graph convolutional networks (GCNs) to tackle surgical downstream tasks such as surgical workflow recognition with competitive performance. Moreover, we demonstrate the benefits of surgical scene graphs regarding the explainability and robustness of model decisions, which are crucial in the clinical setting.

摘要
手术录影幕 capture from microscopic or endoscopic imaging devices 是丰富且复杂的信息源，显示了不同的工具和生物结构在延时间进行了多种程度的交互。尽管这些录影幕在许多程序中很常见，但是用于自动推理手术 workflow 的使用仍然受限。在这个工作中，我们利用Scene graph来表示手术录影幕，并将所有生物结构和工具都编码在内。为了评估我们的解决方案的影响，我们创建了Scene graph dataset，并使用Graph Convolutional Networks (GCNs)来利用Scene graphs来解决手术下游任务，例如手术 workflow 识别，获得了竞争性的表现。此外，我们还证明了Scene graphs 在解释和Robustness 方面的利陵，这些是在临床设定中非常重要的。

Pixel-Grounded Prototypical Part Networks

paper_url: http://arxiv.org/abs/2309.14531
repo_url: None
paper_authors: Zachariah Carmichael, Suhas Lohit, Anoop Cherian, Michael Jones, Walter Scheirer
for: 这个论文的目的是提高protoPartNN的解释性，并解决过去的localization问题。
methods: 这篇论文使用了新的�ceptive field-based architectural constraint和principled pixel space mapping来实现meaningful localization，并提出了一种简化的分类头来提高解释性。
results: 作者们的方法PIXPNET可以 Quantifiably improve interpretability without sacrificing accuracy，并且是唯一一个真正地学习和localize到权重部分的protoPartNN。

Abstract
Prototypical part neural networks (ProtoPartNNs), namely PROTOPNET and its derivatives, are an intrinsically interpretable approach to machine learning. Their prototype learning scheme enables intuitive explanations of the form, this (prototype) looks like that (testing image patch). But, does this actually look like that? In this work, we delve into why object part localization and associated heat maps in past work are misleading. Rather than localizing to object parts, existing ProtoPartNNs localize to the entire image, contrary to generated explanatory visualizations. We argue that detraction from these underlying issues is due to the alluring nature of visualizations and an over-reliance on intuition. To alleviate these issues, we devise new receptive field-based architectural constraints for meaningful localization and a principled pixel space mapping for ProtoPartNNs. To improve interpretability, we propose additional architectural improvements, including a simplified classification head. We also make additional corrections to PROTOPNET and its derivatives, such as the use of a validation set, rather than a test set, to evaluate generalization during training. Our approach, PIXPNET (Pixel-grounded Prototypical part Network), is the only ProtoPartNN that truly learns and localizes to prototypical object parts. We demonstrate that PIXPNET achieves quantifiably improved interpretability without sacrificing accuracy.

摘要
归纳部神经网络（ProtoPartNNs）是一种内在可解释的机器学习方法。它的原型学习方案允许直观的解释，例如：这个原型看起来像那个测试图像的 patch。然而，这实际上是否看起来像那个？在这项工作中，我们探究过去的对象部分Localization和相关的热图是误导的。现有的ProtoPartNNs并不是localize到对象部分，而是localize到整个图像，与生成的解释性视觉化不符。我们认为这些问题的抽象是由于视觉化的吸引力和过于依赖于直观的INTUITION。为了解决这些问题，我们设计了新的接受场景基于的建筑限制，以及一个原则正确的像素空间映射。此外，我们还提出了更多的建筑改进，包括简化的分类头。我们还对PROTOPNET和其 Derivatives进行了修正，例如使用验证集而不是测试集来评估在训练过程中的普适性。我们的方法PIXPNET（像素基于的原型部分网络）是唯一一个真正地学习和localize到prototype object part。我们示出PIXPNET可以提供量化提高的可解释性而不损失准确性。

paper_url: http://arxiv.org/abs/2309.14516
repo_url: None
paper_authors: Shiming Wang, Holger Caesar, Liangliang Nan, Julian F. P. Kooij
for:这个论文的目的是提高自动驾驶中多感器对象检测模型的稳定性，特别是在感器输入缺失（modalities missing）的情况下。methods:这个论文提出了一种名为UniBEV的端到端多模态3D对象检测框架，可以在LiDAR和摄像头输入下运行，而无需 retraining。UniBEV使用了 Bird’s Eye View（BEV）特征地图来确保检测器的输入组合能够处理不同的输入。这种方法与之前的BEV多模态检测方法不同，所有的感器模式都采用了一种统一的方法来从原始感器坐标系统中采样特征到BEV特征。results:在nuScenes上对所有感器输入组合进行比较，UniBEV得到了52.5%的mAP平均值，与基线值（43.5%的mAP平均值）和MetaBEV（48.7%的mAP平均值）相比有显著提高。一个ablation研究表明，通过权重平均 fusioneather than regular concatenation，以及在每个模式的BEV编码器之间共享查询，可以提高对稳定性的依赖。

Abstract
Multi-sensor object detection is an active research topic in automated driving, but the robustness of such detection models against missing sensor input (modality missing), e.g., due to a sudden sensor failure, is a critical problem which remains under-studied. In this work, we propose UniBEV, an end-to-end multi-modal 3D object detection framework designed for robustness against missing modalities: UniBEV can operate on LiDAR plus camera input, but also on LiDAR-only or camera-only input without retraining. To facilitate its detector head to handle different input combinations, UniBEV aims to create well-aligned Bird's Eye View (BEV) feature maps from each available modality. Unlike prior BEV-based multi-modal detection methods, all sensor modalities follow a uniform approach to resample features from the native sensor coordinate systems to the BEV features. We furthermore investigate the robustness of various fusion strategies w.r.t. missing modalities: the commonly used feature concatenation, but also channel-wise averaging, and a generalization to weighted averaging termed Channel Normalized Weights. To validate its effectiveness, we compare UniBEV to state-of-the-art BEVFusion and MetaBEV on nuScenes over all sensor input combinations. In this setting, UniBEV achieves $52.5 \%$ mAP on average over all input combinations, significantly improving over the baselines ($43.5 \%$ mAP on average for BEVFusion, $48.7 \%$ mAP on average for MetaBEV). An ablation study shows the robustness benefits of fusing by weighted averaging over regular concatenation, and of sharing queries between the BEV encoders of each modality. Our code will be released upon paper acceptance.

摘要
多感器对象检测是自动驾驶领域的活跃研究话题，但对感器输入缺失（例如突然的感器故障）的Robustness仍然是一个尚未得到充分研究的问题。在这种情况下，我们提出了UniBEV，一个综合多Modal 3D对象检测框架，旨在提高对感器输入缺失的Robustness。UniBEV可以使用LiDAR和摄像头输入，同时也可以使用LiDAR-only或摄像头只输入，无需重新训练。为使其检测头处理不同的输入组合，UniBEV стре望在每个可用感器模式下创建匹配的Bird's Eye View（BEV）特征地图。与先前的BEV基于多Modal detection方法不同，所有感器模式都采用了一致的方式，将native感器坐标系统中的特征atures映射到BEV特征地图。我们还进行了不同模式之间的混合策略的研究，包括常见的特征 concatenation、梯度平均值和通过 Channel Normalized Weights 扩展。为证明其效果，我们与状态体系的BEVFusion和MetaBEV进行比较，在nuScenes上对所有感器输入组合进行评估。在这种设定下，UniBEV achieved $52.5\%$ mAP的平均值，significantly improving over the baselines ($43.5\%$ mAP on average for BEVFusion, $48.7\%$ mAP on average for MetaBEV).一个ablation study表明，通过权重平均混合而不是常见的特征 concatenation，以及在每个模式的BEVEncoder中共享查询，具有Robustness的优点。我们将在纸Acceptance时发布代码。

Accurate and Interactive Visual-Inertial Sensor Calibration with Next-Best-View and Next-Best-Trajectory Suggestion

paper_url: http://arxiv.org/abs/2309.14514
repo_url: https://github.com/chutsu/yac
paper_authors: Christopher L. Choi, Binbin Xu, Stefan Leutenegger
for: 本研究旨在帮助非专家用户更好地使用视觉遥感（VI）传感器进行计算机视觉或状态估计任务，通过提供图形用户界面和信息理论来收集有用的准备数据，并提供下一个最佳视图和下一个最佳轨迹的建议。
methods: 本研究提出了一种新的VI准备管线，使用图形用户界面和信息理论来导引非专家用户收集有用的准备数据，并提供下一个最佳视图和下一个最佳轨迹的建议，以准备VI传感器的内参、外参和时间偏差。
results: 经过实验表明，我们的方法比现有技术更快、更准、更一致，并且可以与现有VI odometry和VI-SLAM方法结合使用，以获得更高精度的估计结果。

Abstract
Visual-Inertial (VI) sensors are popular in robotics, self-driving vehicles, and augmented and virtual reality applications. In order to use them for any computer vision or state-estimation task, a good calibration is essential. However, collecting informative calibration data in order to render the calibration parameters observable is not trivial for a non-expert. In this work, we introduce a novel VI calibration pipeline that guides a non-expert with the use of a graphical user interface and information theory in collecting informative calibration data with Next-Best-View and Next-Best-Trajectory suggestions to calibrate the intrinsics, extrinsics, and temporal misalignment of a VI sensor. We show through experiments that our method is faster, more accurate, and more consistent than state-of-the-art alternatives. Specifically, we show how calibrations with our proposed method achieve higher accuracy estimation results when used by state-of-the-art VI Odometry as well as VI-SLAM approaches. The source code of our software can be found on: https://github.com/chutsu/yac.

摘要
Visual-Inertial（VI）传感器在 роботику、自动驾驶车和增强和虚拟现实应用中广泛使用。为了在计算机视觉或状态估计任务中使用它们，一个好的准备是必要的。然而，收集有用的准备数据以便计算准备参数的可见性并不是非专家的 trivial事。在这个工作中，我们介绍了一个新的 VI 准备管线，通过使用图形用户界面和信息理论来引导非专家收集有用的准备数据，并且提供 Next-Best-View 和 Next-Best-Trajectory 建议来准备 VI 传感器的内参、外参和时间偏移。我们通过实验表明，我们的方法比现有的状态之 искусственный风格更快、更准、更一致。 Specifically，我们表明使用我们提议的方法来准备准确性估计结果，与现有的 VI Odometry 以及 VI-SLAM 方法相比，具有更高的准确性。我们的软件源代码可以在：https://github.com/chutsu/yac 找到。

Assessment of a new GeoAI foundation model for flood inundation mapping

paper_url: http://arxiv.org/abs/2309.14500
repo_url: None
paper_authors: Wenwen Li, Hyunho Lee, Sizhe Wang, Chia-Yu Hsu, Samantha T. Arundel
for: 这个研究旨在评估IBM-NASA的Prithvi模型在洪水淹没地图分析领域中的表现，以支持关键的地ospatial分析任务。
methods: 这篇论文使用了IBM-NASA的Prithvi模型，并与卷积神经网络和感知器transformer-based架构进行比较，以测试这些模型在洪水淹没地图分析任务中的对策精度。
results: 研究结果显示Prithvi模型在未见过的区域中进行洪水淹没地图分析任务时表现良好，并且在验证数据集和没有被模型视觉化的数据集上显示了良好的预测性和可读性。

Abstract
Vision foundation models are a new frontier in Geospatial Artificial Intelligence (GeoAI), an interdisciplinary research area that applies and extends AI for geospatial problem solving and geographic knowledge discovery, because of their potential to enable powerful image analysis by learning and extracting important image features from vast amounts of geospatial data. This paper evaluates the performance of the first-of-its-kind geospatial foundation model, IBM-NASA's Prithvi, to support a crucial geospatial analysis task: flood inundation mapping. This model is compared with convolutional neural network and vision transformer-based architectures in terms of mapping accuracy for flooded areas. A benchmark dataset, Sen1Floods11, is used in the experiments, and the models' predictability, generalizability, and transferability are evaluated based on both a test dataset and a dataset that is completely unseen by the model. Results show the good transferability of the Prithvi model, highlighting its performance advantages in segmenting flooded areas in previously unseen regions. The findings also indicate areas for improvement for the Prithvi model in terms of adopting multi-scale representation learning, developing more end-to-end pipelines for high-level image analysis tasks, and offering more flexibility in terms of input data bands.

摘要
地球空间人工智能（GeoAI）是一个交叉学科研究领域，它应用和扩展人工智能来解决地球空间问题和地理知识发现。视频基础模型是GeoAI新领域，它们可以通过学习和提取重要的图像特征来实现强大的图像分析。本文评估了首次实现的地球空间基础模型——IBM-NASA的Prithvi，以支持重要的地球空间分析任务：洪水泛滥地图。这个模型与 convolutional neural network 和 vision transformer 基础结构相比，在泛滥区域的地图准确率方面进行了比较。使用 Sen1Floods11 benchmark 数据集进行实验，并根据测试数据集和完全新的数据集来评估模型的预测性、普适性和可转移性。结果显示 Prithvi 模型在未经见过的区域中 segments 泛滥区域的表现良好，表明其在新区域中的表现优异。发现也表明了 Prithvi 模型在采用多尺度表示学习、开发更多的端到端管道和提供更多的输入数据频谱等方面存在改进的空间。

Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator

paper_url: http://arxiv.org/abs/2309.14494
repo_url: https://github.com/soolab/free-bloom
paper_authors: Hanzhuo Huang, Yufan Feng, Cheng Shi, Lan Xu, Jingyi Yu, Sibei Yang
for: 本研究旨在实现无需视频数据和训练的情况下，生成具有Semantic coherence的高质量视频。
methods: 提议一种名为Free-Bloom的管道，利用大型自然语言模型（LLM）作为导演，生成Semantic coherence的提示序列，并使用预训练的潜在扩散模型（LDM）为动画师生成高效率帧。为保证时间和 identical coherence，提出了一些修改，包括共同噪声抽取、步态意识转移和双路 interpolate。
results: 无需任何视频数据和训练，Free-Bloom可以生成具有丰富semantic meaningful frame sequence的高质量视频，能够描绘复杂的场景。此外，Free-Bloom自然兼容LDMs-based extensions。

Abstract
Text-to-video is a rapidly growing research area that aims to generate a semantic, identical, and temporal coherence sequence of frames that accurately align with the input text prompt. This study focuses on zero-shot text-to-video generation considering the data- and cost-efficient. To generate a semantic-coherent video, exhibiting a rich portrayal of temporal semantics such as the whole process of flower blooming rather than a set of "moving images", we propose a novel Free-Bloom pipeline that harnesses large language models (LLMs) as the director to generate a semantic-coherence prompt sequence, while pre-trained latent diffusion models (LDMs) as the animator to generate the high fidelity frames. Furthermore, to ensure temporal and identical coherence while maintaining semantic coherence, we propose a series of annotative modifications to adapting LDMs in the reverse process, including joint noise sampling, step-aware attention shift, and dual-path interpolation. Without any video data and training requirements, Free-Bloom generates vivid and high-quality videos, awe-inspiring in generating complex scenes with semantic meaningful frame sequences. In addition, Free-Bloom is naturally compatible with LDMs-based extensions.

摘要

AiAReSeg: Catheter Detection and Segmentation in Interventional Ultrasound using Transformers

paper_url: http://arxiv.org/abs/2309.14492
repo_url: None
paper_authors: Alex Ranne, Yordanka Velikova, Nassir Navab, Ferdinando Rodriguez y Baena
for: 这篇论文是为了提出一种基于深度学习的扫描器网络，以便在 intervencion Ultrasound 图像序列中探测和分割导管。methods: 该方法使用了一种基于 Attention 机制的 transformer 架构，并引入了一种新的 3D 分割头，以实现在时间上的扫描。results: 该方法在一个验证数据集上进行了验证，并在用physics-based导管插入 simulations 生成的synthetic Ultrasound 图像上进行了测试，得到了良好的效果。

Abstract
To date, endovascular surgeries are performed using the golden standard of Fluoroscopy, which uses ionising radiation to visualise catheters and vasculature. Prolonged Fluoroscopic exposure is harmful for the patient and the clinician, and may lead to severe post-operative sequlae such as the development of cancer. Meanwhile, the use of interventional Ultrasound has gained popularity, due to its well-known benefits of small spatial footprint, fast data acquisition, and higher tissue contrast images. However, ultrasound images are hard to interpret, and it is difficult to localise vessels, catheters, and guidewires within them. This work proposes a solution using an adaptation of a state-of-the-art machine learning transformer architecture to detect and segment catheters in axial interventional Ultrasound image sequences. The network architecture was inspired by the Attention in Attention mechanism, temporal tracking networks, and introduced a novel 3D segmentation head that performs 3D deconvolution across time. In order to facilitate training of such deep learning networks, we introduce a new data synthesis pipeline that used physics-based catheter insertion simulations, along with a convolutional ray-casting ultrasound simulator to produce synthetic ultrasound images of endovascular interventions. The proposed method is validated on a hold-out validation dataset, thus demonstrated robustness to ultrasound noise and a wide range of scanning angles. It was also tested on data collected from silicon-based aorta phantoms, thus demonstrated its potential for translation from sim-to-real. This work represents a significant step towards safer and more efficient endovascular surgery using interventional ultrasound.

摘要
Translated into Simplified Chinese:到目前为止，endovascular手术使用金标准fluoroscopy，利用 ionizing radiation Visualize catheters 和vasculature。 prolonged fluoroscopic exposure 对 patient 和 clinician 有害，可能导致 postoperative sequelae 的发展，如 cancer。 Meanwhile，使用 interventional ultrasound 已经得到了广泛的应用，因为它的小型空间占用、快速的数据收集和高对比度图像等优点。然而，ultrasound 图像具有困难的解释和localize vessels、catheters 和 guidewires 在它们中的问题。这个工作提出了一种解决方案，利用一种基于 state-of-the-art 机器学习 transformer 架构来检测和分割 axial interventional ultrasound 图像序列中的 catheters。该网络架构 inspirited 了 Attention in Attention 机制、时间跟踪网络和引入了一个新的3D分割头，通过在时间方向上进行3D deconvolution。为了促进这些深度学习网络的训练，我们引入了一个新的数据生成管线，利用基于物理学习 catheter 插入 simulations 和一个 convolutional ray-casting ultrasound simulator 生成 synthetic ultrasound 图像。该提案在 hold-out 验证集上验证了Robustness 于 ultrasound noise 和多个扫描角度。它还在基于 silicon-based 的 aorta 模型上测试， thereby demonstrating its potential for translation from sim-to-real。这个工作表示了更安全和高效的 endovascular surgery 使用 interventional ultrasound 的重要一步。

Unsupervised 3D Perception with 2D Vision-Language Distillation for Autonomous Driving

paper_url: http://arxiv.org/abs/2309.14491
repo_url: None
paper_authors: Mahyar Najibi, Jingwei Ji, Yin Zhou, Charles R. Qi, Xinchen Yan, Scott Ettinger, Dragomir Anguelov
for: 这篇论文是为了解决关键控制自动驾驶中关键的开放集类型的3D感知问题而写的。
methods: 该论文提出了一种多模态自动标注管道，可以在没有3D人类标签的情况下，使用点云序列中的运动迹象和公共可用的2D图像文本对，识别和跟踪所有交通参与者。
results: 对于 Waymo 开放数据集的实验，该方法与先前的研究相比，在各种无监督3D感知任务上表现出了显著的优异。

Abstract
Closed-set 3D perception models trained on only a pre-defined set of object categories can be inadequate for safety critical applications such as autonomous driving where new object types can be encountered after deployment. In this paper, we present a multi-modal auto labeling pipeline capable of generating amodal 3D bounding boxes and tracklets for training models on open-set categories without 3D human labels. Our pipeline exploits motion cues inherent in point cloud sequences in combination with the freely available 2D image-text pairs to identify and track all traffic participants. Compared to the recent studies in this domain, which can only provide class-agnostic auto labels limited to moving objects, our method can handle both static and moving objects in the unsupervised manner and is able to output open-vocabulary semantic labels thanks to the proposed vision-language knowledge distillation. Experiments on the Waymo Open Dataset show that our approach outperforms the prior work by significant margins on various unsupervised 3D perception tasks.

摘要
闭sets 3D 识别模型只训练在预定的对象类型上可能不够用于安全关键应用程序，如自动驾驶，因为在部署后可能会遇到新的对象类型。在这篇论文中，我们提出了一个多Modal auto Labeling 管道，可以在无人标注的情况下为训练模型提供开放集成类别的培训数据。我们的管道利用点云序列中的运动特征，并与可以得到的免费的2D图像文本对照来识别和跟踪所有交通参与者。与当前领域的研究相比，我们的方法可以不需要人工标注，并且可以自动为移动和静止对象分配开放 vocabulary 语义标签。我们的方法在 Waymo 开放数据集上进行了实验，并与之前的工作相比，在各种无监督3D识别任务上表现出了显著的优异。

Gastro-Intestinal Tract Segmentation Using an Explainable 3D Unet

paper_url: http://arxiv.org/abs/2309.14474
repo_url: None
paper_authors: Kai Li, Jonathan Chan
for: 这篇论文旨在探讨对胃肠癌使用放射治疗时，辐射科医师的角色是如何实现高剂量辐射的同时避免胃肠的影响。
methods: 这篇论文提出了一个基于深度学习（DL）的治疗管线，并将可读性学习（XAI） integrate into the pipeline 以提高模型的透明度和可信度。
results: 这篇论文获得了一个可靠且高精度的辐射治疗管线，可以帮助辐射科医师更快速地处理病人。

Abstract
In treating gastrointestinal cancer using radiotherapy, the role of the radiation oncologist is to administer high doses of radiation, through x-ray beams, toward the tumor while avoiding the stomach and intestines. With the advent of precise radiation treatment technology such as the MR-Linac, oncologists can visualize the daily positions of the tumors and intestines, which may vary day to day. Before delivering radiation, radio oncologists must manually outline the position of the gastrointestinal organs in order to determine position and direction of the x-ray beam. This is a time consuming and labor intensive process that may substantially prolong a patient's treatment. A deep learning (DL) method can automate and expedite the process. However, many deep neural networks approaches currently in use are black-boxes which lack interpretability which render them untrustworthy and impractical in a healthcare setting. To address this, an emergent field of AI known as Explainable AI (XAI) may be incorporated to improve the transparency and viability of a model. This paper proposes a deep learning pipeline that incorporates XAI to address the challenges of organ segmentation.

摘要
在治疗肝肠癌用电疗时，辐射生物学家的角色是通过X射线束射高剂量辐射于肿瘤，同时避免肠和肠肝。随着精细辐射治疗技术的发展，如MR-Linac，生物学家可以每天Visualize肿瘤和肠肝的位置，这些位置可能每天不同。在发射辐射之前， radio生物学家必须手动标识肠肝的位置，以确定辐射的方向和强度。这是一项时间consuming和劳动密集的过程，可能会导致患者的治疗持续时间增加。在这种情况下，一种深度学习（DL）方法可以自动和加速这个过程。然而，许多深度神经网络方法现在在使用的是黑obox，缺乏可读性，这使得它们在医疗设置中不可靠和不实用。为了解决这个问题，一个emergent的人工智能领域，称为可解释AI（XAI），可以被包含到深度学习管道中，以提高模型的透明度和实用性。本文提出了一个深度学习管道，其中包含XAI，以解决肠肿瘤分割的挑战。

FARSEC: A Reproducible Framework for Automatic Real-Time Vehicle Speed Estimation Using Traffic Cameras

paper_url: http://arxiv.org/abs/2309.14468
repo_url: https://github.com/porscheofficial/speed-estimation-traffic-monitoring
paper_authors: Lucas Liebe, Franz Sauerwald, Sylwester Sawicki, Matthias Schneider, Leo Schuhmann, Tolga Buz, Paul Boes, Ahmad Ahmadov, Gerard de Melo
for: 该研究旨在提供一种自动实时计算车辆速度的框架，以提高交通监测和管理的精度和效果。
methods: 该模型使用了新的技术来预测深度地图，从而估算道路段长度，并可以自动处理实际情况如摄像头运动和不同视频流输入。
results: 与三种已知模型进行比较后，该模型在实际的CCTV视频上达到了竞争性的结果，同时具有更好的可重复性和可更新性。

Abstract
Estimating the speed of vehicles using traffic cameras is a crucial task for traffic surveillance and management, enabling more optimal traffic flow, improved road safety, and lower environmental impact. Transportation-dependent systems, such as for navigation and logistics, have great potential to benefit from reliable speed estimation. While there is prior research in this area reporting competitive accuracy levels, their solutions lack reproducibility and robustness across different datasets. To address this, we provide a novel framework for automatic real-time vehicle speed calculation, which copes with more diverse data from publicly available traffic cameras to achieve greater robustness. Our model employs novel techniques to estimate the length of road segments via depth map prediction. Additionally, our framework is capable of handling realistic conditions such as camera movements and different video stream inputs automatically. We compare our model to three well-known models in the field using their benchmark datasets. While our model does not set a new state of the art regarding prediction performance, the results are competitive on realistic CCTV videos. At the same time, our end-to-end pipeline offers more consistent results, an easier implementation, and better compatibility. Its modular structure facilitates reproducibility and future improvements.

摘要
Our model employs novel techniques to estimate the length of road segments via depth map prediction. Additionally, our framework can automatically handle realistic conditions such as camera movements and different video stream inputs. We compare our model with three well-known models in the field using their benchmark datasets. While our model does not set a new state of the art regarding prediction performance, the results are competitive on realistic CCTV videos. Our end-to-end pipeline offers more consistent results, easier implementation, and better compatibility. Its modular structure facilitates reproducibility and future improvements.

Chop & Learn: Recognizing and Generating Object-State Compositions

paper_url: http://arxiv.org/abs/2309.14339
repo_url: None
paper_authors: Nirat Saini, Hanyu Wang, Archana Swaminathan, Vinoj Jayasundara, Bo He, Kamal Gupta, Abhinav Shrivastava
for: 这篇论文主要研究了不同风格下对物体进行割辑和对象状态的变化。
methods: 该论文提出了一个新的benchmark集合Chop & Learn，用于学习不同风格下的割辑和多视点下的对象状态。同时，它还提出了一个新任务：compositional image generation，可以将学习的割辑风格转移到不同的对象上，生成新的对象状态图像。
results: 该论文使用视频进行compositional action recognition，并证明了这些数据的多种应用。项目官网：https://chopnlearn.github.io。

Abstract
Recognizing and generating object-state compositions has been a challenging task, especially when generalizing to unseen compositions. In this paper, we study the task of cutting objects in different styles and the resulting object state changes. We propose a new benchmark suite Chop & Learn, to accommodate the needs of learning objects and different cut styles using multiple viewpoints. We also propose a new task of Compositional Image Generation, which can transfer learned cut styles to different objects, by generating novel object-state images. Moreover, we also use the videos for Compositional Action Recognition, and show valuable uses of this dataset for multiple video tasks. Project website: https://chopnlearn.github.io.

摘要
Recognizing and generating object-state compositions has been a challenging task, especially when generalizing to unseen compositions. In this paper, we study the task of cutting objects in different styles and the resulting object state changes. We propose a new benchmark suite Chop & Learn, to accommodate the needs of learning objects and different cut styles using multiple viewpoints. We also propose a new task of Compositional Image Generation, which can transfer learned cut styles to different objects, by generating novel object-state images. Moreover, we also use the videos for Compositional Action Recognition, and show valuable uses of this dataset for multiple video tasks. Project website: https://chopnlearn.github.io.Translation:recognizing和生成对象状态组合是一个挑战性任务，特别是对于未看过的组合。在这篇论文中，我们研究对象在不同风格下被剪辑的任务，以及它们所导致的对象状态变化。我们提出了一个新的benchmark集Chop & Learn，以便学习对象和不同剪辑风格的多视点学习。我们还提出了一个新的任务：compositional Image Generation，可以将学习的剪辑风格应用到不同的对象上，通过生成新的对象状态图像。此外，我们还使用视频进行compositional Action Recognition，并显示了这个数据集的多种视频任务的用途。项目网站：https://chopnlearn.github.io。

3D Indoor Instance Segmentation in an Open-World

paper_url: http://arxiv.org/abs/2309.14338
repo_url: https://github.com/aminebdj/3d-owis
paper_authors: Mohamed El Amine Boudjoghra, Salwa K. Al Khatib, Jean Lahoud, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Khan
for: 3D indoor instance segmentation in an open-world setting, where the model can distinguish known classes and identify unknown objects
methods: use an auto-labeling scheme to produce pseudo-labels during training, and adjust the unknown class probability based on objectness score distribution
results: promising open-world 3D instance segmentation performance with carefully curated open-world splits

Abstract
Existing 3D instance segmentation methods typically assume that all semantic classes to be segmented would be available during training and only seen categories are segmented at inference. We argue that such a closed-world assumption is restrictive and explore for the first time 3D indoor instance segmentation in an open-world setting, where the model is allowed to distinguish a set of known classes as well as identify an unknown object as unknown and then later incrementally learning the semantic category of the unknown when the corresponding category labels are available. To this end, we introduce an open-world 3D indoor instance segmentation method, where an auto-labeling scheme is employed to produce pseudo-labels during training and induce separation to separate known and unknown category labels. We further improve the pseudo-labels quality at inference by adjusting the unknown class probability based on the objectness score distribution. We also introduce carefully curated open-world splits leveraging realistic scenarios based on inherent object distribution, region-based indoor scene exploration and randomness aspect of open-world classes. Extensive experiments reveal the efficacy of the proposed contributions leading to promising open-world 3D instance segmentation performance.

摘要
现有的3D实例分割方法通常假设所有需要分割的semantic类都会在训练时 disponible，只有seen类会在推理时分割。我们认为这种closed-world假设是限制性的，我们开发了第一个在开放世界设定下进行3Dindoor实例分割的方法，其中模型允许分辨知道的类别以及未知对象的类别，并在可以获得对应类别标签时逐渐学习未知类别的semanticcategory。为此，我们提出了一种开放世界3Dindoor实例分割方法，其中使用自动标签机制生成pseudo-标签 durante entrenamiento，并在推理时调整未知类别概率根据对象性分布。此外，我们还提出了仔细制定的开放世界分割，利用实际场景、indoorScene区域探索和Randomaspect of open-world类来生成准确的pseudo-标签。广泛的实验表明我们的提案具有优秀的开放世界3D实例分割性能。

Noise-in, Bias-out: Balanced and Real-time MoCap Solving

paper_url: http://arxiv.org/abs/2309.14330
repo_url: None
paper_authors: Georgios Albanis, Nikolaos Zioulis, Spyridon Thermos, Anargyros Chatzitofis, Kostas Kolomvatsos
for: 这篇论文旨在提高现场摄像头采集系统的准确性和可靠性，使用机器学习算法来解决 marker 估计中的噪声和不规则性。
methods: 该论文使用机器学习技术，包括表示学习和不平衡回归，来解决 marker 估计中的问题。它还利用了 marker-less MoCap 技术来获取数据。
results: 该论文的实验结果表明，使用该方法可以在实时 MoCap 中提高 marker 估计的准确性和稳定性，并在具有极端和特殊 pose 的情况下表现出优异性。项目页面：https://moverseai.github.io/noise-tail

Abstract
Real-time optical Motion Capture (MoCap) systems have not benefited from the advances in modern data-driven modeling. In this work we apply machine learning to solve noisy unstructured marker estimates in real-time and deliver robust marker-based MoCap even when using sparse affordable sensors. To achieve this we focus on a number of challenges related to model training, namely the sourcing of training data and their long-tailed distribution. Leveraging representation learning we design a technique for imbalanced regression that requires no additional data or labels and improves the performance of our model in rare and challenging poses. By relying on a unified representation, we show that training such a model is not bound to high-end MoCap training data acquisition, and exploit the advances in marker-less MoCap to acquire the necessary data. Finally, we take a step towards richer and affordable MoCap by adapting a body model-based inverse kinematics solution to account for measurement and inference uncertainty, further improving performance and robustness. Project page: https://moverseai.github.io/noise-tail

摘要
现实时光学动作捕捉（MoCap）系统没有受到现代数据驱动模型的改进。在这项工作中，我们通过机器学习解决实时噪声不结构 marker 估计中的噪声问题，并提供了可靠的 marker-based MoCap，即使使用便宜的感知器。为达到这一目标，我们关注了许多相关的挑战，包括训练数据的获取和其长尾分布。通过表示学习，我们设计了一种无需额外数据或标签的偏好回归技术，以提高我们模型在罕见和具有挑战性的姿势中的性能。由于我们的模型不依赖高级 MoCap 训练数据获取，我们可以利用 marker-less MoCap 技术获取必要的数据。最后，我们通过对体部模型基于 inverse kinematics 解决方案进行修改，以考虑测量和推理不确定性，进一步提高性能和可靠性。项目页面：https://moverseai.github.io/noise-tail

DeepMesh: Mesh-based Cardiac Motion Tracking using Deep Learning

paper_url: http://arxiv.org/abs/2309.14306
repo_url: None
paper_authors: Qingjie Meng, Wenjia Bai, Declan P O’Regan, and Daniel Rueckert
for: 这个论文是用于评估冠动脉疾病和诊断冠动脉疾病的评估工具。
methods: 这个论文使用的方法是基于深度学习的DeepMesh模型，该模型可以从冠动脉CMR图像中提取冠动脉的3D运动信息。
results: 实验结果表明，DeepMesh方法可以高效地和量化地评估冠动脉左心室的3D运动信息，并且比其他图像基于和网格基于的冠动脉运动跟踪方法更高效。

Abstract
3D motion estimation from cine cardiac magnetic resonance (CMR) images is important for the assessment of cardiac function and the diagnosis of cardiovascular diseases. Current state-of-the art methods focus on estimating dense pixel-/voxel-wise motion fields in image space, which ignores the fact that motion estimation is only relevant and useful within the anatomical objects of interest, e.g., the heart. In this work, we model the heart as a 3D mesh consisting of epi- and endocardial surfaces. We propose a novel learning framework, DeepMesh, which propagates a template heart mesh to a subject space and estimates the 3D motion of the heart mesh from CMR images for individual subjects. In DeepMesh, the heart mesh of the end-diastolic frame of an individual subject is first reconstructed from the template mesh. Mesh-based 3D motion fields with respect to the end-diastolic frame are then estimated from 2D short- and long-axis CMR images. By developing a differentiable mesh-to-image rasterizer, DeepMesh is able to leverage 2D shape information from multiple anatomical views for 3D mesh reconstruction and mesh motion estimation. The proposed method estimates vertex-wise displacement and thus maintains vertex correspondences between time frames, which is important for the quantitative assessment of cardiac function across different subjects and populations. We evaluate DeepMesh on CMR images acquired from the UK Biobank. We focus on 3D motion estimation of the left ventricle in this work. Experimental results show that the proposed method quantitatively and qualitatively outperforms other image-based and mesh-based cardiac motion tracking methods.

摘要
3D动态计算从cinéCardiac Magnetic Resonance（CMR）图像是评估心脏功能和诊断循环疾病的重要方法。当前状态艺术方法都是在图像空间进行密集像素/体积化动态场的估计，忽略了动态场只在心脏的解剖对象上是有用的事实。在这种工作中，我们模型了心脏为3D网格，包括血管和内血管表面。我们提出了一种新的学习框架，深度网格（DeepMesh），它将投影模板心脏网格到个体空间，并估计从CMR图像中心脏的3D动态。在DeepMesh中，个体心脏的结构图像的结束 диасто利Frame中的心脏网格被首先从模板网格中重construct。然后，从2D短轴和长轴CMR图像中获取心脏网格的3D动态场，并通过开发可导的网格到图像照片的映射器，以便利用多视图解剖信息来进行3D网格重建和动态场估计。提出的方法可以计算 vertex-wise 偏移量，并维护 vertex 之间的匹配关系，这是评估不同个体和人口中心脏功能的量化评估的关键。我们在UK Biobank中获取的CMR图像进行了实验，我们专注于左心室的3D动态跟踪。实验结果表明，提出的方法在图像基于和网格基于的心脏动态跟踪方法中量化和质量上有显著优势。

Regress Before Construct: Regress Autoencoder for Point Cloud Self-supervised Learning

paper_url: http://arxiv.org/abs/2310.03670
repo_url: https://github.com/liuyyy111/point-rae
paper_authors: Yang Liu, Chen Chen, Can Wang, Xulin King, Mengyuan Liu
for: 这篇论文的目的是提出一种新的自适应神经网络模型，即Point Regress AutoEncoder（Point-RAE），用于无监督学习三维点云数据。
methods: 该模型使用了一个mask regressor来预测masked patch representation，并使用了一个alignment constraint来确保预测的masked patch representation与实际的masked patch表示相一致。
results: 该模型在多个下游任务中表现出色，包括ScanObjectNN和ModelNet40等。Specifically, our pre-trained models achieve a high accuracy of 90.28% on the ScanObjectNN hardest split and 94.1% accuracy on ModelNet40, surpassing all the other self-supervised learning methods.

Abstract
Masked Autoencoders (MAE) have demonstrated promising performance in self-supervised learning for both 2D and 3D computer vision. Nevertheless, existing MAE-based methods still have certain drawbacks. Firstly, the functional decoupling between the encoder and decoder is incomplete, which limits the encoder's representation learning ability. Secondly, downstream tasks solely utilize the encoder, failing to fully leverage the knowledge acquired through the encoder-decoder architecture in the pre-text task. In this paper, we propose Point Regress AutoEncoder (Point-RAE), a new scheme for regressive autoencoders for point cloud self-supervised learning. The proposed method decouples functions between the decoder and the encoder by introducing a mask regressor, which predicts the masked patch representation from the visible patch representation encoded by the encoder and the decoder reconstructs the target from the predicted masked patch representation. By doing so, we minimize the impact of decoder updates on the representation space of the encoder. Moreover, we introduce an alignment constraint to ensure that the representations for masked patches, predicted from the encoded representations of visible patches, are aligned with the masked patch presentations computed from the encoder. To make full use of the knowledge learned in the pre-training stage, we design a new finetune mode for the proposed Point-RAE. Extensive experiments demonstrate that our approach is efficient during pre-training and generalizes well on various downstream tasks. Specifically, our pre-trained models achieve a high accuracy of \textbf{90.28\%} on the ScanObjectNN hardest split and \textbf{94.1\%} accuracy on ModelNet40, surpassing all the other self-supervised learning methods. Our code and pretrained model are public available at: \url{https://github.com/liuyyy111/Point-RAE}.

摘要
masked autoencoders (MAE) 在自助学习中表现出色，特别是在2D和3D计算机视觉领域。然而，现有的MAE基本方法仍有一些缺点。首先，Encoder和Decoder之间的函数分离不够完善，这限制了Encoder的表征学习能力。其次，下游任务只使用Encoder，而不全面利用通过Encoder-Decoder架构在 предtext任务中获得的知识。在这篇论文中，我们提出了Point Regress AutoEncoder（Point-RAE），一种新的抽象方法 для点云自助学习。我们在Point-RAE中引入了一个mask推 regression器，该推 regression器预测从可见patch表示中编码的masked patch表示，而Decoder则使用这些预测的masked patch表示重建目标。通过这种方式，我们减少了Encoder的表征空间中Decoder的影响。此外，我们引入了一个alignment constraint，确保encoded表示中的masked patch表示与Encoder计算的masked patch表示相align。为了充分利用在预训练阶段学习的知识，我们设计了一种新的finetune模式 дляPoint-RAE。我们的实验表明，我们的方法是在预训练阶段高效，并且在多个下游任务上具有良好的泛化性。具体来说，我们的预训练模型在ScanObjectNN最难的分区上达到了90.28%的高精度，并在ModelNet40上达到了94.1%的精度，超过了所有其他自助学习方法。我们的代码和预训练模型可以在以下链接中下载：\url{https://github.com/liuyyy111/Point-RAE}.

Dataset Diffusion: Diffusion-based Synthetic Dataset Generation for Pixel-Level Semantic Segmentation

paper_url: http://arxiv.org/abs/2309.14303
repo_url: https://github.com/vinairesearch/dataset-diffusion
paper_authors: Quang Nguyen, Truong Vu, Anh Tran, Khoi Nguyen
for:This paper aims to address the labor-intensive task of preparing training data for deep vision models by proposing a novel method for generating pixel-level semantic segmentation labels using a text-to-image generative model.methods:The proposed method utilizes the text prompts, cross-attention, and self-attention of the Stable Diffusion (SD) model to generate segmentation maps corresponding to synthetic images. The method introduces three new techniques: class-prompt appending, class-prompt cross-attention, and self-attention exponentiation.results:The proposed approach significantly outperforms concurrent work on two datasets, PASCAL VOC and MSCOCO, and provides a reliable way to generate pixel-level semantic segmentation labels without the need for labor-intensive pixel-wise annotation.

Abstract
Preparing training data for deep vision models is a labor-intensive task. To address this, generative models have emerged as an effective solution for generating synthetic data. While current generative models produce image-level category labels, we propose a novel method for generating pixel-level semantic segmentation labels using the text-to-image generative model Stable Diffusion (SD). By utilizing the text prompts, cross-attention, and self-attention of SD, we introduce three new techniques: class-prompt appending, class-prompt cross-attention, and self-attention exponentiation. These techniques enable us to generate segmentation maps corresponding to synthetic images. These maps serve as pseudo-labels for training semantic segmenters, eliminating the need for labor-intensive pixel-wise annotation. To account for the imperfections in our pseudo-labels, we incorporate uncertainty regions into the segmentation, allowing us to disregard loss from those regions. We conduct evaluations on two datasets, PASCAL VOC and MSCOCO, and our approach significantly outperforms concurrent work. Our benchmarks and code will be released at https://github.com/VinAIResearch/Dataset-Diffusion

摘要
准备深度视觉模型的训练数据是一项劳动密集的任务。为了解决这个问题，生成模型在深度学习领域得到了广泛的应用。现有的生成模型可以生成图像级别的类别标签，但我们提出了一种新的方法，即使用文本生成器Stable Diffusion（SD）来生成像素级别的semantic segmentation标签。我们利用文本提示、跨处理和自处理的SD特性，提出了三种新技术：类提示附加、类提示跨处理和自处理指数。这些技术使得我们可以生成对应于合成图像的segmentation图。这些图像serve为训练semantic segmenter的pseudo标签，从而消除了对每个像素的手动标注的劳动密集任务。为了考虑我们的 pseudo标签中的不准确部分，我们将uncertainty区域纳入segmentation中，因此可以忽略这些区域中的损失。我们在PASCAL VOC和MSCOCO两个 dataset上进行了评估，并得到了与当前同类工作的显著超越。我们的标准 benchmarks和代码将在https://github.com/VinAIResearch/Dataset-Diffusion中发布。

Tiled Multiplane Images for Practical 3D Photography

paper_url: http://arxiv.org/abs/2309.14291
repo_url: None
paper_authors: Numair Khan, Douglas Lanman, Lei Xiao
for: 该研究旨在解决单视图图像中的三维摄影问题，有用的应用包括虚拟现实和移动计算。
methods: 该研究使用了多平面图像（MPI）来Estimate scene，可以模型复杂的外观效果、抗抖阈深度错误和软边缘Synthesize better than使用 текстури化的雷达或层次深度图像。
results: 该研究提出了一种使用分割多平面图像（TMPI）来生成单视图三维图像的方法，其中每个小区域只有几个深度层，可以提高计算效率。与state-of-the-art单视图MPI方法相比，该方法的生成结果相似，计算 overhead 比较低。

Abstract
The task of synthesizing novel views from a single image has useful applications in virtual reality and mobile computing, and a number of approaches to the problem have been proposed in recent years. A Multiplane Image (MPI) estimates the scene as a stack of RGBA layers, and can model complex appearance effects, anti-alias depth errors and synthesize soft edges better than methods that use textured meshes or layered depth images. And unlike neural radiance fields, an MPI can be efficiently rendered on graphics hardware. However, MPIs are highly redundant and require a large number of depth layers to achieve plausible results. Based on the observation that the depth complexity in local image regions is lower than that over the entire image, we split an MPI into many small, tiled regions, each with only a few depth planes. We call this representation a Tiled Multiplane Image (TMPI). We propose a method for generating a TMPI with adaptive depth planes for single-view 3D photography in the wild. Our synthesized results are comparable to state-of-the-art single-view MPI methods while having lower computational overhead.

摘要
“ synthesizing novel views from a single image ”有很多实际应用，如虚拟现实和移动设备等，Recent years 有很多解决方案提出来。 Multiplane Image (MPI) 估算场景为堆叠的 RGBA 层，可以更好地模拟复杂的外观效果、抑制遮蔽depth 误差和软边缘。不同于 neural radiance fields， MPI 可以高效地在图形硬件上运算。然而， MPI 很受重复性的限制，需要许多深度层以 дости得可靠的结果。根据本地图像区域的深度复杂性观察，我们将 MPI 拆分为多个小、瓷砾的区域，每个区域只有几个深度平面。我们称这个表示法为 Tiled Multiplane Image (TMPI)。我们提出一种方法，用于从单一影像中生成 TMPI WITH adaptive depth planes 的单眼 3D 摄影。我们的合成结果与现有的单眼 MPI 方法相似，但计算负载较低。

CLIP-DIY: CLIP Dense Inference Yields Open-Vocabulary Semantic Segmentation For-Free

paper_url: http://arxiv.org/abs/2309.14289
repo_url: https://github.com/wysoczanska/clip-diy
paper_authors: Monika Wysoczańska, Michaël Ramamonjisoa, Tomasz Trzciński, Oriane Siméoni
for: 这个论文旨在开发一种基于 CLIP 的开放世界图像识别方法，以达到零shot semantic segmentation 的目标。
methods: 该方法不需要任何额外训练或标注，而是基于现有的无监督物体定位方法，Directly 利用 CLIP 的分类能力进行多 scales 的 patch 处理，并将决策综合到一个地图中。
results: 在 PASCAL VOC 和 COCO 上，该方法可以达到零shot semantic segmentation 的State-of-the-art 结果，与最佳方法在 COCO 上表现相当。

Abstract
The emergence of CLIP has opened the way for open-world image perception. The zero-shot classification capabilities of the model are impressive but are harder to use for dense tasks such as image segmentation. Several methods have proposed different modifications and learning schemes to produce dense output. Instead, we propose in this work an open-vocabulary semantic segmentation method, dubbed CLIP-DIY, which does not require any additional training or annotations, but instead leverages existing unsupervised object localization approaches. In particular, CLIP-DIY is a multi-scale approach that directly exploits CLIP classification abilities on patches of different sizes and aggregates the decision in a single map. We further guide the segmentation using foreground/background scores obtained using unsupervised object localization methods. With our method, we obtain state-of-the-art zero-shot semantic segmentation results on PASCAL VOC and perform on par with the best methods on COCO.

摘要
CLIP的出现开启了开放世界图像识别的新时代。CLIP的零shot分类能力吸引了很多人，但是在密集任务如图像 segmentation 中更加困难使用。许多方法已经提出了不同的修改和学习方案来生成密集输出。而我们在这里提出了一种开放词汇 semantic segmentation 方法，称为 CLIP-DIY，不需要任何额外的训练或标注，而是利用现有的无监督物体定位方法来进行推导。具体来说，CLIP-DIY 是一种多尺度方法，直接利用 CLIP 分类器在不同大小的 patches 上进行分类，并将决定聚合到一个地图上。我们还使用无监督物体定位方法来引导分 segmentation。我们的方法可以在 PASCAL VOC 和 COCO 上达到领先的 zero-shot semantic segmentation 结果，并与最佳方法在 COCO 上表现相当。

Calibration-based Dual Prototypical Contrastive Learning Approach for Domain Generalization Semantic Segmentation

paper_url: http://arxiv.org/abs/2309.14282
repo_url: None
paper_authors: Muxin Liao, Shishun Tian, Yuhang Zhang, Guoguang Hua, Wenbin Zou, Xia Li
for: 这个研究是为了解决域缩推理中的域对预设问题，并提出了一个基于对照对比学习的方法来获得域标准的特征。methods: 这个方法使用了一个对照对比学习的核心思想，即使用不同域的中心价值作为域标准，并将这些中心价值与不同域的标准对照。results: 这个方法在域扩展Semantic segmentation任务中得到了superior的性能，并且可以对不同域的标准进行对照对比，以获得更好的域标准。

Abstract
Prototypical contrastive learning (PCL) has been widely used to learn class-wise domain-invariant features recently. These methods are based on the assumption that the prototypes, which are represented as the central value of the same class in a certain domain, are domain-invariant. Since the prototypes of different domains have discrepancies as well, the class-wise domain-invariant features learned from the source domain by PCL need to be aligned with the prototypes of other domains simultaneously. However, the prototypes of the same class in different domains may be different while the prototypes of different classes may be similar, which may affect the learning of class-wise domain-invariant features. Based on these observations, a calibration-based dual prototypical contrastive learning (CDPCL) approach is proposed to reduce the domain discrepancy between the learned class-wise features and the prototypes of different domains for domain generalization semantic segmentation. It contains an uncertainty-guided PCL (UPCL) and a hard-weighted PCL (HPCL). Since the domain discrepancies of the prototypes of different classes may be different, we propose an uncertainty probability matrix to represent the domain discrepancies of the prototypes of all the classes. The UPCL estimates the uncertainty probability matrix to calibrate the weights of the prototypes during the PCL. Moreover, considering that the prototypes of different classes may be similar in some circumstances, which means these prototypes are hard-aligned, the HPCL is proposed to generate a hard-weighted matrix to calibrate the weights of the hard-aligned prototypes during the PCL. Extensive experiments demonstrate that our approach achieves superior performance over current approaches on domain generalization semantic segmentation tasks.

摘要
它们（Prototypical contrastive learning，PCL）在最近得到了广泛的应用，用于学习域外兼顾的特征。这些方法基于假设，各个类域的中值（prototype）是域外兼顾的。然而，不同域的中值之间可能存在差异，因此通过PCL学习的域外兼顾特征需要同时与其他域的中值进行对alignment。然而，不同类域的中值可能存在差异，而不同类域的中值可能相似，这可能影响学习域外兼顾特征的过程。基于这些观察，我们提出了一种calibration-based dual prototypical contrastive learning（CDPCL）方法，用于降低不同域的中值与学习的域外兼顾特征之间的域不一致。CDPCL包括了一种uncertainty-guided PCL（UPCL）和一种hard-weighted PCL（HPCL）。由于不同类域的中值之间可能存在不同的差异，我们提出了一个不确定性概率矩阵，用于表示不同类域中值之间的差异。UPCL用于估算这个不确定性概率矩阵，以calibrate PCL中的权重。此外，considering that prototypes of different classes may be similar in some circumstances, which means these prototypes are hard-aligned, we propose a hard-weighted matrix to calibrate the weights of the hard-aligned prototypes during the PCL.经过广泛的实验，我们发现我们的方法在域通用 semantic segmentation 任务上达到了现有方法的最高性能。

SINCERE: Supervised Information Noise-Contrastive Estimation REvisited

paper_url: http://arxiv.org/abs/2309.14277
repo_url: https://github.com/tufts-ml/supcontrast
paper_authors: Patrick Feeney, Michael C. Hughes
for: 提供一种正确的扩展自动学习方法，使得自动学习模型可以从可用的类别标签中学习。
methods: 使用InfoNCE损失函数作为基础，并通过修改SupCon损失函数来提供一种正确的扩展方法。
results: 比较SINCERE和SupCon损失函数的学习轨迹和终端Linear分类器性能，发现SINCERE损失函数可以更好地分离不同类别的嵌入空间，并且与SupCon损失函数相比，SINCERE损失函数可以提供更高的终端分类器性能。

Abstract
The information noise-contrastive estimation (InfoNCE) loss function provides the basis of many self-supervised deep learning methods due to its strong empirical results and theoretic motivation. Previous work suggests a supervised contrastive (SupCon) loss to extend InfoNCE to learn from available class labels. This SupCon loss has been widely-used due to reports of good empirical performance. However, in this work we suggest that the specific SupCon loss formulated by prior work has questionable theoretic justification, because it can encourage images from the same class to repel one another in the learned embedding space. This problematic behavior gets worse as the number of inputs sharing one class label increases. We propose the Supervised InfoNCE REvisited (SINCERE) loss as a remedy. SINCERE is a theoretically justified solution for a supervised extension of InfoNCE that never causes images from the same class to repel one another. We further show that minimizing our new loss is equivalent to maximizing a bound on the KL divergence between class conditional embedding distributions. We compare SINCERE and SupCon losses in terms of learning trajectories during pretraining and in ultimate linear classifier performance after finetuning. Our proposed SINCERE loss better separates embeddings from different classes during pretraining while delivering competitive accuracy.

摘要
《信息干扰对照估计（InfoNCE）损失函数提供了许多自动学习深度学习方法的基础，因为它在实际上表现良好并具有理论基础。之前的工作提出了一种名为超级vised contrastive（SupCon）损失函数，用于从可用的类标签学习。这种SupCon损失函数广泛使用，但是我们认为其具体的形式不具有理论基础，因为它可能会使图像同一个类型的图像在学习的嵌入空间中抵抗对方。这种问题的严重程度随着输入图像同一个类型的数量增加。我们提议一种名为Supervised InfoNCE REvisited（SINCERE）损失函数，它是一种理论上正确的自upervised扩展，不会使图像同一个类型的图像在学习的嵌入空间中抵抗对方。我们还证明了将我们的新损失函数最小化等价于将类型 conditional嵌入分布的KL散度上升 bounds。我们比较了SINCERE和SupCon损失函数在预训练和精化后的线性分类器性能。我们的提议的SINCERE损失函数在预训练时更好地分离不同类型的嵌入，而且在精化后具有竞争力的准确率。》Note: The translation is in Simplified Chinese, which is the standard written form of Chinese used in mainland China.

Identity-preserving Editing of Multiple Facial Attributes by Learning Global Edit Directions and Local Adjustments

paper_url: http://arxiv.org/abs/2309.14267
repo_url: None
paper_authors: Najmeh Mohammadbagheri, Fardin Ayar, Ahmad Nickabadi, Reza Safabakhsh
for: 这个研究旨在解决人脸特征编辑中的identiy损失问题，提出了一个新的架构ID-Style，并在训练过程中使用两种损失函数来保持实例特征的独特性。
methods: 这个架构包括学习全球方向(LGD)和实例化对应强度预测器(IAIP)网络，LGD找到每个属性的共享和半独特的方向，IAIP网络在输入实例上调整全球方向。训练时使用两种损失函数，一种是来避免LGD找到过度独特的方向，另一种是来保持实例特征的独特性。
results: 试验结果显示，ID-Style比基于相似的state-of-the-art工作更好地保持实例特征，具体而言，与基于工作相比，ID-Style在人脸特征编辑中的Identity preserving metric(FRS)和均值修改率(mACC)分别提高了10%和7%。此外，ID-Style的网络结构比基于工作小了约95%，但是它仍然能够保持与基于工作相似的修改效果。

Abstract
Semantic facial attribute editing using pre-trained Generative Adversarial Networks (GANs) has attracted a great deal of attention and effort from researchers in recent years. Due to the high quality of face images generated by StyleGANs, much work has focused on the StyleGANs' latent space and the proposed methods for facial image editing. Although these methods have achieved satisfying results for manipulating user-intended attributes, they have not fulfilled the goal of preserving the identity, which is an important challenge. We present ID-Style, a new architecture capable of addressing the problem of identity loss during attribute manipulation. The key components of ID-Style include Learnable Global Direction (LGD), which finds a shared and semi-sparse direction for each attribute, and an Instance-Aware Intensity Predictor (IAIP) network, which finetunes the global direction according to the input instance. Furthermore, we introduce two losses during training to enforce the LGD to find semi-sparse semantic directions, which along with the IAIP, preserve the identity of the input instance. Despite reducing the size of the network by roughly 95% as compared to similar state-of-the-art works, it outperforms baselines by 10% and 7% in Identity preserving metric (FRS) and average accuracy of manipulation (mACC), respectively.

摘要
<>Translate the given text into Simplified Chinese.<>Recent years have seen a great deal of attention and effort from researchers in the field of semantic facial attribute editing using pre-trained Generative Adversarial Networks (GANs). This is due to the high quality of face images generated by StyleGANs, which has led to a focus on the latent space of StyleGANs and proposed methods for facial image editing. However, these methods have not been able to preserve the identity of the input instance, which is an important challenge. To address this challenge, we present ID-Style, a new architecture that includes Learnable Global Direction (LGD) and an Instance-Aware Intensity Predictor (IAIP) network. The LGD finds a shared and semi-sparse direction for each attribute, while the IAIP finetunes the global direction according to the input instance. Additionally, we introduce two losses during training to enforce the LGD to find semi-sparse semantic directions, which, along with the IAIP, preserve the identity of the input instance. Despite reducing the size of the network by roughly 95% compared to similar state-of-the-art works, ID-Style outperforms baselines by 10% and 7% in Identity Preserving Metric (FRS) and Average Accuracy of Manipulation (mACC), respectively.

Industrial Application of 6D Pose Estimation for Robotic Manipulation in Automotive Internal Logistics

paper_url: http://arxiv.org/abs/2309.14265
repo_url: None
paper_authors: Philipp Quentin, Dino Knoll, Daniel Goehring
for: This paper aims to evaluate the current status quo of 6D pose estimation in the context of automotive parts handling tasks, and to identify the challenges and limitations of existing approaches.
methods: The authors built a representative 6D pose estimation pipeline using state-of-the-art components, including data generation methods and pose estimators, and evaluated its performance on automotive parts.
results: The authors found that the performance of the trained 6D pose estimators was promising, but did not meet industry requirements. They also revealed that the main challenge was the inability of the estimators to provide reliable uncertainties for their poses, rather than the accuracy of the poses themselves. Additionally, the authors compared RGB- and RGB-D-based approaches and showed that they are differently vulnerable to the domain gap induced by synthetic data.

Abstract
Despite the advances in robotics a large proportion of the of parts handling tasks in the automotive industry's internal logistics are not automated but still performed by humans. A key component to competitively automate these processes is a 6D pose estimation that can handle a large number of different parts, is adaptable to new parts with little manual effort, and is sufficiently accurate and robust with respect to industry requirements. In this context, the question arises as to the current status quo with respect to these measures. To address this we built a representative 6D pose estimation pipeline with state-of-the-art components from economically scalable real to synthetic data generation to pose estimators and evaluated it on automotive parts with regards to a realistic sequencing process. We found that using the data generation approaches, the performance of the trained 6D pose estimators are promising, but do not meet industry requirements. We reveal that the reason for this is the inability of the estimators to provide reliable uncertainties for their poses, rather than the ability of to provide sufficiently accurate poses. In this context we further analyzed how RGB- and RGB-D-based approaches compare against this background and show that they are differently vulnerable to the domain gap induced by synthetic data.

摘要
Our results show that while the trained 6D pose estimators perform well, they do not meet industry requirements. We found that the reason for this is the inability of the estimators to provide reliable uncertainties for their poses, rather than the accuracy of the poses themselves. Additionally, we compared RGB- and RGB-D-based approaches and found that they are differently vulnerable to the domain gap induced by synthetic data.

Enhancing Healthcare with EOG: A Novel Approach to Sleep Stage Classification

paper_url: http://arxiv.org/abs/2310.03757
repo_url: https://github.com/suvadeepmaiti/EOG_Sleep_Stage_classification
paper_authors: Suvadeep Maiti, Shivam Kumar Sharma, Raju S. Bapi
for: automated sleep stage classification using EOG signals
methods: proposed SE-Resnet-Transformer model, 1D-GradCAM, t-SNE plots
results: accurate classification of five distinct sleep stages, noteworthy performance with macro-F1 scores of 74.72, 70.63, and 69.26, respectively, excelling in identifying REM sleep.

Abstract
We introduce an innovative approach to automated sleep stage classification using EOG signals, addressing the discomfort and impracticality associated with EEG data acquisition. In addition, it is important to note that this approach is untapped in the field, highlighting its potential for novel insights and contributions. Our proposed SE-Resnet-Transformer model provides an accurate classification of five distinct sleep stages from raw EOG signal. Extensive validation on publically available databases (SleepEDF-20, SleepEDF-78, and SHHS) reveals noteworthy performance, with macro-F1 scores of 74.72, 70.63, and 69.26, respectively. Our model excels in identifying REM sleep, a crucial aspect of sleep disorder investigations. We also provide insight into the internal mechanisms of our model using techniques such as 1D-GradCAM and t-SNE plots. Our method improves the accessibility of sleep stage classification while decreasing the need for EEG modalities. This development will have promising implications for healthcare and the incorporation of wearable technology into sleep studies, thereby advancing the field's potential for enhanced diagnostics and patient comfort.

摘要
我们介绍了一种创新的自动睡眠阶段分类方法使用 EOG 信号，解决了使用 EEG 数据采集所带来的不适和实用性问题。此外，这种方法在领域中尚未被探索，因此它的潜在性和贡献很大。我们的提议的 SE-Resnet-Transformer 模型可以准确地从原始 EOG 信号中分类五个不同的睡眠阶段。我们对公共可用的数据库（SleepEDF-20、SleepEDF-78 和 SHHS）进行了广泛的验证，并发现了关键的表现，其中 macro-F1 分数分别为 74.72、70.63 和 69.26。我们的模型在识别 REM 睡眠方面表现出色，这是许多睡眠疾病研究中的关键方面。我们还使用了一些技术，如 1D-GradCAM 和 t-SNE 图表，来探索我们的模型的内部机制。我们的方法可以提高睡眠阶段分类的可accessibility，同时减少 EEG 模态的需求。这种发展将对医疗保健和睡眠研究中的睡眠疾病诊断和患者舒适具有普ROMising的影响。

Informative Data Mining for One-Shot Cross-Domain Semantic Segmentation

paper_url: http://arxiv.org/abs/2309.14241
repo_url: None
paper_authors: Yuxi Wang, Jian Liang, Jun Xiao, Shuqi Mei, Yuran Yang, Zhaoxiang Zhang
for: 这个研究旨在提供一个有效的一击适应方法，实现从标注源数据到无标注目数据的语意分类转换。methods: 这个方法使用了一个新的选择项目，将标注源数据中的最有用样本选择出来，以便快速适应和减少过滤训练。然后，这些选择的样本会被用来更新模型，包括裁剪和原型基于的信息增强。results: 我们的方法在实验中表现出色，比较 existing 方法高效和精度。 Specifically, our approach achieves a new state-of-the-art one-shot performance of 56.7%/55.4% on the GTA5/SYNTHIA to Cityscapes adaptation tasks, respectively.

Abstract
Contemporary domain adaptation offers a practical solution for achieving cross-domain transfer of semantic segmentation between labeled source data and unlabeled target data. These solutions have gained significant popularity; however, they require the model to be retrained when the test environment changes. This can result in unbearable costs in certain applications due to the time-consuming training process and concerns regarding data privacy. One-shot domain adaptation methods attempt to overcome these challenges by transferring the pre-trained source model to the target domain using only one target data. Despite this, the referring style transfer module still faces issues with computation cost and over-fitting problems. To address this problem, we propose a novel framework called Informative Data Mining (IDM) that enables efficient one-shot domain adaptation for semantic segmentation. Specifically, IDM provides an uncertainty-based selection criterion to identify the most informative samples, which facilitates quick adaptation and reduces redundant training. We then perform a model adaptation method using these selected samples, which includes patch-wise mixing and prototype-based information maximization to update the model. This approach effectively enhances adaptation and mitigates the overfitting problem. In general, we provide empirical evidence of the effectiveness and efficiency of IDM. Our approach outperforms existing methods and achieves a new state-of-the-art one-shot performance of 56.7\%/55.4\% on the GTA5/SYNTHIA to Cityscapes adaptation tasks, respectively. The code will be released at \url{https://github.com/yxiwang/IDM}.

摘要
当前领域的适应采用实现了semantic segmentation中的交叉频道传输，即使在不同的测试环境下，可以将源数据标注得到目标数据上的semantic segmentation。这些解决方案在应用中得到了广泛的应用，但是它们需要模型在测试环境发生变化时重新训练，这可能会导致不可持额外高的成本，特别是在时间consuming的训练过程和数据隐私问题上。one-shot适应方法试图解决这些挑战，通过将源模型转移到目标频道上，只需要一个目标数据。然而，这些方法仍然面临计算成本和过拟合问题。为了解决这个问题，我们提出了一种新的框架，即信息挖掘（Informative Data Mining，IDM）。IDM提供了一种不确定性基于的选择 criterion，可以帮助快速适应和减少重复训练。然后，我们使用这些选择的样本进行模型适应方法，包括patch-wise混合和prototype-based信息最大化，以更新模型。这种方法有效地提高适应和减少过拟合问题。总之，我们提供了empirical evidence表明IDM的效果和效率。我们的方法比现有方法更高效，在GTA5/SYNTHIA到Cityscapes适应任务上达到了56.7%/55.4%的一射性能。我们的代码将在\url{https://github.com/yxiwang/IDM}上发布。

QuadricsNet: Learning Concise Representation for Geometric Primitives in Point Clouds

paper_url: http://arxiv.org/abs/2309.14211
repo_url: https://github.com/michaelwu99-lab/quadricsnet
paper_authors: Ji Wu, Huai Yu, Wen Yang, Gui-Song Xia
for: 本研究目的是提出一种 novel 的抽象框架，用于学习 3D 点云中的减少精简 geometric primitive representation。
methods: 我们使用 quadrics 来表示多种 primitives，并提出了首个 end-to-end 学习基于 quadrics 的框架，即 QuadricsNet，用于解析 quadrics 在点云中。我们还提出了一种新的 pattern-comprehensive 数据集，用于训练和评估。
results: 我们的研究表明，我们的精简表示方法和 QuadricsNet 框架具有高效性和稳定性。我们的代码可以在 \url{https://github.com/MichaelWu99-lab/QuadricsNet} 上获取。

Abstract
This paper presents a novel framework to learn a concise geometric primitive representation for 3D point clouds. Different from representing each type of primitive individually, we focus on the challenging problem of how to achieve a concise and uniform representation robustly. We employ quadrics to represent diverse primitives with only 10 parameters and propose the first end-to-end learning-based framework, namely QuadricsNet, to parse quadrics in point clouds. The relationships between quadrics mathematical formulation and geometric attributes, including the type, scale and pose, are insightfully integrated for effective supervision of QuaidricsNet. Besides, a novel pattern-comprehensive dataset with quadrics segments and objects is collected for training and evaluation. Experiments demonstrate the effectiveness of our concise representation and the robustness of QuadricsNet. Our code is available at \url{https://github.com/MichaelWu99-lab/QuadricsNet}

摘要
这篇论文提出了一种新的框架，用于学习3D点云中简洁的 геометри� primitives表示。与之前每种 primitives 都被 individually 表示不同，我们在挑战性的问题上关注了如何实现一种简洁而均衡的表示。我们使用 quadrics 来表示多样化的 primitives，只需要10个参数。我们提出了第一个终端学习基于架构，即 QuadricsNet，用于解析 quadrics 在点云中。我们妥善地 интеGRATE了 quadrics 的数学表述和 geometric attribute，包括类型、比例和 Orient，以便对 QuadricsNet 进行有效的监督。此外，我们还收集了包含 quadrics 段和物体的novel pattern-comprehensive dataset，用于训练和评估。实验表明我们的简洁表示和 QuadricsNet 的稳定性。我们的代码可以在 GitHub 上找到：https://github.com/MichaelWu99-lab/QuadricsNet。

Automatic Animation of Hair Blowing in Still Portrait Photos

paper_url: http://arxiv.org/abs/2309.14207
repo_url: None
paper_authors: Wenpeng Xiao, Wentao Liu, Yitong Wang, Bernard Ghanem, Bing Li
for: 这个论文是为了自动animate人类发样图片中的人发。
methods: 该论文使用了先进的实例分割网络，将发分解为多个实例，并提出了一种基于实例分割的发动模块，以生成自然和愉悦的发动效果。
results: 对比其他state-of-the-art方法，该论文的方法在量化评价中占优，并在质量测试中提供了最愉悦和最吸引人的视觉效果。

Abstract
We propose a novel approach to animate human hair in a still portrait photo. Existing work has largely studied the animation of fluid elements such as water and fire. However, hair animation for a real image remains underexplored, which is a challenging problem, due to the high complexity of hair structure and dynamics. Considering the complexity of hair structure, we innovatively treat hair wisp extraction as an instance segmentation problem, where a hair wisp is referred to as an instance. With advanced instance segmentation networks, our method extracts meaningful and natural hair wisps. Furthermore, we propose a wisp-aware animation module that animates hair wisps with pleasing motions without noticeable artifacts. The extensive experiments show the superiority of our method. Our method provides the most pleasing and compelling viewing experience in the qualitative experiments and outperforms state-of-the-art still-image animation methods by a large margin in the quantitative evaluation. Project url: \url{https://nevergiveu.github.io/AutomaticHairBlowing/}

摘要
我们提出了一种新的方法，用于在静止画像中动画人类发型。现有的工作主要研究了流体元素的动画，如水和火。然而，对真实图像中的发型动画还具有挑战性，这是因为发型结构的复杂性和动态性。为了解决这个问题，我们创新地将发型抽取视为实例分割问题，其中每个发型被称为一个实例。使用高级实例分割网络，我们的方法可以EXTRACT meaningful和自然的发型。此外，我们还提议了一种发型感知动画模块，可以使发型动画具有愉悦的运动而不会出现显著的瑕疵。我们的实验结果表明，我们的方法可以提供最有趣和最有吸引力的视觉体验，并在量化评价中大幅超越了现有的静止图像动画方法。项目链接：

paper_url: http://arxiv.org/abs/2309.14203
repo_url: https://github.com/rshaojimmy/multimodal-deepfake
paper_authors: Rui Shao, Tianxing Wu, Jianlong Wu, Liqiang Nie, Ziwei Liu
for: 本研究探讨了一新的多Modal媒体伪造问题，即Detecting and Grounding Multi-Modal Media Manipulation (DGM^4)，旨在不仅检测多Modal媒体的 authenticty，还是根据多Modal媒体的不同modalities进行深入的媒体伪造推理。
methods: 本研究提出了一种名为 HierArchical Multi-modal Manipulation rEasoning tRansformer (HAMMER) 的新方法，该方法包括两个主要部分：1) 多Modal媒体杂化学习和深度媒体推理两个部分，以及2) 多Modal媒体聚合器。
results: 实验结果表明，HAMMER和HAMMER++ 两种模型具有superiority，能够准确地检测和理解多Modal媒体中的伪造 traces。

Abstract
Misinformation has become a pressing issue. Fake media, in both visual and textual forms, is widespread on the web. While various deepfake detection and text fake news detection methods have been proposed, they are only designed for single-modality forgery based on binary classification, let alone analyzing and reasoning subtle forgery traces across different modalities. In this paper, we highlight a new research problem for multi-modal fake media, namely Detecting and Grounding Multi-Modal Media Manipulation (DGM^4). DGM^4 aims to not only detect the authenticity of multi-modal media, but also ground the manipulated content, which requires deeper reasoning of multi-modal media manipulation. To support a large-scale investigation, we construct the first DGM^4 dataset, where image-text pairs are manipulated by various approaches, with rich annotation of diverse manipulations. Moreover, we propose a novel HierArchical Multi-modal Manipulation rEasoning tRansformer (HAMMER) to fully capture the fine-grained interaction between different modalities. HAMMER performs 1) manipulation-aware contrastive learning between two uni-modal encoders as shallow manipulation reasoning, and 2) modality-aware cross-attention by multi-modal aggregator as deep manipulation reasoning. Dedicated manipulation detection and grounding heads are integrated from shallow to deep levels based on the interacted multi-modal information. To exploit more fine-grained contrastive learning for cross-modal semantic alignment, we further integrate Manipulation-Aware Contrastive Loss with Local View and construct a more advanced model HAMMER++. Finally, we build an extensive benchmark and set up rigorous evaluation metrics for this new research problem. Comprehensive experiments demonstrate the superiority of HAMMER and HAMMER++.

摘要
伪信息已成为一种紧迫的问题。在图像和文本之间的多modal forgery广泛存在于互联网上。虽然多种深度迷假检测和文本假新闻检测方法已经提出，但它们只是为单模态迷假进行二分类 binary classification，而不是检测和理解多modal media forgery的细节。在这篇论文中，我们提出了一个新的研究问题：多Modal Media Manipulation Detection and Grounding（DGM^4）。DGM^4的目标不仅是检测多modal media的authenticity，而且需要理解和检测受到 modificaiton的内容，这需要更深入的理解多modal media forgery。为了支持大规模的研究，我们构建了第一个DGM^4数据集，其中图像和文本组合被多种方法修改，并且有丰富的修改注释。此外，我们提出了一种 HierArchical Multi-modal Manipulation rEasoning tRansformer（HAMMER），它可以全面捕捉多modal media forgery的细节交互。HAMMER包括两种推理方法：1）在图像和文本之间进行修改意识的对比学习，以及2）在多modal信息之间进行模块相关的交互。通过将这两种推理方法集成到深度和浅度层次上，我们可以实现更细致的修改检测和修改理解。为了更好地利用跨modal semantic alignment的推理，我们还提出了Manipulation-Aware Contrastive Loss with Local View，并构建了更先进的模型HAMMER++。最后，我们建立了一个完整的 bencmarks，并设置了严格的评价指标。广泛的实验表明HAMMER和HAMMER++的优越性。

(Predictable) Performance Bias in Unsupervised Anomaly Detection

paper_url: http://arxiv.org/abs/2309.14198
repo_url: None
paper_authors: Felix Meissen, Svenja Breuer, Moritz Knolle, Alena Buyx, Ruth Müller, Georgios Kaissis, Benedikt Wiestler, Daniel Rückert
for: 本研究旨在探讨Unsupervised Anomaly Detection（UAD）模型在医疗领域中的公平性问题。
methods: 我们使用三个大规模公共available的胸部X射线图像 dataset进行了实验，并使用了两种 state-of-the-art UAD模型 для医疗图像。我们还引入了一个新的 subgroup-AUROC（sAUROC）指标，用于衡量机器学习的公平性。
results: 我们的实验发现了“公平法律”（类似于 Transformers 中的“扩大法律”），即训练集中 subgroup 的表达与该 subgroup 内 anomaly detection性能之间存在直线关系。我们的研究还发现了Balanced training data 仍然存在性能差距，并且这些差距可以叠加，使得具有多个不利影响组的主体性能更加低下。

Abstract
Background: With the ever-increasing amount of medical imaging data, the demand for algorithms to assist clinicians has amplified. Unsupervised anomaly detection (UAD) models promise to aid in the crucial first step of disease detection. While previous studies have thoroughly explored fairness in supervised models in healthcare, for UAD, this has so far been unexplored. Methods: In this study, we evaluated how dataset composition regarding subgroups manifests in disparate performance of UAD models along multiple protected variables on three large-scale publicly available chest X-ray datasets. Our experiments were validated using two state-of-the-art UAD models for medical images. Finally, we introduced a novel subgroup-AUROC (sAUROC) metric, which aids in quantifying fairness in machine learning. Findings: Our experiments revealed empirical "fairness laws" (similar to "scaling laws" for Transformers) for training-dataset composition: Linear relationships between anomaly detection performance within a subpopulation and its representation in the training data. Our study further revealed performance disparities, even in the case of balanced training data, and compound effects that exacerbate the drop in performance for subjects associated with multiple adversely affected groups. Interpretation: Our study quantified the disparate performance of UAD models against certain demographic subgroups. Importantly, we showed that this unfairness cannot be mitigated by balanced representation alone. Instead, the representation of some subgroups seems harder to learn by UAD models than that of others. The empirical fairness laws discovered in our study make disparate performance in UAD models easier to estimate and aid in determining the most desirable dataset composition.

摘要
Background: With the ever-increasing amount of medical imaging data, the demand for algorithms to assist clinicians has amplified. Unsupervised anomaly detection (UAD) models promise to aid in the crucial first step of disease detection. While previous studies have thoroughly explored fairness in supervised models in healthcare, for UAD, this has so far been unexplored.Methods: In this study, we evaluated how dataset composition regarding subgroups manifests in disparate performance of UAD models along multiple protected variables on three large-scale publicly available chest X-ray datasets. Our experiments were validated using two state-of-the-art UAD models for medical images. Finally, we introduced a novel subgroup-AUROC (sAUROC) metric, which aids in quantifying fairness in machine learning.Findings: Our experiments revealed empirical "fairness laws" (similar to "scaling laws" for Transformers) for training-dataset composition: Linear relationships between anomaly detection performance within a subpopulation and its representation in the training data. Our study further revealed performance disparities, even in the case of balanced training data, and compound effects that exacerbate the drop in performance for subjects associated with multiple adversely affected groups.Interpretation: Our study quantified the disparate performance of UAD models against certain demographic subgroups. Importantly, we showed that this unfairness cannot be mitigated by balanced representation alone. Instead, the representation of some subgroups seems harder to learn by UAD models than that of others. The empirical fairness laws discovered in our study make disparate performance in UAD models easier to estimate and aid in determining the most desirable dataset composition.Here's the translation in Traditional Chinese:背景：随着医疗影像数据的不断增加，诊断助手需求增加。不监督型异常检测（UAD）模型能帮助在疾病检测的首步中发掘疾病。过去的研究已经对医疗保健领域中监督模型的公平性进行了广泛的探讨，但是对UAD模型仍然是未知的。方法：这次研究中，我们评估了不同子群体的参数影响UAD模型在多个数据库中的表现差异。我们使用了三个公共可用的胸部X射影数据库，并验证了两个现有的UAD模型。最后，我们引入了一个新的子群体AUROC（sAUROC）指标，以便量测机器学习中的公平性。发现：我们的实验发现了关于训练集合的公平性法则（similar to Transformers的推广法则），这些法则表示在一个子population中的异常检测性能和训练集合中的表现之间的直线关系。我们的研究还发现了充分平衡的训练数据中的表现差异，以及两个或多个负面影响的子群体之间的互动效应。解释：我们的研究量化了UAD模型对特定民族子群体的不公平表现。我们发现了，不公平性不能通过充分平衡 alone 来减轻。相反，一些子群体的表现似乎更难由UAD模型学习，而其他子群体的表现则更容易学习。我们在这研究中发现的公平法则使得UAD模型的不公平表现更易估计，并帮助决定最佳的训练集合。

LAPP: Layer Adaptive Progressive Pruning for Compressing CNNs from Scratch

paper_url: http://arxiv.org/abs/2309.14157
repo_url: None
paper_authors: Pucheng Zhai, Kailing Guo, Fang Liu, Xiaofen Xing, Xiangmin Xu
for:这个论文的目的是提出一种名为层 adaptive progressive pruning（LAPP）的框架，用于快速适应性地减小 convolutional Neural Network（CNN）的计算量。methods:LAPP 框架使用了一种learnable threshold和 FLOPs 约束来控制减小率，并在训练过程中动态更新这些约束，以便适应不同层的重要性分数变化。此外，在减小每层之前，我们还引入了一种轻量级的跳过来提高减小后的表达能力。results:与先前的压缩方法相比，LAPP 框架在多个数据集和后ION 背景上达到了更高的性能提升。例如，在 CIFAR-10 上，我们可以压缩 ResNet-20 到 40.3% 而无需减少精度。 ImageNet 上，我们可以减少 ResNet-18 的 FLOPs 55.6%，同时保持顶部 1 准确率和顶部 5 准确率不变。

Abstract
Structured pruning is a commonly used convolutional neural network (CNN) compression approach. Pruning rate setting is a fundamental problem in structured pruning. Most existing works introduce too many additional learnable parameters to assign different pruning rates across different layers in CNN or cannot control the compression rate explicitly. Since too narrow network blocks information flow for training, automatic pruning rate setting cannot explore a high pruning rate for a specific layer. To overcome these limitations, we propose a novel framework named Layer Adaptive Progressive Pruning (LAPP), which gradually compresses the network during initial training of a few epochs from scratch. In particular, LAPP designs an effective and efficient pruning strategy that introduces a learnable threshold for each layer and FLOPs constraints for network. Guided by both task loss and FLOPs constraints, the learnable thresholds are dynamically and gradually updated to accommodate changes of importance scores during training. Therefore the pruning strategy can gradually prune the network and automatically determine the appropriate pruning rates for each layer. What's more, in order to maintain the expressive power of the pruned layer, before training starts, we introduce an additional lightweight bypass for each convolutional layer to be pruned, which only adds relatively few additional burdens. Our method demonstrates superior performance gains over previous compression methods on various datasets and backbone architectures. For example, on CIFAR-10, our method compresses ResNet-20 to 40.3% without accuracy drop. 55.6% of FLOPs of ResNet-18 are reduced with 0.21% top-1 accuracy increase and 0.40% top-5 accuracy increase on ImageNet.

摘要
“构成式剔除”是一种常用的卷积神经网络（CNN）压缩方法。剔除率设定是 convolutional neural network （CNN）压缩的基本问题。现有大多数的工作将额外的可学习参数引入到不同层的 CNN 中，以便对不同层设置不同的剔除率。另外，一些方法无法明确控制压缩率，或者对于特定层设置过高的剔除率。这些限制使得自动剔除率设定无法充分发挥其效果。为了解决这些问题，我们提出了一个名为层别进行式进行剔除（Layer Adaptive Progressive Pruning，LAPP）的新框架。LAPP 采用了一个可学习的阈值，以及 FLOPs 约束，以便在训练的初期几十轮中逐步压缩网络。具体来说，LAPP 设计了一个高效且可靠的剔除策略，通过在训练过程中逐步更新可学习的阈值，以便适应变化的重要性分数。因此，LAPP 可以逐步剔除网络，并自动决定每个层的适当剔除率。此外，为维护剔除后的表达能力，我们将每个剔除的卷积层加上一个轻量级的辅助路径，这仅增加了一些轻微的负担。我们的方法在多个数据集和背景 arquitectures 上示出了优秀的性能提升。例如，在 CIFAR-10 上，我们将 ResNet-20 压缩到 40.3% without accuracy drop。 ImageNet 上，我们将 ResNet-18 的 FLOPs 压缩到 55.6%，并且跟踪到 0.21% 顶部 1 accuracy increase 和 0.40% top-5 accuracy increase。

IEBins: Iterative Elastic Bins for Monocular Depth Estimation

paper_url: http://arxiv.org/abs/2309.14137
repo_url: https://github.com/shuweishao/iebins
paper_authors: Shuwei Shao, Zhongcai Pei, Xingming Wu, Zhong Liu, Weihai Chen, Zhengguo Li
for:* 本研究旨在提出一种基于分类回归的独眼深度估计方法（MDE），用于解决独眼深度估计中的问题。methods:* 提出了一种新的迭代弹性桶（IEBins）技术，用于在多个阶段中进行高质量深度搜索。* 利用了一种 нов的弹性目标桶技术，以适应不同的深度不确定性。results:* 对于KITTI、NYU-Depth-v2和SUN RGB-D数据集进行了广泛的实验，并证明了提出的方法可以超越先前的状态之 искусственный智能。* 源代码可以在https://github.com/ShuweiShao/IEBins上获取。

Abstract
Monocular depth estimation (MDE) is a fundamental topic of geometric computer vision and a core technique for many downstream applications. Recently, several methods reframe the MDE as a classification-regression problem where a linear combination of probabilistic distribution and bin centers is used to predict depth. In this paper, we propose a novel concept of iterative elastic bins (IEBins) for the classification-regression-based MDE. The proposed IEBins aims to search for high-quality depth by progressively optimizing the search range, which involves multiple stages and each stage performs a finer-grained depth search in the target bin on top of its previous stage. To alleviate the possible error accumulation during the iterative process, we utilize a novel elastic target bin to replace the original target bin, the width of which is adjusted elastically based on the depth uncertainty. Furthermore, we develop a dedicated framework composed of a feature extractor and an iterative optimizer that has powerful temporal context modeling capabilities benefiting from the GRU-based architecture. Extensive experiments on the KITTI, NYU-Depth-v2 and SUN RGB-D datasets demonstrate that the proposed method surpasses prior state-of-the-art competitors. The source code is publicly available at https://github.com/ShuweiShao/IEBins.

摘要
《单眼深度估计（MDE）是计算机视觉的基本领域和许多下渠应用的核心技术。近期，一些方法将MDE视为一个分类预测和回传问题，使用线性结合的概率分布和数组中心来预测深度。本文提出一个新的迭代弹性桶（IEBins）概念，用于分类预测和回传问题的MDE。提案的IEBins通过不断地优化搜寻范围，以进行多阶段的精确深度搜寻。为了避免可能的错误累累在迭代过程中，我们使用了一个新的弹性目标桶，其宽度根据深度不确定而调整。此外，我们开发了一个特别的架构，包括一个特征提取器和一个迭代优化器，具有强大的时间统计模型能力，从GRU架构中受益。广泛的实验表明，提案的方法在KITTI、NYU-Depth-v2和SUN RGB-D数据集上具有较高的性能，超过了先前的州际之径。原始代码可以在https://github.com/ShuweiShao/IEBins上取得。

Masked Image Residual Learning for Scaling Deeper Vision Transformers

paper_url: http://arxiv.org/abs/2309.14136
repo_url: https://github.com/russellllaputa/MIRL
paper_authors: Guoxi Huang, Hongtao Fu, Adrian G. Bors
for: 这篇研究旨在提高深度向 ViT 的训练过程，并解决深度层中的降低问题。
methods: authors propose a self-supervised learning framework called Masked Image Residual Learning (MIRL), which significantly alleviates the degradation problem, making it possible to scale ViT along depth for performance upgrade.
results: authors show that deeper ViTs can be effectively optimized using MIRL, and easily gain accuracy from increased depth. With the same level of computational complexity as ViT-Base and ViT-Large, they instantiate 4.5 times and 2 times deeper ViTs, dubbed ViT-S-54 and ViT-B-48. The deeper ViT-S-54 achieves performance on par with ViT-Large, while ViT-B-48 achieves 86.2% top-1 accuracy on ImageNet.

Abstract
Deeper Vision Transformers (ViTs) are more challenging to train. We expose a degradation problem in deeper layers of ViT when using masked image modeling (MIM) for pre-training. To ease the training of deeper ViTs, we introduce a self-supervised learning framework called Masked Image Residual Learning (MIRL), which significantly alleviates the degradation problem, making scaling ViT along depth a promising direction for performance upgrade. We reformulate the pre-training objective for deeper layers of ViT as learning to recover the residual of the masked image. We provide extensive empirical evidence showing that deeper ViTs can be effectively optimized using MIRL and easily gain accuracy from increased depth. With the same level of computational complexity as ViT-Base and ViT-Large, we instantiate 4.5$\times$ and 2$\times$ deeper ViTs, dubbed ViT-S-54 and ViT-B-48. The deeper ViT-S-54, costing 3$\times$ less than ViT-Large, achieves performance on par with ViT-Large. ViT-B-48 achieves 86.2% top-1 accuracy on ImageNet. On one hand, deeper ViTs pre-trained with MIRL exhibit excellent generalization capabilities on downstream tasks, such as object detection and semantic segmentation. On the other hand, MIRL demonstrates high pre-training efficiency. With less pre-training time, MIRL yields competitive performance compared to other approaches.

摘要
deeper 视图变换器（ViTs）更加具有挑战性，我们揭示了深层 ViT 的降低问题，当使用掩码图像模型（MIM）进行预训练时。为了减轻深层 ViT 的训练困难，我们提出了一种自动学习框架，称为掩码图像差分学习（MIRL），它可以明显减轻降低问题，使深层 ViT 的缩放成为性能升级的可能性。我们重新表述了深层 ViT 的预训练目标，即学习恢复掩码图像中的差异。我们提供了丰富的实验证据，表明深层 ViT 可以通过 MIRL 有效地优化，并且可以轻松地从增加深度中获得性能提升。与 ViT-Base 和 ViT-Large 相同的计算复杂度下，我们实例化了 4.5 $\times$ 和 2 $\times$ 深度更深的 ViTs，称为 ViT-S-54 和 ViT-B-48。深度更深的 ViT-S-54，耗资相当于 ViT-Large 的三倍，却可以与 ViT-Large 的性能一致。ViT-B-48 在 ImageNet 上达到 86.2% 的顶部一 Accuracy。一方面，深度更深的 ViTs 预训练后在下游任务中表现出色，如物体检测和semantic segmentation。另一方面， MIRL 表现出高效的预训练能力，需要更少的预训练时间，却可以与其他方法相比肩并胜。

SurrogatePrompt: Bypassing the Safety Filter of Text-To-Image Models via Substitution

paper_url: http://arxiv.org/abs/2309.14122
repo_url: https://github.com/zjm1900/surrogateprompt
paper_authors: Zhongjie Ba, Jieming Zhong, Jiachen Lei, Peng Cheng, Qinglong Wang, Zhan Qin, Zhibo Wang, Kui Ren
for: The paper is written to address the issue of advanced text-to-image models generating unsafe content, specifically photorealistic NSFW images of political figures.
methods: The paper uses a novel framework called SurrogatePrompt, which utilizes large language models, image-to-text, and image-to-image modules to automate the creation of attack prompts that can bypass the safety filters of closed-source models like Midjourney.
results: The paper demonstrates the success of SurrogatePrompt in generating abundant photorealistic NSFW images of political figures by exploiting vulnerabilities in Midjourney’s proprietary safety filter, with an 88% success rate in bypassing the filter. The generated images are found to present significant safety hazards, both subjectively and objectively.

Abstract
Advanced text-to-image models such as DALL-E 2 and Midjourney possess the capacity to generate highly realistic images, raising significant concerns regarding the potential proliferation of unsafe content. This includes adult, violent, or deceptive imagery of political figures. Despite claims of rigorous safety mechanisms implemented in these models to restrict the generation of not-safe-for-work (NSFW) content, we successfully devise and exhibit the first prompt attacks on Midjourney, resulting in the production of abundant photorealistic NSFW images. We reveal the fundamental principles of such prompt attacks and suggest strategically substituting high-risk sections within a suspect prompt to evade closed-source safety measures. Our novel framework, SurrogatePrompt, systematically generates attack prompts, utilizing large language models, image-to-text, and image-to-image modules to automate attack prompt creation at scale. Evaluation results disclose an 88% success rate in bypassing Midjourney's proprietary safety filter with our attack prompts, leading to the generation of counterfeit images depicting political figures in violent scenarios. Both subjective and objective assessments validate that the images generated from our attack prompts present considerable safety hazards.

摘要
高级文本到图像模型如DALL-E 2和Midjourney具有生成高度真实的图像能力，这引发了严重的安全问题。这包括政治人物的成人、暴力或误导性图像。尽管这些模型的生成不安全内容的安全机制已经实施了严格的安全措施，但我们成功地开发了第一个攻击示例，使得Midjourney生成了大量高度真实的不安全图像。我们揭示了攻击示例的基本原理，并建议在涉及高风险的提示中进行部分替换，以避免关闭源代码安全措施。我们的新框架SurrogatePrompt可以在大规模的攻击提示创建中自动化攻击提示的生成。我们的评估结果表明，使用我们的攻击提示可以在Midjourney的专有安全筛选器中成功绕过88%的攻击，并生成政治人物在暴力场景中的假图像。subjective和objective评估表明，生成的图像具有严重的安全风险。

Convolutional autoencoder-based multimodal one-class classification

paper_url: http://arxiv.org/abs/2309.14090
repo_url: None
paper_authors: Firas Laakom, Fahad Sohrab, Jenni Raitoharju, Alexandros Iosifidis, Moncef Gabbouj
for: 这种论文是为了提出一种适用于多Modal数据的深度学习一类分类方法，以便在异常检测中使用。
methods: 该方法使用了两个卷积autoencoder并在它们之间进行了 JOINT 训练，以使得输入数据在幂等空间中得到最紧凑的表示。
results: 对于一个多Modal图像分类数据集，该方法的实验结果表明，与单模式方法相比，该方法的多模式方法得到了更好的结果。此外，研究不同的输入图像大小和最新的特征多样性规则izers的影响，并证明这些规则izers可以提高性能。

Abstract
One-class classification refers to approaches of learning using data from a single class only. In this paper, we propose a deep learning one-class classification method suitable for multimodal data, which relies on two convolutional autoencoders jointly trained to reconstruct the positive input data while obtaining the data representations in the latent space as compact as possible. During inference, the distance of the latent representation of an input to the origin can be used as an anomaly score. Experimental results using a multimodal macroinvertebrate image classification dataset show that the proposed multimodal method yields better results as compared to the unimodal approach. Furthermore, study the effect of different input image sizes, and we investigate how recently proposed feature diversity regularizers affect the performance of our approach. We show that such regularizers improve performance.

摘要
一类分类指的是使用单一类型的数据进行学习。在这篇论文中，我们提出了一种适用于多modal数据的深度学习一类分类方法，该方法基于两个卷积 autoencoder 同时进行卷积重构正确的输入数据，并在幂空间中获得数据表示的最短距离。在推理阶段，输入数据的幂空间表示距离原点的距离可以作为异常分数。我们使用多modalmacro生物图像分类 dataset 进行实验，并证明了我们的方法在多modal数据中获得更好的结果，而且比单modal方法更好。此外，我们还研究了不同的输入图像大小的效果，以及最近提出的特征多样化规范化的影响。我们发现这些规范化可以提高性能。

BoIR: Box-Supervised Instance Representation for Multi-Person Pose Estimation

paper_url: http://arxiv.org/abs/2309.14072
repo_url: https://github.com/uyoung-jeong/BoIR
paper_authors: Uyoung Jeong, Seungryul Baek, Hyung Jin Chang, Kwang In Kim
for: 这篇论文是为了提出一种解决多人 pose estimation 下 instances 分解问题的方法，提高了scene中人体pose estimation的性能。
methods: 该方法使用了 bounding box-level instance representation learning，同时解决了人体实例检测、实例分解和实例关键点匹配问题。它还使用了多任务学习，包括底层关键点估计、 bounding box 回归和对比式实例嵌入学习，无需在推理过程中添加额外计算成本。
results: 该方法在 CrowdPose 和 OCHuman 等数据集上达到了领先的性能水平，比如 COCO val (0.8 AP)、COCO test-dev (0.5 AP) 和 CrowdPose (4.9 AP) 等。

Abstract
Single-stage multi-person human pose estimation (MPPE) methods have shown great performance improvements, but existing methods fail to disentangle features by individual instances under crowded scenes. In this paper, we propose a bounding box-level instance representation learning called BoIR, which simultaneously solves instance detection, instance disentanglement, and instance-keypoint association problems. Our new instance embedding loss provides a learning signal on the entire area of the image with bounding box annotations, achieving globally consistent and disentangled instance representation. Our method exploits multi-task learning of bottom-up keypoint estimation, bounding box regression, and contrastive instance embedding learning, without additional computational cost during inference. BoIR is effective for crowded scenes, outperforming state-of-the-art on COCO val (0.8 AP), COCO test-dev (0.5 AP), CrowdPose (4.9 AP), and OCHuman (3.5 AP). Code will be available at https://github.com/uyoung-jeong/BoIR

摘要
单stage多人人体 pose 估计（MPPE）方法已经达到了非常高的性能水平，但现有方法在拥挤场景下无法分离特征。在这篇论文中，我们提出了一种名为BoIR的 bounding box 级别实体表示学习方法，该方法同时解决实体检测、实体分离和实体关键点匹配问题。我们的新的实体嵌入损失提供了图像整体的学习信号，实现了全局一致的和分离的实体表示。我们的方法通过 bottom-up 关键点估计、 bounding box 回归和对比实体嵌入学习 Multi-task learning，不需要额外计算成本在推理过程中。BoIR 在拥挤场景下表现出色，与状态流行的 COCO val (0.8 AP)、COCO test-dev (0.5 AP)、CrowdPose (4.9 AP) 和 OCHuman (3.5 AP) 等测试集上表现出色。代码将在 GitHub 上提供，请参考 https://github.com/uyoung-jeong/BoIR。

Soft Mixture Denoising: Beyond the Expressive Bottleneck of Diffusion Models

paper_url: http://arxiv.org/abs/2309.14068
repo_url: None
paper_authors: Yangming Li, Boris van Breugel, Mihaela van der Schaar
for: 这 paper 是 investigate diffusion models 的表现和假设的，特别是在 backward denoising 方面。
methods: 这 paper 使用 current diffusion models 和 soft mixture denoising (SMD) 方法进行研究。
results: 这 paper 发现 current diffusion models 在 backward denoising 方面存在 expressive bottleneck 和 unbounded errors，而 SMD 方法可以有效地解决这些问题，并在实际应用中表现出优于 diffusion models。

Abstract
Because diffusion models have shown impressive performances in a number of tasks, such as image synthesis, there is a trend in recent works to prove (with certain assumptions) that these models have strong approximation capabilities. In this paper, we show that current diffusion models actually have an expressive bottleneck in backward denoising and some assumption made by existing theoretical guarantees is too strong. Based on this finding, we prove that diffusion models have unbounded errors in both local and global denoising. In light of our theoretical studies, we introduce soft mixture denoising (SMD), an expressive and efficient model for backward denoising. SMD not only permits diffusion models to well approximate any Gaussian mixture distributions in theory, but also is simple and efficient for implementation. Our experiments on multiple image datasets show that SMD significantly improves different types of diffusion models (e.g., DDPM), espeically in the situation of few backward iterations.

摘要
因为扩散模型在多种任务中表现出色，如图像生成等，所以有一些研究者在最近的工作中尝试证明（假设一些）这些模型具有强的近似能力。在这篇论文中，我们发现现有的扩散模型实际上在回卷降噪中存在表达力瓶颈，而现有的理论保证假设太强。基于这个发现，我们证明了扩散模型在本地和全局降噪中有无限大的错误。视我们的理论研究，我们引入了软组合降噪（SMD）模型，这是一种表达力强大且实用的回卷降噪模型。SMD不仅允许扩散模型在理论上能够很好地近似任何高斯混合分布，而且是实用的并简单的实现。我们在多个图像 dataset 上进行了实验，发现SMD可以大幅提高不同类型的扩散模型（如 DDPM），特别是在几个后向迭代中。

paper_url: http://arxiv.org/abs/2309.14065
repo_url: https://github.com/Fourier7754/AsymFormer
paper_authors: Siqi Du, Weixi Wang, Renzhong Guo, Shengjun Tang
for: 实现RGB-D semantic segmentation的高效精准算法，以提高 робо类智能系统的可靠性和效率。
methods: 提出了一种新的非对称网络AsymFormer，通过优化计算资源分布和非对称背bone来实现多模态特征的有效融合。还利用了特征选择和多模态自相似特征EXTRACTION来提高网络精度，而无需增加参数数量，保证实时执行于机器人平台。
results: 在NYUv2和SUNRGBD datasets上进行了测试，AsymFormer与52.0% mIoU在NYUv2和49.1% mIoU在SUNRGBD达到了竞争性的结果。此外，AsymFormer在RTX3090上实现了79 FPS的推理速度，在混合精度量化后达到了65 FPS的推理速度。这些结果表明AsymFormer可以在RGB-D semantic segmentation中寻求高精度和高效率的平衡。

Abstract
In the realm of robotic intelligence, achieving efficient and precise RGB-D semantic segmentation is a key cornerstone. State-of-the-art multimodal semantic segmentation methods, primarily rooted in symmetrical skeleton networks, find it challenging to harmonize computational efficiency and precision. In this work, we propose AsymFormer, a novel network for real-time RGB-D semantic segmentation, which targets the minimization of superfluous parameters by optimizing the distribution of computational resources and introduces an asymmetrical backbone to allow for the effective fusion of multimodal features. Furthermore, we explore techniques to bolster network accuracy by redefining feature selection and extracting multi-modal self-similarity features without a substantial increase in the parameter count, thereby ensuring real-time execution on robotic platforms. Additionally, a Local Attention-Guided Feature Selection (LAFS) module is used to selectively fuse features from different modalities by leveraging their dependencies. Subsequently, a Cross-Modal Attention-Guided Feature Correlation Embedding (CMA) module is introduced to further extract cross-modal representations. This method is evaluated on NYUv2 and SUNRGBD datasets, with AsymFormer demonstrating competitive results with 52.0% mIoU on NYUv2 and 49.1% mIoU on SUNRGBD. Notably, AsymFormer achieves an inference speed of 65 FPS and after implementing mixed precision quantization, it attains an impressive inference speed of 79 FPS on RTX3090. This significantly outperforms existing multi-modal methods, thereby demonstrating that AsymFormer can strike a balance between high accuracy and efficiency for RGB-D semantic segmentation.

摘要
在机器人智能领域，实现高效精准的 RGB-D semantic segmentation 是一个关键的核心。现有的多Modal semantic segmentation 方法，主要基于对称骨网络，很难实现计算效率和准确之间的平衡。在这个工作中，我们提出了 AsymFormer，一种新的网络，旨在最小化无用参数，通过分布计算资源的优化，并引入非对称脊梁，以有效地融合多Modal 特征。此外，我们还 explore 了提高网络准确性的技术，包括重新定义特征选择和提取多Modal 自相似特征，而无需增加参数数量，以确保实时执行于机器人平台。另外，我们还使用 Local Attention-Guided Feature Selection (LAFS) 模块， selecing 不同模式的特征，并使用 Cross-Modal Attention-Guided Feature Correlation Embedding (CMA) 模块，进一步提取交Modal 表示。这种方法在 NYUv2 和 SUNRGBD 数据集上进行评估，AsymFormer Displaying competitive results with 52.0% mIoU on NYUv2 and 49.1% mIoU on SUNRGBD。含义的是，AsymFormer 在实时执行中能够平衡高准确率和高效率，为 RGB-D semantic segmentation 提供了一个可靠的解决方案。

FeCAM: Exploiting the Heterogeneity of Class Distributions in Exemplar-Free Continual Learning

paper_url: http://arxiv.org/abs/2309.14062
repo_url: https://github.com/dipamgoswami/fecam
paper_authors: Dipam Goswami, Yuyang Liu, Bartłomiej Twardowski, Joost van de Weijer
for: 这篇研究探讨了类别增执行（Continual Learning）中的概念增执行（Class-Incremental Learning），尤其是在没有回溯的情况下，因为这会受到严重的遗传问题的影响。
methods: 这篇研究使用了 prototype 网络，它们生成了新的类别标本，并使用封锁的特征提取器来分类特征基于 Euclidian 距离。
results: 研究发现，在统计学上分布的分类是成功的，但是在学习非站势数据时，Euclidian 距离是不理想的，并且特征分布是多样的。因此，这篇研究提出了使用不规律 Mahalanobis 距离来解决这个挑战。此外，这篇研究还证明了模型特征相关性比之前的对应样本从正常分布中采样和训练线性分类器的方法更好。相比于现有的方法，这篇研究的方法可以在多个shot CIL 设定下进行普遍化，并且在领域增执行设定下也能够取得顶尖的结果。

Abstract
Exemplar-free class-incremental learning (CIL) poses several challenges since it prohibits the rehearsal of data from previous tasks and thus suffers from catastrophic forgetting. Recent approaches to incrementally learning the classifier by freezing the feature extractor after the first task have gained much attention. In this paper, we explore prototypical networks for CIL, which generate new class prototypes using the frozen feature extractor and classify the features based on the Euclidean distance to the prototypes. In an analysis of the feature distributions of classes, we show that classification based on Euclidean metrics is successful for jointly trained features. However, when learning from non-stationary data, we observe that the Euclidean metric is suboptimal and that feature distributions are heterogeneous. To address this challenge, we revisit the anisotropic Mahalanobis distance for CIL. In addition, we empirically show that modeling the feature covariance relations is better than previous attempts at sampling features from normal distributions and training a linear classifier. Unlike existing methods, our approach generalizes to both many- and few-shot CIL settings, as well as to domain-incremental settings. Interestingly, without updating the backbone network, our method obtains state-of-the-art results on several standard continual learning benchmarks. Code is available at https://github.com/dipamgoswami/FeCAM.

摘要
例子空间自适应学习（CIL）具有一些挑战，因为它禁止在前一个任务中重新训练数据，从而导致忘记现象。最近的途径是通过冻结特征提取器来逐步学习类别器，这些途径在这篇论文中得到了广泛的关注。我们在这篇论文中探讨了示例网络，它们通过冻结特征提取器生成新的类 prototype，并基于Euclidean距离来类别特征。我们在类别分布分析中表明，基于Euclidean距离的分类是适用于共同训练的特征。然而，当学习非站点数据时，我们发现Euclidean距离是不优的，特征分布是不均匀的。为了解决这个挑战，我们再次考虑了不规则的Mahalanobis距离。此外，我们验证了模型特征 covariance 关系的模型化是比前一些尝试在normal分布上随机抽取特征并训练线性分类器更好。与现有方法不同的是，我们的方法可以在多少shot CIL设定下和领域增量学习设定下普遍适用，并且不需要更新后台网络，我们的方法在多个标准的不断学习标准benchmark上获得了状态机器人的结果。代码可以在https://github.com/dipamgoswami/FeCAM中找到。

Weakly Supervised Semantic Segmentation by Knowledge Graph Inference

paper_url: http://arxiv.org/abs/2309.14057
repo_url: https://github.com/jia-zhang666/grm_layer
paper_authors: Jia Zhang, Bo Peng, Xi Wu
for: 提高弱监督semantic segmentation（WSSS）的性能，特别是在Convolutional Neural Networks（CNNs）上。
methods: 提出了一种基于图reasoning的方法，通过同时提高多类划分网络和 segmentation network的两个阶段，提高WSSS的整体性能。在多类划分网络阶段，外部知识被 интеGRATED，并与GCNs结合，以全局理解图像中的inter-class依赖关系。在 segmentation network阶段，提出了一种Graph Reasoning Mapping（GRM）模块，使用文本数据库来挖掘图像区域中的semantic coherence，以提高特征表示。
results: 使用图像级别的supervision，在PASCAL VOC 2012和MS-COCO datasets上实现了WSSS的状态oa-level性能。经验表明，提出的图reasoning方法有效地提高了WSSS的性能。

Abstract
Currently, existing efforts in Weakly Supervised Semantic Segmentation (WSSS) based on Convolutional Neural Networks (CNNs) have predominantly focused on enhancing the multi-label classification network stage, with limited attention given to the equally important downstream segmentation network. Furthermore, CNN-based local convolutions lack the ability to model the extensive inter-category dependencies. Therefore, this paper introduces a graph reasoning-based approach to enhance WSSS. The aim is to improve WSSS holistically by simultaneously enhancing both the multi-label classification and segmentation network stages. In the multi-label classification network segment, external knowledge is integrated, coupled with GCNs, to globally reason about inter-class dependencies. This encourages the network to uncover features in non-salient regions of images, thereby refining the completeness of generated pseudo-labels. In the segmentation network segment, the proposed Graph Reasoning Mapping (GRM) module is employed to leverage knowledge obtained from textual databases, facilitating contextual reasoning for class representation within image regions. This GRM module enhances feature representation in high-level semantics of the segmentation network's local convolutions, while dynamically learning semantic coherence for individual samples. Using solely image-level supervision, we have achieved state-of-the-art performance in WSSS on the PASCAL VOC 2012 and MS-COCO datasets. Extensive experimentation on both the multi-label classification and segmentation network stages underscores the effectiveness of the proposed graph reasoning approach for advancing WSSS.

摘要
现有的弱监督Semantic Segmentation（WSSS）基于Convolutional Neural Networks（CNNs）的努力主要集中在增强多标签分类网络阶段，忽略了下游分类网络的 equally important 部分。此外，CNN-based 本地卷积缺乏模型图像中的广泛交互类关系。因此，本文提出了基于图 reasoning的方法，以提高 WSSS。目标是通过同时提高多标签分类和分类网络两个阶段来改进 WSSS。在多标签分类网络段，外部知识被集成，与GCNs相结合，以全局理解图像中的交互类关系。这使得网络可以在非关键区域找到特征，从而提高生成的 Pseudo-labels 的完整性。在分类网络段，我们提出的 Graph Reasoning Mapping（GRM）模块，使用文本数据库中的知识，以便在图像区域内进行 Contextual Reasoning 。这个 GRM 模块可以增强分类网络的高级 semantics 表示，同时动态学习个体样本的 semantic coherence。通过仅使用图像水平监督，我们在 PASCAL VOC 2012 和 MS-COCO 数据集上实现了 WSSS 的最新状态。广泛的实验表明，提出的图 reasoning 方法有效地提高 WSSS。

Single Image Test-Time Adaptation for Segmentation

paper_url: http://arxiv.org/abs/2309.14052
repo_url: https://github.com/klarajanouskova/SITTA-Segmentation
paper_authors: Klara Janouskova, Tamir Shor, Chaim Baskin, Jiri Matas
for: 这项研究旨在提高深度神经网络对领域变化的Robustness，并在图像分类或分割任务上实现这一目标。
methods: 这项研究使用Test-Time Adaptation（TTA）方法，并且特点在于使用自动生成的mask来进行验证。
results: 研究发现，通过在测试时间使用自动生成的mask来优化自我超vised损失，可以提高图像分割模型的Robustness。在不同的条件下，这些改进可以提高模型的性能，相比之下，不使用这些改进，提高的性能仅为1.7%和2.16%。

Abstract
Test-Time Adaptation (TTA) methods improve the robustness of deep neural networks to domain shift on a variety of tasks such as image classification or segmentation. This work explores adapting segmentation models to a single unlabelled image with no other data available at test-time. In particular, this work focuses on adaptation by optimizing self-supervised losses at test-time. Multiple baselines based on different principles are evaluated under diverse conditions and a novel adversarial training is introduced for adaptation with mask refinement. Our additions to the baselines result in a 3.51 and 3.28 % increase over non-adapted baselines, without these improvements, the increase would be 1.7 and 2.16 % only.

摘要
测试时适应（TTA）方法可以提高深度神经网络对频率差shift的Robustness在各种任务上，如图像分类或分割。这项工作探讨在没有其他数据可用的情况下，使用单个无标注图像进行适应。具体来说，这项工作关注在测试时优化自我指导损失来进行适应。我们评估了基于不同原则的多个基准，并引入了一种新的对适应进行干扰训练。我们的改进对基准而言，导致了3.51%和3.28%的提高，如果没有这些改进，则只有1.7%和2.16%的提高。

Unveiling Fairness Biases in Deep Learning-Based Brain MRI Reconstruction

paper_url: http://arxiv.org/abs/2309.14392
repo_url: https://github.com/ydu0117/reconfairness
paper_authors: Yuning Du, Yuyang Xue, Rohan Dharmakumar, Sotirios A. Tsaftaris
for: 这个研究旨在检查深度学习（DL）重建的脑磁共振成像中是否存在不公正性问题，以及如何通过不同的方法来解决这个问题。
methods: 这个研究使用的方法包括U-Net架构来重建图像，并对基eline Empirical Risk Minimisation（ERM）和重新均衡策略进行实现，以检测和解决不公正性问题。
results: 研究发现，DL重建模型存在男女和年龄 subgroup的性别偏见，但不是由数据不均衡和训练歧视引起的。这些发现可以帮助我们更好地理解深度学习图像重建中的不公正性问题，并为医疗AI应用中的公平性提供更多的参考。

Abstract
Deep learning (DL) reconstruction particularly of MRI has led to improvements in image fidelity and reduction of acquisition time. In neuroimaging, DL methods can reconstruct high-quality images from undersampled data. However, it is essential to consider fairness in DL algorithms, particularly in terms of demographic characteristics. This study presents the first fairness analysis in a DL-based brain MRI reconstruction model. The model utilises the U-Net architecture for image reconstruction and explores the presence and sources of unfairness by implementing baseline Empirical Risk Minimisation (ERM) and rebalancing strategies. Model performance is evaluated using image reconstruction metrics. Our findings reveal statistically significant performance biases between the gender and age subgroups. Surprisingly, data imbalance and training discrimination are not the main sources of bias. This analysis provides insights of fairness in DL-based image reconstruction and aims to improve equity in medical AI applications.

摘要
深度学习（DL）重建特别是MRI的场景下，已经导致图像准确性的提高和数据采样时间的减少。在神经成像中，DL方法可以从不充分的数据中重建高质量图像。然而，是必须考虑公平性在DL算法中，特别是在人口特征方面。本研究提供了首次DL基于脑MRI重建模型的公平分析。该模型采用U-Net架构 для图像重建，并实施基eline Empirical Risk Minimization（ERM）和重新填充策略来探讨不公正的存在和来源。模型的性能被评估使用图像重建指标。我们的发现表明男女和年龄子集之间存在 statistically significant的性能偏好。意外地，数据不均衡和训练歧视并不是主要的偏好来源。这种分析提供了DL基于图像重建中的公平性的视角，并旨在提高医疗AI应用中的公平性。

Hashing Neural Video Decomposition with Multiplicative Residuals in Space-Time

paper_url: http://arxiv.org/abs/2309.14022
repo_url: None
paper_authors: Cheng-Hung Chan, Cheng-Yang Yuan, Cheng Sun, Hwann-Tzong Chen
for: 这个研究旨在提供一种基于层次编辑的视频分解方法，以便在视频中进行快速、高质量的编辑。
methods: 该方法使用神经网络模型将输入视频分解成多个层次表示，每个层包括一个2D текстура地图、原始视频的遮盾、和表示视频中的空间时间变化的乘法差异。该方法可以快速地在单个GPU上学习神经表示，并在实时渲染编辑结果。
results: 该方法可以在不同视频中生成高质量的编辑效果，并且可以在实时进行编辑。同时，该方法还提出了一种基于特征跟踪的评价指标，以对视频编辑效果进行 объектив评估。

Abstract
We present a video decomposition method that facilitates layer-based editing of videos with spatiotemporally varying lighting and motion effects. Our neural model decomposes an input video into multiple layered representations, each comprising a 2D texture map, a mask for the original video, and a multiplicative residual characterizing the spatiotemporal variations in lighting conditions. A single edit on the texture maps can be propagated to the corresponding locations in the entire video frames while preserving other contents' consistencies. Our method efficiently learns the layer-based neural representations of a 1080p video in 25s per frame via coordinate hashing and allows real-time rendering of the edited result at 71 fps on a single GPU. Qualitatively, we run our method on various videos to show its effectiveness in generating high-quality editing effects. Quantitatively, we propose to adopt feature-tracking evaluation metrics for objectively assessing the consistency of video editing. Project page: https://lightbulb12294.github.io/hashing-nvd/

摘要
我们提出了一种视频分解方法，该方法使得可以对视频进行层次编辑，包括空间时间变化的灯光效果。我们的神经网络模型将输入视频分解成多个层次表示，每个层包括一个2D текстура地图、原始视频的遮盾和表示空间时间变化的乘法差异。通过修改Texture Maps可以在整个视频帧中快速传播修改，而保持其他内容的一致性。我们的方法可以快速学习视频层次神经表示，并在单个GPU上实现实时渲染修改后的结果，每帧25s。我们对多个视频进行了质量评估，并提出了一种基于特征跟踪的评价指标来客观评估视频编辑的一致性。项目页面：https://lightbulb12294.github.io/hashing-nvd/Note: The translation is in Simplified Chinese, which is the standard Chinese writing system used in mainland China. If you prefer Traditional Chinese, please let me know.

Variational Inference for Scalable 3D Object-centric Learning

paper_url: http://arxiv.org/abs/2309.14010
repo_url: None
paper_authors: Tianyu Wang, Kee Siong Ng, Miaomiao Liu
for: 学习3D场景中对象-центric的无监督表示学习，以适应更大的场景。
methods: 分别学习对象姿态和外观表示，并将对象表示升级到视图不变的形式，以保持对象身份。使用渐进变型推断过程可以处理顺序输入和在线更新对象热度分布。
results: 实验表明，我们提出的方法可以在synthetic和实际数据集上INFER和维护3D场景中对象-центric的表示，并比前一代模型表现更好。

Abstract
We tackle the task of scalable unsupervised object-centric representation learning on 3D scenes. Existing approaches to object-centric representation learning show limitations in generalizing to larger scenes as their learning processes rely on a fixed global coordinate system. In contrast, we propose to learn view-invariant 3D object representations in localized object coordinate systems. To this end, we estimate the object pose and appearance representation separately and explicitly map object representations across views while maintaining object identities. We adopt an amortized variational inference pipeline that can process sequential input and scalably update object latent distributions online. To handle large-scale scenes with a varying number of objects, we further introduce a Cognitive Map that allows the registration and query of objects on a per-scene global map to achieve scalable representation learning. We explore the object-centric neural radiance field (NeRF) as our 3D scene representation, which is jointly modeled within our unsupervised object-centric learning framework. Experimental results on synthetic and real datasets show that our proposed method can infer and maintain object-centric representations of 3D scenes and outperforms previous models.

摘要
我们面临三维场景上可扩展无监督物体中心表示学习的挑战。现有的物体中心表示学习方法在扩大到更大的场景时存在限制，因为它们的学习过程基于固定的全球坐标系。与此相反，我们提议通过分离出视图不变的3D物体表示和维护物体标识来学习视图不变的3D物体表示。我们采用了一个缓存梯度推导管道，可以处理顺序输入并在线更新物体热度分布。为了处理大规模场景中的变化数量物体，我们还引入了认知地图，允许在每个场景全球地图上注册和查询物体，以实现可扩展的表示学习。我们采用物体中心场景灯光场（NeRF）作为我们的三维场景表示，并将其与我们的无监督物体中心学习框架集成。实验结果表明，我们的提议方法可以推断和维护三维场景中的物体中心表示，并在现实数据集上超越先前的模型。

Better Generalization of White Matter Tract Segmentation to Arbitrary Datasets with Scaled Residual Bootstrap

paper_url: http://arxiv.org/abs/2309.13980
repo_url: None
paper_authors: Wan Liu, Chuyang Ye
for: 提高 diffusion magnetic resonance imaging（dMRI）white matter（WM）轨迹分割的泛化性能。
methods: 使用降噪 bootstrap 缩放策略对培oki数据进行增强，以适应不同测试数据集的分布误差。
results: 对两个 dMRI 数据集进行实验，结果表明，提出的方法可以在不同的设置下提高 WM 轨迹分割的泛化性能。

Abstract
White matter (WM) tract segmentation is a crucial step for brain connectivity studies. It is performed on diffusion magnetic resonance imaging (dMRI), and deep neural networks (DNNs) have achieved promising segmentation accuracy. Existing DNN-based methods use an annotated dataset for model training. However, the performance of the trained model on a different test dataset may not be optimal due to distribution shift, and it is desirable to design WM tract segmentation approaches that allow better generalization of the segmentation model to arbitrary test datasets. In this work, we propose a WM tract segmentation approach that improves the generalization with scaled residual bootstrap. The difference between dMRI scans in training and test datasets is most noticeably caused by the different numbers of diffusion gradients and noise levels. Since both of them lead to different signal-to-noise ratios (SNRs) between the training and test data, we propose to augment the training scans by adjusting the noise magnitude and develop an adapted residual bootstrap strategy for the augmentation. To validate the proposed approach, two dMRI datasets were used, and the experimental results show that our method consistently improved the generalization of WM tract segmentation under various settings.

摘要
白 matter（WM）轨迹分 segmentation 是脑连接研究的关键步骤。它通过 diffusion magnetic resonance imaging（dMRI）进行，并使用深度神经网络（DNNs）获得了有前途的分 segmentation 精度。现有的 DNN-based 方法通常使用标注数据集进行模型训练。然而，训练模型在不同的测试数据集上的性能可能不是最佳，因为分布转移，而且您想要设计 WM tract segmentation 方法，可以更好地将分 segmentation 模型应用于任何测试数据集。在这项工作中，我们提出了一种 WM tract segmentation 方法，可以提高分 segmentation 模型的泛化性。我们通过扩大 residual bootstrap 来实现这一点。测试和训练数据集之间的主要差异在于增强和噪声水平的不同，这两个因素都会导致测试和训练数据集之间的信号噪声比（SNR）的不同。因此，我们提议在训练数据集上进行噪声调整，并开发了适应的 residual bootstrap 策略。为验证我们的方法，我们使用了两个 dMRI 数据集，并实验结果表明，我们的方法可以在不同的设置下提高 WM tract segmentation 的泛化性。

Diverse Semantic Image Editing with Style Codes

paper_url: http://arxiv.org/abs/2309.13975
repo_url: https://github.com/hakansivuk/divsem
paper_authors: Hakan Sivuk, Aysegul Dundar
for: 这个论文的目的是提出一种基于semantic map的Semantic Image Editing方法，可以填充杀死图像中的 pixels，同时保持图像的内在逻辑和外在逻辑的一致性。methods: 该方法使用了一种新的Style Encoding机制，可以区分可见和部分可见对象的风格编码，从而提高了图像的一致性和多样性。results: 对比之前的条件图像生成和Semantic Image Editing算法，该方法在评价指标上具有显著的改进，并且可以提供更多的多样化结果。

Abstract
Semantic image editing requires inpainting pixels following a semantic map. It is a challenging task since this inpainting requires both harmony with the context and strict compliance with the semantic maps. The majority of the previous methods proposed for this task try to encode the whole information from erased images. However, when an object is added to a scene such as a car, its style cannot be encoded from the context alone. On the other hand, the models that can output diverse generations struggle to output images that have seamless boundaries between the generated and unerased parts. Additionally, previous methods do not have a mechanism to encode the styles of visible and partially visible objects differently for better performance. In this work, we propose a framework that can encode visible and partially visible objects with a novel mechanism to achieve consistency in the style encoding and final generations. We extensively compare with previous conditional image generation and semantic image editing algorithms. Our extensive experiments show that our method significantly improves over the state-of-the-art. Our method not only achieves better quantitative results but also provides diverse results. Please refer to the project web page for the released code and demo: https://github.com/hakansivuk/DivSem.

摘要
semantic image editing 需要根据semantic map进行像素填充。这是一项复杂的任务，因为这种填充需要与上下文相协调，并且必须严格遵循semantic map。大多数之前的方法都尝试从抹除图像中提取整个信息。然而，当一个 objet 如车加入场景时，其样式不能从上下文中提取。相反，模型们可以输出多个生成物，但是这些生成物往往与不抹除部分之间没有滑块。此外，之前的方法没有机制来对可见和部分可见对象的样式进行不同的编码，以提高性能。在这种情况下，我们提出了一个框架，可以对可见和部分可见对象进行novel机制来实现样式编码的一致性和最终生成的一致性。我们与前期的条件图像生成和semantic image editing算法进行了广泛的比较。我们的广泛实验表明，我们的方法在状态前的性能上显著提高。我们不仅获得了更好的量化结果，还提供了更多的多样性。请参考项目网页获取代码和示例：https://github.com/hakansivuk/DivSem。

Egocentric RGB+Depth Action Recognition in Industry-Like Settings

paper_url: http://arxiv.org/abs/2309.13962
repo_url: https://github.com/jkini/Meccano
paper_authors: Jyoti Kini, Sarah Fleischer, Ishan Dave, Mubarak Shah
for: 本研究旨在提高 egocentric 视角下的人机交互行为识别率，通过利用RGB和深度模式。
methods: 我们采用3D Video SWIN Transformer来有效地编码RGB和深度模式，并提出了一种基于exponentially decaying variant of the focal loss modulating factor的训练策略，以及late fusion来结合两种模式的预测结果。
results: 我们在MECCANO dataset上进行了广泛的评估，并在多模式人体动作识别挑战中获得了优秀成绩，其中包括在ICIAP 2023 多模式人体动作识别比赛中获得第一名。

Abstract
Action recognition from an egocentric viewpoint is a crucial perception task in robotics and enables a wide range of human-robot interactions. While most computer vision approaches prioritize the RGB camera, the Depth modality - which can further amplify the subtleties of actions from an egocentric perspective - remains underexplored. Our work focuses on recognizing actions from egocentric RGB and Depth modalities in an industry-like environment. To study this problem, we consider the recent MECCANO dataset, which provides a wide range of assembling actions. Our framework is based on the 3D Video SWIN Transformer to encode both RGB and Depth modalities effectively. To address the inherent skewness in real-world multimodal action occurrences, we propose a training strategy using an exponentially decaying variant of the focal loss modulating factor. Additionally, to leverage the information in both RGB and Depth modalities, we opt for late fusion to combine the predictions from each modality. We thoroughly evaluate our method on the action recognition task of the MECCANO dataset, and it significantly outperforms the prior work. Notably, our method also secured first place at the multimodal action recognition challenge at ICIAP 2023.

摘要
egocentric 视角下的行为识别是机器人学中的一项重要感知任务，它允许机器人与人类进行广泛的互动。而大多数计算机视觉方法强调RGB摄像头，但深度modalidad - 可以进一步强调 egocentric 视角下的动作细节 - 仍然未得到充分的利用。我们的工作是 Egocentric RGB 和深度modalities 上的动作识别在行业环境中进行研究。为了研究这个问题，我们使用了最近的MECCANO dataset，该集合提供了许多精心搭建的动作。我们的框架基于3D Video SWIN Transformer来有效地编码RGB和深度modalities。为了解决实际世界中多Modal 动作的自然偏见，我们提出了一种使用加速式衰减的焦点损失模块化因子来进行训练策略。此外，我们选择了将RGB和深度modalities 的预测结果进行融合，以利用每个模式中的信息。我们对MECCANO dataset上的动作识别任务进行了严格的评估，并证明了我们的方法在此任务上明显超越了先前的工作。值得注意的是，我们的方法还在ICIAP 2023 多模态动作识别挑战中获得了第一名。

In-Domain GAN Inversion for Faithful Reconstruction and Editability

paper_url: http://arxiv.org/abs/2309.13956
repo_url: None
paper_authors: Jiapeng Zhu, Yujun Shen, Yinghao Xu, Deli Zhao, Qifeng Chen, Bolei Zhou
for: 这篇论文的目的是提出一种内部类型对应（In-Domain GAN Inversion），将已经预训的GAN模型中的内存空间对应到原始影像，以便实现各种影像修改应用程序，无需重新训练。
methods: 这篇论文使用了域导向encoder和域调整优化器，将内存空间对应到原始影像，并进行了广泛的分析，包括预测器结构、起始对应点和对应参数空间的影响，以探索修改性和重建质量之间的贡献。
results: 这篇论文发现，内部类型对应可以将GAN模型中学习的知识应用到实际的影像修改上，并且可以让GAN模型在不需要重新训练的情况下，实现各种影像修改应用程序。

Abstract
Generative Adversarial Networks (GANs) have significantly advanced image synthesis through mapping randomly sampled latent codes to high-fidelity synthesized images. However, applying well-trained GANs to real image editing remains challenging. A common solution is to find an approximate latent code that can adequately recover the input image to edit, which is also known as GAN inversion. To invert a GAN model, prior works typically focus on reconstructing the target image at the pixel level, yet few studies are conducted on whether the inverted result can well support manipulation at the semantic level. This work fills in this gap by proposing in-domain GAN inversion, which consists of a domain-guided encoder and a domain-regularized optimizer, to regularize the inverted code in the native latent space of the pre-trained GAN model. In this way, we manage to sufficiently reuse the knowledge learned by GANs for image reconstruction, facilitating a wide range of editing applications without any retraining. We further make comprehensive analyses on the effects of the encoder structure, the starting inversion point, as well as the inversion parameter space, and observe the trade-off between the reconstruction quality and the editing property. Such a trade-off sheds light on how a GAN model represents an image with various semantics encoded in the learned latent distribution. Code, models, and demo are available at the project page: https://genforce.github.io/idinvert/.

摘要

Speed Co-Augmentation for Unsupervised Audio-Visual Pre-training

paper_url: http://arxiv.org/abs/2309.13942
repo_url: None
paper_authors: Jiangliu Wang, Jianbo Jiao, Yibing Song, Stephen James, Zhan Tong, Chongjian Ge, Pieter Abbeel, Yun-hui Liu
for: 本研究旨在提高无监督音视频预训练。
methods: 我们提议一种新的速度协调增强方法，该方法随机改变音频和视频数据的播放速率。
results: 实验结果表明，我们提议的方法可以明显提高学习的表示。

Abstract
This work aims to improve unsupervised audio-visual pre-training. Inspired by the efficacy of data augmentation in visual contrastive learning, we propose a novel speed co-augmentation method that randomly changes the playback speeds of both audio and video data. Despite its simplicity, the speed co-augmentation method possesses two compelling attributes: (1) it increases the diversity of audio-visual pairs and doubles the size of negative pairs, resulting in a significant enhancement in the learned representations, and (2) it changes the strict correlation between audio-visual pairs but introduces a partial relationship between the augmented pairs, which is modeled by our proposed SoftInfoNCE loss to further boost the performance. Experimental results show that the proposed method significantly improves the learned representations when compared to vanilla audio-visual contrastive learning.

摘要
这项工作目的是提高无监督音视频预训练。受到视觉对冲学习的数据扩充效果启发，我们提议一种新的速度合并方法，该方法随机改变音频和视频数据的播放速度。尽管简单，这种速度合并方法具有两个吸引人的特点：（1）它增加了音视频对的多样性，同时将负对的数据量增加一倍，从而对学习的表示进行了显著提升，和（2）它改变了音视频对的紧密相关性，但是引入了增强对的关系，这种关系由我们提议的SoftInfoNCE损失函数来模型，以进一步提高表示的性能。实验结果表明，我们的方法在无监督音视频对比学习中显著提高了学习的表示。

A Lightweight Recurrent Grouping Attention Network for Video Super-Resolution

paper_url: http://arxiv.org/abs/2309.13940
repo_url: https://github.com/karlygzhu/rgan
paper_authors: Yonggui Zhu, Guofang Li
for: 提高视频超分解（VSR）模型的效率和可扩展性。
methods: 提出了一种轻量级循环分组注意力网络，利用前向特征提取模块和后向特征提取模块从两个方向收集时间信息，并提出了一种新的分组机制以高效地收集参照帧和其邻域帧的空间时间信息。
results: 实验表明，我们的模型在多个数据集上达到了当今主流模型的最佳性能。

Abstract
Effective aggregation of temporal information of consecutive frames is the core of achieving video super-resolution. Many scholars have utilized structures such as sliding windows and recurrent to gather spatio-temporal information of frames. However, although the performance of the constructed VSR models is improving, the size of the models is also increasing, exacerbating the demand on the equipment. Thus, to reduce the stress on the device, we propose a novel lightweight recurrent grouping attention network. The parameters of this model are only 0.878M, which is much lower than the current mainstream model for studying video super-resolution. We design forward feature extraction module and backward feature extraction module to collect temporal information between consecutive frames from two directions. Moreover, a new grouping mechanism is proposed to efficiently collect spatio-temporal information of the reference frame and its neighboring frames. The attention supplementation module is presented to further enhance the information gathering range of the model. The feature reconstruction module aims to aggregate information from different directions to reconstruct high-resolution features. Experiments demonstrate that our model achieves state-of-the-art performance on multiple datasets.

摘要
“有效地聚合 consecutiverame 的时间信息是视频超分辨的核心。许多学者们使用 slide window 和 recurrent 结构收集 frame 的空间时间信息。although the constructed VSR models are improving, the size of the models is also increasing, exacerbating the demand on the equipment. Therefore, to reduce the stress on the device, we propose a novel lightweight recurrent grouping attention network. The parameters of this model are only 0.878M, which is much lower than the current mainstream model for studying video super-resolution.我们设计了 forward feature extraction module 和 backward feature extraction module，用于从两个方向收集 consecutive frames 的时间信息。此外，我们还提出了一种新的 grouping mechanism，用于高效地收集 reference frame 和其邻近 frame 的空间时间信息。另外，我们还提出了一种 attention supplementation module，用于进一步增强模型的信息收集范围。最后，我们设计了一个 feature reconstruction module，用于将信息从不同的方向聚合到高分辨特征上。实验表明，我们的模型在多个 dataset 上达到了状态监测性能。”

Recursive Counterfactual Deconfounding for Object Recognition

paper_url: http://arxiv.org/abs/2309.13924
repo_url: None
paper_authors: Jiayin Sun, Hong Wang, Qiulei Dong
for: 提高图像识别 tasks 的准确率和泛化能力，特别是对于闭SET和开SET scenario。
methods: 基于counterfactual分析的Recursive Counterfactual Deconfounding（RCD）模型，包括对图像特征、模型预测和干扰因素之间的关系建立和更新。
results: 在closed-set和open-set scenario的识别任务中，提出的RCD模型与11个基eline模型相比，在大多数情况下表现出色，并且可以进一步提高图像识别的准确率和泛化能力。

Abstract
Image recognition is a classic and common task in the computer vision field, which has been widely applied in the past decade. Most existing methods in literature aim to learn discriminative features from labeled images for classification, however, they generally neglect confounders that infiltrate into the learned features, resulting in low performances for discriminating test images. To address this problem, we propose a Recursive Counterfactual Deconfounding model for object recognition in both closed-set and open-set scenarios based on counterfactual analysis, called RCD. The proposed model consists of a factual graph and a counterfactual graph, where the relationships among image features, model predictions, and confounders are built and updated recursively for learning more discriminative features. It performs in a recursive manner so that subtler counterfactual features could be learned and eliminated progressively, and both the discriminability and generalization of the proposed model could be improved accordingly. In addition, a negative correlation constraint is designed for alleviating the negative effects of the counterfactual features further at the model training stage. Extensive experimental results on both closed-set recognition task and open-set recognition task demonstrate that the proposed RCD model performs better than 11 state-of-the-art baselines significantly in most cases.

摘要
RCD 模型包括一个事实图和一个对事实图的对照图，其中包含了图像特征、模型预测和干扰因素之间的关系，通过 recursively 更新和学习更加细致的特征，以提高模型的抗混淆性和泛化能力。此外，为了进一步缓解对对照特征的负面影响，我们在模型训练阶段设计了一种负 correlated 约束。在 closed-set 识别任务和 open-set 识别任务中，我们进行了广泛的实验，结果显示，相比于 11 个基eline，RCD 模型在大多数情况下能够显著提高对象识别的性能。

Subspace-Aware Feature Reconstruction for Unsupervised Anomaly Localization

paper_url: http://arxiv.org/abs/2309.13904
repo_url: None
paper_authors: Katsuya Hotta, Chao Zhang, Yoshihiro Hagihara, Takuya Akashi
for: 这篇论文旨在提出一个新的 anomaly localization 方法，实现适应 feature approximation。
methods: 方法基于 self-expressive model，learn low-dimensional subspaces，并通过这些 subspaces 重建 feature representation。
results: 实验结果显示，这篇论文的方法可以与现有的 state-of-the-art 方法相比，实现适应 feature approximation，并且只需要少量的 samples。

Abstract
Unsupervised anomaly localization, which plays a critical role in industrial manufacturing, is to identify anomalous regions that deviate from patterns established exclusively from nominal samples. Recent mainstream methods focus on approximating the target feature distribution by leveraging embeddings from ImageNet models. However, a common issue in many anomaly localization methods is the lack of adaptability of the feature approximations to specific targets. Consequently, their ability to effectively identify anomalous regions relies significantly on the data coverage provided by the finite resources in a memory bank. In this paper, we propose a novel subspace-aware feature reconstruction framework for anomaly localization. To achieve adaptive feature approximation, our proposed method involves the reconstruction of the feature representation through the self-expressive model designed to learn low-dimensional subspaces. Importantly, the sparsity of the subspace representation contributes to covering feature patterns from the same subspace with fewer resources, leading to a reduction in the memory bank. Extensive experiments across three industrial benchmark datasets demonstrate that our approach achieves competitive anomaly localization performance compared to state-of-the-art methods by adaptively reconstructing target features with a small number of samples.

摘要
不监督异常定位，在工业生产中扮演关键的角色，是要将异常区域与基于nominal样本所建立的模式进行比较。现今主流方法通常是通过利用ImageNet模型生成的嵌入来近似目标特征分布。然而，许多异常定位方法中的共同问题是特征近似的灵活性不够，这使得它们在特定目标上效果地识别异常区域的能力受到数据库中的finite资源的限制。在本文中，我们提出了一种新的子空间意识激发特征重建框架，用于异常定位。我们的提议方法包括通过自我表达模型学习低维度子空间，以达到适应性的特征近似。重要的是，子空间表示的稀疏性使得从同一个子空间中覆盖特征模式需要 fewer 资源，从而降低数据库。我们在三个工业标准 datasets上进行了广泛的实验，结果表明，我们的方法可以与当前状态OF-the-art方法竞争地实现异常定位，只需要一小数量的样本。

Bitstream-Corrupted Video Recovery: A Novel Benchmark Dataset and Method

paper_url: http://arxiv.org/abs/2309.13890
repo_url: https://github.com/liutighe/bscv-dataset
paper_authors: Tianyi Liu, Kejun Wu, Yi Wang, Wenyang Liu, Kim-Hui Yap, Lap-Pui Chau
for: 提供了一个大规模的bitstream-corrupted video（BSCV）benchmark，用于解决实世界中的视频损坏问题。
methods: 使用了一个三个参数的视频损坏模型，并提供了一个可替换的video recovery框架，用于评估现有的视频填写方法。
results: 透过评估现有的视频填写方法，发现了这些方法在解决实世界中的视频损坏问题上的限制，并证明了我们的框架在解决这个问题上的优势。

Abstract
The past decade has witnessed great strides in video recovery by specialist technologies, like video inpainting, completion, and error concealment. However, they typically simulate the missing content by manual-designed error masks, thus failing to fill in the realistic video loss in video communication (e.g., telepresence, live streaming, and internet video) and multimedia forensics. To address this, we introduce the bitstream-corrupted video (BSCV) benchmark, the first benchmark dataset with more than 28,000 video clips, which can be used for bitstream-corrupted video recovery in the real world. The BSCV is a collection of 1) a proposed three-parameter corruption model for video bitstream, 2) a large-scale dataset containing rich error patterns, multiple corruption levels, and flexible dataset branches, and 3) a plug-and-play module in video recovery framework that serves as a benchmark. We evaluate state-of-the-art video inpainting methods on the BSCV dataset, demonstrating existing approaches' limitations and our framework's advantages in solving the bitstream-corrupted video recovery problem. The benchmark and dataset are released at https://github.com/LIUTIGHE/BSCV-Dataset.

摘要
过去一代，视频恢复技术得到了大幅度的进步，如视频填充、完善和错误隐藏。然而，这些技术通常通过手动设计的错误面积来模拟缺失内容，因此无法填充实际的视频损害在视频通信（如电子投票、直播和互联网视频）和多媒体证明中。为解决这一问题，我们介绍了 bitstream-corrupted video（BSCV） benchmark，这是世界上第一个以上28,000个视频剪辑为基础的损害视频恢复 benchmark。BSCV包括以下三个组成部分：1）一种提议的三参数损害模型 для视频比特流；2）一个大规模的数据集，包含丰富的错误特征、多个损害水平和灵活的数据支线；3）一个在视频恢复框架中的插件模块，作为 benchmark。我们对现有视频填充方法进行了BSCV dataset上的评估，并证明了我们的框架在解决损害视频恢复问题中的优势。BSCV dataset和 benchmark将于https://github.com/LIUTIGHE/BSCV-Dataset上发布。

Skip-Connected Neural Networks with Layout Graphs for Floor Plan Auto-Generation

paper_url: http://arxiv.org/abs/2309.13881
repo_url: https://github.com/yuntaeJ/SkipNet-FloorPlanGen
paper_authors: Yuntae Jeon, Dai Quoc Tran, Seunghee Park
for: automated and efficient floor plan designs
methods: skip-connected neural networks integrated with layout graphs
results: 93.9 mIoU score in the 1st CVAAD workshop challenge

Abstract
With the advent of AI and computer vision techniques, the quest for automated and efficient floor plan designs has gained momentum. This paper presents a novel approach using skip-connected neural networks integrated with layout graphs. The skip-connected layers capture multi-scale floor plan information, and the encoder-decoder networks with GNN facilitate pixel-level probability-based generation. Validated on the MSD dataset, our approach achieved a 93.9 mIoU score in the 1st CVAAD workshop challenge. Code and pre-trained models are publicly available at https://github.com/yuntaeJ/SkipNet-FloorPlanGe.

摘要
With the advent of AI and computer vision techniques, the quest for automated and efficient floor plan designs has gained momentum. This paper presents a novel approach using skip-connected neural networks integrated with layout graphs. The skip-connected layers capture multi-scale floor plan information, and the encoder-decoder networks with GNN facilitate pixel-level probability-based generation. Validated on the MSD dataset, our approach achieved a 93.9 mIoU score in the 1st CVAAD workshop challenge. Code and pre-trained models are publicly available at https://github.com/yuntaeJ/SkipNet-FloorPlanGe.Here's the translation in Traditional Chinese:随着人工智能和计算机见识技术的发展，自动化和高效的地图设计问题得到了很多关注。这篇论文提出了一种使用跳接连接神经网络与格局图Integrated的新方法。跳接层 capture多値标高图信息，并且与对应的encoder-decoder网络和GNN结合，实现像素级概率基于生成。在MSD dataset上验证，我们的方法实现了93.9 mIoU分数在1st CVAAD工作坊挑战中。代码和预训模型公开可以在https://github.com/yuntaeJ/SkipNet-FloorPlanGe中找到。

Attention and Pooling based Sigmoid Colon Segmentation in 3D CT images

paper_url: http://arxiv.org/abs/2309.13872
repo_url: None
paper_authors: Md Akizur Rahman, Sonit Singh, Kuruparan Shanmugalingam, Sankaran Iyer, Alan Blair, Praveen Ravindran, Arcot Sowmya
for: 该研究旨在开发一种基于修改的3D U-Net体系结构的深度学习模型，用于从计算机 Tomography（CT）图像中 segments the sigmoid colon。
methods: 该研究使用了Pyramid pooling（PyP）和通道空间压缩和刺激（csSE）等技术来改进模型性能。
results: 结果表明，使用PyP和csSE技术可以提高分割精度，并且 ensemble方法可以提高分割精度。最终，该研究结果表明，基于修改的3D U-Net体系结构是有效的用于 segments the sigmoid colon在CT图像中。

Abstract
Segmentation of the sigmoid colon is a crucial aspect of treating diverticulitis. It enables accurate identification and localisation of inflammation, which in turn helps healthcare professionals make informed decisions about the most appropriate treatment options. This research presents a novel deep learning architecture for segmenting the sigmoid colon from Computed Tomography (CT) images using a modified 3D U-Net architecture. Several variations of the 3D U-Net model with modified hyper-parameters were examined in this study. Pyramid pooling (PyP) and channel-spatial Squeeze and Excitation (csSE) were also used to improve the model performance. The networks were trained using manually annotated sigmoid colon. A five-fold cross-validation procedure was used on a test dataset to evaluate the network's performance. As indicated by the maximum Dice similarity coefficient (DSC) of 56.92+/-1.42%, the application of PyP and csSE techniques improves segmentation precision. We explored ensemble methods including averaging, weighted averaging, majority voting, and max ensemble. The results show that average and majority voting approaches with a threshold value of 0.5 and consistent weight distribution among the top three models produced comparable and optimal results with DSC of 88.11+/-3.52%. The results indicate that the application of a modified 3D U-Net architecture is effective for segmenting the sigmoid colon in Computed Tomography (CT) images. In addition, the study highlights the potential benefits of integrating ensemble methods to improve segmentation precision.

摘要
Segmentation of the sigmoid colon is a crucial aspect of treating diverticulitis. It enables accurate identification and localization of inflammation, which in turn helps healthcare professionals make informed decisions about the most appropriate treatment options. This research presents a novel deep learning architecture for segmenting the sigmoid colon from Computed Tomography (CT) images using a modified 3D U-Net architecture. Several variations of the 3D U-Net model with modified hyper-parameters were examined in this study. Pyramid pooling (PyP) and channel-spatial Squeeze and Excitation (csSE) were also used to improve the model performance. The networks were trained using manually annotated sigmoid colon. A five-fold cross-validation procedure was used on a test dataset to evaluate the network's performance. As indicated by the maximum Dice similarity coefficient (DSC) of 56.92+/-1.42%, the application of PyP and csSE techniques improves segmentation precision. We explored ensemble methods including averaging, weighted averaging, majority voting, and max ensemble. The results show that average and majority voting approaches with a threshold value of 0.5 and consistent weight distribution among the top three models produced comparable and optimal results with DSC of 88.11+/-3.52%. The results indicate that the application of a modified 3D U-Net architecture is effective for segmenting the sigmoid colon in Computed Tomography (CT) images. In addition, the study highlights the potential benefits of integrating ensemble methods to improve segmentation precision.Here's the text in Traditional Chinese:分 segmentation of the sigmoid colon 是 diverticulitis 的一个重要方面，可以精确地识别和localization of inflammation，从而帮助医疗专业人员做出最适当的治疗选择。本研究提出了一个基于 modified 3D U-Net 架构的深度学习模型，用于 Computed Tomography (CT) 影像中的sigmoid colon 分 segmentation。本研究中评估了多种 modified 3D U-Net 模型的不同参数，并使用 Pyramid pooling (PyP) 和 channel-spatial Squeeze and Excitation (csSE) 技术来改善模型性能。模型被训练使用手动标注的sigmoid colon。使用五fold cross-validation 方法进行评估，结果显示，使用 PyP 和 csSE 技术可以提高分 segmentation 精度。我们explored ensemble methods，包括 averaging、weighted averaging、majority voting 和 max ensemble，并发现，使用average 和 majority voting 方法，并设置阈值为 0.5，可以获得最佳结果，DSC 为 88.11+/-3.52%。结果显示，使用 modified 3D U-Net 架构可以有效地分 segmentation sigmoid colon 在 Computed Tomography (CT) 影像中。此外，研究也显示了 ensemble methods 的潜在优化效果。

On Calibration of Modern Quantized Efficient Neural Networks

paper_url: http://arxiv.org/abs/2309.13866
repo_url: None
paper_authors: Joey Kuang, Alexander Wong
for: 这个论文探讨了几种 Architecture 和两个 Dataset 上的 Calibration 性能，以及这些 Calibration 性能与精度之间的关系。
methods: 该论文使用了 ShuffleNetv2、GhostNet-VGG 和 MobileOne 三种 Architecture，以及 CIFAR-100 和 PathMNIST 两个 Dataset。它们使用了不同的精度来评估 Calibration 性能。
results: 研究发现，Calibration 性能与精度之间存在相互关系，尤其是在 4 位 activation режиmé下。GhostNet-VGG 被发现为最为鲁减的 Architecture，能够在低精度下保持比较好的 Calibration 性能。另外，温度 scaling 可以改善 Calibration 错误，但也有一些限制。

Abstract
We explore calibration properties at various precisions for three architectures: ShuffleNetv2, GhostNet-VGG, and MobileOne; and two datasets: CIFAR-100 and PathMNIST. The quality of calibration is observed to track the quantization quality; it is well-documented that performance worsens with lower precision, and we observe a similar correlation with poorer calibration. This becomes especially egregious at 4-bit activation regime. GhostNet-VGG is shown to be the most robust to overall performance drop at lower precision. We find that temperature scaling can improve calibration error for quantized networks, with some caveats. We hope that these preliminary insights can lead to more opportunities for explainable and reliable EdgeML.

摘要
我们研究了不同精度下的准确性质量，对三种架构：ShuffleNetv2、GhostNet-VGG和MobileOne，以及两个 dataset：CIFAR-100和PathMNIST。我们发现，准确性质量与量化质量之间存在直接的相关性，即性能随着精度下降而变差，我们在4比特活动 режиimen中发现了类似的关系。 GhostNet-VGG 显示为低精度下的最高抗性。我们发现了温度Scaling可以改善量化网络的准确性错误，但有一些限制。我们希望这些初步发现可以带来更多的可靠和可解释的EdgeML。

SuPerPM: A Large Deformation-Robust Surgical Perception Framework Based on Deep Point Matching Learned from Physical Constrained Simulation Data

paper_url: http://arxiv.org/abs/2309.13863
repo_url: None
paper_authors: Shan Lin, Albert J. Miao, Ali Alabiad, Fei Liu, Kaiyuan Wang, Jingpei Lu, Florian Richter, Michael C. Yip
for: 实现更好的骨盘识别和重建，减少对大幅弯曲的肿瘤进行追踪和重建的误差。
methods: 使用学习型非静态点云匹配来进行数据汇合，以应对大幅弯曲。
results: 在复杂的骨盘运动中，得到了superior的表现，比前一代医疗场景追踪算法更好。

Abstract
Manipulation of tissue with surgical tools often results in large deformations that current methods in tracking and reconstructing algorithms have not effectively addressed. A major source of tracking errors during large deformations stems from wrong data association between observed sensor measurements with previously tracked scene. To mitigate this issue, we present a surgical perception framework, SuPerPM, that leverages learning-based non-rigid point cloud matching for data association, thus accommodating larger deformations. The learning models typically require training data with ground truth point cloud correspondences, which is challenging or even impractical to collect in surgical environments. Thus, for tuning the learning model, we gather endoscopic data of soft tissue being manipulated by a surgical robot and then establish correspondences between point clouds at different time points to serve as ground truth. This was achieved by employing a position-based dynamics (PBD) simulation to ensure that the correspondences adhered to physical constraints. The proposed framework is demonstrated on several challenging surgical datasets that are characterized by large deformations, achieving superior performance over state-of-the-art surgical scene tracking algorithms.

摘要
人体组织的 manipulate 使用手术工具经常会导致大幅变形，现有的跟踪和重建算法并未有效地处理这些变形。主要的跟踪错误源于 incorrect data association between observed sensor measurements with previously tracked scene。为解决这个问题，我们提出了一个手术认知框架，SuPerPM，该框架利用学习基于非RIGID点云匹配来实现数据关联，因此可以满足更大的变形。学习模型通常需要训练数据包含真实的点云对应关系，但在手术环境中收集这些数据是困难或者实际上不可能的。因此，我们为训练学习模型，收集了endoscopic数据 soft tissue being manipulated by a surgical robot，并在不同时间点之间建立了点云对应关系，以供真实的参照。我们使用了位置基于动力学（PBD）模拟来确保这些对应关系遵循物理约束。我们提出的框架在一些具有大幅变形的手术数据集上进行了评测，实现了与当前最佳手术场景跟踪算法相比的更高性能。

Adversarial Attacks on Video Object Segmentation with Hard Region Discovery

paper_url: http://arxiv.org/abs/2309.13857
repo_url: None
paper_authors: Ping Li, Yu Zhang, Li Yuan, Jian Zhao, Xianghua Xu, Xiaoqin Zhang
for: 本研究旨在提高视频对象 segmentation 模型的安全性，对抗 adversarial examples 的攻击。
methods: 本研究提出了一种基于 first-frame attack 的对象agnostic adversary，通过探索 easily confused region 来生成具有更强的攻击力的干扰。
results: 实验结果表明，我们的攻击器可以很大程度下降了多种 state-of-the-art video object segmentation 模型的性能。

Abstract
Video object segmentation has been applied to various computer vision tasks, such as video editing, autonomous driving, and human-robot interaction. However, the methods based on deep neural networks are vulnerable to adversarial examples, which are the inputs attacked by almost human-imperceptible perturbations, and the adversary (i.e., attacker) will fool the segmentation model to make incorrect pixel-level predictions. This will rise the security issues in highly-demanding tasks because small perturbations to the input video will result in potential attack risks. Though adversarial examples have been extensively used for classification, it is rarely studied in video object segmentation. Existing related methods in computer vision either require prior knowledge of categories or cannot be directly applied due to the special design for certain tasks, failing to consider the pixel-wise region attack. Hence, this work develops an object-agnostic adversary that has adversarial impacts on VOS by first-frame attacking via hard region discovery. Particularly, the gradients from the segmentation model are exploited to discover the easily confused region, in which it is difficult to identify the pixel-wise objects from the background in a frame. This provides a hardness map that helps to generate perturbations with a stronger adversarial power for attacking the first frame. Empirical studies on three benchmarks indicate that our attacker significantly degrades the performance of several state-of-the-art video object segmentation models.

摘要
“视频对象分割（VOS）在计算机视觉任务中得到应用，如视频编辑、自动驾驶和人机交互。然而，基于深度神经网络的方法容易受到恶意示例（adversarial examples）的攻击，这些攻击者（i.e., 攻击者）会使用 almost human-imperceptible 的干扰，让 VOS 模型进行错误的像素级预测。这会导致高度需求任务中的安全问题。虽然恶意示例在分类领域已经广泛研究，但在 VOS 领域 rarely 被研究。现有的计算机视觉相关方法 Either require prior knowledge of categories 或者不可直接应用，因为它们特定的设计不适用于 VOS。因此，这项工作开发了一种对 VOS 具有恶意影响的对象agnostic adversary。具体来说，从 VOS 模型的梯度来发现容易困惑的区域，这个区域中的像素Difficult to identify from the background in a frame.这提供了一个 hardness map，帮助生成具有更强的恶意力的攻击。实验表明，我们的攻击者可以 Significantly degrade 多种 state-of-the-art VOS 模型的性能。”

DISeR: Designing Imaging Systems with Reinforcement Learning

paper_url: http://arxiv.org/abs/2309.13851
repo_url: None
paper_authors: Tzofi Klinghoffer, Kushagra Tiwary, Nikhil Behari, Bhavya Agrawalla, Ramesh Raskar
for: 本研究旨在自动化图像系统设计，提高图像系统的性能和可靠性。
methods: 本研究使用语言学模型和强化学习来自动搜索图像系统的组件，包括摄像头、光源、光学元件和感知模型。
results: 研究示出，通过自动搜索图像系统的组件，可以实现更高的任务性能和更好的可靠性。实验结果表明，与业界标准相比，我们的方法可以提供更高的深度估计和更好的摄像头配置。

Abstract
Imaging systems consist of cameras to encode visual information about the world and perception models to interpret this encoding. Cameras contain (1) illumination sources, (2) optical elements, and (3) sensors, while perception models use (4) algorithms. Directly searching over all combinations of these four building blocks to design an imaging system is challenging due to the size of the search space. Moreover, cameras and perception models are often designed independently, leading to sub-optimal task performance. In this paper, we formulate these four building blocks of imaging systems as a context-free grammar (CFG), which can be automatically searched over with a learned camera designer to jointly optimize the imaging system with task-specific perception models. By transforming the CFG to a state-action space, we then show how the camera designer can be implemented with reinforcement learning to intelligently search over the combinatorial space of possible imaging system configurations. We demonstrate our approach on two tasks, depth estimation and camera rig design for autonomous vehicles, showing that our method yields rigs that outperform industry-wide standards. We believe that our proposed approach is an important step towards automating imaging system design.

摘要
In this paper, we use a context-free grammar (CFG) to formulate these four building blocks of imaging systems. The CFG can be automatically searched over with a learned camera designer to jointly optimize the imaging system with task-specific perception models. By transforming the CFG to a state-action space, we can implement the camera designer with reinforcement learning to intelligently search over the combinatorial space of possible imaging system configurations.We demonstrate our approach on two tasks, depth estimation and camera rig design for autonomous vehicles. Our method yields rigs that outperform industry-wide standards. We believe that our proposed approach is an important step towards automating imaging system design.

Tuning Multi-mode Token-level Prompt Alignment across Modalities

paper_url: http://arxiv.org/abs/2309.13847
repo_url: https://github.com/wds2014/ALIGN
paper_authors: Dongsheng Wang, Miaoge Li, Xinyang Liu, MingSheng Xu, Bo Chen, Hanwang Zhang
for: 提高视觉概念理解的开放世界视觉模型表现。
methods: 使用多模式Token级别调整框架，利用最优运输来学习和协调多模式modalities的Prompt tokens。
results: 在各种图像识别benchmark上显示出优于常见方法的总体化和几个shot能力，并且Qualitative分析表明学习的Prompt tokens能够捕捉多种视觉概念。

Abstract
Advancements in prompt tuning of vision-language models have underscored their potential in enhancing open-world visual concept comprehension. However, prior works only primarily focus on single-mode (only one prompt for each modality) and holistic level (image or sentence) semantic alignment, which fails to capture the sample diversity, leading to sub-optimal prompt discovery. To address the limitation, we propose a multi-mode token-level tuning framework that leverages the optimal transportation to learn and align a set of prompt tokens across modalities. Specifically, we rely on two essential factors: 1) multi-mode prompts discovery, which guarantees diverse semantic representations, and 2) token-level alignment, which helps explore fine-grained similarity. Consequently, the similarity can be calculated as a hierarchical transportation problem between the modality-specific sets. Extensive experiments on popular image recognition benchmarks show the superior generalization and few-shot abilities of our approach. The qualitative analysis demonstrates that the learned prompt tokens have the ability to capture diverse visual concepts.

摘要

Multi-mode prompts discovery, which guarantees diverse semantic representations.2. Token-level alignment, which helps explore fine-grained similarity.Consequently, the similarity can be calculated as a hierarchical transportation problem between the modality-specific sets. Extensive experiments on popular image recognition benchmarks show the superior generalization and few-shot abilities of our approach. The qualitative analysis demonstrates that the learned prompt tokens have the ability to capture diverse visual concepts.Note: “vision-language models” should be translated as “视觉语言模型” in Simplified Chinese.

Traj-LO: In Defense of LiDAR-Only Odometry Using an Effective Continuous-Time Trajectory

paper_url: http://arxiv.org/abs/2309.13842
repo_url: https://github.com/kevin2431/traj-lo
paper_authors: Xin Zheng, Jianke Zhu
for: 这个论文旨在探讨LiDAR仅凭借LiDAR点云数据实现的ODometry问题，而不是通过附加惯性仪器来提高精度。
methods: 该方法首先将LiDAR测量点视为流动的流点，然后将LiDAR运动 parameterized为简单 yet effective的连续时间曲线。最后，我们的Traj-LO方法尝试通过紧密结合LiDAR点云的 геометри信息和运动约束来重建 LiDAR的空间时间准确的运动。
results: 对于不同类型的LiDAR和多个LiDAR系统，我们的方法表现出了可靠和有效的性能，甚至在kinematic状态超过IMU测量范围的情况下也能够取得良好的结果。我们的实现已经在GitHub上公开。

Abstract
LiDAR Odometry is an essential component in many robotic applications. Unlike the mainstreamed approaches that focus on improving the accuracy by the additional inertial sensors, this letter explores the capability of LiDAR-only odometry through a continuous-time perspective. Firstly, the measurements of LiDAR are regarded as streaming points continuously captured at high frequency. Secondly, the LiDAR movement is parameterized by a simple yet effective continuous-time trajectory. Therefore, our proposed Traj-LO approach tries to recover the spatial-temporal consistent movement of LiDAR by tightly coupling the geometric information from LiDAR points and kinematic constraints from trajectory smoothness. This framework is generalized for different kinds of LiDAR as well as multi-LiDAR systems. Extensive experiments on the public datasets demonstrate the robustness and effectiveness of our proposed LiDAR-only approach, even in scenarios where the kinematic state exceeds the IMU's measuring range. Our implementation is open-sourced on GitHub.

摘要
雷达探测器（LiDAR）是许多 робо械应用中的一个关键组件。与主流方法不同，这封信函数通过不断增强附加的惯性传感器来提高精度。而我们的方法则是通过持续时间的视角来探讨LiDAR只的定位。首先，我们视为LiDAR测量点是高频Capture的流动点。其次，我们将LiDAR的移动参数化为简单 yet effective的持续时间曲线。因此，我们提出的Traj-LO方法尝试通过与LiDAR点的几何信息和运动稳定性的束缚来恢复LiDAR的空间时间准确的运动。这种框架适用于不同类型的LiDAR以及多个LiDAR系统。我们的实现在GitHub上公开。Extensive experiments on public datasets have demonstrated the robustness and effectiveness of our proposed LiDAR-only approach, even in scenarios where the kinematic state exceeds the IMU's measuring range.

Fill the K-Space and Refine the Image: Prompting for Dynamic and Multi-Contrast MRI Reconstruction

paper_url: http://arxiv.org/abs/2309.13839
repo_url: https://github.com/hellopipu/promptmr
paper_authors: Bingyu Xin, Meng Ye, Leon Axel, Dimitris N. Metaxas
for: 提高动态或多contrast MRI重建的精度和效率，以及扩展现有的MRI重建模型到不同的输入类型和参数 Conditioning.
methods: 提出了一种两 stage重建管道，首先利用物理学基础重建缺失的k-空间数据，然后通过一种基于提示的学习方法（PromptMR）来进行多视图、多对比度、邻近类型和加速因子的all-in-one重建。
results: 对比 précédente estado del arte的加速MRI重建方法，提出的方法得到了显著的提高，并且可以更好地适应不同的输入类型和参数 Conditioning.

Abstract
The key to dynamic or multi-contrast magnetic resonance imaging (MRI) reconstruction lies in exploring inter-frame or inter-contrast information. Currently, the unrolled model, an approach combining iterative MRI reconstruction steps with learnable neural network layers, stands as the best-performing method for MRI reconstruction. However, there are two main limitations to overcome: firstly, the unrolled model structure and GPU memory constraints restrict the capacity of each denoising block in the network, impeding the effective extraction of detailed features for reconstruction; secondly, the existing model lacks the flexibility to adapt to variations in the input, such as different contrasts, resolutions or views, necessitating the training of separate models for each input type, which is inefficient and may lead to insufficient reconstruction. In this paper, we propose a two-stage MRI reconstruction pipeline to address these limitations. The first stage involves filling the missing k-space data, which we approach as a physics-based reconstruction problem. We first propose a simple yet efficient baseline model, which utilizes adjacent frames/contrasts and channel attention to capture the inherent inter-frame/-contrast correlation. Then, we extend the baseline model to a prompt-based learning approach, PromptMR, for all-in-one MRI reconstruction from different views, contrasts, adjacent types, and acceleration factors. The second stage is to refine the reconstruction from the first stage, which we treat as a general video restoration problem to further fuse features from neighboring frames/contrasts in the image domain. Extensive experiments show that our proposed method significantly outperforms previous state-of-the-art accelerated MRI reconstruction methods.

摘要
<>针对动态或多比特磁共振成像（MRI）重建，关键在于挖掘 между帧或比特信息。目前，最佳性能的方法是折叠模型，它将iterative MRI重建步骤与可学习神经网络层组合起来。然而，需要突破两个主要限制：首先，折叠模型的结构和GPU内存限制限制每个除噪块在网络中的容量，阻碍细节特征的有效提取;其次，现有模型缺乏适应输入的灵活性，需要对不同的输入，如不同的比特、分辨率、视野或视角，进行分别训练，这是不fficient和可能导致重建不足。在这篇论文中，我们提出了一个两个阶段的MRI重建管道，用于解决这些限制。第一阶段是填充缺失的k空间数据，我们对此采用物理基础的重建方法。我们首先提出了一个简单 yet efficient的基准模型，该模型利用邻近帧/比特和通道注意力 capture the inherent inter-frame/-contrast correlation。然后，我们将基准模型扩展到PromptMR，用于从不同的视角、比特、邻近类型和加速因子中进行一起的MRI重建。第二阶段是对第一阶段的重建进行进一步的纠正，我们将其视为一个通用视频恢复问题，以更好地融合邻近帧/比特中的特征。经过广泛的实验，我们发现我们的提议方法在前一个状态的加速MRI重建方法上显著超越。

IBVC: Interpolation-driven B-frame Video Compression

paper_url: http://arxiv.org/abs/2309.13835
repo_url: None
paper_authors: Meiqin Liu, Chenming Xu, Chao Yao, Weisi Lin, Yao Zhao
For: The paper aims to improve B-frame video compression by addressing inaccurate quantized motions and inefficient motion compensation in previous learned approaches.* Methods: The proposed method, Interpolation-driven B-frame Video Compression (IBVC), involves two major operations: video frame interpolation and artifact reduction compression. It uses a bit-rate free MEMC based on interpolation and a residual guided masking encoder to adaptively select meaningful contexts with interpolated multi-scale dependencies.* Results: The experimental results on B-frame coding demonstrate that IBVC has significant improvements compared to relevant state-of-the-art methods, and can save bit rates compared with the random access (RA) configuration of H.266 (VTM).Here are the three points in Simplified Chinese:* For: 提高B帧视频压缩，解决前一些学习方法中的不准确量化运动和不fficient的运动补做。* Methods: 提议的方法是Interpolation-driven B-frame Video Compression (IBVC)，它包括两个主要操作：视频 interpolating和artefact reduction compression。它使用免费的MEMC based on interpolation，并使用 residual guided masking encoder来选择有用的上下文。* Results: 实验结果表明，IBVC在B帧编码方面有显著的改进，并可以比Random Access（RA）配置的H.266（VTM）节省比特率。

Abstract
Learned B-frame video compression aims to adopt bi-directional motion estimation and motion compensation (MEMC) coding for middle frame reconstruction. However, previous learned approaches often directly extend neural P-frame codecs to B-frame relying on bi-directional optical-flow estimation or video frame interpolation. They suffer from inaccurate quantized motions and inefficient motion compensation. To address these issues, we propose a simple yet effective structure called Interpolation-driven B-frame Video Compression (IBVC). Our approach only involves two major operations: video frame interpolation and artifact reduction compression. IBVC introduces a bit-rate free MEMC based on interpolation, which avoids optical-flow quantization and additional compression distortions. Later, to reduce duplicate bit-rate consumption and focus on unaligned artifacts, a residual guided masking encoder is deployed to adaptively select the meaningful contexts with interpolated multi-scale dependencies. In addition, a conditional spatio-temporal decoder is proposed to eliminate location errors and artifacts instead of using MEMC coding in other methods. The experimental results on B-frame coding demonstrate that IBVC has significant improvements compared to the relevant state-of-the-art methods. Meanwhile, our approach can save bit rates compared with the random access (RA) configuration of H.266 (VTM). The code will be available at https://github.com/ruhig6/IBVC.

摘要
学习B帧视频压缩targets采用双向运动估计和运动补做(MEMC)编码来重建中间帧。然而，以前的学习方法通常直接将神经网络P帧编码器扩展到B帧，基于双向光流估计或视频帧 interpolación。它们受到不准确的量化运动和不fficient的运动补做的影响。为了解决这些问题，我们提出了一种简单 yet effective的结构，即 interpolación-driven B帧视频压缩(IBVC)。我们的方法只有两个主要操作：视频帧 interpolación和噪声压缩。IBVC引入了一种免费的MEMC基于 interpolación，这可以避免光流量化和额外压缩损害。后来，为了减少重复的比特率消耗和重点关注不同依赖度的噪声，我们提出了一种适应性的masking编码器，以便选择 interpolated多尺度依赖关系中的有意义上下文。此外，我们还提出了一种conditional spatio-temporal解码器，以消除Location errors和噪声而不是使用MEMC编码。实验结果表明，IBVC与相关的状态 искусternal methods相比有显著改善。同时，我们的方法可以与H.266（VTM）中的随机访问（RA）配置相比节省比特率。代码将在https://github.com/ruhig6/IBVC上提供。

PARTICLE: Part Discovery and Contrastive Learning for Fine-grained Recognition

paper_url: http://arxiv.org/abs/2309.13822
repo_url: https://github.com/cvl-umass/PARTICLE
paper_authors: Oindrila Saha, Subhransu Maji
for: 这些方法是用于提高细化分类和分割任务的自助学习方法。
methods: 这些方法包括instance-discriminative contrastive学习和part-centric equivariance和不变性目标。
results: 这些方法可以提高图像分类和分割任务的性能，例如在Linear-evaluation scheme中，使用DetCon自助学习方法训练ResNet50在ImageNet上的分类精度从35.4%提高到42.0%在Caltech-UCSD Birds上，从35.5%提高到44.1%在FGVC Aircraft上，从29.7%提高到37.4%在Stanford Cars上。

Abstract
We develop techniques for refining representations for fine-grained classification and segmentation tasks in a self-supervised manner. We find that fine-tuning methods based on instance-discriminative contrastive learning are not as effective, and posit that recognizing part-specific variations is crucial for fine-grained categorization. We present an iterative learning approach that incorporates part-centric equivariance and invariance objectives. First, pixel representations are clustered to discover parts. We analyze the representations from convolutional and vision transformer networks that are best suited for this task. Then, a part-centric learning step aggregates and contrasts representations of parts within an image. We show that this improves the performance on image classification and part segmentation tasks across datasets. For example, under a linear-evaluation scheme, the classification accuracy of a ResNet50 trained on ImageNet using DetCon, a self-supervised learning approach, improves from 35.4% to 42.0% on the Caltech-UCSD Birds, from 35.5% to 44.1% on the FGVC Aircraft, and from 29.7% to 37.4% on the Stanford Cars. We also observe significant gains in few-shot part segmentation tasks using the proposed technique, while instance-discriminative learning was not as effective. Smaller, yet consistent, improvements are also observed for stronger networks based on transformers.

摘要
(Simplified Chinese translation)我们开发了一种基于自我指导的方法，用于精细分类和分割任务中的表示更新。我们发现，基于实例异同学习的精化方法并不那么有效，而recognizing特定部分的变化是精度分类的关键。我们提出了一种循环学习方法，其中包括部分准确性和不变性目标。首先，我们使用图像中的像素表示进行聚合，以便发现特定部分。然后，我们分析了图像中的卷积和视力转换网络，以确定最适合这种任务的表示。接着，我们在图像中聚合和对比部分表示，以提高图像分类和分割任务的性能。我们发现，这种方法在多个数据集上都有显著提高，而instance-discriminative学习方法不太有效。此外，我们还发现，对于更强大的网络，基于转换器的方法也会得到更小 yet consistent的改进。

MMA-Net: Multiple Morphology-Aware Network for Automated Cobb Angle Measurement

paper_url: http://arxiv.org/abs/2309.13817
repo_url: None
paper_authors: Zhengxuan Qiu, Jie Yang, Jiankun Wang
for: 提高骨盘畸形诊断和评估中的自动定角度测量精度。
methods: 利用多种骨盘 morphology 作为注意力信息，并将 segmentation 结果与原始 X-ray 图像 concatenate 作为 regression 模块进行精度的定角度测量。
results: 在 AASCE 挑战数据集上测试，SMAPE 为 7.28%，MAE 为 3.18{\deg}, 与其他竞争方法相比表现出色。

Abstract
Scoliosis diagnosis and assessment depend largely on the measurement of the Cobb angle in spine X-ray images. With the emergence of deep learning techniques that employ landmark detection, tilt prediction, and spine segmentation, automated Cobb angle measurement has become increasingly popular. However, these methods encounter difficulties such as high noise sensitivity, intricate computational procedures, and exclusive reliance on a single type of morphological information. In this paper, we introduce the Multiple Morphology-Aware Network (MMA-Net), a novel framework that improves Cobb angle measurement accuracy by integrating multiple spine morphology as attention information. In the MMA-Net, we first feed spine X-ray images into the segmentation network to produce multiple morphological information (spine region, centerline, and boundary) and then concatenate the original X-ray image with the resulting segmentation maps as input for the regression module to perform precise Cobb angle measurement. Furthermore, we devise joint loss functions for our segmentation and regression network training, respectively. We evaluate our method on the AASCE challenge dataset and achieve superior performance with the SMAPE of 7.28% and the MAE of 3.18{\deg}, indicating a strong competitiveness compared to other outstanding methods. Consequently, we can offer clinicians automated, efficient, and reliable Cobb angle measurement.

摘要
诊断和评估斯科利病（Scoliosis）几乎完全依赖在脊梁X射线图像中测量Cobb角度。随着深度学习技术的出现，使用landmark检测、倾斜预测和脊梁分 segmentation的自动化Cobb角度测量方法在现场上变得越来越受欢迎。然而，这些方法受到高噪音敏感、复杂计算过程和单一类型形态信息的限制。在这篇论文中，我们介绍了多种形态意识网络（MMA-Net），一种新的框架，可以提高Cobb角度测量精度。在MMA-Net中，我们首先将脊梁X射线图像传递给分 segmentation网络，以生成多种形态信息（脊梁区域、中心线和边界），然后将原始X射线图像和生成的分 segmentation图像作为输入传递给 regression模块进行精确Cobb角度测量。此外，我们定义了joint损失函数用于我们的分 segmentation和回归网络训练。我们在AASCE挑战数据集上评估了我们的方法，并实现了优于其他突出的方法的性能，SMAPE值为7.28%和MAE值为3.18°，这表明我们的方法具有强大的竞争力。因此，我们可以为临床医生提供自动化、高效和可靠的Cobb角度测量。

DVI-SLAM: A Dual Visual Inertial SLAM Network

paper_url: http://arxiv.org/abs/2309.13814
repo_url: None
paper_authors: Xiongfeng Peng, Zhihua Liu, Weiming Li, Ping Tan, SoonYong Cho, Qiang Wang
for: This paper aims to improve visual simultaneous localization and mapping (SLAM) methods by better integrating visual information and inertial measurement unit (IMU) data.
methods: The proposed method uses a novel deep SLAM network with dual visual factors, which integrates both photometric and re-projection factors into an end-to-end differentiable structure through a multi-factor data association module.
results: The proposed method significantly outperforms state-of-the-art methods on several public datasets, including TartanAir, EuRoC, and ETH3D-SLAM. Specifically, the absolute trajectory error was reduced by 45.3% and 36.2% for monocular and stereo configurations on the EuRoC dataset, respectively.

Abstract
Recent deep learning based visual simultaneous localization and mapping (SLAM) methods have made significant progress. However, how to make full use of visual information as well as better integrate with inertial measurement unit (IMU) in visual SLAM has potential research value. This paper proposes a novel deep SLAM network with dual visual factors. The basic idea is to integrate both photometric factor and re-projection factor into the end-to-end differentiable structure through multi-factor data association module. We show that the proposed network dynamically learns and adjusts the confidence maps of both visual factors and it can be further extended to include the IMU factors as well. Extensive experiments validate that our proposed method significantly outperforms the state-of-the-art methods on several public datasets, including TartanAir, EuRoC and ETH3D-SLAM. Specifically, when dynamically fusing the three factors together, the absolute trajectory error for both monocular and stereo configurations on EuRoC dataset has reduced by 45.3% and 36.2% respectively.

摘要
现代深度学习基于视觉同时定位地图（SLAM）方法在最近几年中已经做出了很大的进步。然而，如何更好地利用视觉信息并更好地与惯性测量单元（IMU）在视觉SLAM中进行集成，这是有研究价值的。本文提出了一种新的深度SLAM网络，具有双视觉因素。基本思想是通过多因素数据匹配模块将 photometric 因素和重投影因素集成到端到端可微结构中。我们显示了我们提出的网络可以在运动中学习和调整两个视觉因素的信任地图，并且可以进一步包括 IMU 因素。广泛的实验证明了我们的提出方法在多个公共数据集上具有显著的优势，包括 TartanAir、EuRoC 和 ETH3D-SLAM。具体来说，在动态混合三个因素时，EuRoC 数据集上的绝对轨迹错误量降低了45.3%和36.2%分别 для 单镜和 стерео 配置。

Boundary-Aware Proposal Generation Method for Temporal Action Localization

paper_url: http://arxiv.org/abs/2309.13810
repo_url: None
paper_authors: Hao Zhang, Chunyan Feng, Jiahui Yang, Zheng Li, Caili Guo
for: 本文旨在提出一种基于界限感知的暂时动作地理化（TAL）方法，以便在未处理视频中找到动作类别和时间边界。methods: 本文提出的Boundary-Aware Proposal Generation（BAPG）方法，通过强调界限感知来改善TAL的准确性。BAPG不依赖现有的TAL网络架构，可以与主流TAL模型进行插件式应用。results: 对于THUMOS14和ActivityNet-1.3 dataset的广泛实验表明，BAPG可以显著提高TAL的性能。

Abstract
The goal of Temporal Action Localization (TAL) is to find the categories and temporal boundaries of actions in an untrimmed video. Most TAL methods rely heavily on action recognition models that are sensitive to action labels rather than temporal boundaries. More importantly, few works consider the background frames that are similar to action frames in pixels but dissimilar in semantics, which also leads to inaccurate temporal boundaries. To address the challenge above, we propose a Boundary-Aware Proposal Generation (BAPG) method with contrastive learning. Specifically, we define the above background frames as hard negative samples. Contrastive learning with hard negative mining is introduced to improve the discrimination of BAPG. BAPG is independent of the existing TAL network architecture, so it can be applied plug-and-play to mainstream TAL models. Extensive experimental results on THUMOS14 and ActivityNet-1.3 demonstrate that BAPG can significantly improve the performance of TAL.

摘要
文本内容：目的是Temporal Action Localization（TAL）找到视频中的分类和时间边界。大多数TAL方法都依赖于动作识别模型，而这些模型更关注动作标签而非时间边界。更重要的是，有些工作不考虑视频中的背景帧，这些帧与动作帧相似在像素级别，但是在semantics方面不同，这也导致了不准确的时间边界。为解决这个挑战，我们提出了Boundary-Aware Proposal Generation（BAPG）方法，该方法使用了对比学习。我们定义了上述背景帧为hard negative samples。对比学习可以提高BAPG的推iscrimination。BAPG与主流TAL网络架构独立，因此可以直接应用于主流TAL模型。我们在THUMOS14和ActivityNet-1.3上进行了广泛的实验，结果表明BAPG可以显著提高TAL的性能。

2023-09-25

cs.AI

cs.AI - 2023-09-25

Integrating Higher-Order Dynamics and Roadway-Compliance into Constrained ILQR-based Trajectory Planning for Autonomous Vehicles

paper_url: http://arxiv.org/abs/2309.14566
repo_url: None
paper_authors: Hanxiang Li, Jiaqiao Zhang, Sheng Zhu, Dongjian Tang, Donghao Xu
for: 本研究旨在提出一种基于CILQR优化算法的在道路上的自动驾驶汽车路径规划方法，以提高安全性和舒适性。
methods: 本研究使用了CILQR优化算法，并增加了更高阶的约束和成本，以确保路径规划是可控的。此外，本研究还考虑了道路规则遵从性，以确保车辆遵循路径规划的约束。
results: simulation和实际驾驶场景 validate了本研究的方法，显示了改进的安全性和舒适性。

Abstract
This paper addresses the advancements in on-road trajectory planning for Autonomous Passenger Vehicles (APV). Trajectory planning aims to produce a globally optimal route for APVs, considering various factors such as vehicle dynamics, constraints, and detected obstacles. Traditional techniques involve a combination of sampling methods followed by optimization algorithms, where the former ensures global awareness and the latter refines for local optima. Notably, the Constrained Iterative Linear Quadratic Regulator (CILQR) optimization algorithm has recently emerged, adapted for APV systems, emphasizing improved safety and comfort. However, existing implementations utilizing the vehicle bicycle kinematic model may not guarantee controllable trajectories. We augment this model by incorporating higher-order terms, including the first and second-order derivatives of curvature and longitudinal jerk. This inclusion facilitates a richer representation in our cost and constraint design. We also address roadway compliance, emphasizing adherence to lane boundaries and directions, which past work often overlooked. Lastly, we adopt a relaxed logarithmic barrier function to address the CILQR's dependency on feasible initial trajectories. The proposed methodology is then validated through simulation and real-world experiment driving scenes in real time.

摘要
To address this limitation, we augment the model by incorporating higher-order terms, including the first and second-order derivatives of curvature and longitudinal jerk. This allows for a more detailed representation in our cost and constraint design. Additionally, we emphasize adherence to lane boundaries and directions, which past work often overlooked. To address the CILQR's dependency on feasible initial trajectories, we adopt a relaxed logarithmic barrier function.The proposed methodology is then validated through simulation and real-world experiment driving scenes in real time. This paper's contributions include a more accurate and comprehensive vehicle model, improved roadway compliance, and a relaxed logarithmic barrier function to address the CILQR's dependency on feasible initial trajectories. These advancements lead to more controllable and safe trajectories for APVs.

Generative Escher Meshes

paper_url: http://arxiv.org/abs/2309.14564
repo_url: None
paper_authors: Noam Aigerman, Thibault Groueix
For: This paper proposes a fully-automatic, text-guided generative method for producing periodic, repeating, tile-able 2D art, such as the one seen on floors, mosaics, ceramics, and the work of M.C. Escher.* Methods: The method uses an unconstrained, differentiable parameterization of the space of all possible tileable shapes for a given symmetry group, and modifies the laplacian used in a 2D mesh-mapping technique - Orbifold Tutte Embedding - to achieve all possible tiling configurations for a chosen planar symmetry group. The method also leverages a trained image diffusion model to define a loss on the resulting image, thereby updating the mesh’s parameters based on its appearance matching the text prompt.* Results: The paper shows that the method is able to produce plausible, appealing results, with non-trivial tiles, for a variety of different periodic tiling patterns.

Abstract
This paper proposes a fully-automatic, text-guided generative method for producing periodic, repeating, tile-able 2D art, such as the one seen on floors, mosaics, ceramics, and the work of M.C. Escher. In contrast to the standard concept of a seamless texture, i.e., square images that are seamless when tiled, our method generates non-square tilings which comprise solely of repeating copies of the same object. It achieves this by optimizing both geometry and color of a 2D mesh, in order to generate a non-square tile in the shape and appearance of the desired object, with close to no additional background details. We enable geometric optimization of tilings by our key technical contribution: an unconstrained, differentiable parameterization of the space of all possible tileable shapes for a given symmetry group. Namely, we prove that modifying the laplacian used in a 2D mesh-mapping technique - Orbifold Tutte Embedding - can achieve all possible tiling configurations for a chosen planar symmetry group. We thus consider both the mesh's tile-shape and its texture as optimizable parameters, rendering the textured mesh via a differentiable renderer. We leverage a trained image diffusion model to define a loss on the resulting image, thereby updating the mesh's parameters based on its appearance matching the text prompt. We show our method is able to produce plausible, appealing results, with non-trivial tiles, for a variety of different periodic tiling patterns.

摘要
Our key technical contribution is an unconstrained, differentiable parameterization of the space of all possible tileable shapes for a given symmetry group. We modify the laplacian used in a 2D mesh-mapping technique called Orbifold Tutte Embedding to achieve all possible tiling configurations for a chosen planar symmetry group. This allows us to optimize both the mesh's tile shape and its texture as parameters, which are then rendered using a differentiable renderer.We use a trained image diffusion model to define a loss on the resulting image, which is used to update the mesh's parameters based on its appearance matching the text prompt. Our method is able to produce plausible and appealing results with non-trivial tiles for a variety of different periodic tiling patterns.

Training-free Linear Image Inversion via Flows

paper_url: http://arxiv.org/abs/2310.04432
repo_url: None
paper_authors: Ashwini Pokle, Matthew J. Muckley, Ricky T. Q. Chen, Brian Karrer
for: Linear image inversion without training
methods: 使用预训练的生成模型，采用流程匹配模型，使用理论支持的质量补做方法，大幅减少手动参数调整。
results: 在高维数据集上（ImageNet-64/128和AFHQ-256），无需特定问题调整，我们的流程基于匹配方法对图像反向问题进行了有效的解决。

Abstract
Training-free linear inversion involves the use of a pretrained generative model and -- through appropriate modifications to the generation process -- solving inverse problems without any finetuning of the generative model. While recent prior methods have explored the use of diffusion models, they still require the manual tuning of many hyperparameters for different inverse problems. In this work, we propose a training-free method for image inversion using pretrained flow models, leveraging the simplicity and efficiency of Flow Matching models, using theoretically-justified weighting schemes and thereby significantly reducing the amount of manual tuning. In particular, we draw inspiration from two main sources: adopting prior gradient correction methods to the flow regime, and a solver scheme based on conditional Optimal Transport paths. As pretrained diffusion models are widely accessible, we also show how to practically adapt diffusion models for our method. Empirically, our approach requires no problem-specific tuning across an extensive suite of noisy linear image inversion problems on high-dimensional datasets, ImageNet-64/128 and AFHQ-256, and we observe that our flow-based method for image inversion significantly improves upon closely-related diffusion-based linear inversion methods.

摘要
<>使用预训练的生成模型进行无需训练的线性逆转，通过对生成过程进行相应的修改，可以解决逆转问题无需生成模型的负载。Recent prior方法已经探索了使用扩散模型，但仍然需要手动调整许多超参数 для不同的逆转问题。在这种工作中，我们提出了一种无需训练的图像逆转方法使用预训练的流模型，利用流模型的简单性和高效性，并使用理论上正确的权重分配方案，以降低手动调整的数量。特别是，我们从两个主要的来源中突破想法：在流程中采用先前的梯度修正方法，以及基于条件最优运输路径的解决方案。由于预训练的扩散模型广泛可用，我们还展示了如何实际地适应 diffusion 模型。在实验中，我们发现我们的流基于方法可以在高维度的数据集上进行无需具体问题调整的图像逆转，并且与相似的扩散基于线性逆转方法相比，我们的流基于方法可以获得显著的改进。>>

Disinformation Detection: An Evolving Challenge in the Age of LLMs

paper_url: http://arxiv.org/abs/2309.15847
repo_url: None
paper_authors: Bohan Jiang, Zhen Tan, Ayushi Nirmal, Huan Liu
for: 本研究旨在探讨利用大型语言模型（LLMs）生成的假信息攻击性的威胁，以及如何通过利用LLMs自身来建立可靠的防御机制。
methods: 本研究采用了现有的假信息检测技术，以及利用LLMs自身来生成检测假信息的模型。
results: 研究发现，现有的假信息检测技术对LLMs生成的假信息有限的检测能力，而利用LLMs自身来生成检测假信息的模型则表现更高效。

Abstract
The advent of generative Large Language Models (LLMs) such as ChatGPT has catalyzed transformative advancements across multiple domains. However, alongside these advancements, they have also introduced potential threats. One critical concern is the misuse of LLMs by disinformation spreaders, leveraging these models to generate highly persuasive yet misleading content that challenges the disinformation detection system. This work aims to address this issue by answering three research questions: (1) To what extent can the current disinformation detection technique reliably detect LLM-generated disinformation? (2) If traditional techniques prove less effective, can LLMs themself be exploited to serve as a robust defense against advanced disinformation? and, (3) Should both these strategies falter, what novel approaches can be proposed to counter this burgeoning threat effectively? A holistic exploration for the formation and detection of disinformation is conducted to foster this line of research.

摘要
LLMs的出现已经导致多个领域的进步，但同时也引入了潜在的威胁。一个重要的问题是利用LLMs散布假信息，使用这些模型生成高度感染的假信息，这会挑战假信息检测系统。本研究的目的是回答以下三个研究问题：（1）现有的假信息检测技术能够有效地检测LLM生成的假信息吗？（2）如果传统技术不够有效，可以利用LLM们自己作为防止高级假信息的强大防御手段吗？以及（3）如果这两种策略失败，可以提出新的方法来有效地对抗这种快速发展的威胁。通过探讨假信息的形成和检测，本研究旨在推动这一领域的研究。

Art or Artifice? Large Language Models and the False Promise of Creativity

paper_url: http://arxiv.org/abs/2309.14556
repo_url: None
paper_authors: Tuhin Chakrabarty, Philippe Laban, Divyansh Agarwal, Smaranda Muresan, Chien-Sheng Wu
for: 评估大语言模型（LLM）的创作能力
methods: 使用Consensual Assessment Technique和Torrance Test of Creative Writing评估创作性
results: LLM生成的故事通过TTCW测试失败率较高，并且使用LLM作为评估器时与专业作者的评估结果没有正面相关性。

Abstract
Researchers have argued that large language models (LLMs) exhibit high-quality writing capabilities from blogs to stories. However, evaluating objectively the creativity of a piece of writing is challenging. Inspired by the Torrance Test of Creative Thinking (TTCT), which measures creativity as a process, we use the Consensual Assessment Technique [3] and propose the Torrance Test of Creative Writing (TTCW) to evaluate creativity as a product. TTCW consists of 14 binary tests organized into the original dimensions of Fluency, Flexibility, Originality, and Elaboration. We recruit 10 creative writers and implement a human assessment of 48 stories written either by professional authors or LLMs using TTCW. Our analysis shows that LLM-generated stories pass 3-10X less TTCW tests than stories written by professionals. In addition, we explore the use of LLMs as assessors to automate the TTCW evaluation, revealing that none of the LLMs positively correlate with the expert assessments.

摘要
Translated into Simplified Chinese:研究人员认为大语言模型（LLM）具有高质量的写作能力，从博客到故事。然而，评估创作文章的创新性是具有挑战性的。受某种创新思维测试（TTCT）的启发，我们使用了共识评估技术 [3]，并提出了杜鲁门创作写作测试（TTCW），以评估创作作品的质量。TTCW包括14个二进制测试，涵盖了原始维度的流畅、灵活性、原创性和发展。我们邀请了10名创作作家，并对由专业作家或 LLM 写作的48篇故事进行人类评估使用 TTCW。我们的分析显示，LLM 生成的故事通过 TTCW 测试的数量比专业作家的故事少得多，3-10 倍。此外，我们还探讨了使用 LLM 作为评估者，以自动化 TTCW 评估，结果显示，没有任何 LLM 与专业评估相关。

Tactile Estimation of Extrinsic Contact Patch for Stable Placement

paper_url: http://arxiv.org/abs/2309.14552
repo_url: None
paper_authors: Kei Ota, Devesh K. Jha, Krishna Murthy Jatavallabhula, Asako Kanezaki, Joshua B. Tenenbaum
for: 这个论文是为了研究机器人如何具备细化的操作技能，特别是在堆叠复杂形状物体时。
methods: 该论文使用了反馈技能来帮助机器人学习堆叠复杂形状物体。机器人通过感受到物体与环境之间的轻微接触来理解物体的稳定性。
results: 研究结果表明，通过对物体与环境之间的轻微接触来估算物体的稳定性是可能的。此外，该方法还可以估算物体在释放 grasp 时的稳定性。实验结果表明，该方法可以在不同的物体对象中实现精准的堆叠。

Abstract
Precise perception of contact interactions is essential for the fine-grained manipulation skills for robots. In this paper, we present the design of feedback skills for robots that must learn to stack complex-shaped objects on top of each other. To design such a system, a robot should be able to reason about the stability of placement from very gentle contact interactions. Our results demonstrate that it is possible to infer the stability of object placement based on tactile readings during contact formation between the object and its environment. In particular, we estimate the contact patch between a grasped object and its environment using force and tactile observations to estimate the stability of the object during a contact formation. The contact patch could be used to estimate the stability of the object upon the release of the grasp. The proposed method is demonstrated on various pairs of objects that are used in a very popular board game.

摘要
<精准感受接触互动是机器人细致 manipulate 技能的关键。在这篇论文中，我们提出了机器人学习排序复杂形状物体的方法。为了设计这种系统，机器人需要能够根据非常轻微的接触互动理解物体的稳定性。我们的结果表明，可以通过触感读数在物体和其环境之间的接触形成时计算物体的稳定性。特别是，我们可以通过力和感觉观察来估计握持物体和环境之间的接触面积，以估计物体在释放时的稳定性。我们的方法在各种普遍的板球游戏中使用了不同的对象。>Note that Simplified Chinese is used in the translation, as it is the more widely used standard for Chinese writing in mainland China. If you prefer Traditional Chinese, I can provide that as well.

Algorithmic Collusion or Competition: the Role of Platforms’ Recommender Systems

paper_url: http://arxiv.org/abs/2309.14548
repo_url: None
paper_authors: Xingchen Xu, Stephanie Lee, Yong Tan
For: This paper examines how recommendation algorithms used by e-commerce platforms can impact the competitive dynamics of AI-based pricing algorithms.* Methods: The paper uses a repeated game framework to model the interactions between sellers and the platform’s recommender system, and conducts experiments to observe price dynamics and determine the final equilibrium.* Results: The paper finds that a profit-based recommender system can intensify algorithmic collusion among sellers, while a demand-based recommender system can foster price competition and result in lower prices. The results are robust in various market scenarios.

Abstract
Recent academic research has extensively examined algorithmic collusion resulting from the utilization of artificial intelligence (AI)-based dynamic pricing algorithms. Nevertheless, e-commerce platforms employ recommendation algorithms to allocate exposure to various products, and this important aspect has been largely overlooked in previous studies on algorithmic collusion. Our study bridges this important gap in the literature and examines how recommendation algorithms can determine the competitive or collusive dynamics of AI-based pricing algorithms. Specifically, two commonly deployed recommendation algorithms are examined: (i) a recommender system that aims to maximize the sellers' total profit (profit-based recommender system) and (ii) a recommender system that aims to maximize the demand for products sold on the platform (demand-based recommender system). We construct a repeated game framework that incorporates both pricing algorithms adopted by sellers and the platform's recommender system. Subsequently, we conduct experiments to observe price dynamics and ascertain the final equilibrium. Experimental results reveal that a profit-based recommender system intensifies algorithmic collusion among sellers due to its congruence with sellers' profit-maximizing objectives. Conversely, a demand-based recommender system fosters price competition among sellers and results in a lower price, owing to its misalignment with sellers' goals. Extended analyses suggest the robustness of our findings in various market scenarios. Overall, we highlight the importance of platforms' recommender systems in delineating the competitive structure of the digital marketplace, providing important insights for market participants and corresponding policymakers.

摘要
现代学术研究已经广泛研究了基于人工智能（AI）的动态价格算法的算法协作。然而，电商平台使用推荐算法来分配产品的曝光，这一重要方面在过去的研究中受到了广泛的忽略。我们的研究填补了这一重要的研究漏洞，并研究了推荐算法如何影响AI基于价格算法的竞争或协作动态。 Specifically,我们研究了两种通常部署的推荐算法：（i）一个目标 Maximize sellers' total profit的推荐系统（profit-based recommender system），和（ii）一个目标 Maximize the demand for products sold on the platform的推荐系统（demand-based recommender system）。我们建立了一个重复游戏框架，该框架包括采用的价格算法和平台的推荐系统。然后，我们进行实验，观察价格动态并确定最终平衡。实验结果表明，一个基于利润的推荐系统会使算法协作增强，因为它与卖家的利润最大化目标相匹配。相反，一个基于需求的推荐系统会促进价格竞争，导致价格下降，因为它与卖家的目标不一致。 extended 分析表明我们的结论在不同的市场情况下具有坚实性。总的来说，我们强调了平台的推荐系统在数字市场的竞争结构中发挥重要作用，为市场参与者和相关政策制定者提供重要的洞察。

Effect of roundabout design on the behavior of road users: A case study of roundabouts with application of Unsupervised Machine Learning

paper_url: http://arxiv.org/abs/2309.14540
repo_url: None
paper_authors: Tasnim M. Dwekat, Ayda A. Almsre, Huthaifa I. Ashqar
for: 这个研究的目的是评估缓冲器的性能并研究人行道用户在互动缓冲器时的行为。methods: 该研究使用了观察和分类 drivers的行为，以及预测道路用户在缓冲器交叉点的行为。results: 研究发现，缓冲器可以减少拐弯口口的速度，入口速度和相应的影响速度取决于道路用户的评级。此外，车辆的速度在过缓冲器时更适合于汽车和卡车的速度。此外，缓冲器具有两种内在特点，首先，由于汽车的小尺寸和缓冲器所有部分都可见，因此从所有方向进入缓冲器时，所有 drivers 都需要减速，从而增加了他们在穿过缓冲器时的反应时间，降低了事故的风险。其次，由于缓冲器内部的流速更少， drivers 只需要左看（在右侧交通），从而更容易过缓冲器。

Abstract
This research aims to evaluate the performance of the rotors and study the behavior of the human driver in interacting with the rotors. In recent years, rotors have been increasingly used between countries due to their safety, capacity, and environmental advantages, and because they provide safe and fluid flows of vehicles for transit and integration. It turns out that roundabouts can significantly reduce speed at twisting intersections, entry speed and the resulting effect on speed depends on the rating of road users. In our research, (bus, car, truck) drivers were given special attention and their behavior was categorized into (conservative, normal, aggressive). Anticipating and recognizing driver behavior is an important challenge. Therefore, the aim of this research is to study the effect of roundabouts on these classifiers and to develop a method for predicting the behavior of road users at roundabout intersections. Safety is primarily due to two inherent features of the rotor. First, by comparing the data collected and processed in order to classify and evaluate drivers' behavior, and comparing the speeds of the drivers (bus, car and truck), the speed of motorists at crossing the roundabout was more fit than that of buses and trucks. We looked because the car is smaller and all parts of the rotor are visible to it. So drivers coming from all directions have to slow down, giving them more time to react and mitigating the consequences in the event of an accident. Second, with fewer conflicting flows (and points of conflict), drivers only need to look to their left (in right-hand traffic) for other vehicles, making their job of crossing the roundabout easier as there is less need to split attention between different directions.

摘要
Safety is a primary concern, and rotors have two inherent features that contribute to safety. First, the speed of motorists crossing the roundabout is more controlled compared to buses and trucks, as the smaller car size allows for better visibility of all parts of the rotor. This requires drivers to slow down, giving them more time to react and reducing the risk of accidents. Second, with fewer conflicting flows and points of conflict, drivers only need to look to their left (in right-hand traffic) for other vehicles, making it easier to cross the roundabout and reducing the need to split attention between different directions.In our research, we collected and processed data to classify and evaluate driver behavior, and compared the speeds of buses, cars, and trucks. We found that the speed of motorists crossing the roundabout was more controlled than that of buses and trucks, as the car's smaller size allows for better visibility of all parts of the rotor. Overall, the design of rotors provides a safer and more efficient way to manage traffic flow, and our research aims to further understand and improve the performance of these intersections.

Watch Your Language: Large Language Models and Content Moderation

paper_url: http://arxiv.org/abs/2309.14517
repo_url: None
paper_authors: Deepak Kumar, Yousef AbuHashem, Zakir Durumeric
for: 这个论文旨在研究大型自然语言模型（LLM）在内容审核任务中的表现。
methods: 论文使用了现代商业化的GPT-3、GPT-3.5和GPT-4大型自然语言模型，对两种常见的内容审核任务进行评估：规则基础的社区审核和价值评估。
results: 论文发现，LLMs可以有效地进行许多社区的规则基础审核， median accuracy 达到 64%， median precision 达到 83%。而对恶意内容检测，LLMs 表现明显 луч于现有的商业化恶意类别化器。但是，论文发现，在恶意检测任务上，Recent 附加的模型大小增加只有微scopic 的提升，表明 LLMs 在这种任务上可能已经达到性能杯顶。

Abstract
Large language models (LLMs) have exploded in popularity due to their ability to perform a wide array of natural language tasks. Text-based content moderation is one LLM use case that has received recent enthusiasm, however, there is little research investigating how LLMs perform in content moderation settings. In this work, we evaluate a suite of modern, commercial LLMs (GPT-3, GPT-3.5, GPT-4) on two common content moderation tasks: rule-based community moderation and toxic content detection. For rule-based community moderation, we construct 95 LLM moderation-engines prompted with rules from 95 Reddit subcommunities and find that LLMs can be effective at rule-based moderation for many communities, achieving a median accuracy of 64% and a median precision of 83%. For toxicity detection, we find that LLMs significantly outperform existing commercially available toxicity classifiers. However, we also find that recent increases in model size add only marginal benefit to toxicity detection, suggesting a potential performance plateau for LLMs on toxicity detection tasks. We conclude by outlining avenues for future work in studying LLMs and content moderation.

摘要
For rule-based community moderation, we created 95 LLM moderation engines using rules from 95 Reddit subcommunities. We found that LLMs can effectively moderate content for many communities, achieving a median accuracy of 64% and a median precision of 83%.For toxicity detection, we found that LLMs significantly outperformed existing commercial toxicity classifiers. However, we also found that increasing the size of the model only provided marginal benefits for toxicity detection, suggesting a potential performance plateau for LLMs on this task.Based on our findings, we outline potential avenues for future research on LLMs and content moderation.

paper_url: http://arxiv.org/abs/2309.14497
repo_url: None
paper_authors: Xiao Li, Kaiwen Liu, H. Eric Tseng, Anouck Girard, Ilya Kolmanovsky
for: 本研究旨在帮助自动驾驶车辆在复杂的交通场景中成功完成其驾驶任务，特别是在高速公路强制汇聚场景中。
methods: 本研究使用了社会行为模型，该模型考虑了交互的 drivers 的社会行为和个人目标。基于这个模型，我们开发了一种退火策略控制的决策策略，可以在线估计其他司机的意图，并在不确定的意图下预测附近车辆的行为。
results: 我们通过对比game理论控制器和实际交通数据进行了对比，证明了我们的决策策略的有效性。

Abstract
Understanding the intention of vehicles in the surrounding traffic is crucial for an autonomous vehicle to successfully accomplish its driving tasks in complex traffic scenarios such as highway forced merging. In this paper, we consider a behavioral model that incorporates both social behaviors and personal objectives of the interacting drivers. Leveraging this model, we develop a receding-horizon control-based decision-making strategy, that estimates online the other drivers' intentions using Bayesian filtering and incorporates predictions of nearby vehicles' behaviors under uncertain intentions. The effectiveness of the proposed decision-making strategy is demonstrated and evaluated based on simulation studies in comparison with a game theoretic controller and a real-world traffic dataset.

摘要
理解周围交通中车辆的意图是自动驾驶车辆在复杂交通场景中成功完成驾驶任务的关键。在这篇论文中，我们考虑了一种行为模型，该模型包括交互驾驶员的社会行为和个人目标。利用这种模型，我们开发了一种往复控制基于决策策略，该策略在线上估计其他驾驶员的意图使用 bayesian 筛选，并在不确定意图下预测附近车辆的行为。我们通过模拟研究和与游戏理论控制器进行比较，证明了提议的决策策略的有效性。

Era Splitting – Invariant Learning for Decision Trees

paper_url: http://arxiv.org/abs/2309.14496
repo_url: https://github.com/jefferythewind/era-splitting-notebook-examples
paper_authors: Timothy DeLise
For: The paper is written to address the issue of out-of-distribution (OOD) generalization in decision tree models, specifically random forest and gradient-boosting decision trees.* Methods: The paper proposes two new splitting criteria for decision trees that incorporate era-wise information into the splitting process, allowing the models to find split points that are optimal across all disjoint eras in the data.* Results: The paper describes unique experiments to showcase the benefits of the new splitting criteria, which improve metrics in the authors’ experiments out-of-sample. The new criteria are incorporated into a state-of-the-art gradient boosted decision tree model in the Scikit-Learn code base, which is made freely available.Here are the three key points in Simplified Chinese text:* For: 本研究是为了解决决策树模型中的外部数据泛化问题，特别是随机森林和梯度拟合决策树模型。* Methods: 本研究提出了两种新的分割 criterion，用于决策树模型中的分割过程中，以便在不同的时间和地点上进行数据分割。* Results: 本研究通过一系列唯一的实验，展示了新分割 criterion 的优势，可以在尝试样本中提高 metric 的表现。新 criterion 被 integrate 到 Scikit-Learn 代码库中的一个状态最佳的梯度拟合决策树模型中，并且免费释出。

Abstract
Real life machine learning problems exhibit distributional shifts in the data from one time to another or from on place to another. This behavior is beyond the scope of the traditional empirical risk minimization paradigm, which assumes i.i.d. distribution of data over time and across locations. The emerging field of out-of-distribution (OOD) generalization addresses this reality with new theory and algorithms which incorporate environmental, or era-wise information into the algorithms. So far, most research has been focused on linear models and/or neural networks. In this research we develop two new splitting criteria for decision trees, which allow us to apply ideas from OOD generalization research to decision tree models, including random forest and gradient-boosting decision trees. The new splitting criteria use era-wise information associated with each data point to allow tree-based models to find split points that are optimal across all disjoint eras in the data, instead of optimal over the entire data set pooled together, which is the default setting. We describe the new splitting criteria in detail and develop unique experiments to showcase the benefits of these new criteria, which improve metrics in our experiments out-of-sample. The new criteria are incorporated into the a state-of-the-art gradient boosted decision tree model in the Scikit-Learn code base, which is made freely available.

摘要
The new splitting criteria use era-wise information associated with each data point to find split points that are optimal across all disjoint eras in the data, rather than optimal over the entire data set pooled together. We describe the new splitting criteria in detail and conduct unique experiments to demonstrate their benefits, which improve metrics out-of-sample.We have incorporated the new splitting criteria into a state-of-the-art gradient boosted decision tree model in the Scikit-Learn code base and made it freely available. This research provides a new approach to addressing distributional shifts in machine learning and improving the generalization of tree-based models.

A Novel Deep Learning Technique for Morphology Preserved Fetal ECG Extraction from Mother ECG using 1D-CycleGAN

paper_url: http://arxiv.org/abs/2310.03759
repo_url: None
paper_authors: Promit Basak, A. H. M Nazmus Sakib, Muhammad E. H. Chowdhury, Nasser Al-Emadi, Huseyin Cagatay Yalcin, Shona Pedersen, Sakib Mahmud, Serkan Kiranyaz, Somaya Al-Maadeed
for: 这个研究的目的是监测胎儿心脏的电压信号，以实现胎儿心脏疾病的早期诊断和后续照护。methods: 这个研究使用了1D CycleGAN来重建胎儿心脏电压信号，并且进行了广泛的预处理和适当的框架，以维持信号的结构。results: 这个研究的结果显示，使用1D CycleGAN重建胎儿心脏电压信号的方法可以获得高精度的胎儿心脏疾病诊断和胎儿心脏功能监测。这个方法可以实现胎儿心脏疾病的早期诊断和后续照护，并且与现有的相关技术相比，具有较高的精度和可靠性。

Abstract
Monitoring the electrical pulse of fetal heart through a non-invasive fetal electrocardiogram (fECG) can easily detect abnormalities in the developing heart to significantly reduce the infant mortality rate and post-natal complications. Due to the overlapping of maternal and fetal R-peaks, the low amplitude of the fECG, systematic and ambient noises, typical signal extraction methods, such as adaptive filters, independent component analysis, empirical mode decomposition, etc., are unable to produce satisfactory fECG. While some techniques can produce accurate QRS waves, they often ignore other important aspects of the ECG. Our approach, which is based on 1D CycleGAN, can reconstruct the fECG signal from the mECG signal while maintaining the morphology due to extensive preprocessing and appropriate framework. The performance of our solution was evaluated by combining two available datasets from Physionet, "Abdominal and Direct Fetal ECG Database" and "Fetal electrocardiograms, direct and abdominal with reference heartbeat annotations", where it achieved an average PCC and Spectral-Correlation score of 88.4% and 89.4%, respectively. It detects the fQRS of the signal with accuracy, precision, recall and F1 score of 92.6%, 97.6%, 94.8% and 96.4%, respectively. It can also accurately produce the estimation of fetal heart rate and R-R interval with an error of 0.25% and 0.27%, respectively. The main contribution of our work is that, unlike similar studies, it can retain the morphology of the ECG signal with high fidelity. The accuracy of our solution for fetal heart rate and R-R interval length is comparable to existing state-of-the-art techniques. This makes it a highly effective tool for early diagnosis of fetal heart diseases and regular health checkups of the fetus.

摘要
监测胎儿心脏电压通过非侵入式胎儿电cardiogram (fECG) 可以轻松地检测胎儿心脏发育异常，从而减少新生儿死亡率和哺乳期后的合并症状。由于胎母和胎儿的R峰重叠，低强度fECG，系统性和 ambient 噪声，传统的信号提取方法，如适应过滤、独立 componenets 分析、empirical mode decomposition 等，通常无法生成满意的fECG。虽然一些技术可以生成准确的QRS波，但它们通常忽略了其他重要的ECG方面。我们的方法基于1D CycleGAN，可以从mECG信号中重建fECG信号，同时保持信号的形态，因为我们进行了广泛的预处理和适当的框架。我们的解决方案的性能得到了两个可用的Physionet数据集("Abdominal and Direct Fetal ECG Database"和"Fetal electrocardiograms, direct and abdominal with reference heartbeat annotations")的评估，其中获得了88.4%和89.4%的PCC和spectral-correlation分数。它可以准确地检测信号中的fQRS，并具有92.6%、97.6%、94.8%和96.4%的准确率、精度、回归率和F1分数。它还可以准确地计算胎儿心率和R-R间隔的误差，分别为0.25%和0.27%。我们的工作的主要贡献在于，与其他相似的研究不同，可以保持ECG信号的形态高度准确。我们的解决方案的准确率和R-R间隔长度与现有的状态 искусственный智能技术相当。这使得它成为了诊断胎心疾病的高效工具，以及哺乳期后胎心健康检查的重要工具。

When Automated Assessment Meets Automated Content Generation: Examining Text Quality in the Era of GPTs

paper_url: http://arxiv.org/abs/2309.14488
repo_url: https://github.com/nd-hal/automated-ml-scoring-versus-generation
paper_authors: Marialena Bevilacqua, Kezia Oketch, Ruiyang Qin, Will Stamey, Xinyuan Zhang, Yi Gan, Kai Yang, Ahmed Abbasi
for: 这个论文主要研究了机器学习（ML）模型在文本数据评分中的应用和发展。
methods: 该论文使用了多种机器学习模型，包括文本生成大语言模型（GPTs）和卷积神经网络（CNN/RNN），对人类生成的文本和GPT生成的文本进行评分和评估。
results: 研究发现，使用 transformer 预训练语言模型（PLMs）可以更准确地评分人类生成的文本质量，而 traditional deep learning 和特征基于的 ML 模型则更倾向于评分人类文本较高。此外，研究还发现，transformer PLMs 具有更强的泛化能力，可以更好地处理 GPT 生成的文本。

Abstract
The use of machine learning (ML) models to assess and score textual data has become increasingly pervasive in an array of contexts including natural language processing, information retrieval, search and recommendation, and credibility assessment of online content. A significant disruption at the intersection of ML and text are text-generating large-language models such as generative pre-trained transformers (GPTs). We empirically assess the differences in how ML-based scoring models trained on human content assess the quality of content generated by humans versus GPTs. To do so, we propose an analysis framework that encompasses essay scoring ML-models, human and ML-generated essays, and a statistical model that parsimoniously considers the impact of type of respondent, prompt genre, and the ML model used for assessment model. A rich testbed is utilized that encompasses 18,460 human-generated and GPT-based essays. Results of our benchmark analysis reveal that transformer pretrained language models (PLMs) more accurately score human essay quality as compared to CNN/RNN and feature-based ML methods. Interestingly, we find that the transformer PLMs tend to score GPT-generated text 10-15\% higher on average, relative to human-authored documents. Conversely, traditional deep learning and feature-based ML models score human text considerably higher. Further analysis reveals that although the transformer PLMs are exclusively fine-tuned on human text, they more prominently attend to certain tokens appearing only in GPT-generated text, possibly due to familiarity/overlap in pre-training. Our framework and results have implications for text classification settings where automated scoring of text is likely to be disrupted by generative AI.

摘要
使用机器学习（ML）模型评分文本数据已经在多种场景中广泛应用，包括自然语言处理、信息检索、搜索和推荐、以及在线内容可靠性评分。一种 significante 的干预在机器学习和文本之间的交叉处是文本生成大语言模型（GPTs）。我们employs一种分析框架，覆盖了文本评分ML模型、人类和GPTs生成的文本，以及一个简单的统计模型，考虑了评分模型的类型、提示类型和评分模型的影响。我们使用了一个丰富的测试环境，包括18,460个人类生成和GPTs生成的文本。我们的基准分析结果显示，transformer预训练语言模型（PLMs）在评分人类文本质量方面更为准确，比起CNN/RNN和特征基于ML方法。另外，我们发现transformer PLMs对GPTs生成的文本进行评分，相对于人类写作的文本，提高了10-15%的平均分。然而，传统的深度学习和特征基于ML方法对人类文本进行评分，显著高于。进一步的分析表明，although transformer PLMs是仅仅特征基于人类文本进行finetune，它们更加强调在GPTs生成的文本中出现的某些字符，可能是因为预训练中的熟悉/重叠。我们的框架和结果对于文本分类设置，其中自动评分文本可能受到生成AI的干扰有重要意义。

Incorporating Ensemble and Transfer Learning For An End-To-End Auto-Colorized Image Detection Model

paper_url: http://arxiv.org/abs/2309.14478
repo_url: None
paper_authors: Ahmed Samir Ragab, Shereen Aly Taie, Howida Youssry Abdelnaby
for: 这个论文旨在提出一种新的图像色调检测方法，用于分辨天然颜色图像和计算机色调图像。
methods: 该方法结合了传输学习和集成学习的优点，并使用预训练的VGG16和Resnet50树脊，以及Mobile Net v2或Efficientnet特征向量。
results: 该模型在分类性能和泛化能力方面表现出色，准确率在94.55%到99.13%之间，偏差总错误率很低。与现有状态的先进模型相比，该模型表现出了更高的分类性能和泛化能力。

Abstract
Image colorization is the process of colorizing grayscale images or recoloring an already-color image. This image manipulation can be used for grayscale satellite, medical and historical images making them more expressive. With the help of the increasing computation power of deep learning techniques, the colorization algorithms results are becoming more realistic in such a way that human eyes cannot differentiate between natural and colorized images. However, this poses a potential security concern, as forged or illegally manipulated images can be used illegally. There is a growing need for effective detection methods to distinguish between natural color and computer-colorized images. This paper presents a novel approach that combines the advantages of transfer and ensemble learning approaches to help reduce training time and resource requirements while proposing a model to classify natural color and computer-colorized images. The proposed model uses pre-trained branches VGG16 and Resnet50, along with Mobile Net v2 or Efficientnet feature vectors. The proposed model showed promising results, with accuracy ranging from 94.55% to 99.13% and very low Half Total Error Rate values. The proposed model outperformed existing state-of-the-art models regarding classification performance and generalization capabilities.

摘要
Image colorization是将灰度图像或已经颜色化的图像中的颜色更改的过程。这种图像修改可以用于灰度卫星图像、医疗图像和历史图像等，使其更加表达力。随着深度学习技术的计算能力的提高，图像色化算法的结果变得越来越真实，以至于人眼无法分辨天然颜色和计算机颜色化图像之间的差异。然而，这也带来了安全性问题，因为假或非法修改的图像可以用于违法活动。随着需求的增长，有效地检测天然颜色和计算机颜色化图像的方法变得越来越重要。本文提出了一种新的方法，该方法结合了传输学习和集成学习的优点，以减少训练时间和资源需求，并提出了一种用于分类天然颜色和计算机颜色化图像的模型。该模型使用预训练分支VGG16和Resnet50，以及Mobile Net v2或Efficientnet特征向量。该模型的实验结果表现出色，准确率在94.55%至99.13%之间，并且半总错误率很低。该模型超越了现有的状态对模型，以 regards to classification performance和总体能力。

Adapting Double Q-Learning for Continuous Reinforcement Learning

paper_url: http://arxiv.org/abs/2309.14471
repo_url: None
paper_authors: Arsenii Kuznetsov
for: 这个论文主要针对off-policy reinforcement learning中的偏高偏估问题，提出了一种新的偏估控制方法。
methods: 该方法基于一个混合策略，每个策略组件由两个分立的网络评估，从而消除了基于偏估的假设。
results: 该方法在一些MuJoCo环境中达到了near-SOTA的result，显示了其可行性和有效性。

Abstract
Majority of off-policy reinforcement learning algorithms use overestimation bias control techniques. Most of these techniques rooted in heuristics, primarily addressing the consequences of overestimation rather than its fundamental origins. In this work we present a novel approach to the bias correction, similar in spirit to Double Q-Learning. We propose using a policy in form of a mixture with two components. Each policy component is maximized and assessed by separate networks, which removes any basis for the overestimation bias. Our approach shows promising near-SOTA results on a small set of MuJoCo environments.

摘要
大多数Off-policy reinforcement learning算法使用过估偏调技术。这些技术基于规则，主要是解决过估的后果而不是其基本原因。在这项工作中，我们提出了一种新的偏调修正方法，类似于Double Q-Learning。我们提议使用一个策略组合，其中每个策略组件是由两个分开的网络评估和最大化。这种方法可以消除任何基于过估偏调的基础。我们的方法在一些MuJoCo环境上显示了有优的近SOTA结果。

DefGoalNet: Contextual Goal Learning from Demonstrations For Deformable Object Manipulation

paper_url: http://arxiv.org/abs/2309.14463
repo_url: None
paper_authors: Bao Thach, Tanner Watts, Shing-Hei Ho, Tucker Hermans, Alan Kuntz
for: 解决控制柔体物体到目标形状的问题，即shape servoing问题。
methods: 开发了一种基于神经网络的 DefGoalNet，可以直接从人类示范中学习柔体物体目标形状。
results: 在 simulate 和physical robot 上进行了多种任务测试，包括手术压缩任务，并达到了高达90%的成功率，表明该方法可以有效地解决shape servoing问题， bringing deformable object manipulation closer to practical, real-world applications.

Abstract
Shape servoing, a robotic task dedicated to controlling objects to desired goal shapes, is a promising approach to deformable object manipulation. An issue arises, however, with the reliance on the specification of a goal shape. This goal has been obtained either by a laborious domain knowledge engineering process or by manually manipulating the object into the desired shape and capturing the goal shape at that specific moment, both of which are impractical in various robotic applications. In this paper, we solve this problem by developing a novel neural network DefGoalNet, which learns deformable object goal shapes directly from a small number of human demonstrations. We demonstrate our method's effectiveness on various robotic tasks, both in simulation and on a physical robot. Notably, in the surgical retraction task, even when trained with as few as 10 demonstrations, our method achieves a median success percentage of nearly 90%. These results mark a substantial advancement in enabling shape servoing methods to bring deformable object manipulation closer to practical, real-world applications.

摘要
shape servoing, a robotic task focused on controlling objects to desired goal shapes, is a promising approach to deformable object manipulation. however, a challenge arises with the reliance on the specification of a goal shape. this goal has been obtained either through a laborious domain knowledge engineering process or by manually manipulating the object into the desired shape and capturing the goal shape at that specific moment, both of which are impractical in various robotic applications. in this paper, we solve this problem by developing a novel neural network DefGoalNet, which learns deformable object goal shapes directly from a small number of human demonstrations. we demonstrate our method's effectiveness on various robotic tasks, both in simulation and on a physical robot. notably, in the surgical retraction task, even when trained with as few as 10 demonstrations, our method achieves a median success percentage of nearly 90%. these results mark a substantial advancement in enabling shape servoing methods to bring deformable object manipulation closer to practical, real-world applications.

Online Active Learning For Sound Event Detection

paper_url: http://arxiv.org/abs/2309.14460
repo_url: None
paper_authors: Mark Lindsey, Ankit Shah, Francis Kubala, Richard M. Stern
for: 这篇论文是为了提高Sound Event Detection（SED）中的监督学习效率而写的。
methods: 这篇论文使用了线上活动学习（OAL）来减少监督学习需要的时间和努力。它还使用了新的损失函数来解决现有OAL方法中的问题，例如气流分布的变化和数据漂移。
results: 实验结果显示，使用OAL可以将SED监督学习的时间和努力缩减到SONYC dataset中的一半，并且新的方法可以成功地解决现有OAL方法中的问题。

Abstract
Data collection and annotation is a laborious, time-consuming prerequisite for supervised machine learning tasks. Online Active Learning (OAL) is a paradigm that addresses this issue by simultaneously minimizing the amount of annotation required to train a classifier and adapting to changes in the data over the duration of the data collection process. Prior work has indicated that fluctuating class distributions and data drift are still common problems for OAL. This work presents new loss functions that address these challenges when OAL is applied to Sound Event Detection (SED). Experimental results from the SONYC dataset and two Voice-Type Discrimination (VTD) corpora indicate that OAL can reduce the time and effort required to train SED classifiers by a factor of 5 for SONYC, and that the new methods presented here successfully resolve issues present in existing OAL methods.

摘要
数据收集和注释是超级vised机器学习任务的必要前置条件，但它们是时间consuming和劳动密集的。在线活动学习（OAL）是一种方法，它同时减少了训练分类器所需的注释量和适应数据的变化过程中的数据风险。前一个研究表示，在OAL中仍然存在涨落分布和数据漂移的问题。这个工作提出了新的损失函数，以解决这些问题在音频事件检测（SED）领域中。实验结果来自SONYC数据集和两个语音类型识别（VTD） corpora，表明OAL可以将SED分类器训练所需的时间和劳动量减少到SONYC数据集的5倍，并且新的方法在现有OAL方法中成功解决了问题。

Self-Recovery Prompting: Promptable General Purpose Service Robot System with Foundation Models and Self-Recovery

paper_url: http://arxiv.org/abs/2309.14425
repo_url: None
paper_authors: Mimo Shirasaka, Tatsuya Matsushima, Soshi Tsunashima, Yuya Ikeda, Aoi Horo, So Ikoma, Chikaha Tsuji, Hikaru Wada, Tsunekazu Omija, Dai Komukai, Yutaka Matsuo Yusuke Iwasawa
for: 本研究旨在开发一个可以执行多种任务的通用服务机器人（GPSR），需要一个高度普适和适应任务和环境的系统。
methods: 我们首先基于多个基础模型开发了一个高级GPSR系统，并通过让每个模型提示来使其普适和适应。
results: 我们发现在更实际的GPSR应用场景中存在三种类型的失败情况：缺乏信息、错误的规划生成和执行失败。我们则提出了自适应提示管道，以探索必要的信息并修改提示来恢复失败。我们的实验证明，具有自适应机制的系统可以完成任务并解决多种失败情况。

Abstract
A general-purpose service robot (GPSR), which can execute diverse tasks in various environments, requires a system with high generalizability and adaptability to tasks and environments. In this paper, we first developed a top-level GPSR system for worldwide competition (RoboCup@Home 2023) based on multiple foundation models. This system is both generalizable to variations and adaptive by prompting each model. Then, by analyzing the performance of the developed system, we found three types of failure in more realistic GPSR application settings: insufficient information, incorrect plan generation, and plan execution failure. We then propose the self-recovery prompting pipeline, which explores the necessary information and modifies its prompts to recover from failure. We experimentally confirm that the system with the self-recovery mechanism can accomplish tasks by resolving various failure cases. Supplementary videos are available at https://sites.google.com/view/srgpsr .

摘要
一种通用服务机器人（GPSR），能够执行多种任务在多种环境中，需要一个高度通用和适应性的系统。在这篇论文中，我们首先基于多个基础模型开发了一个全面的GPSR系统，并通过提示每个模型来使其通用和适应性更高。然后，通过分析系统的性能，我们发现在更实际的GPSR应用场景中存在三种失败类型：不充分的信息、错误的计划生成和计划执行失败。我们提出了自动恢复提示管道，以探索所需的信息并修改提示来解决失败。我们通过实验证明，具有自动恢复机制的系统可以成功完成任务，并解决多种失败情况。补充视频可以在https://sites.google.com/view/srgpsr 中找到。

Extreme Parkour with Legged Robots

paper_url: http://arxiv.org/abs/2309.14341
repo_url: None
paper_authors: Xuxin Cheng, Kexin Shi, Ananye Agarwal, Deepak Pathak
for: 本研究旨在开发一种小型低成本机器人，以便它可以通过困难的满足某些难以控制的环境。
methods: 该研究使用一种单一的前视频摄像头和深度学习算法，以便从摄像头图像直接生成高精度控制行为。
results: 研究结果显示，该机器人可以通过跳高障碍物、跨越差距、悬停和跑过倾斜的坡道等动作，并可以在新的障碍物环境中进行普适化。

Abstract
Humans can perform parkour by traversing obstacles in a highly dynamic fashion requiring precise eye-muscle coordination and movement. Getting robots to do the same task requires overcoming similar challenges. Classically, this is done by independently engineering perception, actuation, and control systems to very low tolerances. This restricts them to tightly controlled settings such as a predetermined obstacle course in labs. In contrast, humans are able to learn parkour through practice without significantly changing their underlying biology. In this paper, we take a similar approach to developing robot parkour on a small low-cost robot with imprecise actuation and a single front-facing depth camera for perception which is low-frequency, jittery, and prone to artifacts. We show how a single neural net policy operating directly from a camera image, trained in simulation with large-scale RL, can overcome imprecise sensing and actuation to output highly precise control behavior end-to-end. We show our robot can perform a high jump on obstacles 2x its height, long jump across gaps 2x its length, do a handstand and run across tilted ramps, and generalize to novel obstacle courses with different physical properties. Parkour videos at https://extreme-parkour.github.io/

摘要
人类可以通过穿梭障碍物来完成公园OUR，需要精准的眼睛肌肉协调和运动。为了让机器人做同样的任务，需要超越类似的挑战。传统上，这是通过独立地工程感知、行动和控制系统来实现的，这会限制它们只能在严格控制的室内预先设定的赛跑课程中运行。与此相反，人类可以通过练习而不是改变基本生物结构来学习公园OUR。在这篇论文中，我们采用类似的方法，使用一个小型低成本机器人，具有不精准的运动和单个前方深度摄像头来感知，它的摄像头图像是低频、颤动和噪声易产生的。我们证明了一个单一神经网络策略，直接从摄像头图像中获取控制行为，通过大规模RL在模拟中训练，可以超越不精准的感知和运动，并输出高精度的控制行为。我们的机器人可以跳高障碍物2倍其高度，跳跃差2倍其长度，执行手stand和跑在倾斜的滚动道上，并可以通过不同物理特性的新障碍课程进行扩展。有关公园OUR视频，请参考https://extreme-parkour.github.io/

Joint Audio and Speech Understanding

paper_url: http://arxiv.org/abs/2309.14405
repo_url: https://github.com/YuanGongND/ltu
paper_authors: Yuan Gong, Alexander H. Liu, Hongyin Luo, Leonid Karlinsky, James Glass
for: 这篇论文旨在构建一个基于机器学习的听说识别和理解模型，以便更好地理解人类听说信号中的语音和非语音声音。
methods: 该模型基于 integrate Whisper 和 LLaMA 两个模块，分别用于听说识别和理解语音和非语音声音。
results: 模型可以同时识别和理解语音和非语音声音，包括语音和非语音声音的识别、语音特征提取、语音识别和语音理解等任务。

Abstract
Humans are surrounded by audio signals that include both speech and non-speech sounds. The recognition and understanding of speech and non-speech audio events, along with a profound comprehension of the relationship between them, constitute fundamental cognitive capabilities. For the first time, we build a machine learning model, called LTU-AS, that has a conceptually similar universal audio perception and advanced reasoning ability. Specifically, by integrating Whisper as a perception module and LLaMA as a reasoning module, LTU-AS can simultaneously recognize and jointly understand spoken text, speech paralinguistics, and non-speech audio events - almost everything perceivable from audio signals.

摘要
人类在听到各种各样的声音信号周围，包括语音和非语音声音。认识和理解语音和非语音声音事件，以及对它们之间的关系的深入理解，是人类的基本认知能力。我们现在第一次建立了一个机器学习模型，即LTU-AS，它具有类似于人类听觉的概念性和高级逻辑能力。具体来说，通过将Whisper作为感知模块和LLaMA作为逻辑模块相结合，LTU-AS可以同时认识和共同理解说话文本、语音非语言特征和非语音声音事件——大致上来说，听到 audio 信号中的一切可见事物。

UnitedHuman: Harnessing Multi-Source Data for High-Resolution Human Generation

paper_url: http://arxiv.org/abs/2309.14335
repo_url: https://github.com/unitedhuman/unitedhuman
paper_authors: Jianglin Fu, Shikai Li, Yuming Jiang, Kwan-Yee Lin, Wayne Wu, Ziwei Liu
for: 提高人像生成质量
methods: 多源数据集合并学习高分辨率人像生成模型
results: 与单一数据集学习的模型相比，通过jointly learning from multi-source data achieve superior quality in human image generation.

Abstract
Human generation has achieved significant progress. Nonetheless, existing methods still struggle to synthesize specific regions such as faces and hands. We argue that the main reason is rooted in the training data. A holistic human dataset inevitably has insufficient and low-resolution information on local parts. Therefore, we propose to use multi-source datasets with various resolution images to jointly learn a high-resolution human generative model. However, multi-source data inherently a) contains different parts that do not spatially align into a coherent human, and b) comes with different scales. To tackle these challenges, we propose an end-to-end framework, UnitedHuman, that empowers continuous GAN with the ability to effectively utilize multi-source data for high-resolution human generation. Specifically, 1) we design a Multi-Source Spatial Transformer that spatially aligns multi-source images to full-body space with a human parametric model. 2) Next, a continuous GAN is proposed with global-structural guidance and CutMix consistency. Patches from different datasets are then sampled and transformed to supervise the training of this scale-invariant generative model. Extensive experiments demonstrate that our model jointly learned from multi-source data achieves superior quality than those learned from a holistic dataset.

摘要
人类生成技术已经取得了 significiant 的进步，然而现有的方法仍然困难将特定的区域如面部和手臂等生成出来。我们认为这主要的原因在于训练数据。总体的人类数据集缺乏和低分辨率的地方部分信息，因此我们提议使用多源数据集，包括不同分辨率的图像，并将其集成到一个高分辨率的人类生成模型中。然而，多源数据集具有以下两个挑战：一是不同的部分不能够在一个准确的人类空间中匹配，二是不同的数据集来源的图像尺寸不同。为了解决这些挑战，我们提出了一个综合框架，名为 UnitedHuman，它使得 kontinuous GAN 能够有效地利用多源数据来生成高分辨率的人类图像。具体来说，我们设计了一个 Multi-Source Spatial Transformer，它将多源图像转换到全身人类空间中，并使用人类参数模型来进行匹配。然后，我们提出了一个 kontinuous GAN，它具有全STRUCTURE 导向和 CutMix 一致性。不同的数据集中的小块被随机选择并转换，以supervise kontinuous GAN 的训练。我们的实验表明，我们从多源数据集中 JOINTLY 学习的模型可以超过来自整体数据集的模型。

LinGCN: Structural Linearized Graph Convolutional Network for Homomorphically Encrypted Inference

paper_url: http://arxiv.org/abs/2309.14331
repo_url: https://github.com/harveyp123/lingcn-neurips23
paper_authors: Hongwu Peng, Ran Ran, Yukui Luo, Jiahui Zhao, Shaoyi Huang, Kiran Thorat, Tong Geng, Chenghong Wang, Xiaolin Xu, Wujie Wen, Caiwen Ding
for: 这个论文是为了提高Graph Convolution Network (GCN)模型的安全性和可扩展性而写的。
methods: 这个论文使用了Homomorphic Encryption (HE)技术来保护客户端数据，并且提出了一个名为LinGCN的框架，用于实现GCN模型的加密运算。LinGCN使用了分别为node-wise non-linear location selection和compact node-wise polynomial replacement policy两个关键元素，以提高GCN模型的性能和可扩展性。
results: 这个论文的实验结果显示，LinGCN在NTU-XVIEW skeleton joint dataset上具有较高的延迟速度、准确率和可扩展性，相比CryptoGCN等其他解决方案。具体来说，LinGCN在GCN模型的加密运算中实现了14.2倍的延迟速度提升，保持了75%的准确率，并且降低了 multiplication depth。

Abstract
The growth of Graph Convolution Network (GCN) model sizes has revolutionized numerous applications, surpassing human performance in areas such as personal healthcare and financial systems. The deployment of GCNs in the cloud raises privacy concerns due to potential adversarial attacks on client data. To address security concerns, Privacy-Preserving Machine Learning (PPML) using Homomorphic Encryption (HE) secures sensitive client data. However, it introduces substantial computational overhead in practical applications. To tackle those challenges, we present LinGCN, a framework designed to reduce multiplication depth and optimize the performance of HE based GCN inference. LinGCN is structured around three key elements: (1) A differentiable structural linearization algorithm, complemented by a parameterized discrete indicator function, co-trained with model weights to meet the optimization goal. This strategy promotes fine-grained node-level non-linear location selection, resulting in a model with minimized multiplication depth. (2) A compact node-wise polynomial replacement policy with a second-order trainable activation function, steered towards superior convergence by a two-level distillation approach from an all-ReLU based teacher model. (3) an enhanced HE solution that enables finer-grained operator fusion for node-wise activation functions, further reducing multiplication level consumption in HE-based inference. Our experiments on the NTU-XVIEW skeleton joint dataset reveal that LinGCN excels in latency, accuracy, and scalability for homomorphically encrypted inference, outperforming solutions such as CryptoGCN. Remarkably, LinGCN achieves a 14.2x latency speedup relative to CryptoGCN, while preserving an inference accuracy of 75% and notably reducing multiplication depth.

摘要
Graph Convolutional Network (GCN) 模型的发展已经革命化了许多应用程序，超过了人类表现在个人健康监测和金融系统等领域。但是在云端部署GCNs时，隐私问题引起了关注，因为可能会发生对客户数据的敌意攻击。为解决安全问题，使用了同源加密（HE）来保护敏感客户数据。然而，HE引入了实际应用中的重要计算开销。为了解决这些挑战，我们提出了LinGCN框架，用于降低 multiplication depth并优化HE基于GCN的推理性能。LinGCN框架包括三个关键元素：1. 可微分结构线性化算法，并且通过一个参数化的整数指示函数，与模型参数进行共训练，以达到优化目标。这种策略可以实现精细化节点级非线性位置选择，从而降低 multiplication depth。2. 一种减少 multiplication depth的紧凑型节点值替换策略，通过一个二次可训练的活化函数，由一个两级液态灵感法推导，从一个所有ReLU基于教师模型中学习。3. 一种可以进一步减少HE基于节点活化函数的 multiplication level 的加强HE解决方案。我们的实验表明，LinGCN在NTU-XVIEW骨架联合数据集上表现出了较高的延迟、准确率和扩展性，与CRYPTOGCN相比，LinGCN可以实现14.2倍的延迟速度提升，保持75%的推理精度，同时减少 multiplication depth。

Innovative Digital Storytelling with AIGC: Exploration and Discussion of Recent Advances

paper_url: http://arxiv.org/abs/2309.14329
repo_url: None
paper_authors: Rongzhang Gu, Hui Li, Changyue Su, Wayne Wu
for: 这个研究的目的是提高人们对将AI生成内容（AIGC）与数字故事创作的结合现状、局限性和挑战的认识。
methods: 这篇论文使用了现有的AIGC技术和数字故事创作工具，并通过实验和专家采访来研究AIGC与数字故事创作的整体效果和挑战。
results: 研究发现，虽然AIGC可以快速生成图片、音频和音效，但是在复杂的人物动画、表情和声音效果方面，人类仍然无法被代表。此外，AIGC与数字故事创作的结合还存在许多挑战和限制，如人工创作的灵活性和艺术感受的缺失。

Abstract
Digital storytelling, as an art form, has struggled with cost-quality balance. The emergence of AI-generated Content (AIGC) is considered as a potential solution for efficient digital storytelling production. However, the specific form, effects, and impacts of this fusion remain unclear, leaving the boundaries of AIGC combined with storytelling undefined. This work explores the current integration state of AIGC and digital storytelling, investigates the artistic value of their fusion in a sample project, and addresses common issues through interviews. Through our study, we conclude that AIGC, while proficient in image creation, voiceover production, and music composition, falls short of replacing humans due to the irreplaceable elements of human creativity and aesthetic sensibilities at present, especially in complex character animations, facial expressions, and sound effects. The research objective is to increase public awareness of the current state, limitations, and challenges arising from combining AIGC and digital storytelling.

摘要
“数字storytelling”作为艺术形式，困惑于成本质量平衡。人工智能生成内容（AIGC）的出现被视为可能解决高效数字storytelling生产的问题。然而，这两者的结合的具体形式、效果和影响仍然不清楚，“数字storytelling”与AIGC的界限未定。本研究探讨了AIGC与数字storytelling的当前整合状况，研究了这两者艺术价值的融合效果，并通过采访 Addressing common issues. Through our study, we conclude that AIGC, while proficient in image creation, voiceover production, and music composition, falls short of replacing humans due to the irreplaceable elements of human creativity and aesthetic sensibilities at present, especially in complex character animations, facial expressions, and sound effects. The research objective is to increase public awareness of the current state, limitations, and challenges arising from combining AIGC and digital storytelling.

Physics of Language Models: Part 3.2, Knowledge Manipulation

paper_url: http://arxiv.org/abs/2309.14402
repo_url: None
paper_authors: Zeyuan Allen-Zhu, Yuanzhi Li
for:This paper explores the ability of language models to manipulate stored knowledge during inference, specifically focusing on four manipulation types: retrieval, classification, comparison, and inverse search.methods:The authors use pre-trained language models like GPT2/3/4 and employ Chain of Thoughts (CoTs) during both training and inference to improve performance on simple classification and comparison tasks.results:The paper finds that language models struggle with simple classification and comparison tasks unless CoTs are employed, and they perform poorly in inverse knowledge search, even with adequate instruct fine-tuning. The primary contribution of the paper is a synthetic dataset for a controlled experiment that confirms these inherent weaknesses in language models.Here is the text in Simplified Chinese:for:这篇论文探讨了语言模型在推理过程中是否可以有效地把已经存储的知识 manipulate。methods:作者使用了预训练的语言模型如GPT2/3/4，并在推理过程中使用链条思维（CoTs）来提高简单的分类和比较任务的性能。results:论文发现，语言模型在简单的分类和比较任务上需要使用CoTs才能够有效地完成，而且它们在反向知识搜索任务中表现不佳，即使有足够的指导 fine-tuning。Primary contribution是一个控制实验的synthetic数据集，以confirm这些语言模型内在的劣势。

Abstract
Language models can store vast amounts of factual knowledge, but their ability to use this knowledge for logical reasoning remains questionable. This paper explores a language model's ability to manipulate its stored knowledge during inference. We focus on four manipulation types: retrieval (e.g., "What is person A's attribute X"), classification (e.g., "Is A's attribute X even or odd?"), comparison (e.g., "Is A greater than B in attribute X?") and inverse search (e.g., "Which person's attribute X equals T?") We observe that pre-trained language models like GPT2/3/4 excel in knowledge retrieval but struggle with simple classification or comparison tasks unless Chain of Thoughts (CoTs) are employed during both training and inference. They also perform poorly in inverse knowledge search, irrespective of the prompts. Our primary contribution is a synthetic dataset for a controlled experiment that confirms these inherent weaknesses: a language model cannot efficiently manipulate knowledge from pre-training data, even when such knowledge is perfectly stored and fully extractable in the models, and despite adequate instruct fine-tuning.

摘要
语言模型可以存储庞大的 фактические知识，但它们在逻辑推理方面的能力仍然存在问题。这篇论文探讨了语言模型在推理过程中如何 manipulate 存储的知识。我们关注了四种推理方法：提取（例如，"人A的特征X是什么？）、分类（例如，"A的特征X是偶数或奇数？）、比较（例如，"A是B在特征X方面大吗？）以及反向搜索（例如，"谁的特征X等于T？）。我们发现，预训练的语言模型如GPT2/3/4在知识提取方面表现出色，但在简单的分类或比较任务中，除非使用链接思维（CoTs），否则表现不佳。它们还在反向知识搜索方面表现不佳，不管提示是什么。我们的主要贡献是一个控制性的 synthetic 数据集，用于确认这些内在的弱点：语言模型不能效率地从预训练数据中提取知识，即使这些知识完全可以在模型中提取并且受到了充分的训练精化。

Physics of Language Models: Part 3.1, Knowledge Storage and Extraction

paper_url: http://arxiv.org/abs/2309.14316
repo_url: None
paper_authors: Zeyuan Allen Zhu, Yuanzhi Li
for: 本研究探讨了大语言模型是否真正从知识源中提取信息，或者只是通过在训练中遇到类似的问题来回答问题。
methods: 本研究使用控制的半人工生物графи信息来进行深入研究这个问题。
results: 研究发现，模型的知识提取能力与不同多样性度指标的训练数据相关。通过近似线性探测，发现模型在隐藏嵌入名词的位置或者在训练文本中的其他符号的嵌入位置上线性编码知识属性强相关。

Abstract
Large language models can store extensive world knowledge, often extractable through question-answering (e.g., "What is Abraham Lincoln's birthday?"). However, it's unclear whether the model answers questions based on exposure to exact/similar questions during training, or if it genuinely extracts knowledge from the source (e.g., Wikipedia biographies). In this paper, we conduct an in-depth study of this problem using a controlled set of semi-synthetic biography data. We uncover a relationship between the model's knowledge extraction ability and different diversity measures of the training data. We conduct (nearly) linear probing, revealing a strong correlation between this relationship and whether the model (nearly) linearly encodes the knowledge attributes at the hidden embedding of the entity names, or across the embeddings of other tokens in the training text.

摘要
大型语言模型可以储存广泛的世界知识，通常通过问答（例如，“亚伯拉罕林肯的生日是什么？）来抽出知识。然而，是否 modelo 回答问题基于训练时期所曝露的具体/相似问题，或是它实际提取知识从源（例如，Wikipedia 传记），这是一个未知的问题。在这篇论文中，我们透过一个控制的半人工生物agraph 数据集进行了深入的研究。我们发现了知识提取能力和不同多样性度量的训练数据之间的关系。我们进行了（近乎）直线探索，发现这个关系和模型（近乎）直线将知识属性嵌入到实体名称的隐藏嵌入中，或者在训练文本中的其他 tokens 的嵌入中。

Multiple Different Explanations for Image Classifiers

paper_url: http://arxiv.org/abs/2309.14309
repo_url: None
paper_authors: Hana Chockler, David A. Kelly, Daniel Kroening
for: 提供多个预测结果的算法和工具，以帮助理解黑盒图像分类器的行为。
methods: 基于 causal 理论，使用原理导向的方法计算多个预测结果。
results: 在 ImageNet-mini benchmark 上，REX 算法可以对 7 倍更多的图像进行多个预测结果计算，与之前的工作具有显著的提升。

Abstract
Existing explanation tools for image classifiers usually give only one single explanation for an image. For many images, however, both humans and image classifiers accept more than one explanation for the image label. Thus, restricting the number of explanations to just one severely limits the insight into the behavior of the classifier. In this paper, we describe an algorithm and a tool, REX, for computing multiple explanations of the output of a black-box image classifier for a given image. Our algorithm uses a principled approach based on causal theory. We analyse its theoretical complexity and provide experimental results showing that REX finds multiple explanations on 7 times more images than the previous work on the ImageNet-mini benchmark.

摘要
现有的图像分类器解释工具通常只给出一个图像的解释。然而，许多图像都可以由人类和图像分类器接受多个解释。因此，只给出一个解释将限制我们对分类器的行为的理解。在这篇论文中，我们描述了一种算法和工具，即REX，用于计算一个黑板图像分类器的输出对某图像的多个解释。我们的算法基于 causal theory，我们分析了其理论复杂性，并提供了实验结果，显示REX在ImageNet-mini benchmark上可以对7个图像计算多个解释。

Overview of Class Activation Maps for Visualization Explainability

paper_url: http://arxiv.org/abs/2309.14304
repo_url: None
paper_authors: Anh Pham Thi Minh
for: 本研究旨在概述过去几年内Class Activation Map（CAM）方法的演进，以及评价CAM的精度和可读性。
methods: 本研究使用了多种方法来评价CAM的精度和可读性，包括使用不同的评价指标和附加技术来提高CAM的精度和可读性。
results: 本研究发现了一些CAM方法的缺点和限制，并提出了未来研究的可能性，以提高CAM的可读性和精度。

Abstract
Recent research in deep learning methodology has led to a variety of complex modelling techniques in computer vision (CV) that reach or even outperform human performance. Although these black-box deep learning models have obtained astounding results, they are limited in their interpretability and transparency which are critical to take learning machines to the next step to include them in sensitive decision-support systems involving human supervision. Hence, the development of explainable techniques for computer vision (XCV) has recently attracted increasing attention. In the realm of XCV, Class Activation Maps (CAMs) have become widely recognized and utilized for enhancing interpretability and insights into the decision-making process of deep learning models. This work presents a comprehensive overview of the evolution of Class Activation Map methods over time. It also explores the metrics used for evaluating CAMs and introduces auxiliary techniques to improve the saliency of these methods. The overview concludes by proposing potential avenues for future research in this evolving field.

摘要
Within the realm of XCV, Class Activation Maps (CAMs) have become widely recognized and utilized for enhancing interpretability and insights into the decision-making process of deep learning models. This overview provides a comprehensive review of the evolution of CAM methods over time, explores the metrics used for evaluating CAMs, and introduces auxiliary techniques to improve the saliency of these methods. The overview concludes by proposing potential avenues for future research in this rapidly evolving field.Translation notes:* "black-box" refers to the lack of transparency and interpretability of deep learning models.* "sensitive decision-support systems" refers to systems that require human supervision and involve critical decision-making.* "XCV" stands for "explainable computer vision".* "CAMs" stands for "Class Activation Maps".

Identifying the Risks of LM Agents with an LM-Emulated Sandbox

paper_url: http://arxiv.org/abs/2309.15817
repo_url: https://github.com/ryoungj/toolemu
paper_authors: Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, Tatsunori Hashimoto
for:The paper aims to address the challenges of identifying risks in Language Model (LM) agents and tools, such as leaking private data or causing financial losses, by introducing a framework called ToolEmu and an automatic safety evaluator.methods:The ToolEmu framework uses an LM to emulate tool execution and enables the testing of LM agents against a diverse range of tools and scenarios without manual instantiation. The evaluator examines agent failures and quantifies associated risks.results:The tool emulator and evaluator were tested through human evaluation, and the results showed that 68.8% of failures identified with ToolEmu would be valid real-world agent failures. The paper also provides a quantitative risk analysis of current LM agents and identifies numerous failures with potentially severe outcomes, highlighting the need to develop safer LM agents for real-world deployment.Here’s the simplified Chinese text:for: 这篇论文目标是解决语言模型（LM）代理和工具中的风险识别问题，如泄露private数据或者导致金融损失，通过引入工具模拟器（ToolEmu）和自动安全评估器。methods: 工具模拟器使用LM来模拟工具执行，无需手动实例化，可以测试LM代理在多种工具和enario下。自动安全评估器对代理失败进行评估并评估相关风险。results: 通过人工评估，工具模拟器和自动安全评估器的测试结果显示，68.8%的失败是真实的世界代理失败。论文还提供了当前LM代理的量化风险分析，并发现了许多可能导致严重后果的失败，高亮了需要开发更安全的LM代理 для实际应用。

Abstract
Recent advances in Language Model (LM) agents and tool use, exemplified by applications like ChatGPT Plugins, enable a rich set of capabilities but also amplify potential risks - such as leaking private data or causing financial losses. Identifying these risks is labor-intensive, necessitating implementing the tools, manually setting up the environment for each test scenario, and finding risky cases. As tools and agents become more complex, the high cost of testing these agents will make it increasingly difficult to find high-stakes, long-tailed risks. To address these challenges, we introduce ToolEmu: a framework that uses an LM to emulate tool execution and enables the testing of LM agents against a diverse range of tools and scenarios, without manual instantiation. Alongside the emulator, we develop an LM-based automatic safety evaluator that examines agent failures and quantifies associated risks. We test both the tool emulator and evaluator through human evaluation and find that 68.8% of failures identified with ToolEmu would be valid real-world agent failures. Using our curated initial benchmark consisting of 36 high-stakes tools and 144 test cases, we provide a quantitative risk analysis of current LM agents and identify numerous failures with potentially severe outcomes. Notably, even the safest LM agent exhibits such failures 23.9% of the time according to our evaluator, underscoring the need to develop safer LM agents for real-world deployment.

摘要
近期语言模型（LM）代理和工具的应用，如ChatGPT插件，提供了一个富有可能性的功能集，但也扩大了潜在风险的范围 - 如泄露私人数据或导致财务损失。识别这些风险是劳动密集的，需要实施工具，手动设置测试enario的环境，并找到危险的场景。随着工具和代理的复杂度的增加，测试这些代理的成本将在不断增加，使得找到高度投资、长尾风险变得越来越困难。为解决这些挑战，我们引入 ToolEmu：一个基于LM的工具抽象框架，可以在多种工具和enario下测试LM代理，无需手动实例化。同时，我们开发了基于LM的自动安全评估工具，可以对代理失败进行评估，并评估相关风险。我们通过人工评估测试了ToolEmu和评估工具，发现68.8%的失败是真实世界中的代理失败。使用我们精心准备的初始准 benchmark，包括36个高度投资工具和144个测试场景，我们提供了一个量化风险分析，并发现当前LM代理中存在许多失败，其中23.9%的时间，最安全的LM代理也会出现这些失败。这表明需要为实际部署开发更安全的LM代理。

NAS-NeRF: Generative Neural Architecture Search for Neural Radiance Fields

paper_url: http://arxiv.org/abs/2309.14293
repo_url: None
paper_authors: Saeejith Nair, Yuhao Chen, Mohammad Javad Shafiee, Alexander Wong
for: 该文章的目的是提出一种基于神经网络搜索的NeRF架构优化策略，以实现在不同场景下达到高品质synthesis的同时控制计算复杂性。
methods: 该方法使用神经网络搜索策略来生成适应不同场景的NeRF架构，并通过约束目标synthesis质量指标和预算来引导搜索。
results: 实验结果表明，提议的NAS-NeRF可以生成比基eline NeRF更加小巧、快速和低耗的NeRF架构，而且在不同场景下都可以保持高品质synthesis。

Abstract
Neural radiance fields (NeRFs) enable high-quality novel view synthesis, but their high computational complexity limits deployability. While existing neural-based solutions strive for efficiency, they use one-size-fits-all architectures regardless of scene complexity. The same architecture may be unnecessarily large for simple scenes but insufficient for complex ones. Thus, there is a need to dynamically optimize the neural network component of NeRFs to achieve a balance between computational complexity and specific targets for synthesis quality. We introduce NAS-NeRF, a generative neural architecture search strategy that generates compact, scene-specialized NeRF architectures by balancing architecture complexity and target synthesis quality metrics. Our method incorporates constraints on target metrics and budgets to guide the search towards architectures tailored for each scene. Experiments on the Blender synthetic dataset show the proposed NAS-NeRF can generate architectures up to 5.74$\times$ smaller, with 4.19$\times$ fewer FLOPs, and 1.93$\times$ faster on a GPU than baseline NeRFs, without suffering a drop in SSIM. Furthermore, we illustrate that NAS-NeRF can also achieve architectures up to 23$\times$ smaller, with 22$\times$ fewer FLOPs, and 4.7$\times$ faster than baseline NeRFs with only a 5.3% average SSIM drop. Our source code is also made publicly available at https://saeejithnair.github.io/NAS-NeRF.

摘要
神经震荡场（NeRF）可以实现高质量的新视图合成，但是它们的计算复杂性限制了它们的部署。现有的神经网络解决方案尽量减少计算复杂性，但是它们使用一个适用于所有场景的 Architecture，无论场景的复杂性如何。这种 Architecture 可能是对简单场景来说过大，对复杂场景来说则不够。因此，有一个需要 dynamically optimize NeRF 的神经网络组件，以达到计算复杂性和特定目标的平衡。我们介绍了 NAS-NeRF，一种生成神经架构搜索策略，可以生成适合场景的 Compact 和 Scene-Specialized NeRF 架构，并且可以根据目标合成质量指标和预算来导引搜索。我们的方法可以在 Blender synthetic 数据集上实现，并且可以生成与基eline NeRF 相比，5.74倍小、4.19倍 fewer FLOPs、1.93倍快于 GPU 上的 NeRF 架构，而无需做出 SSIM 下降。此外，我们还证明了 NAS-NeRF 可以生成与基eline NeRF 相比，23倍小、22倍 fewer FLOPs、4.7倍快于 GPU 上的 NeRF 架构，只有5.3%的 average SSIM 下降。我们的源代码也公开发布在。

Perception-and-Energy-aware Motion Planning for UAV using Learning-based Model under Heteroscedastic Uncertainty

paper_url: http://arxiv.org/abs/2309.14272
repo_url: https://gitlab.com/rei08/perception-energy-planner
paper_authors: Reiya Takemura, Genya Ishigami
for: 本研究旨在帮助无人航空器（UAV）在Global Navigation Satellite Systems（GNSS） denied环境中能够能量高效、可靠地飞行。
methods: 该研究提出了一种基于感知和能源的动作规划算法，用于解决UAV在GNSS denied环境中的轨迹规划问题。该算法优化了一个权重因子，包括UAV的总能量消耗和LiDAR探测器 mounted on UAV 的感知质量。在线导航之前，高精度模拟器从实际飞行数据中学习了UAV的能量消耗和LiDAR测量的不确定性，以便在线导航时能够更好地估算这些参数。
results: 对比实际环境中的 fotorealistic 环境，实验结果表明，提出的算法可以在不确定性下进行权衡，以提高UAV的能效性和感知质量。开源代码可以在 https://gitlab.com/ReI08/perception-energy-planner 上下载。

Abstract
Global navigation satellite systems (GNSS) denied environments/conditions require unmanned aerial vehicles (UAVs) to energy-efficiently and reliably fly. To this end, this study presents perception-and-energy-aware motion planning for UAVs in GNSS-denied environments. The proposed planner solves the trajectory planning problem by optimizing a cost function consisting of two indices: the total energy consumption of a UAV and the perception quality of light detection and ranging (LiDAR) sensor mounted on the UAV. Before online navigation, a high-fidelity simulator acquires a flight dataset to learn energy consumption for the UAV and heteroscedastic uncertainty associated with LiDAR measurements, both as functions of the horizontal velocity of the UAV. The learned models enable the online planner to estimate energy consumption and perception quality, reducing UAV battery usage and localization errors. Simulation experiments in a photorealistic environment confirm that the proposed planner can address the trade-off between energy efficiency and perception quality under heteroscedastic uncertainty. The open-source code is released at https://gitlab.com/ReI08/perception-energy-planner.

摘要

Unsupervised correspondence with combined geometric learning and imaging for radiotherapy applications

paper_url: http://arxiv.org/abs/2309.14269
repo_url: https://github.com/rrr-uom-projects/unsup-rt-corr-net
paper_authors: Edward G. A. Henderson, Marcel van Herk, Andrew F. Green, Eliana M. Vasquez Osorio
For: The paper aims to develop a model for accurately identifying corresponding points between organ segmentations of different patients for radiotherapy applications.* Methods: The model uses a combination of 3D shape information and imaging information to estimate correspondences and perform interpolation. The model was trained with head and neck organ segmentations from planning CT scans, and two approaches were used to incorporate imaging information: extracting features directly from image patches and including the mean square error between patches as part of the loss function.* Results: The correspondence and interpolation performance were evaluated using several metrics, including geodesic error, chamfer distance, and conformal distortion. The best performing model configuration incorporated imaging information as part of the loss function, which produced more anatomically plausible correspondences. The model outperformed a baseline non-rigid registration approach and the original model with direct inclusion of image features.Here is the same information in Simplified Chinese:* For: 这篇论文的目标是为了准确地标注不同患者的器官分割中的对应点，以便在放疗应用中使用。* Methods: 该模型使用了3D形状信息和成像信息来估算对应点并进行插值。模型通过使用规划CT扫描图像的头颈器官分割来进行训练，并采用了两种方法来包含成像信息：直接从图像块中提取特征，以及将图像块之间的平均方差作为损失函数的一部分。* Results: 对应点和插值性能被评估使用了几种指标，包括 геодезиック误差、斜截距离和几何扭曲误差。最佳配置中包含了成像信息作为损失函数的一部分，生成了更 Plausible的对应点。模型比非rigid registration方法和原始模型 WITH direct inclusion of image features更好。

Abstract
The aim of this study was to develop a model to accurately identify corresponding points between organ segmentations of different patients for radiotherapy applications. A model for simultaneous correspondence and interpolation estimation in 3D shapes was trained with head and neck organ segmentations from planning CT scans. We then extended the original model to incorporate imaging information using two approaches: 1) extracting features directly from image patches, and 2) including the mean square error between patches as part of the loss function. The correspondence and interpolation performance were evaluated using the geodesic error, chamfer distance and conformal distortion metrics, as well as distances between anatomical landmarks. Each of the models produced significantly better correspondences than the baseline non-rigid registration approach. The original model performed similarly to the model with direct inclusion of image features. The best performing model configuration incorporated imaging information as part of the loss function which produced more anatomically plausible correspondences. We will use the best performing model to identify corresponding anatomical points on organs to improve spatial normalisation, an important step in outcome modelling, or as an initialisation for anatomically informed registrations. All our code is publicly available at https://github.com/rrr-uom-projects/Unsup-RT-Corr-Net

摘要
“本研究的目标是开发一种能够准确标注不同病人器官分割的模型，以便在放射治疗应用中进行空间Normalization。我们使用了头和 neck器官分割的规划CT扫描图进行模型训练。然后，我们将原始模型扩展以包括影像信息，使用两种方法：1）直接从图像块中提取特征，2）在损失函数中包括图像块的平均方差。对于每个模型，我们评估了各种维度的比较，包括地odesic error、Chamfer distance和conformal distortion metric，以及器官标志点之间的距离。每个模型都生成了较好的对应关系，比基eline非rigid registration方法更好。原始模型和直接包括图像特征的模型的性能相似。最佳配置是将影像信息作为损失函数的一部分来，生成更符合解剖学的对应关系。我们将使用最佳配置来标注器官之间的对应点，以提高结果模型中的空间Normalization，或者作为初始化 для解剖学指导的registrations。我们的代码都公开在https://github.com/rrr-uom-projects/Unsup-RT-Corr-Net上”

Date-Driven Approach for Identifying State of Hemodialysis Fistulas: Entropy-Complexity and Formal Concept Analysis

paper_url: http://arxiv.org/abs/2309.14399
repo_url: None
paper_authors: Vasilii A. Gromov, E. I. Zvorykina, Yurii N. Beschastnov, Majid Sohrabi
for: 这种研究旨在为诊断病理性尿道采用数学方法进行分类。
methods: 这种方法基于laminar blood flow的假设，认为正常功能的尿道会出现单普液流，而病理性尿道会出现湍流。这种方法包括在 entropy-complexity 平面上将时序列映射，并与已知集合进行比较，以及使用正式概念分析构建 concepts-objects 图。
results: 这两种方法具有高效性，可以准确地确定尿道的状态。

Abstract
The paper explores mathematical methods that differentiate regular and chaotic time series, specifically for identifying pathological fistulas. It proposes a noise-resistant method for classifying responding rows of normally and pathologically functioning fistulas. This approach is grounded in the hypothesis that laminar blood flow signifies normal function, while turbulent flow indicates pathology. The study explores two distinct methods for distinguishing chaotic from regular time series. The first method involves mapping the time series onto the entropy-complexity plane and subsequently comparing it to established clusters. The second method, introduced by the authors, constructs a concepts-objects graph using formal concept analysis. Both of these methods exhibit high efficiency in determining the state of the fistula.

摘要
文章研究了用数学方法分辨常规和异常时间序列，特别是用于识别病理性尿道。它提出了一种对应行的响应类型进行分类的听力抗噪方法，这种方法基于假设：在正常情况下，液体血流表示正常功能，而在异常情况下，液体血流表示疾病。文章研究了两种方法来分辨混乱和常规时间序列：首先，将时间序列映射到复杂度-自 entropy 平面，然后与已知的集群进行比较；其次，通过正式概念分析构建一个概念物件图。两种方法都能高效地确定尿道的状态。

OmniEvent: A Comprehensive, Fair, and Easy-to-Use Toolkit for Event Understanding

paper_url: http://arxiv.org/abs/2309.14258
repo_url: https://github.com/thu-keg/omnievent
paper_authors: Hao Peng, Xiaozhi Wang, Feng Yao, Zimu Wang, Chuzhao Zhu, Kaisheng Zeng, Lei Hou, Juanzi Li
for: 本研究开发了一个全面的事件理解工具kit OmniEvent，用于解决文本中事件检测、事件Argument提取和事件关系提取等复杂的信息提取任务。
methods: OmniEvent支持主流的模型化方法，并处理了Peng et al. (2023)所报告的隐藏评估坑，以确保公平的比较。
results: OmniEvent提供了一个可直接用于生产环境的Web服务，并提供了可修改的模块化框架，以便用户根据需要进行自定义模型评估。

Abstract
Event understanding aims at understanding the content and relationship of events within texts, which covers multiple complicated information extraction tasks: event detection, event argument extraction, and event relation extraction. To facilitate related research and application, we present an event understanding toolkit OmniEvent, which features three desiderata: (1) Comprehensive. OmniEvent supports mainstream modeling paradigms of all the event understanding tasks and the processing of 15 widely-used English and Chinese datasets. (2) Fair. OmniEvent carefully handles the inconspicuous evaluation pitfalls reported in Peng et al. (2023), which ensures fair comparisons between different models. (3) Easy-to-use. OmniEvent is designed to be easily used by users with varying needs. We provide off-the-shelf models that can be directly deployed as web services. The modular framework also enables users to easily implement and evaluate new event understanding models with OmniEvent. The toolkit (https://github.com/THU-KEG/OmniEvent) is publicly released along with the demonstration website and video (https://omnievent.xlore.cn/).

摘要
Event理解目标是理解文本中的事件内容和关系，包括多种复杂信息提取任务：事件检测、事件参数提取和事件关系提取。为推动相关研究和应用，我们提供了一套事件理解工具包 OmniEvent，具有以下三个目标：1. 全面。OmniEvent支持主流模型化思路所有的事件理解任务，并处理15种常用的英文和中文数据集。2. 公平。OmniEvent通过彻底处理报告在Peng et al. (2023)中报道的隐藏评估坑，以确保比较不同模型的公平。3. 易用。OmniEvent设计便于用户们根据需要使用。我们提供了直接可以部署为网服务的准备好的模型，框架也允许用户轻松实现和评估新的事件理解模型。工具kit（https://github.com/THU-KEG/OmniEvent）公开发布，并提供了示例网站和视频（https://omnievent.xlore.cn/）。

Prediction Model For Wordle Game Results With High Robustness

paper_url: http://arxiv.org/abs/2309.14250
repo_url: https://github.com/zeniSoida/pl1
paper_authors: Jiaqi Weng, Chunlin Feng
for: 本研究用数据分析和机器学习方法研究Wordle游戏的动态。
methods: 我们使用ARIMAX模型和反射神经网络模型来预测词语的难度，并使用K-means归一化来分类词语的数值。
results: 我们的研究发现，在2023年3月1日，约有12,884个结果将被提交，词语”尴尬”的平均尝试次数为4.8，属于最难的分数 cluster。此外，我们还研究了玩家的忠诚度和他们是否做日常挑战的比例。我们的模型经过了严格的敏感分析和验证，确认其稳定性。总的来说，本研究提供了基于日期或给定的五个字词语的Wordle游戏预测框架，结果已经提交给纽约时报游戏编辑。

Abstract
In this study, we delve into the dynamics of Wordle using data analysis and machine learning. Our analysis initially focused on the correlation between the date and the number of submitted results. Due to initial popularity bias, we modeled stable data using an ARIMAX model with coefficient values of 9, 0, 2, and weekdays/weekends as the exogenous variable. We found no significant relationship between word attributes and hard mode results. To predict word difficulty, we employed a Backpropagation Neural Network, overcoming overfitting via feature engineering. We also used K-means clustering, optimized at five clusters, to categorize word difficulty numerically. Our findings indicate that on March 1st, 2023, around 12,884 results will be submitted and the word "eerie" averages 4.8 attempts, falling into the hardest difficulty cluster. We further examined the percentage of loyal players and their propensity to undertake daily challenges. Our models underwent rigorous sensitivity analyses, including ADF, ACF, PACF tests, and cross-validation, confirming their robustness. Overall, our study provides a predictive framework for Wordle gameplay based on date or a given five-letter word. Results have been summarized and submitted to the Puzzle Editor of the New York Times.

摘要
在这个研究中，我们使用数据分析和机器学习来研究Wordle的动态。我们首先查看了日期和提交结果之间的相关性。由于初始的受欢迎偏见，我们使用ARIMAX模型，其含有9, 0, 2的系数值和星期天/星期六作为外生变量。我们发现没有显著的词属性和困难模式之间的关系。为预测词难度，我们使用反射神经网络，并通过特征工程来避免过拟合。我们还使用K-means聚类算法，并优化为五个分类。我们发现，在2023年3月1日，约有12,884个结果将被提交，并且词“幽默”的平均尝试次数为4.8，属于最难的分类。我们进一步研究了忠诚玩家的百分比和他们的日常挑战的倾向。我们的模型经过了严格的敏感分析，包括ADF、ACF、PACF测试和批处理，以确认其可靠性。总的来说，我们的研究提供了基于日期或给定的五个字的Wordle游戏玩法预测框架。结果已经总结并提交给纽约时报游戏编辑。

Rethinking Internet Communication Through LLMs: How Close Are We?

paper_url: http://arxiv.org/abs/2309.14247
repo_url: None
paper_authors: Sifat Ut Taki, Spyridon Mastorakis
for: 重新思考互联网上用户之间的交流方式，以便更好地捕捉用户对另一端通信频道的认知。
methods: 提出使用大语言模型（LLM）代表用户之间的交流，并提出了实现这种通信架构的方法。
results: 对现有技术的可行性进行了 reality check，并讨论了未来研究的挑战和有趣的方向。

Abstract
In this paper, we rethink the way that communication among users over the Internet, one of the fundamental outcomes of the Internet evolution, takes place. Instead of users communicating directly over the Internet, we explore an architecture that enables users to communicate with (query) Large Language Models (LLMs) that capture the cognition of users on the other end of the communication channel. We present an architecture to achieve such LLM-based communication and we perform a reality check to assess how close we are today to realizing such a communication architecture from a technical point of view. Finally, we discuss several research challenges and identify interesting directions for future research.

摘要
在这篇论文中，我们重新思考互联网上用户之间的通信方式，这是互联网进化的一个基本结果。而不是直接通过互联网进行用户之间的通信，我们研究了一种使用大自然语言模型（LLM）来捕捉用户对另一端通信频道的认知。我们提出了实现这种 LLM-based 通信架构的方案，并进行了技术实现的现实性检查。最后，我们讨论了一些研究挑战和未来研究的有趣方向。

Enhancing data efficiency in reinforcement learning: a novel imagination mechanism based on mesh information propagation

paper_url: http://arxiv.org/abs/2309.14243
repo_url: https://github.com/ouazusakou/imagination_mechanism
paper_authors: Zihang Wang, Maowei Jiang
for: 提高深度学习强化学习（RL）算法的数据效率，特别是面临高维状态空间和大规模问题时。
methods: 引入一种人类类似的想象机制（Imagination Mechanism，IM），用于在不同话题间共享样本信息，从而提高RL算法的学习效率。
results: 在四种主流SOTA RL算法（SAC、PPO、DDPG和DQN）上，IM可以带来显著的提高，最终导致不同任务上的表现均得到提高。

Abstract
Reinforcement learning(RL) algorithms face the challenge of limited data efficiency, particularly when dealing with high-dimensional state spaces and large-scale problems. Most of RL methods often rely solely on state transition information within the same episode when updating the agent's Critic, which can lead to low data efficiency and sub-optimal training time consumption. Inspired by human-like analogical reasoning abilities, we introduce a novel mesh information propagation mechanism, termed the 'Imagination Mechanism (IM)', designed to significantly enhance the data efficiency of RL algorithms. Specifically, IM enables information generated by a single sample to be effectively broadcasted to different states across episodes, instead of simply transmitting in the same episode. This capability enhances the model's comprehension of state interdependencies and facilitates more efficient learning of limited sample information. To promote versatility, we extend the IM to function as a plug-and-play module that can be seamlessly and fluidly integrated into other widely adopted RL algorithms. Our experiments demonstrate that IM consistently boosts four mainstream SOTA RL algorithms, such as SAC, PPO, DDPG, and DQN, by a considerable margin, ultimately leading to superior performance than before across various tasks. For access to our code and data, please visit https://github.com/OuAzusaKou/imagination_mechanism

摘要
利用人类类似的想象能力，我们引入了一种新的网格信息传递机制，称为“想象机制”（IM），以提高深度学习束缚学习（RL）算法的数据效率。Specifically，IM使得单个样本生成的信息可以有效地在不同话数据集中传递，而不是仅在同一话数据集中传递。这种能力提高模型对状态之间的相互关系的理解，并且使得学习有限样本信息更加高效。为了推广可用性，我们将IM作为一个可插入式和可靠地Integrate into other widely adopted RL algorithms。我们的实验表明，IM可以持续地提高四种主流SOTA RL算法，如SAC、PPO、DDPG和DQN，并 ultimately leading to superior performance across various tasks. For access to our code and data, please visit https://github.com/OuAzusaKou/imagination_mechanism。

Seeing and hearing what has not been said; A multimodal client behavior classifier in Motivational Interviewing with interpretable fusion

paper_url: http://arxiv.org/abs/2309.14398
repo_url: None
paper_authors: Lucie Galland, Catherine Pelachaud, Florian Pecune
for: 评估动机听讲的质量，以便提高治疗效果。
methods: 利用多Modal特征，如文本、语音、脸部表情和身体表情，建立一个准确地分类客户话语的模型。
results: 通过对AnnoMI数据集进行注解，收集多Modal信息，并确定最重要的Modalities在决策过程中的扮演。

Abstract
Motivational Interviewing (MI) is an approach to therapy that emphasizes collaboration and encourages behavioral change. To evaluate the quality of an MI conversation, client utterances can be classified using the MISC code as either change talk, sustain talk, or follow/neutral talk. The proportion of change talk in a MI conversation is positively correlated with therapy outcomes, making accurate classification of client utterances essential. In this paper, we present a classifier that accurately distinguishes between the three MISC classes (change talk, sustain talk, and follow/neutral talk) leveraging multimodal features such as text, prosody, facial expressivity, and body expressivity. To train our model, we perform annotations on the publicly available AnnoMI dataset to collect multimodal information, including text, audio, facial expressivity, and body expressivity. Furthermore, we identify the most important modalities in the decision-making process, providing valuable insights into the interplay of different modalities during a MI conversation.

摘要
《动机导向会议》（MI）是一种帮助客户改变行为的医疗方法。为评估MI会议质量，客户的语言可以被分类为变化语言、维持语言或跟随/中立语言。变化语言的比例和治疗效果正相关，因此正确地分类客户语言非常重要。在这篇论文中，我们提出了一种精准地分类客户语言的分类器，利用多Modal特征，如文本、 просодия、 facial expressivity 和 body expressivity。为了训练我们的模型，我们对公共可用的 AnnoMI 数据集进行了标注，以收集多Modal信息，包括文本、音频、 facial expressivity 和 body expressivity。此外，我们还确定了决策过程中最重要的Modalities，提供了有价值的发现，描述了不同Modalities在MI会议中的协作。

MoDem-V2: Visuo-Motor World Models for Real-World Robot Manipulation

paper_url: http://arxiv.org/abs/2309.14236
repo_url: None
paper_authors: Patrick Lancaster, Nicklas Hansen, Aravind Rajeswaran, Vikash Kumar
for:* 本研究旨在开发一个能够在无instrumented real-world environments中学习 contact-rich manipulation的系统，以提高现代机器人系统的可靠性和安全性。methods:* 该系统基于latest algorithmic advancements in model-based reinforcement learning (MBRL), demo-bootstrapping, and effective exploration，并使用 visual pixels directly for learning。results:* 该系统能够在实际世界中学习 contact-rich dexterous manipulation skills，并在四个复杂的visuo-motor manipulation问题中进行了 empirical demonstration。这是首个直接在实际世界中学习的 demonstration-augmented visual MBRL系统。

Abstract
Robotic systems that aspire to operate in uninstrumented real-world environments must perceive the world directly via onboard sensing. Vision-based learning systems aim to eliminate the need for environment instrumentation by building an implicit understanding of the world based on raw pixels, but navigating the contact-rich high-dimensional search space from solely sparse visual reward signals significantly exacerbates the challenge of exploration. The applicability of such systems is thus typically restricted to simulated or heavily engineered environments since agent exploration in the real-world without the guidance of explicit state estimation and dense rewards can lead to unsafe behavior and safety faults that are catastrophic. In this study, we isolate the root causes behind these limitations to develop a system, called MoDem-V2, capable of learning contact-rich manipulation directly in the uninstrumented real world. Building on the latest algorithmic advancements in model-based reinforcement learning (MBRL), demo-bootstrapping, and effective exploration, MoDem-V2 can acquire contact-rich dexterous manipulation skills directly in the real world. We identify key ingredients for leveraging demonstrations in model learning while respecting real-world safety considerations -- exploration centering, agency handover, and actor-critic ensembles. We empirically demonstrate the contribution of these ingredients in four complex visuo-motor manipulation problems in both simulation and the real world. To the best of our knowledge, our work presents the first successful system for demonstration-augmented visual MBRL trained directly in the real world. Visit https://sites.google.com/view/modem-v2 for videos and more details.

摘要
роботизированные системы, которые стремятся работать в неинструментированных реальных средах, должны воспринимать мир непосредственно с помощью наборного зрения. Системы обучения на основе зрения стремятся отказаться от необходимости в инструментированной среде, построив implicit понимание мира на основе raw пикселей, но навигация высокомерных пространств по solely sparse visual reward signals значительно увеличивает вызов эксплорации. Поэтому применение таких систем ограничено симулированными или инженерными средами, так как исследуние агента в реальном мире без руководства эксплицитных state estimation и dense reward может привести к небезопасному поведению и фатальным ошибкам.В этом исследовании мы изолируем корни этих ограничений, чтобы разработать систему, называемую MoDem-V2, которая может научиться контактным манипуляциям непосредственно в неинструментированном реальном мире. На основе последних достижений в области алгоритмов моделируемого обучения (MBRL), демо-ботстроппинга и эффективного исследувания, MoDem-V2 может приобрести контактные манипулятивные навыки в реальном мире. Мы идентифицируем ключевые ингредиенты для использования демонстраций в модельном обучении, уважая considertions безопасности реального мира -- exploration centering, handover agency и actor-critic ensembles. Мы экспериментально подтверждаем вклад этих ингредиентов в решении четырех сложных визуометрических манипулятивных задач в обеих симуляции и реальном мире. По нашему знанию, наша работа представляет первую успешную систему demonstration-augmented visual MBRL, обученную непосредственно в реальном мире. Посетите https://sites.google.com/view/modem-v2 для видео и дополнительные детали.

Stackelberg Driver Model for Continual Policy Improvement in Scenario-Based Closed-Loop Autonomous Driving

paper_url: http://arxiv.org/abs/2309.14235
repo_url: https://github.com/BlueCat-de/SDM
paper_authors: Haoyi Niu, Qimao Chen, Yingyue Li, Jianming Hu
for: 本研究旨在提高自动驾驶车辆（AV）的性能和可靠性，通过针对极其罕见 yet critical corner cases 的优化。
methods: 该研究使用 adversarial generation 方法生成安全关键的驾驶场景，并通过 Stackelberg Driver Model（SDM）模型准确描述了车辆之间的层次结构，以实现 AV 的不断改进。
results: 实验表明，该算法在高维场景中表现出色，比基准方法更高效，导致了 AV 的显著进步，同时 continually generating progressively challenging scenarios。

Abstract
The deployment of autonomous vehicles (AVs) has faced hurdles due to the dominance of rare but critical corner cases within the long-tail distribution of driving scenarios, which negatively affects their overall performance. To address this challenge, adversarial generation methods have emerged as a class of efficient approaches to synthesize safety-critical scenarios for AV testing. However, these generated scenarios are often underutilized for AV training, resulting in the potential for continual AV policy improvement remaining untapped, along with a deficiency in the closed-loop design needed to achieve it. Therefore, we tailor the Stackelberg Driver Model (SDM) to accurately characterize the hierarchical nature of vehicle interaction dynamics, facilitating iterative improvement by engaging background vehicles (BVs) and AV in a sequential game-like interaction paradigm. With AV acting as the leader and BVs as followers, this leader-follower modeling ensures that AV would consistently refine its policy, always taking into account the additional information that BVs play the best response to challenge AV. Extensive experiments have shown that our algorithm exhibits superior performance compared to several baselines especially in higher dimensional scenarios, leading to substantial advancements in AV capabilities while continually generating progressively challenging scenarios. Code is available at https://github.com/BlueCat-de/SDM.

摘要
自带驱动自动车 (AV) 的部署面临了由罕见而重要的角度案例所带来的阻碍，这些角度案例会影响 AV 的总性表现。为解决这个挑战， adversarial 生成方法在 AV 测试中出现了，这些方法可以生成安全关键的驾驶enario。然而，这些生成的场景通常不被用于 AV 训练，导致 AV 政策的持续改进 remained untapped，以及closed-loop 设计的缺失。因此，我们将 Stackelberg 驾驶器模型 (SDM) 改进，以便准确地描述车辆交互动力学的层次结构，从而实现了 iterative 改进，让 AV 在 background 车辆 (BV) 的支持下，通过 sequential 交互模式来精细调整其策略，并且总是考虑 BV 的最佳回应，以挑战 AV。我们的算法在高维度enario中表现出色，比如基eline 特别出色，从而实现了 AV 的重要进步，同时 continually 生成进一步挑战 AV 的场景。代码可以在上找到。

Combined sizing and layout optimization of truss structures via update Monte Carlo tree search (UMCTS) algorithm

paper_url: http://arxiv.org/abs/2309.14231
repo_url: None
paper_authors: Fu-Yao Ko, Katsuyuki Suzuki, Kazuo Yonekura
for: 本研究的主要目标是找到螺栓结构的最佳设计，同时考虑尺寸和布局变量。
methods: 本研究使用了一种强化学习方法，名为更新 Monte Carlo 搜索（UMCTS），用于解决螺栓结构的尺寸和布局优化问题。
results: 研究表明，使用 UMCTS 方法可以减少计算时间，并且稳定地实现较好的解决方案，比传统方法更好。

Abstract
The main concern of this study is to find the optimal design of truss structures considering sizing and layout variables simultaneously. As compared to purely sizing optimization problems, this problem is more challenging since the two types of variables involved are fundamentally different in nature. In this paper, a reinforcement learning method combining the update process and Monte Carlo tree search called the update Monte Carlo tree search (UMCTS) for sizing optimization problems is applied to solve combined sizing and layout optimization for truss structures. This study proposes a novel update process for nodal coordinates with two features. (1) The allowed range of each coordinate varies in each round. (2) Accelerators for the number of entries in the allowed range and iteration numbers are introduced to reduce the computation time. Furthermore, nodal coordinates and member areas are determined at the same time with only one search tree in each round. The validation and efficiency of the UMCTS are tested on benchmark problems of planar and spatial trusses with discrete sizing variables and continuous layout variables. It is shown that the CPU time of the UMCTS is two times faster than the branch and bound method. The numerical results demonstrate that the proposed method stably achieves a better solution than other traditional methods.

摘要
本研究的主要担忧是查找螺栓结构的最优设计，同时考虑大小和布局变量。与纯粹的大小优化问题相比，这个问题更加具有挑战性，因为这两种变量的本质不同。在这篇论文中，我们应用了一种combined reinforcement learning method，called update Monte Carlo tree search (UMCTS)，解决螺栓结构的大小和布局优化问题。我们提出了一种新的更新过程，其中每个坐标的允许范围在每一轮都不同，同时还引入了加速器来减少计算时间。此外，每一轮都只需要一个搜索树来确定节点坐标和部件面积。我们对标准问题进行验证和效率测试，结果显示，UMCTS的计算时间比branch and bound方法快两倍。数值结果表明，我们提出的方法可稳定地实现更好的解决方案，比传统方法更好。

Implicit Sensing in Traffic Optimization: Advanced Deep Reinforcement Learning Techniques

paper_url: http://arxiv.org/abs/2309.14395
repo_url: None
paper_authors: Emanuel Figetakis, Yahuza Bello, Ahmed Refaey, Lei Lei, Medhat Moussa
for: 这个论文的目的是解决高速公路突然出现堵塞的问题，使用自动驾驶车辆（AV）的感知器来做出智能决策，以避免因堵塞而导致的延迟。methods: 这个论文使用了深度强化学习（DRL）技术，基于Markov决策过程（MDP）模型，训练RL代理人以适应实际驾驶情况。具体来说，使用了SUMO仿真器和OPENAI GYM评估工具来评估提议模型的性能。results: 结果表明，使用{\epsilon}-抽象策略训练DQN代理人后，其性能明显超过使用Boltzmann策略训练的DQN代理人。

Abstract
A sudden roadblock on highways due to many reasons such as road maintenance, accidents, and car repair is a common situation we encounter almost daily. Autonomous Vehicles (AVs) equipped with sensors that can acquire vehicle dynamics such as speed, acceleration, and location can make intelligent decisions to change lanes before reaching a roadblock. A number of literature studies have examined car-following models and lane-changing models. However, only a few studies proposed an integrated car-following and lane-changing model, which has the potential to model practical driving maneuvers. Hence, in this paper, we present an integrated car-following and lane-changing decision-control system based on Deep Reinforcement Learning (DRL) to address this issue. Specifically, we consider a scenario where sudden construction work will be carried out along a highway. We model the scenario as a Markov Decision Process (MDP) and employ the well-known DQN algorithm to train the RL agent to make the appropriate decision accordingly (i.e., either stay in the same lane or change lanes). To overcome the delay and computational requirement of DRL algorithms, we adopt an MEC-assisted architecture where the RL agents are trained on MEC servers. We utilize the highly reputable SUMO simulator and OPENAI GYM to evaluate the performance of the proposed model under two policies; {\epsilon}-greedy policy and Boltzmann policy. The results unequivocally demonstrate that the DQN agent trained using the {\epsilon}-greedy policy significantly outperforms the one trained with the Boltzmann policy.

摘要
高速公路上突然出现堵塞，常见的情况之一，可能是道路维护、事故或车辆维修等多种原因。自动驾驶车（AV）配备感知器可以获取车辆动态状态，如速度、加速度和位置，可以做出智能决策，以避免堵塞。许多文献研究了车辆随驾模型和车道变更模型，但只有一些研究提出了集成车辆随驾和车道变更模型，这种模型具有实际驾驶行为的潜在优势。因此，在这篇论文中，我们提出了基于深度强化学习（DRL）的集成车辆随驾和车道变更决策控制系统，以解决这个问题。具体来说，我们考虑了高速公路上突然进行建设工程的情况。我们将这种情况模型为马克夫满度决策过程（MDP），并使用了知名的DQN算法来训练RL代理人进行适当决策（即留在同一个车道或变更车道）。为了解决DRL算法的延迟和计算资源的问题，我们采用了MEC助け的架构，其中RL代理人在MEC服务器上进行训练。我们使用了非常可靠的SUMO仿真器和OPENAI GYM来评估提出的模型的性能，并对两种策略进行评估：{\epsilon}-抽象策略和博尔ツ曼策略。结果明确表明，使用{\epsilon}-抽象策略训练的DQN代理人明显超越使用博尔ツ曼策略训练的DQN代理人。

Multiple Noises in Diffusion Model for Semi-Supervised Multi-Domain Translation

paper_url: http://arxiv.org/abs/2309.14394
repo_url: None
paper_authors: Tsiry Mayet, Simon Bernard, Clement Chatelain, Romain Herault
for: 这 paper 的目的是提出一种多个频道的域转换方法，用于在半指导下进行多个域之间的域转换。
methods: 这 paper 使用了一种名为 Multi-Domain Diffusion (MDD) 的噪声扩散框架，它不需要定义输入和输出域，可以在任意的域分配中进行域转换（如 $(D_1, D_2)\rightarrow{}D_3$, $D_2\rightarrow{}(D_1, D_3)$, $D_3\rightarrow{}D_1$, 等等），而不需要额外训练每个域配置的模型。
results: 这 paper 在一个多个域的Synthetic image translation dataset上进行了实验，并得到了一些有趣的结果。

Abstract
Domain-to-domain translation involves generating a target domain sample given a condition in the source domain. Most existing methods focus on fixed input and output domains, i.e. they only work for specific configurations (i.e. for two domains, either $D_1\rightarrow{}D_2$ or $D_2\rightarrow{}D_1$). This paper proposes Multi-Domain Diffusion (MDD), a conditional diffusion framework for multi-domain translation in a semi-supervised context. Unlike previous methods, MDD does not require defining input and output domains, allowing translation between any partition of domains within a set (such as $(D_1, D_2)\rightarrow{}D_3$, $D_2\rightarrow{}(D_1, D_3)$, $D_3\rightarrow{}D_1$, etc. for 3 domains), without the need to train separate models for each domain configuration. The key idea behind MDD is to leverage the noise formulation of diffusion models by incorporating one noise level per domain, which allows missing domains to be modeled with noise in a natural way. This transforms the training task from a simple reconstruction task to a domain translation task, where the model relies on less noisy domains to reconstruct more noisy domains. We present results on a multi-domain (with more than two domains) synthetic image translation dataset with challenging semantic domain inversion.

摘要
域到域翻译（Domain-to-domain translation）是生成目标域样本，给定源域的条件。现有的方法都是针对固定的输入和输出域，即只能处理特定的配置（例如 $D_1\to D_2$ 或 $D_2\to D_1$）。这篇论文提出了多域扩散（Multi-Domain Diffusion，MDD），一种基于半supervised的域扩散框架。与先前的方法不同，MDD不需要定义输入和输出域，可以在一个集合（例如 $(D_1, D_2)\to D_3$，$D_2\to (D_1, D_3)$，$D_3\to D_1$ 等）中进行翻译，无需为每个域配置单独训练模型。MDD的关键思想是利用扩散模型的噪声表示，每个域都有一个噪声水平，这使得缺失的域可以自然地被噪声表示。这将训练任务从一个简单的重建任务变为域翻译任务，其中模型通过更加净化的域来重建更加噪声的域。我们在多域（包括更多于两个域）的Synthetic image翻译dataset上进行了实验，并取得了具有挑战性的semantic domain inversion的结果。

Accelerating Machine Learning Algorithms with Adaptive Sampling

paper_url: http://arxiv.org/abs/2309.14221
repo_url: None
paper_authors: Mo Tiwari
for: 提高大规模数据处理中机器学习算法的效率。
methods: 使用Randomized counterparts instead of computationally intensive subroutines to improve computational efficiency.
results: 几乎没有质量下降，但可以大幅提高计算效率。I hope this helps! Let me know if you have any other questions.

Abstract
The era of huge data necessitates highly efficient machine learning algorithms. Many common machine learning algorithms, however, rely on computationally intensive subroutines that are prohibitively expensive on large datasets. Oftentimes, existing techniques subsample the data or use other methods to improve computational efficiency, at the expense of incurring some approximation error. This thesis demonstrates that it is often sufficient, instead, to substitute computationally intensive subroutines with a special kind of randomized counterparts that results in almost no degradation in quality.

摘要
era of big data 需要非常高效的机器学习算法。然而，许多常见的机器学习算法却依赖于计算昂贵的子routine，对大量数据来说是不可接受的。有时候，现有的技术会采用采样或其他方法来提高计算效率，但这会导致一定的近似错误。这个论文示出，可以在代之前 substitute computationally intensive subroutines with a special kind of randomized counterparts，而不会导致质量下降。

MemDA: Forecasting Urban Time Series with Memory-based Drift Adaptation

paper_url: http://arxiv.org/abs/2309.14216
repo_url: https://github.com/deepkashiwa20/Urban_Concept_Drift
paper_authors: Zekun Cai, Renhe Jiang, Xinyu Yang, Zhaonan Wang, Diansheng Guo, Hiroki Kobayashi, Xuan Song, Ryosuke Shibasaki
for: 本研究旨在解决城市时间序列预测中的概念漂移问题，以提高城市智能化的可持续发展。
methods: 本研究提出了一种新的城市时间序列预测模型，该模型通过考虑数据周期性并在预测过程中进行协调调整，以适应概念漂移。
results: 实验结果表明，本研究的设计在实际数据上显著超越了现有方法，并且可以通过减少预测模型对数据分布变化的敏感性，提高模型的可重用性和泛化能力。

Abstract
Urban time series data forecasting featuring significant contributions to sustainable development is widely studied as an essential task of the smart city. However, with the dramatic and rapid changes in the world environment, the assumption that data obey Independent Identically Distribution is undermined by the subsequent changes in data distribution, known as concept drift, leading to weak replicability and transferability of the model over unseen data. To address the issue, previous approaches typically retrain the model, forcing it to fit the most recent observed data. However, retraining is problematic in that it leads to model lag, consumption of resources, and model re-invalidation, causing the drift problem to be not well solved in realistic scenarios. In this study, we propose a new urban time series prediction model for the concept drift problem, which encodes the drift by considering the periodicity in the data and makes on-the-fly adjustments to the model based on the drift using a meta-dynamic network. Experiments on real-world datasets show that our design significantly outperforms state-of-the-art methods and can be well generalized to existing prediction backbones by reducing their sensitivity to distribution changes.

摘要
城市时序数据预测 featuring 重要贡献于可持续发展是智能城市广泛研究的必要任务。然而，随着世界环境的剧变和快速变化，假设数据遵循独立同分布（ID）的假设被后续数据分布变化所推翻，导致模型的弱复现和传输性，从而使得随变问题在实际场景中并不得到好的解决。在本研究中，我们提出了一种新的城市时序预测模型，该模型通过考虑数据中的周期性来编码随变，并在随变过程中进行实时调整，使用元动态网络。实验表明，我们的设计在实际数据集上显著超越了现有方法，并且可以将现有预测基础结构降低其对分布变化的敏感性。

Continual Driving Policy Optimization with Closed-Loop Individualized Curricula

paper_url: http://arxiv.org/abs/2309.14209
repo_url: https://github.com/YizhouXu-THU/CLIC
paper_authors: Haoyi Niu, Yizhou Xu, Xingjian Jiang, Jianming Hu
for: This paper aims to improve the safety of autonomous vehicles (AVs) by developing a continual driving policy optimization framework called Closed-Loop Individualized Curricula (CLIC).methods: The CLIC framework uses a collision prediction task to estimate the chance of AV failures in pre-collected scenarios, and then tailors individualized curricula for downstream training based on these failure probabilities.results: The experimental results show that CLIC surpasses other curriculum-based training strategies in managing risky scenarios while maintaining proficiency in handling simpler cases, demonstrating the effectiveness of the CLIC framework in improving the safety of AVs.Here is the answer in Simplified Chinese text:for: 这篇论文目的是提高自动驾驶车辆（AV）的安全性，通过开发一种循环驾驶政策优化框架——封闭循环个性化课程（CLIC）。methods: CLIC框架使用碰撞预测任务来估计AV失败的可能性，然后基于这些失败概率而tailor个性化课程 для下游训练。results: 实验结果表明，CLIC超过了其他课程基本训练策略，在管理危险场景方面达到了显著改进，而且仍能保持处理简单场景的能力，这表明CLIC框架可以有效地提高AV的安全性。

Abstract
The safety of autonomous vehicles (AV) has been a long-standing top concern, stemming from the absence of rare and safety-critical scenarios in the long-tail naturalistic driving distribution. To tackle this challenge, a surge of research in scenario-based autonomous driving has emerged, with a focus on generating high-risk driving scenarios and applying them to conduct safety-critical testing of AV models. However, limited work has been explored on the reuse of these extensive scenarios to iteratively improve AV models. Moreover, it remains intractable and challenging to filter through gigantic scenario libraries collected from other AV models with distinct behaviors, attempting to extract transferable information for current AV improvement. Therefore, we develop a continual driving policy optimization framework featuring Closed-Loop Individualized Curricula (CLIC), which we factorize into a set of standardized sub-modules for flexible implementation choices: AV Evaluation, Scenario Selection, and AV Training. CLIC frames AV Evaluation as a collision prediction task, where it estimates the chance of AV failures in these scenarios at each iteration. Subsequently, by re-sampling from historical scenarios based on these failure probabilities, CLIC tailors individualized curricula for downstream training, aligning them with the evaluated capability of AV. Accordingly, CLIC not only maximizes the utilization of the vast pre-collected scenario library for closed-loop driving policy optimization but also facilitates AV improvement by individualizing its training with more challenging cases out of those poorly organized scenarios. Experimental results clearly indicate that CLIC surpasses other curriculum-based training strategies, showing substantial improvement in managing risky scenarios, while still maintaining proficiency in handling simpler cases.

摘要
自动驾驶车辆（AV）的安全性问题一直是长期的主要担忧，这是因为自然驾驶驾驶分布中罕见的危险和安全关键场景的缺失。为解决这个挑战，自动驾驶场景研究有了很大的干预，关注生成高风险驾驶场景，并应用其进行安全检测自动驾驶模型。然而，有限的研究是关于重复这些广泛的场景来进一步改进AV模型。此外，从其他AV模型的巨大场景库中挑选有用信息是困难和挑战的。因此，我们开发了一个基于closed-loop个性化课程（CLIC）的驱动策略优化框架，它可以分解为以下几个标准化子模块：AV评估、场景选择和AV培训。在CLIC中，AV评估被设置为预测AV失败的概率任务，每轮评估AV在这些场景中的失败概率，然后根据这些概率重新采样历史场景，为下游培训生成个性化课程，使AV的培训更加个性化，与评估其能力相匹配。因此，CLIC不仅可以最大化已收集的历史场景库的利用，同时也可以通过个性化培训，提高AV在危险场景中的管理能力，而不会妨碍其在简单场景中的运作。实验结果表明，CLIC超越了其他课程基本培训策略，在管理危险场景方面显示了明显的改进，而且仍能保持简单场景中的运作效率。

Framework based on complex networks to model and mine patient pathways

paper_url: http://arxiv.org/abs/2309.14208
repo_url: https://github.com/caroline-rosa/framework_patient_pathways
paper_authors: Caroline de Oliveira Costa Souza Rosa, Márcia Ito, Alex Borges Vieira, Klaus Wehmuth, Antônio Tadeu Azevedo Gomes
for: 这个研究旨在自动发现患者群体的医疗系统历史记录，以提高医疗质量和效率。
methods: 该研究提出了一个框架，包括多方面图模型、基于时间的不同程度衡量方法和基于传统中心度指标的挖掘方法。
results: 研究在孕综和糖尿病两个例子中证明了该框架的有用性，可以找到相似路径集合、简洁表示路径和按照多个视角显示最重要的 Pattern。

Abstract
The automatic discovery of a model to represent the history of encounters of a group of patients with the healthcare system -- the so-called "pathway of patients" -- is a new field of research that supports clinical and organisational decisions to improve the quality and efficiency of the treatment provided. The pathways of patients with chronic conditions tend to vary significantly from one person to another, have repetitive tasks, and demand the analysis of multiple perspectives (interventions, diagnoses, medical specialities, among others) influencing the results. Therefore, modelling and mining those pathways is still a challenging task. In this work, we propose a framework comprising: (i) a pathway model based on a multi-aspect graph, (ii) a novel dissimilarity measurement to compare pathways taking the elapsed time into account, and (iii) a mining method based on traditional centrality measures to discover the most relevant steps of the pathways. We evaluated the framework using the study cases of pregnancy and diabetes, which revealed its usefulness in finding clusters of similar pathways, representing them in an easy-to-interpret way, and highlighting the most significant patterns according to multiple perspectives.

摘要
自动发现患者群体对医疗系统的互动历史模型 -- 称之为"患者路径" -- 是一个新的研究领域，用于支持临床和组织决策，以提高治疗质量和效率。患者的路径通常在不同人群中有很大差异，具有重复的任务和多个视角（如 intervenciones、诊断、医学专业等）的影响。因此，模型和挖掘这些路径仍然是一项挑战。在这项工作中，我们提出了以下框架：（i）基于多方面图的路径模型，（ii）基于时间因素的不同度量来比较路径，以及（iii）基于传统中心度量来挖掘路径中最重要的步骤。我们使用了孕期和糖尿病两个案例进行评估，发现该框架可以快速找到相似路径集，将其易于理解地表示出来，并高亮多个视角中的重要特征。

LLMCarbon: Modeling the end-to-end Carbon Footprint of Large Language Models

paper_url: http://arxiv.org/abs/2309.14393
repo_url: https://github.com/sotarokaneda/mlcarbon
paper_authors: Ahmad Faiz, Sotaro Kaneda, Ruhan Wang, Rita Osi, Parteek Sharma, Fan Chen, Lei Jiang
for: 这研究旨在提供一个能够精准计算大语言模型（LLM）训练过程中的碳脚印，包括操作和嵌入碳脚印，以及新的 нейрон网络设计的碳脚印预测模型。
methods: 该研究使用了一种名为\textit{LLMCarbon}的端到端碳脚印预测模型，可以对 dense 和 mixture-of-experts（MoE） LLMs 进行碳脚印预测。与之前的研究mlco2相比，\textit{LLMCarbon} 能够更好地预测不同 LLMs 的碳脚印。
results: 对于不同的 LLMs，\textit{LLMCarbon} 能够提供更高的预测精度，并且可以模型出操作和嵌入碳脚印，以及新的 нейрон网络设计的碳脚印。

Abstract
The carbon footprint associated with large language models (LLMs) is a significant concern, encompassing emissions from their training, inference, experimentation, and storage processes, including operational and embodied carbon emissions. An essential aspect is accurately estimating the carbon impact of emerging LLMs even before their training, which heavily relies on GPU usage. Existing studies have reported the carbon footprint of LLM training, but only one tool, mlco2, can predict the carbon footprint of new neural networks prior to physical training. However, mlco2 has several serious limitations. It cannot extend its estimation to dense or mixture-of-experts (MoE) LLMs, disregards critical architectural parameters, focuses solely on GPUs, and cannot model embodied carbon footprints. Addressing these gaps, we introduce \textit{LLMCarbon}, an end-to-end carbon footprint projection model designed for both dense and MoE LLMs. Compared to mlco2, LLMCarbon significantly enhances the accuracy of carbon footprint estimations for various LLMs.

摘要
Large language models (LLMs) 的碳 hoofprint 是一个重要的问题，包括训练、推理、实验和存储过程中的碳排放，包括运行和嵌入碳排放。一个重要的方面是在新的 LLM 出现之前已经准确地估算其碳影响，这主要取决于 GPU 使用情况。现有的研究已经报告了 LLM 训练的碳排放，但只有一个工具，mlco2，可以在物理训练之前预测新的神经网络的碳排放。然而，mlco2 有多个严重的限制。它无法扩展到 dense 或 mixture-of-experts (MoE) LLMs，忽略了关键的建筑 Parameters，围绕 GPU 进行固定的注意力，并不能模拟嵌入碳排放。为了解决这些缺陷，我们介绍了 \textit{LLMCarbon}，一个针对 dense 和 MoE LLMs 的碳排放预测模型。与 mlco2 相比，LLMCarbon 可以对不同类型的 LLMs 提供更高精度的碳排放估算。

Species196: A One-Million Semi-supervised Dataset for Fine-grained Species Recognition

paper_url: http://arxiv.org/abs/2309.14183
repo_url: https://github.com/Species-Dataset/species-dataset.github.io
paper_authors: Wei He, Kai Han, Ying Nie, Chengcheng Wang, Yunhe Wang
for: 本研究旨在提供大规模的 semi-supervised 数据集，用于驱逐物种识别领域的深度学习基础模型开发。
methods: 本研究使用 semi-supervised 学习方法，包括 Species196-L 和 Species196-U 两个数据集，以及四种 эксперименталь设定：超级vised 学习、semi-supervised 学习、自我supervised 预训练和 zero-shot 推理。
results: 本研究通过对 Species196 数据集的 represntative 方法进行实证研究，以评估这些方法在驱逐物种识别领域的表现。

Abstract
The development of foundation vision models has pushed the general visual recognition to a high level, but cannot well address the fine-grained recognition in specialized domain such as invasive species classification. Identifying and managing invasive species has strong social and ecological value. Currently, most invasive species datasets are limited in scale and cover a narrow range of species, which restricts the development of deep-learning based invasion biometrics systems. To fill the gap of this area, we introduced Species196, a large-scale semi-supervised dataset of 196-category invasive species. It collects over 19K images with expert-level accurate annotations Species196-L, and 1.2M unlabeled images of invasive species Species196-U. The dataset provides four experimental settings for benchmarking the existing models and algorithms, namely, supervised learning, semi-supervised learning, self-supervised pretraining and zero-shot inference ability of large multi-modal models. To facilitate future research on these four learning paradigms, we conduct an empirical study of the representative methods on the introduced dataset. The dataset is publicly available at https://species-dataset.github.io/.

摘要
开发基础视觉模型已经提高了普通视识能力到高水平，但无法良好地解决特殊领域的细腻识别。识别和管理入侵物种有着强烈的社会和生态价值。目前，大多数入侵物种数据集都具有有限的规模和局部的种类覆盖率，这限制了深入学习基于入侵生物ometrics系统的发展。为了填补这一领域的空白，我们引入了Species196数据集，这是一个大规模的半指导式数据集，收集了196类入侵物种的19K多张图像，其中Expert-level准确标注 Species196-L，以及1.2万张不标注的入侵物种图像 Species196-U。该数据集提供了四种实验设置，用于测试现有模型和算法的性能，即：指导学习、半指导学习、自动预训练和大多模式模型的零码推理能力。为了促进未来关于这四种学习方法的研究，我们进行了 Species196 数据集上的实验研究。该数据集公开可用于。

Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision

paper_url: http://arxiv.org/abs/2309.14181
repo_url: https://github.com/Q-Future/Q-Bench
paper_authors: Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Chunyi Li, Wenxiu Sun, Qiong Yan, Guangtao Zhai, Weisi Lin
for: 这个论文是为了评估多Modal Large Language Models (MLLMs)在低级别视觉理解和描述能力方面的能力而写的。
methods: 这个论文使用了以下方法：constructed LLVisionQA dataset，proposed LLDescribe dataset，和一种基于GPT的比较管道来评估MLLMs的描述能力。
results: 这个论文的结果表明MLLMs具有初步的低级别视觉能力，但这些能力还是不稳定和不准确的，需要进一步的提升。

Abstract
The rapid evolution of Multi-modality Large Language Models (MLLMs) has catalyzed a shift in computer vision from specialized models to general-purpose foundation models. Nevertheless, there is still an inadequacy in assessing the abilities of MLLMs on low-level visual perception and understanding. To address this gap, we present Q-Bench, a holistic benchmark crafted to systematically evaluate potential abilities of MLLMs on three realms: low-level visual perception, low-level visual description, and overall visual quality assessment. a) To evaluate the low-level perception ability, we construct the LLVisionQA dataset, consisting of 2,990 diverse-sourced images, each equipped with a human-asked question focusing on its low-level attributes. We then measure the correctness of MLLMs on answering these questions. b) To examine the description ability of MLLMs on low-level information, we propose the LLDescribe dataset consisting of long expert-labelled golden low-level text descriptions on 499 images, and a GPT-involved comparison pipeline between outputs of MLLMs and the golden descriptions. c) Besides these two tasks, we further measure their visual quality assessment ability to align with human opinion scores. Specifically, we design a softmax-based strategy that enables MLLMs to predict quantifiable quality scores, and evaluate them on various existing image quality assessment (IQA) datasets. Our evaluation across the three abilities confirms that MLLMs possess preliminary low-level visual skills. However, these skills are still unstable and relatively imprecise, indicating the need for specific enhancements on MLLMs towards these abilities. We hope that our benchmark can encourage the research community to delve deeper to discover and enhance these untapped potentials of MLLMs. Project Page: https://vqassessment.github.io/Q-Bench.

摘要
<>转换文本到简化中文。<>多Modal Large Language Models（MLLMs）的快速EVOLUTION catalyzed a shift from specialized models to general-purpose foundation models in computer vision. However, there is still an inadequacy in assessing the abilities of MLLMs on low-level visual perception and understanding. To address this gap, we present Q-Bench, a comprehensive benchmark crafted to systematically evaluate the potential abilities of MLLMs on three realms: low-level visual perception, low-level visual description, and overall visual quality assessment.a) To evaluate the low-level perception ability, we construct the LLVisionQA dataset, consisting of 2,990 diverse-sourced images, each equipped with a human-asked question focusing on its low-level attributes. We then measure the correctness of MLLMs on answering these questions.b) To examine the description ability of MLLMs on low-level information, we propose the LLDescribe dataset consisting of long expert-labelled golden low-level text descriptions on 499 images, and a GPT-involved comparison pipeline between outputs of MLLMs and the golden descriptions.c) Besides these two tasks, we further measure their visual quality assessment ability to align with human opinion scores. Specifically, we design a softmax-based strategy that enables MLLMs to predict quantifiable quality scores, and evaluate them on various existing image quality assessment (IQA) datasets. Our evaluation across the three abilities confirms that MLLMs possess preliminary low-level visual skills. However, these skills are still unstable and relatively imprecise, indicating the need for specific enhancements on MLLMs towards these abilities. We hope that our benchmark can encourage the research community to delve deeper to discover and enhance these untapped potentials of MLLMs. Project Page: .

Data Upcycling Knowledge Distillation for Image Super-Resolution

paper_url: http://arxiv.org/abs/2309.14162
repo_url: None
paper_authors: Yun Zhang, Wei Li, Simiao Li, Jie Hu, Hanting Chen, Hailing Wang, Zhijun Tu, Wenjia Wang, Bingyi Jing, Yunhe Wang
for: 这个论文旨在提出一种基于有效数据利用的知识储存抽象（DUKD）方法，以提高单个图像超分解（SISR）学习模型的表现。
methods: 该方法利用了两种有效的图像缩放操作和可逆数据增强操作，通过引入标签一致常数化来加强知识储存抽象的效果。
results: 对于多个 benchmark 测试，DUKD 方法可以明显超过基eline方法，例如PSNR 指标提高0.5dB，并且减少了 RCAN 模型的参数量，但是其表现与 RCAN 教师模型相当。

Abstract
Knowledge distillation (KD) emerges as a challenging yet promising technique for compressing deep learning models, characterized by the transmission of extensive learning representations from proficient and computationally intensive teacher models to compact student models. However, only a handful of studies have endeavored to compress the models for single image super-resolution (SISR) through KD, with their effects on student model enhancement remaining marginal. In this paper, we put forth an approach from the perspective of efficient data utilization, namely, the Data Upcycling Knowledge Distillation (DUKD) which facilitates the student model by the prior knowledge teacher provided via upcycled in-domain data derived from their inputs. This upcycling process is realized through two efficient image zooming operations and invertible data augmentations which introduce the label consistency regularization to the field of KD for SISR and substantially boosts student model's generalization. The DUKD, due to its versatility, can be applied across a broad spectrum of teacher-student architectures. Comprehensive experiments across diverse benchmarks demonstrate that our proposed DUKD method significantly outperforms previous art, exemplified by an increase of up to 0.5dB in PSNR over baselines methods, and a 67% parameters reduced RCAN model's performance remaining on par with that of the RCAN teacher model.

摘要
知识储备（KD）技术为深度学习模型压缩，涉及教师模型传递丰富的学习表示，以提高学生模型的表达能力。然而，只有一些研究利用KD技术进行单张图像超分辨（SISR）压缩，其影响于学生模型的提高仍然较有限。本文提出了一种基于有效数据利用的方法，即数据升级知识储备（DUKD），通过教师模型提供的先前知识，对学生模型进行升级。这个升级过程通过两种高效的图像缩放操作和可逆数据增强来实现，并在KD领域中引入标签一致化规则。DUKD方法因其灵活性，可以应用于多种教师-学生架构。实验结果表明，我们提出的DUKD方法在多个标准benchmark上达到了 significanly更高的PSNR水平，相比基eline方法，提高了67%的参数量，而RCAN教师模型的性能仍然与RCAN教师模型相当。

SPIRT: A Fault-Tolerant and Reliable Peer-to-Peer Serverless ML Training Architecture

paper_url: http://arxiv.org/abs/2309.14148
repo_url: None
paper_authors: Amine Barrak, Mayssa Jaziri, Ranim Trabelsi, Fehmi Jaafar, Fabio Petrillo
for: 这篇论文旨在探讨 Parametric Serverless Distributed Machine Learning (PSDML) 技术，尤其是在 P2P 分布式学习环境中。
methods: 这篇论文提出了一种基于 RedisAI 的 P2P 分布式学习架构，名为 SPIRT，以实现 fault-tolerant、可靠和安全的分布式机器学习训练。
results: SPIRT 架构可以减少模型更新和梯度平均所需时间的82%，并且具有抗坏 peer 和新 peer 集成的能力，同时保证了分布式机器学习任务的安全性。

Abstract
The advent of serverless computing has ushered in notable advancements in distributed machine learning, particularly within parameter server-based architectures. Yet, the integration of serverless features within peer-to-peer (P2P) distributed networks remains largely uncharted. In this paper, we introduce SPIRT, a fault-tolerant, reliable, and secure serverless P2P ML training architecture. designed to bridge this existing gap. Capitalizing on the inherent robustness and reliability innate to P2P systems, SPIRT employs RedisAI for in-database operations, leading to an 82\% reduction in the time required for model updates and gradient averaging across a variety of models and batch sizes. This architecture showcases resilience against peer failures and adeptly manages the integration of new peers, thereby highlighting its fault-tolerant characteristics and scalability. Furthermore, SPIRT ensures secure communication between peers, enhancing the reliability of distributed machine learning tasks. Even in the face of Byzantine attacks, the system's robust aggregation algorithms maintain high levels of accuracy. These findings illuminate the promising potential of serverless architectures in P2P distributed machine learning, offering a significant stride towards the development of more efficient, scalable, and resilient applications.

摘要
来自服务器无法 computing的启示，导致分布式机器学习中的分布式机器学习架构得到了重要的进步，特别是在基于参数服务器的架构中。然而，在对等（P2P）分布式网络中 интеGRATION of serverless特性仍然largely unexplored。在这篇论文中，我们引入SPIRT，一个可靠、可靠性和安全的服务器无法分布式机器学习训练架构。通过利用P2P系统的自然强大和可靠性，SPIRT使用RedisAI进行库操作，从而实现82%的模型更新和梯度平均时间优化。这个架构展示了对 peer 失败的抗衰变和新 peer 的适应能力，彰显其可靠性和可扩展性。此外，SPIRT确保了peer之间的安全通信，进一步提高了分布式机器学习任务的可靠性。甚至在面对拜尼黑攻击时，系统的坚固的总和算法可以保持高水平的准确性。这些发现探讨了服务器无法架构在P2P分布式机器学习中的应用前景，提供了一个重要的进步。

Exploring the Impact of Serverless Computing on Peer To Peer Training Machine Learning

paper_url: http://arxiv.org/abs/2309.14139
repo_url: https://github.com/aminebarrak/peertopeerserverless
paper_authors: Amine Barrak, Ranim Trabelsi, Fehmi Jaafar, Fabio Petrillo
for: 该论文主要旨在提出一种基于服务器レス计算和分布式训练的新架构，以提高分布式训练的可扩展性和容错性。
methods: 该论文使用了分布式 gradient computation 技术，并提出了一种基于 serverless computing 的高效并发分布式训练方法。
results: 研究发现，与传统分布式训练方法相比，该方法可以提高分布式训练的 gradient computation 时间，最高可达 97.34% 的提升。然而，在资源约束下，服务器レス架构可能带来更高的成本，最高达 5.4 倍于实例基础架构。

Abstract
The increasing demand for computational power in big data and machine learning has driven the development of distributed training methodologies. Among these, peer-to-peer (P2P) networks provide advantages such as enhanced scalability and fault tolerance. However, they also encounter challenges related to resource consumption, costs, and communication overhead as the number of participating peers grows. In this paper, we introduce a novel architecture that combines serverless computing with P2P networks for distributed training and present a method for efficient parallel gradient computation under resource constraints. Our findings show a significant enhancement in gradient computation time, with up to a 97.34\% improvement compared to conventional P2P distributed training methods. As for costs, our examination confirmed that the serverless architecture could incur higher expenses, reaching up to 5.4 times more than instance-based architectures. It is essential to consider that these higher costs are associated with marked improvements in computation time, particularly under resource-constrained scenarios. Despite the cost-time trade-off, the serverless approach still holds promise due to its pay-as-you-go model. Utilizing dynamic resource allocation, it enables faster training times and optimized resource utilization, making it a promising candidate for a wide range of machine learning applications.

摘要
随着大数据和机器学习的需求增长，分布式训练方法得到了广泛应用。在这些方法中，点对点（P2P）网络具有提高可扩展性和fault tolerance的优势。然而，随着参与者的增加，P2P网络也面临资源占用、成本和通信开销的挑战。在这篇论文中，我们介绍了一种新的架构，即无服务器计算与P2P网络的结合，用于分布式训练。我们还提出了一种高效的并发梯度计算方法，以适应资源限制的情况。我们的研究表明，在资源限制情况下，使用无服务器架构可以提高梯度计算时间，最多达97.34%。相比传统的P2P分布式训练方法。虽然无服务器架构可能会增加成本，但是这些成本与计算时间之间的trade-off很明显。尤其是在资源受限的情况下，无服务器架构仍然保持了优势。通过动态资源分配，它可以减少训练时间并优化资源利用，使其成为许多机器学习应用的优选。

Small Objects Matters in Weakly-supervised Semantic Segmentation

paper_url: http://arxiv.org/abs/2309.14117
repo_url: None
paper_authors: Cheolhyun Mun, Sanghuk Lee, Youngjung Uh, Junsuk Choe, Hyeran Byun
for: 本研究旨在提供一种全面评估不同 объек 大小的 semantic segmentation 方法的评价指标，以及一个大小均衡的评估集，以便评估不同的 object size 下的 semantic segmentation 方法表现。
methods: 本研究提出了一种新的评价指标，以及一种大小均衡的 cross-entropy 损失函数，以及一种适当的训练策略，以解决现有的 semantic segmentation 方法在小对象上的表现不佳问题。
results: 对于十个基准方法在三个不同的 datasets 上进行了评估，研究发现现有的 semantic segmentation 方法在小对象上的表现不佳，而新提出的大小均衡 cross-entropy 损失函数和适当的训练策略可以改善现有的 semantic segmentation 方法表现。

Abstract
Weakly-supervised semantic segmentation (WSSS) performs pixel-wise classification given only image-level labels for training. Despite the difficulty of this task, the research community has achieved promising results over the last five years. Still, current WSSS literature misses the detailed sense of how well the methods perform on different sizes of objects. Thus we propose a novel evaluation metric to provide a comprehensive assessment across different object sizes and collect a size-balanced evaluation set to complement PASCAL VOC. With these two gadgets, we reveal that the existing WSSS methods struggle in capturing small objects. Furthermore, we propose a size-balanced cross-entropy loss coupled with a proper training strategy. It generally improves existing WSSS methods as validated upon ten baselines on three different datasets.

摘要
弱监督semantic segmentation（WSSS）在给定图像级别标签的情况下进行像素级分类。虽然这是一项复杂的任务，但过去五年研究社区已经取得了可喜的成果。然而，现有WSSS литераature缺乏对不同物体大小的详细评估。因此，我们提出了一种新的评估度量，并收集了一个Size-balanced评估集，以完善PASCAL VOC。通过这两个工具，我们发现现有WSSS方法在捕捉小物体方面努力不足。此外，我们提出了一种Size-balancedcross-entropy损失函数，并与适当的训练策略相结合。它通常会改进现有WSSS方法，并在三个不同的dataset上验证了十个基elines。

Semi-Abstract Value-Based Argumentation Framework

paper_url: http://arxiv.org/abs/2309.14112
repo_url: None
paper_authors: Jovan Jeromela
for: 本论文主要研究的是abstract argumentation frameworks的扩展和应用。
methods: 本论文使用了value-based argumentation framework和semi-abstract argumentation framework两种扩展，它们增加了对Arguments的结构化表示。
results: 本论文提出了一种新的semi-abstract value-based argumentation framework，该框架可以将 Proposition associate with individual arguments映射到一组排序的值上，并通过新引入的攻击原则来使得隐式攻击变得明确。此外，本论文还使用了这两种框架来形ulate一个复杂的道德困境。

Abstract
In his seminal paper, Phan Minh Dung (1995) proposed abstract argumentation framework, which models argumentation using directed graphs where structureless arguments are the nodes and attacks among the arguments are the edges. In the following years, many extensions of this framework were introduced. These extensions typically add a certain form of structure to the arguments. This thesis showcases two such extensions -- value-based argumentation framework by Trevor Bench-Capon (2002) and semi-abstract argumentation framework by Esther Anna Corsi and Christian Ferm\"uller (2017). The former introduces a mapping function that links individual arguments to a set of ordered values, enabling a distinction between objectively and subjectively acceptable arguments. The latter links claims of individual arguments to propositional formulae and then applies newly-introduced attack principles in order to make implicit attacks explicit and to enable a definition of a consequence relation that relies on neither the truth values nor the interpretations in the usual sense. The contribution of this thesis is two-fold. Firstly, the new semi-abstract value-based argumentation framework is introduced. This framework maps propositional formulae associated with individual arguments to a set of ordered values. Secondly, a complex moral dilemma is formulated using the original and the value-based argumentation frameworks showcasing the expressivity of these formalisms.

摘要
Phan Minh Dung（1995）提出了抽象论证框架，该框架使用导航图模型了论证，其中无结构的论证是图节点，而论证之间的攻击是图边。后来，许多对这种框架的扩展都被引入。这些扩展通常增加了论证的某种结构。本论文介绍了两种这种扩展：基于值的论证框架（Trevor Bench-Capon，2002）和半抽象论证框架（Esther Anna Corsi和Christian Fermüller，2017）。前者引入了一个映射函数，该函数将 individuak 论证映射到一个排序的值集中，以便分辨 объекively 和 subjectively 可接受的论证。后者将各个论证的laims链接到 propositional 式中，然后应用新引入的攻击原则，以使隐式攻击显式化，并使得定义一种基于真值和解释的后果关系。本论文的贡献有两个方面。首先，本论文引入了一种新的半抽象值基论证框架，该框架将 propositional 式与值集相关联。其次，通过原始论证框架和值基论证框架，形ulated 一个复杂的道德困境示例，以示这两种形式主义的表达能力。

Comprehensive Overview of Named Entity Recognition: Models, Domain-Specific Applications and Challenges

paper_url: http://arxiv.org/abs/2309.14084
repo_url: None
paper_authors: Kalyani Pakhale
for: 本文旨在探讨Named Entity Recognition（NER）技术的发展和应用，尤其是将传统rule-based策略与当今AI技术相结合，以提高NER的准确率和泛化能力。
methods: 本文涵盖了NER的基本概念、传统技术和当今AI技术的应用，包括BERT、LSTM和CNN等。特别是在领域化NER模型方面，本文强调了适应性的重要性，并提出了域специфи互调模型的概念。
results: 本文通过实践示例和数据分析，证明了NER技术在金融和生物医学等领域的应用，提高了自动化文本分类和结构化抽取的精度和效率。同时，本文还探讨了NER技术的未来发展和挑战，提出了一些未来研究的可能性和方向。

Abstract
In the domain of Natural Language Processing (NLP), Named Entity Recognition (NER) stands out as a pivotal mechanism for extracting structured insights from unstructured text. This manuscript offers an exhaustive exploration into the evolving landscape of NER methodologies, blending foundational principles with contemporary AI advancements. Beginning with the rudimentary concepts of NER, the study spans a spectrum of techniques from traditional rule-based strategies to the contemporary marvels of transformer architectures, particularly highlighting integrations such as BERT with LSTM and CNN. The narrative accentuates domain-specific NER models, tailored for intricate areas like finance, legal, and healthcare, emphasizing their specialized adaptability. Additionally, the research delves into cutting-edge paradigms including reinforcement learning, innovative constructs like E-NER, and the interplay of Optical Character Recognition (OCR) in augmenting NER capabilities. Grounding its insights in practical realms, the paper sheds light on the indispensable role of NER in sectors like finance and biomedicine, addressing the unique challenges they present. The conclusion outlines open challenges and avenues, marking this work as a comprehensive guide for those delving into NER research and applications.

摘要
在自然语言处理（NLP）领域，命名实体识别（NER）作为提取结构化知识从未结构化文本中的重要机制，这篇论文对NER方法的发展进行了极其广泛的探讨，结合了基础原则和当代人工智能技术。这篇论文从传统的规则基础的斜笔概念开始，涵盖了从传统的字符串处理技术到当代的变换器架构，特别是BERT与LSTM和CNN的集成。研究着重点在各个领域中特化的NER模型，如金融、法律和医疗等，强调其特殊适应性。此外，研究还探讨了当前的前沿方法，如强化学习、创新的构造和E-NER，以及Optical Character Recognition（OCR）在NER能力的增强中的作用。以实际场景为基础，论文探讨了NER在金融和生物医学等领域的不可或缺的作用，解决这些领域所存在的特殊挑战。结尾，论文概述了目前的开放挑战和前瞻，用作NER研究和应用的全面指南。

ODE-based Recurrent Model-free Reinforcement Learning for POMDPs

paper_url: http://arxiv.org/abs/2309.14078
repo_url: None
paper_authors: Xuanle Zhao, Duzhen Zhang, Liyuan Han, Tielin Zhang, Bo Xu
for: 解决部分可见（PO）环境中的不可见信息推理问题，提高agent的决策能力。
methods: 使用循环策略与紧凑上下文，基于上下文抽象学习（Context-based reinforcement learning）来提取历史转移中的不可见信息。
results: 通过结合ODEs和无约束RL框架，在POMDPs中解决部分可见控制和meta-RL任务，并在不规则观察数据上进行了实验验证。

Abstract
Neural ordinary differential equations (ODEs) are widely recognized as the standard for modeling physical mechanisms, which help to perform approximate inference in unknown physical or biological environments. In partially observable (PO) environments, how to infer unseen information from raw observations puzzled the agents. By using a recurrent policy with a compact context, context-based reinforcement learning provides a flexible way to extract unobservable information from historical transitions. To help the agent extract more dynamics-related information, we present a novel ODE-based recurrent model combines with model-free reinforcement learning (RL) framework to solve partially observable Markov decision processes (POMDPs). We experimentally demonstrate the efficacy of our methods across various PO continuous control and meta-RL tasks. Furthermore, our experiments illustrate that our method is robust against irregular observations, owing to the ability of ODEs to model irregularly-sampled time series.

摘要

Maximum Likelihood Estimation of Latent Variable Structural Equation Models: A Neural Network Approach

paper_url: http://arxiv.org/abs/2309.14073
repo_url: None
paper_authors: Mehrzad Saremi
for: 这个论文是为了提出一种稳定的图形结构模型，可以在linearity和Gaussianity假设下保持稳定。
methods: 该论文使用了一种基于GPU的算法，用于计算最大 likelihood estimation 的这些模型。
results: 该论文表明，计算最大 likelihood estimation 的这些模型等价于训练一个神经网络。

Abstract
We propose a graphical structure for structural equation models that is stable under marginalization under linearity and Gaussianity assumptions. We show that computing the maximum likelihood estimation of this model is equivalent to training a neural network. We implement a GPU-based algorithm that computes the maximum likelihood estimation of these models.

摘要
我们提出了一种图解结构，用于结构方程模型，该结构在 Linearity 和 Gaussianity 假设下是稳定的。我们表明计算最大likelihood估计这种模型的过程与训练神经网络相同。我们实现了基于GPU的算法，用于计算这种模型的最大likelihood估计。Note: "Linearity" and "Gaussianity" are not exact translations of the English words, but they are commonly used terms in statistics and machine learning to refer to the assumptions of linearity and normality, respectively.

Adapt then Unlearn: Exploiting Parameter Space Semantics for Unlearning in Generative Adversarial Networks

paper_url: http://arxiv.org/abs/2309.14054
repo_url: None
paper_authors: Piyush Tiwary, Atri Guha, Subhodip Panda, Prathosh A. P
for: 防止深度生成模型生成包含不良、袋陋或危险内容的输出。
methods: 基于用户提供的负例进行适应，然后使用排斥正则化训练已经适应模型，以快速忘记特定不良特征。
results: 验证了方法的有效性，能够快速、高效地忘记深度生成模型中不良特征，同时保持生成样本质量。

Abstract
The increased attention to regulating the outputs of deep generative models, driven by growing concerns about privacy and regulatory compliance, has highlighted the need for effective control over these models. This necessity arises from instances where generative models produce outputs containing undesirable, offensive, or potentially harmful content. To tackle this challenge, the concept of machine unlearning has emerged, aiming to forget specific learned information or to erase the influence of undesired data subsets from a trained model. The objective of this work is to prevent the generation of outputs containing undesired features from a pre-trained GAN where the underlying training data set is inaccessible. Our approach is inspired by a crucial observation: the parameter space of GANs exhibits meaningful directions that can be leveraged to suppress specific undesired features. However, such directions usually result in the degradation of the quality of generated samples. Our proposed method, known as 'Adapt-then-Unlearn,' excels at unlearning such undesirable features while also maintaining the quality of generated samples. This method unfolds in two stages: in the initial stage, we adapt the pre-trained GAN using negative samples provided by the user, while in the subsequent stage, we focus on unlearning the undesired feature. During the latter phase, we train the pre-trained GAN using positive samples, incorporating a repulsion regularizer. This regularizer encourages the model's parameters to be away from the parameters associated with the adapted model from the first stage while also maintaining the quality of generated samples. To the best of our knowledge, our approach stands as first method addressing unlearning in GANs. We validate the effectiveness of our method through comprehensive experiments.

摘要
“随着深度生成模型的输出控制需求的增加，导致了关于隐私和合规遵循的担忧。这些担忧的来源是深度生成模型生成的内容中可能包含不适合、歧视或可能危害的内容。为了解决这个挑战，机器忘记（Machine Unlearning）的概念已经出现，旨在忘记特定学习的信息或从已训练的模型中除去不适合的数据子集。我们的目标是防止从已训练的GAN（生成推导网络）中生成包含不适合特征的出力。我们的方法是根据GAN的参数空间展现意义的方向来抑制不适合的特征。但是，这些方向通常会导致生成的样本质量下降。我们的提案方法，称为“Adapt-then-Unlearn”，能够忘记不适合的特征而保持生成的质量。这个方法分成两个阶段：在首先阶段，我们适应已训练的GAN使用用户提供的负面样本，而在后续阶段，我们专注于忘记不适合的特征。在这个阶段中，我们使用正常化器来训练已训练的GAN，并且将这些参数导向远离已适应的模型参数。我们的方法是首个对GAN进行忘记的方法。我们透过广泛的实验证明了我们的方法的有效性。”

Revisiting LARS for Large Batch Training Generalization of Neural Networks

paper_url: http://arxiv.org/abs/2309.14053
repo_url: None
paper_authors: Khoi Do, Duong Nguyen, Hoa Nguyen, Long Tran-Thanh, Quoc-Viet Pham
for: 本文旨在研究Large Batch Learning（LBL）中的稳定性问题，特别是AI训练过程中捕捉到锐 minimum 的问题。
methods: 本文使用了LARS和LAMB两种广泛使用的技术，以及一种热启动策略。
results: 实验表明，TVLARS可以在不使用热启动策略的情况下实现稳定的训练，并且在使用热启动策略时可以与LARS和LAMB相比赢得竞争性的成绩。

Abstract
LARS and LAMB have emerged as prominent techniques in Large Batch Learning (LBL), ensuring the stability of AI training. One of the primary challenges in LBL is convergence stability, where the AI agent usually gets trapped into the sharp minimizer. Addressing this challenge, a relatively recent technique, known as warm-up, has been employed. However, warm-up lacks a strong theoretical foundation, leaving the door open for further exploration of more efficacious algorithms. In light of this situation, we conduct empirical experiments to analyze the behaviors of the two most popular optimizers in the LARS family: LARS and LAMB, with and without a warm-up strategy. Our analyses give us a comprehension of the novel LARS, LAMB, and the necessity of a warm-up technique in LBL. Building upon these insights, we propose a novel algorithm called Time Varying LARS (TVLARS), which facilitates robust training in the initial phase without the need for warm-up. Experimental evaluation demonstrates that TVLARS achieves competitive results with LARS and LAMB when warm-up is utilized while surpassing their performance without the warm-up technique.

摘要
LARS和LAMB已成为大批学习（LBL）中显著的技术，确保训练稳定性。LBL的一个主要挑战是稳定性，AI代理通常会被拥堵在细小的最小值中。为解决这个挑战，一种相对较新的技术——暖身法——已经被采用。然而，暖身法没有强有力的理论基础，留下了进一步探索更有效的算法的门户。在这种情况下，我们进行了实验研究，分析了LARS和LAMB两个最受欢迎的优化器在LBL中的行为。我们的分析帮助我们更好地理解LARS、LAMB和暖身法的必要性，并在这些基础上提出了一种新的算法——时间变化LARS（TVLARS）。TVLARS可以在初始阶段实现稳定训练，不需要暖身法。实验评估表明，TVLARS在使用暖身法时与LARS和LAMB具有竞争性的性能，而无需暖身法时则超越它们。

An automatic selection of optimal recurrent neural network architecture for processes dynamics modelling purposes

paper_url: http://arxiv.org/abs/2309.14037
repo_url: None
paper_authors: Krzysztof Laddach, Rafał Łangowski, Tomasz A. Rutkowski, Bartosz Puchalski
for: 这个论文是为了解决人工神经网络用于行为（黑盒）模型Selected动态过程的开发问题。
methods: 本研究包括四种原创的神经网络架构搜索算法，基于 известные优化技术如进化算法和梯度下降方法。
results: 在使用了扩展验证研究的数据，研究人员通过提出特殊化的进化操作来优化神经网络架构，实现了神经网络的尺寸和准确性之间的变换。

Abstract
A problem related to the development of algorithms designed to find the structure of artificial neural network used for behavioural (black-box) modelling of selected dynamic processes has been addressed in this paper. The research has included four original proposals of algorithms dedicated to neural network architecture search. Algorithms have been based on well-known optimisation techniques such as evolutionary algorithms and gradient descent methods. In the presented research an artificial neural network of recurrent type has been used, whose architecture has been selected in an optimised way based on the above-mentioned algorithms. The optimality has been understood as achieving a trade-off between the size of the neural network and its accuracy in capturing the response of the mathematical model under which it has been learnt. During the optimisation, original specialised evolutionary operators have been proposed. The research involved an extended validation study based on data generated from a mathematical model of the fast processes occurring in a pressurised water nuclear reactor.

摘要
In the research, an artificial neural network of recurrent type is used, and its architecture is selected in an optimized way based on the above-mentioned algorithms. The optimality is understood as achieving a trade-off between the size of the neural network and its accuracy in capturing the response of the mathematical model under which it has been learned.During the optimization, original specialized evolutionary operators are proposed. The research involves an extended validation study based on data generated from a mathematical model of the fast processes occurring in a pressurized water nuclear reactor.Translated into Simplified Chinese:这篇论文关注了人工神经网络（ANN）用于Behavioral（黑盒）模型选择动态过程的开发算法问题。研究包括四种原创的算法提案，基于常见的优化技术 such as evolutionary algorithms和梯度下降方法。在研究中，使用了一个人工神经网络的回归类型，其架构通过上述算法进行优化。优化的目标是在神经网络的大小和学习模型的响应之间寻找一个平衡点。在优化过程中，提出了原创的特殊演化算法。研究还包括一个扩展验证研究，基于核电站压力水堆受控核反应的数学模型生成的数据。

DeepACO: Neural-enhanced Ant Systems for Combinatorial Optimization

paper_url: http://arxiv.org/abs/2309.14032
repo_url: https://github.com/henry-yeh/DeepACO
paper_authors: Haoran Ye, Jiarui Wang, Zhiguang Cao, Helan Liang, Yong Li
for: 本研究提出了一种基于深度学习的ACO框架，以自动化ACO算法中的规则设计。
methods: 该框架使用深度学习来强化ACO算法中的优化策略，并且只需一个神经网络和一组超参数来应用于多种具体问题。
results: 对八种具体问题进行测试，DeepACO consistently outperforms传统的ACO算法，并且在许多情况下比特定问题的方法更好或与其相当。

Abstract
Ant Colony Optimization (ACO) is a meta-heuristic algorithm that has been successfully applied to various Combinatorial Optimization Problems (COPs). Traditionally, customizing ACO for a specific problem requires the expert design of knowledge-driven heuristics. In this paper, we propose DeepACO, a generic framework that leverages deep reinforcement learning to automate heuristic designs. DeepACO serves to strengthen the heuristic measures of existing ACO algorithms and dispense with laborious manual design in future ACO applications. As a neural-enhanced meta-heuristic, DeepACO consistently outperforms its ACO counterparts on eight COPs using a single neural model and a single set of hyperparameters. As a Neural Combinatorial Optimization method, DeepACO performs better than or on par with problem-specific methods on canonical routing problems. Our code is publicly available at https://github.com/henry-yeh/DeepACO.

摘要
《蟑螂群体优化（ACO）是一种元规则算法，已经成功应用于多种 combinatorial optimization problems（COPs）。传统上，为特定问题自定义 ACO 需要专家设计知识驱动的规则。在这篇论文中，我们提议了 DeepACO，一种通用框架，利用深度强化学习自动化规则设计。DeepACO 可以增强现有 ACO 算法的规则措施，并减少未来 ACO 应用中的劳动密集设计。作为一种神经元规则优化方法，DeepACO 在八种 COPs 上以单个神经网络和单个超参数表现出色，并且在 canonical routing problems 中表现更好或与专门方法一样。我们的代码公开在 GitHub 上，请参考。》

Diffeomorphic Transformations for Time Series Analysis: An Efficient Approach to Nonlinear Warping

paper_url: http://arxiv.org/abs/2309.14029
repo_url: None
paper_authors: Iñigo Martinez
for: 这个论文主要针对的是如何处理时间序列数据，以及设计特有的时间序列相似性、分类和对应方法。
methods: 本论文提出了一些新的柔性对焦方法，使用参数化和对焦变换来扩展和改进传统的时间序列相似性计量方法。这些方法是可微的、可逆的、敏感度高且可以应对噪音和异常值。
results: 本论文的结果显示，这些新的柔性对焦方法可以实现高精度的时间序列相似性计量，并且可以与深度学习架构结合，以提高时间序列分类和对应的性能。此外，论文还提出了一些进一步的技术，例如增强的时间transformer网络、深度学习基于时间序列分类模型、可扩展的时间序列对焦分群算法和可扩展的时间序列对焦模型。

Abstract
The proliferation and ubiquity of temporal data across many disciplines has sparked interest for similarity, classification and clustering methods specifically designed to handle time series data. A core issue when dealing with time series is determining their pairwise similarity, i.e., the degree to which a given time series resembles another. Traditional distance measures such as the Euclidean are not well-suited due to the time-dependent nature of the data. Elastic metrics such as dynamic time warping (DTW) offer a promising approach, but are limited by their computational complexity, non-differentiability and sensitivity to noise and outliers. This thesis proposes novel elastic alignment methods that use parametric \& diffeomorphic warping transformations as a means of overcoming the shortcomings of DTW-based metrics. The proposed method is differentiable \& invertible, well-suited for deep learning architectures, robust to noise and outliers, computationally efficient, and is expressive and flexible enough to capture complex patterns. Furthermore, a closed-form solution was developed for the gradient of these diffeomorphic transformations, which allows an efficient search in the parameter space, leading to better solutions at convergence. Leveraging the benefits of these closed-form diffeomorphic transformations, this thesis proposes a suite of advancements that include: (a) an enhanced temporal transformer network for time series alignment and averaging, (b) a deep-learning based time series classification model to simultaneously align and classify signals with high accuracy, (c) an incremental time series clustering algorithm that is warping-invariant, scalable and can operate under limited computational and time resources, and finally, (d) a normalizing flow model that enhances the flexibility of affine transformations in coupling and autoregressive layers.

摘要
“随着时间数据的普遍和多元化，对时间序列资料的相似性、分类和对应方法已经引起了广泛的关注。时间序列之间的相似度决定是一个核心问题，因为传统的距离度量如欧几何距离（Euclidean distance）不适合时间序列资料。弹性度量如动态时间截弯（DTW）提供了一个有前途的方法，但是它们受到计算复杂度、非断统和干扰和噪音的影响。本论文提出了一些新的弹性对称方法，使用参数和 diffeomorphic 截弯变换来超越 DTW 基础的缺陷。这些方法是可微和可逆的，适合深度学习架构，具有较高的计算效率，并且具有较好的抗干扰和噪音性。此外，这些方法还具有关于参数空间的关注解，可以实现更好的搜索和更高的精度。本论文提出了以下几个提升：（a）改进的时间序列变换网络，用于时间序列Alignment和平均（b）使用深度学习的时间序列分类模型，同时进行时间序列Alignment和分类，（c）可扩展的时间序列集群分析算法，可以在有限的计算和时间资源下进行扩展和可扩展，（d）使用流形变换来增强时间序列的弹性和自适应性。”

LORD: Low Rank Decomposition Of Monolingual Code LLMs For One-Shot Compression

paper_url: http://arxiv.org/abs/2309.14021
repo_url: None
paper_authors: Ayush Kaushal, Tejas Vaidhya, Irina Rish
for: 这篇论文探讨了如何使用低阶分解（LoRD）来压缩大语言模型（LLM），以提高执行速度。
methods: 论文使用了低阶分解（LoRD）方法，将大量的Linear层分解为两个较小的Matrix，以减少模型的参数数量，并且保持了完全可微和所有参数可训练。
results: 论文的实验结果显示，使用LoRD压缩StarCoder 16B模型，可以将其变数数量从16B降至13.2B，而且只需要少于10分钟的时间，且没有Drop的情况下，对于推导速度有22.35%的提升。此外，LoRD模型可以与现有的高效缓存方法进行并行优化，以提高推导速度。

Abstract
Low Rank Decomposition of matrix - splitting a large matrix into a product of two smaller matrix offers a means for compression that reduces the parameters of a model without sparsification, and hence delivering more speedup on modern hardware. Moreover, unlike quantization, the compressed linear layers remain fully differentiable and all the parameters trainable, while being able to leverage the existing highly efficient kernels over floating point matrices. We study the potential to compress Large Language Models (LLMs) for monolingual Code generation via Low Rank Decomposition (LoRD) and observe that ranks for the linear layers in these models can be reduced by upto 39.58% with less than 1% increase in perplexity. We then use Low Rank Decomposition (LoRD) to compress StarCoder 16B to 13.2B parameter with no drop and to 12.3B with minimal drop in HumanEval Pass@1 score, in less than 10 minutes on a single A100. The compressed models speeds up inference by up to 22.35% with just a single line of change in code over huggingface's implementation with pytorch backend. Low Rank Decomposition (LoRD) models remain compatible with state of the art near-lossless quantization method such as SpQR, which allows leveraging further compression gains of quantization. Lastly, QLoRA over Low Rank Decomposition (LoRD) model further reduces memory requirements by as much as 21.2% over vanilla QLoRA while offering similar gains from parameter efficient fine tuning. Our work shows Low Rank Decomposition (LoRD) as a promising new paradigm for LLM compression.

摘要
LOW Rank Decomposition of matrix - 将大Matrix split into two smaller matrices 提供了压缩方法，可以减少模型参数而不是简化，从而在现代硬件上提高速度。此外，与量化不同，压缩的线性层保持完全可导和所有参数可训练，同时可以利用浮点数矩阵的高效内核。我们研究了使用LOW Rank Decomposition (LoRD)压缩大型自然语言模型（LLMs），并发现可以将线性层的排名减少到39.58%，并且影响下降小于1%。然后，我们使用LoRD压缩StarCoder 16B 到 13.2B 参数，在单个 A100 上在 less than 10 分钟内完成，而无需Drop的情况下，带有 minimal Drop 的 HumanEval Pass@1 分数。压缩后的模型可以提高推理速度，达到22.35%的提升，只需要在代码中进行单行修改。LoRD 模型与现有的高效 near-lossless 量化方法相容，例如 SpQR，可以进一步减少压缩参数。最后，QLoRA over LoRD 模型可以减少内存需求，达到21.2%的减少，同时保持与参数高效 fine-tuning 的相同减少。我们的工作表明LOW Rank Decomposition (LoRD) 是一种有前途的新方法 для LLM 压缩。

Morphological Computing as Logic Underlying Cognition in Human, Animal, and Intelligent Machine

paper_url: http://arxiv.org/abs/2309.13979
repo_url: None
paper_authors: Gordana Dodig-Crnkovic
for: 本文探讨了自然主义传统下的逻辑、 epistemology 和科学之间的关系。
methods: 文章提出了一种连接逻辑、数学、物理、化学、生物和认知的方案，强调自然 proceses 中的缩减不变的、自组织的动力学。
results: 文章表明了生物体的逻辑存在于自然过程中，并且 humans, animals 和 artifactual agents 都具有内在的逻辑。人类中心的、基于自然语言的逻辑是生物体演化出来的复杂逻辑的 simplest form。因此, cognitive 逻辑来自物理、化学和生物逻辑的演化。在一个自组织的计算框架中，可以使用基于形态/物理/自然计算的创新计算框架来解释人类中心的逻辑的起源。extend Evolutionary Synthesis 是理解人类逻辑的起源和逻辑与信息处理/计算 epistemology 之间的关系的关键。

Abstract
This work examines the interconnections between logic, epistemology, and sciences within the Naturalist tradition. It presents a scheme that connects logic, mathematics, physics, chemistry, biology, and cognition, emphasizing scale-invariant, self-organizing dynamics across organizational tiers of nature. The inherent logic of agency exists in natural processes at various levels, under information exchanges. It applies to humans, animals, and artifactual agents. The common human-centric, natural language-based logic is an example of complex logic evolved by living organisms that already appears in the simplest form at the level of basal cognition of unicellular organisms. Thus, cognitive logic stems from the evolution of physical, chemical, and biological logic. In a computing nature framework with a self-organizing agency, innovative computational frameworks grounded in morphological/physical/natural computation can be used to explain the genesis of human-centered logic through the steps of naturalized logical processes at lower levels of organization. The Extended Evolutionary Synthesis of living agents is essential for understanding the emergence of human-level logic and the relationship between logic and information processing/computational epistemology. We conclude that more research is needed to elucidate the details of the mechanisms linking natural phenomena with the logic of agency in nature.

摘要

Detecting Sexual Content at the Sentence Level in First Millennium Latin Texts

paper_url: http://arxiv.org/abs/2309.14974
repo_url: https://github.com/lascivaroma/seligator
paper_authors: Thibault Clérice
for: 这个研究旨在使用深度学习方法对句子水平进行Semantic classification，以加速人文学科和语言学科中 tradicional和时间consuming的Corpus建设。
methods: 我们引入了一个新的Corpus，包括约2500句文本，从300BCE到900CE，涵盖性 semantics（医学、 эротика等）。我们评估了各种句子分类方法和不同的输入嵌入层，并显示它们都能够超越简单的符号based搜索。
results: 我们的结果表明，这种方法有效，具有高精度和真正的正确率（TPR），分别为70.60%和86.33% using HAN。我们也评估了数据集大小对模型性能的影响（420个 вместо 2013），并显示，虽然我们的模型性能下降，但仍然可以提供高准确率和TPR，甚至无需MLM。

Abstract
In this study, we propose to evaluate the use of deep learning methods for semantic classification at the sentence level to accelerate the process of corpus building in the field of humanities and linguistics, a traditional and time-consuming task. We introduce a novel corpus comprising around 2500 sentences spanning from 300 BCE to 900 CE including sexual semantics (medical, erotica, etc.). We evaluate various sentence classification approaches and different input embedding layers, and show that all consistently outperform simple token-based searches. We explore the integration of idiolectal and sociolectal metadata embeddings (centuries, author, type of writing), but find that it leads to overfitting. Our results demonstrate the effectiveness of this approach, achieving high precision and true positive rates (TPR) of respectively 70.60% and 86.33% using HAN. We evaluate the impact of the dataset size on the model performances (420 instead of 2013), and show that, while our models perform worse, they still offer a high enough precision and TPR, even without MLM, respectively 69% and 51%. Given the result, we provide an analysis of the attention mechanism as a supporting added value for humanists in order to produce more data.

摘要
在本研究中，我们提议使用深度学习方法进行含义分类，以加速人文科学和语言学领域的 корпу文建设，这是传统的时间消耗性任务。我们介绍了一个新的词库，包含约2500个句子，从300年前至900年前，涵盖性 semantics（医学、 эротиче、等）。我们评估了不同句子分类方法和输入嵌入层，发现它们都能够持续性地超越简单的token-based搜索。我们探索了idiololectal和sociolectic metadata嵌入（世纪、作者、类型的写作）的集成，但发现它会导致过拟合。我们的结果表明这种方法的有效性，卷积率分别为70.60%和86.33%使用HAN。我们评估了数据集大小对模型性能的影响（420个 вместо2013），发现，虽然我们的模型表现不佳，但它们仍然可以提供高准确率和TPR，甚至没有MLM，分别为69%和51%。 giventhe result，我们提供了关注机制的分析，作为支持的加值，以便人文科学家生产更多数据。

Audio classification with Dilated Convolution with Learnable Spacings

paper_url: http://arxiv.org/abs/2309.13972
repo_url: https://github.com/k-h-ismail/dcls-audio
paper_authors: Ismail Khalfaoui-Hassani, Timothée Masquelier, Thomas Pellegrini
for: 这个论文是关于音频标注的研究，使用了增宽 convolution 方法来提高音频分类的准确率。
methods: 这个论文使用了 learnable spacings 的增宽 convolution 方法（DCLS），将 DSC 层替换为 DCLS 层，以提高 AudioSet 分类 benchmark 的准确率。
results: 研究发现，使用 DCLS 方法可以在不增加参数数量和只增加低成本的情况下，提高音频分类的准确率。

Abstract
Dilated convolution with learnable spacings (DCLS) is a recent convolution method in which the positions of the kernel elements are learned throughout training by backpropagation. Its interest has recently been demonstrated in computer vision (ImageNet classification and downstream tasks). Here we show that DCLS is also useful for audio tagging using the AudioSet classification benchmark. We took two state-of-the-art convolutional architectures using depthwise separable convolutions (DSC), ConvNeXt and ConvFormer, and a hybrid one using attention in addition, FastViT, and drop-in replaced all the DSC layers by DCLS ones. This significantly improved the mean average precision (mAP) with the three architectures without increasing the number of parameters and with only a low cost on the throughput. The method code is based on PyTorch and is available at https://github.com/K-H-Ismail/DCLS-Audio

摘要
<>将文本翻译成简化中文。<>最近的扩展 convolution 方法之一是learned spacings dilated convolution (DCLS)，它在训练过程中通过反传播学习kernel元素的位置。在计算机视觉中（ImageNet分类和下游任务），DCLS的利用得到了广泛的关注。在这篇文章中，我们表明DCLS也是有用的 для音频标注，我们使用了两个现代 convolutional 架构（DSC），ConvNeXt和ConvFormer，以及一个hybrid架构使用注意力，FastViT，并将所有DSC层换为DCLS层。这会显著提高mAP值，而无需增加参数数量和只增加低成本的通过put Throughput。代码基于PyTorch，可在https://github.com/K-H-Ismail/DCLS-Audio 上下载。

An AI Chatbot for Explaining Deep Reinforcement Learning Decisions of Service-oriented Systems

paper_url: http://arxiv.org/abs/2309.14391
repo_url: https://gitlab.com/xrl2/chat4xai
paper_authors: Andreas Metzger, Jone Bartel, Jan Laufer
for: 本研究旨在帮助服务开发人员、服务提供者和服务用户更好地理解深度强化学习（Deep Reinforcement Learning，简称Deep RL）的决策过程，以便在服务系统中应用Deep RL。
methods: 本研究使用了现代人工智能对话系统技术和专门的提问工程来实现自然语言解释。相比于传统的软件基于对话系统，使用AI对话系统可以消除需要抽象出问题和答案的过程。
results: 本研究通过使用OpenAI的ChatGPT API实现了Chat4XAI，并评估了其解释的准确性和稳定性，结果表明，使用自然语言解释可以提高服务开发人员、服务提供者和服务用户对Deep RL决策过程的理解，并且可以提高服务用户对服务的信任和接受度。

Abstract
Deep Reinforcement Learning (Deep RL) is increasingly used to cope with the open-world assumption in service-oriented systems. Deep RL was successfully applied to problems such as dynamic service composition, job scheduling, and offloading, as well as service adaptation. While Deep RL offers many benefits, understanding the decision-making of Deep RL is challenging because its learned decision-making policy essentially appears as a black box. Yet, understanding the decision-making of Deep RL is key to help service developers perform debugging, support service providers to comply with relevant legal frameworks, and facilitate service users to build trust. We introduce Chat4XAI to facilitate the understanding of the decision-making of Deep RL by providing natural-language explanations. Compared with visual explanations, the reported benefits of natural-language explanations include better understandability for non-technical users, increased user acceptance and trust, as well as more efficient explanations. Chat4XAI leverages modern AI chatbot technology and dedicated prompt engineering. Compared to earlier work on natural-language explanations using classical software-based dialogue systems, using an AI chatbot eliminates the need for eliciting and defining potential questions and answers up-front. We prototypically realize Chat4XAI using OpenAI's ChatGPT API and evaluate the fidelity and stability of its explanations using an adaptive service exemplar.

摘要
深度强化学习（深度RL）在服务 ориентирован系统中得到广泛应用，以应对开放世界假设。深度RL在动态服务组合、作业调度和下载等问题上取得了成功，同时也应用于服务适应性。然而，深度RL的决策过程理解具有挑战，因为它的学习决策策略看起来就像黑盒子。然而，理解深度RL的决策过程是关键，以帮助服务开发人员进行调试、支持服务提供者遵守相关法规，并促进服务用户建立信任。我们介绍了 Chat4XAI，用于促进深度RL 决策过程的理解，提供自然语言解释。与视觉解释相比，报告的优点包括更好的可读性 для非技术用户、更高的用户接受度和信任度，以及更高效的解释。 Chat4XAI 利用现代人工智能聊天机器人技术和专门的推荐工程。与先前的классиical软件基础的对话系统相比，使用 AI 聊天机器人解除了需要提取和定义 potential questions and answers 的需求。我们使用 OpenAI 的 ChatGPT API 实现 Chat4XAI，并评估其解释的准确性和稳定性使用适应服务示例。

May I Ask a Follow-up Question? Understanding the Benefits of Conversations in Neural Network Explainability

paper_url: http://arxiv.org/abs/2309.13965
repo_url: None
paper_authors: Tong Zhang, X. Jessie Yang, Boyang Li
for: 提高用户理解和信任AI模型的决策过程
methods: 使用自由形态对话提高用户理解和信任
results: 对话可以提高用户的理解、acceptance和信任，并促进人机合作

Abstract
Research in explainable AI (XAI) aims to provide insights into the decision-making process of opaque AI models. To date, most XAI methods offer one-off and static explanations, which cannot cater to the diverse backgrounds and understanding levels of users. With this paper, we investigate if free-form conversations can enhance users' comprehension of static explanations, improve acceptance and trust in the explanation methods, and facilitate human-AI collaboration. Participants are presented with static explanations, followed by a conversation with a human expert regarding the explanations. We measure the effect of the conversation on participants' ability to choose, from three machine learning models, the most accurate one based on explanations and their self-reported comprehension, acceptance, and trust. Empirical results show that conversations significantly improve comprehension, acceptance, trust, and collaboration. Our findings highlight the importance of customized model explanations in the format of free-form conversations and provide insights for the future design of conversational explanations.

摘要

Early Churn Prediction from Large Scale User-Product Interaction Time Series

paper_url: http://arxiv.org/abs/2309.14390
repo_url: None
paper_authors: Shamik Bhattacharjee, Utkarsh Thukral, Nilesh Patil
For: The paper aims to predict user churn in business-to-customer scenarios, with a focus on fantasy sports, and to provide insights for businesses to formulate effective retention plans.* Methods: The paper uses historical data and combines user activity with deep neural networks for multivariate time series classification, demonstrating remarkable results for churn prediction in complex contexts.* Results: The paper achieves high accuracy in predicting customer churn likelihood, providing valuable insights for businesses to understand attrition trends and develop effective retention strategies.Here’s the simplified Chinese text for the three information points:* For: 这篇论文目标是预测商业到客户场景中的用户弃用，特别是在幻想体育中，以便为企业提供有价值的归属趋势和退休计划。* Methods: 该论文使用历史数据，将用户活动与深度神经网络结合，实现多变量时间序列分类，在复杂的商业到客户场景中达到了Remarkable的弃用预测结果。* Results: 该论文在预测客户弃用可能性方面实现了高精度，为企业提供有价值的归属趋势和退休计划。

Abstract
User churn, characterized by customers ending their relationship with a business, has profound economic consequences across various Business-to-Customer scenarios. For numerous system-to-user actions, such as promotional discounts and retention campaigns, predicting potential churners stands as a primary objective. In volatile sectors like fantasy sports, unpredictable factors such as international sports events can influence even regular spending habits. Consequently, while transaction history and user-product interaction are valuable in predicting churn, they demand deep domain knowledge and intricate feature engineering. Additionally, feature development for churn prediction systems can be resource-intensive, particularly in production settings serving 200m+ users, where inference pipelines largely focus on feature engineering. This paper conducts an exhaustive study on predicting user churn using historical data. We aim to create a model forecasting customer churn likelihood, facilitating businesses in comprehending attrition trends and formulating effective retention plans. Our approach treats churn prediction as multivariate time series classification, demonstrating that combining user activity and deep neural networks yields remarkable results for churn prediction in complex business-to-customer contexts.

摘要
用户卷退，指客户与企业结束业务关系，对各种商业到客户场景产生深刻的经济影响。在多种系统到用户行为中，预测可能卷退者为primary objective。在投机领域如虚拟运动，国际运动赛事的不可预测因素可能对常规支付习惯产生影响。因此，对卷退预测系统的特征工程可能会占用资源，特别是在服务2000万用户以上的生产环境中，where inference pipelines largely focus on feature engineering。本文通过对历史数据进行广泛的研究，旨在创建一个预测用户卷退可能性的模型，帮助企业理解卷退趋势并制定有效的保留计划。我们的方法将卷退预测视为多变量时间系列分类，示出将用户活动和深度神经网络结合可以在复杂的商业到客户场景中实现remarkable的卷退预测结果。

VidChapters-7M: Video Chapters at Scale

paper_url: http://arxiv.org/abs/2309.13952
repo_url: https://github.com/antoyang/VidChapters
paper_authors: Antoine Yang, Arsha Nagrani, Ivan Laptev, Josef Sivic, Cordelia Schmid
for: 该论文旨在提供一个大规模的视频分章数据集，以便进行视频分章任务的研究。
methods: 该论文使用了自动抓取视频网站上的用户标注的分章信息，自动生成了817万个视频和7万个分章的数据集。
results: 该论文通过对这些数据进行分析，实现了三个任务：视频分章生成、视频分章grounding和 dense video captioning。 Results show that pretraining on VidChapters-7M transfers well to dense video captioning tasks, largely improving the state of the art on the YouCook2 and ViTT benchmarks.

Abstract
Segmenting long videos into chapters enables users to quickly navigate to the information of their interest. This important topic has been understudied due to the lack of publicly released datasets. To address this issue, we present VidChapters-7M, a dataset of 817K user-chaptered videos including 7M chapters in total. VidChapters-7M is automatically created from videos online in a scalable manner by scraping user-annotated chapters and hence without any additional manual annotation. We introduce the following three tasks based on this data. First, the video chapter generation task consists of temporally segmenting the video and generating a chapter title for each segment. To further dissect the problem, we also define two variants of this task: video chapter generation given ground-truth boundaries, which requires generating a chapter title given an annotated video segment, and video chapter grounding, which requires temporally localizing a chapter given its annotated title. We benchmark both simple baselines and state-of-the-art video-language models for these three tasks. We also show that pretraining on VidChapters-7M transfers well to dense video captioning tasks in both zero-shot and finetuning settings, largely improving the state of the art on the YouCook2 and ViTT benchmarks. Finally, our experiments reveal that downstream performance scales well with the size of the pretraining dataset. Our dataset, code, and models are publicly available at https://antoyang.github.io/vidchapters.html.

摘要
“将长片 видео分成章节可以让用户快速导航到他们所需的信息。这个重要主题一直未被充分研究，原因是公共释出的数据缺乏。为解决这个问题，我们提出了 VidChapters-7M dataset，包含 817 万个用户分成的影片和 7 百万个章节。 VidChapters-7M 是通过自动抓取网络上的影片而实现的，无需任何额外的手动标注。我们提出了以下三个任务：影片章节生成任务，包括时间段分影片和生成每个段落的章节标题；以及两个这个任务的变化：影片章节生成基于预设边界，需要根据预设的影片段落标注生成章节标题，以及影片章节固定，需要根据章节标题进行时间位置local化。我们在这些三个任务上评估了基本的基础模型和现有的影词组言模型，并证明了这些模型在零shot和调整设定下具有优秀的表现。此外，我们的实验显示，下游性能与预训练数据的大小成正比。我们的 dataset、代码和模型都可以在获取。”

The Time Traveler’s Guide to Semantic Web Research: Analyzing Fictitious Research Themes in the ESWC “Next 20 Years” Track

paper_url: http://arxiv.org/abs/2309.13939
repo_url: None
paper_authors: Irene Celino, Heiko Paulheim
for: The paper is written to explore the future research directions and themes of the Semantic Web community in the late 2040s and early 2050s.
methods: The paper uses fictitious research papers as a way to gather ideas from the community on potential future research themes and topics, and analyzes the research methods applied by the authors in these submissions.
results: The paper provides a survey of the “science fiction” papers submitted to the “Next 20 years” track of ESWC 2023, including the emerging research themes and topics, and investigates the most fictitious parts of the submissions.

Abstract
What will Semantic Web research focus on in 20 years from now? We asked this question to the community and collected their visions in the "Next 20 years" track of ESWC 2023. We challenged the participants to submit "future" research papers, as if they were submitting to the 2043 edition of the conference. The submissions - entirely fictitious - were expected to be full scientific papers, with research questions, state of the art references, experimental results and future work, with the goal to get an idea of the research agenda for the late 2040s and early 2050s. We received ten submissions, eight of which were accepted for presentation at the conference, that mixed serious ideas of potential future research themes and discussion topics with some fun and irony. In this paper, we intend to provide a survey of those "science fiction" papers, considering the emerging research themes and topics, analysing the research methods applied by the authors in these very special submissions, and investigating also the most fictitious parts (e.g., neologisms, fabricated references). Our goal is twofold: on the one hand, we investigate what this special track tells us about the Semantic Web community and, on the other hand, we aim at getting some insights on future research practices and directions.

摘要
在未来20年，semantic web研究将集中焦点在什么？我们问了社区，收集了他们的见解在“未来20年”track of ESWC 2023中。我们邀请 particiants to submitting“未来”研究论文，如果他们是在2043年版本的会议上提交的。提交的“未来”论文应包括研究问题、现场研究、实验结果和未来工作，以获得2040年代和2050年代的研究训练。我们收到了10篇提交，8篇被接受到会议上，其中有一些具有可能性的未来研究主题和讨论topic。在这篇文章中，我们将对这10篇“科幻”论文进行调查，探讨这些论文中的emerging research theme和topic，分析作者所应用的研究方法，以及一些虚构的部分（例如， neologisms和 fabricated references）。我们的目标是twofold：一方面，我们想要了解这个特别track的semantic web社区，另一方面，我们希望透过这些未来研究方法和方向获得一些预见。

SPOTS: Stable Placement of Objects with Reasoning in Semi-Autonomous Teleoperation Systems

paper_url: http://arxiv.org/abs/2309.13937
repo_url: https://github.com/joonhyung-lee/spots
paper_authors: Joonhyung Lee, Sangbeom Park, Jeongeun Park, Kyungjae Lee, Sungjoon Choi
for: 本研究主要针对pick-and-place任务中的“place”任务，即在人工智能框架下将物品放置在合适的位置。
methods: 本研究提出一种结合实验驱动的物理稳定验证和大语言模型的 semantic reasoning 能力，以生成基于Contextual reasonableness和物理稳定性的物品放置候选者概率分布。
results: 对于两个实验环境和一个实际世界环境进行了广泛的评估，表明OUR方法可以大幅提高物品放置的物理可行性和上下文合理性，同时考虑用户首选。

Abstract
Pick-and-place is one of the fundamental tasks in robotics research. However, the attention has been mostly focused on the ``pick'' task, leaving the ``place'' task relatively unexplored. In this paper, we address the problem of placing objects in the context of a teleoperation framework. Particularly, we focus on two aspects of the place task: stability robustness and contextual reasonableness of object placements. Our proposed method combines simulation-driven physical stability verification via real-to-sim and the semantic reasoning capability of large language models. In other words, given place context information (e.g., user preferences, object to place, and current scene information), our proposed method outputs a probability distribution over the possible placement candidates, considering the robustness and reasonableness of the place task. Our proposed method is extensively evaluated in two simulation and one real world environments and we show that our method can greatly increase the physical plausibility of the placement as well as contextual soundness while considering user preferences.

摘要
Pick-and-place 是 robotics 研究中的基本任务之一，但是它们的注意力主要集中在“捕获”任务上，剩下的“放置”任务则得到了更少的关注。在这篇论文中，我们对置物任务进行了研究，特别是在电子操作框架中。我们关注了放置物品的两个方面：稳定性和上下文合理性。我们提出的方法结合了实际驱动的物理稳定性验证和大语言模型的Semantic reasoning能力。具体来说，我们根据放置上下文信息（例如用户偏好、要放置的物品和当前场景信息）输出一个可能的放置候选者概率分布，考虑放置任务的稳定性和上下文合理性。我们的方法在三个 simulations 和一个真实世界环境中进行了广泛的评估，并显示了我们的方法可以大幅提高物理可能性以及上下文合理性。

Fairness and Bias in Algorithmic Hiring

paper_url: http://arxiv.org/abs/2309.13933
repo_url: https://github.com/Aryia-Behroziuan/neurons
paper_authors: Alessandro Fabris, Nina Baranowska, Matthew J. Dennis, Philipp Hacker, Jorge Saldivar, Frederik Zuiderveen Borgesius, Asia J. Biega
for: 这篇论文是为了探讨算法招聘技术在招聘过程中的应用和公平性问题。
methods: 本论文使用了多学科的方法，包括系统评估、偏见检测、数据分析等，以探讨算法招聘技术的优劣和应用场景。
results: 本论文结果表明，算法招聘技术可以减少招聘过程中的偏见和不公平，但是现有的数据和方法有限制，需要进一步的研究和开发以确保这些技术的公平性和可靠性。

Abstract
Employers are adopting algorithmic hiring technology throughout the recruitment pipeline. Algorithmic fairness is especially applicable in this domain due to its high stakes and structural inequalities. Unfortunately, most work in this space provides partial treatment, often constrained by two competing narratives, optimistically focused on replacing biased recruiter decisions or pessimistically pointing to the automation of discrimination. Whether, and more importantly what types of, algorithmic hiring can be less biased and more beneficial to society than low-tech alternatives currently remains unanswered, to the detriment of trustworthiness. This multidisciplinary survey caters to practitioners and researchers with a balanced and integrated coverage of systems, biases, measures, mitigation strategies, datasets, and legal aspects of algorithmic hiring and fairness. Our work supports a contextualized understanding and governance of this technology by highlighting current opportunities and limitations, providing recommendations for future work to ensure shared benefits for all stakeholders.

摘要
雇主正在整个招聘过程中广泛采用算法招聘技术。算法公平特别适用于这个领域，因为它具有高的重要性和结构性不平等。然而，大多数工作在这个领域都提供了半路处理，经常受到两种竞争的观点所限制：一是乐观地关注代表人员偏见的替换，另一是悲观地指出自动化歧视。无论算法招聘能否更不偏袋更有利于社会，以及哪种类型的算法招聘可以更加不偏袋，目前仍未得到答案。这个多学科调查旨在为实践者和研究人员提供一个平衡和一致的涵盖系统、偏见、测量、缓减策略、数据集和法律方面的算法招聘公平问题的全面覆盖。我们的工作旨在支持Contextualized理解和管理这种技术，通过强调当前的机会和限制，提供未来工作的建议，以确保所有参与者共享利益。

UCF-Crime Annotation: A Benchmark for Surveillance Video-and-Language Understanding

paper_url: http://arxiv.org/abs/2309.13925
repo_url: https://github.com/xuange923/uca-dataset
paper_authors: Tongtong Yuan, Xuange Zhang, Kun Liu, Bo Liu, Jian Jin, Zhenzhen Jiao
for: 本研究旨在提供一个新的多模态Surveillance视频数据集，以便进行多模态Surveillance视频分析。
methods: 我们使用了手动标注的实际世界Surveillance视频数据集UCF-Crime，并将其注解为细腻事件内容和时间。我们的新创建的数据集UCA（UCF-Crime Annotation）提供了一个新的benchmark для多模态Surveillance视频分析。
results: 我们在这个新创建的数据集上测试了当前主流的多模态任务模型，发现这些模型在多模态Surveillance视频场景下表现不佳，这 highlights the necessity of constructing this dataset。

Abstract
Surveillance videos are an essential component of daily life with various critical applications, particularly in public security. However, current surveillance video tasks mainly focus on classifying and localizing anomalous events. Existing methods are limited to detecting and classifying the predefined events with unsatisfactory generalization ability and semantic understanding, although they have obtained considerable performance. To address this issue, we propose constructing the first multimodal surveillance video dataset by manually annotating the real-world surveillance dataset UCF-Crime with fine-grained event content and timing. Our newly annotated dataset, UCA (UCF-Crime Annotation), provides a novel benchmark for multimodal surveillance video analysis. It not only describes events in detailed descriptions but also provides precise temporal grounding of the events in 0.1-second intervals. UCA contains 20,822 sentences, with an average length of 23 words, and its annotated videos are as long as 102 hours. Furthermore, we benchmark the state-of-the-art models of multiple multimodal tasks on this newly created dataset, including temporal sentence grounding in videos, video captioning, and dense video captioning. Through our experiments, we found that mainstream models used in previously publicly available datasets perform poorly on multimodal surveillance video scenarios, which highlights the necessity of constructing this dataset. The link to our dataset and code is provided at: https://github.com/Xuange923/UCA-dataset.

摘要
侦查视频是我们日常生活中的一个重要组成部分，尤其在公共安全领域。然而，现有的侦查视频任务主要集中在异常事件的分类和地点化。现有的方法具有不满足的泛化能力和 semantics理解，尽管它们在性能方面已经取得了一定的进步。为解决这个问题，我们提议创建了首个多模态侦查视频数据集，通过手动标注实际世界的侦查视频数据集UCF-Crime，并将其注解为细化事件内容和时间。我们新创建的数据集，UCAC（UCF-Crime Annotation），不仅描述事件的内容，还提供精确的时间地标，每个事件在0.1秒间隔内进行标注。UCAC包含20822句话，平均长度为23个单词，其注解视频的长度为102小时。此外，我们在这个新创建的数据集上测试了当今主流的多模态任务模型，包括视频句子注释、视频句子注释和稠密视频句子注释。我们的实验结果显示，主流在多模态侦查视频场景下表现糟糕，这 highlights 了我们构建这个数据集的必要性。数据集和代码的链接可以在 GitHub 上找到：https://github.com/Xuange923/UCA-dataset。

A comparison of controller architectures and learning mechanisms for arbitrary robot morphologies

paper_url: http://arxiv.org/abs/2309.13908
repo_url: None
paper_authors: Jie Luo, Jakub Tomczak, Karine Miras, Agoston E. Eiben
for: 这个论文的主要问题是：如果机器人的形态不知道先天，那么哪种控制器和学习方法应该使用？作者们的兴趣是基于模块化演化的机器人，但问题也适用于广泛的系统设计者，寻找可重用的解决方案。
methods: 作者们使用了三种控制器和学习方法的组合：一种基于动物 lokomootion 模型（中央 Pattern Generators，CPG）和一个进化算法学习者，另一种使用强化学习（RL）和一个神经网络控制器架构，以及一种 combining 的方法，其中控制器是神经网络，学习者是进化算法。
results: 作者们对一组模块化机器人进行了测试，并对三种组合的有效性、效率和稳定性进行了比较。结果显示，通常的 CPG 和 RL 方法被外围的 combining 组合所超越，这个组合更加稳定和高效。

Abstract
The main question this paper addresses is: What combination of a robot controller and a learning method should be used, if the morphology of the learning robot is not known in advance? Our interest is rooted in the context of morphologically evolving modular robots, but the question is also relevant in general, for system designers interested in widely applicable solutions. We perform an experimental comparison of three controller-and-learner combinations: one approach where controllers are based on modelling animal locomotion (Central Pattern Generators, CPG) and the learner is an evolutionary algorithm, a completely different method using Reinforcement Learning (RL) with a neural network controller architecture, and a combination `in-between' where controllers are neural networks and the learner is an evolutionary algorithm. We apply these three combinations to a test suite of modular robots and compare their efficacy, efficiency, and robustness. Surprisingly, the usual CPG-based and RL-based options are outperformed by the in-between combination that is more robust and efficient than the other two setups.

摘要
本文探讨的主要问题是：在不知道机器人形态的情况下，应用哪种控制器和学习方法？我们的兴趣基于模块化 robots 的形态演化，但这个问题也适用于更广泛的系统设计师，寻找通用的解决方案。我们通过实验比较三种控制器和学习方法的组合：一种使用动物步态模型（中央 Pattern Generators，CPG）和进化算法学习者，一种完全不同的方法使用奖励学习（RL）和神经网络控制器架构，以及一种混合“中间”的方法，其中控制器是神经网络，学习者是进化算法。我们将这三种组合应用到一组模块 robots 上，并比较其效果、效率和稳定性。结果各种意外地发现，通常的 CPG-based 和 RL-based 选项被“中间”组合所超越，这种组合更加稳定和高效。

Analyzing the Efficacy of an LLM-Only Approach for Image-based Document Question Answering

paper_url: http://arxiv.org/abs/2309.14389
repo_url: None
paper_authors: Nidhi Hegde, Sujoy Paul, Gagan Madan, Gaurav Aggarwal
for: 这个论文的目的是研究文档问答模型中的两个关键组件：视觉编码器和大型自然语言模型（LLM），以及这两个组件之间的相对贡献。
methods: 这篇论文使用了一种叫做“LLM-只”的方法，即直接将文档图像中的文本信息序列化并传递给一个受训练的 LLM，以便不需要显式的视觉编码器。
results: 论文的结果表明，使用这种“LLM-只”方法可以在多种 datasets 上达到与或接近领先性水平的表现。

Abstract
Recent document question answering models consist of two key components: the vision encoder, which captures layout and visual elements in images, and a Large Language Model (LLM) that helps contextualize questions to the image and supplements them with external world knowledge to generate accurate answers. However, the relative contributions of the vision encoder and the language model in these tasks remain unclear. This is especially interesting given the effectiveness of instruction-tuned LLMs, which exhibit remarkable adaptability to new tasks. To this end, we explore the following aspects in this work: (1) The efficacy of an LLM-only approach on document question answering tasks (2) strategies for serializing textual information within document images and feeding it directly to an instruction-tuned LLM, thus bypassing the need for an explicit vision encoder (3) thorough quantitative analysis on the feasibility of such an approach. Our comprehensive analysis encompasses six diverse benchmark datasets, utilizing LLMs of varying scales. Our findings reveal that a strategy exclusively reliant on the LLM yields results that are on par with or closely approach state-of-the-art performance across a range of datasets. We posit that this evaluation framework will serve as a guiding resource for selecting appropriate datasets for future research endeavors that emphasize the fundamental importance of layout and image content information.

摘要
现代文档问答模型通常包括两个关键组件：视觉编码器，用于捕捉图像中的布局和视觉元素，以及一个大语言模型（LLM），用于将问题与图像相关联并补充问题以外的知识来生成准确的答案。然而，视觉编码器和语言模型在这些任务中的相对贡献还不清楚。这 especially interesting， given the effectiveness of instruction-tuned LLMs，which exhibit remarkable adaptability to new tasks。为此，我们在这项工作中进行了以下三个方面的研究：1. 使用LLM-only方法解决文档问答任务的效果（2）， Serializing textual information within document images and feeding it directly to an instruction-tuned LLM, thus bypassing the need for an explicit vision encoder。我们的全面分析涵盖了六个多样化的 benchmarck 数据集，使用不同规模的LLM。我们的发现表明，一种仅依靠LLM的方法可以在多个数据集上达到或接近状态艺术性的表现。我们认为这种评价框架将成为未来研究着重于图像布局和内容信息的关键资源。

Exploring Robot Morphology Spaces through Breadth-First Search and Random Query

paper_url: http://arxiv.org/abs/2309.14387
repo_url: None
paper_authors: Jie Luo
for: 这项研究旨在 Investigating the role of query mechanisms in the brain-body co-evolution of modular robots, and comparing the effectiveness of two different query mechanisms (BFS and Random Query) in evolving robot morphologies.
methods: 该研究使用了 CPPNs and robot controllers using tensors,以及两种不同的查询机制（BFS和随机查询），在两种演化框架（LAMARCK和达尔沃尼系统）中进行了对比性分析。
results: 研究发现，BFS 比 Random Query 更有效率地生成高性能的机器人体，并且在达尔沃尼系统中，BFS 导致机器人体的演化和性能具有更高的多样性和特征。

Abstract
Evolutionary robotics offers a powerful framework for designing and evolving robot morphologies, particularly in the context of modular robots. However, the role of query mechanisms during the genotype-to-phenotype mapping process has been largely overlooked. This research addresses this gap by conducting a comparative analysis of query mechanisms in the brain-body co-evolution of modular robots. Using two different query mechanisms, Breadth-First Search (BFS) and Random Query, within the context of evolving robot morphologies using CPPNs and robot controllers using tensors, and testing them in two evolutionary frameworks, Lamarckian and Darwinian systems, this study investigates their influence on evolutionary outcomes and performance. The findings demonstrate the impact of the two query mechanisms on the evolution and performance of modular robot bodies, including morphological intelligence, diversity, and morphological traits. This study suggests that BFS is both more effective and efficient in producing highly performing robots. It also reveals that initially, robot diversity was higher with BFS compared to Random Query, but in the Lamarckian system, it declines faster, converging to superior designs, while in the Darwinian system, BFS led to higher end-process diversity.

摘要
生态进化机器人学提供了一个强大的框架 для设计和演化机器人体形，特别在模块化机器人中。然而，在基因型-到形态映射过程中 Query 机制的角色被大量遗弃。这种研究填补了这个遗弃，通过对 CPPN 和机器人控制器使用矩阵进行演化 robots 的脑体进行比较分析。使用 BFS 和随机 Query 两种不同的 Query 机制，在拉马克思主义和达尔文主义两种演化框架下测试它们，这种研究研究它们对演化结果和性能的影响。发现 BFS 比Random Query 更有效和高效地生成高性能机器人体形，并且发现在拉马克思主义系统中，BFS 初始时 robot 多样性比 Random Query 高，但随着演化，它快速下降，转化为优秀设计，而达尔文主义系统中，BFS 导致最终的多样性高于 Random Query。

Scene Informer: Anchor-based Occlusion Inference and Trajectory Prediction in Partially Observable Environments

paper_url: http://arxiv.org/abs/2309.13893
repo_url: https://github.com/sisl/sceneinformer
paper_authors: Bernard Lange, Jiachen Li, Mykel J. Kochenderfer
for: 本研究旨在提高自动驾驶汽车在部分可见环境中的导航能力，包括预测 observable agents 的未来运动和推断 occluded agents。
methods: 我们引入了 Scene Informer，一种统一的方法，可以同时预测 observable agents 的运动和推断 occluded agents。Scene Informer 使用 transformer 来聚合不同输入模式，并提供选择性查询 occlusions 可能与 AV 计划路径相交。
results: 我们的方法在 Waymo Open Motion Dataset 上的部分可见设置下表现出色，超过了现有方法在 occupancy 预测和 trajectory 预测方面。

Abstract
Navigating complex and dynamic environments requires autonomous vehicles (AVs) to reason about both visible and occluded regions. This involves predicting the future motion of observed agents, inferring occluded ones, and modeling their interactions based on vectorized scene representations of the partially observable environment. However, prior work on occlusion inference and trajectory prediction have developed in isolation, with the former based on simplified rasterized methods and the latter assuming full environment observability. We introduce the Scene Informer, a unified approach for predicting both observed agent trajectories and inferring occlusions in a partially observable setting. It uses a transformer to aggregate various input modalities and facilitate selective queries on occlusions that might intersect with the AV's planned path. The framework estimates occupancy probabilities and likely trajectories for occlusions, as well as forecast motion for observed agents. We explore common observability assumptions in both domains and their performance impact. Our approach outperforms existing methods in both occupancy prediction and trajectory prediction in partially observable setting on the Waymo Open Motion Dataset.

摘要
Translated into Simplified Chinese:自适应环境需要自动驾驶车 (AV) 能够理解可见和遮盖的区域。这些区域包括预测可见的代理人的未来运动、推理遮盖的人和基于可见环境的场景表示的Vectorization。然而，先前的遮盖推断和轨迹预测工作都是分离的，前者基于简化的扫描方法，后者假设环境完全可见。我们介绍了Scene Informer，一种统一的方法，可以预测可见代理人的轨迹和预测遮盖物的存在。它使用 transformer 来聚合不同的输入模式，并且可以根据预测轨迹的可能性进行选择性的查询遮盖物。框架可以估算遮盖物的存在概率和可能的轨迹，以及可见代理人的预测运动。我们探讨了两个域的共同可见假设和其影响性。我们的方法在 partially observable 环境中的 Waymo Open Motion Dataset 上比过去的方法表现出色。

TouchUp-G: Improving Feature Representation through Graph-Centric Finetuning

paper_url: http://arxiv.org/abs/2309.13885
repo_url: None
paper_authors: Jing Zhu, Xiang Song, Vassilis N. Ioannidis, Danai Koutra, Christos Faloutsos
for: 提高下游图学任务中节点特征的质量，以提高图 neural network（GNN）的表现。
methods: 使用TOUCHUP-G方法，该方法是一种通用的、多Modal的、原则正的方法，可以提高任何下游图任务中节点特征的质量。
results: TOUCHUP-G方法可以在四个真实世界数据集上达到状态的最佳结果，这些数据集包括不同的任务和modal。

Abstract
How can we enhance the node features acquired from Pretrained Models (PMs) to better suit downstream graph learning tasks? Graph Neural Networks (GNNs) have become the state-of-the-art approach for many high-impact, real-world graph applications. For feature-rich graphs, a prevalent practice involves utilizing a PM directly to generate features, without incorporating any domain adaptation techniques. Nevertheless, this practice is suboptimal because the node features extracted from PM are graph-agnostic and prevent GNNs from fully utilizing the potential correlations between the graph structure and node features, leading to a decline in GNNs performance. In this work, we seek to improve the node features obtained from a PM for downstream graph tasks and introduce TOUCHUP-G, which has several advantages. It is (a) General: applicable to any downstream graph task, including link prediction which is often employed in recommender systems; (b) Multi-modal: able to improve raw features of any modality (e.g. images, texts, audio); (c) Principled: it is closely related to a novel metric, feature homophily, which we propose to quantify the potential correlations between the graph structure and node features and we show that TOUCHUP-G can effectively shrink the discrepancy between the graph structure and node features; (d) Effective: achieving state-of-the-art results on four real-world datasets spanning different tasks and modalities.

摘要
如何增强从预训练模型（PM）获取的节点特征以更适合下游图学任务？图神经网络（GNNs）已成为许多高impact、实际世界图应用的州立艺术。对于具有丰富特征的图，一种常见做法是直接使用PM生成特征，不 incorporating任何领域适应技术。然而，这种做法是不优化的，因为PM中的节点特征是图无关的，不能让GNNs完全利用图结构和节点特征之间的潜在相关性，导致GNNs的性能下降。在这项工作中，我们想要改进从PM获取的节点特征，并引入了TOUCHUP-G，它具有以下优势：* 通用：适用于任何下游图任务，包括常用的链接预测任务（常用于推荐系统）。* 多模式：能够提高任何类型的原始特征（例如图像、文本、音频）。* 原则性：与我们提出的一种新的度量（特征同化度）有紧密关系，我们表明了TOUCHUP-G可以有效缩小图结构和节点特征之间的差异。* 有效：在四个实际数据集上达到了状态之最的结果，这些数据集来自不同的任务和模式。

PRiSM: Enhancing Low-Resource Document-Level Relation Extraction with Relation-Aware Score Calibration

paper_url: http://arxiv.org/abs/2309.13869
repo_url: https://github.com/brightjade/prism
paper_authors: Minseok Choi, Hyesu Lim, Jaegul Choo
for: DocRE是为了提取文档中所有实体对的关系而设计的。
methods: 我们使用了一种叫做PRiSM的方法，它可以根据关系Semantic信息来适应logits。
results: 我们在三个DocRE数据集上进行评估，结果表明，将现有模型与PRiSM结合可以提高表达度，而且在训练时使用的数据量只需要3%。同时，我们发现在训练过程中，PRiSM可以降低偏差错误的数量，达到36倍的提升。

Abstract
Document-level relation extraction (DocRE) aims to extract relations of all entity pairs in a document. A key challenge in DocRE is the cost of annotating such data which requires intensive human effort. Thus, we investigate the case of DocRE in a low-resource setting, and we find that existing models trained on low data overestimate the NA ("no relation") label, causing limited performance. In this work, we approach the problem from a calibration perspective and propose PRiSM, which learns to adapt logits based on relation semantic information. We evaluate our method on three DocRE datasets and demonstrate that integrating existing models with PRiSM improves performance by as much as 26.38 F1 score, while the calibration error drops as much as 36 times when trained with about 3% of data. The code is publicly available at https://github.com/brightjade/PRiSM.

摘要
文档级关系提取（DocRE）目标是在文档中提取所有实体对的关系。一个主要挑战在DocRE中是获取数据 annotating 的成本，需要卷积的人工劳动。因此，我们在低资源设定下调查DocRE问题，发现现有模型在低数据上训练后过度估计NA("无关")标签，导致性能有限。在这种情况下，我们从抽象角度出发，提出了PRiSM，它学习基于关系semantic信息来调整logits。我们对三个DocRE数据集进行评估，并证明了将现有模型与PRiSM结合使用可以提高性能，最高提高26.38准确率，同时抽象错误下降了36倍，只需训练约3%的数据。代码可以在https://github.com/brightjade/PRiSM上获取。

Fast-HuBERT: An Efficient Training Framework for Self-Supervised Speech Representation Learning

paper_url: http://arxiv.org/abs/2309.13860
repo_url: https://github.com/yanghaha0908/fasthubert
paper_authors: Guanrou Yang, Ziyang Ma, Zhisheng Zheng, Yakun Song, Zhikang Niu, Xie Chen
for: 这篇论文主要是为了提高自动标注学习（SSL）方法在语音处理任务中的效率，并且测试这些方法在不同的下游任务中的表现。
methods: 本论文使用了HuBERT模型，并进行了多个效率优化，包括范例删除、批评迭代、条件更新、等。
results: 相比原始实现， Fast-HuBERT可以在1.1天内训练完成，并且无损性能，实现了5.2倍的速度提升。此外， authors 还 explore了两种已知技术，并证明了这些技术可以获得类似的改进。

Abstract
Recent years have witnessed significant advancements in self-supervised learning (SSL) methods for speech-processing tasks. Various speech-based SSL models have been developed and present promising performance on a range of downstream tasks including speech recognition. However, existing speech-based SSL models face a common dilemma in terms of computational cost, which might hinder their potential application and in-depth academic research. To address this issue, we first analyze the computational cost of different modules during HuBERT pre-training and then introduce a stack of efficiency optimizations, which is named Fast-HuBERT in this paper. The proposed Fast-HuBERT can be trained in 1.1 days with 8 V100 GPUs on the Librispeech 960h benchmark, without performance degradation, resulting in a 5.2x speedup, compared to the original implementation. Moreover, we explore two well-studied techniques in the Fast-HuBERT and demonstrate consistent improvements as reported in previous work.

摘要
(Simplified Chinese translation)近年来，自动学习（Self-Supervised Learning，SSL）方法在语音处理任务中得到了 significative 进步。许多基于语音的 SSL 模型已经被开发出来，并在各种下游任务中表现出色，包括语音识别。然而，现有的语音基于 SSL 模型面临着计算成本的问题，这可能会限制它们的应用和学术研究的深度。为解决这个问题，我们首先分析了不同模块的计算成本在 HuBERT 预训练过程中，然后引入了一堆性能优化技术，称之为 Fast-HuBERT。我们的 Fast-HuBERT 可以在 1.1 天内使用 8 个 V100 GPU 在 Librispeech 960h 测试集上进行预训练，无需性能下降，相比原始实现，实现了 5.2 倍的速度提升。此外，我们还探索了两种已经广泛研究的技术，并在 Fast-HuBERT 中进行了详细的实验，并发现了一致的改进。

Can neural networks count digit frequency?

paper_url: http://arxiv.org/abs/2310.04431
repo_url: https://github.com/PadmakshKhandelwal/Can-neural-networks-count
paper_authors: Padmaksh Khandelwal
for: 本研究旨在比较不同的古典机器学习模型和神经网络在识别每个数字的频率出现的问题上的性能。这有各种应用场景，如获取视频场景中目标对象的频率。
methods: 我们在这个问题上采用了一种混合的分类和回归任务，并且特意制作了自己的数据集来观察系统性的差异。我们使用不同的度量来评估每种方法的性能，并且在多个数据集上进行了评估。
results: 我们发现，决策树和Random Forest具有内在的偏见，导致它们无法泛化好。同时，神经网络在分类和回归两个任务上都明显超过了古典机器学习模型，尤其是在6位和10位数据集上。数据集和代码在github上公开。

Abstract
In this research, we aim to compare the performance of different classical machine learning models and neural networks in identifying the frequency of occurrence of each digit in a given number. It has various applications in machine learning and computer vision, e.g. for obtaining the frequency of a target object in a visual scene. We considered this problem as a hybrid of classification and regression tasks. We carefully create our own datasets to observe systematic differences between different methods. We evaluate each of the methods using different metrics across multiple datasets.The metrics of performance used were the root mean squared error and mean absolute error for regression evaluation, and accuracy for classification performance evaluation. We observe that decision trees and random forests overfit to the dataset, due to their inherent bias, and are not able to generalize well. We also observe that the neural networks significantly outperform the classical machine learning models in terms of both the regression and classification metrics for both the 6-digit and 10-digit number datasets. Dataset and code are available on github.

摘要
在这项研究中，我们目标是比较不同的古典机器学习模型和神经网络在识别每个数字的频率出现的性能。它在机器学习和计算机视觉等领域有各种应用，例如获取视觉场景中目标对象的频率。我们将这个问题视为分类和回归任务的混合问题。我们仔细制作了自己的数据集，以观察不同方法之间的系统差异。我们对每种方法使用不同的指标进行评估，并在多个数据集上进行评估。我们发现决策树和随机森林因数据集的偏袋而过拟合，无法通过泛化而表现出色。我们还发现神经网络在分类和回归指标方面对6位和10位数据集都有显著的优势。数据集和代码可以在github上下载。

Sampling - Variational Auto Encoder - Ensemble: In the Quest of Explainable Artificial Intelligence

paper_url: http://arxiv.org/abs/2309.14385
repo_url: None
paper_authors: Sarit Maitra, Vivek Mishra, Pratima Verma, Manav Chopra, Priyanka Nath
for: 这篇论文的目的是提出一种新的可解释人工智能（XAI）框架，用于解释人工智能模型的输出。
methods: 这篇论文使用了一种新的混合架构，称为Sampling-Variational Auto Encoder-Ensemble Anomaly Detection（SVEAD），它将Variational Auto Encoder（VAE）与集成折衔和SHapley Additive exPlanations（SHAP）相结合，用于解决不平衡分类问题。
results: 研究发现，将VAE、集成折衔和SHAP结合使用可以不仅提高模型性能，还可以提供一个简单易于解释的框架。此外，研究还使用SHAP与排序重要性和个体条件预期结合，创造了一个强大的模型解释方法。这些发现对实际应用中的XAI具有重要的意义，可以增强人工智能应用的信任度。

Abstract
Explainable Artificial Intelligence (XAI) models have recently attracted a great deal of interest from a variety of application sectors. Despite significant developments in this area, there are still no standardized methods or approaches for understanding AI model outputs. A systematic and cohesive framework is also increasingly necessary to incorporate new techniques like discriminative and generative models to close the gap. This paper contributes to the discourse on XAI by presenting an empirical evaluation based on a novel framework: Sampling - Variational Auto Encoder (VAE) - Ensemble Anomaly Detection (SVEAD). It is a hybrid architecture where VAE combined with ensemble stacking and SHapley Additive exPlanations are used for imbalanced classification. The finding reveals that combining ensemble stacking, VAE, and SHAP can. not only lead to better model performance but also provide an easily explainable framework. This work has used SHAP combined with Permutation Importance and Individual Conditional Expectations to create a powerful interpretability of the model. The finding has an important implication in the real world, where the need for XAI is paramount to boost confidence in AI applications.

摘要
《可解释人工智能（XAI）模型在不同应用领域引起了很大的关注。虽然这一领域已经取得了很大的进步，但是还没有标准化的方法或方法来理解人工智能模型的输出。一个系统和一致的框架也在增加，以整合新技术如探测和生成模型，以填补这一空白。这篇论文对XAI进行了评估，通过提出了一个新的框架：采样-自适应变换器-ensemble异常检测（SVEAD）。这是一种混合体系，其中变换器与ensemble栈和SHapley Additive exPlanations（SHAP）结合使用，用于处理不均衡的分类。研究发现，将ensemble栈、变换器和SHAP结合使用，不仅可以提高模型性能，还可以提供一个简单易理解的框架。本研究使用SHAP与排序重要性和个体条件预期结合，创造了一个强大的模型解释能力。这种发现对现实中的XAI需求具有重要意义，以增加人工智能应用的信任度。》Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. The translation is done using a machine translation tool, and may not be perfect. Please note that the translation may not capture all the nuances and idiomatic expressions of the original text.

Prior Bilinear Based Models for Knowledge Graph Completion

paper_url: http://arxiv.org/abs/2309.13834
repo_url: None
paper_authors: Jiayi Li, Ruilin Luo, Jiaqi Sun, Jing Xiao, Yujiu Yang
for: 本文主要针对知识图（KG）完成任务，探讨bilinear模型忽略了先前属性的问题。
methods: 作者提出了一种解决方案called Unit Ball Bilinear Model (UniBi)，该模型不仅有理论上的优势，还提供了更好的解释性和性能，通过最小化无用学习的约束来减少不必要的学习。
results: 实验表明，UniBi模型能够capture先前属性，并且verify其解释性和性能。

Abstract
Bilinear based models are powerful and widely used approaches for Knowledge Graphs Completion (KGC). Although bilinear based models have achieved significant advances, these studies mainly concentrate on posterior properties (based on evidence, e.g. symmetry pattern) while neglecting the prior properties. In this paper, we find a prior property named "the law of identity" that cannot be captured by bilinear based models, which hinders them from comprehensively modeling the characteristics of KGs. To address this issue, we introduce a solution called Unit Ball Bilinear Model (UniBi). This model not only achieves theoretical superiority but also offers enhanced interpretability and performance by minimizing ineffective learning through minimal constraints. Experiments demonstrate that UniBi models the prior property and verify its interpretability and performance.

摘要
bilinear基于模型在知识图完成（KGC）领域是非常强大和广泛使用的方法。虽然bilinear基于模型已经取得了显著的进步，但这些研究主要集中于后果性质（基于证据，例如对称模式）而忽略了先前性质。在这篇论文中，我们发现了一种先前性质名为“同一性法律”，这种法律不能被bilinear基于模型捕捉，这会限制它们完全模型知识图的特点。为解决这个问题，我们介绍了一种解决方案called Unit Ball Bilinear Model（UniBi）。这个模型不仅具有理论上的优越性，也提供了更好的解释性和性能，通过最小化不必要的学习来减少不必要的约束。实验表明，UniBi模型了先前性质并证明了其解释性和性能。

Dual Feature Augmentation Network for Generalized Zero-shot Learning

paper_url: http://arxiv.org/abs/2309.13833
repo_url: https://github.com/sion1/dfan
paper_authors: Lei Xiang, Yuan Zhou, Haoran Duan, Yang Long
for: 这篇论文主要针对零例学习（Zero-shot learning）问题，旨在无需训练样本就能够推断未经训练的类别。
methods: 该论文提出了一种新的DUAL Feature Augmentation Network（DFAN），包括两个特征增强模块，一个用于视觉特征，另一个用于语义特征。视觉特征增强模块通过学习特征特性，使用高 cosine 距离来强化特征表示。语义特征增强模块则通过提出了一个偏置学习器，捕捉数据集的偏移，使得预测值与实际值之间的差距更小。此外，我们还引入了两个预测器，以冲突地解决本地和全局特征之间的冲突。
results: 实验结果表明，我们的方法在三个 benchmark 上表现出了明显的进步，与现有方法相比，具有更高的准确率和更好的一致性。

Abstract
Zero-shot learning (ZSL) aims to infer novel classes without training samples by transferring knowledge from seen classes. Existing embedding-based approaches for ZSL typically employ attention mechanisms to locate attributes on an image. However, these methods often ignore the complex entanglement among different attributes' visual features in the embedding space. Additionally, these methods employ a direct attribute prediction scheme for classification, which does not account for the diversity of attributes in images of the same category. To address these issues, we propose a novel Dual Feature Augmentation Network (DFAN), which comprises two feature augmentation modules, one for visual features and the other for semantic features. The visual feature augmentation module explicitly learns attribute features and employs cosine distance to separate them, thus enhancing attribute representation. In the semantic feature augmentation module, we propose a bias learner to capture the offset that bridges the gap between actual and predicted attribute values from a dataset's perspective. Furthermore, we introduce two predictors to reconcile the conflicts between local and global features. Experimental results on three benchmarks demonstrate the marked advancement of our method compared to state-of-the-art approaches. Our code is available at https://github.com/Sion1/DFAN.

摘要
为了解决这些问题，我们提出了一个新的对应网络（DFAN），其包括两个对应模组：一个用于可视特征，另一个用于 semantic 特征。可视特征增强模组会明确地学习属性特征，并使用做 Cosine 距离来分离它们，这样提高了属性表示。另一方面，semantic 增强模组中，我们提出了一个偏置学习器，以捕捉实际数据集的偏移，将预测的属性值与实际值匹配。此外，我们引入了两个预测器，以调解本地和全局特征之间的冲突。实验结果显示，我们的方法与现有的方法相比，在三个 benchmark 上有明显的进步。我们的代码可以在 GitHub 上获取：https://github.com/Sion1/DFAN。

Evaluating Cognitive Maps and Planning in Large Language Models with CogEval

paper_url: http://arxiv.org/abs/2309.15129
repo_url: None
paper_authors: Ida Momennejad, Hosein Hasanbeig, Felipe Vieira, Hiteshi Sharma, Robert Osazuwa Ness, Nebojsa Jojic, Hamid Palangi, Jonathan Larson
for: 这研究的目的是系统evaluate大语言模型（LLM）的认知能力。
methods: 该研究使用了一种基于认知科学的评估协议，称为CogEval，来评估 LLM 的认知能力。
results: 研究发现， LL M 在一些简单的规划任务上表现出 Competence，但在更复杂的规划任务上存在 Failure modes，包括hallucination 和循环。这些发现不支持 LL M 具有出色的规划能力。

Abstract
Recently an influx of studies claim emergent cognitive abilities in large language models (LLMs). Yet, most rely on anecdotes, overlook contamination of training sets, or lack systematic Evaluation involving multiple tasks, control conditions, multiple iterations, and statistical robustness tests. Here we make two major contributions. First, we propose CogEval, a cognitive science-inspired protocol for the systematic evaluation of cognitive capacities in Large Language Models. The CogEval protocol can be followed for the evaluation of various abilities. Second, here we follow CogEval to systematically evaluate cognitive maps and planning ability across eight LLMs (OpenAI GPT-4, GPT-3.5-turbo-175B, davinci-003-175B, Google Bard, Cohere-xlarge-52.4B, Anthropic Claude-1-52B, LLaMA-13B, and Alpaca-7B). We base our task prompts on human experiments, which offer both established construct validity for evaluating planning, and are absent from LLM training sets. We find that, while LLMs show apparent competence in a few planning tasks with simpler structures, systematic evaluation reveals striking failure modes in planning tasks, including hallucinations of invalid trajectories and getting trapped in loops. These findings do not support the idea of emergent out-of-the-box planning ability in LLMs. This could be because LLMs do not understand the latent relational structures underlying planning problems, known as cognitive maps, and fail at unrolling goal-directed trajectories based on the underlying structure. Implications for application and future directions are discussed.

摘要
近期有多个研究表明大语言模型（LLM）具有新的认知能力。然而，大多数研究仅仅基于启示，忽略训练集的杂乱，或者缺乏多个任务、控制条件、多个迭代和统计学robustness测试。我们在这里作出了两个主要贡献。首先，我们提议了一种认知科学途径 protocol for the systematic evaluation of cognitive capacities in Large Language Models，可以用于评估多种能力。其次，我们遵循这种协议来系统地评估 eight LLMs（OpenAI GPT-4、GPT-3.5-turbo-175B、davinci-003-175B、Google Bard、Cohere-xlarge-52.4B、Anthropic Claude-1-52B、LLaMA-13B和Alpaca-7B）的认知地图和规划能力。我们基于人类实验的任务提示，这些任务具有建立的验证有效性，且不存在在 LLM 训练集中。我们发现，虽然 LLMs 在一些规划任务中显示出一定的能力，但系统性评估表明，LLMs 在规划任务中存在明显的失败模式，包括hallucination 无效的轨迹和gets trapped in loops。这些发现不支持 LLMS 具有出 Box 的规划能力。这可能是因为 LLMS 不理解规划问题下的隐藏关系结构，并且无法基于这种结构推导目标导向的轨迹。我们的发现有很多应用和未来方向的意义。

Benchmarking Local Robustness of High-Accuracy Binary Neural Networks for Enhanced Traffic Sign Recognition

paper_url: http://arxiv.org/abs/2310.03033
repo_url: https://github.com/christopherbrix/vnncomp2023_benchmarks
paper_authors: Andreea Postovan, Mădălina Eraşcu
for: 本研究旨在提高自动驾驶系统中的道路标志识别精度，并且面临着实际中的挑战，如抗抗例和遮挡。
methods: 本研究使用了二进制神经网络（BNN）来构建精度高的道路标志识别模型，并且强调了模型的尺寸和计算资源的有效使用。
results: 本研究发现，使用BNN模型可以在实际中提高道路标志识别精度，但是存在一些地区的异常输出和错误结果。

Abstract
Traffic signs play a critical role in road safety and traffic management for autonomous driving systems. Accurate traffic sign classification is essential but challenging due to real-world complexities like adversarial examples and occlusions. To address these issues, binary neural networks offer promise in constructing classifiers suitable for resource-constrained devices. In our previous work, we proposed high-accuracy BNN models for traffic sign recognition, focusing on compact size for limited computation and energy resources. To evaluate their local robustness, this paper introduces a set of benchmark problems featuring layers that challenge state-of-the-art verification tools. These layers include binarized convolutions, max pooling, batch normalization, fully connected. The difficulty of the verification problem is given by the high number of network parameters (905k - 1.7 M), of the input dimension (2.7k-12k), and of the number of regions (43) as well by the fact that the neural networks are not sparse. The proposed BNN models and local robustness properties can be checked at https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/traffic_signs_recognition. The results of the 4th International Verification of Neural Networks Competition (VNN-COMP'23) revealed the fact that 4, out of 7, solvers can handle many of our benchmarks randomly selected (minimum is 6, maximum is 36, out of 45). Surprisingly, tools output also wrong results or missing counterexample (ranging from 1 to 4). Currently, our focus lies in exploring the possibility of achieving a greater count of solved instances by extending the allotted time (previously set at 8 minutes). Furthermore, we are intrigued by the reasons behind the erroneous outcomes provided by the tools for certain benchmarks.

摘要
traffic signs 对道路安全和交通管理具有关键作用，因此精准的交通标志分类是非常重要，但又是具有挑战性的。为了解决这些问题， binary neural networks（BNN）提供了一种可能性，它们可以在有限的计算和能源资源下构建高精度的分类器。在我们的前一项工作中，我们已经提出了高精度的BNN模型，专注于模型的紧凑性，以适应有限的计算资源。为了评估这些模型的本地稳定性，这篇论文引入了一组 benchmark 问题，这些问题挑战了当前的验证工具。这些问题包括缩进几何学层、最大池化层、批量常量层和归一化层，它们的难度来自于网络参数的大量（905k-1.7M）、输入维度的大量（2.7k-12k）和区域的数量（43）以及网络不是稀疏的性。可以在中查看我们的模型和本地稳定性性质。 competition 的结果表明，4个出 из 7个解决方案可以随机选择的 benchmark 中的许多（最小值是6，最大值是36，总共45）。尽管有些工具输出了错误的结果或缺失Counterexample（从1到4），但我们目前的注意力是探索可以通过延长时间（原先设置为8分钟）来提高解决的数量。此外，我们也对工具输出错误的原因产生了极大的兴趣。

Privacy-preserving Linear Computations in Spiking Neural P Systems

paper_url: http://arxiv.org/abs/2309.13803
repo_url: https://github.com/hieu9955/ggggg
paper_authors: Mihail-Iulian Plesa, Marian Gheorghe, Florentin Ipate
for: 这个论文旨在提出一种基于生物神经元的启发式计算模型，以及这种模型在不同领域的应用，如正式验证、人工智能和加密等。
methods: 作者提出了一种基于SN P系统的隐私保护协议，允许客户端使用远程服务器来计算线性函数，而无需把函数参数和结果泄露给服务器。
results: 作者采用了SN P系统实现任意自然数上的线性函数，并评估了协议的安全性在“诚实但偷 curios”安全模型下。

Abstract
Spiking Neural P systems are a class of membrane computing models inspired directly by biological neurons. Besides the theoretical progress made in this new computational model, there are also numerous applications of P systems in fields like formal verification, artificial intelligence, or cryptography. Motivated by all the use cases of SN P systems, in this paper, we present a new privacy-preserving protocol that enables a client to compute a linear function using an SN P system hosted on a remote server. Our protocol allows the client to use the server to evaluate functions of the form t_1k + t_2 without revealing t_1, t_2 or k and without the server knowing the result. We also present an SN P system to implement any linear function over natural numbers and some security considerations of our protocol in the honest-but-curious security model.

摘要
��Spiking Neural P Systems是一种基于生物神经元的计算模型。除了这种新的计算模型的理论进步之外，P系统还有很多应用于领域如正式验证、人工智能和加密等。在这篇论文中，我们提出了一种新的隐私保护协议，使得客户可以使用远程服务器上的SN P系统来计算函数形式为t_1k + t_2，而无需抛出t_1, t_2或k的报告，也无需服务器知道结果。我们还提出了一种实现任意自然数上的线性函数的SN P系统，以及一些安全考虑在诚实但叛逆安全模型中。

Can LLM-Generated Misinformation Be Detected?

paper_url: http://arxiv.org/abs/2309.13788
repo_url: https://github.com/llm-misinformation/llm-misinformation
paper_authors: Canyu Chen, Kai Shu
for: investigates whether LLM-generated misinformation can cause more harm than human-written misinformation
methods: builds a taxonomy of LLM-generated misinformation and categorizes potential real-world methods for generating misinformation with LLMs, and employs extensive empirical investigation to study the detection difficulty of LLM-generated misinformation
results: discovers that LLM-generated misinformation can be harder to detect for humans and detectors compared to human-written misinformation with the same semantics, suggesting it can have more deceptive styles and potentially cause more harm

Abstract
The advent of Large Language Models (LLMs) has made a transformative impact. However, the potential that LLMs such as ChatGPT can be exploited to generate misinformation has posed a serious concern to online safety and public trust. A fundamental research question is: will LLM-generated misinformation cause more harm than human-written misinformation? We propose to tackle this question from the perspective of detection difficulty. We first build a taxonomy of LLM-generated misinformation. Then we categorize and validate the potential real-world methods for generating misinformation with LLMs. Then, through extensive empirical investigation, we discover that LLM-generated misinformation can be harder to detect for humans and detectors compared to human-written misinformation with the same semantics, which suggests it can have more deceptive styles and potentially cause more harm. We also discuss the implications of our discovery on combating misinformation in the age of LLMs and the countermeasures.

摘要
LLMs 的出现对社交媒体和网络安全带来了重大影响。然而， LLMS 如 ChatGPT 可能会被滥用来生成谣言，这对于在线安全和公众信任造成了严重的问题。我们提出了一个基本研究问题： LLMS 生成的谣言是人类写的谣言更可能导致更多的伤害吗？我们从检测困难性的角度来回答这个问题。我们首先构建了 LLMS 生成谣言的分类体系，然后将可能在实际情况下使用 LLMS 生成谣言的方法分类和验证。经过广泛的实验研究，我们发现 LLMS 生成的谣言比人类写的谣言更难以检测，它们可能具有更多的欺骗性和误导性，从而可能导致更多的伤害。我们还讨论了我们的发现对于在 LLMS 时代战击谣言的应用和对策的影响。

On the Computational Benefit of Multimodal Learning

paper_url: http://arxiv.org/abs/2309.13782
repo_url: None
paper_authors: Zhou Lu
for: 本研究目的是调查多模态学习是否具有计算优势？
methods: 我们使用一种基于 intersecting two half-spaces 问题的新修改来实现多模态学习。
results: 我们发现，在某些条件下，多模态学习可以在计算上赶超单模态学习，具体来说是可以在幂时间内解决NP困难的学习任务。

Abstract
Human perception inherently operates in a multimodal manner. Similarly, as machines interpret the empirical world, their learning processes ought to be multimodal. The recent, remarkable successes in empirical multimodal learning underscore the significance of understanding this paradigm. Yet, a solid theoretical foundation for multimodal learning has eluded the field for some time. While a recent study by Lu (2023) has shown the superior sample complexity of multimodal learning compared to its unimodal counterpart, another basic question remains: does multimodal learning also offer computational advantages over unimodal learning? This work initiates a study on the computational benefit of multimodal learning. We demonstrate that, under certain conditions, multimodal learning can outpace unimodal learning exponentially in terms of computation. Specifically, we present a learning task that is NP-hard for unimodal learning but is solvable in polynomial time by a multimodal algorithm. Our construction is based on a novel modification to the intersection of two half-spaces problem.

摘要
人类感知自然地进行多模态的处理。 Similarly，机器在观察现实世界时，其学习过程也应该是多模态的。当前， empirical multimodal learning的成功表明了这个思想的重要性。然而，领域中对多模态学习的基础理论还没有得到充分的解决。一些研究（如Lu 2023）已经表明了多模态学习的样本复杂性比单模态学习更低，但另一个基本问题仍然没有得到答案：多模态学习是否也提供了计算上的优势？本研究开始了对多模态学习的计算优势的研究。我们证明，在某些条件下，多模态学习可以在计算上赶超单模态学习，并且可以在计算时间方面呈指数增长。特别是，我们提出了一个NP困难的学习任务，但是通过多模态算法可以在 polynomial time 内解决。我们的构造基于一种新的两个半空间的交叉问题的修改。

Explainable Machine Learning for ICU Readmission Prediction

paper_url: http://arxiv.org/abs/2309.13781
repo_url: None
paper_authors: Alex G. C. de Sá, Daniel Gould, Anna Fedyukova, Mitchell Nicholas, Lucy Dockrell, Calvin Fletcher, David Pilcher, Daniel Capurro, David B. Ascher, Khaled El-Khawas, Douglas E. V. Pires
For: The paper aims to develop a standardized and explainable machine learning pipeline to predict patient readmission in the intensive care unit (ICU) using a multicentric database.* Methods: The paper uses a machine learning approach with a Random Forest classification model to predict patient readmission, and validates the model on both monocentric and multicentric settings. The authors also provide explanations for the constructed models to derive insightful conclusions.* Results: The paper achieves predictive performance with an area under the receiver operating characteristic curve (AUC) up to 0.7, and demonstrates good calibration and consistency on validation sets. The authors also identify a set of variables related to vital signs, blood tests, demographics, and ICU-associated variables that are associated with patient readmission.

Abstract
The intensive care unit (ICU) comprises a complex hospital environment, where decisions made by clinicians have a high level of risk for the patients' lives. A comprehensive care pathway must then be followed to reduce p complications. Uncertain, competing and unplanned aspects within this environment increase the difficulty in uniformly implementing the care pathway. Readmission contributes to this pathway's difficulty, occurring when patients are admitted again to the ICU in a short timeframe, resulting in high mortality rates and high resource utilisation. Several works have tried to predict readmission through patients' medical information. Although they have some level of success while predicting readmission, those works do not properly assess, characterise and understand readmission prediction. This work proposes a standardised and explainable machine learning pipeline to model patient readmission on a multicentric database (i.e., the eICU cohort with 166,355 patients, 200,859 admissions and 6,021 readmissions) while validating it on monocentric (i.e., the MIMIC IV cohort with 382,278 patients, 523,740 admissions and 5,984 readmissions) and multicentric settings. Our machine learning pipeline achieved predictive performance in terms of the area of the receiver operating characteristic curve (AUC) up to 0.7 with a Random Forest classification model, yielding an overall good calibration and consistency on validation sets. From explanations provided by the constructed models, we could also derive a set of insightful conclusions, primarily on variables related to vital signs and blood tests (e.g., albumin, blood urea nitrogen and hemoglobin levels), demographics (e.g., age, and admission height and weight), and ICU-associated variables (e.g., unit type). These insights provide an invaluable source of information during clinicians' decision-making while discharging ICU patients.

摘要
医院快速病区（ICU）是一个复杂的医疗环境，医生的决策对患者生命的风险很高。为了降低复杂性和风险，一个完整的护理路径必须采取。不确定、竞争和不计划的因素在这个环境中增加了困难，使得一致性地实施护理路径变得更加困难。重复入院是这个护理路径的一个主要障碍物，入院次数较多，导致高死亡率和资源利用率高。许多研究已经尝试预测重复入院，但这些研究并没有充分评估、描述和理解重复入院预测。本文提出了一个标准化和可解释的机器学习管道，用于在多中心数据库（i.e., eICU cohort）上预测患者重复入院，并在多中心和单中心设置上验证。我们的机器学习管道在预测重复入院方面达到了AUC0.7的水平，并在验证集上表现出了一致性和准确性。从构建的模型中提供的解释中，我们也得到了一些有价值的结论，主要关注于生命指标和血液测试（如蛋白质、尿氨酸和血红细胞含量）、人口学特征（如年龄和入院高度和重量）以及ICU相关变量（如单元类型）。这些结论可以提供医生决策时的很有价值的信息。

2023-09-25

cs.CL

cs.CL - 2023-09-25

Introducing DictaLM – A Large Generative Language Model for Modern Hebrew

paper_url: http://arxiv.org/abs/2309.14568
repo_url: None
paper_authors: Shaltiel Shmidman, Avi Shmidman, Amir David Nissan Cohen, Moshe Koppel
for: 这篇论文是为了开发一个适用于现代希伯来语的大规模语言模型。
methods: 这篇论文使用了7亿个参数的模型，主要是在希伯来语中进行训练。作者还发布了基础模型和指导适应模型，并将其发布于Creative Commons许可证下。此外，作者还介绍了DictaLM-Rab基础模型，这是专门针对 rabbinic/历史希伯来语的。
results: 这篇论文提出了一个初步的希伯来语大规模语言模型，可以用于多种希伯来语特定任务的细化调整，如教学、问答、情感分析等。这是一个对希伯来语NLP社区的一个初步探索。

Abstract
We present DictaLM, a large-scale language model tailored for Modern Hebrew. Boasting 7B parameters, this model is predominantly trained on Hebrew-centric data. As a commitment to promoting research and development in the Hebrew language, we release both the foundation model and the instruct-tuned model under a Creative Commons license. Concurrently, we introduce DictaLM-Rab, another foundation model geared towards Rabbinic/Historical Hebrew. These foundation models serve as ideal starting points for fine-tuning various Hebrew-specific tasks, such as instruction, Q&A, sentiment analysis, and more. This release represents a preliminary step, offering an initial Hebrew LLM model for the Hebrew NLP community to experiment with.

摘要
我们介绍DictaLM，一种适用于现代希伯来语的大规模语言模型。这个模型拥有70亿参数，主要基于希伯来语数据进行训练。作为推广希伯来语研究和发展的承诺，我们在Creative Commons许可证下发布了基础模型和指导训练模型。同时，我们还引入DictaLM-Rab，另一个针对 rabbinic/历史希伯来语的基础模型。这些基础模型可以用于多种希伯来语特定任务的细化调整，如教学、问答、情感分析等。这次发布代表希伯来语NLPT社区的初步尝试，希望能够促进希伯来语语言模型的研究和发展。

Aligning Large Multimodal Models with Factually Augmented RLHF

paper_url: http://arxiv.org/abs/2309.14525
repo_url: None
paper_authors: Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, Kurt Keutzer, Trevor Darrell
for: addressing the multimodal misalignment issue in large multimodal models (LMM)
methods: using reinforcement learning from human feedback (RLHF) to train a vision-language model to align with human annotations, and augmenting the reward model with additional factual information such as image captions and ground-truth multi-choice options
results: achieving a remarkable improvement of 94% on the LLaVA-Bench dataset and an improvement by 60% on MMHAL-BENCH over other baselines, with the first LMM trained with RLHF.Here are the three information in Simplified Chinese text:
for: 解决大量多模态模型（LMM）中的多模态误差问题
methods: 使用人类反馈学习（RLHF）来训练一个视觉语言模型，并将奖励模型增加了更多的事实信息，如图文描述和真实多选项
results: 在LLaVA-Bench数据集上达到了94%的表现水平，比前一个最佳方法提高了60%，并且开源了代码、模型和数据在https://llava-rlhf.github.io。

Abstract
Large Multimodal Models (LMM) are built across modalities and the misalignment between two modalities can result in "hallucination", generating textual outputs that are not grounded by the multimodal information in context. To address the multimodal misalignment issue, we adapt the Reinforcement Learning from Human Feedback (RLHF) from the text domain to the task of vision-language alignment, where human annotators are asked to compare two responses and pinpoint the more hallucinated one, and the vision-language model is trained to maximize the simulated human rewards. We propose a new alignment algorithm called Factually Augmented RLHF that augments the reward model with additional factual information such as image captions and ground-truth multi-choice options, which alleviates the reward hacking phenomenon in RLHF and further improves the performance. We also enhance the GPT-4-generated training data (for vision instruction tuning) with previously available human-written image-text pairs to improve the general capabilities of our model. To evaluate the proposed approach in real-world scenarios, we develop a new evaluation benchmark MMHAL-BENCH with a special focus on penalizing hallucinations. As the first LMM trained with RLHF, our approach achieves remarkable improvement on the LLaVA-Bench dataset with the 94% performance level of the text-only GPT-4 (while previous best methods can only achieve the 87% level), and an improvement by 60% on MMHAL-BENCH over other baselines. We opensource our code, model, data at https://llava-rlhf.github.io.

摘要
大型多modal模型（LMM）在modalities之间建立起来，但是这两个modalities之间的不一致可能导致"幻觉",生成不受多modal信息的支持的文本输出。为了解决多modal不一致问题，我们从文本领域中提取了人类反馈学习（RLHF），并将其应用到视觉语言对应中，请求人工标注员比较两个响应，并标出更加幻觉的一个，并训练视觉语言模型以 Maximize 模拟人类奖励。我们提出了一种新的对ignment算法called Factually Augmented RLHF，该算法将奖励模型中的奖励信息与更多的事实信息（如图像描述和真实多选项）相结合，以解决奖励黑客现象，并进一步提高性能。此外，我们还使用了之前已有的人类写的图像文本对应来提高我们模型的总能力。为了评估我们的方法在实际场景中的表现，我们开发了一个新的评估标准MMHAL-BENCH，强调对幻觉进行惩罚。作为首个RLHF模型，我们的方法在LLaVA-Bench数据集上达到了94%的性能水平，而前一个最佳方法只能达到87%的水平，在MMHAL-BENCH上与其他基eline相比，我们的方法提高了60%。我们将代码、模型和数据公开发布在https://llava-rlhf.github.io。

ChatGPT Performance on Standardized Testing Exam – A Proposed Strategy for Learners

paper_url: http://arxiv.org/abs/2309.14519
repo_url: None
paper_authors: Umer Farooq, Saira Anwar
for: 这研究探讨了ChatGPT在标准化测试准备中的问题解决能力，特点是关注GRE数学部分。先前的研究表明了ChatGPT在不同学科中的学习方法有很大的潜力。
methods: 我们通过对GRE数学部分100个随机选择的问题进行Quantitative评估来研究ChatGPT在不同内容领域中的问题解决能力。我们还使用t检验来检验修改问题提示对ChatGPT的准确率的影响。
results: 结果显示，对问题提示进行修改后，ChatGPT的准确率有 statistically significant 的提高（84%对修改后的问题，69%对原始数据）。研究还发现ChatGPT在某些问题上存在困难，并提供了修改问题提示的方法可以帮助学生准备标准测试 like GRE。

Abstract
This study explores the problem solving capabilities of ChatGPT and its prospective applications in standardized test preparation, focusing on the GRE quantitative exam. Prior research has shown great potential for the utilization of ChatGPT for academic purposes in revolutionizing the approach to studying across various disciplines. We investigate how ChatGPT performs across various question types in the GRE quantitative domain, and how modifying question prompts impacts its accuracy. More specifically this study addressed two research questions: 1. How does ChatGPT perform in answering GRE-based quantitative questions across various content areas? 2. How does the accuracy of ChatGPT vary with modifying the question prompts? The dataset consisting of 100 randomly selected GRE quantitative questions was collected from the ETS official guide to GRE test preparation. We used quantitative evaluation to answer our first research question, and t-test to examine the statistical association between prompt modification and ChatGPT's accuracy. Results show a statistical improvement in the ChatGPT's accuracy after applying instruction priming and contextual prompts to the original questions. ChatGPT showed 84% accuracy with the modified prompts compared to 69% with the original data. The study discusses the areas where ChatGPT struggled with certain questions and how modifications can be helpful for preparing for standardized tests like GRE and provides future directions for prompt modifications.

摘要
To answer our first research question, we used quantitative evaluation to assess ChatGPT's performance on GRE-based quantitative questions across different content areas. We also used a t-test to examine the statistical association between prompt modification and ChatGPT's accuracy. Our results show that ChatGPT's accuracy improved statistically after we applied instruction priming and contextual prompts to the original questions. With the modified prompts, ChatGPT achieved 84% accuracy, compared to 69% with the original data.The study also discusses the areas where ChatGPT struggled with certain questions and how modifications can be helpful for preparing for standardized tests like GRE. We provide future directions for prompt modifications and highlight the potential of using ChatGPT for test preparation.Here is the translation in Simplified Chinese:这个研究探讨了ChatGPT的问题解决能力和其在标准化测试准备中的可能应用，特点是关注GRE数学部分。先前的研究表明了ChatGPT可以用于学术目的，可以革命化学习的方式。我们调查了ChatGPT如何在不同类型的GRE数学题目上表现，以及如何修改问题提示影响其准确率。为了回答我们的第一个研究问题，我们使用量化评估来评估ChatGPT在GRE数学题目上的表现，并使用t检验来检验修改提示和ChatGPT的准确率之间的统计关系。我们的结果显示，在我们应用了指导提示和文本提示后，ChatGPT的准确率有 statistically 的提高。与原始数据相比，ChatGPT在修改后的问题上达到了84%的准确率，与原始数据相比，这是69%的提高。研究还讨论了ChatGPT在某些问题上的困难之处，以及修改如何有助于为GRE和其他标准化测试准备。我们还提供了未来的提示修改方向，并强调了使用ChatGPT进行测试准备的潜在优势。

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

paper_url: http://arxiv.org/abs/2309.14509
repo_url: None
paper_authors: Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, Yuxiong He
for: 本研究旨在提高大型语言模型（LLM）的训练效率，特别是对长序Transformer模型进行加速。
methods: 本文提出了一种新的方法——DeepSpeed-Ulysses，它可以高效地对长序LLM进行训练。这种方法通过分解输入数据的序列维度，并使用高效的所有到所有集成通信来计算注意力。
results: 实验表明，DeepSpeed-Ulysses可以比现有基eline方法快速2.5倍，并且可以在4倍长的序列长度上进行训练。

Abstract
Computation in a typical Transformer-based large language model (LLM) can be characterized by batch size, hidden dimension, number of layers, and sequence length. Until now, system works for accelerating LLM training have focused on the first three dimensions: data parallelism for batch size, tensor parallelism for hidden size and pipeline parallelism for model depth or layers. These widely studied forms of parallelism are not targeted or optimized for long sequence Transformer models. Given practical application needs for long sequence LLM, renewed attentions are being drawn to sequence parallelism. However, existing works in sequence parallelism are constrained by memory-communication inefficiency, limiting their scalability to long sequence large models. In this work, we introduce DeepSpeed-Ulysses, a novel, portable and effective methodology for enabling highly efficient and scalable LLM training with extremely long sequence length. DeepSpeed-Ulysses at its core partitions input data along the sequence dimension and employs an efficient all-to-all collective communication for attention computation. Theoretical communication analysis shows that whereas other methods incur communication overhead as sequence length increases, DeepSpeed-Ulysses maintains constant communication volume when sequence length and compute devices are increased proportionally. Furthermore, experimental evaluations show that DeepSpeed-Ulysses trains 2.5x faster with 4x longer sequence length than the existing method SOTA baseline.

摘要
Computation in a typical Transformer-based large language model (LLM) can be characterized by batch size, hidden dimension, number of layers, and sequence length. Until now, system works for accelerating LLM training have focused on the first three dimensions: data parallelism for batch size, tensor parallelism for hidden size, and pipeline parallelism for model depth or layers. These widely studied forms of parallelism are not targeted or optimized for long sequence Transformer models. Given practical application needs for long sequence LLM, renewed attentions are being drawn to sequence parallelism. However, existing works in sequence parallelism are constrained by memory-communication inefficiency, limiting their scalability to long sequence large models. In this work, we introduce DeepSpeed-Ulysses, a novel, portable, and effective methodology for enabling highly efficient and scalable LLM training with extremely long sequence length. DeepSpeed-Ulysses at its core partitions input data along the sequence dimension and employs an efficient all-to-all collective communication for attention computation. Theoretical communication analysis shows that whereas other methods incur communication overhead as sequence length increases, DeepSpeed-Ulysses maintains constant communication volume when sequence length and compute devices are increased proportionally. Furthermore, experimental evaluations show that DeepSpeed-Ulysses trains 2.5 times faster with 4 times longer sequence length than the existing method SOTA baseline.

Classifying token frequencies using angular Minkowski $p$-distance

paper_url: http://arxiv.org/abs/2309.14495
repo_url: None
paper_authors: Oliver Urs Lenz, Chris Cornelis
for: 本研究旨在探讨Angular Minkowski $p$-distance是一种代替cosine dissimilarity的异同度量，以及它在20-newsgroups dataset上的分类性能。
methods: 本研究使用了粗糙最近几个邻居和经典加权最近几个邻居来评估分类性能，并分析了$p$参数、数据维度$m$,邻居数$k$和权重的选择对分类性能的影响。
results: 研究发现，采用Angular Minkowski $p$-distance可以获得substantially higher的分类性能，特别是当$p$取得合适的值时。

Abstract
Angular Minkowski $p$-distance is a dissimilarity measure that is obtained by replacing Euclidean distance in the definition of cosine dissimilarity with other Minkowski $p$-distances. Cosine dissimilarity is frequently used with datasets containing token frequencies, and angular Minkowski $p$-distance may potentially be an even better choice for certain tasks. In a case study based on the 20-newsgroups dataset, we evaluate clasification performance for classical weighted nearest neighbours, as well as fuzzy rough nearest neighbours. In addition, we analyse the relationship between the hyperparameter $p$, the dimensionality $m$ of the dataset, the number of neighbours $k$, the choice of weights and the choice of classifier. We conclude that it is possible to obtain substantially higher classification performance with angular Minkowski $p$-distance with suitable values for $p$ than with classical cosine dissimilarity.

摘要

Explainable and Accurate Natural Language Understanding for Voice Assistants and Beyond

paper_url: http://arxiv.org/abs/2309.14485
repo_url: None
paper_authors: Kalpa Gunaratna, Vijay Srinivasan, Hongxia Jin
for: joint NLU (Natural Language Understanding) JOINT NLU是智能声助手上无可或缺的一环，它的目标是同时检测用户的意图和 slot filling。
methods: 使用各种技术提高准确率，并使模型自然易于理解和解释。
results: 对 JOINT NLU 模型进行自然易于理解和解释，不会影响准确率。同时，这种扩展可以在其他普通分类任务中使用。

Abstract
Joint intent detection and slot filling, which is also termed as joint NLU (Natural Language Understanding) is invaluable for smart voice assistants. Recent advancements in this area have been heavily focusing on improving accuracy using various techniques. Explainability is undoubtedly an important aspect for deep learning-based models including joint NLU models. Without explainability, their decisions are opaque to the outside world and hence, have tendency to lack user trust. Therefore to bridge this gap, we transform the full joint NLU model to be `inherently' explainable at granular levels without compromising on accuracy. Further, as we enable the full joint NLU model explainable, we show that our extension can be successfully used in other general classification tasks. We demonstrate this using sentiment analysis and named entity recognition.

摘要
joint意图检测和插槽填充（joint NLU）对智能声音助手是非常重要的。近期的进展在这个领域主要集中在提高准确率上。解释性是深度学习模型，包括联合NLU模型的重要方面。没有解释性，这些模型的决策对外部世界来说是不透明的，因此容易lack user trust。因此，我们将全部联合NLU模型变换成“基本”的解释性模型，无需牺牲准确率。此外，我们证明了我们的扩展可以成功应用于其他通用分类任务中。我们通过 sentiment分析和名称实体识别来说明这一点。

paper_url: http://arxiv.org/abs/2309.14327
repo_url: https://github.com/microsoft/deepspeedexamples
paper_authors: Zhewei Yao, Xiaoxia Wu, Conglong Li, Minjia Zhang, Heyang Qin, Olatunji Ruwase, Ammar Ahmad Awan, Samyam Rajbhandari, Yuxiong He
for: 提高大型语言模型对交互对话的适应性和扩展性
methods: 引入多模态功能，包括创新的多模态 causal 注意机制和数据融合技术
results: 比对 existed 框架 superior 扩展性，可以承受大型语言模型的70亿参数大小Here’s the Chinese text in simplified format:
for: 提高大型语言模型对交互对话的适应性和扩展性
methods: 引入多模态功能，包括创新的多模态 causal 注意机制和数据融合技术
results: 比对 existed 框架 superior 扩展性，可以承受大型语言模型的70亿参数大小

Abstract
Most of the existing multi-modal models, hindered by their incapacity to adeptly manage interleaved image-and-text inputs in multi-image, multi-round dialogues, face substantial constraints in resource allocation for training and data accessibility, impacting their adaptability and scalability across varied interaction realms. To address this, we present the DeepSpeed-VisualChat framework, designed to optimize Large Language Models (LLMs) by incorporating multi-modal capabilities, with a focus on enhancing the proficiency of Large Vision and Language Models in handling interleaved inputs. Our framework is notable for (1) its open-source support for multi-round and multi-image dialogues, (2) introducing an innovative multi-modal causal attention mechanism, and (3) utilizing data blending techniques on existing datasets to assure seamless interactions in multi-round, multi-image conversations. Compared to existing frameworks, DeepSpeed-VisualChat shows superior scalability up to 70B parameter language model size, representing a significant advancement in multi-modal language models and setting a solid foundation for future explorations.

摘要
大多数现有多模态模型受到其不能够有效地处理交错图像和文本输入的限制，在多图多轮对话中受到资源分配和数据可accessibility的限制，影响其适应性和扩展性。为解决这一问题，我们提出了DeepSpeed-VisualChat框架，旨在优化大型语言模型，通过多模态能力提高大型语言和视觉模型对交错输入的处理能力。我们的框架具有以下三个特点：1. 支持多轮多图对话的开源实现，以便实现无缝的多模态对话。2. 引入创新的多模态 causal attention机制，以提高模型对交错输入的处理能力。3. 通过使用现有数据集的混合技术，保证多轮多图对话中的无缝交互。与现有框架相比，DeepSpeed-VisualChat显示出比较出色的扩展性，可以 Handle up to 70B parameter language model size，代表着多模态语言模型的显著进步，并为未来的探索提供了坚实的基础。

Towards General-Purpose Text-Instruction-Guided Voice Conversion

paper_url: http://arxiv.org/abs/2309.14324
repo_url: https://github.com/text-guided-vc/text-guided-vc.github.io
paper_authors: Chun-Yi Kuan, Chen An Li, Tsu-Yuan Hsu, Tse-Yang Lin, Ho-Lam Chung, Kai-Wei Chang, Shuo-yiin Chang, Hung-yi Lee
for: 这篇论文旨在描述一种新的语音转换（VC）模型，该模型根据文本指令（如“诠释慢速深音”或“说话带着幽默boyish音色”）进行指导。与传统方法不同的是，我们的模型可以根据文本指令来修改转换后的语音的谐音和情感信息，从而提供更多的灵活性和具体性。
methods: 该VC模型是一种基于神经网络编码语言模型的，它处理一个序列的精度码，并将其转换为转换后的语音序列。该模型使用文本指令作为样式提示，以修改源语音的不同方面。与之前的方法不同的是，我们的模型可以在端到端方式下处理不同方面的语音信息，而不需要分别使用多个编码器。
results: 实验表明，我们的模型能够很好地理解指令，并提供合理的转换结果。

Abstract
This paper introduces a novel voice conversion (VC) model, guided by text instructions such as "articulate slowly with a deep tone" or "speak in a cheerful boyish voice". Unlike traditional methods that rely on reference utterances to determine the attributes of the converted speech, our model adds versatility and specificity to voice conversion. The proposed VC model is a neural codec language model which processes a sequence of discrete codes, resulting in the code sequence of converted speech. It utilizes text instructions as style prompts to modify the prosody and emotional information of the given speech. In contrast to previous approaches, which often rely on employing separate encoders like prosody and content encoders to handle different aspects of the source speech, our model handles various information of speech in an end-to-end manner. Experiments have demonstrated the impressive capabilities of our model in comprehending instructions and delivering reasonable results.

摘要
Note: The text has been translated into Simplified Chinese, which is the standard writing system used in mainland China.

Urdu Poetry Generated by Using Deep Learning Techniques

paper_url: http://arxiv.org/abs/2309.14233
repo_url: None
paper_authors: Muhammad Shoaib Farooq, Ali Abbas
For: 本研究提供了使用不同深度学习技术和算法生成的 Urdu 诗歌。* Methods: 该研究使用了 Long Short-term Memory Networks (LSTM) 和 Gated Recurrent Unit (GRU) 等深度学习模型，以及自然语言处理 (NLP) 技术来理解、分析和生成人类可以理解和使用的语言。* Results: 研究结果表明，使用这些技术可以生成具有高准确性的 Urdu 诗歌。

Abstract
This study provides Urdu poetry generated using different deep-learning techniques and algorithms. The data was collected through the Rekhta website, containing 1341 text files with several couplets. The data on poetry was not from any specific genre or poet. Instead, it was a collection of mixed Urdu poems and Ghazals. Different deep learning techniques, such as the model applied Long Short-term Memory Networks (LSTM) and Gated Recurrent Unit (GRU), have been used. Natural Language Processing (NLP) may be used in machine learning to understand, analyze, and generate a language humans may use and understand. Much work has been done on generating poetry for different languages using different techniques. The collection and use of data were also different for different researchers. The primary purpose of this project is to provide a model that generates Urdu poems by using data completely, not by sampling data. Also, this may generate poems in pure Urdu, not Roman Urdu, as in the base paper. The results have shown good accuracy in the poems generated by the model.

摘要

Autonomous Vehicles an overview on system, cyber security, risks, issues, and a way forward

paper_url: http://arxiv.org/abs/2309.14213
repo_url: None
paper_authors: Md Aminul Islam, Sarah Alqahtani
for:这篇论文主要是为了探讨自动驾驶车的基本组件和运行特点，以及它们如何在互联网的框架下集成。methods:该论文使用了感知器、人工智能标识系统、控制机制等技术，并将其与云计算服务器集成在一起。results:该论文探讨了自动驾驶车在交通预测和交通预测等领域的实践应用，以及它们对不同产业的自动化任务的影响。同时，它还探讨了自动驾驶车的安全性问题，包括伦理、环境、法律、职业和社会方面的风险。

Abstract
This chapter explores the complex realm of autonomous cars, analyzing their fundamental components and operational characteristics. The initial phase of the discussion is elucidating the internal mechanics of these automobiles, encompassing the crucial involvement of sensors, artificial intelligence (AI) identification systems, control mechanisms, and their integration with cloud-based servers within the framework of the Internet of Things (IoT). It delves into practical implementations of autonomous cars, emphasizing their utilization in forecasting traffic patterns and transforming the dynamics of transportation. The text also explores the topic of Robotic Process Automation (RPA), illustrating the impact of autonomous cars on different businesses through the automation of tasks. The primary focus of this investigation lies in the realm of cybersecurity, specifically in the context of autonomous vehicles. A comprehensive analysis will be conducted to explore various risk management solutions aimed at protecting these vehicles from potential threats including ethical, environmental, legal, professional, and social dimensions, offering a comprehensive perspective on their societal implications. A strategic plan for addressing the challenges and proposing strategies for effectively traversing the complex terrain of autonomous car systems, cybersecurity, hazards, and other concerns are some resources for acquiring an understanding of the intricate realm of autonomous cars and their ramifications in contemporary society, supported by a comprehensive compilation of resources for additional investigation. Keywords: RPA, Cyber Security, AV, Risk, Smart Cars

摘要
The primary focus of this investigation lies in the realm of cybersecurity, specifically in the context of autonomous vehicles. A comprehensive analysis will be conducted to explore various risk management solutions aimed at protecting these vehicles from potential threats, including ethical, environmental, legal, professional, and social dimensions. This will provide a comprehensive perspective on their societal implications.In addition, a strategic plan for addressing the challenges and proposing strategies for effectively traversing the complex terrain of autonomous car systems, cybersecurity, hazards, and other concerns will be presented. This will be supported by a comprehensive compilation of resources for additional investigation.Keywords: RPA, Cyber Security, AV, Risk, Smart Cars

Only 5% Attention Is All You Need: Efficient Long-range Document-level Neural Machine Translation

paper_url: http://arxiv.org/abs/2309.14174
repo_url: None
paper_authors: Zihan Liu, Zewei Sun, Shanbo Cheng, Shujian Huang, Mingxuan Wang
for: DocNMT for handling discourse phenomena in Machine Translation tasks, with the goal of improving efficiency while maintaining performance.
methods: The paper introduces a lightweight attention mechanism to select a small portion of tokens to be attended, reducing the computational cost of the attention module while maintaining performance.
results: The method achieves up to 95% sparsity (only 5% tokens attended) and saves 93% computation cost on the attention module compared to the original Transformer, while maintaining performance.

Abstract
Document-level Neural Machine Translation (DocNMT) has been proven crucial for handling discourse phenomena by introducing document-level context information. One of the most important directions is to input the whole document directly to the standard Transformer model. In this case, efficiency becomes a critical concern due to the quadratic complexity of the attention module. Existing studies either focus on the encoder part, which cannot be deployed on sequence-to-sequence generation tasks, e.g., Machine Translation (MT), or suffer from a significant performance drop. In this work, we keep the translation performance while gaining 20\% speed up by introducing extra selection layer based on lightweight attention that selects a small portion of tokens to be attended. It takes advantage of the original attention to ensure performance and dimension reduction to accelerate inference. Experimental results show that our method could achieve up to 95\% sparsity (only 5\% tokens attended) approximately, and save 93\% computation cost on the attention module compared with the original Transformer, while maintaining the performance.

摘要
文档水平神经机器翻译（DocNMT）已经被证明是处理讨论现象的关键，通过引入文档级别的上下文信息。一个重要的方向是直接将整个文档输入到标准变换器模型中。在这种情况下，效率成为一个关键问题，因为变换器模型的注意模块的复杂度是二次的。现有的研究 either ocus 在encoder部分，无法在序列到序列生成任务中使用，例如机器翻译（MT），或者受到 significativ performance drop。在这种工作中，我们保持翻译性能，同时减少了20%的计算成本，通过引入附加的选择层，选择一小部分的Token进行注意。它利用原始注意来确保性能，并将维度减少以加速推理。实验结果表明，我们的方法可以达到约95%的稀疏性（只有5%的Token被注意），并将93%的计算成本减少在变换器模型中，同时保持性能。

Towards End-User Development for IoT: A Case Study on Semantic Parsing of Cooking Recipes for Programming Kitchen Devices

paper_url: http://arxiv.org/abs/2309.14165
repo_url: https://github.com/filipposventirozos/towards-end-user-development-for-iot
paper_authors: Filippos Ventirozos, Sarah Clinch, Riza Batista-Navarro
for: 支持烹饪recipe instructions的自然语言编程
methods: 使用 conditional random fields (CRF) 和神经网络模型进行语义分析
results: training semantic parsers based on annotations is feasible, but most natural-language instructions are incomplete and need to be transformed into formal meaning representation.Here’s the breakdown of each piece of information:
for: 支持烹饪recipe instructions的自然语言编程 (What the paper is written for)
methods: 使用 conditional random fields (CRF) 和神经网络模型进行语义分析 (What methods the paper uses)
results: training semantic parsers based on annotations is feasible, but most natural-language instructions are incomplete and need to be transformed into formal meaning representation. (What results the paper gets)

Abstract
Semantic parsing of user-generated instructional text, in the way of enabling end-users to program the Internet of Things (IoT), is an underexplored area. In this study, we provide a unique annotated corpus which aims to support the transformation of cooking recipe instructions to machine-understandable commands for IoT devices in the kitchen. Each of these commands is a tuple capturing the semantics of an instruction involving a kitchen device in terms of "What", "Where", "Why" and "How". Based on this corpus, we developed machine learning-based sequence labelling methods, namely conditional random fields (CRF) and a neural network model, in order to parse recipe instructions and extract our tuples of interest from them. Our results show that while it is feasible to train semantic parsers based on our annotations, most natural-language instructions are incomplete, and thus transforming them into formal meaning representation, is not straightforward.

摘要
<>Semantic parsing of user-generated instructional text, in the way of enabling end-users to program the Internet of Things (IoT), is an underexplored area. In this study, we provide a unique annotated corpus which aims to support the transformation of cooking recipe instructions to machine-understandable commands for IoT devices in the kitchen. Each of these commands is a tuple capturing the semantics of an instruction involving a kitchen device in terms of "What", "Where", "Why" and "How". Based on this corpus, we developed machine learning-based sequence labelling methods, namely conditional random fields (CRF) and a neural network model, in order to parse recipe instructions and extract our tuples of interest from them. Our results show that while it is feasible to train semantic parsers based on our annotations, most natural-language instructions are incomplete, and thus transforming them into formal meaning representation, is not straightforward.中文翻译：用户生成的 instrucitonal text 的 semantics parsing，以实现终端用户对 Internet of Things (IoT) 的程式设定，是一个未得到充分探讨的领域。在这个研究中，我们提供了一个唯一的标注集，以支持将烹饪recipe的 instrucitons 转换为机器可理解的命令，并且每个命令都是一个捕捉烹饪 instruciton 的含义的 tuple，包括 "What"、"Where"、"Why" 和 "How"。基于这个标注集，我们开发了机器学习基于条件随机场 (CRF) 和神经网络模型，以解析 recipe instrucitons 并将我们的感兴趣 tuple 提取出来。我们的结果显示，可以对我们的标注集进行训练，但大多数自然语言 instrucitons 是不完整的，因此将它们转换为正式的意义表现，不是一个 straightforward 的任务。

Examining Temporal Bias in Abusive Language Detection

paper_url: http://arxiv.org/abs/2309.14146
repo_url: None
paper_authors: Mali Jin, Yida Mu, Diana Maynard, Kalina Bontcheva
for: This study aims to investigate the nature and impact of temporal bias in abusive language detection across various languages and explore mitigation methods.methods: The study evaluates the performance of models on abusive data sets from different time periods and presents an extensive linguistic analysis of these abusive data sets from a diachronic perspective.results: The results demonstrate that temporal bias is a significant challenge for abusive language detection, with models trained on historical data showing a significant drop in performance over time.

Abstract
The use of abusive language online has become an increasingly pervasive problem that damages both individuals and society, with effects ranging from psychological harm right through to escalation to real-life violence and even death. Machine learning models have been developed to automatically detect abusive language, but these models can suffer from temporal bias, the phenomenon in which topics, language use or social norms change over time. This study aims to investigate the nature and impact of temporal bias in abusive language detection across various languages and explore mitigation methods. We evaluate the performance of models on abusive data sets from different time periods. Our results demonstrate that temporal bias is a significant challenge for abusive language detection, with models trained on historical data showing a significant drop in performance over time. We also present an extensive linguistic analysis of these abusive data sets from a diachronic perspective, aiming to explore the reasons for language evolution and performance decline. This study sheds light on the pervasive issue of temporal bias in abusive language detection across languages, offering crucial insights into language evolution and temporal bias mitigation.

摘要
互联网上的辱语问题日益普遍，对个人和社会造成心理副作用、实际暴力和even death的影响。机器学习模型已经开发出来自动检测辱语，但这些模型可能会受到时间偏见的影响，时间偏见是指语言、语言使用或社会规范随着时间的变化。本研究旨在研究辱语检测中的时间偏见问题，以及不同语言下的时间偏见的影响。我们对不同时间段的辱语数据集进行了评估，结果显示，时间偏见是辱语检测中的一大挑战，历史数据上训练的模型表现下降显著。此外，我们还进行了对这些辱语数据集的广泛语言分析，尝试探讨语言演化的原因和表现下降的原因。这项研究突出了辱语检测中时间偏见的问题，为语言演化和时间偏见缓解提供了关键的洞察。

On the Relation between Internal Language Model and Sequence Discriminative Training for Neural Transducers

paper_url: http://arxiv.org/abs/2309.14130
repo_url: None
paper_authors: Zijian Yang, Wei Zhou, Ralf Schlüter, Hermann Ney
for: 提高 RNN-Transducer 的表现，使用外部语言模型 (LM) 融合
methods: 使用序列推理训练，并对 ILM 进行减法
results: 序列推理训练和 ILM 减法在 Librispeech 上的各种实验中具有类似的表现，包括 MMI 和 MBR 等标准。减法对 ILM 的影响也变得较小。

Abstract
Internal language model (ILM) subtraction has been widely applied to improve the performance of the RNN-Transducer with external language model (LM) fusion for speech recognition. In this work, we show that sequence discriminative training has a strong correlation with ILM subtraction from both theoretical and empirical points of view. Theoretically, we derive that the global optimum of maximum mutual information (MMI) training shares a similar formula as ILM subtraction. Empirically, we show that ILM subtraction and sequence discriminative training achieve similar performance across a wide range of experiments on Librispeech, including both MMI and minimum Bayes risk (MBR) criteria, as well as neural transducers and LMs of both full and limited context. The benefit of ILM subtraction also becomes much smaller after sequence discriminative training. We also provide an in-depth study to show that sequence discriminative training has a minimal effect on the commonly used zero-encoder ILM estimation, but a joint effect on both encoder and prediction + joint network for posterior probability reshaping including both ILM and blank suppression.

摘要
内部语言模型（ILM）减法广泛应用于改进RNN-Transducer的语音识别性能，在这项工作中，我们表明了序列推理训练与ILM减法之间存在强相关性。从理论上来看，我们得出了最大共识度（MMI）训练的全球最优点与ILM减法的相似公式。从实验来看，我们证明了ILM减法和序列推理训练在Librispeech上覆盖广泛的实验中具有相似的性能，包括MMI和最小 bayes风险（MBR） критериria，以及神经转移和LM的全文和有限文本上的性能。ILM减法的利益也变得很小之后进行序列推理训练。我们还进行了深入的研究，发现序列推理训练对于通常使用零编码ILM估计的影响非常小，但对于encoder和预测+联合网络进行 posterior probability重塑，包括ILM和空白抑制，有 JOINT 效果。

Wav2vec-based Detection and Severity Level Classification of Dysarthria from Speech

paper_url: http://arxiv.org/abs/2309.14107
repo_url: None
paper_authors: Farhad Javanmardi, Saska Tirronen, Manila Kodali, Sudarsana Reddy Kadiri, Paavo Alku
for: 这个研究旨在使用自动检测和评估瘫疡症患者的语音信号，以便在医疗诊断中使用。
methods: 这个研究使用了预训练的wav2vec 2.0模型作为特征提取器，建立检测和评估瘫疡症语音的系统。
results: 实验结果显示，使用wav2vec 2.0模型的对应层 embeddings（第一层）可以实现最佳的检测性能，相比基准模型（spectrogram）的最高表现提高1.23%的精度。在研究的评估瘫疡症严重程度分类任务中，使用最终层 embeddings 可以实现10.62%的精度提高，相比基准特征（mel-frequency cepstral coefficients）。

Abstract
Automatic detection and severity level classification of dysarthria directly from acoustic speech signals can be used as a tool in medical diagnosis. In this work, the pre-trained wav2vec 2.0 model is studied as a feature extractor to build detection and severity level classification systems for dysarthric speech. The experiments were carried out with the popularly used UA-speech database. In the detection experiments, the results revealed that the best performance was obtained using the embeddings from the first layer of the wav2vec model that yielded an absolute improvement of 1.23% in accuracy compared to the best performing baseline feature (spectrogram). In the studied severity level classification task, the results revealed that the embeddings from the final layer gave an absolute improvement of 10.62% in accuracy compared to the best baseline features (mel-frequency cepstral coefficients).

摘要
自动检测和评估瘫疡程度可以将单词识别和瘫疡程度分类 directly from acoustic speech signals 用作医疗诊断工具。在这个工作中，预训练的 wav2vec 2.0 模型被研究作为特征提取器，以建立检测和瘫疡程度分类系统。实验使用了常用的 UA-speech 数据库。在检测实验中，结果显示，使用 wav2vec 模型的第一层嵌入得到最佳表现，对比基准特征（spectrogram）的最佳表现，获得了绝对提升1.23%的准确度。在研究的瘫疡程度分类任务中，结果显示，使用 wav2vec 模型的最终层嵌入得到最佳表现，与基准特征（mel-frequency cepstral coefficients）的最佳表现相比，获得了绝对提升10.62%的准确度。

Analysis and Detection of Pathological Voice using Glottal Source Features

paper_url: http://arxiv.org/abs/2309.14080
repo_url: None
paper_authors: Sudarsana Reddy Kadiri, Paavo Alku
for: 该研究旨在对声音疾病进行自动检测，以提供 объектив评估和早期 intervención 的 диагности方法。
methods: 该研究使用 quasi-closed phase (QCP) glottal inverse filtering 方法 estimate glottal flows，并使用 zero frequency filtering (ZFF) 方法计算 approximate glottal source signals，以及直接使用声音信号。此外，研究还提出了 derivate mel-frequency cepstral coefficients (MFCCs) from glottal source waveforms computed by QCP和ZFF，以具体地捕捉 glottal source 谱的变化。
results: 研究结果表明，glottal source 特征含有可以区分正常和疾病声音的信息。通过支持向量机 (SVM) 进行检测试验，发现 studied glottal source 特征可以与 convential MFCCs 和 perceptual linear prediction (PLP) 特征相比，达到了同等或更好的检测性能。此外，combine glottal source 特征与 convential MFCCs 和 PLP 特征可以获得最佳的检测性能，这表明这些特征之间存在辅助性的关系。

Abstract
Automatic detection of voice pathology enables objective assessment and earlier intervention for the diagnosis. This study provides a systematic analysis of glottal source features and investigates their effectiveness in voice pathology detection. Glottal source features are extracted using glottal flows estimated with the quasi-closed phase (QCP) glottal inverse filtering method, using approximate glottal source signals computed with the zero frequency filtering (ZFF) method, and using acoustic voice signals directly. In addition, we propose to derive mel-frequency cepstral coefficients (MFCCs) from the glottal source waveforms computed by QCP and ZFF to effectively capture the variations in glottal source spectra of pathological voice. Experiments were carried out using two databases, the Hospital Universitario Principe de Asturias (HUPA) database and the Saarbrucken Voice Disorders (SVD) database. Analysis of features revealed that the glottal source contains information that discriminates normal and pathological voice. Pathology detection experiments were carried out using support vector machine (SVM). From the detection experiments it was observed that the performance achieved with the studied glottal source features is comparable or better than that of conventional MFCCs and perceptual linear prediction (PLP) features. The best detection performance was achieved when the glottal source features were combined with the conventional MFCCs and PLP features, which indicates the complementary nature of the features.

摘要
自动检测声道疾病可以提供对象评估和早期 intervención для诊断。本研究提供了声道疾病检测中频谱源特征的系统性分析，并investigates其效iveness。频谱源特征通过预计closed phase（QCP）glottal inverse filtering方法、零频率 filtering（ZFF）方法和直接使用声音信号来提取。此外，我们提议 derivation of mel-frequency cepstral coefficients（MFCCs）from the glottal source waveforms computed by QCP和ZFF，以有效捕捉声道源спектrum的变化。实验使用了两个数据库，大学医院主楼（HUPA）数据库和 saarbrucken voice disorders（SVD）数据库。特征分析表明，频谱源含有可以区分正常和疾病声音的信息。疾病检测实验使用支持向量机（SVM）。从检测实验中，我们发现，研究中的频谱源特征表现比或更好于 conventioml MFCCs和perceptual linear prediction（PLP）特征。最佳检测性能是在glottal source特征与conventioml MFCCs和PLP特征结合时得到的，这表明这些特征之间存在衔接关系。

Multiple evolutionary pressures shape identical consonant avoidance in the world’s languages

paper_url: http://arxiv.org/abs/2309.14006
repo_url: None
paper_authors: Chundra A. Cathcart
for: 本研究探讨语言演化中同音字符的出现频率是否受限制，以及这些限制的来源。
methods: 研究者使用phylogenetic分析方法，对同义词的演化进行比较分析，探讨语言演化过程中同音字符的出现频率和word form mutation的影响。
results: 研究发现，同音字符在词形变化中更容易消失，而非出现。此外，同音字符的出现频率较低，word form mutation也更可能将同音字符消除。但同时发现，同音字符不会更容易消失。结论是，同音字符的出现频率受到语言演化的限制，但这些限制不来自于语言使用者的选择。

Abstract
Languages disfavor word forms containing sequences of similar or identical consonants, due to the biomechanical and cognitive difficulties posed by patterns of this sort. However, the specific evolutionary processes responsible for this phenomenon are not fully understood. Words containing sequences of identical consonants may be more likely to arise than those without; processes of word form mutation may be more likely to remove than create sequences of identical consonants in word forms; finally, words containing identical consonants may die out more frequently than those without. Phylogenetic analyses of the evolution of homologous word forms indicate that words with identical consonants arise less frequently than those without, and processes which mutate word forms are more likely to remove sequences of identical consonants than introduce them. However, words with identical consonants do not die out more frequently than those without. Further analyses reveal that forms with identical consonants are replaced in basic meaning functions more frequently than words without. Taken together, results suggest that the under representation of sequences of identical consonants is overwhelmingly a byproduct of constraints on word form coinage, though processes related to word usage also serve to ensure that such patterns are infrequent in more salient vocabulary items. These findings clarify previously unknown aspects of processes of lexical evolution and competition that take place during language change, optimizing communicative systems.

摘要
语言偏远同辅音序列，因为这些模式带来生物机械和认知上的困难。然而，这种现象的具体演化过程仍未完全了解。words containing sequences of identical consonants may be more likely to arise than those without; processes of word form mutation may be more likely to remove than create sequences of identical consonants in word forms; finally, words containing identical consonants may die out more frequently than those without.phylogenetic analyses of the evolution of homologous word forms indicate that words with identical consonants arise less frequently than those without, and processes which mutate word forms are more likely to remove sequences of identical consonants than introduce them. however, words with identical consonants do not die out more frequently than those without. further analyses reveal that forms with identical consonants are replaced in basic meaning functions more frequently than words without. taken together, results suggest that the under representation of sequences of identical consonants is overwhelmingly a byproduct of constraints on word form coinage, though processes related to word usage also serve to ensure that such patterns are infrequent in more salient vocabulary items. these findings clarify previously unknown aspects of processes of lexical evolution and competition that take place during language change, optimizing communicative systems.Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. Traditional Chinese is also widely used, especially in Taiwan and Hong Kong.

Connecting Speech Encoder and Large Language Model for ASR

paper_url: http://arxiv.org/abs/2309.13963
repo_url: None
paper_authors: Wenyi Yu, Changli Tang, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang
for: 本研究旨在比较三种常用的结构，包括完全连接层、多头交叉注意力和Q-Former，以实现自动语音识别（ASR）模型的集成。
methods: 研究使用了Whisper模型系列的语音编码器和Vicuna模型系列的不同模型大小的LLMs，并在LibriSpeech、Common Voice和GigaSpeech datasets上进行了实验。
results: Q-Former-based LLMs在不同数据集上显示了一致和显著的单词错误率（WER）减少，相比其他结构。此外，一种新的段级Q-Former也被提出，使LLMs可以识别长于编码器限制的语音段，带来17%的相对WER减少。

Abstract
The impressive capability and versatility of large language models (LLMs) have aroused increasing attention in automatic speech recognition (ASR), with several pioneering studies attempting to build integrated ASR models by connecting a speech encoder with an LLM. This paper presents a comparative study of three commonly used structures as connectors, including fully connected layers, multi-head cross-attention, and Q-Former. Speech encoders from the Whisper model series as well as LLMs from the Vicuna model series with different model sizes were studied. Experiments were performed on the commonly used LibriSpeech, Common Voice, and GigaSpeech datasets, where the LLMs with Q-Formers demonstrated consistent and considerable word error rate (WER) reductions over LLMs with other connector structures. Q-Former-based LLMs can generalise well to out-of-domain datasets, where 12% relative WER reductions over the Whisper baseline ASR model were achieved on the Eval2000 test set without using any in-domain training data from Switchboard. Moreover, a novel segment-level Q-Former is proposed to enable LLMs to recognise speech segments with a duration exceeding the limitation of the encoders, which results in 17% relative WER reductions over other connector structures on 90-second-long speech data.

摘要
大型语言模型（LLM）的印象力和多方面性在自动话语识别（ASR）中受到了越来越多的关注，有几个先锋性研究尝试建立了 integrate ASR 模型，通过与话语编码器连接。这篇文章发表了三种常用的结构，包括完全连接层、多头标注和Q-Former 的比较研究。研究使用了 Whisper 模型系列的话语编码器和 Vicuna 模型系列的不同模型大小的 LLM，并在 LibriSpeech、Common Voice 和 GigaSpeech datasets 上进行实验。实验结果显示，使用 Q-Former 的 LLM 可以在不使用域内训练数据的情况下，实现了12% 的相对 palabier error rate（WER）降低，相比其他结构。此外，一个新的段级 Q-Former 被提议，允许 LLM 识别长度超过编码器限制的语音段，从而实现了17% 的相对 WER 降低。

Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data

paper_url: http://arxiv.org/abs/2309.13876
repo_url: https://github.com/espnet/espnet
paper_authors: Yifan Peng, Jinchuan Tian, Brian Yan, Dan Berrebbi, Xuankai Chang, Xinjian Li, Jiatong Shi, Siddhant Arora, William Chen, Roshan Sharma, Wangyou Zhang, Yui Sudo, Muhammad Shakeel, Jee-weon Jung, Soumi Maiti, Shinji Watanabe
for: 该论文的目的是开发一个开源的听说模型，以便于研究人员可以在开源的工具kit和公共可用的数据上进行训练和改进。
methods: 该论文使用的方法是基于开源的工具kit和公共可用的数据进行听说模型的训练，并支持更多的翻译方向。
results: 该论文可以在零shot设置下实现良好的一致性和翻译性，并且可以在训练过程中提高效率和稳定性。

Abstract
Pre-training speech models on large volumes of data has achieved remarkable success. OpenAI Whisper is a multilingual multitask model trained on 680k hours of supervised speech data. It generalizes well to various speech recognition and translation benchmarks even in a zero-shot setup. However, the full pipeline for developing such models (from data collection to training) is not publicly accessible, which makes it difficult for researchers to further improve its performance and address training-related issues such as efficiency, robustness, fairness, and bias. This work presents an Open Whisper-style Speech Model (OWSM), which reproduces Whisper-style training using an open-source toolkit and publicly available data. OWSM even supports more translation directions and can be more efficient to train. We will publicly release all scripts used for data preparation, training, inference, and scoring as well as pre-trained models and training logs to promote open science.

摘要
Translate the given text into Simplified Chinese.预训术语模型在大量数据上已经取得了很大成功。OpenAI Whisper是一个多语言多任务模型，在680k小时的监督术语数据上训练。它在不同的术语识别和翻译 bencmarks 中进行了良好的泛化，甚至在零shot setup 下也能达到良好的性能。然而，整个模型开发管道（从数据收集到训练）没有公开 accessible，这使得研究人员很难进一步改进其性能和 Address training-related issues such as efficiency, robustness, fairness, and bias。这项工作提出了一个 Open Whisper-style Speech Model (OWSM)，该模型通过使用开源工具包和公开可用的数据来重现 Whisper-style 训练。OWSM 还支持更多的翻译方向，并且可以更高效地训练。我们将公开所有数据准备、训练、推理和评分脚本以及预训练模型和训练日志，以便推动开放科学。

2023-09-25

cs.LG

cs.LG - 2023-09-25

Understanding the Structure of QM7b and QM9 Quantum Mechanical Datasets Using Unsupervised Learning

paper_url: http://arxiv.org/abs/2309.15130
repo_url: None
paper_authors: Julio J. Valdés, Alain B. Tchagang
for: 本研究探讨了量子机制数据集（QM7b、QM9）的内部结构，它们包含了数量级的有机分子，并通过电子性质来描述。了解这类数据的结构和特点对于预测分子组成是非常重要。
methods: 本研究使用了内在维度分析、聚类和异常检测方法来研究QM7b和QM9数据集。结果显示，QM7b数据集具有清晰定义的集群，与原子组成直接相关。QM9数据集则有一个外围区域主要由异常点组成，以及一个内部核心区域集中的线性对象。与分子大小有直接关系的关系存在于这两个数据集中。
results: despite the structural differences between the two datasets, the predictability of variables of interest for inverse molecular design is high. This is exemplified with models estimating the number of atoms of the molecule from both the original properties and from lower dimensional embedding spaces.

Abstract
This paper explores the internal structure of two quantum mechanics datasets (QM7b, QM9), composed of several thousands of organic molecules and described in terms of electronic properties. Understanding the structure and characteristics of this kind of data is important when predicting the atomic composition from the properties in inverse molecular designs. Intrinsic dimension analysis, clustering, and outlier detection methods were used in the study. They revealed that for both datasets the intrinsic dimensionality is several times smaller than the descriptive dimensions. The QM7b data is composed of well defined clusters related to atomic composition. The QM9 data consists of an outer region predominantly composed of outliers, and an inner core region that concentrates clustered, inliner objects. A significant relationship exists between the number of atoms in the molecule and its outlier/inner nature. Despite the structural differences, the predictability of variables of interest for inverse molecular design is high. This is exemplified with models estimating the number of atoms of the molecule from both the original properties, and from lower dimensional embedding spaces.

摘要
The QM7b dataset is composed of well-defined clusters related to atomic composition, while the QM9 dataset has an outer region primarily consisting of outliers and an inner core region with clustered, linear objects. There is a significant correlation between the number of atoms in the molecule and its outlier/inner nature. Despite the structural differences, the predictability of variables of interest for inverse molecular design is high, as demonstrated by models estimating the number of atoms of the molecule from both the original properties and lower-dimensional embedding spaces.

Towards a statistical theory of data selection under weak supervision

paper_url: http://arxiv.org/abs/2309.14563
repo_url: None
paper_authors: Germain Kolossov, Andrea Montanari, Pulkit Tandon
for: 选择一个小于原始样本大小的子样本，以优化数据预处理和机器学习计算复杂性。
methods: 使用代理模型预测样本标签，然后选择一个子样本，并使用这些标签进行模型训练。
results: 数据选择可以非常有效，在某些情况下甚至可以超过使用整个样本集来训练模型。另外，一些受欢迎的数据选择方法（如偏向重样本或影响函数基于的子样本选择）可能会很差。

Abstract
Given a sample of size $N$, it is often useful to select a subsample of smaller size $n

摘要
Given a sample size $N$, it is often useful to select a subsample of smaller size $n

Disruption Detection for a Cognitive Digital Supply Chain Twin Using Hybrid Deep Learning

paper_url: http://arxiv.org/abs/2309.14557
repo_url: None
paper_authors: Mahmoud Ashraf, Amr Eltawil, Islam Ali
For: The paper aims to provide an effective and efficient tool for mitigating the impact of disruptive events on global supply chains by introducing a hybrid deep learning approach for disruption detection within a cognitive digital supply chain twin framework.* Methods: The proposed approach uses a deep autoencoder neural network combined with a one-class support vector machine algorithm to detect disruptions in real-time. Long-short term memory neural network models are also developed to identify the disrupted echelon and predict time-to-recovery from the disruption effect.* Results: The proposed approach can help decision-makers and supply chain practitioners make appropriate decisions aiming at minimizing negative impact of disruptive events based on real-time disruption detection data. The results demonstrate the trade-off between disruption detection model sensitivity, encountered delay in disruption detection, and false alarms.Here are the three key points in Simplified Chinese text:
for: 本文目的是提供一种有效和高效的工具，以减轻突发事件对全球供应链的影响。
methods: 该方法使用深度自适应神经网络与一类支持向量机算法检测干扰。此外，使用长短期记忆神经网络模型识别受影响的echelon，预测干扰影响的时间恢复。
results: 该方法可以帮助决策者和供应链实践者根据实时干扰检测数据进行合适的决策，以减轻突发事件的负面影响。结果表明干扰检测模型的敏感度、遇到延迟的干扰检测和假阳性的负面关系。这种方法在当前文献中很少被使用。

Abstract
Purpose: Recent disruptive events, such as COVID-19 and Russia-Ukraine conflict, had a significant impact of global supply chains. Digital supply chain twins have been proposed in order to provide decision makers with an effective and efficient tool to mitigate disruption impact. Methods: This paper introduces a hybrid deep learning approach for disruption detection within a cognitive digital supply chain twin framework to enhance supply chain resilience. The proposed disruption detection module utilises a deep autoencoder neural network combined with a one-class support vector machine algorithm. In addition, long-short term memory neural network models are developed to identify the disrupted echelon and predict time-to-recovery from the disruption effect. Results: The obtained information from the proposed approach will help decision-makers and supply chain practitioners make appropriate decisions aiming at minimizing negative impact of disruptive events based on real-time disruption detection data. The results demonstrate the trade-off between disruption detection model sensitivity, encountered delay in disruption detection, and false alarms. This approach has seldom been used in recent literature addressing this issue.

摘要
目的：latest disruptive events, such as COVID-19 and Russia-Ukraine conflict, have had a significant impact on global supply chains. Digital supply chain twins have been proposed to provide decision makers with an effective and efficient tool to mitigate the impact of disruptions.方法：this paper introduces a hybrid deep learning approach for disruption detection within a cognitive digital supply chain twin framework to enhance supply chain resilience. The proposed disruption detection module uses a deep autoencoder neural network combined with a one-class support vector machine algorithm. In addition, long-short term memory neural network models are developed to identify the disrupted echelon and predict time-to-recovery from the disruption effect.结果：the obtained information from the proposed approach will help decision-makers and supply chain practitioners make appropriate decisions aiming at minimizing the negative impact of disruptive events based on real-time disruption detection data. The results demonstrate the trade-off between disruption detection model sensitivity, encountered delay in disruption detection, and false alarms. This approach has seldom been used in recent literature addressing this issue.Here's the translation in Traditional Chinese:目的：最近的干扰事件，如COVID-19和俄乌战争，对全球供应链造成了巨大的影响。数字供应链双生物被提议，以提供决策者具有更高效和更高效的工具，以mitigate干扰的影响。方法：本篇文章介绍了一种混合深度学习方法，用于干扰检测在认知数字供应链双生物框架中，以提高供应链可靠性。提议的干扰检测模组使用深度自动Encoder神经网络，与一类支持向量机器学习算法结合。此外，长期快速传统神经网络模型也被开发，以识别受到干扰的层次，并预测干扰影响的时间回复。结果：取得的信息将助决策者和供应链实践者做出适当的决策，以减少干扰事件的负面影响。结果显示出干扰检测模型的敏感度、遭遇延误的干扰检测时间和误干扰的负面影响之间的贸易。这种方法在最近的文献中 rarely 被使用以解决这个问题。

Cluster-based Method for Eavesdropping Identification and Localization in Optical Links

paper_url: http://arxiv.org/abs/2309.14541
repo_url: None
paper_authors: Haokun Song, Rui Lin, Andrea Sgambelluri, Filippo Cugini, Yajie Li, Jie Zhang, Paolo Monti
for: 检测和定位光线系统中的窃听事件
methods: 基于集群方法检测和定位窃听事件
results: 通过收集发送器端的光性能监测数据可以检测小功率损失引起的窃听事件，而通过利用线上监测数据可以有效地定位窃听事件。

Abstract
We propose a cluster-based method to detect and locate eavesdropping events in optical line systems characterized by small power losses. Our findings indicate that detecting such subtle losses from eavesdropping can be accomplished solely through optical performance monitoring (OPM) data collected at the receiver. On the other hand, the localization of such events can be effectively achieved by leveraging in-line OPM data.

摘要
我们提出了一种基于集群的方法，用于检测和定位光纤系统中的窃听事件。我们的发现表明，通过收集接收端的光性能监测（OPM）数据，可以寻查到这些微量损失。然而，通过利用线上OPM数据，可以有效地确定这些事件的位置。

Detach-ROCKET: Sequential feature selection for time series classification with random convolutional kernels

paper_url: http://arxiv.org/abs/2309.14518
repo_url: https://github.com/gon-uri/detach_rocket
paper_authors: Gonzalo Uribarri, Federico Barone, Alessio Ansuini, Erik Fransén
for: 这篇论文的目的是提出一种Sequential Feature Detachment（SFD）方法，用于时间序列分类（TSC）中删除无用的特征，提高模型的执行效率和泛化能力。
methods: 这篇论文使用了ROCKET模型和其变体，以及模型的权重来Estimate feature importance，并通过Sequential Feature Detachment（SFD）方法删除无用的特征。
results: 根据UCR archive的试验结果，SFD方法可以将时间序列分类模型的特征数量从原本的1000多个降至10%左右，同时提高模型的测试精度0.2%。此外， Detach-ROCKET方法可以对最大的 binary UCR 数据集进行最佳化，从而提高测试精度0.6%，同时删除98.9%的特征。

Abstract
Time Series Classification (TSC) is essential in many fields, such as medicine, environmental science and finance, enabling tasks like disease diagnosis, anomaly detection, and stock price analysis. Machine learning models for TSC like Recurrent Neural Networks and InceptionTime, while successful in numerous applications, can face scalability limitations due to intensive computational requirements. To address this, efficient models such as ROCKET and its derivatives have emerged, simplifying training and achieving state-of-the-art performance by utilizing a large number of randomly generated features from time series data. However, due to their random nature, most of the generated features are redundant or non-informative, adding unnecessary computational load and compromising generalization. Here, we introduce Sequential Feature Detachment (SFD) as a method to identify and prune these non-essential features. SFD uses model coefficients to estimate feature importance and, unlike previous algorithms, can handle large feature sets without the need for complex hyperparameter tuning. Testing on the UCR archive demonstrates that SFD can produce models with $10\%$ of the original features while improving the accuracy $0.2\%$ on the test set. We also present an end-to-end procedure for determining an optimal balance between the number of features and model accuracy, called Detach-ROCKET. When applied to the largest binary UCR dataset, Detach-ROCKET is able to improve test accuracy by $0.6\%$ while reducing the number of features by $98.9\%$. Thus, our proposed procedure is not only lightweight to train and effective in reducing model size and enhancing generalization, but its significant reduction in feature count also paves the way for feature interpretation.

摘要
时序分类（TSC）在医学、环境科学和金融等领域具有重要意义，可以实现疾病诊断、异常检测和股票价格分析等任务。机器学习模型 для TSC，如循环神经网络和InceptionTime，虽然在多个应用中取得成功，但可能会面临扩展性限制，因为计算需求很高。为解决这问题，有效的模型如ROCKET和其 derivates出现了，使得训练更加简单，并在多个时序数据上实现了状态机器学习性能。然而，由于这些生成的特征 Random，大多数生成的特征都是 redundant或非指导的，这会增加计算负担并降低泛化性。在这种情况下，我们介绍了时序特征分离（SFD）方法，可以识别和剔除非关键的特征。SFD使用模型系数来估计特征重要性，与之前的算法不同之处在于可以处理大量特征集 ohne需要复杂的超参数调整。在UCRL архиivos上进行测试，SFD可以生成模型，其中90%的特征是非关键的，而测试集上的准确率提高0.2%。我们还提出了一种从头到尾的过程，可以确定最佳的特征数量和模型准确率之间的平衡，称为Detach-ROCKET。当应用于最大的二进制UCRL数据集时，Detach-ROCKET可以提高测试准确率0.6%，同时减少特征数量98.9%。因此，我们的提出的过程不仅轻量级训练，效果减小模型大小和提高泛化性，而且它的重要减少特征计数也为特征解释开辟了道路。

Zeroth-order Riemannian Averaging Stochastic Approximation Algorithms

paper_url: http://arxiv.org/abs/2309.14506
repo_url: None
paper_authors: Jiaxiang Li, Krishnakumar Balasubramanian, Shiqian Ma
for: 这个论文是为了研究在里曼尼抽象上的泛化随机搜索问题。
methods: 这个论文使用了 Zero-order Riemannian Averaging Stochastic Approximation（\texttt{Zo-RASA）} 算法，并使用了 RiemaNN 移动平均梯度估计器和一种新的 RiemaNN-Lyapunov 分析技术来进行转化分析。
results: 这个论文表明了 \texttt{Zo-RASA} 算法可以使用单个样本或常数级批处理在每个迭代中实现 $\epsilon$-近似首Order站点解。此外，论文还引入了一种新的几何条件，即 bounded second fundamental form，可以用于精度地 approximate parallel transport。

Abstract
We present Zeroth-order Riemannian Averaging Stochastic Approximation (\texttt{Zo-RASA}) algorithms for stochastic optimization on Riemannian manifolds. We show that \texttt{Zo-RASA} achieves optimal sample complexities for generating $\epsilon$-approximation first-order stationary solutions using only one-sample or constant-order batches in each iteration. Our approach employs Riemannian moving-average stochastic gradient estimators, and a novel Riemannian-Lyapunov analysis technique for convergence analysis. We improve the algorithm's practicality by using retractions and vector transport, instead of exponential mappings and parallel transports, thereby reducing per-iteration complexity. Additionally, we introduce a novel geometric condition, satisfied by manifolds with bounded second fundamental form, which enables new error bounds for approximating parallel transport with vector transport.

摘要
我们提出了零预orde Riemannian Averaging Stochastic Approximation（\texttt{Zo-RASA）}算法，用于在里曼尼抽象上进行数学估计。我们证明了\texttt{Zo-RASA}可以在每个迭代中使用单一样本或常量组数据，实现 $\epsilon$-近似首先稳定解的生成。我们的方法使用里曼尼运动平均梯度估计器，并使用一种新的里曼尼- Lyapunov 分析技术进行对准性分析。我们通过使用抽像和向量运输，而不是对称对映和平行运输，将每次迭代的复杂性降低。此外，我们也提出了一个新的几何条件，这个条件是在具有受限第二funamental form的抽象上存在的，它允许我们给出新的错误上限，用于近似平行运输。

Uncertainty Aware Deep Learning for Particle Accelerators

paper_url: http://arxiv.org/abs/2309.14502
repo_url: None
paper_authors: Kishansingh Rajput, Malachi Schram, Karthik Somayaji
for: 这篇论文是为了提出一种用Deep Gaussian Process Approximation（DGPA）方法进行误束预测和不确定性评估的方法。
methods: 这篇论文使用了Deep Gaussian Process Approximation（DGPA）方法，该方法可以捕捉复杂系统动态，但是需要考虑误束和不确定性。
results: 这篇论文在SNS加速器中进行了误束预测和不确定性评估，并提供了一个不确定性意识的模型。

Abstract
Standard deep learning models for classification and regression applications are ideal for capturing complex system dynamics. However, their predictions can be arbitrarily inaccurate when the input samples are not similar to the training data. Implementation of distance aware uncertainty estimation can be used to detect these scenarios and provide a level of confidence associated with their predictions. In this paper, we present results from using Deep Gaussian Process Approximation (DGPA) methods for errant beam prediction at Spallation Neutron Source (SNS) accelerator (classification) and we provide an uncertainty aware surrogate model for the Fermi National Accelerator Lab (FNAL) Booster Accelerator Complex (regression).

摘要
标准的深度学习模型可以很好地捕捉复杂系统的动态。但是，它们的预测结果可能无法实际地准确，尤其是对于与训练数据不同的输入数据。在这篇文章中，我们使用深度 Gaussian Process Approximation（DGPA）方法进行误偏照射预测（分类），并提供了不对称 uncertainty 意识模型（重回应）。

Unveiling the Potential of Deep Learning Models for Solar Flare Prediction in Near-Limb Regions

paper_url: http://arxiv.org/abs/2309.14483
repo_url: None
paper_authors: Chetraj Pandey, Rafal A. Angryk, Berkay Aydin
for: 本研究旨在评估深度学习模型在预测solar flare的性能，使用每小时采样的全盘线条图像，特别是关注近似limb区域（beyond ±70°的太阳盘面）的许多次过looked flare事件。
methods: 我们使用了三种well-known deep learning architecture——AlexNet、VGG16和ResNet34进行了转移学习，并对三个模型进行了比较和评估，使用了真实技能统计（TSS）和Heidke技能分数（HSS），以及计算了回快分数，以理解预测敏感性在中心和近似limb区域中。
results: 我们的研究发现，使用AlexNet基于的模型在全体性能方面表现最高，其TSS和HSS分别为0.53和0.37；而进一步的空间分析回快分数显示，在近似limb事件中，VGG16和ResNet34基于的模型具有更高的预测敏感性。最佳结果是使用ResNet34基于的模型，其near-limb预测回快率为0.59（X-和M-class预测回快率分别为0.81和0.56）。

Abstract
This study aims to evaluate the performance of deep learning models in predicting $\geq$M-class solar flares with a prediction window of 24 hours, using hourly sampled full-disk line-of-sight (LoS) magnetogram images, particularly focusing on the often overlooked flare events corresponding to the near-limb regions (beyond $\pm$70$^{\circ}$ of the solar disk). We trained three well-known deep learning architectures--AlexNet, VGG16, and ResNet34 using transfer learning and compared and evaluated the overall performance of our models using true skill statistics (TSS) and Heidke skill score (HSS) and computed recall scores to understand the prediction sensitivity in central and near-limb regions for both X- and M-class flares. The following points summarize the key findings of our study: (1) The highest overall performance was observed with the AlexNet-based model, which achieved an average TSS$\sim$0.53 and HSS$\sim$0.37; (2) Further, a spatial analysis of recall scores disclosed that for the near-limb events, the VGG16- and ResNet34-based models exhibited superior prediction sensitivity. The best results, however, were seen with the ResNet34-based model for the near-limb flares, where the average recall was approximately 0.59 (the recall for X- and M-class was 0.81 and 0.56 respectively) and (3) Our research findings demonstrate that our models are capable of discerning complex spatial patterns from full-disk magnetograms and exhibit skill in predicting solar flares, even in the vicinity of near-limb regions. This ability holds substantial importance for operational flare forecasting systems.

摘要
这项研究的目标是评估深度学习模型在预测24小时内的$\geq$M级太阳风暴事件的性能，使用每小时采样的全盘线性图像，特别是关注太阳盘面外(-70度以上)的快速风暴事件。我们使用了三种已知的深度学习架构——AlexNet、VGG16和ResNet34进行转移学习，并对三个模型进行比较和评估，使用真实技能统计（TSS）和海德ке技能分数（HSS），并计算了中心和近缘区域的回快率以了解预测敏感度。研究的主要发现包括：1. AlexNet基于的模型在整体性能方面表现最高，其TSS和HSS分别为0.53和0.37；2. 空间分析回快率表明，近缘区域内的风暴事件预测敏感度最高，VGG16和ResNet34基于的模型在近缘区域内表现出色，特别是ResNet34基于的模型，其平均回快率为0.59，X级和M级风暴事件的回快率分别为0.81和0.56；3. 这些研究发现表明，我们的模型可以从全盘线性图像中提取复杂的空间特征，并在靠近近缘区域的风暴事件预测中展现出能力。这种能力对于实际风暴预测系统具有重要意义。

LogGPT: Log Anomaly Detection via GPT

paper_url: http://arxiv.org/abs/2309.14482
repo_url: None
paper_authors: Xiao Han, Shuhan Yuan, Mohamed Trabelsi
for: 这篇研究旨在提出一个基于日志数据的系统异常检测方法，以确保计算机系统的安全性和可靠性。
methods: 本研究使用深度学习模型进行日志异常检测，具体来说是将日志序列变数为自然语言，然后运用深度序列模型，例如LSTM或Transformer，对日志序列中的正常模式进行语言模型化。
results: 实验结果显示，LogGPT在三个 datasets 上表现出色，较 existing state-of-the-art 方法有更高的检测精度。

Abstract
Detecting system anomalies based on log data is important for ensuring the security and reliability of computer systems. Recently, deep learning models have been widely used for log anomaly detection. The core idea is to model the log sequences as natural language and adopt deep sequential models, such as LSTM or Transformer, to encode the normal patterns in log sequences via language modeling. However, there is a gap between language modeling and anomaly detection as the objective of training a sequential model via a language modeling loss is not directly related to anomaly detection. To fill up the gap, we propose LogGPT, a novel framework that employs GPT for log anomaly detection. LogGPT is first trained to predict the next log entry based on the preceding sequence. To further enhance the performance of LogGPT, we propose a novel reinforcement learning strategy to finetune the model specifically for the log anomaly detection task. The experimental results on three datasets show that LogGPT significantly outperforms existing state-of-the-art approaches.

摘要
检测计算机系统中的异常 based on 日志数据是保持安全和可靠性的重要任务。最近，深度学习模型在日志异常检测中得到了广泛的应用。核心思想是模型日志序列为自然语言，采用深度序列模型，如LSTM或Transformer，来编码正常的日志序列模式 via 语言模型化。但是，语言模型化和异常检测之间存在一个差距，因为训练深度序列模型via语言模型化损失并不直接相关于异常检测。为填充这个差距，我们提出了 LogGPT，一种新的框架，它采用 GPT 进行日志异常检测。LogGPT 首先通过预测下一个日志条目基于前一个序列来训练。为了进一步提高 LogGPT 的性能，我们提出了一种新的强化学习策略，用于特定地 finetune 模型为日志异常检测任务。实验结果表明，LogGPT 与现有状态的方法相比，在三个数据集上显著地提高了性能。

Skilog: A Smart Sensor System for Performance Analysis and Biofeedback in Ski Jumping

paper_url: http://arxiv.org/abs/2309.14455
repo_url: None
paper_authors: Lukas Schulthess, Thorir Mar Ingolfsson, Marc Nölke, Michele Magno, Luca Benini, Christoph Leitner
for: 这份研究是为了开发一个智能、 компакт、能效的无线感应器系统，用于现场进行 Ski jumping 的实时性能分析和生成反馈。
methods: 本研究使用了100Hz测量脚压的方法，并使用Machine Learning（ML）模型来实现实时的反馈。
results: 研究获得了92.7%的中心质量预测精度（脊梁偏移、中立位和腹股偏移），并在低功耗的RISC-V架构上进行了实时推导和反馈（0.0109ms/推导）。

Abstract
In ski jumping, low repetition rates of jumps limit the effectiveness of training. Thus, increasing learning rate within every single jump is key to success. A critical element of athlete training is motor learning, which has been shown to be accelerated by feedback methods. In particular, a fine-grained control of the center of gravity in the in-run is essential. This is because the actual takeoff occurs within a blink of an eye ($\sim$300ms), thus any unbalanced body posture during the in-run will affect flight. This paper presents a smart, compact, and energy-efficient wireless sensor system for real-time performance analysis and biofeedback during ski jumping. The system operates by gauging foot pressures at three distinct points on the insoles of the ski boot at 100Hz. Foot pressure data can either be directly sent to coaches to improve their feedback, or fed into a ML model to give athletes instantaneous in-action feedback using a vibration motor in the ski boot. In the biofeedback scenario, foot pressures act as input variables for an optimized XGBoost model. We achieve a high predictive accuracy of 92.7% for center of mass predictions (dorsal shift, neutral stand, ventral shift). Subsequently, we parallelized and fine-tuned our XGBoost model for a RISC-V based low power parallel processor (GAP9), based on the PULP architecture. We demonstrate real-time detection and feedback (0.0109ms/inference) using our on-chip deployment. The proposed smart system is unobtrusive with a slim form factor (13mm baseboard, 3.2mm antenna) and a lightweight build (26g). Power consumption analysis reveals that the system's energy-efficient design enables sustained operation over multiple days (up to 300 hours) without requiring recharge.

摘要
在跳台滑雪中，低重复率的跳跃限制了训练的效iveness。因此，在每次跳跃中提高学习率是关键到success。运动员训练中的核心元素是 дви作学习，已经证明可以通过反馈方法加速。特别是在具有细致控制中心重力的跑道上是关键。因为实际的起飞只需要几十毫秒（约300ms），所以任何不平衡的身体姿势会影响飞行。本文介绍了一种智能、卷积、能效的无线传感器系统，用于实时性表现分析和生物反馈 durante跳台滑雪。该系统通过在跳板底部的三个点检测脚压力，每秒100次获取数据。脚压力数据可以直接给教练提供反馈，或者通过一个机器学习模型给运动员实时反馈，使用跳板内置的振荡机。在生物反馈场景中，脚压力作为输入变量，用于优化的XGBoost模型。我们实现了中心质量预测的高预测精度（92.7%）。然后，我们将XGBoost模型并行化和优化，基于RISC-V架构的低功耗并行处理器（GAP9）。我们实现了实时探测和反馈（0.0109ms/推导），并在芯片上部署。提案的智能系统轻便，减少了跳板的尺寸（13mm基板、3.2mm天线）和重量（26g）。能源消耗分析表明，该系统的能效设计可以持续运行多天（最多300小时）无需充电。

Learning dislocation dynamics mobility laws from large-scale MD simulations

paper_url: http://arxiv.org/abs/2309.14450
repo_url: None
paper_authors: Nicolas Bertin, Vasily V. Bulatov, Fei Zhou
for: 研究金属塑性的 mesoscale 模型 - 粗化 atomistic 动力学中的扭轧动力学
methods: 使用 machine learning (ML) 框架，通过 graph neural networks (GNN) 模型来自动化扭轧动力学的开发
results: 在 BCC 钨中示出了准确地复制了真实 MD simulations 中的压缩/张力偏见，并且在低刺激速度下预测了流体压力， demonstrating 了方法的能力学习扭轧动力学的Physics。

Abstract
The computational method of discrete dislocation dynamics (DDD), used as a coarse-grained model of true atomistic dynamics of lattice dislocations, has become of powerful tool to study metal plasticity arising from the collective behavior of dislocations. As a mesoscale approach, motion of dislocations in the DDD model is prescribed via the mobility law; a function which specifies how dislocation lines should respond to the driving force. However, the development of traditional hand-crafted mobility laws can be a cumbersome task and may involve detrimental simplifications. Here we introduce a machine-learning (ML) framework to streamline the development of data-driven mobility laws which are modeled as graph neural networks (GNN) trained on large-scale Molecular Dynamics (MD) simulations of crystal plasticity. We illustrate our approach on BCC tungsten and demonstrate that our GNN mobility implemented in large-scale DDD simulations accurately reproduces the challenging tension/compression asymmetry observed in ground-truth MD simulations while correctly predicting the flow stress at lower straining rate conditions unseen during training, thereby demonstrating the ability of our method to learn relevant dislocation physics. Our DDD+ML approach opens new promising avenues to improve fidelity of the DDD model and to incorporate more complex dislocation motion behaviors in an automated way, providing a faithful proxy for dislocation dynamics several orders of magnitude faster than ground-truth MD simulations.

摘要
计算方法的粗化扭变动力学（DDD）模型，作为真实原子动力学扭变动力学的粗化模型，已成为金属塑形力学的研究powerful工具。作为中规模方法，DDD模型中扭变线的运动是通过 mobilicity 法规定的，这是一个指定扭变线应对驱动力的函数。然而，开发传统手动设计 mobilicity 法可能是一项繁琐的任务，并且可能会带来不利的简化。在这里，我们引入机器学习（ML）框架，以数据驱动的方式开发出更加简单的 mobilicity 法。我们使用Graph Neural Networks（GNN）模型，在大规模的分子动力学（MD） simulation 中训练这些 mobilicity 法。我们在 BCC 钴中实现了我们的 GNN mobilicity，并在大规模的 DDD simulations 中证明了我们的方法可以准确地复制真实 MD simulation 中的困难的压缩/扩展不均勋，并且可以正确地预测低剪力环境下的流体压缩强度。这表明我们的方法可以学习扭变物理学。我们的 DDD + ML 方法打开了新的可能性，以提高 DDD 模型的准确性，并自动地包含更加复杂的扭变动力学行为，提供一个 faithful 的扭变动力学代理，在训练中未达到的低剪力环境下可以正确地预测流体压缩强度。

On the expressivity of embedding quantum kernels

paper_url: http://arxiv.org/abs/2309.14419
repo_url: None
paper_authors: Elies Gil-Fuster, Jens Eisert, Vedran Dunjko
for: 这个论文的目的是研究量子机器学习和经典机器学习之间的自然连接，特别是在内核方法上。
methods: 这篇论文使用的方法包括量子特征状态的构造和嵌入量子内核。
results: 这篇论文的结果是证明任何量子内核都可以表示为量子特征状态的内积，并且提出了一些新的、未探索的量子内核家族，其中是否还有效的嵌入量子内核还需要进一步研究。

Abstract
One of the most natural connections between quantum and classical machine learning has been established in the context of kernel methods. Kernel methods rely on kernels, which are inner products of feature vectors living in large feature spaces. Quantum kernels are typically evaluated by explicitly constructing quantum feature states and then taking their inner product, here called embedding quantum kernels. Since classical kernels are usually evaluated without using the feature vectors explicitly, we wonder how expressive embedding quantum kernels are. In this work, we raise the fundamental question: can all quantum kernels be expressed as the inner product of quantum feature states? Our first result is positive: Invoking computational universality, we find that for any kernel function there always exists a corresponding quantum feature map and an embedding quantum kernel. The more operational reading of the question is concerned with efficient constructions, however. In a second part, we formalize the question of universality of efficient embedding quantum kernels. For shift-invariant kernels, we use the technique of random Fourier features to show that they are universal within the broad class of all kernels which allow a variant of efficient Fourier sampling. We then extend this result to a new class of so-called composition kernels, which we show also contains projected quantum kernels introduced in recent works. After proving the universality of embedding quantum kernels for both shift-invariant and composition kernels, we identify the directions towards new, more exotic, and unexplored quantum kernel families, for which it still remains open whether they correspond to efficient embedding quantum kernels.

摘要
（一些）自然的量子机器学习和经典机器学习之间的连接在内核方法上已经得到了证明。内核方法 rely on 内核，它们是特征向量生活在大特征空间的内积。量子内核通常通过明确构建量子特征状态来评估，然后计算它们的内积，这被称为嵌入量子内核。由于经典内核通常不直接使用特征向量，我们所思考嵌入量子内核的表达能力如何。在这项工作中，我们提出了一个基本问题：可以所有的量子内核都表示为量子特征状态的内积吗？我们的第一个结果是正的：通过计算 universality，我们发现了对于任何内核函数，都存在一个对应的量子特征映射和嵌入量子内核。对于更操作性的问题，我们在第二部分中正式化了嵌入量子内核的 universality 问题。对于不变内核，我们使用随机傅里埃特性来证明它们是 universality 的，并将其扩展到一个新的 composition kernels 类型，这类型包括已知的 projected quantum kernels。在证明嵌入量子内核的 universality 之后，我们确定了新、更有趣、未探索的量子内核家族的方向，其中是否存在效果嵌入量子内核还未知。

Provable advantages of kernel-based quantum learners and quantum preprocessing based on Grover’s algorithm

paper_url: http://arxiv.org/abs/2309.14406
repo_url: None
paper_authors: Till Muser, Elias Zapusek, Vasilis Belis, Florentin Reiter
for: 该研究目的是提高学习问题的计算效率，特别是应用量子计算机在支持向量机中的速度优势。
methods: 该研究使用了Shor的算法和Grover的算法来实现量子支持向量机的速度优势。
results: 研究发现，通过在支持向量机的kernel中使用量子计算机，可以获得 exponential 的速度优势。此外，通过将量子计算机与类传统的分类方法结合使用，可以进一步提高分类器的性能。

Abstract
There is an ongoing effort to find quantum speedups for learning problems. Recently, [Y. Liu et al., Nat. Phys. $\textbf{17}$, 1013--1017 (2021)] have proven an exponential speedup for quantum support vector machines by leveraging the speedup of Shor's algorithm. We expand upon this result and identify a speedup utilizing Grover's algorithm in the kernel of a support vector machine. To show the practicality of the kernel structure we apply it to a problem related to pattern matching, providing a practical yet provable advantage. Moreover, we show that combining quantum computation in a preprocessing step with classical methods for classification further improves classifier performance.

摘要
有一个持续进行的努力是找到量子速度减少学习问题。最近，李宇等人（Nat. Phys. $\textbf{17}$, 1013--1017 (2021)）已经证明了量子支持向量机的加速，通过利用戈Vor的算法速度。我们在这个结果基础上进一步扩展，并证明了使用格罗弗尔算法在支持向量机的kernel中获得加速。为证明实用性，我们应用了这种结构到一个相关的模式匹配问题，并提供了实用却可证明的优势。此外，我们还证明了结合量子计算在预处理步骤中与经典方法结合，可以进一步提高分类器性能。

Tasks Makyth Models: Machine Learning Assisted Surrogates for Tipping Points

paper_url: http://arxiv.org/abs/2309.14334
repo_url: None
paper_authors: Gianluca Fabiani, Nikolaos Evangelou, Tianqi Cui, Juan M. Bello-Rivas, Cristina P. Martin-Linares, Constantinos Siettos, Ioannis G. Kevrekidis
for: 这个论文的目的是提出一种基于机器学习的框架，用于探索复杂系统的突变点和罕见事件的可能性。
methods: 该论文使用了拟合多元space，神经网络，高斯过程和无方程多尺度模型来实现这个目的。
results: 该论文通过使用这些方法对高维时空数据进行压缩， constructions 缩写模型来描述不同的级别的 emergent 动力学，并且可以准确地预测突变点和罕见事件的可能性。

Abstract
We present a machine learning (ML)-assisted framework bridging manifold learning, neural networks, Gaussian processes, and Equation-Free multiscale modeling, for (a) detecting tipping points in the emergent behavior of complex systems, and (b) characterizing probabilities of rare events (here, catastrophic shifts) near them. Our illustrative example is an event-driven, stochastic agent-based model (ABM) describing the mimetic behavior of traders in a simple financial market. Given high-dimensional spatiotemporal data -- generated by the stochastic ABM -- we construct reduced-order models for the emergent dynamics at different scales: (a) mesoscopic Integro-Partial Differential Equations (IPDEs); and (b) mean-field-type Stochastic Differential Equations (SDEs) embedded in a low-dimensional latent space, targeted to the neighborhood of the tipping point. We contrast the uses of the different models and the effort involved in learning them.

摘要
我们提出了一个基于机器学习（ML）的框架，它将拓扑学学习、神经网络、高斯过程和无方程多尺度模型绑定在一起，用于检测复杂系统的 emergent 行为中的跌宕点，以及在其附近的罕见事件的概率Characterization。我们的示例是一个事件驱动的随机 Agent-Based Model（ABM），描述了金融市场中的模拟行为。给出高维空间时间数据（由随机 ABM 生成），我们构建了不同级别的减少模型，用于描述不同级别的 emergent 动力学：（a） mesoscopic Integro-Partial Differential Equations（IPDEs）；和（b） Mean-field-type Stochastic Differential Equations（SDEs），其embedded在一个低维的隐藏空间中，targeted to the neighborhood of the tipping point。我们对不同模型的使用和学习努力进行了对比。

pLMFPPred: a novel approach for accurate prediction of functional peptides integrating embedding from pre-trained protein language model and imbalanced learning

paper_url: http://arxiv.org/abs/2309.14404
repo_url: https://github.com/mnb66/plmfppred
paper_authors: Zebin Ma, Yonglin Zou, Xiaobin Huang, Wenjin Yan, Hao Xu, Jiexin Yang, Ying Zhang, Jinqi Huang
for: 预测功能肽，即使用人工智能计算策略来快速从蛋白质序列集中鉴别出新的功能肽并确定其不同的功能。
methods: 使用蛋白语言模型基于的插入（ESM-2），开发了一种名为pLMFPPred（蛋白语言模型基于功能肽预测器）来预测功能肽和识别 токси肽。同时，使用SMOTE-TOMEK数据合成采样技术和Shapley值基于的特征选择技术来解决数据不均衡问题，降低计算成本。
results: 在一个验证的独立测试集上，pLMFPPred实现了精度、接收操作特征曲线值和F1值的0.974、0.99和0.974，分别。 comparative experiments show that pLMFPPred outperforms current methods for predicting functional peptides。实验结果表明，提案的方法（pLMFPPred）可以在预测功能肽方面提供更好的性能，并代表一种新的计算方法。

Abstract
Functional peptides have the potential to treat a variety of diseases. Their good therapeutic efficacy and low toxicity make them ideal therapeutic agents. Artificial intelligence-based computational strategies can help quickly identify new functional peptides from collections of protein sequences and discover their different functions.Using protein language model-based embeddings (ESM-2), we developed a tool called pLMFPPred (Protein Language Model-based Functional Peptide Predictor) for predicting functional peptides and identifying toxic peptides. We also introduced SMOTE-TOMEK data synthesis sampling and Shapley value-based feature selection techniques to relieve data imbalance issues and reduce computational costs. On a validated independent test set, pLMFPPred achieved accuracy, Area under the curve - Receiver Operating Characteristics, and F1-Score values of 0.974, 0.99, and 0.974, respectively. Comparative experiments show that pLMFPPred outperforms current methods for predicting functional peptides.The experimental results suggest that the proposed method (pLMFPPred) can provide better performance in terms of Accuracy, Area under the curve - Receiver Operating Characteristics, and F1-Score than existing methods. pLMFPPred has achieved good performance in predicting functional peptides and represents a new computational method for predicting functional peptides.

摘要
功能蛋白质有很大的治疗潜力。它们的良好的治疗效果和低度的致病性使得它们成为理想的药物代用品。通过人工智能基于计算的方法，可以快速地从蛋白质序列集中获取新的功能蛋白质和其不同的功能。我们使用蛋白质语言模型基于嵌入（ESM-2），开发了一个名为pLMFPPred（蛋白质语言模型基于功能蛋白质预测器）的工具，用于预测功能蛋白质和识别毒蛋白质。我们还使用SMOTE-TOMEK数据合成抽样和Shapley值基于特征选择技术来解决数据均衡问题和降低计算成本。在一个验证的独立测试集上，pLMFPPred实现了精度、接收操作特征曲线值和F1分数的值为0.974、0.99和0.974，分别。相比之下，相关的方法实现了较差的效果。实验结果表明，提案的方法（pLMFPPred）可以在预测功能蛋白质方面提供更好的表现，并代表了一种新的计算方法。

A Unified Framework for Uniform Signal Recovery in Nonlinear Generative Compressed Sensing

paper_url: http://arxiv.org/abs/2310.03758
repo_url: None
paper_authors: Junren Chen, Jonathan Scarlett, Michael K. Ng, Zhaoqiang Liu
For: The paper is written to study the problem of generative compressed sensing (GCS) with nonlinear measurements, and to provide uniform recovery guarantees for this problem.* Methods: The paper uses a unified framework that combines the observation model with the generative model to derive uniform recovery guarantees for nonlinear GCS. The framework accommodates GCS with 1-bit/uniformly quantized observations and single index models as canonical examples.* Results: The paper shows that using a single realization of the sensing ensemble and generalized Lasso, all $\mathbf{x}^*\in G(\mathbb{B}_2^k(r))$ can be recovered up to an $\ell_2$-error at most $\epsilon$ using roughly $\tilde{O}({k}/{\epsilon^2})$ samples, with omitted logarithmic factors typically being dominated by $\log L$. This is almost coincident with existing non-uniform guarantees up to logarithmic factors, indicating that the uniformity comes with a very small cost.

Abstract
In generative compressed sensing (GCS), we want to recover a signal $\mathbf{x}^* \in \mathbb{R}^n$ from $m$ measurements ($m\ll n$) using a generative prior $\mathbf{x}^*\in G(\mathbb{B}_2^k(r))$, where $G$ is typically an $L$-Lipschitz continuous generative model and $\mathbb{B}_2^k(r)$ represents the radius-$r$ $\ell_2$-ball in $\mathbb{R}^k$. Under nonlinear measurements, most prior results are non-uniform, i.e., they hold with high probability for a fixed $\mathbf{x}^*$ rather than for all $\mathbf{x}^*$ simultaneously. In this paper, we build a unified framework to derive uniform recovery guarantees for nonlinear GCS where the observation model is nonlinear and possibly discontinuous or unknown. Our framework accommodates GCS with 1-bit/uniformly quantized observations and single index models as canonical examples. Specifically, using a single realization of the sensing ensemble and generalized Lasso, {\em all} $\mathbf{x}^*\in G(\mathbb{B}_2^k(r))$ can be recovered up to an $\ell_2$-error at most $\epsilon$ using roughly $\tilde{O}({k}/{\epsilon^2})$ samples, with omitted logarithmic factors typically being dominated by $\log L$. Notably, this almost coincides with existing non-uniform guarantees up to logarithmic factors, hence the uniformity costs very little. As part of our technical contributions, we introduce the Lipschitz approximation to handle discontinuous observation models. We also develop a concentration inequality that produces tighter bounds for product processes whose index sets have low metric entropy. Experimental results are presented to corroborate our theory.

摘要
在生成式压缩感知（GCS）中，我们想要从 $m$ 测量 ($m \ll n$) 中还原一个信号 $\mathbf{x}^* \in \mathbb{R}^n$ 使用生成模型 $G$，其中 $G$ 是一个 Typically $L$-Lipschitz 连续的生成模型，$\mathbb{B}_2^k(r)$ 表示半径-$r$ $\ell_2$-球在 $\mathbb{R}^k$ 中。在非线性测量下，大多数先前结果是非均匀的，即它们在固定 $\mathbf{x}^*$ 上持有高概率而不是所有 $\mathbf{x}^*$ 上同时持有。在这篇论文中，我们构建了一个统一的框架，用于 derive 均匀的恢复保证，对于非线性 GCS，测量模型可能是不连续或未知的。我们的框架可以涵盖 GCS 与 1-bit/均匀量化观测和单index模型作为 kanonikus 例子。具体来说，使用单个感知ensemble和普通lasso，所有 $\mathbf{x}^*\in G(\mathbb{B}_2^k(r))$ 可以在 $\ell_2$ 误差不超过 $\epsilon$ 的情况下，使用约 $\tilde{O}({k}/{\epsilon^2})$ 个样本进行恢复，忽略 logs 因子通常是由 $\log L$ 控制。这与现有的非均匀保证相差只有 logs 因子，因此均匀性的代价很低。在我们的技术贡献中，我们引入了 Lipschitz approximation 来处理不连续测量模型。我们还开发了一种集中不等式，用于生成产品过程中的指标集 whose 度量 entropy 较低。实验结果用于证明我们的理论。

Futility and utility of a few ancillas for Pauli channel learning

paper_url: http://arxiv.org/abs/2309.14326
repo_url: None
paper_authors: Sitan Chen, Weiyuan Gong
For: 本文 revisits one of the prototypical tasks for characterizing the structure of noise in quantum devices, estimating the eigenvalues of an $n$-qubit Pauli noise channel.* Methods: 本文使用 exponential lower bounds to show the limitations of algorithms for estimating the eigenvalues of the noise channel. These lower bounds hold even for the easier hypothesis testing problem of determining whether the underlying channel is completely depolarizing or has exactly one other nontrivial eigenvalue.* Results: 本文 gets the following results: 1. Any algorithm without quantum memory must make $\Omega(2^n/\epsilon^2)$ measurements to estimate each eigenvalue within error $\epsilon$. 2. Any algorithm with $\le k$ ancilla qubits of quantum memory must make $\Omega(2^{(n-k)/3})$ queries to the unknown channel. 3. With only $k=2$ ancilla qubits of quantum memory, there is an algorithm that solves the hypothesis testing task with high probability using a single measurement.I hope this helps! Let me know if you have any other questions.

Abstract
In this paper we revisit one of the prototypical tasks for characterizing the structure of noise in quantum devices, estimating the eigenvalues of an $n$-qubit Pauli noise channel. Prior work (Chen et al., 2022) established exponential lower bounds for this task for algorithms with limited quantum memory. We first improve upon their lower bounds and show: (1) Any algorithm without quantum memory must make $\Omega(2^n/\epsilon^2)$ measurements to estimate each eigenvalue within error $\epsilon$. This is tight and implies the randomized benchmarking protocol is optimal, resolving an open question of (Flammia and Wallman, 2020). (2) Any algorithm with $\le k$ ancilla qubits of quantum memory must make $\Omega(2^{(n-k)/3})$ queries to the unknown channel. Crucially, unlike in (Chen et al., 2022), our bound holds even if arbitrary adaptive control and channel concatenation are allowed. In fact these lower bounds, like those of (Chen et al., 2022), hold even for the easier hypothesis testing problem of determining whether the underlying channel is completely depolarizing or has exactly one other nontrivial eigenvalue. Surprisingly, we show that: (3) With only $k=2$ ancilla qubits of quantum memory, there is an algorithm that solves this hypothesis testing task with high probability using a single measurement. Note that (3) does not contradict (2) as the protocol concatenates exponentially many queries to the channel before the measurement. This result suggests a novel mechanism by which channel concatenation and $O(1)$ qubits of quantum memory could work in tandem to yield striking speedups for quantum process learning that are not possible for quantum state learning.

摘要
在这篇论文中，我们回顾了一个杜理量化器设备噪声结构的典型任务，即估算 $n$- Quint Pauli 噪声通道的 eigenvalues。先前的工作（Chen et al., 2022）已经证明了这个任务的下界，我们首先提高了这个下界并证明： 1. 没有量子储存的算法必须做 $\Omega(2^n/\epsilon^2)$ 测量来估算每个含误值。这是最佳的和 Flammia 和 Wallman（2020）的问题的解。 2. 具有 $\le k$ ancilla qubits 的量子储存算法必须做 $\Omega(2^{(n-k)/3})$ 请求来 unknown 通道。这个下界不仅如 Chen et al.（2022）的下界，而且允许任意适应控制和通道 concatenation。事实上，这些下界也适用于另一个更容易的假设测试问题，即判断 underlying 通道是完全depolarizing 或者有 exactly one 其他非特性含误值。我们证明： 3. 只有 $k=2$ ancilla qubits 的量子储存算法可以使用单个测量来高probability 解决这个假设测试问题。这个结果表明，通过 concatenating exponentially many queries to the channel before the measurement, it is possible to achieve striking speedups for quantum process learning that are not possible for quantum state learning.

Small-scale proxies for large-scale Transformer training instabilities

paper_url: http://arxiv.org/abs/2309.14322
repo_url: None
paper_authors: Mitchell Wortsman, Peter J. Liu, Lechao Xiao, Katie Everett, Alex Alemi, Ben Adlam, John D. Co-Reyes, Izzeddin Gur, Abhishek Kumar, Roman Novak, Jeffrey Pennington, Jascha Sohl-dickstein, Kelvin Xu, Jaehoon Lee, Justin Gilmer, Simon Kornblith
for: 这个论文的目的是研究大型Transformer模型在大规模训练中出现的训练不稳定性的原因，以及这些不稳定性在小规模训练中的表现。
methods: 本文使用了两种源引起训练不稳定性的研究：在注意层中增长的logits（Dehghani et al., 2023）和输出logits与输出概率的分化（Chowdhery et al., 2022）。通过测量学习率和损失之间的关系，我们显示这些不稳定性也在小模型中出现，并且在高学习率训练中使用了先前在大规模训练中使用的缓解方法可以达到相似的损失值。
results: 本文研究了一些已知优化器和模型调整的影响，包括温存、Weight decay和$\mu$Param（Yang et al., 2022）。我们发现可以通过组合这些技术来训练小模型，以实现在不同学习率下的损失值之间的相似性。最后，我们研究了两种可以预测训练不稳定性的情况：模型活动和梯度 нор的扩展行为。

Abstract
Teams that have trained large Transformer-based models have reported training instabilities at large scale that did not appear when training with the same hyperparameters at smaller scales. Although the causes of such instabilities are of scientific interest, the amount of resources required to reproduce them has made investigation difficult. In this work, we seek ways to reproduce and study training stability and instability at smaller scales. First, we focus on two sources of training instability described in previous work: the growth of logits in attention layers (Dehghani et al., 2023) and divergence of the output logits from the log probabilities (Chowdhery et al., 2022). By measuring the relationship between learning rate and loss across scales, we show that these instabilities also appear in small models when training at high learning rates, and that mitigations previously employed at large scales are equally effective in this regime. This prompts us to investigate the extent to which other known optimizer and model interventions influence the sensitivity of the final loss to changes in the learning rate. To this end, we study methods such as warm-up, weight decay, and the $\mu$Param (Yang et al., 2022), and combine techniques to train small models that achieve similar losses across orders of magnitude of learning rate variation. Finally, to conclude our exploration we study two cases where instabilities can be predicted before they emerge by examining the scaling behavior of model activation and gradient norms.

摘要
团队已经训练过大型Transformer模型时报告了训练不稳定的问题，而这些问题在相同的超参数下不会出现。虽然这些不稳定的原因具有科学兴趣，但是需要资源来调查。在这项工作中，我们寻找一种在小规模上重现和研究训练稳定性和不稳定性。我们首先关注以下两种训练不稳定的来源：在注意层中 logged 的生长（Dehghani et al., 2023）和输出 logits 与 log probability 的分化（Chowdhery et al., 2022）。通过测量学习率和损失之间的关系，我们表明这些不稳定也会在小型模型中出现，并且在这种情况下，已经在大规模上使用的 Mitigation 也是有效的。这使得我们想 investigate 其他知道的优化器和模型 intervención 对最终损失响应于学习率变化的敏感性。为此，我们研究了温存、重量 decay 和 $\mu$Param（Yang et al., 2022）等方法，并将这些方法结合使用来训练小模型，以实现在不同学习率下的相同损失水平。最后，我们研究了两种可以预测训练不稳定之前的情况：模型活动和梯度 norms 的扩散行为。

Lifelong Robot Learning with Human Assisted Language Planners

paper_url: http://arxiv.org/abs/2309.14321
repo_url: None
paper_authors: Meenal Parakh, Alisha Fong, Anthony Simeonov, Tao Chen, Abhishek Gupta, Pulkit Agrawal
for: 这个论文是为了开发一种使用大型自然语言模型（LLM）来帮助机器人学习新的技能的方法。
methods: 论文使用了LLM来帮助机器人查询和学习新的技能，并且可以在数据和时间有效的情况下进行学习。
results: 研究人员通过实验和实际应用，证明了该方法可以帮助机器人在不同任务中快速学习和应用新的技能，并且可以在未来的任务中重用已经学习的技能。I hope that helps! Let me know if you have any other questions.

Abstract
Large Language Models (LLMs) have been shown to act like planners that can decompose high-level instructions into a sequence of executable instructions. However, current LLM-based planners are only able to operate with a fixed set of skills. We overcome this critical limitation and present a method for using LLM-based planners to query new skills and teach robots these skills in a data and time-efficient manner for rigid object manipulation. Our system can re-use newly acquired skills for future tasks, demonstrating the potential of open world and lifelong learning. We evaluate the proposed framework on multiple tasks in simulation and the real world. Videos are available at: https://sites.google.com/mit.edu/halp-robot-learning.

摘要
大型语言模型（LLM）已经被证明可以 acted like 观察者，将高水平的指令分解为可执行的指令序列。但现有的 LLM-based 观察者只能运行固定的技能。我们解决了这个极限，并提出了使用 LLM-based 观察者来查询新技能并教育机器人这些技能，在数据和时间效率下进行弹性物件抓取。我们的系统可以重复 newly acquired 技能，以便在未来任务中重复使用，这显示了开放世界和一生学习的潜力。我们在多个任务中进行了评估，并在网站上提供了详细的视频：https://sites.google.com/mit.edu/halp-robot-learning。

A post-selection algorithm for improving dynamic ensemble selection methods

paper_url: http://arxiv.org/abs/2309.14307
repo_url: https://github.com/prgc/ps-des
paper_authors: Paulo R. G. Cordeiro, George D. C. Cavalcanti, Rafael M. O. Cruz
for: 这个研究的目的是为了选择最佳的多 кластер组件系统（MCS）方法，以提高精准度。
methods: 这个研究使用的方法是Post-Selection Dynamic Ensemble Selection（PS-DES）方法，它是一种在选择阶段选择最佳的组件方法。
results: 实验结果显示，使用精度作为选择组件方法的评估指标，PS-DES方法比单一的DES方法表现更好。Here’s the translation in English:
for: The purpose of this research is to select the best Multiple Classifier Systems (MCS) method to improve accuracy.
methods: The method used in this research is the Post-Selection Dynamic Ensemble Selection (PS-DES) method, which selects the best ensemble method in the selection phase.
results: Experimental results show that using accuracy as a metric to select the ensembles, PS-DES outperforms individual DES techniques.I hope that helps!

Abstract
Dynamic Ensemble Selection (DES) is a Multiple Classifier Systems (MCS) approach that aims to select an ensemble for each query sample during the selection phase. Even with the proposal of several DES approaches, no particular DES technique is the best choice for different problems. Thus, we hypothesize that selecting the best DES approach per query instance can lead to better accuracy. To evaluate this idea, we introduce the Post-Selection Dynamic Ensemble Selection (PS-DES) approach, a post-selection scheme that evaluates ensembles selected by several DES techniques using different metrics. Experimental results show that using accuracy as a metric to select the ensembles, PS-DES performs better than individual DES techniques. PS-DES source code is available in a GitHub repository

摘要
<> translate "Dynamic Ensemble Selection (DES) is a Multiple Classifier Systems (MCS) approach that aims to select an ensemble for each query sample during the selection phase. Even with the proposal of several DES approaches, no particular DES technique is the best choice for different problems. Thus, we hypothesize that selecting the best DES approach per query instance can lead to better accuracy. To evaluate this idea, we introduce the Post-Selection Dynamic Ensemble Selection (PS-DES) approach, a post-selection scheme that evaluates ensembles selected by several DES techniques using different metrics. Experimental results show that using accuracy as a metric to select the ensembles, PS-DES performs better than individual DES techniques. PS-DES source code is available in a GitHub repository" into Simplified Chinese.Here's the translation:<>多个类ifier系统（MCS）方法之一是动态ensemble选择（DES），它在选择阶段为每个查询样本选择一个ensemble。尽管已经提出了多种DES方法，但是没有一个特定的DES技术适合所有问题。因此，我们提出了在每个查询实例中选择最佳DES方法的想法，以提高准确率。为了评估这个想法，我们引入了后期选择的动态ensemble选择（PS-DES）方法，该方法使用不同的度量评估由多种DES技术选择的ensemble。实验结果显示，使用准确率作为度量选择ensemble时，PS-DES方法perform Better than个 DES技术。PS-DES源代码可以在GitHub存储库中找到。

Improved Algorithms for Stochastic Linear Bandits Using Tail Bounds for Martingale Mixtures

paper_url: http://arxiv.org/abs/2309.14298
repo_url: None
paper_authors: Hamish Flynn, David Reeb, Melih Kandemir, Jan Peters
for: 这个论文targets the stochastic linear bandit problem, and proposes improved algorithms with worst-case regret guarantees.
methods: 该论文使用了一种新的tail bound for adaptive martingale mixtures to construct confidence sequences, which are suitable for stochastic bandits. These confidence sequences allow for efficient action selection via convex programming.
results: 该论文提供了一种基于 confidence sequences的 linear bandit algorithm, which is guaranteed to achieve competitive worst-case regret. Additionally, the authors show that their confidence sequences are tighter than competitors, both empirically and theoretically, and demonstrate improved performance in several hyperparameter tuning tasks.

Abstract
We present improved algorithms with worst-case regret guarantees for the stochastic linear bandit problem. The widely used "optimism in the face of uncertainty" principle reduces a stochastic bandit problem to the construction of a confidence sequence for the unknown reward function. The performance of the resulting bandit algorithm depends on the size of the confidence sequence, with smaller confidence sets yielding better empirical performance and stronger regret guarantees. In this work, we use a novel tail bound for adaptive martingale mixtures to construct confidence sequences which are suitable for stochastic bandits. These confidence sequences allow for efficient action selection via convex programming. We prove that a linear bandit algorithm based on our confidence sequences is guaranteed to achieve competitive worst-case regret. We show that our confidence sequences are tighter than competitors, both empirically and theoretically. Finally, we demonstrate that our tighter confidence sequences give improved performance in several hyperparameter tuning tasks.

摘要
我们提出了改进的算法，带有最坏情况的悔检保证，用于Stochastic Linear Bandit问题。通过“面对不确定性的optimism”原则，将随机bandit问题转化为建立不确定奖金函数的信任序列。算法的性能取决于信任序列的大小，小的信任序列对实际性能和悔检保证具有更好的效果。在这项工作中，我们使用了一种新的尾部 bound for adaptive martingale mixtures来构建信任序列，这些信任序列适用于随机bandits。这些信任序列使得可以通过几何编程进行高效的动作选择。我们证明了一个基于我们的信任序列的线性bandit算法能够实现竞争性最坏情况的悔检保证。我们还证明了我们的信任序列比竞争者更紧， tantoempirically和理论上。最后，我们示出了我们的紧密信任序列可以提高一些超参数调整任务的性能。

On the Non-Associativity of Analog Computations

paper_url: http://arxiv.org/abs/2309.14292
repo_url: None
paper_authors: Lisa Kuhn, Bernhard Klein, Holger Fröning
for: 这种研究旨在探讨分析计算中的缺失精度问题，以及这些问题对机器学习任务的影响。
methods: 该研究使用了一个简单的模型来示例出实际的分析处理器中的排序效应。
results: 结果表明，忽略排序可能会导致准确率下降substantially。

Abstract
The energy efficiency of analog forms of computing makes it one of the most promising candidates to deploy resource-hungry machine learning tasks on resource-constrained system such as mobile or embedded devices. However, it is well known that for analog computations the safety net of discretization is missing, thus all analog computations are exposed to a variety of imperfections of corresponding implementations. Examples include non-linearities, saturation effect and various forms of noise. In this work, we observe that the ordering of input operands of an analog operation also has an impact on the output result, which essentially makes analog computations non-associative, even though the underlying operation might be mathematically associative. We conduct a simple test by creating a model of a real analog processor which captures such ordering effects. With this model we assess the importance of ordering by comparing the test accuracy of a neural network for keyword spotting, which is trained based either on an ordered model, on a non-ordered variant, and on real hardware. The results prove the existence of ordering effects as well as their high impact, as neglecting ordering results in substantial accuracy drops.

摘要
“Analog计算的能源效率使得它成为部署资源受限的移动或嵌入式设备上耗费资源的最佳候选人。然而，所有的analog计算都缺乏精度的保障，因此它们暴晒于实现中的各种不稳定性，如非线性、饱和效应和各种噪声。在这个工作中，我们发现了输入操作的顺序也对输出结果产生影响，从而使analog计算变得非关联的，即使其下面的运算可能是数学上关联的。我们创建了一个模型，用于捕捉这些顺序效应。通过这个模型，我们评估了顺序的重要性，并发现忽略顺序会导致准确性下降。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

DECORAIT – DECentralized Opt-in/out Registry for AI Training

paper_url: http://arxiv.org/abs/2309.14400
repo_url: None
paper_authors: Kar Balan, Alex Black, Simon Jenni, Andrew Gilbert, Andy Parsons, John Collomosse
For: The paper aims to address the data governance challenge faced by content creators who want to share their work openly without sanctioning its use for training AI models, and to ensure fair recognition and reward for their contributions.* Methods: The paper proposes a decentralized registry called DECORAIT, which uses hierarchical clustering and a combination of on/off-chain storage to trace the provenance of GenAI training data and determine training consent. The registry leverages distributed ledger technology (DLT) and visual fingerprinting, and is built on the emerging C2PA standard.* Results: The paper reports a prototype of DECORAIT, which demonstrates the feasibility of using a decentralized registry to trace the provenance of GenAI training data and ensure fair recognition and reward for content creators. The prototype combines the strengths of DLT and visual fingerprinting to create a secure, open registry that can be used to express consent and data ownership for GenAI.

Abstract
We present DECORAIT; a decentralized registry through which content creators may assert their right to opt in or out of AI training as well as receive reward for their contributions. Generative AI (GenAI) enables images to be synthesized using AI models trained on vast amounts of data scraped from public sources. Model and content creators who may wish to share their work openly without sanctioning its use for training are thus presented with a data governance challenge. Further, establishing the provenance of GenAI training data is important to creatives to ensure fair recognition and reward for their such use. We report a prototype of DECORAIT, which explores hierarchical clustering and a combination of on/off-chain storage to create a scalable decentralized registry to trace the provenance of GenAI training data in order to determine training consent and reward creatives who contribute that data. DECORAIT combines distributed ledger technology (DLT) with visual fingerprinting, leveraging the emerging C2PA (Coalition for Content Provenance and Authenticity) standard to create a secure, open registry through which creatives may express consent and data ownership for GenAI.

摘要
我们介绍DECORAIT，一个去中心化的数据库，让内容创作者可以选择是否参与人工智能训练，并获得创作所获得的回馈。生成型人工智能（GenAI）可以使用基于大量公开资料的人工智能模型生成图像。如果模型和内容创作者想要公开分享他们的作品而不授权其用于训练，他们面临资料管理挑战。此外，确定GenAI训练资料的起源是重要的，以确保创作者获得公平的认可和奖励。我们报告DECORAIT的原型，它使用嵌入式数据和分支链技术（DLT），并与可识别的视觉指纹（Visual Fingerprinting）集成，以创建一个可靠、公开的数据库，让创作者表达同意和资料所有权 для GenAI。

Learning Risk-Aware Quadrupedal Locomotion using Distributional Reinforcement Learning

paper_url: http://arxiv.org/abs/2309.14246
repo_url: None
paper_authors: Lukas Schneider, Jonas Frey, Takahiro Miki, Marco Hutter
for: 本研究旨在帮助机器人在危险环境中进行行动，以避免意外和减少风险。
methods: 本研究使用分布式再决策学习来考虑安全性，并将完整的值分布来衡量机器人与环境之间的uncertainty。
results: 本研究在 simulate 和 ANYmal quadruped robot上实现了 emergent 风险敏感的行动行为，并且可以通过控制一个参数来调整机器人的行为风格，从而实现风险敏感性。

Abstract
Deployment in hazardous environments requires robots to understand the risks associated with their actions and movements to prevent accidents. Despite its importance, these risks are not explicitly modeled by currently deployed locomotion controllers for legged robots. In this work, we propose a risk sensitive locomotion training method employing distributional reinforcement learning to consider safety explicitly. Instead of relying on a value expectation, we estimate the complete value distribution to account for uncertainty in the robot's interaction with the environment. The value distribution is consumed by a risk metric to extract risk sensitive value estimates. These are integrated into Proximal Policy Optimization (PPO) to derive our method, Distributional Proximal Policy Optimization (DPPO). The risk preference, ranging from risk-averse to risk-seeking, can be controlled by a single parameter, which enables to adjust the robot's behavior dynamically. Importantly, our approach removes the need for additional reward function tuning to achieve risk sensitivity. We show emergent risk sensitive locomotion behavior in simulation and on the quadrupedal robot ANYmal.

摘要
<>对危险环境部署需要机器人理解其动作和移动的风险，以避免意外。现有的步行控制器并未显式地考虑这些风险。在这项工作中，我们提出一种带有安全考虑的步行训练方法，使用分布式再增强学习来考虑安全。而不是仅仅依靠值期望，我们估算整个值分布，以考虑机器人与环境的互动不确定性。这个值分布被消耗到风险度量来提取风险敏感的价值估计。这些估计被 integrate 到 proximal policy optimization（PPO）中，得到我们的方法：分布式 proximal policy optimization（DPPO）。风险偏好，从不敢风险到敢风险，可以通过一个参数控制，以 dynamically 调整机器人的行为。这种方法可以消除需要额外奖励函数调整以实现风险敏感。我们在 simulated 和 ANYmal 四足机器人上实现了 emergent 风险敏感的步行行为。

Learning to Abstain From Uninformative Data

paper_url: http://arxiv.org/abs/2309.14240
repo_url: None
paper_authors: Yikai Zhang, Songzhu Zheng, Mina Dalirrooyfard, Pengxiang Wu, Anderson Schneider, Anant Raj, Yuriy Nevmyvaka, Chao Chen
for: 本研究探讨了在高噪音比例下学习和决策的问题，如金融或医疗领域。
methods: 我们提出了一种基于选择学习理论的损失函数，以及一种迭代算法，可以同时优化预测器和选择器，并在多种场景中评估其实验性能。
results: 我们的方法可以在具有高噪音比例的数据上提供有效的学习和决策，并且可以在训练和测试阶段处理不相关的数据。

Abstract
Learning and decision-making in domains with naturally high noise-to-signal ratio, such as Finance or Healthcare, is often challenging, while the stakes are very high. In this paper, we study the problem of learning and acting under a general noisy generative process. In this problem, the data distribution has a significant proportion of uninformative samples with high noise in the label, while part of the data contains useful information represented by low label noise. This dichotomy is present during both training and inference, which requires the proper handling of uninformative data during both training and testing. We propose a novel approach to learning under these conditions via a loss inspired by the selective learning theory. By minimizing this loss, the model is guaranteed to make a near-optimal decision by distinguishing informative data from uninformative data and making predictions. We build upon the strength of our theoretical guarantees by describing an iterative algorithm, which jointly optimizes both a predictor and a selector, and evaluates its empirical performance in a variety of settings.

摘要
学习和决策在具有自然高噪声比例的领域，如金融或医疗，经常是一项挑战，而且风险很高。在这篇论文中，我们研究学习和行动在一个通用的噪声生成过程下的问题。在这个问题中，数据分布中有一定的无用样本，具有高噪声的标签，而其中一部分数据具有低噪声的有用信息。这种分化存在于训练和测试阶段，需要正确处理无用数据。我们提出了一种基于选择学习理论的新方法，通过最小化这种损失函数，使模型能够做出最佳决策。我们在理论保证的基础上描述了一种迭代算法，它同时优化一个预测器和一个选择器，并评估其实际性能。

Predicting environment effects on breast cancer by implementing machine learning

paper_url: http://arxiv.org/abs/2309.14397
repo_url: None
paper_authors: Muhammad Shoaib Farooq, Mehreen Ilyas
for: 本研究旨在探讨环境因素对乳腺癌的发生和进程中的作用，以及这些因素对乳腺癌预后的影响。
methods: 本研究使用了机器学习算法，包括逻辑回归、随机森林、KNN算法、Support Vector Machine和Extra Tree Classifier，以表达预测。
results: 研究发现，Random Forest算法的准确率为0.91%，ROC曲线为0.901%，表示这些机器学习算法在乳腺癌存活分析中具有良好的准确性，这些技术可能成为乳腺癌预后预测的新选择。

Abstract
The biggest Breast cancer is increasingly a major factor in female fatalities, overtaking heart disease. While genetic factors are important in the growth of breast cancer, new research indicates that environmental factors also play a substantial role in its occurrence and progression. The literature on the various environmental factors that may affect breast cancer risk, incidence, and outcomes is thoroughly reviewed in this study report. The study starts by looking at how lifestyle decisions, such as eating habits, exercise routines, and alcohol consumption, may affect hormonal imbalances and inflammation, two important factors driving the development of breast cancer. Additionally, it explores the part played by environmental contaminants such pesticides, endocrine-disrupting chemicals (EDCs), and industrial emissions, all of which have been linked to a higher risk of developing breast cancer due to their interference with hormone signaling and DNA damage. Algorithms for machine learning are used to express predictions. Logistic Regression, Random Forest, KNN Algorithm, SVC and extra tree classifier. Metrics including the confusion matrix correlation coefficient, F1-score, Precision, Recall, and ROC curve were used to evaluate the models. The best accuracy among all the classifiers is Random Forest with 0.91% accuracy and ROC curve 0.901% of Logistic Regression. The accuracy of the multiple algorithms for machine learning utilized in this research was good, which is important and indicates that these techniques could serve as replacement forecasting techniques in breast cancer survival analysis, notably in the Asia region.

摘要
最大的乳癌是在女性死亡中日益占据主导地位，超越心血管疾病。虽然遗传因素在乳癌增长中扮演重要角色，但新研究表明环境因素也在乳癌发生和进程中扮演了重要角色。本研究报告 thorougly reviewed the literature on various environmental factors that may affect breast cancer risk, incidence, and outcomes. The study begins by examining how lifestyle decisions, such as dietary habits, exercise routines, and alcohol consumption, may affect hormonal imbalances and inflammation, two key factors driving the development of breast cancer. Additionally, it explores the role played by environmental pollutants such as pesticides, endocrine-disrupting chemicals (EDCs), and industrial emissions, all of which have been linked to a higher risk of developing breast cancer due to their interference with hormone signaling and DNA damage. The study used machine learning algorithms, including logistic regression, random forest, KNN algorithm, SVC, and extra tree classifier, to express predictions. Metrics including confusion matrix, correlation coefficient, F1-score, precision, recall, and ROC curve were used to evaluate the models. The best accuracy among all the classifiers was Random Forest with 0.91% accuracy and ROC curve 0.901% of logistic regression. The accuracy of the multiple machine learning algorithms used in this research was good, indicating that these techniques could serve as replacement forecasting techniques in breast cancer survival analysis, particularly in the Asia region.

Guess & Sketch: Language Model Guided Transpilation

paper_url: http://arxiv.org/abs/2309.14396
repo_url: None
paper_authors: Celine Lee, Abdulrahman Mahmoud, Michal Kurek, Simone Campanoni, David Brooks, Stephen Chong, Gu-Yeon Wei, Alexander M. Rush
for: 本研究旨在提高维护旧系统软件的效率，使用了学习型转换器来自动将 Assembly code 转换为其他编程语言。
methods: 本研究使用了一种名为 Guess & Sketch 的 neurosymbolic 方法，它将 LM 和符号解决器结合在一起，以实现 Assembly code 的自动转换。
results: 根据实验结果，Guess & Sketch 可以成功转换 57.6% 更多的 Assembly code 示例，比 GPT-4 和手动编写的转换器更高效。

Abstract
Maintaining legacy software requires many software and systems engineering hours. Assembly code programs, which demand low-level control over the computer machine state and have no variable names, are particularly difficult for humans to analyze. Existing conventional program translators guarantee correctness, but are hand-engineered for the source and target programming languages in question. Learned transpilation, i.e. automatic translation of code, offers an alternative to manual re-writing and engineering efforts. Automated symbolic program translation approaches guarantee correctness but struggle to scale to longer programs due to the exponentially large search space. Their rigid rule-based systems also limit their expressivity, so they can only reason about a reduced space of programs. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. In this work, we leverage the strengths of LMs and symbolic solvers in a neurosymbolic approach to learned transpilation for assembly code. Assembly code is an appropriate setting for a neurosymbolic approach, since assembly code can be divided into shorter non-branching basic blocks amenable to the use of symbolic methods. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence of the transpilation input and output. We test Guess & Sketch on three different test sets of assembly transpilation tasks, varying in difficulty, and show that it successfully transpiles 57.6% more examples than GPT-4 and 39.6% more examples than an engineered transpiler. We also share a training and evaluation dataset for this task.

摘要
维护遗传软件需要很多软件和系统工程时间。 Assembly 程式，它们需要低层次控制电脑机器状态，并没有变数名称，对人类分析而言特别困难。现有的传统程式翻译器可以保证正确性，但是它们是手工设计的源和目标程式码语言的。学习型翻译，即自动翻译程式码，提供了一个人工重新写程式码的替代方案。自动 симвоlic 程式翻译方法可以保证正确性，但是它们在更长的程式码中缺乏扩展性，因为搜索空间是指数增长的。它们的僵化规则系统也限制了它们的表达力，只能理解一个受限的程式空间。概率神经语言模型（LM）可以生成可能的输出，但是它们在交互时需要付出正确性的代价。在这个工作中，我们利用 LM 和符号方法的优点，在 assembly 程式中实现了学习型翻译。 assembly 程式可以被分成更短的非分支基本块，适合使用符号方法。 Guess & Sketch 首先从 LM 中提取对适合性和信任度的资讯，然后将其转交给符号方法以解决这个翻译任务的内涵相等性。我们在三个不同的 assembly 翻译任务上进行测试，发现 Guess & Sketch 成功翻译了 57.6% 更多的例子，比 GPT-4 和手工设计的翻译器更高。我们还提供了这个任务的训练和评估数据集。

Learning Restricted Boltzmann Machines with greedy quantum search

paper_url: http://arxiv.org/abs/2309.14196
repo_url: None
paper_authors: Liming Zhao, Aman Agrawal, Patrick Rebentrost
for: 扩展Restricted Boltzmann Machines（RBMs）的结构学习问题到量子计算领域，并提出相应的量子算法来解决这个问题。
methods: 使用量子算法来学习RBMs的结构，包括两种特定类型的RBMs：ferromagnetic RBMs和地方一致RBMs。
results: 对于这两种类型的RBMs，量子算法比类型的纯类型算法具有 polynomial 速度增长。

Abstract
Restricted Boltzmann Machines (RBMs) are widely used probabilistic undirected graphical models with visible and latent nodes, playing an important role in statistics and machine learning. The task of structure learning for RBMs involves inferring the underlying graph by using samples from the visible nodes. Specifically, learning the two-hop neighbors of each visible node allows for the inference of the graph structure. Prior research has addressed the structure learning problem for specific classes of RBMs, namely ferromagnetic and locally consistent RBMs. In this paper, we extend the scope to the quantum computing domain and propose corresponding quantum algorithms for this problem. Our study demonstrates that the proposed quantum algorithms yield a polynomial speedup compared to the classical algorithms for learning the structure of these two classes of RBMs.

摘要
restrictive Boltzmann machines (RBMs) 是一种广泛使用的可能性图模型，具有可见节点和隐藏节点，在统计学和机器学习中扮演着重要角色。structure learning问题的解决方法是使用可见节点的样本来推断图结构。特别是，了解每个可见节点的两步邻居，可以推断出图结构。先前的研究已经对特定类型的 RBMs 进行了结构学习问题的研究， specifically ferromagnetic 和 locally consistent RBMs。在这篇论文中，我们将这个问题推广到量子计算领域，并提出相应的量子算法来解决这个问题。我们的研究表明，提议的量子算法与类icial算法相比，对于这两种类型的 RBMs 的结构学习问题，具有Polynomial Speedup。

Federated Learning Under Restricted User Availability

paper_url: http://arxiv.org/abs/2309.14176
repo_url: None
paper_authors: Periklis Theodoropoulos, Konstantinos E. Nikolakakis, Dionysis Kalogerias
for: 这篇论文旨在提出一个可靠的联合学习框架，能够在不违反数据隐私的情况下进行联合模型训练。
methods: 本论文使用了一个可能随机的站台选择策略，称为随机存取模型（RAM），并提出了一个新的联合学习问题形ulation，可以有效捕捉和减少具有限制的数据参与的问题。
results: 实验结果显示，提出的方法可以与标准联合学习相比，在不同的设定下均表现出较好的性能。

Abstract
Federated Learning (FL) is a decentralized machine learning framework that enables collaborative model training while respecting data privacy. In various applications, non-uniform availability or participation of users is unavoidable due to an adverse or stochastic environment, the latter often being uncontrollable during learning. Here, we posit a generic user selection mechanism implementing a possibly randomized, stationary selection policy, suggestively termed as a Random Access Model (RAM). We propose a new formulation of the FL problem which effectively captures and mitigates limited participation of data originating from infrequent, or restricted users, at the presence of a RAM. By employing the Conditional Value-at-Risk (CVaR) over the (unknown) RAM distribution, we extend the expected loss FL objective to a risk-aware objective, enabling the design of an efficient training algorithm that is completely oblivious to the RAM, and with essentially identical complexity as FedAvg. Our experiments on synthetic and benchmark datasets show that the proposed approach achieves significantly improved performance as compared with standard FL, under a variety of setups.

摘要

Designing and evaluating an online reinforcement learning agent for physical exercise recommendations in N-of-1 trials

paper_url: http://arxiv.org/abs/2309.14156
repo_url: https://github.com/hialab/reinforcement-learning-agents-in-n-of-1-trials
paper_authors: Dominik Meier, Ipek Ensari, Stefan Konigorski
For: The paper is written to explore the feasibility and effectiveness of using an online reinforcement learning agent to implement personalized adaptive interventions in clinical settings.* Methods: The paper uses a novel study on physical exercise recommendations to reduce pain in endometriosis as an illustration, and describes the design of a contextual bandit recommendation agent. The agent is evaluated in simulation studies.* Results: The results show that adaptive interventions can add complexity to the design and implementation process, but have the potential to improve patients’ benefits even with limited observations. The approach is expected to be transferable to other interventions and clinical settings.Here is the information in Simplified Chinese text:* For: 本研究是为了探讨个性化适应性治疗在临床设置中的可行性和效果。* Methods: 本研究使用了一个新的物理运动推荐算法来降低悬股症的痛症，并对这种算法进行了评估。* Results: 结果显示个性化适应性治疗可能会增加设计和实施过程的复杂性，但是它们可以在有限的观察数据下提高患者的效果。这种方法预期可以在其他治疗和临床设置中传递应用。

Abstract
Personalized adaptive interventions offer the opportunity to increase patient benefits, however, there are challenges in their planning and implementation. Once implemented, it is an important question whether personalized adaptive interventions are indeed clinically more effective compared to a fixed gold standard intervention. In this paper, we present an innovative N-of-1 trial study design testing whether implementing a personalized intervention by an online reinforcement learning agent is feasible and effective. Throughout, we use a new study on physical exercise recommendations to reduce pain in endometriosis for illustration. We describe the design of a contextual bandit recommendation agent and evaluate the agent in simulation studies. The results show that adaptive interventions add complexity to the design and implementation process, but have the potential to improve patients' benefits even if only few observations are available. In order to quantify the expected benefit, data from previous interventional studies is required. We expect our approach to be transferable to other interventions and clinical interventions.

摘要
个人化适应 intervención 可以提高病人的效果，但是规划和实施中存在挑战。如果实施了个人化适应 intervención，是否比静态黄金标准 intervención 更有效？在这篇论文中，我们介绍了一种新的 N-of-1 试验研究设计，用于测试个人化 intervención 是否可行和有效。我们使用了一项新的 физи exercise 推荐算法来减轻疼痛的研究，以 illustrate 我们的方法。我们描述了一种上下文 bandit 推荐代理的设计，并在模拟研究中评估了代理。结果显示，个人化 intervención 可以增加病人的效果，但是设计和实施过程可能会加入复杂性。为了量化预期的效果，需要对前一次的 intervenational 研究数据进行分析。我们预计我们的方法可以应用于其他 intervenational 和临床研究。

Extragradient Type Methods for Riemannian Variational Inequality Problems

paper_url: http://arxiv.org/abs/2309.14155
repo_url: None
paper_authors: Zihao Hu, Guanghui Wang, Xi Wang, Andre Wibisono, Jacob Abernethy, Molei Tao
for: 这个论文主要研究的是偏微分方程问题（Monotone Riemannian Variational Inequality Problems，简称RVIPs）的最优化问题。methods: 这个论文提出了两种新的算法：Riemannian extragradient（REG）方法和Riemannian past extragradient（RPEG）方法，这两种方法都可以在几何扩展空间上实现最优化问题的解。results: 这个论文的结果表明，REG和RPEG方法的最后迭代都会收敛到RVIPs的解，并且这个收敛速率是$O\left(\frac{1}{\sqrt{T}\right)$。此外，这个论文还证明了REG和RPEG方法的平均迭代收敛速率是$O\left(\frac{1}{T}\right)$，这与欧几里得空间中的观察相一致。

Abstract
Riemannian convex optimization and minimax optimization have recently drawn considerable attention. Their appeal lies in their capacity to adeptly manage the non-convexity of the objective function as well as constraints inherent in the feasible set in the Euclidean sense. In this work, we delve into monotone Riemannian Variational Inequality Problems (RVIPs), which encompass both Riemannian convex optimization and minimax optimization as particular cases. In the context of Euclidean space, it is established that the last-iterates of both the extragradient (EG) and past extragradient (PEG) methods converge to the solution of monotone variational inequality problems at a rate of $O\left(\frac{1}{\sqrt{T}\right)$ (Cai et al., 2022). However, analogous behavior on Riemannian manifolds remains an open question. To bridge this gap, we introduce the Riemannian extragradient (REG) and Riemannian past extragradient (RPEG) methods. We demonstrate that both exhibit $O\left(\frac{1}{\sqrt{T}\right)$ last-iterate convergence. Additionally, we show that the average-iterate convergence of both REG and RPEG is $O\left(\frac{1}{T}\right)$, aligning with observations in the Euclidean case (Mokhtari et al., 2020). These results are enabled by judiciously addressing the holonomy effect so that additional complications in Riemannian cases can be reduced and the Euclidean proof inspired by the performance estimation problem (PEP) technique or the sum-of-squares (SOS) technique can be applied again.

摘要
“里曼尼安 convex 优化和最小最大优化在最近吸引了广泛关注。它们的吸引力在于它们可以有效地处理非凸函数和约束的非凸性在欧几何上。在这篇文章中，我们深入研究幂等里曼尼变量不等式问题（RVIPs），它们包括里曼尼 convex 优化和最小最大优化为特殊情况。在欧几何空间中，已经证明了extragradient（EG）和过去extragradient（PEG）方法的最后迭代都会 converge到变量不等式问题的解的 $O\left(\frac{1}{\sqrt{T}\right)$ 速率（Cai et al., 2022）。然而，在里曼尼拓扑上的相似行为仍然是一个未解决的问题。为了桥接这个差距，我们引入里曼尼extragradient（REG）和里曼尼过去extragradient（RPEG）方法。我们证明了这两种方法的最后迭代都会 converge于 $O\left(\frac{1}{\sqrt{T}\right)$ 速率。此外，我们还证明了REG和RPEG的平均迭代速率为 $O\left(\frac{1}{T}\right)$，与欧几何空间中观察到的（Mokhtari et al., 2020）相一致。这些结果是通过谨慎地处理启动效应，使得里曼尼拓扑上的额外复杂性可以被减少，并且可以再次应用欧几何空间中的性能估计问题（PEP）技术或准则集（SOS）技术来实现。”

One-Class Classification for Intrusion Detection on Vehicular Networks

paper_url: http://arxiv.org/abs/2309.14134
repo_url: None
paper_authors: Jake Guidry, Fahad Sohrab, Raju Gottumukkala, Satya Katragadda, Moncef Gabbouj
for: 防护 vehicular networks 中的 Controller Area Network bus 系统免受现代黑客攻击
methods: 使用机器学习方法进行检测和报告攻击
results: 试验了多种state-of-the-art一类分类方法对 Controller Area Network bus 流量中的射预攻击的效果，发现Subspace Support Vector Data Description 方法在normal operation 和被攻击时都能够最高效，Gmean 约85%

Abstract
Controller Area Network bus systems within vehicular networks are not equipped with the tools necessary to ward off and protect themselves from modern cyber-security threats. Work has been done on using machine learning methods to detect and report these attacks, but common methods are not robust towards unknown attacks. These methods usually rely on there being a sufficient representation of attack data, which may not be available due to there either not being enough data present to adequately represent its distribution or the distribution itself is too diverse in nature for there to be a sufficient representation of it. With the use of one-class classification methods, this issue can be mitigated as only normal data is required to train a model for the detection of anomalous instances. Research has been done on the efficacy of these methods, most notably One-Class Support Vector Machine and Support Vector Data Description, but many new extensions of these works have been proposed and have yet to be tested for injection attacks in vehicular networks. In this paper, we investigate the performance of various state-of-the-art one-class classification methods for detecting injection attacks on Controller Area Network bus traffic. We investigate the effectiveness of these techniques on attacks launched on Controller Area Network buses from two different vehicles during normal operation and while being attacked. We observe that the Subspace Support Vector Data Description method outperformed all other tested methods with a Gmean of about 85%.

摘要
控制器网络攻击系统在交通网络中没有具备防御modern cyber安全攻击的工具。工作已经在使用机器学习方法检测和报告这些攻击，但通用方法不够鲜硬度Unknown攻击。这些方法通常需要充分的攻击数据来表征攻击的分布，但可能缺乏数据或者攻击分布太多样化，导致无法充分表征。使用一类分类方法可以解决这问题，只需要正常数据来训练模型来检测异常情况。研究表示一类支持向量数据描述法和一类支持向量分类器在检测插入攻击方面表现出色，但还没有在交通网络中进行测试。本文 investigate了一些当前顶尖一类分类方法在Controller Area Network总线上检测插入攻击的性能。我们对两辆不同的车辆在正常运行和遭受攻击时的Controller Area Network总线上的攻击进行了测试。我们发现Subspace Support Vector Data Description法的性能高于所有测试过的方法，Gmean约85%。

Driving behavior-guided battery health monitoring for electric vehicles using machine learning

paper_url: http://arxiv.org/abs/2309.14125
repo_url: None
paper_authors: Nanhua Jiang, Jiawei Zhang, Weiran Jiang, Yao Ren, Jing Lin, Edwin Khoo, Ziyou Song
for: 提供了一种基于特征的机器学习管道，用于准确和可靠地监测电池健康状态。
methods: 使用了多种健康指标（HI）的合理选择和融合，以及考虑了实际驾驶行为。
results: 提供了一种能够考虑实际驾驶行为的功能特征选择和融合方法，以提高电池健康监测的准确性和实用性。

Abstract
An accurate estimation of the state of health (SOH) of batteries is critical to ensuring the safe and reliable operation of electric vehicles (EVs). Feature-based machine learning methods have exhibited enormous potential for rapidly and precisely monitoring battery health status. However, simultaneously using various health indicators (HIs) may weaken estimation performance due to feature redundancy. Furthermore, ignoring real-world driving behaviors can lead to inaccurate estimation results as some features are rarely accessible in practical scenarios. To address these issues, we proposed a feature-based machine learning pipeline for reliable battery health monitoring, enabled by evaluating the acquisition probability of features under real-world driving conditions. We first summarized and analyzed various individual HIs with mechanism-related interpretations, which provide insightful guidance on how these features relate to battery degradation modes. Moreover, all features were carefully evaluated and screened based on estimation accuracy and correlation analysis on three public battery degradation datasets. Finally, the scenario-based feature fusion and acquisition probability-based practicality evaluation method construct a useful tool for feature extraction with consideration of driving behaviors. This work highlights the importance of balancing the performance and practicality of HIs during the development of feature-based battery health monitoring algorithms.

摘要
《 accurately estimating the state of health (SOH) of batteries is crucial for ensuring the safe and reliable operation of electric vehicles (EVs). feature-based machine learning methods have shown great potential for rapidly and precisely monitoring battery health status. however, using various health indicators (HIs) simultaneously may weaken estimation performance due to feature redundancy. furthermore, ignoring real-world driving behaviors can lead to inaccurate estimation results as some features are rarely accessible in practical scenarios. to address these issues, we proposed a feature-based machine learning pipeline for reliable battery health monitoring, enabled by evaluating the acquisition probability of features under real-world driving conditions. we first summarized and analyzed various individual HIs with mechanism-related interpretations, which provide insightful guidance on how these features relate to battery degradation modes. moreover, all features were carefully evaluated and screened based on estimation accuracy and correlation analysis on three public battery degradation datasets. finally, the scenario-based feature fusion and acquisition probability-based practicality evaluation method construct a useful tool for feature extraction with consideration of driving behaviors. this work highlights the importance of balancing the performance and practicality of HIs during the development of feature-based battery health monitoring algorithms.》Note: Please note that the translation is in Simplified Chinese, and the sentence structure and wording may be different from the original text.

Physics-Informed Solution of The Stationary Fokker-Plank Equation for a Class of Nonlinear Dynamical Systems: An Evaluation Study

paper_url: http://arxiv.org/abs/2309.16725
repo_url: None
paper_authors: Hussam Alhussein, Mohammed Khasawneh, Mohammed F. Daqaq
for: This paper aims to present a data-free, physics-informed neural network (PINN) framework to solve the Fokker-Planck (FP) equation for a class of nonlinear stochastic dynamical systems.methods: The PINN framework uses a neural network to approximate the solution of the FP equation, without requiring any data from the system.results: The paper demonstrates the ability and accuracy of the PINN framework in predicting the probability density function (PDF) under the combined effect of additive and multiplicative noise, capturing P-bifurcations of the PDF, and effectively treating high-dimensional systems. The computational time associated with the PINN solution can be substantially reduced by using transfer learning.

Abstract
The Fokker-Planck (FP) equation is a linear partial differential equation which governs the temporal and spatial evolution of the probability density function (PDF) associated with the response of stochastic dynamical systems. An exact analytical solution of the FP equation is only available for a limited subset of dynamical systems. Semi-analytical methods are available for larger, yet still a small subset of systems, while traditional computational methods; e.g. Finite Elements and Finite Difference require dividing the computational domain into a grid of discrete points, which incurs significant computational costs for high-dimensional systems. Physics-informed learning offers a potentially powerful alternative to traditional computational schemes. To evaluate its potential, we present a data-free, physics-informed neural network (PINN) framework to solve the FP equation for a class of nonlinear stochastic dynamical systems. In particular, through several examples concerning the stochastic response of the Duffing, Van der Pol, and the Duffing-Van der Pol oscillators, we assess the ability and accuracy of the PINN framework in $i)$ predicting the PDF under the combined effect of additive and multiplicative noise, $ii)$ capturing P-bifurcations of the PDF, and $iii)$ effectively treating high-dimensional systems. Through comparisons with Monte-Carlo simulations and the available literature, we show that PINN can effectively address all of the afore-described points. We also demonstrate that the computational time associated with the PINN solution can be substantially reduced by using transfer learning.

摘要
《福克-普朗克方程》是一个线性偏微分方程，其控制了杂态征函数（PDF）的时间和空间演化，该PDF与杂态动力系统的响应相关。唯一的精确分析解是仅适用于有限个动力系统中。半分析方法可以用于更大的子集，而传统计算方法，如finite element和finite difference，需要将计算Domain分成一个离散点的网格，这会带来高维系统的计算成本很高。物理学 Informed learning提供了一种可能有力的替代方案。为了评估其潜力，我们提出了一个数据自由、物理学 Informed neural network（PINN）框架，用于解决非线性杂态动力系统的福克-普朗克方程。具体来说，通过DUFFING、VAN der POL和DUFFING-VAN der POL振荡器的 einige examples，我们评估了PINN框架在下列方面的能力和准确性：1. 对于添加itive和乘数噪声的PDF预测。2. 捕捉PDF的P-分岔。3. 对高维系统的有效处理。通过与 Monte-Carlo 仿真和已有文献的比较，我们显示了PINN可以有效地解决上述问题。此外，我们还示出了使用传输学习可以将PINN解的计算时间显著减少。

MultiModN- Multimodal, Multi-Task, Interpretable Modular Networks

paper_url: http://arxiv.org/abs/2309.14118
repo_url: https://github.com/epfl-iglobalhealth/multimodn
paper_authors: Vinitra Swamy, Malika Satayeva, Jibril Frej, Thierry Bossy, Thijs Vogels, Martin Jaggi, Tanja Käser, Mary-Anne Hartley
for: 这 paper 的目的是提出一种可靠、多任务、多模态的机器学习模型，能够在不同的模式下进行预测和检测。
methods: 这 paper 使用了一种名为 MultiModN 的多模态、模块化网络，通过序列化多种数据类型的Feature Space来提高预测性能和可解释性。
results: experiments 表明，MultiModN 在多个 benchmark 数据集上表现出色，能够在不同的模式下进行预测和检测，而且在面临不够数据时并不会出现 catastrophic failure。

Abstract
Predicting multiple real-world tasks in a single model often requires a particularly diverse feature space. Multimodal (MM) models aim to extract the synergistic predictive potential of multiple data types to create a shared feature space with aligned semantic meaning across inputs of drastically varying sizes (i.e. images, text, sound). Most current MM architectures fuse these representations in parallel, which not only limits their interpretability but also creates a dependency on modality availability. We present MultiModN, a multimodal, modular network that fuses latent representations in a sequence of any number, combination, or type of modality while providing granular real-time predictive feedback on any number or combination of predictive tasks. MultiModN's composable pipeline is interpretable-by-design, as well as innately multi-task and robust to the fundamental issue of biased missingness. We perform four experiments on several benchmark MM datasets across 10 real-world tasks (predicting medical diagnoses, academic performance, and weather), and show that MultiModN's sequential MM fusion does not compromise performance compared with a baseline of parallel fusion. By simulating the challenging bias of missing not-at-random (MNAR), this work shows that, contrary to MultiModN, parallel fusion baselines erroneously learn MNAR and suffer catastrophic failure when faced with different patterns of MNAR at inference. To the best of our knowledge, this is the first inherently MNAR-resistant approach to MM modeling. In conclusion, MultiModN provides granular insights, robustness, and flexibility without compromising performance.

摘要
多任务多模式（MM）模型目的是抽取多种数据类型的共同预测潜力，创建具有不同大小和类型的输入数据中共同含义的共享特征空间。大多数当前的MM架构使用平行融合这些表示，不仅限制了它们的可解释性，还受到数据类型可用性的限制。我们介绍了MultiModN，一种多模态、模块化网络，可以在任意数量、组合或类型的模态中融合干ARN表示，并提供精细的实时预测反馈。MultiModN的可组合管道是设计可解释的，同时也是自然多任务和鲁棒于基本问题的偏见缺失。我们在多个MM数据集上进行了四个实验，测试了MultiModN在10个实际任务（预测医疾诊断、学术表现和天气）上的性能。结果显示，MultiModN的顺序MM融合不会Compromise performance相比基线Parallel融合。通过模拟偏见缺失（MNAR）的挑战，这个工作表明，与MultiModN不同，平行融合基elines会在不同的MNAR挑战时erroneously learn MNAR并遭受极端的失败。在我们知道的范围内，这是首个自然具有MNAR抗性的MM模型。因此，MultiModN提供了细化的洞察、鲁棒性和灵活性，无需牺牲性能。

HyperTrack: Neural Combinatorics for High Energy Physics

paper_url: http://arxiv.org/abs/2309.14113
repo_url: https://github.com/mieskolainen/hypertrack
paper_authors: Mikael Mieskolainen
for: 这个论文是为了解决高能物理中的 combinatorial inverse problems 而写的。
methods: 这个论文使用了一种新的深度学习驱动的 clustering 算法，该算法使用了空间时间非本地可学习图构建器、图神经网络和集成变换器。模型通过节点、边和对象层的损失函数进行训练，包括对比学习和元级超级视图。
results: 作者通过 partiicle tracking simulations 表明了这种前导AI方法的有效性。代码可以在线获取。

Abstract
Combinatorial inverse problems in high energy physics span enormous algorithmic challenges. This work presents a new deep learning driven clustering algorithm that utilizes a space-time non-local trainable graph constructor, a graph neural network, and a set transformer. The model is trained with loss functions at the graph node, edge and object level, including contrastive learning and meta-supervision. The algorithm can be applied to problems such as charged particle tracking, calorimetry, pile-up discrimination, jet physics, and beyond. We showcase the effectiveness of this cutting-edge AI approach through particle tracking simulations. The code is available online.

摘要
高能物理中的 combinatorial inverse problems 涉及到庞大的算法挑战。这项工作提出了一种新的深度学习驱动 clustering 算法，使用空间-时非本地可学习图构建器、图神经网络和集Transformer。该模型通过图节、边和对象层的损失函数进行训练，包括对比学习和超级监督。该算法可以应用于荷电粒子跟踪、calorimetry、堆积排除、jet物理和更多。我们通过粒子跟踪模拟显示了这种前沿人工智能方法的效果。代码可在线获取。

Affective Game Computing: A Survey

paper_url: http://arxiv.org/abs/2309.14104
repo_url: https://github.com/Aryia-Behroziuan/References
paper_authors: Georgios N. Yannakakis, David Melhart
for: 这篇论文探讨了现代情感计算原理、方法和工具在游戏领域的应用，即情感游戏计算。
methods: 论文通过四个核心phasis of the affective loop：游戏情感诱发、游戏情感感知、游戏情感检测和游戏情感适应来进行了评查。
results: 论文提供了情感游戏计算领域的一份综述，并在这一领域中进行了一系列分析和评估。

Abstract
This paper surveys the current state of the art in affective computing principles, methods and tools as applied to games. We review this emerging field, namely affective game computing, through the lens of the four core phases of the affective loop: game affect elicitation, game affect sensing, game affect detection and game affect adaptation. In addition, we provide a taxonomy of terms, methods and approaches used across the four phases of the affective game loop and situate the field within this taxonomy. We continue with a comprehensive review of available affect data collection methods with regards to gaming interfaces, sensors, annotation protocols, and available corpora. The paper concludes with a discussion on the current limitations of affective game computing and our vision for the most promising future research directions in the field.

摘要
We then review available affect data collection methods for gaming interfaces, sensors, annotation protocols, and corpora. Finally, we discuss the current limitations of affective game computing and outline the most promising future research directions in the field.Here is the text in Simplified Chinese:这篇论文介绍了现代情感计算原则、方法和工具在游戏领域的应用。我们通过分析四个核心阶段的情感循环来评估这个新兴领域：游戏情感诱发、游戏情感感知、游戏情感检测和游戏情感适应。此外，我们提供了情感循环中不同阶段的术语、方法和approaches的分类，并将该领域置于这种分类中。接下来，我们进行了有关游戏界面、感知器、注释协议和可用数据集的情感数据收集方法的全面回顾。最后，我们讨论了情感游戏计算的当前局限性，并提出了未来研究的最有前途的方向。

Tracking Control for a Spherical Pendulum via Curriculum Reinforcement Learning

paper_url: http://arxiv.org/abs/2309.14096
repo_url: None
paper_authors: Pascal Klink, Florian Wolf, Kai Ploeger, Jan Peters, Joni Pajarinen
for: 学习非平凡机器人控制法则，不需要手动设计规则。
methods: 使用最新的自动生成课程算法和大规模并行计算，通过改进的优化方案，以更好地识别非欧几何任务结构，从而更快速、更稳定地学习控制器。
results: 学习策略与优化基线相当，在真实系统上达到了类似于最优控制策略的性能。

Abstract
Reinforcement Learning (RL) allows learning non-trivial robot control laws purely from data. However, many successful applications of RL have relied on ad-hoc regularizations, such as hand-crafted curricula, to regularize the learning performance. In this paper, we pair a recent algorithm for automatically building curricula with RL on massively parallelized simulations to learn a tracking controller for a spherical pendulum on a robotic arm via RL. Through an improved optimization scheme that better respects the non-Euclidean task structure, we allow the method to reliably generate curricula of trajectories to be tracked, resulting in faster and more robust learning compared to an RL baseline that does not exploit this form of structured learning. The learned policy matches the performance of an optimal control baseline on the real system, demonstrating the potential of curriculum RL to jointly learn state estimation and control for non-linear tracking tasks.

摘要

On the Benefit of Optimal Transport for Curriculum Reinforcement Learning

paper_url: http://arxiv.org/abs/2309.14091
repo_url: None
paper_authors: Pascal Klink, Carlo D’Eramo, Jan Peters, Joni Pajarinen
for: solves complex tasks by generating a tailored sequence of learning tasks
methods: uses interpolations between task distributions to generate curricula
results: improves upon existing CRL methods and achieves high performance in various tasks

Abstract
Curriculum reinforcement learning (CRL) allows solving complex tasks by generating a tailored sequence of learning tasks, starting from easy ones and subsequently increasing their difficulty. Although the potential of curricula in RL has been clearly shown in various works, it is less clear how to generate them for a given learning environment, resulting in various methods aiming to automate this task. In this work, we focus on framing curricula as interpolations between task distributions, which has previously been shown to be a viable approach to CRL. Identifying key issues of existing methods, we frame the generation of a curriculum as a constrained optimal transport problem between task distributions. Benchmarks show that this way of curriculum generation can improve upon existing CRL methods, yielding high performance in various tasks with different characteristics.

摘要
使用简化中文翻译文本。学习补充课程（CRL）可以解决复杂任务，通过生成适应性较高的学习任务序列，从易于学习的任务开始，然后逐渐增加Difficulty。虽然CRL的潜力已经在不同的研究中得到了证明，但是如何为给定的学习环境生成课程，还是一个不够清楚的问题，因此有多种方法试图自动化这个任务。在这个工作中，我们将关注将课程框架为 interpolations between task distributions，这种方法在过去已经被证明是CRL的可能的方法。从exististing方法的角度，我们将生成课程的问题定义为constrained optimal transport problem between task distributions。 benchmark表明，这种方法可以超越现有的CRL方法，在不同的任务特点下实现高性能。

BiSinger: Bilingual Singing Voice Synthesis

paper_url: http://arxiv.org/abs/2309.14089
repo_url: None
paper_authors: Huali Zhou, Yueqian Lin, Yao Shi, Peng Sun, Ming Li
for: 该研究旨在开拓多语言歌唱Synthetic Voice（SVS）领域，提出一种可以同时模拟英语和中文普通话的BiSinger系统。
methods: 该系统使用CMU字典和映射规则实现语言共享表示，并将单语言歌唱数据与开源歌唱voice转换技术相结合，生成双语歌唱声音。
results: 实验表明，BiSinger系统可以在英语和中文普通话之间进行自由融合，同时保持中文歌曲表现。音频样本可以在https://bisinger-svs.github.io中找到。

Abstract
Although Singing Voice Synthesis (SVS) has made great strides with Text-to-Speech (TTS) techniques, multilingual singing voice modeling remains relatively unexplored. This paper presents BiSinger, a bilingual pop SVS system for English and Chinese Mandarin. Current systems require separate models per language and cannot accurately represent both Chinese and English, hindering code-switch SVS. To address this gap, we design a shared representation between Chinese and English singing voices, achieved by using the CMU dictionary with mapping rules. We fuse monolingual singing datasets with open-source singing voice conversion techniques to generate bilingual singing voices while also exploring the potential use of bilingual speech data. Experiments affirm that our language-independent representation and incorporation of related datasets enable a single model with enhanced performance in English and code-switch SVS while maintaining Chinese song performance. Audio samples are available at https://bisinger-svs.github.io.

摘要
尽管Singing Voice Synthesis（SVS）已经在文本到语音（TTS）技术方面做出了很大的进步，但多语言歌声模型还尚未得到充分的探索。这篇论文介绍了BiSinger，一个拥有英语和中文普通话的双语PoP SVS系统。现有系统需要separate的模型来处理不同的语言，而这会导致不能准确地表示英语和中文，从而限制了码换SVS。为解决这个难题，我们设计了共享表示 между英语和中文歌声，通过使用CMU字典和映射规则来实现。我们将单语言歌声数据与开源的歌声voice转换技术相结合，以生成双语歌声，同时也在探索使用双语言言言言数据。实验证明了我们的语言独立表示和相关数据的 incorporation 使得单个模型在英语和码换SVS中具有提高的性能，同时保持中文歌曲表现。音频示例可以在https://bisinger-svs.github.io中找到。

REPA: Client Clustering without Training and Data Labels for Improved Federated Learning in Non-IID Settings

paper_url: http://arxiv.org/abs/2309.14088
repo_url: None
paper_authors: Boris Radovič, Veljko Pejović
for: 提高非独立并且同样分布的数据设置下的联合学习（Federated Learning，FL）性能，通过对客户端进行分组，以实现更好的数据分布匹配。
methods: 使用一种新的超级vised autoencoder-based方法，创建不需要本地训练和服务器数据暴露的客户端嵌入，以profile客户端的下游数据生成过程。
results: 对三个不同的数据集进行实验分析，显示REPA可以提供最佳模型性能，同时扩展联合学习的应用范围，覆盖之前未经考虑的用 caso。

Abstract
Clustering clients into groups that exhibit relatively homogeneous data distributions represents one of the major means of improving the performance of federated learning (FL) in non-independent and identically distributed (non-IID) data settings. Yet, the applicability of current state-of-the-art approaches remains limited as these approaches cluster clients based on information, such as the evolution of local model parameters, that is only obtainable through actual on-client training. On the other hand, there is a need to make FL models available to clients who are not able to perform the training themselves, as they do not have the processing capabilities required for training, or simply want to use the model without participating in the training. Furthermore, the existing alternative approaches that avert the training still require that individual clients have a sufficient amount of labeled data upon which the clustering is based, essentially assuming that each client is a data annotator. In this paper, we present REPA, an approach to client clustering in non-IID FL settings that requires neither training nor labeled data collection. REPA uses a novel supervised autoencoder-based method to create embeddings that profile a client's underlying data-generating processes without exposing the data to the server and without requiring local training. Our experimental analysis over three different datasets demonstrates that REPA delivers state-of-the-art model performance while expanding the applicability of cluster-based FL to previously uncovered use cases.

摘要
clustering 客户端到组合显示相对同质数据分布的组合是非独立和同样分布（非-IID）数据设置中提高联邦学习（FL）性能的一种主要方法。然而，现有的现状之前方法的应用范围仍然受限，因为这些方法基于本地模型参数的演化获取信息来分组客户端。在另一方面，有一个需要使联邦学习模型可用于无法进行训练的客户端，因为他们没有训练所需的处理能力或者只想使用模型而不参与训练。此外，现有的备用方法都需要每个客户端都具备足够量的标注数据，即每个客户端都是一名数据注释者。在这篇论文中，我们提出了一种不需要训练也不需要标注数据的客户端分组方法，称为REPA。REPA使用一种新的监督式自动encoder方法来创建嵌入，这些嵌入 profiling客户端的下游数据生成过程，无需服务器暴露数据，也无需本地训练。我们在三个不同的数据集上进行了实验分析，结果表明，REPA可以提供状态之前的模型性能，同时扩大了基于分组的联邦学习的应用范围。

Diversify and Conquer: Bandits and Diversity for an Enhanced E-commerce Homepage Experience

paper_url: http://arxiv.org/abs/2309.14046
repo_url: None
paper_authors: Sangeet Jaiswal, Korah T Malayil, Saif Jawaid, Sreekanth Vempati
For: 本研究旨在提高电子商务平台上的推荐广告和产品的效果，具体来说是通过vertical widget reordering来个性化推荐widget。* Methods: 本研究使用了contextual multi-arm bandit问题和增强层来实现个性化推荐。* Results: 通过在Myntra proprietary数据上进行线上和线下A/B测试，研究发现该方法可以提高推荐的效果。

Abstract
In the realm of e-commerce, popular platforms utilize widgets to recommend advertisements and products to their users. However, the prevalence of mobile device usage on these platforms introduces a unique challenge due to the limited screen real estate available. Consequently, the positioning of relevant widgets becomes pivotal in capturing and maintaining customer engagement. Given the restricted screen size of mobile devices, widgets placed at the top of the interface are more prominently displayed and thus attract greater user attention. Conversely, widgets positioned further down the page require users to scroll, resulting in reduced visibility and subsequent lower impression rates. Therefore it becomes imperative to place relevant widgets on top. However, selecting relevant widgets to display is a challenging task as the widgets can be heterogeneous, widgets can be introduced or removed at any given time from the platform. In this work, we model the vertical widget reordering as a contextual multi-arm bandit problem with delayed batch feedback. The objective is to rank the vertical widgets in a personalized manner. We present a two-stage ranking framework that combines contextual bandits with a diversity layer to improve the overall ranking. We demonstrate its effectiveness through offline and online A/B results, conducted on proprietary data from Myntra, a major fashion e-commerce platform in India.

摘要
在电商领域中，流行的平台通过Widget来推荐广告和产品给他们的用户。然而，由于移动设备的使用，导致了屏幕空间的限制，这意味着Widget的位置变得非常重要，以确保维持用户的兴趣。由于移动设备的屏幕尺寸有限，位于页面顶部的Widget更加抢耳，因此吸引更多的用户注意力。相反，位于页面底部的Widget需要用户滚动，这会导致它们的可见性减少，并最终导致吸引率下降。因此，需要将相关的Widget放在顶部。然而，选择需要显示的Widget是一项困难的任务，因为Widget可以是不同的，并且可以在任何时候从平台中引入或删除。在这种情况下，我们模型了垂直Widget重新排序为Contextual多臂抽象问题。我们的目标是个性化排序垂直Widget。我们提出了一个两个阶段的排名框架， combinatesContextual bandits with a diversity layer，以提高总体排名的效果。我们通过在Myntra，一家主要的印度电商平台上进行的实验和在线A/B测试，证明了我们的方法的有效性。

Hierarchical Imitation Learning for Stochastic Environments

paper_url: http://arxiv.org/abs/2309.14003
repo_url: None
paper_authors: Maximilian Igl, Punit Shah, Paul Mougin, Sirish Srinivasan, Tarun Gupta, Brandyn White, Kyriacos Shiarlis, Shimon Whiteson
for: 该论文旨在提高自适应学习中代理人的行为模型，以便在训练数据中生成完整的行为分布。
methods: 该论文提出了Robust Type Conditioning（RTC）方法，通过对于随机类型的 adversarial 训练来消除环境噪声导致的分布变换。
results: 实验结果表明，RTC 方法可以提高代理人的行为模型的分布实际性，同时保持或改善任务性能，对于两个领域的大规模实验结果都表现出色。

Abstract
Many applications of imitation learning require the agent to generate the full distribution of behaviour observed in the training data. For example, to evaluate the safety of autonomous vehicles in simulation, accurate and diverse behaviour models of other road users are paramount. Existing methods that improve this distributional realism typically rely on hierarchical policies. These condition the policy on types such as goals or personas that give rise to multi-modal behaviour. However, such methods are often inappropriate for stochastic environments where the agent must also react to external factors: because agent types are inferred from the observed future trajectory during training, these environments require that the contributions of internal and external factors to the agent behaviour are disentangled and only internal factors, i.e., those under the agent's control, are encoded in the type. Encoding future information about external factors leads to inappropriate agent reactions during testing, when the future is unknown and types must be drawn independently from the actual future. We formalize this challenge as distribution shift in the conditional distribution of agent types under environmental stochasticity. We propose Robust Type Conditioning (RTC), which eliminates this shift with adversarial training under randomly sampled types. Experiments on two domains, including the large-scale Waymo Open Motion Dataset, show improved distributional realism while maintaining or improving task performance compared to state-of-the-art baselines.

摘要
(Simplified Chinese translation)许多学习借风溯行情景中需要智能机器人生成训练数据中全部行为分布的完整分布。例如，评估自动驾驶车辆在模拟环境中的安全性，需要准确且多样化的其他路用者行为模型。现有的方法通常通过层次政策来提高这种分布真实性，这些政策根据目标或人格类型来生成多模样式的行为。然而，这些方法在随机环境中不适用，因为在训练过程中探测到的未来轨迹会影响智能机器人的行为，因此需要在类型编码中分离内部和外部因素的贡献。编码未来环境信息会导致测试时的不合适机器人反应，因为未来是未知的，类型必须从实际未来中独立地采样。我们将这种挑战称为环境随机性导致的类型 conditional distribution 的分布偏移。我们提议Robust Type Conditioning（RTC），通过对随机类型进行对抗训练来消除这种偏移。实验结果表明，RTC在两个领域中提高了分布真实性，同时保持或提高了任务性能，相比于现有的基elines。

Identification of Mixtures of Discrete Product Distributions in Near-Optimal Sample and Time Complexity

paper_url: http://arxiv.org/abs/2309.13993
repo_url: None
paper_authors: Spencer L. Gordon, Erik Jahn, Bijan Mazaheri, Yuval Rabani, Leonard J. Schulman
for: 本研究旨在从统计数据中识别一个混合型杂度分布 $X_1,\ldots,X_n$，其中每个变量 $X_i$ 是一个独立的离散随机变量。
methods: 本研究使用了一种类似于 robust tensor decomposition 的方法，并利用了一种新的约束矩阵的condition number bounding方法，称为 Hadamard extensions。
results: 本研究显示，对于任何 $n\geq 2k-1$，可以在 $(1/\zeta)^{O(k)}$ 的样本复杂度和时间复杂度下识别混合型杂度分布 $X_1,\ldots,X_n$。此外，我们还扩展了 $e^{\Omega(k)}$ 的下界，使其与我们的上界相匹配，并且这种下界适用于各种不同的 $\zeta$。

Abstract
We consider the problem of identifying, from statistics, a distribution of discrete random variables $X_1,\ldots,X_n$ that is a mixture of $k$ product distributions. The best previous sample complexity for $n \in O(k)$ was $(1/\zeta)^{O(k^2 \log k)}$ (under a mild separation assumption parameterized by $\zeta$). The best known lower bound was $\exp(\Omega(k))$. It is known that $n\geq 2k-1$ is necessary and sufficient for identification. We show, for any $n\geq 2k-1$, how to achieve sample complexity and run-time complexity $(1/\zeta)^{O(k)}$. We also extend the known lower bound of $e^{\Omega(k)}$ to match our upper bound across a broad range of $\zeta$. Our results are obtained by combining (a) a classic method for robust tensor decomposition, (b) a novel way of bounding the condition number of key matrices called Hadamard extensions, by studying their action only on flattened rank-1 tensors.

摘要
我们考虑一个统计方面的问题，即从分布统计中识别 $X_1,\ldots,X_n$ 是一个由 $k$ 个产品分布组成的混合分布。最好的前一个样本复杂性为 $(1/\zeta)^{O(k^2 \log k)}$（受到 $\zeta$ 的宽度假设），而最佳知识下界为 $\exp(\Omega(k))$。我们知道 $n\geq 2k-1$ 是必要和充分的条件。我们示出，对于任何 $n\geq 2k-1$，可以实现样本复杂性和运行时间复杂性 $(1/\zeta)^{O(k)}$。我们还扩展了知识下界，使其与我们的上界在广泛的 $\zeta$ 范围内匹配。我们的结果来自于将（a）纯粹的稳定矩阵分解方法与（b）一种约束矩阵的condition数 bounds的新方法结合在一起，通过研究这些矩阵在扁平 rank-1 张量上的行为来获得 Hadamard 扩展。

A Novel Approach for Effective Multi-View Clustering with Information-Theoretic Perspective

paper_url: http://arxiv.org/abs/2309.13989
repo_url: https://github.com/gzcch/A-Novel-Approach-for-Effective-Multi-View-Clustering-with-Information-Theoretic-Perspective-SUMVC
paper_authors: Chenhang Cui, Yazhou Ren, Jingyu Pu, Jiawei Li, Xiaorong Pu, Tianyi Wu, Yutao Shi, Lifang He
for: 提高多视图数据 clustering 性能 using various data sources.
methods: 使用 variational analysis 生成一致信息，并提出一种 suficient representation lower bound 来增强一致信息和减少视图中的无用信息.
results: 在理论分析和多个多视图数据集上，SUMVC 方法表现出优于传统方法，提供了一种新的多视图数据分析的视角.

Abstract
Multi-view clustering (MVC) is a popular technique for improving clustering performance using various data sources. However, existing methods primarily focus on acquiring consistent information while often neglecting the issue of redundancy across multiple views. This study presents a new approach called Sufficient Multi-View Clustering (SUMVC) that examines the multi-view clustering framework from an information-theoretic standpoint. Our proposed method consists of two parts. Firstly, we develop a simple and reliable multi-view clustering method SCMVC (simple consistent multi-view clustering) that employs variational analysis to generate consistent information. Secondly, we propose a sufficient representation lower bound to enhance consistent information and minimise unnecessary information among views. The proposed SUMVC method offers a promising solution to the problem of multi-view clustering and provides a new perspective for analyzing multi-view data. To verify the effectiveness of our model, we conducted a theoretical analysis based on the Bayes Error Rate, and experiments on multiple multi-view datasets demonstrate the superior performance of SUMVC.

摘要
多视图划分（MVC）是一种广泛使用的技术，以提高划分性能使用多种数据源。然而，现有方法主要强调获取一致信息，经常忽略多视图之间的重复性问题。本研究提出了一种新的方法，即足够多视图划分（SUMVC），它从信息论角度探讨多视图划分框架。我们的提议方法包括两部分：首先，我们开发了一种简单可靠的多视图划分方法SCMVC（简单一致多视图划分），使用变量分析生成一致信息。其次，我们提出了一种足够表示下界，以增强一致信息并最小化多视图之间的无用信息。提出的 SUMVC 方法提供了多视图划分问题的有效解决方案，并提供了新的多视图数据分析的视角。为证明我们的模型的有效性，我们基于 bayes 错误率进行了理论分析，并在多个多视图数据集上进行了实验，展示了 SUMVC 的超越性。

Physics-Driven ML-Based Modelling for Correcting Inverse Estimation

paper_url: http://arxiv.org/abs/2309.13985
repo_url: None
paper_authors: Ruiyuan Kang, Tingting Mu, Panos Liatsis, Dimitrios C. Kyritsis
for: 避免机器学习估算失败，尤其在科学和工程领域（SAE）中，以避免严重的后果，如飞机引擎设计。这项工作关注于探测和修复机器学习估算时的失败状态，通过使用模拟和基于物理法律的性能指标来做到这一点。
methods: 使用模拟和性能指标 guid by physical laws， flag 机器学习估算失败并提出一种新的方法GEESE，通过优化来实现低误差和高效率。GEESE的关键设计包括（1）一种混合模拟错误模型，以减少模拟成本和启用基于错误反馈的梯度循环，以及（2）两种生成模型，用于模拟候选状态的演示和探索行为。所有三个模型均为神经网络。
results: GEESE在三个真实的 SAE 逆问题上进行测试，与一些现有的优化/搜索方法进行比较。结果表明，GEESE最少失败次数，并且通常需要物理评估更少次。

Abstract
When deploying machine learning estimators in science and engineering (SAE) domains, it is critical to avoid failed estimations that can have disastrous consequences, e.g., in aero engine design. This work focuses on detecting and correcting failed state estimations before adopting them in SAE inverse problems, by utilizing simulations and performance metrics guided by physical laws. We suggest to flag a machine learning estimation when its physical model error exceeds a feasible threshold, and propose a novel approach, GEESE, to correct it through optimization, aiming at delivering both low error and high efficiency. The key designs of GEESE include (1) a hybrid surrogate error model to provide fast error estimations to reduce simulation cost and to enable gradient based backpropagation of error feedback, and (2) two generative models to approximate the probability distributions of the candidate states for simulating the exploitation and exploration behaviours. All three models are constructed as neural networks. GEESE is tested on three real-world SAE inverse problems and compared to a number of state-of-the-art optimization/search approaches. Results show that it fails the least number of times in terms of finding a feasible state correction, and requires physical evaluations less frequently in general.

摘要

A hybrid surrogate error model to provide fast error estimations and reduce simulation cost, enabling gradient-based backpropagation of error feedback.2. Two generative models to approximate the probability distributions of the candidate states for simulating the exploitation and exploration behaviors. All three models are constructed as neural networks.GEESE is tested on three real-world SAE inverse problems and compared to several state-of-the-art optimization/search approaches. Results show that it fails the least number of times in terms of finding a feasible state correction, and requires physical evaluations less frequently in general.

Newton Method-based Subspace Support Vector Data Description

paper_url: http://arxiv.org/abs/2309.13960
repo_url: None
paper_authors: Fahad Sohrab, Firas Laakom, Moncef Gabbouj
for: 本文提出了一种基于新トン方法的S-SVDD优化方法，以优化一类分类中的数据映射和描述。
methods: 本文使用了Newton方法来优化数据映射和描述，以提高一类分类中的子空间学习。
results: 实验结果表明，提出的优化策略比 gradient-based S-SVDD 在大多数情况下表现更好。

Abstract
In this paper, we present an adaptation of Newton's method for the optimization of Subspace Support Vector Data Description (S-SVDD). The objective of S-SVDD is to map the original data to a subspace optimized for one-class classification, and the iterative optimization process of data mapping and description in S-SVDD relies on gradient descent. However, gradient descent only utilizes first-order information, which may lead to suboptimal results. To address this limitation, we leverage Newton's method to enhance data mapping and data description for an improved optimization of subspace learning-based one-class classification. By incorporating this auxiliary information, Newton's method offers a more efficient strategy for subspace learning in one-class classification as compared to gradient-based optimization. The paper discusses the limitations of gradient descent and the advantages of using Newton's method in subspace learning for one-class classification tasks. We provide both linear and nonlinear formulations of Newton's method-based optimization for S-SVDD. In our experiments, we explored both the minimization and maximization strategies of the objective. The results demonstrate that the proposed optimization strategy outperforms the gradient-based S-SVDD in most cases.

摘要
在这篇论文中，我们介绍了新顿方法在SVDD（Subspace Support Vector Data Description）优化中的应用。S-SVDD的目标是将原始数据映射到优化一类分类的子空间，而S-SVDD的迭代优化过程中的数据映射和描述逻辑依赖于梯度下降。然而，梯度下降只使用一阶信息，可能会导致优化结果不佳。为了解决这种限制，我们利用新顿方法来增强数据映射和描述，从而提高子空间学习基于一类分类的优化。通过利用这些辅助信息，新顿方法在子空间学习中提供了更高效的优化策略，比梯度下降更有效。文章讨论了梯度下降的局限性和新顿方法在子空间学习中的优势，并提供了线性和非线性形式的新顿方法基于优化方法。我们在实验中探索了最小化和最大化目标的两种策略。结果表明，我们提出的优化策略在大多数情况下超越了梯度下降基于S-SVDD的优化。

Beam Enumeration: Probabilistic Explainability For Sample Efficient Self-conditioned Molecular Design

paper_url: http://arxiv.org/abs/2309.13957
repo_url: https://github.com/schwallergroup/augmented_memory
paper_authors: Jeff Guo, Philippe Schwaller
for: This paper aims to improve the explainability and sample efficiency of generative molecular design.
methods: The paper proposes Beam Enumeration, a method that exhaustively enumerates the most probable sub-sequences from language-based molecular generative models and extracts meaningful molecular substructures.
results: The proposed method improves the performance of the recently reported Augmented Memory algorithm, achieving better sample efficiency and generating more high-reward molecules.Here’s the Chinese translation:
for: 本研究旨在提高分子设计的可解释性和样本效率。
methods: 本文提出的方法是Beam Enumeration，它可以对语言基础的分子生成模型中的最可能的子序列进行广泛的枚举，并将分子子结构提取出来。
results: 提议的方法可以提高最近报道的Augmented Memory算法的性能，实现更好的样本效率和产生更多的高质量分子。

Abstract
Generative molecular design has moved from proof-of-concept to real-world applicability, as marked by the surge in very recent papers reporting experimental validation. Key challenges in explainability and sample efficiency present opportunities to enhance generative design to directly optimize expensive high-fidelity oracles and provide actionable insights to domain experts. Here, we propose Beam Enumeration to exhaustively enumerate the most probable sub-sequences from language-based molecular generative models and show that molecular substructures can be extracted. When coupled with reinforcement learning, extracted substructures become meaningful, providing a source of explainability and improving sample efficiency through self-conditioned generation. Beam Enumeration is generally applicable to any language-based molecular generative model and notably further improves the performance of the recently reported Augmented Memory algorithm, which achieved the new state-of-the-art on the Practical Molecular Optimization benchmark for sample efficiency. The combined algorithm generates more high reward molecules and faster, given a fixed oracle budget. Beam Enumeration is the first method to jointly address explainability and sample efficiency for molecular design.

摘要
生成分子设计已经从证明阶段积极应用到实际应用阶段，这是在最近几年的论文中 reporting 实验验证。关键的挑战是解释性和样本效率，这些挑战可以增强生成设计，直接优化昂贵的高精度观测器和提供可行的专业意见。我们提出了 Beam Enumeration，将 exhaustively enumerate 最有可能的子序列从语言基于的分子生成模型，并证明分子结构可以被提取。当与循环学习搭配时，提取的分子结构会具有意义，提供解释性和提高样本效率的自我条件生成。Beam Enumeration 适用于任何语言基于的分子生成模型，并且进一步提高 Augmented Memory 算法的性能，该算法已经在 Practical Molecular Optimization 问题上实现新的顶峰性，具体是 Sample Efficiency。联合算法可以更快地生成更高的奖励分子， givent a fixed oracle budget。Beam Enumeration 是第一个同时解释性和样本效率的分子设计方法。Simplified Chinese:生成分子设计已经从证明阶段普及到实际应用阶段，这是最近几年的论文中 reporting 实验验证。关键的挑战是解释性和样本效率，这些挑战可以增强生成设计，直接优化昂贵的高精度观测器和提供可行的专业意见。我们提出了 Beam Enumeration，将 exhaustively enumerate 最有可能的子序列从语言基于的分子生成模型，并证明分子结构可以被提取。当与循环学习搭配时，提取的分子结构会具有意义，提供解释性和提高样本效率的自我条件生成。Beam Enumeration 适用于任何语言基于的分子生成模型，并且进一步提高 Augmented Memory 算法的性能。这种算法已经在 Practical Molecular Optimization 问题上实现新的顶峰性，具体是 Sample Efficiency。联合算法可以更快地生成更高的奖励分子， givent a fixed oracle budget。Beam Enumeration 是第一个同时解释性和样本效率的分子设计方法。

Deep Reinforcement Learning for the Heat Transfer Control of Pulsating Impinging Jets

paper_url: http://arxiv.org/abs/2309.13955
repo_url: None
paper_authors: Sajad Salavatidezfouli, Giovanni Stabile, Gianluigi Rozza
for: 这个研究探讨了深度强化学习（DRL）在热控制中的应用可能性，通过 Computational Fluid Dynamics 进行研究。
methods: 研究使用了 vanilla Deep Q-Network（DQN）方法进行热控制，并对不同的 DRL 变体进行了比较。
results: 结果表明，soft Double 和 Duel DQN 在所有变体中表现最好，具有高效学习和动作优先级能力。soft Double DQN 超过 hard Double DQN。此外，soft Double 和 Duel 能够在控制周期内维持温度在所需的阈值内超过 98% 的时间。这些发现表明 DRL 在热控制系统中具有扎实的潜力。

Abstract
This research study explores the applicability of Deep Reinforcement Learning (DRL) for thermal control based on Computational Fluid Dynamics. To accomplish that, the forced convection on a hot plate prone to a pulsating cooling jet with variable velocity has been investigated. We begin with evaluating the efficiency and viability of a vanilla Deep Q-Network (DQN) method for thermal control. Subsequently, a comprehensive comparison between different variants of DRL is conducted. Soft Double and Duel DQN achieved better thermal control performance among all the variants due to their efficient learning and action prioritization capabilities. Results demonstrate that the soft Double DQN outperforms the hard Double DQN. Moreover, soft Double and Duel can maintain the temperature in the desired threshold for more than 98% of the control cycle. These findings demonstrate the promising potential of DRL in effectively addressing thermal control systems.

摘要
这个研究项目探讨了深度强化学习（DRL）在计算流体动力学中的应用性。为了实现这一目标，我们 investigate了一种受到脉冲冷风的热板，其中冷风速度是可变的。我们首先评估了普通的深度Q网络（DQN）方法的效率和可行性。然后，我们进行了不同变体的DRL比较。 results indicate that soft Double DQN和Duel DQN在所有变体中表现最佳，它们具有高效学习和动作优先级能力。此外，soft Double DQN超过了hard Double DQN。此外，soft Double和Duel可以在控制ecycle中维持温度在所需的阈值上超过98%的时间。这些发现表明DRL在thermal控制系统中具有扎实的潜力。

Local and Global Trend Bayesian Exponential Smoothing Models

paper_url: http://arxiv.org/abs/2309.13950
repo_url: None
paper_authors: Slawek Smyl, Christoph Bergmeir, Alexander Dokumentov, Erwin Wibowo, Daniel Schmidt
for: 本研究旨在探讨一种基于加法和乘法均摊满足的季节性和非季节性时间序列模型，以满足快速增长、波动性时间序列的需求。
methods: 本研究使用现代抽象贝叶斯适应技术来开发这种模型，并在M3竞赛数据集上应用。
results: 比较其他竞赛算法和参照值，本研究在M3竞赛数据集上得到了最佳的结果，从而在Literature中实现了最佳单variate方法的result。

Abstract
This paper describes a family of seasonal and non-seasonal time series models that can be viewed as generalisations of additive and multiplicative exponential smoothing models. Their development is motivated by fast-growing, volatile time series, and facilitated by state-of-the-art Bayesian fitting techniques. When applied to the M3 competition data set, they outperform the best algorithms in the competition as well as other benchmarks, thus achieving to the best of our knowledge the best results of univariate methods on this dataset in the literature.

摘要

Characterising User Transfer Amid Industrial Resource Variation: A Bayesian Nonparametric Approach

paper_url: http://arxiv.org/abs/2309.13949
repo_url: None
paper_authors: Dongxu Lei, Xiaotian Lin, Xinghu Yu, Zhan Li, Weichao Sun, Jianbin Qiu, Songlin Zhuang, Huijun Gao
for: 本研究旨在提高资源管理策略的发展，通过准确描述用户负载传递的 macro 级模式。
methods: 本研究提出了一种可解释的 hierarchical Bayesian nonparametric 模型 CLUSTER，可自动确定用户群和资源变化对用户传递的影响。
results: 实验结果表明，CLUSTER 模型能够准确预测用户传递响应资源变化，并且能够Quantify uncertainty for reliable decision-making。此外，CLUSTER 模型能够独立地函数于个人可 identificable 信息，保护用户隐私。

Abstract
In a multitude of industrial fields, a key objective entails optimising resource management whilst satisfying user requirements. Resource management by industrial practitioners can result in a passive transfer of user loads across resource providers, a phenomenon whose accurate characterisation is both challenging and crucial. This research reveals the existence of user clusters, which capture macro-level user transfer patterns amid resource variation. We then propose CLUSTER, an interpretable hierarchical Bayesian nonparametric model capable of automating cluster identification, and thereby predicting user transfer in response to resource variation. Furthermore, CLUSTER facilitates uncertainty quantification for further reliable decision-making. Our method enables privacy protection by functioning independently of personally identifiable information. Experiments with simulated and real-world data from the communications industry reveal a pronounced alignment between prediction results and empirical observations across a spectrum of resource management scenarios. This research establishes a solid groundwork for advancing resource management strategy development.

摘要
在多种工业领域中，一个关键目标是优化资源管理，同时满足用户需求。但是资源管理的实践可能导致用户负载的投递式传输，这种现象的准确描述是非常困难和重要。这项研究发现用户群，这些群体捕捉了资源变化下的用户传输模式。我们提议CLUSTER，一种可解释性强的树状贝叶拟合模型，能够自动确定用户群和用户传输响应资源变化。此外，CLUSTER还可以对不确定性进行评估，以便更加可靠地做出决策。我们的方法可以保护用户隐私，不需要个人可识别信息。实验表明，CLUSTER在通信业中的仿真数据和实际数据上具有杰出的一致性，在资源管理方案的多种场景中具有广泛的应用前景。这项研究为资源管理策略的发展提供了坚实的基础。

Provable Training for Graph Contrastive Learning

paper_url: http://arxiv.org/abs/2309.13944
repo_url: https://github.com/voidharuhi/pot-gcl
paper_authors: Yue Yu, Xiao Wang, Mengmei Zhang, Nian Liu, Chuan Shi
for: 本研究旨在解决 Graph Contrastive Learning (GCL) 训练中存在的不均衡问题，提高 GCL 的性能和可靠性。
methods: 本研究使用了实验证明 GCL 训练是不均衡的，并提出了一个名为 “node compactness” 的度量来衡量每个节点是否遵循 GCL 原理。此外，本研究还提出了一种名为 PrOvable Training (POT) 的训练方法，通过在 GCL 训练中添加 regularization 来增强 GCL 的性能。
results: 通过对多个 benchmark 进行了广泛的实验，本研究发现 POT 可以一直提高 GCL 的性能，并且可以作为一个可靠的插件使用。

Abstract
Graph Contrastive Learning (GCL) has emerged as a popular training approach for learning node embeddings from augmented graphs without labels. Despite the key principle that maximizing the similarity between positive node pairs while minimizing it between negative node pairs is well established, some fundamental problems are still unclear. Considering the complex graph structure, are some nodes consistently well-trained and following this principle even with different graph augmentations? Or are there some nodes more likely to be untrained across graph augmentations and violate the principle? How to distinguish these nodes and further guide the training of GCL? To answer these questions, we first present experimental evidence showing that the training of GCL is indeed imbalanced across all nodes. To address this problem, we propose the metric "node compactness", which is the lower bound of how a node follows the GCL principle related to the range of augmentations. We further derive the form of node compactness theoretically through bound propagation, which can be integrated into binary cross-entropy as a regularization. To this end, we propose the PrOvable Training (POT) for GCL, which regularizes the training of GCL to encode node embeddings that follows the GCL principle better. Through extensive experiments on various benchmarks, POT consistently improves the existing GCL approaches, serving as a friendly plugin.

摘要
graph contrastive learning (GCL) 已经成为无标签学习节点嵌入的受欢迎训练方法。 despite the well-established key principle of maximizing the similarity between positive node pairs while minimizing it between negative node pairs, some fundamental problems are still unclear. considering the complex graph structure, are some nodes consistently well-trained and following this principle even with different graph augmentations? or are there some nodes more likely to be untrained across graph augmentations and violate the principle? how to distinguish these nodes and further guide the training of GCL?to answer these questions, we first present experimental evidence showing that the training of GCL is indeed imbalanced across all nodes. to address this problem, we propose the metric "node compactness", which is the lower bound of how a node follows the GCL principle related to the range of augmentations. we further derive the form of node compactness theoretically through bound propagation, which can be integrated into binary cross-entropy as a regularization. to this end, we propose the PrOvable Training (POT) for GCL, which regularizes the training of GCL to encode node embeddings that follows the GCL principle better. through extensive experiments on various benchmarks, POT consistently improves the existing GCL approaches, serving as a friendly plugin.

Evaluating Classification Systems Against Soft Labels with Fuzzy Precision and Recall

paper_url: http://arxiv.org/abs/2309.13938
repo_url: None
paper_authors: Manu Harju, Annamaria Mesaros
for: 本研究旨在提出一种新的精度和准确率计算方法，以便在使用非二进制参考标签时评估音 Event detection系统的性能。
methods: 本研究使用了Kullback-Leibler divergence来衡量系统是否能够准确地遵循数据，并提出了一种基于非二进制参考标签的精度和准确率计算方法。
results: 研究发现，使用提议的计算方法可以准确地评估音 Event detection系统的性能，并且可以避免因数据二进制化而导致的错误解释。

Abstract
Classification systems are normally trained by minimizing the cross-entropy between system outputs and reference labels, which makes the Kullback-Leibler divergence a natural choice for measuring how closely the system can follow the data. Precision and recall provide another perspective for measuring the performance of a classification system. Non-binary references can arise from various sources, and it is often beneficial to use the soft labels for training instead of the binarized data. However, the existing definitions for precision and recall require binary reference labels, and binarizing the data can cause erroneous interpretations. We present a novel method to calculate precision, recall and F-score without quantizing the data. The proposed metrics extend the well established metrics as the definitions coincide when used with binary labels. To understand the behavior of the metrics we show simple example cases and an evaluation of different sound event detection models trained on real data with soft labels.

摘要
<>translate "Classification systems are normally trained by minimizing the cross-entropy between system outputs and reference labels, which makes the Kullback-Leibler divergence a natural choice for measuring how closely the system can follow the data. Precision and recall provide another perspective for measuring the performance of a classification system. Non-binary references can arise from various sources, and it is often beneficial to use the soft labels for training instead of the binarized data. However, the existing definitions for precision and recall require binary reference labels, and binarizing the data can cause erroneous interpretations. We present a novel method to calculate precision, recall and F-score without quantizing the data. The proposed metrics extend the well established metrics as the definitions coincide when used with binary labels. To understand the behavior of the metrics we show simple example cases and an evaluation of different sound event detection models trained on real data with soft labels." into Simplified Chinese.翻译文本为Simplified Chinese：通常，分类系统通过最小化系统输出与参考标签之间的cross-entropy来训练，这使得庒啄-利卜征函数成为衡量系统如何准确地跟踪数据的自然选择。精度和回归提供了另一种视角来衡量分类系统的性能。非二进制参考可以从多种来源 arise，并且在训练时使用软标签可以是有利的。然而，现有的精度和回归定义都需要二进制参考标签，并且将数据二进制化可能会导致错误的解释。我们提出了一种新的方法来计算精度、回归和F-score，而无需将数据二进制化。我们的 metric 扩展了现有的 metric，因为在使用二进制标签时，定义协调。为了理解metric的行为，我们给出了简单的例子cases和使用实际数据和软标签训练不同的音Event检测模型的评估。

SAMN: A Sample Attention Memory Network Combining SVM and NN in One Architecture

paper_url: http://arxiv.org/abs/2309.13930
repo_url: None
paper_authors: Qiaoling Yang, Linkai Luo, Haoyu Zhang, Hong Peng, Ziyang Chen
for: This paper aims to combine Support Vector Machines (SVM) and Neural Networks (NN) to create a more powerful function for multi-classification tasks.
methods: The proposed method, called Sample Attention Memory Network (SAMN), incorporates a sample attention module, class prototypes, and a memory block into NN to effectively combine SVM and NN.
results: Extensive experiments show that SAMN achieves better classification performance than single SVM or single NN with similar parameter sizes, as well as the previous best model for combining SVM and NN.Here’s the same information in Simplified Chinese:
for: 这篇论文目标是将支持向量机(SVM)和神经网络(NN)结合起来，以创造更强大的多类分类器。
methods: 提议的方法是将样本注意力模块、类型评估模块和记忆块 incorporated into NN，以实现SVM和NN的有效结合。
results: 广泛的实验表明，SAMN比单个 SVM 或单个 NN 相同参数大小下的性能更好，以及之前最佳结合 SVM 和 NN 的模型。

Abstract
Support vector machine (SVM) and neural networks (NN) have strong complementarity. SVM focuses on the inner operation among samples while NN focuses on the operation among the features within samples. Thus, it is promising and attractive to combine SVM and NN, as it may provide a more powerful function than SVM or NN alone. However, current work on combining them lacks true integration. To address this, we propose a sample attention memory network (SAMN) that effectively combines SVM and NN by incorporating sample attention module, class prototypes, and memory block to NN. SVM can be viewed as a sample attention machine. It allows us to add a sample attention module to NN to implement the main function of SVM. Class prototypes are representatives of all classes, which can be viewed as alternatives to support vectors. The memory block is used for the storage and update of class prototypes. Class prototypes and memory block effectively reduce the computational cost of sample attention and make SAMN suitable for multi-classification tasks. Extensive experiments show that SAMN achieves better classification performance than single SVM or single NN with similar parameter sizes, as well as the previous best model for combining SVM and NN. The sample attention mechanism is a flexible module that can be easily deepened and incorporated into neural networks that require it.

摘要
支持向量机(SVM)和神经网络(NN)具有强大的补做性。SVM关注样本之间的内部运算，而NN关注样本中特征之间的运算。因此，将SVM和NN结合起来可能提供一个更强大的函数，而不需要增加参数量。然而，现有的SVM和NN结合方法缺乏真正的集成。为此，我们提议一种叫做样本注意力储存网络(SAMN)，它有效地结合了SVM和NN。SVM可以看作是一种样本注意机器。它允许我们将样本注意模块添加到NN中，以实现SVM的主要功能。类型范例是所有类型的代表，它们可以看作是支持向量的替代品。储存块用于存储和更新类型范例。类型范例和储存块可以有效减少样本注意的计算成本，使SAMN适用于多类分类任务。广泛的实验表明，SAMN在类比单独使用SVM或NN时，达到了更好的分类性能，同时也比前一个最佳结合SVM和NN的模型更好。样本注意机制是一种灵活的模块，可以轻松地深入 incorporated into neural networks 中，当需要时。

Pseudo Label Selection is a Decision Problem

paper_url: http://arxiv.org/abs/2309.13926
repo_url: https://github.com/aditya-rathi/Credit-Score-
paper_authors: Julian Rodemann
for: 这个论文的目的是提出一种基于决策理论的pseudo-label选择（PLS）方法，以解决confirmation bias问题。
methods: 这个方法基于一种新的选择 criterion，即 Pseudo posterior predictive的分析性approximation，这个分析性approximation是基于 Bayes-optimality 的。
results: 在模拟和实际数据上，BPLS方法在面临过拟合和confirmation bias问题时表现出优于传统的 PLS 方法。此外，这个方法还可以使得 PLS 更加鲁棒地对待模型假设。

Abstract
Pseudo-Labeling is a simple and effective approach to semi-supervised learning. It requires criteria that guide the selection of pseudo-labeled data. The latter have been shown to crucially affect pseudo-labeling's generalization performance. Several such criteria exist and were proven to work reasonably well in practice. However, their performance often depends on the initial model fit on labeled data. Early overfitting can be propagated to the final model by choosing instances with overconfident but wrong predictions, often called confirmation bias. In two recent works, we demonstrate that pseudo-label selection (PLS) can be naturally embedded into decision theory. This paves the way for BPLS, a Bayesian framework for PLS that mitigates the issue of confirmation bias. At its heart is a novel selection criterion: an analytical approximation of the posterior predictive of pseudo-samples and labeled data. We derive this selection criterion by proving Bayes-optimality of this "pseudo posterior predictive". We empirically assess BPLS for generalized linear, non-parametric generalized additive models and Bayesian neural networks on simulated and real-world data. When faced with data prone to overfitting and thus a high chance of confirmation bias, BPLS outperforms traditional PLS methods. The decision-theoretic embedding further allows us to render PLS more robust towards the involved modeling assumptions. To achieve this goal, we introduce a multi-objective utility function. We demonstrate that the latter can be constructed to account for different sources of uncertainty and explore three examples: model selection, accumulation of errors and covariate shift.

摘要
假标注是一种简单而有效的 semi-supervised learning 方法。它需要一些导向选择假标注数据的标准。这些标准有助于减少假标注的泛化性能。然而，它们的性能通常取决于初始模型适应 labels 数据。早期过度适应可能会传递到最终模型，通常被称为 confirmation bias。在两篇最近的论文中，我们展示了 pseudo-label 选择（PLS）可以自然地被嵌入到决策理论中。这种方法可以减轻 confirmation bias 的问题。PLS 的核心是一个新的选择标准：一种 Analytical 预测 pseudo-samples 和标注数据的 posterior 预测。我们 derive 这个选择标准通过证明 Bayes-优化这个 "pseudo posterior predictive"。我们在 simulated 和实际数据上进行了 Empirical 评估，发现在面临数据泛化和高度可能性 confirmation bias 的情况下，BPLS 比传统的 PLS 方法更好。决策理论的嵌入还使得 PLS 更加抗性待命模型假设。为了实现这一目标，我们引入了一个多目标价值函数。我们示例了这个价值函数可以考虑不同的不确定性来源，并 explore 三个例子：模型选择、积累错误和变量转移。

Sample Complexity of Neural Policy Mirror Descent for Policy Optimization on Low-Dimensional Manifolds

paper_url: http://arxiv.org/abs/2309.13915
repo_url: None
paper_authors: Zhenghao Xu, Xiang Ji, Minshuo Chen, Mengdi Wang, Tuo Zhao
for: 本文研究了使用神经网络的策略算法在强化学习中解决高维策略优化问题。
methods: 本文使用了神经网络作为策略和价值函数的函数近似器，并研究了NPMD算法的样本复杂性。
results: 研究发现，NPMD算法可以在高维策略优化问题中减轻维度着色问题，并可以在有限样本下找到$\epsilon$-优的策略，其样本复杂性为$\widetilde{O}(\epsilon^{-{\frac{d}{\alpha}-2})$。

Abstract
Policy-based algorithms equipped with deep neural networks have achieved great success in solving high-dimensional policy optimization problems in reinforcement learning. However, current analyses cannot explain why they are resistant to the curse of dimensionality. In this work, we study the sample complexity of the neural policy mirror descent (NPMD) algorithm with convolutional neural networks (CNN) as function approximators. Motivated by the empirical observation that many high-dimensional environments have state spaces possessing low-dimensional structures, such as those taking images as states, we consider the state space to be a $d$-dimensional manifold embedded in the $D$-dimensional Euclidean space with intrinsic dimension $d\ll D$. We show that in each iteration of NPMD, both the value function and the policy can be well approximated by CNNs. The approximation errors are controlled by the size of the networks, and the smoothness of the previous networks can be inherited. As a result, by properly choosing the network size and hyperparameters, NPMD can find an $\epsilon$-optimal policy with $\widetilde{O}(\epsilon^{-\frac{d}{\alpha}-2})$ samples in expectation, where $\alpha\in(0,1]$ indicates the smoothness of environment. Compared to previous work, our result exhibits that NPMD can leverage the low-dimensional structure of state space to escape from the curse of dimensionality, providing an explanation for the efficacy of deep policy-based algorithms.

摘要

Matrix Factorization in Tropical and Mixed Tropical-Linear Algebras

paper_url: http://arxiv.org/abs/2309.13914
repo_url: None
paper_authors: Ioannis Kordonis, Emmanouil Theodosis, George Retsinas, Petros Maragos
for: Matrix Factorization (MF) 为机器学习和数据探索中的应用，包括共同推荐系统、维度缩减、数据可视化和社群探测。
methods: 我们使用тропікалgebra和几何学来研究两个问题，包括 Tropical Matrix Factorization (TMF) 和一个新的组合矩阵分解问题。
results: 我们提出了一个改进的TMF算法，可以避免许多地方最佳解；此外，我们还提出了一个新的组合矩阵分解方法，具有与多个用户学习 utility 函数的 interessante解释。我们还呈现了一些数据，证明我们的方法的有效性，并实现了一个推荐系统的应用，获得了显著的结果。

Abstract
Matrix Factorization (MF) has found numerous applications in Machine Learning and Data Mining, including collaborative filtering recommendation systems, dimensionality reduction, data visualization, and community detection. Motivated by the recent successes of tropical algebra and geometry in machine learning, we investigate two problems involving matrix factorization over the tropical algebra. For the first problem, Tropical Matrix Factorization (TMF), which has been studied already in the literature, we propose an improved algorithm that avoids many of the local optima. The second formulation considers the approximate decomposition of a given matrix into the product of three matrices where a usual matrix product is followed by a tropical product. This formulation has a very interesting interpretation in terms of the learning of the utility functions of multiple users. We also present numerical results illustrating the effectiveness of the proposed algorithms, as well as an application to recommendation systems with promising results.

摘要
矩阵因式（MF）在机器学习和数据挖掘中找到了许多应用，包括共享推荐系统、维度减少、数据可视化和社区检测。受推荐系统的最近成功而受欢迎的泛洋算术和几何，我们 investigate two 矩阵因式问题，其中一个是已经在文献中研究的极地矩阵因式（TMF），我们提出了一种改进的算法，可以避免许多地方最佳点。第二个形式是对给定矩阵的approximate decompositions into the product of three matrices，其中一个是 usual matrix product followed by a tropical product。这个形式有非常有趣的学习多个用户的实用函数的解释。我们还present numerical results demonstrating the effectiveness of the proposed algorithms, as well as an application to recommendation systems with promising results。

Follow-ups Also Matter: Improving Contextual Bandits via Post-serving Contexts

paper_url: http://arxiv.org/abs/2309.13896
repo_url: None
paper_authors: Chaoqi Wang, Ziyu Ye, Zhe Feng, Ashwinkumar Badanidiyuru, Haifeng Xu
for: 提高在线学习效率，解决 contectual bandit problem 中 valuable 的后服务上下文信息不可见的问题。
methods: 提出了一种新的contextual bandit problem 模型，利用后服务上下文信息进行学习，并设计了一种新的算法poLinUCB，可以在标准假设下实现紧凑的征逐 regret。
results: 对 synthetic 和实际数据进行了广泛的实验测试，证明了利用后服务上下文信息可以提高学习效率，以及poLinUCB 算法的综合性和可靠性。

Abstract
Standard contextual bandit problem assumes that all the relevant contexts are observed before the algorithm chooses an arm. This modeling paradigm, while useful, often falls short when dealing with problems in which valuable additional context can be observed after arm selection. For example, content recommendation platforms like Youtube, Instagram, Tiktok also observe valuable follow-up information pertinent to the user's reward after recommendation (e.g., how long the user stayed, what is the user's watch speed, etc.). To improve online learning efficiency in these applications, we study a novel contextual bandit problem with post-serving contexts and design a new algorithm, poLinUCB, that achieves tight regret under standard assumptions. Core to our technical proof is a robustified and generalized version of the well-known Elliptical Potential Lemma (EPL), which can accommodate noise in data. Such robustification is necessary for tackling our problem, and we believe it could also be of general interest. Extensive empirical tests on both synthetic and real-world datasets demonstrate the significant benefit of utilizing post-serving contexts as well as the superior performance of our algorithm over the state-of-the-art approaches.

摘要
通常的上下文抽剂问题假设所有相关的上下文都可以在算法选择器之前观察。这种模型虽然有用，但在处理不可见上下文的问题时，它经常失足。例如，内容推荐平台 like Youtube、Instagram 和 Tiktok 可以在推荐后观察有价值的用户奖励信息（例如用户停留时间、用户播放速度等）。为了在这些应用中提高在线学习效率，我们研究了一种新的上下文抽剂问题，即 post-serving 上下文，并设计了一个新的算法 poLinUCB。我们的技术证明基于一种更加稳健和泛化的 Elliptical Potential Lemma (EPL)，可以承受数据噪声。这种稳健性是我们问题的必要条件，并且我们认为这也可能对总体有利。我们的实验表明，利用 post-serving 上下文和我们的算法的优秀性，可以在实验室和实际数据上达到显著的改善。

Graph Representation Learning Towards Patents Network Analysis

paper_url: http://arxiv.org/abs/2309.13888
repo_url: https://github.com/jettbrains/-L-
paper_authors: Mohammad Heydari, Babak Teimourpour
for: 本研究使用graph representation learning方法分析了伊朗官方公报中的专利数据，以找出相似性和新领域。
methods: 研究使用自然语言处理和实体解析技术提取了专利记录中的关键实体，然后将其转换为伊朗专利图граffe。
results: 研究结果显示，通过使用Graph representation learning和文本挖掘技术，可以实现专利资料分析和探索新领域，并且可以预防重复申请专利、熟悉相似和相连的发明、了解法律实体支持专利和研究人员在特定领域的知识。

Abstract
Patent analysis has recently been recognized as a powerful technique for large companies worldwide to lend them insight into the age of competition among various industries. This technique is considered a shortcut for developing countries since it can significantly accelerate their technology development. Therefore, as an inevitable process, patent analysis can be utilized to monitor rival companies and diverse industries. This research employed a graph representation learning approach to create, analyze, and find similarities in the patent data registered in the Iranian Official Gazette. The patent records were scrapped and wrangled through the Iranian Official Gazette portal. Afterward, the key entities were extracted from the scrapped patents dataset to create the Iranian patents graph from scratch based on novel natural language processing and entity resolution techniques. Finally, thanks to the utilization of novel graph algorithms and text mining methods, we identified new areas of industry and research from Iranian patent data, which can be used extensively to prevent duplicate patents, familiarity with similar and connected inventions, Awareness of legal entities supporting patents and knowledge of researchers and linked stakeholders in a particular research field.

摘要
具有广泛应用前景的专利分析技术已经在全球范围内被大型公司广泛应用，以帮助这些公司更好地了解不同行业的竞争情况。这种技术被视为发展中国家的短cut，因为它可以快速加速技术发展。因此，通过监测竞争对手和多个行业的专利分析，这种技术可以帮助公司更好地了解自己的市场环境。本研究采用了图表学习方法来创建、分析和找出专利数据库中的相似性。我们从伊朗官方报纸网站上抓取了专利笔记，然后使用新的自然语言处理和实体解决技术提取了关键实体。最后，我们通过使用新的图算法和文本挖掘技术，在伊朗专利数据中找到了新的行业和研究领域，这些领域可以用于避免重复专利、熟悉相似和相连的发明、了解法定机构支持专利、了解研究人员和相关的投资者在特定研究领域的知识。

Can Class-Priors Help Single-Positive Multi-Label Learning?

paper_url: http://arxiv.org/abs/2309.13886
repo_url: None
paper_authors: Biao Liu, Jie Wang, Ning Xu, Xin Geng
for: solves the problem of single-positive multi-label learning (SPMLL) with class-prior differences in real-world scenarios.
methods: proposes a novel framework called Class-pRiors Induced Single-Positive multi-label learning, which includes a class-priors estimator and an unbiased risk estimator for classification.
results: experiments on ten MLL benchmark datasets demonstrate the effectiveness and superiority of the proposed method over existing SPMLL approaches.

Abstract
Single-positive multi-label learning (SPMLL) is a typical weakly supervised multi-label learning problem, where each training example is annotated with only one positive label. Existing SPMLL methods typically assign pseudo-labels to unannotated labels with the assumption that prior probabilities of all classes are identical. However, the class-prior of each category may differ significantly in real-world scenarios, which makes the predictive model not perform as well as expected due to the unrealistic assumption on real-world application. To alleviate this issue, a novel framework named {\proposed}, i.e., Class-pRiors Induced Single-Positive multi-label learning, is proposed. Specifically, a class-priors estimator is introduced, which could estimate the class-priors that are theoretically guaranteed to converge to the ground-truth class-priors. In addition, based on the estimated class-priors, an unbiased risk estimator for classification is derived, and the corresponding risk minimizer could be guaranteed to approximately converge to the optimal risk minimizer on fully supervised data. Experimental results on ten MLL benchmark datasets demonstrate the effectiveness and superiority of our method over existing SPMLL approaches.

摘要
单正向多标签学习（SPMLL）是一种 Typical weakly supervised multi-label learning 问题，每个训练示例只有一个正确标签。现有的 SPMLL 方法通常将 pseudo-labels 赋给未标记的标签，假设所有类别的先验概率相同。然而，实际应用中每个类别的类别先验可能很大不同，这会使 predictive 模型不能如预期那样表现，因为这是不真实的假设。为解决这个问题，我们提出了一个新的框架，即 Class-pRiors Induced Single-Positive multi-label learning，简称为 \proposed。 Specifically, 我们引入了一个类别先验估计器，可以估算类别先验，并且这些估计器 theoretically guaranteed to converge to the ground-truth class-priors。此外，基于估计器，我们 derive 了一个不偏的风险估计器 для 分类，并且可以 guarantee that the corresponding risk minimizer could approximately converge to the optimal risk minimizer on fully supervised data。实验结果表明，我们的方法在十个 MLL benchmark 数据集上表现出色，比现有的 SPMLL 方法更有效。

Estimating Treatment Effects Under Heterogeneous Interference

paper_url: http://arxiv.org/abs/2309.13884
repo_url: https://github.com/linxf208/hinite
paper_authors: Xiaofeng Lin, Guoxi Zhang, Xiaotian Lu, Han Bao, Koh Takeuchi, Hisashi Kashima
for: The paper is written for estimating individual treatment effects (ITEs) in the presence of interference, specifically in online applications where units are associated and interference can be heterogeneous.
methods: The paper proposes a novel approach to model heterogeneous interference by developing a new architecture that aggregates information from diverse neighbors, using graph neural networks, a mechanism to aggregate information from different views, and attention mechanisms.
results: The proposed method significantly outperforms existing methods for ITE estimation in experiments on multiple datasets with heterogeneous interference, confirming the importance of modeling heterogeneous interference.

Abstract
Treatment effect estimation can assist in effective decision-making in e-commerce, medicine, and education. One popular application of this estimation lies in the prediction of the impact of a treatment (e.g., a promotion) on an outcome (e.g., sales) of a particular unit (e.g., an item), known as the individual treatment effect (ITE). In many online applications, the outcome of a unit can be affected by the treatments of other units, as units are often associated, which is referred to as interference. For example, on an online shopping website, sales of an item will be influenced by an advertisement of its co-purchased item. Prior studies have attempted to model interference to estimate the ITE accurately, but they often assume a homogeneous interference, i.e., relationships between units only have a single view. However, in real-world applications, interference may be heterogeneous, with multi-view relationships. For instance, the sale of an item is usually affected by the treatment of its co-purchased and co-viewed items. We hypothesize that ITE estimation will be inaccurate if this heterogeneous interference is not properly modeled. Therefore, we propose a novel approach to model heterogeneous interference by developing a new architecture to aggregate information from diverse neighbors. Our proposed method contains graph neural networks that aggregate same-view information, a mechanism that aggregates information from different views, and attention mechanisms. In our experiments on multiple datasets with heterogeneous interference, the proposed method significantly outperforms existing methods for ITE estimation, confirming the importance of modeling heterogeneous interference.

摘要
干预效果估计可以帮助在电商、医疗和教育等领域进行有效的决策。一种受欢迎的应用之一是估计干预（例如推广）对单元（例如商品）的结果（例如销售）的影响，known as 个体干预效果（ITE）。在许多在线应用程序中，单元的结果可能受到其他单元的干预，这是因为单元经常相关，这被称为干扰。例如，在一个在线购物网站上，一个商品的销售会受到其推广的影响。先前的研究尝试了模型干扰，以便准确地估计ITE，但它们通常假设了同质干扰，即单元之间只有一种视角。然而，在实际应用中，干扰可能是多质，即单元之间有多种视角。例如，一个商品的销售通常受到其推广和浏览的影响。我们认为，如果不正确地模型多质干扰，ITE估计就会不准确。因此，我们提出了一种新的方法，用于模型多质干扰。我们的提议方法包括图 neural networks 来聚合同视角信息，一种机制来聚合不同视角信息，以及注意机制。在我们对多个数据集上进行的实验中，我们的提议方法在存在多质干扰的情况下显著超过了现有的ITE估计方法，确认了模型多质干扰的重要性。

Diffusion Conditional Expectation Model for Efficient and Robust Target Speech Extraction

paper_url: http://arxiv.org/abs/2309.13874
repo_url: https://github.com/vivian556123/dcem
paper_authors: Leying Zhang, Yao Qian, Linfeng Yu, Heming Wang, Xinkai Wang, Hemin Yang, Long Zhou, Shujie Liu, Yanmin Qian, Michael Zeng
for: 本文旨在提出一种高效的生成方法，用于 Target Speech Extraction (TSE)。
methods: 本文使用了Diffusion Conditional Expectation Model (DCEM)，可以处理多 speaker和单 speaker场景，并且可以在各种噪音和清晰Condition下进行处理。
results: comparing with传统方法，本文的方法在非侵入和侵入 metric 中表现出色，并且具有高效的推理速度和对未看到任务的稳定性。Audio例子可以在线上预览（https://vivian556123.github.io/dcem）。

Abstract
Target Speech Extraction (TSE) is a crucial task in speech processing that focuses on isolating the clean speech of a specific speaker from complex mixtures. While discriminative methods are commonly used for TSE, they can introduce distortion in terms of speech perception quality. On the other hand, generative approaches, particularly diffusion-based methods, can enhance speech quality perceptually but suffer from slower inference speed. We propose an efficient generative approach named Diffusion Conditional Expectation Model (DCEM) for TSE. It can handle multi- and single-speaker scenarios in both noisy and clean conditions. Additionally, we introduce Regenerate-DCEM (R-DCEM) that can regenerate and optimize speech quality based on pre-processed speech from a discriminative model. Our method outperforms conventional methods in terms of both intrusive and non-intrusive metrics and demonstrates notable strengths in inference efficiency and robustness to unseen tasks. Audio examples are available online (https://vivian556123.github.io/dcem).

摘要
target speech extraction (tse)是speech processing中关键的任务，它的目标是从复杂的混合中分离出清晰的speaker的speech。although discriminative methods are commonly used for tse, they can introduce distortion in terms of speech perception quality. on the other hand, generative approaches, particularly diffusion-based methods, can enhance speech quality perceptually but suffer from slower inference speed. we propose an efficient generative approach named diffusion conditional expectation model (dcem) for tse. it can handle multi- and single-speaker scenarios in both noisy and clean conditions. additionally, we introduce regenerate-dcem (r-dcem) that can regenerate and optimize speech quality based on pre-processed speech from a discriminative model. our method outperforms conventional methods in terms of both intrusive and non-intrusive metrics and demonstrates notable strengths in inference efficiency and robustness to unseen tasks. audio examples are available online (https://vivian556123.github.io/dcem).Here's the translation breakdown:* target speech extraction (tse) = 目标语音采样 (tse)* speech processing = 语音处理* discriminative methods = 分类方法* generative approaches = 生成方法* diffusion-based methods = 扩散基于方法* Diffusion Conditional Expectation Model (DCEM) = 扩散 conditional expectation model (DCEM)* Regenerate-DCEM (R-DCEM) = 重新生成-DCEM (R-DCEM)* pre-processed speech = 预处理的语音* inference efficiency = 推理效率* robustness to unseen tasks = 对未经见任务的Robustness* audio examples = 音频示例Please note that the translation is done in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need the translation in Traditional Chinese, please let me know.

Statistical Perspective of Top-K Sparse Softmax Gating Mixture of Experts

paper_url: http://arxiv.org/abs/2309.13850
repo_url: None
paper_authors: Huy Nguyen, Pedram Akbarian, Fanqi Yan, Nhat Ho
for: 这篇论文主要针对的问题是解释权重补做的顶层-$K$ sparse softmax权重函数对输入空间的分割和深度学习模型的性能的影响。
methods: 作者使用了 Gaussian mixture of experts 来设计一个简单的模型，并通过定义新的损失函数来捕捉输入空间不同区域的行为。
results: 研究发现，当知道真实的专家数 $k_{\ast}$ 时，随着样本大小增加，权重补做的散度和参数估计的速度都是 Parametric。但是，当真实模型超出了 $k_{\ast}$ 的情况下，选择从顶层-$K$ sparse softmax权重函数中的专家数量必须大于某些 Voronoi 细胞与真实参数之间的总体 cardinality，以确保权重补做的散度估计 converge。此外，虽然散度估计的速度仍然是 Parametric，但参数估计速度受到权重补做和专家函数之间的内在交互的影响，导致很慢。

Abstract
Top-K sparse softmax gating mixture of experts has been widely used for scaling up massive deep-learning architectures without increasing the computational cost. Despite its popularity in real-world applications, the theoretical understanding of that gating function has remained an open problem. The main challenge comes from the structure of the top-K sparse softmax gating function, which partitions the input space into multiple regions with distinct behaviors. By focusing on a Gaussian mixture of experts, we establish theoretical results on the effects of the top-K sparse softmax gating function on both density and parameter estimations. Our results hinge upon defining novel loss functions among parameters to capture different behaviors of the input regions. When the true number of experts $k_{\ast}$ is known, we demonstrate that the convergence rates of density and parameter estimations are both parametric on the sample size. However, when $k_{\ast}$ becomes unknown and the true model is over-specified by a Gaussian mixture of $k$ experts where $k > k_{\ast}$, our findings suggest that the number of experts selected from the top-K sparse softmax gating function must exceed the total cardinality of a certain number of Voronoi cells associated with the true parameters to guarantee the convergence of the density estimation. Moreover, while the density estimation rate remains parametric under this setting, the parameter estimation rates become substantially slow due to an intrinsic interaction between the softmax gating and expert functions.

摘要
Top-K 稀疏软max权重混合专家已经广泛应用于扩大深度学习架构而无需增加计算成本。尽管在实际应用中它非常受欢迎，但是其理论理解仍然是一个开放的问题。主要挑战在于 top-K 稀疏软max权重混合函数的结构，该函数将输入空间分成多个区域，每个区域具有不同的行为。通过关注 Gaussian mixture of experts，我们建立了关于 top-K 稀疏软max权重混合函数对输入空间的影响的理论结果。我们的结论基于定义新的损失函数来捕捉不同区域的输入行为。当真实的专家数量 $k_{\ast}$ 知道时，我们证明随样本大小的散度和参数估计的渐近率都是参数的。然而，当 $k_{\ast}$ 不知道，真实的模型被过度规定为 Gaussian mixture of $k$ 专家，其中 $k > k_{\ast}$，我们发现，为保证散度估计的渐近，从 top-K 稀疏软max权重混合函数中选择的专家数量必须大于真实参数的总 cardinality。此外，虽然散度估计率保持参数的，但参数估计率因软max权重和专家函数之间的内在互动而变得非常慢。

On the Effectiveness of Adversarial Samples against Ensemble Learning-based Windows PE Malware Detectors

paper_url: http://arxiv.org/abs/2309.13841
repo_url: None
paper_authors: Trong-Nghia To, Danh Le Kim, Do Thi Thu Hien, Nghi Hoang Khoa, Hien Do Hoang, Phan The Duy, Van-Hau Pham
for:这个研究的目的是提出一个组合GAN和RL模型的变异系统，以抵消基于集成学习的检测器。methods:研究使用了GAN和RL模型，包括MalGAN和Deep Q-network anti-malware Engines Attacking Framework (DQEAF)。results:实验结果显示，100%的选择的异常样品保留了可执行档案的格式，而在执行可能性和黑客性方面也有一定的成功。

Abstract
Recently, there has been a growing focus and interest in applying machine learning (ML) to the field of cybersecurity, particularly in malware detection and prevention. Several research works on malware analysis have been proposed, offering promising results for both academic and practical applications. In these works, the use of Generative Adversarial Networks (GANs) or Reinforcement Learning (RL) can aid malware creators in crafting metamorphic malware that evades antivirus software. In this study, we propose a mutation system to counteract ensemble learning-based detectors by combining GANs and an RL model, overcoming the limitations of the MalGAN model. Our proposed FeaGAN model is built based on MalGAN by incorporating an RL model called the Deep Q-network anti-malware Engines Attacking Framework (DQEAF). The RL model addresses three key challenges in performing adversarial attacks on Windows Portable Executable malware, including format preservation, executability preservation, and maliciousness preservation. In the FeaGAN model, ensemble learning is utilized to enhance the malware detector's evasion ability, with the generated adversarial patterns. The experimental results demonstrate that 100\% of the selected mutant samples preserve the format of executable files, while certain successes in both executability preservation and maliciousness preservation are achieved, reaching a stable success rate.

摘要
近些年来，机器学习（ML）在计算机安全领域的应用得到了越来越多的关注和兴趣，特别是在恶意软件检测和防范方面。一些关于恶意软件分析的研究工作已经提出，其中使用生成对抗网络（GANs）或强化学习（RL）可以帮助恶意软件创作者制作形态变化的恶意软件，从而躲避反恶意软件检测。在本研究中，我们提出了一种突变系统，用于对集成学习基于检测器的攻击者进行对抗。我们的提出的FeaGAN模型基于MalGAN模型，通过加入一个RL模型called Deep Q-network anti-malware Engines Attacking Framework（DQEAF），解决了对Windows Portable Executable恶意软件的三大挑战，包括格式保留、可执行性保留和害意保留。在FeaGAN模型中， ensemble learning被使用来增强恶意软件检测器的逃脱能力，通过生成的对抗模式。实验结果表明，100%的选择的突变样本保留了可执行文件的格式，而在可执行性和害意方面也有一定的成功，达到了稳定的成功率。

Penalized Principal Component Analysis using Nesterov Smoothing

paper_url: http://arxiv.org/abs/2309.13838
repo_url: None
paper_authors: Rebecca M. Hurwitz, Georg Hahn
for: 本文使用权重约束最小化方法（PEP）来缩短高维数据中的维度，并添加L1偏导函数约束。
methods: 本文提出了一种使用馈积抑制（Nesterov smoothing）来计算LASSO-类L1偏导函数的分布式优化方法，并使用已有的单值分解（SVD）结果来计算高阶特征向量。
results: 使用1000个基因计划数据集，我们实验ally示出了使用我们提议的精炼PEP可以提高数值稳定性并获得有意义的特征向量。我们还 investigate了对传统PCA的约束最小化方法的比较。

Abstract
Principal components computed via PCA (principal component analysis) are traditionally used to reduce dimensionality in genomic data or to correct for population stratification. In this paper, we explore the penalized eigenvalue problem (PEP) which reformulates the computation of the first eigenvector as an optimization problem and adds an L1 penalty constraint. The contribution of our article is threefold. First, we extend PEP by applying Nesterov smoothing to the original LASSO-type L1 penalty. This allows one to compute analytical gradients which enable faster and more efficient minimization of the objective function associated with the optimization problem. Second, we demonstrate how higher order eigenvectors can be calculated with PEP using established results from singular value decomposition (SVD). Third, using data from the 1000 Genome Project dataset, we empirically demonstrate that our proposed smoothed PEP allows one to increase numerical stability and obtain meaningful eigenvectors. We further investigate the utility of the penalized eigenvector approach over traditional PCA.

摘要
<>使用主成分分析（PCA）计算主成分是传统地用于降维度的 genomic 数据或是对人口分布进行修正。在这篇论文中，我们探讨了增加 penalty 的 eigenvalue 问题（PEP），它将计算第一个 eigenvector 转换为优化问题，并添加 L1 罚项限制。我们的贡献有三个方面：首先，我们扩展了 PEP 方法，通过应用 Nesterov 缓和法来计算 analytical 导数，从而更快地和更有效地解决优化问题中关联的目标函数。第二，我们使用已有的 singular value decomposition（SVD）结果来计算高级别的 eigenvectors。第三，使用 1000 Genome Project 数据集，我们实际示出了我们提议的平滑 PEP 可以增加数值稳定性并获得有意义的 eigenvectors。我们进一步调查了使用增加 penalty 的 eigenvector 方法与传统 PCA 方法的优劣。

Backorder Prediction in Inventory Management: Classification Techniques and Cost Considerations

paper_url: http://arxiv.org/abs/2309.13837
repo_url: None
paper_authors: Sarit Maitra, Sukanya Kundu
for: 预测库存缺失(backorder)管理
methods: 使用多种分类技术，包括平衡携带分类器、柔logic、变分自适应网络-生成对抗网络、多层感知器等，对不同的数据集进行评估，并考虑财务因素和缺失成本。
results: 结果表明，结合模型方法，包括集成技术和VAE，可以有效地处理不均衡数据集，提高预测精度，减少假阳性和假阴性，并增加可 interpretability。

Abstract
This article introduces an advanced analytical approach for predicting backorders in inventory management. Backorder refers to an order that cannot be immediately fulfilled due to stock depletion. Multiple classification techniques, including Balanced Bagging Classifiers, Fuzzy Logic, Variational Autoencoder - Generative Adversarial Networks, and Multi-layer Perceptron classifiers, are assessed in this work using performance evaluation metrics such as ROC-AUC and PR-AUC. Moreover, this work incorporates a profit function and misclassification costs, considering the financial implications and costs associated with inventory management and backorder handling. The study suggests that a combination of modeling approaches, including ensemble techniques and VAE, can effectively address imbalanced datasets in inventory management, emphasizing interpretability and reducing false positives and false negatives. This research contributes to the advancement of predictive analytics and offers valuable insights for future investigations in backorder forecasting and inventory control optimization for decision-making.

摘要
Translated into Simplified Chinese:这篇文章介绍了一种高级分析方法，用于预测库存欠货（backorder）管理中的库存异常情况。这种方法包括多种分类技术，如均衡搅拌分类器、杂化逻辑、变量自动编码-生成敌对网络、多层感知器等，并使用表现评估指标，如ROC-AUC和PR-AUC来评估其表现。此外，这种研究还考虑了财务因素和误分类成本，包括库存管理和欠货处理中的财务影响和成本。研究表明，结合不同的模型方法，包括ensemble技术和VAE，可以有效地处理库存管理中的偏度数据，提高预测精度和减少false阳和false降。这项研究对预测分析领域的发展做出了贡献，并为未来的库存预测和库存控制优化做出了有价值的着想。

NSOTree: Neural Survival Oblique Tree

paper_url: http://arxiv.org/abs/2309.13825
repo_url: https://github.com/xs018/NSOTree
paper_authors: Xiaotong Sun, Peijie Qiu
for: 这篇论文探讨了Survival分析领域中的时间至事件（Time-to-Event）数据，以及将深度学习方法应用到这个领域中，以提高表现和可解性。
methods: 这篇论文提出了一个名为Neural Survival Oblique Tree（NSOTree）的新方法，它结合了深度学习和树型方法，以维持可解性和表现力。NSOTree 基于 ReLU 网络，并且可以与现有的存生模型集成在一起，以便应用。
results: 论文的评估结果显示，NSOTree 能够在实际数据上实现高性能和可解性，并且在存生领域中提供了一个可靠的方法。

Abstract
Survival analysis is a statistical method employed to scrutinize the duration until a specific event of interest transpires, known as time-to-event information characterized by censorship. Recently, deep learning-based methods have dominated this field due to their representational capacity and state-of-the-art performance. However, the black-box nature of the deep neural network hinders its interpretability, which is desired in real-world survival applications but has been largely neglected by previous works. In contrast, conventional tree-based methods are advantageous with respect to interpretability, while consistently grappling with an inability to approximate the global optima due to greedy expansion. In this paper, we leverage the strengths of both neural networks and tree-based methods, capitalizing on their ability to approximate intricate functions while maintaining interpretability. To this end, we propose a Neural Survival Oblique Tree (NSOTree) for survival analysis. Specifically, the NSOTree was derived from the ReLU network and can be easily incorporated into existing survival models in a plug-and-play fashion. Evaluations on both simulated and real survival datasets demonstrated the effectiveness of the proposed method in terms of performance and interpretability.

摘要
生存分析是一种统计方法，用于检查一个特定事件的发生时间，称为时间至事件信息，受到限制。最近，深度学习基于方法在这一领域占据主导地位，因为它们具有表达能力和现代性。然而，深度神经网络的黑盒特性阻碍了其解释性，这在实际生存应用中是极其重要的，但之前的工作却忽略了这一点。相比之下，传统的树状方法具有解释性的优势，但它们难以近似全局最优解。在这篇论文中，我们利用神经网络和树状方法的优点，同时维持解释性。为此，我们提出了神经生存斜树（NSOTree）方法。具体来说，NSOTree是基于ReLU网络的，可以轻松地与现有的生存模型集成。我们对实际和 simulations 数据进行了评估，并证明了我们提出的方法在性能和解释性两个方面具有效果。

Forecasting large collections of time series: feature-based methods

paper_url: http://arxiv.org/abs/2309.13807
repo_url: https://github.com/lixixibj/forecasting-with-time-series-imaging
paper_authors: Li Li, Feng Li, Yanfei Kang
for: 这篇论文主要针对 econometrics 和其他预测领域中的复杂实际问题，即时间序列数据的复杂性使得单一模型不能涵盖所有数据生成过程。
methods: 这篇论文介绍了基于时间序列特征的两种方法来预测大量时间序列数据，即特征基于的模型选择和特征基于的模型组合。
results: 论文详细介绍了现场 откры源软件实现的状态 искусственного预测方法，包括基于时间序列特征的模型选择和模型组合。

Abstract
In economics and many other forecasting domains, the real world problems are too complex for a single model that assumes a specific data generation process. The forecasting performance of different methods changes depending on the nature of the time series. When forecasting large collections of time series, two lines of approaches have been developed using time series features, namely feature-based model selection and feature-based model combination. This chapter discusses the state-of-the-art feature-based methods, with reference to open-source software implementations.

摘要
在经济和许多其他预测领域中，现实世界问题太复杂，不可以单独采用一个模型，假设特定的数据生成过程。预测不同时序系列的表现，不同方法的预测性能会有所不同。当预测大量时序系列时，有两条方向的方法得到发展，一是基于时序特征的模型选择，二是基于时序特征的模型组合。本章介绍了当前最佳实践的特征基于方法，参考开源软件实现。

Projected Randomized Smoothing for Certified Adversarial Robustness

paper_url: http://arxiv.org/abs/2309.13794
repo_url: https://github.com/spfrommer/projected_randomized_smoothing
paper_authors: Samuel Pfrommer, Brendon G. Anderson, Somayeh Sojoudi
for: 提供可证明 robustness 的分类器设计
methods: 使用随机填充方法在低维投影空间中进行随机缓和，并 characterize 缓和后的证明区域
results: 对 CIFAR-10 和 SVHN 进行实验，表明我们的方法可以提供 tractable 的下界，并且在证明区域内捕捉到普通扰动的perturbationsHere is the same information in Simplified Chinese:
for: 设计可证明 robustness 的分类器
methods: 使用随机填充方法在低维投影空间中进行随机缓和，并 characterize 缓和后的证明区域
results: 对 CIFAR-10 和 SVHN 进行实验，表明我们的方法可以提供 tractable 的下界，并且在证明区域内捕捉到普通扰动的perturbations

Abstract
Randomized smoothing is the current state-of-the-art method for producing provably robust classifiers. While randomized smoothing typically yields robust $\ell_2$-ball certificates, recent research has generalized provable robustness to different norm balls as well as anisotropic regions. This work considers a classifier architecture that first projects onto a low-dimensional approximation of the data manifold and then applies a standard classifier. By performing randomized smoothing in the low-dimensional projected space, we characterize the certified region of our smoothed composite classifier back in the high-dimensional input space and prove a tractable lower bound on its volume. We show experimentally on CIFAR-10 and SVHN that classifiers without the initial projection are vulnerable to perturbations that are normal to the data manifold and yet are captured by the certified regions of our method. We compare the volume of our certified regions against various baselines and show that our method improves on the state-of-the-art by many orders of magnitude.

摘要
随机缓和是当前状态艺术方法，生成可证明抗干扰的分类器。通常情况下，随机缓和会生成 $\ell_2$-球证明，但最近的研究已经推广了不同 norm 球证明以及不规则区域。这个工作考虑一种分类器架构，首先将数据投影到低维度的数据投影空间，然后应用标准分类器。通过在低维度投影空间中进行随机缓和，我们Characterize了我们熔化 composite 分类器的证明区域，并证明了可读取的下界。我们在 CIFAR-10 和 SVHN 上进行实验，并证明了不包含初始投影的分类器容易受到数据投影方向的攻击，但是我们的方法可以捕捉这些攻击。我们比较了我们的证明区域的体积与各种基准值，并证明了我们的方法在状态艺术中提高了多个阶段的质量。

ReMasker: Imputing Tabular Data with Masked Autoencoding

paper_url: http://arxiv.org/abs/2309.13793
repo_url: https://github.com/tydusky/remasker
paper_authors: Tianyu Du, Luca Melis, Ting Wang
for: 这个论文是为了推算缺失数据的 tabular 数据填充方法。
methods: 这个方法是基于 masked autoencoding 框架的扩展，其中 besides 缺失数据（即自然缺失），还随机 “重新mask” 一些值，通过优化 autoencoder 来重建这些重新mask 的值，并将模型应用于预测缺失数据。
results: 与优秀的方法进行比较，我们在多种缺失设定下进行了广泛的评估，并显示了 ReMasker 在缺失率不同的情况下的性能都在或超过了现有方法，而且其性能优势通常随缺失数据的比例增长。此外，我们还进行了理论准确性的探讨，并证明 ReMasker 通常学习缺失数据不变的表示。

Abstract
We present ReMasker, a new method of imputing missing values in tabular data by extending the masked autoencoding framework. Compared with prior work, ReMasker is both simple -- besides the missing values (i.e., naturally masked), we randomly ``re-mask'' another set of values, optimize the autoencoder by reconstructing this re-masked set, and apply the trained model to predict the missing values; and effective -- with extensive evaluation on benchmark datasets, we show that ReMasker performs on par with or outperforms state-of-the-art methods in terms of both imputation fidelity and utility under various missingness settings, while its performance advantage often increases with the ratio of missing data. We further explore theoretical justification for its effectiveness, showing that ReMasker tends to learn missingness-invariant representations of tabular data. Our findings indicate that masked modeling represents a promising direction for further research on tabular data imputation. The code is publicly available.

摘要
我们提出了一种新的方法 called ReMasker，用于填充缺失数据的表格数据中的缺失值。与之前的工作相比，ReMasker 更简单，只有缺失值（即自然缺失）之外，我们随机“重新覆盖”另一个集合的值，然后优化自动编码器，重建这个重新覆盖集合，并使用训练模型预测缺失值。与此同时，我们还进行了广泛的评估，发现 ReMasker 在不同的缺失设定下，与或超过现有方法的稳定性和实用性。此外，我们还进行了理论 justify 其效果，发现 ReMasker 倾向于学习缺失性 invariable 的表格数据表示。我们的发现表明，masked modeling 是一个有前途的研究方向。代码公开 available。

Distribution-Free Statistical Dispersion Control for Societal Applications

paper_url: http://arxiv.org/abs/2309.13786
repo_url: None
paper_authors: Zhun Deng, Thomas P. Zollo, Jake C. Snell, Toniann Pitassi, Richard Zemel
for: 这个论文主要是为了提供有关机器学习模型性能的证明，以确保机器学习模型在实际应用中的性能是否符合预期。
methods: 这篇论文使用了一种简单 yet 灵活的框架，可以处理更加复杂的统计函数。这种框架使用了分布自由的方法，以控制不同人群的统计分布。
results: 该论文通过实验表明，该方法可以在恶意评论检测、医疗影像和电影推荐等领域中提供精确的统计保证。

Abstract
Explicit finite-sample statistical guarantees on model performance are an important ingredient in responsible machine learning. Previous work has focused mainly on bounding either the expected loss of a predictor or the probability that an individual prediction will incur a loss value in a specified range. However, for many high-stakes applications, it is crucial to understand and control the dispersion of a loss distribution, or the extent to which different members of a population experience unequal effects of algorithmic decisions. We initiate the study of distribution-free control of statistical dispersion measures with societal implications and propose a simple yet flexible framework that allows us to handle a much richer class of statistical functionals beyond previous work. Our methods are verified through experiments in toxic comment detection, medical imaging, and film recommendation.

摘要
具有具体 finite-sample 统计保证的机器学习是责任感知的重要组成部分。先前的工作主要关注在预测器的预期损失下的约束或者预测结果会在指定范围内带来损失值的概率上。然而，许多高度投资应用中，控制统计分布的偏差是关键，也就是不同人群受到机器决策的不同影响程度。我们开始研究不含统计分布的控制方法，并提出了简单 yet flexible 的框架，可以处理更加复杂的统计函数。我们的方法通过对毒评排除、医疗成像和电影推荐进行实验验证。

Multi-Task Learning For Reduced Popularity Bias In Multi-Territory Video Recommendations

paper_url: http://arxiv.org/abs/2310.03148
repo_url: None
paper_authors: Phanideep Gampa, Farnoosh Javadi, Belhassen Bayar, Ainur Yessenalina
for: 提高多区域个性化推荐系统中 item 的准确率，解决 globally prevalent item 的偏袋问题。
methods: 使用多任务学习 (MTL) 技术，并采用适应性增 sampling 方法来减少 popularity bias。
results: 通过实验，我们 demonstarte 了我们的框架在多区域比基eline 表现出较好的效果，PR-AUC 指标中的增幅可达 $65.27%$。

Abstract
Various data imbalances that naturally arise in a multi-territory personalized recommender system can lead to a significant item bias for globally prevalent items. A locally popular item can be overshadowed by a globally prevalent item. Moreover, users' viewership patterns/statistics can drastically change from one geographic location to another which may suggest to learn specific user embeddings. In this paper, we propose a multi-task learning (MTL) technique, along with an adaptive upsampling method to reduce popularity bias in multi-territory recommendations. Our proposed framework is designed to enrich training examples with active users representation through upsampling, and capable of learning geographic-based user embeddings by leveraging MTL. Through experiments, we demonstrate the effectiveness of our framework in multiple territories compared to a baseline not incorporating our proposed techniques.~Noticeably, we show improved relative gain of up to $65.27\%$ in PR-AUC metric. A case study is presented to demonstrate the advantages of our methods in attenuating the popularity bias of global items.

摘要
不同地区的用户偏好会自然出现在多地区个性化推荐系统中，导致全球热销商品的 item bias。一个地区热销商品可能被全球热销商品所掩蔽。此外，用户的视频浏览习惯可能在不同的地理位置发生重大变化，这可能建议学习特定用户嵌入。在这篇论文中，我们提出了一种多任务学习（MTL）技术，以及适应填充方法，以减少多地区推荐中的流行度偏好。我们的提议框架通过填充活跃用户表示来增强训练示例，并能够通过 MTL 学习地域基于用户嵌入。通过实验，我们证明了我们的框架在多地区比基eline不 incorporating 我们的提议技术时表现更高的效果。特别是，我们显示了改进的相对增长率达到 $65.27\%$ 的 PR-AUC 指标。一个案例研究表明了我们的方法在减少全球商品的流行度偏好中的优势。

2023-09-25

eess.SP

eess.SP - 2023-09-25

Towards a Novel Ultrasound System Based on Low-Frequency Feature Extraction From a Fully-Printed Flexible Transducer

paper_url: http://arxiv.org/abs/2309.14569
repo_url: None
paper_authors: Marco Giordano, Kirill Keller, Francesco Greco, Luca Benini, Michele Magno, Christoph Leitner
for: 这个研究是为了开发一个可靠、便宜、携带式的数码声带测量系统，用于不侵入性地、连续地监测生命 Parameters。
methods: 这个研究使用了一个全新的印刷式、无铅、数码声带感应器，可以实现较好的适材化性。实验室设置使用了一个模拟人体血液流的流体模型和一个模拟心跳的拍脉机，以验证方法。
results: 研究结果显示，这个新型数码声带感应器可以实现高精度的血液流速度测量，并且可以实现低功耗和低带宽的处理。在不同的心跳rhythm下，测量结果皆具有误差不超过0.05Hz（3bpm）。此外，实验室设置显示，这个方法可以实现6倍以上的讯号宽度减少，从12.5MHz降至2MHz。

Abstract
Ultrasound is a key technology in healthcare, and it is being explored for non-invasive, wearable, continuous monitoring of vital signs. However, its widespread adoption in this scenario is still hindered by the size, complexity, and power consumption of current devices. Moreover, such an application demands adaptability to human anatomy, which is hard to achieve with current transducer technology. This paper presents a novel ultrasound system prototype based on a fully printed, lead-free, and flexible polymer ultrasound transducer, whose bending radius promises good adaptability to the human anatomy. Our application scenario focuses on continuous blood flow monitoring. We implemented a hardware envelope filter to efficiently transpose high-frequency ultrasound signals to a lower-frequency spectrum. This reduces computational and power demands with little to no degradation in the task proposed for this work. We validated our method on a setup that mimics human blood flow by using a flow phantom and a peristaltic pump simulating 3 different heartbeat rhythms: 60, 90 and 120 beats per minute. Our ultrasound setup reconstructs peristaltic pump frequencies with errors of less than 0.05 Hz (3 bpm) from the set pump frequency, both for the raw echo and the enveloped echo. The analog pre-processing showed a promising reduction of signal bandwidth of more than 6x: pulse-echo signals of transducers excited at 12.5 MHz were reduced to about 2 MHz. Thus, allowing consumer MCUs to acquire and elaborate signals within mW-power range in an inexpensive fashion.

摘要
“ultrasound是现代医疗技术中关键的一种，正在探索无侵入、可穿戴、不间断监测生命体指标的应用场景。然而，现有设备的大小、复杂度和功耗仍然阻碍了广泛的应用。此外，这种应用场景需要适应人体解剖结构，这是现有探音器技术很难实现。这篇论文提出了一种新的探音系统原型，基于完全印刷、无铅、 flexible полимер探音器。这种探音器的弯曲半径 promise good适应人体解剖结构。我们的应用场景是无间断血流监测。我们实施了硬件滤波器，以有效地将高频探音信号转换为低频spectrum。这 reduces computational和功耗占用，几乎不会影响我们所提出的任务。我们验证了我们的方法，使用一个模拟人血流的流体phantom和一个模拟心跳的剧热泵。我们的ultrasound设备可以准确地重construct peristaltic pump frequencies， errors of less than 0.05 Hz（3 bpm）from the set pump frequency， both for the raw echo and the enveloped echo。analog pre-processing showed a promising reduction of signal bandwidth of more than 6x：pulse-echo signals of transducers excited at 12.5 MHz were reduced to about 2 MHz。因此，allowing consumer MCUs to acquire and elaborate signals within mW-power range in an inexpensive fashion。”

Secret-Message Transmission by Echoing Encrypted Probes – STEEP

paper_url: http://arxiv.org/abs/2309.14529
repo_url: None
paper_authors: Yingbo Hua
for: 这篇论文研究了Maurer、Ahlswede和Csiszar（MAC）为秘密键容量的下限和上限的性质，并基于这些下限的约束，提出了一种名为“秘密信息传输通过响应加密探测”（STEEP）的方案。
methods: 该方案包括两个阶段：在第一阶段，Alice通过探测频道发送随机探测信号给Bob；在第二阶段，Bob根据预测的探测信号发送一个加密的版本给Alice，但是这个加密版本是使用一个秘密来加密的。
results: 如果恶作剂Eve无法获得Alice在第一阶段发送的准确探测信号，那么STEEP可以在返回频道上保证从Bob到Alice的秘密率大于0，即使恶作剂在探测频道上的频率 stronger than Bob的。 STEEP适用于物理层和Upper层在连接网络中。

Abstract
This paper examines the properties of the lower and upper bounds established by Maurer, Ahlswede and Csiszar (MAC) for secret-key capacity in the case of channel probing over single-input and single-output (SISO) channels. Inspired by the insights into MAC's bounds, a scheme called secret-message transmission by echoing encrypted probes (STEEP) is proposed. STEEP consists of two phases: in phase 1, Alice sends random probes over a probing channel to Bob; in phase 2, Bob echoes back an estimated version of the probes, but encrypted by a secret, over a high-quality return channel. Provided that Eve is unable to obtain the exact probes transmitted by Alice in phase 1, STEEP guarantees a positive secrecy rate from Bob to Alice over the return channel even if Eve's channel strength during channel probing is stronger than Bob's. STEEP is applicable to both physical layer and upper layers in connected networks.

摘要
STEEP consists of two phases:1. In phase 1, Alice sends random probes over a probing channel to Bob.2. In phase 2, Bob echoes back an estimated version of the probes, but encrypted by a secret, over a high-quality return channel.Assuming that Eve is unable to obtain the exact probes transmitted by Alice in phase 1, STEEP guarantees a positive secrecy rate from Bob to Alice over the return channel, even if Eve's channel strength during channel probing is stronger than Bob's.STEEP is applicable to both physical layer and upper layers in connected networks.

Heart rate measurement using the built-in triaxial accelerometer from a commercial digital writing device

paper_url: http://arxiv.org/abs/2309.14308
repo_url: None
paper_authors: Julie Payette, Fabrice Vaussenat, Sylvain G. Cloutier
for: 这个研究用于比较智能笔的内置加速度仪和标准ECG仪器中的心率数据是否准确。
methods: 这个研究使用了智能笔equipped with sensors（STABILO的DigiPen）的内置加速度仪和标准ECG仪器来收集数据。数据处理使用了Butterworth滤波器减少噪声。
results: 研究发现，智能笔的内置加速度仪可以准确地预测心率，与标准ECG数据的相关性高于0.99。

Abstract
Wearable devices are on the rise. Smart watches and phones, fitness trackers or smart textiles now provide unprecedented access to our own personal data. As such, wearable devices can enable health monitoring without disrupting our daily routines. In clinical settings, electrocardiograms (ECGs) and photoplethysmographies (PPGs) are used to monitor the heart's and respiratory behaviors. In more practical settings, accelerometers can be used to estimate the heartrate when they are attached to the chest. They can also help filter out some noise in ECG signal from movement. In this work, we compare the heart rate data extracted from the built-in accelerometer of a commercial smart pen equipped with sensors (STABILO's DigiPen), with a standard ECG monitor readouts. We demonstrate that it is possible to accurately predict the heart rate from the smart pencil. The data collection is done with eight volunteers, writing the alphabet continuously for five minutes. The signal is processed with a Butterworth filter to cut off noise. We achieve a mean-squared error (MSE) better than 6.685x10$^{-3}$ comparing the DigiPen's computed ${\Delta}$t (time between pulses) with the reference ECG data. The peaks' timestamps for both signals all maintain a correlation higher than 0.99. All computed heart rates from the pen accurately correlate with the reference ECG signals.

摘要
“智能装置在不断增长。智能手表和手机、健身器或智能纺织物现在提供了前所未有的个人数据访问权。因此，智能装置可以不间断地健康监测，不会影响我们的日常生活。在临床设置下，电喷呈（ECG）和光谱呈（PPG）用于监测心脏和呼吸的行为。在更实际的设置下，加速计可以用于估算心率，当它们附加到胸部时。它们还可以帮助滤除一些运动所引起的噪声在ECG信号中。在这个工作中，我们比较了 comercial smart pen 内建的加速计和标准 ECG 监测器的数据。我们示示了可以准确地预测心率从 smart pen 中提取的数据。数据采集使用八名志愿者，在五分钟内连续写字母。信号处理使用Butterworth滤波器剪辑噪声。我们实现了比较于 6.685 x 10$^{-3}$ 的mean-squared error（MSE），比较 commercial smart pen 计算的 $\Delta$t（心脏бит间隔）与参考 ECG 数据。两个信号的峰时间戳都保持了高于 0.99 的相关性。所有从 pen 中计算的心率都准确地与参考 ECG 信号相关。”

Joint RIS Phase Profile Design and Power Allocation for Parameter Estimation in Presence of Eavesdropping

paper_url: http://arxiv.org/abs/2309.14280
repo_url: None
paper_authors: Erfan Mehdipour Abadi, Ayda Nodel Hokmabadi, Sinan Gezici
for: 本文主要研究了在具有各种各样的环境和干扰的情况下，通过重新配置智能表面（RIS），实现安全传输一个权重矢量参数的精准性。
methods: 本文提出了一种基于Fisher信息矩阵追踪（FIM）的估计精度度量，并使用其关于接收器和侦测器的关键性能指标，以便在RIS环境中实现优化传输精度。
results: 本文通过 alternating 优化和semidefinite relaxation 等方法，解决了RIS级别和发射器级别的优化问题，并通过实验证明了干扰的影响和RIS级别的选择对传输精度的影响。

Abstract
We consider secure transmission of a deterministic complex-valued parameter vector from a transmitter to an intended receiver in the presence of an eavesdropper in a reconfigurable intelligent surface (RIS)-integrated environment. We aim to jointly optimize the RIS phase profile and the power allocation matrix at the transmitter to enhance the estimation accuracy at the intended receiver while limiting that at the eavesdropper. We utilize the trace of the Fisher information matrix (FIM), equivalently, the average Fisher information, as the estimation accuracy metric, and obtain its closed form expression for the intended receiver and the eavesdropper. Accordingly, the joint RIS phase profile and power allocation problem is formulated, and it is solved via alternating optimization. When the power allocation matrix is fixed during alternating optimization, the optimal RIS phase profile design problem is formulated as a non-convex problem and it is solved via semidefinite relaxation and rank reduction. When the RIS phase profile is fixed, a linear programming formulation is obtained for optimal power allocation. Via simulations, the effects of RIS phase design and power allocation are illustrated individually and jointly. Moreover, extensions are provided by considering the presence of line of sight paths in the environment and the availability of RIS elements with adjustable magnitudes.

摘要
我们考虑了一个报文加密传输的幂等复数参数向量从发送器到目标接收器的安全传输，在扩展智能表面（RIS）集成环境中。我们希望同时优化RIS相位特征和发送器的功率分配矩阵，以提高接收器的估计精度，同时限制侦测器的估计精度。我们使用追踪 Fisher信息矩阵（FIM）的跟踪，即平均 Fisher信息，作为估计精度度量，并得到其闭合形式表达。根据此，我们提出了共同优化RIS相位特征和功率分配问题，并通过 alternate 优化解决。当功率分配矩阵在 alternate 优化过程中固定时，则RIS相位特征设计问题变为非对称问题，并通过半definite relaksation和级数减少解决。当RIS相位特征固定时，则发送器的功率分配问题可以转化为线性程序。通过实验，我们证明了RIS相位设计和功率分配在个体和共同优化下的效果。此外，我们还提供了考虑线性路径的存在和RIS元素的调整级别的扩展。

Adaptive Three Layer Hybrid Reconfigurable Intelligent Surface for 6G Wireless Communication: Trade-offs and Performance

paper_url: http://arxiv.org/abs/2309.14087
repo_url: None
paper_authors: Rashed Hasan Ratul, Muhammad Iqbal, Tabinda Ashraf, Jen-Yi Pan, Yi-Han Wang, Shao-Yu Lien
for: 本研究旨在提供一种三层混合式RIS助け的配置方案，可以响应active和passive RIS的变化，同时具有静止或无功状态。
methods: 本研究使用了hybrid RIS-assisted配置，包括三层混合式RIS，以适应变化的发射功率和无线链路质量。
results: simulations表明，该三层混合式RIS-assisted配置比单独使用passive或active RIS-assisted技术更有优势。

Abstract
A potential candidate technology for the development of future 6G networks has been recognized as Reconfigurable Intelligent Surface (RIS). However, due to the variation in radio link quality, traditional passive RISs only accomplish a minimal signal gain in situations with strong direct links between user equipment (UE) and base station (BS). In order to get over this fundamental restriction of smaller gain, the idea of active RISs might be a suitable solution. In contrast to current passive RIS, which simply reflects and directs signals without any additional amplification, active RISs have the ability to enhance reflected signals by the incorporation of amplifiers inside its elements. However, with additional amplifiers, apart from the relatively complex attributes of RIS-assisted arrangements, the additional energy consumption of such technologies is often disregarded. So, there might be a tradeoff between the additional energy consumption for the RIS technologies and the overall gain acquired by deploying this potential advancement. The objective of this work is to provide a primary idea of a three-layer hybrid RIS-assisted configuration that is responsive to both active and passive RIS, as well as an additional dormant or inactive state. The single RIS structure should be capable of adjusting its overall configuration in response to fluctuations in transmit power and radio link quality. Furthermore, our fabricated passive RIS-assisted structure verifies a portion of the proposed idea, with simulations highlighting its advantages over standalone passive or active RIS-assisted technologies.

摘要
sixth generation 网络（6G）的发展中，一种潜在的技术是可配置智能表面（Reconfigurable Intelligent Surface，RIS）。然而，由于无线链路质量的变化，传统的静止RIS只能实现最小的信号增强，尤其在用户设备（UE）和基站（BS）之间的强直接链路情况下。为了突破这种基本限制，可能适用的解决方案是活动RIS。与现有的静止RIS相比，活动RIS可以通过内置扩增器提高反射信号的强度。然而，随着这些技术的增加，除了RIS-assisted的复杂性外，额外的能源消耗也常被忽视。因此，可能存在一种负担增加和增加的负担之间的权衡。本工作的目标是提供一种三层混合RIS-assisted配置，可以响应活动和静止RIS，以及额外的休眠或不活跃状态。单一RIS结构应该能够根据发射功率和无线链路质量的变化进行调整。此外，我们制造的静止RIS-assisted结构的实验证明了一部分的提案的优势，而且模拟结果表明，与独立的静止或活动RIS-assisted技术相比，这种三层混合配置具有更高的优势。

Single-Antenna Jammers in MIMO-OFDM Can Resemble Multi-Antenna Jammers

paper_url: http://arxiv.org/abs/2309.14059
repo_url: https://github.com/iip-group/ofdm-jammer
paper_authors: Gian Marti, Christoph Studer
for: 这篇论文研究了多输入多输出（MIMO）无线系统中频率平坦渠道下单天线干扰器对接收器的影响。
methods: 论文使用了线性空间滤波来消除干扰器所引起的干扰。
results: 研究发现，当干扰器不遵守OFDM协议，它会在多个子帧上引起干扰，而不是单一的一维空间。这意味着在MIMO-OFDM系统中，单天线干扰器可以类比为L天线干扰器。

Abstract
In multiple-input multiple-output (MIMO) wireless systems with frequency-flat channels, a single-antenna jammer causes receive interference that is confined to a one-dimensional subspace. Such a jammer can thus be nulled using linear spatial filtering at the cost of one degree of freedom. Frequency-selective channels are often transformed into multiple frequency-flat subcarriers with orthogonal frequency-division multiplexing (OFDM). We show that when a single-antenna jammer violates the OFDM protocol by not sending a cyclic prefix, the interference received on each subcarrier by a multi-antenna receiver is, in general, not confined to a subspace of dimension one (as a single-antenna jammer in a frequency-flat scenario would be), but of dimension L, where L is the jammer's number of channel taps. In MIMO-OFDM systems, a single-antenna jammer can therefore resemble an L-antenna jammer. Simulations corroborate our theoretical results. These findings imply that mitigating jammers with large delay spread through linear spatial filtering is infeasible. We discuss some (im)possibilities for the way forward.

摘要
在多输入多输出（MIMO）无线系统中，频率平坦渠道上的单天线妨碍器会导致接收干扰，这种干扰将被限制在一维空间中。这种妨碍器可以使用线性空间滤波来纠正，但是需要一个自由度。频率选择性渠道通常会被转换为多个平坦频分谱下的多个子帧，使用多载波分多plexing（OFDM）。我们表明，当妨碍器不遵循OFDM协议，并不发送循环前fix，则干扰接收到每个子帧的多天线接收器是，在总体来说，不再局限于一维空间中（如单天线妨碍器在频率平坦场景中会），而是局限于维度L，其中L是妨碍器的通道扩散的数量。在MIMO-OFDM系统中，单天线妨碍器可以类比于L天线妨碍器。实验证明了我们的理论结果。这些发现表明，通过线性空间滤波来mitigate妨碍器的影响是不可能的。我们讨论了一些（不）可能的前进方向。

Beam Squint Assisted User Localization in Near-Field Integrated Sensing and Communications Systems

paper_url: http://arxiv.org/abs/2309.14012
repo_url: None
paper_authors: Hongliang Luo, Feifei Gao, Wanmai Yuan, Shun Zhang
for: 这篇论文旨在提出一种基于true-time-delay lines（TTDs）的近场通信系统中的用户定位方法，用于解决宽频MIMO系统中的辐束偏移问题。
methods: 该方法利用TTDs控制近场辐束偏移的轨迹，并通过不同子载波束的扫描来实现用户定位。
results: simulations show that the proposed method can effectively reduce the beam sweeping overhead and achieve high accuracy user localization.

Abstract
Integrated sensing and communication (ISAC) has been regarded as a key technology for 6G wireless communications, in which large-scale multiple input and multiple output (MIMO) array with higher and wider frequency bands will be adopted. However, recent studies show that the beam squint phenomenon can not be ignored in wideband MIMO system, which generally deteriorates the communications performance. In this paper, we find that with the aid of true-time-delay lines (TTDs), the range and trajectory of the beam squint in near-field communications systems can be freely controlled, and hence it is possible to reversely utilize the beam squint for user localization. We derive the trajectory equation for near-field beam squint points and design a way to control such trajectory. With the proposed design, beamforming from different subcarriers would purposely point to different angles and different distances, such that users from different positions would receive the maximum power at different subcarriers. Hence, one can simply localize multiple users from the beam squint effect in frequency domain, and thus reduce the beam sweeping overhead as compared to the conventional time domain beam search based approach. Furthermore, we utilize the phase difference of the maximum power subcarriers received by the user at different frequencies in several times beam sweeping to obtain a more accurate distance estimation result, ultimately realizing high accuracy and low beam sweeping overhead user localization. Simulation results demonstrate the effectiveness of the proposed schemes.

摘要
integrated sensing and communication (ISAC) 被认为是 sixth generation wireless communication (6G) 中的关键技术，其中大规模的多输入多输出 (MIMO) 阵列将在更高频率范围内使用。然而，最近的研究表明，在宽频MIMO系统中，扫描干扰（beam squint）现象无法忽略。在这篇论文中，我们发现，通过使用真实时延线（TTD），在近距离通信系统中扫描干扰的范围和轨迹可以自由控制，因此可以利用扫描干扰进行用户位置localization。我们 derive了近距离扫描干扰点的轨迹方程，并设计了控制这种轨迹的方法。与我们的设计相比，通过时域扫描来实现用户位置的方法可以减少扫描干扰的过程。此外，我们利用不同频率下接收到用户的最大功率Subcarrier的相位差来获取更加准确的距离估计结果，从而实现高精度低扫描干扰的用户位置定位。实验结果表明我们的方案的有效性。

Carrier Aggregation Enabled Integrated Sensing and Communication Signal Design and Processing

paper_url: http://arxiv.org/abs/2309.14008
repo_url: None
paper_authors: Zhiqing Wei, Haotian Liu, Xinyi Yang, Wangjun Jiang, Huici Wu, Xingwang Li, Zhiyong Feng
for: 本研究旨在提高未来移动通信系统中智能应用（如互联网联盟（IoV）和扩展现实（XR））的数据传输率和探测精度，通过 интегрирован的感知和通信（ISAC）技术。
methods: 本研究提议使用加载组合（CA）技术将高频和低频频率带width拼接成一个信号，以提高探测性能。此外，本研究还提出了基于压缩感知（CS）的ISAC信号处理算法，并使用快速融合减少阈值算法（FISTA）解决重新配置减少问题。
results: 实验结果表明，CA技术可以有效提高距离和速度估计的准确性。

Abstract
The future mobile communication systems will support intelligent applications such as Internet of Vehicles (IoV) and Extended Reality (XR). Integrated Sensing and Communication (ISAC) is regarded as one of the key technologies satisfying the high data rate communication and highly accurate sensing for these intelligent applications in future mobile communication systems. With the explosive growth of wireless devices and services, the shortage of spectrum resources leads to the fragmentation of available frequency bands for ISAC systems, which degrades sensing performance. Facing the above challenges, this paper proposes a Carrier Aggregation (CA)-based ISAC signal aggregating high and low-frequency bands to improve the sensing performance, where the CA-based ISAC signal can use four different aggregated pilot structures for sensing. Then, an ISAC signal processing algorithm with Compressed Sensing (CS) is proposed and the Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) is used to solve the reconfiguration convex optimization problem. Finally, the Cram'er-Rao Lower Bounds (CRLBs) are derived for the CA-based ISAC signal. Simulation results show that CA efficiently improves the accuracy of range and velocity estimation.

摘要
将来的移动通信系统将支持智能应用程序，如网络化交通 (IoV) 和增强现实 (XR)。集成感知通信 (ISAC) 被认为是未来移动通信系统支持高速数据传输和高精度感知的关键技术。随着无线设备和服务的快速增长，可用频率带的缺乏导致 ISAC 系统的分配频率带产生干扰，从而降低感知性能。面对这些挑战，本文提出了基于搅合 (CA) 的 ISAC 信号搅合高频和低频频率带以提高感知性能。然后，一种基于 CS 的 ISAC 信号处理算法和快速融合缩放算法 (FISTA) 被提出，以解决重配置减少问题。最后，CA 基于 ISAC 信号的 Cram'er-Rao Lower Bounds (CRLBs) 被 derivation。实验结果表明，CA 可以有效提高范围和速度估计的准确性。

Near-field Hybrid Beamforming for Terahertz-band Integrated Sensing and Communications

paper_url: http://arxiv.org/abs/2309.13984
repo_url: None
paper_authors: Ahmet M. Elbir, Abdulkadir Celik, Ahmed M. Eltawil
for: 这篇论文旨在探讨 sixth generation 无线网络中的 Terahertz (THz) 频段通信和集成感知通信 (ISAC) 两大方面。
methods: 作者提出了一种 alternating optimization 技术来解决 near-field THz-ISAC enario 中的质量降低问题。
results: 作者通过数值仿真显示了该方法可以在不添加增加硬件Components的情况下实现atisfactory的spectral efficiency表现，并准确地估算near-field beamformers，有效地 mitigate near-field beam-squint。

Abstract
Terahertz (THz) band communications and integrated sensing and communications (ISAC) are two main facets of the sixth generation wireless networks. In order to compensate the severe attenuation, the THz wireless systems employ large arrays, wherein the near-field beam-squint severely degrades the beamforming accuracy. Contrary to prior works that examine only either narrowband ISAC beamforming or far-field models, we introduce an alternating optimization technique for hybrid beamforming design in near-field THz-ISAC scenario. We also propose an efficient approach to compensate near-field beam-squint via baseband beamformers. Via numerical simulations, we show that the proposed approach achieves satisfactory spectral efficiency performance while accurately estimating the near-field beamformers and mitigating the beam-squint without additional hardware components.

摘要
六代无线网络中的tera响（THz）频段通信和集成感知通信（ISAC）是两个主要方面。为了抵消严重强化，THz无线系统使用大型阵列，其中靠近场区域的 beam-squint 严重降低了射频形成精度。与先前的工作只研究了窄频段ISAC射频形成或远场模型，我们引入了交替优化技术为混合射频形成设计。我们还提出了一种有效的方法来资料near-field beam-squint via baseband射频former。通过数值仿真，我们表明了我们的方法可以实现满意的spectral efficiency性能，准确地估计near-field射频former和mitigate beam-squint，无需额外硬件组件。

Track-before-detect Algorithm based on Cost-reference Particle Filter Bank for Weak Target Detection

paper_url: http://arxiv.org/abs/2309.13922
repo_url: None
paper_authors: Jin Lu, Guojie Peng, Weichuan Zhang, Changming Sun
for: 这个论文是为了解决在雷达、声纳等应用中检测弱目标的问题而写的。
methods: 该论文提出了一种基于改进的 particel Filter 的 track-before-detect（TBD）算法，即 cost-reference particle filter bank（CRPFB）。该算法将目标检测转化为两层假设测试问题。
results: 对于非线性频率调制（NLFM）信号检测和跟踪实验， simulate 结果表明，提出的 TBD 算法比现有的 TBD 算法在检测、跟踪和时间效率方面表现更好。

Abstract
Detecting weak target is an important and challenging problem in many applications such as radar, sonar etc. However, conventional detection methods are often ineffective in this case because of low signal-to-noise ratio (SNR). This paper presents a track-before-detect (TBD) algorithm based on an improved particle filter, i.e. cost-reference particle filter bank (CRPFB), which turns the problem of target detection to the problem of two-layer hypothesis testing. The first layer is implemented by CRPFB for state estimation of possible target. CRPFB has entirely parallel structure, consisting amounts of cost-reference particle filters with different hypothesized prior information. The second layer is to compare a test metric with a given threshold, which is constructed from the output of the first layer and fits GEV distribution. The performance of our proposed TBD algorithm and the existed TBD algorithms are compared according to the experiments on nonlinear frequency modulated (NLFM) signal detection and tracking. Simulation results show that the proposed TBD algorithm has better performance than the state-of-the-arts in detection, tracking, and time efficiency.

摘要
检测弱目标是许多应用中的一个重要和挑战性问题，如雷达和SONAR等。然而，传统的检测方法经常无法处理这种情况，因为信号噪声比（SNR）过低。这篇论文提出了基于改进的粒子筛算法（cost-reference particle filter bank，CRPFB）的track-before-detect（TBD）算法，将目标检测转化为两层假设测试问题。第一层由CRPFB实现的状态估计可能的目标，CRPFB具有完全平行结构，包括不同假设先验信息的多个成本参照粒子筛。第二层是比较测试指标与给定的阈值，该阈值由第一层的输出和适应GEV分布构建。对于非线性频率模ulation（NLFM）信号检测和跟踪的实验结果显示，提议的TBD算法比现有的TBD算法在检测、跟踪和时间效率方面表现更好。

Online Resource Allocation for Semantic-Aware Edge Computing Systems

paper_url: http://arxiv.org/abs/2309.13917
repo_url: None
paper_authors: Yihan Cang, Ming Chen, Zhaohui Yang, Yuntao Hu, Yinlu Wang, Chongwen Huang, Zhaoyang Zhang
for: 这篇论文旨在提出一个基于 semantics 的联合通信和计算资源分配框架，以减轻 MEC 系统中讯息的传输负担。
methods: 这篇论文使用了 Lyapunov 优化、封页数据分析和继承数据分析等方法，将联合通信和计算资源分配问题转化为一个可解决的长期问题。
results: simulations 显示，提出的算法可以与无 semantics 的分配方法相比，节省至多 41.8% 的能源。

Abstract
In this paper, we propose a semantic-aware joint communication and computation resource allocation framework for MEC systems. In the considered system, random tasks arrive at each terminal device (TD), which needs to be computed locally or offloaded to the MEC server. To further release the transmission burden, each TD sends the small-size extracted semantic information of tasks to the server instead of the original large-size raw data. An optimization problem of joint semanticaware division factor, communication and computation resource management is formulated. The problem aims to minimize the energy consumption of the whole system, while satisfying longterm delay and processing rate constraints. To solve this problem, an online low-complexity algorithm is proposed. In particular, Lyapunov optimization is utilized to decompose the original coupled long-term problem into a series of decoupled deterministic problems without requiring the realizations of future task arrivals and channel gains. Then, the block coordinate descent method and successive convex approximation algorithm are adopted to solve the current time slot deterministic problem by observing the current system states. Moreover, the closed-form optimal solution of each optimization variable is provided. Simulation results show that the proposed algorithm yields up to 41.8% energy reduction compared to its counterpart without semantic-aware allocation.

摘要
在这篇论文中，我们提出了一种基于 semantics 的共享计算和通信资源分配框架 для MEC 系统。在考虑的系统中，Random tasks 会随机到each terminal device（TD），需要本地计算或者卸载到 MEC 服务器。为了进一步减轻传输负担，each TD 将送出小型抽象信息（semantic information）到服务器，而不是原始大型原始数据。我们建立了一个优化问题，该问题的目标是最小化整体系统的能耗，同时满足长期延迟和处理率约束。为解决这个问题，我们提出了一种在线低复杂度算法。具体来说，我们利用了 Lyapunov 优化来将原来的 Coupled 长期问题分解成一系列的解耦的决定问题，无需考虑未来任务的到达和通道增益的实现。然后，我们采用了块坐标 descend 方法和Successive Convex Approximation 算法来解决当前时间槽的决定问题，并且提供了每个优化变量的关闭式最优解。实验结果显示，我们的算法可以提供 Up to 41.8% 的能源减少，相比无semantic-aware分配的对照方案。

NoncovANM: Gridless DOA Estimation for LPDF System

paper_url: http://arxiv.org/abs/2309.13902
repo_url: None
paper_authors: Yangying Zhao, Peng Chen, Zhenxin Cao, Xianbin Wang
for: 提高低成本irection finding系统的精度和效率
methods: 利用智能可编程表面（IRS）和低成本探测chnology，采用一个完全可用的接收通道，并利用atomic norm minimization（ANM）问题来优化direction finding表现
results: 提出了一种基于非对称�C-ANM算法的方法，可以快速和高效地优化direction finding表现，并且在实验中比较高效和精度高于比较方法Here’s a more detailed explanation of each point:
for: The paper aims to improve the accuracy and efficiency of low-cost passive direction finding (LPDF) systems.
methods: The proposed method utilizes an intelligent reconfigurable surface (IRS)-aided LPDF system, which only requires one fully functional receiving channel. The method exploits the sparsity of targets in the spatial domain by formulating an atomic norm minimization (ANM) problem to estimate the direction of arrival (DOA). To solve the ANM problem, a novel nonconvex-based ANM (NC-ANM) method is proposed, which uses gradient threshold iteration to avoid falling into saddle points. The theoretical analysis for the convergence of the NC-ANM method is also provided.
results: The proposed method outperforms compared methods in DOA estimation with lower computational complexity in the LPDF system, as shown by simulation results.

Abstract
Direction of arrival (DOA) estimation is an important research in the area of array signal processing, and has been studied for decades. High resolution DOA estimation requires large array aperture, which leads to the increase of hardware cost. Besides, high accuracy DOA estimation methods usually have high computational complexity. In this paper, the problem of decreasing the hardware cost and algorithm complexity is addressed. First, considering the ability of flexible controlling the electromagnetic waves and low-cost, an intelligent reconfigurable surface (IRS)-aided low-cost passive direction finding (LPDF) system is developed, where only one fully functional receiving channel is adopted. Then, the sparsity of targets direction in the spatial domain is exploited by formulating an atomic norm minimization (ANM) problem to estimate the DOA. Traditionally, solving ANM problem is complex and cannot be realized efficiently. Hence, a novel nonconvex-based ANM (NC-ANM) method is proposed by gradient threshold iteration, where a perturbation is introduced to avoid falling into saddle points. The theoretical analysis for the convergence of the NC-ANM method is also given. Moreover, the corresponding Cram\'er-Rao lower bound (CRLB) in the LPDF system is derived, and taken as the referred bound of the DOA estimation. Simulation results show that the proposed method outperforms the compared methods in the DOA estimation with lower computational complexity in the LPDF system.

摘要
irection of arrival (DOA) 估计是阵列信号处理领域的重要研究，已经在数十年来被研究。高分辨率DOA估计需要大型阵列尺寸，这会导致硬件成本的增加。此外，高精度DOA估计方法通常具有高计算复杂性。在这篇论文中，解决降低硬件成本和算法复杂性的问题。首先，根据可控电磁波的能力和低成本，一种智能可重配置表面（IRS）帮助的低成本通过探测（LPDF）系统被开发，只有一个完全可用的接收通道。然后，通过利用目标方向在空间领域的稀畴性，提出一个原子范数最小化（ANM）问题来估计DOA。传统上，解决ANM问题是复杂的，无法有效实现。因此，一种新的非对称-基于ANM（NC-ANM）方法被提出，通过梯度阈值迭代来解决。另外，对NC-ANM方法的理论分析也给出。此外，对LPDF系统中DOA估计的Cramér-Rao下界（CRLB）也被 derive，作为DOA估计的参照 bound。实验结果表明，提出的方法在LPDF系统中的DOA估计中具有较低的计算复杂性和更高的精度，并且超过相关比较方法。

DNN-DANM: A High-Accuracy Two-Dimensional DOA Estimation Method Using Practical RIS

paper_url: http://arxiv.org/abs/2309.13856
repo_url: https://github.com/chenpengseu/dnn-danm
paper_authors: Zhimin Chen, Peng Chen, Le Zheng, Yudong Zhang
for: 这个论文研究了在实际的快速知识Surface (RIS) 系统中，使用深度学习和分解原理来实现两个方向的到来角度估计 (DOA) 问题。
methods: 该论文提出了一种 combining 深度学习 (DNN) 和分解原理 (DANM) 的新方法，用于解决 DOA 估计问题。此外，还提出了一种低计算复杂度的 semi-definite programming (SDP) 方法来解决原子化最小化问题。
results: 该论文通过实验和原型 validate 了该方法在实际 RIS 系统中的性能，并证明了它在两个维度 DOA 估计中具有较低的复杂度和较高的准确率。

Abstract
Reconfigurable intelligent surface (RIS) or intelligent reflecting surface (IRS) has been an attractive technology for future wireless communication and sensing systems. However, in the practical RIS, the mutual coupling effect among RIS elements, the reflection phase shift, and amplitude errors will degrade the RIS performance significantly. This paper investigates the two-dimensional direction-of-arrival (DOA) estimation problem in the scenario using a practical RIS. After formulating the system model with the mutual coupling effect and the reflection phase/amplitude errors of the RIS, a novel DNNDANM method is proposed for the DOA estimation by combining the deep neural network (DNN) and the decoupling atomic norm minimization (DANM). The DNN step reconstructs the received signal from the one with RIS impairments, and the DANM step exploits the signal sparsity in the two-dimensional spatial domain. Additionally, a semi-definite programming (SDP) method with low computational complexity is proposed to solve the atomic minimization problem. Finally, both simulation and prototype are carried out to show estimation performance, and the proposed method outperforms the existing methods in the two-dimensional DOA estimation with low complexity in the scenario with practical RIS.

摘要
现代化的智能反射表（RIS）或智能镜面（IRS）技术在未来无线通信和探测系统中具有吸引力。然而，在实际应用中的RIS中，元件之间的共振效应、反射阶段偏移和干扰错误会对RIS性能产生负面影响。本文研究使用实际RIS场景下的两个维度方向来估计（DOA）问题。经过制定系统模型，包括RIS中的共振效应和反射阶段偏移/干扰错误，我们提出了一种基于深度神经网络（DNN）和解决原子范数最小化（DANM）的新的DNNDANM方法。DNN步骤重建接收信号，而DANM步骤利用信号在两个维度空间中的稀疏性。此外，我们还提出了一种具有低计算复杂性的半definiteProgramming（SDP）方法来解决原子最小化问题。最后，我们通过实验和prototype来证明我们的方法在两个维度DOA估计中具有低复杂性和高性能，并且超越了现有的方法。

On the Energy Efficiency of THz-NOMA enhanced UAV Cooperative Network with SWIPT

paper_url: http://arxiv.org/abs/2309.13836
repo_url: None
paper_authors: Jalal Jalali, Ata Khalili, Hina Tabassum, Rafael Berkvens, Jeroen Famaey, Walid Saad
for: 该论文寻求最大化无线能效 (EE)，在同时进行无线信息传输和能量传输的无人机飞行器协同网络中，采用teraHertz（THz）频率。
methods: 论文使用了非对称多访问（NOMA）功率分配系数、SWIPT功率拆分（PS）比率和无人机轨迹优化，以最大化EE。
results: 论文通过分解为两个阶段优化问题，使用替代优化方法，并通过比较基准点来证明提议的资源分配算法的效果。

Abstract
This paper considers the energy efficiency (EE) maximization of a simultaneous wireless information and power transfer (SWIPT)-assisted unmanned aerial vehicles (UAV) cooperative network operating at TeraHertz (THz) frequencies. The source performs SWIPT enabling the UAV to receive both power and information while also transmitting the information to a designated destination node. Subsequently, the UAV utilizes the harvested energy to relay the data to the intended destination node effectively. Specifically, we maximize EE by optimizing the non-orthogonal multiple access (NOMA) power allocation coefficients, SWIPT power splitting (PS) ratio, and UAV trajectory. The main problem is broken down into a two-stage optimization problem and solved using an alternating optimization approach. In the first stage, optimization of the PS ratio and trajectory is performed by employing successive convex approximation using a lower bound on the exponential factor in the THz channel model. In the second phase, the NOMA power coefficients are optimized using a quadratic transform approach. Numerical results demonstrate the effectiveness of our proposed resource allocation algorithm compared to the baselines where there is no trajectory optimization or no NOMA power or PS optimization.

摘要
The problem is broken down into a two-stage optimization problem and solved using an alternating optimization approach. In the first stage, the PS ratio and trajectory are optimized using successive convex approximation with a lower bound on the exponential factor in the THz channel model. In the second stage, the NOMA power coefficients are optimized using a quadratic transform approach.Numerical results show that the proposed resource allocation algorithm outperforms baseline scenarios without trajectory optimization or NOMA power or PS optimization.

Study of Robust Adaptive Beamforming Algorithms Based on Power Method Processing and Spatial Spectrum Matching

paper_url: http://arxiv.org/abs/2309.13785
repo_url: None
paper_authors: S. Mohammadzadeh, V. H. Nascimento, R. C. de Lamare, O. Kukrer
for: 提高Robust adaptive beamforming（RAB）在covariance矩阵重建错误存在时的性能。
methods: 提议一种基于interference-plus-noise covariance（INC）矩阵重建的efficient RAB技术，包括根据力量方法估计干扰的电力和方向 вектор，然后使用空间匹配处理重建所需的信号-Plus-噪声矩阵。最后，排除噪声组件以保留所需的信号矩阵。
results: 对比已有方法，提议的方法可以更好地适应covariance矩阵重建错误的情况，并且可以提高RAB的性能。

Abstract
Robust adaptive beamforming (RAB) based on interference-plus-noise covariance (INC) matrix reconstruction can experience performance degradation when model mismatch errors exist, particularly when the input signal-to-noise ratio (SNR) is large. In this work, we devise an efficient RAB technique for dealing with covariance matrix reconstruction issues. The proposed method involves INC matrix reconstruction using an idea in which the power and the steering vector of the interferences are estimated based on the power method. Furthermore, spatial match processing is computed to reconstruct the desired signal-plus-noise covariance matrix. Then, the noise components are excluded to retain the desired signal (DS) covariance matrix. A key feature of the proposed technique is to avoid eigenvalue decomposition of the INC matrix to obtain the dominant power of the interference-plus-noise region. Moreover, the INC reconstruction is carried out according to the definition of the theoretical INC matrix. Simulation results are shown and discussed to verify the effectiveness of the proposed method against existing approaches.

摘要
《robust适应 beamforming（RAB）基于干扰＋噪声 covariance（INC）矩阵重建可以遇到性能下降问题，特别是当输入信号响应比（SNR）较大时。在这种工作中，我们设计了一种高效的 RAB 技术来处理 INC 矩阵重建问题。该方法包括使用力量和扫描向量来估算干扰的 INC 矩阵，然后通过空间匹配处理来重建 желатель信号＋噪声 covariance 矩阵。最后，噪声成分被排除，保留 желатель信号 covariance 矩阵。本方法的一个关键特点是不需要对 INC 矩阵进行特征值分解，以获得干扰＋噪声区域的主要功率。此外，INC 重建遵循了理论 INC 矩阵的定义。实验结果表明，提议的方法比既有方法更高效。》Note: Simplified Chinese is also known as "Mandarin" or "Standard Chinese".

2023-09-24

cs.SD

cs.SD - 2023-09-24

paper_url: http://arxiv.org/abs/2309.13650
repo_url: None
paper_authors: Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai
for: 提高 CTCAASR 系统的准确率，使其能够更好地利用语言模型（LM）中的语言知识。
methods: 使用 optimal transport（OT）算法实现语音特征与文本特征之间的交叉模式对应，从而让语音特征编码上下文 dependent 语言特征。
results: 在 AISHELL-1 数据集上，我们的系统达到了 3.96% 和 4.27% 字符错误率（CER），对比基eline 系统而言，相对提高了 28.39% 和 29.42%。

Abstract
Temporal connectionist temporal classification (CTC)-based automatic speech recognition (ASR) is one of the most successful end to end (E2E) ASR frameworks. However, due to the token independence assumption in decoding, an external language model (LM) is required which destroys its fast parallel decoding property. Several studies have been proposed to transfer linguistic knowledge from a pretrained LM (PLM) to the CTC based ASR. Since the PLM is built from text while the acoustic model is trained with speech, a cross-modal alignment is required in order to transfer the context dependent linguistic knowledge from the PLM to acoustic encoding. In this study, we propose a novel cross-modal alignment algorithm based on optimal transport (OT). In the alignment process, a transport coupling matrix is obtained using OT, which is then utilized to transform a latent acoustic representation for matching the context-dependent linguistic features encoded by the PLM. Based on the alignment, the latent acoustic feature is forced to encode context dependent linguistic information. We integrate this latent acoustic feature to build conformer encoder-based CTC ASR system. On the AISHELL-1 data corpus, our system achieved 3.96% and 4.27% character error rate (CER) for dev and test sets, respectively, which corresponds to relative improvements of 28.39% and 29.42% compared to the baseline conformer CTC ASR system without cross-modal knowledge transfer.

摘要
temporal connectionist temporal classification（CTC）基于自动语音识别（ASR）系统是最成功的端到端（E2E）ASR框架之一。然而，由于decode进程中的令符独立假设，需要一个外部语言模型（LM），这样会消除它的快速并行解码性。多个研究已经提出将语言知识从预训练语言模型（PLM）传递到CTC基于ASR系统。由于PLM是由文本建立的，而语音模型则是通过语音训练的，因此需要在语音编码和PLM中的语言知识之间进行交叉模式对齐。在本研究中，我们提出了一种基于最优运输（OT）的交叉模式对齐算法。在对齐过程中，使用OT获得了交叉运输矩阵，然后将其用于将 latent acoustic representation 变换为与语言模型（LM）中的上下文依赖的语言特征匹配。根据对齐，latent acoustic feature 被迫编码上下文依赖的语言信息。我们将这个latent acoustic feature 集成到基于CTC的ASR系统中，并在AISHELL-1数据集上进行测试。测试结果表明，我们的系统在dev和test集上的字符错误率（CER）分别为3.96%和4.27%，相对于基eline conformer CTC ASR系统而言，升幅分别为28.39%和29.42%。

Efficient Black-Box Speaker Verification Model Adaptation with Reprogramming and Backend Learning

paper_url: http://arxiv.org/abs/2309.13605
repo_url: None
paper_authors: Jingyu Li, Tan Lee
for: 这篇论文的目的是提出一种基于深度神经网络的话语识别系统中的领域匹配问题的解决方案，并且透过对模型的数据类型进行修改，以提高SV系统的性能。
methods: 这篇论文使用了一种基于对模型的数据类型进行修改的方法，即利用对模型的预设值进行修改，以实现领域匹配。这种方法通过估计模型的参数 gradients，将模型视为黑盒模型，并使用两层背景学习模组进行最终的适应。
results: 实验结果显示，这种方法可以在语言匹配情况下，对SV系统进行领域匹配，并且使用了 much less computation cost，实现了与完全调整的模型相似的性能。

Abstract
The development of deep neural networks (DNN) has significantly enhanced the performance of speaker verification (SV) systems in recent years. However, a critical issue that persists when applying DNN-based SV systems in practical applications is domain mismatch. To mitigate the performance degradation caused by the mismatch, domain adaptation becomes necessary. This paper introduces an approach to adapt DNN-based SV models by manipulating the learnable model inputs, inspired by the concept of adversarial reprogramming. The pre-trained SV model remains fixed and functions solely in the forward process, resembling a black-box model. A lightweight network is utilized to estimate the gradients for the learnable parameters at the input, which bypasses the gradient backpropagation through the black-box model. The reprogrammed output is processed by a two-layer backend learning module as the final adapted speaker embedding. The number of parameters involved in the gradient calculation is small in our design. With few additional parameters, the proposed method achieves both memory and parameter efficiency. The experiments are conducted in language mismatch scenarios. Using much less computation cost, the proposed method obtains close or superior performance to the fully finetuned models in our experiments, which demonstrates its effectiveness.

摘要
Deep neural networks (DNN) 的发展有助于提高 speaker verification (SV) 系统的性能，但是在实际应用中，域名匹配问题仍然是一个主要的问题。为了解决这个问题，我们需要进行域名适应。这篇文章介绍了一种将 DNN-based SV 模型适应到域名不同的方法，通过修改可学习的模型输入，以及基于反对抗整形的概念。先前训练的 SV 模型保持不变，只参与前向处理，类似于黑盒模型。我们使用轻量级网络计算输入中的梯度，以便更新可学习参数。最终，我们使用两层后端学习模块来处理整形后的输出，并生成最终的适应的 speaker 嵌入。我们的设计具有少量参数，同时具有内存和参数效率。我们的实验结果表明，使用许多更少的计算成本，我们的方法可以在语言匹配场景中实现与完全训练模型相当或更好的性能。

The second multi-channel multi-party meeting transcription challenge (M2MeT) 2.0): A benchmark for speaker-attributed ASR

paper_url: http://arxiv.org/abs/2309.13573
repo_url: None
paper_authors: Yuhao Liang, Mohan Shi, Fan Yu, Yangze Li, Shiliang Zhang, Zhihao Du, Qian Chen, Lei Xie, Yanmin Qian, Jian Wu, Zhuo Chen, Kong Aik Lee, Zhijie Yan, Hui Bu
for: 本文主要探讨了一个实际场景中的“谁说了什么， WHEN”问题，即speaker-attributed ASR (SA-ASR)问题。
methods: 本文使用了两个子track：固定训练条件子track和开放训练条件子track。固定训练条件子track限制了训练数据的使用，但允许参与者使用任何开源预训练模型。开放训练条件子track则允许参与者使用所有可用数据和模型。
results: 本文公布了一个新的10小时测试集，用于排名挑战。本文还提供了参与者提交系统的结果和分析，作为SA-ASR领域的现状标准。

Abstract
With the success of the first Multi-channel Multi-party Meeting Transcription challenge (M2MeT), the second M2MeT challenge (M2MeT 2.0) held in ASRU2023 particularly aims to tackle the complex task of \emph{speaker-attributed ASR (SA-ASR)}, which directly addresses the practical and challenging problem of ``who spoke what at when" at typical meeting scenario. We particularly established two sub-tracks. The fixed training condition sub-track, where the training data is constrained to predetermined datasets, but participants can use any open-source pre-trained model. The open training condition sub-track, which allows for the use of all available data and models without limitation. In addition, we release a new 10-hour test set for challenge ranking. This paper provides an overview of the dataset, track settings, results, and analysis of submitted systems, as a benchmark to show the current state of speaker-attributed ASR.

摘要
在M2MeT挑战的成功之后，M2MeT 2.0挑战在ASRU2023中进一步挑战了复杂的 speaker-attributed ASR（SA-ASR）任务，直接面临typical会议场景中的“谁说了什么，何时”问题。我们设置了两个子轨道。固定培训条件子轨道，团队可以使用预先训练的任何开源模型，但是团队必须使用 predetermined datasets 进行培训。开放培训条件子轨道，允许使用所有可用的数据和模型。此外，我们发布了一个新的10小时测试集，用于挑战排名。本文提供了数据集、轨道设置、结果和分析 submitted系统的概述，作为SA-ASR现状的标准 referential。

Coco-Nut: Corpus of Japanese Utterance and Voice Characteristics Description for Prompt-based Control

paper_url: http://arxiv.org/abs/2309.13509
repo_url: None
paper_authors: Aya Watanabe, Shinnosuke Takamichi, Yuki Saito, Wataru Nakata, Detai Xin, Hiroshi Saruwatari
for: 研究控制语音特征的多目的语音合成。
methods: 使用文本conditioned生成，如文本-图像生成，以实现直觉和复杂的语音特征控制。
results: 开发了一个新的语音 corpus，包括多样化的日本语音样本，以及相应的文本转录和自由形式语音特征描述。

Abstract
In text-to-speech, controlling voice characteristics is important in achieving various-purpose speech synthesis. Considering the success of text-conditioned generation, such as text-to-image, free-form text instruction should be useful for intuitive and complicated control of voice characteristics. A sufficiently large corpus of high-quality and diverse voice samples with corresponding free-form descriptions can advance such control research. However, neither an open corpus nor a scalable method is currently available. To this end, we develop Coco-Nut, a new corpus including diverse Japanese utterances, along with text transcriptions and free-form voice characteristics descriptions. Our methodology to construct this corpus consists of 1) automatic collection of voice-related audio data from the Internet, 2) quality assurance, and 3) manual annotation using crowdsourcing. Additionally, we benchmark our corpus on the prompt embedding model trained by contrastive speech-text learning.

摘要
<>在文本到语音Synthesizer中，控制声音特征是关键以实现多种目标 speech synthesis。考虑到文本条件生成的成功，如文本到图像，自由形文本指令可以为Intuitive和复杂的声音控制提供便利。一个具有充分覆盖和多样性的声音样本库可以提高这种控制研究。然而，目前并没有公开的库 nor可扩展的方法。为此，我们开发了Coco-Nut，一个新的声音库，包括日本语音样本，以及文本转录和自由形声音特征描述。我们的方法包括：1. 自动从互联网上收集声音相关的音频数据2. 质量控制3. 使用人工投票来手动标注此外，我们对这个库进行了基于对比Speech-text学习的唤起式模型的测试。

2023-09-24

cs.CV

cs.CV - 2023-09-24

Diffeomorphic Multi-Resolution Deep Learning Registration for Applications in Breast MRI

paper_url: http://arxiv.org/abs/2309.13777
repo_url: None
paper_authors: Matthew G. French, Gonzalo D. Maso Talou, Thiranja P. Babarenda Gamage, Martyn P. Nash, Poul M. Nielsen, Anthony J. Doyle, Juan Eugenio Iglesias, Yaël Balbastre, Sean I. Young
for: 静脉成像规划中的精准注册可以提高乳腺癌治疗中肿瘤的定位。
methods: 本文提出了一种learning-based注册方法，该方法遵循 diffeomorphic 约束，并且在静脉成像中提供了优秀的注册结果。
results: 本文的实验结果表明，该注册方法可以提供高质量的注册结果，同时也遵循 diffeomorphic 约束。

Abstract
In breast surgical planning, accurate registration of MR images across patient positions has the potential to improve the localisation of tumours during breast cancer treatment. While learning-based registration methods have recently become the state-of-the-art approach for most medical image registration tasks, these methods have yet to make inroads into breast image registration due to certain difficulties-the lack of rich texture information in breast MR images and the need for the deformations to be diffeomophic. In this work, we propose learning strategies for breast MR image registration that are amenable to diffeomorphic constraints, together with early experimental results from in-silico and in-vivo experiments. One key contribution of this work is a registration network which produces superior registration outcomes for breast images in addition to providing diffeomorphic guarantees.

摘要
医学影像识别是一个重要的领域，它可以帮助医生更好地识别和治疗癌症。在乳腺癌治疗中，精准地将MR图像注册到患者的不同位置中有可能提高肿瘤的定位。然而，学习基本的注册方法在乳腺影像注册中尚未得到广泛应用，因为乳腺MR图像的纹理信息缺乏，并且需要的变换是 diffeomophic。在这种情况下，我们提出了一些学习策略，可以考虑到 diffeomorphic 约束，并且在实验中获得了出色的注册结果。我们的一个关键贡献是一种注册网络，可以生成高质量的注册结果，同时也提供 diffeomorphic garanties。

Motion Segmentation from a Moving Monocular Camera

paper_url: http://arxiv.org/abs/2309.13772
repo_url: None
paper_authors: Yuxiang Huang, John Zelek
for: 能够减少视觉SLAM或SFM中的运动物体识别，以便建立地图。
methods: synergistically fusing two popular branches of monocular motion segmentation approaches：point trajectory based和optical flow based methods。
results: 在KT3DMoSeg dataset上达到了状态计算机科学技术的表现水平，能够处理复杂的运动和场景结构。

Abstract
Identifying and segmenting moving objects from a moving monocular camera is difficult when there is unknown camera motion, different types of object motions and complex scene structures. To tackle these challenges, we take advantage of two popular branches of monocular motion segmentation approaches: point trajectory based and optical flow based methods, by synergistically fusing these two highly complementary motion cues at object level. By doing this, we are able to model various complex object motions in different scene structures at once, which has not been achieved by existing methods. We first obtain object-specific point trajectories and optical flow mask for each common object in the video, by leveraging the recent foundational models in object recognition, segmentation and tracking. We then construct two robust affinity matrices representing the pairwise object motion affinities throughout the whole video using epipolar geometry and the motion information provided by optical flow. Finally, co-regularized multi-view spectral clustering is used to fuse the two affinity matrices and obtain the final clustering. Our method shows state-of-the-art performance on the KT3DMoSeg dataset, which contains complex motions and scene structures. Being able to identify moving objects allows us to remove them for map building when using visual SLAM or SFM.

摘要
Difficulties in identifying and segmenting moving objects from a moving monocular camera include unknown camera motion, diverse object motions, and complex scene structures. To address these challenges, we synergistically fuse two popular monocular motion segmentation approaches: point trajectory-based and optical flow-based methods, at the object level. This enables us to model various complex object motions in different scene structures simultaneously, which has not been achieved by existing methods.We first obtain object-specific point trajectories and optical flow masks for each common object in the video by leveraging recent foundational models in object recognition, segmentation, and tracking. We then construct two robust affinity matrices representing the pairwise object motion affinities throughout the entire video using epipolar geometry and motion information provided by optical flow. Finally, co-regularized multi-view spectral clustering is used to fuse the two affinity matrices, resulting in the final clustering. Our method achieves state-of-the-art performance on the KT3DMoSeg dataset, which contains complex motions and scene structures. By identifying moving objects, we can remove them for map building when using visual SLAM or SFM.

Devil in the Number: Towards Robust Multi-modality Data Filter

paper_url: http://arxiv.org/abs/2309.13770
repo_url: None
paper_authors: Yichen Xu, Zihan Xu, Wenhao Chai, Zhonghan Zhao, Enxin Song, Gaoang Wang
For: 这个研究的目的是为了提高CLIP的表现和降低训练成本，通过适当的筛选方法来筛选多modal资料集。* Methods: 这个研究使用了CLIP score筛选器和文本检测方法来筛选资料。在分析资料集时，我们发现了大量的重复信息，例如数字，在文本内容中。我们进行了实验，发现这些重复元素对CLIP scores有着内在的影响。* Results: 我们的文本基于CLIP筛选器在DataComp中的“小规模”频道上比顶尖方法表现出色，实现了3.6%的性能提升。实验还显示了我们提议的文本填充筛选器比原始CLIP score筛选器在选择顶尖40%的资料时表现更好。此外，我们的研究还发现了数字对CLIP和其处理的影响，具有价值的指导意义，包括语言重写技术。

Abstract
In order to appropriately filter multi-modality data sets on a web-scale, it becomes crucial to employ suitable filtering methods to boost performance and reduce training costs. For instance, LAION papers employs the CLIP score filter to select data with CLIP scores surpassing a certain threshold. On the other hand, T-MARS achieves high-quality data filtering by detecting and masking text within images and then filtering by CLIP score. Through analyzing the dataset, we observe a significant proportion of redundant information, such as numbers, present in the textual content. Our experiments on a subset of the data unveil the profound impact of these redundant elements on the CLIP scores. A logical approach would involve reevaluating the CLIP scores after eliminating these influences. Experimentally, our text-based CLIP filter outperforms the top-ranked method on the ``small scale" of DataComp (a data filtering benchmark) on ImageNet distribution shifts, achieving a 3.6% performance improvement. The results also demonstrate that our proposed text-masked filter outperforms the original CLIP score filter when selecting the top 40% of the data. The impact of numbers on CLIP and their handling provide valuable insights for improving the effectiveness of CLIP training, including language rewrite techniques.

摘要
We observe a significant amount of redundant information, such as numbers, in the textual content of the dataset. Our experiments on a subset of the data reveal that these redundant elements have a profound impact on the CLIP scores. A logical approach would be to reevaluate the CLIP scores after eliminating these influences.Experimentally, our text-based CLIP filter outperforms the top-ranked method on the "small scale" of DataComp (a data filtering benchmark) on ImageNet distribution shifts, achieving a 3.6% performance improvement. The results also show that our proposed text-masked filter outperforms the original CLIP score filter when selecting the top 40% of the data. The impact of numbers on CLIP and their handling provide valuable insights for improving the effectiveness of CLIP training, including language rewrite techniques.

Combining Two Adversarial Attacks Against Person Re-Identification Systems

paper_url: http://arxiv.org/abs/2309.13763
repo_url: None
paper_authors: Eduardo de O. Andrade, Igor Garcia Ballhausen Sampaio, Joris Guérin, José Viterbo
for: 这个研究是针对人员识别系统（Re-ID）的安全性进行研究，尤其是运用深度神经网络来实现人员识别。
methods: 本研究使用了两种攻击方法：P-FGSM和Deep Mis-Ranking，并且将其应用到两个受测Re-ID模型：IDE（ResNet-50）和AlignedReID。
results: 研究结果显示，这些攻击方法可以对Re-ID模型造成较大的影响，其中AlignedReID在CUHK03 dataset上的 Rank-10 指数下降了3.36%。此外，研究者还尝试使用Dropout进行防护。

Abstract
The field of Person Re-Identification (Re-ID) has received much attention recently, driven by the progress of deep neural networks, especially for image classification. The problem of Re-ID consists in identifying individuals through images captured by surveillance cameras in different scenarios. Governments and companies are investing a lot of time and money in Re-ID systems for use in public safety and identifying missing persons. However, several challenges remain for successfully implementing Re-ID, such as occlusions and light reflections in people's images. In this work, we focus on adversarial attacks on Re-ID systems, which can be a critical threat to the performance of these systems. In particular, we explore the combination of adversarial attacks against Re-ID models, trying to strengthen the decrease in the classification results. We conduct our experiments on three datasets: DukeMTMC-ReID, Market-1501, and CUHK03. We combine the use of two types of adversarial attacks, P-FGSM and Deep Mis-Ranking, applied to two popular Re-ID models: IDE (ResNet-50) and AlignedReID. The best result demonstrates a decrease of 3.36% in the Rank-10 metric for AlignedReID applied to CUHK03. We also try to use Dropout during the inference as a defense method.

摘要
人员重复识别（Re-ID）领域在最近几年内受到了广泛关注，启发于深度神经网络的进步，特别是图像分类。Re-ID问题的核心是通过不同场景的安全摄像头捕捉到人员的图像，并在不同的环境下进行人员识别。政府和公司在公共安全和失踪人员问题上投入了大量时间和资金，以实现Re-ID系统的应用。然而，Re-ID实施还存在一些挑战，如人像中的遮挡和反射光。在这种情况下，我们将关注Re-ID系统中的对抗攻击，这可能会对系统的性能产生重要的威胁。我们在三个数据集上进行了实验：DukeMTMC-ReID、Market-1501和CUHK03。我们将两种对抗攻击相结合：P-FGSM和Deep Mis-Ranking，并将其应用于两种流行的Re-ID模型：IDE（ResNet-50）和AlignedReID。最佳结果表明，对CUHK03数据集应用AlignedReID模型，P-FGSM和Deep Mis-Ranking的组合可以导致rank-10指标下的下降为3.36%。我们还尝试了在推理过程中使用Dropout作为防御方法。

Look Ma, no code: fine tuning nnU-Net for the AutoPET II challenge by only adjusting its JSON plans

paper_url: http://arxiv.org/abs/2309.13747
repo_url: None
paper_authors: Fabian Isensee, Klaus H. Maier-Hein
for: 提高 AutoPET II 挑战中 nnU-Net 的性能
methods: 通过 modifying nnU-Net 的 ‘nnUNetPlans.json’ 文件，switch to UNet with residual encoder，增加 batch size 和 patch size，以提高模型的性能
results: 比自动配置的 nnU-Net 基eline（5-fold cross-validation Dice score of 65.14 vs 33.28）substantially outperform，但是需要更多的计算资源来训练模型。最终提交ensemble两个最有前途的配置。当提交时，我们的方法在预测集上排名第一。

Abstract
We participate in the AutoPET II challenge by modifying nnU-Net only through its easy to understand and modify 'nnUNetPlans.json' file. By switching to a UNet with residual encoder, increasing the batch size and increasing the patch size we obtain a configuration that substantially outperforms the automatically configured nnU-Net baseline (5-fold cross-validation Dice score of 65.14 vs 33.28) at the expense of increased compute requirements for model training. Our final submission ensembles the two most promising configurations. At the time of submission our method ranks first on the preliminary test set.

摘要
我们参加了AutoPET II挑战，只通过nnUNetPlans.json文件进行 modify nnU-Net。通过更改残差编码器，增加批处理大小和增加补做大小，我们获得了与自动配置的nnU-Net基线（5次交叉验证精度分数为65.14 vs 33.28）的性能显著提高，但是需要更高的计算资源来训练模型。我们最终提交的结果是两种最有前途的配置的ensemble。在提交时，我们的方法在预测集上排名第一。Note: "nnUNetPlans.json" is a JSON file that contains the architecture of the nnU-Net model, and it is "easy to understand and modify" as mentioned in the text.

DROP: Dynamics Responses from Human Motion Prior and Projective Dynamics

paper_url: http://arxiv.org/abs/2309.13742
repo_url: None
paper_authors: Yifeng Jiang, Jungdam Won, Yuting Ye, C. Karen Liu
for: 这篇论文旨在实现人类动作的生成和跟踪，以满足计算机视觉、运动和医疗等领域的需求。
methods: 该论文提出了一种名为DROP的新框架，它利用生成式动作优先逻辑和投影动力来模型人类动作的响应。
results: 经过广泛的评估，DROP模型在不同的动作任务和物理干扰下表现出了可scalability和多样性的特点。

Abstract
Synthesizing realistic human movements, dynamically responsive to the environment, is a long-standing objective in character animation, with applications in computer vision, sports, and healthcare, for motion prediction and data augmentation. Recent kinematics-based generative motion models offer impressive scalability in modeling extensive motion data, albeit without an interface to reason about and interact with physics. While simulator-in-the-loop learning approaches enable highly physically realistic behaviors, the challenges in training often affect scalability and adoption. We introduce DROP, a novel framework for modeling Dynamics Responses of humans using generative mOtion prior and Projective dynamics. DROP can be viewed as a highly stable, minimalist physics-based human simulator that interfaces with a kinematics-based generative motion prior. Utilizing projective dynamics, DROP allows flexible and simple integration of the learned motion prior as one of the projective energies, seamlessly incorporating control provided by the motion prior with Newtonian dynamics. Serving as a model-agnostic plug-in, DROP enables us to fully leverage recent advances in generative motion models for physics-based motion synthesis. We conduct extensive evaluations of our model across different motion tasks and various physical perturbations, demonstrating the scalability and diversity of responses.

摘要
实现人类动作的实惠真实、对环境 dynamically responsive 是动画人物的长期目标，应用于电脑感知、运动和医疗等领域，如动作预测和数据增强。现有的运动基础的生成动作模型可以实现广泛的动作数据模型，但是没有与物理相互作用的界面。而使用模拟器-在-the-loop 学习方法可以实现高度的物理真实行为，但是训练问题往往会影响数据量和采纳。我们介绍了 DROP，一个新的框架，用于模型人类动作的 Dynamics Responses，使用生成动作假设和投影动力学。 DROP 可以被视为一个高度稳定、最小化的物理基础的人类模拟器，与生成动作假设的投影动力学相互作用。通过将学习的动作假设作为投影能量的一部分，DROP 允许flexible和简单地整合已学习的动作假设和新频率动力学。作为一个模型无关的插件，DROP 允许我们充分利用最近的生成动作模型，以便实现物理基础的动作合成。我们在不同的动作任务和各种物理损害中进行了广泛的评估，证明了 DROP 的普遍性和多样性。

MOSAIC: Multi-Object Segmented Arbitrary Stylization Using CLIP

paper_url: http://arxiv.org/abs/2309.13716
repo_url: None
paper_authors: Prajwal Ganugula, Y S S S Santosh Kumar, N K Sagar Reddy, Prabhath Chellingi, Avinash Thakur, Neeraj Kasera, C Shyam Anand
for: 这篇论文的目的是提出一种基于文本提示的多对象分割自由风格化方法，以提高风格化图像的控制精度和扩展性。
methods: 该方法使用了视transformer架构进行文本基于分割和风格化模块，可以针对不同的对象进行精细的风格化控制。
results: 该方法可以生成高质量的风格化图像，并且可以在不同的对象类上进行扩展性测试，而且可以在不同的风格转换中保持图像的可读性。

Abstract
Style transfer driven by text prompts paved a new path for creatively stylizing the images without collecting an actual style image. Despite having promising results, with text-driven stylization, the user has no control over the stylization. If a user wants to create an artistic image, the user requires fine control over the stylization of various entities individually in the content image, which is not addressed by the current state-of-the-art approaches. On the other hand, diffusion style transfer methods also suffer from the same issue because the regional stylization control over the stylized output is ineffective. To address this problem, We propose a new method Multi-Object Segmented Arbitrary Stylization Using CLIP (MOSAIC), that can apply styles to different objects in the image based on the context extracted from the input prompt. Text-based segmentation and stylization modules which are based on vision transformer architecture, were used to segment and stylize the objects. Our method can extend to any arbitrary objects, styles and produce high-quality images compared to the current state of art methods. To our knowledge, this is the first attempt to perform text-guided arbitrary object-wise stylization. We demonstrate the effectiveness of our approach through qualitative and quantitative analysis, showing that it can generate visually appealing stylized images with enhanced control over stylization and the ability to generalize to unseen object classes.

摘要
文本驱动的样式传递开创了一条新的创作图像样式化路径，而不需要实际收集样式图像。尽管有promising结果，文本驱动样式化方法还有一个问题：用户无法控制样式化的精度。如果用户想创造艺术图像，用户需要精准地控制图像中的多种实体的样式化。现有的approach都无法解决这个问题。另一方面，扩散样式传递方法也有同样的问题，因为对彩色输出的区域样式控制是无效的。为解决这个问题，我们提出了一种新的方法：多对象分割自由样式传递使用CLIP（MOSAIC）。我们使用了基于视力转换器架构的文本基于分割和样式化模块，可以将不同的对象在图像中应用不同的样式。我们的方法可以扩展到任意对象、样式和生成高质量图像，比现状态的方法更高效。我们知道，这是文本引导自由对象样式传递的首次尝试。我们通过质量和量化分析，证明我们的方法可以生成美观的样式化图像，并且可以增强样式化的控制和泛化到未看过的对象类型。

Sound-Print: Generalised Face Presentation Attack Detection using Deep Representation of Sound Echoes

paper_url: http://arxiv.org/abs/2309.13704
repo_url: None
paper_authors: Raghavendra Ramachandra, Jag Mohan Singh, Sushma Venkatesh
for: 这篇论文主要目的是提出一种基于阴投信号的声学回音攻击探测方法，以实现智能手机上的面部识别系统中的安全性。
methods: 本论文使用的方法包括对于声学回音的分析和模elling，并提出一种基于宽频脉冲的传输信号，以提高信号与噪音的比例。
results: 实验结果显示，提出的方法可以妥善地探测不同类型的面部攻击，包括印刷攻击、显示攻击和塑胶面伪攻击。

Abstract
Facial biometrics are widely deployed in smartphone-based applications because of their usability and increased verification accuracy in unconstrained scenarios. The evolving applications of smartphone-based facial recognition have also increased Presentation Attacks (PAs), where an attacker can present a Presentation Attack Instrument (PAI) to maliciously gain access to the application. Because the materials used to generate PAI are not deterministic, the detection of unknown presentation attacks is challenging. In this paper, we present an acoustic echo-based face Presentation Attack Detection (PAD) on a smartphone in which the PAs are detected based on the reflection profiles of the transmitted signal. We propose a novel transmission signal based on the wide pulse that allows us to model the background noise before transmitting the signal and increase the Signal-to-Noise Ratio (SNR). The received signal reflections were processed to remove background noise and accurately represent reflection characteristics. The reflection profiles of the bona fide and PAs are different owing to the different reflection characteristics of the human skin and artefact materials. Extensive experiments are presented using the newly collected Acoustic Sound Echo Dataset (ASED) with 4807 samples captured from bona fide and four different types of PAIs, including print (two types), display, and silicone face-mask attacks. The obtained results indicate the robustness of the proposed method for detecting unknown face presentation attacks.

摘要
“人脸生物特征在智能手机应用中广泛应用，因为它们的使用性和无限制场景中的验证精度提高。随着智能手机上的人脸识别应用的发展，也增加了演示攻击（PA），其中攻击者可以使用演示攻击工具（PAI）来恶意获取应用程序。由于攻击工具的材料不决定性，检测未知的演示攻击是困难的。在这篇论文中，我们提出了基于声学回音的人脸演示攻击检测（PAD）方法，在智能手机上进行。我们提出了一种基于宽PULSE的新的传输信号，使得我们可以在传输信号之前模拟背景噪声，提高信号噪声比（SNR）。接收到的声学回音后，我们对噪声进行了处理，以便准确地表示回音特征。人脸和攻击工具的反射特征不同，因为人脸和 artifact材料的反射特征不同。我们在新收集的声学回音数据集（ASED）上进行了广泛的实验，该数据集包含4807个样本，其中有4种不同类型的PAI，包括印刷（两种）、显示和塑料面具攻击。获得的结果表明，我们提出的方法对于检测未知的人脸演示攻击具有坚定的Robustness。”

Video Adverse-Weather-Component Suppression Network via Weather Messenger and Adversarial Backpropagation

paper_url: http://arxiv.org/abs/2309.13700
repo_url: https://github.com/scott-yjyang/ViWS-Net
paper_authors: Yijun Yang, Angelica I. Aviles-Rivero, Huazhu Fu, Ye Liu, Weiming Wang, Lei Zhu
for: Restoring videos degraded by any weather condition
methods: Video adverse-weather-component suppression network (ViWS-Net), including a weather-agnostic video transformer encoder, long short-term temporal modeling mechanism, weather discriminator, and messenger-driven video transformer decoder
results: Outperforms current state-of-the-art methods in restoring videos degraded by any weather condition, on benchmark datasets and real-world weather videos.

Abstract
Although convolutional neural networks (CNNs) have been proposed to remove adverse weather conditions in single images using a single set of pre-trained weights, they fail to restore weather videos due to the absence of temporal information. Furthermore, existing methods for removing adverse weather conditions (e.g., rain, fog, and snow) from videos can only handle one type of adverse weather. In this work, we propose the first framework for restoring videos from all adverse weather conditions by developing a video adverse-weather-component suppression network (ViWS-Net). To achieve this, we first devise a weather-agnostic video transformer encoder with multiple transformer stages. Moreover, we design a long short-term temporal modeling mechanism for weather messenger to early fuse input adjacent video frames and learn weather-specific information. We further introduce a weather discriminator with gradient reversion, to maintain the weather-invariant common information and suppress the weather-specific information in pixel features, by adversarially predicting weather types. Finally, we develop a messenger-driven video transformer decoder to retrieve the residual weather-specific feature, which is spatiotemporally aggregated with hierarchical pixel features and refined to predict the clean target frame of input videos. Experimental results, on benchmark datasets and real-world weather videos, demonstrate that our ViWS-Net outperforms current state-of-the-art methods in terms of restoring videos degraded by any weather condition.

摘要
尽管卷积神经网络（CNNs）已经提议用单个预训练 веса来去除单个图像中的不良天气情况，但它们无法恢复天气视频，因为缺乏时间信息。此外，现有的天气视频修复方法只能处理一种类型的不良天气。在这种情况下，我们提出了第一个可以恢复所有不良天气视频的框架，即视频不良天气组件抑制网络（ViWS-Net）。以实现这一目标，我们首先设计了不同天气情况下的视频转换器编码器，其中包括多个转换器阶段。此外，我们还设计了一种长期短期模型来早期融合输入视频帧的邻近信息，以学习天气特定的信息。此外，我们还引入了一种天气预测器，以便对输入视频帧的像素特征进行拟合，并且通过对天气类型进行反向推导，以维护天气不变的通用信息，并抑制天气特定的信息。最后，我们开发了一种天气驱动的视频转换器解码器，以恢复输入视频中的剩余天气特定特征，并将其空间时间聚合和归一化，以预测输入视频的干净目标帧。实验结果，在标准测试集和实际天气视频上，表明我们的ViWS-Net可以超越当前状态的方法，在任何天气条件下恢复受损的视频。

Causal-DFQ: Causality Guided Data-free Network Quantization

paper_url: http://arxiv.org/abs/2309.13682
repo_url: https://github.com/42shawn/causal-dfq
paper_authors: Yuzhang Shang, Bingxin Xu, Gaowen Liu, Ramana Kompella, Yan Yan
For: 这个研究的目的是为了解决在实际应用中无法提供训练数据的情况下，深度神经网络协商过程中的问题。* Methods: 这个研究使用了 causal reasoning 来建立 causal 图模型，并提出了一个基于 causality 的 data-free network quantization 方法（Causal-DFQ），以消除依赖于数据的限制。* Results: 实验结果显示，Causal-DFQ 能够将深度神经网络协商到更小的网络，并且可以在不需要训练数据的情况下保持比较高的预测性能。

Abstract
Model quantization, which aims to compress deep neural networks and accelerate inference speed, has greatly facilitated the development of cumbersome models on mobile and edge devices. There is a common assumption in quantization methods from prior works that training data is available. In practice, however, this assumption cannot always be fulfilled due to reasons of privacy and security, rendering these methods inapplicable in real-life situations. Thus, data-free network quantization has recently received significant attention in neural network compression. Causal reasoning provides an intuitive way to model causal relationships to eliminate data-driven correlations, making causality an essential component of analyzing data-free problems. However, causal formulations of data-free quantization are inadequate in the literature. To bridge this gap, we construct a causal graph to model the data generation and discrepancy reduction between the pre-trained and quantized models. Inspired by the causal understanding, we propose the Causality-guided Data-free Network Quantization method, Causal-DFQ, to eliminate the reliance on data via approaching an equilibrium of causality-driven intervened distributions. Specifically, we design a content-style-decoupled generator, synthesizing images conditioned on the relevant and irrelevant factors; then we propose a discrepancy reduction loss to align the intervened distributions of the pre-trained and quantized models. It is worth noting that our work is the first attempt towards introducing causality to data-free quantization problem. Extensive experiments demonstrate the efficacy of Causal-DFQ. The code is available at https://github.com/42Shawn/Causal-DFQ.

摘要
模型减量，用于压缩深度神经网络并加速推理速度，已经大大便化了移动和边缘设备上的模型开发。但是，现实中的假设是所有训练数据都可用，而在实际应用中，这个假设不一定成立，因为隐私和安全问题。因此，无数据网络减量在神经网络压缩中收到了重要注意。 causal reasoning提供了一种直观的方式来模型 causal 关系，以消除数据驱动的相关性，使 causality 成为分析无数据问题的关键组成部分。然而， literature 中关于无数据网络减量的 causal 表述不充分。为了bridging这个差距，我们构建了 causal 图来模型数据生成和差异减少 между 预训练和减量模型。 inspirited by causal 理解，我们提出了 causality-guided 无数据网络减量方法（Causal-DFQ），以消除数据的依赖性。具体来说，我们设计了内容-风格-分解的生成器，通过conditioning 图像的相关和 irrelevant 因素来生成图像。然后，我们提出了干扰分布的减少损失，以将预训练和减量模型之间的 intervened 分布接近。值得注意的是，我们的工作是无数据网络减量问题中首次引入 causality 的尝试。extensive experiments 表明了 Causal-DFQ 的有效性。代码可以在 https://github.com/42Shawn/Causal-DFQ 上获取。

BdSpell: A YOLO-based Real-time Finger Spelling System for Bangla Sign Language

paper_url: http://arxiv.org/abs/2309.13676
repo_url: None
paper_authors: Naimul Haque, Meraj Serker, Tariq Bin Bashar
for: 提高孟加拉手语（BdSL）解释的可用性和包容性，增进孟加拉手语社区中的语言平等。
methods: 基于YOLOv5架构的实时手势识别系统，采用特定规则和数字类作为触发器，高效生成隐藏和复合字符，消减用户的压力。
results: 实现字符识别时间优化为1.32秒，准确率达98%，YOLOv5模型在9147张图像上显示出极高的平均精度报告率（mAP）为96.4%。

Abstract
In the domain of Bangla Sign Language (BdSL) interpretation, prior approaches often imposed a burden on users, requiring them to spell words without hidden characters, which were subsequently corrected using Bangla grammar rules due to the missing classes in BdSL36 dataset. However, this method posed a challenge in accurately guessing the incorrect spelling of words. To address this limitation, we propose a novel real-time finger spelling system based on the YOLOv5 architecture. Our system employs specified rules and numerical classes as triggers to efficiently generate hidden and compound characters, eliminating the necessity for additional classes and significantly enhancing user convenience. Notably, our approach achieves character spelling in an impressive 1.32 seconds with a remarkable accuracy rate of 98\%. Furthermore, our YOLOv5 model, trained on 9147 images, demonstrates an exceptional mean Average Precision (mAP) of 96.4\%. These advancements represent a substantial progression in augmenting BdSL interpretation, promising increased inclusivity and accessibility for the linguistic minority. This innovative framework, characterized by compatibility with existing YOLO versions, stands as a transformative milestone in enhancing communication modalities and linguistic equity within the Bangla Sign Language community.

摘要
在孟加拉手语（BdSL）解释领域，先前的方法经常对用户带来压力，需要他们在无隐藏字符的情况下寻找字符，然后根据孟加拉语法规则进行修正，由于在BdSL36数据集中缺失的类型。但这种方法难以准确地猜测错误的拼写。为解决这个限制，我们提出了一种新的实时手写系统，基于YOLOv5架构。我们的系统采用了特定的规则和数字类作为触发器，以高效地生成隐藏和复合字符，从而消除了额外的类和增加了用户的便利。特别是，我们的方法在1.32秒内完成字符拼写，并达到了98%的精度。此外，我们的YOLOv5模型，在9147张图像上训练，显示了极高的平均精度（mAP）96.4%。这些进步表明了在增强孟加拉手语解释方面的重要突破，这将为孟加拉手语社区提供更多的包容性和可用性。这种革命性的框架，具有与现有YOLO版本兼容的特点，代表了孟加拉手语解释领域的巨大进步，并将在语言平等和通信模式方面产生深远的影响。

Joint inversion of Time-Lapse Surface Gravity and Seismic Data for Monitoring of 3D CO$_2$ Plumes via Deep Learning

paper_url: http://arxiv.org/abs/2310.04430
repo_url: None
paper_authors: Adrian Celaya, Mauricio Araya-Polo
for: 预测地下CO2涡，作为监测CO2储存部署的辅助工具。
methods: 基于深度学习的3D结合时间差表地重力和地震数据重建地下密度和速度模型。
results: 与深度学习基于重力只和地震只的拟合模型相比，joint匹配模型得到了改善的密度和速度重建、准确的分割和高的R-squared系数。这些结果表明深度学习基于联合拟合是有效的CO2储存监测工具。

Abstract
We introduce a fully 3D, deep learning-based approach for the joint inversion of time-lapse surface gravity and seismic data for reconstructing subsurface density and velocity models. The target application of this proposed inversion approach is the prediction of subsurface CO2 plumes as a complementary tool for monitoring CO2 sequestration deployments. Our joint inversion technique outperforms deep learning-based gravity-only and seismic-only inversion models, achieving improved density and velocity reconstruction, accurate segmentation, and higher R-squared coefficients. These results indicate that deep learning-based joint inversion is an effective tool for CO$_2$ storage monitoring. Future work will focus on validating our approach with larger datasets, simulations with other geological storage sites, and ultimately field data.

摘要
我团队提出了一种完全三维、深度学习基于的方法，用于同时逆合时间序列表面重力和地震数据，以重建地下密度和速度模型。我们的目标应用是预测地下CO2泵，作为监测CO2储存部署的辅助工具。我们的联合逆合模型在密度和速度重建、准确分割和高R-平方 coefficient方面表现出色，这表明深度学习基于的联合逆合是有效的CO$_2$储存监测工具。未来工作将集中于验证我们的方法，使用更大的数据集、其他地质储存站的 simulate 和最终场景数据。

OneSeg: Self-learning and One-shot Learning based Single-slice Annotation for 3D Medical Image Segmentation

paper_url: http://arxiv.org/abs/2309.13671
repo_url: None
paper_authors: Yixuan Wu, Bo Zheng, Jintai Chen, Danny Z. Chen, Jian Wu
for: 提高医疗图像分割精度，减少数据标注努力。
methods: 提议一种自学习和一键学习基于构建，只需要标注一个3D图像的一 slice，以提高3D医疗图像分割精度。
results: 比对完全监督方法，新方法可以达到相似的性能，仅需要0.1%的数据标注，并且在多个异常测试集上进行了广泛的实验验证。

Abstract
As deep learning methods continue to improve medical image segmentation performance, data annotation is still a big bottleneck due to the labor-intensive and time-consuming burden on medical experts, especially for 3D images. To significantly reduce annotation efforts while attaining competitive segmentation accuracy, we propose a self-learning and one-shot learning based framework for 3D medical image segmentation by annotating only one slice of each 3D image. Our approach takes two steps: (1) self-learning of a reconstruction network to learn semantic correspondence among 2D slices within 3D images, and (2) representative selection of single slices for one-shot manual annotation and propagating the annotated data with the well-trained reconstruction network. Extensive experiments verify that our new framework achieves comparable performance with less than 1% annotated data compared with fully supervised methods and generalizes well on several out-of-distribution testing sets.

摘要
随着深度学习方法在医疗影像分割性能的提高，数据注释仍然是一个大的瓶颈，因为医疗专家需要投入大量的劳动和时间来进行注释，特别是 для 3D 影像。为了减少注释努力而获得竞争性的分割精度，我们提出了一个自学习和一次学习基于框架，只需要注释每个 3D 影像中的一个平面。我们的方法包括两步：（1）自学习一个重建网络，以学习 2D 影像内 3D 影像中的semantic相关性，以及（2）选择单个平面进行一次手动注释，并使用已经训练好的重建网络将注释数据传播到其他影像中。我们的新方法在多个out-of-distribution测试集上进行了广泛的实验，并证明了它可以与完全监督方法相比，并且在不同的测试集上具有良好的一致性。

Adaptation of the super resolution SOTA for Art Restoration in camera capture images

paper_url: http://arxiv.org/abs/2309.13655
repo_url: https://github.com/naagar/art_restoration_dm
paper_authors: Sandeep Nagar, Abhinaba Bala, Sai Amrit Patnaik
for: 这项研究旨在开发一个基于计算机视觉模型的自动化艺术修复方法，以提高和重建受损艺术作品的视觉质量，保留原始特点和瑰宝。methods: 该研究采用了基于扩散模型（DM）的图像超分辨率技术，并对其进行了微调，以适应不同类型的受损，包括噪声、模糊、scratches、淡化等。results: 研究结果显示，通过微调一个超分辨率模型，可以处理多种受损类型，并且可以提高和重建受损艺术作品的视觉质量，而不需要专业知识和较长的时间。代码链接：https://github.com/Naagar/art_restoration_DM。

Abstract
Preserving cultural heritage is of paramount importance. In the domain of art restoration, developing a computer vision model capable of effectively restoring deteriorated images of art pieces was difficult, but now we have a good computer vision state-of-art. Traditional restoration methods are often time-consuming and require extensive expertise. The aim of this work is to design an automated solution based on computer vision models that can enhance and reconstruct degraded artworks, improving their visual quality while preserving their original characteristics and artifacts. The model should handle a diverse range of deterioration types, including but not limited to noise, blur, scratches, fading, and other common forms of degradation. We adapt the current state-of-art for the image super-resolution based on the Diffusion Model (DM) and fine-tune it for Image art restoration. Our results show that instead of fine-tunning multiple different models for different kinds of degradation, fine-tuning one super-resolution. We train it on multiple datasets to make it robust. code link: https://github.com/Naagar/art_restoration_DM

摘要
保护文化遗产对于我们非常重要。在艺术修复领域，开发一个可以有效地恢复褪色的艺术作品图像的计算机视觉模型是一项具有挑战性的任务，但现在我们已经有了一个非常出色的计算机视觉状态。传统的修复方法通常是时间consuming且需要广泛的专业知识。我们的目标是设计一个自动化的解决方案，基于计算机视觉模型，可以提高褪色的艺术作品图像的视觉质量，同时保持原始特征和痕迹。我们采用当前状态的扩充模型（DM），并进行了精细调整，以适应不同类型的褪色，包括噪声、模糊、擦抹、淡化和其他常见的褪色形式。我们的结果表明，不同于先前的多个模型的微调，我们可以通过微调一个超解析模型来实现图像修复。我们在多个数据集上训练这个模型，以使其具有坚固性。更多信息请参考：https://github.com/Naagar/art_restoration_DM。

ILNet: Low-level Matters for Salient Infrared Small Target Detection

paper_url: http://arxiv.org/abs/2309.13646
repo_url: https://github.com/li-haoqing/ilnet
paper_authors: Haoqing Li, Jinfu Yang, Runshi Wang, Yifei Xu
for: 该文章目标是提出一种基于干扰低级网络（ILNet）的干扰小目标检测方法，以提高干扰小目标特征的表示能力。
methods: 该方法使用了一种新的轻量级特征融合模块（IPOF），将低级信息更加注重地融合到深层网络中，以提高干扰小目标的检测性能。此外，还使用了一种动态一维度聚合层（DODA）来动态调整低维度信息的聚合方式。此外，该方法还使用了 Representative Block（RB）来动态分配深层和浅层网络的权重。
results: 实验结果表明，提出的 ILNet 方法在NUAA-SIRST 数据集上取得了78.22% nIoU 和 1.33e-6 Fa 的最佳性能，并在 IRSTD-1K 数据集上取得了68.91% nIoU 和 3.23e-6 Fa 的最佳性能。此外，ILNet 还能够在数据量增加时获得更大的提升。

Abstract
Infrared small target detection is a technique for finding small targets from infrared clutter background. Due to the dearth of high-level semantic information, small infrared target features are weakened in the deep layers of the CNN, which underachieves the CNN's representation ability. To address the above problem, in this paper, we propose an infrared low-level network (ILNet) that considers infrared small targets as salient areas with little semantic information. Unlike other SOTA methods, ILNet pays greater attention to low-level information instead of treating them equally. A new lightweight feature fusion module, named Interactive Polarized Orthogonal Fusion module (IPOF), is proposed, which integrates more important low-level features from the shallow layers into the deep layers. A Dynamic One-Dimensional Aggregation layers (DODA) are inserted into the IPOF, to dynamically adjust the aggregation of low dimensional information according to the number of input channels. In addition, the idea of ensemble learning is used to design a Representative Block (RB) to dynamically allocate weights for shallow and deep layers. Experimental results on the challenging NUAA-SIRST (78.22% nIoU and 1.33e-6 Fa) and IRSTD-1K (68.91% nIoU and 3.23e-6 Fa) dataset demonstrate that the proposed ILNet can get better performances than other SOTA methods. Moreover, ILNet can obtain a greater improvement with the increasement of data volume. Training code are available at https://github.com/Li-Haoqing/ILNet.

摘要
infrared小目标检测是一种技术，用于从抖抖辐射背景中检测小目标。由于高级 semantic信息的缺乏，小抖抖辐射目标特征在深层神经网络中弱化，这会导致神经网络的表征能力受到限制。为解决上述问题，本文提出了一种infrared低级网络（ILNet），它视小抖抖辐射目标为有少量semantic信息的突出区域。不同于其他SOTA方法，ILNet更加注重低级信息，而不是对其进行平等处理。为了更好地捕捉低级信息，我们提出了一种新的轻量级特征融合模块（IPOF），该模块将深层神经网络中的重要低级特征与浅层神经网络中的低级特征进行有效的融合。此外，我们还使用了 Representative Block（RB）来动态分配深浅层神经网络中的权重。实验结果表明，提出的ILNet可以在NUAA-SIRST（78.22% nIoU和1.33e-6 Fa）和IRSTD-1K（68.91% nIoU和3.23e-6 Fa） dataset上达到SOTA的性能。此外，ILNet可以随着数据量的增加而获得更大的改进。训练代码可以在https://github.com/Li-Haoqing/ILNet中找到。

Changes-Aware Transformer: Learning Generalized Changes Representation

paper_url: http://arxiv.org/abs/2309.13619
repo_url: None
paper_authors: Dan Wang, Licheng Jiao, Jie Chen, Shuyuan Yang, Fang Liu
for: 本研究旨在提高 Change Detection (CD) 任务中的变化检测精度，通过学习多种变化的总体表示，并提出一种Changes-Aware Transformer (CAT) 来修正差异特征。
methods: 本研究使用了一种novel的 Changes-Aware Transformer (CAT) 来修正差异特征，CAT 通过栅格cosine cross-attention层和自我注意层来实现这一目的。
results: 实验结果表明，我们的方法可以在 remote sensing CD 数据集和街景 CD 数据集上达到状态之 arts 性能，并且具有良好的普适性。

Abstract
Difference features obtained by comparing the images of two periods play an indispensable role in the change detection (CD) task. However, a pair of bi-temporal images can exhibit diverse changes, which may cause various difference features. Identifying changed pixels with differ difference features to be the same category is thus a challenge for CD. Most nowadays' methods acquire distinctive difference features in implicit ways like enhancing image representation or supervision information. Nevertheless, informative image features only guarantee object semantics are modeled and can not guarantee that changed pixels have similar semantics in the difference feature space and are distinct from those unchanged ones. In this work, the generalized representation of various changes is learned straightforwardly in the difference feature space, and a novel Changes-Aware Transformer (CAT) for refining difference features is proposed. This generalized representation can perceive which pixels are changed and which are unchanged and further guide the update of pixels' difference features. CAT effectively accomplishes this refinement process through the stacked cosine cross-attention layer and self-attention layer. After refinement, the changed pixels in the difference feature space are closer to each other, which facilitates change detection. In addition, CAT is compatible with various backbone networks and existing CD methods. Experiments on remote sensing CD data set and street scene CD data set show that our method achieves state-of-the-art performance and has excellent generalization.

摘要
<>TRANSLATE_TEXT diferenciales características obtenidas por comparar las imágenes de dos períodos juegan un papel fundamental en la tarea de detección de cambios (CD). Sin embargo, una pareja de imágenes bi-temporales puede exhibir cambios diversificados, lo que puede causar diferentes diferenciales características. Identificar pixels cambiados con diferenciales características similares es un desafío para la CD. La mayoría de los métodos actuales adquieren características diferenciales distintivas de manera implícita, como la mejora de la representación de la imagen o la información de supervisión. Sin embargo, las características de la imagen informativas solo garantizan que las semánticas de los objetos se modelen y no garantizan que los pixels cambiados tengan semánticas similares en el espacio de características de diferencia y se distingan de los pixels no cambiados. En este trabajo, se aprende una representación generalizada de los cambios en el espacio de características de diferencia y se propone un Novel Changes-Aware Transformer (CAT) para refinar las características de diferencia. Esta representación generalizada puede percibir qué pixels están cambiados y qué pixels no están cambiados y guiar el update de las características de diferencia de los pixels. CAT efectúa este proceso de refinamiento mediante capas de atención cruzada cosínica y de atención a sí misma. Después de la refinement, los pixels cambiados en el espacio de características de diferencia están más cercanos entre sí, lo que facilita la detección de cambios. Además, CAT es compatible con redes de soporte existentes y métodos de CD. Los experimentos en los conjuntos de datos de CD de áreas remotas y escenas de la calle muestran que nuestro método logra un rendimiento estatal de arte y tiene una excelente generalización.Note: The text is translated using the Google Translate API, and the translation may not be perfect. Please let me know if you need any further assistance.

VisionKG: Unleashing the Power of Visual Datasets via Knowledge Graph

paper_url: http://arxiv.org/abs/2309.13610
repo_url: None
paper_authors: Jicheng Yuan, Anh Le-Tuan, Manh Nguyen-Duc, Trung-Kien Tran, Manfred Hauswirth, Danh Le-Phuoc
for: 提供一个全面的 computer vision 数据资源，实现跨多个源、任务和分类的Visual dataset集成。
methods: 使用知识 graphs和Semantic Web技术来整合、组织和管理多种形式的Visual dataset，提供简单的存取和查询服务，并具有扩展性和可扩展性。
results: 组建了一个名为 Vision Knowledge Graph（VisionKG）的资源，它可以实现跨多个源、任务和分类的Visual dataset集成，并提供了多种数据 Retrieval 和探索服务。

Abstract
The availability of vast amounts of visual data with heterogeneous features is a key factor for developing, testing, and benchmarking of new computer vision (CV) algorithms and architectures. Most visual datasets are created and curated for specific tasks or with limited image data distribution for very specific situations, and there is no unified approach to manage and access them across diverse sources, tasks, and taxonomies. This not only creates unnecessary overheads when building robust visual recognition systems, but also introduces biases into learning systems and limits the capabilities of data-centric AI. To address these problems, we propose the Vision Knowledge Graph (VisionKG), a novel resource that interlinks, organizes and manages visual datasets via knowledge graphs and Semantic Web technologies. It can serve as a unified framework facilitating simple access and querying of state-of-the-art visual datasets, regardless of their heterogeneous formats and taxonomies. One of the key differences between our approach and existing methods is that ours is knowledge-based rather than metadatabased. It enhances the enrichment of the semantics at both image and instance levels and offers various data retrieval and exploratory services via SPARQL. VisionKG currently contains 519 million RDF triples that describe approximately 40 million entities, and are accessible at https://vision.semkg.org and through APIs. With the integration of 30 datasets and four popular CV tasks, we demonstrate its usefulness across various scenarios when working with CV pipelines.

摘要
“现代计算机视觉（CV）算法和架构的开发、测试和评估中，庞大量的视觉数据的可用性是关键因素。大多数视觉数据集是为特定任务或有限的图像数据分布而创建和维护的，而且没有一种统一的方法来管理和访问它们。这不仅会增加建立可靠的视觉识别系统的开发成本，而且会引入偏见到学习系统中和限制数据驱动AI的能力。为解决这些问题，我们提议了视觉知识图（VisionKG），一种新的资源，通过知识图和Semantic Web技术来集成、组织和管理视觉数据集。它可以作为一个统一的框架，方便访问和查询多种不同的视觉任务和数据集，无论它们的格式和分类如何。我们的方法与现有方法的主要区别在于，我们的方法是基于知识图而不是基于元数据的。它可以增强图像和实例层次的 semantics，并提供了多种数据检索和探索服务via SPARQL。VisionKG目前包含519亿个RDF三元组，描述约40亿个实体，可以在https://vision.semkg.org和通过API访问。我们通过将30个数据集和4种常见CV任务集成到VisionKG中，证明了它在不同的场景中对CV管道的有用性。”

Vulnerabilities in Video Quality Assessment Models: The Challenge of Adversarial Attacks

paper_url: http://arxiv.org/abs/2309.13609
repo_url: https://github.com/gzhu-dvl/attackvqa
paper_authors: Ao-Xiang Zhang, Yu Ran, Weixuan Tang, Yuan-Gen Wang
for: This paper focuses on evaluating the robustness of No-Reference Video Quality Assessment (NR-VQA) models against adversarial attacks, and proposing a patch-based random search method for black-box attacks.
methods: The paper uses Convolutional Neural Networks (CNNs) and Transformers as the base models for NR-VQA, and proposes a novel loss function called Score-Reversed Boundary Loss to evaluate the robustness of these models against adversarial attacks.
results: The paper presents the results of evaluating the robustness of NR-VQA models against adversarial attacks using the proposed Score-Reversed Boundary Loss, and shows that the proposed method can effectively launch both white-box and black-box attacks in an imperceptible manner.Here is the simplified Chinese text for the three information points:
for: 这篇论文关注NR-VQA模型对针对攻击的Robustness评估，并提出了一种基于随机搜索的黑盒攻击方法。
methods: 该论文使用Convolutional Neural Networks (CNNs)和Transformers作为NR-VQA模型的基础模型，并提出了一种新的损失函数called Score-Reversed Boundary Loss来评估NR-VQA模型对攻击的Robustness。
results: 该论文通过使用提出的Score-Reversed Boundary Loss来评估NR-VQA模型对攻击的Robustness，并显示了该方法可以效果地发起白盒和黑盒攻击，并且在无人知情的情况下进行。

Abstract
No-Reference Video Quality Assessment (NR-VQA) plays an essential role in improving the viewing experience of end-users. Driven by deep learning, recent NR-VQA models based on Convolutional Neural Networks (CNNs) and Transformers have achieved outstanding performance. To build a reliable and practical assessment system, it is of great necessity to evaluate their robustness. However, such issue has received little attention in the academic community. In this paper, we make the first attempt to evaluate the robustness of NR-VQA models against adversarial attacks, and propose a patch-based random search method for black-box attack. Specifically, considering both the attack effect on quality score and the visual quality of adversarial video, the attack problem is formulated as misleading the estimated quality score under the constraint of just-noticeable difference (JND). Built upon such formulation, a novel loss function called Score-Reversed Boundary Loss is designed to push the adversarial video's estimated quality score far away from its ground-truth score towards a specific boundary, and the JND constraint is modeled as a strict $L_2$ and $L_\infty$ norm restriction. By this means, both white-box and black-box attacks can be launched in an effective and imperceptible manner. The source code is available at https://github.com/GZHU-DVL/AttackVQA.

摘要
“无参考视频质量评估（NR-VQA）在提高用户视频观看体验中扮演着关键性的角色。驱动深度学习，最新的NR-VQA模型基于卷积神经网络（CNNs）和变换器（Transformers）已经实现了出色的表现。为建立可靠和实用的评估系统，必须评估其可靠性。然而，这一问题在学术界得到了少量的关注。本文是首次评估NR-VQA模型对抗攻击的尝试，并提出了一种黑盒攻击方法基于补丁随机搜索。具体来说，我们认为攻击问题应该是让估计的质量分数受到攻击，同时保证视频质量的变化在可以快速感知的范围内。针对这一问题，我们提出了一种新的损失函数 called Score-Reversed Boundary Loss，它可以让攻击者通过控制估计质量分数的变化，使得攻击者可以在无法察觉的情况下发动白盒和黑盒攻击。源代码可以在https://github.com/GZHU-DVL/AttackVQA上获取。”

FaceAtt: Enhancing Image Captioning with Facial Attributes for Portrait Images

paper_url: http://arxiv.org/abs/2309.13601
repo_url: None
paper_authors: Naimul Haque, Iffat Labiba, Sadia Akter
for: This paper focuses on developing a novel approach to attribute-focused image captioning that accurately depicts facial attributes within images.
methods: The FaceAtt model uses deep learning techniques and annotated attributes of portraits as supplementary prior knowledge to improve caption quality.
results: The FaceAtt model yields a subtle yet discernible enhancement in resulting caption scores, demonstrating the effectiveness of incorporating additional attribute vectors during training.Here’s the simplified Chinese text for the three key points:
for: 这篇论文关注开发一种基于人脸特征的图像描述模型，以准确描述图像中的人脸特征。
methods: FaceAtt模型使用深度学习技术和人脸特征注释作为辅助知识，以提高描述质量。
results: FaceAtt模型在训练时使用人脸特征注释可以提供微妙 yet 可识别的提升，表明注释的添加可以提高模型的表现。

Abstract
Automated image caption generation is a critical area of research that enhances accessibility and understanding of visual content for diverse audiences. In this study, we propose the FaceAtt model, a novel approach to attribute-focused image captioning that emphasizes the accurate depiction of facial attributes within images. FaceAtt automatically detects and describes a wide range of attributes, including emotions, expressions, pointed noses, fair skin tones, hair textures, attractiveness, and approximate age ranges. Leveraging deep learning techniques, we explore the impact of different image feature extraction methods on caption quality and evaluate our model's performance using metrics such as BLEU and METEOR. Our FaceAtt model leverages annotated attributes of portraits as supplementary prior knowledge for our portrait images before captioning. This innovative addition yields a subtle yet discernible enhancement in the resulting scores, exemplifying the potency of incorporating additional attribute vectors during training. Furthermore, our research contributes to the broader discourse on ethical considerations in automated captioning. This study sets the stage for future research in refining attribute-focused captioning techniques, with a focus on enhancing linguistic coherence, addressing biases, and accommodating diverse user needs.

摘要
自动生成图像标签是一个关键的研究领域，它提高了视觉内容的可访问性和理解，便于不同的用户群体。在这项研究中，我们提出了FaceAtt模型，一种新的图像标签生成方法，强调在图像中准确描述人脸特征。FaceAtt自动检测和描述了各种特征，包括情感、表情、短脚、白肤肤、头发Texture、吸引力和年龄范围。我们利用深度学习技术，研究不同的图像特征提取方法对标签质量的影响，并使用BLEU和METEOR等 метри来评估我们的FaceAtt模型。我们的FaceAtt模型利用了人脸图像的注解特征作为额外知识来进行预处理，这种创新的添加带来了微妙 yet 可见的提高，表明了在训练时添加特征向量的力量。此外，我们的研究对自动标签技术的伦理考虑进行贡献。这项研究为未来更进一步的增强特征强调标签技术做出了平台，包括提高语言一致性、消除偏见和满足多样化用户需求。

Multi-Dimensional Hyena for Spatial Inductive Bias

paper_url: http://arxiv.org/abs/2309.13600
repo_url: None
paper_authors: Itamar Zimerman, Lior Wolf
for: 这个论文是为了提出一种数据效率的视觉变换器，不需要自注意。它使用了一种新的多轴泛化方法，基于最近的Hyena层。
methods: 这个论文使用了一种新的泛化方法，即Hyena N-D层，以提高视觉变换器的性能。它还提出了多种不同的方法来实现这种泛化，并从实际和理论上进行了详细的分析。
results: 实验结果显示，Hyena N-D层能够提高多种视觉变换器架构的性能，如ViT、Swin和DeiT等。此外，在小数据集 régime中，Hyena-based ViT比特有些文献中提出的特定设计来解决这个问题的ViT变种更好。最后， authors表明了一种hybrid方法，将Hyena N-D层用于前几层，然后使用传统注意力层，能够持续提高不同的视觉变换器架构的性能。

Abstract
In recent years, Vision Transformers have attracted increasing interest from computer vision researchers. However, the advantage of these transformers over CNNs is only fully manifested when trained over a large dataset, mainly due to the reduced inductive bias towards spatial locality within the transformer's self-attention mechanism. In this work, we present a data-efficient vision transformer that does not rely on self-attention. Instead, it employs a novel generalization to multiple axes of the very recent Hyena layer. We propose several alternative approaches for obtaining this generalization and delve into their unique distinctions and considerations from both empirical and theoretical perspectives. Our empirical findings indicate that the proposed Hyena N-D layer boosts the performance of various Vision Transformer architectures, such as ViT, Swin, and DeiT across multiple datasets. Furthermore, in the small dataset regime, our Hyena-based ViT is favorable to ViT variants from the recent literature that are specifically designed for solving the same challenge, i.e., working with small datasets or incorporating image-specific inductive bias into the self-attention mechanism. Finally, we show that a hybrid approach that is based on Hyena N-D for the first layers in ViT, followed by layers that incorporate conventional attention, consistently boosts the performance of various vision transformer architectures.

摘要
Recently, Vision Transformers have gained increasing attention from computer vision researchers. However, the advantage of these transformers over Convolutional Neural Networks (CNNs) is only fully manifested when trained on a large dataset, due to the reduced inductive bias towards spatial locality within the transformer's self-attention mechanism. In this work, we propose a data-efficient vision transformer that does not rely on self-attention. Instead, it employs a novel generalization to multiple axes of the very recent Hyena layer. We present several alternative approaches for obtaining this generalization and discuss their unique distinctions and considerations from both empirical and theoretical perspectives.Our empirical findings indicate that the proposed Hyena N-D layer enhances the performance of various Vision Transformer architectures, such as ViT, Swin, and DeiT, across multiple datasets. Furthermore, in the small dataset regime, our Hyena-based ViT outperforms ViT variants from the recent literature that are specifically designed for solving the same challenge, i.e., working with small datasets or incorporating image-specific inductive bias into the self-attention mechanism. Finally, we show that a hybrid approach that is based on Hyena N-D for the first layers in ViT, followed by layers that incorporate conventional attention, consistently boosts the performance of various vision transformer architectures.

On the Posterior Distribution in Denoising: Application to Uncertainty Quantification

paper_url: http://arxiv.org/abs/2309.13598
repo_url: https://github.com/HilaManor/GaussianDenoisingPosterior
paper_authors: Hila Manor, Tomer Michaeli
for: 这篇论文主要针对的是降噪方法的应用，包括低级图像感知器的降噪、以及基于 Tweedie 公式的score-based生成模型。
methods: 该论文使用 Gaussian denoising 的 posterior distribution 链接到数据分布的 posterior mean，并 derivates 出高阶中心差的关系。
results: 该论文可以快速和减少内存占用来计算 posterior distribution 的主要方向和高阶中心差，不需要训练或精度调整降噪器。

Abstract
Denoisers play a central role in many applications, from noise suppression in low-grade imaging sensors, to empowering score-based generative models. The latter category of methods makes use of Tweedie's formula, which links the posterior mean in Gaussian denoising (i.e., the minimum MSE denoiser) with the score of the data distribution. Here, we derive a fundamental relation between the higher-order central moments of the posterior distribution, and the higher-order derivatives of the posterior mean. We harness this result for uncertainty quantification of pre-trained denoisers. Particularly, we show how to efficiently compute the principal components of the posterior distribution for any desired region of an image, as well as to approximate the full marginal distribution along those (or any other) one-dimensional directions. Our method is fast and memory efficient, as it does not explicitly compute or store the high-order moment tensors and it requires no training or fine tuning of the denoiser. Code and examples are available on the project's webpage in https://hilamanor.github.io/GaussianDenoisingPosterior/

摘要
纹理恢复器在许多应用中扮演着中心角色，从噪声消除低级图像感知器到激发Score-based生成模型。后者使用Tweedie的公式，将 posterior mean在 Gaussian denoising 中相应的负面积最小化。我们 derivate 出 posterior distribution 的高级中心均值和 posterior mean 的高级导数之间的基本关系。我们利用这个结果进行uncertainty quantification of pre-trained denoisers。特别是，我们可以快速计算 posterior distribution 的主要Components在任意区域中，以及任意一个方向的全级分布。我们的方法快速，内存占用少，因为它不需要直接计算或存储高级 moment tensor，也不需要训练或微调denoiser。 codes 和示例可以在https://hilamanor.github.io/GaussianDenoisingPosterior/ 的项目网站上找到。

Advancements in 3D Lane Detection Using LiDAR Point Clouds: From Data Collection to Model Development

paper_url: http://arxiv.org/abs/2309.13596
repo_url: None
paper_authors: Runkai Zhao, Yuwen Heng, Yuanda Gao, Shilei Liu, Heng Wang, Changhao Yao, Jiawen Chen, Weidong Cai
for: 本研究旨在提高自动驾驶系统（ADAS）的车辆感知和决策能力，通过利用学习基于的技术。
methods: 本研究使用了LiDAR数据集，并设计了一个简单 yet effective的自动标注管线，以生成更加精细的车道标注。
results: 实验结果显示，LiLaDet模型在K-Lane数据集和LiSV-3DLane数据集上的3D车道检测任务中表现出色，超过了现有的摄像头和LiDAR基于的方法。

Abstract
Advanced Driver-Assistance Systems (ADAS) have successfully integrated learning-based techniques into vehicle perception and decision-making. However, their application in 3D lane detection for effective driving environment perception is hindered by the lack of comprehensive LiDAR datasets. The sparse nature of LiDAR point cloud data prevents an efficient manual annotation process. To solve this problem, we present LiSV-3DLane, a large-scale 3D lane dataset that comprises 20k frames of surround-view LiDAR point clouds with enriched semantic annotation. Unlike existing datasets confined to a frontal perspective, LiSV-3DLane provides a full 360-degree spatial panorama around the ego vehicle, capturing complex lane patterns in both urban and highway environments. We leverage the geometric traits of lane lines and the intrinsic spatial attributes of LiDAR data to design a simple yet effective automatic annotation pipeline for generating finer lane labels. To propel future research, we propose a novel LiDAR-based 3D lane detection model, LiLaDet, incorporating the spatial geometry learning of the LiDAR point cloud into Bird's Eye View (BEV) based lane identification. Experimental results indicate that LiLaDet outperforms existing camera- and LiDAR-based approaches in the 3D lane detection task on the K-Lane dataset and our LiSV-3DLane.

摘要
高级驾驶辅助系统（ADAS）已成功地将学习基于的技术 integrate 到车辆的感知和决策中。然而，它们在3D车道检测中为有效的驾驶环境感知受到了LiDAR数据的缺乏全面的障碍。LiDAR点云数据的稀疏性阻碍了人工注释的效率。为解决这个问题，我们提出了LiSV-3DLane，一个大规模的3D车道数据集，包含20000帧的周围视野LiDAR点云数据，并且具有增强的semantic注释。与现有的前视角所限定的数据集不同，LiSV-3DLane提供了360度的全景视图，捕捉了城市和高速公路环境中复杂的车道模式。我们利用LiDAR数据的几何特征和点云数据的内在空间属性，设计了一个简单 yet effective的自动注释管道，以生成更细的车道标签。为未来的研究提供动力，我们提出了一种基于LiDAR的3D车道检测模型LiLaDet，该模型将LiDAR点云中的空间几何学学习 integrate 到基于bird's eye view（BEV）的车道标识中。实验结果表明，LiLaDet在K-Lane数据集和我们的LiSV-3DLane上的3D车道检测任务中表现出色，比摄像头和LiDAR基的方法更高效。

Benchmarking Encoder-Decoder Architectures for Biplanar X-ray to 3D Shape Reconstruction

paper_url: http://arxiv.org/abs/2309.13587
repo_url: None
paper_authors: Mahesh Shakya, Bishesh Khanal
for: 这些论文的目的是为了evaluate多种深度学习模型在2D-3D骨形状重建方面的性能，以便在临床应用中进行评估和选择最佳模型。
methods: 这些论文使用的方法包括多种深度学习模型，以及Automatic clinical parameter and landmark extraction methods。
results: 这些论文的结果表明，关注全域空间关系的注意力机制方法在所有骨性质和数据集上表现较好，但是在临床相关的 subgroup中表现可能会被过度估计，肋骨比 femur、hip 和脊梁更加困难重建，并且 dice score 改进不总是导致自动计算临床相关参数的改进。

Abstract
Various deep learning models have been proposed for 3D bone shape reconstruction from two orthogonal (biplanar) X-ray images. However, it is unclear how these models compare against each other since they are evaluated on different anatomy, cohort and (often privately held) datasets. Moreover, the impact of the commonly optimized image-based segmentation metrics such as dice score on the estimation of clinical parameters relevant in 2D-3D bone shape reconstruction is not well known. To move closer toward clinical translation, we propose a benchmarking framework that evaluates tasks relevant to real-world clinical scenarios, including reconstruction of fractured bones, bones with implants, robustness to population shift, and error in estimating clinical parameters. Our open-source platform provides reference implementations of 8 models (many of whose implementations were not publicly available), APIs to easily collect and preprocess 6 public datasets, and the implementation of automatic clinical parameter and landmark extraction methods. We present an extensive evaluation of 8 2D-3D models on equal footing using 6 public datasets comprising images for four different anatomies. Our results show that attention-based methods that capture global spatial relationships tend to perform better across all anatomies and datasets; performance on clinically relevant subgroups may be overestimated without disaggregated reporting; ribs are substantially more difficult to reconstruct compared to femur, hip and spine; and the dice score improvement does not always bring a corresponding improvement in the automatic estimation of clinically relevant parameters.

摘要
各种深度学习模型已经提议用于从两个mutually orthogonal（biplanar）X射线图像中重建3D骨形状。然而，它们之间的比较很难，因为它们在不同的解剖学、人群和（常常是私人拥有）数据集上进行评估。此外，通常优化的图像基于分割指标如 dice score 对2D-3D骨形状重建中的临床参数的影响不够了解。为了更近地到临床翻译，我们提出了一个 benchmarking 框架，评估了实际临床情景中的任务，包括骨折重建、骨嵌入、人口变化的Robustness和临床参数的错误。我们的开源平台提供了8个模型的参考实现（许多实现没有公开）、6个公共数据集的自动化采集和处理API，以及自动提取临床参数和标记的实现。我们对8个2D-3D模型进行了平等评估，使用6个公共数据集，包括4种不同的解剖学。我们的结果显示： attention-based 方法， capture 全局空间关系，在所有解剖学和数据集上表现较好; 不分解的报告可能会过分估计临床重要 subgroup; 肋骨重建相比股骨、股骨和脊梁更加困难; 并 dice score 改进不总是导致自动计算临床参数的改进。

Solving Low-Dose CT Reconstruction via GAN with Local Coherence

paper_url: http://arxiv.org/abs/2309.13584
repo_url: https://github.com/lwjie595/GANLC
paper_authors: Wenjie Liu
for: 用于诊断人体内部器官病变的计算Tomography（CT）成为医学影像领域的基本话题之一，低剂CT的使用被广泛采用，因此其重建方法得到了广泛的研究。
methods: 我们提出了一种基于生成对抗网络（GANs）的新方法，该方法可以利用运动场进行优化，从而提高重建图像的地方协调性和稳定性。
results: 我们对实验数据进行评估，结果表明，我们的提议方法可以与现有的状态对抗方法相比，显著提高重建图像的精度和稳定性。

Abstract
The Computed Tomography (CT) for diagnosis of lesions in human internal organs is one of the most fundamental topics in medical imaging. Low-dose CT, which offers reduced radiation exposure, is preferred over standard-dose CT, and therefore its reconstruction approaches have been extensively studied. However, current low-dose CT reconstruction techniques mainly rely on model-based methods or deep-learning-based techniques, which often ignore the coherence and smoothness for sequential CT slices. To address this issue, we propose a novel approach using generative adversarial networks (GANs) with enhanced local coherence. The proposed method can capture the local coherence of adjacent images by optical flow, which yields significant improvements in the precision and stability of the constructed images. We evaluate our proposed method on real datasets and the experimental results suggest that it can outperform existing state-of-the-art reconstruction approaches significantly.

摘要
computed tomography (CT) 用于人体内部肿瘤诊断是医学影像领域的基本话题之一。低剂量 CT 比标准剂量 CT 更受欢迎，因此其重建方法得到了广泛的研究。然而，现有的低剂量 CT 重建技术主要基于模型基本方法或深度学习基本方法，这些方法经常忽略邻域 CT slice 的协调性和平滑性。为解决这个问题，我们提出了一种使用生成对抗网络 (GANs) 增强本地协调性的新方法。该方法可以通过光流来捕捉邻域图像的本地协调性，从而实现显著提高重建图像的精度和稳定性。我们在实际数据集上测试了我们的提议方法，实验结果表明，它可以与现有的状态空间重建方法相比，显著提高重建图像的质量。

A SAM-based Solution for Hierarchical Panoptic Segmentation of Crops and Weeds Competition

paper_url: http://arxiv.org/abs/2309.13578
repo_url: None
paper_authors: Khoa Dang Nguyen, Thanh-Hai Phung, Hoang-Giang Cao
for: 这个论文旨在探讨农业领域的高级计算机视觉技术——泛型分割，以提高农业作物和杂草的识别和分类。
methods: 该论文提出了一种combines Segment AnyThing Model (SAM)和对象检测模型的方法，以实现高级分割任务。 specifically, 该方法 integrate了两种对象检测模型的特点，namely DINO和YOLO-v8。
results: 该论文的best-performing模型在竞赛中的PQ+分数为81.33。

Abstract
Panoptic segmentation in agriculture is an advanced computer vision technique that provides a comprehensive understanding of field composition. It facilitates various tasks such as crop and weed segmentation, plant panoptic segmentation, and leaf instance segmentation, all aimed at addressing challenges in agriculture. Exploring the application of panoptic segmentation in agriculture, the 8th Workshop on Computer Vision in Plant Phenotyping and Agriculture (CVPPA) hosted the challenge of hierarchical panoptic segmentation of crops and weeds using the PhenoBench dataset. To tackle the tasks presented in this competition, we propose an approach that combines the effectiveness of the Segment AnyThing Model (SAM) for instance segmentation with prompt input from object detection models. Specifically, we integrated two notable approaches in object detection, namely DINO and YOLO-v8. Our best-performing model achieved a PQ+ score of 81.33 based on the evaluation metrics of the competition.

摘要
“对农业中的涵盖分割技术（panoptic segmentation）进行了进一步的探索，以获得农田场景的全面理解。这技术可以帮助农业面临的问题，例如作物和杂草分类、植物涵盖分类以及叶子实例分类。为了探索这些应用，CVPPA年会（8th Workshop on Computer Vision in Plant Phenotyping and Agriculture）举办了一个挑战，即使用PhenoBench数据集进行阶层涵盖分类。我们提出了一个结合SAM模型（Segment AnyThing Model）的实例分类方法，并与物件探测模型（DINO和YOLO-v8）进行了统合。我们的最佳模型在竞赛中的PQ+分数为81.33。”Note: "PQ+ score" is a combination of precision, recall, and F1-score, which is a common evaluation metric for segmentation tasks.

Matrix Completion-Informed Deep Unfolded Equilibrium Models for Self-Supervised k-Space Interpolation in MRI

paper_url: http://arxiv.org/abs/2309.13571
repo_url: None
paper_authors: Chen Luo, Huayu Wang, Taofeng Xie, Qiyu Jin, Guoqing Chen, Zhuo-Xu Cui, Dong Liang
for: 提高MRI图像的速度和质量，不需要完整的标签数据
methods: 利用深度学习模型，同时保留常规模型的理论保证
results: 提出一种自适应深度学习方法，可以在不具备完整标签数据的情况下，实现MRI图像的加速和提高Here is the full text in Simplified Chinese:
for: 本研究旨在提高MRI图像的速度和质量，不需要完整的标签数据。
methods: 我们提出了一种利用深度学习模型的自适应方法，同时保留常规模型的理论保证。
results: 我们的方法可以在不具备完整标签数据的情况下，实现MRI图像的加速和提高，并且超过了现有的自适应方法和传统正则化方法的性能。

Abstract
Recently, regularization model-driven deep learning (DL) has gained significant attention due to its ability to leverage the potent representational capabilities of DL while retaining the theoretical guarantees of regularization models. However, most of these methods are tailored for supervised learning scenarios that necessitate fully sampled labels, which can pose challenges in practical MRI applications. To tackle this challenge, we propose a self-supervised DL approach for accelerated MRI that is theoretically guaranteed and does not rely on fully sampled labels. Specifically, we achieve neural network structure regularization by exploiting the inherent structural low-rankness of the $k$-space data. Simultaneously, we constrain the network structure to resemble a nonexpansive mapping, ensuring the network's convergence to a fixed point. Thanks to this well-defined network structure, this fixed point can completely reconstruct the missing $k$-space data based on matrix completion theory, even in situations where full-sampled labels are unavailable. Experiments validate the effectiveness of our proposed method and demonstrate its superiority over existing self-supervised approaches and traditional regularization methods, achieving performance comparable to that of supervised learning methods in certain scenarios.

摘要
We achieve neural network structure regularization by exploiting the inherent low-rankness of the $k$-space data. Simultaneously, we constrain the network structure to be nonexpansive, ensuring the network's convergence to a fixed point. Thanks to this well-defined network structure, this fixed point can completely reconstruct the missing $k$-space data based on matrix completion theory, even when full-sampled labels are unavailable.Experiments demonstrate the effectiveness of our proposed method and its superiority over existing self-supervised approaches and traditional regularization methods. In certain scenarios, our method achieves performance comparable to that of supervised learning methods.

Robust Digital-Twin Localization via An RGBD-based Transformer Network and A Comprehensive Evaluation on a Mobile Dataset

paper_url: http://arxiv.org/abs/2309.13570
repo_url: https://github.com/augcog/dttd2
paper_authors: Zixun Huang, Keling Yao, Seth Z. Zhao, Chuanyu Pan, Tianjian Xu, Weiyu Feng, Allen Y. Yang
for: 本研究旨在探讨数字双技术在3D物体跟踪和地理位置确定方面的潜在作用，并提出一种基于变换器的6DoF姿态估计器，以实现在真实世界噪声数据下的最佳准确性。
methods: 本研究使用变换器来实现6DoF姿态估计器，并通过对现有 литературы的全面验证，提出了一个新的RGBD数据集called Digital Twin Tracking Dataset v2 (DTTD2)，以适应iPhone感知器数据。
results: 经过广泛的实验和深入分析，本研究证明了我们的方法在面临深度数据错误时仍然能够表现出优于现有基elines的性能。

Abstract
The potential of digital-twin technology, involving the creation of precise digital replicas of physical objects, to reshape AR experiences in 3D object tracking and localization scenarios is significant. However, enabling robust 3D object tracking in dynamic mobile AR environments remains a formidable challenge. These scenarios often require a more robust pose estimator capable of handling the inherent sensor-level measurement noise. In this paper, recognizing the challenges of comprehensive solutions in existing literature, we propose a transformer-based 6DoF pose estimator designed to achieve state-of-the-art accuracy under real-world noisy data. To systematically validate the new solution's performance against the prior art, we also introduce a novel RGBD dataset called Digital Twin Tracking Dataset v2 (DTTD2), which is focused on digital-twin object tracking scenarios. Expanded from an existing DTTD v1 (DTTD1), the new dataset adds digital-twin data captured using a cutting-edge mobile RGBD sensor suite on Apple iPhone 14 Pro, expanding the applicability of our approach to iPhone sensor data. Through extensive experimentation and in-depth analysis, we illustrate the effectiveness of our methods under significant depth data errors, surpassing the performance of existing baselines. Code and dataset are made publicly available at: https://github.com/augcog/DTTD2

摘要
“数字双身技术的潜在可能性，即创建精确的数字对象复制，对于3D对象跟踪和本地化场景的AR经验进行重塑，是非常 significannot。然而，在动态 mobil AR 环境中实现Robust 3D对象跟踪仍然是一大挑战。这些场景通常需要一个更加Robust的 pose estimator，可以处理潜在的 sensor-level 测量噪音。在这篇论文中，我们认为现有Literature中的全面解决方案存在挑战，因此我们提出了一种基于 transformer 的 6DoF pose estimator，可以在实际世界噪音数据下实现 state-of-the-art 精度。为了系统地验证我们的新解决方案的性能，我们还发布了一个名为 Digital Twin Tracking Dataset v2 (DTTD2) 的新数据集，该数据集专注于数字双身对象跟踪场景。DTTD2 是基于 DTTD1 的扩展，新增了使用高级 mobil RGBD 感知器 suite 在 Apple iPhone 14 Pro 上 captured 的数字双身数据，使我们的方法可以应用于 iPhone 感知器数据。通过广泛的实验和深入分析，我们证明了我们的方法在重大深度数据错误下可以实现更高的性能，超过现有的基准值。Code 和数据集在 GitHub 上公开，请参考：https://github.com/augcog/DTTD2。”

Multivariate Prototype Representation for Domain-Generalized Incremental Learning

paper_url: http://arxiv.org/abs/2309.13563
repo_url: None
paper_authors: Can Peng, Piotr Koniusz, Kaiyu Guo, Brian C. Lovell, Peyman Moghadam
for: 这种研究旨在解决深度学习模型在新类样本微调时发生的灾难性忘记问题，以及这种问题在不同领域数据上进行测试时的域shift问题。
methods: 我们提出了一种Domain-Generalized Class-Incremental Learning（DGCIL）方法，该方法能够保持老类，适应新类，并可以在未看过的领域上进行可靠的分类。我们的损失函数保持分类boundary，并且降低每个类的域特定信息。无需保存老示例，我们使用知识传播和估计老类prototype偏移来进行逐步训练。我们的prototype表示基于多变量正态分布，其中的均值和协方差是随着模型特征的变化而不断地适应老类。为了保持老类的表示，我们采用Cholesky分解来采样pseudo-特征。相比之前的pseudo-特征采样策略，我们的方法能够更好地捕捉变异semantic信息。
results: 我们在多个benchmark上进行了实验，并证明了我们的方法的主张。

Abstract
Deep learning models suffer from catastrophic forgetting when being fine-tuned with samples of new classes. This issue becomes even more pronounced when faced with the domain shift between training and testing data. In this paper, we study the critical and less explored Domain-Generalized Class-Incremental Learning (DGCIL). We design a DGCIL approach that remembers old classes, adapts to new classes, and can classify reliably objects from unseen domains. Specifically, our loss formulation maintains classification boundaries and suppresses the domain-specific information of each class. With no old exemplars stored, we use knowledge distillation and estimate old class prototype drift as incremental training advances. Our prototype representations are based on multivariate Normal distributions whose means and covariances are constantly adapted to changing model features to represent old classes well by adapting to the feature space drift. For old classes, we sample pseudo-features from the adapted Normal distributions with the help of Cholesky decomposition. In contrast to previous pseudo-feature sampling strategies that rely solely on average mean prototypes, our method excels at capturing varying semantic information. Experiments on several benchmarks validate our claims.

摘要

LOGICSEG: Parsing Visual Semantics with Neural Logic Learning and Reasoning

paper_url: http://arxiv.org/abs/2309.13556
repo_url: None
paper_authors: Liulei Li, Wenguan Wang, Yi Yang
for: 填充高性能semantic segmentation模型的潜在空白，使得模型能够更好地理解视觉世界的结构和抽象。
methods: 利用神经 inductive 学习和逻辑推理，将数据和符号知识结合在一起，从而实现视 semantic 解析。
results: 在四个 dataset 上进行了广泛的实验，证明了 LOGICSEG 的效果和通用性。

Abstract
Current high-performance semantic segmentation models are purely data-driven sub-symbolic approaches and blind to the structured nature of the visual world. This is in stark contrast to human cognition which abstracts visual perceptions at multiple levels and conducts symbolic reasoning with such structured abstraction. To fill these fundamental gaps, we devise LOGICSEG, a holistic visual semantic parser that integrates neural inductive learning and logic reasoning with both rich data and symbolic knowledge. In particular, the semantic concepts of interest are structured as a hierarchy, from which a set of constraints are derived for describing the symbolic relations and formalized as first-order logic rules. After fuzzy logic-based continuous relaxation, logical formulae are grounded onto data and neural computational graphs, hence enabling logic-induced network training. During inference, logical constraints are packaged into an iterative process and injected into the network in a form of several matrix multiplications, so as to achieve hierarchy-coherent prediction with logic reasoning. These designs together make LOGICSEG a general and compact neural-logic machine that is readily integrated into existing segmentation models. Extensive experiments over four datasets with various segmentation models and backbones verify the effectiveness and generality of LOGICSEG. We believe this study opens a new avenue for visual semantic parsing.

摘要
Translated into Simplified Chinese:当前高性能semantic segmentation模型都是纯数据驱动的sub-symbolic方法，而这与人类认知的抽象Visual perception at multiple levels and symbolic reasoning with structured abstraction is in stark contrast. To fill these fundamental gaps, we propose LOGICSEG, a comprehensive visual semantic parser that combines neural inductive learning and logic reasoning with both rich data and symbolic knowledge. Specifically, the semantic concepts of interest are structured as a hierarchy, from which a set of constraints are derived for describing the symbolic relations and formalized as first-order logic rules. After fuzzy logic-based continuous relaxation, logical formulae are grounded onto data and neural computational graphs, thereby enabling logic-induced network training. During inference, logical constraints are packaged into an iterative process and injected into the network in the form of several matrix multiplications, thereby achieving hierarchy-coherent prediction with logic reasoning. These designs together make LOGICSEG a versatile and compact neural-logic machine that can be seamlessly integrated into existing segmentation models. Experimental results over four datasets with various segmentation models and backbones demonstrate the effectiveness and generality of LOGICSEG. We believe this study opens a new avenue for visual semantic parsing.

Generalized Dice Focal Loss trained 3D Residual UNet for Automated Lesion Segmentation in Whole-Body FDG PET/CT Images

paper_url: http://arxiv.org/abs/2309.13553
repo_url: https://github.com/ahxmeds/autosegnet
paper_authors: Shadab Ahamed, Arman Rahmim
For: The paper is written for developing a comprehensive PET/CT lesion segmentation model for routine quantitative image analysis.* Methods: The paper uses a 3D Residual UNet with Generalized Dice Focal Loss function on the AutoPET challenge 2023 training dataset, and develops the model in a 5-fold cross-validation setting with ensemble learning.* Results: The average ensemble achieved a Dice similarity coefficient (DSC) of 0.5417, false-positive volume (FPV) of 0.8261 ml, and false negative volume (FNV) of 0.2538 ml, while the weighted-average ensemble achieved similar results.Here’s the simplified Chinese text for the three key points:* For: 这篇论文是为了开发一个 Routine 量化图像分析中的 PET/CT 癌症分割模型。* Methods: 这篇论文使用了 3D Residual UNet 与 Generalized Dice Focal Loss 函数在 AutoPET 挑战 2023 训练集上进行了训练，并使用了 5-fold 交叉验证设置和 ensemble 学习。* Results: 平均ensemble 达到了 Dice 相似度系数 (DSC) 为 0.5417，false-positive volume (FPV) 为 0.8261 ml，false negative volume (FNV) 为 0.2538 ml，而 weighted-average ensemble 也达到了类似的结果。

Abstract
Automated segmentation of cancerous lesions in PET/CT images is a vital initial task for quantitative analysis. However, it is often challenging to train deep learning-based segmentation methods to high degree of accuracy due to the diversity of lesions in terms of their shapes, sizes, and radiotracer uptake levels. These lesions can be found in various parts of the body, often close to healthy organs that also show significant uptake. Consequently, developing a comprehensive PET/CT lesion segmentation model is a demanding endeavor for routine quantitative image analysis. In this work, we train a 3D Residual UNet using Generalized Dice Focal Loss function on the AutoPET challenge 2023 training dataset. We develop our models in a 5-fold cross-validation setting and ensemble the five models via average and weighted-average ensembling. On the preliminary test phase, the average ensemble achieved a Dice similarity coefficient (DSC), false-positive volume (FPV) and false negative volume (FNV) of 0.5417, 0.8261 ml, and 0.2538 ml, respectively, while the weighted-average ensemble achieved 0.5417, 0.8186 ml, and 0.2538 ml, respectively. Our algorithm can be accessed via this link: https://github.com/ahxmeds/autosegnet.

摘要
自动 segmentation of cancerous lesions in PET/CT images 是一项非常重要的初始任务，用于量化分析。然而，由于肿瘤的多样性，包括形状、大小和辐射追踪水平，因此往往具有很高的学习难度。这些肿瘤可以在体内各个部位找到， часто靠近健康的器官，这些器官也会显示出明显的辐射吸收。因此，开发一个全面的 PET/CT 肿瘤 segmentation 模型是一项复杂的任务，用于日常量化图像分析。在这个工作中，我们使用 Generalized Dice Focal Loss 函数来训练一个 3D Residual UNet 模型。我们在 5-fold 跨Validation Setting 中进行了模型开发，并使用 average 和 weighted-average ensemble。在预liminary test阶段，average ensemble 达到了 Dice similarity coefficient (DSC)、false-positive volume (FPV) 和 false negative volume (FNV) 的值为 0.5417，0.8261 ml 和 0.2538 ml，分别。而 weighted-average ensemble 达到了 0.5417，0.8186 ml 和 0.2538 ml，分别。我们的算法可以通过以下链接访问：https://github.com/ahxmeds/autosegnet。

Towards Robust Robot 3D Perception in Urban Environments: The UT Campus Object Dataset

paper_url: http://arxiv.org/abs/2309.13549
repo_url: https://github.com/ut-amrl/coda-models
paper_authors: Arthur Zhang, Chaitanya Eranki, Christina Zhang, Ji-Hwan Park, Raymond Hong, Pranav Kalyani, Lochana Kalyanaraman, Arsh Gamare, Arnav Bagad, Maria Esteva, Joydeep Biswas
for: 这个论文是为了提供一个大学校园环境下的自主 Navigation 的数据集，用于 Egocentric 3D 识别和规划。
methods: 该论文使用了多 modal 感知器，包括 3D 点云和颜色视频，以及 RGB-D 视频和 IMU 传感器，并提供了大量的Annotation。
results: 该论文的实验结果表明，使用 CODa 数据集可以提高urban 环境中 3D объек检测性能，并且 sensor-specific 细化调整和预训练可以进一步提高检测精度。

Abstract
We introduce the UT Campus Object Dataset (CODa), a mobile robot egocentric perception dataset collected on the University of Texas Austin Campus. Our dataset contains 8.5 hours of multimodal sensor data: synchronized 3D point clouds and stereo RGB video from a 128-channel 3D LiDAR and two 1.25MP RGB cameras at 10 fps; RGB-D videos from an additional 0.5MP sensor at 7 fps, and a 9-DOF IMU sensor at 40 Hz. We provide 58 minutes of ground-truth annotations containing 1.3 million 3D bounding boxes with instance IDs for 53 semantic classes, 5000 frames of 3D semantic annotations for urban terrain, and pseudo-ground truth localization. We repeatedly traverse identical geographic locations for a wide range of indoor and outdoor areas, weather conditions, and times of the day. Using CODa, we empirically demonstrate that: 1) 3D object detection performance in urban settings is significantly higher when trained using CODa compared to existing datasets even when employing state-of-the-art domain adaptation approaches, 2) sensor-specific fine-tuning improves 3D object detection accuracy and 3) pretraining on CODa improves cross-dataset 3D object detection performance in urban settings compared to pretraining on AV datasets. Using our dataset and annotations, we release benchmarks for 3D object detection and 3D semantic segmentation using established metrics. In the future, the CODa benchmark will include additional tasks like unsupervised object discovery and re-identification. We publicly release CODa on the Texas Data Repository, pre-trained models, dataset development package, and interactive dataset viewer on our website at https://amrl.cs.utexas.edu/coda. We expect CODa to be a valuable dataset for research in egocentric 3D perception and planning for autonomous navigation in urban environments.

摘要
我们介绍UT кампус物件Dataset（CODa），是一个移动机器人自我观察 Dataset，在德州大学奥斯汀分校范围内收集到的8.5小时多modal感应数据。我们的数据包括同步3D点云和stereoRGB影像，来自128通道3D LiDAR和两个1.25MPRGB摄像头，每秒10帧;RGB-D影像从额外0.5MP感应器，每秒7帧，以及9DOF IMU感应器，每秒40Hz。我们提供58分钟的真实标注，包括1.3百万个3D bounding box，每个物体都有实体ID，分配到53个semantic class中;5000帧3D实体标注，用于城市地形的处理;以及假的地理位置标注。我们在同一个地理位置上重复探索了各种室内和室外区域，天气状况和时间。使用CODa，我们经过实验证明：1）在城市设置中，使用CODa进行训练后，3D物体检测性能高于现有数据集，即使使用现有的领域适应方法;2）感应器特定的精确调整可以提高3D物体检测精度;3）使用CODa进行预训可以在城市设置中提高交叉数据集3D物体检测性能。我们在我们的网站上公开了CODa，包括预训模型、数据开发套件和互动数据检视器，可以在https://amrl.cs.utexas.edu/coda 中找到。我们预期CODa将成为城市自主navigation egocentric 3D视察和规划的重要数据集。

DFRD: Data-Free Robustness Distillation for Heterogeneous Federated Learning

paper_url: http://arxiv.org/abs/2309.13546
repo_url: None
paper_authors: Kangyang Luo, Shuai Wang, Yexuan Fu, Xiang Li, Yunshi Lan, Ming Gao
for: 提出了一种隐私保护的分布式学习方法（DFRD），可以在数据不同和模型不同的场景下培养一个稳定和有效的全局模型。
methods: 在服务器端使用一个条件生成器来估算本地模型上传的训练空间，并系统地调查其训练的准确度、传输性和多样性。
results: 通过实验证明，DFRD在多个图像分类任务上比最佳参考模型具有显著的性能提升。

Abstract
Federated Learning (FL) is a privacy-constrained decentralized machine learning paradigm in which clients enable collaborative training without compromising private data. However, how to learn a robust global model in the data-heterogeneous and model-heterogeneous FL scenarios is challenging. To address it, we resort to data-free knowledge distillation to propose a new FL method (namely DFRD). DFRD equips a conditional generator on the server to approximate the training space of the local models uploaded by clients, and systematically investigates its training in terms of fidelity, transferability} and diversity. To overcome the catastrophic forgetting of the global model caused by the distribution shifts of the generator across communication rounds, we maintain an exponential moving average copy of the generator on the server. Additionally, we propose dynamic weighting and label sampling to accurately extract knowledge from local models. Finally, our extensive experiments on various image classification tasks illustrate that DFRD achieves significant performance gains compared to SOTA baselines.

摘要
federated learning (FL) 是一种遵循 privacy 的分布式机器学习模式，在Client端实现协同训练而无需披露私人数据。然而，在数据不同和模型不同的 FL 场景中，学习 Robust 的全球模型是一个挑战。为此，我们通过不使用数据的知识热化来提出一种新的 FL 方法（namely DFRD）。DFRD 在服务器端安装一个Conditional generator，用于模拟客户端上传的本地模型的训练空间，并系统地研究其训练的准确性、传递性和多样性。为了解决由生成器在交流周期中的分布转移所引起的全球模型的忘却性，我们在服务器端维护一个指数移动平均的生成器复制。此外，我们提出了动态权重和标签采样，以准确地提取本地模型中的知识。最后，我们在不同的图像分类任务上进行了广泛的实验，结果显示，DFRD 与当前的标准基eline相比， achieved 显著的性能提升。

Comparative Evaluation of Transfer Learning for Classification of Brain Tumor Using MRI

paper_url: http://arxiv.org/abs/2310.02270
repo_url: None
paper_authors: Abu Kaisar Mohammad Masum, Nusrat Badhon, S. M. Saiful Islam Badhon, Nushrat Jahan Ria, Sheikh Abujar, Muntaser Mansur Syed, Naveed Mahmud
for: 这项研究旨在利用计算机助成诊断技术，尤其是机器学习和深度学习，以分类三种脑肿瘤。
methods: 我们使用了四种转移学习技术来分类脑肿瘤，并在一个标准数据集上进行测试，包括3064个MRI图像，表示三种脑肿瘤。
results: 我们发现，使用ResNet-50模型可以达到99.06%的准确率，超过其他模型。我们还证明了如何在均衡数据集上提高准确率，而无需使用扩展方法。

Abstract
Abnormal growth of cells in the brain and its surrounding tissues is known as a brain tumor. There are two types, one is benign (non-cancerous) and another is malignant (cancerous) which may cause death. The radiologists' ability to diagnose malignancies is greatly aided by magnetic resonance imaging (MRI). Brain cancer diagnosis has been considerably expedited by the field of computer-assisted diagnostics, especially in machine learning and deep learning. In our study, we categorize three different kinds of brain tumors using four transfer learning techniques. Our models were tested on a benchmark dataset of $3064$ MRI pictures representing three different forms of brain cancer. Notably, ResNet-50 outperformed other models with a remarkable accuracy of $99.06\%$. We stress the significance of a balanced dataset for improving accuracy without the use of augmentation methods. Additionally, we experimentally demonstrate our method and compare with other classification algorithms on the CE-MRI dataset using evaluations like F1-score, AUC, precision and recall.

摘要
异常组织增长在脑和周围组织中 known as 脑肿瘤。这有两种，一种是非恶性（非癌细胞），另一种是恶性（癌细胞），可能导致死亡。医学影像识别异常性的能力得到了巨大的助益，特别是在电磁共振成像（MRI）和电脑协助诊断领域。在我们的研究中，我们分类了三种不同的脑肿瘤，使用四种转移学习技术。我们的模型在一个底本数据集上进行测试，包括3064幅 MRI 照片，代表三种不同的脑癌。值得注意的是，ResNet-50 的准确率达到了99.06%，在其他模型中具有卓越的表现。我们强调了统计数据的平衡性，以提高准确性，而不需使用增强方法。此外，我们实验性地评估了我们的方法，并与其他分类算法进行比较，使用评估指标如 F1 分数、AUC、精度和 recall。

Semi-Supervised Domain Generalization for Object Detection via Language-Guided Feature Alignment

paper_url: http://arxiv.org/abs/2309.13525
repo_url: https://github.com/sinamalakouti/CDDMSL
paper_authors: Sina Malakouti, Adriana Kovashka
for: 这篇论文旨在解决半有 labels 的领域泛化（Domain Generalization，DG）和领域转换（Domain Adaptation，DA）问题，并且将vision-language预训应用于这个问题。
methods: 这篇论文使用了一种新的 Cross-Domain Descriptive Multi-Scale Learning（CDDMSL）方法，它通过将图像描述在语言空间中进行对领域特有特征的对应，以实现图像描述的协调。
results: compared to existing methods, CDDMSL 在 DG 和 DA 环境中都有着重要的进步，实现了11.7%和7.5%的改善。

Abstract
Existing domain adaptation (DA) and generalization (DG) methods in object detection enforce feature alignment in the visual space but face challenges like object appearance variability and scene complexity, which make it difficult to distinguish between objects and achieve accurate detection. In this paper, we are the first to address the problem of semi-supervised domain generalization by exploring vision-language pre-training and enforcing feature alignment through the language space. We employ a novel Cross-Domain Descriptive Multi-Scale Learning (CDDMSL) aiming to maximize the agreement between descriptions of an image presented with different domain-specific characteristics in the embedding space. CDDMSL significantly outperforms existing methods, achieving 11.7% and 7.5% improvement in DG and DA settings, respectively. Comprehensive analysis and ablation studies confirm the effectiveness of our method, positioning CDDMSL as a promising approach for domain generalization in object detection tasks.

摘要
现有的领域适应（DA）和通用化（DG）方法在物体检测中强制视觉空间中的特征对齐，但面临对象外观多样性和场景复杂性等挑战，这使得分辨对象并不容易，精度检测也不高。在这篇论文中，我们是首次解决半supervised领域通用化问题，通过探索视觉语言预训练和在语言空间强制特征对齐。我们提出了一种新的跨领域描述多Scale学习（CDDMSL），旨在 maximize图像的描述在嵌入空间中的一致性。CDDMSL与现有方法相比，显著提高了11.7%和7.5%的提升率，分别在DA和DG设置下。广泛的分析和缺省研究证明了我们的方法的有效性，positioning CDDMSL为领域通用化在物体检测任务中的可靠方法。

LiDAR-UDA: Self-ensembling Through Time for Unsupervised LiDAR Domain Adaptation

paper_url: http://arxiv.org/abs/2309.13523
repo_url: None
paper_authors: Amirreza Shaban, JoonHo Lee, Sanghun Jung, Xiangyun Meng, Byron Boots
for: 这个研究是为了提出一个基于自适应领域对应（UDA）的 LiDAR 分类方法，以应对不同 LiDAR 感应器配置所带来的领域差异。
methods: 这个方法使用了两个技术来降低感应器差异和提高pseudo标签质量：1）LiDAR 焦点抽样，实现不同 LiDAR 扫描模式的模拟；2）跨帧聚合，利用 consecutive 帧的时间一致性来生成更可靠的pseudo标签。
results: 这个方法在多个公开 LiDAR 数据集上进行评估，与现有的方法相比，获得了更高的平均 mIoU 分量 ($3.9%$) 。

Abstract
We introduce LiDAR-UDA, a novel two-stage self-training-based Unsupervised Domain Adaptation (UDA) method for LiDAR segmentation. Existing self-training methods use a model trained on labeled source data to generate pseudo labels for target data and refine the predictions via fine-tuning the network on the pseudo labels. These methods suffer from domain shifts caused by different LiDAR sensor configurations in the source and target domains. We propose two techniques to reduce sensor discrepancy and improve pseudo label quality: 1) LiDAR beam subsampling, which simulates different LiDAR scanning patterns by randomly dropping beams; 2) cross-frame ensembling, which exploits temporal consistency of consecutive frames to generate more reliable pseudo labels. Our method is simple, generalizable, and does not incur any extra inference cost. We evaluate our method on several public LiDAR datasets and show that it outperforms the state-of-the-art methods by more than $3.9\%$ mIoU on average for all scenarios. Code will be available at https://github.com/JHLee0513/LiDARUDA.

摘要
我们介绍了LiDAR-UDA，一种新的两阶段自我训练基于无监督领域适应（UDA）方法，用于LiDAR分割。现有的自我训练方法使用一个基于源数据的模型来生成目标数据的假标签，然后通过调整网络来提高预测。这些方法受到源和目标领域之间的频率差引起的频率差问题。我们提出了两种技术来减少探测器差异并提高假标签质量：1）LiDAR扫描方式抽样，可以模拟不同的LiDAR扫描方式，通过随机删除探测器来实现；2）同帧集成，可以利用连续帧的时间一致性来生成更可靠的假标签。我们的方法简单、普适，无需额外的推理成本。我们在一些公共LiDAR数据集上评估了我们的方法，并证明它在所有场景上超过了state-of-the-art方法的$3.9\%$ mIoU平均提升。代码将在https://github.com/JHLee0513/LiDARUDA上提供。

InSpaceType: Reconsider Space Type in Indoor Monocular Depth Estimation

paper_url: http://arxiv.org/abs/2309.13516
repo_url: None
paper_authors: Cho-Ying Wu, Quankai Gao, Chin-Cheng Hsu, Te-Lin Wu, Jing-Wen Chen, Ulrich Neumann
for: 本研究旨在探讨indoor monocular depth estimation方法在实际场景中的稳定性和泛化性，特别是在不同的空间类型下的表现。
methods: 本研究使用了11种最新的方法进行比较，并发现这些方法在不同的空间类型下存在明显的表现偏好。
results: 研究发现，现有的方法在不同的空间类型下存在明显的性能差异，表明这些方法存在偏好，而且在某些空间类型下表现非常差。

Abstract
Indoor monocular depth estimation has attracted increasing research interest. Most previous works have been focusing on methodology, primarily experimenting with NYU-Depth-V2 (NYUv2) Dataset, and only concentrated on the overall performance over the test set. However, little is known regarding robustness and generalization when it comes to applying monocular depth estimation methods to real-world scenarios where highly varying and diverse functional \textit{space types} are present such as library or kitchen. A study for performance breakdown into space types is essential to realize a pretrained model's performance variance. To facilitate our investigation for robustness and address limitations of previous works, we collect InSpaceType, a high-quality and high-resolution RGBD dataset for general indoor environments. We benchmark 11 recent methods on InSpaceType and find they severely suffer from performance imbalance concerning space types, which reveals their underlying bias. We extend our analysis to 4 other datasets, 3 mitigation approaches, and the ability to generalize to unseen space types. Our work marks the first in-depth investigation of performance imbalance across space types for indoor monocular depth estimation, drawing attention to potential safety concerns for model deployment without considering space types, and further shedding light on potential ways to improve robustness. See \url{https://depthcomputation.github.io/DepthPublic} for data.

摘要
内部单目深度估计已经吸引了越来越多的研究兴趣。大多数前一些工作都是在方法ologies上进行了尝试，主要使用NYU-Depth-V2（NYUv2）数据集，并且只是对测试集的总性性能进行了评估。然而，对于实际世界场景中的应用，尚不甚了解单目深度估计方法的稳定性和泛化性。为了实现预训练模型的性能变化，我们需要进行空间类型的性能剖析。为了促进我们的调查和解决前一些工作的局限性，我们收集了InSpaceType，一个高质量、高分辨率的RGBD数据集，用于普遍的内部环境。我们对InSpaceType进行了11种最近的方法的测试，发现它们在不同的空间类型上表现出了严重的性能偏好。我们还扩展了我们的分析至4个其他数据集、3种缓解方法和无seen空间类型的能力。我们的工作是内部单目深度估计中首次对空间类型的性能偏好进行了深入的调查，这引起了关注在没有考虑空间类型的情况下部署模型可能存在的安全风险，以及如何提高模型的稳定性。参考链接：。

Rewrite Caption Semantics: Bridging Semantic Gaps for Language-Supervised Semantic Segmentation

paper_url: http://arxiv.org/abs/2309.13505
repo_url: https://github.com/xing0047/rewrite
paper_authors: Yun Xing, Jian Kang, Aoran Xiao, Jiahao Nie, Shao Ling, Shijian Lu
for: 增强语言授 зада务下的semantic segmentation的能力，使得图像可以通过文本描述进行空间localization。
methods: 利用CLIP来补做缺失的semantics，建立一个概念库，并通过群集导航 sampling来选择相关的概念，然后将其 feed into pre-training。
results: 在8个 segmentation benchmark上进行了广泛的实验，表明CoCu可以减轻语言授 зада务下的semantic gap，大幅提高语言授 зада务下的semantic segmentation的性能。

Abstract
Vision-Language Pre-training has demonstrated its remarkable zero-shot recognition ability and potential to learn generalizable visual representations from language supervision. Taking a step ahead, language-supervised semantic segmentation enables spatial localization of textual inputs by learning pixel grouping solely from image-text pairs. Nevertheless, the state-of-the-art suffers from clear semantic gaps between visual and textual modality: plenty of visual concepts appeared in images are missing in their paired captions. Such semantic misalignment circulates in pre-training, leading to inferior zero-shot performance in dense predictions due to insufficient visual concepts captured in textual representations. To close such semantic gap, we propose Concept Curation (CoCu), a pipeline that leverages CLIP to compensate for the missing semantics. For each image-text pair, we establish a concept archive that maintains potential visually-matched concepts with our proposed vision-driven expansion and text-to-vision-guided ranking. Relevant concepts can thus be identified via cluster-guided sampling and fed into pre-training, thereby bridging the gap between visual and textual semantics. Extensive experiments over a broad suite of 8 segmentation benchmarks show that CoCu achieves superb zero-shot transfer performance and greatly boosts language-supervised segmentation baseline by a large margin, suggesting the value of bridging semantic gap in pre-training data.

摘要
“视言预训示出了无需示例数据的惊人识别能力和可能学习通用的视觉表示。尝试一步前进，语言指导的semantic segmentation可以将文本输入的空间局部化，通过从图像和文本对的学习像素组合。然而，当前的状态艺术受到清晰的Semantic Gap问题困扰，即图像中的许多视觉概念没有在其关联的文本中出现。这种semantic misalignment在预训练中循环，导致零例预测中的稠密预测性能下降，因为预训练中的文本表示中缺失的视觉概念。为了填充这种semantic gap，我们提出了Concept Curation（CoCu）管线，它利用CLIP来补偿缺失的semantics。对每个图像和文本对，我们建立了一个concept archive，该archive保存了可能与图像匹配的视觉概念，我们提出的视力驱动扩展和文本驱动的排名。通过群组指导采样，可以从concept archive中提取相关的概念，并将其传递给预训练，从而bridging视觉和文本semantic之间的 gap。我们对8种 segmentation benchmark进行了广泛的实验，结果表明CoCu可以 achieve superb zero-shot transfer performance，并大幅提高语言指导 segmentation baseline，这表明bridging semantic gap在预训练数据中的价值。”

2023-09-24

cs.AI

cs.AI - 2023-09-24

GHN-QAT: Training Graph Hypernetworks to Predict Quantization-Robust Parameters of Unseen Limited Precision Neural Networks

paper_url: http://arxiv.org/abs/2309.13773
repo_url: None
paper_authors: Stone Yun, Alexander Wong
for: 这 paper 的目的是研究 Graph Hypernetworks (GHN) 可以预测 CNN 架构中不同参数的值，并且在预测过程中减少了大量的优化迭代。
methods: 这 paper 使用 GHN 预测 CNN 架构中的参数，并且对预测结果进行了量化化。
results: 这 paper 的结果表明，通过在量化 aware 训练中使用 GHN 预测参数，可以提高量化后 CNN 的准确率，并且在一些情况下可以达到随机 initialization 的水平。

Abstract
Graph Hypernetworks (GHN) can predict the parameters of varying unseen CNN architectures with surprisingly good accuracy at a fraction of the cost of iterative optimization. Following these successes, preliminary research has explored the use of GHNs to predict quantization-robust parameters for 8-bit and 4-bit quantized CNNs. However, this early work leveraged full-precision float32 training and only quantized for testing. We explore the impact of quantization-aware training and/or other quantization-based training strategies on quantized robustness and performance of GHN predicted parameters for low-precision CNNs. We show that quantization-aware training can significantly improve quantized accuracy for GHN predicted parameters of 4-bit quantized CNNs and even lead to greater-than-random accuracy for 2-bit quantized CNNs. These promising results open the door for future explorations such as investigating the use of GHN predicted parameters as initialization for further quantized training of individual CNNs, further exploration of "extreme bitwidth" quantization, and mixed precision quantization schemes.

摘要
格子嵌入网络（GHN）可以预测未seen convolutional neural network（CNN）的参数， surprisingly good accuracy at a fraction of the cost of iterative optimization. Following these successes, preliminary research has explored the use of GHNs to predict quantization-robust parameters for 8-bit and 4-bit quantized CNNs. However, this early work leveraged full-precision float32 training and only quantized for testing. We explore the impact of quantization-aware training and/or other quantization-based training strategies on quantized robustness and performance of GHN predicted parameters for low-precision CNNs. We show that quantization-aware training can significantly improve quantized accuracy for GHN predicted parameters of 4-bit quantized CNNs and even lead to greater-than-random accuracy for 2-bit quantized CNNs. These promising results open the door for future explorations such as investigating the use of GHN predicted parameters as initialization for further quantized training of individual CNNs, further exploration of "extreme bitwidth" quantization, and mixed precision quantization schemes.Here's the text with the Chinese characters and English translation:格子嵌入网络（GHN）可以预测未seen convolutional neural network（CNN）的参数， surprisingly good accuracy at a fraction of the cost of iterative optimization.following these successes, preliminary research has explored the use of GHNs to predict quantization-robust parameters for 8-bit and 4-bit quantized CNNs.However, this early work leveraged full-precision float32 training and only quantized for testing.We explore the impact of quantization-aware training and/or other quantization-based training strategies on quantized robustness and performance of GHN predicted parameters for low-precision CNNs.We show that quantization-aware training can significantly improve quantized accuracy for GHN predicted parameters of 4-bit quantized CNNs and even lead to greater-than-random accuracy for 2-bit quantized CNNs.These promising results open the door for future explorations such as investigating the use of GHN predicted parameters as initialization for further quantized training of individual CNNs, further exploration of "extreme bitwidth" quantization, and mixed precision quantization schemes.

Deep Learning-Based Connector Detection for Robotized Assembly of Automotive Wire Harnesses

paper_url: http://arxiv.org/abs/2309.13746
repo_url: None
paper_authors: Hao Wang, Björn Johansson
for: 本研究旨在提高自动化汽车电子零部件的质量，通过深度学习方法探测汽车电缆套件中的连接器。
methods: 本研究使用了两种对象检测模型，一种是两stage模型，另一种是一stage模型，以 trains和评估数据集来检测汽车电缆套件中的连接器。
results: 实验结果表明，深度学习方法可以有效检测汽车电缆套件中的连接器，但连接器外部设计有限制。

Abstract
The shift towards electrification and autonomous driving in the automotive industry results in more and more automotive wire harnesses being installed in modern automobiles, which stresses the great significance of guaranteeing the quality of automotive wire harness assembly. The mating of connectors is essential in the final assembly of automotive wire harnesses due to the importance of connectors on wire harness connection and signal transmission. However, the current manual operation of mating connectors leads to severe problems regarding assembly quality and ergonomics, where the robotized assembly has been considered, and different vision-based solutions have been proposed to facilitate a better perception of the robot control system on connectors. Nonetheless, there has been a lack of deep learning-based solutions for detecting automotive wire harness connectors in previous literature. This paper presents a deep learning-based connector detection for robotized automotive wire harness assembly. A dataset of twenty automotive wire harness connectors was created to train and evaluate a two-stage and a one-stage object detection model, respectively. The experiment results indicate the effectiveness of deep learning-based connector detection for automotive wire harness assembly but are limited by the design of the exteriors of connectors.

摘要
随着汽车业的电动化和自动驾驶技术的发展，现代汽车中的电动线套件越来越多，因此保证汽车电动线套件的质量变得非常重要。连接器的匹配在汽车电动线套件的最终组装中是非常重要的，因为连接器对电动线套件的连接和信号传输具有非常重要的作用。然而，现有的手动操作匹配连接器会导致组装质量和人机工程学习的严重问题，而Robotized assembly受到了考虑，不同的视觉基于解决方案也被提出，但在过去的文献中没有深入学习基于解决方案。本文提出了深入学习基于的汽车电动线套件连接器检测方法，并创建了20个汽车电动线套件连接器的数据集来训练和评估两个阶段和一个阶段对象检测模型。实验结果表明深入学习基于的连接器检测方法在汽车电动线套件组装中是有效的，但由于连接器的外部设计，其限制了检测的精度。

Computer Vision Technology for Robotized Wire Harness Assembly

paper_url: http://arxiv.org/abs/2309.13745
repo_url: None
paper_authors: Hao Wang, Omkar Salunkhe, Walter Quadrini, Dan Lämkull, Fredrik Ore, Björn Johansson, Johan Stahre
for: 本研究旨在提高汽车电子系统的绝缘电缆组装质量、效率和人机交互性，满足现代汽车电子系统的需求。
methods: 本研究使用计算机视觉技术来自动化绝缘电缆组装，以提高抗压缩性和抗摩擦性，并且可以在实际生产环境中实现自动化组装。
results: 本研究发现，计算机视觉技术可以帮助机器人更好地识别和操纵绝缘电缆，提高自动化组装的精度和效率。但是，还有一些研究 gap 需要进一步研究，以便在实际生产环境中实现更加实用的机器人自动化组装。

Abstract
Wire harnesses are essential hardware for electronic systems in modern automotive vehicles. With a shift in the automotive industry towards electrification and autonomous driving, more and more automotive electronics are responsible for energy transmission and safety-critical functions such as maneuvering, driver assistance, and safety system. This paradigm shift places more demand on automotive wiring harnesses from the safety perspective and stresses the greater importance of high-quality wire harness assembly in vehicles. However, most of the current operations of wire harness assembly are still performed manually by skilled workers, and some of the manual processes are problematic from different perspectives, such as quality control and ergonomics. There is also a persistent demand in the industry to increase competitiveness and gain market share. Hence, assuring assembly quality while improving ergonomics and optimizing labor costs is desired. Robotized assembly, accomplished by robots or in human-robot collaboration, is a key enabler for fulfilling the increasingly demanding quality and safety as it enables more replicable, transparent, and comprehensible processes than completely manual operations. However, robotized assembly of wire harnesses is challenging in real environments due to the flexibility of the deformable objects, though many preliminary automation solutions have been proposed under simplified industrial configurations. Previous research efforts have proposed the use of computer vision technology to facilitate robotized automation of wire harness assembly, enabling the robots to better perceive and manipulate the flexible wire harness. This article presents an overview on computer vision technology proposed for robotized wire harness assembly and derives research gaps that require further study to facilitate a more practical robotized assembly of wire harness.

摘要
电子系统在现代汽车中的重要硬件是电缆集成。随着汽车工业向电气化和自动驾驶转变，电缆集成的重要性日益增加，它们不仅承担了能量传输，还承担了安全关键功能，如行驶助手、驾驶员助手和安全系统。这种平台转移增加了电缆集成的安全要求，同时也增加了对高质量电缆组装的需求。然而，大多数现有的电缆组装过程仍然是手动完成的，有些手动过程存在质量控制和人体工程学问题。此外，业界也有强烈的竞争和市场份额增长的需求。因此，保证组装质量的同时，改善人体工程学和优化劳动成本是需要的。 robotized assembly，通过机器人或人机合作，是实现提高质量和安全性的关键。然而，在真实环境中，机器人化电缆组装具有较大的挑战，主要是因为电缆是可变形的物体。虽然有许多先前的自动化解决方案在 simplifies 的工业配置下得到了应用，但是在真实环境中，这些解决方案很难实现。以前的研究努力已经提出了利用计算机视觉技术来实现机器人化电缆组装，使机器人可以更好地感知和操纵 flexible 的电缆。本文提供了计算机视觉技术在机器人化电缆组装中的概述，并确定了需要进一步研究的研究漏洞，以便更好地实现实用的机器人化电缆组装。

A Systematic Literature Review of Computer Vision Applications in Robotized Wire Harness Assembly

paper_url: http://arxiv.org/abs/2309.13744
repo_url: None
paper_authors: Hao Wang, Omkar Salunkhe, Walter Quadrini, Björn Johansson, Dan Lämkull, Fredrik Ore, Mélanie Despeisse, Luca Fumagalli, Johan Stahre
for: 这篇论文探讨了计算机视觉技术在机器人化电缆组装中的应用，挑战现有研究所出现的挑战，并提出未来研究的机遇以促进实用的机器人化电缆组装。
methods: 该论文采用了系统性的文献综述方法，检索了目前关于计算机视觉在机器人化电缆组装中的应用研究。
results: 该论文总结了现有研究中的挑战和未来研究的机遇，以促进实用的机器人化电缆组装。

Abstract
This article presents a systematic literature review on computer vision applications that have been proposed for robotized wire harness assembly, derives challenges from existing studies, and identifies opportunities for future research to promote a more practical robotized assembly of wire harnesses.

摘要
这篇文章提出了一项系统性文献复查，探讨了计算机视觉技术在机器人化电缆组装中的应用，从现有研究中提取了挑战，并标识了未来研究的机遇，以促进更实用的机器人化电缆组装。Here's a breakdown of the translation:* "这篇文章" (zhè běn wén zhāng) - This article* "提出了一项" (tí shū le yī jiāng) - Proposes a systematic review* "系统性文献复查" (xì tǒng xìng běn bǎo) - Systematic literature review* "探讨了计算机视觉技术" (tàng shuō le jì shù zhì yè jì) - Explores computer vision technology* "在机器人化电缆组装中" (zhī zhì hóu diàn zhè bù zào) - In robotized wire harness assembly* "提取了挑战" (tí qū le bào zhèng) - Identifies challenges* "并标识了未来研究的机遇" (yuè yì le wèi lǎi yán jí de jī hǎng) - And identifies opportunities for future research* "以促进更实用的机器人化电缆组装" (yǐn jí yī jì zhèng zhì de jī zhì hóu diàn zhè bù zào) - To promote more practical robotized assembly of wire harnesses.

Use of Large Language Models for Stance Classification

paper_url: http://arxiv.org/abs/2309.13734
repo_url: None
paper_authors: Iain J. Cruickshank, Lynnette Hui Xian Ng
for: 本研究旨在探讨大型自然语言模型（LLM）在立场分类任务中的表现，以减少人工标注的使用。
methods: 我们使用四种不同的提问方案与LLM进行比较，以确定它们在不同的数据集中的精度。
results: 我们发现，虽然LLM可以与指导模型匹配或者超越它们的结果，但全局的精度并不是准确的。这表明LLM在立场分类方面还有一定的改进空间。然而，通过使用LLM，我们可以实现无监督的立场检测，从而降低人工标注的需求，并拓宽语言之间的应用范围。

Abstract
Stance detection, the task of predicting an author's viewpoint towards a subject of interest, has long been a focal point of research. Current stance detection methods predominantly rely on manual annotation of sentences, followed by training a supervised machine learning model. This manual annotation process, however, imposes limitations on the model's ability to fully comprehend the stances in the sentence and hampers its potential to generalize across different contexts. In this study, we investigate the use of Large Language Models (LLMs) for the task of stance classification, with an absolute minimum use of human labels. We scrutinize four distinct types of prompting schemes combined with LLMs, comparing their accuracies with manual stance determination. Our study reveals that while LLMs can match or sometimes even exceed the benchmark results in each dataset, their overall accuracy is not definitively better than what can be produced by supervised models. This suggests potential areas for improvement in the stance classification for LLMs. The application of LLMs, however, opens up promising avenues for unsupervised stance detection, thereby curtailing the need for manual collection and annotation of stances. This not only streamlines the process but also paves the way for expanding stance detection capabilities across languages. Through this paper, we shed light on the stance classification abilities of LLMs, thereby contributing valuable insights that can guide future advancements in this domain.

摘要
<>转换文本到简化中文。<>作者视点推断任务，长期是研究的焦点。当前的作者视点推断方法主要依靠手动标注句子，然后训练一个超级vised机器学习模型。然而，这个手动标注过程限制了模型对句子中作者视点的全面理解，使得其在不同上下文中的泛化能力受到限制。在本研究中，我们调查使用大型自然语言模型（LLM）进行作者视点分类任务，具有最小的人工标注使用。我们比较了四种不同的激励方案与LLMs的精度，并与手动决定作者视点的结果进行比较。我们的研究发现，虽然LLMs可以与或超过每个数据集的标准结果，但总的来说，它们的精度不是definitive更好于supervised模型。这表明了LLMs的作者视点分类方面可能存在改进的potential。不过，通过LLMs的应用，可以实现不需要手动收集和标注作者视点的不超级vised推断，这不仅简化了过程，还为推断语言的扩展开辟了道路。通过这篇论文，我们为LLMs的作者视点分类能力提供了有价值的反馈，以帮助未来在这个领域的进一步发展。

Arabic Sentiment Analysis with Noisy Deep Explainable Model

paper_url: http://arxiv.org/abs/2309.13731
repo_url: None
paper_authors: Md. Atabuzzaman, Md Shajalal, Maksuda Bilkis Baby, Alexander Boden
for: 本研究旨在提出一种可解释的情感分类框架，以解决现有的阿拉伯语情感分类模型中的黑盒问题。
methods: 该框架基于加入噪声层的Bi-Directional Long Short-Term Memory（BiLSTM）和Convolutional Neural Networks（CNN）-BiLSTM模型，可以解释特定预测的原因。
results: 实验结果表明，在公共 benchmark 阿拉伯语情感分类数据集上，加入噪声层可以改善阿拉伯语情感分类的性能，并且我们的方法比一些已知的状态作准方法表现更好。此外，引入的解释性噪声层可以使模型更透明和可负责任，有助于普及AI enabled系统。

Abstract
Sentiment Analysis (SA) is an indispensable task for many real-world applications. Compared to limited resourced languages (i.e., Arabic, Bengali), most of the research on SA are conducted for high resourced languages (i.e., English, Chinese). Moreover, the reasons behind any prediction of the Arabic sentiment analysis methods exploiting advanced artificial intelligence (AI)-based approaches are like black-box - quite difficult to understand. This paper proposes an explainable sentiment classification framework for the Arabic language by introducing a noise layer on Bi-Directional Long Short-Term Memory (BiLSTM) and Convolutional Neural Networks (CNN)-BiLSTM models that overcome over-fitting problem. The proposed framework can explain specific predictions by training a local surrogate explainable model to understand why a particular sentiment (positive or negative) is being predicted. We carried out experiments on public benchmark Arabic SA datasets. The results concluded that adding noise layers improves the performance in sentiment analysis for the Arabic language by reducing overfitting and our method outperformed some known state-of-the-art methods. In addition, the introduced explainability with noise layer could make the model more transparent and accountable and hence help adopting AI-enabled system in practice.

摘要

Towards using Cough for Respiratory Disease Diagnosis by leveraging Artificial Intelligence: A Survey

paper_url: http://arxiv.org/abs/2309.14383
repo_url: None
paper_authors: Aneeqa Ijaz, Muhammad Nabeel, Usama Masood, Tahir Mahmood, Mydah Sajid Hashmi, Iryna Posokhova, Ali Rizwan, Ali Imran
For: The paper is written for medical experts and AI scientists to analyze the decisive role of AI/ML in detecting and diagnosing respiratory diseases based on cough acoustics.* Methods: The paper uses a comprehensive review of the literature on cough-based AI algorithms to demonstrate the significance of AI/ML in detecting the onset of specific respiratory diseases. The authors also investigate the mechanism of cough and the latent cough features of respiratory modalities, and analyze customized cough monitoring applications and their AI-powered recognition algorithms.* Results: The paper provides a detailed list of significant features for cough data-driven ML/DL detection and preliminary diagnosis frameworks, and discusses challenges and future research directions to develop practical, robust, and ubiquitous solutions for respiratory disease prediction.Here is the format you requested:* For: 论文是为医疗专家和人工智能科学家分析AI/ML在抑制呼吸疾病中的重要作用。* Methods: 论文使用综述文献来展示呼吸学AI算法在诊断特定呼吸疾病的开头的重要性。文章还研究呼吸机制和呼吸模式的潜在特征，以及个性化呼吸监测应用程序和其AI驱动的识别算法。* Results: 论文提供了呼吸数据驱动ML/DL检测和初步诊断框架中的重要特征列表，并讨论了实用、 Robust、和通用解决方案的挑战和未来研究方向。

Abstract
Cough acoustics contain multitudes of vital information about pathomorphological alterations in the respiratory system. Reliable and accurate detection of cough events by investigating the underlying cough latent features and disease diagnosis can play an indispensable role in revitalizing the healthcare practices. The recent application of Artificial Intelligence (AI) and advances of ubiquitous computing for respiratory disease prediction has created an auspicious trend and myriad of future possibilities in the medical domain. In particular, there is an expeditiously emerging trend of Machine learning (ML) and Deep Learning (DL)-based diagnostic algorithms exploiting cough signatures. The enormous body of literature on cough-based AI algorithms demonstrate that these models can play a significant role for detecting the onset of a specific respiratory disease. However, it is pertinent to collect the information from all relevant studies in an exhaustive manner for the medical experts and AI scientists to analyze the decisive role of AI/ML. This survey offers a comprehensive overview of the cough data-driven ML/DL detection and preliminary diagnosis frameworks, along with a detailed list of significant features. We investigate the mechanism that causes cough and the latent cough features of the respiratory modalities. We also analyze the customized cough monitoring application, and their AI-powered recognition algorithms. Challenges and prospective future research directions to develop practical, robust, and ubiquitous solutions are also discussed in detail.

摘要
咳嗽学包含多种重要信息，可以帮助诊断呼吸系统的疾病变化。通过检测咳嗽特征来进行精准的疾病诊断，可以在医疗实践中发挥关键作用。现在，人工智能（AI）和 ubique computing 在呼吸疾病预测方面的应用正在迅速发展，这在医学领域创造了一种潜在的未来可能性。尤其是在机器学习（ML）和深度学习（DL）方面，已经出现了一种以咳嗽特征为基础的诊断算法的迅速增长趋势。但是，为了全面了解这些研究的结果，需要对所有相关的研究进行总结，以便医学专家和 AI 科学家进行分析。本调查概述了基于咳嗽数据的 ML/DL 检测和先期诊断框架，以及相关的重要特征。我们研究咳嗽的机制和呼吸Modalities 中的潜在特征。我们还分析了自定义咳嗽监测应用程序，以及它们的 AI 驱动的识别算法。挑战和未来研究方向也在详细地讨论。

Agree To Disagree

paper_url: http://arxiv.org/abs/2309.14382
repo_url: https://github.com/mpagli/Agree-to-Disagree
paper_authors: Abhinav Raghuvanshi, Siddhesh Pawar, Anirudh Mittal
for: 这篇论文是为了提供一种自动解析和概括长文档中重要信息的机器学习方法。
methods: 该方法使用机器学习算法对长文档进行自动解析和概括，以提供用户友好的摘要。
results: 该方法可以帮助用户快速理解长文档中的重要信息，从而减少用户对各种服务协议和软件使用协议的审核时间。

Abstract
How frequently do individuals thoroughly review terms and conditions before proceeding to register for a service, install software, or access a website? The majority of internet users do not engage in this practice. This trend is not surprising, given that terms and conditions typically consist of lengthy documents replete with intricate legal terminology and convoluted sentences. In this paper, we introduce a Machine Learning-powered approach designed to automatically parse and summarize critical information in a user-friendly manner. This technology focuses on distilling the pertinent details that users should contemplate before committing to an agreement.

摘要
有多少人在注册服务、安装软件或访问网站之前， thorougly review terms and conditions？大多数互联网用户不这样做。这种趋势并不奇怪，因为条款和条件通常是长长的文档，拥有复杂的法律术语和句子结构。在这篇论文中，我们介绍了一种基于机器学习的方法，可以自动解析和概括重要信息，以便用户在决定时更好地了解。这种技术将关键信息简化，以便用户更好地理解。

ORLA: Mobile Manipulator-Based Object Rearrangement with Lazy A

paper_url: http://arxiv.org/abs/2309.13707
repo_url: https://github.com/gaokai15/ORLA-Star
paper_authors: Kai Gao, Yan Ding, Shiqi Zhang, Jingjin Yu
for: 这个论文主要针对的是移动搅拌器（如搅拌桌或吃卤桌）中的物体重新排序问题，即如何选择合适的物体重新排序策略以实现最佳的物体重新排序结果。
methods: 该论文提出了一种名为ORLA的算法，该算法利用延迟评估（lazy evaluation）技术，搜索一个高质量的物体捕获和放置顺序，考虑了机器人手部和机器人基础的运动。同时，ORLA还支持多层次重新排序任务，使用机器学习来保证物体堆积稳定。
results: 通过对大量的 simulate和减少研究，authors confirm了ORLA*的效果，能够提供高质量的重新排序解决方案，并且可以达到全球最佳性。

Abstract
Effectively performing object rearrangement is an essential skill for mobile manipulators, e.g., setting up a dinner table or organizing a desk. A key challenge in such problems is deciding an appropriate manipulation order for objects to effectively untangle dependencies between objects while considering the necessary motions for realizing the manipulations (e.g., pick and place). To our knowledge, computing time-optimal multi-object rearrangement solutions for mobile manipulators remains a largely untapped research direction. In this research, we propose ORLA*, which leverages delayed (lazy) evaluation in searching for a high-quality object pick and place sequence that considers both end-effector and mobile robot base travel. ORLA* also supports multi-layered rearrangement tasks considering pile stability using machine learning. Employing an optimal solver for finding temporary locations for displacing objects, ORLA* can achieve global optimality. Through extensive simulation and ablation study, we confirm the effectiveness of ORLA* delivering quality solutions for challenging rearrangement instances. Supplementary materials are available at: https://gaokai15.github.io/ORLA-Star/

摘要
通过有效地重新排序物品，移动抓取机器人可以具备更高效的操作能力，例如设置晚餐桌或整理办公桌面。一个主要挑战在这些问题中是决定合适的物品重新排序顺序，以便有效地解决物品之间的依赖关系，同时考虑必要的动作（如找取和放置）。根据我们所知，计算时间最优的多物品重新排序解决方案仍然是移动抓取机器人研究的一个未探讨的方向。在这个研究中，我们提出了ORLA*，它利用延迟（懒散）评估来搜索高质量的物品找取和放置顺序，考虑了执行器和移动机器人基础体的必要运动。ORLA*还支持多层次重新排序任务，使用机器学习来考虑积累稳定性。通过优质的临时解决方案找取物品的位置，ORLA*可以 дости到全球优化。通过广泛的 simulations和减少研究，我们证明了ORLA*在具有挑战性的重新排序任务中的效果。补充材料可以在以下链接中找到：https://gaokai15.github.io/ORLA-Star/

A Neural-Guided Dynamic Symbolic Network for Exploring Mathematical Expressions from Data

paper_url: http://arxiv.org/abs/2309.13705
repo_url: None
paper_authors: Wenqiang Li, Weijun Li, Lina Yu, Min Wu, Jingyi Liu, Yanjie Li
for: 本研究的目的是提出一种新的神经网络引导的动态符号网络方法（DySymNet），用于实现数据探索的符号回归问题。
methods: 本方法使用一种新的网络结构，并通过优化这些结构来找到更适合数据的表达。这种方法不仅能够处理高维问题，还能够优化常数。
results: 根据广泛的数值实验表示，DySymNet方法可以达到现有方法的最佳性能水平，并且在噪音较高的情况下保持稳定性。

Abstract
Symbolic regression (SR) is a powerful technique for discovering the underlying mathematical expressions from observed data. Inspired by the success of deep learning, recent efforts have focused on two categories for SR methods. One is using a neural network or genetic programming to search the expression tree directly. Although this has shown promising results, the large search space poses difficulties in learning constant factors and processing high-dimensional problems. Another approach is leveraging a transformer-based model training on synthetic data and offers advantages in inference speed. However, this method is limited to fixed small numbers of dimensions and may encounter inference problems when given data is out-of-distribution compared to the synthetic data. In this work, we propose DySymNet, a novel neural-guided Dynamic Symbolic Network for SR. Instead of searching for expressions within a large search space, we explore DySymNet with various structures and optimize them to identify expressions that better-fitting the data. With a topology structure like neural networks, DySymNet not only tackles the challenge of high-dimensional problems but also proves effective in optimizing constants. Based on extensive numerical experiments using low-dimensional public standard benchmarks and the well-known SRBench with more variables, our method achieves state-of-the-art performance in terms of fitting accuracy and robustness to noise.

摘要
Symbolic regression (SR) 是一种强大的技术，用于从观察数据中发现下面的数学表达。随着深度学习的成功， latest efforts have focused on two categories of SR methods. One is to use a neural network or genetic programming to search the expression tree directly. Although this has shown promising results, the large search space poses difficulties in learning constant factors and processing high-dimensional problems. Another approach is to leverage a transformer-based model training on synthetic data, which offers advantages in inference speed. However, this method is limited to fixed small numbers of dimensions and may encounter inference problems when given data is out-of-distribution compared to the synthetic data.在这个工作中，我们提出了 DySymNet，一种新的神经网络引导的动态 симвоlic Network for SR. Instead of searching for expressions within a large search space, we explore DySymNet with various structures and optimize them to identify expressions that better-fitting the data. With a topology structure like neural networks, DySymNet not only tackles the challenge of high-dimensional problems but also proves effective in optimizing constants. Based on extensive numerical experiments using low-dimensional public standard benchmarks and the well-known SRBench with more variables, our method achieves state-of-the-art performance in terms of fitting accuracy and robustness to noise.

Skill Check: Some Considerations on the Evaluation of Gamemastering Models for Role-playing Games

paper_url: http://arxiv.org/abs/2309.13702
repo_url: https://github.com/sgongora27/skill-check-gm-tests
paper_authors: Santiago Góngora, Luis Chiruzzo, Gonzalo Méndez, Pablo Gervás
for: 这篇论文是关于用Interactive Storytelling和自然语言处理方法模型游戏主持人（GM）的。
methods: 这篇论文使用了三个测试类划分来评估这些对话系统，并用它们测试了ChatGPT、Bard和OpenAssistant三个简单的GM。
results: 根据测试结果，这三个对话系统在不同的情况下都能够表现出不同的能力和缺点。

Abstract
In role-playing games a Game Master (GM) is the player in charge of the game, who must design the challenges the players face and narrate the outcomes of their actions. In this work we discuss some challenges to model GMs from an Interactive Storytelling and Natural Language Processing perspective. Following those challenges we propose three test categories to evaluate such dialogue systems, and we use them to test ChatGPT, Bard and OpenAssistant as out-of-the-box GMs.

摘要
在角色扮演游戏中，游戏主持人（GM）是游戏中的主要玩家，负责设计玩家面临的挑战和描述玩家行动的结果。在这项工作中，我们讨论了对GM的模型化从互动故事与自然语言处理的角度来面临一些挑战。随后，我们提出了三个测试类别来评估这些对话系统，并使用它们测试ChatGPT、Bard和OpenAssistant作为直接GM。

ALLURE: Auditing and Improving LLM-based Evaluation of Text using Iterative In-Context-Learning

paper_url: http://arxiv.org/abs/2309.13701
repo_url: None
paper_authors: Hosein Hasanbeig, Hiteshi Sharma, Leo Betthauser, Felipe Vieira Frujeri, Ida Momennejad
for: This paper aims to improve the ability of large language models (LLMs) to evaluate text by auditing and refining their performance.
methods: The authors introduce a systematic approach called ALLURE, which involves comparing LLM-generated evaluations with annotated data, and iteratively incorporating instances of significant deviation into the evaluator. The evaluator leverages in-context learning (ICL) to enhance and improve the robust evaluation of text by LLMs.
results: The authors demonstrate the effectiveness of ALLURE in improving the performance of the evaluator LLM, reducing reliance on human annotators in the evaluation process. They anticipate ALLURE to serve diverse applications of LLMs in various domains related to evaluation of textual data, such as medical summarization, education, and productivity.

Abstract
From grading papers to summarizing medical documents, large language models (LLMs) are evermore used for evaluation of text generated by humans and AI alike. However, despite their extensive utility, LLMs exhibit distinct failure modes, necessitating a thorough audit and improvement of their text evaluation capabilities. Here we introduce ALLURE, a systematic approach to Auditing Large Language Models Understanding and Reasoning Errors. ALLURE involves comparing LLM-generated evaluations with annotated data, and iteratively incorporating instances of significant deviation into the evaluator, which leverages in-context learning (ICL) to enhance and improve robust evaluation of text by LLMs. Through this iterative process, we refine the performance of the evaluator LLM, ultimately reducing reliance on human annotators in the evaluation process. We anticipate ALLURE to serve diverse applications of LLMs in various domains related to evaluation of textual data, such as medical summarization, education, and and productivity.

摘要
从分发纸到摘要医疗文档，大型自然语言模型（LLM）在评估人类和AI生成的文本中越来越广泛使用。然而，尽管它们的应用非常广泛，LLM仍然会出现不同的失败模式，因此需要进行系统性的审核和改进。在这篇文章中，我们介绍了ALLURE，一个系统性的方法来审核和改进LLM的文本评估能力。ALLURE通过比较LLM生成的评估和标注数据进行比较，并逐步包含具有重要差异的例子进入评估器中，以利用内容学习（ICL）来提高和改进LLM评估文本的能力。透过这个迭代过程，我们可以提高评估器LLM的性能，最终减少人类标注员在评估过程中的依赖。我们预计ALLURE将能够应用于各种领域中的LLM应用，例如医疗摘要、教育和生产力。

Smart OMVI: Obfuscated Malware Variant Identification using a novel dataset

paper_url: http://arxiv.org/abs/2310.10670
repo_url: None
paper_authors: Suleman Qamar
for: 这个论文是为了提供一个更真实和代表性的病毒分析环境，以evaluate病毒分析技术的效果。
methods: 这个论文使用了多种传统机器学习算法，包括但不限于支持向量机(SVM)、随机森林(RF)和极大梯度提升(XGBOOST)等。
results: XGBOOST算法在这些算法中表现最佳，具有82%的准确率、88%的精度、80%的回归率和83%的F1分数。

Abstract
Cybersecurity has become a significant issue in the digital era as a result of the growth in everyday computer use. Cybercriminals now engage in more than virus distribution and computer hacking. Cyberwarfare has developed as a result because it has become a threat to a nation's survival. Malware analysis serves as the first line of defence against an attack and is a significant component of cybercrime. Every day, malware attacks target a large number of computer users, businesses, and governmental agencies, causing billions of dollars in losses. Malware may evade multiple AV software with a very minor, cunning tweak made by its designers, despite the fact that security experts have a variety of tools at their disposal to identify it. To address this challenge, a new dataset called the Obfuscated Malware Dataset (OMD) has been developed. This dataset comprises 40 distinct malware families having 21924 samples, and it incorporates obfuscation techniques that mimic the strategies employed by malware creators to make their malware variations different from the original samples. The purpose of this dataset is to provide a more realistic and representative environment for evaluating the effectiveness of malware analysis techniques. Different conventional machine learning algorithms including but not limited to Support Vector Machine (SVM), Random Forrest (RF), Extreme Gradient Boosting (XGBOOST) etc are applied and contrasted. The results demonstrated that XGBoost outperformed the other algorithms, achieving an accuracy of f 82%, precision of 88%, recall of 80%, and an F1-Score of 83%.

摘要
在数字时代，cybersecurity已成为一项重要的问题，归功于日常计算机使用的增长。现在，黑客不仅限于散发病毒和黑客行为，而且开发了cyberwarfare，这成为了国家存亡的威胁。针对这种挑战，一个新的数据集called the Obfuscated Malware Dataset (OMD)已经开发出来。这个数据集包含40种不同的黑客家族，共21924个样本，并包含了黑客创造者们使用的混淆技术来使其黑客变体与原始样本不同。该数据集的目的是为了提供更加现实和代表的环境，以评估黑客分析技术的效果。在这个数据集上，不同的传统机器学习算法，包括但不限于支持向量机 (SVM)、Random Forrest (RF) 和极限梯度提升 (XGBOOST) 等，被应用并比较。结果表明，XGBOOST在这些算法中表现出了最高的效果，具有82%的准确率、88%的精度、80%的回归率和83%的F1得分。

Deep Reinforcement Learning for Image-to-Image Translation

paper_url: http://arxiv.org/abs/2309.13672
repo_url: https://github.com/Algolzw/SPAC-Deformable-Registration
paper_authors: Xin Wang, Ziwei Luo, Jing Hu, Chengming Feng, Shu Hu, Bin Zhu, Xi Wu, Siwei Lyu
for: 本研究旨在提出一种基于深度学习和强化学习的图像转换方法，以解决现有的图像转换方法在某些任务上存在困难和过拟合的问题。
methods: 本研究使用了深度学习和强化学习的方法，特别是在一个步骤基础上，通过简单的决策进程来逐步转换源图像到目标图像。此外，本研究还提出了一种新的元策略，可以在标准的actor-critic模型中处理高维连续状态和动作空间，并且可以使得actor生成更加可追踪的高维动作。
results: 实验结果表明，提出的RL-I2IT方法在面临高维连续动作空间问题时表现高效和稳定，并且可以在多个图像转换任务上达到高度的性能。

Abstract
Most existing Image-to-Image Translation (I2IT) methods generate images in a single run of a deep learning (DL) model. However, designing such a single-step model is always challenging, requiring a huge number of parameters and easily falling into bad global minimums and overfitting. In this work, we reformulate I2IT as a step-wise decision-making problem via deep reinforcement learning (DRL) and propose a novel framework that performs RL-based I2IT (RL-I2IT). The key feature in the RL-I2IT framework is to decompose a monolithic learning process into small steps with a lightweight model to progressively transform a source image successively to a target image. Considering that it is challenging to handle high dimensional continuous state and action spaces in the conventional RL framework, we introduce meta policy with a new concept Plan to the standard Actor-Critic model, which is of a lower dimension than the original image and can facilitate the actor to generate a tractable high dimensional action. In the RL-I2IT framework, we also employ a task-specific auxiliary learning strategy to stabilize the training process and improve the performance of the corresponding task. Experiments on several I2IT tasks demonstrate the effectiveness and robustness of the proposed method when facing high-dimensional continuous action space problems.

摘要
大多数现有的图像到图像翻译（I2IT）方法都是通过深度学习（DL）模型在单次训练中生成图像。然而，设计这种单步模型总是困难，需要很多参数，容易落入坏的全局最优点和过拟合。在这种工作中，我们将I2IT重新划为一个步骤性决策问题，并提出了一个新的框架——RL-I2IT。RL-I2IT框架的关键特征在于将绘制学习过程中的庞大学习过程拆分成小步骤，使用轻量级模型逐步将源图像转换成目标图像。由于传统RL框架中高维连续状态和动作空间的处理是困难的，我们引入了一种新的概念——计划，并将其添加到标准actor-critic模型中。在RL-I2IT框架中，我们还使用了一种任务特有的辅助学习策略，以稳定训练过程并提高相应任务的性能。在几个I2IT任务上进行了实验，我们发现提议的方法在面临高维连续动作空间问题时表现得非常有效和稳定。

paper_url: http://arxiv.org/abs/2309.14381
repo_url: https://github.com/Aryia-Behroziuan/References
paper_authors: Nayeon Lee, Yejin Bang, Holy Lovenia, Samuel Cahyawijaya, Wenliang Dai, Pascale Fung
for: 这篇论文旨在探讨预训练模型中存在的社会偏见问题，以及如何在多模态场景中减少这些偏见。
methods: 该论文采用了文献综述的方法，检查了不同领域中预训练模型中的社会偏见问题，并提出了一些应对方法。
results: 该论文发现了预训练模型在不同领域中的社会偏见问题，并提出了一些可能的解决方案，以帮助研究人员在多模态场景中开发更公正的人工智能模型。

Abstract
In recent years, the rapid advancement of machine learning (ML) models, particularly transformer-based pre-trained models, has revolutionized Natural Language Processing (NLP) and Computer Vision (CV) fields. However, researchers have discovered that these models can inadvertently capture and reinforce social biases present in their training datasets, leading to potential social harms, such as uneven resource allocation and unfair representation of specific social groups. Addressing these biases and ensuring fairness in artificial intelligence (AI) systems has become a critical concern in the ML community. The recent introduction of pre-trained vision-and-language (VL) models in the emerging multimodal field demands attention to the potential social biases present in these models as well. Although VL models are susceptible to social bias, there is a limited understanding compared to the extensive discussions on bias in NLP and CV. This survey aims to provide researchers with a high-level insight into the similarities and differences of social bias studies in pre-trained models across NLP, CV, and VL. By examining these perspectives, the survey aims to offer valuable guidelines on how to approach and mitigate social bias in both unimodal and multimodal settings. The findings and recommendations presented here can benefit the ML community, fostering the development of fairer and non-biased AI models in various applications and research endeavors.

摘要
近年来，机器学习（ML）模型的快速发展，特别是基于转换器的预训练模型，对自然语言处理（NLP）和计算机视觉（CV）领域产生了革命性的变革。然而，研究人员发现，这些模型可能会不意imento capture和激发社会偏见，从而导致社会不公正和特定社会群体的不公正代表。解决这些偏见并确保人工智能（AI）系统的公正性已成为ML社区的关键问题。随着emerging multimodal领域中的视觉语言（VL）模型的出现，需要对这些模型中的社会偏见进行关注。虽然VL模型受到社会偏见的影响，但相比NLP和CV领域，对于VL模型的社会偏见还有很 limited的理解。本调查旨在为研究人员提供高级别的社会偏见研究在预训练模型中的类似和不同之处，以及NLP、CV和VL领域中社会偏见的研究方法和措施。通过对这些观点进行分析，本调查期望为ML社区提供有价值的指南，以帮助开发更公正、不偏见的AI模型，并在不同的应用和研究领域中做出贡献。

VoiceLDM: Text-to-Speech with Environmental Context

paper_url: http://arxiv.org/abs/2309.13664
repo_url: https://github.com/glory20h/VoiceLDM
paper_authors: Yeonghyeon Lee, Inmo Yeon, Juhan Nam, Joon Son Chung
for: 这个论文旨在生成准确地遵循两个自然语言文本提示：描述提示和内容提示。描述提示提供环境上下文信息，而内容提示则传达语言内容。
methods: 作者采用基于潜在扩散模型的文本到音频（TTA）模型，并将其扩展以接受额外的内容提示作为条件输入。通过使用预训练的对比语言-音频预训练（CLAP）和Whisper，作者在大量实际音频数据上进行了训练。此外，作者还使用了无束分类器自由指导来进一步提高VoiceLDM的可控性。
results: 实验结果表明，VoiceLDM可以生成准确地遵循两个输入条件的音频，甚至在AudioCaps测试集上超越原始音频的语音可解度。此外，作者还探索了TTS和零shot TTA的能力，并证明VoiceLDM可以达到竞争力的结果。

Abstract
This paper presents VoiceLDM, a model designed to produce audio that accurately follows two distinct natural language text prompts: the description prompt and the content prompt. The former provides information about the overall environmental context of the audio, while the latter conveys the linguistic content. To achieve this, we adopt a text-to-audio (TTA) model based on latent diffusion models and extend its functionality to incorporate an additional content prompt as a conditional input. By utilizing pretrained contrastive language-audio pretraining (CLAP) and Whisper, VoiceLDM is trained on large amounts of real-world audio without manual annotations or transcriptions. Additionally, we employ dual classifier-free guidance to further enhance the controllability of VoiceLDM. Experimental results demonstrate that VoiceLDM is capable of generating plausible audio that aligns well with both input conditions, even surpassing the speech intelligibility of the ground truth audio on the AudioCaps test set. Furthermore, we explore the text-to-speech (TTS) and zero-shot text-to-audio capabilities of VoiceLDM and show that it achieves competitive results. Demos and code are available at https://voiceldm.github.io.

摘要

paper_url: http://arxiv.org/abs/2309.14379
repo_url: https://github.com/andreskarjus/machineassistedmixedmethods
paper_authors: Andres Karjus
for: 这篇论文旨在探讨大语言模型（LLM）在人文社科领域中的应用潜力，以帮助论文作者在数据分析 tasks 上增强和自动化人工劳动。
methods: 该论文提出了一种系统的混合方法 Framework，包括机器可观测和人类专家的协同分析、数据量化和可重复性的考虑。16个机器助手实践案例被用作证明。
results: 该论文的结果表明，在大多数情况下，LLM可以成功地执行许多质量分析任务，包括语言和дискурス分析、 lexical semantic change detection、采访分析、历史事件 causa inference 和文本挖掘、政治立场探测、文本和想法再利用、文学和电影类型组合、社交网络推理和自动 lexicography。此外，论文还发现，在使用LLM时，需要考虑人类专家的知识和经验，以确保结果的准确性和可靠性。

Abstract
The increasing capacities of large language models (LLMs) present an unprecedented opportunity to scale up data analytics in the humanities and social sciences, augmenting and automating qualitative analytic tasks previously typically allocated to human labor. This contribution proposes a systematic mixed methods framework to harness qualitative analytic expertise, machine scalability, and rigorous quantification, with attention to transparency and replicability. 16 machine-assisted case studies are showcased as proof of concept. Tasks include linguistic and discourse analysis, lexical semantic change detection, interview analysis, historical event cause inference and text mining, detection of political stance, text and idea reuse, genre composition in literature and film; social network inference, automated lexicography, missing metadata augmentation, and multimodal visual cultural analytics. In contrast to the focus on English in the emerging LLM applicability literature, many examples here deal with scenarios involving smaller languages and historical texts prone to digitization distortions. In all but the most difficult tasks requiring expert knowledge, generative LLMs can demonstrably serve as viable research instruments. LLM (and human) annotations may contain errors and variation, but the agreement rate can and should be accounted for in subsequent statistical modeling; a bootstrapping approach is discussed. The replications among the case studies illustrate how tasks previously requiring potentially months of team effort and complex computational pipelines, can now be accomplished by an LLM-assisted scholar in a fraction of the time. Importantly, this approach is not intended to replace, but to augment researcher knowledge and skills. With these opportunities in sight, qualitative expertise and the ability to pose insightful questions have arguably never been more critical.

摘要
LLMS 的增长 capacities 提供了无 precedent 的机会，以扩大人文社科领域的数据分析，通过机器执行和人工协助，自动化和加强质量分析任务，提高研究效率和准确性。本贡献提出了一种系统性的混合方法框架，结合人类专家知识和机器可扩展性，并强调透明度和复制性。这些案例中的 16 个机器助手案例作为证明。任务包括语言和 Diskourse 分析、lexical semantics 变化检测、采访分析、历史事件 causality 推断和文本挖掘、政治立场推断、文本和意义 reuse、文学和电影种类作品 genre 组合、社交网络推断、自动 lexicography、缺失 metadata 扩充和多媒体视觉文化分析。与英语emerging LLMS 应用性文献中的焦点不同，这些例子中的大多数例子 involve 小语言和历史文献，这些文献可能会受到数字化改变的影响。除了最复杂的任务需要专家知识外，LLM 可以成功地服务为可靠的研究工具。LLM 和人类注解可能会包含错误和变化，但协调率可以并被考虑在后续统计模型中。这种方法不是替换研究者知识和技能，而是增强它们。这些机遇在视野中，专业知识和能够提出有价值的问题的能力 arguably nunca 这样重要。

Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve

paper_url: http://arxiv.org/abs/2309.13638
repo_url: None
paper_authors: R. Thomas McCoy, Shunyu Yao, Dan Friedman, Matthew Hardy, Thomas L. Griffiths
for: 本研究旨在理解大语言模型（LLM）的优劣点，并推广其应用。
methods: 本研究使用了teleological approach，即认为LLM在解决下一个单词预测任务时的压力，并预测LLM会采取什么策略。
results: 研究发现，LLM的准确率受任务执行概率、目标输出概率和输入提供概率的影响。在 deterministic 环境中，LLM 的准确率高于低概率情况下。此外，研究还发现了一些奇异的失败模式，如 GPT-4 在解码简单密码时的准确率为 51%，但只有 13% 在低概率情况下。这些结果表明，AI 专家应该在低概率情况下使用 LLB 时需要谨慎。

Abstract
The widespread adoption of large language models (LLMs) makes it important to recognize their strengths and limitations. We argue that in order to develop a holistic understanding of these systems we need to consider the problem that they were trained to solve: next-word prediction over Internet text. By recognizing the pressures that this task exerts we can make predictions about the strategies that LLMs will adopt, allowing us to reason about when they will succeed or fail. This approach - which we call the teleological approach - leads us to identify three factors that we hypothesize will influence LLM accuracy: the probability of the task to be performed, the probability of the target output, and the probability of the provided input. We predict that LLMs will achieve higher accuracy when these probabilities are high than when they are low - even in deterministic settings where probability should not matter. To test our predictions, we evaluate two LLMs (GPT-3.5 and GPT-4) on eleven tasks, and we find robust evidence that LLMs are influenced by probability in the ways that we have hypothesized. In many cases, the experiments reveal surprising failure modes. For instance, GPT-4's accuracy at decoding a simple cipher is 51% when the output is a high-probability word sequence but only 13% when it is low-probability. These results show that AI practitioners should be careful about using LLMs in low-probability situations. More broadly, we conclude that we should not evaluate LLMs as if they are humans but should instead treat them as a distinct type of system - one that has been shaped by its own particular set of pressures.

摘要
随着大型语言模型（LLM）的广泛采用，我们必须认可它们的优势和局限性。我们认为，为了发展它们的整体理解，我们需要考虑它们被训练的问题：以互联网文本为基础的下一个词预测。通过认真对待这些任务的压力，我们可以预测LLM会采取什么策略，从而对它们的成功和失败进行预测。我们称这种方法为“teleological approach”。我们认为，LLM的准确率受以下三个因素的影响：任务执行概率、目标输出概率和输入提供的概率。我们预测，当这些概率高时，LLM的准确率也将高；而当它们低时，准确率则将低，即使在deterministic Setting中，概率应该没有影响。为测试我们的预测，我们评估了两个LLM（GPT-3.5和GPT-4）在11个任务上的表现，并发现了robust的证据，证明了我们的假设。在许多情况下，实验发现了意外的失败模式。例如，GPT-4在解码简单密码的任务中的准确率为51%，但只有13% когда输出是低概率的word sequence。这些结果表明，AI实践者应该小心使用LLM在低概率情况下。更广泛地说，我们 concludeThat we should not evaluate LLMs as if they were humans, but rather as a distinct type of system that has been shaped by its own unique set of pressures.

Development of an intelligent system for the detection of corona virus using artificial neural network

paper_url: http://arxiv.org/abs/2309.13636
repo_url: None
paper_authors: Nwafor Emmanuel O, Ngozi Maryrose Umeh, Ikechukwu Ekene Onyenwe
for: 本研究目的是开发一个人工神经网络检测新冠肺炎的智能系统。
methods: 本研究使用了文献综述和683组高烧 Body temperature数据（>= 38℃），从尼日利亚埃努古大学医院搜集到，用于训练人工神经网络检测模型。
results: 模型的评估结果显示，混淆矩阵、回归和方差平方误差（MSE）都是0.967，准确率是97%，这些结果显示新检测系统是可靠且高效。

Abstract
This paper presents the development of an intelligent system for the detection of coronavirus using artificial neural network. This was done after series of literature review which indicated that high fever accounts for 87.9% of the COVID-19 symptoms. 683 temperature data of COVID-19 patients at >= 38C^o were collected from Colliery hospital Enugu, Nigeria and used to train an artificial neural network detective model for the detection of COVID-19. The reference model generated was used converted into Verilog codes using Hardware Description Language (HDL) and then burn into a Field Programming Gate Array (FPGA) controller using FPGA tool in Matlab. The performance of the model when evaluated using confusion matrix, regression and means square error (MSE) showed that the regression value is 0.967; the accuracy is 97% and then MSE is 0.00100Mu. These results all implied that the new detection system for is reliable and very effective for the detection of COVID-19.

摘要
本文介绍了一种人工神经网络系统的开发，用于检测新型冠状病毒（COVID-19）。这种系统是基于文献评审结果，表明高热会质量上占87.9%的COVID-19症状。我们收集了来自尼日利亚埃努古采矿医院的683例COVID-19患者体温大于或等于38℃的数据，并使用人工神经网络探测模型进行训练。模型生成的参考模型被转化为Verilog代码使用硬件描述语言（HDL），然后使用MATLAB中的FPGA工具烧录到场程控制器中。模型的性能测试结果表明，准确率为97%，回归值为0.967，平均方差为0.00100Mu。这些结果表明新检测系统具有可靠性和高效性，适用于COVID-19检测。

PanopticNDT: Efficient and Robust Panoptic Mapping

paper_url: http://arxiv.org/abs/2309.13635
repo_url: https://github.com/tui-nicr/panoptic-mapping
paper_authors: Daniel Seichter, Benedict Stephan, Söhnke Benedikt Fischedick, Steffen Müller, Leonard Rabes, Horst-Michael Gross
for: 本研究旨在提供高精度3D精细地图，以便移动机器人在室内环境中自动操作。
methods: 本文提出了一种基于占用normal distribution transform（NDT）地图的有效和可靠的精细地图方法，名为PanopticNDT。
results: 对于公共可用的Hypersim和ScanNetV2数据集，our approach可以在移动机器人上实现高级别的精细地图，并且在实时精细地图中表达精细信息。此外，我们还证明了PanopticNDT在实际应用中的可行性。

Abstract
As the application scenarios of mobile robots are getting more complex and challenging, scene understanding becomes increasingly crucial. A mobile robot that is supposed to operate autonomously in indoor environments must have precise knowledge about what objects are present, where they are, what their spatial extent is, and how they can be reached; i.e., information about free space is also crucial. Panoptic mapping is a powerful instrument providing such information. However, building 3D panoptic maps with high spatial resolution is challenging on mobile robots, given their limited computing capabilities. In this paper, we propose PanopticNDT - an efficient and robust panoptic mapping approach based on occupancy normal distribution transform (NDT) mapping. We evaluate our approach on the publicly available datasets Hypersim and ScanNetV2. The results reveal that our approach can represent panoptic information at a higher level of detail than other state-of-the-art approaches while enabling real-time panoptic mapping on mobile robots. Finally, we prove the real-world applicability of PanopticNDT with qualitative results in a domestic application.

摘要
Note:* "application scenarios" is translated as "应用场景" (yìng yìng jīng xìng)* "mobile robots" is translated as "移动机器人" (í mouth jī hū rén)* "scene understanding" is translated as "场景理解" (chǎng jǐng lǐ jiě)* "panoptic mapping" is translated as "批量地图" (pīn liàng dì tú)* "occupancy normal distribution transform" is translated as "占据正态分布变换" (zhāng yù zhèng tài fāng zhāng biàn huà)* "real-time panoptic mapping" is translated as "实时批量地图" (shí shí pīn liàng dì tú)* "domestic application" is translated as "家庭应用" (jiā tíng yìng yòu)

EvalLM: Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria

paper_url: http://arxiv.org/abs/2309.13633
repo_url: None
paper_authors: Tae Soo Kim, Yoonjoo Lee, Jamin Shin, Young-Ho Kim, Juho Kim
for: 本研究旨在帮助开发人员通过使用大语言模型（LLM）创造新的生成应用程序，并通过多次修改提示来优化这些应用程序。
methods: 本研究使用了大语言模型（LLM）来评估提示的多个输出，以 помочь开发人员评估Context-specific和主观标准。
results: 对比手动评估，使用EvalLM系统可以帮助开发人员更快速地COMPOSE更多样化的提示，并且需要59% fewer revisions来达到满意的提示。

Abstract
By simply composing prompts, developers can prototype novel generative applications with Large Language Models (LLMs). To refine prototypes into products, however, developers must iteratively revise prompts by evaluating outputs to diagnose weaknesses. Formative interviews (N=8) revealed that developers invest significant effort in manually evaluating outputs as they assess context-specific and subjective criteria. We present EvalLM, an interactive system for iteratively refining prompts by evaluating multiple outputs on user-defined criteria. By describing criteria in natural language, users can employ the system's LLM-based evaluator to get an overview of where prompts excel or fail, and improve these based on the evaluator's feedback. A comparative study (N=12) showed that EvalLM, when compared to manual evaluation, helped participants compose more diverse criteria, examine twice as many outputs, and reach satisfactory prompts with 59% fewer revisions. Beyond prompts, our work can be extended to augment model evaluation and alignment in specific application contexts.

摘要
通过简单地编写提示，开发者可以快速探索新的生成应用程序，使用大型自然语言模型（LLM）。但是，要将原型转化为产品，开发者需要不断修改提示，以评估输出的弱点。我们的研究发现，开发者在评估输出时投入了大量的时间和劳动，以评估Context-specific和主观的标准。我们提出了EvalLM，一个互动式系统，可以通过用户定义的标准来评估多个输出，并提供LLM-based评估器的反馈。通过自然语言描述标准，用户可以使用系统来评估输出的excel和不足，并根据评估器的反馈进行改进。我们的比较研究显示，EvalLM，相比于手动评估，帮助参与者编写更多样的标准，评估twice as many outputs，并在59% fewer revisions中得到满意的提示。此外，我们的工作可以扩展到增强特定应用场景中的模型评估和对齐。

A Multi-channel EEG Data Analysis for Poor Neuro-prognostication in Comatose Patients with Self and Cross-channel Attention Mechanism

paper_url: http://arxiv.org/abs/2310.03756
repo_url: None
paper_authors: Hemin Ali Qadir, Naimahmed Nesaragi, Per Steiner Halvorsen, Ilangko Balasingham
for: 这个研究旨在利用双极电enzephalogram（EEG）记录来有效预测中枢神经系统疾病的不良结果。
methods: 该研究采用了混合深度学习方法，包括特征编码器、学习位编码、 context网络、注意机制和回归和分类块，以优化一个目标函数，即高特异性（true positive rate，TPR）和降低假阳性（<0.05）。
results: 该研究的提出的框架，OUS IVS，在隐藏验证数据上验证后，得分为0.57。

Abstract
This work investigates the predictive potential of bipolar electroencephalogram (EEG) recordings towards efficient prediction of poor neurological outcomes. A retrospective design using a hybrid deep learning approach is utilized to optimize an objective function aiming for high specificity, i.e., true positive rate (TPR) with reduced false positives (< 0.05). A multi-channel EEG array of 18 bipolar channel pairs from a randomly selected 5-minute segment in an hour is kept. In order to determine the outcome prediction, a combination of a feature encoder with 1-D convolutional layers, learnable position encoding, a context network with attention mechanisms, and finally, a regressor and classifier blocks are used. The feature encoder extricates local temporal and spatial features, while the following position encoding and attention mechanisms attempt to capture global temporal dependencies. Results: The proposed framework by our team, OUS IVS, when validated on the challenge hidden validation data, exhibited a score of 0.57.

摘要
这项研究探讨了使用双极电энце法记录（EEG）的预测潜在性，以提高不良神经学结果的预测精度。我们采用了混合深度学习方法，以优化一个目标函数，即高准确率（TPR），同时减少假阳性（<0.05）。我们使用的EEG数据包括18对双极通道，从一个随机选择的1小时内的5分钟段中选择。为了确定结果预测，我们使用了特征编码器、学习位编码、 Context网络和注意机制、以及最后的回归和分类块。特征编码器提取了本地时间和空间特征，而后续的位编码和注意机制尝试了捕捉全局时间相关性。结果：我们团队的提案方框，OUS IVS，在挑战隐藏验证数据上验证时达到了0.57分的得分。

GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph

paper_url: http://arxiv.org/abs/2309.13625
repo_url: https://github.com/lixinustc/graphadapter
paper_authors: Xin Li, Dongze Lian, Zhihe Lu, Jiawang Bai, Zhibo Chen, Xinchao Wang
for: 提高vision-language模型（VLM）在低数据 régime下的表现，通过引入一些额外参数来挖掘任务特定的知识。
methods: 提出一种效果的 adapter-style tuning策略，名为GraphAdapter，它通过显式地模型两 modalities的结构知识来进一步提高文本特化器的表现。
results: 对11个标准 benchmark dataset进行了广泛的实验，并证明了 GraphAdapter 在前一个 adapter-based 方法之上具有显著的优势。

Abstract
Adapter-style efficient transfer learning (ETL) has shown excellent performance in the tuning of vision-language models (VLMs) under the low-data regime, where only a few additional parameters are introduced to excavate the task-specific knowledge based on the general and powerful representation of VLMs. However, most adapter-style works face two limitations: (i) modeling task-specific knowledge with a single modality only; and (ii) overlooking the exploitation of the inter-class relationships in downstream tasks, thereby leading to sub-optimal solutions. To mitigate that, we propose an effective adapter-style tuning strategy, dubbed GraphAdapter, which performs the textual adapter by explicitly modeling the dual-modality structure knowledge (i.e., the correlation of different semantics/classes in textual and visual modalities) with a dual knowledge graph. In particular, the dual knowledge graph is established with two sub-graphs, i.e., a textual knowledge sub-graph, and a visual knowledge sub-graph, where the nodes and edges represent the semantics/classes and their correlations in two modalities, respectively. This enables the textual feature of each prompt to leverage the task-specific structure knowledge from both textual and visual modalities, yielding a more effective classifier for downstream tasks. Extensive experimental results on 11 benchmark datasets reveal that our GraphAdapter significantly outperforms previous adapter-based methods. The code will be released at https://github.com/lixinustc/GraphAdapter

摘要
adapter-style 高效传输学习（ETL）在视力语模型（VLM）的调整下表现出色，特别是在低数据条件下，只需要引入一些附加参数来挖掘任务特定知识基于通用和强大的 VLM 表示。然而，大多数 adapter-style 工作面临两个限制：（i）只使用单一模式来odel任务特定知识；（ii）忽略下游任务中间类关系的利用，导致优化解决方案。为了缓解这些问题，我们提出了一种有效的 adapter-style 调整策略，名为图像 adapter，它通过显式地模型两种模式之间的 dual-modality 结构知识（即文本和视觉模式之间的各种 semantics/classes 的相关性），使得文本特征可以从两种模式中获得任务特定的结构知识，从而更有效地进行下游任务。具体来说，我们建立了两个子图，即文本知识子图和视觉知识子图，其中节点和边表示两种模式中的 semantics/classes 和 их相关性。这使得文本特征可以从两种模式中获得任务特定的结构知识，从而更有效地进行下游任务。我们的 GraphAdapter 在 11 个 benchmark 数据集上进行了广泛的实验，结果显示，我们的 GraphAdapter 明显超越了前一代 adapter-based 方法。代码将在 https://github.com/lixinustc/GraphAdapter 上发布。

PRIS: Practical robust invertible network for image steganography

paper_url: http://arxiv.org/abs/2309.13620
repo_url: https://github.com/yanghangai/pris
paper_authors: Hang Yang, Yitian Xu, Xuhua Liu, Xiaodong Ma
for:PRIS is designed to improve the robustness of image steganography against distortion such as Gaussian noise and lossy compression.methods:PRIS uses invertible neural networks and two enhance modules before and after the extraction process, with a 3-step training strategy. It also considers rounding error, which is typically ignored by other methods, and proposes a gradient approximation function (GAF) to overcome the undifferentiable issue of rounding distortion.results:Experimental results show that PRIS outperforms state-of-the-art robust image steganography methods in both robustness and practicability.Here is the simplified Chinese text for the three key points:for:PRIS 是为了提高图像隐藏技术的鲁棒性，对容器图像受到的扰动（如 Gaussian 噪声和产生损失）进行鲁棒性测试。methods:PRIS 使用 invertible 神经网络，并在提取过程中添加了两个增强模块，使用三步训练策略。它还考虑了round error，通常被其他方法忽略，并提出了一种Gradient Approximation Function（GAF）来超越折射扰动的不可导性问题。results:实验结果表明，PRIS 在鲁棒性和实用性两个方面都超越了当前的图像隐藏方法。

Abstract
Image steganography is a technique of hiding secret information inside another image, so that the secret is not visible to human eyes and can be recovered when needed. Most of the existing image steganography methods have low hiding robustness when the container images affected by distortion. Such as Gaussian noise and lossy compression. This paper proposed PRIS to improve the robustness of image steganography, it based on invertible neural networks, and put two enhance modules before and after the extraction process with a 3-step training strategy. Moreover, rounding error is considered which is always ignored by existing methods, but actually it is unavoidable in practical. A gradient approximation function (GAF) is also proposed to overcome the undifferentiable issue of rounding distortion. Experimental results show that our PRIS outperforms the state-of-the-art robust image steganography method in both robustness and practicability. Codes are available at https://github.com/yanghangAI/PRIS, demonstration of our model in practical at http://yanghang.site/hide/.

摘要
Image 隐藏技术是一种将秘密信息隐藏在另一个图像中，以便当需要时可以恢复。现有的大多数图像隐藏方法具有低的隐藏稳定性，容易受到扰动的影响。这篇论文提出了PRIS，用于提高图像隐藏的稳定性，基于可逆神经网络，并在提取过程前后加入了两个增强模块，采用3步训练策略。此外，我们还考虑了很多现实中常被忽略的圆拟误差问题，并提出了一种梯度近似函数（GAF）来解决圆拟误差问题。实验结果表明，我们的PRIS在稳定性和实用性两个方面都高于当前最佳的图像隐藏方法。代码可以在https://github.com/yanghangAI/PRIS找到，实验演示在http://yanghang.site/hide/.

Boosting Offline Reinforcement Learning for Autonomous Driving with Hierarchical Latent Skills

paper_url: http://arxiv.org/abs/2309.13614
repo_url: None
paper_authors: Zenan Li, Fan Nie, Qiao Sun, Fang Da, Hang Zhao
for: 本文是为了解决learning-based vehicle planning中的长期规划挑战。
methods: 我们使用了variational autoencoder（VAE）来学习从Offline示例中的练习。为了解决VAEs的后验塌缩，我们提出了一种两极Sequence Encoder，可以捕捉练习中的细致驾驶技能的 discrete 和连续变化。
results: 我们在CARLA上进行了广泛的试验，并证明了我们的模型可以在新的enario中比较强的表现。此外，我们还提供了更多的视觉化和实验，以证明学习的策略的可读性和传递性。

Abstract
Learning-based vehicle planning is receiving increasing attention with the emergence of diverse driving simulators and large-scale driving datasets. While offline reinforcement learning (RL) is well suited for these safety-critical tasks, it still struggles to plan over extended periods. In this work, we present a skill-based framework that enhances offline RL to overcome the long-horizon vehicle planning challenge. Specifically, we design a variational autoencoder (VAE) to learn skills from offline demonstrations. To mitigate posterior collapse of common VAEs, we introduce a two-branch sequence encoder to capture both discrete options and continuous variations of the complex driving skills. The final policy treats learned skills as actions and can be trained by any off-the-shelf offline RL algorithms. This facilitates a shift in focus from per-step actions to temporally extended skills, thereby enabling long-term reasoning into the future. Extensive results on CARLA prove that our model consistently outperforms strong baselines at both training and new scenarios. Additional visualizations and experiments demonstrate the interpretability and transferability of extracted skills.

摘要
学习基于的自动驾驶规划正在随着多种驾驶 simulator 和大规模驾驶数据的出现而得到越来越多的注意。虽然线上 reinforcement learning (RL) 适用于这些安全关键任务，但它仍然很难计划长期。在这项工作中，我们提出了一个基于技能的框架，以增强线上 RL 以抵消长期自动驾驶规划挑战。 Specifically，我们设计了一个变量自动编码器 (VAE)，以从线上示范中学习技能。为了解决常见 VAE 的后退问题，我们引入了两个分支序列编码器，以捕捉细致的驾驶技能的分类选择和连续变化。最终策略将学习的技能作为动作，可以通过任何准备好的线上 RL 算法进行训练。这使得我们的模型可以强调长期的规划，而不是每步的动作，从而使得在未来中进行长期预测。我们在 CARLA 上进行了广泛的实验，并证明了我们的模型在训练和新的enario 中 consistently 超过了强的基elines。此外，我们还提供了可读性和传输性的图像和实验，以确认提取的技能的可读性和可传输性。

A Text Classification-Based Approach for Evaluating and Enhancing the Machine Interpretability of Building Codes

paper_url: http://arxiv.org/abs/2309.14374
repo_url: https://github.com/skydustz/text-classification-based-approach-for-evaluating-and-enhancing-machine-interpretability-of-building
paper_authors: Zhe Zheng, Yu-Cheng Zhou, Ke-Yin Chen, Xin-Zheng Lu, Zhong-Tian She, Jia-Rui Lin
for: 本研究旨在提出一种自动评估和提高建筑法规机器可读性的方法，以便将建筑法规转换成计算机处理可能的格式。
methods: 本研究使用了一种基于域专属语言模型和传输学习技术的高效文本分类模型，并提出了一种用于评估建筑法规机器可读性的量化评价方法。
results: 实验表明，提出的文本分类算法在比较建筑法规中的表现更高，提高了F1-score从72.16%到93.60%，同时也提高了下游自动规则解释方法的性能。

Abstract
Interpreting regulatory documents or building codes into computer-processable formats is essential for the intelligent design and construction of buildings and infrastructures. Although automated rule interpretation (ARI) methods have been investigated for years, most of them highly depend on the early and manual filtering of interpretable clauses from a building code. While few of them considered machine interpretability, which represents the potential to be transformed into a computer-processable format, from both clause- and document-level. Therefore, this research aims to propose a novel approach to automatically evaluate and enhance the machine interpretability of single clause and building codes. First, a few categories are introduced to classify each clause in a building code considering the requirements for rule interpretation, and a dataset is developed for model training. Then, an efficient text classification model is developed based on a pretrained domain-specific language model and transfer learning techniques. Finally, a quantitative evaluation method is proposed to assess the overall interpretability of building codes. Experiments show that the proposed text classification algorithm outperforms the existing CNN- or RNN-based methods, improving the F1-score from 72.16% to 93.60%. It is also illustrated that the proposed classification method can enhance downstream ARI methods with an improvement of 4%. Furthermore, analyzing the results of more than 150 building codes in China showed that their average interpretability is 34.40%, which implies that it is still hard to fully transform the entire regulatory document into computer-processable formats. It is also argued that the interpretability of building codes should be further improved both from the human side and the machine side.

摘要
“理解法规文档或基础设计文档的自动转换为电脑处理可能是建筑和基础设施设计中的重要因素。 although automated rule interpretation (ARI) 方法已经在多年来进行研究，大多数它们仅仅依赖早期的手动筛选可解释的条款，而几乎没有考虑过机器可读性，这代表了可以转换为电脑处理格式的潜力。因此，本研究的目的是提出一种新的方法来自动评估和提高建筑法规的机器可读性。首先，我们引入了一些分类建议，以评估每个条款的需求，然后发展了一个可读性训练 datasets。接着，我们开发了一个高效的文本分类模型，基于预训练的专业语言模型和转移学习技术。最后，我们提出了一个量化评估方法，以评估建筑法规的全面可读性。实验结果显示，我们的文本分类算法在 CNN 和 RNN 基础上进行训练后，对 F1 分数进行了提高，从 72.16% 提高至 93.60%。此外，我们还发现，使用我们的分类方法可以对下游 ARI 方法进行改进，提高了 4%。此外，遍历了中国逾 150 份建筑法规，我们发现其平均可读性为 34.40%，这 implies that it is still difficult to fully transform the entire regulatory document into computer-processable formats。此外，我们还认为，建筑法规的可读性应该在人类和机器两方面进行进一步改进。”

MM-NeRF: Multimodal-Guided 3D Multi-Style Transfer of Neural Radiance Field

paper_url: http://arxiv.org/abs/2309.13607
repo_url: None
paper_authors: Zijiang Yang, Zhongwei Qiu, Chang Xu, Dongmei Fu
for: 本研究旨在实现高质量的3D多样化风格传输，使用神经辐射场（NeRF）来获取3D场景的高级别描述。
methods: 本研究提出了一种新的多Modal-guided 3D Multi-style transfer of NeRF（MM-NeRF），它可以实现高质量的3D多样化风格传输，并且可以根据多modal导向来指导风格传输。
results: 实验结果表明，MM-NeRF可以实现高质量的3D多样化风格传输，同时保持多视图一致性和多modal风格引导的semantic一致性。

Abstract
3D style transfer aims to render stylized novel views of 3D scenes with the specified style, which requires high-quality rendering and keeping multi-view consistency. Benefiting from the ability of 3D representation from Neural Radiance Field (NeRF), existing methods learn the stylized NeRF by giving a reference style from an image. However, they suffer the challenges of high-quality stylization with texture details for multi-style transfer and stylization with multimodal guidance. In this paper, we reveal that the same objects in 3D scenes show various states (color tone, details, etc.) from different views after stylization since previous methods optimized by single-view image-based style loss functions, leading NeRF to tend to smooth texture details, further resulting in low-quality rendering. To tackle these problems, we propose a novel Multimodal-guided 3D Multi-style transfer of NeRF, termed MM-NeRF, which achieves high-quality 3D multi-style rendering with texture details and can be driven by multimodal-style guidance. First, MM-NeRF adopts a unified framework to project multimodal guidance into CLIP space and extracts multimodal style features to guide the multi-style stylization. To relieve the problem of lacking details, we propose a novel Multi-Head Learning Scheme (MLS), in which each style head predicts the parameters of the color head of NeRF. MLS decomposes the learning difficulty caused by the inconsistency of multi-style transfer and improves the quality of stylization. In addition, the MLS can generalize pre-trained MM-NeRF to any new styles by adding heads with small training costs (a few minutes). Extensive experiments on three real-world 3D scene datasets show that MM-NeRF achieves high-quality 3D multi-style stylization with multimodal guidance, keeps multi-view consistency, and keeps semantic consistency of multimodal style guidance. Codes will be released later.

摘要
三维样式传输目标是将三维场景渲染为指定的样式，需要高质量的渲染和保持多视图一致性。基于神经辐射场（NeRF）的存在，现有方法学习带有样式的NeRF，但它们面临高质量颜色细节的多样化颜色传输和多Modal导航颜色细节的渲染问题。在这篇论文中，我们发现在使用多视图颜色导航后，同一个三维对象在场景中会显示不同的颜色、细节等状态。这是因为前一代方法通过单视图图像基于风格损失优化NeRF，导致NeRF倾向于平滑Texture细节，从而导致低质量渲染。为解决这些问题，我们提出了一种新的多模态指导三维多样式传输NeRF（MM-NeRF），可以实现高质量三维多样式渲染，并且可以通过多模式导航颜色细节。首先，MM-NeRF采用一种统一框架，将多模态指导 проек到CLIP空间中，并提取多模式风格特征来引导多样式风格化。为解决缺乏细节的问题，我们提出了一种新的多头学习方案（MLS），每个风格头预测NeRF的颜色头的参数。MLS分解了多样式传输中学习的困难，并提高了风格化质量。此外，MLS可以将预训练MM-NeRF扩展到新的风格，只需要训练一些小时。广泛的实验表明，MM-NeRF可以实现高质量三维多样式渲染，保持多视图一致性，并保持多模式颜色导航的semantic一致性。代码将在未来发布。

Distribution-Aware Continual Test Time Adaptation for Semantic Segmentation

paper_url: http://arxiv.org/abs/2309.13604
repo_url: None
paper_authors: Jiayi Ni, Senqiao Yang, Jiaming Liu, Xiaoqi Li, Wenyu Jiao, Ran Xu, Zehui Chen, Yi Liu, Shanghang Zhang
for: This paper proposes a distribution-aware tuning (DAT) method for efficient and practical continual test-time adaptation (CTTA) in semantic segmentation tasks.methods: The DAT method adaptively selects and updates two small groups of trainable parameters based on data distribution during the continual adaptation process, including domain-specific parameters (DSP) and task-relevant parameters (TRP).results: The proposed method achieves promising performance compared to previous state-of-the-art methods on two widely-used semantic segmentation CTTA benchmarks, demonstrating its effectiveness in mitigating the challenges of error accumulation and catastrophic forgetting.

Abstract
Since autonomous driving systems usually face dynamic and ever-changing environments, continual test-time adaptation (CTTA) has been proposed as a strategy for transferring deployed models to continually changing target domains. However, the pursuit of long-term adaptation often introduces catastrophic forgetting and error accumulation problems, which impede the practical implementation of CTTA in the real world. Recently, existing CTTA methods mainly focus on utilizing a majority of parameters to fit target domain knowledge through self-training. Unfortunately, these approaches often amplify the challenge of error accumulation due to noisy pseudo-labels, and pose practical limitations stemming from the heavy computational costs associated with entire model updates. In this paper, we propose a distribution-aware tuning (DAT) method to make the semantic segmentation CTTA efficient and practical in real-world applications. DAT adaptively selects and updates two small groups of trainable parameters based on data distribution during the continual adaptation process, including domain-specific parameters (DSP) and task-relevant parameters (TRP). Specifically, DSP exhibits sensitivity to outputs with substantial distribution shifts, effectively mitigating the problem of error accumulation. In contrast, TRP are allocated to positions that are responsive to outputs with minor distribution shifts, which are fine-tuned to avoid the catastrophic forgetting problem. In addition, since CTTA is a temporal task, we introduce the Parameter Accumulation Update (PAU) strategy to collect the updated DSP and TRP in target domain sequences. We conduct extensive experiments on two widely-used semantic segmentation CTTA benchmarks, achieving promising performance compared to previous state-of-the-art methods.

摘要
自适应驾驶系统通常面临动态和不断变化的环境，因此提出了持续测试时适应（CTTA）作为将部署模型转移到不断变化的目标领域的策略。然而，追求长期适应通常会导致慢速忘记和错误积累问题，这些问题限制了CTTA在实际应用中的实施。现有的CTTA方法主要通过使用大量参数来适应目标领域知识进行自学习。然而，这些方法通常会增加 pseudo-标签 noise 的挑战，并且由于整个模型更新的重要计算成本，它们在实际应用中存在限制。在本文中，我们提出了分布意识调整（DAT）方法，以使得 semantic segmentation CTTA 在实际应用中变得有效和实用。DAT 在 continual adaptation 过程中适应ively 选择和更新两个小组trainable parameter，包括域pecific parameter（DSP）和任务相关 parameter（TRP）。具体来说，DSP 在输出具有显著分布差异时表现敏感，因此可以有效 mitigate 错误积累问题。相反，TRP 被分配到输出具有小分布差异的位置，并在避免慢速忘记问题的同时进行微调。此外，由于 CTTA 是一个时间任务，我们提出了 Parameter Accumulation Update（PAU）策略，用于在目标领域序列中收集更新的 DSP 和 TRP。我们在两个广泛使用的 semantic segmentation CTTA bencmarks 上进行了广泛的实验，并 achieved 比前一个状态的方法更好的性能。

From Cluster Assumption to Graph Convolution: Graph-based Semi-Supervised Learning Revisited

paper_url: http://arxiv.org/abs/2309.13599
repo_url: None
paper_authors: Zheng Wang, Hongming Ding, Li Pan, Jianhua Li, Zhiguo Gong, Philip S. Yu
for: 本文研究 graph-based semi-supervised learning (GSSL) 的关系，并提出三种graph convolution方法来提高GSSL的性能。
methods: 本文使用了一种统一优化框架来探讨 traditional GSSL 方法和 graph convolutional networks (GCNs) 之间的关系。三种提议的graph convolution方法包括：1) supervised方法 OGC，使用标签来引导图 convolution 过程；2) 无标签方法 GGC，希望在图 convolution 过程中保持图结构信息；3) 多尺度版本 GGCM，将 GGC 应用到不同的尺度上。
results: 经过广泛的实验，本文证明了我们提出的三种方法都能够提高 GSSL 的性能。

Abstract
Graph-based semi-supervised learning (GSSL) has long been a hot research topic. Traditional methods are generally shallow learners, based on the cluster assumption. Recently, graph convolutional networks (GCNs) have become the predominant techniques for their promising performance. In this paper, we theoretically discuss the relationship between these two types of methods in a unified optimization framework. One of the most intriguing findings is that, unlike traditional ones, typical GCNs may not jointly consider the graph structure and label information at each layer. Motivated by this, we further propose three simple but powerful graph convolution methods. The first is a supervised method OGC which guides the graph convolution process with labels. The others are two unsupervised methods: GGC and its multi-scale version GGCM, both aiming to preserve the graph structure information during the convolution process. Finally, we conduct extensive experiments to show the effectiveness of our methods.

摘要
Traditional GSSL methods are usually shallow learners, based on the cluster assumption. Recently, graph convolutional networks (GCNs) have become the predominant techniques for their promising performance. In this paper, we theoretically discuss the relationship between these two types of methods in a unified optimization framework. One of the most intriguing findings is that, unlike traditional ones, typical GCNs may not jointly consider the graph structure and label information at each layer. Motivated by this, we further propose three simple but powerful graph convolution methods. The first is a supervised method OGC, which guides the graph convolution process with labels. The others are two unsupervised methods: GGC and its multi-scale version GGCM, both aiming to preserve the graph structure information during the convolution process. Finally, we conduct extensive experiments to show the effectiveness of our methods.Here's the text with some notes on the translation:* "GSSL" is translated as "图像基于 semi-supervised learning" (tú xiàng bǐ yǐjīng xiǎng yù yì)* "traditional methods" is translated as "传统方法" (chuán chéng fāng fa)* "GCNs" is translated as "图aelastic networks" (tú yì xiǎng wǎng)* "unified optimization framework" is translated as "统一优化框架" (tǒng yī yǎo jì kōng jī)* "shallow learners" is translated as "浅学习" (shallow learners)* "cluster assumption" is translated as "团结假设" (cluster assumption)* "graph structure" is translated as "图 структура" (graph structure)* "label information" is translated as "标签信息" (label information)* "supervised method" is translated as "指导方法" (supervised method)* "unsupervised methods" is translated as "无指导方法" (unsupervised methods)* "GGC" is translated as "图structural preserved方法" (GGC)* "GGCM" is translated as "多级图structural preserved方法" (GGCM)Please note that the translation is done in a way that is consistent with the conventions of Simplified Chinese, and some of the terms used may not be exactly the same as the original English text.

Seeing Is Not Always Believing: Invisible Collision Attack and Defence on Pre-Trained Models

paper_url: http://arxiv.org/abs/2309.13579
repo_url: https://github.com/anonymous10240/framework
paper_authors: Minghang Deng, Zhong Zhang, Junming Shao
For: This paper proposes a novel framework for an invisible attack on large-scale pre-trained models (PTMs) like BERT and GPT, which can be used to manipulate the predictions of the models without being detected.* Methods: The proposed attack leverages the MD5 chosen-prefix collision to generate two equal-size models with the same MD5 checksum, which are then deployed on public websites to induce victims to download the poisoned model.* Results: The paper demonstrates the effectiveness and stealthiness of the proposed attack and defensive method on different models and data sets, and provides a theoretical justification for its feasibility.

Abstract
Large-scale pre-trained models (PTMs) such as BERT and GPT have achieved great success in diverse fields. The typical paradigm is to pre-train a big deep learning model on large-scale data sets, and then fine-tune the model on small task-specific data sets for downstream tasks. Although PTMs have rapidly progressed with wide real-world applications, they also pose significant risks of potential attacks. Existing backdoor attacks or data poisoning methods often build up the assumption that the attacker invades the computers of victims or accesses the target data, which is challenging in real-world scenarios. In this paper, we propose a novel framework for an invisible attack on PTMs with enhanced MD5 collision. The key idea is to generate two equal-size models with the same MD5 checksum by leveraging the MD5 chosen-prefix collision. Afterwards, the two ``same" models will be deployed on public websites to induce victims to download the poisoned model. Unlike conventional attacks on deep learning models, this new attack is flexible, covert, and model-independent. Additionally, we propose a simple defensive strategy for recognizing the MD5 chosen-prefix collision and provide a theoretical justification for its feasibility. We extensively validate the effectiveness and stealthiness of our proposed attack and defensive method on different models and data sets.

摘要
大规模预训练模型（PTM）如BERT和GPT在多个领域取得了很大成功。典型的假设是先预训大深度学习模型在大规模数据集上，然后在小任务特定数据集上细化模型以进行下游任务。although PTMs have rapidly progressed with wide real-world applications, they also pose significant risks of potential attacks. 现有的后门攻击或数据毒液方法通常假设攻击者可以入侵受害者的计算机或访问目标数据，这是现实世界中的挑战。在这篇论文中，我们提出了一种新的隐形攻击方法，通过提高MD5撞击的方式来实现。关键思想是通过MD5选择前缀撞击来生成两个相同大小的模型，并将这两个“相同”的模型部署到公共网站上，以引诱受害者下载毒化模型。与传统的深度学习模型攻击方法不同，这种新的攻击方法更加灵活、隐蔽和模型独立。此外，我们还提出了一种简单的防御策略，可以识别MD5选择前缀撞击，并提供了理论上的可行性。我们在不同的模型和数据集上进行了广泛验证和证明了攻击和防御方法的效果和隐蔽性。

Probabilistic Weight Fixing: Large-scale training of neural network weight uncertainties for quantization

paper_url: http://arxiv.org/abs/2309.13575
repo_url: https://github.com/subiawaud/PWFN
paper_authors: Christopher Subia-Waud, Srinandan Dasmahapatra
for: 降低大神经网络的执行时间和能耗，通过尝试将权重限制到一个有限的值集。
methods: 使用 Bayesian neural networks (BNNs) 和一种简化的轻量级 relaxation 来确定权重可以被移动到哪些中心和多少，基于它们的具体位置特有的学习不确定性分布。
results: 比前方法更高的压缩率和更高的准确率，特别是在使用 DeiT-Tiny 模型和 transformer 模型时。在 ImageNet 上，我们的方法可以将 5000 万个权重压缩到 296 个唯一值上，并且与前方法的 top-1 准确率相比提高 1.6%。

Abstract
Weight-sharing quantization has emerged as a technique to reduce energy expenditure during inference in large neural networks by constraining their weights to a limited set of values. However, existing methods for weight-sharing quantization often make assumptions about the treatment of weights based on value alone that neglect the unique role weight position plays. This paper proposes a probabilistic framework based on Bayesian neural networks (BNNs) and a variational relaxation to identify which weights can be moved to which cluster centre and to what degree based on their individual position-specific learned uncertainty distributions. We introduce a new initialisation setting and a regularisation term which allow for the training of BNNs under complex dataset-model combinations. By leveraging the flexibility of weight values captured through a probability distribution, we enhance noise resilience and downstream compressibility. Our iterative clustering procedure demonstrates superior compressibility and higher accuracy compared to state-of-the-art methods on both ResNet models and the more complex transformer-based architectures. In particular, our method outperforms the state-of-the-art quantization method top-1 accuracy by 1.6% on ImageNet using DeiT-Tiny, with its 5 million+ weights now represented by only 296 unique values.

摘要
大型神经网络中的权重共享量化技术可以降低推理过程中的能耗。然而，现有的权重共享量化方法通常假设权重值的处理方法是基于价值alone neglects 权重位置的特殊作用。这篇论文提出了基于 Bayesian neural networks（BNNs）的概率框架和一种可relaxation的方法，用于确定权重可以被移动到哪些集中心和多少基于它们的具体位置特定学习不确定分布。我们提出了一种新的初始化设定和一种正则化项，allowing for the training of BNNs under complex dataset-model combinations。通过利用权重值 captured through a probability distribution 的灵活性，我们提高了雷达鲁抗性和下游压缩性。我们的迭代归一化过程比前式-of-the-art方法更高的压缩率和更高的准确率，特别是在使用 DeiT-Tiny 模型和更复杂的 transformer-based 架构时。在这些模型中，我们的方法可以将 5000万+ 个权重表示为只 296 个唯一的值，与state-of-the-art 方法的 top-1 准确率相比，提高了 1.6%。

Keeping in Time: Adding Temporal Context to Sentiment Analysis Models

paper_url: http://arxiv.org/abs/2309.13562
repo_url: None
paper_authors: Dean Ninalga
for: 提高和保持 sentiment analysis 模型的性能 across shorter and longer time periods.
methods: 使用日期前缀的文本输入，并使用自我标签法将无标签数据用于学习学生模型。使用一种新的日期格式化策略来扩大自我标签过程。
results: 在 LongEval-Classification 评估集上实现了减少性能下降的最好 Result （RPD） (-0.0656)，并达到了总分 0.6923，位列第二名。

Abstract
This paper presents a state-of-the-art solution to the LongEval CLEF 2023 Lab Task 2: LongEval-Classification. The goal of this task is to improve and preserve the performance of sentiment analysis models across shorter and longer time periods. Our framework feeds date-prefixed textual inputs to a pre-trained language model, where the timestamp is included in the text. We show date-prefixed samples better conditions model outputs on the temporal context of the respective texts. Moreover, we further boost performance by performing self-labeling on unlabeled data to train a student model. We augment the self-labeling process using a novel augmentation strategy leveraging the date-prefixed formatting of our samples. We demonstrate concrete performance gains on the LongEval-Classification evaluation set over non-augmented self-labeling. Our framework achieves a 2nd place ranking with an overall score of 0.6923 and reports the best Relative Performance Drop (RPD) of -0.0656 over the short evaluation set.

摘要
To further boost performance, we perform self-labeling on unlabeled data to train a student model. We augment the self-labeling process using a novel augmentation strategy that leverages the date-prefixed formatting of our samples. Our approach achieves concrete performance gains on the LongEval-Classification evaluation set compared to non-augmented self-labeling.Our framework achieved a 2nd place ranking with an overall score of 0.6923 and reported the best Relative Performance Drop (RPD) of -0.0656 over the short evaluation set.

Cordyceps@LT-EDI: Patching Language-Specific Homophobia/Transphobia Classifiers with a Multilingual Understanding

paper_url: http://arxiv.org/abs/2309.13561
repo_url: None
paper_authors: Dean Ninalga
For: 本研究旨在探讨识别社交媒体评论中的恐同和恐 транс人辱骂语言的方法，以优化识别率和准确率。* Methods: 本研究采用了多语言（M-L）和语言特定（L-S）方法的结合，通过简单的权重 interpolating 来融合两种方法，以优化识别率和准确率。* Results: 本研究在 task A 的 ‘Shared Task on Homophobia/Transphobia Detection in social media comments’ 数据集上实现了最佳结果，在五种语言中取得了三个语言的最佳结果，并在马拉雅邦语文本上 achieve 0.997 的macro F1 分数。

Abstract
Detecting transphobia, homophobia, and various other forms of hate speech is difficult. Signals can vary depending on factors such as language, culture, geographical region, and the particular online platform. Here, we present a joint multilingual (M-L) and language-specific (L-S) approach to homophobia and transphobic hate speech detection (HSD). M-L models are needed to catch words, phrases, and concepts that are less common or missing in a particular language and subsequently overlooked by L-S models. Nonetheless, L-S models are better situated to understand the cultural and linguistic context of the users who typically write in a particular language. Here we construct a simple and successful way to merge the M-L and L-S approaches through simple weight interpolation in such a way that is interpretable and data-driven. We demonstrate our system on task A of the 'Shared Task on Homophobia/Transphobia Detection in social media comments' dataset for homophobia and transphobic HSD. Our system achieves the best results in three of five languages and achieves a 0.997 macro average F1-score on Malayalam texts.

摘要
检测transphobia、homophobia和其他形式的仇恨言语困难。信号可以因语言、文化、地区和在线平台而异常。我们介绍了一种联合多语言（M-L）和语言特定（L-S）方法来检测同性恋和变性人仇恨言语检测（HSD）。M-L模型可以捕捉语言中不常见或缺失的词汇和短语，并被L-S模型所过look。然而，L-S模型更好地理解用户 Typically write in a particular language的文化和语言背景。我们构建了一种简单有效的方法来融合M-L和L-S方法，通过简单的权重 interpolating 的方式，以便可以解释和数据驱动。我们在task A of the 'Shared Task on Homophobia/Transphobia Detection in social media comments' dataset上展示了我们的系统，并在五种语言中获得了最佳结果，并在马拉雅拉姆语文本上达到了0.997macro average F1-score。

Decoding Radiologists Intense Focus for Accurate CXR Diagnoses: A Controllable and Interpretable AI System

paper_url: http://arxiv.org/abs/2309.13550
repo_url: None
paper_authors: Trong Thang Pham, Jacob Brecheisen, Anh Nguyen, Hien Nguyen, Ngan Le
for: 这个论文目标是提出一种可控制可解释的护肺X光诊断管道，以帮助理解肺科医生在诊断过程中的认知过程。
methods: 该方法使用视Language模型，可以准确地控制诊断过程，并且可以排除不重要的特征。
results: 经过广泛的实验，表明该方法可以准确地 Classification tasks，只需使用护肺X光的一部分。

Abstract
In the field of chest X-ray (CXR) diagnosis, existing works often focus solely on determining where a radiologist looks, typically through tasks such as detection, segmentation, or classification. However, these approaches are often designed as black-box models, lacking interpretability. In this paper, we introduce a novel and unified controllable interpretable pipeline for decoding the intense focus of radiologists in CXR diagnosis. Our approach addresses three key questions: where a radiologist looks, how long they focus on specific areas, and what findings they diagnose. By capturing the intensity of the radiologist's gaze, we provide a unified solution that offers insights into the cognitive process underlying radiological interpretation. Unlike current methods that rely on black-box machine learning models, which can be prone to extracting erroneous information from the entire input image during the diagnosis process, we tackle this issue by effectively masking out irrelevant information. Our approach leverages a vision-language model, allowing for precise control over the interpretation process while ensuring the exclusion of irrelevant features. To train our model, we utilize an eye gaze dataset to extract anatomical gaze information and generate ground truth heatmaps. Through extensive experimentation, we demonstrate the efficacy of our method. We showcase that the attention heatmaps, designed to mimic radiologists' focus, encode sufficient and relevant information, enabling accurate classification tasks using only a portion of CXR.

摘要
在胸部X射影（CXR）诊断领域，现有的工作通常围绕确定诊断人员的注意力点进行设计，通常通过检测、分割或分类等任务来完成。然而，这些方法经常设计成黑盒模型，缺乏可解释性。在这篇论文中，我们提出了一种新的可控可解释的扫描策略，用于解码诊断人员在CXR诊断过程中的焦点。我们的方法解决了三个关键问题：诊断人员注意力点在哪里、如何长时间关注特定区域，以及他们诊断了什么。我们通过捕捉诊断人员的眼动信息，提供了一种统一的解决方案，可以帮助理解诊断过程中的认知过程。不同于现有的黑盒机器学习模型，这些模型可能会从整个输入图像中提取错误信息，我们通过有效地遮盖无关信息来解决这个问题。我们的方法利用了视觉语言模型，可以准确控制解释过程，同时确保排除无关特征。为了训练我们的模型，我们使用了眼动数据集来提取 анатомиче gaze 信息，生成标准的热图。通过广泛的实验，我们证明了我们的方法的有效性。我们显示了注意力热图，设计用于模拟诊断人员的注意力，含有足够和相关的信息，可以使用CXR中的一部分进行准确的分类任务。

paper_url: http://arxiv.org/abs/2309.13544
repo_url: None
paper_authors: Rahul Singh, Pranav Kanuparthi
for: 这 paper 的目的是提出一个分布式机器学习（ML）管道，用于从 Million Songs Dataset（MSD）中提取类似于输入subset的歌曲。
methods: 该 paper 使用的方法包括使用分布式 ML 管道，以便在 MSD 上进行音频轨道分析和推荐。
results: 该 paper 的结果显示，使用分布式 ML 管道可以提供高效的推荐系统，并且可以在 MSD 上进行大规模的音频轨道分析和推荐。

Abstract
Machine Learning models are being utilized extensively to drive recommender systems, which is a widely explored topic today. This is especially true of the music industry, where we are witnessing a surge in growth. Besides a large chunk of active users, these systems are fueled by massive amounts of data. These large-scale systems yield applications that aim to provide a better user experience and to keep customers actively engaged. In this paper, a distributed Machine Learning (ML) pipeline is delineated, which is capable of taking a subset of songs as input and producing a new subset of songs identified as being similar to the inputted subset. The publicly accessible Million Songs Dataset (MSD) enables researchers to develop and explore reasonably efficient systems for audio track analysis and recommendations, without having to access a commercialized music platform. The objective of the proposed application is to leverage an ML system trained to optimally recommend songs that a user might like.

摘要
机器学习模型在推荐系统方面得到广泛应用，特别是在音乐行业，目前在快速发展。这主要归功于大量数据的支持以及高效的机器学习算法。这些大规模系统的应用旨在提供更好的用户体验，并保持用户高度参与。在这篇论文中，我们提出了一个分布式机器学习（ML）管道，可以将输入subset of songs中的一部分作为输入，并生成与输入相似的新subset of songs。可以通过公共可访问的Million Songs Dataset（MSD），让研究人员开发和探索reasonably efficient的音频轨道分析和推荐系统，不需要访问商业化音乐平台。我们的目标是使用ML系统来优化推荐用户可能喜欢的歌曲。

Human Transcription Quality Improvement

paper_url: http://arxiv.org/abs/2309.14372
repo_url: https://github.com/GenerateAI/LibriCrowd
paper_authors: Jian Gao, Hanbo Sun, Cheng Cao, Zheng Du
for: 提高自动语音识别（ASR）系统的训练数据质量。
methods: 提出一种可靠的训练数据收集方法，包括对标注阶段进行信任度估计基于的重新处理，以及后置标注阶段的自动单词错误 corrections。
results: 实验显示，对100小时英语语音标注的Transcription WER减少了超过50%。进一步研究表明，错误的转录影响ASR模型性能强相关。改进转录质量提供了10%以上相对WER减少。发布了数据集和代码，为研究社区提供利益。

Abstract
High quality transcription data is crucial for training automatic speech recognition (ASR) systems. However, the existing industry-level data collection pipelines are expensive to researchers, while the quality of crowdsourced transcription is low. In this paper, we propose a reliable method to collect speech transcriptions. We introduce two mechanisms to improve transcription quality: confidence estimation based reprocessing at labeling stage, and automatic word error correction at post-labeling stage. We collect and release LibriCrowd - a large-scale crowdsourced dataset of audio transcriptions on 100 hours of English speech. Experiment shows the Transcription WER is reduced by over 50%. We further investigate the impact of transcription error on ASR model performance and found a strong correlation. The transcription quality improvement provides over 10% relative WER reduction for ASR models. We release the dataset and code to benefit the research community.

摘要
高品质转录数据是自动语音识别（ASR）系统训练的关键。然而，现有的行业级数据采集管道对研究人员来说太costly，而且大众办理的转录质量低。在这篇论文中，我们提出一种可靠的方法来采集语音转录。我们提出了两种机制来提高转录质量：在标注阶段基于信息估计的重新处理，以及在后置阶段自动单词错误更正。我们采集并发布了LibriCrowd - 100小时英语语音转录的大规模众生采集数据集。实验表明，转录WER（识别错误率）下降了超过50%。我们进一步调查了转录错误对ASR模型性能的影响，发现了强相关性。高品质转录改善提供了10%以上相对WER降幅。我们发布数据集和代码，以便研究人员享受。

Speech enhancement with frequency domain auto-regressive modeling

paper_url: http://arxiv.org/abs/2309.13537
repo_url: None
paper_authors: Anurenjan Purushothaman, Debottam Dutta, Rohit Kumar, Sriram Ganapathy
for: 提高在远场实际场景中的speech质量和自动语音识别（ASR）性能
methods: 使用AR模型进行束子域speech信号的干扰分解，并使用 dual path long short term memory（DPLSTM）模型进行束子域束子域信号的增强
results: 在REVERB挑战数据集和VOiCES数据集上，与基准系统相比，jointly learns speech dereverberation network和E2E ASR模型可以获得显著性能提高（相对于基准系统的平均相对提高率为10-24%），并且通过主观听测试得到了提高的音频质量。

Abstract
Speech applications in far-field real world settings often deal with signals that are corrupted by reverberation. The task of dereverberation constitutes an important step to improve the audible quality and to reduce the error rates in applications like automatic speech recognition (ASR). We propose a unified framework of speech dereverberation for improving the speech quality and the ASR performance using the approach of envelope-carrier decomposition provided by an autoregressive (AR) model. The AR model is applied in the frequency domain of the sub-band speech signals to separate the envelope and carrier parts. A novel neural architecture based on dual path long short term memory (DPLSTM) model is proposed, which jointly enhances the sub-band envelope and carrier components. The dereverberated envelope-carrier signals are modulated and the sub-band signals are synthesized to reconstruct the audio signal back. The DPLSTM model for dereverberation of envelope and carrier components also allows the joint learning of the network weights for the down stream ASR task. In the ASR tasks on the REVERB challenge dataset as well as on the VOiCES dataset, we illustrate that the joint learning of speech dereverberation network and the E2E ASR model yields significant performance improvements over the baseline ASR system trained on log-mel spectrogram as well as other benchmarks for dereverberation (average relative improvements of 10-24% over the baseline system). The speech quality improvements, evaluated using subjective listening tests, further highlight the improved quality of the reconstructed audio.

摘要
讲话应用程序在远场实际场景中经常会遇到受泛音损害的信号。去泛音是提高语音质量和降低自动语音识别（ASR）错误率的重要步骤。我们提出一个统一框架，用于提高语音质量和ASR性能，基于autoregressive（AR）模型的振荡分解。在各个子带语音信号的频域中，AR模型用于分离振荡和载波部分。我们提出了一种基于双路长短期记忆（DPLSTM）模型的新型神经网络架构，可以同时提高子带振荡和载波组件。去泛音后，振荡和载波组件被修改，并将子带信号重新 sinthezied 以重构音频信号。我们在REVERB挑战数据集和VOiCES数据集上进行了ASR任务，并证明了将批量学习网络参数与下游ASR任务相结合可以获得显著性能提升（相对于基准系统，平均提升10-24%）。此外，通过主观听测试，我们还证明了去泛音后的重构音频质量的提高。

Iterative Reachability Estimation for Safe Reinforcement Learning

paper_url: http://arxiv.org/abs/2309.13528
repo_url: None
paper_authors: Milan Ganai, Zheng Gong, Chenning Yu, Sylvia Herbert, Sicun Gao
for: 本研究旨在提供一个新的安全性权限执行学习（RL）框架，以确保RL在实际应用中的安全性。
methods: 本研究提出了一种新的 reachability estimation 函数，用于在涉及到不确定环境的通用情况下进行安全性权限执行学习。
results: 研究人员通过对一系列安全RL环境进行实验，证明了他们的算法可以在 reward performance 和安全性两个方面提供改进。

Abstract
Ensuring safety is important for the practical deployment of reinforcement learning (RL). Various challenges must be addressed, such as handling stochasticity in the environments, providing rigorous guarantees of persistent state-wise safety satisfaction, and avoiding overly conservative behaviors that sacrifice performance. We propose a new framework, Reachability Estimation for Safe Policy Optimization (RESPO), for safety-constrained RL in general stochastic settings. In the feasible set where there exist violation-free policies, we optimize for rewards while maintaining persistent safety. Outside this feasible set, our optimization produces the safest behavior by guaranteeing entrance into the feasible set whenever possible with the least cumulative discounted violations. We introduce a class of algorithms using our novel reachability estimation function to optimize in our proposed framework and in similar frameworks such as those concurrently handling multiple hard and soft constraints. We theoretically establish that our algorithms almost surely converge to locally optimal policies of our safe optimization framework. We evaluate the proposed methods on a diverse suite of safe RL environments from Safety Gym, PyBullet, and MuJoCo, and show the benefits in improving both reward performance and safety compared with state-of-the-art baselines.

摘要
保证安全是RL实践中非常重要的一点。various challenges需要被解决，例如在环境中处理随机性，提供坚实的状态级别安全满足保证，并避免过度保守的行为，这会损害性能。我们提出了一个新的框架，即Reachability Estimation for Safe Policy Optimization（RESPO），用于安全限制RL在一般随机环境中。在可行集（feasible set）中，我们优化奖励，同时保持持续安全。外部可行集，我们的优化生成最安全的行为， garantizesthat entrance into the feasible set whenever possible with the least cumulative discounted violations。我们引入了一类使用我们的新的达性估计函数来优化的算法，并在我们的框架和类似框架（如同时处理多个硬 soft constraints）中进行优化。我们证明了我们的算法在我们的安全优化框架中幂等 converges to locally optimal policies。我们对一个包含了安全RL环境的多样化集合进行评估，并显示了与现有基准点相比，提高了奖励性能和安全性。

Global-correlated 3D-decoupling Transformer for Clothed Avatar Reconstruction

paper_url: http://arxiv.org/abs/2309.13524
repo_url: https://github.com/river-zhang/gta
paper_authors: Zechuan Zhang, Li Sun, Zongxin Yang, Ling Chen, Yi Yang
for: reconstruction of 3D clothed human avatars from single images
methods: transformer-based architecture with global-correlated image features and 3D-decoupling decoder with cross-attention and learnable embeddings
results: outperforms state-of-the-art approaches in both geometry and texture reconstruction, with high robustness to challenging poses and loose clothing, and produces higher-resolution textures.Here’s the simplified Chinese text:
for: 用单张图像重建 clothed 人物模型
methods: 使用变换器建筑，利用全球相关的图像特征，并使用交叉注意力和学习嵌入来解耦三个平面特征
results: 在 CAPE 和 THuman2.0 数据集上表现出色，与现有方法相比，在几何和文本重建方面具有更高的精度和更高的纹理质量，并且具有更高的可靠性和更高的分辨率.

Abstract
Reconstructing 3D clothed human avatars from single images is a challenging task, especially when encountering complex poses and loose clothing. Current methods exhibit limitations in performance, largely attributable to their dependence on insufficient 2D image features and inconsistent query methods. Owing to this, we present the Global-correlated 3D-decoupling Transformer for clothed Avatar reconstruction (GTA), a novel transformer-based architecture that reconstructs clothed human avatars from monocular images. Our approach leverages transformer architectures by utilizing a Vision Transformer model as an encoder for capturing global-correlated image features. Subsequently, our innovative 3D-decoupling decoder employs cross-attention to decouple tri-plane features, using learnable embeddings as queries for cross-plane generation. To effectively enhance feature fusion with the tri-plane 3D feature and human body prior, we propose a hybrid prior fusion strategy combining spatial and prior-enhanced queries, leveraging the benefits of spatial localization and human body prior knowledge. Comprehensive experiments on CAPE and THuman2.0 datasets illustrate that our method outperforms state-of-the-art approaches in both geometry and texture reconstruction, exhibiting high robustness to challenging poses and loose clothing, and producing higher-resolution textures. Codes will be available at https://github.com/River-Zhang/GTA.

摘要
<>通过单个图像重建三维人物模拟是一项具有挑战性的任务，尤其是当遇到复杂的姿势和裤子时。现有方法具有不足的二维图像特征和不一致的查询方法，导致表现有限。为此，我们提出了全球相关的3D分解变换器（GTA），一种基于变换器架构的新建模，用于从单个图像中重建裤装人物模拟。我们的方法利用变换器模型作为编码器，以捕捉全球相关的图像特征。然后，我们的创新的3D分解解码器使用交叉注意力来分解三平面特征，并使用学习的嵌入作为交叉平面生成的查询。为了有效地增强特征融合三平面3D特征和人体先天知识，我们提议一种混合的先天知识融合策略，将空间和先天知识增强的查询混合使用，利用空间本地化和人体先天知识的优点。通过对CAPE和THuman2.0数据集进行广泛的实验，我们的方法在几何学和纹理重建方面超越了当前状态艺术，展现出高稳定性和高分辨率，并能够有效地处理复杂的姿势和裤子。代码将在https://github.com/River-Zhang/GTA上提供。

Cordyceps@LT-EDI: Depression Detection with Reddit and Self-training

paper_url: http://arxiv.org/abs/2310.01418
repo_url: None
paper_authors: Dean Ninalga
for: 抑郁症是肇导病种之一，而且很普遍。研究发现过度社交媒体用户与抑郁症、ADHD等精神疾病存在相关性。鉴于这样一大量的人群，那么有很多可能未diagnosed的用户和他们创建的帖子。本文提出了一种抑郁严重程度检测系统，使用半指导学习技术预测用户是否经历严重、中度或低度（非诊断）抑郁。
methods: 我们使用一个训练好的模型来分类大量未标注的社交媒体帖子，然后使用生成的标签来训练更强大的分类器。
results: 我们在LT-EDI@RANLP 2023 shared task上展示了我们的框架，其中我们的框架在检测抑郁症的严重程度方面 ranks 3rd 总的。

Abstract
Depression is debilitating, and not uncommon. Indeed, studies of excessive social media users show correlations with depression, ADHD, and other mental health concerns. Given that there is a large number of people with excessive social media usage, then there is a significant population of potentially undiagnosed users and posts that they create. In this paper, we propose a depression severity detection system using a semi-supervised learning technique to predict if a post is from a user who is experiencing severe, moderate, or low (non-diagnostic) levels of depression. Namely, we use a trained model to classify a large number of unlabelled social media posts from Reddit, then use these generated labels to train a more powerful classifier. We demonstrate our framework on Detecting Signs of Depression from Social Media Text - LT-EDI@RANLP 2023 shared task, where our framework ranks 3rd overall.

摘要
抑郁是毁伤性的，并不是罕见的。实际上，研究过度社交媒体用户表明了抑郁、ADHD和其他心理健康问题之间的相关性。 giventhat there is a large number of people with excessive social media usage, then there is a significant population of potentially undiagnosed users and posts that they create. 在这篇论文中，我们提出了一种抑郁严重程度检测系统，使用半指导学习技术来预测用户是否经历严重、中等或低（非诊断）度的抑郁。具体来说，我们使用一个训练好的模型来分类一大量的未标注社交媒体帖子，然后使用这些生成的标签来训练更强大的分类器。我们在LT-EDI@RANLP 2023共享任务上示出了我们的框架，其中我们的框架在总体排名第三。

Object Classification Model Using Ensemble Learning with Gray-Level Co-Occurrence Matrix and Histogram Extraction

paper_url: http://arxiv.org/abs/2309.13512
repo_url: None
paper_authors: Florentina Tatrin Kurniati, Daniel HF Manongga, Eko Sediyono, Sri Yulianto Joko Prasetyo, Roy Rudolf Huizen
for: 本研究旨在开发一种精准的物体分类方法，以便更好地识别和 отличать不同物体。
methods: 本研究使用了投票方法和组合分类器，其中包括Random Forest、K-NN、决策树、SVM和Naive Bayes等分类方法。
results: 测试结果表明，投票方法和组合分类器均取得了很好的结果，其中 ensemble voting 的准确率为92.4%，精度为78.6%，回归率为95.2%，F1-score为86.1%；组合分类器的准确率为99.3%，精度为97.6%，回归率为100%，F1-score为98.8%。根据测试结果，可以确定使用投票方法和组合分类器可以提高物体分类精度。

Abstract
In the field of object classification, identification based on object variations is a challenge in itself. Variations include shape, size, color, and texture, these can cause problems in recognizing and distinguishing objects accurately. The purpose of this research is to develop a classification method so that objects can be accurately identified. The proposed classification model uses Voting and Combined Classifier, with Random Forest, K-NN, Decision Tree, SVM, and Naive Bayes classification methods. The test results show that the voting method and Combined Classifier obtain quite good results with each of them, ensemble voting with an accuracy value of 92.4%, 78.6% precision, 95.2% recall, and 86.1% F1-score. While the combined classifier with an accuracy value of 99.3%, a precision of 97.6%, a recall of 100%, and a 98.8% F1-score. Based on the test results, it can be concluded that the use of the Combined Classifier and voting methods is proven to increase the accuracy value. The contribution of this research increases the effectiveness of the Ensemble Learning method, especially the voting ensemble method and the Combined Classifier in increasing the accuracy of object classification in image processing.

摘要
在物体分类领域，基于物体变化的标识是一项挑战。这些变化包括形状、大小、颜色和文化，这些变化可能会导致对物体的识别和分类准确性受到影响。本研究的目的是开发一种精准的分类方法，以便更好地识别物体。提议的分类模型使用投票和组合分类器，其中包括随机森林、K-NN、决策树、支持向量机和愚蠢树分类方法。测试结果显示，投票方法和组合分类器各自取得了非常好的结果，其中投票方法的准确率为92.4%，命中率为78.6%，召回率为95.2%和准确率为86.1%。而组合分类器的准确率为99.3%，命中率为97.6%，召回率为100%和准确率为98.8%。根据测试结果，可以结论出，使用投票和组合分类器方法可以提高准确率。本研究的贡献是提高 ensemble learning 方法的效果，特别是投票ensemble方法和组合分类器在物体分类中的准确率。

Natural Language based Context Modeling and Reasoning with LLMs: A Tutorial

paper_url: http://arxiv.org/abs/2309.15074
repo_url: None
paper_authors: Haoyi Xiong, Jiang Bian, Sijia Yang, Xiaofei Zhang, Linghe Kong, Daqing Zhang
for: 这个研究是为了探讨大语言模型（LLM）在Context-aware computing中的应用，以及如何使用自然语言来建模上下文和进行上下文理解。
methods: 这个研究使用了各种人工智能技术，如 Ontology 和 OWL，来建模上下文和进行上下文理解。它还使用了自然语言处理技术，如 ChatGPT 和 GPT-4，来模拟用户的请求和上下文。
results: 研究人员在两个案例中证明了 LLMCaC 的可行性，包括在帮助生活中使用移动 z-arm 和规划旅行的上下文意识应用。

Abstract
Large language models (LLMs) have become phenomenally surging, since 2018--two decades after introducing context-awareness into computing systems. Through taking into account the situations of ubiquitous devices, users and the societies, context-aware computing has enabled a wide spectrum of innovative applications, such as assisted living, location-based social network services and so on. To recognize contexts and make decisions for actions accordingly, various artificial intelligence technologies, such as Ontology and OWL, have been adopted as representations for context modeling and reasoning. Recently, with the rise of LLMs and their improved natural language understanding and reasoning capabilities, it has become feasible to model contexts using natural language and perform context reasoning by interacting with LLMs such as ChatGPT and GPT-4. In this tutorial, we demonstrate the use of texts, prompts, and autonomous agents (AutoAgents) that enable LLMs to perform context modeling and reasoning without requiring fine-tuning of the model. We organize and introduce works in the related field, and name this computing paradigm as the LLM-driven Context-aware Computing (LCaC). In the LCaC paradigm, users' requests, sensors reading data, and the command to actuators are supposed to be represented as texts. Given the text of users' request and sensor data, the AutoAgent models the context by prompting and sends to the LLM for context reasoning. LLM generates a plan of actions and responds to the AutoAgent, which later follows the action plan to foster context-awareness. To prove the concepts, we use two showcases--(1) operating a mobile z-arm in an apartment for assisted living, and (2) planning a trip and scheduling the itinerary in a context-aware and personalized manner.

摘要
大型语言模型（LLM）在2018年以来，已经迅速增长，约二十年后引入了计算系统中的上下文意识。通过考虑设备、用户和社会的情况，上下文意识计算已经启动了一系列创新应用，例如协助生活、位置基于的社交网络服务等。为了识别上下文和根据此作出决策，人工智能技术，如 Ontology 和 OWL，已经被采用来表示上下文建模和推理。随着 LL M 的崛起和其改善的自然语言理解和推理能力，现在可以使用自然语言来建模上下文并通过与 LL M 交互，如 ChatGPT 和 GPT-4，进行上下文推理。在这个教程中，我们示例了使用文本、提示和自动代理（AutoAgent）来帮助 LL M 进行上下文建模和推理，不需要模型调整。我们组织和介绍相关领域的工作，并统称这个计算模式为 LLM-驱动的上下文意识计算（LCaC）。在 LCaC 模型中，用户的请求、感应器读取数据和 Command 到 actuator 是 supposed 为文本表示。当 AutoAgent 使用文本提示模型上下文时，LLM 将进行上下文推理，生成动作计划，并对 AutoAgent 回应。AutoAgent 接着根据动作计划进行行动，以实现上下文意识。为证明概念，我们使用了两个示例：在公寓内运作一个移动的 z-臂来协助生活，以及在上下文意识和个性化的方式规划旅行。

Guided Cooperation in Hierarchical Reinforcement Learning via Model-based Rollout

paper_url: http://arxiv.org/abs/2309.13508
repo_url: https://github.com/haoranwang-tj/gcmr_aclg_official
paper_authors: Haoran Wang, Yaoru Sun, Fang Wang, Yeming Chen
for: 这个论文的目的是提出一种goal-conditioned层次强化学习（HRL）框架，以便在复杂的长期强化学习任务中实现有效的探索。
methods: 这个论文使用了一种名为Guided Cooperation via Model-based Rollout（GCMR）的方法，该方法通过估算前向动力学来促进层次协作。此外，论文还使用了一种一步滚动计划来进一步促进层次协作。
results: 实验结果表明，将GCMR框架与ACLG（一种分离变体的HIGL）结合使用，可以比基eline和之前的状态 искусственный风险（SOTA）层次强化学习算法更加稳定和可靠地改进政策。

Abstract
Goal-conditioned hierarchical reinforcement learning (HRL) presents a promising approach for enabling effective exploration in complex long-horizon reinforcement learning (RL) tasks via temporal abstraction. Yet, most goal-conditioned HRL algorithms focused on the subgoal discovery, regardless of inter-level coupling. In essence, for hierarchical systems, the increased inter-level communication and coordination can induce more stable and robust policy improvement. Here, we present a goal-conditioned HRL framework with Guided Cooperation via Model-based Rollout (GCMR), which estimates forward dynamics to promote inter-level cooperation. The GCMR alleviates the state-transition error within off-policy correction through a model-based rollout, further improving the sample efficiency. Meanwhile, to avoid being disrupted by these corrected but possibly unseen or faraway goals, lower-level Q-function gradients are constrained using a gradient penalty with a model-inferred upper bound, leading to a more stable behavioral policy. Besides, we propose a one-step rollout-based planning to further facilitate inter-level cooperation, where the higher-level Q-function is used to guide the lower-level policy by estimating the value of future states so that global task information is transmitted downwards to avoid local pitfalls. Experimental results demonstrate that incorporating the proposed GCMR framework with ACLG, a disentangled variant of HIGL, yields more stable and robust policy improvement than baselines and substantially outperforms previous state-of-the-art (SOTA) HRL algorithms in both hard-exploration problems and robotic control.

摘要
目标受控层次学习（HRL）提供了一种有效的探索方法，用于复杂的长期回归学习（RL）任务。然而，大多数目标受控HRL算法都专注于发现子目标，忽略了层次之间的交互。实际上，在层次系统中，增加层次之间的通信和协调可以提高稳定和可靠的策略改进。我们提出了一种目标受控HRL框架，称为指导合作via模型基于滚动（GCMR），该框架利用前向动力学预测来促进层次之间的合作。GCMR通过模型基于滚动来减少状态转移错误，从而提高样本效率。此外，我们还提出了一种一步滚动规划方法，以便更好地协调层次之间的行为策略。在这种方法中，高层Q函数用于指导低层策略，并且通过估算未来状态的值来传递全局任务信息下来，以避免地方坑拥。实验结果表明，将我们提出的GCMR框架与ACLG（一种分离的HIGL变体）结合使用，可以获得更稳定和可靠的策略改进，并在硬探索问题和机器人控制方面实现substantially outperform前一个状态的艺术algorithm。

2023-09-24

cs.CL

cs.CL - 2023-09-24

Text Classification: A Perspective of Deep Learning Methods

paper_url: http://arxiv.org/abs/2309.13761
repo_url: https://github.com/brijkishorsoni1210/Car-logo-classification
paper_authors: Zhongwei Wan
for: 本文旨在探讨深度学习方法在文本分类任务中的应用，以提高文本分类的准确率和效率。
methods: 本文 introduce了深度学习基于文本分类算法，包括特征提取、特征缩放和评价策略等重要步骤。
results: 文中对多种深度学习文本分类方法进行比较和总结，以便选择合适的方法 для实际应用。

Abstract
In recent years, with the rapid development of information on the Internet, the number of complex texts and documents has increased exponentially, which requires a deeper understanding of deep learning methods in order to accurately classify texts using deep learning techniques, and thus deep learning methods have become increasingly important in text classification. Text classification is a class of tasks that automatically classifies a set of documents into multiple predefined categories based on their content and subject matter. Thus, the main goal of text classification is to enable users to extract information from textual resources and process processes such as retrieval, classification, and machine learning techniques together in order to classify different categories. Many new techniques of deep learning have already achieved excellent results in natural language processing. The success of these learning algorithms relies on their ability to understand complex models and non-linear relationships in data. However, finding the right structure, architecture, and techniques for text classification is a challenge for researchers. This paper introduces deep learning-based text classification algorithms, including important steps required for text classification tasks such as feature extraction, feature reduction, and evaluation strategies and methods. At the end of the article, different deep learning text classification methods are compared and summarized.

摘要
Recently, with the rapid development of information on the internet, the number of complex texts and documents has increased exponentially, requiring a deeper understanding of deep learning methods to accurately classify texts using deep learning techniques. As a result, deep learning methods have become increasingly important in text classification. Text classification is a type of task that automatically classifies a set of documents into multiple predefined categories based on their content and subject matter. Therefore, the main goal of text classification is to enable users to extract information from textual resources and perform processes such as retrieval, classification, and machine learning techniques together to classify different categories. Many new techniques of deep learning have already achieved excellent results in natural language processing. The success of these learning algorithms relies on their ability to understand complex models and non-linear relationships in data. However, finding the right structure, architecture, and techniques for text classification is a challenge for researchers. This paper introduces deep learning-based text classification algorithms, including important steps required for text classification tasks such as feature extraction, feature reduction, and evaluation strategies and methods. At the end of the article, different deep learning text classification methods are compared and summarized.Note: Please note that the translation is in Simplified Chinese, which is one of the two standard Chinese writing systems. If you need the translation in Traditional Chinese, please let me know.

Does the “most sinfully decadent cake ever” taste good? Answering Yes/No Questions from Figurative Contexts

paper_url: http://arxiv.org/abs/2309.13748
repo_url: None
paper_authors: Geetanjali Rakshit, Jeffrey Flanigan
for: investigate the robustness of Question Answering (QA) models on figurative text
methods: use yes/no questions with figurative and non-figurative contexts to test the models’ ability to understand figurative language
results: state-of-the-art BERT-based QA models perform poorly on figurative contexts, but models like GPT-3 and ChatGPT can handle them better, and further performance gains can be achieved by automatically simplifying the figurative contexts.

Abstract
Figurative language is commonplace in natural language, and while making communication memorable and creative, can be difficult to understand. In this work, we investigate the robustness of Question Answering (QA) models on figurative text. Yes/no questions, in particular, are a useful probe of figurative language understanding capabilities of large language models. We propose FigurativeQA, a set of 1000 yes/no questions with figurative and non-figurative contexts, extracted from the domains of restaurant and product reviews. We show that state-of-the-art BERT-based QA models exhibit an average performance drop of up to 15\% points when answering questions from figurative contexts, as compared to non-figurative ones. While models like GPT-3 and ChatGPT are better at handling figurative texts, we show that further performance gains can be achieved by automatically simplifying the figurative contexts into their non-figurative (literal) counterparts. We find that the best overall model is ChatGPT with chain-of-thought prompting to generate non-figurative contexts. Our work provides a promising direction for building more robust QA models with figurative language understanding capabilities.

摘要
通用语言中的比喻语言非常普遍，它可以使交流更加生动、创新，但同时也可以使得理解变得更加困难。在这项工作中，我们研究了问答模型对比喻文本的Robustness。特别是yes/no问题，是 figural语言理解能力的一种有用的检验。我们提出了一个名为FigurativeQA的1000个yes/no问题的集合，其中包括了餐厅和产品评论中的figural和非 figural上下文。我们发现，当问答模型回答figural上下文中的问题时，其性能会下降15%左右，相比于非 figural上下文。虽然模型如GPT-3和ChatGPT能够更好地处理figural语言，但我们发现可以通过自动将figural上下文简化成非 figural（literal）上下文来提高性能。我们发现最佳的模型是ChatGPT加chain-of-thought提示，可以生成非 figural上下文。我们的工作提供了构建更加Robust的问答模型的可能方向。

Multiple Relations Classification using Imbalanced Predictions Adaptation

paper_url: http://arxiv.org/abs/2309.13718
repo_url: https://github.com/sa5r/mrca
paper_authors: Sakher Khalil Alqaaidi, Elika Bozorgi, Krzysztof J. Kochut
for: 这个论文主要用于关系分类任务中处理多个关系的问题。
methods: 该模型使用自定义输出架构和采用额外输入特征来解决不均匀预测问题。
results: 对于一些常用的数据集，模型表现出了显著的改善，尤其是在处理不均匀预测的情况下。

Abstract
The relation classification task assigns the proper semantic relation to a pair of subject and object entities; the task plays a crucial role in various text mining applications, such as knowledge graph construction and entities interaction discovery in biomedical text. Current relation classification models employ additional procedures to identify multiple relations in a single sentence. Furthermore, they overlook the imbalanced predictions pattern. The pattern arises from the presence of a few valid relations that need positive labeling in a relatively large predefined relations set. We propose a multiple relations classification model that tackles these issues through a customized output architecture and by exploiting additional input features. Our findings suggest that handling the imbalanced predictions leads to significant improvements, even on a modest training design. The results demonstrate superiority performance on benchmark datasets commonly used in relation classification. To the best of our knowledge, this work is the first that recognizes the imbalanced predictions within the relation classification task.

摘要
“关系分类任务是将对象和主题实体对应的Semantic关系分类为正确的类别，这个任务在文本挖掘应用中扮演着关键角色，如知识图构建和生物医学文本中实体互动发现。现有关系分类模型采用多种方法来识别单句中的多个关系，但它们忽略了不均匀预测模式。这种模式来自于一些有效的关系，它们需要在大量预定的关系集中得到正面标注。我们提出了一种多关系分类模型，通过自定义输出架构和采用额外输入特征来解决这些问题。我们的发现表明，处理不均匀预测可以取得显著改善，即使在较小的训练设计下。结果表明我们的模型在常用的 benchmark 数据集上显示出了优秀的表现，并且根据我们所知，这是第一个认可关系分类任务中的不均匀预测问题的研究。”

paper_url: http://arxiv.org/abs/2309.13567
repo_url: https://github.com/stevekgyang/mentallama
paper_authors: Kailai Yang, Tianlin Zhang, Ziyan Kuang, Qianqian Xie, Sophia Ananiadou, Jimin Huang
for: 这个论文的目的是为了提供一个可解释的心理健康分析方法，使用大语言模型来提供详细的解释，并在社交媒体上进行自动心理健康分析。
methods: 这个论文使用了ChatGPT来生成可解释的回答，并使用专家写的少量提示来提高模型的性能。
results: 研究结果显示，MentalLLaMA可以与状态艺术方法匹配，并且生成高质量的解释。

Abstract
With the development of web technology, social media texts are becoming a rich source for automatic mental health analysis. As traditional discriminative methods bear the problem of low interpretability, the recent large language models have been explored for interpretable mental health analysis on social media, which aims to provide detailed explanations along with predictions. The results show that ChatGPT can generate approaching-human explanations for its correct classifications. However, LLMs still achieve unsatisfactory classification performance in a zero-shot/few-shot manner. Domain-specific finetuning is an effective solution, but faces 2 challenges: 1) lack of high-quality training data. 2) no open-source LLMs for interpretable mental health analysis were released to lower the finetuning cost. To alleviate these problems, we build the first multi-task and multi-source interpretable mental health instruction (IMHI) dataset on social media, with 105K data samples. The raw social media data are collected from 10 existing sources covering 8 mental health analysis tasks. We use expert-written few-shot prompts and collected labels to prompt ChatGPT and obtain explanations from its responses. To ensure the reliability of the explanations, we perform strict automatic and human evaluations on the correctness, consistency, and quality of generated data. Based on the IMHI dataset and LLaMA2 foundation models, we train MentalLLaMA, the first open-source LLM series for interpretable mental health analysis with instruction-following capability. We also evaluate the performance of MentalLLaMA on the IMHI evaluation benchmark with 10 test sets, where their correctness for making predictions and the quality of explanations are examined. The results show that MentalLLaMA approaches state-of-the-art discriminative methods in correctness and generates high-quality explanations.

摘要
随着网络技术的发展，社交媒体文本正在成为自动心理健康分析的丰富来源。传统的排除性方法具有低可解释性问题，而最近的大语言模型在社交媒体上进行可解释心理健康分析，旨在提供详细的解释以及预测。结果显示，ChatGPT可以生成接近人类解释的正确分类结果。然而，LLMs仍在零容量/几容量情况下实现不满足的分类性能。预处理特定领域的训练是有效的解决方案，但面临两个挑战：1）缺乏高质量训练数据。2）没有开源LLMs для可解释心理健康分析，以降低训练成本。为解决这些问题，我们建立了首个多任务多源可解释心理健康指令（IMHI）数据集，包括105W个数据样本。 raw社交媒体数据由10个现有源收集，覆盖8个心理健康分析任务。我们使用专家写的少量示例和收集的标签来提取ChatGPT的回答，并对其生成的数据进行严格的自动和人类评估，以确保数据的正确性、一致性和质量。基于IMHI数据集和LLaMA2基础模型，我们训练了心理LLaMA，首个开源LLM系列 для可解释心理健康分析，并实现了指令遵循能力。我们还对IMHI评估标准测试集进行了10个测试集的性能评估，其中正确性和生成的解释质量均进行了评估。结果显示，心理LLaMA与状态艺术方法相当，并生成高质量的解释。

Substituting Data Annotation with Balanced Updates and Collective Loss in Multi-label Text Classification

paper_url: http://arxiv.org/abs/2309.13543
repo_url: None
paper_authors: Muberra Ozmen, Joseph Cotnareanu, Mark Coates
for: 这篇论文的目的是解决多标签文本分类（MLTC）任务，并且在没有足够标签数据的情况下进行分类。
methods: 本篇论文使用了自然语言推理来将输入文本转换为初步的标签可能性分布，然后使用标签描述来计算一个标签依赖关系表，最后使用讯息传递法更新初步的标签可能性分布，使用一个集体损失函数来注入预期的标签频率和预期的多标签卡дина优化。
results: 实验结果显示，提案的框架在具有仅具有几个标签的低监控情况下可以获得有效的性能，并且相比使用预训语言模型时，提案的方法可以提高性能 BY 70%。

Abstract
Multi-label text classification (MLTC) is the task of assigning multiple labels to a given text, and has a wide range of application domains. Most existing approaches require an enormous amount of annotated data to learn a classifier and/or a set of well-defined constraints on the label space structure, such as hierarchical relations which may be complicated to provide as the number of labels increases. In this paper, we study the MLTC problem in annotation-free and scarce-annotation settings in which the magnitude of available supervision signals is linear to the number of labels. Our method follows three steps, (1) mapping input text into a set of preliminary label likelihoods by natural language inference using a pre-trained language model, (2) calculating a signed label dependency graph by label descriptions, and (3) updating the preliminary label likelihoods with message passing along the label dependency graph, driven with a collective loss function that injects the information of expected label frequency and average multi-label cardinality of predictions. The experiments show that the proposed framework achieves effective performance under low supervision settings with almost imperceptible computational and memory overheads added to the usage of pre-trained language model outperforming its initial performance by 70\% in terms of example-based F1 score.

摘要
多标签文本分类（MLTC）是将多个标签分配给一个文本的任务，具有广泛的应用领域。大多数现有方法需要巨大量的注释数据来学习一个分类器和/或一组定义的约束，例如层次关系，这些约束可能会变得复杂，特别是当标签数量增加时。在这篇论文中，我们研究了在无注释和缺乏注释的设置下进行MLTC问题。我们的方法包括以下三步：1. 将输入文本映射到一组初步的标签可能性，使用一个预训练的自然语言模型进行自然语言推理。2. 计算一个签名标签依赖图，使用标签描述来计算。3. 更新初步的标签可能性，使用消息传递算法在标签依赖图上进行更新，驱动一个集体损失函数，该函数注入预期的标签频率和预测多个标签 cardinality的信息。实验表明，我们的框架在低级注释设置下达到了有效性，并且增加了非常小的计算和存储开销，相对于使用预训练自然语言模型的使用，提高了70%的例子基于F1分数。

The Study of Perceptual Training of Chinese Mandarin Tones for Monolingual Speakers of English Using Adaptive Computer Based Training Software

paper_url: http://arxiv.org/abs/2309.13513
repo_url: None
paper_authors: Yuke Wang
for: 这个研究探讨了一种新的声调训练技术，可能对第二语言学习和声调训练产生积极影响。
methods: 该研究使用了一种新的声调训练技术，该技术基于语音识别和生成技术，可以帮助学生更好地学习和理解声调。
results: 研究发现，使用该新技术可以提高学生对声调的识别和生成能力，并且可以帮助学生更好地理解和使用声调。

Abstract
The study explored a new technique of phonetic tone training, which may have a positive impact on second language learning and tone training.

摘要
研究探讨了一种新的声音训练技巧，这种技巧可能对第二语言学习和声音训练产生积极影响。Here's a breakdown of the translation:研究 (study)探讨 (explored)一种 (a new)声音 (phonetic)训练 (training)技巧 (technique)可能 (may)对 (positive impact on)第二语言 (second language)学习 (learning)和 (and)声音 (tone)训练 (training)产生 (have a positive impact)

2023-09-25

NoLACE: Improving Low-Complexity Speech Codec Enhancement Through Adaptive Temporal Shaping

Noise-Robust DSP-Assisted Neural Pitch Estimation with Very Low Complexity

On the Impact of Quantization and Pruning of Self-Supervised Speech Models for Downstream Speech Recognition Tasks “In-the-Wild’’

An Investigation of Distribution Alignment in Multi-Genre Speaker Recognition

Multi-Domain Adaptation by Self-Supervised Learning for Speaker Verification

Speaker anonymization using neural audio codec language models

Haha-Pod: An Attempt for Laughter-based Non-Verbal Speaker Verification

VoiceLens: Controllable Speaker Generation and Editing with Flow

Unsupervised Accent Adaptation Through Masked Language Model Correction Of Discrete Self-Supervised Speech Units

Real-Time Emergency Vehicle Detection using Mel Spectrograms and Regular Expressions

Frame-wise streaming end-to-end speaker diarization with non-autoregressive self-attention-based attractors

HiGNN-TTS: Hierarchical Prosody Modeling with Graph Neural Networks for Expressive Long-form TTS

AutoPrep: An Automatic Preprocessing Framework for In-the-Wild Speech Data

A Two-Step Approach for Narrowband Source Localization in Reverberant Rooms

2023-09-25

MEMO: Dataset and Methods for Robust Multimodal Retinal Image Registration with Large or Small Vessel Density Differences

Dynamic Scene Graph Representation for Surgical Video

Pixel-Grounded Prototypical Part Networks

UniBEV: Multi-modal 3D Object Detection with Uniform BEV Encoders for Robustness against Missing Sensor Modalities

Accurate and Interactive Visual-Inertial Sensor Calibration with Next-Best-View and Next-Best-Trajectory Suggestion

Assessment of a new GeoAI foundation model for flood inundation mapping

Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator

AiAReSeg: Catheter Detection and Segmentation in Interventional Ultrasound using Transformers

Unsupervised 3D Perception with 2D Vision-Language Distillation for Autonomous Driving

Gastro-Intestinal Tract Segmentation Using an Explainable 3D Unet

FARSEC: A Reproducible Framework for Automatic Real-Time Vehicle Speed Estimation Using Traffic Cameras

Chop & Learn: Recognizing and Generating Object-State Compositions

3D Indoor Instance Segmentation in an Open-World

Noise-in, Bias-out: Balanced and Real-time MoCap Solving

DeepMesh: Mesh-based Cardiac Motion Tracking using Deep Learning

Regress Before Construct: Regress Autoencoder for Point Cloud Self-supervised Learning

Dataset Diffusion: Diffusion-based Synthetic Dataset Generation for Pixel-Level Semantic Segmentation

Tiled Multiplane Images for Practical 3D Photography

CLIP-DIY: CLIP Dense Inference Yields Open-Vocabulary Semantic Segmentation For-Free

Calibration-based Dual Prototypical Contrastive Learning Approach for Domain Generalization Semantic Segmentation

SINCERE: Supervised Information Noise-Contrastive Estimation REvisited

Identity-preserving Editing of Multiple Facial Attributes by Learning Global Edit Directions and Local Adjustments

Industrial Application of 6D Pose Estimation for Robotic Manipulation in Automotive Internal Logistics

Enhancing Healthcare with EOG: A Novel Approach to Sleep Stage Classification

Informative Data Mining for One-Shot Cross-Domain Semantic Segmentation

QuadricsNet: Learning Concise Representation for Geometric Primitives in Point Clouds

Automatic Animation of Hair Blowing in Still Portrait Photos

Detecting and Grounding Multi-Modal Media Manipulation and Beyond

(Predictable) Performance Bias in Unsupervised Anomaly Detection

LAPP: Layer Adaptive Progressive Pruning for Compressing CNNs from Scratch

IEBins: Iterative Elastic Bins for Monocular Depth Estimation

Masked Image Residual Learning for Scaling Deeper Vision Transformers

SurrogatePrompt: Bypassing the Safety Filter of Text-To-Image Models via Substitution

Convolutional autoencoder-based multimodal one-class classification

BoIR: Box-Supervised Instance Representation for Multi-Person Pose Estimation

Soft Mixture Denoising: Beyond the Expressive Bottleneck of Diffusion Models

AsymFormer: Asymmetrical Cross-Modal Representation Learning for Mobile Platform Real-Time RGB-D Semantic Segmentation

FeCAM: Exploiting the Heterogeneity of Class Distributions in Exemplar-Free Continual Learning

Weakly Supervised Semantic Segmentation by Knowledge Graph Inference

Single Image Test-Time Adaptation for Segmentation

Unveiling Fairness Biases in Deep Learning-Based Brain MRI Reconstruction

Hashing Neural Video Decomposition with Multiplicative Residuals in Space-Time

Variational Inference for Scalable 3D Object-centric Learning

Better Generalization of White Matter Tract Segmentation to Arbitrary Datasets with Scaled Residual Bootstrap

Diverse Semantic Image Editing with Style Codes

Egocentric RGB+Depth Action Recognition in Industry-Like Settings

In-Domain GAN Inversion for Faithful Reconstruction and Editability

Speed Co-Augmentation for Unsupervised Audio-Visual Pre-training

A Lightweight Recurrent Grouping Attention Network for Video Super-Resolution

Recursive Counterfactual Deconfounding for Object Recognition

Subspace-Aware Feature Reconstruction for Unsupervised Anomaly Localization

Bitstream-Corrupted Video Recovery: A Novel Benchmark Dataset and Method

Skip-Connected Neural Networks with Layout Graphs for Floor Plan Auto-Generation

Attention and Pooling based Sigmoid Colon Segmentation in 3D CT images

On Calibration of Modern Quantized Efficient Neural Networks

SuPerPM: A Large Deformation-Robust Surgical Perception Framework Based on Deep Point Matching Learned from Physical Constrained Simulation Data

Adversarial Attacks on Video Object Segmentation with Hard Region Discovery

DISeR: Designing Imaging Systems with Reinforcement Learning

Tuning Multi-mode Token-level Prompt Alignment across Modalities

Traj-LO: In Defense of LiDAR-Only Odometry Using an Effective Continuous-Time Trajectory

Fill the K-Space and Refine the Image: Prompting for Dynamic and Multi-Contrast MRI Reconstruction

IBVC: Interpolation-driven B-frame Video Compression

PARTICLE: Part Discovery and Contrastive Learning for Fine-grained Recognition

MMA-Net: Multiple Morphology-Aware Network for Automated Cobb Angle Measurement