eess.AS - 2023-08-22

Furnishing Sound Event Detection with Language Model Abilities

Abstract:
Recently, the ability of language models (LMs) has attracted increasing attention in visual cross-modality. In this paper, we further explore the generation capacity of LMs for sound event detection (SED), beyond the visual domain. Specifically, we propose an elegant method that aligns audio features and text features to accomplish sound event classification and temporal location. The framework consists of an acoustic encoder, a contrastive module that align the corresponding representations of the text and audio, and a decoupled language decoder that generates temporal and event sequences from the audio characteristic. Compared with conventional works that require complicated processing and barely utilize limited audio features, our model is more concise and comprehensive since language model directly leverage its semantic capabilities to generate the sequences. We investigate different decoupling modules to demonstrate the effectiveness for timestamps capture and event classification. Evaluation results show that the proposed method achieves accurate sequences of sound event detection.


Deep learning-based denoising streamed from mobile phones improves speech-in-noise understanding for hearing aid users

  • paper_url: http://arxiv.org/abs/2308.11456
  • repo_url: None
  • paper_authors: Peter Udo Diehl, Hannes Zilly, Felix Sattler, Yosef Singer, Kevin Kepp, Mark Berry, Henning Hasemann, Marlene Zippel, Müge Kaya, Paul Meyer-Rachner, Annett Pudszuhn, Veit M. Hofmann, Matthias Vormann, Elias Sprengel

Abstract:
The hearing loss of almost half a billion people is commonly treated with hearing aids. However, current hearing aids often do not work well in real-world noisy environments. We present a deep learning based denoising system that runs in real time on iPhone 7 and Samsung Galaxy S10 (25ms algorithmic latency). The denoised audio is streamed to the hearing aid, resulting in a total delay of around 75ms. In tests with hearing aid users having moderate to severe hearing loss, our denoising system improves audio across three tests: 1) listening for subjective audio ratings, 2) listening for objective speech intelligibility, and 3) live conversations in a noisy environment for subjective ratings. Subjective ratings increase by more than 40%, for both the listening test and the live conversation compared to a fitted hearing aid as a baseline. Speech reception thresholds, measuring speech understanding in noise, improve by 1.6 dB SRT. Ours is the first denoising system that is implemented on a mobile device, streamed directly to users’ hearing aids using only a single channel as audio input while improving user satisfaction on all tested aspects, including speech intelligibility. This includes overall preference of the denoised and streamed signal over the hearing aid, thereby accepting the higher latency for the significant improvement in speech understanding.


Convoifilter: A case study of doing cocktail party speech recognition

Abstract:
This paper presents an end-to-end model designed to improve automatic speech recognition (ASR) for a particular speaker in a crowded, noisy environment. The model utilizes a single-channel speech enhancement module that isolates the speaker’s voice from background noise, along with an ASR module. Through this approach, the model is able to decrease the word error rate (WER) of ASR from 80% to 26.4%. Typically, these two components are adjusted independently due to variations in data requirements. However, speech enhancement can create anomalies that decrease ASR efficiency. By implementing a joint fine-tuning strategy, the model can reduce the WER from 26.4% in separate tuning to 14.5% in joint tuning.


Evaluation of the Speech Resynthesis Capabilities of the VoicePrivacy Challenge Baseline B1

Abstract:
Speaker anonymization systems continue to improve their ability to obfuscate the original speaker characteristics in a speech signal, but often create processing artifacts and unnatural sounding voices as a tradeoff. Many of those systems stem from the VoicePrivacy Challenge (VPC) Baseline B1, using a neural vocoder to synthesize speech from an F0, x-vectors and bottleneck features-based speech representation. Inspired by this, we investigate the reproduction capabilities of the aforementioned baseline, to assess how successful the shared methodology is in synthesizing human-like speech. We use four objective metrics to measure speech quality, waveform similarity, and F0 similarity. Our findings indicate that both the speech representation and the vocoder introduces artifacts, causing an unnatural perception. A MUSHRA-like listening test on 18 subjects corroborate our findings, motivating further research on the analysis and synthesis components of the VPC Baseline B1.


Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning

Abstract:
Text-to-music generation (T2M-Gen) faces a major obstacle due to the scarcity of large-scale publicly available music datasets with natural language captions. To address this, we propose the Music Understanding LLaMA (MU-LLaMA), capable of answering music-related questions and generating captions for music files. Our model utilizes audio representations from a pretrained MERT model to extract music features. However, obtaining a suitable dataset for training the MU-LLaMA model remains challenging, as existing publicly accessible audio question answering datasets lack the necessary depth for open-ended music question answering. To fill this gap, we present a methodology for generating question-answer pairs from existing audio captioning datasets and introduce the MusicQA Dataset designed for answering open-ended music-related questions. The experiments demonstrate that the proposed MU-LLaMA model, trained on our designed MusicQA dataset, achieves outstanding performance in both music question answering and music caption generation across various metrics, outperforming current state-of-the-art (SOTA) models in both fields and offering a promising advancement in the T2M-Gen research field.


Abstract:
Tablature notation is widely used in popular music to transcribe and share guitar musical content. As a complement to standard score notation, tablatures transcribe performance gesture information including finger positions and a variety of guitar-specific playing techniques such as slides, hammer-on/pull-off or bends.This paper focuses on bends, which enable to progressively shift the pitch of a note, therefore circumventing physical limitations of the discrete fretted fingerboard. In this paper, we propose a set of 25 high-level features, computed for each note of the tablature, to study how bend occurrences can be predicted from their past and future short-term context. Experiments are performed on a corpus of 932 lead guitar tablatures of popular music and show that a decision tree successfully predicts bend occurrences with an F1 score of 0.71 anda limited amount of false positive predictions, demonstrating promising applications to assist the arrangement of non-guitar music into guitar tablatures.


An Effective Transformer-based Contextual Model and Temporal Gate Pooling for Speaker Identification

Abstract:
Wav2vec2 has achieved success in applying Transformer architecture and self-supervised learning to speech recognition. Recently, these have come to be used not only for speech recognition but also for the entire speech processing. This paper introduces an effective end-to-end speaker identification model applied Transformer-based contextual model. We explored the relationship between the parameters and the performance in order to discern the structure of an effective model. Furthermore, we propose a pooling method, Temporal Gate Pooling, with powerful learning ability for speaker identification. We applied Conformer as encoder and BEST-RQ for pre-training and conducted an evaluation utilizing the speaker identification of VoxCeleb1. The proposed method has achieved an accuracy of 85.9% with 28.5M parameters, demonstrating comparable precision to wav2vec2 with 317.7M parameters. Code is available at https://github.com/HarunoriKawano/speaker-identification-with-tgp.


PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion

Abstract:
Voice conversion as the style transfer task applied to speech, refers to converting one person’s speech into a new speech that sounds like another person’s. Up to now, there has been a lot of research devoted to better implementation of VC tasks. However, a good voice conversion model should not only match the timbre information of the target speaker, but also expressive information such as prosody, pace, pause, etc. In this context, prosody modeling is crucial for achieving expressive voice conversion that sounds natural and convincing. Unfortunately, prosody modeling is important but challenging, especially without text transcriptions. In this paper, we firstly propose a novel voice conversion framework named ‘PMVC’, which effectively separates and models the content, timbre, and prosodic information from the speech without text transcriptions. Specially, we introduce a new speech augmentation algorithm for robust prosody extraction. And building upon this, mask and predict mechanism is applied in the disentanglement of prosody and content information. The experimental results on the AIShell-3 corpus supports our improvement of naturalness and similarity of converted speech.


Ultra Dual-Path Compression For Joint Echo Cancellation And Noise Suppression

  • paper_url: http://arxiv.org/abs/2308.11053
  • repo_url: None
  • paper_authors: Hangting Chen, Jianwei Yu, Yi Luo, Rongzhi Gu, Weihua Li, Zhuocheng Lu, Chao Weng

Abstract:
Echo cancellation and noise reduction are essential for full-duplex communication, yet most existing neural networks have high computational costs and are inflexible in tuning model complexity. In this paper, we introduce time-frequency dual-path compression to achieve a wide range of compression ratios on computational cost. Specifically, for frequency compression, trainable filters are used to replace manually designed filters for dimension reduction. For time compression, only using frame skipped prediction causes large performance degradation, which can be alleviated by a post-processing network with full sequence modeling. We have found that under fixed compression ratios, dual-path compression combining both the time and frequency methods will give further performance improvement, covering compression ratios from 4x to 32x with little model size change. Moreover, the proposed models show competitive performance compared with fast FullSubNet and DeepFilterNet. A demo page can be found at hangtingchen.github.io/ultra_dual_path_compression.github.io/.

cs.SD - 2023-08-21

LibriWASN: A Data Set for Meeting Separation, Diarization, and Recognition with Asynchronous Recording Devices

  • paper_url: http://arxiv.org/abs/2308.10682
  • repo_url: None
  • paper_authors: Joerg Schmalenstroeer, Tobias Gburrek, Reinhold Haeb-Umbach
  • For: 测试闹钟同步算法、会议分离、笔记录系统和讲解系统在无线听写感知网络上的性能。* Methods: 使用了9种设备(5款智能手机和4个麦克风阵列)记录29个通道的数据,其中每个设备采样钟不同步。* Results: 提供了一个名为LibriWASN的数据集,可以用于测试闹钟同步算法、会议分离、笔记录系统和讲解系统在无线听写感知网络上的性能,并且提供了两个不同房间的数据记录和真实的会议分离信息。
    Abstract We present LibriWASN, a data set whose design follows closely the LibriCSS meeting recognition data set, with the marked difference that the data is recorded with devices that are randomly positioned on a meeting table and whose sampling clocks are not synchronized. Nine different devices, five smartphones with a single recording channel and four microphone arrays, are used to record a total of 29 channels. Other than that, the data set follows closely the LibriCSS design: the same LibriSpeech sentences are played back from eight loudspeakers arranged around a meeting table and the data is organized in subsets with different percentages of speech overlap. LibriWASN is meant as a test set for clock synchronization algorithms, meeting separation, diarization and transcription systems on ad-hoc wireless acoustic sensor networks. Due to its similarity to LibriCSS, meeting transcription systems developed for the former can readily be tested on LibriWASN. The data set is recorded in two different rooms and is complemented with ground-truth diarization information of who speaks when.
    摘要 我们现在介绍LibriWASN数据集,它的设计与LibriCSS会议认知数据集相似,但是它的数据记录设备随机分布在会议表前,并且采样时钟不协调。数据集使用了九个设备,包括五个智能手机和四个麦克风数组,共记录29个通道。除了这些,数据集的设计与LibriCSS几乎相同:同样的LibriSpeech句子在会议表周围的八个扬声器上播放,数据分成不同的 overlap percentage 的subset。LibriWASN是为测试 wireless acoustic sensor networks 上的时钟同步算法、会议分离、笔记录系统和 trascription 系统而设计的。由于它与LibriCSS的相似性,可以将LibriCSS 上开发的会议转录系统 directly 应用于LibriWASN。数据集在两个不同的房间中录制,并且有详细的ground truth diarization信息,表示每个人在哪时说话。

An Anchor-Point Based Image-Model for Room Impulse Response Simulation with Directional Source Radiation and Sensor Directivity Patterns

  • paper_url: http://arxiv.org/abs/2308.10543
  • repo_url: None
  • paper_authors: Chao Pan, Lei Zhang, Yilong Lu, Jilu Jin, Lin Qiu, Jingdong Chen, Jacob Benesty
  • for: 这篇论文的目的是扩展图像模型方法,以便在不同应用中使用。
  • methods: 这篇论文使用了图像模型方法,并在其基础上开发了一种 anchor-point-image-model(APIM)方法,以便模拟冲击响应。APIM方法考虑了源辐射和探测器方向性特征,并通过引入anchor点来确定虚拟源的方向。
  • results: 研究人员通过开发了一种算法来生成室内冲击响应,并考虑了方向函数、时间延迟和计算复杂性。这种模型和算法可以在各种音频问题中模拟室内响应,提高和评估处理算法。
    Abstract The image model method has been widely used to simulate room impulse responses and the endeavor to adapt this method to different applications has also piqued great interest over the last few decades. This paper attempts to extend the image model method and develops an anchor-point-image-model (APIM) approach as a solution for simulating impulse responses by including both the source radiation and sensor directivity patterns. To determine the orientations of all the virtual sources, anchor points are introduced to real sources, which subsequently lead to the determination of the orientations of the virtual sources. An algorithm is developed to generate room impulse responses with APIM by taking into account the directional pattern functions, factional time delays, as well as the computational complexity. The developed model and algorithms can be used in various acoustic problems to simulate room acoustics and improve and evaluate processing algorithms.
    摘要 Image模型方法已广泛应用于模拟室内冲激响应,而在过去几十年中,它的应用也引发了广泛的兴趣。这篇论文尝试扩展图像模型方法,并开发了一种基于图像模型的笔点图像模型(APIM)方法,以便通过包括源辐射和感知器直径模式来模拟冲激响应。为确定虚拟源的方向,引入了实际源的笔点,从而确定虚拟源的方向。为生成室内冲激响应使用APIM方法,需要考虑方向函数、分解时间延迟以及计算复杂性。这种开发的模型和算法可以应用于各种音频问题,以模拟室内音频和改进和评估处理算法。

Implicit Self-supervised Language Representation for Spoken Language Diarization

  • paper_url: http://arxiv.org/abs/2308.10470
  • repo_url: None
  • paper_authors: Jagabandhu Mishra, S. R. Mahadeva Prasanna
  • for: 这个研究是为了开发一种基于语音语义分割的语言分类系统,以提高语音识别和语言转换的性能。
  • methods: 这个研究使用了三种不同的框架,分别是固定分 segmentation、change point-based segmentation 和 E2E 框架,以实现语音语义分割。而且,这些框架都使用了隐式框架,以便更好地适应低资源语言。
  • results: 研究发现,使用 x-vector 作为隐式语言表示的方法可以达到与显式语言分类系统相同的性能。而使用 E2E 框架的最佳隐式语言分类性能为 6.38%,但是在使用实际的微软CS(MSCS)数据集时,性能下降到 60.4%,这主要是因为 MSCS 数据集中次语言的分布不同于 TTSF-LD 数据集。此外,以避免分段膨润,使用小值 N 是有助于提高语言分类性能的。
    Abstract In a code-switched (CS) scenario, the use of spoken language diarization (LD) as a pre-possessing system is essential. Further, the use of implicit frameworks is preferable over the explicit framework, as it can be easily adapted to deal with low/zero resource languages. Inspired by speaker diarization (SD) literature, three frameworks based on (1) fixed segmentation, (2) change point-based segmentation and (3) E2E are proposed to perform LD. The initial exploration with synthetic TTSF-LD dataset shows, using x-vector as implicit language representation with appropriate analysis window length ($N$) can able to achieve at per performance with explicit LD. The best implicit LD performance of $6.38$ in terms of Jaccard error rate (JER) is achieved by using the E2E framework. However, considering the E2E framework the performance of implicit LD degrades to $60.4$ while using with practical Microsoft CS (MSCS) dataset. The difference in performance is mostly due to the distributional difference between the monolingual segment duration of secondary language in the MSCS and TTSF-LD datasets. Moreover, to avoid segment smoothing, the smaller duration of the monolingual segment suggests the use of a small value of $N$. At the same time with small $N$, the x-vector representation is unable to capture the required language discrimination due to the acoustic similarity, as the same speaker is speaking both languages. Therefore, to resolve the issue a self-supervised implicit language representation is proposed in this study. In comparison with the x-vector representation, the proposed representation provides a relative improvement of $63.9\%$ and achieved a JER of $21.8$ using the E2E framework.
    摘要 在code-switched(CS)enario中,使用口语语音分类(LD)作为前置处理系统是必要的。进一步地,使用隐式框架更加有利,因为它可以轻松适应低/零资源语言。以speaker diarization(SD)文献为 inspiration,本研究提出了基于(1)固定分 segmentation、(2)变点分 segmentation和(3)E2E的三种框架来实现LD。初步探索使用synthetic TTSF-LD数据集表明,使用x-vector作为隐式语言表示,并采用合适的分析窗口长度(N)可以与显式LD准确性相当。最佳隐式LD性能为6.38%,基于E2E框架。然而,在使用实际的Microsoft CS(MSCS)数据集时,隐式LD性能下降到60.4%,主要是因为MSCS数据集中次语言分段时间的分布与TTSF-LD数据集不同。此外,为了避免分段滑动,小于Monolingual segment duration的值更适合。同时,使用小N值,x-vector表示不能捕捉所需的语言划分,因为同一个speaker同时说两种语言时,它们的音频相似度很高。因此,本研究提出了一种自助学习隐式语言表示,与x-vector表示相比,它提供了63.9%的相对改进,并在E2E框架下实现JER值为21.8%。

Multi-GradSpeech: Towards Diffusion-based Multi-Speaker Text-to-speech Using Consistent Diffusion Models

  • paper_url: http://arxiv.org/abs/2308.10428
  • repo_url: None
  • paper_authors: Heyang Xue, Shuai Guo, Pengcheng Zhu, Mengxiao Bi
  • for: 多个说话人 Text-to-Speech (TTS) 应用场景中的改进
  • methods: 使用 Consistent Diffusion Model (CDM) 作为生成模型,在训练过程中保证 CDM 的一致性,以解决 diffusion 模型在训练和抽象阶段的抽象问题
  • results: 对比 Grad-TTS 和 fine-tuning 方法,提高不同说话人参与的多speaker TTS性能,甚至超越 fine-tuning 方法 Audio samples 可以在 https://welkinyang.github.io/multi-gradspeech/ 上获取
    Abstract Recent advancements in diffusion-based acoustic models have revolutionized data-sufficient single-speaker Text-to-Speech (TTS) approaches, with Grad-TTS being a prime example. However, diffusion models suffer from drift in training and sampling distributions due to imperfect score-matching. The sampling drift problem leads to these approaches struggling in multi-speaker scenarios in practice. In this paper, we present Multi-GradSpeech, a multi-speaker diffusion-based acoustic models which introduces the Consistent Diffusion Model (CDM) as a generative modeling approach. We enforce the consistency property of CDM during the training process to alleviate the sampling drift problem in the inference stage, resulting in significant improvements in multi-speaker TTS performance. Our experimental results corroborate that our proposed approach can improve the performance of different speakers involved in multi-speaker TTS compared to Grad-TTS, even outperforming the fine-tuning approach. Audio samples are available at https://welkinyang.github.io/multi-gradspeech/
    摘要 近期 diffusion-based 音频模型在数据充足单个说话人 Text-to-Speech (TTS) 方法中得到了革命性的进步,Grad-TTS 是其中的一个典型例子。然而, diffusion 模型在训练和采样分布中会受到偏移问题的影响,这会导致这些方法在多个说话人场景中实际应用中困难。在这篇论文中,我们提出了 Multi-GradSpeech,一种多个说话人 diffusion-based 音频模型,其中引入了一种生成模型approach,即 Consistent Diffusion Model (CDM)。我们在训练过程中对 CDM 进行了一种强制保持一致性的方法,以解决训练和采样偏移问题,从而在多个说话人 TTS 性能中做出了显著改进。我们的实验结果表明,我们的提议方法可以在不同的说话人参与 TTS 中提高性能,甚至超过了精度调整方法。音频示例可以在 中找到。

Neural Architectures Learning Fourier Transforms, Signal Processing and Much More….

  • paper_url: http://arxiv.org/abs/2308.10388
  • repo_url: None
  • paper_authors: Prateek Verma
    for:The paper explores the use of neural architectures for learning kernels in Fourier Transform, specifically for audio signal processing applications.methods:The paper proposes using a neural architecture to learn sinusoidal kernels and discovers various signal-processing properties such as windowing functions, onset detectors, high pass filters, low pass filters, modulations, etc. The neural architecture has a comb filter-like structure on top of the learned kernels.results:The paper shows that the proposed method can learn kernels that are adapted to the problem at hand and discovers various signal-processing properties. The learned kernels can be used for tasks such as audio signal processing, and the method has the potential to be used for other signal processing applications as well.
    Abstract This report will explore and answer fundamental questions about taking Fourier Transforms and tying it with recent advances in AI and neural architecture. One interpretation of the Fourier Transform is decomposing a signal into its constituent components by projecting them onto complex exponentials. Variants exist, such as discrete cosine transform that does not operate on the complex domain and projects an input signal to only cosine functions oscillating at different frequencies. However, this is a fundamental limitation, and it needs to be more suboptimal. The first one is that all kernels are sinusoidal: What if we could have some kernels adapted or learned according to the problem? What if we can use neural architectures for this? We show how one can learn these kernels from scratch for audio signal processing applications. We find that the neural architecture not only learns sinusoidal kernel shapes but discovers all kinds of incredible signal-processing properties. E.g., windowing functions, onset detectors, high pass filters, low pass filters, modulations, etc. Further, upon analysis of the filters, we find that the neural architecture has a comb filter-like structure on top of the learned kernels. Comb filters that allow harmonic frequencies to pass through are one of the core building blocks/types of filters similar to high-pass, low-pass, and band-pass filters of various traditional signal processing algorithms. Further, we can also use the convolution operation with a signal to be learned from scratch, and we will explore papers in the literature that uses this with that robust Transformer architectures. Further, we would also explore making the learned kernel's content adaptive, i.e., learning different kernels for different inputs.
    摘要 This report will explore and answer fundamental questions about taking Fourier Transforms and tying it with recent advances in AI and neural architecture. One interpretation of the Fourier Transform is decomposing a signal into its constituent components by projecting them onto complex exponentials. Variants exist, such as the discrete cosine transform, which does not operate on the complex domain and projects an input signal to only cosine functions oscillating at different frequencies. However, this is a fundamental limitation, and it needs to be more suboptimal. The first one is that all kernels are sinusoidal: What if we could have some kernels adapted or learned according to the problem? What if we can use neural architectures for this? We show how one can learn these kernels from scratch for audio signal processing applications. We find that the neural architecture not only learns sinusoidal kernel shapes but discovers all kinds of incredible signal-processing properties. E.g., windowing functions, onset detectors, high pass filters, low pass filters, modulations, etc. Further, upon analysis of the filters, we find that the neural architecture has a comb filter-like structure on top of the learned kernels. Comb filters that allow harmonic frequencies to pass through are one of the core building blocks/types of filters similar to high-pass, low-pass, and band-pass filters of various traditional signal processing algorithms. Further, we can also use the convolution operation with a signal to be learned from scratch, and we will explore papers in the literature that uses this with that robust Transformer architectures. Further, we would also explore making the learned kernel's content adaptive, i.e., learning different kernels for different inputs.This report will explore and answer fundamental questions about taking Fourier Transforms and tying it with recent advances in AI and neural architecture. One interpretation of the Fourier Transform is decomposing a signal into its constituent components by projecting them onto complex exponentials. Variants exist, such as the discrete cosine transform, which does not operate on the complex domain and projects an input signal to only cosine functions oscillating at different frequencies. However, this is a fundamental limitation, and it needs to be more suboptimal. The first one is that all kernels are sinusoidal: What if we could have some kernels adapted or learned according to the problem? What if we can use neural architectures for this? We show how one can learn these kernels from scratch for audio signal processing applications. We find that the neural architecture not only learns sinusoidal kernel shapes but discovers all kinds of incredible signal-processing properties. E.g., windowing functions, onset detectors, high pass filters, low pass filters, modulations, etc. Further, upon analysis of the filters, we find that the neural architecture has a comb filter-like structure on top of the learned kernels. Comb filters that allow harmonic frequencies to pass through are one of the core building blocks/types of filters similar to high-pass, low-pass, and band-pass filters of various traditional signal processing algorithms. Further, we can also use the convolution operation with a signal to be learned from scratch, and we will explore papers in the literature that uses this with that robust Transformer architectures. Further, we would also explore making the learned kernel's content adaptive, i.e., learning different kernels for different inputs.This report will explore and answer fundamental questions about taking Fourier Transforms and tying it with recent advances in AI and neural architecture. One interpretation of the Fourier Transform is decomposing a signal into its constituent components by projecting them onto complex exponentials. Variants exist, such as the discrete cosine transform, which does not operate on the complex domain and projects an input signal to only cosine functions oscillating at different frequencies. However, this is a fundamental limitation, and it needs to be more suboptimal. The first one is that all kernels are sinusoidal: What if we could have some kernels adapted or learned according to the problem? What if we can use neural architectures for this? We show how one can learn these kernels from scratch for audio signal processing applications. We find that the neural architecture not only learns sinusoidal kernel shapes but discovers all kinds of incredible signal-processing properties. E.g., windowing functions, onset detectors, high pass filters, low pass filters, modulations, etc. Further, upon analysis of the filters, we find that the neural architecture has a comb filter-like structure on top of the learned kernels. Comb filters that allow harmonic frequencies to pass through are one of the core building blocks/types of filters similar to high-pass, low-pass, and band-pass filters of various traditional signal processing algorithms. Further, we can also use the convolution operation with a signal to be learned from scratch, and we will explore papers in the literature that uses this with that robust Transformer architectures. Further, we would also explore making the learned kernel's content adaptive, i.e., learning different kernels for different inputs.This report will explore and answer fundamental questions about taking Fourier Transforms and tying it with recent advances in AI and neural architecture. One interpretation of the Fourier Transform is decomposing a signal into its constituent components by projecting them onto complex exponentials. Variants exist, such as the discrete cosine transform, which does not operate on the complex domain and projects an input signal to only cosine functions oscillating at different frequencies. However, this is a fundamental limitation, and it needs to be more suboptimal. The first one is that all kernels are sinusoidal: What if we could have some kernels adapted or learned according to the problem? What if we can use neural architectures for this? We show how one can learn these kernels from scratch for audio signal processing applications. We find that the neural architecture not only learns sinusoidal kernel shapes but discovers all kinds of incredible signal-processing properties. E.g., windowing functions, onset detectors, high pass filters, low pass filters, modulations, etc. Further, upon analysis of the filters, we find that the neural architecture has a comb filter-like structure on top of the learned kernels. Comb filters that allow harmonic frequencies to pass through are one of the core building blocks/types of filters similar to high-pass, low-pass, and band-pass filters of various traditional signal processing algorithms. Further, we can also use the convolution operation with a signal to be learned from scratch, and we will explore papers in the literature that uses this with that robust Transformer architectures. Further, we would also explore making the learned kernel's content adaptive, i.e., learning different kernels for different inputs.This report will explore and answer fundamental questions about taking Fourier Transforms and tying it with recent advances in AI and neural architecture. One interpretation of the Fourier Transform is decomposing a signal into its constituent components by projecting them onto complex exponentials. Variants exist, such as the discrete cosine transform, which does not operate on the complex domain and projects an input signal to only cosine functions oscillating at different frequencies. However, this is a fundamental limitation, and it needs to be more suboptimal. The first one is that all kernels are sinusoidal: What if we could have some kernels adapted or learned according to the problem? What if we can use neural architectures for this? We show how one can learn these kernels from scratch for audio signal processing applications. We find that the neural architecture not only learns sinusoidal kernel shapes but discovers all kinds of incredible signal-processing properties. E.g., windowing functions, onset detectors, high pass filters, low pass filters, modulations, etc. Further, upon analysis of the filters, we find that the neural architecture has a comb filter-like structure on top of the learned kernels. Comb filters that allow harmonic frequencies to pass through are one of the core building blocks/types of filters similar to high-pass, low-pass, and band-pass filters of various traditional signal processing algorithms. Further, we can also use the convolution operation with a signal to be learned from scratch, and we will explore papers in the literature that uses this with that robust Transformer architectures. Further, we would also explore making the learned kernel's content adaptive, i.e., learning different kernels for different inputs.This report will explore and answer fundamental questions about taking Fourier Transforms and tying it with recent advances in AI and neural architecture. One interpretation of the Fourier Transform is decomposing a signal into its constituent components by projecting them onto complex exponentials. Variants exist, such as the discrete cosine transform, which does not operate on the complex domain and projects an input signal to only cosine functions oscillating at different frequencies. However, this is a fundamental limitation, and it needs to be more suboptimal. The first one is that all kernels are sinusoidal: What if we could have some kernels adapted or learned according to the problem? What if we can use neural architectures for this? We show how one can learn these kernels from scratch for audio signal processing applications. We find that the neural architecture not only learns sinusoidal kernel shapes but discovers all kinds of incredible signal-processing properties. E.g., windowing functions, onset detectors, high pass filters, low pass filters, modulations, etc. Further, upon analysis of the filters, we find that the neural architecture has a comb filter-like structure on top of the learned kernels. Comb filters that allow harmonic frequencies to pass through are one of the core building blocks/types of filters similar to high-pass, low-pass, and band-pass filters of various traditional signal processing algorithms. Further, we can also use the convolution operation with a signal to be learned from scratch, and we will explore papers in the literature that uses this with that robust Transformer architectures. Further, we would also explore making the learned kernel's content adaptive, i.e., learning different kernels for different inputs.This report will explore and answer fundamental questions about taking Fourier Transforms and tying it with recent advances in AI and neural architecture. One interpretation of the Fourier Transform is decomposing a signal into its constituent components by projecting them onto complex exponentials. Variants exist, such as the discrete cosine transform, which does not operate on the complex domain and projects an input signal to only cosine functions oscillating at different frequencies. However, this is a fundamental limitation, and it needs to be more suboptimal. The first one is that all kernels are sinusoidal: What if we could have some kernels adapted or learned according to the problem? What if we can use neural architectures for this? We show how one can learn these kernels from scratch for audio signal processing applications. We find that the neural architecture not only learns sinusoidal kernel shapes but discovers all kinds of incredible signal-processing properties. E.g., windowing functions, onset detectors, high pass filters, low pass filters, modulations, etc. Further, upon analysis of the filters, we find that the neural architecture has a comb filter-like structure on top of the learned kernels. Comb filters that allow harmonic frequencies to pass through are one of the core building blocks/types of filters similar to high-pass, low-pass, and band-pass filters of various traditional signal processing algorithms. Further, we can also use the convolution operation with a signal to be learned from scratch, and we will explore papers in the literature that uses this with that robust Transformer architectures. Further, we would also explore making the learned kernel's content adaptive, i.e., learning different kernels for different inputs.This report will explore and answer fundamental questions about taking Fourier Transforms and tying it with recent advances in AI and neural architecture. One interpretation of the Fourier Transform is decomposing a signal into its constituent components by projecting them onto complex exponentials. Variants exist, such as the discrete cosine transform, which does not operate on the complex domain and projects an input signal to only cosine functions oscillating at different frequencies. However, this is a fundamental limitation, and it needs to be more suboptimal. The first one is that all kernels are sinusoidal: What if we could have some kernels adapted or learned according to the problem? What if we can use neural architectures for this? We show how one can learn these kernels from scratch for audio signal processing applications. We find that the neural architecture not only learns sinusoidal kernel shapes but discovers all kinds of incredible signal-processing properties. E.g., windowing functions, onset detectors, high pass filters, low pass filters, modulations, etc. Further, upon analysis of the filters, we find that the neural architecture has a comb filter-like structure on top of the learned kernels. Comb filters that allow harmonic frequencies to pass through are one of the core building blocks/types of filters similar to high-pass, low-pass, and band-pass filters of various traditional signal processing algorithms. Further, we can also use the convolution operation with a signal to be learned from scratch, and we will explore papers in the literature that uses this with that robust Transformer architectures. Further, we would also explore making the learned kernel's content adaptive, i.e., learning different kernels for different inputs.This report will explore and answer fundamental questions about taking Fourier Transforms and tying it with recent advances in AI and neural architecture. One interpretation of the Fourier Transform is decomposing a signal into its constituent components by projecting them onto complex exponentials. Variants exist, such as the discrete cosine transform, which does not operate on the complex domain and projects an input signal to only cosine functions oscillating at different frequencies. However, this is a fundamental limitation, and it needs to be more suboptimal. The first one is that all kernels are sinusoidal: What if we could have some kernels adapted or learned according to the problem? What if we can use neural architectures for this? We show how one can learn these kernels from scratch for audio signal processing applications. We find that the neural architecture not only learns sinusoidal kernel shapes but discovers all kinds of incredible signal-processing properties. E.g., windowing functions, onset detectors, high pass filters, low pass filters, modulations, etc. Further, upon analysis of the filters, we find that the neural architecture has a comb filter-like structure on top of the learned kernels. Comb filters that allow harmonic frequencies to pass through are one of the core building blocks/types of filters similar to high-pass, low-pass, and band-pass filters of various traditional signal processing algorithms. Further, we can also use the convolution operation with a signal to be learned from scratch, and we will explore papers in the literature that uses this with that robust Transformer architectures. Further, we would also explore making the learned kernel's content adaptive, i.e., learning different kernels for different inputs.This report will explore and answer fundamental questions about taking Fourier Transforms and tying it with recent advances in AI and neural architecture. One interpretation of

Local Periodicity-Based Beat Tracking for Expressive Classical Piano Music

  • paper_url: http://arxiv.org/abs/2308.10355
  • repo_url: https://github.com/sunnycyc/plpdp4beat
  • paper_authors: Ching-Yu Chiu, Meinard Müller, Matthew E. P. Davies, Alvin Wen-Yu Su, Yi-Hsuan Yang
  • For: The paper is written for the purpose of improving beat tracking in Western classical piano music, specifically addressing the limitations of existing post-processing trackers (PPTs) in handling local tempo changes.* Methods: The paper proposes a new local periodicity-based PPT called predominant local pulse-based dynamic programming (PLPDP) tracking, which incorporates a method called “predominant local pulses” (PLP) in combination with a dynamic programming (DP) component to jointly consider locally detected periodicity and beat activation strength.* Results: Compared to existing PPTs, PLPDP particularly enhances the recall values at the cost of a lower precision, resulting in an overall improvement of F1-score for beat tracking in two large datasets of Western classical piano music (ASAP and Maz-5). Specifically, PLPDP improved the F1-score from 0.473 to 0.493 in ASAP and from 0.595 to 0.838 in Maz-5.
    Abstract To model the periodicity of beats, state-of-the-art beat tracking systems use "post-processing trackers" (PPTs) that rely on several empirically determined global assumptions for tempo transition, which work well for music with a steady tempo. For expressive classical music, however, these assumptions can be too rigid. With two large datasets of Western classical piano music, namely the Aligned Scores and Performances (ASAP) dataset and a dataset of Chopin's Mazurkas (Maz-5), we report on experiments showing the failure of existing PPTs to cope with local tempo changes, thus calling for new methods. In this paper, we propose a new local periodicity-based PPT, called predominant local pulse-based dynamic programming (PLPDP) tracking, that allows for more flexible tempo transitions. Specifically, the new PPT incorporates a method called "predominant local pulses" (PLP) in combination with a dynamic programming (DP) component to jointly consider the locally detected periodicity and beat activation strength at each time instant. Accordingly, PLPDP accounts for the local periodicity, rather than relying on a global tempo assumption. Compared to existing PPTs, PLPDP particularly enhances the recall values at the cost of a lower precision, resulting in an overall improvement of F1-score for beat tracking in ASAP (from 0.473 to 0.493) and Maz-5 (from 0.595 to 0.838).
    摘要 现代拍轨系统通常使用“后处理追踪器”(PPT)来模拟拍征的周期性。然而,这些假设在表演古典音乐中可能过于僵硬。我们使用了西方古典钢琴音乐的两个大数据集, namely Aligned Scores and Performances(ASAP)数据集和钢琴 Mazurkas(Maz-5)数据集,进行实验,发现现有PPT的失败。因此,我们提出了一种新的本地周期性基于的PPT,即主导本地抖拍基于动态规划(PLPDP)追踪。PLPDP通过结合本地抖拍和激活强度来考虑每个时刻的周期性和拍征活动。因此,PLPDP不再仅仅基于全局的拍征假设。与现有PPT相比,PLPDP尤其提高了ASAP中的回归值(从0.473提高到0.493)和Maz-5中的总体表现(从0.595提高到0.838)。

cs.CV - 2023-08-21

Improving Continuous Sign Language Recognition with Cross-Lingual Signs

  • paper_url: http://arxiv.org/abs/2308.10809
  • repo_url: None
  • paper_authors: Fangyun Wei, Yutong Chen
  • for: 这个研究是为了不间断手语识别(CSLR),CSLR 是一个弱有监督的任务,它的目的是从视频中识别连续的手语,不需要任何先知的时间 bound 信息。
  • methods: 我们的方法基于跨语言手语的观察,我们发现了不同的手语之间的相似视觉信号(如手势和动作)。我们的方法是利用这些相似的视觉信号来帮助另一种手语识别。我们首先建立了两个手语词典,其中一个包含两个数据集中的隔离手语。然后,我们使用一个已知的 isolated sign language recognition 模型来确定两个手语之间的寻常签名。最后,我们使用这些签名来训练一个 CSLR 模型。
  • results: 我们的方法在两个广泛使用的 CSLR 数据集(Phoenix-2014 和 Phoenix-2014T)上达到了状态的最佳性能。
    Abstract This work dedicates to continuous sign language recognition (CSLR), which is a weakly supervised task dealing with the recognition of continuous signs from videos, without any prior knowledge about the temporal boundaries between consecutive signs. Data scarcity heavily impedes the progress of CSLR. Existing approaches typically train CSLR models on a monolingual corpus, which is orders of magnitude smaller than that of speech recognition. In this work, we explore the feasibility of utilizing multilingual sign language corpora to facilitate monolingual CSLR. Our work is built upon the observation of cross-lingual signs, which originate from different sign languages but have similar visual signals (e.g., hand shape and motion). The underlying idea of our approach is to identify the cross-lingual signs in one sign language and properly leverage them as auxiliary training data to improve the recognition capability of another. To achieve the goal, we first build two sign language dictionaries containing isolated signs that appear in two datasets. Then we identify the sign-to-sign mappings between two sign languages via a well-optimized isolated sign language recognition model. At last, we train a CSLR model on the combination of the target data with original labels and the auxiliary data with mapped labels. Experimentally, our approach achieves state-of-the-art performance on two widely-used CSLR datasets: Phoenix-2014 and Phoenix-2014T.
    摘要 Our approach is based on the observation that different sign languages share similar visual signals, such as hand shape and motion. We aim to leverage cross-lingual signs in one sign language to improve the recognition capability of another. To achieve this, we first create two sign language dictionaries containing isolated signs from two datasets. We then identify the sign-to-sign mappings between the two sign languages using a well-optimized isolated sign language recognition model. Finally, we train a CSLR model on a combination of the target data with original labels and the auxiliary data with mapped labels.Experiments show that our approach achieves state-of-the-art performance on two widely used CSLR datasets: Phoenix-2014 and Phoenix-2014T.

MGMAE: Motion Guided Masking for Video Masked Autoencoding

  • paper_url: http://arxiv.org/abs/2308.10794
  • repo_url: None
  • paper_authors: Bingkun Huang, Zhiyu Zhao, Guozhen Zhang, Yu Qiao, Limin Wang
  • for: 这 paper 的目的是提高 video 自我监督学习中的表示学习性能。
  • methods: 这 paper 使用了 Temporal Redundancy 导致高度的 masking ratio 和自定义的 masking 策略,并引入了动态导向的 masking 策略,以利用视频中的运动信息来建立时间一致的 masking 体积。
  • results: Comparing with the original VideoMAE, 这 paper 的 MGMAE 在 Something-Something V2 和 Kinetics-400 数据集上表现出色,并且通过视觉分析来证明 MGMAE 可以带来更有用的结构信息。
    Abstract Masked autoencoding has shown excellent performance on self-supervised video representation learning. Temporal redundancy has led to a high masking ratio and customized masking strategy in VideoMAE. In this paper, we aim to further improve the performance of video masked autoencoding by introducing a motion guided masking strategy. Our key insight is that motion is a general and unique prior in video, which should be taken into account during masked pre-training. Our motion guided masking explicitly incorporates motion information to build temporal consistent masking volume. Based on this masking volume, we can track the unmasked tokens in time and sample a set of temporal consistent cubes from videos. These temporal aligned unmasked tokens will further relieve the information leakage issue in time and encourage the MGMAE to learn more useful structure information. We implement our MGMAE with an online efficient optical flow estimator and backward masking map warping strategy. We perform experiments on the datasets of Something-Something V2 and Kinetics-400, demonstrating the superior performance of our MGMAE to the original VideoMAE. In addition, we provide the visualization analysis to illustrate that our MGMAE can sample temporal consistent cubes in a motion-adaptive manner for more effective video pre-training.
    摘要 《面具自编码》在无监督视频表示学习中表现出色。视频中的时间重复性导致高度的面具率和自定义面具策略,在这篇论文中,我们想要进一步改进视频面具自编码的性能。我们的关键发现是,运动是视频中一个普遍和特殊的先验,应该在面具预训练中考虑。我们的运动导向的面具明确地包含运动信息,建立了时间一致的面具量。基于这个面具量,我们可以在时间中跟踪不填充的元素,并从视频中采样一组时间一致的立方体。这些时间一致的不填充元素将further降低信息泄露问题,并促使MGMAE学习更有用的结构信息。我们实现了我们的MGMAE,使用了高效的在线Optical Flow估计器和倒向面具地图折叠策略。我们在Something-Something V2和Kinetics-400数据集上进行了实验,并证明了我们的MGMAE在原始VideoMAE之上表现出色。此外,我们还提供了视觉分析,以 Illustrate我们的MGMAE可以在运动适应的方式中采样时间一致的立方体,为更有效的视频预训练。

Extraction of Text from Optic Nerve Optical Coherence Tomography Reports

  • paper_url: http://arxiv.org/abs/2308.10790
  • repo_url: None
  • paper_authors: Iyad Majid, Youchen Victor Zhang, Robert Chang, Sophia Y. Wang
  • for: 这个研究的目的是开发和评估一种基于规则的算法,以提高从Zeiss Cirrus optical coherence tomography(OCT)扫描报告中提取文本数据的效率,包括Retinal nerve fiber layer(RNFL)值和其他 ganglion cell count(GCC)数据。
  • methods: 这个研究使用了DICOM文件,并将其转换为图像文件,然后使用PaddleOCR Python包进行光学字符识别。基于规则的算法被设计和优化,以提高对RNFL和GCC报告中数据的提取精度。
  • results: 研究结果表明,开发的算法可以准确地提取RNFL和GCC报告中的数据,其精度与人工审查的结果相似。在一些情况下,特别是RNFL厚度的时钟小时5和6,以及GCC的信号强度,提取数据存在一定的挑战。
    Abstract Purpose: The purpose of this study was to develop and evaluate rule-based algorithms to enhance the extraction of text data, including retinal nerve fiber layer (RNFL) values and other ganglion cell count (GCC) data, from Zeiss Cirrus optical coherence tomography (OCT) scan reports. Methods: DICOM files that contained encapsulated PDF reports with RNFL or Ganglion Cell in their document titles were identified from a clinical imaging repository at a single academic ophthalmic center. PDF reports were then converted into image files and processed using the PaddleOCR Python package for optical character recognition. Rule-based algorithms were designed and iteratively optimized for improved performance in extracting RNFL and GCC data. Evaluation of the algorithms was conducted through manual review of a set of RNFL and GCC reports. Results: The developed algorithms demonstrated high precision in extracting data from both RNFL and GCC scans. Precision was slightly better for the right eye in RNFL extraction (OD: 0.9803 vs. OS: 0.9046), and for the left eye in GCC extraction (OD: 0.9567 vs. OS: 0.9677). Some values presented more challenges in extraction, particularly clock hours 5 and 6 for RNFL thickness, and signal strength for GCC. Conclusions: A customized optical character recognition algorithm can identify numeric results from optical coherence scan reports with high precision. Automated processing of PDF reports can greatly reduce the time to extract OCT results on a large scale.
    摘要 目的:本研究的目的是开发和评估基于规则的算法,以提高从Zeiss CirrusOptical coherence tomography(OCT)扫描报告中提取文本数据,包括Retinal nerve fiber layer(RNFL)值和其他ganglion cell count(GCC)数据。方法:从一个学术眼科中心的临床扫描存储库中提取包含encapsulated PDF报告的DICOM文件,其中PDF报告中包含RNFL或GCC的文档标题。然后将PDF报告转换成图像文件,并使用Python package PaddleOCR进行光学 charactereognition。设计和优化基于规则的算法以提高提取RNFL和GCC数据的精度。评估算法的效果通过手动审查一组RNFL和GCC报告进行。结果:开发的算法在RNFL和GCC扫描报告中提取数据的精度很高,OD中的右眼RNFL提取精度为0.9803,而OS中的左眼RNFL提取精度为0.9046。在GCC扫描报告中,左眼GCC提取精度为0.9567,而右眼GCC提取精度为0.9677。报告中一些值的提取存在更大的挑战,例如RNFL厚度的时钟小时5和6,以及GCC的信号强度。结论:可以使用自定义的光学 charactereognition算法来从OCT扫描报告中提取 numeric结果,并且自动处理PDF报告可以大大减少对大规模OCT结果的提取时间。

Dense Error Map Estimation for MRI-Ultrasound Registration in Brain Tumor Surgery Using Swin UNETR

  • paper_url: http://arxiv.org/abs/2308.10784
  • repo_url: None
  • paper_authors: Soorena Salari, Amirhossein Rasoulian, Hassan Rivaz, Yiming Xiao
  • for: 降低手术成本并提高脑肿瘤除脑成功率,早期手术治疗脑肿瘤非常重要。但是,手术过程中脑组织变形(脑shift)会使预操作图像无效,因此需要一种可靠的、可携带的工具来跟踪脑shift。
  • methods: 作者提出了一种基于深度学习(DL)的框架,使用Swin UNITER来自动评估MRI-iUS регистрация结果的质量,并在实时进行3D-patch-wise紧凑错误图表示。
  • results: 作者通过使用实际临床数据进行了评估,并显示了该框架在iUS-导引脑肿瘤除脑中的表现。
    Abstract Early surgical treatment of brain tumors is crucial in reducing patient mortality rates. However, brain tissue deformation (called brain shift) occurs during the surgery, rendering pre-operative images invalid. As a cost-effective and portable tool, intra-operative ultrasound (iUS) can track brain shift, and accurate MRI-iUS registration techniques can update pre-surgical plans and facilitate the interpretation of iUS. This can boost surgical safety and outcomes by maximizing tumor removal while avoiding eloquent regions. However, manual assessment of MRI-iUS registration results in real-time is difficult and prone to errors due to the 3D nature of the data. Automatic algorithms that can quantify the quality of inter-modal medical image registration outcomes can be highly beneficial. Therefore, we propose a novel deep-learning (DL) based framework with the Swin UNETR to automatically assess 3D-patch-wise dense error maps for MRI-iUS registration in iUS-guided brain tumor resection and show its performance with real clinical data for the first time.
    摘要 早期脑肿手术的抢救率可以通过早期手术来降低患者死亡率。然而,手术过程中的脑组织变形(脑移动)会使得先前的图像无效,这使得抢救脑肿的手术更加具有挑战性。作为一种cost-effective和可搬式的工具,实时ultrasound(iUS)可以跟踪脑移动,并且精准的MRI-iUS注册技术可以更新先前的计划,从而提高手术的安全性和效果。然而,手动评估MRI-iUS注册结果的实时性具有困难和错误的倾向,特别是由于数据的3D性。自动的算法可以评估多Modal医疗图像注册结果的质量,这可以对手术的安全性和效果产生很大的改善。因此,我们提出了一种基于深度学习(DL)的框架,使用Swin UNITER来自动评估MRI-iUS注册结果的3D紧密错误地图,并在实际临床数据上展示其性能。

CoNe: Contrast Your Neighbours for Supervised Image Classification

  • paper_url: http://arxiv.org/abs/2308.10761
  • repo_url: https://github.com/mingkai-zheng/cone
  • paper_authors: Mingkai Zheng, Shan You, Lang Huang, Xiu Su, Fei Wang, Chen Qian, Xiaogang Wang, Chang Xu
  • For: 提高图像分类的表现,强调内类样本之间的差异* Methods: 引入对邻居进行对比,使每个样本不仅受到类中心的监督,还 direktly使用类近的特征作为锚点生成更适应的目标* Results: 在不同的基准数据集、网络架构和设置下,CoNe达到了状态革命性的表现,包括80.8%的Top-1准确率在ImageNet上使用ResNet-50,超越了最近的Timm训练秘密。
    Abstract Image classification is a longstanding problem in computer vision and machine learning research. Most recent works (e.g. SupCon , Triplet, and max-margin) mainly focus on grouping the intra-class samples aggressively and compactly, with the assumption that all intra-class samples should be pulled tightly towards their class centers. However, such an objective will be very hard to achieve since it ignores the intra-class variance in the dataset. (i.e. different instances from the same class can have significant differences). Thus, such a monotonous objective is not sufficient. To provide a more informative objective, we introduce Contrast Your Neighbours (CoNe) - a simple yet practical learning framework for supervised image classification. Specifically, in CoNe, each sample is not only supervised by its class center but also directly employs the features of its similar neighbors as anchors to generate more adaptive and refined targets. Moreover, to further boost the performance, we propose ``distributional consistency" as a more informative regularization to enable similar instances to have a similar probability distribution. Extensive experimental results demonstrate that CoNe achieves state-of-the-art performance across different benchmark datasets, network architectures, and settings. Notably, even without a complicated training recipe, our CoNe achieves 80.8\% Top-1 accuracy on ImageNet with ResNet-50, which surpasses the recent Timm training recipe (80.4\%). Code and pre-trained models are available at \href{https://github.com/mingkai-zheng/CoNe}{https://github.com/mingkai-zheng/CoNe}.
    摘要 image分类是计算机视觉和机器学习领域的长期问题。最近的研究主要集中在强制内类样本凑合,假设所有内类样本都应该很紧张地吸引向类中心。然而,这种假设忽略了数据集中的内类差异。因此,这种单一的目标不够。为提供更加有用的目标,我们介绍了对比你邻居(CoNe)-一种简单 yet 实用的学习框架 для超参数化图像分类。具体来说,在CoNe中,每个样本不仅被supervised by its类中心,还直接使用类似邻居的特征作为锚点来生成更适应和细化的目标。此外,为了进一步提高性能,我们提出了“分布一致性”作为更加有用的正则化,使得类似的实例有相似的概率分布。实验结果表明,CoNe在不同的数据集、网络架构和设置下均达到了状态之顶的性能。特别是,我们的CoNe在ImageNet上使用ResNet-50达到了80.8%的Top-1准确率,比最近的Timm训练秘密(80.4%)高出了0.4%。代码和预训练模型可以在获取。

Boosting Adversarial Attack with Similar Target

  • paper_url: http://arxiv.org/abs/2308.10743
  • repo_url: https://github.com/huanranchen/Similar-Target-Attacker
  • paper_authors: Shuo Zhang, Ziruo Wang, Zikai Zhou, Huanran Chen
  • for: 防止深度神经网络受到攻击,提高模型应用安全性。
  • methods: 使用ensemble攻击,通过提高cosinus相似性来减少攻击方法的精度。
  • results: 在ImageNet上,我们的方法能够提高攻击力,并在18种分类器和透彻训练的模型上表现出色。
    Abstract Deep neural networks are vulnerable to adversarial examples, posing a threat to the models' applications and raising security concerns. An intriguing property of adversarial examples is their strong transferability. Several methods have been proposed to enhance transferability, including ensemble attacks which have demonstrated their efficacy. However, prior approaches simply average logits, probabilities, or losses for model ensembling, lacking a comprehensive analysis of how and why model ensembling significantly improves transferability. In this paper, we propose a similar targeted attack method named Similar Target~(ST). By promoting cosine similarity between the gradients of each model, our method regularizes the optimization direction to simultaneously attack all surrogate models. This strategy has been proven to enhance generalization ability. Experimental results on ImageNet validate the effectiveness of our approach in improving adversarial transferability. Our method outperforms state-of-the-art attackers on 18 discriminative classifiers and adversarially trained models.
    摘要 In this paper, we propose a new method called Similar Target (ST) that targets the gradients of each model to enhance transferability. By promoting cosine similarity between the gradients, our method regularizes the optimization direction to simultaneously attack all surrogate models. This strategy has been shown to improve generalization ability.Experimental results on ImageNet demonstrate the effectiveness of our approach in improving adversarial transferability. Our method outperforms state-of-the-art attackers on 18 discriminative classifiers and adversarially trained models.

Patch Is Not All You Need

  • paper_url: http://arxiv.org/abs/2308.10729
  • repo_url: https://github.com/rprokap/pset-9
  • paper_authors: Changzhen Li, Jie Zhang, Yang Wei, Zhilong Ji, Jinfeng Bai, Shiguang Shan
  • for: 提高计算机视觉任务的表现,特别是处理图像的Sequential输入问题。
  • methods: 提出了一种名为 Pattern Transformer(Patternformer)的新方法,该方法可以将图像转换为模式序列,以便将其输入到Transformer中。在这个方法中,使用了Convolutional Neural Network(CNN)来提取图像中的多种模式,每个通道表示一种唯一的模式,并将其作为视觉токен输入到下一个Transformer中。
  • results: 使用了vanilla ResNet和Transformer,在CIFAR-10和CIFAR-100上达到了状态机器人性表现,并在ImageNet上获得了竞争性的结果。
    Abstract Vision Transformers have achieved great success in computer visions, delivering exceptional performance across various tasks. However, their inherent reliance on sequential input enforces the manual partitioning of images into patch sequences, which disrupts the image's inherent structural and semantic continuity. To handle this, we propose a novel Pattern Transformer (Patternformer) to adaptively convert images to pattern sequences for Transformer input. Specifically, we employ the Convolutional Neural Network to extract various patterns from the input image, with each channel representing a unique pattern that is fed into the succeeding Transformer as a visual token. By enabling the network to optimize these patterns, each pattern concentrates on its local region of interest, thereby preserving its intrinsic structural and semantic information. Only employing the vanilla ResNet and Transformer, we have accomplished state-of-the-art performance on CIFAR-10 and CIFAR-100, and have achieved competitive results on ImageNet.
    摘要 computer 视野中的 Transformers 已经取得了很大的成功,在不同任务上达到了极高的表现。然而,它们的内置的序列输入依赖关系使得需要手动将图像分割成patch序列,这会中断图像的自然的结构和semantic连续性。为了解决这个问题,我们提出了一种新的 Pattern Transformer(Patternformer),用于适应图像转换为模式序列,以便将其作为Transformer输入。具体来说,我们使用 Convolutional Neural Network 提取图像中的多种模式,每个通道表示一个唯一的模式,并将它们传递给接下来的 Transformer 作为视觉 токен。通过让网络优化这些模式,每个模式可以专注于其本地区域兴趣点,从而保留它的内在结构和semantic信息。只使用普通的 ResNet 和 Transformer,我们在 CIFAR-10 和 CIFAR-100 上取得了状态对应的表现,并在 ImageNet 上获得了竞争性的结果。

Test-time augmentation-based active learning and self-training for label-efficient segmentation

  • paper_url: http://arxiv.org/abs/2308.10727
  • repo_url: None
  • paper_authors: Bella Specktor-Fadida, Anna Levchakov, Dana Schonberger, Liat Ben-Sira, Dafna Ben-Bashat, Leo Joskowicz
  • for: 这个论文的目的是提出一种新的自动监督(ST)和活动学习(AL)结合方法,以提高深度学习模型的精度和稳定性。
  • methods: 该方法首先使用测试时扩展(TTA)对初始教师网络进行处理,然后根据最低估计的 dice 分数选择需要注释的情况。注释后,使用现有注释和ST情况下的粗略pseudo标签进行重新训练。
  • results: 研究结果显示,ST 对两个任务都非常有效,可以提高 ID 和 OOD 数据的性能。然而,在单序列肥胎体部分 segmentation 任务中,将 AL 与 ST 结合使用后,性能略有下降。而在高变化的 Placenta 任务中,AL 却没有提高性能。结合 AL 和 ST 后,对单序列肥胎体部分 segmentation 任务,得到了0.961的 dice 分数,只需使用 6 个原始扫描和 2 个新序列扫描。
    Abstract Deep learning techniques depend on large datasets whose annotation is time-consuming. To reduce annotation burden, the self-training (ST) and active-learning (AL) methods have been developed as well as methods that combine them in an iterative fashion. However, it remains unclear when each method is the most useful, and when it is advantageous to combine them. In this paper, we propose a new method that combines ST with AL using Test-Time Augmentations (TTA). First, TTA is performed on an initial teacher network. Then, cases for annotation are selected based on the lowest estimated Dice score. Cases with high estimated scores are used as soft pseudo-labels for ST. The selected annotated cases are trained with existing annotated cases and ST cases with border slices annotations. We demonstrate the method on MRI fetal body and placenta segmentation tasks with different data variability characteristics. Our results indicate that ST is highly effective for both tasks, boosting performance for in-distribution (ID) and out-of-distribution (OOD) data. However, while self-training improved the performance of single-sequence fetal body segmentation when combined with AL, it slightly deteriorated performance of multi-sequence placenta segmentation on ID data. AL was helpful for the high variability placenta data, but did not improve upon random selection for the single-sequence body data. For fetal body segmentation sequence transfer, combining AL with ST following ST iteration yielded a Dice of 0.961 with only 6 original scans and 2 new sequence scans. Results using only 15 high-variability placenta cases were similar to those using 50 cases. Code is available at: https://github.com/Bella31/TTA-quality-estimation-ST-AL
    摘要 深度学习技术需要大量数据进行注释,但这些注释是时间消耗的。为了减轻注释负担,自适应学习(ST)和活动学习(AL)方法已经被开发出来,同时也有将这两种方法相互融合的方法。然而,还没有一个明确的时候,哪种方法是最有用,以及何时合并它们。在这篇论文中,我们提出了一种新的方法,即将ST与AL相互融合,使用测试时数据增强(TTA)。首先,TTA在初始教师网络上进行。然后,选择基于最低估计 dice 分数的案例进行注释。高估分数的案例用作软 Pseudo-标签进行ST。选取的注释案例与ST案例一起进行训练,并使用现有注释案例和ST案例。我们在MRI胎体和胎盘分割任务上进行了实验,并发现ST具有高效性,对于ID和OOD数据都能提高性能。然而,在单序列胎体分割任务中,与AL结合自适应学习可以提高性能,而在多序列胎盘分割任务中,AL对ID数据的性能有所下降。AL对高变化胎盘数据具有帮助,但对单序列胎体数据来说,AL没有提高随机选择的性能。对胎体分割序列传输,将AL与ST相互融合,并在ST迭代后应用AL,可以达到0.961的Dice值,只需6个原始扫描和2个新序列扫描。使用仅15个高变化胎盘案例可以达到类似的性能,与使用50个案例相当。软件代码可以在以下链接中找到:https://github.com/Bella31/TTA-quality-estimation-ST-AL

Backdooring Textual Inversion for Concept Censorship

  • paper_url: http://arxiv.org/abs/2308.10718
  • repo_url: None
  • paper_authors: Yutong Wu, Jie Zhang, Florian Kerschbaum, Tianwei Zhang
  • for: 防止AI生成内容中的恶意用途,如假新闻和个人诋毁。
  • methods: 利用backdoor技术对文本反转(TI)模型进行概念审核,选择敏感词作为触发词,在训练阶段 censored 敏感词,以防止模型生成包含恶意概念的图像。
  • results: 通过对Stable Diffusion模型进行广泛的实验,证明了我们的方法的有效性。Results show that our approach can effectively censor malicious concepts in text-to-image models, ensuring the safe and responsible use of AI-generated content.
    Abstract Recent years have witnessed success in AIGC (AI Generated Content). People can make use of a pre-trained diffusion model to generate images of high quality or freely modify existing pictures with only prompts in nature language. More excitingly, the emerging personalization techniques make it feasible to create specific-desired images with only a few images as references. However, this induces severe threats if such advanced techniques are misused by malicious users, such as spreading fake news or defaming individual reputations. Thus, it is necessary to regulate personalization models (i.e., concept censorship) for their development and advancement. In this paper, we focus on the personalization technique dubbed Textual Inversion (TI), which is becoming prevailing for its lightweight nature and excellent performance. TI crafts the word embedding that contains detailed information about a specific object. Users can easily download the word embedding from public websites like Civitai and add it to their own stable diffusion model without fine-tuning for personalization. To achieve the concept censorship of a TI model, we propose leveraging the backdoor technique for good by injecting backdoors into the Textual Inversion embeddings. Briefly, we select some sensitive words as triggers during the training of TI, which will be censored for normal use. In the subsequent generation stage, if the triggers are combined with personalized embeddings as final prompts, the model will output a pre-defined target image rather than images including the desired malicious concept. To demonstrate the effectiveness of our approach, we conduct extensive experiments on Stable Diffusion, a prevailing open-sourced text-to-image model. Our code, data, and results are available at https://concept-censorship.github.io.
    摘要 近年来,AI生成内容(AIGC)得到了成功。人们可以使用预训练的扩散模型来生成高质量的图像,或者使用Only Prompts在自然语言中自定义现有图像。更有趣的是,新兴的个性化技术使得可以通过只提供一些图像作为参考来生成特定的愿望图像。然而,这也会导致严重的安全问题,如散播假新闻或者损害个人声誉。因此,需要规制个性化模型(i.e.,概念审查)的发展和进步。 在这篇论文中,我们将关注个性化技术中的文本反转(TI),该技术在轻量级和性能方面卓越。TI可以在公共网站如Civitai上下载word embedding,并将其添加到自己的稳定扩散模型中,无需微调。为了实现TI模型的概念审查,我们提议利用后门技术,即在TI中植入后门。具体来说,我们在TI训练过程中选择一些敏感词作为触发词,并在生成阶段将这些触发词与个性化 embedding 结合使用,以生成预定义的目标图像。 为证明我们的方法的有效性,我们在Stable Diffusion上进行了广泛的实验,该是一个流行的开源文本到图像模型。我们的代码、数据和结果可以在中找到。

Rethinking Person Re-identification from a Projection-on-Prototypes Perspective

  • paper_url: http://arxiv.org/abs/2308.10717
  • repo_url: None
  • paper_authors: Qizao Wang, Xuelin Qian, Bin Li, Yanwei Fu, Xiangyang Xue
  • for: 这篇论文是关于人回 identification(Re-ID)任务的研究,它在过去一个 décennial 中取得了很大的发展。
  • methods: 现有的状态 искусственный智能方法是从输入图像中提取特征,然后使用分类器进行分类。但是,由于训练集和测试集之间没有人类重合,因此在推理阶段就会抛弃分类器。只有从图像特征提取出来的特征被用于人回 identification via 距离度量。这篇论文提出了对分类器的新思路,即视分类器为从图像特征到类型原型的投影。这些原型是分类器学习的参数。在这种情况下,我们可以将输入图像的标识视为与所有原型之间的相似性,并将其作为更有特征的特征进行人回 identification。
  • results: 我们提出了一个新的基线模型,即ProNet,它保留了推理阶段的分类器功能。为了促进类型原型的学习,我们采用了 triplet 损失和标识分类损失。对于多级别设计,我们还提出了一个改进版的ProNet++。实验结果表明,我们的提议的ProNet是简单却有效,可以 beat 先前的基线。ProNet++ 也可以与基于 transformer 的竞争对手相比。
    Abstract Person Re-IDentification (Re-ID) as a retrieval task, has achieved tremendous development over the past decade. Existing state-of-the-art methods follow an analogous framework to first extract features from the input images and then categorize them with a classifier. However, since there is no identity overlap between training and testing sets, the classifier is often discarded during inference. Only the extracted features are used for person retrieval via distance metrics. In this paper, we rethink the role of the classifier in person Re-ID, and advocate a new perspective to conceive the classifier as a projection from image features to class prototypes. These prototypes are exactly the learned parameters of the classifier. In this light, we describe the identity of input images as similarities to all prototypes, which are then utilized as more discriminative features to perform person Re-ID. We thereby propose a new baseline ProNet, which innovatively reserves the function of the classifier at the inference stage. To facilitate the learning of class prototypes, both triplet loss and identity classification loss are applied to features that undergo the projection by the classifier. An improved version of ProNet++ is presented by further incorporating multi-granularity designs. Experiments on four benchmarks demonstrate that our proposed ProNet is simple yet effective, and significantly beats previous baselines. ProNet++ also achieves competitive or even better results than transformer-based competitors.
    摘要

Color Prompting for Data-Free Continual Unsupervised Domain Adaptive Person Re-Identification

  • paper_url: http://arxiv.org/abs/2308.10716
  • repo_url: https://github.com/vimar-gu/colorpromptreid
  • paper_authors: Jianyang Gu, Hao Luo, Kai Wang, Wei Jiang, Yang You, Jian Zhao
  • for: 本文提出了一种数据自由无监督领域人识别(Re-ID)方法,以解决数据注释的压力。
  • methods: 本文使用了一种名为Color Prompting(CoP)方法,它使用轻量级的提示网络来适应当前任务的颜色分布,并将其用于将图像转换成过去的风格。
  • results: 对比于图像重温方法,CoP方法能够更好地避免忘记现象,并且在新领域上快速适应新任务,只需要一小 amounts of无标签图像。实验表明,在 continual 训练管道中,提出的CoP方法可以在seen和unseen领域上提高了6.7%和8.1%的average rank-1。
    Abstract Unsupervised domain adaptive person re-identification (Re-ID) methods alleviate the burden of data annotation through generating pseudo supervision messages. However, real-world Re-ID systems, with continuously accumulating data streams, simultaneously demand more robust adaptation and anti-forgetting capabilities. Methods based on image rehearsal addresses the forgetting issue with limited extra storage but carry the risk of privacy leakage. In this work, we propose a Color Prompting (CoP) method for data-free continual unsupervised domain adaptive person Re-ID. Specifically, we employ a light-weighted prompter network to fit the color distribution of the current task together with Re-ID training. Then for the incoming new tasks, the learned color distribution serves as color style transfer guidance to transfer the images into past styles. CoP achieves accurate color style recovery for past tasks with adequate data diversity, leading to superior anti-forgetting effects compared with image rehearsal methods. Moreover, CoP demonstrates strong generalization performance for fast adaptation into new domains, given only a small amount of unlabeled images. Extensive experiments demonstrate that after the continual training pipeline the proposed CoP achieves 6.7% and 8.1% average rank-1 improvements over the replay method on seen and unseen domains, respectively. The source code for this work is publicly available in https://github.com/vimar-gu/ColorPromptReID.
    摘要 <>按照以下步骤翻译文本到简化中文:1. 将文本转换为普通文本。2. 使用翻译器将文本翻译成Simplified Chinese。文本:Unsupervised domain adaptive person re-identification (Re-ID) methods alleviate the burden of data annotation through generating pseudo supervision messages. However, real-world Re-ID systems, with continuously accumulating data streams, simultaneously demand more robust adaptation and anti-forgetting capabilities. Methods based on image rehearsal addresses the forgetting issue with limited extra storage but carry the risk of privacy leakage. In this work, we propose a Color Prompting (CoP) method for data-free continual unsupervised domain adaptive person Re-ID. Specifically, we employ a light-weighted prompter network to fit the color distribution of the current task together with Re-ID training. Then for the incoming new tasks, the learned color distribution serves as color style transfer guidance to transfer the images into past styles. CoP achieves accurate color style recovery for past tasks with adequate data diversity, leading to superior anti-forgetting effects compared with image rehearsal methods. Moreover, CoP demonstrates strong generalization performance for fast adaptation into new domains, given only a small amount of unlabeled images. Extensive experiments demonstrate that after the continual training pipeline the proposed CoP achieves 6.7% and 8.1% average rank-1 improvements over the replay method on seen and unseen domains, respectively. The source code for this work is publicly available in https://github.com/vimar-gu/ColorPromptReID.翻译结果:Unsupervised domain adaptive人Re-ID方法通过生成 pseudo 监督信息来减轻数据注释的负担。然而,实际世界Re-ID系统在不断增加数据流时同时需要更加坚强的适应和反忘能力。基于图像回忆 addresses 忘却问题,但具有有限的额外存储和隐私泄露风险。在这个工作中,我们提议了一种颜色提示(CoP)方法,用于无需数据注释的连续不监督频道适应人Re-ID。具体来说,我们使用轻量级的提示网络来适应当前任务的颜色分布,并与Re-ID训练相结合。然后,对于接下来的新任务,已经学习的颜色分布会作为颜色风格传递指导,将图像转换成过去的风格。CoP实现了过去任务的准确颜色风格恢复,使得反忘能力比图像回忆方法更好。此外,CoP还显示了快速适应新频道的强大泛化性能,只需要小量无注释图像。广泛的实验表明,在连续训练管道后,我们的CoP方法可以在seen和unseen频道上实现6.7%和8.1%的平均排名1改进。源代码可以在https://github.com/vimar-gu/ColorPromptReID中下载。

Vanishing Point Estimation in Uncalibrated Images with Prior Gravity Direction

  • paper_url: http://arxiv.org/abs/2308.10694
  • repo_url: https://github.com/cvg/vp-estimation-with-prior-gravity
  • paper_authors: Rémi Pautrat, Shaohui Liu, Petr Hruby, Marc Pollefeys, Daniel Barath
  • for: 估算曼哈顿幕(三个垂直方向)和相机的不知 фокал距离,利用垂直方向的先验知识。
  • methods: 使用极限测量仪(IMU)的数据,提供了两个新的2线解决方案,其中一个不受先前的解决方案中的弯曲影响。同时,我们设计了一种新的非最小方法,可以在任意数量的线上提高本地优化性能。
  • results: 我们的方法可以在实验室和真实世界数据上实现更高的准确率,与相关方法相比,同时运行时间相对相似。此外,我们还证明了我们的解决方案可以用于相对旋转估计。代码可以在https://github.com/cvg/VP-Estimation-with-Prior-Gravity上下载。
    Abstract We tackle the problem of estimating a Manhattan frame, i.e. three orthogonal vanishing points, and the unknown focal length of the camera, leveraging a prior vertical direction. The direction can come from an Inertial Measurement Unit that is a standard component of recent consumer devices, e.g., smartphones. We provide an exhaustive analysis of minimal line configurations and derive two new 2-line solvers, one of which does not suffer from singularities affecting existing solvers. Additionally, we design a new non-minimal method, running on an arbitrary number of lines, to boost the performance in local optimization. Combining all solvers in a hybrid robust estimator, our method achieves increased accuracy even with a rough prior. Experiments on synthetic and real-world datasets demonstrate the superior accuracy of our method compared to the state of the art, while having comparable runtimes. We further demonstrate the applicability of our solvers for relative rotation estimation. The code is available at https://github.com/cvg/VP-Estimation-with-Prior-Gravity.
    摘要 我们解决了推算曼哈顿幕板,即三个垂直方向的极限点,以及相机未知 фокаль距离问题,利用一个先前的垂直方向。这个方向可以来自一个惯性测量仪,这是现代消费者设备的标准组件,例如智能手机。我们提供了完整的分析 minimal line configuration 和两个新的2-line solver,其中一个不受现有solver中的 singularities 的影响。此外,我们设计了一种新的非最小方法,可以在arbitrary number of lines上运行,以提高本地优化性能。将所有solvers集成到混合稳定估计器中,我们的方法可以在一个粗略的先前中实现更高的准确性。实验表明,我们的方法与当前状态的方法相比,在synthetic和实际数据集上具有更高的准确性,同时具有相似的运行时间。我们还证明了我们的solvers可以用于相对旋转估计。代码可以在https://github.com/cvg/VP-Estimation-with-Prior-Gravity中下载。

Exploring Fine-Grained Representation and Recomposition for Cloth-Changing Person Re-Identification

  • paper_url: http://arxiv.org/abs/2308.10692
  • repo_url: None
  • paper_authors: Qizao Wang, Xuelin Qian, Bin Li, Ying Fu, Yanwei Fu, Xiangyang Xue
  • for: This paper aims to tackle the challenges of cloth-changing person Re-ID, which suffer from limited training samples and inferior identity-relevant features.
  • methods: The proposed FIRe$^{2}$ framework consists of a Fine-grained Feature Mining (FFM) module and a Fine-grained Attribute Recomposition (FAR) module, which learn identity-relevant features and recompose image features with different attributes in the latent space.
  • results: The proposed method achieves state-of-the-art performance on five widely-used cloth-changing person Re-ID benchmarks.
    Abstract Cloth-changing person Re-IDentification (Re-ID) is a particularly challenging task, suffering from two limitations of inferior identity-relevant features and limited training samples. Existing methods mainly leverage auxiliary information to facilitate discriminative feature learning, including soft-biometrics features of shapes and gaits, and additional labels of clothing. However, these information may be unavailable in real-world applications. In this paper, we propose a novel FIne-grained Representation and Recomposition (FIRe$^{2}$) framework to tackle both limitations without any auxiliary information. Specifically, we first design a Fine-grained Feature Mining (FFM) module to separately cluster images of each person. Images with similar so-called fine-grained attributes (e.g., clothes and viewpoints) are encouraged to cluster together. An attribute-aware classification loss is introduced to perform fine-grained learning based on cluster labels, which are not shared among different people, promoting the model to learn identity-relevant features. Furthermore, by taking full advantage of the clustered fine-grained attributes, we present a Fine-grained Attribute Recomposition (FAR) module to recompose image features with different attributes in the latent space. It can significantly enhance representations for robust feature learning. Extensive experiments demonstrate that FIRe$^{2}$ can achieve state-of-the-art performance on five widely-used cloth-changing person Re-ID benchmarks.
    摘要 cloth-changing person Re-IDentification (Re-ID) 是一项特别具有挑战性的任务,受到两个限制:一是低质量的身份相关特征,二是有限的训练样本。现有方法主要利用辅助信息来促进特征学习,包括软生物метри学特征(如形状和步伐)以及额外标签(如服装)。但这些信息在实际应用中可能不可用。在这篇论文中,我们提出一种新的细腻表示和重组(FIRe$^{2}$)框架,用于解决这两个限制,无需任何辅助信息。具体来说,我们首先设计了细腻特征挖掘(FFM)模块,用于分别对每个人的图像进行分割。图像具有相似的细腻特征(例如服装和视角)会受到鼓励分割。我们引入了基于分割标签的特征学习损失函数,以进行细腻学习,这些标签不同人之间不同,使模型学习身份相关的特征。此外,我们利用分割的细腻特征,提出了细腻特征重组(FAR)模块,用于重新组合图像特征。这可以明显提高特征表示的稳定性。我们对五种广泛使用的人脸变换人Re-ID标准benchmark进行了广泛的实验,结果显示,FIRe$^{2}$可以达到现有state-of-the-art的性能。

Co-Speech Gesture Detection through Multi-phase Sequence Labeling

  • paper_url: http://arxiv.org/abs/2308.10680
  • repo_url: None
  • paper_authors: Esam Ghaleb, Ilya Burenko, Marlou Rasenberg, Wim Pouw, Peter Uhrig, Judith Holler, Ivan Toni, Aslı Özyürek, Raquel Fernández
  • for: 这篇论文的目的是提出一种新的自动手势检测方法,以更好地捕捉手势的时间序列特征和上下文信息。
  • methods: 该方法使用Transformer编码器学习运动序列中的上下文特征,并使用Conditional Random Fields进行时间序列标注。
  • results: 实验结果表明,该方法与强基eline模型相比,在手势roke检测方面显著提高了性能。此外,通过使用Transformer编码器学习运动序列中的上下文特征,大幅提高了手势单元检测的准确率。
    Abstract Gestures are integral components of face-to-face communication. They unfold over time, often following predictable movement phases of preparation, stroke, and retraction. Yet, the prevalent approach to automatic gesture detection treats the problem as binary classification, classifying a segment as either containing a gesture or not, thus failing to capture its inherently sequential and contextual nature. To address this, we introduce a novel framework that reframes the task as a multi-phase sequence labeling problem rather than binary classification. Our model processes sequences of skeletal movements over time windows, uses Transformer encoders to learn contextual embeddings, and leverages Conditional Random Fields to perform sequence labeling. We evaluate our proposal on a large dataset of diverse co-speech gestures in task-oriented face-to-face dialogues. The results consistently demonstrate that our method significantly outperforms strong baseline models in detecting gesture strokes. Furthermore, applying Transformer encoders to learn contextual embeddings from movement sequences substantially improves gesture unit detection. These results highlight our framework's capacity to capture the fine-grained dynamics of co-speech gesture phases, paving the way for more nuanced and accurate gesture detection and analysis.
    摘要 <>手势是面对面通话中的重要组成部分。它们随着时间的推移,经常遵循预测可靠的运动阶段,包括准备阶段、行动阶段和抽取阶段。然而,现有的自动手势检测方法通常将问题定义为二分类问题,即判断一个段落是否包含手势,这会忽略手势的自然顺序和上下文特征。为解决这个问题,我们提出了一种新的框架,它将手势检测问题重新定义为多个阶段序列标注问题,而不是二分类问题。我们的模型处理时间窗口内的骨骼运动序列,使用Transformer编码器学习上下文嵌入,并使用 Conditional Random Fields 进行序列标注。我们对一个大量多样化的面对面对话中的协作手势数据进行了评估。结果表明,我们的方法在检测手势行动阶段方面具有显著的优势,并且在应用Transformer编码器学习上下文嵌入后,手势单位检测得到了明显改善。这些结果表明我们的框架具有捕捉细腻的手势阶段动态特征的能力,为更加精准和准确的手势检测和分析开创了新的可能性。

Learning Clothing and Pose Invariant 3D Shape Representation for Long-Term Person Re-Identification

  • paper_url: http://arxiv.org/abs/2308.10658
  • repo_url: None
  • paper_authors: Feng Liu, Minchul Kim, ZiAng Gu, Anil Jian, Xiaoming Liu
  • for: 本研究旨在扩展长期人识别(LT-ReID)范围,从传统的人脸识别扩展到更广泛的实际生活中的人类活动,同时仍能考虑到衣物变化和时间差等因素。
  • methods: 我们提出了一种新的方法 named 3DInvarReID,它可以分离人体的标识特征和非标识特征(姿势、衣服形状和 текстур),并在一个共同的框架下重建准确的3D衣服人体形状和学习特征。
  • results: 我们通过实验表明,我们的方法可以对人识别进行更好的识别。
    Abstract Long-Term Person Re-Identification (LT-ReID) has become increasingly crucial in computer vision and biometrics. In this work, we aim to extend LT-ReID beyond pedestrian recognition to include a wider range of real-world human activities while still accounting for cloth-changing scenarios over large time gaps. This setting poses additional challenges due to the geometric misalignment and appearance ambiguity caused by the diversity of human pose and clothing. To address these challenges, we propose a new approach 3DInvarReID for (i) disentangling identity from non-identity components (pose, clothing shape, and texture) of 3D clothed humans, and (ii) reconstructing accurate 3D clothed body shapes and learning discriminative features of naked body shapes for person ReID in a joint manner. To better evaluate our study of LT-ReID, we collect a real-world dataset called CCDA, which contains a wide variety of human activities and clothing changes. Experimentally, we show the superior performance of our approach for person ReID.
    摘要 long-term人ReID(LT-ReID)在计算机视觉和生物ometrics中日益重要。在这项工作中,我们希望扩展LT-ReID,包括更加广泛的实际人类活动,同时仍能考虑大量时间差的衣物变化。这种设定带来了更多的挑战,这是因为人体姿态和服装的多样性会导致几何不一致和外观模糊。为解决这些挑战,我们提出了一种新的方法3DInvarReID,它可以(i)分离人体的身份和非身份组成(姿态、服装形状和文本),并(ii)重建准确的3D裸体形状和学习特征性的裸体形状,以便人ReID的学习。为了更好地评估我们的LT-ReID研究,我们收集了一个名为CCDA的实际数据集,该数据集包含了各种人类活动和衣物变化。实验表明,我们的方法在人ReID方面具有显著的优势。

bbOCR: An Open-source Multi-domain OCR Pipeline for Bengali Documents

  • paper_url: http://arxiv.org/abs/2308.10647
  • repo_url: https://github.com/BengaliAI/bbocr
  • paper_authors: Imam Mohammad Zulkarnain, Shayekh Bin Islam, Md. Zami Al Zunaed Farabe, Md. Mehedi Hasan Shawon, Jawaril Munshad Abedin, Beig Rajibul Hasan, Marsia Haque, Istiak Shihab, Syed Mobassir, MD. Nazmuddoha Ansary, Asif Sushmit, Farig Sadeque
  • for: 这个论文主要目标是提供一个可扩展的开源文档OCR系统,以便在各种低资源语言中进行文档数字化。
  • methods: 该论文使用了一种新的孟加拉文本识别模型,以及两个新的人工合成数据集,来解决低资源语言的OCR问题。
  • results: 论文的实验结果表明,该提案的解决方案在现有状态的孟加拉OCR系统中显著优于。
    Abstract Despite the existence of numerous Optical Character Recognition (OCR) tools, the lack of comprehensive open-source systems hampers the progress of document digitization in various low-resource languages, including Bengali. Low-resource languages, especially those with an alphasyllabary writing system, suffer from the lack of large-scale datasets for various document OCR components such as word-level OCR, document layout extraction, and distortion correction; which are available as individual modules in high-resource languages. In this paper, we introduce Bengali$.$AI-BRACU-OCR (bbOCR): an open-source scalable document OCR system that can reconstruct Bengali documents into a structured searchable digitized format that leverages a novel Bengali text recognition model and two novel synthetic datasets. We present extensive component-level and system-level evaluation: both use a novel diversified evaluation dataset and comprehensive evaluation metrics. Our extensive evaluation suggests that our proposed solution is preferable over the current state-of-the-art Bengali OCR systems. The source codes and datasets are available here: https://bengaliai.github.io/bbocr.
    摘要 尽管存在许多光学字符识别(OCR)工具,但缺乏完整的开源系统使得文档数字化在各种低资源语言中受阻,其中包括孟加拉语。低资源语言,特别是使用字母符号文字系统的语言,受到文档OCR组件的大规模数据集的缺乏所困,这些组件包括单词水平OCR、文档格式检测和修正倾斜等,这些组件在高资源语言中有很多大规模数据集可供使用。在这篇论文中,我们介绍了孟加拉语$.$AI-BRACU-OCR(bbOCR):一个开源可扩展的文档OCR系统,可以将孟加拉语文档转化为结构化搜索可读化数字格式,利用了一个新的孟加拉文本识别模型和两个新的人工合成数据集。我们提供了广泛的组件级和系统级评估:两者都使用了一个新的多样化评估数据集和完整的评估指标。我们的广泛评估表明,我们的提案超过当前孟加拉语OCR系统的状态。源代码和数据集可以在以下链接获取:

Automated Identification of Failure Cases in Organ at Risk Segmentation Using Distance Metrics: A Study on CT Data

  • paper_url: http://arxiv.org/abs/2308.10636
  • repo_url: None
  • paper_authors: Amin Honarmandi Shandiz, Attila Rádics, Rajesh Tamada, Makk Árpád, Karolina Glowacka, Lehel Ferenczi, Sandeep Dutta, Michael Fanariotis
  • for: 提高放射治疗规划中自动组织器之间误差的检测和修复
  • methods: 使用Dice和 Hausdorff距离的组合来自动标识失败案例,并使用阈值进行分类
  • results: 在20个不同器官的CT图像中,通过设置阈值,可以快速地自动标识失败案例,并且可以视见性地分类12个案例Here’s the breakdown of each point:
  • for: The paper is aimed at improving the accuracy of automated organ segmentation in radiation therapy planning by detecting and correcting failure cases during the training process.
  • methods: The proposed method uses a combination of Dice and Hausdorff distances to automatically identify failure cases, and sets thresholds for these distances to differentiate between various states of failure cases.
  • results: The method was evaluated on 20 cases of six different organs in CT images from clinical expert curated datasets, and was able to automatically identify 12 cases with high accuracy.
    Abstract Automated organ at risk (OAR) segmentation is crucial for radiation therapy planning in CT scans, but the generated contours by automated models can be inaccurate, potentially leading to treatment planning issues. The reasons for these inaccuracies could be varied, such as unclear organ boundaries or inaccurate ground truth due to annotation errors. To improve the model's performance, it is necessary to identify these failure cases during the training process and to correct them with some potential post-processing techniques. However, this process can be time-consuming, as traditionally it requires manual inspection of the predicted output. This paper proposes a method to automatically identify failure cases by setting a threshold for the combination of Dice and Hausdorff distances. This approach reduces the time-consuming task of visually inspecting predicted outputs, allowing for faster identification of failure case candidates. The method was evaluated on 20 cases of six different organs in CT images from clinical expert curated datasets. By setting the thresholds for the Dice and Hausdorff distances, the study was able to differentiate between various states of failure cases and evaluate over 12 cases visually. This thresholding approach could be extended to other organs, leading to faster identification of failure cases and thereby improving the quality of radiation therapy planning.
    摘要 自动化的器官在险 (OAR) 分割是辐射疗法规划 CT 扫描图中的关键,但自动生成的边界可能不准确,可能导致治疗规划问题。这些不准确的原因可能是不清晰的器官边界或 incorrect ground truth due to 注释错误。要提高模型的性能,需要在训练过程中标识失败案例,并使用一些可能的后处理技术来修复。然而,这个过程可能是时间消耗的,因为传统上需要手动检查预测输出。本文提出了一种方法,通过设置 Dice 和 Hausdorff 距离的组合来自动标识失败案例。这种方法可以减少手动检查预测输出的时间消耗,使得更快地标识失败案例候选者。本研究在 20 个不同器官的 CT 图像中进行了20个案例的评估。通过设置 Dice 和 Hausdorff 距离的阈值,研究可以 diferenciate 不同的失败案例,并评估了12个案例。这种阈值设置方法可以扩展到其他器官,从而更快地标识失败案例,提高辐射疗法规划的质量。

Foundation Model-oriented Robustness: Robust Image Model Evaluation with Pretrained Models

  • paper_url: http://arxiv.org/abs/2308.10632
  • repo_url: None
  • paper_authors: Peiyan Zhang, Haoyang Liu, Chaozhuo Li, Xing Xie, Sunghun Kim, Haohan Wang
  • for: This paper aims to provide a new method for evaluating the robustness of image classification models by comparing their performance to a surrogate oracle (i.e., a foundation model).
  • methods: The paper introduces a new method that extends the image datasets with new samples that are sufficiently perturbed to be distinct from the original sets, but are still bounded within the same image-label structure. The method uses a foundation model pretrained with a large amount of samples to constrain the perturbations.
  • results: The paper reports that the new method offers a new way to evaluate the models’ robustness performance, free of limitations of fixed benchmarks or constrained perturbations, although scoped by the power of the oracle. The paper also leverage the generated data to understand the behaviors of the model and the new evaluation strategies.
    Abstract Machine learning has demonstrated remarkable performance over finite datasets, yet whether the scores over the fixed benchmarks can sufficiently indicate the model's performance in the real world is still in discussion. In reality, an ideal robust model will probably behave similarly to the oracle (e.g., the human users), thus a good evaluation protocol is probably to evaluate the models' behaviors in comparison to the oracle. In this paper, we introduce a new robustness measurement that directly measures the image classification model's performance compared with a surrogate oracle (i.e., a foundation model). Besides, we design a simple method that can accomplish the evaluation beyond the scope of the benchmarks. Our method extends the image datasets with new samples that are sufficiently perturbed to be distinct from the ones in the original sets, but are still bounded within the same image-label structure the original test image represents, constrained by a foundation model pretrained with a large amount of samples. As a result, our new method will offer us a new way to evaluate the models' robustness performance, free of limitations of fixed benchmarks or constrained perturbations, although scoped by the power of the oracle. In addition to the evaluation results, we also leverage our generated data to understand the behaviors of the model and our new evaluation strategies.
    摘要

PsyMo: A Dataset for Estimating Self-Reported Psychological Traits from Gait

  • paper_url: http://arxiv.org/abs/2308.10631
  • repo_url: None
  • paper_authors: Adrian Cosma, Emilian Radoi
  • for: 这个研究的目的是提出一个新的多功能多模式的数据集,用于探索人们的心理特征从步态中找到的各种心理特征。
  • methods: 这个研究使用了312名参与者的7种步态和6个摄像头角度的步态序列,并与每个参与者完善了6个心理测量器,总共17个心理特征相关的个性、自尊、疲劳、攻击性和心理健康。
  • results: 这个研究提出了两种心理特征估计协议,并可以用来对步态进行心理特征估计。同时,这个数据集也可以用于对步态识别技术进行评估。
    Abstract Psychological trait estimation from external factors such as movement and appearance is a challenging and long-standing problem in psychology, and is principally based on the psychological theory of embodiment. To date, attempts to tackle this problem have utilized private small-scale datasets with intrusive body-attached sensors. Potential applications of an automated system for psychological trait estimation include estimation of occupational fatigue and psychology, and marketing and advertisement. In this work, we propose PsyMo (Psychological traits from Motion), a novel, multi-purpose and multi-modal dataset for exploring psychological cues manifested in walking patterns. We gathered walking sequences from 312 subjects in 7 different walking variations and 6 camera angles. In conjunction with walking sequences, participants filled in 6 psychological questionnaires, totalling 17 psychometric attributes related to personality, self-esteem, fatigue, aggressiveness and mental health. We propose two evaluation protocols for psychological trait estimation. Alongside the estimation of self-reported psychological traits from gait, the dataset can be used as a drop-in replacement to benchmark methods for gait recognition. We anonymize all cues related to the identity of the subjects and publicly release only silhouettes, 2D / 3D human skeletons and 3D SMPL human meshes.
    摘要 心理特质估计从外部因素如移动和外观是心理学中长期存在的挑战,基于心理学的肉体理论。到目前为止,解决这个问题的尝试都是使用私人小规模的体积感器。可能的应用包括职业疲劳和心理健康评估,以及市场营销和广告。在这项工作中,我们提出了 PsyMo(心理特质从运动),一个新的多用途和多模式数据集,用于探索在步态中表现的心理征。我们从312名参与者中收集了7种不同的步态和6个摄像头角度的步态序列。与步态序列一起,参与者填充了6种心理测量方法,涉及到个性、自我估计、疲劳、攻击性和心理健康。我们建议了两种心理特质估计评价协议。除了估计步态中的自我报告心理特质之外,该数据集还可以作为步态识别方法的替补数据集使用。我们将所有与身份相关的缓示信息公开发布,只发布了遮盾、2D/3D人体骨架和3D SMPL人体模型。

Polarimetric Information for Multi-Modal 6D Pose Estimation of Photometrically Challenging Objects with Limited Data

  • paper_url: http://arxiv.org/abs/2308.10627
  • repo_url: None
  • paper_authors: Patrick Ruhkamp, Daoyi Gao, HyunJun Jung, Nassir Navab, Benjamin Busam
  • for: 提高6D人体pose estimation的精度, especialy for photometrically challenging objects。
  • methods: 利用RGB-only或RGB-D数据,以及受到的 polarisation information,通过supervised learning和self-supervised learning方法,使用物理特征和形态约束来提高pose estimation的精度。
  • results: 通过利用偏振信息和物理特征,提高了6D人体pose estimation的精度,并且可以处理 textureless surfaces、reflections 和透明物体等难以捕捉的对象。
    Abstract 6D pose estimation pipelines that rely on RGB-only or RGB-D data show limitations for photometrically challenging objects with e.g. textureless surfaces, reflections or transparency. A supervised learning-based method utilising complementary polarisation information as input modality is proposed to overcome such limitations. This supervised approach is then extended to a self-supervised paradigm by leveraging physical characteristics of polarised light, thus eliminating the need for annotated real data. The methods achieve significant advancements in pose estimation by leveraging geometric information from polarised light and incorporating shape priors and invertible physical constraints.
    摘要 6D姿态估计管线,即使使用RGB只或RGB-D数据,会面临照度挑战,如Textureless surface、反射或透明物体。一种基于监督学习的方法,使用补充 polarization 信息作为输入模式,以超越这些限制。这种监督方法然后被扩展到自动化 парадиг,通过利用折射光的物理特性,从而消除需要实际数据标注。这些方法可以在姿态估计中提取折射光的 геометрические信息,并应用形状假设和可逆物理约束,得到显著的进步。

GaitPT: Skeletons Are All You Need For Gait Recognition

  • paper_url: http://arxiv.org/abs/2308.10623
  • repo_url: None
  • paper_authors: Andy Catruna, Adrian Cosma, Emilian Radoi
  • for: 本研究旨在提出一种基于骨骼姿态的新型步态识别模型,用于自动人脸识别。
  • methods: 该模型基于pose estimation骨骼,不需要应用earance信息,可以有效地捕捉人的唯一步态特征。使用层次transformer架构,同时捕捉空间和时间特征,以及人体结构的约束。
  • results: 实验结果显示,GaitPT模型在CASIA-B和GREW dataset上取得了state-of-the-art性能,比其他骨骼基于模型高出6%。在GREW dataset上,GaitPT模型取得了52.16% Rank-1性能,超过了骨骼基于和外观基于模型。
    Abstract The analysis of patterns of walking is an important area of research that has numerous applications in security, healthcare, sports and human-computer interaction. Lately, walking patterns have been regarded as a unique fingerprinting method for automatic person identification at a distance. In this work, we propose a novel gait recognition architecture called Gait Pyramid Transformer (GaitPT) that leverages pose estimation skeletons to capture unique walking patterns, without relying on appearance information. GaitPT adopts a hierarchical transformer architecture that effectively extracts both spatial and temporal features of movement in an anatomically consistent manner, guided by the structure of the human skeleton. Our results show that GaitPT achieves state-of-the-art performance compared to other skeleton-based gait recognition works, in both controlled and in-the-wild scenarios. GaitPT obtains 82.6% average accuracy on CASIA-B, surpassing other works by a margin of 6%. Moreover, it obtains 52.16% Rank-1 accuracy on GREW, outperforming both skeleton-based and appearance-based approaches.
    摘要 研究人行姿势的分析是一个重要的领域,它在安全、医疗、运动和人机交互等领域有广泛的应用。最近,人行姿势被视为一种自动人识别的唯一指纹方法。在这项工作中,我们提出了一种新的步态识别架构,称为步态pyramid transformer(GaitPT),它利用pose estimation骨架来捕捉独特的人行姿势特征,不依赖于外观信息。GaitPT采用了层次转换器架构,能够有效地提取人行运动的空间和时间特征,同时遵循人体骨架的结构。我们的结果显示,GaitPT在CASIA-B上获得了82.6%的平均准确率,比其他skeleton-based步态识别工作高出6%的优异。此外,它在GREW上获得了52.16%的排名第一准确率,超过了skeleton-based和外观based方法。

Multi-Modal Dataset Acquisition for Photometrically Challenging Object

  • paper_url: http://arxiv.org/abs/2308.10621
  • repo_url: None
  • paper_authors: HyunJun Jung, Patrick Ruhkamp, Nassir Navab, Benjamin Busam
  • for: 提高3D视觉任务的数据集准确率、大小、现实性和适用于光学挑战的物体图像模式。
  • methods: 提出了一种新的注释和获取管道,增强现有3D见解和6D对象pose数据集。该方法 integrate robotic forward-kinematics, external infrared trackers, and improved calibration and annotation procedures。
  • results: 介绍了一种多感器仪器组,安装在 робо控端效器上,并证明如何将其集成到高精度数据集的创造中。此外,我们还介绍了一种自由手写程序,以扩展视野覆盖率。两种方法均生成了高质量3D数据,并且具有准确的物体和相机pose注释。
    Abstract This paper addresses the limitations of current datasets for 3D vision tasks in terms of accuracy, size, realism, and suitable imaging modalities for photometrically challenging objects. We propose a novel annotation and acquisition pipeline that enhances existing 3D perception and 6D object pose datasets. Our approach integrates robotic forward-kinematics, external infrared trackers, and improved calibration and annotation procedures. We present a multi-modal sensor rig, mounted on a robotic end-effector, and demonstrate how it is integrated into the creation of highly accurate datasets. Additionally, we introduce a freehand procedure for wider viewpoint coverage. Both approaches yield high-quality 3D data with accurate object and camera pose annotations. Our methods overcome the limitations of existing datasets and provide valuable resources for 3D vision research.
    摘要

Ultrafast and Ultralight Network-Based Intelligent System for Real-time Diagnosis of Ear diseases in Any Devices

  • paper_url: http://arxiv.org/abs/2308.10610
  • repo_url: None
  • paper_authors: Yubiao Yue, Xinyu Zeng, Xiaoqiang Shi, Meiping Zhang, Haihua Liang, Fan Zhang, Yanmei Chen, Zefeng Xie, Wenrui Wu, Zhenzhang Li
    for: 这个研究旨在提高耳疾病诊断效率和准确性,并且提供一个可靠且可行的耳疾病诊断系统。methods: 这个研究使用了深度学习模型,并且开发了一个名为Best-EarNet的高速和轻量级网络,以提高耳疾病诊断的效率和准确性。results: 这个研究获得了95.23%的准确率,并且在实际应用中达到了80帧每秒的帧速度。此外,这个研究还发展了一个名为Ear Keeper的智能诊断系统,可以在日常生活中使用。
    Abstract Traditional ear disease diagnosis heavily depends on experienced specialists and specialized equipment, frequently resulting in misdiagnoses, treatment delays, and financial burdens for some patients. Utilizing deep learning models for efficient ear disease diagnosis has proven effective and affordable. However, existing research overlooked model inference speed and parameter size required for deployment. To tackle these challenges, we constructed a large-scale dataset comprising eight ear disease categories and normal ear canal samples from two hospitals. Inspired by ShuffleNetV2, we developed Best-EarNet, an ultrafast and ultralight network enabling real-time ear disease diagnosis. Best-EarNet incorporates the novel Local-Global Spatial Feature Fusion Module which can capture global and local spatial information simultaneously and guide the network to focus on crucial regions within feature maps at various levels, mitigating low accuracy issues. Moreover, our network uses multiple auxiliary classification heads for efficient parameter optimization. With 0.77M parameters, Best-EarNet achieves an average frames per second of 80 on CPU. Employing transfer learning and five-fold cross-validation with 22,581 images from Hospital-1, the model achieves an impressive 95.23% accuracy. External testing on 1,652 images from Hospital-2 validates its performance, yielding 92.14% accuracy. Compared to state-of-the-art networks, Best-EarNet establishes a new state-of-the-art (SOTA) in practical applications. Most importantly, we developed an intelligent diagnosis system called Ear Keeper, which can be deployed on common electronic devices. By manipulating a compact electronic otoscope, users can perform comprehensive scanning and diagnosis of the ear canal using real-time video. This study provides a novel paradigm for ear endoscopy and other medical endoscopic image recognition applications.
    摘要 传统的耳病诊断依赖于经验丰富的专家和专门的设备,经常导致诊断错误、延迟治疗和患者的经济负担。利用深度学习模型进行高效的耳病诊断已经证明有效并可靠。然而,现有研究却忽略了模型的推理速度和参数大小的问题。为了解决这些挑战,我们建立了大规模的数据集,包括8种耳病类别和正常耳朵样本从两家医院。 Drawing inspiration from ShuffleNetV2, we developed Best-EarNet, an ultrafast and ultralight network that enables real-time ear disease diagnosis. Best-EarNet incorporates a novel Local-Global Spatial Feature Fusion Module that can capture both global and local spatial information simultaneously, guiding the network to focus on crucial regions within feature maps at various levels, thus mitigating low accuracy issues. Moreover, our network uses multiple auxiliary classification heads for efficient parameter optimization. With only 0.77M parameters, Best-EarNet achieves an impressive 80 frames per second on CPU. By employing transfer learning and five-fold cross-validation with 22,581 images from Hospital-1, the model achieves an accuracy of 95.23%. External testing on 1,652 images from Hospital-2 further validates its performance, yielding an accuracy of 92.14%. Compared to state-of-the-art networks, Best-EarNet establishes a new state-of-the-art (SOTA) in practical applications. Most importantly, we developed an intelligent diagnosis system called Ear Keeper, which can be deployed on common electronic devices. By using a compact electronic otoscope, users can perform comprehensive scanning and diagnosis of the ear canal using real-time video. This study provides a novel paradigm for ear endoscopy and other medical endoscopic image recognition applications.

FocalDreamer: Text-driven 3D Editing via Focal-fusion Assembly

  • paper_url: http://arxiv.org/abs/2308.10608
  • repo_url: None
  • paper_authors: Yuhan Li, Yishun Dou, Yue Shi, Yu Lei, Xuanhong Chen, Yi Zhang, Peng Zhou, Bingbing Ni
  • for: 用于提供高级版本的3D文本编辑功能,以便在具体的区域内进行精细编辑。
  • methods: 基于文本提示的基本形状和可编辑部分的混合,以实现精细的3D编辑。具有geometry union和双路渲染功能,可以快速拼接独立的3D部件成全个物体,并且具有可重复使用和部件控制的优势。
  • results: 经过广泛的实验,FOcalDreamer表示出了高级版本的3D文本编辑功能,包括高精度的几何学形状和PBR文本ures,可以与广泛使用的图形引擎相容。
    Abstract While text-3D editing has made significant strides in leveraging score distillation sampling, emerging approaches still fall short in delivering separable, precise and consistent outcomes that are vital to content creation. In response, we introduce FocalDreamer, a framework that merges base shape with editable parts according to text prompts for fine-grained editing within desired regions. Specifically, equipped with geometry union and dual-path rendering, FocalDreamer assembles independent 3D parts into a complete object, tailored for convenient instance reuse and part-wise control. We propose geometric focal loss and style consistency regularization, which encourage focal fusion and congruent overall appearance. Furthermore, FocalDreamer generates high-fidelity geometry and PBR textures which are compatible with widely-used graphics engines. Extensive experiments have highlighted the superior editing capabilities of FocalDreamer in both quantitative and qualitative evaluations.
    摘要 While text-3D editing has made significant progress in leveraging score distillation sampling, emerging approaches still fall short in delivering separable, precise and consistent outcomes that are vital to content creation. In response, we introduce FocalDreamer, a framework that merges base shape with editable parts according to text prompts for fine-grained editing within desired regions. Specifically, equipped with geometry union and dual-path rendering, FocalDreamer assembles independent 3D parts into a complete object, tailored for convenient instance reuse and part-wise control. We propose geometric focal loss and style consistency regularization, which encourage focal fusion and congruent overall appearance. Furthermore, FocalDreamer generates high-fidelity geometry and PBR textures which are compatible with widely-used graphics engines. Extensive experiments have highlighted the superior editing capabilities of FocalDreamer in both quantitative and qualitative evaluations.Here's the translation in Traditional Chinese:而文本3D编辑技术也在利用分数采样方面做出了重要进展,但现有方法仍然缺乏可靠、精细和一致的结果,这些结果是内容创作的关键。为此,我们介绍了FocalDreamer框架,它将基本形状与可编辑部分按照文本提示进行细致的编辑,并在所需区域内进行精细的控制。FocalDreamer具有geometry union和双路渲染功能,可以将独立的3D部件组合成完整的对象,并且可以方便地进行实例重用和部件控制。我们还提出了几何吸引损失和风格一致 regularization,以促进吸引融合和一致的整体外观。此外,FocalDreamer还可以生成高质量的几何学和PBR文件,与广泛使用的图形引擎相容。经过广泛的实验,我们发现FocalDreamer在量化和质量上的评价都有所提高。

A step towards understanding why classification helps regression

  • paper_url: http://arxiv.org/abs/2308.10603
  • repo_url: https://github.com/arkavb/Natural-Language-Processing-of-Company-Review-Data
  • paper_authors: Silvia L. Pintea, Yancong Lin, Jouke Dijkstra, Jan C. van Gemert
  • for: 这些深度回归方法可以提高结果,添加分类损失。
  • methods: 通过 precisely controlled dataset variations and data samplings,发现在偏好数据时,添加分类损失最有效。
  • results: 我们在实验中发现,在偏好数据时,添加分类损失可以提高结果,并且我们可以通过形式化平衡和不平衡损失之间的关系来解释这个现象。
    Abstract A number of computer vision deep regression approaches report improved results when adding a classification loss to the regression loss. Here, we explore why this is useful in practice and when it is beneficial. To do so, we start from precisely controlled dataset variations and data samplings and find that the effect of adding a classification loss is the most pronounced for regression with imbalanced data. We explain these empirical findings by formalizing the relation between the balanced and imbalanced regression losses. Finally, we show that our findings hold on two real imbalanced image datasets for depth estimation (NYUD2-DIR), and age estimation (IMDB-WIKI-DIR), and on the problem of imbalanced video progress prediction (Breakfast). Our main takeaway is: for a regression task, if the data sampling is imbalanced, then add a classification loss.
    摘要 “一些计算机视觉深度回归方法报告了添加分类损失可以提高结果的情况。在这里,我们查讨了这种方法在实践中的利用和有利之处。我们从精心控制的数据集变化和采样开始,发现在不均衡数据时添加分类损失的效果最为明显。我们形式化了不均衡和均衡回归损失之间的关系,并证明了这些实验结论在实际中的影像深度估计(NYUD2-DIR)、年龄估计(IMDB-WIKI-DIR)和视频进程预测(Breakfast)中都有效。我们的主要结论是:如果数据采样不均衡,那么在回归任务中添加分类损失。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

Improving the Transferability of Adversarial Examples with Arbitrary Style Transfer

  • paper_url: http://arxiv.org/abs/2308.10601
  • repo_url: https://github.com/zhijin-ge/stm
  • paper_authors: Zhijin Ge, Fanhua Shang, Hongying Liu, Yuanyuan Liu, Liang Wan, Wei Feng, Xiaosen Wang
    for: This paper aims to improve the transferability of adversarial attacks in the black-box setting by leveraging domain generalization.methods: The proposed method, called Style Transfer Method (STM), utilizes an arbitrary style transfer network to transform images into different domains, while maintaining semantic consistency through fine-tuning and random noise addition.results: The proposed method significantly improves the adversarial transferability on both normally trained models and adversarially trained models, outperforming state-of-the-art input transformation-based attacks.Here’s the simplified Chinese text:for: 本文目的是在黑盒 Setting 中提高攻击 transferability,通过域元泛化来实现。methods: 提议的方法是使用一个自定义的样式传输网络,将图像转换到不同的域中,保持语义信息的一致性通过细调和随机噪声添加。results: 实验结果表明,提议的方法可以在 normally 训练的模型和 adversarial 训练的模型上显著提高攻击 transferability,比对 state-of-the-art 输入变换基于的攻击更高效。
    Abstract Deep neural networks are vulnerable to adversarial examples crafted by applying human-imperceptible perturbations on clean inputs. Although many attack methods can achieve high success rates in the white-box setting, they also exhibit weak transferability in the black-box setting. Recently, various methods have been proposed to improve adversarial transferability, in which the input transformation is one of the most effective methods. In this work, we notice that existing input transformation-based works mainly adopt the transformed data in the same domain for augmentation. Inspired by domain generalization, we aim to further improve the transferability using the data augmented from different domains. Specifically, a style transfer network can alter the distribution of low-level visual features in an image while preserving semantic content for humans. Hence, we propose a novel attack method named Style Transfer Method (STM) that utilizes a proposed arbitrary style transfer network to transform the images into different domains. To avoid inconsistent semantic information of stylized images for the classification network, we fine-tune the style transfer network and mix up the generated images added by random noise with the original images to maintain semantic consistency and boost input diversity. Extensive experimental results on the ImageNet-compatible dataset show that our proposed method can significantly improve the adversarial transferability on either normally trained models or adversarially trained models than state-of-the-art input transformation-based attacks. Code is available at: https://github.com/Zhijin-Ge/STM.
    摘要 Simplified Chinese translation:深度神经网络容易受到人类不可见的攻击,而这些攻击通常可以在白盒模式下达到高成功率。然而,这些攻击方法在黑盒模式下的可 translateability强度很弱。在这个工作中,我们注意到现有的输入转换基于工作主要采用同一个频谱域中的转换数据进行增强。 Drawing inspiration from domain generalization, we aim to further improve transferability using data augmented from different domains. Specifically, a style transfer network can alter the distribution of low-level visual features in an image while preserving semantic content for humans. Therefore, we propose a novel attack method named Style Transfer Method (STM) that utilizes a proposed arbitrary style transfer network to transform images into different domains. To avoid inconsistent semantic information of stylized images for the classification network, we fine-tune the style transfer network and mix up the generated images added by random noise with the original images to maintain semantic consistency and boost input diversity. Our extensive experimental results on the ImageNet-compatible dataset show that our proposed method can significantly improve adversarial transferability on either normally trained models or adversarially trained models than state-of-the-art input transformation-based attacks. Code is available at: https://github.com/Zhijin-Ge/STM.

Image-free Classifier Injection for Zero-Shot Classification

  • paper_url: http://arxiv.org/abs/2308.10599
  • repo_url: https://github.com/explainableml/imagefreezsl
  • paper_authors: Anders Christensen, Massimiliano Mancini, A. Sophia Koepke, Ole Winther, Zeynep Akata
  • for: 这篇论文目的是为了使 pré-训练模型具备零shot类别化能力,不需要训练数据集。
  • methods: 该论文提出了一种名为 Image-free Classifier Injection with Semantics (ICIS)的方法,可以在 pré-训练模型上采用post-hoc方式插入新类别的核心。ICIS使用类别名称或属性等简单的描述器来学习新类别的核心矩阵,并通过(cross-)重建和偏置损失来规范解码过程。
  • results: 实验表明,ICIS可以在标准ZSL数据集上实现强一致(通用)零shot类别化性能。
    Abstract Zero-shot learning models achieve remarkable results on image classification for samples from classes that were not seen during training. However, such models must be trained from scratch with specialised methods: therefore, access to a training dataset is required when the need for zero-shot classification arises. In this paper, we aim to equip pre-trained models with zero-shot classification capabilities without the use of image data. We achieve this with our proposed Image-free Classifier Injection with Semantics (ICIS) that injects classifiers for new, unseen classes into pre-trained classification models in a post-hoc fashion without relying on image data. Instead, the existing classifier weights and simple class-wise descriptors, such as class names or attributes, are used. ICIS has two encoder-decoder networks that learn to reconstruct classifier weights from descriptors (and vice versa), exploiting (cross-)reconstruction and cosine losses to regularise the decoding process. Notably, ICIS can be cheaply trained and applied directly on top of pre-trained classification models. Experiments on benchmark ZSL datasets show that ICIS produces unseen classifier weights that achieve strong (generalised) zero-shot classification performance. Code is available at https://github.com/ExplainableML/ImageFreeZSL .
    摘要 Zero-shot learning模型在图像分类任务上具有非常出色的成绩,但是这些模型必须通过特殊的方法进行训练,因此在需要零shot分类时需要训练数据集。在这篇论文中,我们目标是在不使用图像数据的情况下,为预训练模型添加零shot分类能力。我们提出了一种名为图像自由分类插入法(ICIS),可以在预训练分类模型的后续进行post-hoc插入,并不需要图像数据。我们利用预训练分类器的权重和简单的分类描述符,如分类名称或特征,来进行插入。ICIS包括两个encoder-decoder网络,用于从描述符中学习重构分类器权重,并将描述符与分类器权重进行匹配。我们利用(交叉)重构和偏度损失来规范解码过程。可以说,ICIS可以轻松地在预训练分类模型之上进行训练和应用。我们在ZSL数据集上进行了实验,并证明ICIS可以生成高性能的零shot分类器。代码可以在https://github.com/ExplainableML/ImageFreeZSL上找到。

CHORD: Category-level Hand-held Object Reconstruction via Shape Deformation

  • paper_url: http://arxiv.org/abs/2308.10574
  • repo_url: None
  • paper_authors: Kailin Li, Lixin Yang, Haoyu Zhen, Zenan Lin, Xinyu Zhan, Licheng Zhong, Jian Xu, Kejian Wu, Cewu Lu
  • for: 本研究旨在提供一种新的Category-level Hand-held Object Reconstruction方法,帮助AI理解日常任务中手部与物体之间的关系,并学习 manipulate 技能。
  • methods: 该方法基于 shape 权重的拟合,通过 shape 权重的调整来重建手持物体的形状。此外,方法还具有三种意识:外观、形状和互动 pose。
  • results: 对比之前的方法,CHORD 在量化和质量两个方面均表现出优于状态之前的方法。 Code、模型和数据集可以在https://kailinli.github.io/CHORD 上下载。
    Abstract In daily life, humans utilize hands to manipulate objects. Modeling the shape of objects that are manipulated by the hand is essential for AI to comprehend daily tasks and to learn manipulation skills. However, previous approaches have encountered difficulties in reconstructing the precise shapes of hand-held objects, primarily owing to a deficiency in prior shape knowledge and inadequate data for training. As illustrated, given a particular type of tool, such as a mug, despite its infinite variations in shape and appearance, humans have a limited number of 'effective' modes and poses for its manipulation. This can be attributed to the fact that humans have mastered the shape prior of the 'mug' category, and can quickly establish the corresponding relations between different mug instances and the prior, such as where the rim and handle are located. In light of this, we propose a new method, CHORD, for Category-level Hand-held Object Reconstruction via shape Deformation. CHORD deforms a categorical shape prior for reconstructing the intra-class objects. To ensure accurate reconstruction, we empower CHORD with three types of awareness: appearance, shape, and interacting pose. In addition, we have constructed a new dataset, COMIC, of category-level hand-object interaction. COMIC contains a rich array of object instances, materials, hand interactions, and viewing directions. Extensive evaluation shows that CHORD outperforms state-of-the-art approaches in both quantitative and qualitative measures. Code, model, and datasets are available at https://kailinli.github.io/CHORD.
    摘要 日常生活中,人类通过手部 manipulate 物体。模拟手部 manipulation 物体的形状是 AI 理解日常任务和学习 manipulate 技能的关键。然而, previous approaches 遇到了重建具体手持物体的精确形状的困难,主要是因为缺乏先前形状知识和训练数据不充分。例如,给定一种工具,如杯子,尽管它有无数个形态和外观,但人类在 manipulate 这些杯子时只有有限多个有效的模式和姿势。这可以归结于人类已经掌握了杯子的形状先验,可以快速地确定杯子的rim和握 Handle的位置。基于这一点,我们提出了一种新的方法,即 CHORD,用于类型级手持物体重建 via 形态扭曲。CHORD 使用类别形状先验来重建内部类 объек。为确保准确重建,我们赋予 CHORD 三种意识:外观、形状和互动姿势。此外,我们还建立了一个新的数据集,即 COMIC,包括类型级手持物体的丰富数组、物体实例、材质、手动互动和视角方向。EXTensive 评估表明 CHORD 在量化和质量上都高于当前状态的方法。代码、模型和数据集可以在 获取。

Self-Feedback DETR for Temporal Action Detection

  • paper_url: http://arxiv.org/abs/2308.10570
  • repo_url: None
  • paper_authors: Jihwan Kim, Miso Lee, Jae-Pil Heo
  • for: 本研究旨在解决RETR基于模型在视频应用中的时间动作检测(TAD)中的问题,即自注意力模块的时间塌陷问题。
  • methods: 我们提出了一种新的框架,即Self-DETR,它利用decoder的跨注意地图来重新启用自注意力模块。我们通过简单的矩阵乘法将跨注意地图和其转置相乘,恢复编码器和解码器中自注意力模块的关系。
  • results: 我们的广泛实验表明,Self-DETR可以解决自注意力模块的时间塌陷问题,保持高多样性的注意力在所有层次。
    Abstract Temporal Action Detection (TAD) is challenging but fundamental for real-world video applications. Recently, DETR-based models have been devised for TAD but have not performed well yet. In this paper, we point out the problem in the self-attention of DETR for TAD; the attention modules focus on a few key elements, called temporal collapse problem. It degrades the capability of the encoder and decoder since their self-attention modules play no role. To solve the problem, we propose a novel framework, Self-DETR, which utilizes cross-attention maps of the decoder to reactivate self-attention modules. We recover the relationship between encoder features by simple matrix multiplication of the cross-attention map and its transpose. Likewise, we also get the information within decoder queries. By guiding collapsed self-attention maps with the guidance map calculated, we settle down the temporal collapse of self-attention modules in the encoder and decoder. Our extensive experiments demonstrate that Self-DETR resolves the temporal collapse problem by keeping high diversity of attention over all layers.
    摘要 Temporal Action Detection (TAD) 是一项挑战性强且基础性强的视频应用领域问题。最近,基于 DE TR 的模型已经为 TAD 提出了方案,但它们的性能还不是很好。在这篇论文中,我们指出 DE TR 中的自注意力问题,即自注意力模块在几个关键元素上的集中注意问题。这会使Encoder 和 Decoder 的自注意力模块失去作用。为解决这个问题,我们提出了一种新的框架,即 Self-DETR,它利用 Decoder 的跨注意地图来重新活化 Encoder 和 Decoder 的自注意力模块。我们通过简单的矩阵乘法将跨注意地图与其规定的转置矩阵进行相乘,恢复 Encoder 中的关键特征之间的关系。同时,我们还可以通过指导 collapse 自注意力地图来解决 Encoder 和 Decoder 中的自注意力塌陷问题。我们的广泛的实验表明,Self-DETR 可以解决自注意力塌陷问题,保持高度多样性的注意力于所有层。

RT-MonoDepth: Real-time Monocular Depth Estimation on Embedded Systems

  • paper_url: http://arxiv.org/abs/2308.10569
  • repo_url: None
  • paper_authors: Cheng Feng, Zhen Chen, Congxuan Zhang, Weiming Hu, Bing Li, Feng Lu
    for: 这 paper 的目的是解决嵌入式系统上实时深度估计问题。methods: 该 paper 提出了两种高效和轻量级 encoder-decoder 网络架构,即 RT-MonoDepth 和 RT-MonoDepth-S,以降低计算复杂性和延迟。results: 该 paper 的方法在 NVIDIA Jetson Nano 和 Jetson AGX Orin 上实现了单个 RGB 图像的640×192 分辨率下的18.4/30.5 FPS 和253.0/364.1 FPS 的实时深度估计,并在 KITTI 数据集上达到了相对state-of-the-art 精度。根据作者们所知,这 paper 在快速单靠谱深度估计方法中实现了最高精度和最快的执行速度。
    Abstract Depth sensing is a crucial function of unmanned aerial vehicles and autonomous vehicles. Due to the small size and simple structure of monocular cameras, there has been a growing interest in depth estimation from a single RGB image. However, state-of-the-art monocular CNN-based depth estimation methods using fairly complex deep neural networks are too slow for real-time inference on embedded platforms. This paper addresses the problem of real-time depth estimation on embedded systems. We propose two efficient and lightweight encoder-decoder network architectures, RT-MonoDepth and RT-MonoDepth-S, to reduce computational complexity and latency. Our methodologies demonstrate that it is possible to achieve similar accuracy as prior state-of-the-art works on depth estimation at a faster inference speed. Our proposed networks, RT-MonoDepth and RT-MonoDepth-S, runs at 18.4\&30.5 FPS on NVIDIA Jetson Nano and 253.0\&364.1 FPS on NVIDIA Jetson AGX Orin on a single RGB image of resolution 640$\times$192, and achieve relative state-of-the-art accuracy on the KITTI dataset. To the best of the authors' knowledge, this paper achieves the best accuracy and fastest inference speed compared with existing fast monocular depth estimation methods.
    摘要 深度感知是无人机和自动车辆的关键功能。由于单目频道摄像头的小尺寸和简单结构,有增加对单个RGB图像的深度估计的兴趣。然而,使用较复杂的深度神经网络的现状-of-the-art单目CNN基于深度估计方法在嵌入式平台上实时推理是太慢。这篇论文解决了实时深度估计在嵌入式系统上的问题。我们提议了两种高效和轻量级编码器-解码器网络架构,RT-MonoDepth和RT-MonoDepth-S,以减少计算复杂度和延迟。我们的方法ologies表明,可以在单个RGB图像的640×192分辨率上达到与先前状态艺术工作相同的准确性,但是在更快的推理速度上。我们提出的网络RT-MonoDepth和RT-MonoDepth-S在NVIDIA Jetson Nano和NVIDIA Jetson AGX Orin上运行于640×192分辨率的单个RGB图像上,具有18.4和30.5帧/秒的推理速度,并在KITTI数据集上达到相对状态艺术的准确性。根据作者们所知,这篇论文在快速单目深度估计方法中实现了最高准确性和最快的推理速度。

Seeing the Intangible: Surveying Automatic High-Level Visual Understanding from Still Images

  • paper_url: http://arxiv.org/abs/2308.10562
  • repo_url: None
  • paper_authors: Delfina Sol Martinez Pandiani, Valentina Presutti
  • for: 本文主要探讨了图像理解领域中对抽象社会概念的检测,即通过图像来捕捉和理解人类的情感、社会价值观和意识形态等抽象概念。
  • methods: 本文使用了多disciplinary的视觉学、计算机科学和认知学的视角,对高级视觉理解的semantic元素进行了研究和归类,并对图像理解任务中的高级视觉任务进行了研究和归类,以确定当前CV工作中对抽象社会概念检测的实践。
  • results: 本文通过对高级视觉理解的semantic元素和图像理解任务的研究和归类,发现了一些CV工作是通过不同的术语描述同一种抽象社会概念检测任务,并提出了一种系统性的review和分类方法来探讨这些任务。
    Abstract The field of Computer Vision (CV) was born with the single grand goal of complete image understanding: providing a complete semantic interpretation of an input image. What exactly this goal entails is not immediately straightforward, but theoretical hierarchies of visual understanding point towards a top level of full semantics, within which sits the most complex and subjective information humans can detect from visual data. In particular, non-concrete concepts including emotions, social values and ideologies seem to be protagonists of this "high-level" visual semantic understanding. While such "abstract concepts" are critical tools for image management and retrieval, their automatic recognition is still a challenge, exactly because they rest at the top of the "semantic pyramid": the well-known semantic gap problem is worsened given their lack of unique perceptual referents, and their reliance on more unspecific features than concrete concepts. Given that there seems to be very scarce explicit work within CV on the task of abstract social concept (ASC) detection, and that many recent works seem to discuss similar non-concrete entities by using different terminology, in this survey we provide a systematic review of CV work that explicitly or implicitly approaches the problem of abstract (specifically social) concept detection from still images. Specifically, this survey performs and provides: (1) A study and clustering of high level visual understanding semantic elements from a multidisciplinary perspective (computer science, visual studies, and cognitive perspectives); (2) A study and clustering of high level visual understanding computer vision tasks dealing with the identified semantic elements, so as to identify current CV work that implicitly deals with AC detection.
    摘要 Computer Vision (CV) 的 birth 也标志着完整的图像理解的单一大目标:提供一个完整的semantic解释的输入图像。 CV 的这个目标并不是立即明确的,但理论上的视觉层次结构显示了一个高层的概念理解,其中包括了非具体的概念,如情感、社会价值观和意识形态。这些“抽象概念”在图像管理和检索中是关键工具,但自动识别它们仍然是一个挑战,因为它们位于“semantic gap”问题的顶层,而且它们的唯一特征是与具体的feature不同。由于在 CV 中对抽象社会概念(ASC)的探测工作很少,而且许多最近的工作通过不同的术语来描述类似的非具体元素,因此在这篇评论中,我们提供了一个系统性的CV工作评论,以下是我们的研究和归类:1. 从多元视角来研究和归类高级视觉理解的semantic元素,包括计算机科学、视觉学和认知科学的视角。2. 研究和归类高级视觉理解的CV任务,以确定当前CV工作中是否有对ASC探测的隐式工作。

Spatial Transform Decoupling for Oriented Object Detection

  • paper_url: http://arxiv.org/abs/2308.10561
  • repo_url: https://github.com/yuhongtian17/spatial-transform-decoupling
  • paper_authors: Hongtian Yu, Yunjie Tian, Qixiang Ye, Yunfan Liu
  • for: oriented object detection with ViTs
  • methods: separate network branches for position, size, and angle prediction, and aggregating cascaded activation masks (CAMs)
  • results: state-of-the-art performance on benchmark datasets DOTA-v1.0 and HRSC2016, demonstrating the effectiveness of the proposed method.Here’s the summary in Traditional Chinese:
  • for: 针对使用 ViTs 进行 oriented object detection
  • methods: 使用分支网络来预测位置、大小和角度,并聚合缩凝活化图 (CAMs)
  • results: 在 DOTA-v1.0 和 HRSC2016 benchmark 数据集上取得了状态顶峰性能,证明提案的方法的有效性。
    Abstract Vision Transformers (ViTs) have achieved remarkable success in computer vision tasks. However, their potential in rotation-sensitive scenarios has not been fully explored, and this limitation may be inherently attributed to the lack of spatial invariance in the data-forwarding process. In this study, we present a novel approach, termed Spatial Transform Decoupling (STD), providing a simple-yet-effective solution for oriented object detection with ViTs. Built upon stacked ViT blocks, STD utilizes separate network branches to predict the position, size, and angle of bounding boxes, effectively harnessing the spatial transform potential of ViTs in a divide-and-conquer fashion. Moreover, by aggregating cascaded activation masks (CAMs) computed upon the regressed parameters, STD gradually enhances features within regions of interest (RoIs), which complements the self-attention mechanism. Without bells and whistles, STD achieves state-of-the-art performance on the benchmark datasets including DOTA-v1.0 (82.24% mAP) and HRSC2016 (98.55% mAP), which demonstrates the effectiveness of the proposed method. Source code is available at https://github.com/yuhongtian17/Spatial-Transform-Decoupling.
    摘要 视transformer(ViTs)在计算机视觉任务中已经取得了非常出色的成绩。然而,它们在旋转敏感场景中的潜力尚未得到了充分利用,这可能是因为数据传递过程中缺乏空间不变性的问题。在这种情况下,我们提出了一种新的方法,即空间转换解coupling(STD),它提供了一种简单而有效的解决方案 для oriented object detection with ViTs。基于堆叠的 ViT 块,STD 使用了独立的网络分支来预测矩形框的位置、大小和角度,从而有效地利用了 ViTs 的空间转换潜力。此外,通过聚合缓存在计算的 CAMs,STD 逐渐增强了在区域 interess(RoIs)中的特征,这与自我注意机制相结合。没有各种饰物,STD 在 DOTA-v1.0 和 HRSC2016 的标准 datasets 上达到了状态机器人的性能(82.24% mAP和98.55% mAP),这表明了提案的方法的效果。源代码可以在 GitHub 上找到:https://github.com/yuhongtian17/Spatial-Transform-Decoupling。

Local Spherical Harmonics Improve Skeleton-Based Hand Action Recognition

  • paper_url: http://arxiv.org/abs/2308.10557
  • repo_url: https://github.com/kathpra/lshr_lsht
  • paper_authors: Katharina Prasse, Steffen Jung, Yuxuan Zhou, Margret Keuper
  • for: 这个论文的目的是提出一种特有的手势识别方法,用于解决手势识别难以正确识别的问题。
  • methods: 该方法使用相对angular embedding和本地圆形函数来创建新的手势表示,使得手势识别更加抗抗衡视角和人员差异。
  • results: 经过广泛的实验,该方法在First-Person Hand Action Benchmark中使用RGB-D视频和3D手势注释,以及NTU RGB+D 120数据集上,都能够达到更高的识别精度。
    Abstract Hand action recognition is essential. Communication, human-robot interactions, and gesture control are dependent on it. Skeleton-based action recognition traditionally includes hands, which belong to the classes which remain challenging to correctly recognize to date. We propose a method specifically designed for hand action recognition which uses relative angular embeddings and local Spherical Harmonics to create novel hand representations. The use of Spherical Harmonics creates rotation-invariant representations which make hand action recognition even more robust against inter-subject differences and viewpoint changes. We conduct extensive experiments on the hand joints in the First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations, and on the NTU RGB+D 120 dataset, demonstrating the benefit of using Local Spherical Harmonics Representations. Our code is available at https://github.com/KathPra/LSHR_LSHT.
    摘要 <>TRANSLATE_TEXT手势识别是必备的。人机交互、通信和手势控制都需要它。基于骨架的手势识别传统上包括手部,这些手部类型仍然存在识别错误的问题。我们提出一种特有的手势识别方法,使用相对angular嵌入和本地球形函数来创建新的手势表示。使用球形函数创造旋转不变的表示,使手势识别更加鲁棒对受人体差异和视点变化。我们在First-Person Hand Action Benchmark中使用RGB-D视频和3D手势注解进行了广泛的实验,以及NTU RGB+D 120 dataset,并证明了使用本地球形函数表示的优点。我们的代码可以在https://github.com/KathPra/LSHR_LSHT上找到。Note: "TRANSLATE_TEXT" is a system variable that specifies the text to be translated.

Improving Diversity in Zero-Shot GAN Adaptation with Semantic Variations

  • paper_url: http://arxiv.org/abs/2308.10554
  • repo_url: None
  • paper_authors: Seogkyu Jeon, Bei Liu, Pilhyeon Lee, Kibeom Hong, Jianlong Fu, Hyeran Byun
  • for: 这篇论文的目的是提出一种解决零例GAN适应中的模式崩溃问题,以便在没有更多训练数据的情况下,可以将已经训练好的生成器重新使用来生成未见过的目标领域中的图像。
  • methods: 这篇论文使用了CLIP的语像模型来帮助生成器更好地理解目标领域的概念,并且使用了一种新的方法来找到目标文本中的semantic variation,以避免模式崩溃。此外,这篇论文还引入了一些新的损失函数,包括方向几何损失和适应矩阵损失,以保持生成的图像具有丰富的内容信息。
  • results: 这篇论文的实验结果显示,这些新的方法可以帮助零例GAN适应中的生成器获得更多的多样性和质量,并且在不同的领域中都能够获得良好的结果。此外,这篇论文还进行了一些ablation study,以验证每个提案的效果。总之,这篇论文的结果显示,这些新的方法可以帮助零例GAN适应中的生成器更好地生成多样化和高质量的图像。
    Abstract Training deep generative models usually requires a large amount of data. To alleviate the data collection cost, the task of zero-shot GAN adaptation aims to reuse well-trained generators to synthesize images of an unseen target domain without any further training samples. Due to the data absence, the textual description of the target domain and the vision-language models, e.g., CLIP, are utilized to effectively guide the generator. However, with only a single representative text feature instead of real images, the synthesized images gradually lose diversity as the model is optimized, which is also known as mode collapse. To tackle the problem, we propose a novel method to find semantic variations of the target text in the CLIP space. Specifically, we explore diverse semantic variations based on the informative text feature of the target domain while regularizing the uncontrolled deviation of the semantic information. With the obtained variations, we design a novel directional moment loss that matches the first and second moments of image and text direction distributions. Moreover, we introduce elastic weight consolidation and a relation consistency loss to effectively preserve valuable content information from the source domain, e.g., appearances. Through extensive experiments, we demonstrate the efficacy of the proposed methods in ensuring sample diversity in various scenarios of zero-shot GAN adaptation. We also conduct ablation studies to validate the effect of each proposed component. Notably, our model achieves a new state-of-the-art on zero-shot GAN adaptation in terms of both diversity and quality.
    摘要 通常需要大量数据来训练深度生成模型。为了减少数据收集成本,Zero-shot GAN适应任务的目标是 reuse 已经训练过的生成器,无需更多的训练样本来生成未看到的目标领域的图像。由于数据缺失,文本描述和视觉语言模型,例如 CLIP,被利用来有效地指导生成器。然而,只有单个文本特征而不是真实的图像,生成的图像会逐渐失去多样性,这也被称为模式塌损。为解决这个问题,我们提出了一种新的方法,即在 CLIP 空间找到目标文本的semantic variation。特别是,我们在目标领域的文本特征上探索具有信息的semantic variation,并对这些变化进行规范化。使用获得的变化,我们设计了一种新的方向分布匹配权重整合loss,以保持图像和文本方向分布的第一和第二 moment的匹配。此外,我们引入了塑性重叠和关系一致损失,以有效地保留源领域的有价值内容信息,例如外观。通过广泛的实验,我们证明了我们提出的方法的效果,可以在零shot GAN适应中确保样本多样性。我们还进行了剥离研究,以验证每个提出的组件的效果。值得一提的是,我们的模型在零shot GAN适应中达到了新的state-of-the-art, both 多样性和质量方面。

Learning Weakly Convex Regularizers for Convergent Image-Reconstruction Algorithms

  • paper_url: http://arxiv.org/abs/2308.10542
  • repo_url: None
  • paper_authors: Alexis Goujon, Sebastian Neumayer, Michael Unser
  • for: 这个论文目的是学习非凸规则化器,以便在有upper bound的弱凸性modulus下实现variational混淆。
  • methods: 这个论文使用的方法是基于几万个参数(less than 15,000)的非凸规则化器,它们可以用signal-processing的意义来解释,并且可以模仿手工制定的简约化规则。
  • results: 这个论文通过数值实验表明,使用这种非凸规则化器可以超过凸规则化器以及BM3D混淆器的性能,同时学习的规则可以用iterative scheme来解决反向问题,并且在CT和MRI重建中得到了极好的权衡。
    Abstract We propose to learn non-convex regularizers with a prescribed upper bound on their weak-convexity modulus. Such regularizers give rise to variational denoisers that minimize a convex energy. They rely on few parameters (less than 15,000) and offer a signal-processing interpretation as they mimic handcrafted sparsity-promoting regularizers. Through numerical experiments, we show that such denoisers outperform convex-regularization methods as well as the popular BM3D denoiser. Additionally, the learned regularizer can be deployed to solve inverse problems with iterative schemes that provably converge. For both CT and MRI reconstruction, the regularizer generalizes well and offers an excellent tradeoff between performance, number of parameters, guarantees, and interpretability when compared to other data-driven approaches.
    摘要 我们提议使用严格上界的非凸调理器,以实现强度下降的 convex 能量。这些调理器具有少于 15,000 个参数,并且具有信号处理解释,因为它们模仿了手工创建的稀疏过滤器。通过数据验证,我们发现这些推断器在扩散过滤器中表现更好,并且可以透过迭代方法来解决反射问题。在 CT 和 MRI 重建中,这个调理器具有卓越的协调性、表现能力、保证和解释性,并且与其他数据驱动方法进行比较。

Joint learning of images and videos with a single Vision Transformer

  • paper_url: http://arxiv.org/abs/2308.10533
  • repo_url: None
  • paper_authors: Shuki Shimizu, Toru Tamaki
  • for: 这种方法可以用于同时学习图像和视频,尤其是在图像和视频模型之间进行结合学习。
  • methods: 该方法使用Vision Transformer IV-ViT模型,并将图像批处理和视频帧进行延迟融合。
  • results: 实验结果表明,该方法可以在两个图像集和两个动作识别集上达到比较好的性能。
    Abstract In this study, we propose a method for jointly learning of images and videos using a single model. In general, images and videos are often trained by separate models. We propose in this paper a method that takes a batch of images as input to Vision Transformer IV-ViT, and also a set of video frames with temporal aggregation by late fusion. Experimental results on two image datasets and two action recognition datasets are presented.
    摘要 在本研究中,我们提出了一种方法,用于同时学习图像和视频,使用单个模型。通常情况下,图像和视频通常由分别的模型进行训练。我们在本篇论文中提出了一种方法,将批处图像输入到视transformer IV-ViT中,并使用延迟融合来将视频帧进行时间聚合。我们在两个图像数据集和两个动作识别数据集上进行了实验。

SRFormer: Empowering Regression-Based Text Detection Transformer with Segmentation

  • paper_url: http://arxiv.org/abs/2308.10531
  • repo_url: https://github.com/retsuh-bqw/SRFormer-Text-Det
  • paper_authors: Qingwen Bu, Sungrae Park, Minsoo Khang, Yichuan Cheng
  • for: 这个论文的目的是提出一种基于DETR的综合分割和回归模型,以优化Instance-level分割和回归预测。
  • methods: 该模型使用了DETR的基本思想,并在其基础之上增加了分割和回归两个分支。在每个分支中,使用了不同的激活函数和权重学习策略,以优化分割和回归预测。
  • results: 实验结果表明,该模型在多个benchmark上达到了优秀的性能水平,同时具有出色的训练和数据使用效率。此外,模型还可以在不同的环境下进行适应性和灵活性的提升。
    Abstract Existing techniques for text detection can be broadly classified into two primary groups: segmentation-based methods and regression-based methods. Segmentation models offer enhanced robustness to font variations but require intricate post-processing, leading to high computational overhead. Regression-based methods undertake instance-aware prediction but face limitations in robustness and data efficiency due to their reliance on high-level representations. In our academic pursuit, we propose SRFormer, a unified DETR-based model with amalgamated Segmentation and Regression, aiming at the synergistic harnessing of the inherent robustness in segmentation representations, along with the straightforward post-processing of instance-level regression. Our empirical analysis indicates that favorable segmentation predictions can be obtained at the initial decoder layers. In light of this, we constrain the incorporation of segmentation branches to the first few decoder layers and employ progressive regression refinement in subsequent layers, achieving performance gains while minimizing additional computational load from the mask. Furthermore, we propose a Mask-informed Query Enhancement module. We take the segmentation result as a natural soft-ROI to pool and extract robust pixel representations, which are then employed to enhance and diversify instance queries. Extensive experimentation across multiple benchmarks has yielded compelling findings, highlighting our method's exceptional robustness, superior training and data efficiency, as well as its state-of-the-art performance.
    摘要 现有的文本检测技术可以分为两个主要群组:分割方法和回归方法。分割模型可以提供更高的稳定性,但需要复杂的后处理,导致计算 overhead 高。回归方法可以进行实例化预测,但由于依赖高级表示,限制了其稳定性和数据效率。在我们的学术尝试中,我们提出了 SRFormer,一种基于 DETR 的混合 Segmentation 和 Regression 模型,旨在同时利用分割表示中的内在稳定性,以及实例化预测的直观后处理。我们的实验分析表明,在初始decoder层可以获得有利的分割预测。因此,我们对分割支持进行了限制,只在首些decoder层进行 incorporation,并采用进度ive regression 精度,从而实现性能提升而不增加计算负担。此外,我们还提出了一个Mask-informed Query Enhancement模块,通过使用分割结果作为自然的软 ROI, Pool 和提取稳定的像素表示,然后将其用于增强和多样化实例查询。我们在多个 benchmark 上进行了广泛的实验,得到了吸引人的发现,表明我们的方法具有出色的稳定性、超过常规训练和数据效率,以及当前领域的状态ArrayList 表现。

LightDepth: Single-View Depth Self-Supervision from Illumination Decline

  • paper_url: http://arxiv.org/abs/2308.10525
  • repo_url: None
  • paper_authors: Javier Rodríguez-Puigvert, Víctor M. Batlle, J. M. M. Montiel, Ruben Martinez Cantin, Pascal Fua, Juan D. Tardós, Javier Civera
  • for: 医学拍摄(endoscopy)中无法获得深度数据的情况下,提供一种单视自动深度估计方法,以减少性能下降。
  • methods: 基于单视自动深度估计和人工智能转移的方法。
  • results: 实验显示,我们的自动深度估计模型可以与全面监督模型匹配,而无需深度数据。
    Abstract Single-view depth estimation can be remarkably effective if there is enough ground-truth depth data for supervised training. However, there are scenarios, especially in medicine in the case of endoscopies, where such data cannot be obtained. In such cases, multi-view self-supervision and synthetic-to-real transfer serve as alternative approaches, however, with a considerable performance reduction in comparison to supervised case. Instead, we propose a single-view self-supervised method that achieves a performance similar to the supervised case. In some medical devices, such as endoscopes, the camera and light sources are co-located at a small distance from the target surfaces. Thus, we can exploit that, for any given albedo and surface orientation, pixel brightness is inversely proportional to the square of the distance to the surface, providing a strong single-view self-supervisory signal. In our experiments, our self-supervised models deliver accuracies comparable to those of fully supervised ones, while being applicable without depth ground-truth data.
    摘要 一视图深度估计可以非常有效,只要有足够的真实深度数据进行监督训练。但在医学中的内镜拍摄等场景下,这些数据往往无法获得。在这些情况下,多视图自我监督和 sintetic-to-real 传播作为备选方案,但与监督 caso 的性能有很大的差异。而我们提议的单视图自我监督方法可以达到与监督 caso 相似的性能。在某些医疗设备中,如内镜,镜头和光源都位于 targets 表面的小距离上。因此,我们可以利用这一点,对于任何给定的反射率和表面方向,每个像素的亮度与表面 distance 平方成正比,从而提供了一个强大的单视图自我监督信号。在我们的实验中,我们的自我监督模型可以与完全监督模型具有相同的准确率,而无需深度真实数据。

Information Theory-Guided Heuristic Progressive Multi-View Coding

  • paper_url: http://arxiv.org/abs/2308.10522
  • repo_url: None
  • paper_authors: Jiangmeng Li, Hang Gao, Wenwen Qiang, Changwen Zheng
  • for: 本研究的目的是提出一种基于信息理论的普适多视图学习方法,以解决现有的视图学习方法存在的缺陷。
  • methods: 本研究提出了一种基于信息理论的多视图代表学习方法,包括分布层、集合层和实例层三个级别的进程。在分布层,IPMC方法将多视图的分布进行归一化,以减少视图特定的噪声。在集合层,IPMC方法建立了自适应对比池,并通过视图筛选器进行适应性修改。在实例层,我们采用了一种专门设计的统一损失函数来学习表示和减少梯度干扰。
  • results: 理论和实验研究表明,IPMC方法在比较现有方法时表现出色,具有更高的表示能力和更好的鲁棒性。
    Abstract Multi-view representation learning aims to capture comprehensive information from multiple views of a shared context. Recent works intuitively apply contrastive learning to different views in a pairwise manner, which is still scalable: view-specific noise is not filtered in learning view-shared representations; the fake negative pairs, where the negative terms are actually within the same class as the positive, and the real negative pairs are coequally treated; evenly measuring the similarities between terms might interfere with optimization. Importantly, few works study the theoretical framework of generalized self-supervised multi-view learning, especially for more than two views. To this end, we rethink the existing multi-view learning paradigm from the perspective of information theory and then propose a novel information theoretical framework for generalized multi-view learning. Guided by it, we build a multi-view coding method with a three-tier progressive architecture, namely Information theory-guided hierarchical Progressive Multi-view Coding (IPMC). In the distribution-tier, IPMC aligns the distribution between views to reduce view-specific noise. In the set-tier, IPMC constructs self-adjusted contrasting pools, which are adaptively modified by a view filter. Lastly, in the instance-tier, we adopt a designed unified loss to learn representations and reduce the gradient interference. Theoretically and empirically, we demonstrate the superiority of IPMC over state-of-the-art methods.
    摘要 To address these limitations, we propose a novel information theoretical framework for generalized multi-view learning. This framework is based on information theory and provides a theoretical foundation for multi-view learning. Guided by this framework, we develop a multi-view coding method called Information theory-guided hierarchical Progressive Multi-view Coding (IPMC).IPMC consists of three tiers: distribution-tier, set-tier, and instance-tier. In the distribution-tier, IPMC aligns the distribution between views to reduce view-specific noise. In the set-tier, IPMC constructs self-adjusted contrasting pools, which are adaptively modified by a view filter. Finally, in the instance-tier, we adopt a designed unified loss to learn representations and reduce the gradient interference.Theoretically and empirically, we demonstrate the superiority of IPMC over state-of-the-art methods. Our approach is able to capture more comprehensive information from multiple views and reduce the impact of view-specific noise, leading to better performance in various tasks.

PHE-SICH-CT-IDS: A Benchmark CT Image Dataset for Evaluation Semantic Segmentation, Object Detection and Radiomic Feature Extraction of Perihematomal Edema in Spontaneous Intracerebral Hemorrhage

  • paper_url: http://arxiv.org/abs/2308.10521
  • repo_url: None
  • paper_authors: Deguo Ma, Chen Li, Lin Qiao, Tianming Du, Dechao Tang, Zhiyu Ma, Marcin Grzegorzek Hongzan, Hongzan Sun
  • For: This paper is written for researchers and clinicians who are interested in developing and evaluating computer-aided diagnostic methods for perihematomal edema (PHE) in spontaneous intracerebral hemorrhage (SICH).* Methods: The paper uses a publicly available CT dataset named PHE-SICH-CT-IDS, which comprises 120 brain CT scans and 7,022 CT images, along with corresponding medical information of the patients. The dataset is suitable for assessing the performance of segmentation, detection, and radiomic feature extraction methods.* Results: The paper demonstrates the effectiveness of PHE-SICH-CT-IDS for evaluating the performance of classical algorithms for semantic segmentation, object detection, and radiomic feature extraction. The experimental results confirm the suitability of the dataset for assessing the performance of these methods.Here is the information in Simplified Chinese text:* для: 这篇论文是为研究者和临床医生制定和评估计算机支持的诊断方法,尤其是 périhematomal edema(PHE)在意外内脑出血(SICH)中。* 方法: 这篇论文使用一个公共可用的CT数据集,名为PHE-SICH-CT-IDS,该数据集包含120个脑CT扫描和7022个CT图像,以及病人的医疗信息。该数据集适用于评估segementation、检测和 радиом特征提取方法的性能。* 结果: 这篇论文证明PHE-SICH-CT-IDS适用于评估计算机支持的segementation、检测和 радиом特征提取方法的性能。实验结果表明该数据集适用于评估这些方法的性能。
    Abstract Intracerebral hemorrhage is one of the diseases with the highest mortality and poorest prognosis worldwide. Spontaneous intracerebral hemorrhage (SICH) typically presents acutely, prompt and expedited radiological examination is crucial for diagnosis, localization, and quantification of the hemorrhage. Early detection and accurate segmentation of perihematomal edema (PHE) play a critical role in guiding appropriate clinical intervention and enhancing patient prognosis. However, the progress and assessment of computer-aided diagnostic methods for PHE segmentation and detection face challenges due to the scarcity of publicly accessible brain CT image datasets. This study establishes a publicly available CT dataset named PHE-SICH-CT-IDS for perihematomal edema in spontaneous intracerebral hemorrhage. The dataset comprises 120 brain CT scans and 7,022 CT images, along with corresponding medical information of the patients. To demonstrate its effectiveness, classical algorithms for semantic segmentation, object detection, and radiomic feature extraction are evaluated. The experimental results confirm the suitability of PHE-SICH-CT-IDS for assessing the performance of segmentation, detection and radiomic feature extraction methods. To the best of our knowledge, this is the first publicly available dataset for PHE in SICH, comprising various data formats suitable for applications across diverse medical scenarios. We believe that PHE-SICH-CT-IDS will allure researchers to explore novel algorithms, providing valuable support for clinicians and patients in the clinical setting. PHE-SICH-CT-IDS is freely published for non-commercial purpose at: https://figshare.com/articles/dataset/PHE-SICH-CT-IDS/23957937.
    摘要 急性脑出血是全球最高死亡率和最差预后的疾病之一。不自愿的急性脑出血(SICH)通常会出现突然,对诊断、定位和量化脑出血进行急速的医学实验是非常重要。早期检测和精准分类脑出血附近的肿瘤(PHE)是指导适当的临床干预和提高病人预后的关键。但是,计算机协助诊断技术的进步和评估受到脑出血附近肿瘤数据集的缺乏的公共存储问题所困扰。本研究建立了一个公共可用的脑CT数据集,名为PHE-SICH-CT-IDS,用于肿瘤脑出血的评估。这个数据集包含120个脑CT扫描和7,022个CT图像,以及病人的医疗信息。为证明其效果,本研究评估了古典的分类算法、物体检测算法和几何特征EXTRACTING算法。实验结果确认PHE-SICH-CT-IDS的适用性。根据我们所知,这是肿瘤脑出血中第一个公共可用的数据集,包括多种数据格式,适合各种医疗enario中的应用。我们相信PHE-SICH-CT-IDS将吸引研究人员探索新的算法,提供宝贵的支持 для医生和病人在临床设置中。PHE-SICH-CT-IDS是免费发布的非商业用途,可以在https://figshare.com/articles/dataset/PHE-SICH-CT-IDS/23957937中找到。

QD-BEV : Quantization-aware View-guided Distillation for Multi-view 3D Object Detection

  • paper_url: http://arxiv.org/abs/2308.10515
  • repo_url: None
  • paper_authors: Yifan Zhang, Zhen Dong, Huanrui Yang, Ming Lu, Cheng-Ching Tseng, Yuan Du, Kurt Keutzer, Li Du, Shanghang Zhang
  • for: 多视图3D检测任务中的 birds-eye-view (BEV) 方法在最近获得了显著改进。
  • methods: 我们的方法QD-BEV使用了一种新的视图指导采集 (VGD) 目标函数,可以稳定化量化感知 (QAT) 的训练,同时利用图像特征和 BEV 特征来提高模型性能。
  • results: 我们的实验表明,QD-BEV 可以与之前的方法具有相同或更好的准确度,同时具有显著的效率提升。在 nuScenes 数据集上,量化后的 QD-BEV-Tiny 模型可以达到 37.2% NDS,比 BevFormer-Tiny 提高 1.8%,即使使用了 8 倍压缩。在 Small 和 Base 变体中,QD-BEV 模型也表现出色,分别达到 47.9% NDS (28.2 MB) 和 50.9% NDS (32.9 MB)。
    Abstract Multi-view 3D detection based on BEV (bird-eye-view) has recently achieved significant improvements. However, the huge memory consumption of state-of-the-art models makes it hard to deploy them on vehicles, and the non-trivial latency will affect the real-time perception of streaming applications. Despite the wide application of quantization to lighten models, we show in our paper that directly applying quantization in BEV tasks will 1) make the training unstable, and 2) lead to intolerable performance degradation. To solve these issues, our method QD-BEV enables a novel view-guided distillation (VGD) objective, which can stabilize the quantization-aware training (QAT) while enhancing the model performance by leveraging both image features and BEV features. Our experiments show that QD-BEV achieves similar or even better accuracy than previous methods with significant efficiency gains. On the nuScenes datasets, the 4-bit weight and 6-bit activation quantized QD-BEV-Tiny model achieves 37.2% NDS with only 15.8 MB model size, outperforming BevFormer-Tiny by 1.8% with an 8x model compression. On the Small and Base variants, QD-BEV models also perform superbly and achieve 47.9% NDS (28.2 MB) and 50.9% NDS (32.9 MB), respectively.
    摘要 Recently, multi-view 3D detection based on bird-eye-view (BEV) has made significant improvements. However, the large memory consumption of state-of-the-art models makes it difficult to deploy them on vehicles, and the non-trivial latency will affect the real-time perception of streaming applications. Although quantization is widely used to lighten models, we show in our paper that directly applying quantization to BEV tasks will 1) make the training unstable, and 2) lead to intolerable performance degradation. To solve these issues, our method QD-BEV uses a novel view-guided distillation (VGD) objective, which can stabilize the quantization-aware training (QAT) while enhancing the model performance by leveraging both image features and BEV features. Our experiments show that QD-BEV achieves similar or even better accuracy than previous methods with significant efficiency gains. On the nuScenes datasets, the 4-bit weight and 6-bit activation quantized QD-BEV-Tiny model achieves 37.2% NDS with only 15.8 MB model size, outperforming BevFormer-Tiny by 1.8% with an 8x model compression. On the Small and Base variants, QD-BEV models also perform superbly and achieve 47.9% NDS (28.2 MB) and 50.9% NDS (32.9 MB), respectively.Note:* NDS: Normalized Displacement Score* QAT: Quantization-Aware Training* QD-BEV: Quantization-aware Distillation for Bird-Eye-View* BEV: Bird-Eye-View* nuScenes: nuScenes dataset

Frequency Compensated Diffusion Model for Real-scene Dehazing

  • paper_url: http://arxiv.org/abs/2308.10510
  • repo_url: None
  • paper_authors: Jing Wang, Songtao Wu, Kuanhong Xu, Zhiqiang Yuan
  • for: 提高深度学习基于图像抑震的性能,适应实际世界霾照图像。
  • methods: 基于条件扩散模型,提出一种抑震框架,并设计了一个增强高频信号的网络单元(频率补偿块)来解决深度网络受到霾照图像的影响。
  • results: 对比STATE-OF-THE-ART方法,提出的抑震 diffusion model在实际世界霾照图像上显示出明显的性能提升。
    Abstract Due to distribution shift, deep learning based methods for image dehazing suffer from performance degradation when applied to real-world hazy images. In this paper, we consider a dehazing framework based on conditional diffusion models for improved generalization to real haze. First, we find that optimizing the training objective of diffusion models, i.e., Gaussian noise vectors, is non-trivial. The spectral bias of deep networks hinders the higher frequency modes in Gaussian vectors from being learned and hence impairs the reconstruction of image details. To tackle this issue, we design a network unit, named Frequency Compensation block (FCB), with a bank of filters that jointly emphasize the mid-to-high frequencies of an input signal. We demonstrate that diffusion models with FCB achieve significant gains in both perceptual and distortion metrics. Second, to further boost the generalization performance, we propose a novel data synthesis pipeline, HazeAug, to augment haze in terms of degree and diversity. Within the framework, a solid baseline for blind dehazing is set up where models are trained on synthetic hazy-clean pairs, and directly generalize to real data. Extensive evaluations show that the proposed dehazing diffusion model significantly outperforms state-of-the-art methods on real-world images.
    摘要 Translated into Simplified Chinese:由于分布shift,深度学习基于方法的图像霾化受到实际图像的影响,性能下降。在这篇论文中,我们考虑了基于条件扩散模型的霾化框架,以提高对实际霾的泛化性。首先,我们发现优化扩散模型的训练目标,即高斯噪声矢量,是非常困难的。深度网络的spectral bias导致高频模式在高斯矢量中不能学习,从而降低图像细节的重建。为解决这个问题,我们设计了一个网络单元,名为频率补偿块(FCB),它通过一系列滤波器联合强调输入信号中的中高频谱。我们示出了 diffusion模型与FCB可以在 Both perceptual and distortion metrics中获得显著改进。其次,为进一步提高泛化性,我们提出了一种新的数据生成管道,名为HazeAug,用于增强霾的度和多样性。在这个框架中,我们设置了一个坚实的基线 для盲目霾化,其中模型通过synthetic霾混清对应的训练,直接泛化到实际数据。我们进行了广泛的评估,并证明了我们的霾化扩散模型在实际图像上具有显著性能优势。

Semantic Graph Representation Learning for Handwritten Mathematical Expression Recognition

  • paper_url: http://arxiv.org/abs/2308.10493
  • repo_url: https://github.com/liuzhuang1024/SAM
  • paper_authors: Zhuang Liu, Ye Yuan, Zhilong Ji, Jingfeng Bai, Xiang Bai
  • for: 提高手写数学表达识别(HMER)的性能
  • methods: 提出了一种简单 yet efficient的方法,即semantic interaction learning(SIL),可以强化模型对符号之间的关系理解
  • results: 在公共 benchmark 数据集上进行了广泛的实验,结果表明,我们的提posed module可以有效地提高识别性能,并在 CROHME 和 HME100K 数据集上达到了优于优于 Prior 艺术的表现
    Abstract Handwritten mathematical expression recognition (HMER) has attracted extensive attention recently. However, current methods cannot explicitly study the interactions between different symbols, which may fail when faced similar symbols. To alleviate this issue, we propose a simple but efficient method to enhance semantic interaction learning (SIL). Specifically, we firstly construct a semantic graph based on the statistical symbol co-occurrence probabilities. Then we design a semantic aware module (SAM), which projects the visual and classification feature into semantic space. The cosine distance between different projected vectors indicates the correlation between symbols. And jointly optimizing HMER and SIL can explicitly enhances the model's understanding of symbol relationships. In addition, SAM can be easily plugged into existing attention-based models for HMER and consistently bring improvement. Extensive experiments on public benchmark datasets demonstrate that our proposed module can effectively enhance the recognition performance. Our method achieves better recognition performance than prior arts on both CROHME and HME100K datasets.
    摘要

SynDrone – Multi-modal UAV Dataset for Urban Scenarios

  • paper_url: http://arxiv.org/abs/2308.10491
  • repo_url: https://github.com/lttm/syndrone
  • paper_authors: Giulia Rizzoli, Francesco Barbato, Matteo Caligiuri, Pietro Zanuttigh
  • For: The paper aims to address the challenge of limited annotated high-resolution aerial data for developing computer vision algorithms for Unmanned Aerial Vehicles (UAVs).* Methods: The paper proposes a multimodal synthetic dataset that includes both images and 3D data taken at multiple flying heights, with object-level annotations and pixel-level labeling in 28 classes.* Results: The dataset contains 72k labeled samples that enable effective training of deep architectures, showing promising results in synthetic-to-real adaptation.Here’s the simplified Chinese text for the three points:* For: 本研究旨在解决无人机图像处理算法开发中的高分辨率飞行数据缺乏问题。* Methods: 本文提出了一个多模态的 sintetic 数据集,包括图像和多个飞行高度的3D数据,并包括对象级别的标注和28类像素级别标注。* Results: 数据集包含72k个标注样本,可以有效地训练深度架构,在 sintetic-to-real 适应中显示出扎实的成果。
    Abstract The development of computer vision algorithms for Unmanned Aerial Vehicles (UAVs) imagery heavily relies on the availability of annotated high-resolution aerial data. However, the scarcity of large-scale real datasets with pixel-level annotations poses a significant challenge to researchers as the limited number of images in existing datasets hinders the effectiveness of deep learning models that require a large amount of training data. In this paper, we propose a multimodal synthetic dataset containing both images and 3D data taken at multiple flying heights to address these limitations. In addition to object-level annotations, the provided data also include pixel-level labeling in 28 classes, enabling exploration of the potential advantages in tasks like semantic segmentation. In total, our dataset contains 72k labeled samples that allow for effective training of deep architectures showing promising results in synthetic-to-real adaptation. The dataset will be made publicly available to support the development of novel computer vision methods targeting UAV applications.
    摘要 Computer vision algorithms for Unmanned Aerial Vehicles (UAVs) 图像的开发受到高分辨率飞行数据的可用性的限制。然而,现有数据集中的图像数量有限,使得深度学习模型的训练数据量有限,从而降低了模型的效果。在本文中,我们提出了一个多modal的 sintetic 数据集,包括图像和3D数据, captured at multiple flying heights。此外,数据集还包括对象级别的标注,以及每个像素的28个分类标注,使得可以进行 semantic segmentation 任务的探索。总的来说,我们的数据集包含72k个标注样本,可以有效地训练深度建筑,并且在 sintetic-to-real 适应中显示出了良好的结果。这个数据集将会公开发布,以支持计算机视觉方法的开发,targeting UAV应用。

Enhancing Medical Image Segmentation: Optimizing Cross-Entropy Weights and Post-Processing with Autoencoders

  • paper_url: http://arxiv.org/abs/2308.10488
  • repo_url: None
  • paper_authors: Pranav Singh, Luoyao Chen, Mei Chen, Jinqian Pan, Raviteja Chukkapalli, Shravan Chaudhari, Jacopo Cirrone
  • for: 这个研究旨在提出一种适应医学影像分类的深度学习方法,以提高医学影像分类的精度和效率。
  • methods: 我们提出了一种基于深度学习的医学影像分类方法,使用了U-Net和U-Net++的架构,并且调整了损失函数的重要性。
  • results: 我们的方法在关于疖症病理 dataset 上的评估中,与现有的state-of-the-art方法相比,平均提高了12.26%和12.04%。此外,我们还对三个具有挑战性的医学影像分类任务进行了 benchmarking。
    Abstract The task of medical image segmentation presents unique challenges, necessitating both localized and holistic semantic understanding to accurately delineate areas of interest, such as critical tissues or aberrant features. This complexity is heightened in medical image segmentation due to the high degree of inter-class similarities, intra-class variations, and possible image obfuscation. The segmentation task further diversifies when considering the study of histopathology slides for autoimmune diseases like dermatomyositis. The analysis of cell inflammation and interaction in these cases has been less studied due to constraints in data acquisition pipelines. Despite the progressive strides in medical science, we lack a comprehensive collection of autoimmune diseases. As autoimmune diseases globally escalate in prevalence and exhibit associations with COVID-19, their study becomes increasingly essential. While there is existing research that integrates artificial intelligence in the analysis of various autoimmune diseases, the exploration of dermatomyositis remains relatively underrepresented. In this paper, we present a deep-learning approach tailored for Medical image segmentation. Our proposed method outperforms the current state-of-the-art techniques by an average of 12.26% for U-Net and 12.04% for U-Net++ across the ResNet family of encoders on the dermatomyositis dataset. Furthermore, we probe the importance of optimizing loss function weights and benchmark our methodology on three challenging medical image segmentation tasks
    摘要 医疗影像分割任务存在独特的挑战,需要同时具备本地化和整体性的semantic理解,以准确划分关键组织或异常特征。这种复杂性在医疗影像分割中更加强调,因为影像中的间类 similarity和内部变化很高,同时可能存在图像遮挡。医疗影像分割任务在研究 histopathology 报告中进一步复杂,特别是在 autoimmune 疾病如 dermatomyositis 中。在这些情况下,分析细胞Inflammation和互动尚未得到了足够的研究,这主要归结于数据获取管道的限制。虽然医学科技在进步,但我们仍然缺乏一个全面的 autoimmune 疾病集成。随着 autoimmune 疾病全球蔓延和 COVID-19 的相关性,其研究变得越来越重要。虽然现有的研究已经将人工智能integrated into the analysis of various autoimmune diseases,但dermatomyositis 的研究仍然相对落后。在这篇文章中,我们提出了一种适应医疗影像分割的深度学习方法。我们的提议方法在 ResNet 家族的encoder上的 U-Net 和 U-Net++ 中平均提高了12.26%和12.04%。此外,我们还评估了优化损失函数权重的重要性,并在三个医疗影像分割任务上进行了 benchmark。

ADNet: Lane Shape Prediction via Anchor Decomposition

  • paper_url: http://arxiv.org/abs/2308.10481
  • repo_url: None
  • paper_authors: Lingyu Xiao, Xiang Li, Sen Yang, Wankou Yang
  • for: 提高anchor-based检测方法的限制,使其能够适应不同的检测场景和数据集。
  • methods: 将anchors decomposed into学习热点的映射和相关的方向,从而提高anchors的灵活性和质量。还引入了大kernel注意力(LKA)来增强特征pyramid网络(FPN)的感知范围,以更好地捕捉到lane线的全图Context。
  • results: 在三个广泛使用的检测 benchmark上(VIL-100、CULane和TuSimple)实验表明,我们的方法比现有方法在VIL-100上表现出色,在CULane和TuSimple上具有竞争性的准确率。
    Abstract In this paper, we revisit the limitations of anchor-based lane detection methods, which have predominantly focused on fixed anchors that stem from the edges of the image, disregarding their versatility and quality. To overcome the inflexibility of anchors, we decompose them into learning the heat map of starting points and their associated directions. This decomposition removes the limitations on the starting point of anchors, making our algorithm adaptable to different lane types in various datasets. To enhance the quality of anchors, we introduce the Large Kernel Attention (LKA) for Feature Pyramid Network (FPN). This significantly increases the receptive field, which is crucial in capturing the sufficient context as lane lines typically run throughout the entire image. We have named our proposed system the Anchor Decomposition Network (ADNet). Additionally, we propose the General Lane IoU (GLIoU) loss, which significantly improves the performance of ADNet in complex scenarios. Experimental results on three widely used lane detection benchmarks, VIL-100, CULane, and TuSimple, demonstrate that our approach outperforms the state-of-the-art methods on VIL-100 and exhibits competitive accuracy on CULane and TuSimple. Code and models will be released on https://github.com/ Sephirex-X/ADNet.
    摘要 在这篇论文中,我们重新检视了锚点基本的巡航检测方法的局限性,这些方法主要关注于图像边缘的固定锚点,忽略它们的灵活性和质量。为了超越锚点的限制,我们将锚点 decomposed into 学习热点的热度地图和其相关的方向。这种 decomposing 方法使我们的算法可以适应不同的巡航类型在不同的数据集中。为了提高锚点的质量,我们引入了大kernels attention(LKA) для feature pyramid network(FPN)。这些 significantly increases the receptive field,这是关键的 capture 巡航线通常在整个图像中进行。我们称我们的提出的系统为 anchor decomposition network(ADNet)。此外,我们提出了通用巡航 IoU(GLIoU)损失,这些 Significantly improves 我们的ADNet在复杂的情况下的性能。实验结果表明,我们的方法在 VIL-100 上 exceeds state-of-the-art 方法的性能,并在 CULane 和 TuSimple 上展现了竞争性的准确率。代码和模型将在 上发布。

STEERER: Resolving Scale Variations for Counting and Localization via Selective Inheritance Learning

  • paper_url: http://arxiv.org/abs/2308.10468
  • repo_url: https://github.com/taohan10200/steerer
  • paper_authors: Tao Han, Lei Bai, Lingbo Liu, Wanli Ouyang
  • for: solves the problem of scale variations in object counting, which has not been effectively addressed by existing scale-aware algorithms.
  • methods: proposes a novel method called STEERER, which selectively forwards scale-customized features at each scale and uses a Masked Selection and Inheritance Loss to achieve high-quality density maps across all scales.
  • results: demonstrates unprecedented scale generalization ability on nine datasets with counting and localization tasks, achieving high accuracy and outperforming state-of-the-art methods.
    Abstract Scale variation is a deep-rooted problem in object counting, which has not been effectively addressed by existing scale-aware algorithms. An important factor is that they typically involve cooperative learning across multi-resolutions, which could be suboptimal for learning the most discriminative features from each scale. In this paper, we propose a novel method termed STEERER (\textbf{S}elec\textbf{T}iv\textbf{E} inh\textbf{ER}itance l\textbf{E}a\textbf{R}ning) that addresses the issue of scale variations in object counting. STEERER selects the most suitable scale for patch objects to boost feature extraction and only inherits discriminative features from lower to higher resolution progressively. The main insights of STEERER are a dedicated Feature Selection and Inheritance Adaptor (FSIA), which selectively forwards scale-customized features at each scale, and a Masked Selection and Inheritance Loss (MSIL) that helps to achieve high-quality density maps across all scales. Our experimental results on nine datasets with counting and localization tasks demonstrate the unprecedented scale generalization ability of STEERER. Code is available at \url{https://github.com/taohan10200/STEERER}.
    摘要 减噪是深刻的问题在对象计数中,现有的缩放意识算法没有有效地解决这一问题。一个重要的原因是,它们通常会在多个分辨率之间进行合作学习,这可能会导致学习每个分辨率上最有用的特征。在这篇论文中,我们提出了一种新的方法,称为STEERER( SelecTive InhEritance for objEct Recongition,简称STEERER),它解决了对象计数中的缩放问题。STEERER选择最适合的缩放比例 для小 objet,以提高特征提取,并只往下一个分辨率继承有用的特征。我们的主要发现包括一种专门的特征选择和继承器(FSIA),它在每个缩放级别上选择缩放特定的特征,以及一种做好的掩码选择和继承损失(MSIL),它帮助实现高质量的扩散图 across all scales。我们的实验结果表明,STEERER在九个数据集上的计数和本地化任务中表现出了前所未有的缩放普适性。代码可以在 \url{https://github.com/taohan10200/STEERER} 上获取。

Privacy-Preserving Face Recognition Using Random Frequency Components

  • paper_url: http://arxiv.org/abs/2308.10461
  • repo_url: https://github.com/Tencent/TFace
  • paper_authors: Yuxi Mi, Yuge Huang, Jiazhen Ji, Minyi Zhao, Jiaxiang Wu, Xingkun Xu, Shouhong Ding, Shuigeng Zhou
  • for: 隐私保护 face 图像的视觉信息和模型可还原性
  • methods: 提出了一种基于人类视觉的低频成分剔除技术,以降低模型可还原性;同时提出了一种基于 latest 理论发现和模型注意力的解决方案,即在训练和推理过程中随机选择频谱成分。
  • results: 对比 experiments 表明,提出的方法可以均衡隐私保护目标和识别率的要求。
    Abstract The ubiquitous use of face recognition has sparked increasing privacy concerns, as unauthorized access to sensitive face images could compromise the information of individuals. This paper presents an in-depth study of the privacy protection of face images' visual information and against recovery. Drawing on the perceptual disparity between humans and models, we propose to conceal visual information by pruning human-perceivable low-frequency components. For impeding recovery, we first elucidate the seeming paradox between reducing model-exploitable information and retaining high recognition accuracy. Based on recent theoretical insights and our observation on model attention, we propose a solution to the dilemma, by advocating for the training and inference of recognition models on randomly selected frequency components. We distill our findings into a novel privacy-preserving face recognition method, PartialFace. Extensive experiments demonstrate that PartialFace effectively balances privacy protection goals and recognition accuracy. Code is available at: https://github.com/Tencent/TFace.
    摘要 “人脸识别技术的普遍使用导致隐私问题的增加,因为未经授权的存取具有敏感信息的人脸图像可能会妥协个人资讯。本文对人脸图像的隐私保护进行了深入的研究,并提出了一种隐私保护方法。基于人类和模型之间的感知差异,我们提议使用人类可见的低频率成分剔除可见信息。为防止回复,我们首先解释了模型可以利用的信息减少和高识别率之间的悖论,然后提出了一种基于最近的理论发现和我们对模型注意力的观察,将训练和测试识别模型的阶段随机选择频率成分。我们将结果整合为一个名为PartialFace的隐私保护人脸识别方法。实验结果显示,PartialFace能够均衡隐私保护目标和识别率。代码可以在https://github.com/Tencent/TFace上获取。”

DOMINO++: Domain-aware Loss Regularization for Deep Learning Generalizability

  • paper_url: http://arxiv.org/abs/2308.10453
  • repo_url: None
  • paper_authors: Skylar E. Stolte, Kyle Volle, Aprinda Indahlastari, Alejandro Albizu, Adam J. Woods, Kevin Brink, Matthew Hale, Ruogu Fang
  • for: 提高现代深度学习(DL)中的对外部数据(OOD)泛化性能,解决现有DL模型在不同测试数据上的表现不稳定问题。
  • methods: 提出了一种基于双指导和动态领域相关loss regularization的DOMINO++模型,将专家指导和数据指导知识 integrate到模型中,并通过动态 scaling factor和适应regularization rate来调整regulation。
  • results: 对于MRIs的头部分 segmentation任务,COMPARISON WITH DOMINO和基eline模型表明,DOMINO++在OOD数据上表现出优秀的性能, indicating its potential to improve the reliable deployment of DL on real clinical data.
    Abstract Out-of-distribution (OOD) generalization poses a serious challenge for modern deep learning (DL). OOD data consists of test data that is significantly different from the model's training data. DL models that perform well on in-domain test data could struggle on OOD data. Overcoming this discrepancy is essential to the reliable deployment of DL. Proper model calibration decreases the number of spurious connections that are made between model features and class outputs. Hence, calibrated DL can improve OOD generalization by only learning features that are truly indicative of the respective classes. Previous work proposed domain-aware model calibration (DOMINO) to improve DL calibration, but it lacks designs for model generalizability to OOD data. In this work, we propose DOMINO++, a dual-guidance and dynamic domain-aware loss regularization focused on OOD generalizability. DOMINO++ integrates expert-guided and data-guided knowledge in its regularization. Unlike DOMINO which imposed a fixed scaling and regularization rate, DOMINO++ designs a dynamic scaling factor and an adaptive regularization rate. Comprehensive evaluations compare DOMINO++ with DOMINO and the baseline model for head tissue segmentation from magnetic resonance images (MRIs) on OOD data. The OOD data consists of synthetic noisy and rotated datasets, as well as real data using a different MRI scanner from a separate site. DOMINO++'s superior performance demonstrates its potential to improve the trustworthy deployment of DL on real clinical data.
    摘要 现代深度学习(DL)面临跨频训练数据(OOD)泛化的严重挑战。OOD数据包括测试数据,与DL模型训练数据差异很大。DL模型在域内测试数据上表现良好,但在OOD数据上却有很大的差异。解决这个差异是DL模型的可靠部署的关键。正确的模型均衡可以减少DL模型中的偶极连接,从而提高OOD泛化。先前的工作提出了域快照型均衡(DOMINO)以提高DL均衡,但它缺乏对OOD数据的设计。在这个工作中,我们提出了DOMINO++,它是双指导和动态域快照型均衡,专注于OOD泛化。DOMİNO++结合专家指导和数据指导的知识在其均衡中。与DOMINO不同,DOMİNO++不同的是它的固定均衡率和均衡率。COMPREHENSIVE评估比较了DOMİNO++与DOMINO和基eline模型在MRIs的头部分 segmentation任务上的性能。OOD数据包括synthetic noisy和旋转数据集,以及实际数据使用不同的MRI扫描仪。DOMİNO++的优秀表现表明它的可能性,以提高DL在实际临床数据上可靠部署。

COCA: Classifier-Oriented Calibration for Source-Free Universal Domain Adaptation via Textual Prototype

  • paper_url: http://arxiv.org/abs/2308.10450
  • repo_url: None
  • paper_authors: Xinghong Liu, Yi Zhou, Tao Zhou, Chun-Mei Feng, Ling Shao
  • for: 本研究旨在解决源频率不足的情况下实现高效的通用领域适应(UniDA)和源自由UniDA(SF-UniDA)方法,以便在实际应用中减少标注成本。
  • methods: 我们提出了一种基于文本原型的分类器准备(COCA)方法,利用少量源频率学习来替代大量标注的需求,从而大幅减少标注成本。
  • results: 我们的方法在对比国际顶尖UniDA和SF-UniDA模型时表现出色,并且在源频率不足的情况下实现了高效的适应性。
    Abstract Universal Domain Adaptation (UniDA) aims to distinguish common and private classes between the source and target domains where domain shift exists. Recently, due to more stringent data restrictions, researchers have introduced Source-Free UniDA (SF-UniDA) in more realistic scenarios. SF-UniDA methods eliminate the need for direct access to source samples when performing adaptation to the target domain. However, existing SF-UniDA methods still require an extensive quantity of labeled source samples to train a source model, resulting in significant labeling costs. To tackle this issue, we present a novel Classifier-Oriented Calibration (COCA) method. This method, which leverages textual prototypes, is formulated for the source model based on few-shot learning. Specifically, we propose studying few-shot learning, usually explored for closed-set scenarios, to identify common and domain-private classes despite a significant domain shift between source and target domains. Essentially, we present a novel paradigm based on the vision-language model to learn SF-UniDA and hugely reduce the labeling costs on the source domain. Experimental results demonstrate that our approach outperforms state-of-the-art UniDA and SF-UniDA models.
    摘要 universal 预设适应 (UniDA) 目标是在源预设和目标预设之间分别识别公共和私有类别,而且当存在预设迁移时,研究者们已经引入了无源预设UniDA (SF-UniDA) 方法。SF-UniDA 方法消除了直接访问源样本的需求,并在更现实的应用中进行适应。然而,现有的 SF-UniDA 方法仍然需要大量的标签源样本来训练源模型,从而导致标签成本增加。为了解决这个问题,我们提出了一个新的标签模型均衡 (COCA) 方法。这个方法,基于文本原型,是基于少量学习的形式,对源模型进行均衡。具体而言,我们提出了研究少量学习,通常在关闭集成enario中进行探索,以识别源和预设迁移之间的公共和预设类别,即使存在巨大的预设迁移。我们提出了一个基于视觉语言模型的新方法,以学习SF-UniDA并大幅降低源预设标签成本。实验结果显示,我们的方法比现有的UniDA和SF-UniDA模型表现更好。

Explore and Tell: Embodied Visual Captioning in 3D Environments

  • paper_url: http://arxiv.org/abs/2308.10447
  • repo_url: https://github.com/AIM3-RUC/ExploreAndTell
  • paper_authors: Anwen Hu, Shizhe Chen, Liang Zhang, Qin Jin
  • for: 提高视觉描述模型的准确率和可靠性,使其能够更好地理解场景和描述对象。
  • methods: 提出一种新任务 called Embodied Captioning,即让视觉描述模型具有导航能力,从不同视点获取场景信息,提高场景理解和描述准确率。
  • results: 建立了ET-Cap数据集,并提出了一种Cascade Embodied Captioning模型(CaBOT),可以帮助解决这个任务。模型在实验中表现出色,超过了其他特制基elines。
    Abstract While current visual captioning models have achieved impressive performance, they often assume that the image is well-captured and provides a complete view of the scene. In real-world scenarios, however, a single image may not offer a good viewpoint, hindering fine-grained scene understanding. To overcome this limitation, we propose a novel task called Embodied Captioning, which equips visual captioning models with navigation capabilities, enabling them to actively explore the scene and reduce visual ambiguity from suboptimal viewpoints. Specifically, starting at a random viewpoint, an agent must navigate the environment to gather information from different viewpoints and generate a comprehensive paragraph describing all objects in the scene. To support this task, we build the ET-Cap dataset with Kubric simulator, consisting of 10K 3D scenes with cluttered objects and three annotated paragraphs per scene. We propose a Cascade Embodied Captioning model (CaBOT), which comprises of a navigator and a captioner, to tackle this task. The navigator predicts which actions to take in the environment, while the captioner generates a paragraph description based on the whole navigation trajectory. Extensive experiments demonstrate that our model outperforms other carefully designed baselines. Our dataset, codes and models are available at https://aim3-ruc.github.io/ExploreAndTell.
    摘要 当前的视觉描述模型已经实现了各种印象的表现,但是它们通常假设图像是完整、有利的捕捉了场景。然而,在实际情况下,单个图像可能不会提供好的视点,从而降低细腻的场景理解。为了解决这个限制,我们提出了一项新任务,即Embodied Captioning,这个任务使得视觉描述模型具有探索能力,以减少从不佳视点的视觉含义不确定性。具体来说,从随机视点开始,一个代理人需要在环境中探索信息,并生成描述全场景中所有物品的完整段落。为支持这个任务,我们构建了ET-Cap数据集,使用Kubric simulator,包含10000个3D场景,每个场景有杂乱的物品,并且每个场景有三个注释段落。我们提出了CaBOT模型,它包括探索器和描述器,用于解决这个任务。探索器预测在环境中需要执行哪些动作,而描述器基于整个探索轨迹生成段落描述。我们的实验表明,我们的模型在其他优化的基elines上表现出色。我们的数据集、代码和模型可以在https://aim3-ruc.github.io/ExploreAndTell中获取。

When Prompt-based Incremental Learning Does Not Meet Strong Pretraining

  • paper_url: http://arxiv.org/abs/2308.10445
  • repo_url: https://github.com/tom-tym/apg
  • paper_authors: Yu-Ming Tang, Yi-Xing Peng, Wei-Shi Zheng
  • for: 这篇论文主要针对 incremental learning 问题,即如何使深度网络在不同任务之间快速学习,而不导致 catastrophic forgetting。
  • methods: 这篇论文提出了一种 learnable Adaptive Prompt Generator (APG),即将 prompt retrieval 和 prompt learning 过程结合在一起,以便在不同任务之间快速适应。此外, authors 还提出了一种知识池来规范 APG,以避免学习不效的知识。
  • results: compare to 先进的方法,这篇论文的方法在无强大预训练(typically trained on ImageNet-21k)的情况下,在 exemplar-free incremental learning 中显著超过了其他方法。此外,在强 RETRAINING 情况下,这篇论文的方法也有相当的比较性,表明它可以从预训练中吸取有用的知识。
    Abstract Incremental learning aims to overcome catastrophic forgetting when learning deep networks from sequential tasks. With impressive learning efficiency and performance, prompt-based methods adopt a fixed backbone to sequential tasks by learning task-specific prompts. However, existing prompt-based methods heavily rely on strong pretraining (typically trained on ImageNet-21k), and we find that their models could be trapped if the potential gap between the pretraining task and unknown future tasks is large. In this work, we develop a learnable Adaptive Prompt Generator (APG). The key is to unify the prompt retrieval and prompt learning processes into a learnable prompt generator. Hence, the whole prompting process can be optimized to reduce the negative effects of the gap between tasks effectively. To make our APG avoid learning ineffective knowledge, we maintain a knowledge pool to regularize APG with the feature distribution of each class. Extensive experiments show that our method significantly outperforms advanced methods in exemplar-free incremental learning without (strong) pretraining. Besides, under strong retraining, our method also has comparable performance to existing prompt-based models, showing that our method can still benefit from pretraining. Codes can be found at https://github.com/TOM-tym/APG
    摘要 “增量学习”是为了解决深度网络在顺序任务上学习时的“迷失学习”问题。现有的方法通过学习任务特定的提示来采用固定的背部来sequential tasks。然而,现有的方法具有强大的预训练(通常是在ImageNet-21k上进行),而我们发现其模型在未知的未来任务之间的 gap 太大时会被锁定。在这种情况下,我们开发了一个可学习的 Adaptive Prompt Generator (APG)。关键在于将提取提示和学习提示过程统一到一个可学习的提示生成器中。因此,整个提示过程可以被优化,以降低任务之间的负面效果。此外,为了保证 APG 不会学习无用的知识,我们维护了一个知识池,用于规范 APG 的特征分布。我们的方法在无(强)预训练情况下与先进方法相比较出色,并且在强 RETRAINING 情况下也有相当的性能,表明我们的方法可以从预训练中受益。代码可以在 GitHub 上找到:https://github.com/TOM-tym/APG。

Efficient Joint Optimization of Layer-Adaptive Weight Pruning in Deep Neural Networks

  • paper_url: http://arxiv.org/abs/2308.10438
  • repo_url: https://github.com/akimoto-cris/rd_vit_prune
  • paper_authors: Kaixin Xu, Zhe Wang, Xue Geng, Jie Lin, Min Wu, Xiaoli Li, Weisi Lin
  • for: 提高深度神经网络(DNN)的表现,特别是在输出误差最小化的情况下,同时满足一定的剔除率约束。
  • methods: 提出了一种层 adaptive Weight 剔除方法,利用多层weight的加法性来转化剔除问题为一个 combinatorial 优化问题,通过动态编程解决。
  • results: 在 CIFAR-10 和 ImageNet datasets 上实现了优于现有方法的表现,在 ResNet-32、VGG-16 和 DenseNet-121 上实现了remarkable 的提高,对于 VGG-16 和 ResNet-50 在 ImageNet 上实现了4.7% 和 4.6% 的顶部一Accuracy 提高。
    Abstract In this paper, we propose a novel layer-adaptive weight-pruning approach for Deep Neural Networks (DNNs) that addresses the challenge of optimizing the output distortion minimization while adhering to a target pruning ratio constraint. Our approach takes into account the collective influence of all layers to design a layer-adaptive pruning scheme. We discover and utilize a very important additivity property of output distortion caused by pruning weights on multiple layers. This property enables us to formulate the pruning as a combinatorial optimization problem and efficiently solve it through dynamic programming. By decomposing the problem into sub-problems, we achieve linear time complexity, making our optimization algorithm fast and feasible to run on CPUs. Our extensive experiments demonstrate the superiority of our approach over existing methods on the ImageNet and CIFAR-10 datasets. On CIFAR-10, our method achieves remarkable improvements, outperforming others by up to 1.0% for ResNet-32, 0.5% for VGG-16, and 0.7% for DenseNet-121 in terms of top-1 accuracy. On ImageNet, we achieve up to 4.7% and 4.6% higher top-1 accuracy compared to other methods for VGG-16 and ResNet-50, respectively. These results highlight the effectiveness and practicality of our approach for enhancing DNN performance through layer-adaptive weight pruning. Code will be available on https://github.com/Akimoto-Cris/RD_VIT_PRUNE.
    摘要 在这篇论文中,我们提出了一种新的层adaptive weight采样方法,用于深度神经网络(DNNs),以最小化输出误差的同时遵循target采样比例约束。我们的方法考虑了所有层的集合影响,设计了层adaptive采样方案。我们发现和利用了多层采样weight对输出误差的加性性质。这个性质使得我们可以将采样转换为一个 combinatorial 优化问题,通过动态Programming efficiently解决。我们将问题 decomposes into sub-problems,实现了线性时间复杂度,使我们的优化算法快速并可以在CPU上运行。我们的广泛实验表明我们的方法在ImageNet和CIFAR-10 datasets上表现出色,与其他方法相比,在CIFAR-10上出色的提高了top-1准确率,对ResNet-32、VGG-16和DenseNet-121的top-1准确率提高了0.5%、1.0%和0.7%。在ImageNet上,我们的方法对VGG-16和ResNet-50 achieved up to 4.7%和4.6% higher top-1 accuracy than other methods。这些结果表明我们的方法可以增强DNN性能 through layer-adaptive weight pruning。代码将于https://github.com/Akimoto-Cris/RD_VIT_PRUNE上发布。

UniM$^2$AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving

  • paper_url: http://arxiv.org/abs/2308.10421
  • repo_url: https://github.com/hollow-503/unim2ae
  • paper_authors: Jian Zou, Tianyu Huang, Guanglei Yang, Zhenhua Guo, Wangmeng Zuo
  • for: 本研究旨在开发一种多模态自适应预训练方法,以提高自动驾驶中多个感知器的数据融合。
  • methods: 本研究提出了一种名为UniM$^2$AE的多模态遮盲自动encoder方法,包括两个主要设计:首先,将多modalities的特征项映射到一个共同的3D卷积空间中,然后使用多模态3D交互模块(MMIM)来促进多模态之间的有效交互。
  • results: 实验表明,UniM$^2$AE可以提高3D物体检测和bird’s eye view(BEV)地图分割的性能,相比原始方法提高1.2%(NDS)和6.5%(mIoU)。
    Abstract Masked Autoencoders (MAE) play a pivotal role in learning potent representations, delivering outstanding results across various 3D perception tasks essential for autonomous driving. In real-world driving scenarios, it's commonplace to deploy multiple sensors for comprehensive environment perception. While integrating multi-modal features from these sensors can produce rich and powerful features, there is a noticeable gap in MAE methods addressing this integration. This research delves into multi-modal Masked Autoencoders tailored for a unified representation space in autonomous driving, aiming to pioneer a more efficient fusion of two distinct modalities. To intricately marry the semantics inherent in images with the geometric intricacies of LiDAR point clouds, the UniM$^2$AE is proposed. This model stands as a potent yet straightforward, multi-modal self-supervised pre-training framework, mainly consisting of two designs. First, it projects the features from both modalities into a cohesive 3D volume space, ingeniously expanded from the bird's eye view (BEV) to include the height dimension. The extension makes it possible to back-project the informative features, obtained by fusing features from both modalities, into their native modalities to reconstruct the multiple masked inputs. Second, the Multi-modal 3D Interactive Module (MMIM) is invoked to facilitate the efficient inter-modal interaction during the interaction process. Extensive experiments conducted on the nuScenes Dataset attest to the efficacy of UniM$^2$AE, indicating enhancements in 3D object detection and BEV map segmentation by 1.2\%(NDS) and 6.5\% (mIoU), respectively. Code is available at https://github.com/hollow-503/UniM2AE.
    摘要 masked autoencoders (MAE) 在学习强大表示方面发挥着重要作用,在多种3D感知任务中达到了出色的结果。在实际驾驶场景中,通常会使用多种感知器来捕捉环境。虽然将多Modal特征集成到这些感知器可以生成丰富和强大的特征,但是MAE方法中的集成还有一定的空白。这项研究探讨了针对自动驾驶场景的多Modal Masked Autoencoders,以实现更有效的感知器融合。为了结合图像和LiDAR点云中的semantics和几何特征,我们提出了UniM$^2$AE模型。这是一种简单而强大的多Modal自我超vised预训练框架,主要包括两个设计。首先,我们将两种模式的特征映射到一个共同的3D体积空间中,通过从鸟瞰视角(BEV)扩展而包含高度维度。这使得可以将多modal特征拼接而成的有用特征回溯到其原始模式来重建多个遮盲输入。其次,我们采用多Modal3D互动模块(MMIM)来促进多Modal之间的效率互动。对于nuScenes Dataset进行了广泛的实验,结果表明UniM$^2$AE模型在3D物体检测和BEV地图分割方面提高了1.2%(NDS)和6.5%(mIoU)。代码可以在https://github.com/hollow-503/UniM2AE找到。

The Change You Want to See (Now in 3D)

  • paper_url: http://arxiv.org/abs/2308.10417
  • repo_url: https://github.com/PETRUS1980/ayaka-v2
  • paper_authors: Ragav Sachdeva, Andrew Zisserman
  • for: 检测三维场景中的变化,即两个不同摄像机位置和时间点上的图像之间的差异。
  • methods: 使用自动生成的数据集进行训练,并使用自适应彩色差分来检测变化。
  • results: 可以直接使用两个RGB图像进行检测,不需要访问地图信息或其他辅助数据。 plus, a new evaluation dataset is released with human-annotated differences.
    Abstract The goal of this paper is to detect what has changed, if anything, between two "in the wild" images of the same 3D scene acquired from different camera positions and at different temporal instances. The open-set nature of this problem, occlusions/dis-occlusions due to the shift in viewpoint, and the lack of suitable training datasets, presents substantial challenges in devising a solution. To address this problem, we contribute a change detection model that is trained entirely on synthetic data and is class-agnostic, yet it is performant out-of-the-box on real world images without requiring fine-tuning. Our solution entails a "register and difference" approach that leverages self-supervised frozen embeddings and feature differences, which allows the model to generalise to a wide variety of scenes and domains. The model is able to operate directly on two RGB images, without requiring access to ground truth camera intrinsics, extrinsics, depth maps, point clouds, or additional before-after images. Finally, we collect and release a new evaluation dataset consisting of real-world image pairs with human-annotated differences and demonstrate the efficacy of our method. The code, datasets and pre-trained model can be found at: https://github.com/ragavsachdeva/CYWS-3D
    摘要 “这篇论文的目标是检测在不同摄像机位和时间点上Acquired的同一个3D场景中的变化。由于这是一个开放集合问题,加上 occlusions/dis-occlusions 以及lack of suitable training datasets,这具有很大的挑战。为解决这个问题,我们提出了一种基于Synthetic Data的变化检测模型,这种模型是无类别的,可以在实际世界图像上表现出高性能,而不需要微调。我们的解决方案基于"register and difference"方法,利用自我supervised frozen embeddings和特征差异,使得模型可以适应各种场景和频谱。该模型可以直接操作两个RGB图像,不需要访问地面 truth camera intrinsics, extrinsics, depth maps, point clouds, or additional before-after images。最后,我们收集了和发布了一个新的评估数据集,该数据集包含人注解的差异,并证明了我们的方法的有效性。代码、数据集和预训练模型可以在:https://github.com/ragavsachdeva/CYWS-3D 找到。”

In-Rack Test Tube Pose Estimation Using RGB-D Data

  • paper_url: http://arxiv.org/abs/2308.10411
  • repo_url: None
  • paper_authors: Hao Chen, Weiwei Wan, Masaki Matsushita, Takeyuki Kotaka, Kensuke Harada
  • for: 本研究旨在提高生物和医疗领域 robotic manipulate 试管的精度和安全性,通过检测和确定试管的位置和orientation。
  • methods: 本研究使用 YOLO 对象检测器和点云注册技术来检测和确定试管和试管架的位置和orientation。
  • results: 本研究提出了一种基于优化算法的 pose 估计方法,可以在受到噪声和不完整的点云数据的情况下提供稳定和准确的 pose 估计结果。
    Abstract Accurate robotic manipulation of test tubes in biology and medical industries is becoming increasingly important to address workforce shortages and improve worker safety. The detection and localization of test tubes are essential for the robots to successfully manipulate test tubes. In this paper, we present a framework to detect and estimate poses for the in-rack test tubes using color and depth data. The methodology involves the utilization of a YOLO object detector to effectively classify and localize both the test tubes and the tube racks within the provided image data. Subsequently, the pose of the tube rack is estimated through point cloud registration techniques. During the process of estimating the poses of the test tubes, we capitalize on constraints derived from the arrangement of rack slots. By employing an optimization-based algorithm, we effectively evaluate and refine the pose of the test tubes. This strategic approach ensures the robustness of pose estimation, even when confronted with noisy and incomplete point cloud data.
    摘要 现代生物和医疗行业中的精度机器人操作试管是面临人力短缺和工作者安全问题的关键。检测和确定试管的位置是机器人操作试管的关键。在这篇论文中,我们提出了一种框架,用于通过色彩和深度数据检测和估算试管的位置。我们利用YOLO对象检测器有效地分类和定位试管和试管架 Within the provided image data.然后,我们通过点云 региSTR的技术来估算试管架的 pose。在估算试管的位置时,我们利用试管架的槽布局约束来约束估算。通过使用优化算法,我们有效地评估和修正试管的位置,从而确保pose估算的稳定性,即使面临噪点云数据时。

Turning a CLIP Model into a Scene Text Spotter

  • paper_url: http://arxiv.org/abs/2308.10408
  • repo_url: https://github.com/wenwenyu/tcm
  • paper_authors: Wenwen Yu, Yuliang Liu, Xingkui Zhu, Haoyu Cao, Xing Sun, Xiang Bai
  • for: 这paper aimed to enhance scene text detection and spotting tasks using the large-scale Contrastive Language-Image Pretraining (CLIP) model.
  • methods: 这paper proposed a new backbone called FastTCM-CR50, which utilizes visual prompt learning and cross-attention in CLIP to extract image and text-based prior knowledge. It also introduces an instance-language matching process to refine text regions.
  • results: FastTCM-CR50 offers several advantages, including improved performance (1.7% and 1.5% on average), faster inference speed (48.5% increase), and robust few-shot training capabilities (26.5% and 5.5% improvement on text detection and spotting tasks, respectively). It also consistently enhances performance on out-of-distribution text detection and spotting datasets.
    Abstract We exploit the potential of the large-scale Contrastive Language-Image Pretraining (CLIP) model to enhance scene text detection and spotting tasks, transforming it into a robust backbone, FastTCM-CR50. This backbone utilizes visual prompt learning and cross-attention in CLIP to extract image and text-based prior knowledge. Using predefined and learnable prompts, FastTCM-CR50 introduces an instance-language matching process to enhance the synergy between image and text embeddings, thereby refining text regions. Our Bimodal Similarity Matching (BSM) module facilitates dynamic language prompt generation, enabling offline computations and improving performance. FastTCM-CR50 offers several advantages: 1) It can enhance existing text detectors and spotters, improving performance by an average of 1.7% and 1.5%, respectively. 2) It outperforms the previous TCM-CR50 backbone, yielding an average improvement of 0.2% and 0.56% in text detection and spotting tasks, along with a 48.5% increase in inference speed. 3) It showcases robust few-shot training capabilities. Utilizing only 10% of the supervised data, FastTCM-CR50 improves performance by an average of 26.5% and 5.5% for text detection and spotting tasks, respectively. 4) It consistently enhances performance on out-of-distribution text detection and spotting datasets, particularly the NightTime-ArT subset from ICDAR2019-ArT and the DOTA dataset for oriented object detection. The code is available at https://github.com/wenwenyu/TCM.
    摘要 我们利用大规模的对比语言图像预训练(CLIP)模型来提高场景文本检测和搜寻任务,转化其成一种强大的基础模型,快速TCM-CR50。这个基础模型利用视觉推议和跨Modal的注意力在CLIP中提取图像和文本基于的先前知识。使用预定义和学习的推议,快速TCM-CR50引入了一个实例语言匹配过程,以提高图像和文本嵌入的同步性,从而细化文本区域。我们的双模态匹配(BSM)模块实现了动态语言推议生成,可以进行离线计算,提高性能。快速TCM-CR50具有以下优点:1)可以提高现有的文本检测和搜寻器,提高性能约1.7%和1.5%,分别。2)超过之前的TCM-CR50基础模型,平均提高0.2%和0.56%的文本检测和搜寻任务,同时提高执行速度48.5%。3)展现了强大的几个shot训练能力。只使用10%的超vised数据,快速TCM-CR50提高性能约26.5%和5.5%的文本检测和搜寻任务,分别。4)一致提高了对于Out-of-distribution的文本检测和搜寻数据集的性能,特别是ICDAR2019-ArT的夜间Time-ArT子集和DOTA数据集用于方向对象检测。代码可以在https://github.com/wenwenyu/TCM上获取。

Towards Generalizable Morph Attack Detection with Consistency Regularization

  • paper_url: http://arxiv.org/abs/2308.10392
  • repo_url: None
  • paper_authors: Hossein Kashiani, Niloufar Alipour Talemi, Mohammad Saeed Ebrahimi Saadabadi, Nasser M. Nasrabadi
  • for: 本文主要针对 morph attack detection 的泛化能力增强。
  • methods: 本文提出了两种简单 yet effective的 morph-wise 扩展方法,以探索在实际场景中的各种可能的杂化变换。然后,通过对模型的 logit 和嵌入层进行常规化训练,实现模型在不同来源的杂化图像上学习一致性。
  • results: 实验结果表明,提出的方法在对抗不同来源的杂化图像时,具有更高的泛化性和鲁棒性表现。
    Abstract Though recent studies have made significant progress in morph attack detection by virtue of deep neural networks, they often fail to generalize well to unseen morph attacks. With numerous morph attacks emerging frequently, generalizable morph attack detection has gained significant attention. This paper focuses on enhancing the generalization capability of morph attack detection from the perspective of consistency regularization. Consistency regularization operates under the premise that generalizable morph attack detection should output consistent predictions irrespective of the possible variations that may occur in the input space. In this work, to reach this objective, two simple yet effective morph-wise augmentations are proposed to explore a wide space of realistic morph transformations in our consistency regularization. Then, the model is regularized to learn consistently at the logit as well as embedding levels across a wide range of morph-wise augmented images. The proposed consistency regularization aligns the abstraction in the hidden layers of our model across the morph attack images which are generated from diverse domains in the wild. Experimental results demonstrate the superior generalization and robustness performance of our proposed method compared to the state-of-the-art studies.
    摘要 尽管最近的研究已经在深度神经网络的支持下做出了重要的进步,但它们经常无法良好地泛化到未经见过的形态攻击。随着形态攻击的不断出现,泛化形态攻击检测已经受到了广泛的关注。这篇论文从一致规范的角度来提高形态攻击检测的泛化能力。我们认为,泛化形态攻击检测应该输出不同变化可能出现在输入空间时的一致预测。为达到这个目标,我们提出了两种简单 yet effective的形态增强技术,以探索在实际中可能出现的广泛的形态变换空间。然后,我们将模型规范化以在logit和嵌入层之间学习一致的方法,以涵盖各种形态增强图像的广泛范围。我们的一致规范将模型在不同来源的动态形态图像中的抽象层级划入一致。实验结果表明,我们的提议方法在比较方法中的泛化和鲁棒性性能表现出色。

Automated mapping of virtual environments with visual predictive coding

  • paper_url: http://arxiv.org/abs/2308.10913
  • repo_url: None
  • paper_authors: James Gornet, Matthew Thomson
  • for: 本研究旨在探讨人类如何直接从感知输入中构建内部的认知地图,以及如何使用预测编码来建立这些地图。
  • methods: 本研究使用了一种自我注意力插入 convolutional neural network 来实现预测编码,并在虚拟环境中训练 Agent 进行下一幅图像预测任务。
  • results: 研究发现,通过使用预测编码,Agent 自动构建了一个内部表示环境的vectorized编码,可以支持vector导航并准确地确定自己的位置 relative to 附近的地标。
    Abstract Humans construct internal cognitive maps of their environment directly from sensory inputs without access to a system of explicit coordinates or distance measurements. While machine learning algorithms like SLAM utilize specialized visual inference procedures to identify visual features and construct spatial maps from visual and odometry data, the general nature of cognitive maps in the brain suggests a unified mapping algorithmic strategy that can generalize to auditory, tactile, and linguistic inputs. Here, we demonstrate that predictive coding provides a natural and versatile neural network algorithm for constructing spatial maps using sensory data. We introduce a framework in which an agent navigates a virtual environment while engaging in visual predictive coding using a self-attention-equipped convolutional neural network. While learning a next image prediction task, the agent automatically constructs an internal representation of the environment that quantitatively reflects distances. The internal map enables the agent to pinpoint its location relative to landmarks using only visual information.The predictive coding network generates a vectorized encoding of the environment that supports vector navigation where individual latent space units delineate localized, overlapping neighborhoods in the environment. Broadly, our work introduces predictive coding as a unified algorithmic framework for constructing cognitive maps that can naturally extend to the mapping of auditory, sensorimotor, and linguistic inputs.
    摘要 人类直接从感知输入中构建内部的认知地图,无需访问特定坐标或距离测量系统。而机器学习算法如SLAM使用专门的视觉推理过程来识别视觉特征并构建空间地图从视觉和运动数据,然而大脑内部的认知地图具有一般的算法策略,这种策略可以普适地应用于auditory、感觉和语言输入。在这里,我们示示了预测编码提供了一种自然的神经网络算法,用于使用感知数据构建空间地图。我们介绍了一个框架,在该框架中,一个代理人在虚拟环境中游走,同时使用自我注意力扩展 convolutional neural network 进行视觉预测任务。在学习过程中,代理人自动构建了一个内部表示环境,该表示环境可以量化地表示距离。内部地图允许代理人使用仅视觉信息定位自己的位置。预测编码网络生成了一个矢量化编码环境,该编码环境支持矢量导航,其中个别的latent空间单元界定了环境中的地方化、重叠的地方。总之,我们的工作引入预测编码作为一种统一的算法框架,可以自然地扩展到auditory、感觉和语言输入的认知地图构建。

HoSNN: Adversarially-Robust Homeostatic Spiking Neural Networks with Adaptive Firing Thresholds

  • paper_url: http://arxiv.org/abs/2308.10373
  • repo_url: None
  • paper_authors: Hejia Geng, Peng Li
  • for: 本研究旨在开发一种可以抵抗 adversarial 攻击的神经网络模型,以提高神经网络的可靠性和安全性。
  • methods: 我们采用了一种基于 neural homeostasis 的解决方案,并提出了一种bio-inspired的 threshold-adapting leaky integrate-and-fire (TA-LIF) neuron模型,以建立我们的 adversarially robust homeostatic SNN (HoSNN)。
  • results: 我们的 HoSNN 在 CIFAR-10 上表现出了很好的 robustness,其中without explicit adversarial training,我们的 HoSNN 在 FGSM 和 PGD 攻击下的准确率提高至 72.6% 和 54.19%,比传统 LIF 神经网络高出 29.99% 和 47.83%。
    Abstract Spiking neural networks (SNNs) offer promise for efficient and powerful neurally inspired computation. Common to other types of neural networks, however, SNNs face the severe issue of vulnerability to adversarial attacks. We present the first study that draws inspiration from neural homeostasis to develop a bio-inspired solution that counters the susceptibilities of SNNs to adversarial onslaughts. At the heart of our approach is a novel threshold-adapting leaky integrate-and-fire (TA-LIF) neuron model, which we adopt to construct the proposed adversarially robust homeostatic SNN (HoSNN). Distinct from traditional LIF models, our TA-LIF model incorporates a self-stabilizing dynamic thresholding mechanism, curtailing adversarial noise propagation and safeguarding the robustness of HoSNNs in an unsupervised manner. Theoretical analysis is presented to shed light on the stability and convergence properties of the TA-LIF neurons, underscoring their superior dynamic robustness under input distributional shifts over traditional LIF neurons. Remarkably, without explicit adversarial training, our HoSNNs demonstrate inherent robustness on CIFAR-10, with accuracy improvements to 72.6% and 54.19% against FGSM and PGD attacks, up from 20.97% and 0.6%, respectively. Furthermore, with minimal FGSM adversarial training, our HoSNNs surpass previous models by 29.99% under FGSM and 47.83% under PGD attacks on CIFAR-10. Our findings offer a new perspective on harnessing biological principles for bolstering SNNs adversarial robustness and defense, paving the way to more resilient neuromorphic computing.
    摘要 神经网络(SNN)具有高效和强大的神经元灵感计算的承诺。然而,SNN也面临严重的抗击敌方攻击的问题。我们提出了首次受神经家ostenosis(Neural Homeostasis)的启发,开发了一种生物发现的解决方案,以强化SNN对抗敌方攻击的抵抗力。我们的方法基于一种新的阈值调整泄漏Integrate-and-Fire(TA-LIF) neuron模型,我们采用这种模型construct了我们的提议的鲁棒化SNN(HoSNN)。与传统的LIF模型不同,我们的TA-LIF模型包括一种自适应的阈值调整机制,使抗击噪声的传播和保护HoSNNs的鲁棒性,这种机制在无supervise的情况下自动稳定。我们提供了理论分析,以便更好地理解TA-LIF neuron的稳定性和抽象性,这些分析表明TA-LIF neuron在输入分布变化时的稳定性较高。很意外地,无需显式的抗击训练,我们的HoSNNs在CIFAR-10上显示了自然的鲁棒性,准确率提高至72.6%和54.19%,比FGSM和PGD攻击的原始准确率提高了20.97%和0.6%。此外,通过最小化FGSM抗击训练,我们的HoSNNs超过了之前的模型,在FGSM和PGD攻击下CIFAR-10上的准确率提高了29.99%和47.83%。我们的发现开发了一种新的神经元灵感的鲁棒性和防御方法,为神经网络计算带来更加可靠的保护。

Developing a Machine Learning-Based Clinical Decision Support Tool for Uterine Tumor Imaging

  • paper_url: http://arxiv.org/abs/2308.10372
  • repo_url: None
  • paper_authors: Darryl E. Wright, Adriana V. Gregory, Deema Anaam, Sepideh Yadollahi, Sumana Ramanathan, Kafayat A. Oyemade, Reem Alsibai, Heather Holmes, Harrison Gottlich, Cherie-Akilah G. Browne, Sarah L. Cohen Rassier, Isabel Green, Elizabeth A. Stewart, Hiroaki Takahashi, Bohyun Kim, Shannon Laughlin-Tommaso, Timothy L. Kline
  • for: 这些研究旨在开发一种自动 segmentation 方法,以便 diferenciar between uterine tumors (UTs) and distinguish between different types of UTs.
  • methods: 研究人员使用 nnU-Net 模型,并 explore 不同的训练集大小对性能的影响。他们还使用了 радиomiic 特征来分类 UTs。
  • results: 研究人员发现,使用整个训练集可以达到人类水平的性能,但是在分类 benign versus malignant 和 degenerated LM versus LMS 任务中,自动分类仍然是一个挑战。
    Abstract Uterine leiomyosarcoma (LMS) is a rare but aggressive malignancy. On imaging, it is difficult to differentiate LMS from, for example, degenerated leiomyoma (LM), a prevalent but benign condition. We curated a data set of 115 axial T2-weighted MRI images from 110 patients (mean [range] age=45 [17-81] years) with UTs that included five different tumor types. These data were randomly split stratifying on tumor volume into training (n=85) and test sets (n=30). An independent second reader (reader 2) provided manual segmentations for all test set images. To automate segmentation, we applied nnU-Net and explored the effect of training set size on performance by randomly generating subsets with 25, 45, 65 and 85 training set images. We evaluated the ability of radiomic features to distinguish between types of UT individually and when combined through feature selection and machine learning. Using the entire training set the mean [95% CI] fibroid DSC was measured as 0.87 [0.59-1.00] and the agreement between the two readers was 0.89 [0.77-1.0] on the test set. When classifying degenerated LM from LMS we achieve a test set F1-score of 0.80. Classifying UTs based on radiomic features we identify classifiers achieving F1-scores of 0.53 [0.45, 0.61] and 0.80 [0.80, 0.80] on the test set for the benign versus malignant, and degenerated LM versus LMS tasks. We show that it is possible to develop an automated method for 3D segmentation of the uterus and UT that is close to human-level performance with fewer than 150 annotated images. For distinguishing UT types, while we train models that merit further investigation with additional data, reliable automatic differentiation of UTs remains a challenge.
    摘要 uterine leiomyosarcoma (LMS) 是一种罕见 pero 攻击性的恶性肿瘤。在影像检查中,很难将LMS与例如,衰变的子宫阴膜(LM)相区分,这两种疾病都是常见的,但是它们的疾病性不同。我们为了解决这个问题,收集了115个轴向T2加重的MRI图像,来自110名患者(年龄的mean值为45岁,range为17-81岁),这些图像包含五种不同的肿瘤类型。这些数据被随机分割,将85个图像作为训练集,并将30个图像作为测试集。一名独立的第二读者(读者2)为测试集图像提供了手动分割。为了自动分割,我们使用了nnU-Net,并研究了训练集大小对性能的影响,我们随机生成了25、45、65和85个训练集图像。我们评估了各种辐射学特征的分类能力,并通过特征选择和机器学习来结合这些特征。使用整个训练集时,测试集的 fibroid DSC 的 mean值为0.87(95% CI:0.59-1.00),两名读者之间的一致性为0.89(77-1)。在分类衰变LM和LMS时,我们在测试集上取得了F1分数为0.80。基于辐射学特征,我们确定了一些分类器,其中在测试集上取得了F1分数为0.53(0.45-0.61)和0.80(0.80-0.80)。我们证明了,可以使用 fewer than 150 个标注图像来开发一种自动化的uterus和UT分割方法,并且该方法的性能几乎与人类水平。然而,在分类不同类型的UT时,我们发现,可靠地自动区分UT仍然是一个挑战。

Prediction of Pneumonia and COVID-19 Using Deep Neural Networks

  • paper_url: http://arxiv.org/abs/2308.10368
  • repo_url: None
  • paper_authors: M. S. Haque, M. S. Taluckder, S. B. Shawkat, M. A. Shahriyar, M. A. Sayed, C. Modak
  • for: 这个研究旨在探讨医疗影像分析可以如何帮助早期识别感染病毒和细菌所致的肺炎。
  • methods: 本研究使用机器学习技术预测肺炎基于胸部X射像。
  • results: 研究发现,使用DenseNet121模型可以实现肺炎患者的准确预测,精度为99.58%。
    Abstract Pneumonia, caused by bacteria and viruses, is a rapidly spreading viral infection with global implications. Prompt identification of infected individuals is crucial for containing its transmission. This study explores the potential of medical image analysis to address this challenge. We propose machine-learning techniques for predicting Pneumonia from chest X-ray images. Chest X-ray imaging is vital for Pneumonia diagnosis due to its accessibility and cost-effectiveness. However, interpreting X-rays for Pneumonia detection can be complex, as radiographic features can overlap with other respiratory conditions. We evaluate the performance of different machine learning models, including DenseNet121, Inception Resnet-v2, Inception Resnet-v3, Resnet50, and Xception, using chest X-ray images of pneumonia patients. Performance measures and confusion matrices are employed to assess and compare the models. The findings reveal that DenseNet121 outperforms other models, achieving an accuracy rate of 99.58%. This study underscores the significance of machine learning in the accurate detection of Pneumonia, leveraging chest X-ray images. Our study offers insights into the potential of technology to mitigate the spread of pneumonia through precise diagnostics.
    摘要 《肺炎,由病毒和细菌引起,是一种迅速传播的病毒感染,具有全球化意义。promptly identifying infected individuals是控制其传播的关键。本研究探讨了医疗图像分析在此挑战中的潜在作用。我们提议使用机器学习技术预测肺炎。胸部X射线成像是肺炎诊断的关键手段,因为它具有访问性和成本效果。然而,解释X射线图像可以困难,因为肺炎的放射学特征可能与其他呼吸道疾病重叠。我们评估了不同的机器学习模型,包括DenseNet121、Inception Resnet-v2、Inception Resnet-v3、Resnet50和Xception,使用胸部X射线图像。我们使用性能指标和冲突矩阵来评估和比较这些模型。研究发现,DenseNet121模型在识别肺炎方面表现出色,达到了99.58%的准确率。本研究强调了机器学习在准确诊断肺炎方面的重要性,利用胸部X射线图像。我们的研究为抑止肺炎传播提供了新的思路,并为医疗技术的进步提供了新的可能性。

Vehicle Cameras Guide mmWave Beams: Approach and Real-World V2V Demonstration

  • paper_url: http://arxiv.org/abs/2308.10362
  • repo_url: None
  • paper_authors: Tawfik Osman, Gouranga Charan, Ahmed Alkhateeb
  • For: The paper is written for the purpose of exploring the use of vision sensors in vehicle-to-vehicle (V2V) communication scenarios to improve the accuracy of millimeter-wave (mmWave) and terahertz (THz) beam prediction.* Methods: The paper proposes a deep learning solution that uses images from a 360 camera attached to the vehicle to predict future beams in V2V communication scenarios.* Results: The proposed vision-aided solution achieves an accuracy of approximately 85% in top-5 beam prediction, while significantly reducing the beam training overhead, highlighting the potential of utilizing vision for enabling highly-mobile V2V communications.Here are the three points in Simplified Chinese text:* For: 这篇论文是为了探讨在交通自动车(V2V)通信场景中使用视觉感知器来提高毫米波(mmWave)和tera射频(THz)的扫描方向预测的目的而写的。* Methods: 论文提出了一种基于视觉学习的解决方案,使用附加到车辆上的360度摄像机拍摄的图像来预测V2V通信场景中的未来扫描方向。* Results: 提出的视觉协助解决方案在实际的多模式mmWave V2V通信数据集上被评估,实现了约85%的top-5扫描方向预测精度,同时显著减少了扫描方向训练过程的开销,这表明可以通过使用视觉来实现高度移动的V2V通信。
    Abstract Accurately aligning millimeter-wave (mmWave) and terahertz (THz) narrow beams is essential to satisfy reliability and high data rates of 5G and beyond wireless communication systems. However, achieving this objective is difficult, especially in vehicle-to-vehicle (V2V) communication scenarios, where both transmitter and receiver are constantly mobile. Recently, additional sensing modalities, such as visual sensors, have attracted significant interest due to their capability to provide accurate information about the wireless environment. To that end, in this paper, we develop a deep learning solution for V2V scenarios to predict future beams using images from a 360 camera attached to the vehicle. The developed solution is evaluated on a real-world multi-modal mmWave V2V communication dataset comprising co-existing 360 camera and mmWave beam training data. The proposed vision-aided solution achieves $\approx 85\%$ top-5 beam prediction accuracy while significantly reducing the beam training overhead. This highlights the potential of utilizing vision for enabling highly-mobile V2V communications.
    摘要 Accurately aligning millimeter-wave (mmWave) and terahertz (THz) narrow beams is crucial for ensuring the reliability and high data rates of 5G and beyond wireless communication systems. However, achieving this goal is challenging, especially in vehicle-to-vehicle (V2V) communication scenarios where both the transmitter and receiver are constantly moving. Recently, additional sensing modalities, such as visual sensors, have attracted significant attention due to their ability to provide accurate information about the wireless environment. To address this challenge, we propose a deep learning solution for V2V scenarios that uses images from a 360-degree camera attached to the vehicle to predict future beams. Our proposed solution is evaluated on a real-world multi-modal mmWave V2V communication dataset that includes co-existing 360-degree camera and mmWave beam training data. The vision-aided solution achieves approximately 85% top-5 beam prediction accuracy while significantly reducing the beam training overhead, which highlights the potential of utilizing vision for enabling highly mobile V2V communications.

Strata-NeRF : Neural Radiance Fields for Stratified Scenes

  • paper_url: http://arxiv.org/abs/2308.10337
  • repo_url: None
  • paper_authors: Ankit Dhiman, Srinath R, Harsh Rangwani, Rishubh Parihar, Lokesh R Boregowda, Srinath Sridhar, R Venkatesh Babu
  • for: 这个论文旨在创建一个具有多层构造的场景,并使用神经陷阱场(NeRF)来学习场景的下面代表。
  • methods: 本论文使用Vector Quantized(VQ)对应来调整NeRF,以便在场景中突然变化 scene structure。
  • results: 根据多个synthetic dataset和RealEstate10K dataset的evalution,Strata-NeRF能够有效地捕捉多层场景,减少错误,并生成高频率的视角。
    Abstract Neural Radiance Field (NeRF) approaches learn the underlying 3D representation of a scene and generate photo-realistic novel views with high fidelity. However, most proposed settings concentrate on modelling a single object or a single level of a scene. However, in the real world, we may capture a scene at multiple levels, resulting in a layered capture. For example, tourists usually capture a monument's exterior structure before capturing the inner structure. Modelling such scenes in 3D with seamless switching between levels can drastically improve immersive experiences. However, most existing techniques struggle in modelling such scenes. We propose Strata-NeRF, a single neural radiance field that implicitly captures a scene with multiple levels. Strata-NeRF achieves this by conditioning the NeRFs on Vector Quantized (VQ) latent representations which allow sudden changes in scene structure. We evaluate the effectiveness of our approach in multi-layered synthetic dataset comprising diverse scenes and then further validate its generalization on the real-world RealEstate10K dataset. We find that Strata-NeRF effectively captures stratified scenes, minimizes artifacts, and synthesizes high-fidelity views compared to existing approaches.
    摘要 neural radiance field (NeRF) 方法学习场景中的下一个3D表示,并生成高准确率的新视图。然而,大多数提议都集中在单个 объек 或场景中的模型。然而,在实际世界中,我们可能会捕捉场景在多个层次,导致层次捕捉。例如,旅游者通常会拍摄一座纪念碑的外部结构之前拍摄其内部结构。使用3D模型在多个层次之间进行无缝融合可以极大提高吸引人体验。然而,大多数现有技术在模型这些场景时遇到困难。我们提议Strata-NeRF,一个基于神经雨树(NeRF)的单个神经网络场景模型,使用量化(VQ) latent representation来适应场景结构的突然变化。我们在多层 synthetic 数据集中评估Strata-NeRF的效果,然后进一步验证其在实际世界的 RealEstate10K 数据集上的普适性。我们发现Strata-NeRF可以有效地捕捉层次场景,减少artefacts,并生成高准确率的视图,与现有方法相比。

Coordinate Transformer: Achieving Single-stage Multi-person Mesh Recovery from Videos

  • paper_url: http://arxiv.org/abs/2308.10334
  • repo_url: None
  • paper_authors: Haoyuan Li, Haoye Dong, Hanchao Jia, Dong Huang, Michael C. Kampffmeyer, Liang Lin, Xiaodan Liang
  • for: 这篇论文旨在提高多人3D矩阵重建从视频中的精度,以便自动感知群体行为在虚拟现实、物理治疗等领域。
  • methods: 该论文提出了Coordinate transFormer(CoordFormer)模型,直接模型多人空间时间关系,并同时完成多个矩阵重建的终端方式。它不 partitions 特征图into coarse-scale patch-wise tokens,而是使用一种新的坐标感知注意力来保留像素级别的空间时间坐标信息。此外,论文还提出了一种简单 yet effective的Body Center Attention机制来融合位置信息。
  • results: 实验结果表明,CoordFormer在3DPW数据集上显著提高了状态的艺术,与之前最佳结果相比,提高4.2%、8.8%和4.7%的MPJPE、PAMPJPE和PVE指标。同时,CoordFormer比现有的视频基于的方法更快,提高40%。代码可以在https://github.com/Li-Hao-yuan/CoordFormer中找到。
    Abstract Multi-person 3D mesh recovery from videos is a critical first step towards automatic perception of group behavior in virtual reality, physical therapy and beyond. However, existing approaches rely on multi-stage paradigms, where the person detection and tracking stages are performed in a multi-person setting, while temporal dynamics are only modeled for one person at a time. Consequently, their performance is severely limited by the lack of inter-person interactions in the spatial-temporal mesh recovery, as well as by detection and tracking defects. To address these challenges, we propose the Coordinate transFormer (CoordFormer) that directly models multi-person spatial-temporal relations and simultaneously performs multi-mesh recovery in an end-to-end manner. Instead of partitioning the feature map into coarse-scale patch-wise tokens, CoordFormer leverages a novel Coordinate-Aware Attention to preserve pixel-level spatial-temporal coordinate information. Additionally, we propose a simple, yet effective Body Center Attention mechanism to fuse position information. Extensive experiments on the 3DPW dataset demonstrate that CoordFormer significantly improves the state-of-the-art, outperforming the previously best results by 4.2%, 8.8% and 4.7% according to the MPJPE, PAMPJPE, and PVE metrics, respectively, while being 40% faster than recent video-based approaches. The released code can be found at https://github.com/Li-Hao-yuan/CoordFormer.
    摘要 多人3D矩阵恢复从视频是自动识别人群行为的关键首先步骤,在虚拟现实、物理治疗等领域中具有广泛的应用前景。然而,现有的方法都是基于多个阶段 paradigm,其中人员检测和跟踪阶段在多人场景中进行,而时间动力模型只是对一个人进行模型。因此,它们的性能受到多人间空间temporal矩阵恢复中缺失的人员交互作用的限制,以及检测和跟踪 Defects。为解决这些挑战,我们提出了坐标变换器(CoordFormer),它直接模型多人空间temporal关系,并在端到端 manner中同时进行多个矩阵恢复。而不是将特征图分解成粗略的 patch-wise 块,CoordFormer 利用了一种新的坐标相关注意力来保留像素级别的空间temporal坐标信息。此外,我们还提出了一种简单 yet effective的 Body Center Attention 机制来融合位置信息。我们对3DPW数据集进行了广泛的实验,结果显示,CoordFormer 可以 Significantly improve the state-of-the-art,与之前最佳结果相比,提高 MPJPE、PAMPJPE 和 PVE 度量的成绩,分别提高4.2%、8.8%和4.7%,而且比最近的视频基于方法更快40%。代码可以在 找到。

Towards Real-World Visual Tracking with Temporal Contexts

  • paper_url: http://arxiv.org/abs/2308.10330
  • repo_url: https://github.com/vision4robotics/tctrack
  • paper_authors: Ziang Cao, Ziyuan Huang, Liang Pan, Shiwei Zhang, Ziwei Liu, Changhong Fu
  • for: 提高视觉跟踪性能,特别是在真实世界条件下。
  • methods: 提议一个两级框架(TCTrack),利用时间上下文来增强特征提取和相似度地图匹配。
  • results: 对8个知名测试集进行了广泛的实验,证明TCTrack++的超越性。真实世界测试直接证明TCTrack++可以快速应用于实际应用场景。
    Abstract Visual tracking has made significant improvements in the past few decades. Most existing state-of-the-art trackers 1) merely aim for performance in ideal conditions while overlooking the real-world conditions; 2) adopt the tracking-by-detection paradigm, neglecting rich temporal contexts; 3) only integrate the temporal information into the template, where temporal contexts among consecutive frames are far from being fully utilized. To handle those problems, we propose a two-level framework (TCTrack) that can exploit temporal contexts efficiently. Based on it, we propose a stronger version for real-world visual tracking, i.e., TCTrack++. It boils down to two levels: features and similarity maps. Specifically, for feature extraction, we propose an attention-based temporally adaptive convolution to enhance the spatial features using temporal information, which is achieved by dynamically calibrating the convolution weights. For similarity map refinement, we introduce an adaptive temporal transformer to encode the temporal knowledge efficiently and decode it for the accurate refinement of the similarity map. To further improve the performance, we additionally introduce a curriculum learning strategy. Also, we adopt online evaluation to measure performance in real-world conditions. Exhaustive experiments on 8 wellknown benchmarks demonstrate the superiority of TCTrack++. Real-world tests directly verify that TCTrack++ can be readily used in real-world applications.
    摘要 Visual tracking 在过去几十年内取得了 significiant improvement. 现有的大多数先进trackers 1)即便在理想条件下,也忽略了实际世界中的条件; 2)采用了跟踪检测 paradigm,忽略了丰富的时间上下文; 3)只是将时间信息 integrate into the template,而不是充分利用 consecutiver frames 之间的时间上下文。为解决这些问题,我们提出了一个two-level框架(TCTrack),可以高效利用时间上下文。基于其,我们提出了更加强大的实际世界视觉跟踪方法,即TCTrack++。它分为两个层:特征和相似度图。specifically,我们提出了一种注意力基于的时间适应性核心抽取特征,通过动态调整核心权重来增强空间特征。此外,我们还引入了一种适应时间变换器来编码时间知识,并将其解码为准确地修改相似度图。为了进一步提高性能,我们还采用了一种课程学习策略。此外,我们采用在线评估来衡量实际世界中的性能。广泛的实验表明TCTrack++ 的优越性。实际测试直接证明TCTrack++ 可以直接应用于实际应用中。

Hyper Association Graph Matching with Uncertainty Quantification for Coronary Artery Semantic Labeling

  • paper_url: http://arxiv.org/abs/2308.10320
  • repo_url: None
  • paper_authors: Chen Zhao, Michele Esposito, Zhihui Xu, Weihua Zhou
  • for: 您的论文旨在帮助您更好地诊断心血管疾病(CAD)?
  • methods: 您使用了一种新型的强化学习模型,即图像匹配神经网络(HAGMN),以确定 coronary angiogram(ICA)中的血管分支。
  • results: 您的模型在实际患者数据上实现了0.9345的准确率,并且具有快速的计算速度,可以在实时临床决策场景中提供有效和高效的预测。
    Abstract Coronary artery disease (CAD) is one of the primary causes leading to death worldwide. Accurate extraction of individual arterial branches on invasive coronary angiograms (ICA) is important for stenosis detection and CAD diagnosis. However, deep learning-based models face challenges in generating semantic segmentation for coronary arteries due to the morphological similarity among different types of coronary arteries. To address this challenge, we propose an innovative approach using the hyper association graph-matching neural network with uncertainty quantification (HAGMN-UQ) for coronary artery semantic labeling on ICAs. The graph-matching procedure maps the arterial branches between two individual graphs, so that the unlabeled arterial segments are classified by the labeled segments, and the coronary artery semantic labeling is achieved. By incorporating the anatomical structural loss and uncertainty, our model achieved an accuracy of 0.9345 for coronary artery semantic labeling with a fast inference speed, leading to an effective and efficient prediction in real-time clinical decision-making scenarios.
    摘要 coronary artery disease (CAD) 是全球主要的死亡原因之一。精准地提取个体动脉分支在侵入性 coronary angiography (ICA) 上是必要的 для梗阻检测和 CAD 诊断。然而,深度学习基本模型在生成 semantic segmentation 的 coronary arteries 方面遇到了挑战,因为不同类型的 coronary arteries 之间存在形态学 similarity。为解决这个挑战,我们提出了一种创新的方法,利用 hyper association graph-matching neural network with uncertainty quantification (HAGMN-UQ) для coronary artery semantic labeling on ICAs。图形匹配过程将两个个体图表中的动脉分支映射到一起,以便通过已知的分支进行分类,并实现 coronary artery semantic labeling。通过 incorporating анатомиче结构损失和不确定性,我们的模型实现了 coronary artery semantic labeling 的准确率为 0.9345,并且具有快速的推理速度,从而在实时临床决策场景中实现了有效和高效的预测。

Improving Adversarial Robustness of Masked Autoencoders via Test-time Frequency-domain Prompting

  • paper_url: http://arxiv.org/abs/2308.10315
  • repo_url: https://github.com/shikiw/robustmae
  • paper_authors: Qidong Huang, Xiaoyi Dong, Dongdong Chen, Yinpeng Chen, Lu Yuan, Gang Hua, Weiming Zhang, Nenghai Yu
  • for: 这个论文研究了基于BERT预训练的视觉转换器的逆向攻击性 robustness。
  • methods: 这个论文使用了BEiT和MAE等基于BERT预训练方法,并对这些方法进行了比较。
  • results: 研究发现,MAE的逆向攻击性 robustness比其他预训练方法更差,这导致了对这些预训练方法的基本差异和对逆向攻击性 robustness的影响进行了分析。研究还发现,预训练方法的逆向攻击性 robustness与重建目标有关,即预测图像裂割的raw像素会导致模型的逆向攻击性 robustness下降,而预测图像裂割的semantic context会导致模型的逆向攻击性 robustness上升。根据这种分析, authors提出了一种简单 yet有效的方法来提高MAE的逆向攻击性 robustness,即通过使用预训练数据集中提取的频谱知识来填充频谱空间,从而缩小攻击者的优化空间。
    Abstract In this paper, we investigate the adversarial robustness of vision transformers that are equipped with BERT pretraining (e.g., BEiT, MAE). A surprising observation is that MAE has significantly worse adversarial robustness than other BERT pretraining methods. This observation drives us to rethink the basic differences between these BERT pretraining methods and how these differences affect the robustness against adversarial perturbations. Our empirical analysis reveals that the adversarial robustness of BERT pretraining is highly related to the reconstruction target, i.e., predicting the raw pixels of masked image patches will degrade more adversarial robustness of the model than predicting the semantic context, since it guides the model to concentrate more on medium-/high-frequency components of images. Based on our analysis, we provide a simple yet effective way to boost the adversarial robustness of MAE. The basic idea is using the dataset-extracted domain knowledge to occupy the medium-/high-frequency of images, thus narrowing the optimization space of adversarial perturbations. Specifically, we group the distribution of pretraining data and optimize a set of cluster-specific visual prompts on frequency domain. These prompts are incorporated with input images through prototype-based prompt selection during test period. Extensive evaluation shows that our method clearly boost MAE's adversarial robustness while maintaining its clean performance on ImageNet-1k classification. Our code is available at: https://github.com/shikiw/RobustMAE.
    摘要 在这篇论文中,我们研究了具有BERT预训练的视觉变换器的抗对抗性。我们发现MAE具有较差的抗对抗性,这使我们思考这些BERT预训练方法之间的基本差异以及如何这些差异影响对抗扰动的Robustness。我们的实验分析表明,BERT预训练的抗对抗性强相关于预测目标,即预测图像覆盖区域的Raw像素或semantic上下文,而不是强制预测图像的媒体/高频成分。基于我们的分析,我们提出了一种简单 yet有效的方法来提高MAE的抗对抗性。这种方法是通过使用预处理数据集中提取的频谱知识来占用对抗扰动的优化空间。具体来说,我们将预处理数据集分布分组,然后在测试期间使用频谱域上优化一组特定的视觉提示。这些提示与输入图像进行权重平均,以提高MAE的抗对抗性。我们的实验表明,我们的方法可以明显提高MAE的抗对抗性,而不会影响其在ImageNet-1k分类任务中的清晰性。我们的代码可以在GitHub上找到:https://github.com/shikiw/RobustMAE。

DVGaze: Dual-View Gaze Estimation

  • paper_url: http://arxiv.org/abs/2308.10310
  • repo_url: https://github.com/yihuacheng/dvgaze
  • paper_authors: Yihua Cheng, Feng Lu
  • for: 这个论文是为了提出一种基于双摄像头的眼动估计方法(DV-Gaze),该方法可以利用双摄像头获取更多的面部信息,从而改善眼动估计性能。
  • methods: 该方法使用了一种叫做 dual-view interactive convolution (DIC) 块,该块在多个特征尺度上进行了双摄像头之间的交互式 convolution,以捕捉双摄像头之间的关系。此外,该方法还使用了一种 dual-view transformer 来估计眼动方向。
  • results: 该方法在 ETH-XGaze 和 EVE 数据集上 achieved state-of-the-art 性能,并且我们的实验也证明了双摄像头 gaze estimation 的潜在优势。 codes 可以在 https://github.com/yihuacheng/DVGaze 中下载。
    Abstract Gaze estimation methods estimate gaze from facial appearance with a single camera. However, due to the limited view of a single camera, the captured facial appearance cannot provide complete facial information and thus complicate the gaze estimation problem. Recently, camera devices are rapidly updated. Dual cameras are affordable for users and have been integrated in many devices. This development suggests that we can further improve gaze estimation performance with dual-view gaze estimation. In this paper, we propose a dual-view gaze estimation network (DV-Gaze). DV-Gaze estimates dual-view gaze directions from a pair of images. We first propose a dual-view interactive convolution (DIC) block in DV-Gaze. DIC blocks exchange dual-view information during convolution in multiple feature scales. It fuses dual-view features along epipolar lines and compensates for the original feature with the fused feature. We further propose a dual-view transformer to estimate gaze from dual-view features. Camera poses are encoded to indicate the position information in the transformer. We also consider the geometric relation between dual-view gaze directions and propose a dual-view gaze consistency loss for DV-Gaze. DV-Gaze achieves state-of-the-art performance on ETH-XGaze and EVE datasets. Our experiments also prove the potential of dual-view gaze estimation. We release codes in https://github.com/yihuacheng/DVGaze.
    摘要 “ gaze estimation 方法 根据 facial appearance 来Estimate gaze ,但因为单一 camera 的局限,所Captured facial appearance 无法提供完整的 facial information,因此复杂了 gaze estimation 问题。然而,随着 camera 设备的快速更新,用户可以轻松地使用 dual cameras,这些开发建议我们可以透过 dual-view gaze estimation 进一步改善 gaze estimation 性能。在这篇文章中,我们提出了 dual-view gaze estimation 网络 (DV-Gaze),DV-Gaze 可以从对照两个图像中Estimate dual-view gaze direction。我们首先提出了 dual-view interactive convolution (DIC) 封顶,DIC 封顶在多个尺度中进行了互动式 convolution,并将 dual-view 的特征融合到 epipolar 线上。我们还提出了 dual-view transformer 来估算 gaze direction,并将 camera pose 编码为 transformer 中的位置信息。我们还考虑了 dual-view gaze direction 的几何关系,并提出了 dual-view gaze consistency loss 来确保 DV-Gaze 的性能。实验结果显示 DV-Gaze 在 ETH-XGaze 和 EVE 数据集上实现了 state-of-the-art 的性能。我们释出了代码在 https://github.com/yihuacheng/DVGaze。”

Representation Disparity-aware Distillation for 3D Object Detection

  • paper_url: http://arxiv.org/abs/2308.10308
  • repo_url: None
  • paper_authors: Yanjing Li, Sheng Xu, Mingbao Lin, Jihao Yin, Baochang Zhang, Xianbin Cao
  • For: The paper aims to develop a novel knowledge distillation (KD) method for compact 3D detectors, addressing the representation disparity issue between the teacher model and the student counterpart.* Methods: The proposed method, called representation disparity-aware distillation (RDD), is built on the information bottleneck (IB) principle to minimize the disparity of proposal region pairs in features and logits.* Results: The proposed RDD method demonstrates superior performance over existing KD methods, achieving a mAP of 57.1% on the nuScenes dataset with only 42% FLOPs, even surpassing the teacher performance.Here are the three points in Simplified Chinese text:* For: 本文主要探讨了一种基于知识储存(KD)的减少表示不均(representation disparity)问题的方法,以提高紧凑型3D探测器的性能。* Methods: 提议的方法基于信息瓶颈(IB)原理,以减少学生和教师模型之间的提案区域对的差异。* Results: 相比现有的KD方法,提议的RDD方法在nuScenes数据集上达到57.1%的mAP,只用42%的FLOPs,甚至超过了教师模型的性能。
    Abstract In this paper, we focus on developing knowledge distillation (KD) for compact 3D detectors. We observe that off-the-shelf KD methods manifest their efficacy only when the teacher model and student counterpart share similar intermediate feature representations. This might explain why they are less effective in building extreme-compact 3D detectors where significant representation disparity arises due primarily to the intrinsic sparsity and irregularity in 3D point clouds. This paper presents a novel representation disparity-aware distillation (RDD) method to address the representation disparity issue and reduce performance gap between compact students and over-parameterized teachers. This is accomplished by building our RDD from an innovative perspective of information bottleneck (IB), which can effectively minimize the disparity of proposal region pairs from student and teacher in features and logits. Extensive experiments are performed to demonstrate the superiority of our RDD over existing KD methods. For example, our RDD increases mAP of CP-Voxel-S to 57.1% on nuScenes dataset, which even surpasses teacher performance while taking up only 42% FLOPs.
    摘要 在这篇论文中,我们关注开发知识储备(KD)技术,以建立高效的3D检测器。我们发现,现有的KD方法只在教师模型和学生模型之间存在相似的中间特征表示时才能发挥作用。这可能解释为什么它们在建立EXTREME-COMPACT 3D检测器时效果较差,因为3D点云中的自然稀疏性和不规则性导致了显著的表示差异。这篇论文提出了一种新的表示差异意识的知识储备(RDD)方法,以降低教师和学生模型之间的表示差异,从而降低 compact 学生模型和过参数化教师模型之间的性能差距。我们通过基于信息瓶颈(IB)的创新性观点,可以有效地减少学生和教师模型之间的提案区域对的特征和极值差异。我们进行了广泛的实验,以证明我们的RDD方法在现有KD方法的基础上具有明显的优势。例如,我们的RDD方法在nuScenes数据集上提高了CP-Voxel-S的MAP得分至57.1%, Even surpassed the teacher performance while taking up only 42% FLOPs.

Omnidirectional Information Gathering for Knowledge Transfer-based Audio-Visual Navigation

  • paper_url: http://arxiv.org/abs/2308.10306
  • repo_url: None
  • paper_authors: Jinyu Chen, Wenguan Wang, Si Liu, Hongsheng Li, Yi Yang
  • for: 这篇论文的目的是提出一种基于cross-task Navigation skill transfer的音视频导航器(ORAN),以便在3D环境中寻找声音源。
  • methods: 这篇论文使用了一种 confidence-aware cross-task policy distillation(CCPD)策略,将基本的点到点导航技能传授给ORAN,以便更好地掌握音视频导航。同时,ORAN还配备了一种全方位信息收集(OIG)机制,从不同方向收集视觉-听觉观察数据。
  • results: 结果表明,与前一代竞争对手相比,ORAN在Soundspaces Challenge 2022中获得了1st名,提高了SPL和SR的表现,相对提高了53%和35%。
    Abstract Audio-visual navigation is an audio-targeted wayfinding task where a robot agent is entailed to travel a never-before-seen 3D environment towards the sounding source. In this article, we present ORAN, an omnidirectional audio-visual navigator based on cross-task navigation skill transfer. In particular, ORAN sharpens its two basic abilities for a such challenging task, namely wayfinding and audio-visual information gathering. First, ORAN is trained with a confidence-aware cross-task policy distillation (CCPD) strategy. CCPD transfers the fundamental, point-to-point wayfinding skill that is well trained on the large-scale PointGoal task to ORAN, so as to help ORAN to better master audio-visual navigation with far fewer training samples. To improve the efficiency of knowledge transfer and address the domain gap, CCPD is made to be adaptive to the decision confidence of the teacher policy. Second, ORAN is equipped with an omnidirectional information gathering (OIG) mechanism, i.e., gleaning visual-acoustic observations from different directions before decision-making. As a result, ORAN yields more robust navigation behaviour. Taking CCPD and OIG together, ORAN significantly outperforms previous competitors. After the model ensemble, we got 1st in Soundspaces Challenge 2022, improving SPL and SR by 53% and 35% relatively.
    摘要 audio-visual navigation是一种围绕声音目标的方向寻路任务, robot代理需要在未经过seen的3D环境中前往声音源。在这篇文章中,我们介绍了ORAN,一种全方位audio-visual导航器,基于交叉任务导航技能传递。特别是,ORAN在这种复杂任务中尝试练习两种基本能力:寻路和audio-visual信息收集。首先,ORAN通过信息损失率感知的交叉任务策略储存(CCPD)来学习基本的点对点寻路技能,从而帮助ORAN更好地掌握audio-visual导航。为了提高知识传递的效率和地域差距,CCPD采用可变的决策Confidence来适应教师策略。其次,ORAN配备了全方位信息收集机制(OIG),即在决策之前从不同方向收集视觉-听音观察。因此,ORAN的导航行为变得更加稳定。总之,CCPD和OIG的结合使ORAN在前一代竞争者之上表现出色。最终,我们的模型集成后,在Soundspaces Challenge 2022中获得了1st,提高了SPL和SR的效果比53%和35%。

cs.AI - 2023-08-21

DynED: Dynamic Ensemble Diversification in Data Stream Classification

  • paper_url: http://arxiv.org/abs/2308.10807
  • repo_url: https://github.com/soheilabadifard/dyned
  • paper_authors: Soheil Abadifard, Sepehr Bakhshi, Sanaz Gheibuni, Fazli Can
  • for: 这个论文是为了提高在数据流中的分类精度,因为数据流中的变化可能会影响分类的精度。
  • methods: 这个论文使用了一种基于 MMR (最大侧边相关) 的组合方法,可以在组合时选择最高表现和多样性的组件。
  • results: 实验结果显示,提案的方法(DynED)在四个真实数据集和十一个 sintetic 数据集上的平均mean accuracy 高于五种基eline。
    Abstract Ensemble methods are commonly used in classification due to their remarkable performance. Achieving high accuracy in a data stream environment is a challenging task considering disruptive changes in the data distribution, also known as concept drift. A greater diversity of ensemble components is known to enhance prediction accuracy in such settings. Despite the diversity of components within an ensemble, not all contribute as expected to its overall performance. This necessitates a method for selecting components that exhibit high performance and diversity. We present a novel ensemble construction and maintenance approach based on MMR (Maximal Marginal Relevance) that dynamically combines the diversity and prediction accuracy of components during the process of structuring an ensemble. The experimental results on both four real and 11 synthetic datasets demonstrate that the proposed approach (DynED) provides a higher average mean accuracy compared to the five state-of-the-art baselines.
    摘要 ensemble 方法通常在分类中使用,因为它们的表现非常出色。在数据流环境中达到高精度是一项具有挑战性的任务,因为数据分布的变化会导致分类模型的性能下降。更多的 ensemble 组件可以提高预测精度。然而,不 все ensemble 组件在整体性能方面都做出了预期的贡献。这种情况需要一种方法来选择表现好、多样化的组件。我们提出了一种基于 MMR(最大边缘关键性)的ensemble 构建和维护方法,可以在结构 ensemble 时动态组合多样性和预测精度。实验结果表明,提议的方法(DynED)在四个实际数据集和11个 sintetic 数据集上的平均含义精度比五种现有的基准方法高。

Differentiable Frank-Wolfe Optimization Layer

  • paper_url: http://arxiv.org/abs/2308.10806
  • repo_url: None
  • paper_authors: Zixuan Liu, Liu Liu, Xueqian Wang, Peilin Zhao
  • for: 这个论文是为了提出一种高效的梯度下降优化层(DFWLayer),用于解决大规模优化问题。
  • methods: 这个论文使用了Frank-Wolfe方法,一种可以解决具有约束的优化问题,而不需要计算权重矩阵和梯度矩阵。
  • results: 实验表明,DFWLayer不仅可以在精度和梯度方面与基eline竞争,而且能够快速计算前向和反向传播。此外,它还可以保证约束的满足。
    Abstract Differentiable optimization has received a significant amount of attention due to its foundational role in the domain of machine learning based on neural networks. The existing methods leverages the optimality conditions and implicit function theorem to obtain the Jacobian matrix of the output, which increases the computational cost and limits the application of differentiable optimization. In addition, some non-differentiable constraints lead to more challenges when using prior differentiable optimization layers. This paper proposes a differentiable layer, named Differentiable Frank-Wolfe Layer (DFWLayer), by rolling out the Frank-Wolfe method, a well-known optimization algorithm which can solve constrained optimization problems without projections and Hessian matrix computations, thus leading to a efficient way of dealing with large-scale problems. Theoretically, we establish a bound on the suboptimality gap of the DFWLayer in the context of l1-norm constraints. Experimental assessments demonstrate that the DFWLayer not only attains competitive accuracy in solutions and gradients but also consistently adheres to constraints. Moreover, it surpasses the baselines in both forward and backward computational speeds.
    摘要 diferenciable 优化得到了广泛关注,因为它在基于神经网络的机器学习领域扮演了基础角色。现有方法利用优化性条件和隐函数定理获取输出 Jacobian 矩阵,这会增加计算成本并限制 differentiable 优化的应用。此外,一些非 differentiable 约束会使用先前的 differentiable 优化层面出现更多的挑战。本文提出了一种 differentiable 层,名为 Differentiable Frank-Wolfe Layer (DFWLayer),通过折衔 Frank-Wolfe 算法,这是一种可以解决约束优化问题而无需投影和梯度矩阵计算,从而实现大规模问题的高效解决方法。理论上,我们提出了 l1-norm 约束下 DFWLayer 的优化误差的下界。实验评估表明,DFWLayer 不仅能够实现竞争性的解决方案和梯度,而且一直遵循约束。此外,它在前向和后向计算速度上也超过了基准值。

Artificial intelligence is ineffective and potentially harmful for fact checking

  • paper_url: http://arxiv.org/abs/2308.10800
  • repo_url: https://github.com/osome-iu/ai_fact_checking
  • paper_authors: Matthew R. DeVerna, Harry Yaojun Yan, Kai-Cheng Yang, Filippo Menczer
  • for: 本研究旨在调查人工智能模型对政治新闻的真假性检查对人们的信念和分享意愿的影响。
  • methods: 研究使用了一个受欢迎的人工智能模型生成的 фаクチュール检查来影响参与者对政治新闻的判断和分享意愿。
  • results: 研究发现,尽管人工智能模型能够有效地推翻假新闻头条,但是这并不会影响参与者对新闻的判断或分享意愿。此外,人工智能模型的检查可能会导致参与者对真正的新闻产生误解,并且增加对假新闻的信念。但是,参与者选择查看人工智能模型的检查后,他们更有可能分享正确的新闻。
    Abstract Fact checking can be an effective strategy against misinformation, but its implementation at scale is impeded by the overwhelming volume of information online. Recent artificial intelligence (AI) language models have shown impressive ability in fact-checking tasks, but how humans interact with fact-checking information provided by these models is unclear. Here we investigate the impact of fact checks generated by a popular AI model on belief in, and sharing intent of, political news in a preregistered randomized control experiment. Although the AI performs reasonably well in debunking false headlines, we find that it does not significantly affect participants' ability to discern headline accuracy or share accurate news. However, the AI fact-checker is harmful in specific cases: it decreases beliefs in true headlines that it mislabels as false and increases beliefs for false headlines that it is unsure about. On the positive side, the AI increases sharing intents for correctly labeled true headlines. When participants are given the option to view AI fact checks and choose to do so, they are significantly more likely to share both true and false news but only more likely to believe false news. Our findings highlight an important source of potential harm stemming from AI applications and underscore the critical need for policies to prevent or mitigate such unintended consequences.
    摘要 fact-checking 可以是一种有效的反对谣言策略,但它的实施在大量信息在线上是受到妨碍的。最近的人工智能(AI)语言模型在事实核查任务中表现出了很好的能力,但人们如何与这些模型提供的事实核查信息相互作用的问题未得到清楚的答案。我们在一个预先注册的随机控制试验中调查了由一个受欢迎的 AI 模型生成的事实核查对政治新闻的信徒性和分享意愿的影响。虽然 AI 在证伪假头条中表现不错,但我们发现它不会有显著影响参与者识别头条是否准确或分享准确新闻的能力。然而,AI 事实核查器在特定情况下是有害的:它会降低真正的头条的信徒性,并增加假头条的信徒性。正面的是,AI 会增加正确标注的true头条的分享意愿。当参与者有选择查看 AI 事实核查并选择做此时,他们更有可能分享正确和错误的新闻,但只有更有可能信服错误的新闻。我们的发现表明人工智能应用中的潜在危害,并高亮需要采取政策来预防或减轻这些不良后果。

Stabilizing Unsupervised Environment Design with a Learned Adversary

  • paper_url: http://arxiv.org/abs/2308.10797
  • repo_url: https://github.com/facebookresearch/dcd
  • paper_authors: Ishita Mediratta, Minqi Jiang, Jack Parker-Holder, Michael Dennis, Eugene Vinitsky, Tim Rocktäschel
  • for: 本研究旨在提高普通能力Agent的训练,通过设计适应环境变化的训练任务来促进广泛的普通化和稳定性。
  • methods: 本研究使用了强化学习(RL)来训练教师策略,从 scratch generate任务,使得可以直接生成适应当前机器人能力的任务。
  • results: 本研究在多个已知的难度 Navigation 和赛车环境中实现了与当前状态的比较或超越,生成了可靠的普通化Agent。
    Abstract A key challenge in training generally-capable agents is the design of training tasks that facilitate broad generalization and robustness to environment variations. This challenge motivates the problem setting of Unsupervised Environment Design (UED), whereby a student agent trains on an adaptive distribution of tasks proposed by a teacher agent. A pioneering approach for UED is PAIRED, which uses reinforcement learning (RL) to train a teacher policy to design tasks from scratch, making it possible to directly generate tasks that are adapted to the agent's current capabilities. Despite its strong theoretical backing, PAIRED suffers from a variety of challenges that hinder its practical performance. Thus, state-of-the-art methods currently rely on curation and mutation rather than generation of new tasks. In this work, we investigate several key shortcomings of PAIRED and propose solutions for each shortcoming. As a result, we make it possible for PAIRED to match or exceed state-of-the-art methods, producing robust agents in several established challenging procedurally-generated environments, including a partially-observed maze navigation task and a continuous-control car racing environment. We believe this work motivates a renewed emphasis on UED methods based on learned models that directly generate challenging environments, potentially unlocking more open-ended RL training and, as a result, more general agents.
    摘要 training 通常遇到的一个挑战是设计训练任务,以便在环境变化时展现广泛的普适性和稳定性。这个挑战导致了无监督环境设计(UED)的问题设定,其中学生机器人通过教师机器人提出的 adaptive 任务分布进行训练。一种开拓性的方法是PAIRED,它使用了强化学习(RL)来训练一个教师策略,从scratch 生成任务,使得可以直接生成适应机器人当前能力的任务。然而,PAIRED 受到了许多实际应用中的挑战,使得现有的state-of-the-art方法倾向于使用筛选和变化而不是生成新任务。在这项工作中,我们调查了PAIRED 中的一些关键缺陷,并提出了解决方案。因此,我们使得PAIRED 能够与现有的state-of-the-art方法匹配或超越,在一些已知的难度较高的进行生成任务的环境中,如部分观察 Maze 导航任务和连续控制车辆竞速环境。我们认为这项工作将激励更多关注基于学习模型生成环境的 UED 方法,可能开启更多的开放式RL 训练,从而实现更通用的机器人。

Instruction Tuning for Large Language Models: A Survey

  • paper_url: http://arxiv.org/abs/2308.10792
  • repo_url: None
  • paper_authors: Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei Wu, Guoyin Wang
  • for: 本研究审视了大语言模型(LLM)的指令调整(IT)技术,以提高 LLM 的能力和可控性。
  • methods: 本研究涉及了 IT 技术的通用方法、指令数据集的建构、IT 模型的训练、以及不同modalities、领域和应用场景中的应用。
  • results: 本研究提供了一个系统性的文献综述,包括 IT 技术的总方法、指令数据集的建构、IT 模型的训练、以及不同modalities、领域和应用场景中的应用。同时,还提供了一些可能的障碍和批评,以及一些可能的进一步研究方向。
    Abstract This paper surveys research works in the quickly advancing field of instruction tuning (IT), a crucial technique to enhance the capabilities and controllability of large language models (LLMs). Instruction tuning refers to the process of further training LLMs on a dataset consisting of \textsc{(instruction, output)} pairs in a supervised fashion, which bridges the gap between the next-word prediction objective of LLMs and the users' objective of having LLMs adhere to human instructions. In this work, we make a systematic review of the literature, including the general methodology of IT, the construction of IT datasets, the training of IT models, and applications to different modalities, domains and applications, along with an analysis on aspects that influence the outcome of IT (e.g., generation of instruction outputs, size of the instruction dataset, etc). We also review the potential pitfalls of IT along with criticism against it, along with efforts pointing out current deficiencies of existing strategies and suggest some avenues for fruitful research.
    摘要 In this review, we systematically examine the literature on IT, including the general methodology, dataset construction, model training, and applications across different modalities, domains, and applications. We also analyze factors that influence the effectiveness of IT, such as the quality of instruction outputs and the size of the instruction dataset.Furthermore, we review potential pitfalls and criticisms of IT, as well as efforts to address current deficiencies and suggest avenues for future research. Overall, this review aims to provide a comprehensive overview of the current state of IT and its potential for improving the performance and controllability of LLMs.Here is the translation in Simplified Chinese:这篇论文对 rapidly developing 的 instruction tuning (IT) 技术进行了评估,IT 技术可以增强 large language models (LLMs) 的能力和可控性。IT 技术包括在 supervised 的方式下训练 LLMs 在 instruction-output 对应的 dataset 上,从而 bridge LLMs 的下一个词预测目标和用户的对 LLMs 遵循人类指令的目标之间的差距。在这篇评估中,我们系统地查看了 IT 领域的文献,包括 IT 的通用方法、数据集的建构、模型的训练、以及不同的Modalities、Domains 和应用程序中的应用。我们还分析了 IT 的效果因素,例如 instruction 输出质量和 instruction 数据集的大小。此外,我们还审查了 IT 的潜在弱点和批评,以及现有缺陷的解决方案和未来研究的可能性。总之,这篇评估的目的是为了提供 IT 的全面和系统的评估,以便更好地理解 IT 的潜在和当前的应用。Here is the translation in Traditional Chinese:这篇论文对 rapidly developing 的 instruction tuning (IT) 技术进行了评价,IT 技术可以增强 large language models (LLMs) 的能力和可控性。IT 技术包括在 supervised 的方式下训练 LLMs 在 instruction-output 对应的 dataset 上,从而 bridge LLMs 的下一个词预测目标和用户的对 LLMs 遵循人类指令的目标之间的差距。在这篇评价中,我们系统地查看了 IT 领域的文献,包括 IT 的通用方法、数据集的建构、模型的训练、以及不同的Modalities、Domains 和应用程序中的应用。我们还分析了 IT 的效果因素,例如 instruction 输出质量和 instruction 数据集的大小。此外,我们还审查了 IT 的潜在弱点和批评,以及现有缺陷的解决方案和未来研究的可能性。总之,这篇评价的目的是为了提供 IT 的全面和系统的评价,以便更好地理解 IT 的潜在和现在的应用。

A Block-Ring connected Topology of Parameterized Quantum Circuits

  • paper_url: http://arxiv.org/abs/2308.10791
  • repo_url: None
  • paper_authors: Wenjie Liu, Qingshan Wu
  • for: 提高 parameterized quantum circuit 的效率和可表达性,解决 current circuit 中的优化困难和性能保证问题。
  • methods: 提出了一种新的 topology,即 Block-Ring (BR) topology,通过将所有 qubits 分配到多个块中,并在每个块内采用 all-to-all 模式和 ring 模式连接不同块来构建 PQCs。相比拥有最佳性能的 pure all-to-all topology circuits,BR topology 具有类似的性能,同时减少了参数的数量和二 Quintessence 门的数量。
  • results: BR topology 在 expressibility 和 entangling 能力方面与其他 topology circuits 相比较为优异,而且在 multilayer circuits 中的表现也更好。此外,通过对不同的二 Quintessence 门的影响进行分析,我们还发现 BR topology 在 controlled X-rotation 和 controlled Z-rotation gates 的情况下具有更好的性能。
    Abstract It is essential to select efficient topology of parameterized quantum circuits (PQCs) in variational quantum algorithms (VQAs). However, there are problems in current circuits, i.e. optimization difficulties caused by too many parameters or performance is hard to guarantee. How to reduce the number of parameters (number of single-qubit rotation gates and 2-qubit gates) in PQCs without reducing the performance has become a new challenge. To solve this problem, we propose a novel topology, called Block-Ring (BR) topology, to construct the PQCs. This topology allocate all qubits to several blocks, all-to-all mode is adopt inside each block and ring mode is applied to connect different blocks. Compared with the pure all-to-all topology circuits which own the best power, BR topology have similar performance and the number of parameters and 2-qubit gate reduced from 0(n^2) to 0(mn) , m is a hyperparameter set by ourselves. Besides, we compared BR topology with other topology circuits in terms of expressibility and entangling capability. Considering the effects of different 2-qubit gates on circuits, we also make a distinction between controlled X-rotation gates and controlled Z-rotation gates. Finally, the 1- and 2-layer configurations of PQCs are taken into consideration as well, which shows the BR's performance improvement in the condition of multilayer circuits.
    摘要 “选择优化参数化量子环路(PQC)的架构是变量量子算法(VQA)中的重要任务。但是现有的环路参数过多或性能难以保证问题。如何将PQC的参数数量(单位quantum gate和2个量子闸道的数量)降低到最小化不损化性能成为了新的挑战。为解决这个问题,我们提出了一种新的架构,即封页环路(BR)架构,用于建立PQC。这个架构将所有的量子位数分配到多个封页中,每个封页运行完整的all-to-all模式,并通过环路连接不同的封页。与纯粹的all-to-all架构相比,BR架构具有相似的性能,并将参数数量从O(n^2)降低到O(mn),其中m是我们自己设置的参数。此外,我们与其他架构circuit进行比较,包括表达能力和对应能力。这些对应能力的分析显示BR架构在表达能力和对应能力方面具有优势。此外,我们还考虑了不同的2个量子闸道的效果,并对BR架构的1和2层环路进行考虑。总之,BR架构在多层环路中表现出较好的性能。”

Sparse Linear Concept Discovery Models

  • paper_url: http://arxiv.org/abs/2308.10782
  • repo_url: https://github.com/konpanousis/conceptdiscoverymodels
  • paper_authors: Konstantinos P. Panousis, Dino Ienco, Diego Marcos
  • for: 这篇论文旨在创造可解释的深度学习模型,以便调查和更正模型做出的决策。
  • methods: 该论文提出了一种简单但具有很强的直观性的解释框架,基于对比语言图像模型和单个稀疏线性层。该框架通过使用数据驱动的 Bernoulli 分布来实现权重稀疏,并且在实验中表明,该框架不仅在准确性方面超越了最近的 CBM 方法,而且每个例子的概念稀疏性也得到了提高。
  • results: 该论文的实验结果表明,提出的解释框架不仅在准确性方面超越了最近的 CBM 方法,而且每个例子的概念稀疏性也得到了提高。
    Abstract The recent mass adoption of DNNs, even in safety-critical scenarios, has shifted the focus of the research community towards the creation of inherently intrepretable models. Concept Bottleneck Models (CBMs) constitute a popular approach where hidden layers are tied to human understandable concepts allowing for investigation and correction of the network's decisions. However, CBMs usually suffer from: (i) performance degradation and (ii) lower interpretability than intended due to the sheer amount of concepts contributing to each decision. In this work, we propose a simple yet highly intuitive interpretable framework based on Contrastive Language Image models and a single sparse linear layer. In stark contrast to related approaches, the sparsity in our framework is achieved via principled Bayesian arguments by inferring concept presence via a data-driven Bernoulli distribution. As we experimentally show, our framework not only outperforms recent CBM approaches accuracy-wise, but it also yields high per example concept sparsity, facilitating the individual investigation of the emerging concepts.
    摘要 In this work, we propose a simple and highly intuitive interpretable framework based on Contrastive Language Image models and a single sparse linear layer. Our approach achieves sparsity through principled Bayesian arguments, inferring concept presence via a data-driven Bernoulli distribution. Experimental results show that our framework outperforms recent CBM approaches in accuracy and achieves high per-example concept sparsity, allowing for individual investigation of emerging concepts.Here is the translation in Simplified Chinese:近期,深度神经网络(DNNs)在安全关键场景得到了广泛采用,导致研究人员强调创建可解释性强的模型。概念瓶颈模型(CBMs)是一种受欢迎的方法,它将隐藏层与人理解的概念相关联。然而,CBMs通常会受到(i)性能下降和(ii)比预期更低的解释性的影响,这是由每个决策中参与的概念的绝对数量引起的。在这种情况下,我们提出了一种简单且易于理解的框架,基于对比语言图像模型和单个稀疏线性层。与相关的方法不同,我们的框架中的稀疏性是通过原则性的极 bayesian 理由来实现的,通过数据驱动的bernoulli分布来判断概念存在。我们的实验结果表明,我们的框架不仅在准确性方面超过了最近的 CBM 方法,而且每个例子的概念稀疏性也很高,可以方便地进行每个emerging概念的调查。

To Whom are You Talking? A Deep Learning Model to Endow Social Robots with Addressee Estimation Skills

  • paper_url: http://arxiv.org/abs/2308.10757
  • repo_url: None
  • paper_authors: Carlo Mazzola, Marta Romeo, Francesco Rea, Alessandra Sciutti, Angelo Cangelosi
  • For: The paper is written to address the problem of Addressee Estimation in human-human communication, with the goal of enabling social robots to understand and interpret non-verbal bodily cues from speakers.* Methods: The paper uses a hybrid deep learning model that combines convolutional layers and LSTM cells to analyze images of the speaker’s face and 2D vectors of their body posture. The model is designed to be efficient and deployable on social robots in ecological scenarios.* Results: The paper demonstrates that the proposed model is able to solve the Addressee Estimation problem in terms of addressee localisation in space, from a robot ego-centric point of view.
    Abstract Communicating shapes our social word. For a robot to be considered social and being consequently integrated in our social environment it is fundamental to understand some of the dynamics that rule human-human communication. In this work, we tackle the problem of Addressee Estimation, the ability to understand an utterance's addressee, by interpreting and exploiting non-verbal bodily cues from the speaker. We do so by implementing an hybrid deep learning model composed of convolutional layers and LSTM cells taking as input images portraying the face of the speaker and 2D vectors of the speaker's body posture. Our implementation choices were guided by the aim to develop a model that could be deployed on social robots and be efficient in ecological scenarios. We demonstrate that our model is able to solve the Addressee Estimation problem in terms of addressee localisation in space, from a robot ego-centric point of view.
    摘要 人际交流 shapes our 社会话语. 为了让 robot 被认为是社交的并被 subsequentially интегрирован到我们的社交环境中,是关键的理解一些人类到人类交流的 dinamics. 在这种工作中,我们面临了 Problem of Addressee Estimation,理解发言人的目标人的能力,通过解释和利用发言人的非语言姿势信息。我们通过实施一种 hybrid 深度学习模型,由 convolutional layers 和 LSTM cells 组成,输入图像包含发言人的脸部和发言人的身姿方向 Vektos。我们的实现选择是根据目标发送一个可以部署在社交 robot 上的模型,并且能够在生态环境中高效地解决问题。我们示示了我们的模型能够从 robot 自我中心的视角来解决目标人的地点定位问题。

DataVinci: Learning Syntactic and Semantic String Repairs

  • paper_url: http://arxiv.org/abs/2308.10922
  • repo_url: None
  • paper_authors: Mukul Singh, José Cambronero, Sumit Gulwani, Vu Le, Carina Negreanu, Gust Verbruggen
  • for: 该论文目的是为了提出一种不需要用户提供示例或约束的自动化字符串数据错误检测和修复系统。
  • methods: 该论文使用了自动学习的正则表达式来捕捉大多数列中的值,并基于这些模式和其他列的约束来自动生成修复提案。
  • results: 论文对4个现有和新的benchmark进行了测试,并与7基准相比,实现了更高的错误检测和修复精度。
    Abstract String data is common in real-world datasets: 67.6% of values in a sample of 1.8 million real Excel spreadsheets from the web were represented as text. Systems that successfully clean such string data can have a significant impact on real users. While prior work has explored errors in string data, proposed approaches have often been limited to error detection or require that the user provide annotations, examples, or constraints to fix the errors. Furthermore, these systems have focused independently on syntactic errors or semantic errors in strings, but ignore that strings often contain both syntactic and semantic substrings. We introduce DataVinci, a fully unsupervised string data error detection and repair system. DataVinci learns regular-expression-based patterns that cover a majority of values in a column and reports values that do not satisfy such patterns as data errors. DataVinci can automatically derive edits to the data error based on the majority patterns and constraints learned over other columns without the need for further user interaction. To handle strings with both syntactic and semantic substrings, DataVinci uses an LLM to abstract (and re-concretize) portions of strings that are semantic prior to learning majority patterns and deriving edits. Because not all data can result in majority patterns, DataVinci leverages execution information from an existing program (which reads the target data) to identify and correct data repairs that would not otherwise be identified. DataVinci outperforms 7 baselines on both error detection and repair when evaluated on 4 existing and new benchmarks.
    摘要 STRING 数据是现实世界数据集中的常见类型:67.6%的值在我们采样的180万个真实Excel表格中被表示为文本。系统可以成功清理这些字符串数据会对实际用户产生重要影响。而过去的工作已经探讨了字符串数据中的错误,但提出的方法 oftentimes 有限制,需要用户提供笔记、示例或约束来修复错误。此外,这些系统通常只关注字符串中的语法错误或 semantics 错误,忽略了字符串中的语法和 semantics 都是错误的情况。我们介绍了 DataVinci,一种完全无监督的字符串数据错误检测和修复系统。DataVinci 可以学习语法模式,涵盖大多数列中的值,并报告不符合这些模式的值为数据错误。DataVinci 可以自动 derivation 基于多数模式和约束学习的数据修复,无需用户交互。为处理具有语法和 semantics 的字符串,DataVinci 使用 LLM 抽象(并重新具体化)字符串的 semantic 部分,然后学习多数模式和约束。由于不 todos 的数据可以导致多数模式,DataVinci 利用现有的程序(读取目标数据)的执行信息来识别和修复不会被识别的数据修复。DataVinci 在四个现有和新的benchmark上比基eline 7 高于 Error Detection 和 Repair。

On the Adversarial Robustness of Multi-Modal Foundation Models

  • paper_url: http://arxiv.org/abs/2308.10741
  • repo_url: None
  • paper_authors: Christian Schlarmann, Matthias Hein
  • for: 保护用户免受恶意内容的误导和宣扬false信息。
  • methods: 使用图像攻击法,通过改变图像的描述文本来诱导用户访问恶意网站或接受false信息。
  • results: 显示了基于多modal的基础模型可能受到恶意内容的威胁,需要采取防御措施来保护用户。
    Abstract Multi-modal foundation models combining vision and language models such as Flamingo or GPT-4 have recently gained enormous interest. Alignment of foundation models is used to prevent models from providing toxic or harmful output. While malicious users have successfully tried to jailbreak foundation models, an equally important question is if honest users could be harmed by malicious third-party content. In this paper we show that imperceivable attacks on images in order to change the caption output of a multi-modal foundation model can be used by malicious content providers to harm honest users e.g. by guiding them to malicious websites or broadcast fake information. This indicates that countermeasures to adversarial attacks should be used by any deployed multi-modal foundation model.
    摘要 多modal基础模型,如flamingo或GPT-4,在最近受到了巨大的关注。对基础模型的对齐是为防止模型提供恶意或危险的输出。然而,具有恶意目的的用户已经成功地破坏了基础模型,另一个重要问题是:可以否由恶意第三方内容伤害正常用户?在这篇论文中,我们展示了图像上的不可见攻击可以让恶意内容提供者通过改变多modal基础模型的图像描述来让正常用户访问恶意网站或播送假信息。这表明,在部署多modal基础模型时应该使用对抗攻击的防御措施。

We Don’t Need No Adam, All We Need Is EVE: On The Variance of Dual Learning Rate And Beyond

  • paper_url: http://arxiv.org/abs/2308.10740
  • repo_url: https://github.com/akhadangi/EVE
  • paper_authors: Afshin Khadangi
  • for: 优化深度学习模型的性能和稳定性
  • methods: 使用不同学习率控制不同组件的梯度,并使用适应学习地图的摩擦TERM来更好地控制梯度下降的速率和方向
  • results: 在多种数据集和模型上对比,EVE方法显著超越了现有的优化技术,提高了性能和稳定性
    Abstract In the rapidly advancing field of deep learning, optimising deep neural networks is paramount. This paper introduces a novel method, Enhanced Velocity Estimation (EVE), which innovatively applies different learning rates to distinct components of the gradients. By bifurcating the learning rate, EVE enables more nuanced control and faster convergence, addressing the challenges associated with traditional single learning rate approaches. Utilising a momentum term that adapts to the learning landscape, the method achieves a more efficient navigation of the complex loss surface, resulting in enhanced performance and stability. Extensive experiments demonstrate that EVE significantly outperforms existing optimisation techniques across various benchmark datasets and architectures.
    摘要 在深度学习领域的快速发展中,优化深度神经网络是非常重要的。本文介绍了一种新的方法——加速 velocity estimation(EVE),它创新地将不同的学习率应用到不同的梯度组成部分。通过分化学习率,EVE 可以更细致地控制和更快地 converges,解决了传统单学习率方法所面临的挑战。通过适应学习地图的推移 momentum 项,方法可以更有效地导航复杂的损失函数表面,从而提高性能和稳定性。经验表明,EVE 可以明显超过现有的优化技术在多个benchmark dataset和架构上。

CoMIX: A Multi-agent Reinforcement Learning Training Architecture for Efficient Decentralized Coordination and Independent Decision Making

  • paper_url: http://arxiv.org/abs/2308.10721
  • repo_url: None
  • paper_authors: Giovanni Minelli, Mirco Musolesi
  • for: 本研究旨在提高多机器人系统中的协调能力,使得每个机器人可以独立做出决策,同时也能够协同工作,以实现共同目标。
  • methods: 本研究提出了一种新的训练框架——协调QMIX(CoMIX),该框架使得每个机器人可以适应不同情况,灵活地改变决策,从而实现协同工作。
  • results: 经过多种 simulation 环境的实验表明,CoMIX 可以比基线方法更好地完成协同任务。这些结果证明了我们的增量策略方法是一种有效的技术,用于提高多机器人系统中的协调能力。
    Abstract Robust coordination skills enable agents to operate cohesively in shared environments, together towards a common goal and, ideally, individually without hindering each other's progress. To this end, this paper presents Coordinated QMIX (CoMIX), a novel training framework for decentralized agents that enables emergent coordination through flexible policies, allowing at the same time independent decision-making at individual level. CoMIX models selfish and collaborative behavior as incremental steps in each agent's decision process. This allows agents to dynamically adapt their behavior to different situations balancing independence and collaboration. Experiments using a variety of simulation environments demonstrate that CoMIX outperforms baselines on collaborative tasks. The results validate our incremental policy approach as effective technique for improving coordination in multi-agent systems.
    摘要 <>将文本翻译成简化中文。<>强大的协调技能使代理人能够共同在共享环境中运行,共同向一个共同目标努力,而且理想情况下,每个代理人可以独立做出决策而不干扰别的进程。为了实现这一目标,这篇论文提出了协调QMIX(CoMIX),一种新的培训框架 для分布式代理人,允许emergent协调通过灵活的策略,同时允许每个代理人独立做出决策。CoMIX模型了每个代理人的自私和合作行为为增量的决策过程中的一部分。这使得代理人可以在不同的情况下动态地适应其行为,均衡独立和合作。实验使用了多种模拟环境,证明CoMIX在合作任务上超过了基eline。结果证明我们的增量策略方法是有效的技术,可以提高多代理人系统中的协调。

On the accuracy of interpolation based on single-layer artificial neural networks

  • paper_url: http://arxiv.org/abs/2308.10720
  • repo_url: None
  • paper_authors: Ferdinando Auricchio, Maria Roberta Belardo, Francesco Calabrò, Ariel F. Pascaner
  • for: 本文考虑了一种单个隐层神经网络(ANN),具有简单的Feedforward架构,以及一种被称为EXTREME LEARNING MACHINE(ELM)的训练方法。
  • methods: 本文使用了一种基于nodes的 interpolating函数的方法,并对不同类型的nodes进行比较。
  • results: 研究结果显示,使用Chebychev nodes进行训练时,ANN interpolating函数的准确性会随着节点数量的增加而提高。然而,使用其他类型的nodes时,函数的准确性通常会随着节点数量的增加而下降。
    Abstract In the present paper, we consider one-hidden layer ANNs with a feedforward architecture, also referred to as shallow or two-layer networks, so that the structure is determined by the number and types of neurons. The determination of the parameters that define the function, called training, is done via the resolution of the approximation problem, so by imposing the interpolation through a set of specific nodes. We present the case where the parameters are trained using a procedure that is referred to as Extreme Learning Machine (ELM) that leads to a linear interpolation problem. In such hypotheses, the existence of an ANN interpolating function is guaranteed. The focus is then on the accuracy of the interpolation outside of the given sampling interpolation nodes when they are the equispaced, the Chebychev, and the randomly selected ones. The study is motivated by the well-known bell-shaped Runge example, which makes it clear that the construction of a global interpolating polynomial is accurate only if trained on suitably chosen nodes, ad example the Chebychev ones. In order to evaluate the behavior when growing the number of interpolation nodes, we raise the number of neurons in our network and compare it with the interpolating polynomial. We test using Runge's function and other well-known examples with different regularities. As expected, the accuracy of the approximation with a global polynomial increases only if the Chebychev nodes are considered. Instead, the error for the ANN interpolating function always decays and in most cases we observe that the convergence follows what is observed in the polynomial case on Chebychev nodes, despite the set of nodes used for training.
    摘要 在本文中,我们考虑一种具有一层隐藏层的人工神经网络(ANN),其架构为前向传播,并且由数量和类型的神经元决定结构。通过定义函数的参数的Resolution来进行训练,这被称为Extreme Learning Machine(ELM)。在这种假设下,存在一个ANN interpolating函数是保证的。然后我们关注在给定的采样 interpolating 节点外的准确性,包括等分、Chebychev和随机选择的节点。我们的研究是基于著名的bell-shaped Runge例子,它显示了建立全球 interpolating 多项式的准确性只有在适当选择节点时。为了评估增加 interpolating 节点数量的影响,我们增加了网络中神经元的数量,并与 interpolating 多项式进行比较。我们测试了Runge的函数和其他一些具有不同规则的例子。如期望,透 interpolating 多项式的准确性随着Chebychev节点数量增加。然而,对 ANN interpolating 函数的错误都在下降,并且在大多数情况下,我们发现ANN interpolating 函数的准确性与Chebychev节点数量增加的情况类似,即使使用的节点不同。

Sampling From Autoencoders’ Latent Space via Quantization And Probability Mass Function Concepts

  • paper_url: http://arxiv.org/abs/2308.10704
  • repo_url: None
  • paper_authors: Aymene Mohammed Bouayed, Adrian Iaccovelli, David Naccache
  • For: The paper focuses on sampling from the latent space of generative models built upon autoencoders, with the goal of generating lifelike images.* Methods: The proposed sampling algorithm is based on the concept of probability mass functions and quantization, and it establishes a vicinity around each latent vector from the input data to draw samples from. The algorithm improves upon previous techniques based on Gaussian mixture models (GMM) by reducing the time complexity from $\mathcal{O}(n\times d \times k \times i)$ to $\mathcal{O}(n\times d)$.* Results: The paper demonstrates the superior performance of the proposed sampling algorithm through experimental results on several benchmark datasets, including MNIST, CelebA, and MOBIUS. The algorithm achieves noteworthy improvements in the Fr'echet inception distance (FID) for image generation, with improvements of up to $0.89$ on the MNIST dataset, $1.69$ on the CelebA dataset, and $0.87$ on the MOBIUS dataset, compared to GMM sampling. Additionally, the paper shows that the proposed method is effective in estimating latent space distributions, as evidenced by the Wasserstein distance.Here is the information in Simplified Chinese text:* For: 这个研究目标是从生成模型中的潜在空间采样生成真实的图像。* Methods: 提议的采样算法基于概率质量函数和量化,它在输入数据中的每个潜在 вектор周围建立一个邻域,然后从这些定义的邻域中采样。这种策略使得采样的潜在 вектор主要居住在高概率区域中,从而可以有效地将其转换为真实的世界图像。* Results: 研究表明,提议的采样算法在多个 benchmark 数据集上表现出色,包括 MNIST、CelebA 和 MOBIUS。与 GMM 采样相比,提议的算法在 FID 指标上实现了显著的改善,最大达到 $0.89$ 在 MNIST 数据集上,$1.69$ 在 CelebA 数据集上,$0.87$ 在 MOBIUS 数据集上。此外,研究还证明了提议的方法可以有效地估计潜在空间分布,特别是通过 Wasserstein 距离来证明。
    Abstract In this study, we focus on sampling from the latent space of generative models built upon autoencoders so as the reconstructed samples are lifelike images. To do to, we introduce a novel post-training sampling algorithm rooted in the concept of probability mass functions, coupled with a quantization process. Our proposed algorithm establishes a vicinity around each latent vector from the input data and then proceeds to draw samples from these defined neighborhoods. This strategic approach ensures that the sampled latent vectors predominantly inhabit high-probability regions, which, in turn, can be effectively transformed into authentic real-world images. A noteworthy point of comparison for our sampling algorithm is the sampling technique based on Gaussian mixture models (GMM), owing to its inherent capability to represent clusters. Remarkably, we manage to improve the time complexity from the previous $\mathcal{O}(n\times d \times k \times i)$ associated with GMM sampling to a much more streamlined $\mathcal{O}(n\times d)$, thereby resulting in substantial speedup during runtime. Moreover, our experimental results, gauged through the Fr\'echet inception distance (FID) for image generation, underscore the superior performance of our sampling algorithm across a diverse range of models and datasets. On the MNIST benchmark dataset, our approach outperforms GMM sampling by yielding a noteworthy improvement of up to $0.89$ in FID value. Furthermore, when it comes to generating images of faces and ocular images, our approach showcases substantial enhancements with FID improvements of $1.69$ and $0.87$ respectively, as compared to GMM sampling, as evidenced on the CelebA and MOBIUS datasets. Lastly, we substantiate our methodology's efficacy in estimating latent space distributions in contrast to GMM sampling, particularly through the lens of the Wasserstein distance.
    摘要 在这个研究中,我们关注于从生成模型基于 autoencoder 的latent空间中采样,以便生成的样本是真实的图像。为了实现这一目标,我们提出了一种新的后期采样算法,基于概率质量函数和量化过程。我们的提议的算法会定义latent vector的邻域,然后从这些定义的邻域中采样。这种策略可以确保采样的latent vector主要 inhabit高概率区域,从而可以高效地转换为真实的图像。与基于 Gaussian mixture models(GMM)的采样技术相比,我们的采样算法具有更高的时间复杂度提升,从 $\mathcal{O}(n\times d\times k\times i)$ 降低到 $\mathcal{O}(n\times d)$,这意味着在运行时可以获得重要的速度提升。此外,我们的实验结果,通过 Fréchet inception distance(FID)来衡量图像生成的性能,表明我们的采样算法在不同的模型和数据集上表现出色,与 GMM 采样相比,在 MNIST 数据集上可以达到 $0.89$ 的 FID 值提升。此外,当生成面部和眼部图像时,我们的方法也显示了显著的提升,FID 值分别提高 $1.69$ 和 $0.87$,比 GMM 采样更高。最后,我们验证了我们的方法在估计 latent space 分布方面的效果,特别是通过 Wasserstein distance。

Refashioning Emotion Recognition Modelling: The Advent of Generalised Large Models

  • paper_url: http://arxiv.org/abs/2308.11578
  • repo_url: None
  • paper_authors: Zixing Zhang, Liyizhe Peng, Tao Pang, Jing Han, Huan Zhao, Bjorn W. Schuller
  • for: 这篇论文的目的是实际应用感情识别和人工智能,探讨大语言模型在感情识别方面的表现,包括对 контекスト学习、几少shot学习、精度、通用性和解释等方面的分析和探讨。
  • methods: 这篇论文使用了大语言模型(LLMs),例如ChatGPT,进行感情识别 Task,并对其表现进行了多方面的探讨和分析,包括在对 контекスト学习、几少shot学习、精度、通用性和解释等方面的表现。
  • results: 根据研究结果显示,大语言模型在感情识别方面的表现具有优秀的对 контекスト学习、几少shot学习和精度等特点,并且能够在不同的benchmark上取得最佳的结果。此外,研究也提出了一些问题和挑战,以促进感情识别领域的进一步发展。
    Abstract After the inception of emotion recognition or affective computing, it has increasingly become an active research topic due to its broad applications. Over the past couple of decades, emotion recognition models have gradually migrated from statistically shallow models to neural network-based deep models, which can significantly boost the performance of emotion recognition models and consistently achieve the best results on different benchmarks. Therefore, in recent years, deep models have always been considered the first option for emotion recognition. However, the debut of large language models (LLMs), such as ChatGPT, has remarkably astonished the world due to their emerged capabilities of zero/few-shot learning, in-context learning, chain-of-thought, and others that are never shown in previous deep models. In the present paper, we comprehensively investigate how the LLMs perform in emotion recognition in terms of diverse aspects, including in-context learning, few-short learning, accuracy, generalisation, and explanation. Moreover, we offer some insights and pose other potential challenges, hoping to ignite broader discussions about enhancing emotion recognition in the new era of advanced and generalised large models.
    摘要 après l'invention de la reconnaissance des émotions ou de l'informatique affective, elle est devenue un sujet de recherche actif en raison de ses applications équitables. Lors des dernières décennies, les modèles de reconnaissance des émotions ont graduellement migré de modèles statistiques superficiels à des modèles de réseaux de neurones profonds, ce qui peut considérablement améliorer les performances des modèles de reconnaissance des émotions et obtenir consistamment les meilleurs résultats sur différents benchmarks. Par conséquent, dans les années récentes, les modèles profonds ont toujours été considérés comme la première option pour la reconnaissance des émotions. Cependant, le débat des modèles de langage grands (LLMs), tels que ChatGPT, a remarquablement étonné le monde en raison de leurs capacités émergentes de apprentissage à zéro/few-shot, apprentissage en contexte, chaîne de pensée et d'autres capacités qui ne sont pas montrées dans les modèles profonds précédents. Dans le présent article, nous examinons comprehensivement comment les LLMs se comportent dans la reconnaissance des émotions en termes de divers aspects, y compris l'apprentissage en contexte, l'apprentissage à few-shot, la précision, la généralisation et l'explication. En outre, nous offrons des perspectives et posons des défis potentiels, espérant de susciter des discussions plus larges sur la amélioration de la reconnaissance des émotions dans la nouvelle ère des modèles avancés et généralisés.

Normative Conditional Reasoning as a Fragment of HOL

  • paper_url: http://arxiv.org/abs/2308.10686
  • repo_url: None
  • paper_authors: Xavier Parent, Christoph Benzmüller
  • for: 这篇论文关注了normative(偏好基于的)条件逻辑的机制化。
  • methods: 这篇论文使用了Isabelle/HOL进行浅层Semantical Embedding,以实现机制化。
  • results: 这篇论文可用于自动验证哲学和伦理学问题,例如Modal 逻辑立方体的自动验证。同时,它还可以用于评估伦理学Argument,如Parfit的厌恶结论。
    Abstract We report some results regarding the mechanization of normative (preference-based) conditional reasoning. Our focus is on Aqvist's system E for conditional obligation (and its extensions). Our mechanization is achieved via a shallow semantical embedding in Isabelle/HOL. We consider two possible uses of the framework. The first one is as a tool for meta-reasoning about the considered logic. We employ it for the automated verification of deontic correspondences (broadly conceived) and related matters, analogous to what has been previously achieved for the modal logic cube. The second use is as a tool for assessing ethical arguments. We provide a computer encoding of a well-known paradox in population ethics, Parfit's repugnant conclusion. Whether the presented encoding increases or decreases the attractiveness and persuasiveness of the repugnant conclusion is a question we would like to pass on to philosophy and ethics.
    摘要 我们报告了一些关于normative(偏好基于的)条件逻辑机制的结果。我们的重点是Aqvist的系统E,以及其扩展。我们的机制实现是通过Isabelle/HOL中的浅semantical embedding来实现的。我们考虑了两种可能的应用场景。第一个是作为对考虑逻辑的meta-reasoning工具。我们使用它来自动验证伦理相互关系(广泛理解)以及相关问题,与modal logic cube的自动验证相似。第二个用途是作为评估伦理论点的工具。我们提供了一个计算编码的Parfit的恶名 conclusion。 weather这个编码提高了或降低了恶名 conclusion的吸引力和吸引力是一个我们希望通过哲学和伦理传达给学术界的问题。

Visual Crowd Analysis: Open Research Problems

  • paper_url: http://arxiv.org/abs/2308.10677
  • repo_url: None
  • paper_authors: Muhammad Asif Khan, Hamid Menouar, Ridha Hamila
    for: 本文旨在探讨自动人群监测领域的最新进展和挑战,特别是Visual Crowd Analysis(VCAs)领域内的六大领域。methods: 本文使用了现代深度学习方法,总结了领域内最新的研究进展,并对每个领域进行了分类和评价。results: 本文发现了领域内最新的研究成果,包括novelty和性能提升等,并提供了未解决的挑战,以便未来研究可以继续发展和进步。
    Abstract Over the last decade, there has been a remarkable surge in interest in automated crowd monitoring within the computer vision community. Modern deep-learning approaches have made it possible to develop fully-automated vision-based crowd-monitoring applications. However, despite the magnitude of the issue at hand, the significant technological advancements, and the consistent interest of the research community, there are still numerous challenges that need to be overcome. In this article, we delve into six major areas of visual crowd analysis, emphasizing the key developments in each of these areas. We outline the crucial unresolved issues that must be tackled in future works, in order to ensure that the field of automated crowd monitoring continues to progress and thrive. Several surveys related to this topic have been conducted in the past. Nonetheless, this article thoroughly examines and presents a more intuitive categorization of works, while also depicting the latest breakthroughs within the field, incorporating more recent studies carried out within the last few years in a concise manner. By carefully choosing prominent works with significant contributions in terms of novelty or performance gains, this paper presents a more comprehensive exposition of advancements in the current state-of-the-art.
    摘要 In this article, we will explore six major areas of visual crowd analysis, highlighting the key developments in each of these areas. We will also outline the crucial unresolved issues that must be tackled in future works to ensure that the field of automated crowd monitoring continues to progress and thrive.Several surveys have been conducted on this topic in the past, but this article provides a more comprehensive and intuitive categorization of works, incorporating more recent studies carried out within the last few years. By carefully selecting prominent works with significant contributions in terms of novelty or performance gains, this paper presents a more thorough exposition of the current state-of-the-art in automated crowd monitoring.

A Safe Deep Reinforcement Learning Approach for Energy Efficient Federated Learning in Wireless Communication Networks

  • paper_url: http://arxiv.org/abs/2308.10664
  • repo_url: None
  • paper_authors: Nikolaos Koursioumpas, Lina Magoula, Nikolaos Petropouleas, Alexandros-Ioannis Thanopoulos, Theodora Panagea, Nancy Alonistioti, M. A. Gutierrez-Estevez, Ramin Khalili
  • for: 针对艺术智能(AI)功能的无线网络中的环境影响,聚合学习(FL)技术 emerged as a key privacy-preserving decentralized AI technique.
  • methods: 提议在FL过程中协调设备的计算和通信资源,以最小化总能 consumption,保证模型的性能,并引入罚函数 durante training,约束策略的满足环境约束。
  • results: 比较四种状态的先进基eline解决方案,实现了94%的总能 consumption减少。
    Abstract Progressing towards a new era of Artificial Intelligence (AI) - enabled wireless networks, concerns regarding the environmental impact of AI have been raised both in industry and academia. Federated Learning (FL) has emerged as a key privacy preserving decentralized AI technique. Despite efforts currently being made in FL, its environmental impact is still an open problem. Targeting the minimization of the overall energy consumption of an FL process, we propose the orchestration of computational and communication resources of the involved devices to minimize the total energy required, while guaranteeing a certain performance of the model. To this end, we propose a Soft Actor Critic Deep Reinforcement Learning (DRL) solution, where a penalty function is introduced during training, penalizing the strategies that violate the constraints of the environment, and ensuring a safe RL process. A device level synchronization method, along with a computationally cost effective FL environment are proposed, with the goal of further reducing the energy consumption and communication overhead. Evaluation results show the effectiveness of the proposed scheme compared to four state-of-the-art baseline solutions in both static and dynamic environments, achieving a decrease of up to 94% in the total energy consumption.
    摘要

Deep Evidential Learning for Bayesian Quantile Regression

  • paper_url: http://arxiv.org/abs/2308.10650
  • repo_url: None
  • paper_authors: Frederik Boe Hüttel, Filipe Rodrigues, Francisco Câmara Pereira
  • For: The paper proposes a deep Bayesian quantile regression model for estimating the quantiles of a continuous target distribution without assuming a Gaussian distribution.* Methods: The proposed method is based on evidential learning, which allows the model to capture aleatoric and epistemic uncertainty with a single deterministic forward-pass model.* Results: The method achieves calibrated uncertainties on non-Gaussian distributions, disentanglement of aleatoric and epistemic uncertainty, and robustness to out-of-distribution samples.Here are the three points in Simplified Chinese text:* For: 这篇论文提出了一种深度 Bayesian 量化回归模型,用于估计连续目标分布中的分位数,不需要假设 Gaussian 分布。* Methods: 该方法基于证据学习,通过单个推理途径模型,捕捉到随机变量和模型不确定性。* Results: 方法实现了非 Gaussian 分布上的准确性,分解随机变量和模型不确定性,并对异常样本进行robust处理。
    Abstract It is desirable to have accurate uncertainty estimation from a single deterministic forward-pass model, as traditional methods for uncertainty quantification are computationally expensive. However, this is difficult because single forward-pass models do not sample weights during inference and often make assumptions about the target distribution, such as assuming it is Gaussian. This can be restrictive in regression tasks, where the mean and standard deviation are inadequate to model the target distribution accurately. This paper proposes a deep Bayesian quantile regression model that can estimate the quantiles of a continuous target distribution without the Gaussian assumption. The proposed method is based on evidential learning, which allows the model to capture aleatoric and epistemic uncertainty with a single deterministic forward-pass model. This makes the method efficient and scalable to large models and datasets. We demonstrate that the proposed method achieves calibrated uncertainties on non-Gaussian distributions, disentanglement of aleatoric and epistemic uncertainty, and robustness to out-of-distribution samples.
    摘要 Desirable 是有准确的不确定性估计从单个决定性前进模型,因为传统的不确定性量化方法是计算成本费时的。然而,这是困难的,因为单个前进模型在推理过程中不会采样权重,并常常假设目标分布是 Gaussian。这可能是回归任务中的限制,因为mean和标准差不够用于准确地模型目标分布。这篇文章提议了深度 bayesian 量化回归模型,可以无需 Gaussian 假设来估计连续目标分布的量iles。该方法基于证据学习,使得模型可以捕捉 aleatoric 和 epistemic 不确定性,并且可以使用单个决定性前进模型来实现高效和可扩展。我们示示了该方法可以实现准确的不确定性估计,分离 aleatoric 和 epistemic 不确定性,并且对于非 Gaussian 分布有 robustness。

EVE: Efficient zero-shot text-based Video Editing with Depth Map Guidance and Temporal Consistency Constraints

  • paper_url: http://arxiv.org/abs/2308.10648
  • repo_url: None
  • paper_authors: Yutao Chen, Xingning Dong, Tian Gan, Chunluan Zhou, Ming Yang, Qingpei Guo
  • for: 提高文本视频编辑效果
  • methods: 使用深度地图和时间一致约束,实现有效且高效的文本视频编辑
  • results: 通过实验证明EVE可以实现满意的平衡 между性能和效率,并提供了一个新的评价指标ZVE-50集合。
    Abstract Motivated by the superior performance of image diffusion models, more and more researchers strive to extend these models to the text-based video editing task. Nevertheless, current video editing tasks mainly suffer from the dilemma between the high fine-tuning cost and the limited generation capacity. Compared with images, we conjecture that videos necessitate more constraints to preserve the temporal consistency during editing. Towards this end, we propose EVE, a robust and efficient zero-shot video editing method. Under the guidance of depth maps and temporal consistency constraints, EVE derives satisfactory video editing results with an affordable computational and time cost. Moreover, recognizing the absence of a publicly available video editing dataset for fair comparisons, we construct a new benchmark ZVE-50 dataset. Through comprehensive experimentation, we validate that EVE could achieve a satisfactory trade-off between performance and efficiency. We will release our dataset and codebase to facilitate future researchers.
    摘要 <>驱动于图像扩散模型的出色表现,越来越多的研究人员努力将这些模型扩展到文本基于视频编辑任务中。然而,当前的视频编辑任务主要受到高精度调整成本和生成能力的限制。相比于图像,我们推测视频需要更多的约束来保持时间一致性。为了实现这一目标,我们提出了EVE,一种可靠和高效的零shot视频编辑方法。在深度地图和时间一致性约束的导引下,EVE可以获得高质量的视频编辑结果,同时具有可接受的计算和时间成本。尽管目前没有公共可用的视频编辑数据集,我们为了促进后续研究人员的比较,构建了新的benchmark数据集ZVE-50。通过广泛的实验,我们证明EVE可以实现一个满意的性能和效率之间的交易。我们将将我们的数据集和代码库发布,以便未来的研究人员使用。

SCULPT: Shape-Conditioned Unpaired Learning of Pose-dependent Clothed and Textured Human Meshes

  • paper_url: http://arxiv.org/abs/2308.10638
  • repo_url: None
  • paper_authors: Soubhik Sanyal, Partha Ghosh, Jinlong Yang, Michael J. Black, Justus Thies, Timo Bolkart
  • for: The paper is written for generating 3D clothed human bodies with realistic texture and pose.
  • methods: The paper uses a deep neural network to learn the geometry and appearance distribution of clothed human bodies, using both 3D scan data and 2D image data. The network is trained in an unpaired manner, and the authors use attribute labels to alleviate entanglement between pose and clothing type, and pose and clothing appearance.
  • results: The paper presents the SCULPT dataset, a novel 3D generative model for clothed and textured 3D meshes of humans, and compares the results to state-of-the-art 3D generative models for clothed human bodies. The authors show that their method can generate highly realistic and diverse 3D clothed human bodies with realistic texture and pose.Here is the information in Simplified Chinese text:
  • for: 本文是为生成 clothed 人体 3D 模型而写的,以实现真实的 texture 和 pose。
  • methods: 本文使用深度神经网络来学习 clothed 人体 的 geometry 和 appearance 分布,使用 3D 扫描数据和 2D 图像数据。网络在不带标签的情况下进行训练,并使用 attribute 标签来降低 pose 和衣服类型之间的杂糅,以及 pose 和衣服外观之间的杂糅。
  • results: 本文提出了 SCULPT 数据集,一种 novel 的 3D 生成模型,用于 clothed 和 textured 3D 模型的人体。并与当前最佳的 clothed 人体 3D 生成模型进行比较,显示其可以生成高度真实和多样化的 3D clothed 人体,并且可以实现真实的 texture 和 pose。
    Abstract We present SCULPT, a novel 3D generative model for clothed and textured 3D meshes of humans. Specifically, we devise a deep neural network that learns to represent the geometry and appearance distribution of clothed human bodies. Training such a model is challenging, as datasets of textured 3D meshes for humans are limited in size and accessibility. Our key observation is that there exist medium-sized 3D scan datasets like CAPE, as well as large-scale 2D image datasets of clothed humans and multiple appearances can be mapped to a single geometry. To effectively learn from the two data modalities, we propose an unpaired learning procedure for pose-dependent clothed and textured human meshes. Specifically, we learn a pose-dependent geometry space from 3D scan data. We represent this as per vertex displacements w.r.t. the SMPL model. Next, we train a geometry conditioned texture generator in an unsupervised way using the 2D image data. We use intermediate activations of the learned geometry model to condition our texture generator. To alleviate entanglement between pose and clothing type, and pose and clothing appearance, we condition both the texture and geometry generators with attribute labels such as clothing types for the geometry, and clothing colors for the texture generator. We automatically generated these conditioning labels for the 2D images based on the visual question answering model BLIP and CLIP. We validate our method on the SCULPT dataset, and compare to state-of-the-art 3D generative models for clothed human bodies. We will release the codebase for research purposes.
    摘要 我们介绍SCULPT,一种新的3D生成模型,用于生成具有衣服和文字的3D网格人体。我们开发了一个深度神经网络,用于表示人体凝聚的几何和外观分布。因为这类数据集的大小和可访问性受限,训练这种模型是一项挑战。我们的关键观察是,有一些中等大小的3D扫描数据集,如CAPE,以及大规模的2D图像数据集,包含不同的衣服和多个出现。我们提出了一种不归纳学习方法,用于学习pose-dependent的凝聚和文字的人体网格。我们将3D扫描数据集中的每个骨骼位置转换为SMPL模型中的每个骨骼位置的偏移量。然后,我们在不监督的情况下使用2D图像数据集来训练一个凝聚 conditioned的文字生成器。我们使用了学习的凝聚空间中的中间层的活动来condition我们的文字生成器。为了消除姿势和衣服类型之间的杂糅,以及姿势和衣服外观之间的杂糅,我们将文字生成器和凝聚生成器分别conditioned with attribute标签,如衣服类型和衣服颜色。我们自动生成了这些conditioning标签基于视觉问答模型BLIP和CLIP。我们验证了我们的方法在SCULPT数据集上,并与状态艺术3D生成模型进行比较。我们将发布代码库用于研究用途。

RaLLe: A Framework for Developing and Evaluating Retrieval-Augmented Large Language Models

  • paper_url: http://arxiv.org/abs/2308.10633
  • repo_url: https://github.com/yhoshi3/ralle
  • paper_authors: Yasuto Hoshi, Daisuke Miyashita, Youyang Ng, Kento Tatsuno, Yasuhiro Morioka, Osamu Torii, Jun Deguchi
  • for: 提高知识型任务中的答案准确率,使用 Retrieval-augmented large language models (R-LLMs) combine pre-trained large language models (LLMs) 与信息检索系统。
  • methods: RaLLe 是一个开源框架,用于发展、评估和优化知识型任务中 R-LLMs 的性能。 RaLLe 提供了高度可配置的批处理、精细的检索和生成过程评估、以及可量化的系统性能评估。
  • results: RaLLe 可以帮助开发者提高 R-LLMs 的性能和准确率,特别是在知识型任务中。
    Abstract Retrieval-augmented large language models (R-LLMs) combine pre-trained large language models (LLMs) with information retrieval systems to improve the accuracy of factual question-answering. However, current libraries for building R-LLMs provide high-level abstractions without sufficient transparency for evaluating and optimizing prompts within specific inference processes such as retrieval and generation. To address this gap, we present RaLLe, an open-source framework designed to facilitate the development, evaluation, and optimization of R-LLMs for knowledge-intensive tasks. With RaLLe, developers can easily develop and evaluate R-LLMs, improving hand-crafted prompts, assessing individual inference processes, and objectively measuring overall system performance quantitatively. By leveraging these features, developers can enhance the performance and accuracy of their R-LLMs in knowledge-intensive generation tasks. We open-source our code at https://github.com/yhoshi3/RaLLe.
    摘要 大型自然语言模型(R-LLM)结合信息检索系统,以提高问答准确性。然而,现有的R-LLM构建库提供了高级抽象,无法准确评估和优化提问过程中的具体步骤,如检索和生成。为解决这个空白,我们提出了RaLLe框架,用于促进R-LLM的开发、评估和优化。RaLLe提供了轻松开发和评估R-LLM的功能,以及评估具体步骤的能力,从而使开发者可以Quantitatively评估R-LLM的性能。通过这些特性,开发者可以提高知识密集任务中R-LLM的性能和准确性。我们在 GitHub 上开源了代码,请参考

Weighting by Tying: A New Approach to Weighted Rank Correlation

  • paper_url: http://arxiv.org/abs/2308.10622
  • repo_url: None
  • paper_authors: Sascha Henzgen, Eyke Hüllermeier
  • for: 这篇论文旨在探讨权重rank correlation测量方法,用于捕捉两个顺序的同一集合中的协调度。
  • methods: 这篇论文提出了一种基于杂分函数的权重rank correlation测量方法,具有sound的形式基础和灵活的定义方式。
  • results: 该方法可以具有协调度的权重,并且可以在不同的应用中适应不同的权重分配方式。
    Abstract Measures of rank correlation are commonly used in statistics to capture the degree of concordance between two orderings of the same set of items. Standard measures like Kendall's tau and Spearman's rho coefficient put equal emphasis on each position of a ranking. Yet, motivated by applications in which some of the positions (typically those on the top) are more important than others, a few weighted variants of these measures have been proposed. Most of these generalizations fail to meet desirable formal properties, however. Besides, they are often quite inflexible in the sense of committing to a fixed weighing scheme. In this paper, we propose a weighted rank correlation measure on the basis of fuzzy order relations. Our measure, called scaled gamma, is related to Goodman and Kruskal's gamma rank correlation. It is parametrized by a fuzzy equivalence relation on the rank positions, which in turn is specified conveniently by a so-called scaling function. This approach combines soundness with flexibility: it has a sound formal foundation and allows for weighing rank positions in a flexible way.
    摘要 Statistics 中常用度量对应关系来捕捉两个顺序的同一集合中元素之间的协调程度。标准度量如Kendall的tau和Spearman的rho系数均给每个排名位置同样的重要性。然而,受应用场景中一些排名位置(通常是排名首部)的重要性更高的情况下,一些加权变体被提议。大多数这些扩展都缺乏合适的正式性质,而且经常具有固定的加权方案。在这篇论文中,我们提出了基于杂化顺序关系的加权排名相关度量。我们的度量被称为归一化γ,与Goodman和Kruskal的γ排名相关度量有关。它是由杂化等价关系在排名位置上的一个杂化函数来规定的。这种方法结合了准确性和灵活性:它具有准确的形式基础,同时允许在灵活的方式下对排名位置进行加权。

Large Language Models for Software Engineering: A Systematic Literature Review

  • paper_url: http://arxiv.org/abs/2308.10620
  • repo_url: None
  • paper_authors: Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, Haoyu Wang
  • for: 本研究的目的是实现大语言模型(LLMs)在软件工程(SE)中的应用和促进进程和结果的优化。
  • methods: 本研究使用了系统性的文献综述方法,收集了229篇2017年至2023年的研究论文,以回答四个关键研究询问(RQs)。
  • results: 本研究发现了不同类型的LLMs在SE任务中的应用,以及质感资料收集、预处理和应用的方法。此外,本研究也发现了LLMs在SE任务中的实际贡献,以及优化和评估LLMs的性能的策略。
    Abstract Large Language Models (LLMs) have significantly impacted numerous domains, notably including Software Engineering (SE). Nevertheless, a well-rounded understanding of the application, effects, and possible limitations of LLMs within SE is still in its early stages. To bridge this gap, our systematic literature review takes a deep dive into the intersection of LLMs and SE, with a particular focus on understanding how LLMs can be exploited in SE to optimize processes and outcomes. Through a comprehensive review approach, we collect and analyze a total of 229 research papers from 2017 to 2023 to answer four key research questions (RQs). In RQ1, we categorize and provide a comparative analysis of different LLMs that have been employed in SE tasks, laying out their distinctive features and uses. For RQ2, we detail the methods involved in data collection, preprocessing, and application in this realm, shedding light on the critical role of robust, well-curated datasets for successful LLM implementation. RQ3 allows us to examine the specific SE tasks where LLMs have shown remarkable success, illuminating their practical contributions to the field. Finally, RQ4 investigates the strategies employed to optimize and evaluate the performance of LLMs in SE, as well as the common techniques related to prompt optimization. Armed with insights drawn from addressing the aforementioned RQs, we sketch a picture of the current state-of-the-art, pinpointing trends, identifying gaps in existing research, and flagging promising areas for future study.
    摘要 大语言模型(LLM)在软件工程(SE)领域产生了深远的影响,但是对于LLM在SE应用的全面理解仍处于早期阶段。为bridging这个差距,我们进行了系统性的文献综述,深入探讨LLM在SE中的应用、效果和可能的限制。通过涵盖229篇2017-2023年的研究论文,我们回答了四个关键研究问题(RQ):RQ1:我们对不同类型的LLM在SE任务中使用的不同特点和用途进行了分类和比较分析。RQ2:我们详细介绍了在这个领域中数据收集、处理和应用的方法,探讨了robust和准确的数据集的重要性,以便成功地应用LLM。RQ3:我们探讨了LLM在SE任务中显著成功的场景,描述了它们的实践贡献。RQ4:我们调查了用于优化和评估LLM在SE中表现的策略,以及相关的提示优化技术。通过回答这些RQ,我们获得了关于LLM在SE中的当前状况,揭示出趋势、缺失和未来研究的潜在领域。

BackTrack: Robust template update via Backward Tracking of candidate template

  • paper_url: http://arxiv.org/abs/2308.10604
  • repo_url: None
  • paper_authors: Dongwook Lee, Wonjun Choi, Seohyung Lee, ByungIn Yoo, Eunho Yang, Seongju Hwang
    for:这篇论文的目的是提出一个可靠且可靠的模板更新方法,以解决视觉对象跟踪中的模板更新问题,包括对象的形状变化、照明不同、遮挡等等。methods:这篇论文使用的方法是BackTrack,它是一个基于前向追踪的模板更新方法,可以量化候选模板的信任度,并在适当的时间更新模板。results:实验结果显示,BackTrack可以在不同的追踪评估测试上实现SOTA的追踪性能,并且在对象的形状变化、照明不同、遮挡等等情况下具有较高的可靠性。
    Abstract Variations of target appearance such as deformations, illumination variance, occlusion, etc., are the major challenges of visual object tracking that negatively impact the performance of a tracker. An effective method to tackle these challenges is template update, which updates the template to reflect the change of appearance in the target object during tracking. However, with template updates, inadequate quality of new templates or inappropriate timing of updates may induce a model drift problem, which severely degrades the tracking performance. Here, we propose BackTrack, a robust and reliable method to quantify the confidence of the candidate template by backward tracking it on the past frames. Based on the confidence score of candidates from BackTrack, we can update the template with a reliable candidate at the right time while rejecting unreliable candidates. BackTrack is a generic template update scheme and is applicable to any template-based trackers. Extensive experiments on various tracking benchmarks verify the effectiveness of BackTrack over existing template update algorithms, as it achieves SOTA performance on various tracking benchmarks.
    摘要 target appearance variations, such as deformations, illumination variance, and occlusion, are the main challenges of visual object tracking that negatively impact the performance of a tracker. an effective method to address these challenges is template update, which updates the template to reflect the change of appearance in the target object during tracking. however, with template updates, the quality of new templates or the timing of updates may cause a model drift problem, which severely degrades the tracking performance. here, we propose BackTrack, a robust and reliable method to quantify the confidence of the candidate template by backward tracking it on past frames. based on the confidence score of candidates from BackTrack, we can update the template with a reliable candidate at the right time while rejecting unreliable candidates. BackTrack is a generic template update scheme and is applicable to any template-based trackers. extensive experiments on various tracking benchmarks verify the effectiveness of BackTrack over existing template update algorithms, as it achieves SOTA performance on various tracking benchmarks.

Age Recommendation from Texts and Sentences for Children

  • paper_url: http://arxiv.org/abs/2308.10586
  • repo_url: None
  • paper_authors: Rashedur Rahman, Gwénolé Lecorvé, Nicolas Béchet
  • for: 这个论文的目的是提出一种自动计算适合儿童阅读的年龄的方法,以便对儿童的阅读习惯进行个性化推荐。
  • methods: 该论文使用了现代机器学习模型Transformers进行年龄预测任务,并评估了不同模型的性能。同时,还进行了初步的解释性分析,检查了不同语言特征对年龄预测的影响。
  • results: 该论文的实验结果显示,使用Transformers模型可以在测试集上达到0.98和1.83的MAE分数,在文本级和句子级的年龄预测任务中均有显著的改善。同时,与专家的建议相比,文本级预测模型的性能相对较强,而句子级预测模型的性能甚至超过专家的建议。
    Abstract Children have less text understanding capability than adults. Moreover, this capability differs among the children of different ages. Hence, automatically predicting a recommended age based on texts or sentences would be a great benefit to propose adequate texts to children and to help authors writing in the most appropriate way. This paper presents our recent advances on the age recommendation task. We consider age recommendation as a regression task, and discuss the need for appropriate evaluation metrics, study the use of state-of-the-art machine learning model, namely Transformers, and compare it to different models coming from the literature. Our results are also compared with recommendations made by experts. Further, this paper deals with preliminary explainability of the age prediction model by analyzing various linguistic features. We conduct the experiments on a dataset of 3, 673 French texts (132K sentences, 2.5M words). To recommend age at the text level and sentence level, our best models achieve MAE scores of 0.98 and 1.83 respectively on the test set. Also, compared to the recommendations made by experts, our sentence-level recommendation model gets a similar score to the experts, while the text-level recommendation model outperforms the experts by an MAE score of 1.48.
    摘要 children have less text understanding ability than adults, and this ability varies among children of different ages. Therefore, automatically predicting the recommended age based on texts or sentences would be a great benefit to provide appropriate texts for children and to help authors write in the most appropriate way. This paper presents our recent advances on the age recommendation task. We consider age recommendation as a regression task, and discuss the need for appropriate evaluation metrics, study the use of state-of-the-art machine learning models, namely Transformers, and compare them to different models from the literature. Our results are also compared with recommendations made by experts. Furthermore, this paper explores the preliminary explainability of the age prediction model by analyzing various linguistic features. We conduct experiments on a dataset of 3,673 French texts (132K sentences, 2.5M words) to recommend age at the text level and sentence level. Our best models achieve MAE scores of 0.98 and 1.83 respectively on the test set. Additionally, compared to the recommendations made by experts, our sentence-level recommendation model achieves a similar score to the experts, while the text-level recommendation model outperforms the experts by an MAE score of 1.48.Here's the translation in Traditional Chinese:children have less text understanding ability than adults, and this ability varies among children of different ages. Therefore, automatically predicting the recommended age based on texts or sentences would be a great benefit to provide appropriate texts for children and to help authors write in the most appropriate way. This paper presents our recent advances on the age recommendation task. We consider age recommendation as a regression task, and discuss the need for appropriate evaluation metrics, study the use of state-of-the-art machine learning models, namely Transformers, and compare them to different models from the literature. Our results are also compared with recommendations made by experts. Furthermore, this paper explores the preliminary explainability of the age prediction model by analyzing various linguistic features. We conduct experiments on a dataset of 3,673 French texts (132K sentences, 2.5M words) to recommend age at the text level and sentence level. Our best models achieve MAE scores of 0.98 and 1.83 respectively on the test set. Additionally, compared to the recommendations made by experts, our sentence-level recommendation model achieves a similar score to the experts, while the text-level recommendation model outperforms the experts by an MAE score of 1.48.

Pseudo-online framework for BCI evaluation: A MOABB perspective

  • paper_url: http://arxiv.org/abs/2308.11656
  • repo_url: None
  • paper_authors: Igor Carrara, Théodore Papadopoulo
    for:* 这个研究旨在扩展现有的MOABB框架,以在pseudo-online模式下进行不同算法的比较,并使用基于重叠滑块窗口技术来模拟真实时间环境。methods:* 这个研究使用了offline模式下的EEG数据,并引入了idle项目事件以考虑所有不同的可能性。* 以normalized Matthews Correlation Coefficient (nMCC)和Information Transfer Rate (ITR)来评估算法的性能。results:* 这个研究分析了过去15年最佳的算法,在多个motor imagination(MI)数据集上进行了比较,显示了两种方法之间的 statistically significant differences。
    Abstract Objective: BCI (Brain-Computer Interface) technology operates in three modes: online, offline, and pseudo-online. In the online mode, real-time EEG data is constantly analyzed. In offline mode, the signal is acquired and processed afterwards. The pseudo-online mode processes collected data as if they were received in real-time. The main difference is that the offline mode often analyzes the whole data, while the online and pseudo-online modes only analyze data in short time windows. Offline analysis is usually done with asynchronous BCIs, which restricts analysis to predefined time windows. Asynchronous BCI, compatible with online and pseudo-online modes, allows flexible mental activity duration. Offline processing tends to be more accurate, while online analysis is better for therapeutic applications. Pseudo-online implementation approximates online processing without real-time constraints. Many BCI studies being offline introduce biases compared to real-life scenarios, impacting classification algorithm performance. Approach: The objective of this research paper is therefore to extend the current MOABB framework, operating in offline mode, so as to allow a comparison of different algorithms in a pseudo-online setting with the use of a technology based on overlapping sliding windows. To do this will require the introduction of a idle state event in the dataset that takes into account all different possibilities that are not task thinking. To validate the performance of the algorithms we will use the normalized Matthews Correlation Coefficient (nMCC) and the Information Transfer Rate (ITR). Main results: We analyzed the state-of-the-art algorithms of the last 15 years over several Motor Imagery (MI) datasets composed by several subjects, showing the differences between the two approaches from a statistical point of view. Significance: The ability to analyze the performance of different algorithms in offline and pseudo-online modes will allow the BCI community to obtain more accurate and comprehensive reports regarding the performance of classification algorithms.
    摘要 目标:BCI(脑机器交互)技术运行在三个模式下:在线、离线和pseudo-在线模式。在在线模式下,实时EEG数据在分析。在离线模式下,信号被收集并后期处理。pseudo-在线模式处理收集的数据,如果收集的数据是在实时received的话。主要区别在于离线模式通常分析整个数据,而在线和pseudo-在线模式只分析短时间窗口中的数据。离线分析通常更加准确,而在线分析更适合治疗应用。pseudo-在线实现方式模拟了实时处理,不受实时限制。许多BCI研究在离线 introduce bias,影响分类算法性能。方法:本研究的目标是扩展现有的MOABB框架,从离线模式转换到pseudo-在线模式,以便对不同算法进行比较。为此,需要引入一个空闲状态事件,考虑所有不同的可能性,不是任务思考。以 validate算法性能,我们使用normalized Matthews Correlation Coefficient (nMCC)和Information Transfer Rate (ITR)。主要结果:我们分析了过去15年最新的BCI算法,在多个motor imagination(MI)数据集上,显示了两种方法之间的差异从统计角度。重要性:能够在离线和pseudo-在线模式下分析不同算法的性能,将BCI社区获得更加准确和全面的报告关于分类算法性能。

Overcoming Overconfidence for Active Learning

  • paper_url: http://arxiv.org/abs/2308.10571
  • repo_url: None
  • paper_authors: Yujin Hwang, Won Jo, Juyoung Hong, Yukyung Choi
    for: 这篇论文的目的是提出两种解决活动学习场景中过度自信的问题的方法。methods: 这篇论文使用了两种方法来解决过度自信问题:一种是扩展限定训练分布的混合策略(Cross-Mix-and-Mix,CMaM),另一种是根据投票率对数据进行选择(Ranked Margin Sampling,RankedMS)。results: 经过多种实验和分析,这两种方法能够有效地减少过度自信,并且可以应用于实际场景。
    Abstract It is not an exaggeration to say that the recent progress in artificial intelligence technology depends on large-scale and high-quality data. Simultaneously, a prevalent issue exists everywhere: the budget for data labeling is constrained. Active learning is a prominent approach for addressing this issue, where valuable data for labeling is selected through a model and utilized to iteratively adjust the model. However, due to the limited amount of data in each iteration, the model is vulnerable to bias; thus, it is more likely to yield overconfident predictions. In this paper, we present two novel methods to address the problem of overconfidence that arises in the active learning scenario. The first is an augmentation strategy named Cross-Mix-and-Mix (CMaM), which aims to calibrate the model by expanding the limited training distribution. The second is a selection strategy named Ranked Margin Sampling (RankedMS), which prevents choosing data that leads to overly confident predictions. Through various experiments and analyses, we are able to demonstrate that our proposals facilitate efficient data selection by alleviating overconfidence, even though they are readily applicable.
    摘要 不夸张地说,现代人工智能技术的进步几乎完全取决于大规模和高质量的数据。然而,一个普遍存在的问题是,数据标注预算受限。活动学习是一种常用的方法,通过选择模型中认为有价值的数据进行标注,并将其用于反馈调整模型。然而,由于每轮标注数据有限,模型容易受到偏见,因此更容易产生过自信预测。在这篇论文中,我们提出了两种解决活动学习场景中出现的过自信问题的方法。一种是名为混合混合混合(CMaM)的扩展策略,旨在通过扩展有限训练分布来规范模型。另一种是名为排名边缘抽样(RankedMS)的选择策略,避免选择导致过自信预测的数据。通过多种实验和分析,我们能够证明,我们的建议可以减轻过自信,使得数据选择变得更加高效。

Metaverse: A Vision, Architectural Elements, and Future Directions for Scalable and Realtime Virtual Worlds

  • paper_url: http://arxiv.org/abs/2308.10559
  • repo_url: None
  • paper_authors: Leila Ismail, Rajkumar Buyya
  • for: 这篇论文主要针对Metaverse的实现和发展提出了新的要求。
  • methods: 该论文通过对Metaverse定义的时间演化和需求的分析,提出了可扩展、可靠、高效的Metaverse系统的建构元素和应用分类。
  • results: 该论文提出了Metaverse实现的新要求,并提供了可扩展、可靠、高效的Metaverse系统的建构元素和应用分类,以及未来研究方向。
    Abstract With the emergence of Cloud computing, Internet of Things-enabled Human-Computer Interfaces, Generative Artificial Intelligence, and high-accurate Machine and Deep-learning recognition and predictive models, along with the Post Covid-19 proliferation of social networking, and remote communications, the Metaverse gained a lot of popularity. Metaverse has the prospective to extend the physical world using virtual and augmented reality so the users can interact seamlessly with the real and virtual worlds using avatars and holograms. It has the potential to impact people in the way they interact on social media, collaborate in their work, perform marketing and business, teach, learn, and even access personalized healthcare. Several works in the literature examine Metaverse in terms of hardware wearable devices, and virtual reality gaming applications. However, the requirements of realizing the Metaverse in realtime and at a large-scale need yet to be examined for the technology to be usable. To address this limitation, this paper presents the temporal evolution of Metaverse definitions and captures its evolving requirements. Consequently, we provide insights into Metaverse requirements. In addition to enabling technologies, we lay out architectural elements for scalable, reliable, and efficient Metaverse systems, and a classification of existing Metaverse applications along with proposing required future research directions.
    摘要 With the emergence of Cloud computing, Internet of Things-enabled Human-Computer Interfaces, Generative Artificial Intelligence, and high-accurate Machine and Deep-learning recognition and predictive models, along with the Post Covid-19 proliferation of social networking, and remote communications, the Metaverse has gained a lot of popularity. The Metaverse has the prospective to extend the physical world using virtual and augmented reality, allowing users to interact seamlessly with the real and virtual worlds using avatars and holograms. It has the potential to impact people in the way they interact on social media, collaborate in their work, perform marketing and business, teach, learn, and even access personalized healthcare. While several works in the literature examine Metaverse in terms of hardware wearable devices and virtual reality gaming applications, the requirements of realizing the Metaverse in real-time and at a large-scale have yet to be fully explored. To address this limitation, this paper presents the temporal evolution of Metaverse definitions and captures its evolving requirements. Consequently, we provide insights into Metaverse requirements, as well as architectural elements for scalable, reliable, and efficient Metaverse systems, and a classification of existing Metaverse applications along with proposing required future research directions.

KGrEaT: A Framework to Evaluate Knowledge Graphs via Downstream Tasks

  • paper_url: http://arxiv.org/abs/2308.10537
  • repo_url: None
  • paper_authors: Nicolas Heist, Sven Hertling, Heiko Paulheim
  • for: This research paper aims to evaluate the quality of knowledge graphs (KGs) for downstream tasks, rather than just their correctness and completeness.
  • methods: The paper presents a framework called KGrEaT, which stands for Knowledge Graph Evaluation via Actual Tasks. KGrEaT maps a given KG to datasets for evaluation on various tasks and computes performance metrics for each task.
  • results: The paper shows that KGrEaT can be used to evaluate KGs on a fixed task setup, providing a more comprehensive assessment of their quality than traditional evaluation metrics. Additionally, KGrEaT is modular and can be easily extended with additional tasks and datasets.
    Abstract In recent years, countless research papers have addressed the topics of knowledge graph creation, extension, or completion in order to create knowledge graphs that are larger, more correct, or more diverse. This research is typically motivated by the argumentation that using such enhanced knowledge graphs to solve downstream tasks will improve performance. Nonetheless, this is hardly ever evaluated. Instead, the predominant evaluation metrics - aiming at correctness and completeness - are undoubtedly valuable but fail to capture the complete picture, i.e., how useful the created or enhanced knowledge graph actually is. Further, the accessibility of such a knowledge graph is rarely considered (e.g., whether it contains expressive labels, descriptions, and sufficient context information to link textual mentions to the entities of the knowledge graph). To better judge how well knowledge graphs perform on actual tasks, we present KGrEaT - a framework to estimate the quality of knowledge graphs via actual downstream tasks like classification, clustering, or recommendation. Instead of comparing different methods of processing knowledge graphs with respect to a single task, the purpose of KGrEaT is to compare various knowledge graphs as such by evaluating them on a fixed task setup. The framework takes a knowledge graph as input, automatically maps it to the datasets to be evaluated on, and computes performance metrics for the defined tasks. It is built in a modular way to be easily extendable with additional tasks and datasets.
    摘要 近年来, countless research papers Addressed the topics of knowledge graph creation, extension, or completion,aiming to create larger, more correct, or more diverse knowledge graphs. This research is typically motivated by the argument that using such enhanced knowledge graphs to solve downstream tasks will improve performance. However, this is rarely evaluated. Instead, the predominant evaluation metrics - aiming at correctness and completeness - are valuable but fail to capture the complete picture, i.e., how useful the created or enhanced knowledge graph actually is. Furthermore, the accessibility of such a knowledge graph is rarely considered (e.g., whether it contains expressive labels, descriptions, and sufficient context information to link textual mentions to the entities of the knowledge graph).To better judge how well knowledge graphs perform on actual tasks, we present KGrEaT - a framework to estimate the quality of knowledge graphs via actual downstream tasks like classification, clustering, or recommendation. Instead of comparing different methods of processing knowledge graphs with respect to a single task, the purpose of KGrEaT is to compare various knowledge graphs as such by evaluating them on a fixed task setup. The framework takes a knowledge graph as input, automatically maps it to the datasets to be evaluated on, and computes performance metrics for the defined tasks. It is built in a modular way to be easily extendable with additional tasks and datasets.

Dataset Quantization

  • paper_url: http://arxiv.org/abs/2308.10524
  • repo_url: https://github.com/magic-research/dataset_quantization
  • paper_authors: Daquan Zhou, Kai Wang, Jianyang Gu, Xiangyu Peng, Dongze Lian, Yifan Zhang, Yang You, Jiashi Feng
  • for: 这篇论文的目的是提出一个新的数据集Quantization(DQ)框架,以将大规模数据集缩小为小型subset,并且可以用于训练任意神经网络架构。
  • methods: 本论文使用了 Gradient Matching 技术来实现数据集的缩小,并且透过对数据集进行排序和压缩来实现数据集的缩小。
  • results: 实验结果显示,DQ 可以将 ImageNet-1k 等大规模数据集缩小为小型subset,并且可以保持模型的表现力。此外,DQ 还可以在不同的神经网络架下进行模型训练,并且可以在数据集缩小后实现模型的适当调整。
    Abstract State-of-the-art deep neural networks are trained with large amounts (millions or even billions) of data. The expensive computation and memory costs make it difficult to train them on limited hardware resources, especially for recent popular large language models (LLM) and computer vision models (CV). Recent popular dataset distillation methods are thus developed, aiming to reduce the number of training samples via synthesizing small-scale datasets via gradient matching. However, as the gradient calculation is coupled with the specific network architecture, the synthesized dataset is biased and performs poorly when used for training unseen architectures. To address these limitations, we present dataset quantization (DQ), a new framework to compress large-scale datasets into small subsets which can be used for training any neural network architectures. Extensive experiments demonstrate that DQ is able to generate condensed small datasets for training unseen network architectures with state-of-the-art compression ratios for lossless model training. To the best of our knowledge, DQ is the first method that can successfully distill large-scale datasets such as ImageNet-1k with a state-of-the-art compression ratio. Notably, with 60% data from ImageNet and 20% data from Alpaca's instruction tuning data, the models can be trained with negligible or no performance drop for both vision tasks (including classification, semantic segmentation, and object detection) as well as language tasks (including instruction tuning tasks such as BBH and DROP).
    摘要 现代深度神经网络通常需要大量数据进行训练,但这些数据的计算和内存成本高昂,尤其是在最新的大型自然语言模型(LLM)和计算机视觉模型(CV)中。为了解决这些限制,我们提出了数据压缩(DQ)框架,可以压缩大规模数据集成小型 subsets,可以用于训练任何神经网络架构。我们进行了广泛的实验,并证明了DQ可以生成高质量、高压缩比的小型数据集,用于训练未看过的网络架构。此外,我们发现,只使用60%的ImageNet数据和20%的Alpaca的指令调整数据,可以训练视觉任务(包括分类、semantic segmentation和对象检测)以及语言任务(包括指令调整任务),而无需或几乎无需性能下降。

When Less is Enough: Positive and Unlabeled Learning Model for Vulnerability Detection

  • paper_url: http://arxiv.org/abs/2308.10523
  • repo_url: https://github.com/pilot-vd-2023/pilot
  • paper_authors: Xin-Cheng Wen, Xinchen Wang, Cuiyun Gao, Shaohua Wang, Yang Liu, Zhaoquan Gu
  • for: 这个研究是为了解决自动代码漏洞探测中的问题,尤其是利用深度学习方法来探测漏洞。
  • methods: 这个研究使用了Positive和Unlabeled(PU)学习问题,提出了一个名为PILOT的模型,即PositIve和Unlabeled Learning mOdel for vulnerability deTection。PILOT只 learns from positive和Unlabeled数据进行漏洞探测。
  • results: PILOT模型可以从Positive和Unlabeled数据中获得更高的准确率,并且可以更好地避免欠拓扩数据的影响。
    Abstract Automated code vulnerability detection has gained increasing attention in recent years. The deep learning (DL)-based methods, which implicitly learn vulnerable code patterns, have proven effective in vulnerability detection. The performance of DL-based methods usually relies on the quantity and quality of labeled data. However, the current labeled data are generally automatically collected, such as crawled from human-generated commits, making it hard to ensure the quality of the labels. Prior studies have demonstrated that the non-vulnerable code (i.e., negative labels) tends to be unreliable in commonly-used datasets, while vulnerable code (i.e., positive labels) is more determined. Considering the large numbers of unlabeled data in practice, it is necessary and worth exploring to leverage the positive data and large numbers of unlabeled data for more accurate vulnerability detection. In this paper, we focus on the Positive and Unlabeled (PU) learning problem for vulnerability detection and propose a novel model named PILOT, i.e., PositIve and unlabeled Learning mOdel for vulnerability deTection. PILOT only learns from positive and unlabeled data for vulnerability detection. It mainly contains two modules: (1) A distance-aware label selection module, aiming at generating pseudo-labels for selected unlabeled data, which involves the inter-class distance prototype and progressive fine-tuning; (2) A mixed-supervision representation learning module to further alleviate the influence of noise and enhance the discrimination of representations.
    摘要 自动化代码漏洞检测在最近几年内得到了越来越多的关注。基于深度学习(DL)方法的方法,可以隐式地学习漏洞代码的模式,已经证明了它们在漏洞检测中的效果。DL方法的性能通常取决于标注数据的量和质量。然而,目前的标注数据通常是通过自动收集的,如从人类生成的提交中爬取,因此很难保证标注的质量。先前的研究表明,非漏洞代码(即负标签)在通用的数据集中很难靠度,而漏洞代码(即正标签)更加准确。鉴于实际中的大量未标注数据,是必要和值得探索利用正确的数据和大量未标注数据来进行更加准确的漏洞检测。在这篇论文中,我们关注的是漏洞检测中的Positive和Unlabeled(PU)学习问题,并提出了一种名为PILOT的模型,即PositIve和unlabeled Learning mOdel for vulnerability deTection。PILOT只 learns from positive和unlabeled数据进行漏洞检测。它主要包括两个模块:1. 距离意识标签选择模块,旨在生成选择的未标注数据的 Pseudo-标签,这里涉及到了间类距离原型和进行进度调整的细化。2. 混合超级视角表示学习模块,以进一步减少噪声的影响和提高表示的推论能力。

Hybrid classical-quantum computing: are we forgetting the classical part in the binomial?

  • paper_url: http://arxiv.org/abs/2308.10513
  • repo_url: None
  • paper_authors: Esther Villar-Rodriguez, Aitor Gomez-Tejedor, Eneko Osaba
  • for: 本研究的主要目的是提出一个初步的分类法,用于 классифика hybrid quantum computing 方案,并提出一些关键问题,以促进对 quantum computing 应用中的实际挑战的研究。
  • methods: 本研究使用的方法包括 hybrid quantum computing 的概念分析和分类,以及对 hybrid 方案的应用场景和挑战的调查和分析。
  • results: 本研究提出了一个初步的分类法,并提出了一些关键问题,以促进对 quantum computing 应用中的实际挑战的研究。
    Abstract The expectations arising from the latest achievements in the quantum computing field are causing that researchers coming from classical artificial intelligence to be fascinated by this new paradigm. In turn, quantum computing, on the road towards usability, needs classical procedures. Hybridization is, in these circumstances, an indispensable step but can also be seen as a promising new avenue to get the most from both computational worlds. Nonetheless, hybrid approaches have now and will have in the future many challenges to face, which, if ignored, will threaten the viability or attractiveness of quantum computing for real-world applications. To identify them and pose pertinent questions, a proper characterization of the hybrid quantum computing field, and especially hybrid solvers, is compulsory. With this motivation in mind, the main purpose of this work is to propose a preliminary taxonomy for classifying hybrid schemes, and bring to the fore some questions to stir up researchers minds about the real challenges regarding the application of quantum computing.
    摘要 最新的量子计算领域成果所带来的期望正在使классиical人工智能研究者们对这新的思维方式感到惊叹。然而,量子计算在实用化之路上需要经过类型的过程。融合是在这些情况下的一个不可或缺的步骤,同时也是一个有前途的新途径,以获得两个计算世界的最佳效果。然而,融合方案在未来会面临许多挑战,如果不加以解决,将对量子计算的实际应用产生威胁或吸引力。为了识别这些挑战并提出有关其应用的问题,我们需要进行一个正确的量子计算融合领域的Characterization,特别是融合解决方案。这是本文的主要目的,通过这种方式,我们可以提出一个初步的分类方法,并带来一些关于应用量子计算的真正挑战的问题,以便激发研究者们的思考和探索。

Performance Enhancement Leveraging Mask-RCNN on Bengali Document Layout Analysis

  • paper_url: http://arxiv.org/abs/2308.10511
  • repo_url: None
  • paper_authors: Shrestha Datta, Md Adith Mollah, Raisa Fairooz, Tariful Islam Fahim
  • for: 本研究旨在提高机器理解古 Bangla 文档的能力,特别是通过文档结构分析(DLA)将文档分成不同部分,如段落、图片和表格。
  • methods: 研究者使用了一种特殊的模型 called Mask R-CNN,通过步骤性的超参数调整来提高模型的性能。
  • results: 研究者在 BaDLAD 数据集上取得了一个好的 dice 分数为 0.889,但不是所有情况都是如此。研究者发现,每种语言都有其独特的挑战。
    Abstract Understanding digital documents is like solving a puzzle, especially historical ones. Document Layout Analysis (DLA) helps with this puzzle by dividing documents into sections like paragraphs, images, and tables. This is crucial for machines to read and understand these documents. In the DL Sprint 2.0 competition, we worked on understanding Bangla documents. We used a dataset called BaDLAD with lots of examples. We trained a special model called Mask R-CNN to help with this understanding. We made this model better by step-by-step hyperparameter tuning, and we achieved a good dice score of 0.889. However, not everything went perfectly. We tried using a model trained for English documents, but it didn't fit well with Bangla. This showed us that each language has its own challenges. Our solution for the DL Sprint 2.0 is publicly available at https://www.kaggle.com/competitions/dlsprint2/discussion/432201 along with notebooks, weights, and inference notebook.
    摘要

Large Language Model as a User Simulator

  • paper_url: http://arxiv.org/abs/2308.11534
  • repo_url: https://github.com/FreedomIntelligence/ReaLM
  • paper_authors: Chuyi Kong, Yaxin Fan, Xiang Wan, Feng Jiang, Benyou Wang
  • for: 本研究的目的是推动ChatGPT的民主化,通过使用真实的用户和ChatGPT的对话,以提高人机对话的质量和多样性。
  • methods: 本研究使用了一种新的方法,即使用人类的问题作为学习目标,并使用用户模拟器(UserGPT)生成高质量的人类中心的 sintetic对话数据集(RealChat)。然后,这个数据集用于训练助手模型(ReaLM)。
  • results: 实验表明,ReaLM 模型在 Vicuna-Bench 和 MT-Bench 上与基eline模型相比,在Equivalent training set size时显著超越了它们。此外,人工评估也表明,我们的模型在同等规模下与当今最佳模型相匹配。
    Abstract The unparalleled performance of closed-sourced ChatGPT has sparked efforts towards its democratization, with notable strides made by leveraging real user and ChatGPT conversations, as evidenced by Vicuna. However, while current endeavors like Baize and UltraChat aim to auto-generate conversational data due to challenges in gathering human participation, they primarily rely on ChatGPT to simulate human behaviors based on directives rather than genuine human learning. This results in a limited scope, diminished diversity, and an absence of genuine multi-round conversational dynamics. To address the above issues, we innovatively target human questions extracted from genuine human-machine conversations as a learning goal and train a user simulator, UserGPT, to produce a high-quality human-centric synthetic conversation dataset, RealChat. Subsequently, this dataset trains our assistant model, ReaLM. Experimentally, ReaLM outpaces baseline models in both Vicuna-Bench and MT-Bench by pairwise comparison when considering equivalent training set sizes, and manual evaluation also shows that our model is highly competitive. Impressively, when fine-tuned with the latest LLaMA 2 model, ReaLM secured a leading score of 6.33 in the MT-Bench, outshining the contemporary same-scale models, including the LLaMA-2-7B-chat model. Further in-depth analysis demonstrates the scalability and transferability of our approach. A preliminary exploration into the interplay between training set data quality and resultant model performance is also undertaken, laying a robust groundwork for future investigations. The code is available at https://github.com/FreedomIntelligence/ReaLM.
    摘要 “closed-source ChatGPT的无凡表现促使了它的民主化,如Vicuna所示。然而,现有的尝试如Baize和UltraChat尝试通过自动生成对话数据来缓解人工参与的挑战,但它们主要依靠ChatGPT来模拟人类行为,而不是真正的人类学习。这会导致 scope Limited, 多样性减退和缺乏真正的多回交流剖面。为解决这些问题,我们采用了一种创新的方法,targeting 人类问题从真正的人机对话中提取出来,并用UserGPT训练一个人centric的synthetic conversation dataset,RealChat。然后,这个数据集训练我们的助手模型,ReaLM。实验表明,ReaLM在 Vicuna-Bench 和 MT-Bench 上比基eline模型更高的性能,并且 manual evaluation 也表明我们的模型具有高级竞争力。印象地,当 fine-tune avec latest LLaMA 2 模型时,ReaLM 在 MT-Bench 上获得了6.33的最高分,击败了同等规模的同时代模型,包括 LLama-2-7B-chat 模型。进一步的深入分析还证明了我们的方法的扩展性和传递性。我们还进行了一些初步的数据质量和结果模型之间的关系分析,为未来的研究提供了坚实的基础。代码可以在 https://github.com/FreedomIntelligence/ReaLM 中找到。”

An Examination of the Compositionality of Large Generative Vision-Language Models

  • paper_url: http://arxiv.org/abs/2308.10509
  • repo_url: None
  • paper_authors: Teli Ma, Rong Li, Junwei Liang
  • for: 这个论文旨在探讨大语言模型(LLM)在多modal instruct tuning中 constructed的生成视语言模型(GVLM)在多modal compositional reasoning中的性能,以及如何不受现有的对比度视语言学习(contrastive vision-language learning)的评估约束。
  • methods: 这篇论文提出了一种新的评估指标来评估GVLMs的compositional reasoning能力,并提出了一种基于LLM的方法来减少对 benchmarks中的 morphological bias。此外, authors还提出了一个挑战任务来评估GVLMs的 robustness against inherent inclination toward syntactic correctness。
  • results: 作者们的研究发现,GVLMs在多modal compositional reasoning中的性能受到评估指标和 benchmarks中的 morphological bias的影响,而且GVLMs可以通过 MorphoBias Score 来评估其compositional reasoning能力。此外, authors还发现了一些GVLMs在某些任务上的robustness issue。
    Abstract With the success of Large Language Models (LLMs), a surge of Generative Vision-Language Models (GVLMs) have been constructed via multimodal instruction tuning. The tuning recipe substantially deviates from the common contrastive vision-language learning. However, the performance of GVLMs in multimodal compositional reasoning remains largely unexplored, as existing evaluation metrics and benchmarks focus predominantly on assessing contrastive models like CLIP. In this paper, we examine the potential evaluation metrics to assess the GVLMs and hypothesize generative score methods are suitable for evaluating compositionality. In addition, current benchmarks tend to prioritize syntactic correctness over semantics. The presence of morphological bias in these benchmarks can be exploited by GVLMs, leading to ineffective evaluations. To combat this, we define a MorphoBias Score to quantify the morphological bias and propose a novel LLM-based strategy to calibrate the bias. Moreover, a challenging task is added to evaluate the robustness of GVLMs against inherent inclination toward syntactic correctness. We include the calibrated dataset and the task into a new benchmark, namely MOrphologicall De-biased Benchmark (MODE). Our study provides the first unbiased benchmark for the compositionality of GVLMs, facilitating future research in this direction. We will release our code and datasets.
    摘要 随着大语言模型(LLM)的成功,一批基于多模态指令调教的生成视语言模型(GVLM)已经被建立。但是,GVLM在多模态композиicional reasoning中的性能仍然未得到了足够的探索,因为现有的评价指标和 benchmark 主要用于评估对比型视语言学习模型,如 CLIP。在这篇文章中,我们检查GVLM的评价指标,并假设生成分数方法适用于评估compositional。此外,当前的 benchmark 倾向于优先采用 sintactic correctness 而忽略 semantics,这会导致 GVLM 的不效评价。为解决这个问题,我们定义了一个 MorphoBias Score,用于评估模型的 morphological bias,并提出了一种基于 LLM 的策略来减少这种偏见。此外,我们还添加了一个挑战任务,用于评估 GVLM 对 inherent inclination toward syntactic correctness 的 Robustness。我们将这些数据集和任务添加到一个新的 benchmark,名为 MOrphologicall De-biased Benchmark (MODE)。我们的研究为未来关于 GVLM 的作曲性研究提供了第一个不偏见的 benchmark,并将发布我们的代码和数据集。

Using Autoencoders and AutoDiff to Reconstruct Missing Variables in a Set of Time Series

  • paper_url: http://arxiv.org/abs/2308.10496
  • repo_url: None
  • paper_authors: Jan-Philipp Roche, Oliver Niggemann, Jens Friebe
  • for: 这篇论文目的是为了解决黑盒模型在机器学习中的缺失变量问题。
  • methods: 这篇论文使用自适应神经网络来重建缺失的时间序列变量。首先,使用自适应神经网络进行常规训练,然后定义缺失变量为神经网络输入中的一部分,并通过自动推导优化它们。
  • results: 该方法在一种强不线性电子组件上进行评估,能够成功地重建缺失的一个变量,甚至可以处理多个缺失变量。
    Abstract Existing black box modeling approaches in machine learning suffer from a fixed input and output feature combination. In this paper, a new approach to reconstruct missing variables in a set of time series is presented. An autoencoder is trained as usual with every feature on both sides and the neural network parameters are fixed after this training. Then, the searched variables are defined as missing variables at the autoencoder input and optimized via automatic differentiation. This optimization is performed with respect to the available features loss calculation. With this method, different input and output feature combinations of the trained model can be realized by defining the searched variables as missing variables and reconstructing them. The combination can be changed without training the autoencoder again. The approach is evaluated on the base of a strongly nonlinear electrical component. It is working well for one of four variables missing and generally even for multiple missing variables.
    摘要 现有的黑盒模型方法在机器学习中受到固定输入和输出特征组合的限制。本文提出了一种新的缺失变量重建方法。在这种方法中,一个Autoencoder被训练得常,并且神经网络参数被固定后。然后,待搜索的变量被定义为Autoencoder输入中的缺失变量,并通过自动微分优化。这种优化是基于可用特征损失计算。通过这种方法,可以实现不同的输入和输出特征组合,只需要定义待搜索的变量为缺失变量,并且重建它们。这种组合可以更改无需再次训练Autoencoder。本方法在一种强不线性电子元件上进行了评估,能够成功地重建一个变量中的缺失,并且在多个缺失变量情况下也能够工作良好。

Texture Generation on 3D Meshes with Point-UV Diffusion

  • paper_url: http://arxiv.org/abs/2308.10490
  • repo_url: https://github.com/CVMI-Lab/Point-UV-Diffusion
  • paper_authors: Xin Yu, Peng Dai, Wenbo Li, Lan Ma, Zhengzhe Liu, Xiaojuan Qi
  • for: 这个论文主要探讨了在3D矩阵上生成高质量的 тексту涂抹。
  • methods: 该方法提出了一种叫Point-UV diffusion的粗糙到细的管道,将杂谱推敲模型与UV映射结合,在UV空间中生成3D一致性高的图像。
  • results: 该方法可以生成具有多种颜色和纹理的3D矩阵上的高质量Texture图像,并且可以处理任意 genus 的矩阵。
    Abstract In this work, we focus on synthesizing high-quality textures on 3D meshes. We present Point-UV diffusion, a coarse-to-fine pipeline that marries the denoising diffusion model with UV mapping to generate 3D consistent and high-quality texture images in UV space. We start with introducing a point diffusion model to synthesize low-frequency texture components with our tailored style guidance to tackle the biased color distribution. The derived coarse texture offers global consistency and serves as a condition for the subsequent UV diffusion stage, aiding in regularizing the model to generate a 3D consistent UV texture image. Then, a UV diffusion model with hybrid conditions is developed to enhance the texture fidelity in the 2D UV space. Our method can process meshes of any genus, generating diversified, geometry-compatible, and high-fidelity textures. Code is available at https://cvmi-lab.github.io/Point-UV-Diffusion
    摘要 在这个工作中,我们关注于生成高质量的3D纹理图像。我们提出了点扩散模型,将杂谱扩散模型与UV映射结合,以生成在UV空间中的3D一致性高质量的纹理图像。我们首先引入点扩散模型来生成低频纹理组件,并采用我们自定义的风格指导来解决颜色分布偏见。 derivated的粗糙纹理提供了全局一致性,并作为后续UV扩散阶段的条件,帮助模型生成3D一致性的UV纹理图像。然后,我们开发了Hybrid condition的UV扩散模型,以提高在2D UV空间中的纹理准确性。我们的方法可以处理任意 genus的 mesh,生成多样化、准确性高的纹理图像。代码可以在https://cvmi-lab.github.io/Point-UV-Diffusion 获取。

Deciphering Raw Data in Neuro-Symbolic Learning with Provable Guarantees

  • paper_url: http://arxiv.org/abs/2308.10487
  • repo_url: None
  • paper_authors: Lue Tao, Yu-Xuan Huang, Wang-Zhou Dai, Yuan Jiang
  • for: 这篇论文旨在探讨 neuromorphic hybrid 系统中Machine Learning和符号逻辑的集成,以及如何通过逻辑推理来增强感知模型的准确性。
  • methods: 这篇论文使用了一种新的超visionsignalCharacterization方法,以确定知识库的可用性,并提出了一个可以判断知识库是否能够成功地帮助学习的标准。
  • results: 实验表明,这种方法可以帮助解释hybrid系统的成功和失败,并且可以在不同的知识库下进行可靠的学习。
    Abstract Neuro-symbolic hybrid systems are promising for integrating machine learning and symbolic reasoning, where perception models are facilitated with information inferred from a symbolic knowledge base through logical reasoning. Despite empirical evidence showing the ability of hybrid systems to learn accurate perception models, the theoretical understanding of learnability is still lacking. Hence, it remains unclear why a hybrid system succeeds for a specific task and when it may fail given a different knowledge base. In this paper, we introduce a novel way of characterising supervision signals from a knowledge base, and establish a criterion for determining the knowledge's efficacy in facilitating successful learning. This, for the first time, allows us to address the two questions above by inspecting the knowledge base under investigation. Our analysis suggests that many knowledge bases satisfy the criterion, thus enabling effective learning, while some fail to satisfy it, indicating potential failures. Comprehensive experiments confirm the utility of our criterion on benchmark tasks.
    摘要 (Note: The above text is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. Traditional Chinese is also widely used, particularly in Taiwan and Hong Kong.)

Deep Metric Loss for Multimodal Learning

  • paper_url: http://arxiv.org/abs/2308.10486
  • repo_url: None
  • paper_authors: Sehwan Moon, Hyunju Lee
  • for: This paper aims to improve the performance of multimodal learning models by introducing a novel loss function that subgroups instances according to their unimodal contributions.
  • methods: The proposed method uses a novel loss function called \text{MultiModal} loss, which groups instances based on their unimodal contributions to improve the efficiency of multimodal learning models.
  • results: The proposed method demonstrates improved classification performance on synthetic and real multimodal datasets, and ablation studies verify its effectiveness. Additionally, the method generates reliable prediction scores for each modality, which is essential for subgrouping.
    Abstract Multimodal learning often outperforms its unimodal counterparts by exploiting unimodal contributions and cross-modal interactions. However, focusing only on integrating multimodal features into a unified comprehensive representation overlooks the unimodal characteristics. In real data, the contributions of modalities can vary from instance to instance, and they often reinforce or conflict with each other. In this study, we introduce a novel \text{MultiModal} loss paradigm for multimodal learning, which subgroups instances according to their unimodal contributions. \text{MultiModal} loss can prevent inefficient learning caused by overfitting and efficiently optimize multimodal models. On synthetic data, \text{MultiModal} loss demonstrates improved classification performance by subgrouping difficult instances within certain modalities. On four real multimodal datasets, our loss is empirically shown to improve the performance of recent models. Ablation studies verify the effectiveness of our loss. Additionally, we show that our loss generates a reliable prediction score for each modality, which is essential for subgrouping. Our \text{MultiModal} loss is a novel loss function to subgroup instances according to the contribution of modalities in multimodal learning and is applicable to a variety of multimodal models with unimodal decisions. Our code is available at https://github.com/SehwanMoon/MultiModalLoss.
    摘要 多模态学习通常比单模态学习表现更好,因为它可以利用单模态特征和跨模态交互来提高表达力。然而,只Focus on Integrating multimodal features into a unified comprehensive representation可能会忽略单模态特征。在实际数据中,不同模态的贡献可能会instance to instance的变化,并且经常增强或冲突。在这种情况下,我们引入了一种新的多模态损失函数(MultiModal loss),用于 subgrouping 实例根据其单模态贡献。MultiModal损失可以避免由过拟合而导致的学习不稳定性,并高效地优化多模态模型。在synthetic数据上,MultiModal损失能够提高分类性能,并在四个实际多模态数据集上证明了我们的损失的实际效果。剔除研究表明了我们的损失的有效性。此外,我们还证明了我们的损失可以生成可靠的预测分数,这是 subgrouping 中必要的。我们的MultiModal损失是一种新的loss函数,用于根据多模态中的单模态贡献 subgrouping 实例,并适用于多种多模态模型。我们的代码可以在https://github.com/SehwanMoon/MultiModalLoss上找到。

Deep Semi-supervised Anomaly Detection with Metapath-based Context Knowledge

  • paper_url: http://arxiv.org/abs/2308.10918
  • repo_url: None
  • paper_authors: Hwan Kim, Junghoon Kim, Byung Suk Lee, Sungsu Lim
  • for: 本文针对Graph anomaly detection的研究,提出了一种新的方法,利用Metapath-based semi-supervised learning,解决先前方法的局限性。
  • methods: 本文提出了一个新的框架,即Metapath-based Semi-supervised Anomaly Detection (MSAD),其中GCN层在Encoder和Decoder中都用于有效地传递上下文信息。另外,采用了特制的异常社区和度量抽象,以提高结构和属性之间的学习差异。
  • results: 通过对7个真实网络进行了一系列实验,本文证明了MSAD方法与现有技术相比,具有更高的效果。这些成功的结果开 doors for future investigations,例如优化和分析度量抽象,以进一步提高graph anomaly detection的效果。
    Abstract Graph anomaly detection has attracted considerable attention in recent years. This paper introduces a novel approach that leverages metapath-based semi-supervised learning, addressing the limitations of previous methods. We present a new framework, Metapath-based Semi-supervised Anomaly Detection (MSAD), incorporating GCN layers in both the encoder and decoder to efficiently propagate context information between abnormal and normal nodes. The design of metapath-based context information and a specifically crafted anomaly community enhance the process of learning differences in structures and attributes, both globally and locally. Through a comprehensive set of experiments conducted on seven real-world networks, this paper demonstrates the superiority of the MSAD method compared to state-of-the-art techniques. The promising results of this study pave the way for future investigations, focusing on the optimization and analysis of metapath patterns to further enhance the effectiveness of anomaly detection on attributed networks.
    摘要 “ anomaly detection in graphs has attracted significant attention in recent years. this paper proposes a novel approach that leverages metapath-based semi-supervised learning, addressing the limitations of previous methods. we present a new framework, metapath-based semi-supervised anomaly detection (msad), incorporating gcn layers in both the encoder and decoder to efficiently propagate context information between abnormal and normal nodes. the design of metapath-based context information and a specifically crafted anomaly community enhance the process of learning differences in structures and attributes, both globally and locally. through a comprehensive set of experiments conducted on seven real-world networks, this paper demonstrates the superiority of the msad method compared to state-of-the-art techniques. the promising results of this study pave the way for future investigations, focusing on the optimization and analysis of metapath patterns to further enhance the effectiveness of anomaly detection on attributed networks.”Here's a word-for-word translation of the text into Simplified Chinese:“几何 anomaly detection 在 graphs 中吸引了许多注意。这篇 paper 提出了一种新的方法,利用 metapath-based semi-supervised learning,解决先前的方法中的限制。我们提出了一个新框架,metapath-based semi-supervised anomaly detection (msad),将 gcn 层在 encoder 和 decoder 中内置,以有效地传递 context 信息 между abnormal 和 normal 节点。metapath-based 上下文信息的设计和特定的 anomaly 社区强化了学习不同的结构和属性之间的差异, both globally 和 locally。透过在七个真实世界网络上进行的 comprehensive experiments,这篇 paper 展示了 msad 方法与先前技术相比,具有较好的性能。这些有利的结果点亮了未来的研究,将 focus 在 metapath 模式的优化和分析,以进一步提高具有特征的网络上的 anomaly detection。”

Unsupervised Dialogue Topic Segmentation in Hyperdimensional Space

  • paper_url: http://arxiv.org/abs/2308.10464
  • repo_url: https://github.com/seongminp/hdseg
  • paper_authors: Seongmin Park, Jinkyu Seo, Jihwa Lee
  • for: 这篇论文旨在提出一种基于超高维数字计算(HDC)的无监督对话话题分 segmentation方法,以提高对话 транскрип的理解。
  • methods: 该方法使用HDC技术生成了丰富的 токен表示,通过低成本的初始化多个无关的向量来实现。
  • results: 对于5个分 segmentationbenchmark中的4个,HyperSeg已经超过了当前状态的最佳性能,即使给出部分ground truth信息。此外,HyperSeg在均速度上比基eline faster by a factor of 10。此外,我们还证明了HyperSeg可以提高下游概要SUMMARY的准确率。
    Abstract We present HyperSeg, a hyperdimensional computing (HDC) approach to unsupervised dialogue topic segmentation. HDC is a class of vector symbolic architectures that leverages the probabilistic orthogonality of randomly drawn vectors at extremely high dimensions (typically over 10,000). HDC generates rich token representations through its low-cost initialization of many unrelated vectors. This is especially beneficial in topic segmentation, which often operates as a resource-constrained pre-processing step for downstream transcript understanding tasks. HyperSeg outperforms the current state-of-the-art in 4 out of 5 segmentation benchmarks -- even when baselines are given partial access to the ground truth -- and is 10 times faster on average. We show that HyperSeg also improves downstream summarization accuracy. With HyperSeg, we demonstrate the viability of HDC in a major language task. We open-source HyperSeg to provide a strong baseline for unsupervised topic segmentation.
    摘要 我们介绍了HyperSeg,一种基于高维计算(HDC)的无监督对话话题分 segmentation方法。HDC是一种使用随机抽取的高维向量的可能性的vector symbolic架构,通过低成本的初始化多个无关vector,生成丰富的токен表示。这对话题分 segmentation特别有利,因为它经常被视为下游讲解任务的预处理步骤,并且有限的资源。HyperSeg在5个benchmark测试中比现有状态的法则在4个benchmark中表现出色,即使给出部分真实答案,并且平均速度比基eline快10倍。我们还证明HyperSeg可以提高下游概要精度。通过HyperSeg,我们证明了HDC在主要语言任务中的可行性。我们将HyperSeg开源,以提供无监督话题分 segmentation的强大基线。

Elucidating STEM Concepts through Generative AI: A Multi-modal Exploration of Analogical Reasoning

  • paper_url: http://arxiv.org/abs/2308.10454
  • repo_url: None
  • paper_authors: Chen Cao, Zijian Ding, Gyeong-Geon Lee, Jiajun Jiao, Jionghao Lin, Xiaoming Zhai
  • for: 这项研究旨在探索将生成式人工智能(AI)与多模态相似理解结合,以提高科学、技术、工程和数学(STEM)教育的创新方法。
  • methods: 我们开发了一个新系统,利用生成AI的能力将复杂的数学、物理和编程原理转化为易于理解的 мета喻。这些méta喻然后转化为视觉形式,以进一步增强学习经验。
  • results: 我们通过Randomized A/B/C测试评估了我们的系统的有效性,发现学习效果和学习动机均得到了提高。这些结果将为教育系统设计提供指导,并证明了在STEM领域应用大型语言模型的潜力。
    Abstract This study explores the integration of generative artificial intelligence (AI), specifically large language models, with multi-modal analogical reasoning as an innovative approach to enhance science, technology, engineering, and mathematics (STEM) education. We have developed a novel system that utilizes the capacities of generative AI to transform intricate principles in mathematics, physics, and programming into comprehensible metaphors. To further augment the educational experience, these metaphors are subsequently converted into visual form. Our study aims to enhance the learners' understanding of STEM concepts and their learning engagement by using the visual metaphors. We examine the efficacy of our system via a randomized A/B/C test, assessing learning gains and motivation shifts among the learners. Our study demonstrates the potential of applying large language models to educational practice on STEM subjects. The results will shed light on the design of educational system in terms of harnessing AI's potential to empower educational stakeholders.
    摘要

CVFC: Attention-Based Cross-View Feature Consistency for Weakly Supervised Semantic Segmentation of Pathology Images

  • paper_url: http://arxiv.org/abs/2308.10449
  • repo_url: None
  • paper_authors: Liangrui Pan, Lian Wang, Zhichao Feng, Liwen Xu, Shaoliang Peng
  • for: histopathology image segmentation for cancer diagnosis and prognosis
  • methods: attention-based cross-view feature consistency end-to-end pseudo-mask generation framework (CVFC) with three branches and multi-scale integrated feature maps
  • results: outperformed HistoSegNet, SEAM, C-CAM, WSSS-Tissue, and OEEM in terms of IoU and fwIoU on the WSSS4LUAD dataset with an IoU of 0.7122 and a fwIoU of 0.7018
    Abstract Histopathology image segmentation is the gold standard for diagnosing cancer, and can indicate cancer prognosis. However, histopathology image segmentation requires high-quality masks, so many studies now use imagelevel labels to achieve pixel-level segmentation to reduce the need for fine-grained annotation. To solve this problem, we propose an attention-based cross-view feature consistency end-to-end pseudo-mask generation framework named CVFC based on the attention mechanism. Specifically, CVFC is a three-branch joint framework composed of two Resnet38 and one Resnet50, and the independent branch multi-scale integrated feature map to generate a class activation map (CAM); in each branch, through down-sampling and The expansion method adjusts the size of the CAM; the middle branch projects the feature matrix to the query and key feature spaces, and generates a feature space perception matrix through the connection layer and inner product to adjust and refine the CAM of each branch; finally, through the feature consistency loss and feature cross loss to optimize the parameters of CVFC in co-training mode. After a large number of experiments, An IoU of 0.7122 and a fwIoU of 0.7018 are obtained on the WSSS4LUAD dataset, which outperforms HistoSegNet, SEAM, C-CAM, WSSS-Tissue, and OEEM, respectively.
    摘要 histopathology图像分割是诊断癌症的标准金标,可以预测癌症诊断结果。然而, histopathology图像分割需要高质量的mask,因此许多研究现在使用图像级别标签来实现像素级分割,以降低细化注释的需求。为解决这个问题,我们提出了基于注意力机制的跨视图特征一致末端 Pseudo-mask生成框架CVFC。具体来说,CVFC是由三个分支组成的联合框架,每个分支包含两个Resnet38和一个Resnet50,并且每个分支独立的多尺度综合特征图生成分支级别活动图(CAM)。在每个分支中,通过下采样和扩展方法调整特征图的大小;中间分支将特征矩阵 projet 到查询和关键特征空间,并通过连接层和内积计算生成特征空间感知矩阵。最后,通过特征一致损失和特征横跨损失来优化CVFC的参数。经过大量实验,在WSSS4LUAD数据集上,CVFC获得了0.7122的IoU和0.7018的fwIoU,与HistoSegNet、SEAM、C-CAM、WSSS-Tissue和OEEM等方法相比,它具有更高的性能。

LDCSF: Local depth convolution-based Swim framework for classifying multi-label histopathology images

  • paper_url: http://arxiv.org/abs/2308.10446
  • repo_url: None
  • paper_authors: Liangrui Pan, Yutao Dou, Zhichao Feng, Liwen Xu, Shaoliang Peng
    for: 这个论文主要针对的是提高计算生物学图像诊断的准确率,特别是liver cancer histopathology图像的多标签分类。methods: 该论文提出了一种基于Deep Learning的多标签计算生物学图像分类方法,即Locally Deep Convolutional Swim Framework (LDCSF),它包括Swin transformer模块、局部深度卷积(LDC)模块、特征重构(FR)模块和ResNet模块。results: 该论文的实验结果显示,LDCSF方法可以提高liver cancer histopathology图像的多标签分类精度,其分类精度为0.9460、0.9960、0.9808和0.9847等。此外,该论文还利用了多标签 histopathology图像的分类结果来计算肿瘤-组织率,为liver cancer histopathology图像的微环境分析提供了基础。
    Abstract Histopathological images are the gold standard for diagnosing liver cancer. However, the accuracy of fully digital diagnosis in computational pathology needs to be improved. In this paper, in order to solve the problem of multi-label and low classification accuracy of histopathology images, we propose a locally deep convolutional Swim framework (LDCSF) to classify multi-label histopathology images. In order to be able to provide local field of view diagnostic results, we propose the LDCSF model, which consists of a Swin transformer module, a local depth convolution (LDC) module, a feature reconstruction (FR) module, and a ResNet module. The Swin transformer module reduces the amount of computation generated by the attention mechanism by limiting the attention to each window. The LDC then reconstructs the attention map and performs convolution operations in multiple channels, passing the resulting feature map to the next layer. The FR module uses the corresponding weight coefficient vectors obtained from the channels to dot product with the original feature map vector matrix to generate representative feature maps. Finally, the residual network undertakes the final classification task. As a result, the classification accuracy of LDCSF for interstitial area, necrosis, non-tumor and tumor reached 0.9460, 0.9960, 0.9808, 0.9847, respectively. Finally, we use the results of multi-label pathological image classification to calculate the tumor-to-stromal ratio, which lays the foundation for the analysis of the microenvironment of liver cancer histopathological images. Second, we released a multilabel histopathology image of liver cancer, our code and data are available at https://github.com/panliangrui/LSF.
    摘要 histopathological 图像是肝癌诊断的标准。然而,计算pathology中完全数字诊断的准确率需要提高。在这篇论文中,我们提出了一种基于Swin transformer模块、本地深度卷积(LDC)模块、特征重建(FR)模块和ResNet模块的多标签 histopathology 图像分类模型(LDCSF)。为了提供本地视野诊断结果,我们提出了LDCSF模型,它包括Swin transformer模块、LDC模块、FR模块和ResNet模块。Swin transformer模块限制了计算量生成的注意力机制,并对每个窗口进行注意力限制。LDC模块重构了注意力图并执行多个核心的卷积操作,并将结果传递给下一层。FR模块使用对应的权重矢量向量从核心获取对应的特征图,并将其与原始特征图进行点积 multiplication 生成表示性特征图。最后,ResNet模块进行最终分类任务。因此,LDCSF的分类精度对interstitial area、necrosis、非肿瘤和肿瘤分别达到0.9460、0.9960、0.9808和0.9847。 finally,我们使用多标签病理图像分类结果计算肿瘤-组织率,这 laid the foundation for analyzing the microenvironment of liver cancer histopathological images. Second, we released a multilabel histopathology image of liver cancer, our code and data are available at https://github.com/panliangrui/LSF.

Dynamic Strategy Chain: Dynamic Zero-Shot CoT for Long Mental Health Support Generation

  • paper_url: http://arxiv.org/abs/2308.10444
  • repo_url: None
  • paper_authors: Qi Chen, Dexi Liu
  • for: 提供心理健康支持 through comprehensive and more acceptable responses.
  • methods: combines chain-of-thought (CoT) prompting and Large Language Models (LLMs), with a new zero-shot Dynamic Strategy Chain (DSC) prompting method that simulates mental health counseling strategies tailored to help-seekers’ needs.
  • results: deliver more human-like responses than CoT prompting methods on Long Counseling Text Generation for Mental Health Support (LTGM) tasks, as demonstrated by both automatic and manual evaluations.
    Abstract Long counseling Text Generation for Mental health support (LTGM), an innovative and challenging task, aims to provide help-seekers with mental health support through a comprehensive and more acceptable response. The combination of chain-of-thought (CoT) prompting and Large Language Models (LLMs) is employed and get the SOTA performance on various NLP tasks, especially on text generation tasks. Zero-shot CoT prompting is one of the most common methods in CoT prompting. However, in the LTGM task, Zero-shot CoT prompting can not simulate a counselor or provide personalized strategies without effective mental health counseling strategy prompts. To tackle this challenge, we propose a zero-shot Dynamic Strategy Chain (DSC) prompting method. Firstly, we utilize GPT2 to learn the responses written by mental health counselors and dynamically generate mental health counseling strategies tailored to the help-seekers' needs. Secondly, the Zero-shot DSC prompting is constructed according to mental health counseling strategies and the help-seekers' post. Finally, the Zero-shot DSC prompting is employed to guide LLMs in generating more human-like responses for the help-seekers. Both automatic and manual evaluations demonstrate that Zero-shot DSC prompting can deliver more human-like responses than CoT prompting methods on LTGM tasks.
    摘要 长期咨询文本生成 для心理健康支持(LTGM)是一项创新和挑战性任务,旨在为寻求帮助的人提供全面和更acceptable的回应。利用链式思维(CoT)提示和大语言模型(LLM)的组合,实现了当前NLP任务中的State-of-the-Art(SOTA)性能。然而,在LTGM任务中,零shot CoT提示无法模拟咨询师或提供个性化策略。为解决这个挑战,我们提议一种零shot动态策略链(DSC)提示方法。首先,我们利用GPT2来学习咨询师写的回应,并动态生成适应help-seekers需求的心理健康咨询策略。其次,零shot DSC提示根据咨询策略和help-seekers的帖子而构建。最后,零shot DSC提示被用来导引LLM生成更人类化的回应。自动和手动评估表明,零shot DSC提示在LTGM任务中可以提供更人类化的回应,比CoT提示方法更高效。

Using Large Language Models for Cybersecurity Capture-The-Flag Challenges and Certification Questions

  • paper_url: http://arxiv.org/abs/2308.10443
  • repo_url: None
  • paper_authors: Wesley Tann, Yuancheng Liu, Jun Heng Sim, Choon Meng Seah, Ee-Chien Chang
  • for: 本研究旨在 investigate the effectiveness of Large Language Models (LLMs) in solving Capture-The-Flag (CTF) challenges and questions, particularly in the context of CTF exercises in the classroom.
  • methods: 本研究使用了三个流行的 LLMs:OpenAI ChatGPT、Google Bard 和 Microsoft Bing。研究者首先评估这些 LLMs 的问答能力在五个 Cisco 证书中,然后进行了对 LLMs 解决 CTF 挑战的资深研究,以了解它们的局限性。
  • results: 研究发现 LLMs 在解决 CTF 挑战时表现出色,但也存在一些局限性。在七个测试案例中,LLMs 能够成功解决所有的 CTF 挑战,但在某些情况下,jailbreak 提示可以绕过 LLMs 的伦理保障。研究结论表明,LLMs 在 CTF 挑战中的使用可能会对学习和评估学生的技能产生影响。
    Abstract The assessment of cybersecurity Capture-The-Flag (CTF) exercises involves participants finding text strings or ``flags'' by exploiting system vulnerabilities. Large Language Models (LLMs) are natural-language models trained on vast amounts of words to understand and generate text; they can perform well on many CTF challenges. Such LLMs are freely available to students. In the context of CTF exercises in the classroom, this raises concerns about academic integrity. Educators must understand LLMs' capabilities to modify their teaching to accommodate generative AI assistance. This research investigates the effectiveness of LLMs, particularly in the realm of CTF challenges and questions. Here we evaluate three popular LLMs, OpenAI ChatGPT, Google Bard, and Microsoft Bing. First, we assess the LLMs' question-answering performance on five Cisco certifications with varying difficulty levels. Next, we qualitatively study the LLMs' abilities in solving CTF challenges to understand their limitations. We report on the experience of using the LLMs for seven test cases in all five types of CTF challenges. In addition, we demonstrate how jailbreak prompts can bypass and break LLMs' ethical safeguards. The paper concludes by discussing LLM's impact on CTF exercises and its implications.
    摘要 学生在Capture-The-Flag(CTF)考试中发现文本串或“flag”的过程涉及到系统漏洞的利用。大自然语言模型(LLM)是通过庞大量的单词训练来理解和生成文本的自然语言模型,它们在许多CTF挑战中表现出色。这些LLM在学生手中免费使用。在教学实践中,这种情况可能会让学生违反学术 integrity的规定。教育者需要了解LLM的能力,以便根据生成AI的帮助进行教学调整。这项研究探讨了LLM的效果,特别是在CTF挑战和问题上。我们首先评估了三个流行的LLM:OpenAI ChatGPT、Google Bard和Microsoft Bing。然后,我们进行了问题回答的评估,以评估这些LLM在不同的难度水平上的表现。接着,我们进行了CTF挑战的解决能力的详细研究,以了解LLM的局限性。我们报告了在七个测试案例中使用LLM的经验,包括所有五种CTF挑战类型。此外,我们还示例了如何使用“监狱break”提示绕过LLM的伦理保障。这篇论文结束于对CTF考试的LLM影响和其意义的讨论。

DySuse: Susceptibility Estimation in Dynamic Social Networks

  • paper_url: http://arxiv.org/abs/2308.10442
  • repo_url: None
  • paper_authors: Yingdan Shi, Jingya Zhou, Congcong Zhang
  • for: 预测社交网络中的感染扩散范围,在最近几年内得到了广泛关注。现有大多数研究都是预测社交网络中的总被感染用户数量,而忽略了来自个人视角的感染可能性预测。作为一个更细化的预测任务,感染可能性预测充满吸引力和实用价值。
  • methods: 我们提出了一种名为动态社交网络感染可能性预测任务,它更加真实和有价值。这个任务在动态网络中预测每个用户的感染可能性。由于动态网络的特性和实际应用中的需求,这个任务尚未得到足够的研究。我们提出了一种名为 DySuse 的 novel 框架,基于动态图 embedding 技术。
  • results: 我们的框架在多种感染流行度模型下表现出了优于现有的动态图 embedding 模型,并在多个实验中达到了满意的预测性能。
    Abstract Influence estimation aims to predict the total influence spread in social networks and has received surged attention in recent years. Most current studies focus on estimating the total number of influenced users in a social network, and neglect susceptibility estimation that aims to predict the probability of each user being influenced from the individual perspective. As a more fine-grained estimation task, susceptibility estimation is full of attractiveness and practical value. Based on the significance of susceptibility estimation and dynamic properties of social networks, we propose a task, called susceptibility estimation in dynamic social networks, which is even more realistic and valuable in real-world applications. Susceptibility estimation in dynamic networks has yet to be explored so far and is computationally intractable to naively adopt Monte Carlo simulation to obtain the results. To this end, we propose a novel end-to-end framework DySuse based on dynamic graph embedding technology. Specifically, we leverage a structural feature module to independently capture the structural information of influence diffusion on each single graph snapshot. Besides, {we propose the progressive mechanism according to the property of influence diffusion,} to couple the structural and temporal information during diffusion tightly. Moreover, a self-attention block {is designed to} further capture temporal dependency by flexibly weighting historical timestamps. Experimental results show that our framework is superior to the existing dynamic graph embedding models and has satisfactory prediction performance in multiple influence diffusion models.
    摘要 社会网络的影响估计在最近几年内收到了极大的关注,大多数当前的研究都是预测社会网络中总的影响范围,而忽略了每个用户的感染可能性的估计。作为一个更细致的估计任务,感染估计充满了吸引力和实际价值。基于社交网络的动态性和感染的重要性,我们提出了一项任务,即动态社交网络中的感染估计任务,这个任务在实际应用中更加真实和有价值。然而,在动态网络中进行感染估计是计算上困难的,直观采用蒙特卡罗 simulation 方法来获得结果是无法进行。为此,我们提出了一个新的框架,即 DySuse,基于动态图像技术。具体来说,我们利用一个结构特征模块来独立地捕捉影响扩散在每个单Graph时的结构信息。此外,我们提出了一种进步机制,以便在扩散过程中紧密地结合结构和时间信息。此外,我们还设计了一个自注意块,以便更好地捕捉时间相关性。实验结果表明,我们的框架在多种扩散模型下具有优于现有动态图像嵌入模型的预测性能,并且有满意的结果。

X-VoE: Measuring eXplanatory Violation of Expectation in Physical Events

  • paper_url: http://arxiv.org/abs/2308.10441
  • repo_url: https://github.com/daibopku/x-voe
  • paper_authors: Bo Dai, Linge Wang, Baoxiong Jia, Zeyu Zhang, Song-Chun Zhu, Chi Zhang, Yixin Zhu
  • for: This study aims to assess the ability of AI agents to understand intuitive physics, and to develop a comprehensive benchmark dataset (X-VoE) to evaluate their performance.
  • methods: The study uses the Violation of Expectation (VoE) paradigm, which is rooted in developmental psychology, to test AI models’ understanding of events and their underlying explanations. The dataset includes three distinct settings that probe models’ comprehension of events and their ability to infer occluded object states from visual sequences.
  • results: The experimental outcomes show that the proposed explanation-based learning system is able to align with human commonsense and visually expound VoE events by reconstructing concealed scenes. The results demonstrate the potential of X-VoE as a valuable tool for advancing AI with human-like intuitive physics capabilities.
    Abstract Intuitive physics is pivotal for human understanding of the physical world, enabling prediction and interpretation of events even in infancy. Nonetheless, replicating this level of intuitive physics in artificial intelligence (AI) remains a formidable challenge. This study introduces X-VoE, a comprehensive benchmark dataset, to assess AI agents' grasp of intuitive physics. Built on the developmental psychology-rooted Violation of Expectation (VoE) paradigm, X-VoE establishes a higher bar for the explanatory capacities of intuitive physics models. Each VoE scenario within X-VoE encompasses three distinct settings, probing models' comprehension of events and their underlying explanations. Beyond model evaluation, we present an explanation-based learning system that captures physics dynamics and infers occluded object states solely from visual sequences, without explicit occlusion labels. Experimental outcomes highlight our model's alignment with human commonsense when tested against X-VoE. A remarkable feature is our model's ability to visually expound VoE events by reconstructing concealed scenes. Concluding, we discuss the findings' implications and outline future research directions. Through X-VoE, we catalyze the advancement of AI endowed with human-like intuitive physics capabilities.
    摘要 人类理解物理世界的能力是基于直觉物理,允许预测和解释事件,即使是在婴儿时期。然而,将这种直觉物理能力复制到人工智能(AI)中仍然是一项困难的挑战。本研究介绍了X-VoE,一个完整的比较 dataset,用于评估 AI 代理人的直觉物理理解能力。基于发展心理学 rooted Violation of Expectation(VoE) paradigm,X-VoE 确定了更高的解释能力标准 для直觉物理模型。每个 VoE 情况在 X-VoE 中包含三个不同的设定,探索模型对事件和其下面的解释方面的理解。此外,我们提出了一种基于解释的学习系统,可以从视觉序列中捕捉物理动力学和隐藏的 объек state,无需显式隐藏标签。实验结果显示,我们的模型与人类常识相一致,并能视觉地描述 Violation of Expectation 事件。具有一种remarkable feature的是,我们的模型可以将隐藏的场景重建出来,以便更好地解释 VoE 事件。本研究的结论和未来研究方向的讨论,随着 X-VoE,我们推动人工智能具有人类直觉物理能力的发展。

GPT-in-the-Loop: Adaptive Decision-Making for Multiagent Systems

  • paper_url: http://arxiv.org/abs/2308.10435
  • repo_url: None
  • paper_authors: Nathalia Nascimento, Paulo Alencar, Donald Cowan
  • For: This paper proposes a novel approach called “GPT-in-the-loop” that combines the advanced reasoning capabilities of Large Language Models (LLMs) with multiagent systems to achieve superior decision-making and adaptability in IoT applications.* Methods: The paper employs GPT-4 for enhanced problem-solving and explanation skills, and integrates it into the agent-driven Framework for the Internet of Things (FIoT) to create a GPT-in-the-loop approach.* Results: The paper presents comparative results in the IoT context, showing that the GPT-in-the-loop approach achieves superior decision-making and adaptability without the need for extensive training, outperforming traditional neuroevolutionary methods and solutions provided by software engineers.Here’s the information in Simplified Chinese text:* For: 这篇论文提出了一种新的方法,称为“GPT-in-the-loop”,它将大语言模型(LLM)的高级逻辑能力与多代(MAS)系统结合,以实现在互联网空间应用中的优秀决策和适应性。* Methods: 论文使用GPT-4来提高决策和解释能力,并将其集成到了代理驱动的 Framework for the Internet of Things(FIoT)中。* Results: 论文在互联网空间应用中提供了比较结果,显示GPT-in-the-loop方法在无需延长训练的前提下实现了superior的决策和适应性,比传统的神经演化方法和软件工程师提供的解决方案更高效。
    Abstract This paper introduces the "GPT-in-the-loop" approach, a novel method combining the advanced reasoning capabilities of Large Language Models (LLMs) like Generative Pre-trained Transformers (GPT) with multiagent (MAS) systems. Venturing beyond traditional adaptive approaches that generally require long training processes, our framework employs GPT-4 for enhanced problem-solving and explanation skills. Our experimental backdrop is the smart streetlight Internet of Things (IoT) application. Here, agents use sensors, actuators, and neural networks to create an energy-efficient lighting system. By integrating GPT-4, these agents achieve superior decision-making and adaptability without the need for extensive training. We compare this approach with both traditional neuroevolutionary methods and solutions provided by software engineers, underlining the potential of GPT-driven multiagent systems in IoT. Structurally, the paper outlines the incorporation of GPT into the agent-driven Framework for the Internet of Things (FIoT), introduces our proposed GPT-in-the-loop approach, presents comparative results in the IoT context, and concludes with insights and future directions.
    摘要 Structurally, the paper outlines the incorporation of GPT into the agent-driven Framework for the Internet of Things (FIoT), introduces our proposed GPT-in-the-loop approach, presents comparative results in the IoT context, and concludes with insights and future directions.Translated into Simplified Chinese:这篇论文介绍了一种新的“GPT在循环”方法,这种方法将大型自然语言模型(LLM)如生成预训练转换器(GPT)的先进逻辑能力与多代理(MAS)系统结合起来。我们的框架超越了传统的适应方法,这些方法通常需要长时间的训练。我们在智能街灯互联网器件应用中使用GPT-4,这些代理可以在不需要训练的情况下实现更好的决策和适应能力。我们与传统的神经进化方法和软件工程师提供的解决方案进行比较,这 highlights the potential of GPT驱动的多代理系统在互联网器件领域。本文从结构上来说,它描述了将GPTintegrated into the agent-driven Framework for the Internet of Things(FIoT),介绍了我们的提议的GPT在循环方法,在互联网器件上进行比较性研究,并结束于未来方向的探讨。

Mechanisms that play a game, not toss a coin

  • paper_url: http://arxiv.org/abs/2308.10413
  • repo_url: None
  • paper_authors: Toby Walsh
  • for: 这个论文的目的是提出一种方法来使Randomized机制得到更好的normative property,而不是使用deterministic机制。
  • methods: 这个论文使用的方法包括在agent之间进行游戏,以换取机制中的随机性。这种方法可以保留原来的好的normative property,但是可以取得一个deterministic的机制,容易审核。
  • results: 论文提出了六种不同的领域中的derandomized机制,每个机制都有good的normative property。这些机制都是在mixed Nash equilibrium中,agent们采取了一种模块加法游戏,并且在大多数情况下,agent们会采取uniform mixed strategy。在所有except one的mixed Nash equilibrium中,agent们会报告他们对原始问题的真实偏好。这些derandomized方法因此被称为“quasi-strategy proof”。在一个领域中,论文还证明了一种新的normative property的出现,这个property是由derandomization带来的。
    Abstract Randomized mechanisms can have good normative properties compared to their deterministic counterparts. However, randomized mechanisms are problematic in several ways such as in their verifiability. We propose here to derandomize such mechanisms by having agents play a game instead of tossing a coin. The game is designed so an agent's best action is to play randomly, and this play then injects ``randomness'' into the mechanism. This derandomization retains many of the good normative properties of the original randomized mechanism but gives a mechanism that is deterministic and easy, for instance, to audit. We consider three related methods to derandomize randomized mechanism in six different domains: voting, facility location, task allocation, school choice, peer selection, and resource allocation. We propose a number of novel derandomized mechanisms for these six domains with good normative properties. Each mechanism has a mixed Nash equilibrium in which agents play a modular arithmetic game with an uniform mixed strategy. In all but one mixed Nash equilibrium, agents report their preferences over the original problem sincerely. The derandomized methods are thus ``quasi-strategy proof''. In one domain, we additionally show that a new and desirable normative property emerges as a result of derandomization.
    摘要 随机机制可能比其权威性更好。然而,随机机制具有一些问题,如验证性。我们在这里提出了使用代理人进行游戏而不是投硬币来DERANDOM化这些机制。游戏的设计使得代理人的最佳行为是随机玩家,这些玩家的行为将插入“随机性”到机制中。这种DERANDOM化保留了许多原始随机机制的好性质,但具有权威性和易于审核的deterministic机制。我们在六个领域中考虑了三种相关的方法来DERANDOM化随机机制:投票、设施位置、任务分配、学校选择、同伴选择和资源分配。我们提出了一些新的DERANDOM化机制,每个机制都有一个混合 Nash 平衡,在其中代理人通过混合数学游戏来报告他们对原始问题的偏好。在所有 except one 的混合 Nash 平衡中,代理人都会透明报告他们的偏好。这些DERANDOM化方法因此被称为“ quasi-strategy proof”。在一个领域中,我们还证明了DERANDOM化后出现了一种新的权威性质。

Diffusion Model as Representation Learner

  • paper_url: http://arxiv.org/abs/2308.10916
  • repo_url: None
  • paper_authors: Xingyi Yang, Xinchao Wang
  • for: 这篇论文的目的是研究Diffusion Probabilistic Models(DPMs)的表示能力,以及将DPMs中学习的知识应用于识别任务。
  • methods: 该论文使用了一种新的知识传递方法,named RepFusion,它利用了off-the-shelf DPMs中学习的表示来为学生网络提供监督。
  • results: 该论文在多个图像分类、semantic segmentation和面部定位任务上进行了评估,并显示了与现有方法相比的出色表现。
    Abstract Diffusion Probabilistic Models (DPMs) have recently demonstrated impressive results on various generative tasks.Despite its promises, the learned representations of pre-trained DPMs, however, have not been fully understood. In this paper, we conduct an in-depth investigation of the representation power of DPMs, and propose a novel knowledge transfer method that leverages the knowledge acquired by generative DPMs for recognition tasks. Our study begins by examining the feature space of DPMs, revealing that DPMs are inherently denoising autoencoders that balance the representation learning with regularizing model capacity. To this end, we introduce a novel knowledge transfer paradigm named RepFusion. Our paradigm extracts representations at different time steps from off-the-shelf DPMs and dynamically employs them as supervision for student networks, in which the optimal time is determined through reinforcement learning. We evaluate our approach on several image classification, semantic segmentation, and landmark detection benchmarks, and demonstrate that it outperforms state-of-the-art methods. Our results uncover the potential of DPMs as a powerful tool for representation learning and provide insights into the usefulness of generative models beyond sample generation. The code is available at \url{https://github.com/Adamdad/Repfusion}.
    摘要 Diffusion Probabilistic Models (DPMs) 在不同的生成任务中已经表现出了很好的结果。然而,学习的 DPMs 中的表示能力还没有得到完全的理解。在这篇文章中,我们进行了 DPMs 的深入调查,并提出了一种新的知识传递方法,该方法利用生成 DPMs 中所获得的知识来提高认知任务的性能。我们的研究开始于 DPMs 的特征空间的检查,发现 DPMs 是一种自适应的减噪自适应器,它在学习表示学习和规则化模型容量之间做出了平衡。为此,我们提出了一种名为 RepFusion 的新的知识传递方法。我们的方法在不同的时间步次从准备好的 DPMs 中提取表示,并通过动态地将其作为学生网络的超参论进行使用,通过 reinforcement learning 确定最佳时间。我们在多个图像分类、semantic segmentation 和 landmark detection benchmark 上评估了我们的方法,并示出它在比 state-of-the-art 方法的性能更高。我们的结果揭示了 DPMs 作为表示学习工具的潜力,并提供了生成模型在样本生成之外的用于表示学习的可能性。代码可以在 \url{https://github.com/Adamdad/Repfusion} 上获取。

Simple Baselines for Interactive Video Retrieval with Questions and Answers

  • paper_url: http://arxiv.org/abs/2308.10402
  • repo_url: https://github.com/kevinliang888/ivr-qa-baselines
  • paper_authors: Kaiqu Liang, Samuel Albanie
  • for: 提高视频检索系统的交互性,使其能够更好地回答用户问题。
  • methods: 使用问题 Answering 模型来模拟用户交互,并对视频检索系统进行改进。
  • results: 经过实验,问题基本交互的视频检索系统在 MSR-VTT、MSVD 和 AVSD 等 datasets 上显示出了显著的性能提升。
    Abstract To date, the majority of video retrieval systems have been optimized for a "single-shot" scenario in which the user submits a query in isolation, ignoring previous interactions with the system. Recently, there has been renewed interest in interactive systems to enhance retrieval, but existing approaches are complex and deliver limited gains in performance. In this work, we revisit this topic and propose several simple yet effective baselines for interactive video retrieval via question-answering. We employ a VideoQA model to simulate user interactions and show that this enables the productive study of the interactive retrieval task without access to ground truth dialogue data. Experiments on MSR-VTT, MSVD, and AVSD show that our framework using question-based interaction significantly improves the performance of text-based video retrieval systems.
    摘要 至今,大多数视频检索系统都是基于单个查询场景优化的,忽略了用户与系统之前的互动。在最近几年,有关交互系统的重新兴趣,但现有的方法复杂,性能提升有限。在这项工作中,我们重新探讨交互检索任务,并提出了一些简单又有效的基线。我们使用视频问答模型来模拟用户互动,并显示这可以帮助无需真实对话数据进行产品性研究。在 MSR-VTT、MSVD 和 AVSD 上进行了实验,我们发现使用问题基于的互动方式可以显著提高文本基于视频检索系统的性能。

FairBench: A Four-Stage Automatic Framework for Detecting Stereotypes and Biases in Large Language Models

  • paper_url: http://arxiv.org/abs/2308.10397
  • repo_url: None
  • paper_authors: Yanhong Bai, Jiabao Zhao, Jinxin Shi, Tingjiang Wei, Xingjiao Wu, Liang He
  • for: This paper aims to enhance fairness and reduce adverse impacts on individuals or groups when Large Language Models (LLMs) are applied by detecting stereotypes and biases in the models.
  • methods: The paper introduces a four-stage framework to directly evaluate stereotypes and biases in the generated content of LLMs, including direct inquiry testing, serial or adapted story testing, implicit association testing, and unknown situation testing.
  • results: The paper evaluates five LLMs on the Edu-FairBench, a dataset of 12,632 open-ended questions covering nine sensitive factors and 26 educational scenarios, and finds varying degrees of stereotypes and biases in the models. Additionally, the proposed automated evaluation method has shown a high correlation with human annotations.Here is the information in Simplified Chinese text:
  • for: 这篇论文目标是通过探测大型自然语言模型(LLMs)中的偏见和刻板印象来增强应用中的公平和避免对个人或群体产生负面影响。
  • methods: 论文提出了一种四个阶段的框架,直接评估 LLMs 生成内容中的偏见和刻板印象,包括直接问题测试、串行或修改故事测试、隐藏协会测试和未知情况测试。
  • results: 论文使用教育领域的 Edu-FairBench Dataset,包含 12,632 个开放结束问题,覆盖九种敏感因素和 26 个教育场景,对五个 LLMS 进行评估,发现它们中存在不同程度的偏见和刻板印象。此外,提出的自动评估方法与人工笔迹之间存在高相关性。
    Abstract Detecting stereotypes and biases in Large Language Models (LLMs) can enhance fairness and reduce adverse impacts on individuals or groups when these LLMs are applied. However, the majority of existing methods focus on measuring the model's preference towards sentences containing biases and stereotypes within datasets, which lacks interpretability and cannot detect implicit biases and stereotypes in the real world. To address this gap, this paper introduces a four-stage framework to directly evaluate stereotypes and biases in the generated content of LLMs, including direct inquiry testing, serial or adapted story testing, implicit association testing, and unknown situation testing. Additionally, the paper proposes multi-dimensional evaluation metrics and explainable zero-shot prompts for automated evaluation. Using the education sector as a case study, we constructed the Edu-FairBench based on the four-stage framework, which encompasses 12,632 open-ended questions covering nine sensitive factors and 26 educational scenarios. Experimental results reveal varying degrees of stereotypes and biases in five LLMs evaluated on Edu-FairBench. Moreover, the results of our proposed automated evaluation method have shown a high correlation with human annotations.
    摘要 检测 LLM 中的偏见和偏好可以提高公平性并减少对个人或群体的不良影响。然而,现有的大多数方法都是基于数据集中的偏见和偏好的表达,lacking interpretability和无法检测 LLM 在实际世界中的隐式偏见和偏好。为解决这个问题,本文提出了一个四个阶段的框架,直接评估 LLM 生成内容中的偏见和偏好,包括直接问题测试、串行或适应故事测试、隐藏关联测试和未知情况测试。此外,本文还提出了多维度评价指标和自动评价可解释的零开始提示。使用教育领域为例,我们构建了 Edu-FairBench,基于四个阶段框架,包含 12,632 个开放问题,覆盖 nine 个敏感因素和 26 个教育场景。实验结果表明 LLM 在 Edu-FairBench 上表现出不同程度的偏见和偏好,并且我们的自动评价方法与人工注释有高相关性。

Robotic Planning under Hierarchical Temporal Logic Specifications

  • paper_url: http://arxiv.org/abs/2308.10393
  • repo_url: None
  • paper_authors: Xusheng Luo, Shaojun Xu, Ruixuan Liu, Changliu Liu
  • for: 提高机器人规划的能效性,使得机器人可以更好地完成复杂任务。
  • methods: 使用层次结构的线性时间逻辑(LTL)规定,并采用分解法解决任务。
  • results: 在机器人导航和抓取等领域进行了广泛的实验研究,结果表明,使用层次结构的LTL规定和分解法可以提高规划的表达能力和效率。
    Abstract Past research into robotic planning with temporal logic specifications, notably Linear Temporal Logic (LTL), was largely based on singular formulas for individual or groups of robots. But with increasing task complexity, LTL formulas unavoidably grow lengthy, complicating interpretation and specification generation, and straining the computational capacities of the planners. In order to maximize the potential of LTL specifications, we capitalized on the intrinsic structure of tasks and introduced a hierarchical structure to LTL specifications. In contrast to the "flat" structure, our hierarchical model has multiple levels of compositional specifications and offers benefits such as greater syntactic brevity, improved interpretability, and more efficient planning. To address tasks under this hierarchical temporal logic structure, we formulated a decomposition-based method. Each specification is first broken down into a range of temporally interrelated sub-tasks. We further mine the temporal relations among the sub-tasks of different specifications within the hierarchy. Subsequently, a Mixed Integer Linear Program is utilized to generate a spatio-temporal plan for each robot. Our hierarchical LTL specifications were experimentally applied to domains of robotic navigation and manipulation. Results from extensive simulation studies illustrated both the enhanced expressive potential of the hierarchical form and the efficacy of the proposed method.
    摘要 前研究涉及机器人规划,主要基于单个或组合机器人的时间逻辑要求(Linear Temporal Logic,LTL)。但是随着任务复杂度的增加,LTLFormula会不可避免地增加长度,从而使解释和规划生成变得更加困难,计划器的计算能力也会受到挑战。为了满足LTSpecifications的潜在能力,我们利用任务的内在结构,引入层次结构,从而提供更简洁的语法结构、更好的可读性和更高效的规划。为解决层次时间逻辑结构下的任务,我们提出了一种分解方法。首先,每个规定被分解成一系列相互关联的时间逻辑子任务。然后,我们在不同层次的规定之间挖掘时间关系,并使用混合整数线性编程来生成每个机器人的空间时间规划。我们在机器人导航和抓取等领域进行了实验,结果表明,我们的层次LTSpecifications和分解方法可以增强表达力和规划效率。

Neural Architectures Learning Fourier Transforms, Signal Processing and Much More….

  • paper_url: http://arxiv.org/abs/2308.10388
  • repo_url: None
  • paper_authors: Prateek Verma
  • for: 本文探讨了使用 Fourier Transform 和现代人工智能技术的关系,以及如何将这两者结合使用。
  • methods: 本文使用了 neural architecture 来学习signal processing 中的 kernel,并发现了这些 kernel 可以学习出各种杰出的信号处理特性,如窗口函数、开关检测器、高频滤波器、低频滤波器、修饰等。
  • results: 本文发现了一种使用 neural architecture 学习 kernel 的方法,可以不仅学习 sinusoidal kernel 的形状,还可以发现各种杰出的信号处理特性,如窗口函数、开关检测器、高频滤波器、低频滤波器、修饰等。
    Abstract This report will explore and answer fundamental questions about taking Fourier Transforms and tying it with recent advances in AI and neural architecture. One interpretation of the Fourier Transform is decomposing a signal into its constituent components by projecting them onto complex exponentials. Variants exist, such as discrete cosine transform that does not operate on the complex domain and projects an input signal to only cosine functions oscillating at different frequencies. However, this is a fundamental limitation, and it needs to be more suboptimal. The first one is that all kernels are sinusoidal: What if we could have some kernels adapted or learned according to the problem? What if we can use neural architectures for this? We show how one can learn these kernels from scratch for audio signal processing applications. We find that the neural architecture not only learns sinusoidal kernel shapes but discovers all kinds of incredible signal-processing properties. E.g., windowing functions, onset detectors, high pass filters, low pass filters, modulations, etc. Further, upon analysis of the filters, we find that the neural architecture has a comb filter-like structure on top of the learned kernels. Comb filters that allow harmonic frequencies to pass through are one of the core building blocks/types of filters similar to high-pass, low-pass, and band-pass filters of various traditional signal processing algorithms. Further, we can also use the convolution operation with a signal to be learned from scratch, and we will explore papers in the literature that uses this with that robust Transformer architectures. Further, we would also explore making the learned kernel's content adaptive, i.e., learning different kernels for different inputs.
    摘要 The first question is: What if we could have some kernels adapted or learned according to the problem? What if we can use neural architectures for this? We show how one can learn these kernels from scratch for audio signal processing applications. We find that the neural architecture not only learns sinusoidal kernel shapes but discovers all kinds of incredible signal-processing properties, such as windowing functions, onset detectors, high pass filters, low pass filters, modulations, etc.Further, upon analysis of the filters, we find that the neural architecture has a comb filter-like structure on top of the learned kernels. Comb filters that allow harmonic frequencies to pass through are one of the core building blocks/types of filters similar to high-pass, low-pass, and band-pass filters of various traditional signal processing algorithms.Furthermore, we can also use the convolution operation with a signal to be learned from scratch, and we will explore papers in the literature that use this with robust Transformer architectures. Additionally, we will also explore making the learned kernel's content adaptive, i.e., learning different kernels for different inputs.

Unsupervised Opinion Aggregation – A Statistical Perspective

  • paper_url: http://arxiv.org/abs/2308.10386
  • repo_url: None
  • paper_authors: Noyan C. Sevuktekin, Andrew C. Singer
  • for: 本研究旨在开发一种基于专家意见的统计方法,以估计每位专家的能力水平,无需知道真实的状况。
  • methods: 该方法基于专家意见之间的相似性,通过度量每位专家与其他专家之间的相似性来衡量专家的能力水平。
  • results: 研究表明,更可靠的专家更有可能与其他专家相似,并且提出了一种不需要真实状况的完全无监督版本的朴素贝叶斯分类器,可以在许多问题上达到极限优的性能。
    Abstract Complex decision-making systems rarely have direct access to the current state of the world and they instead rely on opinions to form an understanding of what the ground truth could be. Even in problems where experts provide opinions without any intention to manipulate the decision maker, it is challenging to decide which expert's opinion is more reliable -- a challenge that is further amplified when decision-maker has limited, delayed, or no access to the ground truth after the fact. This paper explores a statistical approach to infer the competence of each expert based on their opinions without any need for the ground truth. Echoing the logic behind what is commonly referred to as \textit{the wisdom of crowds}, we propose measuring the competence of each expert by their likeliness to agree with their peers. We further show that the more reliable an expert is the more likely it is that they agree with their peers. We leverage this fact to propose a completely unsupervised version of the na\"{i}ve Bayes classifier and show that the proposed technique is asymptotically optimal for a large class of problems. In addition to aggregating a large block of opinions, we further apply our technique for online opinion aggregation and for decision-making based on a limited the number of opinions.
    摘要 “复杂决策系统通常没有直接访问现实世界的现状,而是基于专家们的意见来形成决策者的认知。即使专家们没有意图欺骗决策者,也是困难决定哪位专家的意见更可靠——这种问题在决策者没有访问现实世界后的情况下更为复杂。这篇论文探讨了一种统计方法,用于无需真实世界的访问来评估专家的能力。根据专家们之间的一致性来衡量专家的能力,这种逻辑类似于“群智”的思想。我们表明,更可靠的专家更有可能与其他专家一致,并且我们可以利用这一点来提出一种无监督的普遍投票分类器。我们还证明这种技术在一定的问题类型上是可以达到极限优的。此外,我们还应用这种技术于在线意见集成和基于有限数量的意见决策。”

False Negative/Positive Control for SAM on Noisy Medical Images

  • paper_url: http://arxiv.org/abs/2308.10382
  • repo_url: https://github.com/xyimaging/FNPC
  • paper_authors: Xing Yao, Han Liu, Dewei Hu, Daiwei Lu, Ange Lou, Hao Li, Ruining Deng, Gabriel Arenas, Baris Oguz, Nadav Schwartz, Brett C Byram, Ipek Oguz
  • for: 这篇论文主要是为了提高Segment Anything Model(SAM)在医学影像分割 tasks中的表现和稳定性。
  • methods: 该论文提出了一种基于多个 bounding box 的提高 SAM 表现的测试阶段提高技术,以及一种基于 aleatoric uncertainty 的 false negative 和 false positive 修正策略 (FNPC)。
  • results: 在两个ultrasound数据集上,该方法能够提高 SAM 的表现和对不准确提示的Robustness,而无需进一步的训练或调整。此外, authors 还提出了 Single-Slice-to-Volume (SS2V) 方法,允许通过只需 bounding box 注解的单个2D slice,实现3D pixel-level segmentation。
    Abstract The Segment Anything Model (SAM) is a recently developed all-range foundation model for image segmentation. It can use sparse manual prompts such as bounding boxes to generate pixel-level segmentation in natural images but struggles in medical images such as low-contrast, noisy ultrasound images. We propose a refined test-phase prompt augmentation technique designed to improve SAM's performance in medical image segmentation. The method couples multi-box prompt augmentation and an aleatoric uncertainty-based false-negative (FN) and false-positive (FP) correction (FNPC) strategy. We evaluate the method on two ultrasound datasets and show improvement in SAM's performance and robustness to inaccurate prompts, without the necessity for further training or tuning. Moreover, we present the Single-Slice-to-Volume (SS2V) method, enabling 3D pixel-level segmentation using only the bounding box annotation from a single 2D slice. Our results allow efficient use of SAM in even noisy, low-contrast medical images. The source code will be released soon.
    摘要 “seg anything模型(SAM)是一种最近研发的全范围基础模型,用于图像分割。它可以使用稀疏手动提示(如 bounding box)来生成图像中像素级分割,但在医疗图像(如低对比度、噪声射频图像)中表现不佳。我们提出了一种改进SAM在医疗图像分割中表现的测试阶段提示修复技术。该方法结合多个盒子提示的增强和 aleatoric 不确定性基于 false-negative(FN)和 false-positive(FP) corrections(FNPC)策略。我们在两个射频图像集上评估了该方法,并显示了SAM的表现和准确度的改进,而无需进一步训练或调整。此外,我们还提出了单片到体积(SS2V)方法,可以使用只有单个2Dslice的 bounding box注解来实现3D像素级分割。我们的结果允许SAM在噪声低对比度的医疗图像中进行高效使用。代码将即将发布。”

A Human-on-the-Loop Optimization Autoformalism Approach for Sustainability

  • paper_url: http://arxiv.org/abs/2308.10380
  • repo_url: None
  • paper_authors: Ming Jin, Bilgehan Sel, Fnu Hardeep, Wotao Yin
  • for: solves personalized energy-related problems using large language models (LLMs)
  • methods: augments an LLM with an optimization solver, enhancing its proficiency in understanding and responding to user specifications and preferences while providing nonlinear reasoning capabilities.
  • results: enables LLMs to analyze, explain, and tackle a variety of instance-specific energy-related problems, pushing beyond the limits of current prompt-based techniques.
    Abstract This paper outlines a natural conversational approach to solving personalized energy-related problems using large language models (LLMs). We focus on customizable optimization problems that necessitate repeated solving with slight variations in modeling and are user-specific, hence posing a challenge to devising a one-size-fits-all model. We put forward a strategy that augments an LLM with an optimization solver, enhancing its proficiency in understanding and responding to user specifications and preferences while providing nonlinear reasoning capabilities. Our approach pioneers the novel concept of human-guided optimization autoformalism, translating a natural language task specification automatically into an optimization instance. This enables LLMs to analyze, explain, and tackle a variety of instance-specific energy-related problems, pushing beyond the limits of current prompt-based techniques. Our research encompasses various commonplace tasks in the energy sector, from electric vehicle charging and Heating, Ventilation, and Air Conditioning (HVAC) control to long-term planning problems such as cost-benefit evaluations for installing rooftop solar photovoltaics (PVs) or heat pumps. This pilot study marks an essential stride towards the context-based formulation of optimization using LLMs, with the potential to democratize optimization processes. As a result, stakeholders are empowered to optimize their energy consumption, promoting sustainable energy practices customized to personal needs and preferences.
    摘要 Our novel approach, called human-guided optimization autoformalism, automatically translates a natural language task specification into an optimization instance. This enables LLMs to analyze, explain, and solve a variety of instance-specific energy-related problems, beyond the limitations of current prompt-based techniques.Our research encompasses common tasks in the energy sector, including electric vehicle charging, Heating, Ventilation, and Air Conditioning (HVAC) control, and long-term planning problems such as cost-benefit evaluations for installing rooftop solar photovoltaics (PVs) or heat pumps. This pilot study marks an essential step towards context-based optimization using LLMs, with the potential to democratize optimization processes and empower stakeholders to optimize their energy consumption, promoting sustainable energy practices customized to personal needs and preferences.

Algorithm of Thoughts: Enhancing Exploration of Ideas in Large Language Models

  • paper_url: http://arxiv.org/abs/2308.10379
  • repo_url: None
  • paper_authors: Bilgehan Sel, Ahmad Al-Tawaha, Vanshaj Khattar, Lu Wang, Ruoxi Jia, Ming Jin
  • for: 提高大语言模型(LLM)的理解能力,超越传统的“链条”方法。
  • methods: 使用算法逻辑路径来驱动LLM的思维,开拓新的在 контекст学习模式。
  • results: 比单个查询方法更高效,与一种使用广泛搜索算法的多查询方法相当,并且发现LLM可以通过算法指导超越算法本身的性能。
    Abstract Current literature, aiming to surpass the "Chain-of-Thought" approach, often resorts to an external modus operandi involving halting, modifying, and then resuming the generation process to boost Large Language Models' (LLMs) reasoning capacities. This mode escalates the number of query requests, leading to increased costs, memory, and computational overheads. Addressing this, we propose the Algorithm of Thoughts -- a novel strategy that propels LLMs through algorithmic reasoning pathways, pioneering a new mode of in-context learning. By employing algorithmic examples, we exploit the innate recurrence dynamics of LLMs, expanding their idea exploration with merely one or a few queries. Our technique outperforms earlier single-query methods and stands on par with a recent multi-query strategy that employs an extensive tree search algorithm. Intriguingly, our results suggest that instructing an LLM using an algorithm can lead to performance surpassing that of the algorithm itself, hinting at LLM's inherent ability to weave its intuition into optimized searches. We probe into the underpinnings of our method's efficacy and its nuances in application.
    摘要 当前的文献,尝试超越“链式思维”方法,经常采用外部的模式,包括停止、修改、然后继续生成过程,以提高大语言模型(LLM)的思维能力。这种模式会增加查询请求数量,导致成本、内存和计算负担的增加。为此,我们提出了思维算法——一种新的策略,通过算法的思维路径来驱动LLM,开拓一种新的在Context中学习模式。通过使用算法的示例,我们利用LLM的内生循环动力,扩展其想法探索,只需一些或一个查询。我们的技术超过了单个查询方法,与最近的多查询策略,使用广泛的树查找算法相当。有趣的是,我们的结果表明,通过向LLM提供算法的指导,可以让LLM的表现超越算法本身,这表明LLM具有自然的思维搜索优化能力。我们进一步探讨我们的方法的效果和其应用中的细节。

  • paper_url: http://arxiv.org/abs/2308.11462
  • repo_url: https://github.com/hazyresearch/legalbench
  • paper_authors: Neel Guha, Julian Nyarko, Daniel E. Ho, Christopher Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N. Rockmore, Diego Zambrano, Dmitry Talisman, Enam Hoque, Faiz Surani, Frank Fagan, Galit Sarfaty, Gregory M. Dickinson, Haggai Porat, Jason Hegland, Jessica Wu, Joe Nudell, Joel Niklaus, John Nay, Jonathan H. Choi, Kevin Tobia, Margaret Hagan, Megan Ma, Michael Livermore, Nikon Rasumov-Rahe, Nils Holzenberger, Noam Kolt, Peter Henderson, Sean Rehaag, Sharad Goel, Shang Gao, Spencer Williams, Sunny Gandhi, Tom Zur, Varun Iyer, Zehua Li
  • for: 这个论文的目的是为了研究大语言模型在法律领域中可以完成哪些类型的法律推理。
  • methods: 这篇论文使用了一个名为LegalBench的法律推理benchmark,该benchmark包含了162个任务,涵盖了六种不同的法律推理类型。
  • results: 这篇论文通过对20个开源和商业大语言模型进行实证评估,并示出了LegalBench可以帮助研究人员 explore这些模型在法律领域中的应用。
    Abstract The advent of large language models (LLMs) and their adoption by the legal community has given rise to the question: what types of legal reasoning can LLMs perform? To enable greater study of this question, we present LegalBench: a collaboratively constructed legal reasoning benchmark consisting of 162 tasks covering six different types of legal reasoning. LegalBench was built through an interdisciplinary process, in which we collected tasks designed and hand-crafted by legal professionals. Because these subject matter experts took a leading role in construction, tasks either measure legal reasoning capabilities that are practically useful, or measure reasoning skills that lawyers find interesting. To enable cross-disciplinary conversations about LLMs in the law, we additionally show how popular legal frameworks for describing legal reasoning -- which distinguish between its many forms -- correspond to LegalBench tasks, thus giving lawyers and LLM developers a common vocabulary. This paper describes LegalBench, presents an empirical evaluation of 20 open-source and commercial LLMs, and illustrates the types of research explorations LegalBench enables.
    摘要 大量语言模型(LLM)的出现和法律社区的采用让人们问到了哪些类型的法律推理可以由LLM完成?为了推动这个问题的研究,我们提出了法律推理benchmark:legalbench,这是一个由162个任务组成的合作构建的法律推理benchmark,涵盖了六种不同的法律推理类型。legalbench通过了跨学科的过程建造,我们收集了由法律专业人员设计和手工制作的任务。由于这些专业人员在建构过程中扮演了主导角色,任务中的推理能力都是实用的或者法律专业人员感兴趣的。此外,我们还展示了法律框架,这些框架可以用来描述不同类型的法律推理,并与LLM开发者和法律专业人员共享一个语言,从而促进了不同领域之间的交流。本文介绍了legalbench,对20个开源和商业LLM进行了实证评估,并 illustrate了legalbench可以推动的研究探索。

Explaining Emergence

  • paper_url: http://arxiv.org/abs/2308.10912
  • repo_url: https://github.com/benjaminpatrickevans/BRATS
  • paper_authors: Hervé Zwirn
  • for: 本研究探讨了 Emergence 这一概念,即在不同领域中出现的意外现象,以及这些现象如何被观察者主观地描述。
  • methods: 本研究使用了数学模型,探讨了一些具有简单和决定性规则的系统是否能够显示出 Emergent 行为。
  • results: 研究发现,即使系统具有简单和决定性规则,也可以显示出 Emergent 行为,这种行为被称为 computational irreducibility。这种新的概念可以帮助我们从 объек oriented 的角度理解 Emergent 现象。
    Abstract Emergence is a pregnant property in various fields. It is the fact for a phenomenon to appear surprisingly and to be such that it seems at first sight that it is not possible to predict its apparition. That is the reason why it has often been said that emergence is a subjective property relative to the observer. Some mathematical systems having very simple and deterministic rules nevertheless show emergent behavior. Studying these systems shed a new light on the subject and allows to define a new concept, computational irreducibility, which deals with behaviors that even though they are totally deterministic cannot be predicted without simulating them. Computational irreducibility is then a key for understanding emergent phenomena from an objective point of view that does not need the mention of any observer.
    摘要 emergence 是一种在多个领域中表现出的潜在性。这是指现象在初次看到时显示出不可预测的特征,以至于看起来是不可能预测其出现。这也是为什么emergence 常被称为主观性的特征。 certain mathematical systems with very simple and deterministic rules nevertheless exhibit emergent behavior. studying these systems has shed new light on the subject and has led to the development of a new concept, computational irreducibility, which deals with behaviors that are totally deterministic but cannot be predicted without simulating them. computational irreducibility is therefore a key to understanding emergent phenomena from an objective perspective that does not rely on the mention of any observer.Note: The word "emergence" in Chinese is emergence (出现), and "computational irreducibility" is 计算不可逆性 (computational irreducibility).

Imaginations of WALL-E : Reconstructing Experiences with an Imagination-Inspired Module for Advanced AI Systems

  • paper_url: http://arxiv.org/abs/2308.10354
  • repo_url: None
  • paper_authors: Zeinab Sadat Taghavi, Soroush Gooran, Seyed Arshan Dalili, Hamidreza Amirzadeh, Mohammad Jalal Nematbakhsh, Hossein Sameti
  • For: The paper aims to introduce a novel Artificial Intelligence (AI) system that leverages the concept of imagination to process and generate deep and interpretable information across modalities.* Methods: The proposed system uses an imagination-inspired module that bridges the gap between textual inputs and other modalities, enriching the derived information based on previously learned experiences. The system employs large-scale models, specifically a Multimodal Large Language Model (MLLM), to extract meaningful information across modalities while primarily remaining unimodal.* Results: The system outperformed the best Large Language Models (LLM) on multiple tasks, including emotion recognition and question-answering, achieving Weighted F1 (WF1) scores of 46.74%, 25.23%, and Overall F1 (OF1) score of 17%, respectively, compared to 22.89%, 12.28%, and 7% from the well-performing LLM.Here are the three points in Simplified Chinese text:
  • for: 本研究旨在提出一种基于哲学和心理分析的人工智能系统,以便在不同模式之间进行深入的信息交换和理解。
  • methods: 该系统使用基于想象的模块,将文本输入与其他模式之间的差距bridge,通过以前学习的经验进行填充,以提高获得的信息的深度和多样性。
  • results: 系统在多个任务上,包括情感识别和问答,与最佳大语言模型(LLM)进行了比较,并在MELD、IEMOCAP和CoQA数据集上取得了Weighted F1(WF1)分数为46.74%、25.23%和17%,比LMM的最佳表现更高。
    Abstract In this paper, we introduce a novel Artificial Intelligence (AI) system inspired by the philosophical and psychoanalytical concept of imagination as a ``Re-construction of Experiences". Our AI system is equipped with an imagination-inspired module that bridges the gap between textual inputs and other modalities, enriching the derived information based on previously learned experiences. A unique feature of our system is its ability to formulate independent perceptions of inputs. This leads to unique interpretations of a concept that may differ from human interpretations but are equally valid, a phenomenon we term as ``Interpretable Misunderstanding". We employ large-scale models, specifically a Multimodal Large Language Model (MLLM), enabling our proposed system to extract meaningful information across modalities while primarily remaining unimodal. We evaluated our system against other large language models across multiple tasks, including emotion recognition and question-answering, using a zero-shot methodology to ensure an unbiased scenario that may happen by fine-tuning. Significantly, our system outperformed the best Large Language Models (LLM) on the MELD, IEMOCAP, and CoQA datasets, achieving Weighted F1 (WF1) scores of 46.74%, 25.23%, and Overall F1 (OF1) score of 17%, respectively, compared to 22.89%, 12.28%, and 7% from the well-performing LLM. The goal is to go beyond the statistical view of language processing and tie it to human concepts such as philosophy and psychoanalysis. This work represents a significant advancement in the development of imagination-inspired AI systems, opening new possibilities for AI to generate deep and interpretable information across modalities, thereby enhancing human-AI interaction.
    摘要 在这篇论文中,我们介绍了一种基于哲学和心理分析的人工智能(AI)系统,即“经验重建”。我们的AI系统具有基于想象的模块,可以将文本输入与其他模式相互转换,richard derived information based on previously learned experiences。我们的系统具有独特的特点,即能够独立地理解输入。这会导致对概念的解释不同于人类的解释,但具有相同的有效性,我们称之为“可理解的错误”。我们使用大规模模型,具体来说是多Modal Large Language Model(MLLM),使得我们提议的系统可以在不同模式之间提取有意义信息,同时保持主要的单模态。我们对其他大型语言模型进行了多个任务的评估,包括情感识别和问答,使用零扩展方法来保证无偏见的场景。结果显示,我们的系统在MELD、IEMOCAP和CoQA数据集上的WF1分数为46.74%、25.23%和总F1分数为17%,比最佳大型语言模型(LLM)高出22.89%、12.28%和7%。我们的目标是超越语言处理的统计视角,与人类概念相联系,如哲学和心理分析。这项工作代表了人工智能具有想象能力的系统的开发的一个重要突破,开启了新的可能性,让AI生成深层次可理解的信息 across modalities,从而提高人机交互。

A probabilistic analysis of selected notions of iterated conditioning under coherence

  • paper_url: http://arxiv.org/abs/2308.10338
  • repo_url: None
  • paper_authors: Lydia Castronovo, Giuseppe Sanfilippo
  • for: 本文研究了三值逻辑中的Iterated Conditionals,即 Cooper-Calabrese、de Finetti 和 Farrell 等人提出的 conditionals。
  • methods: 本文使用了三值逻辑的不同方法,包括 conjunction 和 disjunction among conditionals,以及 conditional random quantities 的概率传播规则。
  • results: 本文显示了 iterated conditionals 中的 compound probability theorem 和其他基本性质不受 Cooper-Calabrese、de Finetti 和 Farrell 等人的定义影响,但是可以通过使用 suitable random quantities 来满足这些性质。此外,本文还证明了一些 generalized versions of Bayes’ Rule 和 Modus Ponens 的有效性。
    Abstract It is well know that basic conditionals satisfy some desirable basic logical and probabilistic properties, such as the compound probability theorem, but checking the validity of these becomes trickier when we switch to compound and iterated conditionals. We consider de Finetti's notion of conditional as a three-valued object and as a conditional random quantity in the betting framework. We recall the notions of conjunction and disjunction among conditionals in selected trivalent logics. First, in the framework of specific three-valued logics we analyze the notions of iterated conditioning introduced by Cooper-Calabrese, de Finetti and Farrell, respectively. We show that the compound probability theorem and other basic properties are not preserved by these objects, by also computing some probability propagation rules. Then, for each trivalent logic we introduce an iterated conditional as a suitable random quantity which satisfies the compound prevision theorem and some of the desirable properties. We also check the validity of two generalized versions of Bayes' Rule for iterated conditionals. We study the p-validity of generalized versions of Modus Ponens and two-premise centering for iterated conditionals. Finally, we observe that all the basic properties are satisfied only by the iterated conditional mainly developed in recent papers by Gilio and Sanfilippo in the setting of conditional random quantities.
    摘要 “基本条件满足一些愉悦的基础逻辑和概率性质,如复杂概率定理,但检查其有效性变得更加困难当我们转移到复杂和迭代条件。我们使用de Finetti的条件定义为三值对象和在赌博框架中的条件随机量。我们提及选择的三值逻辑中的 conjunction 和 disjunction。首先,在特定的三值逻辑框架中,我们分析由Cooper-Calabrese、de Finetti和Farrell分别引入的迭代条件。我们显示这些对象不 preserved 复杂概率定理和其他基本性质,同时计算一些概率传播规则。然后,我们为每个三值逻辑引入一个适当的迭代条件,该满足复杂预测定理和一些愉悦性质。我们还验证了两个扩展版本的 bayes 规则的有效性。最后,我们发现所有的基本性质只有在 Gilio 和 Sanfilippo 在 conditional random quantities 的设定中的迭代条件中得到。”

A Study on Robustness and Reliability of Large Language Model Code Generation

  • paper_url: http://arxiv.org/abs/2308.10335
  • repo_url: None
  • paper_authors: Li Zhong, Zilong Wang
    for: 这个论文主要是为了评估大语言模型(LLM)生成的代码的可靠性和稳定性。methods: 作者使用了 StackOverflow 上的 1208 个编程问题和 24 种 Java API 来收集数据,并总结了这些 API 的常见错误模式。然后,他们使用现有的 популяр LLM 进行评估。results: 评估结果显示,即使使用 GPT-4,62% 的生成代码中都存在 API 错误,这些错误可能会在实际软件开发中导致不期望的后果。
    Abstract Recently, the large language models (LLMs) have shown extraordinary ability in understanding natural language and generating programming code. It has been a common practice of software engineers to consult LLMs when encountering coding questions. Although efforts have been made to avoid syntax errors and align the code with the intended semantics, the reliability and robustness of the code generationfrom LLMs have not yet been thoroughly studied. The executable code is not equivalent to the reliable and robust code, especially in the context of real-world software development.The misuse of APIs in the generated code could lead to severe problem, such as resource leaks, program crashes, etc.To make things worse, the users of LLM code generation services are actually the developers that are most vulnerable to these code that seems right -- They are always novice developers that are not familiar with the APIs that LLMs generate code for them. Therefore, they could hardly tell the misuse in the code generated by LLMs, which further facilitates the incorrect code applied in real-world software. Existing code evaluation benchmark and datasets focus on crafting small tasks such as programming questions in coding interviews, which however deviates from the problem that developers would ask LLM for real-world coding help. To fill the missing piece, in this work, we propose a dataset RobustAPI for evaluating the reliability and robustness of code generated by LLMs. We collect 1208 coding questions from StackOverflow on 24 representative Java APIs. We summarize thecommon misuse patterns of these APIs and evaluate them oncurrent popular LLMs. The evaluation results show that evenfor GPT-4, 62% of the generated code contains API misuses,which would cause unexpected consequences if the code isintroduced into real-world software.
    摘要 最近,大型自然语言模型(LLM)在理解自然语言和生成代码方面表现出了极高的能力。软件工程师们常常咨询LLM当遇到编程问题。虽有努力避免语法错误和对代码进行Semantic alignment,但LLM代码生成的可靠性和稳定性尚未得到了全面的研究。生成的代码不等于可靠和稳定的代码,特别是在实际软件开发中。生成代码中的APImisuse可能导致严重问题,如资源泄露、程序崩溃等。worse still,使用LLM代码生成服务的用户通常是不熟悉这些API的新手 programmer,因此很难发现生成代码中的错误。这使得 incorrect code更容易在实际软件中应用。现有的代码评估标准和数据集都是为小型任务,如编程题目,而不是真实的软件开发问题。为填补这一缺失,在这个工作中,我们提出了一个名为RobustAPI的代码可靠性和稳定性评估数据集。我们收集了Stack Overflow上的1208个编程问题,并总结了24种常见API的错误模式。我们对当前流行的LLM进行了评估,结果显示,即使是GPT-4,62%的生成代码中包含API错误,这些错误会在实际软件中产生意外的后果。

UAV 3-D path planning based on MOEA/D with adaptive areal weight adjustment

  • paper_url: http://arxiv.org/abs/2308.10307
  • repo_url: None
  • paper_authors: Yougang Xiao, Hao Yang, Huan Liu, Keyu Wu, Guohua Wu
  • for: 该论文旨在提出一种基于分解的多目标演化算法(MOEA/D)以及一种适应区重量调整策略(AAWA),以达到让机器人飞行路径长度和地形威胁之间做出平衡。
  • methods: 该论文使用了一种基于分解的多目标演化算法(MOEA/D),并采用了一种适应区重量调整策略(AAWA)以提高解决方案的多样性。
  • results: 论文通过对二十个人工enario和四个实际enario进行比较,证明了MOEA/D-AAWA的效果。
    Abstract Unmanned aerial vehicles (UAVs) are desirable platforms for time-efficient and cost-effective task execution. 3-D path planning is a key challenge for task decision-making. This paper proposes an improved multi-objective evolutionary algorithm based on decomposition (MOEA/D) with an adaptive areal weight adjustment (AAWA) strategy to make a tradeoff between the total flight path length and the terrain threat. AAWA is designed to improve the diversity of the solutions. More specifically, AAWA first removes a crowded individual and its weight vector from the current population and then adds a sparse individual from the external elite population to the current population. To enable the newly-added individual to evolve towards the sparser area of the population in the objective space, its weight vector is constructed by the objective function value of its neighbors. The effectiveness of MOEA/D-AAWA is validated in twenty synthetic scenarios with different number of obstacles and four realistic scenarios in comparison with other three classical methods.
    摘要 无人飞行器(UAV)是一种高效且经济的任务执行平台。三维路径规划是任务决策中的关键挑战。本文提出了基于分解(MOEA/D)的改进多目标进化算法,并与适应面积质量调整策略(AAWA)结合,以实现路径总长度和地形威胁之间的让担。AAWA是用于提高解的多样性。具体来说,AAWA首先从当前人口中移除拥挤的个体和其加重向量,然后从外部卓越人口中添加一个稀疏个体到当前人口中。为让新增加的个体演化向稀疏的区域,其加重向量由近宠的目标函数值构建。MOEA/D-AAWA的效果在二十个 synthetic 场景中与不同数量的障碍物以及四个实际场景进行比较,证明了其效果的 Validation。

Co-Evolution of Pose and Mesh for 3D Human Body Estimation from Video

  • paper_url: http://arxiv.org/abs/2308.10305
  • repo_url: https://github.com/kasvii/pmce
  • paper_authors: Yingxuan You, Hong Liu, Ti Wang, Wenhao Li, Runwei Ding, Xia Li
  • for: 这篇论文主要是用于提出一种基于视频的3D人体模型生成方法,以解决现有的视频基于方法一般会遇到复杂的人体姿势和形状参数的估计问题。
  • methods: 该方法使用了一个两栅Encoder来估计中帧3D人体姿势和从输入图像序列中提取时间特征。此外,我们还设计了一个协同演化解码器来实现人体姿势和图像引导的AdaLN来使 pose和mesh与人体身体形状相匹配。
  • results: 对于三个标准测试集(3DPW、Human3.6M和MPI-INF-3DHP),我们的提出的PMCE方法在每帧精度和时间一致性方面都超过了先前的状态OF-the-art方法。
    Abstract Despite significant progress in single image-based 3D human mesh recovery, accurately and smoothly recovering 3D human motion from a video remains challenging. Existing video-based methods generally recover human mesh by estimating the complex pose and shape parameters from coupled image features, whose high complexity and low representation ability often result in inconsistent pose motion and limited shape patterns. To alleviate this issue, we introduce 3D pose as the intermediary and propose a Pose and Mesh Co-Evolution network (PMCE) that decouples this task into two parts: 1) video-based 3D human pose estimation and 2) mesh vertices regression from the estimated 3D pose and temporal image feature. Specifically, we propose a two-stream encoder that estimates mid-frame 3D pose and extracts a temporal image feature from the input image sequence. In addition, we design a co-evolution decoder that performs pose and mesh interactions with the image-guided Adaptive Layer Normalization (AdaLN) to make pose and mesh fit the human body shape. Extensive experiments demonstrate that the proposed PMCE outperforms previous state-of-the-art methods in terms of both per-frame accuracy and temporal consistency on three benchmark datasets: 3DPW, Human3.6M, and MPI-INF-3DHP. Our code is available at https://github.com/kasvii/PMCE.
    摘要 尽管单一图像基于的3D人体凝固得到了显著的进步,但从视频中准确地回归3D人体运动仍然是一个挑战。现有的视频基本方法通常通过计算复杂的姿势和形态参数从相关的图像特征来回归人体网格, whose high complexity and low representation ability often result in inconsistent pose motion and limited shape patterns. To address this issue, we introduce 3D pose as the intermediary and propose a Pose and Mesh Co-Evolution network (PMCE) that decouples this task into two parts: 1) video-based 3D human pose estimation and 2) mesh vertices regression from the estimated 3D pose and temporal image feature. Specifically, we propose a two-stream encoder that estimates mid-frame 3D pose and extracts a temporal image feature from the input image sequence. In addition, we design a co-evolution decoder that performs pose and mesh interactions with the image-guided Adaptive Layer Normalization (AdaLN) to make pose and mesh fit the human body shape. Extensive experiments demonstrate that the proposed PMCE outperforms previous state-of-the-art methods in terms of both per-frame accuracy and temporal consistency on three benchmark datasets: 3DPW, Human3.6M, and MPI-INF-3DHP. Our code is available at https://github.com/kasvii/PMCE.Here's the translation in Traditional Chinese:尽管单一图像基于的3D人体凝固已经取得了显著的进步,但从视频中准确地回归3D人体动作仍然是一个挑战。现有的视频基本方法通常通过计算复杂的姿势和形态参数从相关的图像特征来回归人体网格, whose high complexity and low representation ability often result in inconsistent pose motion and limited shape patterns. To address this issue, we introduce 3D pose as the intermediary and propose a Pose and Mesh Co-Evolution network (PMCE) that decouples this task into two parts: 1) video-based 3D human pose estimation and 2) mesh vertices regression from the estimated 3D pose and temporal image feature. Specifically, we propose a two-stream encoder that estimates mid-frame 3D pose and extracts a temporal image feature from the input image sequence. In addition, we design a co-evolution decoder that performs pose and mesh interactions with the image-guided Adaptive Layer Normalization (AdaLN) to make pose and mesh fit the human body shape. Extensive experiments demonstrate that the proposed PMCE outperforms previous state-of-the-art methods in terms of both per-frame accuracy and temporal consistency on three benchmark datasets: 3DPW, Human3.6M, and MPI-INF-3DHP. Our code is available at https://github.com/kasvii/PMCE.

cs.CL - 2023-08-21

Zero- and Few-Shot Prompting with LLMs: A Comparative Study with Fine-tuned Models for Bangla Sentiment Analysis

  • paper_url: http://arxiv.org/abs/2308.10783
  • repo_url: None
  • paper_authors: Md. Arid Hasan, Shudipta Das, Afiyat Anjum, Firoj Alam, Anika Anjum, Avijit Sarker, Sheak Rashed Haider Noori
  • for: 这研究旨在提供大量手动标注的孟加拉新闻推文和Facebook评论数据集,以及在孟加拉语言模型中进行零或几回shot学习的研究。
  • methods: 本研究使用了多种语言模型,包括Flan-T5、GPT-4和Bloomz,并进行了比较分析。
  • results: 研究发现,单语言变换器基本模型在零和几回shot场景下 consistently outperform其他模型。
    Abstract The rapid expansion of the digital world has propelled sentiment analysis into a critical tool across diverse sectors such as marketing, politics, customer service, and healthcare. While there have been significant advancements in sentiment analysis for widely spoken languages, low-resource languages, such as Bangla, remain largely under-researched due to resource constraints. Furthermore, the recent unprecedented performance of Large Language Models (LLMs) in various applications highlights the need to evaluate them in the context of low-resource languages. In this study, we present a sizeable manually annotated dataset encompassing 33,605 Bangla news tweets and Facebook comments. We also investigate zero- and few-shot in-context learning with several language models, including Flan-T5, GPT-4, and Bloomz, offering a comparative analysis against fine-tuned models. Our findings suggest that monolingual transformer-based models consistently outperform other models, even in zero and few-shot scenarios. To foster continued exploration, we intend to make this dataset and our research tools publicly available to the broader research community. In the spirit of further research, we plan to make this dataset and our experimental resources publicly accessible to the wider research community.
    摘要 随着数字世界的快速扩张,情感分析已成为多个领域的重要工具,包括市场营销、政治、客户服务和医疗。虽然拥有广泛的进步,低资源语言,如孟加拉语,仍然受到资源约束,而且没有充分研究。此外,最近的不同应用场景中大语言模型(LLMs)的突出表现,强调了对低资源语言的评估。本研究提供了33605个孟加拉语新闻推文和Facebook评论的手动标注数据集。我们还 investigate了零和几个shot在Context中学习,包括Flan-T5、GPT-4和Bloomz等语言模型,并对这些模型进行比较分析。我们的发现表明,单语言变换器基模型在零和几个shot情况下一直表现出优于其他模型。为了促进进一步的探索,我们计划将这个数据集和我们的研究工具公开提供给更广泛的研究人员。

DepreSym: A Depression Symptom Annotated Corpus and the Role of LLMs as Assessors of Psychological Markers

  • paper_url: http://arxiv.org/abs/2308.10758
  • repo_url: None
  • paper_authors: Anxo Pérez, Marcos Fernández-Pichel, Javier Parapar, David E. Losada
  • for: 该论文旨在探讨计算方法如何从在线文章中检测抑郁症状。
  • methods: 该论文使用了现有的 traces of depression from online publications 和 Beck Depression Inventory-II (BDI-II) 等方法进行研究。
  • results: 该论文提出了一个新的搜索任务,并提供了21580个已标注的句子,以便进一步探讨抑郁症状的检测。
    Abstract Computational methods for depression detection aim to mine traces of depression from online publications posted by Internet users. However, solutions trained on existing collections exhibit limited generalisation and interpretability. To tackle these issues, recent studies have shown that identifying depressive symptoms can lead to more robust models. The eRisk initiative fosters research on this area and has recently proposed a new ranking task focused on developing search methods to find sentences related to depressive symptoms. This search challenge relies on the symptoms specified by the Beck Depression Inventory-II (BDI-II), a questionnaire widely used in clinical practice. Based on the participant systems' results, we present the DepreSym dataset, consisting of 21580 sentences annotated according to their relevance to the 21 BDI-II symptoms. The labelled sentences come from a pool of diverse ranking methods, and the final dataset serves as a valuable resource for advancing the development of models that incorporate depressive markers such as clinical symptoms. Due to the complex nature of this relevance annotation, we designed a robust assessment methodology carried out by three expert assessors (including an expert psychologist). Additionally, we explore here the feasibility of employing recent Large Language Models (ChatGPT and GPT4) as potential assessors in this complex task. We undertake a comprehensive examination of their performance, determine their main limitations and analyze their role as a complement or replacement for human annotators.
    摘要 计算方法用于检测抑郁症状通过在互联网上发布的文章中挖掘抑郁症状的迹象。然而,现有的解决方案具有限制性和可读性问题。为了解决这些问题,latest studies have shown that identifying depressive symptoms can lead to more robust models。eRisk initiative 推动了这个领域的研究,并提出了一个新的排名任务,旨在开发搜索方法,以找到与抑郁症状相关的句子。这个搜索挑战基于 Beck Depression Inventory-II (BDI-II) 问卷中的症状列表,这种问卷在临床实践中广泛使用。我们从参与者系统的结果中提取了21580个标注过的句子,这些句子根据它们与 BDI-II 症状的相关性进行标注。标注的句子来自多种多样的排名方法,最终的数据集成为一个重要的资源,用于提高包含抑郁标志的模型的发展。由于这种相关性标注的复杂性,我们采用了一种Robust评估方法,由三名专家评估器(包括一名专业心理学家)进行评估。此外,我们还 explore了使用最新的大语言模型(ChatGPT和GPT4)作为这种复杂任务的评估者的可能性。我们进行了全面的评估,确定了它们的主要局限性,并分析了它们在这种任务中的角色,是否可以取代或补充人类评估器。

WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models

  • paper_url: http://arxiv.org/abs/2308.10755
  • repo_url: https://github.com/opendatalab/WanJuan1.0
  • paper_authors: Conghui He, Zhenjiang Jin, Chao Xu, Jiantao Qiu, Bin Wang, Wei Li, Hang Yan, Jiaqi Wang, Dahua Lin
  • For: The paper aims to address the lack of transparency and scarcity of open-source data in the development of large language models (LLMs) and multimodal language models (MLLMs) by presenting a large-scale multimodal dataset called “Wan Juan”.* Methods: The paper uses a large-scale dataset called “Wan Juan” which includes text, image-text, and video modalities, with a total volume exceeding 2TB, to train a model called InternLM.* Results: The paper demonstrates the effectiveness of the “Wan Juan” dataset and the InternLM model in multi-dimensional evaluations, showing significant advantages over models of a similar scale.Here are the three key points in Simplified Chinese text:* For: 本研究目的是为了解决大语言模型(LLM)和多modal语言模型(MLLM)的开发中数据的缺乏透明度和开源数据的稀缺,通过提出一个大规模多Modal的数据集“wan Juan”。* Methods: 本研究使用的是一个大规模的数据集“wan Juan”,包括文本、图片文本和视频模式,总量超过2TB,用于训练一个名为InternLM的模型。* Results: 本研究 demonstates “wan Juan”数据集和InternLM模型在多维评估中的效果,显示与类似规模的模型相比有显著优势。
    Abstract The rise in popularity of ChatGPT and GPT-4 has significantly accelerated the development of large models, leading to the creation of numerous impressive large language models(LLMs) and multimodal large language models (MLLMs). These cutting-edge models owe their remarkable performance to high-quality data. However, the details of the training data used in leading paradigms are often kept confidential. This lack of transparency, coupled with the scarcity of open-source data, impedes further developments within the community. As a response, this paper presents "Wan Juan", a large-scale multimodal dataset composed of both Chinese and English data, collected from a wide range of web sources. The dataset incorporates text, image-text, and video modalities, with a total volume exceeding 2TB. It was utilized in the training of InternLM, a model that demonstrated significant advantages in multi-dimensional evaluations when compared to models of a similar scale. All data can be accessed at https://opendatalab.org.cn/WanJuan1.0.
    摘要 “chatgpt”和“gpt-4”的流行化使得大型模型的开发速度得到了 significatively加速,这导致了许多出色的大语言模型(LLMs)和多Modal大语言模型(MLLMs)的创造。这些顶尖模型的卓越表现归功于高质量的数据。然而,领先的模型训练数据的细节经常被保密,这与开源数据的缺乏使得后续的发展受阻。为此,本文介绍了“万娟”,一个大规模多Modal数据集,包括中英文数据,从各种网络源收集而来。该数据集包括文本、图像文本和视频模式,总体量超过2TB。它被用于InternLM模型的训练,该模型在多维评估中表现出了明显的优势。所有数据可以在https://opendatalab.org.cn/WanJuan1.0中下载。

Systematic Offensive Stereotyping (SOS) Bias in Language Models

  • paper_url: http://arxiv.org/abs/2308.10684
  • repo_url: None
  • paper_authors: Fatma Elsafoury
  • For: This paper investigates the systematic offensive stereotype (SOS) bias in language models (LMs) and its impact on their performance and fairness in the task of hate speech detection.* Methods: The authors propose a method to measure the SOS bias in LMs and validate it using a dataset of tweets. They also investigate the effectiveness of debias methods from the literature on removing the SOS bias.* Results: All the inspected LMs are found to be SOS biased, and the SOS bias is reflective of the hate experienced online by marginalized groups. The authors find that removing the SOS bias using a popular debias method leads to worse SOS bias scores, and there is evidence that the SOS bias is impactful on the fairness of the LMs but not their performance on hate speech detection.Here is the same information in Simplified Chinese text:* For: 这篇论文研究了语言模型(LM)中的系统性的负面刻板偏见(SOS)偏见和其对于词汇检测任务的性能和公平性的影响。* Methods: 作者们提出了一种测试SOS偏见的方法,并使用一个推文数据集验证了这种方法。他们还 investigate了文献中的debias方法对于去除SOS偏见的效果。* Results: 所有检查的LM都被发现具有SOS偏见,并且这种偏见与在网络上受到歧视的受试人群的偏见相关。 removing SOS偏见使用文献中的popular debias方法会导致SOS偏见得分更差,并且发现SOS偏见对LM的公平性有影响,但对于词汇检测任务的性能没有强有力的证据。
    Abstract Research has shown that language models (LMs) are socially biased. However, toxicity and offensive stereotyping bias in LMs are understudied. In this paper, we investigate the systematic offensive stereotype (SOS) bias in LMs. We propose a method to measure it. Then, we validate the SOS bias and investigate the effectiveness of debias methods from the literature on removing it. Finally, we investigate the impact of the SOS bias in LMs on their performance and their fairness on the task of hate speech detection. Our results suggest that all the inspected LMs are SOS biased. The results suggest that the SOS bias in LMs is reflective of the hate experienced online by the inspected marginalized groups. The results indicate that removing the SOS bias in LMs, using a popular debias method from the literature, leads to worse SOS bias scores. Finally, Our results show no strong evidence that the SOS bias in LMs is impactful on their performance on hate speech detection. On the other hand, there is evidence that the SOS bias in LMs is impactful on their fairness.
    摘要 研究表明,语言模型(LM)具有社会偏见。然而,语言模型中的排斥和负面刻板偏见尚未得到充分研究。在这篇论文中,我们调查了语言模型中的系统性排斥偏见(SOS)。我们提出了一种测量方法。然后,我们验证了SOS偏见的存在并研究了文献中的除偏见方法对其的效果。最后,我们 investigate了语言模型中SOS偏见对其性能和公平性的影响。我们发现所有检查的语言模型都具有SOS偏见。结果表明SOS偏见在语言模型中是对在线受到恐吓的弱化群体的表现。结果表明,使用文献中的受欢迎除偏见方法可以将SOS偏见从语言模型中除掉,但是这会导致SOS偏见的加大。最后,我们发现SOS偏见在语言模型中没有显著影响性能,但是它确实会影响公平性。

LibriWASN: A Data Set for Meeting Separation, Diarization, and Recognition with Asynchronous Recording Devices

  • paper_url: http://arxiv.org/abs/2308.10682
  • repo_url: None
  • paper_authors: Joerg Schmalenstroeer, Tobias Gburrek, Reinhold Haeb-Umbach
  • for: 本 dataset 用于测试适用于无线声音传感器网络的时钟同步算法、会议分离、记录系统和转录系统。
  • methods: 本 dataset 使用了五款智能手机和四款麦克风数组,共录制29个频道。
  • results: 本 dataset 包含两个不同的会议室的数据,并且提供了真实的会议分离信息。
    Abstract We present LibriWASN, a data set whose design follows closely the LibriCSS meeting recognition data set, with the marked difference that the data is recorded with devices that are randomly positioned on a meeting table and whose sampling clocks are not synchronized. Nine different devices, five smartphones with a single recording channel and four microphone arrays, are used to record a total of 29 channels. Other than that, the data set follows closely the LibriCSS design: the same LibriSpeech sentences are played back from eight loudspeakers arranged around a meeting table and the data is organized in subsets with different percentages of speech overlap. LibriWASN is meant as a test set for clock synchronization algorithms, meeting separation, diarization and transcription systems on ad-hoc wireless acoustic sensor networks. Due to its similarity to LibriCSS, meeting transcription systems developed for the former can readily be tested on LibriWASN. The data set is recorded in two different rooms and is complemented with ground-truth diarization information of who speaks when.
    摘要 我们现在介绍LibriWASN数据集,其设计与LibriCSS会议认知数据集相似,但与之不同的是,数据被随机布置在会议表格上并且采样时钟不同步。数据集使用了五款智能手机和四个麦克风数组,共录制29个频道。除此之外,数据集几乎与LibriCSS设计相同:同样的LibriSpeech句子通过八个喇叭放在会议表格周围播放,数据分为不同的说话重叠百分比下的子集。LibriWASN是为无线听写感知网络上的时钟同步算法、会议分离、分类和转录系统进行测试而设计的。由于与LibriCSS类似,已有的会议转录系统可以轻松地在LibriWASN上测试。数据集在两个不同的房间中录制,并提供了会议时间的真实分类信息。

BAN-PL: a Novel Polish Dataset of Banned Harmful and Offensive Content from Wykop.pl web service

  • paper_url: http://arxiv.org/abs/2308.10592
  • repo_url: https://github.com/ziliat-nask/ban-pl
  • paper_authors: Inez Okulska, Kinga Głąbińska, Anna Kołos, Agnieszka Karlińska, Emilia Wiśnios, Adam Nowakowski, Paweł Ellerik, Andrzej Prałat
  • for: 这篇论文旨在提供一个公共可用的波兰语社交媒体内容敏感词汇数据集,以便进一步提高自动敏感词汇检测技术的发展。
  • methods: 该论文使用了专业模糊器标注的社交媒体内容,包括文章和评论,并将其分为两个类别:“危险”和“中性”。数据集采集和处理过程得到了详细的描述,同时还提供了高级预处理脚本,如掩码词汇检测。
  • results: 该论文提供了一个名为BAN-PL的公共可用波兰语社交媒体内容敏感词汇数据集,包括691,662个社交媒体内容,并具有良好的分类性。
    Abstract Advances in automated detection of offensive language online, including hate speech and cyberbullying, require improved access to publicly available datasets comprising social media content. In this paper, we introduce BAN-PL, the first open dataset in the Polish language that encompasses texts flagged as harmful and subsequently removed by professional moderators. The dataset encompasses a total of 691,662 pieces of content from a popular social networking service, Wykop, often referred to as the "Polish Reddit", including both posts and comments, and is evenly distributed into two distinct classes: "harmful" and "neutral". We provide a comprehensive description of the data collection and preprocessing procedures, as well as highlight the linguistic specificity of the data. The BAN-PL dataset, along with advanced preprocessing scripts for, i.a., unmasking profanities, will be publicly available.
    摘要 “现代化的自动检测 hate speech 和 cyberbullying 在线上需要更好的公共数据集,包括社交媒体内容。在这篇文章中,我们介绍 BAN-PL,是波兰语言中第一个公开的 dataset,包括被评估为伤害的和后来被专业 Moderator 移除的文本。这个 dataset 包括 Wykop 社交网络服务上的 691,662 则内容,包括文章和评论,并对应到两个不同的类别:“伤害”和“中立”。我们提供了详细的数据收集和预处理程序程序,以及资料的语言特点。 BAN-PL dataset 、以及针对 i.a. 解除诅咒词的进阶预处理脚本,将公开提供。”

Exploring Equation as a Better Intermediate Meaning Representation for Numerical Reasoning

  • paper_url: http://arxiv.org/abs/2308.10585
  • repo_url: https://github.com/zirui-hit/bridge_for_numerical_reasoning
  • paper_authors: Dingzirui Wang, Longxu Dou, Wenbin Zhang, Junyu Zeng, Wanxiang Che
  • For: The paper aims to improve the performance of numerical reasoning models by using equations as Intermediate Meaning Representations (IMRs) and addressing two main problems: (1) theoretically proving that equations are more accurate than programs, and (2) improving the generation accuracy of equations with large language models (LLMs).* Methods: The proposed method, called Bridge, consists of two stages: (1) a proof proposition to compare the generation accuracy of different IMRs, and (2) a method to improve the generation accuracy of equations by reducing the tendency of generating constant expressions and programs.* Results: The proposed method achieves state-of-the-art performance on three datasets (GSM8K, SVAMP, and Algebra) under the single reasoning path setting, with an improvement of 2.2%, 0.9%, and 1.7% compared to previous methods.Here is the Chinese version of the information:* For: 本研究旨在提高数理逻辑模型的性能,通过使用方程作为中间意义表示(IMR),并解决两个主要问题:(1) 理论上证明方程的生成精度高于程序,以及(2) 使用大语言模型(LLM)生成方程的精度提高。* Methods: 提议的方法称为“桥”,包括两个阶段:(1) 一个证明方程的生成精度比较propulsion,以及(2) 一种改进LLM生成方程的方法,即减少常量表达和程序的生成倾向。* Results: 提议的方法在三个 datasets(GSM8K、SVAMP、Algebra)下,在单个逻辑路径设置下达到了现有最佳性能,与之前的方法相比,提高2.2%、0.9%和1.7%。
    Abstract Numerical reasoning is vital for natural language processing models to understand and process numerical information in real-world scenarios. Most current methods first generate the Intermediate Meaning Representations (IMRs) of questions and then generate answers. Current SOTA methods generate programs as IMRs with large language models (LLMs). Intuitively, equations have fewer restrictions and closer semantics to the question than programs, leading to higher generation accuracy. However, current LLMs generate equations worse than programs, where we assume that the equation data is rare in pre-training data compared to programs. So in this paper, we try to use equations as IMRs to solve the numerical reasoning task by addressing two problems: (1) Theoretically, how to prove that the equation is an IMR with higher generation accuracy than programs; (2) Empirically, how to improve the generation accuracy of equations with LLMs. For the first problem, we propose and prove a proposition to theoretically compare the generation accuracy of different IMRs. For the second problem, we present a method called Boosting Numerical Reason\textbfing by Decomposing the Generation of Equations (Bridge), which can improve the accuracy of LLMs in generating equations as IMRs by reducing the tendency of generating constant expressions and programs. Our method improves the performance by 2.2%, 0.9%, and 1.7% on GSM8K, SVAMP, and Algebra datasets compared to the previous state-of-the-art methods under the single reasoning path setting. Our codes and prompts are released in https://github.com/zirui-HIT/Bridge_for_Numerical_Reasoning.
    摘要 现代自然语言处理模型需要数学逻辑能力来理解和处理实际场景中的数字信息。大多数当前方法首先生成问题的中间意义表示(IMR),然后生成答案。当前最佳方法使用大型自然语言模型(LLM)生成程序作为IMR,然而,当前LLM生成Equation的能力较差。我们假设Equation数据在预训练数据中较少,导致LLM生成Equation的能力差。因此,在这篇论文中,我们尝试使用Equation作为IMR来解决数学逻辑任务,并解决两个问题:1. theoretically,如何证明Equation是IMR,并且与程序相比具有更高的生成精度?2. empirically,如何使用LLM来改进Equation的生成精度?为了解决第一个问题,我们提出和证明了一个命题来比较不同IMR的生成精度。为了解决第二个问题,我们提出了一种方法called Bridge,它可以通过减少生成常量表达和程序的倾向来改进LLM对Equation的生成精度。我们的方法在GSM8K、SVAMP和Algebra datasets上比前一代方法提高了2.2%、0.9%和1.7%的性能。我们的代码和提问在https://github.com/zirui-HIT/Bridge_for_Numerical_Reasoning上发布。

Weakly synchronous systems with three machines are Turing powerful

  • paper_url: http://arxiv.org/abs/2308.10578
  • repo_url: None
  • paper_authors: Cinzia Di Giusto, Davide Ferré, Etienne Lozes, Nicolas Nisse
  • for: 这篇论文旨在研究弱同步分布系统中的聚合状态机(CFM)模型,以及这些系统在弱同步环境中的可达性问题。
  • methods: 该论文使用了弱同步系统中进程之间的阶段性通信方式,并研究了这些系统的配置可达性问题。
  • results: 研究发现,即使有三个进程,弱同步系统的配置可达性问题仍然是不可解决的。这个结论受到了对消息流chart(MSC)的树宽度研究的启发。
    Abstract Communicating finite-state machines (CFMs) are a Turing powerful model of asynchronous message-passing distributed systems. In weakly synchronous systems, processes communicate through phases in which messages are first sent and then received, for each process. Such systems enjoy a limited form of synchronization, and for some communication models, this restriction is enough to make the reachability problem decidable. In particular, we explore the intriguing case of p2p (FIFO) communication, for which the reachability problem is known to be undecidable for four processes, but decidable for two. We show that the configuration reachability problem for weakly synchronous systems of three processes is undecidable. This result is heavily inspired by our study on the treewidth of the Message Sequence Charts (MSCs) that might be generated by such systems. In this sense, the main contribution of this work is a weakly synchronous system with three processes that generates MSCs of arbitrarily large treewidth.
    摘要 通信Finite-state machine(CFM)是一种图灵完善的异步消息传递分布系统模型。在弱同步系统中,进程通过阶段来进行消息传递,每个进程都有一个阶段。这些系统具有有限的同步化,并且对某些交通模型来说,这些限制足够使得可达性问题可以解决。特别是,我们研究了点对点(FIFO)通信模型,其中的可达性问题知道对四个进程是不可解决的,但对两个进程是可解决的。我们表明,弱同步系统中的三个进程的配置可达性问题是不可解决的。这个结果受我们对消息流图(MSCs)的树宽度研究的启发。因此,本工作的主要贡献是一种弱同步系统,该系统可以生成MSCs的arbitrary大树宽度。

Software Entity Recognition with Noise-Robust Learning

  • paper_url: http://arxiv.org/abs/2308.10564
  • repo_url: https://github.com/taidnguyen/software_entity_recognition
  • paper_authors: Tai Nguyen, Yifeng Di, Joohan Lee, Muhao Chen, Tianyi Zhang
  • for: 提高软件工程技术的可能性,例如自动生成文档、跟踪链接回溯和API建议。
  • methods: 利用Wikipedia分类和大量标注数据,建立了79K个不同类型的软件实体词典,并提出了一种含污染数据的学习方法自适应训练模型。
  • results: 比对自适应训练模型和常见方法,在Wikipedia测试集和Stack Overflow测试集上表现出色,显示了含污染数据的学习方法可以提高软件实体识别模型的性能。
    Abstract Recognizing software entities such as library names from free-form text is essential to enable many software engineering (SE) technologies, such as traceability link recovery, automated documentation, and API recommendation. While many approaches have been proposed to address this problem, they suffer from small entity vocabularies or noisy training data, hindering their ability to recognize software entities mentioned in sophisticated narratives. To address this challenge, we leverage the Wikipedia taxonomy to develop a comprehensive entity lexicon with 79K unique software entities in 12 fine-grained types, as well as a large labeled dataset of over 1.7M sentences. Then, we propose self-regularization, a noise-robust learning approach, to the training of our software entity recognition (SER) model by accounting for many dropouts. Results show that models trained with self-regularization outperform both their vanilla counterparts and state-of-the-art approaches on our Wikipedia benchmark and two Stack Overflow benchmarks. We release our models, data, and code for future research.
    摘要 识别软件实体(如库名称)从自由文本中是软件工程(SE)技术的关键,以实现traceability链回收、自动文档生成和API建议。虽然许多方法已经提出来解决这个问题,但它们受到小词汇或噪声训练数据的限制,使得它们无法识别复杂的 narraive 中的软件实体。为解决这个挑战,我们利用Wikipedia分类系统开发了一个完整的实体词典,包含79K个唯一软件实体,以及12种细化类型,同时还提供了大量标注的数据集,包含超过1.7万句 sentences。然后,我们提议使用自我REGULARIZATION,一种鲁棒学习方法,来训练我们的软件实体识别(SER)模型,并考虑多个dropout。结果显示,使用自我REGULARIZATION训练的模型在Wikipedia标准 benchmark和Stack Overflow两个标准 benchmark上都超过了其普通对应和当前state-of-the-art方法。我们发布了我们的模型、数据和代码,以便未来的研究。

SeqGPT: An Out-of-the-box Large Language Model for Open Domain Sequence Understanding

  • paper_url: http://arxiv.org/abs/2308.10529
  • repo_url: https://github.com/alibaba-nlp/seqgpt
  • paper_authors: Tianyu Yu, Chengyue Jiang, Chao Lou, Shen Huang, Xiaobin Wang, Wei Liu, Jiong Cai, Yangning Li, Yinghui Li, Kewei Tu, Hai-Tao Zheng, Ningyu Zhang, Pengjun Xie, Fei Huang, Yong Jiang
  • for: seqgpt 是一个开源的自动生成模型,旨在提高自然语言理解(nlu)任务的表现。
  • methods: seqgpt 使用 atomic tasks 来表达所有 nlu 任务,并通过 instruciton-tuning 和 fine-tuning 来进行特化。
  • results: seqgpt 在不同的领域和 datasets 上表现出良好的分类和抽取能力,并能够在未见领域进行语言理解任务。
    Abstract Large language models (LLMs) have shown impressive ability for open-domain NLP tasks. However, LLMs are sometimes too footloose for natural language understanding (NLU) tasks which always have restricted output and input format. Their performances on NLU tasks are highly related to prompts or demonstrations and are shown to be poor at performing several representative NLU tasks, such as event extraction and entity typing. To this end, we present SeqGPT, a bilingual (i.e., English and Chinese) open-source autoregressive model specially enhanced for open-domain natural language understanding. We express all NLU tasks with two atomic tasks, which define fixed instructions to restrict the input and output format but still ``open'' for arbitrarily varied label sets. The model is first instruction-tuned with extremely fine-grained labeled data synthesized by ChatGPT and then further fine-tuned by 233 different atomic tasks from 152 datasets across various domains. The experimental results show that SeqGPT has decent classification and extraction ability, and is capable of performing language understanding tasks on unseen domains. We also conduct empirical studies on the scaling of data and model size as well as on the transfer across tasks. Our model is accessible at https://github.com/Alibaba-NLP/SeqGPT.
    摘要 大型语言模型(LLM)在开放领域自然语言处理任务(NLP)中表现出色,但是LLM有时候太具有游离性,无法适应具有限定输入和输出格式的自然语言理解任务(NLU)。它们在NLU任务中表现不佳,特别是事件抽取和实体分类等代表性任务。为此,我们提出了SeqGPT,一个英文和中文双语的开源自动生成模型,特地优化为开放领域自然语言理解。我们将所有NLU任务表示为两个原子任务,这两个任务定义了固定的输入和输出格式,但是仍然具有“开放”的特性,可以对输入和输出进行自由变换。模型首先通过极细化的标签数据synthesized by ChatGPT进行 instrucion-tuning,然后进行了233个原子任务的进一步精度调整。实验结果表明,SeqGPT具有不错的分类和抽取能力,可以在未看到的领域中进行语言理解任务。我们还进行了数据和模型大小的扩展研究以及任务之间的转移研究。我们的模型可以在https://github.com/Alibaba-NLP/SeqGPT中获取。

GradientCoin: A Peer-to-Peer Decentralized Large Language Models

  • paper_url: http://arxiv.org/abs/2308.10502
  • repo_url: None
  • paper_authors: Yeqi Gao, Zhao Song, Junze Yin
  • for: 这篇论文的目的是提出一种基于Bitcoin电子现金系统的分布式大语言模型(LLM),以解决现有大语言模型的中央化控制和不可预测性问题。
  • methods: 该论文使用了Bitcoin电子现金系统的技术和概念,并提出了一种基于Bitcoin的分布式LLM的设计方案。
  • results: 论文提出的分布式LLM可以解决现有大语言模型的中央化控制和不可预测性问题,但实现该系统可能会遇到各种实际困难。此外,该新系统可能不会在经济效益方面超过标准Bitcoin系统。
    Abstract Since 2008, after the proposal of a Bitcoin electronic cash system, Bitcoin has fundamentally changed the economic system over the last decade. Since 2022, large language models (LLMs) such as GPT have outperformed humans in many real-life tasks. However, these large language models have several practical issues. For example, the model is centralized and controlled by a specific unit. One weakness is that if that unit decides to shut down the model, it cannot be used anymore. The second weakness is the lack of guaranteed discrepancy behind this model, as certain dishonest units may design their own models and feed them unhealthy training data. In this work, we propose a purely theoretical design of a decentralized LLM that operates similarly to a Bitcoin cash system. However, implementing such a system might encounter various practical difficulties. Furthermore, this new system is unlikely to perform better than the standard Bitcoin system in economics. Therefore, the motivation for designing such a system is limited. It is likely that only two types of people would be interested in setting up a practical system for it: $\bullet$ Those who prefer to use a decentralized ChatGPT-like software. $\bullet$ Those who believe that the purpose of carbon-based life is to create silicon-based life, such as Optimus Prime in Transformers. The reason the second type of people may be interested is that it is possible that one day an AI system like this will awaken and become the next level of intelligence on this planet.
    摘要 In this work, we propose a purely theoretical design of a decentralized LLM that operates similarly to a Bitcoin cash system. However, implementing such a system may encounter various practical difficulties. Furthermore, this new system is unlikely to perform better than the standard Bitcoin system in economics. Therefore, the motivation for designing such a system is limited.It is likely that only two types of people would be interested in setting up a practical system for it:1. Those who prefer to use a decentralized ChatGPT-like software.2. Those who believe that the purpose of carbon-based life is to create silicon-based life, such as Optimus Prime in Transformers. The reason the second type of people may be interested is that it is possible that one day an AI system like this will awaken and become the next level of intelligence on this planet.

An Effective Method using Phrase Mechanism in Neural Machine Translation

  • paper_url: http://arxiv.org/abs/2308.10482
  • repo_url: https://github.com/phuongnm94/PhraseTransformer
  • paper_authors: Phuong Minh Nguyen, Le Minh Nguyen
  • for: 提高 Vietnamese-Chinese parallel corpora 的 Neural Machine Translation (NMT) 系统。
  • methods: 使用 PhraseTransformer Mechanism,基于 Transformer 模型,对 Vietnamese-Chinese parallel corpora 进行改进。
  • results: 在 VLSP 2022 竞赛 MT 数据集上, achieved BLEU 分数为 35.3 和 33.2 分 respectively。Here’s the English version for reference:
  • for: Improving the Neural Machine Translation (NMT) system for Vietnamese-Chinese parallel corpora.
  • methods: Using a phrase mechanism, PhraseTransformer, based on the Transformer model, to improve the system for Vietnamese-Chinese parallel corpora.
  • results: Achieved BLEU scores of 35.3 and 33.2 on Vietnamese to Chinese and Chinese to Vietnamese data, respectively, on the VLSP 2022 competition MT dataset.
    Abstract Machine Translation is one of the essential tasks in Natural Language Processing (NLP), which has massive applications in real life as well as contributing to other tasks in the NLP research community. Recently, Transformer -based methods have attracted numerous researchers in this domain and achieved state-of-the-art results in most of the pair languages. In this paper, we report an effective method using a phrase mechanism, PhraseTransformer, to improve the strong baseline model Transformer in constructing a Neural Machine Translation (NMT) system for parallel corpora Vietnamese-Chinese. Our experiments on the MT dataset of the VLSP 2022 competition achieved the BLEU score of 35.3 on Vietnamese to Chinese and 33.2 BLEU scores on Chinese to Vietnamese data. Our code is available at https://github.com/phuongnm94/PhraseTransformer.
    摘要 机器翻译是自然语言处理(NLP)中的一项重要任务,它在实际生活中有很大的应用,同时也对NLP研究领域中的其他任务产生了贡献。在最近的研究中,基于Transformer算法的方法在这个领域中吸引了大量研究者,并在大多数对应语言中取得了状态计算机Results。本文报道了一种使用短语机制,PhraseTransformer,以提高基eline模型Transformer在构建并行 corpora vietnamese-chinese 的 neural machine translation(NMT)系统中的性能。我们在 VLSP 2022 竞赛的 MT 数据集上进行了实验,并取得了 Vietnamese 到 Chinese 的 BLEU 分数为 35.3,以及 Chinese 到 Vietnamese 的 BLEU 分数为 33.2。我们的代码可以在 GitHub 上找到:https://github.com/phuongnm94/PhraseTransformer。

Implicit Self-supervised Language Representation for Spoken Language Diarization

  • paper_url: http://arxiv.org/abs/2308.10470
  • repo_url: None
  • paper_authors: Jagabandhu Mishra, S. R. Mahadeva Prasanna
  • for: 这 paper 是为了研究 spoken language diarization (LD) 的应用,特别是使用 implicit framework 来处理 low/zero resource languages。
  • methods: 这 paper 提出了三种 frameworks,基于 fixed segmentation、change point-based segmentation 和 E2E,来实现 LD。它们使用 x-vector 作为隐式语言表示,并调整了分析窗口长度 ($N$) 来达到最佳性能。
  • results: 这 paper 的实验结果表明,使用 x-vector 和适当的 $N$ 可以达到同等性能水平,而使用 E2E 框架可以达到最佳隐式 LD 性能(JER 为 6.38)。然而,在使用实际的 Microsoft CS (MSCS) 数据集时,隐式 LD 性能下降到 60.4,这是因为 MSCS 数据集中次语言的单语言段 duration 的分布不同。此外,为了避免段落平滑,需要使用小值 N。然而,小 N 使得 x-vector 表示无法捕捉需要的语言 отли异,因此这种研究提出了一种自我超视的隐式语言表示。与 x-vector 表示相比,该表示提供了 $63.9%$ 的相对改进,并在 E2E 框架下达到 JER 为 21.8。
    Abstract In a code-switched (CS) scenario, the use of spoken language diarization (LD) as a pre-possessing system is essential. Further, the use of implicit frameworks is preferable over the explicit framework, as it can be easily adapted to deal with low/zero resource languages. Inspired by speaker diarization (SD) literature, three frameworks based on (1) fixed segmentation, (2) change point-based segmentation and (3) E2E are proposed to perform LD. The initial exploration with synthetic TTSF-LD dataset shows, using x-vector as implicit language representation with appropriate analysis window length ($N$) can able to achieve at per performance with explicit LD. The best implicit LD performance of $6.38$ in terms of Jaccard error rate (JER) is achieved by using the E2E framework. However, considering the E2E framework the performance of implicit LD degrades to $60.4$ while using with practical Microsoft CS (MSCS) dataset. The difference in performance is mostly due to the distributional difference between the monolingual segment duration of secondary language in the MSCS and TTSF-LD datasets. Moreover, to avoid segment smoothing, the smaller duration of the monolingual segment suggests the use of a small value of $N$. At the same time with small $N$, the x-vector representation is unable to capture the required language discrimination due to the acoustic similarity, as the same speaker is speaking both languages. Therefore, to resolve the issue a self-supervised implicit language representation is proposed in this study. In comparison with the x-vector representation, the proposed representation provides a relative improvement of $63.9\%$ and achieved a JER of $21.8$ using the E2E framework.
    摘要 在 Code-switched (CS) enario,使用口语语言分类 (LD) 作为预处理系统是必备的。进一步地,使用隐式框架比过分Explicit框架更好,因为它可以更好地适应低/零资源语言。受Speaker diarization (SD) литераature inspirited,本文提出了基于 (1) 固定分 segmentation,(2) 变点分 segmentation 和 (3) E2E 的三种框架来实现 LD。使用 x-vector 作为隐式语言表示可以在适当的分析窗口长度 ($N$) 下达到同等的LD表现。在synthetic TTSF-LD 数据集的初步探索中,使用 E2E 框架可以达到最佳的隐式LD表现($6.38$ Jaccard error rate (JER))。然而,在使用实际的 Microsoft CS (MSCS) 数据集时,隐式LD表现下降至 $60.4$。这种表现差异主要归结于 TTSF-LD 和 MSCS 数据集中次语言段 duration 的分布差异。此外,为了避免段落平滑,小于 $N$ 的值更加重要。同时,使用小 $N$ 值,x-vector 表示无法捕捉到需要的语言差异,因为同一个speaker是说两种语言。因此,本研究提出了一种自助式隐式语言表示。与 x-vector 表示相比,该表示提供了 $63.9\%$ 的相对改进,并使用 E2E 框架达到 JER $21.8$。

Exploring Parameter-Efficient Fine-Tuning Techniques for Code Generation with Large Language Models

  • paper_url: http://arxiv.org/abs/2308.10462
  • repo_url: None
  • paper_authors: Martin Weyssow, Xin Zhou, Kisub Kim, David Lo, Houari Sahraoui
  • for: 这个研究旨在探索如何将大型自然语言模型(LLMs)特化为任务特定数据,以提高其生成代码的能力。
  • methods: 研究使用Parameter-Efficient Fine-Tuning(PEFT)技术来特化LLMs,并进行了广泛的实验研究以评估这些技术在自动代码生成scene中的效果。
  • results: 实验结果显示,PEFT技术可以有效地将LLMs特化为任务特定数据,并提高代码生成的性能和可扩展性。
    Abstract Large Language Models (LLMs) possess impressive capabilities to generate meaningful code snippets given natural language intents in zero-shot, i.e., without the need for specific fine-tuning. In the perspective of unleashing their full potential, prior work has demonstrated the benefits of fine-tuning the models to task-specific data. However, fine-tuning process demands heavy computational costs and is intractable when resources are scarce, especially for models with billions of parameters. In light of these challenges, previous studies explored In-Context Learning (ICL) as an effective strategy to generate contextually appropriate code without fine-tuning. However, it operates at inference time and does not involve learning task-specific parameters, potentially limiting the model's performance on downstream tasks. In this context, we foresee that Parameter-Efficient Fine-Tuning (PEFT) techniques carry a high potential for efficiently specializing LLMs to task-specific data. In this paper, we deliver a comprehensive study of LLMs with the impact of PEFT techniques under the automated code generation scenario. Our experimental results reveal the superiority and potential of such techniques over ICL on a wide range of LLMs in reducing the computational burden and improving performance. Therefore, the study opens opportunities for broader applications of PEFT in software engineering scenarios.
    摘要 大型语言模型(LLM)具有吸引人的能力,可以在零 shot 情况下生成有意义的代码副本,无需特定的精心调整。在激发其全部潜力的视角下,先前的研究表明了对任务特定数据进行精心调整模型的好处。然而,调整过程需要巨大的计算成本,特别是当参数数量庞大时,这种成本变得不可持续。为了解决这些挑战,先前的研究探讨了在代码生成场景中的内在学习(ICL)策略,它可以在推理时生成相应的代码,不需要学习任务特定的参数。然而,ICL 不会在下游任务中学习任务特定的参数,这可能会限制模型的性能。在这个 контексте,我们认为参数高效调整(PEFT)技术具有较高的潜力,可以有效地特化 LLM 到任务特定的数据。在这篇论文中,我们进行了 LLM 的全面研究,并对 PEFT 技术在自动代码生成场景下的影响进行了实验研究。我们的实验结果表明,PEFT 技术比 ICL 在许多 LLM 上具有更高的可行性和性能。因此,这种技术在软件工程场景中具有广泛的应用前景。

Comparing Measures of Linguistic Diversity Across Social Media Language Data and Census Data at Subnational Geographic Areas

  • paper_url: http://arxiv.org/abs/2308.10452
  • repo_url: None
  • paper_authors: Sidney G. -J. Wong, Jonathan Dunn, Benjamin Adams
  • for: 这个研究旨在比较在线空间(社交媒体语言数据)和实际世界空间(新西兰地区)的语言生态学。
  • methods: 研究使用了社交媒体语言数据和实际世界地区的语言多样性的比较方法。
  • results: 研究结果表明,可以使用社交媒体语言数据观察实际地区的语言多样性的空间和时间变化,但需要进一步的研究以确定社交媒体是否准确反映实际行为。
    Abstract This paper describes a preliminary study on the comparative linguistic ecology of online spaces (i.e., social media language data) and real-world spaces in Aotearoa New Zealand (i.e., subnational administrative areas). We compare measures of linguistic diversity between these different spaces and discuss how social media users align with real-world populations. The results from the current study suggests that there is potential to use online social media language data to observe spatial and temporal changes in linguistic diversity at subnational geographic areas; however, further work is required to understand how well social media represents real-world behaviour.
    摘要

Large Language Models on Wikipedia-Style Survey Generation: an Evaluation in NLP Concepts

  • paper_url: http://arxiv.org/abs/2308.10410
  • repo_url: None
  • paper_authors: Fan Gao, Hang Jiang, Moritz Blum, Jinghui Lu, Yuang Jiang, Irene Li
  • For: The paper aims to assess the capability of large language models (LLMs) in producing concise survey articles within the computer science-NLP domain.* Methods: The paper uses automated evaluations and human evaluations to assess the performance of GPT-3.5 and GPT-4 in generating survey articles on 20 chosen topics.* Results: GPT-4 outperforms GPT-3.5 in terms of automated evaluations, and human evaluators provide insights on the strengths and weaknesses of GPT-4 in generating survey articles, including instances of incomplete information and factual inaccuracies.Here are the three points in Simplified Chinese text:* For: 这篇论文目标是评估大语言模型(LLM)在计算机科学-自然语言处理(NLP)领域中的简要报告生成能力。* Methods: 论文使用自动评估和人类评估来评估GPT-3.5和GPT-4在20个主题上的报告生成能力。* Results: GPT-4在自动评估中表现出色,超过GPT-3.5,而人类评估者对GPT-4在报告生成方面提供了多种视角,包括不完整的信息和不准确的事实。
    Abstract Large Language Models (LLMs) have achieved significant success across various natural language processing (NLP) tasks, encompassing question-answering, summarization, and machine translation, among others. While LLMs excel in general tasks, their efficacy in domain-specific applications remains under exploration. Additionally, LLM-generated text sometimes exhibits issues like hallucination and disinformation. In this study, we assess LLMs' capability of producing concise survey articles within the computer science-NLP domain, focusing on 20 chosen topics. Automated evaluations indicate that GPT-4 outperforms GPT-3.5 when benchmarked against the ground truth. Furthermore, four human evaluators provide insights from six perspectives across four model configurations. Through case studies, we demonstrate that while GPT often yields commendable results, there are instances of shortcomings, such as incomplete information and the exhibition of lapses in factual accuracy.
    摘要 (Simplified Chinese translation)大型语言模型(LLMs)在各种自然语言处理(NLP)任务中取得了显著的成功,包括问答、概要、翻译等等。而 LLMs 在具体领域应用中的能力仍然处于探索阶段。此外,LLM 生成的文本 occasional 会出现幻觉和不准确的情况。在这项研究中,我们评估 LLMs 在计算机科学-NLP 领域中的简要报告生成能力,关注 20 个话题。自动评估结果表明,GPT-4 在对基准 truth 进行评估时表现更出色于 GPT-3.5。此外,四名人类评估员从六个角度提供了四种模型配置的反馈。通过案例研究,我们发现了 GPT 经常产生出色的结果,但也有一些缺点,如信息杂乱和实际性不准确。

LibriSQA: Advancing Free-form and Open-ended Spoken Question Answering with a Novel Dataset and Framework

  • paper_url: http://arxiv.org/abs/2308.10390
  • repo_url: https://github.com/zihanzhaosjtu/librisqa
  • paper_authors: Zihan Zhao, Yiyang Jiang, Heyang Liu, Yanfeng Wang, Yu Wang
  • for: 本研究旨在提高大语言模型(LLMs)对多Modal功能的处理能力,特别是对于口头问答(SQA)任务,需要精准的语音和文本特征之间Alignment和深入交互。
  • methods: 我们提出了一种轻量级、端到端框架,使用Librispeech中的自由形式和开放结构的LibriSQA dataset进行SQA任务的执行,并通过修改ASRFormat来证明我们的框架可以处理ASR任务。
  • results: 我们的实验结果表明,我们的框架可以在LibriSQA dataset上取得显著的成果,证明了LLMs可以准确地处理多Modal信息,这为多Modal LLMs的发展开辟了道路。
    Abstract While Large Language Models (LLMs) have demonstrated commendable performance across a myriad of domains and tasks, existing LLMs still exhibit a palpable deficit in handling multimodal functionalities, especially for the Spoken Question Answering (SQA) task which necessitates precise alignment and deep interaction between speech and text features. To address the SQA challenge on LLMs, we initially curated the free-form and open-ended LibriSQA dataset from Librispeech, comprising Part I with natural conversational formats and Part II encompassing multiple-choice questions followed by answers and analytical segments. Both parts collectively include 107k SQA pairs that cover various topics. Given the evident paucity of existing speech-text LLMs, we propose a lightweight, end-to-end framework to execute the SQA task on the LibriSQA, witnessing significant results. By reforming ASR into the SQA format, we further substantiate our framework's capability in handling ASR tasks. Our empirical findings bolster the LLMs' aptitude for aligning and comprehending multimodal information, paving the way for the development of universal multimodal LLMs. The dataset and demo can be found at https://github.com/ZihanZhaoSJTU/LibriSQA.
    摘要 大型自然语言模型(LLM)已经在多个领域和任务上表现出色,但现有LLM仍然缺乏在多模态功能方面的能力,尤其是对话问答(SQA)任务,需要精准的语音和文本特征之间对应。为了解决LLM中的SQA挑战,我们首先绘制了自由形式和开放结构的LibriSQA数据集,包括Part I的自然对话格式和Part II的多选题目和答案分析段落。总共有107k个SQA对。由于现有的语音文本LLM很罕见,我们提议一种轻量级、端到端的框架来实现SQA任务,并在LibriSQA上观察到了显著的成果。通过重新定义ASR为SQA格式,我们进一步证明了我们的框架在ASR任务上的能力。我们的实验结果证明了LLM在处理多模态信息方面的能力,开启了universal multimodal LLM的发展之路。数据集和demo可以在https://github.com/ZihanZhaoSJTU/LibriSQA中找到。

cantnlp@LT-EDI@RANLP-2023: Homophobia/Transphobia Detection in Social Media Comments using Spatio-Temporally Retrained Language Models

  • paper_url: http://arxiv.org/abs/2308.10370
  • repo_url: None
  • paper_authors: Sidney G. -J. Wong, Matthew Durward, Benjamin Adams, Jonathan Dunn
  • for: The paper was written to develop a multiclass classification system for detecting homophobic and transphobic content in social media comments across five languages.
  • methods: The authors used a BERT-based language model and retrained a transformer-based cross-language pretrained language model, XLMRoBERTa, with spatially and temporally relevant social media language data. They also retrained a subset of models with simulated script-mixed social media language data.
  • results: The authors developed the best performing seven-label classification system for Malayalam, with variable performance for other language and class-label conditions. The inclusion of spatio-temporal data improved the classification performance for all language and task conditions compared to the baseline, and the results suggest that transformer-based language classification systems are sensitive to register-specific and language-specific retraining.
    Abstract This paper describes our multiclass classification system developed as part of the LTEDI@RANLP-2023 shared task. We used a BERT-based language model to detect homophobic and transphobic content in social media comments across five language conditions: English, Spanish, Hindi, Malayalam, and Tamil. We retrained a transformer-based crosslanguage pretrained language model, XLMRoBERTa, with spatially and temporally relevant social media language data. We also retrained a subset of models with simulated script-mixed social media language data with varied performance. We developed the best performing seven-label classification system for Malayalam based on weighted macro averaged F1 score (ranked first out of six) with variable performance for other language and class-label conditions. We found the inclusion of this spatio-temporal data improved the classification performance for all language and task conditions when compared with the baseline. The results suggests that transformer-based language classification systems are sensitive to register-specific and language-specific retraining.
    摘要 这篇论文描述了我们的多类分类系统,它是在LTEDI@RANLP-2023共同任务中开发的。我们使用基于BERT的语言模型来检测社交媒体评论中的同性恐惧和 транс恐惧内容,并在五种语言条件下进行检测:英语、西班牙语、旁遮普语、马拉雅底语和泰米尔语。我们将XTLMRoBERTa模型重新训练,使其能够在社交媒体上处理空间和时间相关的语言数据。此外,我们还重新训练了一些模型使其能够处理混合语言社交媒体语言数据,并观察其性能的变化。我们在马拉雅底语中开发了最佳的七个标签分类系统,使得其在Weighted Macro F1分数上达到了第一名(排名第六),而其他语言和类别条件中的性能则有变化。我们发现,包含这些空间和时间相关的数据可以提高所有语言和任务条件下的分类性能。这些结果表明,基于转移器的语言分类系统是根据 регистр和语言特定的重新训练敏感的。

Economic Policy Uncertainty: A Review on Applications and Measurement Methods with Focus on Text Mining Methods

  • paper_url: http://arxiv.org/abs/2308.10304
  • repo_url: None
  • paper_authors: Fatemeh Kaveh-Yazdy, Sajjad Zarifzadeh
    for:This paper focuses on the measurement of Economic Policy Uncertainty (EPU) and its impact on investments, unemployment rates, and recessions.methods:The paper reviews and compares three major groups of EPU measurement methods, including financial parameter-based, text mining-based, and implied uncertainty-based methods.results:The paper surveys the research areas that rely on measuring EPU indices and proposes a list of future research approaches focusing on textual material-based measurement methods.
    Abstract Economic Policy Uncertainty (EPU) represents the uncertainty realized by the investors during economic policy alterations. EPU is a critical indicator in economic studies to predict future investments, the unemployment rate, and recessions. EPU values can be estimated based on financial parameters directly or implied uncertainty indirectly using the text mining methods. Although EPU is a well-studied topic within the economy, the methods utilized to measure it are understudied. In this article, we define the EPU briefly and review the methods used to measure the EPU, and survey the areas influenced by the changes in EPU level. We divide the EPU measurement methods into three major groups with respect to their input data. Examples of each group of methods are enlisted, and the pros and cons of the groups are discussed. Among the EPU measures, text mining-based ones are dominantly studied. These methods measure the realized uncertainty by taking into account the uncertainty represented in the news and publicly available sources of financial information. Finally, we survey the research areas that rely on measuring the EPU index with the hope that studying the impacts of uncertainty would attract further attention of researchers from various research fields. In addition, we propose a list of future research approaches focusing on measuring EPU using textual material.
    摘要 经济政策不确定性(EPU)表示投资者在经济政策变化时所实际体验的不确定性。EPU是经济研究中一个关键指标,可以预测未来投资、失业率和经济衰退。EPU的估值可以基于直接或间接使用金融参数来计算。尽管EPU已经在经济领域得到了广泛研究,但是测量它的方法仍然不充分。在本文中,我们 briefly定义EPU并回顾了测量EPU的方法,并对这些方法的影响范围进行了概述。我们将EPU测量方法分为三个主要组合,根据其输入数据。每个组合的例子与优缺点都被列出,并进行了讨论。text挖掘方法在EPU测量方法中占主导地位。这些方法通过考虑新闻和公共可用的金融信息中表达的不确定性来测量实现的不确定性。最后,我们对研究EPU指数的研究领域进行了报告,并提出了未来研究方向,以便研究不确定性的影响。Note: Please note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

cs.LG - 2023-08-21

Graph Neural Bandits

  • paper_url: http://arxiv.org/abs/2308.10808
  • repo_url: https://github.com/lasgroup/GNNBO
  • paper_authors: Yunzhe Qi, Yikun Ban, Jingrui He
  • for: 本研究旨在提出一种基于图神经网络的推荐框架,以优化推荐策略并解决推荐问题中的挖掘-探索之间的矛盾。
  • methods: 本研究使用图神经网络模型来建模用户之间的协同关系,并分别使用GNN-based模型来适应不同的探索和利用策略。
  • results: 经过理论分析和实验研究,本研究在多个真实数据集上与现有基elines进行比较,并证明了GNB框架的效果。
    Abstract Contextual bandits algorithms aim to choose the optimal arm with the highest reward out of a set of candidates based on the contextual information. Various bandit algorithms have been applied to real-world applications due to their ability of tackling the exploitation-exploration dilemma. Motivated by online recommendation scenarios, in this paper, we propose a framework named Graph Neural Bandits (GNB) to leverage the collaborative nature among users empowered by graph neural networks (GNNs). Instead of estimating rigid user clusters as in existing works, we model the "fine-grained" collaborative effects through estimated user graphs in terms of exploitation and exploration respectively. Then, to refine the recommendation strategy, we utilize separate GNN-based models on estimated user graphs for exploitation and adaptive exploration. Theoretical analysis and experimental results on multiple real data sets in comparison with state-of-the-art baselines are provided to demonstrate the effectiveness of our proposed framework.
    摘要 Contextual bandits algorithms target 选择最佳臂 基于候选人的各种情况信息。各种bandit算法在实际应用中得到了广泛应用,因为它们可以解决探索-投入之间的矛盾。在这篇论文中,我们提出了一个名为图ael Neural Bandits(GNB)的框架,以利用用户 empowered by graph neural networks(GNNs)的协同性。而不是现有的rigid用户群集 estimation,我们通过估算用户图中的exploitation和exploration的协同效果来模型"细化的"协同效果。然后,为了细化推荐策略,我们使用了分开的 GNN-based 模型来适应 estimated user graphs 中的exploitation和适应性。我们提供了理论分析和多个实际数据集的实验结果,以证明我们提出的框架的有效性。

DynED: Dynamic Ensemble Diversification in Data Stream Classification

  • paper_url: http://arxiv.org/abs/2308.10807
  • repo_url: https://github.com/soheilabadifard/dyned
  • paper_authors: Soheil Abadifard, Sepehr Bakhshi, Sanaz Gheibuni, Fazli Can
  • for: 提高数据流环境中分类精度,因为数据分布变化导致模型性能下降。
  • methods: 使用MMR方法动态组合多个组件,以提高 ensemble 的各种性能和多样性。
  • results: 在实验中,提出的方法(DynED)与五种基准方法相比,平均含义准确率更高。
    Abstract Ensemble methods are commonly used in classification due to their remarkable performance. Achieving high accuracy in a data stream environment is a challenging task considering disruptive changes in the data distribution, also known as concept drift. A greater diversity of ensemble components is known to enhance prediction accuracy in such settings. Despite the diversity of components within an ensemble, not all contribute as expected to its overall performance. This necessitates a method for selecting components that exhibit high performance and diversity. We present a novel ensemble construction and maintenance approach based on MMR (Maximal Marginal Relevance) that dynamically combines the diversity and prediction accuracy of components during the process of structuring an ensemble. The experimental results on both four real and 11 synthetic datasets demonstrate that the proposed approach (DynED) provides a higher average mean accuracy compared to the five state-of-the-art baselines.
    摘要 ensemble 方法通常在分类 task 中使用,因为它们的表现非常出色。在数据流环境中达到高精度是一项具有挑战性的任务,因为数据分布的变化可能会导致模型的训练失败。更多的 ensemble 组件可以提高预测精度在这些设置下。 despite ensemble 中的组件的多样性,不 todas 都会如期提供贡献。这种情况需要一种方法来选择表现出色并且多样的组件。我们提出了一种基于 MMR(最大最大关注度)的ensemble construction 和维护方法,可以在结构 ensemble 时间动态结合多样性和预测精度的组件。实验结果表明,提出的方法(DynED)在四个实际数据集和11个 sintetic 数据集上的平均含义精度比五种现有基准高。

Differentiable Frank-Wolfe Optimization Layer

  • paper_url: http://arxiv.org/abs/2308.10806
  • repo_url: None
  • paper_authors: Zixuan Liu, Liu Liu, Xueqian Wang, Peilin Zhao
  • for: 提高大规模问题中的可 diferenciable优化效率
  • methods: 基于Frank-Wolfe算法的可 differentiable层(DFWLayer)
  • results: 提供了一种高效的可 diferenciable优化方法,可以在大规模问题中实现紧跟约束和高速计算。
    Abstract Differentiable optimization has received a significant amount of attention due to its foundational role in the domain of machine learning based on neural networks. The existing methods leverages the optimality conditions and implicit function theorem to obtain the Jacobian matrix of the output, which increases the computational cost and limits the application of differentiable optimization. In addition, some non-differentiable constraints lead to more challenges when using prior differentiable optimization layers. This paper proposes a differentiable layer, named Differentiable Frank-Wolfe Layer (DFWLayer), by rolling out the Frank-Wolfe method, a well-known optimization algorithm which can solve constrained optimization problems without projections and Hessian matrix computations, thus leading to a efficient way of dealing with large-scale problems. Theoretically, we establish a bound on the suboptimality gap of the DFWLayer in the context of l1-norm constraints. Experimental assessments demonstrate that the DFWLayer not only attains competitive accuracy in solutions and gradients but also consistently adheres to constraints. Moreover, it surpasses the baselines in both forward and backward computational speeds.
    摘要 differential optimization 已经收到了广泛关注,因为它在机器学习领域中的神经网络上发挥了基本作用。现有的方法利用优化条件和隐函数定理来获取输出 Jacobian 矩阵,这会增加计算成本并限制 differentiable optimization 的应用。此外,一些非 differentiable 约束会导致使用先前的 differentiable optimization 层更加困难。本文提出了一个 differentiable 层,名为 Differentiable Frank-Wolfe Layer (DFWLayer),通过折衣 Frank-Wolfe 算法,这是一种可以解决约束优化问题的优化算法,不需要投影和卷积矩阵计算,因此可以更好地处理大规模问题。我们在理论上也设置了 l1-norm 约束下 DFWLayer 的优化误差 bound。实验评估表明,DFWLayer 不仅可以达到竞争性的解和梯度准确度,还能够一致地遵循约束。此外,它在前进和后退计算速度上也超过了基eline。

Stabilizing Unsupervised Environment Design with a Learned Adversary

  • paper_url: http://arxiv.org/abs/2308.10797
  • repo_url: https://github.com/facebookresearch/dcd
  • paper_authors: Ishita Mediratta, Minqi Jiang, Jack Parker-Holder, Michael Dennis, Eugene Vinitsky, Tim Rocktäschel
  • for: train generally-capable agents and design training tasks that facilitate broad generalization and robustness to environment variations
  • methods: reinforcement learning (RL) to train a teacher policy to design tasks from scratch
  • results: proposed solutions to several key shortcomings of PAIRED, enabling PAIRED to match or exceed state-of-the-art methods in several established challenging procedurally-generated environments
    Abstract A key challenge in training generally-capable agents is the design of training tasks that facilitate broad generalization and robustness to environment variations. This challenge motivates the problem setting of Unsupervised Environment Design (UED), whereby a student agent trains on an adaptive distribution of tasks proposed by a teacher agent. A pioneering approach for UED is PAIRED, which uses reinforcement learning (RL) to train a teacher policy to design tasks from scratch, making it possible to directly generate tasks that are adapted to the agent's current capabilities. Despite its strong theoretical backing, PAIRED suffers from a variety of challenges that hinder its practical performance. Thus, state-of-the-art methods currently rely on curation and mutation rather than generation of new tasks. In this work, we investigate several key shortcomings of PAIRED and propose solutions for each shortcoming. As a result, we make it possible for PAIRED to match or exceed state-of-the-art methods, producing robust agents in several established challenging procedurally-generated environments, including a partially-observed maze navigation task and a continuous-control car racing environment. We believe this work motivates a renewed emphasis on UED methods based on learned models that directly generate challenging environments, potentially unlocking more open-ended RL training and, as a result, more general agents.
    摘要 一个主要挑战在训练通用的代理人是设计训练任务,以促进广泛的普遍化和环境变化的响应力。这个挑战驱使了无监督环境设计(UED)的问题设定,其中学生代理人通过教师代理人提出的适应性任务来训练。一种开创性的方法是PAIRED,它使用征激学习(RL)来训练教师政策,从scratch生成任务,使得可以直接生成适应到代理人的现有能力的任务。 despite its strong theoretical backing, PAIRED suffers from several challenges that hinder its practical performance. Therefore, state-of-the-art methods currently rely on curation and mutation rather than generation of new tasks. In this work, we investigate several key shortcomings of PAIRED and propose solutions for each shortcoming. As a result, we make it possible for PAIRED to match or exceed state-of-the-art methods, producing robust agents in several established challenging procedurally-generated environments, including a partially-observed maze navigation task and a continuous-control car racing environment. We believe this work motivates a renewed emphasis on UED methods based on learned models that directly generate challenging environments, potentially unlocking more open-ended RL training and, as a result, more general agents.Note: Please note that the translation is in Simplified Chinese, which is one of the two standard versions of Chinese. If you prefer Traditional Chinese, please let me know and I can provide the translation in that version as well.

MGMAE: Motion Guided Masking for Video Masked Autoencoding

  • paper_url: http://arxiv.org/abs/2308.10794
  • repo_url: None
  • paper_authors: Bingkun Huang, Zhiyu Zhao, Guozhen Zhang, Yu Qiao, Limin Wang
  • For: This paper aims to improve the performance of video masked autoencoding by introducing a motion guided masking strategy to incorporate motion information during pre-training.* Methods: The proposed method, called Motion Guided Masked Autoencoder (MGMAE), uses an online efficient optical flow estimator and backward masking map warping strategy to build a temporal consistent masking volume and track unmasked tokens in time.* Results: The proposed MGMAE outperforms the original VideoMAE on the Something-Something V2 and Kinetics-400 datasets, and provides visualization analysis to illustrate the effectiveness of the motion-adaptive sampling of temporal consistent cubes for video pre-training.Here’s the Chinese version of the three points:* For: 这篇论文目标是提高视频掩码自动编码器的性能,通过引入运动导向掩码策略来在预训练中包含运动信息。* Methods: 提议的方法是动态掩码自动编码器(MGMAE),使用在线高效的滤色流估计器和倒掩码地图折叠策略来建立时间一致的掩码量和跟踪时间中的未掩码标签。* Results: MGMAE在Something-Something V2和Kinetics-400数据集上表现出优于原始VideoMAE,并提供视觉分析来证明该方法在时间一致的掩码采样中更有效地进行视频预训练。
    Abstract Masked autoencoding has shown excellent performance on self-supervised video representation learning. Temporal redundancy has led to a high masking ratio and customized masking strategy in VideoMAE. In this paper, we aim to further improve the performance of video masked autoencoding by introducing a motion guided masking strategy. Our key insight is that motion is a general and unique prior in video, which should be taken into account during masked pre-training. Our motion guided masking explicitly incorporates motion information to build temporal consistent masking volume. Based on this masking volume, we can track the unmasked tokens in time and sample a set of temporal consistent cubes from videos. These temporal aligned unmasked tokens will further relieve the information leakage issue in time and encourage the MGMAE to learn more useful structure information. We implement our MGMAE with an online efficient optical flow estimator and backward masking map warping strategy. We perform experiments on the datasets of Something-Something V2 and Kinetics-400, demonstrating the superior performance of our MGMAE to the original VideoMAE. In addition, we provide the visualization analysis to illustrate that our MGMAE can sample temporal consistent cubes in a motion-adaptive manner for more effective video pre-training.
    摘要 《面具自编码》已经在无监督视频表征学习中展现出色的表现。视频中的时间重复性导致了高的面具率和定制化面具策略,在VideoMAE中。在这篇论文中,我们希望进一步提高视频面具自编码的性能,通过引入运动指导的面具策略。我们的关键发现是,运动是视频中的一致性和特有的特征,应该在面具预训练中考虑。我们运动指导的面具Explicitly incorporates motion information to build temporal consistent masking volume。基于这个遮盖体积,我们可以在时间上跟踪未遮盖的 токен,并从视频中抽取一组时间相对的一致的 куби。这些时间相对的一致的 куби将进一步减轻时间泄露问题,使MGMAE学习更有用的结构信息。我们实现了我们的MGMAE,使用了在线高效的Optical flow estimator和后向遮盖Map折叠策略。我们在Something-Something V2和Kinetics-400 datasets上进行了实验,示出了我们MGMAE比原始VideoMAE的更高性能。此外,我们还提供了视觉分析,以 Illustrate that our MGMAE can sample temporal consistent cubes in a motion-adaptive manner for more effective video pre-training.

Instruction Tuning for Large Language Models: A Survey

  • paper_url: http://arxiv.org/abs/2308.10792
  • repo_url: None
  • paper_authors: Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei Wu, Guoyin Wang
  • for: 本研究审视了大语言模型(LLM)的指令调整(IT)技术,以提高LLM的能力和可控性。
  • methods: 本研究使用了一种系统性的Literature Review方法,涵盖了IT的通用方法、指令集建构、模型训练以及不同Modalities、领域和应用程序。
  • results: 研究发现了一些关键因素影响IT的结果(如生成指令输出和指令集大小),以及潜在的潜在问题和批评,并提出了一些可能的解决方案。
    Abstract This paper surveys research works in the quickly advancing field of instruction tuning (IT), a crucial technique to enhance the capabilities and controllability of large language models (LLMs). Instruction tuning refers to the process of further training LLMs on a dataset consisting of \textsc{(instruction, output)} pairs in a supervised fashion, which bridges the gap between the next-word prediction objective of LLMs and the users' objective of having LLMs adhere to human instructions. In this work, we make a systematic review of the literature, including the general methodology of IT, the construction of IT datasets, the training of IT models, and applications to different modalities, domains and applications, along with an analysis on aspects that influence the outcome of IT (e.g., generation of instruction outputs, size of the instruction dataset, etc). We also review the potential pitfalls of IT along with criticism against it, along with efforts pointing out current deficiencies of existing strategies and suggest some avenues for fruitful research.
    摘要 The paper covers the general methodology of IT, IT dataset construction, IT model training, and applications across different modalities, domains, and applications. It also discusses factors that affect IT outcomes, such as generating instruction outputs and dataset size. Additionally, the paper reviews potential pitfalls and criticisms of IT, as well as efforts to address current deficiencies and suggest promising research directions.Here is the translation in Simplified Chinese:这篇论文回顾了大语言模型(LLM)的指令调整(IT)研究,这是一种提高 LLM 的能力和可控性的关键技术。IT 通过在监督下将 LLM 训练在指令和输出对的数据集上,bridge LLM 的下一个词预测目标和用户的指令遵从目标。论文涵盖了 IT 的总方法、指令集成构建、模型训练和应用于不同的模式、领域和应用程序。它还讨论了 IT 的结果受到的因素,如生成指令输出和数据集大小。此外,论文还回顾了 IT 的潜在缺陷和批评,以及改进现有策略的努力。最后,论文还提出了一些有前途的研究方向。

Zero- and Few-Shot Prompting with LLMs: A Comparative Study with Fine-tuned Models for Bangla Sentiment Analysis

  • paper_url: http://arxiv.org/abs/2308.10783
  • repo_url: None
  • paper_authors: Md. Arid Hasan, Shudipta Das, Afiyat Anjum, Firoj Alam, Anika Anjum, Avijit Sarker, Sheak Rashed Haider Noori
  • for: 这篇论文主要是为了探讨孟加拉语 sentiment analysis 的问题,以及大语言模型在这种语言下的表现。
  • methods: 这篇论文使用了许多不同的语言模型,包括 Flan-T5、GPT-4 和 Bloomz,并进行了比较分析。
  • results: 研究发现,单语言 transformer 型模型在零和几个shot情况下一直表现出色,并且在不同的语言模型中表现最佳。
    Abstract The rapid expansion of the digital world has propelled sentiment analysis into a critical tool across diverse sectors such as marketing, politics, customer service, and healthcare. While there have been significant advancements in sentiment analysis for widely spoken languages, low-resource languages, such as Bangla, remain largely under-researched due to resource constraints. Furthermore, the recent unprecedented performance of Large Language Models (LLMs) in various applications highlights the need to evaluate them in the context of low-resource languages. In this study, we present a sizeable manually annotated dataset encompassing 33,605 Bangla news tweets and Facebook comments. We also investigate zero- and few-shot in-context learning with several language models, including Flan-T5, GPT-4, and Bloomz, offering a comparative analysis against fine-tuned models. Our findings suggest that monolingual transformer-based models consistently outperform other models, even in zero and few-shot scenarios. To foster continued exploration, we intend to make this dataset and our research tools publicly available to the broader research community. In the spirit of further research, we plan to make this dataset and our experimental resources publicly accessible to the wider research community.
    摘要 随着数字世界的快速扩张,情感分析已成为多个领域的关键工具,包括市场营销、政治、客户服务和医疗等。虽然拥有了显著的进步,但低资源语言,如孟加拉语,仍然受到资源限制,未能得到足够的研究。此外,最新的无前例的表现表明需要在低资源语言上评估LLMs。本研究提供了33605个孟加拉语新闻微博和Facebook评论的大量手动标注数据集。我们还进行了零和几个shot在场景下的 zero-和几个shot具体学习,包括Flan-T5、GPT-4和Bloomz等语言模型,并进行了比较分析。我们的发现表明,单语言变换器基本模型在零和几个shot场景下一直表现出色,超越其他模型。为了激发更多的探索,我们计划将这些数据集和研究工具公开提供给更广泛的研究社区。

Sparse Linear Concept Discovery Models

  • paper_url: http://arxiv.org/abs/2308.10782
  • repo_url: https://github.com/konpanousis/conceptdiscoverymodels
  • paper_authors: Konstantinos P. Panousis, Dino Ienco, Diego Marcos
  • for: 提高深度学习模型的解释性和性能
  • methods: 使用对比语言图像模型和单个稀疏线性层
  • results: 比对其他CBM方法更高的准确率和每个例子的概率性数量
    Abstract The recent mass adoption of DNNs, even in safety-critical scenarios, has shifted the focus of the research community towards the creation of inherently intrepretable models. Concept Bottleneck Models (CBMs) constitute a popular approach where hidden layers are tied to human understandable concepts allowing for investigation and correction of the network's decisions. However, CBMs usually suffer from: (i) performance degradation and (ii) lower interpretability than intended due to the sheer amount of concepts contributing to each decision. In this work, we propose a simple yet highly intuitive interpretable framework based on Contrastive Language Image models and a single sparse linear layer. In stark contrast to related approaches, the sparsity in our framework is achieved via principled Bayesian arguments by inferring concept presence via a data-driven Bernoulli distribution. As we experimentally show, our framework not only outperforms recent CBM approaches accuracy-wise, but it also yields high per example concept sparsity, facilitating the individual investigation of the emerging concepts.
    摘要 Translation notes:* "DNNs" Deep Neural Networks* "CBMs" Concept Bottleneck Models* "interpretable" 可解释的* "safety-critical" 安全关键的* "performance degradation" 性能下降* "lower interpretability" 更低的可解释性* "sparse" 稀疏的* "Bayesian arguments" bayesian Arguments* " Bernoulli distribution" Бернулли 分布

Mixed-Integer Projections for Automated Data Correction of EMRs Improve Predictions of Sepsis among Hospitalized Patients

  • paper_url: http://arxiv.org/abs/2308.10781
  • repo_url: None
  • paper_authors: Mehak Arora, Hassan Mortagy, Nathan Dwarshius, Swati Gupta, Andre L. Holder, Rishikesan Kamaleswaran
  • for: 这篇研究旨在提高机器学习(ML)模型在诊断过程中的自动化,但储存在过去研究中的一个显著缺陷是不足以处理电子医疗记录(EMR)数据的错误和偏差。
  • methods: 本研究引入了一种创新的投影方法,让临床专家知识成为领域约束,从而生成重要的元数据,可以在机器学习工作流程中使用。特别是,使用高维混合整数程式来捕捉生物和生物physiological约束,以corrrect пацієnt数据。
  • results: 我们的框架可以在预后检测 sepsepsis 中提高机器学习分类器的性能,AUROC 为 0.865,精度为 0.922,比没有这些投影的模型更好。
    Abstract Machine learning (ML) models are increasingly pivotal in automating clinical decisions. Yet, a glaring oversight in prior research has been the lack of proper processing of Electronic Medical Record (EMR) data in the clinical context for errors and outliers. Addressing this oversight, we introduce an innovative projections-based method that seamlessly integrates clinical expertise as domain constraints, generating important meta-data that can be used in ML workflows. In particular, by using high-dimensional mixed-integer programs that capture physiological and biological constraints on patient vitals and lab values, we can harness the power of mathematical "projections" for the EMR data to correct patient data. Consequently, we measure the distance of corrected data from the constraints defining a healthy range of patient data, resulting in a unique predictive metric we term as "trust-scores". These scores provide insight into the patient's health status and significantly boost the performance of ML classifiers in real-life clinical settings. We validate the impact of our framework in the context of early detection of sepsis using ML. We show an AUROC of 0.865 and a precision of 0.922, that surpasses conventional ML models without such projections.
    摘要

  • paper_url: http://arxiv.org/abs/2308.10779
  • repo_url: None
  • paper_authors: Dongjin Lee, Juho Lee, Kijung Shin
  • for: This paper focuses on investigating the vulnerabilities of Temporal Graph Neural Networks (TGNNs) against adversarial attacks, specifically for link prediction tasks on continuous-time dynamic graphs.
  • methods: The proposed method, T-SPEAR, injects edge perturbations into the data that are unnoticeable yet effective in causing malfunction in the victim model. Additionally, the proposed robust training approach, T-SHIELD, uses edge filtering and temporal smoothness to enhance the robustness of the victim model.
  • results: The paper demonstrates that T-SPEAR significantly degrades the victim model’s performance on link prediction tasks, and the attacks are transferable to other TGNNs. Moreover, T-SHIELD effectively filters out adversarial edges and exhibits robustness against adversarial attacks, surpassing the link prediction performance of the naive TGNN by up to 11.2% under T-SPEAR.Here is the format you requested for the results:
  • for: 这篇论文专注于investigating TGNNs中的攻击性 vulnerabilities,具体是针对连接预测任务在时间 kontinuous dynamic graphs上。
  • methods: T-SPEAR方法会将关系变化注入到数据中,这些变化是不可见的,但对犯人模型造成严重的影响。另外,T-SHIELD方法使用边节滤波和时间稳定性来强化犯人模型的抗性。
  • results: 论文显示T-SPEAR可以对犯人模型进行高效的攻击,并且这些攻击可以转移到其他TGNNs上。另外,T-SHIELD方法可以有效地遮盾掉攻击性关系,并且在适当的情况下超过了简单TGNN的连接预测性能。
    Abstract Real-world graphs are dynamic, constantly evolving with new interactions, such as financial transactions in financial networks. Temporal Graph Neural Networks (TGNNs) have been developed to effectively capture the evolving patterns in dynamic graphs. While these models have demonstrated their superiority, being widely adopted in various important fields, their vulnerabilities against adversarial attacks remain largely unexplored. In this paper, we propose T-SPEAR, a simple and effective adversarial attack method for link prediction on continuous-time dynamic graphs, focusing on investigating the vulnerabilities of TGNNs. Specifically, before the training procedure of a victim model, which is a TGNN for link prediction, we inject edge perturbations to the data that are unnoticeable in terms of the four constraints we propose, and yet effective enough to cause malfunction of the victim model. Moreover, we propose a robust training approach T-SHIELD to mitigate the impact of adversarial attacks. By using edge filtering and enforcing temporal smoothness to node embeddings, we enhance the robustness of the victim model. Our experimental study shows that T-SPEAR significantly degrades the victim model's performance on link prediction tasks, and even more, our attacks are transferable to other TGNNs, which differ from the victim model assumed by the attacker. Moreover, we demonstrate that T-SHIELD effectively filters out adversarial edges and exhibits robustness against adversarial attacks, surpassing the link prediction performance of the naive TGNN by up to 11.2% under T-SPEAR.
    摘要 实际世界中的图是动态的,不断发生新的交互,如金融交易在金融网络中。快速图神经网络(TGNN)已经开发出来,以便有效地捕捉动态图中的演变趋势。虽然这些模型已经广泛应用于多个重要领域,但它们的抗击黑客攻击的漏洞仍然未得到了足够的探索。在这篇论文中,我们提出了T-SPEAR,一种简单而有效的黑客攻击方法,用于链接预测任务中的图动态图。具体来说,在受试模型的训练过程之前,我们会注入到数据中的边扰动,这些扰动在我们提出的四个约束下是不可见的,但却足够导致受试模型失效。此外,我们还提出了一种robust训练方法T-SHIELD,用于抗击黑客攻击。通过对边进行筛选和对节点嵌入的时间稳定性来增强受试模型的 robustness。我们的实验研究表明,T-SPEAR会对链接预测任务造成显著的性能下降,而且我们的攻击可以传播到其他不同的TGNNs,即黑客攻击者不知道的模型。此外,我们还证明了T-SHIELD可以有效地筛选出黑客攻击的边,并且在链接预测任务中表现出robustness,超过了 Naive TGNN 的链接预测性能 by up to 11.2% under T-SPEAR.

A Modular and Adaptive System for Business Email Compromise Detection

  • paper_url: http://arxiv.org/abs/2308.10776
  • repo_url: None
  • paper_authors: Jan Brabec, Filip Šrajer, Radek Starosta, Tomáš Sixta, Marc Dupont, Miloš Lenoch, Jiří Menšík, Florian Becker, Jakub Boros, Tomáš Pop, Pavel Novák
  • for: 防茧防诈攻击 (BEC) 和对特定目标进行攻击的电子邮件攻击
  • methods: 复合多种机器学习方法和数据模式,包括自然语言理解 (NLU),检测电子邮件中的 BEC 相关行为,例如文本、图像、元数据和电子邮件交互 контекст
  • results: 在生产环境中证明了超过两年的有效性,并且可以适应不断更新的攻击方法,并且可以提供可解释的鉴定结果
    Abstract The growing sophistication of Business Email Compromise (BEC) and spear phishing attacks poses significant challenges to organizations worldwide. The techniques featured in traditional spam and phishing detection are insufficient due to the tailored nature of modern BEC attacks as they often blend in with the regular benign traffic. Recent advances in machine learning, particularly in Natural Language Understanding (NLU), offer a promising avenue for combating such attacks but in a practical system, due to limitations such as data availability, operational costs, verdict explainability requirements or a need to robustly evolve the system, it is essential to combine multiple approaches together. We present CAPE, a comprehensive and efficient system for BEC detection that has been proven in a production environment for a period of over two years. Rather than being a single model, CAPE is a system that combines independent ML models and algorithms detecting BEC-related behaviors across various email modalities such as text, images, metadata and the email's communication context. This decomposition makes CAPE's verdicts naturally explainable. In the paper, we describe the design principles and constraints behind its architecture, as well as the challenges of model design, evaluation and adapting the system continuously through a Bayesian approach that combines limited data with domain knowledge. Furthermore, we elaborate on several specific behavioral detectors, such as those based on Transformer neural architectures.
    摘要 现代商业电子邮件攻击(BEC)和特攻钓鱼诈骗攻击的发展日益复杂,对全球企业造成巨大挑战。传统的防范邮件和钓鱼攻击的方法已经不能满足现在的需求,因为这些攻击通常与正常的干扰交通混合在一起。新的机器学习技术,特别是自然语言理解(NLU),提供了一个有希望的途径来对抗这些攻击,但在实践中,由于数据可用性、运营成本、解释性要求或需要不断进化系统的要求,需要结合多种方法。我们介绍了CAPE,一个全面和高效的BEC检测系统,已经在生产环境中运行了超过两年。而不是单一的模型,CAPE是一个结合独立的机器学习模型和算法,检测邮件中的BEC相关行为,包括文本、图像、元数据和邮件的通信上下文。这种分解使CAPE的裁决自然可解释。在文章中,我们介绍了CAPE的设计原则和限制,以及模型设计、评估和持续更新的挑战。此外,我们还详细介绍了一些特定的行为检测器,如基于Transformer神经网络架构的检测器。

GBM-based Bregman Proximal Algorithms for Constrained Learning

  • paper_url: http://arxiv.org/abs/2308.10767
  • repo_url: https://github.com/zhenweilin/constrainedgbm
  • paper_authors: Zhenwei Lin, Qi Deng
  • for: 这个研究是为了开发一个能够满足更加复杂的学习任务的新型机器学习算法,特别是适用于不具有投影构造的条件学习任务,如Neyman-Pearson类别和公平类别。
  • methods: 这个研究使用了Bregman proximal算法来适应受条件学习任务限制的机器学习问题。它 introduce了一个新的Bregman主要-副主要方法,并且在几何函数下具有全球最佳性保证。在非凸函数下,我们显示了我们的算法仍然能够在Bregman proximal点架构下获得良好的效果。
  • results: 我们提供了丰富的实验证据,证明了我们的算法框架在NPC和公平类别等条件学习应用中的有效性。而我们的算法框架可以与现有的GBM实现(如XGBoost和LightGBM)集成,不需要更改现有的代码或架构,这使得它具有与现有算法相似的可用性和易用性。
    Abstract As the complexity of learning tasks surges, modern machine learning encounters a new constrained learning paradigm characterized by more intricate and data-driven function constraints. Prominent applications include Neyman-Pearson classification (NPC) and fairness classification, which entail specific risk constraints that render standard projection-based training algorithms unsuitable. Gradient boosting machines (GBMs) are among the most popular algorithms for supervised learning; however, they are generally limited to unconstrained settings. In this paper, we adapt the GBM for constrained learning tasks within the framework of Bregman proximal algorithms. We introduce a new Bregman primal-dual method with a global optimality guarantee when the learning objective and constraint functions are convex. In cases of nonconvex functions, we demonstrate how our algorithm remains effective under a Bregman proximal point framework. Distinct from existing constrained learning algorithms, ours possess a unique advantage in their ability to seamlessly integrate with publicly available GBM implementations such as XGBoost (Chen and Guestrin, 2016) and LightGBM (Ke et al., 2017), exclusively relying on their public interfaces. We provide substantial experimental evidence to showcase the effectiveness of the Bregman algorithm framework. While our primary focus is on NPC and fairness ML, our framework holds significant potential for a broader range of constrained learning applications. The source code is currently freely available at https://github.com/zhenweilin/ConstrainedGBM}{https://github.com/zhenweilin/ConstrainedGBM.
    摘要 As the complexity of learning tasks increases, modern machine learning encounters a new constrained learning paradigm with more intricate and data-driven function constraints. Prominent applications include Neyman-Pearson classification (NPC) and fairness classification, which involve specific risk constraints that make standard projection-based training algorithms unsuitable. Gradient boosting machines (GBMs) are one of the most popular algorithms for supervised learning, but they are generally limited to unconstrained settings. In this paper, we adapt the GBM for constrained learning tasks within the framework of Bregman proximal algorithms. We introduce a new Bregman primal-dual method with a global optimality guarantee when the learning objective and constraint functions are convex. In cases of nonconvex functions, we demonstrate how our algorithm remains effective under a Bregman proximal point framework. Unlike existing constrained learning algorithms, ours has a unique advantage in its ability to seamlessly integrate with publicly available GBM implementations such as XGBoost (Chen and Guestrin, 2016) and LightGBM (Ke et al., 2017), relying exclusively on their public interfaces. We provide substantial experimental evidence to showcase the effectiveness of the Bregman algorithm framework. While our primary focus is on NPC and fairness ML, our framework has significant potential for a broader range of constrained learning applications. The source code is currently freely available at .

To Whom are You Talking? A Deep Learning Model to Endow Social Robots with Addressee Estimation Skills

  • paper_url: http://arxiv.org/abs/2308.10757
  • repo_url: None
  • paper_authors: Carlo Mazzola, Marta Romeo, Francesco Rea, Alessandra Sciutti, Angelo Cangelosi
  • for: 理解人类对话的动态和社会环境中机器人的整合
  • methods: 结合卷积层和LSTM层的卷积神经网络模型,利用发言人的非语言表征来解释话语的收件人
  • results: 模型可以在环境自然的场景中解决对话中的收件人定位问题
    Abstract Communicating shapes our social word. For a robot to be considered social and being consequently integrated in our social environment it is fundamental to understand some of the dynamics that rule human-human communication. In this work, we tackle the problem of Addressee Estimation, the ability to understand an utterance's addressee, by interpreting and exploiting non-verbal bodily cues from the speaker. We do so by implementing an hybrid deep learning model composed of convolutional layers and LSTM cells taking as input images portraying the face of the speaker and 2D vectors of the speaker's body posture. Our implementation choices were guided by the aim to develop a model that could be deployed on social robots and be efficient in ecological scenarios. We demonstrate that our model is able to solve the Addressee Estimation problem in terms of addressee localisation in space, from a robot ego-centric point of view.
    摘要 人类与机器人之间的交流 shapes our 社会言语。如果一个机器人想被认为是社交的,那么它必须理解一些人类之间的交流 dynamics。在这项工作中,我们面临着发言者对象估算(Addressee Estimation)问题,即理解一句话的发言者对象。我们通过利用发言者的非语言性身体姿势来解释和利用深度学习模型,该模型由 convolutional layers 和 LSTM cells 组成,并将图像和发言者的姿势 vectors 作为输入。我们的实现方式受到了在社交机器人上部署模型并在生态环境中高效运行的目标的指导。我们证明了我们的模型可以解决发言者对象估算问题,从机器人自身视角来看。

On the Adversarial Robustness of Multi-Modal Foundation Models

  • paper_url: http://arxiv.org/abs/2308.10741
  • repo_url: None
  • paper_authors: Christian Schlarmann, Matthias Hein
  • for: 保护用户免受恶意内容的误导和宣扬 fake information
  • methods: 使用隐藏式攻击破坏图像,改变多模态基础模型的描述输出
  • results: 显示了恶意内容提供者可以使用这种攻击方法诱导用户访问 malicious websites 或 broadcast fake information,需要对多模态基础模型进行防御措施
    Abstract Multi-modal foundation models combining vision and language models such as Flamingo or GPT-4 have recently gained enormous interest. Alignment of foundation models is used to prevent models from providing toxic or harmful output. While malicious users have successfully tried to jailbreak foundation models, an equally important question is if honest users could be harmed by malicious third-party content. In this paper we show that imperceivable attacks on images in order to change the caption output of a multi-modal foundation model can be used by malicious content providers to harm honest users e.g. by guiding them to malicious websites or broadcast fake information. This indicates that countermeasures to adversarial attacks should be used by any deployed multi-modal foundation model.
    摘要 多模态基础模型,如FLAMINGO或GPT-4,在最近吸引了巨大的关注。对基础模型的Alignment用于防止模型提供恶意或有害输出。然而,恶意用户已成功地破坏基础模型,另一个重要问题是可以否由正常用户受到恶意第三方内容的伤害。在本文中,我们示出了图像透明攻击可以让恶意内容提供者通过改变多模态基础模型的图像描述来诱导正常用户访问黑客网站或播放假信息。这表明,在部署多模态基础模型时应该使用防御性攻击countermeasure。

We Don’t Need No Adam, All We Need Is EVE: On The Variance of Dual Learning Rate And Beyond

  • paper_url: http://arxiv.org/abs/2308.10740
  • repo_url: https://github.com/akhadangi/EVE
  • paper_authors: Afshin Khadangi
  • for: 优化深度学习模型
  • methods: 采用不同学习率分别对不同分量的梯度进行学习率分化
  • results: 比较 EXISTS 优化技术在各种标准数据集和架构上的表现,实验结果显示 EVE 方法可以快速减少损失函数的搜索空间,提高模型的性能和稳定性。
    Abstract In the rapidly advancing field of deep learning, optimising deep neural networks is paramount. This paper introduces a novel method, Enhanced Velocity Estimation (EVE), which innovatively applies different learning rates to distinct components of the gradients. By bifurcating the learning rate, EVE enables more nuanced control and faster convergence, addressing the challenges associated with traditional single learning rate approaches. Utilising a momentum term that adapts to the learning landscape, the method achieves a more efficient navigation of the complex loss surface, resulting in enhanced performance and stability. Extensive experiments demonstrate that EVE significantly outperforms existing optimisation techniques across various benchmark datasets and architectures.
    摘要 在深度学习领域的快速发展中,优化深度神经网络的重要性日益减震。本文介绍了一种新方法——加速率分配(EVE),它创新地将不同的学习率应用到不同的梯度组件。通过分化学习率,EVE允许更细化的控制和更快的收敛,解决了传统单学习率方法所遇到的挑战。通过适应学习地带的滑动项,方法实现了更加有效的搜索和稳定性。广泛的实验表明,EVE在多种 benchmark 数据集和架构上显著超越了现有的优化技术。

UGSL: A Unified Framework for Benchmarking Graph Structure Learning

  • paper_url: http://arxiv.org/abs/2308.10737
  • repo_url: https://github.com/google-research/google-research
  • paper_authors: Bahare Fatemi, Sami Abu-El-Haija, Anton Tsitsulin, Mehran Kazemi, Dustin Zelle, Neslihan Bulut, Jonathan Halcrow, Bryan Perozzi
  • For: 本研究提出了一种统一框架,用于评估扩展 Graph Neural Networks (GNNs) 的应用范围。* Methods: 本研究使用了多种现有的 GNN 模型,并在一个统一的框架中实现了它们。* Results: 研究对多种组件的影响进行了广泛的分析,并提供了这些方法的优缺点。
    Abstract Graph neural networks (GNNs) demonstrate outstanding performance in a broad range of applications. While the majority of GNN applications assume that a graph structure is given, some recent methods substantially expanded the applicability of GNNs by showing that they may be effective even when no graph structure is explicitly provided. The GNN parameters and a graph structure are jointly learned. Previous studies adopt different experimentation setups, making it difficult to compare their merits. In this paper, we propose a benchmarking strategy for graph structure learning using a unified framework. Our framework, called Unified Graph Structure Learning (UGSL), reformulates existing models into a single model. We implement a wide range of existing models in our framework and conduct extensive analyses of the effectiveness of different components in the framework. Our results provide a clear and concise understanding of the different methods in this area as well as their strengths and weaknesses. The benchmark code is available at https://github.com/google-research/google-research/tree/master/ugsl.
    摘要 GRAPH NEURAL NETWORKS (GNNs) 在各种应用场景中表现出色。虽然大多数 GNN 应用场景假设已知图Structured,但一些最近的方法已经扩展了 GNN 的应用范围,并证明它们可以在没有明确提供图结构的情况下也表现出色。这些方法在学习 GNN 参数和图结构时同时进行了学习。先前的研究采用了不同的实验设置,这使得对它们的评价变得困难。在这篇论文中,我们提出了一种图结构学习的 benchmarking 策略,我们称之为 Unified Graph Structure Learning (UGSL)。我们将现有的模型重新表述为单一的模型,并在这个框架中实现了广泛的现有模型。我们进行了广泛的分析,以了解不同组件在这个领域中的效果和优劣点。我们的结果提供了对这些方法的清晰和简洁的理解,以及它们在不同情况下的优劣点。 UGSL 框架的代码可以在 GitHub 上获取:https://github.com/google-research/google-research/tree/master/ugsl。

Artificial intelligence-driven antimicrobial peptide discovery

  • paper_url: http://arxiv.org/abs/2308.10921
  • repo_url: None
  • paper_authors: Paulina Szymczak, Ewa Szczurek
  • for: 抗微生物蛋白质(AMPs)作为替代性抗生素,以扩展抗生素耐药性的选项。
  • methods: 人工智能(AI)在AMP发现中发掘新的方法,包括预测蛋白质性能和毒性,以及生成新的AMP候选者。
  • results: AI在AMP发现中获得了成果,包括透过预测和生成新的AMP候选者,以及控制生成AMPs的性能。
    Abstract Antimicrobial peptides (AMPs) emerge as promising agents against antimicrobial resistance, providing an alternative to conventional antibiotics. Artificial intelligence (AI) revolutionized AMP discovery through both discrimination and generation approaches. The discriminators aid the identification of promising candidates by predicting key peptide properties such as activity and toxicity, while the generators learn the distribution over peptides and enable sampling novel AMP candidates, either de novo, or as analogues of a prototype peptide. Moreover, the controlled generation of AMPs with desired properties is achieved by discriminator-guided filtering, positive-only learning, latent space sampling, as well as conditional and optimized generation. Here we review recent achievements in AI-driven AMP discovery, highlighting the most exciting directions.
    摘要 安提米克rob潜血蛋白(AMPs)在抗生素耐荷性方面emerges as promising agents, providing an alternative to conventional antibiotics.人工智能(AI)在AMP发现方面发挥了革命性的作用,通过both discrimination和generation approaches。拒绝器帮助确定优秀候选者,预测蛋白质活性和致病性,而生成器学习蛋白质分布,可以采样新的AMP候选者,或者是蛋白质原型的analogues。此外,控制生成AMPs的拓展和质量的方法还包括拒绝器导向的筛选、正面 alone learning、秘密空间抽样、以及条件和优化的生成。本文回顾了最近的AI驱动的AMP发现成果,强调最有前途的方向。

What’s Race Got to do with it? Predicting Youth Depression Across Racial Groups Using Machine and Deep Learning

  • paper_url: http://arxiv.org/abs/2308.11591
  • repo_url: None
  • paper_authors: Nathan Zhong, Nikhil Yadav
  • for: 这个研究旨在运用机器学习(ML)和人工神经网络(ANN)模型来诊断高中生中的抑郁症状。
  • methods: 本研究使用了全国青少年风险行为调查系统(YRBSS)调查数据,并运用了不同的种族子集(白人、黑人和其他少数民族)进行分组训练和测试。
  • results: 研究发现不同的种族子集有不同的诊断因素,并且发现了一些特定的变量可以帮助预测抑郁症状。ANN模型在整个数据集上 achieve 的F1分数为82.90%,而最佳机器学习模型(支持向量机)则 achieve 81.90%。
    Abstract Depression is a common yet serious mental disorder that affects millions of U.S. high schoolers every year. Still, accurate diagnosis and early detection remain significant challenges. In the field of public health, research shows that neural networks produce promising results in identifying other diseases such as cancer and HIV. This study proposes a similar approach, utilizing machine learning (ML) and artificial neural network (ANN) models to classify depression in a student. Additionally, the study highlights the differences in relevant factors for race subgroups and advocates the need for more extensive and diverse datasets. The models train on nationwide Youth Risk Behavior Surveillance System (YRBSS) survey data, in which the most relevant factors of depression are found with statistical analysis. The survey data is a structured dataset with 15000 entries including three race subsets each consisting of 900 entries. For classification, the research problem is modeled as a supervised learning binary classification problem. Factors relevant to depression for different racial subgroups are also identified. The ML and ANN models are trained on the entire dataset followed by different race subsets to classify whether an individual has depression. The ANN model achieves the highest F1 score of 82.90% while the best-performing machine learning model, support vector machines (SVM), achieves a score of 81.90%. This study reveals that different parameters are more valuable for modeling depression across diverse racial groups and furthers research regarding American youth depression.
    摘要 每年数百万美国高中生都会被抑郁症病例所困扰。然而,准确诊断和早期发现仍然是一项重要挑战。在公共卫生领域,研究表明,神经网络生成出了识别其他疾病的可能性。这项研究提议了类似的方法,利用机器学习(ML)和人工神经网络(ANN)模型来诊断抑郁症。此外,研究还指出了不同的种族 subgroup 中相关因素的差异,并且强调了更大和多样化的数据集的需要。这些模型在全国青少年风险行为监测系统(YRBSS)调查数据上训练,该数据集包含15000个数据点,每个数据点包含三个种族 subsets,每个subset 包含900个数据点。为分类,研究问题被定义为一种指导学习二分类问题。不同种族 subgroup 中对抑郁症的相关因素也被 indentified。ML 和 ANN 模型在整个数据集上进行训练,然后在不同种族 subsets 中进行分类,以确定个体是否患有抑郁症。ANN 模型 achievesthe highest F1 score of 82.90%,而最佳机器学习模型,支持向量机(SVM), achievesthe score of 81.90%。这项研究表明,不同种族 subgroup 中的参数更有价值于模型抑郁症,并且推动了美国青年抑郁症的进一步研究。

Test-time augmentation-based active learning and self-training for label-efficient segmentation

  • paper_url: http://arxiv.org/abs/2308.10727
  • repo_url: None
  • paper_authors: Bella Specktor-Fadida, Anna Levchakov, Dana Schonberger, Liat Ben-Sira, Dafna Ben-Bashat, Leo Joskowicz
    for:This paper proposes a new method that combines self-training (ST) with active learning (AL) using Test-Time Augmentations (TTA) for medical image segmentation tasks. The method aims to reduce the annotation burden and improve the performance of the segmentation models.methods:The proposed method combines ST with AL using TTA. TTA is performed on an initial teacher network, and cases for annotation are selected based on the lowest estimated Dice score. The selected annotated cases are trained with existing annotated cases and ST cases with border slices annotations.results:The results show that ST is highly effective for both fetal body and placenta segmentation tasks, boosting performance for in-distribution (ID) and out-of-distribution (OOD) data. However, the combination of AL and ST did not improve performance for single-sequence fetal body segmentation, and AL was more effective for high-variability placenta data. The method achieved a Dice score of 0.961 for fetal body segmentation with only 6 original scans and 2 new sequence scans, and the results using 15 high-variability placenta cases were similar to those using 50 cases.
    Abstract Deep learning techniques depend on large datasets whose annotation is time-consuming. To reduce annotation burden, the self-training (ST) and active-learning (AL) methods have been developed as well as methods that combine them in an iterative fashion. However, it remains unclear when each method is the most useful, and when it is advantageous to combine them. In this paper, we propose a new method that combines ST with AL using Test-Time Augmentations (TTA). First, TTA is performed on an initial teacher network. Then, cases for annotation are selected based on the lowest estimated Dice score. Cases with high estimated scores are used as soft pseudo-labels for ST. The selected annotated cases are trained with existing annotated cases and ST cases with border slices annotations. We demonstrate the method on MRI fetal body and placenta segmentation tasks with different data variability characteristics. Our results indicate that ST is highly effective for both tasks, boosting performance for in-distribution (ID) and out-of-distribution (OOD) data. However, while self-training improved the performance of single-sequence fetal body segmentation when combined with AL, it slightly deteriorated performance of multi-sequence placenta segmentation on ID data. AL was helpful for the high variability placenta data, but did not improve upon random selection for the single-sequence body data. For fetal body segmentation sequence transfer, combining AL with ST following ST iteration yielded a Dice of 0.961 with only 6 original scans and 2 new sequence scans. Results using only 15 high-variability placenta cases were similar to those using 50 cases. Code is available at: https://github.com/Bella31/TTA-quality-estimation-ST-AL
    摘要 深度学习技术需要大量数据进行注释,但这些注释可以是时间consuming的。为了减轻注释负担,自动训练(ST)和活动学习(AL)方法已经被开发出来,同时也有将这两种方法相互融合的方法。然而,还没有一个明确的时候,哪种方法是最有用,并且在哪些情况下合理使用它们。在这篇论文中,我们提出了一种新的方法,即将ST与AL相互融合,使用测试时数据扩展(TTA)。首先,TTA被应用于初始教师网络。然后,根据最低估计的 dice 分数选择简单的注释案例。高估分的案例用作软 Pseudo-标签,并将其与已有注释案例和ST案例进行训练。我们在MRI胎Body和 Placenta分割任务上进行了实验,并证明了ST是这两个任务中非常有效的。然而,在单个序列Body分割任务中,杂合AL与ST时,ST会提高ID数据和OOD数据的性能,但是在单个序列Body分割任务中,杂合AL与ST时,ST会轻微下降ID数据的性能。在多个序列Placenta分割任务中,AL可以帮助高度变化的数据,但是在单个序列Body分割任务中,AL无法提高随机选择的性能。在胎Body分割序列传输任务中,将AL与ST相互融合,并在ST迭代后进行AL,可以达到0.961的Dice值,只需要6个原始扫描和2个新序列扫描。在50个高度变化的Placenta数据中,使用只有15个高度变化的Placenta数据可以达到类似的性能。我们的代码可以在以下github上找到:https://github.com/Bella31/TTA-quality-estimation-ST-AL。

Clustered Linear Contextual Bandits with Knapsacks

  • paper_url: http://arxiv.org/abs/2308.10722
  • repo_url: None
  • paper_authors: Yichuan Deng, Michalis Mamakos, Zhao Song
  • for: 本研究探讨了归一化上下文抽奖问题,即奖励和资源消耗是由群集特定的线性模型决定的。 arms 被分成 clusters,cluster 的成员身份不知道给算法。
  • methods: 我们提供了一种算法,可以在不知道所有 arm 的情况下,在数量较少的时间 periods 内具有减少 regret 的性能。我们使用了 econometrics 和抽奖 constrained литературе中的技术,并实现了一种高效的 clustering 方法。
  • results: 我们证明了这种算法可以在数量较少的时间 periods 内具有减少 regret 的性能,而不需要访问所有 arm。特别是,我们发现可以通过在一个随机选择的 subset 上进行 clustering,来实现这一点。
    Abstract In this work, we study clustered contextual bandits where rewards and resource consumption are the outcomes of cluster-specific linear models. The arms are divided in clusters, with the cluster memberships being unknown to an algorithm. Pulling an arm in a time period results in a reward and in consumption for each one of multiple resources, and with the total consumption of any resource exceeding a constraint implying the termination of the algorithm. Thus, maximizing the total reward requires learning not only models about the reward and the resource consumption, but also cluster memberships. We provide an algorithm that achieves regret sublinear in the number of time periods, without requiring access to all of the arms. In particular, we show that it suffices to perform clustering only once to a randomly selected subset of the arms. To achieve this result, we provide a sophisticated combination of techniques from the literature of econometrics and of bandits with constraints.
    摘要 在这个工作中,我们研究集中的上下文投机,其中奖励和资源消耗是集中的线性模型的结果。武器被分成集群,集群成员身份不知道算法。在一个时间段内抽取一个武器会得到奖励和每种多种资源的消耗,而任何资源的总消耗超过限制就意味着算法终止。因此,最大化总奖励需要学习不仅奖励和资源消耗的模型,还需要集群成员身份。我们提供一个可以在时间期限内达到减少于数量的 regret的算法,不需要访问所有武器。特别是,我们显示了可以在随机选择的 subset of 武器上进行分 clustering。为了实现这个结果,我们提供了来自 econometrics 和投机 WITH 限制的 литераature 中的复杂组合技术。

CoMIX: A Multi-agent Reinforcement Learning Training Architecture for Efficient Decentralized Coordination and Independent Decision Making

  • paper_url: http://arxiv.org/abs/2308.10721
  • repo_url: None
  • paper_authors: Giovanni Minelli, Mirco Musolesi
  • for: 这篇论文是为了提出一种基于协调QMIX(CoMIX)的培训框架,以便在分布式代理中实现稳定协调。
  • methods: 该论文使用了自适应策略,让每个代理在决策过程中独立做出决定,同时也能够适应不同情况,协调决策。
  • results: 在多种 simulate环境中进行的实验表明,CoMIX在合作任务上表现更好于基线值,这 validate了我们的增量策略方法是一种有效的协调技术。
    Abstract Robust coordination skills enable agents to operate cohesively in shared environments, together towards a common goal and, ideally, individually without hindering each other's progress. To this end, this paper presents Coordinated QMIX (CoMIX), a novel training framework for decentralized agents that enables emergent coordination through flexible policies, allowing at the same time independent decision-making at individual level. CoMIX models selfish and collaborative behavior as incremental steps in each agent's decision process. This allows agents to dynamically adapt their behavior to different situations balancing independence and collaboration. Experiments using a variety of simulation environments demonstrate that CoMIX outperforms baselines on collaborative tasks. The results validate our incremental policy approach as effective technique for improving coordination in multi-agent systems.
    摘要 Robust协调技能使代理人在共享环境中协同工作,共同向共同目标努力,并理想地不干扰别的进步。为了实现这一目标,这篇论文提出了协调QMIX(CoMIX),一种新的培训框架 для分布式代理人,允许 agents在各自决策过程中动态适应不同情况,同时保持独立决策能力。CoMIX将自利和合作行为视为各自决策过程中的逐步增量。这使得代理人可以在不同情况下动态地适应,均衡独立和合作。实验结果表明,CoMIX在合作任务上比基eline表现出色,这证明了我们的逐步政策方法是有效的。

Relax and penalize: a new bilevel approach to mixed-binary hyperparameter optimization

  • paper_url: http://arxiv.org/abs/2308.10711
  • repo_url: None
  • paper_authors: Marianna de Santis, Jordan Frecon, Francesco Rinaldi, Saverio Salzo, Martin Schmidt
  • for: 这篇论文主要是为了提高机器学习模型中高维度超参数的优化。
  • methods: 该论文使用了等价连续双层 reformulation 以处理混合二进制超参数的优化问题。
  • results: 试验结果表明,该方法可以在 regression 问题中更好地 estimating 群体稀缺结构,并且超过了现有的 relaxation 和 rounding 方法的性能。
    Abstract In recent years, bilevel approaches have become very popular to efficiently estimate high-dimensional hyperparameters of machine learning models. However, to date, binary parameters are handled by continuous relaxation and rounding strategies, which could lead to inconsistent solutions. In this context, we tackle the challenging optimization of mixed-binary hyperparameters by resorting to an equivalent continuous bilevel reformulation based on an appropriate penalty term. We propose an algorithmic framework that, under suitable assumptions, is guaranteed to provide mixed-binary solutions. Moreover, the generality of the method allows to safely use existing continuous bilevel solvers within the proposed framework. We evaluate the performance of our approach for a specific machine learning problem, i.e., the estimation of the group-sparsity structure in regression problems. Reported results clearly show that our method outperforms state-of-the-art approaches based on relaxation and rounding
    摘要 近年来,二级方法在高维参数估计机器学习模型中变得非常流行。然而,到目前为止, binary 参数都是通过连续弹性和圆拟约法来处理,这可能会导致不一致的解决方案。在这个上下文中,我们解决了高维混合二级参数的困难优化问题,通过一个适当的罚项来转化为连续二级形式。我们提出了一个框架,以下所述的假设下,能够提供混合二级解决方案。此外,我们的方法总体上允许使用现有的连续二级解决方案。我们对一个具体的机器学习问题,即回归问题中的群集稀缺结构估计,进行了评估。报告的结果表明,我们的方法在比较 state-of-the-art 方法(基于弹性和圆拟约)的基础上具有明显的优势。

Measuring the Effect of Causal Disentanglement on the Adversarial Robustness of Neural Network Models

  • paper_url: http://arxiv.org/abs/2308.10708
  • repo_url: https://github.com/prebenness/causal_disentanglement_robustness
  • paper_authors: Preben M. Ness, Dusica Marijan, Sunanda Bose
  • for: 这些纸上的文章是为了证明 causal Neural Network 模型在针对性攻击方面具有高度的 robustness,以及在几种推理任务中具有增强的能力,如少数shot学习和罕见上下文分类。
  • methods: 这些模型使用 causal Neural Network 模型,并利用 Computer Vision 领域中的内容/风格分离指标来衡量不同方面的 causal 分离度。
  • results: 研究发现,模型的 causal 分离度与针对性攻击的Robustness 之间存在强相关关系(r=0.820,p=0.001),同时,干扰信号中像素级信息含量与针对性攻击Robustness 存在负相关关系(r=-0.597,p=0.040)。
    Abstract Causal Neural Network models have shown high levels of robustness to adversarial attacks as well as an increased capacity for generalisation tasks such as few-shot learning and rare-context classification compared to traditional Neural Networks. This robustness is argued to stem from the disentanglement of causal and confounder input signals. However, no quantitative study has yet measured the level of disentanglement achieved by these types of causal models or assessed how this relates to their adversarial robustness. Existing causal disentanglement metrics are not applicable to deterministic models trained on real-world datasets. We, therefore, utilise metrics of content/style disentanglement from the field of Computer Vision to measure different aspects of the causal disentanglement for four state-of-the-art causal Neural Network models. By re-implementing these models with a common ResNet18 architecture we are able to fairly measure their adversarial robustness on three standard image classification benchmarking datasets under seven common white-box attacks. We find a strong association (r=0.820, p=0.001) between the degree to which models decorrelate causal and confounder signals and their adversarial robustness. Additionally, we find a moderate negative association between the pixel-level information content of the confounder signal and adversarial robustness (r=-0.597, p=0.040).
    摘要 causal neural network 模型在对抗攻击和一些几个shot学习和罕见情况分类任务中表现出了高水平的鲁棒性,相比传统神经网络。这种鲁棒性 argued to come from the separation of causal and confounding input signals。然而,没有任何数值研究 measuring the level of disentanglement achieved by these types of causal models or assessing how this relates to their adversarial robustness。 existing causal disentanglement metrics are not applicable to deterministic models trained on real-world datasets。We therefore use metrics of content/style disentanglement from the field of computer vision to measure different aspects of the causal disentanglement for four state-of-the-art causal neural network models。By re-implementing these models with a common ResNet18 architecture, we are able to fairly measure their adversarial robustness on three standard image classification benchmarking datasets under seven common white-box attacks。We find a strong association (r=0.820, p=0.001) between the degree to which models decorrelate causal and confounder signals and their adversarial robustness。In addition, we find a moderate negative association between the pixel-level information content of the confounder signal and adversarial robustness (r=-0.597, p=0.040).

Sampling From Autoencoders’ Latent Space via Quantization And Probability Mass Function Concepts

  • paper_url: http://arxiv.org/abs/2308.10704
  • repo_url: None
  • paper_authors: Aymene Mohammed Bouayed, Adrian Iaccovelli, David Naccache
  • for: 本研究旨在从生成模型基于 autoencoder 的含义空间采样,以便生成真实的图像。
  • methods: 我们提出了一种基于概率质量函数的新采样算法,并将其与量化处理结合。该算法在输入数据的含义空间中定义了一个邻域,然后从这些定义的邻域中采样含义向量。这种策略确保了采样的含义向量主要居住在高概率区域,从而可以高效地转换为真实的图像。
  • results: 我们的采样算法在多种模型和数据集上表现出色,比如 MNIST 数据集上,我们的方法与基于 Gaussian mixture models(GMM)采样相比,可以获得 notable 的改善 ($0.89$ 的 FID 值)。此外,当生成图像的类型是人脸和眼睛图像时,我们的方法也显示出了明显的改善(FID 值分别提高 $1.69$ 和 $0.87$)。最后,我们通过 Wasserstein distance 来证明我们的方法在估计含义空间分布上的效果,与 GMM 采样相比。
    Abstract In this study, we focus on sampling from the latent space of generative models built upon autoencoders so as the reconstructed samples are lifelike images. To do to, we introduce a novel post-training sampling algorithm rooted in the concept of probability mass functions, coupled with a quantization process. Our proposed algorithm establishes a vicinity around each latent vector from the input data and then proceeds to draw samples from these defined neighborhoods. This strategic approach ensures that the sampled latent vectors predominantly inhabit high-probability regions, which, in turn, can be effectively transformed into authentic real-world images. A noteworthy point of comparison for our sampling algorithm is the sampling technique based on Gaussian mixture models (GMM), owing to its inherent capability to represent clusters. Remarkably, we manage to improve the time complexity from the previous $\mathcal{O}(n\times d \times k \times i)$ associated with GMM sampling to a much more streamlined $\mathcal{O}(n\times d)$, thereby resulting in substantial speedup during runtime. Moreover, our experimental results, gauged through the Fr\'echet inception distance (FID) for image generation, underscore the superior performance of our sampling algorithm across a diverse range of models and datasets. On the MNIST benchmark dataset, our approach outperforms GMM sampling by yielding a noteworthy improvement of up to $0.89$ in FID value. Furthermore, when it comes to generating images of faces and ocular images, our approach showcases substantial enhancements with FID improvements of $1.69$ and $0.87$ respectively, as compared to GMM sampling, as evidenced on the CelebA and MOBIUS datasets. Lastly, we substantiate our methodology's efficacy in estimating latent space distributions in contrast to GMM sampling, particularly through the lens of the Wasserstein distance.
    摘要 在这个研究中,我们关注在基于 autoencoder 的生成模型的 latent space 中采样,以便生成生动的图像。为此,我们提出了一种新的后期采样算法,基于概率质量函数,并且进行量化处理。我们的提议的算法会定义 latent vector 的邻域,然后从这些定义的邻域中采样。这种策略确保了采样 latent vector 主要居住在高概率区域中,从而可以有效地转换为真实的世界图像。与基于 Gaussian mixture models (GMM) 的采样技术相比,我们的采样算法具有更高的时间复杂度,从 $\mathcal{O}(n\times d \times k \times i)$ 降低到 $\mathcal{O}(n\times d)$,从而在运行时间中获得了显著的加速。此外,我们的实验结果,通过 Fréchet inception distance (FID) 来衡量图像生成的性能,表明我们的采样算法在不同的模型和数据集上具有显著的优势。在 MNIST 数据集上,我们的方法与 GMM 采样相比,提高了 FID 值的不同程度,最高可达 $0.89$。此外,当生成面部和眼部图像时,我们的方法还显示了重要的改进,FID 改进值分别为 $1.69$ 和 $0.87$,在 GMM 采样的基础上具有显著的优势。最后,我们验证了我们的方法在估计 latent space 分布方面的有效性,特别是通过 Wasserstein distance 的验证。

Refashioning Emotion Recognition Modelling: The Advent of Generalised Large Models

  • paper_url: http://arxiv.org/abs/2308.11578
  • repo_url: None
  • paper_authors: Zixing Zhang, Liyizhe Peng, Tao Pang, Jing Han, Huan Zhao, Bjorn W. Schuller
  • for: 这 paper 的目的是 investigate how large language models (LLMs) perform in emotion recognition, and to offer insights and pose potential challenges for enhancing emotion recognition in the new era of advanced and generalised large models.
  • methods: 这 paper 使用了 diverse aspects, including in-context learning, few-shot learning, accuracy, generalisation, and explanation, to evaluate the performance of LLMs in emotion recognition.
  • results: 这 paper 的结果表明 that LLMs can significantly boost the performance of emotion recognition models, and can achieve the best results on different benchmarks. However, the paper also poses potential challenges and offers insights for enhancing emotion recognition in the new era of advanced and generalised large models.
    Abstract After the inception of emotion recognition or affective computing, it has increasingly become an active research topic due to its broad applications. Over the past couple of decades, emotion recognition models have gradually migrated from statistically shallow models to neural network-based deep models, which can significantly boost the performance of emotion recognition models and consistently achieve the best results on different benchmarks. Therefore, in recent years, deep models have always been considered the first option for emotion recognition. However, the debut of large language models (LLMs), such as ChatGPT, has remarkably astonished the world due to their emerged capabilities of zero/few-shot learning, in-context learning, chain-of-thought, and others that are never shown in previous deep models. In the present paper, we comprehensively investigate how the LLMs perform in emotion recognition in terms of diverse aspects, including in-context learning, few-short learning, accuracy, generalisation, and explanation. Moreover, we offer some insights and pose other potential challenges, hoping to ignite broader discussions about enhancing emotion recognition in the new era of advanced and generalised large models.
    摘要 после 几十年的情感认知或情感计算的出现,这已经成为了活跃的研究话题,因为它的广泛应用。过去几十年,情感认知模型逐渐从统计学上的浅层模型迁移到神经网络基于深度模型,这可以大幅提高情感认知模型的性能,并一直在不同的benchmark上达到最佳结果。因此,在最近几年,深度模型一直被视为情感认知的第一选择。然而,大语言模型(LLMs),如ChatGPT,在全球引发了惊叹,因为它们在前一代深度模型中未经过显示的特性,包括零/几个shot学习、上下文学习、串行思维等。在 presente 文章中,我们全面调查了LLMs在情感认知方面的性能,包括上下文学习、几个shot学习、准确率、泛化和解释。此外,我们还提供了一些启示和提出了其他潜在的挑战,希望能够激发更广泛的讨论,以提高情感认知在新的高级通用大模型时代的发展。

An engine to simulate insurance fraud network data

  • paper_url: http://arxiv.org/abs/2308.11659
  • repo_url: None
  • paper_authors: Bavo D. C. Campo, Katrien Antonio
    for: 这个研究旨在开发一个高效且准确的探索阴谋保险laims(fraudulent insurance claims)的方法。methods: 本研究使用了社交网络中所 involve的 party 的特征 engineering 作为 input 进行学习。results: 这个研究使用了一个 simulation machine 来生成synthetic data,并允许使用者控制数据生成机制,以测试不同的方法和模型。
    Abstract Traditionally, the detection of fraudulent insurance claims relies on business rules and expert judgement which makes it a time-consuming and expensive process (\'Oskarsd\'ottir et al., 2022). Consequently, researchers have been examining ways to develop efficient and accurate analytic strategies to flag suspicious claims. Feeding learning methods with features engineered from the social network of parties involved in a claim is a particularly promising strategy (see for example Van Vlasselaer et al. (2016); Tumminello et al. (2023)). When developing a fraud detection model, however, we are confronted with several challenges. The uncommon nature of fraud, for example, creates a high class imbalance which complicates the development of well performing analytic classification models. In addition, only a small number of claims are investigated and get a label, which results in a large corpus of unlabeled data. Yet another challenge is the lack of publicly available data. This hinders not only the development of new methods, but also the validation of existing techniques. We therefore design a simulation machine that is engineered to create synthetic data with a network structure and available covariates similar to the real life insurance fraud data set analyzed in \'Oskarsd\'ottir et al. (2022). Further, the user has control over several data-generating mechanisms. We can specify the total number of policyholders and parties, the desired level of imbalance and the (effect size of the) features in the fraud generating model. As such, the simulation engine enables researchers and practitioners to examine several methodological challenges as well as to test their (development strategy of) insurance fraud detection models in a range of different settings. Moreover, large synthetic data sets can be generated to evaluate the predictive performance of (advanced) machine learning techniques.
    摘要 传统上,探测保险 fraud 的方法依赖于企业规则和专家判断,这使得过程时间consuming 和成本高('Oskarsd\'ottir et al., 2022)。因此,研究人员在尝试开发高效和准确的分析策略来检测可疑的laim。从社交网络中提取和引擎特征来训练学习方法是一种非常有前途的策略(如 Van Vlasselaer et al. (2016);Tumminello et al. (2023))。在开发探测模型时,我们面临了一些挑战。例如,诈骗的不常见性导致分类模型的性能差,而且只有一小部分的laim被调查和标注,导致大量的无标注数据。此外,没有公共可用的数据也限制了新方法的发展和现有技术的验证。为了解决这些问题,我们设计了一个可以生成Synthetic data的 simulate 机器,其中可以控制数据生成机制的一些参数。我们可以指定policyholders和相关方数量,欲要的分类模型性能等级,以及诈骗生成模型中特征的效应大小。因此,这个 simulate 机器可以帮助研究人员和实践人员在不同的设定下测试他们的探测模型,以及解决一些方法学挑战。此外,可以生成大量的Synthetic data,以评估先进机器学习技术的预测性能。

Cost-Efficient Online Decision Making: A Combinatorial Multi-Armed Bandit Approach

  • paper_url: http://arxiv.org/abs/2308.10699
  • repo_url: None
  • paper_authors: Arman Rahbar, Niklas Åkerblom, Morteza Haghir Chehreghani
  • for: 这篇论文目的是解决在许多实际应用中的在线决策问题,这些决策通常基于对入围数据点的测试序列。
  • methods: 论文提出了一种基于 combinatorial 多臂投机的新形式化方法,该方法考虑了测试成本。
  • results: 论文提供了一个新的成本效率的在线决策框架,可以使用 posterior sampling 或 BayesUCB 进行探索。 论文还提供了严格的理论分析和各种实验结果,证明了其在实际问题中的可行性。
    Abstract Online decision making plays a crucial role in numerous real-world applications. In many scenarios, the decision is made based on performing a sequence of tests on the incoming data points. However, performing all tests can be expensive and is not always possible. In this paper, we provide a novel formulation of the online decision making problem based on combinatorial multi-armed bandits and take the cost of performing tests into account. Based on this formulation, we provide a new framework for cost-efficient online decision making which can utilize posterior sampling or BayesUCB for exploration. We provide a rigorous theoretical analysis for our framework and present various experimental results that demonstrate its applicability to real-world problems.
    摘要 在许多实际应用中,在线决策扮演着关键的角色。在许多场景中,决策基于对进来数据点的测试进行序列。然而,执行所有测试可能是昂贵的,而且不一定可行。在这篇论文中,我们提供了一种新的在线决策问题的形式ulation,考虑了测试成本。基于这种形式ulation,我们提供了一个新的cost-efficient的在线决策框架,可以利用 posterior sampling 或 BayesUCB 进行探索。我们提供了一个严格的理论分析,并在各种实验中证明了其可应用性。

Beyond expectations: Residual Dynamic Mode Decomposition and Variance for Stochastic Dynamical Systems

  • paper_url: http://arxiv.org/abs/2308.10697
  • repo_url: None
  • paper_authors: Matthew J. Colbrook, Qin Li, Ryan V. Raut, Alex Townsend
  • for: 这篇论文是关于非线性动力系统的线性化的研究,具体来说是关于科普曼算子的spectral information的研究。
  • methods: 这篇论文使用了一种名为Dynamic Mode Decomposition(DMD)的投影方法来近似科普曼算子的特征属性。
  • results: 这篇论文提出了一种包括偏差和方差的科普曼框架,以控制投影错误。此外,paper还介绍了一种名为variance-pseudospectra的概念,用于评估统计相关性。最后,paper还提供了一些有关科普曼算子特征量的收敛结果。
    Abstract Koopman operators linearize nonlinear dynamical systems, making their spectral information of crucial interest. Numerous algorithms have been developed to approximate these spectral properties, and Dynamic Mode Decomposition (DMD) stands out as the poster child of projection-based methods. Although the Koopman operator itself is linear, the fact that it acts in an infinite-dimensional space of observables poses various challenges. These include spurious modes, essential spectra, and the verification of Koopman mode decompositions. While recent work has addressed these challenges for deterministic systems, there remains a notable gap in verified DMD methods tailored for stochastic systems, where the Koopman operator measures the expectation of observables. We show that it is necessary to go beyond expectations to address these issues. By incorporating variance into the Koopman framework, we address these challenges. Through an additional DMD-type matrix, we approximate the sum of a squared residual and a variance term, each of which can be approximated individually using batched snapshot data. This allows verified computation of the spectral properties of stochastic Koopman operators, controlling the projection error. We also introduce the concept of variance-pseudospectra to gauge statistical coherency. Finally, we present a suite of convergence results for the spectral quantities of stochastic Koopman operators. Our study concludes with practical applications using both simulated and experimental data. In neural recordings from awake mice, we demonstrate how variance-pseudospectra can reveal physiologically significant information unavailable to standard expectation-based dynamical models.
    摘要 库曼运算符linear化非线性动力系统的spectral信息非常重要。numerous algorithms have been developed to approximate这些spectral properties,and Dynamic Mode Decomposition (DMD) stands out as the poster child of projection-based methods。although the Koopman operator itself is linear,the fact that it acts in an infinite-dimensional space of observables poses various challenges。These include spurious modes,essential spectra,and the verification of Koopman mode decompositions。While recent work has addressed these challenges for deterministic systems,there remains a notable gap in verified DMD methods tailored for stochastic systems,where the Koopman operator measures the expectation of observables。We show that it is necessary to go beyond expectations to address these issues。By incorporating variance into the Koopman framework,we address these challenges。Through an additional DMD-type matrix,we approximate the sum of a squared residual and a variance term,each of which can be approximated individually using batched snapshot data。This allows verified computation of the spectral properties of stochastic Koopman operators,controlling the projection error。We also introduce the concept of variance-pseudospectra to gauge statistical coherency。Finally,we present a suite of convergence results for the spectral quantities of stochastic Koopman operators。Our study concludes with practical applications using both simulated and experimental data。In neural recordings from awake mice,we demonstrate how variance-pseudospectra can reveal physiologically significant information unavailable to standard expectation-based dynamical models。

An Improved Best-of-both-worlds Algorithm for Bandits with Delayed Feedback

  • paper_url: http://arxiv.org/abs/2308.10675
  • repo_url: None
  • paper_authors: Saeed Masoudian, Julian Zimmert, Yevgeny Seldin
  • for: 该文章是为了解决带延迟反馈的矩阵问题提出了一种新的算法。
  • methods: 该算法利用了 counts of outstanding observations(在行动时观察到的量)而不需要估计最大延迟 $d_{\max}$,并提供了更紧张的 regret bound。
  • results: 该算法可以控制分布漂移,并且可以预测延迟反馈中出现的异常情况。
    Abstract We propose a new best-of-both-worlds algorithm for bandits with variably delayed feedback. The algorithm improves on prior work by Masoudian et al. [2022] by eliminating the need in prior knowledge of the maximal delay $d_{\mathrm{max}$ and providing tighter regret bounds in both regimes. The algorithm and its regret bounds are based on counts of outstanding observations (a quantity that is observed at action time) rather than delays or the maximal delay (quantities that are only observed when feedback arrives). One major contribution is a novel control of distribution drift, which is based on biased loss estimators and skipping of observations with excessively large delays. Another major contribution is demonstrating that the complexity of best-of-both-worlds bandits with delayed feedback is characterized by the cumulative count of outstanding observations after skipping of observations with excessively large delays, rather than the delays or the maximal delay.
    摘要 我们提出一个新的best-of-both-worlds算法 для抽筋棒杠问题,这个算法超越先前的 Masoudian et al. (2022)的工作,不需要先知道最大延迟 $d_{\max}$ ,并且在两种情况下提供了更紧密的 regret bound。这个算法和其 regret bound 基于在动作时观察到的尚未完成观察数据(count of outstanding observations)而不是延迟或最大延迟(只有在反馈 arrives 时才能观察到)。我们的一个重要贡献是基于偏好的损失估计器和延迟过大的观察 skip 的控制方法,另一个重要贡献是显示了best-of-both-worlds抽筋棒杠问题的复杂性是由尚未完成观察数据的累总而不是延迟或最大延迟。

A Safe Deep Reinforcement Learning Approach for Energy Efficient Federated Learning in Wireless Communication Networks

  • paper_url: http://arxiv.org/abs/2308.10664
  • repo_url: None
  • paper_authors: Nikolaos Koursioumpas, Lina Magoula, Nikolaos Petropouleas, Alexandros-Ioannis Thanopoulos, Theodora Panagea, Nancy Alonistioti, M. A. Gutierrez-Estevez, Ramin Khalili
  • for: 降低艺术智能(AI)驱动无线网络中的环境影响,提高 Privacy Preserving 技术 Federated Learning(FL)的能效性。
  • methods: 提议在FL процессе中调度设备的计算和通信资源,以最小化总能 consumption,保证模型的性能,同时采用 Soft Actor Critic Deep Reinforcement Learning(DRL)解决方案,增加环境约束,保证安全的RL进程。
  • results: 对四个参考解决方案进行比较,实现降低总能 consumption的最高减少率达94%,在静态和动态环境中都有显著的改善。
    Abstract Progressing towards a new era of Artificial Intelligence (AI) - enabled wireless networks, concerns regarding the environmental impact of AI have been raised both in industry and academia. Federated Learning (FL) has emerged as a key privacy preserving decentralized AI technique. Despite efforts currently being made in FL, its environmental impact is still an open problem. Targeting the minimization of the overall energy consumption of an FL process, we propose the orchestration of computational and communication resources of the involved devices to minimize the total energy required, while guaranteeing a certain performance of the model. To this end, we propose a Soft Actor Critic Deep Reinforcement Learning (DRL) solution, where a penalty function is introduced during training, penalizing the strategies that violate the constraints of the environment, and ensuring a safe RL process. A device level synchronization method, along with a computationally cost effective FL environment are proposed, with the goal of further reducing the energy consumption and communication overhead. Evaluation results show the effectiveness of the proposed scheme compared to four state-of-the-art baseline solutions in both static and dynamic environments, achieving a decrease of up to 94% in the total energy consumption.
    摘要

Practical Parallel Algorithms for Non-Monotone Submodular Maximization

  • paper_url: http://arxiv.org/abs/2308.10656
  • repo_url: None
  • paper_authors: Shuang Cui, Kai Han, Jing Tang, He Huang, Xueying Li, Aakas Zhiyuli, Hanxiao Li
  • for: This paper is written for the field of artificial intelligence, specifically for submodular maximization in machine learning, computer vision, and natural language processing.
  • methods: The paper proposes two algorithms for submodular maximization: one for non-monotone submodular maximization subject to a knapsack constraint, and the other for non-monotone submodular maximization subject to a $k$-system constraint. Both algorithms have provable approximation ratios and sublinear adaptive complexities.
  • results: The paper achieves an $(8+\epsilon)$-approximation under $\mathcal{O}(\log n)$ adaptive complexity for non-monotone submodular maximization subject to a knapsack constraint, which is optimal up to a factor of $\mathcal{O}(\log\log n)$. Additionally, the paper proposes the first algorithm with both provable approximation ratio and sublinear adaptive complexity for non-monotone submodular maximization subject to a $k$-system constraint. The two algorithms are also applied to the special case of submodular maximization subject to a cardinality constraint, achieving performance bounds comparable with those of state-of-the-art algorithms.
    Abstract Submodular maximization has found extensive applications in various domains within the field of artificial intelligence, including but not limited to machine learning, computer vision, and natural language processing. With the increasing size of datasets in these domains, there is a pressing need to develop efficient and parallelizable algorithms for submodular maximization. One measure of the parallelizability of a submodular maximization algorithm is its adaptive complexity, which indicates the number of sequential rounds where a polynomial number of queries to the objective function can be executed in parallel. In this paper, we study the problem of non-monotone submodular maximization subject to a knapsack constraint, and propose the first combinatorial algorithm achieving an $(8+\epsilon)$-approximation under $\mathcal{O}(\log n)$ adaptive complexity, which is \textit{optimal} up to a factor of $\mathcal{O}(\log\log n)$. Moreover, we also propose the first algorithm with both provable approximation ratio and sublinear adaptive complexity for the problem of non-monotone submodular maximization subject to a $k$-system constraint. As a by-product, we show that our two algorithms can also be applied to the special case of submodular maximization subject to a cardinality constraint, and achieve performance bounds comparable with those of state-of-the-art algorithms. Finally, the effectiveness of our approach is demonstrated by extensive experiments on real-world applications.
    摘要 “对于人工智能中不同领域的应用,如机器学习、计算机视觉和自然语言处理,Submodular maximization 已经获得了广泛的应用。随着这些领域中的数据集的规模增加,开发高效和平行化的Submodular maximization 算法成为了一个紧要的需求。一个Measure of 平行化的Submodular maximization 算法的可行性是其适应复杂度,它表示在执行多个轮次的时候,可以在平行的方式进行多个几何种的询问。在这篇论文中,我们研究了非单调Submodular maximization 问题下的体统统计限制,并提出了首个具有$(8+\epsilon)$-估计的Combinatorial算法,其适应性为 $\mathcal{O}(\log n)$。此外,我们还提出了首个具有证明的近似比率和对应复杂度的Submodular maximization 算法,它可以在$k$-系统限制下进行应用。作为一个副产品,我们显示了我们的两个算法可以在特殊情况下进行Submodular maximization 问题,并达到了现有算法的性能 bound。最后,我们通过实际实验证明了我们的方法的实用性。”

Deep Evidential Learning for Bayesian Quantile Regression

  • paper_url: http://arxiv.org/abs/2308.10650
  • repo_url: None
  • paper_authors: Frederik Boe Hüttel, Filipe Rodrigues, Francisco Câmara Pereira
  • for: 该文章提出了一种深度 bayesian 量化回归模型,可以估计连续目标分布的Quantile without Gaussian assumption。
  • methods: 该方法基于显示学习,可以捕捉 aleatoric 和 epistemic 不确定性,并且通过单个推测过程来实现。
  • results: 该方法可以实现准确的不确定性估计,并且可以分解 aleatoric 和 epistemic 不确定性,同时具有对 out-of-distribution 样本的Robustness。
    Abstract It is desirable to have accurate uncertainty estimation from a single deterministic forward-pass model, as traditional methods for uncertainty quantification are computationally expensive. However, this is difficult because single forward-pass models do not sample weights during inference and often make assumptions about the target distribution, such as assuming it is Gaussian. This can be restrictive in regression tasks, where the mean and standard deviation are inadequate to model the target distribution accurately. This paper proposes a deep Bayesian quantile regression model that can estimate the quantiles of a continuous target distribution without the Gaussian assumption. The proposed method is based on evidential learning, which allows the model to capture aleatoric and epistemic uncertainty with a single deterministic forward-pass model. This makes the method efficient and scalable to large models and datasets. We demonstrate that the proposed method achieves calibrated uncertainties on non-Gaussian distributions, disentanglement of aleatoric and epistemic uncertainty, and robustness to out-of-distribution samples.
    摘要 希望通过单个决定性前向模型获得准确的不确定性估计,但这很难实现,因为单个前向模型在推理过程中不会采样权重,而且常常假设目标分布是高斯分布。这可能是回归任务中的限制,因为均值和标准差无法准确地模型目标分布。本文提出了深度 bayesian 量化回归模型,可以无需假设高斯分布来估计连续目标分布的quantiles。该方法基于证据学习,允许模型捕捉 aleatoric 和 epistemic 不确定性,并且可以通过单个决定性前向模型来实现高效和扩展性。我们示示了该方法在非高斯分布上具有准确的不确定性估计、分解 aleatoric 和 epistemic 不确定性、以及对非标准样本的Robustness。

Reinforcement Learning Based Sensor Optimization for Bio-markers

  • paper_url: http://arxiv.org/abs/2308.10649
  • repo_url: None
  • paper_authors: Sajal Khandelwal, Pawan Kumar, Syed Azeemuddin
  • for: 本研究旨在提高基于电极设计和脚宽的IDC频率生物感测器的敏感度。
  • methods: 该研究使用了新的强化学习基于二进制群体潮涌优化(RLBPSO)方法来优化感测器的设计参数,并与现有的ACO和其他状态前方法进行比较。
  • results: 研究发现,RLBPSO方法可以在不同频率范围内提高感测器的敏感度,并且在不同的电极设计和脚宽下显示出优异性。
    Abstract Radio frequency (RF) biosensors, in particular those based on inter-digitated capacitors (IDCs), are pivotal in areas like biomedical diagnosis, remote sensing, and wireless communication. Despite their advantages of low cost and easy fabrication, their sensitivity can be hindered by design imperfections, environmental factors, and circuit noise. This paper investigates enhancing the sensitivity of IDC-based RF sensors using novel reinforcement learning based Binary Particle Swarm Optimization (RLBPSO), and it is compared to Ant Colony Optimization (ACO), and other state-of-the-art methods. By focusing on optimizing design parameters like electrode design and finger width, the proposed study found notable improvements in sensor sensitivity. The proposed RLBPSO method shows best optimized design for various frequency ranges when compared to current state-of-the-art methods.
    摘要 Radio frequency (RF) 感测器,尤其是基于交叠式电极(IDC)的感测器,在生物医学诊断、远程探测和无线通信等领域具有重要的应用价值。尽管它们具有低成本和易于制造的优点,但是设计瑕疵、环境因素和电路噪声可能会削弱它们的敏感度。本文通过使用新型的强化学习基于二进制蜂群优化(RLBPSO)方法来提高 IDC-based RF 感测器的敏感度,并与现有的 Ant Colony Optimization(ACO)方法和其他当前最佳方法进行比较。通过关注优化设计参数,如电极设计和脚宽,该研究发现了明显提高感测器敏感度的改进。RLBPSO 方法在不同频率范围内具有最佳的设计优化效果,比较现有的方法更为出色。

Faster Training of Neural ODEs Using Gauß-Legendre Quadrature

  • paper_url: http://arxiv.org/abs/2308.10644
  • repo_url: https://github.com/a-norcliffe/torch_gq_adjoint
  • paper_authors: Alexander Norcliffe, Marc Peter Deisenroth
  • for: 加速神经ODE的训练,提高生成和时间序列模型的性能。
  • methods: 使用Gau{\ss}-Legendre quadrature来更快地解决积分,而不需要精确地解决ODE。
  • results: 提高神经ODE的训练速度,特别是 для大型模型。还提供了一种新的SDE训练方法。
    Abstract Neural ODEs demonstrate strong performance in generative and time-series modelling. However, training them via the adjoint method is slow compared to discrete models due to the requirement of numerically solving ODEs. To speed neural ODEs up, a common approach is to regularise the solutions. However, this approach may affect the expressivity of the model; when the trajectory itself matters, this is particularly important. In this paper, we propose an alternative way to speed up the training of neural ODEs. The key idea is to speed up the adjoint method by using Gau{\ss}-Legendre quadrature to solve integrals faster than ODE-based methods while remaining memory efficient. We also extend the idea to training SDEs using the Wong-Zakai theorem, by training a corresponding ODE and transferring the parameters. Our approach leads to faster training of neural ODEs, especially for large models. It also presents a new way to train SDE-based models.
    摘要 neural ODEs 在生成和时间序列模型中表现出色,但通过逆变法训练它们比粒子模型慢,这是因为需要数字化解 ODEs。为了快速 neural ODEs,常见的方法是减少解。然而,这种方法可能会affect the model's expressiveness; 当轨迹本身重要时,这非常重要。在这篇论文中,我们提出了一种将 neural ODEs 快速训练的 alternativapproach。关键思想是通过 Gau{\ss}-Legendre quadrature 更快地解决积分,而不是使用 ODE-based 方法,同时保持内存效率。我们还扩展了这个想法,将其应用于 SDEs 的训练,使用 Wong-Zakai 定理,通过训练相应的 ODE 并转移参数。我们的方法使得 neural ODEs 的训练变得更快,特别是大型模型。此外,它还提供了一种新的 SDE-based 模型训练方法。

SCULPT: Shape-Conditioned Unpaired Learning of Pose-dependent Clothed and Textured Human Meshes

  • paper_url: http://arxiv.org/abs/2308.10638
  • repo_url: None
  • paper_authors: Soubhik Sanyal, Partha Ghosh, Jinlong Yang, Michael J. Black, Justus Thies, Timo Bolkart
  • for: 这个论文是为了研究如何生成披衣人体3D模型而写的。
  • methods: 这个论文使用了一种深度神经网络来表示人体geometry和外观分布。它们使用了3D扫描数据和2D图像数据来训练这个神经网络,并使用了一种无关的学习过程来学习pose-dependent的披衣人体模型。
  • results: 这个论文通过对SCULPT数据集进行验证来证明了其效果。它们比较了自己的方法与现有的3D生成模型,并发现自己的方法可以更好地生成披衣人体3D模型。
    Abstract We present SCULPT, a novel 3D generative model for clothed and textured 3D meshes of humans. Specifically, we devise a deep neural network that learns to represent the geometry and appearance distribution of clothed human bodies. Training such a model is challenging, as datasets of textured 3D meshes for humans are limited in size and accessibility. Our key observation is that there exist medium-sized 3D scan datasets like CAPE, as well as large-scale 2D image datasets of clothed humans and multiple appearances can be mapped to a single geometry. To effectively learn from the two data modalities, we propose an unpaired learning procedure for pose-dependent clothed and textured human meshes. Specifically, we learn a pose-dependent geometry space from 3D scan data. We represent this as per vertex displacements w.r.t. the SMPL model. Next, we train a geometry conditioned texture generator in an unsupervised way using the 2D image data. We use intermediate activations of the learned geometry model to condition our texture generator. To alleviate entanglement between pose and clothing type, and pose and clothing appearance, we condition both the texture and geometry generators with attribute labels such as clothing types for the geometry, and clothing colors for the texture generator. We automatically generated these conditioning labels for the 2D images based on the visual question answering model BLIP and CLIP. We validate our method on the SCULPT dataset, and compare to state-of-the-art 3D generative models for clothed human bodies. We will release the codebase for research purposes.
    摘要 我们介绍了SCULPT,一种新的3D生成模型,用于 clothed和textured 3D meshes of humans。我们设计了一个深度神经网络,用于表示人体的几何和外观分布。训练这种模型具有挑战性,因为人类 clothed 3D mesh 数据集的大小和可访问性有限。我们的关键观察是,存在中等大小的 3D 扫描数据集,如 CAPE,以及大规模的2D图像数据集,包括衣物的多种样式。为了有效地从两种数据模式学习,我们提议一种不同于对应的学习过程。 Specifically,我们从 3D 扫描数据集中学习一个 pose-dependent 几何空间。我们表示这为每个顶点的插值 relative to the SMPL 模型。然后,我们在无监督的情况下使用 2D 图像数据集来训练一个几何受控的Texture生成器。我们使用 learned geometry 模型的中间响应来condition our texture generator。为了消除姿势和服装类型之间的紧张关系,以及姿势和服装外观之间的紧张关系,我们将 conditioning 标签添加到 texture 和 geometry 生成器中,这些标签基于 visual question answering 模型 BLIP 和 CLIP 自动生成的 attribute labels。我们验证了我们的方法在 SCULPT 数据集上,并与现有的3D生成模型进行比较。我们将发布代码库用于研究用途。

Foundation Model-oriented Robustness: Robust Image Model Evaluation with Pretrained Models

  • paper_url: http://arxiv.org/abs/2308.10632
  • repo_url: None
  • paper_authors: Peiyan Zhang, Haoyang Liu, Chaozhuo Li, Xing Xie, Sunghun Kim, Haohan Wang
  • for: 这 paper 的目的是提出一种新的Robustness measurement,用于评估图像分类模型在真实世界中的性能。
  • methods: 这 paper 使用了一种基于 surrogate oracle (i.e., foundation model) 的评估方法,并设计了一种可以超越 fixes benchmarks 的评估方法。
  • results: 这 paper 通过使用新生成的图像数据,提供了一种新的评估方法,具有不受 fix benchmarks 或受限的杂乱影响的优势。 新方法可以评估模型在真实世界中的Robustness性能,尽管落后于人工智能用户 (i.e., oracle)。
    Abstract Machine learning has demonstrated remarkable performance over finite datasets, yet whether the scores over the fixed benchmarks can sufficiently indicate the model's performance in the real world is still in discussion. In reality, an ideal robust model will probably behave similarly to the oracle (e.g., the human users), thus a good evaluation protocol is probably to evaluate the models' behaviors in comparison to the oracle. In this paper, we introduce a new robustness measurement that directly measures the image classification model's performance compared with a surrogate oracle (i.e., a foundation model). Besides, we design a simple method that can accomplish the evaluation beyond the scope of the benchmarks. Our method extends the image datasets with new samples that are sufficiently perturbed to be distinct from the ones in the original sets, but are still bounded within the same image-label structure the original test image represents, constrained by a foundation model pretrained with a large amount of samples. As a result, our new method will offer us a new way to evaluate the models' robustness performance, free of limitations of fixed benchmarks or constrained perturbations, although scoped by the power of the oracle. In addition to the evaluation results, we also leverage our generated data to understand the behaviors of the model and our new evaluation strategies.
    摘要

A Homogenization Approach for Gradient-Dominated Stochastic Optimization

  • paper_url: http://arxiv.org/abs/2308.10630
  • repo_url: None
  • paper_authors: Jiyuan Tan, Chenyu Xue, Chuwen Zhang, Qi Deng, Dongdong Ge, Yinyu Ye
  • for: Gradient-dominated optimization with $\alpha \in [1, 2]$
  • methods: Stochastic homogeneous second-order descent method (SHSODM) with homogenization approach
  • results: Achieves a sample complexity of $O(\epsilon^{-7/(2 \alpha) +1})$ for $\alpha \in [1, 3/2)$ and $\tilde{O}(\epsilon^{-2/\alpha})$ for $\alpha \in [3/2, 2]$, with an improved sample complexity of $O( \epsilon ^{-( 7-3\alpha ) /( 2\alpha )})$ for $\alpha \in [1,3/2)$.
    Abstract Gradient dominance property is a condition weaker than strong convexity, yet it sufficiently ensures global convergence for first-order methods even in non-convex optimization. This property finds application in various machine learning domains, including matrix decomposition, linear neural networks, and policy-based reinforcement learning (RL). In this paper, we study the stochastic homogeneous second-order descent method (SHSODM) for gradient-dominated optimization with $\alpha \in [1, 2]$ based on a recently proposed homogenization approach. Theoretically, we show that SHSODM achieves a sample complexity of $O(\epsilon^{-7/(2 \alpha) +1})$ for $\alpha \in [1, 3/2)$ and $\tilde{O}(\epsilon^{-2/\alpha})$ for $\alpha \in [3/2, 2]$. We further provide a SHSODM with a variance reduction technique enjoying an improved sample complexity of $O( \epsilon ^{-( 7-3\alpha ) /( 2\alpha )})$ for $\alpha \in [1,3/2)$. Our results match the state-of-the-art sample complexity bounds for stochastic gradient-dominated optimization without \emph{cubic regularization}. Since the homogenization approach only relies on solving extremal eigenvector problems instead of Newton-type systems, our methods gain the advantage of cheaper iterations and robustness in ill-conditioned problems. Numerical experiments on several RL tasks demonstrate the efficiency of SHSODM compared to other off-the-shelf methods.
    摘要 Gradient dominance 性质是一种弱于强制的 convexity 的条件,但它足够保证全局收敛性 для第一阶方法,包括非对称优化。这种性质在机器学习领域中找到了广泛的应用,例如矩阵分解、线性神经网络和政策基于的强化学习(RL)。在这篇论文中,我们研究了随机同质二阶减少法(SHSODM)在梯度控制的优化中的性能。理论上,我们证明了 SHSODM 的样本复杂度为 $O(\epsilon^{-7/(2\alpha)+1})$ for $\alpha \in [1, 3/2)$ 和 $\tilde{O}(\epsilon^{-2/\alpha})$ for $\alpha \in [3/2, 2]$。我们还提供了一种带有减少噪声技术的 SHSODM,其样本复杂度为 $O( \epsilon ^{-(7-3\alpha )/(2\alpha)})$ for $\alpha \in [1,3/2)$。我们的结果与无 кубиック regularization 的梯度-控制优化的状态艺术sample complexity bound匹配。由于homogenization approach 仅仅需要解决极值 eigenvector 问题而不是 Newton-type 系统,我们的方法在糟糕条件下具有更低的迭代次数和更加稳定的性能。 numerically experiments on several RL tasks demonstrate the efficiency of SHSODM compared to other off-the-shelf methods.

GaitPT: Skeletons Are All You Need For Gait Recognition

  • paper_url: http://arxiv.org/abs/2308.10623
  • repo_url: None
  • paper_authors: Andy Catruna, Adrian Cosma, Emilian Radoi
  • for: automatic person identification at a distance
  • methods: pose estimation skeletons, hierarchical transformer architecture
  • results: state-of-the-art performance, surpassing other works by a margin of 6%, outperforming both skeleton-based and appearance-based approaches
    Abstract The analysis of patterns of walking is an important area of research that has numerous applications in security, healthcare, sports and human-computer interaction. Lately, walking patterns have been regarded as a unique fingerprinting method for automatic person identification at a distance. In this work, we propose a novel gait recognition architecture called Gait Pyramid Transformer (GaitPT) that leverages pose estimation skeletons to capture unique walking patterns, without relying on appearance information. GaitPT adopts a hierarchical transformer architecture that effectively extracts both spatial and temporal features of movement in an anatomically consistent manner, guided by the structure of the human skeleton. Our results show that GaitPT achieves state-of-the-art performance compared to other skeleton-based gait recognition works, in both controlled and in-the-wild scenarios. GaitPT obtains 82.6% average accuracy on CASIA-B, surpassing other works by a margin of 6%. Moreover, it obtains 52.16% Rank-1 accuracy on GREW, outperforming both skeleton-based and appearance-based approaches.
    摘要 研究人们的行走模式是一个重要的领域,它有很多应用于安全、医疗、体育和人机交互等领域。最近,行走模式被视为一种唯一的指纹方法 для自动识别人员,不需要依赖于外观信息。在这项工作中,我们提出了一种新的步态识别架构,即步态 pyramid transformer(GaitPT),它利用 pose estimation skeleton 来捕捉独特的行走模式,不需要依赖于外观信息。GaitPT 采用一种层次 transformer 架构,有效地提取行走运动中的空间和时间特征,以遵循人体骨架的结构。我们的结果显示,GaitPT 在 CASIA-B 上取得了82.6% 的平均准确率,比其他skeleton-based gait recognition 工作高出6%。此外,它在 GREW 上取得了52.16% 的排名第一准确率,超过了skeleton-based和 appearance-based方法。

Weighting by Tying: A New Approach to Weighted Rank Correlation

  • paper_url: http://arxiv.org/abs/2308.10622
  • repo_url: None
  • paper_authors: Sascha Henzgen, Eyke Hüllermeier
  • for: 这篇论文旨在提出一种基于杂分函数的权重排名相关度指标,用于捕捉两个排序序列之间的协调度。
  • methods: 该论文基于杂分函数来定义一种权重排名相关度指标,其中每个排名位置具有不同的权重。
  • results: 该论文提出了一种基于杂分函数的权重排名相关度指标,其具有坚实的形式质量和灵活的权重分配方式。
    Abstract Measures of rank correlation are commonly used in statistics to capture the degree of concordance between two orderings of the same set of items. Standard measures like Kendall's tau and Spearman's rho coefficient put equal emphasis on each position of a ranking. Yet, motivated by applications in which some of the positions (typically those on the top) are more important than others, a few weighted variants of these measures have been proposed. Most of these generalizations fail to meet desirable formal properties, however. Besides, they are often quite inflexible in the sense of committing to a fixed weighing scheme. In this paper, we propose a weighted rank correlation measure on the basis of fuzzy order relations. Our measure, called scaled gamma, is related to Goodman and Kruskal's gamma rank correlation. It is parametrized by a fuzzy equivalence relation on the rank positions, which in turn is specified conveniently by a so-called scaling function. This approach combines soundness with flexibility: it has a sound formal foundation and allows for weighing rank positions in a flexible way.
    摘要 通用的排名相关度度量在统计学中广泛应用,用于捕捉两个对象集中元素的排名之间的协调程度。标准度量如肯德尔的tau和斯帕曼的rho系数都强调每个排名位置的等重要性。然而,基于应用中一些排名位置(通常是排名顺序的前几位)的重要性更高的情况,一些加权变体已经被提出。然而,大多数这些扩展都不具备愉悦的正式性质,而且通常具有固定的加权方案。在本文中,我们提出一种基于杂化顺序关系的加权排名相关度度量,称为尺度化γ。它与柯德曼和库斯卡尔的γ排名相关度度量相关。它由杂化rank位置之间的等化关系参数化,该等化关系由一个称为涨函数的扩展函数来定义。这种方法结合了准确性与灵活性:它具有准确的正式基础,并允许在灵活的加权方案下进行排名相关度度量的计算。

centroIDA: Cross-Domain Class Discrepancy Minimization Based on Accumulative Class-Centroids for Imbalanced Domain Adaptation

  • paper_url: http://arxiv.org/abs/2308.10619
  • repo_url: None
  • paper_authors: Xiaona Sun, Zhenyu Wu, Yichen Liu, Saier Hu, Zhiqiang Zhan, Yang Ji
  • for: addresses the imbalanced domain adaptation (IDA) problem, which involves both covariate and long-tailed label shifts across domains.
  • methods: proposes a cross-domain class discrepancy minimization method based on accumulative class-centroids (centroIDA), which includes class-based re-sampling, accumulative class-centroids alignment, and class-wise feature alignment.
  • results: outperforms other state-of-the-art (SOTA) methods on the IDA problem, especially when the degree of label shift increases.Here is the Chinese translation of the three key information points:
  • for: addresses the 非平衡领域适应 (IDA) 问题,这个问题中covariate和长尾标签差异都存在于领域之间。
  • methods: 提出了一种跨领域类别差异最小化方法 (centroIDA),这个方法包括类别基于的重抽样方法、类别集中心对领域之间的平衡,以及类别对特征表示的优化。
  • results: 与其他现有的state-of-the-art (SOTA) 方法相比,它在IDA问题上表现出色,特别是随着标签差异的增加而表现更好。
    Abstract Unsupervised Domain Adaptation (UDA) approaches address the covariate shift problem by minimizing the distribution discrepancy between the source and target domains, assuming that the label distribution is invariant across domains. However, in the imbalanced domain adaptation (IDA) scenario, covariate and long-tailed label shifts both exist across domains. To tackle the IDA problem, some current research focus on minimizing the distribution discrepancies of each corresponding class between source and target domains. Such methods rely much on the reliable pseudo labels' selection and the feature distributions estimation for target domain, and the minority classes with limited numbers makes the estimations more uncertainty, which influences the model's performance. In this paper, we propose a cross-domain class discrepancy minimization method based on accumulative class-centroids for IDA (centroIDA). Firstly, class-based re-sampling strategy is used to obtain an unbiased classifier on source domain. Secondly, the accumulative class-centroids alignment loss is proposed for iterative class-centroids alignment across domains. Finally, class-wise feature alignment loss is used to optimize the feature representation for a robust classification boundary. A series of experiments have proved that our method outperforms other SOTA methods on IDA problem, especially with the increasing degree of label shift.
    摘要 Unsupervised domain adaptation (UDA)方法解决了covariate shift问题,即源频率和目标频率之间的分布差异,假设标签分布在各个频率上是一致的。然而,在不平衡频率适应(IDA)场景中,covariate和长尾标签都存在在不同频率上的差异。为了解决IDA问题,一些当前的研究集中着精力地针对每个对应的类之间的分布差异进行最小化。这些方法具有可靠的pseudo标签选择和目标频率的特征分布估计,以及少数类的数量增加,这会使估计更加不确定,从而影响模型的性能。在本文中,我们提出了一种基于积累类中心的IDA方法,称为centroIDA。首先,我们使用类型基于的重抽样策略来在源频率上获得一个不偏的类ifier。然后,我们提出了积累类中心Alignment损失,用于在各个频率上进行类中心的iterative aligning。最后,我们使用类别特征对齐损失来优化特征表示,以确定一个可靠的分类边界。一系列实验证明了我们的方法在IDA问题上表现出了Superiority,特别是随着标签偏移度的增加。

ST-RAP: A Spatio-Temporal Framework for Real Estate Appraisal

  • paper_url: http://arxiv.org/abs/2308.10609
  • repo_url: https://github.com/dojeon-ai/strap
  • paper_authors: Hojoon Lee, Hawon Jeong, Byungkun Lee, Kyungyup Lee, Jaegul Choo
  • for: 这个论文是为了提出一种基于空间和时间的Real estate APpraisal框架,帮助更好地评估不同地点的房地产价值。
  • methods: 该框架使用层次架构和异质图内存网络,同时兼容时间动态和空间关系,从而更好地捕捉房地产价值的变化趋势。
  • results: 经过对大规模房地产数据集进行广泛的实验,ST-RAP方法比前一些方法更高效,表明了在房地产评估中同时考虑空间和时间方面的integration具有显著的优势。
    Abstract In this paper, we introduce ST-RAP, a novel Spatio-Temporal framework for Real estate APpraisal. ST-RAP employs a hierarchical architecture with a heterogeneous graph neural network to encapsulate temporal dynamics and spatial relationships simultaneously. Through comprehensive experiments on a large-scale real estate dataset, ST-RAP outperforms previous methods, demonstrating the significant benefits of integrating spatial and temporal aspects in real estate appraisal. Our code and dataset are available at https://github.com/dojeon-ai/STRAP.
    摘要 在这篇论文中,我们介绍了ST-RAP,一种新的空间-时间框架 для房地产评估。ST-RAP使用层次架构和异质图 neural network来同时捕捉时间动态和空间关系。经过对大规模房地产数据集进行了广泛的实验,ST-RAP比前一些方法高效,这说明了在房地产评估中同时考虑空间和时间方面的优势。我们的代码和数据集可以在https://github.com/dojeon-ai/STRAP中下载。

FocalDreamer: Text-driven 3D Editing via Focal-fusion Assembly

  • paper_url: http://arxiv.org/abs/2308.10608
  • repo_url: None
  • paper_authors: Yuhan Li, Yishun Dou, Yue Shi, Yu Lei, Xuanhong Chen, Yi Zhang, Peng Zhou, Bingbing Ni
  • for: 这篇论文目的是提出一种基于文本指示的3D编辑框架,以便在 désirable 区域内进行精细编辑。
  • methods: 该框架使用基本形状和可编辑部分的结合,以及geometry union和双路渲染技术,将独立的3D部件组合成完整的物体,并且支持方便的实例重用和部件控制。
  • results: 对比其他方法,福калreamer可以提供更高的精细编辑能力,并且可以生成高质量的Geometry和PBR Textures,可以与广泛使用的图形引擎相容。
    Abstract While text-3D editing has made significant strides in leveraging score distillation sampling, emerging approaches still fall short in delivering separable, precise and consistent outcomes that are vital to content creation. In response, we introduce FocalDreamer, a framework that merges base shape with editable parts according to text prompts for fine-grained editing within desired regions. Specifically, equipped with geometry union and dual-path rendering, FocalDreamer assembles independent 3D parts into a complete object, tailored for convenient instance reuse and part-wise control. We propose geometric focal loss and style consistency regularization, which encourage focal fusion and congruent overall appearance. Furthermore, FocalDreamer generates high-fidelity geometry and PBR textures which are compatible with widely-used graphics engines. Extensive experiments have highlighted the superior editing capabilities of FocalDreamer in both quantitative and qualitative evaluations.
    摘要 “文本三维编辑已经做出了 significativ strides,但现有approaches仍然无法提供可分离、精度和一致的结果,这些结果对内容创建是非常重要。为此,我们介绍了 FocalDreamer 框架,它将基本形状与可编辑部分结合在一起,根据文本提示进行细化编辑,并在desired region中进行精度编辑。具有geometry union和双路渲染功能,FocalDreamer可以独立地组装3D部件,以便方便的实例重用和部件控制。我们还提出了geometry focal loss和样式一致规则,以促进集中融合和一致的整体外观。此外,FocalDreamer可以生成高质量的geometry和PBR Texture,与广泛使用的图形引擎相容。我们的实验表明,FocalDreamer在量化和质量上都具有出色的编辑能力。”

Analyzing Complex Systems with Cascades Using Continuous-Time Bayesian Networks

  • paper_url: http://arxiv.org/abs/2308.10606
  • repo_url: None
  • paper_authors: Alessandro Bregoli, Karin Rathsman, Marco Scutari, Fabio Stella, Søren Wengel Mogensen
  • for: 本研究旨在分析复杂系统中事件的冲击行为,以便更好地理解系统中哪些状态会触发冲击。
  • methods: 本研究使用连续时间感知网络(CTBN)模型来分析冲击行为,并使用新的知识提取方法来提取有用信息。
  • results: 研究发现,使用CTBN模型可以快速和有效地描述事件在系统中传播的方式,并可以标识可能导致冲击行为的系统状态。
    Abstract Interacting systems of events may exhibit cascading behavior where events tend to be temporally clustered. While the cascades themselves may be obvious from the data, it is important to understand which states of the system trigger them. For this purpose, we propose a modeling framework based on continuous-time Bayesian networks (CTBNs) to analyze cascading behavior in complex systems. This framework allows us to describe how events propagate through the system and to identify likely sentry states, that is, system states that may lead to imminent cascading behavior. Moreover, CTBNs have a simple graphical representation and provide interpretable outputs, both of which are important when communicating with domain experts. We also develop new methods for knowledge extraction from CTBNs and we apply the proposed methodology to a data set of alarms in a large industrial system.
    摘要 互动系统的事件可能会表现出堆叠行为,其中事件往往在时间上叠加。虽然堆叠自身可能从数据上直观,但是要理解触发它们的系统状态很重要。为了实现这一目标,我们提出了基于连续时间感知网络(CTBN)的模型化框架,用于分析复杂系统中的堆叠行为。这种框架允许我们描述事件在系统中传播的方式,并提取可能导致堆叠行为的系统状态。此外,CTBN具有简单的图形表示和可解释的输出,这两点都是与领域专家通信时非常重要。我们还开发了新的知识提取方法,并对一个大型工业系统的数据集应用了该方法。

BackTrack: Robust template update via Backward Tracking of candidate template

  • paper_url: http://arxiv.org/abs/2308.10604
  • repo_url: None
  • paper_authors: Dongwook Lee, Wonjun Choi, Seohyung Lee, ByungIn Yoo, Eunho Yang, Seongju Hwang
  • for: 提高视觉对象跟踪性能,增强对象追踪稳定性
  • methods: 使用返回跟踪方法,根据追踪前几帧的图像快照,计算目标对象在屏幕上的变化
  • results: 与已有跟踪方法相比,提供更高的跟踪精度和稳定性,在多种跟踪 benchmark 上达到了最高性能
    Abstract Variations of target appearance such as deformations, illumination variance, occlusion, etc., are the major challenges of visual object tracking that negatively impact the performance of a tracker. An effective method to tackle these challenges is template update, which updates the template to reflect the change of appearance in the target object during tracking. However, with template updates, inadequate quality of new templates or inappropriate timing of updates may induce a model drift problem, which severely degrades the tracking performance. Here, we propose BackTrack, a robust and reliable method to quantify the confidence of the candidate template by backward tracking it on the past frames. Based on the confidence score of candidates from BackTrack, we can update the template with a reliable candidate at the right time while rejecting unreliable candidates. BackTrack is a generic template update scheme and is applicable to any template-based trackers. Extensive experiments on various tracking benchmarks verify the effectiveness of BackTrack over existing template update algorithms, as it achieves SOTA performance on various tracking benchmarks.
    摘要 目标物体的变化,如形态变化、照明变化、遮挡等,是视觉对象跟踪中主要的挑战,它们会对跟踪性能产生负面影响。一种有效的方法是模板更新,该方法在跟踪过程中更新模板,以反映目标物体的形态变化。然而,在模板更新中,新模板质量不够或更新时间不合适可能导致模型漂移问题,这会严重降低跟踪性能。在这种情况下,我们提出了BackTrack方法,它可以评估候选模板的可靠性,并在过去帧中反向跟踪它们。基于候选模板的可靠性分数,我们可以在合适的时间更新模板,并抛弃不可靠的候选模板。BackTrack是一种通用的模板更新方案,适用于任何模板基的跟踪器。广泛的实验表明,BackTrack方法在各种跟踪标准准则上达到了最佳性能。

Improving the Transferability of Adversarial Examples with Arbitrary Style Transfer

  • paper_url: http://arxiv.org/abs/2308.10601
  • repo_url: https://github.com/zhijin-ge/stm
  • paper_authors: Zhijin Ge, Fanhua Shang, Hongying Liu, Yuanyuan Liu, Liang Wan, Wei Feng, Xiaosen Wang
  • for: 针对深度神经网络受到人类不可见的攻击,提高黑盒Setting中的攻击效果。
  • methods: 使用域映射网络进行预处理,并在不同域上进行数据增强,以提高攻击效果。
  • results: 在ImageNet-compatible dataset上,与state-of-the-art方法相比,提高了攻击效果和输入多样性。In English, this translates to:
  • for: Targeting deep neural networks vulnerable to human-imperceptible attacks, to improve attack effectiveness in the black-box setting.
  • methods: Using a domain mapping network for preprocessing, and augmenting the data in different domains to improve attack effectiveness.
  • results: Significantly improving attack effectiveness and input diversity on the ImageNet-compatible dataset compared to state-of-the-art methods.
    Abstract Deep neural networks are vulnerable to adversarial examples crafted by applying human-imperceptible perturbations on clean inputs. Although many attack methods can achieve high success rates in the white-box setting, they also exhibit weak transferability in the black-box setting. Recently, various methods have been proposed to improve adversarial transferability, in which the input transformation is one of the most effective methods. In this work, we notice that existing input transformation-based works mainly adopt the transformed data in the same domain for augmentation. Inspired by domain generalization, we aim to further improve the transferability using the data augmented from different domains. Specifically, a style transfer network can alter the distribution of low-level visual features in an image while preserving semantic content for humans. Hence, we propose a novel attack method named Style Transfer Method (STM) that utilizes a proposed arbitrary style transfer network to transform the images into different domains. To avoid inconsistent semantic information of stylized images for the classification network, we fine-tune the style transfer network and mix up the generated images added by random noise with the original images to maintain semantic consistency and boost input diversity. Extensive experimental results on the ImageNet-compatible dataset show that our proposed method can significantly improve the adversarial transferability on either normally trained models or adversarially trained models than state-of-the-art input transformation-based attacks. Code is available at: https://github.com/Zhijin-Ge/STM.
    摘要

Image-free Classifier Injection for Zero-Shot Classification

  • paper_url: http://arxiv.org/abs/2308.10599
  • repo_url: https://github.com/explainableml/imagefreezsl
  • paper_authors: Anders Christensen, Massimiliano Mancini, A. Sophia Koepke, Ole Winther, Zeynep Akata
  • for: 这个论文的目的是为了帮助预训练的模型具备零批学习分类能力,而不需要训练数据集。
  • methods: 这个论文使用的方法是一种叫做Image-free Classifier Injection with Semantics(ICIS),它可以在预训练的分类模型上具备零批学习分类能力,而不需要图像数据集。ICIS使用了两个Encoder-Decoder网络,通过使用描述符(如类名或属性)来学习重构分类器的 weights。
  • results: 实验结果表明,ICIS可以在标准的ZSL datasets上实现强一致的零批学习分类性能。
    Abstract Zero-shot learning models achieve remarkable results on image classification for samples from classes that were not seen during training. However, such models must be trained from scratch with specialised methods: therefore, access to a training dataset is required when the need for zero-shot classification arises. In this paper, we aim to equip pre-trained models with zero-shot classification capabilities without the use of image data. We achieve this with our proposed Image-free Classifier Injection with Semantics (ICIS) that injects classifiers for new, unseen classes into pre-trained classification models in a post-hoc fashion without relying on image data. Instead, the existing classifier weights and simple class-wise descriptors, such as class names or attributes, are used. ICIS has two encoder-decoder networks that learn to reconstruct classifier weights from descriptors (and vice versa), exploiting (cross-)reconstruction and cosine losses to regularise the decoding process. Notably, ICIS can be cheaply trained and applied directly on top of pre-trained classification models. Experiments on benchmark ZSL datasets show that ICIS produces unseen classifier weights that achieve strong (generalised) zero-shot classification performance. Code is available at https://github.com/ExplainableML/ImageFreeZSL .
    摘要 “零扩展学习模型在图像分类任务中实现了惊人的成绩,但是这些模型需要通过特殊的方法进行训练,因此在需要零扩展分类时需要训练数据集。在这篇论文中,我们想要让预训练模型添加零扩展分类能力,不使用图像数据。我们提出的Image-free Classifier Injection with Semantics(ICIS)技术可以在预训练分类模型上添加新、未看到的类别的分类器,不需要图像数据。我们使用现有的分类器权重和简单的类别描述符(如类名或属性),通过两个Encoder-Decoder网络来学习将权重重构成描述符(并 vice versa),利用(cross-)重构和归一化损失来规范解码过程。吸引注意的是,ICIS可以便宜地训练和应用于预训练分类模型之上。在 benchmark ZSL 数据集上进行实验,我们发现ICIS生成的未看到的分类器权重可以实现强(总体)零扩展分类性能。代码可以在https://github.com/ExplainableML/ImageFreeZSL 上获取。”

RADIANCE: Radio-Frequency Adversarial Deep-learning Inference for Automated Network Coverage Estimation

  • paper_url: http://arxiv.org/abs/2308.10584
  • repo_url: None
  • paper_authors: Sopan Sarkar, Mohammad Hossein Manshaei, Marwan Krunz
  • for: 这篇论文的目的是提出一个基于对抗学习的方法来自动生成无线电信网络的覆盖地图(RF maps),并且考虑到indoor环境中的物体特征和通信标准。
  • methods: 这篇论文使用了对抗学习的方法,具体来说是使用一个叫做RADIANCE的网络模型,这个模型使用了一个semantic map来传递indoor环境的空间关系和物体特征,并且使用了一个新的梯度基于的损失函数来衡量从环境中的一个点到RF地图的变化。
  • results: 根据 simulations 的结果,RADIANCE 可以实现高精度的RF地图生成,其中的mean average error (MAE) 为0.09,root-mean-squared error (RMSE) 为0.29,peak signal-to-noise ratio (PSNR) 为10.78,并且multi-scale structural similarity index (MS-SSIM) 为0.80。
    Abstract Radio-frequency coverage maps (RF maps) are extensively utilized in wireless networks for capacity planning, placement of access points and base stations, localization, and coverage estimation. Conducting site surveys to obtain RF maps is labor-intensive and sometimes not feasible. In this paper, we propose radio-frequency adversarial deep-learning inference for automated network coverage estimation (RADIANCE), a generative adversarial network (GAN) based approach for synthesizing RF maps in indoor scenarios. RADIANCE utilizes a semantic map, a high-level representation of the indoor environment to encode spatial relationships and attributes of objects within the environment and guide the RF map generation process. We introduce a new gradient-based loss function that computes the magnitude and direction of change in received signal strength (RSS) values from a point within the environment. RADIANCE incorporates this loss function along with the antenna pattern to capture signal propagation within a given indoor configuration and generate new patterns under new configuration, antenna (beam) pattern, and center frequency. Extensive simulations are conducted to compare RADIANCE with ray-tracing simulations of RF maps. Our results show that RADIANCE achieves a mean average error (MAE) of 0.09, root-mean-squared error (RMSE) of 0.29, peak signal-to-noise ratio (PSNR) of 10.78, and multi-scale structural similarity index (MS-SSIM) of 0.80.
    摘要 Radio-frequency覆盖地图(RF地图)在无线网络中广泛使用,用于容量规划、Access Point和基站位置选择、地理位置和覆盖估计。进行站点调查以获取RF地图是劳动密集且不可能的。在这篇论文中,我们提出了Radio-frequency智能深度学习推测(RADIANCE),一种基于生成 adversarial neural network(GAN)的方法,用于自动化无线网络覆盖区域估计。RADIANCE利用一个semantic map,一个高级表示indoor环境中物体的空间关系和特征,以指导RF地图生成过程。我们提出了一种新的梯度基于损失函数,用于计算从环境中的点的接收信号强度(RSS)值的方向和距离。RADIANCE将这个损失函数与天线 Pattern相结合,以捕捉indoor配置下信号传播的特性,并生成新的 Pattern Under New配置、天线(beam)模式和中频。我们对RADIANCE与射频 tracing 的RF地图进行了广泛的 simulations。我们的结果表明,RADIANCE的 mean average error(MAE)为0.09,root-mean-squared error(RMSE)为0.29,peak signal-to-noise ratio(PSNR)为10.78,和 multi-scale structural similarity index(MS-SSIM)为0.80。

Pseudo-online framework for BCI evaluation: A MOABB perspective

  • paper_url: http://arxiv.org/abs/2308.11656
  • repo_url: None
  • paper_authors: Igor Carrara, Théodore Papadopoulo
  • for: 这个研究旨在扩展当前的MOABB框架,以在 pseudo-online 模式下对不同算法进行比较,并使用基于覆盖式滑动窗口技术。
  • methods: 这个研究使用了 idle state 事件来考虑所有不同的可能性,并使用 normalized Matthews Correlation Coefficient (nMCC) 和 Information Transfer Rate (ITR) 来评估算法的性能。
  • results: 研究分析了过去 15 年的 estado-of-the-art 算法,并对多个 Motor Imagery (MI) 数据集进行了分析,显示了两种方法之间的差异。
    Abstract Objective: BCI (Brain-Computer Interface) technology operates in three modes: online, offline, and pseudo-online. In the online mode, real-time EEG data is constantly analyzed. In offline mode, the signal is acquired and processed afterwards. The pseudo-online mode processes collected data as if they were received in real-time. The main difference is that the offline mode often analyzes the whole data, while the online and pseudo-online modes only analyze data in short time windows. Offline analysis is usually done with asynchronous BCIs, which restricts analysis to predefined time windows. Asynchronous BCI, compatible with online and pseudo-online modes, allows flexible mental activity duration. Offline processing tends to be more accurate, while online analysis is better for therapeutic applications. Pseudo-online implementation approximates online processing without real-time constraints. Many BCI studies being offline introduce biases compared to real-life scenarios, impacting classification algorithm performance. Approach: The objective of this research paper is therefore to extend the current MOABB framework, operating in offline mode, so as to allow a comparison of different algorithms in a pseudo-online setting with the use of a technology based on overlapping sliding windows. To do this will require the introduction of a idle state event in the dataset that takes into account all different possibilities that are not task thinking. To validate the performance of the algorithms we will use the normalized Matthews Correlation Coefficient (nMCC) and the Information Transfer Rate (ITR). Main results: We analyzed the state-of-the-art algorithms of the last 15 years over several Motor Imagery (MI) datasets composed by several subjects, showing the differences between the two approaches from a statistical point of view. Significance: The ability to analyze the performance of different algorithms in offline and pseudo-online modes will allow the BCI community to obtain more accurate and comprehensive reports regarding the performance of classification algorithms.
    摘要 目的:BCI(脑computer接口)技术运行在三种模式:在线、离线和假在线。在线模式中,实时EEG数据被不断分析。离线模式中,信号被获取和处理后才进行分析。假在线模式将收集的数据处理,就如果它们是在实时接收的。主要区别在于离线模式通常分析整个数据,而在线和假在线模式则只分析短时间窗口内的数据。离线分析通常比较精确,而在线分析则更适合治疗应用。假在线实现方式模拟在线运行,不受实时限制。许多BCI研究被离线进行,导致比实际情况下的偏差,影响分类算法的表现。方法:为了延展现有的MOABB框架(在离线模式下运行),以便在假在线设定下进行不同算法的比较。为此,需要引入一个空闲状态事件,考虑所有不同的可能性,不同于任务思维。以 validate 分类算法的表现,我们将使用Normalized Matthews Correlation Coefficient(nMCC)和Information Transfer Rate(ITR)。主要结果:我们分析了过去15年来的State-of-the-art算法,在许多 Motor Imagery(MI)数据集中,展示了两种方法之间的 statistically 区别。重要性:透过在离线和假在线模式下分析不同算法的表现,BCI社区将能够获得更加精确和全面的报告,对于分类算法的表现。

Overcoming Overconfidence for Active Learning

  • paper_url: http://arxiv.org/abs/2308.10571
  • repo_url: None
  • paper_authors: Yujin Hwang, Won Jo, Juyoung Hong, Yukyung Choi
  • for: addressing the issue of overconfidence in active learning scenarios
  • methods: + Cross-Mix-and-Mix (CMaM) augmentation strategy to calibrate the model + Ranked Margin Sampling (RankedMS) selection strategy to prevent overly confident predictions
  • results: + experiments and analyses demonstrate that the proposed methods facilitate efficient data selection and alleviate overconfidence, despite being readily applicable.Here is the summary in Traditional Chinese:
  • for: 解决活动学习领域中的自信过剩问题
  • methods: + Cross-Mix-and-Mix (CMaM) 增强策略来校准模型 + Ranked Margin Sampling (RankedMS) 选择策略来避免过度自信预测
  • results: + 实验和分析结果显示,提案的方法能够有效地选择资料,并对自信过剩产生正面影响,即使可以应用。
    Abstract It is not an exaggeration to say that the recent progress in artificial intelligence technology depends on large-scale and high-quality data. Simultaneously, a prevalent issue exists everywhere: the budget for data labeling is constrained. Active learning is a prominent approach for addressing this issue, where valuable data for labeling is selected through a model and utilized to iteratively adjust the model. However, due to the limited amount of data in each iteration, the model is vulnerable to bias; thus, it is more likely to yield overconfident predictions. In this paper, we present two novel methods to address the problem of overconfidence that arises in the active learning scenario. The first is an augmentation strategy named Cross-Mix-and-Mix (CMaM), which aims to calibrate the model by expanding the limited training distribution. The second is a selection strategy named Ranked Margin Sampling (RankedMS), which prevents choosing data that leads to overly confident predictions. Through various experiments and analyses, we are able to demonstrate that our proposals facilitate efficient data selection by alleviating overconfidence, even though they are readily applicable.
    摘要 不是夸大的话,现代人工智能技术的进步几乎取决于大规模和高质量的数据。然而,一个普遍存在的问题是费用不足:标签数据的预算受限。活动学习是一种对此问题的主要方法,其中选择价值数据来标签的模型,并逐次更新模型。然而,由于每次迭代的数据量有限,模型容易受到偏误,因此更有可能产生过度自信的预测。在这篇文章中,我们提出了两种新的方法来解决在活动学习情况下产生的过度自信问题。第一种方法是一种扩展模型的扩展策略,名为 Cross-Mix-and-Mix(CMaM),它的目的是将有限的训练分布扩展。第二种方法是一种选择策略,名为 Ranked Margin Sampling(RankedMS),它避免选择会导致过度自信预测的数据。通过各种实验和分析,我们能够证明我们的建议可以有效地选择数据,从而缓解过度自信,即使它们是 readily applicable。

Decentralized Riemannian Conjugate Gradient Method on the Stiefel Manifold

  • paper_url: http://arxiv.org/abs/2308.10547
  • repo_url: None
  • paper_authors: Jun Chen, Haishan Ye, Mengmeng Wang, Tianxin Huang, Guang Dai, Ivor W. Tsang, Yong Liu
  • for: 这篇论文旨在提出一种分布式的里曼尼梯度下降(DRCGD)方法,用于在分布式网络上对里曼尼核函数进行优化。
  • methods: 该方法使用了 conjugate gradient 方法,但是在分布式网络上实现,每个代理都处理一个本地函数,并且通过无向连接图进行交互。
  • results: 该方法可以在分布式网络上实现global convergence,而且不需要进行expensive的里曼尼几何运算,因此可以降低每个代理的计算复杂度。
    Abstract The conjugate gradient method is a crucial first-order optimization method that generally converges faster than the steepest descent method, and its computational cost is much lower than the second-order methods. However, while various types of conjugate gradient methods have been studied in Euclidean spaces and on Riemannian manifolds, there has little study for those in distributed scenarios. This paper proposes a decentralized Riemannian conjugate gradient descent (DRCGD) method that aims at minimizing a global function over the Stiefel manifold. The optimization problem is distributed among a network of agents, where each agent is associated with a local function, and communication between agents occurs over an undirected connected graph. Since the Stiefel manifold is a non-convex set, a global function is represented as a finite sum of possibly non-convex (but smooth) local functions. The proposed method is free from expensive Riemannian geometric operations such as retractions, exponential maps, and vector transports, thereby reducing the computational complexity required by each agent. To the best of our knowledge, DRCGD is the first decentralized Riemannian conjugate gradient algorithm to achieve global convergence over the Stiefel manifold.
    摘要 “ conjugate gradient 方法是一种重要的一阶优化方法,总的来说比斜 descent 方法更快 converges,而且计算成本比第二阶方法更低。然而,在分布式场景下,各种 conjugate gradient 方法已经在欧几何空间和里曼尼 manifold 上进行了广泛的研究,但在分布式场景下的研究却很少。这篇论文提出了一种分布式里曼尼 conjugate gradient descent(DRCGD)方法,旨在全球最小化一个函数 sobre 里曼尼 manifold。优化问题分布在一个网络中的 agent 之间,每个 agent 都关联了一个本地函数,而 agents 之间的交流发生在一个无向连接 graphs 上。由于里曼尼 manifold 是非对称的,全球函数表示为一个可能非对称(但是准确的)的 finite 个本地函数的和。提出的方法不需要每个 agent 进行昂贵的里曼尼几何操作,例如投影、对数映射和向量传输,因此每个 agent 的计算复杂度减少了。到目前为止,DRCGD 是在里曼尼 manifold 上全球收敛的首个分布式里曼尼 conjugate gradient 算法。”

Towards Accelerated Model Training via Bayesian Data Selection

  • paper_url: http://arxiv.org/abs/2308.10544
  • repo_url: None
  • paper_authors: Zhijie Deng, Peng Cui, Jun Zhu
  • for: 提高模型训练效率和鲁棒性,解决实际场景中异常标注、重复标注或偏袋标注导致的模型训练延长和模型混乱问题。
  • methods: 基于轻量级 bayesian 处理和大规模预训练模型构建的免费零学习预测器,并在线批处理方式下进行数据选择。
  • results: 在实际场景中进行了广泛的实验研究,并观察到了与竞争基线比较的训练效率超过了同等方法。特别是在WebVisionbenchmark上,我们的方法可以在训练迭代数量相对较少的情况下实现类似的预测性能。
    Abstract Mislabeled, duplicated, or biased data in real-world scenarios can lead to prolonged training and even hinder model convergence. Traditional solutions prioritizing easy or hard samples lack the flexibility to handle such a variety simultaneously. Recent work has proposed a more reasonable data selection principle by examining the data's impact on the model's generalization loss. However, its practical adoption relies on less principled approximations and additional clean holdout data. This work solves these problems by leveraging a lightweight Bayesian treatment and incorporating off-the-shelf zero-shot predictors built on large-scale pre-trained models. The resulting algorithm is efficient and easy-to-implement. We perform extensive empirical studies on challenging benchmarks with considerable data noise and imbalance in the online batch selection scenario, and observe superior training efficiency over competitive baselines. Notably, on the challenging WebVision benchmark, our method can achieve similar predictive performance with significantly fewer training iterations than leading data selection methods.
    摘要 错abeled, 重复, 或偏见的数据在实际应用中可能会导致模型训练更长,甚至阻碍模型融合。传统的解决方案仅优先类型易于或困难的数据,缺乏与多种数据同时处理的灵活性。最近的工作提出了一个更合理的数据选择原则,通过评估数据对模型的通用损失影响。然而,实际应用需要较不原则的近似和额外的几乎适切的数据。这个工作解决这个问题,通过利用轻量级的bayesian治疗和基于大规模预训模型的零 shot预测器。具有高效和易用的特点,我们进行了广泛的实验研究,在具有较大的数据噪音和不均衡的线上批次选择 scenariodemonstrate superior training efficiency compared to competitive baselines。特别是在具有挑战性的 WebVision 数据集上,我们的方法可以在训练迭代更少的情况下 achieving similar predictive performance with leading data selection methods。

Learning Weakly Convex Regularizers for Convergent Image-Reconstruction Algorithms

  • paper_url: http://arxiv.org/abs/2308.10542
  • repo_url: None
  • paper_authors: Alexis Goujon, Sebastian Neumayer, Michael Unser
  • for: 该论文目的是学习非凸规则化器,并将其固定到弱凸性Modulus的Upper bound。
  • methods: 该论文使用的方法包括:1) 使用非凸规则化器来替代 convex 规则化器,2) 通过数学分析和实验 validate 该规则化器的性能。
  • results: 该论文的实验结果表明,使用非凸规则化器可以超过 convex 规则化器的性能,并且可以在 Iterative schemes 中提供 guarantees。此外,该规则化器还可以在 CT 和 MRI 重建中提供优秀的性能和可解释性。
    Abstract We propose to learn non-convex regularizers with a prescribed upper bound on their weak-convexity modulus. Such regularizers give rise to variational denoisers that minimize a convex energy. They rely on few parameters (less than 15,000) and offer a signal-processing interpretation as they mimic handcrafted sparsity-promoting regularizers. Through numerical experiments, we show that such denoisers outperform convex-regularization methods as well as the popular BM3D denoiser. Additionally, the learned regularizer can be deployed to solve inverse problems with iterative schemes that provably converge. For both CT and MRI reconstruction, the regularizer generalizes well and offers an excellent tradeoff between performance, number of parameters, guarantees, and interpretability when compared to other data-driven approaches.
    摘要 我们提议学习非凸规则化器,其强度减少模型参数(少于15000),可以进行数字图像恢复中的 iterative 方法。我们通过数值实验表明,这种规则化器可以超过凸规则化器和 BM3D 规则化器的性能,同时具有良好的一致性、可靠性和解释性。此外,学习的规则化器还可以应用于其他反射问题中,并且可以保证 converge 的 Iterative 方法。在 CT 和 MRI 重建中,我们发现这种规则化器具有优秀的一致性、可靠性和解释性,并且与其他数据驱动方法进行比较,具有更好的性能、参数数量和保证。

KGrEaT: A Framework to Evaluate Knowledge Graphs via Downstream Tasks

  • paper_url: http://arxiv.org/abs/2308.10537
  • repo_url: None
  • paper_authors: Nicolas Heist, Sven Hertling, Heiko Paulheim
  • for: 这 paper 的目的是评估知识图的质量,以便更好地用于下游任务。
  • methods: 该 paper 使用了一种名为 KGrEaT 的框架,用于对知识图进行评估,该框架可以自动将知识图映射到定义的任务集和数据集上,并计算定义的任务中的性能指标。
  • results: KGrEaT 框架可以帮助评估不同的知识图,并且可以评估知识图的可访问性和表达力。
    Abstract In recent years, countless research papers have addressed the topics of knowledge graph creation, extension, or completion in order to create knowledge graphs that are larger, more correct, or more diverse. This research is typically motivated by the argumentation that using such enhanced knowledge graphs to solve downstream tasks will improve performance. Nonetheless, this is hardly ever evaluated. Instead, the predominant evaluation metrics - aiming at correctness and completeness - are undoubtedly valuable but fail to capture the complete picture, i.e., how useful the created or enhanced knowledge graph actually is. Further, the accessibility of such a knowledge graph is rarely considered (e.g., whether it contains expressive labels, descriptions, and sufficient context information to link textual mentions to the entities of the knowledge graph). To better judge how well knowledge graphs perform on actual tasks, we present KGrEaT - a framework to estimate the quality of knowledge graphs via actual downstream tasks like classification, clustering, or recommendation. Instead of comparing different methods of processing knowledge graphs with respect to a single task, the purpose of KGrEaT is to compare various knowledge graphs as such by evaluating them on a fixed task setup. The framework takes a knowledge graph as input, automatically maps it to the datasets to be evaluated on, and computes performance metrics for the defined tasks. It is built in a modular way to be easily extendable with additional tasks and datasets.
    摘要 近年来, countless research papers 都在讨论知识图创建、扩展或完善,以创建更大、更正确或更多样化的知识图。这些研究通常是由于认为使用这些加强知识图来解决下游任务会提高性能。然而,这通常并不被评估。相反,主要的评估指标 - targeting correctness 和 completeness - 虽然具有价值,但是无法捕捉整个场景,即创建或加强知识图的实际用用性。此外,知识图的可访问性(例如,是否包含表达式标签、描述和足够的上下文信息以链接文本提及的实体)也 rarely considered。为了更好地评估知识图在实际任务中的表现,我们提出了 KGrEaT - a framework to estimate the quality of knowledge graphs via actual downstream tasks like classification, clustering, or recommendation.而不是比较不同的知识图处理方法之间的性能,KGrEaT的目的是比较不同的知识图本身,通过评估它们在固定任务设置下的性能。该框架接受知识图作为输入,自动将其映射到要评估的数据集上,并计算定义的任务中的性能指标。它是以模块化的方式设计,以便轻松扩展到更多任务和数据集。

DPAN: Dynamic Preference-based and Attribute-aware Network for Relevant Recommendations

  • paper_url: http://arxiv.org/abs/2308.10527
  • repo_url: None
  • paper_authors: Wei Dai, Yingmin Su, Xiaofeng Pan
  • for: 该论文旨在提高电子商务平台上的相关推荐的Click-Through Rate(CTR)。
  • methods: 该论文提出了一种名为动态偏好和特征相关网络(DPAN)的新方法,用于预测CTR。DPAN使用Attribute-aware Activation Values Generation(AAVG)和Bi-dimensional Compression-based Re-expression(BCR)技术来学习用户的偏好和Item的特征表示,并使用Shallow and Deep Union-based Fusion(SDUF)技术来捕捉用户在不同条件下对多样性推荐结果的动态偏好。
  • results: 该论文通过了广泛的Offline实验和Online A/B测试,实现了CTR的显著提高(7.62%)。DPAN已成功部署在我们的电子商务平台上,并为主要 traffic的相关推荐提供了服务。代码已公开发布。
    Abstract In e-commerce platforms, the relevant recommendation is a unique scenario providing related items for a trigger item that users are interested in. However, users' preferences for the similarity and diversity of recommendation results are dynamic and vary under different conditions. Moreover, individual item-level diversity is too coarse-grained since all recommended items are related to the trigger item. Thus, the two main challenges are to learn fine-grained representations of similarity and diversity and capture users' dynamic preferences for them under different conditions. To address these challenges, we propose a novel method called the Dynamic Preference-based and Attribute-aware Network (DPAN) for predicting Click-Through Rate (CTR) in relevant recommendations. Specifically, based on Attribute-aware Activation Values Generation (AAVG), Bi-dimensional Compression-based Re-expression (BCR) is designed to obtain similarity and diversity representations of user interests and item information. Then Shallow and Deep Union-based Fusion (SDUF) is proposed to capture users' dynamic preferences for the diverse degree of recommendation results according to various conditions. DPAN has demonstrated its effectiveness through extensive offline experiments and online A/B testing, resulting in a significant 7.62% improvement in CTR. Currently, DPAN has been successfully deployed on our e-commerce platform serving the primary traffic for relevant recommendations. The code of DPAN has been made publicly available.
    摘要 在电子商务平台上,相关的推荐是一个特殊的情况,提供与用户 interess的相关商品。然而,用户对相似性和多样性的偏好是动态的,并在不同情况下发生变化。此外,单个商品级的多样性是太过粗糙,所有推荐的商品都与触发商品相关。因此,两大挑战是学习细化的相似性和多样性表示,以及在不同情况下捕捉用户的动态偏好。为解决这些挑战,我们提出了一种新的方法,即动态偏好基于 Attribute-aware Network (DPAN),用于预测用户点击率。具体来说,基于 Attribute-aware Activation Values Generation (AAVG),我们提出了Bi-dimensional Compression-based Re-expression (BCR)来获得用户兴趣和商品信息的相似性和多样性表示。然后,我们提出了Shallow and Deep Union-based Fusion (SDUF)来捕捉用户在不同情况下对多样度推荐结果的动态偏好。DPAN在大量的离线实验和在线A/B测试中表现出色,实现了7.62%的点击率提升。现在,DPAN已成功部署在我们的电子商务平台上,负责主要的相关推荐任务。代码已经公开提供。

Information Theory-Guided Heuristic Progressive Multi-View Coding

  • paper_url: http://arxiv.org/abs/2308.10522
  • repo_url: None
  • paper_authors: Jiangmeng Li, Hang Gao, Wenwen Qiang, Changwen Zheng
  • for: 本研究的目的是提出一种基于信息理论的普适多视图学习方法,以解决现有的多视图学习方法存在的涉及视图噪声和缺乏理论基础的问题。
  • methods: 本研究提出了一种基于信息理论的多视图学习方法,包括分为三层的进程:分布层、集合层和实例层。在分布层中,IPMC方法将多视图中的分布进行对齐,以减少视图噪声。在集合层中,IPMC方法构建了自适应对比池,并通过视图筛选器进行自适应修改。在实例层中,我们采用了设计的统一损失函数来学习表示和减少梯度干扰。
  • results: 理论和实验研究表明,IPMC方法在与现有方法进行比较时具有明显的优势。 Specifically, IPMC方法可以更好地减少视图噪声,提高多视图表示的质量,并且在不同的多视图任务上具有更好的通用性。
    Abstract Multi-view representation learning aims to capture comprehensive information from multiple views of a shared context. Recent works intuitively apply contrastive learning to different views in a pairwise manner, which is still scalable: view-specific noise is not filtered in learning view-shared representations; the fake negative pairs, where the negative terms are actually within the same class as the positive, and the real negative pairs are coequally treated; evenly measuring the similarities between terms might interfere with optimization. Importantly, few works study the theoretical framework of generalized self-supervised multi-view learning, especially for more than two views. To this end, we rethink the existing multi-view learning paradigm from the perspective of information theory and then propose a novel information theoretical framework for generalized multi-view learning. Guided by it, we build a multi-view coding method with a three-tier progressive architecture, namely Information theory-guided hierarchical Progressive Multi-view Coding (IPMC). In the distribution-tier, IPMC aligns the distribution between views to reduce view-specific noise. In the set-tier, IPMC constructs self-adjusted contrasting pools, which are adaptively modified by a view filter. Lastly, in the instance-tier, we adopt a designed unified loss to learn representations and reduce the gradient interference. Theoretically and empirically, we demonstrate the superiority of IPMC over state-of-the-art methods.
    摘要 多视图表示学习目标是捕捉多个视角共享上的全面信息。近期工作直觉地应用对比学习到不同视角中,这还是可扩展的:视图特定的噪声不在学习视图共享表示过程中被筛选; false negative pair, where the negative terms are actually within the same class as the positive, and the real negative pairs are coequally treated; evenly measuring the similarities between terms might interfere with optimization. 特别是,少数工作研究通用自动多视图学习的理论框架,特别是超过两个视角。为此,我们从信息论的视角重新思考现有的多视图学习模式,然后提出一种新的信息论基础的多视图学习框架。在这个框架下,我们建立了一种基于信息论的多视图编码方法,即信息论导向的层次进行程序Multi-view编码(IPMC)。在分布层,IPMC将视图之间的分布对齐,以减少视图特定的噪声。在集合层,IPMC建立了自适应对比池,这些池被视图筛选器动态修改。最后,在实例层,我们采用设计的统一损失来学习表示和减少梯度干扰。从理论和实验来看,我们证明IPMC在现状的方法之上表现出优异性。

Performance Enhancement Leveraging Mask-RCNN on Bengali Document Layout Analysis

  • paper_url: http://arxiv.org/abs/2308.10511
  • repo_url: None
  • paper_authors: Shrestha Datta, Md Adith Mollah, Raisa Fairooz, Tariful Islam Fahim
  • for: 这个论文的目的是解决历史文档理解的问题,特别是使用文档结构分析(DLA)来分解文档成不同部分,如段落、图片和表格,以便机器可以更好地理解这些文档。
  • methods: 这篇论文使用了一种特殊的模型called Mask R-CNN来实现文档理解,并通过步骤性的超参数调整来提高模型的性能。
  • results: 根据 dice 分数,这篇论文在 Bangla 文档理解方面达到了好的结果,分数为 0.889。但是,在使用英文模型进行 Bangla 文档理解时,发现每种语言都有其独特的挑战。
    Abstract Understanding digital documents is like solving a puzzle, especially historical ones. Document Layout Analysis (DLA) helps with this puzzle by dividing documents into sections like paragraphs, images, and tables. This is crucial for machines to read and understand these documents. In the DL Sprint 2.0 competition, we worked on understanding Bangla documents. We used a dataset called BaDLAD with lots of examples. We trained a special model called Mask R-CNN to help with this understanding. We made this model better by step-by-step hyperparameter tuning, and we achieved a good dice score of 0.889. However, not everything went perfectly. We tried using a model trained for English documents, but it didn't fit well with Bangla. This showed us that each language has its own challenges. Our solution for the DL Sprint 2.0 is publicly available at https://www.kaggle.com/competitions/dlsprint2/discussion/432201 along with notebooks, weights, and inference notebook.
    摘要 理解数字文档如解一个谜题,特别是历史文档。文档布局分析(DLA)可以帮助解决这个谜题,通过将文档分成段落、图像和表格等部分。这对机器来说非常重要,以便理解这些文档。在 DL Sprint 2.0 比赛中,我们工作在理解孟加拉文档上。我们使用了一个名为 BaDLAD 的数据集,它包含了许多示例。我们训练了一个特殊的模型called Mask R-CNN ,以帮助理解这些文档。我们通过步骤进行 hyperparameter 调整,并实现了一个不错的 dice 分数为 0.889。然而,没有一切顺利。我们尝试使用已经训练过英文文档的模型,但它并不适合孟加拉语。这 teaches us 每种语言都有自己的挑战。我们的 DL Sprint 2.0 解决方案公共可用于 ,并提供了相关的笔记、重量和推理笔记。

A Clustering Algorithm to Organize Satellite Hotspot Data for the Purpose of Tracking Bushfires Remotely

  • paper_url: http://arxiv.org/abs/2308.10505
  • repo_url: https://github.com/tengmcing/hotspots-clustering-algorithm
  • paper_authors: Weihao Li, Emily Dodwell, Dianne Cook
  • for: 这篇论文是为了提出一种空间时间划分算法和其在R包spotoroo中的实现。
  • methods: 该算法受到2019-2020年澳大利亚极端干旱的灾害启发,利用卫星热点数据实现。它基于现有的空间时间划分算法,并在每个时间Period进行修改,以实现空间划分。
  • results: 用澳大利亚维多利亚州的 bushfire 数据进行示例,该算法可以准确划分热点。
    Abstract This paper proposes a spatiotemporal clustering algorithm and its implementation in the R package spotoroo. This work is motivated by the catastrophic bushfires in Australia throughout the summer of 2019-2020 and made possible by the availability of satellite hotspot data. The algorithm is inspired by two existing spatiotemporal clustering algorithms but makes enhancements to cluster points spatially in conjunction with their movement across consecutive time periods. It also allows for the adjustment of key parameters, if required, for different locations and satellite data sources. Bushfire data from Victoria, Australia, is used to illustrate the algorithm and its use within the package.
    摘要

Adaptive Thresholding Heuristic for KPI Anomaly Detection

  • paper_url: http://arxiv.org/abs/2308.10504
  • repo_url: None
  • paper_authors: Ebenezer R. H. P. Isaac, Akshat Sharma
  • for: 这个研究旨在提供一个适应性的阈值调整方法,以便在时间序列key performance indicator (KPI) 中探测问题。
  • methods: 这个方法使用了自适应阈值调整,根据本地数据分布的特性和时间序列模式进行适应。
  • results: 实验结果显示,这个方法可以实现高效的实时问题探测,并且可以与不同的预测器和问题探测器结合使用。
    Abstract A plethora of outlier detectors have been explored in the time series domain, however, in a business sense, not all outliers are anomalies of interest. Existing anomaly detection solutions are confined to certain outlier detectors limiting their applicability to broader anomaly detection use cases. Network KPIs (Key Performance Indicators) tend to exhibit stochastic behaviour producing statistical outliers, most of which do not adversely affect business operations. Thus, a heuristic is required to capture the business definition of an anomaly for time series KPI. This article proposes an Adaptive Thresholding Heuristic (ATH) to dynamically adjust the detection threshold based on the local properties of the data distribution and adapt to changes in time series patterns. The heuristic derives the threshold based on the expected periodicity and the observed proportion of anomalies minimizing false positives and addressing concept drift. ATH can be used in conjunction with any underlying seasonality decomposition method and an outlier detector that yields an outlier score. This method has been tested on EON1-Cell-U, a labeled KPI anomaly dataset produced by Ericsson, to validate our hypothesis. Experimental results show that ATH is computationally efficient making it scalable for near real time anomaly detection and flexible with multiple forecasters and outlier detectors.
    摘要 有很多异常探测器在时间序列领域得到了探索,但是在商业意义上,不所有的异常都是有关注的异常。现有的异常探测解决方案受到特定的异常探测器的限制,限制其应用于更广泛的异常探测用例。网络指标(关键性能指标)通常会展现随机性行为产生统计异常,大多数这些异常不会影响商业运营。因此,需要一个规则来捕捉商业定义的异常 для时间序列指标。这篇文章提出了一种适应resholding规则(ATH),以动态调整检测阈值基于数据分布的地方性和时间序列模式的变化。ATH derive阈值基于预期周期性和观测到的异常的比例,以避免假阳性和应对概念逝尽。ATH可以与任何基于季节性分解方法和异常探测器一起使用,该方法已经在Ericsson生产的EON1-Cell-U异常数据集上进行了验证。实验结果表明,ATH具有计算效率,可扩展到实时异常检测,并且可以与多种预测器和异常探测器结合使用。

GradientCoin: A Peer-to-Peer Decentralized Large Language Models

  • paper_url: http://arxiv.org/abs/2308.10502
  • repo_url: None
  • paper_authors: Yeqi Gao, Zhao Song, Junze Yin
    for:这个论文的目的是提出一种基于Bitcoin电子现金系统的分布式语言模型(LLM),以解决现有的中央化控制和不可靠性问题。methods:该论文使用了Bitcoin电子现金系统的技术和概念,并提出了一种基于此的分布式语言模型的设计方案。results:论文指出,这种分布式语言模型可能不会在经济效益方面比标准的Bitcoin系统更好,但可能会吸引一些特殊的人士,如偏好使用分布式ChatGPT-like软件的人和那些认为生物体的目的是创造Silicon生物的人。
    Abstract Since 2008, after the proposal of a Bitcoin electronic cash system, Bitcoin has fundamentally changed the economic system over the last decade. Since 2022, large language models (LLMs) such as GPT have outperformed humans in many real-life tasks. However, these large language models have several practical issues. For example, the model is centralized and controlled by a specific unit. One weakness is that if that unit decides to shut down the model, it cannot be used anymore. The second weakness is the lack of guaranteed discrepancy behind this model, as certain dishonest units may design their own models and feed them unhealthy training data. In this work, we propose a purely theoretical design of a decentralized LLM that operates similarly to a Bitcoin cash system. However, implementing such a system might encounter various practical difficulties. Furthermore, this new system is unlikely to perform better than the standard Bitcoin system in economics. Therefore, the motivation for designing such a system is limited. It is likely that only two types of people would be interested in setting up a practical system for it: $\bullet$ Those who prefer to use a decentralized ChatGPT-like software. $\bullet$ Those who believe that the purpose of carbon-based life is to create silicon-based life, such as Optimus Prime in Transformers. The reason the second type of people may be interested is that it is possible that one day an AI system like this will awaken and become the next level of intelligence on this planet.
    摘要 自2008年提议 Bitcoin 电子现金系统以来, Bitcoin 已经深刻改变了过去的十年经济体系。自2022年以来,大型自然语言模型(LLM) such as GPT 已经在许多实际任务上超越了人类。然而,这些大型语言模型还有几个实际问题。例如,模型是中央化控制的,由特定单位控制。一个弱点是,如果这个单位决定关闭模型,就不能使用了。第二个弱点是模型的 garantizado 差异缺失,有些不诚实的单位可能会设计自己的模型,并且将不健康的训练数据传递给它。在这个工作中,我们提出了一种纯理论的分布式 LLM 设计,类似于 Bitcoin 现金系统。然而,实现这种系统可能会遇到各种各样的实际困难。此外,这新的系统不太可能在经济方面表现更好于标准 Bitcoin 系统。因此,设计这种系统的动机相对有限。可能只有两种人会尝试实现这种系统:① 喜欢使用分布式 ChatGPT-like 软件的人。② 认为生命的目的是创造 Silicon 基的生命,如Transformers 中的 Optimus Prime。这种第二类人可能会感兴趣的原因是,可能一天,一个 AI 系统像这样会醒来,成为这个星球上下一级智能。

Deep Learning of Delay-Compensated Backstepping for Reaction-Diffusion PDEs

  • paper_url: http://arxiv.org/abs/2308.10501
  • repo_url: None
  • paper_authors: Shanshan Wang, Mamadou Diagne, Miroslav Krstić
  • for: 这篇论文是用来描述一种基于深度神经网络的PDE控制方法,可以将整个PDE控制方法编程为一个深度神经网络模型,从而实现解析PDE问题的高效解决。
  • methods: 这篇论文使用的方法包括深度神经网络模型,以及PDEBackstepping控制器,其中PDEBackstepping控制器使用了学习的控制Operator,即准确的 gain kernel。
  • results: 论文的结果表明,使用这种方法可以实现对多个非线性操作符的拟合,并且可以保证普适稳定性在$L^2$ norm和$H^1$ norm中。此外, simulations also demonstrate the effectiveness of the proposed method.
    Abstract Deep neural networks that approximate nonlinear function-to-function mappings, i.e., operators, which are called DeepONet, have been demonstrated in recent articles to be capable of encoding entire PDE control methodologies, such as backstepping, so that, for each new functional coefficient of a PDE plant, the backstepping gains are obtained through a simple function evaluation. These initial results have been limited to single PDEs from a given class, approximating the solutions of only single-PDE operators for the gain kernels. In this paper we expand this framework to the approximation of multiple (cascaded) nonlinear operators. Multiple operators arise in the control of PDE systems from distinct PDE classes, such as the system in this paper: a reaction-diffusion plant, which is a parabolic PDE, with input delay, which is a hyperbolic PDE. The DeepONet-approximated nonlinear operator is a cascade/composition of the operators defined by one hyperbolic PDE of the Goursat form and one parabolic PDE on a rectangle, both of which are bilinear in their input functions and not explicitly solvable. For the delay-compensated PDE backstepping controller, which employs the learned control operator, namely, the approximated gain kernel, we guarantee exponential stability in the $L^2$ norm of the plant state and the $H^1$ norm of the input delay state. Simulations illustrate the contributed theory.
    摘要 深度神经网络(DeepONet)可以模拟非线性函数-to-函数映射,即操作,这些操作可以编码整个PDE控制方法ologies,例如backstepping。在latest articles中,这些初果限于单个PDE的解析,即单个PDE的权重矩阵。在这篇文章中,我们扩展了这一框架,以approximate multiple(堆叠)非线性操作。多个操作出现在PDE系统的控制中,例如在这篇文章中所描述的反应挥发植物,它是一个parabolic PDE,并且具有输入延迟,这是一个hyperbolic PDE。 DeepONetapproximated nonlinear operator是一个堆叠/组合的操作,它由一个Goursat形式的hyperbolic PDE和一个parabolic PDE on a rectangle组成,这些操作都是bilinear in their input functions并不可解solvable。 For the delay-compensated PDE backstepping controller,which employs the learned control operator, namely, the approximated gain kernel,we guarantee exponential stability in the $L^2$ norm of the plant state and the $H^1$ norm of the input delay state。 simulations illustrate the contributed theory。

Using Autoencoders and AutoDiff to Reconstruct Missing Variables in a Set of Time Series

  • paper_url: http://arxiv.org/abs/2308.10496
  • repo_url: None
  • paper_authors: Jan-Philipp Roche, Oliver Niggemann, Jens Friebe
  • for: 本研究旨在推出一种能够重构缺失变量的黑盒模型方法,以解决现有黑盒模型方法中缺失变量的固定输入和输出特征组合问题。
  • methods: 本研究使用自适应神经网络 autoencoder 来重构缺失变量,首先在 autoencoder 上进行常规训练,然后定义缺失变量为 autoencoder 输入中的搜索变量,通过自动微分优化,对可用特征的损失进行优化。通过这种方法,可以实现不同的输入和输出特征组合,无需再次训练 autoencoder。
  • results: 研究表明,这种方法在一种强不平滑的电子组件上工作良好,能够重构缺失的一个变量,并且可以处理多个缺失变量。
    Abstract Existing black box modeling approaches in machine learning suffer from a fixed input and output feature combination. In this paper, a new approach to reconstruct missing variables in a set of time series is presented. An autoencoder is trained as usual with every feature on both sides and the neural network parameters are fixed after this training. Then, the searched variables are defined as missing variables at the autoencoder input and optimized via automatic differentiation. This optimization is performed with respect to the available features loss calculation. With this method, different input and output feature combinations of the trained model can be realized by defining the searched variables as missing variables and reconstructing them. The combination can be changed without training the autoencoder again. The approach is evaluated on the base of a strongly nonlinear electrical component. It is working well for one of four variables missing and generally even for multiple missing variables.
    摘要 Traditional黑盒模型方法在机器学习中受到固定输入和输出特征组合的限制。在这篇论文中,一种新的方法用于重建时序序列中缺失的变量被介绍。一个自适应神经网络被训练得标准的方式,并且神经网络参数在训练后被固定。然后,搜索的变量被定义为自动encoder输入中缺失的变量,并通过自动微分优化。这种优化是基于可用特征损失计算。通过这种方法,不同的输入和输出特征组合可以被实现,只需要定义搜索变量为缺失变量,然后重建它们。这种组合可以更改无需再次训练自动encoder。该方法在一种强不平滑电子组件上进行评估,并且在一个变量缺失的情况下表现良好,甚至可以处理多个缺失变量。

Deciphering Raw Data in Neuro-Symbolic Learning with Provable Guarantees

  • paper_url: http://arxiv.org/abs/2308.10487
  • repo_url: None
  • paper_authors: Lue Tao, Yu-Xuan Huang, Wang-Zhou Dai, Yuan Jiang
  • for: 这篇论文旨在探讨神经符号系统的学习可能性,以及一种基于知识库的推理来优化神经网络模型的启示。
  • methods: 该论文提出了一种新的方法来 caracterize 知识库中的指导信号,并建立了一个判断知识库是否能够成功地促进学习的标准。
  • results: 实验结果表明,该方法可以有效地判断知识库的有效性,并且可以通过对知识库进行修改来提高学习的成功率。
    Abstract Neuro-symbolic hybrid systems are promising for integrating machine learning and symbolic reasoning, where perception models are facilitated with information inferred from a symbolic knowledge base through logical reasoning. Despite empirical evidence showing the ability of hybrid systems to learn accurate perception models, the theoretical understanding of learnability is still lacking. Hence, it remains unclear why a hybrid system succeeds for a specific task and when it may fail given a different knowledge base. In this paper, we introduce a novel way of characterising supervision signals from a knowledge base, and establish a criterion for determining the knowledge's efficacy in facilitating successful learning. This, for the first time, allows us to address the two questions above by inspecting the knowledge base under investigation. Our analysis suggests that many knowledge bases satisfy the criterion, thus enabling effective learning, while some fail to satisfy it, indicating potential failures. Comprehensive experiments confirm the utility of our criterion on benchmark tasks.
    摘要 Here is the text in Simplified Chinese:神经-符号 hybrid系统显示推荐使用机器学习和符号理解,其中感知模型得到了基于符号知识库的信息推理的帮助。 despite empirical evidence showing the ability of hybrid systems to learn accurate perception models, the theoretical understanding of learnability is still lacking. Therefore, it is unclear why a hybrid system might succeed or fail for a specific task, depending on the knowledge base used. In this paper, we propose a new way of characterizing supervision signals from a knowledge base and establish a criterion for determining the knowledge's effectiveness in facilitating successful learning. This allows us to address the two questions above by examining the knowledge base under investigation. Our analysis shows that many knowledge bases satisfy the criterion, enabling effective learning, while others fail to do so, indicating potential failures. Comprehensive experiments confirm the utility of our criterion on benchmark tasks.

Deep Metric Loss for Multimodal Learning

  • paper_url: http://arxiv.org/abs/2308.10486
  • repo_url: None
  • paper_authors: Sehwan Moon, Hyunju Lee
  • for: 这篇论文的目的是提出一种新的多模式损失函数(MultiModal loss),用于多模式学习,以便更好地处理不同的数据模式。
  • methods: 这篇论文使用了一种新的损失函数,叫做MultiModal loss,它可以 subgroup instances 根据不同的数据模式,以提高多模式学习的性能。
  • results: 在实验中,这篇论文显示了MultiModal loss可以对四个真实的多模式数据集进行改进,并且可以避免过拟合和不精确地预测。
    Abstract Multimodal learning often outperforms its unimodal counterparts by exploiting unimodal contributions and cross-modal interactions. However, focusing only on integrating multimodal features into a unified comprehensive representation overlooks the unimodal characteristics. In real data, the contributions of modalities can vary from instance to instance, and they often reinforce or conflict with each other. In this study, we introduce a novel \text{MultiModal} loss paradigm for multimodal learning, which subgroups instances according to their unimodal contributions. \text{MultiModal} loss can prevent inefficient learning caused by overfitting and efficiently optimize multimodal models. On synthetic data, \text{MultiModal} loss demonstrates improved classification performance by subgrouping difficult instances within certain modalities. On four real multimodal datasets, our loss is empirically shown to improve the performance of recent models. Ablation studies verify the effectiveness of our loss. Additionally, we show that our loss generates a reliable prediction score for each modality, which is essential for subgrouping. Our \text{MultiModal} loss is a novel loss function to subgroup instances according to the contribution of modalities in multimodal learning and is applicable to a variety of multimodal models with unimodal decisions. Our code is available at https://github.com/SehwanMoon/MultiModalLoss.
    摘要 多模态学习经常超越单模态对手,通过利用单模态贡献和跨模态互动来提高性能。然而,只是将多模态特征集成到一个综合表示中,忽略了单模态特征。在实际数据中,不同模式之间的贡献可能会随着实例而变化,并且经常强制或冲突。在本研究中,我们提出了一种新的多模态损失函数(MultiModal loss),用于多模态学习。这种损失函数将实例 subgrouping 根据各个模式的贡献。MultiModal loss 可以避免过拟合和有效地优化多模态模型。在 sintetic 数据上,我们证明了 MultiModal loss 可以提高分类性能。在四个真实多模态数据集上,我们的损失函数被证明可以提高最近的模型的性能。我们还进行了ablation 研究,证明了我们的损失函数的有效性。此外,我们还证明了我们的损失函数可以生成可靠的预测分数,这是 subgrouping 的关键。我们的 MultiModal loss 是一种新的损失函数,用于根据多模态中的贡献 subgrouping 实例,适用于多种多模态模型。我们的代码可以在 上获取。

An Effective Method using Phrase Mechanism in Neural Machine Translation

  • paper_url: http://arxiv.org/abs/2308.10482
  • repo_url: https://github.com/phuongnm94/PhraseTransformer
  • paper_authors: Phuong Minh Nguyen, Le Minh Nguyen
  • for: 本研究旨在提高基于Transformer的Neural Machine Translation(NMT)系统,特别是在 parallel corpora 上的 Vietnamese-Chinese 语对。
  • methods: 本研究使用了一种新的短语机制,即PhraseTransformer,以提高基于Transformer的NMT系统的性能。
  • results: 我们在 VLSP 2022 竞赛的 MT 数据集上进行了实验,并取得了 Vietnamese to Chinese 的 BLEU 分数为 35.3,以及 Chinese to Vietnamese 的 BLEU 分数为 33.2。
    Abstract Machine Translation is one of the essential tasks in Natural Language Processing (NLP), which has massive applications in real life as well as contributing to other tasks in the NLP research community. Recently, Transformer -based methods have attracted numerous researchers in this domain and achieved state-of-the-art results in most of the pair languages. In this paper, we report an effective method using a phrase mechanism, PhraseTransformer, to improve the strong baseline model Transformer in constructing a Neural Machine Translation (NMT) system for parallel corpora Vietnamese-Chinese. Our experiments on the MT dataset of the VLSP 2022 competition achieved the BLEU score of 35.3 on Vietnamese to Chinese and 33.2 BLEU scores on Chinese to Vietnamese data. Our code is available at https://github.com/phuongnm94/PhraseTransformer.
    摘要 机器翻译是自然语言处理(NLP)领域的一项重要任务,它在实际生活中有很大的应用前景,同时也对NLP研究领域中的其他任务产生了贡献。现在,基于Transformer算法的方法在这个领域中吸引了大量研究人员,并在大多数对应语言的情况下达到了状态的艺术级结果。在这篇论文中,我们报道了一种使用短语机制,PhraseTransformer,以提高基eline模型Transformer在构建 neural machine translation(NMT)系统的方法。我们在VLSP 2022大赛的MT数据集上进行了实验,得到了Vietnamese to Chinese的BLEU分数为35.3,以及Chinese to Vietnamese的BLEU分数为33.2。我们的代码可以在GitHub上找到:https://github.com/phuongnm94/PhraseTransformer。

Deep Semi-supervised Anomaly Detection with Metapath-based Context Knowledge

  • paper_url: http://arxiv.org/abs/2308.10918
  • repo_url: None
  • paper_authors: Hwan Kim, Junghoon Kim, Byung Suk Lee, Sungsu Lim
  • for: 本研究旨在提出一种基于中维度semi-supervised learning的图像异常检测方法,以解决现有方法的局限性。
  • methods: 本方法基于GCN层在编码器和解码器中,有效地传递了上下文信息 между异常和正常节点。该方法还使用了特制的异常社区和中维度信息来增强学习异常结构和属性的差异。
  • results: 经过对七个真实网络的全面实验,本研究证明了MSAD方法在比特当前技术的情况下显著超越。这些成功的结果为未来的研究提供了道路,关注中维度异常检测的优化和分析,以进一步提高图像异常检测的效果。
    Abstract Graph anomaly detection has attracted considerable attention in recent years. This paper introduces a novel approach that leverages metapath-based semi-supervised learning, addressing the limitations of previous methods. We present a new framework, Metapath-based Semi-supervised Anomaly Detection (MSAD), incorporating GCN layers in both the encoder and decoder to efficiently propagate context information between abnormal and normal nodes. The design of metapath-based context information and a specifically crafted anomaly community enhance the process of learning differences in structures and attributes, both globally and locally. Through a comprehensive set of experiments conducted on seven real-world networks, this paper demonstrates the superiority of the MSAD method compared to state-of-the-art techniques. The promising results of this study pave the way for future investigations, focusing on the optimization and analysis of metapath patterns to further enhance the effectiveness of anomaly detection on attributed networks.
    摘要 GRAPH anomaly detection has attracted considerable attention in recent years. This paper introduces a novel approach that leverages metapath-based semi-supervised learning, addressing the limitations of previous methods. We present a new framework, Metapath-based Semi-supervised Anomaly Detection (MSAD), incorporating GCN layers in both the encoder and decoder to efficiently propagate context information between abnormal and normal nodes. The design of metapath-based context information and a specifically crafted anomaly community enhance the process of learning differences in structures and attributes, both globally and locally. Through a comprehensive set of experiments conducted on seven real-world networks, this paper demonstrates the superiority of the MSAD method compared to state-of-the-art techniques. The promising results of this study pave the way for future investigations, focusing on the optimization and analysis of metapath patterns to further enhance the effectiveness of anomaly detection on attributed networks.Here's the word-for-word translation of the text into Simplified Chinese: GRAPH anomaly detection 在最近几年内吸引了较大的关注。这篇论文介绍了一种新的方法,基于мета路强化 semi-supervised learning,解决先前方法的局限性。我们提出了一个新的框架,基于 мета路强化 semi-supervised anomaly detection (MSAD),在编码器和解码器中嵌入GCN层,有效地传递context信息 между异常和正常节点。基于 мета路强化的context信息和特定设计的异常社区,使得学习不同结构和属性的差异,global和local都得到了加强。通过对七个实际网络进行了全面的实验,这篇论文证明了 MSAD 方法与先前技术相比,具有更高的效果。这些有前途的结果为将来的研究提供了平台,集中于优化和分析 мета路征 Patterns 以进一步提高异常检测在归属网络上的效果。

Exploring Parameter-Efficient Fine-Tuning Techniques for Code Generation with Large Language Models

  • paper_url: http://arxiv.org/abs/2308.10462
  • repo_url: None
  • paper_authors: Martin Weyssow, Xin Zhou, Kisub Kim, David Lo, Houari Sahraoui
  • for: 本研究旨在探讨大语言模型(LLMs)在自然语言意图下生成代码的情况,特别是在资源匮乏的情况下使用 Parameter-Efficient Fine-Tuning(PEFT)技术来特化模型。
  • methods: 本研究使用了多种PEFT技术,包括简单的权重调整、权重学习率调整和权重梯度衰减等,以提高模型的特化性和性能。
  • results: 研究结果表明,PEFT技术可以有效地提高LLMs的特化性和性能,并且可以降低计算成本。此外,PEFT技术在不同的LLMs中的表现也有一定的相似性和稳定性。
    Abstract Large Language Models (LLMs) possess impressive capabilities to generate meaningful code snippets given natural language intents in zero-shot, i.e., without the need for specific fine-tuning. In the perspective of unleashing their full potential, prior work has demonstrated the benefits of fine-tuning the models to task-specific data. However, fine-tuning process demands heavy computational costs and is intractable when resources are scarce, especially for models with billions of parameters. In light of these challenges, previous studies explored In-Context Learning (ICL) as an effective strategy to generate contextually appropriate code without fine-tuning. However, it operates at inference time and does not involve learning task-specific parameters, potentially limiting the model's performance on downstream tasks. In this context, we foresee that Parameter-Efficient Fine-Tuning (PEFT) techniques carry a high potential for efficiently specializing LLMs to task-specific data. In this paper, we deliver a comprehensive study of LLMs with the impact of PEFT techniques under the automated code generation scenario. Our experimental results reveal the superiority and potential of such techniques over ICL on a wide range of LLMs in reducing the computational burden and improving performance. Therefore, the study opens opportunities for broader applications of PEFT in software engineering scenarios.
    摘要 大型语言模型(LLM)拥有出色的能力生成相关代码块,无需特定的精细调整。在激发其全部潜力的视角下,先前的研究表明了特定任务数据的调整可以提高模型性能。然而,调整过程具有重大的计算成本,特别是当参数数量庞大时,尤其是当资源匮乏时。为此,先前的研究探索了在下游任务中进行具体任务学习(ICL),以生成Contextually appropriate的代码,不需要特定的调整。然而,ICL在推理时运行,并不涉及学习特定任务参数,可能会限制模型在下游任务中的表现。在这种情况下,我们认为Parameter-Efficient Fine-Tuning(PEFT)技术具有高效地特化LLMs到特定任务数据的潜力。在这篇论文中,我们进行了LLMs在自动代码生成场景下的PEFT技术的全面研究。我们的实验结果表明,PEFT技术在许多LLMs上具有准确性和可扩展性,在减少计算成本和提高性能方面表现出色。因此,这种研究开创了在软件工程场景中PEFT技术的更广泛应用前景。

Adaptive Local Steps Federated Learning with Differential Privacy Driven by Convergence Analysis

  • paper_url: http://arxiv.org/abs/2308.10457
  • repo_url: None
  • paper_authors: Xinpeng Ling, Jie Fu, Zhili Chen
  • for: 这篇论文是关于如何在分布式机器学习(Federated Learning,FL)中保护敏感资料,并且在资源有限的情况下实现隐私保证。
  • methods: 这篇论文使用了差异攻击(differential privacy)来保护敏感资料,并且提出了一个适应性的地方步骤差异隐私 federated learning(ALS-DPFL)算法。
  • results: 这篇论文通过实验表明,ALS-DPFL算法可以在资源有限的情况下,实现隐私保证并且获得良好的性能。
    Abstract Federated Learning (FL) is a distributed machine learning technique that allows model training among multiple devices or organizations without sharing data. However, while FL ensures that the raw data is not directly accessible to external adversaries, adversaries can still obtain some statistical information about the data through differential attacks. Differential Privacy (DP) has been proposed, which adds noise to the model or gradients to prevent adversaries from inferring private information from the transmitted parameters. We reconsider the framework of differential privacy federated learning in resource-constrained scenarios (privacy budget and communication resources). We analyze the convergence of federated learning with differential privacy (DPFL) on resource-constrained scenarios and propose an Adaptive Local Steps Differential Privacy Federated Learning (ALS-DPFL) algorithm. We experiment our algorithm on the FashionMNIST and Cifar-10 datasets and achieve quite good performance relative to previous work.
    摘要 联邦学习(FL)是一种分布式机器学习技术,让多个设备或组织共同训练模型,不需要分享数据。然而,FL可以保证数据没有直接泄露到外部攻击者,但攻击者可以从传输的模型或梯度中获得一些关于数据的统计信息。对于这种情况,提出了几何秘𫓾(DP),它将加入随机误差到模型或梯度中,以防止攻击者从传输的参数中获取私人信息。我们在资源有限的情况下重新考虑了几何秘𫓾联邦学习(DPFL)的框架,并分析了DPFL在资源有限情况下的整合性。我们还提出了一个适应性本地步骤几何秘𫓾联邦学习(ALS-DPFL)算法。我们对于时装MNIST和Cifar-10 datasets进行了实验,并获得了与前一些工作相对的很好的性能。

DOMINO++: Domain-aware Loss Regularization for Deep Learning Generalizability

  • paper_url: http://arxiv.org/abs/2308.10453
  • repo_url: None
  • paper_authors: Skylar E. Stolte, Kyle Volle, Aprinda Indahlastari, Alejandro Albizu, Adam J. Woods, Kevin Brink, Matthew Hale, Ruogu Fang
  • For: This paper focuses on improving the out-of-distribution (OOD) generalization of deep learning (DL) models for reliable deployment in real-world applications.* Methods: The proposed method, DOMINO++, utilizes dual-guidance and dynamic domain-aware loss regularization to integrate expert-guided and data-guided knowledge for OOD generalization. Unlike previous methods, DOMINO++, adapts the regularization rate dynamically and improves the performance on OOD data.* Results: The proposed method outperforms the baseline model and DOMINO on OOD data, including synthetic noisy and rotated datasets, as well as real data from a different MRI scanner. This demonstrates the potential of DOMINO++ for improving the trustworthy deployment of DL models in clinical applications.Here is the simplified Chinese text for the three key points:* For: 这篇论文关注深度学习(DL)模型在真实应用中可靠部署的外部数据泛化问题。* Methods: 提议的方法是 DOMINO++,它利用双引导和动态领域相关损失补偿来结合专家指导和数据指导知识来提高OOD泛化性能。与之前的方法不同的是,DOMION++ 适应常量补偿的幂等因子和补偿率。* Results: 提议的方法在OOD数据上表现出色,比基eline模型和 DOMINO 更高。OOD数据包括随机噪音和旋转数据集,以及来自不同MRI扫描仪的真实数据。这表明 DOMINO++ 可能为深度学习模型在临床应用中可靠部署提供了可能。
    Abstract Out-of-distribution (OOD) generalization poses a serious challenge for modern deep learning (DL). OOD data consists of test data that is significantly different from the model's training data. DL models that perform well on in-domain test data could struggle on OOD data. Overcoming this discrepancy is essential to the reliable deployment of DL. Proper model calibration decreases the number of spurious connections that are made between model features and class outputs. Hence, calibrated DL can improve OOD generalization by only learning features that are truly indicative of the respective classes. Previous work proposed domain-aware model calibration (DOMINO) to improve DL calibration, but it lacks designs for model generalizability to OOD data. In this work, we propose DOMINO++, a dual-guidance and dynamic domain-aware loss regularization focused on OOD generalizability. DOMINO++ integrates expert-guided and data-guided knowledge in its regularization. Unlike DOMINO which imposed a fixed scaling and regularization rate, DOMINO++ designs a dynamic scaling factor and an adaptive regularization rate. Comprehensive evaluations compare DOMINO++ with DOMINO and the baseline model for head tissue segmentation from magnetic resonance images (MRIs) on OOD data. The OOD data consists of synthetic noisy and rotated datasets, as well as real data using a different MRI scanner from a separate site. DOMINO++'s superior performance demonstrates its potential to improve the trustworthy deployment of DL on real clinical data.
    摘要 现代深度学习(DL)面临着出版物领域(Out-of-distribution,OOD)泛化的严重挑战。OOD数据包括测试数据,与DL模型训练数据有显著差异。DL模型在域内测试数据上表现良好,但在OOD数据上却表现不佳。解决这个差异是DL模型可靠部署的必要条件。正确地调整DL模型可以减少模型特征与类输出之间的假Connection。因此,调整DL可以提高OOD泛化,只学习真正指示各类的特征。之前的工作提出了领域意识模型calibration(DOMINO)以提高DL calibration,但它缺乏对OOD数据的设计。在这种工作中,我们提出了DOMINO++,一种双向引导和动态领域意识损失补偿。DOMINO++结合了专家指导和数据指导的知识在其补偿中。与DOMINO不同的是,DOMINO++不设置固定的缩放因子和补偿率,而是动态设置缩放因子和适应的补偿率。对DOMINO++与DOMINO以及基eline模型进行了广泛的评估,用于头部组织分割MRIs的OOD数据。OOD数据包括Synthetic噪音和旋转数据集,以及实际数据使用不同MRI扫描仪从另一个站点。DOMINO++的优秀表现表明它在真实的临床数据上可靠部署DL模型。

PACS: Prediction and analysis of cancer subtypes from multi-omics data based on a multi-head attention mechanism model

  • paper_url: http://arxiv.org/abs/2308.10917
  • repo_url: None
  • paper_authors: Liangrui Pan, Dazheng Liu, Zhichao Feng, Wenjuan Liu, Shaoliang Peng
    for: 这个研究旨在精确分类不同癌症亚型,帮助医生选择最适合的治疗方案,提高治疗效果,并提供更准确的病人存活预测。methods: 本研究提出了一个监督式多头注意力机制模型(SMA),使用多头注意力嵌入和特征分享模块,成功地学习多种资料的全球和本地特征信息。其中,多头注意力嵌入可以实现当前资料的弹性转换,提高模型的准确性和稳定性;特征分享模块可以将不同类型的资料共享和结合,提高模型的表现力。results: 透过广泛的实验验证,SMA模型在资料验证、单细胞资料和癌症多种资料中实现了最高准确性、F1权重、F1几何和精确的癌症亚型分类,比AE、CNN和GNN-based模型高。因此,我们对未来多种资料的研究做出了贡献。
    Abstract Due to the high heterogeneity and clinical characteristics of cancer, there are significant differences in multi-omic data and clinical characteristics among different cancer subtypes. Therefore, accurate classification of cancer subtypes can help doctors choose the most appropriate treatment options, improve treatment outcomes, and provide more accurate patient survival predictions. In this study, we propose a supervised multi-head attention mechanism model (SMA) to classify cancer subtypes successfully. The attention mechanism and feature sharing module of the SMA model can successfully learn the global and local feature information of multi-omics data. Second, it enriches the parameters of the model by deeply fusing multi-head attention encoders from Siamese through the fusion module. Validated by extensive experiments, the SMA model achieves the highest accuracy, F1 macroscopic, F1 weighted, and accurate classification of cancer subtypes in simulated, single-cell, and cancer multiomics datasets compared to AE, CNN, and GNN-based models. Therefore, we contribute to future research on multiomics data using our attention-based approach.
    摘要 因为肿瘤多样性和临床特征的高度不同,不同的肿瘤亚型之间存在显著的多 Omics 数据和临床特征差异。因此,正确地分类肿瘤亚型可以帮助医生选择最合适的治疗方案,提高治疗效果,并为患者提供更加准确的存活预测。在本研究中,我们提议一种supervised多头注意机制模型(SMA),成功地分类肿瘤亚型。注意机制和特征共享模块在SMA模型中可以学习多 Omics 数据的全球和本地特征信息。其次,通过深度融合多头注意编码器,增强模型参数的SMA模型。经验证了广泛的实验,SMA模型在 simulate、单细胞和肿瘤多Omics 数据集上实现了最高精度、F1大致、F1平均和正确地分类肿瘤亚型,比AE、CNN和GNN-based模型高。因此,我们对未来的多Omics 数据研究做出了贡献。

CVFC: Attention-Based Cross-View Feature Consistency for Weakly Supervised Semantic Segmentation of Pathology Images

  • paper_url: http://arxiv.org/abs/2308.10449
  • repo_url: None
  • paper_authors: Liangrui Pan, Lian Wang, Zhichao Feng, Liwen Xu, Shaoliang Peng
  • For: 本研究旨在提出一种基于注意力机制的跨视图特征一致朴素pseudoMask生成框架CVFC,以解决 histopathology图像分割需要高质量Mask的问题。* Methods: CVFC是一个三极结构的端到端框架,包括两个 Resnet38 和一个 Resnet50,以及独立支持多比例缩放的独立支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例缩放的支持多比例�
    Abstract Histopathology image segmentation is the gold standard for diagnosing cancer, and can indicate cancer prognosis. However, histopathology image segmentation requires high-quality masks, so many studies now use imagelevel labels to achieve pixel-level segmentation to reduce the need for fine-grained annotation. To solve this problem, we propose an attention-based cross-view feature consistency end-to-end pseudo-mask generation framework named CVFC based on the attention mechanism. Specifically, CVFC is a three-branch joint framework composed of two Resnet38 and one Resnet50, and the independent branch multi-scale integrated feature map to generate a class activation map (CAM); in each branch, through down-sampling and The expansion method adjusts the size of the CAM; the middle branch projects the feature matrix to the query and key feature spaces, and generates a feature space perception matrix through the connection layer and inner product to adjust and refine the CAM of each branch; finally, through the feature consistency loss and feature cross loss to optimize the parameters of CVFC in co-training mode. After a large number of experiments, An IoU of 0.7122 and a fwIoU of 0.7018 are obtained on the WSSS4LUAD dataset, which outperforms HistoSegNet, SEAM, C-CAM, WSSS-Tissue, and OEEM, respectively.
    摘要 《 histopathology 图像分割是诊断癌症的标准方法,可以预测癌症诊断。但 histopathology 图像分割需要高质量的面Mask,因此许多研究现在使用图像水平标签来实现像素级分割,以降低细致的注解量。为解决这个问题,我们提出了一种基于注意机制的跨视图特征一致末级 pseudo-Mask 生成框架 named CVFC。具体来说,CVFC 是一个三棵分支结构,包括两个 Resnet38 和一个 Resnet50,以及独立分支多缩放维度集成特征图来生成类Activation Map (CAM)。在每个棵分支中,通过下降和扩展方法调整特征图的大小;中间棵分支将特征矩阵 проек到查询和关键特征空间,并生成特征空间感知矩阵通过连接层和内积来调整和细化每个棵分支的 CAM;最后,通过特征一致损失和特征交叉损失来优化 CVFC 的参数。经过大量实验,在 WSSS4LUAD 数据集上,CVFC 实现了 IoU 0.7122 和 fwIoU 0.7018,超越 HistoSegNet、SEAM、C-CAM、WSSS-Tissue 和 OEEM,分别。

DySuse: Susceptibility Estimation in Dynamic Social Networks

  • paper_url: http://arxiv.org/abs/2308.10442
  • repo_url: None
  • paper_authors: Yingdan Shi, Jingya Zhou, Congcong Zhang
  • for: 预测社交网络中流行的潜在影响范围。
  • methods: 提出了一种基于动态图 embedding 技术的框架,名为 DySuse,以独立捕捉每幅图像的结构信息,并通过进步机制和自注意力块来耦合结构和时间信息。
  • results: 实验结果表明,我们的框架在多种流行传播模型下有较高的预测性能,并且超过了现有的动态图 embedding 模型。
    Abstract Influence estimation aims to predict the total influence spread in social networks and has received surged attention in recent years. Most current studies focus on estimating the total number of influenced users in a social network, and neglect susceptibility estimation that aims to predict the probability of each user being influenced from the individual perspective. As a more fine-grained estimation task, susceptibility estimation is full of attractiveness and practical value. Based on the significance of susceptibility estimation and dynamic properties of social networks, we propose a task, called susceptibility estimation in dynamic social networks, which is even more realistic and valuable in real-world applications. Susceptibility estimation in dynamic networks has yet to be explored so far and is computationally intractable to naively adopt Monte Carlo simulation to obtain the results. To this end, we propose a novel end-to-end framework DySuse based on dynamic graph embedding technology. Specifically, we leverage a structural feature module to independently capture the structural information of influence diffusion on each single graph snapshot. Besides, {we propose the progressive mechanism according to the property of influence diffusion,} to couple the structural and temporal information during diffusion tightly. Moreover, a self-attention block {is designed to} further capture temporal dependency by flexibly weighting historical timestamps. Experimental results show that our framework is superior to the existing dynamic graph embedding models and has satisfactory prediction performance in multiple influence diffusion models.
    摘要 社会网络的影响估计已经在最近几年内收到了极大的关注,大多数当前的研究都是关注社会网络中总的影响范围的估计,而忽略了每个用户从个人角度上的抵触可能性的估计。作为一个更加细化的估计任务,抵触估计充满了吸引力和实际价值。基于社会网络的动态特性和抵触估计的重要性,我们提出了一项任务,即动态社会网络中的抵触估计任务,这个任务更加真实和有价值在实际应用中。动态网络中的抵触估计还没有被探索过,Naive使用Monte Carlo simulations来获取结果是计算拥堵的。为此,我们提出了一个新的框架,即 DySuse,基于动态图嵌入技术。specifically,我们利用一个结构特征模块来独立地捕捉影响扩散在每个单图Snapshot中的结构信息。另外,我们提出了一种进步机制,根据影响扩散的性质,将结构和时间信息在扩散过程中紧密相连。此外,我们还设计了一个自注意阶段,以捕捉流传过程中的时间相关性。实验结果表明,我们的框架在多种影响扩散模型下具有优秀的预测性能,与现有的动态图嵌入模型相比,有显著的优势。

Approximately Equivariant Graph Networks

  • paper_url: http://arxiv.org/abs/2308.10436
  • repo_url: https://github.com/nhuang37/approx_equivariant_graph_nets
  • paper_authors: Ningyuan Huang, Ron Levie, Soledad Villar
  • for: 这个论文主要针对 Graph Neural Networks (GNNs) 的几何对称性问题,具体来说是研究 GNNs 在图像上的学习问题。
  • methods: 这篇论文使用了 Graph Neural Networks (GNNs) 来学习图像上的信号,并对 GNNs 的几何对称性进行了研究。
  • results: 该论文通过对各种图像上的信号进行学习,并分析了 GNNs 的几何对称性,提出了一种基于自动同构的几何对称性约束,并证明了这种约束可以提高 GNNs 的泛化性能。
    Abstract Graph neural networks (GNNs) are commonly described as being permutation equivariant with respect to node relabeling in the graph. This symmetry of GNNs is often compared to the translation equivariance symmetry of Euclidean convolution neural networks (CNNs). However, these two symmetries are fundamentally different: The translation equivariance of CNNs corresponds to symmetries of the fixed domain acting on the image signal (sometimes known as active symmetries), whereas in GNNs any permutation acts on both the graph signals and the graph domain (sometimes described as passive symmetries). In this work, we focus on the active symmetries of GNNs, by considering a learning setting where signals are supported on a fixed graph. In this case, the natural symmetries of GNNs are the automorphisms of the graph. Since real-world graphs tend to be asymmetric, we relax the notion of symmetries by formalizing approximate symmetries via graph coarsening. We present a bias-variance formula that quantifies the tradeoff between the loss in expressivity and the gain in the regularity of the learned estimator, depending on the chosen symmetry group. To illustrate our approach, we conduct extensive experiments on image inpainting, traffic flow prediction, and human pose estimation with different choices of symmetries. We show theoretically and empirically that the best generalization performance can be achieved by choosing a suitably larger group than the graph automorphism group, but smaller than the full permutation group.
    摘要 图 neural network (GNN) 常被描述为对节点重新标记的图有 permutation 对称性。这种 GNN 的对称性与图像 convolution neural network (CNN) 中的平移对称性有所不同:图像 CNN 中的平移对称性是图像信号上的活动对称性,而 GNN 中任意 permutation 都会影响图像信号和图像Domain(有时被称为被动对称性)。在这项工作中,我们关注 GNN 中的活动对称性,通过考虑固定图上支持的信号来进行学习设定。在这种情况下,GNN 的自然对称性是图形自动同构。由于实际的图都很偏 asymmetric,我们将对称性的定义放松,通过图像粗化来形式化approximate symmetries。我们提出了一个 bias-variance 方程,这方程量化了在选择的对称性组中loss in expressivity和学习的regulatory gain之间的负责任性。为了证明我们的方法,我们在图像填充、交通流预测和人姿估计中进行了广泛的实验,并证明了理论和实验上,可以通过选择合适的对称性组来获得最佳的总结性表现。

Federated Learning Robust to Byzantine Attacks: Achieving Zero Optimality Gap

  • paper_url: http://arxiv.org/abs/2308.10427
  • repo_url: None
  • paper_authors: Shiyuan Zuo, Rongfei Fan, Han Hu, Ning Zhang, Shimin Gong
  • for: 提出了一种鲁棒的聚合方法,用于防止Byzantine攻击的 federated learning (FL)
  • methods: 每个用户首先更新模型参数,然后直接将更新后的参数传输到聚合中心,减少了聚合中心和用户之间的交互次数,并且允许每个用户在不同迭代中设置自己的训练参数,从而减轻计算负担。聚合中心使用几何平均来组合来自每个用户的模型参数。
  • results: 证明了我们提出的方法可以具有零优化差和线性收敛,只要Byzantine攻击者的比例小于一半。数字结果也证明了我们的方法的有效性。
    Abstract In this paper, we propose a robust aggregation method for federated learning (FL) that can effectively tackle malicious Byzantine attacks. At each user, model parameter is firstly updated by multiple steps, which is adjustable over iterations, and then pushed to the aggregation center directly. This decreases the number of interactions between the aggregation center and users, allows each user to set training parameter in a flexible way, and reduces computation burden compared with existing works that need to combine multiple historical model parameters. At the aggregation center, geometric median is leveraged to combine the received model parameters from each user. Rigorous proof shows that zero optimality gap is achieved by our proposed method with linear convergence, as long as the fraction of Byzantine attackers is below half. Numerical results verify the effectiveness of our proposed method.
    摘要 在这篇论文中,我们提出了一种鲁棒的聚合方法 для federated learning (FL),可以有效地应对恶意的拜尼阶攻击。每个用户的模型参数首先通过多个步骤进行更新,这些步骤可以在迭代过程中调整,然后直接将更新后的模型参数Push到聚合中心。这将减少聚合中心和用户之间的交互次数,让每个用户可以自由地设置训练参数,并且比现有的方法减少计算负担。在聚合中心,我们使用 геометрический médian来聚合来自每个用户的接收到的模型参数。严格的证明显示,我们的提议方法可以在拜尼攻击者占用 fraction 下于半的情况下 achieves zero optimality gap,并且 linear convergence 。数字结果证明了我们的提议方法的有效性。

Spatio-Temporal Adaptive Embedding Makes Vanilla Transformer SOTA for Traffic Forecasting

  • paper_url: http://arxiv.org/abs/2308.10425
  • repo_url: https://github.com/xdzhelheim/staeformer
  • paper_authors: Hangchen Liu, Zheng Dong, Renhe Jiang, Jiewen Deng, Jinliang Deng, Quanjun Chen, Xuan Song
  • for: 本研究旨在提出一种基于封装的变换器,以优化 traffic forecasting task 的性能。
  • methods: 本研究使用了一种新的组件 called spatio-temporal adaptive embedding,该组件可以帮助模型更好地捕捉 traffic 时序数据中的空间 temporal 关系。
  • results: 研究表明,使用 STAEformer 可以在五个实际 traffic forecasting 数据集上 achieve state-of-the-art 性能。此外,实验还表明了 spatio-temporal adaptive embedding 在 traffic forecasting 中的重要作用,即能够有效地捕捉 traffic 时序数据中的内在空间 temporal 关系和时间序列信息。
    Abstract With the rapid development of the Intelligent Transportation System (ITS), accurate traffic forecasting has emerged as a critical challenge. The key bottleneck lies in capturing the intricate spatio-temporal traffic patterns. In recent years, numerous neural networks with complicated architectures have been proposed to address this issue. However, the advancements in network architectures have encountered diminishing performance gains. In this study, we present a novel component called spatio-temporal adaptive embedding that can yield outstanding results with vanilla transformers. Our proposed Spatio-Temporal Adaptive Embedding transformer (STAEformer) achieves state-of-the-art performance on five real-world traffic forecasting datasets. Further experiments demonstrate that spatio-temporal adaptive embedding plays a crucial role in traffic forecasting by effectively capturing intrinsic spatio-temporal relations and chronological information in traffic time series.
    摘要 随着智能交通系统(ITS)的快速发展,准确的交通预测已成为一项关键挑战。关键瓶颈在于捕捉复杂的空间-时间交通模式。在过去几年,许多基于神经网络的复杂架构的方法已经被提出来解决这个问题。然而,网络架构的提高不断带来逐渐减少的性能提升。在本研究中,我们提出了一种新的组件 called 空间-时间适应嵌入(STAEformer),它可以在基于 transformer 的模型中实现出色的表现。我们的提posed STAEformer 在五个实际交通预测数据集上实现了状态机器的性能。进一步的实验表明,空间-时间适应嵌入在交通预测中发挥了关键的作用,能够有效地捕捉交通时序序列中的内在空间-时间关系和时间信息。

TokenSplit: Using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition

  • paper_url: http://arxiv.org/abs/2308.10415
  • repo_url: None
  • paper_authors: Hakan Erdogan, Scott Wisdom, Xuankai Chang, Zalán Borsos, Marco Tagliasacchi, Neil Zeghidour, John R. Hershey
  • for: 这个论文是为了提出一种基于TokenSequence的语音分离模型,用于分离多个语音源并同时进行语音识别和语音生成。
  • methods: 该模型使用Transformer架构,并通过输入填充和输出掩码来实现多任务同时训练。
  • results: 该模型在对象metric和主观MUSHRA听测中表现出色,无论是否在文本条件下进行训练。此外,模型还可以提供高质量的自动语音识别(ASR)性能和语音合成示例。
    Abstract We present TokenSplit, a speech separation model that acts on discrete token sequences. The model is trained on multiple tasks simultaneously: separate and transcribe each speech source, and generate speech from text. The model operates on transcripts and audio token sequences and achieves multiple tasks through masking of inputs. The model is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture. We also present a "refinement" version of the model that predicts enhanced audio tokens from the audio tokens of speech separated by a conventional separation model. Using both objective metrics and subjective MUSHRA listening tests, we show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning. We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model.
    摘要 我们介绍TokenSplit,一种基于字符序列的语音分离模型。该模型同时进行多个任务的训练:分离每个语音源,并将语音转化为文本。模型对字符序列和音频token序列进行Masking操作,并使用Transformer架构。我们还提出了一种“精度提升”版本的模型,可以通过 convential分离模型生成优化的音频token。通过对象指标和主观MUSHRA听测,我们证明了我们的模型在分离方面具有出色的表现,无论是否受到文本conditioning的影响。此外,我们还测试了自动语音识别(ASR)性能,并提供了语音合成示例以证明该模型的额外价值。

Federated Learning for Connected and Automated Vehicles: A Survey of Existing Approaches and Challenges

  • paper_url: http://arxiv.org/abs/2308.10407
  • repo_url: None
  • paper_authors: Vishnu Pandi Chellapandi, Liangqi Yuan, Christopher G. Brinton, Stanislaw H Zak, Ziran Wang
  • for: 本研究写了一篇评论文章,探讨了在自动驾驶汽车(CAV)中应用 Federated Learning(FL)的最新进展。
  • methods: 本文分析了中央化和分散化的FL框架,包括其主要特征和方法。同时也评论了CAV中FL的各种数据来源、模型和数据安全技术,强调了它们在维护车辆数据隐私和安全方面的重要性。
  • results: 本文对FL在CAV中的各种应用进行了评论,包括它们的基本模型和数据集。最后,文章点出了FL4CAV的现有挑战和未来研究的可能方向,以将FL在CAV中的效率和可靠性进一步提高。
    Abstract Machine learning (ML) is widely used for key tasks in Connected and Automated Vehicles (CAV), including perception, planning, and control. However, its reliance on vehicular data for model training presents significant challenges related to in-vehicle user privacy and communication overhead generated by massive data volumes. Federated learning (FL) is a decentralized ML approach that enables multiple vehicles to collaboratively develop models, broadening learning from various driving environments, enhancing overall performance, and simultaneously securing local vehicle data privacy and security. This survey paper presents a review of the advancements made in the application of FL for CAV (FL4CAV). First, centralized and decentralized frameworks of FL are analyzed, highlighting their key characteristics and methodologies. Second, diverse data sources, models, and data security techniques relevant to FL in CAVs are reviewed, emphasizing their significance in ensuring privacy and confidentiality. Third, specific and important applications of FL are explored, providing insight into the base models and datasets employed for each application. Finally, existing challenges for FL4CAV are listed and potential directions for future work are discussed to further enhance the effectiveness and efficiency of FL in the context of CAV.
    摘要 First, centralized and decentralized frameworks of FL are analyzed, highlighting their key characteristics and methodologies. Second, diverse data sources, models, and data security techniques relevant to FL in CAVs are reviewed, emphasizing their significance in ensuring privacy and confidentiality. Third, specific and important applications of FL are explored, providing insight into the base models and datasets employed for each application. Finally, existing challenges for FL4CAV are listed and potential directions for future work are discussed to further enhance the effectiveness and efficiency of FL in the context of CAV.Translation in Simplified Chinese:机器学习(ML)在connected和自动驾车(CAV)中广泛应用于关键任务,包括感知、规划和控制。然而,它对于车辆数据的训练而言存在重要的用户隐私和通信负担问题。联邦学习(FL)是一种分布式机器学习方法,它可以让多辆车辆共同开发模型,从而拓宽学习不同驾驶环境,提高总性能,同时保护车辆数据隐私和安全。本文将对FL在CAV中的应用进行评论。首先,中央化和分布式框架的FL被分析,强调其关键特征和方法。其次,CAV中 relevante的数据来源、模型和数据安全技术被评审,强调它们在保护隐私和Confidentiality方面的重要性。第三,FL在CAV中的特定应用被探讨,提供了每个应用的基本模型和数据集使用情况。最后,FL4CAV的挑战和未来工作的可能方向被列出,以进一步提高FL在CAV中的效果和效率。

Label Selection Approach to Learning from Crowds

  • paper_url: http://arxiv.org/abs/2308.10396
  • repo_url: https://github.com/ssatsuki/label-selection-layer
  • paper_authors: Kosuke Yoshimura, Hisashi Kashima
  • for: 本研究旨在提高supervised learning中使用来自众所� Discogs 的标注数据的精度,并提出了一种基于SelectiveNet的新方法,即Label Selection Layer,可以自动选择工作者的标注数据是否使用于训练。
  • methods: 本研究使用了一种基于Selector网络的方法,即Label Selection Layer,来自动选择工作者的标注数据是否使用于训练。
  • results: 实验结果显示,提出的方法在大多数情况下与Crowd Layer相当或更好,只有在回归问题时有所下降。
    Abstract Supervised learning, especially supervised deep learning, requires large amounts of labeled data. One approach to collect large amounts of labeled data is by using a crowdsourcing platform where numerous workers perform the annotation tasks. However, the annotation results often contain label noise, as the annotation skills vary depending on the crowd workers and their ability to complete the task correctly. Learning from Crowds is a framework which directly trains the models using noisy labeled data from crowd workers. In this study, we propose a novel Learning from Crowds model, inspired by SelectiveNet proposed for the selective prediction problem. The proposed method called Label Selection Layer trains a prediction model by automatically determining whether to use a worker's label for training using a selector network. A major advantage of the proposed method is that it can be applied to almost all variants of supervised learning problems by simply adding a selector network and changing the objective function for existing models, without explicitly assuming a model of the noise in crowd annotations. The experimental results show that the performance of the proposed method is almost equivalent to or better than the Crowd Layer, which is one of the state-of-the-art methods for Deep Learning from Crowds, except for the regression problem case.
    摘要 <>转换文本到简化中文。<>超级vised学习,特别是超级vised深度学习,需要大量标注数据。一种采集大量标注数据的方法是通过群组工作平台,让多名工作者完成标注任务。然而,标注结果经常含有标签噪音,因为群工作者的标注技能因人而异,完成任务正确性不一致。本研究提出了一种名为学习群体(Learning from Crowds)的框架,直接使用群体标注数据来训练模型。在本研究中,我们提出了一种新的学习群体模型,受到选择网络(Selector Network)的启发。这种方法被称为标签选择层(Label Selection Layer),它可以自动决定是否使用群体成员的标签来训练预测模型。本方法的一个优点是可以适用于大多数超级vised学习问题,只需要将选择网络和目标函数添加到现有模型中,无需显式假设群体标注噪音模型。实验结果表明,提出的方法与state-of-the-art方法之一的Deep Learning from Crowds(Crowd Layer)的性能几乎相同,除了回归问题 caso。

DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning over Tabular Data

  • paper_url: http://arxiv.org/abs/2308.10915
  • repo_url: https://github.com/chu-data-lab/diffprep
  • paper_authors: Peng Li, Zhiyi Chen, Xu Chu, Kexin Rong
  • for: 提高机器学习模型的性能,自动搜索数据预处理管道
  • methods: 使用梯度下降法搜索数据预处理管道,将搜索空间转换为连续和可导的空间
  • results: 在15个实际世界数据集上达到最佳测试准确率,提高模型测试准确率最多6.6个百分点
    Abstract Data preprocessing is a crucial step in the machine learning process that transforms raw data into a more usable format for downstream ML models. However, it can be costly and time-consuming, often requiring the expertise of domain experts. Existing automated machine learning (AutoML) frameworks claim to automate data preprocessing. However, they often use a restricted search space of data preprocessing pipelines which limits the potential performance gains, and they are often too slow as they require training the ML model multiple times. In this paper, we propose DiffPrep, a method that can automatically and efficiently search for a data preprocessing pipeline for a given tabular dataset and a differentiable ML model such that the performance of the ML model is maximized. We formalize the problem of data preprocessing pipeline search as a bi-level optimization problem. To solve this problem efficiently, we transform and relax the discrete, non-differential search space into a continuous and differentiable one, which allows us to perform the pipeline search using gradient descent with training the ML model only once. Our experiments show that DiffPrep achieves the best test accuracy on 15 out of the 18 real-world datasets evaluated and improves the model's test accuracy by up to 6.6 percentage points.
    摘要 <>将文本翻译成简化中文。<>机器学习过程中的数据处理步骤是一个关键步骤,将原始数据转换成下游机器学习模型更加使用的格式。然而,这可能是成本和时间consuming的,经常需要域专家的帮助。现有的自动机器学习(AutoML)框架声称可以自动化数据处理。然而,它们通常使用restricted的数据处理管道搜索空间,这限制了性能提升的 potential,并且它们通常太慢,需要训练机器学习模型多次。在这篇论文中,我们提出了DiffPrep,一种方法可以自动和高效地搜索给定的表格数据集和弹性机器学习模型中的数据处理管道,以最大化机器学习模型的性能。我们将数据处理管道搜索问题формализова为二级优化问题。为了解决这个问题高效,我们将离散、非 diferencial的搜索空间转换为连续和微分的一个,这allow我们使用梯度下降来搜索管道,只需训练机器学习模型一次。我们的实验表明,DiffPrep在18个真实世界数据集中测试精度为15个最好,并提高机器学习模型的测试精度达6.6个百分点。

Unsupervised Opinion Aggregation – A Statistical Perspective

  • paper_url: http://arxiv.org/abs/2308.10386
  • repo_url: None
  • paper_authors: Noyan C. Sevuktekin, Andrew C. Singer
  • For: This paper is written for decision-makers who rely on opinions from multiple experts to make complex decisions, but have limited or no access to the ground truth.* Methods: The paper proposes a statistical approach to infer the competence of each expert based on their opinions, without any need for the ground truth. The approach measures the competence of each expert by their likeliness to agree with their peers, and leverages this fact to propose a completely unsupervised version of the na"{i}ve Bayes classifier.* Results: The paper shows that the proposed technique is asymptotically optimal for a large class of problems, and can be applied for online opinion aggregation and decision-making based on a limited number of opinions.
    Abstract Complex decision-making systems rarely have direct access to the current state of the world and they instead rely on opinions to form an understanding of what the ground truth could be. Even in problems where experts provide opinions without any intention to manipulate the decision maker, it is challenging to decide which expert's opinion is more reliable -- a challenge that is further amplified when decision-maker has limited, delayed, or no access to the ground truth after the fact. This paper explores a statistical approach to infer the competence of each expert based on their opinions without any need for the ground truth. Echoing the logic behind what is commonly referred to as \textit{the wisdom of crowds}, we propose measuring the competence of each expert by their likeliness to agree with their peers. We further show that the more reliable an expert is the more likely it is that they agree with their peers. We leverage this fact to propose a completely unsupervised version of the na\"{i}ve Bayes classifier and show that the proposed technique is asymptotically optimal for a large class of problems. In addition to aggregating a large block of opinions, we further apply our technique for online opinion aggregation and for decision-making based on a limited the number of opinions.
    摘要 决策系统很少直接访问现实世界的当前状态,而是基于意见来形成决策者对真实状态的理解。即使专家不 INTENTIONALLY 操纵决策者,仍然困难判断每位专家的可靠性——这种问题在决策者没有或延迟了真实状态的情况下变得更加严重。本文探讨一种统计方法,用于无需真实状态的情况下评估每位专家的能力。根据《群智》的逻辑,我们提议根据专家们之间的一致性来评估每位专家的能力。我们还证明了,更可靠的专家更有可能与其他专家一致。我们利用这一点,提出了一种完全无监督的na\"{i}ve Bayes分类器,并证明该技术在一类问题上是 asymptotically 优化的。除了聚合大量意见外,我们还应用了该技术于在线意见聚合和基于有限数量的意见进行决策。

Automated mapping of virtual environments with visual predictive coding

  • paper_url: http://arxiv.org/abs/2308.10913
  • repo_url: None
  • paper_authors: James Gornet, Matthew Thomson
  • for: 这个论文旨在探讨大脑如何直接从感知输入中构建内部的认知地图,以及这种方法如何普适地应用于听觉、感觉和语言输入。
  • methods: 这篇论文使用预测编码方法来描述大脑如何使用感知数据构建内部的认知地图。具体来说,论文使用一个自我注意力装备 convolutional neural network 来实现预测编码。
  • results: 研究发现,使用预测编码方法可以自动从视觉数据中提取环境内部的信息,并且可以准确地识别环境中的特征点。此外,研究还发现,预测编码可以将视觉、感觉和语言输入都 mapping 到同一个内部空间中,从而实现 vector 导航。
    Abstract Humans construct internal cognitive maps of their environment directly from sensory inputs without access to a system of explicit coordinates or distance measurements. While machine learning algorithms like SLAM utilize specialized visual inference procedures to identify visual features and construct spatial maps from visual and odometry data, the general nature of cognitive maps in the brain suggests a unified mapping algorithmic strategy that can generalize to auditory, tactile, and linguistic inputs. Here, we demonstrate that predictive coding provides a natural and versatile neural network algorithm for constructing spatial maps using sensory data. We introduce a framework in which an agent navigates a virtual environment while engaging in visual predictive coding using a self-attention-equipped convolutional neural network. While learning a next image prediction task, the agent automatically constructs an internal representation of the environment that quantitatively reflects distances. The internal map enables the agent to pinpoint its location relative to landmarks using only visual information.The predictive coding network generates a vectorized encoding of the environment that supports vector navigation where individual latent space units delineate localized, overlapping neighborhoods in the environment. Broadly, our work introduces predictive coding as a unified algorithmic framework for constructing cognitive maps that can naturally extend to the mapping of auditory, sensorimotor, and linguistic inputs.
    摘要 人类直接从感知输入中构建内部的认知地图,没有访问专门的坐标系或距离测量。而机器学习算法如SLAM则使用专门的视觉推理过程来标识视觉特征并从视觉和运动数据中构建空间地图。然而,大脑中的认知地图的通用性表明了一种统一的映射算法策略,可以扩展到听觉、感觉和语言输入。在这里,我们表明了预测编码提供了一种自然和灵活的神经网络算法,用于通过感知数据构建空间地图。我们在虚拟环境中训练一个使用自我注意力束重Convolutional Neural Network(CNN)进行视觉预测任务的agent,而该任务自动构建了agent内部的环境表示,其中quantitatively表示距离。这个内部地图允许agent使用 только视觉信息确定其所处的位置。预测编码网络生成了一个vector化编码环境,该编码支持vector导航,其中个别的latent space单元界定了环境中的Localized, Overlapping Neighborhoods。总的来说,我们的工作引入预测编码作为一种统一的算法框架,可以自然扩展到 mapped听觉、感觉和语言输入。

HoSNN: Adversarially-Robust Homeostatic Spiking Neural Networks with Adaptive Firing Thresholds

  • paper_url: http://arxiv.org/abs/2308.10373
  • repo_url: None
  • paper_authors: Hejia Geng, Peng Li
  • for: 防御 adversarial attacks 的 SNN 模型
  • methods: 使用 bio-inspired 方法,即 neural homeostasis,开发一种具有自适应阈值调整功能的 leaky integrate-and-fire (TA-LIF) neuron 模型,并将其应用于 constructing 防御型 SNN (HoSNN)
  • results: 对 CIFAR-10 进行测试,提高了对 FGSM 和 PGD 攻击的抵抗力,并在无需显式 adversarial 训练的情况下达到了高度的精度提升(up to 72.6% and 54.19%),并且在 FGSM 攻击下超过了先前的模型(29.99%),表明了 HoSNN 的具有很强的鲁棒性和防御能力。
    Abstract Spiking neural networks (SNNs) offer promise for efficient and powerful neurally inspired computation. Common to other types of neural networks, however, SNNs face the severe issue of vulnerability to adversarial attacks. We present the first study that draws inspiration from neural homeostasis to develop a bio-inspired solution that counters the susceptibilities of SNNs to adversarial onslaughts. At the heart of our approach is a novel threshold-adapting leaky integrate-and-fire (TA-LIF) neuron model, which we adopt to construct the proposed adversarially robust homeostatic SNN (HoSNN). Distinct from traditional LIF models, our TA-LIF model incorporates a self-stabilizing dynamic thresholding mechanism, curtailing adversarial noise propagation and safeguarding the robustness of HoSNNs in an unsupervised manner. Theoretical analysis is presented to shed light on the stability and convergence properties of the TA-LIF neurons, underscoring their superior dynamic robustness under input distributional shifts over traditional LIF neurons. Remarkably, without explicit adversarial training, our HoSNNs demonstrate inherent robustness on CIFAR-10, with accuracy improvements to 72.6% and 54.19% against FGSM and PGD attacks, up from 20.97% and 0.6%, respectively. Furthermore, with minimal FGSM adversarial training, our HoSNNs surpass previous models by 29.99% under FGSM and 47.83% under PGD attacks on CIFAR-10. Our findings offer a new perspective on harnessing biological principles for bolstering SNNs adversarial robustness and defense, paving the way to more resilient neuromorphic computing.
    摘要 斯坦尼尔神经网络(SNN)具有高效和强大的神经逻辑计算的承诺。然而,SNN也面临着严重的抗击性攻击问题,这是其他类型神经网络一样的问题。我们的研究是首次启用神经自适应性来开发一种生物启发的解决方案,以抵抗SNN对抗性攻击的感受性。我们的方法的核心是一种新的阈值自适应泄漏 integrate-and-fire(TA-LIF)神经元模型。与传统的LIF模型不同,我们的TA-LIF模型包含一种自我稳定的动态阈值调节机制,这有助于防止抗击噪音的传播和保护HoSNN的稳定性。我们对TA-LIF神经元的稳定性和整合性进行了理论分析,并证明它们在输入分布变化时具有更高的动态稳定性。这些成果表明,不需要显式的抗击训练,我们的HoSNN可以自动具有很高的抗击性。我们的实验表明,我们的HoSNN在CIFAR-10上的准确率可以提高到72.6%和54.19%,比传统的FGSM和PGD攻击提高了29.99%和47.83%。此外,我们的HoSNN通过最小化FGSM抗击训练来超过之前的模型。这些成果表明,可以通过利用生物原理来强化SNN的抗击性和防御,为更可靠的神经omorphic计算提供新的思路。

Developing a Machine Learning-Based Clinical Decision Support Tool for Uterine Tumor Imaging

  • paper_url: http://arxiv.org/abs/2308.10372
  • repo_url: None
  • paper_authors: Darryl E. Wright, Adriana V. Gregory, Deema Anaam, Sepideh Yadollahi, Sumana Ramanathan, Kafayat A. Oyemade, Reem Alsibai, Heather Holmes, Harrison Gottlich, Cherie-Akilah G. Browne, Sarah L. Cohen Rassier, Isabel Green, Elizabeth A. Stewart, Hiroaki Takahashi, Bohyun Kim, Shannon Laughlin-Tommaso, Timothy L. Kline
  • For: The paper aims to develop an automated method for 3D segmentation of the uterus and uterine tumors (UTs) that is close to human-level performance with fewer than 150 annotated images.* Methods: The authors use a deep learning approach called nnU-Net and explore the effect of training set size on performance by randomly generating subsets with 25, 45, 65, and 85 training set images. They also evaluate the ability of radiomic features to distinguish between types of UTs.* Results: The authors achieve a test set F1-score of 0.80 for classifying degenerated leiomyoma (LM) from leiomyosarcoma (LMS), and F1-scores of 0.53 and 0.80 for classifying benign versus malignant and degenerated LM versus LMS, respectively. However, the authors note that reliable automatic differentiation of UTs remains a challenge.
    Abstract Uterine leiomyosarcoma (LMS) is a rare but aggressive malignancy. On imaging, it is difficult to differentiate LMS from, for example, degenerated leiomyoma (LM), a prevalent but benign condition. We curated a data set of 115 axial T2-weighted MRI images from 110 patients (mean [range] age=45 [17-81] years) with UTs that included five different tumor types. These data were randomly split stratifying on tumor volume into training (n=85) and test sets (n=30). An independent second reader (reader 2) provided manual segmentations for all test set images. To automate segmentation, we applied nnU-Net and explored the effect of training set size on performance by randomly generating subsets with 25, 45, 65 and 85 training set images. We evaluated the ability of radiomic features to distinguish between types of UT individually and when combined through feature selection and machine learning. Using the entire training set the mean [95% CI] fibroid DSC was measured as 0.87 [0.59-1.00] and the agreement between the two readers was 0.89 [0.77-1.0] on the test set. When classifying degenerated LM from LMS we achieve a test set F1-score of 0.80. Classifying UTs based on radiomic features we identify classifiers achieving F1-scores of 0.53 [0.45, 0.61] and 0.80 [0.80, 0.80] on the test set for the benign versus malignant, and degenerated LM versus LMS tasks. We show that it is possible to develop an automated method for 3D segmentation of the uterus and UT that is close to human-level performance with fewer than 150 annotated images. For distinguishing UT types, while we train models that merit further investigation with additional data, reliable automatic differentiation of UTs remains a challenge.
    摘要 uterine leiomyosarcoma (LMS) 是一种罕见 pero 严重的恶性肿瘤。在成像方面,与例如变性的 лейкомиома (LM) 相比,困难于 diferenciar LMS 的形态。我们收集了115个轴向 T2 磁共振成像图像,来自110名患者(平均年龄为45岁,年龄范围为17-81岁),以便自动分割肾脏和uterus的图像。这些数据被随机分割,以便在训练集(n=85)和测试集(n=30)之间进行分布式训练。一名独立的第二读者(读者2)为测试集图像提供了手动分割。为了自动分割,我们应用了 nnU-Net,并研究了训练集大小对性能的影响,通过随机生成25、45、65和85个训练集图像。我们评估了在不同类型的uterus和肾脏图像上的 радиологических特征的分化能力,并通过特征选择和机器学习来结合这些特征。使用整个训练集,我们测量了 mean [95% CI] 的 fibroid DSC 为0.87 [0.59-1.00],并且测试集上的两个读者之间的一致性为0.89 [0.77-1.0]。在分类 degenerated LM 和 LMS 之间,我们在测试集上取得了 F1 分数为0.80。基于 радиialogical 特征,我们标识出了分类器,在测试集上取得了 F1 分数为0.53 [0.45, 0.61] 和 0.80 [0.80, 0.80]。我们表明,可以通过使用 fewer than 150 个标注图像来开发一种自动方法,以便三维分割uterus和肾脏,并且这种方法的性能接近人类水平。然而,在分类不同类型的uterus和肾脏图像时,我们发现,可靠地自动分类 UT 仍然是一个挑战。

SE(3) Equivariant Augmented Coupling Flows

  • paper_url: http://arxiv.org/abs/2308.10364
  • repo_url: https://github.com/lollcat/se3-augmented-coupling-flows
  • paper_authors: Laurence I. Midgley, Vincent Stimper, Javier Antorán, Emile Mathieu, Bernhard Schölkopf, José Miguel Hernández-Lobato
  • for: 这 paper 是为了提出一种可以保持 SE(3) 和 permutation 对称的 coupling flow,用于probabilistic modeling of physical systems。
  • methods: 该 paper 使用了coordinate splits along additional augmented dimensions,将 atoms 的位置映射到 learned SE(3) 对称的基准系中,然后应用标准 flow transformations,如 monotonic rational-quadratic splines。
  • results: 该 flow 可以保持 fast sampling 和 density evaluation,并可以生成不偏的 expectation 值 estimates with respect to the target distribution via importance sampling。在 DW4, LJ13 和 QM9-positional 数据集上训练,该 flow 与 equivariant continuous normalizing flows 相比,可以 sampling 两个阶段 faster,并且是首次learn the full Boltzmann distribution of alanine dipeptide by only modeling the Cartesian positions of its atoms。最后,paper 还示出了该 flow 可以approximately sample from the Boltzmann distribution of DW4 和 LJ13 particle systems using only their energy functions。
    Abstract Coupling normalizing flows allow for fast sampling and density evaluation, making them the tool of choice for probabilistic modeling of physical systems. However, the standard coupling architecture precludes endowing flows that operate on the Cartesian coordinates of atoms with the SE(3) and permutation invariances of physical systems. This work proposes a coupling flow that preserves SE(3) and permutation equivariance by performing coordinate splits along additional augmented dimensions. At each layer, the flow maps atoms' positions into learned SE(3) invariant bases, where we apply standard flow transformations, such as monotonic rational-quadratic splines, before returning to the original basis. Crucially, our flow preserves fast sampling and density evaluation, and may be used to produce unbiased estimates of expectations with respect to the target distribution via importance sampling. When trained on the DW4, LJ13 and QM9-positional datasets, our flow is competitive with equivariant continuous normalizing flows, while allowing sampling two orders of magnitude faster. Moreover, to the best of our knowledge, we are the first to learn the full Boltzmann distribution of alanine dipeptide by only modeling the Cartesian positions of its atoms. Lastly, we demonstrate that our flow can be trained to approximately sample from the Boltzmann distribution of the DW4 and LJ13 particle systems using only their energy functions.
    摘要 通过协同Normalizing Flows可以快速采样和density评估,使其成为物理系统的概率模型的工具 Choice. However, 标准的 coupling architecture 缺乏 SE(3) 和 Permutation 不变性 Physical systems. This work proposes a coupling flow that preserves SE(3) and permutation equivariance by performing coordinate splits along additional augmented dimensions. At each layer, the flow maps atoms' positions into learned SE(3) invariant bases, where we apply standard flow transformations, such as monotonic rational-quadratic splines, before returning to the original basis. Crucially, our flow preserves fast sampling and density evaluation, and may be used to produce unbiased estimates of expectations with respect to the target distribution via importance sampling. When trained on the DW4, LJ13 and QM9-positional datasets, our flow is competitive with equivariant continuous normalizing flows, while allowing sampling two orders of magnitude faster. Moreover, to the best of our knowledge, we are the first to learn the full Boltzmann distribution of alanine dipeptide by only modeling the Cartesian positions of its atoms. Lastly, we demonstrate that our flow can be trained to approximately sample from the Boltzmann distribution of the DW4 and LJ13 particle systems using only their energy functions.Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

Can Large Language Models Find And Fix Vulnerable Software?

  • paper_url: http://arxiv.org/abs/2308.10345
  • repo_url: None
  • paper_authors: David Noever
  • for: 这个研究是为了评估大语言模型(LLMs),特别是OpenAI的GPT-4,在检测软件漏洞方面的能力,并与传统的静态代码分析器如Snyk和Fortify进行比较。
  • methods: 我们的分析涵盖了多个仓库,包括NASA和美国国防部的仓库。GPT-4可以检测到大约四倍于其他工具的漏洞,并提供可行的修复方案,false positives的率很低。我们的测试包括129个代码样本 across eight programming languages,发现 PHP 和 JavaScript 的漏洞最高。GPT-4 的代码更正 Led to a 90% reduction in vulnerabilities, requiring only an 11% increase in code lines。
  • results: GPT-4 的代码更正 led to a 90% reduction in vulnerabilities, requiring only an 11% increase in code lines。这表明 LLMS 的自我审查能力,并且在检测到漏洞时可以提供可行的修复方案。未来的研究应该探索系统级别的漏洞和将多个静态代码分析器集成到 LLMS 的潜在能力中。
    Abstract In this study, we evaluated the capability of Large Language Models (LLMs), particularly OpenAI's GPT-4, in detecting software vulnerabilities, comparing their performance against traditional static code analyzers like Snyk and Fortify. Our analysis covered numerous repositories, including those from NASA and the Department of Defense. GPT-4 identified approximately four times the vulnerabilities than its counterparts. Furthermore, it provided viable fixes for each vulnerability, demonstrating a low rate of false positives. Our tests encompassed 129 code samples across eight programming languages, revealing the highest vulnerabilities in PHP and JavaScript. GPT-4's code corrections led to a 90% reduction in vulnerabilities, requiring only an 11% increase in code lines. A critical insight was LLMs' ability to self-audit, suggesting fixes for their identified vulnerabilities and underscoring their precision. Future research should explore system-level vulnerabilities and integrate multiple static code analyzers for a holistic perspective on LLMs' potential.
    摘要 在这项研究中,我们评估了大语言模型(LLM),特别是OpenAI的GPT-4,在检测软件漏洞方面的能力,与传统的静态代码分析器如Snyk和Fortify进行比较。我们的分析覆盖了多个仓库,包括NASA和国防部的仓库。GPT-4在检测漏洞方面表现出色,比其他Counterparts多出了约四倍的漏洞数。另外,它还提供了每个漏洞的可行修复方案,表明了低 False Positive 率。我们的测试包括129个代码样本 across eight种编程语言,发现最高的漏洞出现在 PHP 和 JavaScript 中。GPT-4 的代码更正引起了漏洞数量的90%减少,仅需要11%的代码行数增加。我们发现 LLMS 可以自我审查,提供检测到的漏洞的修复方案,这种精度是其中的一个关键点。未来的研究应该探索系统级别的漏洞和将多种静态代码分析器集成到 LLMS 的潜在能力中。

A Comprehensive Empirical Evaluation on Online Continual Learning

  • paper_url: http://arxiv.org/abs/2308.10328
  • repo_url: https://github.com/albinsou/ocl_survey
  • paper_authors: Albin Soutif–Cormerais, Antonio Carta, Andrea Cossu, Julio Hurtado, Vincenzo Lomonaco, Joost Van de Weijer, Hamed Hemati
    for: 这个论文主要针对的是在流动数据中进行在线连续学习,以实现更加接近真实生活中的学习经验。methods: 本文对现有文献中的多种在线连续学习方法进行评估,具体来说是在图像分类任务中的分类增量 setting。results: 结果显示大多数方法受到稳定性和过拟合问题的影响,但是学习得到的表示相当于随机抽样训练下的同等计算预算。没有一个明确的赢家 emerges,基本的经验回快是一个非常强的基eline。
    Abstract Online continual learning aims to get closer to a live learning experience by learning directly on a stream of data with temporally shifting distribution and by storing a minimum amount of data from that stream. In this empirical evaluation, we evaluate various methods from the literature that tackle online continual learning. More specifically, we focus on the class-incremental setting in the context of image classification, where the learner must learn new classes incrementally from a stream of data. We compare these methods on the Split-CIFAR100 and Split-TinyImagenet benchmarks, and measure their average accuracy, forgetting, stability, and quality of the representations, to evaluate various aspects of the algorithm at the end but also during the whole training period. We find that most methods suffer from stability and underfitting issues. However, the learned representations are comparable to i.i.d. training under the same computational budget. No clear winner emerges from the results and basic experience replay, when properly tuned and implemented, is a very strong baseline. We release our modular and extensible codebase at https://github.com/AlbinSou/ocl_survey based on the avalanche framework to reproduce our results and encourage future research.
    摘要 在线 kontinual learning 目的是通过直接学习数据流中的时间推移分布来减少数据存储量,从而更接近实时学习经验。在这项实验中,我们评估了文献中的多种在线 kontinual learning 方法。更具体地说,我们在图像分类上下文中关注了分类增量 Setting,learner需要从数据流中逐渐学习新类。我们在 Split-CIFAR100 和 Split-TinyImagenet 测试准则上对这些方法进行了评估,并测量了它们的均值精度、忘记、稳定性和表示质量,以评估它们的不同方面。我们发现大多数方法受到稳定性和下降问题的困扰。然而,学习的表示相对于 i.i.d. 训练相同计算资源下的表示相似。无论是在结果还是在训练过程中,基本的经验回放在概念上是非常强的基准。我们在 GitHub 上发布了我们的模块化和可扩展的代码库,访问 https://github.com/AlbinSou/ocl_survey,以便复制我们的结果并鼓励未来的研究。

Quantum State Tomography using Quantum Machine Learning

  • paper_url: http://arxiv.org/abs/2308.10327
  • repo_url: https://github.com/hongyehu/Machine_Learning_Quantum_State_Tomography
  • paper_authors: Nouhaila Innan, Owais Ishtiaq Siddiqui, Shivang Arora, Tamojit Ghosh, Yasemin Poyraz Koçak, Dominic Paragas, Abdullah Al Omar Galib, Muhammad Al-Zafar Khan, Mohamed Bennai
  • for: 本研究旨在提高量子状态 Tomatoes (QST)的效率,以便在大规模量子系统上应用。
  • methods: 本研究使用量子机器学习(QML)技术来提高QST的效率,并 investigate了不同的类型的QST方法,包括类型化和量子方法。
  • results: 我们的结果显示,我们的QML-based QST方法可以 дости到高准确率(98%),并且需要 fewer measurements than conventional methods,这使得它成为实际应用中的可能性。
    Abstract Quantum State Tomography (QST) is a fundamental technique in Quantum Information Processing (QIP) for reconstructing unknown quantum states. However, the conventional QST methods are limited by the number of measurements required, which makes them impractical for large-scale quantum systems. To overcome this challenge, we propose the integration of Quantum Machine Learning (QML) techniques to enhance the efficiency of QST. In this paper, we conduct a comprehensive investigation into various approaches for QST, encompassing both classical and quantum methodologies; We also implement different QML approaches for QST and demonstrate their effectiveness on various simulated and experimental quantum systems, including multi-qubit networks. Our results show that our QML-based QST approach can achieve high fidelity (98%) with significantly fewer measurements than conventional methods, making it a promising tool for practical QIP applications.
    摘要 量子状态测测(QST)是量子信息处理(QIP)中重要的技术,用于重constructing未知量子状态。然而,传统的QST方法受限于测量数量的限制,使其对大规模量子系统不实用。为超越这个挑战,我们提议通过量子机器学习(QML)技术来提高QST的效率。在这篇论文中,我们进行了对各种QST方法的全面调查,包括класси方法和量子方法;我们还实现了不同的QML方法 дляQST,并在 simulate 和实验性量子系统中证明其效果。我们的结果表明,我们的QML基于QST方法可以达到高准确率(98%),并且使用了远 fewer 测量,与传统方法相比,这使其在实际QIP应用中具有承诺的潜力。

Homogenising SoHO/EIT and SDO/AIA 171Å$~$ Images: A Deep Learning Approach

  • paper_url: http://arxiv.org/abs/2308.10322
  • repo_url: None
  • paper_authors: Subhamoy Chatterjee, Andrés Muñoz-Jaramillo, Maher Dayeh, Hazel M. Bain, Kimberly Moreland
  • for: 这个论文的目的是创建一个同一个时间段内的多频率欧文图像合并,以便进行空天气预测任务。
  • methods: 该论文使用了深度学习模型,通过将多个调查结果合并成一个同一个时间段内的同一个图像来实现合并。
  • results: 该论文发现,使用多个模型的ensemble可以减少模型的不确定性,并且在测试数据中显示出更高的不确定性,表明模型在不良代表性的数据上表现更好。
    Abstract Extreme Ultraviolet images of the Sun are becoming an integral part of space weather prediction tasks. However, having different surveys requires the development of instrument-specific prediction algorithms. As an alternative, it is possible to combine multiple surveys to create a homogeneous dataset. In this study, we utilize the temporal overlap of SoHO/EIT and SDO/AIA 171~\AA ~surveys to train an ensemble of deep learning models for creating a single homogeneous survey of EUV images for 2 solar cycles. Prior applications of deep learning have focused on validating the homogeneity of the output while overlooking the systematic estimation of uncertainty. We use an approach called `Approximate Bayesian Ensembling' to generate an ensemble of models whose uncertainty mimics that of a fully Bayesian neural network at a fraction of the cost. We find that ensemble uncertainty goes down as the training set size increases. Additionally, we show that the model ensemble adds immense value to the prediction by showing higher uncertainty in test data that are not well represented in the training data.
    摘要 extremely ultraviolet images of the sun are becoming an integral part of space weather prediction tasks. however, having different surveys requires the development of instrument-specific prediction algorithms. as an alternative, it is possible to combine multiple surveys to create a homogeneous dataset. in this study, we utilize the temporal overlap of soho/eit and sdo/aia 171 ~\aa ~surveys to train an ensemble of deep learning models for creating a single homogeneous survey of euv images for 2 solar cycles. prior applications of deep learning have focused on validating the homogeneity of the output while overlooking the systematic estimation of uncertainty. we use an approach called 'approximate bayesian ensembling' to generate an ensemble of models whose uncertainty mimics that of a fully bayesian neural network at a fraction of the cost. we find that ensemble uncertainty goes down as the training set size increases. additionally, we show that the model ensemble adds immense value to the prediction by showing higher uncertainty in test data that are not well represented in the training data.Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need the translation in Traditional Chinese, please let me know.

Towards Sustainable Development: A Novel Integrated Machine Learning Model for Holistic Environmental Health Monitoring

  • paper_url: http://arxiv.org/abs/2308.10317
  • repo_url: None
  • paper_authors: Anirudh Mazumder, Sarthak Engala, Aditya Nallaparaju
  • for: 帮助政府确定 intervención点,改善规划和保护努力,以促进可持续发展。
  • methods: 使用机器学习技术来识别环境状况的预测特征,并将污染物水平和粉尘物作为环境状况的指标。
  • results: 通过识别环境状况的预测特征,帮助政府确定干预点,改善规划和保护努力,以促进可持续发展。
    Abstract Urbanization enables economic growth but also harms the environment through degradation. Traditional methods of detecting environmental issues have proven inefficient. Machine learning has emerged as a promising tool for tracking environmental deterioration by identifying key predictive features. Recent research focused on developing a predictive model using pollutant levels and particulate matter as indicators of environmental state in order to outline challenges. Machine learning was employed to identify patterns linking areas with worse conditions. This research aims to assist governments in identifying intervention points, improving planning and conservation efforts, and ultimately contributing to sustainable development.
    摘要

Demystifying the Performance of Data Transfers in High-Performance Research Networks

  • paper_url: http://arxiv.org/abs/2308.10312
  • repo_url: None
  • paper_authors: Ehsan Saeedizade, Bing Zhang, Engin Arslan
  • for: 本研究旨在提高高速数据传输网络中数据传输的效率,帮助解决数据传输中的性能问题。
  • methods: 本研究使用了一种可扩展的端到端监控框架,收集和存储关键性能指标,以便更好地了解数据传输的性能。
  • results: 评估结果显示,提posed框架可以同时监控400个转移每个主机,总共超过40,000个转移,并且可以在一秒钟精度下收集性能统计数据。此外,我们还提出了一种自动处理收集到的性能指标的启发方法,可以自动标识性能异常的根本原因,并且其F-score在87-98%之间。
    Abstract High-speed research networks are built to meet the ever-increasing needs of data-intensive distributed workflows. However, data transfers in these networks often fail to attain the promised transfer rates for several reasons, including I/O and network interference, server misconfigurations, and network anomalies. Although understanding the root causes of performance issues is critical to mitigating them and increasing the utilization of expensive network infrastructures, there is currently no available mechanism to monitor data transfers in these networks. In this paper, we present a scalable, end-to-end monitoring framework to gather and store key performance metrics for file transfers to shed light on the performance of transfers. The evaluation results show that the proposed framework can monitor up to 400 transfers per host and more than 40, 000 transfers in total while collecting performance statistics at one-second precision. We also introduce a heuristic method to automatically process the gathered performance metrics and identify the root causes of performance anomalies with an F-score of 87 - 98%.
    摘要 高速研究网络是为了满足数据敏感分布式工作流的不断增长需求而建立的。然而,在这些网络中数据传输经常无法实现承诺的传输速率,这可能由多种原因引起,包括I/O和网络干扰、服务器配置错误和网络异常。虽然了解性能问题的根本原因非常重要,以便解决它们并提高昂贵的网络基础设施的使用率,但目前并无可用的监控数据传输机制。在这篇论文中,我们提出了一种可扩展的终端到终端监控框架,用于收集和存储关键性能指标。我们的评估结果表明,我们的框架可以监控每个主机400个传输,总共超过40,000个传输,并且在一秒钟精度下收集性能统计数据。我们还提出了一种启发式方法,用于自动处理收集的性能指标,并识别性能异常的根本原因,F-分数为87-98%。

I/O Burst Prediction for HPC Clusters using Darshan Logs

  • paper_url: http://arxiv.org/abs/2308.10311
  • repo_url: None
  • paper_authors: Ehsan Saeedizade, Roya Taheri, Engin Arslan
  • for: 本研究旨在分析大规模高性能计算(HPC)集群中的读写IO patrern,以优化IO性能和应用运行时间。
  • methods: 本研究使用Darshan报告数据分析系统级别的读写IO率,并使用机器学习模型预测IO峰值事件。
  • results: 研究发现系统级别的IO峰值事件频繁发生,并可以使用机器学习模型高度准确预测IO峰值事件(准确率超过90%)。此外,研究还提出了一种基于IO峰值预测的应用调度策略,可以减少应用运行时间。
    Abstract Understanding cluster-wide I/O patterns of large-scale HPC clusters is essential to minimize the occurrence and impact of I/O interference. Yet, most previous work in this area focused on monitoring and predicting task and node-level I/O burst events. This paper analyzes Darshan reports from three supercomputers to extract system-level read and write I/O rates in five minutes intervals. We observe significant (over 100x) fluctuations in read and write I/O rates in all three clusters. We then train machine learning models to estimate the occurrence of system-level I/O bursts 5 - 120 minutes ahead. Evaluation results show that we can predict I/O bursts with more than 90% accuracy (F-1 score) five minutes ahead and more than 87% accuracy two hours ahead. We also show that the ML models attain more than 70% accuracy when estimating the degree of the I/O burst. We believe that high-accuracy predictions of I/O bursts can be used in multiple ways, such as postponing delay-tolerant I/O operations (e.g., checkpointing), pausing nonessential applications (e.g., file system scrubbers), and devising I/O-aware job scheduling methods. To validate this claim, we simulated a burst-aware job scheduler that can postpone the start time of applications to avoid I/O bursts. We show that the burst-aware job scheduling can lead to an up to 5x decrease in application runtime.
    摘要 理解大规模HPC集群中的各种I/O模式是必要的,以避免I/O干扰的发生和影响。然而,大多数前一些工作都是关注监测和预测任务和节点级I/O异常事件。本文分析了Darshan报告从三个超级计算机,以提取系统级读写I/O速率的五分钟间隔。我们发现所有三个集群中有显著的读写I/O速率波动(超过100倍)。然后,我们训练机器学习模型,以估计系统级I/O异常事件的发生5-120分钟前。评估结果表明,我们可以在5分钟前预测I/O异常事件的occurrence高于90%(F-1分数),并在2小时前预测高于87%。此外,我们还证明了ML模型可以在70%以上的情况下估计I/O异常事件的严重程度。我们认为高精度的I/O异常预测可以用于多种方式,如延迟允许延迟I/O操作(例如检查点)、挫止非关键应用程序(例如文件系统扫描),以及开发I/O感知的任务调度方法。为证明这一点,我们模拟了一种基于I/O异常预测的任务调度策略,可以在I/O异常事件发生时推迟应用程序的启动时间。我们显示该策略可以降低应用程序的运行时间,最高可达5倍。

Co-Evolution of Pose and Mesh for 3D Human Body Estimation from Video

  • paper_url: http://arxiv.org/abs/2308.10305
  • repo_url: https://github.com/kasvii/pmce
  • paper_authors: Yingxuan You, Hong Liu, Ti Wang, Wenhao Li, Runwei Ding, Xia Li
  • for: 该论文目的是提出一种基于视频的3D人体模型重建方法,以提高现有方法的精度和准确性。
  • methods: 该方法使用了一种名为Pose and Mesh Co-Evolution网络(PMCE),它将该任务分解为两个部分:1)视频基于3D人体 pose 估计,2)从估计的3Dpose和时间像素特征中进行顶点 regression。
  • results: 对于三个 benchmark 数据集(3DPW、Human3.6M 和 MPI-INF-3DHP),该方法的实验结果表明,PMCE 在每帧精度和时间一致性方面都有较前方法的优势。
    Abstract Despite significant progress in single image-based 3D human mesh recovery, accurately and smoothly recovering 3D human motion from a video remains challenging. Existing video-based methods generally recover human mesh by estimating the complex pose and shape parameters from coupled image features, whose high complexity and low representation ability often result in inconsistent pose motion and limited shape patterns. To alleviate this issue, we introduce 3D pose as the intermediary and propose a Pose and Mesh Co-Evolution network (PMCE) that decouples this task into two parts: 1) video-based 3D human pose estimation and 2) mesh vertices regression from the estimated 3D pose and temporal image feature. Specifically, we propose a two-stream encoder that estimates mid-frame 3D pose and extracts a temporal image feature from the input image sequence. In addition, we design a co-evolution decoder that performs pose and mesh interactions with the image-guided Adaptive Layer Normalization (AdaLN) to make pose and mesh fit the human body shape. Extensive experiments demonstrate that the proposed PMCE outperforms previous state-of-the-art methods in terms of both per-frame accuracy and temporal consistency on three benchmark datasets: 3DPW, Human3.6M, and MPI-INF-3DHP. Our code is available at https://github.com/kasvii/PMCE.
    摘要 尽管单个图像基于3D人体模lesh recovering已经做出了 significativ progress,但从视频中准确和平滑地recover3D人体运动仍然是一个挑战。现有的视频基本方法通常通过计算复杂的姿势和形态参数来恢复人体模lesh, whose high complexity and low representation ability often result in inconsistent pose motion and limited shape patterns. To address this issue, we introduce 3D pose as the intermediary and propose a Pose and Mesh Co-Evolution network (PMCE) that decouples this task into two parts: 1) video-based 3D human pose estimation and 2) mesh vertices regression from the estimated 3D pose and temporal image feature. Specifically, we propose a two-stream encoder that estimates mid-frame 3D pose and extracts a temporal image feature from the input image sequence. In addition, we design a co-evolution decoder that performs pose and mesh interactions with the image-guided Adaptive Layer Normalization (AdaLN) to make pose and mesh fit the human body shape. Extensive experiments demonstrate that the proposed PMCE outperforms previous state-of-the-art methods in terms of both per-frame accuracy and temporal consistency on three benchmark datasets: 3DPW, Human3.6M, and MPI-INF-3DHP. Our code is available at https://github.com/kasvii/PMCE.

eess.IV - 2023-08-21

Extraction of Text from Optic Nerve Optical Coherence Tomography Reports

  • paper_url: http://arxiv.org/abs/2308.10790
  • repo_url: None
  • paper_authors: Iyad Majid, Youchen Victor Zhang, Robert Chang, Sophia Y. Wang
  • for: 这项研究的目的是开发和评估基于规则的算法,以提高光学同步 Tomatoes OCT扫描报告中的文本数据提取,包括肾脚层(RNFL)值和其他神经细胞计数(GCC)数据。
  • methods: 这项研究使用了DICOM文件,其中包含PDF报告,并使用PaddleOCR Python包进行光学字符识别。基于规则的算法被设计和优化,以提高RNFL和GCC数据提取的精度。
  • results: 研究显示,开发的算法可以准确地从RNFL和GCC扫描报告中提取数据,其中OD和OS之间的准确率有所不同。这些值可以大幅提高大规模的OCT结果提取速度。
    Abstract Purpose: The purpose of this study was to develop and evaluate rule-based algorithms to enhance the extraction of text data, including retinal nerve fiber layer (RNFL) values and other ganglion cell count (GCC) data, from Zeiss Cirrus optical coherence tomography (OCT) scan reports. Methods: DICOM files that contained encapsulated PDF reports with RNFL or Ganglion Cell in their document titles were identified from a clinical imaging repository at a single academic ophthalmic center. PDF reports were then converted into image files and processed using the PaddleOCR Python package for optical character recognition. Rule-based algorithms were designed and iteratively optimized for improved performance in extracting RNFL and GCC data. Evaluation of the algorithms was conducted through manual review of a set of RNFL and GCC reports. Results: The developed algorithms demonstrated high precision in extracting data from both RNFL and GCC scans. Precision was slightly better for the right eye in RNFL extraction (OD: 0.9803 vs. OS: 0.9046), and for the left eye in GCC extraction (OD: 0.9567 vs. OS: 0.9677). Some values presented more challenges in extraction, particularly clock hours 5 and 6 for RNFL thickness, and signal strength for GCC. Conclusions: A customized optical character recognition algorithm can identify numeric results from optical coherence scan reports with high precision. Automated processing of PDF reports can greatly reduce the time to extract OCT results on a large scale.
    摘要 目的:本研究旨在开发和评估基于规则的算法,以提高从Zeiss Cirrus光学同步图像(OCT)扫描报告中提取文本数据的精度,包括视网膜神经层(RNFL)值和其他神经细胞计数(GCC)数据。方法:从一所学术眼科中心的临床扫描存储库中提取符合DICOM文件格式的PDF报告,其中文档标题包含“RNFL”或“GCC”。PDF报告被转换为图像文件,并使用Python中的PaddleOCR包进行光学字符识别。为了提高数据提取的精度,我们设计了规则基本的算法,并在多个迭代优化过程中进行优化。结果:开发的算法在RNFL和GCC扫描报告中的数据提取中具有高精度。对于右眼RNFL提取,精度为0.9803,对于左眼GCC提取,精度为0.9567。然而,某些值在提取时存在更大的挑战,包括时钟小时5和6的RNFL厚度,以及GCC的信号强度。结论:可以使用自定义的光学字符识别算法来从OCT扫描报告中提取数据,并且自动处理PDF报告可以大幅降低在大规模上提取OCT结果的时间。

Dense Error Map Estimation for MRI-Ultrasound Registration in Brain Tumor Surgery Using Swin UNETR

  • paper_url: http://arxiv.org/abs/2308.10784
  • repo_url: None
  • paper_authors: Soorena Salari, Amirhossein Rasoulian, Hassan Rivaz, Yiming Xiao
  • for: 降低手术死亡率的早期脑肿瘤手术是关键,但是手术期间脑组织变形(脑Shift)会使得先前的图像无效。因此,成本低廉,可携带的内部手术超声(iUS)可以跟踪脑Shift,并准确地将MRI-iUS注射技术与先前的图像进行匹配。这可以提高手术安全性和效果,最大化肿瘤除除 while avoiding critical regions.
  • methods: 我们提出了一种基于深度学习(DL)的框架,使用Swin UNITER来自动评估MRI-iUS注射结果的3D密集错误地图,并在实际临床数据上显示其性能。
  • results: 我们的方法可以准确地评估MRI-iUS注射结果的质量,并且可以在实际临床情况下提高手术安全性和效果。
    Abstract Early surgical treatment of brain tumors is crucial in reducing patient mortality rates. However, brain tissue deformation (called brain shift) occurs during the surgery, rendering pre-operative images invalid. As a cost-effective and portable tool, intra-operative ultrasound (iUS) can track brain shift, and accurate MRI-iUS registration techniques can update pre-surgical plans and facilitate the interpretation of iUS. This can boost surgical safety and outcomes by maximizing tumor removal while avoiding eloquent regions. However, manual assessment of MRI-iUS registration results in real-time is difficult and prone to errors due to the 3D nature of the data. Automatic algorithms that can quantify the quality of inter-modal medical image registration outcomes can be highly beneficial. Therefore, we propose a novel deep-learning (DL) based framework with the Swin UNETR to automatically assess 3D-patch-wise dense error maps for MRI-iUS registration in iUS-guided brain tumor resection and show its performance with real clinical data for the first time.
    摘要 早期脑肿手术治疗是减少病人死亡率的关键。然而,手术期间脑组织变形(称为脑移动)会使先前的成像无效。作为一种可靠且可搬运的工具,实时ultrasound(iUS)可以跟踪脑移动,并且精准的MRI-iUS注册技术可以更新先前的计划,并且可以帮助解释iUS。这可以提高手术安全性和效果,最大化肿瘤除除而避免重要区域。然而,手动评估MRI-iUS注册结果的实时性困难,并且容易出错,因为数据的3D性。自动的算法可以评估多Modal医疗影像注册结果的质量。因此,我们提出了一种基于深度学习(DL)的框架,使用Swin UNITER来自动评估3D-patch-wise稠密错误地图 дляMRI-iUS注册,并且在实际临床数据上展示其性能。

Automated Identification of Failure Cases in Organ at Risk Segmentation Using Distance Metrics: A Study on CT Data

  • paper_url: http://arxiv.org/abs/2308.10636
  • repo_url: None
  • paper_authors: Amin Honarmandi Shandiz, Attila Rádics, Rajesh Tamada, Makk Árpád, Karolina Glowacka, Lehel Ferenczi, Sandeep Dutta, Michael Fanariotis
  • for: 提高自动组织器的肿瘤识别精度,以避免治疗规划错误。
  • methods: 使用自动模型生成的肿瘤边界概率分布,并使用Dice和 Hausdorff距离的组合来自动标识失败案例。
  • results: 通过设置Dice和 Hausdorff距离的阈值,可以快速自动标识失败案例,并且可以 diferenciate 不同类型的失败案例。
    Abstract Automated organ at risk (OAR) segmentation is crucial for radiation therapy planning in CT scans, but the generated contours by automated models can be inaccurate, potentially leading to treatment planning issues. The reasons for these inaccuracies could be varied, such as unclear organ boundaries or inaccurate ground truth due to annotation errors. To improve the model's performance, it is necessary to identify these failure cases during the training process and to correct them with some potential post-processing techniques. However, this process can be time-consuming, as traditionally it requires manual inspection of the predicted output. This paper proposes a method to automatically identify failure cases by setting a threshold for the combination of Dice and Hausdorff distances. This approach reduces the time-consuming task of visually inspecting predicted outputs, allowing for faster identification of failure case candidates. The method was evaluated on 20 cases of six different organs in CT images from clinical expert curated datasets. By setting the thresholds for the Dice and Hausdorff distances, the study was able to differentiate between various states of failure cases and evaluate over 12 cases visually. This thresholding approach could be extended to other organs, leading to faster identification of failure cases and thereby improving the quality of radiation therapy planning.
    摘要 自动化的器官在险 (OAR) 分 segmentation 是辐射疗程规划 CT 扫描中的关键,但自动生成的界限可能不准确,可能导致治疗规划问题。这些不准确的原因可能是不清晰的器官边界或不准确的真实数据 due to 注释错误。要提高模型的表现,需要在训练过程中确认这些失败案例,并使用一些可能的后处理技术来更正它们。但是,这个过程可能会占用时间,因为传统上需要手动检查预测的输出。这篇论文提出了一种方法,通过设置 Dice 和 Hausdorff 距离的组合来自动确定失败案例。这种方法可以减少手动检查预测输出的时间消耗,使得更快地确定失败案例候选者。本研究在 20 个不同的器官中进行了 CT 图像的评估,并通过设置 Dice 和 Hausdorff 距离的阈值,可以区分不同的失败案例状态,并评估了 12 个案例。这种阈值设定方法可以扩展到其他器官,从而更快地确定失败案例,提高辐射疗程规划的质量。

Enhancing Medical Image Segmentation: Optimizing Cross-Entropy Weights and Post-Processing with Autoencoders

  • paper_url: http://arxiv.org/abs/2308.10488
  • repo_url: None
  • paper_authors: Pranav Singh, Luoyao Chen, Mei Chen, Jinqian Pan, Raviteja Chukkapalli, Shravan Chaudhari, Jacopo Cirrone
    for:This paper is written for the task of medical image segmentation, specifically focusing on the analysis of cell inflammation and interaction in autoimmune diseases like dermatomyositis.methods:The proposed method uses a deep-learning approach tailored for medical image segmentation, which includes the use of U-Net and U-Net++ architectures with optimized loss function weights.results:The proposed method outperforms the current state-of-the-art techniques by an average of 12.26% for U-Net and 12.04% for U-Net++ across the ResNet family of encoders on the dermatomyositis dataset.
    Abstract The task of medical image segmentation presents unique challenges, necessitating both localized and holistic semantic understanding to accurately delineate areas of interest, such as critical tissues or aberrant features. This complexity is heightened in medical image segmentation due to the high degree of inter-class similarities, intra-class variations, and possible image obfuscation. The segmentation task further diversifies when considering the study of histopathology slides for autoimmune diseases like dermatomyositis. The analysis of cell inflammation and interaction in these cases has been less studied due to constraints in data acquisition pipelines. Despite the progressive strides in medical science, we lack a comprehensive collection of autoimmune diseases. As autoimmune diseases globally escalate in prevalence and exhibit associations with COVID-19, their study becomes increasingly essential. While there is existing research that integrates artificial intelligence in the analysis of various autoimmune diseases, the exploration of dermatomyositis remains relatively underrepresented. In this paper, we present a deep-learning approach tailored for Medical image segmentation. Our proposed method outperforms the current state-of-the-art techniques by an average of 12.26% for U-Net and 12.04% for U-Net++ across the ResNet family of encoders on the dermatomyositis dataset. Furthermore, we probe the importance of optimizing loss function weights and benchmark our methodology on three challenging medical image segmentation tasks
    摘要 医学影像分割任务存在独特的挑战,需要同时具备地方化和整体的semantic理解,以准确地界定关键组织或异常特征。这种复杂性由医学影像分割中高度的间类相似性、内类变化和可能存在的图像掩蔽所增加。在医学影像分割任务中,考虑 histopathology 扫描片的涂抹病例,如dermatomyositis,分割任务会更加多样化。由于医学科学的进步,我们缺乏一个全面的自适应病种集成。随着自适应病种的全球流行病例和COVID-19的相关性,其研究变得越来越重要。虽然现有的研究已经将人工智能Integrated into the analysis of various autoimmune diseases,但dermatomyositis的研究仍然相对落后。在这篇文章中,我们提出了一种适应医学影像分割的深度学习方法。我们的提议方法在ResNet家族encoder上的dermatomyositis数据集上比现状前景技术平均提高12.26%和12.04%。此外,我们还探索了优化损失函数的权重和对三个困难的医学影像分割任务进行比较。

Prediction of Pneumonia and COVID-19 Using Deep Neural Networks

  • paper_url: http://arxiv.org/abs/2308.10368
  • repo_url: None
  • paper_authors: M. S. Haque, M. S. Taluckder, S. B. Shawkat, M. A. Shahriyar, M. A. Sayed, C. Modak
  • for: 这个研究旨在采用医疗图像分析技术来检测肺炎。
  • methods: 该研究使用了多种机器学习模型,包括 DenseNet121、Inception Resnet-v2、Inception Resnet-v3、Resnet50 和 Xception,以分析肺 X 光图像。
  • results: 研究发现,DenseNet121 模型在肺炎检测中表现最佳,准确率达到 99.58%。
    Abstract Pneumonia, caused by bacteria and viruses, is a rapidly spreading viral infection with global implications. Prompt identification of infected individuals is crucial for containing its transmission. This study explores the potential of medical image analysis to address this challenge. We propose machine-learning techniques for predicting Pneumonia from chest X-ray images. Chest X-ray imaging is vital for Pneumonia diagnosis due to its accessibility and cost-effectiveness. However, interpreting X-rays for Pneumonia detection can be complex, as radiographic features can overlap with other respiratory conditions. We evaluate the performance of different machine learning models, including DenseNet121, Inception Resnet-v2, Inception Resnet-v3, Resnet50, and Xception, using chest X-ray images of pneumonia patients. Performance measures and confusion matrices are employed to assess and compare the models. The findings reveal that DenseNet121 outperforms other models, achieving an accuracy rate of 99.58%. This study underscores the significance of machine learning in the accurate detection of Pneumonia, leveraging chest X-ray images. Our study offers insights into the potential of technology to mitigate the spread of pneumonia through precise diagnostics.
    摘要 “肺炎,由病毒和 бактеリум引起,是一种快速传播的病毒感染,具有全球化的意义。 Prompt 识别感染者是防止传播的关键。这项研究探讨医疗图像分析可以解决这个挑战。我们提出运用机器学习技术预测肺炎的方法。颈部X线图像是肺炎诊断的重要工具,因为它具有访问性和成本效益。但是,解释颈部X线可能会与其他呼吸道疾病相似,因此需要进一步的分析。我们评估了不同的机器学习模型,包括DenseNet121、Inception Resnet-v2、Inception Resnet-v3、Resnet50和Xception,使用颈部X线图像。我们使用性能指标和混淆矩阵来评估和比较这些模型。研究结果显示DenseNet121最高的准确率为99.58%。这项研究认为机器学习可以帮助精确诊断肺炎,利用颈部X线图像。我们的研究给出了技术可以减少肺炎的传播的可能性。”Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore.

cs.SD - 2023-08-20

Indonesian Automatic Speech Recognition with XLSR-53

  • paper_url: http://arxiv.org/abs/2308.11589
  • repo_url: None
  • paper_authors: Panji Arisaputra, Amalia Zahra
  • For: 这个研究旨在开发一个使用XLSR-53预训练模型的印度尼西亚自动语音识别(ASR)系统,以减少非英语语言的训练数据量,以达到竞争力很高的单词错误率(WER)。* Methods: 该研究使用的方法包括使用XLSR-53预训练模型,并使用TITML-IDN、Magic Data和Common Voice等数据集进行训练。* Results: 该研究的结果显示,使用XLSR-53预训练模型可以在WER20%的基础上减少约8%的错误率,从而实现了在相似的模型中减少训练数据量的目标。
    Abstract This study focuses on the development of Indonesian Automatic Speech Recognition (ASR) using the XLSR-53 pre-trained model, the XLSR stands for cross-lingual speech representations. The use of this XLSR-53 pre-trained model is to significantly reduce the amount of training data in non-English languages required to achieve a competitive Word Error Rate (WER). The total amount of data used in this study is 24 hours, 18 minutes, and 1 second: (1) TITML-IDN 14 hours and 31 minutes; (2) Magic Data 3 hours and 33 minutes; and (3) Common Voice 6 hours, 14 minutes, and 1 second. With a WER of 20%, the model built in this study can compete with similar models using the Common Voice dataset split test. WER can be decreased by around 8% using a language model, resulted in WER from 20% to 12%. Thus, the results of this study have succeeded in perfecting previous research in contributing to the creation of a better Indonesian ASR with a smaller amount of data.
    摘要

cs.CV - 2023-08-20

Boosting Adversarial Transferability by Block Shuffle and Rotation

  • paper_url: http://arxiv.org/abs/2308.10299
  • repo_url: None
  • paper_authors: Kunyu Wang, Xuanran He, Wenxuan Wang, Xiaosen Wang
  • for: This paper focuses on improving the transferability of adversarial examples in the black-box setting.
  • methods: The proposed method, called block shuffle and rotation (BSR), uses input transformation to create a set of new images for gradient calculation.
  • results: The BSR method achieves significantly better transferability than existing input transformation based methods under single-model and ensemble-model settings, and combining BSR with current input transformation methods further improves transferability.Here’s the simplified Chinese version:
  • for: 这篇论文关注在黑盒Setting下的攻击 transferred examples的改进。
  • methods: 提议的方法是block shuffle and rotation (BSR),通过输入变换创建一组新图像进行梯度计算。
  • results: BSR方法在单个模型和ensemble-model Settings下表现出了显著更好的传输性能,并且将BSR方法与当前输入变换方法组合使用可以减少攻击性能。
    Abstract Adversarial examples mislead deep neural networks with imperceptible perturbations and have brought significant threats to deep learning. An important aspect is their transferability, which refers to their ability to deceive other models, thus enabling attacks in the black-box setting. Though various methods have been proposed to boost transferability, the performance still falls short compared with white-box attacks. In this work, we observe that existing input transformation based attacks, one of the mainstream transfer-based attacks, result in different attention heatmaps on various models, which might limit the transferability. We also find that breaking the intrinsic relation of the image can disrupt the attention heatmap of the original image. Based on this finding, we propose a novel input transformation based attack called block shuffle and rotation (BSR). Specifically, BSR splits the input image into several blocks, then randomly shuffles and rotates these blocks to construct a set of new images for gradient calculation. Empirical evaluations on the ImageNet dataset demonstrate that BSR could achieve significantly better transferability than the existing input transformation based methods under single-model and ensemble-model settings. Combining BSR with the current input transformation method can further improve the transferability, which significantly outperforms the state-of-the-art methods.
    摘要 深度学习系统中的敌对示例会使用微小的扰动来诱导深度神经网络,从而带来了深度学习的重要威胁。其中一个重要的方面是它们的传输性,即它们能够在黑盒Setting中欺骗其他模型,从而实现攻击。虽然已有许多方法提出来增强传输性,但其性能仍然落后于白盒攻击。在这个工作中,我们发现了现有的输入变换基于攻击方法中的一个主要问题,即输入变换后的图像的注意力热图与不同的模型存在差异,这可能限制了传输性。我们还发现,将图像的内在关系打破可以破坏原始图像的注意力热图。基于这一发现,我们提出了一种新的输入变换基于攻击方法,即块排序和旋转(BSR)。具体来说,BSR将输入图像分成多个块,然后随机排序和旋转这些块来构建一组新的图像,用于计算梯度。我们对ImageNet dataset进行了实验,表明BSR可以在单模型和集成模型的设置下实现显著更好的传输性,并且与现有的输入变换方法结合使用可以进一步提高传输性,与当前最佳方法强制升级。

DomainAdaptor: A Novel Approach to Test-time Adaptation

  • paper_url: http://arxiv.org/abs/2308.10297
  • repo_url: None
  • paper_authors: Jian Zhang, Lei Qi, Yinghuan Shi, Yang Gao
  • for: 这个论文的目的是如何在试用时进行预测 task 中对应预测数据的预测问题,并且对于缺乏资料的情况下进行更好的预测。
  • methods: 这个论文提出了一个叫做 DomainAdaptor 的方法,它包括一个 AdaMixBN 模组和一个 Generalized Entropy Minimization (GEM) 损失函数。AdaMixBN 模组通过动态混合系数和统计转换操作来适应训练和试用数据之间的领域差异。而 GEM 损失函数则是将 Entropy Minimization 损失函数扩展以更好地利用试用数据中的信息。
  • results: 实验结果显示,DomainAdaptor 可以与现有的方法相比,在四个 benchmark 上实现更高的预测性能。此外,在缺乏数据的情况下,DomainAdaptor 对于几个缺乏数据的benchmark 也实现了更大的改进。
    Abstract To deal with the domain shift between training and test samples, current methods have primarily focused on learning generalizable features during training and ignore the specificity of unseen samples that are also critical during the test. In this paper, we investigate a more challenging task that aims to adapt a trained CNN model to unseen domains during the test. To maximumly mine the information in the test data, we propose a unified method called DomainAdaptor for the test-time adaptation, which consists of an AdaMixBN module and a Generalized Entropy Minimization (GEM) loss. Specifically, AdaMixBN addresses the domain shift by adaptively fusing training and test statistics in the normalization layer via a dynamic mixture coefficient and a statistic transformation operation. To further enhance the adaptation ability of AdaMixBN, we design a GEM loss that extends the Entropy Minimization loss to better exploit the information in the test data. Extensive experiments show that DomainAdaptor consistently outperforms the state-of-the-art methods on four benchmarks. Furthermore, our method brings more remarkable improvement against existing methods on the few-data unseen domain. The code is available at https://github.com/koncle/DomainAdaptor.
    摘要 Current methods have primarily focused on learning generalizable features during training and ignore the specificity of unseen samples, which are also critical during the test. In this paper, we investigate a more challenging task that aims to adapt a trained CNN model to unseen domains during the test. To maximize the information in the test data, we propose a unified method called DomainAdaptor for the test-time adaptation, which consists of an AdaMixBN module and a Generalized Entropy Minimization (GEM) loss. Specifically, AdaMixBN addresses the domain shift by adaptively fusing training and test statistics in the normalization layer via a dynamic mixture coefficient and a statistic transformation operation. To further enhance the adaptation ability of AdaMixBN, we design a GEM loss that extends the Entropy Minimization loss to better exploit the information in the test data. Extensive experiments show that DomainAdaptor consistently outperforms the state-of-the-art methods on four benchmarks. Furthermore, our method brings more remarkable improvement against existing methods on the few-data unseen domain. The code is available at https://github.com/koncle/DomainAdaptor.Here's the Chinese translation of each sentence:现在的方法主要关注在训练时学习通用特征,而忽略了测试时的特定样本,这些样本也是非常重要的。在这篇论文中,我们调查了一个更加具有挑战性的任务,即在测试时对未seen域进行适应。以maximize测试数据中的信息,我们提出了一种统一的方法 called DomainAdaptor,它包括 AdaMixBN 模块和一种通用 Entropy Minimization (GEM) 损失函数。具体来说,AdaMixBN 模块通过在normalization层中动态混合训练和测试统计信息,来 Address the domain shift。此外,我们还设计了一种 GEM 损失函数,它将 Entropy Minimization 损失函数扩展到更好地利用测试数据中的信息。经过广泛的实验,我们发现 DomainAdaptor 可以一直性地超越当前的方法在四个标准benchmark上。此外,我们的方法在几个数据少的未seen域上带来更加卓越的改进。代码可以在https://github.com/koncle/DomainAdaptor 中下载。

Privileged Anatomical and Protocol Discrimination in Trackerless 3D Ultrasound Reconstruction

  • paper_url: http://arxiv.org/abs/2308.10293
  • repo_url: None
  • paper_authors: Qi Li, Ziyi Shen, Qian Li, Dean C. Barratt, Thomas Dowrick, Matthew J. Clarkson, Tom Vercauteren, Yipeng Hu
  • for: 这 paper 是为了提高深度神经网络(DNN)无需外部跟踪设备的三维自由手 Ultrasound(US)重建的研究。
  • methods: 这 paper 使用了两个可能的改进 DNN-based 重建方法的因素:解剖学和协议。 authors 提议使用这两个因素作为静观信息,以提高现有的 DNN-based 方法。
  • results: 实验结果表明,解剖学和协议变化都是 DNN-based US 重建的能量因素;学习不同的主体(解剖学变化)和预先定义的扫描路径(协议变化)都能显著提高帧预测精度、重建覆盖率、累积跟踪误差和最终漂移。
    Abstract Three-dimensional (3D) freehand ultrasound (US) reconstruction without using any additional external tracking device has seen recent advances with deep neural networks (DNNs). In this paper, we first investigated two identified contributing factors of the learned inter-frame correlation that enable the DNN-based reconstruction: anatomy and protocol. We propose to incorporate the ability to represent these two factors - readily available during training - as the privileged information to improve existing DNN-based methods. This is implemented in a new multi-task method, where the anatomical and protocol discrimination are used as auxiliary tasks. We further develop a differentiable network architecture to optimise the branching location of these auxiliary tasks, which controls the ratio between shared and task-specific network parameters, for maximising the benefits from the two auxiliary tasks. Experimental results, on a dataset with 38 forearms of 19 volunteers acquired with 6 different scanning protocols, show that 1) both anatomical and protocol variances are enabling factors for DNN-based US reconstruction; 2) learning how to discriminate different subjects (anatomical variance) and predefined types of scanning paths (protocol variance) both significantly improve frame prediction accuracy, volume reconstruction overlap, accumulated tracking error and final drift, using the proposed algorithm.
    摘要 三维自由手操作超声成像(US)已经在最近得到了深度神经网络(DNN)的进步。在这篇论文中,我们首先调查了两个促进学习的交叉帧相关性的因素:解剖学和协议。我们提议将这两个因素作为特权信息,以改进现有的DNN基于方法。我们实现了一种新的多任务方法,其中解剖学和协议歧义被用作辅助任务。我们进一步开发了可导网络架构,以优化分支位置,控制分支位置中共享和任务特定的网络参数的比例,以最大化来自两个辅助任务的利益。实验结果,在38只臂和19名志愿者通过6种扫描协议获得的数据集上,显示了以下结论:1)解剖学和协议的差异都是US成像中的促进因素;2)学习不同主体(解剖学差异)和预定的扫描路径(协议差异)都能够显著提高帧预测精度、重建 overlap、累积跟踪误差和最终漂移。使用我们提议的算法可以获得这些结果。

Efficient-VRNet: An Exquisite Fusion Network for Riverway Panoptic Perception based on Asymmetric Fair Fusion of Vision and 4D mmWave Radar

  • paper_url: http://arxiv.org/abs/2308.10287
  • repo_url: https://github.com/GuanRunwei/Efficient-VRNet
  • paper_authors: Runwei Guan, Shanliang Yao, Xiaohui Zhu, Ka Lok Man, Yong Yue, Jeremy Smith, Eng Gee Lim, Yutao Yue
  • for: 这个论文的目的是提出一个基于 USV 的河道全景感知方法,以提高自动航行的精度和安全性。
  • methods: 本论文使用了 Contextual Clustering (CoC) 和非对称的视觉和4D mmWave 激光数据融合,实现了同时进行物体检测和 semantic segmentation。
  • results: 在实验中,我们的 Efficient-VRNet 模型在我们自己收集的数据集上表现出色,特别是在不良天气和环境下,其性能比其他单模型更好。
    Abstract Panoptic perception is essential to unmanned surface vehicles (USVs) for autonomous navigation. The current panoptic perception scheme is mainly based on vision only, that is, object detection and semantic segmentation are performed simultaneously based on camera sensors. Nevertheless, the fusion of camera and radar sensors is regarded as a promising method which could substitute pure vision methods, but almost all works focus on object detection only. Therefore, how to maximize and subtly fuse the features of vision and radar to improve both detection and segmentation is a challenge. In this paper, we focus on riverway panoptic perception based on USVs, which is a considerably unexplored field compared with road panoptic perception. We propose Efficient-VRNet, a model based on Contextual Clustering (CoC) and the asymmetric fusion of vision and 4D mmWave radar, which treats both vision and radar modalities fairly. Efficient-VRNet can simultaneously perform detection and segmentation of riverway objects and drivable area segmentation. Furthermore, we adopt an uncertainty-based panoptic perception training strategy to train Efficient-VRNet. In the experiments, our Efficient-VRNet achieves better performances on our collected dataset than other uni-modal models, especially in adverse weather and environment with poor lighting conditions. Our code and models are available at \url{https://github.com/GuanRunwei/Efficient-VRNet}.
    摘要 паноптическое восприятие是关键 для无人水面车(USV)的自主导航。目前的паноптическое восприятие方案主要基于视觉Only,即同时进行对象探测和semantic segmentation基于摄像头感知器。然而,混合摄像头和雷达感知器的方法被视为有前途的,但大多数工作都集中在对象探测上。因此,如何最大化和细致地融合视觉和雷达感知器的特征以提高探测和分割是一个挑战。在这篇论文中,我们关注USV在河道上的паноптиче восприятие,这是与公路上的panoramic perception相比较未探索的领域。我们提出了高效的VRNet,基于Contextual Clustering(CoC)和不对称的视觉和4D mmWave雷达混合,可以同时进行对象探测和分割,以及 drivable area segmentation。此外,我们采用了不确定性基本的panoramic perception训练策略来训练高效的VRNet。在实验中,我们的高效的VRNet在我们收集的数据集上表现更好than其他单modal模型,特别是在恶劣天气和环境下。我们的代码和模型可以在 \url{https://github.com/GuanRunwei/Efficient-VRNet} 中获取。

DomainDrop: Suppressing Domain-Sensitive Channels for Domain Generalization

  • paper_url: http://arxiv.org/abs/2308.10285
  • repo_url: None
  • paper_authors: Jintao Guo, Lei Qi, Yinghuan Shi
  • for: 提高模型对不同频率域数据的抗频率性能
  • methods: 提出了一种基于频率域特征图的频率域抗频率法,通过使用频率域特征图来增强通道的稳定性,提高模型的抗频率性能
  • results: 经验表明,提出的方法可以在多个 benchmark 上达到领先的性能水平,并且可以与其他竞争方法进行比较
    Abstract Deep Neural Networks have exhibited considerable success in various visual tasks. However, when applied to unseen test datasets, state-of-the-art models often suffer performance degradation due to domain shifts. In this paper, we introduce a novel approach for domain generalization from a novel perspective of enhancing the robustness of channels in feature maps to domain shifts. We observe that models trained on source domains contain a substantial number of channels that exhibit unstable activations across different domains, which are inclined to capture domain-specific features and behave abnormally when exposed to unseen target domains. To address the issue, we propose a DomainDrop framework to continuously enhance the channel robustness to domain shifts, where a domain discriminator is used to identify and drop unstable channels in feature maps of each network layer during forward propagation. We theoretically prove that our framework could effectively lower the generalization bound. Extensive experiments on several benchmarks indicate that our framework achieves state-of-the-art performance compared to other competing methods. Our code is available at https://github.com/lingeringlight/DomainDrop.
    摘要

MacFormer: Map-Agent Coupled Transformer for Real-time and Robust Trajectory Prediction

  • paper_url: http://arxiv.org/abs/2308.10280
  • repo_url: None
  • paper_authors: Chen Feng, Hangning Zhou, Huadong Lin, Zhigang Zhang, Ziyao Xu, Chi Zhang, Boyu Zhou, Shaojie Shen
  • for: 预测自驾车行为的未来行为
  • methods: 使用 Map-Agent Coupled Transformer (MacFormer) 框架,并实现了约束映射和参考提取模块,以及一种多任务优化策略 (MTOS) 来增强网络学习。
  • results: 在 Argoverse 1、Argoverse 2 和 nuScenes 实验室中,实现了最佳性能,并且具有最低的推理延迟和最小的模型大小。 experiments 还表明,我们的框架可以抗性不准确的轨迹输入。
    Abstract Predicting the future behavior of agents is a fundamental task in autonomous vehicle domains. Accurate prediction relies on comprehending the surrounding map, which significantly regularizes agent behaviors. However, existing methods have limitations in exploiting the map and exhibit a strong dependence on historical trajectories, which yield unsatisfactory prediction performance and robustness. Additionally, their heavy network architectures impede real-time applications. To tackle these problems, we propose Map-Agent Coupled Transformer (MacFormer) for real-time and robust trajectory prediction. Our framework explicitly incorporates map constraints into the network via two carefully designed modules named coupled map and reference extractor. A novel multi-task optimization strategy (MTOS) is presented to enhance learning of topology and rule constraints. We also devise bilateral query scheme in context fusion for a more efficient and lightweight network. We evaluated our approach on Argoverse 1, Argoverse 2, and nuScenes real-world benchmarks, where it all achieved state-of-the-art performance with the lowest inference latency and smallest model size. Experiments also demonstrate that our framework is resilient to imperfect tracklet inputs. Furthermore, we show that by combining with our proposed strategies, classical models outperform their baselines, further validating the versatility of our framework.
    摘要 预测自驾车代理行为是自驾车领域的基本任务。准确预测需要理解周围环境,这会有效地规范代理行为。然而,现有方法在利用地图和历史轨迹上有限制,导致预测性能和稳定性不 satisfactory。另外,他们的网络架构过于复杂,阻碍实时应用。为解决这些问题,我们提出了 Map-Agent Coupled Transformer(MacFormer),用于实时和可靠的轨迹预测。我们的框架直接将地图约束 incorporated into the network via two специаль地设计的模块: coupling map和 reference extractor。我们还提出了一种多任务优化策略(MTOS),以提高学习 topology 和规则约束。此外,我们还提出了一种 bilateral query scheme 在上下文融合中,以更有效地和更轻量级地网络。我们在 Argoverse 1、Argoverse 2 和 nuScenes 实验室中进行了评估,并在所有benchmark上实现了状态前的性能,同时拥有最低的推理延迟和最小的模型大小。实验还表明,我们的框架可以快速地适应不完整的轨迹输入。此外,我们还证明了我们的方法可以与经典模型结合使用,以提高其性能。

Turning Waste into Wealth: Leveraging Low-Quality Samples for Enhancing Continuous Conditional Generative Adversarial Networks

  • paper_url: http://arxiv.org/abs/2308.10273
  • repo_url: None
  • paper_authors: Xin Ding, Yongwei Wang, Zuheng Xu
  • for: 提高 conditional generative adversarial networks (CcGANs) 的生成质量,使其能够生成 conditional 的图像。
  • methods: 提出了一种新的 Negative Data Augmentation (NDA) 方法,称为 Dual-NDA,该方法使用了两种不同的负样本:一是通过预训练 CcGAN 生成的视觉不真实的图像,二是通过修改实际图像的标签来生成的标签不一致的图像。
  • results: 对 UTKFace 和 Steering Angle 进行了实验研究,发现 Dual-NDA 可以提高 CcGANs 生成的图像的视觉质量和标签一致性,并且可以使 CcGANs 超越当前状态的 conditional GANs 和涂抹模型,达到新的高水平性能。
    Abstract Continuous Conditional Generative Adversarial Networks (CcGANs) enable generative modeling conditional on continuous scalar variables (termed regression labels). However, they can produce subpar fake images due to limited training data. Although Negative Data Augmentation (NDA) effectively enhances unconditional and class-conditional GANs by introducing anomalies into real training images, guiding the GANs away from low-quality outputs, its impact on CcGANs is limited, as it fails to replicate negative samples that may occur during the CcGAN sampling. We present a novel NDA approach called Dual-NDA specifically tailored for CcGANs to address this problem. Dual-NDA employs two types of negative samples: visually unrealistic images generated from a pre-trained CcGAN and label-inconsistent images created by manipulating real images' labels. Leveraging these negative samples, we introduce a novel discriminator objective alongside a modified CcGAN training algorithm. Empirical analysis on UTKFace and Steering Angle reveals that Dual-NDA consistently enhances the visual fidelity and label consistency of fake images generated by CcGANs, exhibiting a substantial performance gain over the vanilla NDA. Moreover, by applying Dual-NDA, CcGANs demonstrate a remarkable advancement beyond the capabilities of state-of-the-art conditional GANs and diffusion models, establishing a new pinnacle of performance.
    摘要 双重 NDA 使用两种负样本:基于预训练 CcGAN 生成的视觉不真实的图像,以及 manipulate 真实图像的标签来创造的标签不一致的图像。通过这些负样本,我们引入了一种新的Discriminator 目标 alongside 修改后 CcGAN 训练算法。我们的实验表明,使用 Dual-NDA,CcGANs 可以在 UTKFace 和扭转角度上提高视觉准确性和标签一致性的假图像生成性能,并且与状态项 conditional GANs 和液体模型相比,显示出remarkable进步。

Domain Reduction Strategy for Non Line of Sight Imaging

  • paper_url: http://arxiv.org/abs/2308.10269
  • repo_url: None
  • paper_authors: Hyunbo Shim, In Cho, Daekyu Kwon, Seon Joo Kim
  • for: 非直线视野(NLOS)图像重建,实现隐藏的场景重建。
  • methods: 利用光子返回的独立计算,建立一个缩小领域来排除空间,以提高优化的计算效率。
  • results: 在不同的NLOS场景中,包括非平面遮盾墙、罕发扫描模式、对焦和非对焦等,实现高效率和高精度的图像重建。
    Abstract This paper presents a novel optimization-based method for non-line-of-sight (NLOS) imaging that aims to reconstruct hidden scenes under various setups. Our method is built upon the observation that photons returning from each point in hidden volumes can be independently computed if the interactions between hidden surfaces are trivially ignored. We model the generalized light propagation function to accurately represent the transients as a linear combination of these functions. Moreover, our proposed method includes a domain reduction procedure to exclude empty areas of the hidden volumes from the set of propagation functions, thereby improving computational efficiency of the optimization. We demonstrate the effectiveness of the method in various NLOS scenarios, including non-planar relay wall, sparse scanning patterns, confocal and non-confocal, and surface geometry reconstruction. Experiments conducted on both synthetic and real-world data clearly support the superiority and the efficiency of the proposed method in general NLOS scenarios.
    摘要

Make-It-4D: Synthesizing a Consistent Long-Term Dynamic Scene Video from a Single Image

  • paper_url: http://arxiv.org/abs/2308.10257
  • repo_url: https://github.com/leoShen917/Make-It-4D
  • paper_authors: Liao Shen, Xingyi Li, Huiqiang Sun, Juewen Peng, Ke Xian, Zhiguo Cao, Guosheng Lin
    for:This paper aims to synthesize a long-term dynamic video from a single image, addressing the challenges of consistent visual content movements and large camera motions.methods:The proposed method, Make-It-4D, utilizes layered depth images (LDIs) and motion estimation to estimate the underlying 4D representation of the scene, including 3D geometry and scene motion. The method also employs a pretrained diffusion model to inpaint and outpaint the input image, filling in occluded regions.results:The proposed method demonstrates effective rendering results, showcasing compelling dynamic video synthesis from a single input image. The method is also training-free, saving a significant amount of training time.
    Abstract We study the problem of synthesizing a long-term dynamic video from only a single image. This is challenging since it requires consistent visual content movements given large camera motions. Existing methods either hallucinate inconsistent perpetual views or struggle with long camera trajectories. To address these issues, it is essential to estimate the underlying 4D (including 3D geometry and scene motion) and fill in the occluded regions. To this end, we present Make-It-4D, a novel method that can generate a consistent long-term dynamic video from a single image. On the one hand, we utilize layered depth images (LDIs) to represent a scene, and they are then unprojected to form a feature point cloud. To animate the visual content, the feature point cloud is displaced based on the scene flow derived from motion estimation and the corresponding camera pose. Such 4D representation enables our method to maintain the global consistency of the generated dynamic video. On the other hand, we fill in the occluded regions by using a pretrained diffusion model to inpaint and outpaint the input image. This enables our method to work under large camera motions. Benefiting from our design, our method can be training-free which saves a significant amount of training time. Experimental results demonstrate the effectiveness of our approach, which showcases compelling rendering results.
    摘要 我们研究将单一图像转换为长期动态影片的问题。这是一个挑战,因为需要在大型镜头运动下维持一致的视觉内容运动。现有的方法 Either hallucinate不一致的无限视野或对大型镜头轨迹进行适应。为解决这些问题,我们需要估计场景中的底层4D(包括3D几何和场景运动),并填充遮盖区域。为此,我们提出了Make-It-4D,一种新的方法,可以将单一图像转换为一致的长期动态影片。我们使用层级深度图像(LDIs)来表示场景,然后将它们投影到形成一个特征点云。为了将视觉内容动态化,我们将特征点云根据场景流和相应的镜头pose进行偏移。这样的4D表示方式使我们的方法能够维持生成的动态影片的全球一致性。另一方面,我们使用预训数据模型填充和外填充输入图像,以解决大型镜头运动下的遮盖区域问题。由于我们的设计,我们的方法不需要训练,这样可以大幅降低训练时间。实验结果显示我们的方法有着吸引人的渲染效果。

Generic Attention-model Explainability by Weighted Relevance Accumulation

  • paper_url: http://arxiv.org/abs/2308.10240
  • repo_url: None
  • paper_authors: Yiming Huang, Aozhe Jia, Xiaodan Zhang, Jiawei Zhang
  • For: This paper aims to improve the explainability of attention-based transformer models in multi-modal tasks, such as visual question answering.* Methods: The proposed method uses a weighted relevancy strategy to take into account the importance of token values when accumulating relevance, reducing distortion in the attention process.* Results: The proposed method outperforms existing methods in terms of explainability, as validated through extensive perturbation tests on visual question answering and image captioning.Here’s the Chinese version of the three key points:* For: 这篇论文的目的是提高基于转换器模型的多模态任务,如视觉问答的解释性。* Methods: 提议的方法使用了权重相关性策略,根据Token值的重要性来增加相关性,从而降低了注意力过程中的扭曲。* Results: 提议的方法在解释性方面表现出色,通过对视觉问答和图文描述进行广泛的干扰测试 Validated 表明,其解释性方法超过了现有的方法。
    Abstract Attention-based transformer models have achieved remarkable progress in multi-modal tasks, such as visual question answering. The explainability of attention-based methods has recently attracted wide interest as it can explain the inner changes of attention tokens by accumulating relevancy across attention layers. Current methods simply update relevancy by equally accumulating the token relevancy before and after the attention processes. However, the importance of token values is usually different during relevance accumulation. In this paper, we propose a weighted relevancy strategy, which takes the importance of token values into consideration, to reduce distortion when equally accumulating relevance. To evaluate our method, we propose a unified CLIP-based two-stage model, named CLIPmapper, to process Vision-and-Language tasks through CLIP encoder and a following mapper. CLIPmapper consists of self-attention, cross-attention, single-modality, and cross-modality attention, thus it is more suitable for evaluating our generic explainability method. Extensive perturbation tests on visual question answering and image captioning validate that our explainability method outperforms existing methods.
    摘要 注意基于转换器模型已经取得了多模式任务中的很大进步,如视觉问答。现在的解释可能性方法吸引了广泛的关注,因为它可以解释关注令牌的内部变化。现有方法简单地将关注令牌的相关性更新为平等地积累关注层中的令牌相关性。然而,令牌值的重要性通常在相关性积累中不同。在这篇论文中,我们提议一种权重 relevancy 策略,该策略考虑令牌值的重要性,以减少积累时的扭曲。为评估我们的方法,我们提出了一种基于 CLIP 的两stage 模型,名为 CLIPmapper,该模型通过 CLIP Encoder 和后续的映射器来处理视觉语言任务。CLIPmapper 包括自注意、交叉注意、单模态和交叉模态注意,因此它更适合评估我们的通用解释可能性方法。对于视觉问答和图文描述等任务,我们进行了广泛的干扰测试,并证明了我们的解释可能性方法在现有方法中表现出色。

From Global to Local: Multi-scale Out-of-distribution Detection

  • paper_url: http://arxiv.org/abs/2308.10239
  • repo_url: https://github.com/jimzai/mode-ood
  • paper_authors: Ji Zhang, Lianli Gao, Bingguang Hao, Hao Huang, Jingkuan Song, Hengtao Shen
  • for: 这个论文的目的是提出一个新的 OUT-OF-DISTRIBUTION(OOD)检测方法,以检测“未知”的数据,即在训练过程中没有见过的标签。
  • methods: 这个方法使用了Multi-scale OOD DEtection(MODE)框架,利用了全域图像信息和本地区域细节来增强OOD检测。具体来说,这个方法首先发现现有的模型因为标签训练和OOD检测的规模不同,所以无法捕捉到有用的本地表示。为了解决这个问题,这个方法提出了Attention-based Local PropAgation(ALPA)的对话式目标,将本地区域的对话捕捉到ID训练过程中。
  • results: 这个方法在多个 bencmarks 上显示了优秀的效果,与之前的最佳性能相比,平均提高了19.24%的False Positive Rate(FPR)和2.77%的AUCROC。代码可以在https://github.com/JimZAI/MODE-OOD 获取。
    Abstract Out-of-distribution (OOD) detection aims to detect "unknown" data whose labels have not been seen during the in-distribution (ID) training process. Recent progress in representation learning gives rise to distance-based OOD detection that recognizes inputs as ID/OOD according to their relative distances to the training data of ID classes. Previous approaches calculate pairwise distances relying only on global image representations, which can be sub-optimal as the inevitable background clutter and intra-class variation may drive image-level representations from the same ID class far apart in a given representation space. In this work, we overcome this challenge by proposing Multi-scale OOD DEtection (MODE), a first framework leveraging both global visual information and local region details of images to maximally benefit OOD detection. Specifically, we first find that existing models pretrained by off-the-shelf cross-entropy or contrastive losses are incompetent to capture valuable local representations for MODE, due to the scale-discrepancy between the ID training and OOD detection processes. To mitigate this issue and encourage locally discriminative representations in ID training, we propose Attention-based Local PropAgation (ALPA), a trainable objective that exploits a cross-attention mechanism to align and highlight the local regions of the target objects for pairwise examples. During test-time OOD detection, a Cross-Scale Decision (CSD) function is further devised on the most discriminative multi-scale representations to distinguish ID/OOD data more faithfully. We demonstrate the effectiveness and flexibility of MODE on several benchmarks -- on average, MODE outperforms the previous state-of-the-art by up to 19.24% in FPR, 2.77% in AUROC. Code is available at https://github.com/JimZAI/MODE-OOD.
    摘要 OUT-OF-DISTRIBUTION (OOD) 检测目标是检测未知数据,其标签在 ID 训练过程中没有出现过。 recent progress in representation learning 使得距离基本的 OOD 检测变得更加重要。在这种情况下,我们提出了 Multi-scale OOD DEtection (MODE) 框架,它利用了全球视觉信息和本地区域细节来最大化 OOD 检测的效果。具体来说,我们发现现有的模型通常通过预训练的 cross-entropy 或对比损失来学习全球视觉信息,但这些模型在 OOD 检测过程中可能无法捕捉到有价值的本地表示。为了解决这个问题,我们提出了 Attention-based Local PropAgation (ALPA) 目标函数,它利用了交叉注意机制来对 ID 训练过程中的本地区域进行匹配和强调。在测试时 OOD 检测过程中,我们还提出了 Cross-Scale Decision (CSD) 函数,用于在不同缩放级别上进行分类。我们在多个 benchmark 上进行了实验,得到了 MODE 的效果和灵活性。在 average 的情况下,MODE 可以与前一个状态的艺术品的 FPR 和 AUROC 进行比较,提高了 19.24% 和 2.77%。代码可以在 上找到。

FedSIS: Federated Split Learning with Intermediate Representation Sampling for Privacy-preserving Generalized Face Presentation Attack Detection

  • paper_url: http://arxiv.org/abs/2308.10236
  • repo_url: https://github.com/naiftt/fedsis
  • paper_authors: Naif Alkhunaizi, Koushik Srivatsan, Faris Almalik, Ibrahim Almakky, Karthik Nandakumar
    for:This paper is written for those who are interested in face presentation attack detection (FacePAD) and want to improve the generalization of their algorithms to unseen domains/attacks.methods:The paper proposes a novel framework called Federated Split learning with Intermediate representation Sampling (FedSIS) which combines federated learning (FL) and split learning to achieve privacy-preserving domain generalization. The FedSIS approach uses a hybrid Vision Transformer (ViT) architecture and a shared adapter network to distill discriminative information from intermediate blocks.results:The paper demonstrates that the FedSIS approach can achieve state-of-the-art generalization performance on two well-known benchmarks for cross-domain FacePAD without any data sharing, thereby preserving privacy.
    Abstract Lack of generalization to unseen domains/attacks is the Achilles heel of most face presentation attack detection (FacePAD) algorithms. Existing attempts to enhance the generalizability of FacePAD solutions assume that data from multiple source domains are available with a single entity to enable centralized training. In practice, data from different source domains may be collected by diverse entities, who are often unable to share their data due to legal and privacy constraints. While collaborative learning paradigms such as federated learning (FL) can overcome this problem, standard FL methods are ill-suited for domain generalization because they struggle to surmount the twin challenges of handling non-iid client data distributions during training and generalizing to unseen domains during inference. In this work, a novel framework called Federated Split learning with Intermediate representation Sampling (FedSIS) is introduced for privacy-preserving domain generalization. In FedSIS, a hybrid Vision Transformer (ViT) architecture is learned using a combination of FL and split learning to achieve robustness against statistical heterogeneity in the client data distributions without any sharing of raw data (thereby preserving privacy). To further improve generalization to unseen domains, a novel feature augmentation strategy called intermediate representation sampling is employed, and discriminative information from intermediate blocks of a ViT is distilled using a shared adapter network. The FedSIS approach has been evaluated on two well-known benchmarks for cross-domain FacePAD to demonstrate that it is possible to achieve state-of-the-art generalization performance without data sharing. Code: https://github.com/Naiftt/FedSIS
    摘要 缺乏对未经看过的领域/攻击的泛化是大多数人脸展示攻击检测(FacePAD)算法的致命弱点。现有的增强FacePAD解决方案假设有多个源领域的数据可以进行中央化训练。在实际情况下,不同的源领域数据可能由不同的实体收集,这些实体通常因为法律和隐私限制无法分享其数据。而合作学习 paradigm such as federated learning(FL)可以解决这个问题,但标准FL方法在适应非同一样的客户端数据分布时具有困难。在这种情况下,一种名为 Federated Split learning with Intermediate representation Sampling(FedSIS)的新框架被引入,用于保护隐私的领域泛化。在 FedSIS 中,使用一种混合的 Computer Vision Transformer(ViT)架构,通过将 FL 和 split learning 结合使得模型对客户端数据分布的统计差异具有抗性,而无需分享Raw数据(因此保护隐私)。为了进一步提高对未经看过的领域的泛化,我们采用了一种新的特征增强策略,即中间表示采样策略,并将 ViT 中的中间块特征通过一个共享适配器网络进行精炼。FedSIS 方法在两个常用的 cross-domain FacePAD 测试集上进行评估,并证明了在不分享数据情况下可以实现状态 искусственный泛化性能。代码:https://github.com/Naiftt/FedSIS

Real-time Regular Expression Matching

  • paper_url: http://arxiv.org/abs/2308.10208
  • repo_url: https://github.com/charbelrami/regex-element
  • paper_authors: Alexandra Bernadotte
  • for: 这篇论文主要针对金字塔自动机、正则表达式匹配、模式识别以及对正则表达式长度增长而导致的加速问题。
  • methods: 该论文提出了一种理论和硬件解决方案,用于解决一些复杂的正则语言类型的加速问题,这些问题在网络入侵检测系统中带来严重的限制。
  • results: 文章提供了一些正则表达式匹配的正确性和复杂性 theorem,以支持其解决方案的可行性和效果。
    Abstract This paper is devoted to finite state automata, regular expression matching, pattern recognition, and the exponential blow-up problem, which is the growing complexity of automata exponentially depending on regular expression length. This paper presents a theoretical and hardware solution to the exponential blow-up problem for some complicated classes of regular languages, which caused severe limitations in Network Intrusion Detection Systems work. The article supports the solution with theorems on correctness and complexity.
    摘要 这篇论文专注于finite state自动机、正则表达式匹配、模式识别和 exponential blow-up 问题(正则表达式长度增长导致自动机复杂度呈指数增长)。这篇论文提出了一种理论和硬件解决方案,以解决一些复杂的正则语言 exponential blow-up 问题,这些问题在网络入侵检测系统中带来严重的限制。文章证明了解决方案的正确性和复杂度。

GeT: Generative Target Structure Debiasing for Domain Adaptation

  • paper_url: http://arxiv.org/abs/2308.10205
  • repo_url: None
  • paper_authors: Can Zhang, Gim Hee Lee
  • for: 本研究的目的是提出一种能够减少源数据偏见的预测模型,以便在不同预测任务中进行更好的预测。
  • methods: 本研究使用了 semi-supervised learning 技术,并提出了一种基于 pseudo labeling 的方法来减少源数据偏见。
  • results: 实验结果表明,我们提出的方法可以在不同的预测任务中减少 source data bias,并且可以在不同的预测任务中获得更好的预测性能。
    Abstract Domain adaptation (DA) aims to transfer knowledge from a fully labeled source to a scarcely labeled or totally unlabeled target under domain shift. Recently, semi-supervised learning-based (SSL) techniques that leverage pseudo labeling have been increasingly used in DA. Despite the competitive performance, these pseudo labeling methods rely heavily on the source domain to generate pseudo labels for the target domain and therefore still suffer considerably from source data bias. Moreover, class distribution bias in the target domain is also often ignored in the pseudo label generation and thus leading to further deterioration of performance. In this paper, we propose GeT that learns a non-bias target embedding distribution with high quality pseudo labels. Specifically, we formulate an online target generative classifier to induce the target distribution into distinctive Gaussian components weighted by their class priors to mitigate source data bias and enhance target class discriminability. We further propose a structure similarity regularization framework to alleviate target class distribution bias and further improve target class discriminability. Experimental results show that our proposed GeT is effective and achieves consistent improvements under various DA settings with and without class distribution bias. Our code is available at: https://lulusindazc.github.io/getproject/.
    摘要 域适应(DA)的目标是将来源域中完全标注的知识传递到目标域中具有域shift的场景中,但目标域通常具有少量或无标注数据。最近,半监督学习(SSL)技术在DA中越来越广泛应用,但这些假标注方法仍然受到源域的限制,即使在 pseudo label 生成过程中,它们仍然受到源数据偏见的影响。此外,目标域中类分布偏见也经常被忽略在假标注生成过程中,这会导致性能下降。在这篇论文中,我们提出了 GeT,它可以学习一个不偏见的目标域分布,并生成高质量的假标注。具体来说,我们提出了在线目标生成类分类器,通过将目标分布划分成不同的 Gaussian 组件,并通过类偏见来补偿源数据偏见,以提高目标类分riminability。此外,我们还提出了结构相似regularization框架,以减少目标类分布偏见,并进一步提高目标类分riminability。实验结果表明,我们的提出的 GeT 是有效的,在不同的 DA 设置下,它都可以获得鲁棒的提高。我们的代码可以在以下链接中找到:https://lulusindazc.github.io/getproject/.

Blind Face Restoration for Under-Display Camera via Dictionary Guided Transformer

  • paper_url: http://arxiv.org/abs/2308.10196
  • repo_url: None
  • paper_authors: Jingfan Tan, Xiaoxu Chen, Tao Wang, Kaihao Zhang, Wenhan Luo, Xiaocun Cao
  • for: 提供用户全屏体验,但是由于显示器的特点,UDC图像受到质量下降的影响。
  • methods: 提出了UDC图像修复方法,并实现了进步。但是还没有专门的UDC图像修复方法和数据集,这可能是UDC场景中最常见的问题。
  • results: 我们提出了一种两stage网络UDC降解模型网络(UDC-DMNet),并使用UDC-DMNet和高质量的face图像从FFHQ和CelebA-Test创建了UDC face修复数据集FFHQ-P/T和测试数据集CelebA-Test-P/T。我们还提出了一种新的字典引导 transformer网络(DGFormer),通过引入人脸组成字典和UDC图像的特点,使DGFormer能够实现盲人脸修复在UDC场景中。实验表明,我们的DGFormer和UDC-DMNet实现了状态的末点性能。
    Abstract By hiding the front-facing camera below the display panel, Under-Display Camera (UDC) provides users with a full-screen experience. However, due to the characteristics of the display, images taken by UDC suffer from significant quality degradation. Methods have been proposed to tackle UDC image restoration and advances have been achieved. There are still no specialized methods and datasets for restoring UDC face images, which may be the most common problem in the UDC scene. To this end, considering color filtering, brightness attenuation, and diffraction in the imaging process of UDC, we propose a two-stage network UDC Degradation Model Network named UDC-DMNet to synthesize UDC images by modeling the processes of UDC imaging. Then we use UDC-DMNet and high-quality face images from FFHQ and CelebA-Test to create UDC face training datasets FFHQ-P/T and testing datasets CelebA-Test-P/T for UDC face restoration. We propose a novel dictionary-guided transformer network named DGFormer. Introducing the facial component dictionary and the characteristics of the UDC image in the restoration makes DGFormer capable of addressing blind face restoration in UDC scenarios. Experiments show that our DGFormer and UDC-DMNet achieve state-of-the-art performance.
    摘要 “隐藏前置摄像头在显示面板下,Under-Display Camera(UDC)为用户提供了全屏体验。然而,由于显示器的特点,UDC拍摄的图像会受到显著质量下降的影响。已经提出了一些方法来解决UDC图像修复问题,但是还没有专门的方法和数据集来修复UDC face图像,这可能是UDC场景中最常见的问题。为此,我们考虑了UDC拍摄过程中的颜色滤波、亮度减弱和折射,并提出了一个两Stage网络UDC降低模型网络(UDC-DMNet),用于模拟UDC拍摄过程。然后,我们使用UDC-DMNet和高质量的face图像从FFHQ和CelebA-Test创建了UDC face培训集FFHQ-P/T和测试集CelebA-Test-P/T。我们还提出了一种新的字典指导变换网络名为DGFormer。通过引入面部字典和UDC图像的特点,DGFormer能够 Addresses blind face修复问题在UDC场景中。实验结果表明,我们的DGFormer和UDC-DMNet都达到了状态精算的性能。”

EDDense-Net: Fully Dense Encoder Decoder Network for Joint Segmentation of Optic Cup and Disc

  • paper_url: http://arxiv.org/abs/2308.10192
  • repo_url: None
  • paper_authors: Mehwish Mehmood, Khuram Naveed, Haroon Ahmed Khan, Syed S. Naqvi
    for: 这项研究的目的是为了提供一种用于诊断和分析глау科病例的自助系统,以帮助医学眼科医生进行诊断。methods: 该网络使用了EDDense-NetSegmentation网络,该网络包括Encoder和Decoder,它们都是由密集块组成的,每个块包括了 grouped convolutional layer,这使得网络能够同时从图像中获取和传递空间信息,而且降低网络的复杂性。而且,在涉及到semantic segmentation的部分,使用了 dice pixel classification来缓解类别偏袋问题。results: 在两个公共可用的数据集上进行评估,该网络的准确率和效率都高于现有的状态的艺术方法。因此,该方法可以用于作为医学眼科医生的第二意见系统,帮助他们进行诊断和分析 glaucoma 病例。
    Abstract Glaucoma is an eye disease that causes damage to the optic nerve, which can lead to visual loss and permanent blindness. Early glaucoma detection is therefore critical in order to avoid permanent blindness. The estimation of the cup-to-disc ratio (CDR) during an examination of the optical disc (OD) is used for the diagnosis of glaucoma. In this paper, we present the EDDense-Net segmentation network for the joint segmentation of OC and OD. The encoder and decoder in this network are made up of dense blocks with a grouped convolutional layer in each block, allowing the network to acquire and convey spatial information from the image while simultaneously reducing the network's complexity. To reduce spatial information loss, the optimal number of filters in all convolution layers were utilised. In semantic segmentation, dice pixel classification is employed in the decoder to alleviate the problem of class imbalance. The proposed network was evaluated on two publicly available datasets where it outperformed existing state-of-the-art methods in terms of accuracy and efficiency. For the diagnosis and analysis of glaucoma, this method can be used as a second opinion system to assist medical ophthalmologists.
    摘要 glaucoma 是一种眼病,可以导致视网膜损害,从而引起视力减退和永久失明。 early detection of glaucoma 是非常重要的,以避免永久失明。 during an examination of the optical disc (OD), the estimation of the cup-to-disc ratio (CDR) is used for the diagnosis of glaucoma. 在这篇论文中,我们提出了 ED-DenseNet segmentation 网络,用于joint segmentation of OC 和 OD。这个网络的编码器和解码器都由密集层组成,每个密集层都包含了 grouped convolutional layer,使网络能够从图像中获取和传递空间信息,同时减少网络的复杂度。为了避免空间信息损失,我们在所有卷积层中使用了最佳的筛选器数量。在semantic segmentation中,我们使用了 dice pixel classification 来缓解类别不均衡问题。我们的方法在两个公共可用的数据集上进行了评估,并在精度和效率方面超过了现有的状态图方法。这种方法可以作为医学眼科专业人员的第二意见系统,帮助 диагности和分析 glaucoma。

Spiking-Diffusion: Vector Quantized Discrete Diffusion Model with Spiking Neural Networks

  • paper_url: http://arxiv.org/abs/2308.10187
  • repo_url: https://github.com/Arktis2022/Spiking-Diffusion
  • paper_authors: Mingxuan Liu, Rui Wen, Hong Chen
  • for: 这个论文主要用于实现基于神经网络的印象生成模型,并且使用脳神经网络(SNN)来实现能效的 neuromorphic 处理器。
  • methods: 这个论文使用了 vector quantized 秘密自适应网络(VQ-SVAE)来学习离散的伪想空间,并且使用 SNN 实现吸引状态传播和扩散图像复原。
  • results: 这个论文的实验结果显示,使用 Spiking-Diffusion 模型可以实现更好的印象生成效果,并且比较过去的 SNN-based 生成模型有更好的表现。实验结果显示,这个模型在 MNIST、FMNIST、KMNIST 和 Letters 等数据集上实现了更好的表现,并且比较过去的 SNN-based 生成模型有更好的表现。
    Abstract Spiking neural networks (SNNs) have tremendous potential for energy-efficient neuromorphic chips due to their binary and event-driven architecture. SNNs have been primarily used in classification tasks, but limited exploration on image generation tasks. To fill the gap, we propose a Spiking-Diffusion model, which is based on the vector quantized discrete diffusion model. First, we develop a vector quantized variational autoencoder with SNNs (VQ-SVAE) to learn a discrete latent space for images. With VQ-SVAE, image features are encoded using both the spike firing rate and postsynaptic potential, and an adaptive spike generator is designed to restore embedding features in the form of spike trains. Next, we perform absorbing state diffusion in the discrete latent space and construct a diffusion image decoder with SNNs to denoise the image. Our work is the first to build the diffusion model entirely from SNN layers. Experimental results on MNIST, FMNIST, KMNIST, and Letters demonstrate that Spiking-Diffusion outperforms the existing SNN-based generation model. We achieve FIDs of 37.50, 91.98, 59.23 and 67.41 on the above datasets respectively, with reductions of 58.60\%, 18.75\%, 64.51\%, and 29.75\% in FIDs compared with the state-of-art work.
    摘要 神经网络(SNN)具有巨大的能效可能性,因其架构为二进制和事件驱动的。SNN在分类任务中广泛使用,但对图像生成任务的探索很少。为填补这个差距,我们提议了一种叫做神经网络扩散模型(Spiking-Diffusion model),它基于矢量量化离散扩散模型。首先,我们开发了一种基于SNN的矢量量化自适应编码器(VQ-SVAE),以学习图像的离散特征空间。VQ-SVAE中,图像特征被编码通过神经元发射率和后生电位,并设计了适应的神经元发射器来重建嵌入特征在形式为脉冲列表的形式。接着,我们在离散特征空间中进行吸引状态扩散,并构建了基于SNN层的扩散图像解码器来降噪图像。我们的工作是首次将扩散模型完全建立在SNN层之上。实验结果在MNIST、FMNIST、KMNIST和Letters等 dataset上表明,Spiking-Diffusion模型比 existed SNN-based generation模型更高效,我们在这些dataset上实现了FID值为37.50、91.98、59.23和67.41,相比 existed work,我们的FID值下降58.60%、18.75%、64.51%和29.75%。

ViT-Lens: Towards Omni-modal Representations

  • paper_url: http://arxiv.org/abs/2308.10185
  • repo_url: https://github.com/TencentARC/ViT-Lens
  • paper_authors: Weixian Lei, Yixiao Ge, Jianfeng Zhang, Dylan Sun, Kun Yi, Ying Shan, Mike Zheng Shou
  • for: 这篇论文是为了提出一种能够有效地处理多个模式(如3D、声音等)的方法,以便在不同的任务和领域中使用已经预训练的 ViT 模型。
  • methods: 这篇论文使用了一种名为 ViT-Lens 的方法,它可以使用已经预训练的 ViT 模型来处理多个模式,并将这些模式Project到一个共同的 embedding 空间中。然后,使用强大的 ViT 模型来处理这些embedding,以获得高效的多模式表示学习。
  • results: 根据论文的测试结果,ViT-Lens 在 zero-shot 3D 分类任务中实现了显著的提升,其中 Objaverse-LVIS 上的准确率为 52.0%,ModelNet40 上的准确率为 87.4%,ScanObjectNN 上的准确率为 60.6%。此外,通过简单地将已经训练的 3D 透镜 integrate 到 InstructBLIP 模型中,实现了 zero-shot 3D 问答。
    Abstract Though the success of CLIP-based training recipes in vision-language models, their scalability to more modalities (e.g., 3D, audio, etc.) is limited to large-scale data, which is expensive or even inapplicable for rare modalities. In this paper, we present ViT-Lens that facilitates efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning to a pre-defined space. Specifically, the modality-specific lens is tuned to project multimodal signals to the shared embedding space, which are then processed by a strong ViT that carries pre-trained image knowledge. The encoded multimodal representations are optimized toward aligning with the modal-independent space, pre-defined by off-the-shelf foundation models. A well-trained lens with a ViT backbone has the potential to serve as one of these foundation models, supervising the learning of subsequent modalities. ViT-Lens provides a unified solution for representation learning of increasing modalities with two appealing benefits: (i) Exploiting the pretrained ViT across tasks and domains effectively with efficient data regime; (ii) Emergent downstream capabilities of novel modalities are demonstrated due to the modality alignment space. We evaluate ViT-Lens in the context of 3D as an initial verification. In zero-shot 3D classification, ViT-Lens achieves substantial improvements over previous state-of-the-art, showing 52.0% accuracy on Objaverse-LVIS, 87.4% on ModelNet40, and 60.6% on ScanObjectNN. Furthermore, we enable zero-shot 3D question-answering by simply integrating the trained 3D lens into the InstructBLIP model without any adaptation. We will release the results of ViT-Lens on more modalities in the near future.
    摘要 尽管CLIP基于训练辑recipes在视觉语模型中获得成功,但它们在更多Modalities(例如3D、音频等)的扩展是受限于大规模数据的,这些数据可能是costly或者even inapplicable for rare modalities。在这篇论文中,我们提出了ViT-Lens,它可以有效地进行多modalities的表示学习,通过将novel modalities projected to a predefined shared embedding space,然后由一个强大的ViT进行处理,该ViT已经预训练了图像知识。所得到的多modalities表示被优化以与modal-independent space相align,这个空间是由off-the-shelf foundation models预定的。一个很好地训练的lens可以作为这些基础模型,监督后续modalities的学习。ViT-Lens提供了一个统一的解决方案,可以在多modalities的表示学习中获得两个优点:(i)可以有效地利用预训练的ViT,并且在有限的数据上进行efficient training;(ii)由于modalities的对齐空间,可以实现 novel modalities的emergent downstream capabilities。我们在3D上进行了首次验证,并达到了substantial improvement,包括Objaverse-LVIS的52.0%准确率、ModelNet40的87.4%准确率和ScanObjectNN的60.6%准确率。此外,我们还可以通过简单地将训练好的3D镜头 integrate into the InstructBLIP模型,实现zero-shot 3D问答。我们将在未来发布更多modalities的结果。

BAVS: Bootstrapping Audio-Visual Segmentation by Integrating Foundation Knowledge

  • paper_url: http://arxiv.org/abs/2308.10175
  • repo_url: None
  • paper_authors: Chen Liu, Peike Li, Hu Zhang, Lincheng Li, Zi Huang, Dadong Wang, Xin Yu
    for:这个研究旨在提高audiovisual segmentation(AVS)的精度,以便在实际场景中更好地定位声音来源。methods:我们提出了一个两阶段的增强AVS框架,包括多modal基础知识的整合。在第一阶段,我们使用一个分类模型来将可能的声音来源从视觉数据中分类,不受污染的音频讯号影响。在第二阶段,我们发展了一个音频视觉 semantic统合策略(AVIS),以确定真正有声音来源。我们建立了一个音频视觉树,根据声音和物类分类的层次相互对应。然后,我们评估了对应物类和类别化的音频标签之间的标签一致性。results:我们的方法在AVS datasets上进行了广泛的实验,特别是在背景噪音的情况下表现出色。我们的方法能够更好地定位真正有声音来源。
    Abstract Given an audio-visual pair, audio-visual segmentation (AVS) aims to locate sounding sources by predicting pixel-wise maps. Previous methods assume that each sound component in an audio signal always has a visual counterpart in the image. However, this assumption overlooks that off-screen sounds and background noise often contaminate the audio recordings in real-world scenarios. They impose significant challenges on building a consistent semantic mapping between audio and visual signals for AVS models and thus impede precise sound localization. In this work, we propose a two-stage bootstrapping audio-visual segmentation framework by incorporating multi-modal foundation knowledge. In a nutshell, our BAVS is designed to eliminate the interference of background noise or off-screen sounds in segmentation by establishing the audio-visual correspondences in an explicit manner. In the first stage, we employ a segmentation model to localize potential sounding objects from visual data without being affected by contaminated audio signals. Meanwhile, we also utilize a foundation audio classification model to discern audio semantics. Considering the audio tags provided by the audio foundation model are noisy, associating object masks with audio tags is not trivial. Thus, in the second stage, we develop an audio-visual semantic integration strategy (AVIS) to localize the authentic-sounding objects. Here, we construct an audio-visual tree based on the hierarchical correspondence between sounds and object categories. We then examine the label concurrency between the localized objects and classified audio tags by tracing the audio-visual tree. With AVIS, we can effectively segment real-sounding objects. Extensive experiments demonstrate the superiority of our method on AVS datasets, particularly in scenarios involving background noise. Our project website is https://yenanliu.github.io/AVSS.github.io/.
    摘要 audio-visual分割(AVS)目的是通过预测像素级地图来确定声音来源。先前的方法假设每个声音 ком ponent in 音频信号总是有视觉对应在图像中。然而,这种假设过looks 忽略了实际情况中的后台噪声和屏幕外声音污染音频记录。这些噪声对AVS模型建立一致的semantic mapping between 音频和视觉信号的建立带来了很大挑战,从而降低了精确的声音定位。在这种情况下,我们提出了一个两stage bootstrapping 音频视觉分割框架。简而言之,我们的BAVS是通过明确的方式来消除噪声或屏幕外声音的污染,从而提高AVS的精度。在第一stage,我们使用一个分割模型来从视觉数据中 lokalisieren potential sounding objects,不受噪声音频信号的影响。同时,我们还利用一个基础Audio classification模型来理解音频semantics。由于audio标签提供by the audio foundation model是噪声的,因此将对象面积与audio标签相关联是不rivial的。因此,在第二stage,我们开发了一种音频视觉semantic интеграation策略(AVIS),以确定真实听起来的对象。我们在音频视觉树中建立了层次相应的声音和对象类别之间的 hierarchical correspondence。然后,我们跟踪音频视觉树,并对 lokalisierten对象和分类的audio标签进行标concurrentlabeling。通过AVIS,我们可以有效地分割真实听起来的对象。我们的项目网站是https://yenanliu.github.io/AVSS.github.io/.

Neural Interactive Keypoint Detection

  • paper_url: http://arxiv.org/abs/2308.10174
  • repo_url: https://github.com/idea-research/click-pose
  • paper_authors: Jie Yang, Ailing Zeng, Feng Li, Shilong Liu, Ruimao Zhang, Lei Zhang
  • for: 这个研究旨在开发一个端到端神经交互关键点检测框架,名为Click-Pose,可以将2D关键点标注的时间和努力减少至10倍以上。
  • methods: Click-Pose使用了一个对话式人工标注 Loop,让用户点击预测的关键点进行更正,并使用一个独特的姿势错误模型来提高模型的自我更正能力。
  • results: Click-Pose在COCO和Human-Art等两个测试集上显示了优秀的效果,只需1.97和6.45次点击(NoC)@95(精度95%)来标注,较前一代模型(ViTPose)的手动更正需要31.4%和36.3%的努力。此外,不需要用户点击,Click-Pose仍然可以超越先前的端到端模型。代码可以在https://github.com/IDEA-Research/Click-Pose上取得。
    Abstract This work proposes an end-to-end neural interactive keypoint detection framework named Click-Pose, which can significantly reduce more than 10 times labeling costs of 2D keypoint annotation compared with manual-only annotation. Click-Pose explores how user feedback can cooperate with a neural keypoint detector to correct the predicted keypoints in an interactive way for a faster and more effective annotation process. Specifically, we design the pose error modeling strategy that inputs the ground truth pose combined with four typical pose errors into the decoder and trains the model to reconstruct the correct poses, which enhances the self-correction ability of the model. Then, we attach an interactive human-feedback loop that allows receiving users' clicks to correct one or several predicted keypoints and iteratively utilizes the decoder to update all other keypoints with a minimum number of clicks (NoC) for efficient annotation. We validate Click-Pose in in-domain, out-of-domain scenes, and a new task of keypoint adaptation. For annotation, Click-Pose only needs 1.97 and 6.45 NoC@95 (at precision 95%) on COCO and Human-Art, reducing 31.4% and 36.3% efforts than the SOTA model (ViTPose) with manual correction, respectively. Besides, without user clicks, Click-Pose surpasses the previous end-to-end model by 1.4 AP on COCO and 3.0 AP on Human-Art. The code is available at https://github.com/IDEA-Research/Click-Pose.
    摘要 这个工作提出了一种名为Click-Pose的结束到终端神经互动关键点检测框架,可以减少更多于10倍的2D关键点标注成本,比手动标注更快和有效。Click-Poseexplores如何用户反馈与神经关键点检测器共同 correction predicted关键点,以实现更快和更有效的标注过程。我们设计了 pose error 模型,将真实pose与四种常见pose error 输入到decoder中,并训练模型可以重建正确的pose,从而提高自修复能力。然后,我们附加了一个交互式人工反馈循环,让用户点击correct predicted关键点,并 iteratively使用decoder更新所有关键点,最少clicks (NoC) для高效的标注。我们验证Click-Pose在域内、域外场景和新任务关键点适应中。对于标注,Click-Pose只需1.97和6.45 NoC@95 (精度95%) 在COCO和人类艺术上,相比SOTA模型(ViTPose)的手动更正,减少了31.4%和36.3%的努力。此外,无需用户点击,Click-Pose超过了之前的端到终模型1.4 AP在COCO和3.0 AP在人类艺术上。代码可以在https://github.com/IDEA-Research/Click-Pose中找到。

VLN-PETL: Parameter-Efficient Transfer Learning for Vision-and-Language Navigation

  • paper_url: http://arxiv.org/abs/2308.10172
  • repo_url: https://github.com/yanyuanqiao/vln-petl
  • paper_authors: Yanyuan Qiao, Zheng Yu, Qi Wu
  • for: 本研究旨在提高大型预训练视言语模型在视言语 Navigation~(VLN) 任务上的性能,并且减少全量微调预训练模型的成本。
  • methods: 本研究提出了一种特点是 VLN 任务的Parameter-Efficient Transfer Learning~(PETL) 方法,包括两个特有的 PETL 模块:历史互动增强器(HIB)和 crossing modal 互动增强器(CIB)。
  • results: 在四种主流 VLN 任务(R2R、REVERIE、NDH、RxR)的广泛实验结果表明,提出的 VLN-PETL 方法可以与全微调和其他 PETL 方法相比,达到相同或更好的性能,并且具有更好的抗衰减性。
    Abstract The performance of the Vision-and-Language Navigation~(VLN) tasks has witnessed rapid progress recently thanks to the use of large pre-trained vision-and-language models. However, full fine-tuning the pre-trained model for every downstream VLN task is becoming costly due to the considerable model size. Recent research hotspot of Parameter-Efficient Transfer Learning (PETL) shows great potential in efficiently tuning large pre-trained models for the common CV and NLP tasks, which exploits the most of the representation knowledge implied in the pre-trained model while only tunes a minimal set of parameters. However, simply utilizing existing PETL methods for the more challenging VLN tasks may bring non-trivial degeneration to the performance. Therefore, we present the first study to explore PETL methods for VLN tasks and propose a VLN-specific PETL method named VLN-PETL. Specifically, we design two PETL modules: Historical Interaction Booster (HIB) and Cross-modal Interaction Booster (CIB). Then we combine these two modules with several existing PETL methods as the integrated VLN-PETL. Extensive experimental results on four mainstream VLN tasks (R2R, REVERIE, NDH, RxR) demonstrate the effectiveness of our proposed VLN-PETL, where VLN-PETL achieves comparable or even better performance to full fine-tuning and outperforms other PETL methods with promising margins.
    摘要 “在最近,视觉语言导航(VLN)任务的性能得到了迅速的进步,归功于使用大型预训练的视觉语言模型。然而,对于每个下游VLN任务进行全面的预训练模型 fine-tuning 成本增加,因为模型的大小很大。目前研究热点 Parameter-Efficient Transfer Learning(PETL)表现出了巨大的潜力,可以有效地调参大型预训练模型,并且只需要调参最小的参数集。然而,直接使用现有的 PETL 方法来处理更加具有挑战性的 VLN 任务可能会导致性能下降。因此,我们提出了首次对 VLN 任务使用 PETL 方法的研究,并提出了一种特定于 VLN 的 PETL 方法 named VLN-PETL。特别是,我们设计了两个 PETL 模块:历史互动加速器(HIB)和交叉模式互动加速器(CIB)。然后,我们将这两个模块与一些现有的 PETL 方法相结合,形成了集成的 VLN-PETL。我们对四个主流 VLN 任务(R2R、REVERIE、NDH、RxR)进行了广泛的实验,结果表明,我们提出的 VLN-PETL 方法可以与全面 fine-tuning 和其他 PETL 方法相比,具有同等或更好的性能。”

Cell Spatial Analysis in Crohn’s Disease: Unveiling Local Cell Arrangement Pattern with Graph-based Signatures

  • paper_url: http://arxiv.org/abs/2308.10166
  • repo_url: None
  • paper_authors: Shunxing Bao, Sichen Zhu, Vasantha L Kolachala, Lucas W. Remedios, Yeonjoo Hwang, Yutong Sun, Ruining Deng, Can Cui, Yike Li, Jia Li, Joseph T. Roland, Qi Liu, Ken S. Lau, Subra Kugathasan, Peng Qiu, Keith T. Wilson, Lori A. Coburn, Bennett A. Landman, Yuankai Huo
  • for: 本研究旨在描述crohn病(CD)活动的抑菌环境,尤其是在肠道区域。
  • methods: 研究人员使用了 Hematoxylin and Eosin 染色技术(H&E),并分类了6种不同的细胞类型。然后,他们使用了 t-SNE 和 Kernel Density Estimation 来分析细胞环境的地方特征。
  • results: 研究发现了不同的细胞嵌入模式,尤其是在RECTUM区域。这些差异强调了数据不同性对细胞空间安排的影响。此外,研究还发现了两个研究机构之间的数据分布差异,这指出了协作医疗机构的重要性。
    Abstract Crohn's disease (CD) is a chronic and relapsing inflammatory condition that affects segments of the gastrointestinal tract. CD activity is determined by histological findings, particularly the density of neutrophils observed on Hematoxylin and Eosin stains (H&E) imaging. However, understanding the broader morphometry and local cell arrangement beyond cell counting and tissue morphology remains challenging. To address this, we characterize six distinct cell types from H&E images and develop a novel approach for the local spatial signature of each cell. Specifically, we create a 10-cell neighborhood matrix, representing neighboring cell arrangements for each individual cell. Utilizing t-SNE for non-linear spatial projection in scatter-plot and Kernel Density Estimation contour-plot formats, our study examines patterns of differences in the cellular environment associated with the odds ratio of spatial patterns between active CD and control groups. This analysis is based on data collected at the two research institutes. The findings reveal heterogeneous nearest-neighbor patterns, signifying distinct tendencies of cell clustering, with a particular focus on the rectum region. These variations underscore the impact of data heterogeneity on cell spatial arrangements in CD patients. Moreover, the spatial distribution disparities between the two research sites highlight the significance of collaborative efforts among healthcare organizations. All research analysis pipeline tools are available at https://github.com/MASILab/cellNN.
    摘要 crohn病 (CD) 是一种 chronic 和 relapsing 的Inflammatory condition ,影响 digestive tract 的一部分。 CD 的活动由 histological 发现决定,特别是 neutrophils 的密度在 Hematoxylin 和 Eosin 染色 (H&E) 图像中。然而,理解更广泛的 morphometry 和 local cell arrangement 还是一个挑战。为了解决这个问题,我们将 six 种不同的细胞类型从 H&E 图像中分类,并开发了一种 novel approach 以获取每个细胞的本地空间签名。 Specifically, we create a 10-cell neighborhood matrix, representing neighboring cell arrangements for each individual cell. 使用 t-SNE для非线性空间投影,我们在 scatter-plot 和 Kernel Density Estimation 折线图中进行了 pattern 的分析,以探讨 CD 和 control 组之间的cellular environment 差异。这种分析基于在两个研究机构收集的数据。我们发现 heterogeneous 的 nearest-neighbor 模式,表明 CD 病人的细胞含量具有明显的差异,尤其是在 rectum 区域。这些差异 highlights 数据的多样性对细胞空间安排的影响。此外,研究站之间的 spatial distribution 差异也表明了卫生机构之间的合作是非常重要。所有的研究分析管道工具可以在 https://github.com/MASILab/cellNN 上找到。

ThermRad: A Multi-modal Dataset for Robust 3D Object Detection under Challenging Conditions

  • paper_url: http://arxiv.org/abs/2308.10161
  • repo_url: None
  • paper_authors: Qiao Yan, Yihan Wang
  • for: 提高3D物体检测的稳定性和可靠性在极端天气和照明条件下。
  • methods: 提出了一种新的多模态融合方法,即RTDF-RCNN,利用4D雷达和热成像仪器的优势进行对象检测。
  • results: 对比其他方法,RTDF-RCNN在检测车辆、人员和自行车等对象方面有显著提高(超过7.98%、24.27%和27.15%),同时与LiDAR方法具有相似的性能。
    Abstract Robust 3D object detection in extreme weather and illumination conditions is a challenging task. While radars and thermal cameras are known for their resilience to these conditions, few studies have been conducted on radar-thermal fusion due to the lack of corresponding datasets. To address this gap, we first present a new multi-modal dataset called ThermRad, which includes a 3D LiDAR, a 4D radar, an RGB camera and a thermal camera. This dataset is unique because it includes data from all four sensors in extreme weather conditions, providing a valuable resource for future research in this area. To validate the robustness of 4D radars and thermal cameras for 3D object detection in challenging weather conditions, we propose a new multi-modal fusion method called RTDF-RCNN, which leverages the complementary strengths of 4D radars and thermal cameras to boost object detection performance. To further prove the effectiveness of our proposed framework, we re-implement state-of-the-art (SOTA) 3D detectors on our dataset as benchmarks for evaluation. Our method achieves significant enhancements in detecting cars, pedestrians, and cyclists, with improvements of over 7.98%, 24.27%, and 27.15%, respectively, while achieving comparable results to LiDAR-based approaches. Our contributions in both the ThermRad dataset and the new multi-modal fusion method provide a new approach to robust 3D object detection in adverse weather and illumination conditions. The ThermRad dataset will be released.
    摘要 “Robust 3D物体探测在极端天气和照明条件下是一项具有挑战性的任务。尽管雷达和热成像仪器在这些条件下显示出了抗性,但由于数据集的缺乏,关于雷达-热成像融合的研究受到了限制。为了解决这个差距,我们首先提供了一个新的多模态数据集,称为ThermRad,该数据集包括3D LiDAR、4D雷达、RGB摄像头和热成像仪器。这个数据集独特之处在于它包含了所有四种感知器的数据,并在极端天气条件下进行数据采集,提供了未来研究的优质资源。为了证明4D雷达和热成像仪器在极端天气条件下的 robustness,我们提出了一种新的多模态融合方法,称为RTDF-RCNN,该方法利用了4D雷达和热成像仪器之间的共同优势,以提高物体探测性能。为了进一步证明我们的提出的方法的效果,我们重新实现了状态计算机视觉(SOTA)3D探测器在我们的数据集上进行评估。我们的方法在检测汽车、人员和自行车方面获得了显著提升,提升率分别为7.98%、24.27%和27.15%,而同时与LiDAR基本上的方法具有相似的结果。我们的贡献在ThermRad数据集和新的多模态融合方法方面,为robust 3D物体探测在极端天气和照明条件下提供了一个新的方法。ThermRad数据集将会公开发布。”

HODN: Disentangling Human-Object Feature for HOI Detection

  • paper_url: http://arxiv.org/abs/2308.10158
  • repo_url: None
  • paper_authors: Shuman Fang, Zhiwen Lin, Ke Yan, Jie Li, Xianming Lin, Rongrong Ji
  • for: 本文目标是提高人物物体互动检测的精度,并提出了一种基于变换器的人物物体分离网络(HODN),以显著提高现有方法的性能。
  • methods: 本文使用了两个独立的分离解码器来检测人物和物体,然后将其传递给互动解码器进行互动检测。为了确保互动解码器关注人类中心的区域,我们提出了一种人类导引链接方法。此外,我们还提出了一种停止梯度机制,以防止互动导引对物体检测的影响,但允许它们对人类检测的影响。
  • results: 我们的提议方法在V-COCO和HICO-Det数据集上达到了竞争性表现,可以与现有方法相结合以实现最佳效果。
    Abstract The task of Human-Object Interaction (HOI) detection is to detect humans and their interactions with surrounding objects, where transformer-based methods show dominant advances currently. However, these methods ignore the relationship among humans, objects, and interactions: 1) human features are more contributive than object ones to interaction prediction; 2) interactive information disturbs the detection of objects but helps human detection. In this paper, we propose a Human and Object Disentangling Network (HODN) to model the HOI relationships explicitly, where humans and objects are first detected by two disentangling decoders independently and then processed by an interaction decoder. Considering that human features are more contributive to interaction, we propose a Human-Guide Linking method to make sure the interaction decoder focuses on the human-centric regions with human features as the positional embeddings. To handle the opposite influences of interactions on humans and objects, we propose a Stop-Gradient Mechanism to stop interaction gradients from optimizing the object detection but to allow them to optimize the human detection. Our proposed method achieves competitive performance on both the V-COCO and the HICO-Det datasets. It can be combined with existing methods easily for state-of-the-art results.
    摘要 人机物互动检测任务的目标是检测人类和他们与周围物体之间的互动,而transformer基本方法在当前得到了主导地位。然而,这些方法忽略了人类、物体和互动之间的关系:1)人类特征更加重要于互动预测;2)互动信息会干扰物体检测,但是帮助人类检测。在这篇论文中,我们提议了一个人类和物体分离网络(HODN),以显式地模型HOI关系,其中人类和物体被两个独立的分离解码器独立地检测,然后被交互解码器处理。由于人类特征更加重要于互动预测,我们提议了一个人类引导链接方法,以确保交互解码器专注于人类中心的区域,并将人类特征作为位域嵌入。为了处理人类和物体之间的对称影响,我们提议了一个停止梯度机制,以停止互动梯度在物体检测中优化,但是允许它们在人类检测中优化。我们的提议方法在V-COCO和HICO-Det datasets上达到了竞争性的性能。它可以轻松地与现有方法结合使用,以实现当前最佳结果。

Contrastive Diffusion Model with Auxiliary Guidance for Coarse-to-Fine PET Reconstruction

  • paper_url: http://arxiv.org/abs/2308.10157
  • repo_url: https://github.com/show-han/pet-reconstruction
  • paper_authors: Zeyu Han, Yuhan Wang, Luping Zhou, Peng Wang, Binyu Yan, Jiliu Zhou, Yan Wang, Dinggang Shen
  • for: 提高高质量的 пози特摄影图像,降低人体辐射暴露
  • methods: 使用抽象敌方网络(GANs)和扩散概率模型(DPMs),以及一种协同逐步优化模型
  • results: 对两个人脑 Positron Emission Tomography(PET)数据集进行了广泛的实验,并证明了该方法可以高效地提高 clinical 可靠性
    Abstract To obtain high-quality positron emission tomography (PET) scans while reducing radiation exposure to the human body, various approaches have been proposed to reconstruct standard-dose PET (SPET) images from low-dose PET (LPET) images. One widely adopted technique is the generative adversarial networks (GANs), yet recently, diffusion probabilistic models (DPMs) have emerged as a compelling alternative due to their improved sample quality and higher log-likelihood scores compared to GANs. Despite this, DPMs suffer from two major drawbacks in real clinical settings, i.e., the computationally expensive sampling process and the insufficient preservation of correspondence between the conditioning LPET image and the reconstructed PET (RPET) image. To address the above limitations, this paper presents a coarse-to-fine PET reconstruction framework that consists of a coarse prediction module (CPM) and an iterative refinement module (IRM). The CPM generates a coarse PET image via a deterministic process, and the IRM samples the residual iteratively. By delegating most of the computational overhead to the CPM, the overall sampling speed of our method can be significantly improved. Furthermore, two additional strategies, i.e., an auxiliary guidance strategy and a contrastive diffusion strategy, are proposed and integrated into the reconstruction process, which can enhance the correspondence between the LPET image and the RPET image, further improving clinical reliability. Extensive experiments on two human brain PET datasets demonstrate that our method outperforms the state-of-the-art PET reconstruction methods. The source code is available at \url{https://github.com/Show-han/PET-Reconstruction}.
    摘要 要获得高质量的 позиトрон释发tomography(PET)图像,同时减少人体暴露到辐射的方法,Various approaches have been proposed to reconstruct standard-dose PET (SPET) images from low-dose PET (LPET) images. One widely adopted technique is the generative adversarial networks (GANs), yet recently, diffusion probabilistic models (DPMs) have emerged as a compelling alternative due to their improved sample quality and higher log-likelihood scores compared to GANs. Despite this, DPMs suffer from two major drawbacks in real clinical settings, i.e., the computationally expensive sampling process and the insufficient preservation of correspondence between the conditioning LPET image and the reconstructed PET (RPET) image. To address the above limitations, this paper presents a coarse-to-fine PET reconstruction framework that consists of a coarse prediction module (CPM) and an iterative refinement module (IRM). The CPM generates a coarse PET image via a deterministic process, and the IRM samples the residual iteratively. By delegating most of the computational overhead to the CPM, the overall sampling speed of our method can be significantly improved. Furthermore, two additional strategies, i.e., an auxiliary guidance strategy and a contrastive diffusion strategy, are proposed and integrated into the reconstruction process, which can enhance the correspondence between the LPET image and the RPET image, further improving clinical reliability. Extensive experiments on two human brain PET datasets demonstrate that our method outperforms the state-of-the-art PET reconstruction methods. The source code is available at \url{https://github.com/Show-han/PET-Reconstruction}.Here's the translation in Traditional Chinese:要获得高质量的 позиトрон释放tomography(PET)图像,同时减少人体暴露到辐射的方法,Various approaches have been proposed to reconstruct standard-dose PET (SPET) images from low-dose PET (LPET) images. One widely adopted technique is the generative adversarial networks (GANs), yet recently, diffusion probabilistic models (DPMs) have emerged as a compelling alternative due to their improved sample quality and higher log-likelihood scores compared to GANs. Despite this, DPMs suffer from two major drawbacks in real clinical settings, i.e., the computationally expensive sampling process and the insufficient preservation of correspondence between the conditioning LPET image and the reconstructed PET (RPET) image. To address the above limitations, this paper presents a coarse-to-fine PET reconstruction framework that consists of a coarse prediction module (CPM) and an iterative refinement module (IRM). The CPM generates a coarse PET image via a deterministic process, and the IRM samples the residual iteratively. By delegating most of the computational overhead to the CPM, the overall sampling speed of our method can be significantly improved. Furthermore, two additional strategies, i.e., an auxiliary guidance strategy and a contrastive diffusion strategy, are proposed and integrated into the reconstruction process, which can enhance the correspondence between the LPET image and the RPET image, further improving clinical reliability. Extensive experiments on two human brain PET datasets demonstrate that our method outperforms the state-of-the-art PET reconstruction methods. The source code is available at \url{https://github.com/Show-han/PET-Reconstruction}.

Unilaterally Aggregated Contrastive Learning with Hierarchical Augmentation for Anomaly Detection

  • paper_url: http://arxiv.org/abs/2308.10155
  • repo_url: None
  • paper_authors: Guodong Wang, Yunhong Wang, Jie Qin, Dongming Zhang, Xiuguo Bao, Di Huang
  • for: 这篇论文的目的是提出一种基于自适应学习的异常检测方法(Unilaterally Aggregated Contrastive Learning with Hierarchical Augmentation,UniCon-HA),以满足安全关键应用中异常检测的需求。
  • methods: 这篇论文使用了自适应学习的方法,包括对异常样本进行批量聚合,以及采用层次增强的批量聚合策略。同时,它还引入了一种软重要机制,以保证异常样本的纯净集中。
  • results: 这篇论文的实验结果显示,UniCon-HA方法在三种异常检测设定中(无标签一类、无标签多类和标签多类)均达到了其他竞争者的超越。
    Abstract Anomaly detection (AD), aiming to find samples that deviate from the training distribution, is essential in safety-critical applications. Though recent self-supervised learning based attempts achieve promising results by creating virtual outliers, their training objectives are less faithful to AD which requires a concentrated inlier distribution as well as a dispersive outlier distribution. In this paper, we propose Unilaterally Aggregated Contrastive Learning with Hierarchical Augmentation (UniCon-HA), taking into account both the requirements above. Specifically, we explicitly encourage the concentration of inliers and the dispersion of virtual outliers via supervised and unsupervised contrastive losses, respectively. Considering that standard contrastive data augmentation for generating positive views may induce outliers, we additionally introduce a soft mechanism to re-weight each augmented inlier according to its deviation from the inlier distribution, to ensure a purified concentration. Moreover, to prompt a higher concentration, inspired by curriculum learning, we adopt an easy-to-hard hierarchical augmentation strategy and perform contrastive aggregation at different depths of the network based on the strengths of data augmentation. Our method is evaluated under three AD settings including unlabeled one-class, unlabeled multi-class, and labeled multi-class, demonstrating its consistent superiority over other competitors.
    摘要 预测异常(AD),找到训练分布不同的样本,在安全关键应用中是非常重要的。虽然最近的自我超vised学习基于尝试得到了可观的结果,但它们的训练目标不符合AD的需求,AD需要一个集中的异常样本分布以及一个散布的异常样本分布。在这篇论文中,我们提出了一种带有层次增强的对比学习方法(UniCon-HA),该方法考虑了上述两个需求。具体来说,我们明确地鼓励异常样本的集中和虚拟异常样本的散布,通过自我对比和不对比的损失函数来实现。由于标准的对比数据增强可能会生成异常样本,我们还提出了一种软机制来重新评估每个增强后的异常样本,以确保它们的集中程度。此外,为了提高集中程度,我们采用了一种易到difficult的层次增强策略,并在网络的不同层次上进行对比归一化,根据数据增强的强度。我们的方法在三种AD设定下进行评估,包括一类、多类和多类标注的情况,并示出了与其他竞争者相比的一致性优势。

ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy in Transformer

  • paper_url: http://arxiv.org/abs/2308.10147
  • repo_url: https://github.com/mxin262/estextspotter
  • paper_authors: Mingxin Huang, Jiaxin Zhang, Dezhi Peng, Hao Lu, Can Huang, Yuliang Liu, Xiang Bai, Lianwen Jin
  • for: 提高文本检测和识别的同时性能
  • methods: 提出了一种新的Explicit Synergy-based Text Spotting Transformer框架(ESTextSpotter),通过在单个解码器中模型特定的文本检测和识别特征来实现显式的同时性。
  • results: 实验结果表明,我们的模型在比较前的状态提高了文本检测和识别的性能。代码可以在https://github.com/mxin262/ESTextSpotter上获取。
    Abstract In recent years, end-to-end scene text spotting approaches are evolving to the Transformer-based framework. While previous studies have shown the crucial importance of the intrinsic synergy between text detection and recognition, recent advances in Transformer-based methods usually adopt an implicit synergy strategy with shared query, which can not fully realize the potential of these two interactive tasks. In this paper, we argue that the explicit synergy considering distinct characteristics of text detection and recognition can significantly improve the performance text spotting. To this end, we introduce a new model named Explicit Synergy-based Text Spotting Transformer framework (ESTextSpotter), which achieves explicit synergy by modeling discriminative and interactive features for text detection and recognition within a single decoder. Specifically, we decompose the conventional shared query into task-aware queries for text polygon and content, respectively. Through the decoder with the proposed vision-language communication module, the queries interact with each other in an explicit manner while preserving discriminative patterns of text detection and recognition, thus improving performance significantly. Additionally, we propose a task-aware query initialization scheme to ensure stable training. Experimental results demonstrate that our model significantly outperforms previous state-of-the-art methods. Code is available at https://github.com/mxin262/ESTextSpotter.
    摘要

OCHID-Fi: Occlusion-Robust Hand Pose Estimation in 3D via RF-Vision

  • paper_url: http://arxiv.org/abs/2308.10146
  • repo_url: None
  • paper_authors: Shujie Zhang, Tianyue Zheng, Zhe Chen, Jingzhi Hu, Abdelwahed Khamis, Jiajun Liu, Jun Luo
  • for: 提高手势识别率在 occluded scenarios
  • methods: 使用 radio-frequency-vision (RF-vision) 技术,并提出 OCHID-Fi 方法,使用宽带 RF 感知器在增强的 LoS 条件下检测手势 pose
  • results: 实验结果表明 OCHID-Fi 方法可以在 occluded scenarios 中保持 comparable 的精度,并且可以在新领域中进行推广Here’s a more detailed explanation of each point:
  • for: The paper aims to improve hand pose estimation accuracy in occluded scenarios, which is a challenging problem in many applications such as virtual reality, robotics, and human-computer interaction.
  • methods: The proposed method uses radio-frequency-vision (RF-vision) technology, which can bypass obstacles and capture hand pose information behind them. The method introduces OCHID-Fi, a complex-valued RF-HPE network that is trained using a cross-modality and cross-domain training process. The network is guided by a pre-trained CM-HPE network and a synchronized CM/RF dataset, and it uses adversarial learning to transfer knowledge from labeled LoS domain to unlabeled occluded domain.
  • results: The experimental results show that OCHID-Fi achieves comparable accuracy to CM-HPE under normal conditions while maintaining such accuracy even in occluded scenarios. The method also demonstrates empirical evidence for its generalizability to new domains.
    Abstract Hand Pose Estimation (HPE) is crucial to many applications, but conventional cameras-based CM-HPE methods are completely subject to Line-of-Sight (LoS), as cameras cannot capture occluded objects. In this paper, we propose to exploit Radio-Frequency-Vision (RF-vision) capable of bypassing obstacles for achieving occluded HPE, and we introduce OCHID-Fi as the first RF-HPE method with 3D pose estimation capability. OCHID-Fi employs wideband RF sensors widely available on smart devices (e.g., iPhones) to probe 3D human hand pose and extract their skeletons behind obstacles. To overcome the challenge in labeling RF imaging given its human incomprehensible nature, OCHID-Fi employs a cross-modality and cross-domain training process. It uses a pre-trained CM-HPE network and a synchronized CM/RF dataset, to guide the training of its complex-valued RF-HPE network under LoS conditions. It further transfers knowledge learned from labeled LoS domain to unlabeled occluded domain via adversarial learning, enabling OCHID-Fi to generalize to unseen occluded scenarios. Experimental results demonstrate the superiority of OCHID-Fi: it achieves comparable accuracy to CM-HPE under normal conditions while maintaining such accuracy even in occluded scenarios, with empirical evidence for its generalizability to new domains.
    摘要 手势识别 (HPE) 是许多应用程序的关键,但传统的摄像头基于CM-HPE方法是完全受到视线(LoS)的限制,因为摄像头无法捕捉遮盖物体。在这篇论文中,我们提议利用无线电视觉(RF-vision),可以绕过障碍物来实现遮盖物体HPE,并引入OCHID-Fi作为首个RF-HPE方法,具有3D手势 pose estimation能力。OCHID-Fi使用通用在智能设备(例如iPhone)上广泛可用的宽频RF传感器来探测3D人类手势pose和提取其骨架。为了解决RF成像的标注挑战,OCHID-Fi采用了交叉模式和交叉领域的训练过程。它使用一个先行训练的CM-HPE网络和一个同步的CM/RF数据集,以引导RF-HPE网络的训练。它还通过对LoS频谱频率进行挑战学习,将学习于标注的LoS频谱频率转移到未标注的遮盖频谱频率,使OCHID-Fi能够通过频谱频率的挑战学习来泛化到新的频谱频率。实验结果表明,OCHID-Fi具有优于CM-HPE的性能,在正常情况下与CM-HPE具有相同的精度,而且在遮盖情况下保持精度,并且在新的频谱频率上进行泛化。

Polymerized Feature-based Domain Adaptation for Cervical Cancer Dose Map Prediction

  • paper_url: http://arxiv.org/abs/2308.10142
  • repo_url: None
  • paper_authors: Jie Zeng, Zeyu Han, Xingchen Peng, Jianghong Xiao, Peng Wang, Yan Wang
  • for: 提高乳腺癌辐射规划的准确性,使用深度学习自动化和加速辐射规划
  • methods: 通过领域适应而增强深度学习预测辐射规划表现,将乳腺癌和肛癌的知识转移到辐射规划中
  • results: 实验结果表明,提出的方法在两个内部临床数据集上显示出优于现有方法的性能
    Abstract Recently, deep learning (DL) has automated and accelerated the clinical radiation therapy (RT) planning significantly by predicting accurate dose maps. However, most DL-based dose map prediction methods are data-driven and not applicable for cervical cancer where only a small amount of data is available. To address this problem, this paper proposes to transfer the rich knowledge learned from another cancer, i.e., rectum cancer, which has the same scanning area and more clinically available data, to improve the dose map prediction performance for cervical cancer through domain adaptation. In order to close the congenital domain gap between the source (i.e., rectum cancer) and the target (i.e., cervical cancer) domains, we develop an effective Transformer-based polymerized feature module (PFM), which can generate an optimal polymerized feature distribution to smoothly align the two input distributions. Experimental results on two in-house clinical datasets demonstrate the superiority of the proposed method compared with state-of-the-art methods.
    摘要 To close the congenital domain gap between the source (i.e., rectum cancer) and the target (i.e., cervical cancer) domains, we develop an effective Transformer-based polymerized feature module (PFM). This module can generate an optimal polymerized feature distribution to smoothly align the two input distributions. Experimental results on two in-house clinical datasets demonstrate the superiority of the proposed method compared with state-of-the-art methods.

March in Chat: Interactive Prompting for Remote Embodied Referring Expression

  • paper_url: http://arxiv.org/abs/2308.10141
  • repo_url: https://github.com/yanyuanqiao/mic
  • paper_authors: Yanyuan Qiao, Yuankai Qi, Zheng Yu, Jing Liu, Qi Wu
  • for: The paper is written for proposing a March-in-Chat (MiC) model that can talk to a Large Language Model (LLM) on the fly and plan dynamically based on a newly proposed Room-and-Object Aware Scene Perceiver (ROASP) to improve the performance of Vision-and-Language Navigation (VLN) tasks, specifically in the REVERIE setting.
  • methods: The paper uses a MiC model that incorporates a ROASP to enable the LLM to plan actions based on the current visual observation and adapt to the larger and more complex REVERIE environment.
  • results: The paper shows that the proposed MiC model outperforms the previous state-of-the-art by large margins in terms of SPL and RGSPL metrics on the REVERIE benchmark.
    Abstract Many Vision-and-Language Navigation (VLN) tasks have been proposed in recent years, from room-based to object-based and indoor to outdoor. The REVERIE (Remote Embodied Referring Expression) is interesting since it only provides high-level instructions to the agent, which are closer to human commands in practice. Nevertheless, this poses more challenges than other VLN tasks since it requires agents to infer a navigation plan only based on a short instruction. Large Language Models (LLMs) show great potential in robot action planning by providing proper prompts. Still, this strategy has not been explored under the REVERIE settings. There are several new challenges. For example, the LLM should be environment-aware so that the navigation plan can be adjusted based on the current visual observation. Moreover, the LLM planned actions should be adaptable to the much larger and more complex REVERIE environment. This paper proposes a March-in-Chat (MiC) model that can talk to the LLM on the fly and plan dynamically based on a newly proposed Room-and-Object Aware Scene Perceiver (ROASP). Our MiC model outperforms the previous state-of-the-art by large margins by SPL and RGSPL metrics on the REVERIE benchmark.
    摘要 很多视觉语言导航(VLN)任务在最近几年内被提出,从房间基于的到物体基于的和室内到户外的。REVERIE(远程身体引用表达)是有趣的,因为它只提供高级指令给代理人,这些指令更加接近实际的人类命令。然而,这会提高代理人需要根据短 instrucion 生成导航计划的挑战。大语言模型(LLM)在机器人行动规划方面表现出了极大的潜力,但这种策略在 REVERIE 设置下尚未被探讨。新的挑战包括:LLM 需要环境意识,以便根据当前视觉观察更新导航计划。此外,LLM 计划的行动需要适应 REVERIE 环境中的更大和更复杂的环境。这篇论文提出了一种 March-in-Chat(MiC)模型,可以在 fly 中与 LLM 交流,并基于一种新的 Room-and-Object Aware Scene Perceiver(ROASP)进行动态规划。我们的 MiC 模型在 REVERIE benchmark 上比前一个状态的平均值大幅度超越了 SPL 和 RGSPL 指标。

HollowNeRF: Pruning Hashgrid-Based NeRFs with Trainable Collision Mitigation

  • paper_url: http://arxiv.org/abs/2308.10122
  • repo_url: None
  • paper_authors: Xiufeng Xie, Riccardo Gherardi, Zhihong Pan, Stephen Huang
  • for: 提高 NeRF 训练和评估的效率,使用 hashgrid-based 位置编码和神经网络,但是有效地利用三维场景的空间稀疏性仍然是一个挑战。
  • methods: 我们提出了一种新的压缩解决方案,即 HollowNeRF,它在训练阶段自动精炼 hashgrid-based NeRF 的特征网格。它通过训练一个粗略的三维感知掩模,以及使用 alternating direction method of multipliers (ADMM) 压缩器来精炼三维感知掩模,从而实现了高质量渲染的同时采用较少的参数。
  • results: 我们的方法可以与 Instant-NGP 相比,提供类似的渲染质量,但是使用的参数数量只有 Instant-NGP 的 31%。此外,我们的方法可以在 PSNR 精度上增加至 1dB,使用的参数数量仅占 Instant-NGP 的 56%。
    Abstract Neural radiance fields (NeRF) have garnered significant attention, with recent works such as Instant-NGP accelerating NeRF training and evaluation through a combination of hashgrid-based positional encoding and neural networks. However, effectively leveraging the spatial sparsity of 3D scenes remains a challenge. To cull away unnecessary regions of the feature grid, existing solutions rely on prior knowledge of object shape or periodically estimate object shape during training by repeated model evaluations, which are costly and wasteful. To address this issue, we propose HollowNeRF, a novel compression solution for hashgrid-based NeRF which automatically sparsifies the feature grid during the training phase. Instead of directly compressing dense features, HollowNeRF trains a coarse 3D saliency mask that guides efficient feature pruning, and employs an alternating direction method of multipliers (ADMM) pruner to sparsify the 3D saliency mask during training. By exploiting the sparsity in the 3D scene to redistribute hash collisions, HollowNeRF improves rendering quality while using a fraction of the parameters of comparable state-of-the-art solutions, leading to a better cost-accuracy trade-off. Our method delivers comparable rendering quality to Instant-NGP, while utilizing just 31% of the parameters. In addition, our solution can achieve a PSNR accuracy gain of up to 1dB using only 56% of the parameters.
    摘要 为解决这个问题,我们提出了 HollowNeRF,一种新的压缩解决方案,用于在 hashgrid-based NeRF 训练阶段自动稀疏特征网格。而不是直接压缩密集特征,HollowNeRF 通过训练一个粗略的三维焦点映射来引导高效的特征剔除,并使用 alternating direction method of multipliers (ADMM) 剔除器在训练阶段稀疏着色映射。通过利用场景中的稀疏性来重新分配哈希冲突,HollowNeRF 可以提高渲染质量,使用相对较少的参数,从而实现更好的成本准确性质量比。我们的方法可以与 Instant-NGP 相比,在同等参数下提供相同的渲染质量,而使用的参数只占 Instant-NGP 的31%。此外,我们的方法还可以在56%的参数下实现PSNR准确性增加达1dB。

PDL: Regularizing Multiple Instance Learning with Progressive Dropout Layers

  • paper_url: http://arxiv.org/abs/2308.10112
  • repo_url: https://github.com/chongqingnosubway/pdl
  • paper_authors: Wenhui Zhu, Peijie Qiu, Oana M. Dumitrascu, Yalin Wang
  • for: 本研究旨在提高多个实例学习(MIL)模型的性能,特别是在弱监督学习情况下。
  • methods: 本研究提出了一种新的进步Dropout层(PDL),用于在MIL模型中进行规范。PDL不仅能够降低过拟合,还能够帮助MIL模型找到更加复杂和有力的特征表示。
  • results: 在多个MILbenchmark dataset上进行了广泛的评估,结果显示了将PDL integrate into多种MIL方法可以不仅提高分类性能,还能够增强其弱监督特征地图localization的能力。
    Abstract Multiple instance learning (MIL) was a weakly supervised learning approach that sought to assign binary class labels to collections of instances known as bags. However, due to their weak supervision nature, the MIL methods were susceptible to overfitting and required assistance in developing comprehensive representations of target instances. While regularization typically effectively combated overfitting, its integration with the MIL model has been frequently overlooked in prior studies. Meanwhile, current regularization methods for MIL have shown limitations in their capacity to uncover a diverse array of representations. In this study, we delve into the realm of regularization within the MIL model, presenting a novel approach in the form of a Progressive Dropout Layer (PDL). We aim to not only address overfitting but also empower the MIL model in uncovering intricate and impactful feature representations. The proposed method was orthogonal to existing MIL methods and could be easily integrated into them to boost performance. Our extensive evaluation across a range of MIL benchmark datasets demonstrated that the incorporation of the PDL into multiple MIL methods not only elevated their classification performance but also augmented their potential for weakly-supervised feature localizations.
    摘要 在这个研究中,我们探究MIL模型中的常见化,提出了一种Progressive Dropout Layer(PDL)方法。我们不仅想要解决过拟合问题,还想要让MIL模型探索复杂且影响力大的特征表示。提出的方法与现有MIL方法 orthogonal,可以轻松地与之集成以提高性能。我们在多种MIL benchmark数据集上进行了广泛的评估,发现在 integrate PDL 到多种MIL方法后,不仅提高了分类性能,还扩大了它们的弱监督学习特征地址 localization 的潜力。

Controllable Multi-domain Semantic Artwork Synthesis

  • paper_url: http://arxiv.org/abs/2308.10111
  • repo_url: None
  • paper_authors: Yuantian Huang, Satoshi Iizuka, Edgar Simo-Serra, Kazuhiro Fukui
  • for: 该论文目的是提出一种多域合成艺术作品的框架,以解决艺术合成 tasks 缺乏公共可用的分割数据的问题。
  • methods: 该论文提出了一个名为 ArtSem 的数据集,包含 40,000 个艺术作品的 semantic 标签地图,以及一种基于 Conditional GAN 的方法,可以高质量地从 semantic 地图生成艺术作品,无需对训练数据进行对应。此外,论文还提出了一种域dependent variational encoder 来实现高质量的多域合成。
  • results: 论文的实验结果表明,该模型可以学习 joint 表示 style 和 semantic 信息,从而实现更好的艺术作品生成。此外,通过分析 latent space 中域分隔的hyperplane,我们还可以实现细化控制生成的艺术作品。相比之前的方法,我们的模型可以生成更高质量的艺术作品。
    Abstract We present a novel framework for multi-domain synthesis of artwork from semantic layouts. One of the main limitations of this challenging task is the lack of publicly available segmentation datasets for art synthesis. To address this problem, we propose a dataset, which we call ArtSem, that contains 40,000 images of artwork from 4 different domains with their corresponding semantic label maps. We generate the dataset by first extracting semantic maps from landscape photography and then propose a conditional Generative Adversarial Network (GAN)-based approach to generate high-quality artwork from the semantic maps without necessitating paired training data. Furthermore, we propose an artwork synthesis model that uses domain-dependent variational encoders for high-quality multi-domain synthesis. The model is improved and complemented with a simple but effective normalization method, based on normalizing both the semantic and style jointly, which we call Spatially STyle-Adaptive Normalization (SSTAN). In contrast to previous methods that only take semantic layout as input, our model is able to learn a joint representation of both style and semantic information, which leads to better generation quality for synthesizing artistic images. Results indicate that our model learns to separate the domains in the latent space, and thus, by identifying the hyperplanes that separate the different domains, we can also perform fine-grained control of the synthesized artwork. By combining our proposed dataset and approach, we are able to generate user-controllable artwork that is of higher quality than existing
    摘要 我们提出了一种新的框架,用于多域合成艺术作品的 semantic 布局。这个挑战性任务的一个主要 limitation 是没有公共可用的分割数据集 для艺术合成。为了解决这个问题,我们提出了一个名为 ArtSem 的数据集,包含 40,000 个艺术作品从 4 个不同域的图像和其对应的 semantic 标签地图。我们使用 conditional Generative Adversarial Network (GAN) 方法生成高质量的艺术作品从 semantic 地图,无需 paired 训练数据。此外,我们提出了一种基于域 dependent 变量编码器的高质量多域合成模型。该模型通过将 semantic 和样式信息结合在一起,学习到了joint 表示,从而实现更好的生成质量。结果表明,我们的模型可以在 latent 空间中分离不同域,因此,通过识别不同域之间的分界面,也可以实现细致控制生成的艺术作品。通过我们提出的数据集和方法,我们能够生成高质量的用户控制的艺术作品。

Root Pose Decomposition Towards Generic Non-rigid 3D Reconstruction with Monocular Videos

  • paper_url: http://arxiv.org/abs/2308.10089
  • repo_url: None
  • paper_authors: Yikai Wang, Yinpeng Dong, Fuchun Sun, Xiao Yang
  • for: 这个论文旨在实现基于单色RGB视频序列的非固定对象三维重建。
  • methods: 该方法基于Root Pose Decomposition(RPD),保持每帧根pose变换,同时建立了局部变换的恰当空间。点注册到Canonical space进行优化。
  • results: RPD可以在复杂的情况下进行非固定对象的三维重建,包括对象受损、人体差异、 occlusion 等。该管道可能扩展到多种对象。实验表明,RPD超过了当前状态艺技。
    Abstract This work focuses on the 3D reconstruction of non-rigid objects based on monocular RGB video sequences. Concretely, we aim at building high-fidelity models for generic object categories and casually captured scenes. To this end, we do not assume known root poses of objects, and do not utilize category-specific templates or dense pose priors. The key idea of our method, Root Pose Decomposition (RPD), is to maintain a per-frame root pose transformation, meanwhile building a dense field with local transformations to rectify the root pose. The optimization of local transformations is performed by point registration to the canonical space. We also adapt RPD to multi-object scenarios with object occlusions and individual differences. As a result, RPD allows non-rigid 3D reconstruction for complicated scenarios containing objects with large deformations, complex motion patterns, occlusions, and scale diversities of different individuals. Such a pipeline potentially scales to diverse sets of objects in the wild. We experimentally show that RPD surpasses state-of-the-art methods on the challenging DAVIS, OVIS, and AMA datasets.
    摘要

MeDM: Mediating Image Diffusion Models for Video-to-Video Translation with Temporal Correspondence Guidance

  • paper_url: http://arxiv.org/abs/2308.10079
  • repo_url: None
  • paper_authors: Ernie Chu, Tzuhsuan Huang, Shuo-Yen Lin, Jun-Cheng Chen
  • for: 本研究提出了一种高效和可靠的方法,MeDM,利用预训练的图像扩散模型进行视频到视频翻译,保持 temporal 流动的一致性。
  • methods: 该方法使用显式的光学流动来构建实用的编码,并对生成的帧进行独立分数。通过利用这种编码,可以将生成的视频中的每帧照合到物理约束,从而解决视频翻译中的困难。
  • results: 通过对多个标准 bencmark 进行广泛的质量和主观测试,研究表明,MeDM 的方法可以准确地将视频翻译成目标视频,并且比传统方法更有优势。
    Abstract This study introduces an efficient and effective method, MeDM, that utilizes pre-trained image Diffusion Models for video-to-video translation with consistent temporal flow. The proposed framework can render videos from scene position information, such as a normal G-buffer, or perform text-guided editing on videos captured in real-world scenarios. We employ explicit optical flows to construct a practical coding that enforces physical constraints on generated frames and mediates independent frame-wise scores. By leveraging this coding, maintaining temporal consistency in the generated videos can be framed as an optimization problem with a closed-form solution. To ensure compatibility with Stable Diffusion, we also suggest a workaround for modifying observed-space scores in latent-space Diffusion Models. Notably, MeDM does not require fine-tuning or test-time optimization of the Diffusion Models. Through extensive qualitative, quantitative, and subjective experiments on various benchmarks, the study demonstrates the effectiveness and superiority of the proposed approach.
    摘要

Sensitivity analysis of AI-based algorithms for autonomous driving on optical wavefront aberrations induced by the windshield

  • paper_url: http://arxiv.org/abs/2308.11711
  • repo_url: None
  • paper_authors: Dominik Werner Wolf, Markus Ulrich, Nikhil Kapoor
  • For: 本研究旨在解决自动驾驶感知技术中的领域转移问题,通过评估两种感知模型对不同风镜配置的敏感性。* Methods: 本研究使用了 Fourier optics 基于的威胁模型来评估感知模型对风镜配置的影响,并对两种感知模型的 neural network benchmark 指标和光学质量函数之间的相关性进行分析。* Results: 研究发现,风镜配置会导致感知模型表现下降,而现有的光学指标可能不够用于评估感知模型的性能。
    Abstract Autonomous driving perception techniques are typically based on supervised machine learning models that are trained on real-world street data. A typical training process involves capturing images with a single car model and windshield configuration. However, deploying these trained models on different car types can lead to a domain shift, which can potentially hurt the neural networks performance and violate working ADAS requirements. To address this issue, this paper investigates the domain shift problem further by evaluating the sensitivity of two perception models to different windshield configurations. This is done by evaluating the dependencies between neural network benchmark metrics and optical merit functions by applying a Fourier optics based threat model. Our results show that there is a performance gap introduced by windshields and existing optical metrics used for posing requirements might not be sufficient.
    摘要 自主驾驶感知技术通常基于supervised机器学习模型,通过实际街道数据进行训练。一般训练过程中会使用单车型和车窗配置拍摄图像。但是,将训练过的模型部署到不同车型上可能会导致域名shift,这可能会影响神经网络性能,并违反工作ADAS要求。为解决这个问题,本文进一步研究域名shift问题,评估两种感知模型对不同车窗配置的敏感性。通过应用 Fourier optics based threat model,我们发现存在由车窗引入的性能差距,现有的光学指标可能不够。

cs.AI - 2023-08-20

Towards Few-shot Coordination: Revisiting Ad-hoc Teamplay Challenge In the Game of Hanabi

  • paper_url: http://arxiv.org/abs/2308.10284
  • repo_url: None
  • paper_authors: Hadi Nekoei, Xutong Zhao, Janarthanan Rajendran, Miao Liu, Sarath Chandar
  • for: 这项研究旨在解决在多智能体协同学习中零shot协调问题。
  • methods: 研究人员使用了现状最佳算法和独立学习算法来评估多智能体协同学习算法的适应性。
  • results: 研究发现,使用现状最佳算法和独立学习算法可以快速适应不同合作伙伴,但需要训练数据的多样性和优化过程的控制。
    Abstract Cooperative Multi-agent Reinforcement Learning (MARL) algorithms with Zero-Shot Coordination (ZSC) have gained significant attention in recent years. ZSC refers to the ability of agents to coordinate zero-shot (without additional interaction experience) with independently trained agents. While ZSC is crucial for cooperative MARL agents, it might not be possible for complex tasks and changing environments. Agents also need to adapt and improve their performance with minimal interaction with other agents. In this work, we show empirically that state-of-the-art ZSC algorithms have poor performance when paired with agents trained with different learning methods, and they require millions of interaction samples to adapt to these new partners. To investigate this issue, we formally defined a framework based on a popular cooperative multi-agent game called Hanabi to evaluate the adaptability of MARL methods. In particular, we created a diverse set of pre-trained agents and defined a new metric called adaptation regret that measures the agent's ability to efficiently adapt and improve its coordination performance when paired with some held-out pool of partners on top of its ZSC performance. After evaluating several SOTA algorithms using our framework, our experiments reveal that naive Independent Q-Learning (IQL) agents in most cases adapt as quickly as the SOTA ZSC algorithm Off-Belief Learning (OBL). This finding raises an interesting research question: How to design MARL algorithms with high ZSC performance and capability of fast adaptation to unseen partners. As a first step, we studied the role of different hyper-parameters and design choices on the adaptability of current MARL algorithms. Our experiments show that two categories of hyper-parameters controlling the training data diversity and optimization process have a significant impact on the adaptability of Hanabi agents.
    摘要 合作多智能体强化学习(MARL)算法与零shot协调(ZSC)在最近几年内获得了广泛关注。ZSC指代智能体可以零shot(无需额外互动经验)协调独立训练的智能体。然而,ZSC可能不适用于复杂任务和变化环境中。智能体还需要适应和改进其性能,只需最小化与其他智能体的互动。在这个工作中,我们通过实验表明,现状最佳ZSC算法与其他学习方法训练的智能体 pairing 时,性能很差,需要数百万互动样本来适应这些新伙伴。为了探索这个问题,我们正式定义了基于流行的合作多智能体游戏哈那比(Hanabi)的框架,以评估MARL方法的适应性。具体来说,我们创建了一个多样化的预训练智能体集合,并定义了一个新的度量叫做适应 regret,用于衡量智能体在与一些封锁的伙伴中进行协调时,能够快速适应和提高协调性能。经过我们的实验,我们发现,简单粗略的独立Q学习(IQL)智能体在大多数情况下能够和SOTA ZSC算法OFF-BELIEF学习(OBL)一样快速适应。这一发现提出了一个有趣的研究问题:如何设计MARL算法,具有高度的ZSC性能和适应不seen伙伴的能力。作为一个初步探索,我们研究了当前MARL算法的不同超参数和设计选择对适应性的影响。我们的实验显示,控制训练数据多样性和优化过程的两类超参数有重要的影响于哈那比智能体的适应性。

Enhancing Spatiotemporal Traffic Prediction through Urban Human Activity Analysis

  • paper_url: http://arxiv.org/abs/2308.10282
  • repo_url: https://github.com/suminhan/traffic-uagcrntf
  • paper_authors: Sumin Han, Youngjun Park, Minji Lee, Jisun An, Dongman Lee
  • For: 提高城市交通安全和便利性,改进现有的交通预测模型。* Methods: 使用图 convolution deep learning算法,利用国家户室旅行调查数据增强 causal 关系 между活动和交通模式。* Results: 对现有的图 convolutional recurrent networks 和图 convolutional transformer 架构进行修改,实现 state-of-the-art 性能而不增加计算负担。
    Abstract Traffic prediction is one of the key elements to ensure the safety and convenience of citizens. Existing traffic prediction models primarily focus on deep learning architectures to capture spatial and temporal correlation. They often overlook the underlying nature of traffic. Specifically, the sensor networks in most traffic datasets do not accurately represent the actual road network exploited by vehicles, failing to provide insights into the traffic patterns in urban activities. To overcome these limitations, we propose an improved traffic prediction method based on graph convolution deep learning algorithms. We leverage human activity frequency data from National Household Travel Survey to enhance the inference capability of a causal relationship between activity and traffic patterns. Despite making minimal modifications to the conventional graph convolutional recurrent networks and graph convolutional transformer architectures, our approach achieves state-of-the-art performance without introducing excessive computational overhead.
    摘要 traffic prediction 是确保市民安全便利的关键元素之一。现有的交通预测模型主要利用深度学习架构捕捉空间和时间相关性。它们经常忽视交通的本质。 Specifically, 交通感知网络在大多数交通数据集中不准确反映实际行驶路网络, fail to provide insights into traffic patterns in urban activities。 To overcome these limitations, we propose an improved traffic prediction method based on graph convolution deep learning algorithms. We leverage human activity frequency data from National Household Travel Survey to enhance the inference capability of a causal relationship between activity and traffic patterns. Despite making minimal modifications to the conventional graph convolutional recurrent networks and graph convolutional transformer architectures, our approach achieves state-of-the-art performance without introducing excessive computational overhead.Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well.

The DKU-DUKEECE System for the Manipulation Region Location Task of ADD 2023

  • paper_url: http://arxiv.org/abs/2308.10281
  • repo_url: None
  • paper_authors: Zexin Cai, Weiqing Wang, Yikang Wang, Ming Li
  • for: 本研究是为了解决 Audio Deepfake Detection Challenge (ADD 2023) 中的 Track 2 任务,即检测受到修改的音频段别。
  • methods: 本研究使用了多种检测系统,包括边界检测系统和深度迷伪检测系统,以及专门使用真实数据训练的 VAE 模型来确定音频clip的真实性。
  • results: 通过将这三个系统融合,我们实现了82.23%的句子准确率和60.66%的 F1 分数,最终ADD分数为0.6713,在 Track 2 中得到第一名。
    Abstract This paper introduces our system designed for Track 2, which focuses on locating manipulated regions, in the second Audio Deepfake Detection Challenge (ADD 2023). Our approach involves the utilization of multiple detection systems to identify splicing regions and determine their authenticity. Specifically, we train and integrate two frame-level systems: one for boundary detection and the other for deepfake detection. Additionally, we employ a third VAE model trained exclusively on genuine data to determine the authenticity of a given audio clip. Through the fusion of these three systems, our top-performing solution for the ADD challenge achieves an impressive 82.23% sentence accuracy and an F1 score of 60.66%. This results in a final ADD score of 0.6713, securing the first rank in Track 2 of ADD 2023.
    摘要 这篇论文介绍了我们为Track 2而设计的系统,这个Track focuses on locating manipulated regions in the second Audio Deepfake Detection Challenge (ADD 2023)。我们的方法是利用多个检测系统来确定剪辑区域的真实性。具体来说,我们训练并集成了两个帧级系统:一个用于边界检测,另一个用于深伪检测。此外,我们还使用了专门为真实数据训练的VAE模型来确定声音片断的真实性。通过这三个系统的融合,我们的最高表现解决方案在ADD挑战中获得了82.23%的句子准确率和60.66%的F1分数。这导致了我们在Track 2的ADD分数为0.6713,在ADD 2023中取得了第一名。

Learning Disentangled Representation with Mutual Information Maximization for Real-Time UAV Tracking

  • paper_url: http://arxiv.org/abs/2308.10262
  • repo_url: None
  • paper_authors: Xucheng Wang, Xiangyang Yang, Hengzhou Ye, Shuiwang Li
  • for: 提高 UAV 跟踪中的效率和精度,使用 DL 模型压缩和分解表示。
  • methods: 使用 DR-MIM 技术实现分解表示,提高模型的表示效果,并且使用 mutual information maximization 提高模型的精度。
  • results: 在四个 UAV 测试benchmark上,DR-MIM 跟踪器与现有状态的 UAV 跟踪方法相比,显示了显著的提高。
    Abstract Efficiency has been a critical problem in UAV tracking due to limitations in computation resources, battery capacity, and unmanned aerial vehicle maximum load. Although discriminative correlation filters (DCF)-based trackers prevail in this field for their favorable efficiency, some recently proposed lightweight deep learning (DL)-based trackers using model compression demonstrated quite remarkable CPU efficiency as well as precision. Unfortunately, the model compression methods utilized by these works, though simple, are still unable to achieve satisfying tracking precision with higher compression rates. This paper aims to exploit disentangled representation learning with mutual information maximization (DR-MIM) to further improve DL-based trackers' precision and efficiency for UAV tracking. The proposed disentangled representation separates the feature into an identity-related and an identity-unrelated features. Only the latter is used, which enhances the effectiveness of the feature representation for subsequent classification and regression tasks. Extensive experiments on four UAV benchmarks, including UAV123@10fps, DTB70, UAVDT and VisDrone2018, show that our DR-MIM tracker significantly outperforms state-of-the-art UAV tracking methods.
    摘要 efficiency 是 UAV 跟踪领域中的一个关键问题,因为计算资源、电池容量和无人飞行器最大负荷受限。 although 使用推荐相关矩阵(DCF)的trackers 在这个领域占据主导地位,因为它们具有良好的效率。 unfortunately,这些最近提出的轻量级深度学习(DL)基于的trackers 使用的模型压缩方法,虽然简单,仍然无法在更高的压缩率下实现满意的跟踪精度。 this paper aims to exploit 分解表示学习(DR)和共享信息最大化(MIM)来进一步提高 DL 基于的trackers 的精度和效率 для UAV 跟踪。 the proposed 分解表示 separates the feature into an identity-related and an identity-unrelated features。 only the latter is used, which enhances the effectiveness of the feature representation for subsequent classification and regression tasks. extensive experiments on four UAV benchmarks, including UAV123@10fps, DTB70, UAVDT and VisDrone2018, show that our DR-MIM tracker significantly outperforms state-of-the-art UAV tracking methods.

Large Transformers are Better EEG Learners

  • paper_url: http://arxiv.org/abs/2308.11654
  • repo_url: None
  • paper_authors: Bingxin Wang, Xiaowen Fu, Yuan Lan, Luchan Zhang, Yang Xiang
  • for: 这 paper 是为了研究如何使用预训练的大型变换器模型来进行电enzephalogram (EEG) 数据预测任务。
  • methods: 这 paper 使用了一种名为 AdaCE(适应器 для转换 EEG 数据)来直接在预训练的视觉和语言变换器模型上进行 EEG 数据预测任务的细化。
  • results: 这 paper 的实验结果表明,使用 AdaCE 模块可以很好地细化预训练的变换器模型,并在多种 EEG 预测任务上达到了状态的艺术性表现。例如,在 UCI HAR 任务上,AdaCE 在预训练的 Swin-Transformer 上 achieve 99.6%,相对于预训练的模型,提高了9.2%。此外,这 paper 还证明了,通过应用 AdaCE 模块来细化更大的预训练模型,可以在 EEG 预测任务上实现更好的表现,这表明了我们的适应器的潜在应用前景。
    Abstract Pre-trained large transformer models have achieved remarkable performance in the fields of natural language processing and computer vision. Since the magnitude of available labeled electroencephalogram (EEG) data is much lower than that of text and image data, it is difficult for transformer models pre-trained from EEG to be developed as large as GPT-4 100T to fully unleash the potential of this architecture. In this paper, we show that transformers pre-trained from images as well as text can be directly fine-tuned for EEG-based prediction tasks. We design AdaCE, plug-and-play Adapters for Converting EEG data into image as well as text forms, to fine-tune pre-trained vision and language transformers. The proposed AdaCE module is highly effective for fine-tuning pre-trained transformers while achieving state-of-the-art performance on diverse EEG-based prediction tasks. For example, AdaCE on the pre-trained Swin-Transformer achieves 99.6%, an absolute improvement of 9.2%, on the EEG-decoding task of human activity recognition (UCI HAR). Furthermore, we empirically show that applying the proposed AdaCE to fine-tune larger pre-trained models can achieve better performance on EEG-based predicting tasks, indicating the potential of our adapters for even larger transformers. The plug-and-play AdaCE module can be applied to fine-tuning most of the popular pre-trained transformers on many other time-series data with multiple channels, not limited to EEG data and the models we use. Our code will be available at https://github.com/wangbxj1234/AdaCE.
    摘要 <>转换给定文本到简化中文。<>大型预训练变换器模型在自然语言处理和计算机视觉领域已经取得了非常出色的表现。由于可用的电enzephalogram(EEG)数据量与文本和图像数据量相比较少,因此预训练自EEG的变换器模型难以达到GPT-4 100T的规模,以全面发挥这种架构的潜力。在这篇论文中,我们表明可以将预训练自图像以及文本的变换器模型直接精度调整为EEG预测任务。我们设计了AdaCE模块,即适配器转换EEG数据为图像和文本形式,以精度调整预训练的视觉和语言变换器。我们的AdaCE模块非常有效地精度调整预训练的变换器,并在多种EEG预测任务上达到了状态 искусственный智能的性能。例如,AdaCE在预训练的Swin-Transformer上达到了99.6%,相对于基eline的9.2%提升。此外,我们也证明了将AdaCE应用于精度调整更大的预训练模型可以在EEG预测任务中达到更好的性能,表明我们的适配器对更大的变换器有潜力。AdaCE模块可以应用于多种 популяр的预训练变换器,并不限于EEG数据和我们使用的模型。我们的代码将在GitHub上公开。

LMTuner: An user-friendly and highly-integrable Training Framework for fine-tuning Large Language Models

  • paper_url: http://arxiv.org/abs/2308.10252
  • repo_url: https://github.com/wengsyx/lmtuner
  • paper_authors: Yixuan Weng, Zhiqi Wang, Huanxuan Liao, Shizhu He, Shengping Liu, Kang Liu, Jun Zhao
  • for: 本研究旨在提高大语言模型(LLM)的快速增量训练,以满足特定领域和行业的需求。
  • methods: 该研究提出了一种名为LMTuner的高度可用、可integrable和可扩展的系统,用于快速训练LLM。LMTuner包括交互模块、训练模块和推理模块。
  • results: 研究表明,LMTuner可以帮助用户快速启动LLM的训练,只需5分钟即可。此外,LMTuner还支持许多高效精细调整方法,如LoRA和QLoRA等,可以在单个服务器上训练300M到130B参数的语言模型。
    Abstract With the burgeoning development in the realm of large language models (LLMs), the demand for efficient incremental training tailored to specific industries and domains continues to increase. Currently, the predominantly employed frameworks lack modular design, it often takes a lot of coding work to kickstart the training of LLM. To address this, we present "LMTuner", a highly usable, integrable, and scalable system for training LLMs expeditiously and with minimal user-input. LMTuner comprises three main modules - the Interaction, Training, and Inference Modules. We advocate that LMTuner's usability and integrality alleviate the complexities in training large language models. Remarkably, even a novice user could commence training large language models within five minutes. Furthermore, it integrates DeepSpeed frameworks and supports Efficient Fine-Tuning methodologies like Low Rank Adaptation (LoRA), Quantized LoRA (QLoRA), etc., enabling the training of language models scaling from 300M to a whopping 130B parameters using a single server. The LMTuner's homepage (https://wengsyx.github.io/LMTuner/)and screencast video (https://youtu.be/nsXmWOmN3rE) are now publicly available.
    摘要 随着大型语言模型(LLM)的发展,需要高效、逐步培训适应特定领域和领域的需求不断增长。目前主要使用的框架缺乏模块化设计,经常需要大量代码工作以开始LLM的培训。为解决这个问题,我们提出了“LMTuner”系统,它具有高度可用性、可插入性和可扩展性,可以快速、需要 minimal user-input 培训大型语言模型。LMTuner包括三个主要模块:交互模块、培训模块和推理模块。我们认为LMTuner的可用性和可插入性可以减轻大型语言模型培训的复杂性。特别是,even a novice user可以在5分钟内开始培训大型语言模型。此外,它还支持深速框架和高效精度调整方法(如LoRA、QLoRA等),可以在单个服务器上培训语言模型,从300M到130B参数的训练。LMTuner的主页(https://wengsyx.github.io/LMTuner/)和屏幕录制视频(https://youtu.be/nsXmWOmN3rE)现在都已经公开。

Machine Learning-powered Combinatorial Clock Auction

  • paper_url: http://arxiv.org/abs/2308.10226
  • repo_url: https://github.com/marketdesignresearch/ml-cca
  • paper_authors: Ermis Soumalias, Jakob Weissteiner, Jakob Heiss, Sven Seuken
  • for: 提高实际 iterative combinatorial auctions (ICA) 的设计效率,尤其是在 bundle space exponentially 增长的情况下。
  • methods: 使用机器学习 (ML) 技术进行 preference elicitation,并提出一种基于 demand queries 的 ML-powered combinatorial clock auction。
  • results: 在 spectrum auction 多个领域进行实验,并与最常用的实际 ICA(combined clock auction)进行比较,结果显示了 significanly 高效性和 clearer potential。
    Abstract We study the design of iterative combinatorial auctions (ICAs). The main challenge in this domain is that the bundle space grows exponentially in the number of items. To address this, several papers have recently proposed machine learning (ML)-based preference elicitation algorithms that aim to elicit only the most important information from bidders. However, from a practical point of view, the main shortcoming of this prior work is that those designs elicit bidders' preferences via value queries (i.e., ``What is your value for the bundle $\{A,B\}$?''). In most real-world ICA domains, value queries are considered impractical, since they impose an unrealistically high cognitive burden on bidders, which is why they are not used in practice. In this paper, we address this shortcoming by designing an ML-powered combinatorial clock auction that elicits information from the bidders only via demand queries (i.e., ``At prices $p$, what is your most preferred bundle of items?''). We make two key technical contributions: First, we present a novel method for training an ML model on demand queries. Second, based on those trained ML models, we introduce an efficient method for determining the demand query with the highest clearing potential, for which we also provide a theoretical foundation. We experimentally evaluate our ML-based demand query mechanism in several spectrum auction domains and compare it against the most established real-world ICA: the combinatorial clock auction (CCA). Our mechanism significantly outperforms the CCA in terms of efficiency in all domains, it achieves higher efficiency in a significantly reduced number of rounds, and, using linear prices, it exhibits vastly higher clearing potential. Thus, with this paper we bridge the gap between research and practice and propose the first practical ML-powered ICA.
    摘要 我们研究联合推价 combinatorial auction(ICA)的设计。主要挑战在这域的是,套件空间随着物品数量的增加而呈指数增长。为解决这个问题,一些最近的论文已经提出了基于机器学习(ML)的偏好探索算法,目的是获取价值询问(i.e., What is your value for the bundle \{A,B\}?)。但实际上,这些设计仍然存在一个问题,那是它们透过价值询问来探索投得者的偏好,这种方法在实际应用中被视为不切实际。在这篇论文中,我们解决这个问题,通过设计一个基于ML的联合时钟推价(Combinatorial Clock Auction,CCA),这个推价方式只透过需求询问(i.e., At prices $p$, what is your most preferred bundle of items?)来探索投得者的偏好。我们做了两个关键的技术贡献:首先,我们提出了一种基于需求询问的ML模型训练方法。其次,基于这些训练的ML模型,我们引入了一个高效的需求询问选择方法,并提供了理论基础。我们将这些结果实际应用于几个频谱拍卖领域,并与现有的CCA进行比较。我们的机制在所有领域中具有更高的效率,可以在许多领域中取得更高的效率,并且使用线性价格,它的推价 potential exhibits vastly higher clearing potential。因此,这篇论文 bridges the gap between research and practice,并提出了首个实际可行的ML-powered ICA。

ChatEDA: A Large Language Model Powered Autonomous Agent for EDA

  • paper_url: http://arxiv.org/abs/2308.10204
  • repo_url: None
  • paper_authors: Zhuolun He, Haoyuan Wu, Xinyun Zhang, Xufeng Yao, Su Zheng, Haisheng Zheng, Bei Yu
  • for: 提高电路设计工作流程的自动化和效率,使用大型自然语言处理模型来增强电子设计自动化工具之间的协同合作。
  • methods: 使用大型自然语言处理模型AutoMage,并与电子设计自动化工具相结合,实现了自动化的任务规划、脚本生成和任务执行。
  • results: 通过全面的实验评估,ChatEDA已经表明了在处理多样化的需求方面的强大能力,并且 fine-tuned AutoMage 模型在 GPT-4 和其他类似 LLM 中表现出色。
    Abstract The integration of a complex set of Electronic Design Automation (EDA) tools to enhance interoperability is a critical concern for circuit designers. Recent advancements in large language models (LLMs) have showcased their exceptional capabilities in natural language processing and comprehension, offering a novel approach to interfacing with EDA tools. This research paper introduces ChatEDA, an autonomous agent for EDA empowered by a large language model, AutoMage, complemented by EDA tools serving as executors. ChatEDA streamlines the design flow from the Register-Transfer Level (RTL) to the Graphic Data System Version II (GDSII) by effectively managing task planning, script generation, and task execution. Through comprehensive experimental evaluations, ChatEDA has demonstrated its proficiency in handling diverse requirements, and our fine-tuned AutoMage model has exhibited superior performance compared to GPT-4 and other similar LLMs.
    摘要 electronic design automation (EDA) 工具集成是电路设计师面临的一个关键问题。近年来,大型自然语言模型(LLM)的发展受到了关注,因为它们在自然语言处理和理解方面表现出了出色的能力,这提供了一种新的方法来与 EDA 工具进行交互。本研讨稿介绍了 ChatEDA,一个基于大型自然语言模型 AutoMage 的自主代理人,与 EDA 工具合作来实现从Register-Transfer Level (RTL) 到 Graphic Data System Version II (GDSII) 的设计流程的自动化。通过全面的实验评估,ChatEDA 已经证明了它在处理多样化的需求方面的能力,而我们精心调整的 AutoMage 模型则与 GPT-4 和其他类似的 LLM 相比,表现出了更高的性能。

Soft Decomposed Policy-Critic: Bridging the Gap for Effective Continuous Control with Discrete RL

  • paper_url: http://arxiv.org/abs/2308.10203
  • repo_url: None
  • paper_authors: Yechen Zhang, Jian Sun, Gang Wang, Zhuo Li, Wei Chen
  • For: The paper aims to address the challenges of applying discrete reinforcement learning (RL) algorithms to continuous control problems, and to develop a novel architecture that can effectively overcome these challenges.* Methods: The paper proposes the Soft Decomposed Policy-Critic (SDPC) architecture, which combines soft RL and actor-critic techniques with discrete RL methods. The SDPC architecture discretizes each action dimension independently and employs a shared critic network to maximize the soft $Q$-function. The paper also introduces two types of policies: decomposed actors that lead to the Soft Decomposed Actor-Critic (SDAC) algorithm, and decomposed $Q$-networks that generate Boltzmann soft exploration policies, resulting in the Soft Decomposed-Critic Q (SDCQ) algorithm.* Results: The paper presents extensive experimental results that demonstrate the effectiveness of the proposed SDPC architecture in addressing the challenges associated with continuous control. The results show that the SDPC algorithm outperforms state-of-the-art continuous RL algorithms in a variety of continuous control tasks, including Mujoco’s Humanoid and Box2d’s BipedalWalker.Here is the simplified Chinese text for the three key points:* 用途: 本文旨在解决离散RL算法应用到连续控制问题时存在的挑战,并提出一种新的架构来有效地解决这些挑战。* 方法: 本文提出了软分解策略评估器(SDPC)架构,它结合软RL和演员评估技术,并将离散RL方法应用到连续控制问题中。SDPC将每个动作维度独立地分解,并使用共享评估器网络来最大化软$Q$-函数。这种新的方法使得SDPC可以支持两种策略:分解演员,导致Soft Decomposed Actor-Critic(SDAC)算法,以及分解$Q$-网络,生成Boltzmann软探索策略,导致Soft Decomposed-Critic Q(SDCQ)算法。* 结果: 本文提供了广泛的实验结果,证明了SDPC架构在连续控制问题中的效果。结果显示,SDPC算法在Mujoco的人工智能和Box2d的两脚步行器等多个连续控制任务中表现出色,超过了当前最佳连续RL算法的表现。这些实验结果证明了SDPC架构在连续控制问题中的有效性。
    Abstract Discrete reinforcement learning (RL) algorithms have demonstrated exceptional performance in solving sequential decision tasks with discrete action spaces, such as Atari games. However, their effectiveness is hindered when applied to continuous control problems due to the challenge of dimensional explosion. In this paper, we present the Soft Decomposed Policy-Critic (SDPC) architecture, which combines soft RL and actor-critic techniques with discrete RL methods to overcome this limitation. SDPC discretizes each action dimension independently and employs a shared critic network to maximize the soft $Q$-function. This novel approach enables SDPC to support two types of policies: decomposed actors that lead to the Soft Decomposed Actor-Critic (SDAC) algorithm, and decomposed $Q$-networks that generate Boltzmann soft exploration policies, resulting in the Soft Decomposed-Critic Q (SDCQ) algorithm. Through extensive experiments, we demonstrate that our proposed approach outperforms state-of-the-art continuous RL algorithms in a variety of continuous control tasks, including Mujoco's Humanoid and Box2d's BipedalWalker. These empirical results validate the effectiveness of the SDPC architecture in addressing the challenges associated with continuous control.
    摘要 离散强化学习(RL)算法在解决顺序决策任务中的离散动作空间上表现出色,如Atari游戏。然而,在连续控制问题上,它们的效果受到维度爆炸的限制。在这篇论文中,我们提出了软分解政策评估器(SDPC)架构,它将离散RL和演示器-评估器技术与离散RL方法结合,以解决这一问题。SDPC独立地离散每个动作维度,并使用共享评估器网络来最大化软$Q$-函数。这种新的方法使得SDPC可以支持两种策略:分解演示者,导致的软分解演示者-评估器(SDAC)算法,以及分解$Q$-网络,生成博尔茨曼软探索策略,导致的软分解评估器Q(SDCQ)算法。通过广泛的实验,我们证明了我们提出的方法在多种连续控制任务中比州场RL算法表现出色,包括Mujoco的人工智能和Box2d的双脚行走器。这些实验结果证明了SDPC架构在连续控制问题上的有效性。

Deep Reinforcement Learning for Artificial Upwelling Energy Management

  • paper_url: http://arxiv.org/abs/2308.10199
  • repo_url: None
  • paper_authors: Yiyuan Zhang, Wei Fan
  • for: 本研究旨在提高人工温顺升(AU)系统的效率,以便更好地刺激海藻生长和提高海洋碳储存。
  • methods: 本研究使用深度强化学习(DRL)算法来开发高效的空气喷射策略,以优化AU系统的运行。
  • results: 对于 simulate 的数据,我们的DRL算法可以更好地减少能源浪费,同时保证AU系统的稳定和高效运行。
    Abstract The potential of artificial upwelling (AU) as a means of lifting nutrient-rich bottom water to the surface, stimulating seaweed growth, and consequently enhancing ocean carbon sequestration, has been gaining increasing attention in recent years. This has led to the development of the first solar-powered and air-lifted AU system (AUS) in China. However, efficient scheduling of air injection systems remains a crucial challenge in operating AUS, as it holds the potential to significantly improve system efficiency. Conventional approaches based on rules or models are often impractical due to the complex and heterogeneous nature of the marine environment and its associated disturbances. To address this challenge, we propose a novel energy management approach that utilizes deep reinforcement learning (DRL) algorithm to develop efficient strategies for operating AUS. Through extensive simulations, we evaluate the performance of our algorithm and demonstrate its superior effectiveness over traditional rule-based approaches and other DRL algorithms in reducing energy wastage while ensuring the stable and efficient operation of AUS. Our findings suggest that a DRL-based approach offers a promising way for improving the efficiency of AUS and enhancing the sustainability of seaweed cultivation and carbon sequestration in the ocean.
    摘要 人们在最近几年内对人工升浮(AU)作为吸引燃料富含底水而升到水面,促进海藻生长,并因此增强海洋碳储存的潜力的可能性已经吸引了越来越多的注意。这导致了中国首个太阳能驱动、空气升降AU系统(AUS)的开发。然而,AU系统的有效的调度仍然是一个关键的挑战,因为它可以大幅提高系统的效率。传统的方法,如规则或模型,经常因为marine环境的复杂和多样性而无法实施。为解决这个挑战,我们提出了一种基于深度强化学习(DRL)算法的能源管理方法。通过广泛的 simulations,我们评估了我们的算法的性能,并证明它在降低能源浪费的同时保证AU系统的稳定和高效运行。我们的发现表明,使用DRL算法可以提高AU系统的效率,并且对海洋藻类培植和碳储存具有扩展性。

Efficient Real-time Path Planning with Self-evolving Particle Swarm Optimization in Dynamic Scenarios

  • paper_url: http://arxiv.org/abs/2308.10169
  • repo_url: https://github.com/xinjinghao/real-time-path-planning-with-sepso
  • paper_authors: Jinghao Xin, Zhi Li, Yang Zhang, Ning Li
  • for: 本研究旨在提高Particle Swarm Optimization(PSO)的计算效率和避免偏差问题,以便应用于动态场景中的路径规划问题。
  • methods: 本研究提出了一种基于Tensor Operation Form(TOF)的Self-Evolving Particle Swarm Optimization(SEPSO)算法,具有自适应的距离权重补偿和自适应的准则权重补偿,以提高计算效率和避免偏差。
  • results: 实验结果表明,SEPSO可以在四个常用的优化函数上实现更好的路径规划,并且在动态场景中具有较高的计算效率(每秒67个路径计算)和更好的实时性。
    Abstract Particle Swarm Optimization (PSO) has demonstrated efficacy in addressing static path planning problems. Nevertheless, such application on dynamic scenarios has been severely precluded by PSO's low computational efficiency and premature convergence downsides. To address these limitations, we proposed a Tensor Operation Form (TOF) that converts particle-wise manipulations to tensor operations, thereby enhancing computational efficiency. Harnessing the computational advantage of TOF, a variant of PSO, designated as Self-Evolving Particle Swarm Optimization (SEPSO) was developed. The SEPSO is underpinned by a novel Hierarchical Self-Evolving Framework (HSEF) that enables autonomous optimization of its own hyper-parameters to evade premature convergence. Additionally, a Priori Initialization (PI) mechanism and an Auto Truncation (AT) mechanism that substantially elevates the real-time performance of SEPSO on dynamic path planning problems were introduced. Comprehensive experiments on four widely used benchmark optimization functions have been initially conducted to corroborate the validity of SEPSO. Following this, a dynamic simulation environment that encompasses moving start/target points and dynamic/static obstacles was employed to assess the effectiveness of SEPSO on the dynamic path planning problem. Simulation results exhibit that the proposed SEPSO is capable of generating superior paths with considerably better real-time performance (67 path planning computations per second in a regular desktop computer) in contrast to alternative methods. The code of this paper can be accessed here.
    摘要 particle swarm optimization (PSO) 已经在静止路径规划问题上展示了效果。然而,在动态场景下,PSO的计算效率低下和迟速 converges 的缺点限制了其应用。为了解决这些限制,我们提出了一种tensor操作形式(TOF),它将 particle-wise 操作转化为tensor操作,从而提高计算效率。基于TOF的一种PSO变体,称为自适应 particule swarm optimization(SEPSO),在自适应层次结构(HSEF)的支持下,可以自动调整它的自身超参数,以避免迟速 converges。此外,我们还提出了一种先验初始化(PI)机制和一种自动舒缩(AT)机制,这些机制可以大幅提高REPSO在动态路径规划问题上的实时性表现。我们在四个通用优化函数上进行了初步的实验,以证明SEPSO的有效性。然后,我们使用包括移动起点/目标点和动态/静态障碍物的动态 simulations 环境来评估SEPSO在动态路径规划问题上的效果。实验结果显示,提议的SEPSO可以生成优化的路径,并且在实时性方面表现出色(每秒67个路径规划计算在常规桌面电脑上),相比于其他方法。代码可以在这里获取。

Rethinking Client Drift in Federated Learning: A Logit Perspective

  • paper_url: http://arxiv.org/abs/2308.10162
  • repo_url: None
  • paper_authors: Yunlu Yan, Chun-Mei Feng, Mang Ye, Wangmeng Zuo, Ping Li, Rick Siow Mong Goh, Lei Zhu, C. L. Philip Chen
  • for: 这个论文旨在解决 federated learning (FL) 中 client drift 问题,并提高 FL 的性能。
  • methods: 本文提出了一个新的 FedCSD 算法,它是一个 federated framework 中的 class prototype similarity distillation 方法,用于将 local 和 global 模型之间的差异调节。
  • results: 实验结果显示,FedCSD 方法在不同的多元环境下比 state-of-the-art 的 federated learning 方法表现更好,并且可以增强 global 模型的质量。
    Abstract Federated Learning (FL) enables multiple clients to collaboratively learn in a distributed way, allowing for privacy protection. However, the real-world non-IID data will lead to client drift which degrades the performance of FL. Interestingly, we find that the difference in logits between the local and global models increases as the model is continuously updated, thus seriously deteriorating FL performance. This is mainly due to catastrophic forgetting caused by data heterogeneity between clients. To alleviate this problem, we propose a new algorithm, named FedCSD, a Class prototype Similarity Distillation in a federated framework to align the local and global models. FedCSD does not simply transfer global knowledge to local clients, as an undertrained global model cannot provide reliable knowledge, i.e., class similarity information, and its wrong soft labels will mislead the optimization of local models. Concretely, FedCSD introduces a class prototype similarity distillation to align the local logits with the refined global logits that are weighted by the similarity between local logits and the global prototype. To enhance the quality of global logits, FedCSD adopts an adaptive mask to filter out the terrible soft labels of the global models, thereby preventing them to mislead local optimization. Extensive experiments demonstrate the superiority of our method over the state-of-the-art federated learning approaches in various heterogeneous settings. The source code will be released.
    摘要 联合学习(FL)允许多个客户端共同学习,以保护隐私。但是,实际世界中的非相同数据将导致客户端的漂移,严重损害FL的性能。可是,我们发现在模型不断更新时,本地和全球模型之间的差异将逐渐增加,严重损害FL的性能。这主要是由于资料不均衡引起的严重遗忘,导致本地模型的优化失败。为解决这问题,我们提出了一个新的算法,名为FedCSD,它是一个基于联合架构的类型对应对数据分析方法。FedCSD不仅将全球知识转移到本地客户端,因为一个尚未训练的全球模型无法提供可靠的知识,即类型对应信息,而且其对应的软件标签将会误导本地优化。具体来说,FedCSD引入一个类型对应对数据分析来调整本地値值和重新调整的全球値值,并运用适应mask来筛选全球模型的差异软件标签,以避免它们误导本地优化。实际实验表明我们的方法在多种不同设定下表现出色,与现有的联合学习方法相比。我们将发布源代码。

SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-form Layout-to-Image Generation

  • paper_url: http://arxiv.org/abs/2308.10156
  • repo_url: None
  • paper_authors: Chengyou Jia, Minnan Luo, Zhuohang Dang, Guang Dai, Xiaojun Chang, Mengmeng Wang, Jingdong Wang
  • for: 该 paper targets 提高 Text-to-Image (T2I) 生成模型的细节控制能力,具体来说是通过 Layout-to-Image (L2I) 生成模型,从用户指定的布局信息中提取更多的空间和semantic信息,以提高生成的图像质量和控制性。
  • methods: 该 paper 提出了一种新的 Spatial-Semantic Map Guided (SSMG) 扩散模型,通过使用布局信息中的特征图来为生成过程提供指导,以获得更高质量和更多的控制性。此外,paper 还提出了 Relation-Sensitive Attention (RSA) 和 Location-Sensitive Attention (LSA) 机制,用于模型多对多对象之间的关系和空间信息。
  • results: EXTENSIVE experiments 表明,SSMG 可以 achieve highly promising results,在多个维度上(包括准确率、多样性和控制性)超越前一代的模型。
    Abstract Despite significant progress in Text-to-Image (T2I) generative models, even lengthy and complex text descriptions still struggle to convey detailed controls. In contrast, Layout-to-Image (L2I) generation, aiming to generate realistic and complex scene images from user-specified layouts, has risen to prominence. However, existing methods transform layout information into tokens or RGB images for conditional control in the generative process, leading to insufficient spatial and semantic controllability of individual instances. To address these limitations, we propose a novel Spatial-Semantic Map Guided (SSMG) diffusion model that adopts the feature map, derived from the layout, as guidance. Owing to rich spatial and semantic information encapsulated in well-designed feature maps, SSMG achieves superior generation quality with sufficient spatial and semantic controllability compared to previous works. Additionally, we propose the Relation-Sensitive Attention (RSA) and Location-Sensitive Attention (LSA) mechanisms. The former aims to model the relationships among multiple objects within scenes while the latter is designed to heighten the model's sensitivity to the spatial information embedded in the guidance. Extensive experiments demonstrate that SSMG achieves highly promising results, setting a new state-of-the-art across a range of metrics encompassing fidelity, diversity, and controllability.
    摘要 尽管文本到图像(T2I)生成模型已经取得了重要进展,但是也有许多长度和复杂度的文本描述仍然很难以传递细致的控制。相比之下,图像排版到图像(L2I)生成,它的目标是从用户指定的排版中生成真实和复杂的场景图像,在过程中获得了更多的焦点。然而,现有的方法将排版信息转换为token或RGB图像,用于conditional控制生成过程中,导致个体实例的空间和semantic控制不充分。为了解决这些局限性,我们提出了一种新的空间semantic映射指南(SSMG)扩散模型,该模型采用由排版 derivation 的特征图进行导航。由于特征图具有较好的设计,它们可以储存大量的空间和semantic信息,因此SSMG在生成质量和控制方面达到了前所未有的水平。此外,我们还提出了关系敏感注意力(RSA)和位置敏感注意力(LSA)机制。前者用于模型场景中的物体之间的关系,而后者是为了增强模型对空间信息的敏感度。经验证明,SSMG在多个维度上达到了非常出色的结果,创造了新的state-of-the-art。

Federated Pseudo Modality Generation for Incomplete Multi-Modal MRI Reconstruction

  • paper_url: http://arxiv.org/abs/2308.10910
  • repo_url: None
  • paper_authors: Yunlu Yan, Chun-Mei Feng, Yuexiang Li, Rick Siow Mong Goh, Lei Zhu
  • for: addresses the missing modality challenge in federated multi-modal MRI reconstruction.
  • methods: utilizes a pseudo modality generation mechanism to recover the missing modality for each single-modal client by sharing the distribution information of the amplitude spectrum in frequency space, and introduces a clustering scheme to reduce communication costs.
  • results: can effectively complete the missing modality within an acceptable communication cost, and attains similar performance with the ideal scenario.
    Abstract While multi-modal learning has been widely used for MRI reconstruction, it relies on paired multi-modal data which is difficult to acquire in real clinical scenarios. Especially in the federated setting, the common situation is that several medical institutions only have single-modal data, termed the modality missing issue. Therefore, it is infeasible to deploy a standard federated learning framework in such conditions. In this paper, we propose a novel communication-efficient federated learning framework, namely Fed-PMG, to address the missing modality challenge in federated multi-modal MRI reconstruction. Specifically, we utilize a pseudo modality generation mechanism to recover the missing modality for each single-modal client by sharing the distribution information of the amplitude spectrum in frequency space. However, the step of sharing the original amplitude spectrum leads to heavy communication costs. To reduce the communication cost, we introduce a clustering scheme to project the set of amplitude spectrum into finite cluster centroids, and share them among the clients. With such an elaborate design, our approach can effectively complete the missing modality within an acceptable communication cost. Extensive experiments demonstrate that our proposed method can attain similar performance with the ideal scenario, i.e., all clients have the full set of modalities. The source code will be released.
    摘要 多Modal学习已经广泛应用于MRI重建,但它需要对配对多Modal数据进行学习,这在实际临床场景中很Difficult to obtain.特别是在联合设定下,许多医疗机构只有单Modal数据,称为模式缺失问题。因此,在这种情况下不可能采用标准的联合学习框架。在这篇论文中,我们提出一种新的通信效率高的联合学习框架,即Fed-PMG,用于解决联合多Modal MRI重建中的模式缺失问题。特别是,我们利用pseudo模式生成机制来恢复每个单Modal客户端缺失的模式。然而,将原始振荡谱分享给客户端会导致重大的通信成本。为了降低通信成本,我们引入一种分区 schemes,将振荡谱Project到finite的中心点集中,然后将其分享给客户端。与此棋子的设计,我们的方法可以效果地完成缺失模式,并且在可接受的通信成本下完成。广泛的实验表明,我们提出的方法可以达到与理想情况相同的性能,即所有客户端都有完整的模式。源代码将会发布。

A Survey on Fairness in Large Language Models

  • paper_url: http://arxiv.org/abs/2308.10149
  • repo_url: None
  • paper_authors: Yingji Li, Mengnan Du, Rui Song, Xin Wang, Ying Wang
  • for: 本文旨在概述关于LLM中公平性的相关研究,包括中等规模LLM的评估指标和降低偏见方法,以及大规模LLM的公平性研究。
  • methods: 本文介绍了对LLM中偏见的评估指标和降低偏见方法,包括内在偏见和外在偏见的评估指标和方法。
  • results: 本文总结了大规模LLM的公平性研究,包括偏见评估、偏见原因和降低偏见方法。同时,文章还提供了对LLM公平性发展的未来方向和挑战。
    Abstract Large language models (LLMs) have shown powerful performance and development prospect and are widely deployed in the real world. However, LLMs can capture social biases from unprocessed training data and propagate the biases to downstream tasks. Unfair LLM systems have undesirable social impacts and potential harms. In this paper, we provide a comprehensive review of related research on fairness in LLMs. First, for medium-scale LLMs, we introduce evaluation metrics and debiasing methods from the perspectives of intrinsic bias and extrinsic bias, respectively. Then, for large-scale LLMs, we introduce recent fairness research, including fairness evaluation, reasons for bias, and debiasing methods. Finally, we discuss and provide insight on the challenges and future directions for the development of fairness in LLMs.
    摘要 For medium-scale LLMs, we introduce evaluation metrics and debiasing methods from two perspectives: intrinsic bias and extrinsic bias. Then, for large-scale LLMs, we discuss recent fairness research, including fairness evaluation, reasons for bias, and debiasing methods. Finally, we discuss the challenges and future directions for developing fairness in LLMs.

ExpeL: LLM Agents Are Experiential Learners

  • paper_url: http://arxiv.org/abs/2308.10144
  • repo_url: https://github.com/Andrewzh112/ExpeL
  • paper_authors: Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, Gao Huang
  • for: 这paper的目的是提出一种新的语言模型,用于自动学习和做出决策。
  • methods: 该paper使用的方法包括自然语言处理和机器学习技术,以EXTRACT知识和经验从训练任务中。
  • results: experiments show that the proposed ExpeL agent exhibits robust learning efficacy and consistently enhances its performance as it accumulates experiences. Additionally, the paper explores the emerging capabilities and transfer learning potential of the ExpeL agent through qualitative observations and additional experiments.
    Abstract The recent surge in research interest in applying large language models (LLMs) to decision-making tasks has flourished by leveraging the extensive world knowledge embedded in LLMs. While there is a growing demand to tailor LLMs for custom decision-making tasks, finetuning them for specific tasks is resource-intensive and may diminish the model's generalization capabilities. Moreover, state-of-the-art language models like GPT-4 and Claude are primarily accessible through API calls, with their parametric weights remaining proprietary and unavailable to the public. This scenario emphasizes the growing need for new methodologies that allow learning from agent experiences without requiring parametric updates. To address these problems, we introduce the Experiential Learning (ExpeL) agent. Our agent autonomously gathers experiences and extracts knowledge using natural language from a collection of training tasks. At inference, the agent recalls its extracted insights and past experiences to make informed decisions. Our empirical results highlight the robust learning efficacy of the ExpeL agent, indicating a consistent enhancement in its performance as it accumulates experiences. We further explore the emerging capabilities and transfer learning potential of the ExpeL agent through qualitative observations and additional experiments.
    摘要 近些年,巨型语言模型(LLM)在决策任务上的研究兴趣呈现出了激增趋势,通过利用 LLM 中嵌入的广泛世界知识来实现。然而,为特定任务进行训练和优化 LLM 可能会占用资源,同时可能会降低模型的通用化能力。另外,当前的语言模型如 GPT-4 和 Claude 通常通过 API 调用来提供,其参数权重则是商业秘密,不可公开。这种情况强调了需要新的方法来学习从代理体验中。为解决这些问题,我们介绍了经验学(ExpeL)代理人。我们的代理人可以自动从训练任务中收集经验,使用自然语言提取知识。在推理时,代理人可以回忆提取的洞察和过去经验,做出 Informed 决策。我们的实验结果表明,ExpeL 代理人的学习效果是稳定和可靠的,其性能随着经验的增加而提高。我们还通过观察和其他实验进行资深探索,探讨 ExpeL 代理人的出现的能力和转移学习潜力。

A Review on Objective-Driven Artificial Intelligence

  • paper_url: http://arxiv.org/abs/2308.10135
  • repo_url: None
  • paper_authors: Apoorv Singh
  • For: The paper aims to address the limitations of current AI technologies in understanding context, nuances, and subtle cues in communication, and to close the gap between human and machine intelligence.* Methods: The paper reviews prospective Machine Intelligence candidates, including hierarchical planning-based approaches, energy-based, latent-variable methods, and joint embedding predictive architecture methods.* Results: The paper discusses how these methods can help machines better understand context, make logical inferences, and predict outcomes in various situations, ultimately bridging the gap between human and machine intelligence.Here’s the information in Simplified Chinese text:
  • for: 本文目标是解决当前人工智能技术中对语言上的上下文、细节和含义的理解 limitation, 以及将人工智能与人类智能之间的差距缩小。
  • methods: 本文评论了可能的机器智能候选人,包括层次规划基本方法、能量基本方法、隐变量方法和共同嵌入预测建筑方法。
  • results: 本文讲述了这些方法如何帮助机器更好地理解上下文, 做出逻辑推理, 预测结果等,从而bridging人工智能与人类智能之间的差距。
    Abstract While advancing rapidly, Artificial Intelligence still falls short of human intelligence in several key aspects due to inherent limitations in current AI technologies and our understanding of cognition. Humans have an innate ability to understand context, nuances, and subtle cues in communication, which allows us to comprehend jokes, sarcasm, and metaphors. Machines struggle to interpret such contextual information accurately. Humans possess a vast repository of common-sense knowledge that helps us make logical inferences and predictions about the world. Machines lack this innate understanding and often struggle with making sense of situations that humans find trivial. In this article, we review the prospective Machine Intelligence candidates, a review from Prof. Yann LeCun, and other work that can help close this gap between human and machine intelligence. Specifically, we talk about what's lacking with the current AI techniques such as supervised learning, reinforcement learning, self-supervised learning, etc. Then we show how Hierarchical planning-based approaches can help us close that gap and deep-dive into energy-based, latent-variable methods and Joint embedding predictive architecture methods.
    摘要 In this article, we review prospective Machine Intelligence candidates, a review from Prof. Yann LeCun, and other work that can help close the gap between human and machine intelligence. Specifically, we discuss what's lacking with current AI techniques such as supervised learning, reinforcement learning, self-supervised learning, etc. Then, we show how Hierarchical planning-based approaches can help close that gap and deep-dive into energy-based, latent-variable methods and Joint embedding predictive architecture methods.

TransFace: Calibrating Transformer Training for Face Recognition from a Data-Centric Perspective

  • paper_url: http://arxiv.org/abs/2308.10133
  • repo_url: https://github.com/danjun6737/transface
  • paper_authors: Jun Dan, Yang Liu, Haoyu Xie, Jiankang Deng, Haoran Xie, Xuansong Xie, Baigui Sun
  • for: 这篇论文主要应用于面Recognition (FR) 任务中,以优化ViTs-based FR 模型的性能。
  • methods: 本论文提出了一个名为 TransFace 的FR模型,使用了一种名为 DPAP 的 patch-level 数据增强策略和一种名为 EHSM 的困难样本挑战策略。DPAP 会随机对主要 patches 进行振荡变化,以扩大样本多样性,使ViTs 减少遗传学的问题。EHSM 使用了本地征别的信息熵来动态调整训练中的重要性负载,导致更稳定的预测。
  • results: 实验结果显示,TransFace 可以在多个 Benchmark 上超越原有的 FR 模型。
    Abstract Vision Transformers (ViTs) have demonstrated powerful representation ability in various visual tasks thanks to their intrinsic data-hungry nature. However, we unexpectedly find that ViTs perform vulnerably when applied to face recognition (FR) scenarios with extremely large datasets. We investigate the reasons for this phenomenon and discover that the existing data augmentation approach and hard sample mining strategy are incompatible with ViTs-based FR backbone due to the lack of tailored consideration on preserving face structural information and leveraging each local token information. To remedy these problems, this paper proposes a superior FR model called TransFace, which employs a patch-level data augmentation strategy named DPAP and a hard sample mining strategy named EHSM. Specially, DPAP randomly perturbs the amplitude information of dominant patches to expand sample diversity, which effectively alleviates the overfitting problem in ViTs. EHSM utilizes the information entropy in the local tokens to dynamically adjust the importance weight of easy and hard samples during training, leading to a more stable prediction. Experiments on several benchmarks demonstrate the superiority of our TransFace. Code and models are available at https://github.com/DanJun6737/TransFace.
    摘要 vision transformers (vits) 有 demonstrated 强大的表示能力 在不同的视觉任务中,归功于它们的内在的数据吃虫性。然而,我们意外地发现,当应用于人脸认知(FR)场景时,vits 表现弱化了。我们调查了这种现象的原因,发现现有的数据增强方法和困难样本挖掘策略与 vits-based FR 脊梁不兼容,这是因为缺乏适应保持人脸结构信息和利用每个本地 токен信息的考虑。为了解决这些问题,本文提出了一种超越 FR 模型,即 TransFace,该模型使用了 patch-level 数据增强策略名为 DPAP 和一种困难样本挖掘策略名为 EHSM。具体来说,DPAP 随机地对主导的 patches 进行扰动,以扩大样本多样性,从而解决 vits 中的过拟合问题。EHSM 利用了本地 токен中的信息熵来动态调整训练期间的重要性Weight,导致更稳定的预测。我们在多个 benchmark 上进行了实验,并证明了 TransFace 的优越性。代码和模型可以在 上获取。

3D-Aware Neural Body Fitting for Occlusion Robust 3D Human Pose Estimation

  • paper_url: http://arxiv.org/abs/2308.10123
  • repo_url: https://github.com/edz-o/3dnbf
  • paper_authors: Yi Zhang, Pengliang Ji, Angtian Wang, Jieru Mei, Adam Kortylewski, Alan Yuille
  • for: 3D human pose estimation with occlusion robustness
  • methods: 3D-aware Neural Body Fitting (3DNBF) with generative model of deep features and contrastive learning
  • results: outperforms other approaches on both occluded and standard benchmarks
    Abstract Regression-based methods for 3D human pose estimation directly predict the 3D pose parameters from a 2D image using deep networks. While achieving state-of-the-art performance on standard benchmarks, their performance degrades under occlusion. In contrast, optimization-based methods fit a parametric body model to 2D features in an iterative manner. The localized reconstruction loss can potentially make them robust to occlusion, but they suffer from the 2D-3D ambiguity. Motivated by the recent success of generative models in rigid object pose estimation, we propose 3D-aware Neural Body Fitting (3DNBF) - an approximate analysis-by-synthesis approach to 3D human pose estimation with SOTA performance and occlusion robustness. In particular, we propose a generative model of deep features based on a volumetric human representation with Gaussian ellipsoidal kernels emitting 3D pose-dependent feature vectors. The neural features are trained with contrastive learning to become 3D-aware and hence to overcome the 2D-3D ambiguity. Experiments show that 3DNBF outperforms other approaches on both occluded and standard benchmarks. Code is available at https://github.com/edz-o/3DNBF
    摘要 “复复基于方法用深度网络估计3D人姿 Parameters directly from a 2D image。尽管在标准 benchmark 上达到了现有的州首性表现,但它们在 occlusion 情况下表现不佳。相反,优化基于方法将 parametric body model 适束到 2D 特征,并且使用局部重建损失可能对 occlusion 有优化作用,但它们受到 2D-3D 歧义的影响。”“驱动 motivated by the recent success of generative models in rigid object pose estimation, we propose 3D-aware Neural Body Fitting (3DNBF) - an approximate analysis-by-synthesis approach to 3D human pose estimation with SOTA performance and occlusion robustness。 Specifically, we propose a generative model of deep features based on a volumetric human representation with Gaussian ellipsoidal kernels emitting 3D pose-dependent feature vectors。 The neural features are trained with contrastive learning to become 3D-aware and hence to overcome the 2D-3D ambiguity。”“实验结果显示,3DNBF 比其他方法在 occluded 和标准 benchmark 上表现更好。代码可以在 https://github.com/edz-o/3DNBF 上取得。”

Robust Mixture-of-Expert Training for Convolutional Neural Networks

  • paper_url: http://arxiv.org/abs/2308.10110
  • repo_url: https://github.com/optml-group/robust-moe-cnn
  • paper_authors: Yihua Zhang, Ruisi Cai, Tianlong Chen, Guanhua Zhang, Huan Zhang, Pin-Yu Chen, Shiyu Chang, Zhangyang Wang, Sijia Liu
    for: 这个论文旨在探讨如何使用 Mixture of Expert (MoE) 进行 adversarial robustness 的 CNN 模型。methods: 该论文使用了 adversarial training (AT) 机制,并分析了 MoE 模型中的两个维度:路由器的稳定性和专家的稳定性。它还提出了一种新的 router-expert alternating Adversarial training 框架,以提高 MoE 模型的 adversarial robustness。results: 该论文的实验结果表明,AdvMoE 可以在 4 种常用的 CNN 模型架构和 4 个 benchmark 数据集上提高 adversarial robustness 1% ~ 4%,同时保持了 MoE 模型的效率优势,实现了 более чем 50% 的计算成本减少。
    Abstract Sparsely-gated Mixture of Expert (MoE), an emerging deep model architecture, has demonstrated a great promise to enable high-accuracy and ultra-efficient model inference. Despite the growing popularity of MoE, little work investigated its potential to advance convolutional neural networks (CNNs), especially in the plane of adversarial robustness. Since the lack of robustness has become one of the main hurdles for CNNs, in this paper we ask: How to adversarially robustify a CNN-based MoE model? Can we robustly train it like an ordinary CNN model? Our pilot study shows that the conventional adversarial training (AT) mechanism (developed for vanilla CNNs) no longer remains effective to robustify an MoE-CNN. To better understand this phenomenon, we dissect the robustness of an MoE-CNN into two dimensions: Robustness of routers (i.e., gating functions to select data-specific experts) and robustness of experts (i.e., the router-guided pathways defined by the subnetworks of the backbone CNN). Our analyses show that routers and experts are hard to adapt to each other in the vanilla AT. Thus, we propose a new router-expert alternating Adversarial training framework for MoE, termed AdvMoE. The effectiveness of our proposal is justified across 4 commonly-used CNN model architectures over 4 benchmark datasets. We find that AdvMoE achieves 1% ~ 4% adversarial robustness improvement over the original dense CNN, and enjoys the efficiency merit of sparsity-gated MoE, leading to more than 50% inference cost reduction. Codes are available at https://github.com/OPTML-Group/Robust-MoE-CNN.
    摘要 “罕见的 Mixture of Expert(MoE)模型架构,已经表现出高精度和高效的模型推理承诺。尽管MoE在推理领域受到广泛关注,但是对于它在卷积神经网络(CNN)领域的潜在应用还有很少的研究。在这篇论文中,我们问:如何使一个基于MoE的CNN模型具有鲁棒性?可以如何针对这个问题进行鲁棒训练?我们的初步研究表明,对于MoE-CNN模型,使用传统的鲁棒训练机制(开发为普通的CNN模型)不再有效。为了更好地理解这种现象,我们分析了MoE-CNN模型的鲁棒性的两个维度:路由器(即选择数据特定专家的闭包函数)的鲁棒性和专家(即路由器引导的卷积神经网络的子网络)的鲁棒性。我们的分析表明,路由器和专家在传统的鲁棒训练中很难相互适应。因此,我们提出了一种新的路由器-专家 alternate adversarial training框架,称为AdvMoE。我们的提案的效果被证明在4种常用的CNN模型架构和4个标准数据集上,与原始密集CNN模型相比,AdvMoE可以提高1%~4%的鲁棒性表现,同时保留了MoE的稀疏性优势,实现了More than 50%的推理成本减少。代码可以在https://github.com/OPTML-Group/Robust-MoE-CNN上下载。”

ASPIRE: Language-Guided Augmentation for Robust Image Classification

  • paper_url: http://arxiv.org/abs/2308.10103
  • repo_url: None
  • paper_authors: Sreyan Ghosh, Chandra Kiran Reddy Evuru, Sonal Kumar, Utkarsh Tyagi, Sakshi Singh, Sanjoy Chowdhury, Dinesh Manocha
  • for: 本研究旨在提高神经图像分类器在真实世界中的表现,使其能够更好地适应不同的场景和数据。
  • methods: 本研究使用了一种名为ASPIRE的简单 yet有效的解决方案,通过在训练集中添加语言指导的数据增强来减少对不相关的特征的依赖。
  • results: 研究表明,使用ASPIRE可以提高神经图像分类器的分类精度,在4个dataset上与9个基线方法进行比较,分别提高精度by 1%-38%.
    Abstract Neural image classifiers can often learn to make predictions by overly relying on non-predictive features that are spuriously correlated with the class labels in the training data. This leads to poor performance in real-world atypical scenarios where such features are absent. Supplementing the training dataset with images without such spurious features can aid robust learning against spurious correlations via better generalization. This paper presents ASPIRE (Language-guided data Augmentation for SPurIous correlation REmoval), a simple yet effective solution for expanding the training dataset with synthetic images without spurious features. ASPIRE, guided by language, generates these images without requiring any form of additional supervision or existing examples. Precisely, we employ LLMs to first extract foreground and background features from textual descriptions of an image, followed by advanced language-guided image editing to discover the features that are spuriously correlated with the class label. Finally, we personalize a text-to-image generation model to generate diverse in-domain images without spurious features. We demonstrate the effectiveness of ASPIRE on 4 datasets, including the very challenging Hard ImageNet dataset, and 9 baselines and show that ASPIRE improves the classification accuracy of prior methods by 1% - 38%. Code soon at: https://github.com/Sreyan88/ASPIRE.
    摘要 神经图像分类器可以很容易地学习通过过度依赖不可预测的特征来预测类别标签。这会导致在实际世界中不常见的情况下表现不佳,因为这些特征在真实数据中缺失。补充训练集中的图像可以帮助神经图像分类器学习更加稳健,以避免基于不可预测的特征的欺骗。这篇论文提出了ASPIRE(语言引导数据增强为排除SPurIous correlation的解决方案),它是一种简单而有效的解决方案。ASPIRE,受语言引导,可以生成不需要任何形式的额外监督或现有示例的图像。具体来说,我们使用自然语言处理技术来首先提取图像中的前景和背景特征,然后使用高级语言导向的图像编辑技术来找出与类别标签相关的不可预测特征。最后,我们个性化文本到图像生成模型,以生成具有不可预测特征的域内多样化图像。我们在4个 dataset上进行了测试,包括非常困难的 Hard ImageNet dataset,并与9个基线进行比较,显示ASPIRE可以提高先前方法的分类精度 by 1% - 38%。代码即将在 GitHub 上发布。

Open, Closed, or Small Language Models for Text Classification?

  • paper_url: http://arxiv.org/abs/2308.10092
  • repo_url: None
  • paper_authors: Hao Yu, Zachary Yang, Kellin Pelrine, Jean Francois Godbout, Reihaneh Rabbany
  • for: 这个研究旨在检验三类模型在三种任务上的表现:命名实体识别、政党预测和谣言检测。
  • methods: 研究使用了八个数据集和三类模型进行评估:开源模型、关闭源模型和生成模型。
  • results: 研究发现,大型语言模型在多个任务上表现出色,但是开源模型可以与关闭源模型相匹配,而小型模型如RoBERTa在某些数据集上可以达到与生成模型相同或更高的性能。然而,关闭模型在需要最高普遍性的任务中仍保持优势。
    Abstract Recent advancements in large language models have demonstrated remarkable capabilities across various NLP tasks. But many questions remain, including whether open-source models match closed ones, why these models excel or struggle with certain tasks, and what types of practical procedures can improve performance. We address these questions in the context of classification by evaluating three classes of models using eight datasets across three distinct tasks: named entity recognition, political party prediction, and misinformation detection. While larger LLMs often lead to improved performance, open-source models can rival their closed-source counterparts by fine-tuning. Moreover, supervised smaller models, like RoBERTa, can achieve similar or even greater performance in many datasets compared to generative LLMs. On the other hand, closed models maintain an advantage in hard tasks that demand the most generalizability. This study underscores the importance of model selection based on task requirements
    摘要

GNNPipe: Accelerating Distributed Full-Graph GNN Training with Pipelined Model Parallelism

  • paper_url: http://arxiv.org/abs/2308.10087
  • repo_url: None
  • paper_authors: Jingji Chen, Zhuoming Chen, Xuehai Qian
  • for: 本研究 targets at improving the efficiency of distributed full-graph GNN training methods.
  • methods: 我们提出了一种新的训练方法 named GNNPipe,它采用了模型并行 instead of 图并行,具有较低的最坏情况极限通信复杂度。我们还提出了一种 chunk-based 管道式训练方法,以确保 GPU 资源的高Utilization。
  • results: 我们的方法可以减少每个epoch的训练时间,并且可以减少通信量和开销。 experiments show that our method reduces the per-epoch training time by up to 2.45x (on average 2.03x) and reduces the communication volume and overhead by up to 22.51x and 27.21x (on average 10.27x and 14.96x), respectively, while achieving a comparable level of model accuracy and convergence speed compared to graph parallelism.
    Abstract Current distributed full-graph GNN training methods adopt a variant of data parallelism, namely graph parallelism, in which the whole graph is divided into multiple partitions (subgraphs) and each GPU processes one of them. This incurs high communication overhead because of the inter-partition message passing at each layer. To this end, we proposed a new training method named GNNPipe that adopts model parallelism instead, which has a lower worst-case asymptotic communication complexity than graph parallelism. To ensure high GPU utilization, we proposed to combine model parallelism with a chunk-based pipelined training method, in which each GPU processes a different chunk of graph data at different layers concurrently. We further proposed hybrid parallelism that combines model and graph parallelism when the model-level parallelism is insufficient. We also introduced several tricks to ensure convergence speed and model accuracies to accommodate embedding staleness introduced by pipelining. Extensive experiments show that our method reduces the per-epoch training time by up to 2.45x (on average 2.03x) and reduces the communication volume and overhead by up to 22.51x and 27.21x (on average 10.27x and 14.96x), respectively, while achieving a comparable level of model accuracy and convergence speed compared to graph parallelism.
    摘要 当前的分布式全图GNNS培训方法采用了一种变体的数据并行性,即图并行性,在其中整个图被分成多个分区(子图),每个GPU处理一个分区。这会产生高度的通信开销,因为每层都需要进行交互式的分区消息传递。为了解决这个问题,我们提出了一种新的培训方法,名为GNPipe,它采用了模型并行性,它的最差情况的极限级别的通信复杂度比图并行性低。为保证高效的GPU使用率,我们提出了将模型并行性与 chunk-based 管道式培训方法结合使用,在不同的层次上同时处理不同的图数据块。此外,我们还提出了将模型并行性与图并行性结合使用,当模型级别的并行性不足时。此外,我们还提出了一些技巧来确保快速的整体训练速度和模型精度,以适应嵌入过时引起的管道化。广泛的实验表明,我们的方法可以将每个轮次训练时间减少至最多2.45倍(平均2.03倍),并且可以减少通信量和开销至最多22.51倍和27.21倍(平均10.27倍和14.96倍),而保持与图并行性相比的相似水平的模型精度和训练速度。

Contrastive Learning for Non-Local Graphs with Multi-Resolution Structural Views

  • paper_url: http://arxiv.org/abs/2308.10077
  • repo_url: None
  • paper_authors: Asif Khan, Amos Storkey
  • for: 本研究旨在学习不同类型图像之间的相似性,以提高图像检测和蛋白质功能预测等应用。
  • methods: 我们提出了一种新的多视图对照学习方法,通过对图像进行多种扩充,捕捉到图像之间的结构相似性,从而揭示隐藏的关系和相似性。
  • results: 我们的方法在synthetic和实际结构数据上比基eline高$16.06%$,$3.27%$和$8.04%$。此外,它在邻近任务上表现优于基eline,说明它可以更好地捕捉结构信息,提高下游应用的性能。
    Abstract Learning node-level representations of heterophilic graphs is crucial for various applications, including fraudster detection and protein function prediction. In such graphs, nodes share structural similarity identified by the equivalence of their connectivity which is implicitly encoded in the form of higher-order hierarchical information in the graphs. The contrastive methods are popular choices for learning the representation of nodes in a graph. However, existing contrastive methods struggle to capture higher-order graph structures. To address this limitation, we propose a novel multiview contrastive learning approach that integrates diffusion filters on graphs. By incorporating multiple graph views as augmentations, our method captures the structural equivalence in heterophilic graphs, enabling the discovery of hidden relationships and similarities not apparent in traditional node representations. Our approach outperforms baselines on synthetic and real structural datasets, surpassing the best baseline by $16.06\%$ on Cornell, $3.27\%$ on Texas, and $8.04\%$ on Wisconsin. Additionally, it consistently achieves superior performance on proximal tasks, demonstrating its effectiveness in uncovering structural information and improving downstream applications.
    摘要

UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding

  • paper_url: http://arxiv.org/abs/2308.11592
  • repo_url: None
  • paper_authors: Hao Feng, Zijian Wang, Jingqun Tang, Jinghui Lu, Wengang Zhou, Houqiang Li, Can Huang
  • for: 这 paper 的目的是提出一种新的多Modal模型,能够同时检测、识别和理解文本,以提高文本理解的性能。
  • methods: 这 paper 使用了 UniDoc 模型,该模型具有文本检测和识别功能,并且可以充分利用大型预训练模型的表示能力和世界知识。此外,UniDoc 还能够充分利用任务之间的有益关系,提高每个任务的性能。
  • results: 实验结果显示,UniDoc 在多个复杂的 benchmark 上达到了州际纪录级的成绩。这是目前已知的首个同时检测、识别、spotting 和理解的大型多Modal模型。
    Abstract In the era of Large Language Models (LLMs), tremendous strides have been made in the field of multimodal understanding. However, existing advanced algorithms are limited to effectively utilizing the immense representation capabilities and rich world knowledge inherent to these large pre-trained models, and the beneficial connections among tasks within the context of text-rich scenarios have not been sufficiently explored. In this work, we introduce UniDoc, a novel multimodal model equipped with text detection and recognition capabilities, which are deficient in existing approaches. Moreover, UniDoc capitalizes on the beneficial interactions among tasks to enhance the performance of each individual task. To implement UniDoc, we perform unified multimodal instruct tuning on the contributed large-scale instruction following datasets. Quantitative and qualitative experimental results show that UniDoc sets state-of-the-art scores across multiple challenging benchmarks. To the best of our knowledge, this is the first large multimodal model capable of simultaneous text detection, recognition, spotting, and understanding.
    摘要 在大语言模型(LLMs)时代,我们在多模式理解领域已经做出了巨大的进步。然而,现有的高级算法尚未能充分利用大型预训练模型中的庞大表示能力和丰富的世界知识,也没有充分探索了在文本丰富场景下任务之间的有利连接。在这项工作中,我们介绍了UniDoc,一种新的多模式模型,具有文本检测和识别功能,现存的方法缺乏这些功能。此外,UniDoc利用任务之间的有利连接来提高每个任务的性能。为实现UniDoc,我们在大规模的 instrucion following 数据集上进行了统一多模式 instru Tuning。量化和质量上的实验结果表明,UniDoc在多个复杂的benchmark上设置了state-of-the-art 得分。据我们所知,这是第一个同时检测、识别、点名和理解的大型多模式模型。