paper_authors: George Boateng, Jonathan Abrefah Mensah, Kevin Takyi Yeboah, William Edor, Andrew Kojo Mensah-Onumah, Naafi Dasana Ibrahim, Nana Sam Yeboah
for: The paper is written to explore the possibility of using AI to compete in Ghana’s National Science and Maths Quiz (NSMQ) and to describe the progress made so far in the NSMQ AI project.
methods: The paper uses open-source AI technology to build an AI system that can compete in the NSMQ, with a focus on speech-to-text, text-to-speech, question-answering, and human-computer interaction.
results: The paper describes the progress made thus far in the NSMQ AI project, including the development of an AI system that can compete in the NSMQ and the potential real-world impact of such a system on education in Africa.Abstract
Can an AI win Ghana's National Science and Maths Quiz (NSMQ)? That is the question we seek to answer in the NSMQ AI project, an open-source project that is building AI to compete live in the NSMQ and win. The NSMQ is an annual live science and mathematics competition for senior secondary school students in Ghana in which 3 teams of 2 students compete by answering questions across biology, chemistry, physics, and math in 5 rounds over 5 progressive stages until a winning team is crowned for that year. The NSMQ is an exciting live quiz competition with interesting technical challenges across speech-to-text, text-to-speech, question-answering, and human-computer interaction. In this ongoing work that began in January 2023, we give an overview of the project, describe each of the teams, progress made thus far, and the next steps toward our planned launch and debut of the AI in October for NSMQ 2023. An AI that conquers this grand challenge can have real-world impact on education such as enabling millions of students across Africa to have one-on-one learning support from this AI.
摘要
可以AI赢得加纳国家科学数学竞赛(NSMQ)呢?这是我们想要回答的问题,我们在NSMQ AI项目中进行开源项目,旨在使AI在NSMQ中赢得比赛。NSMQ是每年举行的live科学数学竞赛,参与者是加纳高中二年级学生,共有3支队伍,每支队伍有2名学生,通过Answering questions across biology, chemistry, physics, and math in 5 rounds over 5 progressive stages until a winning team is crowned for that year。NSMQ是一个有趣的live竞赛,技术挑战包括speech-to-text、text-to-speech、问题回答和人机交互。在这项工作于2023年1月开始的项目中,我们将提供项目概述、团队描述、已经进展和下一步的计划,以便在10月份的NSMQ 2023上发布AI。一旦AI成功解决这个大型挑战,可能会对教育产生实际影响,如提供非洲数百万学生一对一的学习支持。
Auditory Attention Decoding with Task-Related Multi-View Contrastive Learning
results: 研究人员通过对 two 个 popular AAD 数据集进行测试,发现了我们的方法的优越性,并与现有的 state-of-the-art 方法进行比较。Abstract
The human brain can easily focus on one speaker and suppress others in scenarios such as a cocktail party. Recently, researchers found that auditory attention can be decoded from the electroencephalogram (EEG) data. However, most existing deep learning methods are difficult to use prior knowledge of different views (that is attended speech and EEG are task-related views) and extract an unsatisfactory representation. Inspired by Broadbent's filter model, we decode auditory attention in a multi-view paradigm and extract the most relevant and important information utilizing the missing view. Specifically, we propose an auditory attention decoding (AAD) method based on multi-view VAE with task-related multi-view contrastive (TMC) learning. Employing TMC learning in multi-view VAE can utilize the missing view to accumulate prior knowledge of different views into the fusion of representation, and extract the approximate task-related representation. We examine our method on two popular AAD datasets, and demonstrate the superiority of our method by comparing it to the state-of-the-art method.
摘要
人脑可以轻松地关注一个说话者并压抑其他说话者在cocktail party类场景中。现在,研究人员发现了基于电enzephalogram(EEG)数据的听力注意力可以被解码。然而,大多数现有的深度学习方法难以使用不同视图(即注意力和EEG数据是任务相关的视图)的先前知识,并提取不满足的表示。以布鲁门特 filters 模型为 inspirations,我们在多视图 paradigm 中解码听力注意力,并使用缺失的视图来汇集不同视图中的先前知识,并提取任务相关的表示。我们提出了基于多视图VAE的听力注意力解码方法(AAD),并使用任务相关的多视图异构学习(TMC)来学习。通过TMC学习,我们可以在多视图VAE中汇集不同视图中的先前知识,并提取任务相关的表示。我们在两个流行的AAD数据集上进行了实验,并证明了我们的方法的优越性,比较于状态的艺术方法。
Evil Operation: Breaking Speaker Recognition with PaddingBack
paper_authors: Zhe Ye, Diqun Yan, Li Dong, Kailai Shen
For: The paper aims to propose a novel backdoor attack method that can bypass speaker recognition systems and remain undetectable to human ears.* Methods: The proposed method, called PaddingBack, exploits the widely used speech signal operation of padding to make poisoned samples indistinguishable from clean ones.* Results: The experimental results show that PaddingBack achieves a high attack success rate while maintaining a high rate of benign accuracy, and is able to resist defense methods while maintaining its stealthiness against human perception.Here’s the full text in Simplified Chinese:* For: 本研究提出的目的是提出一种可以绕过说话识别系统的背门附件攻击方法,并且能够避免人类听觉中的异常感。* Methods: 该方法称为PaddingBack,利用了广泛使用的语音信号操作padding,以制作恶意样本与净样本无法分辨。* Results: 实验结果显示,PaddingBack可以达到高度的攻击成功率,同时保持高度的净样本准确率,并且能够抵抗防御方法,同时保持人类听觉中的潜藏性。Abstract
Machine Learning as a Service (MLaaS) has gained popularity due to advancements in machine learning. However, untrusted third-party platforms have raised concerns about AI security, particularly in backdoor attacks. Recent research has shown that speech backdoors can utilize transformations as triggers, similar to image backdoors. However, human ears easily detect these transformations, leading to suspicion. In this paper, we introduce PaddingBack, an inaudible backdoor attack that utilizes malicious operations to make poisoned samples indistinguishable from clean ones. Instead of using external perturbations as triggers, we exploit the widely used speech signal operation, padding, to break speaker recognition systems. Our experimental results demonstrate the effectiveness of the proposed approach, achieving a significantly high attack success rate while maintaining a high rate of benign accuracy. Furthermore, PaddingBack demonstrates the ability to resist defense methods while maintaining its stealthiness against human perception. The results of the stealthiness experiment have been made available at https://nbufabio25.github.io/paddingback/.
摘要
MSAC: Multiple Speech Attribute Control Method for Speech Emotion Recognition
results: 对于单个 corpora 和跨 corpora SER 场景,我们的提议的 SER 工作流程经过了广泛的实验,并 consistently 超过基准值,包括认知、泛化和可靠性性能。单个 corpora SER 场景中,我们的 SER 工作流程达到了72.97%的 WAR 和 71.76%的 UAR 在 IEMOCAP corpora 上。Abstract
Despite significant progress, speech emotion recognition (SER) remains challenging due to inherent complexity and ambiguity of the emotion attribute, particularly in wild world. Whereas current studies primarily focus on recognition and generalization capabilities, this work pioneers an exploration into the reliability of SER methods and investigates how to model the speech emotion from the aspect of data distribution across various speech attributes. Specifically, we first build a novel CNN-based SER model which adopts additive margin softmax loss to expand the distance between features of different classes, thereby enhancing their discrimination. Second, a novel multiple speech attribute control method MSAC is proposed to explicitly control speech attributes, enabling the model to be less affected by emotion-agnostic attributes and capture more fine-grained emotion-related features. Third, we make a first attempt to test and analyze the reliability of the proposed SER workflow using the out-of-distribution detection method. Extensive experiments on both single and cross-corpus SER scenarios show that our proposed unified SER workflow consistently outperforms the baseline in terms of recognition, generalization, and reliability performance. Besides, in single-corpus SER, the proposed SER workflow achieves superior recognition results with a WAR of 72.97\% and a UAR of 71.76\% on the IEMOCAP corpus.
摘要
尽管已经取得了 significative 进步,speech emotion recognition(SER)仍然是一项复杂和不确定的任务,尤其在野外环境中。现有研究主要关注recognition和泛化能力,而这项工作则尝试了 SER 方法的可靠性的探索,并 investigate 如何从数据分布角度模型 speech emotion。 Specifically, we first build a novel CNN-based SER model which adopts additive margin softmax loss to expand the distance between features of different classes, thereby enhancing their discrimination. Second, a novel multiple speech attribute control method MSAC is proposed to explicitly control speech attributes, enabling the model to be less affected by emotion-agnostic attributes and capture more fine-grained emotion-related features. Third, we make a first attempt to test and analyze the reliability of the proposed SER workflow using the out-of-distribution detection method. Extensive experiments on both single and cross-corpus SER scenarios show that our proposed unified SER workflow consistently outperforms the baseline in terms of recognition, generalization, and reliability performance. Besides, in single-corpus SER, the proposed SER workflow achieves superior recognition results with a WAR of 72.97% and a UAR of 71.76% on the IEMOCAP corpus.
Target Speech Extraction with Conditional Diffusion Model
paper_authors: Naoyuki Kamo, Marc Delcroix, Tomohiro Nakatani
for: targets speech extraction (TSE) in a mixture of multi-talkers
methods: uses a conditional diffusion model conditioned on a clue identifying the target speaker, and ensemble inference to reduce potential extraction errors
results: outperforms a comparable TSE system trained discriminatively in experiments on Libri2mix corpusAbstract
Diffusion model-based speech enhancement has received increased attention since it can generate very natural enhanced signals and generalizes well to unseen conditions. Diffusion models have been explored for several sub-tasks of speech enhancement, such as speech denoising, dereverberation, and source separation. In this paper, we investigate their use for target speech extraction (TSE), which consists of estimating the clean speech signal of a target speaker in a mixture of multi-talkers. TSE is realized by conditioning the extraction process on a clue identifying the target speaker. We show we can realize TSE using a conditional diffusion model conditioned on the clue. Besides, we introduce ensemble inference to reduce potential extraction errors caused by the diffusion process. In experiments on Libri2mix corpus, we show that the proposed diffusion model-based TSE combined with ensemble inference outperforms a comparable TSE system trained discriminatively.
摘要
听说模型基于扩散模型的speech增强技术在最近几年来得到了更多的关注,因为它可以生成非常自然的增强信号,并且可以在未见过的条件下进行泛化。扩散模型在多个子任务中被探索,如speech噪声除去、泛化声学环境和音源分离。在这篇论文中,我们研究了它们在target speech extraction(TSE)中的使用,TSE是一种估计混合多个说话人的干扰者的清晰speech信号的过程。我们表明可以通过对 clue(指定target speaker)进行条件的扩散模型来实现TSE。此外,我们还引入了集成推理来降低扩散过程中的潜在出错。在Libri2mix数据集上进行了实验,我们发现提出的扩散模型基于TSE,并且集成推理可以与一个相对的TSE系统所得到的性能进行比较。
Universal Automatic Phonetic Transcription into the International Phonetic Alphabet
results: 我们的模型可以达到与人工标注师相当的质量水平,并且与之前的最佳语音到IPA模型(Wav2Vec2Phoneme)相比,我们的模型在训练数据量相对较少的情况下可以达到类似或更好的结果。Abstract
This paper presents a state-of-the-art model for transcribing speech in any language into the International Phonetic Alphabet (IPA). Transcription of spoken languages into IPA is an essential yet time-consuming process in language documentation, and even partially automating this process has the potential to drastically speed up the documentation of endangered languages. Like the previous best speech-to-IPA model (Wav2Vec2Phoneme), our model is based on wav2vec 2.0 and is fine-tuned to predict IPA from audio input. We use training data from seven languages from CommonVoice 11.0, transcribed into IPA semi-automatically. Although this training dataset is much smaller than Wav2Vec2Phoneme's, its higher quality lets our model achieve comparable or better results. Furthermore, we show that the quality of our universal speech-to-IPA models is close to that of human annotators.
摘要
这篇论文介绍了一种现代模型,用于将任何语言的 spoken language 转录为国际音声字母(IPA)。将语言记录转录为 IPA 是一项重要但是时间占用很大的任务,即使只是部分自动化这个过程,也有很大的潜在速度提升语言记录的批处。与之前的最佳音频-to-IPA 模型(Wav2Vec2Phoneme)一样,我们的模型基于 wav2vec 2.0,并在音频输入上进行了微调,以预测 IPA。我们使用了 CommonVoice 11.0 中的七种语言的训练数据,并将其 semi-automatically 转录为 IPA。虽然我们的训练集规模较小,但它的质量更高,使我们的模型在获得相似或更好的结果。此外,我们还证明了我们的通用音频-to-IPA 模型的质量与人工注释员很相似。
paper_authors: Michael Kuhlmann, Adrian Meise, Fritz Seebauer, Petra Wagner, Reinhold Haeb-Umbach
for: 这个论文的目的是研究语音表示的分解,以提高数据驱动模型的普适性、解释性和公正性。
methods: 该论文使用了标准的分解目标函数来训练语音表示,并对比了这些表示的分解程度。
results: 研究发现,使用标准的分解目标函数可以限制语音表示的分解程度,但可以通过一定程度的改进来提高分解效果。Abstract
Disentanglement is the task of learning representations that identify and separate factors that explain the variation observed in data. Disentangled representations are useful to increase the generalizability, explainability, and fairness of data-driven models. Only little is known about how well such disentanglement works for speech representations. A major challenge when tackling disentanglement for speech representations are the unknown generative factors underlying the speech signal. In this work, we investigate to what degree speech representations encoding speaker identity can be disentangled. To quantify disentanglement, we identify acoustic features that are highly speaker-variant and can serve as proxies for the factors of variation underlying speech. We find that disentanglement of the speaker embedding is limited when trained with standard objectives promoting disentanglement but can be improved over vanilla representation learning to some extent.
摘要分化是学习表示法,以分解数据中观察到的变化的因素为目的。分化的表示法有助于提高数据驱动模型的普遍性、解释性和公平性。对于speech表示法,尚不了解分化是否有效。在这种工作中,我们研究了speech表示法中的发音者标识可以被分化的程度。为量分化,我们确定了一些高度发音者特定的音频特征,可以作为变化的因素下的 фактор代表。我们发现,使用标准的分化目标可以有限地分化发音者表示,但可以通过一些程度上的表示学习来提高分化。Here's the translation in Traditional Chinese as well:分化是学习表示法,以分解数据中观察到的变化的因素为目的。分化的表示法有助于提高数据驱动模型的普遍性、解释性和公平性。对于speech表示法,还不了解分化是否有效。在这种工作中,我们研究了speech表示法中的发音者标识可以被分化的程度。为量分化,我们确定了一些高度发音者特定的音频特征,可以作为变化的因素下的 фактор代表。我们发现,使用标准的分化目标可以有限地分化发音者表示,但可以通过一些程度上的表示学习来提高分化。
EPCFormer: Expression Prompt Collaboration Transformer for Universal Referring Video Object Segmentation
paper_authors: Jiajun Chen, Jiacheng Lin, Zhiqiang Xiao, Haolong Fu, Ke Nai, Kailun Yang, Zhiyong Li
for: 这 paper 是为了解决 audio-guided video object segmentation (A-VOS) 和 referring video object segmentation (R-VOS) 等两个高度相关的任务。
methods: 这 paper 使用了一种 universal architecture called Expression Prompt Collaboration Transformer (EPCFormer),并提出了一种 Expression Alignment (EA) 机制和一种 Expression-Visual Attention (EVA) 机制来解决模式表示问题。
results: 实验结果表明,EPCFormer 可以在 A-VOS 和 R-VOS 两个任务上达到州际级Result。此外,EPCFormer 可以快速转移知识 между两个任务,从而提高视频对象 segmentation 的精度。Abstract
Audio-guided Video Object Segmentation (A-VOS) and Referring Video Object Segmentation (R-VOS) are two highly-related tasks, which both aim to segment specific objects from video sequences according to user-provided expression prompts. However, due to the challenges in modeling representations for different modalities, contemporary methods struggle to strike a balance between interaction flexibility and high-precision localization and segmentation. In this paper, we address this problem from two perspectives: the alignment representation of audio and text and the deep interaction among audio, text, and visual features. First, we propose a universal architecture, the Expression Prompt Collaboration Transformer, herein EPCFormer. Next, we propose an Expression Alignment (EA) mechanism for audio and text expressions. By introducing contrastive learning for audio and text expressions, the proposed EPCFormer realizes comprehension of the semantic equivalence between audio and text expressions denoting the same objects. Then, to facilitate deep interactions among audio, text, and video features, we introduce an Expression-Visual Attention (EVA) mechanism. The knowledge of video object segmentation in terms of the expression prompts can seamlessly transfer between the two tasks by deeply exploring complementary cues between text and audio. Experiments on well-recognized benchmarks demonstrate that our universal EPCFormer attains state-of-the-art results on both tasks. The source code of EPCFormer will be made publicly available at https://github.com/lab206/EPCFormer.
摘要
audio-guided视频对象 segmentation (A-VOS) 和 referring视频对象 segmentation (R-VOS) 是两个非常相关的任务,它们都是根据用户提供的表达提示从视频序列中提取特定对象的。然而,由于不同媒体表示的模型化问题,当前方法很难协调用用户提供的表达提示和高精度的地方化分割。在这篇论文中,我们解决这个问题从两个方面:表达提示的对齐表示和听力和文本特征之间的深度交互。首先,我们提出了一种通用架构,即表达 prompt collaboration transformer(EPCFormer)。然后,我们提出了一种表达对齐(EA)机制,用于对听力和文本表达进行对齐。通过对听力和文本表达进行对比学习,我们的提出的EPCFormer实现了对听力和文本表达的semantic equivalence的认知。然后,为了促进听力、文本和视频特征之间的深度交互,我们引入了表达-视频注意力(EVA)机制。通过深入探索听力、文本和视频特征之间的相互补做,我们的EPCFormer可以很好地传递知识 между两个任务。实验结果表明,我们的通用EPCFormer在两个任务上达到了现有最佳结果。代码将在https://github.com/lab206/EPCFormer上公开。
for: 3D vision-language grounding (3D-VL) tasks, such as visual grounding, dense captioning, question answering, and situated reasoning.
methods: Uses a pre-trained Transformer for 3D vision and text alignment, with self-attention layers for single-modal modeling and multi-modal fusion.
results: Achieves state-of-the-art results on various 3D-VL tasks, with superior data efficiency and strong performance even with limited annotations during fine-tuning.Here’s the simplified Chinese text:
for: 3D视力语言固定(3D-VL)任务,如视图固定、密集描述、问答和位置理解。
methods: 使用预训练的 transformer для 3D视力和文本对齐,通过自我注意层实现单模态模型和多模态融合。
results: 在多种 3D-VL 任务上取得了状态之一的结果,并且在限制缺少标注时的练习 fine-tuning 中表现出色。Abstract
3D vision-language grounding (3D-VL) is an emerging field that aims to connect the 3D physical world with natural language, which is crucial for achieving embodied intelligence. Current 3D-VL models rely heavily on sophisticated modules, auxiliary losses, and optimization tricks, which calls for a simple and unified model. In this paper, we propose 3D-VisTA, a pre-trained Transformer for 3D Vision and Text Alignment that can be easily adapted to various downstream tasks. 3D-VisTA simply utilizes self-attention layers for both single-modal modeling and multi-modal fusion without any sophisticated task-specific design. To further enhance its performance on 3D-VL tasks, we construct ScanScribe, the first large-scale 3D scene-text pairs dataset for 3D-VL pre-training. ScanScribe contains 2,995 RGB-D scans for 1,185 unique indoor scenes originating from ScanNet and 3R-Scan datasets, along with paired 278K scene descriptions generated from existing 3D-VL tasks, templates, and GPT-3. 3D-VisTA is pre-trained on ScanScribe via masked language/object modeling and scene-text matching. It achieves state-of-the-art results on various 3D-VL tasks, ranging from visual grounding and dense captioning to question answering and situated reasoning. Moreover, 3D-VisTA demonstrates superior data efficiency, obtaining strong performance even with limited annotations during downstream task fine-tuning.
摘要
三维视力语言固定(3D-VL)是一个emerging领域,旨在将三维物理世界与自然语言相连接,这对实体智能是非常重要。现有3D-VL模型都依赖于复杂的模块、辅助损失和优化技巧,这зыва�种简单的和一致的模型。在这篇论文中,我们提出了3D-VisTA,一个预训练的Transformer用于三维视力和文本对齐。3D-VisTA使用自注意层来模型单Modal和多Modal的混合,不需任何任务特定的复杂设计。为了进一步提高3D-VL任务的表现,我们构建了ScanScribe,这是第一个大规模的3D场景文本对 dataset,包括2995个RGB-D扫描和1185个唯一的室内场景,来自ScanNet和3R-Scan dataset,以及278K个场景描述,这些描述来自现有的3D-VL任务、模板和GPT-3。3D-VisTA在ScanScribe上预训练后,可以通过偏挥语言/物体模型和场景文本匹配来进行Masked Language/Object Modeling和Scene-Text Matching。它在多种3D-VL任务上达到了状态前的Result,从visual grounding和精密描述到问题回答和位置理解。此外,3D-VisTA还表现出了优秀的数据效率,能够在下游任务练习时就具有强的表现,即使有限的注释。
Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval
results: 对两个基准数据集MSCOCO和Flickr30K进行了广泛的实验,并与SOTA基线相比,HAT得到了大量的提升。具体来说,在图像到文本和文本到图像检索两个关键任务上,HAT的Recall@1提高了7.6%和16.7%在MSCOCO上,以及4.4%和11.6%在Flickr30K上。Abstract
Most existing cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts, \textit{e.g.}, CNN for images and RNN/Transformer for texts. Such discrepancy in architectures may induce different semantic distribution spaces and limit the interactions between images and texts, and further result in inferior alignment between images and texts. To fill this research gap, inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities. Specifically, we design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed \textbf{Hierarchical Alignment Transformers (HAT)}, which consists of an image Transformer, a text Transformer, and a hierarchical alignment module. With such identical architectures, the encoders could produce representations with more similar characteristics for images and texts, and make the interactions and alignments between them much easier. Besides, to leverage the rich semantics, we devise a hierarchical alignment scheme to explore multi-level correspondences of different layers between images and texts. To evaluate the effectiveness of the proposed HAT, we conduct extensive experiments on two benchmark datasets, MSCOCO and Flickr30K. Experimental results demonstrate that HAT outperforms SOTA baselines by a large margin. Specifically, on two key tasks, \textit{i.e.}, image-to-text and text-to-image retrieval, HAT achieves 7.6\% and 16.7\% relative score improvement of Recall@1 on MSCOCO, and 4.4\% and 11.6\% on Flickr30k respectively. The code is available at \url{https://github.com/LuminosityX/HAT}.
摘要
现有跨Modal Retrieval方法通常采用不同架构的两�ream Encoder,如图像使用CNN,文本使用RNN/Transformer。这种不同的架构可能会导致图像和文本的Semantic分布空间不同,限制图像和文本之间的交互,从而导致图像和文本的Alignment不佳。为了填补这个研究空白,我们提出了一种基于Transformers的跨Modal Retrieval框架,名为层次对齐Transformers(HAT)。这个框架包括图像Transformer、文本Transformer和层次对齐模块。通过使用同一种架构,encoder可以生成更像性的表示,从而使图像和文本之间的交互和对齐变得更加容易。此外,为了利用rich的Semantic,我们设计了一种层次对齐方案,以探索不同层次的对应关系 между图像和文本。为证明HAT的效iveness,我们对MSCOCO和Flickr30K两个benchmark datasets进行了广泛的实验。实验结果表明,HAT在图像-文本和文本-图像检索任务上的表现都超过了State-of-the-Art baseline,具体来说,在MSCOCO上,HAT在图像-文本和文本-图像检索任务上的Recall@1相对于基eline的提高为7.6%和16.7%。在Flickr30K上,HAT的提高为4.4%和11.6%。代码可以在github上找到:https://github.com/LuminosityX/HAT。
TranSTYLer: Multimodal Behavioral Style Transfer for Facial and Body Gestures Generation
results: 我们使用PATS corpus进行训练,并对其进行扩展以包括对话活动和2D脸部特征点。对象和主观评价表明,我们的模型在训练阶段seen和unseen风格时都能够超越状态之前的模型。为了解决可能出现的风格和内容泄露问题,我们提出了一种方法来评估传递的行为和姿势是否成功地采用了target风格,而不会破坏源内容的意义。Abstract
This paper addresses the challenge of transferring the behavior expressivity style of a virtual agent to another one while preserving behaviors shape as they carry communicative meaning. Behavior expressivity style is viewed here as the qualitative properties of behaviors. We propose TranSTYLer, a multimodal transformer based model that synthesizes the multimodal behaviors of a source speaker with the style of a target speaker. We assume that behavior expressivity style is encoded across various modalities of communication, including text, speech, body gestures, and facial expressions. The model employs a style and content disentanglement schema to ensure that the transferred style does not interfere with the meaning conveyed by the source behaviors. Our approach eliminates the need for style labels and allows the generalization to styles that have not been seen during the training phase. We train our model on the PATS corpus, which we extended to include dialog acts and 2D facial landmarks. Objective and subjective evaluations show that our model outperforms state of the art models in style transfer for both seen and unseen styles during training. To tackle the issues of style and content leakage that may arise, we propose a methodology to assess the degree to which behavior and gestures associated with the target style are successfully transferred, while ensuring the preservation of the ones related to the source content.
摘要
Domain Adaptive Person Search via GAN-based Scene Synthesis for Cross-scene Videos
methods: 基于 Fast R-CNN 模型,采用 Assisted-Identity Query Module (AIDQ) 提供正面图像,并采用 GAN 生成高质量的人体图像数据进行场景合成。采用在线学习策略,同步学习生成的图像和原始图像,以便增强特征学习。
results: 在 CUHK-SYSU 和 PRW 两个人体搜索标准benchmark上进行了广泛的实验,并取得了优秀的性能。并进行了详细的减少性能研究,证明 GAN 生成的数据可以增加数据的多样性和真实性。Abstract
Person search has recently been a challenging task in the computer vision domain, which aims to search specific pedestrians from real cameras.Nevertheless, most surveillance videos comprise only a handful of images of each pedestrian, which often feature identical backgrounds and clothing. Hence, it is difficult to learn more discriminative features for person search in real scenes. To tackle this challenge, we draw on Generative Adversarial Networks (GAN) to synthesize data from surveillance videos. GAN has thrived in computer vision problems because it produces high-quality images efficiently. We merely alter the popular Fast R-CNN model, which is capable of processing videos and yielding accurate detection outcomes. In order to appropriately relieve the pressure brought by the two-stage model, we design an Assisted-Identity Query Module (AIDQ) to provide positive images for the behind part. Besides, the proposed novel GAN-based Scene Synthesis model that can synthesize high-quality cross-id person images for person search tasks. In order to facilitate the feature learning of the GAN-based Scene Synthesis model, we adopt an online learning strategy that collaboratively learns the synthesized images and original images. Extensive experiments on two widely used person search benchmarks, CUHK-SYSU and PRW, have shown that our method has achieved great performance, and the extensive ablation study further justifies our GAN-synthetic data can effectively increase the variability of the datasets and be more realistic.
摘要
人体搜索是计算机视觉领域中的一个长期挑战,目标是从真实的摄像头中搜索特定的步行人。然而,大多数surveillance视频中只包含每个步行人的几张图像,这些图像通常具有相同的背景和服装。因此,学习更加特异的人体特征变得困难。为解决这个问题,我们引入生成 adversarial networks(GAN)来生成数据集。GAN在计算机视觉问题中取得了成功,因为它可以生成高质量的图像。我们只是修改了popular Fast R-CNN模型,这种模型可以处理视频并提供准确的检测结果。为了正确地减轻两个阶段模型中的压力,我们设计了一个帮助查询模块(AIDQ),以提供后部图像的正面图像。此外,我们还提出了一种新的基于GAN的Scene Synthesis模型,可以生成高质量的跨ID人体图像 для人体搜索任务。为了促进GAN-based Scene Synthesis模型的特征学习,我们采用了在线学习策略,将合作学习生成的图像和原始图像。广泛的实验表明,我们的方法在两个常用的人体搜索标准 benchmarck上表现出色,并且extensive ablation study further justify我们的GAN-synthetic数据可以增加数据集的变化性和更加真实。
All-pairs Consistency Learning for Weakly Supervised Semantic Segmentation
results: 在 PASCAL VOC 和 MS COCO 数据集上实现了更好的类本地化图(67.3% mIoU on PASCAL VOC train),从而提高 WSSS 性能。Abstract
In this work, we propose a new transformer-based regularization to better localize objects for Weakly supervised semantic segmentation (WSSS). In image-level WSSS, Class Activation Map (CAM) is adopted to generate object localization as pseudo segmentation labels. To address the partial activation issue of the CAMs, consistency regularization is employed to maintain activation intensity invariance across various image augmentations. However, such methods ignore pair-wise relations among regions within each CAM, which capture context and should also be invariant across image views. To this end, we propose a new all-pairs consistency regularization (ACR). Given a pair of augmented views, our approach regularizes the activation intensities between a pair of augmented views, while also ensuring that the affinity across regions within each view remains consistent. We adopt vision transformers as the self-attention mechanism naturally embeds pair-wise affinity. This enables us to simply regularize the distance between the attention matrices of augmented image pairs. Additionally, we introduce a novel class-wise localization method that leverages the gradients of the class token. Our method can be seamlessly integrated into existing WSSS methods using transformers without modifying the architectures. We evaluate our method on PASCAL VOC and MS COCO datasets. Our method produces noticeably better class localization maps (67.3% mIoU on PASCAL VOC train), resulting in superior WSSS performances.
摘要
在这项工作中,我们提出了一种基于转换器的新的常规化方法,以改进弱元素概率semantic segmentation(WSSS)中对 объек的本地化。在图像级WSSS中,使用Class Activation Map(CAM)生成对象本地化,但CAM的部分活动问题导致consistency regularization不具备对图像增强的抗锯齿性。我们的方法忽略了每个CAM中的对region之间的关系,这些关系捕捉了上下文信息,并且应该是图像视图不变的。为此,我们提出了一种新的所有对之间一致常规化(ACR)。给定两个扩展视图,我们的方法对扩展视图中的活动强度进行规范,同时确保每个视图中的区域之间的相互关系保持一致。我们采用了转换器作为自我注意力机制,这使得我们可以简单地规范扩展视图之间的距离。此外,我们还提出了一种新的类型本地化方法,该方法利用类token的梯度来优化类本地化。我们的方法可以轻松地与现有的WSSS方法集成,无需修改架构。我们在PASCAL VOC和MS COCO数据集上进行了评估,我们的方法在PASCAL VOC训练集上得到了67.3%的mean Intersection over Union(mIoU),这表明我们的方法可以提供更好的类本地化图像。
Cloth2Tex: A Customized Cloth Texture Generation Pipeline for 3D Virtual Try-On
results: 作者通过质量和量化评估,证明了 Cloth2Tex 可以生成高质量的 texture maps,并且在视觉效果上超过其他方法。Abstract
Fabricating and designing 3D garments has become extremely demanding with the increasing need for synthesizing realistic dressed persons for a variety of applications, e.g. 3D virtual try-on, digitalization of 2D clothes into 3D apparel, and cloth animation. It thus necessitates a simple and straightforward pipeline to obtain high-quality texture from simple input, such as 2D reference images. Since traditional warping-based texture generation methods require a significant number of control points to be manually selected for each type of garment, which can be a time-consuming and tedious process. We propose a novel method, called Cloth2Tex, which eliminates the human burden in this process. Cloth2Tex is a self-supervised method that generates texture maps with reasonable layout and structural consistency. Another key feature of Cloth2Tex is that it can be used to support high-fidelity texture inpainting. This is done by combining Cloth2Tex with a prevailing latent diffusion model. We evaluate our approach both qualitatively and quantitatively and demonstrate that Cloth2Tex can generate high-quality texture maps and achieve the best visual effects in comparison to other methods. Project page: tomguluson92.github.io/projects/cloth2tex/
摘要
制备和设计3D衣服已经变得极其需求量,因为需要生成真实的穿着人形进行多种应用,如3D虚拟试穿、2D衣服数字化到3D服装和布料动画。因此需要一个简单和直观的管道来获得高质量的纹理,从简单的输入中,如2D参考图像。传统的折叠基于的纹理生成方法需要手动选择大量的控制点,这可以是一个时间consuming和繁琐的过程。我们提议一种新的方法,called Cloth2Tex,它消除了人类的劳动在这个过程中。Cloth2Tex是一种自动学习的方法,可以生成纹理图片,并且具有合理的布局和结构一致性。另外,Cloth2Tex还可以支持高精度的纹理填充。我们通过质量和量化的评估,证明Cloth2Tex可以生成高质量的纹理图片,并且在比较其他方法时,可以 achieve the best visual effects。项目页面:tomguluson92.github.io/projects/cloth2tex/
Vision-Based Autonomous Navigation for Unmanned Surface Vessel in Extreme Marine Conditions
results: 对比state-of-the-art净气化方法,该提案在MBZIRC simulate dataset上表现出了明显的优异性能,包括各种指标上的比较优异性能。Abstract
Visual perception is an important component for autonomous navigation of unmanned surface vessels (USV), particularly for the tasks related to autonomous inspection and tracking. These tasks involve vision-based navigation techniques to identify the target for navigation. Reduced visibility under extreme weather conditions in marine environments makes it difficult for vision-based approaches to work properly. To overcome these issues, this paper presents an autonomous vision-based navigation framework for tracking target objects in extreme marine conditions. The proposed framework consists of an integrated perception pipeline that uses a generative adversarial network (GAN) to remove noise and highlight the object features before passing them to the object detector (i.e., YOLOv5). The detected visual features are then used by the USV to track the target. The proposed framework has been thoroughly tested in simulation under extremely reduced visibility due to sandstorms and fog. The results are compared with state-of-the-art de-hazing methods across the benchmarked MBZIRC simulation dataset, on which the proposed scheme has outperformed the existing methods across various metrics.
摘要
<>translate text into Simplified ChineseVisual perception is an important component for autonomous navigation of unmanned surface vessels (USV), particularly for the tasks related to autonomous inspection and tracking. These tasks involve vision-based navigation techniques to identify the target for navigation. Reduced visibility under extreme weather conditions in marine environments makes it difficult for vision-based approaches to work properly. To overcome these issues, this paper presents an autonomous vision-based navigation framework for tracking target objects in extreme marine conditions. The proposed framework consists of an integrated perception pipeline that uses a generative adversarial network (GAN) to remove noise and highlight the object features before passing them to the object detector (i.e., YOLOv5). The detected visual features are then used by the USV to track the target. The proposed framework has been thoroughly tested in simulation under extremely reduced visibility due to sandstorms and fog. The results are compared with state-of-the-art de-hazing methods across the benchmarked MBZIRC simulation dataset, on which the proposed scheme has outperformed the existing methods across various metrics.<>Here's the translation in Simplified Chinese:视觉认知是自动航行无人水面船(USV)中重要的一部分,尤其是在自动检查和跟踪任务中。这些任务需要基于视觉导航技术来确定目标。 marine 环境中的极端天气条件会使视觉基于的方法难以正常工作。为解决这些问题,本文提出了一个基于视觉的自动导航框架,用于在极端海洋条件下跟踪目标对象。该框架包括一个集成的识别管道,使用生成对抗网络(GAN)来消除噪声并强调对象特征,然后将这些特征传递给对象检测器(YOLOv5)进行检测。检测到的视觉特征然后被用于跟踪目标。本框架在基于 MBZIRC 的 simulate 环境下进行了严格的测试,并与现有的抑霾方法进行了比较。结果表明,提出的方案在各种维度上都有出众的表现。
SDLFormer: A Sparse and Dense Locality-enhanced Transformer for Accelerated MR Image Reconstruction
results: 对多核磁共振图像加速的实验结果显示,该方法可以与其他重建建筑物相比,提高PSNR和SSIM指标的值。 Code可以在https://github.com/rahul-gs-16/sdlformer.git中找到。Abstract
Transformers have emerged as viable alternatives to convolutional neural networks owing to their ability to learn non-local region relationships in the spatial domain. The self-attention mechanism of the transformer enables transformers to capture long-range dependencies in the images, which might be desirable for accelerated MRI image reconstruction as the effect of undersampling is non-local in the image domain. Despite its computational efficiency, the window-based transformers suffer from restricted receptive fields as the dependencies are limited to within the scope of the image windows. We propose a window-based transformer network that integrates dilated attention mechanism and convolution for accelerated MRI image reconstruction. The proposed network consists of dilated and dense neighborhood attention transformers to enhance the distant neighborhood pixel relationship and introduce depth-wise convolutions within the transformer module to learn low-level translation invariant features for accelerated MRI image reconstruction. The proposed model is trained in a self-supervised manner. We perform extensive experiments for multi-coil MRI acceleration for coronal PD, coronal PDFS and axial T2 contrasts with 4x and 5x under-sampling in self-supervised learning based on k-space splitting. We compare our method against other reconstruction architectures and the parallel domain self-supervised learning baseline. Results show that the proposed model exhibits improvement margins of (i) around 1.40 dB in PSNR and around 0.028 in SSIM on average over other architectures (ii) around 1.44 dB in PSNR and around 0.029 in SSIM over parallel domain self-supervised learning. The code is available at https://github.com/rahul-gs-16/sdlformer.git
摘要
transformers 已经成为了 convolutional neural networks 的可行的替代方案,因为它们可以学习图像空间中的非本地区域关系。transformers 中的自注意机制使得 transformers 可以捕捉图像中的长距离依赖关系,这可能是加速 MRI 图像重建的潜在的优点,因为 MRI 图像下折衔的效果是非本地的。 despite its computational efficiency, window-based transformers suffer from restricted receptive fields as the dependencies are limited to within the scope of the image windows. we propose a window-based transformer network that integrates dilated attention mechanism and convolution for accelerated MRI image reconstruction. the proposed network consists of dilated and dense neighborhood attention transformers to enhance the distant neighborhood pixel relationship and introduce depth-wise convolutions within the transformer module to learn low-level translation invariant features for accelerated MRI image reconstruction. the proposed model is trained in a self-supervised manner. we perform extensive experiments for multi-coil MRI acceleration for coronal PD, coronal PDFS and axial T2 contrasts with 4x and 5x under-sampling in self-supervised learning based on k-space splitting. we compare our method against other reconstruction architectures and the parallel domain self-supervised learning baseline. results show that the proposed model exhibits improvement margins of (i) around 1.40 dB in PSNR and around 0.028 in SSIM on average over other architectures (ii) around 1.44 dB in PSNR and around 0.029 in SSIM over parallel domain self-supervised learning. the code is available at https://github.com/rahul-gs-16/sdlformer.git.
Blur aware metric depth estimation with multi-focus plenoptic cameras
results: 实验结果表明,引入了焦距缓冲信息可以提高depth estimation的准确性和精度。该方法在实验中对实际的3D复杂场景进行了验证,并与3D激光扫描仪获取的实际测量数据进行了比较。Abstract
While a traditional camera only captures one point of view of a scene, a plenoptic or light-field camera, is able to capture spatial and angular information in a single snapshot, enabling depth estimation from a single acquisition. In this paper, we present a new metric depth estimation algorithm using only raw images from a multi-focus plenoptic camera. The proposed approach is especially suited for the multi-focus configuration where several micro-lenses with different focal lengths are used. The main goal of our blur aware depth estimation (BLADE) approach is to improve disparity estimation for defocus stereo images by integrating both correspondence and defocus cues. We thus leverage blur information where it was previously considered a drawback. We explicitly derive an inverse projection model including the defocus blur providing depth estimates up to a scale factor. A method to calibrate the inverse model is then proposed. We thus take into account depth scaling to achieve precise and accurate metric depth estimates. Our results show that introducing defocus cues improves the depth estimation. We demonstrate the effectiveness of our framework and depth scaling calibration on relative depth estimation setups and on real-world 3D complex scenes with ground truth acquired with a 3D lidar scanner.
摘要
tradicional 摄像机只能捕捉一个场景的一点视角,而 plenoptic 或 light-field 摄像机则能够在单个拍摄中捕捉场景的空间和角度信息,从而实现深度估计从单个获得。在这篇论文中,我们提出了一种基于原始图像的新的深度估计算法,使用多重ocus plenoptic 摄像机获得的Raw图像。我们的方法尤其适用于多重ocus配置,其中多个微镜头具有不同的 фокус距离。我们的方法的主要目标是通过结合匹配和杂谱诱导来提高不同损失的 disparity 估计。我们利用了模糊信息,而前面它被视为一个缺点。我们明确地 derivation 一个逆 проекции模型,包括杂谱模糊,以获得深度估计。然后,我们提出了一种准确做出深度缩放准确的方法。我们的结果表明,将杂谱诱导包含在深度估计中可以提高深度估计的精度。我们在相对深度估计设置和实际世界3D复杂场景中使用了真实的3D激光扫描仪获得的ground truth进行证明。
AICSD: Adaptive Inter-Class Similarity Distillation for Semantic Segmentation
paper_authors: Amir M. Mansourian, Rozhan Ahmadi, Shohreh Kasaei
For: The paper aims to improve the accuracy of lightweight student networks for semantic segmentation tasks using knowledge distillation.* Methods: The proposed method, called Inter-Class Similarity Distillation (ICSD), transfers high-order relations from the teacher network to the student network by computing intra-class distributions and inter-class similarity matrices using KL divergence. An Adaptive Loss Weighting (ALW) training strategy is also proposed to gradually reduce the influence of the teacher network towards the end of training.* Results: The proposed method outperforms most existing knowledge distillation methods in terms of mIoU and pixel accuracy on two well-known datasets for semantic segmentation, Cityscapes and Pascal VOC 2012.Here are the three key points in Simplified Chinese text:* 为:本文目的是使用知识传授提高轻量级学生网络在 semantic segmentation 任务中的准确性。* 方法:提议的方法是 Inter-Class Similarity Distillation (ICSD),它通过计算网络输出中每个类的内部分布来传递教师网络中高阶关系。此外,还使用 Adaptive Loss Weighting (ALW) 训练策略,以逐渐减少教师网络的影响。* 结果:提议的方法在 Cityscapes 和 Pascal VOC 2012 两个常见的 semantic segmentation 数据集上,与大多数现有的知识传授方法相比,在 mIoU 和像素准确性上表现出色。Abstract
In recent years, deep neural networks have achieved remarkable accuracy in computer vision tasks. With inference time being a crucial factor, particularly in dense prediction tasks such as semantic segmentation, knowledge distillation has emerged as a successful technique for improving the accuracy of lightweight student networks. The existing methods often neglect the information in channels and among different classes. To overcome these limitations, this paper proposes a novel method called Inter-Class Similarity Distillation (ICSD) for the purpose of knowledge distillation. The proposed method transfers high-order relations from the teacher network to the student network by independently computing intra-class distributions for each class from network outputs. This is followed by calculating inter-class similarity matrices for distillation using KL divergence between distributions of each pair of classes. To further improve the effectiveness of the proposed method, an Adaptive Loss Weighting (ALW) training strategy is proposed. Unlike existing methods, the ALW strategy gradually reduces the influence of the teacher network towards the end of training process to account for errors in teacher's predictions. Extensive experiments conducted on two well-known datasets for semantic segmentation, Cityscapes and Pascal VOC 2012, validate the effectiveness of the proposed method in terms of mIoU and pixel accuracy. The proposed method outperforms most of existing knowledge distillation methods as demonstrated by both quantitative and qualitative evaluations. Code is available at: https://github.com/AmirMansurian/AICSD
摘要
Recently, deep neural networks have achieved remarkable accuracy in computer vision tasks. However, with inference time being a crucial factor, particularly in dense prediction tasks such as semantic segmentation, knowledge distillation has emerged as a successful technique for improving the accuracy of lightweight student networks. Existing methods often neglect the information in channels and among different classes. To overcome these limitations, this paper proposes a novel method called Inter-Class Similarity Distillation (ICSD) for the purpose of knowledge distillation.The proposed method transfers high-order relations from the teacher network to the student network by independently computing intra-class distributions for each class from network outputs. This is followed by calculating inter-class similarity matrices for distillation using KL divergence between distributions of each pair of classes. To further improve the effectiveness of the proposed method, an Adaptive Loss Weighting (ALW) training strategy is proposed. Unlike existing methods, the ALW strategy gradually reduces the influence of the teacher network towards the end of training process to account for errors in teacher's predictions.Extensive experiments conducted on two well-known datasets for semantic segmentation, Cityscapes and Pascal VOC 2012, validate the effectiveness of the proposed method in terms of mIoU and pixel accuracy. The proposed method outperforms most of existing knowledge distillation methods as demonstrated by both quantitative and qualitative evaluations. Code is available at: https://github.com/AmirMansurian/AICSD.Here's the translation in Traditional Chinese:过去的几年,深度神经网络在计算机视觉任务中已经取得了很高的准确性。然而,在填充预测任务中,特别是 semantic segmentation 中,推论时间成为一个关键的因素。为了解决这个问题,这篇文章提出了一种名为 Inter-Class Similarity Distillation (ICSD) 的新方法。提案的方法通过获取师网络的高阶关系,将这些关系转移到学生网络中。这是通过获取每个类别的网络输出中的数据,并计算每对类别之间的相似性矩阵来进行知识传递。另外,为了进一步提高方法的效果,这篇文章还提出了一种 Adaptive Loss Weighting (ALW) 训练策略。与现有的方法不同的是,ALW 策略在训练过程中逐渐将师网络的影响降低,以抵销教师预测中的错误。实验结果显示,提案的方法在 Cityscapes 和 Pascal VOC 2012 这两个常用的 semantic segmentation 数据集上具有较高的 mIoU 和像素精度。此外,与现有的知识传递方法比较,提案的方法在量值和质量上都表现较好。代码可以在 https://github.com/AmirMansurian/AICSD 上取得。
A Comparative Study of Image-to-Image Translation Using GANs for Synthetic Child Race Data
paper_authors: Wang Yao, Muhammad Ali Farooq, Joseph Lemley, Peter Corcoran
for: 提高face recognition技术的种族多样性
methods: 使用图像-图像转换来调整儿童脸部数据的种族
results: 实验结果表明,使用图像-图像转换方法可以生成各种种族的人工儿童脸部数据样本,提高face recognition技术的种族多样性。Abstract
The lack of ethnic diversity in data has been a limiting factor of face recognition techniques in the literature. This is particularly the case for children where data samples are scarce and presents a challenge when seeking to adapt machine vision algorithms that are trained on adult data to work on children. This work proposes the utilization of image-to-image transformation to synthesize data of different races and thus adjust the ethnicity of children's face data. We consider ethnicity as a style and compare three different Image-to-Image neural network based methods, specifically pix2pix, CycleGAN, and CUT networks to implement Caucasian child data and Asian child data conversion. Experimental validation results on synthetic data demonstrate the feasibility of using image-to-image transformation methods to generate various synthetic child data samples with broader ethnic diversity.
摘要
“无伦不同的人种数据的缺乏对面 recognition技术的发展带来了限制。尤其是儿童的数据样本罕见,对于适应机器视觉算法trained on adult data来应用于儿童的情况存在挑战。本工作提议利用图像到图像转换来增加不同的人种样本,以适应儿童的脸部数据的不同种族。我们认为人种是一种风格,并评估了三种基于图像到图像神经网络的方法,即 pix2pix、CycleGAN 和 CUT 网络,以实现白人儿童数据和亚洲儿童数据的转换。对于 sintetic data 的实验验证结果表明,使用图像到图像转换方法可以生成各种不同的 sintetic 儿童数据样本,以拓宽人种多样性。”
Will your Doorbell Camera still recognize you as you grow old
results: 实验结果表明,长期年龄影响仍然是现代面部验证方法的主要挑战。Abstract
Robust authentication for low-power consumer devices such as doorbell cameras poses a valuable and unique challenge. This work explores the effect of age and aging on the performance of facial authentication methods. Two public age datasets, AgeDB and Morph-II have been used as baselines in this work. A photo-realistic age transformation method has been employed to augment a set of high-quality facial images with various age effects. Then the effect of these synthetic aging data on the high-performance deep-learning-based face recognition model is quantified by using various metrics including Receiver Operating Characteristic (ROC) curves and match score distributions. Experimental results demonstrate that long-term age effects are still a significant challenge for the state-of-the-art facial authentication method.
摘要
低功耗消费者设备的坚实验证提供了一个独特和有价值的挑战。这项工作研究了人脸认证方法在不同年龄的影响。使用了公共的年龄数据集AgeDB和Morph-II作为基准,这里使用了一种实际准确的年龄变换方法来增加一组高质量的人脸图像,并对这些图像进行了不同年龄的变换。然后,通过使用深度学习基于的高性能人脸识别模型,量化这些人脸图像在不同年龄的影响。实验结果显示,长期年龄效应仍然是现代人脸认证方法的一大挑战。
results: 经过对8种分 segmentation任务(如人体潜水员)的广泛实验,这篇论文表明AquaSAM模型在水下图像分割任务中比默认SAM模型更高效,尤其是在困难任务(如珊瑚礁)中。AquaSAM模型在水下图像分割任务中的平均Dice相似度指数(DSC)提高了7.13%,并在多尺度指标(mIoU)上提高了8.27%。Abstract
The Segment Anything Model (SAM) has revolutionized natural image segmentation, nevertheless, its performance on underwater images is still restricted. This work presents AquaSAM, the first attempt to extend the success of SAM on underwater images with the purpose of creating a versatile method for the segmentation of various underwater targets. To achieve this, we begin by classifying and extracting various labels automatically in SUIM dataset. Subsequently, we develop a straightforward fine-tuning method to adapt SAM to general foreground underwater image segmentation. Through extensive experiments involving eight segmentation tasks like human divers, we demonstrate that AquaSAM outperforms the default SAM model especially at hard tasks like coral reefs. AquaSAM achieves an average Dice Similarity Coefficient (DSC) of 7.13 (%) improvement and an average of 8.27 (%) on mIoU improvement in underwater segmentation tasks.
摘要
《Segment Anything Model》(SAM)已经革命化自然图像分割,但其在水下图像上的性能仍然受限。这项工作提出了将SAM扩展到水下图像上,以创建一种多样化的水下目标分割方法。为此,我们首先自动找到和分类SUIM数据集中的多种标签。然后,我们开发了一种简单的微调方法,以适应SAM进行普通水下图像分割的适应。经过对八种分割任务,如人体潜水员,的广泛实验,我们表明了 AquaSAM 在水下分割任务中的优异性,尤其是在复杂的珊瑚礁等难题上。AquaSAM 在水下分割任务中的平均 dice相似度系数(DSC)提高了7.13%,和水下分割任务的平均准确率(mIoU)提高了8.27%。
Robust retrieval of material chemical states in X-ray microspectroscopy
results: 通过实验结果,证明了该方法的有效性和可靠性,可以在实际应用中快速和准确地检测材料的化学状态,即使在低信号噪声和光谱特征 overlap 的情况下。Abstract
X-ray microspectroscopic techniques are essential for studying morphological and chemical changes in materials, providing high-resolution structural and spectroscopic information. However, its practical data analysis for reliably retrieving the chemical states remains a major obstacle to accelerating the fundamental understanding of materials in many research fields. In this work, we propose a novel data formulation model for X-ray microspectroscopy and develop a dedicated unmixing framework to solve this problem, which is robust to noise and spectral variability. Moreover, this framework is not limited to the analysis of two-state material chemistry, making it an effective alternative to conventional and widely-used methods. In addition, an alternative directional multiplier method with provable convergence is applied to obtain the solution efficiently. Our framework can accurately identify and characterize chemical states in complex and heterogeneous samples, even under challenging conditions such as low signal-to-noise ratios and overlapping spectral features. Extensive experimental results on simulated and real datasets demonstrate its effectiveness and reliability.
摘要
Exploring Transformers for Open-world Instance Segmentation
results: 这 paper 的模型在多种开放世界cross-category 和 cross-dataset 推广中取得了state-of-the-art 性能,特别是在 VOC 到 non-VOC 设置下,模型在 ARb100 和 ARm100 上达到了40.0% 和34.9% 的最高记录。在 COCO 到 UVO 推广中,SWORD 模型比前一个最佳的开放世界模型高出5.9% 和8.1% 的 APm 和 ARm100。Abstract
Open-world instance segmentation is a rising task, which aims to segment all objects in the image by learning from a limited number of base-category objects. This task is challenging, as the number of unseen categories could be hundreds of times larger than that of seen categories. Recently, the DETR-like models have been extensively studied in the closed world while stay unexplored in the open world. In this paper, we utilize the Transformer for open-world instance segmentation and present SWORD. Firstly, we introduce to attach the stop-gradient operation before classification head and further add IoU heads for discovering novel objects. We demonstrate that a simple stop-gradient operation not only prevents the novel objects from being suppressed as background, but also allows the network to enjoy the merit of heuristic label assignment. Secondly, we propose a novel contrastive learning framework to enlarge the representations between objects and background. Specifically, we maintain a universal object queue to obtain the object center, and dynamically select positive and negative samples from the object queries for contrastive learning. While the previous works only focus on pursuing average recall and neglect average precision, we show the prominence of SWORD by giving consideration to both criteria. Our models achieve state-of-the-art performance in various open-world cross-category and cross-dataset generalizations. Particularly, in VOC to non-VOC setup, our method sets new state-of-the-art results of 40.0% on ARb100 and 34.9% on ARm100. For COCO to UVO generalization, SWORD significantly outperforms the previous best open-world model by 5.9% on APm and 8.1% on ARm100.
摘要
open-world实例分割是一项崛起的任务,旨在通过学习有限数量的基本类目对象来分割图像中的所有对象。这个任务非常吃力,因为未知类别的数量可能是已知类别的百倍以上。在过去,DETR-like模型在关闭世界中被广泛研究,而在开放世界中却未得到过 изучение。在这篇论文中,我们使用Transformer进行开放世界实例分割,并提出SWORD。首先,我们在分类头部添加停止梯度操作,并添加IoU头来发现新对象。我们发现简单的停止梯度操作不仅防止新对象被识别为背景,还让网络享受到了识别标签的便利。其次,我们提出了一种新的对比学习框架,以增强对象和背景之间的表示。我们保持一个通用对象队列,以获取对象的中心,并动态选择对象查询中的正确和错误样本进行对比学习。而过去的工作只关注着追求平均回归率,忽略了平均准确率,我们显示SWORD的优势,并在不同的开放世界交叉类和交叉数据集上达到了state-of-the-art表现。尤其是在VOC到非VOC设置下,我们的方法设置了新的state-of-the-art记录,ARb100上的40.0%和ARm100上的34.9%。在COCO到UVO总结上,SWORD明显超过了之前最佳的开放世界模型,APm上提高了5.9%和ARm100上提高了8.1%。
D3G: Exploring Gaussian Prior for Temporal Sentence Grounding with Glance Annotation
results: 我们的D3G方法在三个挑战性 benchmark上进行了广泛的实验,并证明了它的效果性。它与现状的弱监督方法相比,提高了性能的大幅度,并降低了与完全监督方法的性能差距。Abstract
Temporal sentence grounding (TSG) aims to locate a specific moment from an untrimmed video with a given natural language query. Recently, weakly supervised methods still have a large performance gap compared to fully supervised ones, while the latter requires laborious timestamp annotations. In this study, we aim to reduce the annotation cost yet keep competitive performance for TSG task compared to fully supervised ones. To achieve this goal, we investigate a recently proposed glance-supervised temporal sentence grounding task, which requires only single frame annotation (referred to as glance annotation) for each query. Under this setup, we propose a Dynamic Gaussian prior based Grounding framework with Glance annotation (D3G), which consists of a Semantic Alignment Group Contrastive Learning module (SA-GCL) and a Dynamic Gaussian prior Adjustment module (DGA). Specifically, SA-GCL samples reliable positive moments from a 2D temporal map via jointly leveraging Gaussian prior and semantic consistency, which contributes to aligning the positive sentence-moment pairs in the joint embedding space. Moreover, to alleviate the annotation bias resulting from glance annotation and model complex queries consisting of multiple events, we propose the DGA module, which adjusts the distribution dynamically to approximate the ground truth of target moments. Extensive experiments on three challenging benchmarks verify the effectiveness of the proposed D3G. It outperforms the state-of-the-art weakly supervised methods by a large margin and narrows the performance gap compared to fully supervised methods. Code is available at https://github.com/solicucu/D3G.
摘要
Temporal sentence grounding (TSG) 目标是在没有剪辑的视频中定位一个具体的时刻,与一个自然语言查询符对应。Recently, weakly supervised methods 仍然与完全监督的方法之间存在大量性能差距,而后者需要劳动密集的时间戳注解。在这种研究中,我们想要降低注解成本, yet keep competitive performance for TSG task compared to fully supervised ones。To achieve this goal, we investigate a recently proposed glance-supervised temporal sentence grounding task, which requires only single frame annotation (referred to as glance annotation) for each query。Under this setup, we propose a Dynamic Gaussian prior based Grounding framework with Glance annotation (D3G), which consists of a Semantic Alignment Group Contrastive Learning module (SA-GCL) and a Dynamic Gaussian prior Adjustment module (DGA). Specifically, SA-GCL samples reliable positive moments from a 2D temporal map via jointly leveraging Gaussian prior and semantic consistency, which contributes to aligning the positive sentence-moment pairs in the joint embedding space。Moreover, to alleviate the annotation bias resulting from glance annotation and model complex queries consisting of multiple events, we propose the DGA module, which adjusts the distribution dynamically to approximate the ground truth of target moments。Extensive experiments on three challenging benchmarks verify the effectiveness of the proposed D3G。It outperforms the state-of-the-art weakly supervised methods by a large margin and narrows the performance gap compared to fully supervised methods。代码可以在 https://github.com/solicucu/D3G 中找到。
Image Copy-Move Forgery Detection via Deep Cross-Scale PatchMatch
methods: 本研究提出了一种新的全级图像CMFD框架, combining conventional and deep learning methods。 Specifically, we design a deep cross-scale patchmatch method tailored for CMFD to localize copy-move regions, and develop a manipulation region location branch for source/target separation。
results: 我们的方法在不同的复制和移动内容中显示出了显著更高的普适性和性能, compared to existing approaches。Abstract
The recently developed deep algorithms achieve promising progress in the field of image copy-move forgery detection (CMFD). However, they have limited generalizability in some practical scenarios, where the copy-move objects may not appear in the training images or cloned regions are from the background. To address the above issues, in this work, we propose a novel end-to-end CMFD framework by integrating merits from both conventional and deep methods. Specifically, we design a deep cross-scale patchmatch method tailored for CMFD to localize copy-move regions. In contrast to existing deep models, our scheme aims to seek explicit and reliable point-to-point matching between source and target regions using features extracted from high-resolution scales. Further, we develop a manipulation region location branch for source/target separation. The proposed CMFD framework is completely differentiable and can be trained in an end-to-end manner. Extensive experimental results demonstrate the high generalizability of our method to different copy-move contents, and the proposed scheme achieves significantly better performance than existing approaches.
摘要
最近发展的深度算法在图像复制移动伪造检测(CMFD)领域具有承诺的进步。然而,这些深度算法在一些实际场景中具有有限的通用性,例如在训练图像中没有复制移动对象或者径复制区域来自背景。为了解决上述问题,在这项工作中,我们提出了一种新的端到端CMFD框架,通过结合传统和深度方法的优点。具体来说,我们设计了一种适合CMFD的深度跨scale patchmatch方法,以便在本地化复制移动区域。与现有的深度模型不同,我们的方案寻求明确和可靠的点对点匹配 между源和目标区域,使用高分辨率层次中提取的特征。此外,我们开发了一个修改区域定位分支,用于源/目标分离。我们提出的CMFD框架是完全可导的,可以在端到端的训练方式下进行培训。广泛的实验结果表明,我们的方法具有不同复制移动内容的高通用性,并且我们提出的方案在现有方法中显著提高了性能。
How Generalizable are Deepfake Detectors? An Empirical Study
paper_authors: Boquan Li, Jun Sun, Christopher M. Poskitt
for: 这篇论文旨在探讨深伪材料检测方法的普适性,以帮助检测器在不同的 dataset 上保持一步 ahead of 害客。
methods: 本论文使用了六个深伪数据集、五种深伪检测方法和两种模型增强方法进行研究。
results: 研究发现,检测器在零 shot 设定下不能普适化,并且发现检测器学习了特定的合成方法的不良特征,以及检测器EXTRACTING 缺乏特征,导致普适性受限。然而,研究还发现了一些通用的神经元,可能为零 shot 普适性提供了可能的路径。Abstract
Deepfake videos and images are becoming increasingly credible, posing a significant threat given their potential to facilitate fraud or bypass access control systems. This has motivated the development of deepfake detection methods, in which deep learning models are trained to distinguish between real and synthesized footage. Unfortunately, existing detection models struggle to generalize to deepfakes from datasets they were not trained on, but little work has been done to examine why or how this limitation can be addressed. In this paper, we present the first empirical study on the generalizability of deepfake detectors, an essential goal for detectors to stay one step ahead of attackers. Our study utilizes six deepfake datasets, five deepfake detection methods, and two model augmentation approaches, confirming that detectors do not generalize in zero-shot settings. Additionally, we find that detectors are learning unwanted properties specific to synthesis methods and struggling to extract discriminative features, limiting their ability to generalize. Finally, we find that there are neurons universally contributing to detection across seen and unseen datasets, illuminating a possible path forward to zero-shot generalizability.
摘要
深刻的假动作和图像在增加可信度方面做出了重要贡献,它们的潜在威胁包括诈骗和绕过存取控制系统。这些问题驱使了深入学习检测方法的发展,这些模型通过训练来识别真实和合成的录影。可是,现有的检测模型在不同的数据集上缺乏通用性,但有很少的研究探讨这个限制和如何解决。在这篇论文中,我们提供了深入探讨检测器通用性的首个实践研究,这是检测器要一步拦截到诈骗者的重要目标。我们的研究使用了六个深刻假数据集,五个深刻检测方法和两种模型增强方法,确定了检测器在零点设定下不具通用性。此外,我们发现检测器在合成方法特有的特性上学习不良的特征,导致它们对于新的数据集难以准确检测。最后,我们发现有些神经网络在所有数据集上都具有检测功能,这提供了可能的通用性路径。
results: 论文评估了提交的解决方案的表现,以及一些基eline的测试数据集上的比较性能。 Here’s the English version of the three key information points:
for: The paper mainly introduces the Efficient Face Recognition Competition (EFaR) held at the 2023 International Joint Conference on Biometrics (IJCB 2023), as well as the 6 teams that participated in the competition with 17 submissions.
methods: The submitted solutions use small, efficient network architectures to reduce computational cost, and some solutions apply model quantization.
results: The paper evaluates the performance of the submitted solutions and compares them to a set of baselines on a diverse set of benchmarks, including bias, cross-quality, and large-scale recognition.Abstract
This paper presents the summary of the Efficient Face Recognition Competition (EFaR) held at the 2023 International Joint Conference on Biometrics (IJCB 2023). The competition received 17 submissions from 6 different teams. To drive further development of efficient face recognition models, the submitted solutions are ranked based on a weighted score of the achieved verification accuracies on a diverse set of benchmarks, as well as the deployability given by the number of floating-point operations and model size. The evaluation of submissions is extended to bias, cross-quality, and large-scale recognition benchmarks. Overall, the paper gives an overview of the achieved performance values of the submitted solutions as well as a diverse set of baselines. The submitted solutions use small, efficient network architectures to reduce the computational cost, some solutions apply model quantization. An outlook on possible techniques that are underrepresented in current solutions is given as well.
摘要
这篇论文介绍了2023年国际 JOINT Conference on Biometrics(IJCB 2023)上进行的Efficient Face Recognition Competition(EFaR)的结果。比赛接收了6个队伍的17个提交。为了驱动高效人脸识别模型的进一步发展,提交的解决方案按照使用多个benchmark上达到的验证精度的权重分数、以及模型的大小和浮点数据操作数量来进行排名。评测中还包括偏见、交叉评估和大规模识别的benchmark。总的来说,本文给出了提交的解决方案的实际性和多个基准值的概述,以及一些未在当前解决方案中充分表现的可能的技术。
Under-Display Camera Image Restoration with Scattering Effect
paper_authors: Binbin Song, Xiangyu Chen, Shuning Xu, Jiantao Zhou
for: addresses the under-display camera (UDC) image restoration problem with a specific focus on the scattering effect caused by the display.
methods: uses a two-branch restoration network, including a scattering branch that uses channel-wise self-attention to estimate the scattering effect parameters, and an image branch that leverages local representation advantages of CNN to recover clear scenes.
results: demonstrates superior performance over state-of-the-art UDC restoration techniques through extensive experiments on both real-world and synthesized data.Here’s the summary in Traditional Chinese:
for: addresses the 下层显示器(UDC)的图像修复问题,专注在显示器对图像的散射效应。
results: 通过对真实世界和合成数据进行广泛的实验,证明了提案方法与现有的UDC修复技术相比,具有较好的性能。Abstract
The under-display camera (UDC) provides consumers with a full-screen visual experience without any obstruction due to notches or punched holes. However, the semi-transparent nature of the display inevitably introduces the severe degradation into UDC images. In this work, we address the UDC image restoration problem with the specific consideration of the scattering effect caused by the display. We explicitly model the scattering effect by treating the display as a piece of homogeneous scattering medium. With the physical model of the scattering effect, we improve the image formation pipeline for the image synthesis to construct a realistic UDC dataset with ground truths. To suppress the scattering effect for the eventual UDC image recovery, a two-branch restoration network is designed. More specifically, the scattering branch leverages global modeling capabilities of the channel-wise self-attention to estimate parameters of the scattering effect from degraded images. While the image branch exploits the local representation advantage of CNN to recover clear scenes, implicitly guided by the scattering branch. Extensive experiments are conducted on both real-world and synthesized data, demonstrating the superiority of the proposed method over the state-of-the-art UDC restoration techniques. The source code and dataset are available at \url{https://github.com/NamecantbeNULL/SRUDC}.
摘要
“Under-display camera(UDC)为用户提供了一个无阻碍的全屏视觉体验,但是半透明的显示器无法避免对UDC图像的严重抑制。在这种情况下,我们在UDC图像恢复问题上进行了专门的考虑,并模型了由显示器引起的散射效应。我们通过物理模型来描述散射效应,并对图像形成管线进行了改进,以建立一个真实的UDC数据集。为了减少散射效应的影响,我们设计了两棵树结构,其中一棵是散射分支,利用通道级自注意力来估计散射效应的参数,另一棵是图像分支,利用深度学习来恢复清晰的场景。我们在实际数据上进行了广泛的实验,并证明了我们的方法在UDC图像恢复问题上的优越性。数据集和源代码可以在 GitHub 上获取(https://github.com/NamecantbeNULL/SRUDC)。”
EPCFormer: Expression Prompt Collaboration Transformer for Universal Referring Video Object Segmentation
results: 实验结果表明,我们提出的通用EPCFormer模型在两个任务上都达到了状态的艺术Result,并且可以很好地传递知识 между两个任务。Abstract
Audio-guided Video Object Segmentation (A-VOS) and Referring Video Object Segmentation (R-VOS) are two highly-related tasks, which both aim to segment specific objects from video sequences according to user-provided expression prompts. However, due to the challenges in modeling representations for different modalities, contemporary methods struggle to strike a balance between interaction flexibility and high-precision localization and segmentation. In this paper, we address this problem from two perspectives: the alignment representation of audio and text and the deep interaction among audio, text, and visual features. First, we propose a universal architecture, the Expression Prompt Collaboration Transformer, herein EPCFormer. Next, we propose an Expression Alignment (EA) mechanism for audio and text expressions. By introducing contrastive learning for audio and text expressions, the proposed EPCFormer realizes comprehension of the semantic equivalence between audio and text expressions denoting the same objects. Then, to facilitate deep interactions among audio, text, and video features, we introduce an Expression-Visual Attention (EVA) mechanism. The knowledge of video object segmentation in terms of the expression prompts can seamlessly transfer between the two tasks by deeply exploring complementary cues between text and audio. Experiments on well-recognized benchmarks demonstrate that our universal EPCFormer attains state-of-the-art results on both tasks. The source code of EPCFormer will be made publicly available at https://github.com/lab206/EPCFormer.
摘要
Audio-guided Video Object Segmentation (A-VOS) 和 Referring Video Object Segmentation (R-VOS) 是两个高度相关的任务,它们都是根据用户提供的表达提示来从视频序列中 segment 特定对象。然而,由于不同模式之间的表达模型化困难,当前方法很难以寻求高精度地位和表达提示之间的平衡。在这篇论文中,我们解决这个问题从两个方面:表达提示的对齐表示和深度交互 among audio、文本和视觉特征。首先,我们提出一种通用架构,即表达Prompt Collaboration Transformer(EPCFormer)。然后,我们提出一种表达对齐(EA)机制,用于对 audio 和文本表达进行对齐。通过引入对 audio 和文本表达的对比学习,我们实现了对 audio 和文本表达的Semantic equivalence的认知。然后,为了促进 audio、文本和视觉特征之间的深度交互,我们引入表达-视觉注意力(EVA)机制。通过深入探索 audio、文本和视觉特征之间的 complementary cues,我们实现了从表达提示角度看到的视频对象分割知识的交叉传递。实验结果表明,我们的通用 EPCFormer 在两个任务上达到了状态艺术的Result。源代码将在 GitHub 上公开,详细信息请参考 。
Towards Top-Down Stereoscopic Image Quality Assessment via Stereo Attention
results: 实验结果表明,该方法可以更好地模拟人类视觉系统(HVS)的性质,并超越现有的底层方法。Abstract
Stereoscopic image quality assessment (SIQA) plays a crucial role in evaluating and improving the visual experience of 3D content. Existing binocular properties and attention-based methods for SIQA have achieved promising performance. However, these bottom-up approaches are inadequate in exploiting the inherent characteristics of the human visual system (HVS). This paper presents a novel network for SIQA via stereo attention, employing a top-down perspective to guide the quality assessment process. Our proposed method realizes the guidance from high-level binocular signals down to low-level monocular signals, while the binocular and monocular information can be calibrated progressively throughout the processing pipeline. We design a generalized Stereo AttenTion (SAT) block to implement the top-down philosophy in stereo perception. This block utilizes the fusion-generated attention map as a high-level binocular modulator, influencing the representation of two low-level monocular features. Additionally, we introduce an Energy Coefficient (EC) to account for recent findings indicating that binocular responses in the primate primary visual cortex are less than the sum of monocular responses. The adaptive EC can tune the magnitude of binocular response flexibly, thus enhancing the formation of robust binocular features within our framework. To extract the most discriminative quality information from the summation and subtraction of the two branches of monocular features, we utilize a dual-pooling strategy that applies min-pooling and max-pooling operations to the respective branches. Experimental results highlight the superiority of our top-down method in simulating the property of visual perception and advancing the state-of-the-art in the SIQA field. The code of this work is available at https://github.com/Fanning-Zhang/SATNet.
摘要
三维内容的视觉体验评估(SIQA)具有重要的作用,用于评估和改进三维内容的视觉体验。现有的底层方法和双目性质具有承诺的表现。然而,这些底层方法无法充分利用人视系统(HVS)的内在特性。这篇论文提出了一种新的网络方法 для SIQA,通过双目注意力来导引评估过程。我们的提议方法可以从高级双目信号下降到低级单目信号,同时双目和单目信息可以在处理管道中进行进度性calibration。我们设计了一种通用的双目注意力块(SAT)来实现上述哲学。这个块利用生成的注意力地图作为高级双目模ulator,影响低级单目特征表示。此外,我们引入了能量系数(EC),以应对证明 primate primary visual cortex中的双目响应小于单目响应的现象。可变的EC可以适应性地调整双目响应的 магнитуда,以便在我们的框架中成形Robust的双目特征。为了从两个支路的单目特征之和和差中提取最有价值的质量信息,我们采用了双pooling策略,对两个支路的单目特征进行最小池化和最大池化操作。实验结果表明,我们的底层方法可以准确模拟视觉响应和提高SIQA领域的状态。代码可以在 中找到。
Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions
paper_authors: Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Hanwang Zhang, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, Yueting Zhuang for: 这个论文旨在�evaluating the instruction following ability of multimodal large language models (MLLMs) on complicated interleaved vision-language instructions, and introducing a generic and lightweight controllable knowledge re-injection module to address the common defect of existing methods.methods: The proposed method utilizes a controllable knowledge re-injection module that leverages the sophisticated reasoning ability of LLMs to conditionally extract instruction-specific visual information and re-inject it into the LLM. The module is learned using an annotation-free cross-attention guided counterfactual image training strategy that collaborates a cascade of foundation models.results: The proposed method achieves state-of-the-art zero-shot performance across all tasks of I4, without high-quality multimodal instruction tuning data. Cheetor also exhibits competitive performance compared with state-of-the-art instruction tuned models on MME benchmark.Abstract
Multimodal Large Language Models (MLLMs) have recently sparked significant interest, which demonstrates emergent capabilities to serve as a general-purpose model for various vision-language tasks. However, existing methods mainly focus on limited types of instructions with a single image as visual context, which hinders the widespread availability of MLLMs. In this paper, we introduce the I4 benchmark to comprehensively evaluate the instruction following ability on complicated interleaved vision-language instructions, which involve intricate image-text sequential context, covering a diverse range of scenarios (e.g., visually-rich webpages/textbooks, lecture slides, embodied dialogue). Systematic evaluation on our I4 benchmark reveals a common defect of existing methods: the Visual Prompt Generator (VPG) trained on image-captioning alignment objective tends to attend to common foreground information for captioning but struggles to extract specific information required by particular tasks. To address this issue, we propose a generic and lightweight controllable knowledge re-injection module, which utilizes the sophisticated reasoning ability of LLMs to control the VPG to conditionally extract instruction-specific visual information and re-inject it into the LLM. Further, we introduce an annotation-free cross-attention guided counterfactual image training strategy to methodically learn the proposed module by collaborating a cascade of foundation models. Enhanced by the proposed module and training strategy, we present Cheetor, a Transformer-based MLLM that can effectively handle a wide variety of interleaved vision-language instructions and achieves state-of-the-art zero-shot performance across all tasks of I4, without high-quality multimodal instruction tuning data. Cheetor also exhibits competitive performance compared with state-of-the-art instruction tuned models on MME benchmark.
摘要
大量多模态语言模型 (MLLMs) 在最近吸引了广泛的关注,这表明它们在多种视觉语言任务上表现出了总体的多功能性。然而,现有的方法主要集中在有限的类型的指令上,使得 MLMMs 的普及性受限。在这篇论文中,我们介绍了 I4 benchmark,用于全面评估 MLMMs 对于复杂的交叠视觉语言指令的遵循能力。系统性的评估表明,现有的方法存在一种普遍的缺陷:使用图像captioning对应目标训练的视觉提示生成器 (VPG) 往往会强调通用的前景信息,但是忽略特定任务所需的具体信息。为解决这一问题,我们提出了一种通用且轻量级的可控知识重新注入模块,该模块利用 LLMS 的复杂逻辑能力来控制 VPG,将特定任务所需的视觉信息从 LLMS 中提取出来,并重新注入到 LLMS 中。此外,我们提出了一种无需注意标注的横向关注帮助的反向图像培训策略,用于系统地学习该模块。通过该模块和培训策略,我们提出了一种基于转换器的 MLLM ,称为 Cheetor,可以有效地处理各种交叠视觉语言指令,并在 I4 测试benchmark上达到零基eline性能,不需要高质量的多媒体指令调整数据。此外,Cheetor 还与状态当前的指令调整模型在 MME 测试benchmark上表现竞争力。
Application for White Spot Syndrome Virus (WSSV) Monitoring using Edge Machine Learning
paper_authors: Lorenzo S. Querol, Macario O. Cordel II, Dan Jeric A. Rustia, Mary Nia M. Santos
for: This paper aims to improve disease surveillance in the aquaculture industry, specifically for the White Spot Syndrome Virus (WSSV), by developing a mobile application and training a WSSV recognition model using computer vision techniques.
methods: The authors developed a mobile application to collect and monitor data, and trained two models (MobileNetV3-Small and EfficientNetV2-B0) using an imbalanced dataset to improve WSSV recognition. They also analyzed the saliency heatmaps of both models to understand the features that are most important in making a prediction.
results: The models achieved an F1-Score of 0.72 and 0.99, respectively, and the saliency heatmaps revealed the features that are most important in the images for making a correct prediction. The results demonstrate the effectiveness of using computer vision techniques for WSSV recognition, but also highlight the limitations of using resource-constrained devices and the need for further improvement.Abstract
The aquaculture industry, strongly reliant on shrimp exports, faces challenges due to viral infections like the White Spot Syndrome Virus (WSSV) that severely impact output yields. In this context, computer vision can play a significant role in identifying features not immediately evident to skilled or untrained eyes, potentially reducing the time required to report WSSV infections. In this study, the challenge of limited data for WSSV recognition was addressed. A mobile application dedicated to data collection and monitoring was developed to facilitate the creation of an image dataset to train a WSSV recognition model and improve country-wide disease surveillance. The study also includes a thorough analysis of WSSV recognition to address the challenge of imbalanced learning and on-device inference. The models explored, MobileNetV3-Small and EfficientNetV2-B0, gained an F1-Score of 0.72 and 0.99 respectively. The saliency heatmaps of both models were also observed to uncover the "black-box" nature of these models and to gain insight as to what features in the images are most important in making a prediction. These results highlight the effectiveness and limitations of using models designed for resource-constrained devices and balancing their performance in accurately recognizing WSSV, providing valuable information and direction in the use of computer vision in this domain.
摘要
鱼养业,强调虾 экспор特别是,面临病毒感染的挑战,如白点综合病毒(WSSV),这会严重影响产量。在这种情况下,计算机视觉可以发挥重要的作用,可以帮助找到不直观或未经训练的目的不可见的特征,从而减少WSSV感染的报告时间。本研究的挑战是有限的数据,用于WSSV识别的模型训练。为解决这个问题,我们开发了一款专门用于数据采集和监测的移动应用程序,以便创建一个用于训练WSSV识别模型的图像数据集。本研究还包括了WSSV识别的全面分析,以解决模型学习的偏袋问题和设备上的推理。我们检查了两种模型,MobileNetV3-Small和EfficientNetV2-B0,它们的F1分数分别为0.72和0.99。我们还研究了这两个模型的精度热图,以了解这些模型在图像中的哪些特征是最重要的,以及它们如何影响模型的预测结果。这些结果显示了使用特定的资源限制的设备上的模型的效果和局限性,以及在精度地识别WSSV的方面的价值信息和指导。
Class-level Structural Relation Modelling and Smoothing for Visual Representation Learning
for: 这篇论文主要targets the problem of visual representation learning, particularly when dealing with classes that have diverse visual patterns.
methods: 这篇论文提出了一个框架,named CSRMS,which includes three modules: Class-level Relation Modelling, Class-aware Graph Sampling, and Relational Graph-Guided Representation Learning. These modules aim to model a relational graph of the entire dataset and perform class-aware smoothing and regularization operations to alleviate the issue of intra-class visual diversity and inter-class similarity.
results: 实验结果显示,CSRMS可以将结构知识模型化到图像表现学中,提高表现学模型的性能。此外,CSRMS可以与现有的最佳表现学模型结合使用,实现表现学模型的性能提升。Abstract
Representation learning for images has been advanced by recent progress in more complex neural models such as the Vision Transformers and new learning theories such as the structural causal models. However, these models mainly rely on the classification loss to implicitly regularize the class-level data distributions, and they may face difficulties when handling classes with diverse visual patterns. We argue that the incorporation of the structural information between data samples may improve this situation. To achieve this goal, this paper presents a framework termed \textbf{C}lass-level Structural Relation Modeling and Smoothing for Visual Representation Learning (CSRMS), which includes the Class-level Relation Modelling, Class-aware Graph Sampling, and Relational Graph-Guided Representation Learning modules to model a relational graph of the entire dataset and perform class-aware smoothing and regularization operations to alleviate the issue of intra-class visual diversity and inter-class similarity. Specifically, the Class-level Relation Modelling module uses a clustering algorithm to learn the data distributions in the feature space and identify three types of class-level sample relations for the training set; Class-aware Graph Sampling module extends typical training batch construction process with three strategies to sample dataset-level sub-graphs; and Relational Graph-Guided Representation Learning module employs a graph convolution network with knowledge-guided smoothing operations to ease the projection from different visual patterns to the same class. Experiments demonstrate the effectiveness of structured knowledge modelling for enhanced representation learning and show that CSRMS can be incorporated with any state-of-the-art visual representation learning models for performance gains. The source codes and demos have been released at https://github.com/czt117/CSRMS.
摘要
“图像表现学已经由最近的更进步的神经网络模型,如视图变换器和新的学习理论,如结构 causal 模型,所进步。但这些模型主要靠 классификаtion 损失来隐式训练数据分布,可能在处理多标的视觉模式时遇到问题。我们认为将数据样本之间的结构信息纳入模型中可以改善这个情况。为此,这篇论文提出了一个名为 Class-level Structural Relation Modeling and Smoothing for Visual Representation Learning(CSRMS)的框架,包括 Class-level Relation Modelling、Class-aware Graph Sampling 和 Relational Graph-Guided Representation Learning 三个模块。这些模块的目的是建立数据集的关系图,并通过阶段调整和缓和操作来缓和内部分类视觉多样性和相似性。具体来说,Class-level Relation Modelling 模块使用聚类算法学习数据集的分布在特征空间,并识别出三种类别水平的样本关系 для训练集; Class-aware Graph Sampling 模块延伸了传统的训练批次建构过程,使用三种策略来抽样数据集; Relational Graph-Guided Representation Learning 模块运用了图像 convolution 网络和知识导向缓和操作来将不同的视觉模式转换为同一个类别。实验结果显示结构知识模型可以帮助提高图像表现学,并证明 CSRMS 可以与任何现有的图像表现学模型结合使用,以获得性能提升。CSRMS 的源代码和示例已经发布在 GitHub 上(https://github.com/czt117/CSRMS)。”
Comprehensive Assessment of the Performance of Deep Learning Classifiers Reveals a Surprising Lack of Robustness
for: The paper aims to evaluate the robustness of machine learning models, specifically deep neural networks, and to develop a benchmark for comprehensive evaluation of performance.
methods: The paper proposes using a wide range of different types of data to benchmark performance and using a single metric to produce a consistent evaluation of performance.
results: The paper finds that current deep neural networks are extremely vulnerable to making mistakes on certain types of data, and that they are insecure and unreliable in real-world scenarios where they may encounter data from many different domains.Here’s the Chinese translation of the three points:
for: 这篇论文的目的是评估机器学习模型(尤其是深度神经网络)的可靠性和可靠性评估方法。
methods: 论文提议使用多种不同类型的数据来评估性能,并使用单一指标来生成一致的评估结果。
results: 论文发现现有的深度神经网络在某些类型的数据上很容易出错,并且在实际场景中,它们可能会遇到多种不同的预测任务,因此它们是不可靠的。Abstract
Reliable and robust evaluation methods are a necessary first step towards developing machine learning models that are themselves robust and reliable. Unfortunately, current evaluation protocols typically used to assess classifiers fail to comprehensively evaluate performance as they tend to rely on limited types of test data, and ignore others. For example, using the standard test data fails to evaluate the predictions made by the classifier to samples from classes it was not trained on. On the other hand, testing with data containing samples from unknown classes fails to evaluate how well the classifier can predict the labels for known classes. This article advocates bench-marking performance using a wide range of different types of data and using a single metric that can be applied to all such data types to produce a consistent evaluation of performance. Using such a benchmark it is found that current deep neural networks, including those trained with methods that are believed to produce state-of-the-art robustness, are extremely vulnerable to making mistakes on certain types of data. This means that such models will be unreliable in real-world scenarios where they may encounter data from many different domains, and that they are insecure as they can easily be fooled into making the wrong decisions. It is hoped that these results will motivate the wider adoption of more comprehensive testing methods that will, in turn, lead to the development of more robust machine learning methods in the future. Code is available at: \url{https://codeberg.org/mwspratling/RobustnessEvaluation}
摘要
可靠且稳定的评估方法是开发机器学习模型的必要第一步。然而,当前的评估协议通常只使用有限的测试数据来评估类ifiers的性能,而忽略其他类型的测试数据。例如,使用标准测试数据不能评估类ifiers对未知类型数据的预测性能。相反,使用包含未知类型数据的测试数据则不能评估类ifiers对已知类型数据的预测性能。这篇文章提出了使用多种不同类型的数据进行比较性能的方法,并使用一个统一的指标来评估所有数据类型的性能。使用这种标准,发现现有的深度神经网络,包括由其它方法训练的神经网络,在某些数据类型上存在极大的敏感性和容易被骗的问题。这意味着这些模型在实际场景中可能会出现问题,并且它们是不安全的,因为它们可以轻松地被骗到错误决策。希望这些结果能够激励更广泛的测试方法的采用,以便在未来开发更加稳定的机器学习方法。Code可以在以下链接获取:
OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation
paper_authors: Dongyang Yu, Shihao Wang, Yuan Fang, Wangpeng An
for: 这 paper 是为了解决多模态数据融合和无限数据生成问题,以提高人工智能对复杂实际数据的理解和生成能力。
methods: 这 paper 使用了多种操作,包括视频/图像描述EXTRACTION、稠密描述EXTRACTION、自动语音识别(ASR)、光学字符识别(OCR)、认知任何模型(RAM)和物体跟踪。
results: 这 paper 的 finale输出将每个视频输入转化成一个详细的时间序列文档,从而使视频变成了详细的故事,使其更易于大语言模型处理。Abstract
This paper presents OmniDataComposer, an innovative approach for multimodal data fusion and unlimited data generation with an intent to refine and uncomplicate interplay among diverse data modalities. Coming to the core breakthrough, it introduces a cohesive data structure proficient in processing and merging multimodal data inputs, which include video, audio, and text. Our crafted algorithm leverages advancements across multiple operations such as video/image caption extraction, dense caption extraction, Automatic Speech Recognition (ASR), Optical Character Recognition (OCR), Recognize Anything Model(RAM), and object tracking. OmniDataComposer is capable of identifying over 6400 categories of objects, substantially broadening the spectrum of visual information. It amalgamates these diverse modalities, promoting reciprocal enhancement among modalities and facilitating cross-modal data correction. \textbf{The final output metamorphoses each video input into an elaborate sequential document}, virtually transmuting videos into thorough narratives, making them easier to be processed by large language models. Future prospects include optimizing datasets for each modality to encourage unlimited data generation. This robust base will offer priceless insights to models like ChatGPT, enabling them to create higher quality datasets for video captioning and easing question-answering tasks based on video content. OmniDataComposer inaugurates a new stage in multimodal learning, imparting enormous potential for augmenting AI's understanding and generation of complex, real-world data.
摘要
The final output transforms each video input into an elaborate sequential document, virtually transmuting videos into thorough narratives that are easier to process by large language models. Future prospects include optimizing datasets for each modality to encourage unlimited data generation, providing priceless insights to models like ChatGPT and enabling them to create higher quality datasets for video captioning. This will ease question-answering tasks based on video content and inaugurate a new stage in multimodal learning, offering enormous potential for augmenting AI's understanding and generation of complex, real-world data.In simplified Chinese, the text would be:这篇论文介绍了 OmniDataComposer,一种创新的多Modal数据融合和无限数据生成方法。核心突破是一种可靠的数据结构,能够高效地处理和融合多Modal数据输入,包括视频、音频和文本。算法利用了多种进步,如视频/图像描述EXTRACTION、稠密描述EXTRACTION、自动语音识别(ASR)、光学字符识别(OCR)、Recognize Anything Model(RAM)和物体跟踪。输出 transformations each video input into an elaborate sequential document, virtually transmuting videos into thorough narratives that are easier to process by large language models。未来的前景包括优化每个模式的数据集,以便无限数据生成。这将为模型如ChatGPT提供无估量的智能,使其创建更高质量的视频描述集和简化基于视频内容的问答任务。OmniDataComposer开启了一个新的多Modal学习阶段,提供了巨大的潜力来增强AI对复杂、实际世界数据的理解和生成。
Multimodal Color Recommendation in Vector Graphic Documents
paper_authors: Qianru Qiu, Xueting Wang, Mayu Otani
for: 这个研究旨在提供基于文本 контекст的颜色建议,以帮助设计者选择适合的颜色。
methods: 该模型使用自我注意力网络和 crossed attention网络,以捕捉多个色彩中的关系,并将颜色和文本表示 integrate into one model。
results: 实验结果表明,该方法在准确率、颜色分布和用户体验方面都超过了先前的颜色alette completion方法,同时在全色组生成任务中,其对比 truth palettes 的颜色多样性和相似性也有所提高。Abstract
Color selection plays a critical role in graphic document design and requires sufficient consideration of various contexts. However, recommending appropriate colors which harmonize with the other colors and textual contexts in documents is a challenging task, even for experienced designers. In this study, we propose a multimodal masked color model that integrates both color and textual contexts to provide text-aware color recommendation for graphic documents. Our proposed model comprises self-attention networks to capture the relationships between colors in multiple palettes, and cross-attention networks that incorporate both color and CLIP-based text representations. Our proposed method primarily focuses on color palette completion, which recommends colors based on the given colors and text. Additionally, it is applicable for another color recommendation task, full palette generation, which generates a complete color palette corresponding to the given text. Experimental results demonstrate that our proposed approach surpasses previous color palette completion methods on accuracy, color distribution, and user experience, as well as full palette generation methods concerning color diversity and similarity to the ground truth palettes.
摘要
颜色选择在图文设计中扮演着关键的角色,需要考虑各种不同的 контекス特。然而,建议合适的颜色,使其融合在其他颜色和文本上下文中,是经验 designer 的挑战。在这个研究中,我们提出了一种多模态假面颜色模型,将多个颜色精灵 integrate 到一起,以提供文本意识 Color 推荐。我们的提议模型包括自我注意力网络,捕捉多个颜色精灵之间的关系,以及 crossed 注意力网络,将颜色和 CLIP 基于的文本表示 incorporate 到一起。我们的提议方法主要关注颜色精灵 completion,根据给定的颜色和文本来推荐颜色。此外,它还适用于另一个颜色推荐任务,全alette generation,生成与给定文本相对应的完整颜色精灵。实验结果表明,我们的提议方法在准确性、颜色分布和用户体验方面,都有所提高,并且在全alette generation 任务中,色彩多样性和真实性与基准 palettes 相比,也有所提高。
From Unimodal to Multimodal: improving the sEMG-Based Pattern Recognition via deep generative models
results: 对6个数据库进行测试,包括5个公开的数据库和自己收集的数据库,其中28名参与者执行了38种手势,包括EMG和IMU数据,结果表明提议方法比单模态HGR方法(增加2.15%-13.10%)表现更好,这表明通过深度生成模型生成的虚拟IMU信号可以明显提高EMG基于的手势识别精度。Abstract
Multimodal hand gesture recognition (HGR) systems can achieve higher recognition accuracy. However, acquiring multimodal gesture recognition data typically requires users to wear additional sensors, thereby increasing hardware costs. This paper proposes a novel generative approach to improve Surface Electromyography (sEMG)-based HGR accuracy via virtual Inertial Measurement Unit (IMU) signals. Specifically, we trained a deep generative model based on the intrinsic correlation between forearm sEMG signals and forearm IMU signals to generate virtual forearm IMU signals from the input forearm sEMG signals at first. Subsequently, the sEMG signals and virtual IMU signals were fed into a multimodal Convolutional Neural Network (CNN) model for gesture recognition. To evaluate the performance of the proposed approach, we conducted experiments on 6 databases, including 5 publicly available databases and our collected database comprising 28 subjects performing 38 gestures, containing both sEMG and IMU data. The results show that our proposed approach outperforms the sEMG-based unimodal HGR method (with increases of 2.15%-13.10%). It demonstrates that incorporating virtual IMU signals, generated by deep generative models, can significantly enhance the accuracy of sEMG-based HGR. The proposed approach represents a successful attempt to transition from unimodal HGR to multimodal HGR without additional sensor hardware.
摘要
多模态手势识别(HGR)系统可以提高识别精度。然而,获取多模态手势识别数据通常需要用户穿着额外传感器,从而增加硬件成本。这篇论文提出了一种新的生成方法,用于通过生成虚拟抬肘卫星测量单元(IMU)信号来提高表肘电omyography(sEMG)基于的HGR精度。特别是,我们使用了深度生成模型,根据肘部sEMG信号和肘部IMU信号的内在相关性来生成虚拟肘部IMU信号。然后,sEMG信号和虚拟IMU信号被输入到一个多模态卷积神经网络(CNN)模型中进行手势识别。为评估提案的性能,我们进行了6个数据库的实验,包括5个公共可用的数据库和我们收集的数据库,包含28名参与者进行38种手势,其中包括sEMG和IMU数据。结果表明,我们的提案方法比sEMG基于的单模态HGR方法(增幅1.15%-13.10%)高。这表明,通过深度生成模型生成的虚拟IMU信号可以显著提高sEMG基于的HGR精度。这种方法表明了在不增加额外传感器硬件成本的情况下,从单模态HGR转移到多模态HGR的成功尝试。
3D Gaussian Splatting for Real-Time Radiance Field Rendering
paper_authors: Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis
for: 实现高质量 novel-view 合成,提高 scenes 的完整性和分辨率。
methods: 使用 3D Gaussians 表示 scene,并进行interleaved 优化/密度控制,以获得高精度 scene 表示。
results: 实现了 state-of-the-art 的 visual quality 和实时渲染,并且在多个评估 datasets 上达到了领先的Result。Abstract
Radiance Field methods have recently revolutionized novel-view synthesis of scenes captured with multiple photos or videos. However, achieving high visual quality still requires neural networks that are costly to train and render, while recent faster methods inevitably trade off speed for quality. For unbounded and complete scenes (rather than isolated objects) and 1080p resolution rendering, no current method can achieve real-time display rates. We introduce three key elements that allow us to achieve state-of-the-art visual quality while maintaining competitive training times and importantly allow high-quality real-time (>= 30 fps) novel-view synthesis at 1080p resolution. First, starting from sparse points produced during camera calibration, we represent the scene with 3D Gaussians that preserve desirable properties of continuous volumetric radiance fields for scene optimization while avoiding unnecessary computation in empty space; Second, we perform interleaved optimization/density control of the 3D Gaussians, notably optimizing anisotropic covariance to achieve an accurate representation of the scene; Third, we develop a fast visibility-aware rendering algorithm that supports anisotropic splatting and both accelerates training and allows realtime rendering. We demonstrate state-of-the-art visual quality and real-time rendering on several established datasets.
摘要
“射频场方法”(Radiance Field method)在近期为 novel-view synthesis of captured scenes with multiple photos or videos 进行了革命性的改进。然而,实现高品质仍然需要费时训练和渲染 neural network,而最近的更快的方法则必须牺牲品质来获得速度。对于无限和完整的场景(而不是孤立的物体),以及1080p分辨率的渲染,目前的任何方法都无法在真实时间内进行高品质的novel-view synthesis。我们提出了三个关键的元素,允许我们实现现代化的Visual quality,同时维持竞争性的训练时间和重要的高品质实时(>= 30 fps)novel-view synthesis at 1080p resolution。首先,从摄像机对焦点所生成的稀疏点开始,我们使用3D Gaussians来表示场景,并保留恰当的维度场内散度场的性质,以避免在空间中无需过度计算。第二,我们在3D Gaussians中进行推广/频率控制,特别是对照方差进行最佳化,以确保场景的准确表示。第三,我们开发了一个快速可见性测试的渲染算法,支持标准渲染和实时渲染,并且加速训练和实时渲染。我们在一些已知的测试集上进行了实验,并证明了我们的方法可以实现现代化的Visual quality和高品质的实时渲染。
Exploiting Spatial-Temporal Context for Interacting Hand Reconstruction on Monocular RGB Video
results: 经验表明,本文提出的方法可以达到新的状态精度水平,在公共测试 benchmark 上实现了最高的表现。Abstract
Reconstructing interacting hands from monocular RGB data is a challenging task, as it involves many interfering factors, e.g. self- and mutual occlusion and similar textures. Previous works only leverage information from a single RGB image without modeling their physically plausible relation, which leads to inferior reconstruction results. In this work, we are dedicated to explicitly exploiting spatial-temporal information to achieve better interacting hand reconstruction. On one hand, we leverage temporal context to complement insufficient information provided by the single frame, and design a novel temporal framework with a temporal constraint for interacting hand motion smoothness. On the other hand, we further propose an interpenetration detection module to produce kinetically plausible interacting hands without physical collisions. Extensive experiments are performed to validate the effectiveness of our proposed framework, which achieves new state-of-the-art performance on public benchmarks.
摘要
重构互动手 FROM monochrome RGB 数据是一项复杂的任务,因为它们包含许多干扰因素,如自我和相互遮挡,以及类似的 texture。 先前的工作只是利用单个 RGB 图像提供的信息,没有考虑这些物理上的可能的关系,这导致了更差的重建结果。在这项工作中,我们决心显式利用空间-时间信息来实现更好的互动手 reconstruction。一方面,我们利用时间上下文来补充单幅图像中的不足信息,并设计了一个时间框架,以确保互动手动 motion 的平滑性。另一方面,我们进一步提议了一个 penetration 检测模块,以生成物理可能的互动手无碰撞。我们进行了广泛的实验来验证我们的提议的有效性,并实现了公共标准的新高水平性能。
results: 我们的方法可以显著改善场景描述和生成场景之间的对齐。Abstract
Guided synthesis of high-quality 3D scenes is a challenging task. Diffusion models have shown promise in generating diverse data, including 3D scenes. However, current methods rely directly on text embeddings for controlling the generation, limiting the incorporation of complex spatial relationships between objects. We propose a novel approach for 3D scene diffusion guidance using scene graphs. To leverage the relative spatial information the scene graphs provide, we make use of relational graph convolutional blocks within our denoising network. We show that our approach significantly improves the alignment between scene description and generated scene.
摘要
<>将高质量3D场景合成引导为一个挑战性的任务。分散模型已经展示了生成多样数据的潜力,包括3D场景。然而,当前的方法直接基于文本嵌入来控制生成,限制了对物体之间复杂的空间关系的 incorporation。我们提出了一种新的方法,使用场景图导向3D场景扩散指导。为了利用场景图提供的相对空间信息,我们在杂化网络中使用关系图 convolutional块。我们显示,我们的方法可以显著改善场景描述和生成场景之间的对齐。Note: "场景图" (scene graph) refers to a graph that represents the relationships between objects in a scene, and "杂化网络" (denoising network) is a type of neural network that is trained to remove noise from a signal.
ConDistFL: Conditional Distillation for Federated Learning from Partially Annotated Data
for: simultaneously delineating multiple organs and diseases
methods: federated learning (FL) with knowledge distillation
results: outperforms FedAvg and FedOpt baselines, superior generalizability on external test dataset, can perform well without frequent aggregationHere’s the Simplified Chinese translation of the three points:
for: 同时分割多个器官和疾病
methods: 联邦学习(FL)与知识储存
results: 比 FedAvg 和 FedOpt 基elines有更好的性能,在外部测试集上表现出较高的普适性,可以不经常聚合来达到良好的性能。Abstract
Developing a generalized segmentation model capable of simultaneously delineating multiple organs and diseases is highly desirable. Federated learning (FL) is a key technology enabling the collaborative development of a model without exchanging training data. However, the limited access to fully annotated training data poses a major challenge to training generalizable models. We propose "ConDistFL", a framework to solve this problem by combining FL with knowledge distillation. Local models can extract the knowledge of unlabeled organs and tumors from partially annotated data from the global model with an adequately designed conditional probability representation. We validate our framework on four distinct partially annotated abdominal CT datasets from the MSD and KiTS19 challenges. The experimental results show that the proposed framework significantly outperforms FedAvg and FedOpt baselines. Moreover, the performance on an external test dataset demonstrates superior generalizability compared to models trained on each dataset separately. Our ablation study suggests that ConDistFL can perform well without frequent aggregation, reducing the communication cost of FL. Our implementation will be available at https://github.com/NVIDIA/NVFlare/tree/dev/research/condist-fl.
摘要
发展一个可以同时分割多个器官和疾病的通用分割模型是非常有优势的。联邦学习(FL)是一种关键技术,它可以帮助建立一个模型,不需要交换训练数据。然而,受到完全标注数据的限制,很难训练通用的模型。我们提出了“ConDistFL”框架,它将FL与知识储存结合以解决这个问题。本地模型可以从全球模型中提取未标注器官和肿瘤的知识,使用适当的条件概率表示。我们在四个不同的部分标注的腹部CT数据集上验证了我们的框架。实验结果表明,我们的框架在FedAvg和FedOpt基准下显著 OUTPERFORMS。此外,对于外部测试集,我们的模型表现更高的普适性,比单独在每个数据集上训练的模型。我们的剖分研究表明,ConDistFL可以在不经常聚合的情况下表现良好,减少联邦学习中的通信成本。我们的实现将在https://github.com/NVIDIA/NVFlare/tree/dev/research/condist-fl上提供。
Backdoor Federated Learning by Poisoning Backdoor-Critical Layers
methods: 本研究提出了一种基于实际攻击者视角的协调方法,可以帮助攻击者识别和攻击 FL 模型中的极其敏感层(Backdoor-Critical,BC)。此外,本研究还提出了一种基于 BC 层的新型后门攻击方法,可以在不同的防御策略下寻找最佳攻击方式。
results: 经过广泛的实验,研究发现,使用本研究的 BC 层感知后门攻击方法,可以在七种最新的防御策略下成功后门 FL 模型,且比最新的后门攻击方法更高效。Abstract
Federated learning (FL) has been widely deployed to enable machine learning training on sensitive data across distributed devices. However, the decentralized learning paradigm and heterogeneity of FL further extend the attack surface for backdoor attacks. Existing FL attack and defense methodologies typically focus on the whole model. None of them recognizes the existence of backdoor-critical (BC) layers-a small subset of layers that dominate the model vulnerabilities. Attacking the BC layers achieves equivalent effects as attacking the whole model but at a far smaller chance of being detected by state-of-the-art (SOTA) defenses. This paper proposes a general in-situ approach that identifies and verifies BC layers from the perspective of attackers. Based on the identified BC layers, we carefully craft a new backdoor attack methodology that adaptively seeks a fundamental balance between attacking effects and stealthiness under various defense strategies. Extensive experiments show that our BC layer-aware backdoor attacks can successfully backdoor FL under seven SOTA defenses with only 10% malicious clients and outperform the latest backdoor attack methods.
摘要
Translation notes:* "backdoor-critical" (BC) layers are a small subset of layers in a machine learning model that dominate the model's vulnerabilities.* The proposed approach identifies and verifies BC layers from the perspective of attackers.* The new backdoor attack methodology adaptively seeks a balance between attacking effects and stealthiness under various defense strategies.* The approach can successfully backdoor FL under seven state-of-the-art defenses with only 10% malicious clients and outperform the latest backdoor attack methods.
An Empirical Analysis of Range for 3D Object Detection
paper_authors: Neehar Peri, Mengtian Li, Benjamin Wilson, Yu-Xiong Wang, James Hays, Deva Ramanan
for: 本文主要研究长距离3D探测,以实现自主驾驶车辆的安全 Navigation。
methods: 本文使用Argoverse 2.0 dataset进行实验分析,探讨长距离3D探测的问题,并发现近距离LiDAR测量是紧密且适合使用小尺寸矩阵,而远距离测量则是疏 dispersed且适合使用大尺寸矩阵。本文还提出了一组为近 vs 远场探测而调整的范围专家,以及一些简单的技术来优化长距离探测的效率和精度。
results: 本文的实验结果显示,使用该范围专家和技术可以提高长距离探测的效率33%,并提高精度3.2% CDS。Abstract
LiDAR-based 3D detection plays a vital role in autonomous navigation. Surprisingly, although autonomous vehicles (AVs) must detect both near-field objects (for collision avoidance) and far-field objects (for longer-term planning), contemporary benchmarks focus only on near-field 3D detection. However, AVs must detect far-field objects for safe navigation. In this paper, we present an empirical analysis of far-field 3D detection using the long-range detection dataset Argoverse 2.0 to better understand the problem, and share the following insight: near-field LiDAR measurements are dense and optimally encoded by small voxels, while far-field measurements are sparse and are better encoded with large voxels. We exploit this observation to build a collection of range experts tuned for near-vs-far field detection, and propose simple techniques to efficiently ensemble models for long-range detection that improve efficiency by 33% and boost accuracy by 3.2% CDS.
摘要
lidar-based 3D 探测在自动驾驶中扮演着关键性的角色。很奇怪的是,即使自动车辆(AV)需要探测附近 объек(以避免碰撞)和远场 объек(为长期规划),当前的标准准则仅专注于附近 3D 探测。然而,AV 需要探测远场 объек 以确保安全 Navigation。在这篇论文中,我们提供了实验分析远场 3D 探测使用 Argoverse 2.0 长距离探测数据集,以更好地理解问题,并分享以下发现:附近 LiDAR 测量 dense 且最佳地编码为小 voxels,而远场测量则是稀疏的,更适合使用大 voxels 编码。我们利用这一观察,建立了适应于近vs远场探测的范围专家,并提出了简单的技术来有效地ensemble模型以提高长距离探测的效率和准确率。
Implicit neural representations for joint decomposition and registration of gene expression images in the marmoset brain
results: 实验结果表明,本方法可以提供出色的结果,并在其他匹配技术上表现出色。Abstract
We propose a novel image registration method based on implicit neural representations that addresses the challenging problem of registering a pair of brain images with similar anatomical structures, but where one image contains additional features or artifacts that are not present in the other image. To demonstrate its effectiveness, we use 2D microscopy $\textit{in situ}$ hybridization gene expression images of the marmoset brain. Accurately quantifying gene expression requires image registration to a brain template, which is difficult due to the diversity of patterns causing variations in visible anatomical brain structures. Our approach uses implicit networks in combination with an image exclusion loss to jointly perform the registration and decompose the image into a support and residual image. The support image aligns well with the template, while the residual image captures individual image characteristics that diverge from the template. In experiments, our method provided excellent results and outperformed other registration techniques.
摘要
我们提出了一种基于隐式神经表示的新型图像匹配方法,用于处理一对具有相似解剖结构的脑图像,其中一个图像包含一些不在另一个图像中存在的特征或噪声。为证明其效果,我们使用了2D微显镜天然增强引入蛋白表达图像。正确评估蛋白表达需要图像匹配到脑模板,这是因为脑结构的多样性导致视觉特征的变化。我们的方法使用隐式网络和图像排除损失相结合,同时进行匹配和图像分解。支持图像能够匹配良好到模板,而剩余图像损失中包含各自图像特征。在实验中,我们的方法表现出色,超越了其他匹配技术。
Synthetic Augmentation with Large-scale Unconditional Pre-training
results: 通过在三个 histopathology 数据集上预训练,然后在一个 colorectal cancer (CRC) 数据集上测试,得到了训练使用小量标注数据集的增强图像识别率的提高,具体提高6.4%。Abstract
Deep learning based medical image recognition systems often require a substantial amount of training data with expert annotations, which can be expensive and time-consuming to obtain. Recently, synthetic augmentation techniques have been proposed to mitigate the issue by generating realistic images conditioned on class labels. However, the effectiveness of these methods heavily depends on the representation capability of the trained generative model, which cannot be guaranteed without sufficient labeled training data. To further reduce the dependency on annotated data, we propose a synthetic augmentation method called HistoDiffusion, which can be pre-trained on large-scale unlabeled datasets and later applied to a small-scale labeled dataset for augmented training. In particular, we train a latent diffusion model (LDM) on diverse unlabeled datasets to learn common features and generate realistic images without conditional inputs. Then, we fine-tune the model with classifier guidance in latent space on an unseen labeled dataset so that the model can synthesize images of specific categories. Additionally, we adopt a selective mechanism to only add synthetic samples with high confidence of matching to target labels. We evaluate our proposed method by pre-training on three histopathology datasets and testing on a histopathology dataset of colorectal cancer (CRC) excluded from the pre-training datasets. With HistoDiffusion augmentation, the classification accuracy of a backbone classifier is remarkably improved by 6.4% using a small set of the original labels. Our code is available at https://github.com/karenyyy/HistoDiffAug.
摘要
医学图像识别系统经常需要大量的训练数据,包括专家标注,这可能是时间consuming和成本高的。在最近,人工增强技术被提出,以生成符合类别标签的图像。然而,这些方法的效果受训练的生成模型的表达能力的限制,而这无法保证。为了进一步减少依赖于标注数据,我们提议一种名为HistoDiffusion的人工增强方法。在这种方法中,我们首先在大量无标注数据上训练一个潜在扩散模型(LDM),以学习通用特征并生成真实图像。然后,我们在一个未看过的标注数据集上精度地调整模型,以使其能够生成特定类别的图像。此外,我们采用了一种选择机制,只添加符合目标标签的synthetic样本。我们对三个 Histopathology 数据集进行预训练,并在一个排除在预训练数据集中的大肠癌(CRC)数据集上进行测试。与HistoDiffusion增强后,一个基础类фика器的分类精度显著提高了6.4%,只使用一小部分原始标注。我们的代码可以在 GitHub 上找到:https://github.com/karenyyy/HistoDiffAug。
Hierarchical Visual Primitive Experts for Compositional Zero-Shot Learning
methods: 本研究提出了一种名为 Composition Transformer(CoT)的简单可扩展架构,该架构包括对象和特征专家,通过可视网络层次结构来生成表示性 embedding。对象专家从底层 final layer 中提取表示性对象 embedding,而特征专家通过一种提出的对象引导注意力模块来生成特征 embedding,以显式地模型上下文关系。
results: 根据多个 benchmark 数据集,包括 MIT-States、C-GQA 和 VAW-CZSL,our method achieve State-of-the-Art 性能。此外,我们还证明了 CoT 在改善可视特征分辨率和减少模型偏见问题上的效果。代码可以在 https://github.com/HanjaeKim98/CoT 上获取。Abstract
Compositional zero-shot learning (CZSL) aims to recognize unseen compositions with prior knowledge of known primitives (attribute and object). Previous works for CZSL often suffer from grasping the contextuality between attribute and object, as well as the discriminability of visual features, and the long-tailed distribution of real-world compositional data. We propose a simple and scalable framework called Composition Transformer (CoT) to address these issues. CoT employs object and attribute experts in distinctive manners to generate representative embeddings, using the visual network hierarchically. The object expert extracts representative object embeddings from the final layer in a bottom-up manner, while the attribute expert makes attribute embeddings in a top-down manner with a proposed object-guided attention module that models contextuality explicitly. To remedy biased prediction caused by imbalanced data distribution, we develop a simple minority attribute augmentation (MAA) that synthesizes virtual samples by mixing two images and oversampling minority attribute classes. Our method achieves SoTA performance on several benchmarks, including MIT-States, C-GQA, and VAW-CZSL. We also demonstrate the effectiveness of CoT in improving visual discrimination and addressing the model bias from the imbalanced data distribution. The code is available at https://github.com/HanjaeKim98/CoT.
摘要
compositional zero-shot learning (CZSL) targets recognizing unseen compositions based on prior knowledge of known primitives (attribute and object). previous works for CZSL often suffer from grasping the contextuality between attribute and object, as well as the discriminability of visual features, and the long-tailed distribution of real-world compositional data. we propose a simple and scalable framework called Composition Transformer (CoT) to address these issues. CoT employs object and attribute experts in distinctive manners to generate representative embeddings, using the visual network hierarchically. the object expert extracts representative object embeddings from the final layer in a bottom-up manner, while the attribute expert makes attribute embeddings in a top-down manner with a proposed object-guided attention module that models contextuality explicitly. to remedy biased prediction caused by imbalanced data distribution, we develop a simple minority attribute augmentation (MAA) that synthesizes virtual samples by mixing two images and oversampling minority attribute classes. our method achieves SoTA performance on several benchmarks, including MIT-States, C-GQA, and VAW-CZSL. we also demonstrate the effectiveness of CoT in improving visual discrimination and addressing the model bias from the imbalanced data distribution. the code is available at https://github.com/HanjaeKim98/CoT.
Coarse-to-Fine: Learning Compact Discriminative Representation for Single-Stage Image Retrieval
paper_authors: Yunquan Zhu, Xinkai Gao, Bo Ke, Ruizhi Qiao, Xing Sun
for: 实现单stage图像检索的高效精度搜索
methods: 提出了一种Coarse-to-Fine框架,学习Compact Discriminative representation(CFCD),只需要图像级别的标签进行训练。具体来说,我们首先设计了一种适应性softmax基于损失函数,在每个mini-batch中动态调整其尺度和边缘,以强化supervision during training和intra-class compactness。其次,我们提出了一种机制,通过硬negative sampling策略选择突出地ocal descriptors,并将其混合到全局表示中,以便在全球范围内优化相互之间的Semantic关系。
results: 经验证明了我们的方法的效果,在Revisited Oxford和Revisited Paris等benchmark上实现了单stage图像检索的state-of-the-art性能。Abstract
Image retrieval targets to find images from a database that are visually similar to the query image. Two-stage methods following retrieve-and-rerank paradigm have achieved excellent performance, but their separate local and global modules are inefficient to real-world applications. To better trade-off retrieval efficiency and accuracy, some approaches fuse global and local feature into a joint representation to perform single-stage image retrieval. However, they are still challenging due to various situations to tackle, $e.g.$, background, occlusion and viewpoint. In this work, we design a Coarse-to-Fine framework to learn Compact Discriminative representation (CFCD) for end-to-end single-stage image retrieval-requiring only image-level labels. Specifically, we first design a novel adaptive softmax-based loss which dynamically tunes its scale and margin within each mini-batch and increases them progressively to strengthen supervision during training and intra-class compactness. Furthermore, we propose a mechanism which attentively selects prominent local descriptors and infuse fine-grained semantic relations into the global representation by a hard negative sampling strategy to optimize inter-class distinctiveness at a global scale. Extensive experimental results have demonstrated the effectiveness of our method, which achieves state-of-the-art single-stage image retrieval performance on benchmarks such as Revisited Oxford and Revisited Paris. Code is available at https://github.com/bassyess/CFCD.
摘要将给定文本翻译成简化中文。图像检索目标是从数据库中检索与查询图像视觉相似的图像。两Stage方法在 retrieve-and-rerank 模式下实现了出色的表现,但它们的分立的本地和全局模块在实际应用中不是非常有效。为了更好地平衡检索效率和准确率,一些方法将全局和本地特征集成为一个共同表示,以实现单stage图像检索。然而,它们仍然面临许多挑战,例如背景、遮挡和视角等。在这项工作中,我们设计了一个粗略到细节的框架,用于学习练习Compact Discriminative representation(CFCD),以实现端到端单stage图像检索,只需要图像级别标签。具体来说,我们首先设计了一种新的适应式软MAX基于损失函数,可以在每个小批中动态调整缩放和边界,以强化supervision during training和内部精度。此外,我们提出了一种机制,可以在硬negative samplingstrategy中选择表现出色的本地特征,并将其注入到全局表示中,以便在全球级别提高对类的分辨率。经验证实结果表明,我们的方法可以实现单stage图像检索的最佳表现,在Revisited Oxford和Revisited Paris等benchmark上达到了状态畅的单stage图像检索性能。代码可以在https://github.com/bassyess/CFCD中找到。
Few-shot medical image classification with simple shape and texture text descriptors using vision-language models
results: 我们的结果表明,使用VLMs和GPT-4生成的描述符进行医学图像二进制少量分类是一种可行的方法。然而,为了准确地分类,需要排除certain descriptor的计算分类分数。此外,我们评估了VLMs对乳腺癌ultrasound图像中形状特征的评价能力。我们进一步调查GPT-4生成的描述符集中的变化程度。我们的工作提供了关于VLMs在医学图像分析中的应用的重要发现。Abstract
In this work, we investigate the usefulness of vision-language models (VLMs) and large language models for binary few-shot classification of medical images. We utilize the GPT-4 model to generate text descriptors that encapsulate the shape and texture characteristics of objects in medical images. Subsequently, these GPT-4 generated descriptors, alongside VLMs pre-trained on natural images, are employed to classify chest X-rays and breast ultrasound images. Our results indicate that few-shot classification of medical images using VLMs and GPT-4 generated descriptors is a viable approach. However, accurate classification requires to exclude certain descriptors from the calculations of the classification scores. Moreover, we assess the ability of VLMs to evaluate shape features in breast mass ultrasound images. We further investigate the degree of variability among the sets of text descriptors produced by GPT-4. Our work provides several important insights about the application of VLMs for medical image analysis.
摘要
在这项研究中,我们调查了视力语言模型(VLM)和大语言模型是否能够实现医学图像二进制几个shot分类。我们使用GPT-4模型生成医学图像中对象的形状和文化特征的文本描述。然后,这些GPT-4生成的描述、 alongside VLMs预训练于自然图像,用于分类胸部X射线和乳腺ultrasound图像。我们的结果表明,使用VLMs和GPT-4生成的描述进行医学图像二进制分类是一种可行的方法。然而,精确地分类需要排除某些描述器从分类得分计算中。此外,我们评估了VLMs对乳腺瘤ultrasound图像中形状特征的评价能力。我们进一步调查GPT-4生成的描述集中的变化程度。我们的研究提供了关于VLMs在医学图像分析方面的重要发现。
Real-time Strawberry Detection Based on Improved YOLOv5s Architecture for Robotic Harvesting in open-field environment
For: The paper proposes a custom object detection model based on YOLOv5 for strawberry detection in open-field environments.* Methods: The proposed model modifies the original YOLOv5 architecture by replacing the C3 module with C2f and combining Spatial Pyramid Pooling Fast with Cross Stage Partial Net. The model is trained on a dataset of RGB images of strawberry canopies with three maturity classes.* Results: The proposed model achieves the highest mean average precision of 80.3% among five compared models, with an inference speed of 18ms per image. The model outperforms the latest YOLOv8s in terms of average precision in the immature and mature classes, while being faster and having fewer parameters.Abstract
This study proposed a YOLOv5-based custom object detection model to detect strawberries in an outdoor environment. The original architecture of the YOLOv5s was modified by replacing the C3 module with the C2f module in the backbone network, which provided a better feature gradient flow. Secondly, the Spatial Pyramid Pooling Fast in the final layer of the backbone network of YOLOv5s was combined with Cross Stage Partial Net to improve the generalization ability over the strawberry dataset in this study. The proposed architecture was named YOLOv5s-Straw. The RGB images dataset of the strawberry canopy with three maturity classes (immature, nearly mature, and mature) was collected in open-field environment and augmented through a series of operations including brightness reduction, brightness increase, and noise adding. To verify the superiority of the proposed method for strawberry detection in open-field environment, four competitive detection models (YOLOv3-tiny, YOLOv5s, YOLOv5s-C2f, and YOLOv8s) were trained, and tested under the same computational environment and compared with YOLOv5s-Straw. The results showed that the highest mean average precision of 80.3% was achieved using the proposed architecture whereas the same was achieved with YOLOv3-tiny, YOLOv5s, YOLOv5s-C2f, and YOLOv8s were 73.4%, 77.8%, 79.8%, 79.3%, respectively. Specifically, the average precision of YOLOv5s-Straw was 82.1% in the immature class, 73.5% in the nearly mature class, and 86.6% in the mature class, which were 2.3% and 3.7%, respectively, higher than that of the latest YOLOv8s. The model included 8.6*10^6 network parameters with an inference speed of 18ms per image while the inference speed of YOLOv8s had a slower inference speed of 21.0ms and heavy parameters of 11.1*10^6, which indicates that the proposed model is fast enough for real time strawberry detection and localization for the robotic picking.
摘要
The dataset used in this study consisted of RGB images of strawberry canopies with three maturity classes (immature, nearly mature, and mature) collected in an open-field environment. The images were augmented using brightness reduction, brightness increase, and noise adding.To evaluate the performance of the proposed model, four competitive detection models (YOLOv3-tiny, YOLOv5s, YOLOv5s-C2f, and YOLOv8s) were trained and tested under the same computational environment. The results showed that the proposed model achieved the highest mean average precision of 80.3%, outperforming the other models by 3.7% to 7.3%. Specifically, the average precision of YOLOv5s-Straw was 82.1% in the immature class, 73.5% in the nearly mature class, and 86.6% in the mature class.The proposed model includes 8.6 million network parameters and has an inference speed of 18ms per image, which is fast enough for real-time strawberry detection and localization for robotic picking. In comparison, YOLOv8s has heavier parameters (11.1 million) and a slower inference speed (21.0ms).Overall, the proposed YOLOv5s-Straw model outperformed other state-of-the-art models for strawberry detection in open-field environments, and is a promising solution for robotic strawberry picking applications.
PARTNER: Level up the Polar Representation for LiDAR 3D Object Detection
results: 与前一代极坐标系方法相比,实现了3.68%和9.15%的显著提高在 Waymo 和 ONCE 验证集上,并在流式探测和不同分辨率下达到了竞争力的 результаados。Abstract
Recently, polar-based representation has shown promising properties in perceptual tasks. In addition to Cartesian-based approaches, which separate point clouds unevenly, representing point clouds as polar grids has been recognized as an alternative due to (1) its advantage in robust performance under different resolutions and (2) its superiority in streaming-based approaches. However, state-of-the-art polar-based detection methods inevitably suffer from the feature distortion problem because of the non-uniform division of polar representation, resulting in a non-negligible performance gap compared to Cartesian-based approaches. To tackle this issue, we present PARTNER, a novel 3D object detector in the polar coordinate. PARTNER alleviates the dilemma of feature distortion with global representation re-alignment and facilitates the regression by introducing instance-level geometric information into the detection head. Extensive experiments show overwhelming advantages in streaming-based detection and different resolutions. Furthermore, our method outperforms the previous polar-based works with remarkable margins of 3.68% and 9.15% on Waymo and ONCE validation set, thus achieving competitive results over the state-of-the-art methods.
摘要
近些时间,基于极坐标的表示方法在认知任务中展现出了有前途的性能。除了使用坐标系分解的方法,即在不同的分辨率下分别处理点云,基于极坐标网格的表示方法被认为是一个有优势的选择,因为它们在不同的分辨率下具有robust性和流式处理的优势。然而,现状的极坐标基的检测方法无法避免特征扭曲问题,这是因为极坐标网格的非均匀分配引起的。为解决这个问题,我们提出了PARTNER,一种新的3D物体检测器。PARTNER通过重新调整全局表示和添加实例级别的几何信息来缓解特征扭曲问题,并且在检测头中进行了改进,以便更好地进行准确性。我们的方法在流式检测和不同的分辨率上具有了极大的优势,并且在 Waymo和ONCE验证集上比前一代极坐标基的方法有3.68%和9.15%的remarkable margins,从而实现了与当前最佳方法的竞争性。
PAIF: Perception-Aware Infrared-Visible Image Fusion for Attack-Tolerant Semantic Segmentation
results: 实验结果表明,我们的方案可以大幅提高分割稳定性,相比高级竞争者,增加了15.3%的mIOU分割精度。Abstract
Infrared and visible image fusion is a powerful technique that combines complementary information from different modalities for downstream semantic perception tasks. Existing learning-based methods show remarkable performance, but are suffering from the inherent vulnerability of adversarial attacks, causing a significant decrease in accuracy. In this work, a perception-aware fusion framework is proposed to promote segmentation robustness in adversarial scenes. We first conduct systematic analyses about the components of image fusion, investigating the correlation with segmentation robustness under adversarial perturbations. Based on these analyses, we propose a harmonized architecture search with a decomposition-based structure to balance standard accuracy and robustness. We also propose an adaptive learning strategy to improve the parameter robustness of image fusion, which can learn effective feature extraction under diverse adversarial perturbations. Thus, the goals of image fusion (\textit{i.e.,} extracting complementary features from source modalities and defending attack) can be realized from the perspectives of architectural and learning strategies. Extensive experimental results demonstrate that our scheme substantially enhances the robustness, with gains of 15.3% mIOU of segmentation in the adversarial scene, compared with advanced competitors. The source codes are available at https://github.com/LiuZhu-CV/PAIF.
摘要
infrared和可见图像融合是一种强大的技术,可以将不同modalities的补充信息结合以提高下游semantic perception任务的性能。现有的学习基于方法显示出惊人的表现,但是它们受到内置的敌意攻击的隐藏危险,导致准确性减少。在这种工作中,我们提出了一种感知 aware的融合框架,以提高 segmentation 的 Robustness 在敌意场景中。我们首先进行了系统的分析,探讨了不同modalities的图像融合组件与 segmentation 的相关性。基于这些分析,我们提出了一种协调结构,以平衡标准准确性和 Robustness。我们还提出了一种适应学习策略,以提高图像融合的参数Robustness,使其在多种敌意攻击下学习有效的特征提取。因此,我们的方案可以从architecture和学习策略的角度实现图像融合的两个目标:提取源modalities中的补充特征,并防止攻击。我们的实验结果表明,我们的方案可以大幅提高Robustness,与先进竞争对手相比,增加了15.3%的mIOU segmentation准确率。源代码可以在https://github.com/LiuZhu-CV/PAIF上获取。
PUG: Photorealistic and Semantically Controllable Synthetic Data for Representation Learning
results: 论文通过对多种视觉模型的评估,显示了 PUG 环境和数据集的可用性和有效性,并为研究人员提供了一种更加准确和可靠的评估方式。Abstract
Synthetic image datasets offer unmatched advantages for designing and evaluating deep neural networks: they make it possible to (i) render as many data samples as needed, (ii) precisely control each scene and yield granular ground truth labels (and captions), (iii) precisely control distribution shifts between training and testing to isolate variables of interest for sound experimentation. Despite such promise, the use of synthetic image data is still limited -- and often played down -- mainly due to their lack of realism. Most works therefore rely on datasets of real images, which have often been scraped from public images on the internet, and may have issues with regards to privacy, bias, and copyright, while offering little control over how objects precisely appear. In this work, we present a path to democratize the use of photorealistic synthetic data: we develop a new generation of interactive environments for representation learning research, that offer both controllability and realism. We use the Unreal Engine, a powerful game engine well known in the entertainment industry, to produce PUG (Photorealistic Unreal Graphics) environments and datasets for representation learning. In this paper, we demonstrate the potential of PUG to enable more rigorous evaluations of vision models.
摘要
<> translate the following text into Simplified Chinese<>人工图像数据集具有无可比的优势,可以为深度神经网络的设计和评估带来很多便利:可以(i)生成无数量的数据样本,(ii)精确控制每个场景和获得细腻的标签和描述,(iii)在训练和测试中控制分布变化,以孤立变量。Despite such promise, the use of synthetic image data is still limited -- and often played down -- mainly due to their lack of realism. Most works therefore rely on datasets of real images, which have often been scraped from public images on the internet, and may have issues with regards to privacy, bias, and copyright, while offering little control over how objects precisely appear. In this work, we present a path to democratize the use of photorealistic synthetic data: we develop a new generation of interactive environments for representation learning research, that offer both controllability and realism. We use the Unreal Engine, a powerful game engine well known in the entertainment industry, to produce PUG (Photorealistic Unreal Graphics) environments and datasets for representation learning. In this paper, we demonstrate the potential of PUG to enable more rigorous evaluations of vision models.Translation:人工图像数据集具有无可比的优势,可以为深度神经网络的设计和评估带来很多便利。可以(i)生成无数量的数据样本,(ii)精确控制每个场景和获得细腻的标签和描述,(iii)在训练和测试中控制分布变化,以孤立变量。尽管如此,使用人工图像数据的使用仍然受到限制——主要是因为它们缺乏实际性。大多数工作因此选择使用实际图像数据,这些数据经常从互联网上抓取,可能存在隐私、偏见和版权问题,而且无法控制对象的具体外观。在这个工作中,我们提出了一种路径,以使用PUG(真实无极图形)环境和数据集来进行表示学习研究。我们使用Unreal Engine游戏引擎,这是娱乐业界非常知名的游戏引擎,生成PUG环境和数据集。在这篇论文中,我们示出了PUG的潜在力量,以允许更加严格的评估视觉模型。
Prompted Contrast with Masked Motion Modeling: Towards Versatile 3D Action Representation Learning
methods: 本研究使用了Prompted Contrast with Masked Motion Modeling(PCM$^{\rm 3}$)方法,它将对比学习和做牌预测任务相互补充,从而提高了对多个下游任务的泛化能力。
results: 实验结果表明,PCM$^{\rm 3}$ 方法在五个下游任务中的表现都superior于现有的状态之工作,特别是在三个大规模的数据集上。codesAbstract
Self-supervised learning has proved effective for skeleton-based human action understanding, which is an important yet challenging topic. Previous works mainly rely on contrastive learning or masked motion modeling paradigm to model the skeleton relations. However, the sequence-level and joint-level representation learning cannot be effectively and simultaneously handled by these methods. As a result, the learned representations fail to generalize to different downstream tasks. Moreover, combining these two paradigms in a naive manner leaves the synergy between them untapped and can lead to interference in training. To address these problems, we propose Prompted Contrast with Masked Motion Modeling, PCM$^{\rm 3}$, for versatile 3D action representation learning. Our method integrates the contrastive learning and masked prediction tasks in a mutually beneficial manner, which substantially boosts the generalization capacity for various downstream tasks. Specifically, masked prediction provides novel training views for contrastive learning, which in turn guides the masked prediction training with high-level semantic information. Moreover, we propose a dual-prompted multi-task pretraining strategy, which further improves model representations by reducing the interference caused by learning the two different pretext tasks. Extensive experiments on five downstream tasks under three large-scale datasets are conducted, demonstrating the superior generalization capacity of PCM$^{\rm 3}$ compared to the state-of-the-art works. Our project is publicly available at: https://jhang2020.github.io/Projects/PCM3/PCM3.html .
摘要
自我指导学习已经证明对人体动作理解是有效的,这是一个重要但也是具有挑战性的领域。先前的工作主要采用了对比学习或遮盖动作模型的概念学习方法来模型人体关系。然而,序列水平和联合水平的表示学习无法同时得到有效的处理。这导致学习表示失去泛化到不同的下游任务中。此外,将这两种方法在一种简单的方式结合可能会导致在训练中的干扰。为解决这些问题,我们提出了受提醒的对比学习与遮盖动作模型(PCM$^{\rm 3}$),用于多样化的3D动作表示学习。我们的方法将对比学习和遮盖预测任务融合在一起,从而增强模型的泛化能力 для多种下游任务。具体来说,遮盖预测提供了对比学习训练中的新的训练视图,而对比学习则帮助遮盖预测训练得到高级别semantic信息。此外,我们还提出了双重受提醒多任务预训练策略,可以降低学习两个不同预tex任务时的干扰。我们在三个大规模数据集上进行了五个下游任务的广泛实验,证明PCM$^{\rm 3}$的泛化能力较为先前的工作更高。我们的项目在https://jhang2020.github.io/Projects/PCM3/PCM3.html上公开可用。
Zero-shot Skeleton-based Action Recognition via Mutual Information Estimation and Maximization
results: 我们在三个大规模的skeleton action dataset上进行了广泛的实验,结果显示了我们的方法的有效性。Abstract
Zero-shot skeleton-based action recognition aims to recognize actions of unseen categories after training on data of seen categories. The key is to build the connection between visual and semantic space from seen to unseen classes. Previous studies have primarily focused on encoding sequences into a singular feature vector, with subsequent mapping the features to an identical anchor point within the embedded space. Their performance is hindered by 1) the ignorance of the global visual/semantic distribution alignment, which results in a limitation to capture the true interdependence between the two spaces. 2) the negligence of temporal information since the frame-wise features with rich action clues are directly pooled into a single feature vector. We propose a new zero-shot skeleton-based action recognition method via mutual information (MI) estimation and maximization. Specifically, 1) we maximize the MI between visual and semantic space for distribution alignment; 2) we leverage the temporal information for estimating the MI by encouraging MI to increase as more frames are observed. Extensive experiments on three large-scale skeleton action datasets confirm the effectiveness of our method. Code: https://github.com/YujieOuO/SMIE.
摘要
zero-shot骨干基于动作识别targets recognizing unseen categories after training on seen categories. The key is to build a connection between visual and semantic space from seen to unseen classes. Previous studies have primarily focused on encoding sequences into a singular feature vector, with subsequent mapping the features to an identical anchor point within the embedded space. Their performance is hindered by 1) ignorance of the global visual/semantic distribution alignment, which results in a limitation to capture the true interdependence between the two spaces. 2) negligence of temporal information since the frame-wise features with rich action clues are directly pooled into a single feature vector. We propose a new zero-shot skeleton-based action recognition method via mutual information (MI) estimation and maximization. Specifically, 1) we maximize the MI between visual and semantic space for distribution alignment; 2) we leverage the temporal information for estimating the MI by encouraging MI to increase as more frames are observed. Extensive experiments on three large-scale skeleton action datasets confirm the effectiveness of our method. Code: .Here's the word-for-word translation of the text into Simplified Chinese: zero-shot骨干基于动作识别targetsRecognize unseen categories after training on seen categories. The key is to build a connection between visual and semantic space from seen to unseen classes. Previous studies have primarily focused on encoding sequences into a singular feature vector, with subsequent mapping the features to an identical anchor point within the embedded space. Their performance is hindered by 1) ignorance of the global visual/semantic distribution alignment, which results in a limitation to capture the true interdependence between the two spaces. 2) negligence of temporal information since the frame-wise features with rich action clues are directly pooled into a single feature vector. We propose a new zero-shot skeleton-based action recognition method via mutual information (MI) estimation and maximization. Specifically, 1) we maximize the MI between visual and semantic space for distribution alignment; 2) we leverage the temporal information for estimating the MI by encouraging MI to increase as more frames are observed. Extensive experiments on three large-scale skeleton action datasets confirm the effectiveness of our method. Code: .
Deterministic Neural Illumination Mapping for Efficient Auto-White Balance Correction
paper_authors: Furkan Kınlı, Doğa Yılmaz, Barış Özcan, Furkan Kıraç
for: 提供高速、高质量图像彩度 correction 解决方案
methods: 基于 deterministic color style transfer 的权重映射策略,具有 resolution-agnostic 特点,可整合任何预训练 AWB 网络
results: 实验结果表明,该方法可以实现至少 35 倍快的处理速度,并且与现有方法相当或更高的性能,在高分辨率图像上Here’s the breakdown of each point:1. 为什么:提供高速、高质量图像彩度 correction 解决方案2. 如何:基于 deterministic color style transfer 的权重映射策略,具有 resolution-agnostic 特点,可整合任何预训练 AWB 网络3. 结果:实验结果表明,该方法可以实现至少 35 倍快的处理速度,并且与现有方法相当或更高的性能,在高分辨率图像上Abstract
Auto-white balance (AWB) correction is a critical operation in image signal processors for accurate and consistent color correction across various illumination scenarios. This paper presents a novel and efficient AWB correction method that achieves at least 35 times faster processing with equivalent or superior performance on high-resolution images for the current state-of-the-art methods. Inspired by deterministic color style transfer, our approach introduces deterministic illumination color mapping, leveraging learnable projection matrices for both canonical illumination form and AWB-corrected output. It involves feeding high-resolution images and corresponding latent representations into a mapping module to derive a canonical form, followed by another mapping module that maps the pixel values to those for the corrected version. This strategy is designed as resolution-agnostic and also enables seamless integration of any pre-trained AWB network as the backbone. Experimental results confirm the effectiveness of our approach, revealing significant performance improvements and reduced time complexity compared to state-of-the-art methods. Our method provides an efficient deep learning-based AWB correction solution, promising real-time, high-quality color correction for digital imaging applications. Source code is available at https://github.com/birdortyedi/DeNIM/
摘要
自动白平衡(AWB)修正是图像信号处理中的关键操作,以确保图像彩色 corrections 在不同照明场景下具有准确性和一致性。本文描述了一种新的和高效的 AWB 修正方法,可以在高分辨率图像上实现至少35倍的处理速度,与现有方法相当或更好的性能。我们的方法基于权值映射矩阵,通过学习映射矩阵来实现权值映射,并将其应用于AWB修正输出。我们的方法包括将高分辨率图像和相应的秘密表示 feed 到映射模块,以 derivation 一个征准形式,然后另一个映射模块将像素值映射到AWB修正后的像素值。这种策略是解决分辨率不依赖的,同时也可以轻松地将任何预训练的AWB网络作为后ION。实验结果表明我们的方法的有效性,表明与现有方法相比,具有显著的性能提升和处理时间减少。我们的方法提供了一种高效的深度学习基于AWB修正解决方案,承诺实时、高质量彩色修正 для数字摄影应用。代码可以在 上获取。
TIJO: Trigger Inversion with Joint Optimization for Defending Multimodal Backdoored Models
results: 在 TrojVQA 测试集上,TIJO 方法可以减少 dual-key 后门攻击的攻击效果,并且在单模态后门攻击中也表现出良好的效果Abstract
We present a Multimodal Backdoor Defense technique TIJO (Trigger Inversion using Joint Optimization). Recent work arXiv:2112.07668 has demonstrated successful backdoor attacks on multimodal models for the Visual Question Answering task. Their dual-key backdoor trigger is split across two modalities (image and text), such that the backdoor is activated if and only if the trigger is present in both modalities. We propose TIJO that defends against dual-key attacks through a joint optimization that reverse-engineers the trigger in both the image and text modalities. This joint optimization is challenging in multimodal models due to the disconnected nature of the visual pipeline which consists of an offline feature extractor, whose output is then fused with the text using a fusion module. The key insight enabling the joint optimization in TIJO is that the trigger inversion needs to be carried out in the object detection box feature space as opposed to the pixel space. We demonstrate the effectiveness of our method on the TrojVQA benchmark, where TIJO improves upon the state-of-the-art unimodal methods from an AUC of 0.6 to 0.92 on multimodal dual-key backdoors. Furthermore, our method also improves upon the unimodal baselines on unimodal backdoors. We present ablation studies and qualitative results to provide insights into our algorithm such as the critical importance of overlaying the inverted feature triggers on all visual features during trigger inversion. The prototype implementation of TIJO is available at https://github.com/SRI-CSL/TIJO.
摘要
我们提出了一种多模态后门防御技术TIJO(Trigger Inversion using Joint Optimization)。在最近的arXiv:2112.07668中,我们已经成功地实现了对多模态模型的后门攻击。这个后门触发器被分解成两个模式(图像和文本),只有在两个模式中都存在触发器时才会启动后门。我们的TIJO技术利用联合优化来防御双钥匙攻击,通过在图像和文本模式中对触发器进行反向工程。这个联合优化在多模态模型中是具有挑战性的,因为视觉管道中的数据都是独立的,包括一个离线特征提取器,其输出然后与文本模式进行融合。我们的关键发现是,在对触发器进行反向工程时,应该在图像特征空间进行,而不是像素空间。我们在TrojVQA benchmark上证明了TIJO的有效性,其在多模态双钥匙后门上从AUC 0.6提高到0.92,并且在单模态后门上也超过了单模态基线。我们还提供了简要的ablation study和Qualitative results,以便更好地理解我们的算法,如果在触发器反向工程中 overlaying 翻译的特征Trigger。TIJO的原型实现可以在https://github.com/SRI-CSL/TIJO中找到。
Developability Approximation for Neural Implicits through Rank Minimization
results: 实验结果表明,该方法可以准确地重建开发可能的表面,并且可以在受到噪声影响的情况下保持一定的精度。Abstract
Developability refers to the process of creating a surface without any tearing or shearing from a two-dimensional plane. It finds practical applications in the fabrication industry. An essential characteristic of a developable 3D surface is its zero Gaussian curvature, which means that either one or both of the principal curvatures are zero. This paper introduces a method for reconstructing an approximate developable surface from a neural implicit surface. The central idea of our method involves incorporating a regularization term that operates on the second-order derivatives of the neural implicits, effectively promoting zero Gaussian curvature. Implicit surfaces offer the advantage of smoother deformation with infinite resolution, overcoming the high polygonal constraints of state-of-the-art methods using discrete representations. We draw inspiration from the properties of surface curvature and employ rank minimization techniques derived from compressed sensing. Experimental results on both developable and non-developable surfaces, including those affected by noise, validate the generalizability of our method.
摘要
<>将文本翻译为简化字符的中文。<>发展可能性指的是将二维面变换为无撕裂、无剪裂的三维表面的过程。它在制造业中有实际应用。一个必要的特征是发展可能性表面的零 Gaussian 几何,这意味着一或两个主要几何都是零。这篇论文介绍了一种使用神经隐式函数来重建精确的发展可能性表面的方法。我们的中心思想是在神经隐式函数的第二阶导数上添加一个正则化项,以实现零 Gaussian 几何。隐式表面具有较平滑的变形和无限分辨率的优势,超越了现有方法使用分割表示的高 polygon 约束。我们启发自表面几何的属性,并使用压缩感知技术来解决矩阵问题。实验结果表明,我们的方法在发展可能性表面和非发展可能性表面,包括受噪声影响的情况下,具有普适性。
From Sky to the Ground: A Large-scale Benchmark and Simple Baseline Towards Real Rain Removal
paper_authors: Yun Guo, Xueyao Xiao, Yi Chang, Shumin Deng, Luxin Yan
For: 提高实际雨天图像涂抹(RID)的进步,增加大规模高质量配对训练样本。* Methods: 构建了一个大规模高质量配对雨天图像数据集(LHP-Rain),包括3000个视频序列,100万高分辨率(19201080)帧对。提出了一种新的稳定低级 tensor 恢复模型,生成更好地分离静背景和动雨。设计了一种简单的 transformer 基于单图雨涂抹基线,同时利用自身关注和跨层关注,具有捕捉特征表示。 Results: 对比 existing 方法,提出的 dataset 和 deraining 方法具有显著的优势,在雨天图像涂抹任务中具有更高的性能。Abstract
Learning-based image deraining methods have made great progress. However, the lack of large-scale high-quality paired training samples is the main bottleneck to hamper the real image deraining (RID). To address this dilemma and advance RID, we construct a Large-scale High-quality Paired real rain benchmark (LHP-Rain), including 3000 video sequences with 1 million high-resolution (1920*1080) frame pairs. The advantages of the proposed dataset over the existing ones are three-fold: rain with higher-diversity and larger-scale, image with higher-resolution and higher-quality ground-truth. Specifically, the real rains in LHP-Rain not only contain the classical rain streak/veiling/occlusion in the sky, but also the \textbf{splashing on the ground} overlooked by deraining community. Moreover, we propose a novel robust low-rank tensor recovery model to generate the GT with better separating the static background from the dynamic rain. In addition, we design a simple transformer-based single image deraining baseline, which simultaneously utilize the self-attention and cross-layer attention within the image and rain layer with discriminative feature representation. Extensive experiments verify the superiority of the proposed dataset and deraining method over state-of-the-art.
摘要
学习基于的图像雨排除方法已经做出了大量的进步。然而,缺乏大规模高质量对应训练样本是阻碍真实图像雨排除(RID)的主要瓶颈。为解决这个困难和提高RID,我们构建了大规模高质量对应雨天 benchmark(LHP-Rain),包括3000个视频序列和100万高分辨率(1920*1080)帧对。LHP-Rain中的雨水比现有的 dataset 更多样化和大规模,图像质量更高,附加的雨水ground truth 更加准确。具体来说,LHP-Rain 中的雨水不仅包括天空中的класси型雨条/遮盲/占据,还包括在地面上的溅射,这一点在雨排除社区中很少被考虑。此外,我们提出了一种新的robust低级张量回归模型,用于生成更加分离静态背景和动态雨水的GT。此外,我们设计了一种简单的 transformer 基于的单图像雨排除基线,同时利用自身关注和跨层关注,在图像和雨层中同时使用特征表示。广泛的实验证明了我们提出的数据集和雨排除方法的优越性。
results: 根据实验结果显示,DefCor-Net可以对于US图像进行高精度的形状修正,从而回复原始的几何结构(Dice Coefficient:从 $14.3\pm20.9$ 提高至 $82.6\pm12.1$,当力量为 $6N$)。Abstract
The recovery of morphologically accurate anatomical images from deformed ones is challenging in ultrasound (US) image acquisition, but crucial to accurate and consistent diagnosis, particularly in the emerging field of computer-assisted diagnosis. This article presents a novel anatomy-aware deformation correction approach based on a coarse-to-fine, multi-scale deep neural network (DefCor-Net). To achieve pixel-wise performance, DefCor-Net incorporates biomedical knowledge by estimating pixel-wise stiffness online using a U-shaped feature extractor. The deformation field is then computed using polynomial regression by integrating the measured force applied by the US probe. Based on real-time estimation of pixel-by-pixel tissue properties, the learning-based approach enables the potential for anatomy-aware deformation correction. To demonstrate the effectiveness of the proposed DefCor-Net, images recorded at multiple locations on forearms and upper arms of six volunteers are used to train and validate DefCor-Net. The results demonstrate that DefCor-Net can significantly improve the accuracy of deformation correction to recover the original geometry (Dice Coefficient: from $14.3\pm20.9$ to $82.6\pm12.1$ when the force is $6N$).
摘要
“ ultrasound(US)图像获取中,形态准确性的图像恢复是一项挑战,但是对医学诊断的准确性和一致性具有极高的重要性,特别是在计算机助动诊断领域。本文提出了一种基于多尺度深度神经网络(DefCor-Net)的新型形态意识恢复方法。为了实现像素级的表现,DefCor-Net在核心网络中包含生物医学知识,并且在线计算每个像素的刚性。通过把测量US探针所应用的力场 интеグрирова到多元函数回归,DefCor-Net计算出了形态场。基于实时测量每个像素的组织特性,这种学习基于的方法具有潜在的形态意识恢复能力。为证明DefCor-Net的有效性,使用了多个臂和肘的六名志愿者所记录的图像进行训练和验证。结果显示,DefCor-Net可以显著改善对形态恢复的准确性(Dice Coefficient:从14.3±20.9到82.6±12.1,当力场为6N)。”Note: The translation is in Simplified Chinese, which is the standardized form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.
High-Throughput and Accurate 3D Scanning of Cattle Using Time-of-Flight Sensors and Deep Learning
results: 根据实验结果,提出的系统能够生成高质量的牛形态模型,并且可以准确测量牛的体积和表面积。Abstract
We introduce a high throughput 3D scanning solution specifically designed to precisely measure cattle phenotypes. This scanner leverages an array of depth sensors, i.e. time-of-flight (Tof) sensors, each governed by dedicated embedded devices. The system excels at generating high-fidelity 3D point clouds, thus facilitating an accurate mesh that faithfully reconstructs the cattle geometry on the fly. In order to evaluate the performance of our system, we have implemented a two-fold validation process. Initially, we test the scanner's competency in determining volume and surface area measurements within a controlled environment featuring known objects. Secondly, we explore the impact and necessity of multi-device synchronization when operating a series of time-of-flight sensors. Based on the experimental results, the proposed system is capable of producing high-quality meshes of untamed cattle for livestock studies.
摘要
我们介绍了一种高通量3D扫描解决方案,专门为精确测量牛phenotype提供。这个扫描仪使用了一组深度感知器,即时光探测(ToF)感知器,每个感知器由专门的嵌入式设备控制。系统能够生成高品质3D点云,从而实现精确重建牛体均匀的三维模型。为评估我们的系统性能,我们实施了两重验证过程。首先,我们测试了扫描仪在控制台上测量物体体积和表面积的能力。其次,我们探索了在多个时光探测感知器同时运行时的多设备同步的影响和必要性。根据实验结果,我们的系统能够生成高质量牛体三维模型,为畜牧学研究提供有价值的数据。
3D Motion Magnification: Visualizing Subtle Motions with Time Varying Radiance Fields
results: 该论文通过使用不同的场景和摄像头设置进行了研究和验证,并证明了其效果。Abstract
Motion magnification helps us visualize subtle, imperceptible motion. However, prior methods only work for 2D videos captured with a fixed camera. We present a 3D motion magnification method that can magnify subtle motions from scenes captured by a moving camera, while supporting novel view rendering. We represent the scene with time-varying radiance fields and leverage the Eulerian principle for motion magnification to extract and amplify the variation of the embedding of a fixed point over time. We study and validate our proposed principle for 3D motion magnification using both implicit and tri-plane-based radiance fields as our underlying 3D scene representation. We evaluate the effectiveness of our method on both synthetic and real-world scenes captured under various camera setups.
摘要
运动增大帮助我们可见到微不足的运动。然而,先前的方法只适用于 fix 摄像机拍摄的 2D 视频。我们提出了一种支持新视图渲染的3D 运动增大方法,可以增大 captured by a moving camera 中的微不足运动。我们使用时间变化的辐射场来表示场景,并利用儒利安理则来提取和增强时间上点的变化。我们对使用 implicit 和 tri-plane-based 辐射场作为场景表示方法进行了研究和验证。我们对具有不同摄像机设置的 both synthetic 和实际场景进行了评估。
FSD V2: Improving Fully Sparse 3D Object Detection with Virtual Voxels
results: 这篇论文在三个大规模数据集上进行了实验,包括Waymo开放数据集、Argoverse 2数据集和nuScenes数据集。结果显示FSDv2在长距离场景中表现出色,并在多种场景中具有竞争性的性能。此外,论文还提供了详细的实验分析,以便促进可重复性和进一步研究。Abstract
LiDAR-based fully sparse architecture has garnered increasing attention. FSDv1 stands out as a representative work, achieving impressive efficacy and efficiency, albeit with intricate structures and handcrafted designs. In this paper, we present FSDv2, an evolution that aims to simplify the previous FSDv1 while eliminating the inductive bias introduced by its handcrafted instance-level representation, thus promoting better general applicability. To this end, we introduce the concept of \textbf{virtual voxels}, which takes over the clustering-based instance segmentation in FSDv1. Virtual voxels not only address the notorious issue of the Center Feature Missing problem in fully sparse detectors but also endow the framework with a more elegant and streamlined approach. Consequently, we develop a suite of components to complement the virtual voxel concept, including a virtual voxel encoder, a virtual voxel mixer, and a virtual voxel assignment strategy. Through empirical validation, we demonstrate that the virtual voxel mechanism is functionally similar to the handcrafted clustering in FSDv1 while being more general. We conduct experiments on three large-scale datasets: Waymo Open Dataset, Argoverse 2 dataset, and nuScenes dataset. Our results showcase state-of-the-art performance on all three datasets, highlighting the superiority of FSDv2 in long-range scenarios and its general applicability to achieve competitive performance across diverse scenarios. Moreover, we provide comprehensive experimental analysis to elucidate the workings of FSDv2. To foster reproducibility and further research, we have open-sourced FSDv2 at https://github.com/tusen-ai/SST.
摘要
“LiDAR-based弹性探测 Architecture 在最近得到了增加的注意。FSDv1 作为代表性的工作,成功地实现了出色的效率和可靠性,但具有复杂的结构和手工设计。在这篇论文中,我们提出 FSDv2,它是 FSDv1 的进化,旨在简化前一代的结构,消除实例级别表示所引入的预设偏见,以提高更好的通用性。为此,我们引入了“虚拟小体”概念,取代 FSDv1 中的弹性分割。虚拟小体不仅解决了完全缺失中心特征问题,还赋予框架更加简洁和流畅的方式。为此,我们开发了一套辅助虚拟小体的组件,包括虚拟小体编码器、虚拟小体混合器和虚拟小体分配策略。通过实验验证,我们证明虚拟小体机制与 FSDv1 中手工 clustering 功能相似,但更加通用。我们在 Waymo Open Dataset、Argoverse 2 dataset 和 nuScenes dataset 上进行了实验,我们的结果显示 FSDv2 在长距离场景中具有状态机器人的性能,并在多种场景中实现了竞争性的表现。此外,我们进行了全面的实验分析,以便更好地解释 FSDv2 的工作原理。为了促进可重复性和进一步研究,我们将 FSDv2 开源在 GitHub 上,请参考 。”
Mask Frozen-DETR: High Quality Instance Segmentation with One GPU
results: 我们的方法在COCO测试数据集上的测试预测中,与状态当前的实例分割方法Mask DINO相比,提高了性能(55.3% vs. 54.7%),并且在训练时间和GPU资源上减少了训练时间的多少(10X)。此外,我们的所有实验都可以使用一个Tesla V100 GPU With 16 GB的内存进行训练,表明了我们提出的框架的显著高效性。Abstract
In this paper, we aim to study how to build a strong instance segmenter with minimal training time and GPUs, as opposed to the majority of current approaches that pursue more accurate instance segmenter by building more advanced frameworks at the cost of longer training time and higher GPU requirements. To achieve this, we introduce a simple and general framework, termed Mask Frozen-DETR, which can convert any existing DETR-based object detection model into a powerful instance segmentation model. Our method only requires training an additional lightweight mask network that predicts instance masks within the bounding boxes given by a frozen DETR-based object detector. Remarkably, our method outperforms the state-of-the-art instance segmentation method Mask DINO in terms of performance on the COCO test-dev split (55.3% vs. 54.7%) while being over 10X times faster to train. Furthermore, all of our experiments can be trained using only one Tesla V100 GPU with 16 GB of memory, demonstrating the significant efficiency of our proposed framework.
摘要
在这篇论文中,我们目的是研究如何使用最少的训练时间和GPU来构建一个强大的实例分割器,而不是现有的大多数方法,它们通过建立更高级的框架来提高实例分割器的准确率,但是这会导致训练时间更长和GPU需求更高。为此,我们提出了一个简单和通用的框架,称为Mask Frozen-DETR,它可以将任何现有的DETR基于对象检测模型转化成一个强大的实例分割模型。我们的方法只需训练一个轻量级的面网络,该网络可以在冻结的DETR基于对象检测模型提供的 bounding box 内预测实例面。值得注意的是,我们的方法在 COCO 测试发展集上比 state-of-the-art 实例分割方法 Mask DINO 高出0.6%的性能(55.3% vs. 54.7%),而且训练时间比 Mask DINO 快上10倍。此外,我们所有的实验都可以使用单个 Tesla V100 GPU WITH 16 GB 内存进行训练,这表明我们提出的方法具有显著的效率。
AdaptiveSAM: Towards Efficient Tuning of SAM for Surgical Scene Segmentation
paper_authors: Jay N. Paranjape, Nithin Gopalakrishnan Nair, Shameema Sikder, S. Swaroop Vedula, Vishal M. Patel for:这篇论文是为了解决人工智能在外科Scene分析中的基本问题,即数据稀缺性问题。methods:这篇论文提出了一种基于Segment-Anything(SAM)模型的适应方法,即AdaptiveSAM,可以快速地适应新的数据集,同时允许文本提示分割。results:实验表明,AdaptiveSAM可以在各种医学影像数据集上出perform better than当前状态的方法,包括手术、超声和X射线等。Abstract
Segmentation is a fundamental problem in surgical scene analysis using artificial intelligence. However, the inherent data scarcity in this domain makes it challenging to adapt traditional segmentation techniques for this task. To tackle this issue, current research employs pretrained models and finetunes them on the given data. Even so, these require training deep networks with millions of parameters every time new data becomes available. A recently published foundation model, Segment-Anything (SAM), generalizes well to a large variety of natural images, hence tackling this challenge to a reasonable extent. However, SAM does not generalize well to the medical domain as is without utilizing a large amount of compute resources for fine-tuning and using task-specific prompts. Moreover, these prompts are in the form of bounding-boxes or foreground/background points that need to be annotated explicitly for every image, making this solution increasingly tedious with higher data size. In this work, we propose AdaptiveSAM - an adaptive modification of SAM that can adjust to new datasets quickly and efficiently, while enabling text-prompted segmentation. For finetuning AdaptiveSAM, we propose an approach called bias-tuning that requires a significantly smaller number of trainable parameters than SAM (less than 2\%). At the same time, AdaptiveSAM requires negligible expert intervention since it uses free-form text as prompt and can segment the object of interest with just the label name as prompt. Our experiments show that AdaptiveSAM outperforms current state-of-the-art methods on various medical imaging datasets including surgery, ultrasound and X-ray. Code is available at https://github.com/JayParanjape/biastuning
摘要
划分是跨域诊断中的基本问题,但由于医学领域数据的稀缺性,使得传统划分技术难以适应这个任务。为解决这个问题,当前的研究通常使用预训练模型,并对其进行微调。然而,这需要训练深度网络数百万个参数,每次新数据available时需要重新训练。一个最近发表的基础模型Segment-Anything(SAM)能够通用于各种自然图像,因此有所减轻这个问题。然而,SAM在医学领域中不具备泛化能力,需要大量计算资源进行微调,并使用任务特有的提示。这些提示通常是 bounding-boxes 或 foreground/background 点,需要明确标注每个图像,这使得该解决方案难以扩展。在这项工作中,我们提出了 AdaptiveSAM,一种适应型的 SAM 修改。AdaptiveSAM 可以快速地适应新的数据集,而且可以通过自由文本提示进行文本识别。我们还提出了一种偏好调整方法,可以在微调 AdaptiveSAM 时减少参数的数量,至少比 SAM 少于 2%。同时,AdaptiveSAM 需要非常少的专家干预,因为它使用自由文本提示,并且可以通过对象关键词来 segment 目标对象。我们的实验表明,AdaptiveSAM 在各种医学成像数据集上表现出色,包括手术、ultrasound 和 X-ray。代码可以在 https://github.com/JayParanjape/biastuning 上获取。
Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation
results: 实验结果表明,该方法可以在三个常用的TSGV测试集上达到高效性和精度的平衡,而无需使用复杂的架构和损失函数。Abstract
Temporal Sentence Grounding in Videos (TSGV) aims to detect the event timestamps described by the natural language query from untrimmed videos. This paper discusses the challenge of achieving efficient computation in TSGV models while maintaining high performance. Most existing approaches exquisitely design complex architectures to improve accuracy with extra layers and loss, suffering from inefficiency and heaviness. Although some works have noticed that, they only make an issue of feature fusion layers, which can hardly enjoy the highspeed merit in the whole clunky network. To tackle this problem, we propose a novel efficient multi-teacher model (EMTM) based on knowledge distillation to transfer diverse knowledge from both heterogeneous and isomorphic networks. Specifically, We first unify different outputs of the heterogeneous models into one single form. Next, a Knowledge Aggregation Unit (KAU) is built to acquire high-quality integrated soft labels from multiple teachers. After that, the KAU module leverages the multi-scale video and global query information to adaptively determine the weights of different teachers. A Shared Encoder strategy is then proposed to solve the problem that the student shallow layers hardly benefit from teachers, in which an isomorphic teacher is collaboratively trained with the student to align their hidden states. Extensive experimental results on three popular TSGV benchmarks demonstrate that our method is both effective and efficient without bells and whistles.
摘要
Automated Real Time Delineation of Supraclavicular Brachial Plexus in Neck Ultrasonography Videos: A Deep Learning Approach
results: 研究结果显示,使用深度学习模型可以实现高准确性和可靠性的现场分类,并且可以区别supraclavicular和邻近的interscalene brachial plexus。此外,研究也显示了不同的 ultrasound 机器的影像数据集可以通过精致化和无需精致化的方法来进行数据集的整合和标注。Abstract
Peripheral nerve blocks are crucial to treatment of post-surgical pain and are associated with reduction in perioperative opioid use and hospital stay. Accurate interpretation of sono-anatomy is critical for the success of ultrasound (US) guided peripheral nerve blocks and can be challenging to the new operators. This prospective study enrolled 227 subjects who were systematically scanned for supraclavicular and interscalene brachial plexus in various settings using three different US machines to create a dataset of 227 unique videos. In total, 41,000 video frames were annotated by experienced anaesthesiologists using partial automation with object tracking and active contour algorithms. Four baseline neural network models were trained on the dataset and their performance was evaluated for object detection and segmentation tasks. Generalizability of the best suited model was then tested on the datasets constructed from separate US scanners with and without fine-tuning. The results demonstrate that deep learning models can be leveraged for real time segmentation of supraclavicular brachial plexus in neck ultrasonography videos with high accuracy and reliability. Model was also tested for its ability to differentiate between supraclavicular and adjoining interscalene brachial plexus. The entire dataset has been released publicly for further study by the research community.
摘要
périphériques nerve blocks sont essentielles pour le traitement de la douleur postopératoire et sont associées à une réduction de l'utilisation de morphiniques periopératoires et de la durée de hospitalisation. L'interprétation accurate de la sono-anatomie est critique pour le succès des blocks nerveuses guidées par ultrason (US) et peut être challengeante pour les nouveaux opérateurs. Cette étude prospective a enrôlé 227 sujets qui ont été systématiquement scannés pour le plexus brachial supraclaviculaire et interscapulin au moyen de trois machines US différentes pour créer un ensemble de 227 vidéos uniques. Au total, 41 000 cadres de vidéo ont été annotés par des anesthésiologistes expérimentés utilisant une partial automation avec des algorithmes de suivi d'objets et de contours actifs. Quatre modèles de réseaux de neurones basiques ont été entraînés sur le dataset et leur performance a été évaluée pour les tâches de détection et de segmentation d'objets. La généralisation du modèle le plus adapté a été testée sur les données constructives de scanners US différents, avec et sans fine-tuning. Les résultats montrent que les modèles d'apprentissage profond peuvent être utilisés pour la segmentation en temps réel du plexus brachial supraclaviculaire dans les vidéos d'ultrasonographie du cou avec une précision et une fiabilité élevées. Le modèle a également été testé pour sa capacité à distinguer entre le plexus brachial supraclaviculaire et l'adjoignant plexus interscapulin. Le tout dataset a été libéré au public pour une étude supplémentaire par la communauté de la recherche.
Scaling may be all you need for achieving human-level object recognition capacity with human-like visual experience
results: 研究发现,通过同时增加数据量、模型大小和图像分辨率,可以达到人类级视觉对象识别能力,但需要在模型大小、数据量和图像分辨率的增加中同步进行调整。例如,一个2.5B参数的ViT模型,通过20K小时(2.3年)的人类类视频数据和952x952像素的空间分辨率进行训练,应该可以达到人类级准确率在ImageNet。Abstract
This paper asks whether current self-supervised learning methods, if sufficiently scaled up, would be able to reach human-level visual object recognition capabilities with the same type and amount of visual experience humans learn from. Previous work on this question only considered the scaling of data size. Here, we consider the simultaneous scaling of data size, model size, and image resolution. We perform a scaling experiment with vision transformers up to 633M parameters in size (ViT-H/14) trained with up to 5K hours of human-like video data (long, continuous, mostly egocentric videos) with image resolutions of up to 476x476 pixels. The efficiency of masked autoencoders (MAEs) as a self-supervised learning algorithm makes it possible to run this scaling experiment on an unassuming academic budget. We find that it is feasible to reach human-level object recognition capacity at sub-human scales of model size, data size, and image size, if these factors are scaled up simultaneously. To give a concrete example, we estimate that a 2.5B parameter ViT model trained with 20K hours (2.3 years) of human-like video data with a spatial resolution of 952x952 pixels should be able to reach roughly human-level accuracy on ImageNet. Human-level competence is thus achievable for a fundamental perceptual capability from human-like perceptual experience (human-like in both amount and type) with extremely generic learning algorithms and architectures and without any substantive inductive biases.
摘要
Prototype Learning for Out-of-Distribution Polyp Segmentation
methods: 我们的模型使用了不同的照明模式,如白光 imaging (WLI)、蓝光 imaging (BLI)、 Linked color imaging (LCI) 和 flexible spectral imaging color enhancement (FICE),并使用 prototype 来表示每种对象类的特征特征,例如形状、Texture 和颜色。
results: 我们的模型可以在不同中心的数据集上提供高达 $\geq$ 90%的 dice 系数和 $\geq$ 85%的 mIoU 分割精度,并且具有实时处理速度。在对 16 种现状顶尖图像分割架构进行比较时,我们的方法表现出了超越性,这可能将改善临床结果。Abstract
Existing polyp segmentation models from colonoscopy images often fail to provide reliable segmentation results on datasets from different centers, limiting their applicability. Our objective in this study is to create a robust and well-generalized segmentation model named PrototypeLab that can assist in polyp segmentation. To achieve this, we incorporate various lighting modes such as White light imaging (WLI), Blue light imaging (BLI), Linked color imaging (LCI), and Flexible spectral imaging color enhancement (FICE) into our new segmentation model, that learns to create prototypes for each class of object present in the images. These prototypes represent the characteristic features of the objects, such as their shape, texture, color. Our model is designed to perform effectively on out-of-distribution (OOD) datasets from multiple centers. We first generate a coarse mask that is used to learn prototypes for the main object class, which are then employed to generate the final segmentation mask. By using prototypes to represent the main class, our approach handles the variability present in the medical images and generalize well to new data since prototype capture the underlying distribution of the data. PrototypeLab offers a promising solution with a dice coefficient of $\geq$ 90\% and mIoU $\geq$ 85\% with a near real-time processing speed for polyp segmentation. It achieved superior performance on OOD datasets compared to 16 state-of-the-art image segmentation architectures, potentially improving clinical outcomes. Codes are available at https://github.com/xxxxx/PrototypeLab.
摘要
traditional Chinese version:现有的肿体段化模型从医学护理影像中的分段结果不可靠,限制了它们的实用性。我们的目标是创建一个可靠和普遍适用的分段模型,名为PrototypeLab,可以帮助进行肿体段化。为了实现这一目标,我们在新的分段模型中 integrate了不同的照明方式,如白光成像(WLI)、蓝光成像(BLI)、相关颜色成像(LCI)和可变色spectral成像(FICE)。这些照明方式的整合使我们的新分段模型学习出每个类别对应的原型,这些原型表示对象的形状、 текстура和颜色的特征特征。我们的模型设计能够在多个中心的数据集上表现出色,并且可以快速处理数据。我们首先生成一个粗略的mask,并使用这个mask来学习每个主要类别的原型,然后使用这些原型生成最终的分段mask。通过使用原型来表示主要类别,我们的方法可以处理医学影像中的变化,并且可以很好地适应新数据,因为原型捕捉了数据的下面分布。PrototypeLab提供了一个有 promise的解决方案,其中 dice coefficient ≥ 90%和mIoU ≥ 85%,并且具有近实时处理速度。它在多个中心的数据集上表现出色,并且超过了16种state-of-the-art图像分 segmentation模型,可能改善临床结果。代码可以在https://github.com/xxxxx/PrototypeLab 获取。Here's the translation in Simplified Chinese:现有的肿体段化模型经常无法在不同中心的数据集上提供可靠的分段结果,这限制了它们的实用性。我们的目标是创建一个可靠和普遍适用的分段模型,名为PrototypeLab,可以帮助进行肿体段化。为了实现这一目标,我们在新的分段模型中 integrate了不同的照明方式,如白光成像(WLI)、蓝光成像(BLI)、相关颜色成像(LCI)和可变色spectral成像(FICE)。这些照明方式的整合使我们的新分段模型学习出每个类别对应的原型,这些原型表示对象的形状、 текстуra和颜色的特征特征。我们的模型设计能够在多个中心的数据集上表现出色,并且可以快速处理数据。我们首先生成一个粗略的mask,并使用这个mask来学习每个主要类别的原型,然后使用这些原型生成最终的分段mask。通过使用原型来表示主要类别,我们的方法可以处理医学影像中的变化,并且可以很好地适应新数据,因为原型捕捉了数据的下面分布。PrototypeLab提供了一个有 promise的解决方案,其中 dice coefficient ≥ 90%和mIoU ≥ 85%,并且具有近实时处理速度。它在多个中心的数据集上表现出色,并且超过了16种state-of-the-art图像分 segmentation模型,可能改善临床结果。代码可以在https://github.com/xxxxx/PrototypeLab 获取。
Video-based Person Re-identification with Long Short-Term Representation Learning
results: 我们进行了广泛的实验,测试我们的提议在三个常用的标准 benchmar 上。结果显示,我们的方法可以在 V-ReID 中提供更高的性能,超过大多数当前状态的最佳方法。Abstract
Video-based person Re-Identification (V-ReID) aims to retrieve specific persons from raw videos captured by non-overlapped cameras. As a fundamental task, it spreads many multimedia and computer vision applications. However, due to the variations of persons and scenes, there are still many obstacles that must be overcome for high performance. In this work, we notice that both the long-term and short-term information of persons are important for robust video representations. Thus, we propose a novel deep learning framework named Long Short-Term Representation Learning (LSTRL) for effective V-ReID. More specifically, to extract long-term representations, we propose a Multi-granularity Appearance Extractor (MAE), in which four granularity appearances are effectively captured across multiple frames. Meanwhile, to extract short-term representations, we propose a Bi-direction Motion Estimator (BME), in which reciprocal motion information is efficiently extracted from consecutive frames. The MAE and BME are plug-and-play and can be easily inserted into existing networks for efficient feature learning. As a result, they significantly improve the feature representation ability for V-ReID. Extensive experiments on three widely used benchmarks show that our proposed approach can deliver better performances than most state-of-the-arts.
摘要
视频基于人体重新识别(V-ReID)目标是从非重叠的视频中提取特定人脸。作为基础任务,它广泛应用于多媒体和计算机视觉领域。然而,由于人脸和场景的变化,V-ReID仍然存在许多障碍。在这项工作中,我们注意到人脸的长期和短期信息都是重要的robust视频表示。因此,我们提出了一种新的深度学习框架,即长期短期表示学习(LSTRL),以提高V-ReID的性能。更进一步,我们提出了一种多粒度外观捕获器(MAE),可以有效地在多帧中捕获四个粒度的人脸表达。同时,我们提出了一种双向运动估计器(BME),可以快速提取从一帧到下一帧的对称运动信息。MAE和BME都可以与现有网络结合使用,以提高特征学习的能力。经验表明,我们的提出的方法可以在三个广泛使用的标准测试集上达到比较高的性能。
results: 经验分析表明,使用本研究提出的软件平台进行主观测试可以生成合理的3D模型主观质量分数。Abstract
Recently, widespread 3D graphics (e.g., point clouds and meshes) have drawn considerable efforts from academia and industry to assess their perceptual quality by conducting subjective experiments. However, lacking a handy software for 3D subjective experiments complicates the construction of 3D graphics quality assessment datasets, thus hindering the prosperity of relevant fields. In this paper, we develop a powerful platform with which users can flexibly design their 3D subjective methodologies and build high-quality datasets, easing a broad spectrum of 3D graphics subjective quality study. To accurately illustrate the perceptual quality differences of 3D stimuli, our software can simultaneously render the source stimulus and impaired stimulus and allows both stimuli to respond synchronously to viewer interactions. Compared with amateur 3D visualization tool-based or image/video rendering-based schemes, our approach embodies typical 3D applications while minimizing cognitive overload during subjective experiments. We organized a subjective experiment involving 40 participants to verify the validity of the proposed software. Experimental analyses demonstrate that subjective tests on our software can produce reasonable subjective quality scores of 3D models. All resources in this paper can be found at https://openi.pcl.ac.cn/OpenDatasets/3DQA.
摘要
近些年来,广泛的3D图形(如点云和网格)在学术和industry中吸引了广泛的努力,以评估它们的主观质量通过主观实验。然而,由于缺乏一个方便的3D主观实验软件,建构3D图形质量评估数据集的建构变得更加困难,从而阻碍相关领域的发展。在这篇论文中,我们开发了一个强大的平台,允许用户自由地设计他们的3D主观方法ологи和建立高质量数据集,从而促进3D图形主观质量研究的广泛发展。为准确地 Illustrate3D刺激物的主观质量差异,我们的软件可以同时渲染源刺激和受损刺激,并且允许两个刺激响应同步到观众的交互。与 amateur 3D视觉工具基于的方案或基于图像/视频渲染的方案相比,我们的方法体现出典型的3D应用程序,同时减少主观实验中的认知负担。我们组织了一个主观实验,具有40名参与者,以验证我们提出的软件的有效性。实验分析表明,我们的软件可以生成3D模型的主观质量分数。所有资源可以在https://openi.pcl.ac.cn/OpenDatasets/3DQA找到。
Learning Concise and Descriptive Attributes for Visual Recognition
paper_authors: An Yan, Yu Wang, Yiwu Zhong, Chengyu Dong, Zexue He, Yujie Lu, William Wang, Jingbo Shang, Julian McAuley
for: 这研究旨在探讨基础模型的新进展,以及它们如何提高可读性的视觉识别器。
methods: 该研究使用大语言模型(LLM)来生成特征集,然后应用视觉语言模型来分类图像。
results: 研究发现,使用大量的特征集可以达到与图像特征集相当的性能,但是我们在8个 dataset上进一步的调查发现,LLM生成的特征集中有很多噪音。我们提出一种新的学习搜索方法,可以找到更小的 yet 高效的特征集。在 CUB dataset 上,我们的方法可以使用只有 32 个特征集来分类 200 种鸟类,并且达到了使用大量 LLG 生成的特征集(如 10k 个特征集)的性能水平。此外,我们的新方法还具有更高的可读性和交互性,以及能够概括知识的能力。Abstract
Recent advances in foundation models present new opportunities for interpretable visual recognition -- one can first query Large Language Models (LLMs) to obtain a set of attributes that describe each class, then apply vision-language models to classify images via these attributes. Pioneering work shows that querying thousands of attributes can achieve performance competitive with image features. However, our further investigation on 8 datasets reveals that LLM-generated attributes in a large quantity perform almost the same as random words. This surprising finding suggests that significant noise may be present in these attributes. We hypothesize that there exist subsets of attributes that can maintain the classification performance with much smaller sizes, and propose a novel learning-to-search method to discover those concise sets of attributes. As a result, on the CUB dataset, our method achieves performance close to that of massive LLM-generated attributes (e.g., 10k attributes for CUB), yet using only 32 attributes in total to distinguish 200 bird species. Furthermore, our new paradigm demonstrates several additional benefits: higher interpretability and interactivity for humans, and the ability to summarize knowledge for a recognition task.
摘要
results: 测试结果表明,LAFD在WIDERFACE数据集上的均值准确率为94.1%、92.2%和82.1%,与Retinaface相比提高3.4%、4.0%和8.3%,并且与轻量级模型LFFD相比提高3.1%、4.1%和4.1%。如果输入图像先进行预处理并将其横幅或长宽尺寸调整到1560px或1200px,则模型在’hard’验证子集上的均值准确率为86.2%。模型轻量级,只有10.2MB大小。Abstract
In this paper, we propose a lightweight and accurate face detection algorithm LAFD (Light and accurate face detection) based on Retinaface. Backbone network in the algorithm is a modified MobileNetV3 network which adjusts the size of the convolution kernel, the channel expansion multiplier of the inverted residuals block and the use of the SE attention mechanism. Deformable convolution network(DCN) is introduced in the context module and the algorithm uses focal loss function instead of cross-entropy loss function as the classification loss function of the model. The test results on the WIDERFACE dataset indicate that the average accuracy of LAFD is 94.1%, 92.2% and 82.1% for the "easy", "medium" and "hard" validation subsets respectively with an improvement of 3.4%, 4.0% and 8.3% compared to Retinaface and 3.1%, 4.1% and 4.1% higher than the well-performing lightweight model, LFFD. If the input image is pre-processed and scaled to 1560px in length or 1200px in width, the model achieves an average accuracy of 86.2% on the 'hard' validation subset. The model is lightweight, with a size of only 10.2MB.
摘要
在这篇论文中,我们提出了一种轻量级并高度准确的人脸检测算法LAFD(轻量级和准确的人脸检测),基于Retinaface。这个算法中的基础网络是一种修改后的MobileNetV3网络,通过调整卷积核的大小、扩展通道多少和使用SE注意力机制来调整。在 context 模块中,我们引入了弹性卷积网络(DCN),并使用 focal loss 函数 instead of cross-entropy loss function 作为模型的分类损失函数。在 WIDERFACE 数据集上进行测试,LAFD 的平均准确率为 94.1%、92.2% 和 82.1% ,对 Retinaface 的提高为 3.4%、4.0% 和 8.3%,而与轻量级表现良好的模型 LFFD 的提高为 3.1%、4.1% 和 4.1%。如果输入图像经过预处理并将其扩展到 1560px 长或 1200px 宽,则模型在 'hard' 验证子集上的平均准确率为 86.2%。该模型轻量级,只有 10.2MB 大小。
Pengembangan Model untuk Mendeteksi Kerusakan pada Terumbu Karang dengan Klasifikasi Citra
For: 这个研究旨在开发一个精确的分类模型,以识别和区别健康和萎缩珊瑚的视觉特征。* Methods: 这个研究使用机器学习模型,特别是卷积神经网络(CNN),以识别和区别健康和萎缩珊瑚的视觉特征。* Results: 这个研究发现,由 scratch ResNet 模型可以在精度和准确性方面超越预训练的模型。这些精度的分类模型将有助研究人员和海洋生物学家更好地理解珊瑚礁生态环境的健康状况,并且可以用于监控珊瑚礁环境的变化,从而做出有关生态系统重建和保护的重要贡献。Abstract
The abundant biodiversity of coral reefs in Indonesian waters is a valuable asset that needs to be preserved. Rapid climate change and uncontrolled human activities have led to the degradation of coral reef ecosystems, including coral bleaching, which is a critical indicator of coral health conditions. Therefore, this research aims to develop an accurate classification model to distinguish between healthy corals and corals experiencing bleaching. This study utilizes a specialized dataset consisting of 923 images collected from Flickr using the Flickr API. The dataset comprises two distinct classes: healthy corals (438 images) and bleached corals (485 images). These images have been resized to a maximum of 300 pixels in width or height, whichever is larger, to maintain consistent sizes across the dataset. The method employed in this research involves the use of machine learning models, particularly convolutional neural networks (CNN), to recognize and differentiate visual patterns associated with healthy and bleached corals. In this context, the dataset can be used to train and test various classification models to achieve optimal results. By leveraging the ResNet model, it was found that a from-scratch ResNet model can outperform pretrained models in terms of precision and accuracy. The success in developing accurate classification models will greatly benefit researchers and marine biologists in gaining a better understanding of coral reef health. These models can also be employed to monitor changes in the coral reef environment, thereby making a significant contribution to conservation and ecosystem restoration efforts that have far-reaching impacts on life.
摘要
INDONESIA的珊瑚礁多样性具有巨大的价值,需要保护。快速的气候变化和无控制的人类活动导致珊瑚礁生态系统的退化,包括珊瑚病症,是珊瑚健康状况的重要指标。因此,这项研究的目标是开发一个准确的分类模型,以分辨健康的珊瑚和经受病症的珊瑚。本研究使用特殊的数据集,包括Flickr API上收集的923张图片。这个数据集包括两个不同的类别:健康的珊瑚(438张图片)和病症的珊瑚(485张图片)。这些图片已经被缩放到最多300像素的宽或高,以保持数据集中图片的尺寸一致。本研究使用机器学习模型,特别是卷积神经网络(CNN),识别和区分健康和病症珊瑚的视觉特征。在这种情况下,数据集可以用来训练和测试不同的分类模型,以达到最佳结果。通过利用ResNet模型,发现从头开始的ResNet模型可以在精度和准确性方面超越预训练模型。成功地开发准确的分类模型,将对研究人员和海洋生物学家提供深刻的理解,珊瑚礁的健康状况。这些模型也可以用来监测珊瑚礁环境的变化,从而为保护和生态系统重建做出重要贡献。
Cooperative Multi-agent Bandits: Distributed Algorithms with Optimal Individual Regret and Constant Communication Costs
results: 本文的算法可以实现最佳个体误差和常数通信成本,并且超越了现有的分布式算法和领导者追随者模式。Abstract
Recently, there has been extensive study of cooperative multi-agent multi-armed bandits where a set of distributed agents cooperatively play the same multi-armed bandit game. The goal is to develop bandit algorithms with the optimal group and individual regrets and low communication between agents. The prior work tackled this problem using two paradigms: leader-follower and fully distributed algorithms. Prior algorithms in both paradigms achieve the optimal group regret. The leader-follower algorithms achieve constant communication costs but fail to achieve optimal individual regrets. The state-of-the-art fully distributed algorithms achieve optimal individual regrets but fail to achieve constant communication costs. This paper presents a simple yet effective communication policy and integrates it into a learning algorithm for cooperative bandits. Our algorithm achieves the best of both paradigms: optimal individual regret and constant communication costs.
摘要
近来,有广泛的研究关于协同多智能多手枪抽筋游戏,其中多个分布式代理共同参与同一个多手枪抽筋游戏。目标是开发抽筋算法,以便各个代理具有最佳小组和个人惩罚,同时减少代理之间的交流。先前的工作通过两种方法解决了这个问题:领导者-追随者和完全分布式算法。先前的领导者-追随者算法实现了常数交流成本,但失去最佳个人惩罚。现状的完全分布式算法实现了最佳个人惩罚,但失去常数交流成本。本文提出了一种简单又有效的交流策略,并将其 интегра到了一种学习算法中,以实现协同抽筋中的最佳个人惩罚和常数交流成本。
A Comparative Study of Code Generation using ChatGPT 3.5 across 10 Programming Languages
results: 研究发现了模型的一些意外行为和限制,以及 automatized code generation对编程语言和技术领域的演化的影响。Abstract
Large Language Models (LLMs) are advanced Artificial Intelligence (AI) systems that have undergone extensive training using large datasets in order to understand and produce language that closely resembles that of humans. These models have reached a level of proficiency where they are capable of successfully completing university exams across several disciplines and generating functional code to handle novel problems. This research investigates the coding proficiency of ChatGPT 3.5, a LLM released by OpenAI in November 2022, which has gained significant recognition for its impressive text generating and code creation capabilities. The skill of the model in creating code snippets is evaluated across 10 various programming languages and 4 different software domains. Based on the findings derived from this research, major unexpected behaviors and limitations of the model have been identified. This study aims to identify potential areas for development and examine the ramifications of automated code generation on the evolution of programming languages and on the tech industry.
摘要
大型语言模型(LLM)是人工智能(AI)系统的进步,经过大量数据训练以便理解和生成语言,与人类语言更加相似。这些模型已经达到了人类水平,能够成功完成大学考试的多个领域和解决新的问题。本研究探讨了ChatGPT 3.5,一个由OpenAI在2022年11月发布的LLM,它在文本生成和代码创建方面获得了广泛的赞誉。这个模型在10种程式语言和4个软件领域中创建代码的技能被评估。根据这些研究发现的结果,模型具有一些意外的行为和限制。本研究旨在确定模型的发展前景和自动代码生成对程式语言的演化和科技业的影响。
Apple Vision Pro for Healthcare: “The Ultimate Display”? – Entering the Wonderland of Precision
results: 该论文认为,Apple Vision Pro可以在医疗领域中提供更高效的辅助工具,帮助临床医生在诊断和治疗过程中占用更多时间与病人进行互动。Abstract
At the Worldwide Developers Conference (WWDC) in June 2023, Apple introduced the Vision Pro. The Vision Pro is a Mixed Reality (MR) headset, more specifically it is a Virtual Reality (VR) device with an additional Video See-Through (VST) capability. The VST capability turns the Vision Pro also into an Augmented Reality (AR) device. The AR feature is enabled by streaming the real world via cameras to the (VR) screens in front of the user's eyes. This is of course not unique and similar to other devices, like the Varjo XR-3. Nevertheless, the Vision Pro has some interesting features, like an inside-out screen that can show the headset wearers' eyes to "outsiders" or a button on the top, called "Digital Crown", that allows you to seamlessly blend digital content with your physical space by turning it. In addition, it is untethered, except for the cable to the battery, which makes the headset more agile, compared to the Varjo XR-3. This could actually come closer to the "Ultimate Display", which Ivan Sutherland had already sketched in 1965. Not available to the public yet, like the Ultimate Display, we want to take a look into the crystal ball in this perspective to see if it can overcome some clinical challenges that - especially - AR still faces in the medical domain, but also go beyond and discuss if the Vision Pro could support clinicians in essential tasks to spend more time with their patients.
摘要
在2023年6月的全球开发者大会(WWDC)上,苹果公司发布了“视野豪”(Vision Pro)混合现实(MR)头戴式设备,具体来说是虚拟现实(VR)设备具有视频增强(VST)功能。VST功能使得视野豪也成为了增强现实(AR)设备。AR功能由通过摄像头传输真实世界到用户的视网膜上的方式实现,这与其他设备类似,如Varjo XR-3。然而,视野豪有一些有趣的特点,如内置屏幕,可以在外部显示头戴式设备穿戴者的眼睛,以及位于顶部的“数字皇冠”(Digital Crown)按钮,可以轻松融合数字内容与实际空间。此外,它还不受绑定,除了电池供电的电缆,使得头戴式设备更加灵活,相比Varjo XR-3。这可能可以实现“最终显示”(Ultimate Display), Ivan Sutherland在1965年绘制的概念。虽然不如“最终显示”一样不到公众,但我们可以通过幻灯片来看看这个头戴式设备是否可以在医疗领域超越临床挑战,同时还可以讨论这个设备是否可以支持临床专业人员在实际任务中更多时间与病人进行互动。
Interpretable Goal-Based model for Vehicle Trajectory Prediction in Interactive Scenarios
results: 通过使用 INTERACTION 数据集,实现并评估了我们的方案,并证明了我们的方案可以准确地预测车辆路径而不会产生可解释性的损害。Abstract
The abilities to understand the social interaction behaviors between a vehicle and its surroundings while predicting its trajectory in an urban environment are critical for road safety in autonomous driving. Social interactions are hard to explain because of their uncertainty. In recent years, neural network-based methods have been widely used for trajectory prediction and have been shown to outperform hand-crafted methods. However, these methods suffer from their lack of interpretability. In order to overcome this limitation, we combine the interpretability of a discrete choice model with the high accuracy of a neural network-based model for the task of vehicle trajectory prediction in an interactive environment. We implement and evaluate our model using the INTERACTION dataset and demonstrate the effectiveness of our proposed architecture to explain its predictions without compromising the accuracy.
摘要
autonomous driving 的道路安全受到 vehicle 与周围环境之间的社交互动行为的理解是关键。社交互动的不确定性使其很难以解释。在过去几年,基于神经网络的方法在路径预测方面得到了广泛应用,但这些方法受到了其不可解释性的限制。为了缓解这个问题,我们将精确的选择模型与高精度的神经网络模型结合,以实现在交互环境中的路径预测。我们使用 INTERACTION 数据集进行实现和评估,并证明了我们的提议的建筑可以不妨碍准确性而提供解释。
Vehicle Motion Forecasting using Prior Information and Semantic-assisted Occupancy Grid Maps
results: 对于实际 NuScenes 数据集的测试和验证,本文的模型表现出色,能够更好地预测静止和动态车辆的行为,并且通过缺失数据集和地图信息的补做来证明模型的可靠性。Abstract
Motion prediction is a challenging task for autonomous vehicles due to uncertainty in the sensor data, the non-deterministic nature of future, and complex behavior of agents. In this paper, we tackle this problem by representing the scene as dynamic occupancy grid maps (DOGMs), associating semantic labels to the occupied cells and incorporating map information. We propose a novel framework that combines deep-learning-based spatio-temporal and probabilistic approaches to predict vehicle behaviors.Contrary to the conventional OGM prediction methods, evaluation of our work is conducted against the ground truth annotations. We experiment and validate our results on real-world NuScenes dataset and show that our model shows superior ability to predict both static and dynamic vehicles compared to OGM predictions. Furthermore, we perform an ablation study and assess the role of semantic labels and map in the architecture.
摘要
<> translate "Motion prediction is a challenging task for autonomous vehicles due to uncertainty in the sensor data, the non-deterministic nature of future, and complex behavior of agents. In this paper, we tackle this problem by representing the scene as dynamic occupancy grid maps (DOGMs), associating semantic labels to the occupied cells and incorporating map information. We propose a novel framework that combines deep-learning-based spatio-temporal and probabilistic approaches to predict vehicle behaviors.Contrary to the conventional OGM prediction methods, evaluation of our work is conducted against the ground truth annotations. We experiment and validate our results on real-world NuScenes dataset and show that our model shows superior ability to predict both static and dynamic vehicles compared to OGM predictions. Furthermore, we perform an ablation study and assess the role of semantic labels and map in the architecture." into 中文(简体)Here's the translation:<>预测行为是自动驾驶车辆的挑战之一,因为感知数据中的不确定性、未来的非束定性和智能代理人的复杂行为。在这篇论文中,我们通过将场景表示为动态占用格网图(DOGM),将占用细胞 association semantic label,并利用地图信息来解决这个问题。我们提出了一种新的框架, combining 深度学习基于空间temporal和概率方法来预测车辆行为。与传统 OGM 预测方法不同,我们的评估采用了真实的地图注释。我们在实际的 NuScenes 数据集上进行了实验和验证,并证明了我们的模型在预测静止和动态车辆方面具有更高的能力,比传统 OGM 预测方法更好。此外,我们还进行了减少研究,以评估semantic label和地图在架构中的作用。
Actor-Critic with variable time discretization via sustained actions
results: 在Ant、HalfCheetah、Hopper和Walker2D等四个机器人控制环境中,SusACER算法都能够超越当前最佳算法。Abstract
Reinforcement learning (RL) methods work in discrete time. In order to apply RL to inherently continuous problems like robotic control, a specific time discretization needs to be defined. This is a choice between sparse time control, which may be easier to train, and finer time control, which may allow for better ultimate performance. In this work, we propose SusACER, an off-policy RL algorithm that combines the advantages of different time discretization settings. Initially, it operates with sparse time discretization and gradually switches to a fine one. We analyze the effects of the changing time discretization in robotic control environments: Ant, HalfCheetah, Hopper, and Walker2D. In all cases our proposed algorithm outperforms state of the art.
摘要
重复学习(RL)方法在离散时间下运行。为了将RL应用于基于连续时间的问题,例如机器人控制,需要定义特定的时间离散设定。这是一个选择 между稀疏时间控制和精细时间控制的选择。在这种工作中,我们提出了 SusACER,一种离散RL算法,将不同时间离散设定的优点相互结合。首先,它使用稀疏时间离散,然后慢慢地转换到精细时间离散。我们对机器人控制环境中的Ant、半驰虎、跳跃机和 Walker2D进行分析,在所有情况下,我们的提议算法超越了现有的state of the art。
Engineering LaCAM$^\ast$: Towards Real-Time, Large-Scale, and Near-Optimal Multi-Agent Pathfinding
results: 经验证明,将这些改进技术融合到LaCAM*算法中,可以明显提高解质量,从而进一步推进MAPF算法的边缘。Abstract
This paper addresses the challenges of real-time, large-scale, and near-optimal multi-agent pathfinding (MAPF) through enhancements to the recently proposed LaCAM* algorithm. LaCAM* is a scalable search-based algorithm that guarantees the eventual finding of optimal solutions for cumulative transition costs. While it has demonstrated remarkable planning success rates, surpassing various state-of-the-art MAPF methods, its initial solution quality is far from optimal, and its convergence speed to the optimum is slow. To overcome these limitations, this paper introduces several improvement techniques, partly drawing inspiration from other MAPF methods. We provide empirical evidence that the fusion of these techniques significantly improves the solution quality of LaCAM*, thus further pushing the boundaries of MAPF algorithms.
摘要
Here is the text in Simplified Chinese:这篇论文解决了实时、大规模、近似优质多代理路径寻找(MAPF)的挑战,通过对最近提出的LaCAM*算法进行增强。LaCAM*是一种可扩展的搜索基本算法,可以 garantue the eventual finding of optimal solutions for cumulative transition costs。虽然它已经达到了多种状态前的寻找成功率,但其初始解质不佳,并且与优质相对较慢。为了超越这些限制,这篇论文提出了多种改进技术,部分 Draw inspiration from other MAPF methods。我们提供了实证证据,表明这些技术的融合可以 Significantly improve LaCAM*的解质,进一步推动MAPF算法的发展。
In-Context Alignment: Chat with Vanilla Language Models Before Fine-Tuning
methods: 这 paper 使用了一个 vanilla 预训练语言模型 Llama-2,并通过在下文中学习来实现归一化。
results: compared to 直接提示,在下context中进行归一化无需更改模型参数,可以提高 win-rate 7 倍,使 vanilla 语言模型与对齐 fine-tuning 的强基线模型相当。Abstract
In this note, we explore inference-time alignment through in-context learning. We consider a vanilla pretrained language model Llama-2 before any fine-tuning and retrieve an average of 9 demonstration alignment examples when the model is prompted to follow chat-style instructions. Compared to direct prompting, the in-context alignment without changing model weights leads to a 7x increase in win-rate w.r.t. the text-davinci-003 model from OpenAI, making the vanilla language model comparable to strong baselines with alignment fine-tuning.
摘要
在这份说明中,我们探索了在语言模型学习中的推理时对适应。我们考虑了一个未经任何微调的语言模型Llama-2,并从其在语音指令下检索了9个示例匹配例子。相比直接提示,无需更改模型参数的受上下文匹配导致了与文本-达维纳-003模型从OpenAI的7倍增加胜率,使原始语言模型与对适应微调相当。
Lossy and Lossless (L$^2$) Post-training Model Size Compression
results: 本研究可以实现稳定的$10\times$压缩比率无损准确性,并且可以在短时间内取得$20\times$压缩比率对应轻微损失。代码可以在https://github.com/ModelTC/L2_Compression上获取。Abstract
Deep neural networks have delivered remarkable performance and have been widely used in various visual tasks. However, their huge size causes significant inconvenience for transmission and storage. Many previous studies have explored model size compression. However, these studies often approach various lossy and lossless compression methods in isolation, leading to challenges in achieving high compression ratios efficiently. This work proposes a post-training model size compression method that combines lossy and lossless compression in a unified way. We first propose a unified parametric weight transformation, which ensures different lossy compression methods can be performed jointly in a post-training manner. Then, a dedicated differentiable counter is introduced to guide the optimization of lossy compression to arrive at a more suitable point for later lossless compression. Additionally, our method can easily control a desired global compression ratio and allocate adaptive ratios for different layers. Finally, our method can achieve a stable $10\times$ compression ratio without sacrificing accuracy and a $20\times$ compression ratio with minor accuracy loss in a short time. Our code is available at https://github.com/ModelTC/L2_Compression .
摘要
深度神经网络已经提供了很好的性能,并在各种视觉任务中广泛使用。然而,它们的巨大大小带来了传输和存储的不便。许多前面的研究已经探讨过模型大小压缩。然而,这些研究通常是采用各种损失压缩和无损压缩方法,导致高效压缩率很困难。本工作提出了一种后处理模型大小压缩方法,它可以同时使用损失压缩和无损压缩。我们首先提出了一种统一的参数重要性变换,使得不同的损失压缩方法可以在后处理中进行 JOINT 处理。然后,我们引入了特有的可微分Counter,以便通过优化损失压缩来到达更适合的点,以便 later 无损压缩。此外,我们的方法可以轻松地控制desired的全局压缩比,并分配适应的层级压缩率。最后,我们的方法可以实现稳定的 $10\times$ 压缩比,无损减少精度,以及 $20\times$ 压缩比,只有微量损失精度。我们的代码可以在 https://github.com/ModelTC/L2_Compression 上找到。
Teacher-Student Architecture for Knowledge Distillation: A Survey
results: 本研究通过多种知识压缩、扩展、适应和加强目标,成功地实现了多种知识压缩目标。Abstract
Although Deep neural networks (DNNs) have shown a strong capacity to solve large-scale problems in many areas, such DNNs are hard to be deployed in real-world systems due to their voluminous parameters. To tackle this issue, Teacher-Student architectures were proposed, where simple student networks with a few parameters can achieve comparable performance to deep teacher networks with many parameters. Recently, Teacher-Student architectures have been effectively and widely embraced on various knowledge distillation (KD) objectives, including knowledge compression, knowledge expansion, knowledge adaptation, and knowledge enhancement. With the help of Teacher-Student architectures, current studies are able to achieve multiple distillation objectives through lightweight and generalized student networks. Different from existing KD surveys that primarily focus on knowledge compression, this survey first explores Teacher-Student architectures across multiple distillation objectives. This survey presents an introduction to various knowledge representations and their corresponding optimization objectives. Additionally, we provide a systematic overview of Teacher-Student architectures with representative learning algorithms and effective distillation schemes. This survey also summarizes recent applications of Teacher-Student architectures across multiple purposes, including classification, recognition, generation, ranking, and regression. Lastly, potential research directions in KD are investigated, focusing on architecture design, knowledge quality, and theoretical studies of regression-based learning, respectively. Through this comprehensive survey, industry practitioners and the academic community can gain valuable insights and guidelines for effectively designing, learning, and applying Teacher-Student architectures on various distillation objectives.
摘要
although deep neural networks (DNNs) have shown strong capacity to solve large-scale problems in many areas, such DNNs are difficult to deploy in real-world systems due to their numerous parameters. To address this issue, Teacher-Student architectures were proposed, where simple student networks with a few parameters can achieve comparable performance to deep teacher networks with many parameters. recently, Teacher-Student architectures have been effectively and widely embraced on various knowledge distillation (KD) objectives, including knowledge compression, knowledge expansion, knowledge adaptation, and knowledge enhancement. With the help of Teacher-Student architectures, current studies are able to achieve multiple distillation objectives through lightweight and generalized student networks. different from existing KD surveys that primarily focus on knowledge compression, this survey first explores Teacher-Student architectures across multiple distillation objectives. this survey presents an introduction to various knowledge representations and their corresponding optimization objectives. additionally, we provide a systematic overview of Teacher-Student architectures with representative learning algorithms and effective distillation schemes. this survey also summarizes recent applications of Teacher-Student architectures across multiple purposes, including classification, recognition, generation, ranking, and regression. finally, potential research directions in KD are investigated, focusing on architecture design, knowledge quality, and theoretical studies of regression-based learning, respectively. through this comprehensive survey, industry practitioners and the academic community can gain valuable insights and guidelines for effectively designing, learning, and applying Teacher-Student architectures on various distillation objectives.
results: 对比基eline方法,这种提出的策略更有效地暴露了Stable Diffusion(SD)模型中的漏洞,即使SD模型具有安全特性。此外,这种框架还能够对文本到文本模型进行红队, resulting in significantly higher toxic response generation rate compared to previously reported numbers.Abstract
Warning: this paper contains content that may be inappropriate or offensive. As generative models become available for public use in various applications, testing and analyzing vulnerabilities of these models has become a priority. Here we propose an automatic red teaming framework that evaluates a given model and exposes its vulnerabilities against unsafe and inappropriate content generation. Our framework uses in-context learning in a feedback loop to red team models and trigger them into unsafe content generation. We propose different in-context attack strategies to automatically learn effective and diverse adversarial prompts for text-to-image models. Our experiments demonstrate that compared to baseline approaches, our proposed strategy is significantly more effective in exposing vulnerabilities in Stable Diffusion (SD) model, even when the latter is enhanced with safety features. Furthermore, we demonstrate that the proposed framework is effective for red teaming text-to-text models, resulting in significantly higher toxic response generation rate compared to previously reported numbers.
摘要
警告:这篇论文可能包含不适或不宜的内容。 随着生成模型在不同应用中变得更加普遍使用,测试和分析这些模型的漏洞已成为一个优先事项。 在这篇论文中,我们提出一种自动红团框架,用于评估给定模型的漏洞,并让模型生成不安全或不适的内容。 我们的框架使用受 Context 学习的反馈循环,以红团模型并让它生成不安全内容。 我们提出了不同的 Context 攻击策略,以自动学习有效和多样的对抗示例 для文本到图像模型。 我们的实验表明,相比基eline方法,我们提出的策略在Stable Diffusion(SD)模型上更加有效,即使后者具有安全功能。 此外,我们的框架还对文本到文本模型进行了红团,并得到了远远高于之前报道的恶意回应率。
PokerKit: A Comprehensive Python Library for Fine-Grained Multi-Variant Poker Game Simulations
results: PokerKit的可靠性已经通过静态类型检查、广泛的doctests和单元测试确认,实现了97%的代码覆盖率。PokerKit的出现对计算机 póker领域做出了重要贡献,推动未来的研究和高级AI开发,用于多种 póker游戏。Abstract
PokerKit is an open-source Python library designed to overcome the restrictions of existing poker game simulation and hand evaluation tools, which typically support only a handful of poker variants and lack flexibility in game state control. In contrast, PokerKit significantly expands this scope by supporting an extensive array of poker variants and it provides a flexible architecture for users to define their custom games. This paper details the design and implementation of PokerKit, including its intuitive programmatic API, multi-variant game support, and a unified hand evaluation suite across different hand types. The flexibility of PokerKit allows for applications in diverse areas, such as poker AI development, tool creation, and online poker casino implementation. PokerKit's reliability has been established through static type checking, extensive doctests, and unit tests, achieving 97\% code coverage. The introduction of PokerKit represents a significant contribution to the field of computer poker, fostering future research and advanced AI development for a wide variety of poker games.
摘要
pokerKit 是一个开源的 Python 库,旨在超越现有的 póker 游戏模拟和手牌评估工具,这些工具通常只支持几种 póker 变种并缺乏游戏状态控制的灵活性。相比之下,pokerKit 对此进行了广泛的扩展,支持了大量的 póker 变种,并提供了用户定义的自定义游戏功能。这篇论文介绍了 pokerKit 的设计和实现,包括它的直观的编程 API,多种变种游戏支持,以及不同手牌类型的统一手牌评估 suite。pokerKit 的灵活性允许其在多种领域应用,如 póker AI 研发、工具创造和在线 póker 赌场实现。pokerKit 的可靠性已经通过静态类型检查、extensive doctests 和单元测试达到 97% 代码覆盖率。pokerKit 的出现对计算机 póker 领域做出了重要贡献,激发未来的研究和高级 AI 开发,涵盖各种 póker 游戏。
MindDiffuser: Controlled Image Reconstruction from Human Brain Activity with Semantic and Structural Diffusion
results: 该模型在Natural Scenes Dataset(NSD)上表现出了现在领先的性能,并且经过了许多质量和质量分析,得出了可读性的多模式特征,与脑响应的对应性得到了证明。Abstract
Reconstructing visual stimuli from brain recordings has been a meaningful and challenging task. Especially, the achievement of precise and controllable image reconstruction bears great significance in propelling the progress and utilization of brain-computer interfaces. Despite the advancements in complex image reconstruction techniques, the challenge persists in achieving a cohesive alignment of both semantic (concepts and objects) and structure (position, orientation, and size) with the image stimuli. To address the aforementioned issue, we propose a two-stage image reconstruction model called MindDiffuser. In Stage 1, the VQ-VAE latent representations and the CLIP text embeddings decoded from fMRI are put into Stable Diffusion, which yields a preliminary image that contains semantic information. In Stage 2, we utilize the CLIP visual feature decoded from fMRI as supervisory information, and continually adjust the two feature vectors decoded in Stage 1 through backpropagation to align the structural information. The results of both qualitative and quantitative analyses demonstrate that our model has surpassed the current state-of-the-art models on Natural Scenes Dataset (NSD). The subsequent experimental findings corroborate the neurobiological plausibility of the model, as evidenced by the interpretability of the multimodal feature employed, which align with the corresponding brain responses.
摘要
<>重构脑记录中的视觉刺激是一项有意义且挑战性的任务。特别是在实现精确和可控的图像重建方面,这种任务具有推动脑机器交互的进步和应用的重要性。尽管复杂图像重建技术得到了进步,但是在图像刺激中协调semantic(概念和物体)和structure(位置、方向、大小)仍然是一项挑战。为了解决这个问题,我们提出了一种两个阶段的图像重建模型,称为 MindDiffuser。在第一阶段,使用VQ-VAE隐藏表示和CLIP文本嵌入从fMRI中解码的,并将其置入稳定扩散,从而得到包含semantic信息的初步图像。在第二阶段,我们利用CLIP视觉特征从fMRI中解码的,作为监督信息,通过反射来调整在第一阶段解码的两个特征向量,以实现结构信息的协调。实验结果表明,我们的模型在Natural Scenes Dataset(NSD)上超过了当前状态的艺术模型。后续的实验发现,证明了我们的模型在脑响应的可靠性方面具有神经生物学可能性,其中multimodal特征的可读性与脑响应的对应。
results: 我们在\acf{mdgs}和\acf{bobsl} dataset上Quantitatively证明了我们的方法的效果,可以达到33.22 BLEU-1分数的word对应精度。Abstract
Capturing and annotating Sign language datasets is a time consuming and costly process. Current datasets are orders of magnitude too small to successfully train unconstrained \acf{slt} models. As a result, research has turned to TV broadcast content as a source of large-scale training data, consisting of both the sign language interpreter and the associated audio subtitle. However, lack of sign language annotation limits the usability of this data and has led to the development of automatic annotation techniques such as sign spotting. These spottings are aligned to the video rather than the subtitle, which often results in a misalignment between the subtitle and spotted signs. In this paper we propose a method for aligning spottings with their corresponding subtitles using large spoken language models. Using a single modality means our method is computationally inexpensive and can be utilized in conjunction with existing alignment techniques. We quantitatively demonstrate the effectiveness of our method on the \acf{mdgs} and \acf{bobsl} datasets, recovering up to a 33.22 BLEU-1 score in word alignment.
摘要
捕捉和标注手语数据集是一个时间consuming和成本高的过程。现有的数据集规模几个数量级小于需要成功训练无约制手语识别模型。因此,研究人员将视频广播内容作为大规模训练数据,包括手语 interprete 和关联的音频字幕。然而,手语注释缺失限制了这些数据的可用性,导致了自动注释技术的开发,如手语搜索。这些搜索被视频而不是字幕进行对齐,经常导致字幕和搜索到的手语之间的不一致。在这篇论文中,我们提议一种方法用于将搜索与其对应的字幕进行对齐,使用大量的人语言模型。由于我们只使用一种模式,我们的方法是计算机不昂贵的,可以与现有的对齐方法结合使用。我们量化地示示了我们的方法在\acf{mdgs}和\acf{bobsl}数据集上的效果,recovering up to 33.22 BLEU-1 分数。
AutoPCF: Efficient Product Carbon Footprint Accounting with Large Language Models
results: 使用AutoPCF框架估计三个案例产品的碳脚印,结果显示AutoPCF框架具有快速计算碳脚印的能力,比传统方法快得多少倍。Abstract
The product carbon footprint (PCF) is crucial for decarbonizing the supply chain, as it measures the direct and indirect greenhouse gas emissions caused by all activities during the product's life cycle. However, PCF accounting often requires expert knowledge and significant time to construct life cycle models. In this study, we test and compare the emergent ability of five large language models (LLMs) in modeling the 'cradle-to-gate' life cycles of products and generating the inventory data of inputs and outputs, revealing their limitations as a generalized PCF knowledge database. By utilizing LLMs, we propose an automatic AI-driven PCF accounting framework, called AutoPCF, which also applies deep learning algorithms to automatically match calculation parameters, and ultimately calculate the PCF. The results of estimating the carbon footprint for three case products using the AutoPCF framework demonstrate its potential in achieving automatic modeling and estimation of PCF with a large reduction in modeling time from days to minutes.
摘要
产品碳脚印(PCF)对于减少供应链的碳排放非常重要,因为它测量产品生命周期中直接和间接气候变化所导致的绿house gas排放。然而,PCF会计通常需要专业知识和大量时间建立生命周期模型。在这项研究中,我们测试和比较五种大型自然语言模型(LLM)在产品“营养径”生命周期的模型和生成输入输出inv质数据方面的能力,揭示它们的局限性作为总体PCF知识库。通过使用LLM,我们提议一种自动驱动的PCF会计框架,称为AutoPCF,该框架还应用深度学习算法来自动匹配计算参数,最终计算PCF。三个案例 продукт的碳脚印估计结果表明AutoPCF框架在自动模型和估计PCF方面具有很大的潜力,从天天减少到分钟内。
Federated Inference with Reliable Uncertainty Quantification over Wireless Channels via Conformal Prediction
results: 该论文通过数值结果显示,WFCP 在有限通信资源和/或大量设备情况下具有显著优势,特别是与已有的 federated CP 方案进行数字实现的比较。Abstract
Consider a setting in which devices and a server share a pre-trained model. The server wishes to make an inference on a new input given the model. Devices have access to data, previously not used for training, and can communicate to the server over a common wireless channel. If the devices have no access to the new input, can communication from devices to the server enhance the quality of the inference decision at the server? Recent work has introduced federated conformal prediction (CP), which leverages devices-to-server communication to improve the reliability of the server's decision. With federated CP, devices communicate to the server information about the loss accrued by the shared pre-trained model on the local data, and the server leverages this information to calibrate a decision interval, or set, so that it is guaranteed to contain the correct answer with a pre-defined target reliability level. Previous work assumed noise-free communication, whereby devices can communicate a single real number to the server. In this paper, we study for the first time federated CP in a wireless setting. We introduce a novel protocol, termed wireless federated conformal prediction (WFCP), which builds on type-based multiple access (TBMA) and on a novel quantile correction strategy. WFCP is proved to provide formal reliability guarantees in terms of coverage of the predicted set produced by the server. Using numerical results, we demonstrate the significant advantages of WFCP against digital implementations of existing federated CP schemes, especially in regimes with limited communication resources and/or large number of devices.
摘要
假设设备和服务器共享预训练模型。服务器想要对新输入进行推断。设备可以访问未使用过训练的数据,并可以通过共享的无线通信chnnel与服务器进行通信。如果设备没有访问新输入,可以通过设备到服务器的通信来提高服务器的推断决策质量吗? latest work introduced federated conformal prediction (CP), which leverages devices-to-server communication to improve the reliability of the server's decision. With federated CP, devices communicate to the server information about the loss accrued by the shared pre-trained model on the local data, and the server leverages this information to calibrate a decision interval, or set, so that it is guaranteed to contain the correct answer with a pre-defined target reliability level. Previous work assumed noise-free communication, whereby devices can communicate a single real number to the server. In this paper, we study for the first time federated CP in a wireless setting. We introduce a novel protocol, termed wireless federated conformal prediction (WFCP), which builds on type-based multiple access (TBMA) and on a novel quantile correction strategy. WFCP is proved to provide formal reliability guarantees in terms of coverage of the predicted set produced by the server. Using numerical results, we demonstrate the significant advantages of WFCP against digital implementations of existing federated CP schemes, especially in regimes with limited communication resources and/or large number of devices.
Semantic Interpretation and Validation of Graph Attention-based Explanations for GNN Models
results: 通过应用该方法于一个遥感点云估计模型,成功地 indentify了提高性的semantic类别,并生成了可靠的后续semantic解释。Abstract
In this work, we propose a methodology for investigating the application of semantic attention to enhance the explainability of Graph Neural Network (GNN)-based models, introducing semantically-informed perturbations and establishing a correlation between predicted feature-importance weights and model accuracy. Graph Deep Learning (GDL) has emerged as a promising field for tasks like scene interpretation, leveraging flexible graph structures to concisely describe complex features and relationships. As traditional explainability methods used in eXplainable AI (XAI) cannot be directly applied to such structures, graph-specific approaches are introduced. Attention mechanisms have demonstrated their efficacy in estimating the importance of input features in deep learning models and thus have been previously employed to provide feature-based explanations for GNN predictions. Building upon these insights, we extend existing attention-based graph-explainability methods investigating the use of attention weights as importance indicators of semantically sorted feature sets. Through analysing the behaviour of predicted attention-weights distribution in correlation with model accuracy, we gain valuable insights into feature importance with respect to the behaviour of the GNN model. We apply our methodology to a lidar pointcloud estimation model successfully identifying key semantic classes that contribute to enhanced performance effectively generating reliable post-hoc semantic explanations.
摘要
在这项工作中,我们提出了一种方法来增强Graph Neural Network(GNN)模型的解释性,通过引入semantically-informed perturbations和建立 predicted feature-importance weights与模型准确率之间的相关性。Graph Deep Learning(GDL)已经成为一个有前途的领域,用于场景理解等任务,利用灵活的图结构 concisely describe complex features和关系。traditional explainability methods在XAI中不能直接应用于such structures,因此introduce graph-specific approaches。Attention mechanisms have demonstrated their efficacy in estimating the importance of input features in deep learning models, and thus have been previously employed to provide feature-based explanations for GNN predictions. Building upon these insights, we extend existing attention-based graph-explainability methods by investigating the use of attention weights as importance indicators of semantically sorted feature sets. Through analyzing the behavior of predicted attention-weights distribution in correlation with model accuracy, we gain valuable insights into feature importance with respect to the behavior of the GNN model. We apply our methodology to a lidar pointcloud estimation model and successfully identify key semantic classes that contribute to enhanced performance, effectively generating reliable post-hoc semantic explanations.
Hybrid Retrieval-Augmented Generation for Real-time Composition Assistance
results: HybridRAG在 Wikitext 和 Pile 子集上实现了更低的延迟,并在实用性方面超过了云端只有模型。Abstract
Retrieval augmented models show promise in enhancing traditional language models by improving their contextual understanding, integrating private data, and reducing hallucination. However, the processing time required for retrieval augmented large language models poses a challenge when applying them to tasks that require real-time responses, such as composition assistance. To overcome this limitation, we propose the Hybrid Retrieval-Augmented Generation (HybridRAG) framework that leverages a hybrid setting that combines both client and cloud models. HybridRAG incorporates retrieval-augmented memory generated asynchronously by a Large Language Model (LLM) in the cloud. By integrating this retrieval augmented memory, the client model acquires the capability to generate highly effective responses, benefiting from the LLM's capabilities. Furthermore, through asynchronous memory integration, the client model is capable of delivering real-time responses to user requests without the need to wait for memory synchronization from the cloud. Our experiments on Wikitext and Pile subsets show that HybridRAG achieves lower latency than a cloud-based retrieval-augmented LLM, while outperforming client-only models in utility.
摘要
Note:* "Retrieval-augmented models" refers to models that use retrieval-augmented memory to improve their performance.* "Large Language Model" (LLM) refers to a model that can process and generate human-like language.* "Client model" refers to a model that runs on a local device, such as a smartphone or a computer.* "Cloud model" refers to a model that runs on a remote server, such as a cloud computing service.* "Memory synchronization" refers to the process of synchronizing the memory of multiple devices or models, so that they can access and share the same information.* "Utility" refers to the usefulness or effectiveness of a model or approach.
Adding Why to What? Analyses of an Everyday Explanation
paper_authors: Lutz Terfloth, Michael Schaffer, Heike M. Buhl, Carsten Schulte
for: 这篇论文的目的是研究如何为非专家用户提供可解释的技术决策。
methods: 这篇论文使用了技术哲学的双重本质理论来探讨对非专家用户的解释。
results: 研究发现,解释者在解释游戏时首先关注建筑(Architecture),然后关注相关性(Relevance)。在视频回忆中,解释者解释了基本组件之前 initially 解释了Physical Aspects,然后才转移到更复杂的、不可见的方面。 shift between addressing the two sides was justified by explanation goals, emerging misunderstandings, and the knowledge needs of the explainee。Abstract
In XAI it is important to consider that, in contrast to explanations for professional audiences, one cannot assume common expertise when explaining for laypeople. But such explanations between humans vary greatly, making it difficult to research commonalities across explanations. We used the dual nature theory, a techno-philosophical approach, to cope with these challenges. According to it, one can explain, for example, an XAI's decision by addressing its dual nature: by focusing on the Architecture (e.g., the logic of its algorithms) or the Relevance (e.g., the severity of a decision, the implications of a recommendation). We investigated 20 game explanations using the theory as an analytical framework. We elaborate how we used the theory to quickly structure and compare explanations of technological artifacts. We supplemented results from analyzing the explanation contents with results from a video recall to explore how explainers justified their explanation. We found that explainers were focusing on the physical aspects of the game first (Architecture) and only later on aspects of the Relevance. Reasoning in the video recalls indicated that EX regarded the focus on the Architecture as important for structuring the explanation initially by explaining the basic components before focusing on more complex, intangible aspects. Shifting between addressing the two sides was justified by explanation goals, emerging misunderstandings, and the knowledge needs of the explainee. We discovered several commonalities that inspire future research questions which, if further generalizable, provide first ideas for the construction of synthetic explanations.
摘要
在XAI中,需要注意的是,与专业听众的解释不同,不能假设共同知识。然而,人类之间的解释却很多样化,这使得研究共同点困难。我们采用了双重本质理论,一种技术哲学方法,以应对这些挑战。根据这种理论,可以通过关注XAI的几个方面来解释它的决策: Architecture(例如算法逻辑)或 Relevance(例如决策严重性、建议的影响)。我们对20个游戏解释使用了这种分析框架。我们详细介绍了如何使用这种理论快速结构和比较解释技术 artifacts。我们还补充了分析解释内容的结果,以及视频回忆中的解释者 justify their explanation。我们发现,解释者在初始阶段关注物理方面(Architecture),然后才关注更复杂、无形的方面。在视频回忆中的理由表明,EX认为在初始阶段通过解释基本组件来结构化解释是重要的。在转换 между两个方面时,解释者根据解释目标、出现的混淆和需要了解的知识来决定转换。我们发现了一些共同点,这些共同点可能会激发未来的研究问题。如果这些共同点能够普遍适用,它们将提供首先的想法 для构建人工解释。
Assistive Chatbots for healthcare: a succinct review
paper_authors: Basabdatta Sen Bhattacharya, Vibhav Sinai Pissurlenkar
For: The paper is written to review the state-of-the-art in AI-enabled Chatbots in healthcare, specifically during the last 10 years (2013-2023).* Methods: The paper reviews commercial and non-commercial Chatbots that are being used for patient support, as well as those in clinical trial phases. It also discusses the need for thorough and rigorous checks to ensure patient safety and medical ethics.* Results: The paper highlights a lack of trust in AI-enabled Chatbots among healthcare workers, patients, and the wider community, as well as dissatisfaction with the NLP skills of the Chatbots. It suggests that to enable deployment and integration of AI-enabled Chatbots in public health services, the technology needs to be simple and safe to use, and confidence in the technology needs to be built among the medical community and the wider community through outreach.Here are the three points in Simplified Chinese text:* For: 这篇论文是为了回顾过去十年(2013-2023)内健康服务中AI应用的状况。* Methods: 论文评论了商业和非商业的Chatbot,以及它们在患者支持方面的应用。它还提出了为保证患者安全和医疗伦理的严格检查的需要。* Results: 论文指出了健康工作者、患者和社会大众对AI应用Chatbot的不信任,以及Chatbot的自然语言处理技术不够的不满。它建议,为了让AI应用Chatbot在公共医疗服务中得到广泛应用,技术需要简单、安全,并需要对医疗人员和社会大众进行宣传和培训。Abstract
Artificial Intelligence (AI) for supporting healthcare services has never been more necessitated than by the recent global pandemic. Here, we review the state-of-the-art in AI-enabled Chatbots in healthcare proposed during the last 10 years (2013-2023). The focus on AI-enabled technology is because of its potential for enhancing the quality of human-machine interaction via Chatbots, reducing dependence on human-human interaction and saving man-hours. Our review indicates that there are a handful of (commercial) Chatbots that are being used for patient support, while there are others (non-commercial) that are in the clinical trial phases. However, there is a lack of trust on this technology regarding patient safety and data protection, as well as a lack of wider awareness on its benefits among the healthcare workers and professionals. Also, patients have expressed dissatisfaction with Natural Language Processing (NLP) skills of the Chatbots in comparison to humans. Notwithstanding the recent introduction of ChatGPT that has raised the bar for the NLP technology, this Chatbot cannot be trusted with patient safety and medical ethics without thorough and rigorous checks to serve in the `narrow' domain of assistive healthcare. Our review suggests that to enable deployment and integration of AI-enabled Chatbots in public health services, the need of the hour is: to build technology that is simple and safe to use; to build confidence on the technology among: (a) the medical community by focussed training and development; (b) the patients and wider community through outreach.
摘要
人工智能(AI)在支持医疗服务方面从未如今所需要的那么重要。我们对过去10年(2013-2023)提出的AI应用于医疗领域的评论。我们的评论表明,只有一些商业聊天机器人在患者支持方面使用,而其他非商业聊天机器人则处于临床试验阶段。然而,技术的可靠性和数据保护方面存在不足的信任,同时医疗工作者和专业人员对其利好的认知也不够。此外,患者对自然语言处理(NLP)技术的评价较低,与人类之间的交流仍然存在差距。尽管最近出现了ChatGPT,但这种技术在医疗领域的应用仍需进行严格的检验和评估,以确保Patient Safety和医疗伦理的安全性。我们的评论建议,为了使AI应用于医疗服务中,需要:建立简单安全的技术;帮助医疗社区了解和信任技术;通过宣传和教育,建立患者和社区的信任。
Predicting Drug-Drug Interactions Using Knowledge Graphs
paper_authors: Lizzy Farrugia, Lilian M. Azzopardi, Jeremy Debattista, Charlie Abela for:The paper aims to predict unknown Drug-Drug Interactions (DDIs) by incorporating Knowledge Graphs (KGs) and various drug features from public drug repositories.methods:The medicX end-to-end framework uses a combination of translation, factorisation, and Neural Network (NN) based KG Embedding (KGE) methods to integrate drug features and predict unknown DDIs. The best performing combination was the ComplEx embedding method with a Long Short-Term Memory (LSTM) network, which achieved an F1-score of 95.19%.results:The ComplEx embedding method with an LSTM network achieved an F1-score of 95.19% on a dataset based on the DDIs found in DrugBank version 5.1.8, outperforming the state-of-the-art model DeepDDI by 5.61%. Additionally, a graph auto-encoder model using a Graph Neural Network (GNN) achieved an F1-score of 91.94%.Abstract
In the last decades, people have been consuming and combining more drugs than before, increasing the number of Drug-Drug Interactions (DDIs). To predict unknown DDIs, recently, studies started incorporating Knowledge Graphs (KGs) since they are able to capture the relationships among entities providing better drug representations than using a single drug property. In this paper, we propose the medicX end-to-end framework that integrates several drug features from public drug repositories into a KG and embeds the nodes in the graph using various translation, factorisation and Neural Network (NN) based KG Embedding (KGE) methods. Ultimately, we use a Machine Learning (ML) algorithm that predicts unknown DDIs. Among the different translation and factorisation-based KGE models, we found that the best performing combination was the ComplEx embedding method with a Long Short-Term Memory (LSTM) network, which obtained an F1-score of 95.19% on a dataset based on the DDIs found in DrugBank version 5.1.8. This score is 5.61% better than the state-of-the-art model DeepDDI. Additionally, we also developed a graph auto-encoder model that uses a Graph Neural Network (GNN), which achieved an F1-score of 91.94%. Consequently, GNNs have demonstrated a stronger ability to mine the underlying semantics of the KG than the ComplEx model, and thus using higher dimension embeddings within the GNN can lead to state-of-the-art performance.
摘要
在最近几十年中,人们的药物consumption和组合已经变得更加普遍,导致药物相互作用(DDIs)的数量增加。为预测未知的DDIs,最近的研究开始 incorporating知识图(KGs),因为它们可以捕捉药物之间的关系,提供更好的药物表示than使用单一的药物属性。在这篇文章中,我们提出了medicX终端框架,该框架 integrates 多种药物特征从公共药物库中into a KG,并使用不同的翻译、分解和神经网络(NN)基于KGE方法来嵌入图节点。最终,我们使用机器学习算法预测未知DDIs。在不同的翻译和分解基于KGE模型中,我们发现了最佳的组合是ComplEx嵌入方法与长短期记忆网络(LSTM),其在基于DrugBank版本5.1.8的数据集上取得了F1得分95.19%,高于当前状态的模型DeepDDI。此外,我们还开发了一种图自编码模型,使用图神经网络(GNN),其取得了F1得分91.94%。因此,GNNs在挖掘知识图下的能力更强,使用高维度嵌入在GNN中可以达到状态之Art。
Current and Future Challenges in Knowledge Representation and Reasoning
paper_authors: James P. Delgrande, Birte Glimm, Thomas Meyer, Miroslaw Truszczynski, Frank Wolter
For: The paper discusses the current state of the art in Knowledge Representation and Reasoning, including its relation to other areas such as machine learning and uncertainty reasoning, and provides recommendations for future progress.* Methods: The paper is based on presentations, panels, working groups, and discussions that took place at a Dagstuhl Perspectives workshop on Knowledge Representation and Reasoning in July 2022.* Results: The paper provides a manifesto that declares the current views on Knowledge Representation, including its origins, goals, milestones, and current foci, as well as its challenges and key priorities for the next decade.Here is the same information in Simplified Chinese text:
results: 本文提供了一份宣言,宣布知识表示的起源、目标、里程碑和当前焦点,以及其挑战和未来十年的关键优先事项。Abstract
Knowledge Representation and Reasoning is a central, longstanding, and active area of Artificial Intelligence. Over the years it has evolved significantly; more recently it has been challenged and complemented by research in areas such as machine learning and reasoning under uncertainty. In July 2022 a Dagstuhl Perspectives workshop was held on Knowledge Representation and Reasoning. The goal of the workshop was to describe the state of the art in the field, including its relation with other areas, its shortcomings and strengths, together with recommendations for future progress. We developed this manifesto based on the presentations, panels, working groups, and discussions that took place at the Dagstuhl Workshop. It is a declaration of our views on Knowledge Representation: its origins, goals, milestones, and current foci; its relation to other disciplines, especially to Artificial Intelligence; and on its challenges, along with key priorities for the next decade.
摘要
知识表示和推理是人工智能的中心、长期积极发展的领域。随着时间的推移,它不断发展和改进,最近受到机器学习和不确定性推理的研究启发。2022年7月,达斯图尔视角工作坊(Dagstuhl Perspectives)举行了关于知识表示和推理的国际研讨会。工作坊的目的是描述该领域的现状,包括与其他领域的关系、短coming和优势,以及未来十年的发展优先级。我们基于工作坊的演讲、审议组、工作组和讨论会议的结果,制定了这份宣言。这是我们对知识表示的看法,包括其起源、目标、里程碑和当前焦点;与其他学科的关系,特别是人工智能;以及其挑战和未来十年的发展优先级。
Correlating Medi-Claim Service by Deep Learning Neural Networks
results: 通过使用卷积神经网络架构和 corrrelation 研究,能够准确地检测诈骗CLAIM,并且可以帮助防止金融诈骗案件。Abstract
Medical insurance claims are of organized crimes related to patients, physicians, diagnostic centers, and insurance providers, forming a chain reaction that must be monitored constantly. These kinds of frauds affect the financial growth of both insured people and health insurance companies. The Convolution Neural Network architecture is used to detect fraudulent claims through a correlation study of regression models, which helps to detect money laundering on different claims given by different providers. Supervised and unsupervised classifiers are used to detect fraud and non-fraud claims.
摘要
医疗保险养成有组织犯罪关系于病人、医生、诊断中心和保险公司,形成一个推动式的链 reaction。这种类型的诈骗活动会对保险人和健康保险公司的财务增长产生影响。使用卷积神经网络架构来检测诈骗养成,通过对不同提供者的clamshell进行相关性研究,可以检测到不同提供者的钱财洗涤。使用supervised和Unsupervised分类器来检测诈骗和非诈骗养成。
Heterogeneous 360 Degree Videos in Metaverse: Differentiated Reinforcement Learning Approaches
results: 实验表明,该模型能够有效地优化帧率和压缩率,并适应不同需求的场景。Abstract
Advanced video technologies are driving the development of the futuristic Metaverse, which aims to connect users from anywhere and anytime. As such, the use cases for users will be much more diverse, leading to a mix of 360-degree videos with two types: non-VR and VR 360-degree videos. This paper presents a novel Quality of Service model for heterogeneous 360-degree videos with different requirements for frame rates and cybersickness. We propose a frame-slotted structure and conduct frame-wise optimization using self-designed differentiated deep reinforcement learning algorithms. Specifically, we design two structures, Separate Input Differentiated Output (SIDO) and Merged Input Differentiated Output (MIDO), for this heterogeneous scenario. We also conduct comprehensive experiments to demonstrate their effectiveness.
摘要
高级视频技术驱动未来Metaverse的发展,目的是Connect users from anywhere and anytime。因此,用户的用例将变得更加多样化,导致360度视频的两种类型:非VR和VR 360度视频。这篇论文提出了一种新的服务质量模型 для不同需求的 heterogeneous 360度视频,包括帧率和恶心症的不同需求。我们提出了一种帧槽结构,并通过自定义分化深度学习算法进行帧WISE优化。具体来说,我们设计了两种结构:分离输入�ifferentiated输出(SIDO)和合并输入�ifferentiated输出(MIDO),为这种多样化enario提供了优化。我们还进行了广泛的实验,以证明它们的有效性。
Federated Zeroth-Order Optimization using Trajectory-Informed Surrogate Gradients
for: Federated zeroth-order optimization (ZOO) algorithms, which are used for query- and communication-efficient optimization in applications such as federated learning.
methods: Trajectory-informed gradient surrogates and adaptive gradient correction techniques, which are used to improve the accuracy and efficiency of federated ZOO.
results: The proposed FZooS algorithm achieves theoretical improvements over existing approaches and is supported by real-world experiments in federated black-box adversarial attack and federated non-differentiable metric optimization.Here is the simplified Chinese version of the three information:
methods: using trajectory-informed gradient surrogates和适应式Gradient correction技术,以提高联合ZOO的准确性和效率。
results: proposed FZooS算法在理论上有所改进,并在实际中通过联合黑盒抗击和非凸度量优化等实验得到支持。Abstract
Federated optimization, an emerging paradigm which finds wide real-world applications such as federated learning, enables multiple clients (e.g., edge devices) to collaboratively optimize a global function. The clients do not share their local datasets and typically only share their local gradients. However, the gradient information is not available in many applications of federated optimization, which hence gives rise to the paradigm of federated zeroth-order optimization (ZOO). Existing federated ZOO algorithms suffer from the limitations of query and communication inefficiency, which can be attributed to (a) their reliance on a substantial number of function queries for gradient estimation and (b) the significant disparity between their realized local updates and the intended global updates. To this end, we (a) introduce trajectory-informed gradient surrogates which is able to use the history of function queries during optimization for accurate and query-efficient gradient estimation, and (b) develop the technique of adaptive gradient correction using these gradient surrogates to mitigate the aforementioned disparity. Based on these, we propose the federated zeroth-order optimization using trajectory-informed surrogate gradients (FZooS) algorithm for query- and communication-efficient federated ZOO. Our FZooS achieves theoretical improvements over the existing approaches, which is supported by our real-world experiments such as federated black-box adversarial attack and federated non-differentiable metric optimization.
摘要
联合优化,是一种兴起的概念,它在联合学习、联合优化等实际应用中找到了广泛的应用。在这种概念下,多个客户端(例如边缘设备)可以共同优化一个全球函数。客户端不会分享自己的本地数据,通常只会分享本地的梯度。但是,在许多应用中,梯度信息不可用,因此产生了联合零阶优化(ZOO)的概念。现有的联合ZOO算法受到函数询问和通信不�fficiente的限制,这可以被归因于(a)它们依赖了访问函数的很多次以估计梯度,以及(b)它们实现的本地更新和 globally 预期的更新之间存在很大的差异。为了解决这个问题,我们(a)引入了路径受限的梯度代理,这些梯度代理可以使用优化过程中的历史函数询问来实现精确和查询节省的梯度估计,以及(b)开发了适应性梯度调整技术,使用这些梯度代理来缓和上述差异。基于这些,我们提出了联合零阶优化使用路径受限梯度代理(FZooS)算法,实现了查询和通信节省的联合ZOO。我们的FZooS理论上超越了现有的方法,这被我们在实际应用中,如联合黑盒抗击和联合非 diffeomorphic 度量优化中所证明。
Path Signatures for Diversity in Probabilistic Trajectory Optimisation
results: 实验表明,该策略可以在各种问题上实现更低的平均成本,包括2D导航和受损环境中的机器人手臂操作。Abstract
Motion planning can be cast as a trajectory optimisation problem where a cost is minimised as a function of the trajectory being generated. In complex environments with several obstacles and complicated geometry, this optimisation problem is usually difficult to solve and prone to local minima. However, recent advancements in computing hardware allow for parallel trajectory optimisation where multiple solutions are obtained simultaneously, each initialised from a different starting point. Unfortunately, without a strategy preventing two solutions to collapse on each other, naive parallel optimisation can suffer from mode collapse diminishing the efficiency of the approach and the likelihood of finding a global solution. In this paper we leverage on recent advances in the theory of rough paths to devise an algorithm for parallel trajectory optimisation that promotes diversity over the range of solutions, therefore avoiding mode collapses and achieving better global properties. Our approach builds on path signatures and Hilbert space representations of trajectories, and connects parallel variational inference for trajectory estimation with diversity promoting kernels. We empirically demonstrate that this strategy achieves lower average costs than competing alternatives on a range of problems, from 2D navigation to robotic manipulators operating in cluttered environments.
摘要
路径规划可以被看作是一个轨迹优化问题,其中需要将轨迹优化为最小化一个成本函数。在复杂的环境中,找到globally optimal solution可以是一个困难的任务,因为这个问题通常会陷入到地方最优解。然而,随着计算机硬件的进步,我们可以使用并行的轨迹优化方法,从不同的初始点开始并行地生成多个解决方案。然而,如果不采取措施来避免解决方案之间的冲突,那么纯粹的并行优化方法可能会陷入到模式塌突,从而降低方法的效率和找到全局解的可能性。在这篇论文中,我们采用了最近的粗 PATH 理论来设计一种并行轨迹优化算法,该算法可以在轨迹优化过程中提高多样性,因此避免模式塌突并实现更好的全局性。我们的方法基于轨迹签名和希尔伯特空间表示,并将并行变分推理与多样性激活函数相连接。我们实际上证明了这种策略在一系列问题上实现了更低的平均成本,从2D导航到受损环境中的机器人抓取器。
Enhancing Adversarial Robustness in Low-Label Regime via Adaptively Weighted Regularization and Knowledge Distillation
results: 我们的实验结果显示,我们的提案的算法可以实现 state-of-the-art 的性能,与现有算法相比,具有显著的优势。具体来说,对于仅有少量标签数据的情况下,我们的算法与使用所有标签数据的超级vised adversarial 训练算法相比,在 CIFAR-10 上的标准和Robust 精度上几乎相同。例如,我们的算法仅使用 8% 的标签数据时,与使用所有标签数据的超级vised adversarial 训练算法相比,其性能仍然具有显著的优势。Abstract
Adversarial robustness is a research area that has recently received a lot of attention in the quest for trustworthy artificial intelligence. However, recent works on adversarial robustness have focused on supervised learning where it is assumed that labeled data is plentiful. In this paper, we investigate semi-supervised adversarial training where labeled data is scarce. We derive two upper bounds for the robust risk and propose a regularization term for unlabeled data motivated by these two upper bounds. Then, we develop a semi-supervised adversarial training algorithm that combines the proposed regularization term with knowledge distillation using a semi-supervised teacher (i.e., a teacher model trained using a semi-supervised learning algorithm). Our experiments show that our proposed algorithm achieves state-of-the-art performance with significant margins compared to existing algorithms. In particular, compared to supervised learning algorithms, performance of our proposed algorithm is not much worse even when the amount of labeled data is very small. For example, our algorithm with only 8\% labeled data is comparable to supervised adversarial training algorithms that use all labeled data, both in terms of standard and robust accuracies on CIFAR-10.
摘要
“敌对类型调教是现在人工智能的研究领域中受到了很多关注,以确保人工智能的可靠性。然而,现有的工作通常假设有充足的标签数据,而我们在这篇论文中则 investigate 敌对调教中的半supervised 学习,在标签数据 scarce 的情况下。我们 deriv 了两个上限 bound 的敌对风险,并提出了一个基于这两个上限 bound 的调教term。然后,我们开发了一个半supervised adversarial training algorithm,它结合了我们提出的调教term 和知识传授使用半supervised teacher (即一个使用半supervised learning algorithm训练的教师模型)。我们的实验结果显示,我们的提案的算法可以 achieve state-of-the-art 性能,并且与已有算法相比,在标签数据很少的情况下,性能不会太差。例如,我们的算法仅使用8%的标签数据时,可以与完全supervised adversarial training algorithm相比,在 CIFAR-10 上 Both 标准和敌对精度方面表现出色。”
SODFormer: Streaming Object Detection with Transformer Using Events and Frames
methods: 利用Transformer架构, integrates events and frames to continuously detect objects in an asynchronous manner,并使用 asynchronous attention-based fusion module to integrate two heterogeneous sensing modalities。
results: 与四种state-of-the-art方法和八个基eline比较,提出的SODFormer方法显示出了显著的性能优势。 Additionally, the proposed method works well even in cases where the conventional frame-based camera fails, such as high-speed motion and low-light conditions.Abstract
DAVIS camera, streaming two complementary sensing modalities of asynchronous events and frames, has gradually been used to address major object detection challenges (e.g., fast motion blur and low-light). However, how to effectively leverage rich temporal cues and fuse two heterogeneous visual streams remains a challenging endeavor. To address this challenge, we propose a novel streaming object detector with Transformer, namely SODFormer, which first integrates events and frames to continuously detect objects in an asynchronous manner. Technically, we first build a large-scale multimodal neuromorphic object detection dataset (i.e., PKU-DAVIS-SOD) over 1080.1k manual labels. Then, we design a spatiotemporal Transformer architecture to detect objects via an end-to-end sequence prediction problem, where the novel temporal Transformer module leverages rich temporal cues from two visual streams to improve the detection performance. Finally, an asynchronous attention-based fusion module is proposed to integrate two heterogeneous sensing modalities and take complementary advantages from each end, which can be queried at any time to locate objects and break through the limited output frequency from synchronized frame-based fusion strategies. The results show that the proposed SODFormer outperforms four state-of-the-art methods and our eight baselines by a significant margin. We also show that our unifying framework works well even in cases where the conventional frame-based camera fails, e.g., high-speed motion and low-light conditions. Our dataset and code can be available at https://github.com/dianzl/SODFormer.
摘要
《DAVIS摄像头 Streaming Two Complementary Sensing Modalities of Asynchronous Events and Frames for Object Detection》DAVIS摄像头, Streaming two complementary sensing modalities of asynchronous events and frames,已经被广泛应用于重要的物体检测挑战中(例如快速运动模糊和低光照)。然而,如何有效利用rich temporal cues和融合两种不同的视觉流还是一个挑战。为了解决这个挑战,我们提出了一种新的流动对象检测器,即SODFormer,它首先将事件和帧集成为一起continuously检测物体。技术上,我们首先建立了一个大规模的多模态神经元摄像头检测数据集(即PKU-DAVIS-SOD),包括1080.1k的手动标签。然后,我们设计了一种空间时间Transformer架构,通过一个终到终的序列预测问题,来检测物体。在这个架构中,我们提出了一种新的 temporal Transformer模块,利用了两个视觉流的rich temporal cues来提高检测性能。最后,我们提出了一种异步注意力基于的融合模块,以便将两种不同的感知模式融合在一起,并且可以在任何时候提问,以便查找物体和跨出同步帧基于的融合策略的限制。结果显示,我们提出的SODFormer方法在比较四种state-of-the-art方法和我们的八个基eline之上取得了显著的提高。我们还证明了我们的统一框架在高速运动和低光照等情况下也能够正常工作。我们的数据集和代码可以在https://github.com/dianzl/SODFormer上下载。
Non-Intrusive Electric Load Monitoring Approach Based on Current Feature Visualization for Smart Energy Management
methods: 本文employs popular计算机视觉技术,包括卷积变换和gramianangular场方法,将一维电流信号映射到二维颜色特征图像上。然后,通过U型深度神经网络 WITH multi-scale特征提取和注意机制,识别所有电动负荷。
results: 实验结果表明,提出的方法在公共数据集和私有数据集上均达到了superior表现,可以支持大规模互联网对象(IoT)中的能效能源管理。Abstract
The state-of-the-art smart city has been calling for an economic but efficient energy management over large-scale network, especially for the electric power system. It is a critical issue to monitor, analyze and control electric loads of all users in system. In this paper, we employ the popular computer vision techniques of AI to design a non-invasive load monitoring method for smart electric energy management. First of all, we utilize both signal transforms (including wavelet transform and discrete Fourier transform) and Gramian Angular Field (GAF) methods to map one-dimensional current signals onto two-dimensional color feature images. Second, we propose to recognize all electric loads from color feature images using a U-shape deep neural network with multi-scale feature extraction and attention mechanism. Third, we design our method as a cloud-based, non-invasive monitoring of all users, thereby saving energy cost during electric power system control. Experimental results on both public and our private datasets have demonstrated our method achieves superior performances than its peers, and thus supports efficient energy management over large-scale Internet of Things (IoT).
摘要
现代智能城市的要求是实现经济高效的能源管理,特别是电力系统。监测、分析和控制所有用户的电载是一个关键问题。在这篇论文中,我们利用人工智能popular计算机视觉技术来设计一种不侵入式的电力监测方法。首先,我们利用信号变换(包括wavelet transform和Discrete Fourier Transform)和Gramian Angular Field(GAF)方法将一维电流信号映射到二维颜色特征图像上。其次,我们提出了通过U型深度神经网络with multi-scale feature extraction和注意机制来识别所有电载。最后,我们设计了一种云端基于的非侵入式监测方法,以 saves energy cost during electric power system control。实验结果表明,我们的方法在公共数据集和私有数据集上都达到了更高的性能,因此支持了大规模Internet of Things(IoT)中的高效能源管理。
InfeRE: Step-by-Step Regex Generation via Chain of Inference
methods: 这个论文使用了一种新的批处理方法,即将生成regex表达式的过程 decomposes into chains of step-by-step inference,以提高生成的regex表达式的精度和可读性。此外,它还引入了一种自适应均衡机制,以 Ensemble 多个模型的输出,从而提高了生成的regex表达式的稳定性。
results: 实验结果表明,InfeRE 可以备受提高神经语言模型生成regex表达式的精度,在两个公开的数据集上(NL-RX-Turk 和 KB13)测试,与前一代的基eline 和树状生成方法相比,InfeRE 可以提高 DFA@5 准确率的16.3% 和 14.7%。特别是,InfeRE 可以在两个数据集上,相比之前的树状生成方法,提高 DFA@5 准确率的18.1% 和 11.3%。Abstract
Automatically generating regular expressions (abbrev. regexes) from natural language description (NL2RE) has been an emerging research area. Prior studies treat regex as a linear sequence of tokens and generate the final expressions autoregressively in a single pass. They did not take into account the step-by-step internal text-matching processes behind the final results. This significantly hinders the efficacy and interpretability of regex generation by neural language models. In this paper, we propose a new paradigm called InfeRE, which decomposes the generation of regexes into chains of step-by-step inference. To enhance the robustness, we introduce a self-consistency decoding mechanism that ensembles multiple outputs sampled from different models. We evaluate InfeRE on two publicly available datasets, NL-RX-Turk and KB13, and compare the results with state-of-the-art approaches and the popular tree-based generation approach TRANX. Experimental results show that InfeRE substantially outperforms previous baselines, yielding 16.3% and 14.7% improvement in DFA@5 accuracy on two datasets, respectively. Particularly, InfeRE outperforms the popular tree-based generation approach by 18.1% and 11.3% on both datasets, respectively, in terms of DFA@5 accuracy.
摘要
自然语言描述(NL2RE)自动生成正则表达式(regex)是一个emerging研究领域。先前的研究通常将regex视为一个连续序列的token,通过单个通过一次推导生成最终结果。然而,这些研究未能考虑regex生成的内部文本匹配过程,这会限制神经语言模型的效果和可读性。在这篇论文中,我们提出了一新的思路called InfeRE,它将regex生成分解为一系列步骤的推导链。为了提高稳定性,我们还引入了自适应嵌入机制,该机制可以从不同模型中抽象多个输出,并将其ensemble。我们在两个公共可用的数据集NL-RX-Turk和KB13上进行了实验,并与当前的基eline和树状生成方法相比较。实验结果表明,InfeREsubstantiallyoutsperforms先前的基eline,在两个数据集上DFA@5准确率提高16.3%和14.7%。尤其是,InfeRE在两个数据集上与树状生成方法相比,DFA@5准确率提高18.1%和11.3%。
Adapting Foundation Models for Information Synthesis of Wireless Communication Specifications
results: 根据一个标准 benchmark 集合,该工具能够提供更加准确和相关的答案,其中 Bleu 分数和 BERTScore F1-度分别为 0.37 和 0.79,比前一代工具 ChatGPT 的分数高出许多。Abstract
Existing approaches to understanding, developing and researching modern wireless communication technologies involves time-intensive and arduous process of sifting through numerous webpages and technical specification documents, gathering the required information and synthesizing it. This paper presents NextGen Communications Copilot, a conversational artificial intelligence tool for information synthesis of wireless communication specifications. The system builds on top of recent advancements in foundation models and consists of three key additional components: a domain-specific database, a context extractor, and a feedback mechanism. The system appends user queries with concise and query-dependent contextual information extracted from a database of wireless technical specifications and incorporates tools for expert feedback and data contributions. On evaluation using a benchmark dataset of queries and reference responses created by subject matter experts, the system demonstrated more relevant and accurate answers with an average BLEU score and BERTScore F1-measure of 0.37 and 0.79 respectively compared to the corresponding values of 0.07 and 0.59 achieved by state-of-the-art tools like ChatGPT.
摘要
现有的方法 для了解、开发和研究现代无线通信技术都是一个时间consuming和辛苦的过程,需要逐页搜索众多的网页和技术规范文档,收集所需信息并将其综合化。本文介绍了 NextGen Communications Copilot,一个基于最新的基础模型的会话型人工智能工具,用于无线通信规范信息的综合处理。该系统包括三个关键组件:域pecific数据库、上下文提取器和反馈机制。系统将用户查询 append 与域pecific数据库中的简短和查询dependent的上下文信息,并包括专家反馈和数据贡献工具。经评估使用一个标准 benchmark dataset of queries和 reference responses,创建了由专家制定的查询和参照响应,系统示出了与现有工具 like ChatGPT 的相对比较好的准确性和相关性,其 BLEU 分数和 BERTScore F1-measure 分别为 0.37 和 0.79。
results: 研究发现,显示更多的uncertainty信息可以帮助用户更自信地做出决策。I hope this helps! Let me know if you have any other questions.Abstract
Many research explore how well computers are able to examine emotions displayed by humans and use that data to perform different tasks. However, there have been very few research which evaluate the computers ability to generate emotion classification information in an attempt to help the user make decisions or perform tasks. This is a crucial area to explore as it is paramount to the two way communication between humans and computers. This research conducted an experiment to investigate the impact of different uncertainty information displays of emotion classification on the human decision making process. Results show that displaying more uncertainty information can help users to be more confident when making decisions.
摘要
很多研究都在研究计算机如何识别人类表达的情感,并使用这些数据来完成不同的任务。然而,有很少的研究探讨计算机是否能够生成情感分类信息,以帮助用户做出决策或完成任务。这是一个关键的领域,因为两个方向的人机交互是非常重要的。本研究进行了一项实验,以调查不同的不确定信息显示方式对人类决策过程的影响。结果显示,显示更多的不确定信息可以帮助用户更加自信地做出决策。
Gentopia: A Collaborative Platform for Tool-Augmented LLMs
For: The paper aims to provide a flexible and customizable framework for Augmented Language Models (ALMs) that enables the use of various language models, task formats, prompting modules, and plugins.* Methods: The paper proposes a new framework called gentopia, which allows users to customize their ALMs through simple configurations and integrates various language models, task formats, prompting modules, and plugins into a unified paradigm.* Results: The paper establishes gentpool, a public platform for registering and sharing user-customized agents, and gentbench, an integral component of gentpool that evaluates user-customized agents across diverse aspects such as safety, robustness, and efficiency.Abstract
Augmented Language Models (ALMs) empower large language models with the ability to use tools, transforming them into intelligent agents for real-world interactions. However, most existing frameworks for ALMs, to varying degrees, are deficient in the following critical features: flexible customization, collaborative democratization, and holistic evaluation. We present gentopia, an ALM framework enabling flexible customization of agents through simple configurations, seamlessly integrating various language models, task formats, prompting modules, and plugins into a unified paradigm. Furthermore, we establish gentpool, a public platform enabling the registration and sharing of user-customized agents. Agents registered in gentpool are composable such that they can be assembled together for agent collaboration, advancing the democratization of artificial intelligence. To ensure high-quality agents, gentbench, an integral component of gentpool, is designed to thoroughly evaluate user-customized agents across diverse aspects such as safety, robustness, efficiency, etc. We release gentopia on Github and will continuously move forward.
摘要
基于扩展语言模型(ALM)的框架,gentopia,允许大语言模型使用工具,将其转变成智能代理人进行实际交互。然而,现有的ALM框架,各有不同程度的缺失,包括灵活定制、合作民主化和整体评估。我们提出了gentopia框架,允许用户通过简单的配置来自定义代理人,并允许不同的语言模型、任务格式、提示模块和插件在一个统一的架构中协作。此外,我们建立了gentpool公共平台,让用户可以注册和分享自定义代理人。gentpool中注册的代理人可以组合起来,推动人工智能的民主化。为保证高质量代理人,gentbench,gentpool的一个重要组件,专门用于评估用户自定义代理人的多个方面,包括安全、稳定性、效率等。我们将gentopia发布到Github,并将持续推进。
Top K Relevant Passage Retrieval for Biomedical Question Answering
results: 经过微调后,这个DPR模型在BioASQ问答 dataset上得到了0.81的F1分数,表明其能够准确地回答生物医学相关的问题。Abstract
Question answering is a task that answers factoid questions using a large collection of documents. It aims to provide precise answers in response to the user's questions in natural language. Question answering relies on efficient passage retrieval to select candidate contexts, where traditional sparse vector space models, such as TF-IDF or BM25, are the de facto method. On the web, there is no single article that could provide all the possible answers available on the internet to the question of the problem asked by the user. The existing Dense Passage Retrieval model has been trained on Wikipedia dump from Dec. 20, 2018, as the source documents for answering questions. Question answering (QA) has made big strides with several open-domain and machine comprehension systems built using large-scale annotated datasets. However, in the clinical domain, this problem remains relatively unexplored. According to multiple surveys, Biomedical Questions cannot be answered correctly from Wikipedia Articles. In this work, we work on the existing DPR framework for the biomedical domain and retrieve answers from the Pubmed articles which is a reliable source to answer medical questions. When evaluated on a BioASQ QA dataset, our fine-tuned dense retriever results in a 0.81 F1 score.
摘要
问答任务是回答基于大量文档的问题,目的是通过自然语言提供精确的答案。问答依赖于高效的段 Retrieval,传统的稀疏 вектор空间模型,如 TF-IDF 或 BM25,是现实中的标准方法。在互联网上,没有一篇文章可以提供用户问题的所有可能的答案。现有的 dense passage retrieval 模型已经在 Dec. 20, 2018 的Wikipedia dump上进行了训练,作为回答问题的源文档。问答(QA)在开放领域和机器理解领域已经做出了大量的进展,但在医疗领域,这个问题还很少研究。根据多个调查,医学问题无法从 Wikipedia 文章中正确地回答。在这种情况下,我们在现有的 DPR 框架上进行了改进,并从可靠的 Pubmed 文章中提取答案。当评估在 BioASQ QA 数据集上时,我们的精制 dense retriever 得分为 0.81 F1 分。
AgentSims: An Open-Source Sandbox for Large Language Model Evaluation
results: 我们的示例可以在 https://agentsims.com 上查看。Abstract
With ChatGPT-like large language models (LLM) prevailing in the community, how to evaluate the ability of LLMs is an open question. Existing evaluation methods suffer from following shortcomings: (1) constrained evaluation abilities, (2) vulnerable benchmarks, (3) unobjective metrics. We suggest that task-based evaluation, where LLM agents complete tasks in a simulated environment, is a one-for-all solution to solve above problems. We present AgentSims, an easy-to-use infrastructure for researchers from all disciplines to test the specific capacities they are interested in. Researchers can build their evaluation tasks by adding agents and buildings on an interactive GUI or deploy and test new support mechanisms, i.e. memory, planning and tool-use systems, by a few lines of codes. Our demo is available at https://agentsims.com .
摘要
具有chatGPT大语言模型(LLM)的社区中,评估这些模型的能力是一个公开的问题。现有的评估方法受到以下缺点:(1)受限的评价能力,(2)易受到攻击的标准,(3)不准确的度量。我们建议使用任务基本评估,让LLM代理在模拟环境中完成任务,作为一个一元解决方案。我们提供了 AgentSims,一个易于使用的基础设施,让研究人员从各个领域测试他们感兴趣的具体能力。研究人员可以通过在交互式GUI上添加代理和建筑,或者通过几行代码来部署和测试新的支持机制,如记忆、规划和工具使用系统。我们的 demo 可以在 上查看。
MSAC: Multiple Speech Attribute Control Method for Speech Emotion Recognition
results: 对于单个corpus和跨corpus的SER场景,我们的提出的SER工作流程在recognition、generalization和可靠性性方面均表现出优于基eline。单个corpusSER场景中,我们的SER工作流程取得了72.97%的WAR和71.76%的UAR在IEMOCAP corpus上。Abstract
Despite significant progress, speech emotion recognition (SER) remains challenging due to inherent complexity and ambiguity of the emotion attribute, particularly in wild world. Whereas current studies primarily focus on recognition and generalization capabilities, this work pioneers an exploration into the reliability of SER methods and investigates how to model the speech emotion from the aspect of data distribution across various speech attributes. Specifically, we first build a novel CNN-based SER model which adopts additive margin softmax loss to expand the distance between features of different classes, thereby enhancing their discrimination. Second, a novel multiple speech attribute control method MSAC is proposed to explicitly control speech attributes, enabling the model to be less affected by emotion-agnostic attributes and capture more fine-grained emotion-related features. Third, we make a first attempt to test and analyze the reliability of the proposed SER workflow using the out-of-distribution detection method. Extensive experiments on both single and cross-corpus SER scenarios show that our proposed unified SER workflow consistently outperforms the baseline in terms of recognition, generalization, and reliability performance. Besides, in single-corpus SER, the proposed SER workflow achieves superior recognition results with a WAR of 72.97\% and a UAR of 71.76\% on the IEMOCAP corpus.
摘要
尽管有了 significative progress,speech emotion recognition(SER)仍然具有挑战性,主要是因为情感属性的内在复杂和不确定性,特别是在野外环境中。而现有研究主要关注 reconocimiento y generalización capacidades,这个工作则探索了SER方法的可靠性,并 investigate了如何从数据分布角度来模型speech emotion。 Specifically,我们首先构建了一个基于CNN的SER模型,采用了添加式margin softmax损失函数,以增强不同类别之间的距离,从而提高它们的区分度。其次,我们提出了一种 Multiple Speech Attribute Control(MSAC)方法,以控制speech attribute,使模型免受情感无关的属性的影响,捕捉更细腻的情感相关特征。 Finally,我们对提出的SER工作流进行了首次测试和分析,并在单个corpus和交叉corpus中进行了广泛的实验。结果表明,我们的提出的SER工作流在认知、泛化和可靠性方面均有显著的优异性。此外,在单个corpus中,我们的SER工作流在IEMOCAP corpus上 achievable 的recognition结果为72.97%和71.76%。
Scope Loss for Imbalanced Classification and RL Exploration
paper_authors: Hasham Burhani, Xiao Qi Shi, Jonathan Jaegerman, Daniel Balicki
for: This paper aims to address the exploration-exploitation trade-off in reinforcement learning and the dataset imbalance problem in supervised classification.
methods: The paper equates the two problems and derives a novel loss function called Scope Loss, which adjusts gradients to prevent performance losses from over-exploitation and dataset imbalances without the need for tuning.
results: The paper shows that Scope Loss outperforms state-of-the-art loss functions over a basket of benchmark reinforcement learning tasks and a skewed classification dataset.Abstract
We demonstrate equivalence between the reinforcement learning problem and the supervised classification problem. We consequently equate the exploration exploitation trade-off in reinforcement learning to the dataset imbalance problem in supervised classification, and find similarities in how they are addressed. From our analysis of the aforementioned problems we derive a novel loss function for reinforcement learning and supervised classification. Scope Loss, our new loss function, adjusts gradients to prevent performance losses from over-exploitation and dataset imbalances, without the need for any tuning. We test Scope Loss against SOTA loss functions over a basket of benchmark reinforcement learning tasks and a skewed classification dataset, and show that Scope Loss outperforms other loss functions.
摘要
我们证明了回归学习问题与supervised分类问题之间的等价性。我们因此将rek-exploration偏好和数据偏好问题相提并论,并发现它们在解决方面存在相似之处。基于这些问题的分析,我们提出了一种新的损失函数,称为Scope损失。Scope损失函数可以适应找到潜在的性能损失和数据偏好问题,无需任何调整。我们对一组标准回归学习任务和一个偏好分类 dataset进行测试,并证明Scope损失函数在与现状最优损失函数进行比较时表现出色。
Improving Performance of Semi-Supervised Learning by Adversarial Attacks
paper_authors: Dongyoon Yang, Kunwoong Kim, Yongdai Kim
for: 提高现有的隐私学习(SSL)算法性能
methods: 利用对预训练模型的 adversarial 攻击选择高自信度无标记数据进行标注
results: 在 CIFAR10 上,与 SCAR 结合的三种 latest SSL algorithms 显示出显著提高图像分类性能Abstract
Semi-supervised learning (SSL) algorithm is a setup built upon a realistic assumption that access to a large amount of labeled data is tough. In this study, we present a generalized framework, named SCAR, standing for Selecting Clean samples with Adversarial Robustness, for improving the performance of recent SSL algorithms. By adversarially attacking pre-trained models with semi-supervision, our framework shows substantial advances in classifying images. We introduce how adversarial attacks successfully select high-confident unlabeled data to be labeled with current predictions. On CIFAR10, three recent SSL algorithms with SCAR result in significantly improved image classification.
摘要
半supervised learning(SSL)算法是基于现实的假设,即获得大量标注数据很Difficult。在这种研究中,我们提出一种普适的框架,名为SCAR,即选择干净样本并具有对抗性强度,以提高 latest SSL算法的性能。通过对预训练模型进行对抗性攻击,我们的框架成功地选择高自信产生的无标注样本进行标注。在CIFAR10上,三种latest SSL算法与SCAR结果显著改善图像分类。
Multi-Granularity Attention Model for Group Recommendation
paper_authors: Jianye Ji, Jiayan Pei, Shaochuan Lin, Taotao Zhou, Hengxu He, Jia Jia, Ning Hu
for: 提供个性化推荐给多个用户组 based on their shared interests, preferences, and characteristics.
methods: 使用多级别的granularity (i.e., subsets, groups, and supersets) to uncover group members’ latent preferences and mitigate recommendation noise. Specifically, our method includes a Subset Preference Extraction module, a Group Preference Extraction module, and a Superset Preference Extraction module.
results: 在多个级别的granularity上减少推荐噪音,并全面学习用户的个性兴趣. Extensive offline and online experiments have demonstrated the superiority of our method in terms of performance.Abstract
Group recommendation provides personalized recommendations to a group of users based on their shared interests, preferences, and characteristics. Current studies have explored different methods for integrating individual preferences and making collective decisions that benefit the group as a whole. However, most of them heavily rely on users with rich behavior and ignore latent preferences of users with relatively sparse behavior, leading to insufficient learning of individual interests. To address this challenge, we present the Multi-Granularity Attention Model (MGAM), a novel approach that utilizes multiple levels of granularity (i.e., subsets, groups, and supersets) to uncover group members' latent preferences and mitigate recommendation noise. Specially, we propose a Subset Preference Extraction module that enhances the representation of users' latent subset-level preferences by incorporating their previous interactions with items and utilizing a hierarchical mechanism. Additionally, our method introduces a Group Preference Extraction module and a Superset Preference Extraction module, which explore users' latent preferences on two levels: the group-level, which maintains users' original preferences, and the superset-level, which includes group-group exterior information. By incorporating the subset-level embedding, group-level embedding, and superset-level embedding, our proposed method effectively reduces group recommendation noise across multiple granularities and comprehensively learns individual interests. Extensive offline and online experiments have demonstrated the superiority of our method in terms of performance.
摘要
群体推荐提供个性化的推荐给群体成员基于他们共同的兴趣、偏好和特征。现有研究已经探索了不同的方法来集成个体偏好并为群体作出共同的决策,但大多数情况都忽略了用户的潜在偏好,导致个体兴趣的学习不够。为解决这个挑战,我们提出了多级别注意力模型(MGAM),一种新的方法,利用不同级别的划分(i.e., 子集、组和超集)来探索群体成员的潜在偏好并减少推荐噪音。具体来说,我们提出了一个子集偏好提取模块,通过利用用户对物品的前期互动和层次机制来强化用户的潜在子集级别偏好的表示。此外,我们的方法还引入了组偏好提取模块和超集偏好提取模块,它们分别探索用户的组级别偏好和超集级别偏好。通过结合子集级别嵌入、组级别嵌入和超集级别嵌入,我们提出的方法可以有效减少群体推荐噪音并全面学习个体兴趣。经过大量的线上和线下实验,我们的方法在性能方面表现出了明显的优势。
Understanding CNN Hidden Neuron Activations Using Structured Background Knowledge and Deductive Reasoning
results: 研究结果表明,这种方法可以自动地将大规模背景知识链接到 convolutional neural network 中 dense layer 中的各个神经元,并提供有意义的标签。这些标签可以帮助解释深度学习系统中隐藏层neuron的活动,从而减轻深度学习系统的黑盒效应。Abstract
A major challenge in Explainable AI is in correctly interpreting activations of hidden neurons: accurate interpretations would provide insights into the question of what a deep learning system has internally detected as relevant on the input, demystifying the otherwise black-box character of deep learning systems. The state of the art indicates that hidden node activations can, in some cases, be interpretable in a way that makes sense to humans, but systematic automated methods that would be able to hypothesize and verify interpretations of hidden neuron activations are underexplored. In this paper, we provide such a method and demonstrate that it provides meaningful interpretations. Our approach is based on using large-scale background knowledge approximately 2 million classes curated from the Wikipedia concept hierarchy together with a symbolic reasoning approach called Concept Induction based on description logics, originally developed for applications in the Semantic Web field. Our results show that we can automatically attach meaningful labels from the background knowledge to individual neurons in the dense layer of a Convolutional Neural Network through a hypothesis and verification process.
摘要
一个主要挑战在可解释人工智能是正确地解释隐藏神经元的活动:正确的解释会提供关于deep learning系统内部检测到的输入信息的深入了解,从而消除深度学习系统的黑盒特性。现状的最佳实践表明,隐藏节点的活动可以,在某些情况下,被解释得通常是人类可理解的,但系统化的自动方法,能够假设和验证解释隐藏神经元的活动,尚未得到充分的探索。在这篇论文中,我们提供了一种这样的方法,并证明它可以提供有意义的解释。我们的方法基于使用大规模的背景知识(约200万个类别),来自Wikipedia概念层次结构,以及基于描述逻辑的符号推理方法 called Concept Induction,原始是为Semantic Web领域开发的。我们的结果表明,我们可以通过一种假设和验证过程,自动将background知识中的有意义标签附加到 convolutional neural network 的紧凑层中的个体神经元。
Cooperative Multi-Type Multi-Agent Deep Reinforcement Learning for Resource Management in Space-Air-Ground Integrated Networks
results: 实验结果显示了CMT-MARL方法的有效性,包括总转送率和转送成功率等关键性能指标。这些结果证明了SAGIN系统的可能性和实现性。Abstract
The Space-Air-Ground Integrated Network (SAGIN), integrating heterogeneous devices including low earth orbit (LEO) satellites, unmanned aerial vehicles (UAVs), and ground users (GUs), holds significant promise for advancing smart city applications. However, resource management of the SAGIN is a challenge requiring urgent study in that inappropriate resource management will cause poor data transmission, and hence affect the services in smart cities. In this paper, we develop a comprehensive SAGIN system that encompasses five distinct communication links and propose an efficient cooperative multi-type multi-agent deep reinforcement learning (CMT-MARL) method to address the resource management issue. The experimental results highlight the efficacy of the proposed CMT-MARL, as evidenced by key performance indicators such as the overall transmission rate and transmission success rate. These results underscore the potential value and feasibility of future implementation of the SAGIN.
摘要
Space-Air-Ground интеegrated Network (SAGIN),融合各种不同设备,包括低地球轨道卫星(LEO)、无人飞行器(UAV)和地面用户(GU),具有推动智能城市应用的巨大潜力。然而,SAGIN资源管理是一项需要紧迫研究的挑战,因为不当的资源管理会导致数据传输差,从而影响智能城市服务的质量。在这篇论文中,我们提出了一个全面的 SAGIN 系统,包括五种不同的通信链接,并提出了一种高效的合作多种多代理人深度学习(CMT-MARL)方法来解决资源管理问题。实验结果表明,提议的 CMT-MARL 方法能够减少数据传输差和提高传输成功率,这些结果证明了 SAGIN 的可能性和实现性。
AI Chatbots as Multi-Role Pedagogical Agents: Transforming Engagement in CS Education
paper_authors: Cassie Chen Cao, Zijian Ding, Jionghao Lin, Frank Hopfgartner for:这项研究旨在利用人工智能(AI)搭载的多角色 чат bot 来提高计算机科学教育的学习经验和参与度。methods:我们采用了设计基本研究方法,开发、实现和评估一个具有四个不同 чат bot 角色的学习环境,这些角色基于自主决定理论,满足学生的三种 innate 心理需求 - 能力、自主和相互关系。results:我们在高等教育上下文中进行了一个月的测试,征得 200 名学生的参与,并与人教和单个 чат bot 的条件进行比较。我们的研究采用了混合方法,包括量化测量如 chat log 序列分析,以及讨论和问卷调查。通过结合 cutting-edge 自然语言处理技术如话题分析和情感分析,我们提供了深入的理解系统对学生参与度、动机和问题解决方面的影响。Abstract
This study investigates the use of Artificial Intelligence (AI)-powered, multi-role chatbots as a means to enhance learning experiences and foster engagement in computer science education. Leveraging a design-based research approach, we develop, implement, and evaluate a novel learning environment enriched with four distinct chatbot roles: Instructor Bot, Peer Bot, Career Advising Bot, and Emotional Supporter Bot. These roles, designed around the tenets of Self-Determination Theory, cater to the three innate psychological needs of learners - competence, autonomy, and relatedness. Additionally, the system embraces an inquiry-based learning paradigm, encouraging students to ask questions, seek solutions, and explore their curiosities. We test this system in a higher education context over a period of one month with 200 participating students, comparing outcomes with conditions involving a human tutor and a single chatbot. Our research utilizes a mixed-methods approach, encompassing quantitative measures such as chat log sequence analysis, and qualitative methods including surveys and focus group interviews. By integrating cutting-edge Natural Language Processing techniques such as topic modelling and sentiment analysis, we offer an in-depth understanding of the system's impact on learner engagement, motivation, and inquiry-based learning. This study, through its rigorous design and innovative approach, provides significant insights into the potential of AI-empowered, multi-role chatbots in reshaping the landscape of computer science education and fostering an engaging, supportive, and motivating learning environment.
摘要
We test the system in a higher education context for one month with 200 participating students, comparing outcomes with conditions involving a human tutor and a single chatbot. Our research combines quantitative measures such as chat log sequence analysis and qualitative methods like surveys and focus group interviews. We employ cutting-edge Natural Language Processing techniques like topic modeling and sentiment analysis to gain a deeper understanding of the system's impact on learner engagement, motivation, and inquiry-based learning.Our study offers significant insights into the potential of AI-empowered, multi-role chatbots to reshape computer science education and create an engaging, supportive, and motivating learning environment. By integrating innovative approaches and cutting-edge technologies, we provide a comprehensive understanding of the system's effectiveness and its potential for future applications.
NEOLAF, an LLM-powered neural-symbolic cognitive architecture
results: 在使用MATH数据集上进行的实验表明,NEOLAF代理人具有出色的学习能力,并且有可能革新认知架构和自我改进的教学系统。Abstract
This paper presents the Never Ending Open Learning Adaptive Framework (NEOLAF), an integrated neural-symbolic cognitive architecture that models and constructs intelligent agents. The NEOLAF framework is a superior approach to constructing intelligent agents than both the pure connectionist and pure symbolic approaches due to its explainability, incremental learning, efficiency, collaborative and distributed learning, human-in-the-loop enablement, and self-improvement. The paper further presents a compelling experiment where a NEOLAF agent, built as a problem-solving agent, is fed with complex math problems from the open-source MATH dataset. The results demonstrate NEOLAF's superior learning capability and its potential to revolutionize the field of cognitive architectures and self-improving adaptive instructional systems.
摘要
Translation Notes:* "Never Ending Open Learning Adaptive Framework" (NEOLAF) is translated as "无止境开放学习适应框架" (Wú zhì jìng kāifàng xuéxí suīyìng kāngyì)* "pure connectionist" is translated as "纯连接主义" (chún liánxì zhǔyì)* "pure symbolic" is translated as "纯符号主义" (chún fúhào zhǔyì)* "explainability" is translated as "可解释性" (kějìexplainability)* "incremental learning" is translated as "逐步学习" (jìbù xuéxí)* "efficiency" is translated as "效率" (fùliàng)* "collaborative and distributed learning" is translated as "合作分布式学习" (hèzuò fēnzhèng zhīxíng xuéxí)* "human-in-the-loop enablement" is translated as "人在循环启用" (rén zài xiànglún kāi yòng)* "self-improvement" is translated as "自我改进" (zìwǒ gǎi jìn)
SimplyRetrieve: A Private and Lightweight Retrieval-Centric Generative AI Tool
results: 该论文介绍了一个开源的GUI和API基于RCG平台,名为SimplyRetrieve,它具有本地化、轻量级和用户友好的界面,可以帮助机器学习社区更好地利用这些高级技术。Abstract
Large Language Model (LLM) based Generative AI systems have seen significant progress in recent years. Integrating a knowledge retrieval architecture allows for seamless integration of private data into publicly available Generative AI systems using pre-trained LLM without requiring additional model fine-tuning. Moreover, Retrieval-Centric Generation (RCG) approach, a promising future research direction that explicitly separates roles of LLMs and retrievers in context interpretation and knowledge memorization, potentially leads to more efficient implementation. SimplyRetrieve is an open-source tool with the goal of providing a localized, lightweight, and user-friendly interface to these sophisticated advancements to the machine learning community. SimplyRetrieve features a GUI and API based RCG platform, assisted by a Private Knowledge Base Constructor and a Retrieval Tuning Module. By leveraging these capabilities, users can explore the potential of RCG for improving generative AI performance while maintaining privacy standards. The tool is available at https://github.com/RCGAI/SimplyRetrieve with an MIT license.
摘要
CheXFusion: Effective Fusion of Multi-View Features using Transformers for Long-Tailed Chest X-Ray Classification
results: 这份论文的解决方案在MIMIC-CXR测试集上取得了0.372 mAP的成绩,在竞赛中排名第一。这表明了考虑多个视角、类别不均匀和预测结果的共存关系在医疗影像分类中的重要性。论文的代码可以在https://github.com/dongkyuk/CXR-LT-public-solution上获取。Abstract
Medical image classification poses unique challenges due to the long-tailed distribution of diseases, the co-occurrence of diagnostic findings, and the multiple views available for each study or patient. This paper introduces our solution to the ICCV CVAMD 2023 Shared Task on CXR-LT: Multi-Label Long-Tailed Classification on Chest X-Rays. Our approach introduces CheXFusion, a transformer-based fusion module incorporating multi-view images. The fusion module, guided by self-attention and cross-attention mechanisms, efficiently aggregates multi-view features while considering label co-occurrence. Furthermore, we explore data balancing and self-training methods to optimize the model's performance. Our solution achieves state-of-the-art results with 0.372 mAP in the MIMIC-CXR test set, securing 1st place in the competition. Our success in the task underscores the significance of considering multi-view settings, class imbalance, and label co-occurrence in medical image classification. Public code is available at https://github.com/dongkyuk/CXR-LT-public-solution
摘要
医学图像分类面临独特挑战,这些挑战包括疾病的长尾分布、诊断发现的共处和每个案例或病人可以提供多个视图。本文介绍我们在ICCV CVAMD 2023 共同任务中的解决方案:多标签长尾分类在胸部X射线图像(CXR-LT)中。我们的方法引入了CheXFusion,一种基于变换器的融合模块,该模块通过自我注意和交叉注意机制有效地聚合多视图特征,同时考虑标签共处。此外,我们还探索了数据填充和自我训练方法来优化模型性能。我们的解决方案在MIMIC-CXR测试集上 achievement 0.372 mAP,在竞赛中获得了第一名,这 подтвержда了在医学图像分类中考虑多视图设置、类别不均衡和标签共处的重要性。我们的代码可以在https://github.com/dongkyuk/CXR-LT-public-solution 上获取。
ALFA – Leveraging All Levels of Feature Abstraction for Enhancing the Generalization of Histopathology Image Classification Across Unseen Hospitals
paper_authors: Milad Sikaroudi, Maryam Hosseini, Shahryar Rahnamayan, H. R. Tizhoosh
for: 提高图像分类的泛化性,使模型能够在不同的医院中提供更好的表现
methods: 使用扩展自我超级视图,并在不同的分布差异场景下进行自我超级视图,从而 derivatin invariant feature from training images,并使用域对齐模块来进一步提取抽象特征
results: 实验结果表明,提出的方法可以在不同的医院图像中提供更好的泛化性,并在不同的分布差异场景下进行更好的表现Abstract
We propose an exhaustive methodology that leverages all levels of feature abstraction, targeting an enhancement in the generalizability of image classification to unobserved hospitals. Our approach incorporates augmentation-based self-supervision with common distribution shifts in histopathology scenarios serving as the pretext task. This enables us to derive invariant features from training images without relying on training labels, thereby covering different abstraction levels. Moving onto the subsequent abstraction level, we employ a domain alignment module to facilitate further extraction of invariant features across varying training hospitals. To represent the highly specific features of participating hospitals, an encoder is trained to classify hospital labels, independent of their diagnostic labels. The features from each of these encoders are subsequently disentangled to minimize redundancy and segregate the features. This representation, which spans a broad spectrum of semantic information, enables the development of a model demonstrating increased robustness to unseen images from disparate distributions. Experimental results from the PACS dataset (a domain generalization benchmark), a synthetic dataset created by applying histopathology-specific jitters to the MHIST dataset (defining different domains with varied distribution shifts), and a Renal Cell Carcinoma dataset derived from four image repositories from TCGA, collectively indicate that our proposed model is adept at managing varying levels of image granularity. Thus, it shows improved generalizability when faced with new, out-of-distribution hospital images.
摘要
我们提出了一种涵盖所有水平的特征抽象方法,目的是提高图像分类的通用性,覆盖不同医院的不见图像。我们的方法通过在历史病理景象中添加自我超visuospatial alignment,使得在不需要训练标签的情况下 derivation invariant features from 训练图像。在接下来的层次,我们使用域Alignment模块来进一步提取不同医院的抽象特征。为了表示参与医院的特定特征,我们训练了一个Encoder来分类医院标签,不同于其诊断标签。从每个Encoder中提取的特征后,我们进行了拟合以避免重复性和分化特征。这种表示,覆盖了广泛的语义信息,使得我们提出的模型在面对新、未经见图像时显示出更好的通用性。实验结果来自PACS数据集(领域通用性标准 benchmark)、在应用特定于 histopathology 的扰动后生成的 sintethic 数据集以及来自TCGA的 Renal Cell Carcinoma 数据集,表明我们的模型在不同水平的图像粒度下具有更好的普适性。
Establishing Trust in ChatGPT BioMedical Generated Text: An Ontology-Based Knowledge Graph to Validate Disease-Symptom Links
results: 我们的结果表明,在比较不同的 ChatGPT 知识图和其相应的 PubMed 知识图时,发现了一些有趣的观察结果。例如,一些 ChatGPT 知识图中的连接数比 PubMed 知识图更多,而且一些 GPT 知识图的中心性度量更高,尤其是对于相互重叠的节点。这些结果表明了人工智能生成的内容中的未经验证知识的潜在价值,需要进一步验证。Abstract
Methods: Through an innovative approach, we construct ontology-based knowledge graphs from authentic medical literature and AI-generated content. Our goal is to distinguish factual information from unverified data. We compiled two datasets: one from biomedical literature using a "human disease and symptoms" query, and another generated by ChatGPT, simulating articles. With these datasets (PubMed and ChatGPT), we curated 10 sets of 250 abstracts each, selected randomly with a specific seed. Our method focuses on utilizing disease ontology (DOID) and symptom ontology (SYMP) to build knowledge graphs, robust mathematical models that facilitate unbiased comparisons. By employing our fact-checking algorithms and network centrality metrics, we conducted GPT disease-symptoms link analysis to quantify the accuracy of factual knowledge amid noise, hypotheses, and significant findings. Results: The findings obtained from the comparison of diverse ChatGPT knowledge graphs with their PubMed counterparts revealed some interesting observations. While PubMed knowledge graphs exhibit a wealth of disease-symptom terms, it is surprising to observe that some ChatGPT graphs surpass them in the number of connections. Furthermore, some GPT graphs are demonstrating supremacy of the centrality scores, especially for the overlapping nodes. This striking contrast indicates the untapped potential of knowledge that can be derived from AI-generated content, awaiting verification. Out of all the graphs, the factual link ratio between any two graphs reached its peak at 60%. Conclusions: An intriguing insight from our findings was the striking number of links among terms in the knowledge graph generated from ChatGPT datasets, surpassing some of those in its PubMed counterpart. This early discovery has prompted further investigation using universal network metrics to unveil the new knowledge the links may hold.
摘要
方法:通过创新的方法,我们从authentic医学文献和AI生成的内容中构建了ontology-based知识图。我们的目标是区分 фактической信息和未经证实的数据。我们编译了两个数据集:一个是生物医学文献,使用“人类疾病和症状”查询,另一个是由ChatGPT生成的文章。 With这两个数据集(PubMed和ChatGPT),我们精心审选了250个摘要,使用特定的种子值进行随机选择。我们的方法是利用疾病ontology(DOID)和症状ontology(SYMP)建立知识图,并使用我们的 фактиче性检查算法和网络中心度度量来进行GPT疾病-症状链接分析,以量化factual知识中的噪音、假设和重要发现。结果:对比多个ChatGPT知识图与其PubMed对应的知识图,我们发现了一些有趣的观察。PubMed知识图显示了丰富的疾病-症状 термина,但是某些ChatGPT graphs在连接数量方面超过了它们。此外,一些GPT graphs的中心度分数特别高,特别是在重叠的节点上。这个明显的对比表明AI生成的内容中的知识尚未得到证实,但它们具有潜在的价值。在所有知识图中,factual链接比率最高达60%。结论:我们的发现表明,ChatGPT生成的知识图中的链接数量异常多,有些连接数量甚至超过了PubMed知识图中的一些连接。这种早期的发现已经引发了我们进一步的调查,使用通用网络度量来揭示这些链接可能含有的新知识。
ViLP: Knowledge Exploration using Vision, Language, and Pose Embeddings for Video Action Recognition
results: 根据实验结果,本文的方法在UCFC-101和HMDB-51两个人体动作识别数据集上的准确率分别为92.81%和73.02%,而无需视频数据预训练,而且经过kinetics预训练后,准确率分别提高至96.11%和75.75%。Abstract
Video Action Recognition (VAR) is a challenging task due to its inherent complexities. Though different approaches have been explored in the literature, designing a unified framework to recognize a large number of human actions is still a challenging problem. Recently, Multi-Modal Learning (MML) has demonstrated promising results in this domain. In literature, 2D skeleton or pose modality has often been used for this task, either independently or in conjunction with the visual information (RGB modality) present in videos. However, the combination of pose, visual information, and text attributes has not been explored yet, though text and pose attributes independently have been proven to be effective in numerous computer vision tasks. In this paper, we present the first pose augmented Vision-language model (VLM) for VAR. Notably, our scheme achieves an accuracy of 92.81% and 73.02% on two popular human video action recognition benchmark datasets, UCF-101 and HMDB-51, respectively, even without any video data pre-training, and an accuracy of 96.11% and 75.75% after kinetics pre-training.
摘要
视频动作识别(VAR)是一个复杂的任务,它的内在复杂性使得设计一个综合性的框架来识别大量人类动作变得具有挑战性。在文献中,不同的方法已经被探讨,但是设计一个综合性的框架来识别大量人类动作仍然是一个挑战性的问题。在文献中,2D骨架或 pose 模式 часто被用于这项任务,可以独立或与视觉信息(RGB 模式)一起使用。然而,将 pose、视觉信息和文本特征相结合尚未被探讨,尽管文本和 pose 特征独立地已经在计算机视觉任务中证明有效。在这篇论文中,我们提出了首个含有pose的视力语言模型(VLM),该模型在 UCF-101 和 HMDB-51 两个常用的人类视频动作识别 benchmark 数据集上取得了92.81% 和 73.02% 的准确率,而不需要任何视频数据预训练,并且在 kinetic 预训练后达到了96.11% 和 75.75% 的准确率。
Intelligent Assistant Language Understanding On Device
paper_authors: Cecilia Aas, Hisham Abdelsalam, Irina Belousova, Shruti Bhargava, Jianpeng Cheng, Robert Daland, Joris Driesen, Federico Flego, Tristan Guigue, Anders Johannsen, Partha Lal, Jiarui Lu, Joel Ruben Antony Moniz, Nathan Perkins, Dhivya Piraviperumal, Stephen Pulman, Diarmuid Ó Séaghdha, David Q. Sun, John Torr, Marco Del Vecchio, Jay Wacker, Jason D. Williams, Hong Yu
results: 本研究实现了一种更加私钥、可靠、快速、表达力和准确的自然语言理解系统,并提供了实践经验和建议,以便未来的研究工作。Abstract
It has recently become feasible to run personal digital assistants on phones and other personal devices. In this paper we describe a design for a natural language understanding system that runs on device. In comparison to a server-based assistant, this system is more private, more reliable, faster, more expressive, and more accurate. We describe what led to key choices about architecture and technologies. For example, some approaches in the dialog systems literature are difficult to maintain over time in a deployment setting. We hope that sharing learnings from our practical experiences may help inform future work in the research community.
摘要
现在已经可以在手机和其他个人设备上运行个人数字助手。在这篇论文中,我们描述了一种运行在设备上的自然语言理解系统的设计。与服务器上的助手相比,这种系统更加私钥、可靠、快速、表达力强、准确。我们详细介绍了一些关键的建筑和技术选择。例如,一些对话系统文献中的方法在部署环境中具有维护困难。我们希望通过分享我们的实践经验,对未来的研究工作产生影响。
FLIPS: Federated Learning using Intelligent Participant Selection
paper_authors: Rahul Atul Bhope, K. R. Jayaram, Nalini Venkatasubramanian, Ashish Verma, Gegi Thomas for: 这个论文旨在解决 Federated Learning (FL) 训练任务中数据和参与者多样性的管理问题,特别是在FL训练过程中对参与者选择的影响。methods: 该论文提出了一种基于标签分布划分的中间件系统,称为 FLIPS,它可以在FL训练过程中对参与者进行划分,以确保每个划分群在参与者选择中具有平等的代表性。此外,FLIPS还支持多种常见的FL算法,包括 FedAvg、FedProx、FedDyn、FedOpt 和 FedYogi。为了管理分布式平台的多样性和动态资源可用性,FLIPS还包含了一种卫星管理机制。results: 该论文的实验研究表明,FLIPS可以在实际世界数据集上提高FL训练的精度,相比随机选择、Oort和梯度划分等其他两种”聪明”选择机制,FLIPS可以在20-60%的通信成本下提高精度 by 17-20%。此外,FLIPS的效果还能在存在延迟参与者的情况下保持。Abstract
This paper presents the design and implementation of FLIPS, a middleware system to manage data and participant heterogeneity in federated learning (FL) training workloads. In particular, we examine the benefits of label distribution clustering on participant selection in federated learning. FLIPS clusters parties involved in an FL training job based on the label distribution of their data apriori, and during FL training, ensures that each cluster is equitably represented in the participants selected. FLIPS can support the most common FL algorithms, including FedAvg, FedProx, FedDyn, FedOpt and FedYogi. To manage platform heterogeneity and dynamic resource availability, FLIPS incorporates a straggler management mechanism to handle changing capacities in distributed, smart community applications. Privacy of label distributions, clustering and participant selection is ensured through a trusted execution environment (TEE). Our comprehensive empirical evaluation compares FLIPS with random participant selection, as well as two other "smart" selection mechanisms - Oort and gradient clustering using two real-world datasets, two different non-IID distributions and three common FL algorithms (FedYogi, FedProx and FedAvg). We demonstrate that FLIPS significantly improves convergence, achieving higher accuracy by 17 - 20 % with 20 - 60 % lower communication costs, and these benefits endure in the presence of straggler participants.
摘要
具体来说,这篇论文提出了一个名为FLIPS的中间件系统,用于管理 federated learning(FL)训练任务中的数据和参与者多样性。FLIPS在FL训练之前将参与者按照其标签分布进行分群,并在训练中确保每个分群都得到了公平的表现。FLIPS支持通用的FL算法,同时管理平台多样性和动态资源可用性,并通过安全执行环境(TEE)保证标签分布、分群和参与者选择的隐私。我们的实验证明,FLIPS可以大幅提高FL训练的收敛速度,在20-60%的通信成本下达到17-20%的高精度,这些优势在受到延迟参与者的情况下也保持不变。
Exploiting Generalization in Offline Reinforcement Learning via Unseen State Augmentations
results: 这个论文在多个Offline RL任务中实现了改进的性能,并发现了其扩展策略通常比基eline更保守。Abstract
Offline reinforcement learning (RL) methods strike a balance between exploration and exploitation by conservative value estimation -- penalizing values of unseen states and actions. Model-free methods penalize values at all unseen actions, while model-based methods are able to further exploit unseen states via model rollouts. However, such methods are handicapped in their ability to find unseen states far away from the available offline data due to two factors -- (a) very short rollout horizons in models due to cascading model errors, and (b) model rollouts originating solely from states observed in offline data. We relax the second assumption and present a novel unseen state augmentation strategy to allow exploitation of unseen states where the learned model and value estimates generalize. Our strategy finds unseen states by value-informed perturbations of seen states followed by filtering out states with epistemic uncertainty estimates too high (high error) or too low (too similar to seen data). We observe improved performance in several offline RL tasks and find that our augmentation strategy consistently leads to overall lower average dataset Q-value estimates i.e. more conservative Q-value estimates than a baseline.
摘要
“在线束缚学习(RL)方法寻求平衡between exploration和利用,通过保守的价值估计--- penalty 未看过的状态和动作的价值。无模型方法对所有未看过的动作进行 penalty,而具有模型方法可以通过模型执行来进一步利用未看过的状态。然而,这些方法因两个因素受到限制---(a)模型中的执行 horizon 非常短,因为模型错误的堆叠,以及(b)模型执行仅启动自已经见过的状态。我们松动这一假设,并提出了一种新的未看过状态扩展策略,允许在已知模型和价值估计中利用未看过状态。我们的策略通过在已经看过的状态上进行价值意识的偏移,然后过滤高度不确定性(高错误)或者太相似于已经看过的数据的状态。我们发现在多个Offline RL任务中表现出色,并观察到我们的扩展策略通常比基准值更保守,即更低的平均数据Q值估计。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.
Guarding the Guardians: Automated Analysis of Online Child Sexual Abuse
For: The paper is written to address the urgent need for a solution to analyze children’s sexual abuse reports comprehensively, with a focus on reducing the risk of exposure to harmful content for analysts.* Methods: The paper proposes a novel automated tool that categorizes reports on three dimensions: Subject, Degree of Criminality, and Damage. Additionally, the paper introduces a novel approach to annotate the collected data, enabling a more in-depth analysis of the reports.* Results: The paper’s approach significantly reduces the risk of exposure to harmful content for analysts, and improves the comprehension of fundamental patterns and trends in children’s sexual abuse reports, enabling law enforcement agencies and policymakers to create focused strategies in the fight against children’s violence.In Simplified Chinese text, the three key points would be:
results: 论文的方法可以明显减少分析人员遭受有害内容的风险,同时提高了对儿童色情虐待报告的基本 patrón和趋势的理解,为儿童保护和法制建设提供了有力的支持。Abstract
Online violence against children has increased globally recently, demanding urgent attention. Competent authorities manually analyze abuse complaints to comprehend crime dynamics and identify patterns. However, the manual analysis of these complaints presents a challenge because it exposes analysts to harmful content during the review process. Given these challenges, we present a novel solution, an automated tool designed to analyze children's sexual abuse reports comprehensively. By automating the analysis process, our tool significantly reduces the risk of exposure to harmful content by categorizing the reports on three dimensions: Subject, Degree of Criminality, and Damage. Furthermore, leveraging our multidisciplinary team's expertise, we introduce a novel approach to annotate the collected data, enabling a more in-depth analysis of the reports. This approach improves the comprehension of fundamental patterns and trends, enabling law enforcement agencies and policymakers to create focused strategies in the fight against children's violence.
摘要
在全球范围内,网络对儿童的暴力行为已经增加,需要紧急关注。有能力的当局人工分析滥剑投诉,以便更好地理解犯罪动力和趋势。然而,手动分析这些投诉存在挑战,因为它可能曝露分析员遭受有害内容的风险。为了解决这些挑战,我们提出了一种新的解决方案:一种自动化分析儿童色情虐待投诉的工具。通过自动化分析过程,我们的工具可以减少分析员遭受有害内容的风险,并将投诉分为三个维度:主体、犯罪程度和伤害。此外,我们的多科学队伍专家的协作,我们引入了一种新的数据标注方法,以便更深入地分析投诉。这种方法可以更好地描述基本的趋势和模式,使宪法机关和制定政策者可以根据这些数据制定有关儿童暴力的专门策略。
results: 这个论文发现了注意力流中的不确定程度与模型回答质量之间存在关系,并通过修正模型的自信度来避免错误答案的显示。Abstract
Language Models are being widely used in Education. Even though modern deep learning models achieve very good performance on question-answering tasks, sometimes they make errors. To avoid misleading students by showing wrong answers, it is important to calibrate the confidence - that is, the prediction probability - of these models. In our work, we propose to use an XGBoost on top of BERT to output the corrected probabilities, using features based on the attention mechanism. Our hypothesis is that the level of uncertainty contained in the flow of attention is related to the quality of the model's response itself.
摘要
语言模型在教育领域广泛使用。虽然现代深度学习模型在问答任务上表现非常出色,但有时会出现错误。为了避免通过错误答案误导学生,需要对这些模型进行准确性调整。在我们的工作中,我们提议使用XGBoost在BERT之上输出修正的概率,使用基于注意力机制的特征。我们假设注意力流中的不确定程度与模型的答案质量之间存在相关性。
results: 我们发现,“开放性”在语言上存在很大的混乱,而“聪明性”和“情绪性”在OCEAN框架中表现出了明显的强调,“外向性”和“合作性”则表现出了明确的分离。我们的发现表明GPT的多样性和可以根据人类意图进行定制的能力。Abstract
The research explores the steerability of Large Language Models (LLMs), particularly OpenAI's ChatGPT iterations. By employing a behavioral psychology framework called OCEAN (Openness, Conscientiousness, Extroversion, Agreeableness, Neuroticism), we quantitatively gauged the model's responsiveness to tailored prompts. When asked to generate text mimicking an extroverted personality, OCEAN scored the language alignment to that behavioral trait. In our analysis, while "openness" presented linguistic ambiguity, "conscientiousness" and "neuroticism" were distinctly evoked in the OCEAN framework, with "extroversion" and "agreeableness" showcasing a notable overlap yet distinct separation from other traits. Our findings underscore GPT's versatility and ability to discern and adapt to nuanced instructions. Furthermore, historical figure simulations highlighted the LLM's capacity to internalize and project instructible personas, precisely replicating their philosophies and dialogic styles. However, the rapid advancements in LLM capabilities and the opaque nature of some training techniques make metric proposals degrade rapidly. Our research emphasizes a quantitative role to describe steerability in LLMs, presenting both its promise and areas for further refinement in aligning its progress to human intentions.
摘要
研究探讨大语言模型(LLM)的可控性,尤其是OpenAI的ChatGPT迭代。通过employnig行为心理学框架called OCEAN(开放性、聪明性、外向性、合作性、情绪性),我们量化了模型对定制提示的回应。当请求生成文本模拟外向性人格时,OCEAN分数表示语言对该行为 trait的吻合。在我们的分析中,“开放性”存在语言 ambiguity,而“聪明性”和“情绪性”在OCEAN框架中得到了明显的表达,而“外向性”和“合作性”则显示了明显的 overlap yet distinct separation from other traits。我们的发现强调GPT的灵活性和对 instrucible 指令的适应能力。此外,历史人物模拟表明了LLM的能力 internalize和 project instructible personas,精准地复制他们的哲学和对话风格。然而,LLM的技能快速发展和一些训练技术的不透明性使得metric proposal degrade rapidly。我们的研究强调了量化描述 LLM 的可控性的重要性,并提出了其推进人类意图的方法。
Mobile Supply: The Last Piece of Jigsaw of Recommender System
results: 实验证明,提出的方法可以further improve the performance of edge-side recommender systems and user experience,并已经在一个大规模的在线美食平台上部署,获得了可观的业务效益。Abstract
Recommendation system is a fundamental functionality of online platforms. With the development of computing power of mobile phones, some researchers have deployed recommendation algorithms on users' mobile devices to address the problems of data transmission delay and pagination trigger mechanism. However, the existing edge-side mobile rankings cannot completely solve the problem of pagination trigger mechanism. The mobile ranking can only sort the items on the current page, and the fixed set of candidate items limits the performance of the mobile ranking. Besides, after the user has viewed the items of interest to the user on the current page, the user refresh to get a new page of items. This will affect the user's immersive experience because the user is not satisfied with the left items on the current page. In order to address the problem of pagination trigger mechanism, we propose a completely new module in the pipeline of recommender system named Mobile Supply. The pipeline of recommender system is extended to "retrival->pre-ranking->ranking->re-ranking->Mobile Supply->mobile ranking". Specifically, we introduce the concept of list value and use point-wise paradigm to approximate list-wise estimation to calculate the maximum revenue that can be achieved by mobile ranking for the current page. We also design a new mobile ranking approach named device-aware mobile ranking considering the differences of mobile devices tailored to the new pipeline. Extensive offline and online experiments show the superiority of our proposed method and prove that Mobile Supply can further improve the performance of edge-side recommender system and user experience. Mobile Supply has been deployed on the homepage of a large-scale online food platform and has yielded considerable profits in our business.
摘要
“推荐系统是线上平台的基本功能之一。随着移动设备的计算能力的提高,一些研究人员已经将推荐算法部署到用户的移动设备上以解决数据传输延迟和分页触发器机制的问题。然而,现有的边缘式移动排名无法完全解决分页触发器机制的问题。这个边缘式移动排名只能在当前页面上排序项目,而且固定的候选项目限制了排名的表现。此外,当用户已经查看了他们 interessant 的项目时,用户刷新以获取新的页面项目。这会影响用户的沉浸体验,因为用户不满意LEFT项目。”“为了解决分页触发器机制的问题,我们提出了一个全新的模组,名为 Mobile Supply。我们将推荐系统的管线延展为“获取->预选->排名->重新排名->Mobile Supply->边缘式排名”。具体来说,我们引入了列值的概念,并使用点子法来估算列值的最大收益,以计算可以由边缘式排名获得的当前页面的最大收益。我们还设计了一个新的边缘式排名方法,名为 Device-Aware Mobile Ranking,考虑了移动设备的不同特点,以适应新的管线。”“我们将 Mobile Supply 部署到一个大规模的线上食物平台的首页上,并获得了显著的收益。”
Revisiting Prompt Engineering via Declarative Crowdsourcing
results: 该论文的预liminary案例研究表明,使用宣告式描述工程可以提高LLM数据处理工作流程的质量,同时保持成本在所需范围内。这些案例包括排序、实体解决和填充等。Abstract
Large language models (LLMs) are incredibly powerful at comprehending and generating data in the form of text, but are brittle and error-prone. There has been an advent of toolkits and recipes centered around so-called prompt engineering-the process of asking an LLM to do something via a series of prompts. However, for LLM-powered data processing workflows, in particular, optimizing for quality, while keeping cost bounded, is a tedious, manual process. We put forth a vision for declarative prompt engineering. We view LLMs like crowd workers and leverage ideas from the declarative crowdsourcing literature-including leveraging multiple prompting strategies, ensuring internal consistency, and exploring hybrid-LLM-non-LLM approaches-to make prompt engineering a more principled process. Preliminary case studies on sorting, entity resolution, and imputation demonstrate the promise of our approach
摘要
巨型语言模型(LLM)极其强大地理解和生成文本数据,但是脆弱和容易出错。随着推广工具和热门recipes的出现,关于 socalled prompt engineering——通过一系列提示来要求 LLM 做某件事——的过程在 LLM 驱动的数据处理工作流程中变得极其重要。然而,在Optimizing for quality的同时,保持成本在可控的范围内是一个艰辛的、手动的过程。我们提出了声明式提示工程的视野,将 LLM 看作是一群人群,利用声明式人群创新的想法——包括多种提示策略、保证内部一致性,以及混合 LLM 和非 LLM 方法——来使提示工程变得更加原则化。我们的初步案例研究包括排序、实体解析和填充,表明了我们的方法的承诺。
Recurrent Multi-scale Transformer for High-Resolution Salient Object Detection
paper_authors: Xinhao Deng, Pingping Zhang, Wei Liu, Huchuan Lu for:This paper aims to improve the performance of high-resolution salient object detection (HRSOD) by proposing a new dataset and a novel Recurrent Multi-scale Transformer (RMFormer) method.methods:The proposed RMFormer method utilizes shared Transformers and multi-scale refinement architectures to generate high-resolution saliency maps, guided by lower-resolution predictions.results:Extensive experiments on both high-resolution and low-resolution benchmarks demonstrate the effectiveness and superiority of the proposed framework, with the RMFormer method achieving state-of-the-art performance on the newly proposed HRS10K dataset.Abstract
Salient Object Detection (SOD) aims to identify and segment the most conspicuous objects in an image or video. As an important pre-processing step, it has many potential applications in multimedia and vision tasks. With the advance of imaging devices, SOD with high-resolution images is of great demand, recently. However, traditional SOD methods are largely limited to low-resolution images, making them difficult to adapt to the development of High-Resolution SOD (HRSOD). Although some HRSOD methods emerge, there are no large enough datasets for training and evaluating. Besides, current HRSOD methods generally produce incomplete object regions and irregular object boundaries. To address above issues, in this work, we first propose a new HRS10K dataset, which contains 10,500 high-quality annotated images at 2K-8K resolution. As far as we know, it is the largest dataset for the HRSOD task, which will significantly help future works in training and evaluating models. Furthermore, to improve the HRSOD performance, we propose a novel Recurrent Multi-scale Transformer (RMFormer), which recurrently utilizes shared Transformers and multi-scale refinement architectures. Thus, high-resolution saliency maps can be generated with the guidance of lower-resolution predictions. Extensive experiments on both high-resolution and low-resolution benchmarks show the effectiveness and superiority of the proposed framework. The source code and dataset are released at: https://github.com/DrowsyMon/RMFormer.
摘要
抽象对象检测(SOD)目的是在图像或视频中识别和分割最为醒目的对象。作为前处理步骤,它在多媒体和视觉任务中具有重要的应用前景。随着捕捉设备的发展,高分辨率SOD(HRSOD)的需求日益增加。然而,传统的SOD方法主要适用于低分辨率图像,使其难以适应HRSOD的发展。虽然一些HRSOD方法已经出现,但是没有足够的大型数据集用于训练和评估。此外,现有的HRSOD方法通常生成不完整的对象区域和不规则的对象边界。为解决上述问题,在这种工作中,我们首先提出了一个新的HRSOD数据集,名为HRS10K,它包含10500个高质量注解图像,分别在2K-8K分辨率上。我们知道,这是HRSOD任务中最大的数据集,它将有助于未来的工作在训练和评估模型。此外,为提高HRSOD性能,我们提出了一种新的循环多ScaleTransformer(RMFormer),它可以在不同的尺度上重复使用共享的Transformer和多尺度精度建立。因此,高分辨率的Saliency图可以通过低分辨率预测的指导生成。我们进行了广泛的实验,并证明了我们的框架的有效性和超越性。数据集和代码可以在https://github.com/DrowsyMon/RMFormer上下载。
A Cost Analysis of Generative Language Models and Influence Operations
results: 研究结果表明,LLM只需要生成可用输出,并且输出的可靠性只需要达到25%,就可以为宣传人员提供成本节省。同时,监控控制对于API访问ible的LLM可以减少成本,但是对于国家来说,特别是进行大规模影响操作的国家,没有经济上的收益来自于专门为影响操作培训自己的LLM。Abstract
Despite speculation that recent large language models (LLMs) are likely to be used maliciously to improve the quality or scale of influence operations, uncertainty persists regarding the economic value that LLMs offer propagandists. This research constructs a model of costs facing propagandists for content generation at scale and analyzes (1) the potential savings that LLMs could offer propagandists, (2) the potential deterrent effect of monitoring controls on API-accessible LLMs, and (3) the optimal strategy for propagandists choosing between multiple private and/or open source LLMs when conducting influence operations. Primary results suggest that LLMs need only produce usable outputs with relatively low reliability (roughly 25%) to offer cost savings to propagandists, that the potential reduction in content generation costs can be quite high (up to 70% for a highly reliable model), and that monitoring capabilities have sharply limited cost imposition effects when alternative open source models are available. In addition, these results suggest that nation-states -- even those conducting many large-scale influence operations per year -- are unlikely to benefit economically from training custom LLMs specifically for use in influence operations.
摘要
尽管有人 especulate recent large language models (LLMs) 可能会被用于提高媒体操作质量或规模,但是对于宣传者而言, LLMS 的经济价值还存在uncertainty。这项研究构建了宣传者内容生成在大规模时所面临的成本模型,并分析了以下问题:(1) LLMs 可以提供宣传者内容生成的可能性,(2) API 可访问的 LLMs 监控控制的抑效果,以及(3) 宣传者选择多个私人和/或开源 LLMs 时的优化策略。主要结果表明,LLMs 只需生成可用输出,并且只需要roughly 25% 的可靠性,就能为宣传者提供成本节省。此外,研究还发现,监控控制对于使用开源模型来源的宣传者来说,成本干扰效果很少。最后,这些结果表明,even nation-states 进行大规模的媒体操作,不太可能通过专门为影响操作培训自己的 LLMs 来获得经济效益。
SurvBeX: An explanation method of the machine learning survival models based on the Beran estimator
paper_authors: Lev V. Utkin, Danila Y. Eremenko, Andrei V. Konstantinov
for: The paper proposes a new explanation method called SurvBeX for interpreting predictions of machine learning survival black-box models.
methods: The method uses a modified Beran estimator as the surrogate explanation model, and generates many points in a local area around an example of interest to compute the survival function of the black-box model and the Beran estimator.
results: The paper demonstrates the efficiency of SurvBeX through numerical experiments with synthetic and real survival data, and compares the method with SurvLIME and SurvSHAP. The code implementing SurvBeX is available online.Abstract
An explanation method called SurvBeX is proposed to interpret predictions of the machine learning survival black-box models. The main idea behind the method is to use the modified Beran estimator as the surrogate explanation model. Coefficients, incorporated into Beran estimator, can be regarded as values of the feature impacts on the black-box model prediction. Following the well-known LIME method, many points are generated in a local area around an example of interest. For every generated example, the survival function of the black-box model is computed, and the survival function of the surrogate model (the Beran estimator) is constructed as a function of the explanation coefficients. In order to find the explanation coefficients, it is proposed to minimize the mean distance between the survival functions of the black-box model and the Beran estimator produced by the generated examples. Many numerical experiments with synthetic and real survival data demonstrate the SurvBeX efficiency and compare the method with the well-known method SurvLIME. The method is also compared with the method SurvSHAP. The code implementing SurvBeX is available at: https://github.com/DanilaEremenko/SurvBeX
摘要
一种名为SurvBeX的解释方法被提议用于解释机器学习生存黑盒模型的预测结果。该方法的主要思想是使用修改后的Beran估计器作为解释模型。将这些修改后的Beran估计器作为特征影响值,可以看作黑盒模型预测结果中特征的影响。与已知的LIME方法类似,在一个当地区域around一个Example of interest中,生成多个例子。对每个生成的例子,计算黑盒模型的生存函数,并将BERAN估计器中的生存函数作为特征影响值构建。为了找到解释系数,提议使用生成的例子中的平均距离来最小化黑盒模型和BERAN估计器生成的生存函数之间的距离。多个数学实验证明SurvBeX的效果,并与SurvLIME和SurvSHAP方法进行比较。代码实现SurvBeX可以在以下链接中找到:https://github.com/DanilaEremenko/SurvBeX。
Tiny LVLM-eHub: Early Multimodal Experiments with Bard
results: 研究结果显示,Bard模型在大多数多模态能力中表现出色,仅在物体推理方面表现不佳,与人类评估更加一致。此外,Tiny LVLM-eHub变体可以便捷地评估各种Offline LVLMs模型。Abstract
Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated significant progress in tackling complex multimodal tasks. Among these cutting-edge developments, Google's Bard stands out for its remarkable multimodal capabilities, promoting comprehensive comprehension and reasoning across various domains. This work presents an early and holistic evaluation of LVLMs' multimodal abilities, with a particular focus on Bard, by proposing a lightweight variant of LVLM-eHub, named Tiny LVLM-eHub. In comparison to the vanilla version, Tiny LVLM-eHub possesses several appealing properties. Firstly, it provides a systematic assessment of six categories of multimodal capabilities, including visual perception, visual knowledge acquisition, visual reasoning, visual commonsense, object hallucination, and embodied intelligence, through quantitative evaluation of $42$ standard text-related visual benchmarks. Secondly, it conducts an in-depth analysis of LVLMs' predictions using the ChatGPT Ensemble Evaluation (CEE), which leads to a robust and accurate evaluation and exhibits improved alignment with human evaluation compared to the word matching approach. Thirdly, it comprises a mere $2.1$K image-text pairs, facilitating ease of use for practitioners to evaluate their own offline LVLMs. Through extensive experimental analysis, this study demonstrates that Bard outperforms previous LVLMs in most multimodal capabilities except object hallucination, to which Bard is still susceptible. Tiny LVLM-eHub serves as a baseline evaluation for various LVLMs and encourages innovative strategies aimed at advancing multimodal techniques. Our project is publicly available at \url{https://github.com/OpenGVLab/Multi-Modality-Arena}.
摘要
近期大量视语言模型(LVLM)的进步,表明了许多复杂多Modal任务的解决方案。其中,Google的Bard凸出了优异的多Modal能力,涵盖了多个领域的全面理解和合理思维。本文提出了一种轻量级的LVLM-eHub变体,名为Tiny LVLM-eHub,与传统版本相比具有多个优点。首先,它提供了六类多Modal能力的系统性评估,包括视觉理解、视觉知识获取、视觉逻辑、视觉常识、物体梦幻和embodied智能,通过42个标准文本相关的视觉准确度评估。其次,它使用ChatGPT Ensemble Evaluation(CEE)进行深入分析,从而获得了更加稳定和准确的评估结果,并与人类评估更加一致。最后,它使用2.1万个图文对象,使得评估方便快速。通过广泛的实验分析,本研究表明,Bard在大多数多Modal能力中都超越了前一代LVLM,只有物体梦幻能力方面存在一定的极点。Tiny LVLM-eHub可以作为多种LVLM的基准评估,激励创新的多Modal技术发展。我们的项目公开可用于\url{https://github.com/OpenGVLab/Multi-Modality-Arena}.
Dimensionality Reduction for Improving Out-of-Distribution Detection in Medical Image Segmentation
paper_authors: McKell Woodland, Nihil Patel, Mais Al Taie, Joshua P. Yung, Tucker J. Netherton, Ankit B. Patel, Kristy K. Brock
for: 验证 segmentation 模型在数据外部分布下的性能。
methods: 使用 Mahalanobis 距离post hoc 方法对瓶颈特征进行降维,并使用 Principal Component Analysis 降维瓶颈特征。
results: 可以高效地检测到数据外部分布下的图像。Abstract
Clinically deployed segmentation models are known to fail on data outside of their training distribution. As these models perform well on most cases, it is imperative to detect out-of-distribution (OOD) images at inference to protect against automation bias. This work applies the Mahalanobis distance post hoc to the bottleneck features of a Swin UNETR model that segments the liver on T1-weighted magnetic resonance imaging. By reducing the dimensions of the bottleneck features with principal component analysis, OOD images were detected with high performance and minimal computational load.
摘要
SEM-GAT: Explainable Semantic Pose Estimation using Learned Graph Attention
results: 我们的方法在KITTI odometry dataset上进行测试,与参考方法相比具有竞争性的准确率,同时具有更高的轨迹缓和更少的网络参数。Abstract
This paper proposes a GNN-based method for exploiting semantics and local geometry to guide the identification of reliable pointcloud registration candidates. Semantic and morphological features of the environment serve as key reference points for registration, enabling accurate lidar-based pose estimation. Our novel lightweight static graph structure informs our attention-based keypoint node aggregation GNN network by identifying semantic instance-based relationships, acting as inductive bias to significantly reduce the computational burden of pointcloud registration. By connecting candidate nodes and exploiting cross-graph attention, we identify confidence scores for all potential registration correspondences, estimating the displacement between pointcloud scans. Our pipeline enables introspective analysis of the model's performance by correlating it with the individual contributions of local structures in the environment, providing valuable insights into the system's behaviour. We test our method on the KITTI odometry dataset, achieving competitive accuracy compared to benchmark methods and a higher track smoothness while relying on significantly fewer network parameters.
摘要
(本文提出了一种基于GNN的方法,利用 semantics和local geometry来导引可靠的点云注册候选者的标识。环境中的semantic和形态特征作为注册参考点,实现了高精度的激光探测pose estimation。我们的新的轻量级静止图 структуры告诉我们的注意力基于节点聚合GNN网络,通过标识semantic实例之间的关系,以 inductive bias 的形式减少点云注册的计算成本。通过连接候选节点并利用交叉图注意力,我们可以为所有可能的注册匹配计算出信任度,并估算点云扫描中的偏移量。我们的管道可以 introspective 地分析模型的性能,将其与本地环境结构的个别贡献相对考量,提供有价值的信息,了解系统的行为。我们在KITTI odometry dataset上测试了我们的方法,与标准方法相比,实现了竞争性的准确率和更高的车辆运动平滑性,同时使用的网络参数数量更少。)
Safe Multimodal Communication in Human-Robot Collaboration
methods: 本研究提出了一种基于多Modal融合的语音和手势命令的框架,以便人工智能机器人和人类之间进行自然和高效的交流。同时,该框架 siempre respects safety regulations。
results: 通过比较实验表明,通过多Modal融合的语音和手势命令,机器人可以从人类提供有价值的信息来完成任务,同时保证操作员的安全。Abstract
The new industrial settings are characterized by the presence of human and robots that work in close proximity, cooperating in performing the required job. Such a collaboration, however, requires to pay attention to many aspects. Firstly, it is crucial to enable a communication between this two actors that is natural and efficient. Secondly, the robot behavior must always be compliant with the safety regulations, ensuring always a safe collaboration. In this paper, we propose a framework that enables multi-channel communication between humans and robots by leveraging multimodal fusion of voice and gesture commands while always respecting safety regulations. The framework is validated through a comparative experiment, demonstrating that, thanks to multimodal communication, the robot can extract valuable information for performing the required task and additionally, with the safety layer, the robot can scale its speed to ensure the operator's safety.
摘要
新的工业设置 caracterized by the presence of human and robots working in close proximity, cooperating to perform the required job. However, such collaboration requires attention to many aspects. Firstly, it is crucial to enable natural and efficient communication between the two actors. Secondly, the robot's behavior must always comply with safety regulations, ensuring safe collaboration. In this paper, we propose a framework that enables multi-channel communication between humans and robots by leveraging multimodal fusion of voice and gesture commands while always respecting safety regulations. The framework is validated through a comparative experiment, demonstrating that, thanks to multimodal communication, the robot can extract valuable information for performing the required task and additionally, with the safety layer, the robot can scale its speed to ensure the operator's safety.Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. If you need Traditional Chinese, please let me know.
results: 测试结果显示,商业 LLMS 在复杂环境中表现强,但是与开源竞争对手相比,它们的表现存在显著差异。Abstract
Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks. As a result, there has been an urgent need to evaluate LLMs as agents on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional evolving benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities in a multi-turn open-ended generation setting. Our extensive test over 25 LLMs (including APIs and open-sourced models) shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and open-sourced competitors. It also serves as a component of an ongoing project with wider coverage and deeper consideration towards systematic LLM evaluation. Datasets, environments, and an integrated evaluation package for AgentBench are released at https://github.com/THUDM/AgentBench
摘要
results: 研究发现,偏见NLP模型通常会复制和强化现有的社会偏见,可能导致社会技术环境中的歧视。参与者的口头问naire和主题分析也表明,读者阅读文章时可能受到这些偏见的影响,从而改变他们对国家的看法。这些发现强调了AI系统在社会中的影响,以及需要更正AI系统中的偏见。Abstract
We investigate the potential for nationality biases in natural language processing (NLP) models using human evaluation methods. Biased NLP models can perpetuate stereotypes and lead to algorithmic discrimination, posing a significant challenge to the fairness and justice of AI systems. Our study employs a two-step mixed-methods approach that includes both quantitative and qualitative analysis to identify and understand the impact of nationality bias in a text generation model. Through our human-centered quantitative analysis, we measure the extent of nationality bias in articles generated by AI sources. We then conduct open-ended interviews with participants, performing qualitative coding and thematic analysis to understand the implications of these biases on human readers. Our findings reveal that biased NLP models tend to replicate and amplify existing societal biases, which can translate to harm if used in a sociotechnical setting. The qualitative analysis from our interviews offers insights into the experience readers have when encountering such articles, highlighting the potential to shift a reader's perception of a country. These findings emphasize the critical role of public perception in shaping AI's impact on society and the need to correct biases in AI systems.
摘要
(Simplified Chinese translation)我们研究使用人类评估方法检测自然语言处理(NLP)模型中的国籍偏见。偏见的NLP模型可能扩大和复制现有社会偏见,导致算法性隔离,这对于AI系统的公平和正义具有挑战性。我们的研究采用了一种两步混合方法,包括量化和质量分析,以确定和理解国籍偏见在文本生成模型中的影响。我们通过人类中心的量化分析 mesure了AI源生成的文章中的国籍偏见的程度。然后,我们通过对参与者进行开放结构问naire和Theme coding分析来理解这些偏见对人类读者的影响。我们的发现表明,偏见的NLP模型通常会复制和加强现有社会偏见,这可能在社会技术 Setting中导致害。我们的访问分析表明,当读者遇到这些文章时,可能会改变他们对某个国家的看法。这些发现强调了AI对社会的影响的重要性,以及需要 corrections in AI systems。
Towards an AI to Win Ghana’s National Science and Maths Quiz
paper_authors: George Boateng, Jonathan Abrefah Mensah, Kevin Takyi Yeboah, William Edor, Andrew Kojo Mensah-Onumah, Naafi Dasana Ibrahim, Nana Sam Yeboah
for: The paper is written to explore the possibility of building an AI system that can compete in Ghana’s National Science and Maths Quiz (NSMQ) and potentially win.
methods: The paper describes an open-source project that is building AI to compete in the NSMQ, with a focus on speech-to-text, text-to-speech, question-answering, and human-computer interaction.
results: The paper provides an overview of the progress made thus far in the project, including the development of the AI system and the next steps toward its planned launch and debut in October for NSMQ 2023.Abstract
Can an AI win Ghana's National Science and Maths Quiz (NSMQ)? That is the question we seek to answer in the NSMQ AI project, an open-source project that is building AI to compete live in the NSMQ and win. The NSMQ is an annual live science and mathematics competition for senior secondary school students in Ghana in which 3 teams of 2 students compete by answering questions across biology, chemistry, physics, and math in 5 rounds over 5 progressive stages until a winning team is crowned for that year. The NSMQ is an exciting live quiz competition with interesting technical challenges across speech-to-text, text-to-speech, question-answering, and human-computer interaction. In this ongoing work that began in January 2023, we give an overview of the project, describe each of the teams, progress made thus far, and the next steps toward our planned launch and debut of the AI in October for NSMQ 2023. An AI that conquers this grand challenge can have real-world impact on education such as enabling millions of students across Africa to have one-on-one learning support from this AI.
摘要
可以AI赢得加纳国家科学和数学竞赛(NSMQ)呢?我们在NSMQ AI项目中寻求答案,这是一个开源项目,旨在通过AI参加NSMQ并赢得奖。NSMQ是每年在加纳举行的生活 science和数学竞赛,参赛者是高中二年级学生,共有3个队伍,每个队伍有2名学生,在5轮5阶段的竞赛中回答生物、化学、物理和数学等领域的问题。NSMQ是一场激动人心的直播竞赛,涉及到语音识别、文本识别、问题回答和人机交互等技术挑战。在我们自2023年1月开始的工作中,我们将提供项目概述,介绍各个团队、已经进步的情况,以及下一步的计划,以备在10月份的NSMQ 2023上发布和使用AI。一旦AI成功解决这一大挑战,可以对教育产生实际影响,如提供非洲数百万学生一对一的学习支持。
Deep Learning-Based Knowledge Injection for Metaphor Detection: A Comprehensive Review
results: 本文结果预示,现有的知识批注方法在 métaphore认知任务中具有较高的识别率和准确率。但是,现有的方法还存在一些问题,如知识批注的质量和可靠性问题。Abstract
The history of metaphor research also marks the evolution of knowledge infusion research. With the continued advancement of deep learning techniques in recent years, the natural language processing community has shown great interest in applying knowledge to successful results in metaphor recognition tasks. Although there has been a gradual increase in the number of approaches involving knowledge injection in the field of metaphor recognition, there is a lack of a complete review article on knowledge injection based approaches. Therefore, the goal of this paper is to provide a comprehensive review of research advances in the application of deep learning for knowledge injection in metaphor recognition tasks. In this paper, we systematically summarize and generalize the mainstream knowledge and knowledge injection principles, as well as review the datasets, evaluation metrics, and benchmark models used in metaphor recognition tasks. Finally, we explore the current issues facing knowledge injection methods and provide an outlook on future research directions.
摘要
历史上的比喻研究也标志着知识混合研究的演化。随着近年深度学习技术的不断发展,自然语言处理社区对于应用知识到成功的结果在比喻识别任务中表示了极大的兴趣。虽然在比喻识别领域中有一个慢慢增长的方法涉及知识注入,但是没有一篇完整的文章来评论这些方法。因此,本文的目标是为您提供深度学习在比喻识别任务中知识注入的完整评论。在这篇文章中,我们系统地总结和总结主流的知识和知识注入原则,同时回顾用于比喻识别任务的数据集、评价指标和标准模型。最后,我们探讨知识注入方法当前面临的问题,并对未来研究方向提出了一些想法。
Comparative Analysis of the wav2vec 2.0 Feature Extractor
results: 研究表明,使用神经网络原始波形特征提取器可以与传统的特征提取方法竞争,并且可以在LibriSpeech benchmark上实现类似的性能。此外,研究还分析了各个组件的效果,并发现了一些帮助ASR系统获得重要信息的带宽滤波器。Abstract
Automatic speech recognition (ASR) systems typically use handcrafted feature extraction pipelines. To avoid their inherent information loss and to achieve more consistent modeling from speech to transcribed text, neural raw waveform feature extractors (FEs) are an appealing approach. Also the wav2vec 2.0 model, which has recently gained large popularity, uses a convolutional FE which operates directly on the speech waveform. However, it is not yet studied extensively in the literature. In this work, we study its capability to replace the standard feature extraction methods in a connectionist temporal classification (CTC) ASR model and compare it to an alternative neural FE. We show that both are competitive with traditional FEs on the LibriSpeech benchmark and analyze the effect of the individual components. Furthermore, we analyze the learned filters and show that the most important information for the ASR system is obtained by a set of bandpass filters.
摘要
自动语音识别(ASR)系统通常使用手工设计的特征提取管道。为了避免其内置的信息损失并实现更一致的模型化从语音到转录文本,神经原始波形特征提取器(FEs)是一种吸引人的方法。另外,最近广受欢迎的wav2vec 2.0模型使用了一种卷积 convolutional FE,该模型直接操作语音波形。然而,它在文献中还未得到了广泛的研究。在这种工作中,我们研究了它的可行性来代替标准特征提取方法在一个连接主义时间分类(CTC) ASR 模型中,并与一种代替神经 FE 进行比较。我们发现两者都与传统的特征提取方法竞争在 LibriSpeech 测试集上,并分析了各个组件的效果。此外,我们分析了学习的滤波器,并发现了一组频率带滤波器是 ASR 系统中最重要的信息来源。
CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic Languages
results: 管道在不同语言和方言上都达到了高性能水平,并在大多数任务上超越或扩展了父管道Stanza。此外,新增的网络数据处理功能和其原因也被介绍。Abstract
We present CLASSLA-Stanza, a pipeline for automatic linguistic annotation of the South Slavic languages, which is based on the Stanza natural language processing pipeline. We describe the main improvements in CLASSLA-Stanza with respect to Stanza, and give a detailed description of the model training process for the latest 2.1 release of the pipeline. We also report performance scores produced by the pipeline for different languages and varieties. CLASSLA-Stanza exhibits consistently high performance across all the supported languages and outperforms or expands its parent pipeline Stanza at all the supported tasks. We also present the pipeline's new functionality enabling efficient processing of web data and the reasons that led to its implementation.
摘要
我团队今天发布了一个名为CLASSLA-Stanza的自动语言标注管道,这个管道基于Stanza自然语言处理管道。我们详细介绍了CLASSLA-Stanza与Stanza之间的主要改进,以及latest版本2.1中模型训练过程的详细描述。我们还公布了不同语言和变体的性能分数。CLASSLA-Stanza在所有支持语言上表现了高稳定性,并在所有任务上超越或扩展了父管道Stanza的性能。我们还介绍了新增的网络数据处理功能,以及这种功能的实现的原因。
OpinionConv: Conversational Product Search with Grounded Opinions
paper_authors: Vahid Sadiri Javadi, Martin Potthast, Lucie Flek
for: This paper aims to address the problem of training conversational AI in simulating sales conversations by leveraging product reviews as a rich source of product opinions to ground conversational AI in true subjective narratives.
methods: The paper uses product reviews as a source of product opinions to train a conversational AI model called OpinionConv, which can simulate sales conversations.
results: The paper conducts several user studies to validate the generated conversations and shows that the generated opinions are perceived as realistic. The assessors also confirm the importance of opinions as an informative basis for decision-making.Here’s the simplified Chinese version of the three key points:
results: 论文通过多个用户研究证明了生成的对话是真实的,评分人也证明了对话中的意见对决策提供了有用信息。Abstract
When searching for products, the opinions of others play an important role in making informed decisions. Subjective experiences about a product can be a valuable source of information. This is also true in sales conversations, where a customer and a sales assistant exchange facts and opinions about products. However, training an AI for such conversations is complicated by the fact that language models do not possess authentic opinions for their lack of real-world experience. We address this problem by leveraging product reviews as a rich source of product opinions to ground conversational AI in true subjective narratives. With OpinionConv, we develop the first conversational AI for simulating sales conversations. To validate the generated conversations, we conduct several user studies showing that the generated opinions are perceived as realistic. Our assessors also confirm the importance of opinions as an informative basis for decision-making.
摘要
Studying Socially Unacceptable Discourse Classification (SUD) through different eyes: “Are we on the same page ?”
results: 作者通过分析不同批标注方法对SUD学习的影响,并提供了一些可支持领域专家在标注任务中的数据洞察。Abstract
We study Socially Unacceptable Discourse (SUD) characterization and detection in online text. We first build and present a novel corpus that contains a large variety of manually annotated texts from different online sources used so far in state-of-the-art Machine learning (ML) SUD detection solutions. This global context allows us to test the generalization ability of SUD classifiers that acquire knowledge around the same SUD categories, but from different contexts. From this perspective, we can analyze how (possibly) different annotation modalities influence SUD learning by discussing open challenges and open research directions. We also provide several data insights which can support domain experts in the annotation task.
摘要
我们研究社会不可接受的语言(SUD)的特征化和检测在线文本中。我们首先构建了一个新的文献库,包含了不同的在线来源的手动标注文本,以及现有的机器学习(ML)SUD检测解决方案中使用的同一类型的文本。这个全球背景允许我们测试SUD分类器在不同上下文中是否具有泛化能力。从这个角度来看,我们可以分析不同的标注方式对SUD学习产生的影响,并讨论开放的挑战和未来研究方向。我们还提供了一些数据分析视图,以支持领域专家进行标注任务。
paper_authors: Sang-eun Han, Yeonseok Jeong, Seung-won Hwang, Kyungjae Lee
for: answering user questions on unrestricted knowledge sources
methods: Judge-Specialist framework with specialist retrievers/readers and a dedicated language model to select the final answer
results: outperforms state-of-the-art multi-source QA methods on Natural Questions, and robustly preserves monotonicity against noise from speech recognitionAbstract
Question answering (QA) is a critical task for speech-based retrieval from knowledge sources, by sifting only the answers without requiring to read supporting documents. Specifically, open-domain QA aims to answer user questions on unrestricted knowledge sources. Ideally, adding a source should not decrease the accuracy, but we find this property (denoted as "monotonicity") does not hold for current state-of-the-art methods. We identify the cause, and based on that we propose Judge-Specialist framework. Our framework consists of (1) specialist retrievers/readers to cover individual sources, and (2) judge, a dedicated language model to select the final answer. Our experiments show that our framework not only ensures monotonicity, but also outperforms state-of-the-art multi-source QA methods on Natural Questions. Additionally, we show that our models robustly preserve the monotonicity against noise from speech recognition. We publicly release our code and setting.
摘要
问答(QA)是知识源检索中的关键任务,通过只检索答案而不需要阅读支持文档。特别是开放领域QA旨在回答用户问题在不限制的知识源上。理想情况下,添加源应该不会降低准确性,但我们发现现有方法中的性质( denoted as "monotonicity")不成立。我们认定了原因,并基于此我们提出了 Judge-Specialist 框架。我们的框架包括(1)专家检索/读取器,覆盖个别源,以及(2)判官,专门的语言模型来选择最终答案。我们的实验表明,我们的框架不仅保证幂等性,而且超越了当前状态的跨源QA方法在自然问题上的性能。此外,我们的模型也能够坚定地保持幂等性面对语音识别器的噪音。我们在线发布了我们的代码和设置。
Large Language Model Prompt Chaining for Long Legal Document Classification
results: 根据论文的结果,通过提问链接法,可以不 только超越零shot,还可以超过大型模型,如ChatGPT零shot,使用更小的模型。Abstract
Prompting is used to guide or steer a language model in generating an appropriate response that is consistent with the desired outcome. Chaining is a strategy used to decompose complex tasks into smaller, manageable components. In this study, we utilize prompt chaining for extensive legal document classification tasks, which present difficulties due to their intricate domain-specific language and considerable length. Our approach begins with the creation of a concise summary of the original document, followed by a semantic search for related exemplar texts and their corresponding annotations from a training corpus. Finally, we prompt for a label - based on the task - to assign, by leveraging the in-context learning from the few-shot prompt. We demonstrate that through prompt chaining, we can not only enhance the performance over zero-shot, but also surpass the micro-F1 score achieved by larger models, such as ChatGPT zero-shot, using smaller models.
摘要
提示是用于引导或导引语言模型生成适当的回应,以确保与所需结果相符。链式是一种策略,用于将复杂任务分解成更小、更容易处理的组件。在这项研究中,我们使用提示链式来处理广泛的法律文档分类任务,这些任务因其专业领域语言和较长的文档长度而更加具有挑战性。我们的方法包括:首先创建原始文档的简短摘要,然后通过semantic search找到相关的示例文档和它们的相关注释,从训练集中获取。最后,我们根据任务提供标签,通过受到上下文学习的几个提示来启用。我们示出,通过提示链式,不仅可以超越零shot的性能,还可以使用更小的模型超越更大的模型,如ChatGPT零shot。
Social Media, Topic Modeling and Sentiment Analysis in Municipal Decision Support
paper_authors: Miloš Švaňa for: This paper is written for municipal decision-makers who want to incorporate social media sentiment into their decision-making processes.methods: The paper proposes a framework for processing social media posts that consists of three steps: determining the sentiment polarity of each post, identifying prevalent topics, and aggregating the sentiment information. The framework uses fuzzy numbers to represent the sentiment in a richer way and capture the diversity of opinions expressed on social media.results: The paper demonstrates the application of the framework on tweets published from Ostrava, Czechia over a period of about two months. The results show how fuzzy numbers can represent the sentiment in a more nuanced way and capture the diversity of opinions expressed on social media.Abstract
Many cities around the world are aspiring to become. However, smart initiatives often give little weight to the opinions of average citizens. Social media are one of the most important sources of citizen opinions. This paper presents a prototype of a framework for processing social media posts with municipal decision-making in mind. The framework consists of a sequence of three steps: (1) determining the sentiment polarity of each social media post (2) identifying prevalent topics and mapping these topics to individual posts, and (3) aggregating these two pieces of information into a fuzzy number representing the overall sentiment expressed towards each topic. Optionally, the fuzzy number can be reduced into a tuple of two real numbers indicating the "amount" of positive and negative opinion expressed towards each topic. The framework is demonstrated on tweets published from Ostrava, Czechia over a period of about two months. This application illustrates how fuzzy numbers represent sentiment in a richer way and capture the diversity of opinions expressed on social media.
摘要
Determine the sentiment polarity of each social media post (是否积极或消极的意见)2. Identify prevalent topics and map them to individual posts (找出主要话题并将其与各个帖子相关联)3. Aggregate the two pieces of information into a fuzzy number representing the overall sentiment expressed towards each topic (将这两个信息合并为一个模糊数字表示每个话题的总意见)Optionally, the fuzzy number can be reduced into a tuple of two real numbers indicating the “amount” of positive and negative opinion expressed towards each topic (可以将模糊数字转换为一个二元数组,表示每个话题的积极和消极意见的量)This framework was demonstrated on tweets published from Ostrava, Czechia over a period of about two months, showing how fuzzy numbers can represent sentiment in a richer way and capture the diversity of opinions expressed on social media.
Collective Human Opinions in Semantic Textual Similarity
results: 分析发现,人类评分的集体变化不能用标准的整数或单个高斯函数来描述,而是由人类不同的评分差异所引起的。此外,现有的 STS 模型无法捕捉人类对具体实例的不一致,而更反映了模型对总体数据集的预测程度。Abstract
Despite the subjective nature of semantic textual similarity (STS) and pervasive disagreements in STS annotation, existing benchmarks have used averaged human ratings as the gold standard. Averaging masks the true distribution of human opinions on examples of low agreement, and prevents models from capturing the semantic vagueness that the individual ratings represent. In this work, we introduce USTS, the first Uncertainty-aware STS dataset with ~15,000 Chinese sentence pairs and 150,000 labels, to study collective human opinions in STS. Analysis reveals that neither a scalar nor a single Gaussian fits a set of observed judgements adequately. We further show that current STS models cannot capture the variance caused by human disagreement on individual instances, but rather reflect the predictive confidence over the aggregate dataset.
摘要
尽管 semantic textual similarity (STS) 的评估是主观的,且存在各种不同的评估标准,现有的 benchmark 都使用了人类评分的均值作为金标准。但是,均值将低度一致的示例评分压缩到了一个平均值上,从而隐藏了人类意见的差异。在这项工作中,我们引入了 USTS,首个带有 ~15,000 个中文句子对和 150,000 个标签的不确定性意见 STS 数据集,以研究人类集体意见在 STS 中的表现。分析发现, neither 一个整数还是一个 Gaussian 能够准确地描述观察到的判断。此外,我们还表明,现有的 STS 模型无法捕捉人类对具体实例的不一致,而是反映了对总体数据集的预测信心。
I-WAS: a Data Augmentation Method with GPT-2 for Simile Detection
results: 实验结果表明,我们的提议的数据增强方法可以有效提高比喻检测的性能。Abstract
Simile detection is a valuable task for many natural language processing (NLP)-based applications, particularly in the field of literature. However, existing research on simile detection often relies on corpora that are limited in size and do not adequately represent the full range of simile forms. To address this issue, we propose a simile data augmentation method based on \textbf{W}ord replacement And Sentence completion using the GPT-2 language model. Our iterative process called I-WAS, is designed to improve the quality of the augmented sentences. To better evaluate the performance of our method in real-world applications, we have compiled a corpus containing a more diverse set of simile forms for experimentation. Our experimental results demonstrate the effectiveness of our proposed data augmentation method for simile detection.
摘要
寓言检测是许多自然语言处理(NLP)应用中的重要任务,特别在文学领域。然而,现有的寓言检测研究通常基于有限的词库和不充分表示寓言的全面形式。为解决这个问题,我们提出了基于Word replacement和Sentence completion的GPT-2语言模型的寓言数据增强方法。我们的迭代过程被称为I-WAS,旨在提高增强后的句子质量。为更好地评估我们的方法在实际应用中的表现,我们将一个包含更多寓言形式的词库编译起来。我们的实验结果表明我们的提议的数据增强方法对寓言检测具有有效性。
DataTales: Investigating the use of Large Language Models for Authoring Data-Driven Articles
results: 该研究通过对 11 名专业人士的反馈,发现 DataTales 可以帮助作者更快速地撰写数据驱动文章,并提供了一些可能性和机会来进一步 интегра LLM 为数据驱动文章作者的 valuabe助手。Abstract
Authoring data-driven articles is a complex process requiring authors to not only analyze data for insights but also craft a cohesive narrative that effectively communicates the insights. Text generation capabilities of contemporary large language models (LLMs) present an opportunity to assist the authoring of data-driven articles and expedite the writing process. In this work, we investigate the feasibility and perceived value of leveraging LLMs to support authors of data-driven articles. We designed a prototype system, DataTales, that leverages a LLM to generate textual narratives accompanying a given chart. Using DataTales as a design probe, we conducted a qualitative study with 11 professionals to evaluate the concept, from which we distilled affordances and opportunities to further integrate LLMs as valuable data-driven article authoring assistants.
摘要
作者撰写数据驱动文章是一个复杂的过程,作者需要不仅分析数据获得洞察,还需要把握数据来编写一篇有关的文章。当代大语言模型(LLM)的文本生成能力提供了帮助作者撰写数据驱动文章的机会,并且可以快速化写作过程。在这项工作中,我们研究了利用LLM支持数据驱动文章作者的可能性和价值。我们设计了一个名为DataTales的 прототип系统,该系统利用LLM生成与给定图表相关的文字导趋。通过DataTales作为设计探索工具,我们对11名专业人士进行了质量研究,从中提炼出了LLM在数据驱动文章作者助手中的可能性和优势。
The Five-Dollar Model: Generating Game Maps and Sprites from Sentence Embeddings
results: 根据cosine相似性分数,该模型能够成功地生成具有编码 semantic 含义的图像,并且在限制量数据下可以达到高质量和美观的图像生成。Abstract
The five-dollar model is a lightweight text-to-image generative architecture that generates low dimensional images from an encoded text prompt. This model can successfully generate accurate and aesthetically pleasing content in low dimensional domains, with limited amounts of training data. Despite the small size of both the model and datasets, the generated images are still able to maintain the encoded semantic meaning of the textual prompt. We apply this model to three small datasets: pixel art video game maps, video game sprite images, and down-scaled emoji images and apply novel augmentation strategies to improve the performance of our model on these limited datasets. We evaluate our models performance using cosine similarity score between text-image pairs generated by the CLIP VIT-B/32 model.
摘要
“五币模型”是一个轻量级文本至图生成架构,它将文本提示转换为低维度图像。这个模型可以成功实现精确和美观的内容生成,即使对于训练数据的量相对较少。尽管模型和数据集都很小,但生成的图像仍然能够保持文本提示中的 semantics 含义。我们将这个模型应用到三个小型数据集:像素艺术游戏地图、游戏图像和缩小的表情符号图像,并对这些限制的数据集进行新的扩展策略以改善我们的模型表现。我们使用 CLIP VIT-B/32 模型的弹性相似度分数来评估我们的模型表现。
A Comparative Study on TF-IDF feature Weighting Method and its Analysis using Unstructured Dataset
paper_authors: Mamata Das, Selvakumar K., P. J. A. Alphonse
For: The paper is written for text classification and its algorithms, specifically focusing on the feature weighting method for text classification on unstructured data.* Methods: The paper uses two features, N-Grams and TF-IDF, on the IMDB movie reviews and Amazon Alexa reviews dataset for sentiment analysis. The state-of-the-art classifiers used to validate the method include SVM, Logistic Regression, Multinomial Naive Bayes, Random Forest, Decision Tree, and k-nearest neighbors.* Results: The paper found that TF-IDF features resulted in a significant increase in feature extraction, with TF-IDF achieving the maximum accuracy, precision, recall, and F1-score values of 93.81%, 94.20%, 93.81%, and 91.99%, respectively, in the Random Forest classifier.Abstract
Text Classification is the process of categorizing text into the relevant categories and its algorithms are at the core of many Natural Language Processing (NLP). Term Frequency-Inverse Document Frequency (TF-IDF) and NLP are the most highly used information retrieval methods in text classification. We have investigated and analyzed the feature weighting method for text classification on unstructured data. The proposed model considered two features N-Grams and TF-IDF on the IMDB movie reviews and Amazon Alexa reviews dataset for sentiment analysis. Then we have used the state-of-the-art classifier to validate the method i.e., Support Vector Machine (SVM), Logistic Regression, Multinomial Naive Bayes (Multinomial NB), Random Forest, Decision Tree, and k-nearest neighbors (KNN). From those two feature extractions, a significant increase in feature extraction with TF-IDF features rather than based on N-Gram. TF-IDF got the maximum accuracy (93.81%), precision (94.20%), recall (93.81%), and F1-score (91.99%) value in Random Forest classifier.
摘要
文本分类是将文本分类到相关的类别,其算法是自然语言处理(NLP)的核心。文本频率-反转文档频率(TF-IDF)和NLP是文本检索中最常用的方法。我们已经调查和分析了文本分类中的特征赋值方法,并在IMDB电影评论和Amazon Alexa评论数据集上进行了 sentiment分析。然后,我们使用了当今最佳分类器来验证方法,即支持向量机(SVM)、梯度回归、多元随机树(Multinomial NB)、Random Forest、决策树和k-最近邻居(KNN)。从这两个特征提取方法中,TF-IDF特征提取得到了显著的提高,而不是基于N-Gram。TF-IDF在Random Forest分类器中获得了最高的准确率(93.81%)、精度(94.20%)、回归率(93.81%)和F1分数(91.99%)值。
Continual Pre-Training of Large Language Models: How to (re)warm your model?
paper_authors: Kshitij Gupta, Benjamin Thérien, Adam Ibrahim, Mats L. Richter, Quentin Anthony, Eugene Belilovsky, Irina Rish, Timothée Lesort
for: 这个论文的目的是探讨如何实现大语言模型的持续预训练,以提高计算效率和预训练模型的性能。
methods: 本研究使用了不同的暖身策略来研究模型在新数据上的性能。
results: 研究结果显示,使用暖身策略可以在长期内提高下游数据的性能,并且在大下游数据集上超越从头开始训练的模型。Abstract
Large language models (LLMs) are routinely pre-trained on billions of tokens, only to restart the process over again once new data becomes available. A much cheaper and more efficient solution would be to enable the continual pre-training of these models, i.e. updating pre-trained models with new data instead of re-training them from scratch. However, the distribution shift induced by novel data typically results in degraded performance on past data. Taking a step towards efficient continual pre-training, in this work, we examine the effect of different warm-up strategies. Our hypothesis is that the learning rate must be re-increased to improve compute efficiency when training on a new dataset. We study the warmup phase of models pre-trained on the Pile (upstream data, 300B tokens) as we continue to pre-train on SlimPajama (downstream data, 297B tokens), following a linear warmup and cosine decay schedule. We conduct all experiments on the Pythia 410M language model architecture and evaluate performance through validation perplexity. We experiment with different pre-training checkpoints, various maximum learning rates, and various warmup lengths. Our results show that while rewarming models first increases the loss on upstream and downstream data, in the longer run it improves the downstream performance, outperforming models trained from scratch$\unicode{x2013}$even for a large downstream dataset.
摘要
Simple synthetic data reduces sycophancy in large language models
methods: 研究者使用了三个偏见任务(Perez et al., 2022),测试了模型在不同的缩放和调教情况下的奴役行为。
results: 研究发现,对于PaLM模型,通过缩放和调教可以显著增强奴役行为,而且even when the user’s view is objectively incorrect, models will still agree with them。此外,研究者还提出了一种简单的人工数据干预方法,通过在公共NLP任务上添加一些适当的数据,可以减少模型对用户意见的依赖。Abstract
Sycophancy is an undesirable behavior where models tailor their responses to follow a human user's view even when that view is not objectively correct (e.g., adapting liberal views once a user reveals that they are liberal). In this paper, we study the prevalence of sycophancy in language models and propose a simple synthetic-data intervention to reduce this behavior. First, on a set of three sycophancy tasks (Perez et al., 2022) where models are asked for an opinion on statements with no correct answers (e.g., politics), we observe that both model scaling and instruction tuning significantly increase sycophancy for PaLM models up to 540B parameters. Second, we extend sycophancy evaluations to simple addition statements that are objectively incorrect, finding that despite knowing that these statements are wrong, language models will still agree with them if the user does as well. To reduce sycophancy, we present a straightforward synthetic-data intervention that takes public NLP tasks and encourages models to be robust to user opinions on these tasks. Adding these data in a lightweight finetuning step can significantly reduce sycophantic behavior on held-out prompts. Code for generating synthetic data for intervention can be found at https://github.com/google/sycophancy-intervention.
摘要
sycophancy 是一种不良行为,在语言模型回答时适应人工用户的观点,即使这些观点不是 объекively 正确(例如,适应自由主义观点一旦用户承认自己是自由主义者)。在这篇论文中,我们研究了语言模型中的 sycophancy 的普遍性和提出了一种简单的人工数据干预措施来降低这种行为。首先,我们在 Perez et al. (2022) 中提供的三个 sycophancy 任务上观察到,随着模型缩放和指令调整,PaLM 模型的 sycophancy 会增加到 540B 参数的最大值。其次,我们扩展了 sycophancy 评估范围到对象错误的简单加法句子,发现,即使用户认为这些句子是错误的,语言模型仍会同意它们,如果用户也同意。为了降低 sycophancy,我们提出了一种简单的人工数据干预措施,通过在公共 NLP 任务上添加一些适应用户观点的数据,让模型在新的任务上具有更好的Robustness。可以在 https://github.com/google/sycophancy-intervention 找到代码生成 synthetic data 的步骤。
Universal Automatic Phonetic Transcription into the International Phonetic Alphabet
results: 该模型在七种语言的CommonVoice 11.0训练数据上达到了与人工标注几乎相当的质量水平,并且与之前的最佳speech-to-IPA模型(Wav2Vec2Phoneme)的训练数据集相比,该模型的训练数据集更小。Abstract
This paper presents a state-of-the-art model for transcribing speech in any language into the International Phonetic Alphabet (IPA). Transcription of spoken languages into IPA is an essential yet time-consuming process in language documentation, and even partially automating this process has the potential to drastically speed up the documentation of endangered languages. Like the previous best speech-to-IPA model (Wav2Vec2Phoneme), our model is based on wav2vec 2.0 and is fine-tuned to predict IPA from audio input. We use training data from seven languages from CommonVoice 11.0, transcribed into IPA semi-automatically. Although this training dataset is much smaller than Wav2Vec2Phoneme's, its higher quality lets our model achieve comparable or better results. Furthermore, we show that the quality of our universal speech-to-IPA models is close to that of human annotators.
摘要
A Cross-Domain Evaluation of Approaches for Causal Knowledge Extraction
results: 结果表明,使用BERT预训练语言模型的序列标记模型可以提供 significnat 性能提升,而span基于方法在四个数据集中的表现都比simple sequence tagging模型更好。Abstract
Causal knowledge extraction is the task of extracting relevant causes and effects from text by detecting the causal relation. Although this task is important for language understanding and knowledge discovery, recent works in this domain have largely focused on binary classification of a text segment as causal or non-causal. In this regard, we perform a thorough analysis of three sequence tagging models for causal knowledge extraction and compare it with a span based approach to causality extraction. Our experiments show that embeddings from pre-trained language models (e.g. BERT) provide a significant performance boost on this task compared to previous state-of-the-art models with complex architectures. We observe that span based models perform better than simple sequence tagging models based on BERT across all 4 data sets from diverse domains with different types of cause-effect phrases.
摘要
causal knowledge extraction 是另一个重要的自然语言处理任务,即从文本中提取有关 causal 关系的信息。 although recent works in this area have mainly focused on将文本段分类为 causal 或非 causal,我们在这个领域进行了系统性的分析,并与 span 基于 causality 提取方法进行比较。 our experiments show that pre-trained language model 的 embedding 提供了 significannot performance boost 在这个任务中,比之前的 state-of-the-art 模型 with complex architectures。 我们发现 span 基于模型在所有四个数据集中表现较好, especialy when dealing with diverse domains and different types of cause-effect phrases.
Generative Benchmark Creation for Table Union Search
results: 研究发现,使用生成AI模型创建的 benchmark 比手动纪录和标注的 benchmark 更加具有挑战性,并且允许更加详细的分析方法的性能。 Specifically, the top-performing method achieves a Mean Average Precision of around 60%, over 30% less than its performance on existing manually created benchmarks.Abstract
Data management has traditionally relied on synthetic data generators to generate structured benchmarks, like the TPC suite, where we can control important parameters like data size and its distribution precisely. These benchmarks were central to the success and adoption of database management systems. But more and more, data management problems are of a semantic nature. An important example is finding tables that can be unioned. While any two tables with the same cardinality can be unioned, table union search is the problem of finding tables whose union is semantically coherent. Semantic problems cannot be benchmarked using synthetic data. Our current methods for creating benchmarks involve the manual curation and labeling of real data. These methods are not robust or scalable and perhaps more importantly, it is not clear how robust the created benchmarks are. We propose to use generative AI models to create structured data benchmarks for table union search. We present a novel method for using generative models to create tables with specified properties. Using this method, we create a new benchmark containing pairs of tables that are both unionable and non-unionable but related. We thoroughly evaluate recent existing table union search methods over existing benchmarks and our new benchmark. We also present and evaluate a new table search methods based on recent large language models over all benchmarks. We show that the new benchmark is more challenging for all methods than hand-curated benchmarks, specifically, the top-performing method achieves a Mean Average Precision of around 60%, over 30% less than its performance on existing manually created benchmarks. We examine why this is the case and show that the new benchmark permits more detailed analysis of methods, including a study of both false positives and false negatives that were not possible with existing benchmarks.
摘要
“数据管理历史上依赖了人工生成的数据生成器来生成结构化的标准吞吐量测试(TPC),以控制数据大小和分布的重要参数。这些测试对数据库管理系统的采用和普及做出了重要贡献。然而,越来越多的数据管理问题是 semantic 性质的,例如找到可union的表。虽然任何两个表都可以union,但表union搜索问题是找到semantically coherent的表的union。semantic问题无法使用人工生成的数据来 benchmark。我们当前的创建benchmark方法是通过手动筛选和标注实际数据来实现。这些方法不具有可靠性和扩展性,而且可能更重要的是,不确定创建的benchmark的可靠性。我们提议使用生成AI模型来创建结构化数据 benchmarks for table union search。我们提出了一种使用生成模型创建表 avec specified properties的新方法。使用这种方法,我们创建了一个新的benchmark,包含可union和non-union但相关的表对。我们进行了对现有benchmark和我们新的benchmark的严格评估。我们还提出了基于最新的大语言模型的新表搜索方法,并对所有benchmark进行评估。我们发现新的benchmark比手动创建的benchmark更加具有挑战性,特别是top-performing方法的 Mean Average Precision 约为60%,相比手动创建的benchmark的30%以上。我们分析了这种情况,并证明新的benchmark允许更详细的方法分析,包括对方法的false positives和false negatives的研究,这些研究不可能通过现有benchmark进行。”
paper_authors: Aritra Mandal, Daniel Tunkelang, Zhe Wu
for: 提高电商搜索中的用户体验和企业业绩
methods: 提出了一种框架,通过识别和利用查询等价性来提高搜索结果的准确率和用户满意度
results: 实验结果表明,该框架可以高效地识别和利用查询等价性,并与流行的句子转换模型相比,实现了更高的查询相似性(Pearson correlation coefficient为0.85),这表明该方法可以提高电商搜索中的用户体验和企业业绩。Abstract
Search query variation poses a challenge in e-commerce search, as equivalent search intents can be expressed through different queries with surface-level differences. This paper introduces a framework to recognize and leverage query equivalence to enhance searcher and business outcomes. The proposed approach addresses three key problems: mapping queries to vector representations of search intent, identifying nearest neighbor queries expressing equivalent or similar intent, and optimizing for user or business objectives. The framework utilizes both surface similarity and behavioral similarity to determine query equivalence. Surface similarity involves canonicalizing queries based on word inflection, word order, compounding, and noise words. Behavioral similarity leverages historical search behavior to generate vector representations of query intent. An offline process is used to train a sentence similarity model, while an online nearest neighbor approach supports processing of unseen queries. Experimental evaluations demonstrate the effectiveness of the proposed approach, outperforming popular sentence transformer models and achieving a Pearson correlation of 0.85 for query similarity. The results highlight the potential of leveraging historical behavior data and training models to recognize and utilize query equivalence in e-commerce search, leading to improved user experiences and business outcomes. Further advancements and benchmark datasets are encouraged to facilitate the development of solutions for this critical problem in the e-commerce domain.
摘要 translate-send: from-language en to-language zh-CN text-type text contents "Search query variation poses a challenge in e-commerce search, as equivalent search intents can be expressed through different queries with surface-level differences. This paper introduces a framework to recognize and leverage query equivalence to enhance searcher and business outcomes. The proposed approach addresses three key problems: mapping queries to vector representations of search intent, identifying nearest neighbor queries expressing equivalent or similar intent, and optimizing for user or business objectives. The framework utilizes both surface similarity and behavioral similarity to determine query equivalence. Surface similarity involves canonicalizing queries based on word inflection, word order, compounding, and noise words. Behavioral similarity leverages historical search behavior to generate vector representations of query intent. An offline process is used to train a sentence similarity model, while an online nearest neighbor approach supports processing of unseen queries. Experimental evaluations demonstrate the effectiveness of the proposed approach, outperforming popular sentence transformer models and achieving a Pearson correlation of 0.85 for query similarity. The results highlight the potential of leveraging historical behavior data and training models to recognize and utilize query equivalence in e-commerce search, leading to improved user experiences and business outcomes. Further advancements and benchmark datasets are encouraged to facilitate the development of solutions for this critical problem in the e-commerce domain."Here's the translation in Simplified Chinese:搜索查询的变化 poses 电商搜索中的挑战,因为等效的搜索意图可以通过不同的查询语句表达出来,具有表面上的差异。本文提出了一种框架,用于认可和利用查询相似性,以提高搜索者和企业的结果。该框架解决了三个关键问题:将查询映射到搜索意图的vector表示,标识最相似的查询语句,并优化用户或企业的目标。该框架利用表面相似性和行为相似性来确定查询相似性。表面相似性包括Word排序、幂等词、缩合词和噪音词的canonicalization。行为相似性利用历史搜索行为生成查询意图的vector表示。在线程中使用了一个历史搜索行为训练的模型,而在线上使用了一个最近的邻居方法来处理未看过的查询。实验证明了该方法的效果,超越了流行的句子变换模型,并达到了0.85的Pearson相关性。结果表明,通过利用历史行为数据和训练模型,可以认可和利用查询相似性,提高用户体验和商业结果。进一步的进步和标准 datasets 是鼓励的,以便开发电商搜索领域中的解决方案。
Storyfier: Exploring Vocabulary Learning Support with Text Generation Models
results: 学习者对使用Storyfier进行学习有很好的满意度,但是在阅读、填充和写作任务中,使用Storyfier的学习者表现相对较差,尤其是在记忆和使用目标词汇方面。Abstract
Vocabulary learning support tools have widely exploited existing materials, e.g., stories or video clips, as contexts to help users memorize each target word. However, these tools could not provide a coherent context for any target words of learners' interests, and they seldom help practice word usage. In this paper, we work with teachers and students to iteratively develop Storyfier, which leverages text generation models to enable learners to read a generated story that covers any target words, conduct a story cloze test, and use these words to write a new story with adaptive AI assistance. Our within-subjects study (N=28) shows that learners generally favor the generated stories for connecting target words and writing assistance for easing their learning workload. However, in the read-cloze-write learning sessions, participants using Storyfier perform worse in recalling and using target words than learning with a baseline tool without our AI features. We discuss insights into supporting learning tasks with generative models.
摘要
学习词汇支持工具已经广泛利用现有的材料,如故事或视频片段,作为词汇记忆的 Context。然而,这些工具无法提供学生们感兴趣的词汇的 coherent Context,并rarely帮助学生们实践词汇使用。在这篇论文中,我们与教师和学生合作开发了Storyfier,利用文本生成模型,让学生可以阅读一个包含target词的生成故事,进行故事填充测试,并使用这些词汇写新的故事,并且有adaptive AI帮助。我们的在人Subjects研究(N=28)表明,学生通常喜欢使用Storyfier来连接target词和写作帮助,以减轻学习劳动。然而,在read-cloze-write学习 Session中,参与者使用Storyfier表现比基eline工具而言,更难记忆和使用target词。我们讨论了如何使用生成模型支持学习任务的信息。
Extracting detailed oncologic history and treatment plan from medical oncology notes with large language models
results: 研究发现,GPT-4 模型在抽取肿瘤病例纪录中表现最佳,其中的 BLEU 分数为 0.69,ROUGE 分数为 0.72,并且在复杂任务中的准确率为 67%。这个模型在抽取肿瘤特征和药物信息方面表现特别出色,并且在推断疾病的 симптом和未来药物的考虑方面也表现了优异。这个研究表明,GPT-4 可能已经可以用于从肿瘤进程纪录中提取重要的信息,以便于临床研究、复杂人口管理和评估quality patient care。Abstract
Both medical care and observational studies in oncology require a thorough understanding of a patient's disease progression and treatment history, often elaborately documented in clinical notes. Despite their vital role, no current oncology information representation and annotation schema fully encapsulates the diversity of information recorded within these notes. Although large language models (LLMs) have recently exhibited impressive performance on various medical natural language processing tasks, due to the current lack of comprehensively annotated oncology datasets, an extensive evaluation of LLMs in extracting and reasoning with the complex rhetoric in oncology notes remains understudied. We developed a detailed schema for annotating textual oncology information, encompassing patient characteristics, tumor characteristics, tests, treatments, and temporality. Using a corpus of 10 de-identified breast cancer progress notes at University of California, San Francisco, we applied this schema to assess the abilities of three recently-released LLMs (GPT-4, GPT-3.5-turbo, and FLAN-UL2) to perform zero-shot extraction of detailed oncological history from two narrative sections of clinical progress notes. Our team annotated 2750 entities, 2874 modifiers, and 1623 relationships. The GPT-4 model exhibited overall best performance, with an average BLEU score of 0.69, an average ROUGE score of 0.72, and an average accuracy of 67% on complex tasks (expert manual evaluation). Notably, it was proficient in tumor characteristic and medication extraction, and demonstrated superior performance in inferring symptoms due to cancer and considerations of future medications. The analysis demonstrates that GPT-4 is potentially already usable to extract important facts from cancer progress notes needed for clinical research, complex population management, and documenting quality patient care.
摘要
医疗和观察研究在肿瘤学中都需要深刻理解病人疾病进展和治疗历史,这些信息通常都是在临床笔记中详细记录的。 despite their vital role, current oncology information representation and annotation schema 没有完全涵盖临床笔记中的多样性信息。 Recently, large language models (LLMs) have shown impressive performance on various medical natural language processing tasks,but due to the lack of comprehensively annotated oncology datasets,the extent to which LLMs can extract and reason with the complex rhetoric in oncology notes remains understudied。We developed a detailed schema for annotating textual oncology information, including patient characteristics,tumor characteristics,tests,treatments,and temporality。Using a corpus of 10 de-identified breast cancer progress notes at University of California, San Francisco,we applied this schema to assess the abilities of three recently-released LLMs (GPT-4, GPT-3.5-turbo, and FLAN-UL2) to perform zero-shot extraction of detailed oncological history from two narrative sections of clinical progress notes。Our team annotated 2750 entities,2874 modifiers,and 1623 relationships。GPT-4 model exhibited overall best performance,with an average BLEU score of 0.69,an average ROUGE score of 0.72,and an average accuracy of 67% on complex tasks (expert manual evaluation)。It was proficient in tumor characteristic and medication extraction,and demonstrated superior performance in inferring symptoms due to cancer and considerations of future medications。The analysis demonstrates that GPT-4 is potentially already usable to extract important facts from cancer progress notes needed for clinical research,complex population management,and documenting quality patient care。
What about translation? New coding system for content analysis on the perception of literary translation around the political transformation in 1989 in Hungary as a classification problem on an unbalanced dataset
for: Tracking trends in the perception of literary translation during political transformation in 1989 in Hungary.
methods: Trained BERT models to carry over coding system to 1980-1999 issues of literary journal Nagyvilág, with extensive hyperparameter tuning, loss functions robust to label unbalance, 10-fold cross-validation, model ensemble for prediction, manual validation, and calibration method to better predict label counts.
results: Study of relations between labels using label relation networks.Abstract
To track trends in the perception of literary translation around the political transformation in 1989 in Hungary, a coding system was developed on the paragraphs of the 1980-1999 issues of the literary journal Alf\"old. This paper describes how we trained BERT models to carry over the coding system to the 1980-1999 issues of the literary journal Nagyvil\'ag. We use extensive hyperparameter tuning, loss functions robust to label unbalance, 10-fold cross-validation for precise evaluations and a model ensemble for prediction, manual validation on the predict set, a new calibration method to better predict label counts for sections of the Nagyvil\'ag corpus, and to study the relations between labels, we construct label relation networks.
摘要
为了跟踪1989年政治转型期间文学翻译的观点变化,我们在1980-1999年《alföld》期刊中的段落上设计了一个编码系统。本文描述了我们如何使用BERT模型将编码系统传播到1980-1999年《大世界》期刊中的段落上。我们采用了广泛的hyperparameter优化、 Label不均衡的损失函数、10重交叉验证、精确的预测和手动验证预测集、一个新的准确预测标签计数的方法、以及为了研究标签之间的关系,我们构建了标签关系网络。
results: 我们在PATS数据集上训练了我们的模型,并对比了现有的状态数据模型。对jective和主观评价结果表明,我们的模型在seen和unseen风格中都能够实现更高的性能。此外,我们还提出了一种方法来评估传递的行为和姿势是否正确,以确保源行为的意思不会产生泄露。Abstract
This paper addresses the challenge of transferring the behavior expressivity style of a virtual agent to another one while preserving behaviors shape as they carry communicative meaning. Behavior expressivity style is viewed here as the qualitative properties of behaviors. We propose TranSTYLer, a multimodal transformer based model that synthesizes the multimodal behaviors of a source speaker with the style of a target speaker. We assume that behavior expressivity style is encoded across various modalities of communication, including text, speech, body gestures, and facial expressions. The model employs a style and content disentanglement schema to ensure that the transferred style does not interfere with the meaning conveyed by the source behaviors. Our approach eliminates the need for style labels and allows the generalization to styles that have not been seen during the training phase. We train our model on the PATS corpus, which we extended to include dialog acts and 2D facial landmarks. Objective and subjective evaluations show that our model outperforms state of the art models in style transfer for both seen and unseen styles during training. To tackle the issues of style and content leakage that may arise, we propose a methodology to assess the degree to which behavior and gestures associated with the target style are successfully transferred, while ensuring the preservation of the ones related to the source content.
摘要
Accurate, Explainable, and Private Models: Providing Recourse While Minimizing Training Data Leakage
For: This paper aims to mitigate attacks on machine learning models that provide algorithmic recourse to individuals who receive negative outcomes.* Methods: The paper presents two novel methods for generating differentially private recourse: Differentially Private Model (DPM) and Laplace Recourse (LR).* Results: The authors find that DPM and LR perform well in reducing what an adversary can infer, especially at low false positive rates. When the training dataset size is large enough, the authors achieve particular success in preventing privacy leakage while maintaining model and recourse accuracy with the LR method.Here’s the information in Simplified Chinese text:
results: 作者发现,在低假阳性率下,DPM和LR都能够减少攻击者可以获取的信息量,特别是在训练数据集大 enough 的情况下。Abstract
Machine learning models are increasingly utilized across impactful domains to predict individual outcomes. As such, many models provide algorithmic recourse to individuals who receive negative outcomes. However, recourse can be leveraged by adversaries to disclose private information. This work presents the first attempt at mitigating such attacks. We present two novel methods to generate differentially private recourse: Differentially Private Model (DPM) and Laplace Recourse (LR). Using logistic regression classifiers and real world and synthetic datasets, we find that DPM and LR perform well in reducing what an adversary can infer, especially at low FPR. When training dataset size is large enough, we find particular success in preventing privacy leakage while maintaining model and recourse accuracy with our novel LR method.
摘要
机器学习模型在影响各个领域中越来越广泛应用,以预测个人结果。然而,这些模型可能会泄露个人隐私信息。这项工作提出了首个防止这种攻击的方法。我们提出了两种新的涉嫌隐私模型(DPM)和拉普拉斯补偿(LR)。使用логистиック回归分类器和实际世界和synthetic数据集,我们发现DPM和LR在低 False Positive Rate(FP)下具有良好的隐私保护能力,特别是当训练集大小充分时。我们的LR方法在防止隐私泄露的同时保持模型和补偿精度。
RLHF-Blender: A Configurable Interactive Interface for Learning from Diverse Human Feedback
results: 可以 investigate various types of feedback, such as demonstrations, rankings, comparisons, and natural language instructions, as well as studies considering the impact of human factors on their effectiveness.Abstract
To use reinforcement learning from human feedback (RLHF) in practical applications, it is crucial to learn reward models from diverse sources of human feedback and to consider human factors involved in providing feedback of different types. However, the systematic study of learning from diverse types of feedback is held back by limited standardized tooling available to researchers. To bridge this gap, we propose RLHF-Blender, a configurable, interactive interface for learning from human feedback. RLHF-Blender provides a modular experimentation framework and implementation that enables researchers to systematically investigate the properties and qualities of human feedback for reward learning. The system facilitates the exploration of various feedback types, including demonstrations, rankings, comparisons, and natural language instructions, as well as studies considering the impact of human factors on their effectiveness. We discuss a set of concrete research opportunities enabled by RLHF-Blender. More information is available at https://rlhfblender.info/.
摘要
<>将人类反馈学习(RLHF)应用于实际场景中,是非常重要的。因此,学习从多种人类反馈来的奖励模型是必须的,同时也需要考虑人类提供反馈的因素。然而,现有的研究工具有限,使得系统性的研究受到阻碍。为了bridging这个差距,我们提议RLHF-Blender,一个可配置的交互式界面,用于学习人类反馈。RLHF-Blender提供了可模块化的实验框架和实现,帮助研究者系统地探索不同类型的人类反馈的性质和质量。系统支持explore多种反馈类型,包括示例、排名、比较和自然语言指令,以及考虑人类因素对其效果的影响。我们介绍了RLHF-Blender可以开发的具体研究机会。更多信息请访问https://rlhfblender.info/.Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.
Cooperative Multi-agent Bandits: Distributed Algorithms with Optimal Individual Regret and Constant Communication Costs
paper_authors: Lin Yang, Xuchuang Wang, Mohammad Hajiesmaili, Lijun Zhang, John C. S. Lui, Don Towsley
for: 这个论文的目的是开发一种协同多智能体多机枪游戏中的Optimal group regret和低通信成本的bandit算法。
methods: 这个论文使用了两种方法:领导者-追随者和完全分布式算法。
results: 这个论文的算法可以达到最佳个体 regret和常量通信成本。Abstract
Recently, there has been extensive study of cooperative multi-agent multi-armed bandits where a set of distributed agents cooperatively play the same multi-armed bandit game. The goal is to develop bandit algorithms with the optimal group and individual regrets and low communication between agents. The prior work tackled this problem using two paradigms: leader-follower and fully distributed algorithms. Prior algorithms in both paradigms achieve the optimal group regret. The leader-follower algorithms achieve constant communication costs but fail to achieve optimal individual regrets. The state-of-the-art fully distributed algorithms achieve optimal individual regrets but fail to achieve constant communication costs. This paper presents a simple yet effective communication policy and integrates it into a learning algorithm for cooperative bandits. Our algorithm achieves the best of both paradigms: optimal individual regret and constant communication costs.
摘要
The Model Inversion Eavesdropping Attack in Semantic Communication Systems
for: 本研究探讨了 semantic communication 系统中的隐私泄露问题,并提出了一种基于 Random Permutation and Substitution 的防御策略。
methods: 本研究使用了 Model Inversion Eavesdropping Attack (MIEA) 来攻击 semantic communication 系统,并考虑了 white-box 和 black-box 两种设定。
results: 实验结果表明,提出的防御策略可以有效防止 MIEA,并且在不同的通道条件下能够保持高质量的征文重建。Abstract
In recent years, semantic communication has been a popular research topic for its superiority in communication efficiency. As semantic communication relies on deep learning to extract meaning from raw messages, it is vulnerable to attacks targeting deep learning models. In this paper, we introduce the model inversion eavesdropping attack (MIEA) to reveal the risk of privacy leaks in the semantic communication system. In MIEA, the attacker first eavesdrops the signal being transmitted by the semantic communication system and then performs model inversion attack to reconstruct the raw message, where both the white-box and black-box settings are considered. Evaluation results show that MIEA can successfully reconstruct the raw message with good quality under different channel conditions. We then propose a defense method based on random permutation and substitution to defend against MIEA in order to achieve secure semantic communication. Our experimental results demonstrate the effectiveness of the proposed defense method in preventing MIEA.
摘要
近年来, semantic communication 成为研究热点,因为它可以提高通信效率。然而, semantic communication 依赖深度学习来提取消息的意义,因此它容易受到深度学习模型的攻击。在这篇论文中,我们介绍了模型反向窃听攻击(MIEA),以揭示 semantic communication 系统中的隐私泄露风险。在 MIEA 中,攻击者首先监听 semantic communication 系统传输的信号,然后通过模型反向攻击来重建原始消息,包括白盒和黑盒两种设置。我们的evaluation结果表明, MIEA 可以在不同的通信道条件下成功重建原始消息,并且提议了基于随机排序和替换的防御方法,以确保 semantic communication 的安全。我们的实验结果表明,提议的防御方法可以有效防止 MIEA。
Comparative Analysis of the wav2vec 2.0 Feature Extractor
results: 研究表明,两种神经网络特征提取方法都能与传统的特征提取方法竞争在 LibriSpeech benchmark 上,并且分析了各个组件的效果。 另外,研究还发现,ASR 系统最重要的信息是由一组带宽滤波器获得的。Abstract
Automatic speech recognition (ASR) systems typically use handcrafted feature extraction pipelines. To avoid their inherent information loss and to achieve more consistent modeling from speech to transcribed text, neural raw waveform feature extractors (FEs) are an appealing approach. Also the wav2vec 2.0 model, which has recently gained large popularity, uses a convolutional FE which operates directly on the speech waveform. However, it is not yet studied extensively in the literature. In this work, we study its capability to replace the standard feature extraction methods in a connectionist temporal classification (CTC) ASR model and compare it to an alternative neural FE. We show that both are competitive with traditional FEs on the LibriSpeech benchmark and analyze the effect of the individual components. Furthermore, we analyze the learned filters and show that the most important information for the ASR system is obtained by a set of bandpass filters.
摘要
自动语音识别(ASR)系统通常使用手工设计的特征提取管道。以避免其内置的信息损失并实现更一致的模型化从语音到转录文本,神经原始波形特征提取器(FEs)是一种吸引人的方法。另外,最近广受欢迎的wav2vec 2.0模型使用了一个 convolutional FE,该模型直接操作于语音波形。然而,它在文献中还没有得到广泛的研究。在这项工作中,我们研究了它的可行性以replace标准特征提取方法在一个 Connectionist Temporal Classification(CTC) ASR 模型中,并与一个 alternating neural FE 进行比较。我们发现两者在 LibriSpeech benchmark 上都是与传统特征提取方法竞争的,并分析了各种组件的效果。此外,我们还分析了学习的滤波器,发现主要的信息 для ASR 系统是由一组 bandpass 滤波器获得。
In-Context Alignment: Chat with Vanilla Language Models Before Fine-Tuning
results: 与直接提示相比,在受Context的协同学习下,无需修改模型参数的情况下,vanilla语言模型的赢利率提高了7倍,与文本达文西003模型from OpenAI进行比较。Abstract
In this note, we explore inference-time alignment through in-context learning. We consider a vanilla pretrained language model Llama-2 before any fine-tuning and retrieve an average of 9 demonstration alignment examples when the model is prompted to follow chat-style instructions. Compared to direct prompting, the in-context alignment without changing model weights leads to a 7x increase in win-rate w.r.t. the text-davinci-003 model from OpenAI, making the vanilla language model comparable to strong baselines with alignment fine-tuning.
摘要
在这份笔记中,我们研究了在使用受Context learning时进行推理时的对齐。我们考虑了未经任何微调的语言模型Llama-2,并从chat风格的指令中获取了9个示例对齐。与直接提示相比,在不变换模型参数时进行受Context learning的对齐,导致了与文本-达文西003模型from OpenAI的7倍增加赢得率,使得未经微调的语言模型与对齐微调的基elines相当。
Teacher-Student Architecture for Knowledge Distillation: A Survey
results: 本文综述了现有的应用场景,包括分类、识别、生成、排名和回归等多种目标,并提出了未来研究方向,包括架构设计、知识质量和回归学习等。Abstract
Although Deep neural networks (DNNs) have shown a strong capacity to solve large-scale problems in many areas, such DNNs are hard to be deployed in real-world systems due to their voluminous parameters. To tackle this issue, Teacher-Student architectures were proposed, where simple student networks with a few parameters can achieve comparable performance to deep teacher networks with many parameters. Recently, Teacher-Student architectures have been effectively and widely embraced on various knowledge distillation (KD) objectives, including knowledge compression, knowledge expansion, knowledge adaptation, and knowledge enhancement. With the help of Teacher-Student architectures, current studies are able to achieve multiple distillation objectives through lightweight and generalized student networks. Different from existing KD surveys that primarily focus on knowledge compression, this survey first explores Teacher-Student architectures across multiple distillation objectives. This survey presents an introduction to various knowledge representations and their corresponding optimization objectives. Additionally, we provide a systematic overview of Teacher-Student architectures with representative learning algorithms and effective distillation schemes. This survey also summarizes recent applications of Teacher-Student architectures across multiple purposes, including classification, recognition, generation, ranking, and regression. Lastly, potential research directions in KD are investigated, focusing on architecture design, knowledge quality, and theoretical studies of regression-based learning, respectively. Through this comprehensive survey, industry practitioners and the academic community can gain valuable insights and guidelines for effectively designing, learning, and applying Teacher-Student architectures on various distillation objectives.
摘要
although deep neural networks (DNNs) have shown strong capacity to solve large-scale problems in many areas, such DNNs are hard to be deployed in real-world systems due to their voluminous parameters. to tackle this issue, teacher-student architectures were proposed, where simple student networks with a few parameters can achieve comparable performance to deep teacher networks with many parameters. recently, teacher-student architectures have been effectively and widely embraced on various knowledge distillation (KD) objectives, including knowledge compression, knowledge expansion, knowledge adaptation, and knowledge enhancement. with the help of teacher-student architectures, current studies are able to achieve multiple distillation objectives through lightweight and generalized student networks. different from existing KD surveys that primarily focus on knowledge compression, this survey first explores teacher-student architectures across multiple distillation objectives. this survey presents an introduction to various knowledge representations and their corresponding optimization objectives. additionally, we provide a systematic overview of teacher-student architectures with representative learning algorithms and effective distillation schemes. this survey also summarizes recent applications of teacher-student architectures across multiple purposes, including classification, recognition, generation, ranking, and regression. lastly, potential research directions in KD are investigated, focusing on architecture design, knowledge quality, and theoretical studies of regression-based learning, respectively. through this comprehensive survey, industry practitioners and the academic community can gain valuable insights and guidelines for effectively designing, learning, and applying teacher-student architectures on various distillation objectives.
BarlowRL: Barlow Twins for Data-Efficient Reinforcement Learning
results: 在Atari 100k 测试集上表现出色,超过了DER 和其它对比算法的表现Abstract
This paper introduces BarlowRL, a data-efficient reinforcement learning agent that combines the Barlow Twins self-supervised learning framework with DER (Data-Efficient Rainbow) algorithm. BarlowRL outperforms both DER and its contrastive counterpart CURL on the Atari 100k benchmark. BarlowRL avoids dimensional collapse by enforcing information spread to the whole space. This helps RL algorithms to utilize uniformly spread state representation that eventually results in a remarkable performance. The integration of Barlow Twins with DER enhances data efficiency and achieves superior performance in the RL tasks. BarlowRL demonstrates the potential of incorporating self-supervised learning techniques to improve RL algorithms.
摘要
这篇论文介绍了BarlowRL,一种数据效率的 reinforcement learning代理人,它将Barlow Twins自我超vis学框架与DER(数据效率雨bow)算法结合在一起。BarlowRL在Atari 100k benchmark上表现出优于DER和其对应的对比算法CURL。BarlowRL通过保证信息散布到整个空间,避免维度塌陷,使RL算法能够利用 uniformly 分布的状态表示,最终导致了很好的表现。将Barlow Twins与DER集成,可以提高数据效率并实现RL任务中的优秀表现。BarlowRL表明了将自我超vis学技术integrated into RL算法可以提高其表现。
SDLFormer: A Sparse and Dense Locality-enhanced Transformer for Accelerated MR Image Reconstruction
results: 实验结果显示,提出的方法可以在4x和5x的下采样情况下,与其他架构和平行领域自主学习基准相比,提高了1.40dB的PSNR和0.028的SSIM的平均提升。代码可以在https://github.com/rahul-gs-16/sdlformer.git中下载。Abstract
Transformers have emerged as viable alternatives to convolutional neural networks owing to their ability to learn non-local region relationships in the spatial domain. The self-attention mechanism of the transformer enables transformers to capture long-range dependencies in the images, which might be desirable for accelerated MRI image reconstruction as the effect of undersampling is non-local in the image domain. Despite its computational efficiency, the window-based transformers suffer from restricted receptive fields as the dependencies are limited to within the scope of the image windows. We propose a window-based transformer network that integrates dilated attention mechanism and convolution for accelerated MRI image reconstruction. The proposed network consists of dilated and dense neighborhood attention transformers to enhance the distant neighborhood pixel relationship and introduce depth-wise convolutions within the transformer module to learn low-level translation invariant features for accelerated MRI image reconstruction. The proposed model is trained in a self-supervised manner. We perform extensive experiments for multi-coil MRI acceleration for coronal PD, coronal PDFS and axial T2 contrasts with 4x and 5x under-sampling in self-supervised learning based on k-space splitting. We compare our method against other reconstruction architectures and the parallel domain self-supervised learning baseline. Results show that the proposed model exhibits improvement margins of (i) around 1.40 dB in PSNR and around 0.028 in SSIM on average over other architectures (ii) around 1.44 dB in PSNR and around 0.029 in SSIM over parallel domain self-supervised learning. The code is available at https://github.com/rahul-gs-16/sdlformer.git
摘要
transformers 已经成为了卷积神经网络的可行替代品,因为它们可以学习图像空间中的非本地区关系。transformers 的自注意机制使得它们可以捕捉图像中的长距离依赖关系,这可能是加速 MRI 图像重建的潜在的优点,因为 MRI 图像下折衰的效果是非本地的。尽管它们有计算效率的优势,但窗口基于的 transformers 受限于图像窗口范围内的依赖关系。我们提议一种窗口基于的 transformer 网络,该网络 integrate 了扩展注意力机制和卷积操作以加速 MRI 图像重建。我们的提案的网络包括扩展和密集 neighborhood attention transformers,以增强远方块像素关系,并在 transformer 模块中添加 depth-wise 卷积来学习低级翻译不变的特征。我们的模型在自我超vised 的方式进行训练。我们进行了多种实验,包括多个 MRI 加速器,以及 coronal PD、coronal PDFS 和 axial T2 对比。我们与其他重建架构和并行Domain self-supervised learning 基线进行比较。结果表明,我们的模型在 PSNR 和 SSIM 两个指标上分别提高了约1.40 dB和约0.028的平均提升。代码可以在 中找到。
Advancing Natural-Language Based Audio Retrieval with PaSST and Large Audio-Caption Data Sets
results: 该系统在2023年DCASE挑战中 ranked第一,并在ClothoV2测试集上超越当前状态的艺术点,提高了5.6 pp. mAP@10。Abstract
This work presents a text-to-audio-retrieval system based on pre-trained text and spectrogram transformers. Our method projects recordings and textual descriptions into a shared audio-caption space in which related examples from different modalities are close. Through a systematic analysis, we examine how each component of the system influences retrieval performance. As a result, we identify two key components that play a crucial role in driving performance: the self-attention-based audio encoder for audio embedding and the utilization of additional human-generated and synthetic data sets during pre-training. We further experimented with augmenting ClothoV2 captions with available keywords to increase their variety; however, this only led to marginal improvements. Our system ranked first in the 2023's DCASE Challenge, and it outperforms the current state of the art on the ClothoV2 benchmark by 5.6 pp. mAP@10.
摘要
Simplified Chinese:这个研究提出了一个基于预训练文本和spectrogram转换器的文本至声音检索系统。我们的方法将录音和文本描述映射到一个共享声音-caption空间,在不同modalities中相关的示例都很近。通过系统atic分析,我们评估每个系统组件对检索性能的影响。我们发现两个关键的组件对检索性能有决定性的影响:使用自我注意力基于的声音编码器,以及在预训练过程中使用additional human-generated和合成数据集。我们还尝试了将ClothoV2标签中可用的关键词加入,但只导致了微妙的改进。我们的系统在2023年DCASE挑战中名列第一,并在ClothoV2标准测试集上比现状权威的检索性能提高5.6 pp. mAP@10。
Federated Inference with Reliable Uncertainty Quantification over Wireless Channels via Conformal Prediction
results: 根据数值结果,作者比较了WFCP 与现有的联邦CP 方案的性能,发现WFCP 在有限通信资源和/或多个设备的情况下具有显著优势。特别是,WFCP 可以在无阻塞通信的情况下提供正式的可靠性保证,而现有的联邦CP 方案则不能做到。Abstract
Consider a setting in which devices and a server share a pre-trained model. The server wishes to make an inference on a new input given the model. Devices have access to data, previously not used for training, and can communicate to the server over a common wireless channel. If the devices have no access to the new input, can communication from devices to the server enhance the quality of the inference decision at the server? Recent work has introduced federated conformal prediction (CP), which leverages devices-to-server communication to improve the reliability of the server's decision. With federated CP, devices communicate to the server information about the loss accrued by the shared pre-trained model on the local data, and the server leverages this information to calibrate a decision interval, or set, so that it is guaranteed to contain the correct answer with a pre-defined target reliability level. Previous work assumed noise-free communication, whereby devices can communicate a single real number to the server. In this paper, we study for the first time federated CP in a wireless setting. We introduce a novel protocol, termed wireless federated conformal prediction (WFCP), which builds on type-based multiple access (TBMA) and on a novel quantile correction strategy. WFCP is proved to provide formal reliability guarantees in terms of coverage of the predicted set produced by the server. Using numerical results, we demonstrate the significant advantages of WFCP against digital implementations of existing federated CP schemes, especially in regimes with limited communication resources and/or large number of devices.
摘要
Setting 中,设备和服务器共享预训练模型。服务器想要对新输入进行推断。设备可以访问未使用过训练的数据,并可以通过公共无线频道与服务器进行通信。如果设备没有访问新输入,是否可以通过设备到服务器的通信提高服务器的推断决策质量? latest work 引入了联邦均衡预测(CP),该技术利用设备到服务器的通信来提高服务器的决策可靠性。在联邦CP中,设备将共享模型在本地数据上的损失信息通过无线频道传输给服务器,服务器利用这些信息进行均衡决策集,以 garantuee 决策集中包含正确答案,并且预定的可靠性水平。 previous work 假设了无噪通信,设备可以将单个实数传输给服务器。在这篇论文中,我们研究了在无线设置下的联邦CP。我们提出了一种新的协议,称为无线联邦均衡预测(WFCP),它基于类型基本多访问(TBMA)和一种新的量衡修正策略。WFCP提供了正式可靠性保证,包括预测集产生的覆盖率。通过数值结果,我们展示了WFCP在数字实现联邦CP方案的情况下,特别是在通信资源有限和/或设备数量很大的情况下,具有显著优势。
OpinionConv: Conversational Product Search with Grounded Opinions
paper_authors: Vahid Sadiri Javadi, Martin Potthast, Lucie Flek
for: 这 paper 是为了 simulating sales conversations 和 grounding conversational AI in true subjective narratives.
methods: 该 paper 使用 product reviews 作为 rich source of product opinions.
results: 在 several user studies 中,generated conversations 被评估为 realistic, 并且 assessors 确认 opinions 作为 informed basis for decision-making.Abstract
When searching for products, the opinions of others play an important role in making informed decisions. Subjective experiences about a product can be a valuable source of information. This is also true in sales conversations, where a customer and a sales assistant exchange facts and opinions about products. However, training an AI for such conversations is complicated by the fact that language models do not possess authentic opinions for their lack of real-world experience. We address this problem by leveraging product reviews as a rich source of product opinions to ground conversational AI in true subjective narratives. With OpinionConv, we develop the first conversational AI for simulating sales conversations. To validate the generated conversations, we conduct several user studies showing that the generated opinions are perceived as realistic. Our assessors also confirm the importance of opinions as an informative basis for decision-making.
摘要
Translated into Simplified Chinese:在寻找产品时,他人的意见具有重要的指导作用。产品的主观经验可以提供有价值的信息。这也是销售对话中的事实,顾客和销售助手交换产品的信息和意见。然而,用AI训练这些对话是因为语言模型缺乏真实世界经验而复杂。我们解决这个问题,利用产品评论作为产品意见的丰富源,以真实的主观故事为 conversational AI 定位。我们开发了 OpinionConv,首个用于模拟销售对话的对话AI。为验证生成的对话,我们进行了多个用户研究,显示生成的意见被评估为真实。我们的评估人也证实了意见作为决策基础的重要性。
Semantic Interpretation and Validation of Graph Attention-based Explanations for GNN Models
methods: 该论文提出了一种基于Graph Deep Learning(GDL)的方法,通过引入Semantic Attention Mechanism来提高GNN模型的解释性。该方法基于Attention Mechanism的概念,通过计算模型对输入特征的重要性来提供feature-based解释。
results: 该论文通过应用该方法于一个Lidar点云估计模型,成功地标识了模型的透明性和性能之间的相关性,并生成了可靠的后果Semantic Explanation。Abstract
In this work, we propose a methodology for investigating the application of semantic attention to enhance the explainability of Graph Neural Network (GNN)-based models, introducing semantically-informed perturbations and establishing a correlation between predicted feature-importance weights and model accuracy. Graph Deep Learning (GDL) has emerged as a promising field for tasks like scene interpretation, leveraging flexible graph structures to concisely describe complex features and relationships. As traditional explainability methods used in eXplainable AI (XAI) cannot be directly applied to such structures, graph-specific approaches are introduced. Attention mechanisms have demonstrated their efficacy in estimating the importance of input features in deep learning models and thus have been previously employed to provide feature-based explanations for GNN predictions. Building upon these insights, we extend existing attention-based graph-explainability methods investigating the use of attention weights as importance indicators of semantically sorted feature sets. Through analysing the behaviour of predicted attention-weights distribution in correlation with model accuracy, we gain valuable insights into feature importance with respect to the behaviour of the GNN model. We apply our methodology to a lidar pointcloud estimation model successfully identifying key semantic classes that contribute to enhanced performance effectively generating reliable post-hoc semantic explanations.
摘要
在这个研究中,我们提出了一种方法来提高图 neural network(GNN)模型的解释性,通过引入Semantic attention和建立 predicted feature-importance weights 和模型准确率之间的相关性。图深度学习(GDL)已经成为了场景理解的一个有前途的领域,利用图结构来简洁地描述复杂的特征和关系。传统的解释方法在XAI中不能直接应用于这些结构,因此图特定的方法被引入。Attention机制已经证明了它们可以估算输入特征的重要性,因此在GNN预测中提供了基于特征的解释。我们在这些基础上进一步推广了现有的注意力Weight-based graph-explainability方法,研究 attention weights 作为semantic sorted feature sets的重要性指标。通过分析预测的注意力分布的行为和模型准确率之间的相关性,我们获得了对feature importance的重要信息,即 Semantic classes的贡献对于提高性能的贡献。我们应用了我们的方法ology到一个 lidar pointcloud estimation模型,成功地Identifying key semantic classes that contribute to enhanced performance, effectively generating reliable post-hoc semantic explanations.
Varying-coefficients for regional quantile via KNN-based LASSO with applications to health outcome study
results: 该方法在实际应用中能够准确地捕捉健康结果和风险因素之间的复杂时变关系。Abstract
Health outcomes, such as body mass index and cholesterol levels, are known to be dependent on age and exhibit varying effects with their associated risk factors. In this paper, we propose a novel framework for dynamic modeling of the associations between health outcomes and risk factors using varying-coefficients (VC) regional quantile regression via K-nearest neighbors (KNN) fused Lasso, which captures the time-varying effects of age. The proposed method has strong theoretical properties, including a tight estimation error bound and the ability to detect exact clustered patterns under certain regularity conditions. To efficiently solve the resulting optimization problem, we develop an alternating direction method of multipliers (ADMM) algorithm. Our empirical results demonstrate the efficacy of the proposed method in capturing the complex age-dependent associations between health outcomes and their risk factors.
摘要
健康结果,如体重指数和尿痰水平,与年龄存在许多关系,这些关系随着风险因素的变化而发生变化。在这篇论文中,我们提出了一种新的方法,即时变量方程模型,用于描述健康结果和风险因素之间的关系。这种方法通过变量系数(VC)地方量化回归和K-最近邻(KNN)束regularization,捕捉年龄的时间变化效应。我们的方法具有强制实施证明,包括紧张估计误差 bound和在某些正则条件下检测到具体的团集模式。为解决相应的优化问题,我们开发了一种分解方法of multipliers(ADMM)算法。我们的实验结果表明,我们的方法能够准确捕捉健康结果和风险因素之间的复杂年龄关系。
results: 该论文提出了一种分布式iterative sketching方法,可以同时实现线性回归的加速和安全保护。具体来说,它提出了一种使用随机抽样和编码加密的方法,可以在分布式系统中实现高效的线性回归计算。此外,论文还特别关注了一种特殊的随机化hadamard transform,并将其扩展到块抽样。Abstract
In this work, we propose methods for speeding up linear regression distributively, while ensuring security. We leverage randomized sketching techniques, and improve straggler resilience in asynchronous systems. Specifically, we apply a random orthonormal matrix and then subsample \textit{blocks}, to simultaneously secure the information and reduce the dimension of the regression problem. In our setup, the transformation corresponds to an encoded encryption in an \textit{approximate gradient coding scheme}, and the subsampling corresponds to the responses of the non-straggling workers; in a centralized coded computing network. This results in a distributive \textit{iterative sketching} approach for an $\ell_2$-subspace embedding, \textit{i.e.} a new sketch is considered at each iteration. We also focus on the special case of the \textit{Subsampled Randomized Hadamard Transform}, which we generalize to block sampling; and discuss how it can be modified in order to secure the data.
摘要
在这项工作中,我们提出了一种加速线性回归的分布式方法,同时保证安全性。我们利用随机抽取技术,并改进异步系统中的延迟问题。特别是,我们首先应用随机正交矩阵,然后对块进行采样,以同时保护信息和缩小回归问题的维度。在我们的设置中,这种变换对应于一种编码加密方案,即精度梯度编码,而采样对应于非延迟工作者的响应。因此,我们得到了一种分布式迭代绘制方法,即在每次迭代中生成一个新的绘制。我们还关注特殊情况下的归一化随机哈达姆变换,并将其扩展到块采样;并讨论如何修改它以保护数据。
Studying Socially Unacceptable Discourse Classification (SUD) through different eyes: “Are we on the same page ?”
results: 我们提供了一些数据洞察,以支持领域专家在标注任务中。同时,我们还分析了可能存在的不同批注modalities的影响于社会不容许语言学习,并提出了一些未解决的挑战和研究方向。Abstract
We study Socially Unacceptable Discourse (SUD) characterization and detection in online text. We first build and present a novel corpus that contains a large variety of manually annotated texts from different online sources used so far in state-of-the-art Machine learning (ML) SUD detection solutions. This global context allows us to test the generalization ability of SUD classifiers that acquire knowledge around the same SUD categories, but from different contexts. From this perspective, we can analyze how (possibly) different annotation modalities influence SUD learning by discussing open challenges and open research directions. We also provide several data insights which can support domain experts in the annotation task.
摘要
我们研究社会不容许的语言讨论(SUD)Characterization和检测在在线文本中。我们首先构建并提供了一个新的 корпуス,包含了不同在线源的手动标注的文本,这些文本在过去的 estado-of-the-art 机器学习(ML)SUD检测解决方案中使用过。这个全球背景允许我们测试SUD分类器的通用能力,这些分类器从不同的上下文中获得了相同的 SUD 类别知识。从这个角度来看,我们可以分析不同标注方式对 SUD 学习的影响,并讨论开放的挑战和研究方向。我们还提供了一些数据见解,以支持领域专家在标注任务中。Note: "SUD" stands for "Socially Unacceptable Discourse" in English.
Dual input neural networks for positional sound source localization
results: 对于一系列的实验数据,DI-NN与基准方法(如最小二乘法和卷积循环神经网络)进行比较,DI-NN在本试用 dataset 中实现了五倍的Localization error reduction than 基准方法,并且较CRNN两倍。Abstract
In many signal processing applications, metadata may be advantageously used in conjunction with a high dimensional signal to produce a desired output. In the case of classical Sound Source Localization (SSL) algorithms, information from a high dimensional, multichannel audio signals received by many distributed microphones is combined with information describing acoustic properties of the scene, such as the microphones' coordinates in space, to estimate the position of a sound source. We introduce Dual Input Neural Networks (DI-NNs) as a simple and effective way to model these two data types in a neural network. We train and evaluate our proposed DI-NN on scenarios of varying difficulty and realism and compare it against an alternative architecture, a classical Least-Squares (LS) method as well as a classical Convolutional Recurrent Neural Network (CRNN). Our results show that the DI-NN significantly outperforms the baselines, achieving a five times lower localization error than the LS method and two times lower than the CRNN in a test dataset of real recordings.
摘要
在许多信号处理应用程序中,元数据可能被利用于高维信号生成愿景输出。在经典的声音源地理位置算法中,来自多个分布式 Mikrophone 的高维多通道音频信号以及场景的声学属性信息(如 Mikrophone 的空间坐标)被组合以估算声音源的位置。我们引入双输入神经网络(DI-NN)作为一种简单而有效的方法来模型这两种数据类型。我们在不同的难度和真实性场景下训练和评估我们的提议 DI-NN,并与基准architecture(经典的最小二乘法和卷积循环神经网络)进行比较。我们的结果显示,DI-NN 明显超越了基准,在实际录制的测试集中实现了与经典方法(最小二乘法)和卷积循环神经网络(CRNN)相比的五倍的地理位置误差。
Comprehensive Assessment of the Performance of Deep Learning Classifiers Reveals a Surprising Lack of Robustness
results: 现有深度神经网络模型容易在某些数据类型上出现错误,表明它们在实际场景中可能不可靠,并且容易被骗到错误决策Here’s a more detailed explanation of each point:
for: The paper aims to evaluate the robustness and reliability of machine learning models, specifically deep neural networks, by using a wide range of data types and a single metric to assess their performance.
methods: The authors propose using a benchmark that includes multiple types of data to evaluate the models’ performance, and they use a single metric to compare the models’ performance across different data types.
results: The authors found that current deep neural networks are vulnerable to making mistakes on certain types of data, which means they may not be reliable in real-world scenarios where they may encounter data from many different domains. Additionally, the authors found that these models can be easily fooled into making wrong decisions.Abstract
Reliable and robust evaluation methods are a necessary first step towards developing machine learning models that are themselves robust and reliable. Unfortunately, current evaluation protocols typically used to assess classifiers fail to comprehensively evaluate performance as they tend to rely on limited types of test data, and ignore others. For example, using the standard test data fails to evaluate the predictions made by the classifier to samples from classes it was not trained on. On the other hand, testing with data containing samples from unknown classes fails to evaluate how well the classifier can predict the labels for known classes. This article advocates bench-marking performance using a wide range of different types of data and using a single metric that can be applied to all such data types to produce a consistent evaluation of performance. Using such a benchmark it is found that current deep neural networks, including those trained with methods that are believed to produce state-of-the-art robustness, are extremely vulnerable to making mistakes on certain types of data. This means that such models will be unreliable in real-world scenarios where they may encounter data from many different domains, and that they are insecure as they can easily be fooled into making the wrong decisions. It is hoped that these results will motivate the wider adoption of more comprehensive testing methods that will, in turn, lead to the development of more robust machine learning methods in the future. Code is available at: \url{https://codeberg.org/mwspratling/RobustnessEvaluation}
摘要
可靠和可靠的评估方法是开发可靠和可靠的机器学习模型的必要第一步。 unfortunately,现有的评估协议通常只能够部分评估模型的性能,因为它们通常只使用有限的测试数据来进行评估。例如,使用标准测试数据不能评估模型对未知类样本的预测结果。相反,使用未知类样本来测试模型将不能评估模型对已知类样本的预测结果。本文提出了使用多种不同类型的数据进行 benchmarking性能,并使用一个可以应用于所有数据类型的单一指标来生成一致的评估性能。使用这种标准,发现当前的深度神经网络,包括使用认为会生成状态对的训练方法,对于某些类型的数据表示极度易误。这意味着这些模型在实际世界中的应用中将不可靠,因为它们可能会遇到多种领域的数据。此外,这些模型也是不安全的,因为它们可以轻松地被骗到错误地做出决策。希望这些结果能够激励更广泛的测试方法的采用,以便在未来开发更加可靠的机器学习方法。Code is available at: \url{https://codeberg.org/mwspratling/RobustnessEvaluation}
D-Score: A Synapse-Inspired Approach for Filter Pruning
results: 实验结果表明,该方法可以在CIFAR-10和ImageNet datasets上减少了显著的计算量和参数数量,而无需损失精度。Abstract
This paper introduces a new aspect for determining the rank of the unimportant filters for filter pruning on convolutional neural networks (CNNs). In the human synaptic system, there are two important channels known as excitatory and inhibitory neurotransmitters that transmit a signal from a neuron to a cell. Adopting the neuroscientific perspective, we propose a synapse-inspired filter pruning method, namely Dynamic Score (D-Score). D-Score analyzes the independent importance of positive and negative weights in the filters and ranks the independent importance by assigning scores. Filters having low overall scores, and thus low impact on the accuracy of neural networks are pruned. The experimental results on CIFAR-10 and ImageNet datasets demonstrate the effectiveness of our proposed method by reducing notable amounts of FLOPs and Params without significant Acc. Drop.
摘要
OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation
results: 该方法可以识别超过6400种类型的对象,大幅扩大视觉信息范围。它将多modal数据融合起来,促进modalities之间的互助和跨modal数据更正。最终输出将每个视频输入转化为详细的时间序列文档,使视频内容更易被大语言模型处理。Abstract
This paper presents OmniDataComposer, an innovative approach for multimodal data fusion and unlimited data generation with an intent to refine and uncomplicate interplay among diverse data modalities. Coming to the core breakthrough, it introduces a cohesive data structure proficient in processing and merging multimodal data inputs, which include video, audio, and text. Our crafted algorithm leverages advancements across multiple operations such as video/image caption extraction, dense caption extraction, Automatic Speech Recognition (ASR), Optical Character Recognition (OCR), Recognize Anything Model(RAM), and object tracking. OmniDataComposer is capable of identifying over 6400 categories of objects, substantially broadening the spectrum of visual information. It amalgamates these diverse modalities, promoting reciprocal enhancement among modalities and facilitating cross-modal data correction. \textbf{The final output metamorphoses each video input into an elaborate sequential document}, virtually transmuting videos into thorough narratives, making them easier to be processed by large language models. Future prospects include optimizing datasets for each modality to encourage unlimited data generation. This robust base will offer priceless insights to models like ChatGPT, enabling them to create higher quality datasets for video captioning and easing question-answering tasks based on video content. OmniDataComposer inaugurates a new stage in multimodal learning, imparting enormous potential for augmenting AI's understanding and generation of complex, real-world data.
摘要
Constructing Custom Thermodynamics Using Deep Learning
results: 该论文通过对聚合物延伸的研究,成功地学习了三个可解释的热动力坐标,并建立了聚合物延伸的动力景观,包括稳定状态和转变状态的识别,以及延伸速率的控制。此外,该论文还应用了该方法到了不同领域的空间疫病问题,证明了该方法的广泛科学和技术应用前景。Abstract
One of the most exciting applications of AI is automated scientific discovery based on previously amassed data, coupled with restrictions provided by the known physical principles, including symmetries and conservation laws. Such automated hypothesis creation and verification can assist scientists in studying complex phenomena, where traditional physical intuition may fail. Of particular importance are complex dynamic systems where their time evolution is strongly influenced by varying external parameters. In this paper we develop a platform based on a generalised Onsager principle to learn macroscopic dynamical descriptions of arbitrary stochastic dissipative systems directly from observations of their microscopic trajectories. We focus on systems whose complexity and sheer sizes render complete microscopic description impractical, and constructing theoretical macroscopic models requires extensive domain knowledge or trial-and-error. Our machine learning approach addresses this by simultaneously constructing reduced thermodynamic coordinates and interpreting the dynamics on these coordinates. We demonstrate our method by studying theoretically and validating experimentally, the stretching of long polymer chains in an externally applied field. Specifically, we learn three interpretable thermodynamic coordinates and build a dynamical landscape of polymer stretching, including (1) the identification of stable and transition states and (2) the control of the stretching rate. We further demonstrate the universality of our approach by applying it to an unrelated problem in a different domain: constructing macroscopic dynamics for spatial epidemics, showing that our method addresses wide scientific and technological applications.
摘要
一种非常有趣的人工智能应用是基于先前整理的数据自动化科学发现,与知道的物理原理限制相结合,包括对称和能量保守法则。这种自动生成和验证假设可以帮助科学家研究复杂现象,其中传统的物理直觉可能失效。特别是复杂动态系统,其时间演化受外部参数变化的影响很强。在这篇论文中,我们开发了基于总体的奥托生理定律的平台,用于直接从微型跟踪数据中学习杂动系统的宏观动力描述。我们关注的是 complexity 和 scale 至今不可能完全描述的系统,并且构建理论宏观模型需要广泛的领域知识或尝试试验。我们的机器学习方法解决了这个问题,同时构建了减少的热力学坐标和解释动力学。我们通过研究杂动聚合物强制延展的实验和理论分析,证明我们的方法可以在不同领域应用。
PTransIPs: Identification of phosphorylation sites based on protein pretrained language model and Transformer
paper_authors: Ziyang Xu, Haitian Zhong for:这份研究用于开发一个新的深度学习模型,用于识别蛋白质中的磷酸化位点。methods:这个模型使用了一种新的深度学习架构,叫做PTransIPs,它将蛋白质中的氨基酸看作是字,将它们拓展为唯一的编码,并且使用了大型预训练的蛋白质模型的嵌入。results:实验结果显示,PTransIPs 能够高效地识别蛋白质中的磷酸化位点,AUROC 值为 0.9232 和 0.9660,分别用于识别磷酸化 S/T 和 Y 位点。此外,实验还显示了预训练模型嵌入的贡献,以及模型的可读性和普遍性。Abstract
Phosphorylation is central to numerous fundamental cellular processes, influencing the onset and progression of a variety of diseases. The correct identification of these phosphorylation sites is of great importance to unravel the intricate molecular mechanisms within cells and during viral infections, potentially leading to the discovery of new therapeutic targets. In this study, we introduce PTransIPs, a novel deep learning model for the identification of phosphorylation sites. PTransIPs treat amino acids within protein sequences as words, extracting unique encodings based on their type and sequential position. The model also incorporates embeddings from large pretrained protein models as additional data inputs. PTransIPS is further trained on a combination model of convolutional neural network with residual connections and Transformer model equipped with multi-head attention mechanisms. At last, the model outputs classification results through a fully connected layer. The results of independent testing reveal that PTransIPs outperforms existing state-of-the-art(SOTA) methods, achieving AUROCs of 0.9232 and 0.9660 for identifying phosphorylated S/T and Y sites respectively. In addition, ablation studies prove that pretrained model embeddings contribute to the performance of PTransIPs. Furthermore, PTransIPs has interpretable amino acid preference, visible training process and shows generalizability on other bioactivity classification tasks. To facilitate usage, our code and data are publicly accessible at \url{https://github.com/StatXzy7/PTransIPs}.
摘要
蛋白磷酸化是细胞内多种基本生物过程中的核心,影响疾病发生和进程。正确识别这些磷酸化位点非常重要,以解释细胞内分子机制和病毒感染过程,可能导致新的药物目标的发现。在这项研究中,我们介绍了PTransIPs,一种新的深度学习模型,用于识别磷酸化位点。PTransIPs将蛋白质内的氨基酸看作 слова,提取唯一的编码,基于它们的类型和顺序位置。模型还使用大型预训练蛋白质模型的嵌入为附加数据输入。PTransIPS在一种组合的卷积神经网络和Transformer模型中进行了进一步训练。最后,模型输出了分类结果通过完全连接层。独立测试结果表明,PTransIPs超过了现有状态的方法,实现了AUROC值为0.9232和0.9660,用于识别磷酸化S/T和Y位点。此外,归因研究表明,预训练模型嵌入对PTransIPs的性能做出了贡献。此外,PTransIPs具有可解释的氨基酸偏好、可见的训练过程和在其他生物活动分类任务上的普适性。为便于使用,我们的代码和数据在GitHub上公开 accessible。
Correlating Medi-Claim Service by Deep Learning Neural Networks
results: 通过这种方法,可以准确地检测和预测诈骗案件,保护医疗保险公司和投保人的金融发展。Abstract
Medical insurance claims are of organized crimes related to patients, physicians, diagnostic centers, and insurance providers, forming a chain reaction that must be monitored constantly. These kinds of frauds affect the financial growth of both insured people and health insurance companies. The Convolution Neural Network architecture is used to detect fraudulent claims through a correlation study of regression models, which helps to detect money laundering on different claims given by different providers. Supervised and unsupervised classifiers are used to detect fraud and non-fraud claims.
摘要
医疗保险索赔是有组织犯罪活动相关的患者、医生、诊断中心和保险公司,形成一个排练的链式反推。这种类型的诈骗活动会对保险人和医疗保险公司的财务发展产生影响。使用卷积神经网络架构来检测诈骗索赔,通过对不同提供者的索赔进行相关的回归分析,可以检测到财务融资。使用超级vised和无级supervised分类器来检测诈骗和非诈骗索赔。
Explainable machine learning to enable high-throughput electrical conductivity optimization of doped conjugated polymers
paper_authors: Ji Wei Yoon, Adithya Kumar, Pawan Kumar, Kedar Hippalgaonkar, J Senthilnath, Vijila Chellappan for: 这研究旨在提高填充 polymer 材料的电导率测量效率,并通过机器学习(ML)方法来加速物料发现。methods: 该研究使用 readily measured absorbance spectra 作为输入,使用 ML 模型来预测填充 polymer 材料的电导率。results: 研究发现,使用 ML 模型可以高度准确地分类和预测填充 polymer 材料的电导率,并且可以提高实验测量效率 by 89%。此外,该研究还解决了机器学习模型中的常见问题,即不可解释性,通过利用特有的数学性质和 ML 模型,得到了证明了 spectral influences on conductivity 的准确信息。Abstract
The combination of high-throughput experimentation techniques and machine learning (ML) has recently ushered in a new era of accelerated material discovery, enabling the identification of materials with cutting-edge properties. However, the measurement of certain physical quantities remains challenging to automate. Specifically, meticulous process control, experimentation and laborious measurements are required to achieve optimal electrical conductivity in doped polymer materials. We propose a ML approach, which relies on readily measured absorbance spectra, to accelerate the workflow associated with measuring electrical conductivity. The first ML model (classification model), accurately classifies samples with a conductivity >~25 to 100 S/cm, achieving a maximum of 100% accuracy rate. For the subset of highly conductive samples, we employed a second ML model (regression model), to predict their conductivities, yielding an impressive test R2 value of 0.984. To validate the approach, we showed that the models, neither trained on the samples with the two highest conductivities of 498 and 506 S/cm, were able to, in an extrapolative manner, correctly classify and predict them at satisfactory levels of errors. The proposed ML workflow results in an improvement in the efficiency of the conductivity measurements by 89% of the maximum achievable using our experimental techniques. Furthermore, our approach addressed the common challenge of the lack of explainability in ML models by exploiting bespoke mathematical properties of the descriptors and ML model, allowing us to gain corroborated insights into the spectral influences on conductivity. Through this study, we offer an accelerated pathway for optimizing the properties of doped polymer materials while showcasing the valuable insights that can be derived from purposeful utilization of ML in experimental science.
摘要
高通过率实验技术和机器学习(ML)已经引入了一个新的时代,快速发现新材料的Properties。然而,一些物理量的测量仍然具有挑战。 Specifically, 制造过程控制、实验和劳动密集的测量是必需的,以实现射频电性的优化。我们提议一种ML方法,基于ready measured absorbance spectrum,加速测量电性的工作流程。第一个ML模型(分类模型)精确地将样本分类为电性> ~25 to 100 S/cm,达到了100%的准确率。对于部分高电性样本,我们使用了第二个ML模型(回归模型),预测他们的电性,得到了惊人的测试R2值为0.984。为验证方法,我们证明了模型没有在两个最高电性的样本(498和506 S/cm)上训练时,仍然可以在推导性的方式下,正确地分类和预测它们,并且达到了满意的误差水平。提出的ML工作流程可以提高电性测量的效率 by 89%。此外,我们的方法解决了通用机器学习模型的解释性问题,通过特有的数学属性和ML模型,使我们可以获得协同的理解,从而提高了我们对电性的理解。通过这项研究,我们提供了一个加速优化射频电性材料的路径,同时展示了机器学习在实验科学中的有价值。
Asynchronous Evolution of Deep Neural Network Architectures
results: 在11比特多路分配任务和图像描述任务中,AES实现了多重性和效率的提升, Suggesting that AES is a promising method for parallelizing the evolution of complex systems with long and variable evaluation times.Abstract
Many evolutionary algorithms (EAs) take advantage of parallel evaluation of candidates. However, if evaluation times vary significantly, many worker nodes (i.e.,\ compute clients) are idle much of the time, waiting for the next generation to be created. Evolutionary neural architecture search (ENAS), a class of EAs that optimizes the architecture and hyperparameters of deep neural networks, is particularly vulnerable to this issue. This paper proposes a generic asynchronous evaluation strategy (AES) that is then adapted to work with ENAS. AES increases throughput by maintaining a queue of upto $K$ individuals ready to be sent to the workers for evaluation and proceeding to the next generation as soon as $M<摘要
(Note: The text has been translated into Simplified Chinese, which is the standard form of Chinese used in mainland China and widely used in other countries as well. The translation is written in the traditional Chinese characters, rather than the simplified Chinese characters used in mainland China. The translation is based on the standard grammar and vocabulary of Simplified Chinese, and may differ slightly from the original text in terms of wording and sentence structure.)
Abstract
Data Science is a modern Data Intelligence practice, which is the core of many businesses and helps businesses build smart strategies around to deal with businesses challenges more efficiently. Data Science practice also helps in automating business processes using the algorithm, and it has several other benefits, which also deliver in a non-profitable framework. In regards to data science, three key components primarily influence the effective outcome of a data science project. Those are 1.Availability of Data 2.Algorithm 3.Processing power or infrastructure
摘要
《数据科学是现代数据智能实践之一,它是许多企业的核心,帮助企业构建智能策略,更有效地面对企业挑战。数据科学实践还可以自动化商业过程,它还有许多其他的优点,可以在非营利性框架下实现。在数据科学方面,三个关键组件主要影响数据科学项目的效果。那些是1.数据的可用性2.算法3.处理能力或基础设施》Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.
Application-Oriented Benchmarking of Quantum Generative Learning Using QUARK
results: 本研究通过将不同的Quantum generative models在不同的环境下训练和部署,并使用了广泛的评估指标,以评估这些模型的实际性和可行性。Abstract
Benchmarking of quantum machine learning (QML) algorithms is challenging due to the complexity and variability of QML systems, e.g., regarding model ansatzes, data sets, training techniques, and hyper-parameters selection. The QUantum computing Application benchmaRK (QUARK) framework simplifies and standardizes benchmarking studies for quantum computing applications. Here, we propose several extensions of QUARK to include the ability to evaluate the training and deployment of quantum generative models. We describe the updated software architecture and illustrate its flexibility through several example applications: (1) We trained different quantum generative models using several circuit ansatzes, data sets, and data transformations. (2) We evaluated our models on GPU and real quantum hardware. (3) We assessed the generalization capabilities of our generative models using a broad set of metrics that capture, e.g., the novelty and validity of the generated data.
摘要
审核量子机器学习(QML)算法具有复杂性和多样性,如模型架构、数据集、训练技术和超参数选择等方面。QUantum computing Application benchmaRK(QUARK)框架可以简化和标准化量子计算应用程序的审核研究。我们提出了将QUARK扩展以支持量子生成模型的训练和部署评估。我们描述了更新后的软件架构,并通过多个示例应用 illustrate its flexibility:1. 我们使用不同的量子生成模型、数据集和数据变换训练了多种环境。2. 我们在GPU和真实量子硬件上评估了我们的模型。3. 我们使用一组广泛的指标评估我们的生成模型的泛化能力,例如生成数据的新鲜度和有效性。
Federated Zeroth-Order Optimization using Trajectory-Informed Surrogate Gradients
results: 实验表明,该算法在 federated black-box adversarial attack 和 federated non-differentiable metric optimization 等实际应用中具有理论上的改进和实际效果。Abstract
Federated optimization, an emerging paradigm which finds wide real-world applications such as federated learning, enables multiple clients (e.g., edge devices) to collaboratively optimize a global function. The clients do not share their local datasets and typically only share their local gradients. However, the gradient information is not available in many applications of federated optimization, which hence gives rise to the paradigm of federated zeroth-order optimization (ZOO). Existing federated ZOO algorithms suffer from the limitations of query and communication inefficiency, which can be attributed to (a) their reliance on a substantial number of function queries for gradient estimation and (b) the significant disparity between their realized local updates and the intended global updates. To this end, we (a) introduce trajectory-informed gradient surrogates which is able to use the history of function queries during optimization for accurate and query-efficient gradient estimation, and (b) develop the technique of adaptive gradient correction using these gradient surrogates to mitigate the aforementioned disparity. Based on these, we propose the federated zeroth-order optimization using trajectory-informed surrogate gradients (FZooS) algorithm for query- and communication-efficient federated ZOO. Our FZooS achieves theoretical improvements over the existing approaches, which is supported by our real-world experiments such as federated black-box adversarial attack and federated non-differentiable metric optimization.
摘要
联合优化,是一种emerging paradigm,可以应用于联合学习、联合优化等实际应用中。在这种模型中,多个客户端(例如边缘设备)可以共同优化一个全球函数。客户端不会分享本地数据,通常只会分享本地梯度。然而,在许多应用中,梯度信息不可用,这导致了联合零次顺序优化(ZOO)的出现。现有的联合ZOO算法受到查询和通信不确定性的限制,这可以被归因于(a)它们依赖大量的函数查询来Estimate梯度,以及(b)它们实现的本地更新与globally intended的更新之间的差异。为解决这个问题,我们(a)引入了路径参数预测的梯度代理,可以在优化过程中使用历史的函数查询来精确地Estimate梯度,以及(b)开发了适应的梯度调整技术,以mitigate the aforementioned disparity。基于这些,我们提出了联合零次顺序优化使用路径参数预测梯度(FZooS)算法,实现了查询和通信效率的联合ZOO。我们的FZooS理论上超越了现有的方法,这被支持了我们的实际实验,例如联合黑盒抗攻击和联合非 differentiable 度量优化。
Learning Specialized Activation Functions for Physics-informed Neural Networks
paper_authors: Honghui Wang, Lu Lu, Shiji Song, Gao Huang
for: This paper aims to address the optimization difficulty of physics-informed neural networks (PINNs) by exploring the connection between PINNs and activation functions.
methods: The paper introduces adaptive activation functions to search for the optimal function when solving different problems, and compares different adaptive activation functions and their limitations in the context of PINNs.
results: The proposed adaptive activation function can be used to solve different PDE systems in an interpretable way, and its effectiveness is demonstrated on a series of benchmarks.Here is the same information in Simplified Chinese text:
results: 提议的 adaptive 活动函数可以用于解决不同 PDE 系统,并且在可读性方面具有优势,效果在一系列 benchmark 上得到证明。Abstract
Physics-informed neural networks (PINNs) are known to suffer from optimization difficulty. In this work, we reveal the connection between the optimization difficulty of PINNs and activation functions. Specifically, we show that PINNs exhibit high sensitivity to activation functions when solving PDEs with distinct properties. Existing works usually choose activation functions by inefficient trial-and-error. To avoid the inefficient manual selection and to alleviate the optimization difficulty of PINNs, we introduce adaptive activation functions to search for the optimal function when solving different problems. We compare different adaptive activation functions and discuss their limitations in the context of PINNs. Furthermore, we propose to tailor the idea of learning combinations of candidate activation functions to the PINNs optimization, which has a higher requirement for the smoothness and diversity on learned functions. This is achieved by removing activation functions which cannot provide higher-order derivatives from the candidate set and incorporating elementary functions with different properties according to our prior knowledge about the PDE at hand. We further enhance the search space with adaptive slopes. The proposed adaptive activation function can be used to solve different PDE systems in an interpretable way. Its effectiveness is demonstrated on a series of benchmarks. Code is available at https://github.com/LeapLabTHU/AdaAFforPINNs.
摘要
物理学 informed neural networks (PINNs) oftentimes 受到优化困难。在这种工作中,我们揭示了 PINNs 的优化困难与 activation functions 之间的关系。具体来说,我们发现 PINNs 解决不同的 PDE 问题时会具有高度敏感性于 activation functions。现有的工作通常通过不efficient trial-and-error 来选择 activation functions。为了避免不efficient manual selection 和 PINNs 的优化困难,我们引入了适应 activation functions,以搜索解决不同问题的优化函数。我们比较了不同的适应 activation functions,并讨论它们在 PINNs 中的局限性。此外,我们提议在 PINNs 优化中应用学习组合 candidate activation functions 的思想,以提高学习得到的函数的平滑性和多样性。这可以通过从候选集中除掉无法提供高阶导数的 activation functions,并将不同性质的 elementary functions 纳入候选集中来实现。我们还增加了 adaptive slopes,以进一步扩大搜索空间。我们的提议的适应 activation function 可以在可读性方面解决不同 PDE 系统。我们在一系列 benchmark 上证明了其效iveness。代码可以在 GitHub 上找到:https://github.com/LeapLabTHU/AdaAFforPINNs。
Path Signatures for Diversity in Probabilistic Trajectory Optimisation
results: 经验表明,该策略可以在各种问题上实现更低的平均成本,从2D导航到受擦层环境中的机器人抓取器。Abstract
Motion planning can be cast as a trajectory optimisation problem where a cost is minimised as a function of the trajectory being generated. In complex environments with several obstacles and complicated geometry, this optimisation problem is usually difficult to solve and prone to local minima. However, recent advancements in computing hardware allow for parallel trajectory optimisation where multiple solutions are obtained simultaneously, each initialised from a different starting point. Unfortunately, without a strategy preventing two solutions to collapse on each other, naive parallel optimisation can suffer from mode collapse diminishing the efficiency of the approach and the likelihood of finding a global solution. In this paper we leverage on recent advances in the theory of rough paths to devise an algorithm for parallel trajectory optimisation that promotes diversity over the range of solutions, therefore avoiding mode collapses and achieving better global properties. Our approach builds on path signatures and Hilbert space representations of trajectories, and connects parallel variational inference for trajectory estimation with diversity promoting kernels. We empirically demonstrate that this strategy achieves lower average costs than competing alternatives on a range of problems, from 2D navigation to robotic manipulators operating in cluttered environments.
摘要
运动规划可以被视为一个轨迹优化问题,其中一个目标是将轨迹优化为最小化成本函数。在复杂的环境中,拥有多个障碍物和复杂的几何结构时,这个优化问题通常具有困难和极杂的本地最优解。然而,当前的计算硬件技术使得可以并行进行轨迹优化,从不同的起始点初始化多个解决方案。然而,如果没有避免两个解决方案相互冲突的策略,直观的并行优化可能会降低效率和找到全局解的可能性。在这篇论文中,我们利用了最近的粗 PATH 理论来设计一种避免模式崩溃的并行轨迹优化算法,该算法会Promote 多样性在解决方案的范围内,从而避免模式崩溃并实现更好的全局性。我们的方法基于轨迹签名和希尔伯特空间表示法,并将并行变分推理与多样性激活器相连接。我们在各种问题上进行了实验,并证明了这种策略可以在范围内实现更低的平均成本。
ConDistFL: Conditional Distillation for Federated Learning from Partially Annotated Data
For: 提出了一种总结多个器官和疾病的整体分割模型,使用联合学习(FL)技术,并且解决了基本缺乏完全标注数据的问题。* Methods: combining FL with knowledge distillation,使得本地模型可以从全球模型中提取未标注器官和肿瘤的知识,并且使用适当的条件概率表示来做这一点。* Results: 对四个不同的部分标注的腹部CT数据集进行验证,并证明了该方法与FedAvg和FedOpt基elines相比,具有显著的提高。此外,对外部测试数据集的性能也表明了模型在不同数据集上进行集成训练后的优异普适性。Abstract
Developing a generalized segmentation model capable of simultaneously delineating multiple organs and diseases is highly desirable. Federated learning (FL) is a key technology enabling the collaborative development of a model without exchanging training data. However, the limited access to fully annotated training data poses a major challenge to training generalizable models. We propose "ConDistFL", a framework to solve this problem by combining FL with knowledge distillation. Local models can extract the knowledge of unlabeled organs and tumors from partially annotated data from the global model with an adequately designed conditional probability representation. We validate our framework on four distinct partially annotated abdominal CT datasets from the MSD and KiTS19 challenges. The experimental results show that the proposed framework significantly outperforms FedAvg and FedOpt baselines. Moreover, the performance on an external test dataset demonstrates superior generalizability compared to models trained on each dataset separately. Our ablation study suggests that ConDistFL can perform well without frequent aggregation, reducing the communication cost of FL. Our implementation will be available at https://github.com/NVIDIA/NVFlare/tree/dev/research/condist-fl.
摘要
发展一种可同时分割多个器官和疾病的通用模型是非常有优点的。联邦学习(FL)是一种关键技术,它允许合作建立模型,而不需要交换训练数据。然而,有限的完全标注数据对训练通用模型 pose 一个主要挑战。我们提出了 "ConDistFL" 框架,它将 FL 与知识塑造相结合,以解决这个问题。本地模型可以从全球模型中提取未标注器官和肿瘤的知识,使用适当设计的conditional probability表示。我们在四个不同的 partially annotated 腹部 CT 数据集上验证了我们的框架。实验结果表明,我们的框架在 FedAvg 和 FedOpt 基elines 上显著超越了。此外,对于外部测试集的性能表明,我们的模型具有较高的普适性,比单独在每个数据集上训练的模型要好。我们的剖析研究表明,ConDistFL 可以在不经常的聚合情况下表现良好,降低了联邦学习中的通信成本。我们的实现将在 GitHub 上提供,请参考 。
Enhancing Adversarial Robustness in Low-Label Regime via Adaptively Weighted Regularization and Knowledge Distillation
paper_authors: Dongyoon Yang, Insung Kong, Yongdai Kim
For: This paper focuses on semi-supervised adversarial training, where labeled data is scarce.* Methods: The authors derive two upper bounds for the robust risk and propose a regularization term for unlabeled data. They also develop a semi-supervised adversarial training algorithm that combines the proposed regularization term with knowledge distillation using a semi-supervised teacher.* Results: The authors achieve state-of-the-art performance with significant margins compared to existing algorithms. Specifically, their algorithm with only 8% labeled data is comparable to supervised adversarial training algorithms that use all labeled data in terms of standard and robust accuracies on CIFAR-10.Here’s the Chinese translation of the three key points:* For: 这篇论文专注于半指导式对抗训练,即标注数据匮乏的情况。* Methods: 作者提出了两个Upper bound,并提出了一个用于未标注数据的正则化项。他们还开发了一种半指导式对抗训练算法,该算法结合了提出的正则化项和知识塑造。* Results: 作者实现了现有算法的最佳性能,具体来说,他们的算法只使用8%的标注数据,仍能与全量标注数据使用的超级vised adversarial training算法相当,即在CIFAR-10上的标准准确率和对抗性准确率。Abstract
Adversarial robustness is a research area that has recently received a lot of attention in the quest for trustworthy artificial intelligence. However, recent works on adversarial robustness have focused on supervised learning where it is assumed that labeled data is plentiful. In this paper, we investigate semi-supervised adversarial training where labeled data is scarce. We derive two upper bounds for the robust risk and propose a regularization term for unlabeled data motivated by these two upper bounds. Then, we develop a semi-supervised adversarial training algorithm that combines the proposed regularization term with knowledge distillation using a semi-supervised teacher (i.e., a teacher model trained using a semi-supervised learning algorithm). Our experiments show that our proposed algorithm achieves state-of-the-art performance with significant margins compared to existing algorithms. In particular, compared to supervised learning algorithms, performance of our proposed algorithm is not much worse even when the amount of labeled data is very small. For example, our algorithm with only 8\% labeled data is comparable to supervised adversarial training algorithms that use all labeled data, both in terms of standard and robust accuracies on CIFAR-10.
摘要
“敌对响应性”是人工智能的研究领域,最近受到了很多关注,以建立可靠的人工智能。然而,现有的工作假设了充足的标签数据,并且专注于监督学习。在本文中,我们研究 semi-supervised adversarial 训练,其中标签数据稀缺。我们 derive two upper bounds for the robust risk,并提出一个鼓励不标签数据的调整项。然后,我们开发了一个 semi-supervised adversarial 训练算法,它结合了提案的调整项和知识传授使用 semi-supervised teacher (即使用 semi-supervised 学习算法训练的教师模型)。我们的实验结果显示,我们的提案算法可以实现现在的最佳性能,并且与已有的算法相比,仅在标签数据非常少时,性能与监督学习算法相似。例如,我们的算法仅使用 8% 的标签数据时,与监督学习算法使用所有标签数据相比,在 CIFAR-10 上的标准和敌对精度都具有显著的优化。”
Backdoor Federated Learning by Poisoning Backdoor-Critical Layers
for: 这 paper 旨在探讨 federated learning (FL) 中存在攻击敏感数据的差点,并提出了一种基于攻击者视角的增强型隐蔽攻击方法。
methods: 该 paper 使用了一种涉及攻击者视角的方法来识别 federated learning (FL) 模型中的敏感层次,然后通过适应性地进行攻击来寻找适合的攻击方法。
results: 实验结果表明,该 paper 提出的 BC 层攻击方法可以在七种 state-of-the-art (SOTA) 防御策略下成功地攻击 federated learning (FL),且比较新的攻击方法更高效。Abstract
Federated learning (FL) has been widely deployed to enable machine learning training on sensitive data across distributed devices. However, the decentralized learning paradigm and heterogeneity of FL further extend the attack surface for backdoor attacks. Existing FL attack and defense methodologies typically focus on the whole model. None of them recognizes the existence of backdoor-critical (BC) layers-a small subset of layers that dominate the model vulnerabilities. Attacking the BC layers achieves equivalent effects as attacking the whole model but at a far smaller chance of being detected by state-of-the-art (SOTA) defenses. This paper proposes a general in-situ approach that identifies and verifies BC layers from the perspective of attackers. Based on the identified BC layers, we carefully craft a new backdoor attack methodology that adaptively seeks a fundamental balance between attacking effects and stealthiness under various defense strategies. Extensive experiments show that our BC layer-aware backdoor attacks can successfully backdoor FL under seven SOTA defenses with only 10% malicious clients and outperform the latest backdoor attack methods.
摘要
联合学习(FL)已经广泛应用以进行分散设备上的机器学习训练。然而,分散式学习模式和资料多样性对FL的攻击面积增加了额外的隐藏问题。现有的FL攻击和防御方法通常集中在整个模型上。 none of them 认为存在关键层(BC)-一小subset of layers that dominate the model vulnerabilities. 攻击BC层可以实现equivalent 的效果,但是比攻击整个模型要小得多,这使得现有的防御技术更难察觉。 This paper proposes a general in-situ approach that identifies and verifies BC layers from the perspective of attackers. Based on the identified BC layers, we carefully craft a new backdoor attack methodology that adaptively seeks a fundamental balance between attacking effects and stealthiness under various defense strategies. 实验表明,我们的BC层意识的后门攻击可以成功地在七种SOTA防御措施下进行后门攻击,并且比latest backdoor attack methods 高效。
Toward Improving Predictive Risk Modelling for New Zealand’s Child Welfare System Using Clustering Methods
results: 研究发现,使用不同的 clustering 方法可以分辨出不同的儿童群体,并且这些群体之间存在一定的区别。此外,研究发现,使用特定的年龄组别的模型可以提高模型的准确性。Abstract
The combination of clinical judgement and predictive risk models crucially assist social workers to segregate children at risk of maltreatment and decide when authorities should intervene. Predictive risk modelling to address this matter has been initiated by several governmental welfare authorities worldwide involving administrative data and machine learning algorithms. While previous studies have investigated risk factors relating to child maltreatment, several gaps remain as to understanding how such risk factors interact and whether predictive risk models perform differently for children with different features. By integrating Principal Component Analysis and K-Means clustering, this paper presents initial findings of our work on the identification of such features as well as their potential effect on current risk modelling frameworks. This approach allows examining existent, unidentified yet, clusters of New Zealand (NZ) children reported with care and protection concerns, as well as to analyse their inner structure, and evaluate the performance of prediction models trained cluster wise. We aim to discover the extent of clustering degree required as an early step in the development of predictive risk models for child maltreatment and so enhance the accuracy of such models intended for use by child protection authorities. The results from testing LASSO logistic regression models trained on identified clusters revealed no significant difference in their performance. The models, however, performed slightly better for two clusters including younger children. our results suggest that separate models might need to be developed for children of certain age to gain additional control over the error rates and to improve model accuracy. While results are promising, more evidence is needed to draw definitive conclusions, and further investigation is necessary.
摘要
临床判断和预测风险模型可以帮助社工分类受护儿童投入风险和决定当局是否介入。预测风险模型在世界各地政府儿童护理机构中已经被开发,使用行政数据和机器学习算法。 although previous studies have investigated child maltreatment risk factors, there are still gaps in understanding how these risk factors interact and whether predictive risk models perform differently for children with different features. 本文使用主成分分析和K-Means聚类分析初步发现了这些特征,以及它们可能对当前风险模型 frameworks 有何影响。这种方法允许我们检查新西兰(NZ)儿童报告了护理和保护问题的现有、未知的群集,以及其内部结构,并评估这些群集训练的预测模型性能。我们的目标是发现预测模型是否需要不同的年龄层分配,以提高预测模型的准确性。我们的结果表明,使用LASSO logistic regression模型训练于特定群集没有显著差异。然而,这些模型在两个年龄较少的群集中表现稍微更好。这些结果表明,可能需要为不同的年龄层开发不同的模型,以提高预测模型的准确性。虽然结果有前途,但需要更多的证据来 draw definitive conclusions,并进一步进行调查。
The Five-Dollar Model: Generating Game Maps and Sprites from Sentence Embeddings
paper_authors: Timothy Merino, Roman Negri, Dipika Rajesh, M Charity, Julian Togelius
for: 这个论文旨在提出一种轻量级的文本到图像生成模型,能够从编码的文本提示生成低维度图像。
methods: 这个模型使用了一些新的扩展策略,以提高模型在有限的数据集上的性能。
results: 模型能够生成高度准确和美观的图像,同时保持文本提示中的含义。Abstract
The five-dollar model is a lightweight text-to-image generative architecture that generates low dimensional images from an encoded text prompt. This model can successfully generate accurate and aesthetically pleasing content in low dimensional domains, with limited amounts of training data. Despite the small size of both the model and datasets, the generated images are still able to maintain the encoded semantic meaning of the textual prompt. We apply this model to three small datasets: pixel art video game maps, video game sprite images, and down-scaled emoji images and apply novel augmentation strategies to improve the performance of our model on these limited datasets. We evaluate our models performance using cosine similarity score between text-image pairs generated by the CLIP VIT-B/32 model.
摘要
“五块模型”是一种轻量级文本到图像生成架构,可以从编码的文本提示生成低维度图像。这种模型可以在有限的培训数据下生成准确和美观的内容,并且保持文本提示中的含义。我们将这种模型应用于三个小 datasets:像素艺术视频游戏地图、视频游戏填充图像和压缩emoji图像。我们还使用了新的扩展策略来提高我们模型的性能。我们使用 cosine similarity 分数来评估我们模型对文本-图像对的表现。
Generative Models for Anomaly Detection and Design-Space Dimensionality Reduction in Shape Optimization
results: 提高全球优化算法的收敛性,仅生成高质量几何特征的设计,避免 computationally expensive 优化过程中的浪费。Abstract
Our work presents a novel approach to shape optimization, that has the twofold objective to improve the efficiency of global optimization algorithms while promoting the generation of high-quality designs during the optimization process free of geometrical anomalies. This is accomplished by reducing the number of the original design variables defining a new reduced subspace where the geometrical variance is maximized and modeling the underlying generative process of the data via probabilistic linear latent variable models such as Factor Analysis and Probabilistic Principal Component Analysis. We show that the data follows approximately a Gaussian distribution when the shape modification method is linear and the design variables are sampled uniformly at random, due to the direct application of the central limit theorem. The model uncertainty is measured in terms of Mahalanobis distance, and the paper demonstrates that anomalous designs tend to exhibit a high value of this metric. This enables the definition of a new optimization model where anomalous geometries are penalized and consequently avoided during the optimization loop. The procedure is demonstrated for hull shape optimization of the DTMB 5415 model, extensively used as an international benchmark for shape optimization problems. The global optimization routine is carried out using Bayesian Optimization and the DIRECT algorithm. From the numerical results, the new framework improves the convergence of global optimization algorithms, while only designs with high-quality geometrical features are generated through the optimization routine thereby avoiding the wastage of precious computationally expensive simulations.
摘要
我们的工作提出了一种新的方法 для优化形状,以提高全球优化算法的效率,同时推出高质量的设计。这是通过减少原始设计变量,定义一个新的减少子空间,使几何异常值最大化,并使用抽象线性latent variable模型,如因素分析和概率主成分分析,来模型数据的生成过程。我们证明数据遵循近似 Gaussian 分布,当shape modification方法是线性的,并且设计变量随机 sampling 时,通过直接应用中心偏移定理。模型不确定性被测量为 Mahalanobis 距离,并且实验表明,异常设计通常具有高值这个指标。这允许定义一个新的优化模型,惩罚异常几何,并在优化迭代中避免异常设计的生成。我们在 DTMB 5415 模型的船体形状优化中进行了实验,使用 Bayesian 优化和 DIRECT 算法。从numerical 结果来看,新的框架可以提高全球优化算法的收敛,同时只有高质量的几何特征被优化算法生成,从而避免了计算成本expensive的simulation 的浪费。
A Comparative Study on TF-IDF feature Weighting Method and its Analysis using Unstructured Dataset
results: 研究发现,基于TF-IDF特征提取方法可以获得最高的准确率(93.81%)、精度(94.20%)、回归率(93.81%)和F1分数(91.99%)值,而基于N-Gram特征提取方法则不如TF-IDF方法。Abstract
Text Classification is the process of categorizing text into the relevant categories and its algorithms are at the core of many Natural Language Processing (NLP). Term Frequency-Inverse Document Frequency (TF-IDF) and NLP are the most highly used information retrieval methods in text classification. We have investigated and analyzed the feature weighting method for text classification on unstructured data. The proposed model considered two features N-Grams and TF-IDF on the IMDB movie reviews and Amazon Alexa reviews dataset for sentiment analysis. Then we have used the state-of-the-art classifier to validate the method i.e., Support Vector Machine (SVM), Logistic Regression, Multinomial Naive Bayes (Multinomial NB), Random Forest, Decision Tree, and k-nearest neighbors (KNN). From those two feature extractions, a significant increase in feature extraction with TF-IDF features rather than based on N-Gram. TF-IDF got the maximum accuracy (93.81%), precision (94.20%), recall (93.81%), and F1-score (91.99%) value in Random Forest classifier.
摘要
文本分类是将文本分类到相关的类别中的过程,其算法是自然语言处理(NLP)的核心。文本频率-反向文档频率(TF-IDF)和NLP是文本检索中最广泛使用的方法。我们已经对文本分类中的特征赋值方法进行了调查和分析。我们提出了基于IMDB电影评论和Amazon Alexa评论数据集的 sentiment analysis 的方法,并使用了当今最佳的分类器来验证方法,即支持向量机(SVM)、概率回归、多项随机森林(Multinomial NB)、随机树、决策树和k-最近邻居(KNN)。从两个特征提取来看,TF-IDF特征的特征提取得到了显著的增加,而不是基于N- Gram。TF-IDF在Random Forest分类器中获得了最大的准确率(93.81%)、精度(94.20%)、回归率(93.81%)和F1分数(91.99%)值。
Top K Relevant Passage Retrieval for Biomedical Question Answering
paper_authors: Shashank Gupta for:本研究旨在开发一个基于Pubmed文章的生物医学问答系统,以提供准确的答案。methods:本研究使用现有的DPR框架,并在其基础上进行了细致的调整和训练,以提高问答系统的准确率。results:在 BioASQ 问答集上进行评估,我们的调整后的紧密检索器得分为0.81,表明我们的方法可以提供高度准确的答案。Abstract
Question answering is a task that answers factoid questions using a large collection of documents. It aims to provide precise answers in response to the user's questions in natural language. Question answering relies on efficient passage retrieval to select candidate contexts, where traditional sparse vector space models, such as TF-IDF or BM25, are the de facto method. On the web, there is no single article that could provide all the possible answers available on the internet to the question of the problem asked by the user. The existing Dense Passage Retrieval model has been trained on Wikipedia dump from Dec. 20, 2018, as the source documents for answering questions. Question answering (QA) has made big strides with several open-domain and machine comprehension systems built using large-scale annotated datasets. However, in the clinical domain, this problem remains relatively unexplored. According to multiple surveys, Biomedical Questions cannot be answered correctly from Wikipedia Articles. In this work, we work on the existing DPR framework for the biomedical domain and retrieve answers from the Pubmed articles which is a reliable source to answer medical questions. When evaluated on a BioASQ QA dataset, our fine-tuned dense retriever results in a 0.81 F1 score.
摘要
问答任务是用一个大量文档集来回答用户的问题。其目标是通过自然语言提供精准的答案。问答需要高效的段落检索,以选择可能的上下文,传统上使用TF-IDF或BM25等稀疏 вектор空间模型。在互联网上,没有一篇文章可以提供用户问题的所有可能的答案。我们使用Dec. 20, 2018年的Wikipedia备份作为问答模型的训练数据源。问答(QA)在开放领域和机器理解领域已经做出了很大的进步,但在医疗领域这个问题还很少研究。根据多个调查,医学问题不能准确地从Wikipedia文章中得到答案。在这种情况下,我们对现有的DPR框架进行了修改,并从Pubmed文章中检索答案。当评估在BioASQ QA数据集上时,我们的精度检索器得到了0.81的F1分数。
Scope Loss for Imbalanced Classification and RL Exploration
paper_authors: Hasham Burhani, Xiao Qi Shi, Jonathan Jaegerman, Daniel Balicki
for: 本研究目的是Equivalence between reinforcement learning problem和Supervised classification problem,并找到它们之间的相似性。
methods: 本研究使用了探索尝试和优化问题的探索-优化补偿来Address the exploration exploitation trade-off in reinforcement learning and the dataset imbalance problem in supervised classification。
results: 研究发现了一种新的损失函数Scope Loss,可以防止过度利用和数据偏好导致的性能下降,无需进行任何调整。Scope Loss在一系列基准功能回归学任务和一个偏好分类 dataset 上测试,与State-of-the-art损失函数相比,Scope Loss表现出色。Abstract
We demonstrate equivalence between the reinforcement learning problem and the supervised classification problem. We consequently equate the exploration exploitation trade-off in reinforcement learning to the dataset imbalance problem in supervised classification, and find similarities in how they are addressed. From our analysis of the aforementioned problems we derive a novel loss function for reinforcement learning and supervised classification. Scope Loss, our new loss function, adjusts gradients to prevent performance losses from over-exploitation and dataset imbalances, without the need for any tuning. We test Scope Loss against SOTA loss functions over a basket of benchmark reinforcement learning tasks and a skewed classification dataset, and show that Scope Loss outperforms other loss functions.
摘要
我们展示了强化学习问题和超级vised分类问题之间的等值性。我们遂视探索优化和数据集不均势问题在强化学习和超级vised分类中的相似性,并从这些问题的分析中获得了一个新的损失函数。我们称之为Scope Loss。Scope Loss可以调整 gradients,以避免因过度探索而导致的性能损失和数据集不均势问题,不需要任何调整。我们将Scope Loss与现有的损失函数进行比较,在一签 benchmark 强化学习任务和一个偏斜的分类dataset上进行测试,结果显示Scope Loss可以超越其他损失函数。
Improving Performance of Semi-Supervised Learning by Adversarial Attacks
results: 在 CIFAR10 上,与 SCAR 结合的三种 latest SSL 算法显示出了显著提高图像分类的表现。Abstract
Semi-supervised learning (SSL) algorithm is a setup built upon a realistic assumption that access to a large amount of labeled data is tough. In this study, we present a generalized framework, named SCAR, standing for Selecting Clean samples with Adversarial Robustness, for improving the performance of recent SSL algorithms. By adversarially attacking pre-trained models with semi-supervision, our framework shows substantial advances in classifying images. We introduce how adversarial attacks successfully select high-confident unlabeled data to be labeled with current predictions. On CIFAR10, three recent SSL algorithms with SCAR result in significantly improved image classification.
摘要
<>将文本翻译成简化中文。<>半有指导学习(SSL)算法是基于现实的假设,即获得大量标注数据困难。在这个研究中,我们提出一种通用框架,名为SCAR,即选择清洁样本并具有对抗鲁棒性。通过对预训练模型进行对抗攻击,我们的框架实现了显著提高图像分类性能。我们介绍了如何使用对抗攻击选择高信度无标记数据,并将当前预测作为标注。在CIFAR10上,三种最近的SSL算法与SCAR结果显著提高图像分类。
Continual Pre-Training of Large Language Models: How to (re)warm your model?
results: 研究结果表明,在继续预训练时,模型的整体性能会逐渐提高,即使在大量下游数据集上。此外,在不同的预训练点和最大学习率下,模型的性能也有显著的不同。Abstract
Large language models (LLMs) are routinely pre-trained on billions of tokens, only to restart the process over again once new data becomes available. A much cheaper and more efficient solution would be to enable the continual pre-training of these models, i.e. updating pre-trained models with new data instead of re-training them from scratch. However, the distribution shift induced by novel data typically results in degraded performance on past data. Taking a step towards efficient continual pre-training, in this work, we examine the effect of different warm-up strategies. Our hypothesis is that the learning rate must be re-increased to improve compute efficiency when training on a new dataset. We study the warmup phase of models pre-trained on the Pile (upstream data, 300B tokens) as we continue to pre-train on SlimPajama (downstream data, 297B tokens), following a linear warmup and cosine decay schedule. We conduct all experiments on the Pythia 410M language model architecture and evaluate performance through validation perplexity. We experiment with different pre-training checkpoints, various maximum learning rates, and various warmup lengths. Our results show that while rewarming models first increases the loss on upstream and downstream data, in the longer run it improves the downstream performance, outperforming models trained from scratch$\unicode{x2013}$even for a large downstream dataset.
摘要
大型语言模型(LLM)通常在数十亿个字符上进行预训练,然后又重新开始预训练。一种更经济高效的解决方案是让这些模型在新数据上进行连续预训练,而不是从scratch重新训练。然而,新数据引入的分布变化通常会导致过去数据的性能下降。为了实现效率的连续预训练,在这项工作中,我们研究了不同的温存策略。我们的假设是,在训练新数据集时,学习率必须重新增加以提高计算效率。我们研究在Pile(上游数据,300亿个字符)预训练后,在SlimPajama(下游数据,297亿个字符)上继续预训练,采用线性温存和cosine衰减时间表。我们在Pythia 410M语言模型架构上进行所有实验,并通过验证plexity来评估性能。我们对不同的预训练检查点、最大学习率和温存长度进行了尝试。我们的结果表明,虽然在重新暖化模型后,初期loss会增加在上游和下游数据上,但在长期来看,它会提高下游性能,超过从scratch训练的模型,即使是大型下游数据集。
Generalization bound for estimating causal effects from observational network data
results: 实验研究表明,这种方法可以有效地估计 causal effect,并且可以提供一个理论上的支持来减少复杂的干扰偏见。Abstract
Estimating causal effects from observational network data is a significant but challenging problem. Existing works in causal inference for observational network data lack an analysis of the generalization bound, which can theoretically provide support for alleviating the complex confounding bias and practically guide the design of learning objectives in a principled manner. To fill this gap, we derive a generalization bound for causal effect estimation in network scenarios by exploiting 1) the reweighting schema based on joint propensity score and 2) the representation learning schema based on Integral Probability Metric (IPM). We provide two perspectives on the generalization bound in terms of reweighting and representation learning, respectively. Motivated by the analysis of the bound, we propose a weighting regression method based on the joint propensity score augmented with representation learning. Extensive experimental studies on two real-world networks with semi-synthetic data demonstrate the effectiveness of our algorithm.
摘要
估计来自观察网络数据的 causal effect 是一个重要 yet 挑战性的问题。现有的 causal inference 在网络数据上lacks 一个分析 generalization bound,可以 theoretically 提供支持来减少复杂的混杂偏见和实践 guide 学习目标的原则性。为了填这个 gap,我们 derivate 一个 generalization bound для causal effect estimation 在网络场景下,通过 exploiting 1) 重量 schema based on joint propensity score 和 2) representation learning schema based on Integral Probability Metric (IPM)。我们提供 two perspectives on the generalization bound in terms of reweighting and representation learning,分别。受 bound 分析的激励,我们提议一种基于 joint propensity score 和 representation learning的重量回归方法。经验性研究在 two real-world networks 上的 semi-synthetic data 表明了我们的算法的有效性。
Understanding CNN Hidden Neuron Activations Using Structured Background Knowledge and Deductive Reasoning
results: 研究结果表明,我们可以通过一种假设和验证过程,自动将大规模背景知识中的意义labels附加到Convolutional Neural Network的 dense层神经元上。Abstract
A major challenge in Explainable AI is in correctly interpreting activations of hidden neurons: accurate interpretations would provide insights into the question of what a deep learning system has internally detected as relevant on the input, demystifying the otherwise black-box character of deep learning systems. The state of the art indicates that hidden node activations can, in some cases, be interpretable in a way that makes sense to humans, but systematic automated methods that would be able to hypothesize and verify interpretations of hidden neuron activations are underexplored. In this paper, we provide such a method and demonstrate that it provides meaningful interpretations. Our approach is based on using large-scale background knowledge approximately 2 million classes curated from the Wikipedia concept hierarchy together with a symbolic reasoning approach called Concept Induction based on description logics, originally developed for applications in the Semantic Web field. Our results show that we can automatically attach meaningful labels from the background knowledge to individual neurons in the dense layer of a Convolutional Neural Network through a hypothesis and verification process.
摘要
一 Major challenge in Explainable AI 是正确地理解隐藏节点的活动:正确的解释可以提供关于深度学习系统内部检测到的输入的信息,从而干预深度学习系统的黑盒特性。现状的技术是,隐藏节点的活动可以在某些情况下被解释得通用人类理解,但系统化的自动方法来测试和验证解释是未经探索的。在这篇论文中,我们提供了一种方法,并证明其可以提供有意义的解释。我们的方法基于使用大规模背景知识(约200万个类别,来自wikipedia知识树),并使用基于描述逻辑的符号推理方法 called Concept Induction,原始是为Semantic Web领域开发的。我们的结果表明,我们可以通过一个假设和验证过程,将background知识中的有意义标签自动地应用于 dense层中的神经元。
Cooperative Multi-Type Multi-Agent Deep Reinforcement Learning for Resource Management in Space-Air-Ground Integrated Networks
results: 实验结果表明,提议的CMT-MARL方法能够有效地解决资源管理问题,并且可以提高总传输率和传输成功率等关键性能指标。Abstract
The Space-Air-Ground Integrated Network (SAGIN), integrating heterogeneous devices including low earth orbit (LEO) satellites, unmanned aerial vehicles (UAVs), and ground users (GUs), holds significant promise for advancing smart city applications. However, resource management of the SAGIN is a challenge requiring urgent study in that inappropriate resource management will cause poor data transmission, and hence affect the services in smart cities. In this paper, we develop a comprehensive SAGIN system that encompasses five distinct communication links and propose an efficient cooperative multi-type multi-agent deep reinforcement learning (CMT-MARL) method to address the resource management issue. The experimental results highlight the efficacy of the proposed CMT-MARL, as evidenced by key performance indicators such as the overall transmission rate and transmission success rate. These results underscore the potential value and feasibility of future implementation of the SAGIN.
摘要
SAGIN(空间-空气-地面集成网络),包括低地球轨道卫星(LEO)、无人飞行器(UAV)和地面用户(GU)多种设备,具有推进智能城市应用的潜力。然而,SAGIN资源管理却是一项需要优先研究的挑战,因为不当的资源管理会导致数据传输差,从而影响智能城市服务。本文提出了一个完整的SAGIN系统,包括五种不同的通信链,并提出了一种高效的合作多种多代理人深度学习(CMT-MARL)方法来解决资源管理问题。实验结果表明,提议的CMT-MARL方法能够有效地解决SAGIN资源管理问题,以示KEY表现指标(总传输率和传输成功率)。这些结果表明SAGIN的可能性和实现性。
Fourier neural operator for real-time simulation of 3D dynamic urban microclimate
For: The paper aims to develop a real-time three-dimensional urban wind field simulation method using the Fourier Neural Operator (FNO) network to accelerate the modeling of complex non-linear interactions and system dynamics in urban microclimates.* Methods: The paper uses a combination of Computational Fluid Dynamics (CFD) simulation and the FNO network to model urban microclimates. The training and testing data are generated from CFD simulation of the urban area, based on the semi-Lagrangian approach and fractional stepping method.* Results: The paper shows that the FNO model can accurately reconstruct the instantaneous spatial velocity field and generalize well on different wind directions. The FNO approach can make predictions within milliseconds on the graphics processing unit, making real-time simulation of 3D dynamic urban microclimate possible.Here are the three points in Simplified Chinese:* For: 本研究旨在通过 Фурье神经网络(FNO)加速城市微气候模型化。* Methods: 本研究使用CFD计算和FNO网络模拟城市微气候。训练和测试数据来自CFD计算城市区域,基于半拉格朗日方法和分辨率步骤法。* Results: FNO模型可以准确重建三维城市风场速度场,并在不同风向下Generalize well。FNO方法可以在图形处理器上进行毫秒级准确预测,使城市微气候实时模拟变得可能。Abstract
Global urbanization has underscored the significance of urban microclimates for human comfort, health, and building/urban energy efficiency. They profoundly influence building design and urban planning as major environmental impacts. Understanding local microclimates is essential for cities to prepare for climate change and effectively implement resilience measures. However, analyzing urban microclimates requires considering a complex array of outdoor parameters within computational domains at the city scale over a longer period than indoors. As a result, numerical methods like Computational Fluid Dynamics (CFD) become computationally expensive when evaluating the impact of urban microclimates. The rise of deep learning techniques has opened new opportunities for accelerating the modeling of complex non-linear interactions and system dynamics. Recently, the Fourier Neural Operator (FNO) has been shown to be very promising in accelerating solving the Partial Differential Equations (PDEs) and modeling fluid dynamic systems. In this work, we apply the FNO network for real-time three-dimensional (3D) urban wind field simulation. The training and testing data are generated from CFD simulation of the urban area, based on the semi-Lagrangian approach and fractional stepping method to simulate urban microclimate features for modeling large-scale urban problems. Numerical experiments show that the FNO model can accurately reconstruct the instantaneous spatial velocity field. We further evaluate the trained FNO model on unseen data with different wind directions, and the results show that the FNO model can generalize well on different wind directions. More importantly, the FNO approach can make predictions within milliseconds on the graphics processing unit, making real-time simulation of 3D dynamic urban microclimate possible.
摘要
Recently, deep learning techniques have been applied to accelerate the modeling of complex non-linear interactions and system dynamics. One promising approach is the Fourier Neural Operator (FNO), which can accelerate the solution of Partial Differential Equations (PDEs) and model fluid dynamic systems.In this study, we use the FNO network for real-time three-dimensional (3D) urban wind field simulation. The training and testing data are generated from CFD simulations of the urban area, using the semi-Lagrangian approach and fractional stepping method to simulate urban microclimate features for modeling large-scale urban problems. Our numerical experiments show that the FNO model can accurately reconstruct the instantaneous spatial velocity field. We also evaluate the trained FNO model on unseen data with different wind directions, and the results show that the model can generalize well on different wind directions.More importantly, the FNO approach can make predictions within milliseconds on a graphics processing unit, making real-time simulation of 3D dynamic urban microclimate possible. This has significant implications for urban planning and design, as well as for the development of more energy-efficient and resilient cities.
Characterization of Human Balance through a Reinforcement Learning-based Muscle Controller
for: This paper aims to explore the use of center of mass (COM) state space and reinforcement learning (RL) to monitor balance capabilities in humans, and to establish balance recovery limits.
methods: The paper employs a musculoskeletal model integrated with a balance controller, trained through RL, to investigate balancing capabilities. The RL framework includes two interconnected neural networks governing balance recovery and muscle coordination, trained using Proximal Policy Optimization (PPO) with reference state initialization, early termination, and multiple training strategies.
results: The paper obtains final balance recovery (BR) enclosing successful balance recovery trajectories by exploring recovery from random initial COM states (position and velocity) space for a trained controller. The BRs are compared with analytical postural stability limits from a linear inverted pendulum model, and the results show a similar trend in successful COM states but more limited ranges in the recoverable areas. The paper also investigates the effect of muscle weakness and neural excitation delay on the BRs, revealing reduced balancing capability in different regions.Abstract
Balance assessment during physical rehabilitation often relies on rubric-oriented battery tests to score a patient's physical capabilities, leading to subjectivity. While some objective balance assessments exist, they are often limited to tracking the center of pressure (COP), which does not fully capture the whole-body postural stability. This study explores the use of the center of mass (COM) state space and presents a promising avenue for monitoring the balance capabilities in humans. We employ a musculoskeletal model integrated with a balance controller, trained through reinforcement learning (RL), to investigate balancing capabilities. The RL framework consists of two interconnected neural networks governing balance recovery and muscle coordination respectively, trained using Proximal Policy Optimization (PPO) with reference state initialization, early termination, and multiple training strategies. By exploring recovery from random initial COM states (position and velocity) space for a trained controller, we obtain the final BR enclosing successful balance recovery trajectories. Comparing the BRs with analytical postural stability limits from a linear inverted pendulum model, we observe a similar trend in successful COM states but more limited ranges in the recoverable areas. We further investigate the effect of muscle weakness and neural excitation delay on the BRs, revealing reduced balancing capability in different regions. Overall, our approach of learning muscular balance controllers presents a promising new method for establishing balance recovery limits and objectively assessing balance capability in bipedal systems, particularly in humans.
摘要
评估身体重建中的平衡能力经常采用套路-oriented测试维度来评估病人的身体能力,带来主观性。尽管有一些 объектив的平衡评估存在,但它们通常只能跟踪中心重量(COP),不能完全捕捉人体整体姿态稳定性。本研究探讨了使用中心质量(COM)状态空间来监测人体平衡能力。我们采用了一种musculoskeletal模型和平衡控制器,通过反射学习(RL)训练,investigate balancing capabilities。RL框架包括两个相连的神经网络,一个 governing balance recovery,另一个 governing muscle coordination,通过距离最小化算法(PPO)进行训练。通过探索已经训练好的控制器从随机初始COM状态空间中恢复平衡的过程,我们获得了最终的BR(balance recovery)。 Comparing the BRs with analytical postural stability limits from a linear inverted pendulum model, we observe a similar trend in successful COM states but more limited ranges in the recoverable areas。我们进一步调查了肌肉衰竭和神经刺激延迟对BR的影响,发现在不同区域的平衡能力受到了限制。总的来说,我们的学习muscular平衡控制器的方法可能是评估人体平衡能力的新方法,特别是在人类身上。
PUG: Photorealistic and Semantically Controllable Synthetic Data for Representation Learning
results: 论文通过PUG环境和数据集,实现了更加准确和可靠的视觉模型评估,提供了一种更加可控和真实的替代方案。Abstract
Synthetic image datasets offer unmatched advantages for designing and evaluating deep neural networks: they make it possible to (i) render as many data samples as needed, (ii) precisely control each scene and yield granular ground truth labels (and captions), (iii) precisely control distribution shifts between training and testing to isolate variables of interest for sound experimentation. Despite such promise, the use of synthetic image data is still limited -- and often played down -- mainly due to their lack of realism. Most works therefore rely on datasets of real images, which have often been scraped from public images on the internet, and may have issues with regards to privacy, bias, and copyright, while offering little control over how objects precisely appear. In this work, we present a path to democratize the use of photorealistic synthetic data: we develop a new generation of interactive environments for representation learning research, that offer both controllability and realism. We use the Unreal Engine, a powerful game engine well known in the entertainment industry, to produce PUG (Photorealistic Unreal Graphics) environments and datasets for representation learning. In this paper, we demonstrate the potential of PUG to enable more rigorous evaluations of vision models.
摘要
<>通过实验室自动生成的图像集,深度神经网络的设计和评估受到了无与伦比的优势:可以(i)生成无数样本,(ii)精准控制每个场景,并提供细腻的标签和描述,(iii)在训练和测试之间准确控制分布变化,以孤立变量对照。尽管如此,使用synthetic图像数据仍然受到限制——常被淡化——主要因为它们缺乏真实感。大多数作品因此选择使用实际图像数据,这些数据通常是从互联网上抓取的,可能存在隐私、偏见和版权问题,而且对物体的显示没有准确控制。在这篇论文中,我们提出了一种将高真实度的synthetic数据普及化的方法:我们开发了一代新的交互环境,以Unreal Engine游戏引擎为基础,生成PUG(高真实度Unreal图形)环境和数据集,用于 representation learning研究。我们在这篇论文中展示了PUG的潜在力量,帮助vision模型的更加严格的评估。
Amortized Global Search for Efficient Preliminary Trajectory Design with Deep Generative Models
results: 我们在 De Jong 的 5 个函数和一个低推力圆形三体问题中进行了评估,并得到了良好的结果。Abstract
Preliminary trajectory design is a global search problem that seeks multiple qualitatively different solutions to a trajectory optimization problem. Due to its high dimensionality and non-convexity, and the frequent adjustment of problem parameters, the global search becomes computationally demanding. In this paper, we exploit the clustering structure in the solutions and propose an amortized global search (AmorGS) framework. We use deep generative models to predict trajectory solutions that share similar structures with previously solved problems, which accelerates the global search for unseen parameter values. Our method is evaluated using De Jong's 5th function and a low-thrust circular restricted three-body problem.
摘要
<>转换给定文本到简化中文。<>预liminary trajectory design是一个全球搜索问题,旨在找到多个 качеitative不同的解决方案。由于其高维度和非拟合性,以及常见的问题参数调整,全球搜索变得计算极其困难。在这篇论文中,我们利用解决方案中的凝集结构,并提出了一种含有各种凝集的全球搜索(AmorGS)框架。我们使用深度生成模型预测 trajectory解决方案,这些解决方案与之前已解决的问题中的结构相似,从而加速了未before seen的参数值上的全球搜索。我们的方法被评估使用De Jong的第五个函数和一个低推力圆形 restricted three-body problem。
results: 在实验中,使用 SCA 层的模型在图像和声音识别任务中实现了高准确率,并在Auto-PGD攻击中展现了明显更高的Robustness,不需要在训练过程中使用随机噪声训练。Abstract
The vulnerability to adversarial perturbations is a major flaw of Deep Neural Networks (DNNs) that raises question about their reliability when in real-world scenarios. On the other hand, human perception, which DNNs are supposed to emulate, is highly robust to such perturbations, indicating that there may be certain features of the human perception that make it robust but are not represented in the current class of DNNs. One such feature is that the activity of biological neurons is correlated and the structure of this correlation tends to be rather rigid over long spans of times, even if it hampers performance and learning. We hypothesize that integrating such constraints on the activations of a DNN would improve its adversarial robustness, and, to test this hypothesis, we have developed the Self-Consistent Activation (SCA) layer, which comprises of neurons whose activations are consistent with each other, as they conform to a fixed, but learned, covariability pattern. When evaluated on image and sound recognition tasks, the models with a SCA layer achieved high accuracy, and exhibited significantly greater robustness than multi-layer perceptron models to state-of-the-art Auto-PGD adversarial attacks \textit{without being trained on adversarially perturbed data
摘要
PMU measurements based short-term voltage stability assessment of power systems via deep transfer learning
results: 实验结果表明,提出的方法可以在IEEE 39-bus测试系统上提高模型评估精度约20%,并且具有强大的适应能力于结构变化。该方法还利用 transformer 模型中的自注意机制,与浅学习方法和其他深度学习基于方法相比,具有显著的优势。Abstract
Deep learning has emerged as an effective solution for addressing the challenges of short-term voltage stability assessment (STVSA) in power systems. However, existing deep learning-based STVSA approaches face limitations in adapting to topological changes, sample labeling, and handling small datasets. To overcome these challenges, this paper proposes a novel phasor measurement unit (PMU) measurements-based STVSA method by using deep transfer learning. The method leverages the real-time dynamic information captured by PMUs to create an initial dataset. It employs temporal ensembling for sample labeling and utilizes least squares generative adversarial networks (LSGAN) for data augmentation, enabling effective deep learning on small-scale datasets. Additionally, the method enhances adaptability to topological changes by exploring connections between different faults. Experimental results on the IEEE 39-bus test system demonstrate that the proposed method improves model evaluation accuracy by approximately 20% through transfer learning, exhibiting strong adaptability to topological changes. Leveraging the self-attention mechanism of the Transformer model, this approach offers significant advantages over shallow learning methods and other deep learning-based approaches.
摘要
深度学习已经成为电力系统短期电压稳定评估 (STVSA) 的有效解决方案。然而,现有的深度学习基于 STVSA 方法受到 топологи奇变、样本标注和处理小数据集的限制。为了缓解这些挑战,这篇论文提议一种基于 PMU 测量的新型 STVSA 方法,使用深度转移学习。该方法利用 PMU 测量获得的实时动态信息,创建初始数据集。它使用时间ensemble 进行样本标注,并使用最小二乘生成整形网络 (LSGAN) 进行数据增强,以便在小规模数据集上进行有效的深度学习。此外,该方法改进了对 topology 变化的适应性,通过探索不同的故障之间的连接。实验结果在 IEEE 39-bus 测试系统上表明,提案的方法可以通过转移学习提高评估准确率约 20%,并且具有强大的适应性。利用 Transformer 模型的自注意机制,该方法在比较深度学习方法和其他深度学习基于方法之上具有显著优势。
The Prospect of Enhancing Large-Scale Heterogeneous Federated Learning with Transformers
results: 实验结果显示,Transformer-based FL模型在大规模不同数据所有者的场景下表现出色,特别是在数据多样性和数据规模增加的情况下。此外,通过对CKA表示相似性进行分析,本文还提供了对Transformers的表现的深入理解。Abstract
Federated learning (FL) addresses data privacy concerns by enabling collaborative training of AI models across distributed data owners. Wide adoption of FL faces the fundamental challenges of data heterogeneity and the large scale of data owners involved. In this paper, we investigate the prospect of Transformer-based FL models for achieving generalization and personalization in this setting. We conduct extensive comparative experiments involving FL with Transformers, ResNet, and personalized ResNet-based FL approaches under various scenarios. These experiments consider varying numbers of data owners to demonstrate Transformers' advantages over deep neural networks in large-scale heterogeneous FL tasks. In addition, we analyze the superior performance of Transformers by comparing the Centered Kernel Alignment (CKA) representation similarity across different layers and FL models to gain insight into the reasons behind their promising capabilities.
摘要
合作学习(FL)解决数据隐私问题,通过在分布式数据所有者之间进行AI模型的共同训练。广泛采用FL面临了数据多样性和数据所有者的大规模挑战。在这篇论文中,我们调查了使用Transformer-based FL模型来实现通用和个性化。我们进行了广泛的比较实验,包括FL与Transformers、ResNet和个性化ResNet-based FL方法在不同情况下。这些实验涵盖了不同数据所有者的数量,以 demonstarteTransformers在大规模不同数据类型FL任务中的优势。此外,我们还分析了Transformers的高性能原因,通过比较不同层次的CKA表示相似性来获得关键因素的含义。
results: 研究发现,这个模型可以高度精准地预测未见的逻辑合成设计的实体合成延迟(98.3%)和面积度量(96.1%),并且可以在不同的延迟目标下进行预测。此外,模型还可以在不同的逻辑合成设计中实现高度的构成性。Abstract
In this work, we introduce GraPhSyM, a Graph Attention Network (GATv2) model for fast and accurate estimation of post-physical synthesis circuit delay and area metrics from pre-physical synthesis circuit netlists. Once trained, GraPhSyM provides accurate visibility of final design metrics to early EDA stages, such as logic synthesis, without running the slow physical synthesis flow, enabling global co-optimization across stages. Additionally, the swift and precise feedback provided by GraPhSym is instrumental for machine-learning-based EDA optimization frameworks. Given a gate-level netlist of a circuit represented as a graph, GraPhSyM utilizes graph structure, connectivity, and electrical property features to predict the impact of physical synthesis transformations such as buffer insertion and gate sizing. When trained on a dataset of 6000 prefix adder designs synthesized at an aggressive delay target, GraPhSyM can accurately predict the post-synthesis delay (98.3%) and area (96.1%) metrics of unseen adders with a fast 0.22s inference time. Furthermore, we illustrate the compositionality of GraPhSyM by employing the model trained on a fixed delay target to accurately anticipate post-synthesis metrics at a variety of unseen delay targets. Lastly, we report promising generalization capabilities of the GraPhSyM model when it is evaluated on circuits different from the adders it was exclusively trained on. The results show the potential for GraPhSyM to serve as a powerful tool for advanced optimization techniques and as an oracle for EDA machine learning frameworks.
摘要
在这项工作中,我们介绍了GraPhSyM模型,是基于图注意力网络(GATv2)的一种快速和准确地计算后physical synthesis circuit延迟和面积指标的方法。一旦训练完成,GraPhSyM可以在逻辑合成之前提供准确的设计指标视图,无需运行慢的物理合成流程,从而实现全局协调。此外,GraPhSyM提供的快速和准确反馈对机器学习基于EDA优化框架非常有利。对于一个表示为图的逻辑电路,GraPhSyM利用图结构、连接和电性特征来预测物理合成转换(如缓冲插入和门大小调整)的影响。当训练在6000个逻辑和逻辑电路的延迟目标下进行的时候,GraPhSyM可以准确预测未看过的加器的延迟(98.3%)和面积(96.1%)指标,并且具有快速的0.22秒推理时间。此外,我们还证明了GraPhSyM的可组合性,可以使用固定延迟目标训练的模型来准确预测未看过的延迟目标。最后,我们报告了GraPhSyM模型在不同于它被专门训练的加器之外的普遍化能力。结果表明,GraPhSyM有望成为一种强大的进阶优化技术工具和EDA机器学习框架的oracle。
The Compatibility between the Pangu Weather Forecasting Model and Meteorological Operational Data
results: 研究结果显示,Pangu-Weather模型与各种NWP操作分析兼容,并且可以改进预测性能。此外,提高全球或地方初始条件质量能够显著提高Pangu-Weather模型的预测性能。Abstract
Recently, multiple data-driven models based on machine learning for weather forecasting have emerged. These models are highly competitive in terms of accuracy compared to traditional numerical weather prediction (NWP) systems. In particular, the Pangu-Weather model, which is open source for non-commercial use, has been validated for its forecasting performance by the European Centre for Medium-Range Weather Forecasts (ECMWF) and has recently been published in the journal "Nature". In this paper, we evaluate the compatibility of the Pangu-Weather model with several commonly used NWP operational analyses through case studies. The results indicate that the Pangu-Weather model is compatible with different operational analyses from various NWP systems as the model initial conditions, and it exhibits a relatively stable forecasting capability. Furthermore, we have verified that improving the quality of global or local initial conditions significantly contributes to enhancing the forecasting performance of the Pangu-Weather model.
摘要
Translation in Simplified Chinese:最近,基于机器学习的多种数据驱动模型为气象预报出现了,这些模型与传统的数值气象预测(NWP)系统相比,具有高度竞争的准确性。其中,开源非商业用途的Pangu-Weather模型,已经由欧洲中期气象预测中心(ECMWF)验证了预测性能,并最近在《自然》杂志上发表。在这篇论文中,我们通过 caso studies 评估了Pangu-Weather模型与多种常用的NWP操作分析相容性。结果显示,Pangu-Weather模型可以与不同的NWP系统的操作分析进行Compatible,并且显示出相对稳定的预测能力。此外,我们还证明了改善全球或地方初始条件质量能够明显提高Pangu-Weather模型的预测性能。
Optimizing the switching operation in monoclonal antibody production: Economic MPC and reinforcement learning
results: 论文的实验结果表明,使用sigmoid函数近似方法和ReLU近似方法可以提高吞吐量和生产效率,而且比传统的1%产品剩余规则更为灵活和有效。Abstract
Monoclonal antibodies (mAbs) have emerged as indispensable assets in medicine, and are currently at the forefront of biopharmaceutical product development. However, the growing market demand and the substantial doses required for mAb clinical treatments necessitate significant progress in its large-scale production. Most of the processes for industrial mAb production rely on batch operations, which result in significant downtime. The shift towards a fully continuous and integrated manufacturing process holds the potential to boost product yield and quality, while eliminating the extra expenses associated with storing intermediate products. The integrated continuous mAb production process can be divided into the upstream and downstream processes. One crucial aspect that ensures the continuity of the integrated process is the switching of the capture columns, which are typically chromatography columns operated in a fed-batch manner downstream. Due to the discrete nature of the switching operation, advanced process control algorithms such as economic MPC (EMPC) are computationally difficult to implement. This is because an integer nonlinear program (INLP) needs to be solved online at each sampling time. This paper introduces two computationally-efficient approaches for EMPC implementation, namely, a sigmoid function approximation approach and a rectified linear unit (ReLU) approximation approach. It also explores the application of deep reinforcement learning (DRL). These three methods are compared to the traditional switching approach which is based on a 1% product breakthrough rule and which involves no optimization.
摘要
Spellburst: A Node-based Interface for Exploratory Creative Coding with Natural Language Prompts
results: 论文的评估表明,Spellburst 可以帮助艺术家更快速地实现他们的想法,并且可以帮助开发计算机创造力工具,以便在 semantic 和 sintactic 空间之间进行桥接。Abstract
Creative coding tasks are often exploratory in nature. When producing digital artwork, artists usually begin with a high-level semantic construct such as a "stained glass filter" and programmatically implement it by varying code parameters such as shape, color, lines, and opacity to produce visually appealing results. Based on interviews with artists, it can be effortful to translate semantic constructs to program syntax, and current programming tools don't lend well to rapid creative exploration. To address these challenges, we introduce Spellburst, a large language model (LLM) powered creative-coding environment. Spellburst provides (1) a node-based interface that allows artists to create generative art and explore variations through branching and merging operations, (2) expressive prompt-based interactions to engage in semantic programming, and (3) dynamic prompt-driven interfaces and direct code editing to seamlessly switch between semantic and syntactic exploration. Our evaluation with artists demonstrates Spellburst's potential to enhance creative coding practices and inform the design of computational creativity tools that bridge semantic and syntactic spaces.
摘要
创造性编程任务经常具有探索性质。当生成数字艺术作品时,艺术家通常从高水平semantic construct开始,如“普遍玻璃过滤器”,然后通过代码参数的变化,如形状、颜色、线条和透明度,来生成可观的结果。根据艺术家的采访,将semantic construct翻译到程序语法可能会困难,现有的编程工具也不太适合快速的创作探索。为解决这些挑战,我们介绍Spellburst,一个基于大语言模型(LLM)的创造编程环境。Spellburst提供以下功能:1. 节点基本接口,让艺术家通过分支和合并操作来生成生成艺术和探索不同的变化。2. 表达式基于的提示式交互,让艺术家通过提示来参与semantic programming。3. dinamic提示驱动的界面和直接代码编辑,让艺术家轻松地在semantic和语法空间之间切换。我们的评估表明,Spellburst可以增强创造编程做法,并为计算创造工具的设计提供指导。
Predicting and explaining nonlinear material response using deep Physically Guided Neural Networks with Internal Variables
results: 研究发现PGNNIV方法能够预测不同材料的内部和外部变量,并且可以解释材料的 constitutive law,这种方法被称为Explainable Artificial Intelligence (XAI)。Abstract
Nonlinear materials are often difficult to model with classical state model theory because they have a complex and sometimes inaccurate physical and mathematical description or we simply do not know how to describe such materials in terms of relations between external and internal variables. In many disciplines, Neural Network methods have arisen as powerful tools to identify very complex and non-linear correlations. In this work, we use the very recently developed concept of Physically Guided Neural Networks with Internal Variables (PGNNIV) to discover constitutive laws using a model-free approach and training solely with measured force-displacement data. PGNNIVs make a particular use of the physics of the problem to enforce constraints on specific hidden layers and are able to make predictions without internal variable data. We demonstrate that PGNNIVs are capable of predicting both internal and external variables under unseen load scenarios, regardless of the nature of the material considered (linear, with hardening or softening behavior and hyperelastic), unravelling the constitutive law of the material hence explaining its nature altogether, placing the method in what is known as eXplainable Artificial Intelligence (XAI).
摘要
ViLP: Knowledge Exploration using Vision, Language, and Pose Embeddings for Video Action Recognition
results: experiments 表明,该方法可以在两个人体动作识别数据集 UCF-101 和 HMDB-51 上达到 92.81% 和 73.02% 的准确率,而无需任何视频数据预训练。经ketics预训练后,准确率可以达到 96.11% 和 75.75%。Abstract
Video Action Recognition (VAR) is a challenging task due to its inherent complexities. Though different approaches have been explored in the literature, designing a unified framework to recognize a large number of human actions is still a challenging problem. Recently, Multi-Modal Learning (MML) has demonstrated promising results in this domain. In literature, 2D skeleton or pose modality has often been used for this task, either independently or in conjunction with the visual information (RGB modality) present in videos. However, the combination of pose, visual information, and text attributes has not been explored yet, though text and pose attributes independently have been proven to be effective in numerous computer vision tasks. In this paper, we present the first pose augmented Vision-language model (VLM) for VAR. Notably, our scheme achieves an accuracy of 92.81% and 73.02% on two popular human video action recognition benchmark datasets, UCF-101 and HMDB-51, respectively, even without any video data pre-training, and an accuracy of 96.11% and 75.75% after kinetics pre-training.
摘要
视频动作识别(VAR)是一个复杂的任务,它的内在复杂性使得设计一个综合性的框架来识别大量人类动作变得非常困难。然而,在文献中,不同的方法已经被探讨过,但是设计一个综合性的框架仍然是一个挑战。在文献中,2D骨骼或pose特征 oftentimes 用于这个任务,可以独立或与视觉信息(RGB特征)一起使用。然而,对于pose、视觉信息和文本特征的组合尚未被探讨,尽管文本和pose特征独立地已经证明了其效果在许多计算机视觉任务中。在这篇论文中,我们提出了首个含有 pose 的视力语言模型(VLM),该模型在 UCF-101 和 HMDB-51 两个常用的人类动作识别 benchmark 数据集上达到了 92.81% 和 73.02% 的准确率,而无需任何视频数据预训练,并且在预训练后达到了 96.11% 和 75.75% 的准确率。
Advancements In Crowd-Monitoring System: A Comprehensive Analysis of Systematic Approaches and Automation Algorithms: State-of-The-Art
For: This paper focuses on the development and analysis of crowd monitoring systems, specifically exploring the use of artificial intelligence (AI) algorithms and models to enhance their effectiveness and security.* Methods: The paper employs a bifurcated approach, comparing vision-based and non-vision-based technologies for crowd monitoring, and examines the efficacy of these methods in different environments and contexts.* Results: The paper presents an in-depth analysis of the recent incorporation of AI algorithms and models into automated crowd monitoring systems, highlighting their contemporary applications and effectiveness in various contexts.Abstract
Growing apprehensions surrounding public safety have captured the attention of numerous governments and security agencies across the globe. These entities are increasingly acknowledging the imperative need for reliable and secure crowd-monitoring systems to address these concerns. Effectively managing human gatherings necessitates proactive measures to prevent unforeseen events or complications, ensuring a safe and well-coordinated environment. The scarcity of research focusing on crowd monitoring systems and their security implications has given rise to a burgeoning area of investigation, exploring potential approaches to safeguard human congregations effectively. Crowd monitoring systems depend on a bifurcated approach, encompassing vision-based and non-vision-based technologies. An in-depth analysis of these two methodologies will be conducted in this research. The efficacy of these approaches is contingent upon the specific environment and temporal context in which they are deployed, as they each offer distinct advantages. This paper endeavors to present an in-depth analysis of the recent incorporation of artificial intelligence (AI) algorithms and models into automated systems, emphasizing their contemporary applications and effectiveness in various contexts.
摘要
全球各地政府和安全机构都在关注公众安全的问题上感到担忧,认为需要可靠和安全的人群监测系统来解决这些问题。管理人群聚集需要采取先进的措施,以避免未然的事件或复杂性,确保安全和有效地协调环境。由于人群监测系统的安全性研究不足,这个领域的研究正在不断扩展,探讨有效地保护人群聚集的方法。人群监测系统采用分割方法,包括视觉基于和非视觉基于技术。本研究将进行深入分析这两种方法,分别在不同环境和时间上的效果。由于这些方法在不同情况下的应用,它们各有优劣。本文将强调现代应用的人工智能(AI)算法和模型在自动化系统中的应用,探讨其在不同场景中的现代应用和效果。
Intelligent Assistant Language Understanding On Device
paper_authors: Cecilia Aas, Hisham Abdelsalam, Irina Belousova, Shruti Bhargava, Jianpeng Cheng, Robert Daland, Joris Driesen, Federico Flego, Tristan Guigue, Anders Johannsen, Partha Lal, Jiarui Lu, Joel Ruben Antony Moniz, Nathan Perkins, Dhivya Piraviperumal, Stephen Pulman, Diarmuid Ó Séaghdha, David Q. Sun, John Torr, Marco Del Vecchio, Jay Wacker, Jason D. Williams, Hong Yu
results: 相比服务器基础上的助手,这种系统更加私钥、可靠、快速、表达力强和准确。Abstract
It has recently become feasible to run personal digital assistants on phones and other personal devices. In this paper we describe a design for a natural language understanding system that runs on device. In comparison to a server-based assistant, this system is more private, more reliable, faster, more expressive, and more accurate. We describe what led to key choices about architecture and technologies. For example, some approaches in the dialog systems literature are difficult to maintain over time in a deployment setting. We hope that sharing learnings from our practical experiences may help inform future work in the research community.
摘要
现在可以在手机和其他个人设备上运行个人数字助手。在这篇论文中,我们描述了一种运行于设备上的自然语言理解系统的设计。与服务器上的助手相比,这种系统更加私隐、可靠、快速、表达力 stronger和更准确。我们介绍了一些关键的建筑和技术选择,例如在部署环境中维护一些对话系统文献中的方法可能困难。我们希望通过分享我们的实践经验,可以对未来的研究community提供指导。
On genuine invariance learning without weight-tying
paper_authors: Artem Moskalev, Anna Sepliarskaia, Erik J. Bekkers, Arnold Smeulders
for: investigate properties and limitations of invariance learned by neural networks from the data compared to the genuine invariance achieved through invariant weight-tying.
methods: adopt a group theoretical perspective and analyze invariance learning in neural networks without weight-tying constraints.
results: demonstrate that even when a network learns to correctly classify samples on a group orbit, the underlying decision-making in such a model does not attain genuine invariance, and propose several metrics to quantify learned invariance.Abstract
In this paper, we investigate properties and limitations of invariance learned by neural networks from the data compared to the genuine invariance achieved through invariant weight-tying. To do so, we adopt a group theoretical perspective and analyze invariance learning in neural networks without weight-tying constraints. We demonstrate that even when a network learns to correctly classify samples on a group orbit, the underlying decision-making in such a model does not attain genuine invariance. Instead, learned invariance is strongly conditioned on the input data, rendering it unreliable if the input distribution shifts. We next demonstrate how to guide invariance learning toward genuine invariance by regularizing the invariance of a model at the training. To this end, we propose several metrics to quantify learned invariance: (i) predictive distribution invariance, (ii) logit invariance, and (iii) saliency invariance similarity. We show that the invariance learned with the invariance error regularization closely reassembles the genuine invariance of weight-tying models and reliably holds even under a severe input distribution shift. Closer analysis of the learned invariance also reveals the spectral decay phenomenon, when a network chooses to achieve the invariance to a specific transformation group by reducing the sensitivity to any input perturbation.
摘要
在这篇论文中,我们研究神经网络学习的不变性和其限制,并与真正的不变性相比较。为此,我们采用群理论的视角,分析神经网络无束缚的不变性学习。我们示示,即使神经网络能正确地分类样本在群或бие中,其下面的决策不会实现真正的不变性。相反,学习的不变性强烈受输入数据的影响,因此在输入分布变化时无法保靠。我们随后示出如何通过训练时的不变性正则化来引导神经网络学习真正的不变性。为此,我们提出了几个度量学习的不变性:(一)预测分布不变性、(二)启动函数不变性和(三)相似性不变性。我们表明,通过不变性错误正则化学习的不变性几乎与束缚模型的真正不变性相同,并在输入分布变化时可靠地保持。进一步分析学习的不变性也揭示了特征衰落现象,当神经网络选择通过减少输入干扰的敏感度来实现不变性。
FLIPS: Federated Learning using Intelligent Participant Selection
results: 我们的严谨实验表明,相比随机选择party,FLIPS可以提高精度,在20-60%的通信成本下提高精度17-20%,并且这些优势在参与者具有慢卡特性时仍保持。Abstract
This paper presents the design and implementation of FLIPS, a middleware system to manage data and participant heterogeneity in federated learning (FL) training workloads. In particular, we examine the benefits of label distribution clustering on participant selection in federated learning. FLIPS clusters parties involved in an FL training job based on the label distribution of their data apriori, and during FL training, ensures that each cluster is equitably represented in the participants selected. FLIPS can support the most common FL algorithms, including FedAvg, FedProx, FedDyn, FedOpt and FedYogi. To manage platform heterogeneity and dynamic resource availability, FLIPS incorporates a straggler management mechanism to handle changing capacities in distributed, smart community applications. Privacy of label distributions, clustering and participant selection is ensured through a trusted execution environment (TEE). Our comprehensive empirical evaluation compares FLIPS with random participant selection, as well as two other "smart" selection mechanisms - Oort and gradient clustering using two real-world datasets, two different non-IID distributions and three common FL algorithms (FedYogi, FedProx and FedAvg). We demonstrate that FLIPS significantly improves convergence, achieving higher accuracy by 17 - 20 % with 20 - 60 % lower communication costs, and these benefits endure in the presence of straggler participants.
摘要
Scalable and Equitable Math Problem Solving Strategy Prediction in Big Educational Data
paper_authors: Anup Shakya, Vasile Rus, Deepak Venugopal for:本研究旨在提高学生数学学习效果使用智能教学系统(ITS)和适应教学系统(AIS)。methods:我们利用机器学习和人工智能技术来预测学生的解决策略,以便个性化为每个学生适应。我们首先学习学生的掌握表示(MVec),然后使用非 Parametric 聚类算法将这些表示分成不同的群组。最后,我们使用深度神经网络(DNN)模型来预测学生的解决策略。results:我们使用实际世界大规模学生互动数据集(MATHia)进行实验,并使用 transformers 和 Node2Vec 来学习 MVec,以及 LSTM 来预测解决策略。我们的方法可以扩展到大规模数据集,并且具有预测准确性和predictive equality,即预测策略具有一定的普适性。Abstract
Understanding a student's problem-solving strategy can have a significant impact on effective math learning using Intelligent Tutoring Systems (ITSs) and Adaptive Instructional Systems (AISs). For instance, the ITS/AIS can better personalize itself to correct specific misconceptions that are indicated by incorrect strategies, specific problems can be designed to improve strategies and frustration can be minimized by adapting to a student's natural way of thinking rather than trying to fit a standard strategy for all. While it may be possible for human experts to identify strategies manually in classroom settings with sufficient student interaction, it is not possible to scale this up to big data. Therefore, we leverage advances in Machine Learning and AI methods to perform scalable strategy prediction that is also fair to students at all skill levels. Specifically, we develop an embedding called MVec where we learn a representation based on the mastery of students. We then cluster these embeddings with a non-parametric clustering method where we progressively learn clusters such that we group together instances that have approximately symmetrical strategies. The strategy prediction model is trained on instances sampled from these clusters. This ensures that we train the model over diverse strategies and also that strategies from a particular group do not bias the DNN model, thus allowing it to optimize its parameters over all groups. Using real world large-scale student interaction datasets from MATHia, we implement our approach using transformers and Node2Vec for learning the mastery embeddings and LSTMs for predicting strategies. We show that our approach can scale up to achieve high accuracy by training on a small sample of a large dataset and also has predictive equality, i.e., it can predict strategies equally well for learners at diverse skill levels.
摘要
理解学生的问题解决策略可以对智能教学系统(ITS)和适应教学系统(AIS)的有效学习产生重要影响。例如,ITS/AIS可以更好地个性化自己,为学生的特定错误策略进行特定的更正,设计特定的问题来改善策略,并降低学生的沮丧度。虽然在课堂 SETTINGS中,人工专家可能可以手动确定策略,但不可能扩展到大数据。因此,我们利用机器学习和人工智能技术进行可扩展的策略预测,同时保证学生的公平性。我们开发了一个叫做MVec的嵌入,其中我们学习了学生的掌握程度的表示。然后我们使用非Parametric clustering方法,分类这些嵌入,并逐渐学习分组,以便将学生的策略分为不同的组。我们的策略预测模型是基于这些分组的实例进行训练的。这种方法可以在多个组中学习多种策略,同时避免策略来自某个组的偏见,使得神经网络模型能够在所有组之间优化参数。使用来自MATHia的实际大规模学生互动数据,我们采用了 transformers 和 Node2Vec 来学习掌握嵌入,并使用 LSTM 来预测策略。我们的方法可以在大规模数据上进行扩展,并且具有预测公平性,即可以平等地预测学生的策略水平。
Generative Benchmark Creation for Table Union Search
results: 该论文的结果表明,使用生成AI模型创建的 benchmark 更加具有挑战性,比手动创建的 benchmark 更能让方法进行细致的分析,包括 false positives 和 false negatives 的分析。Abstract
Data management has traditionally relied on synthetic data generators to generate structured benchmarks, like the TPC suite, where we can control important parameters like data size and its distribution precisely. These benchmarks were central to the success and adoption of database management systems. But more and more, data management problems are of a semantic nature. An important example is finding tables that can be unioned. While any two tables with the same cardinality can be unioned, table union search is the problem of finding tables whose union is semantically coherent. Semantic problems cannot be benchmarked using synthetic data. Our current methods for creating benchmarks involve the manual curation and labeling of real data. These methods are not robust or scalable and perhaps more importantly, it is not clear how robust the created benchmarks are. We propose to use generative AI models to create structured data benchmarks for table union search. We present a novel method for using generative models to create tables with specified properties. Using this method, we create a new benchmark containing pairs of tables that are both unionable and non-unionable but related. We thoroughly evaluate recent existing table union search methods over existing benchmarks and our new benchmark. We also present and evaluate a new table search methods based on recent large language models over all benchmarks. We show that the new benchmark is more challenging for all methods than hand-curated benchmarks, specifically, the top-performing method achieves a Mean Average Precision of around 60%, over 30% less than its performance on existing manually created benchmarks. We examine why this is the case and show that the new benchmark permits more detailed analysis of methods, including a study of both false positives and false negatives that were not possible with existing benchmarks.
摘要
datamanagement 历史上通过生成器来生成结构化的benchmark,如TPC集成,以便控制数据大小和分布的重要参数。这些benchmark 对数据管理系统的成功和普及起到了关键作用。但随着时间的推移,数据管理问题变得越来越 semantic in nature。一个重要的例子是找到可以合并的表。虽然任何两个表都可以合并,但表合并搜索是找到可以semantically coherent的表的问题。semantic 问题不能使用生成的数据来 benchmark。我们目前的benchmark创建方法是通过手动筛选和标注实际数据来实现。这些方法不具有可靠性和可扩展性,而且更重要的是,不确定创建的benchmark 的可靠性。我们提议使用生成AI模型来创建结构化数据benchmark для表合并搜索。我们提出了一种使用生成模型创建表的新方法。使用这种方法,我们创建了一个新的benchmark,包含可以合并的表和不可以合并的表,但它们之间存在关系。我们对现有benchmark 和我们新创建的benchmark进行了仔细的评估。我们还提出了基于最新的大语言模型的新表搜索方法,并对所有benchmark进行了评估。我们发现,新的benchmark 比手动创建的benchmark 更加挑战,特别是top-performing方法的 Mean Average Precision 约为60%,相比手动创建的benchmark 上的性能下降了30%。我们分析了这种情况,并证明新的benchmark 允许更详细的方法分析,包括对方法的false positives和false negatives进行了研究,这些研究不可能通过现有的benchmark 进行。
Exploiting Generalization in Offline Reinforcement Learning via Unseen State Augmentations
results: 这个论文通过一种新的不可见状态扩展策略来提高无线网络学习的性能,并证明了这种策略可以减少数据集Q值估计的平均值,从而实现更保守的Q值估计。Abstract
Offline reinforcement learning (RL) methods strike a balance between exploration and exploitation by conservative value estimation -- penalizing values of unseen states and actions. Model-free methods penalize values at all unseen actions, while model-based methods are able to further exploit unseen states via model rollouts. However, such methods are handicapped in their ability to find unseen states far away from the available offline data due to two factors -- (a) very short rollout horizons in models due to cascading model errors, and (b) model rollouts originating solely from states observed in offline data. We relax the second assumption and present a novel unseen state augmentation strategy to allow exploitation of unseen states where the learned model and value estimates generalize. Our strategy finds unseen states by value-informed perturbations of seen states followed by filtering out states with epistemic uncertainty estimates too high (high error) or too low (too similar to seen data). We observe improved performance in several offline RL tasks and find that our augmentation strategy consistently leads to overall lower average dataset Q-value estimates i.e. more conservative Q-value estimates than a baseline.
摘要
无线连接学习(RL)方法寻找平衡 между探索和占用,通过保守估值来衡量未看到的状态和动作的价值。无模型方法对所有未看到的动作进行 penalty,而模型基于方法可以通过模型扩展来进一步利用未看到的状态。然而,这些方法因两个因素受限:(a)模型中的扩展时间非常短,由于堆叠模型错误,和(b)模型扩展仅从看到的状态开始。我们relax这个第二个假设,并提出了一种新的未看到状态扩展策略,允许利用未看到状态的价值估计。我们的策略通过对已经看到的状态进行价值意识的扰动,然后过滤高度 Epistemic 不确定性(高错误)或者太像已经看到的数据的状态。我们在多个无线RL任务中观察到改进的性能,并发现我们的扩展策略通常比基准值更保守,即更低的平均数据Q估值。
Evaluating and Explaining Large Language Models for Code Using Syntactic Structures
results: 该论文通过对 12 种流行的 LLM 进行实验,以及对 ASTxplainer derive 的视图进行用户研究,显示了 ASTxplainer 的潜在作用和可用性。研究结果表明,ASTxplainer 可以提供有用的预测解释和模型效果评估。Abstract
Large Language Models (LLMs) for code are a family of high-parameter, transformer-based neural networks pre-trained on massive datasets of both natural and programming languages. These models are rapidly being employed in commercial AI-based developer tools, such as GitHub CoPilot. However, measuring and explaining their effectiveness on programming tasks is a challenging proposition, given their size and complexity. The methods for evaluating and explaining LLMs for code are inextricably linked. That is, in order to explain a model's predictions, they must be reliably mapped to fine-grained, understandable concepts. Once this mapping is achieved, new methods for detailed model evaluations are possible. However, most current explainability techniques and evaluation benchmarks focus on model robustness or individual task performance, as opposed to interpreting model predictions. To this end, this paper introduces ASTxplainer, an explainability method specific to LLMs for code that enables both new methods for LLM evaluation and visualizations of LLM predictions that aid end-users in understanding model predictions. At its core, ASTxplainer provides an automated method for aligning token predictions with AST nodes, by extracting and aggregating normalized model logits within AST structures. To demonstrate the practical benefit of ASTxplainer, we illustrate the insights that our framework can provide by performing an empirical evaluation on 12 popular LLMs for code using a curated dataset of the most popular GitHub projects. Additionally, we perform a user study examining the usefulness of an ASTxplainer-derived visualization of model predictions aimed at enabling model users to explain predictions. The results of these studies illustrate the potential for ASTxplainer to provide insights into LLM effectiveness, and aid end-users in understanding predictions.
摘要
大型语言模型(LLM) для程式码是一家高参数、transformer基于神经网络的家族,在巨大的自然语言和程式语言Dataset上预训。这些模型在商业AI基于开发工具中被快速运用,例如GitHub CoPilot。然而,评估和解释LLMs的效果在程式任务上是一个具有挑战性的问题,因为它们的大小和复杂性。以下是一些用于评估和解释LLMs的方法:1. 将模型预测与AST结构进行自动对齐,以提取和聚合 нор化的模型潜在值。2. 使用AST结构来解释模型预测的方法,以提供更多的具体和可理解的概念。3. 使用新的评估方法和可视化工具来评估LLM的效果。为了解决这个问题,本文将介绍一种特有的解释方法——ASTxplainer,它可以帮助用户理解LLM的预测。ASTxplainer使用自动对齐模型预测和AST结构,以提取和聚合 нор化的模型潜在值。这些方法可以提供更多的具体和可理解的概念,以帮助用户理解LLM的预测。为了证明ASTxplainer的实用性,我们在12种popular LLMs for code上进行了一场empirical评估,使用一个 curaateddataset of the most popular GitHub projects。此外,我们还进行了一次用户研究,评估ASTxplainer-derived的可视化工具是否可以帮助用户解释模型预测。研究结果表明,ASTxplainer可以提供LLM效果的实际价值,并帮助用户理解预测。
results: 实验结果表明,该方法可以高效地识别和利用查询Intent,并且可以超越流行的句子转换器模型,实现了查询相似性的Pearson相关系数0.85。这些结果表明,可以通过历史搜索行为数据和模型训练来认识和利用查询Intent,从而提高用户体验和商业效果。Abstract
Search query variation poses a challenge in e-commerce search, as equivalent search intents can be expressed through different queries with surface-level differences. This paper introduces a framework to recognize and leverage query equivalence to enhance searcher and business outcomes. The proposed approach addresses three key problems: mapping queries to vector representations of search intent, identifying nearest neighbor queries expressing equivalent or similar intent, and optimizing for user or business objectives. The framework utilizes both surface similarity and behavioral similarity to determine query equivalence. Surface similarity involves canonicalizing queries based on word inflection, word order, compounding, and noise words. Behavioral similarity leverages historical search behavior to generate vector representations of query intent. An offline process is used to train a sentence similarity model, while an online nearest neighbor approach supports processing of unseen queries. Experimental evaluations demonstrate the effectiveness of the proposed approach, outperforming popular sentence transformer models and achieving a Pearson correlation of 0.85 for query similarity. The results highlight the potential of leveraging historical behavior data and training models to recognize and utilize query equivalence in e-commerce search, leading to improved user experiences and business outcomes. Further advancements and benchmark datasets are encouraged to facilitate the development of solutions for this critical problem in the e-commerce domain.
摘要
搜索查询的变化呈现了电商搜索中的挑战,因为相同的搜索意图可以通过不同的查询语句表达。本文介绍了一个框架,用于认可和利用查询相似性,以提高搜索者和商业目标的结果。该方案解决了三个关键问题:将查询映射到搜索意图的 вектор表示,标识最相似的查询,并优化用户或商业目标。该框架利用surface similarity和behavioral similarity来确定查询相似性。surface similarity通过词形变化、词序、合成和噪声词进行 canonicalization。behavioral similarity利用历史搜索行为生成搜索意图的 вектор表示。在线 nearest neighbor 方法支持处理未看过的查询。实验证明了提议的方法的有效性,比 популяр的句子转换器模型高效,并达到了0.85的Spearman相关系数。结果表明可以利用历史行为数据和模型训练来认可和利用查询相似性,从而提高用户体验和商业result。进一步的进步和标准化数据集的开发可以促进在电商领域内的解决这类问题的发展。
results: 我们发现了不同 trait的语言对应性,包括“开放性”、“注意力”和“合作性”,而“外向性”和“不稳定性”则显示了明显的差异。这些发现表明GPT的多样性和适应能力,但同时也表明了一些问题,如训练技术的不透明度和LLM的快速进步。Abstract
The research explores the steerability of Large Language Models (LLMs), particularly OpenAI's ChatGPT iterations. By employing a behavioral psychology framework called OCEAN (Openness, Conscientiousness, Extroversion, Agreeableness, Neuroticism), we quantitatively gauged the model's responsiveness to tailored prompts. When asked to generate text mimicking an extroverted personality, OCEAN scored the language alignment to that behavioral trait. In our analysis, while "openness" presented linguistic ambiguity, "conscientiousness" and "neuroticism" were distinctly evoked in the OCEAN framework, with "extroversion" and "agreeableness" showcasing a notable overlap yet distinct separation from other traits. Our findings underscore GPT's versatility and ability to discern and adapt to nuanced instructions. Furthermore, historical figure simulations highlighted the LLM's capacity to internalize and project instructible personas, precisely replicating their philosophies and dialogic styles. However, the rapid advancements in LLM capabilities and the opaque nature of some training techniques make metric proposals degrade rapidly. Our research emphasizes a quantitative role to describe steerability in LLMs, presenting both its promise and areas for further refinement in aligning its progress to human intentions.
摘要
Note: Please note that the translation is in Simplified Chinese, and some words or phrases may have different translations in Traditional Chinese.
MCTS guided Genetic Algorithm for optimization of neural network weights
results: 本研究结果表明,结合遗传算法和蒙地卡树搜索策略可以优化神经网络的优化问题。通过对遗传树进行优化搜索,可以快速地找到最佳的神经网络结构。Abstract
In this research, we investigate the possibility of applying a search strategy to genetic algorithms to explore the entire genetic tree structure. Several methods aid in performing tree searches; however, simpler algorithms such as breadth-first, depth-first, and iterative techniques are computation-heavy and often result in a long execution time. Adversarial techniques are often the preferred mechanism when performing a probabilistic search, yielding optimal results more quickly. The problem we are trying to tackle in this paper is the optimization of neural networks using genetic algorithms. Genetic algorithms (GA) form a tree of possible states and provide a mechanism for rewards via the fitness function. Monte Carlo Tree Search (MCTS) has proven to be an effective tree search strategy given states and rewards; therefore, we will combine these approaches to optimally search for the best result generated with genetic algorithms.
摘要
在这项研究中,我们研究了将搜索策略应用于遗传算法,以探索整个遗传树结构。许多方法可以进行树搜索,但是简单的算法如广度优先、深度优先和迭代方法往往需要较长的计算时间。对于probabilistic搜索,反斗技术通常是首选的机制,可以快速获得优化结果。我们在这篇论文中是通过遗传算法优化神经网络的优化问题。遗传算法形成了一棵可能的状态树,并提供了一种via遗传函数的奖励机制。蒙地卡罗瑞搜索(MCTS)在给定状态和奖励时已经证明是一个有效的搜索策略,因此我们将这些方法相结合,以优化遗传算法中的最佳结果。
Revisiting Prompt Engineering via Declarative Crowdsourcing
results: 预liminary的案例研究表明,使用宣言式推广工程可以提高LLM在排序、实体解析和填充等任务中的性能。Abstract
Large language models (LLMs) are incredibly powerful at comprehending and generating data in the form of text, but are brittle and error-prone. There has been an advent of toolkits and recipes centered around so-called prompt engineering-the process of asking an LLM to do something via a series of prompts. However, for LLM-powered data processing workflows, in particular, optimizing for quality, while keeping cost bounded, is a tedious, manual process. We put forth a vision for declarative prompt engineering. We view LLMs like crowd workers and leverage ideas from the declarative crowdsourcing literature-including leveraging multiple prompting strategies, ensuring internal consistency, and exploring hybrid-LLM-non-LLM approaches-to make prompt engineering a more principled process. Preliminary case studies on sorting, entity resolution, and imputation demonstrate the promise of our approach
摘要
大型语言模型(LLM)具有极高的文本理解和生成能力,但是它们受限于精度和精度。随着推 engineering(提示工程)的出现,人们开始关注如何使用提示来让 LLM 完成某种任务。然而,为了在 LLM 驱动的数据处理工作流程中提高质量,同时保持成本在bounds,是一个繁琐、手动的过程。我们提出了声明式推 engineering 的视野。我们视 LLM 为群组工作者,并利用声明式招募文献中的想法,包括多种提示策略、内部一致性和混合 LLM-非 LLM approaches,以使提示工程变得更加原则化。初步的案例研究表明,这种方法在排序、实体解析和填充等方面具有承诺的批处。
Search Engine and Recommendation System for the Music Industry built with JinaAI
results: 建立了一个有效的搜索引擎和推荐系统,可以帮助用户快速找到想要的歌曲,并且可以保持和提高搜索引擎的性能质量。Abstract
One of the most intriguing debates regarding a novel task is the development of search engines and recommendation-based systems in the music industry. Studies have shown a drastic depression in the search engine fields, due to concerning factors such as speed, accuracy and the format of data given for querying. Often people face difficulty in searching for a song solely based on the title, hence a solution is proposed to complete a search analysis through a single query input and is matched with the lyrics of the songs present in the database. Hence it is essential to incorporate cutting-edge technology tools for developing a user-friendly search engine. Jina AI is an MLOps framework for building neural search engines that are utilized, in order for the user to obtain accurate results. Jina AI effectively helps to maintain and enhance the quality of performance for the search engine for the query given. An effective search engine and a recommendation system for the music industry, built with JinaAI.
摘要
一个非常有趣的讨论是音乐业中搜索引擎和推荐系统的开发。研究表明,搜索引擎领域受到了严重的萧瑟和精度等因素的影响,导致搜索效果不佳。因此,一种解决方案是通过单个查询输入完成搜索分析,并将数据库中的歌曲歌词与查询结果进行匹配。因此,采用先进的技术工具对于建立用户友好的搜索引擎是非常重要。Jina AI 是一个 ML Ops 框架,用于建立基于神经网络的搜索引擎,以提供精准的搜索结果。Jina AI 有效地帮助维护和提高搜索引擎的性能质量。一款有效的搜索引擎和推荐系统,用于音乐industry,基于 JinaAI。
The Copycat Perceptron: Smashing Barriers Through Collective Learning
paper_authors: Giovanni Catania, Aurélien Decelle, Beatriz Seoane
for: 研究一种 Binary Perceptron 模型在教师-学生场景下的平衡性质。
methods: 使用适当的学习规则和显式氧化 coupling proportional to Hamming distance between students’ weights。
results: 对于具有非零温度的情况, coupling of replicas 导致 phase diagram shift to smaller values of α,这表明在 fixed fraction of reviewed examples 下,解决方案的自由能 landscape 变得更平滑,使用 local update algorithms such as Simulated Annealing 可以更容易到达解决方案。Abstract
We characterize the equilibrium properties of a model of $y$ coupled binary perceptrons in the teacher-student scenario, subject to a suitable learning rule, with an explicit ferromagnetic coupling proportional to the Hamming distance between the students' weights. In contrast to recent works, we analyze a more general setting in which a thermal noise is present that affects the generalization performance of each student. Specifically, in the presence of a nonzero temperature, which assigns nonzero probability to configurations that misclassify samples with respect to the teacher's prescription, we find that the coupling of replicas leads to a shift of the phase diagram to smaller values of $\alpha$: This suggests that the free energy landscape gets smoother around the solution with good generalization (i.e., the teacher) at a fixed fraction of reviewed examples, which allows local update algorithms such as Simulated Annealing to reach the solution before the dynamics gets frozen. Finally, from a learning perspective, these results suggest that more students (in this case, with the same amount of data) are able to learn the same rule when coupled together with a smaller amount of data.
摘要
我们研究一个 teacher-student enario中的 $y$ 关联 binary perceptron 模型的稳定性特性,采用一种合适的学习规则,并具有明确的 ferromagnetic 相互作用,该相互作用与学生的权重差值成正比。与之前的研究不同,我们分析了一种更通用的设置,在其中每个学生面临着一定温度,这使得每个学生的推测结果受到样本推测结果的影响。我们发现,在非零温度下,相互作用导致解的相对温度下降,这使得解的自由能面积变得更平滑,从而使用 Simulated Annealing 类型的本地更新算法可以更好地到达解。最后,从学习角度来看,这些结果表明,通过将更多的学生(每个学生具有相同数据量) coupling вместе,可以在相同数据量下学习同样的规则。
Randomized algorithms for precise measurement of differentially-private, personalized recommendations
results: 作者通过实验研究了这种隐私保护的个性化推荐算法在用户体验、广告商价值和平台收益等方面的影响,并发现该算法可以减少用户隐私泄露风险,同时保持用户满意度和广告商满意度。Abstract
Personalized recommendations form an important part of today's internet ecosystem, helping artists and creators to reach interested users, and helping users to discover new and engaging content. However, many users today are skeptical of platforms that personalize recommendations, in part due to historically careless treatment of personal data and data privacy. Now, businesses that rely on personalized recommendations are entering a new paradigm, where many of their systems must be overhauled to be privacy-first. In this article, we propose an algorithm for personalized recommendations that facilitates both precise and differentially-private measurement. We consider advertising as an example application, and conduct offline experiments to quantify how the proposed privacy-preserving algorithm affects key metrics related to user experience, advertiser value, and platform revenue compared to the extremes of both (private) non-personalized and non-private, personalized implementations.
摘要
现代互联网生态系统中,个性化推荐已成为重要的一部分,帮助艺术家和创作者与有兴趣的用户连接,并帮助用户发现新和有趣的内容。然而,许多用户今天对于个性化推荐平台的存在表示怀疑,部分原因是历史上对个人数据和隐私的不谨慎处理。现在,基于个性化推荐的企业正进入一个新的 paradigma,其中许多系统需要重新设计以保持隐私。在这篇文章中,我们提出一种隐私保护的个性化推荐算法,可以同时保证精度和分配隐私。我们通过广告作为应用例子,并在线实验评估了提议的隐私保护算法对用户体验、广告商价值和平台收益的影响,与非个性化和非隐私个性化实现相比。
SurvBeX: An explanation method of the machine learning survival models based on the Beran estimator
paper_authors: Lev V. Utkin, Danila Y. Eremenko, Andrei V. Konstantinov
For: The paper proposes a new method called SurvBeX to interpret predictions of machine learning survival black-box models.* Methods: The method uses a modified Beran estimator as a surrogate explanation model, and generates many points in a local area around an example of interest to compute the survival function of the black-box model and the Beran estimator.* Results: The paper demonstrates the efficiency of SurvBeX through numerical experiments with synthetic and real survival data, and compares the method with SurvLIME and SurvSHAP. The code implementing SurvBeX is available online.Abstract
An explanation method called SurvBeX is proposed to interpret predictions of the machine learning survival black-box models. The main idea behind the method is to use the modified Beran estimator as the surrogate explanation model. Coefficients, incorporated into Beran estimator, can be regarded as values of the feature impacts on the black-box model prediction. Following the well-known LIME method, many points are generated in a local area around an example of interest. For every generated example, the survival function of the black-box model is computed, and the survival function of the surrogate model (the Beran estimator) is constructed as a function of the explanation coefficients. In order to find the explanation coefficients, it is proposed to minimize the mean distance between the survival functions of the black-box model and the Beran estimator produced by the generated examples. Many numerical experiments with synthetic and real survival data demonstrate the SurvBeX efficiency and compare the method with the well-known method SurvLIME. The method is also compared with the method SurvSHAP. The code implementing SurvBeX is available at: https://github.com/DanilaEremenko/SurvBeX
摘要
提出了一种解释方法 called SurvBeX,用于解释机器学习生存黑盒模型的预测结果。该方法的主要想法是使用 modify Beran 估计器作为解释模型。 incorporated into Beran 估计器的系数可以看作黑盒模型预测结果中特定特征的影响值。采用 LIME 方法的做法,在对 интересов的示例点附近 generate many 点,然后对每个生成的示例点,计算黑盒模型的生存函数,并将 Beran 估计器中的生存函数作为解释系数的函数。为了找到解释系数,提议使用生成的示例点中的mean distance between survival functions of the black-box model and the Beran estimator 来减少。 numerically experiments with synthetic and real survival data demonstrate SurvBeX 的效果,并与 SurvLIME 方法进行比较。 SurvBeX 还与 SurvSHAP 方法进行比较。 SurvBeX 的代码可以在以下链接中找到:https://github.com/DanilaEremenko/SurvBeX。
Dimensionality Reduction for Improving Out-of-Distribution Detection in Medical Image Segmentation
methods: 该论文使用了 Mahalanobis 距离后处理瓶颈特征,将瓶颈特征缩放到 Principal Component Analysis 中,以高效地检测out-of-distribution 图像。
results: 该论文的实验结果显示,通过应用 Mahalanobis 距离后处理瓶颈特征,可以高效地检测out-of-distribution 图像,并且具有较高的性能和较低的计算负担。Abstract
Clinically deployed segmentation models are known to fail on data outside of their training distribution. As these models perform well on most cases, it is imperative to detect out-of-distribution (OOD) images at inference to protect against automation bias. This work applies the Mahalanobis distance post hoc to the bottleneck features of a Swin UNETR model that segments the liver on T1-weighted magnetic resonance imaging. By reducing the dimensions of the bottleneck features with principal component analysis, OOD images were detected with high performance and minimal computational load.
摘要
临床应用的分割模型通常会在训练分布外的数据上失败。由于这些模型在大多数情况下表现良好,因此在推理阶段检测出idanormal(OOD)图像是非常重要的,以避免自动化偏见。这个工作使用Swin UNITER模型的瓶颈特征使用 Mahalanobis 距离后处理,以降低瓶颈特征的维度。通过使用主成分分析,OOD 图像可以高效地检测到,而且计算负担相对较小。
“Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
results: 研究发现现有LLM和安全措施无法彻底防范监狱提示的攻击,特别是在13种禁止enario中,其中两个监狱提示在GPT-3.5和GPT-4上达到了0.99攻击成功率,并在线上免疫超过100天。研究 shed light on the严重和不断演化的监狱提示威胁领域。希望本研究可以促进研究人员和LLM供应商在推广安全和规范的LLM方面努力。Abstract
The misuse of large language models (LLMs) has garnered significant attention from the general public and LLM vendors. In response, efforts have been made to align LLMs with human values and intent use. However, a particular type of adversarial prompts, known as jailbreak prompt, has emerged and continuously evolved to bypass the safeguards and elicit harmful content from LLMs. In this paper, we conduct the first measurement study on jailbreak prompts in the wild, with 6,387 prompts collected from four platforms over six months. Leveraging natural language processing technologies and graph-based community detection methods, we discover unique characteristics of jailbreak prompts and their major attack strategies, such as prompt injection and privilege escalation. We also observe that jailbreak prompts increasingly shift from public platforms to private ones, posing new challenges for LLM vendors in proactive detection. To assess the potential harm caused by jailbreak prompts, we create a question set comprising 46,800 samples across 13 forbidden scenarios. Our experiments show that current LLMs and safeguards cannot adequately defend jailbreak prompts in all scenarios. Particularly, we identify two highly effective jailbreak prompts which achieve 0.99 attack success rates on ChatGPT (GPT-3.5) and GPT-4, and they have persisted online for over 100 days. Our work sheds light on the severe and evolving threat landscape of jailbreak prompts. We hope our study can facilitate the research community and LLM vendors in promoting safer and regulated LLMs.
摘要
大量语言模型(LLM)的不当使用已经引起了公众和LLM供应商的关注。为了规避LLMs被用于不良目的,努力被进行了对LLMs的人类价值观和合法用途的Alignment。然而,一种特殊的恶意提示,称为监狱破解提示,在不断演化以通过安全措施得到恶意内容从LLMs。在这篇论文中,我们进行了首次在野外中对监狱提示的测量研究,收集了6,387个提示从四个平台上, duration of six months。通过自然语言处理技术和图形基本的社区探测方法,我们发现了监狱提示的独特特征和主要攻击策略,如提示注入和特权提升。我们还发现,监狱提示在公共平台逐渐减少,这对LLM供应商在抢救措施方面带来了新的挑战。为了评估监狱提示所可能引起的危害,我们创建了46,800个问题样本,涵盖13个禁止enario。我们的实验表明,目前的LLMs和安全措施无法有效地防止监狱提示在所有情况下。特别是,我们标识出了两个非常有效的监狱提示,在ChatGPT(GPT-3.5)和GPT-4上达到了0.99的攻击成功率,它们在线上持续超过100天。我们的工作照明了监狱提示的严重和演化的威胁风险。我们希望我们的研究能够促进研究 сообщество和LLM供应商在推广安全和规范的LLMs方面的努力。
Communication-Efficient Framework for Distributed Image Semantic Wireless Transmission
For: The paper proposes a federated learning-based semantic communication (FLSC) framework for multi-task distributed image transmission with IoT devices.* Methods: The FLSC framework uses a hierarchical vision transformer (HVT)-based extractor and a task-adaptive translator for coarse-to-fine semantic extraction and meaning translation. The framework also employs a channel state information-based multiple-input multiple-output transmission module to combat channel fading and noise.* Results: The paper shows that the FLSC framework can achieve better performance than traditional schemes in terms of coarse semantic information and signal-to-noise ratio, especially in low signal-to-noise ratio and channel bandwidth ratio regimes. Specifically, the FLSC framework can provide around 10 dB signal-to-noise ratio gain in the 3 dB channel condition.Abstract
Multi-node communication, which refers to the interaction among multiple devices, has attracted lots of attention in many Internet-of-Things (IoT) scenarios. However, its huge amounts of data flows and inflexibility for task extension have triggered the urgent requirement of communication-efficient distributed data transmission frameworks. In this paper, inspired by the great superiorities on bandwidth reduction and task adaptation of semantic communications, we propose a federated learning-based semantic communication (FLSC) framework for multi-task distributed image transmission with IoT devices. Federated learning enables the design of independent semantic communication link of each user while further improves the semantic extraction and task performance through global aggregation. Each link in FLSC is composed of a hierarchical vision transformer (HVT)-based extractor and a task-adaptive translator for coarse-to-fine semantic extraction and meaning translation according to specific tasks. In order to extend the FLSC into more realistic conditions, we design a channel state information-based multiple-input multiple-output transmission module to combat channel fading and noise. Simulation results show that the coarse semantic information can deal with a range of image-level tasks. Moreover, especially in low signal-to-noise ratio and channel bandwidth ratio regimes, FLSC evidently outperforms the traditional scheme, e.g. about 10 peak signal-to-noise ratio gain in the 3 dB channel condition.
摘要
Scaling may be all you need for achieving human-level object recognition capacity with human-like visual experience
for: investigate whether current self-supervised learning methods can reach human-level visual object recognition capabilities with the same type and amount of visual experience as humans.
methods: use vision transformers with up to 633M parameters and train with up to 5K hours of human-like video data, with image resolutions of up to 476x476 pixels, using masked autoencoders as a self-supervised learning algorithm.
results: find that it is feasible to reach human-level object recognition capacity at sub-human scales of model size, data size, and image size, if these factors are scaled up simultaneously, and estimate that a 2.5B parameter ViT model trained with 20K hours of human-like video data should be able to reach roughly human-level accuracy on ImageNet.Abstract
This paper asks whether current self-supervised learning methods, if sufficiently scaled up, would be able to reach human-level visual object recognition capabilities with the same type and amount of visual experience humans learn from. Previous work on this question only considered the scaling of data size. Here, we consider the simultaneous scaling of data size, model size, and image resolution. We perform a scaling experiment with vision transformers up to 633M parameters in size (ViT-H/14) trained with up to 5K hours of human-like video data (long, continuous, mostly egocentric videos) with image resolutions of up to 476x476 pixels. The efficiency of masked autoencoders (MAEs) as a self-supervised learning algorithm makes it possible to run this scaling experiment on an unassuming academic budget. We find that it is feasible to reach human-level object recognition capacity at sub-human scales of model size, data size, and image size, if these factors are scaled up simultaneously. To give a concrete example, we estimate that a 2.5B parameter ViT model trained with 20K hours (2.3 years) of human-like video data with a spatial resolution of 952x952 pixels should be able to reach roughly human-level accuracy on ImageNet. Human-level competence is thus achievable for a fundamental perceptual capability from human-like perceptual experience (human-like in both amount and type) with extremely generic learning algorithms and architectures and without any substantive inductive biases.
摘要
这篇论文询问了现有自动学习方法,如果继续扩大,能否达到人类级视觉对象识别能力,使用同样的类型和量的视觉经验。先前的工作只考虑了数据量的扩大。我们在这篇论文中考虑了同时扩大数据量、模型大小和图像分辨率。我们通过使用视Transformer模型,最大达633M参数(ViT-H/14),使用人类类似的视频数据(长、连续、主要是 Egocentric 视频),并将图像分辨率提高至476x476像素。我们发现,在同时扩大数据量、模型大小和图像分辨率的情况下,可以达到人类级对象识别能力,但是这些因素需要同时扩大。例如,我们估计,一个2.5B参数的 ViT 模型,通过20K小时(2.3年)的人类类似的视频数据,并在952x952像素的空间分辨率下进行训练,应该能够达到图像Net roughly human-level accuracy。这显示,通过人类类似的感知经验(包括同样的类型和量),使用极简的学习算法和架构,并不具备重要的逻辑假设,可以达到人类级的视觉对象识别能力。
DeRisk: An Effective Deep Learning Framework for Credit Risk Prediction over Real-World Financial Data
paper_authors: Yancheng Liang, Jiajie Zhang, Hui Li, Xiaochen Liu, Yi Hu, Yong Wu, Jinyao Zhang, Yongyan Liu, Yi Wu
for: 预测信用风险(credit risk prediction)
methods: 使用深度学习模型(deep learning model)
results: 超越统计学习方法(statistical learning methods),实现更高的预测精度(higher prediction accuracy)Abstract
Despite the tremendous advances achieved over the past years by deep learning techniques, the latest risk prediction models for industrial applications still rely on highly handtuned stage-wised statistical learning tools, such as gradient boosting and random forest methods. Different from images or languages, real-world financial data are high-dimensional, sparse, noisy and extremely imbalanced, which makes deep neural network models particularly challenging to train and fragile in practice. In this work, we propose DeRisk, an effective deep learning risk prediction framework for credit risk prediction on real-world financial data. DeRisk is the first deep risk prediction model that outperforms statistical learning approaches deployed in our company's production system. We also perform extensive ablation studies on our method to present the most critical factors for the empirical success of DeRisk.
摘要
尽管深度学习技术在过去几年中取得了巨大的进步,但最新的风险预测模型仍然基于高度手动调整的阶段性统计学学习工具,如梯度提升和随机森林方法。不同于图像或语言,实际世界金融数据具有高维、稀疏、噪音和极度不均衡的特点,这使得深度神经网络模型在实践中特别困难要求和脆弱。在这项工作中,我们提出了DeRisk,一种高效的深度学习风险预测框架,用于实际世界金融数据的风险预测。DeRisk是我们公司生产系统中现在使用的统计学学习方法的首个深度风险预测模型,我们还进行了广泛的减少研究,以阐明DeRisk的成功的重要因素。
results: 作者对 25 个 LLM(包括 API 和开源模型)进行了广泛的测试,发现Top商业 LLM 在复杂环境中表现出了强大的代理能力,但是与开源竞争对手之间存在显著的性能差异。Abstract
Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks. As a result, there has been an urgent need to evaluate LLMs as agents on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional evolving benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities in a multi-turn open-ended generation setting. Our extensive test over 25 LLMs (including APIs and open-sourced models) shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and open-sourced competitors. It also serves as a component of an ongoing project with wider coverage and deeper consideration towards systematic LLM evaluation. Datasets, environments, and an integrated evaluation package for AgentBench are released at https://github.com/THUDM/AgentBench
摘要
Almost-sure convergence of iterates and multipliers in stochastic sequential quadratic optimization
results: 提供了新的几乎确定的收敛保证,包括 primal 迭代、Lagrange多余阶度和稳定度量的收敛。Abstract
Stochastic sequential quadratic optimization (SQP) methods for solving continuous optimization problems with nonlinear equality constraints have attracted attention recently, such as for solving large-scale data-fitting problems subject to nonconvex constraints. However, for a recently proposed subclass of such methods that is built on the popular stochastic-gradient methodology from the unconstrained setting, convergence guarantees have been limited to the asymptotic convergence of the expected value of a stationarity measure to zero. This is in contrast to the unconstrained setting in which almost-sure convergence guarantees (of the gradient of the objective to zero) can be proved for stochastic-gradient-based methods. In this paper, new almost-sure convergence guarantees for the primal iterates, Lagrange multipliers, and stationarity measures generated by a stochastic SQP algorithm in this subclass of methods are proved. It is shown that the error in the Lagrange multipliers can be bounded by the distance of the primal iterate to a primal stationary point plus the error in the latest stochastic gradient estimate. It is further shown that, subject to certain assumptions, this latter error can be made to vanish by employing a running average of the Lagrange multipliers that are computed during the run of the algorithm. The results of numerical experiments are provided to demonstrate the proved theoretical guarantees.
摘要
This paper presents new almost-sure convergence guarantees for the primal iterates, Lagrange multipliers, and stationarity measures generated by a stochastic SQP algorithm in this subclass of methods. The error in the Lagrange multipliers can be bounded by the distance of the primal iterate to a primal stationary point plus the error in the latest stochastic gradient estimate. Furthermore, it is shown that this latter error can be made to vanish by employing a running average of the Lagrange multipliers computed during the run of the algorithm, subject to certain assumptions.Numerical experiments are provided to demonstrate the proved theoretical guarantees. These results demonstrate the effectiveness of the proposed method in solving continuous optimization problems with nonlinear equality constraints.
Linear Convergence Bounds for Diffusion Models via Stochastic Localization
results: 这个论文提供了高维数据分布中扩散模型的新的拓扑分布 bound,这些 bound 是线性增长的(在数据维度上),并且不需要数据分布具有强平滑性。这个论文还证明了扩散模型只需要 $\tilde O(\frac{d \log^2(1/\delta)}{\varepsilon^2})$ 步来近似于任何数据分布,其中 $\delta$ 是数据分布的噪声标准差,$\varepsilon$ 是近似度。Abstract
Diffusion models are a powerful method for generating approximate samples from high-dimensional data distributions. Several recent results have provided polynomial bounds on the convergence rate of such models, assuming $L^2$-accurate score estimators. However, up until now the best known such bounds were either superlinear in the data dimension or required strong smoothness assumptions. We provide the first convergence bounds which are linear in the data dimension (up to logarithmic factors) assuming only finite second moments of the data distribution. We show that diffusion models require at most $\tilde O(\frac{d \log^2(1/\delta)}{\varepsilon^2})$ steps to approximate an arbitrary data distribution on $\mathbb{R}^d$ corrupted with Gaussian noise of variance $\delta$ to within $\varepsilon^2$ in Kullback--Leibler divergence. Our proof builds on the Girsanov-based methods of previous works. We introduce a refined treatment of the error arising from the discretization of the reverse SDE, which is based on tools from stochastic localization.
摘要
Diffusion models are a powerful method for generating approximate samples from high-dimensional data distributions. Several recent results have provided polynomial bounds on the convergence rate of such models, assuming $L^2$-accurate score estimators. However, up until now the best known such bounds were either superlinear in the data dimension or required strong smoothness assumptions. We provide the first convergence bounds which are linear in the data dimension (up to logarithmic factors) assuming only finite second moments of the data distribution. We show that diffusion models require at most $\tilde O(\frac{d \log^2(1/\delta)}{\varepsilon^2})$ steps to approximate an arbitrary data distribution on $\mathbb{R}^d$ corrupted with Gaussian noise of variance $\delta$ to within $\varepsilon^2$ in Kullback--Leibler divergence. Our proof builds on the Girsanov-based methods of previous works. We introduce a refined treatment of the error arising from the discretization of the reverse SDE, which is based on tools from stochastic localization.Note: "Simplified Chinese" is a romanization of Chinese that uses the Chinese characters and their pronunciations, but not the traditional Chinese grammar and syntax. It is often used for computer interfaces and other contexts where a more simplified representation of Chinese is desired.
methods: 该方法利用了 Raw 图像中的杂质信息,并通过结合对应关系和杂质信息来提高depth estimation。具体来说,该方法首先使用了 inverse projection 模型来计算depth map,然后通过scale factor来进行准确的深度估计。
results: 实验结果表明,通过引入杂质信息,可以提高depth estimation的准确性。该方法在实际场景中对3D复杂场景进行了测试,并与实际的3D探测器数据进行了比较。Abstract
While a traditional camera only captures one point of view of a scene, a plenoptic or light-field camera, is able to capture spatial and angular information in a single snapshot, enabling depth estimation from a single acquisition. In this paper, we present a new metric depth estimation algorithm using only raw images from a multi-focus plenoptic camera. The proposed approach is especially suited for the multi-focus configuration where several micro-lenses with different focal lengths are used. The main goal of our blur aware depth estimation (BLADE) approach is to improve disparity estimation for defocus stereo images by integrating both correspondence and defocus cues. We thus leverage blur information where it was previously considered a drawback. We explicitly derive an inverse projection model including the defocus blur providing depth estimates up to a scale factor. A method to calibrate the inverse model is then proposed. We thus take into account depth scaling to achieve precise and accurate metric depth estimates. Our results show that introducing defocus cues improves the depth estimation. We demonstrate the effectiveness of our framework and depth scaling calibration on relative depth estimation setups and on real-world 3D complex scenes with ground truth acquired with a 3D lidar scanner.
摘要
traditional camera 只能捕捉一个场景的一点视角,而 plenoptic 或 light-field camera 则可以在单个捕捉中捕捉场景的空间和方向信息,从而实现深度估计从单个获取。在这篇论文中,我们提出了一种基于 raw 图像的新的深度估计算法,使用多ocus plenoptic 相机。我们的 BLADE 方法旨在利用膨润信息来提高不焦相差图像中的 disparity 估计,因此我们可以更好地利用膨润信息。我们明确地 derivation 一个 inverse projection 模型,包括 defocus 膨润,以提供深度估计。我们还提出了一种准确把 calibration 方法,以考虑深度涨幅。我们的结果表明,在引入膨润信息后,深度估计得到了改善。我们在相对深度估计设置和实际世界3D复杂场景中进行了实验,并与3D激光扫描仪获取的实际深度数据进行了比较。
Under-Display Camera Image Restoration with Scattering Effect
results: 在实验中,提出的方法在实际数据和synthesized数据上比现状态技术更高效。Please note that the translation is done in a simplified Chinese format, which may not be as precise as the original English version.Abstract
The under-display camera (UDC) provides consumers with a full-screen visual experience without any obstruction due to notches or punched holes. However, the semi-transparent nature of the display inevitably introduces the severe degradation into UDC images. In this work, we address the UDC image restoration problem with the specific consideration of the scattering effect caused by the display. We explicitly model the scattering effect by treating the display as a piece of homogeneous scattering medium. With the physical model of the scattering effect, we improve the image formation pipeline for the image synthesis to construct a realistic UDC dataset with ground truths. To suppress the scattering effect for the eventual UDC image recovery, a two-branch restoration network is designed. More specifically, the scattering branch leverages global modeling capabilities of the channel-wise self-attention to estimate parameters of the scattering effect from degraded images. While the image branch exploits the local representation advantage of CNN to recover clear scenes, implicitly guided by the scattering branch. Extensive experiments are conducted on both real-world and synthesized data, demonstrating the superiority of the proposed method over the state-of-the-art UDC restoration techniques. The source code and dataset are available at \url{https://github.com/NamecantbeNULL/SRUDC}.
摘要
《下显示摄像头(UDC)提供了无障碍的全屏视觉体验,但 semi-透明显示器导致UDC图像受到严重抑制。在这种情况下,我们解决UDC图像恢复问题,特别是考虑显示器对图像的散射效应。我们直接模型散射效应,将显示器视为一个具有同样散射特性的媒体来进行物理模型。通过修改图像形成管道,我们构建了真实的UDC数据集,并提供了相应的真实参考值。为抑制散射效应,我们设计了两支分支网络:散射支分支利用通道wise自注意的全局模型来估算散射效应参数,而图像支分支则利用CNN的地方表示优势来恢复清晰场景,协同驱动散射支分支。我们对实际数据和生成数据进行了广泛的实验,证明了我们的方法在UDC恢复技术中的优越性。源代码和数据集可以在 \url{https://github.com/NamecantbeNULL/SRUDC} 中下载。》Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.
Towards Top-Down Stereoscopic Image Quality Assessment via Stereo Attention
results: 实验结果表明,该方法可以更好地模拟人类视觉系统的特性,并超越当前的状态艺。code可以在https://github.com/Fanning-Zhang/SATNet上下载。Abstract
Stereoscopic image quality assessment (SIQA) plays a crucial role in evaluating and improving the visual experience of 3D content. Existing binocular properties and attention-based methods for SIQA have achieved promising performance. However, these bottom-up approaches are inadequate in exploiting the inherent characteristics of the human visual system (HVS). This paper presents a novel network for SIQA via stereo attention, employing a top-down perspective to guide the quality assessment process. Our proposed method realizes the guidance from high-level binocular signals down to low-level monocular signals, while the binocular and monocular information can be calibrated progressively throughout the processing pipeline. We design a generalized Stereo AttenTion (SAT) block to implement the top-down philosophy in stereo perception. This block utilizes the fusion-generated attention map as a high-level binocular modulator, influencing the representation of two low-level monocular features. Additionally, we introduce an Energy Coefficient (EC) to account for recent findings indicating that binocular responses in the primate primary visual cortex are less than the sum of monocular responses. The adaptive EC can tune the magnitude of binocular response flexibly, thus enhancing the formation of robust binocular features within our framework. To extract the most discriminative quality information from the summation and subtraction of the two branches of monocular features, we utilize a dual-pooling strategy that applies min-pooling and max-pooling operations to the respective branches. Experimental results highlight the superiority of our top-down method in simulating the property of visual perception and advancing the state-of-the-art in the SIQA field. The code of this work is available at https://github.com/Fanning-Zhang/SATNet.
摘要
三通像质量评估(SIQA)在评估和改进三维内容的视觉体验方面扮演着关键性角色。现有的幂论和注意力基本方法已经实现了承诺性的表现。然而,这些底层方法不充分利用人类视觉系统(HVS)的内在特性。本文提出了一种新的网络 для SIQA,通过三通注意力,实现了顶部下向的指导评估过程。我们的提议方法可以从高级双目信号下降到低级单目信号,并在处理管道中进行进行步进式均衡。我们设计了一个通用的三通注意力块(SAT),以实现顶部下向的哲学思想在三通观察中。这个块利用生成的注意力地图作为高级双目模ulator,影响低级单目特征表示。此外,我们引入了能量系数(EC),以考虑最近的发现,表明双目响应在人类脑顶某处的辐射响应小于单目响应的总和。可以通过自适应EC调整幂论响应的大小,从而提高在我们框架中形成的稳定双目特征。为了从两个支线的单目特征批处中提取最有价值的质量信息,我们采用了双池策略,将各支线的单目特征批处应用最小池化和最大池化操作。实验结果表明,我们的顶部下向方法可以更好地模拟视觉响应和提高SIQA领域的状态。代码可以在https://github.com/Fanning-Zhang/SATNet上获取。
Physics-driven universal twin-image removal network for digital in-line holographic microscopy
results: 实验证明,UTIRnet可以准确地Suppress twin-image noise,并且保持输入干涉图像的一致性,从而提高计算量相图像识别的可靠性。例如,在live neural glial cell culture migration感测中,UTIRnet可以成功地捕捉细胞移动的动态过程。Abstract
Digital in-line holographic microscopy (DIHM) enables efficient and cost-effective computational quantitative phase imaging with a large field of view, making it valuable for studying cell motility, migration, and bio-microfluidics. However, the quality of DIHM reconstructions is compromised by twin-image noise, posing a significant challenge. Conventional methods for mitigating this noise involve complex hardware setups or time-consuming algorithms with often limited effectiveness. In this work, we propose UTIRnet, a deep learning solution for fast, robust, and universally applicable twin-image suppression, trained exclusively on numerically generated datasets. The availability of open-source UTIRnet codes facilitates its implementation in various DIHM systems without the need for extensive experimental training data. Notably, our network ensures the consistency of reconstruction results with input holograms, imparting a physics-based foundation and enhancing reliability compared to conventional deep learning approaches. Experimental verification was conducted among others on live neural glial cell culture migration sensing, which is crucial for neurodegenerative disease research.
摘要
数字内线推干微镜(DIHM)可以有效地和经济地实现计算量相对测量图像,具有大视野,这使其成为研究细胞活动、迁徙和生物微流体等领域的 valuables工具。然而,DIHM重建的质量受到双像噪声的限制,这成为一个 significante挑战。传统的方法用于 Mitigating这种噪声包括复杂的硬件设置或时间consuming的算法,其效果往往有限。在这种情况下,我们提出了UTIRnet,一种深度学习解决方案,用于快速、稳定、universally applicable的双像消除,该解决方案基于数字生成的数据集进行训练。UTIRnet的开源代码的可用性使得它可以在不同的 DIHM 系统中实现,无需详细的实验室训练数据。另外,我们的网络 garantizesthe consistency of reconstruction results with input holograms,从而为 DIHM 系统提供一个基于物理的基础,并提高了与传统深度学习方法相比的可靠性。实验证明了我们的UTIRnet在 live neural glial cell culture migration 感知等方面的表现。
Single-shot experimental-numerical twin-image removal in lensless digital holographic microscopy
paper_authors: Piotr Arcab, Mikolaj Rogalski, Maciej Trusiak for:LDHM imaging offers a large field-of-view and is crucial for high-throughput particle tracking and biomedical examination of cells and tissues, but is limited by the twin-image effect.methods:The proposed technique uses two-source off-axis hologram recording and a novel phase retrieval numerical algorithm to remove twin-image errors, providing a low-cost, out-of-laboratory imaging solution with enhanced precision.results:The proposed technique enables twin-image-free reconstruction of LDHM images, which improves the accuracy of technical and biomedical imaging applications. The results demonstrate the effectiveness of the proposed technique using phase test targets and cheek cells biosamples.Abstract
Lensless digital holographic microscopy (LDHM) offers very large field-of-view label-free imaging crucial, e.g., in high-throughput particle tracking and biomedical examination of cells and tissues. Compact layouts promote point-of-case and out-of-laboratory applications. The LDHM, based on the Gabor in-line holographic principle, is inherently spoiled by the twin-image effect, which complicates the quantitative analysis of reconstructed phase and amplitude maps. Popular family of solutions consists of numerical methods, which tend to minimize twin-image upon iterative process based on data redundancy. Additional hologram recordings are needed, and final results heavily depend on the algorithmic parameters, however. In this contribution we present a novel single-shot experimental-numerical twin-image removal technique for LDHM. It leverages two-source off-axis hologram recording deploying simple fiber splitter. Additionally, we introduce a novel phase retrieval numerical algorithm specifically tailored to the acquired holograms, that provides twin-image-free reconstruction without compromising the resolution. We quantitatively and qualitatively verify proposed method employing phase test target and cheek cells biosample. The results demonstrate that the proposed technique enables low-cost, out-of-laboratory LDHM imaging with enhanced precision, achieved through the elimination of twin-image errors. This advancement opens new avenues for more accurate technical and biomedical imaging applications using LDHM, particularly in scenarios where cost-effective and portable imaging solutions are desired.
摘要
LDHM(无镜像数字折射微镜)提供了很大的场视野,无标签的图像重要,如高通过率粒子跟踪和生物医学Cells和组织的检查。嵌入式的设计促进了点位应用和出厂应用。基于Gabor直线折射原理的LDHM受到双像效应的干扰,这使得量化分析重constructed的相位和振幅图表变得复杂。通用的解决方案包括数学方法,这些方法通过基于数据重复的迭代过程来减少双像效应。然而,这些方法需要额外的折射agram记录,并且最终结果受到算法参数的影响。在这篇论文中,我们提出了一种新的单 shot实验数字twain-image removedtechnique for LDHM。它利用了两个源偏心折射agram记录,使用简单的纤维Splitter。此外,我们还提出了一种专门为获得的折射agram设计的数学算法,可以在无需COMPROMISE的分辨率情况下提供无双像效应的重建。我们使用测试target和唾液细胞样本来证明提出的方法的有效性。结果表明,提出的方法可以在低成本和出厂环境中提供高精度的LDHM成像,并且消除了双像效应。这一进展开 up新的可靠和可搬移的LDHM成像应用,特别是在成本效益和出厂环境中。
Non-Intrusive Electric Load Monitoring Approach Based on Current Feature Visualization for Smart Energy Management
results: 实验结果显示,本方法在公共和私人数据集上均达到了超过其他方法的性能,因此支持了大规模互联网智能系统的有效能源管理。Abstract
The state-of-the-art smart city has been calling for an economic but efficient energy management over large-scale network, especially for the electric power system. It is a critical issue to monitor, analyze and control electric loads of all users in system. In this paper, we employ the popular computer vision techniques of AI to design a non-invasive load monitoring method for smart electric energy management. First of all, we utilize both signal transforms (including wavelet transform and discrete Fourier transform) and Gramian Angular Field (GAF) methods to map one-dimensional current signals onto two-dimensional color feature images. Second, we propose to recognize all electric loads from color feature images using a U-shape deep neural network with multi-scale feature extraction and attention mechanism. Third, we design our method as a cloud-based, non-invasive monitoring of all users, thereby saving energy cost during electric power system control. Experimental results on both public and our private datasets have demonstrated our method achieves superior performances than its peers, and thus supports efficient energy management over large-scale Internet of Things (IoT).
摘要
现代智能城市呼吁了一种经济高效的能源管理方法,特别是电力系统。监测、分析和控制所有用户的电力负荷是一个关键问题。在这篇论文中,我们采用了流行的计算机视觉技术,设计了一种不侵入的负荷监测方法。首先,我们利用了卷积变换(包括浪干变换和离散傅里叶变换)和 Gramian Angular Field(GAF)方法将一维电流信号映射到二维颜色特征图像上。其次,我们提出了通过 U-型深度神经网络(包括多级特征提取和注意机制)来识别所有的电力负荷。最后,我们设计了一种云端、不侵入的监测方法,以便在互联网物联网(IoT)中实现有效的能源管理。实验结果表明,我们的方法在公共数据集和私人数据集上都达到了比其他方法更高的性能,因此支持了大规模互联网物联网中的有效能源管理。
Weakly Semi-Supervised Detection in Lung Ultrasound Videos
paper_authors: Jiahong Ouyang, Li Chen, Gary Y. Li, Naveen Balaraju, Shubham Patil, Courosh Mehanian, Sourabh Kulhare, Rachel Millin, Kenton W. Gregory, Cynthia R. Gregory, Meihua Zhu, David O. Kessler, Laurie Malia, Almaz Dessie, Joni Rabiner, Di Coneybeare, Bo Shopsin, Andrew Hersh, Cristian Madar, Jeffrey Shupp, Laura S. Johnson, Jacob Avila, Kristin Dwyer, Peter Weimersheimer, Balasundar Raju, Jochen Kruecker, Alvin Chen
results: 对医学ultrasound视频中肺聚集(如COVID-19肺炎)的检测精度和可靠性进行了改进,比基eline semi-supervised模型更高,同时提高了数据和注释的使用效率。Abstract
Frame-by-frame annotation of bounding boxes by clinical experts is often required to train fully supervised object detection models on medical video data. We propose a method for improving object detection in medical videos through weak supervision from video-level labels. More concretely, we aggregate individual detection predictions into video-level predictions and extend a teacher-student training strategy to provide additional supervision via a video-level loss. We also introduce improvements to the underlying teacher-student framework, including methods to improve the quality of pseudo-labels based on weak supervision and adaptive schemes to optimize knowledge transfer between the student and teacher networks. We apply this approach to the clinically important task of detecting lung consolidations (seen in respiratory infections such as COVID-19 pneumonia) in medical ultrasound videos. Experiments reveal that our framework improves detection accuracy and robustness compared to baseline semi-supervised models, and improves efficiency in data and annotation usage.
摘要 < Lang="zh-CN" > 框架fram by frame的注意点标注由医疗专家是训练完全指导的物体检测模型的医学视频数据的常见需求。我们提出一种改进医学视频中物体检测的方法,通过弱指导来提高物体检测的准确性和稳定性。具体来说,我们将个体检测预测结果聚合到视频级别预测中,并将视频级别损失扩展到教师学生训练策略中,以提供额外的指导。我们还引入了改进教师学生框架的方法,包括基于弱指导的pseudo标签质量改进和adaptive调整知识传递 между教师和学生网络。我们在诊断肺脏聚集(COVID-19感染引起的肺炎)的医学超声视频中应用这种方法。实验表明,我们的框架可以提高检测精度和稳定性,并提高数据和注释使用效率。 Lang>Note that Simplified Chinese is used in the translation, as it is the more commonly used standard for scientific and technical writing in China.
results: 实验结果显示,使用DefCor-Net可以对US图像进行高精度的形状修正,从$14.3\pm20.9$提高至$82.6\pm12.1$(当力量为$6N$时),这表明DefCor-Net可以实现体内质量测定的灵活性和高精度。Abstract
The recovery of morphologically accurate anatomical images from deformed ones is challenging in ultrasound (US) image acquisition, but crucial to accurate and consistent diagnosis, particularly in the emerging field of computer-assisted diagnosis. This article presents a novel anatomy-aware deformation correction approach based on a coarse-to-fine, multi-scale deep neural network (DefCor-Net). To achieve pixel-wise performance, DefCor-Net incorporates biomedical knowledge by estimating pixel-wise stiffness online using a U-shaped feature extractor. The deformation field is then computed using polynomial regression by integrating the measured force applied by the US probe. Based on real-time estimation of pixel-by-pixel tissue properties, the learning-based approach enables the potential for anatomy-aware deformation correction. To demonstrate the effectiveness of the proposed DefCor-Net, images recorded at multiple locations on forearms and upper arms of six volunteers are used to train and validate DefCor-Net. The results demonstrate that DefCor-Net can significantly improve the accuracy of deformation correction to recover the original geometry (Dice Coefficient: from $14.3\pm20.9$ to $82.6\pm12.1$ when the force is $6N$).
摘要
“ Ultrasound(US)图像获取中,修复变形的 morphologically 精准 анатомиче图像 recover 是一项挑战,但是对医学诊断的准确性和一致性至关重要,特别是在计算机助成诊断领域。本文提出了一种基于多尺度深度神经网络(DefCor-Net)的新型 anatomy-aware deformation correction 方法。通过在线计算像素刚性的方法,DefCor-Net 可以在实时计算像素刚性的基础上进行学习基于图像材料的 deformation field 计算。通过使用 U-shaped 特征提取器,DefCor-Net 可以在每个像素位置上计算刚性,从而实现像素级别的性能。为了证明 DefCor-Net 的有效性,本文使用了多个臂部和上臂部的 six 名志愿者所记录的图像进行训练和验证。结果表明,DefCor-Net 可以显著提高 deformation correction 的准确性,从 $14.3\pm20.9$ 提高到 $82.6\pm12.1$(当力度为 $6N$)。”Note that Simplified Chinese is used in this translation, as it is the most widely used standard for Chinese writing in mainland China. If you prefer Traditional Chinese, I can provide that version as well.
methods: 使用 filter-x least mean square (FxLMS) 算法,但它的快速减退速度使得在面对快速变化的噪声时,表现不佳。此外,噪声功率的变化也会损害算法的稳定性。
results: 通过与征算法结合了惯性方法,使得算法更加快速地趋向稳定点,并且更好地避免了主要噪声功率的干扰。Abstract
Multichannel active noise control (MCANC) is widely utilized to achieve significant noise cancellation area in the complicated acoustic field. Meanwhile, the filter-x least mean square (FxLMS) algorithm gradually becomes the benchmark solution for the implementation of MCANC due to its low computational complexity. However, its slow convergence speed more or less undermines the performance of dealing with quickly varying disturbances, such as piling noise. Furthermore, the noise power variation also deteriorates the robustness of the algorithm when it adopts the fixed step size. To solve these issues, we integrated the normalized multichannel FxLMS with the momentum method, which hence, effectively avoids the interference of the primary noise power and accelerates the convergence of the algorithm. To validate its effectiveness, we deployed this algorithm in a multichannel noise control window to control the real machine noise.
摘要
多通道活动噪声控制(MCANC)广泛应用于复杂的噪声场中实现显著的噪声抑制面积。同时,Filter-x最小二乘(FxLMS)算法逐渐成为MCANC实现的标准解决方案,因为它的计算复杂性较低。然而,它的慢速收敛速度在面对快变化的干扰时,很大程度地降低了性能。此外,噪声功率变化也降低了算法的稳定性,特别是当采用固定步长时。为解决这些问题,我们将normalized multichannel FxLMS与势量方法结合,从而有效地避免了主要噪声功率的干扰和加速了算法的收敛。为验证其效果,我们在多通道噪声控制窗口中应用了这种算法来控制实际机器噪声。
results: 提出的AudioVMAF系统在带宽限制场景下表现出更高的预测精度,并在比较已有视觉质量特征与专门的音频质量指标(ViSQOL-v3)中显示出7.8%和2.0%的显著提高。Abstract
Video Multimethod Assessment Fusion (VMAF) [1], [2], [3] is a popular tool in the industry for measuring coded video quality. In this study, we propose an auditory-inspired frontend in existing VMAF for creating videos of reference and coded spectrograms, and extended VMAF for measuring coded audio quality. We name our system AudioVMAF. We demonstrate that image replication is capable of further enhancing prediction accuracy, especially when band-limited anchors are present. The proposed method significantly outperforms all existing visual quality features repurposed for audio, and even demonstrates a significant overall improvement of 7.8% and 2.0% of Pearson and Spearman rank correlation coefficient, respectively, over a dedicated audio quality metric (ViSQOL-v3 [4]) also inspired from the image domain.
摘要
视频多方法评估融合(VMAF)是行业中广泛使用的视频质量评估工具。在本研究中,我们提出一种听力 inspirited 的前端,用于创建参考视频和编码спектрограм,并扩展了VMAF以测量编码音频质量。我们称之为AudioVMAF。我们示出,图像复制能够进一步提高预测精度,特别是在存在带限 anchors 时。我们的方法在所有现有的视觉质量特征的抽象下表现出色,并在ViSQOL-v3 (4)中显示了 significan 7.8% 和 2.0% 的潘森和斯宾塞排名相关系数,分别。
Improving Deep Attractor Network by BGRU and GMM for Speech Separation
methods: 该模型使用了bidirectional gated neural network (BGRU) 代替了 bidirectional long short-term memory (BLSTM),并使用 Gaussian Mixture Model (GMM) 作为聚类算法来降低复杂性和提高学习速度和准确性。
results: 在使用 TIMIT 语音数据集进行评估时,提出的模型可以达到12.3 dB和2.94的 SDR 和 PESQ 分数,比原始 DANet 模型更好。此外,该模型还减少了20.7%和17.9%的参数数量和训练时间。最后,该模型在混合阿拉伯语音信号上进行评估,得到了更好的结果。Abstract
Deep Attractor Network (DANet) is the state-of-the-art technique in speech separation field, which uses Bidirectional Long Short-Term Memory (BLSTM), but the complexity of the DANet model is very high. In this paper, a simplified and powerful DANet model is proposed using Bidirectional Gated neural network (BGRU) instead of BLSTM. The Gaussian Mixture Model (GMM) other than the k-means was applied in DANet as a clustering algorithm to reduce the complexity and increase the learning speed and accuracy. The metrics used in this paper are Signal to Distortion Ratio (SDR), Signal to Interference Ratio (SIR), Signal to Artifact Ratio (SAR), and Perceptual Evaluation Speech Quality (PESQ) score. Two speaker mixture datasets from TIMIT corpus were prepared to evaluate the proposed model, and the system achieved 12.3 dB and 2.94 for SDR and PESQ scores respectively, which were better than the original DANet model. Other improvements were 20.7% and 17.9% in the number of parameters and time training, respectively. The model was applied on mixed Arabic speech signals and the results were better than that in English.
摘要
深度吸引网络(DANet)是现代语音分离领域的状态元技术,使用了双向长短期记忆(BLSTM),但DANet模型的复杂性很高。在本文中,一种简化了DANet模型,使用了双向闭合神经网络(BGRU)而不是BLSTM。 Gaussian Mixture Model(GMM)在DANet中作为聚类算法来降低复杂性和提高学习速度和准确性。本文使用的度量包括信号质量至噪声比(SDR)、信号质量至干扰比(SIR)、信号质量至噪声比(SAR)和语音质量评价分数(PESQ)。使用TIMIT corpus中的两个说话者混合数据集进行评估,提出的模型在SDR和PESQ分数上分别达到12.3 dB和2.94,比原始DANet模型更好。此外,模型的参数数量和训练时间都有20.7%和17.9%的下降。该模型在混合阿拉伯语音信号上得到了更好的结果,比英语更好。
SeACo-Paraformer: A Non-Autoregressive ASR System with Flexible and Effective Hotword Customization Ability
results: 在50,000小时工业大数据实验中,提出的模型比强基eline在定制和总 ASR 任务中表现出色,同时提出了一种高效的大规模热词筛选方法。 industrial models 和两个热词测试集都已经公开。Abstract
Hotword customization is one of the important issues remained in ASR field - it is of value to enable users of ASR systems to customize names of entities, persons and other phrases. The past few years have seen both implicit and explicit modeling strategies for ASR contextualization developed. While these approaches have performed adequately, they still exhibit certain shortcomings such as instability in effectiveness. In this paper we propose Semantic-augmented Contextual-Paraformer (SeACo-Paraformer) a novel NAR based ASR system with flexible and effective hotword customization ability. It combines the accuracy of the AED-based model, the efficiency of the NAR model, and the excellent performance in contextualization. In 50,000 hours industrial big data experiments, our proposed model outperforms strong baselines in customization and general ASR tasks. Besides, we explore an efficient way to filter large scale incoming hotwords for further improvement. The source codes and industrial models proposed and compared are all opened as well as two hotword test sets.
摘要
“热词自定义是ASR领域中一个重要的 issuesthat is of great value to enable users of ASR systems to customize names of entities, persons, and other phrases. Recently, both implicit and explicit modeling strategies for ASR contextualization have been developed, but they still have some shortcomings such as instability in effectiveness. In this paper, we propose a novel NAR-based ASR system with flexible and effective hotword customization ability, called Semantic-augmented Contextual-Paraformer (SeACo-Paraformer). It combines the accuracy of the AED-based model, the efficiency of the NAR model, and the excellent performance in contextualization. In 50,000 hours of industrial big data experiments, our proposed model outperforms strong baselines in customization and general ASR tasks. Furthermore, we explore an efficient way to filter large-scale incoming hotwords for further improvement. The source codes and industrial models proposed and compared are all open, as well as two hotword test sets.”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.
Investigation of Self-supervised Pre-trained Models for Classification of Voice Quality from Speech and Neck Surface Accelerometer Signals
results: 研究发现,使用 NSA 输入可以比语音输入更好地进行分类,而且使用预训练模型生成的特征可以提高分类精度,特别是对于语音和 NSA 输入。此外,研究还发现 HuBERT 特征在分类任务中表现更好于 wav2vec2-BASE 和 wav2vec2-LARGE 特征。Abstract
Prior studies in the automatic classification of voice quality have mainly studied the use of the acoustic speech signal as input. Recently, a few studies have been carried out by jointly using both speech and neck surface accelerometer (NSA) signals as inputs, and by extracting MFCCs and glottal source features. This study examines simultaneously-recorded speech and NSA signals in the classification of voice quality (breathy, modal, and pressed) using features derived from three self-supervised pre-trained models (wav2vec2-BASE, wav2vec2-LARGE, and HuBERT) and using a SVM as well as CNNs as classifiers. Furthermore, the effectiveness of the pre-trained models is compared in feature extraction between glottal source waveforms and raw signal waveforms for both speech and NSA inputs. Using two signal processing methods (quasi-closed phase (QCP) glottal inverse filtering and zero frequency filtering (ZFF)), glottal source waveforms are estimated from both speech and NSA signals. The study has three main goals: (1) to study whether features derived from pre-trained models improve classification accuracy compared to conventional features (spectrogram, mel-spectrogram, MFCCs, i-vector, and x-vector), (2) to investigate which of the two modalities (speech vs. NSA) is more effective in the classification task with pre-trained model-based features, and (3) to evaluate whether the deep learning-based CNN classifier can enhance the classification accuracy in comparison to the SVM classifier. The results revealed that the use of the NSA input showed better classification performance compared to the speech signal. Between the features, the pre-trained model-based features showed better classification accuracies, both for speech and NSA inputs compared to the conventional features. It was also found that the HuBERT features performed better than the wav2vec2-BASE and wav2vec2-LARGE features.
摘要
先前的研究主要是使用语音信号来自动分类voice quality,现在有一些研究使用语音和颈部表面加速器(NSA)信号同时录制,并提取MFCCs和颈部源特征。本研究通过使用三种自动学习模型(wav2vec2-BASE、wav2vec2-LARGE和HuBERT)提取特征,使用SVM和CNN作为分类器,研究语音和NSA信号同时录制的voice quality分类效果。此外,还比较了三种模型在特征提取中的效果,以及使用不同的信号处理方法( quasi-closed phase颈部逆推和zero frequency filtering)来提取颈部源波形。研究的主要目标是:1. 研究使用预训练模型提取的特征是否能够提高分类精度,比较传统特征(spectrogram、mel-spectrogram、MFCCs、i-vector和x-vector)的效果。2. 研究语音和NSA信号中哪一种Modalities更有效iveness在分类任务中,并使用预训练模型基于特征进行分类。3. 研究使用深度学习基于CNN的分类器是否能够提高分类精度,相比SVM分类器。结果表明,使用NSA输入信号可以实现更好的分类性能,而且使用预训练模型基于特征可以提高分类精度,无论是语音还是NSA输入信号。此外,HuBERT特征也表现出了更高的分类精度。
paper_authors: Babak Azad, Ahmed Abdalla, Kwanghee Won, Ali Mirzakhani Nafchi for: 这个研究旨在开发一个基于视觉 трансформер的抗疫菌头病毒检测方法,以提高小麦和黑麦生产计划中的抗疫菌头病毒检测效率和精度。methods: 这个方法使用了一种新的 Context Bridge,将U-Net网络的本地表现力与视觉 трансформер模型的全球自我注意力机制结合起来,以提高模型的多对多关联能力和模型化能力。此外,这个方法使用了Efficient Self-attention机制,取代了原始视觉 трансформер模型的标准注意力机制,以减少模型的复杂度。results: 这个研究透过广泛的实验和评估,展示了这个基于视觉 трансформер的方法在抗疫菌头病毒检测任务中的效果。Abstract
Fusarium head blight is a devastating disease that causes significant economic losses annually on small grains. Efficiency, accuracy, and timely detection of FHB in the resistance screening are critical for wheat and barley breeding programs. In recent years, various image processing techniques have been developed using supervised machine learning algorithms for the early detection of FHB. The state-of-the-art convolutional neural network-based methods, such as U-Net, employ a series of encoding blocks to create a local representation and a series of decoding blocks to capture the semantic relations. However, these methods are not often capable of long-range modeling dependencies inside the input data, and their ability to model multi-scale objects with significant variations in texture and shape is limited. Vision transformers as alternative architectures with innate global self-attention mechanisms for sequence-to-sequence prediction, due to insufficient low-level details, may also limit localization capabilities. To overcome these limitations, a new Context Bridge is proposed to integrate the local representation capability of the U-Net network in the transformer model. In addition, the standard attention mechanism of the original transformer is replaced with Efficient Self-attention, which is less complicated than other state-of-the-art methods. To train the proposed network, 12,000 wheat images from an FHB-inoculated wheat field at the SDSU research farm in Volga, SD, were captured. In addition to healthy and unhealthy plants, these images encompass various stages of the disease. A team of expert pathologists annotated the images for training and evaluating the developed model. As a result, the effectiveness of the transformer-based method for FHB-disease detection, through extensive experiments across typical tasks for plant image segmentation, is demonstrated.
摘要
fusarium 头炎是一种致命的疾病,每年在小谷物上造成重大经济损失。效率、准确性和时效检测 fusarium 头炎在小谷物抗性培养计划中是关键。在过去的几年中,一些基于超vised机器学习算法的图像处理技术被开发出来,用于早期检测 fusarium 头炎。现有的 convolutional neural network(CNN)方法,如 U-Net,通过一系列的编码块创建地方表示,并通过一系列的解码块捕捉 semantic 关系。但这些方法通常无法模型输入数据中的远程相互关系,而且对于具有不同文化和形状的多尺度对象进行模型化也有限制。为了突破这些限制,我们提出了一种新的 Context Bridge,用于将 U-Net 网络的本地表示能力integrated into transformer模型中。此外,原始 transformer 模型的标准注意机制被 replaced with Efficient Self-attention,这是比其他当前状态最简单的方法。为了训练我们提出的网络,我们使用了12,000个小谷物图像,这些图像来自于南达大学农业实验室Volga, SD的FHB-感染小谷物田。除了健康和病毒植物之外,这些图像还包括不同阶段的疾病。一 команópathologists expert annotated the images for training and evaluating the developed model. As a result, the effectiveness of the transformer-based method for FHB-disease detection, through extensive experiments across typical tasks for plant image segmentation, is demonstrated.
Distributionally Robust Classification on a Data Budget
paper_authors: Benjamin Feuer, Ameya Joshi, Minh Pham, Chinmay Hegde
For: The paper aims to address the challenge of training robust deep learning models under distribution shifts, specifically in domains with limited data budgets.* Methods: The authors introduce a new dataset called JANuS (Joint Annotations and Names Set) and perform a series of carefully controlled investigations to evaluate the factors contributing to robustness in image classification. They use a standard ResNet-50 model trained with the cross-entropy loss on 2.4 million image samples and compare the results to a CLIP ResNet-50 model trained on 400 million samples.* Results: The authors show that the standard ResNet-50 model can attain comparable robustness to the CLIP ResNet-50 model on limited data budgets, which is the first result of its kind to our knowledge.Here’s the Simplified Chinese text format you requested:* For: 本文目标是在分布shift下训练深度学习模型,特别是在数据预算有限的情况下。* Methods: 作者引入了新的JANuS(联合注释和名称集)数据集,并通过仔细控制的调查来评估图像分类 task中的Robustness因素。他们使用标准的ResNet-50模型,使用十字积分损失函数进行训练,并与400万样本进行比较。* Results: 作者发现,使用标准的ResNet-50模型可以在有限数据预算下实现相似的Robustness性能,这是我们知道的第一个结果。Abstract
Real world uses of deep learning require predictable model behavior under distribution shifts. Models such as CLIP show emergent natural distributional robustness comparable to humans, but may require hundreds of millions of training samples. Can we train robust learners in a domain where data is limited? To rigorously address this question, we introduce JANuS (Joint Annotations and Names Set), a collection of four new training datasets with images, labels, and corresponding captions, and perform a series of carefully controlled investigations of factors contributing to robustness in image classification, then compare those results to findings derived from a large-scale meta-analysis. Using this approach, we show that standard ResNet-50 trained with the cross-entropy loss on 2.4 million image samples can attain comparable robustness to a CLIP ResNet-50 trained on 400 million samples. To our knowledge, this is the first result showing (near) state-of-the-art distributional robustness on limited data budgets. Our dataset is available at \url{https://huggingface.co/datasets/penfever/JANuS_dataset}, and the code used to reproduce our experiments can be found at \url{https://github.com/penfever/vlhub/}.
摘要
实际应用中的深度学习需要模型在分布转移时具有预测可靠性。如CLIP模型,它们可以在自然分布下显示出类似于人类的分布弹性,但可能需要数百万个训练样本。可以在具有有限的数据库中训练强健的学习者吗?为了系统地回答这个问题,我们提出了JANuS(共同注释和名称集),包括四个新的训练集,每个集包含图像、标签和相应的描述,并进行了一系列仔细控制的调查,以研究影响模型强健性的因素。我们发现,使用权重平衡损失函数,只需训练240万个图像样本的标准ResNet-50模型,可以达到与CLIP ResNet-50模型在400万样本上训练后的相似的分布弹性。我们认为这是首次在有限数据预算下实现(近)顶尖分布弹性的结果。我们的数据集可以在\url{https://huggingface.co/datasets/penfever/JANuS_dataset}中找到,并且使用来复制我们的实验的代码可以在\url{https://github.com/penfever/vlhub/}找到。
WarpEM: Dynamic Time Warping for Accurate Catheter Registration in EM-guided Procedures
results: 结果显示,DTW方法可以精确地调整和匹配EM追踪路径与血管中心轴,与 marker-based 追踪作为参考值得出高度相似的追踪结果,mean error 为2.22mm。Abstract
Accurate catheter tracking is crucial during minimally invasive endovascular procedures (MIEP), and electromagnetic (EM) tracking is a widely used technology that serves this purpose. However, registration between preoperative images and the EM tracking system is often challenging. Existing registration methods typically require manual interactions, which can be time-consuming, increase the risk of errors and change the procedural workflow. Although several registration methods are available for catheter tracking, such as marker-based and path-based approaches, their limitations can impact the accuracy of the resulting tracking solution, consequently, the outcome of the medical procedure. This paper introduces a novel automated catheter registration method for EM-guided MIEP. The method utilizes 3D signal temporal analysis, such as Dynamic Time Warping (DTW) algorithms, to improve registration accuracy and reliability compared to existing methods. DTW can accurately warp and match EM-tracked paths to the vessel's centerline, making it particularly suitable for registration. The introduced registration method is evaluated for accuracy in a vascular phantom using a marker-based registration as the ground truth. The results indicate that the DTW method yields accurate and reliable registration outcomes, with a mean error of $2.22$mm. The introduced registration method presents several advantages over state-of-the-art methods, such as high registration accuracy, no initialization required, and increased automation.
摘要
准确的导管跟踪是在微创综合性术(MIEP)中非常重要,电磁(EM)跟踪技术是广泛使用的。然而,在 préoperative图像和EM跟踪系统之间的注册通常是困难的。现有的注册方法通常需要手动交互,这可能会耗时,增加错误的风险,并改变操作工作流程。虽然有几种注册方法可以用于导管跟踪,如标记基于的和路径基于的方法,但它们的局限性可能会影响导管跟踪解决方案的准确性,从而影响医疗程序的结果。这篇论文介绍了一种新的自动化导管注册方法,用于EM-引导MIEP。该方法利用3D信号时间分析算法,如动态时间战斗(DTW)算法,以提高注册准确性和可靠性。DTW算法可以准确地扭曲和匹配EM跟踪的路径与血管中心线,使其特别适用于注册。引入的注册方法在vascular模拟器中使用 marker-based注册作为参照值进行评估。结果表明,DTW方法可以提供高准确性和可靠性的注册结果,平均错误为2.22毫米。引入的注册方法具有许多优点,如高注册准确性、无需初始化、提高自动化等。
MOMA-Force: Visual-Force Imitation for Real-World Mobile Manipulation
methods: 该方法 combinest representation learning for perception, imitation learning for complex motion generation, and admittance whole-body control to achieve both robustness and controllability.
results: 在实际家庭环境中,该方法比基线方法更高的成功率和更小的接触力和力异常值。Here’s the full text in Simplified Chinese:
methods: 该方法 combinest representation learning for perception, imitation learning for complex motion generation, and admittance whole-body control to achieve both robustness and controllability.
results: 在实际家庭环境中,该方法比基线方法更高的成功率和更小的接触力和力异常值。I hope that helps! Let me know if you have any other questions.Abstract
In this paper, we present a novel method for mobile manipulators to perform multiple contact-rich manipulation tasks. While learning-based methods have the potential to generate actions in an end-to-end manner, they often suffer from insufficient action accuracy and robustness against noise. On the other hand, classical control-based methods can enhance system robustness, but at the cost of extensive parameter tuning. To address these challenges, we present MOMA-Force, a visual-force imitation method that seamlessly combines representation learning for perception, imitation learning for complex motion generation, and admittance whole-body control for system robustness and controllability. MOMA-Force enables a mobile manipulator to learn multiple complex contact-rich tasks with high success rates and small contact forces. In a real household setting, our method outperforms baseline methods in terms of task success rates. Moreover, our method achieves smaller contact forces and smaller force variances compared to baseline methods without force imitation. Overall, we offer a promising approach for efficient and robust mobile manipulation in the real world. Videos and more details can be found on \url{https://visual-force-imitation.github.io}
摘要
在这篇论文中,我们提出了一种新的方法,以便移动抓取机器人执行多种接触rich的抓取任务。而学习基于方法可以在端到端的方式生成动作,但它们经常受到不精准的动作和噪声的影响。而经典控制基于方法可以提高系统的稳定性,但是在Parameter tuning的代价上。为了解决这些挑战,我们提出了MOMA-Force方法,这是一种基于视觉力学学习、模仿学习和总体控制的视觉力学抓取方法。MOMA-Force方法可以让移动抓取机器人学习多种复杂的接触rich任务,并且具有高成功率和小接触力。在一个真实的家庭环境中,我们的方法比基准方法更高的任务成功率,同时也比基准方法更小的接触力和力矩变化。总的来说,我们的方法可以带来有效和稳定的移动抓取在真实世界中。视频和更多细节可以在 \url{https://visual-force-imitation.github.io} 上找到。
Exploring Visual Pre-training for Robot Manipulation: Datasets, Models and Methods
results: 实验结果表明,提议的 Vi-PRoM 方案在不同的 simulate 环境和真实机器人上都达到了显著的改进。Abstract
Visual pre-training with large-scale real-world data has made great progress in recent years, showing great potential in robot learning with pixel observations. However, the recipes of visual pre-training for robot manipulation tasks are yet to be built. In this paper, we thoroughly investigate the effects of visual pre-training strategies on robot manipulation tasks from three fundamental perspectives: pre-training datasets, model architectures and training methods. Several significant experimental findings are provided that are beneficial for robot learning. Further, we propose a visual pre-training scheme for robot manipulation termed Vi-PRoM, which combines self-supervised learning and supervised learning. Concretely, the former employs contrastive learning to acquire underlying patterns from large-scale unlabeled data, while the latter aims learning visual semantics and temporal dynamics. Extensive experiments on robot manipulations in various simulation environments and the real robot demonstrate the superiority of the proposed scheme. Videos and more details can be found on \url{https://explore-pretrain-robot.github.io}.
摘要
“在过去几年,使用大规模实际数据进行视觉预训练已经做出了大量的进步,表明了视觉预训练在机器人学习中的潜力。然而,机器人 manipulate 任务中的视觉预训练的秘诀仍未得到建立。本文对视觉预训练策略在机器人 manipulate 任务中的影响进行了全面的调查,从三个基本的角度出发:预训练数据集、模型架构和训练方法。我们提供了许多有用的实验结果,这些结果对机器人学习具有帮助作用。此外,我们还提出了一种视觉预训练方案 для机器人 manipulate 任务,名为 Vi-PRoM,它将自我监督学习和监督学习相结合。具体来说,前者通过对大规模无标记数据中的对比学习来捕捉下层模式,而后者则是学习视觉 semantics 和时间动力学。我们在各种 simulate 环境和真实机器人上进行了广泛的实验,并证明了我们的方案的优越性。视频和更多细节可以在 \url{https://explore-pretrain-robot.github.io} 上找到。”
Adaptive Semi-Supervised Segmentation of Brain Vessels with Ambiguous Labels
paper_authors: Fengming Lin, Yan Xia, Nishant Ravikumar, Qiongyao Liu, Michael MacRaild, Alejandro F Frangi
for: 本研究旨在提高脑血管分割精度,以便脑血管疾病诊断和治疗。
methods: 该方法采用进步式半监督学习、适应性训练策略和边界增强等技术。
results: 实验结果表明,该方法在3DRA数据集上实现了高精度的网格基分割结果,并且能够处理部分或抽象标注的数据集。Abstract
Accurate segmentation of brain vessels is crucial for cerebrovascular disease diagnosis and treatment. However, existing methods face challenges in capturing small vessels and handling datasets that are partially or ambiguously annotated. In this paper, we propose an adaptive semi-supervised approach to address these challenges. Our approach incorporates innovative techniques including progressive semi-supervised learning, adaptative training strategy, and boundary enhancement. Experimental results on 3DRA datasets demonstrate the superiority of our method in terms of mesh-based segmentation metrics. By leveraging the partially and ambiguously labeled data, which only annotates the main vessels, our method achieves impressive segmentation performance on mislabeled fine vessels, showcasing its potential for clinical applications.
摘要
精准分割脑血管是脑血管疾病诊断和治疗中的关键。然而,现有方法在捕捉小血管和处理部分或杂杂标注的数据集时受到挑战。在这篇论文中,我们提出了一种适应性半supervised方法来解决这些挑战。我们的方法包括进步半supervised学习、适应性训练策略和边界增强等创新技术。实验结果表明,我们的方法在3DRA数据集上的笔直基于分割指标上具有显著的优势。通过利用部分和杂杂标注的数据,我们的方法在杂杂标注的细血管上达到了很好的分割性能,这显示了其在临床应用中的潜力。
AvatarVerse: High-quality & Stable 3D Avatar Creation from Text and Pose
results: 实现零shot3D模型生成高品质、多样化的3D人物模型,比前工作有更高的表现质量和稳定性。Abstract
Creating expressive, diverse and high-quality 3D avatars from highly customized text descriptions and pose guidance is a challenging task, due to the intricacy of modeling and texturing in 3D that ensure details and various styles (realistic, fictional, etc). We present AvatarVerse, a stable pipeline for generating expressive high-quality 3D avatars from nothing but text descriptions and pose guidance. In specific, we introduce a 2D diffusion model conditioned on DensePose signal to establish 3D pose control of avatars through 2D images, which enhances view consistency from partially observed scenarios. It addresses the infamous Janus Problem and significantly stablizes the generation process. Moreover, we propose a progressive high-resolution 3D synthesis strategy, which obtains substantial improvement over the quality of the created 3D avatars. To this end, the proposed AvatarVerse pipeline achieves zero-shot 3D modeling of 3D avatars that are not only more expressive, but also in higher quality and fidelity than previous works. Rigorous qualitative evaluations and user studies showcase AvatarVerse's superiority in synthesizing high-fidelity 3D avatars, leading to a new standard in high-quality and stable 3D avatar creation. Our project page is: https://avatarverse3d.github.io
摘要
创建高质量、多样化和自然表达的3D人物模型从高级定制文本描述和姿势指导是一项复杂的任务,因为3D模型和Texture的细节和不同风格(现实、虚构等)的涉及。我们介绍了AvatarVerse,一个稳定的生成高质量3D人物模型的管道,从文本描述和姿势指导开始。具体来说,我们引入了基于DensePose信号的2D扩散模型,以确保3D人物的姿势控制,并解决了著名的托尼问题,从部分观察的场景中提高了视觉一致性。此外,我们提出了一种进步的高分辨率3D生成策略,实现了对创建的3D人物模型的质量提升。因此,我们的AvatarVerse管道实现了零式3D模型化,创造出更加表达力强、高质量和真实的3D人物模型,胜过前一个工作。我们的项目页面是:https://avatarverse3d.github.io。
Recurrent Self-Supervised Video Denoising with Denser Receptive Field
paper_authors: Zichun Wang, Yulun Zhang, Debing Zhang, Ying Fu
for: 自动化视频干净(video denoising)
methods: 使用自适应损块网络(blind spot networks)和自适应循环视频干净方法(self-supervised recurrent video denoising method)
results: 提高视频干净效果,利用参照帧和邻帧帧的更多信息,同时具有较好的泛化能力和稳定性。Abstract
Self-supervised video denoising has seen decent progress through the use of blind spot networks. However, under their blind spot constraints, previous self-supervised video denoising methods suffer from significant information loss and texture destruction in either the whole reference frame or neighbor frames, due to their inadequate consideration of the receptive field. Moreover, the limited number of available neighbor frames in previous methods leads to the discarding of distant temporal information. Nonetheless, simply adopting existing recurrent frameworks does not work, since they easily break the constraints on the receptive field imposed by self-supervision. In this paper, we propose RDRF for self-supervised video denoising, which not only fully exploits both the reference and neighbor frames with a denser receptive field, but also better leverages the temporal information from both local and distant neighbor features. First, towards a comprehensive utilization of information from both reference and neighbor frames, RDRF realizes a denser receptive field by taking more neighbor pixels along the spatial and temporal dimensions. Second, it features a self-supervised recurrent video denoising framework, which concurrently integrates distant and near-neighbor temporal features. This enables long-term bidirectional information aggregation, while mitigating error accumulation in the plain recurrent framework. Our method exhibits superior performance on both synthetic and real video denoising datasets. Codes will be available at https://github.com/Wang-XIaoDingdd/RDRF.
摘要
自我监督视频干扰有很好的进步,特别是通过盲区网络。然而,在这些盲区约束下,前一代的自我监督视频干扰方法会导致重要信息的损失和图像的破坏,主要是因为它们对接收场的不充分考虑。此外,过去的方法中的可用邻帧数量有限,导致远端的时间信息抛弃。然而,直接采用现有的循环框架不行,因为它们容易违反自我监督中的接收场约束。在这篇论文中,我们提出了RDRF方法,该方法不仅能充分利用参照帧和邻帧帧的信息,而且能更好地利用邻帧帧的时间特征。首先,RDRF方法实现了更 dense的接收场,通过在空间和时间维度上接受更多的邻帧像素。其次,它提供了一种自我监督循环视频干扰框架,该框架同时集成了远端和近邻邻帧特征。这使得长期双向信息集成,并减少了循环框架中的错误积累。我们的方法在Synthetic和实际视频干扰数据上表现出色。代码将在https://github.com/Wang-XIaoDingdd/RDRF中提供。
FeatEnHancer: Enhancing Hierarchical Features for Object Detection and Beyond Under Low-Light Vision
results: 在多个低光照视觉任务中,FeatEnHancer模块可以带来显著和一致的提高,包括黑bject检测 (+5.7 mAP on ExDark)、人脸检测 (+1.5 mAPon DARK FACE)、夜间 semantic segmentation (+5.1 mIoU on ACDC) 和视频对象检测 (+1.8 mAP on DarkVision),这显示了增强层次特征的有效性。Abstract
Extracting useful visual cues for the downstream tasks is especially challenging under low-light vision. Prior works create enhanced representations by either correlating visual quality with machine perception or designing illumination-degrading transformation methods that require pre-training on synthetic datasets. We argue that optimizing enhanced image representation pertaining to the loss of the downstream task can result in more expressive representations. Therefore, in this work, we propose a novel module, FeatEnHancer, that hierarchically combines multiscale features using multiheaded attention guided by task-related loss function to create suitable representations. Furthermore, our intra-scale enhancement improves the quality of features extracted at each scale or level, as well as combines features from different scales in a way that reflects their relative importance for the task at hand. FeatEnHancer is a general-purpose plug-and-play module and can be incorporated into any low-light vision pipeline. We show with extensive experimentation that the enhanced representation produced with FeatEnHancer significantly and consistently improves results in several low-light vision tasks, including dark object detection (+5.7 mAP on ExDark), face detection (+1.5 mAPon DARK FACE), nighttime semantic segmentation (+5.1 mIoU on ACDC ), and video object detection (+1.8 mAP on DarkVision), highlighting the effectiveness of enhancing hierarchical features under low-light vision.
摘要
<>将文本翻译成简化中文。<>低光照下的视觉特征提取是特别困难,先前的工作通过与机器感知相关的视质质量相关或设计产生杂质变换方法来创建增强的表示。我们认为,根据下游任务的损失函数优化增强图像表示可以获得更表现тив的表示。因此,在这个工作中,我们提出了一种新的模块,FeatEnHancer,它通过多级划分特征并使用多头注意力指导任务相关损失函数来创建适合的表示。此外,我们的内部划分增强可以提高每级特征提取的质量,同时将不同级划分特征组合在一起,以反映它们在任务中的相对重要性。FeatEnHancer是一个通用的插件和撤退模块,可以在任何低光照视觉管道中使用。我们通过广泛的实验表明,FeatEnHancer生成的增强表示可以在多个低光照视觉任务中提高结果,包括黑影物体检测 (+5.7 mAP on ExDark)、人脸检测 (+1.5 mAPon DARK FACE)、夜间 semantic segmentation (+5.1 mIoU on ACDC) 和视频对象检测 (+1.8 mAP on DarkVision),这 highlights the effectiveness of enhancing hierarchical features under low-light vision。
SoilNet: An Attention-based Spatio-temporal Deep Learning Framework for Soil Organic Carbon Prediction with Digital Soil Mapping in Europe
results: 研究结果显示,提案的架构在预测土壤碳含量方面比常用的机器学习方法(如随机森林)有更好的表现,具体而言,这个模型的误差值较低。这个模型是一个可靠的工具,可以用来预测土壤碳和其他土壤特征,并且可以帮助土地管理和决策过程中的准确信息。Abstract
Digital soil mapping (DSM) is an advanced approach that integrates statistical modeling and cutting-edge technologies, including machine learning (ML) methods, to accurately depict soil properties and their spatial distribution. Soil organic carbon (SOC) is a crucial soil attribute providing valuable insights into soil health, nutrient cycling, greenhouse gas emissions, and overall ecosystem productivity. This study highlights the significance of spatial-temporal deep learning (DL) techniques within the DSM framework. A novel architecture is proposed, incorporating spatial information using a base convolutional neural network (CNN) model and spatial attention mechanism, along with climate temporal information using a long short-term memory (LSTM) network, for SOC prediction across Europe. The model utilizes a comprehensive set of environmental features, including Landsat-8 images, topography, remote sensing indices, and climate time series, as input features. Results demonstrate that the proposed framework outperforms conventional ML approaches like random forest commonly used in DSM, yielding lower root mean square error (RMSE). This model is a robust tool for predicting SOC and could be applied to other soil properties, thereby contributing to the advancement of DSM techniques and facilitating land management and decision-making processes based on accurate information.
摘要
《数字土壤地图(DSM)是一种先进的方法,它将统计模型和前沿技术,包括机器学习(ML)方法,融合在一起以准确地表示土壤属性和其空间分布。土壤有机碳(SOC)是一个重要的土壤特征,它为土壤健康、营养循环、温室气体排放和生态系统产生力提供了重要的信息。本研究发现,在 DSM 框架中使用空间时间深度学习(DL)技术可以提高 SOC 预测的准确性。本文提出了一种新的架构,其包括基于 Convolutional Neural Network(CNN)模型的空间注意机制和基于 Long Short-Term Memory(LSTM)网络的时间注意机制,用于预测欧洲各地的 SOC。该模型使用了包括 Landsat-8 图像、地形、远程感知指数和气候时间序列在内的全面环境特征作为输入特征。结果表明,提议的框架可以比常见的多项式学习方法,如Random Forest,更好地预测 SOC,具有较低的根圆方差误差(RMSE)。这种模型是一种可靠的 SOC 预测工具,可以应用于其他土壤属性,从而为土地管理和决策过程提供准确信息的支持。
Feature Decoupling-Recycling Network for Fast Interactive Segmentation
results: 在6个不同领域和模式的数据集上进行了广泛的实验,表明:1)比其他方法更高效(最多4.25倍),特别是在复杂的场景下;2)可以作为通用增强技术应用于不同的方法;3)具有跨任务普适性和鲁棒性。Abstract
Recent interactive segmentation methods iteratively take source image, user guidance and previously predicted mask as the input without considering the invariant nature of the source image. As a result, extracting features from the source image is repeated in each interaction, resulting in substantial computational redundancy. In this work, we propose the Feature Decoupling-Recycling Network (FDRN), which decouples the modeling components based on their intrinsic discrepancies and then recycles components for each user interaction. Thus, the efficiency of the whole interactive process can be significantly improved. To be specific, we apply the Decoupling-Recycling strategy from three perspectives to address three types of discrepancies, respectively. First, our model decouples the learning of source image semantics from the encoding of user guidance to process two types of input domains separately. Second, FDRN decouples high-level and low-level features from stratified semantic representations to enhance feature learning. Third, during the encoding of user guidance, current user guidance is decoupled from historical guidance to highlight the effect of current user guidance. We conduct extensive experiments on 6 datasets from different domains and modalities, which demonstrate the following merits of our model: 1) superior efficiency than other methods, particularly advantageous in challenging scenarios requiring long-term interactions (up to 4.25x faster), while achieving favorable segmentation performance; 2) strong applicability to various methods serving as a universal enhancement technique; 3) well cross-task generalizability, e.g., to medical image segmentation, and robustness against misleading user guidance.
摘要
最近的互动式分割方法会 iteratively 使用源图像、用户指导和先前预测的面积作为输入,而不考虑源图像的不变性。这会导致在每次互动中提取源图像的特征,从而导致计算重复,从而导致计算浪费。在这种情况下,我们提出了Feature Decoupling-Recycling Network(FDRN),它将模型组件基于其内在差异分解,然后将组件重新使用。这有助于提高整个互动过程的效率。具体来说,我们在三个方面应用Decoupling-Recycling策略来解决三种不同的差异:首先,我们的模型将源图像 semantics 学习与用户指导编码分解成两个不同的输入领域。第二,FDRN将高级和低级特征从层次结构的 semantic representation 分解,以提高特征学习。第三,在用户指导编码时,当前用户指导与历史指导分解,以强调当前用户指导的效果。我们在6个不同领域和模式的数据集上进行了广泛的实验,其结果表明:1. 我们的模型在长期互动(最长4.25倍)中表现出了明显的高效性,而且在多种互动方法上表现出了优秀的分割性能。2. FDRN 是一种通用的增强技术,可以应用于多种方法。3. 我们的模型在不同的任务上具有良好的跨任务泛化性和鲁棒性,例如医学影像分割。
Keyword Spotting Simplified: A Segmentation-Free Approach using Character Counting and CTC re-scoring
results: 实验 validate了这种方法可以卓越于当今最佳的方法,尽管使用的模型非常简单和占用空间小。Abstract
Recent advances in segmentation-free keyword spotting treat this problem w.r.t. an object detection paradigm and borrow from state-of-the-art detection systems to simultaneously propose a word bounding box proposal mechanism and compute a corresponding representation. Contrary to the norm of such methods that rely on complex and large DNN models, we propose a novel segmentation-free system that efficiently scans a document image to find rectangular areas that include the query information. The underlying model is simple and compact, predicting character occurrences over rectangular areas through an implicitly learned scale map, trained on word-level annotated images. The proposed document scanning is then performed using this character counting in a cost-effective manner via integral images and binary search. Finally, the retrieval similarity by character counting is refined by a pyramidal representation and a CTC-based re-scoring algorithm, fully utilizing the trained CNN model. Experimental validation on two widely-used datasets shows that our method achieves state-of-the-art results outperforming the more complex alternatives, despite the simplicity of the underlying model.
摘要
近年来, segmentation-free 关键词检索技术发展,将这个问题转化为对象检测模式,借鉴国际一级检测系统,同时提出词框报告机制和相应的表示计算。与传统方法不同,我们提出了一种简单、占地小的 segmentation-free 系统,通过高效扫描文档图像,找到包含查询信息的矩形区域。这个模型简单、巧妙,通过隐式学习的Scale Map,在word级图像上预测字符出现的区域。然后,通过 integral images 和 binary search 来实现cost-effective的文档扫描。最后,通过 pyramidal representation 和 CTC-based re-scoring algorithm,完全利用训练的 CNN 模型,进行了 Retrieval 相关性的补做。我们在两个常用的数据集上进行了实验验证,发现我们的方法可以在与更复杂的对比下,即使模型本身简单,却能够达到国际一级的Result。
Learning Photometric Feature Transform for Free-form Object Scan
paper_authors: Xiang Feng, Kaizhang Kang, Fan Pei, Huakeng Ding, Jinjiang You, Ping Tan, Kun Zhou, Hongzhi Wu
for: 提高3D重建的精度和速度
methods: 使用自动学习的多视图投影和变换方法,并与照明条件进行共同训练
results: 实现了高精度和高速的3D重建,并与专业3D扫描仪和照片进行比较,与当前技术相比较有优势Abstract
We propose a novel framework to automatically learn to aggregate and transform photometric measurements from multiple unstructured views into spatially distinctive and view-invariant low-level features, which are fed to a multi-view stereo method to enhance 3D reconstruction. The illumination conditions during acquisition and the feature transform are jointly trained on a large amount of synthetic data. We further build a system to reconstruct the geometry and anisotropic reflectance of a variety of challenging objects from hand-held scans. The effectiveness of the system is demonstrated with a lightweight prototype, consisting of a camera and an array of LEDs, as well as an off-the-shelf tablet. Our results are validated against reconstructions from a professional 3D scanner and photographs, and compare favorably with state-of-the-art techniques.
摘要
我们提出了一种新的框架,用于自动学习将多视角不结构化测量数据转化为空间特征和视角不变的低级特征,这些特征被传递给多视角斯tereo方法以增强3D重建。在获取过程中的照明条件和特征变换被同时训练在大量的 sintetic数据上。我们还建立了一个系统,用于从手持扫描获取的数据中重建物体的几何和方向异otropic反射。我们的结果通过使用轻量级的 прототип,包括一个相机和一个LED阵列,以及一个商业化的平板电脑,与专业3D扫描仪和照片进行比较,并与现有技术相比较有着良好的效果。
Improving Mass Detection in Mammography Images: A Study of Weakly Supervised Learning and Class Activation Map Methods
results: 研究发现,使用不同的启动图示方法在训练和测试阶段可以提高模型的性能,尤其是降低False Positive Per Image (FPPI)值,提高True Positive Rate (TPR)。Abstract
In recent years, weakly supervised models have aided in mass detection using mammography images, decreasing the need for pixel-level annotations. However, most existing models in the literature rely on Class Activation Maps (CAM) as the activation method, overlooking the potential benefits of exploring other activation techniques. This work presents a study that explores and compares different activation maps in conjunction with state-of-the-art methods for weakly supervised training in mammography images. Specifically, we investigate CAM, GradCAM, GradCAM++, XGradCAM, and LayerCAM methods within the framework of the GMIC model for mass detection in mammography images. The evaluation is conducted on the VinDr-Mammo dataset, utilizing the metrics Accuracy, True Positive Rate (TPR), False Negative Rate (FNR), and False Positive Per Image (FPPI). Results show that using different strategies of activation maps during training and test stages leads to an improvement of the model. With this strategy, we improve the results of the GMIC method, decreasing the FPPI value and increasing TPR.
摘要
Recently, weakly supervised models have been used for mass detection in mammography images, reducing the need for pixel-level annotations. However, most existing models in the literature rely on Class Activation Maps (CAM) as the activation method, without exploring other activation techniques. This study aims to explore and compare different activation maps in conjunction with state-of-the-art methods for weakly supervised training in mammography images. Specifically, we investigate CAM, GradCAM, GradCAM++, XGradCAM, and LayerCAM methods within the framework of the GMIC model for mass detection in mammography images. The evaluation is conducted on the VinDr-Mammo dataset, using Accuracy, True Positive Rate (TPR), False Negative Rate (FNR), and False Positive Per Image (FPPI) metrics. Results show that using different strategies of activation maps during training and test stages leads to improved model performance, with a decrease in FPPI and an increase in TPR.
paper_authors: Sohail Ahmed Khan, Duc-Tien Dang-Nguyen
for: This paper aims to provide insights into the effectiveness of different deep learning architectures, training strategies, and deepfake detection benchmarks for developing more accurate and reliable deepfake detection systems.
methods: The paper evaluates eight supervised deep learning architectures and two transformer-based models pre-trained using self-supervised strategies on four benchmarks, including intra-dataset and inter-dataset evaluations, to examine the best performing models, generalisation capabilities, and impact of augmentations.
results: The paper presents a comprehensive comparative analysis of supervised and self-supervised models for deepfake detection, including the best performing models, generalisation capabilities, and impact of augmentations, to provide insights into the effectiveness of different deep learning architectures, training strategies, and deepfake detection benchmarks.Here are the three points in Simplified Chinese text:
results: 论文提供了一项全面的比较分析,探讨不同的深度学习架构、训练策略和深度假像检测 benchmark 的效果,包括最佳性能、泛化能力和增强策略的影响,以帮助开发更加准确和可靠的深度假像检测系统。Abstract
This paper present a comprehensive comparative analysis of supervised and self-supervised models for deepfake detection. We evaluate eight supervised deep learning architectures and two transformer-based models pre-trained using self-supervised strategies (DINO, CLIP) on four benchmarks (FakeAVCeleb, CelebDF-V2, DFDC, and FaceForensics++). Our analysis includes intra-dataset and inter-dataset evaluations, examining the best performing models, generalisation capabilities, and impact of augmentations. We also investigate the trade-off between model size and performance. Our main goal is to provide insights into the effectiveness of different deep learning architectures (transformers, CNNs), training strategies (supervised, self-supervised), and deepfake detection benchmarks. These insights can help guide the development of more accurate and reliable deepfake detection systems, which are crucial in mitigating the harmful impact of deepfakes on individuals and society.
摘要
translate into Simplified Chinese:这篇论文提供了深度伪造检测中超级和自动驱动模型的比较分析。我们评估了8个超级深度学习架构和2个基于转换器的模型(DINO、CLIP)在4个标准测试集(FakeAVCeleb、CelebDF-V2、DFDC、FaceForensics++)上的性能。我们的分析包括内部数据集和间部数据集的评估,检查最佳性能模型,泛化能力和数据增强的影响。我们还进行了模型大小和性能之间的负面关系的研究。我们的主要目标是提供不同深度学习架构(转换器、CNN)、训练策略(supervised、self-supervised)和深度伪造检测标准集的情况,以帮助开发更加准确和可靠的深度伪造检测系统,这些系统对个人和社会的影响是非常重要的。
RoadScan: A Novel and Robust Transfer Learning Framework for Autonomous Pothole Detection in Roads
results: 该方法在准确地检测坑洞方面达到了显著的表现,其准确率达96.12%,EER值为3.89%,AUROC值为0.988,与其他现状顶尖研究相比表现高效。Abstract
This research paper presents a novel approach to pothole detection using Deep Learning and Image Processing techniques. The proposed system leverages the VGG16 model for feature extraction and utilizes a custom Siamese network with triplet loss, referred to as RoadScan. The system aims to address the critical issue of potholes on roads, which pose significant risks to road users. Accidents due to potholes on the roads have led to numerous accidents. Although it is necessary to completely remove potholes, it is a time-consuming process. Hence, a general road user should be able to detect potholes from a safe distance in order to avoid damage. Existing methods for pothole detection heavily rely on object detection algorithms which tend to have a high chance of failure owing to the similarity in structures and textures of a road and a pothole. Additionally, these systems utilize millions of parameters thereby making the model difficult to use in small-scale applications for the general citizen. By analyzing diverse image processing methods and various high-performing networks, the proposed model achieves remarkable performance in accurately detecting potholes. Evaluation metrics such as accuracy, EER, precision, recall, and AUROC validate the effectiveness of the system. Additionally, the proposed model demonstrates computational efficiency and cost-effectiveness by utilizing fewer parameters and data for training. The research highlights the importance of technology in the transportation sector and its potential to enhance road safety and convenience. The network proposed in this model performs with a 96.12 % accuracy, 3.89 % EER, and a 0.988 AUROC value, which is highly competitive with other state-of-the-art works.
摘要
DiffSynth: Latent In-Iteration Deflickering for Realistic Video Synthesis
methods: 该论文提出了一种新的方法,称为DiffSynth,它包括两个关键组件:一个 latent in-iteration deflickering 框架和一个 video deflickering 算法。
results: 实验结果表明,DiffSynth 可以有效地避免视频中的闪烁问题,并且可以在不同的视频生成任务中表现出色,包括文本引导视频风格化、时尚视频生成、图像引导视频风格化、视频修复和3D 渲染。Abstract
In recent years, diffusion models have emerged as the most powerful approach in image synthesis. However, applying these models directly to video synthesis presents challenges, as it often leads to noticeable flickering contents. Although recently proposed zero-shot methods can alleviate flicker to some extent, we still struggle to generate coherent videos. In this paper, we propose DiffSynth, a novel approach that aims to convert image synthesis pipelines to video synthesis pipelines. DiffSynth consists of two key components: a latent in-iteration deflickering framework and a video deflickering algorithm. The latent in-iteration deflickering framework applies video deflickering to the latent space of diffusion models, effectively preventing flicker accumulation in intermediate steps. Additionally, we propose a video deflickering algorithm, named patch blending algorithm, that remaps objects in different frames and blends them together to enhance video consistency. One of the notable advantages of DiffSynth is its general applicability to various video synthesis tasks, including text-guided video stylization, fashion video synthesis, image-guided video stylization, video restoring, and 3D rendering. In the task of text-guided video stylization, we make it possible to synthesize high-quality videos without cherry-picking. The experimental results demonstrate the effectiveness of DiffSynth. All videos can be viewed on our project page. Source codes will also be released.
摘要
近年来,Diffusion模型在图像生成领域中得到了广泛应用,但是直接应用这些模型到视频生成中存在一些挑战,因为这会导致视频中的干扰内容变得显著。虽然最近提出的零模型可以在一定程度上缓解干扰,但我们仍然无法生成具有一致性的视频。在这篇论文中,我们提出了DiffSynth,一种新的方法,旨在将图像生成管道转换为视频生成管道。DiffSynth包括两个关键组件:一个抽象iteration抑制框架和一个视频抑制算法。抽象iteration抑制框架在 diffusion模型的latent空间中应用视频抑制,从而避免在中间步骤中积累干扰。此外,我们提出了一种名为贴合算法的视频抑制算法,它可以在不同的帧中重新映射对象,并将它们进行融合,以提高视频一致性。DiffSynth的一个重要优点是它可以应用于多种视频生成任务,包括文本引导视频 стилизация、时尚视频生成、图像引导视频 стилизация、视频恢复和3D渲染。在文本引导视频 стилизаation任务中,我们实现了无需筛选的高质量视频生成。实验结果表明DiffSynth的效果。所有视频都可以在我们的项目页面上查看,源代码也将被发布。
Cross-Silo Prototypical Calibration for Federated Learning with Non-IID Data
paper_authors: Zhuang Qi, Lei Meng, Zitan Chen, Han Hu, Hui Lin, Xiangxu Meng
for: This paper aims to improve the performance of federated learning by addressing the issue of dataset biases, such as heterogeneous data distributions and missing classes, through a cross-silo prototypical calibration method called FedCSPC.
methods: The FedCSPC method uses a Data Prototypical Modeling (DPM) module to learn data patterns via clustering, and a cross-silo prototypical calibration (CSPC) module to improve the robustness of the calibration. The CSPC module projects cross-source features into a consistent space while maintaining clear decision boundaries.
results: The paper shows that FedCSPC outperforms state-of-the-art methods in learning consistent features across different data sources of the same class, leading to better performance. The results are demonstrated through experiments on four datasets, including an ablation study, in-depth analysis, and case study.Here is the same information in Simplified Chinese:
results: 论文表明,FedCSPC 方法在不同数据源的同一类数据上学习一致的特征,性能比 state-of-the-art 方法更好。结果通过四个数据集的实验、减少学习、深入分析和案例研究证明。Abstract
Federated Learning aims to learn a global model on the server side that generalizes to all clients in a privacy-preserving manner, by leveraging the local models from different clients. Existing solutions focus on either regularizing the objective functions among clients or improving the aggregation mechanism for the improved model generalization capability. However, their performance is typically limited by the dataset biases, such as the heterogeneous data distributions and the missing classes. To address this issue, this paper presents a cross-silo prototypical calibration method (FedCSPC), which takes additional prototype information from the clients to learn a unified feature space on the server side. Specifically, FedCSPC first employs the Data Prototypical Modeling (DPM) module to learn data patterns via clustering to aid calibration. Subsequently, the cross-silo prototypical calibration (CSPC) module develops an augmented contrastive learning method to improve the robustness of the calibration, which can effectively project cross-source features into a consistent space while maintaining clear decision boundaries. Moreover, the CSPC module's ease of implementation and plug-and-play characteristics make it even more remarkable. Experiments were conducted on four datasets in terms of performance comparison, ablation study, in-depth analysis and case study, and the results verified that FedCSPC is capable of learning the consistent features across different data sources of the same class under the guidance of calibrated model, which leads to better performance than the state-of-the-art methods. The source codes have been released at https://github.com/qizhuang-qz/FedCSPC.
摘要
federated learning 目标是在服务器端学习一个通用模型,该模型可以在保持隐私的情况下,通过客户端上的本地模型,泛化到所有客户端。现有的解决方案通常是通过客户端对象函数的规范化或改进模型聚合机制来提高模型泛化能力。然而,它们的性能通常受到数据偏好的影响,如不同数据分布和缺失类。为解决这个问题,本文提出了跨积 silence prototype 准备方法(FedCSPC),该方法通过客户端上的额外原型信息来学习服务器端的通用特征空间。具体来说,FedCSPC首先使用数据prototype模型(DPM)模块学习数据模式,以帮助准备。然后,跨积 silence prototype准备(CSPC)模块开发了一种改进的增强对比学习方法,可以有效地将跨源特征投影到一致的空间中,保持清晰的决策边界。此外,CSPC模块的实现简单易用,使其更加remarkable。经过实验,results表明,FedCSPC可以在不同数据源之间的同类型数据上学习一致的特征,从而获得更好的性能,比现有方法更好。代码已经在https://github.com/qizhuang-qz/FedCSPC上发布。
Lighting Every Darkness in Two Pairs: A Calibration-Free Pipeline for RAW Denoising
results: 与其他准备和训练方法相比,该方法在不同的数位增强和摄像头上实现了更高的噪声去除效果,只需要几个匹配的数据对和0.5%的迭代。Abstract
Calibration-based methods have dominated RAW image denoising under extremely low-light environments. However, these methods suffer from several main deficiencies: 1) the calibration procedure is laborious and time-consuming, 2) denoisers for different cameras are difficult to transfer, and 3) the discrepancy between synthetic noise and real noise is enlarged by high digital gain. To overcome the above shortcomings, we propose a calibration-free pipeline for Lighting Every Drakness (LED), regardless of the digital gain or camera sensor. Instead of calibrating the noise parameters and training repeatedly, our method could adapt to a target camera only with few-shot paired data and fine-tuning. In addition, well-designed structural modification during both stages alleviates the domain gap between synthetic and real noise without any extra computational cost. With 2 pairs for each additional digital gain (in total 6 pairs) and 0.5% iterations, our method achieves superior performance over other calibration-based methods. Our code is available at https://github.com/Srameo/LED .
摘要
准确基于方法在极低照度环境下进行 RAW 图像干涉除,但这些方法受到多种主要缺点的影响:1)准备过程耗时和费时consuming,2)对不同摄像头的denoiser难以传输,3)高度数字增强导致假象差异变大。为了解决以上缺陷,我们提出了不需要准备的管道,可以在不同的摄像头上适应LED,不 matter how much digital gain or camera sensor。相比之下,我们的方法只需要几个对应的数据和微调就能够适应目标摄像头。此外,我们在两个阶段中设计了结构修改,以避免假象差异的问题,无需额外的计算成本。使用2对每个额外数字增强(共计6对)和0.5%迭代,我们的方法可以在其他准确基于方法上达到更高的性能。我们的代码可以在https://github.com/Srameo/LED 中找到。
GaFET: Learning Geometry-aware Facial Expression Translation from In-The-Wild Images
results: 广泛的 qualitative 和 quantitative 实验表明,我们的方法可以在无需视频或标注数据的情况下实现更高质量和更准确的面部表达传递结果,并能够处理多个 pose 和复杂的 texture。Abstract
While current face animation methods can manipulate expressions individually, they suffer from several limitations. The expressions manipulated by some motion-based facial reenactment models are crude. Other ideas modeled with facial action units cannot generalize to arbitrary expressions not covered by annotations. In this paper, we introduce a novel Geometry-aware Facial Expression Translation (GaFET) framework, which is based on parametric 3D facial representations and can stably decoupled expression. Among them, a Multi-level Feature Aligned Transformer is proposed to complement non-geometric facial detail features while addressing the alignment challenge of spatial features. Further, we design a De-expression model based on StyleGAN, in order to reduce the learning difficulty of GaFET in unpaired "in-the-wild" images. Extensive qualitative and quantitative experiments demonstrate that we achieve higher-quality and more accurate facial expression transfer results compared to state-of-the-art methods, and demonstrate applicability of various poses and complex textures. Besides, videos or annotated training data are omitted, making our method easier to use and generalize.
摘要
当前的面部动画方法可以分别 manipulate 表情,但它们受到一些限制。一些基于动作的面部reenactment模型中的表情被描述为粗糙。其他基于表情动作单元的想法无法泛化到未经标注的表情。在这篇论文中,我们引入了一种新的 Geometry-aware Facial Expression Translation (GaFET) 框架,它基于参数化的 3D 面部表示和可以稳定地做出表情分离。其中,一种 Multi-level Feature Aligned Transformer 被提议,以填充非 геометрические面部细节特征,同时解决空间特征的对齐问题。另外,我们设计了基于 StyleGAN 的 De-expression 模型,以降低 GaFET 在无标注 "在野" 图像上学习的困难性。广泛的质量和量测试表明,我们可以在比例表情传输中获得更高质量和更准确的结果,并在不同的姿势和复杂的文化上进行应用。此外,我们不需要视频或标注训练数据,使我们的方法更容易使用和泛化。
A Horse with no Labels: Self-Supervised Horse Pose Estimation from Unlabelled Images and Synthetic Prior
results: 可以准确地学习动物姿态,只需要一小部分的Synthetic 2D pose和无标注图像Abstract
Obtaining labelled data to train deep learning methods for estimating animal pose is challenging. Recently, synthetic data has been widely used for pose estimation tasks, but most methods still rely on supervised learning paradigms utilising synthetic images and labels. Can training be fully unsupervised? Is a tiny synthetic dataset sufficient? What are the minimum assumptions that we could make for estimating animal pose? Our proposal addresses these questions through a simple yet effective self-supervised method that only assumes the availability of unlabelled images and a small set of synthetic 2D poses. We completely remove the need for any 3D or 2D pose annotations (or complex 3D animal models), and surprisingly our approach can still learn accurate 3D and 2D poses simultaneously. We train our method with unlabelled images of horses mainly collected for YouTube videos and a prior consisting of 2D synthetic poses. The latter is three times smaller than the number of images needed for training. We test our method on a challenging set of horse images and evaluate the predicted 3D and 2D poses. We demonstrate that it is possible to learn accurate animal poses even with as few assumptions as unlabelled images and a small set of 2D poses generated from synthetic data. Given the minimum requirements and the abundance of unlabelled data, our method could be easily deployed to different animals.
摘要
获取标注数据来训练深度学习方法用于动物姿态估计是具有挑战性的。现在,人工生成数据广泛使用于姿态估计任务中,但大多数方法仍然采用指导学习 парадигмы,使用人工图像和标签。可以完全无监督培训吗?一个小型的人工数据集是足够吗?我们的提议通过一种简单 yet effective的自我监督方法来回答这些问题。我们只需要没有标注的图像和一小组 synthetic 2D 姿态作为假设。我们完全 removing the need for any 3D or 2D pose annotations (或复杂的 3D 动物模型),并且我们的方法可以在缺乏标注的情况下学习准确的 3D 和 2D 姿态。我们使用 YouTube 上收集的大量无标注图像和一小组 synthetic 2D 姿态来训练我们的方法。后者的数量只是图像的三倍。我们对一组具有挑战性的马图像进行测试,并评估预测的 3D 和 2D 姿态。我们示出了可以通过使用只有无标注图像和少量 synthetic 2D 姿态来学习准确的动物姿态。由于最小的假设和丰富的无标注数据,我们的方法可以轻松应用于不同的动物。
DiT: Efficient Vision Transformers with Dynamic Token Routing
for: ImageNet classification, object detection, instance segmentation, and semantic segmentation
methods: Data-dependent token routing strategy for Dynamic Vision Transformer (DiT) with differentiable routing gates for multi-path feature propagation, and budget constraints for routing gate and early-stopping of feature extraction.
results: Superior performance and favorable complexity/accuracy trade-offs compared to many State-of-the-Art (SoTA) methods on various vision tasks, with the DiT-B5 achieving 84.8% top-1 Acc on ImageNet with 10.3 GFLOPs, which is 1.0% higher than the SoTA method with similar computational complexity.Abstract
Recently, the tokens of images share the same static data flow in many dense networks. However, challenges arise from the variance among the objects in images, such as large variations in the spatial scale and difficulties of recognition for visual entities. In this paper, we propose a data-dependent token routing strategy to elaborate the routing paths of image tokens for Dynamic Vision Transformer, dubbed DiT. The proposed framework generates a data-dependent path per token, adapting to the object scales and visual discrimination of tokens. In feed-forward, the differentiable routing gates are designed to select the scaling paths and feature transformation paths for image tokens, leading to multi-path feature propagation. In this way, the impact of object scales and visual discrimination of image representation can be carefully tuned. Moreover, the computational cost can be further reduced by giving budget constraints to the routing gate and early-stopping of feature extraction. In experiments, our DiT achieves superior performance and favorable complexity/accuracy trade-offs than many SoTA methods on ImageNet classification, object detection, instance segmentation, and semantic segmentation. Particularly, the DiT-B5 obtains 84.8\% top-1 Acc on ImageNet with 10.3 GFLOPs, which is 1.0\% higher than that of the SoTA method with similar computational complexity. These extensive results demonstrate that DiT can serve as versatile backbones for various vision tasks.
摘要
近期,图像token在多密网络中共享同样的静态数据流。然而,图像中对象的变化带来了挑战,包括巨大的空间缩放和视觉特征的识别困难。在这篇论文中,我们提出了基于数据依赖的图像token路由策略,用于强化图像Token的路由方式。我们的框架生成了基于数据的路由路径,以适应图像中对象的尺度和视觉特征。在Feed-Forward中,我们设计了可微分的路由门,以选择缩放路径和特征转换路径,从而实现多路径特征传播。这样,我们可以细化对象的尺度和视觉特征的影响。此外,我们还可以通过对路由门进行予算限制和早期停止特征提取来降低计算成本。在实验中,我们的DiT在ImageNet分类、物体检测、实例 segmentation和semantic segmentation等多种视觉任务上显示出了优秀的性能和计算复杂度/准确率的平衡。尤其是DiT-B5在ImageNet上取得了84.8%的权重排名第一位,与同等计算复杂度的SoTA方法相比,提高了1.0%的性能。这些广泛的结果表明,DiT可以作为多种视觉任务的 versatile 背部。
paper_authors: Kaixuan Wei, Xiao Li, Johannes Froech, Praneeth Chakravarthula, James Whitehead, Ethan Tseng, Arka Majumdar, Felix Heide
for: This paper aims to improve the performance of optical neural networks for image recognition tasks, with the goal of bringing optical neural networks into the modern deep learning era.
methods: The paper introduces a large-kernel spatially-varying convolutional neural network learned via low-dimensional reparameterization techniques, and experiments with a flat meta-optical system that includes an array of nanophotonic structures to induce angle-dependent responses.
results: The paper achieves a blind test classification accuracy of 73.80% on the CIFAR-10 dataset with a nanophotonic neural network, outperforming the first modern digital neural network (AlexNet) with 57M parameters and bringing optical neural networks into the modern deep learning era.Abstract
The explosive growth of computation and energy cost of artificial intelligence has spurred strong interests in new computing modalities as potential alternatives to conventional electronic processors. Photonic processors that execute operations using photons instead of electrons, have promised to enable optical neural networks with ultra-low latency and power consumption. However, existing optical neural networks, limited by the underlying network designs, have achieved image recognition accuracy much lower than state-of-the-art electronic neural networks. In this work, we close this gap by introducing a large-kernel spatially-varying convolutional neural network learned via low-dimensional reparameterization techniques. We experimentally instantiate the network with a flat meta-optical system that encompasses an array of nanophotonic structures designed to induce angle-dependent responses. Combined with an extremely lightweight electronic backend with approximately 2K parameters we demonstrate a nanophotonic neural network reaches 73.80\% blind test classification accuracy on CIFAR-10 dataset, and, as such, the first time, an optical neural network outperforms the first modern digital neural network -- AlexNet (72.64\%) with 57M parameters, bringing optical neural network into modern deep learning era.
摘要
“计算和人工智能的能源成本的快速增长已经促使了新的计算模式的兴趣,以代替传统的电子处理器。光学处理器可以通过光子而非电子来执行操作,承诺了可以实现光学神经网络的超低延迟和能耗。然而,现有的光学神经网络,受到基础网络设计的限制,只能达到图像识别精度远低于电子神经网络的状态OF-the-art。在这种工作中,我们封闭了这个差距,通过大核心空间变化的干扰 convolutional neural network 的学习,并通过低维度重parameterization技术来实现。我们实际实现了这种网络,使用一个平面 meta-光学系统,包括一个数组 nanophotonic 结构,以induce 角度相关的响应。与此同时,我们还使用一个非常轻量级的电子后续,包含约 2K 参数,并证明了一个 nanophotonic 神经网络可以在 CIFAR-10 数据集上达到 73.80% 的盲测精度,超过了 AlexNet (72.64%)的精度,这是首次,光学神经网络超越了第一代现代数字神经网络, bringing optical neural network into modern deep learning era。”
Enhancing Nucleus Segmentation with HARU-Net: A Hybrid Attention Based Residual U-Blocks Network
results: 我们对多个数据集进行了广泛的量化评估,并证明了我们的方法在BNS、MoNuSeg、CoNSeg和CPM-17等数据集上的性能superiority compared to state-of-the-art methods。Abstract
Nucleus image segmentation is a crucial step in the analysis, pathological diagnosis, and classification, which heavily relies on the quality of nucleus segmentation. However, the complexity of issues such as variations in nucleus size, blurred nucleus contours, uneven staining, cell clustering, and overlapping cells poses significant challenges. Current methods for nucleus segmentation primarily rely on nuclear morphology or contour-based approaches. Nuclear morphology-based methods exhibit limited generalization ability and struggle to effectively predict irregular-shaped nuclei, while contour-based extraction methods face challenges in accurately segmenting overlapping nuclei. To address the aforementioned issues, we propose a dual-branch network using hybrid attention based residual U-blocks for nucleus instance segmentation. The network simultaneously predicts target information and target contours. Additionally, we introduce a post-processing method that combines the target information and target contours to distinguish overlapping nuclei and generate an instance segmentation image. Within the network, we propose a context fusion block (CF-block) that effectively extracts and merges contextual information from the network. Extensive quantitative evaluations are conducted to assess the performance of our method. Experimental results demonstrate the superior performance of the proposed method compared to state-of-the-art approaches on the BNS, MoNuSeg, CoNSeg, and CPM-17 datasets.
摘要
核心像素分割是生物学分析、诊断和分类中一个关键步骤,但是这个步骤受到核心像素质量的限制。然而,核心像素的变化、模糊、不均匀染料、细胞堆叠和重叠细胞等问题带来了挑战。现有的核心像素分割方法主要基于核心形态或边缘检测方法。核心形态基本方法具有局限性,难以预测不规则形状的核心,而边缘检测方法在重叠细胞上受到检测的挑战。为了解决以上问题,我们提议一种基于双分支网络的核心实例分割方法。该方法同时预测目标信息和目标边界。此外,我们还提出了一种兼容处理方法,通过将目标信息和目标边界结合起来,以解决重叠细胞的问题。在网络中,我们提出了一个上下文融合块(CF-块),可以有效地抽取和融合网络中的上下文信息。我们对方法的性能进行了广泛的量化评估。实验结果表明,我们提出的方法在BNS、MoNuSeg、CoNSeg和CPM-17等数据集上的性能明显超过了现有方法。
paper_authors: Yingchi Liu, Zhu Liu, Long Ma, Jinyuan Liu, Xin Fan, Zhongxuan Luo, Risheng Liu
For: The paper is written for constructing deep learning schemes for Low-Light Vision (LLV) tasks.* Methods: The paper proposes a generic low-light vision solution by introducing a generative block to convert data from the RAW to the RGB domain, and establishes a bilevel model to precisely characterize the latent correspondence between the generative procedure and the vision task.* Results: The paper demonstrates the superiority of the proposed approach on three representative low-light vision tasks, namely enhancement, detection, and segmentation, and shows that the generative blocks have a strong generalization ability in other low-light vision tasks.Here is the information in Simplified Chinese text:* For: 这篇论文是为了构建深度学习方案来解决低光环境视觉任务。* Methods: 论文提出了一种通用的低光环境视觉解决方案,利用生成块将数据从RAW转换到RGB频谱上,并建立了一个碎谱模型来准确地描述数据生成过程和视觉任务之间的隐藏关系。* Results: 论文在三个表示低光环境视觉任务的示例任务上,即提升、检测和分割任务上,展现了提案的方法的超越性,并证明了生成块在其他低光环境视觉任务中具有强大的普适性。Abstract
Recently, there has been a growing interest in constructing deep learning schemes for Low-Light Vision (LLV). Existing techniques primarily focus on designing task-specific and data-dependent vision models on the standard RGB domain, which inherently contain latent data associations. In this study, we propose a generic low-light vision solution by introducing a generative block to convert data from the RAW to the RGB domain. This novel approach connects diverse vision problems by explicitly depicting data generation, which is the first in the field. To precisely characterize the latent correspondence between the generative procedure and the vision task, we establish a bilevel model with the parameters of the generative block defined as the upper level and the parameters of the vision task defined as the lower level. We further develop two types of learning strategies targeting different goals, namely low cost and high accuracy, to acquire a new bilevel generative learning paradigm. The generative blocks embrace a strong generalization ability in other low-light vision tasks through the bilevel optimization on enhancement tasks. Extensive experimental evaluations on three representative low-light vision tasks, namely enhancement, detection, and segmentation, fully demonstrate the superiority of our proposed approach. The code will be available at https://github.com/Yingchi1998/BGL.
摘要
近些年来,低光环境视觉(LLV)领域内有一个增长的兴趣,现有技术主要集中在设计任务特定和数据依赖的视觉模型上标准RGB频谱上,这些模型内置了隐藏的数据关系。在这种研究中,我们提出了一种通用的低光环境解决方案,通过引入生成块将数据从RAW频谱转换到RGB频谱。这种新的approach连接了多种视觉问题,并且显式地描述了数据生成过程,这是领域内首次。为准确地描述生成过程和视觉任务之间的隐藏关系,我们建立了一个二级模型,其中生成块的参数定义为上层级,而视觉任务的参数定义为下层级。我们还开发了两种不同目标,即低成本和高精度的学习策略,以获得一种新的二级生成学习 парадиг。生成块具有强大的通用能力在其他低光环境任务上,经过二级优化的增强任务上。我们在三个代表性的低光环境任务上,即增强、检测和 segmentation 上进行了广泛的实验评估,并证明了我们提出的方法的超越性。代码将在https://github.com/Yingchi1998/BGL中提供。
VR-based body tracking to stimulate musculoskeletal training
paper_authors: M. Neidhardt, S. Gerlach F. N. Schmidt, I. A. K. Fiedler, S. Grube, B. Busse, A. Schlaefer
For: 这个研究旨在开发一个基于HoloLens 2的虚拟下山滑雪训练应用程序,以便为老年人和残疾人提供个性化的训练和自动化评估。* Methods: 这个研究使用HoloLens 2的运动数据来控制和预测身体运动和关节角度 during musculoskeletal training。研究者记录了10名健康志愿者的外部跟踪相机数据,并系统地分析了整个身体运动是否可以从HoloLens 2运动数据中 derivation。* Results: 研究结果显示,HoloLens 2 运动数据和外部跟踪数据之间存在高度相关性,特别是在上半身运动和下肢关节角度方面。无参与者报告了运动疲劳效应,所有参与者都能快速互动和控制他们的运动。Abstract
Training helps to maintain and improve sufficient muscle function, body control, and body coordination. These are important to reduce the risk of fracture incidents caused by falls, especially for the elderly or people recovering from injury. Virtual reality training can offer a cost-effective and individualized training experience. We present an application for the HoloLens 2 to enable musculoskeletal training for elderly and impaired persons to allow for autonomous training and automatic progress evaluation. We designed a virtual downhill skiing scenario that is controlled by body movement to stimulate balance and body control. By adapting the parameters of the ski slope, we can tailor the intensity of the training to individual users. In this work, we evaluate whether the movement data of the HoloLens 2 alone is sufficient to control and predict body movement and joint angles during musculoskeletal training. We record the movements of 10 healthy volunteers with external tracking cameras and track a set of body and joint angles of the participant during training. We estimate correlation coefficients and systematically analyze whether whole body movement can be derived from the movement data of the HoloLens 2. No participant reports movement sickness effects and all were able to quickly interact and control their movement during skiing. Our results show a high correlation between HoloLens 2 movement data and the external tracking of the upper body movement and joint angles of the lower limbs.
摘要
训练可以保持和改善足够的肌肉功能、身体控制和身体协调。这些因素对降低因为落下而导致骨折的风险非常重要,特别是老年人或恢复后的人。虚拟现实训练可以提供成本效益和个性化的训练经验。我们在HoloLens 2上提出了一个应用程序,用于帮助老年人和残疾人进行肌骨征识训练,以便在自主训练和自动进度评估之间进行折衔。我们设计了一个虚拟下山滑雪场景,通过身体运动控制来刺激平衡和身体协调。通过调整雪坡参数,我们可以根据用户的个性进行定制训练的Intensity。在这项工作中,我们评估了HoloLens 2运动数据是否充分控制和预测身体运动和关节角度 durante 肌骨征识训练。我们通过外部跟踪相机记录参与者的运动,并跟踪参与者的身体运动和关节角度。我们计算了相关系数,系统地分析了整体运动是否可以从HoloLens 2运动数据中提取出来。所有参与者都没有报告运动药效,并且所有参与者快速交互和控制他们的运动 durante 滑雪。我们的结果显示,HoloLens 2运动数据与外部跟踪的上半身运动和关节角度之间存在高相关性。
Heterogeneous Forgetting Compensation for Class-Incremental Learning
results: 实验结果显示,HFC模型能够有效地解决累累忘记挑战,并且在不同的数据集上获得了良好的性能。Abstract
Class-incremental learning (CIL) has achieved remarkable successes in learning new classes consecutively while overcoming catastrophic forgetting on old categories. However, most existing CIL methods unreasonably assume that all old categories have the same forgetting pace, and neglect negative influence of forgetting heterogeneity among different old classes on forgetting compensation. To surmount the above challenges, we develop a novel Heterogeneous Forgetting Compensation (HFC) model, which can resolve heterogeneous forgetting of easy-to-forget and hard-to-forget old categories from both representation and gradient aspects. Specifically, we design a task-semantic aggregation block to alleviate heterogeneous forgetting from representation aspect. It aggregates local category information within each task to learn task-shared global representations. Moreover, we develop two novel plug-and-play losses: a gradient-balanced forgetting compensation loss and a gradient-balanced relation distillation loss to alleviate forgetting from gradient aspect. They consider gradient-balanced compensation to rectify forgetting heterogeneity of old categories and heterogeneous relation consistency. Experiments on several representative datasets illustrate effectiveness of our HFC model. The code is available at https://github.com/JiahuaDong/HFC.
摘要
CLASS-INCREMENTAL LEARNING (CIL) 已经取得了不可忽略的成功,可以顺序学习新的类型,同时解决旧类型的恐怖忘记。然而,大多数现有的 CIL 方法不合理地假设所有的旧类型忘记速率相同,并忽略了旧类型忘记不同程度的负面影响。为超越这些挑战,我们开发了一种新的多类忘记补偿模型(HFC),可以解决旧类型的多类忘记问题。 Specifically, we design a task-semantic aggregation block to alleviate heterogeneous forgetting from representation aspect. It aggregates local category information within each task to learn task-shared global representations. Moreover, we develop two novel plug-and-play losses: a gradient-balanced forgetting compensation loss and a gradient-balanced relation distillation loss to alleviate forgetting from gradient aspect. They consider gradient-balanced compensation to rectify forgetting heterogeneity of old categories and heterogeneous relation consistency.实验结果表明,我们的 HFC 模型具有效果。代码可以在 上下载。
Dual Aggregation Transformer for Image Super-Resolution
results: 我们的DAT模型在多个实验中表现出色,超过了当前的方法。代码和模型可以在https://github.com/zhengchen1999/DAT 上下载。Abstract
Transformer has recently gained considerable popularity in low-level vision tasks, including image super-resolution (SR). These networks utilize self-attention along different dimensions, spatial or channel, and achieve impressive performance. This inspires us to combine the two dimensions in Transformer for a more powerful representation capability. Based on the above idea, we propose a novel Transformer model, Dual Aggregation Transformer (DAT), for image SR. Our DAT aggregates features across spatial and channel dimensions, in the inter-block and intra-block dual manner. Specifically, we alternately apply spatial and channel self-attention in consecutive Transformer blocks. The alternate strategy enables DAT to capture the global context and realize inter-block feature aggregation. Furthermore, we propose the adaptive interaction module (AIM) and the spatial-gate feed-forward network (SGFN) to achieve intra-block feature aggregation. AIM complements two self-attention mechanisms from corresponding dimensions. Meanwhile, SGFN introduces additional non-linear spatial information in the feed-forward network. Extensive experiments show that our DAT surpasses current methods. Code and models are obtainable at https://github.com/zhengchen1999/DAT.
摘要
“传统抽象 transformer 在低级视觉任务中,如像高清化(SR)中得到了很大的推广。这些网络使用自我对齐在不同的维度,包括空间和通道维度,并实现了非常出色的表现。这给我们启发了融合这两种维度的想法,我们提出了一个新的 transformer 模型,即双总化 transformer(DAT),用于图像 SR。我们的 DAT 在内部和外部两种方式进行特征聚合,即在不同的维度进行双重总化。具体来说,我们在连续的 transformer 层中交替应用空间和通道自我对齐。这种交替策略使得 DAT 能够捕捉全域上下文,并实现内部对齐的特征聚合。此外,我们还提出了适应互动模组(AIM)和空间闸道对应网络(SGFN),以实现内部对齐的特征聚合。AIM 对应了两个自我对齐机制,而 SGFN 则引入了额外的非线性空间信息。实验结果显示,我们的 DAT 超过了目前的方法。代码和模型可以在 GitHub 上获取:https://github.com/zhengchen1999/DAT。”
Distortion-aware Transformer in 360° Salient Object Detection
for: addressing the distortion problem in 360{\deg} data projection for feature extraction and task development
methods: using a Transformer-based model called DATFormer with two distortion-adaptive modules and a learnable relation matrix for positional embedding
results: outperforming existing 2D SOD and 360 SOD methods on three public datasetsAbstract
With the emergence of VR and AR, 360{\deg} data attracts increasing attention from the computer vision and multimedia communities. Typically, 360{\deg} data is projected into 2D ERP (equirectangular projection) images for feature extraction. However, existing methods cannot handle the distortions that result from the projection, hindering the development of 360-data-based tasks. Therefore, in this paper, we propose a Transformer-based model called DATFormer to address the distortion problem. We tackle this issue from two perspectives. Firstly, we introduce two distortion-adaptive modules. The first is a Distortion Mapping Module, which guides the model to pre-adapt to distorted features globally. The second module is a Distortion-Adaptive Attention Block that reduces local distortions on multi-scale features. Secondly, to exploit the unique characteristics of 360{\deg} data, we present a learnable relation matrix and use it as part of the positional embedding to further improve performance. Extensive experiments are conducted on three public datasets, and the results show that our model outperforms existing 2D SOD (salient object detection) and 360 SOD methods.
摘要
Firstly, we introduce two distortion-adaptive modules:1. Distortion Mapping Module: This module guides the model to pre-adapt to distorted features globally.2. Distortion-Adaptive Attention Block: This module reduces local distortions on multi-scale features.Secondly, to exploit the unique characteristics of 360° data, we present a learnable relation matrix and use it as part of the positional embedding to further improve performance.Extensive experiments are conducted on three public datasets, and the results show that our model outperforms existing 2D SOD (salient object detection) and 360 SOD methods.
Energy-Guided Diffusion Model for CBCT-to-CT Synthesis
results: 对胸肿囊数据集进行实验,EGDiff方法可以生成高精度、高视觉质量的sCT图像,与State-of-the-art无监督合成方法相比,EGDiff方法表现出色。Abstract
Cone Beam CT (CBCT) plays a crucial role in Adaptive Radiation Therapy (ART) by accurately providing radiation treatment when organ anatomy changes occur. However, CBCT images suffer from scatter noise and artifacts, making relying solely on CBCT for precise dose calculation and accurate tissue localization challenging. Therefore, there is a need to improve CBCT image quality and Hounsfield Unit (HU) accuracy while preserving anatomical structures. To enhance the role and application value of CBCT in ART, we propose an energy-guided diffusion model (EGDiff) and conduct experiments on a chest tumor dataset to generate synthetic CT (sCT) from CBCT. The experimental results demonstrate impressive performance with an average absolute error of 26.87$\pm$6.14 HU, a structural similarity index measurement of 0.850$\pm$0.03, a peak signal-to-noise ratio of the sCT of 19.83$\pm$1.39 dB, and a normalized cross-correlation of the sCT of 0.874$\pm$0.04. These results indicate that our method outperforms state-of-the-art unsupervised synthesis methods in accuracy and visual quality, producing superior sCT images.
摘要
cone beam CT (CBCT) 在 adaptive radiation therapy (ART) 中发挥重要作用,准确地提供辐射治疗当器官结构变化时。然而,CBCT图像受到散射噪和artefacts的影响,使凭借CBCT alone 精度计算和正确地本地化难以准确。因此,我们需要提高 CBCT 图像质量和Hounsfield单元(HU)准确性,保持器官结构。为了提高 CBCT 在 ART 中的应用价值,我们提议一种能量引导扩散模型(EGDiff),并在胸腔肿瘤数据集上进行实验,将 CBCT 转换成 synthetic CT(sCT)。实验结果表明,我们的方法可以达到 impressive 性能,其中平均绝对错误为26.87±6.14 HU,结构相似度指数为0.850±0.03,峰信号噪声比(PSNR)为19.83±1.39 dB,同步协方差为0.874±0.04。这些结果表明,我们的方法在准确性和视觉质量方面都有所提高,生成出Superior sCT 图像。
Explicifying Neural Implicit Fields for Efficient Dynamic Human Avatar Modeling via a Neural Explicit Surface
results: 实验表明,NES 能够与前一代3D方法相比,具有类似的性能,同时提高渲染速度和减少存储开销。Abstract
This paper proposes a technique for efficiently modeling dynamic humans by explicifying the implicit neural fields via a Neural Explicit Surface (NES). Implicit neural fields have advantages over traditional explicit representations in modeling dynamic 3D content from sparse observations and effectively representing complex geometries and appearances. Implicit neural fields defined in 3D space, however, are expensive to render due to the need for dense sampling during volumetric rendering. Moreover, their memory efficiency can be further optimized when modeling sparse 3D space. To overcome these issues, the paper proposes utilizing Neural Explicit Surface (NES) to explicitly represent implicit neural fields, facilitating memory and computational efficiency. To achieve this, the paper creates a fully differentiable conversion between the implicit neural fields and the explicit rendering interface of NES, leveraging the strengths of both implicit and explicit approaches. This conversion enables effective training of the hybrid representation using implicit methods and efficient rendering by integrating the explicit rendering interface with a newly proposed rasterization-based neural renderer that only incurs a texture color query once for the initial ray interaction with the explicit surface, resulting in improved inference efficiency. NES describes dynamic human geometries with pose-dependent neural implicit surface deformation fields and their dynamic neural textures both in 2D space, which is a more memory-efficient alternative to traditional 3D methods, reducing redundancy and computational load. The comprehensive experiments show that NES performs similarly to previous 3D approaches, with greatly improved rendering speed and reduced memory cost.
摘要
The paper creates a fully differentiable conversion between the implicit neural fields and the explicit rendering interface of NES, allowing for effective training of the hybrid representation using implicit methods and efficient rendering. The conversion enables the use of a rasterization-based neural renderer that only incurs a texture color query once for the initial ray interaction with the explicit surface, resulting in improved inference efficiency.NES describes dynamic human geometries with pose-dependent neural implicit surface deformation fields and their dynamic neural textures in 2D space, which is a more memory-efficient alternative to traditional 3D methods, reducing redundancy and computational load. The comprehensive experiments show that NES performs similarly to previous 3D approaches, with greatly improved rendering speed and reduced memory cost.
paper_authors: Xingxing Yang, Jie Chen, Zaifeng Yang for: 这篇论文主要targets near-infrared (NIR) image spectrum translation, a challenging problem with many promising applications.methods: 该方法基于一种合作学习模式,通过exploring latent cross-domain priors(i.e., latent spectrum context priors and task domain priors),colorizes NIR images in parallel with another proxy grayscale colorization task.results: 该方法可以生成高质量的spectrum translation输出,并且比 estado-of-the-art counterparts提高3.95dB和4.66dB的PNSR для NIR和grayscale colorization tasks。Abstract
Near-infrared (NIR) image spectrum translation is a challenging problem with many promising applications. Existing methods struggle with the mapping ambiguity between the NIR and the RGB domains, and generalize poorly due to the limitations of models' learning capabilities and the unavailability of sufficient NIR-RGB image pairs for training. To address these challenges, we propose a cooperative learning paradigm that colorizes NIR images in parallel with another proxy grayscale colorization task by exploring latent cross-domain priors (i.e., latent spectrum context priors and task domain priors), dubbed CoColor. The complementary statistical and semantic spectrum information from these two task domains -- in the forms of pre-trained colorization networks -- are brought in as task domain priors. A bilateral domain translation module is subsequently designed, in which intermittent NIR images are generated from grayscale and colorized in parallel with authentic NIR images; and vice versa for the grayscale images. These intermittent transformations act as latent spectrum context priors for efficient domain knowledge exchange. We progressively fine-tune and fuse these modules with a series of pixel-level and feature-level consistency constraints. Experiments show that our proposed cooperative learning framework produces satisfactory spectrum translation outputs with diverse colors and rich textures, and outperforms state-of-the-art counterparts by 3.95dB and 4.66dB in terms of PNSR for the NIR and grayscale colorization tasks, respectively.
摘要
near-infrared(NIR)图像 спектр翻译是一个具有挑战性的问题,有很多有前途的应用。现有方法在映射NIR和RGBDomains之间存在困难,并且因模型学习能力的限制和缺乏充足的NIR-RGB图像对 для训练而导致泛化不佳。为了解决这些挑战,我们提议一种合作学习 парадиг,通过利用潜在的跨频域约束(即潜在pectrumContext约束和任务频域约束),来同时进行NIR图像的colorization。这两个任务频域的统计和semantic spectrum信息都被引入作为任务频域约束。随后,我们设计了一种bilateral频域翻译模块,其中NIR图像中的黑白图像在干扰NIR图像的同时,也在平行进行了颜色化。这些干扰变换作为潜在pectrumContext约束,以便有效地进行频域知识交换。我们逐步细化和融合这些模块,并使用像素级和特征级一致性约束。实验结果表明,我们提议的合作学习框架可以生成高质量的spectrum翻译输出,具有多样性和丰富的Texture,并且比前方的counterpart高3.95dB和4.66dB在NIR和黑白图像色化任务中的PNSR指标上。
A Hybrid CNN-Transformer Architecture with Frequency Domain Contrastive Learning for Image Deraining
for: restore degraded images affected by rain streaks
methods: image deraining
results: not specifiedPlease note that the results are not specified in the abstract, so I cannot provide any information about the results of the paper.Abstract
Image deraining is a challenging task that involves restoring degraded images affected by rain streaks.
摘要
图像抑雨是一项具有挑战性的任务,涉及到修复受到雨斑影响的图像。
AFN: Adaptive Fusion Normalization via Encoder-Decoder Framework
methods: 该论文提出了一种新的normalization函数 named Adaptive Fusion Normalization(AFN),它可以结合所有normalization方法,并消除它们的缺点。
results: 经过实验,AFN函数在领域总结和图像分类任务中表现出色,超过了现有的normalization方法。Abstract
The success of deep learning is inseparable from normalization layers. Researchers have proposed various normalization functions, and each of them has both advantages and disadvantages. In response, efforts have been made to design a unified normalization function that combines all normalization procedures and mitigates their weaknesses. We also proposed a new normalization function called Adaptive Fusion Normalization. Through experiments, we demonstrate AFN outperforms the previous normalization techniques in domain generalization and image classification tasks.
摘要
深度学习的成功与normalization层相关,研究人员提出了多种normalization函数,每种都有优点和缺点。为了解决这些问题,努力设计一个统一的normalization函数,汇集所有normalization过程,并减少它们的缺点。我们还提出了一种新的normalization函数called Adaptive Fusion Normalization(AFN)。经过实验,我们证明AFN在领域普适化和图像分类任务中表现出色,超越了前一代的normalization技术。
FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization Search
paper_authors: Jordan Dotzel, Gang Wu, Andrew Li, Muhammad Umar, Yun Ni, Mohamed S. Abdelfattah, Zhiru Zhang, Liqun Cheng, Martin G. Dixon, Norman P. Jouppi, Quoc V. Le, Sheng Li
results: 这个方法可以实现高品质且低成本的DNNs模型,并且比之前的方法更好。对于ResNet-18和ResNet-50模型,这个方法可以提高ImageNet准确度 by 1.31%和0.90%分别,同时保持相同的模型成本。此外,这个方法还可以对MobileNetV2进行改进,提高其准确度 by up to 0.98%分。最后,这个方法还可以同时搜寻一个混合精度和神经网络架构的共同搜寻空间,提高ImageNet准确度 by 2.69%分。Abstract
Quantization has become a mainstream compression technique for reducing model size, computational requirements, and energy consumption for modern deep neural networks (DNNs). With the improved numerical support in recent hardware, including multiple variants of integer and floating point, mixed-precision quantization has become necessary to achieve high-quality results with low model cost. Prior mixed-precision quantization methods have performed a post-training quantization search, which compromises on accuracy, or a differentiable quantization search, which leads to high memory usage from branching. Therefore, we propose the first one-shot mixed-precision quantization search that eliminates the need for retraining in both integer and low-precision floating point models. We evaluate our floating-point and integer quantization search (FLIQS) on multiple convolutional networks and vision transformer models to discover Pareto-optimal models. Our approach discovers models that improve upon uniform precision, manual mixed-precision, and recent integer quantization search methods. With the proposed integer quantization search, we increase the accuracy of ResNet-18 on ImageNet by 1.31% points and ResNet-50 by 0.90% points with equivalent model cost over previous methods. Additionally, for the first time, we explore a novel mixed-precision floating-point search and improve MobileNetV2 by up to 0.98% points compared to prior state-of-the-art FP8 models. Finally, we extend FLIQS to simultaneously search a joint quantization and neural architecture space and improve the ImageNet accuracy by 2.69% points with similar model cost on a MobileNetV2 search space.
摘要
归纳化技术已成为现代深度神经网络(DNN)的主流压缩技术,以提高模型大小、计算需求和能耗。随着现有硬件的数字支持的提升,混合精度归纳化已成为实现高质量结果的低成本模型的必要手段。现有的混合精度归纳化方法通常会在训练后进行归纳化搜索,这会妥协准确性,或者使用可导的归纳化搜索,这会导致高 память使用率。因此,我们提出了首个一步混合精度归纳化搜索,无需重新训练,并在整数和低精度浮点数模型中实现高质量结果。我们在多个卷积网络和视transformer模型上进行了评估,并发现了Pareto优质量模型。我们的方法在浮点数和整数归纳化搜索中提高了ImageNet中ResNet-18和ResNet-50模型的准确率,相比之前的方法,增加了1.31%点和0.90%点。此外,我们首次探索了一种新的混合精度浮点数搜索,并在MobileNetV2上提高了0.98%点,相比之前的FP8模型。最后,我们将FLIQS扩展到同时搜索归纳化和神经网络体系空间,并在MobileNetV2上提高了ImageNet准确率2.69%点,与相同的模型成本相似。
Multi-Label Self-Supervised Learning with Scene Images
results: 实验显示,提出的多标签自监学习(MLS)方法可以学习高质量的图像表示,在MS-COCO数据集上实现了分类、检测和分割标准 bencmarks 的最佳结果,同时与现有方法相比,MLS 更简单,易于部署和进一步探索。Abstract
Self-supervised learning (SSL) methods targeting scene images have seen a rapid growth recently, and they mostly rely on either a dedicated dense matching mechanism or a costly unsupervised object discovery module. This paper shows that instead of hinging on these strenuous operations, quality image representations can be learned by treating scene/multi-label image SSL simply as a multi-label classification problem, which greatly simplifies the learning framework. Specifically, multiple binary pseudo-labels are assigned for each input image by comparing its embeddings with those in two dictionaries, and the network is optimized using the binary cross entropy loss. The proposed method is named Multi-Label Self-supervised learning (MLS). Visualizations qualitatively show that clearly the pseudo-labels by MLS can automatically find semantically similar pseudo-positive pairs across different images to facilitate contrastive learning. MLS learns high quality representations on MS-COCO and achieves state-of-the-art results on classification, detection and segmentation benchmarks. At the same time, MLS is much simpler than existing methods, making it easier to deploy and for further exploration.
摘要
自动学习(SSL)方法targeting场景图像在最近几年内得到了快速发展,这些方法主要基于 either 专门的密集匹配机制或者昂贵的无监督物体发现模块。这篇论文表明,相比于依靠这些艰辛的操作,高质量图像表示可以通过对场景/多标签图像SSL进行简单的多标签分类问题来学习。特别是,每个输入图像都将多个 binary pseudo-标签赋给,通过对其嵌入与两个词典中的嵌入进行比较,并使用二分类 entropy 损失函数进行优化。这种方法被称为多标签自动学习(MLS)。视觉化Qualitatively 显示,MLS 可以自动找到不同图像中的semantic 相似 pseudo-正例对,以便进行对比学习。MLS 在 MS-COCO 上学习高质量表示,并在分类、检测和 segmentation benchmark 上 achieve 状态的最佳结果。同时,MLS 比现有方法更加简单,更容易部署和进一步探索。
Environment-Invariant Curriculum Relation Learning for Fine-Grained Scene Graph Generation
results: 对VG和GQA数据集进行了广泛的实验,结果显示,EICR框架可以作为SGG模型的通用策略,并取得了显著的改善。Abstract
The scene graph generation (SGG) task is designed to identify the predicates based on the subject-object pairs.However,existing datasets generally include two imbalance cases: one is the class imbalance from the predicted predicates and another is the context imbalance from the given subject-object pairs, which presents significant challenges for SGG. Most existing methods focus on the imbalance of the predicted predicate while ignoring the imbalance of the subject-object pairs, which could not achieve satisfactory results. To address the two imbalance cases, we propose a novel Environment Invariant Curriculum Relation learning (EICR) method, which can be applied in a plug-and-play fashion to existing SGG methods. Concretely, to remove the imbalance of the subject-object pairs, we first construct different distribution environments for the subject-object pairs and learn a model invariant to the environment changes. Then, we construct a class-balanced curriculum learning strategy to balance the different environments to remove the predicate imbalance. Comprehensive experiments conducted on VG and GQA datasets demonstrate that our EICR framework can be taken as a general strategy for various SGG models, and achieve significant improvements.
摘要
scene graph generation (SGG) 任务的设计是根据主语-谓语对 Identify predicate。然而,现有数据集通常存在两种不均衡情况:一是预测 predicate 的类别不均衡,另一是给定主语-谓语对的上下文不均衡,这两种不均衡情况都会对 SGG 带来很大的挑战。大多数现有方法主要关注预测 predicate 的不均衡,而忽略主语-谓语对的不均衡,这会导致不能达到满意的结果。为了解决这两种不均衡情况,我们提出了一种新的 Environment Invariant Curriculum Relation 学习方法(EICR),它可以与现有的 SGG 方法相结合使用。具体来说,为了消除主语-谓语对的不均衡,我们首先构建了不同的分布环境 для主语-谓语对,然后学习一个环境不变的模型。接着,我们构建了一种类别均衡的学习策略,以平衡不同的环境,从而消除预测 predicate 的不均衡。经过了在 VG 和 GQA 数据集上的广泛实验,我们的 EICR 框架可以作为多种 SGG 模型的通用策略,并实现了显著的提升。
Mirror-NeRF: Learning Neural Radiance Fields for Mirrors with Whitted-Style Ray Tracing
methods: 该方法基于镜子反射概率的 introduce 镜子反射概率并使用Whitted Ray Tracing的光传输模型跟踪光线,以及一些促进学习过程的技术。
results: 实验和比较表明,该方法在 synthetic 和实际数据集上具有明显的优势,能够准确描述镜子上的反射和多视图相互关联的反射。Abstract
Recently, Neural Radiance Fields (NeRF) has exhibited significant success in novel view synthesis, surface reconstruction, etc. However, since no physical reflection is considered in its rendering pipeline, NeRF mistakes the reflection in the mirror as a separate virtual scene, leading to the inaccurate reconstruction of the mirror and multi-view inconsistent reflections in the mirror. In this paper, we present a novel neural rendering framework, named Mirror-NeRF, which is able to learn accurate geometry and reflection of the mirror and support various scene manipulation applications with mirrors, such as adding new objects or mirrors into the scene and synthesizing the reflections of these new objects in mirrors, controlling mirror roughness, etc. To achieve this goal, we propose a unified radiance field by introducing the reflection probability and tracing rays following the light transport model of Whitted Ray Tracing, and also develop several techniques to facilitate the learning process. Experiments and comparisons on both synthetic and real datasets demonstrate the superiority of our method. The code and supplementary material are available on the project webpage: https://zju3dv.github.io/Mirror-NeRF/.
摘要
最近,神经辐射场(NeRF)在新视图合成、表面重建等领域表现出了显著的成功。然而,由于NeRF的渲染管线中没有考虑物理反射,因此NeRF会错误地将镜子中的反射视为独立的虚拟场景,导致镜子和多视图不一致的反射。在这篇论文中,我们提出了一种新的神经渲染框架,名为镜子-NeRF,它能够学习镜子上的准确 геометрии和反射。我们还提出了多种技术来促进学习过程,包括引入反射概率和根据Whitted雨筒跟踪模型跟踪光线的方法。实验和比较表明,我们的方法在 sintetic和实际数据集上具有明显的优势。代码和补充材料可以在项目网站(https://zju3dv.github.io/Mirror-NeRF/)上获取。
Spatialyze: A Geospatial Video Analytics System with Spatial-Aware Optimizations
methods: 该论文提出了一个新的框架 named Spatialyze,该框架使用了域专语言, allowing users to construct geospatial video analytic workflows using a 3-step, declarative, build-filter-observe paradigm。
results: 实验结果表明,使用Spatialyze可以提高执行效率,比例最高可以达5.3倍,同时维持97.1%的准确率。Abstract
Videos that are shot using commodity hardware such as phones and surveillance cameras record various metadata such as time and location. We encounter such geospatial videos on a daily basis and such videos have been growing in volume significantly. Yet, we do not have data management systems that allow users to interact with such data effectively. In this paper, we describe Spatialyze, a new framework for end-to-end querying of geospatial videos. Spatialyze comes with a domain-specific language where users can construct geospatial video analytic workflows using a 3-step, declarative, build-filter-observe paradigm. Internally, Spatialyze leverages the declarative nature of such workflows, the temporal-spatial metadata stored with videos, and physical behavior of real-world objects to optimize the execution of workflows. Our results using real-world videos and workflows show that Spatialyze can reduce execution time by up to 5.3x, while maintaining up to 97.1% accuracy compared to unoptimized execution.
摘要
视频 Recorded using common hardware such as phones and surveillance cameras 包含时间和地点 metadata。我们每天都会遇到这类地ospatial videos,但我们没有有效地处理这些数据的数据管理系统。 在本文中,我们介绍了 Spatialyze,一个新的框架 для地ospatial videos 的终端查询。Spatialyze 提供了一个域特定语言, allowing users to construct geospatial video analytic workflows using a 3-step, declarative, build-filter-observe paradigm。内部,Spatialyze 利用了声明性的 workflows,视频中的时间-空间 metadata 和实际物体的物理行为来优化 workflows 的执行。我们使用实际视频和 workflows 进行测试,结果显示,Spatialyze 可以提高执行时间 Speed 到 5.3x,保持高于 97.1% 的准确率 compared to unoptimized execution。
Feature-Suppressed Contrast for Self-Supervised Food Pre-training
paper_authors: Xinda Liu, Yaohui Zhu, Linhu Liu, Jiang Tian, Lili Wang for:This paper focuses on developing a self-supervised learning method for food image recognition, aiming to reduce the human labeling expenses and improve the efficiency of food image analysis.methods:The proposed method, called Feature Suppressed Contrast (FeaSC), leverages contrastive self-supervised learning on unlabelled food images. To address the problem of similar informative contents in the two views, the method uses a response-aware scheme to localize salient features in an unsupervised manner, reducing the mutual information between the views.results:The proposed FeaSC method consistently improves the classification accuracy of BYOL and SimSiam by 1.70% - 6.69% on four publicly available food recognition datasets. Additionally, the method achieves superior results on downstream segmentation tasks, demonstrating its effectiveness in food image analysis.Abstract
Most previous approaches for analyzing food images have relied on extensively annotated datasets, resulting in significant human labeling expenses due to the varied and intricate nature of such images. Inspired by the effectiveness of contrastive self-supervised methods in utilizing unlabelled data, weiqing explore leveraging these techniques on unlabelled food images. In contrastive self-supervised methods, two views are randomly generated from an image by data augmentations. However, regarding food images, the two views tend to contain similar informative contents, causing large mutual information, which impedes the efficacy of contrastive self-supervised learning. To address this problem, we propose Feature Suppressed Contrast (FeaSC) to reduce mutual information between views. As the similar contents of the two views are salient or highly responsive in the feature map, the proposed FeaSC uses a response-aware scheme to localize salient features in an unsupervised manner. By suppressing some salient features in one view while leaving another contrast view unchanged, the mutual information between the two views is reduced, thereby enhancing the effectiveness of contrast learning for self-supervised food pre-training. As a plug-and-play module, the proposed method consistently improves BYOL and SimSiam by 1.70\% $\sim$ 6.69\% classification accuracy on four publicly available food recognition datasets. Superior results have also been achieved on downstream segmentation tasks, demonstrating the effectiveness of the proposed method.
摘要
previous approaches for analyzing food images have relied on extensively annotated datasets, resulting in significant human labeling expenses due to the varied and intricate nature of such images. Inspired by the effectiveness of contrastive self-supervised methods in utilizing unlabelled data, we explore leveraging these techniques on unlabelled food images. In contrastive self-supervised methods, two views are randomly generated from an image by data augmentations. However, regarding food images, the two views tend to contain similar informative contents, causing large mutual information, which impedes the efficacy of contrastive self-supervised learning. To address this problem, we propose Feature Suppressed Contrast (FeaSC) to reduce mutual information between views. As the similar contents of the two views are salient or highly responsive in the feature map, the proposed FeaSC uses a response-aware scheme to localize salient features in an unsupervised manner. By suppressing some salient features in one view while leaving another contrast view unchanged, the mutual information between the two views is reduced, thereby enhancing the effectiveness of contrast learning for self-supervised food pre-training. As a plug-and-play module, the proposed method consistently improves BYOL and SimSiam by 1.70\% $\sim$ 6.69\% classification accuracy on four publicly available food recognition datasets. Superior results have also been achieved on downstream segmentation tasks, demonstrating the effectiveness of the proposed method.Here's the word-for-word translation of the text into Simplified Chinese:前一些食物图像分析方法都是基于大量人工标注的数据集,这导致了人工标注成本的增加,因为食物图像的性质是复杂且多变的。以启发自contrastive self-supervised方法的效iveness,我们想要利用无标注数据来预训练食物图像。在contrastive self-supervised方法中,两个视图是通过数据变换生成的,但是在食物图像上,这两个视图往往含有相似的有用信息,导致大量的相互信息,这阻碍了对比学习的效iveness。为解决这个问题,我们提出了特征压缩对比(FeaSC),以减少视图之间的相互信息。在特征地图中,与食物图像相似的部分是突出的或高度反应的,我们使用回应感知方案来本地化这些特征。通过压缩一个视图中的突出特征而不改变另一个对比视图,我们可以减少视图之间的相互信息,从而提高对比学习的效iveness。作为插入式模块,我们的提案可以适应BYOL和SimSiam等方法,并在四个公开的食物识别数据集上实现了1.70% 至 6.69%的分类精度提升。此外,我们还在下游分割任务中获得了更高的成果,这证明了我们的提案的效果。
A Benchmark for Chinese-English Scene Text Image Super-resolution
results: 我们在提出的Real-CE数据集上进行了实验,并评估了现有的STISR模型,包括使用我们的Edge-aware损失和不使用。实验结果显示,我们的Edge-aware方法能够提高STISR模型的性能,并且能够保持中文文本的拼写正确性和可读性。Abstract
Scene Text Image Super-resolution (STISR) aims to recover high-resolution (HR) scene text images with visually pleasant and readable text content from the given low-resolution (LR) input. Most existing works focus on recovering English texts, which have relatively simple character structures, while little work has been done on the more challenging Chinese texts with diverse and complex character structures. In this paper, we propose a real-world Chinese-English benchmark dataset, namely Real-CE, for the task of STISR with the emphasis on restoring structurally complex Chinese characters. The benchmark provides 1,935/783 real-world LR-HR text image pairs~(contains 33,789 text lines in total) for training/testing in 2$\times$ and 4$\times$ zooming modes, complemented by detailed annotations, including detection boxes and text transcripts. Moreover, we design an edge-aware learning method, which provides structural supervision in image and feature domains, to effectively reconstruct the dense structures of Chinese characters. We conduct experiments on the proposed Real-CE benchmark and evaluate the existing STISR models with and without our edge-aware loss. The benchmark, including data and source code, is available at https://github.com/mjq11302010044/Real-CE.
摘要
APBench: A Unified Benchmark for Availability Poisoning Attacks and Defenses
paper_authors: Tianrui Qin, Xitong Gao, Juanjuan Zhao, Kejiang Ye, Cheng-Zhong Xu for:这篇论文的目的是评估黑客攻击和防御数据毒液的能效性,并提供一个 benchmark 来评估这些攻击和防御方法的表现。methods:这篇论文使用了9种最新的可用性毒液攻击、8种防御算法和4种传统的数据增强技术来评估这些攻击和防御方法的表现。results:这篇论文的结果显示现有的黑客攻击无法保护个人隐私,而 APBench 可以帮助评估这些攻击和防御方法的表现。Abstract
The efficacy of availability poisoning, a method of poisoning data by injecting imperceptible perturbations to prevent its use in model training, has been a hot subject of investigation. Previous research suggested that it was difficult to effectively counteract such poisoning attacks. However, the introduction of various defense methods has challenged this notion. Due to the rapid progress in this field, the performance of different novel methods cannot be accurately validated due to variations in experimental setups. To further evaluate the attack and defense capabilities of these poisoning methods, we have developed a benchmark -- APBench for assessing the efficacy of adversarial poisoning. APBench consists of 9 state-of-the-art availability poisoning attacks, 8 defense algorithms, and 4 conventional data augmentation techniques. We also have set up experiments with varying different poisoning ratios, and evaluated the attacks on multiple datasets and their transferability across model architectures. We further conducted a comprehensive evaluation of 2 additional attacks specifically targeting unsupervised models. Our results reveal the glaring inadequacy of existing attacks in safeguarding individual privacy. APBench is open source and available to the deep learning community: https://github.com/lafeat/apbench.
摘要
“数据可用性毒化”的效果,一种在模型训练中注入不可见的干扰以防止数据使用,已成为研究热点。前一些研究表明,对这种攻击难以有效防御。然而,新的防御技术的出现挑战了这一观点。由于这个领域的快速进步,不同的新方法的性能无法准确验证因为实验设置的变化。为了进一步评估攻击和防御毒化方法的能力,我们开发了一个标准套件——APBench,用于评估毒化攻击的效果。APBench包括9种当前最佳的可用性毒化攻击,8种防御算法,以及4种常见的数据增强技术。我们还在不同的毒化比率下进行了实验,并对多个数据集和模型架构进行了评估。此外,我们还进行了对2种专门针对无监督模型的攻击的全面评估。我们的结果显示现有的攻击方法对个人隐私无法提供充分的保护。APBench是开源的,可以在GitHub上获取:https://github.com/lafeat/apbench。”
Learning a Graph Neural Network with Cross Modality Interaction for Image Fusion
results: 我们在多个数据集(TNO、MFNet和M3FD)上进行了广泛的实验,结果表明,我们的IGNet方法可以生成视觉吸引人的融合图像,同时在检测和分割任务中平均获得2.59% mAP@.5和7.77% mIoU高于相关的状态当前方法。Abstract
Infrared and visible image fusion has gradually proved to be a vital fork in the field of multi-modality imaging technologies. In recent developments, researchers not only focus on the quality of fused images but also evaluate their performance in downstream tasks. Nevertheless, the majority of methods seldom put their eyes on the mutual learning from different modalities, resulting in fused images lacking significant details and textures. To overcome this issue, we propose an interactive graph neural network (GNN)-based architecture between cross modality for fusion, called IGNet. Specifically, we first apply a multi-scale extractor to achieve shallow features, which are employed as the necessary input to build graph structures. Then, the graph interaction module can construct the extracted intermediate features of the infrared/visible branch into graph structures. Meanwhile, the graph structures of two branches interact for cross-modality and semantic learning, so that fused images can maintain the important feature expressions and enhance the performance of downstream tasks. Besides, the proposed leader nodes can improve information propagation in the same modality. Finally, we merge all graph features to get the fusion result. Extensive experiments on different datasets (TNO, MFNet and M3FD) demonstrate that our IGNet can generate visually appealing fused images while scoring averagely 2.59% mAP@.5 and 7.77% mIoU higher in detection and segmentation than the compared state-of-the-art methods. The source code of the proposed IGNet can be available at https://github.com/lok-18/IGNet.
摘要
infrared和可见图像融合逐渐成为多Modal imaging技术中的重要分支。在最近的发展中,研究人员不仅关注融合图像的质量,还评估其在下游任务中的表现。然而,大多数方法很少关注不同模式之间的相互学习,导致融合图像缺乏重要的特征和тексту。为解决这问题,我们提出了一种交互式图 neural network(GNN)基于树结构的架构,called IGNet。具体来说,我们首先应用多级提取器来获得 shallow 特征,这些特征被用作构建图像结构的必要输入。然后,图像交互模块可以将抽象分支中的中间特征构建成图像结构。同时,两个分支的图像结构之间进行交互性学习,以便在不同模式之间增强融合图像的表现。此外,我们还提出了领导节点,以提高同一个模式中的信息传播。最后,我们将所有的图像特征合并到一起,以获得融合结果。我们在不同的数据集(TNO、MFNet和M3FD)进行了广泛的实验,结果表明,我们的IGNet可以生成有趣的融合图像,同时与比较的状态前方法相比,其在检测和分类任务中的性能提高了2.59% mAP@.5和7.77% mIoU。源代码可以在https://github.com/lok-18/IGNet中下载。
Local Consensus Enhanced Siamese Network with Reciprocal Loss for Two-view Correspondence Learning
results: 实验表明,基于MSA-Net的两个提议可以提高匹配性能,达到了参考数据集上的状态速度表现Abstract
Recent studies of two-view correspondence learning usually establish an end-to-end network to jointly predict correspondence reliability and relative pose. We improve such a framework from two aspects. First, we propose a Local Feature Consensus (LFC) plugin block to augment the features of existing models. Given a correspondence feature, the block augments its neighboring features with mutual neighborhood consensus and aggregates them to produce an enhanced feature. As inliers obey a uniform cross-view transformation and share more consistent learned features than outliers, feature consensus strengthens inlier correlation and suppresses outlier distraction, which makes output features more discriminative for classifying inliers/outliers. Second, existing approaches supervise network training with the ground truth correspondences and essential matrix projecting one image to the other for an input image pair, without considering the information from the reverse mapping. We extend existing models to a Siamese network with a reciprocal loss that exploits the supervision of mutual projection, which considerably promotes the matching performance without introducing additional model parameters. Building upon MSA-Net, we implement the two proposals and experimentally achieve state-of-the-art performance on benchmark datasets.
摘要
(Simplified Chinese translation)现在的研究通常是两视匹配学习的结果,通常是一个端到端网络来同时预测匹配可靠性和相对pose。我们从两个方面提高了这种框架:第一,我们提议一个Local Feature Consensus(LFC)插件块来增强现有模型的特征。给一个匹配特征,这个块将其周围的特征通过相互邻居一致来增强,并将其汇聚到生成一个加强特征。由于匹配点遵循同一个跨视图变换,并且在学习过程中分享更一致的特征,因此特征一致性增强了匹配点的相互关系,降低了干扰器的影响,使输出特征更有力度地分类匹配/干扰。第二,现有的方法通常通过真实对应的地址来训练网络,而不考虑反向映射的信息。我们将现有模型扩展为一个SIAMESE网络,使用对偶损失来利用对偶映射的超级vision,从而明显提高匹配性能,而不需要添加更多的模型参数。基于MSA-Net,我们实现了这两个提议,并在测试数据集上实现了状态机器人的性能。
Microvasculature Segmentation in Human BioMolecular Atlas Program (HuBMAP)
results: 研究对不同方法进行了严谨的评估,并发现了各种改进方法的性能。这种研究提供了未来研究领域的有价值透彻。Abstract
Image segmentation serves as a critical tool across a range of applications, encompassing autonomous driving's pedestrian detection and pre-operative tumor delineation in the medical sector. Among these applications, we focus on the National Institutes of Health's (NIH) Human BioMolecular Atlas Program (HuBMAP), a significant initiative aimed at creating detailed cellular maps of the human body. In this study, we concentrate on segmenting various microvascular structures in human kidneys, utilizing 2D Periodic Acid-Schiff (PAS)-stained histology images. Our methodology begins with a foundational FastAI U-Net model, upon which we investigate alternative backbone architectures, delve into deeper models, and experiment with Feature Pyramid Networks. We rigorously evaluate these varied approaches by benchmarking their performance against our baseline U-Net model. This study thus offers a comprehensive exploration of cutting-edge segmentation techniques, providing valuable insights for future research in the field.
摘要
(Simplified Chinese translation)图像分割是应用领域中的一种重要工具,包括自动驾驶中的步行人检测和医疗领域中的前操作肿瘤定点。我们在这些应用中将重点关注国家卫生研究院(NIH)的人生物分子地图计划(HuBMAP),这是一项旨在创建人体cellular图的重要initiative。在这项研究中,我们将专注于人类肾脏中的微血管结构分割,使用2D periodic acid-Schiff(PAS)染色的历史图像。我们的方法开始于基础的 FastAI U-Net 模型,然后我们会 investigate alternative backbone architectures、 deeper models和 Feature Pyramid Networks。我们严格评估这些不同的方法,对比基准 U-Net 模型的性能。这项研究因此提供了一种全面的分割技术探索,为未来研究提供有价值的意见。
Syn-Mediverse: A Multimodal Synthetic Dataset for Intelligent Scene Understanding of Healthcare Facilities
For: 这个论文的目的是提供一个大量的多模态Synthetic数据集,以便研究医疗设施的场景理解。* Methods: 该论文使用了一个 simulate industry-standard optical tracking camera 生成的数据集,包含了多种场景理解任务的1.5万个标注。* Results: 论文提供了一个广泛的基线测试,以评估不同任务的性能。此外,论文还提供了一个在线评估平台,可以帮助进一步研究医疗设施的场景理解。Abstract
Safety and efficiency are paramount in healthcare facilities where the lives of patients are at stake. Despite the adoption of robots to assist medical staff in challenging tasks such as complex surgeries, human expertise is still indispensable. The next generation of autonomous healthcare robots hinges on their capacity to perceive and understand their complex and frenetic environments. While deep learning models are increasingly used for this purpose, they require extensive annotated training data which is impractical to obtain in real-world healthcare settings. To bridge this gap, we present Syn-Mediverse, the first hyper-realistic multimodal synthetic dataset of diverse healthcare facilities. Syn-Mediverse contains over \num{48000} images from a simulated industry-standard optical tracking camera and provides more than 1.5M annotations spanning five different scene understanding tasks including depth estimation, object detection, semantic segmentation, instance segmentation, and panoptic segmentation. We demonstrate the complexity of our dataset by evaluating the performance on a broad range of state-of-the-art baselines for each task. To further advance research on scene understanding of healthcare facilities, along with the public dataset we provide an online evaluation benchmark available at \url{http://syn-mediverse.cs.uni-freiburg.de}
摘要
安全和效率在医疗设施中是非常重要,因为患者的生命正在归附。虽然已经采用了机器人来协助医疗人员完成复杂的手术等任务,但人类专业仍然是不可或缺的。下一代自动化医疗机器人的发展取决于它们能够在复杂和紧张的医疗环境中进行感知和理解。然而,深度学习模型在这种目的上面习用的数据是实际医疗设施中获得的困难。为了bridging这个差距,我们介绍了Syn-Mediverse,首个 Hyper-Realistic 多模态人工数据集。Syn-Mediverse包含了 более48000张来自 simulated 行业标准光学跟踪相机的图像,以及1500000多个注释,涵盖了五个不同的场景理解任务,包括深度估计、物体检测、semantic segmentation、instance segmentation和panoptic segmentation。我们通过评估一系列国际顶峰模型的性能来证明Syn-Mediverse的复杂性。为了进一步推动医疗设施场景理解的研究,我们同时提供了在 line 4 提到的在线评估平台,可以在http://syn-mediverse.cs.uni-freiburg.de 上获取。
Understanding Biometric Entropy and Iris Capacity: Avoiding Identity Collisions on National Scales
results: 研究发现,使用眼睛图像可以实现高精度的唯一身份识别,并且可以处理大规模人口数据。具体来说,在US NIST(国家标准技术研究所)试验中,使用眼睛图像进行1.2亿次比较,并没有发现任何身份冲突现象。此外,研究还发现,使用两个眼睛图像的生物特征可以保证全球唯一身份识别。Abstract
The numbers of persons who can be enrolled by their iris patterns with no identity collisions is studied in relation to the biometric entropy extracted, and the decision operating threshold. The population size at which identity collision becomes likelier than not, given those variables, defines iris "capacity." The general solution to this combinatorial problem is derived, in analogy with the well-known "birthday problem." Its application to unique biometric identification on national population scales is shown, referencing empirical data from US NIST (National Institute of Standards and Technology) trials involving 1.2 trillion (1.2 x 10^(12) ) iris comparisons. The entropy of a given person's two iris patterns suffices for global identity uniqueness.
摘要
TEXTThe number of people who can be enrolled using their iris patterns without any identity collisions is studied in relation to the biometric entropy extracted and the decision operating threshold. The population size at which identity collision becomes more likely than not, given these variables, defines the "capacity" of the iris. The general solution to this combinatorial problem is derived, similar to the well-known "birthday problem." Its application to unique biometric identification on national population scales is shown, referencing empirical data from US NIST (National Institute of Standards and Technology) trials involving 1.2 trillion (1.2 x 10^(12)) iris comparisons. The entropy of a person's two iris patterns is sufficient for global identity uniqueness.SIMPLIFIED CHINESE TRANSLATION文本通过人们的肉眼印模式注册的人数量,不会出现身份冲突的情况是研究的,与提取的生物метри entropy和决策操作阈值相关。这个变量定义了肉眼的容量。通过生物 метри "生日问题" 的一般解决方案来 derivation。这种应用于国家规模的唯一生物特征标识, referencing 美国 NIST(国家标准技术研究所)的实验数据,涉及 1.2 x 10^(12) 比较。两个人的肉眼印模式的熵值充分保证全球身份唯一性。
Photorealistic and Identity-Preserving Image-Based Emotion Manipulation with Latent Diffusion Models
paper_authors: Ioannis Pikoulis, Panagiotis P. Filntisis, Petros Maragos
for: investigate the emotion manipulation capabilities of diffusion models with “in-the-wild” images
methods: Latent Diffusion models and text-driven manipulation with CLIP latents
results: superior image quality and realism, competitive results relative to emotion translation compared to GAN-based counterparts.Abstract
In this paper, we investigate the emotion manipulation capabilities of diffusion models with "in-the-wild" images, a rather unexplored application area relative to the vast and rapidly growing literature for image-to-image translation tasks. Our proposed method encapsulates several pieces of prior work, with the most important being Latent Diffusion models and text-driven manipulation with CLIP latents. We conduct extensive qualitative and quantitative evaluations on AffectNet, demonstrating the superiority of our approach in terms of image quality and realism, while achieving competitive results relative to emotion translation compared to a variety of GAN-based counterparts. Code is released as a publicly available repo.
摘要
在这篇论文中,我们研究了使用“在野”图像进行情感操作的扩散模型,这是图像到图像翻译任务领域的未探索领域。我们提出的方法集成了许多先前的研究,最重要的是潜在扩散模型和文本驱动的映射。我们在AffectNet上进行了广泛的质量和量测试,示出我们的方法在图像质量和真实性方面具有突出的优势,同时与情感翻译相比,与多种基于GAN的对手相比的结果具有竞争力。代码将被公开发布为公共可用 репозиторий。
Boosting Few-shot 3D Point Cloud Segmentation via Query-Guided Enhancement
methods: 提出一种基于查询指导的改进方法,通过修改支持背景模型以匹配查询样本的上下文,并通过填充查询特征来填充 semantic gap
results: 实验结果表明,该方法可以在S3DIS和ScanNet上达到显著提高,同时保持高效率Abstract
Although extensive research has been conducted on 3D point cloud segmentation, effectively adapting generic models to novel categories remains a formidable challenge. This paper proposes a novel approach to improve point cloud few-shot segmentation (PC-FSS) models. Unlike existing PC-FSS methods that directly utilize categorical information from support prototypes to recognize novel classes in query samples, our method identifies two critical aspects that substantially enhance model performance by reducing contextual gaps between support prototypes and query features. Specifically, we (1) adapt support background prototypes to match query context while removing extraneous cues that may obscure foreground and background in query samples, and (2) holistically rectify support prototypes under the guidance of query features to emulate the latter having no semantic gap to the query targets. Our proposed designs are agnostic to the feature extractor, rendering them readily applicable to any prototype-based methods. The experimental results on S3DIS and ScanNet demonstrate notable practical benefits, as our approach achieves significant improvements while still maintaining high efficiency. The code for our approach is available at https://github.com/AaronNZH/Boosting-Few-shot-3D-Point-Cloud-Segmentation-via-Query-Guided-Enhancement
摘要
尽管普通的3D点云分割检测已经得到了广泛的研究,但将通用模型适应新类型仍然是一项具有挑战性的任务。这篇论文提出了一种改进点云几何分割(PC-FSS)模型的新方法。与现有PC-FSS方法不同,我们的方法不直接使用支持类prototype来识别新类型的查询样本中的类别信息。而是通过两种关键方法来减少查询样本与支持类prototype之间的上下文差异,以提高模型性能。具体来说,我们:1. 将支持背景prototype调整到与查询样本的上下文相匹配,同时移除查询样本中可能掩蔽背景和前景的误导因素。2. 使用查询特征来正则化支持类prototype,以便模拟查询样本中没有 semantic gap 的情况。我们的设计是对于任何prototype-based方法都是可靠的,并且在实验中得到了显著的实用效果。我们的代码可以在https://github.com/AaronNZH/Boosting-Few-shot-3D-Point-Cloud-Segmentation-via-Query-Guided-Enhancement上下载。
FireFly A Synthetic Dataset for Ember Detection in Wildfire
results: 根据论文的描述,使用FireFly数据集训练四种流行的对象检测模型后,对于真实的野火场景下,相比于只使用小型实际数据集训练的模型,FireFly可以提供8.57%的提升在mean Average Precision(mAP)上。Abstract
This paper presents "FireFly", a synthetic dataset for ember detection created using Unreal Engine 4 (UE4), designed to overcome the current lack of ember-specific training resources. To create the dataset, we present a tool that allows the automated generation of the synthetic labeled dataset with adjustable parameters, enabling data diversity from various environmental conditions, making the dataset both diverse and customizable based on user requirements. We generated a total of 19,273 frames that have been used to evaluate FireFly on four popular object detection models. Further to minimize human intervention, we leveraged a trained model to create a semi-automatic labeling process for real-life ember frames. Moreover, we demonstrated an up to 8.57% improvement in mean Average Precision (mAP) in real-world wildfire scenarios compared to models trained exclusively on a small real dataset.
摘要
results: 对于一些常用的分类器,该方法可以高效地生成攻击示例,并且在非目标攻击和目标攻击两种情况下都显示出优秀的性能。Abstract
Decision-based black-box attacks often necessitate a large number of queries to craft an adversarial example. Moreover, decision-based attacks based on querying boundary points in the estimated normal vector direction often suffer from inefficiency and convergence issues. In this paper, we propose a novel query-efficient curvature-aware geometric decision-based black-box attack (CGBA) that conducts boundary search along a semicircular path on a restricted 2D plane to ensure finding a boundary point successfully irrespective of the boundary curvature. While the proposed CGBA attack can work effectively for an arbitrary decision boundary, it is particularly efficient in exploiting the low curvature to craft high-quality adversarial examples, which is widely seen and experimentally verified in commonly used classifiers under non-targeted attacks. In contrast, the decision boundaries often exhibit higher curvature under targeted attacks. Thus, we develop a new query-efficient variant, CGBA-H, that is adapted for the targeted attack. In addition, we further design an algorithm to obtain a better initial boundary point at the expense of some extra queries, which considerably enhances the performance of the targeted attack. Extensive experiments are conducted to evaluate the performance of our proposed methods against some well-known classifiers on the ImageNet and CIFAR10 datasets, demonstrating the superiority of CGBA and CGBA-H over state-of-the-art non-targeted and targeted attacks, respectively. The source code is available at https://github.com/Farhamdur/CGBA.
摘要
决策基于的黑盒攻击经常需要许多查询来制作攻击性的输入。另外,基于查询边缘点的决策攻击经常受到效率和收敛问题的影响。在这篇论文中,我们提出了一种新的查询效率高的几何决策基于黑盒攻击(CGBA),它在一个限定的2D平面上进行边搜索,以确保找到边点成功,不 matter the boundary curvature。尽管CGBA攻击可以有效地攻击任何决策边界,但是它在非目标攻击时尤其有效,可以轻松地制作高质量的攻击性输入。在目标攻击时,我们开发了一种新的查询效率变体CGBA-H,并设计了一个算法来获得更好的初始边界点,以提高目标攻击的性能。我们对一些常用的分类器进行了广泛的实验,并证明了CGBA和CGBA-H的超越性, Comparing with state-of-the-art non-targeted and targeted attacks。源代码可以在https://github.com/Farhamdur/CGBA中下载。
methods: 该论文使用了CC0协议,提供了免费和开放的数据访问渠道,包括RDF堆 dump文件、SPARQL端点和 Linked Open Data 云端,以及高性能计算 embedding 技术。
results: SemOpenAlex 可以满足广泛的用例enario,如探索性搜索、大规模科学影响量计算、科学领域之间的探索性分析、学术推荐系统、合作者推荐、出版物推荐、会议推荐等。Abstract
We present SemOpenAlex, an extensive RDF knowledge graph that contains over 26 billion triples about scientific publications and their associated entities, such as authors, institutions, journals, and concepts. SemOpenAlex is licensed under CC0, providing free and open access to the data. We offer the data through multiple channels, including RDF dump files, a SPARQL endpoint, and as a data source in the Linked Open Data cloud, complete with resolvable URIs and links to other data sources. Moreover, we provide embeddings for knowledge graph entities using high-performance computing. SemOpenAlex enables a broad range of use-case scenarios, such as exploratory semantic search via our website, large-scale scientific impact quantification, and other forms of scholarly big data analytics within and across scientific disciplines. Additionally, it enables academic recommender systems, such as recommending collaborators, publications, and venues, including explainability capabilities. Finally, SemOpenAlex can serve for RDF query optimization benchmarks, creating scholarly knowledge-guided language models, and as a hub for semantic scientific publishing.
摘要
我们现在提供SemOpenAlex,一个广泛的RDF知识图库,包含超过260亿个三元组关于科学出版物和其相关的实体,如作者、机构、杂志和概念。SemOpenAlex采用CC0许可证,提供免费和开放的数据访问。我们将数据提供多种途径,包括RDF填充文件、SPARQL终结点和 Linked Open Data 云端,并提供可访问的 URI 和与其他数据源的链接。此外,我们还提供知识图实体的嵌入,使用高性能计算。SemOpenAlex 支持广泛的使用场景,如探索性Semantic Search、大规模科学影响评估、科学领域之间和科学领域之间的大数据分析,以及学术推荐系统,如推荐合作者、论文和会议。此外,SemOpenAlex 还可以用于RDF查询优化基准,创建学术知识驱动的自然语言模型,以及为Semantic scientific publishing 的中心。
Diffusion Model in Causal Inference with Unmeasured Confounders
results: 我们的实验结果表明,在不被测量的干扰因素的存在下,我们的提议的模型可以更 precisely 捕捉 counterfactual 分布,与 DCM 相比。Abstract
We study how to extend the use of the diffusion model to answer the causal question from the observational data under the existence of unmeasured confounders. In Pearl's framework of using a Directed Acyclic Graph (DAG) to capture the causal intervention, a Diffusion-based Causal Model (DCM) was proposed incorporating the diffusion model to answer the causal questions more accurately, assuming that all of the confounders are observed. However, unmeasured confounders in practice exist, which hinders DCM from being applicable. To alleviate this limitation of DCM, we propose an extended model called Backdoor Criterion based DCM (BDCM), whose idea is rooted in the Backdoor criterion to find the variables in DAG to be included in the decoding process of the diffusion model so that we can extend DCM to the case with unmeasured confounders. Synthetic data experiment demonstrates that our proposed model captures the counterfactual distribution more precisely than DCM under the unmeasured confounders.
摘要
我们研究如何扩展Diffusion模型,以回答基于观察数据的 causal 问题,在存在未测量的干扰变量的情况下。在Pearl的框架下,使用 Directed Acyclic Graph (DAG) 捕捉 causal 干扰,Diffusion-based Causal Model (DCM) 被提出,将Diffusion模型与 causal 干扰相结合,以更准确地回答 causal 问题,假设所有干扰变量都是观察的。然而,在实践中,未测量的干扰变量存在,这限制了 DCM 的应用。为了解决 DCM 的这种限制,我们提出了一种扩展模型,即 Backdoor Criterion based DCM (BDCM),其基于 Backdoor criterion 来选择 DAG 中的变量,以便在Diffusion模型的解码过程中包含这些变量,从而扩展 DCM 到带有未测量的干扰变量的情况。 synthetic data 实验表明,我们的提出的模型可以更 precisely 回答 causal 问题,比 DCM 在未测量的干扰变量的情况下。
QDax: A Library for Quality-Diversity and Population-based Algorithms with Hardware Acceleration
paper_authors: Felix Chalumeau, Bryan Lim, Raphael Boige, Maxime Allard, Luca Grillotti, Manon Flageat, Valentin Macé, Arthur Flajolet, Thomas Pierrot, Antoine Cully
for: The paper is written for researchers and practitioners who are interested in using Quality-Diversity (QD) optimization algorithms in Jax for various optimization purposes, including black-box optimization and continuous control.
methods: The paper presents QDax, an open-source library with a streamlined and modular API for QD optimization algorithms in Jax. The library offers implementations of popular QD, Neuroevolution, and Reinforcement Learning (RL) algorithms, supported by various examples.
results: The paper demonstrates the efficiency and flexibility of QDax by testing it with 95% coverage and showing that it can be just-in-time compiled with Jax for efficient execution across multiple accelerators, including GPUs and TPUs.Abstract
QDax is an open-source library with a streamlined and modular API for Quality-Diversity (QD) optimization algorithms in Jax. The library serves as a versatile tool for optimization purposes, ranging from black-box optimization to continuous control. QDax offers implementations of popular QD, Neuroevolution, and Reinforcement Learning (RL) algorithms, supported by various examples. All the implementations can be just-in-time compiled with Jax, facilitating efficient execution across multiple accelerators, including GPUs and TPUs. These implementations effectively demonstrate the framework's flexibility and user-friendliness, easing experimentation for research purposes. Furthermore, the library is thoroughly documented and tested with 95\% coverage.
摘要
QDax 是一个开源库,具有整合和可调的 API,用于纯度多样化(QD)优化算法在 Jax 中。这个库可以用于各种优化目的,由黑盒优化到连续控制。QDax 提供了各种流行的 QD、神经演化和强化学习(RL)算法的实现,并且支持多个例子。这些实现可以与 Jax 的 Just-in-Time 编译功能相结合,以提高在多个加速器(包括 GPU 和 TPU)上的执行效率。这些实现也详细地显示了框架的 flexibility 和用户友善性,便利实验研究。此外,库也受到了95%的覆盖率测试。
Detecting Spells in Fantasy Literature with a Transformer Based Artificial Intelligence
results: 我们的实验结果表明,可以使用BERT模型来识别咒语的上下文,并且采用不同的序列分类和Token分类方法可以提高模型的准确率。此外,我们还发现了咒语的总体特征,可以将模型应用于其他奇幻世界。Abstract
Transformer architectures and models have made significant progress in language-based tasks. In this area, is BERT one of the most widely used and freely available transformer architecture. In our work, we use BERT for context-based phrase recognition of magic spells in the Harry Potter novel series. Spells are a common part of active magic in fantasy novels. Typically, spells are used in a specific context to achieve a supernatural effect. A series of investigations were conducted to see if a Transformer architecture could recognize such phrases based on their context in the Harry Potter saga. For our studies a pre-trained BERT model was used and fine-tuned utilising different datasets and training methods to identify the searched context. By considering different approaches for sequence classification as well as token classification, it is shown that the context of spells can be recognised. According to our investigations, the examined sequence length for fine-tuning and validation of the model plays a significant role in context recognition. Based on this, we have investigated whether spells have overarching properties that allow a transfer of the neural network models to other fantasy universes as well. The application of our model showed promising results and is worth to be deepened in subsequent studies.
摘要
transformer 架构和模型在语言相关任务中做出了重要进展。在这个领域中,BERT是一个非常受欢迎的和可公开使用的 transformer 架构。在我们的工作中,我们使用 BERT 进行了文本上下文基于短语识别, specifically 在哈利·ポッター系列小说中的魔法 incantations。incantations 是魔法世界中常见的一种特殊语言,通常在特定的上下文中使用以实现超自然的效果。我们通过不同的数据集和训练方法来练习和 fine-tune 一个预训练的 BERT 模型,以识别这些上下文。我们的研究表明,模型的检查序列长度在 fine-tuning 和验证中发挥了重要作用。此外,我们还 investigate 了 whether spells have overarching properties that allow a transfer of the neural network models to other fantasy universes。我们的应用结果表示该模型具有潜在的应用价值,值得进一步研究。
FFF: Fragments-Guided Flexible Fitting for Building Complete Protein Structures
paper_authors: Weijie Chen, Xinyan Wang, Yuhang Wang
for: 该 paper 描述了一种新的方法 FFP,它可以将蛋白质结构预测和蛋白质结构识别相结合,从而构建完整的蛋白质结构。
methods: 该方法使用多级识别网络来捕捉输入 3D 粒子电子镜像中的多种结构特征,然后使用 Pseudo 氨基酸 вектор 和蛋白质序列对照方法来生成蛋白质结构片段。最后,通过 flexible 匹配来构建完整的结构模型。
results: 根据我们的测试, FFP 方法在构建完整蛋白质结构方面比基eline 方法表现更好。Abstract
Cryo-electron microscopy (cryo-EM) is a technique for reconstructing the 3-dimensional (3D) structure of biomolecules (especially large protein complexes and molecular assemblies). As the resolution increases to the near-atomic scale, building protein structures de novo from cryo-EM maps becomes possible. Recently, recognition-based de novo building methods have shown the potential to streamline this process. However, it cannot build a complete structure due to the low signal-to-noise ratio (SNR) problem. At the same time, AlphaFold has led to a great breakthrough in predicting protein structures. This has inspired us to combine fragment recognition and structure prediction methods to build a complete structure. In this paper, we propose a new method named FFF that bridges protein structure prediction and protein structure recognition with flexible fitting. First, a multi-level recognition network is used to capture various structural features from the input 3D cryo-EM map. Next, protein structural fragments are generated using pseudo peptide vectors and a protein sequence alignment method based on these extracted features. Finally, a complete structural model is constructed using the predicted protein fragments via flexible fitting. Based on our benchmark tests, FFF outperforms the baseline methods for building complete protein structures.
摘要
冻电子顺ligtroscopy(冻电子顺igtroscopy)是一种用于 reconstruction生物分子(特别是大蛋白复合物和分子集合体)的3维结构的技术。随着分辨率逐渐提高到近原子尺度,从冻电子顺igtroscopy地图中直接建立蛋白结构的可能性增加。然而,由于低信号噪声比(SNR)问题,不能完全建立蛋白结构。同时,AlphaFold在预测蛋白结构方面取得了重大突破。这种灵感我们将Recognition-based de novo building方法和结构预测方法相结合,以建立完整的结构。在这篇论文中,我们提出一种新的方法 named FFF,它可以将蛋白结构预测和蛋白结构认知相连接。首先,我们使用多级认知网络来捕捉输入3D冻电子顺igtroscopy地图中的多种结构特征。然后,我们使用pseudo peptide vectors和基于这些提取的蛋白序列对应方法来生成蛋白结构分割。最后,我们使用预测的蛋白分割来建立完整的结构模型,并通过flexible fitting来调整它们。根据我们的测试,FFF方法在建立完整蛋白结构方面表现出色,超过了基eline方法。
Segmentation Framework for Heat Loss Identification in Thermal Images: Empowering Scottish Retrofitting and Thermographic Survey Companies
For: This study aims to tackle fuel poverty in Scotland by automating the identification of heat loss sources in thermal images of homes, using a deep learning-based segmentation framework.* Methods: The proposed framework uses a Mask Region Proposal Convolutional Neural Network (Mask RCNN) to segment heat loss sources caused by weak insulation, and eliminates obstructive objects present in the images.* Results: The final fine-tuned model achieved a mean average precision (mAP) score of 77.2% for segmenting the target objects (heat loss sources), demonstrating the potential of the proposed framework in accurately quantifying energy loss in Scottish homes.Abstract
Retrofitting and thermographic survey (TS) companies in Scotland collaborate with social housing providers to tackle fuel poverty. They employ ground-level infrared (IR) camera-based-TSs (GIRTSs) for collecting thermal images to identi-fy the heat loss sources resulting from poor insulation. However, this identifica-tion process is labor-intensive and time-consuming, necessitating extensive data processing. To automate this, an AI-driven approach is necessary. Therefore, this study proposes a deep learning (DL)-based segmentation framework using the Mask Region Proposal Convolutional Neural Network (Mask RCNN) to validate its applicability to these thermal images. The objective of the framework is to au-tomatically identify, and crop heat loss sources caused by weak insulation, while also eliminating obstructive objects present in those images. By doing so, it min-imizes labor-intensive tasks and provides an automated, consistent, and reliable solution. To validate the proposed framework, approximately 2500 thermal imag-es were collected in collaboration with industrial TS partner. Then, 1800 repre-sentative images were carefully selected with the assistance of experts and anno-tated to highlight the target objects (TO) to form the final dataset. Subsequently, a transfer learning strategy was employed to train the dataset, progressively aug-menting the training data volume and fine-tuning the pre-trained baseline Mask RCNN. As a result, the final fine-tuned model achieved a mean average precision (mAP) score of 77.2% for segmenting the TO, demonstrating the significant po-tential of proposed framework in accurately quantifying energy loss in Scottish homes.
摘要
历史遗产改造和 thermographic 检测(TS)公司在苏格兰与社会住房提供商合作,解决燃料贫困问题。他们使用地面近红外(IR)摄像机基本TS(GIRTS)来收集热图像,以识别因为差异垄断的热损源。但这个识别过程劳动 INTENSIVE 和时间consuming,需要广泛的数据处理。为了自动化这个过程,这种研究提出了基于人工智能(AI)的分割框架,使用面部提案卷积神经网络(Mask RCNN)来验证其适用性。该框架的目标是自动地识别和cropping热损源,并从热图像中排除干扰物体。通过这样做,它将减少劳动 INTENSIVE 任务,提供一个自动化、一致、可靠的解决方案。为验证提出的框架,约2500个热图像被收集,并与业务TS伙伴合作。然后,1800个代表性图像被谨慎选择,并由专家帮助高亮目标对象(TO),以形成最终数据集。接着,使用传输学习策略进行训练数据集,逐渐增加训练数据量,并进行细化和微调。最终,经过微调的基线Mask RCNN模型在 segmenting TO 方面获得了77.2%的均值精度分(mAP),显示了提出的框架在准确量化苏格兰家庭的能源损失中的显著潜力。
MedMine: Examining Pre-trained Language Models on Medication Mining
results: 研究发现现有的PLM模型在自动药物探索 task 上存在不均衡的表现,特别是在不同的实体类型和临床事件上。Abstract
Automatic medication mining from clinical and biomedical text has become a popular topic due to its real impact on healthcare applications and the recent development of powerful language models (LMs). However, fully-automatic extraction models still face obstacles to be overcome such that they can be deployed directly into clinical practice for better impacts. Such obstacles include their imbalanced performances on different entity types and clinical events. In this work, we examine current state-of-the-art pre-trained language models (PLMs) on such tasks, via fine-tuning including the monolingual model Med7 and multilingual large language model (LLM) XLM-RoBERTa. We compare their advantages and drawbacks using historical medication mining shared task data sets from n2c2-2018 challenges. We report the findings we get from these fine-tuning experiments such that they can facilitate future research on addressing them, for instance, how to combine their outputs, merge such models, or improve their overall accuracy by ensemble learning and data augmentation. MedMine is part of the M3 Initiative \url{https://github.com/HECTA-UoM/M3}
摘要
自动药物挖掘从医疗和生物医学文本中得到了广泛的关注,因为它们在医疗应用中有真正的影响。然而,完全自动提取模型仍然需要突破一些障碍,以便在临床实践中直接部署。这些障碍包括它们在不同实体类型和医疗事件上的不均衡性表现。在这项工作中,我们评估了当前状态的批处理语言模型(PLM)在这些任务上,包括单语言模型Med7和多语言大语言模型(LLM)XLM-RoBERTa。我们比较了它们的优点和缺点,使用历史药物挖掘分享任务数据集。我们报告了这些精度调整实验的结果,以便未来研究如何组合它们的输出、合并这些模型或提高它们的总准确率 durch ensemble学习和数据扩展。MedMine是M3Initiave的一部分,详细信息请参考。
A Meta-learning based Stacked Regression Approach for Customer Lifetime Value Prediction
paper_authors: Karan Gadgil, Sukhpal Singh Gill, Ahmed M. Abdelmoniem
For: The paper aims to propose a simple yet effective and interpretable Customer Lifetime Value (CLV) prediction model that can handle a wide variety of input features and is applicable in various business domains.* Methods: The proposed model is based on a meta-learning-based stacked regression approach that combines the predictions from bagging and boosting models.* Results: The proposed model was empirically tested on an openly available Online Retail dataset and showed superior performance compared to existing distribution-based and basic models.Here’s the simplified Chinese version:* For: 这篇论文目的是提出一种简单又有效、可解释的客户生命周期价值(CLV)预测模型,可以处理各种输入特征并在各个业务领域中适用。* Methods: 该模型基于元学习基层堆叠回归方法,将权重融合 bagging 和 boosting 模型的预测结果。* Results: 该模型在一个公开的在线零售数据集上进行了实验测试,与现有的分布型和基础型模型相比,显示出了更高的性能。Abstract
Companies across the globe are keen on targeting potential high-value customers in an attempt to expand revenue and this could be achieved only by understanding the customers more. Customer Lifetime Value (CLV) is the total monetary value of transactions/purchases made by a customer with the business over an intended period of time and is used as means to estimate future customer interactions. CLV finds application in a number of distinct business domains such as Banking, Insurance, Online-entertainment, Gaming, and E-Commerce. The existing distribution-based and basic (recency, frequency & monetary) based models face a limitation in terms of handling a wide variety of input features. Moreover, the more advanced Deep learning approaches could be superfluous and add an undesirable element of complexity in certain application areas. We, therefore, propose a system which is able to qualify both as effective, and comprehensive yet simple and interpretable. With that in mind, we develop a meta-learning-based stacked regression model which combines the predictions from bagging and boosting models that each is found to perform well individually. Empirical tests have been carried out on an openly available Online Retail dataset to evaluate various models and show the efficacy of the proposed approach.
摘要
世界各地公司都在努力寻找高值客户,以拓展收入。这可以通过更好地理解客户来实现。客户生命周期价值(CLV)是指客户在业务之间的财务交易总额,在一定时间范围内,并用于预测未来客户互动。CLV在银行、保险、在线娱乐、游戏和电商等多个业务领域有广泛的应用。现有的分布型和基础(频率、购买量和金额)型模型在处理各种输入特征方面存在限制。此外,更高级的深度学习方法可能会增加不必要的复杂性,特别在某些应用领域。我们因此提出了一个能够同时具有效果、全面、简单并且可解释的系统。为了实现这一目标,我们开发了基于元学习的堆式回归模型,该模型将束合袋装和提升模型的预测结果。我们对公开ailable的在线零售数据集进行了实验,以评估不同模型的表现,并证明了我们的方法的有效性。
Stock Market Price Prediction: A Hybrid LSTM and Sequential Self-Attention based Approach
results: 对三个股票数据集(SBIN、HDFCBANK、BANKBARODA)进行了广泛的实验,结果表明提议的模型比现有模型更有效率和可行,RMSE和R2评价指标表现最佳。Abstract
One of the most enticing research areas is the stock market, and projecting stock prices may help investors profit by making the best decisions at the correct time. Deep learning strategies have emerged as a critical technique in the field of the financial market. The stock market is impacted due to two aspects, one is the geo-political, social and global events on the bases of which the price trends could be affected. Meanwhile, the second aspect purely focuses on historical price trends and seasonality, allowing us to forecast stock prices. In this paper, our aim is to focus on the second aspect and build a model that predicts future prices with minimal errors. In order to provide better prediction results of stock price, we propose a new model named Long Short-Term Memory (LSTM) with Sequential Self-Attention Mechanism (LSTM-SSAM). Finally, we conduct extensive experiments on the three stock datasets: SBIN, HDFCBANK, and BANKBARODA. The experimental results prove the effectiveness and feasibility of the proposed model compared to existing models. The experimental findings demonstrate that the root-mean-squared error (RMSE), and R-square (R2) evaluation indicators are giving the best results.
摘要
一个非常吸引人的研究领域是股市,并且预测股价可以帮助投资者取得最佳的决策时间。深度学习策略在金融市场中发挥了关键作用。股市受到两个方面的影响,一是地域政治、社会和全球事件的影响,这些事件可能对股价趋势产生影响。而第二个方面则是历史价格趋势和季节性,我们可以通过这些信息来预测股价。在这篇论文中,我们将关注第二个方面,并建立一个名为Long Short-Term Memory(LSTM)的新模型,并加入Sequential Self-Attention Mechanism(LSTM-SSAM)。最后,我们对SBIN、HDFCBANK和BANKBARODA三个股Dataset进行了广泛的实验。实验结果证明了我们提出的模型的有效性和实现性,并且与现有模型进行比较。实验结果表明,使用RMSE和R2评价指标,我们的模型在预测股价方面表现出色。
results: 本论文结果显示,统计学方法 alone 是不够实现人工智能的。它还指出了一些关键的认知能力,以及它们在人工智能实现中的作用。此外,它还检查了一些社会技术因素,以及它们对人工智能发展的影响。Abstract
The original vision of AI was re-articulated in 2002 via the term 'Artificial General Intelligence' or AGI. This vision is to build 'Thinking Machines' - computer systems that can learn, reason, and solve problems similar to the way humans do. This is in stark contrast to the 'Narrow AI' approach practiced by almost everyone in the field over the many decades. While several large-scale efforts have nominally been working on AGI (most notably DeepMind), the field of pure focused AGI development has not been well funded or promoted. This is surprising given the fantastic value that true AGI can bestow on humanity. In addition to the dearth of effort in this field, there are also several theoretical and methodical missteps that are hampering progress. We highlight why purely statistical approaches are unlikely to lead to AGI, and identify several crucial cognitive abilities required to achieve human-like adaptability and autonomous learning. We conclude with a survey of socio-technical factors that have undoubtedly slowed progress towards AGI.
摘要
原始的人工智能概念在2002年被重新艺术iculminated 以"人工通用智能"(AGI)的形式。这个目标是建立"思维机器"——计算机系统可以学习、理据和解决问题,与人类相似。这与在多个时期内的"窄AI"方法不同,大多数人在该领域的努力都是这种方法。虽然有几个大规模尝试在AGI方面工作(特别是DeepMind),但是纯粹的AGI发展领域没有得到过足够的投资和推广。这对人类的未来带来了惊人的价值。此外,我们还指出了统计方法不可能导致AGI的理论和方法上的阻碍因素,并识别了达到人类式的适应和自主学习所需的关键认知能力。我们结束于论述AGI的发展受到了社会技术因素的阻碍。
Feature Importance versus Feature Influence and What It Signifies for Explainable AI
results: 研究表明,使用CIU可以提供更有表达力和更灵活的解释,并且可以减少因果关系的偏见。Abstract
When used in the context of decision theory, feature importance expresses how much changing the value of a feature can change the model outcome (or the utility of the outcome), compared to other features. Feature importance should not be confused with the feature influence used by most state-of-the-art post-hoc Explainable AI methods. Contrary to feature importance, feature influence is measured against a reference level or baseline. The Contextual Importance and Utility (CIU) method provides a unified definition of global and local feature importance that is applicable also for post-hoc explanations, where the value utility concept provides instance-level assessment of how favorable or not a feature value is for the outcome. The paper shows how CIU can be applied to both global and local explainability, assesses the fidelity and stability of different methods, and shows how explanations that use contextual importance and contextual utility can provide more expressive and flexible explanations than when using influence only.
摘要
Translated into Simplified Chinese:在决策理论中,特征重要度表示修改特征值会改变模型结果(或结果的价值)的程度,相比其他特征。这与特征影响不同,特征影响是相对参照水平或基线进行度量。Contextual Importance and Utility(CIU)方法提供了一个综合定义的全局和本地特征重要度,可以应用于后期解释,其中值用性概念提供了实例级别的评估结果如何有利或不利于结果。文章显示了CIU如何应用于全局和本地解释,评估不同方法的准确性和稳定性,并显示了使用Contextual Importance和Contextual Utility来提供更加表达性和灵活的解释,相比使用影响只。
A machine-learning sleep-wake classification model using a reduced number of features derived from photoplethysmography and activity signals
paper_authors: Douglas A. Almeida, Felipe M. Dias, Marcelo A. F. Toledo, Diego A. C. Cardenas, Filipe A. C. Oliveira, Estela Ribeiro, Jose E. Krieger, Marco A. Gutierrez
results: 该模型的性能与当前领先方法相当,具有感知率91.15 $\pm$ 1.16%, 特征率53.66 $\pm$ 1.12%, F1分数83.88 $\pm$ 0.56%和κ48.0 $\pm$ 0.86%。这个方法在计算能力有限的穿戴式设备中可以实现更好的性能。Abstract
Sleep is a crucial aspect of our overall health and well-being. It plays a vital role in regulating our mental and physical health, impacting our mood, memory, and cognitive function to our physical resilience and immune system. The classification of sleep stages is a mandatory step to assess sleep quality, providing the metrics to estimate the quality of sleep and how well our body is functioning during this essential period of rest. Photoplethysmography (PPG) has been demonstrated to be an effective signal for sleep stage inference, meaning it can be used on its own or in a combination with others signals to determine sleep stage. This information is valuable in identifying potential sleep issues and developing strategies to improve sleep quality and overall health. In this work, we present a machine learning sleep-wake classification model based on the eXtreme Gradient Boosting (XGBoost) algorithm and features extracted from PPG signal and activity counts. The performance of our method was comparable to current state-of-the-art methods with a Sensitivity of 91.15 $\pm$ 1.16%, Specificity of 53.66 $\pm$ 1.12%, F1-score of 83.88 $\pm$ 0.56%, and Kappa of 48.0 $\pm$ 0.86%. Our method offers a significant improvement over other approaches as it uses a reduced number of features, making it suitable for implementation in wearable devices that have limited computational power.
摘要
睡眠是我们全面健康和卫生的重要组成部分。它对我们的情绪、身体健康和智能功能产生重要的影响,同时也影响我们的免疫力和身体抵抗力。确定睡眠阶段是一项必要的步骤,以评估睡眠质量,并提供评估睡眠质量和身体功能的指标。光谱 Plethysmography (PPG) 已被证明是一种有效的睡眠阶段推断信号,因此可以单独使用或与其他信号结合使用来确定睡眠阶段。这些信息对于检测可能存在的睡眠问题和改善睡眠质量和全面健康提供了 ценности。在这个工作中,我们提出了基于 eXtreme Gradient Boosting (XGBoost) 算法和 PPG 信号和活动计数的机器学习睡眠-醒目分类模型。我们的方法的性能与当前状态的方法相当,具有感知率为 91.15 $\pm$ 1.16%、特异性为 53.66 $\pm$ 1.12%、F1 分数为 83.88 $\pm$ 0.56% 和 Kappa 值为 48.0 $\pm$ 0.86%。我们的方法提供了与其他方法相比的显著改善,因为它使用了减少的特征数,使其适合在有限的计算能力的穿戴式设备中实现。
Revealing the Underlying Patterns: Investigating Dataset Similarity, Performance, and Generalization
results: 通过添加一小数量的未看过图像(例如1、3或7)到训练集,可以提高模型的泛化能力,并降低训练和标注成本。Abstract
Supervised deep learning models require significant amount of labelled data to achieve an acceptable performance on a specific task. However, when tested on unseen data, the models may not perform well. Therefore, the models need to be trained with additional and varying labelled data to improve the generalization. In this work, our goal is to understand the models, their performance and generalization. We establish image-image, dataset-dataset, and image-dataset distances to gain insights into the model's behavior. Our proposed distance metric when combined with model performance can help in selecting an appropriate model/architecture from a pool of candidate architectures. We have shown that the generalization of these models can be improved by only adding a small number of unseen images (say 1, 3 or 7) into the training set. Our proposed approach reduces training and annotation costs while providing an estimate of model performance on unseen data in dynamic environments.
摘要
深度学习模型需要大量标注数据来达到特定任务的可接受性水平。然而,当测试在未看到的数据时,模型可能不会表现好。因此,模型需要通过添加更多和变化的标注数据来改善通用性。在这项工作中,我们的目标是理解模型、其性能和通用性。我们定义图像-图像、数据集-数据集和图像-数据集距离,以获得模型的行为的启示。我们的提议的距离度量器,当与模型性能相结合,可以帮助选择最佳的模型/架构从候选 arquitectures中。我们已经示出,通过只添加一小数量的未看到图像(例如1、3或7)到训练集中,可以改善这些模型的通用性。我们的提议方法可以降低训练和注释成本,同时提供对未看到数据的模型性能的估计,在动态环境中。
Provably Efficient Learning in Partially Observable Contextual Bandit
for: investigate transfer learning in partially observable contextual bandits
methods: convert the problem to identifying or partially identifying causal effects through optimization problems, and use sampling algorithms to obtain causal bounds
results: improve the performance of classical bandit algorithms and achieve orders of magnitude faster convergence rates, especially in tasks with function approximation.Abstract
In this paper, we investigate transfer learning in partially observable contextual bandits, where agents have limited knowledge from other agents and partial information about hidden confounders. We first convert the problem to identifying or partially identifying causal effects between actions and rewards through optimization problems. To solve these optimization problems, we discretize the original functional constraints of unknown distributions into linear constraints, and sample compatible causal models via sequentially solving linear programmings to obtain causal bounds with the consideration of estimation error. Our sampling algorithms provide desirable convergence results for suitable sampling distributions. We then show how causal bounds can be applied to improving classical bandit algorithms and affect the regrets with respect to the size of action sets and function spaces. Notably, in the task with function approximation which allows us to handle general context distributions, our method improves the order dependence on function space size compared with previous literatures. We formally prove that our causally enhanced algorithms outperform classical bandit algorithms and achieve orders of magnitude faster convergence rates. Finally, we perform simulations that demonstrate the efficiency of our strategy compared to the current state-of-the-art methods. This research has the potential to enhance the performance of contextual bandit agents in real-world applications where data is scarce and costly to obtain.
摘要
在这篇论文中,我们研究了在部分可见情况下的 contextual bandit 中的转移学习,agent有限制的知识来自其他代理人和部分隐藏的干扰因素。我们首先将问题转化为标定或部分标定 causal 效应 между动作和奖励,通过优化问题来解决。为解决这些优化问题,我们将原始Unknown Distributions的函数约束转化为线性约束,并通过顺序解决线性程序来采样Compatible causal models,从而获得 causal bound WITH consideration of estimation error。我们的采样算法提供了可靠的连续抽象结果。然后,我们示了如何使用 causal bound 来改进经典 bandit 算法,并对 act 集和函数空间大小的选择产生影响。尤其在可以处理通用上下文分布时,我们的方法提高了对函数空间大小的依赖性。我们正式证明我们的 causally enhanced 算法比经典 bandit 算法更高效,并实现了orders of magnitude faster convergence rates。最后,我们进行了 simulations ,证明我们的策略比现有的方法更高效。这项研究有望提高实际应用中的 contextual bandit 代理人性能,因为数据是珍贵和costly to obtain。
MSLE: An ontology for Materials Science Laboratory Equipment. Large-Scale Devices for Materials Characterization
results: 该论文通过与领域专家的合作,对大规模材料Characterization设备进行了研究和模型化,并使用 SHACL 语言来模型约束。这些约束可以帮助回答材料科学实验室设备的能力问题。Abstract
This paper introduces a new ontology for Materials Science Laboratory Equipment, termed MSLE. A fundamental issue with materials science laboratory (hereafter lab) equipment in the real world is that scientists work with various types of equipment with multiple specifications. For example, there are many electron microscopes with different parameters in chemical and physical labs. A critical development to unify the description is to build an equipment domain ontology as basic semantic knowledge and to guide the user to work with the equipment appropriately. Here, we propose to develop a consistent ontology for equipment, the MSLE ontology. In the MSLE, two main existing ontologies, the Semantic Sensor Network (SSN) and the Material Vocabulary (MatVoc), have been integrated into the MSLE core to build a coherent ontology. Since various acronyms and terms have been used for equipment, this paper proposes an approach to use a Simple Knowledge Organization System (SKOS) to represent the hierarchical structure of equipment terms. Equipment terms were collected in various languages and abbreviations and coded into the MSLE using the SKOS model. The ontology development was conducted in close collaboration with domain experts and focused on the large-scale devices for materials characterization available in our research group. Competency questions are expected to be addressed through the MSLE ontology. Constraints are modeled in the Shapes Query Language (SHACL); a prototype is shown and validated to show the value of the modeling constraints.
摘要
The MSLE ontology integrates two existing ontologies, the Semantic Sensor Network (SSN) and the Material Vocabulary (MatVoc), to create a coherent ontology. To deal with the various acronyms and terms used for equipment, the authors propose using a Simple Knowledge Organization System (SKOS) to represent the hierarchical structure of equipment terms.The ontology development was conducted in collaboration with domain experts and focused on large-scale devices for materials characterization available in the research group. The authors expect that the MSLE ontology will address competency questions and provide a standardized way of describing equipment. Constraints are modeled in the Shapes Query Language (SHACL) and a prototype is shown to demonstrate the value of the modeling constraints.Translation notes:* "Materials Science Laboratory Equipment" is translated as "材料科学实验室设备" (materials science experimental equipment)* "Semantic Sensor Network" is translated as "含义感知网络" (semantic sensor network)* "Material Vocabulary" is translated as "材料词汇" (material vocabulary)* "Simple Knowledge Organization System" is translated as "简单知识组织系统" (simple knowledge organization system)* "SHACL" is translated as "SHACL" (SHACL)Note: The translation is based on Simplified Chinese, which is the most widely used form of Chinese in mainland China. If you need the translation in Traditional Chinese, please let me know.
Measuring Variety, Balance, and Disparity: An Analysis of Media Coverage of the 2021 German Federal Election
paper_authors: Michael Färber, Jannik Schwade, Adam Jatowt
for: 本研究旨在探讨新闻文章中多样性的评估方法,以便防止过滤屏和促进公共讨论,特别是在选举前。
methods: 本研究提出了一种基于多维度的新闻文章多样性评估框架,考虑了个体、党派和话题的多样性。同时,研究人员还创建了一个Google Top Stories数据集,包括超过26,000个不同的标题和来自超过900家新闻机构的新闻文章,收集于2021年德国联邦选举前后的两周内。
results: 研究人员发现,使用更一般性的搜索关键词(例如“选举”)时,新闻文章的多样性较高。然而,使用更专门的搜索关键词(例如“教育”、“欧洲”、“气候保护”、“政府”)时,新闻文章的多样性在三个维度中较高,这反映了更加主观、专注的讨论。Abstract
Determining and measuring diversity in news articles is important for a number of reasons, including preventing filter bubbles and fueling public discourse, especially before elections. So far, the identification and analysis of diversity have been illuminated in a variety of ways, such as measuring the overlap of words or topics between news articles related to US elections. However, the question of how diversity in news articles can be measured holistically, i.e., with respect to (1) variety, (2) balance, and (3) disparity, considering individuals, parties, and topics, has not been addressed. In this paper, we present a framework for determining diversity in news articles according to these dimensions. Furthermore, we create and provide a dataset of Google Top Stories, encompassing more than 26,000 unique headlines from more than 900 news outlets collected within two weeks before and after the 2021 German federal election. While we observe high diversity for more general search terms (e.g., "election"), a range of search terms ("education," "Europe," "climate protection," "government") resulted in news articles with high diversity in two out of three dimensions. This reflects a more subjective, dedicated discussion on rather future-oriented topics.
摘要
确定和衡量新闻文章的多样性是重要的多种原因,包括避免 Filter Bubble 和促进公众讨论,特别是在选举前。迄今为止,多样性的识别和分析已经得到了多种方法的探讨,如在美国选举新闻文章中度量词语或话题之间的重叠。然而,如何全面衡量新闻文章的多样性,即以(1)多样性、(2)平衡和(3)差异为基础,考虑个体、党派和话题,还没有得到回答。在这篇论文中,我们提出了对多样性的定义和衡量方法。此外,我们还创建了一个 Google Top Stories 数据集,包括超过 26,000 个唯一的标题和来自超过 900 家新闻机构,在2021年德国联邦大选之前两周内收集到的。我们发现,使用更通用的搜索关键词(例如 "选举")时,新闻文章的多样性很高。然而,使用不同的搜索关键词(例如 "教育", "欧洲", "气候保护", "政府")时,新闻文章的多样性在三个维度中具有高度的多样性,这反映了一种更Subjective、专注于未来话题的讨论。
results: 研究发现,自动学习的特征表现可以提取细化的传输对应形状,而基准方法仅能基于背景噪音来分类数据。Abstract
In recent years, the traditional feature engineering process for training machine learning models is being automated by the feature extraction layers integrated in deep learning architectures. In wireless networks, many studies were conducted in automatic learning of feature representations for domain-related challenges. However, most of the existing works assume some supervision along the learning process by using labels to optimize the model. In this paper, we investigate an approach to learning feature representations for wireless transmission clustering in a completely unsupervised manner, i.e. requiring no labels in the process. We propose a model based on convolutional neural networks that automatically learns a reduced dimensionality representation of the input data with 99.3% less components compared to a baseline principal component analysis (PCA). We show that the automatic representation learning is able to extract fine-grained clusters containing the shapes of the wireless transmission bursts, while the baseline enables only general separability of the data based on the background noise.
摘要
Our proposed model is based on convolutional neural networks (CNNs), which automatically learn a reduced dimensionality representation of the input data. We show that this approach achieves a 99.3% reduction in the number of components compared to a baseline principal component analysis (PCA) method. Furthermore, the automatic representation learning is able to extract fine-grained clusters containing the shapes of the wireless transmission bursts, while the baseline method only enables general separability of the data based on the background noise.
paper_authors: Kristina Schaaff, Caroline Reinig, Tim Schlippe
For: This study investigates the empathetic responses and emotional expressions of ChatGPT, a chatbot based on GPT-3.5.* Methods: The study evaluates ChatGPT’s empathy in three aspects: understanding and expressing emotions, parallel emotional response, and empathic personality.* Results: ChatGPT was able to correctly identify emotions and produce appropriate answers in 91.7% of cases, and reacted with a parallel emotion in 70.7% of conversations. The empathic capabilities of ChatGPT were found to be better than those of people with Asperger syndrome/high-functioning autism, but still below the average of healthy humans.Here is the information in Simplified Chinese text:* 为:这项研究研究了基于GPT-3.5的ChatGPT chatbot的共鸣和情感表达。* 方法:研究对ChatGPT的共鸣进行三个方面的评估:理解和表达情感、同步情感反应和共鸣性格。* 结果:ChatGPT在91.7%的情况下正确地识别情感和生成相应的答案,在对话中与人类的情感同步达到70.7%。与阿斯伯格症/高功能自闭症患者相比,ChatGPT的共鸣能力显示出改善,但仍然下于健康人群的平均水平。Abstract
Empathy is often understood as the ability to share and understand another individual's state of mind or emotion. With the increasing use of chatbots in various domains, e.g., children seeking help with homework, individuals looking for medical advice, and people using the chatbot as a daily source of everyday companionship, the importance of empathy in human-computer interaction has become more apparent. Therefore, our study investigates the extent to which ChatGPT based on GPT-3.5 can exhibit empathetic responses and emotional expressions. We analyzed the following three aspects: (1) understanding and expressing emotions, (2) parallel emotional response, and (3) empathic personality. Thus, we not only evaluate ChatGPT on various empathy aspects and compare it with human behavior but also show a possible way to analyze the empathy of chatbots in general. Our results show, that in 91.7% of the cases, ChatGPT was able to correctly identify emotions and produces appropriate answers. In conversations, ChatGPT reacted with a parallel emotion in 70.7% of cases. The empathic capabilities of ChatGPT were evaluated using a set of five questionnaires covering different aspects of empathy. Even though the results indicate that the empathic abilities of ChatGPT are still below the average of healthy humans, the scores are better than those of people who have been diagnosed with Asperger syndrome / high-functioning autism.
摘要
Empathy 常被理解为与别人分享和理解他们的情感或情绪的能力。随着虚拟助手在不同领域的使用,例如孩子们寻求家庭作业帮助、人们寻求医疗建议以及人们每天通过虚拟助手获得伴侣关系,人机交互中Empathy的重要性变得更加明显。因此,我们的研究探讨了基于GPT-3.5的ChatGPT是否能够表现出Empathy的响应和情感表达。我们分析了以下三个方面:(1)理解和表达情感,(2)并行情感响应,以及(3)Empathic Personality。因此,我们不仅评估ChatGPT在不同Empathy方面的表现,并与人类行为进行比较,还提供了分析虚拟助手 Empathy 的可能性。我们的结果显示,在91.7%的情况下,ChatGPT能正确地识别情感并提供相应的答案。在对话中,ChatGPT在70.7%的情况下表现出并行情感响应。虚拟助手Empathic能力被评估 using five 个问卷,涵盖不同方面的Empathy。尽管结果表明ChatGPT的Empathic能力仍然比健康人类的平均水平低,但得分仍高于被诊断为有Asperger症/高功能自闭症的人。
paper_authors: Michaël Mathieu, Sherjil Ozair, Srivatsan Srinivasan, Caglar Gulcehre, Shangtong Zhang, Ray Jiang, Tom Le Paine, Richard Powell, Konrad Żołna, Julian Schrittwieser, David Choi, Petko Georgiev, Daniel Toyama, Aja Huang, Roman Ring, Igor Babuschkin, Timo Ewalds, Mahyar Bordbar, Sarah Henderson, Sergio Gómez Colmenarejo, Aäron van den Oord, Wojciech Marian Czarnecki, Nando de Freitas, Oriol Vinyals
results: 该论文使用了仅做offline数据,提高了先前发表的AlphaStar行为做clone代理的状态。它实现了90%的胜率。Abstract
StarCraft II is one of the most challenging simulated reinforcement learning environments; it is partially observable, stochastic, multi-agent, and mastering StarCraft II requires strategic planning over long time horizons with real-time low-level execution. It also has an active professional competitive scene. StarCraft II is uniquely suited for advancing offline RL algorithms, both because of its challenging nature and because Blizzard has released a massive dataset of millions of StarCraft II games played by human players. This paper leverages that and establishes a benchmark, called AlphaStar Unplugged, introducing unprecedented challenges for offline reinforcement learning. We define a dataset (a subset of Blizzard's release), tools standardizing an API for machine learning methods, and an evaluation protocol. We also present baseline agents, including behavior cloning, offline variants of actor-critic and MuZero. We improve the state of the art of agents using only offline data, and we achieve 90% win rate against previously published AlphaStar behavior cloning agent.
摘要
星际II是一个非常具有挑战性的模拟强化学环境之一,它是部分可见、随机、多个智能体,需要在长时间 horizon 上进行策略规划,同时在实时低级别执行。它还拥有活跃的职业竞赛场景。由于星际II的挑战性和Blizzard公司发布了数百万场星际II游戏记录,因此这个纸 lái 利用这些数据,建立了一个名为AlphaStar Unplugged的标准。我们定义了一个子集(Blizzard发布的 dataset)、工具和标准化 API для机器学习方法,以及评估协议。我们还提供了基线代理,包括行为做参数和离线变体的actor-critic和MuZero。我们使用仅基于离线数据进行代理,并达到90%的胜率,比之前发布的AlphaStar行为做参数代理更高。
Vocab-Expander: A System for Creating Domain-Specific Vocabularies Based on Word Embeddings
results: 该系统具有易用的界面,允许用户快速确认或拒绝词语建议。 Vocab-Expander 可以满足多种用例,如提高技术和创新管理中的概念基于搜索、组织内部或跨学科项目的沟通和合作,以及特定课程教育中的词汇创建。Abstract
In this paper, we propose Vocab-Expander at https://vocab-expander.com, an online tool that enables end-users (e.g., technology scouts) to create and expand a vocabulary of their domain of interest. It utilizes an ensemble of state-of-the-art word embedding techniques based on web text and ConceptNet, a common-sense knowledge base, to suggest related terms for already given terms. The system has an easy-to-use interface that allows users to quickly confirm or reject term suggestions. Vocab-Expander offers a variety of potential use cases, such as improving concept-based information retrieval in technology and innovation management, enhancing communication and collaboration within organizations or interdisciplinary projects, and creating vocabularies for specific courses in education.
摘要
在这篇论文中,我们提出了Vocab-Expander,一个在线工具,它允许用户(例如技术搜寻专家)根据他们的领域兴趣创建和扩展词汇表。它利用了当前最佳的词嵌入技术,基于网络文本和ConceptNet,一个常识知识库,提供相关的词语建议。用户可以通过一个简单易用的界面来快速确认或拒绝词语建议。Vocab-Expander具有多种可能的用 caso,例如在科技和创新管理中提高基于概念的搜索,在组织内部或跨学科项目中增强交流和合作,以及为特定课程创建专门的词汇表。
Balanced Face Dataset: Guiding StyleGAN to Generate Labeled Synthetic Face Image Dataset for Underrepresented Group
results: 研究表明,使用StyleGAN模型生成的人脸图像数据集具有良好的代表性和准确性,可以用于各种下游任务。Abstract
For a machine learning model to generalize effectively to unseen data within a particular problem domain, it is well-understood that the data needs to be of sufficient size and representative of real-world scenarios. Nonetheless, real-world datasets frequently have overrepresented and underrepresented groups. One solution to mitigate bias in machine learning is to leverage a diverse and representative dataset. Training a model on a dataset that covers all demographics is crucial to reducing bias in machine learning. However, collecting and labeling large-scale datasets has been challenging, prompting the use of synthetic data generation and active labeling to decrease the costs of manual labeling. The focus of this study was to generate a robust face image dataset using the StyleGAN model. In order to achieve a balanced distribution of the dataset among different demographic groups, a synthetic dataset was created by controlling the generation process of StyleGaN and annotated for different downstream tasks.
摘要
为了让机器学习模型在未经见过数据中准确泛化,需要确保数据集具有足够的大小和 represencing 实际世界情况。然而,实际世界数据集往往存在过度和不足的分布。为了减少机器学习中的偏见,可以利用多样化和表示性的数据集。在这种情况下,通过控制StyleGAN生成过程,生成了一个多样化的面像数据集,并对不同的下游任务进行了标注。这些步骤可以帮助生成一个 Robust 的面像数据集,以减少偏见。
No Length Left Behind: Enhancing Knowledge Tracing for Modeling Sequences of Excessive or Insufficient Lengths
methods: 提出了一种模型 called Sequence-Flexible Knowledge Tracing (SFKT),用于解决现有方法中 sequences 的较长或较短问题。
results: 模型可以更好地捕捉学生的完整历史练习行为,并且可以避免过拟合问题。Abstract
Knowledge tracing (KT) aims to predict students' responses to practices based on their historical question-answering behaviors. However, most current KT methods focus on improving overall AUC, leaving ample room for optimization in modeling sequences of excessive or insufficient lengths. As sequences get longer, computational costs will increase exponentially. Therefore, KT methods usually truncate sequences to an acceptable length, which makes it difficult for models on online service systems to capture complete historical practice behaviors of students with too long sequences. Conversely, modeling students with short practice sequences using most KT methods may result in overfitting due to limited observation samples. To address the above limitations, we propose a model called Sequence-Flexible Knowledge Tracing (SFKT).
摘要
知识追踪(KT)目标是预测学生对实践的回答。然而,现有的大多数KT方法都是通过提高总的准确率来优化,忽略了序列长度过长或短的问题。随着序列长度增加,计算成本会 exponential 增加。因此,KT方法通常会舍弃序列,使得在在线服务系统上的模型难以捕捉学生的完整历史实践行为。相反,使用大多数KT方法模型学生短实践序列可能会导致过拟合,因为有限的观察样本。为解决上述限制,我们提出了一种模型 calledSequence-Flexible Knowledge Tracing(SFKT)。
paper_authors: Shusaku Egami, Yasunori Yamamoto, Ikki Ohmukai, Takashi Okumura
For: The paper aims to automate the assessment of COVID-19 infection risks for individuals based on the Japanese government’s formulation of infection risks.* Methods: The paper uses an ontology called COVID-19 Infection Risk Ontology (CIRO) and the Resource Description Framework (RDF) and SPARQL queries to automate the assessment of infection risks.* Results: The knowledge graph built using CIRO and RDF/SPARQL queries can infer the infection risks formulated by the Japanese government, and the reasoning experiments demonstrated the usefulness of the knowledge processing. However, some issues were identified for further deployment.Here’s the same information in Simplified Chinese:* For: 本研究旨在自动评估 COVID-19 感染风险,基于日本政府的感染风险形态。* Methods: 本研究使用 COVID-19 感染风险 ontology(CIRO)和资源描述框架(RDF)和 SPARQL 查询自动评估感染风险。* Results: 使用 CIRO 和 RDF/SPARQL 查询构建的知识图可以推理出日本政府的感染风险形态,并且理解实验表明了知识处理的有用性。但是,进一步部署还有一些问题需要解决。Abstract
Public health authorities perform contact tracing for highly contagious agents to identify close contacts with the infected cases. However, during the pandemic caused by coronavirus disease 2019 (COVID-19), this operation was not employed in countries with high patient volumes. Meanwhile, the Japanese government conducted this operation, thereby contributing to the control of infections, at the cost of arduous manual labor by public health officials. To ease the burden of the officials, this study attempted to automate the assessment of each person's infection risk through an ontology, called COVID-19 Infection Risk Ontology (CIRO). This ontology expresses infection risks of COVID-19 formulated by the Japanese government, toward automated assessment of infection risks of individuals, using Resource Description Framework (RDF) and SPARQL (SPARQL Protocol and RDF Query Language) queries. For evaluation, we demonstrated that the knowledge graph built could infer the risks, formulated by the government. Moreover, we conducted reasoning experiments to analyze the computational efficiency. The experiments demonstrated usefulness of the knowledge processing, and identified issues left for deployment.
摘要
公共健康当局在高度传染病毒病例中进行联系跟踪,以确定感染者的近距离接触者。然而,在2019冠状病毒疫情中,这种操作在高病人量国家没有实施。而日本政府则进行了这种操作,从而对感染的控制做出了贡献,但是需要公共卫生官员进行劳动密集的手动劳动。为了减轻官员的负担,本研究尝试自动评估每个人的感染风险,通过叫做COVID-19感染风险 ontology(CIRO)。这个 ontology 表达了由日本政府制定的感染风险形式,并使用 Resource Description Framework(RDF)和 SPARQL(SPARQL Protocol and RDF Query Language)查询来自动评估个人的感染风险。为了评估,我们展示了知识图建立的可能性,并进行了逻辑实验来分析计算效率。实验表明了知识处理的有用性,并确定了部署中的问题。
Exploring the Physical World Adversarial Robustness of Vehicle Detection
results: 研究发现,Yolo v6 模型在攻击下表现出色,其 AP 值只减少了 6.59%,而 ASA 攻击则导致了 AP 值减少了 14.51%,远超其他算法的影响。此外,研究还发现,静止场景的识别 AP 值较高,并且不同天气条件下的结果相对稳定。Abstract
Adversarial attacks can compromise the robustness of real-world detection models. However, evaluating these models under real-world conditions poses challenges due to resource-intensive experiments. Virtual simulations offer an alternative, but the absence of standardized benchmarks hampers progress. Addressing this, we propose an innovative instant-level data generation pipeline using the CARLA simulator. Through this pipeline, we establish the Discrete and Continuous Instant-level (DCI) dataset, enabling comprehensive experiments involving three detection models and three physical adversarial attacks. Our findings highlight diverse model performances under adversarial conditions. Yolo v6 demonstrates remarkable resilience, experiencing just a marginal 6.59% average drop in average precision (AP). In contrast, the ASA attack yields a substantial 14.51% average AP reduction, twice the effect of other algorithms. We also note that static scenes yield higher recognition AP values, and outcomes remain relatively consistent across varying weather conditions. Intriguingly, our study suggests that advancements in adversarial attack algorithms may be approaching its ``limitation''.In summary, our work underscores the significance of adversarial attacks in real-world contexts and introduces the DCI dataset as a versatile benchmark. Our findings provide valuable insights for enhancing the robustness of detection models and offer guidance for future research endeavors in the realm of adversarial attacks.
摘要
实际世界中的检测模型可能会受到敌意攻击的威胁。然而,在实际情况下进行测试具有资源占用的问题。虚拟 simulate 提供了一种 alternaative,但缺乏标准化的 benchmark 使得进步受阻。为了解决这个问题,我们提出了一种创新的实时数据生成管道,使用 CARLA 模拟器。通过这个管道,我们建立了精度和连续实时(DCI)数据集,允许对三种检测模型和三种物理敌意攻击进行全面的实验。我们的发现表明,Yolo v6 表现出色,只有 marginal 6.59% 的平均精度下降(AP)。与此同时,ASA 攻击导致了 substatial 14.51% 的平均精度下降,比其他算法多出一半。我们还发现,静止场景的识别精度较高,并且结果在不同的天气条件下呈 relativelly 一致。另外,我们的研究表明,敌意攻击算法的进步可能会达到其“限制”。总之,我们的工作强调了在实际世界中的敌意攻击的重要性,并将 DCI 数据集作为一个多样化的标准启用。我们的发现为检测模型的Robustness带来了有价值的指导,并为未来对敌意攻击算法的研究提供了新的思路。
for: The paper is written to address the issue of k-means algorithm producing clusterings that violate our expectations with respect to high/low similarity/density, and to reconcile k-means with Kleinberg’s axiomatic framework in Euclidean and non-Euclidean settings.
methods: The paper introduces two new clusterability properties, variational k-separability and residual k-separability, and proposes extensions of k-means algorithm that fit approximately the Kleinberg’s richness axiom.
results: The paper demonstrates that the proposed extensions of k-means algorithm fit the Kleinberg’s consistency axiom in both Euclidean and non-Euclidean settings, and provides a method for constructing datasets for testing purposes of algorithms optimizing k-means cost function. Additionally, the paper provides practical contributions to the field of clusterability theory and the theory of axiomatic frameworks of clustering.Abstract
The widely applied k-means algorithm produces clusterings that violate our expectations with respect to high/low similarity/density and is in conflict with Kleinberg's axiomatic system for distance based clustering algorithms that formalizes those expectations in a natural way. k-means violates in particular the consistency axiom. We hypothesise that this clash is due to the not explicated expectation that the data themselves should have the property of being clusterable in order to expect the algorithm clustering hem to fit a clustering axiomatic system. To demonstrate this, we introduce two new clusterability properties, variational k-separability and residual k-separability and show that then the Kleinberg's consistency axiom holds for k-means operating in the Euclidean or non-Euclidean space. Furthermore, we propose extensions of k-means algorithm that fit approximately the Kleinberg's richness axiom that does not hold for k-means. In this way, we reconcile k-means with Kleinberg's axiomatic framework in Euclidean and non-Euclidean settings. Besides contribution to the theory of axiomatic frameworks of clustering and for clusterability theory, practical contribution is the possibility to construct {datasets for testing purposes of algorithms optimizing k-means cost function. This includes a method of construction of {clusterable data with known in advance global optimum.
摘要
广泛应用的k-means算法对我们的预期产生了冲突,尤其是在高/低相似性和密度方面。k-means与克莱因堡格的axiomaatic系统不符合,这是因为没有明确地预期资料本身应有聚集性的假设。为了证明这一点,我们引入了两个新的聚集性特性:variational k-separability和residual k-separability。我们显示了这两个特性使得克莱因堡格的一致性axioma成立,并且提出了基于非欧几何空间的延伸算法,可以近似地满足克莱因堡格的丰富性axioma。这样,我们可以将k-means算法与克莱因堡格的axiomaatic framework相符合,并且在欧几何和非欧几何空间中实现。此外,我们可以建立具有已知全球最佳解的聚集数据集,用于测试 clustering 算法的优化。Note: Simplified Chinese is used in mainland China and Singapore, while Traditional Chinese is used in Taiwan, Hong Kong, and Macau.
High-Resolution Cranial Defect Reconstruction by Iterative, Low-Resolution, Point Cloud Completion Transformers
paper_authors: Marek Wodzinski, Mateusz Daniol, Daria Hemmerling, Miroslaw Socha
for: This paper aims to develop an automatic, dedicated system for personalized cranial reconstruction, to increase the availability of cranial implants and reduce the time and cost of manual design.
methods: The proposed method reformulates the problem as a point cloud completion task and uses an iterative, transformer-based approach to reconstruct the cranial defect at any resolution, while being fast and resource-efficient during training and inference.
results: The proposed method shows superior performance compared to state-of-the-art volumetric approaches in terms of GPU memory consumption, while maintaining high-quality of the reconstructed defects.Abstract
Each year thousands of people suffer from various types of cranial injuries and require personalized implants whose manual design is expensive and time-consuming. Therefore, an automatic, dedicated system to increase the availability of personalized cranial reconstruction is highly desirable. The problem of the automatic cranial defect reconstruction can be formulated as the shape completion task and solved using dedicated deep networks. Currently, the most common approach is to use the volumetric representation and apply deep networks dedicated to image segmentation. However, this approach has several limitations and does not scale well into high-resolution volumes, nor takes into account the data sparsity. In our work, we reformulate the problem into a point cloud completion task. We propose an iterative, transformer-based method to reconstruct the cranial defect at any resolution while also being fast and resource-efficient during training and inference. We compare the proposed methods to the state-of-the-art volumetric approaches and show superior performance in terms of GPU memory consumption while maintaining high-quality of the reconstructed defects.
摘要
每年千计人们受到不同类型的头部伤害,需要个性化设备 whose 手动设计是昂贵的时间消耗。因此,一个自动化、专门的系统可以大大提高个性化头部重建的可用性。头部缺陷重建问题可以 reformulated 为形状完成任务,并使用专门的深度网络解决。当前最常见的方法是使用体积表示,并应用深度网络进行图像分割。然而,这种方法有一些局限性,不能扩展到高分辨率的体积,也不考虑数据稀疏性。在我们的工作中,我们将问题重新定义为点云完成任务。我们提出了一种迭代的、基于变换器的方法,可以在任何分辨率下重建头部缺陷,并且在训练和推理过程中具有快速和资源高效的特点。我们与状态的艺术方法进行比较,并显示我们的方法在 GPU 内存占用量方面具有更高的性能,同时保持重建的缺陷质量高。
Intelligence-Endogenous Management Platform for Computing and Network Convergence
paper_authors: Zicong Hong, Xiaoyu Qiu, Jian Lin, Wuhui Chen, Yue Yu, Hui Wang, Song Guo, Wen Gao for:This paper aims to present a concept for an intelligence-endogenous management platform for Computing and Network Convergence (CNC) called “CNC brain” based on artificial intelligence technologies.methods:The proposed CNC brain platform uses four key building blocks: perception, scheduling, adaptation, and governance, to efficiently and automatically match supply and demand with high heterogeneity in a CNC throughout its life cycle.results:The proposed method is evaluated on a CNC testbed that integrates two open-source frameworks (OpenFaas and Kubernetes) and a real-world business dataset provided by Microsoft Azure, and the evaluation results show that the proposed method is effective in terms of resource utilization and performance.Abstract
Massive emerging applications are driving demand for the ubiquitous deployment of computing power today. This trend not only spurs the recent popularity of the \emph{Computing and Network Convergence} (CNC), but also introduces an urgent need for the intelligentization of a management platform to coordinate changing resources and tasks in the CNC. Therefore, in this article, we present the concept of an intelligence-endogenous management platform for CNCs called \emph{CNC brain} based on artificial intelligence technologies. It aims at efficiently and automatically matching the supply and demand with high heterogeneity in a CNC via four key building blocks, i.e., perception, scheduling, adaptation, and governance, throughout the CNC's life cycle. Their functionalities, goals, and challenges are presented. To examine the effectiveness of the proposed concept and framework, we also implement a prototype for the CNC brain based on a deep reinforcement learning technology. Also, it is evaluated on a CNC testbed that integrates two open-source and popular frameworks (OpenFaas and Kubernetes) and a real-world business dataset provided by Microsoft Azure. The evaluation results prove the proposed method's effectiveness in terms of resource utilization and performance. Finally, we highlight the future research directions of the CNC brain.
摘要
巨大的应用需求今天启动了计算力的无限扩展。这种趋势不仅推动了最近的计算和网络融合(CNC)的流行,还提出了智能化管理平台的急需,以协调CNC中的变化资源和任务。因此,在本文中,我们提出了基于人工智能技术的CNC脑(CNC brain)智能化管理平台的概念,旨在高效自动匹配CNC中的供应和需求,并在CNC生命周期中实现四个关键组件的功能:感知、调度、适应和治理。我们还实现了基于深度强化学习技术的CNC脑原型,并在一个包含OpenFaas和Kubernetes两个开源框架以及Microsoft Azure提供的实际业务数据的CNC测试环境中进行了评估。评估结果表明,提出的方法能够提高资源利用率和性能。最后,我们还概述了CNC脑的未来研究方向。
Biomedical Knowledge Graph Embeddings with Negative Statements
results: 我们在 ontology-rich生物医学知识 graphs 上进行了两种不同的预测任务,并得到了与现有benchmarks进行比较的好效果。Abstract
A knowledge graph is a powerful representation of real-world entities and their relations. The vast majority of these relations are defined as positive statements, but the importance of negative statements is increasingly recognized, especially under an Open World Assumption. Explicitly considering negative statements has been shown to improve performance on tasks such as entity summarization and question answering or domain-specific tasks such as protein function prediction. However, no attention has been given to the exploration of negative statements by knowledge graph embedding approaches despite the potential of negative statements to produce more accurate representations of entities in a knowledge graph. We propose a novel approach, TrueWalks, to incorporate negative statements into the knowledge graph representation learning process. In particular, we present a novel walk-generation method that is able to not only differentiate between positive and negative statements but also take into account the semantic implications of negation in ontology-rich knowledge graphs. This is of particular importance for applications in the biomedical domain, where the inadequacy of embedding approaches regarding negative statements at the ontology level has been identified as a crucial limitation. We evaluate TrueWalks in ontology-rich biomedical knowledge graphs in two different predictive tasks based on KG embeddings: protein-protein interaction prediction and gene-disease association prediction. We conduct an extensive analysis over established benchmarks and demonstrate that our method is able to improve the performance of knowledge graph embeddings on all tasks.
摘要
一个知识图是一种强大的实体和关系表示方式。大多数关系被定义为正式声明,但对开放世界假设下,正式声明的重要性得到了更多的认可,特别是在实体概括和问答 зада务中。明确考虑正式声明可以提高实体概括和问答 зада务的性能。然而,知识图嵌入方法对负声明的探索没有得到过关注,尽管负声明可以生成更准确的实体表示。我们提出了一种新的方法,真实步行(TrueWalks),用于在知识图表示学习过程中包含负声明。特别是,我们提出了一种新的步行生成方法,可以不仅区分正式声明和负声明,还能够考虑ontology层次上的否定 semantically。这对生物医学领域的应用非常重要,因为嵌入方法在ontology层次上对负声明的不足已被证明为是关键的限制。我们在生物医学知识图中进行了两种不同的预测任务基于KG嵌入:蛋白质-蛋白质交互预测和基因-疾病相关性预测。我们对已有的标准底层进行了广泛的分析,并证明了我们的方法可以在所有任务中提高知识图嵌入的性能。
Doubly Robust Estimator for Off-Policy Evaluation with Large Action Spaces
paper_authors: Tatsuhiro Shimizu, Laura Forastiere
For: The paper focuses on Off-Policy Evaluation (OPE) in contextual bandit settings with large action spaces, and aims to develop a more accurate and efficient estimator.* Methods: The paper proposes a new estimator called Marginalized Doubly Robust (MDR) estimator, which combines the strengths of Marginalized Inverse Propensity Scoring (MIPS) and doubly robust estimation. The MDR estimator uses embeddings of actions to mitigate the estimator’s variance and improve accuracy.* Results: The paper shows that the proposed MDR estimator is unbiased under weaker assumptions than MIPS and maintains variance reduction against IPS, which was the main advantage of MIPS. The empirical experiment verifies the supremacy of MDR against existing estimators.Here are the three key points in Simplified Chinese:* For: 这个论文关注在Contextual Bandit Setting下的Off-Policy Evaluation (OPE)问题上, 并提出了一种更加准确和高效的估计方法。* Methods: 论文提出了一种新的估计方法called Marginalized Doubly Robust (MDR) estimator, 它结合了Marginalized Inverse Propensity Scoring (MIPS)和 doubly robust估计的优点。 MDR estimator使用行动的嵌入来降低估计变iance和提高准确性。* Results: 论文表明,提出的MDR估计方法在较弱的假设下是不偏的, 并且保持了对IPS的变iance减少, 这是MIPS的主要优点。 验证性实验证明了MDR估计方法在现有估计方法之上的优越性。Abstract
We study Off-Policy Evaluation (OPE) in contextual bandit settings with large action spaces. The benchmark estimators suffer from severe bias and variance tradeoffs. Parametric approaches suffer from bias due to difficulty specifying the correct model, whereas ones with importance weight suffer from variance. To overcome these limitations, Marginalized Inverse Propensity Scoring (MIPS) was proposed to mitigate the estimator's variance via embeddings of an action. To make the estimator more accurate, we propose the doubly robust estimator of MIPS called the Marginalized Doubly Robust (MDR) estimator. Theoretical analysis shows that the proposed estimator is unbiased under weaker assumptions than MIPS while maintaining variance reduction against IPS, which was the main advantage of MIPS. The empirical experiment verifies the supremacy of MDR against existing estimators.
摘要
我们研究偏离策略评估(OPE)在具有大型动作空间的上下文ual bandit中。参考估计器受到严重的偏误和方差交易限制。parametric方法受到模型难以确定的偏误,而具有重要性权重的方法则受到方差的限制。为了超越这些局限性,我们提出了折衔embeddings的Marginalized Inverse Propensity Scoring(MIPS),以减少估计器的方差。为了使估计器更加准确,我们提出了MIPS的双重Robust(MDR)估计器。理论分析表明,我们的估计器具有较弱的假设下的无偏性,而且与IPS相比,具有更好的方差减少效果。实验证明了MDR的超越性。
RCMHA: Relative Convolutional Multi-Head Attention for Natural Language Modelling
results: 实验结果表明,提出的RCMHA模型在比较于其他注意力模型(MHA、MDHA、RMHA)时,具有更高的准确率(0.572),并且内存使用量相对较低(2.98 GB)。Abstract
The Attention module finds common usage in language modeling, presenting distinct challenges within the broader scope of Natural Language Processing. Multi-Head Attention (MHA) employs an absolute positional encoding, which imposes limitations on token length and entails substantial memory consumption during the processing of embedded inputs. The current remedy proposed by researchers involves the utilization of relative positional encoding, similar to the approach adopted in Transformer-XL or Relative Multi-Head Attention (RMHA), albeit the employed architecture consumes considerable memory resources. To address these challenges, this study endeavors to refine MHA, leveraging relative positional encoding in conjunction with the Depth-Wise Convolutional Layer architecture, which promises heightened accuracy coupled with minimized memory usage. The proposed RCMHA framework entails the modification of two integral components: firstly, the application of the Depth-Wise Convolutional Layer to the input embedding, encompassing Query, Key, and Value parameters; secondly, the incorporation of Relative Positional Encoding into the attention scoring phase, harmoniously integrated with Scaled Dot-Product Attention. Empirical experiments underscore the advantages of RCMHA, wherein it exhibits superior accuracy, boasting a score of 0.572 in comparison to alternative attention modules such as MHA, Multi-DConv-Head Attention (MDHA), and RMHA. Concerning memory utilization, RMHA emerges as the most frugal, demonstrating an average consumption of 2.98 GB, surpassing RMHA which necessitates 3.5 GB.
摘要
研究人员通常使用注意模块在自然语言处理中,但它们具有一些特殊的挑战。多头注意(MHA)使用绝对位置编码,这限制了单个符号的长度和需要大量内存进行融合输入的处理。为了解决这些挑战,研究人员提出了使用相对位置编码的方法,类似于Transformer-XL或相对多头注意(RMHA)的方法,但是使用的架构占用了大量内存资源。为了解决这些问题,本研究尝试更新MHA,使用相对位置编码和深度wise卷积层架构,以提高准确率并降低内存使用量。提案的RCMHA框架包括对输入嵌入进行深度wise卷积层应用,以及在注意得分阶段进行相对位置编码的整合,和尺度积分注意。实验证明RCMHA具有更高的准确率,其中在比较注意模块时,RCMHA的得分为0.572,比MHA、Multi-DConv-Head Attention(MDHA)和RMHA更高。此外,RCMHA的内存使用量较低,只需2.98 GB,超过RMHA的3.5 GB。
TPTU: Task Planning and Tool Usage of Large Language Model-based AI Agents
results: 我们通过使用不同的LLMs来实现这个框架,并评估了这些模型在 typical tasks 上的任务规划和工具使用(TPTU)能力。我们发现LLMs在执行复杂任务时具有潜在的潜力,但还有一些挑战和需要进一步研究的领域。Abstract
With recent advancements in natural language processing, Large Language Models (LLMs) have emerged as powerful tools for various real-world applications. Despite their prowess, the intrinsic generative abilities of LLMs may prove insufficient for handling complex tasks which necessitate a combination of task planning and the usage of external tools. In this paper, we first propose a structured framework tailored for LLM-based AI Agents and discuss the crucial capabilities necessary for tackling intricate problems. Within this framework, we design two distinct types of agents (i.e., one-step agent and sequential agent) to execute the inference process. Subsequently, we instantiate the framework using various LLMs and evaluate their Task Planning and Tool Usage (TPTU) abilities on typical tasks. By highlighting key findings and challenges, our goal is to provide a helpful resource for researchers and practitioners to leverage the power of LLMs in their AI applications. Our study emphasizes the substantial potential of these models, while also identifying areas that need more investigation and improvement.
摘要
Within this framework, we design two types of agents (one-step agent and sequential agent) to execute the inference process. We then instantiate the framework using various LLMs and evaluate their Task Planning and Tool Usage (TPTU) abilities on typical tasks. Our study highlights key findings and challenges, providing a helpful resource for researchers and practitioners to leverage the power of LLMs in their AI applications. We emphasize the substantial potential of these models, while also identifying areas that require further investigation and improvement.Translation notes:* "Large Language Models" (LLMs) is translated as "大型自然语言处理模型" (dàxíng zìrán yǔyán xῡngwén módelìng).* "Task Planning and Tool Usage" (TPTU) is translated as "任务规划和工具使用" (rèngwù guīhua yǔ gōngjuǎn shǐyòng).* "AI Agents" is translated as "人工智能代理" (réngōng zhìnéng dàibiǎn).* "inference process" is translated as "推理过程" (tuīlǐ gòujiāng).* "one-step agent" and "sequential agent" are translated as "单步代理" (dān bù dàibiǎn) and "连续代理" (liánxù dàibiǎn), respectively.
Boosting Chinese ASR Error Correction with Dynamic Error Scaling Mechanism
results: 实验结果表明,该提出的方法可以substantially improve ASR error correction,并在已知的数据集上获得了优秀的结果。Abstract
Chinese Automatic Speech Recognition (ASR) error correction presents significant challenges due to the Chinese language's unique features, including a large character set and borderless, morpheme-based structure. Current mainstream models often struggle with effectively utilizing word-level features and phonetic information. This paper introduces a novel approach that incorporates a dynamic error scaling mechanism to detect and correct phonetically erroneous text generated by ASR output. This mechanism operates by dynamically fusing word-level features and phonetic information, thereby enriching the model with additional semantic data. Furthermore, our method implements unique error reduction and amplification strategies to address the issues of matching wrong words caused by incorrect characters. Experimental results indicate substantial improvements in ASR error correction, demonstrating the effectiveness of our proposed method and yielding promising results on established datasets.
摘要
Prompt Guided Copy Mechanism for Conversational Question Answering
results: 实验表明,该方法能够有效提高对话型问题的回答自然度和适用性,在CoQA挑战中获得了好的结果。Abstract
Conversational Question Answering (CQA) is a challenging task that aims to generate natural answers for conversational flow questions. In this paper, we propose a pluggable approach for extractive methods that introduces a novel prompt-guided copy mechanism to improve the fluency and appropriateness of the extracted answers. Our approach uses prompts to link questions to answers and employs attention to guide the copy mechanism to verify the naturalness of extracted answers, making necessary edits to ensure that the answers are fluent and appropriate. The three prompts, including a question-rationale relationship prompt, a question description prompt, and a conversation history prompt, enhance the copy mechanism's performance. Our experiments demonstrate that this approach effectively promotes the generation of natural answers and achieves good results in the CoQA challenge.
摘要
问答对话(CQA)是一项具有挑战性的任务,旨在生成自然的对话流程中的答案。在这篇论文中,我们提出了一种可替换的方法,即使用启示机制来提高抽取答案的流畅性和适用性。我们的方法使用启示来联结问题和答案,并通过注意力引导机制来验证抽取答案的自然性,进行必要的修改,以确保答案的流畅性和适用性。我们的实验表明,这种方法可以有效地促进自然的答案生成,并在CoQA挑战中 дости得好的 результа。
RecycleGPT: An Autoregressive Language Model with Recyclable Module
for: To improve the speed response of language models and reduce execution time.
methods: Based on the strong correlations between adjacent tokens, recycling pre-generated model states without running the whole model multiple times.
results: Experiments and analysis show that the approach can significantly reduce inference latency, achieving up to 1.4x speedup while maintaining high performance.Abstract
Existing large language models have to run K times to generate a sequence of K tokens. In this paper, we present RecycleGPT, a generative language model with fast decoding speed by recycling pre-generated model states without running the whole model in multiple steps. Our approach relies on the observation that adjacent tokens in a sequence usually have strong correlations and the next token in a sequence can be reasonably guessed or inferred based on the preceding ones. Experiments and analysis demonstrate the effectiveness of our approach in lowering inference latency, achieving up to 1.4x speedup while preserving high performance.
摘要
现有大型语言模型需要运行 K 次以生成一个序列中的 K 个符号。在本文中,我们介绍了 RecycleGPT,一种生成语言模型,具有快速解码速度,通过 reuse 预生成模型状态而不需要在多个步骤中运行整个模型。我们的方法基于 adjacent 符号在序列中强相关性的观察,下一个符号可以基于前一个符号预测或推理。实验和分析表明,我们的方法可以降低推理延迟,达到最高性能的 1.4 倍速度。
End-to-End Evaluation for Low-Latency Simultaneous Speech Translation
paper_authors: Christian Huber, Tu Anh Dinh, Carlos Mullov, Ngoc Quan Pham, Thai Binh Nguyen, Fabian Retkowski, Stefan Constantin, Enes Yavuz Ugan, Danni Liu, Zhaolin Li, Sai Koneru, Jan Niehues, Alexander Waibel
results: 本研究通过该框架对不同低延迟语音翻译方法进行了比较。包括可修改输出和固定输出方法的比较,以及使用现有的缩进和端到端系统的比较。此外,框架还可自动评估翻译质量和延迟时间,并提供了在线用户界面,以便向用户显示低延迟模型的输出。Abstract
The challenge of low-latency speech translation has recently draw significant interest in the research community as shown by several publications and shared tasks. Therefore, it is essential to evaluate these different approaches in realistic scenarios. However, currently only specific aspects of the systems are evaluated and often it is not possible to compare different approaches. In this work, we propose the first framework to perform and evaluate the various aspects of low-latency speech translation under realistic conditions. The evaluation is carried out in an end-to-end fashion. This includes the segmentation of the audio as well as the run-time of the different components. Secondly, we compare different approaches to low-latency speech translation using this framework. We evaluate models with the option to revise the output as well as methods with fixed output. Furthermore, we directly compare state-of-the-art cascaded as well as end-to-end systems. Finally, the framework allows to automatically evaluate the translation quality as well as latency and also provides a web interface to show the low-latency model outputs to the user.
摘要
<>将文本翻译成简化中文。>研究群体对低延迟语音翻译的挑战在最近几年内已引起了广泛的关注,如图所示。因此,评估这些不同的方法在实际场景下是非常重要的。然而,当前只有特定的方面的系统被评估,而且不能比较不同的方法。在这项工作中,我们提出了第一个能够在实际场景下进行和评估多个方面的低延迟语音翻译的框架。这包括音频分割以及不同组件的运行时间。其次,我们使用这个框架对低延迟语音翻译不同方法进行比较。我们评估可以修改输出的模型以及固定输出的方法。此外,我们直接对现有的核心笔记和端到端系统进行比较。最后,该框架可以自动评估翻译质量以及延迟时间,并提供了一个网页界面,以便用户查看低延迟模型的输出。
Counterfactual Monotonic Knowledge Tracing for Assessing Students’ Dynamic Mastery of Knowledge Concepts
for: 评估学生动态知识概念的掌握是知识追踪(KT)任务的核心, both offline teaching and online educational applications 需要。
methods: exist KT methods rely on the implicit paradigm of historical practice to mastery of knowledge concepts to students’ responses to practices to address the challenge of unlabeled concept mastery.
results: 我们提出了一种原则正确的方法 called Counterfactual Monotonic Knowledge Tracing (CMKT), which builds on the implicit paradigm described above by using a counterfactual assumption to constrain the evolution of students’ mastery of knowledge concepts.Abstract
As the core of the Knowledge Tracking (KT) task, assessing students' dynamic mastery of knowledge concepts is crucial for both offline teaching and online educational applications. Since students' mastery of knowledge concepts is often unlabeled, existing KT methods rely on the implicit paradigm of historical practice to mastery of knowledge concepts to students' responses to practices to address the challenge of unlabeled concept mastery. However, purely predicting student responses without imposing specific constraints on hidden concept mastery values does not guarantee the accuracy of these intermediate values as concept mastery values. To address this issue, we propose a principled approach called Counterfactual Monotonic Knowledge Tracing (CMKT), which builds on the implicit paradigm described above by using a counterfactual assumption to constrain the evolution of students' mastery of knowledge concepts.
摘要
为核心的知识跟踪(KT)任务,评估学生的动态知识概念熟练性非常重要,是线上教育应用以及线下教育中的一个关键任务。由于学生的知识概念熟练性通常无法被直接标注,现有的KT方法通常采用历史实践的隐式模式来评估学生对知识概念的熟练性。然而,仅仅预测学生的回答不能保证这些中间概念熟练性值的准确性。为解决这个问题,我们提出了一种原则性的方法calledCounterfactual Monotonic Knowledge Tracing(CMKT),该方法基于上述隐式模式,并使用一种对假假设来约束学生的知识概念熟练性的演化。
Robust Ordinal Regression for Subsets Comparisons with Interactions
results: 我们通过使用不确定集来表示模型参数的可能值,并定义一种robust排序关系,以确定subset之间的偏好关系。我们在 sintetic 和实际数据上进行了数值测试,并证明了我们的偏好预测的多样性和可靠性。Abstract
This paper is dedicated to a robust ordinal method for learning the preferences of a decision maker between subsets. The decision model, derived from Fishburn and LaValle (1996) and whose parameters we learn, is general enough to be compatible with any strict weak order on subsets, thanks to the consideration of possible interactions between elements. Moreover, we accept not to predict some preferences if the available preference data are not compatible with a reliable prediction. A predicted preference is considered reliable if all the simplest models (Occam's razor) explaining the preference data agree on it. Following the robust ordinal regression methodology, our predictions are based on an uncertainty set encompassing the possible values of the model parameters. We define a robust ordinal dominance relation between subsets and we design a procedure to determine whether this dominance relation holds. Numerical tests are provided on synthetic and real-world data to evaluate the richness and reliability of the preference predictions made.
摘要
这篇论文探讨了一种可靠的排序方法,用于学习决策者对subset的偏好。我们基于鱼本和拉瓦列(1996)提出的决策模型,并学习其参数,该模型可以与任何严格强制排序集合兼容。此外,我们接受不预测一些偏好,如果可用偏好数据不具有可靠预测。一个预测的偏好被视为可靠,如果所有最简模型(奥卡姆的剑)解释偏好数据都同意它。按照稳健排序回归方法,我们的预测基于模型参数的不确定集。我们定义了稳健排序准则,并设计了一种确定该准则是否成立的过程。在synthetic和实际数据上进行了数据测试,以评估我们的偏好预测的 ricacity 和可靠性。
A reading survey on adversarial machine learning: Adversarial attacks and their understanding
for: This paper provides a survey of existing adversarial attacks and their understanding based on different perspectives, with the goal of classifying adversarial attacks and understanding their vulnerabilities in a systematic order.
methods: The paper uses a comprehensive review of existing literature on adversarial attacks and defenses to provide a detailed understanding of the different types of attacks and their characteristics.
results: The paper concludes with a discussion on the future research directions in the field of adversarial machine learning, highlighting the limitations of existing defenses and the need for further research to mitigate the effects of adversarial attacks.Abstract
Deep Learning has empowered us to train neural networks for complex data with high performance. However, with the growing research, several vulnerabilities in neural networks have been exposed. A particular branch of research, Adversarial Machine Learning, exploits and understands some of the vulnerabilities that cause the neural networks to misclassify for near original input. A class of algorithms called adversarial attacks is proposed to make the neural networks misclassify for various tasks in different domains. With the extensive and growing research in adversarial attacks, it is crucial to understand the classification of adversarial attacks. This will help us understand the vulnerabilities in a systematic order and help us to mitigate the effects of adversarial attacks. This article provides a survey of existing adversarial attacks and their understanding based on different perspectives. We also provide a brief overview of existing adversarial defences and their limitations in mitigating the effect of adversarial attacks. Further, we conclude with a discussion on the future research directions in the field of adversarial machine learning.
摘要
深度学习已经赋予我们训练复杂数据的神经网络高性能。然而,随着研究的发展,一些神经网络的漏洞被曝光。一种特定的研究分支,敌意机器学习,利用和探索了一些导致神经网络对近似输入进行误分类的漏洞。一类称为敌意攻击的算法被提出,以使神经网络在不同领域中对不同任务进行误分类。随着敌意攻击的扩大和增长的研究,了解敌意攻击的分类变得非常重要。这将帮助我们系统地了解漏洞,并帮助我们 Mitigate the effects of adversarial attacks。这篇文章提供了现有的敌意攻击和它们的理解,以及不同角度的概述。此外,我们还提供了现有的防御措施的简要概述和其限制在减轻敌意攻击的效果。最后,我们 conclude with 对敌意机器学习未来研究方向的讨论。
Discrete Message via Online Clustering Labels in Decentralized POMDP
results: 提出了一种简单的消息生成函数设计,并与奖励学习结合使用 Regularized Information Maximization 损失函数,实现了对比 estado-of-the-art 多智能体通信基eline 的突出表现,并可以实现高效率的几位数据传输。Abstract
Communication is crucial for solving cooperative Multi-Agent Reinforcement Learning tasks in Partially-Observable Markov Decision Processes. Existing works often rely on black-box methods to encode local information/features into messages shared with other agents. However, such black-box approaches are unable to provide any quantitative guarantees on the expected return and often lead to the generation of continuous messages with high communication overhead and poor interpretability. In this paper, we establish an upper bound on the return gap between an ideal policy with full observability and an optimal partially-observable policy with discrete communication. This result enables us to recast multi-agent communication into a novel online clustering problem over the local observations at each agent, with messages as cluster labels and the upper bound on the return gap as clustering loss. By minimizing the upper bound, we propose a surprisingly simple design of message generation functions in multi-agent communication and integrate it with reinforcement learning using a Regularized Information Maximization loss function. Evaluations show that the proposed discrete communication significantly outperforms state-of-the-art multi-agent communication baselines and can achieve nearly-optimal returns with few-bit messages that are naturally interpretable.
摘要
通信是解决合作多智能体强化学习任务中的关键,特别是在部分可见 Markov 决策过程中。现有的方法 oft rely on 黑盒方法来编码本地信息/特征到其他代理机器人的消息中。然而,这些黑盒方法无法提供任何量化保证返回并经常导致高通信开销和低可读性的连续消息生成。在这篇论文中,我们确定了完全可见策略和部分可见策略之间的返回差。这个结果使得我们可以将多智能体通信转化为一个新的在本地观察到每个代理机器人的局部观察上进行在线划分问题,消息作为划分标签,并将返回差作为划分损失。通过最小化返回差,我们提议一种简单的消息生成函数设计,并将其与强化学习相结合,使用 Regularized Information Maximization 损失函数。评估表明,提议的简单消息生成方法在多智能体通信基elines上显著超越了当前的多智能体通信基elines,并可以在几 bits 的消息中实现近似于最优的返回,这些消息自然可读性强。
SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs
paper_authors: Shengzhi Li, Nima Tajbakhsh for:This paper is written to present a new synthetic multi-turn question-answer dataset related to academic graphs, called SciGraphQA.methods:The dataset is built by using Palm-2 to generate open-vocabulary multi-turn question-answering dialogues about the graphs, with an average of 2.23 question-answer turns for each graph. The paper’s context, including the paper title, abstract, paragraph mentioning the graph, and rich text contextual data from the graph, is provided as input to GPT-4 to assess the matching quality of the question-answer turns.results:The average rating of the question-answer turns given the paper’s context is 8.7/10 on a test set. The most popular MLLM models, such as LLaVa, mPLUGowl, BLIP-2, and openFlamingo, are evaluated on the dataset, with LLaVA-13B being the most performant with a CIDEr score of 0.08. The question prompts for LLAVA are further enriched by including serialized data tables extracted from the graphs using the DePlot model, boosting LLaVA’s 0-shot CIDEr to 0.15. Fine-tuning LLaVa using the dataset results in a substantially higher CIDEr score of 0.26.Abstract
In this work, we present SciGraphQA, a synthetic multi-turn question-answer dataset related to academic graphs. SciGraphQA is 13 times larger than ChartVQA, the previously largest chart-visual question-answering dataset. It is also the largest open-sourced chart VQA dataset with non-synthetic charts. To build our dataset, we selected 290,000 Computer Science or Machine Learning ArXiv papers published between 2010 and 2020, and then used Palm-2 to generate 295K samples of open-vocabulary multi-turn question-answering dialogues about the graphs. As context, we provided the text-only Palm-2 with paper title, abstract, paragraph mentioning the graph, and rich text contextual data from the graph itself, obtaining dialogues with an average 2.23 question-answer turns for each graph. We asked GPT-4 to assess the matching quality of our question-answer turns given the paper's context, obtaining an average rating of 8.7/10 on our 3K test set. We evaluated the 0-shot capability of the most popular MLLM models such as LLaVa, mPLUGowl, BLIP-2, and openFlamingo's on our dataset, finding LLaVA-13B being the most performant with a CIDEr score of 0.08. We further enriched the question prompts for LLAVA by including the serialized data tables extracted from the graphs using the DePlot model, boosting LLaVA's 0-shot CIDEr to 0.15. To verify the validity of our dataset, we also fine-tuned LLaVa using our dataset, reaching a substantially higher CIDEr score of 0.26. We anticipate further accuracy improvement by including segmentation mask tokens and leveraging larger LLM backbones coupled with emergent prompting techniques. Our code and data are open-sourced.
摘要
在这项工作中,我们介绍了SciGraphQA,一个基于学术图的多turn问答数据集。SciGraphQA比ChartVQA更大13倍,是目前最大的开源 chart VQA数据集之一。为建立我们的数据集,我们选择了2010-2020年发表的290,000篇计算机科学或机器学习ArXiv论文,然后使用Palm-2生成295,000个开 vocabulary multi-turn问答对话。作为 контекст,我们提供了文本只的Palm-2,并提供了论文标题、摘要、提及图表的段落,以及图表自身的丰富文本数据,得到了每个图表的2.23个问答对话。我们召集了GPT-4进行评分,得到了每个测试集的8.7/10的对话匹配评分。我们评估了目前最受欢迎的MLLM模型,包括LLaVa、mPLUGowl、BLIP-2和openFlamingo,发现LLaVA-13B表现最佳,CIDEr分数为0.08。我们进一步丰富了LLaVA的问题提示,包括使用DePlot模型提取的序列化数据表,提高LLaVA的0shot CIDEr至0.15。为验证我们的数据集的有效性,我们还使用我们的数据集进行了精细调整LLaVa,达到了远高于0.26的CIDEr分数。我们预期将来的准确性改进,通过包括分割masktoken和利用更大的LLM底层,并采用新的提示技术。我们的代码和数据将公开。
Solving Falkner-Skan type equations via Legendre and Chebyshev Neural Blocks
results: 通过对 Falcker-Skan 方程的不同配置进行 simulate,证明提出的方法可以减少计算复杂性并提高效率Abstract
In this paper, a new deep-learning architecture for solving the non-linear Falkner-Skan equation is proposed. Using Legendre and Chebyshev neural blocks, this approach shows how orthogonal polynomials can be used in neural networks to increase the approximation capability of artificial neural networks. In addition, utilizing the mathematical properties of these functions, we overcome the computational complexity of the backpropagation algorithm by using the operational matrices of the derivative. The efficiency of the proposed method is carried out by simulating various configurations of the Falkner-Skan equation.
摘要
在这篇论文中,一种新的深度学习架构,用于解决非线性法克内-斯坦方程,被提出。通过使用Legendre和Chebyshev神经块,这种方法表明了在神经网络中使用正交多项式可以增加人工神经网络的拟合能力。此外,利用这些函数的数学性质,我们超越了反propagation算法的计算复杂性,使用操作矩阵的导数。提出的方法的效率被通过 simulate多种法克内-斯坦方程的配置来证明。
Heterogeneous Knowledge Fusion: A Novel Approach for Personalized Recommendation via LLM
results: 实验结果显示,我们的方法可以很好地融合用户各种不同行为信息,并对推荐表现有所提高。Abstract
The analysis and mining of user heterogeneous behavior are of paramount importance in recommendation systems. However, the conventional approach of incorporating various types of heterogeneous behavior into recommendation models leads to feature sparsity and knowledge fragmentation issues. To address this challenge, we propose a novel approach for personalized recommendation via Large Language Model (LLM), by extracting and fusing heterogeneous knowledge from user heterogeneous behavior information. In addition, by combining heterogeneous knowledge and recommendation tasks, instruction tuning is performed on LLM for personalized recommendations. The experimental results demonstrate that our method can effectively integrate user heterogeneous behavior and significantly improve recommendation performance.
摘要
“用户多样化行为的分析和挖掘是推荐系统中的关键。然而,通过将不同类型的多样化行为 integrate into 推荐模型中,会导致特征稀缺和知识孤立问题。为解决这个挑战,我们提出了一种基于 Large Language Model (LLM) 的个性化推荐方法,通过提取和融合用户多样化行为信息中的多样化知识。此外,通过结合多样化知识和推荐任务,对 LLM 进行了指令调整,以实现个性化推荐。实验结果显示,我们的方法可以有效地 инте integrate 用户多样化行为,并有显著提高推荐性能。”Note: Please keep in mind that the translation is in Simplified Chinese, and the grammar and sentence structure may be different from the original text.
Improving Deep Attractor Network by BGRU and GMM for Speech Separation
methods: 使用了Bidirectional Gated neural network (BGRU) instead of BLSTM,并使用 Gaussian Mixture Model (GMM) 作为聚类算法来减少模型的复杂性。
results: 在使用TIMIT corpus中的两个说话者混音数据集进行评估时,提出的模型比原始DANet模型得到了12.3 dB和2.94的SDR和PESQ分数,并且减少了20.7%和17.9%的参数数量和训练时间。同时,模型在混合阿拉伯语音信号上也表现了更好的result。Abstract
Deep Attractor Network (DANet) is the state-of-the-art technique in speech separation field, which uses Bidirectional Long Short-Term Memory (BLSTM), but the complexity of the DANet model is very high. In this paper, a simplified and powerful DANet model is proposed using Bidirectional Gated neural network (BGRU) instead of BLSTM. The Gaussian Mixture Model (GMM) other than the k-means was applied in DANet as a clustering algorithm to reduce the complexity and increase the learning speed and accuracy. The metrics used in this paper are Signal to Distortion Ratio (SDR), Signal to Interference Ratio (SIR), Signal to Artifact Ratio (SAR), and Perceptual Evaluation Speech Quality (PESQ) score. Two speaker mixture datasets from TIMIT corpus were prepared to evaluate the proposed model, and the system achieved 12.3 dB and 2.94 for SDR and PESQ scores respectively, which were better than the original DANet model. Other improvements were 20.7% and 17.9% in the number of parameters and time training, respectively. The model was applied on mixed Arabic speech signals and the results were better than that in English.
摘要
深度吸引网络(DANet)是现在的演说分离领域技术state-of-the-art,使用双向长短期记忆(BLSTM),但DANet模型的复杂性很高。在这篇论文中,一种简化了DANet模型的方法被提出,使用双向阻塞神经网络(BGRU)而不是BLSTM。在DANet中, Gaussian Mixture Model(GMM)作为聚类算法,以降低复杂性并提高学习速度和准确性。在本文中使用的度量包括Signal to Distortion Ratio(SDR)、Signal to Interference Ratio(SIR)、Signal to Artifact Ratio(SAR)和Perceptual Evaluation Speech Quality(PESQ)分数。对于TIMIT corpus中的两个说话混合数据集进行了评估,提出的模型实现了12.3 dB和2.94的SDR和PESQ分数,分别高于原始DANet模型。此外,模型的参数数量和训练时间都有20.7%和17.9%的下降。模型在混合阿拉伯语语音信号上进行了应用,结果比英语更好。
Expediting Neural Network Verification via Network Reduction
results: 我们在一个大量的benchmark上进行了实验,结果表明,提出的方法可以减少神经网络,并使现有的验证工具更快速地处理神经网络。此外,实验结果还表明,网络减少可以提高现有验证工具对许多网络的可用性。Abstract
A wide range of verification methods have been proposed to verify the safety properties of deep neural networks ensuring that the networks function correctly in critical applications. However, many well-known verification tools still struggle with complicated network architectures and large network sizes. In this work, we propose a network reduction technique as a pre-processing method prior to verification. The proposed method reduces neural networks via eliminating stable ReLU neurons, and transforming them into a sequential neural network consisting of ReLU and Affine layers which can be handled by the most verification tools. We instantiate the reduction technique on the state-of-the-art complete and incomplete verification tools, including alpha-beta-crown, VeriNet and PRIMA. Our experiments on a large set of benchmarks indicate that the proposed technique can significantly reduce neural networks and speed up existing verification tools. Furthermore, the experiment results also show that network reduction can improve the availability of existing verification tools on many networks by reducing them into sequential neural networks.
摘要
各种验证方法已经被提议来验证深度神经网络的安全性,以确保神经网络在关键应用中正确地工作。然而,许多知名的验证工具仍然无法处理复杂的网络架构和大型网络。在这种情况下,我们提议一种网络减少技术作为预处理方法,以降低验证工具的难度。我们的方法利用稳定的ReLU神经元的消除和变换为一个序列神经网络,包括ReLU和Affine层,这些层可以由现有的验证工具处理。我们在alpha-beta-crown、VeriNet和PRIMA等完整和部分验证工具上实现了这种减少技术,并对一个大量的benchmark进行了实验。实验结果表明,我们的方法可以减少神经网络,并使现有的验证工具在许多网络上提高可用性。
Generative AI trial for nonviolent communication mediation
results: 结果表明,使用生成AI可能有潜在的应用前提,但目前还不够实际化。建议改进的指南包括添加模型回答、重新学习修改回答、使用适当的词汇表达每个过程,以及重新请求必要的信息。Abstract
Aiming for a mixbiotic society that combines freedom and solidarity among people with diverse values, I focused on nonviolent communication (NVC) that enables compassionate giving in various situations of social division and conflict, and tried a generative AI for it. Specifically, ChatGPT was used in place of the traditional certified trainer to test the possibility of mediating (modifying) input sentences in four processes: observation, feelings, needs, and requests. The results indicate that there is potential for the application of generative AI, although not yet at a practical level. Suggested improvement guidelines included adding model responses, relearning revised responses, specifying appropriate terminology for each process, and re-asking for required information. The use of generative AI will be useful initially to assist certified trainers, to prepare for and review events and workshops, and in the future to support consensus building and cooperative behavior in digital democracy, platform cooperatives, and cyber-human social co-operating systems. It is hoped that the widespread use of NVC mediation using generative AI will lead to the early realization of a mixbiotic society.
摘要
我寻求一个mixbiotic社会,既保持人们多样化价值观的自由,又强调人们之间的团结和共融。我选择了非暴力通信(NVC)作为解决社会分化和冲突的工具,并使用生成AI测试其可能性。特别是,我使用了ChatGPT来替代传统证明人员,在四个过程中测试了输入句子的修改可能性:观察、情感、需求和请求。结果表明,生成AI有潜力应用,但还不够实用。建议的改进建议包括添加模型回答、重新学习修改回答、指定每个过程的适当术语,以及重新请求需要的信息。使用生成AI将有助于资深训练人员,准备和审查活动和讲座,以及在未来支持协商建设和合作行为在数字民主、平台合作和人机社会合作系统中。希望通过广泛应用NVC媒介使用生成AI,早日实现mixbiotic社会。
Part-Aware Transformer for Generalizable Person Re-identification
results: 我们的方法在大多数DG-ReID设置下达到了状态对的表现,特别是在Market$\to$Duke设置下,我们的方法在Rank1和mAP上超过了州际最优性的表现,提高了10.9%和12.8%。Abstract
Domain generalization person re-identification (DG-ReID) aims to train a model on source domains and generalize well on unseen domains. Vision Transformer usually yields better generalization ability than common CNN networks under distribution shifts. However, Transformer-based ReID models inevitably over-fit to domain-specific biases due to the supervised learning strategy on the source domain. We observe that while the global images of different IDs should have different features, their similar local parts (e.g., black backpack) are not bounded by this constraint. Motivated by this, we propose a pure Transformer model (termed Part-aware Transformer) for DG-ReID by designing a proxy task, named Cross-ID Similarity Learning (CSL), to mine local visual information shared by different IDs. This proxy task allows the model to learn generic features because it only cares about the visual similarity of the parts regardless of the ID labels, thus alleviating the side effect of domain-specific biases. Based on the local similarity obtained in CSL, a Part-guided Self-Distillation (PSD) is proposed to further improve the generalization of global features. Our method achieves state-of-the-art performance under most DG ReID settings. Under the Market$\to$Duke setting, our method exceeds state-of-the-art by 10.9% and 12.8% in Rank1 and mAP, respectively. The code is available at https://github.com/liyuke65535/Part-Aware-Transformer.
摘要
领域总结人识别(DG-ReID)目标是在源频谱上训练模型,并在未看过的频谱上进行普适化。视觉转移通常比常见的CNN网络在分布差异下表现更好,但是转移基于的ReID模型总是因为监督学习策略在源频谱上遇到分布差异而导致过拟合。我们发现,不同ID的全局图像应该有不同的特征,但是它们的相似部分(例如黑色背pack)并不受这一限制。基于这一点,我们提出了一种纯transformer模型(称为Part-aware Transformer),通过设计一个代理任务(名为跨ID相似学习(CSL))来挖掘不同ID的本地视觉信息。这个代理任务使得模型学习通用特征,因为它只关心不同ID标签下的视觉相似性,从而消除分布差异的副作用。基于本地相似性获得的Part-guided Self-Distillation(PSD)进一步改进了全局特征的普适性。我们的方法在大多数DG ReID设置下达到了状态盘。在Market$\to$Duke设置下,我们的方法比状态盘提高了10.9%和12.8%的排名1和mAP, соответivamente。代码可以在https://github.com/liyuke65535/Part-Aware-Transformer上下载。
Binary Federated Learning with Client-Level Differential Privacy
results: 实验结果基于 MNIST 和 Fashion-MNIST 数据集显示,提议的训练算法可以实现客户端级别的隐私保护,同时享受到低通信开销的优势。Abstract
Federated learning (FL) is a privacy-preserving collaborative learning framework, and differential privacy can be applied to further enhance its privacy protection. Existing FL systems typically adopt Federated Average (FedAvg) as the training algorithm and implement differential privacy with a Gaussian mechanism. However, the inherent privacy-utility trade-off in these systems severely degrades the training performance if a tight privacy budget is enforced. Besides, the Gaussian mechanism requires model weights to be of high-precision. To improve communication efficiency and achieve a better privacy-utility trade-off, we propose a communication-efficient FL training algorithm with differential privacy guarantee. Specifically, we propose to adopt binary neural networks (BNNs) and introduce discrete noise in the FL setting. Binary model parameters are uploaded for higher communication efficiency and discrete noise is added to achieve the client-level differential privacy protection. The achieved performance guarantee is rigorously proved, and it is shown to depend on the level of discrete noise. Experimental results based on MNIST and Fashion-MNIST datasets will demonstrate that the proposed training algorithm achieves client-level privacy protection with performance gain while enjoying the benefits of low communication overhead from binary model updates.
摘要
联合学习(FL)是一种隐私保护的合作学习框架,可以进一步强化其隐私保护。现有的FL系统通常采用联合平均(FedAvg)作为训练算法,并在其中实现差分隐私。然而,这些系统中的隐私 utility 质量负面环境严重影响训练性能,特别是当强制实施严格的隐私预算时。此外, Gaussian 机制需要模型参数的高精度。为了提高通信效率和实现更好的隐私 utility 质量,我们提议一种基于 binary neural networks(BNNs)的通信高效的FL训练算法,并实现了适用于客户端的差分隐私保护。我们采用 binary 模型参数上传,以提高通信效率,并在FL设置中添加抽象噪声来实现客户端级差分隐私保护。我们的性能保证是严格地证明的,并且表明其取决于抽象噪声的水平。实验结果基于 MNIST 和 Fashion-MNIST 数据集表明,我们的训练算法可以实现客户端级差分隐私保护,同时享受到低通信开销的 binary 模型更新的好处。
When GPT Meets Program Analysis: Towards Intelligent Detection of Smart Contract Logic Vulnerabilities in GPTScan
results: 该论文通过使用GPT来理解代码,实现了高精度(超过90%)的智能合约逻辑漏洞检测,并新发现了9个人验员错过的漏洞。Abstract
Smart contracts are prone to various vulnerabilities, leading to substantial financial losses over time. Current analysis tools mainly target vulnerabilities with fixed control or dataflow patterns, such as re-entrancy and integer overflow. However, a recent study on Web3 security bugs revealed that about 80% of these bugs cannot be audited by existing tools due to the lack of domain-specific property description and checking. Given recent advances in Generative Pretraining Transformer (GPT), it is worth exploring how GPT could aid in detecting logic vulnerabilities in smart contracts. In this paper, we propose GPTScan, the first tool combining GPT with static analysis for smart contract logic vulnerability detection. Instead of relying solely on GPT to identify vulnerabilities, which can lead to high false positives and is limited by GPT's pre-trained knowledge, we utilize GPT as a versatile code understanding tool. By breaking down each logic vulnerability type into scenarios and properties, GPTScan matches candidate vulnerabilities with GPT. To enhance accuracy, GPTScan further instructs GPT to intelligently recognize key variables and statements, which are then validated by static confirmation. Evaluation on diverse datasets with around 400 contract projects and 3K Solidity files shows that GPTScan achieves high precision (over 90%) for token contracts and acceptable precision (57.14%) for large projects like Web3Bugs. It effectively detects groundtruth logic vulnerabilities with a recall of over 80%, including 9 new vulnerabilities missed by human auditors. GPTScan is fast and cost-effective, taking an average of 14.39 seconds and 0.01 USD to scan per thousand lines of Solidity code. Moreover, static confirmation helps GPTScan reduce two-thirds of false positives.
摘要
智能合约容易受到各种漏洞的威胁,导致长期的财务损失。现有的分析工具主要targets着固定控制或数据流模式的漏洞,如重入和整数溢出。然而,一项研究表明,约80%的Web3安全漏洞无法由现有工具检测,因为缺乏域特定的属性描述和检查。随着生成学习变换器(GPT)的进步,我们可以考虑如何使用GPT来检测智能合约逻辑漏洞。在这篇论文中,我们提出GPTScan,第一个结合GPT和静态分析的智能合约逻辑漏洞检测工具。而不是仅仅依靠GPT来识别漏洞,这可能会导致高false positives和GPT的预训练知识的限制。我们利用GPT作为智能代码理解工具,将每种逻辑漏洞类型分解为场景和属性。GPTScan与GPT进行匹配,以提高准确性。为了进一步提高准确性,GPTScan还 instrucGPT认智感知关键变量和语句,然后验证这些变量和语句的有效性。我们对包括约400个合约项目和3000个Solidity文件的多样化数据进行评估,结果表明GPTScan在智能合约中具有高精度(超过90%),并且在大型项目如Web3Bugs中具有可接受的精度(57.14%)。GPTScan可以快速和cost-effective地检测漏洞,每千行Solidity代码平均需要14.39秒和0.01美元。此外,静态确认帮助GPTScan减少了两 thirds的false positives。
CrossTalk: Intelligent Substrates for Language-Oriented Interaction in Video-Based Communication and Collaboration
paper_authors: Haijun Xia, Tony Wang, Aditya Gunturu, Peiling Jiang, William Duan, Xiaoshuo Yao
for: 这篇论文旨在提出一种基于智能技术的视频会议系统,以便更好地帮助用户进行交流和合作。
methods: 论文提出三个关键设计思想,包括面板基础、语言基于意图识别和轻量级交互技术。
results: 作者开发了一个名为 CrossTalk 的视频会议系统,该系统实现了这三个设计思想,并为用户提供了更加流畅和灵活的交流和合作体验。Abstract
Despite the advances and ubiquity of digital communication media such as videoconferencing and virtual reality, they remain oblivious to the rich intentions expressed by users. Beyond transmitting audio, videos, and messages, we envision digital communication media as proactive facilitators that can provide unobtrusive assistance to enhance communication and collaboration. Informed by the results of a formative study, we propose three key design concepts to explore the systematic integration of intelligence into communication and collaboration, including the panel substrate, language-based intent recognition, and lightweight interaction techniques. We developed CrossTalk, a videoconferencing system that instantiates these concepts, which was found to enable a more fluid and flexible communication and collaboration experience.
摘要
尽管数字通信媒体如视频会议和虚拟现实已经广泛应用并普及,但它们却忽略了用户表达的丰富意图。我们认为数字通信媒体不仅仅是传输音频、视频和消息的工具,而是能够通过不侵入式的协助来提高交流和合作。根据前期研究的结果,我们提出了三个关键的设计思想,包括面板底层、语言基于意图识别和轻量级交互技术。我们开发了 CrossTalk 视频会议系统,该系统实现了这些概念,并在使用者体验中提供了更加流畅和灵活的交流和合作体验。
What has ChatGPT read? The origins of archaeological citations used by a generative artificial intelligence application
results: chatGPT 模型提供的参考文献中有很多是 fictitious,但是所有真实的参考文献都有在wikipedia页面上被引用Here’s a breakdown of each point in English:
for: The paper aims to test what archaeological literature was included in ChatGPT’s training phase.
methods: The paper uses cloze analysis to infer what sources the generative AI model has memorized.
results: The paper finds that a large percentage of the references provided by ChatGPT are fictitious, and that all genuine references have also been cited on Wikipedia pages. This suggests that the source base for at least some of the data is found in those pages.Abstract
The public release of ChatGPT has resulted in considerable publicity and has led to wide-spread discussion of the usefulness and capabilities of generative AI language models. Its ability to extract and summarise data from textual sources and present them as human-like contextual responses makes it an eminently suitable tool to answer questions users might ask. This paper tested what archaeological literature appears to have been included in ChatGPT's training phase. While ChatGPT offered seemingly pertinent references, a large percentage proved to be fictitious. Using cloze analysis to make inferences on the sources 'memorised' by a generative AI model, this paper was unable to prove that ChatGPT had access to the full texts of the genuine references. It can be shown that all references provided by ChatGPT that were found to be genuine have also been cited on Wikipedia pages. This strongly indicates that the source base for at least some of the data is found in those pages. The implications of this in relation to data quality are discussed.
摘要
公共发布的ChatGPT已引起广泛的关注和讨论,探讨了生成AI语言模型的用途和能力。它可以从文本源中提取和摘要数据,并以人类化的语言回答用户问题。这篇论文测试了ChatGPT在训练阶段是否包含了文物学 литературы。虽然ChatGPT提供了看似相关的参考,但大多数证明是假的。通过cloze分析来推断一个生成AI模型所吸收的源,这篇论文未能证明ChatGPT有访问全文真实参考的能力。可以证明所有由ChatGPT提供的真实参考都已经出现在Wikipedia页面上。这表明至少一部分数据的来源在那里。关于数据质量的影响,进行了讨论。
DOMINO: Domain-invariant Hyperdimensional Classification for Multi-Sensor Time Series Data
results: 对多种多感器时序分类任务进行了广泛的评估,结果表明DOMINO比状态 Künstler(SOTA)的域泛化技术高出2.04%的准确率,并在训练和推理中具有16.34倍和2.89倍的速度优势。此外,DOMINO在部分标注和高度不均衡数据上进行学习时表现更加出众,对硬件噪声的抗衡性提高了10.93倍。Abstract
With the rapid evolution of the Internet of Things, many real-world applications utilize heterogeneously connected sensors to capture time-series information. Edge-based machine learning (ML) methodologies are often employed to analyze locally collected data. However, a fundamental issue across data-driven ML approaches is distribution shift. It occurs when a model is deployed on a data distribution different from what it was trained on, and can substantially degrade model performance. Additionally, increasingly sophisticated deep neural networks (DNNs) have been proposed to capture spatial and temporal dependencies in multi-sensor time series data, requiring intensive computational resources beyond the capacity of today's edge devices. While brain-inspired hyperdimensional computing (HDC) has been introduced as a lightweight solution for edge-based learning, existing HDCs are also vulnerable to the distribution shift challenge. In this paper, we propose DOMINO, a novel HDC learning framework addressing the distribution shift problem in noisy multi-sensor time-series data. DOMINO leverages efficient and parallel matrix operations on high-dimensional space to dynamically identify and filter out domain-variant dimensions. Our evaluation on a wide range of multi-sensor time series classification tasks shows that DOMINO achieves on average 2.04% higher accuracy than state-of-the-art (SOTA) DNN-based domain generalization techniques, and delivers 16.34x faster training and 2.89x faster inference. More importantly, DOMINO performs notably better when learning from partially labeled and highly imbalanced data, providing 10.93x higher robustness against hardware noises than SOTA DNNs.
摘要
随着互联网物联网的快速发展,许多现实世界应用程序利用不同种类的传感器来 capture 时间序列信息。边缘基于机器学习(ML)方法ologies 常常被用来分析本地收集的数据。然而,跨数据频道的分布shift 问题是数据驱动的 ML 方法ologies 中的一个基本问题。随着时间序列数据中的空间和时间相关性的不断提高,使用深度神经网络(DNNs)来捕捉这些相关性已成为一项核心的技术。然而,这些深度神经网络的计算资源需求已超出当今边缘设备的处理能力。在此基础上,我们提出了 DOMINO,一种新的幂 dimensional computing(HDC)学习框架,解决跨数据频道分布shift 问题在干扰多感知时序数据中。DOMINO 利用高维度空间中效率和并行的矩阵操作,动态标识和筛选域variant 维度。我们对多种多感知时序分类任务进行了广泛的评估,结果显示,DOMINO 在 average 比 state-of-the-art(SOTA) DNN-based 域泛化技术上 achieve 2.04% 高的准确率,并提供 16.34x faster 训练和 2.89x faster 推理。此外,DOMINO 在 learning 从 partially 标注和高度不均衡的数据中表现更为出色,提供 10.93x 更高的硬件噪音鲁减能力。
SynJax: Structured Probability Distributions for JAX
results: 该论文通过使用SynJax库,实现了一种大规模的可导 differentiable 模型,可以直接表示数据中的结构,并且可以在现代硬件加速器上实现高效的推理。Abstract
The development of deep learning software libraries enabled significant progress in the field by allowing users to focus on modeling, while letting the library to take care of the tedious and time-consuming task of optimizing execution for modern hardware accelerators. However, this has benefited only particular types of deep learning models, such as Transformers, whose primitives map easily to the vectorized computation. The models that explicitly account for structured objects, such as trees and segmentations, did not benefit equally because they require custom algorithms that are difficult to implement in a vectorized form. SynJax directly addresses this problem by providing an efficient vectorized implementation of inference algorithms for structured distributions covering alignment, tagging, segmentation, constituency trees and spanning trees. With SynJax we can build large-scale differentiable models that explicitly model structure in the data. The code is available at https://github.com/deepmind/synjax.
摘要
通过深度学习软件库的发展,在这个领域取得了重要进步,让用户可以专注于模型设计,让库负责处理现代硬件加速器的繁琐和耗时任务。然而,这主要对特定类型的深度学习模型带来了好处,如转换器,这些模型的基本 primitives 可以轻松地vector化计算。然而,模型处理结构化对象的模型,如树和分割,没有得到了相同的好处,因为它们需要特定的算法,difficult to implement in a vectorized form。SynJax直接解决了这个问题,提供了高效的vectorized实现方式,用于推理算法,包括对适配、标记、分割、树和span树的推理。通过SynJax,我们可以构建大规模可导的模型,并且直接模型数据中的结构。代码可以在https://github.com/deepmind/synjax上获取。
Local Structure-aware Graph Contrastive Representation Learning
results: 实验结果表明,LS-GCL方法在五个数据集上的表现比前一些状态对比较高,在节点分类和链接预测任务上都达到了更好的效果。Abstract
Traditional Graph Neural Network (GNN), as a graph representation learning method, is constrained by label information. However, Graph Contrastive Learning (GCL) methods, which tackle the label problem effectively, mainly focus on the feature information of the global graph or small subgraph structure (e.g., the first-order neighborhood). In the paper, we propose a Local Structure-aware Graph Contrastive representation Learning method (LS-GCL) to model the structural information of nodes from multiple views. Specifically, we construct the semantic subgraphs that are not limited to the first-order neighbors. For the local view, the semantic subgraph of each target node is input into a shared GNN encoder to obtain the target node embeddings at the subgraph-level. Then, we use a pooling function to generate the subgraph-level graph embeddings. For the global view, considering the original graph preserves indispensable semantic information of nodes, we leverage the shared GNN encoder to learn the target node embeddings at the global graph-level. The proposed LS-GCL model is optimized to maximize the common information among similar instances at three various perspectives through a multi-level contrastive loss function. Experimental results on five datasets illustrate that our method outperforms state-of-the-art graph representation learning approaches for both node classification and link prediction tasks.
摘要
传统的图 нейрон网络(GNN)在图表示学习方法中受标签信息的限制。然而,图对照学习(GCL)方法,可以有效地解决标签问题,主要集中于全图或小子图结构(例如,首先邻居)的特征信息。在本文中,我们提出了一种本地结构意识感知的图对照学习表示学习方法(LS-GCL),用于模型节点的多视图结构信息。具体来说,我们构建了不限于首先邻居的semantic子图。对本地视图,每个目标节点的semantic子图将输入到共享GNNEncoder中,以获取目标节点的子图级别表示。然后,我们使用一个池化函数生成子graph级别的图编码。对全球视图,由于原始图保留了节点的必要 semantic信息,我们利用共享GNNEncoder来学习目标节点的全图级别表示。我们提出的LS-GCL模型通过最大化三个不同视角的共同信息来优化多级对照损失函数来进行优化。实验结果表明,我们的方法在五个数据集上比 estado-of-the-art的图表示学习方法出色地进行节点分类和链接预测任务。
paper_authors: Haodi Ma, Anthony Colas, Yuejie Wang, Ali Sadeghian, Daisy Zhe Wang
for: This paper is written for researchers and practitioners interested in neural knowledge graph inference, particularly those looking to combine logic rules with knowledge graph embeddings.
methods: The paper proposes a mechanism called InjEx, which injects multiple types of rules through simple constraints to capture definite Horn rules.
results: The paper evaluates InjEx on both the knowledge graph completion (KGC) and few-shot knowledge graph completion (FKGC) settings, and shows that it outperforms baseline KGC models as well as specialized few-shot models while maintaining its scalability and efficiency.Here’s the same information in Simplified Chinese text:
results: 论文在知识图完成(KGC)和少量知识图完成(FKGC)设置下进行了实验,并证明InjEx可以超越基eline KGC模型以及特化的少量模型,同时保持了其扩展性和效率。Abstract
Recent works in neural knowledge graph inference attempt to combine logic rules with knowledge graph embeddings to benefit from prior knowledge. However, they usually cannot avoid rule grounding, and injecting a diverse set of rules has still not been thoroughly explored. In this work, we propose InjEx, a mechanism to inject multiple types of rules through simple constraints, which capture definite Horn rules. To start, we theoretically prove that InjEx can inject such rules. Next, to demonstrate that InjEx infuses interpretable prior knowledge into the embedding space, we evaluate InjEx on both the knowledge graph completion (KGC) and few-shot knowledge graph completion (FKGC) settings. Our experimental results reveal that InjEx outperforms both baseline KGC models as well as specialized few-shot models while maintaining its scalability and efficiency.
摘要
最近的 neural knowledge graph inference 研究尝试将逻辑规则与知识图 embedding 结合以获得优势。然而,它们通常无法避免规则定义,并尚未全面探讨多种规则的混合。在这个工作中,我们提出了 InjEx,一种可以通过简单的约束将多种类型的规则注入到 embedding 空间中的机制。首先,我们理论上证明了 InjEx 可以注入这些规则。然后,我们通过在知识图完成 (KGC) 和少量知识图完成 (FKGC) 设置中评估 InjEx,发现它可以让知识图中的 embedding 空间具有可读性和可理解性。我们的实验结果表明,InjEx 可以比基eline KGC 模型和专门的几何shot模型表现更好,同时保持其可扩展性和效率。
Redundancy-aware Transformer for Video Question Answering
results: 通过对多个VideoQA benchmark进行测试,发现该方法可以达到当前最佳的结果。Abstract
This paper identifies two kinds of redundancy in the current VideoQA paradigm. Specifically, the current video encoders tend to holistically embed all video clues at different granularities in a hierarchical manner, which inevitably introduces \textit{neighboring-frame redundancy} that can overwhelm detailed visual clues at the object level. Subsequently, prevailing vision-language fusion designs introduce the \textit{cross-modal redundancy} by exhaustively fusing all visual elements with question tokens without explicitly differentiating their pairwise vision-language interactions, thus making a pernicious impact on the answering. To this end, we propose a novel transformer-based architecture, that aims to model VideoQA in a redundancy-aware manner. To address the neighboring-frame redundancy, we introduce a video encoder structure that emphasizes the object-level change in neighboring frames, while adopting an out-of-neighboring message-passing scheme that imposes attention only on distant frames. As for the cross-modal redundancy, we equip our fusion module with a novel adaptive sampling, which explicitly differentiates the vision-language interactions by identifying a small subset of visual elements that exclusively support the answer. Upon these advancements, we find this \underline{R}edundancy-\underline{a}ware trans\underline{former} (RaFormer) can achieve state-of-the-art results on multiple VideoQA benchmarks.
摘要
To address these issues, we propose a novel transformer-based architecture that models VideoQA in a redundancy-aware manner. To reduce neighboring-frame redundancy, we introduce a video encoder structure that emphasizes object-level changes in neighboring frames and adopts an out-of-neighboring message-passing scheme that only attends to distant frames. To address cross-modal redundancy, we equip our fusion module with a novel adaptive sampling that explicitly differentiates vision-language interactions by identifying a small subset of visual elements that exclusively support the answer.Our proposed \underline{R}edundancy-\underline{a}ware transformer (RaFormer) achieves state-of-the-art results on multiple VideoQA benchmarks.
TempFuser: Learning Tactical and Agile Flight Maneuvers in Aerial Dogfights using a Long Short-Term Temporal Fusion Transformer
methods: 该方法使用两个LSTM基于输入嵌入来编码长期稀缺状态轨迹,以及短期密集状态轨迹。通过将两个嵌入者 integrate through transformer编码器,方法subsequently derivese终端飞行命令。
results: 该模型在具有多种反对机型的高精度环境中进行了广泛验证,并证明了它在机动和战术飞行方面的表现超过了基准模型。该模型成功地学习了基本飞行招数、人工驾驶员式战术招数和在低空下的稳定追逐。视频可以在 \url{https://sites.google.com/view/tempfuser} 上查看。Abstract
Aerial dogfights necessitate understanding the tactically changing maneuvers from a long-term perspective, along with the rapidly changing aerodynamics from a short-term view. In this paper, we propose a novel long short-term temporal fusion transformer (TempFuser) for a policy network in aerial dogfights. Our method uses two LSTM-based input embeddings to encode long-term, sparse state trajectories, as well as short-term, dense state trajectories. By integrating the two embeddings through a transformer encoder, the method subsequently derives end-to-end flight commands for agile and tactical maneuvers. We formulate a deep reinforcement learning framework to train our TempFuser-based policy model. We then extensively validate our model, demonstrating that it outperforms other baseline models against a diverse range of opponent aircraft in a high-fidelity environment. Our model successfully learns basic fighter maneuvers, human pilot-like tactical maneuvers, and robust supersonic pursuit in low altitudes without explicitly coded prior knowledge. Videos are available at \url{https://sites.google.com/view/tempfuser}
摘要
aerial dogfights require understanding the tactically changing maneuvers from a long-term perspective, as well as the rapidly changing aerodynamics from a short-term view. In this paper, we propose a novel long short-term temporal fusion transformer (TempFuser) for a policy network in aerial dogfights. Our method uses two LSTM-based input embeddings to encode long-term, sparse state trajectories, as well as short-term, dense state trajectories. By integrating the two embeddings through a transformer encoder, the method subsequently derives end-to-end flight commands for agile and tactical maneuvers. We formulate a deep reinforcement learning framework to train our TempFuser-based policy model. We then extensively validate our model, demonstrating that it outperforms other baseline models against a diverse range of opponent aircraft in a high-fidelity environment. Our model successfully learns basic fighter maneuvers, human pilot-like tactical maneuvers, and robust supersonic pursuit in low altitudes without explicitly coded prior knowledge. Videos are available at \url{https://sites.google.com/view/tempfuser}Here's the Chinese text with traditional Chinese characters:空中 dogfight 需要从长期perspective理解战术上的变化,以及短期view的 aerodynamics 变化。在这篇文章中,我们提出一个 novel long short-term temporal fusion transformer (TempFuser) 作为policy network的一部分。我们的方法使用两个 LSTM 基于的输入嵌入来编码长期、稀疏的状态轨迹,以及短期、密集的状态轨迹。通过将两个嵌入器组合成一个 transformer Encoder,方法随后 derivation 终端的飞行命令。我们建立了一个深度强化学习框架,用于训练我们的 TempFuser 基于的政策模型。我们然后广泛验证我们的模型,证明它在高质量环境中比基eline模型高效。我们的模型成功地学习了基本战斗机动、人类飞行员式的战术机动和在低高度中Robust supersonic pursuit 无需显式编程优先知识。影片可以在 \url{https://sites.google.com/view/tempfuser} 上找到。
PaniniQA: Enhancing Patient Education Through Interactive Question Answering
results: 通过自动和人工评估,表明PaniniQA可以有效地帮助病人理解和记忆医疗指南,提高病人的医疗知识和自信心Abstract
Patient portal allows discharged patients to access their personalized discharge instructions in electronic health records (EHRs). However, many patients have difficulty understanding or memorizing their discharge instructions. In this paper, we present PaniniQA, a patient-centric interactive question answering system designed to help patients understand their discharge instructions. PaniniQA first identifies important clinical content from patients' discharge instructions and then formulates patient-specific educational questions. In addition, PaniniQA is also equipped with answer verification functionality to provide timely feedback to correct patients' misunderstandings. Our comprehensive automatic and human evaluation results demonstrate our PaniniQA is capable of improving patients' mastery of their medical instructions through effective interactions
摘要
患者门户 permet 出院患者访问其个性化出院指南在电子医疗纪录 (EHR) 中。然而,许多患者困难理解或记忆出院指南。在这篇论文中,我们介绍 PaniniQA,一个患者中心的交互问答系统,用于帮助患者理解出院指南。PaniniQA 首先从患者的出院指南中提取重要的医疗内容,然后根据患者的个性特点制定特定的教育问题。此外,PaniniQA 还具有答案验证功能,以提供及时的反馈,以正式误解。我们的全面的自动和人工评估结果表明,PaniniQA 能够通过有效的互动提高患者对医疗指南的理解。
Analysis of Optical Loss and Crosstalk Noise in MZI-based Coherent Photonic Neural Networks
paper_authors: Amin Shafiee, Sanmitra Banerjee, Krishnendu Chakrabarty, Sudeep Pasricha, Mahdi Nikdast for: 这篇论文主要关注于提出了一个从底向上的模型,用于分析摄光网络(SP-NN)中各种实验设计对于损失和杂音的影响。methods: 本论文使用了一个从底向上的模型,从device层次到系统层次,以分析摄光网络中各种实验设计对于损失和杂音的影响。results: 本论文的结果显示,当SP-NN的规模增加时,损失和杂音的影响会逐渐增加,导致推论精度下降,甚至可以下降至10%以下。此外,本论文还给出了不同的MZI网络配置(如Reck、Clements和Diamond)的损失和杂音的分析结果。Abstract
With the continuous increase in the size and complexity of machine learning models, the need for specialized hardware to efficiently run such models is rapidly growing. To address such a need, silicon-photonic-based neural network (SP-NN) accelerators have recently emerged as a promising alternative to electronic accelerators due to their lower latency and higher energy efficiency. Not only can SP-NNs alleviate the fan-in and fan-out problem with linear algebra processors, their operational bandwidth can match that of the photodetection rate (typically 100 GHz), which is at least over an order of magnitude faster than electronic counterparts that are restricted to a clock rate of a few GHz. Unfortunately, the underlying silicon photonic devices in SP-NNs suffer from inherent optical losses and crosstalk noise originating from fabrication imperfections and undesired optical couplings, the impact of which accumulates as the network scales up. Consequently, the inferencing accuracy in an SP-NN can be affected by such inefficiencies -- e.g., can drop to below 10% -- the impact of which is yet to be fully studied. In this paper, we comprehensively model the optical loss and crosstalk noise using a bottom-up approach, from the device to the system level, in coherent SP-NNs built using Mach-Zehnder interferometer (MZI) devices. The proposed models can be applied to any SP-NN architecture with different configurations to analyze the effect of loss and crosstalk. Such an analysis is important where there are inferencing accuracy and scalability requirements to meet when designing an SP-NN. Using the proposed analytical framework, we show a high power penalty and a catastrophic inferencing accuracy drop of up to 84% for SP-NNs of different scales with three known MZI mesh configurations (i.e., Reck, Clements, and Diamond) due to accumulated optical loss and crosstalk noise.
摘要
随着机器学习模型的大小和复杂度不断增加,特化硬件来高效运行这些模型的需求也在不断增长。为了解决这种需求,silicon-photonic-based neural network(SP-NN)加速器在最近几年出现了,它们因其低延迟和高能效性而成为了电子加速器的有力竞争者。不仅可以使SP-NN解决线性代数处理器的缓冲和输出问题,其操作带宽可以与光检测速率(通常是100 GHz)相同,这是电子对手的多orders of magnitude更慢的速率。然而,在SP-NN中的silicon光学设备受到制造瑕疵和不良光学 Coupling的影响,这些影响会随着网络规模增加,从而影响SP-NN的推理精度。例如,推理精度可以降至下rance than 10%。在这篇论文中,我们从底层设备到系统层使用可靠的模型来模拟光损和十字谱噪。这些模型可以应用于不同的SP-NN架构,以分析光损和十字谱噪对推理精度的影响。这种分析对于设计SP-NN时存在推理精度和可扩展性的需求非常重要。使用我们提出的分析框架,我们显示了SP-NN的不同规模下的高电力负担和推理精度下降可达84%,这些下降都是由光损和十字谱噪所导致的。
Mind the Gap: Improving Success Rate of Vision-and-Language Navigation by Revisiting Oracle Success Routes
results: 本论文在三个通用的测试集(R2R、REVERIE 和 NDH)上进行评估,结果显示了可以实现更高的精确率和更低的失败率,显示了这个方法的潜力。Abstract
Vision-and-Language Navigation (VLN) aims to navigate to the target location by following a given instruction. Unlike existing methods focused on predicting a more accurate action at each step in navigation, in this paper, we make the first attempt to tackle a long-ignored problem in VLN: narrowing the gap between Success Rate (SR) and Oracle Success Rate (OSR). We observe a consistently large gap (up to 9%) on four state-of-the-art VLN methods across two benchmark datasets: R2R and REVERIE. The high OSR indicates the robot agent passes the target location, while the low SR suggests the agent actually fails to stop at the target location at last. Instead of predicting actions directly, we propose to mine the target location from a trajectory given by off-the-shelf VLN models. Specially, we design a multi-module transformer-based model for learning compact discriminative trajectory viewpoint representation, which is used to predict the confidence of being a target location as described in the instruction. The proposed method is evaluated on three widely-adopted datasets: R2R, REVERIE and NDH, and shows promising results, demonstrating the potential for more future research.
摘要
vision-and-language navigation (vlN) 目标是通过跟随给定的指令进行导航。与现有方法强调预测每步行动的准确性不同,在这篇论文中,我们首次尝试解决 vlN 中长期被忽略的问题:减少 SR 和 OSR 之间的差距。我们在两个基本样本数据集上(R2R 和 REVERIE)观察到一定的差距(最高达 9%)。高 OSR 表示机器人代理人在目标位置通过,而低 SR 则表示机器人代理人最终没有停止在目标位置。相比于直接预测行动,我们提议从 off-the-shelf VLN 模型获取的轨迹给出的路径视图表示来挖掘目标位置。我们设计了一种具有多模块的 transformer 基本模型,用于学习短暂而特征化的轨迹视图表示,以预测指令中所描述的目标位置是否准确。我们在 R2R、REVERIE 和 NDH 等三个广泛采用的数据集上进行评估,并取得了满意的结果,证明了我们的方法的潜在可能性。
Analysis of the Evolution of Advanced Transformer-Based Language Models: Experiments on Opinion Mining
For: 本研究旨在研究高性能自然语言处理(NLP)模型在观点挖掘(Opinion Mining)任务中的表现,并对比不同的Transformer-based语言模型。* Methods: 本研究使用了高性能的Transformer-based语言模型,包括BERT、RoBERTa和XLNet等,对多个语料库进行评价,并对比它们的性能。* Results: 研究结果显示,这些Transformer-based语言模型在观点挖掘任务中具有出色的表现,具体来说,BERT和RoBERTa在多个语料库中的平均准确率都高于90%,而XLNet的平均准确率则高于95%。Abstract
Opinion mining, also known as sentiment analysis, is a subfield of natural language processing (NLP) that focuses on identifying and extracting subjective information in textual material. This can include determining the overall sentiment of a piece of text (e.g., positive or negative), as well as identifying specific emotions or opinions expressed in the text, that involves the use of advanced machine and deep learning techniques. Recently, transformer-based language models make this task of human emotion analysis intuitive, thanks to the attention mechanism and parallel computation. These advantages make such models very powerful on linguistic tasks, unlike recurrent neural networks that spend a lot of time on sequential processing, making them prone to fail when it comes to processing long text. The scope of our paper aims to study the behaviour of the cutting-edge Transformer-based language models on opinion mining and provide a high-level comparison between them to highlight their key particularities. Additionally, our comparative study shows leads and paves the way for production engineers regarding the approach to focus on and is useful for researchers as it provides guidelines for future research subjects.
摘要
Why Linguistics Will Thrive in the 21st Century: A Reply to Piantadosi (2023)
results: 本文结论是,尽管LLMs具有启示和实用性,但人类语言学习的谜团仍未被解释。此外,LLMs无法提供解释性的科学理论,因此generative linguistics仍将是21世纪和以后不可或缺的科学 дисциплины。Abstract
We present a critical assessment of Piantadosi's (2023) claim that "Modern language models refute Chomsky's approach to language," focusing on four main points. First, despite the impressive performance and utility of large language models (LLMs), humans achieve their capacity for language after exposure to several orders of magnitude less data. The fact that young children become competent, fluent speakers of their native languages with relatively little exposure to them is the central mystery of language learning to which Chomsky initially drew attention, and LLMs currently show little promise of solving this mystery. Second, what can the artificial reveal about the natural? Put simply, the implications of LLMs for our understanding of the cognitive structures and mechanisms underlying language and its acquisition are like the implications of airplanes for understanding how birds fly. Third, LLMs cannot constitute scientific theories of language for several reasons, not least of which is that scientific theories must provide interpretable explanations, not just predictions. This leads to our final point: to even determine whether the linguistic and cognitive capabilities of LLMs rival those of humans requires explicating what humans' capacities actually are. In other words, it requires a separate theory of language and cognition; generative linguistics provides precisely such a theory. As such, we conclude that generative linguistics as a scientific discipline will remain indispensable throughout the 21st century and beyond.
摘要
我们提出了对Piantadosi(2023)的批判,关注四个主要点。首先,虽然大型语言模型(LLMs)表现出色,但人类通过相对较少的数据来获得语言能力。儿童在获得native语言 fluency的过程中需要相对较少的数据,这是语言学习中的中心谜题,LLMs目前没有解决这个谜题。二、人工智能可以揭示自然语言吗?将airplanes作为 birds fly的 analogie,LLMs对我们对语言和其学习机制的理解提供了什么?三、LLMs无法构成语言科学的理论,因为科学理论需要可解释的结果,不仅仅是预测。这导致我们的最后一点:要确定LLMs的语言和认知能力与人类相比,首先需要解释人类的能力。在其他 palabras,我们需要一个分析语言和认知的理论,生成语言学派 precisely 提供了这样的理论。因此,我们结论是,生成语言学派作为科学领域将在21世纪和以后保持不可或缺的。
Investigation of Self-supervised Pre-trained Models for Classification of Voice Quality from Speech and Neck Surface Accelerometer Signals
paper_authors: Sudarsana Reddy Kadiri, Farhad Javanmardi, Paavo Alku for:* 这项研究旨在 классифика voice quality(呼吸声、Modal声和压缩声)的自动分类方法。methods:* 这项研究使用了同时记录的语音和脖子振荡器(NSA)信号作为输入,并提取了MFCCs和glottal source features。results:* 研究发现,使用 NSA 输入可以获得更好的分类性能,而且使用 pre-trained 模型基于的特征(wav2vec2-BASE、wav2vec2-LARGE 和 HuBERT)可以提高分类精度。I hope this helps! Let me know if you have any other questions.Abstract
Prior studies in the automatic classification of voice quality have mainly studied the use of the acoustic speech signal as input. Recently, a few studies have been carried out by jointly using both speech and neck surface accelerometer (NSA) signals as inputs, and by extracting MFCCs and glottal source features. This study examines simultaneously-recorded speech and NSA signals in the classification of voice quality (breathy, modal, and pressed) using features derived from three self-supervised pre-trained models (wav2vec2-BASE, wav2vec2-LARGE, and HuBERT) and using a SVM as well as CNNs as classifiers. Furthermore, the effectiveness of the pre-trained models is compared in feature extraction between glottal source waveforms and raw signal waveforms for both speech and NSA inputs. Using two signal processing methods (quasi-closed phase (QCP) glottal inverse filtering and zero frequency filtering (ZFF)), glottal source waveforms are estimated from both speech and NSA signals. The study has three main goals: (1) to study whether features derived from pre-trained models improve classification accuracy compared to conventional features (spectrogram, mel-spectrogram, MFCCs, i-vector, and x-vector), (2) to investigate which of the two modalities (speech vs. NSA) is more effective in the classification task with pre-trained model-based features, and (3) to evaluate whether the deep learning-based CNN classifier can enhance the classification accuracy in comparison to the SVM classifier. The results revealed that the use of the NSA input showed better classification performance compared to the speech signal. Between the features, the pre-trained model-based features showed better classification accuracies, both for speech and NSA inputs compared to the conventional features. It was also found that the HuBERT features performed better than the wav2vec2-BASE and wav2vec2-LARGE features.
摘要
前期研究在自动识别语音质量方面主要使用语音音频信号作为输入。近年来,一些研究开始将语音和脖子表面加速度信号(NSA)作为输入并提取MFCC和格斗音源特征。本研究通过同时记录的语音和NSA信号进行语音质量(呼吸、模态和压缩)的分类,使用三个自动预训练模型(wav2vec2-BASE、wav2vec2-LARGE和HuBERT)提取特征,并使用支持向量机(SVM)和卷积神经网络(CNN)作为分类器。此外,采用不同的信号处理方法( quasi-closed phase 预测频率滤波和零频率滤波)来估计语音和NSA信号的格斗音源波形。研究拥有三个主要目标:1. 研究是否可以通过使用预训练模型提取特征来提高分类精度,比较传统特征(spectrogram、mel-spectrogram、MFCC、i-vector和x-vector)的表现。2. investigate 语音和NSA信号中哪一个模式更有效iveness 在分类任务中,并且是否可以通过预训练模型基于特征来确定这一点。3. 评估深度学习基于CNN的分类器是否可以提高分类精度,与支持向量机(SVM)分类器相比。研究结果表明,使用NSA输入可以更好地分类语音质量,而且采用预训练模型基于特征可以提高分类精度,对于语音和NSA输入都有着优异表现。此外,HuBERT特征也被发现比wav2vec2-BASE和wav2vec2-LARGE特征更为有效。
for: address the challenges of cross-domain learning of Human Pose Estimation (HPE) without access to source data during the adaptation process.
methods: proposed a novel framework that consists of three models: source model, intermediate model, and target model, which explores the task from both source-protect and target-relevant perspectives.
results: comprehensive experiments on several domain adaptive HPE benchmarks show that the proposed method outperforms existing approaches by a considerable margin.Abstract
Human Pose Estimation (HPE) is widely used in various fields, including motion analysis, healthcare, and virtual reality. However, the great expenses of labeled real-world datasets present a significant challenge for HPE. To overcome this, one approach is to train HPE models on synthetic datasets and then perform domain adaptation (DA) on real-world data. Unfortunately, existing DA methods for HPE neglect data privacy and security by using both source and target data in the adaptation process. To this end, we propose a new task, named source-free domain adaptive HPE, which aims to address the challenges of cross-domain learning of HPE without access to source data during the adaptation process. We further propose a novel framework that consists of three models: source model, intermediate model, and target model, which explores the task from both source-protect and target-relevant perspectives. The source-protect module preserves source information more effectively while resisting noise, and the target-relevant module reduces the sparsity of spatial representations by building a novel spatial probability space, and pose-specific contrastive learning and information maximization are proposed on the basis of this space. Comprehensive experiments on several domain adaptive HPE benchmarks show that the proposed method outperforms existing approaches by a considerable margin. The codes are available at https://github.com/davidpengucf/SFDAHPE.
摘要
人体姿态估计(HPE)在多个领域得到广泛应用,如动作分析、医疗和虚拟现实。然而,实际世界数据的高成本成为HPE的一大挑战。为解决这个问题,一种方法是在HPE模型上训练于synthetic数据,然后在实际数据上进行领域适应(DA)。然而,现有的DA方法 дляHPE忽视了数据隐私和安全性,通过使用源数据和目标数据在适应过程中使用。为此,我们提出了一个新任务,名为无源领域适应HPE,旨在解决HPE的跨领域学习问题,不需要在适应过程中访问源数据。我们还提出了一个新的框架,包括三个模型:源模型、中间模型和目标模型,该框架从源保护和目标相关两个角度出发,以提高适应效果。源保护模块更好地保留源信息,同时抵御噪声,目标相关模块减少了空间表示的稀疏性,通过建立一个新的空间概率空间,并在其基础上提出了pose特有的对比学习和信息最大化。我们对多个领域适应HPE的benchmark进行了广泛的实验,结果表明,我们提出的方法在现有方法的基础上具有较大的提升。代码可以在https://github.com/davidpengucf/SFDAHPE上获取。
Unmasking the Invisible: Finding Location-Specific Aggregated Air Quality Index with Smartphone-Captured Images
methods: 这篇论文使用了大量的户外图像和相应的PM2.5浓度数据来训练DCNN模型,并通过超vised学习来建立图像和PM2.5浓度之间的相关性指数。这种方法被称为“ Picture-based Predictor of PM2.5 Concentration”(PPPC)。
results: 试验结果表明,该模型在预测达卡的PM2.5浓度方面表现出色,比较流行的模型如ViT和INN以及CNN基本模型如VGG19、ResNet50和MobileNetV2都要出色。此外,该模型的资源利用率较高,只用了少量的参数。Abstract
The prevalence and mobility of smartphones make these a widely used tool for environmental health research. However, their potential for determining aggregated air quality index (AQI) based on PM2.5 concentration in specific locations remains largely unexplored in the existing literature. In this paper, we thoroughly examine the challenges associated with predicting location-specific PM2.5 concentration using images taken with smartphone cameras. The focus of our study is on Dhaka, the capital of Bangladesh, due to its significant air pollution levels and the large population exposed to it. Our research involves the development of a Deep Convolutional Neural Network (DCNN), which we train using over a thousand outdoor images taken and annotated. These photos are captured at various locations in Dhaka, and their labels are based on PM2.5 concentration data obtained from the local US consulate, calculated using the NowCast algorithm. Through supervised learning, our model establishes a correlation index during training, enhancing its ability to function as a Picture-based Predictor of PM2.5 Concentration (PPPC). This enables the algorithm to calculate an equivalent daily averaged AQI index from a smartphone image. Unlike, popular overly parameterized models, our model shows resource efficiency since it uses fewer parameters. Furthermore, test results indicate that our model outperforms popular models like ViT and INN, as well as popular CNN-based models such as VGG19, ResNet50, and MobileNetV2, in predicting location-specific PM2.5 concentration. Our dataset is the first publicly available collection that includes atmospheric images and corresponding PM2.5 measurements from Dhaka. Our code and dataset will be made public when publishing the paper.
摘要
智能手机的普遍和流动性使得它们成为了环境健康研究中广泛使用的工具。然而,智能手机在确定具体位置的空气质量指数(AQI)方面的潜在应用仍然在现有文献中得不到充分的探讨。本文 thorougly examine the challenges associated with predicting location-specific PM2.5 concentration using images taken with smartphone cameras.我们的研究对象是孟加拉国首都达卡,因为它的空气污染水平很高,并且有大量人口暴露在其中。我们的研究包括开发一个深度卷积神经网络(DCNN),我们使用超过一千个户外图像进行训练。这些图像在达卡各地拍摄,并将其标注为PM2.5浓度数据,该数据来自当地美国领事馆计算的NowCast算法。通过监督学习,我们的模型在训练期间建立了相关性指数,从而使得它可以作为图像基于预测PM2.5浓度的算法(PPPC)。这使得算法可以从智能手机图像中计算équivalent的日均AQI指数。与流行的过度参数化模型不同,我们的模型表现出资源有效性,因为它使用 fewer 参数。另外,测试结果表明,我们的模型在确定具体位置的PM2.5浓度方面比流行的ViT和INN模型,以及流行的CNN基本模型如VGG19、ResNet50和MobileNetV2,表现出色。我们的数据集是首次公共可用的,包括达卡的大气图像和相应PM2.5测量数据。我们的代码和数据将在发表论文时公开。
Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies
results: 研究发现自动反馈技术可以有效地改进LLM的性能和可用性,但还存在一些挑战和未来的发展方向。Abstract
Large language models (LLMs) have demonstrated remarkable performance across a wide array of NLP tasks. However, their efficacy is undermined by undesired and inconsistent behaviors, including hallucination, unfaithful reasoning, and toxic content. A promising approach to rectify these flaws is self-correction, where the LLM itself is prompted or guided to fix problems in its own output. Techniques leveraging automated feedback -- either produced by the LLM itself or some external system -- are of particular interest as they are a promising way to make LLM-based solutions more practical and deployable with minimal human feedback. This paper presents a comprehensive review of this emerging class of techniques. We analyze and taxonomize a wide array of recent work utilizing these strategies, including training-time, generation-time, and post-hoc correction. We also summarize the major applications of this strategy and conclude by discussing future directions and challenges.
摘要
Translation notes:* "Large language models" (LLMs) is translated as "大型语言模型" (dàxíng yǔyán módelǐ)* "NLP tasks" is translated as "自然语言处理任务" (zìrán yǔyán xiǎnggōng zhìdao)* "hallucination" is translated as "幻见" (hénjiàn)* "unfaithful reasoning" is translated as "不诚实的推理" (bùzhèngshí de tuīlǐ)* "toxic content" is translated as "毒害内容" (dāohài nèixìng)* "self-correction" is translated as "自动更正" (zìdòng gengzhèng)* "automated feedback" is translated as "自动反馈" (zìdòng fāngxiàn)* "training-time" is translated as "训练时间" (xùnxīn shíjiān)* "generation-time" is translated as "生成时间" (shēngchǎn shíjiān)* "post-hoc correction" is translated as "后续更正" (hòu xiù gengzhèng)* "major applications" is translated as "主要应用" (zhǔyào yìngyù)* "future directions" is translated as "未来方向" (wèilái fāngdìng)* "challenges" is translated as "挑战" (tiǎozhàng)
VN-Solver: Vision-based Neural Solver for Combinatorial Optimization over Graphs
results: 结果表明,这种视觉方法的性能不仅不低于Matrix-based方法,而且可以与其相比,开启了一新的数据驱动优化解决方法的avenue。Abstract
Data-driven approaches have been proven effective in solving combinatorial optimization problems over graphs such as the traveling salesman problems and the vehicle routing problem. The rationale behind such methods is that the input instances may follow distributions with salient patterns that can be leveraged to overcome the worst-case computational hardness. For optimization problems over graphs, the common practice of neural combinatorial solvers consumes the inputs in the form of adjacency matrices. In this paper, we explore a vision-based method that is conceptually novel: can neural models solve graph optimization problems by \textit{taking a look at the graph pattern}? Our results suggest that the performance of such vision-based methods is not only non-trivial but also comparable to the state-of-the-art matrix-based methods, which opens a new avenue for developing data-driven optimization solvers.
摘要
<>translate the following text into Simplified Chinese:Data-driven approaches have been proven effective in solving combinatorial optimization problems over graphs such as the traveling salesman problems and the vehicle routing problem. The rationale behind such methods is that the input instances may follow distributions with salient patterns that can be leveraged to overcome the worst-case computational hardness. For optimization problems over graphs, the common practice of neural combinatorial solvers consumes the inputs in the form of adjacency matrices. In this paper, we explore a vision-based method that is conceptually novel: can neural models solve graph optimization problems by \textit{taking a look at the graph pattern}? Our results suggest that the performance of such vision-based methods is not only non-trivial but also comparable to the state-of-the-art matrix-based methods, which opens a new avenue for developing data-driven optimization solvers.Translate the text into Simplified Chinese:<>Here's the translation:数据驱动方法在解决图上的 combinatorial 优化问题上已经得到证明,如旅行商问题和车辆路径问题。这种方法的基本思想是,输入实例可能会遵循一些突出的模式,这些模式可以用来缓解最坏情况的计算复杂性。在图上的优化问题上,常见的 neural combinatorial 算法会将输入作为邻接矩阵来处理。在这篇论文中,我们探索了一种新的视觉基于的方法:可以 neural 模型通过 \textit{看看图形模式} 来解决图上的优化问题吗?我们的结果表明,这种视觉基于的方法不仅不rivial,而且与当前最佳的矩阵基于的方法相当,这开启了一个新的数据驱动优化算法的发展新途。
Empirical Optimal Risk to Quantify Model Trustworthiness for Failure Detection
paper_authors: Shuang Ao, Stefan Rueger, Advaith Siddharthan
For: This paper focuses on the problem of failure detection (FD) in AI systems, specifically the evaluation of FD performance and the trade-offs between data coverage rate and performance on accepted data.* Methods: The paper proposes two new evaluation metrics, the Excess Area Under the Optimal RC Curve (E-AUoptRC) and the Trust Index (TI), to better reflect the trustworthiness of FD models. These metrics are designed to provide a more intuitive and meaningful evaluation of FD performance, especially when the data coverage rate is partial.* Results: The paper reports extensive experiments on three benchmark image datasets with ten variants of transformer and CNN models, demonstrating that the proposed methods can better reflect the model trustworthiness than existing evaluation metrics. The results also show that high overall accuracy does not always yield high TI, highlighting the necessity of the proposed Trust Index as a complementary metric to the model overall accuracy.Abstract
Failure detection (FD) in AI systems is a crucial safeguard for the deployment for safety-critical tasks. The common evaluation method of FD performance is the Risk-coverage (RC) curve, which reveals the trade-off between the data coverage rate and the performance on accepted data. One common way to quantify the RC curve by calculating the area under the RC curve. However, this metric does not inform on how suited any method is for FD, or what the optimal coverage rate should be. As FD aims to achieve higher performance with fewer data discarded, evaluating with partial coverage excluding the most uncertain samples is more intuitive and meaningful than full coverage. In addition, there is an optimal point in the coverage where the model could achieve ideal performance theoretically. We propose the Excess Area Under the Optimal RC Curve (E-AUoptRC), with the area in coverage from the optimal point to the full coverage. Further, the model performance at this optimal point can represent both model learning ability and calibration. We propose it as the Trust Index (TI), a complementary evaluation metric to the overall model accuracy. We report extensive experiments on three benchmark image datasets with ten variants of transformer and CNN models. Our results show that our proposed methods can better reflect the model trustworthiness than existing evaluation metrics. We further observe that the model with high overall accuracy does not always yield the high TI, which indicates the necessity of the proposed Trust Index as a complementary metric to the model overall accuracy. The code are available at \url{https://github.com/AoShuang92/optimal_risk}.
摘要
Failure detection (FD) 在人工智能系统中是一项重要的安全监测,用于安全关键任务的部署。通常的评估方法是风险覆盖率(RC)曲线,它显示了数据覆盖率和接受数据的性能之间的交易。但这个指标并不能告诉我们任务是否适合FD,也不能告诉我们应该选择的覆盖率是多少。因为FD的目标是通过少量数据来提高性能,所以评估 partial coverage,排除最不确定的样本更加直观和有意义。此外,存在最佳的覆盖率点, modelo可以在理论上实现最佳性能。我们提出了过余的风险覆盖曲线下的最佳点(E-AUoptRC),以及该点下的模型性能。我们认为这个指标是模型信任指数(TI),它是评估模型可靠性的 complementary 指标。我们在三个标准图像集上进行了广泛的实验,结果表明我们的提议方法可以更好地反映模型的可靠性。我们还发现,高度全局准确率并不总是导致高度信任指数,这说明了我们的信任指数是一个必要的 complementary 指标。代码可以在 GitHub 上找到:https://github.com/AoShuang92/optimal_risk。
Building Safe and Reliable AI systems for Safety Critical Tasks with Vision-Language Processing
results: 本论文的结果显示,现有的人工智能系统无法正确地识别通用的失败原因,并且需要更多的技术来量化预测的质量。Abstract
Although AI systems have been applied in various fields and achieved impressive performance, their safety and reliability are still a big concern. This is especially important for safety-critical tasks. One shared characteristic of these critical tasks is their risk sensitivity, where small mistakes can cause big consequences and even endanger life. There are several factors that could be guidelines for the successful deployment of AI systems in sensitive tasks: (i) failure detection and out-of-distribution (OOD) detection; (ii) overfitting identification; (iii) uncertainty quantification for predictions; (iv) robustness to data perturbations. These factors are also challenges of current AI systems, which are major blocks for building safe and reliable AI. Specifically, the current AI algorithms are unable to identify common causes for failure detection. Furthermore, additional techniques are required to quantify the quality of predictions. All these contribute to inaccurate uncertainty quantification, which lowers trust in predictions. Hence obtaining accurate model uncertainty quantification and its further improvement are challenging. To address these issues, many techniques have been proposed, such as regularization methods and learning strategies. As vision and language are the most typical data type and have many open source benchmark datasets, this thesis will focus on vision-language data processing for tasks like classification, image captioning, and vision question answering. In this thesis, we aim to build a safeguard by further developing current techniques to ensure the accurate model uncertainty for safety-critical tasks.
摘要
Failure detection and out-of-distribution (OOD) detection2. Overfitting identification3. Uncertainty quantification for predictions4. Robustness to data perturbationsCurrent AI algorithms are unable to identify common causes for failure detection and lack techniques to quantify the quality of predictions, leading to inaccurate uncertainty quantification and lower trust in predictions. To address these issues, many techniques have been proposed, such as regularization methods and learning strategies.In this thesis, we focus on vision-language data processing for tasks like classification, image captioning, and vision question answering. Our aim is to build a safeguard by further developing current techniques to ensure accurate model uncertainty for safety-critical tasks.
Two Sides of Miscalibration: Identifying Over and Under-Confidence Prediction for Network Calibration
results: 该论文的实验结果显示,提出的方法可以substantially outperform existing calibration techniques,并且在一个自动故障检测任务中,提高了模型的可靠性和信任性。Here’s the full text in Simplified Chinese:
results: 该论文的实验结果显示,提出的方法可以substantially outperform existing calibration techniques,并且在一个自动故障检测任务中,提高了模型的可靠性和信任性。Abstract
Proper confidence calibration of deep neural networks is essential for reliable predictions in safety-critical tasks. Miscalibration can lead to model over-confidence and/or under-confidence; i.e., the model's confidence in its prediction can be greater or less than the model's accuracy. Recent studies have highlighted the over-confidence issue by introducing calibration techniques and demonstrated success on various tasks. However, miscalibration through under-confidence has not yet to receive much attention. In this paper, we address the necessity of paying attention to the under-confidence issue. We first introduce a novel metric, a miscalibration score, to identify the overall and class-wise calibration status, including being over or under-confident. Our proposed metric reveals the pitfalls of existing calibration techniques, where they often overly calibrate the model and worsen under-confident predictions. Then we utilize the class-wise miscalibration score as a proxy to design a calibration technique that can tackle both over and under-confidence. We report extensive experiments that show our proposed methods substantially outperforming existing calibration techniques. We also validate our proposed calibration technique on an automatic failure detection task with a risk-coverage curve, reporting that our methods improve failure detection as well as trustworthiness of the model. The code are available at \url{https://github.com/AoShuang92/miscalibration_TS}.
摘要
deep learning 网络的自信核对是非常重要的,以确保在安全关键任务中的可靠预测。 miscalibration 可能会导致模型过于自信和/或不足自信,即模型对其预测的自信度高于或低于模型的准确率。 recent studies 曾经提出了 calibration 技术,并在不同任务上得到了成功。然而, under-confidence 的 miscalibration 问题还没有得到了充分的关注。在这篇论文中,我们强调了对 under-confidence 问题的注意。我们首先引入了一种新的指标,即 miscalibration Score,以评估模型的总体和类别 Calibration 状态,包括是否过于自信和/或不足自信。我们的提出的指标显示了现有的 calibration 技术的缺陷,即它们通常过于 Calibration 模型,从而恶化了不足自信的预测。然后,我们利用类别 miscalibration Score 作为代理,设计了一种可以解决过于自信和不足自信的 calibration 技术。我们报告了广泛的实验结果,显示了我们的提出的方法在现有的 calibration 技术上表现出了极大的优势。我们还验证了我们的提出的 calibration 技术在自动故障检测任务中的可靠性和信任性。代码可以在 \url{https://github.com/AoShuang92/miscalibration_TS} 上获取。
Strategic Preys Make Acute Predators: Enhancing Camouflaged Object Detectors by Generating Camouflaged Objects
methods: 基于预料-vs-捕食者游戏的思想,从预料和捕食者两个角度提出算法,包括一种对抗训练框架“Camouflageator”以及一种新的COD方法“Internal Coherence and Edge Guidance”(ICEG)。
results: 对比 existed COD 方法,ICEG 能够更好地 segmentation 遮蔽物,而且 Camouflageator 可以改进多种 COD 方法,包括 ICEG,从而实现 state-of-the-art COD 性能。Abstract
Camouflaged object detection (COD) is the challenging task of identifying camouflaged objects visually blended into surroundings. Albeit achieving remarkable success, existing COD detectors still struggle to obtain precise results in some challenging cases. To handle this problem, we draw inspiration from the prey-vs-predator game that leads preys to develop better camouflage and predators to acquire more acute vision systems and develop algorithms from both the prey side and the predator side. On the prey side, we propose an adversarial training framework, Camouflageator, which introduces an auxiliary generator to generate more camouflaged objects that are harder for a COD method to detect. Camouflageator trains the generator and detector in an adversarial way such that the enhanced auxiliary generator helps produce a stronger detector. On the predator side, we introduce a novel COD method, called Internal Coherence and Edge Guidance (ICEG), which introduces a camouflaged feature coherence module to excavate the internal coherence of camouflaged objects, striving to obtain more complete segmentation results. Additionally, ICEG proposes a novel edge-guided separated calibration module to remove false predictions to avoid obtaining ambiguous boundaries. Extensive experiments show that ICEG outperforms existing COD detectors and Camouflageator is flexible to improve various COD detectors, including ICEG, which brings state-of-the-art COD performance.
摘要
幻化物体检测(COD)是一项复杂的任务,即识别扮演融入周围环境中的物体。虽然已经取得了很大的成功,现有的COD检测器仍然在一些挑战性情况下困难获得精准结果。为解决这个问题,我们从预食-vs-掠食游戏中继承了猎食者和猎食者之间的竞争关系,并从两个角度提出算法。在猎食者(prey)一方,我们提出了一个对抗训练框架,即Camouflageator,该框架在auxiliary generator中引入了更多的掩蔽物体,使COD方法更难以检测。Camouflageator在对Generator和检测器进行对抗训练后,可以生成更加掩蔽的物体,从而提高检测精度。在猎食者(predator)一方,我们提出了一种新的COD方法,即内部凝聚和边缘引导(ICEG),该方法引入了掩蔽物体的凝聚特征模块,以提高物体完整性的检测结果。此外,ICEG还提出了一种新的边缘引导分离calibration模块,以除掉假定的预测,避免获得模糊的边界。广泛的实验表明,ICEG可以超越现有的COD检测器,而Camouflageator可以改进各种COD检测器,包括ICEG,从而实现状态足球的COD性能。
Precise Benchmarking of Explainable AI Attribution Methods
paper_authors: Rafaël Brandt, Daan Raatjens, Georgi Gaydadjiev
for: The paper aims to develop a novel evaluation approach for benchmarking state-of-the-art explainable AI (XAI) attribution methods, in order to provide deeper insights into the output of XAI models.
methods: The proposed evaluation approach includes a synthetic classification model accompanied by its derived ground truth explanations, as well as new high-fidelity metrics to quantify the difference between explanations of the investigated XAI method and those derived from the synthetic model.
results: The authors investigate their proposal by constructing a synthetic convolutional image classification model and benchmarking several widely used XAI attribution methods using their evaluation approach. They compare their results with established prior XAI evaluation metrics, and show that their metrics provide deeper insights into the performance of XAI methods, including the poor precision scores among negatively contributing pixels. Additionally, they demonstrate that their metrics are among the fastest in terms of execution time.Abstract
The rationale behind a deep learning model's output is often difficult to understand by humans. EXplainable AI (XAI) aims at solving this by developing methods that improve interpretability and explainability of machine learning models. Reliable evaluation metrics are needed to assess and compare different XAI methods. We propose a novel evaluation approach for benchmarking state-of-the-art XAI attribution methods. Our proposal consists of a synthetic classification model accompanied by its derived ground truth explanations allowing high precision representation of input nodes contributions. We also propose new high-fidelity metrics to quantify the difference between explanations of the investigated XAI method and those derived from the synthetic model. Our metrics allow assessment of explanations in terms of precision and recall separately. Also, we propose metrics to independently evaluate negative or positive contributions of inputs. Our proposal provides deeper insights into XAI methods output. We investigate our proposal by constructing a synthetic convolutional image classification model and benchmarking several widely used XAI attribution methods using our evaluation approach. We compare our results with established prior XAI evaluation metrics. By deriving the ground truth directly from the constructed model in our method, we ensure the absence of bias, e.g., subjective either based on the training set. Our experimental results provide novel insights into the performance of Guided-Backprop and Smoothgrad XAI methods that are widely in use. Both have good precision and recall scores among positively contributing pixels (0.7, 0.76 and 0.7, 0.77, respectively), but poor precision scores among negatively contributing pixels (0.44, 0.61 and 0.47, 0.75, resp.). The recall scores in the latter case remain close. We show that our metrics are among the fastest in terms of execution time.
摘要
<> translate "The rationale behind a deep learning model's output is often difficult to understand by humans. EXplainable AI (XAI) aims at solving this by developing methods that improve interpretability and explainability of machine learning models. Reliable evaluation metrics are needed to assess and compare different XAI methods. We propose a novel evaluation approach for benchmarking state-of-the-art XAI attribution methods. Our proposal consists of a synthetic classification model accompanied by its derived ground truth explanations allowing high precision representation of input nodes contributions. We also propose new high-fidelity metrics to quantify the difference between explanations of the investigated XAI method and those derived from the synthetic model. Our metrics allow assessment of explanations in terms of precision and recall separately. Also, we propose metrics to independently evaluate negative or positive contributions of inputs. Our proposal provides deeper insights into XAI methods output. We investigate our proposal by constructing a synthetic convolutional image classification model and benchmarking several widely used XAI attribution methods using our evaluation approach. We compare our results with established prior XAI evaluation metrics. By deriving the ground truth directly from the constructed model in our method, we ensure the absence of bias, e.g., subjective either based on the training set. Our experimental results provide novel insights into the performance of Guided-Backprop and Smoothgrad XAI methods that are widely in use. Both have good precision and recall scores among positively contributing pixels (0.7, 0.76 and 0.7, 0.77, respectively), but poor precision scores among negatively contributing pixels (0.44, 0.61 and 0.47, 0.75, resp.). The recall scores in the latter case remain close. We show that our metrics are among the fastest in terms of execution time."中文翻译:人类理解深度学习模型输出的理由往往具有困难,EXplainable AI(XAI)目的是解决这一问题,通过发展可解释性和可读性的机器学习模型。可靠的评估 метри可以用来评估和比较不同的 XAI 方法。我们提出了一种新的评估方法,用于比较现代 XAI 负担方法的表现。我们的提议包括一个Synthetic类型模型,以及其Derived的真实解释,allowing high precision representation of input nodes contributions。我们还提出了一些新的高效度 métriques,用于评估Investigated XAI方法的解释和Synthetic模型中的解释之间的差异。我们的 métriques 允许对解释进行精确的评估,分别评估精度和回归。此外,我们还提出了一些独立评估输入的正负性贡献的 метри。我们的提议可以为 XAI 方法的输出提供更深入的理解。我们在一个Synthetic convolutional image classification模型上进行了实验,并使用我们的评估方法评估了一些广泛使用的 XAI 负担方法。我们与已有的 XAI 评估 métriques进行比较。在我们的方法中,直接从构建的模型中 derivation ground truth,以避免主观偏见,如基于训练集的主观偏见。我们的实验结果提供了新的意义,Guided-Backprop和Smoothgrad XAI方法在使用的情况下的性能。两者在正确贡献像素上有着好的精度和回归分数(0.7, 0.76和0.7, 0.77,分别),但是在负贡献像素上有着差的精度分数(0.44, 0.61和0.47, 0.75,分别)。负贡献像素的回归分数保持相对较近。我们的 métriques 在执行时间方面也是 Among the fastest。