cs.AI - 2023-08-22

Furnishing Sound Event Detection with Language Model Abilities

  • paper_url: http://arxiv.org/abs/2308.11530
  • repo_url: None
  • paper_authors: Hualei Wang, Jianguo Mao, Zhifang Guo, Jiarui Wan, Hong Liu, Xiangdong Wang
  • for: 本研究探讨语言模型(LM)在视觉跨模态中的能力,特别是sound event detection(SED)领域。
  • methods: 我们提出了一种简洁的方法,通过对音频特征和文本特征的对应进行对齐,实现声音事件分类和时间位置的生成。该框架包括一个音频编码器、一个对应模块和一个独立的语言解码器。
  • results: 我们的模型可以准确地生成声音事件探测序列。与传统方法相比,我们的模型更加简洁和全面,因为它直接利用语言模型的 semantic 能力来生成序列。我们还对不同的解码模块进行了研究,以示timestamps capture和事件分类的效果。
    Abstract Recently, the ability of language models (LMs) has attracted increasing attention in visual cross-modality. In this paper, we further explore the generation capacity of LMs for sound event detection (SED), beyond the visual domain. Specifically, we propose an elegant method that aligns audio features and text features to accomplish sound event classification and temporal location. The framework consists of an acoustic encoder, a contrastive module that align the corresponding representations of the text and audio, and a decoupled language decoder that generates temporal and event sequences from the audio characteristic. Compared with conventional works that require complicated processing and barely utilize limited audio features, our model is more concise and comprehensive since language model directly leverage its semantic capabilities to generate the sequences. We investigate different decoupling modules to demonstrate the effectiveness for timestamps capture and event classification. Evaluation results show that the proposed method achieves accurate sequences of sound event detection.
    摘要 最近,语言模型(LM)在视觉交互领域的能力受到了越来越多的关注。在这篇论文中,我们进一步探索语言模型对声音事件检测(SED)的生成能力,超出视觉领域。我们提出了一种简洁的方法,将音频特征和文本特征进行对齐,以完成声音事件类型和时间位置的分类。该框架包括一个声音编码器、一个对应模块,将文本和音频特征的对应表示进行对齐,以及一个独立的语言解码器,从音频特征中生成时间序列和事件序列。相比于传统的方法,需要复杂的处理和尝试用有限的音频特征,我们的模型更简洁和全面,因为语言模型直接利用其语义能力来生成序列。我们 investigate了不同的解 Coupling模块,以示出对时间捕捉和事件分类的效果。评估结果显示,我们的方法可以准确地检测声音事件。

TrackFlow: Multi-Object Tracking with Normalizing Flows

  • paper_url: http://arxiv.org/abs/2308.11513
  • repo_url: None
  • paper_authors: Gianluca Mancusi, Aniello Panariello, Angelo Porrello, Matteo Fabbri, Simone Calderara, Rita Cucchiara
  • for: 提高多对象跟踪的性能,尤其是在多模态 Setting 中。
  • methods: 使用深度概率模型来计算候选对应关系的可能性,以提高跟踪-by-检测算法的性能。
  • results: 在 simulate 和实际 benchmark 上进行了实验,显示了我们的方法可以提高跟踪-by-检测算法的性能。
    Abstract The field of multi-object tracking has recently seen a renewed interest in the good old schema of tracking-by-detection, as its simplicity and strong priors spare it from the complex design and painful babysitting of tracking-by-attention approaches. In view of this, we aim at extending tracking-by-detection to multi-modal settings, where a comprehensive cost has to be computed from heterogeneous information e.g., 2D motion cues, visual appearance, and pose estimates. More precisely, we follow a case study where a rough estimate of 3D information is also available and must be merged with other traditional metrics (e.g., the IoU). To achieve that, recent approaches resort to either simple rules or complex heuristics to balance the contribution of each cost. However, i) they require careful tuning of tailored hyperparameters on a hold-out set, and ii) they imply these costs to be independent, which does not hold in reality. We address these issues by building upon an elegant probabilistic formulation, which considers the cost of a candidate association as the negative log-likelihood yielded by a deep density estimator, trained to model the conditional joint probability distribution of correct associations. Our experiments, conducted on both simulated and real benchmarks, show that our approach consistently enhances the performance of several tracking-by-detection algorithms.
    摘要 隐身多目标跟踪领域最近又有新的关注,旧的schema tracking-by-detection,因为它的简单性和强制约束,不需要复杂的设计和痛苦照顾 tracking-by-attention 方法。在这个视图下,我们想扩展 tracking-by-detection 到多模式设定,其中需要从不同的信息源(例如,2D 运动指示、视觉特征和姿态估计)计算总成本。更加准确地说,我们采用了一个实验研究,其中有一个粗略的3D 信息估计也可以与传统的 метри(例如,IoU)一起使用。为了实现这一点,现有的方法通常采用 either simple rules or complex heuristics 来均衡每个成本的贡献。然而,i) 它们需要在保留集上精心调整特制的超参数,并 ii) 它们假设这些成本是独立的,而实际上不是。我们解决这些问题,是通过基于简洁概率形式ulation,它考虑候选关联的成本为负极log-概率的深度概率预测器,用于模型候选关联的条件联合概率分布。我们的实验,在 Both simulated 和 real benchmarks 上进行,显示了我们的方法能够一致提高许多 tracking-by-detection 算法的性能。

User Identity Linkage in Social Media Using Linguistic and Social Interaction Features

  • paper_url: http://arxiv.org/abs/2308.11684
  • repo_url: None
  • paper_authors: Despoina Chatzakou, Juan Soler-Company, Theodora Tsikrika, Leo Wanner, Stefanos Vrochidis, Ioannis Kompatsiaris
  • for: 防止社交媒体上的负面内容的 spreadof and retain online identity
  • methods: 使用多个用户活动特征进行机器学习基于检测,以确定两个或多个虚拟标识是否属于同一个真实人
  • results: 在恶意和恐怖主义相关的推特内容中,模型的效果得到证明
    Abstract Social media users often hold several accounts in their effort to multiply the spread of their thoughts, ideas, and viewpoints. In the particular case of objectionable content, users tend to create multiple accounts to bypass the combating measures enforced by social media platforms and thus retain their online identity even if some of their accounts are suspended. User identity linkage aims to reveal social media accounts likely to belong to the same natural person so as to prevent the spread of abusive/illegal activities. To this end, this work proposes a machine learning-based detection model, which uses multiple attributes of users' online activity in order to identify whether two or more virtual identities belong to the same real natural person. The models efficacy is demonstrated on two cases on abusive and terrorism-related Twitter content.
    摘要

Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions

  • paper_url: http://arxiv.org/abs/2308.11483
  • repo_url: None
  • paper_authors: Pouya Pezeshkpour, Estevam Hruschka
    for: 这 paper 探讨了 Large Language Models (LLMs) 在不同的 NLP 任务中表现的稳定性问题,特别是在多选问题上。methods: 作者们使用了多种方法来 investigate LLMs 的不稳定性,包括对选项的重新排序和几个示例的尝试。results: 研究发现,当选项的顺序发生变化时,LLMs 的表现会受到很大的影响,表现差异可达 13% 到 75% 不同的benchmark上。通过 detailed 分析,作者们 conjecture 这种不稳定性源于 LLMs 对最佳选项的不确定性,并且特定的选项位置可能会帮助模型更准确地预测最佳选项。
    Abstract Large Language Models (LLMs) have demonstrated remarkable capabilities in various NLP tasks. However, previous works have shown these models are sensitive towards prompt wording, and few-shot demonstrations and their order, posing challenges to fair assessment of these models. As these models become more powerful, it becomes imperative to understand and address these limitations. In this paper, we focus on LLMs robustness on the task of multiple-choice questions -- commonly adopted task to study reasoning and fact-retrieving capability of LLMs. Investigating the sensitivity of LLMs towards the order of options in multiple-choice questions, we demonstrate a considerable performance gap of approximately 13% to 75% in LLMs on different benchmarks, when answer options are reordered, even when using demonstrations in a few-shot setting. Through a detailed analysis, we conjecture that this sensitivity arises when LLMs are uncertain about the prediction between the top-2/3 choices, and specific options placements may favor certain prediction between those top choices depending on the question caused by positional bias. We also identify patterns in top-2 choices that amplify or mitigate the model's bias toward option placement. We found that for amplifying bias, the optimal strategy involves positioning the top two choices as the first and last options. Conversely, to mitigate bias, we recommend placing these choices among the adjacent options. To validate our conjecture, we conduct various experiments and adopt two approaches to calibrate LLMs' predictions, leading to up to 8 percentage points improvement across different models and benchmarks.
    摘要

Expecting The Unexpected: Towards Broad Out-Of-Distribution Detection

  • paper_url: http://arxiv.org/abs/2308.11480
  • repo_url: https://github.com/servicenow/broad-openood
  • paper_authors: Charles Guille-Escuret, Pierre-André Noël, Ioannis Mitliagkas, David Vazquez, Joao Monteiro
  • for: 本研究旨在提高部署机器学习系统的可靠性,通过开发检测出现在训练集之外的输入(Out-of-distribution,OOD)方法。
  • methods: 本研究对现有的OOD检测方法进行了评估,并发现这些方法只能够有效地检测未知的类型,而对其他类型的分布转移表现不一致。为解决这个问题,我们提出了一种基于生成模型的ensemble方法,可以提供更一致和全面的OOD检测解决方案。
  • results: 我们的研究发现,现有的OOD检测方法在不同类型的分布转移中的性能不一致,而我们的ensemble方法可以提供更高的可靠性和敏感性。我们还发布了一个名为BROAD(Benchmarking Resilience Over Anomaly Diversity)的数据集,以便评估OOD检测方法的性能。
    Abstract Improving the reliability of deployed machine learning systems often involves developing methods to detect out-of-distribution (OOD) inputs. However, existing research often narrowly focuses on samples from classes that are absent from the training set, neglecting other types of plausible distribution shifts. This limitation reduces the applicability of these methods in real-world scenarios, where systems encounter a wide variety of anomalous inputs. In this study, we categorize five distinct types of distribution shifts and critically evaluate the performance of recent OOD detection methods on each of them. We publicly release our benchmark under the name BROAD (Benchmarking Resilience Over Anomaly Diversity). Our findings reveal that while these methods excel in detecting unknown classes, their performance is inconsistent when encountering other types of distribution shifts. In other words, they only reliably detect unexpected inputs that they have been specifically designed to expect. As a first step toward broad OOD detection, we learn a generative model of existing detection scores with a Gaussian mixture. By doing so, we present an ensemble approach that offers a more consistent and comprehensive solution for broad OOD detection, demonstrating superior performance compared to existing methods. Our code to download BROAD and reproduce our experiments is publicly available.
    摘要 提高机器学习系统部署时的可靠性通常涉及到开发检测出idanormal inputs的方法。然而,现有研究通常只关注 absent classes 中的样本,忽视其他类型的可能性 Distribution Shift。这种限制 reduce了这些方法在实际应用中的适用性,因为系统会遇到各种异常输入。在这种研究中,我们分类ified five distinct types of distribution shifts, and critically evaluated the performance of recent OOD detection methods on each of them. We publicly release our benchmark under the name BROAD (Benchmarking Resilience Over Anomaly Diversity). Our findings reveal that while these methods excel in detecting unknown classes, their performance is inconsistent when encountering other types of distribution shifts. In other words, they only reliably detect unexpected inputs that they have been specifically designed to expect. As a first step toward broad OOD detection, we learn a generative model of existing detection scores with a Gaussian mixture. By doing so, we present an ensemble approach that offers a more consistent and comprehensive solution for broad OOD detection, demonstrating superior performance compared to existing methods. Our code to download BROAD and reproduce our experiments is publicly available.Note: Please note that the translation is in Simplified Chinese, which is one of the two standard forms of Chinese. If you need the translation in Traditional Chinese, please let me know.

Revisiting column-generation-based matheuristic for learning classification trees

  • paper_url: http://arxiv.org/abs/2308.11477
  • repo_url: https://github.com/krooonal/col_gen_estimator
  • paper_authors: Krunal Kishor Patel, Guy Desaulniers, Andrea Lodi
  • for: 这篇论文目的是提高分类问题的解决方法,特别是在机器学习领域中使用决策树模型。
  • methods: 该论文使用的方法是基于列生成的规则逻辑,以提高分类问题的解决效率和可扩展性。
  • results: 对于多类分类问题,该方法可以减少数据点数量,并使用数据依赖的约束来提高分类质量。 computational results表明,这些改进可以提高解决效率。
    Abstract Decision trees are highly interpretable models for solving classification problems in machine learning (ML). The standard ML algorithms for training decision trees are fast but generate suboptimal trees in terms of accuracy. Other discrete optimization models in the literature address the optimality problem but only work well on relatively small datasets. \cite{firat2020column} proposed a column-generation-based heuristic approach for learning decision trees. This approach improves scalability and can work with large datasets. In this paper, we describe improvements to this column generation approach. First, we modify the subproblem model to significantly reduce the number of subproblems in multiclass classification instances. Next, we show that the data-dependent constraints in the master problem are implied, and use them as cutting planes. Furthermore, we describe a separation model to generate data points for which the linear programming relaxation solution violates their corresponding constraints. We conclude by presenting computational results that show that these modifications result in better scalability.
    摘要

IT3D: Improved Text-to-3D Generation with Explicit View Synthesis

  • paper_url: http://arxiv.org/abs/2308.11473
  • repo_url: https://github.com/buaacyw/it3d-text-to-3d
  • paper_authors: Yiwen Chen, Chi Zhang, Xiaofeng Yang, Zhongang Cai, Gang Yu, Lei Yang, Guosheng Lin
  • for: 本研究旨在提高文本到3D图像转换技术,并使用大型文本到图像扩散模型(LDM)提取知识。
  • methods: 本研究使用图像到图像管道,利用LDM生成高质量多视图图像,并通过Diffusion-GAN双向训练策略来引导3D模型训练。
  • results: 实验结果表明,本方法比基eline方法有更高的质量和精度,能够更好地解决文本到3D图像转换中的一些问题,如过度满、缺乏细节和不实际的输出。
    Abstract Recent strides in Text-to-3D techniques have been propelled by distilling knowledge from powerful large text-to-image diffusion models (LDMs). Nonetheless, existing Text-to-3D approaches often grapple with challenges such as over-saturation, inadequate detailing, and unrealistic outputs. This study presents a novel strategy that leverages explicitly synthesized multi-view images to address these issues. Our approach involves the utilization of image-to-image pipelines, empowered by LDMs, to generate posed high-quality images based on the renderings of coarse 3D models. Although the generated images mostly alleviate the aforementioned issues, challenges such as view inconsistency and significant content variance persist due to the inherent generative nature of large diffusion models, posing extensive difficulties in leveraging these images effectively. To overcome this hurdle, we advocate integrating a discriminator alongside a novel Diffusion-GAN dual training strategy to guide the training of 3D models. For the incorporated discriminator, the synthesized multi-view images are considered real data, while the renderings of the optimized 3D models function as fake data. We conduct a comprehensive set of experiments that demonstrate the effectiveness of our method over baseline approaches.
    摘要 最近的文本到3D技术发展受到了大型文本到图像扩散模型(LDM)的知识储存的推动。然而,现有的文本到3D方法通常会遇到过度饱和、不够细节和不实际的输出等问题。本研究提出了一种新的策略,利用可控多视图图像来解决这些问题。我们的方法是利用图像到图像管道,利用LDM来生成基于粗糙3D模型的高质量poses图像。虽然生成的图像大多消除了以上问题,但是由于大扩散模型的生成性,仍然存在视角不一致和内容差异等问题。为解决这个障碍,我们提议在3D模型训练中添加一个判别器,并采用Diffusion-GAN双向训练策略来引导3D模型的训练。对于添加的判别器,生成的多视图图像被视为真实数据,而 renderings of 优化的3D模型则被视为假数据。我们进行了一系列的实验,证明了我们的方法在基础方法上表现更高效。

Dynamic Open Vocabulary Enhanced Safe-landing with Intelligence (DOVESEI)

  • paper_url: http://arxiv.org/abs/2308.11471
  • repo_url: https://github.com/mistlab/dovesei
  • paper_authors: Haechan Mark Bong, Rongge Zhang, Ricardo de Azambuja, Giovanni Beltrame
  • for: 本研究目标是为城市空中机器人开发安全降落。
  • methods: 本研究使用视 servoing 技术,利用开放词汇图像分割,适应不同场景,并且不需要大量数据更新内部模型。
  • results: 实验表明,该系统可以在100米高度下成功执行降落动作,且通过引入动态专注机制,提高降落成功率。
    Abstract This work targets what we consider to be the foundational step for urban airborne robots, a safe landing. Our attention is directed toward what we deem the most crucial aspect of the safe landing perception stack: segmentation. We present a streamlined reactive UAV system that employs visual servoing by harnessing the capabilities of open vocabulary image segmentation. This approach can adapt to various scenarios with minimal adjustments, bypassing the necessity for extensive data accumulation for refining internal models, thanks to its open vocabulary methodology. Given the limitations imposed by local authorities, our primary focus centers on operations originating from altitudes of 100 meters. This choice is deliberate, as numerous preceding works have dealt with altitudes up to 30 meters, aligning with the capabilities of small stereo cameras. Consequently, we leave the remaining 20m to be navigated using conventional 3D path planning methods. Utilizing monocular cameras and image segmentation, our findings demonstrate the system's capability to successfully execute landing maneuvers at altitudes as low as 20 meters. However, this approach is vulnerable to intermittent and occasionally abrupt fluctuations in the segmentation between frames in a video stream. To address this challenge, we enhance the image segmentation output by introducing what we call a dynamic focus: a masking mechanism that self adjusts according to the current landing stage. This dynamic focus guides the control system to avoid regions beyond the drone's safety radius projected onto the ground, thus mitigating the problems with fluctuations. Through the implementation of this supplementary layer, our experiments have reached improvements in the landing success rate of almost tenfold when compared to global segmentation. All the source code is open source and available online (github.com/MISTLab/DOVESEI).
    摘要

Internal Cross-layer Gradients for Extending Homogeneity to Heterogeneity in Federated Learning

  • paper_url: http://arxiv.org/abs/2308.11464
  • repo_url: None
  • paper_authors: Yun-Hin Chan, Rui Zhou, Running Zhao, Zhihan Jiang, Edith C. -H. Ngai
  • for: 提高模型不同的 Federated Learning 方法处理系统不同性能的能力
  • methods: 利用内部交叉层导数,不需要客户端之间的通信,可以增强深层导数的相似性
  • results: 实验结果证明 InCo Aggregation 的效果,显示内部交叉层导数是提高性能的有效途径
    Abstract Federated learning (FL) inevitably confronts the challenge of system heterogeneity in practical scenarios. To enhance the capabilities of most model-homogeneous FL methods in handling system heterogeneity, we propose a training scheme that can extend their capabilities to cope with this challenge. In this paper, we commence our study with a detailed exploration of homogeneous and heterogeneous FL settings and discover three key observations: (1) a positive correlation between client performance and layer similarities, (2) higher similarities in the shallow layers in contrast to the deep layers, and (3) the smoother gradients distributions indicate the higher layer similarities. Building upon these observations, we propose InCo Aggregation that leverags internal cross-layer gradients, a mixture of gradients from shallow and deep layers within a server model, to augment the similarity in the deep layers without requiring additional communication between clients. Furthermore, our methods can be tailored to accommodate model-homogeneous FL methods such as FedAvg, FedProx, FedNova, Scaffold, and MOON, to expand their capabilities to handle the system heterogeneity. Copious experimental results validate the effectiveness of InCo Aggregation, spotlighting internal cross-layer gradients as a promising avenue to enhance the performance in heterogenous FL.
    摘要 联合学习(FL)在实际应用中遇到系统多样性的挑战。为了增强大多数模型相似的FL方法在处理系统多样性的能力,我们提出了一个训练方案,可以将其扩展到处理这个挑战。在这篇论文中,我们开始我们的研究,进行了详细的探索Homogeneous和Heterogeneous FL Setting中的三个关键观察:(1)客户端性能和层 similarity 之间的正相关,(2)在浅层较为高 similarity ,而深层较低 similarity,(3)在各层 Similarity 中更平滑的梯度分布,这些观察可以帮助我们更好地理解FL系统的多样性问题。基于这些观察,我们提出了InCo Aggregation,利用服务器模型中的内部交叉层梯度,把深层层梯度与浅层梯度混合,以增强深层层梯度的相似性,不需要客户端之间的额外交流。此外,我们的方法可以与模型相似的FL方法,如FedAvg、FedProx、FedNova、Scaffold和MOON相容,以扩展它们的能力,处理系统多样性。实际实验结果显示,InCo Aggregation 具有很好的效果,强调了内部交叉层梯度作为提高FL系统多样性性能的有力之路。

A Survey on Self-Supervised Representation Learning

  • paper_url: http://arxiv.org/abs/2308.11455
  • repo_url: https://github.com/microsoft/esvit
  • paper_authors: Tobias Uelwer, Jan Robine, Stefan Sylvius Wagner, Marc Höftmann, Eric Upschulte, Sebastian Konietzny, Maike Behrendt, Stefan Harmeling
  • for: 本文提供了一个总结性的综述,探讨了一些无监督学习方法,用于学习图像表示。这些表示可以用于下游任务,如分类或物体检测。
  • methods: 本文使用了一些无监督学习方法,包括自适应卷积神经网络、自适应层次神经网络和卷积神经网络。
  • results: 根据Literature review,这些方法在下游任务中表现非常出色,与监督学习方法相当。Here’s the translation in English:
  • for: This paper provides a comprehensive review of methods for learning image representations without supervision, which can be used in downstream tasks such as classification or object detection.
  • methods: The paper uses several unsupervised learning methods, including autoencoders, self-attention mechanisms, and convolutional neural networks.
  • results: According to the literature review, these methods have performed extremely well in downstream tasks, comparable to supervised learning methods.
    Abstract Learning meaningful representations is at the heart of many tasks in the field of modern machine learning. Recently, a lot of methods were introduced that allow learning of image representations without supervision. These representations can then be used in downstream tasks like classification or object detection. The quality of these representations is close to supervised learning, while no labeled images are needed. This survey paper provides a comprehensive review of these methods in a unified notation, points out similarities and differences of these methods, and proposes a taxonomy which sets these methods in relation to each other. Furthermore, our survey summarizes the most-recent experimental results reported in the literature in form of a meta-study. Our survey is intended as a starting point for researchers and practitioners who want to dive into the field of representation learning.
    摘要 学习有意义的表示是现代机器学习领域中的核心任务之一。最近,许多无监督学习方法被引入,可以学习图像表示。这些表示可以在下游任务中使用,如分类或物体检测。这些无监督学习方法的表示质量与监督学习相似,但无需标注图像。本文提供了这些方法的统一notation,指出这些方法之间的相似性和差异,并提出了这些方法的分类方式。此外,我们的survey还summarized了Literature中最近的实验结果,并进行了meta-study。本文为研究者和实践者提供了进入无监督学习领域的开始点。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

Convergence guarantee for consistency models

  • paper_url: http://arxiv.org/abs/2308.11449
  • repo_url: None
  • paper_authors: Junlong Lyu, Zhitang Chen, Shoubo Feng
  • for: 这 paper 的目的是为Consistency Models (CMs) 提供首次一致性保证,这种一步生成模型可以生成与Diffusion Models相同的样本。
  • methods: 这 paper 使用了基本的score-matching error assumption, consistency error assumption和数据分布的smoothness假设,以确保CMs 可以效率地从任何现实数据分布中采样,并且采样Error小于$W_2$.
  • results: 这 paper 的结果包括:(1) 对于$L^2$-accurate score和consistency假设,CMs 可以在一步中采样到任何现实数据分布,并且采样Error scales polynomially in all parameters; (2) 不需要强制对数据分布的假设,如log-Sobelev inequality; (3) 可以further reduce the error by using Multistep Consistency Sampling procedure.
    Abstract We provide the first convergence guarantees for the Consistency Models (CMs), a newly emerging type of one-step generative models that can generate comparable samples to those generated by Diffusion Models. Our main result is that, under the basic assumptions on score-matching errors, consistency errors and smoothness of the data distribution, CMs can efficiently sample from any realistic data distribution in one step with small $W_2$ error. Our results (1) hold for $L^2$-accurate score and consistency assumption (rather than $L^\infty$-accurate); (2) do note require strong assumptions on the data distribution such as log-Sobelev inequality; (3) scale polynomially in all parameters; and (4) match the state-of-the-art convergence guarantee for score-based generative models (SGMs). We also provide the result that the Multistep Consistency Sampling procedure can further reduce the error comparing to one step sampling, which support the original statement of "Consistency Models, Yang Song 2023". Our result further imply a TV error guarantee when take some Langevin-based modifications to the output distributions.
    摘要 我们提供了一些一步生成模型(CM)的协调保证,这是一种最近崛起的一种生成模型,可以生成与演化模型(Diffusion Models)相似的样本。我们的主要结果是,假设score-matching error、consistency error和资料分布的平滑性满足某些基本假设,则CM可以将任何现实的资料分布 efficiently sampled in one step with small $W_2$ error。我们的结果包括:1. 对于$L^2$-accurate score和consistency假设(而不是$L^\infty$-accurate);2. 不需要对于资料分布的强则假设,如log-Sobelev不等式;3. 随所有参数的度量 polynomially scale;4. 与state-of-the-art score-based生成模型(SGMs)的协调保证相符。我们还提供了一个Multistep Consistency Sampling程序,可以降低比一步样本的错误,这支持原始的“Consistency Models, Yang Song 2023”的声明。我们的结果进一步显示了一个TV错误保证,当将一些Langevin-based modifications套用到输出分布时。

Aspect-oriented Opinion Alignment Network for Aspect-Based Sentiment Classification

  • paper_url: http://arxiv.org/abs/2308.11447
  • repo_url: https://github.com/aone-nlp/absa-aoan
  • paper_authors: Xueyi Liu, Rui Hou, Yanglei Gan, Da Luo, Changlin Li, Xiaojun Shi, Qiao Liu
  • for: 这篇论文目的是提出一种新的方法来解决在多个方面的文本分析中存在的semantic mismatch问题,以提高 Fine-grained sentiment analysis 的精度。
  • methods: 该方法使用了一种新的Aspect-oriented Opinion Alignment Network (AOAN),包括一个邻域span增强模块和一个多元视角注意机制,以强调对Opinion words和对应的方面的上下文关系。
  • results: 实验结果表明,该模型在三个标准数据集上达到了领域的最佳效果,代表着该方法在 Fine-grained sentiment analysis 中的成功应用。
    Abstract Aspect-based sentiment classification is a crucial problem in fine-grained sentiment analysis, which aims to predict the sentiment polarity of the given aspect according to its context. Previous works have made remarkable progress in leveraging attention mechanism to extract opinion words for different aspects. However, a persistent challenge is the effective management of semantic mismatches, which stem from attention mechanisms that fall short in adequately aligning opinions words with their corresponding aspect in multi-aspect sentences. To address this issue, we propose a novel Aspect-oriented Opinion Alignment Network (AOAN) to capture the contextual association between opinion words and the corresponding aspect. Specifically, we first introduce a neighboring span enhanced module which highlights various compositions of neighboring words and given aspects. In addition, we design a multi-perspective attention mechanism that align relevant opinion information with respect to the given aspect. Extensive experiments on three benchmark datasets demonstrate that our model achieves state-of-the-art results. The source code is available at https://github.com/AONE-NLP/ABSA-AOAN.
    摘要 非常详细的 sentiment 分析中,尤其是 aspect-based sentiment classification,目标是根据不同的上下文来预测具体的 sentiment 偏好。先前的研究已经做出了很大的进步,通过使用注意力机制来提取不同的 opinion 词。但是,一个持续的挑战是如何有效地处理 semantic 匹配问题,这些问题来自于注意力机制不够地对 opinion 词和对应的 aspect 进行匹配。为了解决这个问题,我们提出了一种新的 Aspect-oriented Opinion Alignment Network (AOAN),用于捕捉不同的 opinion 词和 aspect 之间的上下文关系。Here's the translation in Traditional Chinese: 非常细致的 sentiment 分析中,尤其是 aspect-based sentiment classification,目标是根据不同的上下文来预测具体的 sentiment 偏好。先前的研究已经做出了很大的进步,通过使用注意力机制来提取不同的 opinion 词。但是,一个持续的挑战是如何有效地处理 semantic 匹配问题,这些问题来自于注意力机制不够地对 opinion 词和对应的 aspect 进行匹配。为了解决这个问题,我们提出了一种新的 Aspect-oriented Opinion Alignment Network (AOAN),用于捕捉不同的 opinion 词和 aspect 之间的上下文关系。Note that the translation is in Simplified Chinese, as requested. If you would like the translation in Traditional Chinese instead, please let me know.

Exploration of Rashomon Set Assists Explanations for Medical Data

  • paper_url: http://arxiv.org/abs/2308.11446
  • repo_url: None
  • paper_authors: Katarzyna Kobylińska, Mateusz Krzyziński, Rafał Machowicz, Mariusz Adamek, Przemysław Biecek
  • For: This paper aims to address the problem of relying solely on performance metrics in machine learning modeling, particularly in medical and healthcare studies, by introducing a novel process to explore Rashomon set models.* Methods: The proposed approach uses the $\texttt{Rashomon_DETECT}$ algorithm to identify the most different models within the Rashomon set, and the Profile Disparity Index (PDI) to quantify differences in variable effects among models.* Results: The approach is demonstrated on a foundational case study of predicting survival among hemophagocytic lymphohistiocytosis (HLH) patients, as well as on other medical data sets, showing its effectiveness and versatility in various contexts.Here are the three points in Simplified Chinese:* For: 这篇论文目的是解决机器学习模型选择过程中围绕性能指标偏重的问题,尤其在医疗和健康研究中,以获得更多的有价值信息。* Methods: 该方法使用 $\texttt{Rashomon_DETECT}$ 算法 Identify Rashomon set 中最为不同的模型,并使用 Profile Disparity Index (PDI) 量化变量效果之间的差异。* Results: 该方法在针对 Hemophagocytic lymphohistiocytosis (HLH) 患者存活预测的基本案例研究中,以及其他医疗数据集中,得到了有效和多样的结果。
    Abstract The machine learning modeling process conventionally culminates in selecting a single model that maximizes a selected performance metric. However, this approach leads to abandoning a more profound analysis of slightly inferior models. Particularly in medical and healthcare studies, where the objective extends beyond predictions to valuable insight generation, relying solely on performance metrics can result in misleading or incomplete conclusions. This problem is particularly pertinent when dealing with a set of models with performance close to maximum one, known as $\textit{Rashomon set}$. Such a set can be numerous and may contain models describing the data in a different way, which calls for comprehensive analysis. This paper introduces a novel process to explore Rashomon set models, extending the conventional modeling approach. The cornerstone is the identification of the most different models within the Rashomon set, facilitated by the introduced $\texttt{Rashomon_DETECT}$ algorithm. This algorithm compares profiles illustrating prediction dependencies on variable values generated by eXplainable Artificial Intelligence (XAI) techniques. To quantify differences in variable effects among models, we introduce the Profile Disparity Index (PDI) based on measures from functional data analysis. To illustrate the effectiveness of our approach, we showcase its application in predicting survival among hemophagocytic lymphohistiocytosis (HLH) patients - a foundational case study. Additionally, we benchmark our approach on other medical data sets, demonstrating its versatility and utility in various contexts.
    摘要 传统的机器学习模型选择过程是通过选择最大化一个选择的性能指标来完成的。然而,这种方法会抛弃更深入的模型分析。特别在医疗和健康研究中,目标不仅是预测,还是生成有价值的理解。只靠性能指标来结论可能导致误导或不完整的结论。这种问题特别存在于处理一组性能几乎最大的模型集合,称为“Rashomon集”。这个集合可能很多,其中包含描述数据不同方式的模型,需要全面的分析。本文提出了一种新的模型探索过程,扩展传统模型选择策略。其核心是在Rashomon集中 identific 最不同的模型,由我们引入的 $\texttt{Rashomon\_DETECT}$ 算法实现。这个算法比较使用 eXplainable Artificial Intelligence(XAI)技术生成的变量值预测依赖的profile。为了量化不同模型中变量效应的差异,我们引入了 Profile Disparity Index(PDI),基于函数数据分析中的度量。我们通过应用这种方法在 Hemophagocytic lymphohistiocytosis(HLH)患者的存活预测中进行了示例,并将其应用于其他医疗数据集,以示其多样性和可用性。

Inferring gender from name: a large scale performance evaluation study

  • paper_url: http://arxiv.org/abs/2308.12381
  • repo_url: None
  • paper_authors: Kriste Krstovski, Yao Lu, Ye Xu
  • for: 这个论文主要目的是为了对名称到性别推断的算法和软件产品进行大规模性能评估,以及提出两种新的混合方法以实现更高的性能。
  • methods: 本文使用了多个大量注释的名称数据集来进行分析,并提出了两种新的混合方法。
  • results: 研究发现现有方法中的任何一种都无法在所有情况下达到最佳性能,而两种新提出的混合方法均可以在所有情况下实现更高的性能。
    Abstract A person's gender is a crucial piece of information when performing research across a wide range of scientific disciplines, such as medicine, sociology, political science, and economics, to name a few. However, in increasing instances, especially given the proliferation of big data, gender information is not readily available. In such cases researchers need to infer gender from readily available information, primarily from persons' names. While inferring gender from name may raise some ethical questions, the lack of viable alternatives means that researchers have to resort to such approaches when the goal justifies the means - in the majority of such studies the goal is to examine patterns and determinants of gender disparities. The necessity of name-to-gender inference has generated an ever-growing domain of algorithmic approaches and software products. These approaches have been used throughout the world in academia, industry, governmental and non-governmental organizations. Nevertheless, the existing approaches have yet to be systematically evaluated and compared, making it challenging to determine the optimal approach for future research. In this work, we conducted a large scale performance evaluation of existing approaches for name-to-gender inference. Analysis are performed using a variety of large annotated datasets of names. We further propose two new hybrid approaches that achieve better performance than any single existing approach.
    摘要 人的性别信息是科学研究中不可或缺的重要信息,包括医学、社会学、政治科学和经济学等领域。然而,随着大数据的普及,性别信息越来越难以获得。在这些情况下,研究人员需要根据可用的信息进行性别推断,主要是根据人名。虽然从名字中推断性别可能会附带一些伦理问题,但由于现有的可行方法缺乏,研究人员需要采用这些方法以实现研究目标。在全球范围内,这些方法已经广泛应用于大学、企业、政府和非政府组织中。然而,现有的方法尚未得到系统性的评估和比较,这使得未来研究中选择最佳方法仍然存在挑战。在这项工作中,我们进行了大规模性能评估现有的名字到性别推断方法。分析使用了多种大量注释的名字数据集。此外,我们还提出了两种新的混合方法,其性能更高于任何单独的现有方法。

A Survey on Large Language Model based Autonomous Agents

  • paper_url: http://arxiv.org/abs/2308.11432
  • repo_url: https://github.com/paitesanshi/llm-agent-survey
  • paper_authors: Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, Ji-Rong Wen
  • for: 本研究准备了一份总结LLM基于自主代理的研究,包括LLM基于代理的构建、应用领域和评价策略等方面。
  • methods: 本研究使用了大量网络知识获得的大语言模型(LLM),并提出了一个统一框架来涵盖大多数之前的工作。
  • results: 本研究通过对LLM基于代理的各种应用领域和评价策略的总结,提出了一些挑战和未来方向,并将相关参考文献存储在https://github.com/Paitesanshi/LLM-Agent-Survey中。
    Abstract Autonomous agents have long been a prominent research topic in the academic community. Previous research in this field often focuses on training agents with limited knowledge within isolated environments, which diverges significantly from the human learning processes, and thus makes the agents hard to achieve human-like decisions. Recently, through the acquisition of vast amounts of web knowledge, large language models (LLMs) have demonstrated remarkable potential in achieving human-level intelligence. This has sparked an upsurge in studies investigating autonomous agents based on LLMs. To harness the full potential of LLMs, researchers have devised diverse agent architectures tailored to different applications. In this paper, we present a comprehensive survey of these studies, delivering a systematic review of the field of autonomous agents from a holistic perspective. More specifically, our focus lies in the construction of LLM-based agents, for which we propose a unified framework that encompasses a majority of the previous work. Additionally, we provide a summary of the various applications of LLM-based AI agents in the domains of social science, natural science, and engineering. Lastly, we discuss the commonly employed evaluation strategies for LLM-based AI agents. Based on the previous studies, we also present several challenges and future directions in this field. To keep track of this field and continuously update our survey, we maintain a repository for the related references at https://github.com/Paitesanshi/LLM-Agent-Survey.
    摘要 自主代理已经是学术界的一个主要研究话题,早期的研究通常是在隔离环境中训练有限知识的代理,这与人类学习过程不同,导致代理做出的决策困难达到人类水平。然而,随着互联网知识的掌握,大型自然语言模型(LLM)在实现人类智能水平方面表现出了很好的潜力。这导致了对自主代理基于 LLM 的研究的快速增长。为了挖掘 LLM 的潜力,研究者们设计了多种特定应用场景的代理建模。在这篇文章中,我们提供了一份系统性的评论,涵盖了这些研究的大部分。我们更加关注 LLM 基于代理的建模,并提出了一个统一框架,覆盖了大多数前期工作。此外,我们还提供了自然科学、社会科学和工程等领域 LLM 基于 AI 代理的多种应用案例。最后,我们讨论了对 LLM 基于 AI 代理的评价策略,并根据前期研究提出了一些挑战和未来方向。为了保持这一领域的报道和不断更新我们的评论,我们在 GitHub 上建立了一个参考库,可以在 https://github.com/Paitesanshi/LLM-Agent-Survey 中找到。

A Study on the Impact of Non-confounding Covariates on the Inferential Performance of Methods based on the Potential Outcome Framework

  • paper_url: http://arxiv.org/abs/2308.11676
  • repo_url: None
  • paper_authors: Yonghe Zhao, Shuai Fu, Huiyan Sun
  • for: The paper is written to provide a unified graphical framework for causal inference models based on the Potential Outcome Framework (POF), and to analyze the influence of non-confounding covariates on the inference performance of these models.
  • methods: The paper uses a graphical framework to present the underlying principles of causal inference models based on the POF, and conducts extensive experiments on synthetic datasets to validate the theoretical conclusions.
  • results: The paper finds that the optimal scenario for eliminating confounding bias is for the covariates to exclusively encompass confounders, and that adjustment variables contribute to more accurate inferences in the task of inferring counterfactual outcomes.
    Abstract The Potential Outcome Framework (POF) plays a prominent role in the field of causal inference. Most causal inference models based on the POF (CIMs-B-POF) are designed for eliminating confounding bias and default to an underlying assumption of Confounding Covariates. This assumption posits that the covariates consist solely of confounders. However, the assumption of Confounding Covariates is challenging to maintain in practice, particularly when dealing with high-dimensional covariates. While certain methods have been proposed to differentiate the distinct components of covariates prior to conducting causal inference, the consequences of treating non-confounding covariates as confounders remain unclear. This ambiguity poses a potential risk when applying the CIMs-B-POF in practical scenarios. In this paper, we present a unified graphical framework for the CIMs-B-POF, which greatly enhances the comprehension of these models' underlying principles. Using this graphical framework, we quantitatively analyze the extent to which the inference performance of CIMs-B-POF is influenced when incorporating various types of non-confounding covariates, such as instrumental variables, mediators, colliders, and adjustment variables. The key findings are: in the task of eliminating confounding bias, the optimal scenario is for the covariates to exclusively encompass confounders; in the subsequent task of inferring counterfactual outcomes, the adjustment variables contribute to more accurate inferences. Furthermore, extensive experiments conducted on synthetic datasets consistently validate these theoretical conclusions.
    摘要 Potential Outcome Framework (POF) 在 causal inference 领域扮演着重要的角色。大多数基于 POF 的 causal inference 模型 (CIMs-B-POF) 是为了消除干扰偏见而设计的,默认假设是 Confounding Covariates 假设,即 covariates 仅仅包含干扰因素。然而,在实践中保持 Confounding Covariates 假设是困难的,特别是处理高维 covariates 时。虽然一些方法已经被提出来分解 covariates 的不同组成部分,然而对非干扰 covariates 被视为干扰因素的后果仍然不清楚。这种不确定性在实践中应用 CIMs-B-POF 时可能存在风险。在这篇论文中,我们提出了一个统一的图形 Framework для CIMs-B-POF,这有助于更好地理解这些模型的基本原理。使用这个图形 Framework,我们量化分析了在不同类型的非干扰 covariates 存在时,CIMs-B-POF 的推理性能是如何受影响的。我们发现,在消除干扰偏见的任务中,理想的情况是 covariates 仅仅包含干扰因素;在后续的对 counterfactual 结果进行推理任务中,调整变量对更准确的推理做出了贡献。此外,我们在 synthetic 数据上进行了广泛的实验,并 consistently 验证了这些理论结论。

AIxArtist: A First-Person Tale of Interacting with Artificial Intelligence to Escape Creative Block

  • paper_url: http://arxiv.org/abs/2308.11424
  • repo_url: None
  • paper_authors: Makayla Lewis
  • for: 这篇论文探讨了人工智能(AI)如何支持艺术创作,以及在艺术创作过程中AI的可追溯性。
  • methods: 本论文采用了人工智能工具HIS、ChatGPT和Midjourney,进行了一些实验和探索,以探索AI如何支持艺术创作。
  • results: 本论文发现了一些关键问题,包括创作过程的透明性、作品的起源和伦理问题,以及创作是 copying 还是灵感?这些问题需要进一步的讨论和探索。
    Abstract The future of the arts and artificial intelligence (AI) is promising as technology advances. As the use of AI in design becomes more widespread, art practice may not be a human-only art form and could instead become a digitally integrated experience. With enhanced creativity and collaboration, arts and AI could work together towards creating artistic outputs that are visually appealing and meet the needs of the artist and viewer. While it is uncertain how far the integration will go, arts and AI will likely influence one another. This workshop pictorial puts forward first-person research that shares interactions between an HCI researcher and AI as they try to escape the creative block. The pictorial paper explores two questions: How can AI support artists' creativity, and what does it mean to be explainable in this context? HIs, ChatGPT and Midjourney were engaged; the result was a series of reflections that require further discussion and explorations in the XAIxArts community: Transparency of attribution, the creation process, ethics of asking, and inspiration vs copying.
    摘要 This workshop pictorial presents first-person research that explores the interactions between an HCI researcher and AI as they try to escape creative blocks. The pictorial paper examines two questions: how can AI support artists' creativity, and what does it mean to be explainable in this context? The research involved engaging with AI models such as ChatGPT and Midjourney, leading to a series of reflections that require further discussion and exploration in the XAIxArts community. These reflections include transparency of attribution, the creation process, ethics of asking, and inspiration vs copying.

  • paper_url: http://arxiv.org/abs/2308.11421
  • repo_url: None
  • paper_authors: Alexander Wong, Saad Abbasi, Saeejith Nair
  • for: 这个研究的目的是实现高通过率且低 computation complexity的类比视觉 Transformer 架构设计。
  • methods: 这个研究使用了 Generative Architecture Search (GAS) 来生成高效的类比视觉 Transformer 架构设计,并且将注意力集中在面精度和 Q-pooling 设计模式上。
  • results: TurboViT 架构设计在 ImageNet-1K 数据集上实现了比较高的精度和低的 computation complexity,与其他 10 个现有的高效类比视觉 Transformer 网络架构设计相比。 Inference 延误和批处处理时间都表现出色,在低延误场景下,TurboViT 的延误时间比 FasterViT-0 低了 >3.21 倍,而且对 batch 处理也表现出 >3.18 倍的提高。
    Abstract Vision transformers have shown unprecedented levels of performance in tackling various visual perception tasks in recent years. However, the architectural and computational complexity of such network architectures have made them challenging to deploy in real-world applications with high-throughput, low-memory requirements. As such, there has been significant research recently on the design of efficient vision transformer architectures. In this study, we explore the generation of fast vision transformer architecture designs via generative architecture search (GAS) to achieve a strong balance between accuracy and architectural and computational efficiency. Through this generative architecture search process, we create TurboViT, a highly efficient hierarchical vision transformer architecture design that is generated around mask unit attention and Q-pooling design patterns. The resulting TurboViT architecture design achieves significantly lower architectural computational complexity (>2.47$\times$ smaller than FasterViT-0 while achieving same accuracy) and computational complexity (>3.4$\times$ fewer FLOPs and 0.9% higher accuracy than MobileViT2-2.0) when compared to 10 other state-of-the-art efficient vision transformer network architecture designs within a similar range of accuracy on the ImageNet-1K dataset. Furthermore, TurboViT demonstrated strong inference latency and throughput in both low-latency and batch processing scenarios (>3.21$\times$ lower latency and >3.18$\times$ higher throughput compared to FasterViT-0 for low-latency scenario). These promising results demonstrate the efficacy of leveraging generative architecture search for generating efficient transformer architecture designs for high-throughput scenarios.
    摘要 视transformer在近年来的视觉任务中表现出了前所未有的水平。然而,这些网络架构的建筑和计算复杂性使得它们在实际应用中高速、低内存要求下部署困难。因此,有一些研究是设计高效的视transformer架构。在这项研究中,我们通过生成式建筑搜索(GAS)来生成高效的视transformer架构设计,以达到精度和建筑计算效率的平衡。通过这个生成过程,我们创造了TurboViT,一种高效的层次视transformer架构设计,基于面积注意力和Q-Pooling设计模式。TurboViT架构设计的建筑计算复杂性比FasterViT-0大于2.47倍,计算复杂性比MobileViT2-2.0大于3.4倍,同时精度相同。此外,TurboViT在低延迟和批处理场景中表现出了优秀的执行时间和 Throughput,比FasterViT-0在低延迟场景下执行时间大于3.21倍,比MobileViT2-2.0在批处理场景下执行时间大于3.18倍。这些优秀的结果表明了利用生成式建筑搜索生成高效的transformer架构设计的有效性。

Tensor Regression

  • paper_url: http://arxiv.org/abs/2308.11419
  • repo_url: https://github.com/tensorly/torch
  • paper_authors: Jiani Liu, Ce Zhu, Zhen Long, Yipeng Liu
  • for: This paper is written for students, researchers, and practitioners who work with high dimensional data and are interested in tensor-based regression analysis.
  • methods: The paper provides a systematic study and analysis of tensor-based regression models and their applications, including a comprehensive review of existing methods, their core ideas, and theoretical characteristics.
  • results: The paper covers the basics of tensor-based regression, provides examples of how to use existing methods to solve specific regression tasks with multiway data, and discusses available datasets and software resources for efficient implementation.
    Abstract Regression analysis is a key area of interest in the field of data analysis and machine learning which is devoted to exploring the dependencies between variables, often using vectors. The emergence of high dimensional data in technologies such as neuroimaging, computer vision, climatology and social networks, has brought challenges to traditional data representation methods. Tensors, as high dimensional extensions of vectors, are considered as natural representations of high dimensional data. In this book, the authors provide a systematic study and analysis of tensor-based regression models and their applications in recent years. It groups and illustrates the existing tensor-based regression methods and covers the basics, core ideas, and theoretical characteristics of most tensor-based regression methods. In addition, readers can learn how to use existing tensor-based regression methods to solve specific regression tasks with multiway data, what datasets can be selected, and what software packages are available to start related work as soon as possible. Tensor Regression is the first thorough overview of the fundamentals, motivations, popular algorithms, strategies for efficient implementation, related applications, available datasets, and software resources for tensor-based regression analysis. It is essential reading for all students, researchers and practitioners of working on high dimensional data.
    摘要 “tensor regression”是数据分析和机器学习领域的一个关键领域,旨在探索变量之间的依赖关系,通常使用向量。随着神经成像、计算机视觉、气候学和社交网络等技术的发展,传统的数据表示方法面临了挑战。tensor是高维数据的自然表示方法。本书提供了tensor-based regression模型的系统性研究和分析,以及其在最近几年的应用。它分组和描述了现有的tensor-based regression方法,覆盖基础知识、核心思想和主要特征。此外,读者还可以了解如何使用现有的tensor-based regression方法来解决特定的多向数据回归任务,选择合适的数据集和使用哪些软件包来进行相关工作。“tensor regression”是高维数据处理的基础知识,是所有师生、研究人员和实践者都必须阅读的一本书。

Interpretable Distribution-Invariant Fairness Measures for Continuous Scores

  • paper_url: http://arxiv.org/abs/2308.11375
  • repo_url: None
  • paper_authors: Ann-Kristin Becker, Oana Dumitrasc, Klaus Broelemann
  • for: 这个论文主要是为了扩展对连续分数的算法公平性评估方法。
  • methods: 该论文提出了一种基于沃氏距离的公平性评估方法,该方法可以快速计算并且对不同模型、数据集或时间点进行比较。
  • results: 研究人员通过实验表明,新提出的公平性评估方法可以更好地捕捉到不同群体之间的差异,并且可以比较不同的模型、数据集或时间点之间的偏见。
    Abstract Measures of algorithmic fairness are usually discussed in the context of binary decisions. We extend the approach to continuous scores. So far, ROC-based measures have mainly been suggested for this purpose. Other existing methods depend heavily on the distribution of scores, are unsuitable for ranking tasks, or their effect sizes are not interpretable. Here, we propose a distributionally invariant version of fairness measures for continuous scores with a reasonable interpretation based on the Wasserstein distance. Our measures are easily computable and well suited for quantifying and interpreting the strength of group disparities as well as for comparing biases across different models, datasets, or time points. We derive a link between the different families of existing fairness measures for scores and show that the proposed distributionally invariant fairness measures outperform ROC-based fairness measures because they are more explicit and can quantify significant biases that ROC-based fairness measures miss. Finally, we demonstrate their effectiveness through experiments on the most commonly used fairness benchmark datasets.
    摘要 各种算法公平度量通常在二分类决策中被讨论。我们扩展了这种方法,以适应连续分数。目前,ROC基尼度量是为此目的提出的主要方法。其他现有方法受分布的影响很大,不适用于排名任务,或者其效果不能解释。我们提议一种不受分布影响的公平度量方法,基于温顿距离。我们的度量方法容易计算,适合量化和解释群体差异的强度以及不同模型、数据集、时间点之间的偏见。我们还 derivates了不同家族的现有公平度量方法之间的连接,并证明了我们提议的不受分布影响的公平度量方法在ROC基尼度量方法之上表现更好,因为它们更加明确,可以量化ROC基尼度量方法所过look的重要偏见。最后,我们通过使用最常用的公平性标准数据集进行实验,证明了它们的有效性。

How Much Temporal Long-Term Context is Needed for Action Segmentation?

  • paper_url: http://arxiv.org/abs/2308.11358
  • repo_url: https://github.com/ltcontext/ltcontext
  • paper_authors: Emad Bahrami, Gianpiero Francesca, Juergen Gall
  • for: temporal action segmentation
  • methods: transformer-based model with sparse attention
  • results: best performance for temporal action segmentationHere’s the full text in Simplified Chinese:
  • for: 这篇论文是为了解决视频中的时间动作分割问题而写的。
  • methods: 这篇论文使用了 transformer 模型,并使用了稀谱注意力来捕捉整个视频的上下文。
  • results: 实验结果表明,模型需要捕捉整个视频的上下文,才能达到最佳的时间动作分割性能。
    Abstract Modeling long-term context in videos is crucial for many fine-grained tasks including temporal action segmentation. An interesting question that is still open is how much long-term temporal context is needed for optimal performance. While transformers can model the long-term context of a video, this becomes computationally prohibitive for long videos. Recent works on temporal action segmentation thus combine temporal convolutional networks with self-attentions that are computed only for a local temporal window. While these approaches show good results, their performance is limited by their inability to capture the full context of a video. In this work, we try to answer how much long-term temporal context is required for temporal action segmentation by introducing a transformer-based model that leverages sparse attention to capture the full context of a video. We compare our model with the current state of the art on three datasets for temporal action segmentation, namely 50Salads, Breakfast, and Assembly101. Our experiments show that modeling the full context of a video is necessary to obtain the best performance for temporal action segmentation.
    摘要 <>模型长期视频上下文是多个细致任务的关键,包括时间动作 segmentation。一个有趣的问题是如何多少长期时间上下文是必需的 для最佳性能。虽然变换器可以模型视频的长期上下文,但这会对长视频计算昂贵。现有的时间动作 segmentation方法通过将时间卷积网络与自注意力组合在一起,但其性能受到当前视频上下文的限制。在这个工作中,我们尝试回答如何多少长期时间上下文是必需的 для时间动作 segmentation,我们提出了一种基于变换器的模型,通过稀疏注意力来捕捉整个视频的上下文。我们与当前领域的状态速算三个数据集进行比较,分别是50Salads、Breakfast和Assembly101。我们的实验结果表明,模型整个视频的上下文是获得最佳性能的关键。

Semantic RGB-D Image Synthesis

  • paper_url: http://arxiv.org/abs/2308.11356
  • repo_url: None
  • paper_authors: Shijie Li, Rong Li, Juergen Gall
  • for: 提高RGB-D图像Semantic分割的训练数据多样性
  • methods: 提出了一种基于Semantic图像Synthesis的方法,使用多模态数据生成真实的RGB-D图像
  • results: 比前一代单模态方法有大幅提高,并且通过混合实际和生成图像进行训练可以进一步提高RGB-D图像Semantic分割的精度
    Abstract Collecting diverse sets of training images for RGB-D semantic image segmentation is not always possible. In particular, when robots need to operate in privacy-sensitive areas like homes, the collection is often limited to a small set of locations. As a consequence, the annotated images lack diversity in appearance and approaches for RGB-D semantic image segmentation tend to overfit the training data. In this paper, we thus introduce semantic RGB-D image synthesis to address this problem. It requires synthesising a realistic-looking RGB-D image for a given semantic label map. Current approaches, however, are uni-modal and cannot cope with multi-modal data. Indeed, we show that extending uni-modal approaches to multi-modal data does not perform well. In this paper, we therefore propose a generator for multi-modal data that separates modal-independent information of the semantic layout from the modal-dependent information that is needed to generate an RGB and a depth image, respectively. Furthermore, we propose a discriminator that ensures semantic consistency between the label maps and the generated images and perceptual similarity between the real and generated images. Our comprehensive experiments demonstrate that the proposed method outperforms previous uni-modal methods by a large margin and that the accuracy of an approach for RGB-D semantic segmentation can be significantly improved by mixing real and generated images during training.
    摘要 Collecting diverse sets of training images for RGB-D semantic image segmentation 不一定是可能的。特别是当机器人需要在隐私敏感的地方 like 家庭中运行时,收集是经常受限于一小组地点。因此,标注图像缺乏多样性的外观和RGB-D semantic image segmentation 的方法容易过拟合训练数据。在这篇论文中,我们因此引入semantic RGB-D 图像合成来解决这个问题。它需要生成一个看起来很真实的 RGB-D 图像,以及一个给定的semantic label map。current approach是单模的,无法处理多模数据。我们实际上发现,将单模方法扩展到多模数据并不能达到好的效果。因此,我们提议一个生成器,可以分离modal-independent信息和modal-dependent信息。此外,我们还提议一个检验器,确保标注图像和生成的图像之间的semantic consistency,以及生成的图像和实际图像之间的 percepatual similarity。我们的全面实验表明,我们的方法可以大幅提高前一代单模方法的性能,并且可以在训练时混合实际和生成的图像,以提高RGB-D semantic segmentation的精度。

ProAgent: Building Proactive Cooperative AI with Large Language Models

  • paper_url: http://arxiv.org/abs/2308.11339
  • repo_url: https://github.com/PKU-Alignment/ProAgent
  • paper_authors: Ceyao Zhang, Kaijie Yang, Siyi Hu, Zihao Wang, Guanghe Li, Yihang Sun, Cheng Zhang, Zhaowei Zhang, Anji Liu, Song-Chun Zhu, Xiaojun Chang, Junge Zhang, Feng Yin, Yitao Liang, Yaodong Yang
  • for: 这个论文主要目标是为了开发一种能够在人机合作中表现出突出的智能代理(ProAgent),帮助人机合作实现更高效的协同作业。
  • methods: 这个论文使用了大型自然语言模型(LLM)来为代理机制表现出更高级别的智能行为,包括预测合作伙伴的下一步决策并根据此形ulate更好的计划。
  • results: 实验结果表明,ProAgent在与其他AI代理和人类代理合作时表现出了remarkable的性能优势,比如自适应学习和人类学习等方法。此外,ProAgent还具有高度可模块化和可解释性,可以轻松地整合到各种协同enario中。
    Abstract Building AIs with adaptive behaviors in human-AI cooperation stands as a pivotal focus in AGI research. Current methods for developing cooperative agents predominantly rely on learning-based methods, where policy generalization heavily hinges on past interactions with specific teammates. These approaches constrain the agent's capacity to recalibrate its strategy when confronted with novel teammates. We propose \textbf{ProAgent}, a novel framework that harnesses large language models (LLMs) to fashion a \textit{pro}active \textit{agent} empowered with the ability to anticipate teammates' forthcoming decisions and formulate enhanced plans for itself. ProAgent excels at cooperative reasoning with the capacity to dynamically adapt its behavior to enhance collaborative efforts with teammates. Moreover, the ProAgent framework exhibits a high degree of modularity and interpretability, facilitating seamless integration to address a wide array of coordination scenarios. Experimental evaluations conducted within the framework of \textit{Overcook-AI} unveil the remarkable performance superiority of ProAgent, outperforming five methods based on self-play and population-based training in cooperation with AI agents. Further, when cooperating with human proxy models, its performance exhibits an average improvement exceeding 10\% compared to the current state-of-the-art, COLE. The advancement was consistently observed across diverse scenarios involving interactions with both AI agents of varying characteristics and human counterparts. These findings inspire future research for human-robot collaborations. For a hands-on demonstration, please visit \url{https://pku-proagent.github.io}.
    摘要 (Simplified Chinese translation)建立AI具有适应行为的概念在人类-AI合作中是AGI研究中的核心焦点。目前,开发合作代理人的方法主要依赖于学习方法,其策略泛化强度受到特定团队成员的交互影响。这些方法限制了代理人的策略重塑能力,面临新的团队成员时。我们提出了\textbf{ProAgent}框架,利用大型自然语言模型(LLMs)为代理人带来了能预测伙伴的决策并提出改进的计划的能力。ProAgent在合作理解方面表现出色,可以适应团队合作中的各种情况,并且具有高度的可重新组合性和可读性,可以轻松地与不同的协调enario进行集成。在\textit{Overcook-AI}框架下,我们进行了实验评估,发现ProAgent在与基于自我玩家和人口学习的五种方法进行比较时,在合作 with AI代理人方面表现出了杰出的性能优势。此外,当与人类代理模型进行合作时,其性能表现出了平均提高超过10%,与当前的状态艺术COLE相比。这些发现在不同的情况下,包括与不同特征的AI代理人和人类对手进行交互时,均得到了证明。这些发现 inspirits future research on human-robot collaboration. For a hands-on demonstration, please visit \url{https://pku-proagent.github.io}.

On the Opportunities and Challenges of Offline Reinforcement Learning for Recommender Systems

  • paper_url: http://arxiv.org/abs/2308.11336
  • repo_url: None
  • paper_authors: Xiaocong Chen, Siyu Wang, Julian McAuley, Dietmar Jannach, Lina Yao
  • for: 本研究旨在探讨在推荐系统中使用无线连接学习,特别是在不同的环境下进行学习和推荐。
  • methods: 本研究使用了无线连接学习的各种方法,包括Q-learning、SARSA和 Deep Q-Networks等,以学习用户的偏好和行为。
  • results: 研究发现,使用无线连接学习可以提高推荐系统的数据效率和学习速度,同时也可以提高用户的满意度和使用频率。
    Abstract Reinforcement learning serves as a potent tool for modeling dynamic user interests within recommender systems, garnering increasing research attention of late. However, a significant drawback persists: its poor data efficiency, stemming from its interactive nature. The training of reinforcement learning-based recommender systems demands expensive online interactions to amass adequate trajectories, essential for agents to learn user preferences. This inefficiency renders reinforcement learning-based recommender systems a formidable undertaking, necessitating the exploration of potential solutions. Recent strides in offline reinforcement learning present a new perspective. Offline reinforcement learning empowers agents to glean insights from offline datasets and deploy learned policies in online settings. Given that recommender systems possess extensive offline datasets, the framework of offline reinforcement learning aligns seamlessly. Despite being a burgeoning field, works centered on recommender systems utilizing offline reinforcement learning remain limited. This survey aims to introduce and delve into offline reinforcement learning within recommender systems, offering an inclusive review of existing literature in this domain. Furthermore, we strive to underscore prevalent challenges, opportunities, and future pathways, poised to propel research in this evolving field.
    摘要

GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training

  • paper_url: http://arxiv.org/abs/2308.11331
  • repo_url: None
  • paper_authors: Xinchi Deng, Han Shi, Runhui Huang, Changlin Li, Hang Xu, Jianhua Han, James Kwok, Shen Zhao, Wei Zhang, Xiaodan Liang
  • for: 本文提出了一种数据驱动自动模型增长算法,用于对语言-图像做contrastive预训练,并且可以处理不断增长的在线数据。
  • methods: 本文使用了动态增长空间和最优化 архитектуры,以适应在线学习场景。同时,提出了共享编码器,以增强语言-图像融合度。
  • results: 相比已有方法,GrowCLIP在零参数图像分类任务上提高了2.3%的平均排名第一精度。在Flickr30K dataset上,GrowCLIP在零参数图像检索任务上提高了1.2%的找到第一图像-文本准确率。
    Abstract Cross-modal pre-training has shown impressive performance on a wide range of downstream tasks, benefiting from massive image-text pairs collected from the Internet. In practice, online data are growing constantly, highlighting the importance of the ability of pre-trained model to learn from data that is continuously growing. Existing works on cross-modal pre-training mainly focus on training a network with fixed architecture. However, it is impractical to limit the model capacity when considering the continuously growing nature of pre-training data in real-world applications. On the other hand, it is important to utilize the knowledge in the current model to obtain efficient training and better performance. To address the above issues, in this paper, we propose GrowCLIP, a data-driven automatic model growing algorithm for contrastive language-image pre-training with continuous image-text pairs as input. Specially, we adopt a dynamic growth space and seek out the optimal architecture at each growth step to adapt to online learning scenarios. And the shared encoder is proposed in our growth space to enhance the degree of cross-modal fusion. Besides, we explore the effect of growth in different dimensions, which could provide future references for the design of cross-modal model architecture. Finally, we employ parameter inheriting with momentum (PIM) to maintain the previous knowledge and address the issue of the local minimum dilemma. Compared with the existing methods, GrowCLIP improves 2.3% average top-1 accuracy on zero-shot image classification of 9 downstream tasks. As for zero-shot image retrieval, GrowCLIP can improve 1.2% for top-1 image-to-text recall on Flickr30K dataset.
    摘要 跨modal预训练已经在各种下游任务中显示出很好的性能,受到互联网上庞大的图片文本对的收集启发。在实践中,网络上数据不断增长,高亮了预训练数据的不断增长的重要性。现有的跨modal预训练方法主要是通过固定的网络架构进行训练。然而,在实际应用中,限制模型容量是不切实际的,因为预训练数据的不断增长会导致模型无法适应。相反,我们需要利用当前模型的知识来获得高效的训练和更好的性能。为此,在这篇论文中,我们提出了GrowCLIP,一种基于数据驱动的自动模型增长算法,用于对冲对的语言图片预训练。我们采用动态生长空间,在每个增长步骤中寻找最佳的网络架构,以适应在线学习场景。此外,我们还提出了共享编码器,以增强模型之间的混合度。此外,我们还研究了不同维度的增长效果,这可能会对未来的跨modal模型架构设计产生影响。最后,我们采用参数继承与势(PIM)来维护之前的知识,解决局部最小问题。相比之下,GrowCLIP与现有方法相比,提高了9个下游任务的zero-shot图像分类精度,平均提高2.3%。此外,GrowCLIP还可以提高Flickr30K dataset上的zero-shot图像检索的top-1图像文本恢复率,提高1.2%。

From Mundane to Meaningful: AI’s Influence on Work Dynamics – evidence from ChatGPT and Stack Overflow

  • paper_url: http://arxiv.org/abs/2308.11302
  • repo_url: None
  • paper_authors: Quentin Gallea
  • for: 这篇论文探讨了如何利用生成AI实现代码编程的产生效率提升,同时也提出了这些新技术对工作和知识共享方式的影响。
  • methods: 该论文使用了 quasi-experimental 方法(差异分析),利用Stack Overflow上最大的在线编程社区的使用情况,评估 ChatGPT 的发布对编程问题的影响。
  • results: 研究发现,ChatGPT 的发布导致编程问题数量减少,同时问题的 докуumenation 质量也有所提高。此外,剩下的问题也变得更加复杂。这些结果表明,利用生成AI可以实现工作效率提升,同时也会导致工作方式的重大变革,让人类专注于更加复杂的任务。
    Abstract This paper illustrates how generative AI could give opportunities for big productivity gains but also opens up questions about the impact of these new powerful technologies on the way we work and share knowledge. More specifically, we explore how ChatGPT changed a fundamental aspect of coding: problem-solving. To do so, we exploit the effect of the sudden release of ChatGPT on the 30th of November 2022 on the usage of the largest online community for coders: Stack Overflow. Using quasi-experimental methods (Difference-in-Difference), we find a significant drop in the number of questions. In addition, the questions are better documented after the release of ChatGPT. Finally, we find evidence that the remaining questions are more complex. These findings suggest not only productivity gains but also a fundamental change in the way we work where routine inquiries are solved by AI allowing humans to focus on more complex tasks.
    摘要 In Simplified Chinese:这篇论文描述了如何生成AI可以带来大量的产出增长,但同时也提出了这些新技术对我们工作和知识分享方式的影响。我们Specifically,我们研究了ChatGPT如何改变编程中的问题解决方式。为此,我们利用了chatGPT于11月30日的突然发布对Stack Overflow上最大的编程社区的使用情况产生的影响。使用 quasi-experimental方法(Difference-in-Difference),我们发现了问题数量减少的显著影响。此外,剩下的问题更加详细。这些发现不仅表明产出增长,还表明了我们工作的基本变革,Routine inquiry由AI解决,让人类可以专注于更复杂的任务。

Improving Knot Prediction in Wood Logs with Longitudinal Feature Propagation

  • paper_url: http://arxiv.org/abs/2308.11291
  • repo_url: https://github.com/jeremyfix/icvs2023
  • paper_authors: Salim Khazem, Jeremy Fix, Cédric Pradalier
  • for: 本研究旨在预测木材内部缺陷的位置,以提高木材质量评估的准确性和效率。
  • methods: 本研究使用了卷积回归神经网络来解决木材外形特征与内部缺陷的Binary segmentation任务。
  • results: 研究表明,通过使用便宜的外形测量设备(如激光 Profiler)进行训练,可以通过卷积回归神经网络来预测木材内部缺陷的位置,并且可以在不同的树种上进行效果评估。
    Abstract The quality of a wood log in the wood industry depends heavily on the presence of both outer and inner defects, including inner knots that are a result of the growth of tree branches. Today, locating the inner knots require the use of expensive equipment such as X-ray scanners. In this paper, we address the task of predicting the location of inner defects from the outer shape of the logs. The dataset is built by extracting both the contours and the knots with X-ray measurements. We propose to solve this binary segmentation task by leveraging convolutional recurrent neural networks. Once the neural network is trained, inference can be performed from the outer shape measured with cheap devices such as laser profilers. We demonstrate the effectiveness of our approach on fir and spruce tree species and perform ablation on the recurrence to demonstrate its importance.
    摘要 木材行业中木材质量受到内部和外部缺陷的影响,包括由树木分支生长而成的内弯。今天,找到内弯需要使用昂贵的设备,如X射线扫描仪。在这篇论文中,我们解决了根据外形测量内弯的任务。我们提出使用卷积回归神经网络解决这个二分类任务。一旦神经网络训练完毕,可以通过便宜的设备,如激光 Profilers 进行推理。我们在桦树和落叶树种中展示了我们的方法的效果,并对循环的重要性进行了剖除。

ShadowNet for Data-Centric Quantum System Learning

  • paper_url: http://arxiv.org/abs/2308.11290
  • repo_url: None
  • paper_authors: Yuxuan Du, Yibo Yang, Tongliang Liu, Zhouchen Lin, Bernard Ghanem, Dacheng Tao
  • for: 本研究旨在探讨大量量子系统的动态学习问题,以减轻维度祸咎的影响。
  • methods: 本研究提出了一种数据驱动学习 paradigma,结合了神经网络协议和经典阴影,以便实现多种量子学习任务。
  • results: 研究表明,该 paradigma可以在偏远量子系统中提供准确和可靠的预测结果,并且可以在批处理大量量子系统时实现快速和高效的预测。
    Abstract Understanding the dynamics of large quantum systems is hindered by the curse of dimensionality. Statistical learning offers new possibilities in this regime by neural-network protocols and classical shadows, while both methods have limitations: the former is plagued by the predictive uncertainty and the latter lacks the generalization ability. Here we propose a data-centric learning paradigm combining the strength of these two approaches to facilitate diverse quantum system learning (QSL) tasks. Particularly, our paradigm utilizes classical shadows along with other easily obtainable information of quantum systems to create the training dataset, which is then learnt by neural networks to unveil the underlying mapping rule of the explored QSL problem. Capitalizing on the generalization power of neural networks, this paradigm can be trained offline and excel at predicting previously unseen systems at the inference stage, even with few state copies. Besides, it inherits the characteristic of classical shadows, enabling memory-efficient storage and faithful prediction. These features underscore the immense potential of the proposed data-centric approach in discovering novel and large-scale quantum systems. For concreteness, we present the instantiation of our paradigm in quantum state tomography and direct fidelity estimation tasks and conduct numerical analysis up to 60 qubits. Our work showcases the profound prospects of data-centric artificial intelligence to advance QSL in a faithful and generalizable manner.
    摘要 大量量子系统的动力学是由维度瓶颈所困难。统计学学习提供了新的可能性,通过神经网络协议和类型热影,然而两者都有局限性:前者受到预测不确定性的困扰,而后者缺乏泛化能力。我们提议一种数据驱动学习思想,结合这两种方法,以便实现多样化量子系统学习(QSL)任务。具体来说,我们的思想利用类型热影并与量子系统其他易获得信息一起创建训练集,然后通过神经网络学习探索QSL问题下的底层映射规则。通过神经网络的泛化能力,这种思想可以在训练阶段在线上培养,并在探索阶段预测未经见过的系统,即使只有几个状态的复制。此外,它继承类型热影的特点,即储存和预测的具有快速储存和准确预测的特点。这些特点强调了我们提议的数据驱动方法在发现新的大规模量子系统方面的极大潜力。为了证明这一点,我们在量子状态探测和直接准确率估计任务中实现了实例,并进行了数值分析至60个量子比特。我们的工作展示了数据驱动人工智能在忠实和泛化的方式下提高量子系统学习的可能性。

Recording of 50 Business Assignments

  • paper_url: http://arxiv.org/abs/2308.12211
  • repo_url: https://github.com/microsoft/50BusinessAssignmentsLog
  • paper_authors: Michal Sroka, Mohammadreza Fani Sani
  • for: 本研究用于发现和分析用户如何完成业务任务,提供有价值的进程效率和优化的研究发现。
  • methods: 本文提供了50个真实的企业过程数据集,这些数据集有很大的研究应用potential,包括任务挖掘和过程自动化。
  • results: 本研究提供了一个有价值的数据集,这些数据集有助于研究人员和实践者了解进程效率和优化。
    Abstract One of the main use cases of process mining is to discover and analyze how users follow business assignments, providing valuable insights into process efficiency and optimization. In this paper, we present a comprehensive dataset consisting of 50 real business processes. The dataset holds significant potential for research in various applications, including task mining and process automation which is a valuable resource for researchers and practitioners.
    摘要 一个主要的用 caso of proces mining 是发现和分析用户如何跟踪商业任务,提供有价值的信息来进行过程效率和优化。在这篇论文中,我们提供了完整的数据集,包含50个真实的商业过程。该数据集具有较高的研究价值,包括任务挖掘和过程自动化,是研究人员和实践者的宝贵资源。

CNN based Cuneiform Sign Detection Learned from Annotated 3D Renderings and Mapped Photographs with Illumination Augmentation

  • paper_url: http://arxiv.org/abs/2308.11277
  • repo_url: None
  • paper_authors: Ernst Stötzner, Timo Homburg, Hubert Mara
  • for: 针对ancient near eastern studies (DANES) 社区面临的挑战,我们开发了数字工具来处理篆字体系,这是一种3D文字痕迹在泥 TABLETS上的历史文化,覆盖了三千多年和至少八种主要语言。
  • methods: 我们使用了HeiCuBeDa和MaiCuBeDa数据集,包括约500个标注的板表,并提供了一种新的OCR-like方法来处理混合图像数据。我们的签名localization使用RepPoints探测器来预测字符的位置为 bounding boxes。我们使用了GigaMesh的MSII(曲率)基于的渲染、Phong-ashed 3D模型和照片,以及光照增强。
  • results: 使用渲染的3D图像进行签名检测比使用照片更好,而且我们的方法在混合数据集上表现良好,而且Phong renderings和特别是MSII renderings在照片上提高了结果。
    Abstract Motivated by the challenges of the Digital Ancient Near Eastern Studies (DANES) community, we develop digital tools for processing cuneiform script being a 3D script imprinted into clay tablets used for more than three millennia and at least eight major languages. It consists of thousands of characters that have changed over time and space. Photographs are the most common representations usable for machine learning, while ink drawings are prone to interpretation. Best suited 3D datasets that are becoming available. We created and used the HeiCuBeDa and MaiCuBeDa datasets, which consist of around 500 annotated tablets. For our novel OCR-like approach to mixed image data, we provide an additional mapping tool for transferring annotations between 3D renderings and photographs. Our sign localization uses a RepPoints detector to predict the locations of characters as bounding boxes. We use image data from GigaMesh's MSII (curvature, see https://gigamesh.eu) based rendering, Phong-shaded 3D models, and photographs as well as illumination augmentation. The results show that using rendered 3D images for sign detection performs better than other work on photographs. In addition, our approach gives reasonably good results for photographs only, while it is best used for mixed datasets. More importantly, the Phong renderings, and especially the MSII renderings, improve the results on photographs, which is the largest dataset on a global scale.
    摘要 受到数字古近东研究(DANES)社区的挑战启发,我们开发了数字工具来处理古代 Mesopotamia 文字,这是一种3D字符印制在泥版上,用于超过三千年和至少八种主要语言。它包含了数千个字符,随着时间和空间的变化而变化。 photographs 是最常用的机器学习 Representation,而墨水Drawing 容易被解释。我们创建了 HeiCuBeDa 和 MaiCuBeDa 数据集,包含约500个注释的泥版。为了我们的新的 OCR-like 方法,我们提供了一个附加的映射工具,用于将3D渲染与 photographs 之间的注释传输。我们使用 GigaMesh 的 MSII(曲率,见 )基于的渲染、Phong 灯光渲染和 photographs 以及照明增强。结果表明,使用3D渲染来检测字符性能更高于其他工作的 photographs 上。此外,我们的方法在 photographs 上提供了reasonably good的结果,而且在混合数据集上表现最佳。尤其是 Phong 渲染和 MSII 渲染,对于 photographs 上的结果有所提高。

Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning

  • paper_url: http://arxiv.org/abs/2308.11276
  • repo_url: None
  • paper_authors: Shansong Liu, Atin Sakkeer Hussain, Chenshuo Sun, Ying Shan
  • for: solves the problem of text-to-music generation (T2M-Gen) faced due to the scarcity of large-scale publicly available music datasets with natural language captions.
  • methods: utilizes audio representations from a pretrained MERT model to extract music features, and introduces a methodology for generating question-answer pairs from existing audio captioning datasets, as well as the MusicQA Dataset designed for answering open-ended music-related questions.
  • results: achieves outstanding performance in both music question answering and music caption generation across various metrics, outperforming current state-of-the-art (SOTA) models in both fields and offering a promising advancement in the T2M-Gen research field.
    Abstract Text-to-music generation (T2M-Gen) faces a major obstacle due to the scarcity of large-scale publicly available music datasets with natural language captions. To address this, we propose the Music Understanding LLaMA (MU-LLaMA), capable of answering music-related questions and generating captions for music files. Our model utilizes audio representations from a pretrained MERT model to extract music features. However, obtaining a suitable dataset for training the MU-LLaMA model remains challenging, as existing publicly accessible audio question answering datasets lack the necessary depth for open-ended music question answering. To fill this gap, we present a methodology for generating question-answer pairs from existing audio captioning datasets and introduce the MusicQA Dataset designed for answering open-ended music-related questions. The experiments demonstrate that the proposed MU-LLaMA model, trained on our designed MusicQA dataset, achieves outstanding performance in both music question answering and music caption generation across various metrics, outperforming current state-of-the-art (SOTA) models in both fields and offering a promising advancement in the T2M-Gen research field.
    摘要 文本转换为乐曲生成(T2M-Gen)遇到了一个重要的障碍,即公共可用的大规模乐曲数据集中的自然语言描述缺乏。为解决这个问题,我们提议了Music Understanding LLaMA(MU-LLaMA),可以回答乐曲相关的问题并生成乐曲文件的描述。我们的模型利用了预训练的MERT模型来提取乐曲特征。但是获得合适的模型训练数据集仍然是一个挑战,因为现有的公共可用的音频问答数据集缺乏必要的深度来回答开放式乐曲问题。为了填补这个空白,我们提出了一种方法,可以将现有的音频描述数据集转换成问题-答案对,并介绍了MusicQA数据集,用于回答开放式乐曲相关的问题。实验结果表明,我们提出的MU-LLaMA模型,在我们设计的MusicQA数据集上进行训练,在多种纪录计中表现出色,超越当前的状态机(SOTA)模型在乐曲问题回答和乐曲描述生成方面,并为T2M-Gen研究领域带来了可期的进步。

Robust Lagrangian and Adversarial Policy Gradient for Robust Constrained Markov Decision Processes

  • paper_url: http://arxiv.org/abs/2308.11267
  • repo_url: None
  • paper_authors: David M. Bossens
  • for: 本 paper 目的是提出一种基于 reinforcement learning 的任务模型框架,即 robust constrained Markov decision process (RCMDP),该框架可以考虑行为约束和模型不确定性,并提供了对模型不确定性的鲁棒性。
  • methods: 本 paper 使用的方法包括:1) 基于值估计的最坏情况动力学;2) 基于拉格朗日点的最坏情况动力学;3) 对 RCMDP 的劣化策略算法。
  • results: 本 paper 的实验结果表明,提出的 two algorithms(RCPG with Robust Lagrangian 和 Adversarial RCPG)在 injecting perturbations 的 inventory management 和 safe navigation 任务中表现比较出色,特别是 Adversarial RCPG 在所有测试中排名第二。
    Abstract The robust constrained Markov decision process (RCMDP) is a recent task-modelling framework for reinforcement learning that incorporates behavioural constraints and that provides robustness to errors in the transition dynamics model through the use of an uncertainty set. Simulating RCMDPs requires computing the worst-case dynamics based on value estimates for each state, an approach which has previously been used in the Robust Constrained Policy Gradient (RCPG). Highlighting potential downsides of RCPG such as not robustifying the full constrained objective and the lack of incremental learning, this paper introduces two algorithms, called RCPG with Robust Lagrangian and Adversarial RCPG. RCPG with Robust Lagrangian modifies RCPG by taking the worst-case dynamics based on the Lagrangian rather than either the value or the constraint. Adversarial RCPG also formulates the worst-case dynamics based on the Lagrangian but learns this directly and incrementally as an adversarial policy through gradient descent rather than indirectly and abruptly through constrained optimisation on a sorted value list. A theoretical analysis first derives the Lagrangian policy gradient for the policy optimisation of both proposed algorithms and then the adversarial policy gradient to learn the adversary for Adversarial RCPG. Empirical experiments injecting perturbations in inventory management and safe navigation tasks demonstrate the competitive performance of both algorithms compared to traditional RCPG variants as well as non-robust and non-constrained ablations. In particular, Adversarial RCPG ranks among the top two performing algorithms on all tests.
    摘要 RCMDP(有约束的马尔可夫决策过程)是一种最近的任务模型框架,用于机器学习中的激励学习,它包含行为约束并提供了对转移动力学模型中的错误的Robustness。模拟RCMDP需要基于每个状态的值估计计算最差情况的动力学,这同RCPG(有约束的策略梯度)一样。在RCPG中,作者提出了两种算法,即RCPG with Robust Lagrangian和Adversarial RCPG。RCPG with Robust Lagrangian通过基于Lagrangian而不是值或约束来修改RCPG。Adversarial RCPG也基于Lagrangian,但是通过对敌对策略进行准确的梯度下降来直接和逐步地学习对敌。作者首先 derivates Lagrangian policy gradient для政策优化两个提出的算法,然后 derivates adversarial policy gradient来学习对敌。实验表明,两种算法在各种任务中表现竞争性,特别是Adversarial RCPG在所有测试中排名第二。

Efficient Last-iterate Convergence Algorithms in Solving Games

  • paper_url: http://arxiv.org/abs/2308.11256
  • repo_url: None
  • paper_authors: Linjian Meng, Zhenxing Ge, Wenbin Li, Bo An, Yang Gao
  • For: The paper is written for learning Nash equilibrium (NE) in two-player zero-sum normal-form games (NFGs) and extensive-form games (EFGs) using no-regret algorithms.* Methods: The paper uses the Reward Transformation (RT) framework, which transforms the problem of learning NE in the original game into a series of strongly convex-concave optimization problems (SCCPs). The authors also use Regret Matching+ (RM+) algorithm to solve the SCCPs, and propose a novel transformation method to enable RM+ to solve the SCCPs.* Results: The paper shows that the proposed algorithm, Reward Transformation RM+ (RTRM+), enjoys last-iterate convergence under the discrete-time feedback setting, and significantly outperforms existing last-iterate convergence algorithms and RM+ (CFR+) in experiments.
    Abstract No-regret algorithms are popular for learning Nash equilibrium (NE) in two-player zero-sum normal-form games (NFGs) and extensive-form games (EFGs). Many recent works consider the last-iterate convergence no-regret algorithms. Among them, the two most famous algorithms are Optimistic Gradient Descent Ascent (OGDA) and Optimistic Multiplicative Weight Update (OMWU). However, OGDA has high per-iteration complexity. OMWU exhibits a lower per-iteration complexity but poorer empirical performance, and its convergence holds only when NE is unique. Recent works propose a Reward Transformation (RT) framework for MWU, which removes the uniqueness condition and achieves competitive performance with OMWU. Unfortunately, RT-based algorithms perform worse than OGDA under the same number of iterations, and their convergence guarantee is based on the continuous-time feedback assumption, which does not hold in most scenarios. To address these issues, we provide a closer analysis of the RT framework, which holds for both continuous and discrete-time feedback. We demonstrate that the essence of the RT framework is to transform the problem of learning NE in the original game into a series of strongly convex-concave optimization problems (SCCPs). We show that the bottleneck of RT-based algorithms is the speed of solving SCCPs. To improve the their empirical performance, we design a novel transformation method to enable the SCCPs can be solved by Regret Matching+ (RM+), a no-regret algorithm with better empirical performance, resulting in Reward Transformation RM+ (RTRM+). RTRM+ enjoys last-iterate convergence under the discrete-time feedback setting. Using the counterfactual regret decomposition framework, we propose Reward Transformation CFR+ (RTCFR+) to extend RTRM+ to EFGs. Experimental results show that our algorithms significantly outperform existing last-iterate convergence algorithms and RM+ (CFR+).
    摘要 “无后悔算法”受欢迎用于学习两player零余游戏(NFG)和广泛游戏(EFG)中的 Nash equilibriium(NE)。许多最近的研究将注意力集中在最后迭代具有无后悔性的算法。其中最具知名度的两个算法是Optimistic Gradient Descent Ascent(OGDA)和Optimistic Multiplicative Weight Update(OMWU)。然而,OGDA的每迭代复杂度较高,而OMWU的实际性较差,且其对NE的唯一性Conditions不够严格。Recent works propose a Reward Transformation(RT) framework for MWU, which removes the uniqueness condition and achieves competitive performance with OMWU。然而,RT-based algorithms under the same number of iterations than OGDA, and their convergence guarantee is based on the continuous-time feedback assumption, which does not hold in most scenarios。To address these issues, we provide a closer analysis of the RT framework, which holds for both continuous and discrete-time feedback。我们展示了RT framework的核心是将学习NE在原始游戏中的问题转换为一系列强oly convex-concave optimization problems(SCCPs)。我们显示了RT-based algorithms的瓶颈在SCCPs的解决方法。为了提高它们的实际性表现,我们设计了一个新的变换方法,让SCCPs可以通过Regret Matching+(RM+),一个无后悔算法,解决,从而产生Reward Transformation RM+(RTRM+)。RTRM+ 满足最后迭代具有无后悔性的条件。使用Counterfactual regret decomposition framework,我们提出Reward Transformation CFR+(RTCFR+)来扩展RTRM+到EFGs。实验结果显示我们的算法在已知的最后迭代具有无后悔性的算法和RM+(CFR+)中具有优秀的实际表现。

A survey on bias in machine learning research

  • paper_url: http://arxiv.org/abs/2308.11254
  • repo_url: https://github.com/Aastha2104/Parkinson-Disease-Prediction
  • paper_authors: Agnieszka Mikołajczyk-Bareła, Michał Grochowski
  • for: 本研究旨在帮助理解机器学习中的偏见源泉和错误,以提高机器学习模型的公平、透明和准确性。
  • methods: 本文提供了四十个可能的机器学习漏洞和错误的示例,并对每个示例进行了详细的描述。
  • results: 本文通过对机器学习管道中的偏见和错误的分析,帮助开发者更好地检测和缓解偏见,从而提高机器学习模型的公平性和准确性。
    Abstract Current research on bias in machine learning often focuses on fairness, while overlooking the roots or causes of bias. However, bias was originally defined as a "systematic error," often caused by humans at different stages of the research process. This article aims to bridge the gap between past literature on bias in research by providing taxonomy for potential sources of bias and errors in data and models. The paper focus on bias in machine learning pipelines. Survey analyses over forty potential sources of bias in the machine learning (ML) pipeline, providing clear examples for each. By understanding the sources and consequences of bias in machine learning, better methods can be developed for its detecting and mitigating, leading to fairer, more transparent, and more accurate ML models.
    摘要 现有研究 often focuses on fairness 的偏见在机器学习中,而忽略了偏见的根源或 causa。然而,偏见原本是一种“系统性错误”,常由人类在不同阶段的研究过程中引入。这篇文章目的是 bridge the gap between past literature on bias in research by providing a taxonomy for potential sources of bias and errors in data and models. The paper focuses on bias in machine learning pipelines. The survey analyzes over forty potential sources of bias in the machine learning (ML) pipeline, providing clear examples for each. By understanding the sources and consequences of bias in machine learning, better methods can be developed for its detecting and mitigating, leading to fairer, more transparent, and more accurate ML models.Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. Traditional Chinese is used in Hong Kong, Macau, and Taiwan.

  • paper_url: http://arxiv.org/abs/2308.12307
  • repo_url: https://gitlab.com/adhooge1/bend-prediction
  • paper_authors: Alexandre D’Hooge, Louis Bigo, Ken Déguernel
  • for: 这篇论文主要用于研究 guitar 乐谱中的弯曲技巧,以及如何使用数据分析方法来预测弯曲的发生。
  • methods: 这篇论文使用了一些数据分析技术,包括决策树等,来研究弯曲的预测。
  • results: 实验结果表明,使用这些数据分析技术可以准确预测弯曲的发生,并且具有一定的可靠性和精度。
    Abstract Tablature notation is widely used in popular music to transcribe and share guitar musical content. As a complement to standard score notation, tablatures transcribe performance gesture information including finger positions and a variety of guitar-specific playing techniques such as slides, hammer-on/pull-off or bends.This paper focuses on bends, which enable to progressively shift the pitch of a note, therefore circumventing physical limitations of the discrete fretted fingerboard. In this paper, we propose a set of 25 high-level features, computed for each note of the tablature, to study how bend occurrences can be predicted from their past and future short-term context. Experiments are performed on a corpus of 932 lead guitar tablatures of popular music and show that a decision tree successfully predicts bend occurrences with an F1 score of 0.71 anda limited amount of false positive predictions, demonstrating promising applications to assist the arrangement of non-guitar music into guitar tablatures.
    摘要 Tablaturenotation是流行音乐中广泛使用的notation方式,用于记录和分享吉他乐器的音乐内容。作为标准notation的补充,tablaturenotation记录了演奏手势信息,包括手指位置和许多特有的吉他演奏技巧,如滑弹、弹压和弯曲。本文关注的是弯曲,它可以使演奏者在不改变 физиical fretted fingerboard的前提下,逐渐改变音符的抑制值。在本文中,我们提出了25个高级特征,用于研究如何通过检测短期内的前后文来预测琴曲中的弯曲出现。实验使用了932首流行乐器琴曲tablature,并显示了一棵决策树可以成功预测琴曲中的弯曲出现,F1分数为0.71,并且具有有限的假阳性预测,这表明可以用于将非吉他音乐转换成琴曲tablature。

Multi-Source Domain Adaptation for Cross-Domain Fault Diagnosis of Chemical Processes

  • paper_url: http://arxiv.org/abs/2308.11247
  • repo_url: None
  • paper_authors: Eduardo Fernandes Montesuma, Michela Mulas, Fred Ngolè Mboula, Francesco Corona, Antoine Souloumiac
  • for: 这种研究旨在提高过程监测中的故障诊断精度,具体来说是利用机器学习算法预测故障类型基于感知器读ings。
  • methods: 该研究使用单源预处理适应(SSDA)和多源预处理适应(MSDA)算法进行故障诊断,并在田东曼进程中进行了广泛的比较。
  • results: 研究显示,在多源场景下使用多个预处理源可以提高故障诊断精度,具体来说是比单源场景提高23%的平均精度。此外,无适应情况下,多源场景可以提高不适应精度的平均提升率为8.4%。
    Abstract Fault diagnosis is an essential component in process supervision. Indeed, it determines which kind of fault has occurred, given that it has been previously detected, allowing for appropriate intervention. Automatic fault diagnosis systems use machine learning for predicting the fault type from sensor readings. Nonetheless, these models are sensible to changes in the data distributions, which may be caused by changes in the monitored process, such as changes in the mode of operation. This scenario is known as Cross-Domain Fault Diagnosis (CDFD). We provide an extensive comparison of single and multi-source unsupervised domain adaptation (SSDA and MSDA respectively) algorithms for CDFD. We study these methods in the context of the Tennessee-Eastmann Process, a widely used benchmark in the chemical industry. We show that using multiple domains during training has a positive effect, even when no adaptation is employed. As such, the MSDA baseline improves over the SSDA baseline classification accuracy by 23% on average. In addition, under the multiple-sources scenario, we improve classification accuracy of the no adaptation setting by 8.4% on average.
    摘要 FAULT诊断是 proces supervision 中的一个重要组件。它可以确定哪种缺陷已经发生,只要它已经检测到了,然后采取相应的 intervención。自动FAULT诊断系统 使用机器学习来预测缺陷类型从感知值中。然而,这些模型对数据分布的变化敏感,这些变化可能是由监测过程中的变化所致,如操作模式的变化。这种情况被称为 Cross-Domain Fault Diagnosis (CDFD)。我们提供了广泛的单源和多源无监督领域适应 (SSDA 和 MSDA 分别) 算法的 Comparative study для CDFD。我们在 Tennessee-Eastmann 过程中进行了这些方法的研究,这是化学工业中广泛使用的一个标准测试 benchmark。我们发现在训练时使用多个频道有益,即使没有适应也。因此,MSDA 基线提高了 SSDA 基eline 的分类精度,在 average 上提高了 23%。此外,在多源场景下,我们在无适应情况下提高了分类精度的 average 上提高了 8.4%。

Faster Optimization in S-Graphs Exploiting Hierarchy

  • paper_url: http://arxiv.org/abs/2308.11242
  • repo_url: None
  • paper_authors: Hriday Bavle, Jose Luis Sanchez-Lopez, Javier Civera, Holger Voos
    for:This paper aims to improve the scalability of Situational Graphs (S-Graphs) in large environments for Simultaneous Localization and Mapping (SLAM) by marginalizing redundant robot poses and their connections to observations.methods:The proposed method generates and optimizes room-local graphs encompassing all graph entities within a room-like structure, compresses the S-Graphs, and performs windowed local optimization of the compressed graph at regular time-distance intervals. Global optimization is performed every time a loop closure is detected.results:The proposed method achieves similar accuracy compared to the baseline while reducing the computation time by 39.81% compared to the baseline.
    Abstract 3D scene graphs hierarchically represent the environment appropriately organizing different environmental entities in various layers. Our previous work on situational graphs extends the concept of 3D scene graph to SLAM by tightly coupling the robot poses with the scene graph entities, achieving state-of-the-art results. Though, one of the limitations of S-Graphs is scalability in really large environments due to the increased graph size over time, increasing the computational complexity. To overcome this limitation in this work we present an initial research of an improved version of S-Graphs exploiting the hierarchy to reduce the graph size by marginalizing redundant robot poses and their connections to the observations of the same structural entities. Firstly, we propose the generation and optimization of room-local graphs encompassing all graph entities within a room-like structure. These room-local graphs are used to compress the S-Graphs marginalizing the redundant robot keyframes within the given room. We then perform windowed local optimization of the compressed graph at regular time-distance intervals. A global optimization of the compressed graph is performed every time a loop closure is detected. We show similar accuracy compared to the baseline while showing a 39.81% reduction in the computation time with respect to the baseline.
    摘要 三维场景图 hierarchically represents the environment, appropriately organizing different environmental entities in various layers. Our previous work on situational graphs extends the concept of 3D scene graph to SLAM by tightly coupling the robot poses with the scene graph entities, achieving state-of-the-art results. However, one of the limitations of S-Graphs is scalability in really large environments due to the increased graph size over time, increasing the computational complexity. To overcome this limitation, in this work we present an initial research of an improved version of S-Graphs by exploiting the hierarchy to reduce the graph size by marginalizing redundant robot poses and their connections to the observations of the same structural entities. First, we propose the generation and optimization of room-local graphs encompassing all graph entities within a room-like structure. These room-local graphs are used to compress the S-Graphs marginalizing the redundant robot keyframes within the given room. We then perform windowed local optimization of the compressed graph at regular time-distance intervals. A global optimization of the compressed graph is performed every time a loop closure is detected. We show similar accuracy compared to the baseline while showing a 39.81% reduction in computation time with respect to the baseline.

An Effective Transformer-based Contextual Model and Temporal Gate Pooling for Speaker Identification

  • paper_url: http://arxiv.org/abs/2308.11241
  • repo_url: https://github.com/harunorikawano/speaker-identification-with-tgp
  • paper_authors: Harunori Kawano, Sota Shimizu
  • for: 这个研究是为了开发一个高精度的 speaker identification 模型,使用 Transformer 架构和自我超vised learning。
  • methods: 本研究使用了 Transformer-based contextual model,并进一步提出了 Temporal Gate Pooling 方法来增强模型的学习能力。
  • results: 研究获得了使用 VoxCeleb1 资料集进行认个者识别 tasks 的85.9%的精度,与 wav2vec2 的317.7M个parameters相比,这个方法具有较高的精度和较低的 Parameters 数量。
    Abstract Wav2vec2 has achieved success in applying Transformer architecture and self-supervised learning to speech recognition. Recently, these have come to be used not only for speech recognition but also for the entire speech processing. This paper introduces an effective end-to-end speaker identification model applied Transformer-based contextual model. We explored the relationship between the parameters and the performance in order to discern the structure of an effective model. Furthermore, we propose a pooling method, Temporal Gate Pooling, with powerful learning ability for speaker identification. We applied Conformer as encoder and BEST-RQ for pre-training and conducted an evaluation utilizing the speaker identification of VoxCeleb1. The proposed method has achieved an accuracy of 85.9% with 28.5M parameters, demonstrating comparable precision to wav2vec2 with 317.7M parameters. Code is available at https://github.com/HarunoriKawano/speaker-identification-with-tgp.
    摘要 它使用 transformer 架构和自动学习来实现了speech recognition的成功。最近,这些技术不仅用于speech recognition,还用于整个speech processing。这篇论文介绍了一种高效的端到端speaker identification模型,该模型使用 transformer-based 上下文模型。我们研究了参数与性能之间的关系,以便理解高效模型的结构。此外,我们提出了一种pooling方法, named Temporal Gate Pooling,具有强大的学习能力。我们使用 Conformer 作为编码器,并使用 BEST-RQ 进行预训练。我们对 VoxCeleb1 进行了评估,并实现了85.9%的准确率,相比之下,wav2vec2 的参数数量为317.7M,我们的方法准确率相对较高。代码可以在 GitHub 上找到:https://github.com/HarunoriKawano/speaker-identification-with-tgp。

ROSGPT_Vision: Commanding Robots Using Only Language Models’ Prompts

  • paper_url: http://arxiv.org/abs/2308.11236
  • repo_url: https://github.com/bilel-bj/rosgpt_vision
  • paper_authors: Bilel Benjdira, Anis Koubaa, Anas M. Ali
  • for: 这个论文主要是提出一种新的 робо控制方法,使用语言模型提示来控制 робо。
  • methods: 该方法使用语言模型和视觉模型结合在一起,通过自动化机制来执行 robotic 任务。
  • results: 这个方法可以减少 robotic 开发成本,并且可以在实际应用中提高应用质量。Here’s a more detailed explanation of each point:
  • for: The paper proposes a new method for controlling robots using only language prompts, which is a significant departure from traditional methods that rely on technical details and programming.
  • methods: The method uses a combination of language models and vision models to automate the mechanisms behind the prompts, allowing the robot to execute tasks based on natural language descriptions.
  • results: The method has been shown to reduce development costs and improve the quality of applications, and it has the potential to advance robotic research in this direction.I hope this helps! Let me know if you have any further questions.
    Abstract In this paper, we argue that the next generation of robots can be commanded using only Language Models' prompts. Every prompt interrogates separately a specific Robotic Modality via its Modality Language Model (MLM). A central Task Modality mediates the whole communication to execute the robotic mission via a Large Language Model (LLM). This paper gives this new robotic design pattern the name of: Prompting Robotic Modalities (PRM). Moreover, this paper applies this PRM design pattern in building a new robotic framework named ROSGPT_Vision. ROSGPT_Vision allows the execution of a robotic task using only two prompts: a Visual and an LLM prompt. The Visual Prompt extracts, in natural language, the visual semantic features related to the task under consideration (Visual Robotic Modality). Meanwhile, the LLM Prompt regulates the robotic reaction to the visual description (Task Modality). The framework automates all the mechanisms behind these two prompts. The framework enables the robot to address complex real-world scenarios by processing visual data, making informed decisions, and carrying out actions automatically. The framework comprises one generic vision module and two independent ROS nodes. As a test application, we used ROSGPT_Vision to develop CarMate, which monitors the driver's distraction on the roads and makes real-time vocal notifications to the driver. We showed how ROSGPT_Vision significantly reduced the development cost compared to traditional methods. We demonstrated how to improve the quality of the application by optimizing the prompting strategies, without delving into technical details. ROSGPT_Vision is shared with the community (link: https://github.com/bilel-bj/ROSGPT_Vision) to advance robotic research in this direction and to build more robotic frameworks that implement the PRM design pattern and enables controlling robots using only prompts.
    摘要 在这篇论文中,我们 argueThat the next generation of robots can be commanded using only Language Models' prompts. Every prompt interrogates separately a specific Robotic Modality via its Modality Language Model (MLM). A central Task Modality mediates the whole communication to execute the robotic mission via a Large Language Model (LLM). This paper gives this new robotic design pattern the name of: Prompting Robotic Modalities (PRM). Moreover, this paper applies this PRM design pattern in building a new robotic framework named ROSGPT_Vision. ROSGPT_Vision allows the execution of a robotic task using only two prompts: a Visual and an LLM prompt. The Visual Prompt extracts, in natural language, the visual semantic features related to the task under consideration (Visual Robotic Modality). Meanwhile, the LLM Prompt regulates the robotic reaction to the visual description (Task Modality). The framework automates all the mechanisms behind these two prompts. The framework enables the robot to address complex real-world scenarios by processing visual data, making informed decisions, and carrying out actions automatically. The framework comprises one generic vision module and two independent ROS nodes. As a test application, we used ROSGPT_Vision to develop CarMate, which monitors the driver's distraction on the roads and makes real-time vocal notifications to the driver. We showed how ROSGPT_Vision significantly reduced the development cost compared to traditional methods. We demonstrated how to improve the quality of the application by optimizing the prompting strategies, without delving into technical details. ROSGPT_Vision is shared with the community (link: https://github.com/bilel-bj/ROSGPT_Vision) to advance robotic research in this direction and to build more robotic frameworks that implement the PRM design pattern and enables controlling robots using only prompts.

Adaptive White-Box Watermarking with Self-Mutual Check Parameters in Deep Neural Networks

  • paper_url: http://arxiv.org/abs/2308.11235
  • repo_url: None
  • paper_authors: Zhenzhe Gao, Zhaoxia Yin, Hongjian Zhan, Heng Yin, Yue Lu
  • for: 检测和防止人工智能模型中的意外或恶意篡改。
  • methods: 使用敏感 watermarking 技术来识别和检测篡改。
  • results: 测试结果表明,当篡改率低于 20% 时,我们的方法可以达到很高的恢复性能。而对于受到 watermarking 影响的模型,我们使用可适应位数技术来恢复模型的精度。
    Abstract Artificial Intelligence (AI) has found wide application, but also poses risks due to unintentional or malicious tampering during deployment. Regular checks are therefore necessary to detect and prevent such risks. Fragile watermarking is a technique used to identify tampering in AI models. However, previous methods have faced challenges including risks of omission, additional information transmission, and inability to locate tampering precisely. In this paper, we propose a method for detecting tampered parameters and bits, which can be used to detect, locate, and restore parameters that have been tampered with. We also propose an adaptive embedding method that maximizes information capacity while maintaining model accuracy. Our approach was tested on multiple neural networks subjected to attacks that modified weight parameters, and our results demonstrate that our method achieved great recovery performance when the modification rate was below 20%. Furthermore, for models where watermarking significantly affected accuracy, we utilized an adaptive bit technique to recover more than 15% of the accuracy loss of the model.
    摘要 人工智能(AI)在应用广泛,但也存在风险,因为在部署过程中可能会有不恰当或恶意篡改。因此, Regular checks 是必要的,以检测和预防这些风险。某些攻击可能会导致模型参数的篡改,因此我们需要一种方法来检测和修复篡改的参数。在这篇论文中,我们提出了一种用于检测篡改参数和位数的方法,可以用来检测、定位和修复篡改的参数。此外,我们还提出了一种适应式嵌入方法,可以最大化信息容量,同时保持模型的准确性。我们的方法在多个神经网络上进行了测试,并达到了篡改率低于20%时的恢复性能。此外,对于模型中 watermarking 对准确性产生了较大的影响,我们使用适应位数技术来恢复模型的准确性,达到了超过15%的恢复效果。

Traffic Flow Optimisation for Lifelong Multi-Agent Path Finding

  • paper_url: http://arxiv.org/abs/2308.11234
  • repo_url: None
  • paper_authors: Zhe Chen, Daniel Harabor, Jiaoyang Li, Peter J. Stuckey
  • for: 解决多 Agent 路径规划问题,即 robotics 中多 Agent 需要计算共享地图上免撞的路径。
  • methods: 提出一种新的方法,使用填充避免拥填的路径引导 Agent 达到目的地。
  • results: 在一shot MAPF 和 Lifelong MAPF 两个大规模场景中,提供了较好的解决方案,对一shot MAPF 的解决质量做出了重要改进,对 Lifelong MAPF 的总 Throughput 做出了大幅提高。
    Abstract Multi-Agent Path Finding (MAPF) is a fundamental problem in robotics that asks us to compute collision-free paths for a team of agents, all moving across a shared map. Although many works appear on this topic, all current algorithms struggle as the number of agents grows. The principal reason is that existing approaches typically plan free-flow optimal paths, which creates congestion. To tackle this issue we propose a new approach for MAPF where agents are guided to their destination by following congestion-avoiding paths. We evaluate the idea in two large-scale settings: one-shot MAPF, where each agent has a single destination, and lifelong MAPF, where agents are continuously assigned new tasks. For one-shot MAPF we show that our approach substantially improves solution quality. For Lifelong MAPF we report large improvements in overall throughput.
    摘要 多智能路径找(MAPF)是 robotics 中的基本问题,它要求我们计算多个智能机器人在共享地图上的冲突自由路径。虽然有很多研究对这个问题进行了努力,但现有的方法都难以承受多个机器人的增加。主要的原因是现有的方法通常计划自由流优化路径,这会导致堵塞。为解决这个问题,我们提出了一种新的 MAPF 方法,即使 agents 跟随堵塞避免路径来达到目的地。我们在两个大规模设置中评估了这个想法:一个是一次 MAPF,每个机器人都有单个目的地;另一个是持续 MAPF,机器人 continuous 被分配新任务。对一次 MAPF 我们显示了我们的方法可以显著提高解决方案质量。对持续 MAPF 我们报告了大幅提高总吞吐量。

On-Premise AIOps Infrastructure for a Software Editor SME: An Experience Report

  • paper_url: http://arxiv.org/abs/2308.11225
  • repo_url: None
  • paper_authors: Anes Bendimerad, Youcef Remil, Romain Mathonat, Mehdi Kaytoue
  • for: 本研究旨在探讨在企业内部实施基于开源工具的AIOps解决方案,以提高软件维护和监测。
  • methods: 本研究使用开源工具构建了一套完整的AIOps基础设施,并提供了不同选择的原则和策略。
  • results: 研究结果表明,使用开源工具构建AIOps基础设施可以减少成本和提高软件维护效率,同时也可以满足公司的数据管理和安全需求。
    Abstract Information Technology has become a critical component in various industries, leading to an increased focus on software maintenance and monitoring. With the complexities of modern software systems, traditional maintenance approaches have become insufficient. The concept of AIOps has emerged to enhance predictive maintenance using Big Data and Machine Learning capabilities. However, exploiting AIOps requires addressing several challenges related to the complexity of data and incident management. Commercial solutions exist, but they may not be suitable for certain companies due to high costs, data governance issues, and limitations in covering private software. This paper investigates the feasibility of implementing on-premise AIOps solutions by leveraging open-source tools. We introduce a comprehensive AIOps infrastructure that we have successfully deployed in our company, and we provide the rationale behind different choices that we made to build its various components. Particularly, we provide insights into our approach and criteria for selecting a data management system and we explain its integration. Our experience can be beneficial for companies seeking to internally manage their software maintenance processes with a modern AIOps approach.
    摘要 信息技术已成为不同行业的关键组成部分,导致软件维护和监测得到了更大的关注。由于现代软件系统的复杂性,传统维护方法已成为不足。AIOps概念出现以增强预测维护,通过大数据和机器学习技术来提高维护效率。然而,使用AIOps存在数据复杂性和事件管理等挑战。 comercial解决方案存在,但它们可能不适用于某些公司,因为高成本、数据管理问题和私有软件的限制。本文探讨在公司内部实施On-premise AIOps解决方案的可行性,通过使用开源工具。我们介绍了一个完整的AIOps基础设施,我们在公司中成功地部署了这个基础设施,并提供了不同组件的选择原则。尤其是数据管理系统的选择和集成方法。我们的经验可以帮助公司通过现代AIOps方法 internally管理软件维护过程。

Evaluating Large Language Models on Graphs: Performance Insights and Comparative Analysis

  • paper_url: http://arxiv.org/abs/2308.11224
  • repo_url: https://github.com/ayame1006/llmtograph
  • paper_authors: Chang Liu, Bo Wu
  • for: 这个研究旨在评估四种大语言模型(LLMs)在处理图数据上的应用能力。
  • methods: 该研究使用了四种不同的评估指标:理解、正确性、准确性和修复能力。
  • results: 研究发现,LLMs可以很好地理解图数据的自然语言描述,并且可以对图结构进行有效的推理。GPT模型在正确性方面表现出色,而其他模型则在结构理解方面表现较差。GPT模型在多个答案 зада题上常出现错误答案,这可能会降低其修复能力。另外,GPT-4能够修复GPT-3.5-turbo和自己之前的迭代的答案。研究代码可以在 GitHub 上找到:https://github.com/Ayame1006/LLMtoGraph。
    Abstract Large Language Models (LLMs) have garnered considerable interest within both academic and industrial. Yet, the application of LLMs to graph data remains under-explored. In this study, we evaluate the capabilities of four LLMs in addressing several analytical problems with graph data. We employ four distinct evaluation metrics: Comprehension, Correctness, Fidelity, and Rectification. Our results show that: 1) LLMs effectively comprehend graph data in natural language and reason with graph topology. 2) GPT models can generate logical and coherent results, outperforming alternatives in correctness. 3) All examined LLMs face challenges in structural reasoning, with techniques like zero-shot chain-of-thought and few-shot prompting showing diminished efficacy. 4) GPT models often produce erroneous answers in multi-answer tasks, raising concerns in fidelity. 5) GPT models exhibit elevated confidence in their outputs, potentially hindering their rectification capacities. Notably, GPT-4 has demonstrated the capacity to rectify responses from GPT-3.5-turbo and its own previous iterations. The code is available at: https://github.com/Ayame1006/LLMtoGraph.
    摘要 大型语言模型(LLM)在学术和业务领域都受到了广泛关注。然而,对图数据的应用仍然尚未得到充分探索。本研究通过评估四种LLM在解决多个分析问题上的能力来评估LLM在图数据上的应用。我们采用了四种评估指标:理解、正确性、准确性和修复。我们的结果显示:1)LLM可以很好地理解图数据的自然语言描述和图结构的关系。2)GPT模型在正确性方面表现出色,超越了其他选择。3)所有考试LLM都面临着结构理解的挑战,特别是零shot链条思维和几个shot提示的效果减退。4)GPT模型在多个答案任务中很容易出现错误答案,这可能会影响它们的准确性。5)GPT模型表现出高度自信心,这可能会阻碍它们的修复能力。备注的是,GPT-4已经示出了可以修复GPT-3.5-turbo和自己的前一个迭代的能力。代码可以在 GitHub上找到:https://github.com/Ayame1006/LLMtoGraph。

Federated Learning on Patient Data for Privacy-Protecting Polycystic Ovary Syndrome Treatment

  • paper_url: http://arxiv.org/abs/2308.11220
  • repo_url: https://github.com/toriqiu/fl-pcos
  • paper_authors: Lucia Morris, Tori Qiu, Nikhil Raghuraman
  • for: 这篇论文是为了探讨 Federated Learning(FL)在预测女性淋巴疾病多囊卵巢综合症(PCOS)的优化药物方案。
  • methods: 这篇论文使用了多种 Federated Learning 方法,并在人工合成 PCOS 患者数据集上进行了测试。
  • results: 研究发现,这些 Federated Learning 方法在论文中提出的数据隐私保护和药物优选问题上都具有出色的表现。
    Abstract The field of women's endocrinology has trailed behind data-driven medical solutions, largely due to concerns over the privacy of patient data. Valuable datapoints about hormone levels or menstrual cycling could expose patients who suffer from comorbidities or terminate a pregnancy, violating their privacy. We explore the application of Federated Learning (FL) to predict the optimal drug for patients with polycystic ovary syndrome (PCOS). PCOS is a serious hormonal disorder impacting millions of women worldwide, yet it's poorly understood and its research is stunted by a lack of patient data. We demonstrate that a variety of FL approaches succeed on a synthetic PCOS patient dataset. Our proposed FL models are a tool to access massive quantities of diverse data and identify the most effective treatment option while providing PCOS patients with privacy guarantees.
    摘要 妇女激素学的应用落后于数据驱动医疗解决方案,主要是由于患者数据隐私问题的担忧。 valuable datapoints about 激素水平或月经周期可能暴露患有并发症或中止怀孕的患者,违反其隐私。 我们探讨了 Federated Learning(FL)的应用,以预测患有多囊卵巢综合症(PCOS)患者最佳药物。 PCOS 是世界上数百万女性中的一种严重激素失衡症,但它的研究受到缺乏患者数据的限制。 我们示出了多种 FL 方法在 sintetic PCOS 患者数据集上得到成功。我们的提议的 FL 模型是一种访问庞大数据量和鉴别最有效的治疗方案的工具,同时为 PCOS 患者提供隐私保证。

Federated Learning in Big Model Era: Domain-Specific Multimodal Large Models

  • paper_url: http://arxiv.org/abs/2308.11217
  • repo_url: None
  • paper_authors: Zengxiang Li, Zhaoxiang Hou, Hui Liu, Ying Wang, Tongzhi Li, Longfei Xie, Chao Shi, Chengyi Yang, Weishan Zhang, Zelei Liu, Liang Xu
  • for: 这篇论文旨在提出一种多模态联合学习框架,帮助多家企业通过私有领域数据来共同训练大型模型,实现多enario智能服务。
  • methods: 论文提出了多模态联合学习的策略性转型,包括智能基础和目标的重要性在大模型时代,以及在多种数据、模型聚合、性能和成本交易、数据隐私和奖励机制等方面的新挑战。
  • results: 论文介绍了一个城市安全运营管理案例研究,其中多家企业共同提供多模态数据和专业知识,实现了城市安全运营管理的分布部署和有效协调。初步实验表明,企业可以通过多模态模型联合学习提高和储存智能能力,共同创造出高质量智能服务,涵盖能源基础设施安全、住宅社区安全和城市运营管理等多个领域。
    Abstract Multimodal data, which can comprehensively perceive and recognize the physical world, has become an essential path towards general artificial intelligence. However, multimodal large models trained on public datasets often underperform in specific industrial domains. This paper proposes a multimodal federated learning framework that enables multiple enterprises to utilize private domain data to collaboratively train large models for vertical domains, achieving intelligent services across scenarios. The authors discuss in-depth the strategic transformation of federated learning in terms of intelligence foundation and objectives in the era of big model, as well as the new challenges faced in heterogeneous data, model aggregation, performance and cost trade-off, data privacy, and incentive mechanism. The paper elaborates a case study of leading enterprises contributing multimodal data and expert knowledge to city safety operation management , including distributed deployment and efficient coordination of the federated learning platform, technical innovations on data quality improvement based on large model capabilities and efficient joint fine-tuning approaches. Preliminary experiments show that enterprises can enhance and accumulate intelligent capabilities through multimodal model federated learning, thereby jointly creating an smart city model that provides high-quality intelligent services covering energy infrastructure safety, residential community security, and urban operation management. The established federated learning cooperation ecosystem is expected to further aggregate industry, academia, and research resources, realize large models in multiple vertical domains, and promote the large-scale industrial application of artificial intelligence and cutting-edge research on multimodal federated learning.
    摘要 多模式数据,能够全面感知和识别物理世界,已成为通往普遍人工智能的关键Path。然而,多模式大型模型在公共数据集上训练时经常表现不佳在特定行业领域。这篇论文提出了一种多模式联合学习框架,允许多家企业使用私有领域数据共同训练大型模型,以实现多场景智能服务。作者对联合学习在智能基础和目标时代的战略性转变进行了深入探讨,以及新的多样数据、模型汇集、性能和成本负担、数据隐私和奖励机制等挑战。论文还介绍了一个城市安全运营管理案例研究,包括分布式部署和有效协调联合学习平台,以及基于大型模型技术的数据质量改进和高效联合练习方法。初步实验显示,企业可以通过多模式模型联合学习增强和积累智能能力,共同创造出高质量智能服务,涵盖能源基础设施安全、住宅社区安全和城市运营管理。建立的联合学习合作生态系统预期会进一步吸引产业、学术和研究资源,实现多个垂直领域的大型模型,并推动人工智能和多模式联合学习的大规模产业应用。

ConcatPlexer: Additional Dim1 Batching for Faster ViTs

  • paper_url: http://arxiv.org/abs/2308.11199
  • repo_url: None
  • paper_authors: Donghoon Han, Seunghyeon Seo, Donghyeon Jeon, Jiho Jang, Chaerin Kong, Nojun Kwak
    for: 这个研究旨在提高视觉识别的效率,以提高模型的测试速度和精度。methods: 本研究使用了一种叫做Data Multiplexing(DataMUX)的成本削减方法,并将其应用到视觉模型中。它还导入了一个名为Image Multiplexer的新方法,以及一些新的组件,以解决DataMux在视觉模型中的弱点。results: 研究发现,使用ConcatPlexer模型可以大幅提高视觉识别的启动速度,同时保持了69.5%和83.4%的验证精度。相比之下,ViT-B/16模型需要23.5%更多的GFLOPs以达到相同的验证精度。
    Abstract Transformers have demonstrated tremendous success not only in the natural language processing (NLP) domain but also the field of computer vision, igniting various creative approaches and applications. Yet, the superior performance and modeling flexibility of transformers came with a severe increase in computation costs, and hence several works have proposed methods to reduce this burden. Inspired by a cost-cutting method originally proposed for language models, Data Multiplexing (DataMUX), we propose a novel approach for efficient visual recognition that employs additional dim1 batching (i.e., concatenation) that greatly improves the throughput with little compromise in the accuracy. We first introduce a naive adaptation of DataMux for vision models, Image Multiplexer, and devise novel components to overcome its weaknesses, rendering our final model, ConcatPlexer, at the sweet spot between inference speed and accuracy. The ConcatPlexer was trained on ImageNet1K and CIFAR100 dataset and it achieved 23.5% less GFLOPs than ViT-B/16 with 69.5% and 83.4% validation accuracy, respectively.
    摘要 transformers 在自然语言处理(NLP)领域表现出色,同时在计算机视觉领域也引发了多种创新应用。然而,transformers 的高性能和模型灵活性却导致计算成本增加,因此许多研究团队提出了降低计算成本的方法。 draw inspiration from language models 的 cost-cutting method,我们提出了一种新的方法 для高效的视觉识别,即图像多重化(Image Multiplexer)。我们首先介绍了图像多重化的原型,然后开发了新的组件来缓解其缺点,最终得到了我们的最终模型——ConcatPlexer。ConcatPlexer 在 ImageNet1K 和 CIFAR100 dataset 上训练,与 ViT-B/16 的 GFLOPs 相比,减少了 23.5%,而 validation accuracy 则达到了 69.5% 和 83.4%。

ViLLA: Fine-Grained Vision-Language Representation Learning from Real-World Data

  • paper_url: http://arxiv.org/abs/2308.11194
  • repo_url: https://github.com/stanfordmimi/villa
  • paper_authors: Maya Varma, Jean-Benoit Delbrouck, Sarah Hooper, Akshay Chaudhari, Curtis Langlotz
  • For: 这种研究旨在评估当前的视觉语言模型(VLM)在高度复杂的多模态数据上的表现,以及如何改进VLM以更好地捕捉高度复杂的图像区域和文本特征之间的关系。* Methods: 该研究使用了一种名为ViLLA的新方法,它包括一个轻量级自动生成的地图模型和一个对比度VLM,以学习从复杂数据中捕捉高度复杂的区域特征和文本特征之间的关系。* Results: 研究表明,在四个领域(合成图像、产品图像、医疗图像和自然图像)上,ViLLA可以在精细的理解任务中表现出色,比如零shot对象检测(COCO上AP50点为3.6,LVIS上mAP点为0.6)和检索任务(R-Precision点为14.2)。
    Abstract Vision-language models (VLMs), such as CLIP and ALIGN, are generally trained on datasets consisting of image-caption pairs obtained from the web. However, real-world multimodal datasets, such as healthcare data, are significantly more complex: each image (e.g. X-ray) is often paired with text (e.g. physician report) that describes many distinct attributes occurring in fine-grained regions of the image. We refer to these samples as exhibiting high pairwise complexity, since each image-text pair can be decomposed into a large number of region-attribute pairings. The extent to which VLMs can capture fine-grained relationships between image regions and textual attributes when trained on such data has not been previously evaluated. The first key contribution of this work is to demonstrate through systematic evaluations that as the pairwise complexity of the training dataset increases, standard VLMs struggle to learn region-attribute relationships, exhibiting performance degradations of up to 37% on retrieval tasks. In order to address this issue, we introduce ViLLA as our second key contribution. ViLLA, which is trained to capture fine-grained region-attribute relationships from complex datasets, involves two components: (a) a lightweight, self-supervised mapping model to decompose image-text samples into region-attribute pairs, and (b) a contrastive VLM to learn representations from generated region-attribute pairs. We demonstrate with experiments across four domains (synthetic, product, medical, and natural images) that ViLLA outperforms comparable VLMs on fine-grained reasoning tasks, such as zero-shot object detection (up to 3.6 AP50 points on COCO and 0.6 mAP points on LVIS) and retrieval (up to 14.2 R-Precision points).
    摘要 视力语言模型(VLM),如CLIP和ALIGN,通常在图像-描述文本对 obtained from the web 上进行训练。然而,真实世界多Modal数据,如医疗数据,是非常复杂的:每个图像(例如 X-ray)通常与描述多个细腻区域的文本(例如医生报告)相对应。我们称这些样本为具有高对比复杂性,因为每个图像-文本对可以被分解成大量的区域-特征对。VLM 是否能够在这些数据上捕捉细腻的区域-特征关系,尚未被评估。我们的第一个关键贡献是通过系统性的评估表明,随着对于训练数据的对比复杂性的增加,标准的 VLM 会遇到性能下降,最高达37% 的搜索任务。为解决这个问题,我们介绍了我们的第二个关键贡献——ViLLA。ViLLA 是一种可以从复杂数据中捕捉细腻区域-特征关系的模型,它包括两个组件:(a)一种轻量级、自动学习的映射模型,用于将图像-文本对分解成区域-特征对,以及(b)一种对比 VLM,用于从生成的区域-特征对中学习表示。我们通过在四个领域(人工、产品、医疗和自然图像)进行实验,证明 ViLLA 在细腻理解任务中(例如零shot物体检测和搜索)表现出色,比较类似 VLM 高出3.6 AP50 点和0.6 mAP 点。

Diversity Measures: Domain-Independent Proxies for Failure in Language Model Queries

  • paper_url: http://arxiv.org/abs/2308.11189
  • repo_url: https://github.com/lab-v2/diversity_measures
  • paper_authors: Noel Ngu, Nathaniel Lee, Paulo Shakarian
  • for: 这篇论文旨在提供一些基于它的各种应用领域的错误预测方法,以便更好地评估大语言模型的性能。
  • methods: 这篇论文使用了三种不同的方法来衡量大语言模型的错误程度,即熵度、金尼鲁分离度和中心距离。这些方法不仅可以独立地评估模型的性能,还可以应用于几个不同的应用场景,如几个示例问题、链式思维和错误检测。
  • results: 根据实验结果,这三种方法都强相关于模型的失败概率。此外,这些方法还可以应用于不同的数据集和温度设置,并且可以用于评估模型的性能。
    Abstract Error prediction in large language models often relies on domain-specific information. In this paper, we present measures for quantification of error in the response of a large language model based on the diversity of responses to a given prompt - hence independent of the underlying application. We describe how three such measures - based on entropy, Gini impurity, and centroid distance - can be employed. We perform a suite of experiments on multiple datasets and temperature settings to demonstrate that these measures strongly correlate with the probability of failure. Additionally, we present empirical results demonstrating how these measures can be applied to few-shot prompting, chain-of-thought reasoning, and error detection.
    摘要 大型语言模型中的错误预测通常需要对特定领域的信息。在这篇论文中,我们提出了基于响应中的弹性、盖尼不纯和中心距离的三种度量来衡量大型语言模型的错误。我们描述了如何使用这三种度量来评估大型语言模型的错误probability,并在多个数据集和温度设定下进行了一系列实验,以示这些度量强相关于错误的可能性。此外,我们还提供了实验结果,说明了如何使用这些度量来应用少量提示、链接思维和错误探测。

MISSRec: Pre-training and Transferring Multi-modal Interest-aware Sequence Representation for Recommendation

  • paper_url: http://arxiv.org/abs/2308.11175
  • repo_url: None
  • paper_authors: Jinpeng Wang, Ziyun Zeng, Yunxiao Wang, Yuting Wang, Xingyu Lu, Tianxiang Li, Jun Yuan, Rui Zhang, Hai-Tao Zheng, Shu-Tao Xia
  • for: 这篇研究旨在解决缺乏ID特征的问题,以及实际推荐情况中的冷开头问题。
  • methods: 本研究提出了一个多 modal 信息学习框架,包括一个基于 transformer 架构的使用者端 encoder-decoder 模型,以及一个适应项目端的动态融合模块。
  • results: 实验结果显示,MISSRec 能够实现高效的实际推荐情况下的推荐。
    Abstract The goal of sequential recommendation (SR) is to predict a user's potential interested items based on her/his historical interaction sequences. Most existing sequential recommenders are developed based on ID features, which, despite their widespread use, often underperform with sparse IDs and struggle with the cold-start problem. Besides, inconsistent ID mappings hinder the model's transferability, isolating similar recommendation domains that could have been co-optimized. This paper aims to address these issues by exploring the potential of multi-modal information in learning robust and generalizable sequence representations. We propose MISSRec, a multi-modal pre-training and transfer learning framework for SR. On the user side, we design a Transformer-based encoder-decoder model, where the contextual encoder learns to capture the sequence-level multi-modal synergy while a novel interest-aware decoder is developed to grasp item-modality-interest relations for better sequence representation. On the candidate item side, we adopt a dynamic fusion module to produce user-adaptive item representation, providing more precise matching between users and items. We pre-train the model with contrastive learning objectives and fine-tune it in an efficient manner. Extensive experiments demonstrate the effectiveness and flexibility of MISSRec, promising an practical solution for real-world recommendation scenarios.
    摘要 目标是强化用户潜在有趣的ITEM predication,基于她/his历史交互序列。现有的大多数分列推荐器都是基于ID特征,尽管广泛使用,但它们经常在罕见ID下表现不佳,并且困难解决冷启动问题。此外,不一致的ID映射使模型的可移植性受阻,隔离类似推荐领域的相似性,这些领域可能可以协同优化。这篇论文旨在解决这些问题,通过学习多 modal 信息来学习Robust和可 generalized 序列表示。我们提议MISSRec,一种多 modal 预训练和传输学习框架,用于SR。用户方面,我们设计了一个基于Transformer的Encoder-Decoder模型,其中Contextual Encoder 学习 capture 序列级别多 modal 共谐,而一种新的兴趣相关 Decoder 被开发以更好地捕捉ITEM-modality-兴趣关系,以提高序列表示。候选ITEM 方面,我们采用动态融合模块生成用户适应ITEM表示,为用户和ITEM之间更精准的匹配提供更多的精度。我们在对照学习目标下预训练模型,并在有效的方式进行精度调整。广泛的实验表明MISSRec的有效性和灵活性,提供了实际解决现实推荐场景中的实际解决方案。

Hierarchical Point-based Active Learning for Semi-supervised Point Cloud Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2308.11166
  • repo_url: https://github.com/SmiletoE/HPAL
  • paper_authors: Zongyi Xu, Bo Yuan, Shanshan Zhao, Qianni Zhang, Xinbo Gao
  • for: 本研究旨在 Addressing the issue of limited annotations in 3D point cloud segmentation using active learning.
  • methods: 方法包括一个层次最小准确度模块,以及一种特征距离抑制策略,以选择重要和代表性的点 для人工标注。此外,基于这个活动策略,我们还建立了一个半监督分割框架。
  • results: 实验结果表明,提档的方法可以达到96.5%和100%的完全监督基线性能,只需使用0.07%和0.1%的训练数据。这些结果超越了当前最佳弱监督和活动学习方法。代码将在https://github.com/SmiletoE/HPAL中发布。
    Abstract Impressive performance on point cloud semantic segmentation has been achieved by fully-supervised methods with large amounts of labelled data. As it is labour-intensive to acquire large-scale point cloud data with point-wise labels, many attempts have been made to explore learning 3D point cloud segmentation with limited annotations. Active learning is one of the effective strategies to achieve this purpose but is still under-explored. The most recent methods of this kind measure the uncertainty of each pre-divided region for manual labelling but they suffer from redundant information and require additional efforts for region division. This paper aims at addressing this issue by developing a hierarchical point-based active learning strategy. Specifically, we measure the uncertainty for each point by a hierarchical minimum margin uncertainty module which considers the contextual information at multiple levels. Then, a feature-distance suppression strategy is designed to select important and representative points for manual labelling. Besides, to better exploit the unlabelled data, we build a semi-supervised segmentation framework based on our active strategy. Extensive experiments on the S3DIS and ScanNetV2 datasets demonstrate that the proposed framework achieves 96.5% and 100% performance of fully-supervised baseline with only 0.07% and 0.1% training data, respectively, outperforming the state-of-the-art weakly-supervised and active learning methods. The code will be available at https://github.com/SmiletoE/HPAL.
    摘要 具有印象的表现在点云semantic segmentation方面已经由完全监督的方法实现了出色的成绩。由于获得大规模点云数据和点 wise标签的劳动密集,许多尝试已经被作出以探索学习3D点云 segmentation的方法。活动学习是这种目标的有效策略之一,但是仍然受到了不足的探索。最近的这些方法会测量每个预分区的uncertainty,但它们受到重复信息的困扰和需要额外的劳动进行区分。这篇论文目的在于解决这个问题,通过开发一种层次 minimum margin uncertainty module来测量每个点的uncertainty,考虑多个层次的contextual信息。然后,我们设计了一种特征距离抑制策略,以选择重要和代表性的点进行手动标注。此外,为了更好地利用无标注数据,我们构建了基于我们的活动策略的半supervised segmentation框架。广泛的实验在S3DIS和ScanNetV2数据集上表明,我们的提案的框架可以与完全监督基准相同的96.5%和100%的性能,只使用0.07%和0.1%的训练数据。此外,我们的方法还能够超越当前的弱监督和活动学习方法。代码将在https://github.com/SmiletoE/HPAL上提供。

xxMD: Benchmarking Neural Force Fields Using Extended Dynamics beyond Equilibrium

  • paper_url: http://arxiv.org/abs/2308.11155
  • repo_url: https://github.com/zpengmei/xxmd
  • paper_authors: Zihan Pengmei, Junyu Liu, Yinan Shu
  • For: The paper aims to address the limitations of current neural force field (NFF) models in representing chemical reactions by introducing a new dataset called xxMD, which includes energies and forces computed from both multireference wave function theory and density functional theory.* Methods: The paper uses a constrained distribution of internal coordinates and energies in the MD17 datasets to demonstrate their inadequacy for representing systems undergoing chemical reactions. The authors then introduce the xxMD dataset, which includes nuclear configuration spaces that authentically depict chemical reactions, making it a more chemically relevant dataset.* Results: The authors re-assess equivariant models on the xxMD datasets and find notably higher mean absolute errors than those reported for MD17 and its variants, highlighting the challenges faced in crafting a generalizable NFF model with extrapolation capability. The authors propose two new datasets, xxMD-CASSCF and xxMD-DFT, which are available online.
    Abstract Neural force fields (NFFs) have gained prominence in computational chemistry as surrogate models, superseding quantum-chemistry calculations in ab initio molecular dynamics. The prevalent benchmark for NFFs has been the MD17 dataset and its subsequent extension. These datasets predominantly comprise geometries from the equilibrium region of the ground electronic state potential energy surface, sampling from direct adiabatic dynamics. However, many chemical reactions entail significant molecular deformations, notably bond breaking. We demonstrate the constrained distribution of internal coordinates and energies in the MD17 datasets, underscoring their inadequacy for representing systems undergoing chemical reactions. Addressing this sampling limitation, we introduce the xxMD (Extended Excited-state Molecular Dynamics) dataset, derived from non-adiabatic dynamics. This dataset encompasses energies and forces ascertained from both multireference wave function theory and density functional theory. Furthermore, its nuclear configuration spaces authentically depict chemical reactions, making xxMD a more chemically relevant dataset. Our re-assessment of equivariant models on the xxMD datasets reveals notably higher mean absolute errors than those reported for MD17 and its variants. This observation underscores the challenges faced in crafting a generalizable NFF model with extrapolation capability. Our proposed xxMD-CASSCF and xxMD-DFT datasets are available at \url{https://github.com/zpengmei/xxMD}.
    摘要

Exploring Unsupervised Cell Recognition with Prior Self-activation Maps

  • paper_url: http://arxiv.org/abs/2308.11144
  • repo_url: https://github.com/cpystan/psm
  • paper_authors: Pingyi Chen, Chenglu Zhu, Zhongyi Shui, Jiatong Cai, Sunyi Zheng, Shichuan Zhang, Lin Yang
  • for: 本研究旨在降低生物标注成本,提高生物图像识别效果。
  • methods: 我们提出了一种基于自动激活图的方法,通过自动激活图中的特征来生成假标记。然后,我们引入了一种语义归一化模块,将假标记转换为像素级别的语义假标记。
  • results: 我们在两个 histological 数据集上进行评估,结果表明我们的方法可以与其他全盘和弱盘方法竞争,而无需任何手动标注。此外,我们的简单 yet 有效的框架还可以实现多类细胞检测,这在已有的无监督方法中无法完成。
    Abstract The success of supervised deep learning models on cell recognition tasks relies on detailed annotations. Many previous works have managed to reduce the dependency on labels. However, considering the large number of cells contained in a patch, costly and inefficient labeling is still inevitable. To this end, we explored label-free methods for cell recognition. Prior self-activation maps (PSM) are proposed to generate pseudo masks as training targets. To be specific, an activation network is trained with self-supervised learning. The gradient information in the shallow layers of the network is aggregated to generate prior self-activation maps. Afterward, a semantic clustering module is then introduced as a pipeline to transform PSMs to pixel-level semantic pseudo masks for downstream tasks. We evaluated our method on two histological datasets: MoNuSeg (cell segmentation) and BCData (multi-class cell detection). Compared with other fully-supervised and weakly-supervised methods, our method can achieve competitive performance without any manual annotations. Our simple but effective framework can also achieve multi-class cell detection which can not be done by existing unsupervised methods. The results show the potential of PSMs that might inspire other research to deal with the hunger for labels in medical area.
    摘要 Successful supervised deep learning models for cell recognition rely heavily on detailed annotations. However, obtaining these annotations can be costly and inefficient. To address this issue, we explored label-free methods for cell recognition. Our proposed method uses prior self-activation maps (PSMs) to generate pseudo masks as training targets. Specifically, we train an activation network using self-supervised learning to generate the PSMs, and then use a semantic clustering module to transform the PSMs into pixel-level semantic pseudo masks for downstream tasks. We evaluated our method on two histological datasets (MoNuSeg and BCData) and found that it can achieve competitive performance without any manual annotations. Our method is simple but effective, and can also perform multi-class cell detection, which is not possible with existing unsupervised methods. The results demonstrate the potential of PSMs to address the need for labels in medical applications.

Is There Any Social Principle for LLM-Based Agents?

  • paper_url: http://arxiv.org/abs/2308.11136
  • repo_url: None
  • paper_authors: Jitao Bai, Simiao Zhang, Zhonghao Chen
  • for: 这篇论文主要是关于大语言模型基于代理的应用。
  • methods: 论文使用了大语言模型来实现代理,并考虑了社会科学的应用。
  • results: 论文提出了一种新的代理方法,并通过实验证明了其效果。In English, this translates to:
  • for: This paper is primarily about the application of large language models based on proxies.
  • methods: The paper uses large language models to implement proxies and considers applications in social sciences.
  • results: The paper proposes a new proxy method and experiments prove its effectiveness.
    Abstract Focus on Large Language Model based agents should involve more than "human-centered" alignment or application. We argue that more attention should be paid to the agent itself and discuss the potential of social sciences for agents.
    摘要 大语言模型基于代理应该超出人类中心的启aligned或应用。我们认为代理本身应该受到更多的注意力,并讨论社会科学在代理方面的潜力。Here's a word-for-word translation:大语言模型基于代理应该超出人类中心的启aligned或应用。我们认为代理本身应该受到更多的注意力,并讨论社会科学在代理方面的潜力。Note that Simplified Chinese is the standard writing system used in mainland China, while Traditional Chinese is used in Taiwan and Hong Kong.

ReLLa: Retrieval-enhanced Large Language Models for Lifelong Sequential Behavior Comprehension in Recommendation

  • paper_url: http://arxiv.org/abs/2308.11131
  • repo_url: None
  • paper_authors: Jianghao Lin, Rong Shan, Chenxu Zhu, Kounianhua Du, Bo Chen, Shigang Quan, Ruiming Tang, Yong Yu, Weinan Zhang
  • for: 这 paper 主要针对 recommendation зада务中的 zero-shot 和 few-shot 设置,以提高大语言模型 (LLM) 的表现。
  • methods: 该 paper 提出了一种 novel 框架,名为 Retrieval-enhanced Large Language models (ReLLa),用于解决 LLM 在 recommendation 领域中的各种问题。
  • results: 经过广泛的实验,ReLLa 表现出优于现有基线模型,并能够解决 LLM 在长期序列行为理解方面的问题。
    Abstract With large language models (LLMs) achieving remarkable breakthroughs in natural language processing (NLP) domains, LLM-enhanced recommender systems have received much attention and have been actively explored currently. In this paper, we focus on adapting and empowering a pure large language model for zero-shot and few-shot recommendation tasks. First and foremost, we identify and formulate the lifelong sequential behavior incomprehension problem for LLMs in recommendation domains, i.e., LLMs fail to extract useful information from a textual context of long user behavior sequence, even if the length of context is far from reaching the context limitation of LLMs. To address such an issue and improve the recommendation performance of LLMs, we propose a novel framework, namely Retrieval-enhanced Large Language models (ReLLa) for recommendation tasks in both zero-shot and few-shot settings. For zero-shot recommendation, we perform semantic user behavior retrieval (SUBR) to improve the data quality of testing samples, which greatly reduces the difficulty for LLMs to extract the essential knowledge from user behavior sequences. As for few-shot recommendation, we further design retrieval-enhanced instruction tuning (ReiT) by adopting SUBR as a data augmentation technique for training samples. Specifically, we develop a mixed training dataset consisting of both the original data samples and their retrieval-enhanced counterparts. We conduct extensive experiments on a real-world public dataset (i.e., MovieLens-1M) to demonstrate the superiority of ReLLa compared with existing baseline models, as well as its capability for lifelong sequential behavior comprehension.
    摘要 Large language models (LLMs) 在自然语言处理(NLP)领域取得了显著的突破, LLM-enhanced recommender systems 也在当前得到了广泛的关注。在这篇论文中,我们关注在适应和强化纯大语言模型(LLM)的零shot和几shot推荐任务上。首先,我们识别和描述了 LLM 在推荐领域中的生命周期行为无法理解问题,即 LLM 无法从用户行为序列中提取有用信息,即使用户行为序列的长度远远超过 LLM 的上下文限制。为解决这一问题并提高 LLM 的推荐性能,我们提出了一种新的框架,即 Retrieval-enhanced Large Language models (ReLLa),用于零shot和几shot的推荐任务。 для零shot推荐,我们实施了 semantic user behavior retrieval (SUBR),以提高测试样本的数据质量,从而减轻 LLM 提取用户行为序列中的关键知识的困难。而为了几shot推荐,我们进一步设计了 retrieval-enhanced instruction tuning (ReiT),通过采用 SUBR 作为数据增强技术来培育训练样本。具体来说,我们构建了一个混合训练集,包括原始数据样本和其增强后的对应样本。我们在一个真实的公共数据集(即 MovieLens-1M)上进行了广泛的实验,以证明 ReLLa 与现有基eline模型相比,具有更高的优势,同时也能够解决生命周期行为无法理解问题。

Transformers for Capturing Multi-level Graph Structure using Hierarchical Distances

  • paper_url: http://arxiv.org/abs/2308.11129
  • repo_url: None
  • paper_authors: Yuankai Luo
  • for: 本研究旨在提出一种基于层次结构编码的图变换器,以提高图变换器对不同类型图的表现。
  • methods: 本研究使用了一种名为层次距离结构编码(HDSE)的方法,该方法利用图中节点之间的层次距离来建模图的多层次结构。
  • results: 经过对12个实际数据集的广泛实验,研究发现,使用HDSE方法可以成功地提高多种基eline transformers的表现,在10个标准测试集上实现了状态的领先性表现。
    Abstract Graph transformers need strong inductive biases to derive meaningful attention scores. Yet, current proposals rarely address methods capturing longer ranges, hierarchical structures, or community structures, as they appear in various graphs such as molecules, social networks, and citation networks. In this paper, we propose a hierarchy-distance structural encoding (HDSE), which models a hierarchical distance between the nodes in a graph focusing on its multi-level, hierarchical nature. In particular, this yields a framework which can be flexibly integrated with existing graph transformers, allowing for simultaneous application with other positional representations. Through extensive experiments on 12 real-world datasets, we demonstrate that our HDSE method successfully enhances various types of baseline transformers, achieving state-of-the-art empirical performances on 10 benchmark datasets.
    摘要 GRaph transformers需要强大的推导性偏好,以derive meaningful attention scores。然而,当前的提议 rarely address methods capturing longer ranges, hierarchical structures, or community structures,as they appear in various graphs such as molecules, social networks, and citation networks。在这篇论文中,我们提议了一种层次距离结构编码(HDSE),该模型在图中节点之间的层次距离,强调图的多层、层次结构。特别是,这种方法可以flexibly integrate with existing graph transformers,allowing for simultaneous application with other positional representations。通过对12个实际 dataset进行了广泛的实验,我们证明了我们的HDSE方法成功地提高了多种基eline transformers的性能,达到了10个标准 benchmark dataset的状态态表现。

CAME: Contrastive Automated Model Evaluation

  • paper_url: http://arxiv.org/abs/2308.11111
  • repo_url: https://github.com/pengr/contrastive_autoeval
  • paper_authors: Ru Peng, Qiuyang Duan, Haobo Wang, Jiachen Ma, Yanbo Jiang, Yongjun Tu, Xiu Jiang, Junbo Zhao
  • for: 本研究旨在提出一种新的自动模型评估(AutoEval)框架,以便评估训练完成的机器学习模型而无需使用标注测试集。
  • methods: 该框架基于一种新的对比损失函数,通过对比测试集中的模型表现和训练集中的模型表现来评估模型的性能。
  • results: 研究人员通过实验证明,CAME框架可以在AutoEval中达到新的最佳性能水平,超过先前的工作。
    Abstract The Automated Model Evaluation (AutoEval) framework entertains the possibility of evaluating a trained machine learning model without resorting to a labeled testing set. Despite the promise and some decent results, the existing AutoEval methods heavily rely on computing distribution shifts between the unlabelled testing set and the training set. We believe this reliance on the training set becomes another obstacle in shipping this technology to real-world ML development. In this work, we propose Contrastive Automatic Model Evaluation (CAME), a novel AutoEval framework that is rid of involving training set in the loop. The core idea of CAME bases on a theoretical analysis which bonds the model performance with a contrastive loss. Further, with extensive empirical validation, we manage to set up a predictable relationship between the two, simply by deducing on the unlabeled/unseen testing set. The resulting framework CAME establishes a new SOTA results for AutoEval by surpassing prior work significantly.
    摘要 autoeval框架可能无需使用标注测试集来评估已经训练的机器学习模型。尽管存在承诺和一些不错的结果,现有的autoeval方法都仰赖计算分布shift между无标测试集和训练集。我们认为这种依赖于训练集的方法会成为实际ml开发中的另一个障碍。在这项工作中,我们提出了对比自动评估(CAME)框架,它不再需要使用训练集。CAME的核心想法基于对模型性能与对比损失的理论分析。我们通过大量的实验验证,成功地建立了对比测试集上的模型性能和对比损失之间的可预测关系。这种关系可以通过对无标测试集进行推理来获得。CAME的框架在autoeval领域创造了新的最佳实践(SOTA)结果,超过了之前的工作。

Anonymity at Risk? Assessing Re-Identification Capabilities of Large Language Models

  • paper_url: http://arxiv.org/abs/2308.11103
  • repo_url: https://github.com/skatinger/anonymity-at-risk-assessing-re-identification-capabilities-of-large-language-models
  • paper_authors: Alex Nyffenegger, Matthias Stürmer, Joel Niklaus
  • for: The paper explores the potential of large language models (LLMs) to re-identify individuals in court rulings, with a focus on privacy protection in the European Union and Switzerland.
  • methods: The authors construct a proof-of-concept using actual legal data from the Swiss federal supreme court and create an anonymized Wikipedia dataset for more rigorous testing. They introduce new metrics to measure performance and systematically analyze the factors that influence successful re-identifications.
  • results: Despite high re-identification rates on Wikipedia, even the best LLMs struggled with court decisions due to a lack of test datasets, the need for substantial training resources, and data sparsity in the information used for re-identification. The study concludes that re-identification using LLMs may not be feasible for now, but it could become possible in the future.Here is the information in Simplified Chinese text:
  • for: 本研究探讨了大语言模型(LLMs)在法律案例中重新标识个人的可能性,强调欧盟和瑞士隐私保护。
  • methods: 作者们使用实际的瑞士最高法院判决文档构建了证明,并创建了一个匿名的Wikipedia数据集进行更加严格的测试。他们引入了新的成本度量来衡量表现,并系统地分析了重要的成本因素。
  • results: 尽管在Wikipedia上获得了高的重新标识率,甚至最好的LLMs在法律案例中仍然遇到了困难,这是因为缺乏测试数据集,需要巨大的训练资源,以及法律案例中数据的稀缺性。研究结论是,使用LLMs进行重新标识可能不太可能,但是未来可能变得可能。
    Abstract Anonymity of both natural and legal persons in court rulings is a critical aspect of privacy protection in the European Union and Switzerland. With the advent of LLMs, concerns about large-scale re-identification of anonymized persons are growing. In accordance with the Federal Supreme Court of Switzerland, we explore the potential of LLMs to re-identify individuals in court rulings by constructing a proof-of-concept using actual legal data from the Swiss federal supreme court. Following the initial experiment, we constructed an anonymized Wikipedia dataset as a more rigorous testing ground to further investigate the findings. With the introduction and application of the new task of re-identifying people in texts, we also introduce new metrics to measure performance. We systematically analyze the factors that influence successful re-identifications, identifying model size, input length, and instruction tuning among the most critical determinants. Despite high re-identification rates on Wikipedia, even the best LLMs struggled with court decisions. The complexity is attributed to the lack of test datasets, the necessity for substantial training resources, and data sparsity in the information used for re-identification. In conclusion, this study demonstrates that re-identification using LLMs may not be feasible for now, but as the proof-of-concept on Wikipedia showed, it might become possible in the future. We hope that our system can help enhance the confidence in the security of anonymized decisions, thus leading to the courts being more confident to publish decisions.
    摘要 “欧盟和瑞士的司法预测中的匿名性保护是一个重要的问题。随着大规模数据预测技术的发展,对匿名化后的个人重新识别的担忧增加。根据瑞士联邦最高法院的判决,我们进行了一个实验,使用瑞士联邦最高法院的法律数据来测试LLM的重新识别能力。在进一步的测试中,我们使用了一个匿名化的Wikipedia数据集,以更加严谨地检验发现。我们也引入了一个新的任务,即在文本中重新识别个人,并且引入了新的衡量表现的指标。我们系统性地分析了对成功重新识别的影响因素,发现模型大小、输入长度和调整受到最大的影响。尽管在Wikipedia上有高的重新识别率,但是even the best LLMs仅在法院的判决中取得了 moderate的成功率。这些成功率的低度是由于没有足够的测试数据、需要很大的训练资源和数据潜在的缺乏。因此,我们的研究结果表明,使用LLMs进行重新识别可能不太可能,但是在未来,这个技术可能会成为可能的。我们希望,我们的系统可以帮助提高匿名化判决的安全性,使得法院更自信地发布判决。”

Using Early Exits for Fast Inference in Automatic Modulation Classification

  • paper_url: http://arxiv.org/abs/2308.11100
  • repo_url: None
  • paper_authors: Elsayed Mohammed, Omar Mashaal, Hatem Abou-Zeid
  • for: 本研究旨在提高无线通信中的自动模式分类(AMC)技术的效率,通过使用深度学习(DL)技术提取无线信号特征。
  • methods: 本研究提出使用早期离开(EE)技术加速DL模型的推理,并研究了四种不同的早期离开架构和自定义多支分支训练算法。
  • results: 通过广泛的实验,我们发现对于中度到高度的信号含杂率(SNR),使用EE技术可以显著降低深度神经网络的推理速度,而不会产生分类精度的下降。我们还进行了推理时间与分类精度之间的平衡分析。这是目前所知道的首次应用EE技术于AMC领域的研究。
    Abstract Automatic modulation classification (AMC) plays a critical role in wireless communications by autonomously classifying signals transmitted over the radio spectrum. Deep learning (DL) techniques are increasingly being used for AMC due to their ability to extract complex wireless signal features. However, DL models are computationally intensive and incur high inference latencies. This paper proposes the application of early exiting (EE) techniques for DL models used for AMC to accelerate inference. We present and analyze four early exiting architectures and a customized multi-branch training algorithm for this problem. Through extensive experimentation, we show that signals with moderate to high signal-to-noise ratios (SNRs) are easier to classify, do not require deep architectures, and can therefore leverage the proposed EE architectures. Our experimental results demonstrate that EE techniques can significantly reduce the inference speed of deep neural networks without sacrificing classification accuracy. We also thoroughly study the trade-off between classification accuracy and inference time when using these architectures. To the best of our knowledge, this work represents the first attempt to apply early exiting methods to AMC, providing a foundation for future research in this area.
    摘要 Simplified Chinese:自动模式分类(AMC)在无线通信中扮演了关键的角色,可以自动将广播信号分类。深度学习(DL)技术在AMC中越来越受到关注,因为它们可以提取广播信号的复杂特征。然而,DL模型具有高计算复杂度和高推理延迟。这篇论文提出使用早退出(EE)技术来加速DL模型在AMC中的推理。我们提出了四种EE架构和一种自定义多支分支训练算法。经过广泛的实验,我们发现在moderate to high signal-to-noise ratio(SNR)下,信号更容易分类,不需要深度的架构,可以利用我们提出的EE架构。我们的实验结果表明,EE技术可以减少深度神经网络的推理速度,而不会增加分类精度的损失。我们还在使用这些架构时进行了严格的质量评估和时间评估。根据我们所知,这是首次将EE技术应用于AMC,这为未来的相关研究提供了基础。

Video OWL-ViT: Temporally-consistent open-world localization in video

  • paper_url: http://arxiv.org/abs/2308.11093
  • repo_url: None
  • paper_authors: Georg Heigold, Matthias Minderer, Alexey Gritsenko, Alex Bewley, Daniel Keysers, Mario Lučić, Fisher Yu, Thomas Kipf
  • for: 本研究旨在适应预训练的开放视界图像模型到视频本地化中。
  • methods: 我们基于OWL-ViT开放词汇检测模型,并添加了一个变换器解码器,以卷积神经网络输出的一帧图像作为下一帧对象查询。
  • results: 我们的模型在面对挑战性的TAO-OWBenchmark上表现出色,证明了预训练大量图像文本数据可以成功传递到开放视界本地化中。
    Abstract We present an architecture and a training recipe that adapts pre-trained open-world image models to localization in videos. Understanding the open visual world (without being constrained by fixed label spaces) is crucial for many real-world vision tasks. Contrastive pre-training on large image-text datasets has recently led to significant improvements for image-level tasks. For more structured tasks involving object localization applying pre-trained models is more challenging. This is particularly true for video tasks, where task-specific data is limited. We show successful transfer of open-world models by building on the OWL-ViT open-vocabulary detection model and adapting it to video by adding a transformer decoder. The decoder propagates object representations recurrently through time by using the output tokens for one frame as the object queries for the next. Our model is end-to-end trainable on video data and enjoys improved temporal consistency compared to tracking-by-detection baselines, while retaining the open-world capabilities of the backbone detector. We evaluate our model on the challenging TAO-OW benchmark and demonstrate that open-world capabilities, learned from large-scale image-text pre-training, can be transferred successfully to open-world localization across diverse videos.
    摘要 Translation in Simplified Chinese:我们提出了一种架构和训练方法,可以将预训练的开放视界图像模型适应到视频地图Localization。理解开放视界(不受固定标签空间约束)是许多实际视觉任务的关键。在大量图像文本数据集上进行对比预训练,最近导致了图像级别任务的显著改进。然而,对于结构化任务,如对象localization,使用预训练模型更加困难。特别是在视频任务中,任务特定数据受限。我们在OWL-ViT开放词汇探测模型的基础上建立了一个Transformer解码器,以便在视频中传播对象表示。解码器使用下一帧的输出符号来作为下一帧的对象查询。我们的模型是基于视频数据的端到端训练的,并且比较tracking-by-detection基eline更有优势,同时保留了预训练模型的开放视界能力。我们在TAO-OWbenchmark上评估了我们的模型,并证明了可以成功传递开放视界的能力,从大规模图像文本预训练中学习到开放视界地图Localization across多种视频。

Collaborative Route Planning of UAVs, Workers and Cars for Crowdsensing in Disaster Response

  • paper_url: http://arxiv.org/abs/2308.11088
  • repo_url: None
  • paper_authors: Lei Han, Chunyu Tu, Zhiwen Yu, Zhiyong Yu, Weihua Shan, Liang Wang, Bin Guo
  • for: 本研究旨在提高灾区内部合作多代理器(UAV、工人和车辆)的数据收集效率。
  • methods: 本研究提出了一个多代理器路径观察法(MANF-RL-RP),具有多效设计,包括全球与本地信息处理、特定多代理器系统模型结构等。
  • results: 比较基准算法(Greedy-SC-RP和MANF-DNN-RP),MANF-RL-RP在任务完成率方面有显著提高。
    Abstract Efficiently obtaining the up-to-date information in the disaster-stricken area is the key to successful disaster response. Unmanned aerial vehicles (UAVs), workers and cars can collaborate to accomplish sensing tasks, such as data collection, in disaster-stricken areas. In this paper, we explicitly address the route planning for a group of agents, including UAVs, workers, and cars, with the goal of maximizing the task completion rate. We propose MANF-RL-RP, a heterogeneous multi-agent route planning algorithm that incorporates several efficient designs, including global-local dual information processing and a tailored model structure for heterogeneous multi-agent systems. Global-local dual information processing encompasses the extraction and dissemination of spatial features from global information, as well as the partitioning and filtering of local information from individual agents. Regarding the construction of the model structure for heterogeneous multi-agent, we perform the following work. We design the same data structure to represent the states of different agents, prove the Markovian property of the decision-making process of agents to simplify the model structure, and also design a reasonable reward function to train the model. Finally, we conducted detailed experiments based on the rich simulation data. In comparison to the baseline algorithms, namely Greedy-SC-RP and MANF-DNN-RP, MANF-RL-RP has exhibited a significant improvement in terms of task completion rate.
    摘要 efficiently 获取在灾难 struck 地区的最新信息是灾难应对的关键。无人飞行器(UAV)、工人和车辆可以在灾难 struck 地区合作完成感知任务,如数据收集。在这篇论文中,我们明确地讨论了一组代理人(包括UAV、工人和车辆)的路径规划,以最大化任务完成率为目标。我们提出了多Agent Route Planning Algorithm(MANF-RL-RP),该算法包括了许多高效的设计,如全球-本地双信息处理和特定的模型结构 для多种Agent系统。全球-本地双信息处理包括从全球信息中提取和传递空间特征,以及来自个体代理人的本地信息的分区和筛选。在构建多种Agent系统的模型结构方面,我们进行了以下工作。我们设计了同样的数据结构来表示不同代理人的状态,证明代理人决策过程的markt价性以简化模型结构,并设计了合理的奖励函数来训练模型。最后,我们对着富有的 simulate 数据进行了详细的实验。与基准算法(即Greedy-SC-RP和MANF-DNN-RP)相比,MANF-RL-RP 在任务完成率方面表现出了显著的提升。

Neural Amortized Inference for Nested Multi-agent Reasoning

  • paper_url: http://arxiv.org/abs/2308.11071
  • repo_url: None
  • paper_authors: Kunal Jha, Tuan Anh Le, Chuanyang Jin, Yen-Ling Kuo, Joshua B. Tenenbaum, Tianmin Shu
  • for: 本研究旨在提高多智能体交互中的复杂社会推理能力,使其能够更好地理解别人对自己的推理。
  • methods: 本研究使用神经网络来减轻高阶社会推理的计算复杂性,以提高多智能体交互的效率。
  • results: 实验结果表明,我们的方法可以减少计算复杂性,同时减少准确性的削弱。
    Abstract Multi-agent interactions, such as communication, teaching, and bluffing, often rely on higher-order social inference, i.e., understanding how others infer oneself. Such intricate reasoning can be effectively modeled through nested multi-agent reasoning. Nonetheless, the computational complexity escalates exponentially with each level of reasoning, posing a significant challenge. However, humans effortlessly perform complex social inferences as part of their daily lives. To bridge the gap between human-like inference capabilities and computational limitations, we propose a novel approach: leveraging neural networks to amortize high-order social inference, thereby expediting nested multi-agent reasoning. We evaluate our method in two challenging multi-agent interaction domains. The experimental results demonstrate that our method is computationally efficient while exhibiting minimal degradation in accuracy.
    摘要 多代理交互,如通信、教学和威胁,经常需要高级社会推理,即理解他们如何推理自己。这种复杂的推理可以通过嵌套多代理推理来有效模型。然而,计算复杂性随着每层推理层数的增加而呈指数增长, pose significant challenges。然而,人类在日常生活中很自然地完成复杂的社会推理。为了bridging这个 gap,我们提出了一种新的方法:利用神经网络来减轻高级社会推理,从而加快嵌套多代理推理。我们在两个复杂多代理交互领域进行了实验,结果表明我们的方法具有高效性和减少准确性下降的能力。

Temporal-Distributed Backdoor Attack Against Video Based Action Recognition

  • paper_url: http://arxiv.org/abs/2308.11070
  • repo_url: None
  • paper_authors: Xi Li, Songhe Wang, Ruiquan Huang, Mahanth Gowda, George Kesidis
  • for: 本研究旨在探讨视频数据下的后门攻击(Trojan),以及现有模型对这种攻击的抵御性。
  • methods: 本研究提出了一种简单 yet 有效的后门攻击方法,通过在转换域中添加杂音来植入潜在的攻击词。这种攻击可以在视频帧中逐帧插入,并且可以在攻击后继续保持高准确率。
  • results: 经过广泛的实验,研究人员发现这种攻击方法可以在多种知名模型上达到高度可见性和鲁棒性,并且可以在不同的视频识别 benchmark 上实现攻击。此外,研究人员还发现了一种称为 “Collateral Damage” 的现象,即在攻击过程中可能会导致模型对非目标类型的数据进行误分类。
    Abstract Deep neural networks (DNNs) have achieved tremendous success in various applications including video action recognition, yet remain vulnerable to backdoor attacks (Trojans). The backdoor-compromised model will mis-classify to the target class chosen by the attacker when a test instance (from a non-target class) is embedded with a specific trigger, while maintaining high accuracy on attack-free instances. Although there are extensive studies on backdoor attacks against image data, the susceptibility of video-based systems under backdoor attacks remains largely unexplored. Current studies are direct extensions of approaches proposed for image data, e.g., the triggers are \textbf{independently} embedded within the frames, which tend to be detectable by existing defenses. In this paper, we introduce a \textit{simple} yet \textit{effective} backdoor attack against video data. Our proposed attack, adding perturbations in a transformed domain, plants an \textbf{imperceptible, temporally distributed} trigger across the video frames, and is shown to be resilient to existing defensive strategies. The effectiveness of the proposed attack is demonstrated by extensive experiments with various well-known models on two video recognition benchmarks, UCF101 and HMDB51, and a sign language recognition benchmark, Greek Sign Language (GSL) dataset. We delve into the impact of several influential factors on our proposed attack and identify an intriguing effect termed "collateral damage" through extensive studies.
    摘要 深度神经网络(DNN)在不同应用场景中取得了很大成功,如视频动作识别,然而它们却容易受到后门攻击(Trojan)。攻击者可以通过特定的触发符使得恶意修改的模型在测试实例(非目标类)中产生错误分类,而保持高精度水平。虽然对于图像数据已有广泛的研究,但视频系统对于后门攻击的抗性仍然尚未得到充分研究。现有的研究多是对图像数据进行直接扩展,例如在帧内独立地插入触发符,这些触发符可以被现有的防御策略检测。在这篇论文中,我们提出了一种简单又有效的后门攻击方法,通过在转换域中添加噪声,在视频帧中植入不可见、时间分布的触发符,并证明其具有抗性。我们通过对多种知名模型在UCf101、HMDB51和希腊手语认知 benchmark 上进行了广泛的实验,证明了我们的提案的有效性。我们还进行了详细的研究,探讨了一些影响我们提案的因素,并发现了一种感人的效果,我们称之为“副作用”。

Topological Graph Signal Compression

  • paper_url: http://arxiv.org/abs/2308.11068
  • repo_url: None
  • paper_authors: Guillermo Bernárdez, Lev Telyatnikov, Eduard Alarcón, Albert Cabellos-Aparicio, Pere Barlet-Ros, Pietro Liò
  • for: 这 paper 的目的是提出一种基于 Topological Deep Learning (TDL) 方法来压缩信号 над 图 structures。
  • methods: 这 paper 使用的方法包括对原始信号进行分 clustering,然后使用 topological-inspired message passing 获取压缩后的信号表示。
  • results: 该方法可以在两个实际 Internet Service Provider Networks 的数据集上提高标准 GNN 和 feed-forward 架构的压缩性能,从 $30%$ 到 $90%$ 的压缩率提高,表明它更好地捕捉和利用图结构中的空间和时间相关性。
    Abstract Recently emerged Topological Deep Learning (TDL) methods aim to extend current Graph Neural Networks (GNN) by naturally processing higher-order interactions, going beyond the pairwise relations and local neighborhoods defined by graph representations. In this paper we propose a novel TDL-based method for compressing signals over graphs, consisting in two main steps: first, disjoint sets of higher-order structures are inferred based on the original signal --by clustering $N$ datapoints into $K\ll N$ collections; then, a topological-inspired message passing gets a compressed representation of the signal within those multi-element sets. Our results show that our framework improves both standard GNN and feed-forward architectures in compressing temporal link-based signals from two real-word Internet Service Provider Networks' datasets --from $30\%$ up to $90\%$ better reconstruction errors across all evaluation scenarios--, suggesting that it better captures and exploits spatial and temporal correlations over the whole graph-based network structure.
    摘要 最近爆发的拓扑深度学习(TDL)方法希望可以补充当前图ael neural network(GNN)的限制,自然处理更高阶交互,超出现有图表示中的对角相关和本地邻里hood。在这篇论文中,我们提出了一种基于TDL的图信号压缩方法,包括两个主要步骤:首先,通过原始信号对$N$个数据点进行分 clustering,将其分成$K\ll N$个集合;然后,基于图的拓扑结构,进行多元素集合内的扩展传递,以获得压缩后的信号表示。我们的结果表明,我们的框架可以在两个实际世界互联网服务提供商网络数据集上,将标准GNN和批处理架构超越,在压缩时间链接基于网络结构中的信号方面达到$30\%$到$90\%$的更好的重建错误,表明它更好地捕捉和利用图结构中的空间和时间相关性。

CSM-H-R: An Automatic Context Reasoning Framework for Interoperable Intelligent Systems and Privacy Protection

  • paper_url: http://arxiv.org/abs/2308.11066
  • repo_url: https://github.com/songhui01/csm-h-r
  • paper_authors: Songhui Yue, Xiaoyan Hong, Randy K. Smith
  • for: 这个论文的目的是提出一个自动化高级上下文(HLC)理解框架,以便在智能系统规模上实现智能系统的自动化整合。
  • methods: 该框架使用ontology和状态在运行时和模型存储阶段进行程序式组合,以实现意义full HLC的认知,并将结果应用于不同的理解技术。
  • results: 实验表明,该框架可以自动捕捉和理解高级上下文,并将其转换为可以应用于不同的理解技术的数据表示。此外,该框架还实现了隐私保护功能,通过域嵌入和信息卷积来减少信息相关性。
    Abstract Automation of High-Level Context (HLC) reasoning for intelligent systems at scale is imperative due to the unceasing accumulation of contextual data in the IoT era, the trend of the fusion of data from multi-sources, and the intrinsic complexity and dynamism of the context-based decision-making process. To mitigate this issue, we propose an automatic context reasoning framework CSM-H-R, which programmatically combines ontologies and states at runtime and the model-storage phase for attaining the ability to recognize meaningful HLC, and the resulting data representation can be applied to different reasoning techniques. Case studies are developed based on an intelligent elevator system in a smart campus setting. An implementation of the framework - a CSM Engine, and the experiments of translating the HLC reasoning into vector and matrix computing especially take care of the dynamic aspects of context and present the potentiality of using advanced mathematical and probabilistic models to achieve the next level of automation in integrating intelligent systems; meanwhile, privacy protection support is achieved by anonymization through label embedding and reducing information correlation. The code of this study is available at: https://github.com/songhui01/CSM-H-R.
    摘要 自然语言处理(NLP)技术在智能系统中的应用在不断增长,特别是在互联网物联网(IoT)时代,数据来源的融合和上下文决策过程的内在复杂性和动态性使得高级上下文(HLC)理解成为非常重要的。为解决这一问题,我们提出了一个自动上下文理解框架CSM-H-R,该框架在运行时和模型存储阶段使用ontologies和状态进行程序性结合,以实现对有意义的HLC的识别,并且可以应用于不同的理解技术。在智能电梯系统的实际案例中,我们开发了CSM引擎,并对HLC理解进行了vector和矩阵计算的实验,特别是处理上下文的动态性,表明了使用高级数学和统计模型可以实现下一个自动化层次的智能系统集成。同时,我们实现了隐私保护支持,通过嵌入标签和减少信息相关性来实现隐身。CSM框架的代码可以在以下链接中找到:https://github.com/songhui01/CSM-H-R。

FedDAT: An Approach for Foundation Model Finetuning in Multi-Modal Heterogeneous Federated Learning

  • paper_url: http://arxiv.org/abs/2308.12305
  • repo_url: None
  • paper_authors: Haokun Chen, Yao Zhang, Denis Krompass, Jindong Gu, Volker Tresp
  • for: 这则研究旨在提高基础模型在多modal学习中的表现,并且解决集中训练数据的问题。
  • methods: 本研究使用 Federated Dual-Adapter Teacher (FedDAT) 方法,具有调整客户端本地更新和实施多元知识传播 (MKD),以解决客户端数据不具同一性的问题。
  • results: 实验结果显示,FedDAT 在多modal Vision-Language 任务上substantially 超过了现有的中央化 PEFT 方法适应 FL 的表现。
    Abstract Recently, foundation models have exhibited remarkable advancements in multi-modal learning. These models, equipped with millions (or billions) of parameters, typically require a substantial amount of data for finetuning. However, collecting and centralizing training data from diverse sectors becomes challenging due to distinct privacy regulations. Federated Learning (FL) emerges as a promising solution, enabling multiple clients to collaboratively train neural networks without centralizing their local data. To alleviate client computation burdens and communication overheads, previous works have adapted Parameter-efficient Finetuning (PEFT) methods for FL. Hereby, only a small fraction of the model parameters are optimized and communicated during federated communications. Nevertheless, most previous works have focused on a single modality and neglected one common phenomenon, i.e., the presence of data heterogeneity across the clients. Therefore, in this work, we propose a finetuning framework tailored to heterogeneous multi-modal FL, called Federated Dual-Aadapter Teacher (FedDAT). Specifically, our approach leverages a Dual-Adapter Teacher (DAT) to address data heterogeneity by regularizing the client local updates and applying Mutual Knowledge Distillation (MKD) for an efficient knowledge transfer. FedDAT is the first approach that enables an efficient distributed finetuning of foundation models for a variety of heterogeneous Vision-Language tasks. To demonstrate its effectiveness, we conduct extensive experiments on four multi-modality FL benchmarks with different types of data heterogeneity, where FedDAT substantially outperforms the existing centralized PEFT methods adapted for FL.
    摘要 最近,基金会模型在多模态学习中展现了显著的进步。这些模型通常需要大量数据进行微调,但收集和中央化训练数据因为不同隐私规定而变得困难。为了解决这问题,聚合学习(FL)成为了一种有前途的解决方案,允许多个客户共同训练神经网络,无需中央化本地数据。以减少客户计算负担和通信开销为目的,先前的工作已经采用了参数效率微调(PEFT)方法进行FL。然而,大多数先前的工作宁悠单一模式,忽视了客户端数据的不同性。因此,在本工作中,我们提出了适应多模式、多数据类型 federated 微调框架,称为 FedDAT。具体来说,我们的方法利用了双适应教师(DAT)来处理客户端数据的不同性,通过规则化客户端本地更新和应用知识传播(MKD)进行高效的知识传递。FedDAT 是首个能够有效地在多模态 FL 上进行基础模型的分布式微调。为证明其效果,我们在四个多模态 FL 测试准则上进行了广泛的实验,其中 FedDAT 在不同类型的数据不同性下显著超过了已有的中央化 PEFT 方法。

Beyond Discriminative Regions: Saliency Maps as Alternatives to CAMs for Weakly Supervised Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2308.11052
  • repo_url: None
  • paper_authors: M. Maruf, Arka Daw, Amartya Dutta, Jie Bu, Anuj Karpatne
  • for: 本研究比较了抽象图和特征图两种方法在弱监督 semantic segmentation (WS3) 中的表现,并提出了一些新的评价指标来全面评估这两种方法的性能。
  • methods: 本研究使用了特征图和抽象图两种方法来生成pseudo-ground truth,并通过多个视角来比较它们的相似性和不同性。
  • results: 研究发现,使用抽象图可以更好地解决WS3中的非特征区域 (NDR) 问题,并且通过随机裁剪提高了抽象图的性能。
    Abstract In recent years, several Weakly Supervised Semantic Segmentation (WS3) methods have been proposed that use class activation maps (CAMs) generated by a classifier to produce pseudo-ground truths for training segmentation models. While CAMs are good at highlighting discriminative regions (DR) of an image, they are known to disregard regions of the object that do not contribute to the classifier's prediction, termed non-discriminative regions (NDR). In contrast, attribution methods such as saliency maps provide an alternative approach for assigning a score to every pixel based on its contribution to the classification prediction. This paper provides a comprehensive comparison between saliencies and CAMs for WS3. Our study includes multiple perspectives on understanding their similarities and dissimilarities. Moreover, we provide new evaluation metrics that perform a comprehensive assessment of WS3 performance of alternative methods w.r.t. CAMs. We demonstrate the effectiveness of saliencies in addressing the limitation of CAMs through our empirical studies on benchmark datasets. Furthermore, we propose random cropping as a stochastic aggregation technique that improves the performance of saliency, making it a strong alternative to CAM for WS3.
    摘要 Translation notes:* "Weakly Supervised Semantic Segmentation" (WS3) is translated as "弱指示 semantic segmentation" (WS3) in Simplified Chinese.* "Class activation map" (CAM) is translated as "类划分图" (CAM) in Simplified Chinese.* "Discriminative regions" (DR) is translated as "分化区" (DR) in Simplified Chinese.* "Non-discriminative regions" (NDR) is translated as "非分化区" (NDR) in Simplified Chinese.* "Attribution methods" such as "saliency maps" is translated as "责任方法" such as "吸引图" in Simplified Chinese.* "Stochastic aggregation technique" such as "random cropping" is translated as "随机聚合技术" such as "随机裁剪" in Simplified Chinese.

Personalized Event Prediction for Electronic Health Records

  • paper_url: http://arxiv.org/abs/2308.11013
  • repo_url: None
  • paper_authors: Jeong Min Lee, Milos Hauskrecht
  • For: The paper aims to develop accurate predictive models of clinical event sequences to support patient care, specifically by addressing the challenge of patient-specific variability in clinical conditions.* Methods: The paper proposes and investigates multiple new event sequence prediction models and methods, including refinement of population-wide models to subpopulations, self-adaptation, and meta-level model switching.* Results: The paper analyzes and tests the performance of these models on clinical event sequences of patients in the MIMIC-III database.
    Abstract Clinical event sequences consist of hundreds of clinical events that represent records of patient care in time. Developing accurate predictive models of such sequences is of a great importance for supporting a variety of models for interpreting/classifying the current patient condition, or predicting adverse clinical events and outcomes, all aimed to improve patient care. One important challenge of learning predictive models of clinical sequences is their patient-specific variability. Based on underlying clinical conditions, each patient's sequence may consist of different sets of clinical events (observations, lab results, medications, procedures). Hence, simple population-wide models learned from event sequences for many different patients may not accurately predict patient-specific dynamics of event sequences and their differences. To address the problem, we propose and investigate multiple new event sequence prediction models and methods that let us better adjust the prediction for individual patients and their specific conditions. The methods developed in this work pursue refinement of population-wide models to subpopulations, self-adaptation, and a meta-level model switching that is able to adaptively select the model with the best chance to support the immediate prediction. We analyze and test the performance of these models on clinical event sequences of patients in MIMIC-III database.
    摘要 临床事件序列包括数百个临床事件记录,表示患者 receiving 的记录时间。 开发准确预测模型临床序列非常重要,以支持多种模型,用于解释/分类当前患者状况,预测不良临床事件和结果,以提高患者治疗。 一个重要的预测临床序列模型挑战是每个患者的病人特有性。 基于下面的临床状况,每个患者的序列可能包含不同的临床事件(观察结果、实验室测试、药物、手术)。 因此,从事件序列中学习的人口广泛模型可能无法准确预测每个患者的特定动态和差异。 为解决问题,我们提出和探索多种新的事件序列预测模型和方法,使我们能更好地适应患者和其特定状况。 我们在MIMIC-III数据库中分析和测试这些模型的性能。

“Guinea Pig Trials” Utilizing GPT: A Novel Smart Agent-Based Modeling Approach for Studying Firm Competition and Collusion

  • paper_url: http://arxiv.org/abs/2308.10974
  • repo_url: None
  • paper_authors: Xu Han, Zengqing Wu, Chuan Xiao
  • For: The paper is written to study firm competition and collusion using a novel framework called Smart Agent-Based Modeling (SABM), which employs GPT-4 technologies to represent firms and their interactions.* Methods: The study uses a controlled experiment with smart agents to examine firm price competition and collusion behaviors under various conditions, comparing the results to those obtained through experiments with human subjects.* Results: The paper finds that smart agents consistently reach tacit collusion in the absence of communication, leading to prices converging at levels higher than the Bertrand equilibrium price but lower than monopoly or cartel prices. With communication allowed, smart agents achieve a higher-level collusion with prices close to cartel prices, and collusion forms more quickly with communication. These results highlight the importance of communication in enhancing trust between firms and facilitating collusion.
    Abstract Firm competition and collusion involve complex dynamics, particularly when considering communication among firms. Such issues can be modeled as problems of complex systems, traditionally approached through experiments involving human subjects or agent-based modeling methods. We propose an innovative framework called Smart Agent-Based Modeling (SABM), wherein smart agents, supported by GPT-4 technologies, represent firms, and interact with one another. We conducted a controlled experiment to study firm price competition and collusion behaviors under various conditions. SABM is more cost-effective and flexible compared to conducting experiments with human subjects. Smart agents possess an extensive knowledge base for decision-making and exhibit human-like strategic abilities, surpassing traditional ABM agents. Furthermore, smart agents can simulate human conversation and be personalized, making them ideal for studying complex situations involving communication. Our results demonstrate that, in the absence of communication, smart agents consistently reach tacit collusion, leading to prices converging at levels higher than the Bertrand equilibrium price but lower than monopoly or cartel prices. When communication is allowed, smart agents achieve a higher-level collusion with prices close to cartel prices. Collusion forms more quickly with communication, while price convergence is smoother without it. These results indicate that communication enhances trust between firms, encouraging frequent small price deviations to explore opportunities for a higher-level win-win situation and reducing the likelihood of triggering a price war. We also assigned different personas to firms to analyze behavioral differences and tested variant models under diverse market structures. The findings showcase the effectiveness and robustness of SABM and provide intriguing insights into competition and collusion.
    摘要 企业竞争和勾结涉及到复杂的动态,特别是在公司之间的交流方面。这些问题可以通过人类实验或智能代理模型(ABM)来模拟。我们提出了一种创新的框架called Smart Agent-Based Modeling(SABM),其中智能代理,受到GPT-4技术支持,代表公司,并互动相互。我们进行了一项控制性实验,以研究企业价格竞争和勾结行为的不同情况。SABM相比人类实验更加经济和灵活。智能代理具有广泛的知识库和人类策略能力,超过传统ABM代理。此外,智能代理可以模拟人类对话,可个性化,使其适用于研究复杂的交流情况。我们的结果表明,在无交流情况下,智能代理一般会达成tacit collusion,导致价格相对于BERTRAND平衡价格高,但比单一垄断或垄断价格低。当交流被允许时,智能代理可以实现更高级别的勾结,价格接近垄断价格。勾结形成更快,无交流情况下价格均衡更平滑。这些结果表明,交流可以增强公司之间的信任,使小价格偏移更频繁地探索机会,降低价格战的可能性。我们还将不同的公司个性分配给不同的公司,以分析行为差异,并在多种市场结构下测试不同的模型。结果显示SABM的效果和稳定性,并提供了精彩的竞争和勾结的新思路。

DocPrompt: Large-scale continue pretrain for zero-shot and few-shot document question answering

  • paper_url: http://arxiv.org/abs/2308.10959
  • repo_url: None
  • paper_authors: Sijin Wu, Dan Zhang, Teng Hu, Shikun Feng
  • for: 文章旨在提出一种名为 Docprompt 的文档问答模型,可以在文档问答任务中实现强大的零学习和几学习性能。
  • methods: 文章提出了一种新的弱监督数据生成方法、一种多Stage训练方法和一种理解模型&生成模型集成方法。
  • results: 实验结果显示, после继续预训练, Docprompt 模型在文档问答任务上明显超过了现有的强基线模型,并且可以大幅提高文档问答客户项目的交付效率和模型性能,降低注释成本和劳动成本。
    Abstract In this paper, we propose Docprompt for document question answering tasks with powerful zero-shot and few-shot performance. We proposed a novel weakly supervised data generation method, a novel multl-stage training method and a novel understanding model & generation model ensemble method. Experiment results show that the Docprompt model after continue pretrain significantly outperforms the existing strong baseline models on document question answering tasks. This method greatly improves the delivery efficiency and model performance of document question answering customer projects, reducing annotation costs and labor costs. Our demo can be found at https://huggingface.co/spaces/PaddlePaddle/ERNIE-Layout.
    摘要 在这篇论文中,我们提出了 Docprompt,用于文档问答任务的强大零shot和几shot性能的解决方案。我们提出了一种新的软参数生成方法、一种多Stage训练方法和一种新的理解模型&生成模型结合方法。实验结果显示,在继续预训练后,Docprompt模型在文档问答任务上明显超越了现有的强基线模型。这种方法可以大幅提高文档问答客户项目的交付效率和模型性能,降低注释成本和劳动成本。您可以在https://huggingface.co/spaces/PaddlePaddle/ERNIE-Layout找到我们的demo。

Structured World Models from Human Videos

  • paper_url: http://arxiv.org/abs/2308.10901
  • repo_url: None
  • paper_authors: Russell Mendonca, Shikhar Bahl, Deepak Pathak
  • For: The paper aims to enable robots to learn complex manipulation skills directly in the real world using a small amount of interaction data.* Methods: The approach uses human video data to build a structured, human-centric action space grounded in visual affordances, and trains a world model on human videos before fine-tuning on a small amount of robot interaction data without task supervision.* Results: The approach allows robots to learn various manipulation skills in complex settings in under 30 minutes of interaction.Here is the same information in Simplified Chinese:* For: 论文旨在帮助机器人直接在真实世界中学习复杂的抓取技能,只需要很少的互动数据。* Methods: 方法使用人类视频数据构建一个基于视觉可用性的结构化人类行为空间,然后在人类视频上训练世界模型,并在小量机器人互动数据上练习而不需要任务指导。* Results: 方法可以让机器人在复杂的设置下快速学习多种抓取技能,仅需要30分钟的互动。
    Abstract We tackle the problem of learning complex, general behaviors directly in the real world. We propose an approach for robots to efficiently learn manipulation skills using only a handful of real-world interaction trajectories from many different settings. Inspired by the success of learning from large-scale datasets in the fields of computer vision and natural language, our belief is that in order to efficiently learn, a robot must be able to leverage internet-scale, human video data. Humans interact with the world in many interesting ways, which can allow a robot to not only build an understanding of useful actions and affordances but also how these actions affect the world for manipulation. Our approach builds a structured, human-centric action space grounded in visual affordances learned from human videos. Further, we train a world model on human videos and fine-tune on a small amount of robot interaction data without any task supervision. We show that this approach of affordance-space world models enables different robots to learn various manipulation skills in complex settings, in under 30 minutes of interaction. Videos can be found at https://human-world-model.github.io
    摘要 我们面临的问题是直接在实际世界中学习复杂的通用行为。我们提议一种方法,使用只需一些不同场景的实际互动轨迹来教育机器人快速学习抓取技能。从计算机视觉和自然语言学习领域的成功经验中,我们认为,为了高效地学习,机器人必须能够利用互联网规模的人类视频数据。人类在与世界交互中有很多有趣的方式,这些方式可以帮助机器人不仅构建有用的动作和可用性的理解,还可以了解这些动作如何影响世界进行抓取。我们的方法是建立基于视觉可用性学习的人类行为空间,并在这个空间中训练一个世界模型。我们在人类视频上进行了训练,并在少量机器人互动数据上进行了精度调整。我们显示,这种可用性空间世界模型的方法可以让不同的机器人在复杂的设置下快速学习多种抓取技能,仅用30分钟的互动。视频可以在https://human-world-model.github.io找到。

TADA! Text to Animatable Digital Avatars

  • paper_url: http://arxiv.org/abs/2308.10899
  • repo_url: https://github.com/TingtingLiao/TADA
  • paper_authors: Tingting Liao, Hongwei Yi, Yuliang Xiu, Jiaxaing Tang, Yangyi Huang, Justus Thies, Michael J. Black
  • For: The paper aims to generate high-quality 3D avatars from textual descriptions, with realistic animations and detailed geometry.* Methods: The approach uses a 2D diffusion model and an animatable parametric body model, along with hierarchical rendering and score distillation sampling (SDS) to create detailed 3D avatars from text.* Results: The paper demonstrates that TADA significantly surpasses existing approaches on both qualitative and quantitative measures, enabling the creation of large-scale digital character assets that are ready for animation and rendering, and are easily editable through natural language.
    Abstract We introduce TADA, a simple-yet-effective approach that takes textual descriptions and produces expressive 3D avatars with high-quality geometry and lifelike textures, that can be animated and rendered with traditional graphics pipelines. Existing text-based character generation methods are limited in terms of geometry and texture quality, and cannot be realistically animated due to inconsistent alignment between the geometry and the texture, particularly in the face region. To overcome these limitations, TADA leverages the synergy of a 2D diffusion model and an animatable parametric body model. Specifically, we derive an optimizable high-resolution body model from SMPL-X with 3D displacements and a texture map, and use hierarchical rendering with score distillation sampling (SDS) to create high-quality, detailed, holistic 3D avatars from text. To ensure alignment between the geometry and texture, we render normals and RGB images of the generated character and exploit their latent embeddings in the SDS training process. We further introduce various expression parameters to deform the generated character during training, ensuring that the semantics of our generated character remain consistent with the original SMPL-X model, resulting in an animatable character. Comprehensive evaluations demonstrate that TADA significantly surpasses existing approaches on both qualitative and quantitative measures. TADA enables creation of large-scale digital character assets that are ready for animation and rendering, while also being easily editable through natural language. The code will be public for research purposes.
    摘要 我们介绍TADA,一个简单又有效的方法,将文本描述转换为高品质的3D人物模型,包括高级的几何和生命力的纹理,可以通过传统的グラフィックス管线进行动画和渲染。现有的文本基于的人物生成方法受到几何和纹理质量的限制,并且无法真实地动画,因为几何和纹理之间的对齐不稳定,尤其是在脸部区域。为了突破这些限制,TADA利用了2D传播模型和可动的 Parametric Body Model。具体来说,我们从SMPL-X中 derivated一个可优化的高分辨率人体模型,包括3D偏移和纹理图像,并使用层次渲染和分析抽象 Sampling (SDS) 创建高品质、细节满怀的3D人物。为了保证几何和纹理之间的对齐,我们在SDS训练过程中使用 render 的 норма和 RGB 图像,并利用它们的隐藏嵌入来稳定训练。此外,我们还引入了多种表情参数,以使得生成的人物在训练过程中具有表情,以保持与原始 SMPL-X 模型的 semantics 一致,使得生成的人物可以被动画。我们的评估结果显示,TADA 在 both 质量和量化度上有所提高,与现有的方法相比。TADA 可以实现大规模的数码人物资产的创建,并且可以通过自然语言进行易于修改。我们将代码公开供研究用途。

Giraffe: Adventures in Expanding Context Lengths in LLMs

  • paper_url: http://arxiv.org/abs/2308.10882
  • repo_url: https://github.com/abacusai/long-context
  • paper_authors: Arka Pal, Deep Karkhanis, Manley Roberts, Samuel Dooley, Arvind Sundararajan, Siddartha Naidu
  • for: 这个论文主要用于探讨现代大型自然语言处理器(LLMs)如何在评估时处理长输入序列。
  • methods: 该论文使用现有的context length extrapolation方法,包括修改 pozitional encoding 系统以指示输入序列中token或活动的位置。并 introduce some new design,如修改基于position encoding的减少策略。
  • results: 该论文通过三个新的评估任务(FreeFormQA、AlteredNumericQA和LongChat-Lines)以及折减指标来测试这些方法。发现线性扩展是最佳的扩展方法,并示出可以通过使用更长的扩展级别在评估时获得更好的性能。同时,发现修改基于position encoding的减少策略也有扩展能力。基于这些结果,该论文释放了三个新的13B参数长Context模型,即4k和16k context模型从基础LLaMA-13B中训练,以及32k context模型从基础LLaMA2-13B中训练。同时还释放了 reproduce 结果的代码。
    Abstract Modern large language models (LLMs) that rely on attention mechanisms are typically trained with fixed context lengths which enforce upper limits on the length of input sequences that they can handle at evaluation time. To use these models on sequences longer than the train-time context length, one might employ techniques from the growing family of context length extrapolation methods -- most of which focus on modifying the system of positional encodings used in the attention mechanism to indicate where tokens or activations are located in the input sequence. We conduct a wide survey of existing methods of context length extrapolation on a base LLaMA or LLaMA 2 model, and introduce some of our own design as well -- in particular, a new truncation strategy for modifying the basis for the position encoding. We test these methods using three new evaluation tasks (FreeFormQA, AlteredNumericQA, and LongChat-Lines) as well as perplexity, which we find to be less fine-grained as a measure of long context performance of LLMs. We release the three tasks publicly as datasets on HuggingFace. We discover that linear scaling is the best method for extending context length, and show that further gains can be achieved by using longer scales at evaluation time. We also discover promising extrapolation capabilities in the truncated basis. To support further research in this area, we release three new 13B parameter long-context models which we call Giraffe: 4k and 16k context models trained from base LLaMA-13B, and a 32k context model trained from base LLaMA2-13B. We also release the code to replicate our results.
    摘要 现代大语言模型(LLM)通常通过注意机制训练,但是它们的评估时间上下文长度是固定的,这限制了它们可以处理的输入序列长度。为了使这些模型处理 longer than train-time context length 的序列,可以使用Context length extrapolation方法。我们对现有的方法进行了广泛的survey,并介绍了一些我们自己的设计,包括一种新的截断策略 для修改基于位置编码的系统。我们使用三个新的评估任务(FreeFormQA、AlteredNumericQA和LongChat-Lines)以及折叠指标来测试这些方法。我们发现线性扩展是最佳的扩展方法,并且可以通过使用更长的扩展级别来进一步提高性能。此外,我们发现 truncated basis 具有扩展的潜在能力。为支持进一步的研究,我们释放了三个13B参数的长 context模型,即4k和16k上下文模型从基础 LLMA-13B 开始,以及32k上下文模型从基础 LLMA2-13B 开始。我们还释放了复制我们结果的代码。

Analyzing Transformer Dynamics as Movement through Embedding Space

  • paper_url: http://arxiv.org/abs/2308.10874
  • repo_url: None
  • paper_authors: Sumeet S. Singh
  • for: This paper explores the underlying mechanics of Transformer language models and how they give rise to intelligent behaviors.
  • methods: The authors use a systems approach to analyze Transformers and develop a mathematical framework that views the models as movement through embedding space.
  • results: The paper reveals important insights into the emergence of intelligence in Transformers, including the idea that the models are essentially “Embedding Space walkers” that compose context into a single vector, and that attention plays a key role in associating vectors and influencing the organization of the embedding space. Additionally, the authors find some evidence for their semantic space theory, which posits that embedding vectors represent semantic concepts.
    Abstract Transformer language models exhibit intelligent behaviors such as understanding natural language, recognizing patterns, acquiring knowledge, reasoning, planning, reflecting and using tools. This paper explores how their underlying mechanics give rise to intelligent behaviors. We adopt a systems approach to analyze Transformers in detail and develop a mathematical framework that frames their dynamics as movement through embedding space. This novel perspective provides a principled way of thinking about the problem and reveals important insights related to the emergence of intelligence: 1. At its core the Transformer is a Embedding Space walker, mapping intelligent behavior to trajectories in this vector space. 2. At each step of the walk, it composes context into a single composite vector whose location in Embedding Space defines the next step. 3. No learning actually occurs during decoding; in-context learning and generalization are simply the result of different contexts composing into different vectors. 4. Ultimately the knowledge, intelligence and skills exhibited by the model are embodied in the organization of vectors in Embedding Space rather than in specific neurons or layers. These abilities are properties of this organization. 5. Attention's contribution boils down to the association-bias it lends to vector composition and which influences the aforementioned organization. However, more investigation is needed to ascertain its significance. 6. The entire model is composed from two principal operations: data independent filtering and data dependent aggregation. This generalization unifies Transformers with other sequence models and across modalities. Building upon this foundation we formalize and test a semantic space theory which posits that embedding vectors represent semantic concepts and find some evidence of its validity.
    摘要 吸收器语言模型展示出智能行为,如理解自然语言、识别模式、获得知识、reasoning、规划、反思和使用工具。这篇论文探讨它们的基本机制如何产生智能行为。我们采用系统方法分析吸收器,并开发了一个数学框架来描述它们的动态。这种新的视角提供了一个原则性的方法来思考问题,并揭示了智能行为的出现的重要关键点:1. 吸收器的核心是Embedding Space漫步者,将智能行为映射到vector空间中的路径上。2. 在每一步中,吸收器将上下文融合成一个单一的复合向量,该向量在Embedding Space中的位置定义下一步的路径。3. 在解码过程中,没有实际学习发生,而是在不同上下文中的融合导致了不同的向量组合,从而实现了吸收器的智能行为。4. 吸收器的智能、智慧和技能都是Embedding Space中向量的组织方式所具有的,而不是特定的神经元或层。这些能力是这种组织的属性。5. 关注的贡献在向量组合中带来了关联偏好,影响了Embedding Space中向量的组织,但需要进一步的调查以确定其重要性。6. 整个模型由两种主要操作组成:数据独立的滤波和数据依赖的聚合。这种一致性将吸收器与其他序列模型和多种模式相连接。基于这个基础,我们正式提出了一种 semantics空间理论,即向量表示 semantic concepts,并发现了一些证据支持这一理论的有效性。

Real World Time Series Benchmark Datasets with Distribution Shifts: Global Crude Oil Price and Volatility

  • paper_url: http://arxiv.org/abs/2308.10846
  • repo_url: https://github.com/oilpricebenchmarks/COB
  • paper_authors: Pranay Pasula
  • for: 本研究的目的是提供task-labeled时间序列数据集,用于驱动 kontinual learning在金融领域的进步。
  • methods: 本研究使用了资产价格数据的变换,生成了volatility proxy,并使用了期望最大化(EM)算法来适应模型。
  • results: 研究发现,通过包含任务标签,四种 kontinual learning算法在多个预测时间 horizon 上表现出了 Universal 的改进。
    Abstract The scarcity of task-labeled time-series benchmarks in the financial domain hinders progress in continual learning. Addressing this deficit would foster innovation in this area. Therefore, we present COB, Crude Oil Benchmark datasets. COB includes 30 years of asset prices that exhibit significant distribution shifts and optimally generates corresponding task (i.e., regime) labels based on these distribution shifts for the three most important crude oils in the world. Our contributions include creating real-world benchmark datasets by transforming asset price data into volatility proxies, fitting models using expectation-maximization (EM), generating contextual task labels that align with real-world events, and providing these labels as well as the general algorithm to the public. We show that the inclusion of these task labels universally improves performance on four continual learning algorithms, some state-of-the-art, over multiple forecasting horizons. We hope these benchmarks accelerate research in handling distribution shifts in real-world data, especially due to the global importance of the assets considered. We've made the (1) raw price data, (2) task labels generated by our approach, (3) and code for our algorithm available at https://oilpricebenchmarks.github.io.
    摘要 <>转换文本为简化中文。<>金融领域内存续ous task-标注时间序列 benchmark 的缺乏,阻碍了持续学习的进步。为了解决这一问题,我们提出了 COB,即 Crude Oil Benchmark 数据集。 COB 包含了30年的资产价格,其中 exhibit 显著的分布shift,并且根据这些分布shift 生成对应的任务(即 режи)标签。我们的贡献包括将资产价格数据转换为Volatility proxy,使用期望最大化(EM)方法进行适应,生成基于实际世界事件的contextual task标签,并将这些标签以及通用的算法公开发布。我们表明,包括这些任务标签在内的 continual learning 算法在多个预测时间 horizon 上 universally 提高了四种状态之际的表现。我们希望这些 benchmark 可以加速实际数据中的分布shift处理研究,特别是由于我们考虑的资产的全球重要性。我们在 上提供了(1)原始价格数据,(2)由我们方法生成的任务标签,(3)以及代码。

Neural Networks Optimizations Against Concept and Data Drift in Malware Detection

  • paper_url: http://arxiv.org/abs/2308.10821
  • repo_url: None
  • paper_authors: William Maillet, Benjamin Marais
  • for: 提高基eline neural network对概念飘移问题的处理能力
  • methods: Feature reduction和使用最新验证集训练,并提出了Drift-Resilient Binary Cross-Entropy损失函数
  • results: 对2020-2023年间 collected的新型恶意文件进行评估,提高了15.2%的恶意文件检测率 compared to baseline model
    Abstract Despite the promising results of machine learning models in malware detection, they face the problem of concept drift due to malware constant evolution. This leads to a decline in performance over time, as the data distribution of the new files differs from the training one, requiring regular model update. In this work, we propose a model-agnostic protocol to improve a baseline neural network to handle with the drift problem. We show the importance of feature reduction and training with the most recent validation set possible, and propose a loss function named Drift-Resilient Binary Cross-Entropy, an improvement to the classical Binary Cross-Entropy more effective against drift. We train our model on the EMBER dataset (2018) and evaluate it on a dataset of recent malicious files, collected between 2020 and 2023. Our improved model shows promising results, detecting 15.2% more malware than a baseline model.
    摘要 尽管机器学习模型在针对恶意软件检测方面表现出色,但它们面临着概念漂移问题,这是因为恶意软件不断演化,导致模型在时间上的性能下降。为了解决这个问题,我们提出了一种模型无关协议,用于改进基eline神经网络,以适应漂移问题。我们表明了减少特征和使用最新的验证集训练的重要性,并提出了一种名为“漂移抗性二进制十字积分”的损失函数,比 класси的二进制十字积分更有效地防止漂移。我们在EMBER数据集(2018)上训练了我们的模型,并在2020-2023年间收集的一个数据集上进行了评估。我们改进后的模型显示了出色的表现,能够检测到2018年训练集中的15.2%更多的恶意软件。