cs.AI - 2023-08-10

Automatic Extraction of Relevant Road Infrastructure using Connected vehicle data and Deep Learning Model

  • paper_url: http://arxiv.org/abs/2308.05658
  • repo_url: None
  • paper_authors: Adu-Gyamfi Kojo, Kandiboina Raghupathi, Ravichandra-Mouli Varsha, Knickerbocker Skylar, Hans Zachary N, Hawkins, Neal R, Sharma Anuj
  • for: 快速改善城市景观中的交通系统,提高路网规划、交通安全和通勤体验等方面的研究。
  • methods: 利用连接式汽车数据和 cutting-edge 深度学习技术自动识别交叉口。
  • results: 实验结果显示自动识别率达 95%,直道路段 F1 分数达 97%,交叉口 F1 分数达 90%。
    Abstract In today's rapidly evolving urban landscapes, efficient and accurate mapping of road infrastructure is critical for optimizing transportation systems, enhancing road safety, and improving the overall mobility experience for drivers and commuters. Yet, a formidable bottleneck obstructs progress - the laborious and time-intensive manual identification of intersections. Simply considering the shear number of intersections that need to be identified, and the labor hours required per intersection, the need for an automated solution becomes undeniable. To address this challenge, we propose a novel approach that leverages connected vehicle data and cutting-edge deep learning techniques. By employing geohashing to segment vehicle trajectories and then generating image representations of road segments, we utilize the YOLOv5 (You Only Look Once version 5) algorithm for accurate classification of both straight road segments and intersections. Experimental results demonstrate an impressive overall classification accuracy of 95%, with straight roads achieving a remarkable 97% F1 score and intersections reaching a 90% F1 score. This approach not only saves time and resources but also enables more frequent updates and a comprehensive understanding of the road network. Our research showcases the potential impact on traffic management, urban planning, and autonomous vehicle navigation systems. The fusion of connected vehicle data and deep learning models holds promise for a transformative shift in road infrastructure mapping, propelling us towards a smarter, safer, and more connected transportation ecosystem.
    摘要 今天的城市景观在不断发展,效率和准确地图示城市道路基础设施是非常重要,以提高交通系统的优化、提高安全性和改善通勤者的整体移动体验。然而,一个强大的障碍是手动标识交叉口的劳动 INTENSIVE 和时间耗费。考虑到需要标识的交叉口的数量和每个交叉口的劳动时间,就需要一个自动化解决方案。为此,我们提出了一种新的方法,利用连接式车辆数据和 cutting-edge 深度学习技术。我们使用 geohashing 将车辆轨迹分段,然后生成道路段的图像表示,并使用 YOLOv5(You Only Look Once version 5)算法进行精准的交叉口和直线道路段的分类。实验结果表明,这种方法可以 дости到95%的总分类精度,直线道路段达到97%的 F1 分数,交叉口达到90%的 F1 分数。这种方法不仅可以节省时间和资源,还可以实现更频繁的更新和全面了解道路网络。我们的研究表明,这种方法可以对交通管理、城市规划和自动驾驶 Navigation 系统产生深远的影响,推动我们 towards a smarter, safer, and more connected transportation ecosystem。

Updating Clinical Risk Stratification Models Using Rank-Based Compatibility: Approaches for Evaluating and Optimizing Clinician-Model Team Performance

  • paper_url: http://arxiv.org/abs/2308.05619
  • repo_url: None
  • paper_authors: Erkin Ötleş, Brian T. Denton, Jenna Wiens
  • for: 这个论文主要是为了提高临床机器学习模型的更新和维护性能而写的。
  • methods: 这篇论文提出了一种新的排名基于兼容度指标($C^R$)和一种新的损失函数,用于优化分类性能并且满足用户预期。
  • results: 在使用MIMIC数据集进行临床风险分类的案例研究中,该方法可以提供更兼容的模型,同时保持分类性能的提升,相比现有的模型选择技术,$C^R$值提高了$0.019$ ($95%$ 信息interval: $0.005$, $0.035$).
    Abstract As data shift or new data become available, updating clinical machine learning models may be necessary to maintain or improve performance over time. However, updating a model can introduce compatibility issues when the behavior of the updated model does not align with user expectations, resulting in poor user-model team performance. Existing compatibility measures depend on model decision thresholds, limiting their applicability in settings where models are used to generate rankings based on estimated risk. To address this limitation, we propose a novel rank-based compatibility measure, $C^R$, and a new loss function that aims to optimize discriminative performance while encouraging good compatibility. Applied to a case study in mortality risk stratification leveraging data from MIMIC, our approach yields more compatible models while maintaining discriminative performance compared to existing model selection techniques, with an increase in $C^R$ of $0.019$ ($95\%$ confidence interval: $0.005$, $0.035$). This work provides new tools to analyze and update risk stratification models used in clinical care.
    摘要 如果数据变化或新数据成 disponible,then updating clinical machine learning models may be necessary to maintain or improve performance over time. However, updating a model can introduce compatibility issues when the behavior of the updated model does not align with user expectations, resulting in poor user-model team performance. Existing compatibility measures depend on model decision thresholds, limiting their applicability in settings where models are used to generate rankings based on estimated risk. To address this limitation, we propose a novel rank-based compatibility measure, $C^R$, and a new loss function that aims to optimize discriminative performance while encouraging good compatibility. Applied to a case study in mortality risk stratification leveraging data from MIMIC, our approach yields more compatible models while maintaining discriminative performance compared to existing model selection techniques, with an increase in $C^R$ of $0.019$ ($95\%$ confidence interval: $0.005$, $0.035$). This work provides new tools to analyze and update risk stratification models used in clinical care.Here's the breakdown of the translation:* "data shift" is translated as "数据变化" (shift is not a direct translation, but it's the closest equivalent to convey the idea of change in data)* "new data become available" is translated as "新数据成 disponible" (available is not a direct translation, but it's the closest equivalent to convey the idea of data becoming accessible)* "clinical machine learning models" is translated as "临床机器学习模型"* "user expectations" is translated as "用户期望"* "poor user-model team performance" is translated as "用户-模型团队性能不佳"* "compatibility measures" is translated as " compatibilty 度量"* "model decision thresholds" is translated as "模型决策阈值"* "settings where models are used to generate rankings based on estimated risk" is translated as "根据估计风险生成排名的设置"* "novel rank-based compatibility measure" is translated as "新的排名基于 compatibilty 度量"* "new loss function" is translated as "新的损失函数"* "case study in mortality risk stratification" is translated as "mortality 风险分级的案例研究"* "MIMIC" is translated as "MIMIC" (it's a well-known dataset in the field, so it's better to keep the original name)* "our approach yields more compatible models" is translated as "我们的方法生成了更Compatibilty 的模型"* "maintaining discriminative performance" is translated as "保持预测性能"* "compared to existing model selection techniques" is translated as "与现有的模型选择技术相比"* "with an increase in $C^R$ of $0.019$" is translated as " $C^R$ 增加 $0.019$"* "($95\%$ confidence interval: $0.005$, $0.035$)" is translated as "(95% 信息 interval:$0.005$, $0.035$)"* "This work provides new tools to analyze and update risk stratification models used in clinical care" is translated as "这种工作提供了新的工具来分析和更新在临床护理中使用的风险分级模型"

A Neural Network Based Choice Model for Assortment Optimization

  • paper_url: http://arxiv.org/abs/2308.05617
  • repo_url: None
  • paper_authors: Hanzhao Wang, Zhongze Cai, Xiaocheng Li, Kalyan Talluri
  • for: 预测客户购买概率,包括价格和其他购买选择的影响。
  • methods: 使用神经网络模型,无需手动建立客户行为模型和优化估计。
  • results: 比较了神经网络模型和基准模型的性能,并提供了训练技巧以提高神经网络模型的稳定性和性能。
    Abstract Discrete-choice models are used in economics, marketing and revenue management to predict customer purchase probabilities, say as a function of prices and other features of the offered assortment. While they have been shown to be expressive, capturing customer heterogeneity and behaviour, they are also hard to estimate, often based on many unobservables like utilities; and moreover, they still fail to capture many salient features of customer behaviour. A natural question then, given their success in other contexts, is if neural networks can eliminate the necessity of carefully building a context-dependent customer behaviour model and hand-coding and tuning the estimation. It is unclear however how one would incorporate assortment effects into such a neural network, and also how one would optimize the assortment with such a black-box generative model of choice probabilities. In this paper we investigate first whether a single neural network architecture can predict purchase probabilities for datasets from various contexts and generated under various models and assumptions. Next, we develop an assortment optimization formulation that is solvable by off-the-shelf integer programming solvers. We compare against a variety of benchmark discrete-choice models on simulated as well as real-world datasets, developing training tricks along the way to make the neural network prediction and subsequent optimization robust and comparable in performance to the alternates.
    摘要 “统计学、市场营运和收益管理中使用数据选择模型来预测顾客购买可能性,例如作为价格和产品推荐的函数。这些模型能够捕捉顾客多样性和行为,但是它们还是很难估计,通常基于许多不可观测的价值;此外,它们还不能捕捉许多顾客行为的突出特征。因此,一个自然的问题是,使用神经网络可以消除需要手动建立上下文相依的顾客行为模型,并且透过调整估计。对于这个问题,我们在这篇论文中进行了调查,以查探一个神经网络架构是否能预测不同上下文中的购买可能性。接着,我们开发了一个排序优化的形式关系,可以通过商业标准的整数计划解决方案来解决。我们对此进行了评估,并且发展了一些训练技巧,以使神经网络预测和后续的优化可以对于替代方案进行比较。”

A Smart Robotic System for Industrial Plant Supervision

  • paper_url: http://arxiv.org/abs/2308.05612
  • repo_url: None
  • paper_authors: D. Adriana Gómez-Rosal, Max Bergau, Georg K. J. Fischer, Andreas Wachaja, Johannes Gräter, Matthias Odenweller, Uwe Piechottka, Fabian Hoeflinger, Nikhil Gosala, Niklas Wetzel, Daniel Büscher, Abhinav Valada, Wolfram Burgard
  • for: 为了减轻化学生产厂房内的人工操作员的异常检测和监测任务,我们提出了一种基于自主导航 robot 和多种感器的自动检测系统。
  • methods: 我们的系统采用了多种感器,包括视觉感知、嗅感和听觉感知,以模仿人类的感知和解释能力,从而提供自动检测功能。
  • results: 我们在一座废水处理厂的实际工作环境中进行了广泛的评估,结果表明我们的系统可以在厂房内自主导航,并提供有用的异常操作信息。
    Abstract In today's chemical production plants, human field operators perform frequent checks on the plant's integrity to guarantee high safety standards, and thus are possibly the first to encounter dangerous operating conditions. To alleviate their tasks of failure detection and monitoring by audio, visual, and olfactory perceptions, we present a robotic system that consists of an autonomously navigating robot integrated with various sensors and data processing. We aim to resemble the human sensing and interpretation capabilities of sight, smell, and hearing, for providing automated inspection. We evaluate our system extensively at a wastewater facility in full working conditions. Our results demonstrate that the system is able to robustly navigate a plant and to provide useful information about critical operating conditions.
    摘要 今天的化学生产厂中,人类场地操作员经常进行厂房完整性的检查,以保证高度的安全标准,因此可能是首先发现危险运行条件的人。为了减轻人员的失效检测和监测的任务,我们提出了一个由自动导航机器人、多种感器和数据处理技术组成的 робоット系统。我们希望通过模仿人类感知和解释能力,提供自动检查。我们在废水处理设施的实际工作环境下进行了广泛的评估。我们的结果表明,系统可以强健地导航厂房,并提供有用的关键操作状况信息。

Multi-graph Spatio-temporal Graph Convolutional Network for Traffic Flow Prediction

  • paper_url: http://arxiv.org/abs/2308.05601
  • repo_url: None
  • paper_authors: Weilong Ding, Tianpu Zhang, Jianwu Wang, Zhuofeng Zhao
  • for: 预测高速公路日常交通流量
  • methods: 使用空间temporal深度学习方法对 highway 域内日常交通流量进行预测, 并使用数据Normalization策略处理数据不均衡问题。
  • results: 比基eline方法有明显改善的预测精度, 并有实际的商业应用优势。
    Abstract Inter-city highway transportation is significant for urban life. As one of the key functions in intelligent transportation system (ITS), traffic evaluation always plays significant role nowadays, and daily traffic flow prediction still faces challenges at network-wide toll stations. On the one hand, the data imbalance in practice among various locations deteriorates the performance of prediction. On the other hand, complex correlative spatio-temporal factors cannot be comprehensively employed in long-term duration. In this paper, a prediction method is proposed for daily traffic flow in highway domain through spatio-temporal deep learning. In our method, data normalization strategy is used to deal with data imbalance, due to long-tail distribution of traffic flow at network-wide toll stations. And then, based on graph convolutional network, we construct networks in distinct semantics to capture spatio-temporal features. Beside that, meteorology and calendar features are used by our model in the full connection stage to extra external characteristics of traffic flow. By extensive experiments and case studies in one Chinese provincial highway, our method shows clear improvement in predictive accuracy than baselines and practical benefits in business.
    摘要 между城市高速交通是城市生活中的重要功能之一,作为智能交通系统(ITS)的关键功能之一,交通评估总是在当今扮演着重要的角色,而日常交通流量预测仍然面临着网络范围的站点上的挑战。一方面,在实践中的数据不均衡问题使得预测性能下降。另一方面,较复杂的空间时间因素不能在长期时间内充分发挥作用。在本文中,我们提出了基于高速公路域的日常交通流量预测方法,通过空间时间深度学习来解决数据不均衡问题。我们使用了数据 нормализаStrategy来处理数据不均衡问题,因为交通流量在网络范围的站点上有长尾分布。然后,基于图 convolutional network,我们构建了不同 semantics 的网络来捕捉空间时间特征。此外,我们在全连接阶段使用了气象和历法特征来采取外部特征。经过广泛的实验和案例研究在一个中国省级高速公路上,我们的方法表现出了明显的预测精度提升和实践上的商业优势。

Optical Script Identification for multi-lingual Indic-script

  • paper_url: http://arxiv.org/abs/2308.05780
  • repo_url: None
  • paper_authors: Sidhantha Poddar, Rohan Gupta
  • for: 这篇论文主要针对的是脸部识别领域中的脸部文本处理和识别技术的进展。
  • methods: 本文主要介绍了现有的方法和现状技术在脸部文本处理和识别方面的应用,包括精炼的文本处理方法和深度学习模型的应用。
  • results: 本文提出了一种新的脸部文本处理和识别方法,并进行了比较分析与现有方法的性能对比。这种方法可以减少识别错误率,提高识别精度。
    Abstract Script identification and text recognition are some of the major domains in the application of Artificial Intelligence. In this era of digitalization, the use of digital note-taking has become a common practice. Still, conventional methods of using pen and paper is a prominent way of writing. This leads to the classification of scripts based on the method they are obtained. A survey on the current methodologies and state-of-art methods used for processing and identification would prove beneficial for researchers. The aim of this article is to discuss the advancement in the techniques for script pre-processing and text recognition. In India there are twelve prominent Indic scripts, unlike the English language, these scripts have layers of characteristics. Complex characteristics such as similarity in text shape make them difficult to recognize and analyze, thus this requires advance preprocessing methods for their accurate recognition. A sincere attempt is made in this survey to provide a comparison between all algorithms. We hope that this survey would provide insight to a researcher working not only on Indic scripts but also other languages.
    摘要 artificial intelligence 中的一些主要领域是文本识别和脸部识别。在这个数字化时代,使用数字记录已成为一种常见的做法。然而,传统的使用笔和纸的方法仍然是写作的主要方式。这导致了脚本的分类基于它们的获取方法。这篇文章的目的是讨论脚本预处理和文本识别技术的进步。在印度,有十二种主要的印度文字,与英语不同的是,这些文字具有复杂的特征。例如,文本的形状相似性使得它们Difficult to recognize and analyze, therefore, advanced preprocessing methods are required for their accurate recognition. We hope that this survey will provide insight to researchers working not only on Indic scripts but also other languages.Note: Please note that Simplified Chinese is used in mainland China and Singapore, while Traditional Chinese is used in Taiwan, Hong Kong, and Macau.

Proximal Policy Optimization Actual Combat: Manipulating Output Tokenizer Length

  • paper_url: http://arxiv.org/abs/2308.05585
  • repo_url: None
  • paper_authors: Miao Fan, Chen Hu, Shuchang Zhou
  • for: 本研究旨在使用强化学习从人类反馈(RLHF)来控制大语言模型(LLM)的输出,尤其是当 LLM 拥有误导性内容时,以确保安全的 AI 系统。
  • methods: 本研究使用了 Proximal Policy Optimization(PPO)来训练 LLM,并在 Gloden 作为奖励模型来评估 PPO 的效果。
  • results: 实验结果表明,PPO 不仅可以在特定任务中 manipulate 输出的tokenizer长度,而且在奖励模型效果被排除后,训练更加简单。
    Abstract The Reinforcement Learning from Human Feedback (RLHF) plays a pivotal role in shaping the impact of large language models (LLMs), contributing significantly to controlling output toxicity and selecting output styles, particularly as LLMs often harbor misleading content, highlighting the urgency to align them with human values for secure AI systems. The RLHF, characterized by complexity, instability, and sensitivity to hyperparameters, makes the evaluation of the reward model for complex tasks challenging, thereby further complicating the use of Proximal Policy Optimization (PPO). In this paper, we introduce a simple task designed to employ Gloden as a reward model that validates the effectiveness of PPO and inspires it, primarily explaining the task of utilizing PPO to manipulate the tokenizer length of the output generated by the model. Experiments confirm that PPO is not only effective in manipulating the output tokenizer length to a certain extent in this type of task but also exhibits facilitated training once the influence of the reward model effect is excluded, making it an exciting development.
    摘要 大型语言模型(LLM)的影响被强制控制,RLHF(人类反馈学习)在这个过程中扮演着关键角色,尤其是当LLM经常含有误导性内容时,高亮了需要与人类价值调整AI系统的紧急性。RLHF具有复杂性、不稳定性和参数敏感性,因此评估奖劵模型的测试难度增加,进一步复杂了使用PPO(最近邻居损失)。在这篇文章中,我们介绍了一个简单的任务,用于运用Golden作为奖劵模型,验证PPO的效果和鼓励它,主要是解释使用PPO来调整模型产生的输出块长。实验表明,PPO不仅可以在这种任务中有效地调整输出块长,而且在奖劵模型效应被排除后,训练更加 Facilitated,这是一个有趣的发展。

Generative Diffusion Models for Radio Wireless Channel Modelling and Sampling

  • paper_url: http://arxiv.org/abs/2308.05583
  • repo_url: None
  • paper_authors: Ushnish Sengupta, Chinkuo Jao, Alberto Bernacchia, Sattar Vakili, Da-shan Shiu
    for: 这种吸引模型基于的渠道采样方法用于快速生成具有有限数据的通道实现。methods: 我们使用一种吸引模型,其拥有一个基于U Net架构的频谱空间频率域的扩散模型。results: 我们比较了我们的扩散模型和现有的GAN基于方法,发现我们的方法可以稳定地训练,并且生成高准确性和多样性的通道样本。此外,我们还示出了可以在有限数据上预训练模型,然后在小型、不同的通道分布上进行微调,这表明了使用有限数据来模型现实世界通道的可能性。
    Abstract Channel modelling is essential to designing modern wireless communication systems. The increasing complexity of channel modelling and the cost of collecting high-quality wireless channel data have become major challenges. In this paper, we propose a diffusion model based channel sampling approach for rapidly synthesizing channel realizations from limited data. We use a diffusion model with a U Net based architecture operating in the frequency space domain. To evaluate how well the proposed model reproduces the true distribution of channels in the training dataset, two evaluation metrics are used: $i)$ the approximate $2$-Wasserstein distance between real and generated distributions of the normalized power spectrum in the antenna and frequency domains and $ii)$ precision and recall metric for distributions. We show that, compared to existing GAN based approaches which suffer from mode collapse and unstable training, our diffusion based approach trains stably and generates diverse and high-fidelity samples from the true channel distribution. We also show that we can pretrain the model on a simulated urban macro-cellular channel dataset and fine-tune it on a smaller, out-of-distribution urban micro-cellular dataset, therefore showing that it is feasible to model real world channels using limited data with this approach.
    摘要 通道模拟是现代无线通信系统设计的关键。随着通道模拟的复杂度和收集高质量无线通道数据的成本的增加,已成为主要挑战。在本文中,我们提出了基于扩散模型的通道采样方法,可以快速生成基于有限数据的通道实现。我们使用了一种基于U网络的扩散模型,在频率空间域中运行。为评估我们提出的模型是否可以正确地重建真实通道分布,我们使用了两个评估指标:1. approx 2-Wasserstein 距离:用于评估生成的通道分布是否与实际数据分布相似。2. 准确率和回归率指标:用于评估生成的通道分布是否具有高准确率和高回归率。我们比较了我们的扩散模型与基于GAN的方法,发现其在训练过程中更加稳定,并且可以生成多样性和高精度的通道样本。此外,我们还证明了我们可以在有限数据上预训练模型,然后在小型、异常分布上进行细化训练,因此表明了可以使用有限数据来模拟真实的通道分布。

C5: Towards Better Conversation Comprehension and Contextual Continuity for ChatGPT

  • paper_url: http://arxiv.org/abs/2308.05567
  • repo_url: None
  • paper_authors: Pan Liang, Danwei Ye, Zihao Zhu, Yunchao Wang, Wang Xia, Ronghua Liang, Guodao Sun
    for: 这篇论文的目的是提出一个互动对话可视化系统,以帮助用户更好地理解和维护对话的Contextual information。methods: 这篇论文使用了一个名为C5的互动对话可视化系统,包括全局View、主题View和问题答案View。全局View使用GitLog图例表示对话结构,显示对话的演化趋势并支持地方焦点特征的探索。主题View将所有问题和答案节点和它们之间的关系展示在知识图中,显示对话的相关性和演化。问题答案View包括三个连接的看板,让用户深入探索个别对话,并在提问时提供特定的Contextual information。results: 这篇论文的用户研究结果显示,C5可以帮助用户更好地理解和维护对话的Contextual information,并提高用户对对话的满意度和效率。
    Abstract Large language models (LLMs), such as ChatGPT, have demonstrated outstanding performance in various fields, particularly in natural language understanding and generation tasks. In complex application scenarios, users tend to engage in multi-turn conversations with ChatGPT to keep contextual information and obtain comprehensive responses. However, human forgetting and model contextual forgetting remain prominent issues in multi-turn conversation scenarios, which challenge the users' conversation comprehension and contextual continuity for ChatGPT. To address these challenges, we propose an interactive conversation visualization system called C5, which includes Global View, Topic View, and Context-associated Q\&A View. The Global View uses the GitLog diagram metaphor to represent the conversation structure, presenting the trend of conversation evolution and supporting the exploration of locally salient features. The Topic View is designed to display all the question and answer nodes and their relationships within a topic using the structure of a knowledge graph, thereby display the relevance and evolution of conversations. The Context-associated Q\&A View consists of three linked views, which allow users to explore individual conversations deeply while providing specific contextual information when posing questions. The usefulness and effectiveness of C5 were evaluated through a case study and a user study.
    摘要 The Global View uses the GitLog diagram metaphor to represent the conversation structure, showing the trend of conversation evolution and allowing users to explore locally salient features. The Topic View displays all the question and answer nodes and their relationships within a topic using the structure of a knowledge graph, thereby displaying the relevance and evolution of conversations. The Context-associated Q&A View consists of three linked views that allow users to explore individual conversations in depth while providing specific contextual information when posing questions.We evaluated the usefulness and effectiveness of C5 through a case study and a user study. The results showed that C5 can effectively improve the understanding of multi-turn conversations and the contextual continuity of ChatGPT.

Recent Advancements In The Field Of Deepfake Detection

  • paper_url: http://arxiv.org/abs/2308.05563
  • repo_url: None
  • paper_authors: Natalie Krueger, Dr. Mounika Vanamala, Dr. Rushit Dave
  • for: 本研究的目的是Survey and analyze current methods and advances in the field of deepfake detection, in order to address the problems of malicious deepfake creation and the lack of universal deepfake detection methods.
  • methods: 本研究使用了多种现有的方法和技术,包括深度学习、图像处理和分析等,以检测和识别deepfake。
  • results: 本研究对多种deepfake检测方法进行了分析和评估,并提出了一些新的方法和技术,以提高deepfake检测的精度和效果。
    Abstract A deepfake is a photo or video of a person whose image has been digitally altered or partially replaced with an image of someone else. Deepfakes have the potential to cause a variety of problems and are often used maliciously. A common usage is altering videos of prominent political figures and celebrities. These deepfakes can portray them making offensive, problematic, and/or untrue statements. Current deepfakes can be very realistic, and when used in this way, can spread panic and even influence elections and political opinions. There are many deepfake detection strategies currently in use but finding the most comprehensive and universal method is critical. So, in this survey we will address the problems of malicious deepfake creation and the lack of universal deepfake detection methods. Our objective is to survey and analyze a variety of current methods and advances in the field of deepfake detection.
    摘要 深圳是一种图像或视频中人脸的数字修改或部分替换为另一个人的图像。深圳有可能导致多种问题,常用于邪恶目的。一种常见的用途是修改政要和明星视频。这些深圳可以让他们发表侮辱、问题和/或不实的话语。当使用这种方式时,可以广泛散播恐慌,甚至影响选举和政治意见。目前有很多深圳检测策略在使用,但找到最全面和通用的方法是关键。因此,在这份调查中,我们将评估和分析各种当前方法和深圳检测领域的进展。

Learning (With) Distributed Optimization

  • paper_url: http://arxiv.org/abs/2308.05548
  • repo_url: https://github.com/alexcaselli/Federated-Learning-for-Human-Mobility-Models
  • paper_authors: Aadharsh Aadhithya A, Abinesh S, Akshaya J, Jayanth M, Vishnu Radhakrishnan, Sowmya V, Soman K. P
  • for: 这篇论文提供了分布式优化技术的历史发展概述,从早期的对偶基于方法开始,沿着勋爵relaxation和分解策略的发展,最终演变成了增强勋爵方法。
  • methods: 这篇论文总结了分布式优化的各种方法,包括对偶方法、分解方法和增强勋爵方法。
  • results: 这篇论文 highlights了分布式优化在机器学习和成像等领域的实际应用,以及增强勋爵方法在非对称优化问题上的性能优势。
    Abstract This paper provides an overview of the historical progression of distributed optimization techniques, tracing their development from early duality-based methods pioneered by Dantzig, Wolfe, and Benders in the 1960s to the emergence of the Augmented Lagrangian Alternating Direction Inexact Newton (ALADIN) algorithm. The initial focus on Lagrangian relaxation for convex problems and decomposition strategies led to the refinement of methods like the Alternating Direction Method of Multipliers (ADMM). The resurgence of interest in distributed optimization in the late 2000s, particularly in machine learning and imaging, demonstrated ADMM's practical efficacy and its unifying potential. This overview also highlights the emergence of the proximal center method and its applications in diverse domains. Furthermore, the paper underscores the distinctive features of ALADIN, which offers convergence guarantees for non-convex scenarios without introducing auxiliary variables, differentiating it from traditional augmentation techniques. In essence, this work encapsulates the historical trajectory of distributed optimization and underscores the promising prospects of ALADIN in addressing non-convex optimization challenges.
    摘要 Translation into Simplified Chinese:这篇论文提供了分布式优化技术的历史发展概述,从早期的 dual-based 方法开始,由达普雷、沃尔夫和本德尔在 1960 年代开创,到 ALADIN 算法的出现。初期的关注于勾配 relaxation 的 convex 问题和分解策略,导致了 ADMM 的改进。2000 年代后期,特别是在机器学习和成像领域,再次启发了分布式优化的兴趣,证明了 ADMM 的实用性和统一性。此外,文章还提出了 proximal center 方法的出现和其在多个领域的应用。此外,文章还强调了 ALADIN 的独特特征,即不需要辅助变量,可以保证非对易问题的收敛,与传统的增强技术不同。总之,这篇论文概述了分布式优化的历史轨迹,并强调了 ALADIN 在非对易优化挑战中的前景。

Enhancing AUV Autonomy With Model Predictive Path Integral Control

  • paper_url: http://arxiv.org/abs/2308.05547
  • repo_url: None
  • paper_authors: Pierre Nicolay, Yvan Petillot, Mykhaylo Marfeychuk, Sen Wang, Ignacio Carlucho
  • for: 这个论文是为了研究模型预测积分控制(MPPI)在潜水器自主车(AUV)控制中的可行性。
  • methods: 该论文使用了非线性AUV模型,通过在实时间卷积样本来计算控制动作。
  • results: 论文比较了MPPI控制器与经典PID和链式PID控制器的性能,并证明了MPPI控制器的优越性。此外,论文还介绍了在环境限制下如何使用MPPI控制器处理环境变化。
    Abstract Autonomous underwater vehicles (AUVs) play a crucial role in surveying marine environments, carrying out underwater inspection tasks, and ocean exploration. However, in order to ensure that the AUV is able to carry out its mission successfully, a control system capable of adapting to changing environmental conditions is required. Furthermore, to ensure the robotic platform's safe operation, the onboard controller should be able to operate under certain constraints. In this work, we investigate the feasibility of Model Predictive Path Integral Control (MPPI) for the control of an AUV. We utilise a non-linear model of the AUV to propagate the samples of the MPPI, which allow us to compute the control action in real time. We provide a detailed evaluation of the effect of the main hyperparameters on the performance of the MPPI controller. Furthermore, we compared the performance of the proposed method with a classical PID and Cascade PID approach, demonstrating the superiority of our proposed controller. Finally, we present results where environmental constraints are added and show how MPPI can handle them by simply incorporating those constraints in the cost function.
    摘要 自主水下车辆(AUV)在水下环境评估、下水检查任务以及海洋探险中扮演着关键角色。但是,要确保AUV成功完成其任务,需要一个能够适应环境变化的控制系统。另外,为保证机器人平台的安全运行,控制器应该在某些限制下运行。在这种情况下,我们研究了MPPI(模型预测融合控制)的可行性,并使用AUV的非线性模型来传播MPPI的样本,以实时计算控制动作。我们提供了MPPI控制器的详细评估,包括主要超参数的影响。此外,我们还比较了我们的提议方法与传统PID和链式PID控制方法的性能,并证明了我们的控制器的优越性。最后,我们展示了在环境约束下运行MPPI控制器的效果,并证明了MPPI可以通过简单地包含约束条件在成本函数中来处理约束。

Models Matter: The Impact of Single-Step Retrosynthesis on Synthesis Planning

  • paper_url: http://arxiv.org/abs/2308.05522
  • repo_url: None
  • paper_authors: Paula Torren-Peraire, Alan Kai Hassen, Samuel Genheden, Jonas Verhoeven, Djork-Arne Clevert, Mike Preuss, Igor Tetko
  • for: 这个论文的目的是提出一种结合单步逆synthesis预测和多步synthesis规划的方法,以提高化学合成路径的找到率和可行性。
  • methods: 该论文使用了多个单步逆synthesis模型,并将其应用于多步synthesis规划中,以及分析这些模型在不同数据集上的影响。
  • results: 研究发现,单步逆synthesis模型的高性能并不一定意味着它们在多步synthesis规划中的表现也高,而且使用不同的单步逆synthesis模型可以提高多步synthesis规划的成功率。
    Abstract Retrosynthesis consists of breaking down a chemical compound recursively step-by-step into molecular precursors until a set of commercially available molecules is found with the goal to provide a synthesis route. Its two primary research directions, single-step retrosynthesis prediction, which models the chemical reaction logic, and multi-step synthesis planning, which tries to find the correct sequence of reactions, are inherently intertwined. Still, this connection is not reflected in contemporary research. In this work, we combine these two major research directions by applying multiple single-step retrosynthesis models within multi-step synthesis planning and analyzing their impact using public and proprietary reaction data. We find a disconnection between high single-step performance and potential route-finding success, suggesting that single-step models must be evaluated within synthesis planning in the future. Furthermore, we show that the commonly used single-step retrosynthesis benchmark dataset USPTO-50k is insufficient as this evaluation task does not represent model performance and scalability on larger and more diverse datasets. For multi-step synthesis planning, we show that the choice of the single-step model can improve the overall success rate of synthesis planning by up to +28% compared to the commonly used baseline model. Finally, we show that each single-step model finds unique synthesis routes, and differs in aspects such as route-finding success, the number of found synthesis routes, and chemical validity, making the combination of single-step retrosynthesis prediction and multi-step synthesis planning a crucial aspect when developing future methods.
    摘要 <>translate the following text into Simplified Chinese<>Retrosynthesis 是将化学化合物分解成步骤进行回归的过程,直到找到一组商业可用的化合物,以实现合成路径。它的两个主要研究方向是单步反synthesis预测和多步合成规划,这两者在实际研究中是相互关联的。然而,这种关系并没有得到现代研究的反映。在这项工作中,我们将这两个主要研究方向结合起来,通过应用多个单步反synthesis模型在多步合成规划中,并使用公共和专用反应数据进行分析。我们发现,高单步性能与合成规划成功之间存在较大的差异,这表明,单步模型在合成规划中的评价需要进一步强化。此外,我们发现,通用的单步反synthesis标准数据集USPTO-50k不足,这种评价任务不能准确反映模型的性能和扩展性。对多步合成规划来说,我们发现,选择合适的单步模型可以提高总成功率达+28%,相比常用的基准模型。最后,我们发现,每个单步模型都发现了独特的合成路径,这些路径之间存在差异,例如合成路径成功率、发现的合成路径数量和化学有效性等方面,这使得单步反synthesis预测和多步合成规划的组合成为未来发展方法的关键。

Mono-hydra: Real-time 3D scene graph construction from monocular camera input with IMU

  • paper_url: http://arxiv.org/abs/2308.05515
  • repo_url: None
  • paper_authors: U. V. B. L. Udugama, G. Vosselman, F. Nex
  • for: 本研究旨在帮助机器人自动在3D环境中 navigate,通过提供一种能够在实时下使用单目视图系统构建3D场景图的方法。
  • methods: 该方法使用了一个 combining monocular camera和IMU传感器的设置,并使用了一系列深度学习算法来 derive depth和 semantics。
  • results: 该系统可以在实时下处理,并且可以达到 sub-20cm 的错误水平,使得机器人可以更加快速做出决策,提高决策效率和效果。
    Abstract The ability of robots to autonomously navigate through 3D environments depends on their comprehension of spatial concepts, ranging from low-level geometry to high-level semantics, such as objects, places, and buildings. To enable such comprehension, 3D scene graphs have emerged as a robust tool for representing the environment as a layered graph of concepts and their relationships. However, building these representations using monocular vision systems in real-time remains a difficult task that has not been explored in depth. This paper puts forth a real-time spatial perception system Mono-Hydra, combining a monocular camera and an IMU sensor setup, focusing on indoor scenarios. However, the proposed approach is adaptable to outdoor applications, offering flexibility in its potential uses. The system employs a suite of deep learning algorithms to derive depth and semantics. It uses a robocentric visual-inertial odometry (VIO) algorithm based on square-root information, thereby ensuring consistent visual odometry with an IMU and a monocular camera. This system achieves sub-20 cm error in real-time processing at 15 fps, enabling real-time 3D scene graph construction using a laptop GPU (NVIDIA 3080). This enhances decision-making efficiency and effectiveness in simple camera setups, augmenting robotic system agility. We make Mono-Hydra publicly available at: https://github.com/UAV-Centre-ITC/Mono_Hydra
    摘要 robot的自动导航能力在3D环境中取决于它对空间概念的理解,从低级几何到高级 semantics,如物体、场所和建筑。为实现这种理解,3D场景图作为一种层次图的工具,表示环境的概念和其关系,得到了广泛应用。然而,使用单目视系统在实时中建立这些表示仍然是一项具有挑战性的任务。这篇论文提出了一种实时空间感知系统Mono-Hydra,该系统结合单目视系统和IMU传感器设置,主要针对室内场景。然而,提出的方法可以适应户外应用,提供了灵活的应用可能性。该系统使用深度学习算法 derivation depth和semantics。它使用基于正方形信息的robocentric visual-inertial odometry(VIO)算法,以确保与IMU和单目视系统的视觉同步。该系统在实时处理中实现了20cm以下的错误,并在15帧/秒的速度下进行了实时3D场景图建构,使用了NVIDIA 3080 laptop GPU。这种系统提高了决策效率和有效性,并在简单的相机设置下增强了机器人系统的敏捷。我们在https://github.com/UAV-Centre-ITC/Mono_Hydra上公开了Mono-Hydra。

Multi-domain Recommendation with Embedding Disentangling and Domain Alignment

  • paper_url: http://arxiv.org/abs/2308.05508
  • repo_url: https://github.com/Stevenn9981/EDDA
  • paper_authors: Wentao Ning, Xiao Yan, Weiwen Liu, Reynold Cheng, Rui Zhang, Bo Tang
  • For: 本研究旨在提供多domain recommendation(MDR),即为不同领域(例如产品类型)的用户/项目提供推荐。现有的MDR模型面临两个挑战:首先,困难分离通用于多个领域的知识(例如用户喜欢便宜的item)和单个领域的知识(例如用户喜欢蓝色的服装,但不喜欢蓝色的汽车)。其次,它们对不同领域之间的知识传递有限。我们提出了一种新的MDR方法,名为EDDA,它包括两个关键 ком成分:嵌入分离推荐和领域对应。具体来说,嵌入分离推荐将模型和嵌入分为两部分,分别用于多个领域和单个领域,而大多数现有的MDR方法只关注模型级别的分离。领域对应利用图像处理的Random Walk来identify不同领域的用户/项目对应的相似对,并强制相似的用户/项目对应具有相似的嵌入,从而提高知识传递。* Methods: 我们提出了一种新的MDR方法,名为EDDA,它包括两个关键 ком成分:嵌入分离推荐和领域对应。嵌入分离推荐将模型和嵌入分为两部分,分别用于多个领域和单个领域。领域对应利用图像处理的Random Walk来identify不同领域的用户/项目对应的相似对,并强制相似的用户/项目对应具有相似的嵌入。* Results: 我们对12个基线模型进行比较,并在3个实际数据集上进行测试。结果表明,EDDA在所有数据集和领域上都能够超越基线模型。所有数据集和代码可以在https://github.com/Stevenn9981/EDDA上下载。
    Abstract Multi-domain recommendation (MDR) aims to provide recommendations for different domains (e.g., types of products) with overlapping users/items and is common for platforms such as Amazon, Facebook, and LinkedIn that host multiple services. Existing MDR models face two challenges: First, it is difficult to disentangle knowledge that generalizes across domains (e.g., a user likes cheap items) and knowledge specific to a single domain (e.g., a user likes blue clothing but not blue cars). Second, they have limited ability to transfer knowledge across domains with small overlaps. We propose a new MDR method named EDDA with two key components, i.e., embedding disentangling recommender and domain alignment, to tackle the two challenges respectively. In particular, the embedding disentangling recommender separates both the model and embedding for the inter-domain part and the intra-domain part, while most existing MDR methods only focus on model-level disentangling. The domain alignment leverages random walks from graph processing to identify similar user/item pairs from different domains and encourages similar user/item pairs to have similar embeddings, enhancing knowledge transfer. We compare EDDA with 12 state-of-the-art baselines on 3 real datasets. The results show that EDDA consistently outperforms the baselines on all datasets and domains. All datasets and codes are available at https://github.com/Stevenn9981/EDDA.
    摘要 (Simplified Chinese translation)多域推荐(MDR)目的是为不同类型产品的推荐,通常在 Amazon、Facebook 和 LinkedIn 等多种服务平台上出现。现有的 MDR 模型面临两个挑战:首先,很难分离通用于多个域的知识(例如,用户喜欢便宜的商品)和单个域的知识(例如,用户喜欢蓝色的服装,但不喜欢蓝色的汽车)。其次,它们对不同域之间的知识传递有限。我们提出了一种新的 MDR 方法,名为 EDDA,它包括嵌入分离推荐器和域对齐。嵌入分离推荐器将模型和嵌入分为两个部分:交叉域部分和内部域部分,而大多数现有的 MDR 方法只是强调模型级别的分离。域对齐利用图像处理中的Random Walk来标识不同域的用户/物品对,并强制类似的用户/物品对有类似的嵌入,从而提高知识传递。我们与 12 个现有基线比较 EDDA,结果显示 EDDA 在所有数据集和域上均表现优于基线。所有数据集和代码可以在 https://github.com/Stevenn9981/EDDA 上下载。

EFX Allocations Exist for Binary Valuations

  • paper_url: http://arxiv.org/abs/2308.05503
  • repo_url: None
  • paper_authors: Xiaolin Bu, Jiaxin Song, Ziqi Yu
  • for: 研究公分问题,特别是对于任何物品(EFX)的公平分配是否存在。
  • methods: 使用完全不同的技术,将 binary valuation 扩展到一般的 binary valuation,并提供一个几何时间算法来计算 EFX 分配。
  • results: 证明 EFX 分配 sempre 存在,并且提供一个几何时间算法来计算 EFX 分配。
    Abstract We study the fair division problem and the existence of allocations satisfying the fairness criterion envy-freeness up to any item (EFX). The existence of EFX allocations is a major open problem in the fair division literature. We consider binary valuations where the marginal gain of the value by receiving an extra item is either $0$ or $1$. Babaioff et al. [2021] proved that EFX allocations always exist for binary and submodular valuations. In this paper, by using completely different techniques, we extend this existence result to general binary valuations that are not necessarily submodular, and we present a polynomial time algorithm for computing an EFX allocation.
    摘要 我们研究公平分配问题,特别是一个称为“envy-freeness up to any item”(EFX)的公平性准则。EFX分配的存在是公平分配领域的一个主要开放问题。我们考虑二进制的价值函数,其中每个物品的追加价值为0或1。 Babai et al.(2021)证明了对二进制和不等式价值函数的EFX分配总是存在。在这篇文章中,我们使用完全不同的技术推广这个存在结果至一般二进制价值函数,并提供了一个几何时间算法来计算EFX分配。

Bringing order into the realm of Transformer-based language models for artificial intelligence and law

  • paper_url: http://arxiv.org/abs/2308.05502
  • repo_url: None
  • paper_authors: Candida M. Greco, Andrea Tagarelli
  • for: This paper provides a systematic overview of Transformer-based language models (TLMs) for AI-driven problems and tasks in the legal domain, with a focus on highlighting research advances and current limitations.
  • methods: The paper uses TLMs, specifically BERT and related models, to address AI-driven problems and tasks in the legal domain.
  • results: The paper provides a comprehensive overview of the current state of TLM-based methods in the legal domain, highlighting their contributions and limitations, and identifying opportunities for further research development.Here’s the Chinese translation of the three information points:
  • for: 这篇论文提供了法律领域中TLMs(转换器基本模型)为AI驱动问题和任务的系统性综述,目的是highlighting研究进步和当前的局限性,以及未来研发的机会。
  • methods: 论文使用TLMs,具体来说是BERT和相关模型,来解决法律领域中的AI驱动问题和任务。
  • results: 论文提供了法律领域中TLM-基本方法的现状综述, highlighting其贡献和局限性,以及未来研发的机会。
    Abstract Transformer-based language models (TLMs) have widely been recognized to be a cutting-edge technology for the successful development of deep-learning-based solutions to problems and applications that require natural language processing and understanding. Like for other textual domains, TLMs have indeed pushed the state-of-the-art of AI approaches for many tasks of interest in the legal domain. Despite the first Transformer model being proposed about six years ago, there has been a rapid progress of this technology at an unprecedented rate, whereby BERT and related models represent a major reference, also in the legal domain. This article provides the first systematic overview of TLM-based methods for AI-driven problems and tasks in the legal sphere. A major goal is to highlight research advances in this field so as to understand, on the one hand, how the Transformers have contributed to the success of AI in supporting legal processes, and on the other hand, what are the current limitations and opportunities for further research development.
    摘要 TLMs (Transformer-based language models) 已经广泛被认为是现代技术的代表之一,用于解决需要自然语言处理和理解的问题和应用。就如同其他文本领域一样,TLMs 在法律领域也提高了人工智能的方法。虽然首个 Transformer 模型已经提出了大约六年之前,但这技术在无 precedent 的速度下进步了,BERT 和相关模型在法律领域也成为了主要参考。本文为法律领域中 TLM-based 方法的首次系统性概述,目的是要把研究进展突出来,以便理解TLMs 如何贡献于法律过程中的成功,以及当前的局限性和未来研究发展的机会。

More Than Meets the Eye: Analyzing Anesthesiologists’ Visual Attention in the Operating Room Using Deep Learning Models

  • paper_url: http://arxiv.org/abs/2308.05501
  • repo_url: None
  • paper_authors: Sapir Gershov, Fadi Mahameed, Aeyal Raz, Shlomi Laufer
  • for: This study aims to improve the safe management of patients under general anesthesia by analyzing the visual attention of anesthesiologists during surgery.
  • methods: The study uses a novel eye-tracking method based on deep learning models that process monitor-mounted webcams to collect continuous behavioral data without disturbing the anesthesiologists’ natural workflow.
  • results: The study found that the proposed framework can distinguish between different visual behavioral patterns, including baseline visual attention during uneventful periods, patterns associated with active phases, and patterns during critical, unanticipated incidents.
    Abstract Patient's vital signs, which are displayed on monitors, make the anesthesiologist's visual attention (VA) a key component in the safe management of patients under general anesthesia; moreover, the distribution of said VA and the ability to acquire specific cues throughout the anesthetic, may have a direct impact on patient's outcome. Currently, most studies employ wearable eye-tracking technologies to analyze anesthesiologists' visual patterns. Albeit being able to produce meticulous data, wearable devices are not a sustainable solution for large-scale or long-term use for data collection in the operating room (OR). Thus, by utilizing a novel eye-tracking method in the form of deep learning models that process monitor-mounted webcams, we collected continuous behavioral data and gained insight into the anesthesiologist's VA distribution with minimal disturbance to their natural workflow. In this study, we collected OR video recordings using the proposed framework and compared different visual behavioral patterns. We distinguished between baseline VA distribution during uneventful periods to patterns associated with active phases or during critical, unanticipated incidents. In the future, such a platform may serve as a crucial component of context-aware assistive technologies in the OR.
    摘要 医生的生命体征,displayed on monitors,making the anesthesiologist's visual attention (VA) a key component in the safe management of patients under general anesthesia; Moreover,the distribution of said VA and the ability to acquire specific cues throughout the anesthetic,may have a direct impact on the patient's outcome. Currently,most studies employ wearable eye-tracking technologies to analyze anesthesiologists' visual patterns. Although wearable devices can produce meticulous data,they are not a sustainable solution for large-scale or long-term use for data collection in the operating room (OR). Therefore,by utilizing a novel eye-tracking method in the form of deep learning models that process monitor-mounted webcams,we collected continuous behavioral data and gained insight into the anesthesiologist's VA distribution with minimal disturbance to their natural workflow. In this study,we collected OR video recordings using the proposed framework and compared different visual behavioral patterns. We distinguished between baseline VA distribution during uneventful periods and patterns associated with active phases or during critical,unanticipated incidents. In the future,such a platform may serve as a crucial component of context-aware assistive technologies in the OR.

Exploring XAI for the Arts: Explaining Latent Space in Generative Music

  • paper_url: http://arxiv.org/abs/2308.05496
  • repo_url: https://github.com/bbanar2/exploring_xai_in_genmus_via_lsr
  • paper_authors: Nick Bryan-Kinns, Berker Banar, Corey Ford, Courtney N. Reed, Yixiao Zhang, Simon Colton, Jack Armitage
  • for: 本研究旨在增强Explainable AI(XAI)功能,以便更好地 debug 和理解创作 AI 系统。
  • methods: 该研究使用了扩展MeasureVAE模型,通过增加干扰变量来增强模型的可解释性。
  • results: 研究人员通过提供用户界面反馈循环和 latent space 可视化来帮助人们更好地理解和预测模型的输出。
    Abstract Explainable AI has the potential to support more interactive and fluid co-creative AI systems which can creatively collaborate with people. To do this, creative AI models need to be amenable to debugging by offering eXplainable AI (XAI) features which are inspectable, understandable, and modifiable. However, currently there is very little XAI for the arts. In this work, we demonstrate how a latent variable model for music generation can be made more explainable; specifically we extend MeasureVAE which generates measures of music. We increase the explainability of the model by: i) using latent space regularisation to force some specific dimensions of the latent space to map to meaningful musical attributes, ii) providing a user interface feedback loop to allow people to adjust dimensions of the latent space and observe the results of these changes in real-time, iii) providing a visualisation of the musical attributes in the latent space to help people understand and predict the effect of changes to latent space dimensions. We suggest that in doing so we bridge the gap between the latent space and the generated musical outcomes in a meaningful way which makes the model and its outputs more explainable and more debuggable.
    摘要 “可解释AI”(Explainable AI)具有潜在地支持更互动性和流动性的合作AI系统,这些系统可以与人类创作合作。为了实现这一目标,创意AI模型需要具有可解释的特性,例如:可检查、理解和修改。然而,目前在艺术领域中有很少XAI(可解释AI)。在这个工作中,我们示范了一种将音乐生成模型变得更加可解释的方法,具体来说是:将测量VAE扩展为生成乐曲的度量。我们增加了这个模型的可解释性,通过:1. 使用几何空间调整来让特定的几何空间维度映射到有意义的音乐特征上。2. 提供用户反馈回路,让人类可以在实时运行中调整几何空间维度,并观察改变的结果。3. 提供音乐特征的可视化,帮助人类理解和预测几何空间维度的变化会对音乐生成的影响。我们认为,这些措施可以将latent space和生成的音乐结果之间的距离降低到有意义的水平,使模型和其输出更加可解释和可 debug。

LLM As DBA

  • paper_url: http://arxiv.org/abs/2308.05481
  • repo_url: https://github.com/tsinghuadatabasegroup/db-gpt
  • paper_authors: Xuanhe Zhou, Guoliang Li, Zhiyuan Liu
  • for: 这个论文是为了提供一个基于大自然语言模型(LLM)的数据库管理器(D-Bot),帮助数据库管理员(DBA)更好地维护和优化数据库系统,以确保数据的可用性、性能和可靠性。
  • methods: 这个论文使用了大自然语言模型(LLM)来检测数据库维护知识从文档和工具中,并使用树思维理解来找到根本原因分析。它还使用了多个LLM协同诊断。
  • results: 这个论文的实验结果表明,D-Bot可以高效地和有效地诊断目标数据库的根本原因,并且可以帮助DBA更好地维护和优化数据库系统。
    Abstract Database administrators (DBAs) play a crucial role in managing, maintaining and optimizing a database system to ensure data availability, performance, and reliability. However, it is hard and tedious for DBAs to manage a large number of database instances (e.g., millions of instances on the cloud databases). Recently large language models (LLMs) have shown great potential to understand valuable documents and accordingly generate reasonable answers. Thus, we propose D-Bot, a LLM-based database administrator that can continuously acquire database maintenance experience from textual sources, and provide reasonable, well-founded, in-time diagnosis and optimization advice for target databases. This paper presents a revolutionary LLM-centric framework for database maintenance, including (i) database maintenance knowledge detection from documents and tools, (ii) tree of thought reasoning for root cause analysis, and (iii) collaborative diagnosis among multiple LLMs. Our preliminary experimental results that D-Bot can efficiently and effectively diagnose the root causes and our code is available at github.com/TsinghuaDatabaseGroup/DB-GPT.
    摘要 Database administrators (DBAs) play a crucial role in managing, maintaining and optimizing a database system to ensure data availability, performance, and reliability. However, it is hard and tedious for DBAs to manage a large number of database instances (e.g., millions of instances on the cloud databases). Recently large language models (LLMs) have shown great potential to understand valuable documents and accordingly generate reasonable answers. Thus, we propose D-Bot, a LLM-based database administrator that can continuously acquire database maintenance experience from textual sources, and provide reasonable, well-founded, in-time diagnosis and optimization advice for target databases. This paper presents a revolutionary LLM-centric framework for database maintenance, including (i) database maintenance knowledge detection from documents and tools, (ii) tree of thought reasoning for root cause analysis, and (iii) collaborative diagnosis among multiple LLMs. Our preliminary experimental results show that D-Bot can efficiently and effectively diagnose the root causes, and our code is available at github.com/TsinghuaDatabaseGroup/DB-GPT.Here's the translation in Traditional Chinese:Database administrators (DBAs) play a crucial role in managing, maintaining and optimizing a database system to ensure data availability, performance, and reliability. However, it is hard and tedious for DBAs to manage a large number of database instances (e.g., millions of instances on the cloud databases). Recently large language models (LLMs) have shown great potential to understand valuable documents and accordingly generate reasonable answers. Thus, we propose D-Bot, a LLM-based database administrator that can continuously acquire database maintenance experience from textual sources, and provide reasonable, well-founded, in-time diagnosis and optimization advice for target databases. This paper presents a revolutionary LLM-centric framework for database maintenance, including (i) database maintenance knowledge detection from documents and tools, (ii) tree of thought reasoning for root cause analysis, and (iii) collaborative diagnosis among multiple LLMs. Our preliminary experimental results show that D-Bot can efficiently and effectively diagnose the root causes, and our code is available at github.com/TsinghuaDatabaseGroup/DB-GPT.

Reviewing 3D Object Detectors in the Context of High-Resolution 3+1D Radar

  • paper_url: http://arxiv.org/abs/2308.05478
  • repo_url: None
  • paper_authors: Patrick Palmer, Martin Krueger, Richard Altendorfer, Ganesh Adam, Torsten Bertram
    for:* 这个研究是为了探索深度学习模型在高分辨率激光数据上进行3D物体检测。methods:* 这个研究使用了现有的3D激光点云资料,并将其适应到高分辨率激光数据上。results:* 研究发现了一些适合高分辨率激光数据的3D物体检测模型,并评估了它们在不同数据集上的表现。
    Abstract Recent developments and the beginning market introduction of high-resolution imaging 4D (3+1D) radar sensors have initialized deep learning-based radar perception research. We investigate deep learning-based models operating on radar point clouds for 3D object detection. 3D object detection on lidar point cloud data is a mature area of 3D vision. Many different architectures have been proposed, each with strengths and weaknesses. Due to similarities between 3D lidar point clouds and 3+1D radar point clouds, those existing 3D object detectors are a natural basis to start deep learning-based 3D object detection on radar data. Thus, the first step is to analyze the detection performance of the existing models on the new data modality and evaluate them in depth. In order to apply existing 3D point cloud object detectors developed for lidar point clouds to the radar domain, they need to be adapted first. While some detectors, such as PointPillars, have already been adapted to be applicable to radar data, we have adapted others, e.g., Voxel R-CNN, SECOND, PointRCNN, and PV-RCNN. To this end, we conduct a cross-model validation (evaluating a set of models on one particular data set) as well as a cross-data set validation (evaluating all models in the model set on several data sets). The high-resolution radar data used are the View-of-Delft and Astyx data sets. Finally, we evaluate several adaptations of the models and their training procedures. We also discuss major factors influencing the detection performance on radar data and propose possible solutions indicating potential future research avenues.
    摘要 近期发展和开始应用高分辨率图像4D(3+1D)雷达感知技术已经INITIALIZED深度学习基于雷达感知研究。我们研究在雷达点云数据上使用深度学习模型进行3D对象检测。3D对象检测在激光点云数据上是成熟领域。许多不同的架构已经被提议,每种都有优点和缺点。由于雷达点云和激光点云之间的相似性,因此现有的3D对象检测模型成为开始深度学习基于雷达数据的自然选择。因此,我们的首要步骤是分析现有模型在新数据模式上的检测性能,并对其进行深入评估。为了将现有的3D点云对象检测器应用到雷达频谱中,它们需要进行适应。而一些检测器,如PointPillars,已经被适应以使其适用于雷达数据。我们则适应了其他检测器,如Voxel R-CNN、SECOND、PointRCNN和PV-RCNN。为此,我们进行了跨模型验证(评估多个模型在一个特定数据集上)以及跨数据集验证(评估所有模型在多个数据集上)。使用的高分辨率雷达数据是View-of-Delft和Astyx数据集。最后,我们评估了多种适应和训练过程。我们还讨论了雷达数据上检测性能的主要因素,并提出了可能的解决方案,并指出了未来研究的可能性。

Explainable AI applications in the Medical Domain: a systematic review

  • paper_url: http://arxiv.org/abs/2308.05411
  • repo_url: None
  • paper_authors: Nicoletta Prentzas, Antonis Kakas, Constantinos S. Pattichis
  • for: 这篇论文主要是为了探讨医疗人工智能领域内的可解释AI解决方案,以便帮助医生更好地理解和信任AI的结果。
  • methods: 这篇论文主要采用了文献综述方法,通过对近年发表的198篇相关论文进行系统性的总结,探讨了各种可解释AI解决方案的发展。
  • results: 该论文的结果显示,在医疗人工智能领域内,大多数解释AI解决方案使用了模型无关的技术,深度学习模型被广泛应用,而且用于提高信任性的解释性能力被广泛应用,但很少有工作报道了医生的参与。同时,用户可交互的视觉界面被证明是更有用的。
    Abstract Artificial Intelligence in Medicine has made significant progress with emerging applications in medical imaging, patient care, and other areas. While these applications have proven successful in retrospective studies, very few of them were applied in practice.The field of Medical AI faces various challenges, in terms of building user trust, complying with regulations, using data ethically.Explainable AI (XAI) aims to enable humans understand AI and trust its results. This paper presents a literature review on the recent developments of XAI solutions for medical decision support, based on a representative sample of 198 articles published in recent years. The systematic synthesis of the relevant articles resulted in several findings. (1) model-agnostic XAI techniques were mostly employed in these solutions, (2) deep learning models are utilized more than other types of machine learning models, (3) explainability was applied to promote trust, but very few works reported the physicians participation in the loop, (4) visual and interactive user interface is more useful in understanding the explanation and the recommendation of the system. More research is needed in collaboration between medical and AI experts, that could guide the development of suitable frameworks for the design, implementation, and evaluation of XAI solutions in medicine.
    摘要 人工智能在医学中已经做出了重要进展,其应用领域包括医疗影像、患者护理和其他领域。虽然这些应用在回顾性研究中证明了其成功,但是它们在实践中却非常少。医学人工智能遇到了许多挑战,包括建立用户信任、遵循法规和使用数据伦理。可解释人工智能(XAI)的目的是让人们理解人工智能和信任其结果。本文介绍了一篇关于最近发展的 XAI解决方案在医疗决策支持方面的文献回顾,基于最近几年发表的198篇文章。系统性的分析这些相关文章的结果有几个发现:(1)模型不依赖的 XAI技术在这些解决方案中广泛应用,(2)深度学习模型比其他机器学习模型更多地使用,(3)解释是为了提高信任,但很少的工作报告了医生参与的循环,(4)可视化和交互的用户界面更有用于理解解释和系统的建议。需要更多的合作 между医学和人工智能专家,以确定适合的框架,以便设计、实施和评估 XAI 解决方案在医学领域。

A Comparative Assessment of Multi-view fusion learning for Crop Classification

  • paper_url: http://arxiv.org/abs/2308.05407
  • repo_url: https://github.com/fmenat/multiviewcropclassification
  • paper_authors: Francisco Mena, Diego Arenas, Marlon Nuske, Andreas Dengel
  • for: 本研究旨在评估不同多视图学习模型在农作物分类 tasks 上的表现。
  • methods: 本研究使用了多种多视图合并策略,包括输入级别合并、特征级别合并和层次级别合并。
  • results: 研究发现,不同的测试区域中,不同的合并策略具有最佳表现。另外,研究还发现,使用多视图合并策略可以超越单个视图模型和前一代合并策略。
    Abstract With a rapidly increasing amount and diversity of remote sensing (RS) data sources, there is a strong need for multi-view learning modeling. This is a complex task when considering the differences in resolution, magnitude, and noise of RS data. The typical approach for merging multiple RS sources has been input-level fusion, but other - more advanced - fusion strategies may outperform this traditional approach. This work assesses different fusion strategies for crop classification in the CropHarvest dataset. The fusion methods proposed in this work outperform models based on individual views and previous fusion methods. We do not find one single fusion method that consistently outperforms all other approaches. Instead, we present a comparison of multi-view fusion methods for three different datasets and show that, depending on the test region, different methods obtain the best performance. Despite this, we suggest a preliminary criterion for the selection of fusion methods.
    摘要 随着Remote Sensing(RS)数据源的快速增加和多样化,需要 Multi-view learning 模型。这是一个复杂的任务,因为RS数据差异尺度、大小和噪声。传统的方法是输入水平融合,但更高级的融合策略可能超越这些传统方法。这项工作评估了不同融合策略对 cropharvest 数据集的蓟类分类。我们未发现一个单一的融合方法,可以一直抵御所有其他方法。相反,我们提供了不同融合方法的比较,并在三个不同的数据集上进行了测试。根据测试区域,不同的方法在不同的情况下表现最佳。尽管如此,我们建议一个初步的融合方法选择 criterion。

Notation3 as an Existential Rule Language

  • paper_url: http://arxiv.org/abs/2308.07332
  • repo_url: https://github.com/smennicke/n32rules
  • paper_authors: Dörthe Arndt, Stephan Mennicke
  • for: 这个论文主要针对的是 Notation3 Logic(\nthree)和存在规则(existential rules)之间的关系,以及将\nthree规则翻译成existential rules的问题。
  • methods: 作者使用了对\nthree规则和existential rules的分析和比较,并定义了一个直观的翻译方法,以便将\nthree规则翻译成existential rules。
  • results: 作者通过实验表明,使用existential rule reasoners可以提高\nthree reasoning的性能,特别是在含有多个实体的情况下。此外,EYE reasoner在处理依赖关系较多的规则时表现出了 excel 的速度。
    Abstract Notation3 Logic (\nthree) is an extension of RDF that allows the user to write rules introducing new blank nodes to RDF graphs. Many applications (e.g., ontology mapping) rely on this feature as blank nodes -- used directly or in auxiliary constructs -- are omnipresent on the Web. However, the number of fast \nthree reasoners covering this very important feature of the logic is rather limited. On the other hand, there are engines like VLog or Nemo which do not directly support Semantic Web rule formats but which are developed and optimized for very similar constructs: existential rules. In this paper, we investigate the relation between \nthree rules with blank nodes in their heads and existential rules. We identify a subset of \nthree which can be mapped directly to existential rules and define such a mapping preserving the equivalence of \nthree formulae. In order to also illustrate that in some cases \nthree reasoning could benefit from our translation, we then employ this mapping in an implementation to compare the performance of the \nthree reasoners EYE and cwm to VLog and Nemo on \nthree rules and their mapped counterparts. Our tests show that the existential rule reasoners perform particularly well for use cases containing many facts while especially the EYE reasoner is very fast when dealing with a high number of dependent rules. We thus provide a tool enabling the Semantic Web community to directly use existing and future existential rule reasoners and benefit from the findings of this active community.
    摘要 notation3逻辑(\nthree)是RDF的扩展,允许用户编写引入新的白节点到RDF图表中的规则。许多应用(例如ontology mapping)对此特性具有重要性,因为白节点在互联网上是普遍存在的。然而,快速的\nthree理解器覆盖这一特性的数量相对较少。一方面,有些引擎如VLog或Nemo不直接支持semantic web规则格式,但是它们是为类似结构优化的。在这篇论文中,我们研究了\nthree规则中包含白节点的头部和existential规则之间的关系。我们 identificada一个\nthree规则的子集可以直接映射到existential规则,并定义了这种映射,以保持\nthree формулы的等价性。为了证明\nthree理解可以从我们的翻译中受益,我们在实现中使用了这种映射,并与EYE和cwm两个\nthree理解器进行比较。我们的测试显示,existential规则理解器在具有多个事实的用例中表现特别好,而EYE理解器在依赖的规则多的用例中具有很高的运动速度。因此,我们提供了一种工具,使得semantic web社区可以直接使用现有和未来的existential规则理解器,并且从这个活跃社区的研究中受益。

Enhancing Trust in LLM-Based AI Automation Agents: New Considerations and Future Challenges

  • paper_url: http://arxiv.org/abs/2308.05391
  • repo_url: None
  • paper_authors: Sivan Schwartz, Avi Yaeli, Segev Shlomov
  • for: 本研究探讨了人工智能代理人在新一代自动化工具中的信任问题,以及这些工具如何解决这些问题。
  • methods: 本研究分析了现有文献中关于人工智能代理人信任的主要方面,并评估了新一代自动化工具是如何Addressing these considerations。
  • results: 研究发现,新一代自动化工具可以帮助解决许多信任问题,但还有一些挑战需要进一步研究。
    Abstract Trust in AI agents has been extensively studied in the literature, resulting in significant advancements in our understanding of this field. However, the rapid advancements in Large Language Models (LLMs) and the emergence of LLM-based AI agent frameworks pose new challenges and opportunities for further research. In the field of process automation, a new generation of AI-based agents has emerged, enabling the execution of complex tasks. At the same time, the process of building automation has become more accessible to business users via user-friendly no-code tools and training mechanisms. This paper explores these new challenges and opportunities, analyzes the main aspects of trust in AI agents discussed in existing literature, and identifies specific considerations and challenges relevant to this new generation of automation agents. We also evaluate how nascent products in this category address these considerations. Finally, we highlight several challenges that the research community should address in this evolving landscape.
    摘要 信任AI代理已经得到了广泛的研究,从而取得了 significative advancements 在这一领域。然而,大量语言模型(LLM)的快速发展以及基于LLM的AI代理框架的出现带来了新的挑战和机遇,需要进一步的研究。在过程自动化领域,一个新一代基于AI的代理出现了,允许执行复杂任务。同时,建立自动化的过程变得更加容易,通过用户友好的无代码工具和培训机制,让企业用户能够更直观地参与到自动化过程中。本文探讨了这些新的挑战和机遇,分析了现有文献中关于信任AI代理的主要方面,并特别关注这一新一代自动化代理的准确性和可靠性。最后,我们还提出了一些研究社区应该解决的挑战。

Adaptive Taxonomy Learning and Historical Patterns Modelling for Patent Classification

  • paper_url: http://arxiv.org/abs/2308.05385
  • repo_url: https://github.com/hope-rita/patcls
  • paper_authors: Tao Zou, Le Yu, Leilei Sun, Bowen Du, Deqing Wang, Fuzhen Zhuang
    for: 本研究的目的是提出一种完整考虑专利申请文本和证明人申请历史pattern的权利分类方法,以提高权利分类的准确性。methods: 本研究使用一种综合学习模型,包括IPC代码相关性学习模块和历史应用模式学习模块。IPC代码相关性学习模块通过自适应传递和聚合消息来 derive IPC代码的语义表示。历史应用模式学习模块通过双通道聚合机制来 incorporate 申请人的前一个申请文本。results: 实验结果表明,当前方法在实际数据集上表现出色,比已有方法更高精度。此外,模型还能捕捉证明人的时间特征和IPC代码之间的语义依赖关系。
    Abstract Patent classification aims to assign multiple International Patent Classification (IPC) codes to a given patent. Recent methods for automatically classifying patents mainly focus on analyzing the text descriptions of patents. However, apart from the texts, each patent is also associated with some assignees, and the knowledge of their applied patents is often valuable for classification. Furthermore, the hierarchical taxonomy formulated by the IPC system provides important contextual information and enables models to leverage the correlations between IPC codes for more accurate classification. However, existing methods fail to incorporate the above aspects. In this paper, we propose an integrated framework that comprehensively considers the information on patents for patent classification. To be specific, we first present an IPC codes correlations learning module to derive their semantic representations via adaptively passing and aggregating messages within the same level and across different levels along the hierarchical taxonomy. Moreover, we design a historical application patterns learning component to incorporate the corresponding assignee's previous patents by a dual channel aggregation mechanism. Finally, we combine the contextual information of patent texts that contains the semantics of IPC codes, and assignees' sequential preferences to make predictions. Experiments on real-world datasets demonstrate the superiority of our approach over the existing methods. Besides, we present the model's ability to capture the temporal patterns of assignees and the semantic dependencies among IPC codes.
    摘要 《权利分类目的是将多个国际专利分类(IPC)代码分配给一个专利。现代方法主要关注专利文本描述进行自动分类。然而,除了文本,每个专利还与一些投分人相关,其应用专利知识对分类具有重要作用。另外,由IPC系统定义的层次分类结构提供了重要的上下文信息,使模型可以利用分类代码之间的相关性进行更准确的分类。然而,现有方法忽略了以上因素。在这篇论文中,我们提出一个整合性的框架,全面考虑专利信息进行专利分类。具体来说,我们首先提出一个IPC代码相关性学习模块,通过适应性地传递和聚合消息来 derivation их semantic representations。此外,我们设计了历史应用模式学习组件,通过双通道聚合机制来 incorporate 投分人的前一个专利。最后,我们将专利文本中含有的专利 semantics,与投分人的时序预ference相结合,进行预测。实验结果表明,我们的方法在实际数据上表现出色,并且能够捕捉投分人的时序模式和IPC代码之间的semantic相关性。》

Beyond Semantics: Learning a Behavior Augmented Relevance Model with Self-supervised Learning

  • paper_url: http://arxiv.org/abs/2308.05379
  • repo_url: None
  • paper_authors: Zeyuan Chen, Wei Chen, Jia Xu, Zhongyi Liu, Wei Zhang
  • for: 该论文目的是为搜索引擎实现用户体验,即模型搜索结果中的搜索项。
  • methods: 该论文使用了一种新的相似性模型,该模型不仅考虑semantic matching,还考虑了用户行为和Item的具体特征。
  • results: 该论文的实验结果表明,使用该新的相似性模型可以更好地捕捉用户的搜索需求,提高搜索结果的准确率和用户满意度。
    Abstract Relevance modeling aims to locate desirable items for corresponding queries, which is crucial for search engines to ensure user experience. Although most conventional approaches address this problem by assessing the semantic similarity between the query and item, pure semantic matching is not everything.
    摘要 <> translate "Relevance modeling aims to locate desirable items for corresponding queries, which is crucial for search engines to ensure user experience. Although most conventional approaches address this problem by assessing the semantic similarity between the query and item, pure semantic matching is not everything." into Simplified Chinese<>Here's the translation:重要模型化目标是为相应的查询找到需要的项目,这对搜索引擎确保用户体验至关重要。大多数常见方法通过对查询和项目之间的semantic相似性进行评估来解决这个问题,但纯semantic匹配并不是所有的。

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models’ Alignment

  • paper_url: http://arxiv.org/abs/2308.05374
  • repo_url: None
  • paper_authors: Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo, Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, Hang Li
  • for: 本研究旨在提供对大语言模型(LLM)可靠性的全面评估,以便在实际应用中进行系统化迭代和部署。
  • methods: 本研究采用了7个主要类别来评估 LLM 可靠性,包括可靠性、安全性、公平性、抗违用性、可解释性和逻辑、社会规范遵循性和Robustness。每个主要类别下分为多个子类别,总计29个子类别。
  • results: 测试结果表明,通常情况下,更加适应的模型具有更高的总可靠性。但是,对不同的可靠性类别进行细化分析、测试和改进,则可以提高 LLM 的可靠性和伦理性。这些研究结果为实际应用中 LLM 部署提供了有价值的指导和灵感。
    Abstract Ensuring alignment, which refers to making models behave in accordance with human intentions [1,2], has become a critical task before deploying large language models (LLMs) in real-world applications. For instance, OpenAI devoted six months to iteratively aligning GPT-4 before its release [3]. However, a major challenge faced by practitioners is the lack of clear guidance on evaluating whether LLM outputs align with social norms, values, and regulations. This obstacle hinders systematic iteration and deployment of LLMs. To address this issue, this paper presents a comprehensive survey of key dimensions that are crucial to consider when assessing LLM trustworthiness. The survey covers seven major categories of LLM trustworthiness: reliability, safety, fairness, resistance to misuse, explainability and reasoning, adherence to social norms, and robustness. Each major category is further divided into several sub-categories, resulting in a total of 29 sub-categories. Additionally, a subset of 8 sub-categories is selected for further investigation, where corresponding measurement studies are designed and conducted on several widely-used LLMs. The measurement results indicate that, in general, more aligned models tend to perform better in terms of overall trustworthiness. However, the effectiveness of alignment varies across the different trustworthiness categories considered. This highlights the importance of conducting more fine-grained analyses, testing, and making continuous improvements on LLM alignment. By shedding light on these key dimensions of LLM trustworthiness, this paper aims to provide valuable insights and guidance to practitioners in the field. Understanding and addressing these concerns will be crucial in achieving reliable and ethically sound deployment of LLMs in various applications.
    摘要 ensure alignment, which refers to making models behave in accordance with human intentions [1,2], has become a critical task before deploying large language models (LLMs) in real-world applications. For instance, OpenAI devoted six months to iteratively aligning GPT-4 before its release [3]. However, a major challenge faced by practitioners is the lack of clear guidance on evaluating whether LLM outputs align with social norms, values, and regulations. This obstacle hinders systematic iteration and deployment of LLMs. To address this issue, this paper presents a comprehensive survey of key dimensions that are crucial to consider when assessing LLM trustworthiness. The survey covers seven major categories of LLM trustworthiness: reliability, safety, fairness, resistance to misuse, explainability and reasoning, adherence to social norms, and robustness. Each major category is further divided into several sub-categories, resulting in a total of 29 sub-categories. Additionally, a subset of 8 sub-categories is selected for further investigation, where corresponding measurement studies are designed and conducted on several widely-used LLMs. The measurement results indicate that, in general, more aligned models tend to perform better in terms of overall trustworthiness. However, the effectiveness of alignment varies across the different trustworthiness categories considered. This highlights the importance of conducting more fine-grained analyses, testing, and making continuous improvements on LLM alignment. By shedding light on these key dimensions of LLM trustworthiness, this paper aims to provide valuable insights and guidance to practitioners in the field. Understanding and addressing these concerns will be crucial in achieving reliable and ethically sound deployment of LLMs in various applications.

Machine Learning aided Computer Architecture Design for CNN Inferencing Systems

  • paper_url: http://arxiv.org/abs/2308.05364
  • repo_url: None
  • paper_authors: Christopher A. Metz
  • for: 这篇论文是为了测试和选择适合的加速器,以提高机器学习(ML)算法的效率和可靠性。
  • methods: 本文使用了Design Space Exploration(DSE)技术来探索最佳的加速器选择,并开发了一个快速和精确的预测方法来评估加速器的电力和性能。
  • results: 本文获得了一个MAPE值为5.03%和5.94%的快速和精确的预测方法,可以帮助计算机架构师在开发过程中更快地评估加速器的选择,从而节省时间和成本,提高时间上市时间。
    Abstract Efficient and timely calculations of Machine Learning (ML) algorithms are essential for emerging technologies like autonomous driving, the Internet of Things (IoT), and edge computing. One of the primary ML algorithms used in such systems is Convolutional Neural Networks (CNNs), which demand high computational resources. This requirement has led to the use of ML accelerators like GPGPUs to meet design constraints. However, selecting the most suitable accelerator involves Design Space Exploration (DSE), a process that is usually time-consuming and requires significant manual effort. Our work presents approaches to expedite the DSE process by identifying the most appropriate GPGPU for CNN inferencing systems. We have developed a quick and precise technique for forecasting the power and performance of CNNs during inference, with a MAPE of 5.03% and 5.94%, respectively. Our approach empowers computer architects to estimate power and performance in the early stages of development, reducing the necessity for numerous prototypes. This saves time and money while also improving the time-to-market period.
    摘要 高效和时间相对的计算机学学习(ML)算法计算是现代技术的核心,如自动驾驶、物联网(IoT)和边缘计算。某些主要的ML算法在这些系统中使用的是卷积神经网络(CNNs),它们需要高度的计算资源。这种需求导致了使用ML加速器,如GPGPU,以满足设计要求。然而,选择最适合的加速器通常需要大量的时间和人工劳动,这是通过设计空间探索(DSE)来实现的。我们的工作提出了加速DSE过程的方法,以便更快地选择适合CNN推理系统的GPGPU。我们开发了一种快速精确的方法来预测CNN在推理过程中的能量和性能,MAPE值分别为5.03%和5.94%。我们的方法使得计算机架构师可以在开发的早期阶段估算能量和性能,从而降低了许多原型的需求。这涵盖了时间和成本,同时也提高了时间上市期。

Metacognitive Prompting Improves Understanding in Large Language Models

  • paper_url: http://arxiv.org/abs/2308.05342
  • repo_url: https://github.com/eternityyw/metacognitive-prompting
  • paper_authors: Yuqing Wang, Yun Zhao
  • for: 提高大语言模型(LLM)的理解能力
  • methods: 使用元 cognitive prompting(MP)策略,通过系统化的自我评估过程,让 LLM 借鉴自己的庞大内在知识和新发现
  • results: MP 方法比标准和串联思维提问方法(chain-of-thought prompting)更高效, PaLM 当 equipped with MP 能够 approached GPT-4 的表现水平,并且 across models 和 datasets 中,MP consistently outperform 其他提问方法。
    Abstract In Large Language Models (LLMs), there have been consistent advancements in task-specific performance, largely influenced by effective prompt design. While recent research on prompting has enhanced the reasoning capabilities of LLMs, a gap remains in further improving their understanding abilities. In this study, we introduce Metacognitive Prompting (MP), a strategy inspired by human introspective reasoning processes. Using MP, LLMs undergo a systematic series of structured, self-aware evaluations, drawing on both their vast inherent knowledge and new insights. Our experiments involve five prevalent LLMs: Llama2, Vicuna, PaLM, GPT-3.5, and GPT-4, all of which span various general natural language understanding (NLU) tasks from the GLUE and SuperGLUE benchmarks. Results indicate that, although GPT-4 consistently excels in most tasks, PaLM, when equipped with MP, approaches its performance level. Furthermore, across models and datasets, MP consistently outperforms existing prompting methods, including standard and chain-of-thought prompting. This study underscores the potential to amplify the understanding abilities of LLMs and highlights the benefits of mirroring human introspective reasoning in NLU tasks.
    摘要 在大型语言模型(LLM)中,有了不断的进步,主要受到有效的提示设计的影响。虽然最近的研究表明,提示可以提高LLM的理解能力,但是还有一定的差距需要补做。在本研究中,我们介绍了认知提示(MP),一种基于人类 introspective reasoning 的策略。使用MP,LLMs 进行了一系列结构化、自我意识的评估,利用其内置的庞大知识和新的发现。我们的实验包括五种流行的 LLM:Llama2、Vicuna、PaLM、GPT-3.5 和 GPT-4,这些模型都涵盖了 GLUE 和 SuperGLUE benchmark 上的多种通用自然语言理解(NLU)任务。结果表明,虽然 GPT-4 在大多数任务中表现出色,但PaLM,当它配备 MP 时,可以接近它的表现水平。此外,在模型和数据集之间,MP 一直比标准和链式思维提示方法表现更好。这种研究强调了提高LLMs 的理解能力的潜在可能性,并高亮了在 NLU 任务中模仿人类 introspective reasoning 的重要性。

Classification of Human- and AI-Generated Texts: Investigating Features for ChatGPT

  • paper_url: http://arxiv.org/abs/2308.05341
  • repo_url: None
  • paper_authors: Lorenz Mindner, Tim Schlippe, Kristina Schaaff
  • for: 检测文本是否由人类或AI生成
  • methods: 使用传统和新特征进行检测,包括沟通能力、语义特征、列表查找、错误基本特征、阅读性特征、AI反馈特征和文本矢量特征
  • results: 实验结果显示,新特征可以substantially提高许多分类器的表现,并且最佳基本文本重构检测系统甚至高于GPTZero的183.8%相对F1分数。
    Abstract Recently, generative AIs like ChatGPT have become available to the wide public. These tools can for instance be used by students to generate essays or whole theses. But how does a teacher know whether a text is written by a student or an AI? In our work, we explore traditional and new features to (1) detect text generated by AI from scratch and (2) text rephrased by AI. Since we found that classification is more difficult when the AI has been instructed to create the text in a way that a human would not recognize that it was generated by an AI, we also investigate this more advanced case. For our experiments, we produced a new text corpus covering 10 school topics. Our best systems to classify basic and advanced human-generated/AI-generated texts have F1-scores of over 96%. Our best systems for classifying basic and advanced human-generated/AI-rephrased texts have F1-scores of more than 78%. The systems use a combination of perplexity, semantic, list lookup, error-based, readability, AI feedback, and text vector features. Our results show that the new features substantially help to improve the performance of many classifiers. Our best basic text rephrasing detection system even outperforms GPTZero by 183.8% relative in F1-score.
    摘要 最近,生成型AI如ChatGPT已经向广大公众开放。这些工具可以让学生生成作业或整篇论文。但教师如何知道文本是学生写的或AI写的?在我们的工作中,我们探索了传统和新特征,以检测AI生成文本和学生重塑文本。由于我们发现,当AI被 instrumente产生文本以让人无法认出是AI生成的时,分类更加困难,因此我们也进行了这种更高级的检测。为了进行实验,我们生成了10个学校主题的新文本库。我们的最佳系统可以分类基本和高级人类生成/AI生成文本的F1分数高于96%。我们的最佳系统可以分类基本和高级人类重塑/AI重塑文本的F1分数高于78%。这些系统使用了混合的困惑度、语义、列表查找、错误基本、可读性、AI反馈和文本向量特征。我们的结果表明,新特征有效地提高了许多分类器的性能。我们的最佳基本文本重塑检测系统甚至超过GPTZero的相对F1分数183.8%。

Adv-Inpainting: Generating Natural and Transferable Adversarial Patch via Attention-guided Feature Fusion

  • paper_url: http://arxiv.org/abs/2308.05320
  • repo_url: None
  • paper_authors: Yanjie Li, Mingxing Duan, Bin Xiao
  • For: The paper aims to improve the stealthiness and transferability of adversarial attacks on facial recognition (FR) models by generating natural-looking and highly transferable adversarial patches.* Methods: The proposed method, called Adv-Inpainting, consists of two stages: (1) an attention-guided StyleGAN (Att-StyleGAN) that adaptively combines texture and identity features based on the attention map to generate high-transferable and natural adversarial patches, and (2) a refinement network with a new boundary variance loss to further improve the coherence between the patch and its surrounding area.* Results: The proposed method demonstrates stronger transferability and improved visual quality than previous adversarial patch attacks, making it a more effective and stealthy approach for attacking FR models.
    Abstract The rudimentary adversarial attacks utilize additive noise to attack facial recognition (FR) models. However, because manipulating the total face is impractical in the physical setting, most real-world FR attacks are based on adversarial patches, which limit perturbations to a small area. Previous adversarial patch attacks often resulted in unnatural patterns and clear boundaries that were easily noticeable. In this paper, we argue that generating adversarial patches with plausible content can result in stronger transferability than using additive noise or directly sampling from the latent space. To generate natural-looking and highly transferable adversarial patches, we propose an innovative two-stage coarse-to-fine attack framework called Adv-Inpainting. In the first stage, we propose an attention-guided StyleGAN (Att-StyleGAN) that adaptively combines texture and identity features based on the attention map to generate high-transferable and natural adversarial patches. In the second stage, we design a refinement network with a new boundary variance loss to further improve the coherence between the patch and its surrounding area. Experiment results demonstrate that Adv-Inpainting is stealthy and can produce adversarial patches with stronger transferability and improved visual quality than previous adversarial patch attacks.
    摘要 “攻击Facial recognition(FR)模型的基本攻击方法是使用添加噪音。然而,在实际情况下,对整个脸部进行修改是不切实际的。因此,大多数实际世界中的FR攻击都是基于对小区域进行修改的对应攻击。在这篇论文中,我们认为可以使用可信内容的对应攻击来实现更高的传递性。为了生成自然看起来的高传递性对应攻击,我们提出了一个创新的两阶段粗糙到细致的攻击框架,称为Adv-Inpainting。在第一阶段,我们提出了一个注意力引导的StyleGAN(Att-StyleGAN),可以根据注意力地图来组合文本和识别特征,生成高传递性和自然的对应攻击。在第二阶段,我们设计了一个新的边缘差分损失,以更好地调节对应攻击和周围区域之间的协调。实验结果显示,Adv-Inpainting可以让攻击更加隐藏和生成更高传递性和改善的视觉质量的对应攻击。”

Homophily-enhanced Structure Learning for Graph Clustering

  • paper_url: http://arxiv.org/abs/2308.05309
  • repo_url: https://github.com/galogm/hole
  • paper_authors: Ming Gu, Gaoming Yang, Sheng Zhou, Ning Ma, Jiawei Chen, Qiaoyu Tan, Meihan Liu, Jiajun Bu
  • for: 本研究旨在提高图像分类中使用图神经网络(GNNs)的性能,通过改进图structure的质量来提高图分类的精度。
  • methods: 本研究提出了一种新的方法 called HoLe,通过提高图中同类度的度量来提高GNNs和图分类的性能。 HoLe 包括两个结构学习模块:层次相关度估计和群集 aware 缩放。
  • results: 对于七种不同类型和规模的测试数据集,HoLe 在多种分类指标下与州前的基线方法进行比较,并表明其性能优于基线方法。
    Abstract Graph clustering is a fundamental task in graph analysis, and recent advances in utilizing graph neural networks (GNNs) have shown impressive results. Despite the success of existing GNN-based graph clustering methods, they often overlook the quality of graph structure, which is inherent in real-world graphs due to their sparse and multifarious nature, leading to subpar performance. Graph structure learning allows refining the input graph by adding missing links and removing spurious connections. However, previous endeavors in graph structure learning have predominantly centered around supervised settings, and cannot be directly applied to our specific clustering tasks due to the absence of ground-truth labels. To bridge the gap, we propose a novel method called \textbf{ho}mophily-enhanced structure \textbf{le}arning for graph clustering (HoLe). Our motivation stems from the observation that subtly enhancing the degree of homophily within the graph structure can significantly improve GNNs and clustering outcomes. To realize this objective, we develop two clustering-oriented structure learning modules, i.e., hierarchical correlation estimation and cluster-aware sparsification. The former module enables a more accurate estimation of pairwise node relationships by leveraging guidance from latent and clustering spaces, while the latter one generates a sparsified structure based on the similarity matrix and clustering assignments. Additionally, we devise a joint optimization approach alternating between training the homophily-enhanced structure learning and GNN-based clustering, thereby enforcing their reciprocal effects. Extensive experiments on seven benchmark datasets of various types and scales, across a range of clustering metrics, demonstrate the superiority of HoLe against state-of-the-art baselines.
    摘要 “图 clustering 是图分析中的基本任务,而 latest advances in utilizing graph neural networks (GNNs) 已经 achieved impressive results. Despite the success of existing GNN-based graph clustering methods, they often overlook the quality of graph structure, which is inherent in real-world graphs due to their sparse and multifarious nature, leading to subpar performance. Graph structure learning allows refining the input graph by adding missing links and removing spurious connections. However, previous endeavors in graph structure learning have predominantly centered around supervised settings, and cannot be directly applied to our specific clustering tasks due to the absence of ground-truth labels. To bridge the gap, we propose a novel method called 同型环境增强的结构学习 for graph clustering (HoLe). Our motivation stems from the observation that subtly enhancing the degree of homophily within the graph structure can significantly improve GNNs and clustering outcomes. To realize this objective, we develop two clustering-oriented structure learning modules, i.e., hierarchical correlation estimation and cluster-aware sparsification. The former module enables a more accurate estimation of pairwise node relationships by leveraging guidance from latent and clustering spaces, while the latter one generates a sparsified structure based on the similarity matrix and clustering assignments. Additionally, we devise a joint optimization approach alternating between training the homophily-enhanced structure learning and GNN-based clustering, thereby enforcing their reciprocal effects. Extensive experiments on seven benchmark datasets of various types and scales, across a range of clustering metrics, demonstrate the superiority of HoLe against state-of-the-art baselines.”

Double-chain Constraints for 3D Human Pose Estimation in Images and Videos

  • paper_url: http://arxiv.org/abs/2308.05298
  • repo_url: https://github.com/KHB1698/DC-GCT
  • paper_authors: Hongbo Kang, Yong Wang, Mengyuan Liu, Doudou Wu, Peng Liu, Wenming Yang
  • for: 这种论文的目的是 reconstruction 3D pose from 2D pose lacking depth information.
  • methods: 该论文提出了一种新的模型,即 Double-chain Graph Convolutional Transformer (DC-GCT),通过double-chain设计和本地约束模块、全局约束模块和特征互动模块来模型人体pose的空间约束,从而提高模型的表达能力。
  • results: 实验结果表明,DC-GCT在两个复杂的数据集(Human3.6M和MPI-INF-3DHP)上实现了状态可读性的表现,特别是在Human3.6M数据集上,我们的模型在所有动作类别上实现了状态可读性。
    Abstract Reconstructing 3D poses from 2D poses lacking depth information is particularly challenging due to the complexity and diversity of human motion. The key is to effectively model the spatial constraints between joints to leverage their inherent dependencies. Thus, we propose a novel model, called Double-chain Graph Convolutional Transformer (DC-GCT), to constrain the pose through a double-chain design consisting of local-to-global and global-to-local chains to obtain a complex representation more suitable for the current human pose. Specifically, we combine the advantages of GCN and Transformer and design a Local Constraint Module (LCM) based on GCN and a Global Constraint Module (GCM) based on self-attention mechanism as well as a Feature Interaction Module (FIM). The proposed method fully captures the multi-level dependencies between human body joints to optimize the modeling capability of the model. Moreover, we propose a method to use temporal information into the single-frame model by guiding the video sequence embedding through the joint embedding of the target frame, with negligible increase in computational cost. Experimental results demonstrate that DC-GCT achieves state-of-the-art performance on two challenging datasets (Human3.6M and MPI-INF-3DHP). Notably, our model achieves state-of-the-art performance on all action categories in the Human3.6M dataset using detected 2D poses from CPN, and our code is available at: https://github.com/KHB1698/DC-GCT.
    摘要 “重建3D姿势从2D姿势中缺失深度信息是特别困难的,因为人体运动的复杂性和多样性。关键在于有效地模型人体 JOINTS 之间的空间约束,以利用它们的内在依赖关系。因此,我们提出了一个新的模型,叫做 Double-chain Graph Convolutional Transformer (DC-GCT),用于将姿势约束到一个双排链接构中,包括本地到全球和全球到本地链接。这个模型结合了 GCN 和 Transformer 的优点,并将本地约束模块 (LCM) 和全球约束模块 (GCM) 搭配在一起,以及一个特有的 Feature Interaction Module (FIM)。这个方法可以充分捕捉人体 JOINTS 之间的多层次依赖关系,以便优化模型的测试能力。此外,我们还提出了一种将时间信息 integrate 到单框模型中的方法,通过将动态序列嵌入与目标帧的肢体嵌入相互推导,几乎没有加入计算成本。实验结果表明,DC-GCT 在 Human3.6M 和 MPI-INF-3DHP 两个挑战性 datasets 上 achieves 州际最佳性能。特别是,我们的模型在 Human3.6M dataset 中使用检测到的 2D 姿势从 CPN 上的检测结果,在所有动作类别上均达到州际最佳性能。我们的代码可以在:https://github.com/KHB1698/DC-GCT 中找到。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well.

Multimodal Pretrained Models for Sequential Decision-Making: Synthesis, Verification, Grounding, and Perception

  • paper_url: http://arxiv.org/abs/2308.05295
  • repo_url: None
  • paper_authors: Yunhao Yang, Cyrus Neary, Ufuk Topcu
  • for: 这个论文旨在开发一种能够利用预训练模型中表达的多Modal知识,解决sequential decision-making任务的算法。
  • methods: 该算法使用预训练模型的输出构建和验证控制器,并将控制器与任务环境相互关联。
  • results: 研究人员通过一系列实验,证明了该算法可以成功构建、验证和ground控制器,并在实际任务中提供了高效的控制器。
    Abstract Recently developed pretrained models can encode rich world knowledge expressed in multiple modalities, such as text and images. However, the outputs of these models cannot be integrated into algorithms to solve sequential decision-making tasks. We develop an algorithm that utilizes the knowledge from pretrained models to construct and verify controllers for sequential decision-making tasks, and to ground these controllers to task environments through visual observations. In particular, the algorithm queries a pretrained model with a user-provided, text-based task description and uses the model's output to construct an automaton-based controller that encodes the model's task-relevant knowledge. It then verifies whether the knowledge encoded in the controller is consistent with other independently available knowledge, which may include abstract information on the environment or user-provided specifications. If this verification step discovers any inconsistency, the algorithm automatically refines the controller to resolve the inconsistency. Next, the algorithm leverages the vision and language capabilities of pretrained models to ground the controller to the task environment. It collects image-based observations from the task environment and uses the pretrained model to link these observations to the text-based control logic encoded in the controller (e.g., actions and conditions that trigger the actions). We propose a mechanism to ensure the controller satisfies the user-provided specification even when perceptual uncertainties are present. We demonstrate the algorithm's ability to construct, verify, and ground automaton-based controllers through a suite of real-world tasks, including daily life and robot manipulation tasks.
    摘要 近期开发的预训模型可以对多modalitate的资讯进行快速整合,例如文本和图像。然而,这些模型的输出无法直接用于解决织tek的决策任务。我们开发了一个算法,它可以利用预训模型中的知识来建立和验证控制器,并将这些控制器与任务环境联系起来。具体来说,这个算法会将用户提供的文本任务描述输入到预训模型中,然后使用模型的输出来建立自动机-基于的控制器,这个控制器将包含模型中任务相关的知识。接着,这个算法将检查控制器中的知识是否与独立可用的知识一致,这可能包括环境中的抽象信息或用户提供的规则。如果发现任何不一致,这个算法将自动修补控制器以解决这个问题。接下来,这个算法将利用预训模型的视觉和语言能力来联系控制器和任务环境。它将从任务环境中收集图像观察,并使用预训模型来将这些观察与文本基于的控制逻辑相连接(例如,动作和条件)。我们提出了一种机制,以确保控制器遵循用户提供的规则,甚至在感知不确定情况下。我们透过一系列实际任务,包括日常生活和机器人操作任务,证明了这个算法的能力。

Cross-heterogeneity Graph Few-shot Learning

  • paper_url: http://arxiv.org/abs/2308.05275
  • repo_url: None
  • paper_authors: Pengfei Ding, Yan Wang, Guanfeng Liu
  • for: addresses the label sparsity issue in heterogeneous graphs (HGs)
  • methods: proposes a novel model for Cross-heterogeneity Graph Few-shot Learning, including extracting meta-patterns and a score module to measure the informativeness of labeled samples
  • results: demonstrates superior performance over the state-of-the-art methods in predicting new classes with few-labeled data on four real-world datasets.Here is the same information in Simplified Chinese text:
  • for: solves the label sparsity issue in多类型 nodes和 edges的短文本 (HGs)
  • methods: 提出了跨类型图几拟学学习的新模型,包括提取多态模式和评估标注样本信息的分数模块
  • results: 在四个实际 dataset上表现出了与状态艺术方法相比的更好的性能,用于预测新类别 WITH few-labeled data.
    Abstract In recent years, heterogeneous graph few-shot learning has been proposed to address the label sparsity issue in heterogeneous graphs (HGs), which contain various types of nodes and edges. The existing methods have achieved good performance by transferring generalized knowledge extracted from rich-labeled classes in source HG(s) to few-labeled classes in a target HG. However, these methods only consider the single-heterogeneity scenario where the source and target HGs share a fixed set of node/edge types, ignoring the more general scenario of cross-heterogeneity, where each HG can have a different and non-fixed set of node/edge types. To this end, we focus on the unexplored cross-heterogeneity scenario and propose a novel model for Cross-heterogeneity Graph Few-shot Learning, namely CGFL. In CGFL, we first extract meta-patterns to capture heterogeneous information and propose a multi-view heterogeneous graph neural network (MHGN) to learn meta-patterns across HGs. Then, we propose a score module to measure the informativeness of labeled samples and determine the transferability of each source HG. Finally, by integrating MHGN and the score module into a meta-learning mechanism, CGFL can effectively transfer generalized knowledge to predict new classes with few-labeled data. Extensive experiments on four real-world datasets have demonstrated the superior performance of CGFL over the state-of-the-art methods.
    摘要 近年来,异构图少shot学习(HGFS)已经提出来解决键盘稀缺问题在异构图(HG)中,异构图包含不同类型的节点和边。现有的方法已经实现了好的表现,通过将源HG中的通用知识轮换到目标HG中来预测新的类别。但是,这些方法只考虑单种异构enario,即源和目标HG中共享一定的节点/边类型,忽略了更通用的跨异构enario,即每个HG可以有不同的非固定节点/边类型。为此,我们将注意力集中在未探索的跨异构enario中,并提出了一种新的模型:异构图几种shot学习(CGFL)。在CGFL中,我们首先提取异构模式,以捕捉异构信息,然后提出了多视图异构图神经网络(MHGN)来学习异构模式。接着,我们提出了一个分数模块,用于评估预测样本的信息含量,并确定每个源HG的传输性。最后,通过将MHGN和分数模块集成到一个元学习机制中,CGFL可以有效地将通用知识传输到预测新类别的新样本上。实际实验在四个真实世界数据集上展示了CGFL的优于现有方法。

AI4GCC – Track 3: Consumption and the Challenges of Multi-Agent RL

  • paper_url: http://arxiv.org/abs/2308.05260
  • repo_url: None
  • paper_authors: Marco Jiralerspong, Gauthier Gidel
  • for: 提高机器学习与传统经济政策分析的融合水平
  • methods: 建议包括consumption/需求指标在评估标准中,并进一步研究代理人学习过程和游戏理论性质
  • results: 未知Here’s a breakdown of each point:
  • for: The paper aims to improve the integration of machine learning with traditional economic policy analysis.
  • methods: The authors suggest including an additional index that accounts for consumption/utility in the evaluation criteria and further investigating the learning dynamics of agents in the simulator and the game theoretic properties of outcomes from proposed negotiation protocols.
  • results: The paper does not provide specific results, but rather suggests areas for improvement for future iterations of the competition/simulation.
    Abstract The AI4GCC competition presents a bold step forward in the direction of integrating machine learning with traditional economic policy analysis. Below, we highlight two potential areas for improvement that could enhance the competition's ability to identify and evaluate proposed negotiation protocols. Firstly, we suggest the inclusion of an additional index that accounts for consumption/utility as part of the evaluation criteria. Secondly, we recommend further investigation into the learning dynamics of agents in the simulator and the game theoretic properties of outcomes from proposed negotiation protocols. We hope that these suggestions can be of use for future iterations of the competition/simulation.
    摘要 《AI4GCC比赛对经济政策分析中机器学习的整合做出了勇敢的一步前进。以下是我们对提高比赛评估提案交涉协议的能力的两个可能的改进建议:首先,我们建议添加一个考虑消费/用户需求的指标,作为评估标准之一。其次,我们建议进一步研究模拟器中代理人的学习动态和游戏理论性质。我们希望这些建议可以对未来的比赛/模拟具有帮助。

Vector quantization loss analysis in VQGANs: a single-GPU ablation study for image-to-image synthesis

  • paper_url: http://arxiv.org/abs/2308.05242
  • repo_url: https://github.com/luv91/vqgan_project
  • paper_authors: Luv Verma, Varun Mohan
  • for: 本研究使用Vector Quantized Generative Adversarial Networks (VQGANs)进行图像到图像 sintesis tasks, 通过单个NVIDIA A100 GPU进行实验。
  • methods: 研究者通过调整关键参数,包括数据集大小、代码vector attribute和缓冲维度,以了解VQGAN在有限资源的环境下的行为。特别是,研究者围绕vector量化损失进行了调整,以深入理解精度latent空间的影响。
  • results: 虽然研究者未能超越现有的benchmark,但发现了一些有趣的现象,如artifacts、代码book size优化和PCA comparative analysis。此外,研究者还发现了在2D Positional Encoding中引入位置编码可以减少artefacts并提供对平衡清晰度和过拟合的洞察。
    Abstract This study performs an ablation analysis of Vector Quantized Generative Adversarial Networks (VQGANs), concentrating on image-to-image synthesis utilizing a single NVIDIA A100 GPU. The current work explores the nuanced effects of varying critical parameters including the number of epochs, image count, and attributes of codebook vectors and latent dimensions, specifically within the constraint of limited resources. Notably, our focus is pinpointed on the vector quantization loss, keeping other hyperparameters and loss components (GAN loss) fixed. This was done to delve into a deeper understanding of the discrete latent space, and to explore how varying its size affects the reconstruction. Though, our results do not surpass the existing benchmarks, however, our findings shed significant light on VQGAN's behaviour for a smaller dataset, particularly concerning artifacts, codebook size optimization, and comparative analysis with Principal Component Analysis (PCA). The study also uncovers the promising direction by introducing 2D positional encodings, revealing a marked reduction in artifacts and insights into balancing clarity and overfitting.
    摘要 这项研究进行了Vectơr Quantized Generative Adversarial Networks(VQGAN)的ablation分析,专注于使用单个NVIDIA A100 GPU进行图像到图像synthesis。当前研究探讨了关键参数的细腻效果,包括epoch数、图像数量和codebook vector和latent dimension的特性,specifically在有限资源的约束下。特别是,我们围绕vector quantization loss进行了定制,其他 гиперпараметры和损失组件(GAN损失)保持不变。这是为了深入了解权重量化空间的细节,并探讨如何通过变化其大小影响重建。虽然我们的结果不能超越现有的标准,但我们的发现对VQGAN在小数据集上的行为提供了深入的理解,特别是关于缺陷、codebook大小优化和与Principal Component Analysis(PCA)的比较分析。研究还发现了在引入2D位置编码时可以获得明显减少缺陷的效果,并提供了权重和过拟合之间的平衡。

Leveraging the Edge and Cloud for V2X-Based Real-Time Object Detection in Autonomous Driving

  • paper_url: http://arxiv.org/abs/2308.05234
  • repo_url: None
  • paper_authors: Faisal Hawlader, François Robinet, Raphaël Frank
  • for: 本研究旨在找到在实时感知中提高探测质量和响应速度的最佳平衡,以便在自动驾驶汽车中实现更好的驾驶决策。
  • methods: 本研究使用了Edge和云平台来减轻计算机和功能约束,并使用了不同的压缩算法来减少数据传输延迟。研究还使用了不同的探测模型和压缩质量来评估不同的探测质量和响应速度之间的平衡。
  • results: 研究发现,使用Cloud平台可以在实时中运行高精度探测模型,而不需要在本地进行计算。此外,研究还发现了不同压缩算法和探测模型之间的平衡,可以根据具体情况选择最佳的探测质量和响应速度。
    Abstract Environmental perception is a key element of autonomous driving because the information received from the perception module influences core driving decisions. An outstanding challenge in real-time perception for autonomous driving lies in finding the best trade-off between detection quality and latency. Major constraints on both computation and power have to be taken into account for real-time perception in autonomous vehicles. Larger object detection models tend to produce the best results, but are also slower at runtime. Since the most accurate detectors cannot run in real-time locally, we investigate the possibility of offloading computation to edge and cloud platforms, which are less resource-constrained. We create a synthetic dataset to train object detection models and evaluate different offloading strategies. Using real hardware and network simulations, we compare different trade-offs between prediction quality and end-to-end delay. Since sending raw frames over the network implies additional transmission delays, we also explore the use of JPEG and H.265 compression at varying qualities and measure their impact on prediction metrics. We show that models with adequate compression can be run in real-time on the cloud while outperforming local detection performance.
    摘要 环境感知是自动驾驶中的关键元素,因为感知模块的信息直接影响自动驾驶的核心决策。现实时感知是自动驾驶中的一大挑战,需要找到最佳的折衔点 между检测质量和延迟。由于 computation 和 power 受限,实时感知在自动车辆中存在严格的限制。大型 объек detection 模型通常能生成最佳的结果,但它们在运行时也比较慢。由于最准确的检测器无法在实时地本地运行,我们 investigate 将计算卷积到边缘和云平台上,这些平台具有较强的计算能力和资源。我们创建了一个 sintetic 数据集,用于训练对象检测模型并评估不同的卷积策略。使用真实硬件和网络 simulations,我们比较不同的折衔点 between 预测质量和综合延迟。由于将原始帧发送到网络会导致额外的传输延迟,我们也探索使用 JPEG 和 H.265 压缩在不同质量下的影响。我们发现可以在云上运行具有合适压缩的模型,而且其预测性能超过了本地检测性能。

Alexa, play with robot: Introducing the First Alexa Prize SimBot Challenge on Embodied AI

  • paper_url: http://arxiv.org/abs/2308.05221
  • repo_url: None
  • paper_authors: Hangjie Shi, Leslie Ball, Govind Thattai, Desheng Zhang, Lucy Hu, Qiaozi Gao, Suhaila Shakiah, Xiaofeng Gao, Aishwarya Padmakumar, Bofei Yang, Cadence Chung, Dinakar Guthy, Gaurav Sukhatme, Karthika Arumugam, Matthew Wen, Osman Ipek, Patrick Lange, Rohan Khanna, Shreyas Pansare, Vasu Sharma, Chao Zhang, Cris Flagg, Daniel Pressel, Lavina Vaz, Luke Dai, Prasoon Goyal, Sattvik Sahai, Shaohua Liu, Yao Lu, Anna Gottardi, Shui Hu, Yang Liu, Dilek Hakkani-Tur, Kate Bland, Heather Rocker, James Jeun, Yadunandana Rao, Michael Johnston, Akshaya Iyengar, Arindam Mandal, Prem Natarajan, Reza Ghanadan
  • for: The paper is written to describe the SimBot Challenge, a new challenge for university teams to build robot assistants that complete tasks in a simulated physical environment.
  • methods: The paper describes the infrastructure and support provided to the teams, including Alexa Arena, the simulated environment, and an ML toolkit to accelerate the building of vision and language models.
  • results: The paper summarizes the approaches taken by the participating teams to overcome research challenges and extracts key lessons learned, and provides analysis of the performance of the competing SimBots during the competition.Here is the same information in Simplified Chinese text:
  • for: 这篇论文是为描述新的大学生Robot助手完成 simulated 物理环境中的任务的挑战。
  • methods: 论文描述了给参与 коман队提供的基础设施和支持,包括Alexa Arena simulated 环境和ML工具包,以加速建立视觉和语言模型。
  • results: 论文总结了参与队伍解决研究挑战的方法,提取了关键的学习经验,并对参赛 SimBots 的性能进行分析。
    Abstract The Alexa Prize program has empowered numerous university students to explore, experiment, and showcase their talents in building conversational agents through challenges like the SocialBot Grand Challenge and the TaskBot Challenge. As conversational agents increasingly appear in multimodal and embodied contexts, it is important to explore the affordances of conversational interaction augmented with computer vision and physical embodiment. This paper describes the SimBot Challenge, a new challenge in which university teams compete to build robot assistants that complete tasks in a simulated physical environment. This paper provides an overview of the SimBot Challenge, which included both online and offline challenge phases. We describe the infrastructure and support provided to the teams including Alexa Arena, the simulated environment, and the ML toolkit provided to teams to accelerate their building of vision and language models. We summarize the approaches the participating teams took to overcome research challenges and extract key lessons learned. Finally, we provide analysis of the performance of the competing SimBots during the competition.
    摘要 《Alexa Prize计划》已经授权了许多大学生参与创建对话代理人的挑战,如社交谱挑战和任务谱挑战。随着对话交互在多Modal和肉体上出现,我们需要探索对话交互的可用性,并与计算机视觉和物理体附加。这篇论文描述了新的SimBot挑战,university teams在模拟的物理环境中完成任务的机器助手。这篇文章提供了挑战的概述,包括在线和离线两个阶段的挑战。我们描述了提供给团队的基础设施和支持,包括Alexa Arena模拟环境和提供给团队的ML工具箱,以加速他们的视觉和语言模型建立。我们总结了参与队伍所采取的解决研究挑战的方法,并提取出了关键的学习经验。最后,我们对参赛SimBots的性能进行分析。

“Generate” the Future of Work through AI: Empirical Evidence from Online Labor Markets

  • paper_url: http://arxiv.org/abs/2308.05201
  • repo_url: None
  • paper_authors: Jin Liu, Xingchen Xu, Yongjun Li, Yong Tan
  • for: 本研究旨在探讨智能语言模型(LLM)在劳动市场上的影响,特别是chatGPT的发布对文本相关工作和自由职业者的影响。
  • methods: 本研究采用了差异差分(DID)方法, interpreting the launch of ChatGPT as an exogenous shock,以量化其对文本相关工作和自由职业者的影响。
  • results: 研究发现,chatGPT的发布导致文本相关工作和自由职业者的交易量减少,特别是在高过去交易量或低质量标准单位上。然而,这种负面影响不是所有服务提供者都经受到的。
    Abstract With the advent of general-purpose Generative AI, the interest in discerning its impact on the labor market escalates. In an attempt to bridge the extant empirical void, we interpret the launch of ChatGPT as an exogenous shock, and implement a Difference-in-Differences (DID) approach to quantify its influence on text-related jobs and freelancers within an online labor marketplace. Our results reveal a significant decrease in transaction volume for gigs and freelancers directly exposed to ChatGPT. Additionally, this decline is particularly marked in units of relatively higher past transaction volume or lower quality standards. Yet, the negative effect is not universally experienced among service providers. Subsequent analyses illustrate that freelancers proficiently adapting to novel advancements and offering services that augment AI technologies can yield substantial benefits amidst this transformative period. Consequently, even though the advent of ChatGPT could conceivably substitute existing occupations, it also unfolds immense opportunities and carries the potential to reconfigure the future of work. This research contributes to the limited empirical repository exploring the profound influence of LLM-based generative AI on the labor market, furnishing invaluable insights for workers, job intermediaries, and regulatory bodies navigating this evolving landscape.
    摘要

Hierarchical Representations for Spatio-Temporal Visual Attention Modeling and Understanding

  • paper_url: http://arxiv.org/abs/2308.05189
  • repo_url: None
  • paper_authors: Miguel-Ángel Fernández-Torres
  • for: 本博士论文研究了视觉注意力模型和理解视频序列中的层次表示。更加具体地说,我们提出了两种计算模型,用于模型视觉注意力。首先,我们提出了一种生成概率模型,用于 Context-aware 视觉注意力模型和理解。其次,我们开发了一种深度网络架构,用于视觉注意力模型,首先估算上下文视觉注意力,并最终用于模型视频序列中的注意力。
  • methods: 我们提出了两种计算模型,一种是生成概率模型,另一种是深度网络架构。
  • results: 我们提出了一种 Context-aware 视觉注意力模型和一种深度网络架构,可以帮助理解视频序列中的视觉注意力。
    Abstract This PhD. Thesis concerns the study and development of hierarchical representations for spatio-temporal visual attention modeling and understanding in video sequences. More specifically, we propose two computational models for visual attention. First, we present a generative probabilistic model for context-aware visual attention modeling and understanding. Secondly, we develop a deep network architecture for visual attention modeling, which first estimates top-down spatio-temporal visual attention, and ultimately serves for modeling attention in the temporal domain.
    摘要 这个博士论文关注了视觉视频序列中的空间时间视注意力模型化和理解的研究和开发。更 Specifically,我们提出了两种计算模型 для视觉视注意力。首先,我们提出了一种基于概率理论的生成模型,用于Context-aware的视注意力模型化和理解。其次,我们开发了一种深度网络架构,用于视注意力模型化,首先估算上下文视注意力,最终用于模型时间域的视注意力。

PromptPaint: Steering Text-to-Image Generation Through Paint Medium-like Interactions

  • paper_url: http://arxiv.org/abs/2308.05184
  • repo_url: https://github.com/johnr0/promptpaint
  • paper_authors: John Joon Young Chung, Eytan Adar
  • for: 提供一种新的文本到图像(T2I)生成模型,帮助用户更好地表达复杂的概念。
  • methods: combinestext-to-image生成和用户在 физи上绘制图像的交互方式,允许用户通过不同的提示组合来表达复杂的概念。
  • results: 通过一系列研究,我们描述了不同的混合提示方法、设计贸易和社会技术挑战,以及未来可调配生成工具的可能性。
    Abstract While diffusion-based text-to-image (T2I) models provide a simple and powerful way to generate images, guiding this generation remains a challenge. For concepts that are difficult to describe through language, users may struggle to create prompts. Moreover, many of these models are built as end-to-end systems, lacking support for iterative shaping of the image. In response, we introduce PromptPaint, which combines T2I generation with interactions that model how we use colored paints. PromptPaint allows users to go beyond language to mix prompts that express challenging concepts. Just as we iteratively tune colors through layered placements of paint on a physical canvas, PromptPaint similarly allows users to apply different prompts to different canvas areas and times of the generative process. Through a set of studies, we characterize different approaches for mixing prompts, design trade-offs, and socio-technical challenges for generative models. With PromptPaint we provide insight into future steerable generative tools.
    摘要 Diffusion-based文本到图像(T2I)模型提供了一种简单而强大的图像生成方式,但是导向这个生成过程仍然是一个挑战。有些概念难以通过语言描述,用户可能会遇到创建提示的困难。此外,许多这些模型是端到端系统,缺乏生成过程中的迭代调整支持。为了解决这些问题,我们引入PromptPaint,它将T2I生成与用户使用颜色涂抹的交互方式结合。PromptPaint允许用户超过语言来混合表达复杂概念的提示。与物理画布上层层涂抹颜色一样,PromptPaint也 позволяет用户在生成过程中不同的提示应用于不同的画布区域和时间。通过一系列研究,我们描述了不同的混合提示方法、设计负担和生成模型的社会技术挑战。PromptPaint为未来可控生成工具提供了新的视角。

FPGA Resource-aware Structured Pruning for Real-Time Neural Networks

  • paper_url: http://arxiv.org/abs/2308.05170
  • repo_url: None
  • paper_authors: Benjamin Ramhorst, George A. Constantinides, Vladimir Loncar
  • for: 这个研究旨在提高FPGA上深度学习运算的效率和能效性,以应对实时系统和互联物件(IoT)设备的需求。
  • methods: 本研究提出了一种对于FPGA进行深度学习运算的硬件中心化剪枝方法,通过将剪枝转换为一个范例问题,并使用资源意识的维度结构来减少DSP对象和BRAM的使用率。
  • results: 在实验中,这个方法可以实现55%-92%的DSP对象和81%的BRAM使用率的减少,并在实时标签分类和快速图像分类等任务上表现出色。
    Abstract Neural networks achieve state-of-the-art performance in image classification, speech recognition, scientific analysis and many more application areas. With the ever-increasing need for faster computation and lower power consumption, driven by real-time systems and Internet-of-Things (IoT) devices, FPGAs have emerged as suitable devices for deep learning inference. Due to the high computational complexity and memory footprint of neural networks, various compression techniques, such as pruning, quantization and knowledge distillation, have been proposed in literature. Pruning sparsifies a neural network, reducing the number of multiplications and memory. However, pruning often fails to capture properties of the underlying hardware, causing unstructured sparsity and load-balance inefficiency, thus bottlenecking resource improvements. We propose a hardware-centric formulation of pruning, by formulating it as a knapsack problem with resource-aware tensor structures. The primary emphasis is on real-time inference, with latencies in the order of 1$\mu$s, accelerated with hls4ml, an open-source framework for deep learning inference on FPGAs. Evaluated on a range of tasks, including real-time particle classification at CERN's Large Hadron Collider and fast image classification, the proposed method achieves a reduction ranging between 55% and 92% in the utilization of digital signal processing blocks (DSP) and up to 81% in block memory (BRAM) utilization.
    摘要

DOST – Domain Obedient Self-supervised Training for Multi Label Classification with Noisy Labels

  • paper_url: http://arxiv.org/abs/2308.05101
  • repo_url: None
  • paper_authors: Soumadeep Saha, Utpal Garain, Arijit Ukil, Arpan Pal, Sundeep Khandelwal
  • for: 这篇论文主要应用于多标签分类任务中,减少标签错误的影响。
  • methods: 本文提出了一个叫做领域遵循自动化训练(DOST)的方法,通过领域指导来检测错误标签和防止规则违背预测。
  • results: 实验结果显示,DOST方法能够提高多标签分类任务中的性能,并具有更高的数据效率和领域遵循性。
    Abstract The enormous demand for annotated data brought forth by deep learning techniques has been accompanied by the problem of annotation noise. Although this issue has been widely discussed in machine learning literature, it has been relatively unexplored in the context of "multi-label classification" (MLC) tasks which feature more complicated kinds of noise. Additionally, when the domain in question has certain logical constraints, noisy annotations often exacerbate their violations, making such a system unacceptable to an expert. This paper studies the effect of label noise on domain rule violation incidents in the MLC task, and incorporates domain rules into our learning algorithm to mitigate the effect of noise. We propose the Domain Obedient Self-supervised Training (DOST) paradigm which not only makes deep learning models more aligned to domain rules, but also improves learning performance in key metrics and minimizes the effect of annotation noise. This novel approach uses domain guidance to detect offending annotations and deter rule-violating predictions in a self-supervised manner, thus making it more "data efficient" and domain compliant. Empirical studies, performed over two large scale multi-label classification datasets, demonstrate that our method results in improvement across the board, and often entirely counteracts the effect of noise.
    摘要 due to the enormous demand for annotated data brought about by deep learning techniques, the problem of annotation noise has become a major issue. while this problem has been widely discussed in the field of machine learning, it has received relatively little attention in the context of "multi-label classification" (MLC) tasks, which involve more complex types of noise. furthermore, when the domain in question has certain logical constraints, noisy annotations can exacerbate their violations, making the system unacceptable to experts. this paper investigates the effect of label noise on domain rule violations in MLC tasks and incorporates domain rules into our learning algorithm to mitigate the effect of noise. we propose the "Domain Obedient Self-supervised Training" (DOST) paradigm, which not only makes deep learning models more aligned to domain rules, but also improves learning performance in key metrics and minimizes the effect of annotation noise. this novel approach uses domain guidance to detect offending annotations and deter rule-violating predictions in a self-supervised manner, making it more "data efficient" and domain compliant. empirical studies conducted on two large-scale multi-label classification datasets demonstrate that our method results in improvement across the board, and often entirely counteracts the effect of noise.

LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation

  • paper_url: http://arxiv.org/abs/2308.05095
  • repo_url: https://github.com/LayoutLLM-T2I/LayoutLLM-T2I.github.io
  • paper_authors: Leigang Qu, Shengqiong Wu, Hao Fei, Liqiang Nie, Tat-Seng Chua
  • for: 本文旨在实现基于文本提示的高准确性图像生成,而无需手动提供指导信息。
  • methods: 本文提出了一种层次推理 paradigm,首先通过基于大自然语言模型的受体学习生成文本提示的粗细结构条件,然后提出了一种基于提取物体互动的扩展推理方法来生成高准确性图像。
  • results: 对比之前的状态艺模型,本文的提posed方法在图像生成和布局规划方面具有显著的超越性,并且可以在不提供任何指导信息的情况下实现高准确性图像生成。
    Abstract In the text-to-image generation field, recent remarkable progress in Stable Diffusion makes it possible to generate rich kinds of novel photorealistic images. However, current models still face misalignment issues (e.g., problematic spatial relation understanding and numeration failure) in complex natural scenes, which impedes the high-faithfulness text-to-image generation. Although recent efforts have been made to improve controllability by giving fine-grained guidance (e.g., sketch and scribbles), this issue has not been fundamentally tackled since users have to provide such guidance information manually. In this work, we strive to synthesize high-fidelity images that are semantically aligned with a given textual prompt without any guidance. Toward this end, we propose a coarse-to-fine paradigm to achieve layout planning and image generation. Concretely, we first generate the coarse-grained layout conditioned on a given textual prompt via in-context learning based on Large Language Models. Afterward, we propose a fine-grained object-interaction diffusion method to synthesize high-faithfulness images conditioned on the prompt and the automatically generated layout. Extensive experiments demonstrate that our proposed method outperforms the state-of-the-art models in terms of layout and image generation. Our code and settings are available at https://layoutllm-t2i.github.io.
    摘要 在文本至图生成领域,最近有remarkable进步在稳定扩散方面,使得可以生成丰富的新型高品质图像。然而,当前模型在复杂的自然场景中仍然存在误差问题(如场景理解和数字化问题),这会限制高准确性文本至图生成。虽然最近有努力改进可控性,如给予细化指导(如素描和涂鸦),但这些问题没有得到基础性解决,用户需要手动提供指导信息。在这项工作中,我们努力生成具有高准确性的图像,与给定的文本提示semantic上一致,无需任何指导。为此,我们提出了一种循环式的coarse-to-fine方法,包括场景规划和图像生成。具体来说,我们首先通过基于大语言模型的受容学习来生成基于文本提示的粗细化场景。然后,我们提出了一种基于提示和自动生成的场景进行高准确性图像生成的细化对象互动扩散方法。广泛的实验表明,我们的提出方法在场景和图像生成方面超过了状态艺术模型的性能。我们的代码和设置可以在https://layoutllm-t2i.github.io获取。

Organizational Bulk Email Systems: Their Role and Performance in Remote Work

  • paper_url: http://arxiv.org/abs/2308.05085
  • repo_url: None
  • paper_authors: Ruoyan Kong, Haiyi Zhu, Joseph A. Konstan
  • for: 这篇论文旨在探讨远程工作环境下的组织交流系统,以及如何更好地设计和发送组织批量电子邮件。
  • methods: 该论文首先回顾了以往关于评估、设计和评型组织交流系统的研究,然后介绍了我们最近的发现和一些有用的研究技术,最后提出了研究论文的研究计划和一些关键问题。
  • results: 该论文未提出明确的结果,主要是为了探讨远程工作环境下组织交流系统的问题和解决方案。
    Abstract The COVID-19 pandemic has forced many employees to work from home. Organizational bulk emails now play a critical role to reach employees with central information in this work-from-home environment. However, we know from our own recent work that organizational bulk email has problems: recipients fail to retain the bulk messages they received from the organization; recipients and senders have different opinions on which bulk messages were important; and communicators lack technology support to better target and design messages. In this position paper, first we review the prior work on evaluating, designing, and prototyping organizational communication systems. Second we review our recent findings and some research techniques we found useful in studying organizational communication. Last we propose a research agenda to study organizational communications in remote work environment and suggest some key questions and potential study directions.
    摘要 COVID-19 大流行导致许多员工需要从家中办公。组织批量电子邮件现在在这个从家中工作环境中扮演着关键角色,将中央信息传递给员工。然而,我们从我们最近的工作中发现,组织批量电子邮件存在一些问题:收件人Difficult to retain bulk messages received from the organization; recipients and senders have different opinions on which bulk messages were important; and communicators lack technology support to better target and design messages。在这份位点纸上,我们首先回顾了评估、设计和实验室通信系统的先前工作。其次,我们回顾了我们最近的发现和一些有用的研究技术。最后,我们提出了研究计划,以研究组织通信在远程工作环境中,并提出了一些关键问题和潜在的研究方向。

Drones4Good: Supporting Disaster Relief Through Remote Sensing and AI

  • paper_url: http://arxiv.org/abs/2308.05074
  • repo_url: None
  • paper_authors: Nina Merkle, Reza Bahmanyar, Corentin Henry, Seyed Majid Azimi, Xiangtian Yuan, Simon Schopferer, Veronika Gstaiger, Stefan Auer, Anne Schneibel, Marc Wieland, Thomas Kraft
  • for: 这项研究旨在提高紧急救援和救灾组织在灾难后的响应效率,通过快速和准确地获取受灾区域信息。
  • methods: 这项研究使用无人机数据和深度学习方法实现自动化大规模情况评估。同时,研究还展示了在无人机上部署自动化援助物资发送的技术。
  • results: 研究结果表明了大规模图像分析在场景中的可行性,以及在无人机上部署自动化援助物资发送可以提高物资发送的安全性。
    Abstract In order to respond effectively in the aftermath of a disaster, emergency services and relief organizations rely on timely and accurate information about the affected areas. Remote sensing has the potential to significantly reduce the time and effort required to collect such information by enabling a rapid survey of large areas. To achieve this, the main challenge is the automatic extraction of relevant information from remotely sensed data. In this work, we show how the combination of drone-based data with deep learning methods enables automated and large-scale situation assessment. In addition, we demonstrate the integration of onboard image processing techniques for the deployment of autonomous drone-based aid delivery. The results show the feasibility of a rapid and large-scale image analysis in the field, and that onboard image processing can increase the safety of drone-based aid deliveries.
    摘要 为了有效应对灾难后math,急救服务和救济机构需要快速和准确的信息关于受到影响的区域。远程感知技术有望大幅减少收集这些信息所需的时间和劳动。为此,主要挑战是自动提取远程感知数据中相关信息。在这种工作中,我们表明了结合无人机数据和深度学习方法可以实现自动化和大规模情况评估。此外,我们还示出了在无人机上部署自动救济物资交付的可能性,并且显示了在场景中实现大规模图像分析的可能性,以及在无人机上部署可以提高救济物资交付的安全性。

Competitions in AI – Robustly Ranking Solvers Using Statistical Resampling

  • paper_url: http://arxiv.org/abs/2308.05062
  • repo_url: None
  • paper_authors: Chris Fawcett, Mauro Vallati, Holger H. Hoos, Alfonso E. Gerevini
  • for: 本研究探讨了Competition结果是否能够泛化到不同的问题集中的问题,并使用统计重样技术来解决这个问题。
  • methods: 本研究使用统计重样技术来分析竞赛结果,生成了比赛成绩的置信区间和Statistically robust的排名方法。
  • results: 应用于最近的SAT、人工智能规划和计算机视觉竞赛中,本研究发现了经常出现的统计平局和排名倒转现象,与官方结果相比,显示了一定的不一致性。
    Abstract Solver competitions play a prominent role in assessing and advancing the state of the art for solving many problems in AI and beyond. Notably, in many areas of AI, competitions have had substantial impact in guiding research and applications for many years, and for a solver to be ranked highly in a competition carries considerable weight. But to which extent can we expect competition results to generalise to sets of problem instances different from those used in a particular competition? This is the question we investigate here, using statistical resampling techniques. We show that the rankings resulting from the standard interpretation of competition results can be very sensitive to even minor changes in the benchmark instance set used as the basis for assessment and can therefore not be expected to carry over to other samples from the same underlying instance distribution. To address this problem, we introduce a novel approach to statistically meaningful analysis of competition results based on resampling performance data. Our approach produces confidence intervals of competition scores as well as statistically robust solver rankings with bounded error. Applied to recent SAT, AI planning and computer vision competitions, our analysis reveals frequent statistical ties in solver performance as well as some inversions of ranks compared to the official results based on simple scoring.
    摘要 解决比赛在许多人工智能领域中扮演着重要的角色,以评估和推动技术的进步。在许多领域中,比赛的影响非常大,一个解决方案在比赛中的排名具有很大的重要性。然而,我们是否能期望比赛结果在不同的问题集上进行泛化呢?这是我们在这里进行研究的问题。使用统计重样技术,我们发现了比赛结果的排名可以受到小于1%的变化在基础实例集上的影响,这意味着不能预期在其他样本上进行泛化。为解决这个问题,我们提出了一种新的统计意义上的比赛结果分析方法,该方法可以生成比赛 scores 的置信区间以及带有 bounded error 的有效解决方案排名。应用于最近的 SAT、人工智能规划和计算机视觉比赛,我们的分析发现了许多解决方案之间的统计平局,以及一些与官方结果不同的排名。

Separate Anything You Describe

  • paper_url: http://arxiv.org/abs/2308.05037
  • repo_url: https://github.com/audio-agi/audiosep
  • paper_authors: Xubo Liu, Qiuqiang Kong, Yan Zhao, Haohe Liu, Yi Yuan, Yuzhuo Liu, Rui Xia, Yuxuan Wang, Mark D. Plumbley, Wenwu Wang
    for:AudioSep is a foundation model for open-domain audio source separation with natural language queries, aiming to separate a target sound from an audio mixture given a natural language query.methods:AudioSep is trained on large-scale multimodal datasets and evaluated on numerous tasks including audio event separation, musical instrument separation, and speech enhancement.results:AudioSep demonstrates strong separation performance and impressive zero-shot generalization ability using audio captions or text labels as queries, substantially outperforming previous audio-queried and language-queried sound separation models.
    Abstract Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA). LASS aims to separate a target sound from an audio mixture given a natural language query, which provides a natural and scalable interface for digital audio applications. Recent works on LASS, despite attaining promising separation performance on specific sources (e.g., musical instruments, limited classes of audio events), are unable to separate audio concepts in the open domain. In this work, we introduce AudioSep, a foundation model for open-domain audio source separation with natural language queries. We train AudioSep on large-scale multimodal datasets and extensively evaluate its capabilities on numerous tasks including audio event separation, musical instrument separation, and speech enhancement. AudioSep demonstrates strong separation performance and impressive zero-shot generalization ability using audio captions or text labels as queries, substantially outperforming previous audio-queried and language-queried sound separation models. For reproducibility of this work, we will release the source code, evaluation benchmark and pre-trained model at: https://github.com/Audio-AGI/AudioSep.
    摘要 新的语音场景分析(CASA) paradigma:语言-声音源分离(LASS)。LASS aimsto separate a target sound from an audio mixture given a natural language query, providing a natural and scalable interface for digital audio applications. Recent works on LASS, despite achieving promising separation performance on specific sources (e.g., musical instruments, limited classes of audio events), are unable to separate audio concepts in the open domain. In this work, we introduce AudioSep, a foundation model for open-domain audio source separation with natural language queries. We train AudioSep on large-scale multimodal datasets and extensively evaluate its capabilities on numerous tasks including audio event separation, musical instrument separation, and speech enhancement. AudioSep demonstrates strong separation performance and impressive zero-shot generalization ability using audio captions or text labels as queries, significantly outperforming previous audio-queried and language-queried sound separation models. For reproducibility of this work, we will release the source code, evaluation benchmark, and pre-trained model at: https://github.com/Audio-AGI/AudioSep.

Expert load matters: operating networks at high accuracy and low manual effort

  • paper_url: http://arxiv.org/abs/2308.05035
  • repo_url: None
  • paper_authors: Sara Sangalli, Ertunc Erdil, Ender Konukoglu
  • for: 这个论文是为了提出一种基于模型信任度的人机共同工作系统,以避免在关键应用中出现误差。
  • methods: 该论文使用了深度神经网络,并提出了一种新的补做函数来帮助模型更好地识别准确和不准确的样本。
  • results: 该论文的实验结果表明,使用该补做函数可以同时提高模型的准确率和减少人类专家的干预数量,并且可以更好地检测不符合样本。
    Abstract In human-AI collaboration systems for critical applications, in order to ensure minimal error, users should set an operating point based on model confidence to determine when the decision should be delegated to human experts. Samples for which model confidence is lower than the operating point would be manually analysed by experts to avoid mistakes. Such systems can become truly useful only if they consider two aspects: models should be confident only for samples for which they are accurate, and the number of samples delegated to experts should be minimized. The latter aspect is especially crucial for applications where available expert time is limited and expensive, such as healthcare. The trade-off between the model accuracy and the number of samples delegated to experts can be represented by a curve that is similar to an ROC curve, which we refer to as confidence operating characteristic (COC) curve. In this paper, we argue that deep neural networks should be trained by taking into account both accuracy and expert load and, to that end, propose a new complementary loss function for classification that maximizes the area under this COC curve. This promotes simultaneously the increase in network accuracy and the reduction in number of samples delegated to humans. We perform experiments on multiple computer vision and medical image datasets for classification. Our results demonstrate that the proposed loss improves classification accuracy and delegates less number of decisions to experts, achieves better out-of-distribution samples detection and on par calibration performance compared to existing loss functions.
    摘要 人工智能和人类合作系统在关键应用中,以确保最小错误,用户应该根据模型信任度设置操作点,以确定决策是否交由人类专家。样本的模型信任度低于操作点的情况应该由专家手动分析以避免错误。这些系统才能真正有用,需要考虑两个方面:模型只应该对它们精度高的样本表现出自信,而且尽量减少向人类专家 delegates 的样本数量。特别是在医疗领域,专家时间是有限的和昂贵的。这种贸易关系可以用一个类似于 ROC 曲线的折衔曲线来表示,我们称之为信任操作特征(COC)曲线。在这篇文章中,我们 argue 深度神经网络应该在考虑准确性和专家负担的同时进行训练,并为此提出了一种新的补偿损失函数。这种损失函数可以同时提高网络准确性和减少向人类专家 delegates 的样本数量。我们在多个计算机视觉和医疗影像Dataset上进行了实验,结果表明,我们的损失函数可以提高分类准确性,同时减少向人类专家 delegates 的样本数量,并且在Out-of-distribution样本检测和Calibration性能方面表现良好。