2023-11-03

cs.AI

cs.AI - 2023-11-03

Post Turing: Mapping the landscape of LLM Evaluation

paper_url: http://arxiv.org/abs/2311.02049
repo_url: None
paper_authors: Alexey Tikhonov, Ivan P. Yamshchikov
for: 这篇论文旨在探讨大语言模型（LLM）的评估方法的发展历程，从阿兰·图灵的创始问题到现代人工智能研究。
methods: 这篇论文将LMM的发展分为不同的时期，每个时期都有其独特的标准和评估标准。传统的评估方法，如图灵测验，随着LMM越来越接近人类行为，而失去了可靠性。
results: 这篇论文强调了需要一个统一的评估系统，因为LMM在更广泛的社会影响下使用。通过分析常见的评估方法，这篇论文强调了标准化和客观标准的重要性，以确保LMM的可靠性、公平性和社会利好。

Abstract
In the rapidly evolving landscape of Large Language Models (LLMs), introduction of well-defined and standardized evaluation methodologies remains a crucial challenge. This paper traces the historical trajectory of LLM evaluations, from the foundational questions posed by Alan Turing to the modern era of AI research. We categorize the evolution of LLMs into distinct periods, each characterized by its unique benchmarks and evaluation criteria. As LLMs increasingly mimic human-like behaviors, traditional evaluation proxies, such as the Turing test, have become less reliable. We emphasize the pressing need for a unified evaluation system, given the broader societal implications of these models. Through an analysis of common evaluation methodologies, we advocate for a qualitative shift in assessment approaches, underscoring the importance of standardization and objective criteria. This work serves as a call for the AI community to collaboratively address the challenges of LLM evaluation, ensuring their reliability, fairness, and societal benefit.

摘要
在大语言模型（LLM）的快速演化中，定义和标准化评估方法仍然是一个核心挑战。这篇文章跟踪了LLM评估的历史发展，从阿兰·图灵提出的基础问题到现代人工智能研究的时期。我们将LLM的发展分成不同的时期，每个时期都具有独特的标准和评估标准。随着LLM越来越接近人类行为，传统的评估代理人，如图灵测试，变得更加不可靠。我们强调了评估系统的统一化的需要，因为这些模型在社会中的广泛应用具有更大的社会意义。通过分析常见的评估方法，我们强调了标准化和客观标准的重要性，以便确保LLM的可靠性、公平性和社会 benefit。这篇文章作为人工智能社区的呼吁，呼吁所有相关人员共同面临LLM评估的挑战，以确保其可靠性、公平性和社会 benefit。

Quantum circuit synthesis with diffusion models

paper_url: http://arxiv.org/abs/2311.02041
repo_url: https://github.com/florianfuerrutter/genqc
paper_authors: Florian Fürrutter, Gorka Muñoz-Gil, Hans J. Briegel
for: 本研究使用生成机器学习模型，具体来说是杂化扩散模型（DM），以便将量子操作翻译成可行的物理实现。
methods: 本研究使用文本条件来控制DM模型，使其生成所需的量子操作 dentro gate-based quantum circuit。这种方法可以避免在训练过程中经典计算量子动力学的极高开销，从而提高模型的效率。
results: 研究表明，DM模型在两个任务中表现出色：生成强相关性和编辑量子Circuit。模型可以生成新的Circuit，并支持扩展such as masking和编辑，以适应目标量子设备的约束。由于其灵活性和通用性，我们认为DM模型将在量子Circuit合成中扮演重要角色，提高实际应用和理论量子计算的理解。

Abstract
Quantum computing has recently emerged as a transformative technology. Yet, its promised advantages rely on efficiently translating quantum operations into viable physical realizations. In this work, we use generative machine learning models, specifically denoising diffusion models (DMs), to facilitate this transformation. Leveraging text-conditioning, we steer the model to produce desired quantum operations within gate-based quantum circuits. Notably, DMs allow to sidestep during training the exponential overhead inherent in the classical simulation of quantum dynamics -- a consistent bottleneck in preceding ML techniques. We demonstrate the model's capabilities across two tasks: entanglement generation and unitary compilation. The model excels at generating new circuits and supports typical DM extensions such as masking and editing to, for instance, align the circuit generation to the constraints of the targeted quantum device. Given their flexibility and generalization abilities, we envision DMs as pivotal in quantum circuit synthesis, enhancing both practical applications but also insights into theoretical quantum computation.

摘要

VQPy: An Object-Oriented Approach to Modern Video Analytics

paper_url: http://arxiv.org/abs/2311.01623
repo_url: https://github.com/vqpy/vqpy
paper_authors: Shan Yu, Zhenting Zhu, Yu Chen, Hanchen Xu, Pengzhan Zhao, Yang Wang, Arthi Padmanabhan, Hugo Latapie, Harry Xu
for: 用于开发视频分析系统和服务的前沿技术之一是视频查询，用户可以通过这些查询找到视频中的特定 interessante Objekte。
methods: 该方法基于视频对象（如人、动物、车辆等）与传统对象指定语言中的对象模型之间的相似性，并提出了一种基于对象指定的视频分析方法。该方法名为VQPy，它包括一个容易用户表达视频对象和其交互的前端（基于Python），以及可自动构建和优化管道的可扩展后端。
results: 我们已经实现了VQPy，并将其开源到Cisco的DeepVision框架中。这个技术已经被商业化并应用于视频分析领域。

Abstract
Video analytics is widely used in contemporary systems and services. At the forefront of video analytics are video queries that users develop to find objects of particular interest. Building upon the insight that video objects (e.g., human, animals, cars, etc.), the center of video analytics, are similar in spirit to objects modeled by traditional object-oriented languages, we propose to develop an object-oriented approach to video analytics. This approach, named VQPy, consists of a frontend$\unicode{x2015}$a Python variant with constructs that make it easy for users to express video objects and their interactions$\unicode{x2015}$as well as an extensible backend that can automatically construct and optimize pipelines based on video objects. We have implemented and open-sourced VQPy, which has been productized in Cisco as part of its DeepVision framework.

摘要
视频分析广泛应用于现代系统和服务中。用户发展的视频查询是视频分析的前导力量。建立在视频对象（如人、动物、车等）的核心思想上，我们提议开发一种对象射影视频分析方法。该方法名为VQPy，包括一个前端使用Python语言的变体，用于让用户轻松表达视频对象和它们之间的交互，以及可扩展的后端，可自动构建和优化视频对象的管道。我们已经实现和开源了VQPy，它在Cisco的DeepVision框架中被商业化。

APRICOT: Acuity Prediction in Intensive Care Unit (ICU): Predicting Stability, Transitions, and Life-Sustaining Therapies

paper_url: http://arxiv.org/abs/2311.02026
repo_url: None
paper_authors: Miguel Contreras, Brandon Silva, Benjamin Shickel, Tezcan Ozrazgat Baslanti, Yuanfang Ren, Ziyuan Guan, Sabyasachi Bandyopadhyay, Kia Khezeli, Azra Bihorac, Parisa Rashidi
for: 这个研究的目的是开发一个基于 transformer ней网络的实时评估 ICU 病人症状的模型，以便在实时监测病人症状，提供便利于医生进行时间性的干预。methods: 该研究使用了三个大数据集进行开发和验证：University of Florida Health (UFH)、eICU Collaborative Research Database (eICU) 和 Medical Information Mart for Intensive Care (MIMIC)-IV。模型使用 transformer ней网络，并进行了外部、时间和前向验证。results: 模型的表现与当前状态艺术方法相当，并且可以预测病人需要生命维持治疗的可能性。此外，模型还可以预测病人需要的生命维持治疗，例如呼吸机和 vasopressor。这些结果表明，APRICOT 模型可以帮助医生在实时监测病人症状，并提供有用的信息以便进行时间性的干预。

Abstract
The acuity state of patients in the intensive care unit (ICU) can quickly change from stable to unstable, sometimes leading to life-threatening conditions. Early detection of deteriorating conditions can result in providing more timely interventions and improved survival rates. Current approaches rely on manual daily assessments. Some data-driven approaches have been developed, that use mortality as a proxy of acuity in the ICU. However, these methods do not integrate acuity states to determine the stability of a patient or the need for life-sustaining therapies. In this study, we propose APRICOT (Acuity Prediction in Intensive Care Unit), a Transformer-based neural network to predict acuity state in real-time in ICU patients. We develop and extensively validate externally, temporally, and prospectively the APRICOT model on three large datasets: University of Florida Health (UFH), eICU Collaborative Research Database (eICU), and Medical Information Mart for Intensive Care (MIMIC)-IV. The performance of APRICOT shows comparable results to state-of-the-art mortality prediction models (external AUROC 0.93-0.93, temporal AUROC 0.96-0.98, and prospective AUROC 0.98) as well as acuity prediction models (external AUROC 0.80-0.81, temporal AUROC 0.77-0.78, and prospective AUROC 0.87). Furthermore, APRICOT can make predictions for the need for life-sustaining therapies, showing comparable results to state-of-the-art ventilation prediction models (external AUROC 0.80-0.81, temporal AUROC 0.87-0.88, and prospective AUROC 0.85), and vasopressor prediction models (external AUROC 0.82-0.83, temporal AUROC 0.73-0.75, prospective AUROC 0.87). This tool allows for real-time acuity monitoring of a patient and can provide helpful information to clinicians to make timely interventions. Furthermore, the model can suggest life-sustaining therapies that the patient might need in the next hours in the ICU.

摘要
ICU病人的病状可以快速从稳定转变为不稳定，有时会导致生命危险。早期检测病人的状况下降可以提供更时效的干预措施，提高生存率。现有的方法基于手动日常评估。一些数据驱动的方法已经开发，它们使用死亡率作为ICU病人的严重程度的代理。但这些方法并不能考虑病人的稳定状况或需要的生命维持治疗。本研究提出了APRICOT（ICU病人稳定状况预测）模型，基于变换器来预测ICU病人的稳定状况。我们在UFH、eICU和MIMIC-IV三个大数据集上开发和验证了APRICOT模型，并取得了相当于现状的result。APRICOT模型的性能与现状的死亡预测模型（外部AUROC0.93-0.93，时间AUROC0.96-0.98，前瞻AUROC0.98）以及稳定状况预测模型（外部AUROC0.80-0.81，时间AUROC0.77-0.78，前瞻AUROC0.87）相当。此外，APRICOT模型还可以预测需要生命维持治疗的可能性，与现状的呼吸预测模型（外部AUROC0.80-0.81，时间AUROC0.87-0.88，前瞻AUROC0.85）和 vasopressor 预测模型（外部AUROC0.82-0.83，时间AUROC0.73-0.75，前瞻AUROC0.87）相当。这种工具可以实时监测ICU病人的稳定状况，并提供便利的信息，以便医生在时间上作出有效的干预措施。此外，模型还可以预测在ICU中接下来几个小时内需要的生命维持治疗。

Active Reasoning in an Open-World Environment

paper_url: http://arxiv.org/abs/2311.02018
repo_url: None
paper_authors: Manjie Xu, Guangyuan Jiang, Wei Liang, Chi Zhang, Yixin Zhu
for: 这 paper 的目的是提出一个可互动的开放世界环境，以评估活跃的理解能力。
methods: 这 paper 使用了一种名为 $Conan$ 的交互式开放世界环境，激发了人工智能代理人的活跃探索和多轮推理。
results: 经过分析 $Conan$ 环境，这 paper 发现了现代状态的许多模型在活跃探索和解释复杂情况时存在缺陷。同时， paper 还探讨了从推理转移到推理的过程，并在 $Conan$ 环境中实现了这种转移。

Abstract
Recent advances in vision-language learning have achieved notable success on complete-information question-answering datasets through the integration of extensive world knowledge. Yet, most models operate passively, responding to questions based on pre-stored knowledge. In stark contrast, humans possess the ability to actively explore, accumulate, and reason using both newfound and existing information to tackle incomplete-information questions. In response to this gap, we introduce $Conan$, an interactive open-world environment devised for the assessment of active reasoning. $Conan$ facilitates active exploration and promotes multi-round abductive inference, reminiscent of rich, open-world settings like Minecraft. Diverging from previous works that lean primarily on single-round deduction via instruction following, $Conan$ compels agents to actively interact with their surroundings, amalgamating new evidence with prior knowledge to elucidate events from incomplete observations. Our analysis on $Conan$ underscores the shortcomings of contemporary state-of-the-art models in active exploration and understanding complex scenarios. Additionally, we explore Abduction from Deduction, where agents harness Bayesian rules to recast the challenge of abduction as a deductive process. Through $Conan$, we aim to galvanize advancements in active reasoning and set the stage for the next generation of artificial intelligence agents adept at dynamically engaging in environments.

摘要

paper_url: http://arxiv.org/abs/2311.02017
repo_url: None
paper_authors: Ashman Mehra, Snehanshu Saha, Vaskar Raychoudhury, Archana Mathur
for: 这篇论文目的是为了提出一个基于人工智能学习的食物配送方案，以减少现有的配送成本和提高配送效率。
methods: 本论文使用了一种多目标优化方法，将consumer satisfaction和配送成本都当作优化目标，并通过一个基于强化学习的代理人系统来实现实时的决策。
results: 根据 simulations 的结果，DeliverAI 可以降低配送车队大小 by 12%，减少配送距离 by 13%，并提高配送效率 by 50% 相比基准。

Abstract
Delivery of items from the producer to the consumer has experienced significant growth over the past decade and has been greatly fueled by the recent pandemic. Amazon Fresh, Shopify, UberEats, InstaCart, and DoorDash are rapidly growing and are sharing the same business model of consumer items or food delivery. Existing food delivery methods are sub-optimal because each delivery is individually optimized to go directly from the producer to the consumer via the shortest time path. We observe a significant scope for reducing the costs associated with completing deliveries under the current model. We model our food delivery problem as a multi-objective optimization, where consumer satisfaction and delivery costs, both, need to be optimized. Taking inspiration from the success of ride-sharing in the taxi industry, we propose DeliverAI - a reinforcement learning-based path-sharing algorithm. Unlike previous attempts for path-sharing, DeliverAI can provide real-time, time-efficient decision-making using a Reinforcement learning-enabled agent system. Our novel agent interaction scheme leverages path-sharing among deliveries to reduce the total distance traveled while keeping the delivery completion time under check. We generate and test our methodology vigorously on a simulation setup using real data from the city of Chicago. Our results show that DeliverAI can reduce the delivery fleet size by 12\%, the distance traveled by 13%, and achieve 50% higher fleet utilization compared to the baselines.

摘要
生产者到消费者的物品交付经历了过去十年的显著增长，而且受到最近的流行病影响很大。亚马逊新鲜、拍卖、uberEats、InstaCart和doorDash等公司的快速增长，都是共享同一个生意模式，即消费者物品或食品交付。现有的食品交付方法有限，每个交付都是单独优化，直接从生产者到消费者，最短时间路径。我们观察到现有交付模式中存在很大的成本减少空间。我们将食品交付问题模型为多目标优化问题，即消费者满意度和交付成本都需要优化。从taxi行业中成功的乘车共享经验，我们提出DeliverAI - 基于强化学习的路径共享算法。与过去的路径共享方法不同，DeliverAI可以在实时、时间高效的情况下，通过强化学习 Agent系统进行决策。我们的新代理互动方式利用交付中的路径共享，以降低总距离和保持交付完成时间。我们在使用实际的 ЧикаGO市数据进行了严格的测试，结果显示，DeliverAI可以减少交付车队大小12%，总距离减少13%，并实现交付车队使用率50%高于基准值。

Score Models for Offline Goal-Conditioned Reinforcement Learning

paper_url: http://arxiv.org/abs/2311.02013
repo_url: None
paper_authors: Harshit Sikchi, Rohan Chitnis, Ahmed Touati, Alborz Geramifard, Amy Zhang, Scott Niekum
for: 本研究的目的是开发一种能够在完全离线 dataset 上学习多个目标的 Reinforcement Learning（RL）方法，以便开发通用的智能体可以在不需要手工设计奖励函数的情况下学习多种多样的技能。
methods: 本研究使用的方法是基于权重的 mixture-distribution matching，它将occupancy matching perspective和凸 dual形式的学习目标相结合，以更好地利用不优化的离线数据。
results: 实验表明，SMORe 可以在高维观察数据的 robot 抓取和行走任务上超过了现状的基准值，并且可以在不同的环境下达到更好的性能。

Abstract
Offline Goal-Conditioned Reinforcement Learning (GCRL) is tasked with learning to achieve multiple goals in an environment purely from offline datasets using sparse reward functions. Offline GCRL is pivotal for developing generalist agents capable of leveraging pre-existing datasets to learn diverse and reusable skills without hand-engineering reward functions. However, contemporary approaches to GCRL based on supervised learning and contrastive learning are often suboptimal in the offline setting. An alternative perspective on GCRL optimizes for occupancy matching, but necessitates learning a discriminator, which subsequently serves as a pseudo-reward for downstream RL. Inaccuracies in the learned discriminator can cascade, negatively influencing the resulting policy. We present a novel approach to GCRL under a new lens of mixture-distribution matching, leading to our discriminator-free method: SMORe. The key insight is combining the occupancy matching perspective of GCRL with a convex dual formulation to derive a learning objective that can better leverage suboptimal offline data. SMORe learns scores or unnormalized densities representing the importance of taking an action at a state for reaching a particular goal. SMORe is principled and our extensive experiments on the fully offline GCRL benchmark composed of robot manipulation and locomotion tasks, including high-dimensional observations, show that SMORe can outperform state-of-the-art baselines by a significant margin.

摘要
<>转换文本到简化中文。<>Offline Goal-Conditioned Reinforcement Learning（GCRL）的任务是从无线据集中学习多个目标，使用稀疏奖励函数。Offline GCRL 是开发通用代理人可以利用预存在的数据学习多样化和可重用的技能，而无需手工设计奖励函数的关键。然而，当前 GCRL 方法基于监督学习和对比学习 often 在离线设置下是不优化的。一种 alternativa perspective on GCRL 是优化occupancy 匹配，但需要学习一个discriminator，该 discriminator subsequenly 作为 Pseudo-reward для下游 RL。在学习过程中，inaccuracies 在学习的 discriminator 可能会倒逼 negatively 影响 resulting 策略。我们提出一种新的 GCRL 方法，称为 SMORe，该方法基于 mixture-distribution 匹配的新镜头。关键思想是将 GCRL 的 occupancy 匹配视角与 convex dual 表示相结合，以 deriv 一个学习目标，可以更好地利用不优化的离线数据。SMORe 学习 actions 在状态下的importance ，代表着达到特定目标的可能性。SMORe 是原则的，我们在无线 GCRL benchmark 上的完全离线任务中，包括机器人 manipulate 和 locomotion 任务，以高维观察表示，展现了 SMORe 可以在 state-of-the-art 基准点上高于表现。

Obtaining Explainable Classification Models using Distributionally Robust Optimization

paper_url: http://arxiv.org/abs/2311.01994
repo_url: None
paper_authors: Sanjeeb Dash, Soumyadip Ghosh, Joao Goncalves, Mark S. Squillante
for: 这个论文的目的是提出一种能够同时保证模型泛化质量和计算成本低的分类器建模方法。
methods: 这个论文使用了sets of feature value rulesconstructed using distributionally robust optimization，并使用column generation来高效地搜索rule sets的空间。
results: 论文的实验结果显示，提出的方法可以在一个大量的公共可用的二分类问题实例上超过竞争方法（如随机森林或推动等），从一个或多个以下维度来衡量：泛化质量、计算成本和可读性。

Abstract
Model explainability is crucial for human users to be able to interpret how a proposed classifier assigns labels to data based on its feature values. We study generalized linear models constructed using sets of feature value rules, which can capture nonlinear dependencies and interactions. An inherent trade-off exists between rule set sparsity and its prediction accuracy. It is computationally expensive to find the right choice of sparsity -- e.g., via cross-validation -- with existing methods. We propose a new formulation to learn an ensemble of rule sets that simultaneously addresses these competing factors. Good generalization is ensured while keeping computational costs low by utilizing distributionally robust optimization. The formulation utilizes column generation to efficiently search the space of rule sets and constructs a sparse ensemble of rule sets, in contrast with techniques like random forests or boosting and their variants. We present theoretical results that motivate and justify the use of our distributionally robust formulation. Extensive numerical experiments establish that our method improves over competing methods -- on a large set of publicly available binary classification problem instances -- with respect to one or more of the following metrics: generalization quality, computational cost, and explainability.

摘要
模型可读性是关键，以便人类用户可以根据特征值来解释提议分类器对数据分配标签。我们研究使用集合特征值规则构建的泛化线性模型，可以捕捉非线性关系和互动。存在一种折衔选择精度和预测精度之间的矛盾。现有方法 computationally expensive 找到最佳精度 -- 例如，via 批处理 -- 的方法。我们提出了一种新的表述，以同时解决这些矛盾的因素。我们的方法可以保证良好的泛化质量，同时降低计算成本，通过使用分布式 robust 优化。我们的表述使用列生成来快速搜索规则集的空间，并构建一个稀疏的规则集 ensemble，与 Random Forest 或 boosting 和其他变体不同。我们提供了理论上的结果，以证明和正确使用我们的分布式 robust 表述。我们的实验证明，我们的方法在一组公共可用的二分类问题实例上比竞争方法更好，以下一个或多个纪录：泛化质量、计算成本和可读性。

RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches

paper_url: http://arxiv.org/abs/2311.01977
repo_url: None
paper_authors: Jiayuan Gu, Sean Kirmani, Paul Wohlhart, Yao Lu, Montserrat Gonzalez Arenas, Kanishka Rao, Wenhao Yu, Chuyuan Fu, Keerthana Gopalakrishnan, Zhuo Xu, Priya Sundaresan, Peng Xu, Hao Su, Karol Hausman, Chelsea Finn, Quan Vuong, Ted Xiao
for: 本研究旨在提高机器人学习系统的普适性，使其能够更好地适应新任务和新情况。
methods: 本研究提出了一种基于粗略运动路径图像的政策条件方法（RT-Trajectory），该方法通过粗略的运动路径图像来表达任务，并且可以让策略更好地适应新任务。
results: 实验结果表明，RT-Trajectory 能够在各种真实世界机器人任务中表现出较好的普适性，并且可以在不同的训练数据下表现出更广泛的任务能力。

Abstract
Generalization remains one of the most important desiderata for robust robot learning systems. While recently proposed approaches show promise in generalization to novel objects, semantic concepts, or visual distribution shifts, generalization to new tasks remains challenging. For example, a language-conditioned policy trained on pick-and-place tasks will not be able to generalize to a folding task, even if the arm trajectory of folding is similar to pick-and-place. Our key insight is that this kind of generalization becomes feasible if we represent the task through rough trajectory sketches. We propose a policy conditioning method using such rough trajectory sketches, which we call RT-Trajectory, that is practical, easy to specify, and allows the policy to effectively perform new tasks that would otherwise be challenging to perform. We find that trajectory sketches strike a balance between being detailed enough to express low-level motion-centric guidance while being coarse enough to allow the learned policy to interpret the trajectory sketch in the context of situational visual observations. In addition, we show how trajectory sketches can provide a useful interface to communicate with robotic policies: they can be specified through simple human inputs like drawings or videos, or through automated methods such as modern image-generating or waypoint-generating methods. We evaluate RT-Trajectory at scale on a variety of real-world robotic tasks, and find that RT-Trajectory is able to perform a wider range of tasks compared to language-conditioned and goal-conditioned policies, when provided the same training data.

摘要
通用化仍然是Robot学习系统中最重要的目标之一。Recently proposed approaches show promise in generalizing to novel objects, semantic concepts, or visual distribution shifts, but generalizing to new tasks remains challenging. For example, a language-conditioned policy trained on pick-and-place tasks will not be able to generalize to a folding task, even if the arm trajectory of folding is similar to pick-and-place. Our key insight is that this kind of generalization becomes feasible if we represent the task through rough trajectory sketches. We propose a policy conditioning method using such rough trajectory sketches, which we call RT-Trajectory, that is practical, easy to specify, and allows the policy to effectively perform new tasks that would otherwise be challenging to perform. We find that trajectory sketches strike a balance between being detailed enough to express low-level motion-centric guidance while being coarse enough to allow the learned policy to interpret the trajectory sketch in the context of situational visual observations. In addition, we show how trajectory sketches can provide a useful interface to communicate with robotic policies: they can be specified through simple human inputs like drawings or videos, or through automated methods such as modern image-generating or waypoint-generating methods. We evaluate RT-Trajectory at scale on a variety of real-world robotic tasks, and find that RT-Trajectory is able to perform a wider range of tasks compared to language-conditioned and goal-conditioned policies, when provided the same training data.

The language of prompting: What linguistic properties make a prompt successful?

paper_url: http://arxiv.org/abs/2311.01967
repo_url: None
paper_authors: Alina Leidinger, Robert van Rooij, Ekaterina Shutova
for: 这研究旨在调查LLMs的不同大小、预训练和指令调整后在不同语法结构和词义上的提示表现。
methods: 该研究使用了LLMs的不同大小、预训练和指令调整后的模型，并对提示进行了语法结构和词义上的变化，以调查提示表现的相关性。
results: 研究发现，LLMs的表现与提示语法结构和词义上的变化有负相关性，而不是与提示的低准确率或字频、歧义率或提示长度相关。这 suggets that 现有的评价标准可能不够全面，需要更加全面和可靠的评价标准。

Abstract
The latest generation of LLMs can be prompted to achieve impressive zero-shot or few-shot performance in many NLP tasks. However, since performance is highly sensitive to the choice of prompts, considerable effort has been devoted to crowd-sourcing prompts or designing methods for prompt optimisation. Yet, we still lack a systematic understanding of how linguistic properties of prompts correlate with task performance. In this work, we investigate how LLMs of different sizes, pre-trained and instruction-tuned, perform on prompts that are semantically equivalent, but vary in linguistic structure. We investigate both grammatical properties such as mood, tense, aspect and modality, as well as lexico-semantic variation through the use of synonyms. Our findings contradict the common assumption that LLMs achieve optimal performance on lower perplexity prompts that reflect language use in pretraining or instruction-tuning data. Prompts transfer poorly between datasets or models, and performance cannot generally be explained by perplexity, word frequency, ambiguity or prompt length. Based on our results, we put forward a proposal for a more robust and comprehensive evaluation standard for prompting research.

摘要
最新一代LLM可以通过提示来实现吸引人的零shot或几shot性能在多种NLP任务中。然而，由于选择提示的性能具有很高的敏感度，因此在这方面投入了大量的时间和精力，包括投入人们或设计提示优化方法。然而，我们仍然缺乏系统性的理解，推广提示的语言性质如何与任务性能相关。在这项工作中，我们研究了不同大小的LLM，预训练和指导调整后的表现，对于semantically相同但 linguistically不同的提示。我们研究了 grammatical properties 如模式、时态、方式和可能性，以及 lexico-semantic variation through the use of synonyms。我们的发现证明了常见的假设，即LLMs在低凝度提示上表现最佳，不正确。提示在不同的数据集或模型之间传递不好，并且不能通过凝度、字 frequency、歧义或提示长度来解释性能。根据我们的结果，我们提出了一种更加可靠和全面的评估标准 для提示研究。

Don’t Make Your LLM an Evaluation Benchmark Cheater

paper_url: http://arxiv.org/abs/2311.01964
repo_url: None
paper_authors: Kun Zhou, Yutao Zhu, Zhipeng Chen, Wentong Chen, Wayne Xin Zhao, Xu Chen, Yankai Lin, Ji-Rong Wen, Jiawei Han
for: 评估大语言模型（LLM）性能，提高人工智能前进。
methods: 使用评估指标来评估LLM的能力水平，但可能会导致不当使用和误导性的评估结果。
results: 在大量实验中发现，使用评估数据来训练模型可能会带来很大的评估结果偏高，导致模型性能的误估。提出了一些指南，以改善LLM的评估和训练。

Abstract
Large language models~(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity. To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs in different aspects. Despite that a number of high-quality benchmarks have been released, the concerns about the appropriate use of these benchmarks and the fair comparison of different models are increasingly growing. Considering these concerns, in this paper, we discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results. Specially, we focus on a special issue that would lead to inappropriate evaluation, \ie \emph{benchmark leakage}, referring that the data related to evaluation sets is occasionally used for model training. This phenomenon now becomes more common since pre-training data is often prepared ahead of model test. We conduct extensive experiments to study the effect of benchmark leverage, and find that it can dramatically boost the evaluation results, which would finally lead to an unreliable assessment of model performance. To improve the use of existing evaluation benchmarks, we finally present several guidelines for both LLM developers and benchmark maintainers. We hope this work can draw attention to appropriate training and evaluation of LLMs.

摘要
Translation notes:* "Large language models" is translated as "大语言模型" (dà yǔ yán módel), which is a common term used to refer to deep learning models that are trained on large amounts of text data.* "Artificial intelligence" is translated as "人工智能" (rén gōng zhì nǎo), which is the standard term used in Chinese to refer to AI.* "Evaluation benchmarks" is translated as "评估标准" (píng jī biaodian), which refers to the datasets and tasks used to evaluate the performance of LLMs.* "Benchmark leakage" is translated as "标准泄露" (biāo zhì zhòu), which refers to the practice of using data related to evaluation sets for model training.* "Pre-training data" is translated as "预训练数据" (xiù xù xíng xīn), which refers to the data used to pre-train LLMs before fine-tuning them on specific tasks.

Assessing Fidelity in XAI post-hoc techniques: A Comparative Study with Ground Truth Explanations Datasets

paper_url: http://arxiv.org/abs/2311.01961
repo_url: None
paper_authors: M. Miró-Nicolau, A. Jaume-i-Capó, G. Moyà-Alcover
for: 本研究的目的是评估current state-of-the-art XAI方法的准确性和可靠性，并从中排除低准确性的方法，以促进更好的XAI技术的发展。
methods: 本研究使用了三种常见的XAI方法，namely backpropagation of output information to input, sensitivity analysis, and Class Activation Maps (CAM).
results: 研究结果表明，基于output backpropagation的XAI方法具有较高的准确性和可靠性，而sensitivity analysis和CAM方法则相对较低。然而，backpropagation方法生成的关键区域图像具有较高的噪声水平。这些发现有助于排除错误的解释并推动XAI技术的进一步发展。

Abstract
The evaluation of the fidelity of eXplainable Artificial Intelligence (XAI) methods to their underlying models is a challenging task, primarily due to the absence of a ground truth for explanations. However, assessing fidelity is a necessary step for ensuring a correct XAI methodology. In this study, we conduct a fair and objective comparison of the current state-of-the-art XAI methods by introducing three novel image datasets with reliable ground truth for explanations. The primary objective of this comparison is to identify methods with low fidelity and eliminate them from further research, thereby promoting the development of more trustworthy and effective XAI techniques. Our results demonstrate that XAI methods based on the backpropagation of output information to input yield higher accuracy and reliability compared to methods relying on sensitivity analysis or Class Activation Maps (CAM). However, the backpropagation method tends to generate more noisy saliency maps. These findings have significant implications for the advancement of XAI methods, enabling the elimination of erroneous explanations and fostering the development of more robust and reliable XAI.

摘要
评估Explainable Artificial Intelligence（XAI）方法的准确性是一项复杂的任务，主要是因为无法获得解释的准确参照值。然而，评估准确性是确保XAI方法正确性的必要步骤。在这种研究中，我们通过引入三个新的图像集来进行公正和客观地比较当前领域的XAI方法。我们的主要目标是通过评估XAI方法的准确性和可靠性来消除低准确性的方法，以促进更加可靠和有效的XAI技术的发展。我们的结果显示，基于输出信息的倒推法生成的XAI方法比靠视分析或Class Activation Maps（CAM）方法更高的准确性和可靠性。然而，倒推法通常会生成更多的噪声灵敏图。这些发现有重要的意义 дляXAI方法的发展，可以消除错误的解释并推动更加稳定和可靠的XAI技术的发展。

Architecture of Smart Certificates for Web3 Applications Against Cyberthreats in Financial Industry

paper_url: http://arxiv.org/abs/2311.01956
repo_url: None
paper_authors: Stefan Kambiz Behfar, Jon Crowcroft
for: 本研究探讨了当今互联网的安全挑战，尤其是区块链和分布式存储等新技术的应用。它还研究了未来互联网的形态，并提出了一种新的“智能证书”设计方案，以帮助企业更好地保护自己免受网络攻击，并确保数据和系统的安全。
methods: 本研究使用了Web3应用程序和安全解决方案，如Certik、Forta、Slither和Securify等，以提高企业数字基础设施的可恢复性。它还提出了一种基于多层次架构的抗性分析和攻击聚合方法，以及一种用于检测和排查证书的证明力和可信worthiness。
results: 本研究提出的“智能证书”设计方案可以帮助企业更好地保护自己免受网络攻击，并提高数据和系统的安全性。此外，通过使用证书检测和排查技术，可以检测和排查证书的证明力和可信worthiness，以确保企业数据和系统的安全。

Abstract
This study addresses the security challenges associated with the current internet transformations, specifically focusing on emerging technologies such as blockchain and decentralized storage. It also investigates the role of Web3 applications in shaping the future of the internet. The primary objective is to propose a novel design for 'smart certificates,' which are digital certificates that can be programmatically enforced. Utilizing such certificates, an enterprise can better protect itself from cyberattacks and ensure the security of its data and systems. Web3 recent security solutions by companies and projects like Certik, Forta, Slither, and Securify are the equivalent of code scanning tool that were originally developed for Web1 and Web2 applications, and definitely not like certificates to help enterprises feel safe against cyberthreats. We aim to improve the resilience of enterprises' digital infrastructure by building on top of Web3 application and put methodologies in place for vulnerability analysis and attack correlation, focusing on architecture of different layers, Wallet/Client, Application and Smart Contract, where specific components are provided to identify and predict threats and risks. Furthermore, Certificate Transparency is used for enhancing the security, trustworthiness and decentralized management of the certificates, and detecting misuses, compromises, and malfeasances.

摘要
Current Web3 security solutions, such as those offered by Certik, Forta, Slither, and Securify, are comparable to code scanning tools developed for Web1 and Web2 applications, and are not sufficient to protect enterprises from cyber threats. To improve the resilience of enterprises' digital infrastructure, we propose building on top of Web3 applications and implementing methodologies for vulnerability analysis and attack correlation, focusing on the architecture of different layers, including the wallet/client, application, and smart contract.To enhance the security, trustworthiness, and decentralized management of the certificates, we use Certificate Transparency. This also helps detect misuses, compromises, and malfeasances. Our proposed design for smart certificates aims to provide enterprises with a more secure and reliable way to protect their data and systems in the face of evolving cyber threats.

A Quantitative Autonomy Quantification Framework for Fully Autonomous Robotic Systems

paper_url: http://arxiv.org/abs/2311.01939
repo_url: None
paper_authors: Nasser Gyagenda, Hubert Roth
for: 本研究旨在提供一个基于任务需求的自主性评估框架，以帮助在限定人工监督下进行自主函数的机器人系统部署。
methods: 本研究使用了三种自主性指标，包括必要能力、可靠性和回应能力，并将它们与任务需求进行映射，以量化机器人系统的自主性水平。
results: 研究发现，使用提案的自主性评估框架可以帮助确定机器人系统的自主性水平，并且可以提供一个共同语言和规范界限 для自主系统开发者和使用者。两个案例研究，包括一个在路上验证的自主汽车和DARPA subT挑战规则分析，都显示了该框架的可行性和有用性。

Abstract
Although autonomous functioning facilitates deployment of robotic systems in domains that admit limited human oversight on our planet and beyond, finding correspondence between task requirements and autonomous capability is still an open challenge. Consequently, a number of methods for quantifying autonomy have been proposed over the last three decades, but to our knowledge all these have no discernment of sub-mode features of variation of autonomy and some are based on metrics that violet the Goodhart's law. This paper focuses on the full autonomous mode and proposes a task-requirements based autonomy assessment framework. The framework starts by establishing robot task characteristics from which three autonomy metrics, namely requisite capability, reliability and responsiveness, and functions for determining autonomy as a two-part measure, namely of level of autonomy and degree of autonomy are derived. These characteristics are founded on the realization that robots ultimately replace human skilled workers, to find a mapping between human job and robot task characteristics. The distinction between level and degree of autonomy stemmed from the acknowledgment that autonomy is not just a question of existence, but also one of performance of requisite capability. When continuously monitored, the proposed metrics provide a means of monitoring the integrity of a system. The framework has been demonstrated on two case studies, namely autonomous vehicle at an on-road dynamic driving task and the DARPA subT challenge rules analysis. The framework provides not only a tool for quantifying autonomy, but also a regulatory interface and common language for autonomous systems developers and users.

摘要
尽管自主 fonctioning 可以在允许有限的人工监督下在我们 planet 和 beyond 中部署 robotic systems，但发现任务需求和自主能力之间的对应仍然是一个打开的挑战。因此，过去三十年来，一些方法用于量化自主性被提出，但据我们所知，这些方法都没有辨别自主模式中的特征变化，并且一些基于不符合Goodhart's law的度量。这篇文章关注全自主模式，并提出一种基于任务需求的自主性评估框架。该框架开始于确定机器人任务特征，并从这些特征 derive 三个自主度量：一是必需能力、可靠性和响应性，二是用于确定自主性的两部分度量：一是自主度量级别，二是自主度量的度量。这些特征基于人工智能将 eventually replace 人类高级工人的认识，以找到机器人任务特征和人工任务之间的映射。在监控下，提出的度量可以监测系统的完整性。框架在两个案例中进行了示例分析：一是在路面动态驾驶任务上的自动驾驶车辆，二是 DARPA subT 挑战规则分析。该框架不仅提供了量化自主性的工具，还提供了自主系统开发者和用户之间的标准化界面和通用语言。

Supermind Ideator: Exploring generative AI to support creative problem-solving

paper_url: http://arxiv.org/abs/2311.01937
repo_url: None
paper_authors: Steven R. Rick, Gianni Giacomelli, Haoran Wen, Robert J. Laubacher, Nancy Taubenslag, Jennifer L. Heyman, Max Sina Knicker, Younes Jeddi, Hendrik Maier, Stephen Dwyer, Pranav Ragupathy, Thomas W. Malone
for: This paper is written for people who want to use creative problem-solving techniques to generate innovative ideas for designing groups of people and/or computers.
methods: The paper uses a large language model (GPT 3.5) and adds prompting, fine-tuning, and a user interface specifically designed to help people use creative problem-solving techniques.
results: The paper describes early experiences with using this system and suggests ways it could be extended to support additional techniques for other specific problem-solving domains.Here’s the same information in Simplified Chinese:
for: 这篇论文是为了帮助人们使用创新问题解决技术来生成创新的想法，特别是设计人员和计算机(“超级 minds”)的组合。
methods: 论文使用大语言模型（GPT 3.5）和添加提示、精度调整以及特制用户界面，以帮助人们使用创新问题解决技术。
results: 论文描述了使用这种系统的初期经验，并建议将其扩展到支持其他特定问题解决领域的技巧。

Abstract
Previous efforts to support creative problem-solving have included (a) techniques (such as brainstorming and design thinking) to stimulate creative ideas, and (b) software tools to record and share these ideas. Now, generative AI technologies can suggest new ideas that might never have occurred to the users, and users can then select from these ideas or use them to stimulate even more ideas. Here, we describe such a system, Supermind Ideator. The system uses a large language model (GPT 3.5) and adds prompting, fine tuning, and a user interface specifically designed to help people use creative problem-solving techniques. Some of these techniques can be applied to any problem; others are specifically intended to help generate innovative ideas about how to design groups of people and/or computers ("superminds"). We also describe our early experiences with using this system and suggest ways it could be extended to support additional techniques for other specific problem-solving domains.

摘要
previous efforts to support creative problem-solving have included (a) techniques (such as brainstorming and design thinking) to stimulate creative ideas, and (b) software tools to record and share these ideas. now, generative AI technologies can suggest new ideas that might never have occurred to the users, and users can then select from these ideas or use them to stimulate even more ideas. here, we describe such a system, Supermind Ideator. the system uses a large language model (GPT 3.5) and adds prompting, fine tuning, and a user interface specifically designed to help people use creative problem-solving techniques. some of these techniques can be applied to any problem; others are specifically intended to help generate innovative ideas about how to design groups of people and/or computers ("superminds"). we also describe our early experiences with using this system and suggest ways it could be extended to support additional techniques for other specific problem-solving domains.Here's the translation in Traditional Chinese:previous efforts to support creative problem-solving have included (a) techniques (such as brainstorming and design thinking) to stimulate creative ideas, and (b) software tools to record and share these ideas. now, generative AI technologies can suggest new ideas that might never have occurred to the users, and users can then select from these ideas or use them to stimulate even more ideas. here, we describe such a system, Supermind Ideator. the system uses a large language model (GPT 3.5) and adds prompting, fine tuning, and a user interface specifically designed to help people use creative problem-solving techniques. some of these techniques can be applied to any problem; others are specifically intended to help generate innovative ideas about how to design groups of people and/or computers ("superminds"). we also describe our early experiences with using this system and suggest ways it could be extended to support additional techniques for other specific problem-solving domains.

GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling

paper_url: http://arxiv.org/abs/2311.01927
repo_url: None
paper_authors: Tobias Katsch
For: This paper aims to improve the efficiency and effectiveness of sequence models, particularly for auto-regressive language modeling.* Methods: The authors develop a foundational sequence model called GateLoop, which generalizes linear recurrent models by employing data-controlled state transitions. The model comes with two efficient modes: $O(l)$ recurrent mode and $O(l \log_{2} l)$ parallel mode.* Results: The authors show that GateLoop outperforms existing models for auto-regressive language modeling, and the approach can be interpreted as providing data-controlled relative-positional information to Attention. The findings suggest that incorporating data-controlled complex cumulative products may be a crucial step towards more powerful sequence models.

Abstract
Linear Recurrence has proven to be a powerful tool for modeling long sequences efficiently. In this work, we show that existing models fail to take full advantage of its potential. Motivated by this finding, we develop GateLoop, a foundational sequence model that generalizes linear recurrent models such as S4, S5, LRU and RetNet, by employing data-controlled state transitions. Utilizing this theoretical advance, GateLoop empirically outperforms existing models for auto-regressive language modeling. Our method comes with a low-cost $O(l)$ recurrent mode and an efficient $O(l \log_{2} l)$ parallel mode making use of highly optimized associative scan implementations. Furthermore, we derive an $O(l^2)$ surrogate attention mode, revealing remarkable implications for Transformer and recently proposed architectures. Specifically, we prove that our approach can be interpreted as providing data-controlled relative-positional information to Attention. While many existing models solely rely on data-controlled cumulative sums for context aggregation, our findings suggest that incorporating data-controlled complex cumulative products may be a crucial step towards more powerful sequence models.

摘要
Here's the Simplified Chinese translation:线性回归已经证明是效果很好的长序列模型工具。在这种工作中，我们发现现有模型没有充分利用其潜力。我们因此开发了 GateLoop，一种基础序列模型，可以普适 linear recurrent models such as S4, S5, LRU和RetNet，通过使用数据控制状态转移。我们的方法在自然语言模型中实际比现有模型 луч。它具有低成本的 $O(l)$ 回归模式和高效的 $O(l \log_{2} l)$ 并行模式，使用了高优化的相关扫描实现。此外，我们得出了 $O(l^2)$ 代理注意力模式，这有着很重要的意义 для Transformer 和最近提出的体系。我们的方法可以提供数据控制的相对位置信息给注意力，这与许多现有模型仅通过数据控制累加来实现上下文汇集不同。我们的发现表明，将数据控制的复杂累加纳入Sequence模型可能是至关重要的一步。

Large Language Models Illuminate a Progressive Pathway to Artificial Healthcare Assistant: A Review

paper_url: http://arxiv.org/abs/2311.01918
repo_url: https://github.com/mingze-yuan/awesome-llm-healthcare
paper_authors: Mingze Yuan, Peng Bao, Jiajia Yuan, Yunhao Shen, Zifan Chen, Yi Xie, Jie Zhao, Yang Chen, Li Zhang, Lin Shen, Bin Dong
For: This paper provides a comprehensive review of the applications and implications of large language models (LLMs) in medicine, with a focus on their potential to enhance various aspects of healthcare.* Methods: The paper examines the fundamental applications of general-purpose and specialized LLMs, as well as the emerging development of LLM-powered autonomous agents for healthcare. It also explores the ability of multimodal LLMs to process diverse data types like medical imaging and EHRs to augment diagnostic accuracy.* Results: The paper highlights the transformative potential of LLMs in modern medicine, but also acknowledges the need for continuous optimizations and ethical oversight before these models can be effectively integrated into clinical practice.

Abstract
With the rapid development of artificial intelligence, large language models (LLMs) have shown promising capabilities in mimicking human-level language comprehension and reasoning. This has sparked significant interest in applying LLMs to enhance various aspects of healthcare, ranging from medical education to clinical decision support. However, medicine involves multifaceted data modalities and nuanced reasoning skills, presenting challenges for integrating LLMs. This paper provides a comprehensive review on the applications and implications of LLMs in medicine. It begins by examining the fundamental applications of general-purpose and specialized LLMs, demonstrating their utilities in knowledge retrieval, research support, clinical workflow automation, and diagnostic assistance. Recognizing the inherent multimodality of medicine, the review then focuses on multimodal LLMs, investigating their ability to process diverse data types like medical imaging and EHRs to augment diagnostic accuracy. To address LLMs' limitations regarding personalization and complex clinical reasoning, the paper explores the emerging development of LLM-powered autonomous agents for healthcare. Furthermore, it summarizes the evaluation methodologies for assessing LLMs' reliability and safety in medical contexts. Overall, this review offers an extensive analysis on the transformative potential of LLMs in modern medicine. It also highlights the pivotal need for continuous optimizations and ethical oversight before these models can be effectively integrated into clinical practice. Visit https://github.com/mingze-yuan/Awesome-LLM-Healthcare for an accompanying GitHub repository containing latest papers.

摘要
受人工智能快速发展的推动，大型自然语言模型（LLM）在人类语言理解和推理方面已经表现出了承诺的能力。这一点引发了医疗领域应用LLM的兴趣，以提高医疗教育、临床决策等方面。然而，医疗涉及多种数据类型和复杂的理解技能，这些挑战 LLM 的应用。本文提供了医疗领域 LLM 应用和意涂抵触的全面评论。它首先检查了通用和专门的 LLM 在知识检索、研究支持、临床工作流程自动化和诊断协助方面的应用。认识到医疗的多样性，文章然后关注多模态 LLM，研究其能够处理医疗图像和 EHR 等多种数据类型，以增强诊断准确性。为了解决 LLM 的个性化和复杂临床理解限制，文章探讨了emerging 的 LLM 驱动的医疗自动化技术的发展。此外，文章还总结了评估 LLM 在医疗上的可靠性和安全性的评价方法。总之，本文提供了现代医疗中 LLM 的转变 potential，以及在应用这些模型之前，需要不断优化和优先考虑伦理监督的重要性。有关最新的论文，请参考 GitHub 存储夹。

Enhancing Functional Data Analysis with Sequential Neural Networks: Advantages and Comparative Study

paper_url: http://arxiv.org/abs/2311.01875
repo_url: None
paper_authors: J. Zhao, J. Li, M. Chen, S. Jadhav
for: 这篇论文旨在应用Sequential Neural Networks (SNNs) 来解决功能资料分析 (Functional Data Analysis, FDA) 领域中的问题。
methods: 这篇论文使用了SNNs 来实现功能资料的分析，并与传统 FDA 方法进行比较。
results: 研究发现，SNNs 可以在功能资料分析中提供更好的性能，并且可以轻松地实现。此外，SNNs 还可以处理高维ensionality 的功能资料，并且可以应对实际世界中的资料分析问题。

Abstract
Functional Data Analysis (FDA) is a statistical domain developed to handle functional data characterized by high dimensionality and complex data structures. Sequential Neural Networks (SNNs) are specialized neural networks capable of processing sequence data, a fundamental aspect of functional data. Despite their great flexibility in modeling functional data, SNNs have been inadequately employed in the FDA community. One notable advantage of SNNs is the ease of implementation, making them accessible to a broad audience beyond academia. Conversely, FDA-based methodologies present challenges, particularly for practitioners outside the field, due to their intricate complexity. In light of this, we propose utilizing SNNs in FDA applications and demonstrate their effectiveness through comparative analyses against popular FDA regression models based on numerical experiments and real-world data analysis. SNN architectures allow us to surpass the limitations of traditional FDA methods, offering scalability, flexibility, and improved analytical performance. Our findings highlight the potential of SNN-based methodologies as powerful tools for data applications involving functional data.

摘要
SNNs 的一个优点是易于实现，使得它们可以被更广泛的应用者所使用，不仅限于学术界。然而， FDA 基础方法ologies 具有复杂的核心，对于非学术界的实践者而言，它们可能会具有困难性。因此，我们建议使用 SNNs 在 FDA 应用中，并通过比较性分析与实际数据分析，评估 SNNs 在 FDA 领域的表现。SNN 架构可以超过传统 FDA 方法的限制，提供可扩展性、柔软性和改进的分析性能。我们的发现显示 SNN-based 方法ologies 具有强大的应用潜力，可以帮助解决各种函数数据的应用问题。

Multi-EuP: The Multilingual European Parliament Dataset for Analysis of Bias in Information Retrieval

paper_url: http://arxiv.org/abs/2311.01870
repo_url: None
paper_authors: Jinrui Yang, Timothy Baldwin, Trevor Cohn
for: 这个论文是为了研究多语言信息检索（IR）上的公平性而设计的多语言标准数据集。
methods: 这个数据集包含22,000多语言文档，收集自欧洲议会，涵盖24种语言。这个数据集具有 Authentic multilingual corpus，包括所有24种语言的话题翻译，以及cross-lingual relevance judgments。此外，数据集还包含文档关于文档的人口信息，使研究语言偏见更加容易。
results: 作者表明Multi-EuP数据集可以用于评估单语言和多语言IR的效果。作者还进行了一项初步实验，检查选择tokenization策略会导致的语言偏见。

Abstract
We present Multi-EuP, a new multilingual benchmark dataset, comprising 22K multi-lingual documents collected from the European Parliament, spanning 24 languages. This dataset is designed to investigate fairness in a multilingual information retrieval (IR) context to analyze both language and demographic bias in a ranking context. It boasts an authentic multilingual corpus, featuring topics translated into all 24 languages, as well as cross-lingual relevance judgments. Furthermore, it offers rich demographic information associated with its documents, facilitating the study of demographic bias. We report the effectiveness of Multi-EuP for benchmarking both monolingual and multilingual IR. We also conduct a preliminary experiment on language bias caused by the choice of tokenization strategy.

摘要
我们介绍 Multi-EuP，一个新的多语言标准数据集，包含22000多语言文档，收集自欧洲议会，涵盖24种语言。这个数据集旨在研究多语言搜索（IR）上的公平性，以分析语言和人口特征偏见在排序上。它拥有真实的多语言资料库，包括所有24种语言的话题翻译，以及跨语言相关评估。此外，它还提供了文档关于其文档的多元化信息，使得研究人员可以研究语言偏见。我们证明Multi-EuP可以用于评测单语言和多语言搜索的效果。我们还进行了一项预liminary实验，探讨因选择Tokenization策略而导致的语言偏见。

Towards Concept-Aware Large Language Models

paper_url: http://arxiv.org/abs/2311.01866
repo_url: None
paper_authors: Chen Shani, Jilles Vreeken, Dafna Shahaf
for: 本研究旨在将人类概念转移到机器上，以提高机器的概念形成和推理能力。
methods: 本研究使用了现代大语言模型（LLMs），并分析了这些模型是否能够正确地捕捉人类概念的结构。研究还讨论了将概念包含在不同阶段的数据pipeline中，以提高机器的概念形成和推理能力。
results: 研究发现，使用概念进行预训可以更好地匹配人类的概念理解，并且可以提高机器的预测Robustness。这些初步结果显示了概念意识型机器语言模型的推论力。

Abstract
Concepts play a pivotal role in various human cognitive functions, including learning, reasoning and communication. However, there is very little work on endowing machines with the ability to form and reason with concepts. In particular, state-of-the-art large language models (LLMs) work at the level of tokens, not concepts. In this work, we analyze how well contemporary LLMs capture human concepts and their structure. We then discuss ways to develop concept-aware LLMs, taking place at different stages of the pipeline. We sketch a method for pretraining LLMs using concepts, and also explore the simpler approach that uses the output of existing LLMs. Despite its simplicity, our proof-of-concept is shown to better match human intuition, as well as improve the robustness of predictions. These preliminary results underscore the promise of concept-aware LLMs.

摘要
文本中的概念扮演着重要的角色，包括学习、理解和交流。然而，目前的机器学习模型（LLM）却只能处理单词，而不是概念。在这篇文章中，我们分析了当代LLM是如何捕捉人类概念的。然后，我们讨论了如何开发概念意识的LLM，包括不同阶段的pipeline中的不同方法。我们采用了一种预训练方法，使用概念来训练LLM，并且也探讨了使用现有LLM的输出来实现的简单方法。尽管这种方法的简单，但我们的证明显示，它可以更好地匹配人类的直觉，并提高预测的稳定性。这些初步结果表明概念意识LLM具有承诺的承诺。

SortNet: Learning To Rank By a Neural-Based Sorting Algorithm

paper_url: http://arxiv.org/abs/2311.01864
repo_url: None
paper_authors: Leonardo Rigutini, Tiziano Papini, Marco Maggini, Franco Scarselli
for: 本研究旨在提出一种适应性排序算法，能够根据用户需求进行个性化排序。
methods: 本研究使用了一种基于神经网络的比较器，通过在每个训练例之间进行多次迭代，使得模型能够学习出最有用的排序顺序。
results: 实验结果表明，SortNet算法在LETO dataset上表现出色，与其他当前领先算法相比，具有更高的排序精度和更好的排序稳定性。

Abstract
The problem of relevance ranking consists of sorting a set of objects with respect to a given criterion. Since users may prefer different relevance criteria, the ranking algorithms should be adaptable to the user needs. Two main approaches exist in literature for the task of learning to rank: 1) a score function, learned by examples, which evaluates the properties of each object yielding an absolute relevance value that can be used to order the objects or 2) a pairwise approach, where a "preference function" is learned using pairs of objects to define which one has to be ranked first. In this paper, we present SortNet, an adaptive ranking algorithm which orders objects using a neural network as a comparator. The neural network training set provides examples of the desired ordering between pairs of items and it is constructed by an iterative procedure which, at each iteration, adds the most informative training examples. Moreover, the comparator adopts a connectionist architecture that is particularly suited for implementing a preference function. We also prove that such an architecture has the universal approximation property and can implement a wide class of functions. Finally, the proposed algorithm is evaluated on the LETOR dataset showing promising performances in comparison with other state of the art algorithms.

摘要
问题的排序 ranks a set of objects based on a given criterion. Since users may prefer different relevance criteria, the ranking algorithms should be adaptable to the user needs. 在文献中有两种主要的方法来实现学习排序：1）通过例子学习的分数函数，对每个对象评估其特性，并从中计算出一个绝对相关性值，以便将对象排序；2）通过对象对的 preference function 学习，用于定义哪一个对象应该排在前面。在这篇论文中，我们提出了 SortNet，一种可变排序算法，通过神经网络作为比较器来排序对象。神经网络训练集提供了欲要的对象之间的排序示例，并通过迭代过程，在每一迭代中添加最有用的训练示例。此外，比较器采用了 Connectionist 架构，特别适合实现 preference function。我们还证明了这种架构具有 universal approximation property，可以实现广泛的函数。最后，我们对 LETOR 数据集进行了评估，与其他当前状态的算法相比，SortNet 表现很出色。

FAME: Flexible, Scalable Analogy Mappings Engine

paper_url: http://arxiv.org/abs/2311.01860
repo_url: None
paper_authors: Shahar Jacob, Chen Shani, Dafna Shahaf
for: This paper aims to advance computational analogy by developing a new framework that can handle partial analogies and suggest new entities to be added.
methods: The paper uses automatic extraction of commonsense representations to identify mappings between entities, and the input requirements are relaxed to only require names of entities.
results: The model achieves 81.2% accuracy on classical 2x2 analogy problems and 77.8% accuracy on larger problems, outperforming human performance and providing interpretable results.

Abstract
Analogy is one of the core capacities of human cognition; when faced with new situations, we often transfer prior experience from other domains. Most work on computational analogy relies heavily on complex, manually crafted input. In this work, we relax the input requirements, requiring only names of entities to be mapped. We automatically extract commonsense representations and use them to identify a mapping between the entities. Unlike previous works, our framework can handle partial analogies and suggest new entities to be added. Moreover, our method's output is easily interpretable, allowing for users to understand why a specific mapping was chosen. Experiments show that our model correctly maps 81.2% of classical 2x2 analogy problems (guess level=50%). On larger problems, it achieves 77.8% accuracy (mean guess level=13.1%). In another experiment, we show our algorithm outperforms human performance, and the automatic suggestions of new entities resemble those suggested by humans. We hope this work will advance computational analogy by paving the way to more flexible, realistic input requirements, with broader applicability.

摘要
人类认知的核心能力之一是比喻，当面临新情况时，我们常将先前在其他领域的经验转移到新的情况中。大多数计算比喻工作都需要复杂、手动制作的输入。在这项工作中，我们宽松了输入要求，只需提供实体的名称，我们会自动提取常识表示并用其来确定实体之间的映射。与前一些工作不同，我们的框架可以处理 incomplete analogy 和提出新的实体。此外，我们的方法的输出易于解释，允许用户理解为何选择了特定的映射。实验表明，我们的模型在经典的 2x2 比喻问题中 correctly map 81.2% (guess level=50%)，在更大的问题上达到 77.8% 的准确率 (mean guess level=13.1%)。在另一个实验中，我们的算法超越了人类性能，并且自动提出的新实体建议与人类的建议相似。我们希望这项工作将计算比喻的发展带进更加灵活、现实主义的输入要求，并具有更广泛的应用。

A Neural Radiance Field-Based Architecture for Intelligent Multilayered View Synthesis

paper_url: http://arxiv.org/abs/2311.01842
repo_url: None
paper_authors: D. Dhinakaran, S. M. Udhaya Sankar, G. Elumalai, N. Jagadish kumar
For: The paper aims to improve on-demand source routing systems in mobile ad hoc networks by proposing a new routing strategy called Optimized Route Selection via Red Imported Fire Ants (RIFA) Strategy.* Methods: The proposed method uses predicting route failure and energy utilization to select the path during the routing phase. The authors evaluate the performance of the proposed strategy based on parameters such as energy usage, packet delivery rate (PDR), and end-to-end (E2E) delay.* Results: The results show that the proposed strategy is superior to traditional routing methods in terms of network lifetime, node energy consumption, and typical E2E delay under most network performance measures and factors.

Abstract
A mobile ad hoc network is made up of a number of wireless portable nodes that spontaneously come together en route for establish a transitory network with no need for any central management. A mobile ad hoc network (MANET) is made up of a sizable and reasonably dense community of mobile nodes that travel across any terrain and rely solely on wireless interfaces for communication, not on any well before centralized management. Furthermore, routing be supposed to offer a method for instantly delivering data across a network between any two nodes. Finding the best packet routing from across infrastructure is the major issue, though. The proposed protocol's major goal is to identify the least-expensive nominal capacity acquisition that assures the transportation of realistic transport that ensures its durability in the event of any node failure. This study suggests the Optimized Route Selection via Red Imported Fire Ants (RIFA) Strategy as a way to improve on-demand source routing systems. Predicting Route Failure and energy Utilization is used to pick the path during the routing phase. Proposed work assess the results of the comparisons based on performance parameters like as energy usage, packet delivery rate (PDR), and end-to-end (E2E) delay. The outcome demonstrates that the proposed strategy is preferable and increases network lifetime while lowering node energy consumption and typical E2E delay under the majority of network performance measures and factors.

摘要
mobile ad hoc network 是由一些无线可携式节点组成，这些节点自动集结在路上，建立一个没有中央管理的临时网络，以确保数据在网络中快速传输。mobile ad hoc network (MAN) 是一个较大、相对密集的移动节点社区，这些节点在任何地形上行驶，仅通过无线接口进行通信，不需要任何先前中央化管理。此外，路由应该提供一种能够立即传输数据 между任何两个节点的方法。然而，找到最佳包路由是网络的主要问题。提议的协议的主要目标是确定最低成本的常规容量获得，以确保实际交通的持续性，并在任何节点故障时保证网络的稳定性。本研究提出了优化via Red Imported Fire Ants (RIFA) 策略，以提高启动源路由系统的性能。在路由阶段，预测路径失败和能量利用，以选择路径。提议的工作评估结果基于性能参数，如能量消耗、包传输率 (PDR) 和终端到终端延迟 (E2E)。结果表明，提议的策略更有利，可以提高网络寿命，降低节点能量消耗和常规 E2E 延迟，对于大多数网络性能指标和因素。

DiffDub: Person-generic Visual Dubbing Using Inpainting Renderer with Diffusion Auto-encoder

paper_url: http://arxiv.org/abs/2311.01811
repo_url: None
paper_authors: Tao Liu, Chenpeng Du, Shuai Fan, Feilong Chen, Kai Yu
for: 这篇论文是为了解决生成高质量、人类通用的视频配音问题而写的。
methods: 该论文提出了一种扩散 Dubbing 技术，包括在插入式渲染器中使用掩码来分割可编辑区和不可编辑区，以及使用数据增强和补充眼动引导等灵活策略来解决问题。
results: 经过严格的实验证明，该技术可以在人类通用和多语言enario中提供高质量、自然的视频配音，并且在不具备对应的音频视频数据的情况下也能够达到良好的效果。

Abstract
Generating high-quality and person-generic visual dubbing remains a challenge. Recent innovation has seen the advent of a two-stage paradigm, decoupling the rendering and lip synchronization process facilitated by intermediate representation as a conduit. Still, previous methodologies rely on rough landmarks or are confined to a single speaker, thus limiting their performance. In this paper, we propose DiffDub: Diffusion-based dubbing. We first craft the Diffusion auto-encoder by an inpainting renderer incorporating a mask to delineate editable zones and unaltered regions. This allows for seamless filling of the lower-face region while preserving the remaining parts. Throughout our experiments, we encountered several challenges. Primarily, the semantic encoder lacks robustness, constricting its ability to capture high-level features. Besides, the modeling ignored facial positioning, causing mouth or nose jitters across frames. To tackle these issues, we employ versatile strategies, including data augmentation and supplementary eye guidance. Moreover, we encapsulated a conformer-based reference encoder and motion generator fortified by a cross-attention mechanism. This enables our model to learn person-specific textures with varying references and reduces reliance on paired audio-visual data. Our rigorous experiments comprehensively highlight that our ground-breaking approach outpaces existing methods with considerable margins and delivers seamless, intelligible videos in person-generic and multilingual scenarios.

摘要
<>通过两个阶段 paradigm，我们提出了一种新的Diffusion-based dubbing方法，即DiffDub。首先，我们采用了一个Diffusion自适应Encoder，通过填充低部分的面部区域，保留原始部分的精度。在我们的实验中，我们遇到了一些挑战。主要是semantic encoder的稳定性不够，导致高级特征的捕捉受到限制。另外，模型忽略了面部位置，导致在帧内的嘴或鼻部震动。为了解决这些问题，我们采用了多种策略，包括数据增强和补充眼部指导。此外，我们还包装了一个基于参考编码器和运动生成器的cross-attention机制，使我们的模型能够学习人Specific的面部特征，并降低了对配对的音频视频数据的依赖。我们的严谨的实验表明，我们的创新方法在人Specific和多语言场景中比 existed方法有considerable的优势，并能够提供流畅、智能的视频。

AFPQ: Asymmetric Floating Point Quantization for LLMs

paper_url: http://arxiv.org/abs/2311.01792
repo_url: https://github.com/zhangsichengsjtu/afpq
paper_authors: Yijia Zhang, Sicheng Zhang, Shijie Cao, Dayou Du, Jianyu Wei, Ting Cao, Ningyi Xu
for: 提高大型自然语言模型（LLM）的部署效率和可扩展性。
methods: 提出了偏好性浮点量化（AFPQ）方法，将正负值分别设置为不同的浮点幂。
results: 相比传统浮点量化方法，AFPQ方法可以大幅提高精度，并且可以轻松地与其他量化方法结合使用，无需额外存储空间。

Abstract
Large language models (LLMs) show great performance in various tasks, but face deployment challenges from limited memory capacity and bandwidth. Low-bit weight quantization can save memory and accelerate inference. Although floating-point (FP) formats show good performance in LLM quantization, they tend to perform poorly with small group sizes or sub-4 bits. We find the reason is that the absence of asymmetry in previous FP quantization makes it unsuitable for handling asymmetric value distribution of LLM weight tensors. In this work, we propose asymmetric FP quantization (AFPQ), which sets separate scales for positive and negative values. Our method leads to large accuracy improvements and can be easily plugged into other quantization methods, including GPTQ and AWQ, for better performance. Besides, no additional storage is needed compared with asymmetric integer (INT) quantization. The code is available at https://github.com/zhangsichengsjtu/AFPQ.

摘要
大型语言模型（LLM）在多种任务中表现出色，但面临部署挑战由限制内存容量和带宽所致。低位数量量化可以降低内存占用和加速推理。虽然浮点数（FP）格式在LLM量化中表现良好，但它们在小组size或下于4位时表现糟糕。我们发现这是因为前一代FP量化缺乏非对称性，导致无法处理语言模型权重张量的非对称值分布。在这种情况下，我们提出非对称浮点量化（AFPQ）方法，该方法在正负值中设置不同的缩放因素。我们的方法可以大幅提高准确性，并且可以轻松地与其他量化方法混合使用，包括GPTQ和AWQ，以提高性能。此外，与非对称整数（INT）量化相比，我们的方法不需要额外存储空间。代码可以在 GitHub 上找到：https://github.com/zhangsichengsjtu/AFPQ。

TCM-GPT: Efficient Pre-training of Large Language Models for Domain Adaptation in Traditional Chinese Medicine

paper_url: http://arxiv.org/abs/2311.01786
repo_url: None
paper_authors: Guoxing Yang, Jianyu Shi, Zan Wang, Xiaohong Liu, Guangyu Wang
for: This paper aims to improve the performance of large language models in the field of Traditional Chinese Medicine (TCM) by proposing a novel domain-specific TCM domain adaptation approach.methods: The proposed TCM Domain Adaptation (TCMDA) approach uses a large TCM-specific corpus, TCM-Corpus-1B, to pre-train and fine-tune a pre-trained language model, TCM-GPT-7B, to improve its performance on TCM-related tasks. The TCMDA approach leverages the LoRA technique to efficiently train specific dense layers for pre-training and fine-tuning.results: The proposed TCMDA approach achieves the best performance on two TCM tasks, TCM examination and TCM diagnosis, outperforming other models by relative increments of 17% and 12% in accuracy, respectively. This study represents the pioneering validation of domain adaptation of a large language model with 7 billion parameters in the TCM domain.

Abstract
Pre-training and fine-tuning have emerged as a promising paradigm across various natural language processing (NLP) tasks. The effectiveness of pretrained large language models (LLM) has witnessed further enhancement, holding potential for applications in the field of medicine, particularly in the context of Traditional Chinese Medicine (TCM). However, the application of these general models to specific domains often yields suboptimal results, primarily due to challenges like lack of domain knowledge, unique objectives, and computational efficiency. Furthermore, their effectiveness in specialized domains, such as Traditional Chinese Medicine, requires comprehensive evaluation. To address the above issues, we propose a novel domain specific TCMDA (TCM Domain Adaptation) approach, efficient pre-training with domain-specific corpus. Specifically, we first construct a large TCM-specific corpus, TCM-Corpus-1B, by identifying domain keywords and retreving from general corpus. Then, our TCMDA leverages the LoRA which freezes the pretrained model's weights and uses rank decomposition matrices to efficiently train specific dense layers for pre-training and fine-tuning, efficiently aligning the model with TCM-related tasks, namely TCM-GPT-7B. We further conducted extensive experiments on two TCM tasks, including TCM examination and TCM diagnosis. TCM-GPT-7B archived the best performance across both datasets, outperforming other models by relative increments of 17% and 12% in accuracy, respectively. To the best of our knowledge, our study represents the pioneering validation of domain adaptation of a large language model with 7 billion parameters in TCM domain. We will release both TCMCorpus-1B and TCM-GPT-7B model once accepted to facilitate interdisciplinary development in TCM and NLP, serving as the foundation for further study.

摘要
<>将文本翻译成简化中文。<>预训练和精度调整已成为自然语言处理（NLP）任务中有效的方法。大型语言模型（LLM）的效果在特定领域中得到进一步改进，可能用于医学领域，特别是中医（TCM）领域。然而，通用模型在特定领域应用 frequently leads to suboptimal results，主要因为lack of domain knowledge、unique objectives和计算效率等问题。此外，这些普通模型在特殊领域中的效果还需要进行全面评估。为了解决上述问题，我们提议一种新的域特定TCMDA（TCM域调整）方法，fficiently pre-train with domain-specific corpus。我们首先构建了大量TCM域特定 corpus，TCM-Corpus-1B，通过标识域名和检索general corpus中的内容。然后，我们的TCMDA利用LoRA，即冻结预训练模型的参数并使用排名分解矩阵来效率地训练特定的稠密层，以高效地将模型与TCM相关任务相匹配，即TCM-GPT-7B。我们进行了广泛的TCM任务的实验，包括TCM考试和TCM诊断。TCM-GPT-7B在两个TCM任务上得到了最高性能，与其他模型相比，TCM任务上的准确率提高了17%和12%。到目前为止，我们的研究是TCM域中大语言模型7亿参数的域调整的先驱 validate。我们将在接下来释出TCMCorpus-1B和TCM-GPT-7B模型，以便推动TCM和NLP领域的交叉发展，并作为TCM领域的基础研究。

Modeling the Uncertainty with Maximum Discrepant Students for Semi-supervised 2D Pose Estimation

paper_url: http://arxiv.org/abs/2311.01770
repo_url: None
paper_authors: Jiaqi Wu, Junbiao Pang, Qingming Huang
for: 提高 semi-supervised pose estimation 任务中的计算机视觉性能
methods: 使用 dual mean-teacher 框架，构建两个最大差异学生 (MDSs)，以及创造多种不确定性来评估 pseudo-labels 的质量
results: 实验结果显示，我们的方法可以提高 semi-supervised pose estimation 中的三个数据集的性能

Abstract
Semi-supervised pose estimation is a practically challenging task for computer vision. Although numerous excellent semi-supervised classification methods have emerged, these methods typically use confidence to evaluate the quality of pseudo-labels, which is difficult to achieve in pose estimation tasks. For example, in pose estimation, confidence represents only the possibility that a position of the heatmap is a keypoint, not the quality of that prediction. In this paper, we propose a simple yet efficient framework to estimate the quality of pseudo-labels in semi-supervised pose estimation tasks from the perspective of modeling the uncertainty of the pseudo-labels. Concretely, under the dual mean-teacher framework, we construct the two maximum discrepant students (MDSs) to effectively push two teachers to generate different decision boundaries for the same sample. Moreover, we create multiple uncertainties to assess the quality of the pseudo-labels. Experimental results demonstrate that our method improves the performance of semi-supervised pose estimation on three datasets.

摘要
semi-supervised pose estimation是计算机视觉中的实际挑战。虽然有很多优秀的半指导类型方法出现了，但这些方法通常使用信任来评估pseudo-标签的质量，这在pose estimation任务中很难实现。例如，在pose estimation中，信任只表示一个热图中的位置是关键点的可能性，而不是这个预测的质量。在这篇论文中，我们提出了一个简单又高效的框架，用于在半指导 pose estimation任务中评估pseudo-标签的质量。具体来说，在 dual mean-teacher 框架下，我们构建了两个最大差分学生（MDSs），以便让两个教师生成不同的决策边界 для同一个样本。此外，我们还创造了多种不确定性，以评估pseudo-标签的质量。实验结果表明，我们的方法可以改善半指导 pose estimation在三个数据集上的性能。

Indo LEGO-ABSA: A Multitask Generative Aspect Based Sentiment Analysis for Indonesian Language

paper_url: http://arxiv.org/abs/2311.01757
repo_url: None
paper_authors: Randy Zakya Suchrady, Ayu Purwarianti
For: This paper aims to implement a multitask learning and prompting approach for aspect-based sentiment analysis in Bahasa Indonesia using generative pre-trained language models.* Methods: The Indo LEGO-ABSA model is developed using the LEGO-ABSA framework, which employs the T5 model (specifically mT5) and trains all tasks within aspect-based sentiment analysis using multitask learning.* Results: The model achieved high accuracy on several tasks within aspect-based sentiment analysis, including Aspect Sentiment Triplet Extraction (f1-score of 79.55%), Unified Aspect-based Sentiment Analysis (86.09%), Aspect Opinion Pair Extraction (79.85%), Aspect Term Extraction (87.45%), and Opinion Term Extraction (88.09%).

Abstract
Aspect-based sentiment analysis is a method in natural language processing aimed at identifying and understanding sentiments related to specific aspects of an entity. Aspects are words or phrases that represent an aspect or attribute of a particular entity. Previous research has utilized generative pre-trained language models to perform aspect-based sentiment analysis. LEGO-ABSA is one framework that has successfully employed generative pre-trained language models in aspect-based sentiment analysis, particularly in English. LEGO-ABSA uses a multitask learning and prompting approach to enhance model performance. However, the application of this approach has not been done in the context of Bahasa Indonesia. Therefore, this research aims to implement the multitask learning and prompting approach in aspect-based sentiment analysis for Bahasa Indonesia using generative pre-trained language models. In this study, the Indo LEGO-ABSA model is developed, which is an aspect-based sentiment analysis model utilizing generative pre-trained language models and trained with multitask learning and prompting. Indo LEGO-ABSA is trained with a hotel domain dataset in the Indonesian language. The obtained results include an f1-score of 79.55% for the Aspect Sentiment Triplet Extraction task, 86.09% for Unified Aspect-based Sentiment Analysis, 79.85% for Aspect Opinion Pair Extraction, 87.45% for Aspect Term Extraction, and 88.09% for Opinion Term Extraction. Indo LEGO-ABSA adopts the LEGO-ABSA framework that employs the T5 model, specifically mT5, by applying multitask learning to train all tasks within aspect-based sentiment analysis.

摘要
“对象基 sentiment分析”是自然语言处理中的一种方法，旨在识别和理解对特定实体的情感。实体上的“方面”是指实体的某个特征或属性。先前的研究已经使用生成预训语言模型进行对象基 sentiment分析。LEGO-ABSA是一个成功地使用生成预训语言模型进行对象基 sentiment分析的框架，特别是在英文中。LEGO-ABSA使用多任务学习和提示方法来提高模型性能。然而，这种方法尚未应用于印尼语。因此，本研究的目的是将多任务学习和提示方法应用于印尼语的对象基 sentiment分析。在这项研究中，我们开发了印尼 LEGO-ABSA 模型，这是一个使用生成预训语言模型和多任务学习的对象基 sentiment分析模型。印尼 LEGO-ABSA 在饭店领域的印尼语数据上进行训练， obtained 的结果包括对于对象情感三元组抽取任务的 f1 分数为 79.55%，对于统一对象基情感分析任务的分数为 86.09%，对于方面意见对立抽取任务的分数为 79.85%，对于方面词抽取任务的分数为 87.45%，和对于意见词抽取任务的分数为 88.09%。印尼 LEGO-ABSA 运用了 LEGO-ABSA 框架，具体地是使用 T5 模型，特别是 mT5，通过多任务学习训练所有对象基 sentiment分析任务。

RiskQ: Risk-sensitive Multi-Agent Reinforcement Learning Value Factorization

paper_url: http://arxiv.org/abs/2311.01753
repo_url: https://github.com/xmu-rl-3dv/riskq
paper_authors: Siqi Shen, Chennan Ma, Chao Li, Weiquan Liu, Yongquan Fu, Songzhu Mei, Xinwang Liu, Cheng Wang
for: 这个论文旨在解决多智能系统中的风险敏感多智能学习问题，即在不确定环境、不同代理策略和部分可见性下学习协调的政策。
methods: 作者引入了风险敏感个体全球最大原则（RIGM），该原则要求每个代理的风险敏感行动选择集等于中央政策的风险敏感行动选择。作者还提出了一种名为风险Q的方法，该方法可以模型多个代理的共同返回分布，并满足RIGM原则。
results: 作者通过广泛的实验表明，风险Q方法可以在多个环境下实现优秀的性能。代码可以在https://github.com/xmu-rl-3dv/RiskQ中找到。

Abstract
Multi-agent systems are characterized by environmental uncertainty, varying policies of agents, and partial observability, which result in significant risks. In the context of Multi-Agent Reinforcement Learning (MARL), learning coordinated and decentralized policies that are sensitive to risk is challenging. To formulate the coordination requirements in risk-sensitive MARL, we introduce the Risk-sensitive Individual-Global-Max (RIGM) principle as a generalization of the Individual-Global-Max (IGM) and Distributional IGM (DIGM) principles. This principle requires that the collection of risk-sensitive action selections of each agent should be equivalent to the risk-sensitive action selection of the central policy. Current MARL value factorization methods do not satisfy the RIGM principle for common risk metrics such as the Value at Risk (VaR) metric or distorted risk measurements. Therefore, we propose RiskQ to address this limitation, which models the joint return distribution by modeling quantiles of it as weighted quantile mixtures of per-agent return distribution utilities. RiskQ satisfies the RIGM principle for the VaR and distorted risk metrics. We show that RiskQ can obtain promising performance through extensive experiments. The source code of RiskQ is available in https://github.com/xmu-rl-3dv/RiskQ.

摘要
Current MARL value factorization methods do not satisfy the RIGM principle for common risk metrics such as the Value at Risk (VaR) metric or distorted risk measurements. To address this limitation, we propose RiskQ, which models the joint return distribution by modeling quantiles of it as weighted quantile mixtures of per-agent return distribution utilities. RiskQ satisfies the RIGM principle for the VaR and distorted risk metrics.We show that RiskQ can obtain promising performance through extensive experiments. The source code of RiskQ is available at https://github.com/xmu-rl-3dv/RiskQ.Translation notes:* "Multi-agent systems" is translated as "多Agent系统" (duō agent xì tǒng)* "Environmental uncertainty" is translated as "环境不确定" (huán jìng bù jiè dìng)* "Varying policies of agents" is translated as "代理人政策的变化" (dì zhěng zhèng yì zhī yì)* "Partial observability" is translated as "部分可见性" (bù zhāng kě jian xìng)* "Risk-sensitive MARL" is translated as "风险敏感的多Agent学习" (fēng xìng mǐn gǎn de duō agent xué xí)* "RIGM principle" is translated as "风险敏感原则" (fēng xìng mǐn gǎn yuán xì)* "Value at Risk" is translated as "值得风险" (zhí dé fēng xìng)* "Distorted risk measurements" is translated as "扭曲的风险测量" (kuò xiào de fēng xìng gòu liàng)* "RiskQ" is translated as "风险Q" (fēng xìng Q)

Energy Efficiency Optimization for Subterranean LoRaWAN Using A Reinforcement Learning Approach: A Direct-to-Satellite Scenario

paper_url: http://arxiv.org/abs/2311.01743
repo_url: None
paper_authors: Kaiqiang Lin, Muhammad Asad Ullah, Hirley Alves, Konstantin Mikhaylov, Tong Hao
for: 这篇论文旨在探讨如何在无 terrestrial 网络（NTN）中充分利用地下 LoRaWAN 网络，以实现负荷较大的农业和灾难救援操作中的经济和社会效益。
methods: 这篇论文使用了强调SF的 LoRa 模ulation，以优化数据传输率、无线通信时间、覆盖范围和能量消耗。但是，在大规模地下 LoRaWAN NTN 中，尚存在效率地分配SF 到终端设备的挑战。为此，这篇论文提出了一种基于强化学习（RL）的SF 分配策略，以优化系统的能源效率（EE）。
results: 对四个标准方法进行比较，RL 基于 SF 分配策略在极地下直接到卫星场景中表现出色，特别是 MAD3QN 在 MAA2C 方法之上具有更高的吞吐量和EE。

Abstract
The integration of subterranean LoRaWAN and non-terrestrial networks (NTN) delivers substantial economic and societal benefits in remote agriculture and disaster rescue operations. The LoRa modulation leverages quasi-orthogonal spreading factors (SFs) to optimize data rates, airtime, coverage and energy consumption. However, it is still challenging to effectively assign SFs to end devices for minimizing co-SF interference in massive subterranean LoRaWAN NTN. To address this, we investigate a reinforcement learning (RL)-based SFs allocation scheme to optimize the system's energy efficiency (EE). To efficiently capture the device-to-environment interactions in dense networks, we proposed an SFs allocation technique using the multi-agent dueling double deep Q-network (MAD3QN) and the multi-agent advantage actor-critic (MAA2C) algorithms based on an analytical reward mechanism. Our proposed RL-based SFs allocation approach evinces better performance compared to four benchmarks in the extreme underground direct-to-satellite scenario. Remarkably, MAD3QN shows promising potentials in surpassing MAA2C in terms of convergence rate and EE.

摘要
<>translate "The integration of subterranean LoRaWAN and non-terrestrial networks (NTN) delivers substantial economic and societal benefits in remote agriculture and disaster rescue operations. The LoRa modulation leverages quasi-orthogonal spreading factors (SFs) to optimize data rates, airtime, coverage and energy consumption. However, it is still challenging to effectively assign SFs to end devices for minimizing co-SF interference in massive subterranean LoRaWAN NTN. To address this, we investigate a reinforcement learning (RL)-based SFs allocation scheme to optimize the system's energy efficiency (EE). To efficiently capture the device-to-environment interactions in dense networks, we proposed an SFs allocation technique using the multi-agent dueling double deep Q-network (MAD3QN) and the multi-agent advantage actor-critic (MAA2C) algorithms based on an analytical reward mechanism. Our proposed RL-based SFs allocation approach evinces better performance compared to four benchmarks in the extreme underground direct-to-satellite scenario. Remarkably, MAD3QN shows promising potentials in surpassing MAA2C in terms of convergence rate and EE." into Simplified Chinese.Here's the translation:<>将地下LoRaWAN和非地球网络（NTN）集成可以实现各种经济和社会效益，如远程农业和灾难救援操作。LoRa模ulation使用 quasi-正交扩展因子（SF）来优化数据速率、广播时间、覆盖率和能量消耗。然而，在大规模地下LoRaWAN NTN中有效地分配SF仍然是一个挑战。为了解决这个问题，我们研究了一种基于强化学习（RL）的SF分配策略，以优化系统的能效性（EE）。为了有效地捕捉设备与环境之间的互动，我们提出了一种使用多代理对抗逻辑网络（MAD3QN）和多代理优势actor-critic（MAA2C）算法的SF分配技术，基于分析奖励机制。我们的提出的RL基于SF分配方法在极地下直接卫星enario下表现更好，特别是MAD3QN在MAA2C方面具有潜在的提升。

Flexible Error Mitigation of Quantum Processes with Data Augmentation Empowered Neural Model

paper_url: http://arxiv.org/abs/2311.01727
repo_url: https://github.com/EXPmaster/DAEM
paper_authors: Manwen Liao, Yan Zhu, Giulio Chiribella, Yuxiang Yang
for: 这个论文的目的是为了开发一种可以在实际应用中使用的量子计算中的错误纠正方法。
methods: 这个论文使用了数据扩充 empowered 神经网络模型来实现错误纠正。这种模型不需要任何具体的噪声类型和测量设置的知识，可以直接从受到噪声影响的量子过程的受测结果中估算噪声自由统计数据。
results: 在数值实验中，这种模型能够高效地纠正多种类型的噪声，包括Markovian 噪声和非Markovian 噪声，比前一代的错误纠正方法更高效。此外，这种模型还可以应用于多种不同的量子过程中，包括大规模量子系统和连续变量量子态。这种数据扩充 empowered 神经网络模型为实现更可靠和可robust的量子技术提供了一个坚实的基础。

Abstract
Neural networks have shown their effectiveness in various tasks in the realm of quantum computing. However, their application in quantum error mitigation, a crucial step towards realizing practical quantum advancements, has been restricted by reliance on noise-free statistics. To tackle this critical challenge, we propose a data augmentation empowered neural model for error mitigation (DAEM). Our model does not require any prior knowledge about the specific noise type and measurement settings and can estimate noise-free statistics solely from the noisy measurement results of the target quantum process, rendering it highly suitable for practical implementation. In numerical experiments, we show the model's superior performance in mitigating various types of noise, including Markovian noise and Non-Markovian noise, compared with previous error mitigation methods. We further demonstrate its versatility by employing the model to mitigate errors in diverse types of quantum processes, including those involving large-scale quantum systems and continuous-variable quantum states. This powerful data augmentation-empowered neural model for error mitigation establishes a solid foundation for realizing more reliable and robust quantum technologies in practical applications.

摘要
neural networks 在量子计算领域中的各种任务中表现出了效果。然而，它们在量子错误修正中，实现实用量子进步的关键步骤，受到了噪声自由统计的限制。为解决这个挑战，我们提议一种基于数据增强的神经网络模型 для错误修正（DAEM）。我们的模型不需要任何噪声类型和测量设置的先前知识，可以通过噪声损失结果来估算噪声自由统计，这使得它在实际应用中非常适用。在数学实验中，我们展示了模型在不同类型的噪声下的优秀性能，比如Markovian噪声和非Markovian噪声，并且在不同类型的量子过程中进行错误修正，包括大规模量子系统和连续变量量子态。这种数据增强 empowered 神经网络模型 для错误修正，为实现更可靠和robust的量子技术在实际应用中提供了一个坚实的基础。

Towards Calibrated Robust Fine-Tuning of Vision-Language Models

paper_url: http://arxiv.org/abs/2311.01723
repo_url: None
paper_authors: Changdae Oh, Mijoo Kim, Hyesu Lim, Junhyeok Park, Euiseog Jeong, Zhi-Qi Cheng, Kyungwoo Song
for: This paper focuses on the problem of calibration and robustness in fine-tuning pre-trained vision-language models (VLMs) under distribution shift.
methods: The authors propose a simple approach called calibrated robust fine-tuning (CaRot) that incentivizes the calibration and robustness of pre-trained VLMs on both in-distribution (ID) and out-of-distribution (OOD) datasets.
results: The authors show that their proposed method, CaRot, effectively improves the calibration and robustness of pre-trained VLMs on OOD datasets, as verified by empirical results on ImageNet-1K distribution shift evaluation.Here’s the Chinese version of the three key points:
for: 这篇论文关注在微调投入预训练语义视觉模型（VLMs）下的分布Shift问题。
methods: 作者提出了一种简单的方法，即强制投入Calibration和Robustness的CaRot方法，以适应ID和OOD datasets上的预训练VLMs。
results: 作者证明了CaRot方法能有效地提高预训练VLMs在OOD datasets上的Calibration和Robustness，经验结果通过ImageNet-1K分布Shift评估 verify。

Abstract
While fine-tuning unleashes the potential of a pre-trained model to a specific task, it trades off the model's generalization capability on out-of-distribution (OOD) datasets. To mitigate this, robust fine-tuning aims to ensure performance on OOD datasets as well as an in-distribution (ID) dataset for which the model is being tuned. However, another criterion for reliable machine learning (ML), confidence calibration, has been overlooked despite its increasing demand for real-world high-stakes ML applications (e.g., autonomous driving and medical diagnosis). For the first time, we raise concerns about the calibration of fine-tuned vision-language models (VLMs) under distribution shift by showing that naive fine-tuning and even state-of-the-art robust fine-tuning methods hurt the calibration of pre-trained VLMs, especially on OOD datasets. To address this, we provide a simple approach, called a calibrated robust fine-tuning (CaRot) that incentivizes the calibration and robustness on both ID and OOD datasets. Empirical results on ImageNet-1K distribution shift evaluation verify the effectiveness of our method.

摘要
While 微调 fine-tuning 可以发挥预训练模型对特定任务的潜力，它同时 sacrifice 模型对非常用数据集（OOD）的泛化能力。为了解决这个问题，我们提出了一种名为 robust fine-tuning 的方法，该方法具有在 ID 数据集和 OOD 数据集上的性能和泛化能力。然而，另一个重要的机器学习（ML）需求，即信息报告（confidence calibration），在实际高风险 ML 应用中受到了忽视。我们在这篇文章中首次提出了预训练 VLM 的准确性报告在分布shift下的问题，并证明了微调和当今最佳的 robust fine-tuning 方法在 OOD 数据集上会削弱预训练 VLM 的准确性。为解决这个问题，我们提出了一种简单的方法，即准确性报告和Robust fine-tuning（CaRot），该方法激励预训练 VLM 在 ID 和 OOD 数据集上具备准确性和泛化能力。实验结果表明，CaRot 可以有效地改善预训练 VLM 在 ImageNet-1K 分布shift 评估中的性能。

An Empirical Study of Benchmarking Chinese Aspect Sentiment Quad Prediction

paper_url: http://arxiv.org/abs/2311.01713
repo_url: None
paper_authors: Junxian Zhou, Haiqin Yang, Ye Junpeng, Yuxuan He, Hao Mou
for: expanding the capacity of aspect-level sentiment analysis
methods: constructing two large Chinese ASQP datasets and evaluating the performance of GPT series models
results: highlighting the need for additional techniques to address ASQP and the potential issues with using GPTsHere’s the simplified Chinese text:
for: 扩大方面 sentiment 分析的容量
methods: constructing two large Chinese ASQP datasets 和 evaluating GPT 系列模型的表现
results: highlighting the need for additional techniques to address ASQP 和 GPT 的可能问题

Abstract
Aspect sentiment quad prediction (ASQP) is a critical subtask of aspect-level sentiment analysis. Current ASQP datasets are characterized by their small size and low quadruple density, which hinders technical development. To expand capacity, we construct two large Chinese ASQP datasets crawled from multiple online platforms. The datasets hold several significant characteristics: larger size (each with 10,000+ samples) and rich aspect categories, more words per sentence, and higher density than existing ASQP datasets. Moreover, we are the first to evaluate the performance of Generative Pre-trained Transformer (GPT) series models on ASQP and exhibit potential issues. The experiments with state-of-the-art ASQP baselines underscore the need to explore additional techniques to address ASQP, as well as the importance of further investigation into methods to improve the performance of GPTs.

摘要
“对象层情感预测（ASQP）是对应情感分析的重要子任务。现有的ASQP数据集的特点是小型和低四元密度，这限制了技术的发展。为了扩大能力，我们建立了两个大型的中文ASQP数据集，从多个在线平台网站爬虫取得。这两个数据集具有以下特点：更大的大小（每个数据集都有10,000+个样本）、丰富的层别分类、更多的单词数和更高的密度，较现有的ASQP数据集高。此外，我们是首次将生成预训Transformer（GPT）系列模型应用于ASQP，并发现了一些潜在的问题。实验中使用现有的ASQP基线，显示了需要进一步探索ASQP的技术，以及GPT的性能提升的重要性。”

Data-Free Distillation of Language Model by Text-to-Text Transfer

paper_url: http://arxiv.org/abs/2311.01689
repo_url: None
paper_authors: Zheyuan Bai, Xinduo Liu, Hailin Hu, Tianyu Guo, Qinghua Zhang, Yunhe Wang
for: 本研究的目的是提出一种基于生成语言模型的数据自由知识填充（DFKD）框架，以提高模型压缩性和特定性。
methods: 我们提出了一种名为DFKD-T$^{3}$的新框架，其中使用预训练的生成语言模型作为可控的数据生成器，以实现文本到文本的结构学习。
results: 我们的方法可以在不同的下游任务中提高填充性和多样性，并且可以直接用于填充其他语言模型，超越当前SOTA方法。

Abstract
Data-Free Knowledge Distillation (DFKD) plays a vital role in compressing the model when original training data is unavailable. Previous works for DFKD in NLP mainly focus on distilling encoder-only structures like BERT on classification tasks, which overlook the notable progress of generative language modeling. In this work, we propose a novel DFKD framework, namely DFKD-T$^{3}$, where the pretrained generative language model can also serve as a controllable data generator for model compression. This novel framework DFKD-T$^{3}$ leads to an end-to-end learnable text-to-text framework to transform the general domain corpus to compression-friendly task data, targeting to improve both the \textit{specificity} and \textit{diversity}. Extensive experiments show that our method can boost the distillation performance in various downstream tasks such as sentiment analysis, linguistic acceptability, and information extraction. Furthermore, we show that the generated texts can be directly used for distilling other language models and outperform the SOTA methods, making our method more appealing in a general DFKD setting. Our code is available at https://gitee.com/mindspore/models/tree/master/research/nlp/DFKD\_T3.

摘要
<>将文本转换为简化中文。<>论文摘要：数据无法获取的知识填充（DFKD）在模型压缩中扮演着重要的角色。先前的DFKD研究主要集中在NLP领域的分类任务上，忽略了生成语言模型的进步。在这项工作中，我们提出了一种新的DFKD框架，即DFKD-T$^{3}$，其中预训练的生成语言模型也可以作为数据生成器来压缩模型。这种新的框架DFKD-T$^{3}$实现了一个综合的文本到文本框架，可以将通用领域文库转换为适合压缩的任务数据，以提高特点和多样性。我们的方法可以在多个下游任务中提高填充性，包括情感分类、语言可接受性和信息抽取。此外，我们还证明了生成的文本可以直接用于填充其他语言模型，并且超越当前最佳方法，使我们的方法在通用DFKD Setting中更加吸引人。我们的代码可以在https://gitee.com/mindspore/models/tree/master/research/nlp/DFKD\_T3中找到。

The R.O.A.D. to precision medicine

paper_url: http://arxiv.org/abs/2311.01681
repo_url: None
paper_authors: Dimitris Bertsimas, Angelos G. Koulouras, Georgios Antonios Margonis
for: 这个研究旨在 addresses the deficiencies of Randomized trial data subgroup analysis, and transforms ObservAtional Data to be used as if they were randomized, thus paving the road for precision medicine.
methods: 方法包括一个 novel two-step process to correct the estimated probabilities of the outcome under a treatment, and then use these probabilities to train Optimal Policy Trees (OPTs) to assign treatments to subgroups of patients based on their characteristics.
results: 研究结果显示，这些推荐高于专家的推荐在Gastrointestinal stromal tumors (GIST) 和 extremity sarcomas 中。 In addition, the framework identified a subset of patients with unique characteristics who may not require treatment, which was validated in an external cohort.

Abstract
We propose a prognostic stratum matching framework that addresses the deficiencies of Randomized trial data subgroup analysis and transforms ObservAtional Data to be used as if they were randomized, thus paving the road for precision medicine. Our approach counters the effects of unobserved confounding in observational data by correcting the estimated probabilities of the outcome under a treatment through a novel two-step process. These probabilities are then used to train Optimal Policy Trees (OPTs), which are decision trees that optimally assign treatments to subgroups of patients based on their characteristics. This facilitates the creation of clinically intuitive treatment recommendations. We applied our framework to observational data of patients with gastrointestinal stromal tumors (GIST) and validated the OPTs in an external cohort using the sensitivity and specificity metrics. We show that these recommendations outperformed those of experts in GIST. We further applied the same framework to randomized clinical trial (RCT) data of patients with extremity sarcomas. Remarkably, despite the initial trial results suggesting that all patients should receive treatment, our framework, after addressing imbalances in patient distribution due to the trial's small sample size, identified through the OPTs a subset of patients with unique characteristics who may not require treatment. Again, we successfully validated our recommendations in an external cohort.

摘要
我们提出了一个预测层匹配框架，用于解决随机化试验数据 subgroup 分析中的缺陷，并将观察数据转化为可以被用于随机化的数据，从而开阔了精准医学的道路。我们的方法可以在观察数据中减轻不观察到的潜在干扰因素的影响，通过一种新的两步过程来修正对待治疗的估计概率。这些估计概率然后用于训练优化策略树（OPTs），这些树是根据病人特征来优化治疗分配的决策树。这使得可以创造出临床直观的治疗建议。我们在肠癌细胞肿（GIST）观察数据中应用了我们的框架，并在外部群体中验证了OPTs的性能。我们显示，我们的建议超过了GIST专家的建议。此外，我们还应用了同一个框架到手术试验数据中，并在试验结果显示所有患者应该接受治疗的情况下，我们的框架经过对患者分布不均衡的修正，通过OPTsidentified一 subgroup of patients with unique characteristics who may not require treatment。我们在外部群体中验证了我们的建议。

DialogBench: Evaluating LLMs as Human-like Dialogue Systems

paper_url: http://arxiv.org/abs/2311.01677
repo_url: None
paper_authors: Jiao Ou, Junda Lu, Che Liu, Yihong Tang, Fuzheng Zhang, Di Zhang, Zhongyuan Wang, Kun Gai
for: 评估大型自然语言模型（LLMs）的对话系统人类化能力
methods: 提出了一个对话评估Benchmark，名为DialogBench，用于评估LLMs的对话系统人类化能力
results: 实验结果表明，对LLMs进行指导微调可以提高它们的人类化能力，但是还有很多方面需要进一步改进，以达到人类化对话系统的标准。

Abstract
Large language models (LLMs) have achieved remarkable breakthroughs in new dialogue capabilities, refreshing human's impressions on dialogue systems. The long-standing goal of dialogue systems is to be human-like enough to establish long-term connections with users by satisfying the need for communication, affection and social belonging. Therefore, there has been an urgent need to evaluate LLMs as human-like dialogue systems. In this paper, we propose DialogBench, a dialogue evaluation benchmark that currently contains $12$ dialogue tasks to assess the capabilities of LLMs as human-like dialogue systems should have. Specifically, we prompt GPT-4 to generate evaluation instances for each task. We first design the basic prompt based on widely-used design principles and further mitigate the existing biases to generate higher-quality evaluation instances. Our extensive test over $28$ LLMs (including pre-trained and supervised instruction-tuning) shows that instruction fine-tuning benefits improve the human likeness of LLMs to a certain extent, but there is still much room to improve those capabilities for most LLMs as human-like dialogue systems. In addition, experimental results also indicate that LLMs perform differently in various abilities that human-like dialogue systems should have. We will publicly release DialogBench, along with the associated evaluation code for the broader research community.

摘要
大型语言模型（LLM）已经取得了杰出的突破，刷新人类对对话系统的印象。对话系统的长期目标是成为与用户建立长期连接的人工智能系统，满足用户的交流、感情和社交属性需求。因此，有一个急需评估 LLM 的人工智能对话系统的能力。在本文中，我们提出了 DialogBench，一个对话评估标准，包括 $12$ 个对话任务，以评估 LLM 是否具备人工智能对话系统的能力。具体来说，我们将 GPT-4 提供评估实例。我们首先根据广泛使用的设计原则设计基本提示，然后进一步降低现有的偏见，以生成更高品质的评估实例。我们对 $28$ 个 LLM (包括预训和实际指导 fine-tuning) 进行了广泛的测试，发现实际指导 fine-tuning 可以提高 LLM 的人工智能程度，但是大多数 LLM 仍然有很大的执行空间来提升其能力。此外，实验结果还显示了不同的 LLM 在不同的能力上表现不同。我们将在未来发布 DialogBench，以及相关的评估代码，供更广泛的研究社区使用。

MineSegSAT: An automated system to evaluate mining disturbed area extents from Sentinel-2 imagery

paper_url: http://arxiv.org/abs/2311.01676
repo_url: None
paper_authors: Ezra MacDonald, Derek Jacoby, Yvonne Coady
for: 这个论文的目的是为了评估采矿业对环境的影响，以便更好地理解和mitigate采矿活动的生态后果。
methods: 这个论文使用了SegFormer深度学习分割框架，将Sentinel-2数据集中的环境影响区域分类为不同的土地覆盖类别。
results: 论文通过使用不同的损失函数（Dice、Tversky和Lovasz损失函数）进行训练，并在测试区域上进行推断，以确定采矿活动对环境的影响。

Abstract
Assessing the environmental impact of the mineral extraction industry plays a critical role in understanding and mitigating the ecological consequences of extractive activities. This paper presents MineSegSAT, a model that presents a novel approach to predicting environmentally impacted areas of mineral extraction sites using the SegFormer deep learning segmentation architecture trained on Sentinel-2 data. The data was collected from non-overlapping regions over Western Canada in 2021 containing areas of land that have been environmentally impacted by mining activities that were identified from high-resolution satellite imagery in 2021. The SegFormer architecture, a state-of-the-art semantic segmentation framework, is employed to leverage its advanced spatial understanding capabilities for accurate land cover classification. We investigate the efficacy of loss functions including Dice, Tversky, and Lovasz loss respectively. The trained model was utilized for inference over the test region in the ensuing year to identify potential areas of expansion or contraction over these same periods. The Sentinel-2 data is made available on Amazon Web Services through a collaboration with Earth Daily Analytics which provides corrected and tiled analytics-ready data on the AWS platform. The model and ongoing API to access the data on AWS allow the creation of an automated tool to monitor the extent of disturbed areas surrounding known mining sites to ensure compliance with their environmental impact goals.

摘要
<>翻译文本到简化中文。<<> evaluating the environmental impact of the mineral extraction industry is crucial in understanding and mitigating the ecological consequences of extractive activities. This paper presents MineSegSAT, a model that employs the SegFormer deep learning segmentation architecture to predict environmentally impacted areas of mineral extraction sites using Sentinel-2 data. The data was collected from non-overlapping regions over Western Canada in 2021, containing areas of land that have been environmentally impacted by mining activities identified from high-resolution satellite imagery in 2021. The SegFormer architecture, a state-of-the-art semantic segmentation framework, is leveraged to accurately classify land cover. We investigate the efficacy of loss functions including Dice, Tversky, and Lovasz loss, respectively. The trained model was utilized for inference over the test region in the ensuing year to identify potential areas of expansion or contraction over the same periods. The Sentinel-2 data is made available on Amazon Web Services through a collaboration with Earth Daily Analytics, which provides corrected and tiled analytics-ready data on the AWS platform. The model and ongoing API to access the data on AWS allow the creation of an automated tool to monitor the extent of disturbed areas surrounding known mining sites to ensure compliance with their environmental impact goals.我们使用SegFormer深度学习分割框架来预测采矿活动对环境的影响，并使用Sentinel-2数据集来训练模型。这个数据集包含2021年在加拿大西部的非重叠区域中发生了采矿活动所导致的环境影响。我们使用Dice、Tversky和Lovasz损失函数进行调参。训练完成后，我们使用该模型来对测试区域进行预测，以确定可能的扩张或缩减区域。Sentinel-2数据通过我们与Earth Daily Analytics的合作在AWS平台上提供了已经 corrected和分割的分析Ready数据。我们的模型和ongoing API可以在AWS平台上访问数据，并创建一个自动化工具来监测采矿活动周围的受损区域，以确保遵循环境影响目标。

Deep Learning-driven Community Resilience Rating based on Intertwined Socio-Technical Systems Features

paper_url: http://arxiv.org/abs/2311.01661
repo_url: None
paper_authors: Kai Yin, Ali Mostafavi
for: 这个论文主要是为了提高社区抗逆能力的评估和评价。
methods: 这个论文使用了一种基于深度学习的三层模型，即Resili-Net，来评估社区的抗逆能力水平。
results: 根据美国多个都会区的公共可 accessible 数据，Resili-Net 模型可以对社区的抗逆能力进行评估，并将其分为五个不同的水平。此外，模型还可以对社区抗逆能力的变化进行分析，以便为特定的抗逆能力提高措施提供指导。

Abstract
Community resilience is a complex and muti-faceted phenomenon that emerges from complex and nonlinear interactions among different socio-technical systems and their resilience properties. However, present studies on community resilience focus primarily on vulnerability assessment and utilize index-based approaches, with limited ability to capture heterogeneous features within community socio-technical systems and their nonlinear interactions in shaping robustness, redundancy, and resourcefulness components of resilience. To address this gap, this paper presents an integrated three-layer deep learning model for community resilience rating (called Resili-Net). Twelve measurable resilience features are specified and computed within community socio-technical systems (i.e., facilities, infrastructures, and society) related to three resilience components of robustness, redundancy, and resourcefulness. Using publicly accessible data from multiple metropolitan statistical areas in the United States, Resili-Net characterizes the resilience levels of spatial areas into five distinct levels. The interpretability of the model outcomes enables feature analysis for specifying the determinants of resilience in areas within each resilience level, allowing for the identification of specific resilience enhancement strategies. Changes in community resilience profiles under urban development patterns are further examined by changing the value of related socio-technical systems features. Accordingly, the outcomes provide novel perspectives for community resilience assessment by harnessing machine intelligence and heterogeneous urban big data.

摘要
社区抗险能力是一种复杂多方面的现象，由社区技术系统之间的复杂和非线性互动而生成。然而，当前社区抗险研究主要关注的是漏斗分析，使用指标方法，具有限定能力捕捉社区技术系统中异质特性和非线性互动的复杂特征。为了填补这一漏洞，这篇文章提出了一种基于深度学习的社区抗险评价模型（名为Resili-Net）。这种模型确定了社区技术系统中12个可测量的抗险特征，并将其分配到三个抗险组成部分：坚韧性、备用性和资源fulness。使用美国多个都会区的公共可达数据，Resili-Net将社区抗险水平分为五个不同水平。模型结果的可读性允许特征分析，以确定每个水平中的抗险特征决定因素，并提供了特定抗险提高策略的标识。此外，通过改变相关社区技术系统特征的值，模型还研究了社区抗险资料的变化。因此，模型的结果为社区抗险评价提供了新的视角，并利用机器智能和多种城市大数据来捕捉社区抗险能力。

MARRS: Multimodal Reference Resolution System

paper_url: http://arxiv.org/abs/2311.01650
repo_url: None
paper_authors: Halim Cagri Ates, Shruti Bhargava, Site Li, Jiarui Lu, Siddhardha Maddula, Joel Ruben Antony Moniz, Anil Kumar Nalamalapu, Roman Hoang Nguyen, Melis Ozyildirim, Alkesh Patel, Dhivya Piraviperumal, Vincent Renkens, Ankit Samal, Thy Tran, Bo-Hsiang Tseng, Hong Yu, Yuan Zhang, Rong Zou
for: 这篇论文旨在描述一种在自然语言理解系统中处理上下文的框架，即多模态参考解决系统（MARRS）。
methods: 这篇论文使用了不同的机器学习模型来处理上下文，包括参考解决模型和上下文 rewrite 模型。
results: 这篇论文介绍了 MARRS 如何处理多种上下文，包括对话上下文、视觉上下文和背景上下文，同时保护用户隐私。

Abstract
Successfully handling context is essential for any dialog understanding task. This context maybe be conversational (relying on previous user queries or system responses), visual (relying on what the user sees, for example, on their screen), or background (based on signals such as a ringing alarm or playing music). In this work, we present an overview of MARRS, or Multimodal Reference Resolution System, an on-device framework within a Natural Language Understanding system, responsible for handling conversational, visual and background context. In particular, we present different machine learning models to enable handing contextual queries; specifically, one to enable reference resolution, and one to handle context via query rewriting. We also describe how these models complement each other to form a unified, coherent, lightweight system that can understand context while preserving user privacy.

摘要
成功处理上下文是任务理解任务的关键。这种上下文可能是对话（基于前一个用户的查询或系统的回答）、视觉（基于用户看到的内容，例如屏幕）或背景（基于信号such as 铃声或播放音乐）。在这种工作中，我们介绍了 MARRS，即多modal引用解决系统，是一个在自然语言理解系统中的设备框架，负责处理对话、视觉和背景上下文。特别是，我们介绍了不同的机器学习模型，以便处理上下文 queries; specifically, one to enable reference resolution, and one to handle context via query rewriting。我们还描述了这些模型如何衔接起来，形成一个协调、轻量级的系统，能够理解上下文，而保持用户隐私。

2023-11-03

Post Turing: Mapping the landscape of LLM Evaluation

Quantum circuit synthesis with diffusion models

VQPy: An Object-Oriented Approach to Modern Video Analytics

APRICOT: Acuity Prediction in Intensive Care Unit (ICU): Predicting Stability, Transitions, and Life-Sustaining Therapies

Active Reasoning in an Open-World Environment

DeliverAI: Reinforcement Learning Based Distributed Path-Sharing Network for Food Deliveries

Score Models for Offline Goal-Conditioned Reinforcement Learning

Obtaining Explainable Classification Models using Distributionally Robust Optimization

RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches

The language of prompting: What linguistic properties make a prompt successful?

Don’t Make Your LLM an Evaluation Benchmark Cheater

Assessing Fidelity in XAI post-hoc techniques: A Comparative Study with Ground Truth Explanations Datasets

Architecture of Smart Certificates for Web3 Applications Against Cyberthreats in Financial Industry

A Quantitative Autonomy Quantification Framework for Fully Autonomous Robotic Systems

Supermind Ideator: Exploring generative AI to support creative problem-solving

GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling

Large Language Models Illuminate a Progressive Pathway to Artificial Healthcare Assistant: A Review

Enhancing Functional Data Analysis with Sequential Neural Networks: Advantages and Comparative Study

Multi-EuP: The Multilingual European Parliament Dataset for Analysis of Bias in Information Retrieval

Towards Concept-Aware Large Language Models

SortNet: Learning To Rank By a Neural-Based Sorting Algorithm

FAME: Flexible, Scalable Analogy Mappings Engine

A Neural Radiance Field-Based Architecture for Intelligent Multilayered View Synthesis

DiffDub: Person-generic Visual Dubbing Using Inpainting Renderer with Diffusion Auto-encoder

AFPQ: Asymmetric Floating Point Quantization for LLMs

TCM-GPT: Efficient Pre-training of Large Language Models for Domain Adaptation in Traditional Chinese Medicine

Modeling the Uncertainty with Maximum Discrepant Students for Semi-supervised 2D Pose Estimation

Indo LEGO-ABSA: A Multitask Generative Aspect Based Sentiment Analysis for Indonesian Language

RiskQ: Risk-sensitive Multi-Agent Reinforcement Learning Value Factorization

Energy Efficiency Optimization for Subterranean LoRaWAN Using A Reinforcement Learning Approach: A Direct-to-Satellite Scenario

Flexible Error Mitigation of Quantum Processes with Data Augmentation Empowered Neural Model

Towards Calibrated Robust Fine-Tuning of Vision-Language Models

An Empirical Study of Benchmarking Chinese Aspect Sentiment Quad Prediction

Data-Free Distillation of Language Model by Text-to-Text Transfer

The R.O.A.D. to precision medicine

DialogBench: Evaluating LLMs as Human-like Dialogue Systems

MineSegSAT: An automated system to evaluate mining disturbed area extents from Sentinel-2 imagery

Deep Learning-driven Community Resilience Rating based on Intertwined Socio-Technical Systems Features

MARRS: Multimodal Reference Resolution System