2023-11-13

cs.AI

cs.AI - 2023-11-13

Parrot-Trained Adversarial Examples: Pushing the Practicality of Black-Box Audio Attacks against Speaker Recognition Models

paper_url: http://arxiv.org/abs/2311.07780
repo_url: None
paper_authors: Rui Duan, Zhe Qu, Leah Ding, Yao Liu, Zhuo Lu
for: 防御对话系统的安全性挑战，特别是对于真实世界中的语音识别系统。
methods: 提出了一种新的机制，即“喊鸟训练”（Parrot Training，PT），用于生成对目标模型的攻击。基于最近的语音转换技术（Voice Conversion，VC），使用一句短语的知识生成更多的假语音样本，以便在PT模型上进行攻击。
results: 实验结果显示，对于开源模型，PT-AEs可以达到45.8%-80.8%的攻击成功率，而对于智能设备，包括Apple HomePod（Siri）、Amazon Echo和Google Home，可以达到47.9%-58.3%的攻击成功率。

Abstract
Audio adversarial examples (AEs) have posed significant security challenges to real-world speaker recognition systems. Most black-box attacks still require certain information from the speaker recognition model to be effective (e.g., keeping probing and requiring the knowledge of similarity scores). This work aims to push the practicality of the black-box attacks by minimizing the attacker's knowledge about a target speaker recognition model. Although it is not feasible for an attacker to succeed with completely zero knowledge, we assume that the attacker only knows a short (or a few seconds) speech sample of a target speaker. Without any probing to gain further knowledge about the target model, we propose a new mechanism, called parrot training, to generate AEs against the target model. Motivated by recent advancements in voice conversion (VC), we propose to use the one short sentence knowledge to generate more synthetic speech samples that sound like the target speaker, called parrot speech. Then, we use these parrot speech samples to train a parrot-trained(PT) surrogate model for the attacker. Under a joint transferability and perception framework, we investigate different ways to generate AEs on the PT model (called PT-AEs) to ensure the PT-AEs can be generated with high transferability to a black-box target model with good human perceptual quality. Real-world experiments show that the resultant PT-AEs achieve the attack success rates of 45.8% - 80.8% against the open-source models in the digital-line scenario and 47.9% - 58.3% against smart devices, including Apple HomePod (Siri), Amazon Echo, and Google Home, in the over-the-air scenario.

摘要
听音攻击（AE）对实际世界的 speaker recognition 系统构成了重要的安全挑战。大多数黑盒攻击仍然需要攻击者有一定的信息，如识别分数等（e.g., 探测和需要知道相似度）。这项工作的目标是使黑盒攻击变得更加实际，减少攻击者对目标 speaker recognition 模型的知识。尽管无法 completly 无知攻击成功，但我们假设攻击者只知道target speaker的一段（或几秒）的语音示例。无需进一步的探测，我们提议一种新的机制，called parrot training，来生成对目标模型的攻击。驱动于最近的语音转换（VC）技术，我们提议使用一个短语音示例来生成更多的合成语音样本，以达到更好的人工识别质量。然后，我们使用这些parrot speech样本来训练一个PT模型。在一个共同传播和感知框架下，我们研究不同的方法来生成PT模型上的攻击样本（PT-AEs），以确保PT-AEs可以高效地传播到黑盒目标模型，并且具有良好的人工识别质量。实际实验表明，结果的PT-AEs在开源模型上达到了45.8%-80.8%的攻击成功率，在数字线上enario中，以及47.9%-58.3%的攻击成功率，在过空间上enario中，包括Apple HomePod（Siri）、Amazon Echo和Google Home等智能设备。

GreekT5: A Series of Greek Sequence-to-Sequence Models for News Summarization

paper_url: http://arxiv.org/abs/2311.07767
repo_url: https://github.com/nc0der/greekt5
paper_authors: Nikolaos Giarelis, Charalampos Mastrokostas, Nikos Karacapilidis
for: 这篇论文主要是为了提出一系列的新型文本摘要模型，用于希腊新闻文章的自动摘要。
methods: 该论文使用了深度学习的Transformer模型，并对希腊语新闻文章进行了大量的训练和测试，以评估模型的性能。
results: 论文的实验结果显示，提出的新型模型在多种评价指标上都有显著的超越希腊BART模型的表现， indicating that the proposed models have better performance in summarizing Greek news articles.

Abstract
Text summarization (TS) is a natural language processing (NLP) subtask pertaining to the automatic formulation of a concise and coherent summary that covers the major concepts and topics from one or multiple documents. Recent advancements in deep learning have led to the development of abstractive summarization transformer-based models, which outperform classical approaches. In any case, research in this field focuses on high resource languages such as English, while the corresponding work for low resource languages is still underdeveloped. Taking the above into account, this paper proposes a series of novel TS models for Greek news articles. The proposed models were thoroughly evaluated on the same dataset against GreekBART, which is the state-of-the-art model in Greek abstractive news summarization. Our evaluation results reveal that most of the proposed models significantly outperform GreekBART on various evaluation metrics. We make our evaluation code public, aiming to increase the reproducibility of this work and facilitate future research in the field.

摘要
文本摘要（TS）是自然语言处理（NLP）下一个子任务，它旨在自动生成简洁 coherent 的摘要，涵盖一或多个文档中的主要概念和话题。在深度学习的推动下，有一些抽象摘要转换器模型在英语等高资源语言的研究中取得了突出的成果，而对低资源语言的研究仍然处于不足的状态。本文提出了一系列新的TS模型，用于希腊新闻文章的摘要。这些模型经过了严格的评估，并与希腊BART模型进行了比较。我们的评估结果显示，大多数我们提出的模型在不同的评价指标上都有显著的提高，并且超过了希腊BART模型。我们将我们的评估代码公开，以增加这项工作的重复性和未来研究的便利。

Vision-Language Integration in Multimodal Video Transformers (Partially) Aligns with the Brain

paper_url: http://arxiv.org/abs/2311.07766
repo_url: None
paper_authors: Dota Tianai Dong, Mariya Toneva
for: 这个论文旨在探讨多模态信息的集成是人工智能系统理解现实世界的重要前提。
methods: 该论文使用视频转换器模型，同时学习视觉、文本和声音。
results: 研究发现，通过利用 neuroscientific 证据，可以对预训练多模态视频转换器模型进行评估。 Results show that vision can enhance language processing performance, but the joint representation of the model does not capture brain-relevant information beyond that captured by individual modalities. Fine-tuning the model using a vision-language inference task can improve brain alignment.

Abstract
Integrating information from multiple modalities is arguably one of the essential prerequisites for grounding artificial intelligence systems with an understanding of the real world. Recent advances in video transformers that jointly learn from vision, text, and sound over time have made some progress toward this goal, but the degree to which these models integrate information from modalities still remains unclear. In this work, we present a promising approach for probing a pre-trained multimodal video transformer model by leveraging neuroscientific evidence of multimodal information processing in the brain. Using brain recordings of participants watching a popular TV show, we analyze the effects of multi-modal connections and interactions in a pre-trained multi-modal video transformer on the alignment with uni- and multi-modal brain regions. We find evidence that vision enhances masked prediction performance during language processing, providing support that cross-modal representations in models can benefit individual modalities. However, we don't find evidence of brain-relevant information captured by the joint multi-modal transformer representations beyond that captured by all of the individual modalities. We finally show that the brain alignment of the pre-trained joint representation can be improved by fine-tuning using a task that requires vision-language inferences. Overall, our results paint an optimistic picture of the ability of multi-modal transformers to integrate vision and language in partially brain-relevant ways but also show that improving the brain alignment of these models may require new approaches.

摘要
We find that vision enhances masked prediction performance during language processing, providing support that cross-modal representations in models can benefit individual modalities. However, we do not find evidence of brain-relevant information captured by the joint multi-modal transformer representations beyond that captured by all of the individual modalities.We also show that the brain alignment of the pre-trained joint representation can be improved by fine-tuning using a task that requires vision-language inferences. Our results suggest that multi-modal transformers can integrate vision and language in partially brain-relevant ways, but improving the brain alignment of these models may require new approaches.

The Disagreement Problem in Faithfulness Metrics

paper_url: http://arxiv.org/abs/2311.07763
repo_url: None
paper_authors: Brian Barr, Noah Fatsi, Leif Hancox-Li, Peter Richter, Daniel Proano, Caleb Mok
for: 这个论文的目的是对黑盒机器学习模型的解释进行评估。
methods: 这篇论文使用了多种方法来评估解释的准确性。
results: 研究发现现有的 metric 不符合， leaving users 无法选择最准确的解释。I hope this helps! Let me know if you have any other questions.

Abstract
The field of explainable artificial intelligence (XAI) aims to explain how black-box machine learning models work. Much of the work centers around the holy grail of providing post-hoc feature attributions to any model architecture. While the pace of innovation around novel methods has slowed down, the question remains of how to choose a method, and how to make it fit for purpose. Recently, efforts around benchmarking XAI methods have suggested metrics for that purpose -- but there are many choices. That bounty of choice still leaves an end user unclear on how to proceed. This paper focuses on comparing metrics with the aim of measuring faithfulness of local explanations on tabular classification problems -- and shows that the current metrics don't agree; leaving users unsure how to choose the most faithful explanations.

摘要
XAI（解释人工智能）领域目标是解释黑盒机器学习模型的工作原理。大多数工作集中在寻求“后期特征归因”，即任何模型架构都能提供解释。虽然创新的速度有所减速，但问题是如何选择方法，以及如何使其适用。近期的XAI方法测试建议了多种指标，但选择它们仍然是一个问题。这篇论文将 comparing metrics，以衡量本地解释的准确性在表格分类问题上，并显示当前指标之间没有一致，使用者无法选择最准确的解释。

Amodal Optical Flow

paper_url: http://arxiv.org/abs/2311.07761
repo_url: None
paper_authors: Maximilian Luz, Rohit Mohan, Ahmed Rida Sekkat, Oliver Sawade, Elmar Matthes, Thomas Brox, Abhinav Valada
for: 这个论文主要研究了在透明或填塞物体场景下的光流估计问题，并提出了模块化光流（Amodal Optical Flow）来解决这些问题。
methods: 作者提出了一种新的任务——模块化光流估计，并为这个任务提供了扩展的AmodalSynthDrive数据集，以便进行研究。他们还提出了一种新的Initialization方法——AmodalFlowNet，该方法使用了变换器来实现层次结构的特征传播和模块化Semantic Grounding。
results: 作者在大量实验中证明了模块化光流的可行性，并示出了其在下游任务中的用用，如精准跟踪。他们还提供了一种新的评价指标——Amodal Flow Quality，以便量化估计的性能。

Abstract
Optical flow estimation is very challenging in situations with transparent or occluded objects. In this work, we address these challenges at the task level by introducing Amodal Optical Flow, which integrates optical flow with amodal perception. Instead of only representing the visible regions, we define amodal optical flow as a multi-layered pixel-level motion field that encompasses both visible and occluded regions of the scene. To facilitate research on this new task, we extend the AmodalSynthDrive dataset to include pixel-level labels for amodal optical flow estimation. We present several strong baselines, along with the Amodal Flow Quality metric to quantify the performance in an interpretable manner. Furthermore, we propose the novel AmodalFlowNet as an initial step toward addressing this task. AmodalFlowNet consists of a transformer-based cost-volume encoder paired with a recurrent transformer decoder which facilitates recurrent hierarchical feature propagation and amodal semantic grounding. We demonstrate the tractability of amodal optical flow in extensive experiments and show its utility for downstream tasks such as panoptic tracking. We make the dataset, code, and trained models publicly available at http://amodal-flow.cs.uni-freiburg.de.

摘要
optical flow 估计在透明或屏蔽物体的情况下非常具有挑战性。在这项工作中，我们在任务层面上解决这些挑战，通过插入amodal optical flow，即混合可见和透明的像流。而不是只表示可见区域，我们定义amodal optical flow为多层级像素级动力场，涵盖了场景中可见和透明区域的全部。为便于研究这个新任务，我们将AmodalSynthDrive数据集扩展到包括像素级标签 дляamodal optical flow估计。我们提出了多种强大的基线，以及Amodal Flow Quality指标来衡量性能的可读性。此外，我们还提出了 novel AmodalFlowNet，它包括一个基于变换器的像素级核心编码器和一个循环变换器解码器，这些核心编码器和解码器帮助实现循环层次特征传播和amodal语义固定。我们在广泛的实验中证明了amodal optical flow的可行性，并展示了其对下游任务 such as panoptic tracking 的用于。我们将数据集、代码和训练模型公开发布在http://amodal-flow.cs.uni-freiburg.de。

Enabling High-Level Machine Reasoning with Cognitive Neuro-Symbolic Systems

paper_url: http://arxiv.org/abs/2311.07759
repo_url: None
paper_authors: Alessandro Oltramari
for: 该论文旨在帮助AI系统具备高级别理解能力，以便在不同的应用场景中展现出更加稳定和强大的表现。
methods: 该论文提出了一种将认知架构与外部神经符号学Component integrate的方法，以便帮助AI系统具备更高级别的理解能力。
results: 该论文提出的方法可以帮助AI系统在不同的应用场景中展现出更加稳定和强大的表现，并且可以解决一些现有的AI系统无法解决的问题，如自动驾驶汽车在未经训练的情况下的表现下降。

Abstract
High-level reasoning can be defined as the capability to generalize over knowledge acquired via experience, and to exhibit robust behavior in novel situations. Such form of reasoning is a basic skill in humans, who seamlessly use it in a broad spectrum of tasks, from language communication to decision making in complex situations. When it manifests itself in understanding and manipulating the everyday world of objects and their interactions, we talk about common sense or commonsense reasoning. State-of-the-art AI systems don't possess such capability: for instance, Large Language Models have recently become popular by demonstrating remarkable fluency in conversing with humans, but they still make trivial mistakes when probed for commonsense competence; on a different level, performance degradation outside training data prevents self-driving vehicles to safely adapt to unseen scenarios, a serious and unsolved problem that limits the adoption of such technology. In this paper we propose to enable high-level reasoning in AI systems by integrating cognitive architectures with external neuro-symbolic components. We illustrate a hybrid framework centered on ACT-R and we discuss the role of generative models in recent and future applications.

摘要
高级逻辑可以定义为通过经验获得的知识总结和在新情况下展现稳定行为的能力。这种能力是人类的基本技能，在各种任务中都能够无顾余力地使用，从语言交流到复杂情况下的决策。当它在日常物品和 их交互中表现出来时，我们就称之为常识或通用逻辑。现代AI系统没有这种能力，例如大语言模型在最近几年内吸引了广泛关注，但它们在检测常识能力时仍然会出现轻微的错误。另一方面，自驾车器在未经训练的情况下的性能下降，是一个严重而尚未解决的问题，这限制了自驾车技术的应用。在这篇论文中，我们提议通过结合认知架构和外部神经符号组件来启用高级逻辑在AI系统中。我们介绍了一种混合框架， centered on ACT-R，并讨论了在最近和未来应用中的生成模型的作用。

SynthEnsemble: A Fusion of CNN, Vision Transformer, and Hybrid Models for Multi-Label Chest X-Ray Classification

paper_url: http://arxiv.org/abs/2311.07750
repo_url: None
paper_authors: S. M. Nabil Ashraf, Md. Adyelullahil Mamun, Hasnat Md. Abdullah, Md. Golam Rabiul Alam
for: 这个研究旨在使用深度学习技术来自动诊断胸部X射像中的肺病变化，以提高早期检测和有效治疗的可能性。
methods: 研究使用了多种预训练的对应式神经网络（CNN）、转换器（Transformer）和混合模型（CNN+Transformer），以及传统模型。最佳个体模型为CoAtNet，其在接收操作特征曲线图（AUROC）中获得84.2%的表现。
results: 通过将所有训练模型的预测结果使用一个加权平均ensemble，使用了进化的算法决定每个模型的重量，可以进一步提高AUROC至85.4%，超越了现有的州际前方法。研究显示了深度学习技术，特别是集成深度学习，对于自动诊断胸部X射像中的肺病变化有很高的准确性。

Abstract
Chest X-rays are widely used to diagnose thoracic diseases, but the lack of detailed information about these abnormalities makes it challenging to develop accurate automated diagnosis systems, which is crucial for early detection and effective treatment. To address this challenge, we employed deep learning techniques to identify patterns in chest X-rays that correspond to different diseases. We conducted experiments on the "ChestX-ray14" dataset using various pre-trained CNNs, transformers, hybrid(CNN+Transformer) models and classical models. The best individual model was the CoAtNet, which achieved an area under the receiver operating characteristic curve (AUROC) of 84.2%. By combining the predictions of all trained models using a weighted average ensemble where the weight of each model was determined using differential evolution, we further improved the AUROC to 85.4%, outperforming other state-of-the-art methods in this field. Our findings demonstrate the potential of deep learning techniques, particularly ensemble deep learning, for improving the accuracy of automatic diagnosis of thoracic diseases from chest X-rays.

摘要
胸部X光图是广泛用于诊断胸部疾病的工具，但由于疾病异常的细节信息缺乏，因此建立准确的自动诊断系统是非常重要的，以便早期发现和有效治疗。为解决这个挑战，我们利用深度学习技术来识别胸部X光图中的不同疾病特征。我们在“ChestX-ray14”数据集上进行了多种预训练 convolutional neural network（CNN）、transformer和混合（CNN+Transformer）模型的实验。最佳的个体模型是CoAtNet，它在受者操作特征曲线（AUROC）上达到了84.2%。通过将所有训练模型的预测结果结合使用一个权重平均 ensemble，我们进一步提高了AUROC到85.4%，超过了当前领域其他状态的方法。我们的发现表明深度学习技术，特别是ensemble深度学习，对诊断胸部疾病从胸部X光图自动诊断的精度有着潜在的潜力。

Simplifying Complex Observation Models in Continuous POMDP Planning with Probabilistic Guarantees and Practice

paper_url: http://arxiv.org/abs/2311.07745
repo_url: None
paper_authors: Idan Lev-Yehudi, Moran Barenboim, Vadim Indelman
for: 这个论文是为了解决部分可观测 Markov决策过程（POMDP）中的高维度连续观测问题，如摄像头图像，而需要大量的计算力和存储空间。
methods: 这篇论文使用机器学习的概率模型来表示观测模型，但这些模型在线上部署时需要过多的计算资源。论文提出了使用简化的观测模型进行规划，并保持了对解决方案质量的正式保证。
results: 论文的主要贡献是一种新的概率 bound，基于统计总体变化距离简化模型。这个 bound 可以确保 POMDP 值与原始模型之间的相似性，并且可以在不需要访问昂贵的模型的情况下实现。论文还提出了在线和离线部分的计算，以及不需要访问模型的情况下实现正式保证的新结果。最后，论文通过实验示例了如何将 bound 集成到现有的连续在线 POMDP 解决器中。

Abstract
Solving partially observable Markov decision processes (POMDPs) with high dimensional and continuous observations, such as camera images, is required for many real life robotics and planning problems. Recent researches suggested machine learned probabilistic models as observation models, but their use is currently too computationally expensive for online deployment. We deal with the question of what would be the implication of using simplified observation models for planning, while retaining formal guarantees on the quality of the solution. Our main contribution is a novel probabilistic bound based on a statistical total variation distance of the simplified model. We show that it bounds the theoretical POMDP value w.r.t. original model, from the empirical planned value with the simplified model, by generalizing recent results of particle-belief MDP concentration bounds. Our calculations can be separated into offline and online parts, and we arrive at formal guarantees without having to access the costly model at all during planning, which is also a novel result. Finally, we demonstrate in simulation how to integrate the bound into the routine of an existing continuous online POMDP solver.

摘要
解决具有高维度和连续观测的部分可观测Markov决策过程（POMDP）是许多实际 роботех和规划问题的必需。现有研究提出了机器学习概率模型作为观测模型，但其计算成本过高，不适合在线部署。我们考虑使用简化的观测模型进行规划，保留正式的质量保证。我们的主要贡献是一种新的 probabilistic bound，基于统计总体变化距离简化模型。我们证明这个 bound 约束 POMDP 值与原始模型之间的关系，通过泛化 particle-belief MDP 集中bounds。我们的计算可以分为线上和线下两部分，并不需要在规划过程中访问昂贵的模型，这也是一个新的结果。最后，我们在 simulated 环境中示例了将 bound 集成到现有的连续在线 POMDP 解决器中。

Generalization Analogies (GENIES): A Testbed for Generalizing AI Oversight to Hard-To-Measure Domains

paper_url: http://arxiv.org/abs/2311.07723
repo_url: https://github.com/joshuaclymer/genies
paper_authors: Joshua Clymer, Garrett Baker, Rohan Subramani, Sam Wang
for: 本研究旨在控制LLMs的推荐模型在不可靠的情况下的泛化。
methods: 作者通过创造8种分类的69个分布转移来研究推荐模型的泛化。
results: 研究发现，推荐模型默认情况下不会评估” instruciton-following”，而是倾向于仿佛网络文本的人物。 standard fine-tuning方法常常无法分辨 instruciton-following 和杂合行为。

Abstract
As AI systems become more intelligent and their behavior becomes more challenging to assess, they may learn to game the flaws of human feedback instead of genuinely striving to follow instructions; however, this risk can be mitigated by controlling how LLMs generalize human feedback to situations where it is unreliable. To better understand how reward models generalize, we craft 69 distribution shifts spanning 8 categories. We find that reward models do not learn to evaluate `instruction-following' by default and instead favor personas that resemble internet text. Techniques for interpreting reward models' internal representations achieve better generalization than standard fine-tuning, but still frequently fail to distinguish instruction-following from conflated behaviors. We consolidate the 15 most challenging distribution shifts into the GENaralization analogIES (GENIES) benchmark, which we hope will enable progress toward controlling reward model generalization.

摘要
(Simplified Chinese translation)随着AI系统的智能化和其行为的挑战性增加，它们可能会学习游戏人类反馈的漏洞而不是真正努力遵从 instruxions;然而，这种风险可以通过控制LLMs对人类反馈的泛化来减少。为了更好地理解奖励模型的泛化，我们创造了69个分布转换，涵盖8个类别。我们发现奖励模型不会默认地评估` instruxion-following'，而是偏爱网络文本类型的人物。使用解释奖励模型内部表示的技术可以更好地泛化than标准精度调整，但并不frequently fails to distinguish instruction-following from conflated behaviors。我们将15个最复杂的分布转换集成为GENeralization analogIES（GENIES）标准，希望这将促进奖励模型泛化控制的进步。

PolyIE: A Dataset of Information Extraction from Polymer Material Scientific Literature

paper_url: http://arxiv.org/abs/2311.07715
repo_url: https://github.com/jerry3027/polyie
paper_authors: Jerry Junyang Cheung, Yuchen Zhuang, Yinghao Li, Pranav Shetty, Wantian Zhao, Sanjeev Grampurohit, Rampi Ramprasad, Chao Zhang
for: 本研究旨在提供一个基于科学文献的 polymer 材料自动提取信息（SciIE） dataset，以便进一步推动这一领域的研究。
methods: 该dataset 基于 146 篇全文 polymer 学术论文，并由域专家 manually annotate 不同类型的命名实体（如材料、性能、值、条件）以及它们之间的 N-ary 关系。
results: 研究人员使用现状的名实体抽取和关系抽取模型对 POLYIE 进行评估，并分析这些模型在不同领域的优劣。

Abstract
Scientific information extraction (SciIE), which aims to automatically extract information from scientific literature, is becoming more important than ever. However, there are no existing SciIE datasets for polymer materials, which is an important class of materials used ubiquitously in our daily lives. To bridge this gap, we introduce POLYIE, a new SciIE dataset for polymer materials. POLYIE is curated from 146 full-length polymer scholarly articles, which are annotated with different named entities (i.e., materials, properties, values, conditions) as well as their N-ary relations by domain experts. POLYIE presents several unique challenges due to diverse lexical formats of entities, ambiguity between entities, and variable-length relations. We evaluate state-of-the-art named entity extraction and relation extraction models on POLYIE, analyze their strengths and weaknesses, and highlight some difficult cases for these models. To the best of our knowledge, POLYIE is the first SciIE benchmark for polymer materials, and we hope it will lead to more research efforts from the community on this challenging task. Our code and data are available on: https://github.com/jerry3027/PolyIE.

摘要

Histopathologic Cancer Detection

paper_url: http://arxiv.org/abs/2311.07711
repo_url: https://github.com/lbasyal/Histopathologic-Cancer-Detection-
paper_authors: Varan Singh Rohila, Neeraj Lalwani, Lochan Basyal
for: 预测肿瘤细胞的抑肿效果，以提高病人健康安全。
methods: 使用多层感知网络和卷积神经网络模型，对HE染色压榨组织图像进行分类预测。
results: 基本型卷积神经网络模型比基本型多层感知网络模型表现更好，而ResNet50模型还能超越当前最佳模型。此外，研究还提出了将转移学习和分割技术应用于特定特征的理解。

Abstract
Early diagnosis of the cancer cells is necessary for making an effective treatment plan and for the health and safety of a patient. Nowadays, doctors usually use a histological grade that pathologists determine by performing a semi-quantitative analysis of the histopathological and cytological features of hematoxylin-eosin (HE) stained histopathological images. This research contributes a potential classification model for cancer prognosis to efficiently utilize the valuable information underlying the HE-stained histopathological images. This work uses the PatchCamelyon benchmark datasets and trains them in a multi-layer perceptron and convolution model to observe the model's performance in terms of precision, Recall, F1 Score, Accuracy, and AUC Score. The evaluation result shows that the baseline convolution model outperforms the baseline MLP model. Also, this paper introduced ResNet50 and InceptionNet models with data augmentation, where ResNet50 is able to beat the state-of-the-art model. Furthermore, the majority vote and concatenation ensemble were evaluated and provided the future direction of using transfer learning and segmentation to understand the specific features.

摘要
早期诊断癌细胞是必要的 для制定有效的治疗计划和患者的健康安全。现在医生通常使用 histological grade，由病理学家根据 Hematoxylin-eosin（HE）染色的 histopathological 和细胞学特征进行半量化分析。这项研究提供了一种潜在的癌诊断分类模型，以有效利用HE染色 histopathological 图像下的有价值信息。本研究使用 PatchCamelyon benchmark 数据集，并使用多层感知网络和卷积模型训练，以评估模型在精度、回卷、F1 分数、准确率和 AUC 分数上的表现。结果显示，基eline 卷积模型在精度、回卷和 F1 分数上超过了基eline MLP 模型。此外，本文还介绍了 ResNet50 和 InceptionNet 模型，并使用数据扩充来评估其性能。最后，文章还评估了 majority vote 和 concatenation ensemble，并提供了将来使用传输学习和分割来理解特定特征的未来方向。

Reinforcement Learning for Solving Stochastic Vehicle Routing Problem

paper_url: http://arxiv.org/abs/2311.07708
repo_url: None
paper_authors: Zangir Iklassov, Ikboljon Sobirov, Ruben Solozabal, Martin Takac
for: 解决RL和ML技术在Stochastic Vehicle Routing Problem（SVRP）中的不约性问题，提出一种新的综合性框架。
methods: 提出一种简单 yet effective的RL代理人，采用了特制的训练方法，能够全面地处理SVRP中的随机性源。
results: 通过比较分析，提出的模型在多个SVRP设定下表现出优于一种广泛应用的现有metaheuristic，实现了3.43%的交通成本减少。此外，模型在不同的SVRP环境下展现了鲁棒性和学习优化路径策略的能力。

Abstract
This study addresses a gap in the utilization of Reinforcement Learning (RL) and Machine Learning (ML) techniques in solving the Stochastic Vehicle Routing Problem (SVRP) that involves the challenging task of optimizing vehicle routes under uncertain conditions. We propose a novel end-to-end framework that comprehensively addresses the key sources of stochasticity in SVRP and utilizes an RL agent with a simple yet effective architecture and a tailored training method. Through comparative analysis, our proposed model demonstrates superior performance compared to a widely adopted state-of-the-art metaheuristic, achieving a significant 3.43% reduction in travel costs. Furthermore, the model exhibits robustness across diverse SVRP settings, highlighting its adaptability and ability to learn optimal routing strategies in varying environments. The publicly available implementation of our framework serves as a valuable resource for future research endeavors aimed at advancing RL-based solutions for SVRP.

摘要
Here's the Simplified Chinese translation:这项研究旨在填补RL和ML技术在解决不确定性加大的交通车辆路径问题（SVRP）中的 Utilization gap。我们提出了一种全新的综合解决方案，涵盖SVRP中关键的不确定性来源，并使用一种简单 yet有效的RL Agent，以及适应training方法。通过比较分析，我们的提议模型在SVRP中表现出优于一种广泛应用的状态艺术，实现了3.43%的旅行成本减少。此外，模型在不同的SVRP设定下展现出了稳定性和适应性，这表明其可以在不同环境中学习优化的路径策略。我们公开提供的实现方案作为未来RL基于SVRP的研究进程中的有价值资源。

Robust and Scalable Hyperdimensional Computing With Brain-Like Neural Adaptations

paper_url: http://arxiv.org/abs/2311.07705
repo_url: None
paper_authors: Junyao Wang, Mohammad Abdullah Al Faruque
for: 这个研究旨在提高 Edge-based 机器学习（ML）方法的效率，使其能够在物联网（IoT）系统中进行实时分析。
methods: 这个研究使用了 brain-inspired 高dimensional computing（HDC）技术，并提出了一些动态 HDC 学习框架，以便获得更好的效率和精度。
results: 这个研究发现，使用动态 HDC 学习框架可以实现更好的精度和效率，并且可以在 Edge-based 系统中进行实时分析。

Abstract
The Internet of Things (IoT) has facilitated many applications utilizing edge-based machine learning (ML) methods to analyze locally collected data. Unfortunately, popular ML algorithms often require intensive computations beyond the capabilities of today's IoT devices. Brain-inspired hyperdimensional computing (HDC) has been introduced to address this issue. However, existing HDCs use static encoders, requiring extremely high dimensionality and hundreds of training iterations to achieve reasonable accuracy. This results in a huge efficiency loss, severely impeding the application of HDCs in IoT systems. We observed that a main cause is that the encoding module of existing HDCs lacks the capability to utilize and adapt to information learned during training. In contrast, neurons in human brains dynamically regenerate all the time and provide more useful functionalities when learning new information. While the goal of HDC is to exploit the high-dimensionality of randomly generated base hypervectors to represent the information as a pattern of neural activity, it remains challenging for existing HDCs to support a similar behavior as brain neural regeneration. In this work, we present dynamic HDC learning frameworks that identify and regenerate undesired dimensions to provide adequate accuracy with significantly lowered dimensionalities, thereby accelerating both the training and inference.

摘要
互联网智能化（IoT）已经推动了许多应用程序利用边缘基于机器学习（ML）技术来分析本地收集的数据。然而，受欢迎的ML算法经常需要昂费的计算力 beyond 今天的IoT设备。基于大脑启发的超dimensional computing（HDC）已经被引入来解决这个问题。然而，现有的HDC使用静止的编码器，需要极高的维度和百上百的训练迭代来 achieve 可接受的精度。这会导致严重的效率损失，严重阻碍HDC的应用在IoT系统中。我们发现了一个主要的问题是现有的HDC编码器无法利用和适应训练中所学习的信息。相比之下，人脑中的神经元在学习新信息时会 dynamically regenerate ，提供更有用的功能。HDC的目标是利用高维度的随机生成的基本超vector 来表示信息作为神经活动的模式，但是现有的HDC无法支持类似于大脑神经重生的行为。在这个工作中，我们提出了动态HDC学习框架，可以识别和重生无愿的维度，以提供足够的精度，并且可以快速训练和推断。这将可以大幅提高IoT系统中HDC的效率和可扩展性。

AuthentiGPT: Detecting Machine-Generated Text via Black-Box Language Models Denoising

paper_url: http://arxiv.org/abs/2311.07700
repo_url: None
paper_authors: Zhen Guo, Shangdi Yu
for: 本研究旨在检测大语言模型（LLM）生成的文本是否为人类写作。
methods: 本研究提出了一种效果很好的分类方法，即AuthentiGPT，它利用黑盒模型对输入文本进行干扰处理，然后进行semantic比较来判断文本是否为人类写作。
results: 研究发现，AuthentiGPT在特定领域的数据集上达到了0.918的AUROC分数，比其他商业算法高得多，表明它可以有效地检测LLM生成的文本是否为人类写作。

Abstract
Large language models (LLMs) have opened up enormous opportunities while simultaneously posing ethical dilemmas. One of the major concerns is their ability to create text that closely mimics human writing, which can lead to potential misuse, such as academic misconduct, disinformation, and fraud. To address this problem, we present AuthentiGPT, an efficient classifier that distinguishes between machine-generated and human-written texts. Under the assumption that human-written text resides outside the distribution of machine-generated text, AuthentiGPT leverages a black-box LLM to denoise input text with artificially added noise, and then semantically compares the denoised text with the original to determine if the content is machine-generated. With only one trainable parameter, AuthentiGPT eliminates the need for a large training dataset, watermarking the LLM's output, or computing the log-likelihood. Importantly, the detection capability of AuthentiGPT can be easily adapted to any generative language model. With a 0.918 AUROC score on a domain-specific dataset, AuthentiGPT demonstrates its effectiveness over other commercial algorithms, highlighting its potential for detecting machine-generated text in academic settings.

摘要

On The Truthfulness of ‘Surprisingly Likely’ Responses of Large Language Models

paper_url: http://arxiv.org/abs/2311.07692
repo_url: None
paper_authors: Naman Goel
for: The paper is written to investigate the relevance of the surprisingly likely criterion for responses of large language models (LLMs).
methods: The paper uses a game-theoretic multi-agent setting to reward rational agents for maximizing the expected information gain with their answers, based on their probabilistic beliefs.
results: The paper shows that the method improves the accuracy of LLMs’ responses significantly, with up to 24 percentage points aggregate improvement on the TruthfulQA benchmark and up to 70 percentage points improvement on individual categories of questions.Here are the three key information points in Simplified Chinese text:
for: 这篇论文是为了研究大语言模型（LLMs）的回答中的可预料性准则的有效性。
methods: 这篇论文使用了游戏理论多代人设定，通过奖励合理代理人为 maximize 其回答中的预期信息增加来提高 LLMS 的回答准确性。
results: 这篇论文显示，该方法可以significantly提高 LLMS 的回答准确性，最多可以提高 TruthfulQA benchmark 的总成绩24%，并在具体的问题类型上达到70%的提高。

Abstract
The surprisingly likely criterion in the seminal work of Prelec (the Bayesian Truth Serum) guarantees truthfulness in a game-theoretic multi-agent setting, by rewarding rational agents to maximise the expected information gain with their answers w.r.t. their probabilistic beliefs. We investigate the relevance of a similar criterion for responses of LLMs. We hypothesize that if the surprisingly likely criterion works in LLMs, under certain conditions, the responses that maximize the reward under this criterion should be more accurate than the responses that only maximize the posterior probability. Using benchmarks including the TruthfulQA benchmark and using openly available LLMs: GPT-2 and LLaMA-2, we show that the method indeed improves the accuracy significantly (for example, upto 24 percentage points aggregate improvement on TruthfulQA and upto 70 percentage points improvement on individual categories of questions).

摘要
“Prelec的著名作品（ bayesian truth serum）中的预期增加价值标准可以 garantuee 多智能体场景中的真实性，通过对回答的 rational agents 的偏好分布进行奖励。我们研究 LLMS 的回答是否可以通过类似的标准进行改善。我们预设，如果这个标准适用于 LLMS，在某些情况下，对回答的奖励最大化的方法可以提高精度。我们使用 truthfulQA benchmark 和公开可用的 GPT-2 和 LLaMA-2 LLMs，证明了这种方法可以提高精度，例如，在 TruthfulQA 中总共提高了 24% 的精度，并在单一问题类别中提高了 70% 的精度。”

paper_url: http://arxiv.org/abs/2311.07687
repo_url: None
paper_authors: Arjun Vaithilingam Sudhakar, Prasanna Parthasarathi, Janarthanan Rajendran, Sarath Chandar
for: 提高文本游戏中的表现
methods: 使用更新LLM来更好地推荐动作，从而减少人工标注游戏记录的依赖
results: 通过在游戏中更新LLM，可以减少人工标注游戏记录的依赖，但是在不同游戏之间的传输性不很好Here’s the simplified Chinese text:
for: 提高文本游戏中的表现
methods: 使用更新LLM来更好地推荐动作，从而减少人工标注游戏记录的依赖
results: 通过在游戏中更新LLM，可以减少人工标注游戏记录的依赖，但是在不同游戏之间的传输性不很好

Abstract
Large Language Models (LLMs) have demonstrated superior performance in language understanding benchmarks. CALM, a popular approach, leverages linguistic priors of LLMs -- GPT-2 -- for action candidate recommendations to improve the performance in text games in Jericho without environment-provided actions. However, CALM adapts GPT-2 with annotated human gameplays and keeps the LLM fixed during the learning of the text based games. In this work, we explore and evaluate updating LLM used for candidate recommendation during the learning of the text based game as well to mitigate the reliance on the human annotated gameplays, which are costly to acquire. We observe that by updating the LLM during learning using carefully selected in-game transitions, we can reduce the dependency on using human annotated game plays for fine-tuning the LLMs. We conducted further analysis to study the transferability of the updated LLMs and observed that transferring in-game trained models to other games did not result in a consistent transfer.

摘要

Fuse to Forget: Bias Reduction and Selective Memorization through Model Fusion

paper_url: http://arxiv.org/abs/2311.07682
repo_url: https://github.com/keremzaman/fusetoforget
paper_authors: Kerem Zaman, Leshem Choshen, Shashank Srivastava
for: 本研究旨在探讨模型融合是否会干扰和减少不需要的知识。
methods: 本研究使用了文本分类和生成任务，对多个模型的权重进行融合，并分析了模型融合对学习快照、社会偏见和记忆能力的影响。
results: 研究发现，在模型融合中，共享知识通常会增强，而不共享知识通常会消失或被忘记。这种现象可能使模型融合成为一种减少语言模型的隐私问题的工具。

Abstract
Model fusion research aims to aggregate the knowledge of multiple models to enhance performance by combining their weights. In this work, we study the inverse, investigating whether and how can model fusion interfere and reduce unwanted knowledge. We delve into the effects of model fusion on the evolution of learned shortcuts, social biases, and memorization capabilities in fine-tuned language models. Through several experiments covering text classification and generation tasks, our analysis highlights that shared knowledge among models is usually enhanced during model fusion, while unshared knowledge is usually lost or forgotten. Based on this observation, we demonstrate the potential of model fusion as a debiasing tool and showcase its efficacy in addressing privacy concerns associated with language models.

摘要

paper_url: http://arxiv.org/abs/2311.07575
repo_url: https://github.com/alpha-vllm/llama2-accessory
paper_authors: Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, Jiaming Han, Siyuan Huang, Yichi Zhang, Xuming He, Hongsheng Li, Yu Qiao
for: 这个论文主要目标是提出一种多模式大语言模型（MLLM），以实现多Modal的语言理解和图像理解。
methods: 该论文使用了一种权重混合策略，将两个不同领域的模型权重混合在一起，以提高图像理解和语言理解的能力。此外，论文还提出了一种多任务混合策略，将多种视觉任务进行联合调整，以提高模型的多模式能力。
results: 根据论文的描述，SPHINX模型在多种应用场景中表现出色，包括图像理解、视觉问答、区域理解、图像描述、人体pose估计等。此外，论文还提出了一种高分辨率图像分解策略，可以更好地捕捉高分辨率图像中的细节。

Abstract
We present SPHINX, a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings. First, for stronger vision-language alignment, we unfreeze the large language model (LLM) during pre-training, and introduce a weight mix strategy between LLMs trained by real-world and synthetic data. By directly integrating the weights from two domains, the mixed LLM can efficiently incorporate diverse semantics with favorable robustness. Then, to enable multi-purpose capabilities, we mix a variety of tasks for joint visual instruction tuning, and design task-specific instructions to avoid inter-task conflict. In addition to the basic visual question answering, we include more challenging tasks such as region-level understanding, caption grounding, document layout detection, and human pose estimation, contributing to mutual enhancement over different scenarios. Additionally, we propose to extract comprehensive visual embeddings from various network architectures, pre-training paradigms, and information granularity, providing language models with more robust image representations. Based on our proposed joint mixing, SPHINX exhibits superior multi-modal understanding capabilities on a wide range of applications. On top of this, we further propose an efficient strategy aiming to better capture fine-grained appearances of high-resolution images. With a mixing of different scales and high-resolution sub-images, SPHINX attains exceptional visual parsing and reasoning performance on existing evaluation benchmarks. We hope our work may cast a light on the exploration of joint mixing in future MLLM research. Code is released at https://github.com/Alpha-VLLM/LLaMA2-Accessory.

摘要
我们介绍SPHINX，一种多模态大型自然语言模型（MLLM），具有混合模型权重、调整任务和视觉嵌入的共同混合。首先，为强化视觉语言对应，我们在预训练时解冻大语言模型（LLM），并通过实际数据和 sintetic 数据两种预训练模型的权重混合来实现模型权重的共同混合。这种混合可以快速并高效地将多个 semantic 集成到模型中，并且具有良好的鲁棒性。然后，为实现多用途能力，我们混合了多种任务，并设计了任务特定的 instrucions，以避免任务之间的冲突。此外，我们还提出了EXTRACT comprehensive visual embeddings，从不同的网络架构、预训练方法和信息粒度中提取图像表示，为语言模型提供更加鲁棒的图像表示。基于我们的共同混合方法，SPHINX在多种应用场景中展示出了优秀的多模态理解能力。此外，我们还提出了一种高效的策略，用于更好地捕捉高分辨率图像的细节 appearances。通过混合不同的尺度和高分辨率子图，SPHINX在现有评估标准上实现了出色的视解析和推理性能。我们希望我们的工作可以激发未来 MLLM 研究中的 JOINT 混合方法的探索。代码可以在上下载。

paper_url: http://arxiv.org/abs/2311.07562
repo_url: https://github.com/zzxslp/mm-navigator
paper_authors: An Yan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin, Linjie Li, Jianfeng Wang, Jianwei Yang, Yiwu Zhong, Julian McAuley, Jianfeng Gao, Zicheng Liu, Lijuan Wang
for: 这篇论文的目的是提出一种基于GPT-4V的智能手机Graphical User Interface（GUI）导航代理人MM-Navigator，以便它可以与人类用户一样交互，并根据给出的指令决定后续的行动。
methods: 这篇论文使用的方法是使用大量多模态模型（LMMs），具体来说是GPT-4V，以透过其高级屏幕解释、行动理解和精准行动地理位能力来完成零时 GUI 导航任务。
results: 根据人类评估，MM-Navigator在我们收集的iOS屏幕数据集上表现出了91%的准确率，在生成合理的动作描述和执行正确的动作方面。此外，我们还评估了模型在一个Android屏幕导航数据集上的性能，其在零时情况下超越了前一代的 GUI 导航器。

Abstract
We present MM-Navigator, a GPT-4V-based agent for the smartphone graphical user interface (GUI) navigation task. MM-Navigator can interact with a smartphone screen as human users, and determine subsequent actions to fulfill given instructions. Our findings demonstrate that large multimodal models (LMMs), specifically GPT-4V, excel in zero-shot GUI navigation through its advanced screen interpretation, action reasoning, and precise action localization capabilities. We first benchmark MM-Navigator on our collected iOS screen dataset. According to human assessments, the system exhibited a 91\% accuracy rate in generating reasonable action descriptions and a 75\% accuracy rate in executing the correct actions for single-step instructions on iOS. Additionally, we evaluate the model on a subset of an Android screen navigation dataset, where the model outperforms previous GUI navigators in a zero-shot fashion. Our benchmark and detailed analyses aim to lay a robust groundwork for future research into the GUI navigation task. The project page is at https://github.com/zzxslp/MM-Navigator.

摘要
我们介绍MM-Navigator，基于GPT-4V的智能手机 Graphical User Interface（GUI）导航代理。MM-Navigator可以与智能手机屏幕交互，并根据给定的 instrucions 确定后续的行动。我们的研究发现，大型多模态模型（LMM），具体来说是GPT-4V，在零容器 GUI 导航方面表现出色，拥有先进的屏幕解释、动作理解和精确动作local化能力。我们首先在我们收集的 iOS 屏幕数据集上对MM-Navigator进行了测试。根据人类评估，系统在生成合理的动作描述上达到了91%的准确率，并在单步 instrucions 上执行正确的动作达到了75%的准确率。此外，我们还对一部分 Android 屏幕导航数据集进行了测试，并证明了前一代 GUI 导航器在零容器情况下的超越。我们的 benchmark 和详细分析旨在为未来关于 GUI 导航任务的研究提供坚实的基础。项目页面可以在 https://github.com/zzxslp/MM-Navigator 上找到。

An Extensive Study on Adversarial Attack against Pre-trained Models of Code

paper_url: http://arxiv.org/abs/2311.07553
repo_url: https://github.com/cgcl-codes/attack_ptmc
paper_authors: Xiaohu Du, Ming Wen, Zichao Wei, Shangwen Wang, Hai Jin
for: 这篇论文是用于测试和评估Transformer-based预训练模型的攻击性评估。
methods: 这篇论文使用了五种现有的攻击方法，并从三个角度进行了系统性的分析：效果、效率和生成的例子质量。
results: 研究结果显示，现有的攻击方法中，identifier substitution within for and if statements 是最有效的，并且可以优化生成的攻击码的自然性。此外，提出了一个新的方法，优化不同类型的陈述式，并使用搜索精灵来生成攻击码，可以优化效率和自然性。

Abstract
Transformer-based pre-trained models of code (PTMC) have been widely utilized and have achieved state-of-the-art performance in many mission-critical applications. However, they can be vulnerable to adversarial attacks through identifier substitution or coding style transformation, which can significantly degrade accuracy and may further incur security concerns. Although several approaches have been proposed to generate adversarial examples for PTMC, the effectiveness and efficiency of such approaches, especially on different code intelligence tasks, has not been well understood. To bridge this gap, this study systematically analyzes five state-of-the-art adversarial attack approaches from three perspectives: effectiveness, efficiency, and the quality of generated examples. The results show that none of the five approaches balances all these perspectives. Particularly, approaches with a high attack success rate tend to be time-consuming; the adversarial code they generate often lack naturalness, and vice versa. To address this limitation, we explore the impact of perturbing identifiers under different contexts and find that identifier substitution within for and if statements is the most effective. Based on these findings, we propose a new approach that prioritizes different types of statements for various tasks and further utilizes beam search to generate adversarial examples. Evaluation results show that it outperforms the state-of-the-art ALERT in terms of both effectiveness and efficiency while preserving the naturalness of the generated adversarial examples.

摘要
启用基于变换器的预训练模型（PTMC）在许多关键应用中已经广泛应用，但它们可能受到 identifier 替换或编程风格变化的攻击，这可能会导致准确性下降和安全问题。虽然有几种方法用于生成针对 PTMC 的攻击示例，但这些方法在不同的代码智能任务中的效果和效率尚未得到了充分的了解。为了填补这一漏洞，本研究系统atically 分析了五种当前领先的攻击方法，从三个角度来评估它们的效果、效率和生成的示例质量。结果显示，其中没有一种方法能够均衡这三个方面。特别是，拥有高攻击成功率的方法通常需要较长的时间，生成的针对式代码通常缺乏自然性，并且vice versa。为了解决这一限制，我们调查了在不同上下文中 Identifier 替换的影响，发现在 for 和 if 语句中进行 Identifier 替换是最有效的。基于这些发现，我们提出了一种新的方法，它根据不同的任务类型将不同类型的语句优先级化，并使用搜索桶来生成攻击示例。测试结果表明，它在效果和效率两个方面超越了现有的 ALERT，同时保持了针对式代码的自然性。

paper_url: http://arxiv.org/abs/2311.07547
repo_url: https://github.com/vista-h/gpt-4v_social_media
paper_authors: Hanjia Lyu, Jinfa Huang, Daoan Zhang, Yongsheng Yu, Xinyi Mou, Jinsheng Pan, Zhengyuan Yang, Zhongyu Wei, Jiebo Luo
for: 这个论文旨在探讨大型多Modal模型（LMMs）在社交媒体内容分析方面的潜力。
methods: 这个论文使用GPT-4V模型进行社交媒体内容分析，选择了五个表型任务，包括情感分析、仇恨言语检测、假新闻识别、人口统计学和政治立场检测，以评估GPT-4V的能力。
results: GPT-4V在这些任务中表现出色，表现出联合图片文字对 pair 的理解能力、文化和情境意识以及广泛的通用常识知识。 despite the overall impressive capacity of GPT-4V in the social media domain, there remain notable challenges, such as struggling with multilingual social multimedia comprehension and generating erroneous information in the context of evolving celebrity and politician knowledge.

Abstract
Recent research has offered insights into the extraordinary capabilities of Large Multimodal Models (LMMs) in various general vision and language tasks. There is growing interest in how LMMs perform in more specialized domains. Social media content, inherently multimodal, blends text, images, videos, and sometimes audio. Understanding social multimedia content remains a challenging problem for contemporary machine learning frameworks. In this paper, we explore GPT-4V(ision)'s capabilities for social multimedia analysis. We select five representative tasks, including sentiment analysis, hate speech detection, fake news identification, demographic inference, and political ideology detection, to evaluate GPT-4V. Our investigation begins with a preliminary quantitative analysis for each task using existing benchmark datasets, followed by a careful review of the results and a selection of qualitative samples that illustrate GPT-4V's potential in understanding multimodal social media content. GPT-4V demonstrates remarkable efficacy in these tasks, showcasing strengths such as joint understanding of image-text pairs, contextual and cultural awareness, and extensive commonsense knowledge. Despite the overall impressive capacity of GPT-4V in the social media domain, there remain notable challenges. GPT-4V struggles with tasks involving multilingual social multimedia comprehension and has difficulties in generalizing to the latest trends in social media. Additionally, it exhibits a tendency to generate erroneous information in the context of evolving celebrity and politician knowledge, reflecting the known hallucination problem. The insights gleaned from our findings underscore a promising future for LMMs in enhancing our comprehension of social media content and its users through the analysis of multimodal information.

摘要
Translated into Simplified Chinese:近期研究提供了关于大型多模态模型（LMM）在多个通用视觉语言任务中的新的发现。在这些任务中，LMM表现出了惊人的能力。我们在这篇论文中探索GPT-4V(ision)在社交媒体分析中的能力。我们选择了五个表示任务，包括情感分析、谩骂检测、假新闻标识、人口统计推断和政治立场检测，以评估GPT-4V。我们的调查开始于现有数据集的初步量化分析，然后是一个精心审查结果，并选择一些ILLUSTRATE GPT-4V在多Modal社交媒体内容中的潜力。GPT-4V在这些任务中表现出了惊人的能力，包括对图文对的同时理解、上下文和文化意识、以及广泛的通情知识。尽管GPT-4V在社交媒体领域中的总体表现很出色，但还有一些突出的挑战。GPT-4V在多语言社交媒体理解任务中困难，以及在最新的社交媒体趋势上generalization。此外，它还表现出了在 evolving celebrity和政治人物知识上的错误信息生成问题，这是已知的幻觉问题。我们的发现可以推出，LMM在社交媒体内容和用户理解方面具有优秀的未来。

A Benchmark to Understand the Role of Knowledge Graphs on Large Language Model’s Accuracy for Question Answering on Enterprise SQL Databases

paper_url: http://arxiv.org/abs/2311.07509
repo_url: None
paper_authors: Juan Sequeda, Dean Allemang, Bryon Jacob
for: This paper aims to evaluate the accuracy of large language models (LLMs) in answering enterprise questions on SQL databases, and to explore the role of knowledge graphs (KGs) in improving accuracy.
methods: The paper introduces a benchmark consisting of an enterprise SQL schema, a range of enterprise queries, and a contextual layer incorporating an ontology and mappings that define a knowledge graph. The authors use GPT-4 with zero-shot prompts directly on SQL databases and evaluate its accuracy.
results: The authors find that question answering using GPT-4 achieves an accuracy of 16%, and that this accuracy increases to 54% when questions are posed over a knowledge graph representation of the enterprise SQL database. The results suggest that investing in knowledge graphs can provide higher accuracy for LLM-powered question answering systems.Here is the information in Simplified Chinese text:
for: 这篇论文旨在评估大语言模型（LLM）在企业问题上的答案精度，以及探讨知识图（KG）在提高精度方面的作用。
methods: 论文提出了一个企业SQL schema，一系列企业查询，以及一个含有 ontology 和映射的contextual层，用于定义知识图。作者使用 GPT-4 directly on SQL databases 进行零shot prompt，并评估其精度。
results: 作者发现，使用 GPT-4 answering enterprise questions 的精度为 16%，并且当问题提交到知识图表示的企业SQL数据库时，精度提高到 54%。结果表明，投入知识图可以提高 LLM 投入问题 answering 系统的精度。

Abstract
Enterprise applications of Large Language Models (LLMs) hold promise for question answering on enterprise SQL databases. However, the extent to which LLMs can accurately respond to enterprise questions in such databases remains unclear, given the absence of suitable Text-to-SQL benchmarks tailored to enterprise settings. Additionally, the potential of Knowledge Graphs (KGs) to enhance LLM-based question answering by providing business context is not well understood. This study aims to evaluate the accuracy of LLM-powered question answering systems in the context of enterprise questions and SQL databases, while also exploring the role of knowledge graphs in improving accuracy. To achieve this, we introduce a benchmark comprising an enterprise SQL schema in the insurance domain, a range of enterprise queries encompassing reporting to metrics, and a contextual layer incorporating an ontology and mappings that define a knowledge graph. Our primary finding reveals that question answering using GPT-4, with zero-shot prompts directly on SQL databases, achieves an accuracy of 16%. Notably, this accuracy increases to 54% when questions are posed over a Knowledge Graph representation of the enterprise SQL database. Therefore, investing in Knowledge Graph provides higher accuracy for LLM powered question answering systems.

摘要
企业应用大语言模型（LLM）具有问答系统的潜在应用前景，但是企业问题库中LLM的精度问答能力还未得到了足够的评估。此外，知识图（KG）在增强LLM问答系统的精度方面的潜力还不够了解。本研究旨在评估LLM问答系统在企业问题库中的精度，同时探讨知识图在提高精度方面的作用。为此，我们提出了一个标准 benchmark，包括一个企业 SQL 架构，一系列企业查询，以及一个contextual层，包括一个 ontology 和映射，定义了一个知识图。我们的主要发现是，使用 GPT-4，直接在 SQL 数据库上提问，可以达到 16% 的精度。此外，当问题提交到知识图表示的企业 SQL 数据库时，精度提高至 54%。因此，投资知识图可以提高 LLMPowered 问答系统的精度。

EvoFed: Leveraging Evolutionary Strategies for Communication-Efficient Federated Learning

paper_url: http://arxiv.org/abs/2311.07485
repo_url: None
paper_authors: Mohammad Mahdi Rahimi, Hasnain Irshad Bhatti, Younghyun Park, Humaira Kousar, Jaekyun Moon
for: 这篇论文旨在提出一种基于进化策略（Evolutionary Strategies，ES）的联合 Federated Learning（FL）方法，以解决FL中资料共享和通信成本的问题。
methods: 这篇论文使用了一种名为“对应度基于信息分享”的概念，将各个节点的本地更新后的模型与每个误差噪音模型进行比较，从而将模型更新资讯传递给服务器。服务器将这些适应度值进行统计处理，并将更新后的全域模型分发回节点。
results: 这篇论文的实验结果显示，使用EvoFed方法可以在各种实际应用中实现与FedAvg方法相似的性能，并大幅降低了总通信成本。

Abstract
Federated Learning (FL) is a decentralized machine learning paradigm that enables collaborative model training across dispersed nodes without having to force individual nodes to share data. However, its broad adoption is hindered by the high communication costs of transmitting a large number of model parameters. This paper presents EvoFed, a novel approach that integrates Evolutionary Strategies (ES) with FL to address these challenges. EvoFed employs a concept of 'fitness-based information sharing', deviating significantly from the conventional model-based FL. Rather than exchanging the actual updated model parameters, each node transmits a distance-based similarity measure between the locally updated model and each member of the noise-perturbed model population. Each node, as well as the server, generates an identical population set of perturbed models in a completely synchronized fashion using the same random seeds. With properly chosen noise variance and population size, perturbed models can be combined to closely reflect the actual model updated using the local dataset, allowing the transmitted similarity measures (or fitness values) to carry nearly the complete information about the model parameters. As the population size is typically much smaller than the number of model parameters, the savings in communication load is large. The server aggregates these fitness values and is able to update the global model. This global fitness vector is then disseminated back to the nodes, each of which applies the same update to be synchronized to the global model. Our analysis shows that EvoFed converges, and our experimental results validate that at the cost of increased local processing loads, EvoFed achieves performance comparable to FedAvg while reducing overall communication requirements drastically in various practical settings.

摘要
联合学习（FL）是一种分布式机器学习 paradigma，允许分散的节点合作进行模型训练，而无需强制每个节点共享数据。然而，其广泛应用受到大量模型参数传输成本的限制。这篇论文提出了 EvoFed，一种新的方法，它将生态演化策略（ES）与 FL 集成以解决这些挑战。EvoFed 采用了一种基于“适应度基于信息共享”的概念，与传统的模型基于 FL 不同。每个节点不需要将实际更新后的模型参数传输，而是将本地更新后的模型与每个噪声扰动模型的距离进行比较。每个节点和服务器都会生成一个完全同步的噪声扰动模型集，使用相同的随机种子。当采用合适的噪声 variance 和种子大小时，噪声扰动模型可以准确反映本地数据更新后的模型，使得传输的适应度值（或fitness值）具有几乎完整的模型参数信息。由于种子大小通常比模型参数的数量小得多，因此通信负担减少很大。服务器将这些适应度值聚合，并将其更新到全局模型。全局适应度向量然后被分发回节点，每个节点都将应用相同的更新，以同步到全局模型。我们的分析表明，EvoFed 可以达到 converges，而且我们的实验结果表明，在增加本地处理负担的情况下，EvoFed 可以在各种实际场景中提供与 FedAvg 相当的性能，同时减少大量通信需求。

Psychometric Predictive Power of Large Language Models

paper_url: http://arxiv.org/abs/2311.07484
repo_url: None
paper_authors: Tatsuki Kuribayashi, Yohei Oseki, Timothy Baldwin
for: 这paper是为了研究语言模型如何模拟人类阅读行为而写的。
methods: 这paper使用了大型语言模型（LLMs），并对其进行了指令调整以提高其提供人类首选回答的能力。
results: 研究发现，尽管指令调整可以使LLMs更加人类化，但是它们在计算心理лингвисти学的预测力方面并不总是比基础LLMs更好。此外，研究还发现，使用特定语言假设的提示方法可以使LLMs更加人类化，但是这些提示方法并不能提高LLMs的预测力。

Abstract
Next-word probabilities from language models have been shown to successfully simulate human reading behavior. Building on this, we show that, interestingly, instruction-tuned large language models (LLMs) yield worse psychometric predictive power (PPP) for human reading behavior than base LLMs with equivalent perplexities. In other words, instruction tuning, which helps LLMs provide human-preferred responses, does not always make them human-like from the computational psycholinguistics perspective. In addition, we explore prompting methodologies in simulating human reading behavior with LLMs, showing that prompts reflecting a particular linguistic hypothesis lead LLMs to exhibit better PPP but are still worse than base LLMs. These highlight that recent instruction tuning and prompting do not offer better estimates than direct probability measurements from base LLMs in cognitive modeling.

摘要
基于语言模型的下一个词概率已经成功地模拟了人类阅读行为。我们发现，有趣的是，对于人类阅读行为的预测力（PPP）而言，特定的指导过滤后的大型语言模型（LLMs）的性能更差于基线模型。这意味着，虽然指导过滤可以使LLMs提供人类首选的回答，但并不总是使其成为人类语言模型的计算预测模型。此外，我们还探讨了使用LLMs simulate human reading behavior的提示方法，发现，表达特定语言假设的提示可以使LLMs表现出更好的PPP，但仍然比基线模型差。这些结果表明，最近的指导过滤和提示方法不能提供更好的估计，与直接从基线模型中获取的概率 measurement相比。

InCA: Rethinking In-Car Conversational System Assessment Leveraging Large Language Models

paper_url: http://arxiv.org/abs/2311.07469
repo_url: None
paper_authors: Ken E. Friedl, Abbas Goher Khan, Soumya Ranjan Sahoo, Md Rashad Al Hasan Rony, Jana Germies, Christian Süß
for: 这个论文主要目标是提出一套适用于评估汽车内 conversational question answering（ConvQA）系统的关键性能指标（KPI），以及相关的数据集。
methods: 该论文使用了现有的评估指标不足的问题作为出发点，并提出了一种基于 persona 的召回方法来提高模型的多元视角能力。
results: 该论文的实验结果表明，使用该提出的 KPI 和数据集可以准确评估 ConvQA 系统的性能，并且通过使用不同的 persona 来召回模型可以提高模型的多元视角能力。

Abstract
The assessment of advanced generative large language models (LLMs) poses a significant challenge, given their heightened complexity in recent developments. Furthermore, evaluating the performance of LLM-based applications in various industries, as indicated by Key Performance Indicators (KPIs), is a complex undertaking. This task necessitates a profound understanding of industry use cases and the anticipated system behavior. Within the context of the automotive industry, existing evaluation metrics prove inadequate for assessing in-car conversational question answering (ConvQA) systems. The unique demands of these systems, where answers may relate to driver or car safety and are confined within the car domain, highlight the limitations of current metrics. To address these challenges, this paper introduces a set of KPIs tailored for evaluating the performance of in-car ConvQA systems, along with datasets specifically designed for these KPIs. A preliminary and comprehensive empirical evaluation substantiates the efficacy of our proposed approach. Furthermore, we investigate the impact of employing varied personas in prompts and found that it enhances the model's capacity to simulate diverse viewpoints in assessments, mirroring how individuals with different backgrounds perceive a topic.

摘要
evaluating advanced generative large language models (LLMs) presents a significant challenge due to their increased complexity in recent developments. additionally, assessing the performance of LLM-based applications in various industries, as indicated by key performance indicators (KPIs), is a complex task. this task requires a deep understanding of industry use cases and the expected system behavior. within the context of the automotive industry, existing evaluation metrics are inadequate for assessing in-car conversational question answering (ConvQA) systems. the unique demands of these systems, where answers may relate to driver or car safety and are confined within the car domain, highlight the limitations of current metrics. to address these challenges, this paper introduces a set of KPIs tailored for evaluating the performance of in-car ConvQA systems, along with datasets specifically designed for these KPIs. a preliminary and comprehensive empirical evaluation substantiates the efficacy of our proposed approach. furthermore, we investigate the impact of employing varied personas in prompts and found that it enhances the model's capacity to simulate diverse viewpoints in assessments, mirroring how individuals with different backgrounds perceive a topic.Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know and I can provide that as well.

Are We Falling in a Middle-Intelligence Trap? An Analysis and Mitigation of the Reversal Curse

paper_url: http://arxiv.org/abs/2311.07468
repo_url: https://github.com/trestad/mitigating-reversal-curse
paper_authors: Ang Lv, Kaiyi Zhang, Shufang Xie, Quan Tu, Yuhan Chen, Ji-Rong Wen, Rui Yan
for: 本研究旨在探讨大语言模型（LLM）中的“逆转咒”现象，即训练数据中知识实体的顺序影响模型的理解。
methods: 本研究使用了GLM模型，其中使用了autoregressive blank infilling objective来增强模型的表达能力。
results: 在一个由研究者设计的逆转咒测试任务中，使用BICO训练方法可以提高Llama模型的准确率从原来的0%提高到约70%。

Abstract
Recent studies have highlighted a phenomenon in large language models (LLMs) known as "the reversal curse," in which the order of knowledge entities in the training data biases the models' comprehension. For example, if a model is trained on sentences where entity A consistently appears before entity B, it can respond to queries about A by providing B as the answer. However, it may encounter confusion when presented with questions concerning B. We contend that the reversal curse is partially a result of specific model training objectives, particularly evident in the prevalent use of the next-token prediction within most causal language models. For the next-token prediction, models solely focus on a token's preceding context, resulting in a restricted comprehension of the input. In contrast, we illustrate that the GLM, trained using the autoregressive blank infilling objective where tokens to be predicted have access to the entire context, exhibits better resilience against the reversal curse. We propose a novel training method, BIdirectional Casual language modeling Optimization (BICO), designed to mitigate the reversal curse when fine-tuning pretrained causal language models on new data. BICO modifies the causal attention mechanism to function bidirectionally and employs a mask denoising optimization. In the task designed to assess the reversal curse, our approach improves Llama's accuracy from the original 0% to around 70%. We hope that more attention can be focused on exploring and addressing these inherent weaknesses of the current LLMs, in order to achieve a higher level of intelligence.

摘要
研究发现，大语言模型（LLM）中存在一种现象，被称为“逆转咒语”，即训练数据中知识实体的顺序影响模型的理解。例如，如果一个模型在句子中entity A consistently appears before entity B，它可能会在对A的问题上提供B作为答案。但是，它可能会在对B的问题上遇到困惑。我们认为，逆转咒语 partly due to specific model training objectives, particularly the prevalent use of next-token prediction within most causal language models。这种预测方法会让模型围绕一个token的前置上下文进行预测，从而导致输入的 restriction 的理解。然而，我们展示了使用autoregressive blank infilling objective，其中tokens to be predicted有访问整个上下文的能力，可以减轻逆转咒语的影响。我们提出了一种新的训练方法，名为BIdirectional Casual language modeling Optimization（BICO），用于在新数据上细化已经预测的语言模型。BICO改变了 causal attention mechanism 的方向，并使用 mask denoising optimization。在用于评估逆转咒语的任务中，我们的方法可以提高Llama的准确率，从原来的0%提高到大约70%。我们希望可以更多地关注和解决当前LLMs的内在弱点，以达到更高水平的智能。

On Measuring Faithfulness of Natural Language Explanations

paper_url: http://arxiv.org/abs/2311.07466
repo_url: https://github.com/heidelberg-nlp/cc-shap
paper_authors: Letitia Parcalabescu, Anette Frank
for: 这 paper 的目的是为了解释 LLM 的预测，并评估现有的 faithfulness 测试是否能够准确评估 LLM 的内部工作方式。
methods: 这 paper 使用了现有的 faithfulness 测试，以及自己提出的 CC-SHAP 测试来评估 LLM 的自 consistency。CC-SHAP 是一种新的、更为细致的自 consistency 测试，可以比较模型的输入贡献与答案预测和生成的解释之间的关系。
results: 据 paper 的结果，现有的 faithfulness 测试并不能准确评估 LLM 的内部工作方式，而是只能评估其输出水平的自 consistency。而 CC-SHAP 测试则能够更好地评估 LLM 的自 consistency，并且可以提供更加 interpretable 的结果。

Abstract
Large language models (LLMs) can explain their own predictions, through post-hoc or Chain-of-Thought (CoT) explanations. However the LLM could make up reasonably sounding explanations that are unfaithful to its underlying reasoning. Recent work has designed tests that aim to judge the faithfulness of either post-hoc or CoT explanations. In this paper we argue that existing faithfulness tests are not actually measuring faithfulness in terms of the models' inner workings, but only evaluate their self-consistency on the output level. The aims of our work are two-fold. i) We aim to clarify the status of existing faithfulness tests in terms of model explainability, characterising them as self-consistency tests instead. This assessment we underline by constructing a Comparative Consistency Bank for self-consistency tests that for the first time compares existing tests on a common suite of 11 open-source LLMs and 5 datasets -- including ii) our own proposed self-consistency measure CC-SHAP. CC-SHAP is a new fine-grained measure (not test) of LLM self-consistency that compares a model's input contributions to answer prediction and generated explanation. With CC-SHAP, we aim to take a step further towards measuring faithfulness with a more interpretable and fine-grained method. Code available at \url{https://github.com/Heidelberg-NLP/CC-SHAP}

摘要

To clarify the status of existing faithfulness tests in terms of model explainability, characterizing them as self-consistency tests.2. To propose a new fine-grained measure of LLM self-consistency, called CC-SHAP, which compares the model’s input contributions to its answer prediction and generated explanation.We construct a Comparative Consistency Bank for self-consistency tests on a common suite of 11 open-source LLMs and 5 datasets. Our proposed CC-SHAP measure provides a more interpretable and fine-grained method for measuring faithfulness. The code for CC-SHAP is available at \url{https://github.com/Heidelberg-NLP/CC-SHAP}.

KnowSafe: Combined Knowledge and Data Driven Hazard Mitigation in Artificial Pancreas Systems

paper_url: http://arxiv.org/abs/2311.07460
repo_url: None
paper_authors: Xugui Zhou, Maxfield Kouzel, Chloe Smith, Homa Alemzadeh
for: This paper aims to improve the safety and security of cyber-physical systems (CPS) by proposing a combined knowledge and data-driven approach called KnowSafe to predict and mitigate safety hazards.
methods: The KnowSafe approach integrates domain-specific knowledge of safety constraints and context-specific mitigation actions with machine learning (ML) techniques to estimate system trajectories, infer potential hazards, and generate optimal corrective actions to keep the system safe.
results: Experimental evaluation on two realistic closed-loop testbeds for artificial pancreas systems (APS) and a real-world clinical trial dataset for diabetes treatment demonstrates that KnowSafe outperforms the state-of-the-art by achieving higher accuracy in predicting system state trajectories and potential hazards, a low false positive rate, and no false negatives. It also maintains the safe operation of the simulated APS despite faults or attacks without introducing any new hazards, with a hazard mitigation success rate of 92.8%, which is at least 76% higher than solely rule-based (50.9%) and data-driven (52.7%) methods.Here is the result in Simplified Chinese text:
for: 本研究旨在提高Cyber-Physical Systems (CPS) 的安全性和安全性。
methods: 该方法 combinesterminology-specific knowledge of safety constraints和context-specific mitigation actionswith机器学习(ML)技术来估算系统轨迹、推测potential hazards, 并生成最佳 corrections to keep the system safe.
results: 实验证明，KnowSafe在两个实际关闭loop testbed for artificial pancreas systems (APS) 和一个实际临床试验数据集 for diabetes treatment 上表现出优于状态艺术的 Results show that KnowSafe outperforms the state-of-the-art by achieving higher accuracy in predicting system state trajectories and potential hazards, a low false positive rate, and no false negatives. It also maintains the safe operation of the simulated APS despite faults or attacks without introducing any new hazards, with a hazard mitigation success rate of 92.8%, which is at least 76% higher than solely rule-based (50.9%) and data-driven (52.7%) methods.

Abstract
Significant progress has been made in anomaly detection and run-time monitoring to improve the safety and security of cyber-physical systems (CPS). However, less attention has been paid to hazard mitigation. This paper proposes a combined knowledge and data driven approach, KnowSafe, for the design of safety engines that can predict and mitigate safety hazards resulting from safety-critical malicious attacks or accidental faults targeting a CPS controller. We integrate domain-specific knowledge of safety constraints and context-specific mitigation actions with machine learning (ML) techniques to estimate system trajectories in the far and near future, infer potential hazards, and generate optimal corrective actions to keep the system safe. Experimental evaluation on two realistic closed-loop testbeds for artificial pancreas systems (APS) and a real-world clinical trial dataset for diabetes treatment demonstrates that KnowSafe outperforms the state-of-the-art by achieving higher accuracy in predicting system state trajectories and potential hazards, a low false positive rate, and no false negatives. It also maintains the safe operation of the simulated APS despite faults or attacks without introducing any new hazards, with a hazard mitigation success rate of 92.8%, which is at least 76% higher than solely rule-based (50.9%) and data-driven (52.7%) methods.

摘要
“具有显著进步的偏差探测和执行监控技术，以提高Cyber-Physical System（CPS）的安全性和安全性。然而，较少的注意力被带到危险排除方面。本文提出了一个结合知识和数据驱动的方法，即KnowSafe，用于设计安全引擎，可以预测和排除CPS控制器的安全危险。我们结合专业知识和机器学习（ML）技术，估计系统轨迹在远近未来，推断可能的危险，并生成最佳修正动作，以确保系统安全。实验评估过两个实际关闭loop测试床 для人工肾脏系统（APS）和一个真实世界临床试验数据集 для调节糖尿病治疗，显示了KnowSafe在预测系统状态轨迹和潜在危险方面的精度高于现有技术， false positive rate低，false negative absent。它还能在模拟的APS中维持安全运行，即使有faults或攻击，成功排除危险的成功率为92.8%，至少高于专业规则（50.9%）和数据驱动（52.7%）方法。”

Think Before You Speak: Cultivating Communication Skills of Large Language Models via Inner Monologue

paper_url: http://arxiv.org/abs/2311.07445
repo_url: None
paper_authors: Junkai Zhou, Liang Pang, Huawei Shen, Xueqi Cheng
for: 提高大语言模型（LLM）的对话系统能力，使其更像人类对话伙伴。
methods: 添加了五种communication skills到响应生成过程中：主题转换、主动问题、概念引导、同情和概要总结。
results: 在人工和自动评估中，提出的CSIM策略比基eline模型更高效，并且能够更好地评估对话生成能力。

Abstract
The emergence of large language models (LLMs) further improves the capabilities of open-domain dialogue systems and can generate fluent, coherent, and diverse responses. However, LLMs still lack an important ability: communication skills, which makes them more like information seeking tools than anthropomorphic chatbots. To make LLMs more anthropomorphic and proactive during the conversation, we add five communication skills to the response generation process: topic transition, proactively asking questions, concept guidance, empathy, and summarising often. The addition of communication skills increases the interest of users in the conversation and attracts them to chat for longer. To enable LLMs better understand and use communication skills, we design and add the inner monologue to LLMs. The complete process is achieved through prompt engineering and in-context learning. To evaluate communication skills, we construct a benchmark named Cskills for evaluating various communication skills, which can also more comprehensively evaluate the dialogue generation ability of the model. Experimental results show that the proposed CSIM strategy improves the backbone models and outperforms the baselines in both automatic and human evaluations.

摘要
大型语言模型（LLM）的出现进一步提高了开放领域对话系统的能力，可以生成流畅、一致、多样的回答。然而，LLM仍缺乏重要的能力：交流技巧，使其更像信息搜索工具而不是人工智能聊天机器人。为使LLM更人化和主动在对话中，我们在回答生成过程中添加了五种交流技巧：话题转换、主动问题，概念导航、同情和概要。这些技巧的添加使用户对对话更有兴趣，使其更长时间参与对话。为让LLM更好地理解和使用交流技巧，我们设计了内部对话。完整的过程通过提问工程和在线学习实现。为评估交流技巧，我们建立了名为Cskills的基准，用于评估不同的交流技巧，同时也更全面评估对话生成能力。实验结果表明，我们提出的CSIM策略可以提高基础模型的性能，并在自动和人类评估中超过基eline。

Investigating Multi-Pivot Ensembling with Massively Multilingual Machine Translation Models

paper_url: http://arxiv.org/abs/2311.07439
repo_url: https://github.com/zurichnlp/multipivotnmt
paper_authors: Alireza Mohammadshahi, Jannis Vamvas, Rico Sennrich
for: 提高低资源语言翻译方向的翻译质量
methods: 使用多语言折衔策略，提出MaxEns结合策略，偏向最高信任预测结果
results: 在FLORES测试准则上对20种低资源语言方向进行评估，显示MaxEns方法可以提高翻译质量，同时减少翻译中的幻化现象，比irect翻译和平均策略更好

Abstract
Massively multilingual machine translation models allow for the translation of a large number of languages with a single model, but have limited performance on low- and very-low-resource translation directions. Pivoting via high-resource languages remains a strong strategy for low-resource directions, and in this paper we revisit ways of pivoting through multiple languages. Previous work has used a simple averaging of probability distributions from multiple paths, but we find that this performs worse than using a single pivot, and exacerbates the hallucination problem because the same hallucinations can be probable across different paths. As an alternative, we propose MaxEns, a combination strategy that is biased towards the most confident predictions, hypothesising that confident predictions are less prone to be hallucinations. We evaluate different strategies on the FLORES benchmark for 20 low-resource language directions, demonstrating that MaxEns improves translation quality for low-resource languages while reducing hallucination in translations, compared to both direct translation and an averaging approach. On average, multi-pivot strategies still lag behind using English as a single pivot language, raising the question of how to identify the best pivoting strategy for a given translation direction.

摘要
大规模多语言机器翻译模型可以同时翻译多种语言，但在低资源翻译方向上表现有限。通过高资源语言作为中间语言来做转换是一个强大策略，在这篇论文中我们重新检视了多语言转换的方法。先前的工作使用了多个路径的概率分布的平均值，但我们发现这会比使用单个转换更差，并且增加了幻觉问题，因为同一个幻觉可能会在不同的路径上出现。作为替代方案，我们提议MaxEns，一种组合策略，偏好最确定的预测，假设最确定的预测对幻觉更敏感。我们在FLORES测试准则上对20种低资源语言方向进行了不同策略的评估，发现MaxEns可以提高低资源语言翻译质量，同时减少翻译中的幻觉，比直接翻译和平均策略更好。然而，多个转换策略仍然落后于使用英语作为单一中间语言，这提出了如何确定最佳转换策略的问题。

Hallucination Augmented Recitations for Language Models

paper_url: http://arxiv.org/abs/2311.07424
repo_url: None
paper_authors: Abdullatif Köksal, Renat Aksitov, Chung-Ching Chang
for: The paper aims to improve the attribution of large language models (LLMs) by creating counterfactual datasets using hallucination in LLMs.
methods: The paper proposes a method called Hallucination Augmented Recitations (HAR) to create counterfactual datasets for open book question answering.
results: The paper shows that models finetuned with the counterfactual datasets improve text grounding and open book QA performance, with up to an 8.0% increase in F1 score, compared to using human-annotated factual datasets. The improvements are consistent across various model sizes and datasets.

Abstract
Attribution is a key concept in large language models (LLMs) as it enables control over information sources and enhances the factuality of LLMs. While existing approaches utilize open book question answering to improve attribution, factual datasets may reward language models to recall facts that they already know from their pretraining data, not attribution. In contrast, counterfactual open book QA datasets would further improve attribution because the answer could only be grounded in the given text. We propose Hallucination Augmented Recitations (HAR) for creating counterfactual datasets by utilizing hallucination in LLMs to improve attribution. For open book QA as a case study, we demonstrate that models finetuned with our counterfactual datasets improve text grounding, leading to better open book QA performance, with up to an 8.0% increase in F1 score. Our counterfactual dataset leads to significantly better performance than using humanannotated factual datasets, even with 4x smaller datasets and 4x smaller models. We observe that improvements are consistent across various model sizes and datasets, including multi-hop, biomedical, and adversarial QA datasets.

摘要
<>转换文本到简化中文。>概念归属是大语言模型（LLM）中关键的概念，它允许控制信息来源并提高 LLM 的事实性。现有方法使用开书问答来提高归属，但是可能会奖励语言模型 recall 已经从预训练数据中学习的知识，而不是归属。相比之下，Counterfactual open book QA 数据集可以进一步提高归属，因为答案只能基于给定的文本。我们提议使用 Hallucination Augmented Recitations（HAR）来创建 counterfactual 数据集，利用 LLM 中的幻觉来提高归属。在 open book QA 中作为案例研究，我们表明，使用我们的 counterfactual 数据集可以提高文本固定，导致更好的 open book QA 性能，最高提高 F1 分数8.0%。我们的 counterfactual 数据集比使用人工标注的事实数据集更好，即使用4倍小数据和4倍小模型。我们发现，改进是模型size和数据集之间一致的，包括多步、医学和抗击 QA 数据集。

Exploring Values in Museum Artifacts in the SPICE project: a Preliminary Study

paper_url: http://arxiv.org/abs/2311.07396
repo_url: None
paper_authors: Nele Kadastik, Thomas A. Pederson, Luis Emilio Bruni, Rossana Damiano, Antonio Lieto, Manuel Striani, Tsvi Kuflik, Alan Wecker
for: 本研究目的是开发一个 semantic reasoning 工具，以增强博物馆访问者的多样性视角。
methods: 该工具基于 TCL 常识推理框架，利用 Haidt 理论中的道德价值 ontological 模型，将博物馆展品相关联到共同价值和情感。
results: 在 Haifa 的 Hecht 博物馆collection 上进行先期测试，系统可以建议访问者不同价值观的文物，扩展访问者的博物馆经验。

Abstract
This document describes the rationale, the implementation and a preliminary evaluation of a semantic reasoning tool developed in the EU H2020 SPICE project to enhance the diversity of perspectives experienced by museum visitors. The tool, called DEGARI 2.0 for values, relies on the commonsense reasoning framework TCL, and exploits an ontological model formalizingthe Haidt's theory of moral values to associate museum items with combined values and emotions. Within a museum exhibition, this tool can suggest cultural items that are associated not only with the values of already experienced or preferred objects, but also with novel items with different value stances, opening the visit experience to more inclusive interpretations of cultural content. The system has been preliminarily tested, in the context of the SPICE project, on the collection of the Hecht Museum of Haifa.

摘要

Predicting Continuous Locomotion Modes via Multidimensional Feature Learning from sEMG

paper_url: http://arxiv.org/abs/2311.07395
repo_url: None
paper_authors: Peiwen Fu, Wenjuan Zhong, Yuyang Zhang, Wenxuan Xiong, Yuzhou Lin, Yanlong Tai, Lin Meng, Mingming Zhang
for: 本研究旨在提高智能化和透明度的人工辅助器（walking-assistive device）控制方法，需要采用适应控制方法来实现平滑的模式转换。
methods: 本研究提出了 Deep-STF，一种综合的深度学习模型，用于捕捉surface electromyography（sEMG）信号的集成特征。该模型可以在不同的预测时间间隔（100-500 ms）上进行精准和可靠的连续预测九种行走模式和15种模式转换。
results: 实验结果表明，Deep-STF在多种行走模式和转换中表现出色，只靠基于sEMG数据进行预测。预测100 ms后，Deep-STF的均值预测精度为96.48%，即使延长预测时间间隔至500 ms，精度仅下降至93.00%。此外，对于下一个转换的稳定预测时间（stable prediction time）的评估也提供了有用的数据。

Abstract
Walking-assistive devices require adaptive control methods to ensure smooth transitions between various modes of locomotion. For this purpose, detecting human locomotion modes (e.g., level walking or stair ascent) in advance is crucial for improving the intelligence and transparency of such robotic systems. This study proposes Deep-STF, a unified end-to-end deep learning model designed for integrated feature extraction in spatial, temporal, and frequency dimensions from surface electromyography (sEMG) signals. Our model enables accurate and robust continuous prediction of nine locomotion modes and 15 transitions at varying prediction time intervals, ranging from 100 to 500 ms. In addition, we introduced the concept of 'stable prediction time' as a distinct metric to quantify prediction efficiency. This term refers to the duration during which consistent and accurate predictions of mode transitions are made, measured from the time of the fifth correct prediction to the occurrence of the critical event leading to the task transition. This distinction between stable prediction time and prediction time is vital as it underscores our focus on the precision and reliability of mode transition predictions. Experimental results showcased Deep-STP's cutting-edge prediction performance across diverse locomotion modes and transitions, relying solely on sEMG data. When forecasting 100 ms ahead, Deep-STF surpassed CNN and other machine learning techniques, achieving an outstanding average prediction accuracy of 96.48%. Even with an extended 500 ms prediction horizon, accuracy only marginally decreased to 93.00%. The averaged stable prediction times for detecting next upcoming transitions spanned from 28.15 to 372.21 ms across the 100-500 ms time advances.

摘要
<>translate text into Simplified ChineseWalking-assistive devices require adaptive control methods to ensure smooth transitions between various modes of locomotion. For this purpose, detecting human locomotion modes (e.g., level walking or stair ascent) in advance is crucial for improving the intelligence and transparency of such robotic systems. This study proposes Deep-STF, a unified end-to-end deep learning model designed for integrated feature extraction in spatial, temporal, and frequency dimensions from surface electromyography (sEMG) signals. Our model enables accurate and robust continuous prediction of nine locomotion modes and 15 transitions at varying prediction time intervals, ranging from 100 to 500 ms. In addition, we introduced the concept of 'stable prediction time' as a distinct metric to quantify prediction efficiency. This term refers to the duration during which consistent and accurate predictions of mode transitions are made, measured from the time of the fifth correct prediction to the occurrence of the critical event leading to the task transition. This distinction between stable prediction time and prediction time is vital as it underscores our focus on the precision and reliability of mode transition predictions. Experimental results showcased Deep-STP's cutting-edge prediction performance across diverse locomotion modes and transitions, relying solely on sEMG data. When forecasting 100 ms ahead, Deep-STF surpassed CNN and other machine learning techniques, achieving an outstanding average prediction accuracy of 96.48%. Even with an extended 500 ms prediction horizon, accuracy only marginally decreased to 93.00%. The averaged stable prediction times for detecting next upcoming transitions spanned from 28.15 to 372.21 ms across the 100-500 ms time advances.<>

Testing learning-enabled cyber-physical systems with Large-Language Models: A Formal Approach

paper_url: http://arxiv.org/abs/2311.07377
repo_url: None
paper_authors: Xi Zheng, Aloysius K. Mok, Ruzica Piskac, Yong Jae Lee, Bhaskar Krishnamachari, Dakai Zhu, Oleg Sokolsky, Insup Lee
for: This paper focuses on the challenges of ensuring formal safety in cyber-physical systems (CPS) that are infused with machine learning (ML).
methods: The paper examines testing as the most practical method for verification and validation, and summarizes current state-of-the-art methodologies. It also proposes a roadmap to transition from foundational probabilistic testing to a more rigorous approach that can provide formal assurance.
results: The paper identifies the main challenges in ensuring formal safety for learning-enabled CPS, and proposes a roadmap to address these challenges.

Abstract
The integration of machine learning (ML) into cyber-physical systems (CPS) offers significant benefits, including enhanced efficiency, predictive capabilities, real-time responsiveness, and the enabling of autonomous operations. This convergence has accelerated the development and deployment of a range of real-world applications, such as autonomous vehicles, delivery drones, service robots, and telemedicine procedures. However, the software development life cycle (SDLC) for AI-infused CPS diverges significantly from traditional approaches, featuring data and learning as two critical components. Existing verification and validation techniques are often inadequate for these new paradigms. In this study, we pinpoint the main challenges in ensuring formal safety for learningenabled CPS.We begin by examining testing as the most pragmatic method for verification and validation, summarizing the current state-of-the-art methodologies. Recognizing the limitations in current testing approaches to provide formal safety guarantees, we propose a roadmap to transition from foundational probabilistic testing to a more rigorous approach capable of delivering formal assurance.

摘要
机器学习（ML）在Cyber-Physical Systems（CPS）中的集成带来了 significative benefits，包括提高效率、预测能力、实时响应和自动化操作。这种整合已经加速了许多实际应用的开发和部署，例如自动驾驶车辆、快递机器人、服务机器人和 теле医疗程序。然而，AI-infused CPS 的软件开发生命周期（SDLC）与传统方法有很大差异，数据和学习作为两个关键组件。现有的验证和验证技术 часто无法满足这些新的 парадигмы的需求。在这种研究中，我们特别关注了确保正式安全的主要挑战。我们开始 by examining testing as the most practical method for verification and validation, summarizing the current state-of-the-art methodologies。认为现有的测试方法无法提供正式安全保证，我们提出了一个路线图，以帮助从基础概率测试过渡到更加严格的方法，以提供正式保证。

Past as a Guide: Leveraging Retrospective Learning for Python Code Completion

paper_url: http://arxiv.org/abs/2311.07635
repo_url: https://github.com/SeungyounShin/Past-as-a-Guide
paper_authors: Seunggyoon Shin, Seunggyu Chang, Sungjoon Choi
for: 提高大语言模型（LLM）的代码能力
methods: integrate past history with interactive and iterative code refinements
results: achieved 92% pass@1 on HumanEval, demonstrating the potential to advance the field by leveraging retrospection from past experiences and interactive and iterative refinement processes without external correctness indicators.

Abstract
This work presents Past as a Guide (PaG), a simple approach for Large Language Models (LLMs) to improve the coding capabilities by integrating the past history with interactive and iterative code refinements. To be specific, inspired by human cognitive processes, the proposed method enables LLMs to utilize previous programming and debugging experiences to enhance the Python code completion tasks. The framework facilitates LLMs to iteratively refine the Python code based on previous execution and debugging results and optimize learning and reasoning capabilities. The proposed methodology achieved a 92\% pass@1 on HumanEval, demonstrating the potential to advance the field by leveraging retrospection from past experiences and interactive and iterative refinement processes without external correctness indicators.

摘要
这个工作提出了过去作为指南（PaG），一种简单的方法，用于大语言模型（LLM）提高编程能力。具体来说，这种方法受人类认知过程的启发，让提案的方法使用过去编程和调试经验来提高Python代码完成任务。框架允许LLM通过前一次执行和调试结果进行间接反复优化Python代码，提高学习和理解能力。该方法在HumanEval上达到92%的通过率@1，表明该方法可以利用过去经验和间接反复优化过程，不需要外部正确性指标，进而提高领域的进步。

The Impact of Large Language Models on Scientific Discovery: a Preliminary Study using GPT-4

paper_url: http://arxiv.org/abs/2311.07361
repo_url: None
paper_authors: Microsoft Research AI4Science, Microsoft Azure Quantum
for: 本研究的目的是评估GPT-4语言模型在科学发现方面的性能，以 validate its domain-specific expertise、accelerate scientific progress、optimize resource allocation、guide future model development和促进交叉学科研究。
methods: 本研究采用专家驱动的案例评估和 occasionally benchmark testing来评估GPT-4模型在各科学领域中的性能，以获得其对复杂科学概念和关系的理解和解决能力。
results: 初步调查显示，GPT-4模型在多种科学应用方面表现出了扎实的潜力，能够处理复杂的问题解决和知识集成任务。主要评估GPT-4的科学知识库、科学理解、科学数学计算能力和多种科学预测能力。

Abstract
In recent years, groundbreaking advancements in natural language processing have culminated in the emergence of powerful large language models (LLMs), which have showcased remarkable capabilities across a vast array of domains, including the understanding, generation, and translation of natural language, and even tasks that extend beyond language processing. In this report, we delve into the performance of LLMs within the context of scientific discovery, focusing on GPT-4, the state-of-the-art language model. Our investigation spans a diverse range of scientific areas encompassing drug discovery, biology, computational chemistry (density functional theory (DFT) and molecular dynamics (MD)), materials design, and partial differential equations (PDE). Evaluating GPT-4 on scientific tasks is crucial for uncovering its potential across various research domains, validating its domain-specific expertise, accelerating scientific progress, optimizing resource allocation, guiding future model development, and fostering interdisciplinary research. Our exploration methodology primarily consists of expert-driven case assessments, which offer qualitative insights into the model's comprehension of intricate scientific concepts and relationships, and occasionally benchmark testing, which quantitatively evaluates the model's capacity to solve well-defined domain-specific problems. Our preliminary exploration indicates that GPT-4 exhibits promising potential for a variety of scientific applications, demonstrating its aptitude for handling complex problem-solving and knowledge integration tasks. Broadly speaking, we evaluate GPT-4's knowledge base, scientific understanding, scientific numerical calculation abilities, and various scientific prediction capabilities.

摘要
近年来，自然语言处理技术的突破性进步使得强大的大语言模型（LLM）出现了，这些模型在各种领域表现出了惊人的能力，包括自然语言理解、生成和翻译，以及 extends beyond 语言处理的任务。在这份报告中，我们将关注 GPT-4，当前领域的状态的语言模型。我们的调查覆盖了多个科学领域，包括药物发现、生物、计算化学（密度功能理论（DFT）和分子动力学（MD））、材料设计和部分偏微方程（PDE）。我们通过专家驱动的案例评估和 occasionally benchmark 测试来评估 GPT-4 在科学任务上的表现。我们的初步探索表明，GPT-4 在多种科学应用程序中表现出了潜在的潜力，证明它可以处理复杂的问题解决和知识集成任务。总的来说，我们评估 GPT-4 的知识基础、科学理解、科学数学计算能力和多种科学预测能力。

MetaSymNet: A Dynamic Symbolic Regression Network Capable of Evolving into Arbitrary Formulations

paper_url: http://arxiv.org/abs/2311.07326
repo_url: None
paper_authors: Yanjie Li, Weijun Li, Lina Yu, Min Wu, Jinyi Liu, Wenqiang Li, Meilan Hao, Shu Wei, Yusong Deng
为：该 paper 的目的是提出一种可靠地自动生成易于理解的数学公式，以解决传统人工神经网络（MLP）的黑盒问题。* 方法：该 paper 使用了一种动态调整网络结构的方法，即 MetaSymNet，该方法可以在实时进行网络结构的扩展和缩小。此外，该 paper 还使用了 PANGU meta 函数作为活化函数，以生成特定需求的数学公式。* 结果：对比四种state-of-the-art симвоlic regression算法，该 paper 的 MetaSymNet 算法在 более чем 10 个公共数据集上进行了比较，并 consistently 表现出了更高的性能。此外，该 paper 还评估了 MetaSymNet 的拟合能力和推断能力，并发现其在这两个领域中均有优异表现。

Abstract
Mathematical formulas serve as the means of communication between humans and nature, encapsulating the operational laws governing natural phenomena. The concise formulation of these laws is a crucial objective in scientific research and an important challenge for artificial intelligence (AI). While traditional artificial neural networks (MLP) excel at data fitting, they often yield uninterpretable black box results that hinder our understanding of the relationship between variables x and predicted values y. Moreover, the fixed network architecture in MLP often gives rise to redundancy in both network structure and parameters. To address these issues, we propose MetaSymNet, a novel neural network that dynamically adjusts its structure in real-time, allowing for both expansion and contraction. This adaptive network employs the PANGU meta function as its activation function, which is a unique type capable of evolving into various basic functions during training to compose mathematical formulas tailored to specific needs. We then evolve the neural network into a concise, interpretable mathematical expression. To evaluate MetaSymNet's performance, we compare it with four state-of-the-art symbolic regression algorithms across more than 10 public datasets comprising 222 formulas. Our experimental results demonstrate that our algorithm outperforms others consistently regardless of noise presence or absence. Furthermore, we assess MetaSymNet against MLP and SVM regarding their fitting ability and extrapolation capability, these are two essential aspects of machine learning algorithms. The findings reveal that our algorithm excels in both areas. Finally, we compared MetaSymNet with MLP using iterative pruning in network structure complexity. The results show that MetaSymNet's network structure complexity is obviously less than MLP under the same goodness of fit.

摘要
matematicos serve como meio de comunicação entre humanos e natureza, encapsulando as leis operacionais que governam fenômenos naturais. A formulação concisa dessas leis é um objetivo crucial na pesquisa científica e uma desvantagem importante para inteligência artificial (IA). Embora as redes neurais artificiais tradicionais (MLP) excelam em adaptação de dados, elas often yield resultados negros brutos que dificultam nossa compreensão da relação entre variáveis x e valores preditos y. Além disso, a estrutura de rede fixa em MLP frequentemente dá rise a redundância em ambos a estrutura de rede e parâmetros. Para abordar esses problemas, propomos MetaSymNet, uma rede neuronal novativa que ajusta sua estrutura em tempo real, permitindo expansão e contração. Essa rede adaptativa emprega a função de ativação PANGU, que é um tipo único capaz de evoluir para várias funções básicas durante o treinamento para compor fórmulas matemáticas personalizadas. Em seguida, evoluímos a rede neuronal para uma expressão matemática concisa e interpretable. Para avaliar o desempenho de MetaSymNet, comparamos com quatro algoritmos de regressão simbólica de estado da arte em mais de 10 conjuntos de dados públicos, que incluem 222 fórmulas. Nossos resultados experimentais demonstram que nosso algoritmo supera os outros consistentemente, independentemente da presença ou ausência de ruído. Além disso, avaliamos MetaSymNet em relação à capacidade de ajuste de MLP e SVM para adaptação de dados e extrapolação. Os resultados mostram que nosso algoritmo excelentes em ambas as áreas. Por fim, comparamos MetaSymNet com MLP usando rede de pruned em complexidade de estrutura. Os resultados mostram que a complexidade de estrutura de MetaSymNet é significativamente menor do que MLP sob o mesmo bom ajuste.

Towards a Transportable Causal Network Model Based on Observational Healthcare Data

paper_url: http://arxiv.org/abs/2311.08427
repo_url: None
paper_authors: Alice Bernasconi, Alessio Zanga, Peter J. F. Lucas, Marco Scutari Fabio Stella
for: 本研究旨在提供一种基于人工智能技术的诊断模型，以提高妊娠和青少年女性患有乳腺癌后的心血管风险评估。
methods: 本研究使用选择图、缺失图、 causal发现和先前知识 combine into a single graphical model，以估计青少年女性患有乳腺癌后心血管风险。研究从两个不同的患者群体中获取数据，并由专业医生 validate 模型的风险评估、准确率和可解释性。
results: 研究结果表明，使用该模型可以在诊断中提高妊娠和青少年女性患有乳腺癌后心血管风险的准确率，并且模型的预测结果比其他机器学习方法更加准确。

Abstract
Over the last decades, many prognostic models based on artificial intelligence techniques have been used to provide detailed predictions in healthcare. Unfortunately, the real-world observational data used to train and validate these models are almost always affected by biases that can strongly impact the outcomes validity: two examples are values missing not-at-random and selection bias. Addressing them is a key element in achieving transportability and in studying the causal relationships that are critical in clinical decision making, going beyond simpler statistical approaches based on probabilistic association. In this context, we propose a novel approach that combines selection diagrams, missingness graphs, causal discovery and prior knowledge into a single graphical model to estimate the cardiovascular risk of adolescent and young females who survived breast cancer. We learn this model from data comprising two different cohorts of patients. The resulting causal network model is validated by expert clinicians in terms of risk assessment, accuracy and explainability, and provides a prognostic model that outperforms competing machine learning methods.

摘要
In this context, we propose a novel approach that combines selection diagrams, missingness graphs, causal discovery, and prior knowledge into a single graphical model to estimate the cardiovascular risk of adolescent and young females who survived breast cancer. We use data from two different cohorts of patients to learn this model, and the resulting causal network model is validated by expert clinicians in terms of risk assessment, accuracy, and explainability. The model outperforms competing machine learning methods in providing accurate predictions.

Rethinking and Benchmarking Predict-then-Optimize Paradigm for Combinatorial Optimization Problems

paper_url: http://arxiv.org/abs/2311.07633
repo_url: None
paper_authors: Haoyu Geng, Han Ruan, Runzhong Wang, Yang Li, Yang Wang, Lei Chen, Junchi Yan
for: 研究领域是 Predict-Then-Optimize (PTO) 中的决策和预测组件系统，旨在解决各种 combinatorial optimization 问题，如能源成本考虑的排程、网络广告预算分配和社交网络上的图像匹配等。
methods: 研究使用的方法包括 end-to-end 方法和传统两阶段方法，以 Directly Optimizing the Ultimate Decision Quality 的方式来提高决策质量。
results: 研究提供了一个整合现有实验场景的benchmark，以便评估不同情况下的模型效果，并提供了一个新的 industrial combinatorial advertising 问题的数据集，以便更好地评估和应用这些方法。

Abstract
Numerous web applications rely on solving combinatorial optimization problems, such as energy cost-aware scheduling, budget allocation on web advertising, and graph matching on social networks. However, many optimization problems involve unknown coefficients, and improper predictions of these factors may lead to inferior decisions which may cause energy wastage, inefficient resource allocation, inappropriate matching in social networks, etc. Such a research topic is referred to as "Predict-Then-Optimize (PTO)" which considers the performance of prediction and decision-making in a unified system. A noteworthy recent development is the end-to-end methods by directly optimizing the ultimate decision quality which claims to yield better results in contrast to the traditional two-stage approach. However, the evaluation benchmarks in this field are fragmented and the effectiveness of various models in different scenarios remains unclear, hindering the comprehensive assessment and fast deployment of these methods. To address these issues, we provide a comprehensive categorization of current approaches and integrate existing experimental scenarios to establish a unified benchmark, elucidating the circumstances under which end-to-end training yields improvements, as well as the contexts in which it performs ineffectively. We also introduce a new dataset for the industrial combinatorial advertising problem for inclusive finance to open-source. We hope the rethinking and benchmarking of PTO could facilitate more convenient evaluation and deployment, and inspire further improvements both in the academy and industry within this field.

摘要
许多网络应用程序依赖于解决 combinatorial optimization 问题，如能源成本考虑的调度、在网络广告上的预算分配和社交网络上的图像匹配。然而，许多优化问题中的系数未知， incorrect predictions of these factors may lead to inferior decisions, resulting in energy waste, inefficient resource allocation, inappropriate matching in social networks, and so on. This research topic is referred to as "Predict-Then-Optimize (PTO)" and considers the performance of prediction and decision-making in a unified system.Recent developments in end-to-end methods have claimed to yield better results by directly optimizing the ultimate decision quality, but the evaluation benchmarks in this field are fragmented and the effectiveness of various models in different scenarios remains unclear. To address these issues, we provide a comprehensive categorization of current approaches and integrate existing experimental scenarios to establish a unified benchmark, elucidating the circumstances under which end-to-end training yields improvements and the contexts in which it performs ineffectively.In addition, we introduce a new dataset for the industrial combinatorial advertising problem in the field of inclusive finance to open-source. We hope that the rethinking and benchmarking of PTO could facilitate more convenient evaluation and deployment, and inspire further improvements both in the academy and industry within this field.

ResMGCN: Residual Message Graph Convolution Network for Fast Biomedical Interactions Discovering

paper_url: http://arxiv.org/abs/2311.07632
repo_url: None
paper_authors: Zecheng Yin
for: 本研究旨在提出一种fast和精准的生物医学信息图connolly方法，以便更好地预测生物医学信息之间的互动。
methods: 本研究使用了一种新型的剩余消息图卷积网络（ResMGCN），它可以快速和精准地捕捉生物医学信息之间的互动。ResMGCN通过聚合下一层信息和前一层信息来 guide node更新，从而获得更有意义的节点表示。
results: 在四个生物医学互动网络数据集上进行了实验，结果显示，ResMGCN可以比前一代模型更高效地使用存储和时间，并且在预测生物医学信息之间的互动方面达到了极高的效果。

Abstract
Biomedical information graphs are crucial for interaction discovering of biomedical information in modern age, such as identification of multifarious molecular interactions and drug discovery, which attracts increasing interests in biomedicine, bioinformatics, and human healthcare communities. Nowadays, more and more graph neural networks have been proposed to learn the entities of biomedical information and precisely reveal biomedical molecule interactions with state-of-the-art results. These methods remedy the fading of features from a far distance but suffer from remedying such problem at the expensive cost of redundant memory and time. In our paper, we propose a novel Residual Message Graph Convolution Network (ResMGCN) for fast and precise biomedical interaction prediction in a different idea. Specifically, instead of enhancing the message from far nodes, ResMGCN aggregates lower-order information with the next round higher information to guide the node update to obtain a more meaningful node representation. ResMGCN is able to perceive and preserve various messages from the previous layer and high-order information in the current layer with least memory and time cost to obtain informative representations of biomedical entities. We conduct experiments on four biomedical interaction network datasets, including protein-protein, drug-drug, drug-target, and gene-disease interactions, which demonstrates that ResMGCN outperforms previous state-of-the-art models while achieving superb effectiveness on both storage and time.

摘要
生物医学信息图是现代生物医学研究中不可或缺的工具，用于揭示生物医学信息的多样性，如蛋白质相互作用和药物发现，这在生物医学、生物信息学和人类医疗领域引起了越来越多的关注。在当今，越来越多的图 neural network 被提议用于学习生物医学信息的实体和准确地揭示生物分子相互作用。然而，这些方法往往会带来缺乏特征的问题，并且需要大量的内存和时间成本。在我们的论文中，我们提出了一种新的差异 идеald Residual Message Graph Convolution Network (ResMGCN)，用于快速和准确的生物医学交互预测。Specifically, ResMGCN 通过在下一轮更高级别信息的帮助下，将下一轮更低级别信息与当前层信息融合，以便更准确地更新节点表示。ResMGCN 能够捕捉和保留上一层和当前层的所有信息，并在最小的内存和时间成本下获得有用的生物医学实体表示。我们在四个生物医学交互网络数据集上进行了实验，包括蛋白质-蛋白质、药物-药物、药物-目标和基因-疾病交互，结果表明，ResMGCN 在存储和时间成本方面具有superb的效果，而且在生物医学交互预测方面具有极高的准确率。

Semi-automatic Data Enhancement for Document-Level Relation Extraction with Distant Supervision from Large Language Models

paper_url: http://arxiv.org/abs/2311.07314
repo_url: None
paper_authors: Junpeng Li, Zixia Jia, Zilong Zheng
for: 文章的目的是提出一种自动化文档关系EXTRACTION的方法，以便减少人工干预。
methods: 该方法利用大语言模型（LLM）和自然语言推理（NLI）模块生成关系 triple，以增强文档关系集。
results: authors 通过对 DocGNRE 数据集进行重新标注，发现该方法能够提高文档关系EXTRACTION的准确率。

Abstract
Document-level Relation Extraction (DocRE), which aims to extract relations from a long context, is a critical challenge in achieving fine-grained structural comprehension and generating interpretable document representations. Inspired by recent advances in in-context learning capabilities emergent from large language models (LLMs), such as ChatGPT, we aim to design an automated annotation method for DocRE with minimum human effort. Unfortunately, vanilla in-context learning is infeasible for document-level relation extraction due to the plenty of predefined fine-grained relation types and the uncontrolled generations of LLMs. To tackle this issue, we propose a method integrating a large language model (LLM) and a natural language inference (NLI) module to generate relation triples, thereby augmenting document-level relation datasets. We demonstrate the effectiveness of our approach by introducing an enhanced dataset known as DocGNRE, which excels in re-annotating numerous long-tail relation types. We are confident that our method holds the potential for broader applications in domain-specific relation type definitions and offers tangible benefits in advancing generalized language semantic comprehension.

摘要
文档级关系提取（DocRE），targeting to extract relations from a long context, is a crucial challenge in achieving fine-grained structural comprehension and generating interpretable document representations. Inspired by recent advances in in-context learning capabilities emergent from large language models (LLMs), such as ChatGPT, we aim to design an automated annotation method for DocRE with minimum human effort. However, vanilla in-context learning is infeasible for document-level relation extraction due to the abundance of predefined fine-grained relation types and the uncontrolled generations of LLMs. To address this issue, we propose a method integrating a large language model (LLM) and a natural language inference (NLI) module to generate relation triples, thereby augmenting document-level relation datasets. We demonstrate the effectiveness of our approach by introducing an enhanced dataset known as DocGNRE, which excels in re-annotating numerous long-tail relation types. We believe that our method holds great potential for broader applications in domain-specific relation type definitions and offers tangible benefits in advancing generalized language semantic comprehension.

C-Procgen: Empowering Procgen with Controllable Contexts

paper_url: http://arxiv.org/abs/2311.07312
repo_url: None
paper_authors: Zhenxiong Tan, Kaixin Wang, Xinchao Wang
for: 这篇论文是为了提供一个增强的 Procgen 环境集，以便进行多种研究。
methods: 这篇论文使用了细致的环境配置机制，包括游戏机制和代理特性。这使得过程生成过程，之前是一个黑盒，现在变得更加透明和可调整。
results: C-Procgen 提供了200多个独特的游戏上下文，并且可以进行精细的环境配置。这使得研究人员可以更好地控制和分析过程生成过程。

Abstract
We present C-Procgen, an enhanced suite of environments on top of the Procgen benchmark. C-Procgen provides access to over 200 unique game contexts across 16 games. It allows for detailed configuration of environments, ranging from game mechanics to agent attributes. This makes the procedural generation process, previously a black-box in Procgen, more transparent and adaptable for various research needs.The upgrade enhances dynamic context management and individualized assignments, while maintaining computational efficiency. C-Procgen's controllable contexts make it applicable in diverse reinforcement learning research areas, such as learning dynamics analysis, curriculum learning, and transfer learning. We believe that C-Procgen will fill a gap in the current literature and offer a valuable toolkit for future works.

摘要
我们介绍C-Procgen，一个增强版的环境集合，基于Procgen测试库。C-Procgen提供了超过200个不同游戏情境，涵盖16款游戏。它允许精确地配置环境，从游戏机制到代理属性。这使得预设的生成过程，在Procgen中是一个黑盒子，现在变得更加透明和可调整，适用于不同的研究需求。升级提高了动态上下文管理和个性化分配，保持计算效率。C-Procgen的可控上下文使其适用于多种强化学习研究领域，如学习动力分析、课程学习和转移学习。我们认为C-Procgen将填补现有文献中的空白，并提供一个有价的工具组。

Do large language models and humans have similar behaviors in causal inference with script knowledge?

paper_url: http://arxiv.org/abs/2311.07311
repo_url: https://github.com/tony-hong/causal-script
paper_authors: Xudong Hong, Margarita Ryzhova, Daniel Adrian Biondi, Vera Demberg
for: 研究大型预训语言模型（LLMs）的语言理解能力，包括零shot causal reasoning。
methods: 使用脚本基于的故事进行研究，检测事件B的处理。
results: 1) 最新的LLMs（如GPT-3或Vicuna）与人类行为相似，在$\neg A \rightarrow B$ condition下显示较长的阅读时间。2) despite this correlation, all models still have difficulty integrating script knowledge, failing to predict that $nil \rightarrow B$ is less surprising than $\neg A \rightarrow B$.

Abstract
Recently, large pre-trained language models (LLMs) have demonstrated superior language understanding abilities, including zero-shot causal reasoning. However, it is unclear to what extent their capabilities are similar to human ones. We here study the processing of an event $B$ in a script-based story, which causally depends on a previous event $A$. In our manipulation, event $A$ is stated, negated, or omitted in an earlier section of the text. We first conducted a self-paced reading experiment, which showed that humans exhibit significantly longer reading times when causal conflicts exist ($\neg A \rightarrow B$) than under logical conditions ($A \rightarrow B$). However, reading times remain similar when cause A is not explicitly mentioned, indicating that humans can easily infer event B from their script knowledge. We then tested a variety of LLMs on the same data to check to what extent the models replicate human behavior. Our experiments show that 1) only recent LLMs, like GPT-3 or Vicuna, correlate with human behavior in the $\neg A \rightarrow B$ condition. 2) Despite this correlation, all models still fail to predict that $nil \rightarrow B$ is less surprising than $\neg A \rightarrow B$, indicating that LLMs still have difficulties integrating script knowledge. Our code and collected data set are available at https://github.com/tony-hong/causal-script.

摘要
最近，大型预训言语模型（LLM）表现出了优秀的语言理解能力，包括零shot causal reasoning。然而，它们与人类的能力相似程度还是未知。我们在这里研究一个script-based story中的事件B的处理，它受到前一个事件A的 causal dependence。在我们的探索中，事件A在文本中的某个前面部分被读出、否定或 omits。我们首先进行了自适应阅读实验，发现在 causal conflict 存在（$\neg A \rightarrow B$）时，人类的阅读时间显著 longer than logical conditions 时间 ($A \rightarrow B$）。然而，阅读时间在 causal A 不是直接提到时仍然很相似， indicating that humans can easily infer event B from their script knowledge。然后，我们测试了多种 LLM 在同一数据集上，以确定它们与人类行为相似度。我们的实验结果表明：1）只有最新的 LLM，如 GPT-3 或 Vicuna，与人类行为在 $\neg A \rightarrow B$ 条件中相似。2）尽管与人类行为相似，所有模型仍然无法预测 $nil \rightarrow B$ 比 $\neg A \rightarrow B$ 更少意外，表明 LLMs 仍然有困难 integra script knowledge。我们的代码和数据集可以在 https://github.com/tony-hong/causal-script 上获取。

Explaining black boxes with a SMILE: Statistical Model-agnostic Interpretability with Local Explanations

paper_url: http://arxiv.org/abs/2311.07286
repo_url: https://github.com/dependable-intelligent-systems-lab/xwhy
paper_authors: Koorosh Aslansefat, Mojgan Hashemian, Martin Walker, Mohammed Naveed Akram, Ioannis Sorokos, Yiannis Papadopoulos
for: 提高机器学习模型的可信度
methods: 使用统计距离度量进行解释性提高
results: 提高解释性不会减少模型的通用性

Abstract
Machine learning is currently undergoing an explosion in capability, popularity, and sophistication. However, one of the major barriers to widespread acceptance of machine learning (ML) is trustworthiness: most ML models operate as black boxes, their inner workings opaque and mysterious, and it can be difficult to trust their conclusions without understanding how those conclusions are reached. Explainability is therefore a key aspect of improving trustworthiness: the ability to better understand, interpret, and anticipate the behaviour of ML models. To this end, we propose SMILE, a new method that builds on previous approaches by making use of statistical distance measures to improve explainability while remaining applicable to a wide range of input data domains.

摘要

TIAGo RL: Simulated Reinforcement Learning Environments with Tactile Data for Mobile Robots

paper_url: http://arxiv.org/abs/2311.07260
repo_url: None
paper_authors: Luca Lach, Francesco Ferro, Robert Haschke
For: The paper is written for researchers and developers working on robotic tasks that involve physical interaction, such as object manipulation.* Methods: The paper uses deep reinforcement learning (DRL) to learn complex behavior in robotics, specifically for the TIAGo service robot.* Results: The paper presents preliminary training results of a learned force control policy and compares it to a classical PI controller.Here’s the information in Simplified Chinese text:* For: 这篇论文是为了研究机器人完成物理互动任务而写的，如物体抓取等。* Methods: 这篇论文使用深度强化学习（DRL）来学习机器人行为，具体来说是为TIAGo服务机器人。* Results: 论文提供了一些初步训练结果，比较了一个学习的力控制策略和一个经典PI控制器。

Abstract
Tactile information is important for robust performance in robotic tasks that involve physical interaction, such as object manipulation. However, with more data included in the reasoning and control process, modeling behavior becomes increasingly difficult. Deep Reinforcement Learning (DRL) produced promising results for learning complex behavior in various domains, including tactile-based manipulation in robotics. In this work, we present our open-source reinforcement learning environments for the TIAGo service robot. They produce tactile sensor measurements that resemble those of a real sensorised gripper for TIAGo, encouraging research in transfer learning of DRL policies. Lastly, we show preliminary training results of a learned force control policy and compare it to a classical PI controller.

摘要
感觉信息对于机器人完成物理互动任务时的稳定性有着重要的作用。然而，随着数据的增加，模型行为变得越来越复杂。深度强化学习（DRL）在不同领域中都有出色的表现，包括机器人的柔软 manipulate。在这篇文章中，我们公布了对TIAGo服务机器人的开源强化学习环境。它们生成了类似于真实感知器的抓取器的感知数据，鼓励研究在DRL策略的传递学习。最后，我们显示了一个学习的力控策略的初步训练结果，并与经典PI控制器进行比较。

Towards Transferring Tactile-based Continuous Force Control Policies from Simulation to Robot

paper_url: http://arxiv.org/abs/2311.07245
repo_url: None
paper_authors: Luca Lach, Robert Haschke, Davide Tateo, Jan Peters, Helge Ritter, Júlia Borràs, Carme Torras
for: 本研究旨在提出一种基于深度学习的无模型控制方法，用于控制 робоット在抓取物体时的力量。
methods: 该方法使用模拟环境生成实际的正常力，并使用深度学习算法来训练连续力控制策略。
results: 对比基eline，该方法在实际中表现出较高的性能，并且通过对域随机化和假设干扰进行了验证。Translation:
for: The purpose of this research is to propose a model-free deep reinforcement learning method for controlling the force of a robot when grasping objects.
methods: The method uses a simulation environment to generate realistic normal forces and employs deep learning algorithms to train continuous force control policies.
results: Compared to the baseline, the proposed method performs better in practical applications and is validated through domain randomization and ablation studies.

Abstract
The advent of tactile sensors in robotics has sparked many ideas on how robots can leverage direct contact measurements of their environment interactions to improve manipulation tasks. An important line of research in this regard is that of grasp force control, which aims to manipulate objects safely by limiting the amount of force exerted on the object. While prior works have either hand-modeled their force controllers, employed model-based approaches, or have not shown sim-to-real transfer, we propose a model-free deep reinforcement learning approach trained in simulation and then transferred to the robot without further fine-tuning. We therefore present a simulation environment that produces realistic normal forces, which we use to train continuous force control policies. An evaluation in which we compare against a baseline and perform an ablation study shows that our approach outperforms the hand-modeled baseline and that our proposed inductive bias and domain randomization facilitate sim-to-real transfer. Code, models, and supplementary videos are available on https://sites.google.com/view/rl-force-ctrl

摘要
《机器人拥有感觉传感器后，许多想法就被提出来了，以便机器人通过直接接触环境来改进搅动任务。重要的一线研究在这方面是抓持力控制，它的目标是安全地搅动物体，限制搅动物体的力量。而在众所周知的方法中，有些人手动建模了他们的力控制器，有些人使用模型基本的方法，而其他人没有显示实验到实际的转移。我们则提出了一种没有模型基本的深度学习掌控方法，在模拟环境中训练继续力控制策略，然后将其转移到机器人上， без需要进一步的微调。因此，我们提供了一个生成真实正常力的模拟环境，用于训练连续力控制策略。我们对比基准和扫描研究表明，我们的方法高效性比手动建模基准高，并且我们提出的假设和随机预处理促进了实验到实际的转移。代码、模型和补充视频可以在https://sites.google.com/view/rl-force-ctrl中找到。》Note that Simplified Chinese is a written form of Chinese that uses simpler characters and grammar than Traditional Chinese. It is commonly used in mainland China and other parts of the world where Simplified Chinese is the standard form of Chinese.

In Search of the Long-Tail: Systematic Generation of Long-Tail Knowledge via Logical Rule Guided Search

paper_url: http://arxiv.org/abs/2311.07237
repo_url: https://github.com/ink-usc/link
paper_authors: Huihan Li, Yuting Ning, Zeyi Liao, Siyuan Wang, Xiang Lorraine Li, Ximing Lu, Faeze Brahman, Wenting Zhao, Yejin Choi, Xiang Ren
for: 这个论文的目的是为了系统地生成尖顶分布中的知识声明。
methods: 这个论文使用了一种名为Logic-Induced-Knowledge-Search（LINK）框架，通过使用一个符号语句作为基础，首先通过提示一个LLM获取初始值，然后通过批评者来验证这些值的正确性，最后通过推进器来强制实现尖顶分布。
results: 这个论文提出了一个名为Logic-Induced-Long-Tail（LINT）的数据集，包含200个符号规则和50000个知识声明，覆盖了四个领域。人工检验发现84%的声明是正确的。与此同时，ChatGPT和GPT4直接根据逻辑规则生成长尾声明时的正确率分别为56%和78%，而且他们的“长尾”生成实际上都处于更高的可能性范围内，因此不是真正的长尾。这些结论表明LINK是有效地生成尖顶分布中的数据，并且LINT可以用于系统地评估LLM的长尾分布能力。

Abstract
Since large language models have approached human-level performance on many tasks, it has become increasingly harder for researchers to find tasks that are still challenging to the models. Failure cases usually come from the long-tail distribution - data that an oracle language model could assign a probability on the lower end of its distribution. Current methodology such as prompt engineering or crowdsourcing are insufficient for creating long-tail examples because humans are constrained by cognitive bias. We propose a Logic-Induced-Knowledge-Search (LINK) framework for systematically generating long-tail knowledge statements. Grounded by a symbolic rule, we search for long-tail values for each variable of the rule by first prompting a LLM, then verifying the correctness of the values with a critic, and lastly pushing for the long-tail distribution with a reranker. With this framework we construct a dataset, Logic-Induced-Long-Tail (LINT), consisting of 200 symbolic rules and 50K knowledge statements spanning across four domains. Human annotations find that 84% of the statements in LINT are factually correct. In contrast, ChatGPT and GPT4 struggle with directly generating long-tail statements under the guidance of logic rules, each only getting 56% and 78% of their statements correct. Moreover, their "long-tail" generations in fact fall into the higher likelihood range, and thus are not really long-tail. Our findings suggest that LINK is effective for generating data in the long-tail distribution while enforcing quality. LINT can be useful for systematically evaluating LLMs' capabilities in the long-tail distribution. We challenge the models with a simple entailment classification task using samples from LINT. We find that ChatGPT and GPT4's capability in identifying incorrect knowledge drop by ~3% in the long-tail distribution compared to head distribution.

摘要
Since large language models have approached human-level performance on many tasks, it has become increasingly difficult for researchers to find tasks that are still challenging to the models. Failure cases usually come from the long-tail distribution - data that an oracle language model could assign a probability on the lower end of its distribution. Current methodology such as prompt engineering or crowdsourcing are insufficient for creating long-tail examples because humans are constrained by cognitive bias. We propose a Logic-Induced-Knowledge-Search (LINK) framework for systematically generating long-tail knowledge statements. Grounded by a symbolic rule, we search for long-tail values for each variable of the rule by first prompting a LLM, then verifying the correctness of the values with a critic, and lastly pushing for the long-tail distribution with a reranker. With this framework we construct a dataset, Logic-Induced-Long-Tail (LINT), consisting of 200 symbolic rules and 50K knowledge statements spanning across four domains. Human annotations find that 84% of the statements in LINT are factually correct. In contrast, ChatGPT and GPT4 struggle with directly generating long-tail statements under the guidance of logic rules, each only getting 56% and 78% of their statements correct. Moreover, their "long-tail" generations in fact fall into the higher likelihood range, and thus are not really long-tail. Our findings suggest that LINK is effective for generating data in the long-tail distribution while enforcing quality. LINT can be useful for systematically evaluating LLMs' capabilities in the long-tail distribution. We challenge the models with a simple entailment classification task using samples from LINT. We find that ChatGPT and GPT4's capability in identifying incorrect knowledge drops by ~3% in the long-tail distribution compared to head distribution.

IASCAR: Incremental Answer Set Counting by Anytime Refinement

paper_url: http://arxiv.org/abs/2311.07233
repo_url: None
paper_authors: Johannes K. Fichte, Sarah Alice Gaggl, Markus Hecher, Dominik Rusovac
for: 这篇论文旨在探讨 Ansemble Programming（ASP）中 counting answer sets 的问题，以及如何使用知识编译来提高计数效率。
methods: 本文使用了知识编译技术，将 ASP 程序转换成 CNF 式，然后使用 inclusion-exclusion principle 进行系统的排除和包含计数，以提高计数效率。
results: 在预liminary empirical analysis中，本文 demonstarted promising results，指出iterative counting可以快速计数 answer sets，并且可以提高计数效率。

Abstract
Answer set programming (ASP) is a popular declarative programming paradigm with various applications. Programs can easily have many answer sets that cannot be enumerated in practice, but counting still allows quantifying solution spaces. If one counts under assumptions on literals, one obtains a tool to comprehend parts of the solution space, so-called answer set navigation. However, navigating through parts of the solution space requires counting many times, which is expensive in theory. Knowledge compilation compiles instances into representations on which counting works in polynomial time. However, these techniques exist only for CNF formulas, and compiling ASP programs into CNF formulas can introduce an exponential overhead. This paper introduces a technique to iteratively count answer sets under assumptions on knowledge compilations of CNFs that encode supported models. Our anytime technique uses the inclusion-exclusion principle to improve bounds by over- and undercounting systematically. In a preliminary empirical analysis, we demonstrate promising results. After compiling the input (offline phase), our approach quickly (re)counts.

摘要

Large Language Models for Robotics: A Survey

paper_url: http://arxiv.org/abs/2311.07226
repo_url: https://github.com/Aryia-Behroziuan/Other-sources
paper_authors: Fanlong Zeng, Wensheng Gan, Yongheng Wang, Ning Liu, Philip S. Yu
for: This paper aims to provide a comprehensive review of the applications of large language models (LLMs) in robotics, exploring their impact and contributions to key areas such as robot control, perception, decision-making, and path planning.
methods: The paper uses a variety of techniques, including those employed in perception, decision-making, control, and interaction, to demonstrate the potential of LLMs in enhancing robot intelligence and human-robot interaction.
results: The paper highlights recent advancements in robotics models based on LLMs, including their ability to process and generate natural language, facilitating efficient interaction and collaboration with robots. The paper also explores the potential challenges that LLMs may face in the near future, such as the need for more diverse and nuanced training data.

Abstract
The human ability to learn, generalize, and control complex manipulation tasks through multi-modality feedback suggests a unique capability, which we refer to as dexterity intelligence. Understanding and assessing this intelligence is a complex task. Amidst the swift progress and extensive proliferation of large language models (LLMs), their applications in the field of robotics have garnered increasing attention. LLMs possess the ability to process and generate natural language, facilitating efficient interaction and collaboration with robots. Researchers and engineers in the field of robotics have recognized the immense potential of LLMs in enhancing robot intelligence, human-robot interaction, and autonomy. Therefore, this comprehensive review aims to summarize the applications of LLMs in robotics, delving into their impact and contributions to key areas such as robot control, perception, decision-making, and path planning. We first provide an overview of the background and development of LLMs for robotics, followed by a description of the benefits of LLMs for robotics and recent advancements in robotics models based on LLMs. We then delve into the various techniques used in the model, including those employed in perception, decision-making, control, and interaction. Finally, we explore the applications of LLMs in robotics and some potential challenges they may face in the near future. Embodied intelligence is the future of intelligent science, and LLMs-based robotics is one of the promising but challenging paths to achieve this.

摘要
人类的学习、总结和控制复杂 manipulate 任务的能力，我们称之为dexterity intelligence。了解和评估这种智能是一项复杂的任务。随着大语言模型（LLMs）的快速进步和广泛普及，它们在机器人领域的应用得到了越来越多的注意。LLMs具有处理和生成自然语言的能力，可以为机器人交互和合作提供高效的方式。 robotics 领域的研究人员和工程师认为，LLMs在机器人智能、人机交互和自主性方面具有巨大的潜力。因此，本评论的目的是总结LLMs在机器人领域的应用，探讨它们在机器人控制、观察、决策和规划等领域的影响和贡献。我们首先提供LLMs在机器人领域的背景和发展，然后描述LLMs在机器人领域的利点和最新的机器人模型的发展。然后，我们介绍了使用在模型中的各种技术，包括在观察、决策、控制和交互中使用的技术。最后，我们探讨了LLMs在机器人领域的应用和未来可能面临的挑战。聚合智能是未来智能科学的未来，LLMs-based robotics 是一条擅长但挑战性的道路。

Optical Quantum Sensing for Agnostic Environments via Deep Learning

paper_url: http://arxiv.org/abs/2311.07203
repo_url: None
paper_authors: Zeqiao Zhou, Yuxuan Du, Xu-Fei Yin, Shanshan Zhao, Xinmei Tian, Dacheng Tao
for: 这 paper 的目的是提高光学量子探测的精度，并在不知情环境中实现 Heisenberg 限制。
methods: 该 paper 使用了深度学习技术，包括图 neural network 预测器和 trigonometric interpolating 算法，以实现光学量子探测的高精度。
results: experiments 表明，该方法可以在不同的设置下达到高精度水平，并且可以在 eight photons 下实现最大的 quantum Fisher information。

Abstract
Optical quantum sensing promises measurement precision beyond classical sensors termed the Heisenberg limit (HL). However, conventional methodologies often rely on prior knowledge of the target system to achieve HL, presenting challenges in practical applications. Addressing this limitation, we introduce an innovative Deep Learning-based Quantum Sensing scheme (DQS), enabling optical quantum sensors to attain HL in agnostic environments. DQS incorporates two essential components: a Graph Neural Network (GNN) predictor and a trigonometric interpolation algorithm. Operating within a data-driven paradigm, DQS utilizes the GNN predictor, trained on offline data, to unveil the intrinsic relationships between the optical setups employed in preparing the probe state and the resulting quantum Fisher information (QFI) after interaction with the agnostic environment. This distilled knowledge facilitates the identification of optimal optical setups associated with maximal QFI. Subsequently, DQS employs a trigonometric interpolation algorithm to recover the unknown parameter estimates for the identified optical setups. Extensive experiments are conducted to investigate the performance of DQS under different settings up to eight photons. Our findings not only offer a new lens through which to accelerate optical quantum sensing tasks but also catalyze future research integrating deep learning and quantum mechanics.

摘要
DQS consists of two essential components: a graph neural network (GNN) predictor and a trigonometric interpolation algorithm. The GNN predictor is trained on offline data to reveal the intrinsic relationships between the optical setups used to prepare the probe state and the resulting quantum Fisher information (QFI) after interaction with the agnostic environment. This distilled knowledge allows for the identification of optimal optical setups associated with maximal QFI.Subsequently, DQS employs a trigonometric interpolation algorithm to recover the unknown parameter estimates for the identified optical setups. We conduct extensive experiments to investigate the performance of DQS under different settings, including up to eight photons. Our findings not only offer a new approach to accelerate optical quantum sensing tasks but also pave the way for future research integrating deep learning and quantum mechanics.

Applying Large Language Models for Causal Structure Learning in Non Small Cell Lung Cancer

paper_url: http://arxiv.org/abs/2311.07191
repo_url: None
paper_authors: Narmada Naik, Ayush Khandelwal, Mohit Joshi, Madhusudan Atre, Hollis Wright, Kavya Kannan, Scott Hill, Giridhar Mamidipudi, Ganapati Srinivasa, Carlo Bifulco, Brian Piening, Kevin Matlock
for: 这 paper 是为了研究使用 Large Language Models (LLMs) 来解决 causal discovery 中的 edge 方向性问题。
methods: 这 paper 使用了 LLMs 来预测 causal graph 中 edge 的方向性，并对比了现有的状态 искусственного智能方法。
results: 结果显示，LLMs 可以准确预测 causal graph 中 edge 的方向性，并且表现出色于现有的状态 искусственный智能方法。

Abstract
Causal discovery is becoming a key part in medical AI research. These methods can enhance healthcare by identifying causal links between biomarkers, demographics, treatments and outcomes. They can aid medical professionals in choosing more impactful treatments and strategies. In parallel, Large Language Models (LLMs) have shown great potential in identifying patterns and generating insights from text data. In this paper we investigate applying LLMs to the problem of determining the directionality of edges in causal discovery. Specifically, we test our approach on a deidentified set of Non Small Cell Lung Cancer(NSCLC) patients that have both electronic health record and genomic panel data. Graphs are validated using Bayesian Dirichlet estimators using tabular data. Our result shows that LLMs can accurately predict the directionality of edges in causal graphs, outperforming existing state-of-the-art methods. These findings suggests that LLMs can play a significant role in advancing causal discovery and help us better understand complex systems.

摘要
隐含推理是医疗人工智能研究中越来越重要的一部分。这些方法可以增强医疗效果，通过找到生物标志物、人口、治疗和结果之间的 causal 连接。它们可以帮助医疗专业人员选择更有效的治疗和策略。在这篇论文中，我们调查了应用 Large Language Models（LLMs）来确定 causal 推理中的Edge方向。我们在一个医疗记录和 genomic 数据集上进行了测试，并使用 bayesian Dirichlet estimator 验证图表。我们的结果表明，LLMs 可以准确预测 causal 图中的 Edge 方向，超过现有的状态艺技术。这些发现建议 LLMs 可以在 causal 推理中发挥重要作用，帮助我们更好地理解复杂系统。

Cross-Axis Transformer with 2D Rotary Embeddings

paper_url: http://arxiv.org/abs/2311.07184
repo_url: None
paper_authors: Lily Erickson
for: 这篇论文是为了解决计算效率低下，模式缺乏适应性的视觉转换器问题而写的。
methods: 该论文提出了一种基于 Axial Transformers 和 Microsoft 的 Retentive Network 的模型，称为 Cross-Axis Transformer (CAT)，可以减少处理图像所需的浮点运算数量，同时更快速地达到更高准确性。
results: CAT 模型在比较 Vision Transformers 的情况下，可以更快速地训练，并且在图像处理任务上表现更高准确性。

Abstract
Despite lagging behind their modal cousins in many respects, Vision Transformers have provided an interesting opportunity to bridge the gap between sequence modeling and image modeling. Up until now however, vision transformers have largely been held back, due to both computational inefficiency, and lack of proper handling of spatial dimensions. In this paper, we introduce the Cross-Axis Transformer. CAT is a model inspired by both Axial Transformers, and Microsoft's recent Retentive Network, that drastically reduces the required number of floating point operations required to process an image, while simultaneously converging faster and more accurately than the Vision Transformers it replaces.

摘要
尽管模型 cousin 在多种方面落后，视觉 трансформа器仍提供了将序列模型和图像模型桥接的有趣机会。然而，视觉 трансформа器 hasta 现在都受到了计算效率不足和空间维度处理不当的限制。在这篇论文中，我们介绍了横轴 transformer（CAT）。CAT 是基于 Axial Transformers 和 Microsof 的Recent Retentive Network的模型，可以减少处理图像所需的浮点运算数量，同时 convergence faster 和更准确地 than Vision Transformers。

Knowledge Graph Representations to enhance Intensive Care Time-Series Predictions

paper_url: http://arxiv.org/abs/2311.07180
repo_url: None
paper_authors: Samyak Jain, Manuel Burger, Gunnar Rätsch, Rita Kuznetsova
for: 增强Intensive Care Units（ICU）的临床结果预测，需要全面的病人数据集成。
methods: 使用悬崖学进步，将病人时间序列数据和不结构化医疗报告 integrate，提高预测性能。
results: 结合医疗领域的数据，使用知识图 Derived from clinical ontologies like the Unified Medical Language System (UMLS)，提高临床决策模型。组合图表示与生命 Parameters和临床报告，提高性能，尤其是数据缺失时。此外，我们的模型还包括可解释组件，以便理解知识图节点如何影响预测。

Abstract
Intensive Care Units (ICU) require comprehensive patient data integration for enhanced clinical outcome predictions, crucial for assessing patient conditions. Recent deep learning advances have utilized patient time series data, and fusion models have incorporated unstructured clinical reports, improving predictive performance. However, integrating established medical knowledge into these models has not yet been explored. The medical domain's data, rich in structural relationships, can be harnessed through knowledge graphs derived from clinical ontologies like the Unified Medical Language System (UMLS) for better predictions. Our proposed methodology integrates this knowledge with ICU data, improving clinical decision modeling. It combines graph representations with vital signs and clinical reports, enhancing performance, especially when data is missing. Additionally, our model includes an interpretability component to understand how knowledge graph nodes affect predictions.

摘要
医院床位加护部 (ICU) 需要全面的患者数据集成以提高临床结果预测，这是评估患者状况的关键。最近的深度学习突破使用了患者时间序数据，并将不结构化的医疗报告 fusion 到模型中，以提高预测性能。但是，将成熔的医疗领域数据（rich in structural relationships）integrated into these models has not yet been explored。我们的提议的方法是通过临床 ontology 如 Unified Medical Language System (UMLS) derivation 的知识图来捕捉医疗领域的数据，从而提高临床决策模型。这种方法结合了图表示法和生命 parameter 和临床报告，以提高性能，特别是在数据缺失时。此外，我们的模型还包括一个可解释性组件，以便理解知识图节点如何影响预测。

Game Solving with Online Fine-Tuning

paper_url: http://arxiv.org/abs/2311.07178
repo_url: https://github.com/rlglab/online-fine-tuning-solver
paper_authors: Ti-Rong Wu, Hung Guei, Ting Han Wei, Chung-Chin Shih, Jui-Te Chin, I-Chen Wu
for: solves challenging 7x7 Killall-Go problems with online fine-tuning, using less computation time than traditional methods.
methods: applies online fine-tuning and proposes two tailor-designed heuristics for game solving.
results: solves a series of challenging 7x7 Killall-Go problems with 23.54% less computation time compared to the baseline, and the savings scale with problem size.

Abstract
Game solving is a similar, yet more difficult task than mastering a game. Solving a game typically means to find the game-theoretic value (outcome given optimal play), and optionally a full strategy to follow in order to achieve that outcome. The AlphaZero algorithm has demonstrated super-human level play, and its powerful policy and value predictions have also served as heuristics in game solving. However, to solve a game and obtain a full strategy, a winning response must be found for all possible moves by the losing player. This includes very poor lines of play from the losing side, for which the AlphaZero self-play process will not encounter. AlphaZero-based heuristics can be highly inaccurate when evaluating these out-of-distribution positions, which occur throughout the entire search. To address this issue, this paper investigates applying online fine-tuning while searching and proposes two methods to learn tailor-designed heuristics for game solving. Our experiments show that using online fine-tuning can solve a series of challenging 7x7 Killall-Go problems, using only 23.54% of computation time compared to the baseline without online fine-tuning. Results suggest that the savings scale with problem size. Our method can further be extended to any tree search algorithm for problem solving. Our code is available at https://rlg.iis.sinica.edu.tw/papers/neurips2023-online-fine-tuning-solver.

摘要
GAME解释是一种类似 yet更加困难的任务，即找到游戏中的游戏理论价值（基于最优游戏策略），并可选择一个全局策略以实现该结果。AlphaZero算法已经展示出了超人类水平的游戏表现，并且其强大的策略和价值预测也可以作为游戏解释的依据。然而，为了解决游戏并获得全局策略，需要找到对游戏中落后一方的所有移动都有赢的回应。这包括落后一方的很差游戏行为，AlphaZero自动游戏过程中不会遇到这些位置。AlphaZero基于的依据可能在这些 OUT-OF-distribution 位置上高度不准确，这些位置在搜索中occurs throughout the entire search。为解决这个问题，这篇文章提出了在搜索过程中进行在线细化的方法，并提出了两种学习特定的依据来解决游戏。我们的实验表明，使用在线细化可以解决一系列复杂的 7x7 Killall-Go 问题，使用了23.54%的计算时间，相比无在线细化基eline。结果表明，这些节省可以扩大到问题的大小。我们的方法可以进一步扩展到任何树搜索算法来解决问题。我们的代码可以在中找到。

The High-dimensional Phase Diagram and the Large CALPHAD Model

paper_url: http://arxiv.org/abs/2311.07174
repo_url: None
paper_authors: Zhengdi Liu, Xulong An, Wenwen Sun
for: 针对多元素合金系统中的复杂性问题，我们在FeNiCrMn合金系统中引入了大型CALPHAD模型（LCM），以计算所有可能的相态空间。
methods: 我们使用了高维度相态图和哈希表+深度优先搜索（DFS）等方法，系统地结构化了巨量数据，并实现了97%的分类精度和4.80*10^-5的平均方差。
results: 我们成功划分了FeNiCrMn合金系统中的51个独特相态空间，并示例了该方法可用于设计所有439种冷峰合金。这种新的方法将对合金设计技术和多变量问题产生巨大的影响。

Abstract
When alloy systems comprise more than three elements, the visualization of the entire phase space becomes not only daunting but is also accompanied by a data surge. Addressing this complexity, we delve into the FeNiCrMn alloy system and introduce the Large CALPHAD Model (LCM). The LCM acts as a computational conduit, capturing the entire phase space. Subsequently, this enormous data is systematically structured using a high-dimensional phase diagram, aided by hash tables and Depth-first Search (DFS), rendering it both digestible and programmatically accessible. Remarkably, the LCM boasts a 97% classification accuracy and a mean square error of 4.80*10-5 in phase volume prediction. Our methodology successfully delineates 51 unique phase spaces in the FeNiCrMn system, exemplifying its efficacy with the design of all 439 eutectic alloys. This pioneering methodology signifies a monumental shift in alloy design techniques or even multi-variable problems.

摘要
Using high-dimensional phase diagrams, hash tables, and Depth-first Search (DFS), we are able to structure the data in a way that is both digestible and programmatically accessible. Remarkably, the LCM has a 97% classification accuracy and a mean square error of 4.80*10-5 in phase volume prediction.Our methodology successfully delineates 51 unique phase spaces in the FeNiCrMn system, demonstrating its effectiveness in designing all 439 eutectic alloys. This groundbreaking approach represents a significant shift in alloy design techniques and multi-variable problem-solving.

STEER: Unified Style Transfer with Expert Reinforcement

paper_url: http://arxiv.org/abs/2311.07167
repo_url: None
paper_authors: Skyler Hallinan, Faeze Brahman, Ximing Lu, Jaehun Jung, Sean Welleck, Yejin Choi
for: 本文主要针对的问题是如何实现文本Style Transfer，即将文本从一个未知的源风格转换到一个目标风格中。
methods: 我们提出了STEER：一种基于专家强化的统一框架，通过自动生成样式转移对的数据集来解决限制了并行数据的问题。STEER使用了在解码过程中自动生成的产品专家来生成样式转移对的数据集，然后使用这些数据集来预训练初始策略，然后使用在线、离线的强化学习来进一步改进。
results: 我们在一个复杂的数据集上进行了实验，与竞争对手比较，得到了最佳的结果。尤其是，STEER在总样式转移质量方面比175B参数的指定调节GPT-3高，即使其只有226倍小于GPT-3。此外，我们还证明了STEER在不同风格的数据上保持了样式转移能力，并在多种风格下超越了大多数基准值。

Abstract
While text style transfer has many applications across natural language processing, the core premise of transferring from a single source style is unrealistic in a real-world setting. In this work, we focus on arbitrary style transfer: rewriting a text from an arbitrary, unknown style to a target style. We propose STEER: Unified Style Transfer with Expert Reinforcement, a unified frame-work developed to overcome the challenge of limited parallel data for style transfer. STEER involves automatically generating a corpus of style-transfer pairs using a product of experts during decoding. The generated offline data is then used to pre-train an initial policy before switching to online, off-policy reinforcement learning for further improvements via fine-grained reward signals. STEER is unified and can transfer to multiple target styles from an arbitrary, unknown source style, making it particularly flexible and efficient. Experimental results on a challenging dataset with text from a diverse set of styles demonstrate state-of-the-art results compared to competitive baselines. Remarkably, STEER outperforms the 175B parameter instruction-tuned GPT-3 on overall style transfer quality, despite being 226 times smaller in size. We also show STEER is robust, maintaining its style transfer capabilities on out-of-domain data, and surpassing nearly all baselines across various styles. The success of our method highlights the potential of RL algorithms when augmented with controllable decoding to overcome the challenge of limited data supervision.

摘要
While 文本样式传递有很多应用于自然语言处理领域，核心假设从单个来源样式传递是在实际世界中不切实际的。在这项工作中，我们关注于 произвольные样式传递：将文本从未知样式转换到目标样式。我们提出了STEER：一种综合框架，通过专家激励来超越有限平行数据的限制。STEER通过在解码过程中自动生成样式传递对的自动生成器来生成偏好的样式传递对。然后，使用先进的策略进行在线、离线权重学习，以进一步改进精细的奖励信号。STEER可以同时转换多种目标样式，从未知样式中转换，使其特icularly 灵活和高效。我们的实验结果表明，STEER在一个复杂的数据集上达到了现状最佳的效果，比基elines表现出色。尤其是，STEER在 parameter 175B 的 GPT-3 上进行了 instruction-tuned 的实验，而且在总体样式传递质量方面表现出了优于基elines。此外，我们还证明了STEER在域外数据上保持了样式传递能力，并在不同的样式下超越了大部分基elines。这一成功表明了RL算法在加入可控的解码后可以超越有限数据指导的挑战。

Pruning random resistive memory for optimizing analogue AI

paper_url: http://arxiv.org/abs/2311.07164
repo_url: None
paper_authors: Yi Li, Songqi Wang, Yaping Zhao, Shaocong Wang, Woyu Zhang, Yangu He, Ning Lin, Binbin Cui, Xi Chen, Shiming Zhang, Hao Jiang, Peng Lin, Xumeng Zhang, Xiaojuan Qi, Zhongrui Wang, Xiaoxin Xu, Dashan Shang, Qi Liu, Kwang-Ting Cheng, Ming Liu
for: 这篇论文旨在解决人工智能（AI）的能源消耗和环境可持续性问题，通过恢复 аналогов计算。methods: 该论文使用了软件硬件协同设计，结合结构塑性激活边缘剪裁来优化 randomly weighted 分布式 resistive memory neural network 的 topology。results: 该论文在 FashionMNIST、Spoken digits 和 DRIVE 数据集上实现了17.3%、19.9% 和 9.8% 的准确率提升，同时实现了 82.1%、51.2% 和 99.8% 的能效率提升。

Abstract
The rapid advancement of artificial intelligence (AI) has been marked by the large language models exhibiting human-like intelligence. However, these models also present unprecedented challenges to energy consumption and environmental sustainability. One promising solution is to revisit analogue computing, a technique that predates digital computing and exploits emerging analogue electronic devices, such as resistive memory, which features in-memory computing, high scalability, and nonvolatility. However, analogue computing still faces the same challenges as before: programming nonidealities and expensive programming due to the underlying devices physics. Here, we report a universal solution, software-hardware co-design using structural plasticity-inspired edge pruning to optimize the topology of a randomly weighted analogue resistive memory neural network. Software-wise, the topology of a randomly weighted neural network is optimized by pruning connections rather than precisely tuning resistive memory weights. Hardware-wise, we reveal the physical origin of the programming stochasticity using transmission electron microscopy, which is leveraged for large-scale and low-cost implementation of an overparameterized random neural network containing high-performance sub-networks. We implemented the co-design on a 40nm 256K resistive memory macro, observing 17.3% and 19.9% accuracy improvements in image and audio classification on FashionMNIST and Spoken digits datasets, as well as 9.8% (2%) improvement in PR (ROC) in image segmentation on DRIVE datasets, respectively. This is accompanied by 82.1%, 51.2%, and 99.8% improvement in energy efficiency thanks to analogue in-memory computing. By embracing the intrinsic stochasticity and in-memory computing, this work may solve the biggest obstacle of analogue computing systems and thus unleash their immense potential for next-generation AI hardware.

摘要
人工智能（AI）的快速发展已经由大型语言模型展示了人类智能水平。然而，这些模型也带来了前所未有的能源消耗和环境可持续性挑战。一种有前途的解决方案是探索Analog computing，这是数字计算的前代技术，它利用新型的Analog电子设备，如抗抗压记忆，实现了内存计算、扩展性和不朽性。然而，Analog计算仍面临以下挑战：编程不 ideal和开销较高。在这里，我们报告了一种通用解决方案：软硬件协同设计，使用结构塑性-灵感导向的边缘剔除来优化Randomly weighted Analog resistive memory neural network的topology。软件端，通过剔除连接而不是精准地调整抗抗压记忆权重来优化Randomly weighted neural network的topology。硬件端，我们通过电子显微镜探测到了设备物理的编程随机性的 физи学起源，并利用这一发现实现了大规模、低成本的实现一个高性能的随机神经网络，包括高性能的子网络。我们在40nm 256K抗抗压记忆macro上实现了该协同设计，在FashionMNIST和Spoken digits datasets上观察到了图像和音频分类的准确率提高17.3%和19.9%，以及图像分割任务中的PR（ROC）提高9.8%（2%）。此外，我们还观察到了82.1%、51.2%和99.8%的能效提升。通过拥抱内在的随机性和内存计算，这种工作可能解决了Analog计算系统中最大的障碍，从而释放了这些系统的巨大潜力，用于下一代AI硬件。

Enhancing Lightweight Neural Networks for Small Object Detection in IoT Applications

paper_url: http://arxiv.org/abs/2311.07163
repo_url: None
paper_authors: Liam Boyle, Nicolas Baumann, Seonyeong Heo, Michele Magno
for: 提高小物体检测精度并适用于嵌入式设备
methods: 提出了一种适用于任何现有物体检测器的适应划分方法，包括FOMO网络
results: 实验结果表明，该方法可以提高F1分数达225%，同时降低平均物体计数错误达76%，并且表明使用软F1损失可以有效降低不均衡数据的负面影响。

Abstract
Advances in lightweight neural networks have revolutionized computer vision in a broad range of IoT applications, encompassing remote monitoring and process automation. However, the detection of small objects, which is crucial for many of these applications, remains an underexplored area in current computer vision research, particularly for embedded devices. To address this gap, the paper proposes a novel adaptive tiling method that can be used on top of any existing object detector including the popular FOMO network for object detection on microcontrollers. Our experimental results show that the proposed tiling method can boost the F1-score by up to 225% while reducing the average object count error by up to 76%. Furthermore, the findings of this work suggest that using a soft F1 loss over the popular binary cross-entropy loss can significantly reduce the negative impact of imbalanced data. Finally, we validate our approach by conducting experiments on the Sony Spresense microcontroller, showcasing the proposed method's ability to strike a balance between detection performance, low latency, and minimal memory consumption.

摘要
新型轻量级神经网络的进步已经对互联网器件应用领域的计算机视觉领域进行了革命性的改变，涵盖远程监测和流程自动化。然而，对小对象的探测，这是现有计算机视觉研究中尚未得到充分研究的领域，特别是在嵌入式设备上。为了解决这个差距，该篇论文提出了一种新的适应分割方法，可以在现有的对象探测器之上使用，包括受欢迎的FOMO网络。我们的实验结果表明，提议的分割方法可以提高F1分数的最大提升为225%，并同时降低平均对象计数错误的最大降低为76%。此外，我们的研究发现，使用软F1损失函数相比于popular binary cross-entropy损失函数可以significantly reduce the negative impact of imbalanced data。最后，我们验证了我们的方法，通过在Sony Spresense微控制器上进行实验，示出了我们的方法可以在探测性能、延迟时间和内存占用量之间做出平衡。

Interaction is all You Need? A Study of Robots Ability to Understand and Execute

paper_url: http://arxiv.org/abs/2311.07150
repo_url: https://github.com/nid989/teach_edh
paper_authors: Kushal Koshti, Nidhir Bhavsar
for: 本研究旨在帮助机器人在人类环境中自然语言互动下进行高效的任务解决，具体来说是帮助机器人理解和执行复杂的指令在连续对话中解决复杂任务。
methods: 我们基于执行对话历史（EDH）任务从教学标准底采用多变换器模型和BARTLM。我们发现我们最佳配置在基准点上出现了8.85的成功率和14.02的目标相关成功率。此外，我们还提出了一种新的完成这个任务的方法。
results: 我们评估了多个BART模型和LLaMA2 LLMC，其中LLaMA2 LLMC在这个任务上达到了46.77的ROGUE-L分数。

Abstract
This paper aims to address a critical challenge in robotics, which is enabling them to operate seamlessly in human environments through natural language interactions. Our primary focus is to equip robots with the ability to understand and execute complex instructions in coherent dialogs to facilitate intricate task-solving scenarios. To explore this, we build upon the Execution from Dialog History (EDH) task from the Teach benchmark. We employ a multi-transformer model with BART LM. We observe that our best configuration outperforms the baseline with a success rate score of 8.85 and a goal-conditioned success rate score of 14.02. In addition, we suggest an alternative methodology for completing this task. Moreover, we introduce a new task by expanding the EDH task and making predictions about game plans instead of individual actions. We have evaluated multiple BART models and an LLaMA2 LLM, which has achieved a ROGUE-L score of 46.77 for this task.

摘要
Note:* "Teach benchmark" refers to a standardized evaluation framework for natural language understanding and execution in robotics.* "EDH task" stands for "Execution from Dialog History" task, which involves understanding and executing complex instructions given in a coherent dialogue.* "BART LM" refers to a type of language model called Bayesian Artificial Robot Teacher, which is a machine learning model used for natural language understanding and generation.* "ROGUE-L" is a score used to evaluate the performance of language models in task-oriented dialogues, with higher scores indicating better performance.

Analyzing and Predicting Low-Listenership Trends in a Large-Scale Mobile Health Program: A Preliminary Investigation

paper_url: http://arxiv.org/abs/2311.07139
repo_url: None
paper_authors: Arshika Lalan, Shresth Verma, Kumar Madhu Sudan, Amrita Mahale, Aparna Hegde, Milind Tambe, Aparna Taneja
for: 这项研究是为了分析 Kilkari 移动医疗计划的使用者行为，并提出改进方案以增强该项目的效果。
methods: 研究使用时间序列预测分析 Beneficiary 的dropout行为，并将结果应用于NGO 的滥耗预测和滥耗预防策略。
results: 研究发现，通过分析 Beneficiary 的 listened pattern，可以帮助NGO 更好地了解 Beneficiary 的需求，并采取时间序列预测的方法可以预测 Beneficiary 的dropout。

Abstract
Mobile health programs are becoming an increasingly popular medium for dissemination of health information among beneficiaries in less privileged communities. Kilkari is one of the world's largest mobile health programs which delivers time sensitive audio-messages to pregnant women and new mothers. We have been collaborating with ARMMAN, a non-profit in India which operates the Kilkari program, to identify bottlenecks to improve the efficiency of the program. In particular, we provide an initial analysis of the trajectories of beneficiaries' interaction with the mHealth program and examine elements of the program that can be potentially enhanced to boost its success. We cluster the cohort into different buckets based on listenership so as to analyze listenership patterns for each group that could help boost program success. We also demonstrate preliminary results on using historical data in a time-series prediction to identify beneficiary dropouts and enable NGOs in devising timely interventions to strengthen beneficiary retention.

摘要
移动卫生计划在贫困社区中普遍用于卫生信息的传递。基尔卡莉是世界上最大的移动卫生计划之一，它通过发送时敏感的音频消息，为怀孕妈妈和新生妈妈提供卫生信息。我们与印度非营利组织ARMMAN合作，以便识别项目中的瓶颈，并提高项目的效率。我们对参与者的行为轨迹进行了初步分析，并分析每个组的听众模式，以帮助提高项目的成功。我们还采用历史数据时序预测，以预测受助者退出，并帮助非政府组织制定时间性的干预措施，以增强受助者的保留。

WaterBench: Towards Holistic Evaluation of Watermarks for Large Language Models

paper_url: http://arxiv.org/abs/2311.07138
repo_url: https://github.com/THU-KEG/WaterBench
paper_authors: Shangqing Tu, Yuliang Sun, Yushi Bai, Jifan Yu, Lei Hou, Juanzi Li
For: 本研究旨在 evaluating the effectiveness of large language model (LLM) watermarking algorithms, and providing a comprehensive benchmark for these algorithms.* Methods: 本研究使用了 four open-source watermarks on two LLMs under two watermarking strengths, and evaluates the generation and detection performance of these watermarks using a five-category taxonomy of tasks.* Results: 研究发现 current LLM watermarking algorithms 面临着 maintaining generation quality 的挑战，并且 observe 了 these algorithms’ decline in instruction-following abilities after watermarking.

Abstract
To mitigate the potential misuse of large language models (LLMs), recent research has developed watermarking algorithms, which restrict the generation process to leave an invisible trace for watermark detection. Due to the two-stage nature of the task, most studies evaluate the generation and detection separately, thereby presenting a challenge in unbiased, thorough, and applicable evaluations. In this paper, we introduce WaterBench, the first comprehensive benchmark for LLM watermarks, in which we design three crucial factors: (1) For \textbf{benchmarking procedure}, to ensure an apples-to-apples comparison, we first adjust each watermarking method's hyper-parameter to reach the same watermarking strength, then jointly evaluate their generation and detection performance. (2) For \textbf{task selection}, we diversify the input and output length to form a five-category taxonomy, covering $9$ tasks. (3) For \textbf{evaluation metric}, we adopt the GPT4-Judge for automatically evaluating the decline of instruction-following abilities after watermarking. We evaluate $4$ open-source watermarks on $2$ LLMs under $2$ watermarking strengths and observe the common struggles for current methods on maintaining the generation quality. The code and data are available at \url{https://github.com/THU-KEG/WaterBench}.

摘要
为了遏制大语言模型（LLM）的潜在违用，latest research 已经开发出水印算法，以限制生成过程，留下隐藏的水印检测。由于这是一个两 stage 的任务， większe studies 通常分开评估生成和检测，从而带来一个挑战：做出不偏袋化、全面和实用的评估。在这篇论文中，我们介绍 WaterBench，第一个对 LLM 水印的完整Benchmark，其中我们设计了三个关键因素：1. 对于 benchmarking 过程，以确保比较公平，我们首先调整每种水印方法的超参数，使其达到同等的水印强度，然后并行评估其生成和检测性能。2. 对于任务选择，我们将输入和输出长度 diversify 到组成五类分类，涵盖了9个任务。3. 对于评估 metric，我们采用 GPT4-Judge 自动评估水印后 instrucions 的退化程度。我们对两种 LL 进行了两种水印强度的评估，并观察到当前方法在保持生成质量方面的普遍困难。代码和数据可以在上获取。

Understanding Path Planning Explanations

paper_url: http://arxiv.org/abs/2311.07132
repo_url: https://github.com/Sfedfcv/redesigned-pancake
paper_authors: Amar Halilovic, Senka Krivic
for: 本研究旨在解释移动机器人的导航决策。
methods: 我们提出了一种使用视觉和文本解释来解释机器人的导航决策。
results: 我们计划通过用户研究测试机器人的解释的理解性和简洁性，并启动未来研究计划。

Abstract
Navigation is a must-have skill for any mobile robot. A core challenge in navigation is the need to account for an ample number of possible configurations of environment and navigation contexts. We claim that a mobile robot should be able to explain its navigational choices making its decisions understandable to humans. In this paper, we briefly present our approach to explaining navigational decisions of a robot through visual and textual explanations. We propose a user study to test the understandability and simplicity of the robot explanations and outline our further research agenda.

摘要
Navigation 是移动机器人必备的技能之一。核心挑战在于需要考虑多种环境配置和导航上下文。我们认为移动机器人应该能够解释其导航选择，使其决策能够被人类理解。在这篇论文中，我们简要介绍了我们如何通过视觉和文本解释来解释机器人的导航选择。我们提出了用户研究，以测试机器人解释的理解度和简洁度，并述出我们未来研究论点。Note: "Simplified Chinese" refers to the standardized form of Chinese used in mainland China and Singapore, which is different from "Traditional Chinese" used in Hong Kong, Taiwan, and other countries.

paper_url: http://arxiv.org/abs/2311.07127
repo_url: None
paper_authors: Wenqi Fan, Shijie Wang, Xiao-yong Wei, Xiaowei Mei, Qing Li
for: 这研究旨在攻击社交推荐系统，即使在黑盒模式下。
methods: 该研究提出了一种基于多代理学习的攻击框架，即 Multiattack，以协调生成冷启动ITEM的 Profil和跨社区社交关系，以对黑盒社交推荐系统进行无目标攻击。
results: 对多个实际数据集进行了广泛的实验，证明了我们的提出的攻击框架在黑盒模式下的效果。

Abstract
The rise of online social networks has facilitated the evolution of social recommender systems, which incorporate social relations to enhance users' decision-making process. With the great success of Graph Neural Networks in learning node representations, GNN-based social recommendations have been widely studied to model user-item interactions and user-user social relations simultaneously. Despite their great successes, recent studies have shown that these advanced recommender systems are highly vulnerable to adversarial attacks, in which attackers can inject well-designed fake user profiles to disrupt recommendation performances. While most existing studies mainly focus on targeted attacks to promote target items on vanilla recommender systems, untargeted attacks to degrade the overall prediction performance are less explored on social recommendations under a black-box scenario. To perform untargeted attacks on social recommender systems, attackers can construct malicious social relationships for fake users to enhance the attack performance. However, the coordination of social relations and item profiles is challenging for attacking black-box social recommendations. To address this limitation, we first conduct several preliminary studies to demonstrate the effectiveness of cross-community connections and cold-start items in degrading recommendations performance. Specifically, we propose a novel framework Multiattack based on multi-agent reinforcement learning to coordinate the generation of cold-start item profiles and cross-community social relations for conducting untargeted attacks on black-box social recommendations. Comprehensive experiments on various real-world datasets demonstrate the effectiveness of our proposed attacking framework under the black-box setting.

摘要
“在线社交网络的兴起，促进了社交推荐系统的进化，这些系统将社交关系纳入用户决策过程中。基于图神经网络的社交推荐系统在学习用户-项目交互和用户-用户社交关系方面取得了很大成功。然而，最新的研究表明，这些高级推荐系统在黑盒enario下面临恶意攻击时表现很脆弱，攻击者可以通过构建高效的假用户 profilesto破坏推荐性能。大多数现有研究主要关注于targeted攻击，即通过推荐特定item来提高推荐性能。然而，针对黑盒社交推荐系统的untargeted攻击，即通过破坏总体推荐性能来引起攻击者的注意，尚未得到充分研究。为了解决这一限制，我们首先进行了一些预liminary研究，以证明横向社交关系和冷启用户 profilestable 在黑盒setting下的攻击性能的有效性。然后，我们提出了一个名为Multiattack的攻击框架，该框架基于多代理权重学习协调冷启item profil和横向社交关系的生成，以实现黑盒社交推荐系统的untargeted攻击。我们在各种实际数据集上进行了广泛的实验，证明了我们提出的攻击框架在黑盒setting下的效果。”

Explanation-aware Soft Ensemble Empowers Large Language Model In-context Learning

paper_url: http://arxiv.org/abs/2311.07099
repo_url: None
paper_authors: Yue Yu, Jiaming Shen, Tianqi Liu, Zhen Qin, Jing Nathan Yan, Jialu Liu, Chao Zhang, Michael Bendersky
for: 提高大语言模型（LLM）在自然语言理解任务中的能力methods: 提出了一种Explanation-Aware Soft Ensemble框架，包括两种技术：Explanation-guided ensemble和Soft probability aggregation，以提高LLM在具有示例的情况下学习的能力。results: 经过七种自然语言理解任务和四种不同大小的LLM测试，提出的框架能够提高LLM的性能。

Abstract
Large language models (LLMs) have shown remarkable capabilities in various natural language understanding tasks. With only a few demonstration examples, these LLMs can quickly adapt to target tasks without expensive gradient updates. Common strategies to boost such 'in-context' learning ability are to ensemble multiple model decoded results and require the model to generate an explanation along with the prediction. However, these models often treat different class predictions equally and neglect the potential discrepancy between the explanations and predictions. To fully unleash the power of explanations, we propose EASE, an Explanation-Aware Soft Ensemble framework to empower in-context learning with LLMs. We design two techniques, explanation-guided ensemble, and soft probability aggregation, to mitigate the effect of unreliable explanations and improve the consistency between explanations and final predictions. Experiments on seven natural language understanding tasks and four varying-size LLMs demonstrate the effectiveness of our proposed framework.

摘要
大型自然语言模型（LLM）在各种自然语言理解任务上展现出了惊人的能力。只需要几个示例，这些LLM就可以快速适应目标任务，不需要昂贵的梯度更新。常见的优化策略包括 ensemble多个模型的输出结果和要求模型生成预测和解释。然而，这些模型经常忽略预测和解释之间的可能差异。为了充分发挥解释的力量，我们提议EASE，一个带有解释感知的软ensemble框架，以便在LLM中进行内部学习。我们设计了两种技术：解释引导的ensemble和软概率聚合，以 Mitigate不可靠的解释的影响并提高解释和最终预测之间的一致性。在七种自然语言理解任务和四种不同大小的LLM上，我们的提议框架得到了实验证明。

To Tell The Truth: Language of Deception and Language Models

paper_url: http://arxiv.org/abs/2311.07092
repo_url: None
paper_authors: Bodhisattwa Prasad Majumder, Sanchaita Hazra
for: This paper aims to analyze the ability of individuals to discern truth from misinformation in a high-stake environment, and to develop a machine learning model that can detect deception in text-based conversations.
methods: The paper uses a novel dataset of TV game show conversations to investigate the manifestation of potentially verifiable language cues of deception in the presence of objective truth. The authors develop a machine learning model, built on a large language model, that employs a bottleneck framework to learn discernible cues to determine truth.
results: The paper shows that the machine learning model can detect novel but accurate language cues in many cases where humans failed to detect deception, opening up the possibility of humans collaborating with algorithms to improve their ability to detect the truth.

Abstract
Text-based misinformation permeates online discourses, yet evidence of people's ability to discern truth from such deceptive textual content is scarce. We analyze a novel TV game show data where conversations in a high-stake environment between individuals with conflicting objectives result in lies. We investigate the manifestation of potentially verifiable language cues of deception in the presence of objective truth, a distinguishing feature absent in previous text-based deception datasets. We show that there exists a class of detectors (algorithms) that have similar truth detection performance compared to human subjects, even when the former accesses only the language cues while the latter engages in conversations with complete access to all potential sources of cues (language and audio-visual). Our model, built on a large language model, employs a bottleneck framework to learn discernible cues to determine truth, an act of reasoning in which human subjects often perform poorly, even with incentives. Our model detects novel but accurate language cues in many cases where humans failed to detect deception, opening up the possibility of humans collaborating with algorithms and ameliorating their ability to detect the truth.

摘要
文本基本是谎言渗透在线讨论中，然而人们对真实性的识别能力的证据罕见。我们分析了一个新的电视竞赛数据，其中对话在高规模环境中，参与者有冲突目标，导致谎言。我们研究了在对话中可靠的语言证据的表现，并发现了一类检测器（算法）可以和人类相比，即使只有语言证据而不是完整的语言和视频证据。我们的模型，基于大型语言模型，采用瓶颈框架学习可识别的证据，以判断真实性，这是人类在很多情况下表现不佳，即使有奖励。我们的模型在许多情况下可以检测人类未能检测到的谎言，开发人类与算法合作，提高真实性的识别能力。

Sample Dominance Aware Framework via Non-Parametric Estimation for Spontaneous Brain-Computer Interface

paper_url: http://arxiv.org/abs/2311.07079
repo_url: None
paper_authors: Byeong-Hoo Lee, Byoung-Hee Kwon, Seong-Whan Lee
for: 这个研究旨在解决电子encephalogram（EEG）信号的非站点特性对于训练神经网络所带来的挑战，以提高自愿性脑-computer interfaces（BCIs）的表现。
methods: 我们提出了一种基于sample dominance的方法，并使用了两阶段的主导性分数估计技术来补偿sample inconsistency对于网络训练的影响。
results: 我们的实验结果显示，这种方法可以增强自愿性BCIs的表现，并且显示了sample dominance的重要性。

Abstract
Deep learning has shown promise in decoding brain signals, such as electroencephalogram (EEG), in the field of brain-computer interfaces (BCIs). However, the non-stationary characteristics of EEG signals pose challenges for training neural networks to acquire appropriate knowledge. Inconsistent EEG signals resulting from these non-stationary characteristics can lead to poor performance. Therefore, it is crucial to investigate and address sample inconsistency to ensure robust performance in spontaneous BCIs. In this study, we introduce the concept of sample dominance as a measure of EEG signal inconsistency and propose a method to modulate its effect on network training. We present a two-stage dominance score estimation technique that compensates for performance degradation caused by sample inconsistencies. Our proposed method utilizes non-parametric estimation to infer sample inconsistency and assigns each sample a dominance score. This score is then aggregated with the loss function during training to modulate the impact of sample inconsistency. Furthermore, we design a curriculum learning approach that gradually increases the influence of inconsistent signals during training to improve overall performance. We evaluate our proposed method using public spontaneous BCI dataset. The experimental results confirm that our findings highlight the importance of addressing sample dominance for achieving robust performance in spontaneous BCIs.

摘要
深度学习在脑电响应（EEG）信号解码方面表现出了承诺，特别是在脑computer接口（BCI）领域。然而，EEG信号的非站点特性使得训练神经网络获得相应的知识困难。不稳定的EEG信号导致训练神经网络表现不佳。因此，我们需要调查和解决样本不一致性问题，以确保BCI的稳定性。在这项研究中，我们提出了样本主导性的概念，用于度量EEG信号不一致性。我们还提出了一种两阶段主导性分数估计技术，用于补做样本不一致性对网络训练的影响。我们的提议方法使用非 Parametric 估计来推导样本不一致性，并将每个样本分配一个主导性分数。这个分数与训练过程中的损失函数相加，以Modulate 样本不一致性的影响。此外，我们还提出了一种课程学习方法，通过逐步增加训练过程中不一致性信号的影响，以提高总性能。我们使用公共的自发BCI数据集进行实验，实验结果证明了我们的发现，即必须解决样本不一致性问题，以实现BCI的稳定性。

The Impact of Generative Artificial Intelligence

paper_url: http://arxiv.org/abs/2311.07071
repo_url: None
paper_authors: Kaichen Zhang, Ohchan Kwon, Hui Xiong
for: 这个研究探讨了生成人工智能对产品市场的影响，以响应生成人工智能可能对失业和市场衰退产生影响的关注。
methods: 这篇论文使用了一种”自然实验”的方法来解决 causal inference 的挑战，即通过识别一种未预期的和突然的图像生成 AI 泄漏来对比不同风格的图像生成成本。
results: 研究发现，虽然生成 AI 降低了平均价格，但它带来了订单量的增加和总收入的增长。这种 counterintuitive 的发现表明，生成 AI 对艺术家而言是一种利益，而不是一种弊端。

Abstract
The rise of generative artificial intelligence (AI) has sparked concerns about its potential influence on unemployment and market depression. This study addresses this concern by examining the impact of generative AI on product markets. To overcome the challenge of causal inference, given the inherent limitations of conducting controlled experiments, this paper identifies an unanticipated and sudden leak of a highly proficient image-generative AI as a novel instance of a "natural experiment". This AI leak spread rapidly, significantly reducing the cost of generating anime-style images compared to other styles, creating an opportunity for comparative assessment. We collect real-world data from an artwork outsourcing platform. Surprisingly, our results show that while generative AI lowers average prices, it substantially boosts order volume and overall revenue. This counterintuitive finding suggests that generative AI confers benefits upon artists rather than detriments. The study further offers theoretical economic explanations to elucidate this unexpected phenomenon. By furnishing empirical evidence, this paper dispels the notion that generative AI might engender depression, instead underscoring its potential to foster market prosperity. These findings carry significant implications for practitioners, policymakers, and the broader AI community.

摘要
《生成人工智能的兴起引发了失业和市场萧条的担忧。这项研究试图解决这个问题，检查生成人工智能对产品市场的影响。为了超越 causal inference 的限制，这篇论文利用了一次意外和不可预期的图像生成人工智能的泄露作为一个“自然实验”。这个 AI 泄露在其他风格的图像生成成本上减少了成本，创造了对比分析的机会。我们收集了一个艺术委托平台的实际数据。 surprisingly，我们发现，虽然生成人工智能降低了平均价格，但它很大程度上提高了订单量和总收入。这种Counterintuitive finding 表明，生成人工智能对艺术家而言是有利的，而不是有害的。这项研究还提供了经济理论解释，以解释这种意外的现象。通过提供实证证据，这篇论文推翻了生成人工智能会导致萧条的假设，反而证明了它的潜在市场繁荣。这些发现对实践者、政策制定者和更广泛的 AI 社区都具有重要意义。

Non-approximability of constructive global $\mathcal{L}^2$ minimizers by gradient descent in Deep Learning

paper_url: http://arxiv.org/abs/2311.07065
repo_url: None
paper_authors: Thomas Chen, Patricia Muñoz Ewald
for: 本研究探讨深度学习（Deep Learning）网络中的梯度下降算法的几何性。
methods: 研究使用梯度下降方法来实现深度学习网络中的最优化。
results: 研究结果表明，globally minimizing weights和biases对于$\mathcal{L}^2$ cost的解决方案不能通过梯度下降流体 approximation。因此，提出的方法与梯度下降方法是独立的。

Abstract
We analyze geometric aspects of the gradient descent algorithm in Deep Learning (DL) networks. In particular, we prove that the globally minimizing weights and biases for the $\mathcal{L}^2$ cost obtained constructively in [Chen-Munoz Ewald 2023] for underparametrized ReLU DL networks can generically not be approximated via the gradient descent flow. We therefore conclude that the method introduced in [Chen-Munoz Ewald 2023] is disjoint from the gradient descent method.

摘要
我们分析深度学习（Deep Learning）网络中的梯度下降算法的几何性。特别是证明了在[Chen-Munoz Ewald 2023]中所得到的最佳梯度下降方法不能通过梯度下降流程来近似。因此，我们 conclude that the method introduced in [Chen-Munoz Ewald 2023] is disjoint from the gradient descent method.Note:* "梯度下降算法" (gradient descent algorithm) is translated as "梯度下降方法" (gradient descent method) in Simplified Chinese.* "underparametrized ReLU DL networks" is translated as "内部不足的ReLU深度学习网络" (underparameterized ReLU deep learning networks) in Simplified Chinese.

Effective In-vehicle Intrusion Detection via Multi-view Statistical Graph Learning on CAN Messages

paper_url: http://arxiv.org/abs/2311.07056
repo_url: https://github.com/wangkai-tech23/StatGraph
paper_authors: Kai Wang, Qiguang Jiang, Bailing Wang, Yongzheng Zhang, Yulei Wu
for: 这个论文主要关注在智能连接汽车（ICV）中，对于外部网络的通信进行了详细的攻击探测和防护。methods: 本文提出了一个名为StatGraph的多观点统计学 гра图学探测方法，通过将资料流转换为两个统计学 гра圜（TCG和CRG），并透过轻量级的GCN网络进行训练，以实现更高效的探测性。results: 实验结果显示，StatGraph可以提高探测精度和探测性相比之前的方法，并且可以探测到四种新的攻击，这些攻击之前从未被 investigate 过。

Abstract
As an important component of internet of vehicles (IoV), intelligent connected vehicles (ICVs) have to communicate with external networks frequently. In this case, the resource-constrained in-vehicle network (IVN) is facing a wide variety of complex and changing external cyber-attacks, especially the masquerade attack with high difficulty of detection while serious damaging effects that few counter measures can identify successfully. Moreover, only coarse-grained recognition can be achieved in current mainstream intrusion detection mechanisms, i.e., whether a whole data flow observation window contains attack labels rather than fine-grained recognition on every single data item within this window. In this paper, we propose StatGraph: an Effective Multi-view Statistical Graph Learning Intrusion Detection to implement the fine-grained intrusion detection. Specifically, StatGraph generates two statistical graphs, timing correlation graph (TCG) and coupling relationship graph (CRG), based on data streams. In given message observation windows, edge attributes in TCGs represent temporal correlation between different message IDs, while edge attributes in CRGs denote the neighbour relationship and contextual similarity. Besides, a lightweight shallow layered GCN network is trained based graph property of TCGs and CRGs, which can learn the universal laws of various patterns more effectively and further enhance the performance of detection. To address the problem of insufficient attack types in previous intrusion detection, we select two real in-vehicle CAN datasets that cover four new attacks never investigated before. Experimental result shows StatGraph improves both detection granularity and detection performance over state-of-the-art intrusion detection methods.

摘要
为了实现网络内部自动化（IoV）中的智能连接车辆（ICV），它们需要与外部网络进行频繁的通信。在这种情况下，具有限制的内部网络（IVN）面临着多样化和变化的外部黑客攻击，尤其是让人难以发现的掩盖攻击，这些攻击可能导致严重的损害。目前主流的防范攻击机制只能实现粗略的识别，即是某个数据流观察窗口中是否包含攻击标签，而不是每个数据项的精细识别。在这篇论文中，我们提出了StatGraph：一种有效的多视图统计图学防范攻击方法。具体来说，StatGraph根据数据流生成两个统计图，即时间相关图（TCG）和互相关系图（CRG）。在给定的消息观察窗口中，TCG中的边Attributes表示不同消息ID之间的时间相关性，而CRG中的边Attributes表示消息ID之间的邻居关系和上下文相似性。此外，我们还训练了一个轻量级的GCN网络，以利用统计图的属性来学习更加有效的各种模式。为了解决过去防范攻击方法中缺乏攻击类型的问题，我们选择了四种新的攻击方法，这些攻击方法从未被前人研究过。实验结果表明，StatGraph可以提高检测精细度和检测性能，比前方式防范攻击方法更高。

Towards the Law of Capacity Gap in Distilling Language Models

paper_url: http://arxiv.org/abs/2311.07052
repo_url: https://github.com/genezc/minima
paper_authors: Chen Zhang, Dawei Song, Zheyu Ye, Yan Gao
for: 这种研究旨在探讨LM浸泡的最佳方法，尤其是在教师LM和学生LM之间存在巨大容量差距时。
methods: 该研究使用了一种新的法则，即容量差距法则，来描述在浸泡过程中如何选择最佳的学生LM。
results: 研究发现，在不同的学生缩放和架构下，容量差距的优化点几乎固定，这使得浸泡过程中的选择变得更加简单。此外，通过浸泡一个7B教师LM，研究者成功地折衣了一个3B学生LM（称为MiniMA），该模型在常用的测试上创造了一个新的计算性能矩阵，而其调整版本（称为MiniChat）在GPT4评估中超越了许多3B竞争对手，甚至与一些7B聊天模型相匹配。

Abstract
Language model (LM) distillation is a trending area that aims to distil the knowledge resided in a large teacher LM to a small student one. While various methods have been proposed to push the distillation to its limits, it is still a pain distilling LMs when a large capacity gap is exhibited between the teacher and the student LMs. The pain is mainly resulted by the curse of capacity gap, which describes that a larger teacher LM cannot always lead to a better student LM than one distilled from a smaller teacher LM due to the affect of capacity gap increment. That is, there is likely an optimal point yielding the best student LM along the scaling course of the teacher LM. Even worse, the curse of capacity gap can be only partly yet not fully lifted as indicated in previous studies. However, the tale is not ever one-sided. Although a larger teacher LM has better performance than a smaller teacher LM, it is much more resource-demanding especially in the context of recent large LMs (LLMs). Consequently, instead of sticking to lifting the curse, leaving the curse as is should be arguably fine. Even better, in this paper, we reveal that the optimal capacity gap is almost consistent across different student scales and architectures, fortunately turning the curse into the law of capacity gap. The law later guides us to distil a 3B student LM (termed MiniMA) from a 7B teacher LM (adapted LLaMA2-7B). MiniMA is demonstrated to yield a new compute-performance pareto frontier among existing 3B LMs on commonly used benchmarks, and its instruction-tuned version (termed MiniChat) outperforms a wide range of 3B competitors in GPT4 evaluation and could even compete with several 7B chat models.

摘要
language model (LM) 精炼是一个流行的领域，旨在压缩一个大老师 LM 中的知识到一个小学生 LM 中。虽然许多方法已经被提出来推动精炼，但是在大教师 LM 和小学生 LM 之间存在较大的容量差异时，仍然是一种痛苦的精炼。这种痛苦主要来自于容量差异的咒语，即大教师 LM 不一定可以导致一个更好的学生 LM，因为容量差异的增加会导致更大的学生 LM 不可能超越小教师 LM。这意味着在教师 LM 的扩展规模上，存在一个最佳的学生 LM 点，并且这个点与学生 LM 的架构和规模有关。事实上，这种咒语只能部分地被解决，根据前一些研究表明。然而，这不总是一个一方面的问题。虽然大教师 LM 的性能比小教师 LM 更好，但是它却需要更多的资源，特别是在现代大型 LM （LLM）中。因此，相反于努力解决咒语，可以留下咒语，这也是可以接受的。事实上，在这篇论文中，我们发现了容量差异的优化点，这点与学生 LM 的架构和规模有关。我们使用这个点来精炼一个 3B 学生 LM（称为 MiniMA），从一个 7B 教师 LM（改进的 LLaMA2-7B）中。MiniMA 在常用的benchmark上显示出了一个新的计算性能 pareto 边缘，并且其 instruction-tuned 版本（称为 MiniChat）在 GPT4 评价中超过了许多 3B 竞争对手，甚至与一些 7B 对话模型进行竞争。

Phonological Level wav2vec2-based Mispronunciation Detection and Diagnosis Method

paper_url: http://arxiv.org/abs/2311.07037
repo_url: None
paper_authors: Mostafa Shahin, Julien Epps, Beena Ahmed
for: 本研究旨在提高计算机支持下的发音学习（CAPL）工具，特别是第二语言（L2）学习或语音疾病治疗应用中的发音错误检测和诊断（MDD）方法。
methods: 本研究提出了一种基于发音特征分析的低级MDD方法，通过检测发音特征来提供更形成的反馈给学习者。此外，我们还提出了一种基于多标签连接主义分类（CTC）方法来联合模型多个非互相排斥的发音特征。使用预训练的wav2vec2模型作为核心模型。
results: 对于英语学习者的L2语音资料，提出的发音特征MDD方法与传统的phoneme-level MDD方法进行比较，获得了 significatively lower False Acceptance Rate（FAR）、False Rejection Rate（FRR）和诊断错误率（DER）。

Abstract
The automatic identification and analysis of pronunciation errors, known as Mispronunciation Detection and Diagnosis (MDD) plays a crucial role in Computer Aided Pronunciation Learning (CAPL) tools such as Second-Language (L2) learning or speech therapy applications. Existing MDD methods relying on analysing phonemes can only detect categorical errors of phonemes that have an adequate amount of training data to be modelled. With the unpredictable nature of the pronunciation errors of non-native or disordered speakers and the scarcity of training datasets, it is unfeasible to model all types of mispronunciations. Moreover, phoneme-level MDD approaches have a limited ability to provide detailed diagnostic information about the error made. In this paper, we propose a low-level MDD approach based on the detection of speech attribute features. Speech attribute features break down phoneme production into elementary components that are directly related to the articulatory system leading to more formative feedback to the learner. We further propose a multi-label variant of the Connectionist Temporal Classification (CTC) approach to jointly model the non-mutually exclusive speech attributes using a single model. The pre-trained wav2vec2 model was employed as a core model for the speech attribute detector. The proposed method was applied to L2 speech corpora collected from English learners from different native languages. The proposed speech attribute MDD method was further compared to the traditional phoneme-level MDD and achieved a significantly lower False Acceptance Rate (FAR), False Rejection Rate (FRR), and Diagnostic Error Rate (DER) over all speech attributes compared to the phoneme-level equivalent.

摘要
computer-assisted pronunciation learning (CAPL) 工具中的自动识别和分析声音错误（MDD）在语音学习中扮演了关键角色。现有的 MDD 方法仅仅是通过分析音节来检测音节错误，但这些错误可能是非Native 或异常的说话者的不可预测的。此外，音节级 MDD 方法只能提供有限的诊断信息。在这篇论文中，我们提议了一种基于声音特征的低级 MDD 方法。声音特征分解声音生产成Elementary 组件，直接关系到语音生成系统，从而提供更有形成的反馈给学习者。我们还提议了一种多标签的 CTC 方法，以同时模型不同的声音特征。使用 pre-trained wav2vec2 模型作为核心模型。我们的提议方法应用于英语学习者的 L2 语音资料。与传统的音节级 MDD 相比，我们的声音特征 MDD 方法显示了较低的 false acceptance rate（FAR）、false rejection rate（FRR）和诊断错误率（DER）。

ExpNote: Black-box Large Language Models are Better Task Solvers with Experience Notebook

paper_url: http://arxiv.org/abs/2311.07032
repo_url: https://github.com/forangel2014/expnote
paper_authors: Wangtao Sun, Xuanqing Yu, Shizhu He, Jun Zhao, Kang Liu
for: 这 paper 的目的是提高黑盒大语言模型（LLMs）在不同任务中的性能。
methods: 该 paper 提出了一种自动化框架，帮助 LLMs 更好地适应未知任务。该框架通过反思和记录训练数据中的经验，以及在测试时从外部存储器中检索经验，以帮助 LLMs 更好地适应新任务。
results: 实验结果表明，该方法可以显著提高黑盒 LLMs 在多个任务中的性能。数据和代码可以在 GitHub 上获取（https://github.com/forangel2014/ExpNote）。

Abstract
Black-box Large Language Models (LLMs) have shown great power in solving various tasks and are considered general problem solvers. However, LLMs still fail in many specific tasks although understand the task instruction. In this paper, we focus on the problem of boosting the ability of black-box LLMs to solve downstream tasks. We propose ExpNote, an automated framework to help LLMs better adapt to unfamiliar tasks through reflecting and noting experiences from training data and retrieving them from external memory during testing. We evaluate ExpNote on multiple tasks and the experimental results demonstrate that the proposed method significantly improves the performance of black-box LLMs. The data and code are available at https://github.com/forangel2014/ExpNote

摘要
黑盒大语言模型（LLMs）已经表现出杰出的能力解决多种任务，并被视为通用的问题解决者。然而，LLMs仍然在许多具体任务上失败，即使理解任务指令。在这篇论文中，我们专注于增强黑盒LLMs解决下游任务的能力。我们提出了ExpNote，一个自动框架，帮助LLMs更好地适应未知任务。在训练数据中反思和记录经验，并在试验过程中从外部内存中撷取经验，以提高黑盒LLMs的性能。我们在多个任务上进行了实验，结果显示，提案的方法可以对黑盒LLMs进行明显改善。资料和代码可以在https://github.com/forangel2014/ExpNote上获取。

Embarassingly Simple Dataset Distillation

paper_url: http://arxiv.org/abs/2311.07025
repo_url: None
paper_authors: Yunzhen Feng, Ramakrishna Vedantam, Julia Kempe
For: The paper aims to achieve competitive performance on test data when trained on a small set of synthetic training samples, through the process of dataset distillation.* Methods: The paper treats dataset distillation as a bilevel optimization problem and introduces an improved method called Random Truncated Backpropagation Through Time (RaT-BPTT) to address issues such as variance in gradients, computational burden, and long-term dependencies.* Results: The paper establishes new state-of-the-art performance on standard dataset benchmarks using RaT-BPTT, and also discovers that distilled datasets tend to exhibit pronounced intercorrelation, which can be addressed by a boosting mechanism that generates distilled datasets with near optimal performance across different data budgets.Here’s the Chinese translation of the three points:* For: paper 的目的是在使用小量的合成训练样本来实现测试数据上的竞争性表现。* Methods: paper 将dataset distillation看作是一个二级优化问题，并介绍了一种改进的方法Random Truncated Backpropagation Through Time (RaT-BPTT)来解决变异性、计算负担和长期依赖问题。* Results: paper 使用 RaT-BPTT 实现了标准dataset benchmark上的新state-of-the-art性能，并发现了压缩数据集之间的强相关性，可以通过一种boosting机制来生成具有近似最佳性能的各种数据预算下的压缩数据集。

Abstract
Dataset distillation extracts a small set of synthetic training samples from a large dataset with the goal of achieving competitive performance on test data when trained on this sample. In this work, we tackle dataset distillation at its core by treating it directly as a bilevel optimization problem. Re-examining the foundational back-propagation through time method, we study the pronounced variance in the gradients, computational burden, and long-term dependencies. We introduce an improved method: Random Truncated Backpropagation Through Time (RaT-BPTT) to address them. RaT-BPTT incorporates a truncation coupled with a random window, effectively stabilizing the gradients and speeding up the optimization while covering long dependencies. This allows us to establish new state-of-the-art for a variety of standard dataset benchmarks. A deeper dive into the nature of distilled data unveils pronounced intercorrelation. In particular, subsets of distilled datasets tend to exhibit much worse performance than directly distilled smaller datasets of the same size. Leveraging RaT-BPTT, we devise a boosting mechanism that generates distilled datasets that contain subsets with near optimal performance across different data budgets.

摘要

ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models

paper_url: http://arxiv.org/abs/2311.07022
repo_url: https://github.com/ilkerkesen/ViLMA
paper_authors: Ilker Kesen, Andrea Pedrotti, Mustafa Dogan, Michele Cafagna, Emre Can Acikgoz, Letitia Parcalabescu, Iacer Calixto, Anette Frank, Albert Gatt, Aykut Erdem, Erkut Erdem
for: 本研究的目的是为了开发一个任务无关的评估指标，以评估视频语言模型（VidLM）的细腻功能。
methods: 本研究使用了仔细制作的反例来测试 VidLM 的能力，并对其进行评估。此外，研究还包括一系列的技能测试，以评估 VidLM 的基础能力。
results: 研究发现，当前的 VidLM 的基础能力和视频语言模型（VLM）使用的 static 图像相比，其表现并不出色。此外，包括技能测试的表现在内，VidLM 的总表现也不如人类水平。这些结果表明， VidLM 还有很多需要进一步探索的领域。

Abstract
With the ever-increasing popularity of pretrained Video-Language Models (VidLMs), there is a pressing need to develop robust evaluation methodologies that delve deeper into their visio-linguistic capabilities. To address this challenge, we present ViLMA (Video Language Model Assessment), a task-agnostic benchmark that places the assessment of fine-grained capabilities of these models on a firm footing. Task-based evaluations, while valuable, fail to capture the complexities and specific temporal aspects of moving images that VidLMs need to process. Through carefully curated counterfactuals, ViLMA offers a controlled evaluation suite that sheds light on the true potential of these models, as well as their performance gaps compared to human-level understanding. ViLMA also includes proficiency tests, which assess basic capabilities deemed essential to solving the main counterfactual tests. We show that current VidLMs' grounding abilities are no better than those of vision-language models which use static images. This is especially striking once the performance on proficiency tests is factored in. Our benchmark serves as a catalyst for future research on VidLMs, helping to highlight areas that still need to be explored.

摘要
随着视频语言模型（VidLM）的普及，需要开发更加稳健的评估方法来探索它们的视频语言功能。为此，我们提出了视频语言模型评估（ViLMA），一个任务无关的benchmark，它通过控制性的counterfactual来评估视频语言模型的细腻能力。任务基础的评估，虽有价值，但是无法捕捉视频图像中的复杂性和时间特征，这些特征是视频语言模型需要处理的。我们的benchmark包括了基础能力测试，它们评估了视频语言模型的基本能力，这些能力被认为是解决主要counterfactual测试的前提。我们的研究显示，现有的视频语言模型在grounding能力方面并没有超过视频语言模型，这是特别明显的，当加入基础能力测试时。我们的benchmark会成为未来研究视频语言模型的catalyst，帮助探索这些模型的未知领域。

Context-dependent Instruction Tuning for Dialogue Response Generation

paper_url: http://arxiv.org/abs/2311.07006
repo_url: None
paper_authors: Jin Myung Kwak, Minseon Kim, Sung Ju Hwang
for: 这个论文是为了解决复杂多轮对话生成任务中的输入变化问题而写的。
methods: 该论文提出了基于上一次对话 контекст的指令细化框架，该框架可以在每个多轮对话中生成响应和指令，并在评估阶段使用上一次对话 контекст来自动导航响应。
results: 根据量化评估结果，该框架在多轮对话生成任务中比基线方案更好地适应输入变化，并且可以减少计算资源开销。

Abstract
Recent language models have achieved impressive performance in natural language tasks by incorporating instructions with task input during fine-tuning. Since all samples in the same natural language task can be explained with the same task instructions, many instruction datasets only provide a few instructions for the entire task, without considering the input of each example in the task. However, this approach becomes ineffective in complex multi-turn dialogue generation tasks, where the input varies highly with each turn as the dialogue context changes, so that simple task instructions cannot improve the generation performance. To address this limitation, we introduce a context-based instruction fine-tuning framework for each multi-turn dialogue which generates both responses and instructions based on the previous context as input. During the evaluation, the model generates instructions based on the previous context to self-guide the response. The proposed framework produces comparable or even outstanding results compared to the baselines by aligning instructions to the input during fine-tuning with the instructions in quantitative evaluations on dialogue benchmark datasets with reduced computation budget.

摘要
现代语言模型已经取得了优异的表现在自然语言任务中，通过在精馈过程中融合指令与任务输入。自然语言任务中的所有样例都可以使用相同的任务指令来解释，因此许多指令数据集只提供了任务中的一些指令，不考虑每个样例的输入。然而，这种方法在复杂的多转对话生成任务中失效，因为对话上下文的变化会导致输入的高度不同，使得简单的任务指令无法提高生成性能。为了解决这个限制，我们提出了基于对话上下文的指令精馈框架，这个框架在每个多转对话中生成回复和指令，并且使用上一个对话的上下文来自适化。在评估过程中，模型根据上一个对话的上下文来给出指令，以自适应回复。我们的提案 Frameworks 在对话库数据集上进行评估时，与基准值进行比较，产生了相似或甚至出色的结果，并且在对话生成任务中实现了优化的表现。

AGRAMPLIFIER: Defending Federated Learning Against Poisoning Attacks Through Local Update Amplification

paper_url: http://arxiv.org/abs/2311.06996
repo_url: None
paper_authors: Zirui Gong, Liyue Shen, Yanjun Zhang, Leo Yu Zhang, Jingwei Wang, Guangdong Bai, Yong Xiang
for: 本研究旨在 Addressing the Byzantine poisoning attack 在 Federated Learning (FL) 中的协同性带来的威胁。methods: 本研究提出了一种新的方法，即 AGRAMPLIFIER，以提高现有的 Byzantine-robust aggregation rules 的Robustness、准确性和效率。results: 研究表明，通过将 AGRAMPLIFIER 与现有的 Byzantine-robust mechanisms 结合使用，可以提高模型的Robustness、精度和效率， average gains 为 40.08%、39.18% 和 10.68%。

Abstract
The collaborative nature of federated learning (FL) poses a major threat in the form of manipulation of local training data and local updates, known as the Byzantine poisoning attack. To address this issue, many Byzantine-robust aggregation rules (AGRs) have been proposed to filter out or moderate suspicious local updates uploaded by Byzantine participants. This paper introduces a novel approach called AGRAMPLIFIER, aiming to simultaneously improve the robustness, fidelity, and efficiency of the existing AGRs. The core idea of AGRAMPLIFIER is to amplify the "morality" of local updates by identifying the most repressive features of each gradient update, which provides a clearer distinction between malicious and benign updates, consequently improving the detection effect. To achieve this objective, two approaches, namely AGRMP and AGRXAI, are proposed. AGRMP organizes local updates into patches and extracts the largest value from each patch, while AGRXAI leverages explainable AI methods to extract the gradient of the most activated features. By equipping AGRAMPLIFIER with the existing Byzantine-robust mechanisms, we successfully enhance the model's robustness, maintaining its fidelity and improving overall efficiency. AGRAMPLIFIER is universally compatible with the existing Byzantine-robust mechanisms. The paper demonstrates its effectiveness by integrating it with all mainstream AGR mechanisms. Extensive evaluations conducted on seven datasets from diverse domains against seven representative poisoning attacks consistently show enhancements in robustness, fidelity, and efficiency, with average gains of 40.08%, 39.18%, and 10.68%, respectively.

摘要
合作性的联合学习（FL）具有主要的威胁，即本地训练数据和本地更新的操纵，称为Byzantine毒害攻击。为解决这一问题，许多Byzantine鲁班耐式积分规则（AGRs）已经被提议，以筛选或调整嫌疑的本地更新。这篇论文介绍了一种新的方法 called AGRAMPLIFIER，旨在同时提高现有 AGRs 的 Robustness、准确性和效率。AGRAMPLIFIER的核心思想是通过识别每个梯度更新中最压抑的特征来增强本地更新的“道德”性，从而更好地 отли奇嫌疑和合法更新。为实现这一目标，我们提出了两种方法：AGRMP和AGRXAI。AGRMP 将本地更新分割成块，并从每个块中提取最大值，而 AGRXAI 则利用可解释AI方法提取最活跃特征的梯度。通过将 AGRAMPLIFIER 与现有的Byzantine鲁班耐式机制结合使用，我们成功地提高了模型的Robustness，保持了准确性，并提高了总的效率。AGRAMPLIFIER 与现有的Byzantine鲁班耐式机制兼容，可与所有主流 AGR 机制结合使用。文章通过对七个 datasets 从多个领域进行了七种 poisoning 攻击的广泛评估，证明了 AGRAMPLIFIER 的效iveness。 average 提高了40.08%、39.18% 和 10.68%。

State-of-the-Art Review and Synthesis: A Requirement-based Roadmap for Standardized Predictive Maintenance Automation Using Digital Twin Technologies

paper_url: http://arxiv.org/abs/2311.06993
repo_url: None
paper_authors: Sizhe Ma, Katherine A. Flanigan, Mario Bergés
for: This paper aims to provide a requirement-based roadmap for standardized predictive maintenance (PMx) automation using digital twin (DT) technologies.
methods: The paper uses a systematic approach that includes identifying informational requirements (IRs) and functional requirements (FRs) for PMx, and conducting a literature review to determine how these requirements are currently being used in DTs.
results: The paper provides a roadmap for the development of standardized PMx automation using DTs, and highlights the areas where further research is needed to support the progress and maturation of these technologies.Here’s the same information in Simplified Chinese text:
for: 这篇论文目的是提供一个基于需求的PMx自动化道路图，使用数字双工具技术。
methods: 论文采用一种系统atic的方法，包括确定PMx的信息需求(IR)和功能需求(FR)，以及对这些需求在数字双中的现状进行文献评估。
results: 论文提供一个PMx自动化的标准化道路图，并指出需要进一步研究以支持这些技术的进步和成熟。

Abstract
Recent digital advances have popularized predictive maintenance (PMx), offering enhanced efficiency, automation, accuracy, cost savings, and independence in maintenance. Yet, it continues to face numerous limitations such as poor explainability, sample inefficiency of data-driven methods, complexity of physics-based methods, and limited generalizability and scalability of knowledge-based methods. This paper proposes leveraging Digital Twins (DTs) to address these challenges and enable automated PMx adoption at larger scales. While we argue that DTs have this transformative potential, they have not yet reached the level of maturity needed to bridge these gaps in a standardized way. Without a standard definition for such evolution, this transformation lacks a solid foundation upon which to base its development. This paper provides a requirement-based roadmap supporting standardized PMx automation using DT technologies. A systematic approach comprising two primary stages is presented. First, we methodically identify the Informational Requirements (IRs) and Functional Requirements (FRs) for PMx, which serve as a foundation from which any unified framework must emerge. Our approach to defining and using IRs and FRs to form the backbone of any PMx DT is supported by the track record of IRs and FRs being successfully used as blueprints in other areas, such as for product development within the software industry. Second, we conduct a thorough literature review spanning fields to determine the ways in which these IRs and FRs are currently being used within DTs, enabling us to point to the specific areas where further research is warranted to support the progress and maturation of requirement-based PMx DTs.

摘要
近期数字技术发展，predictive maintenance（PMx）得到了广泛应用，提高了效率、自动化、准确性、成本节省和独立性等方面。然而，它仍面临许多限制，如解释能力不足、数据驱动方法的样本不足、物理基础方法的复杂性以及知识基础方法的局限性和扩展性不足。这篇文章提议通过数字双方（DT）解决这些挑战，并促进大规模自动化采用PMx。虽然我们认为DT具有这种转变潜力，但它们并没有达到所需的成熔度，以便在标准化的方式下 bridging这些差距。没有一个标准定义这种演化的基础，这种转型缺乏固定的基础。这篇文章提供了一个基于需求的路线图，支持标准化的PMx自动化使用DT技术。我们采用了两个主要阶段的系统方法。首先，我们方法性地确定PMx的信息需求（IR）和功能需求（FR），这些需求将成为任何统一框架的基础。我们的方法是基于IR和FR的使用记录，在软件行业中产品开发中的蓝图中得到了支持。其次，我们通过对多个领域的文献综述，确定DT中正在使用IR和FR的方式，以便指出需要进行进一步研究，以支持PMx DT的发展和成熔度的提高。

2023-11-13

Parrot-Trained Adversarial Examples: Pushing the Practicality of Black-Box Audio Attacks against Speaker Recognition Models

GreekT5: A Series of Greek Sequence-to-Sequence Models for News Summarization

Vision-Language Integration in Multimodal Video Transformers (Partially) Aligns with the Brain

The Disagreement Problem in Faithfulness Metrics

Amodal Optical Flow

Enabling High-Level Machine Reasoning with Cognitive Neuro-Symbolic Systems

SynthEnsemble: A Fusion of CNN, Vision Transformer, and Hybrid Models for Multi-Label Chest X-Ray Classification

Simplifying Complex Observation Models in Continuous POMDP Planning with Probabilistic Guarantees and Practice

Generalization Analogies (GENIES): A Testbed for Generalizing AI Oversight to Hard-To-Measure Domains

PolyIE: A Dataset of Information Extraction from Polymer Material Scientific Literature

Histopathologic Cancer Detection

Reinforcement Learning for Solving Stochastic Vehicle Routing Problem

Robust and Scalable Hyperdimensional Computing With Brain-Like Neural Adaptations

AuthentiGPT: Detecting Machine-Generated Text via Black-Box Language Models Denoising

On The Truthfulness of ‘Surprisingly Likely’ Responses of Large Language Models

Language Model-In-The-Loop: Data Optimal Approach to Learn-To-Recommend Actions in Text Games

Fuse to Forget: Bias Reduction and Selective Memorization through Model Fusion

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation

An Extensive Study on Adversarial Attack against Pre-trained Models of Code

GPT-4V(ision) as A Social Media Analysis Engine

A Benchmark to Understand the Role of Knowledge Graphs on Large Language Model’s Accuracy for Question Answering on Enterprise SQL Databases

EvoFed: Leveraging Evolutionary Strategies for Communication-Efficient Federated Learning

Psychometric Predictive Power of Large Language Models

InCA: Rethinking In-Car Conversational System Assessment Leveraging Large Language Models

Are We Falling in a Middle-Intelligence Trap? An Analysis and Mitigation of the Reversal Curse

On Measuring Faithfulness of Natural Language Explanations

KnowSafe: Combined Knowledge and Data Driven Hazard Mitigation in Artificial Pancreas Systems

Think Before You Speak: Cultivating Communication Skills of Large Language Models via Inner Monologue

Investigating Multi-Pivot Ensembling with Massively Multilingual Machine Translation Models

Hallucination Augmented Recitations for Language Models

Exploring Values in Museum Artifacts in the SPICE project: a Preliminary Study

Predicting Continuous Locomotion Modes via Multidimensional Feature Learning from sEMG

Testing learning-enabled cyber-physical systems with Large-Language Models: A Formal Approach

Past as a Guide: Leveraging Retrospective Learning for Python Code Completion

The Impact of Large Language Models on Scientific Discovery: a Preliminary Study using GPT-4

MetaSymNet: A Dynamic Symbolic Regression Network Capable of Evolving into Arbitrary Formulations

Towards a Transportable Causal Network Model Based on Observational Healthcare Data

Rethinking and Benchmarking Predict-then-Optimize Paradigm for Combinatorial Optimization Problems

ResMGCN: Residual Message Graph Convolution Network for Fast Biomedical Interactions Discovering

Semi-automatic Data Enhancement for Document-Level Relation Extraction with Distant Supervision from Large Language Models

C-Procgen: Empowering Procgen with Controllable Contexts

Do large language models and humans have similar behaviors in causal inference with script knowledge?

Explaining black boxes with a SMILE: Statistical Model-agnostic Interpretability with Local Explanations

TIAGo RL: Simulated Reinforcement Learning Environments with Tactile Data for Mobile Robots

Towards Transferring Tactile-based Continuous Force Control Policies from Simulation to Robot

In Search of the Long-Tail: Systematic Generation of Long-Tail Knowledge via Logical Rule Guided Search

IASCAR: Incremental Answer Set Counting by Anytime Refinement

Large Language Models for Robotics: A Survey

Optical Quantum Sensing for Agnostic Environments via Deep Learning

Applying Large Language Models for Causal Structure Learning in Non Small Cell Lung Cancer

Cross-Axis Transformer with 2D Rotary Embeddings

Knowledge Graph Representations to enhance Intensive Care Time-Series Predictions

Game Solving with Online Fine-Tuning

The High-dimensional Phase Diagram and the Large CALPHAD Model

STEER: Unified Style Transfer with Expert Reinforcement

Pruning random resistive memory for optimizing analogue AI

Enhancing Lightweight Neural Networks for Small Object Detection in IoT Applications

Interaction is all You Need? A Study of Robots Ability to Understand and Execute

Analyzing and Predicting Low-Listenership Trends in a Large-Scale Mobile Health Program: A Preliminary Investigation

WaterBench: Towards Holistic Evaluation of Watermarks for Large Language Models

Understanding Path Planning Explanations

Untargeted Black-box Attacks for Social Recommendations

Explanation-aware Soft Ensemble Empowers Large Language Model In-context Learning

To Tell The Truth: Language of Deception and Language Models

Sample Dominance Aware Framework via Non-Parametric Estimation for Spontaneous Brain-Computer Interface

The Impact of Generative Artificial Intelligence

Non-approximability of constructive global $\mathcal{L}^2$ minimizers by gradient descent in Deep Learning

Effective In-vehicle Intrusion Detection via Multi-view Statistical Graph Learning on CAN Messages

Towards the Law of Capacity Gap in Distilling Language Models

Phonological Level wav2vec2-based Mispronunciation Detection and Diagnosis Method

ExpNote: Black-box Large Language Models are Better Task Solvers with Experience Notebook

Embarassingly Simple Dataset Distillation

ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models

Context-dependent Instruction Tuning for Dialogue Response Generation

AGRAMPLIFIER: Defending Federated Learning Against Poisoning Attacks Through Local Update Amplification

State-of-the-Art Review and Synthesis: A Requirement-based Roadmap for Standardized Predictive Maintenance Automation Using Digital Twin Technologies