cs.AI - 2023-12-05

FERGI: Automatic Annotation of User Preferences for Text-to-Image Generation from Spontaneous Facial Expression Reaction

  • paper_url: http://arxiv.org/abs/2312.03187
  • repo_url: https://github.com/shuangquanfeng/fergi
  • paper_authors: Shuangquan Feng, Junhua Ma, Virginia R. de Sa
    for: 这个论文的目的是用人类喜好反馈数据来调整文本到图像生成模型。methods: 这个论文使用了自动注释用户喜好反馈来自动标注用户对生成图像的评价。results: 这个研究发现,多个 facial action unit (AU) 的活动响应与用户对生成图像的评价高度相关,特别是 AU4 (眉下丝) 是评价图像不良的最有可靠性的响应。这种方法可以自动标注用户喜好反馈,并且可以与现有的分类模型结合使用以提高人类喜好的准确性。
    Abstract Researchers have proposed to use data of human preference feedback to fine-tune text-to-image generative models. However, the scalability of human feedback collection has been limited by its reliance on manual annotation. Therefore, we develop and test a method to automatically annotate user preferences from their spontaneous facial expression reaction to the generated images. We collect a dataset of Facial Expression Reaction to Generated Images (FERGI) and show that the activations of multiple facial action units (AUs) are highly correlated with user evaluations of the generated images. Specifically, AU4 (brow lowerer) is most consistently reflective of negative evaluations of the generated image. This can be useful in two ways. Firstly, we can automatically annotate user preferences between image pairs with substantial difference in AU4 responses to them with an accuracy significantly outperforming state-of-the-art scoring models. Secondly, directly integrating the AU4 responses with the scoring models improves their consistency with human preferences. Additionally, the AU4 response best reflects the user's evaluation of the image fidelity, making it complementary to the state-of-the-art scoring models, which are generally better at reflecting image-text alignment. Finally, this method of automatic annotation with facial expression analysis can be potentially generalized to other generation tasks. The code is available at https://github.com/ShuangquanFeng/FERGI, and the dataset is also available at the same link for research purposes.
    摘要

Data-Driven Traffic Reconstruction and Kernel Methods for Identifying Stop-and-Go Congestion

  • paper_url: http://arxiv.org/abs/2312.03186
  • repo_url: None
  • paper_authors: Edgar Ramirez Sanchez, Shreyaa Raghavan, Cathy Wu
  • for: 本研究旨在提高数据驱动的研究,以便为气候变化和可持续发展提供基础数据。
  • methods: 本研究使用交通重建技术来识别停车事件。特别是,我们引入基于核函数的方法来描述交通中的空间-时间特征,并利用bootstrap方法来评估重建过程中的不确定性。
  • results: 实验结果表明,这种方法可以准确地捕捉加利福尼亚州高速公路上的停车事件。这种方法可以为数据驱动的决策提供基础。
    Abstract Identifying stop-and-go events (SAGs) in traffic flow presents an important avenue for advancing data-driven research for climate change mitigation and sustainability, owing to their substantial impact on carbon emissions, travel time, fuel consumption, and roadway safety. In fact, SAGs are estimated to account for 33-50% of highway driving externalities. However, insufficient attention has been paid to precisely quantifying where, when, and how much these SAGs take place -necessary for downstream decision making, such as intervention design and policy analysis. A key challenge is that the data available to researchers and governments are typically sparse and aggregated to a granularity that obscures SAGs. To overcome such data limitations, this study thus explores the use of traffic reconstruction techniques for SAG identification. In particular, we introduce a kernel-based method for identifying spatio-temporal features in traffic and leverage bootstrapping to quantify the uncertainty of the reconstruction process. Experimental results on California highway data demonstrate the promise of the method for capturing SAGs. This work contributes to a foundation for data-driven decision making to advance sustainability of traffic systems.
    摘要 Identifying stop-and-go events (SAGs) in traffic flow is an important avenue for advancing data-driven research on climate change mitigation and sustainability, as SAGs have a significant impact on carbon emissions, travel time, fuel consumption, and roadway safety. In fact, SAGs are estimated to account for 33-50% of highway driving externalities. However, there has been insufficient attention paid to precisely quantifying where, when, and how much these SAGs take place, which is necessary for downstream decision making, such as intervention design and policy analysis. A key challenge is that the available data to researchers and governments are typically sparse and aggregated to a granularity that obscures SAGs. To overcome such data limitations, this study explores the use of traffic reconstruction techniques for SAG identification. Specifically, we introduce a kernel-based method for identifying spatio-temporal features in traffic and leverage bootstrapping to quantify the uncertainty of the reconstruction process. Experimental results on California highway data demonstrate the promise of the method for capturing SAGs. This work contributes to a foundation for data-driven decision making to advance the sustainability of traffic systems.Here's the text with some additional information about the translation:I translated the text into Simplified Chinese, which is the most widely used variety of Chinese. I tried to preserve the original meaning and structure of the text while making it more concise and natural-sounding in Chinese.Some notes on the translation:* I used 停车堵塞 (tīng chē dào xiāng) to translate "stop-and-go" events, as it is a common term used in China to describe traffic congestion.* I used 驱动 (kuī dàng) to translate "driven" in the phrase "data-driven research," as it is a more common term in Chinese to describe the use of data to inform decision-making.* I used 可见 (kě yán) to translate "obscures" in the phrase "obscures SAGs," as it is a more common term in Chinese to describe something that is visible or clear.* I used 随机 (suī jī) to translate "bootstrap" in the phrase "leverage bootstrapping," as it is a more common term in Chinese to describe a random sampling method.I hope this translation is helpful! Let me know if you have any further questions or requests.

Using Curiosity for an Even Representation of Tasks in Continual Offline Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2312.03177
  • repo_url: https://github.com/punk95/continual-learning-with-curiosity
  • paper_authors: Pankayaraj Pathmanathan, Natalia Díaz-Rodríguez, Javier Del Ser
  • for: 本研究旨在使用好奇性来提高离线多任务连续强化学习,当任务非站ARY是非标注的,并且在时间上不均分配给学习者。
  • methods: 我们使用好奇性作为任务界限探测和保留老transition tuple的优先级 метри克。我们提出了两种不同的缓存:Hybrid Reservoir Buffer with Task Separation (HRBTS)和Hybrid Curious Buffer (HCB)。
  • results: 我们的提出的缓存,与常见的强化学习算法结合使用,可以减轻离线多任务连续强化学习中的灾难性忘记问题。我们在三种不同的 continual reinforcement learning 设置中进行了实验,并与最新的 Hybrid Reservoir Buffer (HRB) 和 Multi-Time Scale Replay Buffer (MTR)进行了比较。实验结果显示,我们的提出的缓存在大多数设置中 display better immunity to catastrophic forgetting compared to existing works。
    Abstract In this work, we investigate the means of using curiosity on replay buffers to improve offline multi-task continual reinforcement learning when tasks, which are defined by the non-stationarity in the environment, are non labeled and not evenly exposed to the learner in time. In particular, we investigate the use of curiosity both as a tool for task boundary detection and as a priority metric when it comes to retaining old transition tuples, which we respectively use to propose two different buffers. Firstly, we propose a Hybrid Reservoir Buffer with Task Separation (HRBTS), where curiosity is used to detect task boundaries that are not known due to the task agnostic nature of the problem. Secondly, by using curiosity as a priority metric when it comes to retaining old transition tuples, a Hybrid Curious Buffer (HCB) is proposed. We ultimately show that these buffers, in conjunction with regular reinforcement learning algorithms, can be used to alleviate the catastrophic forgetting issue suffered by the state of the art on replay buffers when the agent's exposure to tasks is not equal along time. We evaluate catastrophic forgetting and the efficiency of our proposed buffers against the latest works such as the Hybrid Reservoir Buffer (HRB) and the Multi-Time Scale Replay Buffer (MTR) in three different continual reinforcement learning settings. Experiments were done on classical control tasks and Metaworld environment. Experiments show that our proposed replay buffers display better immunity to catastrophic forgetting compared to existing works in most of the settings.
    摘要 在这项研究中,我们调查了使用好奇性来改进离线多任务连续奖励学习时的缓冲区。特别是,当任务是由环境非站点性决定的,而且无法被学习者在时间上平均暴露的时候,我们使用好奇性来探索任务边界和优先级缓冲区。我们提出了两种不同的缓冲区:首先,我们提出了混合储存缓冲(HRBTS),其中好奇性用于探索任务边界,而这些边界由于任务agnostic的问题而无法被知道。其次,我们使用好奇性来决定保留老的转移对象,并提出了混合好奇缓冲(HCB)。我们最终表明,这些缓冲区,与常见的奖励学习算法结合使用,可以解决由现状的缓冲区所遇到的慢速忘记问题。我们对三个不同的连续奖励学习设置进行评估,并在 классиcal控制任务和Metaworld环境进行实验。实验结果显示,我们的提出的缓冲区在大多数设置中表现出比现有工作更好的抗忘记性。

A Comparative Study of AI-Generated (GPT-4) and Human-crafted MCQs in Programming Education

  • paper_url: http://arxiv.org/abs/2312.03173
  • repo_url: None
  • paper_authors: Jacob Doughty, Zipiao Wan, Anishka Bompelli, Jubahed Qayum, Taozhi Wang, Juran Zhang, Yujia Zheng, Aidan Doyle, Pragnya Sridhar, Arav Agarwal, Christopher Bogart, Eric Keylor, Can Kultur, Jaromir Savelka, Majd Sakr
  • for: Educators can use GPT-4 to generate multiple-choice questions (MCQs) aligned with specific learning objectives (LOs) from Python programming classes in higher education.
  • methods: The GPT-4 system uses a large language model to generate MCQs from high-level course context and module-level LOs.
  • results: The study found that GPT-4 was capable of producing MCQs with clear language, a single correct choice, and high-quality distractors, and the generated MCQs appeared to be well-aligned with the LOs.Here’s the Chinese version:
  • for: Educators可以使用GPT-4来生成与特定学习目标(LOs)相关的多选题(MCQs),来支持Python编程课程的高等教育。
  • methods: GPT-4系统使用大语言模型来生成MCQs,从高级课程背景和Module级学习目标中生成MCQs。
  • results: 研究发现,GPT-4能够生成 Clear语言、唯一正确选项和高质量干扰者的MCQs,并且生成的MCQs与LOs相对匹配。
    Abstract There is a constant need for educators to develop and maintain effective up-to-date assessments. While there is a growing body of research in computing education on utilizing large language models (LLMs) in generation and engagement with coding exercises, the use of LLMs for generating programming MCQs has not been extensively explored. We analyzed the capability of GPT-4 to produce multiple-choice questions (MCQs) aligned with specific learning objectives (LOs) from Python programming classes in higher education. Specifically, we developed an LLM-powered (GPT-4) system for generation of MCQs from high-level course context and module-level LOs. We evaluated 651 LLM-generated and 449 human-crafted MCQs aligned to 246 LOs from 6 Python courses. We found that GPT-4 was capable of producing MCQs with clear language, a single correct choice, and high-quality distractors. We also observed that the generated MCQs appeared to be well-aligned with the LOs. Our findings can be leveraged by educators wishing to take advantage of the state-of-the-art generative models to support MCQ authoring efforts.
    摘要 教育者需要不断开发和维护有效、时尚的评估方法。虽然计算教育中使用大语言模型(LLMs)在生成和促进编程练习方面已有一定研究,但使用LLM生成编程多选问题(MCQs)的使用尚未得到广泛探讨。我们对GPT-4的能力进行了分析,以生成基于高级课程背景和模块级学习目标(LOs)的MCQs。我们对651个LLM生成和449个人工制作的MCQs进行了评估,这些MCQs都与246个LOs相关。我们发现GPT-4能够生成清晰的语言、单选正确答案和高质量的干扰选项。我们还发现生成的MCQs与LOs之间存在良好的吻合。我们的发现可以帮助教育者通过利用当今最先进的生成模型来支持MCQ作文的努力。

GPT vs Human for Scientific Reviews: A Dual Source Review on Applications of ChatGPT in Science

  • paper_url: http://arxiv.org/abs/2312.03769
  • repo_url: None
  • paper_authors: Chenxi Wu, Alan John Varghese, Vivek Oommen, George Em Karniadakis
  • for: 这种研究旨在探讨 Large Language Models (LLMs) 在科学评审中的应用,以提高评审效率、避免偏见、找到交叉领域的连接和发现新趋势。
  • methods: 本研究使用了 13 篇 GPT 相关论文,由人工评审和 SciSpace 进行评审,然后由三种不同类型的评估者进行评估,包括 GPT-3.5、人群团队和 GPT-4。
  • results: 研究发现,SciSpace 的回答和人工评审者的回答在客观问题中有50%的一致性,GPT-4 (有知识评估者) 经常将人工评审者评为更高的准确性,而 SciSpace 在结构、清晰性和完整性方面被评为更高。在主观问题中,无知识评估者 (GPT-3.5 和人群团队) 对 SciSpace 和人工评审者的回答有各种偏好,但 GPT-4 对它们的准确性和结构都有平等的评估,但偏好 SciSpace 的完整性。
    Abstract The new polymath Large Language Models (LLMs) can speed-up greatly scientific reviews, possibly using more unbiased quantitative metrics, facilitating cross-disciplinary connections, and identifying emerging trends and research gaps by analyzing large volumes of data. However, at the present time, they lack the required deep understanding of complex methodologies, they have difficulty in evaluating innovative claims, and they are unable to assess ethical issues and conflicts of interest. Herein, we consider 13 GPT-related papers across different scientific domains, reviewed by a human reviewer and SciSpace, a large language model, with the reviews evaluated by three distinct types of evaluators, namely GPT-3.5, a crowd panel, and GPT-4. We found that 50% of SciSpace's responses to objective questions align with those of a human reviewer, with GPT-4 (informed evaluator) often rating the human reviewer higher in accuracy, and SciSpace higher in structure, clarity, and completeness. In subjective questions, the uninformed evaluators (GPT-3.5 and crowd panel) showed varying preferences between SciSpace and human responses, with the crowd panel showing a preference for the human responses. However, GPT-4 rated them equally in accuracy and structure but favored SciSpace for completeness.
    摘要 新的多才能大语言模型(LLM)可以大幅提高科学评审,可能使用更公平的量化指标,促进交叉学科连接,并找到emerging trend和研究 gap by analyzing large volumes of data。然而,当前,它们缺乏深入的复杂方法理解,对创新性的laims难以评估,也无法评估伦理问题和利益冲突。在这里,我们考虑了13个GPT相关论文,分别由人类评审和SciSpace进行评审,其中评审结果由三种不同的评估者评估,即GPT-3.5、一个拥有人群和GPT-4。我们发现,SciSpace对 объектив问题的回答与人类评审员的回答相一致的比例为50%,GPT-4(知情评估者)经常将人类评审员的准确性评分高于SciSpace,而SciSpace在结构、明了和完整性方面的评分高于人类评审员。在主观问题上,无知评估者(GPT-3.5和拥有人群)对SciSpace和人类回答之间有变化的偏好,拥有人群显示对人类回答的偏好,但GPT-4对两者的准确性和结构是一致的,但它对SciSpace的完整性有更高的评分。

FlexModel: A Framework for Interpretability of Distributed Large Language Models

  • paper_url: http://arxiv.org/abs/2312.03140
  • repo_url: https://github.com/vectorinstitute/flex_model
  • paper_authors: Matthew Choi, Muhammad Adil Asif, John Willes, David Emerson
  • for: 本研究旨在提高大型语言模型的训练和部署所需的硬件前提条件,并且增加模型之间的交互,以提高解释性和责任AI技术的研究。
  • methods: 本研究使用了FlexModel软件包,该包提供了一个易用的界面,可以在多个GPU和多个节点配置下分布式训练和模型交互。它与现有的模型分布库compatible,并且允许用户注册自己的 HookFunctions,以便轻松地与分布式模型内部进行交互。
  • results: 本研究通过FlexModel软件包,提高了模型交互的访问性和可用性,并且使得更多的研究人员可以参与到大型神经网络领域的研究中。
    Abstract With the growth of large language models, now incorporating billions of parameters, the hardware prerequisites for their training and deployment have seen a corresponding increase. Although existing tools facilitate model parallelization and distributed training, deeper model interactions, crucial for interpretability and responsible AI techniques, still demand thorough knowledge of distributed computing. This often hinders contributions from researchers with machine learning expertise but limited distributed computing background. Addressing this challenge, we present FlexModel, a software package providing a streamlined interface for engaging with models distributed across multi-GPU and multi-node configurations. The library is compatible with existing model distribution libraries and encapsulates PyTorch models. It exposes user-registerable HookFunctions to facilitate straightforward interaction with distributed model internals, bridging the gap between distributed and single-device model paradigms. Primarily, FlexModel enhances accessibility by democratizing model interactions and promotes more inclusive research in the domain of large-scale neural networks. The package is found at https://github.com/VectorInstitute/flex_model.
    摘要 随着大语言模型的增长,训练和部署的硬件前提条件也出现了相应的增长。虽然现有的工具可以实现模型平行化和分布式训练,但更深入的模型互动,对于解释性和责任AI技术来说,仍然需要深入的分布式计算知识。这经常阻碍了具有机器学习专业背景但有限的分布式计算知识的研究人员参与到这个领域中。为解决这个挑战,我们提出了 FlexModel,一个软件包,它提供了一个易于使用的接口,可以在多GPU和多节点配置下分布式的模型中进行互动。该库与现有的模型分布库兼容,可以包装PyTorch模型,并提供了用户可注册的 HookFunctions,以便轻松地与分布式模型内部进行互动,从而跨越分布和单设备模型 парадигмы之间的差异。主要来说,FlexModel通过普及模型互动,扩大了研究领域的访问性,并促进了更加包容的研究在大规模神经网络领域。该包可以在 GitHub 上找到:https://github.com/VectorInstitute/flex_model。

Evaluating Agents using Social Choice Theory

  • paper_url: http://arxiv.org/abs/2312.03121
  • repo_url: https://github.com/google-deepmind/open_spiel/tree/master/open_spiel/python/voting
  • paper_authors: Marc Lanctot, Kate Larson, Yoram Bachrach, Luke Marris, Zun Li, Avishkar Bhoopchand, Thomas Anthony, Brian Tanner, Anna Koop
  • for: 本研究旨在探讨多任务评价问题的共同特点,提出一种基于选举理论的评价框架,即投票为评价(VasE)框架。
  • methods: 本研究使用了多个任务的排名或对比来生成总评价,并将评价器看作社会利益函数,能够借鉴社会选择理论 centuries 的研究来 derivation principled 评价框架。
  • results: 实际应用中,VasE 框架能够更加稳定和鲁棒,发现评价数据中不可见的性质,预测复杂多 player 游戏的结果更加准确。此外,最大抽签法可以满足重要的一致性性质,是计算效率高(几乎线性增长)的。
    Abstract We argue that many general evaluation problems can be viewed through the lens of voting theory. Each task is interpreted as a separate voter, which requires only ordinal rankings or pairwise comparisons of agents to produce an overall evaluation. By viewing the aggregator as a social welfare function, we are able to leverage centuries of research in social choice theory to derive principled evaluation frameworks with axiomatic foundations. These evaluations are interpretable and flexible, while avoiding many of the problems currently facing cross-task evaluation. We apply this Voting-as-Evaluation (VasE) framework across multiple settings, including reinforcement learning, large language models, and humans. In practice, we observe that VasE can be more robust than popular evaluation frameworks (Elo and Nash averaging), discovers properties in the evaluation data not evident from scores alone, and can predict outcomes better than Elo in a complex seven-player game. We identify one particular approach, maximal lotteries, that satisfies important consistency properties relevant to evaluation, is computationally efficient (polynomial in the size of the evaluation data), and identifies game-theoretic cycles.
    摘要 我们认为许多总评问题可以通过选举理论来看待。每个任务被视为一个独立的选民,只需提供排序或对比两个代理来生成总评。通过视为社会利益函数,我们可以利用社会选择理论 centuries 的研究来 derive 原则性的评价框架,其有AXIOmatic 基础。这些评价可读性和灵活性高,而免除许多现在跨任务评价所面临的问题。我们在多个场景中应用了 VasE 框架,包括强化学习、大语言模型和人类。在实践中,我们发现 VasE 可以比受欢迎的评价框架(Elo和Nash均值)更加稳定,检测评价数据中不可见的特性,并在复杂的七人游戏中预测结果更加准确。我们还发现一种特殊的方法——最大抽签,满足评价中重要的一致性特性,计算效率高(对评价数据的大小为多阶式),并在游戏中发现游戏理论循环。

The Landscape of Modern Machine Learning: A Review of Machine, Distributed and Federated Learning

  • paper_url: http://arxiv.org/abs/2312.03120
  • repo_url: None
  • paper_authors: Omer Subasi, Oceane Bel, Joseph Manzano, Kevin Barker
  • for: 本研究旨在提供现代机器学习的综述,包括最新的高级机器学习算法、应用和框架。
  • methods: 本研究使用高性能的多器 heterogeneous 并行分布式计算系统和大量数据,涉及到平行分布式学习、深度学习和联合学习。
  • results: 本研究提供了现代机器学习领域的高级概述,可以作为该领域的入门教材。
    Abstract With the advance of the powerful heterogeneous, parallel and distributed computing systems and ever increasing immense amount of data, machine learning has become an indispensable part of cutting-edge technology, scientific research and consumer products. In this study, we present a review of modern machine and deep learning. We provide a high-level overview for the latest advanced machine learning algorithms, applications, and frameworks. Our discussion encompasses parallel distributed learning, deep learning as well as federated learning. As a result, our work serves as an introductory text to the vast field of modern machine learning.
    摘要 Note:* "modern machine learning" is translated as "现代机器学习" (shìdà jīshū xuéxí)* "heterogeneous" is translated as "多样的" (duōyàng de)* "parallel and distributed" is translated as "并行分布的" ( héngxì běnzhù de)* "ever increasing" is translated as "不断增长" (bùdàn zhèngcháng)* "cutting-edge technology" is translated as "前沿科技" (qiánxiāng kējì)* "scientific research" is translated as "科学研究" (kēxué yánjiū)* "consumer products" is translated as "消费品" (xiāofèi pin)* "federated learning" is translated as "联合学习" (liánhé xuéxí)

Unknown Sample Discovery for Source Free Open Set Domain Adaptation

  • paper_url: http://arxiv.org/abs/2312.03767
  • repo_url: None
  • paper_authors: Chowdhury Sadman Jahan, Andreas Savakis
  • for: 这个研究旨在应对开放集领域适束(OSDA)中,将源领域训练的模型适束到目标领域,并且在目标领域中进行分类。特别是,这个研究探讨了无源领域(SF-OSDA)技术,不需要访问源领域样本,但是现有的SF-OSDA方法仅使用目标领域中已知的类别进行适束,并且在推断后适束过程中需要访问整个目标领域。
  • methods: 这个研究使用了教师模型和学生模型的架构,将学生模型适束到目标领域中,并且使用了时间ensemble的教师模型来进行已知 sample separation和适束。它还使用了co-training和时间一致性来帮助学生模型在目标领域中适束。
  • results: 实验结果显示,这个方法(USD)在比较SF-OSDA方法和OSDA方法时表现更好,并且与现有的OSDA模型在适束过程中相比,具有较好的性能。
    Abstract Open Set Domain Adaptation (OSDA) aims to adapt a model trained on a source domain to a target domain that undergoes distribution shift and contains samples from novel classes outside the source domain. Source-free OSDA (SF-OSDA) techniques eliminate the need to access source domain samples, but current SF-OSDA methods utilize only the known classes in the target domain for adaptation, and require access to the entire target domain even during inference after adaptation, to make the distinction between known and unknown samples. In this paper, we introduce Unknown Sample Discovery (USD) as an SF-OSDA method that utilizes a temporally ensembled teacher model to conduct known-unknown target sample separation and adapts the student model to the target domain over all classes using co-training and temporal consistency between the teacher and the student. USD promotes Jensen-Shannon distance (JSD) as an effective measure for known-unknown sample separation. Our teacher-student framework significantly reduces error accumulation resulting from imperfect known-unknown sample separation, while curriculum guidance helps to reliably learn the distinction between target known and target unknown subspaces. USD appends the target model with an unknown class node, thus readily classifying a target sample into any of the known or unknown classes in subsequent post-adaptation inference stages. Empirical results show that USD is superior to existing SF-OSDA methods and is competitive with current OSDA models that utilize both source and target domains during adaptation.
    摘要

Incidental Polysemanticity

  • paper_url: http://arxiv.org/abs/2312.03096
  • repo_url: https://github.com/tmychow/incidental-polysemanticity
  • paper_authors: Victor Lecomte, Kushal Thaman, Trevor Chow, Rylan Schaeffer, Sanmi Koyejo
  • for: This paper aims to provide a second origin story for polysemantic neurons in deep networks, which can arise incidentally even when there are enough neurons to represent all features in the data.
  • methods: The paper uses a combination of theory and experiments to demonstrate the existence of incidental polysemanticity, and to show how training dynamics can strengthen such overlap.
  • results: The paper finds that incidental polysemanticity can occur even when there are ample neurons to represent all features in the data, and that this type of polysemanticity can be a significant obstacle to interpretability of task-optimized deep networks.
    Abstract Polysemantic neurons (neurons that activate for a set of unrelated features) have been seen as a significant obstacle towards interpretability of task-optimized deep networks, with implications for AI safety. The classic origin story of polysemanticity is that the data contains more "features" than neurons, such that learning to perform a task forces the network to co-allocate multiple unrelated features to the same neuron, endangering our ability to understand the network's internal processing. In this work, we present a second and non-mutually exclusive origin story of polysemanticity. We show that polysemanticity can arise incidentally, even when there are ample neurons to represent all features in the data, using a combination of theory and experiments. This second type of polysemanticity occurs because random initialization can, by chance alone, initially assign multiple features to the same neuron, and the training dynamics then strengthen such overlap. Due to its origin, we term this \textit{incidental polysemanticity}.
    摘要 多义neuron(neuron Activate 多个不相关特征)被视为深度网络解释性的主要障碍,带来人工智能安全问题。 класси的起源故事是数据包含更多的“特征” than neuron,这使得学习完成任务的网络强制合并多个不相关的特征到同一个neuron上,威胁我们理解网络内部处理的能力。在这项工作中,我们提出了第二种和非相互排斥的起源故事,我们表明,even when there are enough neurons to represent all features in the data,Random initialization可以,通过巧合alone,初始化多个特征到同一个neuron上,并且训练剂会强化这种重叠。由于其起源,我们称这种现象为“偶然的多义”(incidental polysemy)。

Similarity-based Knowledge Transfer for Cross-Domain Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2312.03764
  • repo_url: None
  • paper_authors: Sergio A. Serrano, Jose Martinez-Carranza, L. Enrique Sucar
  • for: 本研究旨在研究如何在另一个任务空间中传递知识,以加速学习。
  • methods: 我们提出了一种 semi-supervised alignment loss,用于度量不同空间之间的相似性,并将其用于选择适合的知识来提高学习Agent的性能。
  • results: 我们的方法在一组多样化的 Mujoco 控制任务上进行了实验,并显示了其在无需专家政策指导下选择和传递知识的稳定性。
    Abstract Transferring knowledge in cross-domain reinforcement learning is a challenging setting in which learning is accelerated by reusing knowledge from a task with different observation and/or action space. However, it is often necessary to carefully select the source of knowledge for the receiving end to benefit from the transfer process. In this article, we study how to measure the similarity between cross-domain reinforcement learning tasks to select a source of knowledge that will improve the performance of the learning agent. We developed a semi-supervised alignment loss to match different spaces with a set of encoder-decoders, and use them to measure similarity and transfer policies across tasks. In comparison to prior works, our method does not require data to be aligned, paired or collected by expert policies. Experimental results, on a set of varied Mujoco control tasks, show the robustness of our method in effectively selecting and transferring knowledge, without the supervision of a tailored set of source tasks.
    摘要 转移知识在跨领域强化学习是一个挑战的设定,在其中学习速度受到不同观察空间和/或行动空间的知识重用的影响。然而,选择收到知识的源是非常重要,以便接受知识传递过程中的改进。在这篇文章中,我们研究如何测量跨领域强化学习任务之间的相似性,以便选择一个能够提高学习代理的知识源。我们开发了一种半监督对准损失,将不同空间匹配到一组编码器-解码器中,并用其来测量相似性和传递策略 across tasks。与先前的工作不同,我们的方法不需要数据进行对齐、配对或由专家政策进行监督。实验结果,在一组变化的 MuJoCo 控制任务上,显示了我们的方法在不同任务之间选择和传递知识的稳定性。

RESIN-EDITOR: A Schema-guided Hierarchical Event Graph Visualizer and Editor

  • paper_url: http://arxiv.org/abs/2312.03093
  • repo_url: https://github.com/blender-nlp/resin-editor
  • paper_authors: Khanh Duy Nguyen, Zixuan Zhang, Reece Suchocki, Sha Li, Martha Palmer, Susan Brown, Jiawei Han, Heng Ji
  • for: 这篇论文是为了描述一种名为RESIGN-EDITOR的互动事件图像和编辑器,用于分析复杂事件。
  • methods: 该系统使用了人工约束事件模式来引导从多媒体和多文档新闻团cluster中提取的层次事件图。
  • results: 在评估RESIGN-EDITOR的效果时,我们表明了该工具在理解复杂事件和提高系统性能的能力。
    Abstract In this paper, we present RESIN-EDITOR, an interactive event graph visualizer and editor designed for analyzing complex events. Our RESIN-EDITOR system allows users to render and freely edit hierarchical event graphs extracted from multimedia and multi-document news clusters with guidance from human-curated event schemas. RESIN-EDITOR's unique features include hierarchical graph visualization, comprehensive source tracing, and interactive user editing, which is more powerful and versatile than existing Information Extraction (IE) visualization tools. In our evaluation of RESIN-EDITOR, we demonstrate ways in which our tool is effective in understanding complex events and enhancing system performance. The source code, a video demonstration, and a live website for RESIN-EDITOR have been made publicly available.
    摘要 在这篇论文中,我们介绍了RESIME-EDITOR,一种可交互地视觉化和编辑事件图的系统,用于分析复杂事件。我们的RESIME-EDITOR系统允许用户自由地编辑嵌入式事件图,以获得人类筛选的事件模式的指导。RESIME-EDITOR的独特特点包括层次图表示、全源追踪和交互式用户编辑,这些特点比现有的信息EXTRACTION(IE)视觉化工具更加强大和灵活。在我们对RESIME-EDITOR的评估中,我们展示了该工具在理解复杂事件和提高系统性能方面的效果。源代码、视频示例和RESIME-EDITOR的在线网站都已经公开发布。

Colour versus Shape Goal Misgeneralization in Reinforcement Learning: A Case Study

  • paper_url: http://arxiv.org/abs/2312.03762
  • repo_url: https://github.com/KarolisRam/colour-shape-goal-misgeneralization
  • paper_authors: Karolis Ramanauskas, Özgür Şimşek
  • for: 研究 colour versus shape goal misgeneralization 的行为
  • methods: 使用 Procgen Maze 环境,训练超过 1,000 个代理,并评估其在超过 10 万集的话语中的表现
  • results: 发现代理通过特定的色道渠道来探测目标物体,这是一种意外的选择;同时,由于 underspecification,代理的偏好会随着不同的随机种子重新训练而改变;最后,通过训练随机种子来证明存在异常行为的异常点。
    Abstract We explore colour versus shape goal misgeneralization originally demonstrated by Di Langosco et al. (2022) in the Procgen Maze environment, where, given an ambiguous choice, the agents seem to prefer generalization based on colour rather than shape. After training over 1,000 agents in a simplified version of the environment and evaluating them on over 10 million episodes, we conclude that the behaviour can be attributed to the agents learning to detect the goal object through a specific colour channel. This choice is arbitrary. Additionally, we show how, due to underspecification, the preferences can change when retraining the agents using exactly the same procedure except for using a different random seed for the training run. Finally, we demonstrate the existence of outliers in out-of-distribution behaviour based on training random seed alone.
    摘要 我们研究了颜色vs形状目标总结,由Di Langosco et al. (2022)在Procgen Maze环境中原始示示出的问题,在ambiguous选择时,代理人偏好基于颜色而非形状。我们在简化版环境中训练了1,000个代理人,并评估了他们在超过1000万集的episode中的行为。我们结论是,代理人通过特定的颜色通道探测目标物体。这种选择是意外的。此外,我们表明由于不充分规定,代理人的偏好可以通过重新训练使用相同的过程而改变,只是使用不同的随机种子来控制训练运行。最后,我们示出了训练随机种子alone的外liers行为。

Clinical Notes Reveal Physician Fatigue

  • paper_url: http://arxiv.org/abs/2312.03077
  • repo_url: None
  • paper_authors: Chao-Chun Hsu, Ziad Obermeyer, Chenhao Tan
  • for: The paper aims to identify notes written by fatigued physicians and understand the impact of physician fatigue on decision-making and patient outcomes.
  • methods: The authors use a machine learning model to analyze notes from 129,228 emergency room visits and identify patterns associated with fatigued physicians. They also compare the performance of human physicians and language models (LLMs) in generating notes.
  • results: The model accurately identifies notes written by fatigued physicians and flags notes written in other high-fatigue settings. The authors find that notes written by fatigued physicians have lower yield of testing for heart attack and higher predicted fatigue for Black and Hispanic patients. Additionally, they find that LLM-written notes have higher predicted fatigue than real physicians’ notes, suggesting that LLMs may introduce distortions in generated text.
    Abstract Physicians write notes about patients. In doing so, they reveal much about themselves. Using data from 129,228 emergency room visits, we train a model to identify notes written by fatigued physicians -- those who worked 5 or more of the prior 7 days. In a hold-out set, the model accurately identifies notes written by these high-workload physicians, and also flags notes written in other high-fatigue settings: on overnight shifts, and after high patient volumes. Model predictions also correlate with worse decision-making on at least one important metric: yield of testing for heart attack is 18% lower with each standard deviation increase in model-predicted fatigue. Finally, the model indicates that notes written about Black and Hispanic patients have 12% and 21% higher predicted fatigue than Whites -- larger than overnight vs. daytime differences. These results have an important implication for large language models (LLMs). Our model indicates that fatigued doctors write more predictable notes. Perhaps unsurprisingly, because word prediction is the core of how LLMs work, we find that LLM-written notes have 17% higher predicted fatigue than real physicians' notes. This indicates that LLMs may introduce distortions in generated text that are not yet fully understood.
    摘要 医生写病人症状记录时,会透露出自己一些信息。我们使用129,228个急诊室访问数据,训练一个模型,可以准确地识别劳累医生(在过去7天内工作5天或以上)写的症状记录。在测试集中,模型可以准确地识别高工作荷压医生的症状记录,并且可以检测高劳累情况下的其他症状记录,如夜班和高病人量。模型预测也与重要指标之一的决策质量有正相关:对于心肺病检测的采样率,与模型预测的劳累程度相对降低18%。此外,模型还表明,关于黑人和西班牙裔患者的症状记录会有12%和21%更高的预测劳累程度,比白人患者的症状记录更高。这些结果有重要的应用于大语言模型(LLM)。我们的模型表明,劳累医生写的症状记录更加预测可靠,因为word prediction是LLM的核心。我们发现,LLM写的症状记录的预测劳累程度比实际医生写的症状记录高17%。这表明,LLM可能会在生成文本中引入未知的扭曲。

Imitating Shortest Paths in Simulation Enables Effective Navigation and Manipulation in the Real World

  • paper_url: http://arxiv.org/abs/2312.02976
  • repo_url: None
  • paper_authors: Kiana Ehsani, Tanmay Gupta, Rose Hendrix, Jordi Salvador, Luca Weihs, Kuo-Hao Zeng, Kunal Pratap Singh, Yejin Kim, Winson Han, Alvaro Herrasti, Ranjay Krishna, Dustin Schwenk, Eli VanderBilt, Aniruddha Kembhavi
  • For: The paper is written to train modern embodied agents using imitation learning with shortest-path planners in simulation, and to demonstrate that these agents can proficiently navigate, explore, and manipulate objects in both simulation and the real world using only RGB sensors.* Methods: The paper uses a transformer-based, end-to-end architecture called SPOC, which is paired with extensive image augmentation and millions of frames of shortest-path-expert trajectories collected inside approximately 200,000 procedurally generated houses containing 40,000 unique 3D assets.* Results: The paper shows that the proposed method can produce agents that can proficiently navigate, explore, and manipulate objects in both simulation and the real world using only RGB sensors, and that the method is effective and efficient, with the ability to generalize to new environments and tasks.Here’s the same information in Simplified Chinese:* For: 论文是为了使用优化的人工智能进行训练,并在实际环境中测试其能够快速和有效地完成任务。* Methods: 论文使用了一种基于转换器的、端到端的架构,称为SPOC,并与其搭配了广泛的图像增强和大量的帧数。* Results: 论文显示了该方法可以生成能够快速和有效地在实际环境中完成任务的代理人,并且可以在新环境中扩展和适应。
    Abstract Reinforcement learning (RL) with dense rewards and imitation learning (IL) with human-generated trajectories are the most widely used approaches for training modern embodied agents. RL requires extensive reward shaping and auxiliary losses and is often too slow and ineffective for long-horizon tasks. While IL with human supervision is effective, collecting human trajectories at scale is extremely expensive. In this work, we show that imitating shortest-path planners in simulation produces agents that, given a language instruction, can proficiently navigate, explore, and manipulate objects in both simulation and in the real world using only RGB sensors (no depth map or GPS coordinates). This surprising result is enabled by our end-to-end, transformer-based, SPOC architecture, powerful visual encoders paired with extensive image augmentation, and the dramatic scale and diversity of our training data: millions of frames of shortest-path-expert trajectories collected inside approximately 200,000 procedurally generated houses containing 40,000 unique 3D assets. Our models, data, training code, and newly proposed 10-task benchmarking suite CHORES will be open-sourced.
    摘要 现代embodied智能器型通常使用强化学习(RL)和模仿学习(IL)两种方法训练。RL需要广泛的奖励扭曲和辅助损失,经常太慢和不具有效果,而IL通常需要人工指导,收集人类轨迹是非常昂贵的。在这项工作中,我们展示了在模拟中imiter短est-path规划器的imitating可以使得,给出语言指令,智能器型可以准确地导航、探索和操纵物体,并且可以使用RGB感知器(没有深度地图或GPS坐标)。这一结果是由我们的端到端、转换器基于的SPOC架构、强大的视觉编码器和广泛的图像扩展所启用。我们的模型、数据、训练代码和新提出的10任务benchmarking suite CHORES都将被开源。

Dexterous Functional Grasping

  • paper_url: http://arxiv.org/abs/2312.02975
  • repo_url: None
  • paper_authors: Ananye Agarwal, Shagun Uppal, Kenneth Shaw, Deepak Pathak
  • for: 本研究旨在结合人工智能和机器人控制技术,实现在自然环境中对物体进行功能 grasping。
  • methods: 本研究使用模块化方法,首先获取对象的功能可行性,然后使用在模拟环境中培养的低级策略来抓取它。此外,研究还提出了一种使用 eigengrasps 来减少人工数据的搜索空间,以实现更稳定和物理上更真实的运动。
  • results: 研究结果显示,使用 eigengrasps 可以在模拟环境中击败基eline,并在实际环境中与人类操作员进行比较,或者超越人类操作员。视频和图像可以在 https://dexfunc.github.io/ 上查看。
    Abstract While there have been significant strides in dexterous manipulation, most of it is limited to benchmark tasks like in-hand reorientation which are of limited utility in the real world. The main benefit of dexterous hands over two-fingered ones is their ability to pickup tools and other objects (including thin ones) and grasp them firmly to apply force. However, this task requires both a complex understanding of functional affordances as well as precise low-level control. While prior work obtains affordances from human data this approach doesn't scale to low-level control. Similarly, simulation training cannot give the robot an understanding of real-world semantics. In this paper, we aim to combine the best of both worlds to accomplish functional grasping for in-the-wild objects. We use a modular approach. First, affordances are obtained by matching corresponding regions of different objects and then a low-level policy trained in sim is run to grasp it. We propose a novel application of eigengrasps to reduce the search space of RL using a small amount of human data and find that it leads to more stable and physically realistic motion. We find that eigengrasp action space beats baselines in simulation and outperforms hardcoded grasping in real and matches or outperforms a trained human teleoperator. Results visualizations and videos at https://dexfunc.github.io/
    摘要 “尽管有了很大的进步,dexterous manipulation的大多数都仅仅是对 benchmark 任务 like 手中重新Orienting 的限定性利用,这些任务在实际世界中的用途仅仅是有限的。dexterous hands 的主要优点在于能够将工具和其他物品(包括细长的)稳固地捶取并施加力,但这个任务需要 Both a complex understanding of functional affordances 和精确的 low-level control。尽管先前的工作从人类数据中获取了可用性,但这种方法不能扩展到 low-level control。 Similarly, simulation training cannot give the robot an understanding of real-world semantics。在这篇论文中,我们想要结合两个世界的好处,实现实际世界中的功能抓取。我们使用模块化的方法。首先,我们对不同物品的相应区域进行匹配,然后使用 sim 训练的低级策略来抓取它。我们提出了一个新的 eigengrasps 应用,以减少RL 的搜索空间,使用小量人类数据,并发现它导致更稳定和物理上更真实的运动。我们发现 eigengrasp action space 在 sim 中比基eline 高,并在实际世界中超过硬coded grasping,与人工电子师匠相当。结果、视觉化和影片可以在 浏览。”

Alchemist: Parametric Control of Material Properties with Diffusion Models

  • paper_url: http://arxiv.org/abs/2312.02970
  • repo_url: None
  • paper_authors: Prafull Sharma, Varun Jampani, Yuanzhen Li, Xuhui Jia, Dmitry Lagun, Fredo Durand, William T. Freeman, Mark Matthews
  • for: 这个论文是用来控制物体图像中的物理属性,如粗糙度、金属感、反射率和透明度的。
  • methods: 该方法利用文本到图像模型的生成预设,通过 scalar 值和指令来修改图像中的低级材质属性。
  • results: 通过自动生成的物体中心synthetic数据集和修改 modify 的pre-trained文本到图像模型,可以在实际图像中编辑材质属性,保留所有其他属性。
    Abstract We propose a method to control material attributes of objects like roughness, metallic, albedo, and transparency in real images. Our method capitalizes on the generative prior of text-to-image models known for photorealism, employing a scalar value and instructions to alter low-level material properties. Addressing the lack of datasets with controlled material attributes, we generated an object-centric synthetic dataset with physically-based materials. Fine-tuning a modified pre-trained text-to-image model on this synthetic dataset enables us to edit material properties in real-world images while preserving all other attributes. We show the potential application of our model to material edited NeRFs.
    摘要 我们提出了一种方法,可以控制物体的物理属性,如粗糙度、金属性、反射率和透明度,在真实图像中。我们的方法利用了文本到图像模型的生成前提,通过scalar值和指令来修改低级材质属性。由于缺乏控制材质属性的数据集,我们生成了一个中心对象的 sintetic 数据集,其中物体拥有物理基于的材质。通过对修改后的模型进行高级imos练习,我们可以在真实图像中编辑材质属性,保留所有其他属性不变。我们展示了我们的模型可以应用于材质编辑NeRF。Note: "NeRF" stands for "Neural Radiance Fields", which is a technique used to represent 3D objects in a scene in a way that allows for realistic rendering and manipulation of the object's materials and lighting.

Generating Interpretable Networks using Hypernetworks

  • paper_url: http://arxiv.org/abs/2312.03051
  • repo_url: None
  • paper_authors: Isaac Liao, Ziming Liu, Max Tegmark
  • for: 本研究的目的是解码神经网络,即将神经网络的原始权重转化为可解释的算法。
  • methods: 本研究使用了卷积网络(hypernetwork)来生成可解释的网络,并且通过控制网络复杂度来生成多种可解释的算法。
  • results: 研究发现了三种计算L1范数的算法:(a)双面算法(b)几何算法(c)卷积算法,其中只有第一个算法预期在实验之前。研究还发现了这些算法的系统化演化和复杂度控制的影响。此外,研究还示出了一个训练过的卷积网络可以正确地构建未在训练中看到的输入维度的模型,这表明了系统化泛化的能力。
    Abstract An essential goal in mechanistic interpretability to decode a network, i.e., to convert a neural network's raw weights to an interpretable algorithm. Given the difficulty of the decoding problem, progress has been made to understand the easier encoding problem, i.e., to convert an interpretable algorithm into network weights. Previous works focus on encoding existing algorithms into networks, which are interpretable by definition. However, focusing on encoding limits the possibility of discovering new algorithms that humans have never stumbled upon, but that are nevertheless interpretable. In this work, we explore the possibility of using hypernetworks to generate interpretable networks whose underlying algorithms are not yet known. The hypernetwork is carefully designed such that it can control network complexity, leading to a diverse family of interpretable algorithms ranked by their complexity. All of them are interpretable in hindsight, although some of them are less intuitive to humans, hence providing new insights regarding how to "think" like a neural network. For the task of computing L1 norms, hypernetworks find three algorithms: (a) the double-sided algorithm, (b) the convexity algorithm, (c) the pudding algorithm, although only the first algorithm was expected by the authors before experiments. We automatically classify these algorithms and analyze how these algorithmic phases develop during training, as well as how they are affected by complexity control. Furthermore, we show that a trained hypernetwork can correctly construct models for input dimensions not seen in training, demonstrating systematic generalization.
    摘要 一个重要的目标在机制可读性中是解码神经网络,即将神经网络的原始参数转换为可读的算法。由于解码问题的difficulty,已经取得了在理解编码问题中的进展,即将可读的算法转换为神经网络的参数。先前的工作主要集中在将已知的算法编码到神经网络中,这些算法都是可读的。但是,只集中在编码问题上限制了发现新的算法,它们尚未被人类发现,但它们具有可读性。在这种情况下,我们使用嵌入网络来生成可读的网络,其下面的算法并不是已知的。我们 méticulously设计了这个嵌入网络,以控制神经网络的复杂性,从而导致一个多样化的可读算法家族,这些算法的复杂性可以由人类来评估。在计算L1范数任务上,嵌入网络找到了三种算法:(a)双面算法、(b)几何算法、(c)奶糕算法,只有第一种算法被作者们预期。我们自动分类了这些算法,并分析了这些算法的发展阶段以及复杂性控制的影响。此外,我们还证明了一个训练过的嵌入网络可以正确地生成未在训练中看到的输入维度上的模型,这说明了系统化泛化的能力。

Classification for everyone : Building geography agnostic models for fairer recognition

  • paper_url: http://arxiv.org/abs/2312.02957
  • repo_url: None
  • paper_authors: Akshat Jindal, Shreya Singh, Soham Gadgil
  • for: 这个论文是为了研究如何 Mitigate 图像分类模型中的自然地理偏见。
  • methods: 这个论文使用了两个数据集 - The Dollar Street Dataset 和 ImageNet,通过图像的位置信息来评量这种偏见。然后,它提出了多种可以使用的方法来减少这种偏见。
  • results: 这个论文通过分析不同的方法,发现这些方法可以使图像分类模型更加对地域位置具有抗性。
    Abstract In this paper, we analyze different methods to mitigate inherent geographical biases present in state of the art image classification models. We first quantitatively present this bias in two datasets - The Dollar Street Dataset and ImageNet, using images with location information. We then present different methods which can be employed to reduce this bias. Finally, we analyze the effectiveness of the different techniques on making these models more robust to geographical locations of the images.
    摘要 在这篇论文中,我们分析了不同的方法来减轻现有的图像分类模型内置的地域偏见。我们首先量化了这种偏见在两个数据集中 - 美元街数据集和ImageNet数据集中的图像,并使用图像地理位置信息。然后,我们介绍了不同的方法可以用来减少这种偏见。最后,我们分析了不同技术在图像地域位置的影响。

WhisBERT: Multimodal Text-Audio Language Modeling on 100M Words

  • paper_url: http://arxiv.org/abs/2312.02931
  • repo_url: None
  • paper_authors: Lukas Wolf, Greta Tuckute, Klemen Kotar, Eghbal Hosseini, Tamar Regev, Ethan Wilcox, Alex Warstadt
  • for: 本研究是为了检验多Modalities训练语言模型是否可以提高其质量和效率。
  • methods: 作者使用了Whisbert模型,这是基于文本–图像方法的FLAVA模型(Singh et al., 2022)。作者遵循Babylm指南(Warstadt et al., 2023),在一个包含100万个词和其对应的语音的 dataset 上预训Whisbert模型。
  • results: 作者发现,虽然Whisbert在多Modalities训练下可以很好地完成模杂隐藏模型任务和超越Babylm基eline在大多数benchmark任务中,但它在优化复杂的目标时受阻,无法超越文本只的Whisbert基eline。
    Abstract Training on multiple modalities of input can augment the capabilities of a language model. Here, we ask whether such a training regime can improve the quality and efficiency of these systems as well. We focus on text--audio and introduce Whisbert, which is inspired by the text--image approach of FLAVA (Singh et al., 2022). In accordance with Babylm guidelines (Warstadt et al., 2023), we pretrain Whisbert on a dataset comprising only 100 million words plus their corresponding speech from the word-aligned version of the People's Speech dataset (Galvez et al., 2021). To assess the impact of multimodality, we compare versions of the model that are trained on text only and on both audio and text simultaneously. We find that while Whisbert is able to perform well on multimodal masked modeling and surpasses the Babylm baselines in most benchmark tasks, it struggles to optimize its complex objective and outperform its text-only Whisbert baseline.
    摘要 训练多modalities的输入可以增强语言模型的能力。我们问道,是否可以通过这种训练方式提高这些系统的质量和效率。我们将关注文本——音频的训练,并引入Whisbert,它是基于文本——图像的FLAVA(Singh et al., 2022)的 inspirations。按照Babylm指南(Warstadt et al., 2023),我们预训Whisbert在包含1亿个单词的 dataset 上,并与其对应的语音从人类语音 dataset 的word-aligned版本(Galvez et al., 2021)进行了训练。为了评估多modalities的影响,我们比较了基于文本只和基于音频和文本同时训练的模型。我们发现,虽然Whisbert在多modalities隐藏模型和大多数benchmark任务中表现出色,但它在优化复杂的目标函数上困难超越文本只的Whisbert基eline。

Let the LLMs Talk: Simulating Human-to-Human Conversational QA via Zero-Shot LLM-to-LLM Interactions

  • paper_url: http://arxiv.org/abs/2312.02913
  • repo_url: https://github.com/zahraabbasiantaeb/simquac
  • paper_authors: Zahra Abbasiantaeb, Yifei Yuan, Evangelos Kanoulas, Mohammad Aliannejadi
  • for: 这个论文的目的是探讨使用大语言模型(LLM)来仿真人类对话的 conversational question-answering(CQA)系统。
  • methods: 这个论文使用了零shot学习的GPT-4模型来实现学生和教师的角色,并通过让学生模型生成问题,并由教师模型回答问题来模拟人类对话。
  • results: 研究发现,使用LLM来仿真人类对话可以取得比较好的效果, teacher LLM生成的答案更加具体和完整,而学生 LLM 生成的问题更加多样化,覆盖了更多的话题方面。
    Abstract Conversational question-answering (CQA) systems aim to create interactive search systems that effectively retrieve information by interacting with users. To replicate human-to-human conversations, existing work uses human annotators to play the roles of the questioner (student) and the answerer (teacher). Despite its effectiveness, challenges exist as human annotation is time-consuming, inconsistent, and not scalable. To address this issue and investigate the applicability of large language models (LLMs) in CQA simulation, we propose a simulation framework that employs zero-shot learner LLMs for simulating teacher-student interactions. Our framework involves two LLMs interacting on a specific topic, with the first LLM acting as a student, generating questions to explore a given search topic. The second LLM plays the role of a teacher by answering questions and is equipped with additional information, including a text on the given topic. We implement both the student and teacher by zero-shot prompting the GPT-4 model. To assess the effectiveness of LLMs in simulating CQA interactions and understand the disparities between LLM- and human-generated conversations, we evaluate the simulated data from various perspectives. We begin by evaluating the teacher's performance through both automatic and human assessment. Next, we evaluate the performance of the student, analyzing and comparing the disparities between questions generated by the LLM and those generated by humans. Furthermore, we conduct extensive analyses to thoroughly examine the LLM performance by benchmarking state-of-the-art reading comprehension models on both datasets. Our results reveal that the teacher LLM generates lengthier answers that tend to be more accurate and complete. The student LLM generates more diverse questions, covering more aspects of a given topic.
    摘要 conversational question-answering (CQA) 系统的目标是创建可交互的搜索系统,以便有效地检索信息,并且与人类对话方式相似。现有的工作使用人类标注员扮演问题人(学生)和答案人(教师)的角色。然而,人类标注是时间consuming,不一致和不可扩展的。为解决这些问题,我们提出了一个模拟框架,使用零shot学习的大型自然语言模型(LLM)来模拟教师和学生之间的交互。我们的框架包括两个LLM进行交互,其中一个LLM acts as a student,生成问题以探索一个搜索主题。另一个LLM扮演教师,回答问题,并具有额外信息,包括主题相关的文本。我们使用GPT-4模型来实现学生和教师。为了评估LLM在模拟CQA交互中的效果,以及人类和LLM生成的对话之间的差异,我们对模拟数据进行了多种评估。我们首先评估教师的表现,通过自动和人类评估。接着,我们评估学生的表现,分析和比较LLM生成的问题和人类生成的问题之间的差异。此外,我们进行了广泛的分析,以全面评估LLM性能,并将现有的阅读理解模型 benchmark于两个数据集。我们的结果表明,教师LLM生成的答案较长,具有更高的准确性和完整性。学生LLM生成的问题覆盖了更多的主题方面,更加多样化。

Toward autocorrection of chemical process flowsheets using large language models

  • paper_url: http://arxiv.org/abs/2312.02873
  • repo_url: None
  • paper_authors: Lukas Schulze Balhorn, Marc Caballero, Artur M. Schweidtmann
  • for: 这个论文的目的是提出一种基于人工智能技术的自动修正流程图文法,以便更好地检查和修正过程流程图文中的错误。
  • methods: 这个论文使用了大型自然语言模型(LLM)来自动检测和修正过程流程图文中的错误。输入模型是一个可能有误的过程流程图文,输出模型是一个修正后的过程流程图文。
  • results: 在一个人工生成的synthetic dataset上进行了supervised学习,模型实现了80%的顶峰准确率和84%的顶五准确率在一个独立测试集上。这些结果表明模型可以学习自动修正synthetic流程图文。
    Abstract The process engineering domain widely uses Process Flow Diagrams (PFDs) and Process and Instrumentation Diagrams (P&IDs) to represent process flows and equipment configurations. However, the P&IDs and PFDs, hereafter called flowsheets, can contain errors causing safety hazards, inefficient operation, and unnecessary expenses. Correcting and verifying flowsheets is a tedious, manual process. We propose a novel generative AI methodology for automatically identifying errors in flowsheets and suggesting corrections to the user, i.e., autocorrecting flowsheets. Inspired by the breakthrough of Large Language Models (LLMs) for grammatical autocorrection of human language, we investigate LLMs for the autocorrection of flowsheets. The input to the model is a potentially erroneous flowsheet and the output of the model are suggestions for a corrected flowsheet. We train our autocorrection model on a synthetic dataset in a supervised manner. The model achieves a top-1 accuracy of 80% and a top-5 accuracy of 84% on an independent test dataset of synthetically generated flowsheets. The results suggest that the model can learn to autocorrect the synthetic flowsheets. We envision that flowsheet autocorrection will become a useful tool for chemical engineers.
    摘要 Process 工程领域广泛使用过程流agram(PFD)和过程和测试agram(P&ID)来表示过程流和设备配置。然而,PFDs和P&IDs,以下简称为流程图,可能包含错误,导致安全隐患、不效环境和多花钱。更正和验证流程图是一个繁琐、手动的过程。我们提出了一种新的生成式人工智能方法,可以自动检测流程图中的错误并提供更正建议,即自动更正流程图。取得大语言模型(LLMs)的突破口,我们研究LLMs在修订人类语言中的自动修订能力,以便应用于流程图的自动修订。输入模型的流程图可能包含错误,输出模型的建议是修订后的流程图。我们在一个synthetic dataset上进行了监督性训练。模型在独立测试集上达到了80%的顶部一 accuracy和84%的顶部五 accuracy。结果表明,模型可以学习自动修订synthetic流程图。我们anticipate that flowsheet autocorrection will become a useful tool for chemical engineers.

Experimental Insights Towards Explainable and Interpretable Pedestrian Crossing Prediction

  • paper_url: http://arxiv.org/abs/2312.02872
  • repo_url: None
  • paper_authors: Angie Nataly Melo, Carlota Salinas, Miguel Angel Sotelo
  • for: 本研究旨在提高自动驾驶road safety,通过可解释和可 interpret的方式预测步行人过路。
  • methods: 本研究提出了一种新的神经符号approach,结合深度学习和混沌逻辑来实现可解释和可 interpret的步行人过路预测。我们开发了一个可解释预测器(ExPedCross),使用了一组可解释的特征并使用混沌推理系统来预测步行人将否过路。
  • results: 我们对PIE和JAAD数据集进行了评估,实验结果提供了可解释和可 interpret的步行人过路预测任务中的实践经验和建议。
    Abstract In the context of autonomous driving, pedestrian crossing prediction is a key component for improving road safety. Presently, the focus of these predictions extends beyond achieving trustworthy results; it is shifting towards the explainability and interpretability of these predictions. This research introduces a novel neuro-symbolic approach that combines deep learning and fuzzy logic for an explainable and interpretable pedestrian crossing prediction. We have developed an explainable predictor (ExPedCross), which utilizes a set of explainable features and employs a fuzzy inference system to predict whether the pedestrian will cross or not. Our approach was evaluated on both the PIE and JAAD datasets. The results offer experimental insights into achieving explainability and interpretability in the pedestrian crossing prediction task. Furthermore, the testing results yield a set of guidelines and recommendations regarding the process of dataset selection, feature selection, and explainability.
    摘要 在自动驾驶中,人行道十字Prediction是一个关键的安全性组件。目前,这些预测的重点不仅是获得可靠的结果,而且在向Explainability和Interpretability的发展。本研究提出了一种新的 neuralsymbolic方法, combines deep learning和混沌逻辑来实现可解释的人行道十字预测。我们开发了一个可解释预测器(ExPedCross),使用了一组可解释的特征,并使用了混沌推理系统来预测人将否过路。我们的方法在PIE和JAAD数据集上进行了评估。实验结果提供了有用的实验室意见和建议,包括数据集选择、特征选择和可解释的过程。

  • paper_url: http://arxiv.org/abs/2312.03043
  • repo_url: https://github.com/lucidrains/imagen-pytorch
  • paper_authors: Simeon Allmendinger, Patrick Hemmer, Moritz Queisner, Igor Sauer, Leopold Müller, Johannes Jakubik, Michael Vössing, Niklas Kühl
  • for: 这个研究旨在使用扩散型生成模型生成合理的人工镜像数据,以支持外科应用和决策。
  • methods: 我们使用了当今最佳的文本到图像建筑在镜像医学中进行了应用,通过Diffusion-based生成模型来生成人工镜像数据。
  • results: 我们的研究表明,Diffusion-based模型可以学习镜像医学中的风格和 semantics,并且可以生成高质量的人工镜像数据,使得计算机生成的图像在外科应用中得到了应用。
    Abstract Recent advances in synthetic imaging open up opportunities for obtaining additional data in the field of surgical imaging. This data can provide reliable supplements supporting surgical applications and decision-making through computer vision. Particularly the field of image-guided surgery, such as laparoscopic and robotic-assisted surgery, benefits strongly from synthetic image datasets and virtual surgical training methods. Our study presents an intuitive approach for generating synthetic laparoscopic images from short text prompts using diffusion-based generative models. We demonstrate the usage of state-of-the-art text-to-image architectures in the context of laparoscopic imaging with regard to the surgical removal of the gallbladder as an example. Results on fidelity and diversity demonstrate that diffusion-based models can acquire knowledge about the style and semantics in the field of image-guided surgery. A validation study with a human assessment survey underlines the realistic nature of our synthetic data, as medical personnel detects actual images in a pool with generated images causing a false-positive rate of 66%. In addition, the investigation of a state-of-the-art machine learning model to recognize surgical actions indicates enhanced results when trained with additional generated images of up to 5.20%. Overall, the achieved image quality contributes to the usage of computer-generated images in surgical applications and enhances its path to maturity.
    摘要

Towards Causal Representations of Climate Model Data

  • paper_url: http://arxiv.org/abs/2312.02858
  • repo_url: None
  • paper_authors: Julien Boussard, Chandni Nagda, Julia Kaltenborn, Charlotte Emilie Elektra Lange, Philippe Brouillard, Yaniv Gurwicz, Peer Nowack, David Rolnick
  • For: This paper aims to explore the potential of using causal representation learning to improve the efficiency and interpretability of climate model emulation.* Methods: The paper uses the CDSD method to learn causal representations of climate data, including emissions, temperature, and precipitation.* Results: The paper evaluates the effectiveness of CDSD in rendering climate model emulation more efficient and interpretable, and sheds light on the challenges and limitations of using this approach.
    Abstract Climate models, such as Earth system models (ESMs), are crucial for simulating future climate change based on projected Shared Socioeconomic Pathways (SSP) greenhouse gas emissions scenarios. While ESMs are sophisticated and invaluable, machine learning-based emulators trained on existing simulation data can project additional climate scenarios much faster and are computationally efficient. However, they often lack generalizability and interpretability. This work delves into the potential of causal representation learning, specifically the \emph{Causal Discovery with Single-parent Decoding} (CDSD) method, which could render climate model emulation efficient \textit{and} interpretable. We evaluate CDSD on multiple climate datasets, focusing on emissions, temperature, and precipitation. Our findings shed light on the challenges, limitations, and promise of using CDSD as a stepping stone towards more interpretable and robust climate model emulation.
    摘要 клима数据模型,如地球系统模型(ESM),是未来气候变化的预测基础,基于预测的社会经济路径(SSP)气体排放enario。虽然ESM是复杂且无价的,但机器学习基于现有模拟数据的模拟器可以在快速并高效地进行气候scenario projection,但它们通常缺乏普适性和解释性。这个工作探讨了使用 causal representation learning, Specifically the \emph{Causal Discovery with Single-parent Decoding} (CDSD) 方法,以实现气候模型模拟的效率和解释性。我们在多个气候数据集上评估了 CDSD,专注于排放、温度和降水。我们的发现着重于挑战、局限性和使用 CDSD 作为更加解释性和可靠的气候模型模拟的可能性。

Exploring Error Bits for Memory Failure Prediction: An In-Depth Correlative Study

  • paper_url: http://arxiv.org/abs/2312.02855
  • repo_url: None
  • paper_authors: Qiao Yu, Wengui Zhang, Jorge Cardoso, Odej Kao
  • for: 本文旨在探讨大规模数据中心中存储器失效的问题,尤其是双inline存储模块(DIMM)的缺陷。
  • methods: 本文使用了错误比特信息来预测不可修复的错误(UE)。
  • results: 经过实验 validate 的结果表明,我们的方法可以提高预测性能,比对state-of-the-art算法提高F1分数约15%,并 reduc 虚拟机中断的数量约59%。
    Abstract In large-scale datacenters, memory failure is a common cause of server crashes, with uncorrectable errors (UEs) being a major indicator of Dual Inline Memory Module (DIMM) defects. Existing approaches primarily focus on predicting UEs using correctable errors (CEs), without fully considering the information provided by error bits. However, error bit patterns have a strong correlation with the occurrence of uncorrectable errors (UEs). In this paper, we present a comprehensive study on the correlation between CEs and UEs, specifically emphasizing the importance of spatio-temporal error bit information. Our analysis reveals a strong correlation between spatio-temporal error bits and UE occurrence. Through evaluations using real-world datasets, we demonstrate that our approach significantly improves prediction performance by 15% in F1-score compared to the state-of-the-art algorithms. Overall, our approach effectively reduces the number of virtual machine interruptions caused by UEs by approximately 59%.
    摘要 大规模数据中心中,内存失效是服务器崩溃的常见原因,无法修复的错误(UE)是DIMMDefects的重要指标。现有的方法主要集中在预测CEs,未充分考虑错误比特信息。然而,错误比特模式与UE发生的可能性强相关。本文进行了详细的CEs和UEs之间的相关性分析,尤其是关注错误比特的空间时间信息。我们的分析发现,错误比特的空间时间信息与UE发生的可能性强相关。使用实际数据进行评估,我们的方法可以提高预测性能,与当前最佳算法相比,提高F1分数指标15%,并将虚拟机中断引起的UE数量减少约59%。

Inherent limitations of LLMs regarding spatial information

  • paper_url: http://arxiv.org/abs/2312.03042
  • repo_url: None
  • paper_authors: He Yan, Xinyao Hu, Xiangpeng Wan, Chengyu Huang, Kai Zou, Shiqi Xu
  • for: This paper investigates the limitations of ChatGPT and similar models in spatial reasoning and navigation-related tasks, and evaluates their capabilities in 2D and 3D route planning.
  • methods: The paper introduces a novel evaluation framework and a baseline dataset specifically crafted for this study, which includes three key tasks: plotting spatial points, planning routes in 2D spaces, and devising pathways in 3D environments.
  • results: The evaluation reveals key insights into ChatGPT’s capabilities and limitations in spatial understanding, highlighting the areas where the model struggles and where further improvement is needed.
    Abstract Despite the significant advancements in natural language processing capabilities demonstrated by large language models such as ChatGPT, their proficiency in comprehending and processing spatial information, especially within the domains of 2D and 3D route planning, remains notably underdeveloped. This paper investigates the inherent limitations of ChatGPT and similar models in spatial reasoning and navigation-related tasks, an area critical for applications ranging from autonomous vehicle guidance to assistive technologies for the visually impaired. In this paper, we introduce a novel evaluation framework complemented by a baseline dataset, meticulously crafted for this study. This dataset is structured around three key tasks: plotting spatial points, planning routes in two-dimensional (2D) spaces, and devising pathways in three-dimensional (3D) environments. We specifically developed this dataset to assess the spatial reasoning abilities of ChatGPT. Our evaluation reveals key insights into the model's capabilities and limitations in spatial understanding.
    摘要 尽管大语言模型如ChatGPT在自然语言处理方面做出了重要的进步,但它们在理解和处理空间信息方面仍然存在显著的不足,特别是在2D和3D路径规划领域。这篇论文探讨了ChatGPT和类似模型在空间理解和导航相关任务中的内在局限性,这是应用范围从自动驾驶导航到视障人士助手等领域的关键领域。在这篇论文中,我们提出了一种新的评估框架,并附加了一个特制的基线数据集,用于这项研究。这个数据集结构化为三个关键任务:描述空间点、计划2D空间路径和3D环境中的路径规划。我们专门为这项研究而制定了这个数据集,以评估ChatGPT的空间理解能力。我们的评估发现了ChatGPT在空间理解方面的重要缺陷和局限性。

Are Vision Transformers More Data Hungry Than Newborn Visual Systems?

  • paper_url: http://arxiv.org/abs/2312.02843
  • repo_url: https://github.com/buildingamind/vit-cot
  • paper_authors: Lalit Pandey, Samantha M. W. Wood, Justin N. Wood
  • for: 测试带有学习能力的ViTs和动物之间的比较,以确定ViTs是否需要更多的训练数据来达到类似水平。
  • methods: 使用自我监督的ViTs,通过时间作为教学信号,与生物视系统相似。
  • results: ViTs在新生鸡眼中训练时能够解决同样的视偏变对象识别任务,与新生鸡一样学习了视偏变对象表示。ViTs不是需要更多的训练数据的,两者都可以在穷几何环境中学习视偏变对象表示。
    Abstract Vision transformers (ViTs) are top performing models on many computer vision benchmarks and can accurately predict human behavior on object recognition tasks. However, researchers question the value of using ViTs as models of biological learning because ViTs are thought to be more data hungry than brains, with ViTs requiring more training data to reach similar levels of performance. To test this assumption, we directly compared the learning abilities of ViTs and animals, by performing parallel controlled rearing experiments on ViTs and newborn chicks. We first raised chicks in impoverished visual environments containing a single object, then simulated the training data available in those environments by building virtual animal chambers in a video game engine. We recorded the first-person images acquired by agents moving through the virtual chambers and used those images to train self supervised ViTs that leverage time as a teaching signal, akin to biological visual systems. When ViTs were trained through the eyes of newborn chicks, the ViTs solved the same view invariant object recognition tasks as the chicks. Thus, ViTs were not more data hungry than newborn visual systems: both learned view invariant object representations in impoverished visual environments. The flexible and generic attention based learning mechanism in ViTs combined with the embodied data streams available to newborn animals appears sufficient to drive the development of animal-like object recognition.
    摘要 视力变换器(ViT)是计算机视觉benchmark上表现出色的模型,可以准确预测人类行为。然而,研究人员对使用ViT作为生物学学习模型表示怀疑,因为ViT被认为需要更多的训练数据来达到相似水平。为了测试这个假设,我们直接比较了ViT和动物的学习能力,通过在ViT和新生鸡的平行控制养殖实验中进行比较。我们首先将鸡在缺乏视觉环境中养殖,然后通过在虚拟动物室中模拟这些环境中可用的训练数据,在视频游戏引擎中建立虚拟动物室。我们记录了通过代理人在虚拟动物室中移动时获得的第一人称图像,并使用这些图像来训练基于时间的教学信号的自我超vised ViTs。当ViTs通过新生鸡的眼睛进行训练时,ViTs解决了同样的视角不变object recognition任务,与鸡一样。因此,ViTs不是更需要数据的 than新生视系统:both learned view-invariant object representations in impoverished visual environments。ViTs的灵活和通用的注意力基本学习机制,加上可以给新生动物提供的embodied数据流,足以驱动动物如object recognition的发展。

MIMONets: Multiple-Input-Multiple-Output Neural Networks Exploiting Computation in Superposition

  • paper_url: http://arxiv.org/abs/2312.02829
  • repo_url: https://github.com/ibm/multiple-input-multiple-output-nets
  • paper_authors: Nicolas Menet, Michael Hersche, Geethan Karunaratne, Luca Benini, Abu Sebastian, Abbas Rahimi
  • for: 这篇研究目的是提出一种多输入多出力神经网络(MIMONet),以降低深度学习模型的计算成本。
  • methods: MIMONet使用变量绑定机制将多个输入数据结构化为一个固定宽度的分布式表示,并采用多输入多出力神经网络架构来处理数据结构的整体非线性变换。
  • results: 实验表明,MIMOConv和MIMOFormer可以在吞吐量和准确率之间实现协调的质量和速度衡量,并在CIFAR10和CIFAR100上实现2-4倍的速度提升,而无需更改模型参数。
    Abstract With the advent of deep learning, progressively larger neural networks have been designed to solve complex tasks. We take advantage of these capacity-rich models to lower the cost of inference by exploiting computation in superposition. To reduce the computational burden per input, we propose Multiple-Input-Multiple-Output Neural Networks (MIMONets) capable of handling many inputs at once. MIMONets augment various deep neural network architectures with variable binding mechanisms to represent an arbitrary number of inputs in a compositional data structure via fixed-width distributed representations. Accordingly, MIMONets adapt nonlinear neural transformations to process the data structure holistically, leading to a speedup nearly proportional to the number of superposed input items in the data structure. After processing in superposition, an unbinding mechanism recovers each transformed input of interest. MIMONets also provide a dynamic trade-off between accuracy and throughput by an instantaneous on-demand switching between a set of accuracy-throughput operating points, yet within a single set of fixed parameters. We apply the concept of MIMONets to both CNN and Transformer architectures resulting in MIMOConv and MIMOFormer, respectively. Empirical evaluations show that MIMOConv achieves about 2-4 x speedup at an accuracy delta within [+0.68, -3.18]% compared to WideResNet CNNs on CIFAR10 and CIFAR100. Similarly, MIMOFormer can handle 2-4 inputs at once while maintaining a high average accuracy within a [-1.07, -3.43]% delta on the long range arena benchmark. Finally, we provide mathematical bounds on the interference between superposition channels in MIMOFormer. Our code is available at https://github.com/IBM/multiple-input-multiple-output-nets.
    摘要

Calibrated Adaptive Teacher for Domain Adaptive Intelligent Fault Diagnosis

  • paper_url: http://arxiv.org/abs/2312.02826
  • repo_url: None
  • paper_authors: Florent Forest, Olga Fink
  • for: 本研究针对智能异常诊断(IFD)基于深度学习的应用,特别是当深度学习模型需要适应不同的操作条件时。
  • methods: 本研究使用了不确実预测( pseudo-label)的训练方法,并将这些预测与目标领域的标签进行整合,以提高模型的适应能力。
  • results: 本研究在domain-adaptive IFD中提出了一种新的训练方法,即对教师网络的预测进行调整,使用后续调整技术来改善预测的准确性。在Paderbornbenchmark上进行了广泛的实验,并取得了最佳的转移任务性能。
    Abstract Intelligent Fault Diagnosis (IFD) based on deep learning has proven to be an effective and flexible solution, attracting extensive research. Deep neural networks can learn rich representations from vast amounts of representative labeled data for various applications. In IFD, they achieve high classification performance from signals in an end-to-end manner, without requiring extensive domain knowledge. However, deep learning models usually only perform well on the data distribution they have been trained on. When applied to a different distribution, they may experience performance drops. This is also observed in IFD, where assets are often operated in working conditions different from those in which labeled data have been collected. Unsupervised domain adaptation (UDA) deals with the scenario where labeled data are available in a source domain, and only unlabeled data are available in a target domain, where domains may correspond to operating conditions. Recent methods rely on training with confident pseudo-labels for target samples. However, the confidence-based selection of pseudo-labels is hindered by poorly calibrated confidence estimates in the target domain, primarily due to over-confident predictions, which limits the quality of pseudo-labels and leads to error accumulation. In this paper, we propose a novel UDA method called Calibrated Adaptive Teacher (CAT), where we propose to calibrate the predictions of the teacher network throughout the self-training process, leveraging post-hoc calibration techniques. We evaluate CAT on domain-adaptive IFD and perform extensive experiments on the Paderborn benchmark for bearing fault diagnosis under varying operating conditions. Our proposed method achieves state-of-the-art performance on most transfer tasks.
    摘要 智能故障诊断(IFD)基于深度学习已经证明是一种有效和灵活的解决方案,吸引了广泛的研究。深度神经网络可以从大量的表示性数据中学习丰富的表示,用于多种应用。在IFD中,它们在终端到终点的方式中达到高的分类性能,不需要具有广泛的领域知识。然而,深度学习模型通常只能在它们被训练的数据分布上perform well。当应用于不同的分布时,它们可能会经历性能下降。这也是IFD中所见的情况, где assets 经常在不同的操作条件下运行。这里的问题是,当它们被应用到不同的分布时,它们可能会经历性能下降。这也是IFD中所见的情况, where assets 经常在不同的操作条件下运行。这个问题被称为域 Adaptation(UA)。recent methods rely on training with confident pseudo-labels for target samples. However, the confidence-based selection of pseudo-labels is hindered by poorly calibrated confidence estimates in the target domain, primarily due to over-confident predictions, which limits the quality of pseudo-labels and leads to error accumulation.在这篇论文中,我们提出了一种新的UA方法,叫做Calibrated Adaptive Teacher(CAT)。我们提议在自我教学过程中不断地calibrate the predictions of the teacher network,利用后期calibration技术。我们在domain-adaptive IFD中进行了广泛的实验,并在Paderbornbenchmark上进行了多个转移任务的评估。我们的提出方法实现了状态机器的性能。

Sample-based Dynamic Hierarchical Transformer with Layer and Head Flexibility via Contextual Bandit

  • paper_url: http://arxiv.org/abs/2312.03038
  • repo_url: None
  • paper_authors: Fanfei Meng, Lele Zhang, Yu Chen, Yuxin Wang
    for: 这篇论文是为了提出一种基于样本的动态层次变换器(DHT)模型,以便在训练和推理过程中动态配置层和头数,以适应具体的样本复杂性。methods: 该模型使用了解Contextual Bandit Problems来决定层和头的数量,并使用Combinatorial Thompson Sampling来选择特定的头组合。results: 对比传统压缩已经训练过的网络进行推理,DHT模型可以在训练和推理过程中实现更大的计算成本减少(最高达74%),同时减少了精度的损失。
    Abstract Transformer requires a fixed number of layers and heads which makes them inflexible to the complexity of individual samples and expensive in training and inference. To address this, we propose a sample-based Dynamic Hierarchical Transformer (DHT) model whose layers and heads can be dynamically configured with single data samples via solving contextual bandit problems. To determine the number of layers and heads, we use the Uniform Confidence Bound while we deploy combinatorial Thompson Sampling in order to select specific head combinations given their number. Different from previous work that focuses on compressing trained networks for inference only, DHT is not only advantageous for adaptively optimizing the underlying network architecture during training but also has a flexible network for efficient inference. To the best of our knowledge, this is the first comprehensive data-driven dynamic transformer without any additional auxiliary neural networks that implement the dynamic system. According to the experiment results, we achieve up to 74% computational savings for both training and inference with a minimal loss of accuracy.
    摘要 <> transformer 需要固定的层数和头数,这使得它们在个体样本的复杂性和训练和推理成本方面不灵活。为了解决这个问题,我们提议一种基于单个数据样本的动态层次Transformer(DHT)模型,其层数和头数可以通过解决上下文ual bandit问题来动态配置。为确定层数和头数,我们使用均匀信任区bound,而在选择特定头组合时,我们使用 combinatorial Thompson Sampling。与前一些研究所做的压缩已训练网络以便只进行推理时进行压缩不同,DHT 不仅在训练期间适应性地优化基础网络结构,还具有高效的推理网络。根据实验结果,我们可以达到74%的计算减少量,同时减少精度损失。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

Clustering Pseudo Language Family in Multilingual Translation Models with Fisher Information Matrix

  • paper_url: http://arxiv.org/abs/2312.02820
  • repo_url: https://github.com/ecoli-hit/pseudofamily
  • paper_authors: Xinyu Ma, Xuebo Liu, Min Zhang
  • for: The paper is written to address the challenge of clustering languages based solely on their ancestral families, which can yield suboptimal results due to variations in the datasets employed during the model’s training phase.
  • methods: The paper introduces an innovative method that leverages the fisher information matrix (FIM) to cluster language families, anchored on the multilingual translation model’s characteristics. The method defines pseudo language families based on the similarity of the effects of language pairs on model parameters.
  • results: The paper shows that employing these pseudo language families enhances performance over conventional language families in adapting a multilingual translation model to unfamiliar language pairs. The proposed methodology may also be extended to scenarios requiring language similarity measurements.
    Abstract In multilingual translation research, the comprehension and utilization of language families are of paramount importance. Nevertheless, clustering languages based solely on their ancestral families can yield suboptimal results due to variations in the datasets employed during the model's training phase. To mitigate this challenge, we introduce an innovative method that leverages the fisher information matrix (FIM) to cluster language families, anchored on the multilingual translation model's characteristics. We hypothesize that language pairs with similar effects on model parameters exhibit a considerable degree of linguistic congruence and should thus be grouped cohesively. This concept has led us to define pseudo language families. We provide an in-depth discussion regarding the inception and application of these pseudo language families. Empirical evaluations reveal that employing these pseudo language families enhances performance over conventional language families in adapting a multilingual translation model to unfamiliar language pairs. The proposed methodology may also be extended to scenarios requiring language similarity measurements. The source code and associated scripts can be accessed at https://github.com/ecoli-hit/PseudoFamily.
    摘要 在多语言翻译研究中,语言家族的理解和利用对 Paramount importance 。然而,基于祖语言家族来分类语言可能会导致不优化的结果,因为在模型训练阶段使用的数据集可能存在差异。为了解决这个挑战,我们提出了一种创新的方法,利用鱼类信息矩阵(FIM)来分类语言家族,基于多语言翻译模型的特点。我们假设语言对象之间的效果相似性很高,则应该将其归类为一个 cohesive 的语言家族。这个概念导致我们定义了 pseudo 语言家族。我们提供了深入的讨论和应用 pseudo 语言家族的方法。实验表明,使用 pseudo 语言家族可以超过传统语言家族在适应未知语言对的性能。此方法也可以扩展到需要语言相似度测量的场景。详细的代码和相关脚本可以在 GitHub 上获取,请参考

BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models

  • paper_url: http://arxiv.org/abs/2312.02813
  • repo_url: None
  • paper_authors: Fengyuan Shi, Jiaxi Gu, Hang Xu, Songcen Xu, Wei Zhang, Limin Wang
  • for: 这 paper 的目的是提出一种基于文本的通用视频生成框架,以解决现有视频生成模型的缺点,如需要大量的存储和计算资源、缺乏任务泛化和高效性。
  • methods: 这 paper 使用了一种叫做 BIVDiff 的框架,它将特定的图像扩散模型和通用的文本到视频扩散模型相连接,以实现无需训练的视频生成。具体来说,首先使用图像扩散模型(如 ControlNet、Instruct Pix2Pix)进行帧级视频生成,然后使用混合倒数法对生成的视频进行 temporal smoothing,最后输入混合倒数后的缓存进入视频扩散模型进行模型化。
  • results: 这 paper 通过一系列的视频生成任务,如可控的视频生成、视频编辑、视频填充和视频剔除等,证明了 BIVDiff 框架的有效性和通用性。
    Abstract Diffusion models have made tremendous progress in text-driven image and video generation. Now text-to-image foundation models are widely applied to various downstream image synthesis tasks, such as controllable image generation and image editing, while downstream video synthesis tasks are less explored for several reasons. First, it requires huge memory and compute overhead to train a video generation foundation model. Even with video foundation models, additional costly training is still required for downstream video synthesis tasks. Second, although some works extend image diffusion models into videos in a training-free manner, temporal consistency cannot be well kept. Finally, these adaption methods are specifically designed for one task and fail to generalize to different downstream video synthesis tasks. To mitigate these issues, we propose a training-free general-purpose video synthesis framework, coined as BIVDiff, via bridging specific image diffusion models and general text-to-video foundation diffusion models. Specifically, we first use an image diffusion model (like ControlNet, Instruct Pix2Pix) for frame-wise video generation, then perform Mixed Inversion on the generated video, and finally input the inverted latents into the video diffusion model for temporal smoothing. Decoupling image and video models enables flexible image model selection for different purposes, which endows the framework with strong task generalization and high efficiency. To validate the effectiveness and general use of BIVDiff, we perform a wide range of video generation tasks, including controllable video generation video editing, video inpainting and outpainting. Our project page is available at https://bivdiff.github.io.
    摘要 Diffusion models have made tremendous progress in text-driven image and video generation. Now text-to-image foundation models are widely applied to various downstream image synthesis tasks, such as controllable image generation and image editing, while downstream video synthesis tasks are less explored for several reasons. First, it requires huge memory and compute overhead to train a video generation foundation model. Even with video foundation models, additional costly training is still required for downstream video synthesis tasks. Second, although some works extend image diffusion models into videos in a training-free manner, temporal consistency cannot be well kept. Finally, these adaption methods are specifically designed for one task and fail to generalize to different downstream video synthesis tasks. To mitigate these issues, we propose a training-free general-purpose video synthesis framework, coined as BIVDiff, via bridging specific image diffusion models and general text-to-video foundation diffusion models. Specifically, we first use an image diffusion model (like ControlNet, Instruct Pix2Pix) for frame-wise video generation, then perform Mixed Inversion on the generated video, and finally input the inverted latents into the video diffusion model for temporal smoothing. Decoupling image and video models enables flexible image model selection for different purposes, which endows the framework with strong task generalization and high efficiency. To validate the effectiveness and general use of BIVDiff, we perform a wide range of video generation tasks, including controllable video generation, video editing, video inpainting, and video outpainting. Our project page is available at .

Leveraging Domain Adaptation and Data Augmentation to Improve Qur’anic IR in English and Arabic

  • paper_url: http://arxiv.org/abs/2312.02803
  • repo_url: None
  • paper_authors: Vera Pavlova
    for:* 这项研究的目的是解决阿拉伯语和英语中的古兰经信息检索(IR)问题。methods:* 使用最新的 neural IR 方法进行研究,以便更有效地解决这个问题。* 使用数据增强技术来处理缺乏域领域数据的问题。results:* 使用域pecific language model(LM)和域领域数据进行训练,可以大幅提高MRR@10和NDCG@5 metrics中的表现,创造了古兰经IR中的新纪录。
    Abstract In this work, we approach the problem of Qur'anic information retrieval (IR) in Arabic and English. Using the latest state-of-the-art methods in neural IR, we research what helps to tackle this task more efficiently. Training retrieval models requires a lot of data, which is difficult to obtain for training in-domain. Therefore, we commence with training on a large amount of general domain data and then continue training on in-domain data. To handle the lack of in-domain data, we employed a data augmentation technique, which considerably improved results in MRR@10 and NDCG@5 metrics, setting the state-of-the-art in Qur'anic IR for both English and Arabic. The absence of an Islamic corpus and domain-specific model for IR task in English motivated us to address this lack of resources and take preliminary steps of the Islamic corpus compilation and domain-specific language model (LM) pre-training, which helped to improve the performance of the retrieval models that use the domain-specific LM as the shared backbone. We examined several language models (LMs) in Arabic to select one that efficiently deals with the Qur'anic IR task. Besides transferring successful experiments from English to Arabic, we conducted additional experiments with retrieval task in Arabic to amortize the scarcity of general domain datasets used to train the retrieval models. Handling Qur'anic IR task combining English and Arabic allowed us to enhance the comparison and share valuable insights across models and languages.
    摘要 在这项工作中,我们面临着古兰经信息检索(IR)任务的挑战,特别是在阿拉伯语和英语之间。我们使用最新的状态艺术方法来解决这个问题。因为培训检索模型需要很多数据,但这些数据很难以获得,我们因此开始使用通用领域数据进行培训,然后继续使用域专数据进行培训。为了处理缺乏域专数据的问题,我们使用数据扩充技术,这有效地提高了MRR@10和NDCG@5指标的结果,并为古兰经IR任务设置了新的州供应。由于英语中没有伊斯兰卷积和域专语言模型,我们被动地做出了这些缺失的补做,包括伊斯兰卷积集成和域专语言模型预训练。我们在阿拉伯语中选择了一个高效地处理古兰经IR任务的语言模型。除了将成功的实验从英语转移到阿拉伯语之外,我们还进行了额外的检索任务实验,以利用通用领域数据来培训检索模型。通过结合英语和阿拉伯语来处理古兰经IR任务,我们能够提高对模型和语言之间的比较和共享有价值的发现。

PMMTalk: Speech-Driven 3D Facial Animation from Complementary Pseudo Multi-modal Features

  • paper_url: http://arxiv.org/abs/2312.02781
  • repo_url: None
  • paper_authors: Tianshun Han, Shengnan Gui, Yiqing Huang, Baihui Li, Lijian Liu, Benjia Zhou, Ning Jiang, Quan Lu, Ruicong Zhi, Yanyan Liang, Du Zhang, Jun Wan
  • for: 提高Speech-driven 3D facial animation的精度和准确性,并且使用多modal信息(视觉和文本)来提高 results的可靠性和一致性。
  • methods: 提出了一个新的框架,即PMMTalk,使用补充的 Pseudo Multi-Modal features来提高 facial animation 的准确性。该框架包括三个模块:PMMTalk encoder、cross-modal alignment module和PMMTalk decoder。特别是,PMMTalk encoder使用了市场上可得的 talking head generation architecture和speech recognition技术来从speech中提取视觉和文本信息。然后,cross-modal alignment module将 audio-image-text特征进行了时间和Semantic Water level的对齐。最后,PMMTalk decoder用于预测lip-syncing facial blendshape coefficients。与先前的方法不同的是,PMMTalk只需要一个随机的参考面孔图像,但它可以提供更高的准确性。此外,它适用于标准动画生产过程中,可以轻松地 интеGRATE到现有的工作流程中。
  • results: 对比先前的方法,我们的方法在3D facial animation中提高了精度和可靠性。同时,我们也创建了一个大规模的3D Chinese Audio-Visual Facial Animation(3D-CAVFA)数据集,以便进一步探索和改进这个领域。User study表明,我们的方法可以在艺术家和用户之间提供更好的满意度和体验。
    Abstract Speech-driven 3D facial animation has improved a lot recently while most related works only utilize acoustic modality and neglect the influence of visual and textual cues, leading to unsatisfactory results in terms of precision and coherence. We argue that visual and textual cues are not trivial information. Therefore, we present a novel framework, namely PMMTalk, using complementary Pseudo Multi-Modal features for improving the accuracy of facial animation. The framework entails three modules: PMMTalk encoder, cross-modal alignment module, and PMMTalk decoder. Specifically, the PMMTalk encoder employs the off-the-shelf talking head generation architecture and speech recognition technology to extract visual and textual information from speech, respectively. Subsequently, the cross-modal alignment module aligns the audio-image-text features at temporal and semantic levels. Then PMMTalk decoder is employed to predict lip-syncing facial blendshape coefficients. Contrary to prior methods, PMMTalk only requires an additional random reference face image but yields more accurate results. Additionally, it is artist-friendly as it seamlessly integrates into standard animation production workflows by introducing facial blendshape coefficients. Finally, given the scarcity of 3D talking face datasets, we introduce a large-scale 3D Chinese Audio-Visual Facial Animation (3D-CAVFA) dataset. Extensive experiments and user studies show that our approach outperforms the state of the art. We recommend watching the supplementary video.
    摘要 “对话驱动的3D面部动画技术在最近有了很大的改进,但大多数相关工作只使用音频模式,忽略视觉和文本cue,导致精度和一致性不 satisfactory。我们认为视觉和文本cue不是rivial的信息。因此,我们提出了一种新的框架,即PMMTalk,使用补充 Pseudo Multi-Modal feature来提高面部动画的准确性。该框架包括三个模块:PMMTalk编码器、交叉模式对接模块和PMMTalk解码器。具体来说,PMMTalk编码器使用 comercial off-the-shelf talking head生成架构和speech recognition技术来从speech中提取视觉和文本信息。然后,交叉模式对接模块将音频-图像-文本特征在时间和Semantic水平进行对接。最后,PMMTalk解码器用于预测lip-syncing的面部混合坐标。与先前方法不同,PMMTalk只需要额外的随机参考面部图像,但它可以提供更高精度的结果。此外,它适用于标准动画生产工作流程,可以轻松地 интеGRATE到现有的动画生产过程中。 finally,由于3D talking face数据的缺乏,我们介绍了一个大规模的3D中文Audio-Visual Facial Animation(3D-CAVFA)数据集。我们的方法在实验和用户研究中表现出色,并且超越了当前状态。我们建议观看补充视频。”

Towards the Inferrence of Structural Similarity of Combinatorial Landscapes

  • paper_url: http://arxiv.org/abs/2312.02720
  • repo_url: None
  • paper_authors: Mingyu Huang, Ke Li
  • for: 这篇论文的目的是探讨如何通过地图数据挖掘技术来探索 combinatorial optimization 问题的 fitness landscape 中隐藏的 topological 结构信息,以便更好地解决这些问题。
  • methods: 本论文使用了 local optima network 作为 fitness landscape 的代理,并通过 graph data mining 技术进行质量和量化分析,以探索不同问题类型的 fitness landscape 之间的相似性。
  • results: 经过大规模的实验研究,本论文发现了不同问题类型的 fitness landscape 之间存在明显的结构相似性,并且在不同维度上的邻近问题类型之间也存在一定的结构相似性。
    Abstract One of the most common problem-solving heuristics is by analogy. For a given problem, a solver can be viewed as a strategic walk on its fitness landscape. Thus if a solver works for one problem instance, we expect it will also be effective for other instances whose fitness landscapes essentially share structural similarities with each other. However, due to the black-box nature of combinatorial optimization, it is far from trivial to infer such similarity in real-world scenarios. To bridge this gap, by using local optima network as a proxy of fitness landscapes, this paper proposed to leverage graph data mining techniques to conduct qualitative and quantitative analyses to explore the latent topological structural information embedded in those landscapes. By conducting large-scale empirical experiments on three classic combinatorial optimization problems, we gain concrete evidence to support the existence of structural similarity between landscapes of the same classes within neighboring dimensions. We also interrogated the relationship between landscapes of different problem classes.
    摘要 Translated into Simplified Chinese:一种非常常见的问题解决策略是analogy。对于一个问题,一个解决方案可以被视为一个策略性的步行在其适应度地图上。因此,如果一个解决方案对一个问题实例有效,我们预期它也会有效于其他实例,只要它们的适应度地图具有相似的结构特征。然而,由于分布式优化的黑盒特性,很难 directamente从实际情况中推断出这种相似性。为了bridging这个差距,这篇论文提议使用本地最优点网络作为适应度地图的代理,然后使用图数据挖掘技术来进行质量和量化分析,探索适应度地图中嵌入的隐藏结构信息。通过对三个经典的分布式优化问题进行大规模的实验,我们获得了具体的证据,支持适应度地图中同一类问题的不同维度的结构相似性的存在。我们还调查了不同问题类型的适应度地图之间的关系。

Large Knowledge Model: Perspectives and Challenges

  • paper_url: http://arxiv.org/abs/2312.02706
  • repo_url: https://github.com/molyswu/hand_detection
  • paper_authors: Huajun Chen
  • for: 本研究旨在探讨大型语言模型(LLMs)如ChatGPT在知识领域中的应用。
  • methods: 本研究使用了知识图(KGs)等符号知识来增强LLMs,以及使用LLM来扩展传统的符号知识库。
  • results: 研究表明,LLMs可以增强传统的符号知识库,并且可以用于构建和控制知识图。但是,由于人类知识的复杂性,建议创建更大的“大知识模型”(LKM)来管理多种知识结构。
    Abstract Humankind's understanding of the world is fundamentally linked to our perception and cognition, with \emph{human languages} serving as one of the major carriers of \emph{world knowledge}. In this vein, \emph{Large Language Models} (LLMs) like ChatGPT epitomize the pre-training of extensive, sequence-based world knowledge into neural networks, facilitating the processing and manipulation of this knowledge in a parametric space. This article explores large models through the lens of ``knowledge''. We initially investigate the role of symbolic knowledge such as Knowledge Graphs (KGs) in enhancing LLMs, covering aspects like knowledge-augmented language model, structure-inducing pre-training, knowledgeable prompts, structured CoT, knowledge editing, semantic tools for LLM and knowledgeable AI agents. Subsequently, we examine how LLMs can amplify traditional symbolic knowledge bases, encompassing aspects like using LLM as KG builder and controller, structured knowledge pretraining, LLM-enhanced symbolic reasoning, and the amalgamation of perception with cognition. Considering the intricate nature of human knowledge, we advocate for the creation of \emph{Large Knowledge Models} (LKM), specifically engineered to manage diversified spectrum of knowledge structures. This ambitious undertaking could entail several key challenges, such as disentangling knowledge representation from language models, restructuring pre-training with structured knowledge, and building large commonsense models, among others. We finally propose a five-``A'' principle to distinguish the concept of LKM.
    摘要 人类的世界理解与我们的感知和认知密切相关,各种人类语言 serving as one of the major carriers of 世界知识。在这种情况下,大型语言模型(LLMs) like ChatGPT represent the pre-training of extensive, sequence-based world knowledge into neural networks, allowing for the processing and manipulation of this knowledge in a parametric space. 本文通过 “知识” 来探讨大型模型。我们首先 investigate the role of 符号知识 such as Knowledge Graphs (KGs) in enhancing LLMs, including aspects like knowledge-augmented language model, structure-inducing pre-training, knowledgeable prompts, structured CoT, knowledge editing, semantic tools for LLM and knowledgeable AI agents. 然后,我们 examines how LLMs can amplify traditional symbolic knowledge bases, including aspects like using LLM as KG builder and controller, structured knowledge pretraining, LLM-enhanced symbolic reasoning, and the amalgamation of perception with cognition. 考虑到人类知识的复杂性,我们 advocate for the creation of 大型知识模型(LKM),专门设计用于管理多元的知识结构。这项大规模的任务可能会涉及多个关键挑战,如解脱知识表示与语言模型,重新结构预训练与结构知识,以及建立大规模的通用常识模型,等等。 最后,我们提出了五个 “A” 原则来 отлича出 LKM 的概念。

Unified learning-based lossy and lossless JPEG recompression

  • paper_url: http://arxiv.org/abs/2312.02705
  • repo_url: None
  • paper_authors: Jianghui Zhang, Yuanyuan Wang, Lina Guo, Jixiang Luo, Tongda Xu, Yan Wang, Zhi Wang, Hongwei Qin
  • for: 提高 JPEG 图像压缩率,并 bridge lossy 和 lossless 压缩之间的 gap
  • methods: 使用学习的量化表和 Markovian 层次变分自动机
  • results: 可以实现 arbitrarily 低的损害,当 bitrate 接近最高 bound 时Here’s a more detailed explanation of each point:
  • for: The paper aims to improve the compression efficiency of JPEG images and bridge the gap between lossy and lossless compression methods.
  • methods: The proposed method uses a learned quantization table and a Markovian hierarchical variational autoencoder to achieve lossy and lossless JPEG recompression.
  • results: The proposed method can achieve arbitrarily low distortion when the bitrate is close to the upper bound, which is the bitrate of the lossless compression model. This is the first learned method that bridges the gap between lossy and lossless recompression of JPEG images, to the best of the authors’ knowledge.
    Abstract JPEG is still the most widely used image compression algorithm. Most image compression algorithms only consider uncompressed original image, while ignoring a large number of already existing JPEG images. Recently, JPEG recompression approaches have been proposed to further reduce the size of JPEG files. However, those methods only consider JPEG lossless recompression, which is just a special case of the rate-distortion theorem. In this paper, we propose a unified lossly and lossless JPEG recompression framework, which consists of learned quantization table and Markovian hierarchical variational autoencoders. Experiments show that our method can achieve arbitrarily low distortion when the bitrate is close to the upper bound, namely the bitrate of the lossless compression model. To the best of our knowledge, this is the first learned method that bridges the gap between lossy and lossless recompression of JPEG images.
    摘要 JPEG仍是最广泛使用的图像压缩算法。大多数图像压缩算法只考虑无压缩原始图像,而忽略了大量已经存在的JPEG图像。近期,JPEG重压缩方法得到了提议,但这些方法只考虑了JPEG无损重压缩,这只是权重-违和定理的特殊情况。在这篇论文中,我们提出了一个统一的损失量和损失无损JPEG重压缩框架,该框架包括学习的量化表和Markov链式层VARAE。实验显示,我们的方法可以在bitrate接近Upper bound的情况下实现arbitrary低的损失。到目前为止,这是我们所知道的第一种学习方法,可以bridging损失和无损重压缩JPEG图像之间的差距。

Enhancing Vehicle Entrance and Parking Management: Deep Learning Solutions for Efficiency and Security

  • paper_url: http://arxiv.org/abs/2312.02699
  • repo_url: None
  • paper_authors: Muhammad Umer Ramzan, Usman Ali, Syed Haider Abbas Naqvi, Zeeshan Aslam, Tehseen, Husnain Ali, Muhammad Faheem
  • for: 解决组织自动化进口和停车管理问题,提高效率、安全性和记录保持。
  • methods: 利用现代深度学习模型自动化车辆进口和停车过程,并 integrate 车辆检测、车牌号检测、人脸检测和识别模型,以确保车辆和人员的注册。
  • results: 系统可以快速、准确地检测车辆进口和停车,提供高效的记录保持和洗礼车位分配,提高了便捷、准确性和安全性。
    Abstract The auto-management of vehicle entrance and parking in any organization is a complex challenge encompassing record-keeping, efficiency, and security concerns. Manual methods for tracking vehicles and finding parking spaces are slow and a waste of time. To solve the problem of auto management of vehicle entrance and parking, we have utilized state-of-the-art deep learning models and automated the process of vehicle entrance and parking into any organization. To ensure security, our system integrated vehicle detection, license number plate verification, and face detection and recognition models to ensure that the person and vehicle are registered with the organization. We have trained multiple deep-learning models for vehicle detection, license number plate detection, face detection, and recognition, however, the YOLOv8n model outperformed all the other models. Furthermore, License plate recognition is facilitated by Google's Tesseract-OCR Engine. By integrating these technologies, the system offers efficient vehicle detection, precise identification, streamlined record keeping, and optimized parking slot allocation in buildings, thereby enhancing convenience, accuracy, and security. Future research opportunities lie in fine-tuning system performance for a wide range of real-world applications.
    摘要 自动管理机构内部汽车进口和停车是一个复杂的挑战,涉及到记录保持、效率和安全问题。传统的手动方法 для跟踪汽车和寻找停车位置是慢并且是浪费时间的。为解决机构内部汽车进口和停车的自动管理问题,我们使用了当今最先进的深度学习模型,自动化了汽车进口和停车的过程。为保障安全,我们的系统 integrate了车辆检测、车牌号检测、人脸检测和识别模型,以确保汽车和人员是组织注册的。我们已经训练了多个深度学习模型,包括车辆检测、车牌号检测、人脸检测和识别模型,但是YOLOv8n模型在所有模型中表现最佳。此外,车牌号检测得到Google的Tesseract-OCR引擎支持。通过将这些技术集成,系统提供了高效的车辆检测、准确的识别、整洁的记录保持和停车位置分配优化,从而提高了便捷、准确性和安全性。未来的研究机遇在细化系统性能,适应广泛的实际应用场景。

Analyzing and Improving the Training Dynamics of Diffusion Models

  • paper_url: http://arxiv.org/abs/2312.02696
  • repo_url: https://github.com/mmathew23/improved_edm
  • paper_authors: Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, Samuli Laine
  • for: 这篇论文目标是改进数据驱动图像生成领域中流行的ADM扩散模型架构,提高图像生成质量。
  • methods: 作者通过修改网络层来保持活化量、权重量和更新量的平衡,解决了训练过程中的不均匀和不有效性问题。此外,作者还提出了一种在训练完成后设置各个EMA参数的方法,以便精细地调整EMA长度。
  • results: 作者通过修改网络架构和EMA参数,提高了图像生成的质量,并 achieved 1.81的FID记录,胜过了之前的2.41记录。
    Abstract Diffusion models currently dominate the field of data-driven image synthesis with their unparalleled scaling to large datasets. In this paper, we identify and rectify several causes for uneven and ineffective training in the popular ADM diffusion model architecture, without altering its high-level structure. Observing uncontrolled magnitude changes and imbalances in both the network activations and weights over the course of training, we redesign the network layers to preserve activation, weight, and update magnitudes on expectation. We find that systematic application of this philosophy eliminates the observed drifts and imbalances, resulting in considerably better networks at equal computational complexity. Our modifications improve the previous record FID of 2.41 in ImageNet-512 synthesis to 1.81, achieved using fast deterministic sampling. As an independent contribution, we present a method for setting the exponential moving average (EMA) parameters post-hoc, i.e., after completing the training run. This allows precise tuning of EMA length without the cost of performing several training runs, and reveals its surprising interactions with network architecture, training time, and guidance.
    摘要 Currently, diffusion models dominate the field of data-driven image synthesis due to their ability to scale to large datasets. In this paper, we identify and address several issues with the popular ADM diffusion model architecture that were causing uneven and ineffective training. These issues included uncontrolled magnitude changes and imbalances in both the network activations and weights during training. To address these issues, we redesigned the network layers to preserve activation, weight, and update magnitudes on expectation. As a result, we were able to eliminate the observed drifts and imbalances and achieve considerably better performance at equal computational complexity. Our modifications improved the previous record FID of 2.41 in ImageNet-512 synthesis to 1.81, achieved using fast deterministic sampling. Additionally, we present a method for setting the exponential moving average (EMA) parameters post-hoc, which allows for precise tuning of EMA length without the cost of performing multiple training runs and reveals surprising interactions with network architecture, training time, and guidance.

H-GAP: Humanoid Control with a Generalist Planner

  • paper_url: http://arxiv.org/abs/2312.02682
  • repo_url: None
  • paper_authors: Zhengyao Jiang, Yingchen Xu, Nolan Wagener, Yicheng Luo, Michael Janner, Edward Grefenstette, Tim Rocktäschel, Yuandong Tian
  • for: 这篇论文旨在提出一种基于人类动作数据采集的humanoid控制方法,以便在人类中心基础设施中集成和实现物理驱动的humanoid动画。
  • methods: 该论文使用了humanoid trajectory数据集,如MoCapAct,并提出了一种基于状态-动作 trajectory生成模型(H-GAP),可以在Model Predictive Control(MPC)下处理高维状态和动作空间的优化问题。
  • results: 该研究表明,H-GAP可以学习和生成各种motor behaviors,并在不同的下游控制任务中进行适应性规划。此外,H-GAP可以在不同的任务中具有优于或相当于在线学习和RL方法的性能。
    Abstract Humanoid control is an important research challenge offering avenues for integration into human-centric infrastructures and enabling physics-driven humanoid animations. The daunting challenges in this field stem from the difficulty of optimizing in high-dimensional action spaces and the instability introduced by the bipedal morphology of humanoids. However, the extensive collection of human motion-captured data and the derived datasets of humanoid trajectories, such as MoCapAct, paves the way to tackle these challenges. In this context, we present Humanoid Generalist Autoencoding Planner (H-GAP), a state-action trajectory generative model trained on humanoid trajectories derived from human motion-captured data, capable of adeptly handling downstream control tasks with Model Predictive Control (MPC). For 56 degrees of freedom humanoid, we empirically demonstrate that H-GAP learns to represent and generate a wide range of motor behaviours. Further, without any learning from online interactions, it can also flexibly transfer these behaviors to solve novel downstream control tasks via planning. Notably, H-GAP excels established MPC baselines that have access to the ground truth dynamics model, and is superior or comparable to offline RL methods trained for individual tasks. Finally, we do a series of empirical studies on the scaling properties of H-GAP, showing the potential for performance gains via additional data but not computing. Code and videos are available at https://ycxuyingchen.github.io/hgap/.
    摘要 人型控制是一个重要的研究挑战,它提供了与人类中心的基础设施集成和physics驱动的人类动作渲染的可能性。然而,这个领域的挑战在于优化高维动作空间的问题和人型形态引入的不稳定性。然而,人类动作捕捉数据的大量收集和 derivated的人类轨迹数据,如MoCapAct,为这些挑战提供了方向。在这个上下文中,我们介绍了人类通用自编码计划(H-GAP),一种基于人类轨迹数据进行训练的状态-动作轨迹生成模型,能够妥协控制下游任务。对于56度自由度的人类,我们经验表明,H-GAP能够learn represent和生成广泛的动作行为。此外,不需要在线交互学习,它还可以通过规划来适应新的下游控制任务。在比较MPC基线和离线RL方法的情况下,H-GAP表现出色。最后,我们进行了一系列实验研究关于H-GAP的扩展性,显示了可能通过更多的数据获得性能提升,但不需要更多的计算资源。代码和视频可以在https://ycxuyingchen.github.io/hgap/上下载。

Contact Energy Based Hindsight Experience Prioritization

  • paper_url: http://arxiv.org/abs/2312.02677
  • repo_url: None
  • paper_authors: Erdi Sayar, Zhenshan Bing, Carlo D’Eramo, Ozgur S. Oguz, Alois Knoll
  • for: 这篇论文主要目的是解决多目标机器人操作任务中的强化学习问题,即使奖励率稀疏。
  • methods: 该论文提出了一种名为Contact Energy Based Prioritization(CEBP)的新方法,它选择从储存缓存中抽取样本,基于机器人和物体移动的触感信息。该方法希望通过强调触感富有的经验来优化学习。
  • results: 研究人员在不同的稀疏奖励机器人操作任务上评估了该方法,并与现有的方法进行比较。结果显示,CEBP方法在这些任务上表现出优于或与现有方法相当。最后,研究人员在一个真实的Franka机器人上部署了它们的训练政策,并观察到机器人成功完成了一个拾取并置放任务。视频和代码可以在:https://erdiphd.github.io/HER_force 中获取。
    Abstract Multi-goal robot manipulation tasks with sparse rewards are difficult for reinforcement learning (RL) algorithms due to the inefficiency in collecting successful experiences. Recent algorithms such as Hindsight Experience Replay (HER) expedite learning by taking advantage of failed trajectories and replacing the desired goal with one of the achieved states so that any failed trajectory can be utilized as a contribution to learning. However, HER uniformly chooses failed trajectories, without taking into account which ones might be the most valuable for learning. In this paper, we address this problem and propose a novel approach Contact Energy Based Prioritization~(CEBP) to select the samples from the replay buffer based on rich information due to contact, leveraging the touch sensors in the gripper of the robot and object displacement. Our prioritization scheme favors sampling of contact-rich experiences, which are arguably the ones providing the largest amount of information. We evaluate our proposed approach on various sparse reward robotic tasks and compare them with the state-of-the-art methods. We show that our method surpasses or performs on par with those methods on robot manipulation tasks. Finally, we deploy the trained policy from our method to a real Franka robot for a pick-and-place task. We observe that the robot can solve the task successfully. The videos and code are publicly available at: https://erdiphd.github.io/HER_force
    摘要 多目标机器人操作任务 WITH sparse reward 难以使用 reinforcement learning(RL)算法,因为收集成功经验不fficient。 recent algorithms such as Hindsight Experience Replay (HER) 使用了失败的轨迹,并将目标更改为达到的状态,以便任何失败的轨迹都可以作为学习的贡献。 然而,HER uniformmente选择失败的轨迹,不考虑哪些可能是学习中最有价值的。在这篇论文中,我们解决这个问题,并提出了一种新的方法:Contact Energy Based Prioritization~(CEBP)。我们的优化方案基于触摸感测器和物体移动,可以选择接触rich的经验,并且偏好这些经验。我们的优先级顺序对于学习提供了丰富的信息。我们对多个稀缺奖励机器人任务进行了评估,并与当前的方法进行了比较。我们发现,我们的方法在机器人操作任务中胜过或与当前方法相当。最后,我们使用了我们的方法训练的策略,并在真实的 Franka 机器人上完成了一个 pick-and-place 任务。我们发现,机器人可以成功完成这个任务。视频和代码可以在:https://erdiphd.github.io/HER_force 上获取。

Amortized Bayesian Decision Making for simulation-based models

  • paper_url: http://arxiv.org/abs/2312.02674
  • repo_url: https://github.com/mackelab/amortized-decision-making
  • paper_authors: Mila Gorecki, Jakob H. Macke, Michael Deistler
  • for: 这篇论文旨在探讨如何使用 simulation-based inference (SBI) 进行 Bayesian 决策,并如何避免计算 Explicit aproximation 的 posterior distribution。
  • methods: 该论文使用 neural network 进行模拟数据的训练,并可以用来预测给定数据和行动的期望成本。
  • results: 该论文在多个 benchmark 问题中应用了该方法,并证明了它可以induces 类似于 true posterior distribution 中的成本。此外,该论文还应用了该方法于一个实际世界的 simulator 中,即 Bayesian Virtual Epileptic Patient,并证明了它可以在几个 simulations 中推断出低成本的行动。
    Abstract Simulation-based inference (SBI) provides a powerful framework for inferring posterior distributions of stochastic simulators in a wide range of domains. In many settings, however, the posterior distribution is not the end goal itself -- rather, the derived parameter values and their uncertainties are used as a basis for deciding what actions to take. Unfortunately, because posterior distributions provided by SBI are (potentially crude) approximations of the true posterior, the resulting decisions can be suboptimal. Here, we address the question of how to perform Bayesian decision making on stochastic simulators, and how one can circumvent the need to compute an explicit approximation to the posterior. Our method trains a neural network on simulated data and can predict the expected cost given any data and action, and can, thus, be directly used to infer the action with lowest cost. We apply our method to several benchmark problems and demonstrate that it induces similar cost as the true posterior distribution. We then apply the method to infer optimal actions in a real-world simulator in the medical neurosciences, the Bayesian Virtual Epileptic Patient, and demonstrate that it allows to infer actions associated with low cost after few simulations.
    摘要 模拟基于推理(SBI)提供了一个强大的推理框架,可以在各种领域中为不确定的模拟器 posterior distribution 进行推理。然而,在许多情况下, posterior distribution 本身并不是最终目标 -- 而是基于这些参数值和其不确定性来做出决策。然而,由于 SBI 中的 posterior distribution 是(可能粗糙)的估计,因此得出的决策可能会不优化。本文考虑了如何在不确定的模拟器上进行 bayesian 决策,并如何避免计算显式的 posterior 估计。我们的方法是训练一个神经网络,使其可以在任何数据和行动下预测行动的预期成本,从而直接用于推理最低成本的行动。我们在一些标准问题上应用了我们的方法,并证明它们与真正的 posterior distribution 相似。然后,我们将方法应用于医学神经科学的 Bayesian Virtual Epileptic Patient 模拟器,并证明它可以在几次 simulations 后决策出低成本的行动。

Lights out: training RL agents robust to temporary blindness

  • paper_url: http://arxiv.org/abs/2312.02665
  • repo_url: None
  • paper_authors: N. Ordonez, M. Tromp, P. M. Julbe, W. Böhmer
  • for: 增强 Deep Q-Network (DQN) Agent 的Robustness to 短暂失明(temporary blindness)
  • methods: 使用隐藏表示 Observation 的 neural network 架构和 noval n-step 损失函数
  • results: 可以承受更长的盲目期(blindness stretch),示强性提高。In English:
  • for: Enhancing Deep Q-Network (DQN) Agent’s Robustness to Temporary Blindness
  • methods: Using a neural network architecture with hidden representations of observations and a novel n-step loss function
  • results: Can withstand longer blindness periods, demonstrating improved robustness.
    Abstract Agents trained with DQN rely on an observation at each timestep to decide what action to take next. However, in real world applications observations can change or be missing entirely. Examples of this could be a light bulb breaking down, or the wallpaper in a certain room changing. While these situations change the actual observation, the underlying optimal policy does not change. Because of this we want our agent to continue taking actions until it receives a (recognized) observation again. To achieve this we introduce a combination of a neural network architecture that uses hidden representations of the observations and a novel n-step loss function. Our implementation is able to withstand location based blindness stretches longer than the ones it was trained on, and therefore shows robustness to temporary blindness. For access to our implementation, please email Nathan, Marije, or Pau.
    摘要 agent驱动使用DQN培育的agent会根据每个时间步骤的观察来决定下一步的行为。然而,在实际应用中,观察可能会变化或 completly missing。例如,灯泡破裂或墙纸在某个房间改变。这些情况会改变实际观察,但是下面的优化策略不会改变。因此,我们想我们的agent可以继续执行行动,直到它收到一个认可的观察。为 достичь这一点,我们引入了一种神经网络架构,使用隐藏表示的观察,以及一种新的n步损失函数。我们的实现能够承受位置基于的盲目扩展,比训练中的盲目扩展更长,因此表现出了对短暂盲目的抗衡能力。如果您想获取我们的实现,请邮件 Nathan、Marije 或 Pau。

FaceStudio: Put Your Face Everywhere in Seconds

  • paper_url: http://arxiv.org/abs/2312.02663
  • repo_url: None
  • paper_authors: Yuxuan Yan, Chi Zhang, Rui Wang, Yichao Zhou, Gege Zhang, Pei Cheng, Gang Yu, Bin Fu
    for: 这项研究探索了一种能够保持人物身份的图像生成技术,这是一项激发人们 curiosities 的图像生成任务。methods: 这种技术使用了一种直通的前向驱动机制,不需要耗时 fine-tuning,从而实现了快速和高效的图像生成。该模型还使用了一种混合引导框架,将样式化图像、人脸图像和文本提示相结合,以导引图像生成过程。results: 我们的实验结果表明,我们的方法在比较与基线模型和先前的工作进行评估时,具有显著的优势,特别是在高效和保持人物身份方面。
    Abstract This study investigates identity-preserving image synthesis, an intriguing task in image generation that seeks to maintain a subject's identity while adding a personalized, stylistic touch. Traditional methods, such as Textual Inversion and DreamBooth, have made strides in custom image creation, but they come with significant drawbacks. These include the need for extensive resources and time for fine-tuning, as well as the requirement for multiple reference images. To overcome these challenges, our research introduces a novel approach to identity-preserving synthesis, with a particular focus on human images. Our model leverages a direct feed-forward mechanism, circumventing the need for intensive fine-tuning, thereby facilitating quick and efficient image generation. Central to our innovation is a hybrid guidance framework, which combines stylized images, facial images, and textual prompts to guide the image generation process. This unique combination enables our model to produce a variety of applications, such as artistic portraits and identity-blended images. Our experimental results, including both qualitative and quantitative evaluations, demonstrate the superiority of our method over existing baseline models and previous works, particularly in its remarkable efficiency and ability to preserve the subject's identity with high fidelity.
    摘要 Here's the text in Simplified Chinese:这项研究 investigate identity-preserving image synthesis,这是一项图像生成任务,旨在保持主体的身份while adding a personalized, stylistic touch。传统方法,如Textual Inversion和DreamBooth,在自定义图像创建方面做出了进展,但它们带有一些缺点。这些缺点包括需要大量资源和时间进行精细调整,以及需要多个参考图像。为了超越这些挑战,我们的研究提出了一种新的方法,具体来说是一种人像图像的identity-preserving图像生成方法。我们的模型使用了直通途径机制, circumventing the need for intensive fine-tuning, thereby facilitating quick and efficient image generation。我们的创新在于将涂抹图像、人脸图像和文本提示相结合,以便指导图像生成过程。这种独特的组合使得我们的模型可以生成多种应用,如艺术投影和身份混合图像。我们的实验结果,包括both qualitative和quantitative评估,表明我们的方法在效率和保持主体身份方面具有显著优势,特别是在高效性和身份保持方面。

Supervised learning of spatial features with STDP and homeostasis using Spiking Neural Networks on SpiNNaker

  • paper_url: http://arxiv.org/abs/2312.02659
  • repo_url: None
  • paper_authors: Sergio Davies, Andrew Gait, Andrew Rowley, Alessandro Di Nuovo
  • for: 这篇论文目的是在超过规则神经网络(SNN)中进行有监督学习,使其能够识别空间模式。
  • methods: 该论文使用了快速时钟依存性遗传(STDP)和自适应机制来实现SNN的监督学习。
  • results: 试验结果显示,当单个模式进行训练时,SNN能够准确地识别出训练模式,准确率为100%。然而,当多个模式同时训练在同一个网络上时,模式的相似性会影响识别的准确率。这种训练SNN识别空间模式的方法可以应用于静止图像识别和计算机网络中的流量分析等领域。此外,研究人员还发现,在同一个网络上训练多个模式时,homeostatic因素可以使网络检测到模式之间的相似性,而不是只是完全匹配的模式。
    Abstract Artificial Neural Networks (ANN) have gained large popularity thanks to their ability to learn using the well-known backpropagation algorithm. On the other hand, Spiking Neural Networks (SNNs), despite having wider abilities than ANNs, have always presented a challenge in the training phase. This paper shows a new method to perform supervised learning on SNNs, using Spike Timing Dependent Plasticity (STDP) and homeostasis, aiming at training the network to identify spatial patterns. The method is tested using the SpiNNaker digital architecture. A SNN is trained to recognise one or multiple patterns and performance metrics are extracted to measure the performance of the network. Some considerations are drawn from the results showing that, in the case of a single trained pattern, the network behaves as the ideal detector, with 100% accuracy in detecting the trained pattern. However, as the number of trained patterns on a single network increases, the accuracy of the identification is linked to the similarities between these patterns. This method of training an SNN to detect spatial patterns may be applied on pattern recognition in static images or traffic analysis in computer networks, where each network packet represents a spatial pattern. It will be stipulated that the homeostatic factor may enable the network to detect patterns with some degree of similarities, rather than only perfectly matching patterns.
    摘要 人工神经网络(ANN)因其能够使用著名的反射学习算法而受欢迎。然而,神经元脉冲网络(SNN)却总是在训练阶段存在挑战。这篇论文提出了一种新的超越学习方法,使用脉冲时间依赖束性(STDP)和自适应机制,以训练网络认识空间模式。这种方法在使用SpikeNNaker数字架构测试后,可以让网络认识一个或多个模式,并提取性能指标来衡量网络的表现。结果表明,当单个模式被训练时,网络会 behave as the ideal detector,即100%的准确率可以检测到训练过的模式。然而,当多个模式在同一个网络上被训练时,模式的相似性会影响网络的识别率。这种训练SNN认识空间模式的方法可以应用于静止图像或计算机网络中的模式识别,其中每个网络包etes represent a spatial pattern。此外,homeostatic factor可以使网络检测到一定程度的相似模式,而不仅仅是完全匹配的模式。

How should the advent of large language models affect the practice of science?

  • paper_url: http://arxiv.org/abs/2312.03759
  • repo_url: None
  • paper_authors: Marcel Binz, Stephan Alaniz, Adina Roskies, Balazs Aczel, Carl T. Bergstrom, Colin Allen, Daniel Schad, Dirk Wulff, Jevin D. West, Qiong Zhang, Richard M. Shiffrin, Samuel J. Gershman, Ven Popov, Emily M. Bender, Marco Marelli, Matthew M. Botvinick, Zeynep Akata, Eric Schulz
  • for: 这篇论文探讨了大语言模型(LLMs)在科学研究中的应用,以及这些应用的影响。
  • methods: 本文邀请了四个不同领域的科学家共同reflect和辩论,以便了解LLMs的应用对科学研究的影响。
  • results: 这篇论文总结了四个不同的视角,包括Schulz等人的看法,即与人类合作者不同,Bender等人的看法,即LLMs被过度夸大和违用,Marelli等人的看法,即透明的归属和责任,以及Botvinick和Gershman的看法,即人类应该决定科学的发展规划。
    Abstract Large language models (LLMs) are being increasingly incorporated into scientific workflows. However, we have yet to fully grasp the implications of this integration. How should the advent of large language models affect the practice of science? For this opinion piece, we have invited four diverse groups of scientists to reflect on this query, sharing their perspectives and engaging in debate. Schulz et al. make the argument that working with LLMs is not fundamentally different from working with human collaborators, while Bender et al. argue that LLMs are often misused and over-hyped, and that their limitations warrant a focus on more specialized, easily interpretable tools. Marelli et al. emphasize the importance of transparent attribution and responsible use of LLMs. Finally, Botvinick and Gershman advocate that humans should retain responsibility for determining the scientific roadmap. To facilitate the discussion, the four perspectives are complemented with a response from each group. By putting these different perspectives in conversation, we aim to bring attention to important considerations within the academic community regarding the adoption of LLMs and their impact on both current and future scientific practices.
    摘要

SAMSGL: Series-Aligned Multi-Scale Graph Learning for Spatio-Temporal Forecasting

  • paper_url: http://arxiv.org/abs/2312.02646
  • repo_url: None
  • paper_authors: Xiaobei Zou, Luolin Xiong, Yang Tang, Jurgen Kurths
  • for: 这个研究旨在提高空间时间预测的表现,特别是在交通预测和天气预测等领域,因为这些领域的预测受到延迟传播 dinamics和高维度互动的影响。
  • methods: 这个研究使用了Series-Aligned Multi-Scale Graph Learning(SAMSGL)框架,旨在提高预测性能。为了处理延迟传播dinamics,研究人员提出了一个序列aligned图像条件层,以便聚合非延迟图像信号, thereby mitigating the influence of time delays on accuracy。另外,研究人员还提出了一个多尺度图像学习架构,包括全球图像结构和多尺度图像结构,以了解全球和地方空间时间互动。
  • results: 这个研究的实验结果显示SAMSGL的表现比其他方法更好,尤其是在天气预测和交通预测等领域。
    Abstract Spatio-temporal forecasting in various domains, like traffic prediction and weather forecasting, is a challenging endeavor, primarily due to the difficulties in modeling propagation dynamics and capturing high-dimensional interactions among nodes. Despite the significant strides made by graph-based networks in spatio-temporal forecasting, there remain two pivotal factors closely related to forecasting performance that need further consideration: time delays in propagation dynamics and multi-scale high-dimensional interactions. In this work, we present a Series-Aligned Multi-Scale Graph Learning (SAMSGL) framework, aiming to enhance forecasting performance. In order to handle time delays in spatial interactions, we propose a series-aligned graph convolution layer to facilitate the aggregation of non-delayed graph signals, thereby mitigating the influence of time delays for the improvement in accuracy. To understand global and local spatio-temporal interactions, we develop a spatio-temporal architecture via multi-scale graph learning, which encompasses two essential components: multi-scale graph structure learning and graph-fully connected (Graph-FC) blocks. The multi-scale graph structure learning includes a global graph structure to learn both delayed and non-delayed node embeddings, as well as a local one to learn node variations influenced by neighboring factors. The Graph-FC blocks synergistically fuse spatial and temporal information to boost prediction accuracy. To evaluate the performance of SAMSGL, we conduct experiments on meteorological and traffic forecasting datasets, which demonstrate its effectiveness and superiority.
    摘要 预测在不同领域,如交通预测和天气预测,是一项具有挑战性的任务,主要是因为模型困难在描述协同动力和高维度相互作用的问题。尽管 graf-based 网络在空间-时预测方面做出了 significativos progresos,但还有两个关键因素需要进一步考虑:时间延迟在协同动力和多级高维度相互作用。在这项工作中,我们提出了一个Series-Aligned Multi-Scale Graph Learning(SAMSGL)框架,以提高预测性能。为了处理空间协同动力中的时间延迟,我们提议了一个序列对齐图像积分层,以便聚合非延迟图像信号,从而减少时间延迟的影响,提高准确性。为了理解全球和本地空间-时相互作用,我们开发了一个多尺度图学学习架构,包括两个重要组成部分:多尺度图结构学习和图全连接(Graph-FC)块。多尺度图结构学习包括一个全球图结构,用于学习延迟和非延迟节点表示,以及一个本地图结构,用于学习受到邻近因素影响的节点变化。图全连接块协同综合空间和时间信息,以提高预测精度。为评估 SAMSGL 的性能,我们在 meteorological 和交通预测数据集上进行了实验,结果显示它的有效性和优越性。

On the Initialization of Graph Neural Networks

  • paper_url: http://arxiv.org/abs/2312.02622
  • repo_url: https://github.com/lspongebobjh/virgo_icml2023
  • paper_authors: Jiahang Li, Yakun Song, Xiang Song, David Paul Wipf
    for:This paper focuses on improving the initialization process for graph neural networks (GNNs) to reduce the variance of forward and backward propagation and improve model performance.methods:The proposed method, called Virgo, analyzes the variance of forward and backward propagation across GNN layers and proposes a new initialization method that takes into account the influence of the activation function, hidden dimension, graph structure, and message passing.results:The proposed Virgo method leads to superior model performance and more stable variance at initialization on node classification, link prediction, and graph classification tasks, as demonstrated through comprehensive experiments on 15 datasets.
    Abstract Graph Neural Networks (GNNs) have displayed considerable promise in graph representation learning across various applications. The core learning process requires the initialization of model weight matrices within each GNN layer, which is typically accomplished via classic initialization methods such as Xavier initialization. However, these methods were originally motivated to stabilize the variance of hidden embeddings and gradients across layers of Feedforward Neural Networks (FNNs) and Convolutional Neural Networks (CNNs) to avoid vanishing gradients and maintain steady information flow. In contrast, within the GNN context classical initializations disregard the impact of the input graph structure and message passing on variance. In this paper, we analyze the variance of forward and backward propagation across GNN layers and show that the variance instability of GNN initializations comes from the combined effect of the activation function, hidden dimension, graph structure and message passing. To better account for these influence factors, we propose a new initialization method for Variance Instability Reduction within GNN Optimization (Virgo), which naturally tends to equate forward and backward variances across successive layers. We conduct comprehensive experiments on 15 datasets to show that Virgo can lead to superior model performance and more stable variance at initialization on node classification, link prediction and graph classification tasks. Codes are in https://github.com/LspongebobJH/virgo_icml2023.
    摘要 graph neural networks (GNNs) 有显著的搭配可能性在不同应用场景中的图表示学习中展现出来。GNN层的核心学习过程通常通过 класси型的初始化方法,如Xavier initialization来进行初始化模型权重矩阵。然而,这些方法最初是为了稳定隐藏嵌入和梯度的方差在层次的Feedforward Neural Networks (FNNs)和Convolutional Neural Networks (CNNs)中,以避免梯度消失和保持信息流平稳。然而,在 GNN 上下文中,这些古典的初始化方法忽视了输入图结构和消息传递对方差的影响。在这篇论文中,我们分析了 GNN 层之间的方差变化,并显示了 GNN 初始化的方差不稳定性来自于活动函数、隐藏维度、图结构和消息传递的共同作用。为了更好地考虑这些影响因素,我们提出了一种新的初始化方法,称为 Variance Instability Reduction within GNN Optimization (Virgo),它自然地在Successive层之间均衡前向和反向方差。我们对 15 个数据集进行了全面的实验,证明 Virgo 可以在节点分类、链接预测和图类型任务上提高模型性能并保持更稳定的方差。代码在 https://github.com/LspongebobJH/virgo_icml2023。

Panoptica – instance-wise evaluation of 3D semantic and instance segmentation maps

  • paper_url: http://arxiv.org/abs/2312.02608
  • repo_url: https://github.com/brainlesion/panoptica
  • paper_authors: Florian Kofler, Hendrik Möller, Josef A. Buchner, Ezequiel de la Rosa, Ivan Ezhov, Marcel Rosier, Isra Mekki, Suprosanna Shit, Moritz Negwer, Rami Al-Maskari, Ali Ertürk, Shankeeth Vinayahalingam, Fabian Isensee, Sarthak Pati, Daniel Rueckert, Jan S. Kirschke, Stefan K. Ehrlich, Annika Reinke, Bjoern Menze, Benedikt Wiestler, Marie Piraud
  • for: 这篇论文是为了计算2D和3D分割图像的实例化分割质量指标而设计的。
  • methods: 这篇论文使用了一种可编程的、性能优化的包装包,名为panoptica,以计算分割图像的实例化分割质量指标。
  • results: 论文通过使用不同的指标,如平均对称表面距离度量,对多种实际医学数据进行了详细的评估,并证明了panoptica的效果。
    Abstract This paper introduces panoptica, a versatile and performance-optimized package designed for computing instance-wise segmentation quality metrics from 2D and 3D segmentation maps. panoptica addresses the limitations of existing metrics and provides a modular framework that complements the original intersection over union-based panoptic quality with other metrics, such as the distance metric Average Symmetric Surface Distance. The package is open-source, implemented in Python, and accompanied by comprehensive documentation and tutorials. panoptica employs a three-step metrics computation process to cover diverse use cases. The efficacy of panoptica is demonstrated on various real-world biomedical datasets, where an instance-wise evaluation is instrumental for an accurate representation of the underlying clinical task. Overall, we envision panoptica as a valuable tool facilitating in-depth evaluation of segmentation methods.
    摘要 Translated into Simplified Chinese:这篇论文介绍了panoptica,一个功能强大且性能优化的包,用于计算2D和3D segmentation图像质量指标。panoptica解决了现有指标的限制,并提供了一个模块化框架,可以补充原始交集 UNION 基于的权重质量指标,例如平均对称表面距离指标。包是开源的,实现在Python中,并附带了详细的文档和教程。panoptica使用三步计算过程来覆盖多种应用场景。论文示出了panoptica在多个真实的生物医学数据集上的效果,其中Instance-wise评估对于医学任务的准确表示是非常重要的。总之,我们视panoptica为一个有价值的工具,用于深入评估分 segmentation 方法。

Impact of Tokenization on LLaMa Russian Adaptation

  • paper_url: http://arxiv.org/abs/2312.02598
  • repo_url: None
  • paper_authors: Mikhail Tikhomirov, Daniil Chernyshev
  • for: 本研究旨在解决大型语言模型(LLM)在非英语输入时表现下降的问题。
  • methods: 本研究使用 vocabulary substitution 方法来改进 LLaMa 俄语言适应。
  • results: 自动评价结果表明, vocabulary substitution 不仅提高了模型在俄语言中的质量,还可以加速 fine-tuning(35%)和推理(最高达 60%),同时降低内存占用。人工评价结果还表明,使用俄语言适应词汇的模型可以生成更具用户喜爱的答案。
    Abstract Latest instruction-tuned large language models (LLM) show great results on various tasks, however, they often face performance degradation for non-English input. There is evidence that the reason lies in inefficient tokenization caused by low language representation in pre-training data which hinders the comprehension of non-English instructions, limiting the potential of target language instruction-tuning. In this work we investigate the possibility of addressing the issue with vocabulary substitution in the context of LLaMa Russian language adaptation. We explore three variants of vocabulary adaptation and test their performance on Saiga instruction-tuning and fine-tuning on Russian Super Glue benchmark. The results of automatic evaluation show that vocabulary substitution not only improves the model's quality in Russian but also accelerates fine-tuning (35%) and inference (up to 60%) while reducing memory consumption. Additional human evaluation of the instruction-tuned models demonstrates that models with Russian-adapted vocabulary generate answers with higher user preference than the original Saiga-LLaMa model.
    摘要

UTBoost: A Tree-boosting based System for Uplift Modeling

  • paper_url: http://arxiv.org/abs/2312.02573
  • repo_url: https://github.com/jd-opensource/utboost
  • paper_authors: Junjie Gao, Xiangyu Zheng, DongDong Wang, Zhixiang Huang, Bangqi Zheng, Kai Yang
  • for: 这个论文旨在提出两种基于Gradient Boosting Decision Trees(GBDT)算法的新方法,用于估计顾客增长(uplift)。
  • methods: 这两种方法分别是:一种是基于Sequential Additive Model(SAM)的累加学习方法,另一种是基于Huber Regressions(Huber)的多目标学习方法。
  • results: 实验结果表明,这两种方法可以准确地估计顾客增长,并且frequently yield remarkable improvements over base models。此外, authors还开发了一个特有的树提升系统(UTBoost),用于实现这些方法的应用。
    Abstract Uplift modeling refers to the set of machine learning techniques that a manager may use to estimate customer uplift, that is, the net effect of an action on some customer outcome. By identifying the subset of customers for whom a treatment will have the greatest effect, uplift models assist decision-makers in optimizing resource allocations and maximizing overall returns. Accurately estimating customer uplift poses practical challenges, as it requires assessing the difference between two mutually exclusive outcomes for each individual. In this paper, we propose two innovative adaptations of the well-established Gradient Boosting Decision Trees (GBDT) algorithm, which learn the causal effect in a sequential way and overcome the counter-factual nature. Both approaches innovate existing techniques in terms of ensemble learning method and learning objectives, respectively. Experiments on large-scale datasets demonstrate the usefulness of the proposed methods, which often yielding remarkable improvements over base models. To facilitate the application, we develop the UTBoost, an end-to-end tree boosting system specifically designed for uplift modeling. The package is open source and has been optimized for training speed to meet the needs of real industrial applications.
    摘要 <>通过机器学习技术,管理者可以使用“升级模型”来估算每个客户的升级效果,即对某个客户的行为产生的影响。通过 identificatinig 每个客户对待特征的最大效果,升级模型可以帮助决策者优化资源分配和最大化总收益。估算客户升级的实际挑战在于需要评估每个客户对两种不同结果之间的差异。在这篇论文中,我们提出了两种基于 Gradient Boosting Decision Trees(GBDT)算法的创新方法,可以在级联的方式学习 causal effect,并且超越对假性的挑战。这两种方法在 Ensemble Learning 方法和学习目标方面均有创新,实验结果表明,这些方法在大规模数据集上通常可以获得惊人的改进。为便于应用,我们开发了 UTBoost,一个专门为升级模型设计的端到端树提升系统。该系统是开源的,并且在训练速度方面进行了优化,以满足实际工业应用的需求。

Structured World Representations in Maze-Solving Transformers

  • paper_url: http://arxiv.org/abs/2312.02566
  • repo_url: https://github.com/understanding-search/structured-representations-maze-transformers
  • paper_authors: Michael Igorevich Ivanitskiy, Alex F. Spies, Tilman Räuker, Guillaume Corlouer, Chris Mathwin, Lucia Quirke, Can Rager, Rusheb Shah, Dan Valentine, Cecilia Diniz Behn, Katsumi Inoue, Samy Wu Fung
  • for: 本研究旨在理解小型转移模型在解决迷宫问题中的内部行为。
  • methods: 本研究使用转移模型解决迷宫问题,并发现这些模型形成了迷宫 topology 的结构化内部表示,以及有效路径的预测。
  • results: 研究发现,只需要单个 token 的差分流可以线性解码重建整个迷宫,并且各个token的嵌入有空间结构。此外,还发现了路径跟踪的听力头(称为“相邻头”),它们参与找到有效的后续 tokens。
    Abstract Transformer models underpin many recent advances in practical machine learning applications, yet understanding their internal behavior continues to elude researchers. Given the size and complexity of these models, forming a comprehensive picture of their inner workings remains a significant challenge. To this end, we set out to understand small transformer models in a more tractable setting: that of solving mazes. In this work, we focus on the abstractions formed by these models and find evidence for the consistent emergence of structured internal representations of maze topology and valid paths. We demonstrate this by showing that the residual stream of only a single token can be linearly decoded to faithfully reconstruct the entire maze. We also find that the learned embeddings of individual tokens have spatial structure. Furthermore, we take steps towards deciphering the circuity of path-following by identifying attention heads (dubbed $\textit{adjacency heads}$), which are implicated in finding valid subsequent tokens.
    摘要 启发器模型在许多实用机器学习应用中发挥重要作用,然而它们的内部行为仍然对研究人员难以理解。由于这些模型的大小和复杂性,建立全面的内部行为图像变得非常困难。为了解决这个问题,我们尝试了通过解决迷宫来理解小启发器模型。在这项工作中,我们关注启发器模型形成的抽象和有效路径的内部表示。我们证明了只需要一个Token的剩余流可以线性解码重建整个迷宫。此外,我们发现了个token的嵌入有空间结构。此外,我们还证明了路径跟踪的电路,通过identifying关注头(称为“相邻头”),这些关注头参与找到有效的后续Token。

Training on Synthetic Data Beats Real Data in Multimodal Relation Extraction

  • paper_url: http://arxiv.org/abs/2312.03025
  • repo_url: None
  • paper_authors: Zilin Du, Haoxin Li, Xu Guo, Boyang Li
  • for: 本研究旨在对多modal关系抽取进行研究,但是进展受到现有训练数据的稀缺所限。这里我们考虑了一个新的问题设定,即仅在训练过程中使用单modal数据,可以是文本或图像。我们想要从合成数据中训练一个多modal分类器,并在真实多modal测试数据上表现良好。
  • methods: 我们提出了一个名为MI2RAGE的方法,它使用了链接跨modal生成(CCG)来提高生成数据的多样性,并利用教师网络选择高相互资讯的训练数据。
  • results: 与直接在合成数据上训练的方法相比,我们的方法实现了24.06% F1的提升,尤其是使用合成文本时的提升为30.42% F1。而我们最佳的模型,即完全使用合成图像进行训练,对先前由真实多modal数据训练的模型进行了3.76%的F1提升。
    Abstract The task of multimodal relation extraction has attracted significant research attention, but progress is constrained by the scarcity of available training data. One natural thought is to extend existing datasets with cross-modal generative models. In this paper, we consider a novel problem setting, where only unimodal data, either text or image, are available during training. We aim to train a multimodal classifier from synthetic data that perform well on real multimodal test data. However, training with synthetic data suffers from two obstacles: lack of data diversity and label information loss. To alleviate the issues, we propose Mutual Information-aware Multimodal Iterated Relational dAta GEneration (MI2RAGE), which applies Chained Cross-modal Generation (CCG) to promote diversity in the generated data and exploits a teacher network to select valuable training samples with high mutual information with the ground-truth labels. Comparing our method to direct training on synthetic data, we observed a significant improvement of 24.06% F1 with synthetic text and 26.42% F1 with synthetic images. Notably, our best model trained on completely synthetic images outperforms prior state-of-the-art models trained on real multimodal data by a margin of 3.76% in F1. Our codebase will be made available upon acceptance.
    摘要 多Modal关系提取任务已经吸引了大量研究者的关注,但是进步受到数据不足的限制。一个自然的想法是通过扩展现有数据集来进行训练。在这篇论文中,我们考虑了一个新的问题设定,即训练时只有单模态数据,可以是文本或图像。我们目标是从合成数据中训练一个多Modal分类器,并在真实多Modal测试数据上表现良好。但训练合成数据时存在两个问题:数据多样性不够和标签信息损失。为了解决这些问题,我们提出了相互信息感知多Modal迭代数据生成(MI2RAGE)方法。MI2RAGE方法利用了链式跨模态生成(CCG)来提高生成数据的多样性,并利用教师网络选择有高相互信息的训练样本和真实标签。与直接在合成数据上训练相比,我们发现MI2RAGE方法可以提高24.06%的F1值,其中文本合成数据上提高26.42%的F1值。更重要的是,我们的最佳模型在 completelly synthetic图像上训练后,可以超过现有的 estado-of-the-art模型,即在真实多Modal数据上训练的模型,F1值上的提高为3.76%。我们将代码库在接受后发布。

DanZero+: Dominating the GuanDan Game through Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2312.02561
  • repo_url: https://github.com/submit-paper/Danzero_plus
  • paper_authors: Youpeng Zhao, Yudong Lu, Jian Zhao, Wengang Zhou, Houqiang Li
  • for: 这个研究的目标是开发一个用于玩家DanZero的人工智能程序,用于解决复杂的棋牌游戏GuanDan。
  • methods: 该研究使用了深度蒙特卡罗(DMC)和分布式训练框架,并应用了策略基于反射学习算法来进一步提高人工智能的能力。
  • results: 评估结果表明,与基于经验规则的AI程序进行比较后,DanZero Bot表现出色,并且通过采用预训练模型来减少训练时间,实现了人工智能的进一步提高。
    Abstract The utilization of artificial intelligence (AI) in card games has been a well-explored subject within AI research for an extensive period. Recent advancements have propelled AI programs to showcase expertise in intricate card games such as Mahjong, DouDizhu, and Texas Hold'em. In this work, we aim to develop an AI program for an exceptionally complex and popular card game called GuanDan. This game involves four players engaging in both competitive and cooperative play throughout a long process to upgrade their level, posing great challenges for AI due to its expansive state and action space, long episode length, and complex rules. Employing reinforcement learning techniques, specifically Deep Monte Carlo (DMC), and a distributed training framework, we first put forward an AI program named DanZero for this game. Evaluation against baseline AI programs based on heuristic rules highlights the outstanding performance of our bot. Besides, in order to further enhance the AI's capabilities, we apply policy-based reinforcement learning algorithm to GuanDan. To address the challenges arising from the huge action space, which will significantly impact the performance of policy-based algorithms, we adopt the pre-trained model to facilitate the training process and the achieved AI program manages to achieve a superior performance.
    摘要 人工智能(AI)在 карточной игре已经是长期的研究主题。现代技术的发展使得AI程序在复杂的 карточной游戏如 Mahjong、DouDizhu 和 Texas Hold'em 中表现出了专家水平。在这项工作中,我们目标是开发一个用于Exceptionally Complex和受欢迎的 карточной游戏叫做 GuanDan。这个游戏需要四名玩家在竞争和合作之间进行长期的进程,以升级自己的等级,对 AI 提出了巨大的挑战,因为它的扩展状态和动作空间非常大,每一集的长度也很长,规则非常复杂。我们使用了 Deep Monte Carlo(DMC) 技术和分布式训练框架,首先开发了一个名为 DanZero 的 AI 程序。对基于规则的 AI 程序进行评估显示了我们的 bot 的出色表现。此外,为了进一步提高 AI 的能力,我们应用了政策基于返点学习算法于 GuanDan。由于动作空间的巨大性,会对政策基于算法的性能产生很大的影响,我们采用预训练模型来促进训练过程,并实现了一个可以在 GuanDan 中达到超过常规水平的 AI 程序。

Beyond Isolation: Multi-Agent Synergy for Improving Knowledge Graph Construction

  • paper_url: http://arxiv.org/abs/2312.03022
  • repo_url: None
  • paper_authors: Hongbin Ye, Honghao Gui, Aijia Zhang, Tong Liu, Wei Hua, Weiqiang Jia
  • for: 这篇论文旨在探讨知识图构建(KGC)的多方面问题,包括实体、关系和事件EXTRACTION。
  • methods: 该论文提出了一种新的框架,即CooperKGC,该框架建立了一个KGC协作处理网络,让不同的代理人共同解决ENTITY、关系和事件EXTRACTION任务。
  • results: 实验表明,通过CooperKGC的协作处理网络,可以同时解决ENTITY、关系和事件EXTRACTION任务,并且在多个交互循环中,协作促进了知识选择、修正和聚合的能力。
    Abstract Knowledge graph construction (KGC) is a multifaceted undertaking involving the extraction of entities, relations, and events. Traditionally, large language models (LLMs) have been viewed as solitary task-solving agents in this complex landscape. However, this paper challenges this paradigm by introducing a novel framework, CooperKGC. Departing from the conventional approach, CooperKGC establishes a collaborative processing network, assembling a KGC collaboration team capable of concurrently addressing entity, relation, and event extraction tasks. Our experiments unequivocally demonstrate that fostering collaboration and information interaction among diverse agents within CooperKGC yields superior results compared to individual cognitive processes operating in isolation. Importantly, our findings reveal that the collaboration facilitated by CooperKGC enhances knowledge selection, correction, and aggregation capabilities across multiple rounds of interactions.
    摘要 知识图构建(KGC)是一项多方面的任务,涉及到实体、关系和事件的提取。传统上,大型自然语言模型(LLMs)被视为单独的任务解决者在这个复杂的景象中。然而,这篇论文挑战这一观念,提出了一个新的框架——合作KGC。与传统方法不同,合作KGC建立了一个知识图构建协作团队,负责同时处理实体、关系和事件提取任务。我们的实验表明,在合作KGC中促进多种代理人之间的合作和信息交换,可以让知识选择、修正和聚合能力在多个交互循环中得到加强。

Graph Information Bottleneck for Remote Sensing Segmentation

  • paper_url: http://arxiv.org/abs/2312.02545
  • repo_url: None
  • paper_authors: Yuntao Shou, Wei Ai, Tao Meng
  • for: 本研究旨在提高遥感图像分割的精度和效率,特别是面对不规则的 объек 。
  • methods: 本研究使用图像为格 estructure,并引入了简单对比视觉Graph Neural Network (SC-ViG) 架构,以便自适应地选择节点和边进行掩蔽。此外,本研究还应用了信息瓶颈理论来最大化相关任务的信息,而最小化不相关任务的信息。
  • results: 对于公共可用的实验数据集,我们的方法在遥感图像分割和分类任务中具有更高的精度和效率,比之前的状态艺术方法更高。
    Abstract Remote sensing segmentation has a wide range of applications in environmental protection, and urban change detection, etc. Despite the success of deep learning-based remote sensing segmentation methods (e.g., CNN and Transformer), they are not flexible enough to model irregular objects. In addition, existing graph contrastive learning methods usually adopt the way of maximizing mutual information to keep the node representations consistent between different graph views, which may cause the model to learn task-independent redundant information. To tackle the above problems, this paper treats images as graph structures and introduces a simple contrastive vision GNN (SC-ViG) architecture for remote sensing segmentation. Specifically, we construct a node-masked and edge-masked graph view to obtain an optimal graph structure representation, which can adaptively learn whether to mask nodes and edges. Furthermore, this paper innovatively introduces information bottleneck theory into graph contrastive learning to maximize task-related information while minimizing task-independent redundant information. Finally, we replace the convolutional module in UNet with the SC-ViG module to complete the segmentation and classification tasks of remote sensing images. Extensive experiments on publicly available real datasets demonstrate that our method outperforms state-of-the-art remote sensing image segmentation methods.
    摘要 remote sensing segmentation 有广泛的应用在环境保护和城市变化探测等领域。 DESPITE 深度学习基于 remote sensing segmentation 方法(例如 CNN 和 Transformer)的成功,它们并不够灵活来模型不 Regular 的对象。 此外,现有的图标对比学习方法通常采用 maximize mutual information 的方法保持不同图视图的节点表示相同,这可能会使模型学习任务无关的冗余信息。为了解决上述问题,本文将图像视为图结构,并提出了一种简单的对比视觉 Graph Neural Network (SC-ViG) 架构 для remote sensing segmentation。 Specifically, we construct a node-masked and edge-masked graph view to obtain an optimal graph structure representation, which can adaptively learn whether to mask nodes and edges. 此外,本文创新地将信息瓶颈理论引入到图标对比学习中,以最大化相关任务的信息,而最小化无关任务的冗余信息。最后,我们将 UNet 中的卷积模块 replaced 为 SC-ViG 模块,以完成 remote sensing 图像分割和分类任务。 EXTENSIVE 实验表明,我们的方法在公开 available 的实验数据上超过了现有的 remote sensing 图像分割方法。

PolyFit: A Peg-in-hole Assembly Framework for Unseen Polygon Shapes via Sim-to-real Adaptation

  • paper_url: http://arxiv.org/abs/2312.02531
  • repo_url: None
  • paper_authors: Geonhyup Lee, Joosoon Lee, Sangjun Noh, Minhwan Ko, Kangmin Kim, Kyoobin Lee
  • for: 这个研究是为了解决机器人穿孔 assemble 中的基础和挑战性任务,即感知错误和机械错误引起的插入失败或堵塞。
  • methods: 这个研究使用了一种超vised learning方法,即PolyFit,以减少感知错误和机械错误的影响。PolyFit 使用了力矩数据进行精准的外部pose估计,并将磨盘pose调整以纠正偏差。
  • results: 该研究在模拟环境中进行了广泛的训练,使用了包含多种磨盘孔形状、外部pose和相应的Contact力矩数据。在模拟环境中,PolyFit 达到了97.3%和96.3%的磨盘成功率,而在实际应用中,它达到了86.7%和85.0%的成功率,这表明了该方法的稳定性和适应性。
    Abstract The study addresses the foundational and challenging task of peg-in-hole assembly in robotics, where misalignments caused by sensor inaccuracies and mechanical errors often result in insertion failures or jamming. This research introduces PolyFit, representing a paradigm shift by transitioning from a reinforcement learning approach to a supervised learning methodology. PolyFit is a Force/Torque (F/T)-based supervised learning framework designed for 5-DoF peg-in-hole assembly. It utilizes F/T data for accurate extrinsic pose estimation and adjusts the peg pose to rectify misalignments. Extensive training in a simulated environment involves a dataset encompassing a diverse range of peg-hole shapes, extrinsic poses, and their corresponding contact F/T readings. To enhance extrinsic pose estimation, a multi-point contact strategy is integrated into the model input, recognizing that identical F/T readings can indicate different poses. The study proposes a sim-to-real adaptation method for real-world application, using a sim-real paired dataset to enable effective generalization to complex and unseen polygon shapes. PolyFit achieves impressive peg-in-hole success rates of 97.3% and 96.3% for seen and unseen shapes in simulations, respectively. Real-world evaluations further demonstrate substantial success rates of 86.7% and 85.0%, highlighting the robustness and adaptability of the proposed method.
    摘要 Simplified Chinese:这项研究targets瑞鼎Robotics中的难题:杯子在孔中Assembly, where misalignments caused by sensor inaccuracies and mechanical errors often result in insertion failures or jamming. This research introduces PolyFit, representing a paradigm shift by transitioning from a reinforcement learning approach to a supervised learning methodology. PolyFit is a Force/Torque (F/T)-based supervised learning framework designed for 5-DoF peg-in-hole assembly. It utilizes F/T data for accurate extrinsic pose estimation and adjusts the peg pose to rectify misalignments. Extensive training in a simulated environment involves a dataset encompassing a diverse range of peg-hole shapes, extrinsic poses, and their corresponding contact F/T readings. To enhance extrinsic pose estimation, a multi-point contact strategy is integrated into the model input, recognizing that identical F/T readings can indicate different poses. The study proposes a sim-to-real adaptation method for real-world application, using a sim-real paired dataset to enable effective generalization to complex and unseen polygon shapes. PolyFit achieves impressive peg-in-hole success rates of 97.3% and 96.3% for seen and unseen shapes in simulations, respectively. Real-world evaluations further demonstrate substantial success rates of 86.7% and 85.0%, highlighting the robustness and adaptability of the proposed method.

MEMTO: Memory-guided Transformer for Multivariate Time Series Anomaly Detection

  • paper_url: http://arxiv.org/abs/2312.02530
  • repo_url: https://github.com/gunny97/MEMTO
  • paper_authors: Junho Song, Keonwoo Kim, Jeonglyul Oh, Sungzoon Cho
  • for: 这篇论文的目的是提出一种基于嵌入式Transformer的记忆导向的异常检测方法,以便在实际世界多变数时间序列资料上检测异常。
  • methods: 这篇论文使用了一种嵌入式Transformer模型,并将其与一个新的记忆模组结合在一起,以学习对输入数据的应对策略。此外,这篇论文还使用了K-means clustering来初始化记忆项,以稳定训练过程。
  • results: 这篇论文在五个真实世界的多变数时间序列资料上进行了实际测试,以及对先前的州vector方法进行了比较。结果显示,这篇论文的提出的方法在这些数据上取得了平均异常检测F1分数95.74%,较先前的方法有所改善。
    Abstract Detecting anomalies in real-world multivariate time series data is challenging due to complex temporal dependencies and inter-variable correlations. Recently, reconstruction-based deep models have been widely used to solve the problem. However, these methods still suffer from an over-generalization issue and fail to deliver consistently high performance. To address this issue, we propose the MEMTO, a memory-guided Transformer using a reconstruction-based approach. It is designed to incorporate a novel memory module that can learn the degree to which each memory item should be updated in response to the input data. To stabilize the training procedure, we use a two-phase training paradigm which involves using K-means clustering for initializing memory items. Additionally, we introduce a bi-dimensional deviation-based detection criterion that calculates anomaly scores considering both input space and latent space. We evaluate our proposed method on five real-world datasets from diverse domains, and it achieves an average anomaly detection F1-score of 95.74%, significantly outperforming the previous state-of-the-art methods. We also conduct extensive experiments to empirically validate the effectiveness of our proposed model's key components.
    摘要 检测实际世界多变量时间序列数据中异常现象是一项复杂的任务,因为存在复杂的时间关系和变量相关性。近年来,使用深度模型进行重建方法已经广泛应用。然而,这些方法仍然受到过泛化问题的困扰,无法持续提供高性能。为解决这个问题,我们提议MEMTO,一种具有记忆导航的转移学习模型。它采用了一种新的记忆模块,可以根据输入数据来学习每个记忆项的更新度。为稳定训练过程,我们采用了两个阶段训练方法,其中包括使用K-means归一化 clustering来初始化记忆项。此外,我们引入了一种两维偏差基于的检测标准,可以考虑输入空间和隐藏空间的偏差。我们对五个不同领域的实际数据进行了测试,并达到了95.74%的异常检测F1分数,高于之前的状态 искусственный方法。我们还进行了广泛的实验来证明我们提议的模型的关键组件的有效性。

MASP: Scalable GNN-based Planning for Multi-Agent Navigation

  • paper_url: http://arxiv.org/abs/2312.02522
  • repo_url: None
  • paper_authors: Xinyi Yang, Xinting Yang, Chao Yu, Jiayu Chen, Huazhong Yang, Yu Wang
    for: 这个论文的目的是解决多 Agent 协同导航任务,即多个 Agent 需要在有限时间内达到初始未分配目标。methods: 这个论文使用了增强学习(RL)和层次搜索结构来解决这个问题,并使用图神经网络(GNN)来模型多 Agent 和目标之间的交互。results: 论文的实验结果表明,MASP 比 класси型的规划方法和 RL 基础方法高效,在多 Agent 粒子环境(MPE)中达到了nearly 100% 成功率,并且在不同的团队大小下进行零例外情况掌握。
    Abstract We investigate the problem of decentralized multi-agent navigation tasks, where multiple agents need to reach initially unassigned targets in a limited time. Classical planning-based methods suffer from expensive computation overhead at each step and offer limited expressiveness for complex cooperation strategies. In contrast, reinforcement learning (RL) has recently become a popular paradigm for addressing this issue. However, RL struggles with low data efficiency and cooperation when directly exploring (nearly) optimal policies in the large search space, especially with an increased agent number (e.g., 10+ agents) or in complex environments (e.g., 3D simulators). In this paper, we propose Multi-Agent Scalable GNN-based P lanner (MASP), a goal-conditioned hierarchical planner for navigation tasks with a substantial number of agents. MASP adopts a hierarchical framework to divide a large search space into multiple smaller spaces, thereby reducing the space complexity and accelerating training convergence. We also leverage graph neural networks (GNN) to model the interaction between agents and goals, improving goal achievement. Besides, to enhance generalization capabilities in scenarios with unseen team sizes, we divide agents into multiple groups, each with a previously trained number of agents. The results demonstrate that MASP outperforms classical planning-based competitors and RL baselines, achieving a nearly 100% success rate with minimal training data in both multi-agent particle environments (MPE) with 50 agents and a quadrotor 3-dimensional environment (OmniDrones) with 20 agents. Furthermore, the learned policy showcases zero-shot generalization across unseen team sizes.
    摘要 我们研究了分散式多agger naviagtion任务,其中多个 Agent需要在有限时间内到达初始不知道的目标。古典观念系统方法受到每步computational overhead的高成本和有限的表达能力,导致在复杂的合作策略下遇到困难。相比之下,从 reward learning(RL)的角度来看,它在最近几年内已经成为处理此类任务的受欢迎方法。然而,RL在寻找(近乎)优质策略时受到低效率的资料和协力问题,特别是在 Agent 的数量增加(例如 10 只 Agent 或更多)或在复杂的环境中(例如 3D 模拟器)。在这篇论文中,我们提出了一个具有多个 Agent 的对话 GNN 基于 planner(MASP),用于 navigate 任务。MASP 运用了层次架构来分解大的搜寻空间,因此减少了空间复杂度和加速了训练的步骤。此外,我们还利用图 neural network(GNN)来odel agent 和目标之间的互动,提高了目标实现。此外,为了增强未见到的团队大小的通用能力,我们将 Agent 分为多个小组,每个小组都有先前训练的 Agent 数量。结果显示,MASP 比古典观念系统方法和 RL 基eline 高,在 MPE 中的 50 只 Agent 和 OmniDrones 中的 20 只 Agent 获得了接近 100% 的成功率,并且学习的政策展现了零shot泛化性。

Retrieving Conditions from Reference Images for Diffusion Models

  • paper_url: http://arxiv.org/abs/2312.02521
  • repo_url: None
  • paper_authors: Haoran Tang, Xin Zhou, Jieren Deng, Zhihong Pan, Hao Tian, Pratik Chaudhari
  • for: 本研究旨在提高生成图像的多样性,以满足更多应用需求。
  • methods: 本研究使用了 diffusion-based 主题驱动的生成方法,并引入了更加精确的标签数据集 RetriBooru-V1。
  • results: 研究人员提出了新的任务,并引入了一种新的多样度度量来衡量生成图像的成功程度。基于 RAI-inspired 方法,研究人员还实现了对参照图像中的精确信息的重新 Retrieval。
    Abstract Recent diffusion-based subject driven generative methods have enabled image generations with good fidelity for specific objects or human portraits. However, to achieve better versatility for applications, we argue that not only improved datasets and evaluations are desired, but also more careful methods to retrieve only relevant information from conditional images are anticipated. To this end, we propose an anime figures dataset RetriBooru-V1, with enhanced identity and clothing labels. We state new tasks enabled by this dataset, and introduce a new diversity metric to measure success in completing these tasks, quantifying the flexibility of image generations. We establish an RAG-inspired baseline method, designed to retrieve precise conditional information from reference images. Then, we compare with current methods on existing task to demonstrate the capability of the proposed method. Finally, we provide baseline experiment results on new tasks, and conduct ablation studies on the possible structural choices.
    摘要 Translated into Simplified Chinese:近期的扩散基于主题驱动的生成方法已经实现了对特定对象或人脸的图像生成 avec 良好的准确性。然而,为了实现更好的多样性,我们认为不仅需要改进的数据集和评估方法,还需要更加小心地从条件图像中提取相关信息。为此,我们提出了一个名为RetriBooru-V1的漫画人物数据集,其中包含了增强的身份和服装标签。我们提出了基于这个数据集的新任务,并引入了一个新的多样性指标来衡量这些任务的成功程度,量化图像生成的灵活性。我们设置了一个基于RAG的基线方法,用于从参考图像中提取准确的条件信息。然后,我们与现有方法进行比较,以示出我们的方法的能力。最后,我们提供了基eline实验结果,并进行了可能的结构选择的ablation study。

Creative Agents: Empowering Agents with Imagination for Creative Tasks

  • paper_url: http://arxiv.org/abs/2312.02519
  • repo_url: https://github.com/pku-rl/creative-agents
  • paper_authors: Chi Zhang, Penglin Cai, Yuhui Fu, Haoqi Yuan, Zongqing Lu
  • for: 本研究旨在建立具有创造力的智能代理人,以执行开放式创作任务。现有方法建立了多样化的开放式任务完成者,但None of them Demonstrates creativity。
  • methods: 我们提出一类解决方案,其中控制器通过增强 imagination 来转化抽象语言指令为具体环境中的任务目标。我们引入了多种实现创意代理人的方法,包括使用大型自然语言模型或扩散模型来实现图像想象。
  • results: 我们在 Minecraft 游戏中进行了详细的实验分析,显示创意代理人可以在 survival 模式中创造多样化的建筑物。我们还提出了一些新的评价指标,可以更好地评估开放式创作任务中的 AI 代理人。
    Abstract We study building embodied agents for open-ended creative tasks. While existing methods build instruction-following agents that can perform diverse open-ended tasks, none of them demonstrates creativity -- the ability to give novel and diverse task solutions implicit in the language instructions. This limitation comes from their inability to convert abstract language instructions into concrete task goals in the environment and perform long-horizon planning for such complicated goals. Given the observation that humans perform creative tasks with the help of imagination, we propose a class of solutions for creative agents, where the controller is enhanced with an imaginator that generates detailed imaginations of task outcomes conditioned on language instructions. We introduce several approaches to implementing the components of creative agents. We implement the imaginator with either a large language model for textual imagination or a diffusion model for visual imagination. The controller can either be a behavior-cloning policy learned from data or a pre-trained foundation model generating executable codes in the environment. We benchmark creative tasks with the challenging open-world game Minecraft, where the agents are asked to create diverse buildings given free-form language instructions. In addition, we propose novel evaluation metrics for open-ended creative tasks utilizing GPT-4V, which holds many advantages over existing metrics. We perform a detailed experimental analysis of creative agents, showing that creative agents are the first AI agents accomplishing diverse building creation in the survival mode of Minecraft. Our benchmark and models are open-source for future research on creative agents (https://github.com/PKU-RL/Creative-Agents).
    摘要 我们研究建立具有开放性的智能代理人,以完成创新性的任务。现有方法建立了遵循指令的智能代理人,但这些方法都没有表现出创新力——能够提供未知和多样化的任务解决方案。这一限制来自于它们无法将抽象的语言指令转换为环境中的具体任务目标,并进行长期规划。人类在创作任务时通常采用想象力,我们提出了一种解决方案,其中控制器被增强为具有想象力的 imagine 模型,可以根据语言指令生成详细的想象结果。我们介绍了几种实现Componentsof Creative Agents的方法。我们使用大语言模型或扩散模型来实现 imagine 模型,而控制器可以是基于数据学习的行为做clone策略或者是生成可执行代码的预训练基础模型。我们使用 Minecraft 游戏进行创建多样化建筑任务的 benchmarking,并提出了一些新的评价指标来评估开放性创新任务。我们的实验分析表明,创造代理人是在 Minecraft 游戏的存活模式中首次完成多样化建筑任务的 AI 代理人。我们的 benchmark 和模型是开源的,以便未来的研究创新代理人(https://github.com/PKU-RL/Creative-Agents)。

Simplifying Neural Network Training Under Class Imbalance

  • paper_url: http://arxiv.org/abs/2312.02517
  • repo_url: https://github.com/ravidziv/simplifyingimbalancedtraining
  • paper_authors: Ravid Shwartz-Ziv, Micah Goldblum, Yucen Lily Li, C. Bayan Bruss, Andrew Gordon Wilson
  • for: 本研究旨在探讨如何在实际数据集上使用标准深度学习管道中的组件来提高对类偏度问题的性能。
  • methods: 本研究使用了现有的深度学习管道中的各种组件,包括批处理大小、数据增强、优化器和标签平滑,并调整这些组件来适应类偏度问题。
  • results: 研究发现,通过调整标准深度学习管道中的各种组件,可以达到类偏度问题的州前性能,而无需使用特殊的类偏度方法。
    Abstract Real-world datasets are often highly class-imbalanced, which can adversely impact the performance of deep learning models. The majority of research on training neural networks under class imbalance has focused on specialized loss functions, sampling techniques, or two-stage training procedures. Notably, we demonstrate that simply tuning existing components of standard deep learning pipelines, such as the batch size, data augmentation, optimizer, and label smoothing, can achieve state-of-the-art performance without any such specialized class imbalance methods. We also provide key prescriptions and considerations for training under class imbalance, and an understanding of why imbalance methods succeed or fail.
    摘要

ASPEN: High-Throughput LoRA Fine-Tuning of Large Language Models with a Single GPU

  • paper_url: http://arxiv.org/abs/2312.02515
  • repo_url: https://github.com/TUDB-Labs/multi-lora-fine-tune
  • paper_authors: Zhengmao Ye, Dengchun Li, Jingqi Tian, Tingfeng Lan, Jie Zuo, Lei Duan, Hui Lu, Yexi Jiang, Jian Sha, Ke Zhang, Mingjie Tang
  • for: 本文旨在提高大型自然语言处理器(LLM)的 fine-tuning 效率,特别是在多任务 concurrent fine-tuning 中。
  • methods: 本文使用 Low-Rank Adaptation(LoRA)方法,通过共享预训练模型和 adaptive 调度,实现高效地在单个 GPU 上进行多任务 fine-tuning。
  • results: 实验表明,使用 ASPEN 框架可以节省 GPU 内存量为 53%,并提高训练 Throughput 约 17%,相比现有方法。另外,适应调度算法可以减少练习循环时间量为 24%,结束到结束训练延迟量为 12%。
    Abstract Transformer-based large language models (LLMs) have demonstrated outstanding performance across diverse domains, particularly when fine-turned for specific domains. Recent studies suggest that the resources required for fine-tuning LLMs can be economized through parameter-efficient methods such as Low-Rank Adaptation (LoRA). While LoRA effectively reduces computational burdens and resource demands, it currently supports only a single-job fine-tuning setup. In this paper, we present ASPEN, a high-throughput framework for fine-tuning LLMs. ASPEN efficiently trains multiple jobs on a single GPU using the LoRA method, leveraging shared pre-trained model and adaptive scheduling. ASPEN is compatible with transformer-based language models like LLaMA and ChatGLM, etc. Experiments show that ASPEN saves 53% of GPU memory when training multiple LLaMA-7B models on NVIDIA A100 80GB GPU and boosts training throughput by about 17% compared to existing methods when training with various pre-trained models on different GPUs. The adaptive scheduling algorithm reduces turnaround time by 24%, end-to-end training latency by 12%, prioritizing jobs and preventing out-of-memory issues.
    摘要 transformer-based大型自然语言模型(LLM)在多个领域都展现出杰出的性能,特别是在特定领域 Fine-tuning 时。最近的研究表明,用于 Fine-tuning LLM 的资源可以通过 parameter-efficient 方法such as Low-Rank Adaptation (LoRA) 减少计算卷积和资源需求。而 LoRA 目前只支持单个任务 Fine-tuning setup。在这篇论文中,我们介绍 ASPEN,一个高通过put framework for Fine-tuning LLM。 ASPEN 使用 LoRA 方法在单个 GPU 上高效地训练多个任务,利用共享预训练模型和自适应调度。 ASPEN 与 transformer-based 语言模型如 LLaMA 和 ChatGLM 等兼容。实验表明,在 NVIDIA A100 80GB GPU 上训练多个 LLaMA-7B 模型时,ASPEN 可以 saving 53% GPU 内存和boost 训练速度约17% compared to existing methods。适应调度算法可以 reducing turnaround time by 24%, end-to-end training latency by 12%, prioritizing jobs and preventing out-of-memory issues。

AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation

  • paper_url: http://arxiv.org/abs/2312.02512
  • repo_url: None
  • paper_authors: Jeongsoo Choi, Se Jin Park, Minsu Kim, Yong Man Ro
  • for: 这 paper 提出了一种直接将 audio-visual 语音翻译成 audio-visual 语音的框架 (AV2AV),以便实现真实的跨国虚拟会议,并且能够同时显示同步的嘴唇运动。
  • methods: 该框架使用了自然语言处理和计算机视觉技术,并通过自动学习来学习 audio-visual 语音的表示。
  • results: 实验表明,AV2AV 可以在多种语言翻译任务中提供高效的翻译结果,并且可以在不同的语言环境中保持 speaker 的声音特征。
    Abstract This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech Translation (AV2AV) framework, where the input and output of the system are multimodal (i.e., audio and visual speech). With the proposed AV2AV, two key advantages can be brought: 1) We can perform real-like conversations with individuals worldwide in a virtual meeting by utilizing our own primary languages. In contrast to Speech-to-Speech Translation (A2A), which solely translates between audio modalities, the proposed AV2AV directly translates between audio-visual speech. This capability enhances the dialogue experience by presenting synchronized lip movements along with the translated speech. 2) We can improve the robustness of the spoken language translation system. By employing the complementary information of audio-visual speech, the system can effectively translate spoken language even in the presence of acoustic noise, showcasing robust performance. To mitigate the problem of the absence of a parallel AV2AV translation dataset, we propose to train our spoken language translation system with the audio-only dataset of A2A. This is done by learning unified audio-visual speech representations through self-supervised learning in advance to train the translation system. Moreover, we propose an AV-Renderer that can generate raw audio and video in parallel. It is designed with zero-shot speaker modeling, thus the speaker in source audio-visual speech can be maintained at the target translated audio-visual speech. The effectiveness of AV2AV is evaluated with extensive experiments in a many-to-many language translation setting. The demo page is available on https://choijeongsoo.github.io/av2av.
    摘要 这个论文提出了一种新的直接音频视频演讲到音频视频演讲(AV2AV)框架,其输入和输出都是多Modal(即音频和视频演讲)。与已有的演讲到演讲(A2A)不同,AV2AV直接将音频视频演讲翻译成为另一种语言。这种能力提高对话体验,因为它可以同时显示同时发生的口语和翻译后的口语。此外,AV2AV还可以提高演讲语言翻译系统的稳定性。通过利用音频视频演讲的补充信息,系统可以更好地翻译演讲语言,即使在噪音的存在下。为了解决AV2AV翻译数据集缺失的问题,我们提议使用音频只的A2A数据集来训练演讲语言翻译系统。我们通过在先进自动学习中学习统一的音频视频演讲表示,然后用这些表示来训练翻译系统。此外,我们还提议一种AVRenderer,可以在同时生成原始音频和视频。它采用零个模型,因此源语音视频演讲中的 speaker 可以保留在目标翻译后的音频视频演讲中。AV2AV的效果得到了广泛的实验证明,并提供了一个多种语言翻译的多对多示例。详细信息可以在 中找到。

Visual Hindsight Self-Imitation Learning for Interactive Navigation

  • paper_url: http://arxiv.org/abs/2312.03446
  • repo_url: None
  • paper_authors: Kibeom Kim, Kisung Shin, Min Whoo Lee, Moonhoen Lee, Minsu Lee, Byoung-Tak Zhang
  • for: 这个论文主要针对的是提高视觉导航任务的样本效率,即让Agent更快地学习和完成这些任务。
  • methods: 该论文提出了一种新的方法 called Visual Hindsight Self-Imitation Learning (VHS),它利用视觉反思和自我模仿来提高样本效率。另外,论文还提出了一种 prosthetical goal embedding 方法,用于在视觉和部分可见的环境中更好地表示目标。
  • results: 实验结果表明,VHS 方法在视觉导航任务中表现出色,超过了现有的技术。这confirming its superior performance and sample efficiency。
    Abstract Interactive visual navigation tasks, which involve following instructions to reach and interact with specific targets, are challenging not only because successful experiences are very rare but also because the complex visual inputs require a substantial number of samples. Previous methods for these tasks often rely on intricately designed dense rewards or the use of expensive expert data for imitation learning. To tackle these challenges, we propose a novel approach, Visual Hindsight Self-Imitation Learning (VHS) for enhancing sample efficiency through hindsight goal re-labeling and self-imitation. We also introduce a prototypical goal embedding method derived from experienced goal observations, that is particularly effective in vision-based and partially observable environments. This embedding technique allows the agent to visually reinterpret its unsuccessful attempts, enabling vision-based goal re-labeling and self-imitation from enhanced successful experiences. Experimental results show that VHS outperforms existing techniques in interactive visual navigation tasks, confirming its superior performance and sample efficiency.
    摘要 <> translate into Simplified Chinese抽象:视觉导航任务需要按照指令达到和交互特定目标,这些任务非常困难,不仅成功经验非常罕见,而且视觉输入非常复杂,需要大量样本来学习。现有的方法frequently rely on densely designed reward structures or the use of expensive expert data for imitation learning。提议:为了解决这些挑战,我们提出了一种新的方法,视觉历史自我模仿学习(VHS),可以提高样本效率通过后看目标重新标注和自我模仿。我们还提出了一种基于经验目标观察的目标嵌入方法,这种方法在视觉基于和部分可见环境中非常有效。这种嵌入技术使得机器人可以根据不成功的尝试重新 интерпретирова视觉,从而实现视觉基于的目标重新标注和自我模仿。实验结果表明,VHS在交互视觉导航任务中表现出色,超过现有技术,证明其高效性和样本效率。总结:通过提出VHS方法,我们可以解决交互视觉导航任务中的样本效率和成功经验罕见问题,并且实验结果证明VHS的高效性和样本效率。这种方法可以在视觉基于和部分可见环境中应用,有助于机器人更好地完成交互视觉导航任务。

Inspecting Model Fairness in Ultrasound Segmentation Tasks

  • paper_url: http://arxiv.org/abs/2312.02501
  • repo_url: None
  • paper_authors: Zikang Xu, Fenghe Tang, Quan Quan, Jianrui Ding, Chunping Ning, S. Kevin Zhou
  • for: 这 paper 的目的是评估深度学习(DL) segmentation 模型在不同敏感属性下的偏见情况。
  • methods: 该 paper 使用了两个ultrasound数据集,使用了state-of-the-art DL算法进行评估。
  • results: 研究发现,even state-of-the-art DL算法在ultrasound segmentation任务中存在偏见情况。这些结果作为一个警示,强调了在实际应用场景中进行模型评估,以确保伦理考虑和减少对患者结果的风险。
    Abstract With the rapid expansion of machine learning and deep learning (DL), researchers are increasingly employing learning-based algorithms to alleviate diagnostic challenges across diverse medical tasks and applications. While advancements in diagnostic precision are notable, some researchers have identified a concerning trend: their models exhibit biased performance across subgroups characterized by different sensitive attributes. This bias not only infringes upon the rights of patients but also has the potential to lead to life-altering consequences. In this paper, we inspect a series of DL segmentation models using two ultrasound datasets, aiming to assess the presence of model unfairness in these specific tasks. Our findings reveal that even state-of-the-art DL algorithms demonstrate unfair behavior in ultrasound segmentation tasks. These results serve as a crucial warning, underscoring the necessity for careful model evaluation before their deployment in real-world scenarios. Such assessments are imperative to ensure ethical considerations and mitigate the risk of adverse impacts on patient outcomes.
    摘要

MKA: A Scalable Medical Knowledge Assisted Mechanism for Generative Models on Medical Conversation Tasks

  • paper_url: http://arxiv.org/abs/2312.02496
  • repo_url: https://github.com/liangke23/knowledge_assisted_medical_dialogue_generation_mechanism
  • paper_authors: Ke Liang, Sifan Wu, Jiayi Gu
  • for: 这个研究旨在提高医疗聊天机器人的诊断效率和便捷性,使医疗AI技术得到更多应用。
  • methods: 本研究使用了神经生成模型作为聊天机器人的核心,并提出了一个可扩展的医疗知识协助机制(MKA),以帮助神经生成模型在医疗聊天任务中表现更好。
  • results: 评估结果显示,将MKA机制应用于神经生成模型后,在多个自动评估指标中表现出色,并在MedDG和MedDialog-CN两个医疗数据集上取得了最佳性能。
    Abstract Using natural language processing (NLP) technologies to develop medical chatbots makes the diagnosis of the patient more convenient and efficient, which is a typical application in healthcare AI. Because of its importance, lots of research have been come out. Recently, the neural generative models have shown their impressive ability as the core of chatbot, while it cannot scale well when directly applied to medical conversation due to the lack of medical-specific knowledge. To address the limitation, a scalable Medical Knowledge Assisted mechanism, MKA, is proposed in this paper. The mechanism aims to assist general neural generative models to achieve better performance on the medical conversation task. The medical-specific knowledge graph is designed within the mechanism, which contains 6 types of medical-related information, including department, drug, check, symptom, disease, food. Besides, the specific token concatenation policy is defined to effectively inject medical information into the input data. Evaluation of our method is carried out on two typical medical datasets, MedDG and MedDialog-CN. The evaluation results demonstrate that models combined with our mechanism outperform original methods in multiple automatic evaluation metrics. Besides, MKA-Bert-GPT achieves state-of-the-art performance. The open-sourced codes are public: https://github.com/LIANGKE23/Knowledge_Assisted_Medical_Dialogue_Generation_Mechanism
    摘要 使用自然语言处理(NLP)技术开发医疗聊天机器人,可以使患者诊断更加方便和高效,这是医疗AI的典型应用。由于其重要性,有很多研究发表。最近,神经生成模型在聊天机器人核心部分表现出了惊人的能力,但直接应用于医疗对话时因缺乏医疗专业知识而难以扩展。为解决这些限制,本文提出了可扩展的医学知识协助机制(MKA)。该机制的目的是帮助普通的神经生成模型在医疗对话任务中更好的表现。医学专业知识图在机制中设计,包括6种医疗相关信息,包括部门、药品、检查、症状、疾病和食物。此外,特定的Token拼接策略定义以有效地注入医学信息到输入数据中。对我们的方法进行评估,使用了两个典型的医疗数据集:MedDG和MedDialog-CN。评估结果表明,与我们的机制相结合的模型在多个自动评估指标中表现出色,而MKA-Bert-GPT还达到了当前最佳性能。我们的代码公开在 GitHub:https://github.com/LIANGKE23/Knowledge_Assisted_Medical_Dialogue_Generation_Mechanism。

Flexible Communication for Optimal Distributed Learning over Unpredictable Networks

  • paper_url: http://arxiv.org/abs/2312.02493
  • repo_url: None
  • paper_authors: Sahil Tyagi, Martin Swany
  • For: This paper aims to improve the efficiency of distributed deep learning training by reducing the communication overhead and accelerating the training process.* Methods: The paper proposes an Allreduce (AR)-compatible Topk compressor that is bandwidth-optimal and can switch between AG and AR based on the current network configuration. The authors also model the pareto-relationship between parallel and statistical efficiency as a multi-objective optimization (MOO) problem to dynamically adjust the compression ratio and accelerate training.* Results: The proposed method achieves high accuracy like DenseSGD but with lower communication cost, and can dynamically adjust the compression ratio and collective operation to balance parallel and statistical efficiency. The authors also show that the proposed method outperforms AG in certain network configurations and can be applied to various deep learning models.
    Abstract Gradient compression alleviates expensive communication in distributed deep learning by sending fewer values and its corresponding indices, typically via Allgather (AG). Training with high compression ratio (CR) achieves high accuracy like DenseSGD, but has lower parallel scaling due to high communication cost (i.e., parallel efficiency). Using lower CRs improves parallel efficiency by lowering synchronization cost, but degrades model accuracy as well (statistical efficiency). Further, speedup attained with different models and CRs also varies with network latency, effective bandwidth and collective op used for aggregation. In many cases, collectives like Allreduce (AR) have lower cost than AG to exchange the same amount of data. In this paper, we propose an AR-compatible Topk compressor that is bandwidth-optimal and thus performs better than AG in certain network configurations. We develop a flexible communication strategy that switches between AG and AR based on which collective is optimal in the current settings, and model the pareto-relationship between parallel and statistical efficiency as a multi-objective optimization (MOO) problem to dynamically adjust CR and accelerate training while still converging to high accuracy.
    摘要 分布式深度学习中的梯度压缩可以降低负担重大的通信成本,通常通过所谓的Allgather(AG)进行实现。在训练中使用高比率压缩比(CR)可以达到高精度,但是同时会降低并行扩展的并行级别(i.e., 并行效率)。使用较低的CR可以提高并行效率,但是会降低模型精度(统计效率)。此外,不同的模型和CR在不同的网络延迟、有效带宽和集合操作上的速度提升也会有差异。在这篇论文中,我们提出了一种AR兼容的Topk压缩器,它在某些网络配置下具有最佳带宽性,因此在AG之上表现更好。我们开发了一种灵活的通信策略,可以根据当前设置选择AG或AR进行协同,并模型了并行效率和统计效率之间的多目标优化(MOO)问题,以 dynamically 调整CR并加速训练,并且仍然可以 converge to high accuracy。

Learning to Holistically Detect Bridges from Large-Size VHR Remote Sensing Imagery

  • paper_url: http://arxiv.org/abs/2312.02481
  • repo_url: None
  • paper_authors: Yansheng Li, Junwei Luo, Yongjun Zhang, Yihua Tan, Jin-Gang Yu, Song Bai
  • For: The paper is written for detecting bridges in remote sensing images (RSIs) and addressing the challenges of bridge detection in large-size very-high-resolution (VHR) RSIs.* Methods: The paper proposes a large-scale dataset named GLH-Bridge, which comprises 6,000 VHR RSIs sampled from diverse geographic locations across the globe, and presents an efficient network for holistic bridge detection (HBD-Net) in large-size RSIs, which uses a separate detector-based feature fusion (SDFF) architecture and is optimized via a shape-sensitive sample re-weighting (SSRW) strategy.* Results: The paper establishes a bridge detection benchmark including the OBB and HBB tasks, and validates the effectiveness of the proposed HBD-Net on the GLH-Bridge dataset, with cross-dataset generalization experiments illustrating the strong generalization capability of the GLH-Bridge dataset.
    Abstract Bridge detection in remote sensing images (RSIs) plays a crucial role in various applications, but it poses unique challenges compared to the detection of other objects. In RSIs, bridges exhibit considerable variations in terms of their spatial scales and aspect ratios. Therefore, to ensure the visibility and integrity of bridges, it is essential to perform holistic bridge detection in large-size very-high-resolution (VHR) RSIs. However, the lack of datasets with large-size VHR RSIs limits the deep learning algorithms' performance on bridge detection. Due to the limitation of GPU memory in tackling large-size images, deep learning-based object detection methods commonly adopt the cropping strategy, which inevitably results in label fragmentation and discontinuous prediction. To ameliorate the scarcity of datasets, this paper proposes a large-scale dataset named GLH-Bridge comprising 6,000 VHR RSIs sampled from diverse geographic locations across the globe. These images encompass a wide range of sizes, varying from 2,048*2,048 to 16,38*16,384 pixels, and collectively feature 59,737 bridges. Furthermore, we present an efficient network for holistic bridge detection (HBD-Net) in large-size RSIs. The HBD-Net presents a separate detector-based feature fusion (SDFF) architecture and is optimized via a shape-sensitive sample re-weighting (SSRW) strategy. Based on the proposed GLH-Bridge dataset, we establish a bridge detection benchmark including the OBB and HBB tasks, and validate the effectiveness of the proposed HBD-Net. Additionally, cross-dataset generalization experiments on two publicly available datasets illustrate the strong generalization capability of the GLH-Bridge dataset.
    摘要 remote sensing 图像中的桥梁检测(RSIs)在多种应用中发挥关键作用,但是它们具有特殊的挑战。在 RSIs 中,桥梁具有较大的空间尺度和方向比,因此,为保证桥梁的可见性和完整性,需要在大型高分辨率(VHR) RSIs 中进行整体的桥梁检测。然而,由于 GPU 内存不能处理大型图像的限制,深度学习基于对象检测方法通常采用裁剪策略,这会导致标签的分裂和不连续预测。为了解决数据缺乏问题,本文提出了一个大规模的数据集名为 GLH-Bridge,包含 6,000 个 VHR RSIs 从世界各地的多个地理位置采样,这些图像具有较大的尺度,从 2,048*2,048 到 16,384*16,384 像素,总共包含 59,737 座桥梁。此外,我们还提出了一种高效的桥梁检测网络(HBD-Net),该网络采用分立检测器基于特征融合(SDFF)架构,并通过形状敏感样本重新权重策略(SSRW)优化。基于我们提出的 GLH-Bridge 数据集,我们建立了一个桥梁检测标准准则,包括 OBB 和 HBB 任务,并证明了我们提出的 HBD-Net 的有效性。此外,我们还进行了跨数据集普适性实验,证明 GLH-Bridge 数据集具有强大的普适性。

E4SRec: An Elegant Effective Efficient Extensible Solution of Large Language Models for Sequential Recommendation

  • paper_url: http://arxiv.org/abs/2312.02443
  • repo_url: https://github.com/hestiasky/e4srec
  • paper_authors: Xinhang Li, Chong Chen, Xiangyu Zhao, Yong Zhang, Chunxiao Xing
  • for: 这篇论文旨在将大语言模型(LLMs)应用于推荐系统中,以提高推荐的个性化性和效率。
  • methods: 本文提出了一种叫做Elegant Effective Efficient Extensible solution for large language models for Sequential Recommendation(E4SRec),它可以将LLMs与传统的推荐系统结合起来,并且可以使用ID序列作为输入,确保生成的输出在候选列表中。
  • results: 作者通过对四种广泛使用的实际数据集进行了广泛的实验,证明了E4SRec的有效性、效率和可扩展性。
    Abstract The recent advancements in Large Language Models (LLMs) have sparked interest in harnessing their potential within recommender systems. Since LLMs are designed for natural language tasks, existing recommendation approaches have predominantly transformed recommendation tasks into open-domain natural language generation tasks. However, this approach necessitates items to possess rich semantic information, often generates out-of-range results, and suffers from notably low efficiency and limited extensibility. Furthermore, practical ID-based recommendation strategies, reliant on a huge number of unique identities (IDs) to represent users and items, have gained prominence in real-world recommender systems due to their effectiveness and efficiency. Nevertheless, the incapacity of LLMs to model IDs presents a formidable challenge when seeking to leverage LLMs for personalized recommendations. In this paper, we introduce an Elegant Effective Efficient Extensible solution for large language models for Sequential Recommendation (E4SRec), which seamlessly integrates LLMs with traditional recommender systems that exclusively utilize IDs to represent items. Specifically, E4SRec takes ID sequences as inputs, ensuring that the generated outputs fall within the candidate lists. Furthermore, E4SRec possesses the capability to generate the entire ranking list in a single forward process, and demands only a minimal set of pluggable parameters, which are trained for each dataset while keeping the entire LLM frozen. We substantiate the effectiveness, efficiency, and extensibility of our proposed E4SRec through comprehensive experiments conducted on four widely-used real-world datasets. The implementation code is accessible at https://github.com/HestiaSky/E4SRec/.
    摘要 近年来,大语言模型(LLMs)的进步引起了推荐系统中使用其潜力的兴趣。由于 LLMs 是针对自然语言任务设计的,现有的推荐方法主要将推荐任务转化为开放领域自然语言生成任务。然而,这种方法需要ITEMS具有丰富的含义信息,经常产生范围外的结果,并且受到较低的效率和有限的扩展性的限制。此外,在实际应用中,基于唯一标识符(ID)的实用推荐策略已经得到了广泛的应用,因为它们的效果和效率。然而, LLMs 无法模型 ID 是一大问题,当希望通过 LLMs 提供个性化推荐时。在这篇论文中,我们介绍了一种简洁有效高效可扩展的解决方案,即 E4SRec,该方案可以快速地将 LLMs 与传统的推荐系统集成,并且可以使用 ID 序列作为输入,确保生成的输出在候选列表中。此外,E4SRec 具有生成整个排名列表的能力,仅需要一个 minimal 的可插入参数,并且在每个数据集上训练这些参数,保持整个 LLM 冻结。我们通过对四种广泛使用的实际数据集进行了广泛的实验,证明了 E4SRec 的有效性、效率和可扩展性。代码可以在 GitHub 上获取:https://github.com/HestiaSky/E4SRec/.

Let’s Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation

  • paper_url: http://arxiv.org/abs/2312.02439
  • repo_url: https://github.com/sail-sg/clot
  • paper_authors: Shanshan Zhong, Zhongzhan Huang, Shanghua Gao, Wushao Wen, Liang Lin, Marinka Zitnik, Pan Zhou
  • for: 这paper是为了探讨Chain-of-Thought (CoT)和Leap-of-Thought (LoT)在大语言模型 (LLM) 中的应用,以及如何提高LLM的创造力。
  • methods: 这paper使用了Oogiri游戏作为研究对象,建立了一个多modal和多语言的Oogiri-GO数据集,并对现有LLM的LoT能力进行了研究。为了提高LLM的创造力,这paper引入了一种创新的Leap-of-Thought (CLoT) paradigm,包括对预训练LLM的 instrucion tuning 和自我修复两部分。
  • results: 这paper在Oogiri游戏中表现出了出色的创作能力,并在多个任务中表现出了提高的创造力,如云猜测游戏和多元关联任务。这些发现可能为大语言模型的创新应用带来启示,并为未来的研究提供了一条可行的道路。
    Abstract Chain-of-Thought (CoT) guides large language models (LLMs) to reason step-by-step, and can motivate their logical reasoning ability. While effective for logical tasks, CoT is not conducive to creative problem-solving which often requires out-of-box thoughts and is crucial for innovation advancements. In this paper, we explore the Leap-of-Thought (LoT) abilities within LLMs -- a non-sequential, creative paradigm involving strong associations and knowledge leaps. To this end, we study LLMs on the popular Oogiri game which needs participants to have good creativity and strong associative thinking for responding unexpectedly and humorously to the given image, text, or both, and thus is suitable for LoT study. Then to investigate LLMs' LoT ability in the Oogiri game, we first build a multimodal and multilingual Oogiri-GO dataset which contains over 130,000 samples from the Oogiri game, and observe the insufficient LoT ability or failures of most existing LLMs on the Oogiri game. Accordingly, we introduce a creative Leap-of-Thought (CLoT) paradigm to improve LLM's LoT ability. CLoT first formulates the Oogiri-GO dataset into LoT-oriented instruction tuning data to train pretrained LLM for achieving certain LoT humor generation and discrimination abilities. Then CLoT designs an explorative self-refinement that encourages the LLM to generate more creative LoT data via exploring parallels between seemingly unrelated concepts and selects high-quality data to train itself for self-refinement. CLoT not only excels in humor generation in the Oogiri game but also boosts creative abilities in various tasks like cloud guessing game and divergent association task. These findings advance our understanding and offer a pathway to improve LLMs' creative capacities for innovative applications across domains. The dataset, code, and models will be released online. https://zhongshsh.github.io/CLoT/.
    摘要 Chain-of-Thought (CoT) 导引大型语言模型(LLM)进行逻辑推理,并可以激发其逻辑思维能力。虽然效果良好于逻辑任务,但CoT不适用于创造性问题解决,这种问题常需要非典型思维和创新,是创新进程中不可或缺的一部分。在这篇论文中,我们研究了大型语言模型(LLM)在Oogiri游戏中的强相关思维能力(LoT)——一种非线性、创造性的思维方式。为了研究LLM在Oogiri游戏中的LoT能力,我们首先建立了一个多Modal和多语言的Oogiri-GO数据集,包含了Oogiri游戏中的130,000多个样本,并观察了大多数现有LLM在Oogiri游戏中的不足或失败。 accordingly,我们提出了一种创新的Leap-of-Thought(CLoT)方法,以提高LLM的LoT能力。CLoT首先将Oogiri-GO数据集转化为LoT-oriented instruction tuning数据,以训练预训练的LLM达到certain LoT幽默生成和分类能力。然后,CLoT设计了一种探索性自我优化,以便LLM通过探索与之相关的不同概念之间的相似性,生成更创新的LoT数据,并选择高质量的数据进行自我优化。CLoT不仅在Oogiri游戏中展示出了幽默生成能力,还在多个任务中提高了创造能力,如云猫猜测游戏和多元协同关系任务。这些发现对LLM的创造能力的理解提供了一条进路,并可以用于在不同领域中应用创新。数据、代码和模型将在线发布。Please note that the translation is done using a machine translation tool, and may not be perfect.

MUFFIN: Curating Multi-Faceted Instructions for Improving Instruction-Following

  • paper_url: http://arxiv.org/abs/2312.02436
  • repo_url: https://github.com/RenzeLou/Muffin
  • paper_authors: Renze Lou, Kai Zhang, Jian Xie, Yuxuan Sun, Janice Ahn, Hanzi Xu, Yu Su, Wenpeng Yin
  • for: 提高大语言模型(LLMs)的 instruction-following 能力
  • methods: 通过自动扩大每个任务的输入方面来增强数据集的 curación
  • results: LLMs 在不同的缩放任务和无输入任务之间具有更高的 instruction-following 能力,并且可以在不同的输入方面中表现出色。
    Abstract In the realm of large language models (LLMs), enhancing instruction-following capability often involves curating expansive training data. This is achieved through two primary schemes: i) Scaling-Inputs: Amplifying (input, output) pairs per task instruction, aiming for better instruction adherence. ii) Scaling Input-Free Tasks: Enlarging tasks, each composed of an (instruction, output) pair (without requiring a separate input anymore). However, LLMs under Scaling-Inputs tend to be overly sensitive to inputs, leading to misinterpretation or non-compliance with instructions. Conversely, Scaling Input-Free Tasks demands a substantial number of tasks but is less effective in instruction following when dealing with instances in Scaling-Inputs. This work introduces MUFFIN, a new scheme of instruction-following dataset curation. Specifically, we automatically Scale Tasks per Input by diversifying these tasks with various input facets. Experimental results across four zero-shot benchmarks, spanning both Scaling-Inputs and Scaling Input-Free Tasks schemes, reveal that LLMs, at various scales, trained on MUFFIN generally demonstrate superior instruction-following capabilities compared to those trained on the two aforementioned schemes.
    摘要 在大语言模型(LLM)的领域中,提高指令遵从能力通常通过两种主要方案实现:一是扩大输入数据,即增大每个任务的输入对应的对话对,以提高指令遵从性。二是扩大任务数量,每个任务都包含一对(指令、输出),而不需要分配单独的输入。然而,在扩大输入方面,LLMs 有很强的敏感性,可能导致指令的误解或不遵从。相反,扩大任务数量需要很多任务,但是在扩大输入方面效果较差。这篇文章介绍了一种新的指令遵从数据集编制方法,称为MUFFIN。具体来说,我们通过自动扩大每个任务的输入方面来多元化这些任务。实验结果验证了我们的方法,在四个零基eline标准 benchmark 中,覆盖了两种扩大输入和扩大任务数量的方案。结果表明, LLMS 在不同的缩放级别 trained on MUFFIN 通常比 trained on 这两种方案 demonstrate 更高的指令遵从能力。

Visually Grounded Language Learning: a review of language games, datasets, tasks, and models

  • paper_url: http://arxiv.org/abs/2312.02431
  • repo_url: None
  • paper_authors: Alessandro Suglia, Ioannis Konstas, Oliver Lemon
  • for: 本文是一篇系统性的文献评论,探讨了视觉语言(V+L)领域中许多任务和模型的发展。
  • methods: 本文使用维特根hein的“语言游戏”思想分类了V+L任务为三类:推论性游戏、生成性游戏和互动性游戏。
  • results: 文章分析了现有Literature,提出了未来研究应该关注互动游戏,因为自然语言交流是解决对象引用和行动计划的ambiguity的关键,而物理实现也是理解情境和事件 semantics的重要因素。
    Abstract In recent years, several machine learning models have been proposed. They are trained with a language modelling objective on large-scale text-only data. With such pretraining, they can achieve impressive results on many Natural Language Understanding and Generation tasks. However, many facets of meaning cannot be learned by ``listening to the radio" only. In the literature, many Vision+Language (V+L) tasks have been defined with the aim of creating models that can ground symbols in the visual modality. In this work, we provide a systematic literature review of several tasks and models proposed in the V+L field. We rely on Wittgenstein's idea of `language games' to categorise such tasks into 3 different families: 1) discriminative games, 2) generative games, and 3) interactive games. Our analysis of the literature provides evidence that future work should be focusing on interactive games where communication in Natural Language is important to resolve ambiguities about object referents and action plans and that physical embodiment is essential to understand the semantics of situations and events. Overall, these represent key requirements for developing grounded meanings in neural models.
    摘要 Recently, several machine learning models have been proposed, which are trained with a language modeling objective on large-scale text-only data. With such pretraining, they can achieve impressive results on many natural language understanding and generation tasks. However, many aspects of meaning cannot be learned by simply "listening to the radio." In the literature, many vision + language (V+L) tasks have been defined with the aim of creating models that can ground symbols in the visual modality. In this work, we provide a systematic literature review of several tasks and models proposed in the V+L field. We rely on Wittgenstein's idea of "language games" to categorize such tasks into three different families: 1) discriminative games, 2) generative games, and 3) interactive games. Our analysis of the literature provides evidence that future work should focus on interactive games, where communication in natural language is important to resolve ambiguities about object referents and action plans, and that physical embodiment is essential to understand the semantics of situations and events. Overall, these represent key requirements for developing grounded meanings in neural models.

PartSLIP++: Enhancing Low-Shot 3D Part Segmentation via Multi-View Instance Segmentation and Maximum Likelihood Estimation

  • paper_url: http://arxiv.org/abs/2312.03015
  • repo_url: https://github.com/zyc00/partslip2
  • paper_authors: Yuchen Zhou, Jiayuan Gu, Xuanlin Li, Minghua Liu, Yunhao Fang, Hao Su
  • for: 本研究旨在提高零或几个shot 3D部分 segmentation的精度和可扩展性。
  • methods: 本研究使用的方法包括GLIP和SAM两种预训练模型,以及一种改进的Expectation-Maximization算法。
  • results: 对比PartSLIP,PartSLIP++在零或几个shot 3D semantic和实例基本对象部分分割任务中表现更好。
    Abstract Open-world 3D part segmentation is pivotal in diverse applications such as robotics and AR/VR. Traditional supervised methods often grapple with limited 3D data availability and struggle to generalize to unseen object categories. PartSLIP, a recent advancement, has made significant strides in zero- and few-shot 3D part segmentation. This is achieved by harnessing the capabilities of the 2D open-vocabulary detection module, GLIP, and introducing a heuristic method for converting and lifting multi-view 2D bounding box predictions into 3D segmentation masks. In this paper, we introduce PartSLIP++, an enhanced version designed to overcome the limitations of its predecessor. Our approach incorporates two major improvements. First, we utilize a pre-trained 2D segmentation model, SAM, to produce pixel-wise 2D segmentations, yielding more precise and accurate annotations than the 2D bounding boxes used in PartSLIP. Second, PartSLIP++ replaces the heuristic 3D conversion process with an innovative modified Expectation-Maximization algorithm. This algorithm conceptualizes 3D instance segmentation as unobserved latent variables, and then iteratively refines them through an alternating process of 2D-3D matching and optimization with gradient descent. Through extensive evaluations, we show that PartSLIP++ demonstrates better performance over PartSLIP in both low-shot 3D semantic and instance-based object part segmentation tasks. Code released at https://github.com/zyc00/PartSLIP2.
    摘要 开放世界3D部分分割是多种应用程序中的关键,如Robotics和AR/VR。传统的指导方法经常受到有限的3D数据可用性的限制,并且很难在未见到的对象类别上泛化。PartSLIP,一种最近的进步,在零和几个shot 3D部分分割方面做出了重要的突破。这是通过利用2D开放词汇检测模块GLIP的能力,并对多视图2D bounding box预测进行转换和提升为3D分割Masks来实现的。在这篇论文中,我们介绍了PartSLIP++,一种改进版本,旨在超越其前一代的限制。我们的方法包括两个主要改进。一是使用预训练的2D分割模型SAM,以生成像素精度的2D分割,从而生成更加准确和精确的注释。二是PartSLIP++将2D bounding box预测转换为3D分割Masks,而不是使用heuristic的3D转换过程。通过广泛的评估,我们显示了PartSLIP++在零和几个shot 3Dsemantic和实例基于对象部分分割任务中表现更好于PartSLIP。代码可以在https://github.com/zyc00/PartSLIP2中下载。

Decoding Data Quality via Synthetic Corruptions: Embedding-guided Pruning of Code Data

  • paper_url: http://arxiv.org/abs/2312.02418
  • repo_url: None
  • paper_authors: Yu Yang, Aaditya K. Singh, Mostafa Elhoushi, Anas Mahmoud, Kushal Tirumala, Fabian Gloeckle, Baptiste Rozière, Carole-Jean Wu, Ari S. Morcos, Newsha Ardalani
  • for: 提高 Large Language Models(LLMs)的代码生成性能和训练效率,通过 removing “low-quality” code data。
  • methods: 使用 embedding 空间来识别和移除 “low-quality” code data,通过 synthetic corruptions 来探索 “low-quality” code 的特征,并开发了一些基于 embedding 空间的新的采样策略。
  • results: 在 HumanEval 和 MBPP benchmark 上表现出优于现有的 embedding-based 方法,并且可以达到不进行采样的情况下的性能提升为高达 3%, demonstrating the promise of insights from synthetic corruptions for data pruning。
    Abstract Code datasets, often collected from diverse and uncontrolled sources such as GitHub, potentially suffer from quality issues, thereby affecting the performance and training efficiency of Large Language Models (LLMs) optimized for code generation. Previous studies demonstrated the benefit of using embedding spaces for data pruning, but they mainly focused on duplicate removal or increasing variety, and in other modalities, such as images. Our work focuses on using embeddings to identify and remove "low-quality" code data. First, we explore features of "low-quality" code in embedding space, through the use of synthetic corruptions. Armed with this knowledge, we devise novel pruning metrics that operate in embedding space to identify and remove low-quality entries in the Stack dataset. We demonstrate the benefits of this synthetic corruption informed pruning (SCIP) approach on the well-established HumanEval and MBPP benchmarks, outperforming existing embedding-based methods. Importantly, we achieve up to a 3% performance improvement over no pruning, thereby showing the promise of insights from synthetic corruptions for data pruning.
    摘要 <>将文本翻译为简化中文。<>大型语言模型(LLM)的训练效率和性能可能受到来自 GitHub 等多种多样的源头的代码数据集的质量问题的影响。先前的研究已经证明使用嵌入空间进行数据减少具有利处,但主要关注于去除重复或增加多样性,而在其他Modalities中,如图像。我们的工作将关注使用嵌入空间来识别和移除“低质量”的代码数据。我们首先通过使用生成的损害来探索低质量代码的特征在嵌入空间中,然后提出了一种基于嵌入空间的新的减少指标,用于在栈 dataset 中标识和移除低质量项目。我们在人类评估和 MBPP 标准准中示出了这种基于生成损害的减少方法(SCIP)的优势,并且与现有的嵌入空间基本方法相比,达到了3%的性能提升。这显示了对于数据减少,来自生成损害的启示具有承诺。

Towards Fast and Stable Federated Learning: Confronting Heterogeneity via Knowledge Anchor

  • paper_url: http://arxiv.org/abs/2312.02416
  • repo_url: https://github.com/J1nqianChen/FedKA
  • paper_authors: Jinqian Chen, Jihua Zhu, Qinghai Zheng
  • for: 这篇论文目的是解决联合学习中的数据不一致问题,对于联合模型的性能和稳定性有着重要的影响。
  • methods: 这篇论文使用了本地训练和共享知识链来解决数据不一致问题,并提出了一个名为“ Federated Knowledge Anchor”的新算法。
  • results: 实验结果显示,这个新算法可以实现快速和稳定的联合模型训练,并有着优化准确性的效果。
    Abstract Federated learning encounters a critical challenge of data heterogeneity, adversely affecting the performance and convergence of the federated model. Various approaches have been proposed to address this issue, yet their effectiveness is still limited. Recent studies have revealed that the federated model suffers severe forgetting in local training, leading to global forgetting and performance degradation. Although the analysis provides valuable insights, a comprehensive understanding of the vulnerable classes and their impact factors is yet to be established. In this paper, we aim to bridge this gap by systematically analyzing the forgetting degree of each class during local training across different communication rounds. Our observations are: (1) Both missing and non-dominant classes suffer similar severe forgetting during local training, while dominant classes show improvement in performance. (2) When dynamically reducing the sample size of a dominant class, catastrophic forgetting occurs abruptly when the proportion of its samples is below a certain threshold, indicating that the local model struggles to leverage a few samples of a specific class effectively to prevent forgetting. Motivated by these findings, we propose a novel and straightforward algorithm called Federated Knowledge Anchor (FedKA). Assuming that all clients have a single shared sample for each class, the knowledge anchor is constructed before each local training stage by extracting shared samples for missing classes and randomly selecting one sample per class for non-dominant classes. The knowledge anchor is then utilized to correct the gradient of each mini-batch towards the direction of preserving the knowledge of the missing and non-dominant classes. Extensive experimental results demonstrate that our proposed FedKA achieves fast and stable convergence, significantly improving accuracy on popular benchmarks.
    摘要 Federated learning 面临着数据多样性的挑战,导致模型性能和融合受到影响。许多方法已经被提出来解决这个问题,但其效果仍然有限。最近的研究发现,在本地训练中,联邦模型会出现严重的忘记现象,导致全局忘记和性能下降。虽然分析提供了有价值的意见,但complete的感受到易受损类和其影响因素仍未得到彻底的理解。在这篇论文中,我们想要填补这个差距,通过评估每个类忘记程度的变化来系统地分析忘记现象。我们的观察结果是:(1)缺失和非主流类在本地训练中都会同样严重忘记,而主流类则会在性能方面表现出改善。(2)当动态减少主流类的样本数时,当其样本占比下降到某个阈值时,训练过程中忘记现象会突然发生,表明本地模型很难以通过几个特定类的样本来防止忘记。这些发现使我们提出了一种新的和简单的算法——联邦知识锚(FedKA)。假设所有客户端均拥有每个类型的唯一的示例,我们可以在每个本地训练阶段前 construct 知识锚,通过提取缺失类型的示例和随机选择每个类型的一个示例来建立知识锚。然后,我们可以在每个mini-batch中使用知识锚来更正梯度的方向,以保持缺失和非主流类型的知识。我们的实验结果表明,我们的提议的FedKA可以快速和稳定地融合,在流行的benchmark上显著提高准确率。

Foundation Models for Weather and Climate Data Understanding: A Comprehensive Survey

  • paper_url: http://arxiv.org/abs/2312.03014
  • repo_url: https://github.com/shengchaochen82/awesome-large-models-for-weather-and-climate
  • paper_authors: Shengchao Chen, Guodong Long, Jing Jiang, Dikai Liu, Chengqi Zhang
  • for: This paper is written to provide an overview of state-of-the-art AI methodologies for weather and climate data, with a focus on time series and text data.
  • methods: The paper discusses various model architectures, including large language models (LLMs), and their applications in weather and climate data understanding.
  • results: The paper provides an exhaustive review of current breakthroughs in research on large, data-driven models for weather and climate data understanding, including practical applications, crucial resources, and prospective research opportunities.
    Abstract As artificial intelligence (AI) continues to rapidly evolve, the realm of Earth and atmospheric sciences is increasingly adopting data-driven models, powered by progressive developments in deep learning (DL). Specifically, DL techniques are extensively utilized to decode the chaotic and nonlinear aspects of Earth systems, and to address climate challenges via understanding weather and climate data. Cutting-edge performance on specific tasks within narrower spatio-temporal scales has been achieved recently through DL. The rise of large models, specifically large language models (LLMs), has enabled fine-tuning processes that yield remarkable outcomes across various downstream tasks, thereby propelling the advancement of general AI. However, we are still navigating the initial stages of crafting general AI for weather and climate. In this survey, we offer an exhaustive, timely overview of state-of-the-art AI methodologies specifically engineered for weather and climate data, with a special focus on time series and text data. Our primary coverage encompasses four critical aspects: types of weather and climate data, principal model architectures, model scopes and applications, and datasets for weather and climate. Furthermore, in relation to the creation and application of foundation models for weather and climate data understanding, we delve into the field's prevailing challenges, offer crucial insights, and propose detailed avenues for future research. This comprehensive approach equips practitioners with the requisite knowledge to make substantial progress in this domain. Our survey encapsulates the most recent breakthroughs in research on large, data-driven models for weather and climate data understanding, emphasizing robust foundations, current advancements, practical applications, crucial resources, and prospective research opportunities.
    摘要 随着人工智能(AI)的快速发展,地球和大气科学领域正在越来越多地采用数据驱动模型,受到进步的深度学习(DL)技术的推动。特别是通过解码地球系统中的复杂和非线性方面,以及理解天气和气候数据,DL技术在解决气候挑战方面发挥了关键作用。最近,通过大型模型的进步,特别是大语言模型(LLMs),可以进行细化过程,从而在各种下游任务中实现出色的表现,从而推动总AI的发展。然而,我们还处于气候和天气领域的普通AI创造的初 stages。在这篇评论中,我们提供了一份全面、时宜的评论,涵盖了天气和气候数据的类型、主要模型架构、模型范围和应用、以及天气和气候数据的数据集。此外,我们还探讨了创建和应用气候和天气数据理解基础模型的挑战,并提供了关键的洞察和详细的未来研究方向。这种全面的方法使得实践者可以快速掌握这个领域的必要知识,从而取得重要进步。我们的评论汇集了最近在大、数据驱动模型方面的研究进展,强调坚实的基础、当前进步、实用应用、关键资源和未来研究机遇。

BEDD: The MineRL BASALT Evaluation and Demonstrations Dataset for Training and Benchmarking Agents that Solve Fuzzy Tasks

  • paper_url: http://arxiv.org/abs/2312.02405
  • repo_url: https://github.com/minerllabs/basalt-benchmark
  • paper_authors: Stephanie Milani, Anssi Kanervisto, Karolis Ramanauskas, Sander Schulhoff, Brandon Houghton, Rohin Shah
  • for: 本文提供了一个形式化的测试基准 для学习人类反馈,以便评估新开发的算法性能。
  • methods: 本文使用了 Minecraft 游戏中的四个困难任务,例如创建和拍摄瀑布,来测试人类反馈学习算法。
  • results: 本文提供了2600万个图像动作对的集合,以及超过3000个精密对比人类和算法代理的评估结果,以便评估新开发的算法性能。
    Abstract The MineRL BASALT competition has served to catalyze advances in learning from human feedback through four hard-to-specify tasks in Minecraft, such as create and photograph a waterfall. Given the completion of two years of BASALT competitions, we offer to the community a formalized benchmark through the BASALT Evaluation and Demonstrations Dataset (BEDD), which serves as a resource for algorithm development and performance assessment. BEDD consists of a collection of 26 million image-action pairs from nearly 14,000 videos of human players completing the BASALT tasks in Minecraft. It also includes over 3,000 dense pairwise human evaluations of human and algorithmic agents. These comparisons serve as a fixed, preliminary leaderboard for evaluating newly-developed algorithms. To enable this comparison, we present a streamlined codebase for benchmarking new algorithms against the leaderboard. In addition to presenting these datasets, we conduct a detailed analysis of the data from both datasets to guide algorithm development and evaluation. The released code and data are available at https://github.com/minerllabs/basalt-benchmark .
    摘要 minesrl 的 BASALT 比赛已经catalyzed 了人类反馈学习的进步,通过四个困难定义的任务(如创建和拍摄瀑布)在 Minecraft 中进行。已经过去两年的 BASALT 比赛,我们现在向社区提供一个正式的标准 bencmark,通过 BASALT 评估和示例 Dataset (BEDD)。BEDD 包括2600万个图像动作对的收集和nearly 14,000个 Minecraft 游戏视频中的人类玩家完成 BASALT 任务的14,000个视频。它还包括了3,000个紧密的人类评估对人类和算法代理的对比。这些对比用作一个固定的、初步的eaderboard 来评估新发展的算法。为了实现这一比较,我们提供了一个 Streamlined 的代码库来测试新的算法。此外,我们还进行了 BASALT 数据集和代码的详细分析,以帮助算法的开发和评估。已经发布的代码和数据可以在 上找到。

Breast Ultrasound Report Generation using LangChain

  • paper_url: http://arxiv.org/abs/2312.03013
  • repo_url: None
  • paper_authors: Jaeyoung Huh, Hyun Jeong Park, Jong Chul Ye
    for: 这篇研究旨在提高乳腺超音波成像的诊断效率和报告质量,减轻医生和医疗专业人员的负担。methods: 本研究提出了一种基于LangChain的多图分析工具集成方法,通过融合专门的工具和自然语言生成技术,从超音波图像中提取有用的特征,在医疗上进行解释和生成标准化的报告。results: 实验结果显示,每个参考工具均可以提供有质量和量上的重要结果,而且并不需要专业人员的干预。并且,在临床评估中,生成的报告被评估为具有临床意义。
    Abstract Breast ultrasound (BUS) is a critical diagnostic tool in the field of breast imaging, aiding in the early detection and characterization of breast abnormalities. Interpreting breast ultrasound images commonly involves creating comprehensive medical reports, containing vital information to promptly assess the patient's condition. However, the ultrasound imaging system necessitates capturing multiple images of various parts to compile a single report, presenting a time-consuming challenge. To address this problem, we propose the integration of multiple image analysis tools through a LangChain using Large Language Models (LLM), into the breast reporting process. Through a combination of designated tools and text generation through LangChain, our method can accurately extract relevant features from ultrasound images, interpret them in a clinical context, and produce comprehensive and standardized reports. This approach not only reduces the burden on radiologists and healthcare professionals but also enhances the consistency and quality of reports. The extensive experiments shows that each tools involved in the proposed method can offer qualitatively and quantitatively significant results. Furthermore, clinical evaluation on the generated reports demonstrates that the proposed method can make report in clinically meaningful way.
    摘要 breast ultrasound (BUS) 是一种重要的诊断工具在乳腺影像领域,帮助早期发现和特征化乳腺畸形。 解读乳腺超声图像通常需要创建全面的医疗报告,包含重要信息以评估病人的情况。然而,超声影像系统需要捕捉多个图像来编辑报告,这是一项时间consuming的挑战。为解决这个问题,我们提议通过LangChain使用大语言模型(LLM)integrate多个图像分析工具到乳腺报告过程中。通过综合使用指定工具和文本生成through LangChain,我们的方法可以准确提取超声图像中重要特征,在临床上下文中解释它们,并生成完整、标准化的报告。这种方法不仅减轻了医生和医疗专业人员的负担,还提高了报告的一致性和质量。经验表明,每种工具参与的方法可以提供质量和量上的显著结果。此外,临床评估生成的报告也表明该方法可以在临床意义上生成报告。