2023-11-30

cs.AI

cs.AI - 2023-11-30

Negotiated Representations to Prevent Forgetting in Machine Learning Applications

paper_url: http://arxiv.org/abs/2312.00237
repo_url: https://github.com/nurikorhan/negotiated-representations-for-continual-learning
paper_authors: Nuri Korhan, Ceren Öner
for: 本研究目的是解决机器学习领域中的慢性忘记问题，具体是针对神经网络。methods: 本研究使用的方法是 incorporating negotiated representations into the learning process，以保持神经网络在多个任务之间的知识。results: 我们在多个 benchmark 数据集上进行了实验，包括 Split MNIST、Split CIFAR10、Split Fashion MNIST 和 Split CIFAR100。结果表明，我们的方法可以有效地避免神经网络在学习多个任务时的慢性忘记问题，并且在不同的任务之间保持知识的均衡。

Abstract
Catastrophic forgetting is a significant challenge in the field of machine learning, particularly in neural networks. When a neural network learns to perform well on a new task, it often forgets its previously acquired knowledge or experiences. This phenomenon occurs because the network adjusts its weights and connections to minimize the loss on the new task, which can inadvertently overwrite or disrupt the representations that were crucial for the previous tasks. As a result, the the performance of the network on earlier tasks deteriorates, limiting its ability to learn and adapt to a sequence of tasks. In this paper, we propose a novel method for preventing catastrophic forgetting in machine learning applications, specifically focusing on neural networks. Our approach aims to preserve the knowledge of the network across multiple tasks while still allowing it to learn new information effectively. We demonstrate the effectiveness of our method by conducting experiments on various benchmark datasets, including Split MNIST, Split CIFAR10, Split Fashion MNIST, and Split CIFAR100. These datasets are created by dividing the original datasets into separate, non overlapping tasks, simulating a continual learning scenario where the model needs to learn multiple tasks sequentially without forgetting the previous ones. Our proposed method tackles the catastrophic forgetting problem by incorporating negotiated representations into the learning process, which allows the model to maintain a balance between retaining past experiences and adapting to new tasks. By evaluating our method on these challenging datasets, we aim to showcase its potential for addressing catastrophic forgetting and improving the performance of neural networks in continual learning settings.

摘要
机器学习领域内，Catastrophic forgetting 是一个重要的挑战。当一个神经网络学习新任务时，它经常会忘记之前学习的知识和经验。这个现象发生的原因是网络调整其权重和连接，以最小化新任务的损失，这会不必要地覆盖或者干扰之前的表示，从而导致网络在早期任务上表现下降，限制它学习和适应多个任务的能力。在这篇文章中，我们提出了一种预防 Catastrophic forgetting 的方法，特别是针对神经网络。我们的方法旨在在多个任务之间保持神经网络的知识，同时仍然允许它有效地学习新信息。我们通过在各种 benchmark 数据集上进行实验，包括 Split MNIST、Split CIFAR10、Split Fashion MNIST 和 Split CIFAR100，以证明我们的方法的有效性。这些数据集是通过将原始数据集分成不相互重叠的任务，来模拟在不断学习的场景中，模型需要逐个学习多个任务，而不会忘记之前的一切。我们的提出的方法通过在学习过程中 incorporate negotiated 表示，使得模型可以保持过去经验的平衡，同时适应新任务。通过在这些挑战性数据集上评估我们的方法，我们希望能够展示它在预防 Catastrophic forgetting 和提高神经网络在连续学习中的性能。

Uncertainty in Graph Contrastive Learning with Bayesian Neural Networks

paper_url: http://arxiv.org/abs/2312.00232
repo_url: None
paper_authors: Alexander Möllers, Alexander Immer, Elvin Isufi, Vincent Fortuin
for: 这篇论文旨在提高半监督Node类别 зада务中的不确定性估计，并且将其应用于对大量未预先标注的数据进行推导。
methods: 本文使用了variational Bayesian neural network方法，以提高不确定性估计和下游性能。此外，我们提出了一个基于各个正例之间的不同likelihood分布的新的不确定性评估方法。
results: 我们的实验结果显示，这种方法可以提高不确定性估计和下游性能，并且在对大量未预先标注的数据进行推导时表现出色。

Abstract
Graph contrastive learning has shown great promise when labeled data is scarce, but large unlabeled datasets are available. However, it often does not take uncertainty estimation into account. We show that a variational Bayesian neural network approach can be used to improve not only the uncertainty estimates but also the downstream performance on semi-supervised node-classification tasks. Moreover, we propose a new measure of uncertainty for contrastive learning, that is based on the disagreement in likelihood due to different positive samples.

摘要
几何对照学习在仅有少量标签数据时表现出色，但可以使用大量未标签数据。然而，它通常不考虑不确定性估计。我们展示了一种基于Variational Bayesian neural network的方法可以提高不仅不确定性估计，而且还可以提高下游运算 semi-supervised node classification 任务的性能。此外，我们提出了一个基于不同正例的可能性分布的新的不确定性度量。

Unsupervised textile defect detection using convolutional neural networks

paper_url: http://arxiv.org/abs/2312.00224
repo_url: None
paper_authors: Imane Koulali, M. Taner Eskil
For: This paper proposes a novel motif-based approach for unsupervised textile anomaly detection, with the goal of combating the limitations of traditional convolutional neural networks and unsupervised learning paradigms.* Methods: The proposed approach consists of five main steps: preprocessing, automatic pattern period extraction, patch extraction, features selection, and anomaly detection. It uses a new dynamic and heuristic method for feature selection, which avoids the drawbacks of initialization of the number of filters and their weights, and those of the backpropagation mechanism. The design and training of the network are performed in a dynamic and input domain-based manner, with no ad-hoc configurations required.* Results: The proposed approach yields reliable and competitive results (on recall, precision, accuracy, and f1-measure) compared to state-of-the-art unsupervised approaches, in less time, with efficient training in a single epoch and a lower computational cost. The algorithm is demonstrated on the Patterned Fabrics benchmark dataset.

Abstract
In this study, we propose a novel motif-based approach for unsupervised textile anomaly detection that combines the benefits of traditional convolutional neural networks with those of an unsupervised learning paradigm. It consists of five main steps: preprocessing, automatic pattern period extraction, patch extraction, features selection and anomaly detection. This proposed approach uses a new dynamic and heuristic method for feature selection which avoids the drawbacks of initialization of the number of filters (neurons) and their weights, and those of the backpropagation mechanism such as the vanishing gradients, which are common practice in the state-of-the-art methods. The design and training of the network are performed in a dynamic and input domain-based manner and, thus, no ad-hoc configurations are required. Before building the model, only the number of layers and the stride are defined. We do not initialize the weights randomly nor do we define the filter size or number of filters as conventionally done in CNN-based approaches. This reduces effort and time spent on hyperparameter initialization and fine-tuning. Only one defect-free sample is required for training and no further labeled data is needed. The trained network is then used to detect anomalies on defective fabric samples. We demonstrate the effectiveness of our approach on the Patterned Fabrics benchmark dataset. Our algorithm yields reliable and competitive results (on recall, precision, accuracy and f1- measure) compared to state-of-the-art unsupervised approaches, in less time, with efficient training in a single epoch and a lower computational cost.

摘要
在这个研究中，我们提出了一种新的模式基本方法 для无监督文本异常检测，这种方法结合了传统的卷积神经网络的优点和无监督学习理论的优点。该方法包括五个主要步骤：预处理、自动模式周期EXTRACT、补充patch、特征选择和异常检测。我们的提出方法使用了一种新的动态和规则性的特征选择方法，可以避免 initialization of the number of filters（神经元）和其 weights的缺陷，以及传统方法中的径向传播机制的vanishing gradients的缺陷。我们的设计和训练方法是基于动态和输入Domain的方式进行的，因此无需配置。在建立模型之前，只需定义层数和步长即可。我们不会随机初始化 weights 也不需要定义缺省的滤波器大小或神经元数量，这将减少了评估和调整参数的时间和努力。只需要一个不含缺陷的样本进行训练，并不需要进一步的标注数据。训练后的模型可以用于检测异常的 défaut fabric samples。我们在 Patterned Fabrics benchmark dataset 上展示了我们的方法的效果，我们的算法在准确率、精度、准确率和f1-度方面取得了可靠和竞争力强的结果，比 estado-of-the-art 无监督方法更快速、更高效，并且计算成本较低。

Learning active tactile perception through belief-space control

paper_url: http://arxiv.org/abs/2312.00215
repo_url: None
paper_authors: Jean-François Tremblay, David Meger, Francois Hogan, Gregory Dudek
for: 这 paper 是为了解决 робоット在开放世界中与未知物体进行交互时，如何自动学习感知物体物理属性的问题。
methods: 这 paper 使用了一种自动学习策略，即通过开发一个生成型世界模型，使用梯度滤波法来估计物体的物理参数，并使用信息搜集模型预测控制器来开发感知策略。
results: 这 paper 在三个模拟任务中发现了一种能够有效地收集物体感知信息的策略，并在实际robot系统中验证了这种策略的可行性。

Abstract
Robots operating in an open world will encounter novel objects with unknown physical properties, such as mass, friction, or size. These robots will need to sense these properties through interaction prior to performing downstream tasks with the objects. We propose a method that autonomously learns tactile exploration policies by developing a generative world model that is leveraged to 1) estimate the object's physical parameters using a differentiable Bayesian filtering algorithm and 2) develop an exploration policy using an information-gathering model predictive controller. We evaluate our method on three simulated tasks where the goal is to estimate a desired object property (mass, height or toppling height) through physical interaction. We find that our method is able to discover policies that efficiently gather information about the desired property in an intuitive manner. Finally, we validate our method on a real robot system for the height estimation task, where our method is able to successfully learn and execute an information-gathering policy from scratch.

摘要
роботы 操作在开放世界中会遇到未知物理属性的新物体，如质量、摩擦力或大小。这些 роботы 需要通过互动来感知这些属性，然后进行下游任务。我们提出了一种自动学习感觉探索策略，通过开发一个可 diferenciable Bayesian 筛选算法和信息收集模型预测控制器来实现。我们在三个虚拟任务中评估了我们的方法，目标是通过物理互动来估计欲知道的对象属性（质量、高度或倒塌高度）。我们发现，我们的方法可以发现有效地收集有关欲知道的属性信息的策略。最后，我们在真实机器人系统上验证了我们的方法，并成功地从零开始学习和执行信息收集策略。

DREAM: Diffusion Rectification and Estimation-Adaptive Models

paper_url: http://arxiv.org/abs/2312.00210
repo_url: https://github.com/jinxinzhou/DREAM
paper_authors: Jinxin Zhou, Tianyu Ding, Tianyi Chen, Jiachen Jiang, Ilya Zharkov, Zhihui Zhu, Luming Liang
for: 提高Diffusion模型训练和采样之间的对应性，使得训练更加快速和高效。
methods: 提出了一种名为DREAM的新训练框架，包括扩散修正和估计调整两个组成部分，可以减少训练过程中的样本步骤数量，同时保持高质量图像。
results: 在图像超分解（SR）任务上，DREAM比标准扩散基于SR方法更快速地训练，并且可以降低需要的样本步骤数量，以达到相同或更高的图像质量。

Abstract
We present DREAM, a novel training framework representing Diffusion Rectification and Estimation-Adaptive Models, requiring minimal code changes (just three lines) yet significantly enhancing the alignment of training with sampling in diffusion models. DREAM features two components: diffusion rectification, which adjusts training to reflect the sampling process, and estimation adaptation, which balances perception against distortion. When applied to image super-resolution (SR), DREAM adeptly navigates the tradeoff between minimizing distortion and preserving high image quality. Experiments demonstrate DREAM's superiority over standard diffusion-based SR methods, showing a $2$ to $3\times $ faster training convergence and a $10$ to $20\times$ reduction in necessary sampling steps to achieve comparable or superior results. We hope DREAM will inspire a rethinking of diffusion model training paradigms.

摘要
我团队 todavia DREAM，一种新的训练框架，表示噪声修正和估算适应模型，只需要 minimal code changes（仅三行代码）却可以显著提高培训与抽取的对齐。 DREAM 包括两个组件：噪声修正，用于在训练中反映抽取过程，以及估算适应，用于均衡识别和扭曲之间的平衡。当应用于图像超分辨（SR）领域时，DREAM 能够灵活地调整对误差和高质量图像的平衡，并且在训练速度和抽取步骤方面具有显著优势。实验表明，相比标准噪声基于 SR 方法，DREAM 能够更快地训练 converges（$2$ to $3\times $），并且可以降低必要的抽取步骤数量 ($10$ to $20\times $）以 достичь相似或更高的结果。我们希望 DREAM 能够激发噪声模型训练方法的重新思考。

On the Interplay Between Stepsize Tuning and Progressive Sharpening

paper_url: http://arxiv.org/abs/2312.00209
repo_url: None
paper_authors: Vincent Roulet, Atish Agarwala, Fabian Pedregosa
for: 这项研究探讨了深度学习模型优化过程中对步长的调整对模型性能的影响。
methods: 本研究使用了stepsize-tuners、Armijo linesearch和Polyak stepsizes等方法来调整步长，这些方法会随迭代 iterations 来调整步长，并且会随着迭代 iterations 来调整模型的锐度。
results: 研究发现，经典的Armijo linesearch可能会导致模型在大批量或全批量 режиме下表现不佳，而Polyak stepsizes则可以在稳定边缘或稍微超过稳定边缘的情况下表现更好，并且能够超越其他方法。研究还发现，为了解锁步长调整器，需要理解步长和锐度之间的共同动力学。

Abstract
Recent empirical work has revealed an intriguing property of deep learning models by which the sharpness (largest eigenvalue of the Hessian) increases throughout optimization until it stabilizes around a critical value at which the optimizer operates at the edge of stability, given a fixed stepsize (Cohen et al, 2022). We investigate empirically how the sharpness evolves when using stepsize-tuners, the Armijo linesearch and Polyak stepsizes, that adapt the stepsize along the iterations to local quantities such as, implicitly, the sharpness itself. We find that the surprisingly poor performance of a classical Armijo linesearch may be well explained by its tendency to ever-increase the sharpness of the objective in the full or large batch regimes. On the other hand, we observe that Polyak stepsizes operate generally at the edge of stability or even slightly beyond, while outperforming its Armijo and constant stepsizes counterparts. We conclude with an analysis that suggests unlocking stepsize tuners requires an understanding of the joint dynamics of the step size and the sharpness.

摘要

An integrated framework for developing and evaluating an automated lecture style assessment system

paper_url: http://arxiv.org/abs/2312.00201
repo_url: None
paper_authors: Eleni Dimitriadou, Andreas Lanitis
for: 这个论文的目的是开发一个自动评估讲座风格的系统，帮助教师获得实时讲座风格评估反馈，提高学生学习体验质量。
methods: 该系统使用特定可测量生物特征，如脸部表情、身体活动、speech rate和intonation、手势和脸部姿势，从讲师视角的视频中提取。这些可测量生物特征在讲座中被组合，为教师提供讲座风格质量分数，包括讲座时间和整体讲座质量指标。
results: 参与者认为该应用程序是新颖和有用的，可以提供自动化讲座风格评估反馈。此外，该系统的性能评估与人类在讲座风格评估任务中的性能进行比较，结果显示，该系统不仅与人类评估者相当，而且在某些情况下，甚至超过了人类评估者的性能。

Abstract
The aim of the work presented in this paper is to develop and evaluate an integrated system that provides automated lecture style evaluation, allowing teachers to get instant feedback related to the goodness of their lecturing style. The proposed system aims to promote improvement of lecture quality, that could upgrade the overall student learning experience. The proposed application utilizes specific measurable biometric characteristics, such as facial expressions, body activity, speech rate and intonation, hand movement, and facial pose, extracted from a video showing the lecturer from the audience point of view. Measurable biometric features extracted during a lecture are combined to provide teachers with a score reflecting lecture style quality both at frame rate and by providing lecture quality metrics for the whole lecture. The acceptance of the proposed lecture style evaluation system was evaluated by chief education officers, teachers and students regarding the functionality, usefulness of the application, and possible improvements. The results indicate that participants found the application novel and useful in providing automated feedback regarding lecture quality. Furthermore, the performance evaluation of the proposed system was compared with the performance of humans in the task of lecture style evaluation. Results indicate that the proposed system not only achieves similar performance to human observers, but in some cases, it outperforms them.

摘要
本研究的目的是开发一个集成式讲义评估系统，为教师提供自动化讲义评估，以提高学生学习体验质量。该系统使用视频中讲者的特定可量生物特征，如脸部表情、身体动作、语速和声调、手势和脸部姿势，通过对视频进行分析，为教师提供讲义质量评估。系统将分别对每帧和整个讲义进行评估，并提供讲义质量指标。研究进行了教育主管、教师和学生对该系统的可用性和可行性的评估。结果表明，参与者认为该系统是新奇而有用的，可以自动地提供讲义质量的反馈。此外，研究还对该系统的性能进行了比较，结果显示，该系统不仅与人类评估者相当，有时甚至超越了人类评估者的性能。

Applying Large Language Models and Chain-of-Thought for Automatic Scoring

paper_url: http://arxiv.org/abs/2312.03748
repo_url: None
paper_authors: Gyeong-Geon Lee, Ehsan Latif, Xuansheng Wu, Ninghao Liu, Xiaoming Zhai
for: 这个研究探究了使用大型自然语言模型（LLM），具体是GPT-3.5和GPT-4，与链式思维（CoT）在自动评分学生写作答案中的应用。研究的目的是解决过去由于访问性、技术复杂度和解释性等因素而限制了自动评分工具的使用。methods: 我们使用了6个评估任务（3个二进制和3个三进制），共有1,650个学生答案进行测试。我们采用了6种提问工程策略，结合零shot或几 shot学习与CoT，或与项目脊和评分标准结合使用。results: 结果显示，几 shot学习（accuracy = .67）比零 shot学习（accuracy = .60）更高，增加12.6%。CoT，没有项目脊和评分标准时，对评分准确性没有显著影响（accuracy = .60）。但是，CoT提问与 Contextual item stem和评分标准结合使用，显示了13.44%的增长（零 shot）和3.7%的增长（几 shot）。使用PPEAS方法，我们发现了不同水平的准确性具有更好的平衡， highlighting the importance of domain-specific reasoning in enhancing the effectiveness of LLMs in scoring tasks。此外，我们还发现GPT-4在不同的评分任务中表现出优于GPT-3.5，显示8.64%的差异。单调调度策略（greedy sampling）在单调调度策略中表现出优，其他方法，包括ensemble voting策略，都被超越。这个研究表明了LLMs在自动评分中的潜在力量，强调CoT可以提高准确性，特别是与项目脊和评分标准结合使用。

Abstract
This study investigates the application of large language models (LLMs), specifically GPT-3.5 and GPT-4, with Chain-of-Though (CoT)in the automatic scoring of student-written responses to science assessments. We focused on overcoming the challenges of accessibility, technical complexity, and lack of explainability that have previously limited the use of automatic assessment tools among researchers and educators. We used a testing dataset comprising six assessment tasks (three binomial and three trinomial) with 1,650 student responses. We employed six prompt engineering strategies, combining zero-shot or few-shot learning with CoT, either alone or alongside item stem and scoring rubrics. Results indicated that few-shot (acc = .67) outperformed zero-shot learning (acc = .60), with 12.6\% increase. CoT, when used without item stem and scoring rubrics, did not significantly affect scoring accuracy (acc = .60). However, CoT prompting paired with contextual item stems and rubrics proved to be a significant contributor to scoring accuracy (13.44\% increase for zero-shot; 3.7\% increase for few-shot). Using a novel approach PPEAS, we found a more balanced accuracy across different proficiency categories, highlighting the importance of domain-specific reasoning in enhancing the effectiveness of LLMs in scoring tasks. Additionally, we also found that GPT-4 demonstrated superior performance over GPT-3.5 in various scoring tasks, showing 8.64\% difference. The study revealed that the single-call strategy with GPT-4, particularly using greedy sampling, outperformed other approaches, including ensemble voting strategies. This study demonstrates the potential of LLMs in facilitating automatic scoring, emphasizing that CoT enhances accuracy, particularly when used with item stem and scoring rubrics.

摘要

HeTriNet: Heterogeneous Graph Triplet Attention Network for Drug-Target-Disease Interaction

paper_url: http://arxiv.org/abs/2312.00189
repo_url: None
paper_authors: Farhan Tanvir, Khaled Mohammed Saifuddin, Tanvir Hossain, Arunkumar Bagavathi, Esra Akbas
for: 本研究旨在更好地理解药物与目标protein与疾病之间的复杂关系，以便更好地预测药物的机制作用（MoA）和个性化治疗。
methods: 本研究使用了一种新型的多类Graph triplet Attention Network（\texttt{HeTriNet），该模型在人类代谢系统中有效地模型了药物与目标protein与疾病之间的复杂关系。
results: 实验结果表明，\texttt{HeTriNet} 在真实数据上表现出色，较基eline模型有更高的准确率，这表明该模型能够更好地捕捉药物与目标protein与疾病之间的关系。

Abstract
Modeling the interactions between drugs, targets, and diseases is paramount in drug discovery and has significant implications for precision medicine and personalized treatments. Current approaches frequently consider drug-target or drug-disease interactions individually, ignoring the interdependencies among all three entities. Within human metabolic systems, drugs interact with protein targets in cells, influencing target activities and subsequently impacting biological pathways to promote healthy functions and treat diseases. Moving beyond binary relationships and exploring tighter triple relationships is essential to understanding drugs' mechanism of action (MoAs). Moreover, identifying the heterogeneity of drugs, targets, and diseases, along with their distinct characteristics, is critical to model these complex interactions appropriately. To address these challenges, we effectively model the interconnectedness of all entities in a heterogeneous graph and develop a novel Heterogeneous Graph Triplet Attention Network (\texttt{HeTriNet}). \texttt{HeTriNet} introduces a novel triplet attention mechanism within this heterogeneous graph structure. Beyond pairwise attention as the importance of an entity for the other one, we define triplet attention to model the importance of pairs for entities in the drug-target-disease triplet prediction problem. Experimental results on real-world datasets show that \texttt{HeTriNet} outperforms several baselines, demonstrating its remarkable proficiency in uncovering novel drug-target-disease relationships.

摘要
模拟药物、目标和疾病之间的互动非常重要于药物发现和个性化治疗，它们在精细医学中具有深远的影响。目前的方法通常是对药物-目标或药物-疾病之间的互动进行单独考虑，忽略这三者之间的互相关系。在人体代谢系统中，药物与细胞中的蛋白目标结合，影响目标活性，并在生物路径中促进健康功能以治疗疾病。从 binary 关系向更加紧密的 triple 关系迁移是理解药物机制的关键。此外，识别药物、目标和疾病之间的多样性和特点是模型这些复杂关系的关键。为解决这些挑战，我们有效地将所有实体模型为一个异质图，并开发了一种novel Heterogeneous Graph Triplet Attention Network (\texttt{HeTriNet})。\texttt{HeTriNet} 引入了一种新的 triplet 注意机制，以模型药物-目标-疾病 triplet 预测问题中的综合注意力。与传统的对应关系注意力相比，我们定义 triplet 注意力，以模型对于实体的综合重要性。实验结果表明，\texttt{HeTriNet} 在真实数据上比基eline 高效，具有惊人的探索新药物-目标-疾病关系的能力。

Planning Reliability Assurance Tests for Autonomous Vehicles

paper_url: http://arxiv.org/abs/2312.00186
repo_url: None
paper_authors: Simin Zheng, Lu Lu, Yili Hong, Jian Liu
for: 这篇论文目的是为了开发自动驾驶车（AV）的可靠性抗示测试计划。
methods: 本论文使用统计方法来规划AV可靠性抗示测试，并以多种指标进行衡量。
results: 研究人员通过分析加州司机驾驶部AV测试数据，提出了基于多种指标的可靠性抗示测试计划，并为实践提供了建议。

Abstract
Artificial intelligence (AI) technology has become increasingly prevalent and transforms our everyday life. One important application of AI technology is the development of autonomous vehicles (AV). However, the reliability of an AV needs to be carefully demonstrated via an assurance test so that the product can be used with confidence in the field. To plan for an assurance test, one needs to determine how many AVs need to be tested for how many miles and the standard for passing the test. Existing research has made great efforts in developing reliability demonstration tests in the other fields of applications for product development and assessment. However, statistical methods have not been utilized in AV test planning. This paper aims to fill in this gap by developing statistical methods for planning AV reliability assurance tests based on recurrent events data. We explore the relationship between multiple criteria of interest in the context of planning AV reliability assurance tests. Specifically, we develop two test planning strategies based on homogeneous and non-homogeneous Poisson processes while balancing multiple objectives with the Pareto front approach. We also offer recommendations for practical use. The disengagement events data from the California Department of Motor Vehicles AV testing program is used to illustrate the proposed assurance test planning methods.

摘要
人工智能技术在日常生活中越来越普遍，其中一个重要应用是自动驾驶车（AV）的开发。然而，要使AV在实际应用中使用，需要通过一系列的可靠性测试，以确保产品的可靠性。为了规划测试，需要确定要测试多少辆AV，测试多少公里，以及测试标准。现有研究已经做出了大量的努力，以开发产品开发和评估中的可靠性测试方法。然而，统计方法在AV测试规划中尚未得到广泛应用。本文尝试填补这一空白，通过基于回归事件数据的统计方法，为AV可靠性保证测试的规划提供了新的想法。我们研究了多个 интереスoint的关系，并提出了基于homogeneous和非homogeneous Poisson proces的两种测试规划策略，并通过Pareto前方法来平衡多个目标。我们还提供了实践使用的建议。加利福尼亚州机动车管理局的自动驾驶车测试项目中的离开事件数据被用来图示提议的可靠性测试规划方法。

RNA-KG: An ontology-based knowledge graph for representing interactions involving RNA molecules

paper_url: http://arxiv.org/abs/2312.00183
repo_url: https://github.com/anacletolab/rna-kg
paper_authors: Emanuele Cavalleri, Alberto Cabri, Mauricio Soto-Gomez, Sara Bonfitto, Paolo Perlasca, Jessica Gliozzo, Tiffany J. Callahan, Justin Reese, Peter N Robinson, Elena Casiraghi, Giorgio Valentini, Marco Mesiti
for: 这个论文的目的是构建一个知识 graphs（RNA-KG），用于汇集生物学知识，以便更好地研究基因、蛋白质和化学物质之间的函数关系。
methods: 这个论文使用了多种方法，包括预处理和特征化数据源、构建元граHPPOGRAFS和使用实例基本抽象知识模型来对RNA-KG进行生成。
results: 这个论文通过构建RNA-KG，提供了一个中央化、具有一致性和生成性的 Representation of the “RNA world”，可以用于研究基因、蛋白质和化学物质之间的函数关系，以及找到新的药物。

Abstract
The "RNA world" represents a novel frontier for the study of fundamental biological processes and human diseases and is paving the way for the development of new drugs tailored to the patient's biomolecular characteristics. Although scientific data about coding and non-coding RNA molecules are continuously produced and available from public repositories, they are scattered across different databases and a centralized, uniform, and semantically consistent representation of the "RNA world" is still lacking. We propose RNA-KG, a knowledge graph encompassing biological knowledge about RNAs gathered from more than 50 public databases, integrating functional relationships with genes, proteins, and chemicals and ontologically grounded biomedical concepts. To develop RNA-KG, we first identified, pre-processed, and characterized each data source; next, we built a meta-graph that provides an ontological description of the KG by representing all the bio-molecular entities and medical concepts of interest in this domain, as well as the types of interactions connecting them. Finally, we leveraged an instance-based semantically abstracted knowledge model to specify the ontological alignment according to which RNA-KG was generated. RNA-KG can be downloaded in different formats and also queried by a SPARQL endpoint. A thorough topological analysis of the resulting heterogeneous graph provides further insights into the characteristics of the "RNA world". RNA-KG can be both directly explored and visualized, and/or analyzed by applying computational methods to infer bio-medical knowledge from its heterogeneous nodes and edges. The resource can be easily updated with new experimental data, and specific views of the overall KG can be extracted according to the bio-medical problem to be studied.

摘要
“RNA世界”代表了一个新的前ier，用于研究基础生物过程和人类疾病，以及开发个性化于患者的生物分子特征的药物。 although scientific data about coding and non-coding RNA分子在公共存储系统中可以获得，但这些数据分散在多个数据库中，没有一个中央、统一、semantically consistent的表述“RNA世界”。我们提议RNA-KG，一个涵盖生物知识的知识格式，从更多 than 50个公共数据库中收集了关于RNA的生物知识，并将功能关系与基因、蛋白质和化学物质 Ontologically grounded biomedical concepts integrate。为建立RNA-KG，我们首先标识、预处理和特征化每个数据源；然后，我们构建了一个元граフ提供了这个Domain的生物分子和医学概念的ontological Descriptions，以及这些概念之间的类型交互。最后，我们使用基于实例的抽象知识模型来Specify the ontological alignment according to which RNA-KG was generated。RNA-KG可以在不同格式下下载，并可以通过SPARQL接口进行查询。一个详细的 topological analysis of the resulting heterogeneous graph提供了“RNA世界”的特征。RNA-KG可以直接浏览和可视化，并/或通过计算方法来推理生物医学知识从其多元节点和边的抽象。该资源可以轻松地更新新实验数据，并可以根据生物医学问题提取特定的全局KG视图。

Compression of end-to-end non-autoregressive image-to-speech system for low-resourced devices

paper_url: http://arxiv.org/abs/2312.00174
repo_url: None
paper_authors: Gokul Srinivasagan, Michael Deisher, Munir Georges
for: 帮助视障人群更好地访问touchscreen设备，如手机和笔记型计算机。
methods: 使用图像转语音（ITS）系统，但其模型大小很大，难以在低资源设备上部署。
results: 我们提出了一种高效的端到端神经网络架构，可以在低资源设备上生成显示内容的小片段 Audio。我们使用视transformer图像编码器和知识填充压缩模型，从6100万 Parameters压缩到2460万 Parameters。人工和自动评估结果表明，我们的方法可以减少性能下降，并提高推理时间22%。

Abstract
People with visual impairments have difficulty accessing touchscreen-enabled personal computing devices like mobile phones and laptops. The image-to-speech (ITS) systems can assist them in mitigating this problem, but their huge model size makes it extremely hard to be deployed on low-resourced embedded devices. In this paper, we aim to overcome this challenge by developing an efficient endto-end neural architecture for generating audio from tiny segments of display content on low-resource devices. We introduced a vision transformers-based image encoder and utilized knowledge distillation to compress the model from 6.1 million to 2.46 million parameters. Human and automatic evaluation results show that our approach leads to a very minimal drop in performance and can speed up the inference time by 22%.

摘要
人们 WITH visual impairments 有 difficulty accessing touchscreen-enabled personal computing devices like mobile phones and laptops. The image-to-speech (ITS) systems can assist them in mitigating this problem, but their huge model size makes it extremely hard to be deployed on low-resourced embedded devices. In this paper, we aim to overcome this challenge by developing an efficient end-to-end neural architecture for generating audio from tiny segments of display content on low-resource devices. We introduced a vision transformers-based image encoder and utilized knowledge distillation to compress the model from 6.1 million to 2.46 million parameters. Human and automatic evaluation results show that our approach leads to a very minimal drop in performance and can speed up the inference time by 22%.Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

Towards Accurate Differential Diagnosis with Large Language Models

paper_url: http://arxiv.org/abs/2312.00164
repo_url: None
paper_authors: Daniel McDuff, Mike Schaekermann, Tao Tu, Anil Palepu, Amy Wang, Jake Garrison, Karan Singhal, Yash Sharma, Shekoofeh Azizi, Kavita Kulkarni, Le Hou, Yong Cheng, Yun Liu, S Sara Mahdavi, Sushant Prakash, Anupam Pathak, Christopher Semturs, Shwetak Patel, Dale R Webster, Ewa Dominowska, Juraj Gottweis, Joelle Barral, Katherine Chou, Greg S Corrado, Yossi Matias, Jake Sunshine, Alan Karthikesalingam, Vivek Natarajan
for:这种研究旨在评估一种基于大自然语言模型（LLM）的医学诊断助手，以帮助医生更准确地诊断疾病。methods:这种研究使用了一种优化的LLM，并通过评估302例真实的医学案例来评估其能力。参与研究的20名医生在不同的帮助条件下评估了每个案例，包括使用搜索引擎和标准医学资源，以及使用研究者提供的LLM帮助。results:研究结果表明，使用LLM assistance可以提高医生的诊断精度，特别是在难以诊断的案例中。在比较不同帮助条件下，医生使用LLM assistance时的诊断质量得分高于没有LLM assistance的医生（51.7% vs 36.1%，McNemar 测试：45.7，p < 0.01）。此外，医生使用LLM assistance时还可以生成更全面的诊断列表。这些结果表明，LLM for DDx 有可能在实际医疗中提高医生的诊断能力和准确率。

Abstract
An accurate differential diagnosis (DDx) is a cornerstone of medical care, often reached through an iterative process of interpretation that combines clinical history, physical examination, investigations and procedures. Interactive interfaces powered by Large Language Models (LLMs) present new opportunities to both assist and automate aspects of this process. In this study, we introduce an LLM optimized for diagnostic reasoning, and evaluate its ability to generate a DDx alone or as an aid to clinicians. 20 clinicians evaluated 302 challenging, real-world medical cases sourced from the New England Journal of Medicine (NEJM) case reports. Each case report was read by two clinicians, who were randomized to one of two assistive conditions: either assistance from search engines and standard medical resources, or LLM assistance in addition to these tools. All clinicians provided a baseline, unassisted DDx prior to using the respective assistive tools. Our LLM for DDx exhibited standalone performance that exceeded that of unassisted clinicians (top-10 accuracy 59.1% vs 33.6%, [p = 0.04]). Comparing the two assisted study arms, the DDx quality score was higher for clinicians assisted by our LLM (top-10 accuracy 51.7%) compared to clinicians without its assistance (36.1%) (McNemar's Test: 45.7, p < 0.01) and clinicians with search (44.4%) (4.75, p = 0.03). Further, clinicians assisted by our LLM arrived at more comprehensive differential lists than those without its assistance. Our study suggests that our LLM for DDx has potential to improve clinicians' diagnostic reasoning and accuracy in challenging cases, meriting further real-world evaluation for its ability to empower physicians and widen patients' access to specialist-level expertise.

摘要
准确的 diferencial diagnosis (DDx) 是医疗卫生的基础 stone，通常通过一种迭代的解释过程，结合临床历史、物理检查、调查和手术来达成。大型自然语言模型（LLMs）驱动的交互界面现在提供了新的机会，以帮助和自动化这些过程。在这个研究中，我们引入了一个优化的 LLM для differential reasoning，并评估其能够在具有挑战性的实际医疗案例中提供 differential diagnosis。20名临床医生评估了302个实际医疗案例，来自纽约医学报（NEJM）案例报告。每个案例被两名临床医生读取，这两名医生随机分配到两个帮助条件：使用搜索引擎和标准医疗资源，或者使用我们的 LLM 以及这些工具。所有医生都提供了没有帮助的 differential diagnosis 前置。我们的 LLM для DDx 在 standalone 性能方面超越了没有帮助的医生（top-10 准确率为 59.1%，vs 33.6%，p = 0.04）。 Comparing the two assisted study arms, the DDx quality score was higher for clinicians assisted by our LLM (top-10 accuracy 51.7%) compared to clinicians without its assistance (36.1%) (McNemar's Test: 45.7, p < 0.01) and clinicians with search (44.4%) (4.75, p = 0.03). Further, clinicians assisted by our LLM arrived at more comprehensive differential lists than those without its assistance. 我们的研究表明，我们的 LLM для DDx 在挑战性的案例中具有提高临床医生的诊断逻辑和准确性的潜力，值得进一步的实际评估，以帮助Physicians 和患者 Access 到专家水平的医疗服务。

paper_url: http://arxiv.org/abs/2312.00151
repo_url: None
paper_authors: Meera Hahn, Amit Raj, James M. Rehg
for: 本研究的目的是研究视觉语言导航（VLN）任务中机器人需要根据自然语言指令完成目标位置或物体（例如，“沿梳间行走，左转钢琴”）。
methods: 作者使用了一系列简单的遮盖实验来检查导航模型是否受到不同部分的指令信息影响。
results: 研究发现，一些高性能模型仅仅根据 instrucion 中名词token 进行决策，这是一个担忧的局限性。作者提出了两种培训方法来缓解这种局限性。

Abstract
The challenging task of Vision-and-Language Navigation (VLN) requires embodied agents to follow natural language instructions to reach a goal location or object (e.g. `walk down the hallway and turn left at the piano'). For agents to complete this task successfully, they must be able to ground objects referenced into the instruction (e.g.`piano') into the visual scene as well as ground directional phrases (e.g.`turn left') into actions. In this work we ask the following question -- to what degree are spatial and directional language cues informing the navigation model's decisions? We propose a series of simple masking experiments to inspect the model's reliance on different parts of the instruction. Surprisingly we uncover that certain top performing models rely only on the noun tokens of the instructions. We propose two training methods to alleviate this concerning limitation.

摘要
需要智能感知的视觉语言导航（VLN）任务需要智能体 seguir las instrucciones de lenguaje natural para llegar a un lugar o objeto específico（例如，"沿着楼梯下去，在钢琴左边转弯"). To complete this task successfully, the agents must be able to connect the objects mentioned in the instruction (e.g., "钢琴") to the visual scene and translate directional phrases (e.g., "转弯") into actions.在这项工作中，我们问的问题是：到底 NaviModel 是如何受到空间和方向语言指示的影响呢？我们提出了一系列简单的遮盖试验，以检查模型对不同部分的指示的依赖度。各种惊人的结果表明，某些高性能模型仅仅依赖于 instrucciones 中的名词 tokens。我们提出了两种培训方法来解决这种担忧的局限性。

The Stochastic Dynamic Post-Disaster Inventory Allocation Problem with Trucks and UAVs

paper_url: http://arxiv.org/abs/2312.00140
repo_url: None
paper_authors: Robert van Steenbergen, Wouter van Heeswijk, Martijn Mes
for: 这个论文是研究有限资源的援助物流运作在灾难区域中的问题，尤其是考虑干预援助物流运作的社会影响。
methods: 该论文提出了两种预测解决方法基于准确的动态programming，即分解线性价值函数近似法和神经网络价值函数近似法，以有效地处理灾难区域中的不确定性。
results: 实验表明，考虑 deprivation costs 可以改善有限资源的分配，并且使用无人机可以减少交通和deprivation costs，同时维持类似的需求覆盖率。

Abstract
Humanitarian logistics operations face increasing difficulties due to rising demands for aid in disaster areas. This paper investigates the dynamic allocation of scarce relief supplies across multiple affected districts over time. It introduces a novel stochastic dynamic post-disaster inventory allocation problem with trucks and unmanned aerial vehicles delivering relief goods under uncertain supply and demand. The relevance of this humanitarian logistics problem lies in the importance of considering the inter-temporal social impact of deliveries. We achieve this by incorporating deprivation costs when allocating scarce supplies. Furthermore, we consider the inherent uncertainties of disaster areas and the potential use of cargo UAVs to enhance operational efficiency. This study proposes two anticipatory solution methods based on approximate dynamic programming, specifically decomposed linear value function approximation and neural network value function approximation to effectively manage uncertainties in the dynamic allocation process. We compare DL-VFA and NN-VFA with various state-of-the-art methods (exact re-optimization, PPO) and results show a 6-8% improvement compared to the best benchmarks. NN-VFA provides the best performance and captures nonlinearities in the problem, whereas DL-VFA shows excellent scalability against a minor performance loss. The experiments reveal that consideration of deprivation costs results in improved allocation of scarce supplies both across affected districts and over time. Finally, results show that deploying UAVs can play a crucial role in the allocation of relief goods, especially in the first stages after a disaster. The use of UAVs reduces transportation- and deprivation costs together by 16-20% and reduces maximum deprivation times by 19-40%, while maintaining similar levels of demand coverage, showcasing efficient and effective operations.

摘要
人道主义物流操作面临增加的困难，因为援助灾区的需求不断增加。这篇论文研究了带有不确定因素的救济物流 allocate 问题。它引入了一种新的随机动态减少救济资源分配问题，其中货物运输使用卡车和无人机。我们通过考虑不确定因素和资源不足来解决这个问题。我们还考虑了灾区的内在不确定性和使用无人机来提高操作效率。这种人道主义物流问题的重要性在于考虑不同时间点的社会影响。我们通过加入不足的资源分配时的损失成本来解决这个问题。我们还提出了两种预测解决方案，即分解线性值函数估计和神经网络值函数估计，以有效地管理不确定因素在分配过程中。我们比较了DL-VFA和NN-VFA与现有的方法（精确重新优化、PPO）的结果，结果显示DL-VFA和NN-VFA分别提供6-8%的提高，NN-VFA表现最佳，可以捕捉问题中的非线性关系，而DL-VFA具有优秀的扩展性。实验结果表明，考虑不足资源分配时的损失成本可以改善救济物流的分配效率，同时使用无人机可以减少交通成本和损失成本，同时维持类似的需求覆盖率。最后，实验结果表明，在灾区救济物流中使用无人机可以发挥重要的作用，特别是在灾事发生后的早期。

Dataset Distillation in Large Data Era

paper_url: http://arxiv.org/abs/2311.18838
repo_url: https://github.com/VILA-Lab/SRe2L
paper_authors: Zeyuan Yin, Zhiqiang Shen
for:* 这个论文的目的是如何使用数据简化来快速训练模型，同时保持模型在原始数据分布下的性能。methods:* 这个论文使用了一种简单 yet effective的CURRICULUM DATA AUGMENTATION（CDA）技术来synthesize大规模的ImageNet-1K和21K数据集，并使用这些数据来训练模型。results:* 这个论文的模型在ImageNet-1K和21K上 achieved state-of-the-art的Top-1准确率（63.2% under IPC 50和36.1% under IPC 20），并且将full-data训练模型的准确率与数据简化模型的准确率减少到了 menos than absolute 15%。此外，这个论文还成功地应用了数据简化技术于更大的ImageNet-21K数据集，并在标准224x224像素分辨率下达到了最高的Top-1准确率。

Abstract
Dataset distillation aims to generate a smaller but representative subset from a large dataset, which allows a model to be trained efficiently, meanwhile evaluating on the original testing data distribution to achieve decent performance. Many prior works have aimed to align with diverse aspects of the original datasets, such as matching the training weight trajectories, gradient, feature/BatchNorm distributions, etc. In this work, we show how to distill various large-scale datasets such as full ImageNet-1K/21K under a conventional input resolution of 224$\times$224 to achieve the best accuracy over all previous approaches, including SRe$^2$L, TESLA and MTT. To achieve this, we introduce a simple yet effective ${\bf C}$urriculum ${\bf D}$ata ${\bf A}$ugmentation ($\texttt{CDA}$) during data synthesis that obtains the accuracy on large-scale ImageNet-1K and 21K with 63.2% under IPC (Images Per Class) 50 and 36.1% under IPC 20, respectively. Finally, we show that, by integrating all our enhancements together, the proposed model beats the current state-of-the-art by more than 4% Top-1 accuracy on ImageNet-1K/21K and for the first time, reduces the gap to its full-data training counterpart to less than absolute 15%. Moreover, this work represents the inaugural success in dataset distillation on larger-scale ImageNet-21K under the standard 224$\times$224 resolution. Our code and distilled ImageNet-21K dataset of 20 IPC, 2K recovery budget are available at https://github.com/VILA-Lab/SRe2L/tree/main/CDA.

摘要
数据简化目标是生成一个较小但代表性的子集，以便模型快速训练，同时在原始测试数据分布上评估，以实现良好的性能。许多先前的工作都是通过对多个方面的原始数据进行对齐，如匹配训练权重曲线、梯度、特征/批量 нор 分布等，来实现这一目标。在这种工作中，我们展示了如何使用常规输入分辨率224x224抽象出大规模 ImageNet-1K/21K 数据集，以实现最高精度，超过所有先前的方法，包括 SRe$^2$L、TESLA 和 MTT。为达到这一目标，我们引入了简单 yet effective的 $\bf C}$urriculum $\bf D}$ata $\bf A}$ugmentation（$\texttt{CDA}$）方法，在数据生成过程中实现了 ImageNet-1K 和 21K 的63.2% 精度，IPC 50 和 36.1% 精度，分别。最后，我们表明，通过结合我们所有的改进，提出的模型超越了当前状态的杰出性，在 ImageNet-1K/21K 上提高了4.2% 顶部一级精度，并将全数据训练对应的差异降低到绝对15% 以下。此外，这一研究还表现了 dataset distillation 在更大规模的 ImageNet-21K 下的首次成功，以及在标准224x224分辨率下的最高精度。我们的代码和抽象 ImageNet-21K 数据集可以在 GitHub 上找到：https://github.com/VILA-Lab/SRe2L/tree/main/CDA。

paper_url: http://arxiv.org/abs/2311.18837
repo_url: None
paper_authors: Zhen Xing, Qi Dai, Zihao Zhang, Hui Zhang, Han Hu, Zuxuan Wu, Yu-Gang Jiang
for: 这个研究是为了提出一个广泛适用于多种视频任务的基础模型，包括理解任务（如语言引导的视频物体分割）和创造任务（视频编辑和改善）。
methods: 我们提出了一个名为Video Instruction Diffusion（VIDiff）的基础模型，这个模型可以根据用户的指令进行快速的视频编辑和改善，并且可以保证长视频的一致性。我们还提出了一个迭代的自回传方法来确保编辑和改善的一致性。
results: 我们的模型可以对多种输入视频和写好的指令提供吸引人的创造结果， both qualitatively and quantitatively。我们还提供了一些视频示范，请参考我们的网站https://ChenHsing.github.io/VIDiff。

Abstract
Diffusion models have achieved significant success in image and video generation. This motivates a growing interest in video editing tasks, where videos are edited according to provided text descriptions. However, most existing approaches only focus on video editing for short clips and rely on time-consuming tuning or inference. We are the first to propose Video Instruction Diffusion (VIDiff), a unified foundation model designed for a wide range of video tasks. These tasks encompass both understanding tasks (such as language-guided video object segmentation) and generative tasks (video editing and enhancement). Our model can edit and translate the desired results within seconds based on user instructions. Moreover, we design an iterative auto-regressive method to ensure consistency in editing and enhancing long videos. We provide convincing generative results for diverse input videos and written instructions, both qualitatively and quantitatively. More examples can be found at our website https://ChenHsing.github.io/VIDiff.

摘要
Diffusion models 已经取得了图像和视频生成领域的显著成功，这引发了对视频编辑任务的增加兴趣，其中视频被编辑根据提供的文本描述。然而，大多数现有方法只专注于短片视频编辑，并且需要耗时调整或推理。我们是第一个提出了视频指令扩散（VIDiff）模型，这是一个通用的基础模型，适用于广泛的视频任务。这些任务包括语言指导视频对象分割、视频编辑和优化等。我们的模型可以根据用户指令进行编辑和翻译，并在秒钟级别完成这些任务。此外，我们还设计了一种迭代自适应方法，以确保长视频的编辑和优化保持一致。我们提供了多种多样的输入视频和文本描述，并提供了详细的实验结果，以证明我们的模型的生成能力。更多示例可以在我们的网站https://ChenHsing.github.io/VIDiff中找到。

Motion-Conditioned Image Animation for Video Editing

paper_url: http://arxiv.org/abs/2311.18827
repo_url: None
paper_authors: Wilson Yan, Andrew Brown, Pieter Abbeel, Rohit Girdhar, Samaneh Azadi
for: 这个论文是为了解决视频编辑问题而编写的。
methods: 这篇论文使用了一种简单的分解方法，将视频编辑问题分解为图像编辑和动态图像动画两个步骤。
results: 该论文引入了一个新的视频编辑数据集，并进行了对latest video editing方法和MoCA方法的全面人工评估。MoCA方法在这些任务中显示出了更高的人类偏好胜率，并在特别是动态修改任务中表现出了显著的改进，比如Dreamix (63%), MasaCtrl (75%)和Tune-A-Video (72%)等。

Abstract
We introduce MoCA, a Motion-Conditioned Image Animation approach for video editing. It leverages a simple decomposition of the video editing problem into image editing followed by motion-conditioned image animation. Furthermore, given the lack of robust evaluation datasets for video editing, we introduce a new benchmark that measures edit capability across a wide variety of tasks, such as object replacement, background changes, style changes, and motion edits. We present a comprehensive human evaluation of the latest video editing methods along with MoCA, on our proposed benchmark. MoCA establishes a new state-of-the-art, demonstrating greater human preference win-rate, and outperforming notable recent approaches including Dreamix (63%), MasaCtrl (75%), and Tune-A-Video (72%), with especially significant improvements for motion edits.

摘要
我们介绍MoCA，一种基于运动条件的图像动画方法，用于视频编辑。它利用了视频编辑问题的简单分解，先进行图像编辑，然后使用运动条件进行图像动画。此外，由于视频编辑的评估数据不够可靠，我们创建了一个新的评估标准，用于衡量不同类型的编辑能力，包括物品替换、背景更改、风格更改和运动更改。我们进行了全面的人类评估，包括MoCA，以及最新的视频编辑方法，在我们的提案的评估标准上。MoCA成功地建立了一个新的 estado-of-the-art，在人类评估中获得了更高的胜出率，并在视频编辑方法中具有特别明显的进步，尤其是运动更改方面。

Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking

paper_url: http://arxiv.org/abs/2311.18817
repo_url: https://github.com/vfleaking/grokking-dichotomy
paper_authors: Kaifeng Lyu, Jikai Jin, Zhiyuan Li, Simon S. Du, Jason D. Lee, Wei Hu
for: 研究 arithmetic tasks 上的“grokking”现象，即神经网络在训练集时首先“记忆”整个训练集，导致完美的训练准确率，但在测试集时 exhibit near-random 准确率，并在训练过程够长时间后 suddenly transition to perfect test accuracy.
methods: 使用 homogeneous neural nets WITH large initialization AND small weight decay 在 both classification AND regression tasks 上进行训练，并通过 teoretic 分析和实验研究 grokking 现象的induction.
results: 研究结果显示，在训练过程中，神经网络会 initially 困在一个解释器解决方案上，然后在训练时间够长时间后， suddenly transition to min-norm/max-margin 解决方案，从而导致测试准确率的很大改善。

Abstract
Recent work by Power et al. (2022) highlighted a surprising "grokking" phenomenon in learning arithmetic tasks: a neural net first "memorizes" the training set, resulting in perfect training accuracy but near-random test accuracy, and after training for sufficiently longer, it suddenly transitions to perfect test accuracy. This paper studies the grokking phenomenon in theoretical setups and shows that it can be induced by a dichotomy of early and late phase implicit biases. Specifically, when training homogeneous neural nets with large initialization and small weight decay on both classification and regression tasks, we prove that the training process gets trapped at a solution corresponding to a kernel predictor for a long time, and then a very sharp transition to min-norm/max-margin predictors occurs, leading to a dramatic change in test accuracy.

摘要
近期Power等人（2022）的研究发现在学习代数任务中有一种意外的“感知”现象：神经网络在训练集时首先“记忆”训练集，导致完美的训练精度，但在测试集时几乎随机的精度，并在训练足够长时间后，突然转移到完美的测试精度。这篇论文研究了这种感知现象在理论上的设置，并证明了它可以由早期和晚期阶段的偏见引起。 Specifically, 在训练同质神经网络的大初始值和小权重减少下，我们证明了训练过程会在一个核函数预测器的解对应的解方式上困顿一段时间，然后是一个非常锐化的转换到最小 нор 最大 margin 预测器，导致测试精度的剧烈变化。

Unnatural Error Correction: GPT-4 Can Almost Perfectly Handle Unnatural Scrambled Text

paper_url: http://arxiv.org/abs/2311.18805
repo_url: None
paper_authors: Qi Cao, Takeshi Kojima, Yutaka Matsuo, Yusuke Iwasawa
for: 这个研究旨在探讨大语言模型（LLMs）对字符级排序的抗难度，尤其是GPT-4在面对排序输入时的表现。
methods: 研究人员提出了“杂音板”（Scrambled Bench），用于测试LLMs对杂音输入的处理能力，包括重建杂音句子和在杂音上下文中回答问题。
results: 研究结果表明，最强的LLMs在面对杂音输入时能够展现 typoglycemia 现象，即可以理解杂音句子中的意思，只要第一个和最后一个字母保留不变。GPT-4更加出佩了这一点，可以nearly perfectly重建原始句子从杂音句子中，降低编辑距离达95%。

Abstract
While Large Language Models (LLMs) have achieved remarkable performance in many tasks, much about their inner workings remains unclear. In this study, we present novel experimental insights into the resilience of LLMs, particularly GPT-4, when subjected to extensive character-level permutations. To investigate this, we first propose the Scrambled Bench, a suite designed to measure the capacity of LLMs to handle scrambled input, in terms of both recovering scrambled sentences and answering questions given scrambled context. The experimental results indicate that most powerful LLMs demonstrate the capability akin to typoglycemia, a phenomenon where humans can understand the meaning of words even when the letters within those words are scrambled, as long as the first and last letters remain in place. More surprisingly, we found that only GPT-4 nearly flawlessly processes inputs with unnatural errors, even under the extreme condition, a task that poses significant challenges for other LLMs and often even for humans. Specifically, GPT-4 can almost perfectly reconstruct the original sentences from scrambled ones, decreasing the edit distance by 95%, even when all letters within each word are entirely scrambled. It is counter-intuitive that LLMs can exhibit such resilience despite severe disruption to input tokenization caused by scrambled text.

摘要
大型自然语言模型（LLM）在许多任务中表现出色，但内部工作方式仍然不清楚。在这项研究中，我们提出了一个新的实验方法，用于测试 LLM 对字符级排序的抗衰假设。我们称之为“杂字工具”（Scrambled Bench），它可以衡量 LLM 对杂字输入的处理能力，包括恢复杂字句子和在杂字上下文中回答问题。实验结果表明，大多数最强 LLM 在处理杂字输入时展现了人类 typoglycemia 现象，即可以理解杂乱的字符串中的意义，只要第一个和最后一个字母保持不变。更 surprisingly，我们发现了 GPT-4 可以在极端条件下，几乎完美地处理含有不自然的错误的输入，而其他 LLM 和人类 Frequently 难以完成这个任务。具体来说，GPT-4 可以减少杂字输入的编辑距离达 95%，即使所有字符串中的字母都完全杂乱。这是对 LLM 处理杂字输入的抗衰假设的证明，而这种抗衰性与输入tokenization的严重扰乱有关。

Distributed Global Structure-from-Motion with a Deep Front-End

paper_url: http://arxiv.org/abs/2311.18801
repo_url: https://github.com/borglab/gtsfm
paper_authors: Ayush Baid, John Lambert, Travis Driver, Akshay Krishnan, Hayk Stepanyan, Frank Dellaert
for: 这paper的目的是为了检验全球Structure-from-Motion（SfM）是否可以与最新的增量SfM方法相当，以及是否可以通过提高不同阶段的SfM管道中的发展来提高全球SfM的性能。
methods: 该paper使用了一种可分解的SfM框架，以便可以轻松地结合不同阶段的SfM管道中的发展。具体来说，他们使用了深度学习模型来提取和匹配特征，以及SIFT特征，以实现全球SfM和增量SfM的比较。
results: 实验结果表明，虽然深度学习基于两视匹配估计的发展可以提高全球SfM中点密度的性能，但是 none of them outperform SIFT when comparing with incremental SfM results on a range of datasets。这表明，SIFT仍然是一个非常有效的特征提取和匹配方法，尤其是在全球SfM中。

Abstract
While initial approaches to Structure-from-Motion (SfM) revolved around both global and incremental methods, most recent applications rely on incremental systems to estimate camera poses due to their superior robustness. Though there has been tremendous progress in SfM `front-ends' powered by deep models learned from data, the state-of-the-art (incremental) SfM pipelines still rely on classical SIFT features, developed in 2004. In this work, we investigate whether leveraging the developments in feature extraction and matching helps global SfM perform on par with the SOTA incremental SfM approach (COLMAP). To do so, we design a modular SfM framework that allows us to easily combine developments in different stages of the SfM pipeline. Our experiments show that while developments in deep-learning based two-view correspondence estimation do translate to improvements in point density for scenes reconstructed with global SfM, none of them outperform SIFT when comparing with incremental SfM results on a range of datasets. Our SfM system is designed from the ground up to leverage distributed computation, enabling us to parallelize computation on multiple machines and scale to large scenes.

摘要
当初的Structure-from-Motion（SfM）方法主要集中在全球和增量方法之间，但现在大多数应用都是使用增量系统来估计相机位置，因为它们的稳定性更高。虽然在SfM前端上使用深度学习模型从数据中学习得到了很大的进步，但现在的状态对某些增量SfM管道（COLMAP）仍然依赖于古老的SIFT特征，这些特征在2004年被开发出来。在这个工作中，我们研究了是否可以通过特征提取和匹配的发展来使全球SfM与增量SfM相当。为此，我们设计了一个可以轻松地组合不同阶段的SfM管道的模块化SfM框架。我们的实验表明，虽然深度学习基于两视匹配估计的发展可以提高全球SfM中点云密度的性能，但是这些发展都不能超越SIFT，相比于增量SfM结果。我们的SfM系统是从头来设计来利用分布式计算，以便在多台机器上并行计算并扩展到大型场景。

Automated interpretation of congenital heart disease from multi-view echocardiograms

paper_url: http://arxiv.org/abs/2311.18788
repo_url: None
paper_authors: Jing Wang, Xiaofeng Liu, Fangyun Wang, Lin Zheng, Fengqiao Gao, Hanwen Zhang, Xin Zhang, Wanqing Xie, Binbin Wang
for: 这项研究旨在自动分析多视图电子心征图，以帮助诊断婴儿心脏病。
methods: 该研究使用了深度分割 convolution 的多通道网络，以大幅减少网络参数。同时，通过增加正例训练样本，解决了样本偏置问题。
results: 该模型可以在2D电子心征图上达到95.4%的准确率，并在3类分类任务中达到92.1%的准确率。此外，该模型还可以在没有键帧选择和视图记录的情况下进行诊断。

Abstract
Congenital heart disease (CHD) is the most common birth defect and the leading cause of neonate death in China. Clinical diagnosis can be based on the selected 2D key-frames from five views. Limited by the availability of multi-view data, most methods have to rely on the insufficient single view analysis. This study proposes to automatically analyze the multi-view echocardiograms with a practical end-to-end framework. We collect the five-view echocardiograms video records of 1308 subjects (including normal controls, ventricular septal defect (VSD) patients and atrial septal defect (ASD) patients) with both disease labels and standard-view key-frame labels. Depthwise separable convolution-based multi-channel networks are adopted to largely reduce the network parameters. We also approach the imbalanced class problem by augmenting the positive training samples. Our 2D key-frame model can diagnose CHD or negative samples with an accuracy of 95.4\%, and in negative, VSD or ASD classification with an accuracy of 92.3\%. To further alleviate the work of key-frame selection in real-world implementation, we propose an adaptive soft attention scheme to directly explore the raw video data. Four kinds of neural aggregation methods are systematically investigated to fuse the information of an arbitrary number of frames in a video. Moreover, with a view detection module, the system can work without the view records. Our video-based model can diagnose with an accuracy of 93.9\% (binary classification), and 92.1\% (3-class classification) in a collected 2D video testing set, which does not need key-frame selection and view annotation in testing. The detailed ablation study and the interpretability analysis are provided.

摘要
《生口� Heart disease (CHD) 是中国新生儿最常见的出生缺陷和新生儿死亡的主要原因。临床诊断可以基于选择的2D关键帧从五个视角进行。由于多视角数据的有限性，大多数方法需要依靠单视角分析。这项研究提议使用实用的终端框架自动分析多视角电声心动图。我们收集了1308名主要（包括正常控制组、ventricular septal defect（VSD）患者和atrial septal defect（ASD）患者）的五视角电声心动图视频记录，其中包括疾病标签和标准视图关键帧标签。采用深度分割 convolution-based 多通道网络大幅减少网络参数。我们还通过增加正确的训练样本来解决类别偏斜问题。我们的2D关键帧模型可以诊断CHD或负样本的准确率达95.4%，并在负样本VSD或ASD分类中达92.3%。为了更好地解决实际应用中的关键帧选择问题，我们提出了一种自适应软注意机制，直接探索原始视频数据。我们系统atically Investigated four kinds of neural aggregation methods to fuse the information of an arbitrary number of frames in a video. In addition, with a view detection module, the system can work without the view records. Our video-based model can diagnose with an accuracy of 93.9% (binary classification) and 92.1% (3-class classification) in a collected 2D video testing set, which does not require key-frame selection and view annotation in testing. The detailed ablation study and the interpretability analysis are provided.

paper_url: http://arxiv.org/abs/2312.03747
repo_url: None
paper_authors: Giorgos Lysandrou, Roma English Owen, Vanja Popovic, Grant Le Brun, Beatrice Alex, Elizabeth A. L. Fairley
for: This paper aims to improve the collection and understanding of patient experiences in the real world to improve care standards and personalize drug treatment.
methods: The paper uses linguistic analysis to identify similarities between patient experience datasets across different therapeutic domains and data sources, and trains classifiers (CNN and transformer) to accurately identify patient experience posts from social media.
results: The paper finds that the transformer classifier performs the best in classifying patient experience posts, achieving F1-scores ranging between 0.863 and 0.995 across all therapeutic domains and data sources.

Abstract
It is essential that healthcare professionals and members of the healthcare community can access and easily understand patient experiences in the real world, so that care standards can be improved and driven towards personalised drug treatment. Social media platforms and message boards are deemed suitable sources of patient experience information, as patients have been observed to discuss and exchange knowledge, look for and provide support online. This paper tests the hypothesis that not all online patient experience information can be treated and collected in the same way, as a result of the inherent differences in the way individuals talk about their journeys, in different therapeutic domains and or data sources. We used linguistic analysis to understand and identify similarities between datasets, across patient language, between data sources (Reddit, SocialGist) and therapeutic domains (cardiovascular, oncology, immunology, neurology). We detected common vocabulary used by patients in the same therapeutic domain across data sources, except for immunology patients, who use unique vocabulary between the two data sources, and compared to all other datasets. We combined linguistically similar datasets to train classifiers (CNN, transformer) to accurately identify patient experience posts from social media, a task we refer to as patient voice classification. The cardiovascular and neurology transformer classifiers perform the best in their respective comparisons for the Reddit data source, achieving F1-scores of 0.865 and 1.0 respectively. The overall best performing classifier is the transformer classifier trained on all data collected for this experiment, achieving F1-scores ranging between 0.863 and 0.995 across all therapeutic domain and data source specific test datasets.

摘要
“医疗专业人员和健康照顾社区成员需要能够访问和轻松理解病人的real world经验，以改善和导向个性化药物治疗。社交媒体平台和讨论区被视为适合病人经验信息的来源，因为病人在线上讨论和互助。本研究测试了假设：不同的线上病人经验信息不能 uniformly treated and collected，因为病人在不同的医疗领域和数据来源中使用不同的语言和数据集。我们使用语言分析来理解和识别不同数据集之间的相似性，并发现了不同医疗领域和数据来源中病人使用的共同词汇。我们组合语言相似的数据集来训练分类器（CNN和transformer），以精确地识别社交媒体上的病人经验问题。cardiovascular和neurology transformer分类器在Reddit数据源上表现最佳，其F1分别为0.865和1.0。总体最佳表现的分类器是在所有数据集上训练的transformer分类器，其F1分别在所有医疗领域和数据来源特定的测试数据集上具有0.863-0.995的表现。”

paper_url: http://arxiv.org/abs/2312.00825
repo_url: None
paper_authors: Phillip Howard, Avinash Madasu, Tiep Le, Gustavo Lujan Moreno, Anahita Bhiwandiwalla, Vasudev Lal
for: 这篇研究旨在探讨现有的视觉语言模型（VLM）中存在的社会偏见，以及这些偏见如何影响视觉语言模型的性能。
methods: 本研究使用文本到图像扩散模型来生成大量的 counterfactual 图像文本 pairs，以探讨社会偏见的复杂关系。我们的方法使用 Stable Diffusion 和跨注意控制，以生成高度相似的图像文本 pairs，但仅在社会偏见的部分发生不同。
results: 我们的实验结果显示，使用我们生成的 SocialCounterfactuals Dataset 可以帮助探讨和改善现有的 VLM 中的社会偏见。我们的结果显示，这些 counterfactual 图像文本 pairs 可以帮助探讨不同社会偏见的复杂关系，并且可以帮助改善 VLM 的性能。

Abstract
While vision-language models (VLMs) have achieved remarkable performance improvements recently, there is growing evidence that these models also posses harmful biases with respect to social attributes such as gender and race. Prior studies have primarily focused on probing such bias attributes individually while ignoring biases associated with intersections between social attributes. This could be due to the difficulty of collecting an exhaustive set of image-text pairs for various combinations of social attributes. To address this challenge, we employ text-to-image diffusion models to produce counterfactual examples for probing intserctional social biases at scale. Our approach utilizes Stable Diffusion with cross attention control to produce sets of counterfactual image-text pairs that are highly similar in their depiction of a subject (e.g., a given occupation) while differing only in their depiction of intersectional social attributes (e.g., race & gender). Through our over-generate-then-filter methodology, we produce SocialCounterfactuals, a high-quality dataset containing over 171k image-text pairs for probing intersectional biases related to gender, race, and physical characteristics. We conduct extensive experiments to demonstrate the usefulness of our generated dataset for probing and mitigating intersectional social biases in state-of-the-art VLMs.

摘要
“Recently, vision-language models（VLMs）have achieved significant performance improvements, but there is also growing evidence that these models possess harmful biases with respect to social attributes such as gender and race. Prior studies have primarily focused on probing such bias attributes individually, ignoring biases associated with intersections between social attributes. This may be due to the difficulty of collecting an exhaustive set of image-text pairs for various combinations of social attributes. To address this challenge, we employ text-to-image diffusion models to produce counterfactual examples for probing intersectional social biases at scale. Our approach utilizes Stable Diffusion with cross attention control to produce sets of counterfactual image-text pairs that are highly similar in their depiction of a subject (e.g., a given occupation) while differing only in their depiction of intersectional social attributes (e.g., race & gender). Through our over-generate-then-filter methodology, we produce SocialCounterfactuals, a high-quality dataset containing over 171k image-text pairs for probing intersectional biases related to gender, race, and physical characteristics. We conduct extensive experiments to demonstrate the usefulness of our generated dataset for probing and mitigating intersectional social biases in state-of-the-art VLMs.”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation

paper_url: http://arxiv.org/abs/2311.18775
repo_url: None
paper_authors: Zineng Tang, Ziyi Yang, Mahmoud Khademi, Yang Liu, Chenguang Zhu, Mohit Bansal
for: 这篇论文是为了开发一种可靠的多模态语言模型（CoDi-2），该模型可以 suivant complex的语言-视觉-听力交互指令，并在受过例示的情况下进行学习和生成多模态输出。
methods: 这篇论文使用了一种新的Alignment模型，将语言和视觉特征Alignment在编码和生成过程中，以便让语言模型理解复杂的多模态交互指令和受过例示的语言输入。
results: CoDi-2在多种零基础任务上达到了优秀的表现，如主题驱动图生成、视觉转换和音频编辑等。CoDi-2还能够通过多轮交互对话来执行复杂的多模态任务。

Abstract
We present CoDi-2, a versatile and interactive Multimodal Large Language Model (MLLM) that can follow complex multimodal interleaved instructions, conduct in-context learning (ICL), reason, chat, edit, etc., in an any-to-any input-output modality paradigm. By aligning modalities with language for both encoding and generation, CoDi-2 empowers Large Language Models (LLMs) to not only understand complex modality-interleaved instructions and in-context examples, but also autoregressively generate grounded and coherent multimodal outputs in the continuous feature space. To train CoDi-2, we build a large-scale generation dataset encompassing in-context multimodal instructions across text, vision, and audio. CoDi-2 demonstrates a wide range of zero-shot capabilities for multimodal generation, such as in-context learning, reasoning, and compositionality of any-to-any modality generation through multi-round interactive conversation. CoDi-2 surpasses previous domain-specific models on tasks such as subject-driven image generation, vision transformation, and audio editing. CoDi-2 signifies a substantial breakthrough in developing a comprehensive multimodal foundation model adept at interpreting in-context language-vision-audio interleaved instructions and producing multimodal outputs.

摘要
我们提出了CoDi-2，一种多样化和交互式的多Modal大语言模型（MLLM），可以跟踪复杂的多Modal交错指令，进行场景学习（ICL），进行对话，编辑等等，在任意输入输出模式下。通过对语言和模式的对应，CoDi-2使得大语言模型（LLM）不仅能够理解复杂的多Modal交错指令和场景示例，而且还可以自动生成基于语言的协调和准确的多Modal输出在连续特征空间中。为了训练CoDi-2，我们建立了大规模生成数据集，涵盖了文本、视觉和声音之间的多Modal示例。CoDi-2在多Modal生成任务中展示了零基础的能力，如场景学习、理解和多Modal生成的复杂性。CoDi-2超越了过去的域специфи价值模型在任务如主体驱动图像生成、视觉变换和音频编辑等等。CoDi-2表明了在多Modal语言-视觉-声音交错指令下 interpret语言和多Modal输出的完整基础模型。

Evaluating the Impact of Flaky Simulators on Testing Autonomous Driving Systems

paper_url: http://arxiv.org/abs/2311.18768
repo_url: https://github.com/anonoymous9423013/anonymous_paper
paper_authors: Mohammad Hossein Amini, Shervin Naseri, Shiva Nejati
for: 这个论文主要研究了模拟驾驶系统（ADS）测试中的测试不稳定性问题，以及如何使用机器学习（ML）技术来识别不稳定测试。
methods: 这篇论文使用了两个常用的开源ADS模拟器和五种不同的ADS测试setup进行实验研究，以确定测试不稳定性对自动化测试的影响和ML技术是否能够有效地识别不稳定测试。
results: 研究结果表明，ADS测试中的测试不稳定性是一个非常常见的现象，可能导致测试结果偏差。此外，使用ML技术可以有效地识别不稳定测试，但需要至少执行一次测试。相比之下，非ML基准测试需要至少执行两次测试，而ML方法可以提高F1分数$31$%、$21$%和$13$%。

Abstract
Simulators are widely used to test Autonomous Driving Systems (ADS), but their potential flakiness can lead to inconsistent test results. We investigate test flakiness in simulation-based testing of ADS by addressing two key questions: (1) How do flaky ADS simulations impact automated testing that relies on randomized algorithms? and (2) Can machine learning (ML) effectively identify flaky ADS tests while decreasing the required number of test reruns? Our empirical results, obtained from two widely-used open-source ADS simulators and five diverse ADS test setups, show that test flakiness in ADS is a common occurrence and can significantly impact the test results obtained by randomized algorithms. Further, our ML classifiers effectively identify flaky ADS tests using only a single test run, achieving F1-scores of $85$%, $82$% and $96$% for three different ADS test setups. Our classifiers significantly outperform our non-ML baseline, which requires executing tests at least twice, by $31$%, $21$%, and $13$% in F1-score performance, respectively. We conclude with a discussion on the scope, implications and limitations of our study. We provide our complete replication package in a Github repository.

摘要
模拟器广泛用于自动驾驶系统（ADS）的测试，但它们的可靠性问题可能导致测试结果不一致。我们研究在基于模拟器的测试中发生的测试不稳定性问题，并解决了两个关键问题：（1）如何影响自动化测试，它们使用随机算法进行测试？和（2）可以使用机器学习（ML）技术来识别不稳定的ADS测试，同时降低测试重复次数？我们的实验结果来自两种广泛使用的开源ADS模拟器和五种多样化ADS测试配置，表明ADS测试中的不稳定性是常见的问题，可以很大地影响测试结果。此外，我们的ML分类器可以效果地识别不稳定的ADS测试，只需要一次测试运行，实现了F1分数为85%、82%和96%的三个不同的ADS测试配置。我们的分类器与非ML基线相比，提高了测试重复次数的31%、21%和13%的F1分数表现。我们结束 WITH 一个关于这些研究的讨论，以及其限制和影响。我们提供了完整的复制包在 GitHub 存储库中。

MLLMs-Augmented Visual-Language Representation Learning

paper_url: http://arxiv.org/abs/2311.18765
repo_url: https://github.com/lyq312318224/mllms-augmented
paper_authors: Yanqing Liu, Kai Wang, Wenqi Shao, Ping Luo, Yu Qiao, Mike Zheng Shou, Kaipeng Zhang, Yang You
for: 提高视觉语言表示学习的多Modal大语言模型（MLLM）
methods: 使用MLLM扩展每个图像的多个标签，并使用“文本扭曲”维护标签的同长性
results: 在图像文本检索 task 中，OUR方法在 zero-shot 和 fine-tuning 设置下分别获得了5.6% ~ 35.0%和16.8% ~ 46.1%的提升，其中 zero-shot 结果与target dataset fine-tuning结果相当，鼓励更多的多Modal大语言模型的应用。

Abstract
Visual-language pre-training (VLP) has achieved remarkable success in multi-modal tasks, largely attributed to the availability of large-scale image-text datasets. In this work, we demonstrate that multi-modal large language models (MLLMs) can enhance visual-language representation learning by improving data quality. Our approach is simple, utilizing MLLMs to extend multiple captions for each image. To prevent the bias introduced by MLLMs' hallucinations and intrinsic caption styles, we propose "text shearing" to maintain the same length for extended captions as that of the original captions. In image-text retrieval, our method consistently obtains 5.6 ~ 35.0% and 16.8 ~ 46.1% improvement on R@1 under the fine-tuning and zero-shot settings, respectively. Notably, we obtain zero-shot results that are comparable to fine-tuning on target datasets, which encourages more exploration of the versatile use of MLLMs.

摘要
“视觉语言预训练（VLP）已经取得了很大成功在多Modal任务上，主要归功于大规模的图像文本数据集。在这项工作中，我们展示了多Modal大型语言模型（MLLM）可以提高视觉语言表示学习。我们的方法简单，通过将MLLM扩展图像多个caption。为了消除MLLM的偏见和内生caption风格，我们提议“文本扭曲”来保持扩展caption的长度与原始caption的长度相同。在图像文本检索任务中，我们的方法一直 obtient 5.6 ~ 35.0%和16.8 ~ 46.1%的提升在R@1下，分别在精度和零扩展设置下。尤其是在零扩展设置下，我们的方法可以获得与Target数据集的零扩展结果相同或更好的结果，这对探索MLLM的多样使用提供了鼓舞。”

Continual Diffusion with STAMINA: STack-And-Mask INcremental Adapters

paper_url: http://arxiv.org/abs/2311.18763
repo_url: None
paper_authors: James Seale Smith, Yen-Chang Hsu, Zsolt Kira, Yilin Shen, Hongxia Jin
for: 文章旨在探讨 continual text-to-image diffusion 模型是否可以扩展到更长的概念序列上，而不会忘记之前学习的概念。
methods: 作者提出了一种新的方法 named STack-And-Mask INcremental Adapters (STAMINA)，它使用低级别的注意力掩码和定制化 MLP tokens，以提高 LoRA 模型在sequential concept learning中的稳定性和扩展性。
results: 作者的实验结果表明，STAMINA 方法可以在50个概念 benchmark 上实现最佳性能，而不需要存储回放数据。此外，作者还将方法扩展到 continual learning for image classification 领域，并证明了这些提高也可以在标准 benchmark 上实现最佳性能。

Abstract
Recent work has demonstrated a remarkable ability to customize text-to-image diffusion models to multiple, fine-grained concepts in a sequential (i.e., continual) manner while only providing a few example images for each concept. This setting is known as continual diffusion. Here, we ask the question: Can we scale these methods to longer concept sequences without forgetting? Although prior work mitigates the forgetting of previously learned concepts, we show that its capacity to learn new tasks reaches saturation over longer sequences. We address this challenge by introducing a novel method, STack-And-Mask INcremental Adapters (STAMINA), which is composed of low-ranked attention-masked adapters and customized MLP tokens. STAMINA is designed to enhance the robust fine-tuning properties of LoRA for sequential concept learning via learnable hard-attention masks parameterized with low rank MLPs, enabling precise, scalable learning via sparse adaptation. Notably, all introduced trainable parameters can be folded back into the model after training, inducing no additional inference parameter costs. We show that STAMINA outperforms the prior SOTA for the setting of text-to-image continual customization on a 50-concept benchmark composed of landmarks and human faces, with no stored replay data. Additionally, we extended our method to the setting of continual learning for image classification, demonstrating that our gains also translate to state-of-the-art performance in this standard benchmark.

摘要
Recent research has shown the ability to customize text-to-image diffusion models to multiple, fine-grained concepts in a sequential manner, while only providing a few example images for each concept. This setting is called continual diffusion. However, the capacity to learn new tasks reaches saturation over longer sequences. To address this challenge, we propose a novel method called STack-And-Mask INcremental Adapters (STAMINA), which includes low-ranked attention-masked adapters and customized MLP tokens. STAMINA enhances the robust fine-tuning properties of LoRA for sequential concept learning via learnable hard-attention masks parameterized with low rank MLPs, allowing for precise and scalable learning via sparse adaptation. All introduced trainable parameters can be folded back into the model after training, resulting in no additional inference parameter costs. Our method outperforms the prior state-of-the-art for the setting of text-to-image continual customization on a 50-concept benchmark composed of landmarks and human faces, with no stored replay data. Additionally, we extended our method to the setting of continual learning for image classification, demonstrating that our gains also translate to state-of-the-art performance in this standard benchmark.

TaskBench: Benchmarking Large Language Models for Task Automation

paper_url: http://arxiv.org/abs/2311.18760
repo_url: https://github.com/microsoft/JARVIS
paper_authors: Yongliang Shen, Kaitao Song, Xu Tan, Wenqi Zhang, Kan Ren, Siyu Yuan, Weiming Lu, Dongsheng Li, Yueting Zhuang
for: 评估大语言模型（LLM）在任务自动化方面的能力
methods: 引入了TaskBench来评估LLM在任务自动化中的能力，包括三个关键阶段：任务分解、工具邀请和参数预测，以满足用户意图。通过Tool Graph和回归指令方法，生成高质量的评估数据集。
results: 实验结果表明，TaskBench可以有效地反映LLM在任务自动化方面的能力。由于自动化数据建构和人工验证的混合使用，TaskBench在人工评估中具有高一致性，可以用作LLM基于自动化代理人的全面和忠诚的标准评估 benchmark。

Abstract
Recently, the incredible progress of large language models (LLMs) has ignited the spark of task automation, which decomposes the complex tasks described by user instructions into sub-tasks, and invokes external tools to execute them, and plays a central role in autonomous agents. However, there lacks a systematic and standardized benchmark to foster the development of LLMs in task automation. To this end, we introduce TaskBench to evaluate the capability of LLMs in task automation. Specifically, task automation can be formulated into three critical stages: task decomposition, tool invocation, and parameter prediction to fulfill user intent. This complexity makes data collection and evaluation more challenging compared to common NLP tasks. To generate high-quality evaluation datasets, we introduce the concept of Tool Graph to represent the decomposed tasks in user intent, and adopt a back-instruct method to simulate user instruction and annotations. Furthermore, we propose TaskEval to evaluate the capability of LLMs from different aspects, including task decomposition, tool invocation, and parameter prediction. Experimental results demonstrate that TaskBench can effectively reflects the capability of LLMs in task automation. Benefiting from the mixture of automated data construction and human verification, TaskBench achieves a high consistency compared to the human evaluation, which can be utilized as a comprehensive and faithful benchmark for LLM-based autonomous agents.

摘要
近些时间，大语言模型（LLM）的异常进步已经点燃了任务自动化的火焰，将复杂的任务描述由用户指令 decomposes 到子任务，并通过外部工具执行。这种中心作用在自主代理中扮演着重要的角色。然而，在LLM的发展中缺乏系统化和标准化的benchmark，以便评估LLM在任务自动化方面的能力。为此，我们提出了TaskBench，用于评估LLM在任务自动化方面的能力。特别是，任务自动化可以分解为三个关键阶段：任务分解、工具邀请和参数预测，以满足用户意图。这种复杂性使得数据收集和评估变得更加困难，相比于常见的NLP任务。为生成高质量的评估数据，我们引入了工具图来表示用户意图中的分解任务，并采用回归方法来模拟用户指令和注释。此外，我们提出了TaskEval，用于评估LLM的多方面能力，包括任务分解、工具邀请和参数预测。实验结果表明，TaskBench可以准确反映LLM在任务自动化方面的能力。由于自动化数据建构和人工验证的混合，TaskBench在人工评估中实现了高一致性，可以作为LLM基于自主代理的全面和忠实的benchmark。

Language Model Agents Suffer from Compositional Generalization in Web Automation

paper_url: http://arxiv.org/abs/2311.18751
repo_url: https://github.com/google-research/google-research
paper_authors: Hiroki Furuta, Yutaka Matsuo, Aleksandra Faust, Izzeddin Gur
for: 这篇论文旨在探讨语言模型代理（LMA）在多步决策任务上的表现，以及它们在实际应用中的扩展性。
methods: 该论文使用了新的 benchmark，即 CompWoB，来评测 LMA 的表现。此外，论文还使用了已经 prompted 和 transferred LMA 来研究它们在不同任务组合下的表现。
results: 论文显示，基于 base tasks 的 prompted LMA 在 compositional tasks 上的表现很差（24.9% 成功率），而基于 base tasks 的 transferred LMA 则表现较好，但仍有一定的总体化差（54.8% 成功率）。在 balancing 数据分布上，论文提出了一种新模型 HTML-T5++，可以超越人类水平（95.2%）在 MiniWoB 上表现，并在 CompWoB 上达到最佳 zero-shot 性能（61.5%）。

Abstract
Language model agents (LMA) recently emerged as a promising paradigm on muti-step decision making tasks, often outperforming humans and other reinforcement learning agents. Despite the promise, their performance on real-world applications that often involve combinations of tasks is still underexplored. In this work, we introduce a new benchmark, called CompWoB -- 50 new compositional web automation tasks reflecting more realistic assumptions. We show that while existing prompted LMAs (gpt-3.5-turbo or gpt-4) achieve 94.0% average success rate on base tasks, their performance degrades to 24.9% success rate on compositional tasks. On the other hand, transferred LMAs (finetuned only on base tasks) show less generalization gap, dropping from 85.4% to 54.8%. By balancing data distribution across tasks, we train a new model, HTML-T5++, that surpasses human-level performance (95.2%) on MiniWoB, and achieves the best zero-shot performance on CompWoB (61.5%). While these highlight the promise of small-scale finetuned and transferred models for compositional generalization, their performance further degrades under different instruction compositions changing combinational order. In contrast to the recent remarkable success of LMA, our benchmark and detailed analysis emphasize the necessity of building LMAs that are robust and generalizable to task compositionality for real-world deployment.

摘要
现代语言模型代理（LMA）已经出现为多步决策任务的有力的新趋势，常常超越人类和其他循环学习代理。 despite its promise， its performance on实际应用中的组合任务仍然未经explored。在这项工作中，我们介绍了一个新的基准，called CompWoB，包含50个新的组合网络自动化任务，这些任务更加真实地假设。我们发现，现有的推荐LMA（gpt-3.5-turbo或gpt-4）在基础任务上 achiev 94.0%的成功率，但在组合任务上其表现下降到24.9%。相反，传输LMA（只在基础任务上精度）表现更好，从85.4%下降到54.8%。通过让数据分布在任务上均匀，我们训练了一个新模型，HTML-T5++，在MiniWoB上超过人类水平（95.2%），并在CompWoB上实现了零基础性性能（61.5%）。这些结果 highlights 小规模的 transferred和finetuned模型在组合普适性方面的搭配性，但其表现在不同的指令组合下降的情况下。与此同时，我们的基准和详细分析强调了在实际应用中建立LMAs，这些LMAs必须具有对任务组合性的稳定和普适性。

TransCORALNet: A Two-Stream Transformer CORAL Networks for Supply Chain Credit Assessment Cold Start

paper_url: http://arxiv.org/abs/2311.18749
repo_url: https://github.com/jiejieniu/transcoralnet
paper_authors: Jie Shi, Arno P. J. M. Siebes, Siamak Mehrkanoon
for: 这个论文是为了提出一种可解释性Two-Stream transformer网络（TransCORALNet），用于解决供应链信用评估中的领域隔阂和冷启动问题。
methods: 这个模型使用了Two-Stream领域适应架构，并使用了Correlation Alignment（CORAL）损失函数，以提供高精度的信用评估预测。
results: 实验结果表明，TransCORALNet在一个真实世界数据集上表现出色，较多的州ppointofview Baselines的精度。 codes are available on GitHub.

Abstract
This paper proposes an interpretable two-stream transformer CORAL networks (TransCORALNet) for supply chain credit assessment under the segment industry and cold start problem. The model aims to provide accurate credit assessment prediction for new supply chain borrowers with limited historical data. Here, the two-stream domain adaptation architecture with correlation alignment (CORAL) loss is used as a core model and is equipped with transformer, which provides insights about the learned features and allow efficient parallelization during training. Thanks to the domain adaptation capability of the proposed model, the domain shift between the source and target domain is minimized. Therefore, the model exhibits good generalization where the source and target do not follow the same distribution, and a limited amount of target labeled instances exist. Furthermore, we employ Local Interpretable Model-agnostic Explanations (LIME) to provide more insight into the model prediction and identify the key features contributing to supply chain credit assessment decisions. The proposed model addresses four significant supply chain credit assessment challenges: domain shift, cold start, imbalanced-class and interpretability. Experimental results on a real-world data set demonstrate the superiority of TransCORALNet over a number of state-of-the-art baselines in terms of accuracy. The code is available on GitHub https://github.com/JieJieNiu/TransCORALN .

摘要
Simplified Chinese translation:这篇论文提出了一种可解释的两渠道 transformer 网络（TransCORALNet），用于解决供应链信用评估中的领域隔离和冷启问题。该模型采用了两渠道领域适应架构，并使用了 correlate alignment（CORAL）损失函数，以提供更好的预测性能。此外，模型还使用了 Local Interpretable Model-agnostic Explanations（LIME）来提供更多的预测解释和识别供应链信用评估决策中的关键特征。实验结果表明，TransCORALNet 在一个真实世界数据集上表现出色，与多种州对比模型相比，具有更高的准确率。代码可以在 GitHub 上找到（https://github.com/JieJieNiu/TransCORALN）。

AlignBench: Benchmarking Chinese Alignment of Large Language Models

paper_url: http://arxiv.org/abs/2311.18743
repo_url: https://github.com/thudm/alignbench
paper_authors: Xiao Liu, Xuanyu Lei, Shengyuan Wang, Yue Huang, Zhuoer Feng, Bosi Wen, Jiale Cheng, Pei Ke, Yifan Xu, Weng Lam Tam, Xiaohan Zhang, Lichao Sun, Hongning Wang, Jing Zhang, Minlie Huang, Yuxiao Dong, Jie Tang
for: 这个论文的目的是提供一个全面的多维度测试 benchmark，用于评估中文 Large Language Models（LLMs）的对齐性。
methods: 这个论文使用了一个人工循环数据筛选管道，并采用了一种规则调整的多维度 LLM-as-Judge 生成解释和最终评分作为评估方式，以确保高度可靠和可解释性。
results: 这个论文的实验表明，使用 CritiqueLLM 评估 LLMs 的对齐性可以恢复 GPT-4 的评估能力的95%。此外，提供了一个公共 API，以便通过 CritiqueLLM 评估 AlignBench。

Abstract
Alignment has become a critical step for instruction-tuned Large Language Models (LLMs) to become helpful assistants. However, effective evaluation of alignment for emerging Chinese LLMs is still significantly lacking, calling for real-scenario grounded, open-ended, challenging and automatic evaluations tailored for alignment. To fill in this gap, we introduce AlignBench, a comprehensive multi-dimensional benchmark for evaluating LLMs' alignment in Chinese. Equipped with a human-in-the-loop data curation pipeline, our benchmark employs a rule-calibrated multi-dimensional LLM-as-Judge with Chain-of-Thought to generate explanations and final ratings as evaluations, ensuring high reliability and interpretability. Furthermore, we report AlignBench evaluated by CritiqueLLM, a dedicated Chinese evaluator LLM that recovers 95% of GPT-4's evaluation ability. We will provide public APIs for evaluating AlignBench with CritiqueLLM to facilitate the evaluation of LLMs' Chinese alignment. All evaluation codes, data, and LLM generations are available at \url{https://github.com/THUDM/AlignBench}.

摘要
alignment 已成为大型自然语言模型（LLM）的关键步骤，以便使其成为有用的助手。然而，emerging Chinese LLMs的Alignment的评估仍然存在显著的 void，需要基于实际场景、开放、挑战性和自动评估的评估方法。为填补这个空白，我们介绍AlignBench，一个全面的多维度 benchmark，用于评估中文 LLMs的Alignment。我们的 benchmark 使用人类在循环数据筛选管道中提供的数据，并采用规则调整的多维度 LLM-as-Judge 来生成解释和最终评分，以确保高可靠性和可解释性。此外，我们报告了 AlignBench 被 CritiqueLLM，一个专门的中文评估 LLM，Recover 95% of GPT-4 的评估能力。我们将提供公共 API，以便通过 CritiqueLLM 评估 AlignBench。所有评估代码、数据和 LLM 生成都可以在上获取。Note: "LLM" stands for "Large Language Model", and "Chain-of-Thought" is a specific evaluation method used in this benchmark.

VREM-FL: Mobility-Aware Computation-Scheduling Co-Design for Vehicular Federated Learning

paper_url: http://arxiv.org/abs/2311.18741
repo_url: None
paper_authors: Luca Ballotta, Nicolò Dal Fabbro, Giovanni Perin, Luca Schenato, Michele Rossi, Giuseppe Piro
for: 这个论文主要是为了研究和提出了一种基于联合学习的汽车智能驾驶系统，以提高汽车智能驾驶系统的性能和安全性。
methods: 这个论文使用了联合学习技术，并将汽车的移动性和估计5G无线电环境地图相结合，以优化全球机器学习模型的训练。同时，它也采用了计算调度和通信资源的有效分配，以优化汽车智能驾驶系统的性能。
results: 实验结果表明，使用无线电环境地图可以提高联合学习模型的性能，并且可以降低模型训练时间。相比文献 benchmark，VREM-FL可以将学习时间减少28%，并将同样的时间窗口内的模型更新数量提高至原来的两倍。

Abstract
Assisted and autonomous driving are rapidly gaining momentum, and will soon become a reality. Among their key enablers, artificial intelligence and machine learning are expected to play a prominent role, also thanks to the massive amount of data that smart vehicles will collect from their onboard sensors. In this domain, federated learning is one of the most effective and promising techniques for training global machine learning models, while preserving data privacy at the vehicles and optimizing communications resource usage. In this work, we propose VREM-FL, a computation-scheduling co-design for vehicular federated learning that leverages mobility of vehicles in conjunction with estimated 5G radio environment maps. VREM-FL jointly optimizes the global model learned at the server while wisely allocating communication resources. This is achieved by orchestrating local computations at the vehicles in conjunction with the transmission of their local model updates in an adaptive and predictive fashion, by exploiting radio channel maps. The proposed algorithm can be tuned to trade model training time for radio resource usage. Experimental results demonstrate the efficacy of utilizing radio maps. VREM-FL outperforms literature benchmarks for both a linear regression model (learning time reduced by 28%) and a deep neural network for a semantic image segmentation task (doubling the number of model updates within the same time window).

摘要
自助和自动驾驶技术在积极推广和应用化阶段，快速升级成为现实。其关键驱动因素之一是人工智能和机器学习，它们在智能汽车上的各种感知器收集大量数据的情况下发挥着关键作用。在这个领域，联邦学习是训练全球机器学习模型的最有效和最有希望的技术之一，同时保持汽车数据隐私和优化通信资源的使用。在这项工作中，我们提出了VREM-FL，一种基于汽车移动和估计5G无线通信环境图的计算时间分配和通信资源调度算法。VREM-FL同时优化全局模型在服务器上的学习和通信资源的分配，通过在汽车上进行本地计算和发送本地模型更新的适应和预测方式，利用无线通信频率图。该算法可以根据模型训练时间和无线资源的交易来调整。实验结果表明，利用无线图可以提高模型训练效率。VREM-FL在一个线性回归模型和一个深度神经网络 semantic image segmentation任务中，分别比文献标准做得更好，减少了28%的学习时间，并 doubling the number of model updates within the same time window。

Controlgym: Large-Scale Safety-Critical Control Environments for Benchmarking Reinforcement Learning Algorithms

paper_url: http://arxiv.org/abs/2311.18736
repo_url: https://github.com/xiangyuan-zhang/controlgym
paper_authors: Xiangyuan Zhang, Weichao Mao, Saviz Mowlavi, Mouhacine Benosman, Tamer Başar
for: 这个论文的目的是提供一个库，名为controlgym，其中包含了36种安全关键的工业控制场景，以及10个基于偏微分方程（PDE）的控制问题。
methods: 这个库使用了OpenAI Gym/Gymnasium（Gym）框架，可以直接应用标准的学习控制算法，如稳定基elines3。
results: 这个库的控制环境包括连续、无限大的动作和观察空间，这些环境与实际控制应用相符，并且PDE控制环境允许用户扩展系统的状态维度到无穷大，保持系统的内在动力学性。这些特点使得RL算法的扩展性和稳定性成为可考虑的问题。

Abstract
We introduce controlgym, a library of thirty-six safety-critical industrial control settings, and ten infinite-dimensional partial differential equation (PDE)-based control problems. Integrated within the OpenAI Gym/Gymnasium (Gym) framework, controlgym allows direct applications of standard reinforcement learning (RL) algorithms like stable-baselines3. Our control environments complement those in Gym with continuous, unbounded action and observation spaces, motivated by real-world control applications. Moreover, the PDE control environments uniquely allow the users to extend the state dimensionality of the system to infinity while preserving the intrinsic dynamics. This feature is crucial for evaluating the scalability of RL algorithms for control. This project serves the learning for dynamics & control (L4DC) community, aiming to explore key questions: the convergence of RL algorithms in learning control policies; the stability and robustness issues of learning-based controllers; and the scalability of RL algorithms to high- and potentially infinite-dimensional systems. We open-source the controlgym project at https://github.com/xiangyuan-zhang/controlgym.

摘要
我们介绍控制启美（Controlgym），一个包含 thirty-six 安全关键的工业控制设置和十个无穷维度部分 diferencial equations（PDE）控制问题的库。该库在 OpenAI Gym/Gymnasium（Gym）框架中集成，因此可以直接使用标准的回归学习（RL）算法如稳定基础（stable-baselines3）。我们的控制环境与 Gym 中的环境不同，它们具有连续、无限大的动作和观测空间，这些空间是基于实际控制应用而来。此外，PDE控制环境允许用户将系统的状态维度延伸到无穷大，保持系统的内在动力学不变。这一特点非常重要，用于评估回归学习算法的扩展性。这个项目旨在为 dynamics & control（L4DC）社区服务，探索关键问题：回归学习算法在学习控制策略时的整合性; 学习基于控制器的稳定性和可靠性问题; 以及回归学习算法对高维和可能无穷维度系统的扩展性。我们将控制启美项目开源于 GitHub 上，请参考。

Predictable Reinforcement Learning Dynamics through Entropy Rate Minimization

paper_url: http://arxiv.org/abs/2311.18703
repo_url: None
paper_authors: Daniel Jarne Ornia, Giannis Delimpaltadakis, Jens Kober, Javier Alonso-Mora
for: 提高RLAgent的预测性能
methods: 使用状态序列熵率作为预测性度量，并通过改进的PG方法来适应策略依赖性 entropy 问题
results: 在人机用例 inspirited RL任务中，提出了一种可预测的RL策略，并实现了 Near-optimal 奖励的同时提高了预测性能。

Abstract
In Reinforcement Learning (RL), agents have no incentive to exhibit predictable behaviors, and are often pushed (through e.g. policy entropy regularization) to randomize their actions in favor of exploration. From a human perspective, this makes RL agents hard to interpret and predict, and from a safety perspective, even harder to formally verify. We propose a novel method to induce predictable behavior in RL agents, referred to as Predictability-Aware RL (PA-RL), which employs the state sequence entropy rate as a predictability measure. We show how the entropy rate can be formulated as an average reward objective, and since its entropy reward function is policy-dependent, we introduce an action-dependent surrogate entropy enabling the use of PG methods. We prove that deterministic policies minimizing the average surrogate reward exist and also minimize the actual entropy rate, and show how, given a learned dynamical model, we are able to approximate the value function associated to the true entropy rate. Finally, we demonstrate the effectiveness of the approach in RL tasks inspired by human-robot use-cases, and show how it produces agents with more predictable behavior while achieving near-optimal rewards.

摘要
在强化学习（RL）中，代理人没有奖励 exhibit 预测性行为，而是通过例如政策 entropy 规范来随机化其行动，以便探索。从人类视角来看，这使得 RL 代理人难以预测和解释，从安全角度来看，更难以正式验证。我们提出了一种 noval 方法，称为 Predictability-Aware RL（PA-RL），它使用状态序列 entropy 率作为预测性度量。我们示了 entropy 率可以表示为平均奖励目标函数，而其 entropy 奖励函数是政策相依的，因此我们引入了动作相依的 entropy 代理。我们证明了 deterministic 策略可以最小化平均奖励，并且可以最小化实际 entropy 率。 finally，我们在人类机器用例中进行了RL任务，并证明了该方法可以生成预测性行为的代理人，同时实现近乎最佳奖励。Note: The translation is in Simplified Chinese, which is the standard written form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know.

CritiqueLLM: Scaling LLM-as-Critic for Effective and Explainable Evaluation of Large Language Model Generation

paper_url: http://arxiv.org/abs/2311.18702
repo_url: https://github.com/thu-coai/critiquellm
paper_authors: Pei Ke, Bosi Wen, Zhuoer Feng, Xiao Liu, Xuanyu Lei, Jiale Cheng, Shengyuan Wang, Aohan Zeng, Yuxiao Dong, Hongning Wang, Jie Tang, Minlie Huang
for: 本研究旨在investigate the key factor of LLM-based evaluation models, such as scaling properties, and evaluate the potential of replacing GPT-4’s evaluation in practical scenarios.
methods: 本研究提出了一种新的 critique generation model called CritiqueLLM, which includes a dialogue-based prompting method for high-quality referenced / reference-free evaluation data.
results: 实验结果表明，我们的模型可以与GPT-4匹配或超越其在8个任务中的3个任务中，特别是在系统级别的相关性方面表现出色。我们还进行了详细的分析，表明我们的模型在质量评价方面具有良好的扩展性。此外，我们的生成的评价还可以作为直接改进LLMs的生成质量的反馈。

Abstract
Since the natural language processing (NLP) community started to make large language models (LLMs), such as GPT-4, act as a critic to evaluate the quality of generated texts, most of them only train a critique generation model of a specific scale on specific datasets. We argue that a comprehensive investigation on the key factor of LLM-based evaluation models, such as scaling properties, is lacking, so that it is still inconclusive whether these models have potential to replace GPT-4's evaluation in practical scenarios. In this paper, we propose a new critique generation model called CritiqueLLM, which includes a dialogue-based prompting method for high-quality referenced / reference-free evaluation data. Experimental results show that our model can achieve comparable evaluation performance to GPT-4 especially in system-level correlations, and even outperform GPT-4 in 3 out of 8 tasks in a challenging reference-free setting. We conduct detailed analysis to show promising scaling properties of our model in the quality of generated critiques. We also demonstrate that our generated critiques can act as scalable feedback to directly improve the generation quality of LLMs.

摘要
自然语言处理（NLP）社区开始使用大型语言模型（LLM）作为评价文本质量的批评者，大多数人只是在特定的scale和特定的数据集上训练一个特定的评价生成模型。我们认为，关键因素的LLM-based评价模型的全面调查缺失，因此还没有得出结论，这些模型是否有能力取代GPT-4的评价在实际场景中。在这篇论文中，我们提出了一个新的评价生成模型 called CritiqueLLM，它包括对话基于的提示方法，以获得高质量的参考/无参考评价数据。实验结果表明，我们的模型可以与GPT-4相当的评价性能，特别是在系统级别的相关性方面，甚至在一个复杂的无参考设定下超过GPT-4的3个任务。我们进行了详细的分析，以显示我们模型的评价生成质量的扩展性。我们还示示了我们生成的评价可以直接改进LLMs的生成质量。

Evaluating Large Language Model Creativity from a Literary Perspective

paper_url: http://arxiv.org/abs/2312.03746
repo_url: None
paper_authors: Murray Shanahan, Catherine Clarke
for: 这个研究检查了大型自然语言模型（LLM）是否可以作为创作写作的助手，通过一个深入的案例研究来评估其潜力。
methods: 研究人员采用了交互式多声音提示策略，杂合背景描述（场景设定、剧本元素）、写作指导、样本文本和文本批判，以评估LLM的创作能力。
results: 研究结果表明，LLM的成果与提示的复杂程度相关。

Abstract
This paper assesses the potential for large language models (LLMs) to serve as assistive tools in the creative writing process, by means of a single, in-depth case study. In the course of the study, we develop interactive and multi-voice prompting strategies that interleave background descriptions (scene setting, plot elements), instructions that guide composition, samples of text in the target style, and critical discussion of the given samples. We qualitatively evaluate the results from a literary critical perspective, as well as from the standpoint of computational creativity (a sub-field of artificial intelligence). Our findings lend support to the view that the sophistication of the results that can be achieved with an LLM mirrors the sophistication of the prompting.

摘要

paper_url: http://arxiv.org/abs/2311.18676
repo_url: None
paper_authors: Aryaman Rao, Parth Singh, Dinesh Kumar Vishwakarma, Mukesh Prasad
for: 本研究提出了一种基于量子概念的Salp Swarm算法（DQSSA），用于优化社交网络中的影响散布。
methods: 该算法通过精炼meta-规则算法并借鉴量子原理，解决了 premature convergence和低效率的问题。
results: 对四个实际 dataset进行了实验，显示DQSSA的性能比较出色，超过了一些当前最佳算法的表现。

Abstract
Influence Maximization is the task of selecting optimal nodes maximising the influence spread in social networks. This study proposes a Discretized Quantum-based Salp Swarm Algorithm (DQSSA) for optimizing influence diffusion in social networks. By discretizing meta-heuristic algorithms and infusing them with quantum-inspired enhancements, we address issues like premature convergence and low efficacy. The proposed method, guided by quantum principles, offers a promising solution for Influence Maximisation. Experiments on four real-world datasets reveal DQSSA's superior performance as compared to established cutting-edge algorithms.

摘要
“Influence Maximization”是指选择最佳节点以 maximize 社交网络中的影响扩散。本研究提出了一种基于量子逻辑的粒子群算法（DQSSA），用于优化社交网络中的影响扩散。通过粒子群算法的离散化和量子逻辑的激发，我们解决了迅速 converges 和低效率的问题。提出的方法，受量子原理指导，对Influence Maximization 提供了一个有 promise 的解决方案。在四个实际 datasets 上进行了实验，DQSSA 的性能显著 exceeds 已有的先进算法。

Multi-task learning with cross-task consistency for improved depth estimation in colonoscopy

paper_url: http://arxiv.org/abs/2311.18664
repo_url: None
paper_authors: Pedro Esteban Chavarrias Solano, Andrew Bulpitt, Venkataraman Subramanian, Sharib Ali
for: colonoscopy screening, 评估肠Rectum内部疾病，如癌变和溃疡
methods: 多任务学习（MTL）方法，包括一个共享编码器和两个解码器：表面几何解码器和深度估计解码器，以及attend Mechanism优化全局上下文认知
results: 相比baseline方法BTS，我们提出的方法在相对误差上提高14.17%，在 $\delta_{1}$ 精度上提高10.4%。所有实验都在最新发布的C3VD数据集上进行。

Abstract
Colonoscopy screening is the gold standard procedure for assessing abnormalities in the colon and rectum, such as ulcers and cancerous polyps. Measuring the abnormal mucosal area and its 3D reconstruction can help quantify the surveyed area and objectively evaluate disease burden. However, due to the complex topology of these organs and variable physical conditions, for example, lighting, large homogeneous texture, and image modality estimating distance from the camera aka depth) is highly challenging. Moreover, most colonoscopic video acquisition is monocular, making the depth estimation a non-trivial problem. While methods in computer vision for depth estimation have been proposed and advanced on natural scene datasets, the efficacy of these techniques has not been widely quantified on colonoscopy datasets. As the colonic mucosa has several low-texture regions that are not well pronounced, learning representations from an auxiliary task can improve salient feature extraction, allowing estimation of accurate camera depths. In this work, we propose to develop a novel multi-task learning (MTL) approach with a shared encoder and two decoders, namely a surface normal decoder and a depth estimator decoder. Our depth estimator incorporates attention mechanisms to enhance global context awareness. We leverage the surface normal prediction to improve geometric feature extraction. Also, we apply a cross-task consistency loss among the two geometrically related tasks, surface normal and camera depth. We demonstrate an improvement of 14.17% on relative error and 10.4% improvement on $\delta_{1}$ accuracy over the most accurate baseline state-of-the-art BTS approach. All experiments are conducted on a recently released C3VD dataset; thus, we provide a first benchmark of state-of-the-art methods.

摘要
干扰检测是评估肠子和肛肠异常的标准手段，如肿瘤和癌变质。测量异常的肠膜面积和其3D重建可以帮助评估疾病负担。然而，由于肠子和肛肠的复杂 topology和变化的物理条件（如照明、大规模纹理、摄像头距离），准确地估计depth很困难。尤其是大多数干扰检测视频获取是单目的，从而使得深度估计成为一个非常困难的问题。然而，计算机视觉中的深度估计方法已经提出和进步于自然场景数据上，但这些方法在干扰检测数据上的效果尚未得到广泛评估。因为肠膜有许多低纹理区域，学习表示法从 auxiliary task 中提高精细特征提取，以便准确地估计摄像头深度。在这种工作中，我们提议一种基于多任务学习（MTL）的方法，其中包括一个共享Encoder和两个解码器，即表面正常解码器和深度估计解码器。我们的深度估计包括注意力机制，以提高全局上下文意识。我们利用表面正常预测来提高几何特征提取。此外，我们应用了两个几何相关的任务之间的交叉任务一致性损失，以提高几何特征的一致性。我们在C3VD数据集上进行了所有实验，因此我们提供了首个 benchmark 的状态对照。Results:* Relative error improvement: 14.17%* $\delta_{1}$ accuracy improvement: 10.4%Note: $\delta_{1}$ is a measure of the accuracy of depth estimation, and a lower value indicates better accuracy.

Choosing the parameter of the Fermat distance: navigating geometry and noise

paper_url: http://arxiv.org/abs/2311.18663
repo_url: None
paper_authors: Frédéric Chazal, Laure Ferraris, Pablo Groisman, Matthieu Jonckheere, Frédéric Pascal, Facundo Sapienza
for: 本研究使用法拉第distance来解决机器学习任务，当有自然的距离不可用时或者提高euclidian距离的结果。
methods: 本研究使用法拉第distance，其中Parameterα影响了后续任务的性能。
results: 研究表明，选择合适的α值可以 navigation数据集的几何和统计性质，同时避免噪音的影响。

Abstract
The Fermat distance has been recently established as a useful tool for machine learning tasks when a natural distance is not directly available to the practitioner or to improve the results given by Euclidean distances by exploding the geometrical and statistical properties of the dataset. This distance depends on a parameter $\alpha$ that greatly impacts the performance of subsequent tasks. Ideally, the value of $\alpha$ should be large enough to navigate the geometric intricacies inherent to the problem. At the same, it should remain restrained enough to sidestep any deleterious ramifications stemming from noise during the process of distance estimation. We study both theoretically and through simulations how to select this parameter.

摘要
“法米图距离”最近已被视为机器学习任务中的有用工具，当 naturaldistance 不直接available to practitioner 或优化由欧几何距离提供的结果。这个距离取决于参数 $\alpha$，这个参数对后续任务的表现有很大的影响。理想情况下，$\alpha$ 应该够大以探索数据集中的几何特性，同时也应该够小以避免噪音干扰距离估算过程中的副作用。我们通过理论和实验研究如何选择这个参数。

Solving the Team Orienteering Problem with Transformers

paper_url: http://arxiv.org/abs/2311.18662
repo_url: https://github.com/danifuertes/top_transformer
paper_authors: Daniel Fuertes, Carlos R. del-Blanco, Fernando Jaureguizar, Narciso García
for: 本研究是为了提出一种能够快速和准确地解决团队导航问题的多代理路径规划系统。
methods: 本研究使用了一种中央式Transformer神经网络，可以快速学习对象图和代理Context进行编码，以提供快速和准确的解决方案。
results: 经过多个实验表明，提出的系统可以在计算速度方面超越大多数现有的工作，并且可以在准确性和计算速度之间取得平衡。代码可以在http://gti.ssr.upm.es/data中获取。

Abstract
Route planning for a fleet of vehicles is an important task in applications such as package delivery, surveillance, or transportation. This problem is usually modeled as a Combinatorial Optimization problem named as Team Orienteering Problem. The most popular Team Orienteering Problem solvers are mainly based on either linear programming, which provides accurate solutions by employing a large computation time that grows with the size of the problem, or heuristic methods, which usually find suboptimal solutions in a shorter amount of time. In this paper, a multi-agent route planning system capable of solving the Team Orienteering Problem in a very fast and accurate manner is presented. The proposed system is based on a centralized Transformer neural network that can learn to encode the scenario (modeled as a graph) and the context of the agents to provide fast and accurate solutions. Several experiments have been performed to demonstrate that the presented system can outperform most of the state-of-the-art works in terms of computation speed. In addition, the code is publicly available at http://gti.ssr.upm.es/data.

摘要
路径规划 для辆 vehicle fleet 是应用程序 such as package delivery, surveillance, or transportation 中的一个重要任务。这个问题通常被称为 Team Orienteering Problem，并且通常通过 линей编程或 heuristic 方法来解决。在这篇论文中，一种可以快速和准确地解决 Team Orienteering Problem 的多体 Route Planning 系统被提出。该系统基于中央 Transformer 神经网络，可以通过学习enario (模型为图) 和代理的 контекст来提供快速和准确的解决方案。数据表明，提出的系统可以在 computation speed 方面超越大多数现有的工作。此外，代码也publicly available at http://gti.ssr.upm.es/data。

Detailed Human-Centric Text Description-Driven Large Scene Synthesis

paper_url: http://arxiv.org/abs/2311.18654
repo_url: None
paper_authors: Gwanghyun Kim, Dong Un Kang, Hoigi Seo, Hayeon Kim, Se Young Chun
for: 本文提出了一种基于文本的大型场景图生成方法，以提高场景图生成的可控性和自然性。
methods: 本方法包括三个主要组成部分：1) 利用大语言模型（LLM）生成层次结构的关键点盒子布局，2) 根据文本描述进行视点 conditioned 共聚扩散过程，3) 基于像素偏移的 пирамиidal 插值来进行逐渐修正大场景图。
results: 对比于先前的文本到大场景图生成方法，本方法在 faithfulness、可控性和全局自然性等方面表现出色，显示了强大的 faithfulness 性和高度可控性。

Abstract
Text-driven large scene image synthesis has made significant progress with diffusion models, but controlling it is challenging. While using additional spatial controls with corresponding texts has improved the controllability of large scene synthesis, it is still challenging to faithfully reflect detailed text descriptions without user-provided controls. Here, we propose DetText2Scene, a novel text-driven large-scale image synthesis with high faithfulness, controllability, and naturalness in a global context for the detailed human-centric text description. Our DetText2Scene consists of 1) hierarchical keypoint-box layout generation from the detailed description by leveraging large language model (LLM), 2) view-wise conditioned joint diffusion process to synthesize a large scene from the given detailed text with LLM-generated grounded keypoint-box layout and 3) pixel perturbation-based pyramidal interpolation to progressively refine the large scene for global coherence. Our DetText2Scene significantly outperforms prior arts in text-to-large scene synthesis qualitatively and quantitatively, demonstrating strong faithfulness with detailed descriptions, superior controllability, and excellent naturalness in a global context.

摘要
文本驱动大景图生成技术已经取得了 significiant progress，但控制它仍然是挑战。使用额外的空间控制和相应的文本可以提高大景图生成的可控性，但是还是困难以准确反映详细的文本描述。在这里，我们提出了DetText2Scene，一种新的文本驱动大规模图生成技术，具有高准确性、可控性和自然性。DetText2Scene包括以下三个部分：1. 利用大语言模型（LLM）生成层次结构的关键点盒子布局，根据详细的文本描述。2. 基于文本的视点条件联合扩散过程，将给定的详细文本与LLM生成的关键点盒子布局进行合并。3. 像素偏移基数 pyramidal interpolate 进行渐进的精度提高，以保证全局的一致性。与先前的文本到大景图生成技术相比，DetText2Scene 在质量和量上均有显著的提高，表现出了强大的准确性、高度可控性和自然性。

FedEmb: A Vertical and Hybrid Federated Learning Algorithm using Network And Feature Embedding Aggregation

paper_url: http://arxiv.org/abs/2312.00102
repo_url: None
paper_authors: Fanfei Meng, Lele Zhang, Yu Chen, Yuxin Wang
for: 本研究 propose a generalized algorithm FedEmb for vertical and hybrid DNN-based learning, aiming to improve inference accuracy, privacy-preserving properties, and communication efficiency.
methods: FedEmb 使用了一种新的扩展点云模型，使得模型可以在分布式环境中进行更加精准的学习，同时保持数据隐私。
results: 实验结果表明，FedEmb 能够有效地解决分布式特征空间和主题空间的学习问题，与已有方法相比，提高了0.3%-4.2%的推理准确率，同时减少了88.9%的时间复杂度。

Abstract
Federated learning (FL) is an emerging paradigm for decentralized training of machine learning models on distributed clients, without revealing the data to the central server. The learning scheme may be horizontal, vertical or hybrid (both vertical and horizontal). Most existing research work with deep neural network (DNN) modelling is focused on horizontal data distributions, while vertical and hybrid schemes are much less studied. In this paper, we propose a generalized algorithm FedEmb, for modelling vertical and hybrid DNN-based learning. The idea of our algorithm is characterised by higher inference accuracy, stronger privacy-preserving properties, and lower client-server communication bandwidth demands as compared with existing work. The experimental results show that FedEmb is an effective method to tackle both split feature & subject space decentralized problems, shows 0.3% to 4.2% inference accuracy improvement with limited privacy revealing for datasets stored in local clients, and reduces 88.9 % time complexity over vertical baseline method.

摘要
fedlearn (FL) 是一种 emerging 的 paradigm，用于分布式客户端上进行机器学习模型的协同训练，不需要向中央服务器披露数据。学习方案可以是水平的，垂直的或者是混合的（两者都是）。现有的大多数研究都是针对水平数据分布进行深度神经网络（DNN）模型的设计。而垂直和混合方案则受到较少的研究。在这篇论文中，我们提出了一种通用的算法 FedEmb，用于模型垂直和混合 DNN 基于学习。我们的算法的想法是具有更高的推理准确率，更强的隐私保护性，以及对客户端服务器的通信带宽需求的减少。实验结果表明，FedEmb 是一种有效的解决分feature & subject 空间分布式问题的方法，对于数据存储在本地客户端上，提高了 0.3% 到 4.2% 的推理准确率，并且限制了隐私泄露。同时，它还降低了88.9%的时间复杂度相比垂直基eline方法。

Towards Unsupervised Representation Learning: Learning, Evaluating and Transferring Visual Representations

paper_url: http://arxiv.org/abs/2312.00101
repo_url: https://github.com/bonifazstuhr/feamgan
paper_authors: Bonifaz Stuhr
for: 这篇论文旨在探讨无监督学习中学习表示的方法，它可以自动从数据中学习表示，而不需要人工标注。
methods: 本论文使用了自组织神经网络和希耶推论学习规则来学习卷积核和面积，实现了深度无监督学习模型。
results: 本论文在视觉领域中提出了一些新的无监督表示学习方法，并在 Lane Detection 等任务上进行了实验，并取得了一些有利的结果。

Abstract
Unsupervised representation learning aims at finding methods that learn representations from data without annotation-based signals. Abstaining from annotations not only leads to economic benefits but may - and to some extent already does - result in advantages regarding the representation's structure, robustness, and generalizability to different tasks. In the long run, unsupervised methods are expected to surpass their supervised counterparts due to the reduction of human intervention and the inherently more general setup that does not bias the optimization towards an objective originating from specific annotation-based signals. While major advantages of unsupervised representation learning have been recently observed in natural language processing, supervised methods still dominate in vision domains for most tasks. In this dissertation, we contribute to the field of unsupervised (visual) representation learning from three perspectives: (i) Learning representations: We design unsupervised, backpropagation-free Convolutional Self-Organizing Neural Networks (CSNNs) that utilize self-organization- and Hebbian-based learning rules to learn convolutional kernels and masks to achieve deeper backpropagation-free models. (ii) Evaluating representations: We build upon the widely used (non-)linear evaluation protocol to define pretext- and target-objective-independent metrics for measuring and investigating the objective function mismatch between various unsupervised pretext tasks and target tasks. (iii) Transferring representations: We contribute CARLANE, the first 3-way sim-to-real domain adaptation benchmark for 2D lane detection, and a method based on prototypical self-supervised learning. Finally, we contribute a content-consistent unpaired image-to-image translation method that utilizes masks, global and local discriminators, and similarity sampling to mitigate content inconsistencies.

摘要
Unsupervised representation learning aims to find methods that learn representations from data without relying on annotation-based signals. By not using annotations, not only can we save costs, but we may also gain advantages in terms of the structure, robustness, and generalizability of the representations to different tasks. In the long run, unsupervised methods are expected to surpass supervised methods due to the reduction of human intervention and the inherently more general setup that does not bias the optimization towards a specific objective originating from annotation-based signals. While major advantages of unsupervised representation learning have been recently observed in natural language processing, supervised methods still dominate in vision domains for most tasks. In this dissertation, we contribute to the field of unsupervised (visual) representation learning from three perspectives:(i) Learning representations: We design unsupervised, backpropagation-free Convolutional Self-Organizing Neural Networks (CSNNs) that use self-organization- and Hebbian-based learning rules to learn convolutional kernels and masks to achieve deeper backpropagation-free models.(ii) Evaluating representations: We build upon the widely used (non-)linear evaluation protocol to define pretext- and target-objective-independent metrics for measuring and investigating the objective function mismatch between various unsupervised pretext tasks and target tasks.(iii) Transferring representations: We contribute CARLANE, the first 3-way sim-to-real domain adaptation benchmark for 2D lane detection, and a method based on prototypical self-supervised learning. Finally, we contribute a content-consistent unpaired image-to-image translation method that utilizes masks, global and local discriminators, and similarity sampling to mitigate content inconsistencies.

Stochastic Vision Transformers with Wasserstein Distance-Aware Attention

paper_url: http://arxiv.org/abs/2311.18645
repo_url: None
paper_authors: Franciskus Xaverius Erick, Mina Rezaei, Johanna Paula Müller, Bernhard Kainz
for: 本研究的目的是提出一种新的随机视Transformer，用于自动学习（Self-Supervised Learning）中的不确定性和距离意识。
methods: 该方法使用 Gaussian 分布嵌入图像区域，并使用 Wasserstein 距离计算注意力矩阵。此外，提出了基于 Wasserstein 距离的迁移正则项，以吸收距离意识到潜在表示中。
results: 在多个任务上，如偏移检测、数据腐朽检测、semi-supervised 学习和数据转移等，该方法实现了superior的准确率和评估，比自动学习基eline更高。

Abstract
Self-supervised learning is one of the most promising approaches to acquiring knowledge from limited labeled data. Despite the substantial advancements made in recent years, self-supervised models have posed a challenge to practitioners, as they do not readily provide insight into the model's confidence and uncertainty. Tackling this issue is no simple feat, primarily due to the complexity involved in implementing techniques that can make use of the latent representations learned during pre-training without relying on explicit labels. Motivated by this, we introduce a new stochastic vision transformer that integrates uncertainty and distance awareness into self-supervised learning (SSL) pipelines. Instead of the conventional deterministic vector embedding, our novel stochastic vision transformer encodes image patches into elliptical Gaussian distributional embeddings. Notably, the attention matrices of these stochastic representational embeddings are computed using Wasserstein distance-based attention, effectively capitalizing on the distributional nature of these embeddings. Additionally, we propose a regularization term based on Wasserstein distance for both pre-training and fine-tuning processes, thereby incorporating distance awareness into latent representations. We perform extensive experiments across different tasks such as in-distribution generalization, out-of-distribution detection, dataset corruption, semi-supervised settings, and transfer learning to other datasets and tasks. Our proposed method achieves superior accuracy and calibration, surpassing the self-supervised baseline in a wide range of experiments on a variety of datasets.

摘要
自适应学习是一种最有前途的方法，可以从有限的标注数据中获得知识。尽管在最近几年内有了很大的进步，但自适应模型仍然对实践者提出了挑战，因为它们不提供明确的自信和不确定性的回报。解决这一问题并不容易，主要是因为实现使用自适应学习时学习的秘密表示的技术不容易实现，不需要显式标注。为此，我们介绍了一种新的随机视transformer，它在自适应学习（SSL）管道中结合了不确定性和距离意识。而不是传统的决定性向量嵌入，我们的新随机视transformer将图像块嵌入为椭球分布的 Gaussian 嵌入。另外，我们提出了基于 Wasserstein 距离的注意矩阵计算方法，以及在预训练和精度调整过程中添加 Wasserstein 距离 regularization 项，从而将距离意识 integrate 到干扰表示中。我们在不同的任务上进行了广泛的实验，包括均匀概率分布、非标注检测、数据集损害、半指导学习和转移学习。我们的提议方法在各种任务上达到了更高的准确率和抽象率，超过了自适应基线在多种数据集和任务上的各种实验。

Exploring the hierarchical structure of human plans via program generation

paper_url: http://arxiv.org/abs/2311.18644
repo_url: https://github.com/cgc/lightbot-grammar-induction
paper_authors: Carlos G. Correa, Sophia Sanborn, Mark K. Ho, Frederick Callaway, Nathaniel D. Daw, Thomas L. Griffiths
for: 这篇论文探讨了人类行为的层次结构如何形成，以及如何通过观察实验来证明这种层次结构。
methods: 作者使用了一种实验方法，让参与者创建一种语言中的程序，以生成一系列的动作序列，这个语言具有显式的层次结构。
results: 研究发现，人们在创建程序时具有抽象的层次结构，并且偏好使用重复的动作，而这些行为都不能由传统的压缩性和描述长度理论所预测。作者提出了一种扩展的MDL质量来解释这种偏好，并证明这种层次结构是人类计划的一个基本原则。

Abstract
Human behavior is inherently hierarchical, resulting from the decomposition of a task into subtasks or an abstract action into concrete actions. However, behavior is typically measured as a sequence of actions, which makes it difficult to infer its hierarchical structure. In this paper, we explore how people form hierarchically-structured plans, using an experimental paradigm that makes hierarchical representations observable: participants create programs that produce sequences of actions in a language with explicit hierarchical structure. This task lets us test two well-established principles of human behavior: utility maximization (i.e. using fewer actions) and minimum description length (MDL; i.e. having a shorter program). We find that humans are sensitive to both metrics, but that both accounts fail to predict a qualitative feature of human-created programs, namely that people prefer programs with reuse over and above the predictions of MDL. We formalize this preference for reuse by extending the MDL account into a generative model over programs, modeling hierarchy choice as the induction of a grammar over actions. Our account can explain the preference for reuse and provides the best prediction of human behavior, going beyond simple accounts of compressibility to highlight a principle that guides hierarchical planning.

摘要
人类行为本质层次结构，由任务划分为子任务或抽象行为转化为具体行动。然而，行为通常被测量为动作序列，这使得其层次结构难以推断。在这篇论文中，我们探索人们如何形成层次结构的计划，使用实验方法可以观察到层次结构：参与者创建生成序列动作的语言中的层次结构。这个任务让我们测试两个人行为的已知原理：有用性最大化（即使用 fewer actions）和最小描述长度（MDL；即有 shorter program）。我们发现人们对于这两个原理都敏感，但两个账户都无法预测人们创建的程序中的一个特点，即人们偏好 reuse 于 MDL 的预测。我们对此进行了扩展，将 MDL 账户转化为生成模型，用于解释层次结构选择的 grammar 结构。我们的账户可以解释 reuse 的偏好，并提供了人类行为最佳预测，超越简单的压缩性账户，抛出一个导向层次规划的原理。

Data-driven prediction of tool wear using Bayesian-regularized artificial neural networks

paper_url: http://arxiv.org/abs/2311.18620
repo_url: None
paper_authors: Tam T. Truong, Jay Airao, Panagiotis Karras, Faramarz Hojati, Bahman Azarhoushang, Ramin Aghababaei
for: 预测工具损耗，以降低生产成本和提高产品质量。
methods: 使用 bayesian 归一化人工神经网络（BRANN）来精准预测雷刚磨具损耗。 BRANN 结合人工神经网络（ANN）和 bayesian 规范，使得模型能够学习复杂的模式，同时处理不确定性和过拟合问题，从而实现更加普适的模型。
results: 通过对四个不同的实验数据集进行广泛的实验研究，包括NASA Ames 磨削数据集、2010 PHM Data Challenge 数据集、NUAA Ideahouse 工具损耗数据集以及自主进行的 Ti6Al4V 磨削数据集，并对输入特征、训练数据量、隐藏单元、训练算法和传输函数的影响进行了研究，并证明了提议的 BRANN 模型在准确性和可靠性方面与现有状态的模型相比，表现出色。

Abstract
The prediction of tool wear helps minimize costs and enhance product quality in manufacturing. While existing data-driven models using machine learning and deep learning have contributed to the accurate prediction of tool wear, they often lack generality and require substantial training data for high accuracy. In this paper, we propose a new data-driven model that uses Bayesian Regularized Artificial Neural Networks (BRANNs) to precisely predict milling tool wear. BRANNs combine the strengths and leverage the benefits of artificial neural networks (ANNs) and Bayesian regularization, whereby ANNs learn complex patterns and Bayesian regularization handles uncertainty and prevents overfitting, resulting in a more generalized model. We treat both process parameters and monitoring sensor signals as BRANN input parameters. We conducted an extensive experimental study featuring four different experimental data sets, including the NASA Ames milling dataset, the 2010 PHM Data Challenge dataset, the NUAA Ideahouse tool wear dataset, and an in-house performed end-milling of the Ti6Al4V dataset. We inspect the impact of input features, training data size, hidden units, training algorithms, and transfer functions on the performance of the proposed BRANN model and demonstrate that it outperforms existing state-of-the-art models in terms of accuracy and reliability.

摘要
预测工具 wear 可以减少成本并提高产品质量在制造中。现有的数据驱动模型使用机器学习和深度学习已经贡献到精确预测工具 wear，但它们经常缺乏普适性并需要大量训练数据以获得高精度。在这篇论文中，我们提出了一个新的数据驱动模型，使用 Bayesian Regularized Artificial Neural Networks (BRANNs) 精确预测钻削工具 wear。BRANNs 结合人工神经网络 (ANNs) 和 Bayesian 正则化，其中 ANNs 学习复杂的模式，而 Bayesian 正则化处理不确定性和避免过拟合，导致更加普适的模型。我们将处理参数和监测传感器信号作为 BRANN 输入参数。我们进行了广泛的实验研究，包括 NASA Ames 钻削数据集、2010 PHM Data Challenge 数据集、NUAA Ideahouse 工具 wear 数据集和自行进行的 Ti6Al4V 钻削数据集。我们研究 BRANN 模型的输入特征、训练数据量、隐藏单元、训练算法和传输函数的影响，并证明其超过现有状态的模型以来精度和可靠性。

Contrastive Denoising Score for Text-guided Latent Diffusion Image Editing

paper_url: http://arxiv.org/abs/2311.18608
repo_url: https://github.com/HyelinNAM/ContrastiveDenoisingScore
paper_authors: Hyelin Nam, Gihyun Kwon, Geon Yeong Park, Jong Chul Ye
for: 这个论文的目的是提出一种基于文本到图像扩散模型的图像修改方法，以提高图像修改的灵活性和控制性。
methods: 这个方法基于Score Distillation Sampling（SDS）框架，使用了文本到图像扩散模型的rich生成前提来进行图像修改。
results: 该方法可以实现零批量图像转换和神经辉场编辑，并且可以保持图像的结构细节和内容的控制性。qualitative结果和比较表明该方法的有效性。

Abstract
With the remarkable advent of text-to-image diffusion models, image editing methods have become more diverse and continue to evolve. A promising recent approach in this realm is Delta Denoising Score (DDS) - an image editing technique based on Score Distillation Sampling (SDS) framework that leverages the rich generative prior of text-to-image diffusion models. However, relying solely on the difference between scoring functions is insufficient for preserving specific structural elements from the original image, a crucial aspect of image editing. Inspired by the similarity and importance differences between DDS and the contrastive learning for unpaired image-to-image translation (CUT), here we present an embarrassingly simple yet very powerful modification of DDS, called Contrastive Denoising Score (CDS), for latent diffusion models (LDM). Specifically, to enforce structural correspondence between the input and output while maintaining the controllability of contents, we introduce a straightforward approach to regulate structural consistency using CUT loss within the DDS framework. To calculate this loss, instead of employing auxiliary networks, we utilize the intermediate features of LDM, in particular, those from the self-attention layers, which possesses rich spatial information. Our approach enables zero-shot image-to-image translation and neural radiance field (NeRF) editing, achieving a well-balanced interplay between maintaining the structural details and transforming content. Qualitative results and comparisons demonstrates the effectiveness of our proposed method. Project page with code is available at https://hyelinnam.github.io/CDS/.

摘要
Traditional image editing methods have become more diverse and continue to evolve with the remarkable advent of text-to-image diffusion models. A promising recent approach is Delta Denoising Score (DDS), an image editing technique based on Score Distillation Sampling (SDS) framework that leverages the rich generative prior of text-to-image diffusion models. However, relying solely on the difference between scoring functions is insufficient for preserving specific structural elements from the original image, a crucial aspect of image editing.Inspired by the similarity and importance differences between DDS and contrastive learning for unpaired image-to-image translation (CUT), we present an embarrassingly simple yet very powerful modification of DDS, called Contrastive Denoising Score (CDS), for latent diffusion models (LDM). Specifically, to enforce structural correspondence between the input and output while maintaining the controllability of contents, we introduce a straightforward approach to regulate structural consistency using CUT loss within the DDS framework. To calculate this loss, instead of employing auxiliary networks, we utilize the intermediate features of LDM, in particular, those from the self-attention layers, which possess rich spatial information. Our approach enables zero-shot image-to-image translation and neural radiance field (NeRF) editing, achieving a well-balanced interplay between maintaining the structural details and transforming content. Qualitative results and comparisons demonstrate the effectiveness of our proposed method. For more information and to access the code, please visit the project page at .

Joint Detection Algorithm for Multiple Cognitive Users in Spectrum Sensing

paper_url: http://arxiv.org/abs/2311.18599
repo_url: None
paper_authors: Fanfei Meng, Yuxin Wang, Lele Zhang, Yingxin Zhao
For: This paper focuses on the development of a method for multi-user spectrum sensing based on soft decisions, which can effectively detect unoccupied spectrum resources during idle periods and improve the utilization of scarce information resources.* Methods: The paper introduces three common logical circuit decision criteria in hard decisions and analyzes their decision rigor. It also proposes a method for multi-user collaborative sensing based on soft decisions, which significantly reduces false alarm probability and enhances detection probability.* Results: The simulated results of multi-user collaborative sensing indicate that the approach effectively detects spectrum resources unoccupied during idle periods, leveraging the concept of time-division multiplexing and rationalizing the redistribution of information resources. The entire computation process relies on the calculation principles of power spectral density in communication theory, involving threshold decision detection for noise power and the sum of noise and signal power.

Abstract
Spectrum sensing technology is a crucial aspect of modern communication technology, serving as one of the essential techniques for efficiently utilizing scarce information resources in tight frequency bands. This paper first introduces three common logical circuit decision criteria in hard decisions and analyzes their decision rigor. Building upon hard decisions, the paper further introduces a method for multi-user spectrum sensing based on soft decisions. Then the paper simulates the false alarm probability and detection probability curves corresponding to the three criteria. The simulated results of multi-user collaborative sensing indicate that the simulation process significantly reduces false alarm probability and enhances detection probability. This approach effectively detects spectrum resources unoccupied during idle periods, leveraging the concept of time-division multiplexing and rationalizing the redistribution of information resources. The entire computation process relies on the calculation principles of power spectral density in communication theory, involving threshold decision detection for noise power and the sum of noise and signal power. It provides a secondary decision detection, reflecting the perceptual decision performance of logical detection methods with relative accuracy.

摘要
spectrum sensing technology 是现代通信技术的重要方面， serving as one of the essential techniques for efficiently utilizing scarce information resources in tight frequency bands. This paper first introduces three common logical circuit decision criteria in hard decisions and analyzes their decision rigor. Building upon hard decisions, the paper further introduces a method for multi-user spectrum sensing based on soft decisions. Then the paper simulates the false alarm probability and detection probability curves corresponding to the three criteria. The simulated results of multi-user collaborative sensing indicate that the simulation process significantly reduces false alarm probability and enhances detection probability. This approach effectively detects spectrum resources unoccupied during idle periods, leveraging the concept of time-division multiplexing and rationalizing the redistribution of information resources. The entire computation process relies on the calculation principles of power spectral density in communication theory, involving threshold decision detection for noise power and the sum of noise and signal power. It provides a secondary decision detection, reflecting the perceptual decision performance of logical detection methods with relative accuracy.Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

Generalisable Agents for Neural Network Optimisation

paper_url: http://arxiv.org/abs/2311.18598
repo_url: None
paper_authors: Kale-ab Tessera, Callum Rhys Tilbury, Sasha Abramowitz, Ruan de Kock, Omayma Mahjoub, Benjamin Rosman, Sara Hooker, Arnu Pretorius
for: 提高深度神经网络的优化是一项复杂的任务，因为训练过程中存在多种复杂的动态、高度计算需求和长时间训练时间。
methods: 我们提出了一个名为Generalisable Agents for Neural Network Optimisation（GANNO）的多代理掌控方法，该方法通过在训练过程中动态和应答性地调整超参数来改进神经网络优化。GANNO使用每层一个代理，通过观察局部网络动态来采取以调整这些动态以提高全局性能。
results: GANNO可以生成高效的和应答性的超参数调整方案，并且可以在多种未看过的初始条件下表现稳定和可靠。此外，GANNO还可以成功泛化到训练集更加复杂的问题上。我们的工作提供了对这种思想的概述，以及仍需要解决的关键挑战。

Abstract
Optimising deep neural networks is a challenging task due to complex training dynamics, high computational requirements, and long training times. To address this difficulty, we propose the framework of Generalisable Agents for Neural Network Optimisation (GANNO) -- a multi-agent reinforcement learning (MARL) approach that learns to improve neural network optimisation by dynamically and responsively scheduling hyperparameters during training. GANNO utilises an agent per layer that observes localised network dynamics and accordingly takes actions to adjust these dynamics at a layerwise level to collectively improve global performance. In this paper, we use GANNO to control the layerwise learning rate and show that the framework can yield useful and responsive schedules that are competitive with handcrafted heuristics. Furthermore, GANNO is shown to perform robustly across a wide variety of unseen initial conditions, and can successfully generalise to harder problems than it was trained on. Our work presents an overview of the opportunities that this paradigm offers for training neural networks, along with key challenges that remain to be overcome.

摘要
优化深度神经网络是一项复杂的任务，由于训练过程的复杂性、高计算需求和长时间训练时间。为解决这个困难，我们提出了普适代理人 для神经网络优化（GANNO）框架---一种多代理人学习（MARL）方法，通过在训练过程中动态和应对地调整超参数来提高神经网络优化。GANNO使用每层的代理人来观察当地神经网络动态，并根据这些动态进行层次调整，以共同提高全局性能。在这篇论文中，我们使用GANNO控制层次学习率，并证明了该框架可以生成有用和应对的调整方案，与手工策略竞争。此外，GANNO在不同的初始条件下表现稳定，并成功泛化到更加复杂的问题。我们的工作介绍了这种思想的机会，以及训练神经网络中的关键挑战。

Semantic-Aware Frame-Event Fusion based Pattern Recognition via Large Vision-Language Models

paper_url: http://arxiv.org/abs/2311.18592
repo_url: https://github.com/event-ahu/safe_largevlm
paper_authors: Dong Li, Jiandong Jin, Yuhao Zhang, Yanlin Zhong, Yaoyang Wu, Lan Chen, Xiao Wang, Bin Luo
for: 本研究旨在提出一种基于RGB帧和事件流的 Pattern recognition方法，以强化Semantic gap和小规模网络问题。
methods: 我们提出一种 novel framework，通过将semantic labels、RGB帧和事件流综合归一化，使用大规模视力语言模型（CLIP视力编码器）提取RGB和事件特征，并使用预训练的大规模语言模型（CLIP文本编码器）将semantic labels转化为语言描述。然后，我们使用多modal transformer网络将RGB/Event特征和语言特征集成，并使用自我注意和循环神经网络进行增强。
results: 我们在HARDVS和PokerEvent数据集上进行了广泛的实验，并证明了我们提出的SAFE模型的有效性。

Abstract
Pattern recognition through the fusion of RGB frames and Event streams has emerged as a novel research area in recent years. Current methods typically employ backbone networks to individually extract the features of RGB frames and event streams, and subsequently fuse these features for pattern recognition. However, we posit that these methods may suffer from key issues like sematic gaps and small-scale backbone networks. In this study, we introduce a novel pattern recognition framework that consolidates the semantic labels, RGB frames, and event streams, leveraging pre-trained large-scale vision-language models. Specifically, given the input RGB frames, event streams, and all the predefined semantic labels, we employ a pre-trained large-scale vision model (CLIP vision encoder) to extract the RGB and event features. To handle the semantic labels, we initially convert them into language descriptions through prompt engineering, and then obtain the semantic features using the pre-trained large-scale language model (CLIP text encoder). Subsequently, we integrate the RGB/Event features and semantic features using multimodal Transformer networks. The resulting frame and event tokens are further amplified using self-attention layers. Concurrently, we propose to enhance the interactions between text tokens and RGB/Event tokens via cross-attention. Finally, we consolidate all three modalities using self-attention and feed-forward layers for recognition. Comprehensive experiments on the HARDVS and PokerEvent datasets fully substantiate the efficacy of our proposed SAFE model. The source code will be made available at https://github.com/Event-AHU/SAFE_LargeVLM.

摘要
Pattern recognition through the fusion of RGB frames and Event streams has emerged as a novel research area in recent years. Current methods typically employ backbone networks to individually extract the features of RGB frames and event streams, and subsequently fuse these features for pattern recognition. However, we posit that these methods may suffer from key issues like semantic gaps and small-scale backbone networks. In this study, we introduce a novel pattern recognition framework that consolidates the semantic labels, RGB frames, and event streams, leveraging pre-trained large-scale vision-language models. Specifically, given the input RGB frames, event streams, and all the predefined semantic labels, we employ a pre-trained large-scale vision model (CLIP vision encoder) to extract the RGB and event features. To handle the semantic labels, we initially convert them into language descriptions through prompt engineering, and then obtain the semantic features using the pre-trained large-scale language model (CLIP text encoder). Subsequently, we integrate the RGB/Event features and semantic features using multimodal Transformer networks. The resulting frame and event tokens are further amplified using self-attention layers. Concurrently, we propose to enhance the interactions between text tokens and RGB/Event tokens via cross-attention. Finally, we consolidate all three modalities using self-attention and feed-forward layers for recognition. Comprehensive experiments on the HARDVS and PokerEvent datasets fully substantiate the efficacy of our proposed SAFE model. The source code will be made available at https://github.com/Event-AHU/SAFE_LargeVLM.Here's the word-for-word translation of the text into Simplified Chinese: Pattern recognition through the fusion of RGB frames and Event streams has emerged as a novel research area in recent years. Current methods typically employ backbone networks to individually extract the features of RGB frames and event streams, and subsequently fuse these features for pattern recognition. However, we posit that these methods may suffer from key issues like semantic gaps and small-scale backbone networks. In this study, we introduce a novel pattern recognition framework that consolidates the semantic labels, RGB frames, and event streams, leveraging pre-trained large-scale vision-language models. Specifically, given the input RGB frames, event streams, and all the predefined semantic labels, we employ a pre-trained large-scale vision model (CLIP vision encoder) to extract the RGB and event features. To handle the semantic labels, we initially convert them into language descriptions through prompt engineering, and then obtain the semantic features using the pre-trained large-scale language model (CLIP text encoder). Subsequently, we integrate the RGB/Event features and semantic features using multimodal Transformer networks. The resulting frame and event tokens are further amplified using self-attention layers. Concurrently, we propose to enhance the interactions between text tokens and RGB/Event tokens via cross-attention. Finally, we consolidate all three modalities using self-attention and feed-forward layers for recognition. Comprehensive experiments on the HARDVS and PokerEvent datasets fully substantiate the efficacy of our proposed SAFE model. The source code will be made available at https://github.com/Event-AHU/SAFE_LargeVLM.

Continuous 16-bit Training: Accelerating 32-bit Pre-Trained Neural Networks

paper_url: http://arxiv.org/abs/2311.18587
repo_url: None
paper_authors: Juyoung Yun
for: 这项研究的目的是提高深度学习模型的训练效率和可持续性，而不 sacrifice 精度。
methods: 该研究提出了一种Continuation Training方法，使用16位数字precision继续训练先前使用32位数字precision训练的模型。
results: 实验结果表明，该方法可以保持模型的准确性，同时快速加速训练过程，减少计算资源和内存占用。这种方法在需要持续更新和改进的模型 scenarios 是非常有用。

Abstract
In the field of deep learning, the prevalence of models initially trained with 32-bit precision is a testament to its robustness and accuracy. However, the continuous evolution of these models often demands further training, which can be resource-intensive. This study introduces a novel approach where we continue the training of these pre-existing 32-bit models using 16-bit precision. This technique not only caters to the need for efficiency in computational resources but also significantly improves the speed of additional training phases. By adopting 16-bit precision for ongoing training, we are able to substantially decrease memory requirements and computational burden, thereby accelerating the training process in a resource-limited setting. Our experiments show that this method maintains the high standards of accuracy set by the original 32-bit training while providing a much-needed boost in training speed. This approach is especially pertinent in today's context, where most models are initially trained in 32-bit and require periodic updates and refinements. The findings from our research suggest that this strategy of 16-bit continuation training can be a key solution for sustainable and efficient deep learning, offering a practical way to enhance pre-trained models rapidly and in a resource-conscious manner.

摘要
Translated into Simplified Chinese:在深度学习领域，由于模型初始训练使用32位精度的习惯，这也证明了它们的稳定性和准确性。然而，这些模型的不断演化通常需要进一步训练，这可以是资源占用的挑战。这项研究提出了一种新的方法，我们继续使用32位精度训练这些已经存在的模型，并采用16位精度进行继续训练。这种技术不仅能满足计算资源的效率需求，还能够显著提高训练阶段的速度。通过采用16位精度进行继续训练，我们能够减少内存需求和计算负担，因此加速训练过程。我们的实验结果表明，这种方法可以保持原始32位训练的高标准准确性，同时提供一个可观的加速训练速度。这种方法在今天的背景下非常重要，因为大多数模型都是通过32位训练来初始化的，需要定期更新和细化。我们的研究发现，这种16位继续训练策略可以是深度学习中的可持续和高效的解决方案，提供一种实用的方法来快速地增强预训练模型。

Communication-Efficient Heterogeneous Federated Learning with Generalized Heavy-Ball Momentum

paper_url: http://arxiv.org/abs/2311.18578
repo_url: None
paper_authors: Riccardo Zaccone, Carlo Masone, Marco Ciccone
for: 本研究旨在解决 Federated Learning (FL) 中的系统和统计挑战，包括降低通信带宽和频率，以及处理非标一性。
methods: 本文提出了一种新的普通均衡批处理算法（FedHBM），可以有效地Addressing 统计不一致性问题而无需增加交换量。
results: 实验表明，FedHBM 算法在常见的 FL 视觉和自然语言处理 datasets 上可以提供更好的模型质量和更快的收敛速度，特别是在非常不一致的情况下。此外，该算法可以在跨设备场景下应用，并且可以利用好的模型初始化（例如预训练）来减少初始化时间。

Abstract
Federated Learning (FL) is the state-of-the-art approach for learning from decentralized data in privacy-constrained scenarios. As the current literature reports, the main problems associated with FL refer to system and statistical challenges: the former ones demand for efficient learning from edge devices, including lowering communication bandwidth and frequency, while the latter require algorithms robust to non-iidness. State-of-art approaches either guarantee convergence at increased communication cost or are not sufficiently robust to handle extreme heterogeneous local distributions. In this work we propose a novel generalization of the heavy-ball momentum, and present FedHBM to effectively address statistical heterogeneity in FL without introducing any communication overhead. We conduct extensive experimentation on common FL vision and NLP datasets, showing that our FedHBM algorithm empirically yields better model quality and higher convergence speed w.r.t. the state-of-art, especially in pathological non-iid scenarios. While being designed for cross-silo settings, we show how FedHBM is applicable in moderate-to-high cross-device scenarios, and how good model initializations (e.g. pre-training) can be exploited for prompt acceleration. Extended experimentation on large-scale real-world federated datasets further corroborates the effectiveness of our approach for real-world FL applications.

摘要
Federated Learning (FL) 是当前领先的方法，用于从分布式数据中学习，在PRIVACY CONSTRAINED scenarios 中。据文献报告，FL 中的主要问题包括系统和统计挑战：前者需要高效地学习从Edge设备上，包括降低通信带宽和频率，而后者需要对非常不均匀的本地分布进行鲁棒化。当前的方法可以保证增加通信成本以实现增加，或者不具备鲁棒性，对EXTREME HETEROGENEOUS LOCAL DISTRIBUTIONS 进行处理。在这种工作中，我们提出了一种新的普通质量权重，并提出了FedHBM算法，用于有效地处理FL中的统计不均匀性，不增加任何通信成本。我们对常见的FL视觉和NLP数据集进行了广泛的实验，并证明了我们的FedHBM算法在对比当前状态态的情况下，能够更高质量和更快的CONVERGENCE SPEED，特别是在非常不均匀的路径ological scenarios 中。虽然我们的FedHBM算法是为cross-silo settings 设计的，但我们还证明了其可以在moderate-to-high cross-device scenarios 中应用，并且可以利用良好的模型初始化（例如预训练）来加速进程。我们进一步对大规模的实际世界 federated 数据进行了进一步的实验，并证明了我们的方法在实际应用中的有效性。

Fingerprint Matching with Localized Deep Representation

paper_url: http://arxiv.org/abs/2311.18576
repo_url: None
paper_authors: Yongjie Duan, Zhiyu Pan, Jianjiang Feng, Jie Zhou
for: 提高Fixed-length fingerprint表示的准确性和可靠性，以便在大规模人体指纹库中实现高精度的指纹识别。
methods: 提出了一种基于Localized deep representation of fingerprint(LDRF)的方法，通过注重本地区域特征来提供更加稳定和准确的Fixed-length表示。LDRF可以适应任何有效的区域，这使得它具有高灵活性。
results: 对21个 dataset中的140000多个指纹进行了实验，结果表明LDRF比其他Fixed-length表示更高的准确性和可靠性，并且对不同的手势和印记类型具有良好的抗耗性。此外，提出的匹配分数 норailization技术有效地降低了大规模身份识别中的false match率，从而提高了指纹识别的精度和可靠性。

Abstract
Compared to minutia-based fingerprint representations, fixed-length representations are attractive due to simple and efficient matching. However, fixed-length fingerprint representations are limited in accuracy when matching fingerprints with different visible areas, which can occur due to different finger poses or acquisition methods. To address this issue, we propose a localized deep representation of fingerprint, named LDRF. By focusing on the discriminative characteristics within local regions, LDRF provides a more robust and accurate fixed-length representation for fingerprints with variable visible areas. LDRF can be adapted to retain information within any valid area, making it highly flexible. The matching scores produced by LDRF also exhibit intuitive statistical characteristics, which led us to propose a matching score normalization technique to mitigate the uncertainty in the cases of very small overlapping area. With this new technique, we can maintain a high level of accuracy and reliability in our fingerprint matching, even as the size of the database grows rapidly. Our experimental results on 21 datasets containing over 140K fingerprints of various finger poses and impression types show that LDRF outperforms other fixed-length representations and is robust to sensing technologies and impression types. Besides, the proposed matching score normalization effectively reduces the false match rate (FMR) in large-scale identification experiments comprising over 5.11 million fingerprints. Specifically, this technique results in a reduction of two orders of magnitude compared to matching without matching score normalization and five orders of magnitude compared to prior works.

摘要

Search Still Matters: Information Retrieval in the Era of Generative AI

paper_url: http://arxiv.org/abs/2311.18550
repo_url: None
paper_authors: William R. Hersh
for: 本研究探讨了基于大语言模型的生成人工智能如何与搜索系统结合使用，尤其是在学术用途中。
methods: 本研究使用了各种搜索系统和大语言模型，以探讨用户在搜索过程中的需求和满意度。
results: 研究发现，用户的搜索需求可以从简单到复杂，而且有许多因素会影响用户对搜索结果的满意度，如推荐结果的排序、搜索结果的权重、以及搜索结果的可访问性。

Abstract
Objective: Information retrieval (IR, also known as search) systems are ubiquitous in modern times. How does the emergence of generative artificial intelligence (AI), based on large language models (LLMs), fit into the IR process? Process: This perspective explores the use of generative AI in the context of the motivations, considerations, and outcomes of the IR process with a focus on the academic use of such systems. Conclusions: There are many information needs, from simple to complex, that motivate use of IR. Users of such systems, particularly academics, have concerns for authoritativeness, timeliness, and contextualization of search. While LLMs may provide functionality that aids the IR process, the continued need for search systems, and research into their improvement, remains essential.

摘要
English 简化中文 Objective: Information retrieval (IR, also known as search) systems are ubiquitous in modern times. How does the emergence of generative artificial intelligence (AI), based on large language models (LLMs), fit into the IR process?Process: This perspective explores the use of generative AI in the context of the motivations, considerations, and outcomes of the IR process with a focus on the academic use of such systems.Conclusions: There are many information needs, from simple to complex, that motivate use of IR. Users of such systems, particularly academics, have concerns for authoritativeness, timeliness, and contextualization of search. While LLMs may provide functionality that aids the IR process, the continued need for search systems, and research into their improvement, remains essential. Simplified Chinese translation:目标：现代时期内的信息检索（IR，也称为搜索）系统 ubique。如何将生成人工智能（AI），基于大语言模型（LLM），与IR过程相匹配？过程：这个视角探讨在IR过程中的动机、考虑因素以及结果，特别是在学术用途中使用这些系统。结论：有很多信息需求，从简单到复杂，导致IR的使用。用户，特别是学者，对搜索结果的权威性、时效性以及 contextualization 有很多关注。虽然LLMs可能提供助于IR过程的功能，但是继续需要搜索系统的发展和改进的研究。

Real-Time Vibration-Based Bearing Fault Diagnosis Under Time-Varying Speed Conditions

paper_url: http://arxiv.org/abs/2311.18547
repo_url: None
paper_authors: Tuomas Jalonen, Mohammad Al-Sa’d, Serkan Kiranyaz, Moncef Gabbouj
for: 本研究旨在提出一种高效的实时卷积神经网络方法，用于检测滚珠磨具 faults 在不同的噪声水平和时间变化的扭积速率下。
methods: 本研究使用了一种新的 Fisher-based spectral separability analysis (SSA) 方法，以评估提出的卷积神经网络模型的效果。
results: 实验结果表明，提出的模型在健康的滚珠磨具和受损的滚珠磨具、内环、外环和滚球磨具 faults 下都表现出优异的准确率，并且具有高度的鲁棒性和实时性。

Abstract
Detection of rolling-element bearing faults is crucial for implementing proactive maintenance strategies and for minimizing the economic and operational consequences of unexpected failures. However, many existing techniques are developed and tested under strictly controlled conditions, limiting their adaptability to the diverse and dynamic settings encountered in practical applications. This paper presents an efficient real-time convolutional neural network (CNN) for diagnosing multiple bearing faults under various noise levels and time-varying rotational speeds. Additionally, we propose a novel Fisher-based spectral separability analysis (SSA) method to elucidate the effectiveness of the designed CNN model. We conducted experiments on both healthy bearings and bearings afflicted with inner race, outer race, and roller ball faults. The experimental results show the superiority of our model over the current state-of-the-art approach in three folds: it achieves substantial accuracy gains of up to 15.8%, it is robust to noise with high performance across various signal-to-noise ratios, and it runs in real-time with processing durations five times less than acquisition. Additionally, by using the proposed SSA technique, we offer insights into the model's performance and underscore its effectiveness in tackling real-world challenges.

摘要
检测滚珠式磨料 faults 是预测维护策略的关键和避免意外故障的经济和运行成本的减少。然而，许多现有的技术是在严格控制的环境下开发和测试的，这限制了它们在实际应用中的适应性。本文提出了一种高效的实时卷积神经网络（CNN）用于识别多种磨料 faults 下多种噪声水平和时变旋转速度。此外，我们提出了一种基于 Fisher 的 spectral separability analysis（SSA）方法，以解释我们设计的 CNN 模型的效果。我们在健康的磨料和受到内环、外环和滚球磨料 faults 的磨料上进行了实验。实验结果表明，我们的模型在三个方面超过当前状态艺术方法：它具有大约 15.8% 的准确率提升，对噪声具有高效性，并且在实时进行处理，处理时间比获取 five times less。此外，通过使用我们的 SSA 技术，我们提供了模型性能的反馈，并证明它在实际挑战中具有强大的实用性。

Dataset Distillation via the Wasserstein Metric

paper_url: http://arxiv.org/abs/2311.18531
repo_url: None
paper_authors: Haoyang Liu, Tiancheng Xing, Luwei Li, Vibhu Dalal, Jingrui He, Haohan Wang
for: 本研究旨在提高 dataset distillation (DD) 中的数据减少策略，以减少模型性能的损失。
methods: 我们提出了一种基于 Wasserstein 距离的新方法，用于捕捉extensive datasets的主要表示。我们的方法利用 Wasserstein 距离来衡量分布差异，并在 feature space 中嵌入 synthetic data，以便进行分布匹配学习。
results: 我们的方法在多个高分辨率数据集上进行了广泛的测试，并证明了其效果和适应性。结果表明，Wasserstein 距离在 DD 中具有潜在的应用前景。

Abstract
Dataset distillation (DD) offers a compelling approach in computer vision, with the goal of condensing extensive datasets into smaller synthetic versions without sacrificing much of the model performance. In this paper, we continue to study the methods for DD, by addressing its conceptually core objective: how to capture the essential representation of extensive datasets in smaller, synthetic forms. We propose a novel approach utilizing the Wasserstein distance, a metric rooted in optimal transport theory, to enhance distribution matching in DD. Our method leverages the Wasserstein barycenter, offering a geometrically meaningful way to quantify distribution differences and effectively capture the centroid of a set of distributions. Our approach retains the computational benefits of distribution matching-based methods while achieving new state-of-the-art performance on several benchmarks. To provide useful prior for learning the images, we embed the synthetic data into the feature space of pretrained classification models to conduct distribution matching. Extensive testing on various high-resolution datasets confirms the effectiveness and adaptability of our method, indicating the promising yet unexplored capabilities of Wasserstein metrics in dataset distillation.

摘要
为了为学习图像提供有用的先验知识，我们将合成数据 embedding 到预训练的分类模型的特征空间中进行分布匹配。对多个高分辨率数据集进行广泛的测试确认了我们的方法的有效性和适应性，这表明了 Wasserstein 度量在 dataset distillation 中的前前无穷可能性。

Fast ODE-based Sampling for Diffusion Models in Around 5 Steps

paper_url: http://arxiv.org/abs/2312.00094
repo_url: None
paper_authors: Zhenyu Zhou, Defang Chen, Can Wang, Chun Chen
for: 这种论文主要目标是提出一种快速的扩散模型采样算法，以实现高精度的图像生成。
methods: 该算法使用了高阶差分方程解算法，并通过直接学习扩散的平均方向来消除 truncation 误差。
results: 实验表明，使用该算法可以在5个函数评估（NFE）内达到高精度的图像生成，并且与现有的ODE-based扩散模型采样器相比，具有更高的效果和灵活性。

Abstract
Sampling from diffusion models can be treated as solving the corresponding ordinary differential equations (ODEs), with the aim of obtaining an accurate solution with as few number of function evaluations (NFE) as possible. Recently, various fast samplers utilizing higher-order ODE solvers have emerged and achieved better performance than the initial first-order one. However, these numerical methods inherently result in certain approximation errors, which significantly degrades sample quality with extremely small NFE (e.g., around 5). In contrast, based on the geometric observation that each sampling trajectory almost lies in a two-dimensional subspace embedded in the ambient space, we propose Approximate MEan-Direction Solver (AMED-Solver) that eliminates truncation errors by directly learning the mean direction for fast diffusion sampling. Besides, our method can be easily used as a plugin to further improve existing ODE-based samplers. Extensive experiments on image synthesis with the resolution ranging from 32 to 256 demonstrate the effectiveness of our method. With only 5 NFE, we achieve 7.14 FID on CIFAR-10, 13.75 FID on ImageNet 64$\times$64, and 12.79 FID on LSUN Bedroom. Our code is available at https://github.com/zhyzhouu/amed-solver.

摘要
可以将扩散模型的采样视为解决相应的常微分方程（ODE），目的是在最小化函数评估（NFE）数量的情况下获得高精度的解。最近，许多快速采样器使用高阶ODE解处理方法出现，并且在NFE非常小（例如5）的情况下表现更好。然而，这些数字方法会产生一定的 aproximation 误差，这会很大地下降采样质量。相比之下，我们基于几何观察结果，每个采样轨迹大多数在嵌入在环境空间中的二维子空间中，我们提出了准确的主方向扩散采样器（AMED-Solver），可以直接学习扩散采样的主方向，从而消除截断误差。此外，我们的方法可以轻松地作为现有ODE-基于采样器的插件使用。我们在32到256的图像生成中进行了广泛的实验，并达到了5 NFE 的情况下，CIFAR-10 的7.14 FID、ImageNet 64x64 的13.75 FID和LSUN Bedroom 的12.79 FID。我们的代码可以在 GitHub 上找到。

Calibration-free online test-time adaptation for electroencephalography motor imagery decoding

paper_url: http://arxiv.org/abs/2311.18520
repo_url: None
paper_authors: Martin Wimpff, Mario Döbler, Bin Yang
for: 本研究旨在探讨Brain-Computer Interfaces（BCI）的实时适应技术，以提高BCI的解oding能力。
methods: 本研究使用了深度学习技术，并对BCI的解oding模型进行了在推理时的不监督式适应。
results: 研究结果显示，采用了不同的适应技术后，BCI的解oding性能得到了提高，并且不需要访问源数据，保持了隐私性。

Abstract
Providing a promising pathway to link the human brain with external devices, Brain-Computer Interfaces (BCIs) have seen notable advancements in decoding capabilities, primarily driven by increasingly sophisticated techniques, especially deep learning. However, achieving high accuracy in real-world scenarios remains a challenge due to the distribution shift between sessions and subjects. In this paper we will explore the concept of online test-time adaptation (OTTA) to continuously adapt the model in an unsupervised fashion during inference time. Our approach guarantees the preservation of privacy by eliminating the requirement to access the source data during the adaptation process. Additionally, OTTA achieves calibration-free operation by not requiring any session- or subject-specific data. We will investigate the task of electroencephalography (EEG) motor imagery decoding using a lightweight architecture together with different OTTA techniques like alignment, adaptive batch normalization, and entropy minimization. We examine two datasets and three distinct data settings for a comprehensive analysis. Our adaptation methods produce state-of-the-art results, potentially instigating a shift in transfer learning for BCI decoding towards online adaptation.

摘要
BCIs 的开发已经提供了一个可靠的通路，将人脑与外部设备连接起来。然而，在真实世界中达到高精度仍然是一个挑战，主要因为会话和主题之间的分布shift。在这篇论文中，我们将探讨在推理时进行在线测试适应（OTTA）的概念，以实现在推理过程中不需要访问源数据的隐私保护。此外，OTTA还可以实现不需要任务或主题特定的数据的准备。我们将使用一种轻量级的架构和不同的OTTA技术，如对齐、自适应批处理和Entropy最小化，对EEG电enzephalography动作幻像解oding进行研究。我们将对两个数据集和三种不同的数据设置进行全面分析。我们的适应方法可以实现状态革命的结果，可能会导致BCI解oding中的转移学习向在线适应。

Color-Emotion Associations in Art: Fuzzy Approach

paper_url: http://arxiv.org/abs/2311.18518
repo_url: None
paper_authors: Muragul Muratbekova, Pakizar Shamoi
for: 这篇论文的目的是研究艺术作品所诱发的情感，以及色彩在艺术作品中的作用。
methods: 这篇论文使用了杂化集的方法来分类情感，并使用了WIkiArt dataset中标注有情感的画作进行评估。
results: 研究发现了各种情感与色彩之间的强相关性，如感激强烈相关于绿色、褐色和橙色等色彩，愤怒强烈相关于褐色等等。这些发现可以用于艺术作品检索系统、市场营销、设计等实际应用。

Abstract
Art objects can evoke certain emotions. Color is a fundamental element of visual art and plays a significant role in how art is perceived. This paper introduces a novel approach to classifying emotions in art using Fuzzy Sets. We employ a fuzzy approach because it aligns well with human judgments' imprecise and subjective nature. Extensive fuzzy colors (n=120) and a broad emotional spectrum (n=10) allow for a more human-consistent and context-aware exploration of emotions inherent in paintings. First, we introduce the fuzzy color representation model. Then, at the fuzzification stage, we process the Wiki Art Dataset of paintings tagged with emotions, extracting fuzzy dominant colors linked to specific emotions. This results in fuzzy color distributions for ten emotions. Finally, we convert them back to a crisp domain, obtaining a knowledge base of color-emotion associations in primary colors. Our findings reveal strong associations between specific emotions and colors; for instance, gratitude strongly correlates with green, brown, and orange. Other noteworthy associations include brown and anger, orange with shame, yellow with happiness, and gray with fear. Using these associations and Jaccard similarity, we can find the emotions in the arbitrary untagged image. We conducted a 2AFC experiment involving human subjects to evaluate the proposed method. The average hit rate of 0.77 indicates a significant correlation between the method's predictions and human perception. The proposed method is simple to adapt to art painting retrieval systems. The study contributes to the theoretical understanding of color-emotion associations in art, offering valuable insights for various practical applications besides art, like marketing, design, and psychology.

摘要
美术作品可以触动certain情感。色彩是视觉艺术中的基本元素，它在艺术的感受方面扮演着重要的角色。本文提出了一种基于复杂集的情感分类方法，使用了复杂集（Fuzzy Sets），因为这种方法与人类判断的不确定和主观性相匹配。使用了120种颜色和10种情感，我们可以实现更人类化和上下文感应的情感探索。首先，我们介绍了复杂色彩表示模型。然后，在纠ifica阶段，我们使用Wiki Art Dataset中标注情感的画作，提取出复杂的主要颜色与特定情感之间的连接。这结果了复杂颜色分布，其中每个情感都有对应的颜色分布。最后，我们将其转换回精确领域，得到了颜色-情感协助的知识库。我们的研究发现，具体的情感与颜色之间存在强相关，例如感激强相关到绿色、褐色和橙色。其他值得注意的相关包括褐色与愤怒、橙色与尴尬、黄色与开心、灰色与恐惧。使用这些相关和Jaccard相似度，我们可以寻找没有标注的画作中的情感。我们进行了一个2AFC实验，让人类观察者评估我们的方法。平均命中率为0.77，这表明我们的方法和人类感知之间存在强相关。我们的方法可以轻松地应用于艺术画作搜寻系统。本研究对于艺术中颜色-情感协助的理论理解做出了重要贡献，并且对于广泛的实际应用，如市场、设计和心理学等领域都具有价值。

Adaptive Multi-Modality Prompt Learning

paper_url: http://arxiv.org/abs/2312.00823
repo_url: None
paper_authors: Zongqian Wu, Yujing Liu, Mengmeng Zhan, Jialie Shen, Ping Hu, Xiaofeng Zhu
for: 这个论文目的是提出一种适应多模态示例学习方法，以解决现有的示例学习方法在处理图像中的缺陷，包括忽略无用的区域和同时考虑在样本内和样本外的泛化。
methods: 该论文使用了前Text prompt学习和新的图像提示学习方法。图像提示学习方法首先将无用的区域蒙版，然后将这些区域用learnable参数和文本信息补充。此外，每个提示都提供了对另一个提示的辅助信息，进一步强化这两种泛化。
results: 实验结果表明，该方法在真实的数据集上比SOTA方法高效，在不同的下游任务中表现优异。

Abstract
Although current prompt learning methods have successfully been designed to effectively reuse the large pre-trained models without fine-tuning their large number of parameters, they still have limitations to be addressed, i.e., without considering the adverse impact of meaningless patches in every image and without simultaneously considering in-sample generalization and out-of-sample generalization. In this paper, we propose an adaptive multi-modality prompt learning to address the above issues. To do this, we employ previous text prompt learning and propose a new image prompt learning. The image prompt learning achieves in-sample and out-of-sample generalization, by first masking meaningless patches and then padding them with the learnable parameters and the information from texts. Moreover, each of the prompts provides auxiliary information to each other, further strengthening these two kinds of generalization. Experimental results on real datasets demonstrate that our method outperforms SOTA methods, in terms of different downstream tasks.

摘要
现有的提示学习方法已经成功地将大型预训练模型重用而不需要细化其大量参数，但还有一些局限性需要解决，即不考虑每个图像中无意义的小块的不良影响和同时不考虑样本内泛化和样本外泛化。在这篇论文中，我们提出了适应多模态提示学习方法来解决以上问题。我们利用了先前的文本提示学习，并提出了一种新的图像提示学习方法。图像提示学习可以在样本内和样本外实现泛化，首先是将无意义的小块蒙版，然后将其填充learnable参数和文本信息。此外，每个提示都会为另一个提示提供辅助信息，进一步强化这两种泛化。实验结果表明，我们的方法在真实数据上比SOTA方法高效，在不同的下游任务上表现出优秀的性能。

ZeST-NeRF: Using temporal aggregation for Zero-Shot Temporal NeRFs

paper_url: http://arxiv.org/abs/2311.18491
repo_url: None
paper_authors: Violeta Menéndez González, Andrew Gilbert, Graeme Phillipson, Stephen Jolly, Simon Hadfield
for: 本研究旨在提出一种新的视频编辑技术，用于生成新的场景中的 temporal NeRF。
methods: 本研究使用 ZeST-NeRF 方法，该方法可以在不重新训练的情况下，生成新的场景中的 temporal NeRF。
results: 研究表明，ZeST-NeRF 方法可以高效地重建新的视频场景，并且可以提高量化和视觉效果。相比之前的方法，ZeST-NeRF 方法提高了15%的量化效果和显著提高了视觉效果。

Abstract
In the field of media production, video editing techniques play a pivotal role. Recent approaches have had great success at performing novel view image synthesis of static scenes. But adding temporal information adds an extra layer of complexity. Previous models have focused on implicitly representing static and dynamic scenes using NeRF. These models achieve impressive results but are costly at training and inference time. They overfit an MLP to describe the scene implicitly as a function of position. This paper proposes ZeST-NeRF, a new approach that can produce temporal NeRFs for new scenes without retraining. We can accurately reconstruct novel views using multi-view synthesis techniques and scene flow-field estimation, trained only with unrelated scenes. We demonstrate how existing state-of-the-art approaches from a range of fields cannot adequately solve this new task and demonstrate the efficacy of our solution. The resulting network improves quantitatively by 15% and produces significantly better visual results.

摘要
在媒体生产领域中，视频编辑技术扮演着关键性的角色。现有的方法在描述静止场景的新视图图像合成方面取得了非常成功。但是添加时间信息会增加额外复杂性。先前的模型将静止和动态场景用NeRF进行偏函数表示，这些模型取得了很好的结果，但是训练和推理时间成本较高。这篇论文提出了ZeST-NeRF，一种新的方法，可以在新的场景中生成时间NeRF，无需重新训练。我们可以使用多视图合成技术和场景流场 estimation来准确地重建新的视图，并且只需要使用不相关的场景进行训练。我们展示了现有的状态态先进方法无法解决这个新任务，并证明了我们的解决方案的效果。结果显示，我们的网络在量化上提高了15%，并且生成的视觉效果更好。

New Perspectives on the Evaluation of Link Prediction Algorithms for Dynamic Graphs

paper_url: http://arxiv.org/abs/2311.18486
repo_url: https://github.com/aida-ugent/dlp_viz
paper_authors: Raphaël Romero, Tijl De Bie, Jefrey Lijffijt
for: 这项研究旨在catalouging the possibilities for negative sampling in dynamic network prediction, and introducing novel visualization methods to evaluate the effect of negative sampling on predictive performance.
methods: 该研究使用了多种采样方法，包括随机生成的负样本和真实的负样本，以及新的视觉化工具来评估预测性能和时间网络的动态性。
results: 研究发现，采用不同采样方法可能会导致预测性能的不均匀分布，并且可以通过视觉化工具来了解预测性能的时间变化特征。

Abstract
There is a fast-growing body of research on predicting future links in dynamic networks, with many new algorithms. Some benchmark data exists, and performance evaluations commonly rely on comparing the scores of observed network events (positives) with those of randomly generated ones (negatives). These evaluation measures depend on both the predictive ability of the model and, crucially, the type of negative samples used. Besides, as generally the case with temporal data, prediction quality may vary over time. This creates a complex evaluation space. In this work, we catalog the possibilities for negative sampling and introduce novel visualization methods that can yield insight into prediction performance and the dynamics of temporal networks. We leverage these visualization tools to investigate the effect of negative sampling on the predictive performance, at the node and edge level. We validate empirically, on datasets extracted from recent benchmarks that the error is typically not evenly distributed across different data segments. Finally, we argue that such visualization tools can serve as powerful guides to evaluate dynamic link prediction methods at different levels.

摘要
There is a rapidly growing body of research on predicting future links in dynamic networks, with many new algorithms. Some benchmark data exists, and performance evaluations commonly rely on comparing the scores of observed network events (positives) with those of randomly generated ones (negatives). These evaluation measures depend on both the predictive ability of the model and, crucially, the type of negative samples used. Besides, as generally the case with temporal data, prediction quality may vary over time. This creates a complex evaluation space. In this work, we catalog the possibilities for negative sampling and introduce novel visualization methods that can yield insight into prediction performance and the dynamics of temporal networks. We leverage these visualization tools to investigate the effect of negative sampling on the predictive performance, at the node and edge level. We validate empirically, on datasets extracted from recent benchmarks that the error is typically not evenly distributed across different data segments. Finally, we argue that such visualization tools can serve as powerful guides to evaluate dynamic link prediction methods at different levels.Here's the translation in Simplified Chinese characters:有一个快速增长的研究体系，探讨未来网络中的链接预测，新出了许多算法。一些参考数据存在，性能评估通常是比较观察到的网络事件得分与随机生成的一些事件得分之间的比较。这些评估方法取决于模型预测能力以及采用的负样本类型。此外，与时间数据一样，预测质量可能会随时间变化。这创造了一个复杂的评估空间。在这项工作中，我们catalog了负样本的可能性，并引入了新的视觉化工具，可以帮助我们更好地理解预测性能和时间网络的动态。我们利用这些视觉化工具来调查负样本对预测性能的影响，节点和边级别。我们验证了实验数据，从最新的benchmark中提取的数据集上，错误通常不均匀分布于不同的数据段上。最后，我们 argue that这些视觉化工具可以作为评估动态链接预测方法的强大指南。

ESG Accountability Made Easy: DocQA at Your Service

paper_url: http://arxiv.org/abs/2311.18481
repo_url: None
paper_authors: Lokesh Mishra, Cesar Berrospi, Kasper Dinkla, Diego Antognini, Francesco Fusco, Benedikt Bothur, Maksym Lysak, Nikolaos Livathinos, Ahmed Nassar, Panagiotis Vagenas, Lucas Morin, Christoph Auer, Michele Dolfi, Peter Staar
for: 本研究旨在开发一种基于多种人工智能技术的问答对话助手，以提高文档信息检索效率。
methods: 本研究使用了计算机视觉将文档转换成机器可读格式，自然语言处理技术找到相关数据，以及大语言模型构建出启明的回答。
results: 该系统可以帮助用户从大量的环境、社会和治理（ESG）披露报告中提取信息，并可以在线访问超过2000家公司的披露报告。系统可以在：https://ds4sd.github.io 上访问。

Abstract
We present Deep Search DocQA. This application enables information extraction from documents via a question-answering conversational assistant. The system integrates several technologies from different AI disciplines consisting of document conversion to machine-readable format (via computer vision), finding relevant data (via natural language processing), and formulating an eloquent response (via large language models). Users can explore over 10,000 Environmental, Social, and Governance (ESG) disclosure reports from over 2000 corporations. The Deep Search platform can be accessed at: https://ds4sd.github.io.

摘要
我们介绍Deep Search DocQA，这个应用程序可以从文书中提取信息，透过问答对话助手。这个系统结合了不同的人工智能领域技术，包括文档转换为机器可读格式（via computer vision）、发现相关数据（via自然语言处理），并使用大型语言模型写出漂亮回答。用户可以探索超过10,000份环境、社会和管理（ESG）发布报告，来自逾2000家公司。Deep Search平台可以在：https://ds4sd.github.io 上 accessed。

Causal Fairness under Unobserved Confounding: A Neural Sensitivity Framework

paper_url: http://arxiv.org/abs/2311.18460
repo_url: None
paper_authors: Maresa Schröder, Dennis Frauen, Stefan Feuerriegel
for: This paper focuses on ensuring fairness in machine learning predictions, specifically in situations where there is unobserved confounding.
methods: The paper proposes a novel neural framework for learning fair predictions, which includes deriving bounds for causal fairness metrics under different sources of unobserved confounding.
results: The paper demonstrates the effectiveness of its framework in a series of experiments, including a real-world case study about predicting prison sentences. The paper also offers worst-case guarantees of the extent to which causal fairness can be violated due to unobserved confounding.

Abstract
Fairness for machine learning predictions is widely required in practice for legal, ethical, and societal reasons. Existing work typically focuses on settings without unobserved confounding, even though unobserved confounding can lead to severe violations of causal fairness and, thus, unfair predictions. In this work, we analyze the sensitivity of causal fairness to unobserved confounding. Our contributions are three-fold. First, we derive bounds for causal fairness metrics under different sources of unobserved confounding. This enables practitioners to examine the sensitivity of their machine learning models to unobserved confounding in fairness-critical applications. Second, we propose a novel neural framework for learning fair predictions, which allows us to offer worst-case guarantees of the extent to which causal fairness can be violated due to unobserved confounding. Third, we demonstrate the effectiveness of our framework in a series of experiments, including a real-world case study about predicting prison sentences. To the best of our knowledge, ours is the first work to study causal fairness under unobserved confounding. To this end, our work is of direct practical value as a refutation strategy to ensure the fairness of predictions in high-stakes applications.

摘要
广泛需要在机器学习预测中保持公平的需求，包括法律、伦理和社会因素。现有的工作通常假设没有隐藏的偏见，尽管隐藏的偏见可能导致严重的 causal fairness 违背和不公预测。在这种情况下，我们分析隐藏的偏见对 causal fairness 的影响。我们的贡献有三个方面：首先，我们 derivates 隐藏偏见下 causal fairness 指标的下限。这使得实践者可以查看他们的机器学习模型对隐藏偏见的敏感性。第二，我们提出了一种新的神经网络框架，用于学习公平预测。我们可以通过这种框架提供最坏情况下的 garantuees，表明隐藏偏见可能导致的 causal fairness 违背的程度。第三，我们在一系列实验中证明了我们的框架的效果，包括一个实际的案例研究，探讨预测监狱刑期。到我们所知，我们的工作是第一个研究隐藏偏见下 causal fairness 的工作。因此，我们的工作对高度重要的应用中的预测公平性具有直接的实践价值，作为验证策略。

Multiple Disciplinary Data Work Practices in Artificial Intelligence Research: a Healthcare Case Study in the UK

paper_url: http://arxiv.org/abs/2311.18424
repo_url: None
paper_authors: Rafael Henkin, Elizabeth Remfry, Duncan J. Reynolds, Megan Clinch, Michael R. Barnes
for: 本研究旨在探讨健康领域人工智能工具的开发过程中，不同领域之间的知识共享和矛盾解决方式。
methods: 本研究采用了符号分析方法，通过13名参与者的semi-structured interview来探讨参与者在大型研究团队中的工作做法。
results: 研究发现，多学科合作对工作实践产生了深见影响，参与者需要学习其他领域的语言和工具，以便与不同背景的人进行交流和知识共享。大量医疗数据也限制了工作实践。研究发现，会议是共享知识的关键平台，并且建议了数据科学和协作工具的设计方法。

Abstract
Developing artificial intelligence (AI) tools for healthcare is a multiple disciplinary effort, bringing data scientists, clinicians, patients and other disciplines together. In this paper, we explore the AI development workflow and how participants navigate the challenges and tensions of sharing and generating knowledge across disciplines. Through an inductive thematic analysis of 13 semi-structured interviews with participants in a large research consortia, our findings suggest that multiple disciplinarity heavily impacts work practices. Participants faced challenges to learn the languages of other disciplines and needed to adapt the tools used for sharing and communicating with their audience, particularly those from a clinical or patient perspective. Large health datasets also posed certain restrictions on work practices. We identified meetings as a key platform for facilitating exchanges between disciplines and allowing for the blending and creation of knowledge. Finally, we discuss design implications for data science and collaborative tools, and recommendations for future research.

摘要
开发人工智能（AI）工具 для医疗是一个多学科努力，汇集数据科学家、医生、病人和其他领域的专家共同努力。本文 исследова了AI开发工作流程，参与者如何在不同领域之间分享和生成知识的挑战和紧张关系。通过对13名参与者进行induced thematic分析的 semi-structured采访，我们发现多学科影响了工作实践。参与者需要学习其他领域的语言，并适应与他们听众的沟通工具，特别是来自临床或患者角度的人。大规模医疗数据也限制了工作实践。我们发现会议是在不同领域之间交换知识的关键平台，并允许知识杂交和创造。最后，我们讨论了数据科学和合作工具的设计建议，以及未来研究的方向。

Corrupting Convolution-based Unlearnable Datasets with Pixel-based Image Transformations

paper_url: http://arxiv.org/abs/2311.18403
repo_url: None
paper_authors: Xianlong Wang, Shengshan Hu, Minghui Li, Zhifei Yu, Ziqi Zhou, Leo Yu Zhang, Hai Jin
for: 防止干扰学习（Unlearnable Dataset，UD）对模型的泛化性能的严重降低。
methods: 使用简单的多元式扩展来表示卷积型UD，并对其工作机制进行研究。采用随机矩阵的方式进行增强，以提高防御效果。
results: 通过验证实验，证明我们的方法可以成功防止卷积型UD的攻击，并且在新的UD攻击下表现出显著的防御效果。

Abstract
Unlearnable datasets lead to a drastic drop in the generalization performance of models trained on them by introducing elaborate and imperceptible perturbations into clean training sets. Many existing defenses, e.g., JPEG compression and adversarial training, effectively counter UDs based on norm-constrained additive noise. However, a fire-new type of convolution-based UDs have been proposed and render existing defenses all ineffective, presenting a greater challenge to defenders. To address this, we express the convolution-based unlearnable sample as the result of multiplying a matrix by a clean sample in a simplified scenario, and formalize the intra-class matrix inconsistency as $\Theta_{imi}$, inter-class matrix consistency as $\Theta_{imc}$ to investigate the working mechanism of the convolution-based UDs. We conjecture that increasing both of these metrics will mitigate the unlearnability effect. Through validation experiments that commendably support our hypothesis, we further design a random matrix to boost both $\Theta_{imi}$ and $\Theta_{imc}$, achieving a notable degree of defense effect. Hence, by building upon and extending these facts, we first propose a brand-new image COrruption that employs randomly multiplicative transformation via INterpolation operation to successfully defend against convolution-based UDs. Our approach leverages global pixel random interpolations, effectively suppressing the impact of multiplicative noise in convolution-based UDs. Additionally, we have also designed two new forms of convolution-based UDs, and find that our defense is the most effective against them.

摘要
“不可学习的数据集会导致模型在它们上 receives training 的性能下降剧烈，通过引入复杂且隐观的杂音 perturbations 到干净的训练集。许多现有的防御，如 JPEG 压缩和对抗学习，有效地对 UDs 进行防御，但是一种新的 convolution-based UDs 已经被提出，使得现有的防御无效。为了解决这个问题，我们表示 convolution-based unlearnable sample 为 matrix 对 clean sample 的乘法结果，并将 intra-class matrix inconsistency 表示为 $\Theta_{imi}$，inter-class matrix consistency 表示为 $\Theta_{imc}$ 来研究 convolution-based UDs 的工作机制。我们 conjecture 增加这两个指标会 Mitigate 不可学习的效果。经Validation experiments 支持我们的假设，我们进一步设计了一个随机矩阵，以提高 $\Theta_{imi}$ 和 $\Theta_{imc}$，实现了 notable degree of defense effect。因此，我们首次提出了一种 brand-new image Corruption 方法，利用 randomly multiplicative transformation via INterpolation operation 成功地防御对 convolution-based UDs。我们的方法利用全球像素随机 interpolations，有效地抑制了 convolution-based UDs 中的乘数噪音的影响。此外，我们还设计了两种新的 convolution-based UDs，并发现我们的防御是对它们最有效的。”

MRFP: Learning Generalizable Semantic Segmentation from Sim-2-Real with Multi-Resolution Feature Perturbation

paper_url: http://arxiv.org/abs/2311.18331
repo_url: None
paper_authors: Sumanth Udupa, Prajwal Gurunath, Aniruddh Sikdar, Suresh Sundaram
for: 实现源领域中的 semantic scene understanding 任务中的高性能表现，但是因为训练时无法获得多样化的风格，因此使用单一源领域数据来增强目标领域的表现仍然是一个挑战。
methods: 我们提出了一种名为 MultiResolution Feature Perturbation (MRFP) 的新技术，将域别细部特征 perturbed 并且随机预测 style 信息。
results: 我们的实验结果显示，在不同的城市景象分类任务中，MRFP 技术可以帮助现代深度神经网络学习具有领域不对称特征的对称特征，从而提高 semantic segmentation 的表现。

Abstract
Deep neural networks have shown exemplary performance on semantic scene understanding tasks on source domains, but due to the absence of style diversity during training, enhancing performance on unseen target domains using only single source domain data remains a challenging task. Generation of simulated data is a feasible alternative to retrieving large style-diverse real-world datasets as it is a cumbersome and budget-intensive process. However, the large domain-specific inconsistencies between simulated and real-world data pose a significant generalization challenge in semantic segmentation. In this work, to alleviate this problem, we propose a novel MultiResolution Feature Perturbation (MRFP) technique to randomize domain-specific fine-grained features and perturb style of coarse features. Our experimental results on various urban-scene segmentation datasets clearly indicate that, along with the perturbation of style-information, perturbation of fine-feature components is paramount to learn domain invariant robust feature maps for semantic segmentation models. MRFP is a simple and computationally efficient, transferable module with no additional learnable parameters or objective functions, that helps state-of-the-art deep neural networks to learn robust domain invariant features for simulation-to-real semantic segmentation.

摘要
Translated into Simplified Chinese:深度神经网络在源领域中表现出色，但由于训练中缺乏风格多样性，使用单一源领域数据来提高目标领域表现仍然是一项挑战。生成模拟数据是一种可行的方案，但模拟和实际数据之间的域特异性问题却是一个重要的普适性挑战。为解决这问题，我们提出了一种MultiResolution Feature Perturbation（MRFP）技术，randomizes域特异性细节和修饰风格细节。我们在多个都市场景分割数据集上进行了实验，结果显示，除了修饰风格信息之外，修饰细节 компонент也是必要的，以学习域 invariant 稳定的特征地图。MRFP是一种简单、计算效率高、可传播的模块，不增加学习参数或目标函数，帮助当今的深度神经网络学习域-to-实际Semantic Segmentation中的稳定特征。

Advances in 3D Neural Stylization: A Survey

paper_url: http://arxiv.org/abs/2311.18328
repo_url: https://github.com/chenyingshu/advances_3d_neural_stylization
paper_authors: Yingshu Chen, Guocheng Shao, Ka Chun Shum, Binh-Son Hua, Sai-Kit Yeung
for: 本研究探讨了基于人工智能的数字艺术生成方法，尤其是用于3D数据的视觉风格传输技术，以edit图像、视频和3D数据，使其更加艺术化和多样化。
methods: 本研究提出了一个taxonomy，用于描述各种关键的设计选择，包括场景表示、引导数据、优化策略和输出风格。此外，本研究还提供了一个 mini-benchmark 来评估艺术风格传输方法。
results: 根据survey的发现，现有的神经网络风格传输方法在3D数据上具有较高的艺术化水平和灵活性。但是，还存在一些开放的挑战和未来研究方向，例如如何提高风格传输的精度和效率，以及如何扩展风格传输技术到更多的应用领域。

Abstract
Modern artificial intelligence provides a novel way of producing digital art in styles. The expressive power of neural networks enables the realm of visual style transfer methods, which can be used to edit images, videos, and 3D data to make them more artistic and diverse. This paper reports on recent advances in neural stylization for 3D data. We provide a taxonomy for neural stylization by considering several important design choices, including scene representation, guidance data, optimization strategies, and output styles. Building on such taxonomy, our survey first revisits the background of neural stylization on 2D images, and then provides in-depth discussions on recent neural stylization methods for 3D data, where we also provide a mini-benchmark on artistic stylization methods. Based on the insights gained from the survey, we then discuss open challenges, future research, and potential applications and impacts of neural stylization.

摘要
现代人工智能提供了一种新的数字艺术生成方法，即神经网络风格传递技术。这种技术可以用来编辑图像、视频和3D数据，以使其更加艺术化和多样化。本文对神经风格传递的最新进展进行了报告，并提出了神经风格传递的多种设计选择，包括场景表示、引导数据、优化策略和输出风格。基于这些设计选择，我们首先回顾了神经风格传递的背景，然后进行了深入的对话神经风格传递方法的最新进展，并提供了一个小型benchmark。根据这些发现，我们then discuss open challenges, future research, and potential applications and impacts of neural stylization.Here's the translation in Traditional Chinese:现代人工智能提供了一种新的数位艺术生成方法，即神经网络风格传递技术。这种技术可以用来编辑图像、影片和3D数据，以使其更加艺术化和多样化。本文对神经风格传递的最新进展进行了报告，并提出了神经风格传递的多种设计选择，包括场景表示、引导数据、优化策略和出力风格。基于这些设计选择，我们首先回顾了神经风格传递的背景，然后进行了深入的对话神经风格传递方法的最新进展，并提供了一个小型benchmark。根据这些发现，我们then discuss open challenges, future research, and potential applications and impacts of neural stylization.

Generative Artificial Intelligence in Learning Analytics: Contextualising Opportunities and Challenges through the Learning Analytics Cycle

paper_url: http://arxiv.org/abs/2312.00087
repo_url: None
paper_authors: Lixiang Yan, Roberto Martinez-Maldonado, Dragan Gašević
for: 这篇论文探讨了生成人工智能（GenAI）在教育中的应用，以及它对学习分析（LA）周期的影响。
methods: 论文使用了现代大语言模型和扩散模型，例如ChatGPT和Midjourney，并将它们应用于学习分析领域。
results: 论文预测了GenAI将在LA领域中扮演重要的角色，包括分析无结构数据、生成 sintetic learner data、丰富多媒体学习互动、进一步发展交互式和说明分析，以及实现个性化和适应式干预。

Abstract
Generative artificial intelligence (GenAI), exemplified by ChatGPT, Midjourney, and other state-of-the-art large language models and diffusion models, holds significant potential for transforming education and enhancing human productivity. While the prevalence of GenAI in education has motivated numerous research initiatives, integrating these technologies within the learning analytics (LA) cycle and their implications for practical interventions remain underexplored. This paper delves into the prospective opportunities and challenges GenAI poses for advancing LA. We present a concise overview of the current GenAI landscape and contextualise its potential roles within Clow's generic framework of the LA cycle. We posit that GenAI can play pivotal roles in analysing unstructured data, generating synthetic learner data, enriching multimodal learner interactions, advancing interactive and explanatory analytics, and facilitating personalisation and adaptive interventions. As the lines blur between learners and GenAI tools, a renewed understanding of learners is needed. Future research can delve deep into frameworks and methodologies that advocate for human-AI collaboration. The LA community can play a pivotal role in capturing data about human and AI contributions and exploring how they can collaborate most effectively. As LA advances, it is essential to consider the pedagogical implications and broader socioeconomic impact of GenAI for ensuring an inclusive future.

摘要
产生型人工智能（GenAI），例如ChatGPT和Midjourney等现代大语言模型和扩散模型，具有潜在的潜力，可以改变教育和提高人类生产力。虽然GenAI在教育中的普遍使用已经激发了许多研究活动，但是将这些技术集成到学习分析（LA）循环中和其对实际措施的影响尚未得到充分探讨。本文探讨了GenAI在LA中的可能的机遇和挑战。我们提供了GenAI当前领域的简洁概述，并将其置于Clow的LA循环的普遍框架中。我们认为GenAI可以在分析无结构数据、生成 sintetic learner数据、增强多Modal learner互动、提高交互式和解释分析、和实现个性化和适应性改进等方面发挥重要作用。随着学生和GenAI工具之间的界限模糊，需要重新理解学生。未来的研究可以深入研究人AI合作框架和方法。LA社区可以在收集人AI贡献的数据和探讨他们如何合作最有效。随着LA的发展，我们必须考虑GenAI对教育和社会经济发展的影响，以确保一个包容的未来。

TrustMark: Universal Watermarking for Arbitrary Resolution Images

paper_url: http://arxiv.org/abs/2311.18297
repo_url: None
paper_authors: Tu Bui, Shruti Agarwal, John Collomosse
for: 防止违 pirater� copyright 和防止复�伪信息，以及负责任的生成 AI
methods: 提议了一种基于 GAN 的隐形数字水印方法，具有 novel 的架构和空间 спект域损失，以达到水印图像质量与水印恢复精度的平衡
results: 实现了对 3 个基于任意分辨率图像的标准benchmark上的状态级表现

Abstract
Imperceptible digital watermarking is important in copyright protection, misinformation prevention, and responsible generative AI. We propose TrustMark - a GAN-based watermarking method with novel design in architecture and spatio-spectra losses to balance the trade-off between watermarked image quality with the watermark recovery accuracy. Our model is trained with robustness in mind, withstanding various in- and out-place perturbations on the encoded image. Additionally, we introduce TrustMark-RM - a watermark remover method useful for re-watermarking. Our methods achieve state-of-art performance on 3 benchmarks comprising arbitrary resolution images.

摘要
<>转换文本为简化中文。<>不可见数字水印是版权保护、谣言预防和负责任生成AI中的重要方法。我们提议TrustMark，基于GAN的水印方法，具有新的架构和空间 спектраль损失来平衡水印图像质量和水印恢复精度的负担。我们的模型具备鲁棒性，能承受不同的内部和外部干扰。此外，我们还介绍了TrustMark-RM，一种有用的水印去除方法，用于重新水印。我们的方法在3个测试准则中达到了状态级表现，包括任意分辨率图像。

Perceptual Group Tokenizer: Building Perception with Iterative Grouping

paper_url: http://arxiv.org/abs/2311.18296
repo_url: None
paper_authors: Zhiwei Deng, Ting Chen, Yang Li
for: 该论文旨在提出一种基于分组操作的神经网络视觉识别系统，用于自动学习和描述图像特征。
methods: 该模型 entirely rely on grouping operations to extract visual features and perform self-supervised representation learning, including a series of grouping operations to iteratively hypothesize the context for pixels or superpixels to refine feature representations.
results: 该模型在ImageNet-1K自动学习benchmark上 achieve 80.3%的表现，创造了新的记录。

Abstract
Human visual recognition system shows astonishing capability of compressing visual information into a set of tokens containing rich representations without label supervision. One critical driving principle behind it is perceptual grouping. Despite being widely used in computer vision in the early 2010s, it remains a mystery whether perceptual grouping can be leveraged to derive a neural visual recognition backbone that generates as powerful representations. In this paper, we propose the Perceptual Group Tokenizer, a model that entirely relies on grouping operations to extract visual features and perform self-supervised representation learning, where a series of grouping operations are used to iteratively hypothesize the context for pixels or superpixels to refine feature representations. We show that the proposed model can achieve competitive performance compared to state-of-the-art vision architectures, and inherits desirable properties including adaptive computation without re-training, and interpretability. Specifically, Perceptual Group Tokenizer achieves 80.3% on ImageNet-1K self-supervised learning benchmark with linear probe evaluation, marking a new progress under this paradigm.

摘要
人工视觉识别系统显示了惊人的能力，将视觉信息压缩成一组含有丰富表示的标签无监督。一个关键驱动原则是分组。尽管在计算机视觉领域在2010年代初期广泛使用，但是是否可以利用分组来 derivate一个基于神经网络的视觉识别基础结构，并将其中的表示力量推广开来仍然是一个谜。在这篇论文中，我们提出了Perceptual Group Tokenizer模型，该模型完全依赖于分组操作来提取视觉特征并进行自动化表示学习。我们显示，提案的模型可以与当前领先的视觉架构相比，达到竞争性的性能，并拥有柔性计算无需重新训练和解释性的优点。 Specifically, Perceptual Group Tokenizer在ImageNet-1K自主学习benchmark上取得80.3%的分数，创造了新的进步在这个 парадигме。

Non-Cross Diffusion for Semantic Consistency

paper_url: http://arxiv.org/abs/2312.00820
repo_url: None
paper_authors: Ziyang Zheng, Ruiyuan Gao, Qiang Xu
for: Addressing the challenge of semantic inconsistencies in diffusion models, particularly in applications such as image editing and interpolation.
methods: Introducing `Non-Cross Diffusion’, a novel approach to generative modeling that incorporates an ascending dimension of input to ensure enhanced semantic consistency throughout the inference process.
results: Demonstrating the effectiveness of Non-Cross Diffusion through empirical results, including a substantial reduction in semantic inconsistencies and a notable enhancement in the overall performance of diffusion models.Here’s the full translation of the abstract in Simplified Chinese:
for: 这个论文是为了解决扩散模型中的semantic inconsistency问题，特别是在图像编辑和 interpolating 应用中。
methods: 我们引入了一种新的 Non-Cross Diffusion 方法，它在推导过程中灵活地连接来自两个分布的点，以确保更高的semantic consistency。
results: 我们的实验结果表明，Non-Cross Diffusion 可以减少semantic inconsistency，并提高扩散模型的总性能。

Abstract
In diffusion models, deviations from a straight generative flow are a common issue, resulting in semantic inconsistencies and suboptimal generations. To address this challenge, we introduce `Non-Cross Diffusion', an innovative approach in generative modeling for learning ordinary differential equation (ODE) models. Our methodology strategically incorporates an ascending dimension of input to effectively connect points sampled from two distributions with uncrossed paths. This design is pivotal in ensuring enhanced semantic consistency throughout the inference process, which is especially critical for applications reliant on consistent generative flows, including various distillation methods and deterministic sampling, which are fundamental in image editing and interpolation tasks. Our empirical results demonstrate the effectiveness of Non-Cross Diffusion, showing a substantial reduction in semantic inconsistencies at different inference steps and a notable enhancement in the overall performance of diffusion models.

摘要
在扩散模型中，偏离直接生成流是一种常见的问题，导致语义不一致和优化生成不佳。为解决这个挑战，我们介绍了“非交叉扩散”，一种创新的生成模型学习方法。我们的方法强制在输入维度上增加维度，以有效地连接两个分布的顺序样本点。这种设计是决定性的，以确保在推理过程中增强语义一致性，尤其是在各种稳定扩散方法和权重扩散方法上，这些方法在图像修饰和 interpolate 任务中是基础的。我们的实验结果表明非交叉扩散的效果，在不同的推理步骤中减少语义不一致的程度，并提高扩散模型的总性性能。

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

paper_url: http://arxiv.org/abs/2311.18259
repo_url: None
paper_authors: Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mohaiminul Islam, Suyog Jain, Rawal Khirodkar, Devansh Kukreja, Kevin J Liang, Jia-Wei Liu, Sagnik Majumder, Yongsen Mao, Miguel Martin, Effrosyni Mavroudi, Tushar Nagarajan, Francesco Ragusa, Santhosh Kumar Ramakrishnan, Luigi Seminara, Arjun Somayazulu, Yale Song, Shan Su, Zihui Xue, Edward Zhang, Jinxu Zhang, Angela Castillo, Changan Chen, Xinzhu Fu, Ryosuke Furuta, Cristina Gonzalez, Prince Gupta, Jiabo Hu, Yifei Huang, Yiming Huang, Weslie Khoo, Anush Kumar, Robert Kuo, Sach Lakhavani, Miao Liu, Mi Luo, Zhengyi Luo, Brighid Meredith, Austin Miller, Oluwatumininu Oguntola, Xiaqing Pan, Penny Peng, Shraman Pramanick, Merey Ramazanova, Fiona Ryan, Wei Shan, Kiran Somasundaram, Chenan Song, Audrey Southerland, Masatoshi Tateno, Huiyu Wang, Yuchen Wang, Takuma Yagi, Mingfei Yan, Xitong Yang, Zecheng Yu, Shengxin Cindy Zha, Chen Zhao, Ziwei Zhao, Zhifan Zhu, Jeff Zhuo, Pablo Arbelaez, Gedas Bertasius, David Crandall, Dima Damen, Jakob Engel, Giovanni Maria Farinella, Antonino Furnari, Bernard Ghanem, Judy Hoffman, C. V. Jawahar, Richard Newcombe, Hyun Soo Park, James M. Rehg, Yoichi Sato, Manolis Savva, Jianbo Shi, Mike Zheng Shou, Michael Wray
for: 该论文主要用于开发一个多模态多视图视频数据集和评比挑战，用于提高个人活动的视频理解能力。
methods: 该论文使用了同时捕捉的 egocentric 和 exocentric 视频、多通道音频、眼动跟踪、3D 点云、摄像头姿和 IMU 等多种数据来构建一个大规模、多模态的视频数据集。
results: 该论文收集了来自 13 座城市的131个不同的自然场景中的1,422小时视频内容，并提供了多种评比任务的注解，以推动对个人活动视频理解的研究。

Abstract
We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). More than 800 participants from 13 cities worldwide performed these activities in 131 different natural scene contexts, yielding long-form captures from 1 to 42 minutes each and 1,422 hours of video combined. The multimodal nature of the dataset is unprecedented: the video is accompanied by multichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiple paired language descriptions -- including a novel "expert commentary" done by coaches and teachers and tailored to the skilled-activity domain. To push the frontier of first-person video understanding of skilled human activity, we also present a suite of benchmark tasks and their annotations, including fine-grained activity understanding, proficiency estimation, cross-view translation, and 3D hand/body pose. All resources will be open sourced to fuel new research in the community.

摘要
我们介绍Ego-Exo4D，一个多样化、大规模的多模态多视图视频数据集和研究挑战。Ego-Exo4D绕着同时捕获的 Egocentric 和 Exocentric 视频中的高水平人类活动（如运动、音乐、舞蹈、自行车维修）展开。全球13座城市的超过800名参与者在131个不同的自然场景中完成了这些活动，共计1,422小时的视频。该数据集的多模态性是前所未有：视频被 accompaniment 多核心音频、眼动追踪、3D点云、摄像机位置、IMU 和多个对应语言描述——包括一种新的 "专家评论" 由教练和教师提供，专门适用于技能活动领域。为推动人类高水平首人视频理解的前沿，我们还提供了一套 benchmark 任务和其注解，包括细化活动理解、技能评估、跨视图翻译和3D手/体部pose。所有资源将被开源，以激发新的研究于社区。

Sketch Input Method Editor: A Comprehensive Dataset and Methodology for Systematic Input Recognition

paper_url: http://arxiv.org/abs/2311.18254
repo_url: None
paper_authors: Guangming Zhu, Siyuan Wang, Qing Cheng, Kelong Wu, Hao Li, Liang Zhang
for: 这个研究旨在开发一个专门用于专业C4I系统的Sketch Input Method Editor（SketchIME），以便通过使用粘贴笔Sketch来创建全面的情况图。
methods: 这个研究使用了同时认知和分类架构，并采用了少量适应和类增量学习来提高网络的适应性和可解释性。
results: 实验结果表明，提议的架构在提posed dataset和SPG dataset上显示出了更高的性能。

Abstract
With the recent surge in the use of touchscreen devices, free-hand sketching has emerged as a promising modality for human-computer interaction. While previous research has focused on tasks such as recognition, retrieval, and generation of familiar everyday objects, this study aims to create a Sketch Input Method Editor (SketchIME) specifically designed for a professional C4I system. Within this system, sketches are utilized as low-fidelity prototypes for recommending standardized symbols in the creation of comprehensive situation maps. This paper also presents a systematic dataset comprising 374 specialized sketch types, and proposes a simultaneous recognition and segmentation architecture with multilevel supervision between recognition and segmentation to improve performance and enhance interpretability. By incorporating few-shot domain adaptation and class-incremental learning, the network's ability to adapt to new users and extend to new task-specific classes is significantly enhanced. Results from experiments conducted on both the proposed dataset and the SPG dataset illustrate the superior performance of the proposed architecture. Our dataset and code are publicly available at https://github.com/Anony517/SketchIME.

摘要
随着触摸屏设备的普及，自由手绘 sketching 已成为人计算机交互的有前途的modalität。先前的研究主要集中在认知、检索和生成日常物品上，而这个研究旨在为专业C4I系统开发一个专门的Sketch Input Method Editor（SketchIME）。在这个系统中，手绘图被用作低精度原型，用于建议标准化图标的创建。本文还提供了374种专业手绘图类型的系统матиче数据集，并提议一种同时认知和分割架构，通过多级监督来提高性能和解释性。通过涉及少量预训练和分类增强学习，网络的适应新用户和扩展到新任务特定类别的能力得到了显著提高。实验结果表明，提posed架构在提posed数据集和SPG数据集上具有显著性能优势。数据集和代码在https://github.com/Anony517/SketchIME上公开可用。

Navigating Privacy and Copyright Challenges Across the Data Lifecycle of Generative AI

paper_url: http://arxiv.org/abs/2311.18252
repo_url: None
paper_authors: Dawen Zhang, Boming Xia, Yue Liu, Xiwei Xu, Thong Hoang, Zhenchang Xing, Mark Staples, Qinghua Lu, Liming Zhu
for: 这篇论文的目的是探讨生成式人工智能中数据隐私和版权保护的问题。
methods: 论文使用了多方面的技术和伦理方法来解决数据隐私和版权问题，包括数据隐私技术、机器学习卸载技术和数据毒理学技术。
results: 论文通过分析和研究数据隐私和版权问题的多方面性，提出了一种整体性的解决方案，以保护生成式人工智能中数据的隐私和版权。

Abstract
The advent of Generative AI has marked a significant milestone in artificial intelligence, demonstrating remarkable capabilities in generating realistic images, texts, and data patterns. However, these advancements come with heightened concerns over data privacy and copyright infringement, primarily due to the reliance on vast datasets for model training. Traditional approaches like differential privacy, machine unlearning, and data poisoning only offer fragmented solutions to these complex issues. Our paper delves into the multifaceted challenges of privacy and copyright protection within the data lifecycle. We advocate for integrated approaches that combines technical innovation with ethical foresight, holistically addressing these concerns by investigating and devising solutions that are informed by the lifecycle perspective. This work aims to catalyze a broader discussion and inspire concerted efforts towards data privacy and copyright integrity in Generative AI.

摘要
人工智能的生成技术的出现标志着人工智能的进步，展示了具有真实感和生成能力的图像、文本和数据模式。然而，这些进步也带来了数据隐私和版权侵犯的关切问题，主要是因为模型训练所需的庞大数据集。传统的方法，如差分隐私、机器忘却和数据毒素，只能提供分 Fragmented 的解决方案这些复杂的问题。我们的论文探讨了生成AI中数据隐私和版权保护的多方面挑战，并提倡一种整体的解决方案， combinig 技术创新和伦理观察，从数据生命周期的视角全面地Addressing these concerns。这项工作的目标是激发更广泛的讨论，以促进数据隐私和版权完整性在生成AI中。

Large Language Models for Travel Behavior Prediction

paper_url: http://arxiv.org/abs/2312.00819
repo_url: https://github.com/Aryia-Behroziuan/neurons
paper_authors: Baichuan Mo, Hanyong Xu, Dingyi Zhuang, Ruoyun Ma, Xiaotong Guo, Jinhua Zhao
for: 预测旅行行为，以便更好地管理交通需求。
methods: 使用大语言模型（LLMs），无需数据学习Parameter，通过提示工程来预测旅行行为。
results: LLMs可以达到与传统方法相当的准确率和F1分数，并且可以输出解释。但是，有时会出现逻辑错误或幻觉。

Abstract
Travel behavior prediction is a fundamental task in transportation demand management. The conventional methods for travel behavior prediction rely on numerical data to construct mathematical models and calibrate model parameters to represent human preferences. Recent advancement in large language models (LLMs) has shown great reasoning abilities to solve complex problems. In this study, we propose to use LLMs to predict travel behavior with prompt engineering without data-based parameter learning. Specifically, we carefully design our prompts that include 1) task description, 2) travel characteristics, 3) individual attributes, and 4) guides of thinking with domain knowledge, and ask the LLMs to predict an individual's travel behavior and explain the results. We select the travel mode choice task as a case study. Results show that, though no training samples are provided, LLM-based predictions have competitive accuracy and F1-score as canonical supervised learning methods such as multinomial logit, random forest, and neural networks. LLMs can also output reasons that support their prediction. However, though in most of the cases, the output explanations are reasonable, we still observe cases that violate logic or with hallucinations.

摘要
旅行行为预测是交通需求管理的基本任务。传统的方法 для旅行行为预测是基于数字数据构建数学模型，并将模型参数调整以表示人类偏好。然而，现代大语言模型（LLMs）的发展已经显示出了解复难问题的强大能力。在本研究中，我们提议使用LLMs来预测旅行行为，而不需要数据基 Parameters learning。具体来说，我们仔细设计我们的提示，包括1）任务描述、2）旅行特点、3）个人属性和4）思维指南，并问LLMs预测个人的旅行行为，并解释结果。我们选择了交通模式选择任务作为case study。结果显示，虽然没有提供任何训练样本，LLM-based预测的精度和F1分数与canonical supervised learning方法such as multinomial logit、random forest和神经网络相当。LLMs还可以输出解释。然而，虽然大多数情况下，输出的解释合理，但我们仍然观察到了逻辑性错误或幻觉的情况。

LLVMs4Protest: Harnessing the Power of Large Language and Vision Models for Deciphering Protests in the News

paper_url: http://arxiv.org/abs/2311.18241
repo_url: https://github.com/joshzyj/llvms4protest
paper_authors: Yongjun Zhang
for: 这篇论文目的是探讨如何使用大语言和视觉模型来推断新闻文章中的抗议活动。
methods: 作者使用了两种大型预训练变换器模型，包括Longformer和Swin-Transformer V2，对新闻文章的文本和图像数据进行了微调。
results: 作者通过对DoCA corpus和UCLA-抗议项目的图像数据进行微调，实现了对新闻文章中抗议活动的推断。两种微调后的模型将于 GitHub 上发布。

Abstract
Large language and vision models have transformed how social movements scholars identify protest and extract key protest attributes from multi-modal data such as texts, images, and videos. This article documents how we fine-tuned two large pretrained transformer models, including longformer and swin-transformer v2, to infer potential protests in news articles using textual and imagery data. First, the longformer model was fine-tuned using the Dynamic of Collective Action (DoCA) Corpus. We matched the New York Times articles with the DoCA database to obtain a training dataset for downstream tasks. Second, the swin-transformer v2 models was trained on UCLA-protest imagery data. UCLA-protest project contains labeled imagery data with information such as protest, violence, and sign. Both fine-tuned models will be available via \url{https://github.com/Joshzyj/llvms4protest}. We release this short technical report for social movement scholars who are interested in using LLVMs to infer protests in textual and imagery data.

摘要
大型语言和视觉模型已经改变了社会运动学者如何识别抗议和从多模式数据中提取关键抗议特征。这篇文章记录了我们如何使用文本和图像数据来使两个大型预训练变换器模型（包括Longformer和Swin-Transformer V2）进行潜在抗议的推理。首先，我们使用DoCA corpus进行了Longformer模型的微调。我们将纽约时报文章与DoCA数据库匹配，以获得下游任务的训练集。其次，我们使用UCLA-抗议图像数据进行了Swin-Transformer V2模型的训练。UCLA-抗议项目包含了标注图像数据，其中包括抗议、暴力和标语等信息。两个微调后的模型将在上提供。我们发布这份短技术报告，以便社会运动学者可以使用LLVM来推理抗议在文本和图像数据中的存在。

LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models

paper_url: http://arxiv.org/abs/2311.18232
repo_url: https://github.com/abdulhaim/lmrl-gym
paper_authors: Marwa Abdulhai, Isadora White, Charlie Snell, Charles Sun, Joey Hong, Yuexiang Zhai, Kelvin Xu, Sergey Levine
for: 本研究旨在开发稳定可靠的强化学习算法，以培养基于语言模型的目标寻求行为。
methods: 本研究使用了语言模型强化学习（RL），并开发了LMRL-Gymbenchmark以评估多轮RL的表现。
results: 本研究发现，RL算法可以帮助基于语言模型的机器人实现目标寻求行为，并在多轮语言互动中提高表现。

Abstract
Large language models (LLMs) provide excellent text-generation capabilities, but standard prompting and generation methods generally do not lead to intentional or goal-directed agents and might necessitate considerable prompt tuning. This becomes particularly apparent in multi-turn conversations: even the best current LLMs rarely ask clarifying questions, engage in explicit information gathering, or take actions now that lead to better decisions after multiple turns. Reinforcement learning has the potential to leverage the powerful modeling capabilities of LLMs, as well as their internal representation of textual interactions, to create capable goal-directed language agents. This can enable intentional and temporally extended interactions, such as with humans, through coordinated persuasion and carefully crafted questions, or in goal-directed play through text games to bring about desired final outcomes. However, enabling this requires the community to develop stable and reliable reinforcement learning algorithms that can effectively train LLMs. Developing such algorithms requires tasks that can gauge progress on algorithm design, provide accessible and reproducible evaluations for multi-turn interactions, and cover a range of task properties and challenges in improving reinforcement learning algorithms. Our paper introduces the LMRL-Gym benchmark for evaluating multi-turn RL for LLMs, together with an open-source research framework containing a basic toolkit for getting started on multi-turn RL with offline value-based and policy-based RL methods. Our benchmark consists of 8 different language tasks, which require multiple rounds of language interaction and cover a range of tasks in open-ended dialogue and text games.

摘要
大型语言模型（LLM）具有出色的文本生成能力，但标准的提问和生成方法通常不会导致意图或目标导航的代理人，可能需要较大的提问调整。这 especially evident in multi-turn conversations：even the best current LLMs rarely ask clarifying questions, engage in explicit information gathering, or take actions now that lead to better decisions after multiple turns. 使用奖励学习可以利用LLM的强大模型能力和其内部的文本互动表示，创建目标导航的语言代理人。这可以实现意图和时间扩展的交互，如与人类的交流，通过协调说服和精心制定的问题，或在文本游戏中实现目标结果。然而，实现这一点需要社区开发稳定可靠的奖励学习算法，可以有效地训练LLM。开发这些算法需要任务，可以评估算法设计的进度，提供可访问的和可重现的评估方法 для多轮交互，并覆盖多种任务性质和挑战。我们的论文引入了LMRL-Gymbenchmark，用于评估多轮RL дляLLM，同时提供了一个开源的研究框架，包括一个基本的工具包，可以帮助开始多轮RL的研究。我们的benchmark包括8种不同的语言任务，需要多个语言交互round和覆盖了开放式对话和文本游戏等多种任务。

Reasoning with the Theory of Mind for Pragmatic Semantic Communication

paper_url: http://arxiv.org/abs/2311.18224
repo_url: None
paper_authors: Christo Kurisummoottil Thomas, Emilio Calvanese Strinati, Walid Saad
for: 这 paper 描述了一种基于 semantics 的强智通信框架，用于两个智能代理之间的有效目标协调信息共享。
methods: 该框架利用了机器学习领域新出现的理论心（ToM），并使用了动态两级（无线和semantic）反馈机制，以continuously fine-tune 神经网络组件。
results: 实验结果表明，该框架可以实现高效的协调通信，使用比较少的比特数据，同时保持 semantics 不变，并且在比较难的通信频率下表现更好。

Abstract
In this paper, a pragmatic semantic communication framework that enables effective goal-oriented information sharing between two-intelligent agents is proposed. In particular, semantics is defined as the causal state that encapsulates the fundamental causal relationships and dependencies among different features extracted from data. The proposed framework leverages the emerging concept in machine learning (ML) called theory of mind (ToM). It employs a dynamic two-level (wireless and semantic) feedback mechanism to continuously fine-tune neural network components at the transmitter. Thanks to the ToM, the transmitter mimics the actual mental state of the receiver's reasoning neural network operating semantic interpretation. Then, the estimated mental state at the receiver is dynamically updated thanks to the proposed dynamic two-level feedback mechanism. At the lower level, conventional channel quality metrics are used to optimize the channel encoding process based on the wireless communication channel's quality, ensuring an efficient mapping of semantic representations to a finite constellation. Additionally, a semantic feedback level is introduced, providing information on the receiver's perceived semantic effectiveness with minimal overhead. Numerical evaluations demonstrate the framework's ability to achieve efficient communication with a reduced amount of bits while maintaining the same semantics, outperforming conventional systems that do not exploit the ToM-based reasoning.

摘要
At the lower level, conventional channel quality metrics are used to optimize the channel encoding process based on the wireless communication channel's quality, ensuring an efficient mapping of semantic representations to a finite constellation. Additionally, a semantic feedback level is introduced to provide information on the receiver's perceived semantic effectiveness with minimal overhead.Numerical evaluations demonstrate that the proposed framework can achieve efficient communication with a reduced amount of bits while maintaining the same semantics, outperforming conventional systems that do not exploit the ToM-based reasoning. The proposed framework has the potential to enable more effective and efficient communication between intelligent agents in various applications.

Beyond Two-Tower Matching: Learning Sparse Retrievable Cross-Interactions for Recommendation

paper_url: http://arxiv.org/abs/2311.18213
repo_url: None
paper_authors: Liangcai Su, Fan Yan, Jieming Zhu, Xi Xiao, Haoyi Duan, Zhou Zhao, Zhenhua Dong, Ruiming Tang
for: 提高推荐系统的准确性和效率。
methods: 提出一种新的匹配方式，即SparCode，支持复杂的特征互动和高效的检索。SparCode introduce了一个全对全互动模块，以模型细致的查询项互动。此外，我们设计了一个逻辑编码基于的稀疏反向索引，并与模型共同训练。
results: 在开放数据集上进行了广泛的实验，证明了SparCode可以显著提高候选项匹配的准确性，同时保持与两塔模型相同的检索效率。

Abstract
Two-tower models are a prevalent matching framework for recommendation, which have been widely deployed in industrial applications. The success of two-tower matching attributes to its efficiency in retrieval among a large number of items, since the item tower can be precomputed and used for fast Approximate Nearest Neighbor (ANN) search. However, it suffers two main challenges, including limited feature interaction capability and reduced accuracy in online serving. Existing approaches attempt to design novel late interactions instead of dot products, but they still fail to support complex feature interactions or lose retrieval efficiency. To address these challenges, we propose a new matching paradigm named SparCode, which supports not only sophisticated feature interactions but also efficient retrieval. Specifically, SparCode introduces an all-to-all interaction module to model fine-grained query-item interactions. Besides, we design a discrete code-based sparse inverted index jointly trained with the model to achieve effective and efficient model inference. Extensive experiments have been conducted on open benchmark datasets to demonstrate the superiority of our framework. The results show that SparCode significantly improves the accuracy of candidate item matching while retaining the same level of retrieval efficiency with two-tower models. Our source code will be available at MindSpore/models.

摘要
两塔模型是一种广泛应用的匹配框架，已经在产业级应用中广泛部署。两塔匹配的成功归功于其在大量项目中的快速查找能力，因为项目塔可以在预计算中生成并用于快速 approximate nearest neighbor（ANN）搜索。然而，它受到两大挑战，一是局部特征互动能力的局限性，二是在线服务中的准确率下降。现有的方法尝试通过设计新的晚期交互来解决这两个问题，但它们仍然无法支持复杂的特征交互或保持搜索效率。为了解决这些挑战，我们提出了一种新的匹配方案，即 SparCode，它不仅支持复杂的特征交互，还可以保持高效的搜索效率。具体来说，SparCode 引入了一个全对全交互模块，用于模型细致的查询项交互。此外，我们设计了一个基于 discrete code 的稀疏反向索引，并与模型一起进行 JOINT 训练，以实现有效和高效的模型推断。我们在开源 benchmark 数据集上进行了广泛的实验，结果显示，SparCode 可以在同等效率下提高候选项匹配的准确率。我们的源代码将在 MindSpore/models 上公开。

Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation

paper_url: http://arxiv.org/abs/2311.18207
repo_url: https://github.com/hakuhodo-technologies/scope-rl
paper_authors: Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, Yuta Saito
for: 评估假设策略的有效性，使用只有在线日志数据，并且可以用于在线AB测试中选择最佳策略。
methods: Draw inspiration from finance portfolio evaluation，develop a new metric called SharpeRatio@k，measure risk-return tradeoff of policy portfolios formed by OPE estimator under varying online evaluation budgets(k)。
results: validate the effectiveness of SharpeRatio@k in two example scenarios, demonstrate its ability to effectively distinguish between low-risk and high-risk estimators, and accurately identify the most efficient estimator。

Abstract
Off-Policy Evaluation (OPE) aims to assess the effectiveness of counterfactual policies using only offline logged data and is often used to identify the top-k promising policies for deployment in online A/B tests. Existing evaluation metrics for OPE estimators primarily focus on the "accuracy" of OPE or that of downstream policy selection, neglecting risk-return tradeoff in the subsequent online policy deployment. To address this issue, we draw inspiration from portfolio evaluation in finance and develop a new metric, called SharpeRatio@k, which measures the risk-return tradeoff of policy portfolios formed by an OPE estimator under varying online evaluation budgets (k). We validate our metric in two example scenarios, demonstrating its ability to effectively distinguish between low-risk and high-risk estimators and to accurately identify the most efficient estimator. This efficient estimator is characterized by its capability to form the most advantageous policy portfolios, maximizing returns while minimizing risks during online deployment, a nuance that existing metrics typically overlook. To facilitate a quick, accurate, and consistent evaluation of OPE via SharpeRatio@k, we have also integrated this metric into an open-source software, SCOPE-RL. Employing SharpeRatio@k and SCOPE-RL, we conduct comprehensive benchmarking experiments on various estimators and RL tasks, focusing on their risk-return tradeoff. These experiments offer several interesting directions and suggestions for future OPE research.

摘要
off-policy评估（OPE）目的是用已经记录的线上数据评估不同的可能性政策，并且经常用于标识最佳部署的top-k政策。现有的评估 метри克 для OPE estimator主要关注OPE或者下游策略选择的准确性，忽略了在线策略部署中的风险回报贸易。为解决这一问题，我们吸取了金融部署评估中的瑞瑞准则，并开发了一个新的度量，称为SharpeRatio@k，它衡量了由OPE estimator组成的策略 portfolio在不同的在线评估预算（k）下的风险回报贸易。我们在两个示例场景中验证了我们的度量，并证明了它能够有效地区分低风险和高风险的 estimator，并且能够准确地标识最高效的 estimator。这个高效的 estimator是由其能够在在线部署中组成最有利益的策略 portfolio，最大化收益 while minimizing风险。现有的 метри克通常会忽略这一细节。为了快速、准确和一致地进行 OPE 评估，我们还将SharpeRatio@k metric integrate into an open-source software，SCOPE-RL。通过SharpeRatio@k和SCOPE-RL，我们在不同的 estimator和RL任务上进行了广泛的比较实验，强调它们在风险回报贸易方面的贡献。这些实验提供了许多有趣的方向和未来 OPE 研究的建议。

SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation

paper_url: http://arxiv.org/abs/2311.18206
repo_url: https://github.com/hakuhodo-technologies/scope-rl
paper_authors: Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, Yuta Saito
for: 这篇论文介绍了一个名为SCOPE-RL的开源Python软件包，用于离线强化学习（离线RL）、离线评估（OPE）和选择（OPS）。与大多数现有库不同的是，SCOPE-RL集成了这两个关键方面，使得研究者可以轻松地实现离线RL和OPE过程。
methods: SCOPE-RL的OPE模块具有多种OPE估计器和可靠的评估OPE协议。这种方法使得OPE更加深入和可靠，比如可以估计政策下的整个奖励分布而不仅是指定政策的点着预期值。
results: SCOPE-RL提供了更加深入和可靠的OPE结果，包括对OPE的风险-回报质量评估，这超出了现有OPE文献中的简单度评估。

Abstract
This paper introduces SCOPE-RL, a comprehensive open-source Python software designed for offline reinforcement learning (offline RL), off-policy evaluation (OPE), and selection (OPS). Unlike most existing libraries that focus solely on either policy learning or evaluation, SCOPE-RL seamlessly integrates these two key aspects, facilitating flexible and complete implementations of both offline RL and OPE processes. SCOPE-RL put particular emphasis on its OPE modules, offering a range of OPE estimators and robust evaluation-of-OPE protocols. This approach enables more in-depth and reliable OPE compared to other packages. For instance, SCOPE-RL enhances OPE by estimating the entire reward distribution under a policy rather than its mere point-wise expected value. Additionally, SCOPE-RL provides a more thorough evaluation-of-OPE by presenting the risk-return tradeoff in OPE results, extending beyond mere accuracy evaluations in existing OPE literature. SCOPE-RL is designed with user accessibility in mind. Its user-friendly APIs, comprehensive documentation, and a variety of easy-to-follow examples assist researchers and practitioners in efficiently implementing and experimenting with various offline RL methods and OPE estimators, tailored to their specific problem contexts. The documentation of SCOPE-RL is available at https://scope-rl.readthedocs.io/en/latest/.

摘要
SCOPE-RL is designed with user accessibility in mind, providing user-friendly APIs, comprehensive documentation, and easy-to-follow examples to assist researchers and practitioners in efficiently implementing and experimenting with various offline RL methods and OPE estimators tailored to their specific problem contexts. The documentation of SCOPE-RL is available at .

HiFi Tuner: High-Fidelity Subject-Driven Fine-Tuning for Diffusion Models

paper_url: http://arxiv.org/abs/2312.00079
repo_url: None
paper_authors: Zhonghao Wang, Wei Wei, Yang Zhao, Zhisheng Xiao, Mark Hasegawa-Johnson, Humphrey Shi, Tingbo Hou
for: 本研究探讨了基于预训练文本到图像扩散模型的高精度个性化图像生成技术的进步。
methods: 我们提出了一种名为HiFi Tuner的新算法，使用参数有效的精度调整框架，包括杂减处理和重要倒映处理。关键改进包括使用面指导、一种新的参数规范化技术和 incorporation of step-wise subject representations。
results: 我们的实验结果表明，HiFi Tuner可以提高对象的出现质量，并在文本修改任务中实现图像中的对象替换。在 DreamBooth 数据集上使用 Stable Diffusion 模型进行实验，我们发现 fine-tuning solely on textual embeddings 可以提高 CLIP-T 分数 by 3.6 点和 DINO 分数 by 9.6 点，而 fine-tuning 所有参数可以提高 CLIP-T 分数 by 1.2 点和 DINO 分数 by 1.2 点，创造了新的州OF THE ART。

Abstract
This paper explores advancements in high-fidelity personalized image generation through the utilization of pre-trained text-to-image diffusion models. While previous approaches have made significant strides in generating versatile scenes based on text descriptions and a few input images, challenges persist in maintaining the subject fidelity within the generated images. In this work, we introduce an innovative algorithm named HiFi Tuner to enhance the appearance preservation of objects during personalized image generation. Our proposed method employs a parameter-efficient fine-tuning framework, comprising a denoising process and a pivotal inversion process. Key enhancements include the utilization of mask guidance, a novel parameter regularization technique, and the incorporation of step-wise subject representations to elevate the sample fidelity. Additionally, we propose a reference-guided generation approach that leverages the pivotal inversion of a reference image to mitigate unwanted subject variations and artifacts. We further extend our method to a novel image editing task: substituting the subject in an image through textual manipulations. Experimental evaluations conducted on the DreamBooth dataset using the Stable Diffusion model showcase promising results. Fine-tuning solely on textual embeddings improves CLIP-T score by 3.6 points and improves DINO score by 9.6 points over Textual Inversion. When fine-tuning all parameters, HiFi Tuner improves CLIP-T score by 1.2 points and improves DINO score by 1.2 points over DreamBooth, establishing a new state of the art.

摘要
Our proposed method consists of a parameter-efficient fine-tuning framework that includes a denoising process and a pivotal inversion process. Key enhancements include the use of mask guidance, a novel parameter regularization technique, and the incorporation of step-wise subject representations to elevate sample fidelity. Additionally, we propose a reference-guided generation approach that leverages the pivotal inversion of a reference image to mitigate unwanted subject variations and artifacts.We extend our method to a novel image editing task: substituting the subject in an image through textual manipulations. Experimental evaluations conducted on the DreamBooth dataset using the Stable Diffusion model show promising results. Fine-tuning solely on textual embeddings improves CLIP-T score by 3.6 points and improves DINO score by 9.6 points over Textual Inversion. When fine-tuning all parameters, HiFi Tuner improves CLIP-T score by 1.2 points and improves DINO score by 1.2 points over DreamBooth, setting a new state of the art.

Toward the Tradeoffs between Privacy, Fairness and Utility in Federated Learning

paper_url: http://arxiv.org/abs/2311.18190
repo_url: None
paper_authors: Kangkang Sun, Xiaojin Zhang, Xi Lin, Gaolei Li, Jing Wang, Jianhua Li
for: 研究者对分布式学习（FL）系统的公平性和隐私保护进行了研究，以保证用户隐私和数据泄露风险的避免。
methods: 在客户端上使用公平度指标，如人口学准（DemP）、相等投 odds（EOs）和不均衡影响（DI），构建本地公平模型。为保护客户端模型的隐私，我们提出了一种隐私保护公平FL方法。
results: 实验结果显示，在保持公平性和隐私之间存在负相关性。隐私破坏公平度指标的约束，使得公平模型的准确性增加。在我们的实验中，我们发现了公平、隐私和用户体验之间的关系，存在负相关性。

Abstract
Federated Learning (FL) is a novel privacy-protection distributed machine learning paradigm that guarantees user privacy and prevents the risk of data leakage due to the advantage of the client's local training. Researchers have struggled to design fair FL systems that ensure fairness of results. However, the interplay between fairness and privacy has been less studied. Increasing the fairness of FL systems can have an impact on user privacy, while an increase in user privacy can affect fairness. In this work, on the client side, we use fairness metrics, such as Demographic Parity (DemP), Equalized Odds (EOs), and Disparate Impact (DI), to construct the local fair model. To protect the privacy of the client model, we propose a privacy-protection fairness FL method. The results show that the accuracy of the fair model with privacy increases because privacy breaks the constraints of the fairness metrics. In our experiments, we conclude the relationship between privacy, fairness and utility, and there is a tradeoff between these.

摘要

2023-11-30

Negotiated Representations to Prevent Forgetting in Machine Learning Applications

Uncertainty in Graph Contrastive Learning with Bayesian Neural Networks

Unsupervised textile defect detection using convolutional neural networks

Learning active tactile perception through belief-space control

DREAM: Diffusion Rectification and Estimation-Adaptive Models

On the Interplay Between Stepsize Tuning and Progressive Sharpening

An integrated framework for developing and evaluating an automated lecture style assessment system

Applying Large Language Models and Chain-of-Thought for Automatic Scoring

HeTriNet: Heterogeneous Graph Triplet Attention Network for Drug-Target-Disease Interaction

Planning Reliability Assurance Tests for Autonomous Vehicles

RNA-KG: An ontology-based knowledge graph for representing interactions involving RNA molecules

Compression of end-to-end non-autoregressive image-to-speech system for low-resourced devices

Towards Accurate Differential Diagnosis with Large Language Models

Which way is `right’?: Uncovering limitations of Vision-and-Language Navigation model

The Stochastic Dynamic Post-Disaster Inventory Allocation Problem with Trucks and UAVs

Dataset Distillation in Large Data Era

VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models

Motion-Conditioned Image Animation for Video Editing

Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking

Unnatural Error Correction: GPT-4 Can Almost Perfectly Handle Unnatural Scrambled Text

Distributed Global Structure-from-Motion with a Deep Front-End

Automated interpretation of congenital heart disease from multi-view echocardiograms

Classifying patient voice in social media data using neural networks: A comparison of AI models on different data sources and therapeutic domains

Probing and Mitigating Intersectional Social Biases in Vision-Language Models with Counterfactual Examples

CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation

Evaluating the Impact of Flaky Simulators on Testing Autonomous Driving Systems

MLLMs-Augmented Visual-Language Representation Learning

Continual Diffusion with STAMINA: STack-And-Mask INcremental Adapters

TaskBench: Benchmarking Large Language Models for Task Automation

Language Model Agents Suffer from Compositional Generalization in Web Automation

TransCORALNet: A Two-Stream Transformer CORAL Networks for Supply Chain Credit Assessment Cold Start

AlignBench: Benchmarking Chinese Alignment of Large Language Models

VREM-FL: Mobility-Aware Computation-Scheduling Co-Design for Vehicular Federated Learning

Controlgym: Large-Scale Safety-Critical Control Environments for Benchmarking Reinforcement Learning Algorithms

Predictable Reinforcement Learning Dynamics through Entropy Rate Minimization

CritiqueLLM: Scaling LLM-as-Critic for Effective and Explainable Evaluation of Large Language Model Generation

Evaluating Large Language Model Creativity from a Literary Perspective

DQSSA: A Quantum-Inspired Solution for Maximizing Influence in Online Social Networks (Student Abstract)

Multi-task learning with cross-task consistency for improved depth estimation in colonoscopy

Choosing the parameter of the Fermat distance: navigating geometry and noise

Solving the Team Orienteering Problem with Transformers

Detailed Human-Centric Text Description-Driven Large Scene Synthesis

FedEmb: A Vertical and Hybrid Federated Learning Algorithm using Network And Feature Embedding Aggregation

Towards Unsupervised Representation Learning: Learning, Evaluating and Transferring Visual Representations

Stochastic Vision Transformers with Wasserstein Distance-Aware Attention

Exploring the hierarchical structure of human plans via program generation

Data-driven prediction of tool wear using Bayesian-regularized artificial neural networks

Contrastive Denoising Score for Text-guided Latent Diffusion Image Editing

Joint Detection Algorithm for Multiple Cognitive Users in Spectrum Sensing

Generalisable Agents for Neural Network Optimisation

Semantic-Aware Frame-Event Fusion based Pattern Recognition via Large Vision-Language Models

Continuous 16-bit Training: Accelerating 32-bit Pre-Trained Neural Networks

Communication-Efficient Heterogeneous Federated Learning with Generalized Heavy-Ball Momentum

Fingerprint Matching with Localized Deep Representation

Search Still Matters: Information Retrieval in the Era of Generative AI

Real-Time Vibration-Based Bearing Fault Diagnosis Under Time-Varying Speed Conditions

Dataset Distillation via the Wasserstein Metric

Fast ODE-based Sampling for Diffusion Models in Around 5 Steps

Calibration-free online test-time adaptation for electroencephalography motor imagery decoding

Color-Emotion Associations in Art: Fuzzy Approach

Adaptive Multi-Modality Prompt Learning

ZeST-NeRF: Using temporal aggregation for Zero-Shot Temporal NeRFs

New Perspectives on the Evaluation of Link Prediction Algorithms for Dynamic Graphs

ESG Accountability Made Easy: DocQA at Your Service

Causal Fairness under Unobserved Confounding: A Neural Sensitivity Framework

Multiple Disciplinary Data Work Practices in Artificial Intelligence Research: a Healthcare Case Study in the UK

Corrupting Convolution-based Unlearnable Datasets with Pixel-based Image Transformations

MRFP: Learning Generalizable Semantic Segmentation from Sim-2-Real with Multi-Resolution Feature Perturbation

Advances in 3D Neural Stylization: A Survey

Generative Artificial Intelligence in Learning Analytics: Contextualising Opportunities and Challenges through the Learning Analytics Cycle

TrustMark: Universal Watermarking for Arbitrary Resolution Images

Perceptual Group Tokenizer: Building Perception with Iterative Grouping

Non-Cross Diffusion for Semantic Consistency

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Sketch Input Method Editor: A Comprehensive Dataset and Methodology for Systematic Input Recognition

Navigating Privacy and Copyright Challenges Across the Data Lifecycle of Generative AI

Large Language Models for Travel Behavior Prediction

LLVMs4Protest: Harnessing the Power of Large Language and Vision Models for Deciphering Protests in the News

LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models